In [1], we create NER datasets containing short sentences and queries with low-context. These include training/dev/test sets (i.e. LOWNER) derived from Wikipedia sentences. We also create two test sets extracted from MS-MARCO (natural language questions) and ORCAS (search queries), called MSQ-NER and ORCAS-NER. All the released sets contain text with its aligned entity annotations in CoNLL format. This released data also contains gazetteer data, which is composed of 1.67 million entities from the English Wikidata knowledge base.
The dataset is stored at a public Amazon S3 bucket: lowcontext-ner-gaz
. See more in Open Data on AWS.
You will need to install AWS Command Line Interface to access the dataset, e.g. to download the dataset, you can use:
aws s3 cp s3://lowcontext-ner-gaz ./ --recursive --no-sign-request