Low Context Name Entity Recognition (NER) Datasets with Gazetteer

Description

In [1], we create NER datasets containing short sentences and queries with low-context. These include training/dev/test sets (i.e. LOWNER) derived from Wikipedia sentences. We also create two test sets extracted from MS-MARCO (natural language questions) and ORCAS (search queries), called MSQ-NER and ORCAS-NER. All the released sets contain text with its aligned entity annotations in CoNLL format. This released data also contains gazetteer data, which is composed of 1.67 million entities from the English Wikidata knowledge base.

License

CC BY 4.0

How to Download

The dataset is stored at a public Amazon S3 bucket: lowcontext-ner-gaz. See more in Open Data on AWS.

You will need to install AWS Command Line Interface to access the dataset, e.g. to download the dataset, you can use:

aws s3 cp s3://lowcontext-ner-gaz  ./ --recursive --no-sign-request

Reference

  1. GEMNET: Effective Gated Gazetteer Representations for Recognizing Complex Entities in Low-context Input. 2021. Tao Meng, Anjie Fang, Oleg Rokhlenko and Shervin Malmasi. In Proceedings of NAACL.