Skip to content

Instantly share code, notes, and snippets.

@Juan0001
Created April 24, 2018 12:40
Show Gist options
  • Select an option

  • Save Juan0001/fca284aa2700ea47ec01869546770887 to your computer and use it in GitHub Desktop.

Select an option

Save Juan0001/fca284aa2700ea47ec01869546770887 to your computer and use it in GitHub Desktop.
CoNLL-2003

CoNLL-2003 Shared Task: Language-Independent Named Entity Recognition

The CoNLL-2003 (Sang et al. 2003) shared task deals with language-independent named entity recognition as well (English and German).

Dataset

The CoNLL-2003 shared task data files contain four columns separated by a single space. Each word has been put on a separate line and there is an empty line after each sentence. The first item on each line is a word, the second a part-of-speech (POS) tag, the third a syntactic chunk tag and the fourth the named entity tag. The chunk tags and the named entity tags have the format I-TYPE which means that the word is inside a phrase of type TYPE. Only if two phrases of the same type immediately follow each other, the first word of the second phrase will have tag B-TYPE to show that it starts a new phrase. A word with tag O is not part of a phrase.

The English data is a collection of news wire articles from the Reuters Corpus. The annotation has been done by people of the University of Antwerp. Because of copyright reasons we only make available the annotations. In order to build the complete data sets you will need access to the Reuters Corpus. It can be obtained for research purposes without any charge from NIST.

The German data is a collection of articles from the Frankfurter Rundschau. The named entities have been annotated by people of the University of Antwerp. Only the annotations are available here. In order to build these data sets you need access to the ECI Multilingual Text Corpus. It can be ordered from the Linguistic Data Consortium.

Results

References Method F1
Florian et al. (2003) Combination of various machine-learning classifiers 88.76
Ando et al. (2005) Semi-supervised approach 89.31
L Ratinov et al (2009) Word-class Model 90.80
D Lin et al. (2009) W500 + P125 + P64 90.90
Collobert et al. (2011) NN+SLL+LM2 88.67
Collobert et al. (2011) NN+SLL+LM2+Gazetteer 89.59
Suzuki et al. (2011) L1CRF 91.02
Passos et al. (2014) Baseline + Gaz + LexEmb 90.90
Huang et al. (2015) BI-LSTM-CRF 90.10
JPC Chiu et al. (2015) BLSTM-CNN + emb + lex 91.62
Luo et al. (2015) JERL 91.20

References

  • Named Entity Recognition with Bidirectional LSTM-CNNs (CL'15), JPC Chiu et al. [pdf]
  • Bidirectional LSTM-CRF Models for Sequence Tagging (EMNLP'15), Z Huang et al. [pdf]
  • Joint entity recognition and disambiguation (EMNLP '15), G Luo et al. [pdf]
  • Lexicon infused phrase embeddings for named entity resolution (ACL'14), A Passos et al. [pdf]
  • Learning condensed feature representations from large unsupervised data sets for supervised learning (ACL'11), J Suzuki et al. [pdf]
  • Natural Language Processing (Almost) from Scratch (CL'11), R Collobert et al. [pdf]
  • Design Challenges and Misconceptions in Named Entity Recognition (CoNLL'09), L Ratinov et al. [pdf]
  • Phrase Clustering for Discriminative Learning (ACL '09), D Lin et al. [pdf]
  • A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data (JMLR'05), RK Ando et al. [pdf]
  • Named Entity Recognition through Classifier Combination (HLT-NAACL'03), R Florian et al. [pdf]
  • Introduction to the CoNLL-2003 Shared Task: Language-Independent Named Entity Recognition (CoNLL'03), EFTK Sang et al. [pdf]

See Also

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment