Skip to content

Instantly share code, notes, and snippets.

View Juan0001's full-sized avatar

Juan L. Kehoe Juan0001

View GitHub Profile
@Juan0001
Juan0001 / CoNLL-2003.md
Created April 24, 2018 12:40
CoNLL-2003

CoNLL-2003 Shared Task: Language-Independent Named Entity Recognition

The CoNLL-2003 (Sang et al. 2003) shared task deals with language-independent named entity recognition as well (English and German).

Dataset

The CoNLL-2003 shared task data files contain four columns separated by a single space. Each word has been put on a separate line and there is an empty line after each sentence. The first item on each line is a word, the second a part-of-speech (POS) tag, the third a syntactic chunk tag and the fourth the named entity tag. The chunk tags and the named entity tags have the format I-TYPE which means that the word is inside a phrase of type TYPE. Only if two phrases of the same type immediately follow each other, the first word of the second phrase will have tag B-TYPE to show that it starts a new phrase. A word with tag O is not part of a phrase.

The English data is a collection of news wire articles from the Reuters Corpus. The annotation has been done

This post examines the features of [R Markdown](http://www.rstudio.org/docs/authoring/using_markdown)
using [knitr](http://yihui.name/knitr/) in Rstudio 0.96.
This combination of tools provides an exciting improvement in usability for
[reproducible analysis](http://stats.stackexchange.com/a/15006/183).
Specifically, this post
(1) discusses getting started with R Markdown and `knitr` in Rstudio 0.96;
(2) provides a basic example of producing console output and plots using R Markdown;
(3) highlights several code chunk options such as caching and controlling how input and output is displayed;
(4) demonstrates use of standard Markdown notation as well as the extended features of formulas and tables; and
(5) discusses the implications of R Markdown.