This page provides links to a few of the projects started at the HLTCOE. Additional resources can be found on the HLTCOE GitHub page.


Concrete is a data serialization format for NLP. It replaces ad-hoc XML, CSV, or programming language-specific serialization as a way of storing document- and sentence- level annotations.

Concretely Annotated Corpora

Under the heading Concretely Annotated, we processed a variety of standard corpora with multiple popular NLP tool-chains, collected together under a single data schema we have created that we refer to as Concrete. We envision a multimodal workflow, where, e.g., knowledge can be extracted from both text and audio. We developed Concrete to record and share annotations on structured human language data — both text and speech.

The Wikipedia portion of Concretely Annotated Corpora is available from the JHU data archive . The Gigaword and Annotated NYT portions will be available through the LDC.

For the full description, see Ferraro et al. (2014).