This page provides links to a few of the projects started at the HLTCOE. Additional resources can be found on the HLTCOE GitHub page.


Concrete is a data serialization format for NLP. It replaces ad-hoc XML, CSV, or programming language-specific serialization as a way of storing document- and sentence- level annotations.

Concretely Annotated Corpora - Coming 2017

Under the heading Concretely Annotated, we are processing a variety of standard corpora with multiple popular NLP tool-chains, collected together under a single data schema we have created that we refer to as Concrete. We envision a multimodal workflow, where, e.g., knowledge can be extracted from both text and audio. We developed Concrete to record and share annotations on structured human language data — both text and speech.

For the full description, see Ferraro et al. (2014). Due to the size of the individual datasets, we plan to do incremental releases through the LDC of:

  • English Gigaword
  • Annotated NYT
  • TAC Cold Start
  • Wikipedia