View on GitHub


Thrift definitions, making HLT data specifications concrete

Getting started

In this document, we’ll work toward getting users started with Concrete. This document is not intended to be comprehensive tutorial, but rather a starting point for exploration.

What is Concrete?

Concrete is a data serialization format for NLP. It replaces ad-hoc XML, CSV, or programming language-specific serialization as a way of storing document- and sentence- level annotations. Concrete is based on Apache Thrift and thus works cross-platform and in almost all popular programming languages, including Javascript, C++, Java, and Python. To learn more about design considerations and motivation, have a look at the white paper, Ferraro et al. (2014). The details of the data format (schema) are described here.

In addition to data serialization, we provide Concretely annotated data! We’ve done the hard work of running a variety of tools, such as, Stanford’s NLP pipeline and HLTCOE/JHU NLP tools, on an abundance of data. For more information on data release, see the Concrete homepage.

Table of Contents

Data Format

Capturing Document Structure

A Communication is the primary document model. A Communication’s full document text is stored in the Communication.text string field; a Communication’s id may be a headline, URL, or some other identifying/characterizing feature. Communications have Sections, which themselves have Sentences. A Sentence has a Tokenization, which is where DependencyParses, Constituent Parses, and other sentence-level (syntactic) structures are stored. Token-level annotations, like part of speech and named entity labels, are stored as TokenTaggings within a Tokenization. All of these structures and annotation objects have a unique identifier (UUID). UUIDs act as pointers: they allow annotations to be cross-referenced with others.

Global Annotations

Semantic, discourse and coreference annotations can cut across different sentences. Therefore, they are stored at the Communication level. Semantic and discourse annotations, like frame semantic parses, are stored as SituationMentions within SituationMentionSets, while individual mentions of entities are stored as EntityMentions within EntityMentionSets. While EntityMentions and SituationMentions both can ground out in specific tokens (using UUIDs to cross-reference), SituationMentions can ground out in EntityMentions or, recursively, other SituationMentions. If coreference decisions are made, then individual mentions can be clustered together into either SituationSets or EntitySets.

Quick Start

Now we’re going to step through how to look at some Concrete data. This is meant to get our feet wet. It will not cover the all of the annotation types listed above in Data Format. It relies on the concrete-python utility library, which also has a number of useful utility (command line) scripts.

We’ve also provided a Docker image containing the latest concrete, and Java and Python libraries. This can be found on Dockerhub:

$ docker pull hltcoe/concrete
$ docker run -i -t hltcoe/concrete:latest /bin/bash

Step 0: Install concrete-python

First step, install the Python utility library, either directly to your machine

$ pip install concrete

or by running the Docker image, as above.

Step 1: Get some data.

wget '' -O example.concrete

Step 2: What’s in this file?

2.1 Quicklime Communication viewer

Regardless of how you’re ingesting the data, it’s probably a good idea to view the contents of a .comm file using Quicklime. It stands up a mini-webserver that visualizes the data. Install Quicklime using:

pip install quicklime

To view a Concrete file:

$ <path-to>/example.concrete
Listening on http://localhost:8080/
Hit Ctrl-C to quit.

Now, open your web browser and go to the link printed to the screen http://localhost:8080/. For more information about the Quicklime project, check out the Quicklime GitHub repo.

2.2 Command-line tools

In addition to Quicklime, is another tool for viewing the contents of Concrete files. This utility was made available when you installed concrete-python, so you can use it in any directory. Below is some example usage. For further usage, use the script’s --help option.

2.2.1 CoNLL-style output.
$ example.concrete --pos --ner --lemmas --dependency
----- -----     -----    --- ---    ----
1     John      John     NNP PERSON 4
2     ’s        ’s       POS O      1
3     daughter  daughter NN  O      4
4     Mary      Mary     NNP PERSON 5
5     expressed express  VBD O      0
6     sorrow    sorrow   NN  O      5
7     .         .        .   O
2.2.2 Parse tree
$ example.concrete --treebank
  (S (NP (NP (NNP John)
             (POS ’s))
         (NN daughter)
         (NNP Mary))
     (VP (VBD expressed)
         (NP (NN sorrow)))
     (. .)))

Programming with Concrete



If you’re using Java, you would use the following dependencies from Maven Central, which correspond to the v.4.4.4 tag of the concrete-java GitHub repo:


Read a Communication from a file

You can read in a Concrete file to a Communication object as follows.

CompactCommunicationSerializer ser = new CompactCommunicationSerializer();
Communication comm = ser.fromPathString(commFile.getAbsolutePath());

Iterate over sentences

From there, you can just walk the Communication object as you would any other in Java. The thrift spec defines the data structures. For example, you could iterate through the sentences as follows.

for (Section cSection : comm.getSectionList()) {
    for (Sentence cSent : cSection.getSentenceList()) {
        Tokenization cToks = cSent.getTokenization();
        // something with the sentence...

Get the entities and situations

To get the entities and situations you’d do something like this:

// Get the entities.
List<EntityMentionSet> cEmsList = comm.getEntityMentionSetList();
EntityMentionSet cEms = cEmsList.get(0); // Since there's only one.
// Get the relations.
List<SituationMentionSet> cSmsList = comm.getSituationMentionSetList();
SituationMentionSet cSms = cSmsList.get(0); // Since there's only one.