Getting started
In this document, we’ll work toward getting users started with Concrete. This document is not intended to be comprehensive tutorial, but rather a starting point for exploration.
What is Concrete?
Concrete is a data serialization format for NLP. It replaces ad-hoc XML, CSV, or programming language-specific serialization as a way of storing document- and sentence- level annotations. Concrete is based on Apache Thrift and thus works cross-platform and in almost all popular programming languages, including Javascript, C++, Java, and Python. To learn more about design considerations and motivation, have a look at the white paper, Ferraro et al. (2014). The details of the data format (schema) are described here.
In addition to data serialization, we provide Concretely annotated data! We’ve done the hard work of running a variety of tools, such as, Stanford’s NLP pipeline and HLTCOE/JHU NLP tools, on an abundance of data. For more information on data release, see the Concrete homepage.
Table of Contents
Data Format
Capturing Document Structure
A
Communication
is the primary document model. A Communication’s full document text is
stored in the Communication.text
string field; a Communication’s
id
may be a headline, URL, or some other identifying/characterizing
feature. Communications have
Sections,
which themselves have
Sentences. A
Sentence has a
Tokenization,
which is where
DependencyParses,
Constituent
Parses,
and other sentence-level (syntactic) structures are
stored. Token-level annotations, like part of speech and named entity
labels, are stored as
TokenTaggings
within a Tokenization. All of these structures and annotation objects
have a unique identifier
(UUID). UUIDs
act as pointers: they allow annotations to be cross-referenced with
others.
Global Annotations
Semantic, discourse and coreference annotations can cut across different sentences. Therefore, they are stored at the Communication level. Semantic and discourse annotations, like frame semantic parses, are stored as SituationMentions within SituationMentionSets, while individual mentions of entities are stored as EntityMentions within EntityMentionSets. While EntityMentions and SituationMentions both can ground out in specific tokens (using UUIDs to cross-reference), SituationMentions can ground out in EntityMentions or, recursively, other SituationMentions. If coreference decisions are made, then individual mentions can be clustered together into either SituationSets or EntitySets.
Quick Start
Now we’re going to step through how to look at some Concrete
data. This is meant to get our feet wet. It will not cover the all of
the annotation types listed above in Data Format. It
relies on the concrete-python
utility library, which also has a
number of useful utility (command line) scripts.
We’ve also provided a Docker image containing the latest concrete, and Java and Python libraries. This can be found on Dockerhub:
$ docker pull hltcoe/concrete
$ docker run -i -t hltcoe/concrete:latest /bin/bash
#
Step 0: Install concrete-python
First step, install the Python utility library, either directly to your machine
$ pip install concrete
or by running the Docker image, as above.
Step 1: Get some data.
wget 'https://github.com/hltcoe/quicklime/blob/master/agiga_dog-bites-man.concrete?raw=true' -O example.concrete
Step 2: What’s in this file?
2.1 Quicklime Communication viewer
Regardless of how you’re ingesting the data, it’s probably a good idea to view the contents of a .comm file using Quicklime. It stands up a mini-webserver that visualizes the data. Install Quicklime using:
pip install quicklime
To view a Concrete file:
$ qlook.py <path-to>/example.concrete
Listening on http://localhost:8080/
Hit Ctrl-C to quit.
Now, open your web browser and go to the link printed to the screen
http://localhost:8080/
. For more information about the Quicklime project,
check out the Quicklime GitHub repo.
2.2 Command-line tools
In addition to Quicklime, concrete-inspect.py
is another tool for viewing
the contents of Concrete files. This utility was made available when you
installed concrete-python
, so you can use it in any directory. Below is some
example usage. For further usage, use the script’s --help
option.
2.2.1 CoNLL-style output.
$ concrete-inspect.py example.concrete --pos --ner --lemmas --dependency
INDEX TOKEN LEMMA POS NER HEAD
----- ----- ----- --- --- ----
1 John John NNP PERSON 4
2 ’s ’s POS O 1
3 daughter daughter NN O 4
4 Mary Mary NNP PERSON 5
5 expressed express VBD O 0
6 sorrow sorrow NN O 5
7 . . . O
2.2.2 Parse tree
$ concrete-inspect.py example.concrete --treebank
(ROOT
(S (NP (NP (NNP John)
(POS ’s))
(NN daughter)
(NNP Mary))
(VP (VBD expressed)
(NP (NN sorrow)))
(. .)))
Programming with Concrete
Java
Installation
If you’re using Java, you would use the following dependencies from Maven Central, which correspond to the v.4.4.4 tag of the concrete-java GitHub repo: https://github.com/hltcoe/concrete-java
<dependency>
<groupId>edu.jhu.hlt</groupId>
<artifactId>concrete-core</artifactId>
<version>4.4</version>
</dependency>
<dependency>
<groupId>edu.jhu.hlt</groupId>
<artifactId>concrete-util</artifactId>
<version>4.4.4</version>
</dependency>
Read a Communication from a file
You can read in a Concrete file to a Communication object as follows.
CompactCommunicationSerializer ser = new CompactCommunicationSerializer();
Communication comm = ser.fromPathString(commFile.getAbsolutePath());
Iterate over sentences
From there, you can just walk the Communication object as you would any other in Java. The thrift spec defines the data structures. For example, you could iterate through the sentences as follows.
for (Section cSection : comm.getSectionList()) {
for (Sentence cSent : cSection.getSentenceList()) {
Tokenization cToks = cSent.getTokenization();
// ...do something with the sentence...
}
}
Get the entities and situations
To get the entities and situations you’d do something like this:
// Get the entities.
List<EntityMentionSet> cEmsList = comm.getEntityMentionSetList();
EntityMentionSet cEms = cEmsList.get(0); // Since there's only one.
// Get the relations.
List<SituationMentionSet> cSmsList = comm.getSituationMentionSetList();
SituationMentionSet cSms = cSmsList.get(0); // Since there's only one.