Excellent Human Language Technology

This page links to a few of the software projects started at the Human Language Technology Center of Excellence. Additional resources can be found on our GitHub page.

Turkle

Turkle is a Django web application that provides a clone of Amazon's Mechanical Turk service in your local environment, allowing you to collect local expert annotations with the same templates and data files you use for crowd annotation. Meanwhile, our pip-installable ProtoTurk server can be used to rapidly prototype new templates and data files.

Getting Started ProtoTurk

Patapsco

Patapsco is a scalable Python framework for reproducible cross-language information retrieval (CLIR) experiments.

Repository Colab Demo

Concrete

Concrete is a cross-platform data serialization format and communication protocol for language annotations. It replaces ad-hoc TSV, XML, JSON, and other formats for storing document- and sentence-level language annotations. We developed Concrete to record and share annotations on structured human language data, including both text and speech.

Getting Started Python JavaScript Java

Concretely Annotated Corpora

Under the heading Concretely Annotated, we processed a variety of standard corpora with multiple popular NLP tool-chains using the Concrete data schema.

Wikipedia English Gigaword The New York Times