Thrift module: structure

Module Services Data types Constants
structure Arc
Constituent
ConstituentRef
Dependency
DependencyParse
DependencyParseStructure
LatticePath
Parse
Section
Sentence
SpanLink
TaggedToken
Token
TokenLattice
TokenList
TokenRefSequence
TokenTagging
Tokenization
TokenizationKind

Enumerations

Enumeration: TokenizationKind

Enumerated types of Tokenizations


TOKEN_LIST 1
TOKEN_LATTICE 2

Data structures

Struct: Constituent

Key Field Type Description Requiredness Default value
1 id i32 A parse-relative identifier for this consistuent. Together with the UUID for a Parse, this can be used to define pointers to specific constituents. required
2 tag string A description of this constituency node, e.g. the category "NP". For leaf nodes, this should be a word and for pre-terminal nodes this should be a POS tag. optional
3 childList list< i32 > required
4 headChildIndex i32 The index of the head child of this constituent. I.e., the head child of constituent c is c.children[c.head_child_index] . A value of -1 indicates that no child head was identified. optional -1
5 start i32 The first token (inclusive) of this constituent in the parent Tokenization. Almost certainly should be populated. optional
6 ending i32 The last token (exclusive) of this constituent in the parent Tokenization. Almost certainly should be populated. optional

A single parse constituent (or "phrase").

Struct: Parse

Key Field Type Description Requiredness Default value
1 uuid uuid.UUID required
2 metadata metadata.AnnotationMetadata required
3 constituentList list< Constituent > required

A theory about the syntactic parse of a sentence.

\note If we add support for parse forests in the future, then it
will most likely be done by adding a new field (e.g.
"forest_root") that uses a new struct type to encode the
forest. A "kind" field might also be added (analogous to
Tokenization.kind) to indicate whether a parse is encoded
using a simple tree or a parse forest.

Struct: ConstituentRef

Key Field Type Description Requiredness Default value
1 parseId uuid.UUID The UUID of the Parse that this Constituent belongs to. required
2 constituentIndex i32 The index in the constituent list of this Constituent. required

A reference to a Constituent within a Parse.

Struct: Dependency

Key Field Type Description Requiredness Default value
1 gov i32 The governor or the head token. 0 indexed. optional -1
2 dep i32 The dependent token. 0 indexed. required
3 edgeType string The relation that holds between gov and dep. optional

A syntactic edge between two tokens in a tokenized sentence.

Struct: DependencyParseStructure

Key Field Type Description Requiredness Default value
1 isAcyclic bool True iff there are no cycles in the dependency graph. required
2 isConnected bool True iff the dependency graph forms a single connected component. required
3 isSingleHeaded bool True iff every node in the dependency parse has at most one head/parent/governor. required
4 isProjective bool True iff there are no crossing edges in the dependency parse. required

Information about the structure of a dependency parse.
This information is computable from the list of dependencies,
but this allows the consumer to make (verified) assumptions
about the dependencies being processed.

Struct: DependencyParse

Key Field Type Description Requiredness Default value
1 uuid uuid.UUID required
2 metadata metadata.AnnotationMetadata required
3 dependencyList list< Dependency > required
4 structureInformation DependencyParseStructure optional

Represents a dependency parse with typed edges.

Struct: Token

Key Field Type Description Requiredness Default value
1 tokenIndex i32 A 0-based tokenization-relative identifier for this token that represents the order that this token appears in the sentence. Together with the UUID for a Tokenization, this can be used to define pointers to specific tokens. If a Tokenization object contains multiple Token objects with the same id (e.g., in different n-best lists), then all of their other fields *must* be identical as well. required
2 text string The text associated with this token. Note - we may have a destructive tokenizer (e.g., Stanford rewriting) and as a result, we want to maintain this field. optional
3 textSpan spans.TextSpan Location of this token in this perspective's text (.text field). In cases where this token does not correspond directly with any text span in the text (such as word insertion during MT), this field may be given a value indicating "approximately" where the token comes from. A span covering the entire sentence may be used if no more precise value seems appropriate. NOTE: This span represents a best guess, or 'provenance': it cannot be guaranteed that this text span matches the _exact_ text of the document, but is the annotation's best effort at such a representation. optional
4 rawTextSpan spans.TextSpan Location of this token in the original, raw text (.originalText field). In cases where this token does not correspond directly with any text span in the original text (such as word insertion during MT), this field may be given a value indicating "approximately" where the token comes from. A span covering the entire sentence may be used if no more precise value seems appropriate. NOTE: This span represents a best guess, or 'provenance': it cannot be guaranteed that this text span matches the _exact_ text of the original raw document, but is the annotation's best effort at such a representation. optional
5 audioSpan spans.AudioSpan Location of this token in the original audio. NOTE: This span represents a best guess, or 'provenance': it cannot be guaranteed that this text span matches the _exact_ text of the original document, but is the annotation's best effort at such a representation. optional

A single token (typically a word) in a communication. The exact
definition of what counts as a token is left up to the tools that
generate token sequences.

Usually, each token will include at least a text string.

Struct: TokenRefSequence

Key Field Type Description Requiredness Default value
1 tokenIndexList list< i32 > The tokenization-relative identifiers for each token that is included in this sequence. required
2 anchorTokenIndex i32 An optional field that can be used to describe the root of a sentence (if this sequence is a full sentence), the head of a constituent (if this sequence is a constituent), or some other form of "canonical" token in this sequence if, for instance, it is not easy to map this sequence to a another annotation that has a head. This field is defined with respect to the Tokenization given by tokenizationId, and not to this object's tokenIndexList. optional -1
3 tokenizationId uuid.UUID The UUID of the tokenization that contains the tokens. required
4 textSpan spans.TextSpan The text span in the main text (.text field) associated with this TokenRefSequence. NOTE: This span represents a best guess, or 'provenance': it cannot be guaranteed that this text span matches the _exact_ text of the original document, but is the annotation's best effort at such a representation. optional
5 rawTextSpan spans.TextSpan The text span in the original text (.originalText field) associated with this TokenRefSequence. NOTE: This span represents a best guess, or 'provenance': it cannot be guaranteed that this text span matches the _exact_ text of the original raw document, but is the annotation's best effort at such a representation. optional
6 audioSpan spans.AudioSpan The audio span associated with this TokenRefSequence. NOTE: This span represents a best guess, or 'provenance': it cannot be guaranteed that this text span matches the _exact_ text of the original document, but is the annotation's best effort at such a representation. optional
7 dependencies list< Dependency > Use this field to reference a dependency tree fragment such as a shortest path or all the dependents in a constituent. optional
8 constituent ConstituentRef Use this field to specify an entire constituent in a parse tree. Prefer textSpan over this field unless a node in a tree is needed. optional

A list of pointers to tokens that all belong to the same
tokenization.

Struct: TaggedToken

Key Field Type Description Requiredness Default value
1 tokenIndex i32 A pointer to the token being tagged. Token indices are 0-based. These indices are also 0-based. optional
2 tag string A string containing the annotation. If the tag set you are using is not case sensitive, then all part of speech tags should be normalized to upper case. optional
3 confidence double Confidence of the annotation. optional
4 tagList list< string > A list of strings that represent a distribution of possible tags for this token. If populated, the 'tag' field should also be populated with the "best" value from this list. optional
5 confidenceList list< double > A list of doubles that represent confidences associated with the tags in the 'tagList' field. If populated, the 'confidence' field should also be populated with the confidence associated with the "best" tag in 'tagList'. optional

Struct: TokenTagging

Key Field Type Description Requiredness Default value
1 uuid uuid.UUID The UUID of this TokenTagging object. required
2 metadata metadata.AnnotationMetadata Information about where the annotation came from. This should be used to tell between gold-standard annotations and automatically generated theories about the data required
3 taggedTokenList list< TaggedToken > The mapping from tokens to annotations. This may be a partial mapping. required
4 taggingType string An ontology-backed string that represents the type of token taggings this TokenTagging object produces. optional

A theory about some token-level annotation.
The TokenTagging consists of a mapping from tokens
(using token ids) to string tags (e.g. part-of-speech tags or lemmas).

The mapping defined by a TokenTagging may be partial --
i.e., some tokens may not be assigned any part of speech tags.

For lattice tokenizations, you may need to create multiple
part-of-speech taggings (for different paths through the lattice),
since the appropriate tag for a given token may depend on the path
taken. For example, you might define a separate
TokenTagging for each of the top K paths, which leaves all
tokens that are not part of the path unlabeled.

Currently, we use strings to encode annotations. In
the future, we may add fields for encoding specific tag sets
(eg treebank tags), or for adding compound tags.

Struct: LatticePath

Key Field Type Description Requiredness Default value
1 weight double optional
2 tokenList list< Token > required

Struct: Arc

Key Field Type Description Requiredness Default value
1 src i32 optional
2 dst i32 optional
3 token Token optional
4 weight double optional

Type for arcs. For epsilon edges, leave 'token' blank.

Struct: TokenLattice

Key Field Type Description Requiredness Default value
1 startState i32 optional 0
2 endState i32 optional 0
3 arcList list< Arc > required
4 cachedBestPath LatticePath optional

A lattice structure that assigns scores to a set of token
sequences.  The lattice is encoded as an FSA, where states are
identified by integers, and each arc is annotated with an
optional tokens and a weight.  (Arcs with no tokens are
"epsilon" arcs.)  The lattice has a single start state and a
single end state.  (You can use epsilon edges to simulate
multiple start states or multiple end states, if desired.)

The score of a path through the lattice is the sum of the weights
of the arcs that make up that path.  A path with a lower score
is considered "better" than a path with a higher score.

If possible, path scores should be negative log likelihoods
(with base e -- e.g. if P=1, then weight=0; and if P=0.5, then
weight=0.693).  Furthermore, if possible, the path scores should
be globally normalized (i.e., they should encode probabilities).
This will allow for them to be combined with other information
in a reasonable way when determining confidences for system
outputs.

TokenLattices should never contain any paths with cycles.  Every
arc in the lattice should be included in some path from the start
state to the end state.

Struct: TokenList

Key Field Type Description Requiredness Default value
1 tokenList list< Token > required

A wrapper around a list of tokens.

Key Field Type Description Requiredness Default value
1 tokens TokenRefSequence The tokens that make up this SpanLink object. required
2 concreteTarget uuid.UUID optional
3 externalTarget string optional
4 linkType string required

A collection of tokens that represent a link to another resource.
This resource might be another Concrete object (e.g., another
Concrete Communication), represented with the 'concreteTarget'
field, or it could link to a resource outside of Concrete via the
'externalTarget' field.

Struct: Tokenization

Key Field Type Description Requiredness Default value
1 uuid uuid.UUID required
2 metadata metadata.AnnotationMetadata Information about where this tokenization came from. required
3 tokenList TokenList A wrapper around an ordered list of the tokens in this tokenization. This may also give easy access to the "reconstructed text" associated with this tokenization. This field should only have a value if kind==TOKEN_LIST. optional
4 lattice TokenLattice A lattice that compactly describes a set of token sequences that might make up this tokenization. This field should only have a value if kind==LATTICE. optional
5 kind TokenizationKind Enumerated value indicating whether this tokenization is implemented using an n-best list or a lattice. required
6 tokenTaggingList list< TokenTagging > optional
7 parseList list< Parse > optional
8 dependencyParseList list< DependencyParse > optional
9 spanLinkList list< SpanLink > optional

A theory (or set of alternative theories) about the sequence of
tokens that make up a sentence.

This message type is used to record the output of not just for
tokenizers, but also for a wide variety of other tools, including
machine translation systems, text normalizers, part-of-speech
taggers, and stemmers.

Each Tokenization is encoded using either a TokenList
or a TokenLattice. (If you want to encode an n-best list, then
you should store it as n separate Tokenization objects.) The
"kind" field is used to indicate whether this Tokenization contains
a list of tokens or a TokenLattice.

The confidence value for each sequence is determined by combining
the confidence from the "metadata" field with confidence
information from individual token sequences as follows:

  • For n-best lists: metadata.confidence
  • For lattices: metadata.confidence * exp(-sum(arc.weight))
Note: in some cases (such as the output of a machine translation tool), the order of the tokens in a token sequence may not correspond with the order of their original text span offsets.

Struct: Sentence

Key Field Type Description Requiredness Default value
1 uuid uuid.UUID required
2 tokenization Tokenization Theory about the tokens that make up this sentence. For text communications, these tokenizations will typically be generated by a tokenizer. For audio communications, these tokenizations will typically be generated by an automatic speech recognizer. The "Tokenization" message type is also used to store the output of machine translation systems and text normalization systems. optional
3 textSpan spans.TextSpan Location of this sentence in the communication text. NOTE: This span represents a best guess, or 'provenance': it cannot be guaranteed that this text span matches the _exact_ text of the original document, but is the annotation's best effort at such a representation. optional
4 rawTextSpan spans.TextSpan Location of this sentence in the raw text. NOTE: This span represents a best guess, or 'provenance': it cannot be guaranteed that this text span matches the _exact_ text of the original document, but is the annotation's best effort at such a representation. optional
5 audioSpan spans.AudioSpan Location of this sentence in the original audio. NOTE: This span represents a best guess, or 'provenance': it cannot be guaranteed that this text span matches the _exact_ text of the original document, but is the annotation's best effort at such a representation. optional

A single sentence or utterance in a communication.

Struct: Section

Key Field Type Description Requiredness Default value
1 uuid uuid.UUID The unique identifier for this section. required
2 sentenceList list< Sentence > The sentences of this "section." optional
3 textSpan spans.TextSpan Location of this section in the communication text. NOTE: This text span represents a best guess, or 'provenance': it cannot be guaranteed that this text span matches the _exact_ text of the original document, but is the annotation's best effort at such a representation. optional
4 rawTextSpan spans.TextSpan Location of this section in the raw text. NOTE: This text span represents a best guess, or 'provenance': it cannot be guaranteed that this text span matches the _exact_ text of the original document, but is the annotation's best effort at such a representation. optional
9 audioSpan spans.AudioSpan Location of this section in the original audio. NOTE: This span represents a best guess, or 'provenance': it cannot be guaranteed that this text span matches the _exact_ text of the original document, but is the annotation's best effort at such a representation. optional
5 kind string A short, sometimes corpus-specific term characterizing the nature of the section; may change in a future version of concrete. This often acts as a coarse-grained descriptor that is used for filtering. For example, Gigaword uses the section kind "passage" to distinguish content-bearing paragraphs in the body of an article from other paragraphs, such as the headline and dateline. required
6 label string The name of the section. For example, a title of a section on Wikipedia. optional
7 numberList list< i32 > Position within the communication with respect to other Sections: The section number, E.g., 3, or 3.1, or 3.1.2, etc. Aimed at Communications with content organized in a hierarchy, such as a Book with multiple chapters, then sections, then paragraphs. Or even a dense Wikipedia page with subsections. Sections should still be arranged linearly, where reading these numbers should not be required to get a start-to-finish enumeration of the Communication's content. optional
8 lidList list< language.LanguageIdentification > An optional field to be used for multi-language documents. This field should be populated when a section is inside of a document that contains multiple languages. Minimally, each block of text in one language should be it's own section. For example, if a paragraph is in English and the paragraph afterwards is in French, these should be separated into two different sections, allowing language-specific analytics to run on appropriate sections. optional

A single "section" of a communication, such as a paragraph. Each
section is defined using a text or audio span, and can optionally
contain a list of sentences.