Thrift module: structure

Module	Services	Data types	Constants
structure		Arc Constituent ConstituentRef Dependency DependencyParse DependencyParseStructure LatticePath Parse Section Sentence SpanLink TaggedToken Token TokenLattice TokenList TokenRefSequence TokenTagging Tokenization TokenizationKind

Struct: Constituent

Key	Field	Type	Description	Requiredness	Default value
1	id	`i32`	A parse-relative identifier for this consistuent. Together with the UUID for a Parse, this can be used to define pointers to specific constituents.	required
2	tag	`string`	A description of this constituency node, e.g. the category "NP". For leaf nodes, this should be a word and for pre-terminal nodes this should be a POS tag.	optional
3	childList	`list< i32 >`		required
4	headChildIndex	`i32`	The index of the head child of this constituent. I.e., the head child of constituent `c` is `c.children[c.head_child_index]` . A value of -1 indicates that no child head was identified.	optional	`-1`
5	start	`i32`	The first token (inclusive) of this constituent in the parent Tokenization. Almost certainly should be populated.	optional
6	ending	`i32`	The last token (exclusive) of this constituent in the parent Tokenization. Almost certainly should be populated.	optional

A single parse constituent (or "phrase").

Struct: Parse

Key	Field	Type	Requiredness
1	uuid	`uuid.UUID`	required
2	metadata	`metadata.AnnotationMetadata`	required
3	constituentList	`list< Constituent >`	required

A theory about the syntactic parse of a sentence.

\note If we add support for parse forests in the future, then it
will most likely be done by adding a new field (e.g.
"forest_root") that uses a new struct type to encode the
forest. A "kind" field might also be added (analogous to
Tokenization.kind) to indicate whether a parse is encoded
using a simple tree or a parse forest.

Struct: ConstituentRef

Key	Field	Type	Description	Requiredness	Default value
1	parseId	`uuid.UUID`	The UUID of the Parse that this Constituent belongs to.	required
2	constituentIndex	`i32`	The index in the constituent list of this Constituent.	required

A reference to a Constituent within a Parse.

Struct: Dependency

Key	Field	Type	Description	Requiredness	Default value
1	gov	`i32`	The governor or the head token. 0 indexed.	optional	`-1`
2	dep	`i32`	The dependent token. 0 indexed.	required
3	edgeType	`string`	The relation that holds between gov and dep.	optional

A syntactic edge between two tokens in a tokenized sentence.

Struct: DependencyParseStructure

Key	Field	Type	Description	Requiredness
1	isAcyclic	`bool`	True iff there are no cycles in the dependency graph.	required
2	isConnected	`bool`	True iff the dependency graph forms a single connected component.	required
3	isSingleHeaded	`bool`	True iff every node in the dependency parse has at most one head/parent/governor.	required
4	isProjective	`bool`	True iff there are no crossing edges in the dependency parse.	required

Information about the structure of a dependency parse.
This information is computable from the list of dependencies,
but this allows the consumer to make (verified) assumptions
about the dependencies being processed.

Struct: DependencyParse

Key	Field	Type	Requiredness
1	uuid	`uuid.UUID`	required
2	metadata	`metadata.AnnotationMetadata`	required
3	dependencyList	`list< Dependency >`	required
4	structureInformation	`DependencyParseStructure`	optional

Represents a dependency parse with typed edges.

Struct: Token

Key	Field	Type	Description	Requiredness
1	tokenIndex	`i32`	A 0-based tokenization-relative identifier for this token that represents the order that this token appears in the sentence. Together with the UUID for a Tokenization, this can be used to define pointers to specific tokens. If a Tokenization object contains multiple Token objects with the same id (e.g., in different n-best lists), then all of their other fields must be identical as well.	required
2	text	`string`	The text associated with this token. Note - we may have a destructive tokenizer (e.g., Stanford rewriting) and as a result, we want to maintain this field.	optional
3	textSpan	`spans.TextSpan`	Location of this token in this perspective's text (.text field). In cases where this token does not correspond directly with any text span in the text (such as word insertion during MT), this field may be given a value indicating "approximately" where the token comes from. A span covering the entire sentence may be used if no more precise value seems appropriate. NOTE: This span represents a best guess, or 'provenance': it cannot be guaranteed that this text span matches the _exact_ text of the document, but is the annotation's best effort at such a representation.	optional
4	rawTextSpan	`spans.TextSpan`	Location of this token in the original, raw text (.originalText field). In cases where this token does not correspond directly with any text span in the original text (such as word insertion during MT), this field may be given a value indicating "approximately" where the token comes from. A span covering the entire sentence may be used if no more precise value seems appropriate. NOTE: This span represents a best guess, or 'provenance': it cannot be guaranteed that this text span matches the _exact_ text of the original raw document, but is the annotation's best effort at such a representation.	optional
5	audioSpan	`spans.AudioSpan`	Location of this token in the original audio. NOTE: This span represents a best guess, or 'provenance': it cannot be guaranteed that this text span matches the _exact_ text of the original document, but is the annotation's best effort at such a representation.	optional

A single token (typically a word) in a communication. The exact
definition of what counts as a token is left up to the tools that
generate token sequences.

Usually, each token will include at least a text string.

Struct: TokenRefSequence

Key	Field	Type	Description	Requiredness	Default value
1	tokenIndexList	`list< i32 >`	The tokenization-relative identifiers for each token that is included in this sequence.	required
2	anchorTokenIndex	`i32`	An optional field that can be used to describe the root of a sentence (if this sequence is a full sentence), the head of a constituent (if this sequence is a constituent), or some other form of "canonical" token in this sequence if, for instance, it is not easy to map this sequence to a another annotation that has a head. This field is defined with respect to the Tokenization given by tokenizationId, and not to this object's tokenIndexList.	optional	`-1`
3	tokenizationId	`uuid.UUID`	The UUID of the tokenization that contains the tokens.	required
4	textSpan	`spans.TextSpan`	The text span in the main text (.text field) associated with this TokenRefSequence. NOTE: This span represents a best guess, or 'provenance': it cannot be guaranteed that this text span matches the _exact_ text of the original document, but is the annotation's best effort at such a representation.	optional
5	rawTextSpan	`spans.TextSpan`	The text span in the original text (.originalText field) associated with this TokenRefSequence. NOTE: This span represents a best guess, or 'provenance': it cannot be guaranteed that this text span matches the _exact_ text of the original raw document, but is the annotation's best effort at such a representation.	optional
6	audioSpan	`spans.AudioSpan`	The audio span associated with this TokenRefSequence. NOTE: This span represents a best guess, or 'provenance': it cannot be guaranteed that this text span matches the _exact_ text of the original document, but is the annotation's best effort at such a representation.	optional
7	dependencies	`list< Dependency >`	Use this field to reference a dependency tree fragment such as a shortest path or all the dependents in a constituent.	optional
8	constituent	`ConstituentRef`	Use this field to specify an entire constituent in a parse tree. Prefer textSpan over this field unless a node in a tree is needed.	optional

A list of pointers to tokens that all belong to the same
tokenization.

Struct: TaggedToken

Key	Field	Type	Description	Requiredness
1	tokenIndex	`i32`	A pointer to the token being tagged. Token indices are 0-based. These indices are also 0-based.	optional
2	tag	`string`	A string containing the annotation. If the tag set you are using is not case sensitive, then all part of speech tags should be normalized to upper case.	optional
3	confidence	`double`	Confidence of the annotation.	optional
4	tagList	`list< string >`	A list of strings that represent a distribution of possible tags for this token. If populated, the 'tag' field should also be populated with the "best" value from this list.	optional
5	confidenceList	`list< double >`	A list of doubles that represent confidences associated with the tags in the 'tagList' field. If populated, the 'confidence' field should also be populated with the confidence associated with the "best" tag in 'tagList'.	optional

Struct: TokenTagging

Key	Field	Type	Description	Requiredness
1	uuid	`uuid.UUID`	The UUID of this TokenTagging object.	required
2	metadata	`metadata.AnnotationMetadata`	Information about where the annotation came from. This should be used to tell between gold-standard annotations and automatically generated theories about the data	required
3	taggedTokenList	`list< TaggedToken >`	The mapping from tokens to annotations. This may be a partial mapping.	required
4	taggingType	`string`	An ontology-backed string that represents the type of token taggings this TokenTagging object produces.	optional

A theory about some token-level annotation.
The TokenTagging consists of a mapping from tokens
(using token ids) to string tags (e.g. part-of-speech tags or lemmas).

The mapping defined by a TokenTagging may be partial --
i.e., some tokens may not be assigned any part of speech tags.

For lattice tokenizations, you may need to create multiple
part-of-speech taggings (for different paths through the lattice),
since the appropriate tag for a given token may depend on the path
taken. For example, you might define a separate
TokenTagging for each of the top K paths, which leaves all
tokens that are not part of the path unlabeled.

Currently, we use strings to encode annotations. In
the future, we may add fields for encoding specific tag sets
(eg treebank tags), or for adding compound tags.

Struct: LatticePath

Key	Field	Type	Description	Requiredness	Default value
1	weight	`double`		optional
2	tokenList	`list< Token >`		required

Struct: Arc

Key	Field	Type	Requiredness
1	src	`i32`	optional
2	dst	`i32`	optional
3	token	`Token`	optional
4	weight	`double`	optional

Type for arcs. For epsilon edges, leave 'token' blank.

Struct: TokenLattice

Key	Field	Type	Requiredness	Default value
1	startState	`i32`	optional	`0`
2	endState	`i32`	optional	`0`
3	arcList	`list< Arc >`	required
4	cachedBestPath	`LatticePath`	optional

A lattice structure that assigns scores to a set of token
sequences.  The lattice is encoded as an FSA, where states are
identified by integers, and each arc is annotated with an
optional tokens and a weight.  (Arcs with no tokens are
"epsilon" arcs.)  The lattice has a single start state and a
single end state.  (You can use epsilon edges to simulate
multiple start states or multiple end states, if desired.)

The score of a path through the lattice is the sum of the weights
of the arcs that make up that path.  A path with a lower score
is considered "better" than a path with a higher score.

If possible, path scores should be negative log likelihoods
(with base e -- e.g. if P=1, then weight=0; and if P=0.5, then
weight=0.693).  Furthermore, if possible, the path scores should
be globally normalized (i.e., they should encode probabilities).
This will allow for them to be combined with other information
in a reasonable way when determining confidences for system
outputs.

TokenLattices should never contain any paths with cycles.  Every
arc in the lattice should be included in some path from the start
state to the end state.

Struct: TokenList

Key	Field	Type	Description	Requiredness	Default value
1	tokenList	`list< Token >`		required

A wrapper around a list of tokens.

Struct: SpanLink

Key	Field	Type	Description	Requiredness
1	tokens	`TokenRefSequence`	The tokens that make up this SpanLink object.	required
2	concreteTarget	`uuid.UUID`		optional
3	externalTarget	`string`		optional
4	linkType	`string`		required

A collection of tokens that represent a link to another resource.
This resource might be another Concrete object (e.g., another
Concrete Communication), represented with the 'concreteTarget'
field, or it could link to a resource outside of Concrete via the
'externalTarget' field.

Struct: Tokenization

Key	Field	Type	Description	Requiredness
1	uuid	`uuid.UUID`		required
2	metadata	`metadata.AnnotationMetadata`	Information about where this tokenization came from.	required
3	tokenList	`TokenList`	A wrapper around an ordered list of the tokens in this tokenization. This may also give easy access to the "reconstructed text" associated with this tokenization. This field should only have a value if kind==TOKEN_LIST.	optional
4	lattice	`TokenLattice`	A lattice that compactly describes a set of token sequences that might make up this tokenization. This field should only have a value if kind==LATTICE.	optional
5	kind	`TokenizationKind`	Enumerated value indicating whether this tokenization is implemented using an n-best list or a lattice.	required
6	tokenTaggingList	`list< TokenTagging >`		optional
7	parseList	`list< Parse >`		optional
8	dependencyParseList	`list< DependencyParse >`		optional
9	spanLinkList	`list< SpanLink >`		optional

A theory (or set of alternative theories) about the sequence of
tokens that make up a sentence.

This message type is used to record the output of not just for
tokenizers, but also for a wide variety of other tools, including
machine translation systems, text normalizers, part-of-speech
taggers, and stemmers.

Each Tokenization is encoded using either a TokenList
or a TokenLattice. (If you want to encode an n-best list, then
you should store it as n separate Tokenization objects.) The
"kind" field is used to indicate whether this Tokenization contains
a list of tokens or a TokenLattice.

The confidence value for each sequence is determined by combining
the confidence from the "metadata" field with confidence
information from individual token sequences as follows:


 For n-best lists:
metadata.confidence 
 For lattices:
metadata.confidence * exp(-sum(arc.weight)) 


Note: in some cases (such as the output of a machine translation
tool), the order of the tokens in a token sequence may not
correspond with the order of their original text span offsets.

Struct: Sentence

Key	Field	Type	Description	Requiredness
1	uuid	`uuid.UUID`		required
2	tokenization	`Tokenization`	Theory about the tokens that make up this sentence. For text communications, these tokenizations will typically be generated by a tokenizer. For audio communications, these tokenizations will typically be generated by an automatic speech recognizer. The "Tokenization" message type is also used to store the output of machine translation systems and text normalization systems.	optional
3	textSpan	`spans.TextSpan`	Location of this sentence in the communication text. NOTE: This span represents a best guess, or 'provenance': it cannot be guaranteed that this text span matches the _exact_ text of the original document, but is the annotation's best effort at such a representation.	optional
4	rawTextSpan	`spans.TextSpan`	Location of this sentence in the raw text. NOTE: This span represents a best guess, or 'provenance': it cannot be guaranteed that this text span matches the _exact_ text of the original document, but is the annotation's best effort at such a representation.	optional
5	audioSpan	`spans.AudioSpan`	Location of this sentence in the original audio. NOTE: This span represents a best guess, or 'provenance': it cannot be guaranteed that this text span matches the _exact_ text of the original document, but is the annotation's best effort at such a representation.	optional

A single sentence or utterance in a communication.

Struct: Section

Key	Field	Type	Description	Requiredness
1	uuid	`uuid.UUID`	The unique identifier for this section.	required
2	sentenceList	`list< Sentence >`	The sentences of this "section."	optional
3	textSpan	`spans.TextSpan`	Location of this section in the communication text. NOTE: This text span represents a best guess, or 'provenance': it cannot be guaranteed that this text span matches the _exact_ text of the original document, but is the annotation's best effort at such a representation.	optional
4	rawTextSpan	`spans.TextSpan`	Location of this section in the raw text. NOTE: This text span represents a best guess, or 'provenance': it cannot be guaranteed that this text span matches the _exact_ text of the original document, but is the annotation's best effort at such a representation.	optional
9	audioSpan	`spans.AudioSpan`	Location of this section in the original audio. NOTE: This span represents a best guess, or 'provenance': it cannot be guaranteed that this text span matches the _exact_ text of the original document, but is the annotation's best effort at such a representation.	optional
5	kind	`string`	A short, sometimes corpus-specific term characterizing the nature of the section; may change in a future version of concrete. This often acts as a coarse-grained descriptor that is used for filtering. For example, Gigaword uses the section kind "passage" to distinguish content-bearing paragraphs in the body of an article from other paragraphs, such as the headline and dateline.	required
6	label	`string`	The name of the section. For example, a title of a section on Wikipedia.	optional
7	numberList	`list< i32 >`	Position within the communication with respect to other Sections: The section number, E.g., 3, or 3.1, or 3.1.2, etc. Aimed at Communications with content organized in a hierarchy, such as a Book with multiple chapters, then sections, then paragraphs. Or even a dense Wikipedia page with subsections. Sections should still be arranged linearly, where reading these numbers should not be required to get a start-to-finish enumeration of the Communication's content.	optional
8	lidList	`list< language.LanguageIdentification >`	An optional field to be used for multi-language documents. This field should be populated when a section is inside of a document that contains multiple languages. Minimally, each block of text in one language should be it's own section. For example, if a paragraph is in English and the paragraph afterwards is in French, these should be separated into two different sections, allowing language-specific analytics to run on appropriate sections.	optional

A single "section" of a communication, such as a paragraph. Each
section is defined using a text or audio span, and can optionally
contain a list of sentences.

`TOKEN_LIST`	`1`
`TOKEN_LATTICE`	`2`

Thrift module: structure

Enumerations

Enumeration: TokenizationKind

Data structures

Struct: Constituent

Struct: Parse

Struct: ConstituentRef

Struct: Dependency

Struct: DependencyParseStructure

Struct: DependencyParse

Struct: Token

Struct: TokenRefSequence

Struct: TaggedToken

Struct: TokenTagging

Struct: LatticePath

Struct: Arc

Struct: TokenLattice

Struct: TokenList

Struct: SpanLink

Struct: Tokenization

Struct: Sentence

Struct: Section