Data structures
Struct: Constituent
Key
|
Field
|
Type
|
Description
|
Requiredness
|
Default value
|
1
|
id
|
i32
|
A parse-relative identifier for this consistuent. Together
with the UUID for a Parse, this can be used to define
pointers to specific constituents.
|
required
|
|
2
|
tag
|
string
|
A description of this constituency node, e.g. the category "NP".
For leaf nodes, this should be a word and for pre-terminal nodes
this should be a POS tag.
|
optional
|
|
3
|
childList
|
list<
i32
>
|
|
required
|
|
4
|
headChildIndex
|
i32
|
The index of the head child of this constituent. I.e., the
head child of constituent
c
is
c.children[c.head_child_index]
. A value of -1
indicates that no child head was identified.
|
optional
|
-1
|
5
|
start
|
i32
|
The first token (inclusive) of this constituent in the
parent Tokenization. Almost certainly should be populated.
|
optional
|
|
6
|
ending
|
i32
|
The last token (exclusive) of this constituent in the
parent Tokenization. Almost certainly should be populated.
|
optional
|
|
A single parse constituent (or "phrase").
Struct: Parse
A theory about the syntactic parse of a sentence.
\note If we add support for parse forests in the future, then it
will most likely be done by adding a new field (e.g.
"forest_root") that uses a new struct type to encode the
forest. A "kind" field might also be added (analogous to
Tokenization.kind) to indicate whether a parse is encoded
using a simple tree or a parse forest.
Struct: ConstituentRef
Key
|
Field
|
Type
|
Description
|
Requiredness
|
Default value
|
1
|
parseId
|
uuid.UUID
|
The UUID of the Parse that this Constituent belongs to.
|
required
|
|
2
|
constituentIndex
|
i32
|
The index in the constituent list of this Constituent.
|
required
|
|
A reference to a Constituent within a Parse.
Struct: Dependency
Key
|
Field
|
Type
|
Description
|
Requiredness
|
Default value
|
1
|
gov
|
i32
|
The governor or the head token. 0 indexed.
|
optional
|
-1
|
2
|
dep
|
i32
|
The dependent token. 0 indexed.
|
required
|
|
3
|
edgeType
|
string
|
The relation that holds between gov and dep.
|
optional
|
|
A syntactic edge between two tokens in a tokenized sentence.
Struct: DependencyParseStructure
Key
|
Field
|
Type
|
Description
|
Requiredness
|
Default value
|
1
|
isAcyclic
|
bool
|
True iff there are no cycles in the dependency graph.
|
required
|
|
2
|
isConnected
|
bool
|
True iff the dependency graph forms a single connected component.
|
required
|
|
3
|
isSingleHeaded
|
bool
|
True iff every node in the dependency parse has at most
one head/parent/governor.
|
required
|
|
4
|
isProjective
|
bool
|
True iff there are no crossing edges in the dependency parse.
|
required
|
|
Information about the structure of a dependency parse.
This information is computable from the list of dependencies,
but this allows the consumer to make (verified) assumptions
about the dependencies being processed.
Struct: DependencyParse
Represents a dependency parse with typed edges.
Struct: Token
Key
|
Field
|
Type
|
Description
|
Requiredness
|
Default value
|
1
|
tokenIndex
|
i32
|
A 0-based tokenization-relative identifier for this token that
represents the order that this token appears in the
sentence. Together with the UUID for a Tokenization, this can be
used to define pointers to specific tokens. If a Tokenization
object contains multiple Token objects with the same id (e.g., in
different n-best lists), then all of their other fields *must* be
identical as well.
|
required
|
|
2
|
text
|
string
|
The text associated with this token.
Note - we may have a destructive tokenizer (e.g., Stanford rewriting)
and as a result, we want to maintain this field.
|
optional
|
|
3
|
textSpan
|
spans.TextSpan
|
Location of this token in this perspective's text (.text field).
In cases where this token does not correspond directly with any
text span in the text (such as word insertion during MT),
this field may be given a value indicating "approximately" where
the token comes from. A span covering the entire sentence may be
used if no more precise value seems appropriate.
NOTE: This span represents a best guess, or 'provenance':
it cannot be guaranteed that this text span matches the _exact_
text of the document, but is the annotation's best
effort at such a representation.
|
optional
|
|
4
|
rawTextSpan
|
spans.TextSpan
|
Location of this token in the original, raw text (.originalText
field). In cases where this token does not correspond directly
with any text span in the original text (such as word insertion
during MT), this field may be given a value indicating
"approximately" where the token comes from. A span covering the
entire sentence may be used if no more precise value seems
appropriate.
NOTE: This span represents a best guess, or 'provenance':
it cannot be guaranteed that this text span matches the _exact_
text of the original raw document, but is the annotation's best
effort at such a representation.
|
optional
|
|
5
|
audioSpan
|
spans.AudioSpan
|
Location of this token in the original audio.
NOTE: This span represents a best guess, or 'provenance':
it cannot be guaranteed that this text span matches the _exact_
text of the original document, but is the annotation's best
effort at such a representation.
|
optional
|
|
A single token (typically a word) in a communication. The exact
definition of what counts as a token is left up to the tools that
generate token sequences.
Usually, each token will include at least a text string.
Struct: TokenRefSequence
Key
|
Field
|
Type
|
Description
|
Requiredness
|
Default value
|
1
|
tokenIndexList
|
list<
i32
>
|
The tokenization-relative identifiers for each token that is
included in this sequence.
|
required
|
|
2
|
anchorTokenIndex
|
i32
|
An optional field that can be used to describe
the root of a sentence (if this sequence is a full sentence),
the head of a constituent (if this sequence is a constituent),
or some other form of "canonical" token in this sequence if,
for instance, it is not easy to map this sequence to a another
annotation that has a head.
This field is defined with respect to the Tokenization given
by tokenizationId, and not to this object's tokenIndexList.
|
optional
|
-1
|
3
|
tokenizationId
|
uuid.UUID
|
The UUID of the tokenization that contains the tokens.
|
required
|
|
4
|
textSpan
|
spans.TextSpan
|
The text span in the main text (.text field) associated with this
TokenRefSequence.
NOTE: This span represents a best guess, or 'provenance': it
cannot be guaranteed that this text span matches the _exact_ text
of the original document, but is the annotation's best effort at
such a representation.
|
optional
|
|
5
|
rawTextSpan
|
spans.TextSpan
|
The text span in the original text (.originalText field)
associated with this TokenRefSequence.
NOTE: This span represents a best guess, or 'provenance': it
cannot be guaranteed that this text span matches the _exact_ text
of the original raw document, but is the annotation's best effort
at such a representation.
|
optional
|
|
6
|
audioSpan
|
spans.AudioSpan
|
The audio span associated with this TokenRefSequence.
NOTE: This span represents a best guess, or 'provenance':
it cannot be guaranteed that this text span matches the _exact_
text of the original document, but is the annotation's best
effort at such a representation.
|
optional
|
|
7
|
dependencies
|
list<
Dependency
>
|
Use this field to reference a dependency tree fragment
such as a shortest path or all the dependents in a constituent.
|
optional
|
|
8
|
constituent
|
ConstituentRef
|
Use this field to specify an entire constituent in a parse tree.
Prefer textSpan over this field unless a node in a tree is needed.
|
optional
|
|
A list of pointers to tokens that all belong to the same
tokenization.
Struct: TaggedToken
Key
|
Field
|
Type
|
Description
|
Requiredness
|
Default value
|
1
|
tokenIndex
|
i32
|
A pointer to the token being tagged.
Token indices are 0-based. These indices are also 0-based.
|
optional
|
|
2
|
tag
|
string
|
A string containing the annotation.
If the tag set you are using is not case sensitive,
then all part of speech tags should be normalized to upper case.
|
optional
|
|
3
|
confidence
|
double
|
Confidence of the annotation.
|
optional
|
|
4
|
tagList
|
list<
string
>
|
A list of strings that represent a distribution of possible
tags for this token.
If populated, the 'tag' field should also be populated
with the "best" value from this list.
|
optional
|
|
5
|
confidenceList
|
list<
double
>
|
A list of doubles that represent confidences associated with
the tags in the 'tagList' field.
If populated, the 'confidence' field should also be populated
with the confidence associated with the "best" tag in 'tagList'.
|
optional
|
|
Struct: TokenTagging
Key
|
Field
|
Type
|
Description
|
Requiredness
|
Default value
|
1
|
uuid
|
uuid.UUID
|
The UUID of this TokenTagging object.
|
required
|
|
2
|
metadata
|
metadata.AnnotationMetadata
|
Information about where the annotation came from.
This should be used to tell between gold-standard annotations
and automatically generated theories about the data
|
required
|
|
3
|
taggedTokenList
|
list<
TaggedToken
>
|
The mapping from tokens to annotations.
This may be a partial mapping.
|
required
|
|
4
|
taggingType
|
string
|
An ontology-backed string that represents the
type of token taggings this TokenTagging object
produces.
|
optional
|
|
A theory about some token-level annotation.
The TokenTagging consists of a mapping from tokens
(using token ids) to string tags (e.g. part-of-speech tags or lemmas).
The mapping defined by a TokenTagging may be partial --
i.e., some tokens may not be assigned any part of speech tags.
For lattice tokenizations, you may need to create multiple
part-of-speech taggings (for different paths through the lattice),
since the appropriate tag for a given token may depend on the path
taken. For example, you might define a separate
TokenTagging for each of the top K paths, which leaves all
tokens that are not part of the path unlabeled.
Currently, we use strings to encode annotations. In
the future, we may add fields for encoding specific tag sets
(eg treebank tags), or for adding compound tags.
Struct: LatticePath
Key
|
Field
|
Type
|
Description
|
Requiredness
|
Default value
|
1
|
weight
|
double
|
|
optional
|
|
2
|
tokenList
|
list<
Token
>
|
|
required
|
|
Struct: Arc
Key
|
Field
|
Type
|
Description
|
Requiredness
|
Default value
|
1
|
src
|
i32
|
|
optional
|
|
2
|
dst
|
i32
|
|
optional
|
|
3
|
token
|
Token
|
|
optional
|
|
4
|
weight
|
double
|
|
optional
|
|
Type for arcs. For epsilon edges, leave 'token' blank.
Struct: TokenLattice
Key
|
Field
|
Type
|
Description
|
Requiredness
|
Default value
|
1
|
startState
|
i32
|
|
optional
|
0
|
2
|
endState
|
i32
|
|
optional
|
0
|
3
|
arcList
|
list<
Arc
>
|
|
required
|
|
4
|
cachedBestPath
|
LatticePath
|
|
optional
|
|
A lattice structure that assigns scores to a set of token
sequences. The lattice is encoded as an FSA, where states are
identified by integers, and each arc is annotated with an
optional tokens and a weight. (Arcs with no tokens are
"epsilon" arcs.) The lattice has a single start state and a
single end state. (You can use epsilon edges to simulate
multiple start states or multiple end states, if desired.)
The score of a path through the lattice is the sum of the weights
of the arcs that make up that path. A path with a lower score
is considered "better" than a path with a higher score.
If possible, path scores should be negative log likelihoods
(with base e -- e.g. if P=1, then weight=0; and if P=0.5, then
weight=0.693). Furthermore, if possible, the path scores should
be globally normalized (i.e., they should encode probabilities).
This will allow for them to be combined with other information
in a reasonable way when determining confidences for system
outputs.
TokenLattices should never contain any paths with cycles. Every
arc in the lattice should be included in some path from the start
state to the end state.
Struct: TokenList
Key
|
Field
|
Type
|
Description
|
Requiredness
|
Default value
|
1
|
tokenList
|
list<
Token
>
|
|
required
|
|
A wrapper around a list of tokens.
Struct: SpanLink
Key
|
Field
|
Type
|
Description
|
Requiredness
|
Default value
|
1
|
tokens
|
TokenRefSequence
|
The tokens that make up this SpanLink object.
|
required
|
|
2
|
concreteTarget
|
uuid.UUID
|
|
optional
|
|
3
|
externalTarget
|
string
|
|
optional
|
|
4
|
linkType
|
string
|
|
required
|
|
A collection of tokens that represent a link to another resource.
This resource might be another Concrete object (e.g., another
Concrete Communication), represented with the 'concreteTarget'
field, or it could link to a resource outside of Concrete via the
'externalTarget' field.
Struct: Tokenization
Key
|
Field
|
Type
|
Description
|
Requiredness
|
Default value
|
1
|
uuid
|
uuid.UUID
|
|
required
|
|
2
|
metadata
|
metadata.AnnotationMetadata
|
Information about where this tokenization came from.
|
required
|
|
3
|
tokenList
|
TokenList
|
A wrapper around an ordered list of the tokens in this tokenization.
This may also give easy access to the "reconstructed text" associated
with this tokenization.
This field should only have a value if kind==TOKEN_LIST.
|
optional
|
|
4
|
lattice
|
TokenLattice
|
A lattice that compactly describes a set of token sequences that
might make up this tokenization. This field should only have a
value if kind==LATTICE.
|
optional
|
|
5
|
kind
|
TokenizationKind
|
Enumerated value indicating whether this tokenization is
implemented using an n-best list or a lattice.
|
required
|
|
6
|
tokenTaggingList
|
list<
TokenTagging
>
|
|
optional
|
|
7
|
parseList
|
list<
Parse
>
|
|
optional
|
|
8
|
dependencyParseList
|
list<
DependencyParse
>
|
|
optional
|
|
9
|
spanLinkList
|
list<
SpanLink
>
|
|
optional
|
|
A theory (or set of alternative theories) about the sequence of
tokens that make up a sentence.
This message type is used to record the output of not just for
tokenizers, but also for a wide variety of other tools, including
machine translation systems, text normalizers, part-of-speech
taggers, and stemmers.
Each Tokenization is encoded using either a TokenList
or a TokenLattice. (If you want to encode an n-best list, then
you should store it as n separate Tokenization objects.) The
"kind" field is used to indicate whether this Tokenization contains
a list of tokens or a TokenLattice.
The confidence value for each sequence is determined by combining
the confidence from the "metadata" field with confidence
information from individual token sequences as follows:
- For n-best lists:
metadata.confidence
- For lattices:
metadata.confidence * exp(-sum(arc.weight))
Note: in some cases (such as the output of a machine translation
tool), the order of the tokens in a token sequence may not
correspond with the order of their original text span offsets.
Struct: Sentence
Key
|
Field
|
Type
|
Description
|
Requiredness
|
Default value
|
1
|
uuid
|
uuid.UUID
|
|
required
|
|
2
|
tokenization
|
Tokenization
|
Theory about the tokens that make up this sentence. For text
communications, these tokenizations will typically be generated
by a tokenizer. For audio communications, these tokenizations
will typically be generated by an automatic speech recognizer.
The "Tokenization" message type is also used to store the output
of machine translation systems and text normalization
systems.
|
optional
|
|
3
|
textSpan
|
spans.TextSpan
|
Location of this sentence in the communication text.
NOTE: This span represents a best guess, or 'provenance':
it cannot be guaranteed that this text span matches the _exact_
text of the original document, but is the annotation's best
effort at such a representation.
|
optional
|
|
4
|
rawTextSpan
|
spans.TextSpan
|
Location of this sentence in the raw text.
NOTE: This span represents a best guess, or 'provenance':
it cannot be guaranteed that this text span matches the _exact_
text of the original document, but is the annotation's best
effort at such a representation.
|
optional
|
|
5
|
audioSpan
|
spans.AudioSpan
|
Location of this sentence in the original audio.
NOTE: This span represents a best guess, or 'provenance':
it cannot be guaranteed that this text span matches the _exact_
text of the original document, but is the annotation's best
effort at such a representation.
|
optional
|
|
A single sentence or utterance in a communication.
Struct: Section
Key
|
Field
|
Type
|
Description
|
Requiredness
|
Default value
|
1
|
uuid
|
uuid.UUID
|
The unique identifier for this section.
|
required
|
|
2
|
sentenceList
|
list<
Sentence
>
|
The sentences of this "section."
|
optional
|
|
3
|
textSpan
|
spans.TextSpan
|
Location of this section in the communication text.
NOTE: This text span represents a best guess, or 'provenance':
it cannot be guaranteed that this text span matches the _exact_
text of the original document, but is the annotation's best
effort at such a representation.
|
optional
|
|
4
|
rawTextSpan
|
spans.TextSpan
|
Location of this section in the raw text.
NOTE: This text span represents a best guess, or 'provenance':
it cannot be guaranteed that this text span matches the _exact_
text of the original document, but is the annotation's best
effort at such a representation.
|
optional
|
|
9
|
audioSpan
|
spans.AudioSpan
|
Location of this section in the original audio.
NOTE: This span represents a best guess, or 'provenance':
it cannot be guaranteed that this text span matches the _exact_
text of the original document, but is the annotation's best
effort at such a representation.
|
optional
|
|
5
|
kind
|
string
|
A short, sometimes corpus-specific term characterizing the nature
of the section; may change in a future version of concrete. This
often acts as a coarse-grained descriptor that is used for
filtering. For example, Gigaword uses the section kind "passage"
to distinguish content-bearing paragraphs in the body of an
article from other paragraphs, such as the headline and dateline.
|
required
|
|
6
|
label
|
string
|
The name of the section. For example, a title of a section on
Wikipedia.
|
optional
|
|
7
|
numberList
|
list<
i32
>
|
Position within the communication with respect to other Sections:
The section number, E.g., 3, or 3.1, or 3.1.2, etc. Aimed at
Communications with content organized in a hierarchy, such as a Book
with multiple chapters, then sections, then paragraphs. Or even a
dense Wikipedia page with subsections. Sections should still be
arranged linearly, where reading these numbers should not be required
to get a start-to-finish enumeration of the Communication's content.
|
optional
|
|
8
|
lidList
|
list<
language.LanguageIdentification
>
|
An optional field to be used for multi-language documents.
This field should be populated when a section is inside of
a document that contains multiple languages.
Minimally, each block of text in one language should be it's own
section. For example, if a paragraph is in English and the
paragraph afterwards is in French, these should be separated into
two different sections, allowing language-specific analytics to
run on appropriate sections.
|
optional
|
|
A single "section" of a communication, such as a paragraph. Each
section is defined using a text or audio span, and can optionally
contain a list of sentences.