Provides library and main methods for creating and searching a Lucene index over Communication objects.
Communication
objects must have the following:
text
fieldSection
sSentence
s with valid TextSpan
sIf using the components that support pretokenization, the Sentence
objects must
additionally have Tokenization
set.
To build an index using Lucene’s default tokenization, use LuceneCommunicationIndexer
:
try (LuceneCommunicationIndexer indexer = new NaiveConcreteLuceneIndexer(directoryPath)) {
indexer.add(comm1);
indexer.add(comm2);
}
To search over the index using ConcreteLuceneSearcher
:
try (ConcreteLuceneSearcher search = new ConcreteLuceneSearcher(directoryPath)) {
List<Document> docs = search.searchDocuments("Canada", 50);
System.out.println(docs.get(0).get(ConcreteLuceneConstants.COMM_ID_FIELD));
}
Use TokenizedCommunicationIndexer
and TokenizedCommunicationSearcher
.
The searcher uses a whitespace tokenizer on the query strings.
This assumes you are using the standard lucene analyzer with its tokenization.
mvn clean compile assembly:single
The NaiveConcreteLuceneIndexer
requires
Given a .tar.gz
file of Communication objects that satisfy the
above, run the following:
java -cp target/*.jar \
edu.jhu.hlt.concrete.lucene.TarGzCommunicationIndexer \
--input-path /your/comms.tar.gz \
--output-folder /a/folder/for/the/index
Use the --help
flag for all parameters.
Given a folder with a Lucene index built, run the following:
java -cp target/*.jar \
edu.jhu.hlt.concrete.lucene.ConcreteLuceneSearcher \
--index-path /a/folder/for/the/index \
"your query terms"
Use the --help
flag for all parameters.
Lucene modifies the text to be indexed and the query text.
These modifications are performed with an Analyzer
.
Compatible analyzers need to be used for index or search.
The default analyzer is StandardAnalyzer which has its own stop word list, lower cases the text,
and tokenizes using an implementation of the Unicode text segmentation.
This tokenizer treats Chinese text as unigrams.
Lucene provides many different tokenizers and filters for building analyzers.
Explore their repo for more information.
The pre-tokenization code lower cases the text and removes a small set of English stop words.