Genie Flow Invoker Document Process

The invokers in this package involve the processing of documents. Documents can be anything like a Microsoft Word, PowerPoint, PDF or a simple text document. The following process steps are implemented:

PARSE - turning a binary document file into text
CLEAN - removing spurious elements that should not be part of the text
CHUNK - breaking up a text into smaller parts, following different strategies
EMBED - adding a vector embedding to a piece of text
SEARCH - given a list of vectors, find the nearest neighbours of a search-text

Installing

Installing is done through a normal pip install using the appropriate package registry. After installing, one needs to download a number of NLTK corpora. This can be done by executing the command init_docproc. This script will download the required corpora and place them in the standard directory (see https://www.nltk.org/data.html), which is a directory called nltk_data in the user's home directory, or if the environment variable NLTK_DATA is set, into the directory specified.

The Chunked Document class

The core of this set of invokers revolves around the ChunkedDocument object. This is a class that contains the filename of the original document, some possible metadata (a dict with key-value pairs) and a list of document chunks.

Every DocumentChunk contains text, the original span (starting and ending position of that text within the original document), a hierarchy level and a parent id.

Hierarchy within chunks

Every chunk within a ChunkedDocument is set somewhere in a tree. The root of this tree is the original text, and flowing down from that tree are smaller chunks. Every smaller chunk is taken from their 'parent' chunk.

The ChunkedDocument contains a list of DocumentChunks, where for every document chunk, it is recorded at what level in the tree they sit and who their parent is. The root of the tree (the complete document) sits at level 0, one level lower in the tree (the first level of chunks) all have hierarchy level 1, and chunks out of these (even smaller chunks) have level 2, etc.

Operation level

A concept for some of the invokers is the "operation level". This is the level inside the tree to which the invoker is applied. For instance, splitting up all hierarchy level 1 chunks into smaller chunks should be done by specifying operation-Level=1 in the splitter invoker. As a result, the ChunkedDocument will be extended by new chunks at hierarchy level 2, and their parents being set to their respective level 1 chunks that they were sourced from.

Not specifying an operation level will execute the invoker on all existing chunks.

Parsing a Document - the `DocumentParseInvoker`

For parsing we use Apache Tika, a general purpose parsing engine that can turn many known binary document formats into plain text.

The input expected for this invoker is the filename and document_data, a base64-encoded representation of the binary file.

The output is a ChunkedDocument that contains the same filename, some further meta data and one chunk pertaining to the complete text that has been parsed from the document.

When parsing an empty document, the resulting ChunkedDocument will only contain the filename, an empty metadata dictionary and an empty list of chunks.

Cleaning a Document - the `DocumentCleanInvoker`

The DocumentCleanInvoker takes in a ChunkedDocumetn and "cleans' the content of its chunks. The following cleaning operations have been defined:

clean_multiple_newlines: This will reduce any sequence of two or more newlines back to just one newline character.
clean_multiple_spaces: This will reduce any sequence of two or more whitespace characters back to just one space.
clean_tabs: This will reduce any sequence of one or more tab characters back to just one space.
clean_numbers: This will replace numbers larger than 9 with a sequence of # characters of a length equal to the number of digits of the original number. Beyond five digits, the length of the sequence will remain five (so #####). If a number is split by , or . characters, the digits are treated as they are, and these , and . are left alone. For example: the number 3.14159265 would be replaced by 3.#####.
special_term_replacements: This will replace any predefined term with something else. If a value of True is given, the default replacements are used. Alternatively, a dictionary of from:to pairs can be specified. The default replacements are:

SPECIAL_TERMS = {
    "i.e.": "%%IE%%",
    "e.g.": "%%EG%%",
    "etc.": "%%ETC%%",
    ".com": "%%DOTCOM%%",
    "www.": "%%WWW%%",
}

tokenize_detokenize: This complex cleaner uses NLTK tokenizers to split a text into tokens, and then recompiles these tokens back into a sentence.

Tokenize - detokenize cleaning

We use a number of tokenizers from the NLTK package.

The aim of this cleaning strategy is to ensure sentences are properly identified as full sentences. Taking into account, for instance, that a period character after an abbreviation does not indicate the end of a sentence. Or, if there is no period at the end of a sentence but the next sentence starts with a word that regularly starts a new sentence, this is still seen as two separate sentences. And a similar approach is followed for the words within each of these sentences.

After using this cleaning strategy, the resulting text will contain full sentences that are appropriately spelt and constructed.

Breaking a text up into sentences is done using the PunktSentenceTokenizer.

A sentence tokenizer which uses an unsupervised algorithm to build a model for abbreviation words, collocations, and words that start sentences; and then uses that model to find sentence boundaries. This approach has been shown to work well for many European languages.

The resulting sentences are then tokenized using the TreebankWordTokenizer.

The Treebank tokenizer uses regular expressions to tokenize text as in Penn Treebank. This tokenizer performs the following steps: * split standard contractions, e.g. don't -> do n't and they'll -> they 'll * treat most punctuation characters as separate tokens * split off commas and single quotes, when followed by whitespace * separate periods that appear at the end of a line

The resulting list of lists (sentences of words) are then de-tokenized using first the TreebankWordDetokenizer to reconstruct sentences. These sentences are then joined together with a newline character.

Chunking a Document

For chunking texts, this package currently implements two strategies: the FixedWordCountSplitterInvoker that splits texts into fixed-size chunks, and the LexicalDensitySplitInvoker which uses the concept of lexical density to create chunks that contain a similar information density.

The input into a Chunking Invoker is a ChunkedDocument. Depending on whether an operation_level has been set, the "chunking" is applied to all or only chunks or only the chunks at the given level of the hierarchy.

The output of these invokers is the same ChunkedDocument, where new chunks are added to the list of chunks. For these newly added chunks, the hierarchy level is one higher than the chunk that they are based on, and their parent is set to the id of that same parent chunk.

Fixed Word splitting

The FixedWordsSplitter splits a text into windows of a fixed size, then moves the window a predefined number of words and creates a new chunk.

This splitter has the ability to ignore stop words. These words are not counted towards the number of words that are included in the chunk. Stopwords are taken from the NLTK corpus for English stopwords (see https://www.nltk.org/nltk_data/).

When the window cannot be filled to the number of required words, because there are no words left in the (remainder of the) sentence, the default behaviour is to produce smaller chunks at the trailing end of the sentence. With a flag, one can prevent this from happening and only produce chunks that have the configured number of words.

Words are created using the TreebankWordTokenizer and new chunks are created by its counterpart TreebankWordDetokenizer.

The following settings can be made:

max_words: The maximum number of words that should fit into one chunk.
overlap: The number of words to skip each time for a new chunk to be created.
ignore_stop_words: (default False) A boolean indicating if (English) stopwords should be ignored when counting.
drop_trailing: (default False) Whether to drop trailing windows that do not have enough (`max_words') words.
operation_level: (default None) The level in the hierarchy that this invoker needs to be applied to.

Lexical Density Splitting

The concept of lexical density determines how much "real information" is contained in a chunk.

Lexical density estimates the linguistic complexity in a written or spoken composition from the functional words (grammatical units) and content words (lexical units, lexemes).

The lexical density is calculated in this package by taking the fraction of lexical words over the total number of words. Lexical words are Nouns, Adjectives, Verbs and Adverbs. The package uses NLTK Part of Speech tagging to assign POS tags to all words in a chunk.

The user of this invoker should set a minimum fraction of lexical density to be reached, as well as a min and max number of words in a chunk. The strategy then determines if the shortest or longest chunk is found that fits within these bounds (more than min_words, less than max_words and with a lexical density larger than target_density. An alternative strategy would be to find the chunk that has the highest density.

The LexicalDensitySplitter has the following configurations:

min_words: The minimal number of words that should go into a new chunk.
max_words: The maximum number of words that go into a chunk.
overlap: The number of words to skip forward before the next chunk window is started.
target_density: A number between 0.0 and 1.0 that sets the threshold for the lexical density. Chunks that do not have a lexical density of more than this will not be created.
strategy: (default best) a string that can be either shortest, best or longest

Transcript Splitting

A transcript in the form of a WebVTT file will be chunked into a chunk per speaker. So, a parent chunk containing the following:

WEBVTT

ec4ac57d-840f-4f63-a0c1-808e3e218362/13-0
00:00:03.317 --> 00:00:04.797
<v Jochem Michiels>Maybe my English is a bit Dutch but.</v>

ec4ac57d-840f-4f63-a0c1-808e3e218362/32-0
00:00:06.807 --> 00:00:10.249
<v Hanna Berg>OK. But before we start,
I thought it might be good to have a</v>

will be split into two child-chunks, each pertaining to the captions in the example.

Consecutive captions of the same speaker are merged into one DocumentChunk.

NOTE: The splitter expects a correctly formatted WebVTT file. Prepending this chunker with any other chunker, that will wreck the WebVTT format, will result in empty child chunks and warnings.

Every child chunk will have an original_span, being the start and end seconds of the chunk. Also, the following custom attributes (available to Genie Flow invokers from version 0.6.4 of the genie-flow-invoker package) will be added:

party_name: the name of the speaker, if given in the transcript, or "UNKNOWN" if no name is given.
seconds_start: the second from the start of the transcript in which the statement started being made
seconds_end: the second form of the transcript in which the statement finished being made
duration: the number of seconds the statement took to be made
identifier: an optional identifier if provided in the transcript

Embedding a Document - the `DocumentEmbedInvoker`

Embedding, or Word Embedding, is the process of creating a vector that represents a word or piece of text. Based on these vectors, texts that are semantically close to each other are likely to be related.

The EmbedInvoker uses an external service to do the actual vectorisation. Typically, these are the containers that Weaviate has put around the Huggingface text-to-vector (t2v) models. When using this embedding invoker there needs to be a service available that has the interface that Weaviate implemented.

Input into this invoker can be either a ChunkedDocument or a string. If the input is a string, the result is the vector embedding of that string -- a list of floating values. If the input is a ChunkedDocument, the result is the same ChunkedDocument, but with a vector embedding for each and every chunk in that document.

As playfully laid out in "The Art of Pooling Embeddings" there are a number of different pooling strategies. The default that is implemented by Weaviate stands at "masked_mean" but others are available and configurable in this invoker.

The following configuration parameters are available:

text2vec_url: The URL for the test to the vector service.
pooling_strategy: (default None) the optional pooling strategy.
backoff_max_time: The maximum time this invoker should wait before re-trying a failed request.
backoff_max_tries: The maximum number of retries to the service.

Similarity Search

The SimilaritySearchInvoker implements the search for similar chunks in a ChunkedDocument.

Input into this invoker is a SimilaritySearch object; the output is a SimilarityResults object.

SimilaritySearch

This object contains the relevant information for conducting a similarity search. Of course, the chunks to search in and the query to search for. But also parameters that inform the invoker on how to limit the results, how to deal with chunk parents and what distance metric it should use.

The list of chunks is expected to be the same as the list of DocumentChunks that is maintained in a ChunkedDocument. So we can get chunked_document.chunks to go into that field.

The query should be passed as a vector. This can be easily found using the EmbeddingInvoker where the input is just the search string. The output of that invoker (a list of floating values) is the search query vector.

We can also instruct this invoker to only work at a certain operation level. This means that the search will only be conducted within chunks that have the given hierarchy level.

By default, this invoker will return all chunks that are searched, in descending distance from the search query vector. These results can be limited by specifying a horizon and/or a top. The first specifies the maximum distance that is still acceptable. The latter specifies exactly how many results are expected. The result is limited by both these factors if they are specified.

Two parent strategies have been implemented: include and replace. The former will include any parent of the chunks that are retrieved. The latter will drop the children and replace them with their parents.

Configuration

Configuration of this invoker follows a slightly different pattern than with the other invokers in this package. The meta.yaml file does not have to contain any parameters. There are no services to configure or other meta information to pass. All the configuration is done via the input that is sent to the invoker.

However, when parameters are specified in meta.yaml, they form a default value and can be left out of the input JSON. So, if the meta.yaml specifies method: cosine, this does not have to be specified in the input.

The following parameters are available:

filename: The name of the originating file.
chunks: the list of DocumentChunk objects to search within. These chunks need to be embedded (meaning that their vector field has been set).
query_emedding: The embedding of the query to search for.
operation_level: The optional hierarchy level of the chunks that should be searched within.
horizon: The optional minimum distance a chunk needs to have to be included in the result.
top: The optional maximum number of result chunks that will be returned.
parent_strategy: The optional strategy to follow for including any parents: include or replace.
method: The distance metric to be used. Can be cosine (the default), euclidean or manhattan.
include_vector: (default False) A boolean indicating if the embedding vector of the resulting chunks should also be included in the output.

SimilarityResults

The resulting object contains a list of the chunks that have been found, in order of distance. These chunks are represented by the DocumentChunk object, accompanied by their distance towards the search vector.

This information is represented in an object with the fields chunk and distance. The first containing the full DocumentChunk data, and the second containing a floating point value representing the distance.

If the configuration parameter include_vector is False (the default), then the resulting DocumentChunks will not contain their embedding vector.