Working with doctable Parsetrees

Here I'll show you how to extract and use parsetrees in your doctable using Spacy + doctable. The motivation is that parsetree information in raw Spacy Document objects are very large and not suitable for storage when using large corpora. We solve this by simply converting the Spacy Document object to a tree data structure built from python lists and dictionaries, and use the ParseTree object to serialize, de-serialize, and interact with the tree structure.

We use this feature using the get_parsetrees pipeline component after the spacy parser. Check the docs to learn more about this function. You can see more examples of creating parse pipelines in our overview examples.

First we define some example text docuemnts, Star Wars themed.

Creating ParseTreeDoc Objects

The most direct way of creating a parsetree is to parse the desired text using the spacy language model, then use ParseTreeDoc.from_spacy() to construct the ParseTreeDoc. The ParseTreeDoc object is a container for parsetree objects representing each of the sentences identified with the SpaCy parser.

The most important arguments to parse_tok_func are text_parse_func and userdata_map.

  1. text_parse_func determines the mapping from a spacy doc object to the text representation of each token accessed through token.text. By default this parameter is set to lambda d: d.text.

  2. userdata_map is a dictionary mapping an attribute name to a function. You can, for instance, extract info from the original spacy doc object through this method. I'll explain later how these attributes can be accessed and used.

Working With ParseTrees

ParseTreeDoc objects represent sequences of ParseTree objects identified by the spacy language parser. You can see we can access individual sentence parsetrees using numerical indexing or through iteration.

Now we will show how to work with ParseTree objects. These objects are collections of tokens that can be accessed either as a tree (based on the structure of the dependency tree produced by spacy), or as an ordered sequence. We can use numerical indexing or iteration to interact with individual tokens.

We can work with the tree structure of a ParseTree object using the root property.

And access the children of a given token using the childs property. The following tokens are children of the root token.

These objects can be serialized using the .as_dict() method and de-serialized using the .from_dict() method.

More About Tokens

Each token in a ParseTree is represented by a Token object. These objects maintain the tree structure of a parsetree, and each node contains some default information as well as optional and custom information. These are the most important member variables:

Member Variables

Optional Member Variables

The following are provided if the associated spacy parser component was enabled.

We can also access the custom token properties provided to the ParseTreeDoc.from_spacy() method earlier.

Recursive Functions on Parsetrees

We can also navigate the tree structure of parsetrees using recursive functions. Here I simply print out the trajectory of this recursive function.

Create Using ParsePipelines

The most common use case, however, probably involves the creation of of a ParsePipeline in which the end result will be a ParseTreeDoc. We make this using the get_parsetrees pipeline component, and here we show several of the possible arguments.

You can see that the parser provides the same output as we got before with ParseTreeDoc.from_spacy().