ParsePipeline Basics

Here I demonstrate the basics of parsing text using Spacy + doctable to tokenize text. Spacy does most of the heavy-lifting here to actually parse the document, and doctable methods handle the conversion from the Spacy Document object to a sequence of string tokens (words).

1. Build a ParsePipeline for Tokenization

ParsePipeline makes it easy to define a processing pipeline as a list of functions (called components) to apply sequentially to each document in your corpus. You can use the .parsemany() method to run the pipeline on documents in paralel, or simply use the .parse() method to parse a single document.

Our most basic pipeline uses a lambda function to split each text document by whitespace.

We then use the .parse() method to apply the pipeline to a single document.

We can also use the .parsemany() method to parse all of our texts at once. Use the workers parameter to specify the number of processes to use if you want to use parallelization.

2. Use doctable Parsing Components

doctable has some built-in methods for pre- and post-processing Spacy documents. This list includes all functions in the doctable.parse namespace, and you can access them using the doctable.Comp function.

Now we show a pipeline that uses the doctable preprocess method to remove xml tags and urls, the Spacy nlp model to parse the document, and the built-in tokenize method to convert the spacy doc object to a list of tokens.

3. More Complicated Pipelines

Now we show a more complicated mode. The function tokenize also takes two additional methods: keep_tok_func determines whether a Spacy token should be included in the final document, and the parse_tok_func determines how the spacy token objects should be converted to strings. We access the doctable keep_tok and parse_tok methods using the same Comp function to create nested parameter lists.

These are the fundamentals of building ParsePipelines in doctable. While these tools are totally optional, I believe they make it easier to structure your code for text analysis applications.