Vignette 3: Storing Parsed Documents

Here I'll show how to make a DocTable for storing NSS documents at the paragraph level, and parse the documents in parallel.

For context, check out Example 1 - here we'll just use some shortcuts for code used there. These come from the util.py code in the repo examples folder.

These are the vignettes I have created:

import sys
sys.path.append('..')
#import util
import doctable
import spacy
from tqdm import tqdm

# automatically clean up temp folder after python ends
import tempfile
tempdir = tempfile.TemporaryDirectory()
tmpfolder = tempdir.name
tmpfolder

'/tmp/tmp1isfmada'

First we define the metadata and download the text data.

import urllib
def download_nss(year):
    ''' Simple helper function for downloading texts from my nssdocs repo.'''
    baseurl = 'https://raw.githubusercontent.com/devincornell/nssdocs/master/docs/{}.txt'
    url = baseurl.format(year)
    text = urllib.request.urlopen(url).read().decode('utf-8')
    return text

document_metadata = [
    {'year': 2000, 'party': 'D', 'president': 'Clinton'},
    {'year': 2006, 'party': 'R', 'president': 'W. Bush'}, 
    {'year': 2015, 'party': 'D', 'president': 'Obama'}, 
    {'year': 2017, 'party': 'R', 'president': 'Trump'}, 
]

sep = '\n\n'
first_n = 10
for md in document_metadata:
    text = download_nss(md['year'])
    md['text'] = sep.join(text.split(sep)[:first_n])
print(f"{len(document_metadata[0]['text'])=}")

len(document_metadata[0]['text'])=6695

1. Define the DocTable Schema

Now we define a doctable schema using the doctable.schema class decorator and the pickle file column type to prepare to store parsetrees as binary data.

# to be used as a database row representing a single NSS document
@doctable.schema
class NSSDoc:
    __slots__ = [] # include so that doctable.schema can create a slot class

    id: int = doctable.IDCol() # this is an alias for doctable.Col(primary_key=True, autoincrement=True)
    year: int =  doctable.Col()
    party: str = doctable.Col()
    president: str = doctable.Col()
    text: str = doctable.Col()
    doc: doctable.ParseTreeDoc = doctable.ParseTreeFileCol(f'{tmpfolder}/parsetree_pickle_files')

And a class to represent an NSS DocTable.

class NSSDocTable(doctable.DocTable):
    _tabname_ = 'nss_documents'
    _schema_ = NSSDoc

nss_table = NSSDocTable(target=f'{tmpfolder}/nss_3.db', new_db=True)
print(nss_table.count())
nss_table.schema_table()

0


/DataDrive/code/doctable/examples/../doctable/doctable.py:402: UserWarning: Method .count() is depricated. Please use .q.count() instead.
  warnings.warn('Method .count() is depricated. Please use .q.count() instead.')

	name	type	nullable	default	autoincrement	primary_key
0	id	INTEGER	False	None	auto	1
1	year	INTEGER	True	None	auto	0
2	party	VARCHAR	True	None	auto	0
3	president	VARCHAR	True	None	auto	0
4	text	VARCHAR	True	None	auto	0
5	doc	VARCHAR	True	None	auto	0

for md in document_metadata:
    nss_table.insert(md)
nss_table.head()

/DataDrive/code/doctable/examples/../doctable/doctable.py:364: UserWarning: Method .insert() is depricated. Please use .q.insert_single(), .q.insert_single_raw(), .q.insert_multi(), or .q.insert_multi() instead.
  warnings.warn('Method .insert() is depricated. Please use .q.insert_single(), '
/DataDrive/code/doctable/examples/../doctable/doctable.py:390: UserWarning: .insert_single() is depricated: please use .q.insert_single() or .q.insert_single_raw()
  warnings.warn(f'.insert_single() is depricated: please use .q.insert_single() or '
/DataDrive/code/doctable/examples/../doctable/doctable.py:407: UserWarning: Method .head() is depricated. Please use .q.select_head() instead.
  warnings.warn('Method .head() is depricated. Please use .q.select_head() instead.')
/DataDrive/code/doctable/examples/../doctable/connectengine.py:69: SAWarning: TypeDecorator ParseTreeDocFileType() will not produce a cache key because the ``cache_ok`` attribute is not set to True.  This can have significant performance implications including some performance degradations in comparison to prior SQLAlchemy versions.  Set this attribute to True if this type object's state is safe to use in a cache key, or False to disable this warning. (Background on this error at: https://sqlalche.me/e/14/cprf)
  return self._engine.execute(query, *args, **kwargs)

	id	year	party	president	text	doc
0	1	2000	D	Clinton	As we enter the new millennium, we are blessed...	None
1	2	2006	R	W. Bush	My fellow Americans, \n\nAmerica is at war. Th...	None
2	3	2015	D	Obama	Today, the United States is stronger and bette...	None
3	4	2017	R	Trump	An America that is safe, prosperous, and free ...	None

2. Create a Parser Class Using a Pipeline

Now we create a small NSSParser class that keeps a doctable.ParsePipeline object for doing the actual text processing. As you can see from our init method, instantiating the package will load a spacy module into memory and construct the pipeline from the selected components. We also create a wrapper over the pipeline .parse and .parsemany methods. Here we define, instantiate, and view the components of NSSParser.

class NSSParser:
    ''' Handles text parsing for NSS documents.'''
    def __init__(self):
        nlp = spacy.load('en_core_web_sm')

        # this determines all settings for tokenizing
        self.pipeline = doctable.ParsePipeline([
            nlp, # first run spacy parser
            doctable.Comp('merge_tok_spans', merge_ents=True),
            doctable.Comp('get_parsetrees', **{
                'text_parse_func': doctable.Comp('parse_tok', **{
                    'format_ents': True,
                    'num_replacement': 'NUM',
                })
            })
        ])

    def parse(self, text):
        return self.pipeline.parse(text)

parser = NSSParser() # creates a parser instance
parser.pipeline.components

[<spacy.lang.en.English at 0x7fedee1c2cd0>,
 functools.partial(<function merge_tok_spans at 0x7fedf2d8f040>, merge_ents=True),
 functools.partial(<function get_parsetrees at 0x7fedf2d8f1f0>, text_parse_func=functools.partial(<function parse_tok at 0x7fedf2d82ee0>, format_ents=True, num_replacement='NUM'))]

Now we parse the paragraphs of each document in parallel.

for doc in tqdm(nss_table.select(['id','year','text'])):
    parsed = parser.parse(doc.text)
    #print(parsed.as_dict())
    #break
    print(nss_table['doc'])
    nss_table.update({'doc': parsed}, where=nss_table['id']==doc.id, verbose=True)
#nss_table.select_df(limit=2)

/DataDrive/code/doctable/examples/../doctable/doctable.py:443: UserWarning: Method .select() is depricated. Please use .q.select() instead.
  warnings.warn('Method .select() is depricated. Please use .q.select() instead.')
  0%|                                                                                                                                                      | 0/4 [00:00<?, ?it/s]/DataDrive/code/doctable/examples/../doctable/doctable.py:489: UserWarning: Method .update() is depricated. Please use .q.update() instead.
  warnings.warn('Method .update() is depricated. Please use .q.update() instead.')
 25%|███████████████████████████████████▌                                                                                                          | 1/4 [00:00<00:00,  3.83it/s]

nss_documents.doc
DocTable: UPDATE nss_documents SET doc=? WHERE nss_documents.id = ?
nss_documents.doc
DocTable: UPDATE nss_documents SET doc=? WHERE nss_documents.id = ?


100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:00<00:00,  4.89it/s]

nss_documents.doc
DocTable: UPDATE nss_documents SET doc=? WHERE nss_documents.id = ?
nss_documents.doc
DocTable: UPDATE nss_documents SET doc=? WHERE nss_documents.id = ?

3. Work With Parsetrees

Now that we have stored our parsed text as files in the database, we can manipulate the parsetrees. This example shows the 5 most common nouns from each national security strategy document. This is possible because the doctable.ParseTree data structures contain pos information originally provided by the spacy parser. Using ParseTreeFileType allows us to more efficiently store pickled binary data so that we can perform these kinds of analyses at scale.

from collections import Counter # used to count tokens

for nss in nss_table.select():
    noun_counts = Counter([tok.text for pt in nss.doc for tok in pt if tok.pos == 'NOUN'])
    print(f"{nss.president} ({nss.year}): {noun_counts.most_common(5)}")

Clinton (2000): [('world', 9), ('security', 9), ('prosperity', 7), ('threats', 5), ('efforts', 5)]
W. Bush (2006): [('people', 4), ('world', 3), ('war', 2), ('security', 2), ('strategy', 2)]
Obama (2015): [('security', 15), ('world', 9), ('opportunities', 7), ('strength', 7), ('challenges', 7)]
Trump (2017): [('government', 5), ('principles', 4), ('peace', 3), ('people', 3), ('world', 3)]


/DataDrive/code/doctable/examples/../doctable/connectengine.py:69: SAWarning: TypeDecorator ParseTreeDocFileType() will not produce a cache key because the ``cache_ok`` attribute is not set to True.  This can have significant performance implications including some performance degradations in comparison to prior SQLAlchemy versions.  Set this attribute to True if this type object's state is safe to use in a cache key, or False to disable this warning. (Background on this error at: https://sqlalche.me/e/14/cprf)
  return self._engine.execute(query, *args, **kwargs)

Definitely check out this example on parsetreedocs if you're interested in more applications.

And that is all for this vignette! See the list of vignettes at the top of this page for more examples.