Vignette 3: Storing Parsed Documents
Here I'll show how to make a DocTable for storing NSS documents at the paragraph level, and parse the documents in parallel.
For context, check out Example 1 - here we'll just use some shortcuts for code used there. These come from the util.py code in the repo examples folder.
These are the vignettes I have created:
import sys
sys.path.append('..')
#import util
import doctable
import spacy
from tqdm import tqdm
# automatically clean up temp folder after python ends
import tempfile
tempdir = tempfile.TemporaryDirectory()
tmpfolder = tempdir.name
tmpfolder
'/tmp/tmp1isfmada'
First we define the metadata and download the text data.
import urllib
def download_nss(year):
''' Simple helper function for downloading texts from my nssdocs repo.'''
baseurl = 'https://raw.githubusercontent.com/devincornell/nssdocs/master/docs/{}.txt'
url = baseurl.format(year)
text = urllib.request.urlopen(url).read().decode('utf-8')
return text
document_metadata = [
{'year': 2000, 'party': 'D', 'president': 'Clinton'},
{'year': 2006, 'party': 'R', 'president': 'W. Bush'},
{'year': 2015, 'party': 'D', 'president': 'Obama'},
{'year': 2017, 'party': 'R', 'president': 'Trump'},
]
sep = '\n\n'
first_n = 10
for md in document_metadata:
text = download_nss(md['year'])
md['text'] = sep.join(text.split(sep)[:first_n])
print(f"{len(document_metadata[0]['text'])=}")
len(document_metadata[0]['text'])=6695
1. Define the DocTable Schema
Now we define a doctable schema using the doctable.schema
class decorator and the pickle file column type to prepare to store parsetrees as binary data.
# to be used as a database row representing a single NSS document
@doctable.schema
class NSSDoc:
__slots__ = [] # include so that doctable.schema can create a slot class
id: int = doctable.IDCol() # this is an alias for doctable.Col(primary_key=True, autoincrement=True)
year: int = doctable.Col()
party: str = doctable.Col()
president: str = doctable.Col()
text: str = doctable.Col()
doc: doctable.ParseTreeDoc = doctable.ParseTreeFileCol(f'{tmpfolder}/parsetree_pickle_files')
And a class to represent an NSS DocTable.
class NSSDocTable(doctable.DocTable):
_tabname_ = 'nss_documents'
_schema_ = NSSDoc
nss_table = NSSDocTable(target=f'{tmpfolder}/nss_3.db', new_db=True)
print(nss_table.count())
nss_table.schema_table()
0
/DataDrive/code/doctable/examples/../doctable/doctable.py:402: UserWarning: Method .count() is depricated. Please use .q.count() instead.
warnings.warn('Method .count() is depricated. Please use .q.count() instead.')
name | type | nullable | default | autoincrement | primary_key | |
---|---|---|---|---|---|---|
0 | id | INTEGER | False | None | auto | 1 |
1 | year | INTEGER | True | None | auto | 0 |
2 | party | VARCHAR | True | None | auto | 0 |
3 | president | VARCHAR | True | None | auto | 0 |
4 | text | VARCHAR | True | None | auto | 0 |
5 | doc | VARCHAR | True | None | auto | 0 |
for md in document_metadata:
nss_table.insert(md)
nss_table.head()
/DataDrive/code/doctable/examples/../doctable/doctable.py:364: UserWarning: Method .insert() is depricated. Please use .q.insert_single(), .q.insert_single_raw(), .q.insert_multi(), or .q.insert_multi() instead.
warnings.warn('Method .insert() is depricated. Please use .q.insert_single(), '
/DataDrive/code/doctable/examples/../doctable/doctable.py:390: UserWarning: .insert_single() is depricated: please use .q.insert_single() or .q.insert_single_raw()
warnings.warn(f'.insert_single() is depricated: please use .q.insert_single() or '
/DataDrive/code/doctable/examples/../doctable/doctable.py:407: UserWarning: Method .head() is depricated. Please use .q.select_head() instead.
warnings.warn('Method .head() is depricated. Please use .q.select_head() instead.')
/DataDrive/code/doctable/examples/../doctable/connectengine.py:69: SAWarning: TypeDecorator ParseTreeDocFileType() will not produce a cache key because the ``cache_ok`` attribute is not set to True. This can have significant performance implications including some performance degradations in comparison to prior SQLAlchemy versions. Set this attribute to True if this type object's state is safe to use in a cache key, or False to disable this warning. (Background on this error at: https://sqlalche.me/e/14/cprf)
return self._engine.execute(query, *args, **kwargs)
id | year | party | president | text | doc | |
---|---|---|---|---|---|---|
0 | 1 | 2000 | D | Clinton | As we enter the new millennium, we are blessed... | None |
1 | 2 | 2006 | R | W. Bush | My fellow Americans, \n\nAmerica is at war. Th... | None |
2 | 3 | 2015 | D | Obama | Today, the United States is stronger and bette... | None |
3 | 4 | 2017 | R | Trump | An America that is safe, prosperous, and free ... | None |
2. Create a Parser Class Using a Pipeline
Now we create a small NSSParser
class that keeps a doctable.ParsePipeline
object for doing the actual text processing. As you can see from our init method, instantiating the package will load a spacy module into memory and construct the pipeline from the selected components. We also create a wrapper over the pipeline .parse
and .parsemany
methods. Here we define, instantiate, and view the components of NSSParser
.
class NSSParser:
''' Handles text parsing for NSS documents.'''
def __init__(self):
nlp = spacy.load('en_core_web_sm')
# this determines all settings for tokenizing
self.pipeline = doctable.ParsePipeline([
nlp, # first run spacy parser
doctable.Comp('merge_tok_spans', merge_ents=True),
doctable.Comp('get_parsetrees', **{
'text_parse_func': doctable.Comp('parse_tok', **{
'format_ents': True,
'num_replacement': 'NUM',
})
})
])
def parse(self, text):
return self.pipeline.parse(text)
parser = NSSParser() # creates a parser instance
parser.pipeline.components
[<spacy.lang.en.English at 0x7fedee1c2cd0>,
functools.partial(<function merge_tok_spans at 0x7fedf2d8f040>, merge_ents=True),
functools.partial(<function get_parsetrees at 0x7fedf2d8f1f0>, text_parse_func=functools.partial(<function parse_tok at 0x7fedf2d82ee0>, format_ents=True, num_replacement='NUM'))]
Now we parse the paragraphs of each document in parallel.
for doc in tqdm(nss_table.select(['id','year','text'])):
parsed = parser.parse(doc.text)
#print(parsed.as_dict())
#break
print(nss_table['doc'])
nss_table.update({'doc': parsed}, where=nss_table['id']==doc.id, verbose=True)
#nss_table.select_df(limit=2)
/DataDrive/code/doctable/examples/../doctable/doctable.py:443: UserWarning: Method .select() is depricated. Please use .q.select() instead.
warnings.warn('Method .select() is depricated. Please use .q.select() instead.')
0%| | 0/4 [00:00<?, ?it/s]/DataDrive/code/doctable/examples/../doctable/doctable.py:489: UserWarning: Method .update() is depricated. Please use .q.update() instead.
warnings.warn('Method .update() is depricated. Please use .q.update() instead.')
25%|███████████████████████████████████▌ | 1/4 [00:00<00:00, 3.83it/s]
nss_documents.doc
DocTable: UPDATE nss_documents SET doc=? WHERE nss_documents.id = ?
nss_documents.doc
DocTable: UPDATE nss_documents SET doc=? WHERE nss_documents.id = ?
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:00<00:00, 4.89it/s]
nss_documents.doc
DocTable: UPDATE nss_documents SET doc=? WHERE nss_documents.id = ?
nss_documents.doc
DocTable: UPDATE nss_documents SET doc=? WHERE nss_documents.id = ?
3. Work With Parsetrees
Now that we have stored our parsed text as files in the database, we can manipulate the parsetrees. This example shows the 5 most common nouns from each national security strategy document. This is possible because the doctable.ParseTree
data structures contain pos
information originally provided by the spacy parser. Using ParseTreeFileType
allows us to more efficiently store pickled binary data so that we can perform these kinds of analyses at scale.
from collections import Counter # used to count tokens
for nss in nss_table.select():
noun_counts = Counter([tok.text for pt in nss.doc for tok in pt if tok.pos == 'NOUN'])
print(f"{nss.president} ({nss.year}): {noun_counts.most_common(5)}")
Clinton (2000): [('world', 9), ('security', 9), ('prosperity', 7), ('threats', 5), ('efforts', 5)]
W. Bush (2006): [('people', 4), ('world', 3), ('war', 2), ('security', 2), ('strategy', 2)]
Obama (2015): [('security', 15), ('world', 9), ('opportunities', 7), ('strength', 7), ('challenges', 7)]
Trump (2017): [('government', 5), ('principles', 4), ('peace', 3), ('people', 3), ('world', 3)]
/DataDrive/code/doctable/examples/../doctable/connectengine.py:69: SAWarning: TypeDecorator ParseTreeDocFileType() will not produce a cache key because the ``cache_ok`` attribute is not set to True. This can have significant performance implications including some performance degradations in comparison to prior SQLAlchemy versions. Set this attribute to True if this type object's state is safe to use in a cache key, or False to disable this warning. (Background on this error at: https://sqlalche.me/e/14/cprf)
return self._engine.execute(query, *args, **kwargs)
Definitely check out this example on parsetreedocs if you're interested in more applications.
And that is all for this vignette! See the list of vignettes at the top of this page for more examples.