Vignette 1: Storing Document Metadata
In this example, I'll show how to create and manipulate two linked tables for storing document metadata using US National Security Strategy document metadata as an example.
These are the vignettes I have created:
import sys
sys.path.append('..')
import doctable
import spacy
from tqdm import tqdm
import pandas as pd
import os
from pprint import pprint
import urllib.request # used for downloading nss docs
# automatically clean up temp folder after python ends
#tmpfolder = doctable.TempFolder('tmp')
import tempfile
tempdir = tempfile.TemporaryDirectory()
tmpfolder = tempdir.name
tmpfolder
'/tmp/tmplxxguo16'
Introduction to NSS Corpus
This dataset is the plain text version of the US National Security Strategy documents. I compiled the metadata you see below from a page hosted by the historical dept of the secretary's office. In short, each US President must release at least one NSS per term, up to one per-year. This is the metadata we will be inserting into the table:
# information about each NSS document
document_metadata = [
{'year': 2000, 'party': 'D', 'president': 'Clinton'},
{'year': 2002, 'party': 'R', 'president': 'W. Bush'},
{'year': 2006, 'party': 'R', 'president': 'W. Bush'},
{'year': 2010, 'party': 'D', 'president': 'Obama'},
{'year': 2015, 'party': 'D', 'president': 'Obama'},
{'year': 2017, 'party': 'R', 'president': 'Trump'},
]
Create database schemas
The first step will be to define a database schema that is appropriate for the data in document_metadata
. We define an NSSDoc
class to represent a single document. The doctable.schema
decorator will convert the row objects into dataclasses
with slots enabled, and inherit from doctable.DocTableRow to add some additional functionality. The type hints associated with each variable will be used in the schema definition for the new tables, and arguments to doctable.Col
will mostly be passed to dataclasses.field
(see docs for more detail), so all dataclass functionality is maintained.
Also note that a method called .is_old()
was defined. This method will not be included in a database schema, but I'll show later how it can be useful.
# to be used as a database row representing a single NSS document
@doctable.schema
class NSSDoc:
__slots__ = [] # include so that doctable.schema can create a slot class
id: int = doctable.Col(primary_key=True, autoincrement=True) # can also use doctable.IDCol() as a shortcut
year: int = None
party: str = None
president: str = None
def is_old(self):
'''Return whether the document is old or not.'''
return self.year < 2010
We can see that these are regular dataclass methods because their constructors are defined. Note that the dataclass defaults the values to None, so take note of this when inserting or retrieving from a database.
NSSDoc(year=1999)
NSSDoc(year=1999, party=None, president=None)
And we will also likely want to create a class that inherits from DocTable
to statically define the table name, schema object, and any indices or constraints that should be associated with our table. We set the table name and the schema definition class using the reserved member variables _tabname_
and _schema_
, respectively. Note that the NSSDoc
class is provided as the schema.
We also can use this definition to create indices and constraints using the _indices_
and _constraints_
member variables. The indices are provided as name->columns pairs, and the constraints are tuples of the form (constraint_type, constraint_details)
. In this case, we limit the values for check
to R or D.
class NSSDocTable(doctable.DocTable):
_tabname_ = 'nss_documents'
_schema_ = NSSDoc
_indices_ = (
doctable.Index('party_index', 'party'),
)
_constraints_ = (
doctable.Constraint('check', 'party in ("R", "D")'), # party can only take on values R or D.
)
And then we create an instance of the NSSDocTable
table using DocTable
\'s default constructor. We set target=f'{tmp}/nss_1.db'
to indicate we want to access an sqlite database at that path. We also use the new_db=True
to indicate that the database does not exist, so we should create a new one.
fname = f'{tmpfolder}/nss_1.db'
# clean up any old databases
try:
os.remove(fname)
except:
pass
docs_table = NSSDocTable(target=fname, new_db=True)
docs_table
<__main__.NSSDocTable at 0x7f2a55b9b400>
We can use .schema_table()
to see information about the database schema. Note that doctable inferred column types based on the type hints.
docs_table.schema_table()
name | type | nullable | default | autoincrement | primary_key | |
---|---|---|---|---|---|---|
0 | id | INTEGER | False | None | auto | 1 |
1 | year | INTEGER | True | None | auto | 0 |
2 | party | VARCHAR | True | None | auto | 0 |
3 | president | VARCHAR | True | None | auto | 0 |
We are now ready to insert data into the new table. We simply add each document as a dictionary, and show the first n
rows using .head()
.
docs_table.delete() # remove old entries if needed
for doc in document_metadata:
print(doc)
docs_table.insert(doc)
docs_table.head()
{'year': 2000, 'party': 'D', 'president': 'Clinton'}
{'year': 2002, 'party': 'R', 'president': 'W. Bush'}
{'year': 2006, 'party': 'R', 'president': 'W. Bush'}
{'year': 2010, 'party': 'D', 'president': 'Obama'}
{'year': 2015, 'party': 'D', 'president': 'Obama'}
{'year': 2017, 'party': 'R', 'president': 'Trump'}
/DataDrive/code/doctable/examples/../doctable/doctable.py:494: UserWarning: Method .delete() is depricated. Please use .q.delete() instead.
warnings.warn('Method .delete() is depricated. Please use .q.delete() instead.')
/DataDrive/code/doctable/examples/../doctable/doctable.py:364: UserWarning: Method .insert() is depricated. Please use .q.insert_single(), .q.insert_single_raw(), .q.insert_multi(), or .q.insert_multi() instead.
warnings.warn('Method .insert() is depricated. Please use .q.insert_single(), '
/DataDrive/code/doctable/examples/../doctable/doctable.py:390: UserWarning: .insert_single() is depricated: please use .q.insert_single() or .q.insert_single_raw()
warnings.warn(f'.insert_single() is depricated: please use .q.insert_single() or '
/DataDrive/code/doctable/examples/../doctable/doctable.py:407: UserWarning: Method .head() is depricated. Please use .q.select_head() instead.
warnings.warn('Method .head() is depricated. Please use .q.select_head() instead.')
id | year | party | president | |
---|---|---|---|---|
0 | 1 | 2000 | D | Clinton |
1 | 2 | 2002 | R | W. Bush |
2 | 3 | 2006 | R | W. Bush |
3 | 4 | 2010 | D | Obama |
4 | 5 | 2015 | D | Obama |
We can verify that the constraint was defined by attempting to insert a row with an unknown party code.
import sqlalchemy
try:
docs_table.insert({'party':'whateva'})
except sqlalchemy.exc.IntegrityError as e:
print(e)
/DataDrive/code/doctable/examples/../doctable/doctable.py:364: UserWarning: Method .insert() is depricated. Please use .q.insert_single(), .q.insert_single_raw(), .q.insert_multi(), or .q.insert_multi() instead.
warnings.warn('Method .insert() is depricated. Please use .q.insert_single(), '
/DataDrive/code/doctable/examples/../doctable/doctable.py:390: UserWarning: .insert_single() is depricated: please use .q.insert_single() or .q.insert_single_raw()
warnings.warn(f'.insert_single() is depricated: please use .q.insert_single() or '
And we can use all the expected select (see select examples) methods.
democrats = docs_table.select(where=docs_table['party']=='D')
democrats
/DataDrive/code/doctable/examples/../doctable/doctable.py:443: UserWarning: Method .select() is depricated. Please use .q.select() instead.
warnings.warn('Method .select() is depricated. Please use .q.select() instead.')
[NSSDoc(id=1, year=2000, party='D', president='Clinton'),
NSSDoc(id=4, year=2010, party='D', president='Obama'),
NSSDoc(id=5, year=2015, party='D', president='Obama')]
clinton_doc = docs_table.select_first(where=docs_table['president']=='Clinton')
clinton_doc
/DataDrive/code/doctable/examples/../doctable/doctable.py:426: UserWarning: Method .select_first() is depricated. Please use .q.select_first() instead.
warnings.warn('Method .select_first() is depricated. Please use .q.select_first() instead.')
NSSDoc(id=1, year=2000, party='D', president='Clinton')
Along with the methods we defined on the schema objects.
clinton_doc.is_old()
True
Adding political party data
Of course, relational database schemas often involve the use of more than one linked table. Now we'll attempt to integrate the data in party_metadata
into our schema.
# full name of party (we will use later)
party_metadata = [
{'code': 'R', 'name': 'Republican'},
{'code': 'D', 'name': 'Democrat'},
]
First, we create the Party
dataclass just as before.
# to be used as a database row representing a single political party
@doctable.schema
class Party:
__slots__ = []
id: int = doctable.Col(primary_key=True, autoincrement=True) # can also use doctable.IDCol() as a shortcut
code: str = None
name: str = None
And then define a DocTable
with a 'foreignkey' constraint that indicates it\'s relationship to the document table. We can use the reference to the "party" column using nss_documents.party
.
class PartyTable(doctable.DocTable):
_tabname_ = 'political_parties'
_schema_ = Party
_indices_ = {
doctable.Index('code_index', 'code')
}
_constraints_ = (
doctable.Constraint('foreignkey', ('code',), ('nss_documents.party',)),
)
party_table = PartyTable(target=fname)
party_table
<__main__.PartyTable at 0x7f2a55afa310>
party_table.delete() # remove old entries if needed
for party in party_metadata:
print(party)
party_table.insert(party)
party_table.head()
{'code': 'R', 'name': 'Republican'}
{'code': 'D', 'name': 'Democrat'}
/DataDrive/code/doctable/examples/../doctable/doctable.py:494: UserWarning: Method .delete() is depricated. Please use .q.delete() instead.
warnings.warn('Method .delete() is depricated. Please use .q.delete() instead.')
/DataDrive/code/doctable/examples/../doctable/doctable.py:407: UserWarning: Method .head() is depricated. Please use .q.select_head() instead.
warnings.warn('Method .head() is depricated. Please use .q.select_head() instead.')
id | code | name | |
---|---|---|---|
0 | 1 | R | Republican |
1 | 2 | D | Democrat |
Performing "join" select queries
In contrast to sql, the type of join is inferred from the way the select query is used. Using a select
method with columns for both tables will issue an outer join in lieu of other parameters. Also note that we must use as_dataclass
to indicate the data should not use a dataclass for the results, since joined results includes fields from both
party_table.select(['name', docs_table['president']], as_dataclass=False)
/DataDrive/code/doctable/examples/../doctable/doctable.py:443: UserWarning: Method .select() is depricated. Please use .q.select() instead.
warnings.warn('Method .select() is depricated. Please use .q.select() instead.')
/DataDrive/code/doctable/examples/../doctable/doctable.py:445: UserWarning: The "as_dataclass" parameter has been depricated: please set get_raw=True or select_raw to specify that you would like to retrieve a raw RowProxy pobject.
warnings.warn(f'The "as_dataclass" parameter has been depricated: please set get_raw=True or '
/DataDrive/code/doctable/examples/../doctable/connectengine.py:69: SAWarning: SELECT statement has a cartesian product between FROM element(s) "nss_documents" and FROM element "political_parties". Apply join condition(s) between each element to resolve.
return self._engine.execute(query, *args, **kwargs)
/DataDrive/code/doctable/examples/../doctable/doctable.py:453: UserWarning: Conversion from row to object failed according to the following error. Please use .q.select_raw() next time in the future to avoid this issue. e=RowDataConversionFailed("Conversion from <class 'sqlalchemy.engine.row.LegacyRow'> to <class '__main__.Party'> failed.")
warnings.warn(f'Conversion from row to object failed according to the following '
[('Republican', 'Clinton'),
('Republican', 'W. Bush'),
('Republican', 'W. Bush'),
('Republican', 'Obama'),
('Republican', 'Obama'),
('Republican', 'Trump'),
('Republican', None),
('Democrat', 'Clinton'),
('Democrat', 'W. Bush'),
('Democrat', 'W. Bush'),
('Democrat', 'Obama'),
('Democrat', 'Obama'),
('Democrat', 'Trump'),
('Democrat', None)]
To perform an inner join, use a where conditional indicating the columns to be matched.
docs_table.select(['year', 'president', party_table['name']], as_dataclass=False, where=docs_table['party']==party_table['code'])
/DataDrive/code/doctable/examples/../doctable/doctable.py:453: UserWarning: Conversion from row to object failed according to the following error. Please use .q.select_raw() next time in the future to avoid this issue. e=RowDataConversionFailed("Conversion from <class 'sqlalchemy.engine.row.LegacyRow'> to <class '__main__.NSSDoc'> failed.")
warnings.warn(f'Conversion from row to object failed according to the following '
[(2000, 'Clinton', 'Democrat'),
(2002, 'W. Bush', 'Republican'),
(2006, 'W. Bush', 'Republican'),
(2010, 'Obama', 'Democrat'),
(2015, 'Obama', 'Democrat'),
(2017, 'Trump', 'Republican')]
And this works approximately the same when we switch the tables being selected.
party_table.select(['code', 'name', docs_table['president']], as_dataclass=False, where=docs_table['party']==party_table['code'])
[('R', 'Republican', 'Trump'),
('R', 'Republican', 'W. Bush'),
('R', 'Republican', 'W. Bush'),
('D', 'Democrat', 'Clinton'),
('D', 'Democrat', 'Obama'),
('D', 'Democrat', 'Obama')]
And that is all for this vignette! See the list of vignettes at the top of this page for more examples.