Vignette 1: Storing Document Metadata

In this example, I'll show how to create and manipulate two linked tables for storing document metadata using US National Security Strategy document metadata as an example.

These are the vignettes I have created:

Introduction to NSS Corpus

This dataset is the plain text version of the US National Security Strategy documents. I compiled the metadata you see below from a page hosted by the historical dept of the secretary's office. In short, each US President must release at least one NSS per term, up to one per-year. This is the metadata we will be inserting into the table:

Create database schemas

The first step will be to define a database schema that is appropriate for the data in document_metadata. We define an NSSDoc class to represent a single document. The doctable.schema decorator will convert the row objects into dataclasses with slots enabled, and inherit from doctable.DocTableRow to add some additional functionality. The type hints associated with each variable will be used in the schema definition for the new tables, and arguments to doctable.Col will mostly be passed to dataclasses.field (see docs for more detail), so all dataclass functionality is maintained.

Also note that a method called .is_old() was defined. This method will not be included in a database schema, but I'll show later how it can be useful.

We can see that these are regular dataclass methods because their constructors are defined. Note that the dataclass defaults the values to None, so take note of this when inserting or retrieving from a database.

And we will also likely want to create a class that inherits from DocTable to statically define the table name, schema object, and any indices or constraints that should be associated with our table. We set the table name and the schema definition class using the reserved member variables _tabname_ and _schema_, respectively. Note that the NSSDoc class is provided as the schema.

We also can use this definition to create indices and constraints using the _indices_ and _constraints_ member variables. The indices are provided as name->columns pairs, and the constraints are tuples of the form (constraint_type, constraint_details). In this case, we limit the values for check to R or D.

And then we create an instance of the NSSDocTable table using DocTable\'s default constructor. We set target='tmp/nss_1.db' to indicate we want to access an sqlite database at that path. We also use the new_db=True to indicate that the database does not exist, so we should create a new one.

We can use .schema_table() to see information about the database schema. Note that doctable inferred column types based on the type hints.

We are now ready to insert data into the new table. We simply add each document as a dictionary, and show the first n rows using .head().

We can verify that the constraint was defined by attempting to insert a row with an unknown party code.

And we can use all the expected select (see docs) methods.

Along with the methods we defined on the schema objects.

Adding political party data

Of course, relational database schemas often involve the use of more than one linked table. Now we'll attempt to integrate the data in party_metadata into our schema.

First, we create the Party dataclass just as before.

And then define a DocTable with a 'foreignkey' constraint that indicates it\'s relationship to the document table. We can use the reference to the "party" column using

Performing "join" select queries

In contrast to sql, the type of join is inferred from the way the select query is used. Using a select method with columns for both tables will issue an outer join in lieu of other parameters. Also note that we must use as_dataclass to indicate the data should not use a dataclass for the results, since joined results includes fields from both

To perform an inner join, use a where conditional indicating the columns to be matched.

And this works approximately the same when we switch the tables being selected.

And that is all for this vignette! See the list of vignettes at the top of this page for more examples.