DocTable Schemas

Your database table column names and types come from a schema class defined using the @doctable.schema decorator. In addition to providing a schema definition, this class can be used to encapsulate data when inserting or retrieving from the database.

At its most basic, your schema class operates like a dataclass that uses slots for efficiency and allows for custom methods that will not affect the database schema.

Introduction

This is an example of a basic doctable schema. Note the use of the decorator @doctable.schema, the inclusion of __slots__ = [], and the type hints of the member variables - I will explain each of these later in this document.

This class represents a database schema that includes two columns: name (an int) and age (a str).

The schema class definition is then provided to the doctable constructor to create the database table. Here we create an in-memory sqlite table and show the schema resulting from our custom class. Note that doctable automatically inferred that name should be a VARCHAR and age should be an INTEGER based on the provided type hints.

We can also use the schema class to insert data into our DocTable. We simply create a new Record and pass it to the DocTable.insert() method. Using .head(), we see the contents of the database so far. Note that you may also pass a dictionary to insert data - this is just one way of inserting data.

And perhaps more usefully, we can use it to encapsulate results from .select() queries. Note that the returned object is exactly the same as the one we put in. Slot classes are more memory-efficient than dictionaries for storing data, but there is cpu time overhead from inserting that data into the slots.

But, of course, the data can be returned in its raw format by passing the parameter as_dataclass=False.

The doctable.schema Decorator

The @doctable.schema decorator does the work to convert your custom class into a schema class. It transforms your schema class in three ways:

  1. create slots: First, slot variable names will be added to __slots__ automatically based on the fields in your class definition. This is why the default functionality requires you to add __slots__ = [] with no variable names. You may also turn slots off by passing require_slots=False to the decorator (i.e. @doctable.schema(require_slots=False)), otherwise an exception will be raised.

  2. convert to dataclass: Second, your schema class will be converted to a dataclass that generates __init__, __repr__, and other boilerplate methods meant for classes that primarily store data. Any keyword arguments passed to the schema decorator, with the exception of require_slots, will be passed directly to the @dataclasses.dataclass decorator so you have control over the dataclass definition.

  3. inherit from DocTableSchema: Lastly, your schema class will inherit from doctable.DocTableSchema, which provides additional accessors that are used for storage in a DocTable and fine-grained control over retreived data. More on this later.

Column names and types will be inferred from the type hints in your schema class definition. Because DocTable is built on sqlalchemy core, all fields will eventually be converted to sqlalchemy column objects and added to the DocTable metadata. This table shows the type mappings implemented in doctable:

For example, see this example of the most basic possible schema class that can be used to create a doctable. We use static defaulted parameters and type hints including str, int, datetime, and Any, which you can see are converted to VARCHAR, INTEGER, DATETIME, and BLOB column types, respectively. BLOB was used because the provided type hint Any has no entry in the above table.

You can see that this class operates much like a regular dataclass with slots. Thus, these defaulted parameters are applied in the constructor of the schema class, and NOT as the default value in the database schema.

Use doctable.Col For More Control Over Schema Creation

Using doctable.Col() as a default value in the schema class definition can give you more control over schema definitions.

Firstly, this function returns a dataclass field object that can be used to set parameters like default_factory or compare as used by the dataclass. Pass arguments meant for field through the Col parameter field_kwargs=dict(..). Other data passed to Col will be used to create the DocTable schema, which is stored as metadata inside the field.

This example shows how Col can be used to set some parameters meant for field. These will affect your schema class behavior without affecting the produced DocTable schema.

Col also allows you to explicitly specify a column type using a string, sqlalchemy type definition, or sqlalchemy instance passed to column_type. You can then pass arguments meant for the sqlalchemy type constructor through type_kwargs. You may also use type_kwargs with the column type inferred from the type hint.

A full list of string -> sqlalchemy type mappings is shown below:

Finally, Col allows you to pass keyword arguments directly to the sqlalchemy Column constructor. This includes flags like primary_key or default, which are both used to construct the database schema but do not affect the python dataclass. Note that I recreated the classic id column below.

I also included some shortcut Col functions like IDCol, AddedCol, and UpdatedCol - see below.

In this way, Col allows you to give fine-grained control to both the schema class behavior and the sql schema definition.

Working With Schema Objects

Using Col default parameters also has some additional side effects, primarily due to the inherited class DocTableSchema. Among other things, the Col method defines the default dataclass value to be a doctable.EmptyValue() object, which is essentially a placeholder for data that was not inserted into the class upon construction. The __repr__ defined in DocTableSchema dictates that member objects containing this value not appear when printing the class, and furthermore, member variables with the value EmptyValue() will not be provided in the database insertion. This means that the database schema is allowed to use its own default value - an effect which is most obviously useful when inserting an object that does not have an id or other automatically provided values.

The example below shows the new_record.id contains EmptyValue() as a default, and that the id column is not included in the insert query - only name.

Yet when we go to retrieve the inserted data, we can see that the value has been replaced by the defaulted value in the database. This is a useful feature if your pipeline involves the insertion of schema objects directly (as opposed to inserting dictionaries for each row).

The EmptyValue() feature is also useful when issuing select queries involving only a subset of columns. See here we run a select query where we just retrieve the name data, yet the result is still stored in a Record object.

To avoid working with EmptyValue() objects directly, it is recommended that you use the __getitem__ string subscripting to access column data. When using this subscript, the schema object will raise an exception if the returned value is an EmptyValue().

Indices and Constraints

Indices and constraints are provided to the DocTable constructor or definition, as it is not part of the schema class. Here I create custom schema and table definitions where the table has some defined indices and constraints. doctable.Index is really just a direct reference to sqlalchemy.Index, and doctable.Constraint is a mapping to an sqlalchemy constraint type, with the first argument indicating which one.

And we can see that the constraints are working when we try to insert a record where age is less than 1.

This is a full list of the mappings between constraint names and the associated sqlalchemy objects.

Conclusions

In this guide, I tried to show some exmaples and give explanations for the ways that schema classes can be used to create doctables. The design is fairly efficent and flexible, and brings a more object-focused approach compared to raw sql queries without the overhead of ORM.