Skip to content

Dataclass Schema Example

In this vignette I'll show how to use a Python dataclass (introduced in Python 3.7) to specify a schema for a DocTable. The advantage of this schema format is that you can use custom classes to represent each row, and easily convert your existing python objects into a format that it easy to store in a sqlite database.

from datetime import datetime
from pprint import pprint
import pandas as pd
import sys
sys.path.append('..')
import doctable

Basic dataclass usage

For our first example, we show how a basic dataclass object can be used as a DocTable schema. First we create a python dataclass using the @dataclass decorator. This object has three members, each defaulted to None. We can create this object using the constructor provided by dataclass.

from dataclasses import dataclass
@doctable.schema
class User:
    __slots__ = []
    name: str = None
    age: int = None
    height: float = None
User()
User(name=None, age=None, height=None)

And it is relatively easy to create a new doctable using the schema provided by our dataclass User by providing the class definition to the schema argument. We can see that DocTable uses the dataclass schema to create a new table that follows the specified Python datatypes.

db = doctable.DocTable(schema=User, target=':memory:')
db.schema_table()
name type nullable default autoincrement primary_key
0 name VARCHAR True None auto 0
1 age INTEGER True None auto 0
2 height FLOAT True None auto 0

Now we insert several new objects into the table and view them using DocTable.head(). Note that the datbase actually inserted the object's defaulted values into the table.

db.insert([User('kevin'), User('tyrone', age=12), User('carlos', age=25, height=6.5)])
db.head()
name age height
0 kevin NaN NaN
1 tyrone 12.0 NaN
2 carlos 25.0 6.5

Using a normal select(), we can extract the results as the original objects. With no parameters, the select statement extracts all columns as they are stored and they exactly match the original data we entered. As expected from the python object, we can access these as properties of the object. Due to the base class doctable.DocTableRow, we can also access properties using the __getitem__ indexing. I'll show why there is a difference btween the two later.

users = db.select()
for user in users:
    print(f"{user.name}:\n\tage: {user.age}\n\theight: {user['height']}")
kevin:
    age: None
    height: None
tyrone:
    age: 12
    height: None
carlos:
    age: 25
    height: 6.5

Example using doctable.Col

In this example, we will show how to create a dataclass with functionality that supports more complicated database operations. A key to this approach is to use the doctable.Col function as default values for our parameters. Note that when we initialize the object, the default values of all columns except for name are set to EmptyValue. This is important, because EmptyValue will indicate values that are not meant to be inserted into the database or are not retrieved from the database after selecting.

@doctable.schema
class User:
    __slots__ = []
    name: str = doctable.Col()
    age: int = doctable.Col()
    height: float = doctable.Col()
User()
User()

Given that the type specifications are the same as the previous example, we get exactly the same database schema. We insert entries just as before. The User data contained EmptyValues, and so that column data was not presented to the database at all - instead, the schema's column defaults were used. Consistent with our schema (not the object defaults, the default values were set to None.

db = doctable.DocTable(schema=User, target=':memory:')
print(db.schema_table())
db.insert([User('kevin'), User('tyrone', age=12), User('carlos', age=25, height=6.5)])
for user in db.select():
    print(f"{user}")
     name     type  nullable default autoincrement  primary_key
0    name  VARCHAR      True    None          auto            0
1     age  INTEGER      True    None          auto            0
2  height    FLOAT      True    None          auto            0
User(name='kevin', age=None, height=None)
User(name='tyrone', age=None, height=None)
User(name='carlos', age=None, height=None)

Now let's try to select only a subset of the columns - in this case, 'name' and 'age'.

users = db.select(['name', 'age'])
users[0]
User(name='kevin', age=None)

Note that the user height was set to EmptyValue. When we try to access height as an index, we get an error indicating that the data was not retrived in the select statement.

try:
    users[0]['height']
except KeyError as e:
    print(e)
'The column "height" was not retreived in the select statement.'

On the contrary, if we try to access as an attribute, the actual EmptyValue object is retrieved. Object properties work as they always have, but indexing into columns will check for errors in the program logic. This implementation shows how dataclass schemas walk the line between regular python objects and database rows, and thus accessing these values can be done differently depending on how much the table entries should be treated like regular objects vs database rows. This is all determined based on how the dataclass columns are configured.

users[0].height
EmptyValue()

Special column types

Now I'll introduce two special data column types provided by doctable: IDCol(), which represents a regular id column in sqlite with autoindex and primary_key parameters set, and UpdatedCol(), which records the datetime that an object was added to the database. When we create a new user using the dataclass constructor, these values are set to EmptyValue, and are relevant primarily to the database. By setting the repr parameter in the @dataclass decorator, we can use the __repr__ of the DocTableRow base class, which hides EmptyValue columns. This is optional.

from dataclasses import field, fields
@doctable.schema(repr=False)
class User:
    __slots__ = []
    id: int = doctable.IDCol() # shortcut for autoindex, primary_key column.
    updated: datetime = doctable.UpdatedCol() # shortcut for automatically 

    name: str = doctable.Col(nullable=False)
    age: int = doctable.Col(None) # accessing sqlalchemy column keywords arguments

user = User(name='carlos', age=15)
user
User(name='carlos', age=15)

And we can see the relevance of those columns by inserting them into the database and selecting them again. You can see from the result of .head() that the primary key id and the updated columns were appropriately filled upon insertion. After selecting, these objects also contain valid values.

db = doctable.DocTable(schema=User, target=':memory:')
print(db.schema_table())
db.insert([User(name='kevin'), User(name='tyrone', age=12), User(name='carlos', age=25)])
db.head()
      name      type  nullable default autoincrement  primary_key
0       id   INTEGER     False    None          auto            1
1  updated  DATETIME      True    None          auto            0
2     name   VARCHAR     False    None          auto            0
3      age   INTEGER      True    None          auto            0
id updated name age
0 1 2021-07-21 19:03:04.550804 kevin NaN
1 2 2021-07-21 19:03:04.550809 tyrone 12.0
2 3 2021-07-21 19:03:04.550811 carlos 25.0

This was just an example of how regular Python dataclass objects can contain additional data which is relevant to the database, but which is otherwise unneeded. After retrieving from database, we can also use .update() to modify the entry.

user = db.select_first()
user.age = 10
db.update(user, where=db['id']==user['id'])
db.head()
id updated name age
0 1 2021-07-21 19:03:04.550804 kevin 10
1 2 2021-07-21 19:03:04.550809 tyrone 12
2 3 2021-07-21 19:03:04.550811 carlos 25

We can use the convenience function update_dataclass() to update a single row corresponding to the object.

user.age = 11
db.update_dataclass(user)
db.head()
id updated name age
0 1 2021-07-21 19:03:04.550804 kevin 11
1 2 2021-07-21 19:03:04.550809 tyrone 12
2 3 2021-07-21 19:03:04.550811 carlos 25