Dataclass Schema Example
In this vignette I'll show how to use a Python dataclass (introduced in Python 3.7) to specify a schema for a DocTable. The advantage of this schema format is that you can use custom classes to represent each row, and easily convert your existing python objects into a format that it easy to store in a sqlite database.
from datetime import datetime
from pprint import pprint
import pandas as pd
import sys
sys.path.append('..')
import doctable
Basic dataclass usage
For our first example, we show how a basic dataclass object can be used as a DocTable schema. First we create a python dataclass using the @dataclass
decorator. This object has three members, each defaulted to None
. We can create this object using the constructor provided by dataclass
.
from dataclasses import dataclass
@doctable.schema
class User:
__slots__ = []
name: str = None
age: int = None
height: float = None
User()
User(name=None, age=None, height=None)
And it is relatively easy to create a new doctable using the schema provided by our dataclass User
by providing the class definition to the schema
argument. We can see that DocTable
uses the dataclass schema to create a new table that follows the specified Python datatypes.
db = doctable.DocTable(schema=User, target=':memory:')
db.schema_table()
name | type | nullable | default | autoincrement | primary_key | |
---|---|---|---|---|---|---|
0 | name | VARCHAR | True | None | auto | 0 |
1 | age | INTEGER | True | None | auto | 0 |
2 | height | FLOAT | True | None | auto | 0 |
Now we insert several new objects into the table and view them using DocTable.head()
. Note that the datbase actually inserted the object's defaulted values into the table.
db.insert([User('kevin'), User('tyrone', age=12), User('carlos', age=25, height=6.5)])
db.head()
name | age | height | |
---|---|---|---|
0 | kevin | NaN | NaN |
1 | tyrone | 12.0 | NaN |
2 | carlos | 25.0 | 6.5 |
Using a normal select()
, we can extract the results as the original objects. With no parameters, the select statement extracts all columns as they are stored and they exactly match the original data we entered. As expected from the python object, we can access these as properties of the object. Due to the base class doctable.DocTableRow
, we can also access properties using the __getitem__
indexing. I'll show why there is a difference btween the two later.
users = db.select()
for user in users:
print(f"{user.name}:\n\tage: {user.age}\n\theight: {user['height']}")
kevin:
age: None
height: None
tyrone:
age: 12
height: None
carlos:
age: 25
height: 6.5
Example using doctable.Col
In this example, we will show how to create a dataclass with functionality that supports more complicated database operations. A key to this approach is to use the doctable.Col
function as default values for our parameters. Note that when we initialize the object, the default values of all columns except for name
are set to EmptyValue
. This is important, because EmptyValue
will indicate values that are not meant to be inserted into the database or are not retrieved from the database after selecting.
@doctable.schema
class User:
__slots__ = []
name: str = doctable.Col()
age: int = doctable.Col()
height: float = doctable.Col()
User()
User()
Given that the type specifications are the same as the previous example, we get exactly the same database schema. We insert entries just as before. The User
data contained EmptyValue
s, and so that column data was not presented to the database at all - instead, the schema's column defaults were used. Consistent with our schema (not the object defaults, the default values were set to None.
db = doctable.DocTable(schema=User, target=':memory:')
print(db.schema_table())
db.insert([User('kevin'), User('tyrone', age=12), User('carlos', age=25, height=6.5)])
for user in db.select():
print(f"{user}")
name type nullable default autoincrement primary_key
0 name VARCHAR True None auto 0
1 age INTEGER True None auto 0
2 height FLOAT True None auto 0
User(name='kevin', age=None, height=None)
User(name='tyrone', age=None, height=None)
User(name='carlos', age=None, height=None)
Now let's try to select only a subset of the columns - in this case, 'name' and 'age'.
users = db.select(['name', 'age'])
users[0]
User(name='kevin', age=None)
Note that the user height was set to EmptyValue
. When we try to access height as an index, we get an error indicating that the data was not retrived in the select statement.
try:
users[0]['height']
except KeyError as e:
print(e)
'The column "height" was not retreived in the select statement.'
On the contrary, if we try to access as an attribute, the actual EmptyValue
object is retrieved. Object properties work as they always have, but indexing into columns will check for errors in the program logic. This implementation shows how dataclass schemas walk the line between regular python objects and database rows, and thus accessing these values can be done differently depending on how much the table entries should be treated like regular objects vs database rows. This is all determined based on how the dataclass columns are configured.
users[0].height
EmptyValue()
Special column types
Now I'll introduce two special data column types provided by doctable: IDCol()
, which represents a regular id column in sqlite with autoindex and primary_key parameters set, and UpdatedCol()
, which records the datetime that an object was added to the database. When we create a new user using the dataclass constructor, these values are set to EmptyValue
, and are relevant primarily to the database. By setting the repr
parameter in the @dataclass
decorator, we can use the __repr__
of the DocTableRow
base class, which hides EmptyValue
columns. This is optional.
from dataclasses import field, fields
@doctable.schema(repr=False)
class User:
__slots__ = []
id: int = doctable.IDCol() # shortcut for autoindex, primary_key column.
updated: datetime = doctable.UpdatedCol() # shortcut for automatically
name: str = doctable.Col(nullable=False)
age: int = doctable.Col(None) # accessing sqlalchemy column keywords arguments
user = User(name='carlos', age=15)
user
User(name='carlos', age=15)
And we can see the relevance of those columns by inserting them into the database and selecting them again. You can see from the result of .head()
that the primary key id
and the updated
columns were appropriately filled upon insertion. After selecting, these objects also contain valid values.
db = doctable.DocTable(schema=User, target=':memory:')
print(db.schema_table())
db.insert([User(name='kevin'), User(name='tyrone', age=12), User(name='carlos', age=25)])
db.head()
name type nullable default autoincrement primary_key
0 id INTEGER False None auto 1
1 updated DATETIME True None auto 0
2 name VARCHAR False None auto 0
3 age INTEGER True None auto 0
id | updated | name | age | |
---|---|---|---|---|
0 | 1 | 2021-07-21 19:03:04.550804 | kevin | NaN |
1 | 2 | 2021-07-21 19:03:04.550809 | tyrone | 12.0 |
2 | 3 | 2021-07-21 19:03:04.550811 | carlos | 25.0 |
This was just an example of how regular Python dataclass objects can contain additional data which is relevant to the database, but which is otherwise unneeded. After retrieving from database, we can also use .update()
to modify the entry.
user = db.select_first()
user.age = 10
db.update(user, where=db['id']==user['id'])
db.head()
id | updated | name | age | |
---|---|---|---|---|
0 | 1 | 2021-07-21 19:03:04.550804 | kevin | 10 |
1 | 2 | 2021-07-21 19:03:04.550809 | tyrone | 12 |
2 | 3 | 2021-07-21 19:03:04.550811 | carlos | 25 |
We can use the convenience function update_dataclass()
to update a single row corresponding to the object.
user.age = 11
db.update_dataclass(user)
db.head()
id | updated | name | age | |
---|---|---|---|---|
0 | 1 | 2021-07-21 19:03:04.550804 | kevin | 11 |
1 | 2 | 2021-07-21 19:03:04.550809 | tyrone | 12 |
2 | 3 | 2021-07-21 19:03:04.550811 | carlos | 25 |