Skip to content

DocTable Schemas

There are two ways to define schemas for a DocTable:

  1. dataclass schema: column names and types come from a class created using the @doctable.schema decorator. This class represents a single row, and is returned by default when a select query is executed. doctable provides a thin layer over dataclasses with slots to reduce memory overhead from returned results. Custom methods can also be defined on the class that will not affect the database schema. When using this method, constraints and indices must be provided at the time of DocTable instantiation (or in the definition of an inheriting DocTable).

  2. list schema: column names and types come from sequences of strings according to a custom doctable format. This method requires less knowledge of doctable objects but otherwise has no advantages over dataclass schemas.

The doctable package builds on sqlalchemy, so both types of schema specifications ultimately result in a sequence of sqlalchemy column types that will be used to construct (or interface with) the database table.

from datetime import datetime
from pprint import pprint
import pandas as pd

import sys
import doctable

Schema Type Mappings

There are two lookup tables used to relate to sqlalchemy column types. The first is a map from Python datatypes to the sqlalchemy types. This is sufficient for the simplest possible dataclass schema specification.

The second is a string lookup that is provided for the list schema format. You can see that this offers a larger number of types compared to the Python type conversion.

There are several other custom column types I included for convenience.

{'biginteger': sqlalchemy.sql.sqltypes.BigInteger,
 'boolean': sqlalchemy.sql.sqltypes.Boolean,
 'date': sqlalchemy.sql.sqltypes.Date,
 'datetime': sqlalchemy.sql.sqltypes.DateTime,
 'enum': sqlalchemy.sql.sqltypes.Enum,
 'float': sqlalchemy.sql.sqltypes.Float,
 'integer': sqlalchemy.sql.sqltypes.Integer,
 'interval': sqlalchemy.sql.sqltypes.Interval,
 'largebinary': sqlalchemy.sql.sqltypes.LargeBinary,
 'numeric': sqlalchemy.sql.sqltypes.Numeric,
 'smallinteger': sqlalchemy.sql.sqltypes.SmallInteger,
 'string': sqlalchemy.sql.sqltypes.String,
 'text': sqlalchemy.sql.sqltypes.Text,
 'time': sqlalchemy.sql.sqltypes.Time,
 'unicode': sqlalchemy.sql.sqltypes.Unicode,
 'unicodetext': sqlalchemy.sql.sqltypes.UnicodeText,
 'json': doctable.schemas.custom_coltypes.JSONType,
 'pickle': doctable.schemas.custom_coltypes.CpickleType,
 'parsetree': doctable.schemas.custom_coltypes.ParseTreeDocFileType,
 'picklefile': doctable.schemas.custom_coltypes.PickleFileType,
 'textfile': doctable.schemas.custom_coltypes.TextFileType}

List Schemas

And this is another example showing the list schema format.

schema = (
    # standard id column
    #SQLAlchemy: Column('id', Integer, primary_key = True, autoincrement=True), 
    ('integer', 'id', dict(primary_key=True, autoincrement=True)),
    # short form (can't provide any additional args though): ('idcol', 'id')

    # make a category column with two options: "FICTION" and "NONFICTION"
    #SQLAlchemy: Column('title', String,)
    ('string', 'category', dict(nullable=False)),

    # make a non-null title column
    #SQLAlchemy: Column('title', String,)
    ('string', 'title', dict(nullable=False)),

    # make an abstract where the default is an empty string instead of null
    #SQLAlchemy: Column('abstract', String, default='')
    ('string', 'abstract',dict(default='')),

    # make an age column where age must be greater than zero
    #SQLAlchemy: Column('abstract', Integer)
    ('integer', 'age'),

    # make a column that keeps track of column updates
    #SQLAlchemy: Column('updated_on', DateTime(),,
    ('datetime', 'updated_on',  dict(,,
    # short form to auto-record update date: ('date_updated', 'updated_on')

    #SQLAlchemy: Column('updated_on', DateTime(),
    ('datetime', 'updated_on',  dict(,
    # short form to auto-record insertion date: ('date_added', 'added_on')

    # make a string column with max of 500 characters
    #SQLAlchemy: Column('abstract', String, default='')
    ('string', 'text',dict(),dict(length=500)),

    ##### Custom DocTable Column Types #####

    # uses json.dump to convert python object to json when storing and
    # json.load to convert json back to python when querying

    # stores pickled python object directly in table as BLOB
    # TokensType and ParagraphsType are defined in doctable/
    # SQLAlchemy: Column('tokenized', TokensType), Column('sentencized', ParagraphsType)

    # store pickled data into a separate file, recording only filename directly in table
    # the 'fpath' argument can specify where the files should be placed, but by
    # default they are stored in <dbname>_<tablename>_<columnname>
    #('picklefile', 'pickle_obj', dict(), dict(fpath='folder_for_picklefiles')),

    # very similar to above, but use only when storing text data
    #('textfile', 'text_file'), # similar to above

    ##### Constraints #####

    #SQLAlchemy: CheckConstraint('category in ("FICTION","NONFICTION")', name='salary_check')
    ('check_constraint', 'category in ("FICTION","NONFICTION")', dict(name='salary_check')),

    #SQLAlchemy: CheckConstraint('age > 0')
    ('check_constraint', 'age > 0'),

    # make sure each category/title entry is unique
    #SQLAlchemy:  UniqueConstraint('category', 'title', name='work_key')
    ('unique_constraint', ['category','title'], dict(name='work_key')),

    # makes a foreign key from the 'subkey' column of this table to the 'id'
    # column of ANOTHERDOCTABLE, setting the SQL onupdate and ondelete foreign key constraints
    #('foreignkey_constraint', [['subkey'], [ANOTHERDOCTABLE['id']]], {}, dict(onupdate="CASCADE", ondelete="CASCADE")),
    #NOTE: Can't show here because we didn't make ANOTHERDOCTABLE

    ##### Indexes ######

    # make index table
    # SQLAlchemy: Index('ind0', 'category', 'title', unique=True)
    ('index', 'ind0', ('category','title'),dict(unique=True)),

md = doctable.DocTable(target=':memory:', schema=schema, verbose=True)
name type nullable default autoincrement primary_key
0 id INTEGER False None auto 1
1 category VARCHAR False None auto 0
2 title VARCHAR False None auto 0
3 abstract VARCHAR True None auto 0
4 age INTEGER True None auto 0
5 updated_on DATETIME True None auto 0
6 text VARCHAR(500) True None auto 0
7 json_data VARCHAR True None auto 0
8 tokenized BLOB True None auto 0