DocTable File Column Types

It is often good advice to avoid storing large binary data in an SQL table because it will significantly impact the read performance of the entire table. I find, however, that it can be extremely useful in text analysis applications as a way to keep track of a large number of models with associated metadata. As an alternative to storing binary data in the table directly, DocTable includes a number of custom column types that can transparently store data into the filesystem and keep track of it using the schema definitions.

I provide two file storage column types: (1) TextFileCol for storing text data, and (2) PickleFileCol for storing any python data that requires pickling.

Now I create a new table representing a matrix. Notice that I use the PickleFileCol column shortcut to create the column. This column is equivalent to Col(None, coltype='picklefile', type_args=dict(folder=folder)). See that to SQLite, this column simply looks like a text column.

Now we insert a new array. It appears to be inserted the same as any other object.

But when we actually look at the filesystem, we see that files have been created to store the array.

If we want to see the raw data stored in the table, we can create a new doctable without a defined schema. See that the raw filenames have been stored in the database. Recall that the directory indicating where to find these files was provided in the schema itself.

Data Folder Consistency

Now we try to delete a row from the database. We can see that it was deleted as expected.

However, when we check the folder where the data was stored, we find that the file was, in fact, not deleted. This is the case for technical reasons.

We can clean up the unused files using clean_col_files() though. Note that the specific column to clean must be provided.

There may be a situation where doctable cannot find the folder associated with an existing row. We can also use clean_col_files() to check for missing data. This might most frequently occur when the wrong folder is specified in the schema after moving the data file folder. For example, we delete all the pickle files in the directory and then run clean_col_files().

Text File Types

We can also store text files in a similar way. For this, use TextFileCol in the folder specification.

See that the text files were created, and they look like normal text files so we can read them normally.