DocTable Simple Example
In this notebook, I show how to define a DocTable as a python class, populate the DocTable using the .add() and .addmany() commands, query data through generators and pandas dataframes, and finally update DocTable entries.
from pprint import pprint
from timeit import default_timer as timer
from .legacy_helper import get_sklearn_newsgroups # for this example
import sys
sys.path.append('..')
import doctable as dt # this will be the table object we use to interact with our database.
Get News Data From sklearn.datasets
Then parses into a dataframe.
ddf = get_sklearn_newsgroups()
print(ddf.info())
ddf.head(3)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11314 entries, 0 to 11313
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 filename 11314 non-null object
1 target 11314 non-null object
2 text 11314 non-null object
dtypes: object(3)
memory usage: 265.3+ KB
None
filename | target | text | |
---|---|---|---|
0 | 21379 | soc.religion.christian | From: kbanner@philae.sas.upenn.edu (Ken Banner... |
1 | 20874 | soc.religion.christian | From: simon@monu6.cc.monash.edu.au\nSubject: S... |
2 | 58936 | sci.med | From: jeffp@vetmed.wsu.edu (Jeff Parke)\nSubje... |
Define DocTable Class
This class definition will contain the columns, datatypes, unique constraints, and index commands needed for your DocTable. In moving your data from DataFrame to DocTable, you should consider column data types and custom indices carefully.
# this class will represent the doctable. It inherits from DocTable a number of add/query/remove functions.
# of course, you can add any additional methods to this class definition as you find useful.
class SimpleNewsGroups(dt.DocTableLegacy):
def __init__(self, fname):
'''
This includes examples of init variables. See DocTable class for complete list of options.
Inputs:
fname: fname is the name of the new sqlite database that will be used for this class.
'''
tabname = 'simplenewsgroups'
super().__init__(
fname=fname,
tabname=tabname,
colschema=(
'id integer primary key autoincrement',
'file_id int',
'category string',
'raw_text string',
)
)
# this section defines any other commands that should be executed upon init
# NOTICE: references tabname defined in the above __init__ function
# extra commands to create index tables for fast lookup
self.query("create index if not exists idx1 on "+tabname+"(file_id)")
self.query("create index if not exists idx2 on "+tabname+"(category)")
Create a connection to the database by constructing an instance of the class. If this is the first time you've run this code, it will create a new sqlite database file with no entries.
sng = SimpleNewsGroups('simple_news_group.db')
print(sng)
<Documents ct: 0>
Adding Data
There are two common ways to add data to your DocTable.
(1) Add in rows individually
(2) Add in bulk with or without specifying column names
# adds data one row at a time. Takes longer than bulk version
start = timer()
for ind,dat in ddf.iterrows():
row = {'file_id':int(dat['filename']), 'category':dat['target'], 'raw_text':dat['text']}
sng.add(row, ifnotunique='replace')
print((timer() - start)*1000, 'mil sec.')
print(sng)
1629.2546929325908 mil sec.
<Documents ct: 11314>
# adds tuple data in bulk by specifying columns we are adding
start = timer()
col_order = ('file_id','category','raw_text')
data = [(dat['filename'],dat['target'],dat['text']) for ind,dat in ddf.iterrows()]
sng.addmany(data,keys=col_order, ifnotunique='replace')
print((timer() - start)*1000, 'mil sec.')
print(sng)
1269.7400120086968 mil sec.
<Documents ct: 22628>
Querying Data
There are two primary ways of querying data from a DocTable:
(1) retrieve one-by-one from generator using ".get()" function. (2) retrieve all data in Pandas DataFrame suing ".getdf()" function.
result = sng.get(
sel=('file_id','raw_text'),
where='category == "rec.motorcycles"',
orderby='file_id ASC',
limit=3,
)
for row in result:
print(str(row['file_id'])+':', row['raw_text'][:50])
72052: From: ivan@erich.triumf.ca (Ivan D. Reid)
Subject:
72052: From: ivan@erich.triumf.ca (Ivan D. Reid)
Subject:
101725: Subject: Re: Lexan Polish?
From: jeff@mri.com (Jon
result_df = sng.getdf(
sel=('file_id','raw_text'),
where='category == "rec.motorcycles"',
orderby='file_id ASC',
limit=5,
)
result_df
file_id | raw_text | |
---|---|---|
0 | 72052 | From: ivan@erich.triumf.ca (Ivan D. Reid)\nSub... |
1 | 72052 | From: ivan@erich.triumf.ca (Ivan D. Reid)\nSub... |
2 | 101725 | Subject: Re: Lexan Polish?\nFrom: jeff@mri.com... |
3 | 101725 | Subject: Re: Lexan Polish?\nFrom: jeff@mri.com... |
4 | 102616 | From: blgardne@javelin.sim.es.com (Dances With... |
Updating Data in DocTable
The ".update()" function will change entries in the DocTable.
sng.update({'category':'nevermind',},where='file_id == "103121"')
sng.getdf(where='file_id == "103121"') # to see update, look at "category" column entry
id | file_id | category | raw_text | |
---|---|---|---|---|
0 | 395 | 103121 | nevermind | From: MJMUISE@1302.watstar.uwaterloo.ca (Mike ... |
1 | 11709 | 103121 | nevermind | From: MJMUISE@1302.watstar.uwaterloo.ca (Mike ... |