DocTable (slightly more) Advanced Example

In this notebook, I show how to define a DocTable with blob data types, add new rows, and then iterate through rows to populate previously empty fields.

import email
from .legacy_helper import get_sklearn_newsgroups # for this example

import sys
sys.path.append('..')
import doctable as dt # this will be the table object we use to interact with our database.

tempfolder = dt.TempFolder('tmp')

Get News Data From sklearn.datasets

Then parses into a dataframe.

ddf = get_sklearn_newsgroups()
print(ddf.shape)
ddf.head(3)

(11314, 3)

	filename	target	text
0	21379	soc.religion.christian	From: kbanner@philae.sas.upenn.edu (Ken Banner...
1	20874	soc.religion.christian	From: simon@monu6.cc.monash.edu.au\nSubject: S...
2	58936	sci.med	From: jeffp@vetmed.wsu.edu (Jeff Parke)\nSubje...

Define NewsGroups DocTable

This definition includes fields file_id, category, raw_text, subject, author, and tokenized_text. The extra columns compared to example_simple.ipynb are for storing extracted metadata.

class NewsGroups(dt.DocTableLegacy):
    def __init__(self, fname):
        '''
            DocTable class.
            Inputs:
                fname: fname is the name of the new sqlite database that will be used for instances of class.
        '''
        tabname = 'newsgroups'
        super().__init__(
            fname=fname, 
            tabname=tabname, 
            colschema=(
                'id integer primary key autoincrement',
                'file_id int', 
                'category string',
                'raw_text string',
                'subject string', 
                'author string', 
                'tokenized_text blob', 
            ),
            constraints=('UNIQUE(file_id)',)
        )

        # create indices on file_id and category
        self.query("create index if not exists idx1 on "+tabname+"(file_id)")
        self.query("create index if not exists idx2 on "+tabname+"(category)")

sng = NewsGroups(f'{tmp}/news_groupssss.db')
print(sng)

<Documents ct: 0>

# add in raw data
col_order = ('file_id','category','raw_text')
data = [(dat['filename'],dat['target'],dat['text']) for ind,dat in ddf.iterrows()]
sng.addmany(data,keys=col_order, ifnotunique='ignore')
sng.getdf(limit=2)

	id	file_id	category	raw_text	subject	author	tokenized_text
0	1	21379	soc.religion.christian	From: kbanner@philae.sas.upenn.edu (Ken Banner...	None	None	None
1	2	20874	soc.religion.christian	From: simon@monu6.cc.monash.edu.au\nSubject: S...	None	None	None

Update "tokenized_text" Column

Use .get() to loop through rows in the database, and .update() to add in the newly extracted data. In this case, we simply tokenize the text using the python builtin split() function.

query = sng.get(sel=('file_id','raw_text',))
for row in query:

    dat = {'tokenized_text':row['raw_text'].split(),}
    sng.update(dat, 'file_id == {}'.format(row['file_id']))
sng.getdf(limit=2)

	id	file_id	category	raw_text	subject	author	tokenized_text
0	1	21379	soc.religion.christian	From: kbanner@philae.sas.upenn.edu (Ken Banner...	None	None	[From:, kbanner@philae.sas.upenn.edu, (Ken, Ba...
1	2	20874	soc.religion.christian	From: simon@monu6.cc.monash.edu.au\nSubject: S...	None	None	[From:, simon@monu6.cc.monash.edu.au, Subject:...

Extract Email Metadata

This example takes it even further by using the "email" package to parse apart the blog files. It then uses the extracted information to populate the corresponding fields in the DocTable.

query = sng.get(sel=('file_id','raw_text',), asdict=False)
for fid,text in query:
    e = email.message_from_string(text)
    auth = e['From'] if 'From' in e.keys() else ''
    subj = e['Subject'] if 'Subject' in e.keys() else ''
    tok = e.get_payload().split()
    dat = {
        'tokenized_text':tok,
        'author':auth,
        'subject':subj,
    }

    sng.update(dat, 'file_id == {}'.format(fid))
sng.getdf(limit=2)

	id	file_id	category	raw_text	subject	author	tokenized_text
0	1	21379	soc.religion.christian	From: kbanner@philae.sas.upenn.edu (Ken Banner...	Re: SATANIC TOUNGES	kbanner@philae.sas.upenn.edu (Ken Banner)	[In, article, <May.5.02.53.10.1993.28880@athos...
1	2	20874	soc.religion.christian	From: simon@monu6.cc.monash.edu.au\nSubject: S...	Saint Story St. Aloysius Gonzaga	simon@monu6.cc.monash.edu.au	[Heres, a, story, of, a, Saint, that, people, ...

sng.getdf().to_csv('newsgroup20.csv')