In-Database Text Analysis | Intro to Data Science

This blog captures my experience with the second assignment of the online big data course. It seems pretty straightforward needing a couple of sql commands...

 

 

 

 

The assignment instructs us to use the class virtual machine in order to access sqlite, however, as I prefer doing things in my own environment, I just installed sqlite3 from here

I've already gotten the two database files from the first time I set up the git repo. A simple 

$> git fetch course
$> git diff master remotes/course/master ./assignment2/
showed that there were no updates to the course material. I can therefore continue right where I left off last time. ("course" is the remote pointing to the course repository).
 
The assignment has us working with a table "Frequency" in a given database file. Let's see some information about this table and the database.
 
First let's enter sqlite with this db file
$> sqlite3 reuters.db
Now that we are already in the database, let's see the tables that exist:
> .tables
Frequency
 
Now the basic information about the table:
> .schema Frequency
CREATE TABLE Frequency (
docid VARCHAR(255),
term VARCHAR(255),
count int,
PRIMARY KEY(docid, term));

> select count(*) from Frequency;
161802
 
> select * from Frequency limit 1;
10000_txt_earn|net|1
 
This means:
1. The schema consists of a column "docid" and a column "term" that are each Strings with a variable length of 255 characters, along with an integer column "count". The primary key (uniqueness constraint across rows) is computed as a combination of the docid and the term.
 
2. There are 161,802 total rows in the table.
 
3. The first row/entry looks like:
docid = 10000_txt_earn
term = net
count = 1
 
This seems like a table that is keeping a count of the occurrences of a given word in a file. Let's read along in the instructions to learn more...
 
 
 

Problem 1

The first problem asks to query the table for certain information and (in most cases) return the number of rows that meet the constraint. Let's get into it step by step, posted as the relational algebra statement and the corresponding sql query:

a) σdocid=10398_txt_earn(frequency)

select count(*) from (select * from Frequency where docid = '10398_txt_earn');

 

b) πtermσdocid=10398_txt_earn and count=1(frequency))

select count(*) from (select term from Frequency where docid = '10398_txt_earn' and count = 1);

 

c) πtermσdocid=10398_txt_earn and count=1(frequency)) U πtermσdocid=925_txt_trade and count=1(frequency))

select count(*) from (select term from Frequency where docid = '10398_txt_earn' and count = 1 union select term from Frequency where docid = '925_txt_trade' and count = 1);

This results in the value: 324.

However, when we do:

select count(*) from (select term from Frequency where docid = '10398_txt_earn' and count = 1);

we get a value of 110

and when we do:

select count(*) from (select term from Frequency where docid = '925_txt_trade' and count = 1); 

we get a value of 225.

so should we not get 110 + 225 = 335 as the answer instead of 324?

This is not necessarily the case because we are counting the projection of the column "term" and not the entire record. This means that if we have overlapping terms from both result sets of the select statements, only one will be counted. This seems to have been the case here as we got a value less than 335. So with some math, we could also conclude the common terms from both statements, 335 - 324 = 11. this can be confirmed by replacing "union" with "intersect.

select count(*) from (select term from Frequency where docid = '10398_txt_earn' and count = 1 intersect select term from Frequency where docid = '925_txt_trade' and count = 1);

sure enough, we get the answer as 11.

Not surprisingly, if we do not include the projection, we get 335 as the result like below (note how it is select * in both cases instead of select term):

select count(*) from (select * from Frequency where docid = '10398_txt_earn' and count = 1 union select * from Frequency where docid = '925_txt_trade' and count = 1); 

 

(d) count: Write a SQL statement to count the number of documents containing the word “parliament”

A simple query to answer this question would be:

select count(*) from Frequency where term = 'parliament';

However, in other tables we could have multiple rows where the docid and terms are the same, which would give us duplicates. This cannot be the case here since the primary key consists of only (docid, term), which guarantees uniqueness in this regard. Nevertheless, for completeness here is the query that would satisfy that case as well:

select count(distinct docid) from Frequency where term = 'parliament';

 

(e) big documents Write a SQL statement to find all documents that have more than 300 total terms, including duplicate terms. (Hint: You can use the HAVING clause, or you can use a nested query. Another hint: Remember that the count column contains the term frequencies, and you want to consider duplicates.) (docidterm_count)

We can get the total number of terms in all documents with:

select sum(count) from Frequency;

Similarly, we can get the total number of terms in a given document as:

select sum(count) from Frequency where docid = '925_txt_trade';

In order to construct a query to answer the given question, we will need to make it a little more complex with a nested select statement.

select count(*) from (select docid, sum(count) as sum from Frequency group by docid) where sum > 300;

The nested select statement gives us a table with the sum of all the counts for the rows with each docid. the outer loop simply counts those rows that have a cumulative count (noted as the "sum" column) of more than 300. 

An alternative way to write this would be to use the having command as:

select count(*) from (select docid, sum(count) as sum from Frequency group by docid having sum > 300);

This gives us the same result.

 

(f) two words: Write a SQL statement to count the number of unique documents that contain both the word 'transactions' and the word 'world'.

If we write a statement to get a table of all rows where either of the two terms appear:

select docid from Frequency where term = 'transactions' or term = 'world';

we will get a list of docids with repeats. we only want those docids with repeats. i.e. we want to get an intersection of those rows where the term is 'transactions' or the term is 'world' and the docid is the same. For this, we will need to keep the projection and separate out the query into two parts, combining it with an intersect

select docid from Frequency where term = 'transactions' intersect select docid from Frequency where term = 'world';

This lists out those docids that contain both words. Now to get the count we can just add a select count(*) around it:

select count(*) from (select docid from Frequency where term = 'transactions' intersect select docid from Frequency where term = 'world');

 


Problem 2: Matrix Multiplication in SQL

Now we need to work with the other database, matrix.db.

running a ".tables" command in sqlite tells us we have two tables, a and b, both representing different matrices

running a ".schema a" in sqlite tells us we have the following schema:

CREATE TABLE a (
row_num int,
col_num int,
value int,
primary key (row_num, col_num));
 
The schema is the same for the other table. Let's run some basic analysis on the tables representing matrices.
 
To find the number of rows:
select max(row_num) + 1 from A
 
To find the number of columns:
select max(col_num) + 1 from A
 
Both queries returned the value 5. This tells us that it is a square matrix of 5 columns and rows. This applies to the table B as well. Note: We have put a + 1 since the i,j index values are 0-indexed.
 
Now the question from the instructions:
(g) multiply: Express A X B as a SQL query, referring to the class lecture for hints. (I need to turn in the cell(2,3) from the multiplied table).
 
The instructions suggest taking a hint from the lectures. I've worked with matrices before so I'm going to try this one on my own first, and if my solution is incorrect I will go to the lecture for help. So here goes...
 
Since I need to return row 2, column 3, of the final matrix, I know that's multiplying row 2 of A with column 3 of B. let me first try and get row 2 of A:
select * from A where row_num = 2;
now doing the same with column 3 of B gives:
select * from B where col_num = 3;
 
Taking a step back and looking at matrix multiplication, I need to have 5 terms since this is a square matrix of side 5. The final value is of the form:
A(2,0) * B(0,3) + A(2,1) * B(1,3) + ... + A(2,4) * B(3,4)
 
Using the property of 0 we know that anything multiplied by 0 is 0, so we need to only consider those terms in the above expression where both terms exist (otherwise it is 0).
 
Keeping that in mind, we need to select those values where the row in A is 2, column in B is 3, and the column of A = row of B. that looks quite familiar to an equi-join (inner join) so I will use that semantic, let's see how it goes:
select a.value * b.value from A a, B b where a.row_num = 2 and b.col_num=3 and a.col_num=b.row_num;
Now I have all the terms in one table in this query, now I need to sum them up:
select sum(mul) from (select a.value * b.value as mul from A a, B b where a.row_num = 2 and b.col_num=3 and a.col_num=b.row_num);
I got 2874, let's see how I did.
So my answer was correct, but it seems there is another way to do it because the solution hint said "Join columns to rows, group by rows and columns, then filter to get the cell you want.", which is different from what I did. I may come back and try this one later.
 

Problem 3: Working with a Term-Document Matrix

The third problem set revolves around using matrix multiplication to compute the similarity of documents in a normalized form.
 
The problem set explains the logic and formula behind computing this similarity. This concept is a little new to me and since I prefer to learn the theory before implementing things, I will defer to a source of reference. The class lectures, unfortunately, do not cover this topic so I will look at the wikipedia articles for Document-term Matrix and Tf-idf form for computing the importance of terms in a document.
 
Now that that is out of the way, let's move on to the problem set...
 
(h) Similarity Matrix
This needs me to compute the similarity of two documents given that they are represented as a Document-term matrix in a sqlite database reuters.db with the schema as is discussed in problem set 1 above.
 
select sum(a.count*b.count) from Frequency a, Frequency b where a.docid='10080_txt_crude' and b.docid='17035_txt_earn' and a.term=b.term;
 
(i) Keyword Search
For this we need to find the best matching document to the keyword query "washington taxes treasury". The problem set suggests adding this set of keywords to the document matrix (corpus) with a union of scalar queries, so that's what I'll do.
 
select docid, sum(mult) as total from (select a.docid, a.term, a.count, b.count, a.count*b.count as mult from Frequency a, (select 'q' as docid, 'washington' as term, 1 as count UNION select 'q' as docid, 'taxes' as term, 1 as count UNION select 'q' as docid, 'treasury' as term, 1 as count) b where a.term=b.term) group by docid order by total asc;
 
This concludes the assignment involving In-Database Text Analysis. And now I can move on to the cool stuff, i.e. MapReduce!
Show comments 0