## In-Database Text Analysis | Intro to Data Science

Filed under

This blog captures my experience with the second assignment of the online big data course. It seems pretty straightforward needing a couple of sql commands...

The assignment instructs us to use the class virtual machine in order to access sqlite, however, as I prefer doing things in my own environment, I just installed sqlite3 from here

I've already gotten the two database files from the first time I set up the git repo. A simple

```
$> git fetch course
$> git diff master remotes/course/master ./assignment2/
```

```
$> sqlite3 reuters.db
```

```
> .tables
Frequency
```

```
> .schema Frequency
CREATE TABLE Frequency (
docid VARCHAR(255),
term VARCHAR(255),
count int,
PRIMARY KEY(docid, term));
> select count(*) from Frequency;
161802
> select * from Frequency limit 1;
10000_txt_earn|net|1
```

### Problem 1

The first problem asks to query the table for certain information and (in most cases) return the number of rows that meet the constraint. Let's get into it step by step, posted as the relational algebra statement and the corresponding sql query:

a) σ_{docid=}_{10398_txt_earn}(frequency)

`select count(*) from (select * from Frequency where docid = '10398_txt_earn');`

b) π_{term}( σ_{docid=}_{10398_txt_earn and count=1}(frequency))

`select count(*) from (select term from Frequency where docid = '10398_txt_earn' and count = 1);`

c) π_{term}( σ_{docid=}_{10398_txt_earn and count=1}(frequency)) U π_{term}( σ_{docid=}_{925_txt_trade and count=1}(frequency))

`select count(*) from (select term from Frequency where docid = '10398_txt_earn' and count = 1 union select term from Frequency where docid = '925_txt_trade' and count = 1);`

This results in the value: 324.

However, when we do:

`select count(*) from (select term from Frequency where docid = '10398_txt_earn' and count = 1);`

we get a value of 110

and when we do:

`select count(*) from (select term from Frequency where docid = '925_txt_trade' and count = 1); `

we get a value of 225.

so should we not get 110 + 225 = 335 as the answer instead of 324?

This is not necessarily the case because we are counting the projection of the column "term" and not the entire record. This means that if we have overlapping terms from both result sets of the select statements, only one will be counted. This seems to have been the case here as we got a value less than 335. So with some math, we could also conclude the common terms from both statements, 335 - 324 = 11. this can be confirmed by replacing "union" with "intersect.

`select count(*) from (select term from Frequency where docid = '10398_txt_earn' and count = 1 intersect select term from Frequency where docid = '925_txt_trade' and count = 1);`

sure enough, we get the answer as 11.

Not surprisingly, if we do not include the projection, we get 335 as the result like below (note how it is select * in both cases instead of select term):

`select count(*) from (select * from Frequency where docid = '10398_txt_earn' and count = 1 union select * from Frequency where docid = '925_txt_trade' and count = 1); `

**(d) count:** Write a SQL statement to count the number of documents containing the word “parliament”

A simple query to answer this question would be:

`select count(*) from Frequency where term = 'parliament';`

However, in other tables we could have multiple rows where the docid and terms are the same, which would give us duplicates. This cannot be the case here since the primary key consists of only (docid, term), which guarantees uniqueness in this regard. Nevertheless, for completeness here is the query that would satisfy that case as well:

`select count(distinct docid) from Frequency where term = 'parliament';`

**(e) big documents** Write a SQL statement to find all documents that have more than 300 total terms, including duplicate terms. (Hint: You can use the HAVING clause, or you can use a nested query. Another hint: Remember that the count column contains the term frequencies, and you want to consider duplicates.) (docid, term_count)

We can get the total number of terms in all documents with:

`select sum(count) from Frequency;`

Similarly, we can get the total number of terms in a given document as:

`select sum(count) from Frequency where docid = '925_txt_trade';`

In order to construct a query to answer the given question, we will need to make it a little more complex with a nested select statement.

`select count(*) from (select docid, sum(count) as sum from Frequency group by docid) where sum > 300;`

The nested select statement gives us a table with the sum of all the counts for the rows with each docid. the outer loop simply counts those rows that have a cumulative count (noted as the "sum" column) of more than 300.

An alternative way to write this would be to use the having command as:

`select count(*) from (select docid, sum(count) as sum from Frequency group by docid having sum > 300);`

This gives us the same result.

**(f) two words:** Write a SQL statement to count the number of unique documents that contain both the word 'transactions' and the word 'world'.

If we write a statement to get a table of all rows where either of the two terms appear:

`select docid from Frequency where term = 'transactions' or term = 'world';`

we will get a list of docids with repeats. we only want those docids with repeats. i.e. we want to get an intersection of those rows where the term is 'transactions' or the term is 'world' and the docid is the same. For this, we will need to keep the projection and separate out the query into two parts, combining it with an intersect

`select docid from Frequency where term = 'transactions' intersect select docid from Frequency where term = 'world';`

This lists out those docids that contain both words. Now to get the count we can just add a select count(*) around it:

`select count(*) from (select docid from Frequency where term = 'transactions' intersect select docid from Frequency where term = 'world');`

###

Problem 2: Matrix Multiplication in SQL

Now we need to work with the other database, matrix.db.

running a ".tables" command in sqlite tells us we have two tables, a and b, both representing different matrices

running a ".schema a" in sqlite tells us we have the following schema:

`CREATE TABLE a (`

`row_num int,`

`col_num int,`

`value int,`

`primary key (row_num, col_num));`

`select max(row_num) + 1 from A`

`select max(col_num) + 1 from A`

**(g) multiply:**Express A X B as a SQL query, referring to the class lecture for hints. (I need to turn in the cell(2,3) from the multiplied table).

`select * from A where row_num = 2;`

`select * from B where col_num = 3;`

`select a.value * b.value from A a, B b where a.row_num = 2 and b.col_num=3 and a.col_num=b.row_num;`

`select sum(mul) from (select a.value * b.value as mul from A a, B b where a.row_num = 2 and b.col_num=3 and a.col_num=b.row_num);`

### Problem 3: Working with a Term-Document Matrix

**(h) Similarity Matrix**

`select sum(a.count*b.count) from Frequency a, Frequency b where a.docid='10080_txt_crude' and b.docid='17035_txt_earn' and a.term=b.term;`

**(i) Keyword Search**

`select docid, sum(mult) as total from (select a.docid, a.term, a.count, b.count, a.count*b.count as mult from Frequency a, (select 'q' as docid, 'washington' as term, 1 as count UNION select 'q' as docid, 'taxes' as term, 1 as count UNION select 'q' as docid, 'treasury' as term, 1 as count) b where a.term=b.term) group by docid order by total asc;`

Jun2