By: Taylor W. Brown + Josh R. Bruce



What does this tutorial entail?

This is Part 3 of a three-part blog post on relational databases with my colleague and datadeuce, Josh R. Bruce.

Part 1) Explain SQL and relational databases vis-a-vis traditional R data storage

Part 2) Set up a SQLite database using R with WordNet data and demonstrate how to make basic SQL queries within R

Part 3) Conducts a bit of analysis using RSQLite and WordNet

*NOTE: The following tutorial assumes basic understanding of R.*



In Part 1 of this series we introduced the notion of relational databases and explained that they can be queried with a language called SQL. In Part 2, we showed how to load a relational database into R and query it with SQL-like syntax using the WordNet database. In this third and final part, we go over how to extract adjectives from WordNet and connect them to other semantically related adjectives, which we then visualize as a semantic network using Gephi.

Getting Adjectives and Their Semantic Relationships Out of WordNet

The first task is to load the WordNet database, which is discussed in detail in Part 2.

# Set working directory to the folder that contains WordNet SQL files
setwd("~/Google Drive/Projects Working/Measuring_Morality_(WordNet)")

# Load R library for querying SQL databases
library("RSQLite")

# Load the WordNet data
sqlite    <- dbDriver("SQLite")
WordNetdb <- dbConnect(sqlite,"sqlite-31.db")

Having set up the WordNet database, we want to extract two tables: one with information on every word in the database, and another with all semantically linked definitions. In WordNet, senses is the term of art for individual definitions, so we want to extract the “sensesXsemlinksXsenses” table. Importantly for our goal of creating a semantic network, this second table includes source and taget parts of speech (POS), so we know how each sense operates in English language. (NOTE: many of the following six steps can be executed using purely RSQLite syntax, which we’re happy to dictate upon request.)

# Make data frames of WordNet tables
all_words <- dbGetQuery(WordNetdb, 'select * from words')

# Extract table with semantically linked edge list, which includes all POS
semantically_linked_edge_list <- dbGetQuery(WordNetdb, 'select * from sensesXsemlinksXsenses')

Now that we have a data frame with the sematnically linked senses, we create a column that pastes the source and target POS for every row.

Create variable of all possible POS linkages

semantically_linked_edge_list$pos_linkages <-
  paste0(semantically_linked_edge_list$spos,'',semantically_linked_edge_list$dpos)

Then take a look at all possible POS linkages in the WordNet database…

Table of POS linkages and frequencies

as.data.frame(table(semantically_linked_edge_list$pos_linkages))

##    Var1   Freq
## 1   a_a   4129
## 2   a_n   3508
## 3   a_s  25554
## 4   n_a   3502
## 5   n_n 852998
## 6   n_r    342
## 7   n_s   2549
## 8   n_v   4230
## 9   r_n    342
## 10  s_a  25554
## 11  s_n   2549
## 12  v_n   4230
## 13  v_v 149636

From this, we can see that some types of POS ties appear reciprocal (e.g., there are 25,554 adjective –> satellite ties, and 25,554 satellite –> adjective ties), though this is not the case for adjective <–> adjective ties, which are an odd number. For some reason, adverbs are only linked to nouns, not to other adverbs or to adjectives. In the reduced edge list used later to make network visuals, we exclude adverbs.

For now, lets focus on adjectives and their satelites. “Satelites” in the WordNet terminology refer to peripheral concepts that orbit a central, defining concept. For example, as you’ll see below, an example of satelites are count descriptives (e.g., “first”, “second”, “third”, etc.) which are related to a central adjective (e.g., “ordinal”).

Reduce edge list to adjective and adjective satellites

adj_linked <- subset(semantically_linked_edge_list,
                     pos_linkages == 'a_a' |
                       pos_linkages == 'a_s' |
                       pos_linkages == 's_a')

# Rename columns to source and target for Gephi
adj_edge_list <- subset(adj_linked, select = c(swordid, dwordid))
colnames(adj_edge_list)[1] <- 'source'
colnames(adj_edge_list)[2] <- 'target' <br><br>

What this creates is the source and target ID numbers for the adjectives that are semantically linked in WordNet. These ID numbers are related to words, which we will get to a little farther down (the actual words are called lemmas in WordNet). The linkage is called an edge in network terminology. Edges connect nodes or the hubs in a network. So, individual adjectives are nodes, and the connections created by their semantic linkages are edges (for a handy graphic explaining these terms and more, check this out). (Also, see footnote).

We export this “edge list” as a CSV file, which then gets fed into Gephi to make the network visualizations you’ll see below.



Export edge list to CSV for use in Gephi

 write.csv(adj_edge_list_for_csv, file = 'adj_edge_list_26july16.csv',
          row.names = FALSE) <br><br>

Making the Node Information

The file we just exported is an “edge list”–or a list of connections between all adjectives and satellites in the WordNet database. In addition to those connections, Gephi–the network visualization tool that we will be using– requires that we create a “node list” that tells it the ID information used in the edge list (the source and target variables) as well as any information we want to have about those IDs. In this case, we want to label the ID numbers with the words they represent, rather than the sense ID numbers used by WordNet.

One of the things that we also have to account for when making the node list is that some nodes only appear on one side of an edge (i.e., they are not reciprocally named in the network). So we’ll combine all the words that are either sources or targets, and then get rid of duplicate entries.

# create node list from linked adjectives and adjective satellites,
# making sure to account for some nodes only being on one side or the other
node_list_1 <- subset(adj_linked, select = c(swordid))
colnames(node_list_1)[1] <- 'Id'

node_list_2 <- subset(adj_linked, select = c(dwordid))
colnames(node_list_2)[1] <- 'Id'

combined_node_lists <- rbind.data.frame(node_list_1,node_list_2)
combined_node_lists$duplicate <- duplicated(combined_node_lists$Id)

# see how many duplicates there are
table(combined_node_lists$duplicate)

##
## FALSE  TRUE
## 16817 93657

# retain the non-duplicate words
node_list <- subset(combined_node_lists, select = c(Id), duplicate == 'FALSE')

# create column for wordid names (i.e. 'lemmas'), called 'label' for Gephi
node_list$label <- ''

# lookup names for each word in node_list
for(i in 1:nrow(node_list)){
  node_list$label[i] <- all_words$lemma[which(all_words$wordid == node_list$Id[i])]
} <br><br>

Export node list to CSV for use in Gephi

write.csv(node_list, file = 'adj_node_list_26july16.csv', row.names = FALSE)



Visualizing in Gephi

We now have the edge and nodes lists of the entire adjective + satelite WordNet database! While we don’t go over how to visualize the semantic network in Gephi, the CSV files we created are easily imported to make network visualizations. Here’s one of many tutorials on getting data into Gephi.

Whole Adjective + Satellites Network



Zoomed In Portion of Adjective + Satellites Network



Single Head Adjective and Satellites

Below we zoom in and look at one head ajective “chromatic” and all of it’s satellies (word concepts like “green”, “yellow” “blue green”, etc)

Adjusting to look just adjectives, no satellites

While these make for cool visuals, and they give us a great way to see how WordNet structures semantic data in a more interpretable manner, they also generate a distorted idea of what the semantic network looks like. This comes from the large number of satelites around a few concepts, such as “chromatic,” seen above. By being connected to so many other nodes, despite those nodes having no outside connections, “chromatic” and similar concepts are overly priviledged in the network.

So… what if we got rid of those pesky satelites and just looked at the core adjectives? To do that, we just recreate the code above, but only retain “a_a” part of speech linkages in the sense-sense semantically linked data table.

Edges

# Reduce edge list to just adjective and adjective satellites  
adj_head_linked <- subset(semantically_linked_edge_list, pos_linkages == 'a_a')

# Rename columns to source and target for Gephi
adj_head_linked <- subset(adj_head_linked, select = c(swordid, dwordid))
colnames(adj_head_linked)[1] <- 'source'
colnames(adj_head_linked)[2] <- 'target'

# Remove duplicates
adj_head_linked$lookup <- paste0(adj_head_linked$source,'',adj_head_linked$target)
adj_head_linked$duplicate_indicator <- duplicated(adj_head_linked$lookup)
table(adj_head_linked$duplicate_indicator)

##
## FALSE  TRUE
##  4031    98

adj_head_linked_no_duplicates <- subset(adj_head_linked, duplicate_indicator == 'FALSE')

# create weight varialbe for the few that occurred more than once
adj_head_linked_no_duplicates$weight <- 0
for(i in 1:nrow(adj_head_linked_no_duplicates)){
  adj_head_linked_no_duplicates$weight[i] <- length(which(adj_head_linked$lookup == adj_head_linked_no_duplicates$lookup[i]))
}

adj_head_for_csv <- subset(adj_head_linked_no_duplicates,
                           select = c(source, target, weight))

# write.csv(adj_head_for_csv, file = 'adj_heads_only.csv', row.names = FALSE)

Nodes

# create node list from linked adjectives heads only,
# making sure to account for some nodes only being on one side or the other
head_node_list_1 <- subset(adj_head_linked, select = c(source))
colnames(head_node_list_1)[1] <- 'Id'
head_node_list_2 <- subset(adj_head_linked, select = c(target))
colnames(head_node_list_2)[1] <- 'Id'

combined_head_node_lists <- rbind.data.frame(head_node_list_1,head_node_list_2)
combined_head_node_lists$duplicate <- duplicated(combined_head_node_lists$Id)
table(combined_head_node_lists$duplicate)

##
## FALSE  TRUE
##  1570  6688

head_adj_node_list <- subset(combined_head_node_lists,
                             select = c(Id), duplicate == 'FALSE')

# create column for wordid lemmas, called 'label' for Gephi
head_adj_node_list$label <- ''

# lookup lemma for each word in head_adj_node_list
for(i in 1:nrow(head_adj_node_list)){
  head_adj_node_list$label[i] <- all_words$lemma[which(all_words$wordid == head_adj_node_list$Id[i])]
}

# export head_adj_node_list to CSV for use in Gephi
# write.csv(head_adj_node_list, file = 'head_adj_node_list_26july16.csv', row.names = FALSE)

Visualizing in Gephi, again

Adjective-Only Semantic Network

Ah, much better! This semantic network visual (created using Gephi’s built-in Fruchterman-Reingold graph drawing algorithm) shows us that “honorable” and “dishonoralbe” are two of the key adjective concepts in the WordNet data base. Nodes here are sized by their degree, or the number of connections, both in- and out-going, each word has. We can see that there are a lot of connected concepts, but also a decent number of small clusters around the periphery.

Thank you for playing!

We hope this tutorial on RSQLite and WordNet has proven helpful for getting started with relational databases in R, as well as giving a sense of what is possible with the information they contain.


Footnote: In WordNet, multiple senses are often represented by the same word. For example, “good” may mean “having desirable or positive qualities especially those suitable for a thing specified” or “morally admirable” (both are actual definitions in WordNet). These will have unique sense ID numbers, but the same word ID number for “good.” Making things even trickier, “good” with its multiple meanings may be tied to another adjective (e.g., “bad”) that has multiple meanings of its own. We could visualize these connections using the sense ID number, which would create individual nodes for every possible meaning, which would then all be labeled with the same words, “good” and “bad.” We opted not to do this for visual simplicity, instead leaving duplicate edges in the edge list to account for the fact that two words may be linked by multiple semantic connections.