By: Taylor W. Brown + Josh R. Bruce

What does this tutorial entail?

This is Part 2 of a three-part blog post on relational databases with my colleague and datadeuce, Josh R. Bruce.

Part 1) Explain SQL and relational databases vis-a-vis traditional R data storage

Part 2) Set up a SQLite database using R with WordNet data and demonstrate how to make basic SQL queries within R

Part 3) Conduct a bit of analysis using RSQLite and WordNet

*NOTE: The following tutorial assumes basic understanding of R.*

In Part 1 of this series we introduced the notion of relational databases and explained that they can be queried with a language called SQL. In this, Part 2, we will show how to load a relational database into R and query it with SQL-like syntax. To do so, we will be using the WordNet database.

What is WordNet?

WordNet is a large English lexical database produced by the Cognitive Science Laboratory of Princeton University, with help from funding by the National Science Foundation (thanks, NSF !). It organizes 155,287 nouns, verbs, adjectives and adverbs into 117,659 sets of cognitive synonyms–what they call “synsets”–each of which describes a distinct concept. For example, the concept/synset defined as “the act of departing politely” is captured by words like “farewell”, “leave”, “leave-taking.”

As you can tell, WordNet is a pretty cool database with a lot of potential for studying concepts in discourse and/or communities of meaning in the English language. (For those Python users out there, some of the NLTK algorithms are based on WordNet word-similarities. That is neat.) But how do you set up WordNet in R!?

Getting started with WordNet using RSQLite

There are several ways to work with the WordNet database in R, but as this is a tutorial on relational databses, that’s what we’ll be using.

Download SQL files

The first thing we have to do is download WordNet as a set of SQL files, or essentially a bunch of tables: download here. Then we point our working directory to those files and load the necessary R packages for interacting with them.

# Set working directory to the folder that contains WordNet SQL files
setwd("~/Google Drive/Projects Working/Measuring_Morality_(WordNet)")

# Load R library for querying SQL databases

RSQLite is a package that allows R users to work with relational databases and execute SQL-like commands. It’s built off of SQLite, which is a serverless database engine–meaning that it uses SQL to read and write files directly on your local computer. There are other options (e.g. RMySQL for working with MySQL, RPostgreSQL for working with PostgreSQL), but RSQLite will get us started without dealing with client-server stuff.

We won’t go into detail of SQL syntax, as there are already great tutorials for that – at Khan Academy or Code Academy, for example–which we highly recommend going through before tackling RSQLite. Rather, we’ll give you a sense of how to interact with a relational database using RSQLite.

Connect to database

Ok, now that we’re ready to interact with the data, let’s connect to it in our R environment:

# Load the WordNet data
sqlite    <- dbDriver("SQLite")
WordNetdb <- dbConnect(sqlite,"sqlite-31.db")

Remember, relational databases disperse data across several linked tables. In this case, we have 98 WordNet tables in our WordNetdb database, to be exact. One of the benefits of working with relational databases like this is that you don’t have to load all of the data into your environment in order to examine it. To see a list of our tables, we use the following RSQLite command:


##  [1] "adjectiveswithpositions"  "adjpositions"
##  [3] "adjpositiontypes"         "bncconvtasks"
##  [5] "bncimaginfs"              "bncs"
##  [7] "bncspwrs"                 "casedwords"
##  [9] "dict"                     "glfs"
## [11] "ilfs"                     "legacy20"
## [13] "legacy21"                 "lexdomains"
## [15] "lexlinks"                 "linktypes"
## [17] "logs"                     "morphmaps"
## [19] "morphology"               "morphs"
## [21] "postypes"                 "samples"
## [23] "samplesets"               "semlinks"
## [25] "sensemaps2021"            "sensemaps2030"
## [27] "sensemaps2130"            "sensemaps3031"
## [29] "senses"                   "senses20"
## [31] "senses21"                 "senses30"
## [33] "sensesXlexlinksXsenses"   "sensesXsemlinksXsenses"
## [35] "sensesXsynsets"           "sqlite_sequence"
## [37] "sumoarg0maps"             "sumoarg1maps"
## [39] "sumoarg2maps"             "sumoarg3maps"
## [41] "sumodisjointformulas"     "sumodisjoints"
## [43] "sumodomainformulas"       "sumofiles"
## [45] "sumoformulas"             "sumoinstanceformulas"
## [47] "sumoinstances"            "sumomaps"
## [49] "sumoparsemaps"            "sumopartitionformulas"
## [51] "sumorelations"            "sumorules"
## [53] "sumosubclasses"           "sumosubclassformulas"
## [55] "sumosubrelationformulas"  "sumosubrelations"
## [57] "sumotermattrs"            "sumoterms"
## [59] "synsetmaps2031"           "synsetmaps2131"
## [61] "synsetmaps3031"           "synsets"
## [63] "synsets20"                "synsets21"
## [65] "synsetsXsemlinksXsynsets" "verbnetframes"
## [67] "verbnetroles"             "verbnetrolesframes"
## [69] "verbswithframes"          "vframemaps"
## [71] "vframes"                  "vframesentencemaps"
## [73] "vframesentences"          "vnclasses"
## [75] "vnclassmembers"           "vnclassmembersenses"
## [77] "vnexamplemaps"            "vnexamples"
## [79] "vnframemaps"              "vnframenames"
## [81] "vnframes"                 "vnframesubnames"
## [83] "vngroupingmaps"           "vngroupings"
## [85] "vnpredicatemaps"          "vnpredicates"
## [87] "vnrolemaps"               "vnroles"
## [89] "vnroletypes"              "vnselrestrs"
## [91] "vnselrestrtypes"          "vnsemantics"
## [93] "vnsyntaxes"               "words"
## [95] "wordsXsenses"             "wordsXsensesXsynsets"
## [97] "xwnparselfts"             "xwnwsds"

Interact with database

Now, we could load each of these tables as a dataframe and simply execute traditional R code. Easy as that is, however (and we will be doing some of it when we work with this database in our next tutorial), it misses the point of structuring the data across 98 tables. That’s where SQL syntax comes in. As you’ll see throughout the rest of this tutorial, RSQLite commands (aka queries) are structured like SQL queries: command(<databasename> , "<query>")

For example, we can use SQL-like commands to explore some of these 98 tables. Maybe the synsets table is particularly important, since synsets themselves are a central component of WordNet (remember, synsets == synonym sets).

# List all of the fields (i.e. columns) in this table
dbListFields(WordNetdb, "synsets")

## [1] "synsetid"    "pos"         "lexdomainid" "definition"

So it looks like this table provides all of the synonym group definitions within WordNet. Each row in the synset table has a synsetid, a definition, a pos (parts of speech field), and a lexdomainid (which links to the lexdomain table).

Now, if we wanted to extract just the synsetid and definitions from this table, our command would look something like this:

# Get the synsetid and definition as a dataframe
all_synsets <- dbGetQuery(WordNetdb,'SELECT synsetid, definition FROM synsets' )
synsetid definition
100001740 that which is perceived or known or inferred to have its own distinct existence (living or nonliving)
100001930 an entity that has physical existence
100002137 a general concept formed by extracting common features from specific examples
100002452 a separate and self-contained entity
100002684 a tangible and visible entity; an entity that can cast a shadow
100003553 an assemblage of parts that is regarded as a single entity
100003993 a whole (a thing or person) of the same kind or category as another
100004258 a living (or once living) entity
100004475 a living thing that has (or can develop) the ability to act or function independently
100005787 organisms (plants and animals) that live at or near the bottom of a sea

How many synsets are there total? 117,791


## [1] 117791

What if we wanted to see all synsets with a definition that includes a notion of ‘space’?

# Extract synsets with a definition that includes something like 'space'
space_synsets <- dbGetQuery(WordNetdb, "SELECT synsetid, definition FROM synsets WHERE definition LIKE '%space%'")

# Print the first 50 definitions meeting this criteria:

##  [1] "that which has mass and occupies space"
##  [2] "a point or extent in space"
##  [3] "physical space independent of what occupies it"
##  [4] "(physics) an ideal space in which the coordinate dimensions represent the variables that are required to describe a system or substance"
##  [5] "the act of putting two things together with no space between them"
##  [6] "the launching of a missile or spacecraft to a specified destination"
##  [7] "the launching of a spacecraft to the moon"
##  [8] "the act of forcing yourself (or being forced) into or through a restricted space"
##  [9] "any kind of physical activity outside a spacecraft by one of the crew"
## [10] "a landing of a spacecraft in the sea at the end of a space flight"
## [11] "injection of an anesthetic substance into the epidural space of the spinal cord in order to produce epidural anesthesia"
## [12] "act of extending over a wider scope or expanse of space or time"
## [13] "a game in which two players alternately put crosses and circles in one of the compartments of a square grid of nine spaces; the object is to get a row of three crosses or three circles before the opponent does"
## [14] "performance of duties or provision of space and equipment helpful to others"
## [15] "counting backward from an arbitrary number to indicate the time remaining before some event (such as launching a space vehicle)"
## [16] "a method of controlling the flight of a missile or spacecraft by reference to the positions of celestial bodies"
## [17] "roentgenography of the spinal cord to detect possible lesions (usually after injection of a contrast medium into the subarachnoid space)"
## [18] "removal by centesis of fluid from the subarachnoid space of the lumbar region of the spinal cord for diagnostic or therapeutic purposes"
## [19] "the act of turning a vehicle around in a limited space by moving in a series of back and forward arcs"
## [20] "a United States Air Force defense laboratory responsible for discovering and developing and integrating fighting technologies for aerospace forces"
## [21] "a chamber that provides access to space where air is under pressure"
## [22] "an arrangement of aerials spaced to give desired directional characteristics"
## [23] "a navigational device that automatically keeps ships or planes or spacecraft on a steady course"
## [24] "one of a number of closely spaced supports for a railing"
## [25] "a car that has a long body and rear door with space behind rear seat"
## [26] "the enclosed compartment of an aircraft or spacecraft where passengers are carried"
## [27] "a hidden storage space (for money or provisions or weapons)"
## [28] "the space in a ship or aircraft for storing cargo"
## [29] "(printing) the receptacle in which a compositor has his type, which is divided into compartments for the different letters, spaces, or numbers"
## [30] "a wall formed of two thicknesses of masonry with a space between them"
## [31] "the overhead upper surface of a covered space"
## [32] "storage space where wines are stored"
## [33] "a natural or artificial enclosed space"
## [34] "structure consisting of a row of evenly spaced columns"
## [35] "a space module in which astronauts can live and control the spacecraft and communicate with earth"
## [36] "a space into which an area is subdivided"
## [37] "a tower with an elevated workspace enclosed in glass for the visual observation of aircraft around an airport"
## [38] "a space heater that transfers heat to the surrounding air by convection"
## [39] "game equipment (as a piece of wood, plastic, or ivory) used for keeping a count or reserving a space in various card or board games"
## [40] "a vehicle designed for navigation in or on water or air or through outer space"
## [41] "a notch or open space between two merlons in a crenelated battlement"
## [42] "a board with pegs and regularly spaced holes for holding the pegs; used for keeping the score in a game of cribbage"
## [43] "a small room (or recess) or cabinet used for storage space"
## [44] "a soft thread for cleaning the spaces between the teeth"
## [45] "a personal computer small enough to fit conveniently in an individual workspace"
## [46] "the entrance (the space in a wall) through which you enter or leave a room or building; the space that a door can close"
## [47] "a window with two panes of glass and a space between them; reduces heat and noise transmission through the window"
## [48] "an enclosed space for producing reverberation of a sound"
## [49] "a small electric space heater"
## [50] "(printing) the narrowest of the spaces used to separate words or letters"

Well that’s cool, but obviously it’s not as much fun to only see the synonym group (i.e. synset) definitions and not the words associated with them. If we had the words, we could see which were in the same synonym group. For that we will need to link with two other tables out of the 98 in the WordNet database: the words table and the senses table.

The words table includes two fields: wordid and lemma (base word):

# List all of the fields in this table
dbListFields(WordNetdb, "words")

## [1] "wordid" "lemma"

# Get the words table as a dataframe
all_words <- dbGetQuery(WordNetdb,'SELECT * from words' )
wordid lemma
1000 1000 acceptance sampling
1001 1001 acceptant
1002 1002 acceptation
1003 1003 accepted
1004 1004 accepting
1005 1005 acceptive
1006 1006 acceptor
1007 1007 acceptor rna
1008 1008 access
1009 1009 access code
1010 1010 access road

The synsets and the words tables are linked together via the senses table. Remember, this inter-table structure is the key point of relational databases. So let’s take a look at the senses table:

# List all of the fields (i.e. columns) in this table
dbListFields(WordNetdb, "senses")

## [1] "wordid"      "casedwordid" "synsetid"    "senseid"     "sensenum"
## [6] "lexid"       "tagcount"    "sensekey"

# Get the senses table as a dataframe
all_senses <- dbGetQuery(WordNetdb,'SELECT * from senses' )
wordid casedwordid synsetid senseid sensenum lexid tagcount sensekey
100 144 23 106446496 202 1 0 0 2_samuel%1:10:00::
101 178 NA 110588455 205 1 0 0 2nd_lieutenant%1:18:00::
102 179 NA 113766184 206 1 0 9 3%1:23:00::
103 180 24 106631444 209 1 0 0 3-d%1:10:00::
104 180 24 105948666 208 2 0 0 3-d%1:09:00::
105 182 NA 100476414 211 1 0 1 3-hitter%1:04:00::
106 184 NA 113771547 213 1 0 4 30%1:23:00::
107 185 NA 115252958 217 1 0 15 30_minutes%1:28:00::
108 200 NA 115226966 230 1 0 0 365_days%1:28:00::
109 201 NA 115227065 231 1 0 0 366_days%1:28:00::
110 209 25 106631444 240 1 0 0 3d%1:10:00::

We’ll not go further into the details of WordNet here (e.g. what the sensekey, sensenum, etc. columns are). Obviously there are 95 other tables and tons of variables to explore, all of which may hold content of interest to you. The point of this tutorial is just to show you how to load and get the hang of these sorts of relational databases. For right now, then, let’s just link the synset table with the words table via the senses table so that we can see words that are synonyms. This sort of inter-table querying is SQLs strong suit, so get excited!

SQL has an amazing ability to interact with the relational tables by making inter-table commands that essentially say things like “IF a unit has X value in table 1 and EXISTS in table 2 THEN EXTRACT the Y variable from table 3 and summarize it by group Z.” RSQLite doesn’t have the same level of functionality, but it’s pretty close.

SOOOOOOOO….what can we do now? We want to link our synsets table with our words table, and for that we need the senses table. Next we’ll show a few examples of what you can do to accomplish this–one that uses base R code and one that is purely SQL via RSQLite.

Get synonyms out of WordNet

OPTION 1: Using a function and base R code

Define a function that takes the synsetid of interest and returns the words associated with it

getWords <- function(syn_id){
  wordids_df <- subset(all_senses, synsetid == syn_id)
  wordids <- wordids_df$wordid
  words_df <- subset(all_words, wordid %in% wordids)
  words <- words_df$lemma

Pick a synsetid by looking in the synsets table (i.e. the all_synsets dataframe that we created above), then enter that id into the function. For example, ‘100056907’ is the synsetid for the synonym group with the following definition: migration from a place (especially migration from your native country in order to settle in another).

Let’s input ‘100056907’ into our function and see what words are associated with it:


## [1] "emigration"    "expatriation"  "out-migration"

Makes sense to us!

Using this function we could find all of the words that have something to do with the notion of ‘saving’:

# (1) Make a dataframe of all synsets related to "saving"
saving_synsets_df <- dbGetQuery(WordNetdb, "SELECT synsetid, definition FROM synsets WHERE definition LIKE '%saving%'")

# (2) Extract the synsetids related to these synsets
saving_synsetids <- saving_synsets_df$synsetid

# (3) For each synsetid, run the function defined above and add the result to a list:
saving_words <- c()

for (i in 1:length(saving_synsetids)){
  temp_words <- getWords(saving_synsetids[i])
  saving_words <- c(saving_words, temp_words)

# (4) Print the list of words associated with the notion of "saving"

##  [1] "recovery"                   "retrieval"
##  [3] "lifesaving"                 "redemption"
##  [5] "salvation"                  "salvage"
##  [7] "salvation"                  "economy of scale"
##  [9] "banking"                    "401-k"
## [11] "401-k plan"                 "aba transit number"
## [13] "bank identification number" "bin"
## [15] "rocket"                     "skyrocket"
## [17] "credit union"               "thrift institution"
## [19] "building society"           "msb"
## [21] "mutual savings bank"        "federal savings bank"
## [23] "fsb"                        "luddite"
## [25] "disposable income"          "m3"
## [27] "savings account trust"      "savings bank trust"
## [29] "totten trust"               "trust account"
## [31] "trustee account"            "deposit account"
## [33] "time deposit account"       "dormant account"
## [35] "passbook savings account"

Kinda neat, huh!? If you also wanted to see the synset definitions associated with those words:


##  [1] "the act of regaining or saving something lost (or in danger of becoming lost)"
##  [2] "saving the lives of drowning persons"
##  [3] "(theology) the act of delivering from sin or saving from evil"
##  [4] "the act of saving goods or property that were in danger of damage or destruction"
##  [5] "saving someone or something from harm or from an unpleasant situation"
##  [6] "the saving in cost of production that is due to mass production"
##  [7] "engaging in the business of keeping money for savings and checking accounts or for exchange or for issuing loans and credit etc."
##  [8] "a retirement savings plan that is funded by employee contributions and (often) matching contributions from the employer; contributions are made from your salary before taxes and the funds grow tax-free until they are withdrawn, at which point they can be converted into an IRA; funds can be transferred if you change employers and you can (to some extent) manage the investments yourself"
##  [9] "an identification number consisting of a two-part code assigned to banks and savings associations; the first part shows the location and the second identifies the bank itself"
## [10] "propels bright light high in the sky, or used to propel a lifesaving line or harpoon"
## [11] "a cooperative depository financial institution whose members can obtain loans from their combined savings"
## [12] "a depository financial institution intended to encourage personal savings and home buying"
## [13] "British equivalent of United States savings and loan association"
## [14] "a state-chartered savings bank owned by its depositors and managed by a board of trustees"
## [15] "a federally chartered savings bank"
## [16] "one of the 19th century English workmen who destroyed laborsaving machinery that they thought would cause unemployment"
## [17] "income (after taxes) that is available to you for saving or spending"
## [18] "a measure of the money supply; M2 plus deposits at institutions that are not banks (such as savings and loan associations)"
## [19] "a savings account deposited by someone who makes themselves the trustee for a beneficiary and who controls it during their lifetime; afterward the balance is payable to the previously named beneficiary"
## [20] "a savings account in which the deposit is held for a fixed term or in which withdrawals can be made only after giving notice or with loss of interest"
## [21] "a savings account showing no activity (other than posting interest) for some specified period"
## [22] "a savings account in which deposits and withdrawals are recorded in the depositor's passbook"

Ok, we’ve accomplished our task, but we did it with base R code, not RSQLite. How would it look if we wanted to practice our SQL chops?

OPTION 2: Using RSQLite and SQL-like code

Now, remember that the words table only contains wordids and names of words (lemmas), the synsets table only contains synsetids and definitions of synsets, and the senses table contains wordids and their associated synsetid. What we need is a table that contains all of these variables together. That is, we need a table that gives us wordid, synsetid, and the definition of that synsetid. If we can get that, then all we’d have to do to see synonyms is subset by synsetid.

So, how do we create such a table? We JOIN the tables. Joining can get pretty confusing, so we’ll just show you one form (explicit join) and encourage you to review JOINing in one of the SQL tutorials that we recommended above.

words_and_synsets <- dbGetQuery(WordNetdb, 'SELECT synsetid, wordid, lemma FROM words JOIN senses USING(wordid)')
synsetid wordid lemma
108659519 1 'hood
108970180 2 's gravenhage
400252367 3 'tween
400500491 4 'tween decks
104510146 5 .22
303157978 6 .22 caliber
303157978 7 .22 calibre
303157978 8 .22-caliber
303157978 9 .22-calibre
303158270 10 .38 caliber
303158270 11 .38 calibre
303158270 12 .38-caliber
303158270 13 .38-calibre
303158563 14 .45 caliber
303158563 15 .45 calibre
303158563 16 .45-caliber
303158563 17 .45-calibre
113764498 18 0
302193771 18 0
113764713 19 1

Basically the command above says the following:

  • SELECT synsetid, wordid, lemma FROM words : select synsetid, wordid, lemma columns from the table words

  • JOIN senses USING(wordid) : then join that table with the senses table using wordid as the unique identifier

That got us part of the way, but we still don’t have synset definitions. For that we’ll need two JOIN commands–the one above and another that includes the synsets table:

words_and_synsetdefs <- dbGetQuery(WordNetdb, 'SELECT synsetid, wordid, lemma, definition FROM (SELECT synsetid, wordid, lemma FROM words JOIN senses USING(wordid)) JOIN synsets USING(synsetid)')
synsetid wordid lemma definition
100 302205377 87 165 being five more than one hundred sixty
101 302218965 88 165th the ordinal number of one hundred sixty-five in counting order
102 302211800 89 16th coming next after the fifteenth in position
103 113769865 90 17 the cardinal number that is the sum of sixteen and one
104 302195634 90 17 being one more than sixteen
105 108059288 91 17 november a Marxist-Leninist terrorist organization in Greece that is violently opposed to imperialism and capitalism and NATO and the United States; an active terrorist group during the 1980s
106 302205497 92 170 being ten more than one hundred sixty
107 302219105 93 170th the ordinal number of one hundred seventy in counting order
108 113773298 94 1728 a cardinal number equal to one dozen gross
109 302205613 95 175 being five more than one hundred seventy
110 115174893 96 1750s the decade from 1750 to 1759
111 302219241 97 175th the ordinal number of one hundred seventy-five in counting order
112 115175264 98 1760s the decade from 1760 to 1769
113 115175750 99 1770s the decade from 1770 to 1779
114 115176135 100 1780s the decade from 1780 to 1789
115 115176521 101 1790s the decade from 1790 to 1799
116 302211906 102 17th coming next after the sixteenth in position
117 113770005 103 18 the cardinal number that is the sum of seventeen and one
118 302195747 103 18 being one more than seventeen
119 114663447 104 18-karat gold an alloy that contains 75 per cent gold
120 302205738 105 180 being ten more than one hundred seventy

This command said:

  • SELECT synsetid, wordid, lemma, definition : select synsetid, wordid, lemma, definition columns

  • FROM (SELECT synsetid, wordid, lemma FROM words JOIN senses USING(wordid)) : from the first join dataframe that we made above

  • JOIN synsets USING(synsetid) : and join that with our synsets table using synsetid as the unique identifier

Great! We have the table we wanted. Now all we have to do to get synonyms is subset by whichever synsetid is of interest (which could also be executed with a WHERE command in the SQL above):

migration <- subset(words_and_synsetdefs, synsetid == '100056907')
synsetid wordid lemma definition
59111 100056907 41967 emigration migration from a place (especially migration from your native country in order to settle in another)
63119 100056907 44864 expatriation migration from a place (especially migration from your native country in order to settle in another)
130429 100056907 94886 out-migration migration from a place (especially migration from your native country in order to settle in another)

Well that’s it with WordNet for now. We will be conducting more in-depth analysis of WordNet, using RSQLite and some network analysis, in Part 3 of the tutorial. There are still a couple more tricks associated with RSQLite that might be useful, however, so let’s take a look at them.

More RSQLite tricks

RSQLite includes one SQLite database that contains all 86 dataframes in the R datasets package.

# First, disconnect from the WordNet database

# Now connect with the database of all datasets from the datasets package
db <- datasetsDb()

# List the tables included:

##  [1] "BOD"              "CO2"              "ChickWeight"
##  [4] "DNase"            "Formaldehyde"     "Indometh"
##  [7] "InsectSprays"     "LifeCycleSavings" "Loblolly"
## [10] "Orange"           "OrchardSprays"    "PlantGrowth"
## [13] "Puromycin"        "Theoph"           "ToothGrowth"
## [16] "USArrests"        "USJudgeRatings"   "airquality"
## [19] "anscombe"         "attenu"           "attitude"
## [22] "cars"             "chickwts"         "esoph"
## [25] "faithful"         "freeny"           "infert"
## [28] "iris"             "longley"          "morley"
## [31] "mtcars"           "npk"              "pressure"
## [34] "quakes"           "randu"            "rock"
## [37] "sleep"            "stackloss"        "swiss"
## [40] "trees"            "warpbreaks"       "women"

Subset tables

Let’s look only at the “USArrests” dataset, which contains statistics, in arrests per 100,000 residents, for each of the 50 US states in 1973.

dbReadTable(db, "USArrests")

##                Murder Assault UrbanPop Rape
## Alabama          13.2     236       58 21.2
## Alaska           10.0     263       48 44.5
## Arizona           8.1     294       80 31.0
## Arkansas          8.8     190       50 19.5
## California        9.0     276       91 40.6
## Colorado          7.9     204       78 38.7
## Connecticut       3.3     110       77 11.1
## Delaware          5.9     238       72 15.8
## Florida          15.4     335       80 31.9
## Georgia          17.4     211       60 25.8
## Hawaii            5.3      46       83 20.2
## Idaho             2.6     120       54 14.2
## Illinois         10.4     249       83 24.0
## Indiana           7.2     113       65 21.0
## Iowa              2.2      56       57 11.3
## Kansas            6.0     115       66 18.0
## Kentucky          9.7     109       52 16.3
## Louisiana        15.4     249       66 22.2
## Maine             2.1      83       51  7.8
## Maryland         11.3     300       67 27.8
## Massachusetts     4.4     149       85 16.3
## Michigan         12.1     255       74 35.1
## Minnesota         2.7      72       66 14.9
## Mississippi      16.1     259       44 17.1
## Missouri          9.0     178       70 28.2
## Montana           6.0     109       53 16.4
## Nebraska          4.3     102       62 16.5
## Nevada           12.2     252       81 46.0
## New Hampshire     2.1      57       56  9.5
## New Jersey        7.4     159       89 18.8
## New Mexico       11.4     285       70 32.1
## New York         11.1     254       86 26.1
## North Carolina   13.0     337       45 16.1
## North Dakota      0.8      45       44  7.3
## Ohio              7.3     120       75 21.4
## Oklahoma          6.6     151       68 20.0
## Oregon            4.9     159       67 29.3
## Pennsylvania      6.3     106       72 14.9
## Rhode Island      3.4     174       87  8.3
## South Carolina   14.4     279       48 22.5
## South Dakota      3.8      86       45 12.8
## Tennessee        13.2     188       59 26.9
## Texas            12.7     201       80 25.5
## Utah              3.2     120       80 22.9
## Vermont           2.2      48       32 11.2
## Virginia          8.5     156       63 20.7
## Washington        4.0     145       73 26.2
## West Virginia     5.7      81       39  9.3
## Wisconsin         2.6      53       66 10.8
## Wyoming           6.8     161       60 15.6

In particular, let’s subset this dataset to only see states where the rape rate was higher than 30 arrests per 100,000.

dbGetQuery(db, "SELECT row_names, Rape FROM USArrests WHERE Rape > 30")

##    row_names Rape
## 1     Alaska 44.5
## 2    Arizona 31.0
## 3 California 40.6
## 4   Colorado 38.7
## 5    Florida 31.9
## 6   Michigan 35.1
## 7     Nevada 46.0
## 8 New Mexico 32.1

Obviously this approach is extendable to all sorts of subsetting. Just practice.

Summarize/Compute data

The initExtension extension in RSQLite offers a bunch of functions for summarizing and computing statistics about data. For example, let’s find the mean assault rate across all states:


dbGetQuery(db, "SELECT stdev(Assault) FROM USArrests")

##   stdev(Assault)
## 1       83.33766

Or maybe we want the min and max:

AssaultRange <- dbGetQuery(db,"SELECT min(Assault),max(Assault) FROM USArrests")

There is a lot you can do with these summary/compute extensions–especially when you combine them with the JOINing that we explored above. The available extensions are:

  • Math functions : acos, acosh, asin, asinh, atan, atan2, atanh, atn2, ceil, cos, cosh, cot, coth, degrees, difference, exp, floor, log, log10, pi, power, radians, sign, sin, sinh, sqrt, square, tan, tanh

  • String functions: charindex, leftstr, ltrim, padc, padl, padr, proper, replace, replicate, reverse, rightstr, rtrim, strfilter, trim

  • Aggregate functions stdev, variance, mode, median, lowerquartile, upperquartile

Add a table of your own

So, our ‘db’ database comes with all sorts of tables/datasets, but what if we wanted to add one of our own? For example, the PASWR package in R has an example dataset called “Soccer” that contains how many goals were scored in the regulation 90 minute periods of World Cup soccer matches from 1990 to 2002. Let’s load this dataset and add it to our db database:

# Load the package and dataset

# Add the dataset to our database
dbWriteTable(db, "Soccer", Soccer)

# Check that the table now exists in our databset
dbExistsTable(db, "Soccer")

## [1] "TRUE"

Bravo! For other datasets that you might want to add, see this great resource that lists the sample datasets included in various R packages.

If we wanted to write our own table from scratch and add it to our database, the syntax would look something like this (with your specific information filled in):

dbWriteTable(databasename, newtablename,
             value, row.names = NA, overwrite = FALSE, append = FALSE,
             field.types = NULL)


That was a lot of information, but hopefully it gave you a sense of how to work with relational databases in R. As we’ve already said, you don’t have to use RSQLite to do this; you can also just use base R code or, likely better, dataframe manipulation packages like dplyr. That said, we maintain that there is a lot of potential for computational social scientists in starting to use–or at least knowing how to use– relational databases. Not only do they hold great potential for computational social scientists (see Part 1 of this tutorial series), but also because many large datasets of interest to social scientists are only available in SQL form or something messier (e.g. the entire Wikipedia corpus, which Taylor will walk you though in a later blogpost).

So give it a try! And see Part 3 of this series for a demonstration of how these methods can be applied in the context of network analysis, again using the WordNet database.