Part 1 of a three-part series on relational databases for social
scientists, produced with my colleague and datadeuce, Josh R.
What does this tutorial series entail?
Part 1) Explain SQL and relational databases vis-a-vis traditional R data storage
Part 2) Set up a SQLite database using R with WordNet data and demonstrate how to make basic SQL queries within R
Part 3) Conduct a bit of analysis using RSQLite and WordNet
What is SQL?
It’s a language for querying relational databases
SQL is a programming language specifically designed for storing and
sociologists have a long history with the word ‘relational’ and we
should be clear that the designers of relational databases do not
care. So, as best you can (sociologists), leave your opinions and
clever quips about the merits of White, Tilly, et al.’s behind.
What is a relational database?
In short, it’s a database that structures information across many tables
(think dataframes or spreadsheets), each of which is linked by a set of
How is this different from traditional R data storage?
To be honest, when social scientists conduct network analysis in R, they likely store their data in a manner very similar to a relational database. Let’s assume that we’re talking to non-network scientists, however–people using R in the most traditional format, with data stored in a dataframe (essentially a rectangular grid with rows and columns)…like this COMPLETELY MADE-UP data about international suffragettes and their protest activities:
With this storage setup, an R user might ‘reshape’ the dataframe using
basic R code or special packages like
dplyr. Say, for instance, that they
wanted to get a list of women’s club presidents and a count of their activities.
In that case, the syntax might look something like this:
# OPTION 1: using base R commands: df$activities_sum <- rowSums(df[c(8:13)]) # sum the activities presidents <- subset(df, club_president==1) # select only presidents president_acts1 <- presidents[c(2:3, 14)] # select only the presidents + activity count # OPTION 2: using dplyr: library(dplyr) president_acts2 <- df %>% # select only club presidents filter(club_president == 1) %>% # create sum of their activities mutate(activities_sum = march + petition + boycott + public_speech + club_member + sit_in) %>% # select only the presidents + activity count select(name_first, name_second, activities_sum) # Either way, the output looks like this: knitr::kable(president_acts2)
Now let’s turn to relational databases and SQL. A relational database would take the same data above and structure it across several linked ‘tables.’ It could look, possibly, like this:
Users would then use SQL commands to ‘construct queries’ and ‘return results sets.’ Something like this (which completes the same task as above):
SELECT name_first, name_second, activities_sum FROM (SELECT * FROM persons WHERE club_president = 1) persons LEFT JOIN ( SELECT id, COUNT (DISTINCT id_activity) as activities_sum FROM person_activity GROUP BY id ) person_activity ON persons.id = person_activity.id
In both storage styles (dataframe vs relational tables) and syntax
types (R vs SQL) the user is doing the same thing: simply modifying the
structure of the data grids without changing the underlying data itself.
In this case the SQL syntax may look much more complex, but that’s not
inevitable. When datasets become huge, complex, and disperse (see
below), it can be much more clear than R code.
Why might a social scientists want to use a relational database and SQL?
There are several strengths to relational databases and SQL syntax, many of which will not apply to your average social scientist. For a computational social scientist, however, they may be clutch. Let’s take a look…
(1) Non-repeating (i.e. efficient) data storage
Look again at the tables above. In the first, traditional R, example we had a column for each suffragette activity (e.g. march, petition, etc.), along with an entry for each individual. In the relational database, we condensed these columns into a seprate table with an activity name and a unique id for each activity; this table was then linked to a table with information about the suffragettes via those unique ids. Network scientists do this kind of edge-listing all the time in R or SASS to save memory space and quicken computing time. In relational databases, this formatting becomes second nature.
(2) Storing complex, multi-modal data
It takes a bit of imagination at first, but I’m sure you can see how a relational database might be useful–particularly for very large, complex datasets with meta information about several objects or units of analysis (e.g. individuals, organizations, AND countries or states, etc). For example, what if in addition to the metadata on individuals that we displayed above, we also had metadata for each protest activity (e.g. a qualitative description, a date, a location, a participant count, etc). Which storage approach do you think would handle this better? And what if we had even more information about the friends of each individual, or the background of protest activity locations (e.g. population, demographics, laws, etc)? Wouldn’t those be more easily stored in additional tables than added onto one dataframe? Not only would this clean up the appearance and efficiency of your data storage, but if you ever wanted to just analyze the locations, say, then a table without all the other data would be ready and waiting for you.
Such data complexity is tightly related to so-called ‘multimodal’ datasets in network science. In multimodal networks people can be tied together in more ways than one (e.g through friendship-, kinship-, professional-, activity-, AND geographic- ties). These sort of networks and their analysis are at the cutting edge of network science, and with the rise of social media and open-source data, they are becoming increasingly accessible and ginormous (e.g. millions of Facbook users, hospital patients, site visitors, etc). Relational databases can make them manageable.
I think you get the point…for the modern soicologist trying to deal with the onslaught of data coming at us, there is potential here.
(3) Complex querying support
This is what the language SQL is for!
In relational databases, it’s simple to keep private or sensative
information hidden in one table during analysis or when you have to
share data without revealing the identities of participants.
If you think a relational database might be useful to you, we strongly encourage completing (not just enrolling;) in a free SQL tutorial, such as those offered by Khan Academy or Code Academy. You might also look into noSQL databases, such as MongoDB (which we will probably touch on with Python in a future post).
In the end, you might not want to use relational databases but still
find the language of SQL useful for reshaping and extracting information
from single R dataframes (i.e. as a replacement or adendum to
In that case, there is the
which simply allows the use of SQL sytax.
S-Q-L or “Sequel”?
In a nutshell, whichever you prefer.
Josh says “S-Q-L.”
I say “Sequel.”
We somehow remain friends.
What is RSQLite?
RSQLite is library in R that allows users to work with relational databases and execute SQL-like commands. It’s built off of SQLite, which is a serverless database engine–meaning that it uses SQL to read and write files on your computer. SQLite is arguably the most widely used database in the world. Android, iPhone, and IOS devices all use it, as do Firefox, Chrome, and Safari web browsers. Both Apple and Microsoft include it in their operating systems.
Great, now that we generally understand (1) how relational databases function, (2) that relational databases can be queried using a language called SQL, and (3) that RSQLite provides a pseudo-approach to relational databases and SQL in R, it’s time to put that knowledge into practice! Tune in for the next post where we will load a WordNet relational database. As a sneak peak, here is a visualization of the data, aggregated via RSQLite and visualized with Gephi: