Reposted with permission from AALL Spectrum, Volume 25, Number 5 (May/June 2021), pgs. 16-19.
By Sarah Lin, Information Architect & Digital Librarian at RStudio, PBC
As law librarians, many of us scrutinize the data we have access to with Excel and out-of-the-box visualization tools. Whether that data is from docket activity, research databases, websites, or online catalogs, what we have can generally be described as “usage data.” But what one skill set would allow us to do so much more with that data, to better understand and communicate what our users are doing and what they need? Enter, data science.
Broadly speaking, data science brings opportunities to work more quickly and easily with data. It provides better reporting formats by incorporating outside data from various sources, and can even turn text into data that can be displayed visually. Even though legal information isn’t always associated with data, science, or data science, data science skills enable law librarians to do their jobs with greater efficiency. With data science skills, we are able to show new value for our teams and organizations, so it is definitely worth the time invested.
Even in a year when time has been both condensed and stretched (when many of us picked up new hobbies, such as baking), learning to code for just one use case, such as replacing Excel as a data analysis tool, doesn’t make sense. Luckily, data science skills are useful for more than just data manipulation, and learning to code allows you to provide many more use cases than just creating better data visualizations for management. Cooking is a useful metaphor for data science: while it’s completely possible to eat take-out, frozen food, box mixes, and cereal for dinner, you can actually create healthier meals with the right tools, enhanced cooking skills, and a better understanding of ingredients. For example, pre-cut vegetables are available in grocery stores, but a chef ’s knife and some practice allow you to customize any meal you make as well as lower costs. Similarly, while you can do your job with Excel and a commercial tool such as Tableau or PowerBI, learning to do data science opens a window of opportunities to new and improved skills that do more than just create improved graphics for reports or budget projections.
The following 10 data science skills and techniques, along with descriptions of the amazing deliverables that are associated with them, are listed in a progressive skill-building sequence, and they will provide you with a fully stocked data science kitchen. Keep in mind that the examples in this article focus on the R programming language, even though data science can also be done in Python (which has similar and sometimes compatible resources for you to use). The power of data science using R or Python comes from the powerful skills and techniques they enable you to use to transform how you work with data in your day to-day job. It’s time to graduate from Excel and start cooking with gas!
1. Learn to Code with R
Data science requires data, which most librarians already work with, but it also requires learning to code. Much like learning the MARC (MAchine-Readable Cataloging) format, talking to a computer in the language it can understand can be very challenging to humans. Yet once mastered, it opens up a window into understanding how the tools we need to do our jobs actually work, and how to use them more efficiently. You can use a library catalog without knowing MARC and you can use a computer without knowing how to code, but knowing MARC and a programming language provides experiential knowledge of how the search utilities work in the databases you use every day.
Resources exist for professionals who are willing to work on their own through tutorials, primers, and online textbooks such as Hands on Programming with R and R for Data Science. For those who learn best in groups, there are, thankfully, R user groups and chapters of the global R-Ladies organization where professionals of all kinds can (virtually) meet for mutual skill development, tips, and support.
2. Tidy Your Data
“Tidy” might be an unfamiliar word to associate with data; the things you have to do to your data before you can use it can be called tidying, munging, cleaning, etc. These steps, such as normalizing columns, handling missing data, formatting dates and times, joining data from several sources, extracting a subset of data, splitting cell values and standardizing data contents, often constitute the bulk of time spent during data analysis.
R’s solution for those data “cleaning” tasks is a constellation of packages known as the Tidyverse. Because R as a programming language was developed by and for statisticians, packages like the Tidyverse were developed specifically for data analysis. While law librarians are no strangers to working with inconsistent data, the skills and processes that Excel requires don’t go quite far enough for data science. For example, tidy data has only one value per row/column pair, meaning that columns with months or years as headers need to be translated into a single date column for each piece of data. This is an example of a step that might seem cumbersome, but it really is a required foundation to the subsequent data science skills in this article. It also allows you to join and split data into different combinations, enabling you to look at the particular subset of data relevant to your analysis.
3. Create Data Visualizations
Data visualizations are the most ubiquitous products of data science, and for good reason: pictures speak louder than data. While bar graphs and pie charts are de riguer in library statistics, tidy data can be visualized in many formats: scatter plots, box plots, line graphs, heat maps, time series, and geospatial maps. But visualizations created by code also provide more options for both the output format and the visualization details, such as annotations, facets, data groups, quartiles, clusters, or even combining multiple graphs into one visualization.
While there are specific R packages for some of the more unique visualizations, the most popular R graphics package is ggplot2. Because of its popularity, resources abound, though the most generally useful are probably Hadley Wickham’s ggplot2 and Winston Chang’s R Graphics Cookbook, which, with 15 chapters on frequently asked questions, gives you a sense of the vast possibilities related to visualizations. As you can tell from the title of Chang’s book, the metaphor of cooking and data science isn’t unique to this article.
4. Use R Markdown to Create Dynamic Analysis Documents
The R Markdown package is a central tool for data scientists, largely because of the many different file output formats it supports. The article you are reading was written in R Markdown and published to Microsoft Word, but it could have also been published (output) to HTML, PDF, R Notebook, RTF, or PowerPoint, among others (see Flex Dashboard below). Additionally, R Markdown is the basis for other packages eminently useful for knowledge workers: bookdown (for writing tutorials, guides, and books), blogdown (for creating websites, such as bit.ly/MJ21rbind), and dashboards (see below for points on Flex Dashboard and Shiny).
The beauty of writing in R Markdown and using these packages is that they are designed to integrate text and code (though code isn’t required— just as it wasn’t needed in this article, for example), allowing for better storytelling. Whether you’re producing a journal article or a report, dynamically weaving code and text means that your work is reproducible (helpful for sharing and for repeatability) and neatly created in one document. Indeed, the same document could be produced in several formats simultaneously. The R Markdown website (bit.ly/MJ21rmarkdown) is the best place to get started, and it also links to the online book (built using bookdown) about the package.
5. Use Flex Dashboard to Publish Interactive Data Visualizations
Flex Dashboard is another R Markdown output, but one that allows a level of interaction with the data. A standard R Markdown report intersperses text and data, whereas the Flex Dashboard package extends the output possibilities to include widgets, gauges, graphics, and a limited ability to filter, page, and sort different data views. The real beauty in R Markdown, including the interactive output options, is to ensure that code you write once to analyze a dataset doesn’t have to be rewritten when additional data is added.
For example, monthly usage statistics, once tidied, can simply be added to the Flex Dashboard dataset and the code re-executed with one click, considerably reducing the preparatory work. Similarly, the code written once for annual budget projections can be reused in subsequent years; even though the data values may change from year to year, the analysis required generally doesn’t and those efforts can be recycled. Information about using the Flex Dashboard package is available from the R Markdown resources mentioned previously, but the Examples portion of the R Markdown website (bit.ly/MJ21flexdashboard) is a good place to start to see what’s possible with this tool.
6. Build Interactive Web Applications with Shiny
Shiny is an R package that makes interactive applications that allow users to adjust data variables, changing the visualization on the fly. In effect, it lets data consumers play with your data and ask, then answer, their own questions as they arise–without any effort on your part. An application where a manager or administrator could, at their leisure, look at real-time data from the library budget would likely save hours, if not days, of staff time preparing, presenting, and then adjusting and re-presenting budget data. Learning Shiny does take a bit more skill as it requires the data scientist to write the function to return the variables and parameters you want to share, but the interactivity is a game-changer.
Many of the COVID-19 dashboards created around the world recently were built using Shiny, including one from the London School of Hygiene and Tropical Medicine (bit.ly/MJ21vac). The shinyapps.io site is a place to host Shiny applications (with both free and paid accounts), many of which end up being shared publicly. Additionally, the Gallery on the Shiny website (bit.ly/MJ21shiny) illustrates the breadth of interactivity Shiny provides end users. Hadley Wickam is currently working on a new Shiny monograph, Mastering Shiny; the pre-publication manuscript is available at bit.ly/MJ21mastershiny.
7. Scrape Webpages, Use APIs, and Parse Data
Collecting and connecting disparate data sources is a perennial challenge for librarians, due to the size of data sets, incompatibility of data formats, challenges of obtaining the data, and the technical solutions required to interface with those data sources and formats. Being able to scrape and parse webpages for data, use APIs (Application Programming Interfaces) to connect data sources (e.g., a research or HR database), and parse XML or JSON requires coding skills. Combined with the earlier skills mentioned, such as creating visualizations, using code to interact with data outside of standard formats like Excel opens up a world of possibilities for data analysis.
Different R packages are available for different websites and APIs, such as rvest for general website scraping and googlesheets4 to access Google Drive’s API. Sites such as StackOverflow and RStudio’s Community are a wealth of information on specific packages and data challenges, including tutorials on various data types or APIs, and are often just a Google search away. R users might also benefit from a resource like J.D. Long & Paul Teetor’s R Cookbook, which groups frequent challenges by type, such as getting data out of different sources and into a format read by R.
8. Enhance Visualizations With Maps
Enhance Visualizations with Maps Geographic information is also not something readily associated with law library data, even with so many geographically dispersed organizations and/or users. Whether it’s law firm office locations or a geographic locator from Google Analytics, where users are is often information that is valuable to report but hard to present visually without the skills to work with geospatial data. Additionally, geospatial data is a key feature of some empirical research that law librarians may need to provide support and resources for.
There are a number of R packages used in geospatial analysis and visualization, like plotly, leaflet, ggmap, and cloroplethr. Several of these packages are explained with accessible, inspirational examples in Sharon Machlis’ Practical R for Mass Communication and Journalism, which is a great book to work through (and also has webscraping examples). Mapping budget spend by city or resource use by branch elevates data visualization, allowing more effective communication with stakeholders.
9. Text Mining & Analysis
The bulk of the law is text, which makes the ability to do textual processing, mining, and analysis an eminently useful skill for law librarians. Whether looking at documents or document metadata, text mining can answer questions about topic, document relevancy, sentiment, and word frequency. On its face, the answers to these questions might seem unimportant, but they could, for example, allow for analysis based on gender representation (Are female judicial nominees questioned more about their family than male nominees?) or political even-handedness (Do law reviews have a liberal or conservative bias?). Learning to do text mining can be a lot of fun when there are R packages based on personal interests, such as schrute (The Office), janeaustenr (the collected works of Jane Austen), and scotus (SCOTUS opinions), plus interesting data sets such as a file of Presidential State of the Union Addresses (bit.ly/MJ21presidency) and the script for Jurassic Park (bit.ly/MJ21github), among many others. For a step-by-step guide to text mining, Julia Silge & David Robinson’s book Text Mining with R is a great place to start and is also available for free online.
10. Machine Learning
Machine learning is a type of artificial intelligence (AI), and it powers most of the applications we encounter in our personal and professional lives—it is simply another type of data science. In machine learning, the data involved is Big Data (data larger than your computer will hold) and it’s used to predict future actions. While learning to construct Boolean searches based on “classic” legal research database metadata, such as database codes, made us better users of that software, using today’s legal research databases will be more powerful and effective if you’re able to exploit the structure of the algorithms in combination with your legal knowledge.
Supervised Machine Learning for Text Analysis in R by Emil Hvitfeldt and Julia Silge works with the scotus package and a Consumer Financial Protection Bureau Consumer Complaint Database dataset (among others) to provide a guided introduction and instructions for doing machine learning. The pre-publication manuscript is freely available online (bit.ly/ MJ21smltar), and the first two sections are of most utility, focusing on natural language processing and machine learning. Working through the text and examples in this book builds upon the previous nine skills and demonstrates how legal text is analyzed and classified–the same processes that are at work in the legal research tools law librarians and lawyers use every day.
Working Smarter
We are endlessly asked to “work smarter, not harder,” and at times it can seem as if there isn’t anything else smart left to do. Learning data science, however, is definitely a smart strategic move for law librarians and legal information professionals. We all work with data and rely upon it, so it is only sensible to find ways to improve those skills (much as being stuck at home has presented an opportunity to improve culinary techniques). Learning to program in R and working through the aforementioned data, visualization, and analysis tools will give you the skills you need to do better work with less effort. Learning R is a low-cost endeavor with a supportive global community there to help you debug and troubleshoot your code, which means the only thing left to do is to get started.
Additional Resources
- ggplot2 bit.ly/MJ21ggplot2
- Hands on Programming with R and R for Data Science bit.ly/MJ21library
- RStudio bit.ly/MJ21rstudio
- Text Mining with R bit.ly/MJ21textmining
- Tidyverse bit.ly/MJ21tidy