Reposted with permission from AALL Spectrum, Volume 26, Number 2 (November/December 2021), pgs 12-15.
By Erik Adams, Manager of Library Digital Initiatives, Sidley Austin LLP; Martin Korn, Director of Research and Knowledge Services, Sheppard, Mullin, Richter & Hampton LLP; and Casandra Laskowski, Head of Research, Data & Instruction, University of Arizona College of Law Library
Tips and tools for mastering the basics of statistics and analytics to create your own data project.
Analytics is using math and computers to mine data for insights and knowledge. Many tools are now available that make it possible to do analytics with little more than a basic knowledge of statistics, some data, a personal computer, and the right software. You don’t have to know how to calculate the standard of deviation or have an advanced degree in computer science to do your own analytics. It is not necessary to run surveys to gather data. This article discusses some basic concepts in statistics, where to find data, and which tools to use for manipulating that data. It also makes some recommendations for librarians and legal information professionals on how to get involved in data projects.
But first, what’s wrong with Microsoft Excel? Once you really get serious about analytics, you will encounter a variety of speed bumps that are handled better with other products. Excel has limits on the amount and kinds of data it can import and manipulate. Other products make dealing with large and complex data comparatively easy. Excel’s formulas and macro language are not as expressive or sophisticated as that found in R or Python, which both allow for more options. Similarly, OpenRefine, Power BI, and Tableau make it possible to automate a lot of the drudgery of data preparation and cleanup. Excel may be the de facto product people use to manage and share tabular data, but that does not mean it is the best tool for the job. ere are things that it is very good at, but there are many tasks that are better done with other tools. You could use a hammer to drive in a bolt, but a wrench will do the job better. Similarly, you can do analytics with Excel, but you will be more efficient using other programs.
This article was developed from a program at the 2021 American Association of Law Libraries Virtual Conference. The session had a companion workbook that is still available for download (visit bit.ly/ND21DIYworkbook). The workbook provides a walkthrough of different kinds of analytics, using a fictional data set.
Do not skip this section!
Good data analysis answers questions and uncovers new questions you did not realize you should be asking. But to be good at data analysis, one cannot simply build charts or run averages on data. It is critical to understand what the data is and is not saying—what it can and cannot tell you.
The goal of this section is to describe a few basic statistical methods in a way that is useful and not intimidating. We cover data distribution, the 3 Ms (mean, median, and mode), and scaling data (normalization and standardization). We do not expect you to be statisticians by the end, but we do believe that each of you will be able to leverage your existing information literacy skills to become critical and curious data consumers.
One of the most common and recognizable symbols in data analysis is the bell curve. When data is displayed and forms a bell curve, it is said to be normally distributed. When data is normally distributed, the mean, median, and mode are all the same value. We do not live in an ideal world, so data are rarely normally distributed. That is why we need to look at mean, median, and mode together to help understand the data we have.
Mean, also known as average, is one of the most widely used and well-known statistical concepts. For our purposes, the mean is the sum of all values divided by the number of values. It is often erroneously used to help predict “what is most likely.” However, mean is EXTREMELY susceptible to outliers—values that deviate far from others. For example, if I told you that the average starting salary at a job where you are applying was $75,000, you might be excited. That is, until you realize one person gets $300,000 while the other five get $30,000.
This is why you should always look at more than the mean. Median is that middle value in a range of data. When median and mean are not close, you KNOW there is an outlier. In the example above, the median value is $30,000, though the mean is $75,000.
Add in the mode, which is the most common value in a dataset, and you get a clearer picture. In our example, the median and mode are the same, further confirming that we are not likely to get $75,000. That said, not all data will have only one common value. For example, the salary data for new associates (bit.ly/ND21zip) has two modes—the data is bimodal. Since this would be hard to tell from the raw data alone, visualizing data is a useful step in analysis.
It is common for leaders to look to the data to see where their organization stands in relation to others. Standardization and normalization are key tools when attempting to compare across institutions. Standardization scales (transforms) all the data so that it ranges from 0 to 1 (rate) or 0 percent to 100 percent when converted to a percentage. Normalization scales transforms the data into a bell curve with the center of the curve on zero.
It is hard to determine the state of poverty in Wyoming with California by simply looking at the raw number of individuals living in poverty in each state, since they have such disparate total populations. However, if we divide the number of individuals living in poverty by the total population in each state, we get a rate of poverty (often displayed as a percentage of the total population living in the population) that we can use for some comparison.
However, standardized data is not necessarily normally distributed (bell curved). So, while you may be able to compare your library to another institution, you do not know when doing so if they or you are outliers among the full group of institutions. By normalizing data, we begin to see where we fall within the fuller picture. When data is normalized, sixty-five percent of all values are near the peak of the bell curve, or one standard deviation from the mean, whereas 90 percent are just a bit further out at two standard deviations.
The law school grading curve is normalized grade data. You are translating the number of points students get so most students get Bs, with fewer students getting As and Cs. e higher the number of standard deviations away from the mean a value is, the more of an outlier it is to the regular distribution of the data. This is most useful when trying to set realistic goals for your organization. You do not want to compare yourself with an outlier and therefore aim for a bar that is unrealistically high.
Sources of Data
At the most basic level, useful data can be found in two sources: internal and external repositories. Some of these sources are reasonably well known, others may involve a bit of effort to locate, and some are hidden and may require some novel thinking to discover them.
The best sources of data are those that already exist, as opposed to those in which data must first be created. Using existing information simplifies any project. Creating new data is invariably messy—think of the last time you generated a survey and of the 13 percent response rate you received.
When looking for internal data, one of the key considerations is whether or not the data is useful and usable. Your reference tracking system may have several well-known data points (the number of unique patrons, the number of requests, the reference staff, etc). An analytics project can start with the known utility of having to generate the reports no matter the method, and the type of data is fairly simple and not subject to much interpretation.
When thinking about external data the possibilities are truly endless. Consider sources such as the courts, federal and state government agencies, legislative bodies, trade organizations, academia, and legal publisher content. While many of these data sets have been thoroughly mined for information and analysis, it is still easy to find untapped useful information. Some of the more common data sources in this category include EDGAR, PACER, USPTO, and data.gov. The latter includes well over 300,000 datasets, and it points to more than 400 additional state and local data portals. The information hiding in data.gov and related government data portals can involve everything from incarceration statistics, taxes collected, potholes filled, and even the number of trees by species owned by a municipal government. The vast bulk of the data is provided in relatively clean formats that make it easy to use with analytics projects.
Ultimately, selecting a data source can be straightforward but requires a bit of careful thought regarding how useful it will be when processed—and whether or not it will help drive understanding and decision-making
In its 2021 survey of data scientists, Anaconda Inc. found that data professionals spent almost 40 percent of their time on data cleaning and data preparation. (View the report at bit.ly/ND21report.) Typos and formatting errors happen all the time, and data cleanup is a necessary first step in many projects. A good text editor such as Notepad++ or BBEdit is essential, but anything that can automate the process will make you more productive. In many cases, data cleanup can be done with the tool itself—Python is well suited to cleaning data files, and Tableau has a product called Tableau Prep. There is also OpenRefine, a desktop application originally created by Google. The entire purpose of OpenRefine is to clean and prepare data for import into other systems. The OpenRefine website (bit.ly/ND21OpenRefine) has a series of videos that document some of the data problems likely to be encountered and are highly recommended, even for someone not using OpenRefine.
R Studio and the R programming language are very popular among data scientists. Over the last year, AALL member Sarah Lin has made several presentations on R Studio, and recordings of them are still available for viewing. (Watch Sarah’s AALL Webinar at bit.ly/AALL92420.) R is designed for performing statistical analysis and making that analysis as quick and straightforward as possible. It has built-in support for complicated data structures and statistical operations. R has an active community of users with multiple forums on the internet (including a sub-Reddit at bit.ly/ND21reddit). There are guides on the internet and classes available on LinkedIn Learning, Coursera, and other educational sites.
Another popular programming language in analytics and data science is Python. Python was designed with readability in mind, because when programming you often spend as much time reading code as you do writing it. Python is a general-purpose programming language, but people have created add-ons for Python, called modules that are used for analyzing data. Python is an open-source language, which means you can download and install it at no cost, other than your time. There are products like Anaconda (bit.ly/ND21anaconda) and Homebrew (bit.ly/ND21brew) that make installation and configuration relatively painless. Python is, in one author’s opinion, easier to learn than R, but if you have found that programming is not your cup of tea, Python is unlikely to change that.
Jupyter Notebooks is an opensource project that lets you run programs and present results in a single document that displays in a web browser and is easily shared with other Jupyter Notebook users. (Learn more at bit.ly/ND21Jupyter.) Programs can be written in Python, R, and other languages, and annotated using HTML or Markdown. Installation can be difficult, although Anaconda and Homebrew make it much easier. There is also an iOS application called Juno available in the app store that is as easy to install as any other iOS app. It is not free, but on a modern iPad it is very easy to use and very fast. Finally, there are services that will host Jupyter Notebooks, with no local software installation required.
Microsoft’s analytics product is called Power BI. Microsoft has designed it for an entire analytics team working within an analytics ecosystem: developers to prepare data, database or SharePoint administrators to host it, analysts who create dashboards, and business owners who use the dashboards. The biggest advantage of Power BI over the other products discussed in this article is that if you work in a large organization that is serious about business analytics, there is a good chance the organization already has Power BI and all the team roles filled. All you need to do is take advantage of it. (Learn more at bit.ly/ND21Power.)
Finally, there is Tableau. The main Tableau desktop application is available for Windows, Macintosh, and as a web app. There is a server product, which makes it possible to share interactive dashboards. Unlike Power BI, Tableau recognizes that often the person gathering the data and doing the analyzing will be working alone and builds a lot of functionality right into the desktop application. You can easily connect to data, clean it, and output your workbooks to PDF, all from a single computer. Tableau is not free, in any sense of the word, but it is the easiest product to use of the ones discussed here. It also has an amazingly enthusiastic community—search YouTube for the annual “Iron Viz” competition for a sample. (Learn more at bit.ly/ND21Tableau.)
Roles for Legal Information Professionals
Years of experience in analytics projects suggest that there are three levels of involvement for librarians and legal information professionals. The first is “observation,” where the professional is introduced to the project at the very end of development. Often their reaction is understandably less than enthusiastic or, to put it bluntly, one of horror. There are probably few legal information professionals who haven’t witnessed the demonstration of a new, internally developed product and thought to themselves, “This is interesting, but why was it built?” or, even worse, “We already pay a vendor to do this, and they do it better.”
The middle level of involvement is “invitation.” This is where a data project is initiated elsewhere but it was determined, hopefully early on, that a legal information professional should be integrated into the project team. This is a recognition of the value gained by including someone familiar with taxonomies, databases, normalized content, user experience, existing commercial products, and more. Do not wait to be invited to participate! If rumors of a data project are circulating, seek out the source and do not be shy to invite yourself. The end product will be all the better.
Finally, the preferred level of data project involvement is “initiation.” Do not wait to attend a product launch or be invited to offer some suggestions: START THE PROJECT YOURSELF! There is no reason that legal information professionals should be avoiding launching data projects. We are already performing data projects—the next step is making them known to the institutions at large.