The DSSG Earmarks Dataset

What Is an Earmark?

“Among practitioners and scholars, no single definition of the term earmark is universally accepted.” For this project, we treat congressionally directed budget items allocated to specific people, places, or projects as earmarks.

Note that our operating definition does not depend on a value judgment (good or bad) about earmarks. Earmarks may be good, conditionally good, or bad; although we hope that this dataset can shed light on which, we leave it in the hands of others to determine.

Why Did You Create Another Earmarks Dataset?

Our goal was to build a tool that finds earmarks in congressional texts, not to build a dataset of earmarks. A dataset would offer a snapshot of the past, but we wanted to build something that will prove useful in the future. Our tool can create historical datasets and update those datasets with minimal time and effort. As the saying goes, give a man a fish and he eats for a day; teach him to fish and he eats for a lifetime.

Sharing our algorithm also brings transparency to the process. Porter and Walsh have expressed concerns about opaque coding decisions; an algorithm helps solve that problem by providing replicable results and a way to trace its decision-making process.

That being said, we can save you the trouble of creating an earmarks database from our scripts by providing a copy of our data in CSV format here.

Our approach comes with costs. To simplify the problem, we decided to focus on tables, where 85% of observable earmarks reside. We ignored free text in the bills and reports as well as phone calls and letters. We cannot estimate how many earmarks we’re missing because the call logs and letters are not public, but Senator Franken’s office says letter writing is lettermarking. (No one knows exactly how often because the process is not public, but Al Franken’s office says it’s “fairly routine” and it’s likely becoming more common given Congress’s stated ban on earmarks since 2010.)

Another challenge we faced was the variety of table formats used. The Government Printing Office provides congressional bills and reports as plain text files where tables appear as blocks of formatted text. Indentation, whitespace, and occasionally dots and dashes are used to format tables. Our program finds the tables by searching for these text patterns, but Congress could easily hide a table from our script by, say, using semi-colons instead. We do not know how often this happens.

In short, we likely missed more than 15% of earmarks.

Additional effort can reduce our error rate. We welcome outside help and invite improvements to our algorithm. You can fork our code from our GitHub repository.

Where Do I Download the Dataset?

You can find the datasets here:

How Do I Use the Dataset?

Using our dataset demands more of the user than other earmark datasets. We provide a list of earmark candidates rather than a list of earmarks. Each candidate comes with a score generated by a Support vector machine. The user needs to choose a cutoff where scores above indicate an earmark and scores below do not.

This choice has consequences: the higher the cutoff, the less likely we are to find both true and false earmarks. Thinking about the extremes helps clarify the concept:

  • Few scores will exceed a very high cutoff. The good is that you will not find many false earmarks; the bad is that you will not find many of the earmarks.
  • Most scores will exceed a very low cutoff. The good is that you will find almost all the earmarks; the bad is that you will also find lots of false earmarks.

While it would be convenient for us to release a dataset based on a cutoff of our choice, giving the user flexibility has advantages. First, it brings transparency to the process. Unlike the human-coded datasets, the cutoff provides a clear standard for categorization and enables discussion of model sensitivity (for example, what happens to the model when you choose a different cutoff?) Second, it encourages the development of reasoned cutoff choices. See this paper for further discussion.

What is DSSG?

Data Science for Social Good is a fellowship program at the University of Chicago connecting aspiring data scientists and non-profits. Our purpose is social good. We create open-source tools that help communities, organizations, and individuals use data intelligently for their benefit and for the benefit of others. Click here to learn more about our fellowship program.

Who are we?