Tracing Policy Ideas From Lobbyists Through State Legislatures

Fellows: Matthew Burgess, Eugenia Giraudy, Julian Katz-Samuels
Data Science Mentor(s): Joe Walsh
Project Manager: Lauren Haynes
Project Partner: Sunlight Foundation
[Github Repository]

Legislators often lack the time to write bills, so they frequently rely on outside groups to help. Researchers and concerned citizens would like to know who’s writing legislative bills, but trying to read those bills, let alone trace their source, is tedious and time consuming. This is especially true at the state and local levels, where arguably more important policy decisions are made every day.

To increase state-level transparency, DSSG and DSaPP team members have built the Legislative Influence Detector (LID). LID quickly and accurately locates and traces the proliferation of similar language across large datasets. LID is essentially a plagiarism detector designed for state legislation. State legislators introduce over 45,000 bills a year, but because they lack the time, staff, and expertise required to write all those bills from scratch, they often borrow language from other state bills and even from model legislation written by lobbyists. LID helps uncover these relationships by finding matches between state bills and model legislation, thereby making information easier to find and harder to hide (a lobbyists would have to rewrite a bill 50 times to get it passed in all 50 states without detection). In doing so, LID sheds light on the state legislative process; helps citizens learn about the bills under consideration by tracking similar passages across these documents, and increases democratic accountability.

In its current state, LID allows the user to enter the text of a bill and returns documents that potentially match and a score indicating the strength of the match. It comes with a frontend that highlights similar sections in those documents, allowing the user to quickly evaluate the similarities. We ran  500,000 state bills introduced between 2010 and 2015 as well as 2,400 pieces of model legislation and conservatively found 45,000 matches between bills and 14,000 matches between model legislation and bills. We posted all the data to our website ( so others can explore and analyze. These are the most comprehensive data on legislative text re-use available to the public.

More information about how LID works and our next steps can be found on our blog.
You can find the code here.
You can follow updates on this project on Twitter @InfluenceDetect.