Data Resources
Data made available for this RCN are in the public domain and originate from open resources. However, in accordance with established academic norms, let us know that you are using the data, and please cite accordingly.
Overview
A tremendous expansion in digitized information about U.S. Government activities at the federal, state and local levels is a boon for research. For many scholars, however, this virtual firehose of data is seemingly inaccessible without computational support. For those who possess the skills required to work with large and complex datasets, the research benefits of working with government data are not always apparent.
Our Research Coordination Network (RCN) will demonstrate how researching government data is a benefit to both computational scientists and social scientists.
We will build a central collection in a standard file format. We will encourage researchers to use our collection as a starting point with explicit instructions on how it could be augmented. For instance, we will provide links to the primary sources of additional related data. Ideally, as our interdisciplinary network grows, so will the collection.
Data Formats
Government information is now available in a wide variety of digital formats. The struggle for many researchers is to convert raw data into useful forms for comparison and analysis. Many government (GPO, FEC) and non-government entities (NYT; Sunlight) already convert pockets of data. We intend to build a rational collection that includes semi-structured documents, structured text and multimedia formats. Semi-structured documents are field organized but not in tabular form. Examples of semi-structured documents include webpages in HTML, the Federal Registrar in XML or Congressional Floor events available through apis. Structured text are organized in tables or relational database format. Examples of structured text include financial information. Multimedia formats include audio and video and their associated standards. In addition, a significant portion of government information is available only as raw unstructured text or as PDF files.
We will encourage researchers to use our core structured data, and to convert semi-structured and raw data to formats that enable linkages to existing core datasets. Given the rich availability of multiple types of data formats, the focus of this RCN is on the spectrum of federal lawmaking.
Congressional Bill Core Data
A relational MYSQL database focuses on lawmaking and has the congressional bill forms the core of the PoliInformatics project. Each of over 400,000 bills is related to extensive information about its progress through the legislative process, its topic, and its sponsor. The database will also include the full content of every version of every bill and resolution introduced from 1990 forward, parsed by section and (for enrolled bills) linked to the the U.S. Code and the part of the code it amends. Below is a description of the variables, type and description of the elements that are in an existing relational MYSQL database.
Extensions
Importantly, we view these data as the starting point for what will be an unprecedented open source data collection project tailored to the needs of academic researchers. Someone studying congressional hearings may discover that it would be helpful to intergrate information about the member participating in the hearing. The core database provides an easy source of such information. After the researcher “cross-links” the two datasets to incorporate member information for her purposes, future researchers will be then able to use the same record links to integrate hearings activity into their research. Similarly, a researcher studying congressional bill text may find that he needs develop a “stop word” list specifically for legislation. Future research projects will then benefit from the availability of this stop word list .
House Committees
- http://agriculture.house.gov Agriculture
- http://appropriations.house.gov Appropriations
- http://armedservices.house.gov Armed Services
- http://budget.house.gov Budget
- http://edworkforce.house.gov Education and the Workforce
- http://energycommerce.house.gov Energy and Commerce
- http://ethics.house.gov Ethics
- http://financialservices.house.gov Financial Services
- http://foreignaffairs.house.gov Foreign Affairs
- http://homeland.house.gov Homeland Security
- http://cha.house.gov House Administration
- http://judiciary.house.gov Judiciary
- http://naturalresources.house.gov Natural Resources
- http://oversight.house.gov/
- http://rules.house.gov
- http://science.house.gov Science, Space, and Technology
- http://smallbusiness.house.gov Small Business
- http://transportation.house.gov Transportation and Infrastructure
- http://veterans.house.gov Veterans’ Affairs
- http://waysandmeans.house.gov Ways and Means
- http://intelligence.house.gov Intelligence
- http://www.jec.senate.gov Joint Economic Committee
- http://jcl.house.gov Joint Committee on the Library
- http://www.house.gov/jcp Joint Committee on Printing
- http://www.jct.gov Joint Committee on Taxation
Sample Data Sets
House Committee on Science, Space & Technology
Has the following subcommittees:
- Energy
- Space
- Research
- Technology
- Oversight
- Environment
Hearings Archived webcasts in WMX (Windows Media Audio/Video Playlist file) format available from January, 2009. The actual video is streamed using Microsoft’s MMS (Multimedia Streaming) protocol using Akamai servers. Video streams earlier than this (though available) were unplayable on Firefox 18.0.1 as well as IE9 on a Windows PC.
Markups Available in PDF from March 2009 to date. Generally, these consist of Opening Statements, amendment rosters and voting results.
Bills Bills are available as PDF from March 2009 till date. Bills text varied. Some are linked to thomas.loc.gov, others are linked from republican.science.house.gov and may be authenticated documents from the GPO while others are non-searchable, scanned PDF files.
House Committee on Education & the Workforce
Uses FDSYS for printed hearings. Videos are available as “archived webcasts” through an external site “http://edworkforcehouse.granicus.com/“. Clicking to access this site gives a disclaimer “You are now leaving the Committee on Education and the Workforce website. Thank you for visiting. Neither this office nor the U.S. House of Representatives is responsible for the content of the non-House site you are about to access.”
To view the webcast, one needs the:
- Silverlight plug-in; or
- Flash plug-in on Droid enabled devices; or
- HTML5 on iPads and iPhones
US Senate Committee on Energy and Natural Resources
Hearings and Business Meetings archived webcasts going back to April 2005 – non downloadable.
Documents available through FDsys: GPO’s Federal Digital System Printed Hearings
- Congressional Hearings PDF, TEXT 1997 – 2012
- Congressional Committee Prints PDF, TEXT 2001 – 2012
- Legislative Publications PDF, TEXT 1997 – 2010
US Senate Committee on Finance
- Available as archived webcasts RealMedia 2006 – 2008
- Available as archived webcasts Flash format 2009 – date
- Available downloadable document in PDF 2001 – date
Core Data Description
Below is a description of the variables, type and description of the elements that are in an existing relational MYSQL database.
Variable Name | Type | Description |
BillID | Text | In the form “Cong-BillType-BillNum” |
BillNum | Unsigned integer | Bill number |
BillType | Text | Bill type (“HR” or “S”). |
ByReq | Boolean | Introduction by request of an agency? |
Chamber | Boolean | 0 for House, 1 for Senate |
Commem | Boolean | Commemorative bill? |
Cong | Unsigned integer | Congress of introduction |
Cosponsr | Unsigned integer | Number of cosponsors |
IntrDate | Date | Date of introduction |
Major | Unsigned integer | Major topic code |
Minor | Unsigned integer | Minor topic code |
Month | Unsigned integer | Month of introduction |
Mult | Boolean | Multiple referrals? |
MultNo | Unsigned integer | Number of referrals |
PassH | Boolean | Passed House? |
PassS | Boolean | Passed Senate? |
PLaw | Boolean | Public law (passed both and signed)? |
PLawDate | Date | Date bill became public law |
PLawNum | Text | Public law number of bill |
Private | Boolean | Private issue bill? |
ReferArr | Array of booleans | Referred to numbered committee? |
ReportH | Boolean | Reported by a House committee? |
ReportS | Boolean | Reported by a Senate committee? |
Title | Text | Full bill title |
Veto | Boolean | Vetoed by the President? |
Year | Unsigned integer | Year of introduction |
Age | Unsigned integer | Age of member in year |
Class | Unsigned integer | Class of introducing Senator |
ComCArr | Array of booleans | Chair of numbered committee? |
ComC | Boolean | Chair of any committee? |
ComMArr | Array of booleans | Member of numbered committee? |
ComRArr | Array of booleans | Ranking member of numbered committee? |
ComR | Boolean | Ranking member of any committee? |
CumHServ | Floating point | Cumulative years of House service at intro. |
CumSServ | Floating point | Cumulative years of Senate service at intro. |
Delegate | Boolean | Delegate (from Guam, D.C., etc.)? |
District | Unsigned integer | District of House member |
DW1 | Floating point | First dimension NOMINATE score |
DW2 | Floating point | Second dimension NOMINATE score |
FrstConH | Unsigned integer | First Congress of House service |
FrstConS | Unsigned integer | First Congress of Senate service |
Gender | Boolean | 0 for male, 1 for female |
LeadCham | Boolean | Leader of a chamber? |
LeadComm | Boolean | Leader of any committee? |
LeadSubC | Boolean | Leader of any subcommittee? |
Majority | Boolean | Member of majority party? |
MemberID | Text | In the form “PooleID-Cong-Party” |
MRef | Boolean | Member of bill’s referral committee? |
NameFirst | Text | First name |
NameFull | Text | Full name |
NameLast | Text | Last name |
Party | Unsigned integer | Party code |
PooleID | Unsigned integer | Poole’s revised ICPSR number |
Postal | Text | Postal code of home state (e.g. WA) |
State | Unsigned integer | Numeric code of home state |
SubCArr | Array of booleans | Chair of a subcommittee under numbered committee? |
SubC | Boolean | Chair of any subcommittee? |
SubRArr | Array of booleans | Ranking member of a subcommittee under numbered committee? |
SubR | Boolean | Ranking member of any subcommittee? |
wayback/archiveit/endofterm
fraser (fed reserve)
science.gov/metalab/fedstats/data.gov
Acknowledgements
- E. Scott Adler and John Wilkerson, The Congressional Bills Project
- Josh Tauberer, GovTrack
- Keith Poole and Howard Rosenthal, NOMINATE
- Charles Stewart III and Jonathon Woon, Congressional Committee Data