Data Resources

Data made available for this RCN are in the public domain and originate from open resources. However, in accordance with established academic norms, let us know that you are using the data, and please cite accordingly.

 

Overview

A tremendous expansion in digitized information about U.S. Government activities at the federal, state and local levels is a boon for research. For many scholars, however, this virtual firehose of data is seemingly inaccessible without computational support. For those who possess the skills required to work with large and complex datasets, the research benefits of working with government data are not always apparent.

Our Research Coordination Network (RCN) will demonstrate how researching government data is a benefit to both computational scientists and social scientists.

We will build a central collection in a standard file format. We will encourage researchers to use our collection as a starting point with explicit instructions on how it could be augmented. For instance, we will provide links to the primary sources of additional related data. Ideally, as our interdisciplinary network grows, so will the collection.

Data Formats

Government information is now available in a wide variety of digital formats. The struggle for many researchers is to convert raw data into useful forms for comparison and analysis. Many government (GPO, FEC) and non-government entities (NYT; Sunlight) already convert pockets of data. We intend to build a rational collection that includes semi-structured documents, structured text and multimedia formats. Semi-structured documents are field organized but not in tabular form. Examples of semi-structured documents include webpages in HTML, the Federal Registrar in XML or Congressional Floor events available through apis. Structured text are organized in tables or relational database format. Examples of structured text include financial information. Multimedia formats include audio and video and their associated standards. In addition, a significant portion of government information is available only as raw unstructured text or as PDF files.

We will encourage researchers to use our core structured data, and to convert semi-structured and raw data  to formats that enable linkages to existing core datasets. Given the rich availability of multiple types of data formats, the focus of this RCN is on the spectrum of federal lawmaking.

Congressional Bill Core Data

A relational MYSQL database focuses on lawmaking and has the congressional bill forms the core of the PoliInformatics project. Each of over 400,000 bills is related to extensive information about its progress through the legislative process, its topic, and its sponsor. The database will also include the full content of every version of every bill and resolution introduced from 1990 forward, parsed by section and (for enrolled bills) linked to the the U.S. Code and the part of the code it amends. Below is a description of the variables, type and description of the elements that are in an existing relational MYSQL database.

Extensions

Importantly,  we view these data as the starting point for what will be an unprecedented open source data collection project tailored to the needs of academic researchers. Someone studying congressional hearings may discover that it would be helpful to intergrate information about the member participating in the hearing. The core database provides an easy source of such information. After the researcher “cross-links” the two datasets to incorporate member information for her purposes,  future researchers will be then able to use the same record links to integrate hearings activity into their research. Similarly, a researcher studying congressional bill text may find that he needs develop a “stop word” list specifically for legislation. Future research projects will then benefit from the availability of this stop word list .

 

House Committees

Sample Data Sets

House Committee on Science, Space & Technology

Has the following subcommittees:

  • Energy
  • Space
  • Research
  • Technology
  • Oversight
  • Environment

Hearings Archived webcasts in WMX (Windows Media Audio/Video Playlist file) format available from January, 2009. The actual video is streamed using Microsoft’s MMS (Multimedia Streaming) protocol using Akamai servers. Video streams earlier than this (though available) were unplayable on Firefox 18.0.1 as well as IE9 on a Windows PC.

Markups Available in PDF from March 2009 to date. Generally, these consist of Opening Statements, amendment rosters and voting results.

Bills Bills are available as PDF from March 2009 till date. Bills text varied. Some are linked to thomas.loc.gov, others are linked from republican.science.house.gov and may be authenticated documents from the GPO while others are non-searchable, scanned PDF files.

 

House Committee on Education & the Workforce

Hearings

Uses FDSYS for printed hearings. Videos are available as “archived webcasts” through an external site “http://edworkforcehouse.granicus.com/“. Clicking to access this site gives a disclaimer “You are now leaving the Committee on Education and the Workforce website. Thank you for visiting. Neither this office nor the U.S. House of Representatives is responsible for the content of the non-House site you are about to access.”

To view the webcast, one needs the:

  • Silverlight plug-in; or
  • Flash plug-in on Droid enabled devices; or
  • HTML5 on iPads and iPhones

 

US Senate Committee on Energy and Natural Resources

Hearings and Business Meetings archived webcasts going back to April 2005 – non downloadable.

Documents available through FDsys: GPO’s Federal Digital System Printed Hearings

  • Congressional Hearings PDF, TEXT 1997 – 2012
  • Congressional Committee Prints PDF, TEXT 2001 – 2012
  • Legislative Publications PDF, TEXT 1997 – 2010

 

US Senate Committee on Finance

Hearings

  • Available as archived webcasts RealMedia 2006 – 2008
  • Available as archived webcasts Flash format 2009 – date
  • Available downloadable document in PDF 2001 – date

 

Core Data Description

Below is a description of the variables, type and description of the elements that are in an existing relational MYSQL database.

Variable Name Type Description
BillID Text In the form “Cong-BillType-BillNum”
BillNum Unsigned integer Bill number
BillType Text Bill type (“HR” or “S”).
ByReq Boolean Introduction by request of an agency?
Chamber Boolean 0 for House, 1 for Senate
Commem Boolean Commemorative bill?
Cong Unsigned integer Congress of introduction
Cosponsr Unsigned integer Number of cosponsors
IntrDate Date Date of introduction
Major Unsigned integer Major topic code
Minor Unsigned integer Minor topic code
Month Unsigned integer Month of introduction
Mult Boolean Multiple referrals?
MultNo Unsigned integer Number of referrals
PassH Boolean Passed House?
PassS Boolean Passed Senate?
PLaw Boolean Public law (passed both and signed)?
PLawDate Date Date bill became public law
PLawNum Text Public law number of bill
Private Boolean Private issue bill?
ReferArr Array of booleans Referred to numbered committee?
ReportH Boolean Reported by a House committee?
ReportS Boolean Reported by a Senate committee?
Title Text Full bill title
Veto Boolean Vetoed by the President?
Year Unsigned integer Year of introduction
Age Unsigned integer Age of member in year
Class Unsigned integer Class of introducing Senator
ComCArr Array of booleans Chair of numbered committee?
ComC Boolean Chair of any committee?
ComMArr Array of booleans Member of numbered committee?
ComRArr Array of booleans Ranking member of numbered committee?
ComR Boolean Ranking member of any committee?
CumHServ Floating point Cumulative years of House service at intro.
CumSServ Floating point Cumulative years of Senate service at intro.
Delegate Boolean Delegate (from Guam, D.C., etc.)?
District Unsigned integer District of House member
DW1 Floating point First dimension NOMINATE score
DW2 Floating point Second dimension NOMINATE score
FrstConH Unsigned integer First Congress of House service
FrstConS Unsigned integer First Congress of Senate service
Gender Boolean 0 for male, 1 for female
LeadCham Boolean Leader of a chamber?
LeadComm Boolean Leader of any committee?
LeadSubC Boolean Leader of any subcommittee?
Majority Boolean Member of majority party?
MemberID Text In the form “PooleID-Cong-Party”
MRef Boolean Member of bill’s referral committee?
NameFirst Text First name
NameFull Text Full name
NameLast Text Last name
Party Unsigned integer Party code
PooleID Unsigned integer Poole’s revised ICPSR number
Postal Text Postal code of home state (e.g. WA)
State Unsigned integer Numeric code of home state
SubCArr Array of booleans Chair of a subcommittee under numbered committee?
SubC Boolean Chair of any subcommittee?
SubRArr Array of booleans Ranking member of a subcommittee under numbered committee?
SubR Boolean Ranking member of any subcommittee?

wayback/archiveit/endofterm

fraser (fed reserve)

science.gov/metalab/fedstats/data.gov

 

 

 

 

Acknowledgements

  • E. Scott Adler and John Wilkerson, The Congressional Bills Project
  • Josh Tauberer, GovTrack
  • Keith Poole and Howard Rosenthal, NOMINATE
  • Charles Stewart III and Jonathon Woon, Congressional Committee Data

Comments are closed.