Big Data Government-The release of the Mueller report this past week has brought with it renewed interest in the practice of “redaction” in which the government blacks out portions of officially released documents to preserve confidential information. The topic has received more media attention on television in the past week than it has in the past decade, while globally as much as 2% of worldwide online news coverage mentioned the term at its peak on Thursday. Yet, the rise of massive centralized FOIA archives, digitized news archives and a bit of statistical analysis can help scholars readily peer through those dark markings and fill in the redacted blanks.
One of the great weaknesses of the governmental redaction process is the lack of centralized government-wide coordination in determining just what is sensitive enough to warrant obscuring from public view. One government agency’s most sensitive secret is another agency’s view of public information.
This leads to a situation in which multiple government agencies may release the same declassified document with different redactions. One agency might redact the entire first paragraph, while leaving the entirety of the remaining text untouched, while another agency might lead the first page untouched, while heavily redacting the rest of the text.
Historically such discrepancies were difficult for historians and the public to exploit because of the lack of open centralized databases of declassified document archives and FOIA collections.
As non-profits, private companies and academic institutions have focused on assembling vast archives of government documents in recent decades, it has become steadily easier to look across the totality of a government’s publicly released output for patterns.
Simple document similarity clustering can instantly group together all of the versions of a given document that have been released over the years by different agencies. A rudimentary “diff” over each group of documents can help fill in redacted passages, in rare cases even restoring the entire document, exploiting the uncoordinated declassification process.