Landscape organizes everything within sight.

Monday, March 19, 2012

Understanding paper machines

...macroscopes for parsing power and resistance in the utopian twentieth century, an opportunity to hack...

A team of techy humanists are looking for a creative computer scientist to co-create with us a digital tool that brings large corpus of historical paper to life. Our mission is to parse through large number of traditional and non-traditional documents and extract the story of the utopian twentieth century. The project would be ideal for someone interested in exploring ground for their own visualization or information masters or phd thesis project.

We're hoping to engage a creative, visual mind to capitalize upon the array of pre-existent, open-source tools for parsing large stores of textual data.  These tools will be applied to a hand-curated, 6000-pdf library of texts about the rise of utopian movements in the twentieth century, where we hope to stress-test the tools against questions about what's important in the large paper archive; to play, experiment, and fail; and to collaboratively come to new questions about what's important in the data and what a savvy computationalist can extract therefrom.   These algorithms set us up to work with large bodies of texts, immediately finding all the diagrams, suggesting connections among different texts, showing connections between different authors, cataloguing citations, creating databases out of extracted information, and analyzing word choice.

We want our computer expert to be able to dig for data, play and experiment and break these tools, to experiment with different ways of analyzing and portraying results to maximize the intuitive capabilities of the human user against the large-scale analysis of the computer, to explore how creative visualization can enhance textual analysis, and together draw from these experiments creative solutions to historical problems that are beyond the reach of human-driven research power  in the world of history.

On a historical level, our task is to address questions about utopian projects in the twentieth century through the major artifact they left behind: paper.  I'm interested in the broadcast but lesser-known radical movements that permeated government at the United Nations, USAID and the World Bank, that is, the movements identified as land reform, agrarian reform, and appropriate technology, which together aimed to open up the question of holdings in land, the balance of economic power between rich and poor, and the role of engineers in redistributing wealth.

Some of the questions about the life of these utopian projects can be answered analytically and quantitatively by finding new ways of measuring the amount of paper these movements produced (data held in card catalogs and extracted by the opensource application Stackview). Paper has been both a tool to broaden conversation, and also an instrument to overwhelm. Initially, paper facilitated the inclusion of multiple voices and created unlimited flow of information. It was a tool of record keeping, preserving information, and disseminating ideas to far away places. But, in modern history revolutions in
bureaucracy and the limitation of political participation has frequently been a reflection of the number of pages of paper which
experts produce, a social and political clout, with which to disarm and outrank their political opponents.  In debates where peasants with oral traditions are faced down by civil engineers with reams of paper, the civil engineers always win.  In the 1950s and 1960s, as the rise of NGOs came to ring conversations about urban planning and international development with coordinated institutions—all of them creating more paper—even the process of finding one’s way to the beginning of an argument became a labor reserved for the few and privileged.  Introductory texts to these problems increasingly included an organizational flow-chart, a diagram borrowed from business texts, which served as a map to tell would-be activists
where, in the vast continent of NGOs, banks, and government organizations, they entered the conversation.  In short, the power
struggles of the twentieth century have produced a mass of paper, too much to be read by a single scholar or even group of scholars.  The secrets of the paper archive are the record of the power-struggles that determined the rise and fall of utopian movements in the twentieth century.  The larger questions I'm interested in are thus ones about power and struggle.  Was so much paper produced as to make participation by non-elites in the developing world and the wrong side of town nearly impossible, thanks to the disproportionate time of education and reading required to sort through so much paper?  That is
a historical question, and it can be answered by solving a quantitative question in the stacks: How much paper was produced and by whom?

The experimental application Stackview allows a user to visualize the production of paper in a certain subject area as the width of volumes in a stack.  It does not yet allow the comparison of different subject areas by time. There is no method to compare the amount of pages produced in town planning between 1900 and 1950, or those produced in economics textbooks printed in India and South America against those printed in the West.   I want know what can be gained by visualizing and experimenting with the abundant data hidden in the world of paper.

Further steps beyond Stackview include other forms of data extraction and analysis that lean more heavily upon the ability to extract data from a large pool of pdf's -- for instance the question of which utopians influenced each other and which ideas they shared, a problem that can be solved by tracing connections in the text of the digital versions (a social networking question that requires digging data from a pile of pdfs. The tools we imagine starting with are open-source and already tailored to work with text and library databases -- among them the paper-width-measurer Stackview, the image-extractor Filejuicer, the
named-entity extractor Open Calais, the geoparser Textgrounder, and the terminology calculator Bookworm.  Some combination of expertise in Python, Java, and C++ is required. The ideal collaborator for this project can dig data, do batch processing, work with APIs competently, experiment with a variety of out-of-the-box, open-source visualization scripts that deal with texts/images (including geoparsers, named entity recognizers, and NLP analysis).

At a conceptual level, we hope that this tool will be a powerful resource in examining large quantities of information and allow
knowledge seekers to consider a broader, richer, often ignored corpus of text. In doing so, we hope to enlist the power of digital humanities to tame the pile of paper, and redistribute the power that “official” paper took away.

