Landscape organizes everything within sight.

Tuesday, March 27, 2007

New Tools in Old Disciplines: Working Magic with Google Books, cont'd


The earlier post about Google Books on this site is creating quite a buzz among librarian and historian communities online -- partly because famed tech blogger Tim O'Reilly reported my having "dissed" the experience of libraries for virtual research, partly because Google Books is so hot, and partly because the image of libraries disappearing for computers raises hackles among academics everywhere.

The fears start flying. Will historians neglect the skills of traditional research because they've discovered the internet? Probably not. We spend years training in arcane research methods, and we make our names by doing something new, which even today generally means finding some measure of unknown documents in the archive. Will the material archives disappear? That's a fear, because any time a university can cut funds, it will, and then, as Rick Prelinger can testify, entire corpuses of periodicals and log books from the eighteenth century are jetissoned in the dumpster. Are internet archives going to be exhaustive? Definitely not, and in no case is every last spare bit of paper -- the forms, the doodles, the enormous maps -- getting scanned. Some of the fears are legitimate, and some of the fears are false. All give evidence of a rapidly changing world.

The real excitement around tools like Google Books is the possibility of applying new tools that are now simply not available with the other kind of text. The word-count and documentation databases I mentioned are now only a dream -- Google's caution with copyright laws puts them out of the realm of possibility for the moment. But should those become possible, they will open up a realm of research possibilities that are now only experimental in the humanities.

To give but one example, it is now possible in the text-searchable, online Oxford English Dictionary to find all words with "road" or "walking" in the definition that had their origin between 1810 and 1840. I discovered a variety of pieces of slang pertaining specifically to the way people walk down the new streets -- suggesting that they were parading, performing, acting in some way so new to the culture that an entire vocabulary had to be invented to explain what they were doing. By traditional methods, most of these would never have turned up; they're too far apart in occurance, we tend to focus on polemic rather than slang texts, and the shift would have escaped me. This data from the OED is now a major piece of evidence in one of my chapters, allowing me to advance conclusions I would not have been able to make before.

Similar searches on the Dictionary of National Biography have allowed me to perform acrobatics with the networks of different professionals in the 1780s, people like artisans and innkeepers who rarely turn up in traditional historiography, about whom the data is scarce. These professions make brief appearances in the DNB, and by tracing the lives of a hundred innkeepers in the 1780s, patterns of politics, religious belief, and marriage emerge that suggest that innkeepers, with their access to horses and carriages and strangers, were among the best-connected and most political people in the nation. We are only beginning to see what this kind of research can do.

Doing this sort of number crunching on texts yields amazing results. In the future, historians will demand access to the full text of Google Books for exactly this reason. If Google doesn't provide it, many of its competitors -- including the Internet Archive -- may. So a fertile world of sorting searches is ahead of us.

The rosiest scenario includes tech geeks and academic researchers teeming up to talk about framing the search queries. The raw text in the Dictionary of National Biography, for example, has no fields except the entry for "name" and "years." I have to sort through myself to find the number of children, the profession, the religion, the political beliefs, and the books he wrote. But sorting this kind of material against each other in searches is immensely powerful. Did Quakers have more children than Catholics? I don't know, but the archive does. And if the DNB has too few variables to be the right resource for this sort of search, a variety of local archives and court records around Britain are now going online, with exactly that potential. These include the entire proceedings of the Old Bailey court in London, 1674-1834; the census, 1801-1964; the British Parliamentary Papers, 1688-1905, and the LSE Booth Archive (maps of poverty in the 1860s). More often than not, historians like myself with no technical background are in charge of creating the data fields and search algorithms. We rarely find what we want, because we don't know how to use the technology to get what we want. The marriage of technocrats and historians could be a happy one.

Right now, none of these archives talk to each other, none allow tagging or comments from researchers, and those that have tried to provide fields or tags have done so by hand, over years, at immense expense with little to show. To this sort of labor, some open databases provide a vista of solutions. GoogleBase and Freebase are the two important ones now. In open databases, even small archives can contribute the raw data from their holdings, and anyone -- from the genealogist to the professional historian to the computer scientist looking for a Masters Thesis project -- can start putting together the evidence into interesting patterns, and then sharing those tools with others. Analytic tools like Swivel, Pipes, and DabbleDB can start finding patterns immediately. The miracles to come will happen when data starts talking to data, bubbling into new patterns yet undiscovered -- when we start getting entire life histories of shoemakers and Quaker populations out of the traces they left across a dozen government and local databases, when we start discovering shoemakers across vast swathes of England who knew each other and were talking, and when we start following the spread of religious or political ideas across those networks. We must believe that there are patterns locked in the data that are burning to get out, and we must apply all the tools we have to release them.

Google Books blew my socks off because it was able to contribute something new to my research after I had already circled the world for this information, pillaging a variety of specialty libraries, among them, Harvard's Dumbarton Oaks landscape collection, the Maps Collection and Center for British Art at Yale, the Royal Institute of British Architects Collection, the Victoria and Albert, the British Museum, the Cambridge libraries, and the Public Records Office. I was also going through whatever ILL could bring me through the well-organized mechanisms of the University of California. I've seen ephemera and political documents pertaining to the road that were never looked at by any of the thirty major historians who wrote about the road in the course of the twentieth century. It is utterly a delight, then, to encounter other books that did not turn up in my exhaustive ramble through the traditional methods. New tools in old disciplines can do us a world of magic.

Labels: , , , , , ,