BETA
This is a BETA experience. You may opt-out by clicking here

More From Forbes

Edit Story

History As Big Data: 500 Years Of Book Images And Mapping Millions Of Books

Following
This article is more than 8 years old.

The terms “big data” and “massive data analytics” likely conjure thoughts of the modern world, of hundreds of millions of tweets or billions of Facebook posts streaming in real time into gleaming data centers filled with blinking lights. Libraries, on the other hand, filled with endless rows of dusty books, are likely not the first thing that comes to mind. Yet, what if we could use libraries to reimagine our past, creating a gallery of all the images from half a millennium of books or creating a 215-year animated map of human history as seen through millions of books?

Libraries have reinvented themselves in the digital era and one library in particular, the Internet Archive, stands among the forefront of the big data era. The Archive, most famous for its historical archive of the Internet, today holds more than 23 petabytes of historical data that is growing at a rate of 50-60 terabytes per week. On its servers reside more than 436 billion web pages back to 1996, 750,000 television shows back to 2009, over 100,000 pieces of software dating back 30 years, and over half a billion pages of books dating back 500 years from over 1,000 libraries around the world. It is that last collection, of millions of books dating to the year 1500, that we will explore further here.

What would it look like to reimagine the book not as pages of text, but as a global distributed gallery of illustrations, drawings, charts, maps, and photographs that together comprise one of the world’s greatest art collections? In Fall 2013 I approached the Internet Archive with the idea of using computer algorithms to extract every image found on all 600 million pages of their digitized book collection, along with the text surrounding each image and the basic metadata about the book. In just over a month I did precisely that, creating a massive gallery that is slowly being uploaded to Flickr.

Browse the archive for yourself on Flickr, where more than 2.5 million of the images are already available, with more added every few weeks. If you’re interested in nature, try searching for “bird” or “butterfly,” or for the more historically-minded, try “railroad” or the “telephone.”  Emblem books can be particularly beautiful, with their exquisitely detailed renderings of moral stories and daily life.

Suddenly we can look across the centuries of images and create an interactive zoomable montage of 500 years of books, as seen in the image above. Beginning in the upper left with the year 1500 and proceeding by row from left to right and top to bottom, each image represents an illustration from a book published that year. Notice how styles and themes change over time, and the rise of colored prints in the nineteenth century. While books might not seem on the surface like “big data,” the ability to reach across over 1,000 libraries from around the world and make a searchable archive and zoomable collage of centuries of illustrations represents a new way of “seeing” our past.

Yet, images show us only one dimension of what all of these books tell us about the world. What if we could turn to massive data mining algorithms once again, but this time have them “read” all of these books and create maps of all the locations mentioned within?

Over the last several weeks I have been applying powerful algorithms to process over 3.5 million books English-language books dating back to 1800 from the Internet Archive and HathiTrust (which holds a mirror of portions of Google Books). (I only looked at books back to 1800 since books published prior to that tend to use older spellings and grammatical rules that are too difficult for modern data mining algorithms to understand).

My GDELT Project, which is a non-profit initiative supported by Google Ideas to try to catalog and understand the global world using open data, used 160 processors and a terabyte of RAM from Google Cloud over a period of just two weeks to process all 3.5 million English-language books published since 1800 and compute an array of information about each book, from the list of people and organizations mentioned, to millions of themes and thousands of emotions, to a list of locations mentioned. All of this is freely available in Google BigQuery, where you can search 215 years of books in seconds using simple SQL queries.

Using online mapping platform CartoDB to visualize the results from BigQuery, I then created the animated map above, showing all locations worldwide mentioned at least 30 times overall in books published each year, from 1800 to present, using the Internet Archive’s American book collection. Click on the map to view the interactive animated version.

Watch the Westward expansion of the United States through the nineteenth century or the spread beyond Europe in the twentieth. Keep a close eye on the map as it changes from 1922 to 1923. Notice how the majority of the map suddenly disappears - that’s the copyright era kicking in. Libraries have focused the majority of their digitization efforts on the “public domain” era that runs up to 1922, where the majority of books have expired from copyright. Just a quick glance at this map confirms just how much we are missing about the world from all those books published after 1922, but before the digital era, that have gone out of print, but can’t be digitized because they are still in copyright.

You can also compare with the same map compiled from the HathiTrust collection, which is largely a mirror of Google Books.  Note that the two are extremely similar, suggesting that our understanding of the past, at least as seen through books, is not heavily influenced by the collection we use in the pre-copyright era.

What if instead of trying to map every location from every book, we looked just at the locations mentioned in books about a particular subject? Click on the map above to access an interactive zoomable map of all locations mentioned in the 7,715 Internet Archive books published 1855 to 1875 about the American Civil War, Abraham Lincoln, Slavery, or Reconstruction. This map features every location mentioned in those books, so locations elsewhere in the world, especially Europe, are seen on the map, reflecting their contextualization in the discussion of these topics. Yet, as you zoom in, you will notice that the overall contours of the map align closely with the contours of the Civil War and Lincoln’s life.

Turning to the 13,684 HathiTrust books published from 1900 to 1920 about World War I (capturing the environment leading to war) using OCLC subject tags, the map below repeats this process, capturing the global extent of the “Great War.”

Being able to visualize what was written in the past, the firsthand experiences of historical events as they actually happened, offers us an incredible lens to reexamine our understanding of our history. In particular, it allows us for the first time to compare what was written at the time period with what is written today.

In 2012 I collaborated with Silicon Graphics to map world history as seen through the eyes of Wikipedia, using the same approach to identify textual mentions of location in Wikipedia articles and associate them with the nearest date reference. Locations mentioned with respect to a given year are showcased together on the map, with all locations appearing together in an article being linked and the color of each point/line being the average emotion of all mentions of that location or pair of locations within that year, from bright green (highly positive) to bright red (highly negative). Since Wikipedia is a modern encyclopedia of the world, it offers us the incredible ability to compare for the first time our present understanding of the past with what was really said at the time.

As more and more of our history is digitized and preserved, the past will compete with the present as a source of “big data” and we will be able to apply the incredible tools of the modern era to reimagine how we understand the world around us and how we got where we are today.