Collecting and Preserving User Generated Content
The Library of Congress (LC) is beginning to collect user generated content (UGC) and has begun including blogs in many of its digital collections. Blogs are easy to capture because they have a fairly uniform format and tend to exist for a long time.
Abigail Grotke and Martha Anderson described some of LC’s recent efforts. They have archived many of the sites from the last several US national elections. In the early days of archiving, images were not a priority for crawlers. In 2000, images, audio, and video were not archived. By 2002, images started to be captured, but the sites were simple. In 2004, the websites had become more sophisticated and included blogs, donation buttons, video ads, and interactive features. Improved crawlers allowed the capture of multimedia. In 2008, there were many connection options, lots of content in other 3rd party sites, and social media communities. (They did not ask permission from Facebook and other sites to archive content but assumed it was owned by the candidates.) By 2010, social media use had grown; there were more focused lists and heavier use of Facebook and Twitter. Foursquare and Friendster appeared, and use of widgets (which are hard to archive) had grown. Web 2.0 technologies are a continuing challenge, and Facebook has implemented new restrictions.
Many members of Congress now use social media to communicate with constituents. They link to 17 social media sites from their pages. Only 36 members are not using social media.
Martha Anderson said that we used to think of publishers producing material and archivists storing it. Now, all of these functions might be done by the same person. Web pages are artifacts of cultural and technological change, and UGC is one of those. A conference at LC last year on Preserving digital news: Today’s news, tomorrow’s history made the points that we need to:
- Think about citizen journalism,
- Develop shared understandings of long term value, and propose criteria for assessing it, and
- Suggest new organizational models and collaborate to best support this content’s stewardship.
The Twitter archiving project has generated much interest and many inquiries. It is an opportunity to learn. The project is very large–about 20 terabytes of material. Even Twitter cannot index the whole corpus (it is too many rows in a database). Here are the subjects of the inquiries that LC has received:
Twitter is now part of the historical record of communication, news reporting, and social trends. It is a direct record of important events and also serves as a news feed with minute-by-minute headlines. It is also a platform for citizen journalism. So the collection is of value culturally and worth collecting.
CIL 2013 Blog Coordinator