Cody Hennessy, E-Learning and Information Studies Librarian at the University of California-Berkeley (UCB) noted that there has been a large growth in the amount of digitized text available. There are now over 15,000 books in Google Books, which allows one to access a lot of digitized text at once. The HathiTrust Digital Library has started the HathiTrust Research Center to give access to text and allow researchers to ask questions about it. Scanned journals and newspapers are now in licensed databases, so you can trace how a word has been used over time, for example.
Google Books cannot be read by a human because it has over 15 million books (12% of all published books, with 500 billion words–361 billion in English). Many analysis questions are being asked across disciplines.
Professor Marti Hearst at UCB wrote an interesting article in 2003 with a definition of text mining. Here is her definition.
The goal of mining is to discover unknown information, not get what you know is there. Franco Moretti proposed the term “distant reading”, which allows us to focus on units much smaller or much larger than the text by writing Python or R code to run statistical analyses on it.
There are thousands of books that nobody has ever read or are not being read any more. Distant reading allows us to look at those texts as primary sources. People are now pairing text mining and distant reading. So for example we can count the numbers of novels written in the first person or third person over time. More frequently, people are now looking at syntax, which is done using code written in languages such as Python or R to run statistical analyses of word frequencies or neural networks.
For example, here is a word cloud of the topic labeled “female fashion”.
The conclusion from this research is that female authors are more than twice as likely to write about women’s clothing or “fashion” than men.
Using sentiment analysis, a study found that Trump’s Android account uses 40-80% more words related to disgust, sadness, fear, anger and other negative sentiments than his iPhone account does, which led to a speculation that the iPhone account was being used by a staff member.
What does this have to do with libraries?
The DLab of the Computational Text Analysis Working Group (CTAWG) is dedicated to curating and developing innovative methods of computational text analysis for the research community at UCB. Roles of people in the group are informal, with no expertise required. “It’s OK Not To Know” (IOKN2K) is a guiding principle. There is no curriculum; if you keep showing up, you are embedded into the group. Here are some opportunities:
- Connecting researchers with sources such as the Congressional Record,
- A large amount of data
- In the public domain
- Spoken and written text
- Related to other data: metadata about members of Congress, Voting records, Funding, Committees
- Reviewing vendor opportunities,
- Auditing library collections
- Identifying new research opportunities.
Data sources include:
- ProQuest Congressional (1789-present)
- gpo.gov (1994-present)
- ICPSR: 104th to 110th Congress
- An XML database of the complete Congressional Record which was purchased from ProQuest
Most major vendors do not know how to support this kind of research yet (which may mean that they don’t know how to sell it). Some vendors are open to experimentation. What to ask of vendors:
- How good is the OCR? (Ask for examples, when the files were scanned because OCR has improved greatly over time, do they do OCR correction?)
- Do they follow data management best practices (is there a Readme file, are the files organized consistently, is there documetation of how the data was created)?
A library guide to text mining is available.