- Andrew J. Torget
- General Editor
- Debbie Liles
- Associate Editor
- Ben W. Brumfield
- Lead Back End Developer
- Jason Ellis
- Lead User Interface Developer and Data Visualizations
- Sara Carlstead Brumfield
- Semantic Analysis and Additional Development
- William Hicks
- User Interface Development and Data Visualizations
- Stephen Mues
- Database and User Interface Development
- Karin Dalziel
- Information Architecture
- Deborah Kilgore
- Transcription and TEI markup
- Cameron Sinclair
- Transcription and TEI markup
About the Project
- Austin Papers in Manuscript and Print
- Creating the Digital Edition
- Transcription and Markup
- TEI and XML
- Data Transformations for the Online Browse and Search Interfaces
- DAP Browsing Interface
- DAP Search Interface
- Sentiment Analysis
- Future Development of DAP
The Digital Austin Papers is an ongoing effort to build a digital edition of the surviving correspondence of Stephen F. Austin. During the 1820s and early 1830s, Austin served as the most prominent land agent working with the government of Mexico to bring colonists from the United States into the Texas borderlands. As such, his voluminous correspondence offers a remarkable window into the ideas and movements of both Mexicans and Americans during those turbulent decades that preceded the U.S.-Mexican War.
In addition to making these papers available in digital form, a central goal of the project is to experiment with new methods for exploring and discovering meaningful patterns embedded within historical documents. To that end, the project offers a variety of digital methods for searching and discovering patterns spread across the collection, such as text-mining, digital mapping, and sentiment analysis.
The vast majority of the surviving Austin manuscripts are housed in the Dolph Briscoe Center for American History (DB-CAH) at the University of Texas, which also holds extensive manuscript collections connected to other members of the Austin family and key associates of Austin’s. The General Land Office of Texas also holds important collections of his papers. Various other Austin manuscripts may also be found scattered among a variety of archives and collections.
During the 1920s, Eugene C. Barker published an edited edition of the Austin Papers which appeared in three volumes:
- Volume I (published by the Government Printing Office in two parts in 1924), offered transcriptions of Austin’s correspondence through 1827.
- Volume II (published by the Government Printing Office in 1928), offered transcriptions of Austin’s correspondence through 1834.
- Volume III (published by the University of Texas Press in 1927), offered transcriptions of Austin’s correspondence through 1836.
By necessity, the Barker Edition left out numerous documents associated with Austin. Roughly a thousand Austin letters in the DB-CAH collections, for example, were left out of the Barker Edition due to the financial constraints of publishing such a voluminous collection. Other Austin manuscript collections have been discovered in the decades since the 1920s, and several selections of Austin correspondence that did not appear in the Barker Edition have also since appeared in print in other venues.
The Digital Austin Papers (DAP) currently consists of 2,183 letters. That collection represents all the English-language documents transcribed and published in the 1920s Barker Edition. It also includes numerous English-language documents which were left out of the Barker Edition but are available in both transcript and manuscript form in the collections of the DB-CAH. Scanned transcripts of those documents may be found in UNT’s Portal to Texas History’s “Moses and Stephen F. Austin Papers” collection.
The DAP project chose to begin with the Barker Edition and the transcripts of the DB-CAH for several reasons. Foremost, constraints of time and funding meant that scanning and digitizing the transcriptions – rather than create a new set of transcriptions – offered the project the most return on the project’s limited resources. In addition, our comparisons of the Barker Edition against the original manuscripts revealed a remarkably high level of accuracy in the transcriptions. Our intention is to compare all of these transcriptions against the original manuscripts in a future iteration of the project, whenever available resources allow.
Digital scans were made of the Barker Edition, which were then run through optical character recognition (OCR) software by the Digital Projects Lab of the University of North Texas Libraries. Transcripts of Austin documents not included in the Barker Edition went through the same process. The OCR output was then reviewed, corrected, and scrubbed.
The corrected text of the Austin documents were then marked up in XML using TEI-P5 guidelines. The project assigned various metadata fields (such as titles, dates, author, recipient) to each document, added several project-specific metadata fields (such as the location of both the document’s creation and its destination), and tagged every identifiable person and location mentioned within the documents. Summaries of each letter were also paired with the documents, usually using summaries contained in the Barker Edition.
A project-specific xml2tei perl script was used to create files which were then validated against the version 2.3.0 P5 DTD at http://www.tei-c.org/Vault/P5/2.3.0/.
These marked-up versions of the Austin papers are available through the DAP search and browsing interfaces. But because we take seriously Peter Robinson's admonition that "your interface is everyone else's enemy," we also decided to expose the TEI version of the papers in two additional ways:
- Bulk Download: A GitHub repository containing the editor's transcripts, the TEI-P5 XML files, and the programs used to convert the transcripts to TEI-XML is at AustinTranscripts and can be downloaded for analysis or any other re-use.
- Direct Download: Each page on the Digital Austin Papers site has a direct link to download the document in XML format from the GitHub repository.
The DAP online interface required further processing to create the derivative data structures that would support online browsing, searching and analysis. When a TEI XML file is loaded into the system, the title, date and summary of the document is extracted and added along with the TEI source to a document record in the MySQL database powering the online interfaces of DAP. Place names and personal names are extracted from the document and added to tables containing pointers back to the ID of the document in which they appear. In addition, the text of the document (excluding mark-up) is passed through a Porter stemmer and aggregated, creating a distribution of word stem frequencies for each document.
After these derivative data structures are loaded, emendations are applied to support further analysis. Each document's text was extracted and passed through the sentiment analysis library TextMood (detailed below) to generate a sentiment score for that document, which was then added as an attribute to the document record in the database. Each toponym was passed through Geonames to create latitude and longitude coordinates. Additional quality controls were applied to the place names and personal names in the correspondence metadata identifying sender, recipient, and locations of composition. This was facilitated by parsing names from the Barker Edition titles and summaries and leveraging the Barker Edition’s normalizations.
The DAP browsing interface allows users to explore the collection by any particular date, the authors or recipients of documents, and the geographical origins or destinations of documents. Clicking on the title of any associated document will bring up the digital version of that document.
The DAP search interface allows users to search the collection by any combination of keyword, author, recipient, date range, locations and sentiment scores (detailed below).
Because a driving goal of DAP is to provide users with multiple tools for exploring patterns embedded within the collection, users may explore the results of any given search in four different views:
- Document list: Here users may view a list of the documents associated with their search, which can be sorted by date, relevance, and sentiment scores (detailed below). Clicking on any of the document titles will bring up the XML-TEI version of that letter.
- Timeline/Sentiment: Here users may view a histogram that represents either:
- The overall frequency of documents in your search results over time, or
- The proportion of search results compared to its percentage compared to all documents in a given year.
- In both cases, the histogram also shows the percentage of documents in a given year that corresponds to particular sentiment scores (detailed below).
- Clicking on either the year of the histogram, or a particular sentiment bar in a particular year, will bring up the associated documents.
- Geography: Here users may view the geographic patterns embedded in their search results, as the letters are plotted on a map that shows the origins and destinations for each document. Zooming into the map allows users to click on individual locations to access the documents associated with particular places.
- Word Counts: Here users may view ranked lists of the most frequently occurring words within their search results, grouped by total words, named people, and named locations. Clicking on individual words or names brings up the associated documents.
Sentiment analysis is a computational linguistics approach to determining the emotional content of text. In DAP, we adopted a method in which every word in a document is assigned a score based on a positive or negative weight in a dictionary. Scores range from +1 to -1, with 0 being neutral, 1 being strongly positive, and -1 being strongly negative. If the word was not in the dictionary it was considered neutral. Totaling the scores of all words in the text provides a classification of the text itself.
Since it is a measurement of emotion, sentiment analysis is by nature always an approximation. The fact that human readers tend to disagree about the overall sentiment of a given text about 20 percent of the time demonstrates that sentiment scoring of any kind should be taken as simply a rough gauge for the general emotional direction of any given document. Our use of sentiment analysis in DAP is, therefore, an experiment that is part of our larger goal of exploring new methodologies for language analysis.
After experimenting with six different open-source ruby sentiment analysis libraries from GitHub, DAP settled on the open-source TextMood. Each document in the collection was run against the program’s dictionary and given a sentiment score based on the cumulative positive and negative weight of the words within that document.
We used three approaches to check the quality of the results after applying sentiment analysis to the DAP corpus. First, we reviewed the most common words in the corpus to check the coverage and accuracy of the dictionary. We found that 95 percent of the nineteenth-century vocabulary had been scored correctly. Second, we hand-checked the most negative and the most positive scored documents and discovered that those results met our expectations. For example, the most negative described a feud, altercation, and duel between Stephen F. Austin and Joshua Pilcher. The most positive, by contrast, was Austin's glowing sales letter to a Swiss group considering immigration to Texas. Third, we compared the overall trends in sentiment scores over time against what we knew about Austin’s life to see if particular spikes in the sentiment scores matched the historical record of particularly stressful moments for Austin. The spikes in negative sentiment that emerged in documents from 1830 and 1831 matched our expectations, as historians have long recognized those years as particularly difficult periods in both Austin’s personal and professional life.
Overall, the purpose of using sentiment analysis in DAP is to offer users of the project a rough index of the emotional context of the documents in the collection. In order to avoid conveying false impressions of precision, the “Documents” list in the search results converts the numeric score to “positive,” “neutral,” and “negative” (although the precise score and range of scores among the documents is still available to users).
We anticipate work on DAP to proceed along two fronts that support the two driving goals of the overall project:
- Expanding the collection to include as many Austin documents as possible, including:
- Processing and incorporating all Spanish-language documents from the Barker Edition.
- Processing and incorporating all other known documents – in any language – left out of the Barker Edition.
- Developing and refining the available tools for searching and exploring the datasets.
Digital Austin Papers is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License.