As we begin planning Digitizing Enlightenment IV, which will take place in the context of the ISECS Congress in Edinburgh in July 2019, we are keen to broaden the scope and breadth of the Digitizing Enlightenment community in order to highlight new, and existing, digital projects across the interdisciplinary spectrum of eighteenth-century studies. This post, based on work presented at the Digitizing Enlightenment III workshop held in Oxford in July 2018, demonstrates how to identify text reuse – citations, borrowings, plagiarisms – as well as other techniques for leveraging freely available large data-sets from the 18C.
– Glenn Roe, Voltaire Lab
The incredible richness of the Newberry Library’s French Revolution Collection (FRC) has been long known. It consists of more than 30,000 pamphlets and more than 23,000 issues of 180 periodicals published between 1780 and 1810, representing the opinions of all the factions that opposed and defended the monarchy during the turbulent period between 1789-1799 and also contains innumerable ephemeral publications of the early First Republic. The Newberry has released digital copies of more than 35,000 pamphlets totalling approximately 850,000 pages. Not only has the Newberry made the collection available to the public, but it has released a data feed of the entire collection, consisting of the Library’s exceptional metadata describing each object, the OCR text data, and links to the digital facsimiles accessible from the Internet Archive, encouraging researchers and instructors to incorporate the digital collection in new kinds of scholarship and engagement.
In order to facilitate experimental work at ARTFL on this unparalleled resource, we have loaded two versions of this collection – based on a download of the collection from the Newberry’s GitHub repository in November 2017 – into PhiloLogic4, the latest release of ARTFL’s text analysis software. The full version contains all 38,377 documents dating from the 16th century to the end of the 19th century. Our second build attempts to eliminate duplicate documents, is restricted to the period 1787-1799, and thus contains 26,445 documents. Additional implementation information and full open access to both versions of the FRC collection are available online. The quality and coverage of the FRC texts makes it an ideal environment to test a variety of experiments and algorithms to enhance access and open new kinds of approaches using the 1787-99 sample data. At the bottom of the ARTFL FRC page, we have provided links to several different models for examining the collection which are based on extensions to the PhiloLogic4 package.
The simplest model is a document level search which returns matching documents by relevancy ranking based on Python Whoosh. This functions somewhat like a Google search on the collection, with links to the page images of the document or specific instances of the search words in context. For example, the results of a search for “conspirateurs aristocrates ennemis étrangères royalistes” can be seen here.
The second approach is the application of a Topic Model algorithm to the collection. Topic Models are a set of unsupervised learning algorithms that divide collections into a specified number of clusters based on vocabularies of each document which is widely used in digital humanities. The results of the Topic Model has been added to the metadata of the PhiloLogic4 build of the 1787-99 sample data. Each document is identified as having a first and second topic, denoted as A or B, with a number from 00-49 as listed in this TABLE. This first column is the topic number, the second is one or more english keywords which can also be searched. The third column is the top 3 weighted words (features) of that topic, and the 4th column is the rest of the top 10, all of which are shown in relative weight order. Thus, A29 will return the documents that have money assignats as the top weighted topic. Searching for "money" in topic models will get this as eight the first or second topic. An alternative use of this data is to copy some or all of the terms in columns 3 and 4 into the Whoosh search form and get the documents in a ranked relevancy order.
Our first presentation of our work at the Digitizing Enlightenment III showed results from applying the latest version of our sequence aligner to detect text reuse – citations, borrowings, plagiarisms, and so on – from pre-Revolutionary documents during the Revolutionary period. Sequence alignment is a family of algorithms used in a surprising range of disciplines from genetics to text analysis to identify similar segments of arbitrary length. For this work, we aligned the FRC 1787-99 sample against ARTFL’s Frantext pre-1788 collection. The Frantext sample contains 1,263 documents and is particularly strong in 18th century holdings. We loaded the results of the alignment run in a dedicated database which can be queried in a variety of ways, such as source and/or target metadata as well as by words in matching passages.
The public database (June 22, 2018 build) found 8,937 aligned passages, or which around 1,000 were identified algorithmically as banalities. Filtering out shorter alignments, less than 10 words, results in just under 7,000 passages. It is important to note that these numbers are very relative, since they can vary significantly depending on the approach we use to identify and merge, where appropriate, longer passages. The general frequencies are not particularly surprising. The following is a table of the number of borrowed passages in the FRC by author.
Montesquieu – 1,315
Rousseau – 1,133
Voltaire – 979
Mably – 303
Aulony – 263
Racine – 168
Helvétius – 167
D’Holbach* – 146
Saint-Simon – 135
Bossuet – 110
La Fontaine – 94
Diderot – 85
Corneille – 72
Mirabeau – 71
Boileau – 69
Bernardin – 67
Montaigne – 65
*D’Holbach appears as two entries due to slight metadata differences.
The yearly distribution of borrowings from the top three Enlightenment authors again follows a reasonable pattern.
The annual distribution in the FRC of the 536 passages derived from Rousseau’s Contrat Social, seems reasonable and would match expectations based on other things we know.
While the global numbers are interesting, if not very surprising, there are number of specific texts and authors which would warrant further investigation. There are numerous chapbooks, such as the Calendrier moral, 1794, which are interesting because of their selection of inspiring passages from various authors. Jean-Jacques Barthélemy’s L'Accord de la religion et de la liberté (1791) features some 25 long extracts from d’Holbach’s Système social.
The alignment database is available to the public. The database has a variety of useful features. This link will push a search for all of the aligned passages in the FRC from Rousseau’s Contrat Social greater than 10 words. The report is laid out chronologically (in this case by FRC year). Each instance shows the matching passages with available metadata, links to the context of each passage, and a button to highlight the differences in each matching pair. The facets on the right will allow you to get frequencies by author, title, year and so on. Clicking on those will return the corresponding text pairs.
We anticipate further experimental work on the FRC, most notably in using the excellent subject information as ways to assess the accuracy of Topic Modelling and to consider supervised learning algorithms to further classify the collection by subject.
It is our pleasure to acknowledge that the Newberry Library has released this extraordinary resource under the Open Data Commons Attribution License, ODC-BY 1.0. We believe that this splendid collection and the Newberry’s release of all of the data will facilitate a generation of ground-breaking work in Revolutionary studies. If you find the collection useful, please do contact the Newberry Library to congratulate them on this wonderful initiative and how their efforts contribute to your research.
We would love to hear from you. Please send comments, suggestions and problem reports to email@example.com.
– Clovis Gladstone and Mark Olsen