Some of you may be familiar with the roman Democracy: An American Novel.
It is a novel published in 1880 by an initially anonymous author who was very popular in the United States. At the time, many people tried to identify it by all kinds of comparative analyzes of the text and other texts of known authors. Eventually, the identity of the true perpetrator was revealed after his death; it was Henry Adams. For this lab, you will develop a method of proximity analysis between two texts and use it to identify the author of a mystery text.
You will have access to several novels by French authors for which there is no longer any copyright. The authors are: Jules Verne, Victor Hugo, Honoré de Balzac, Voltaire, the Comtesse de Ségur and Émile Zola. It is these texts that you will have to process to extract a signature from them which will be used to compare them with each other and to compare the mystery text to the signature of each author. All novels by the same author will be grouped together in a single text file.
The signature you are going to calculate is based on the frequency of duplicate words. A doublet of words is simply a sequence of two words. For example, in the following sentence: The dog eats the cat and the cat eats the mouse. We find the doublets: