Word2Vec ANALYSIS
19TH CENTURY NEWSPAPERS
ON GNADENHUTTEN MASSACRE
The source code for this project can be found here.
Word2Vec is a popular word embedding, which is able to model words in high-dimensional space beyond frequency count. The advantage of Word2Vec is that it can capture the "contexts" of a word within a specific body of corpus. I trained a Word2Vec model on 9 newspaper reporting on the Gnadenhutten Massacre that happened on March 8, 1782. I am interested in how different sides involved in this massacre were being discussed in public discourse. Specifically, I am interested in words that are most associated with the Moravian Indians and the American militia.
Methodology
Python's gensim implementation of Word2Vec model is used to train words vectors with 500 dimensions. Some of the text pre-processing steps taken before training the corpus include removing stopwords, stemming, tokenizing words and removing dirty texts (spelling mistakes and random characters).
For the actual training of Word2Vec, the following hyperparameters were used;
-
skip-gram method is used instead of CBOW (Continuous Bag of Words) since skip-gram generally performs better on small dataset
-
Dimension of word vectors: 500
Google's Embedding Projector and Plotly are used to visualize the results.
Findings
Word2Vec can be used to construct a dimension with words. For this project, I look at a dimension for White American w/o British. Mathematically speaking, imagine a vector for White + American - British. The finding is quite revealing of the Gnadenhutten Massacre.
Some of the significant words that come up from this specific dimension include;
-
Massacre
-
Moravian Indians
-
War
-
Missionary
-
Ohio
-
Many lives
These words construct what we know about the Gnadenhutten Massacre - "The Moravian Indians, who were not allies of Britain, were massacred by American militia in Gnadenhutte, Ohio. Many lives were lost (specifically 96 Christian Indians were killed) in 1782 against the backdrop of the American Revolutionary War."
Acts of Violence - Why, How and What
In order to have a more nuanced understanding of why and how these acts of violence were waged against the netural Moravian Indians, I explore the dimension of several words, including;
-
Massacre
-
Murder
-
Execution
-
Slaughter
-
Slaughter House
Violence by American Militia - Why and How
Limitation
Word2Vec is an amazing tool to visualize and interpret our corpus; however, one of the biggest limitations of this approach is that Word2Vec require a large number of data to give a more accurate interpretation.
For future research, more newspaper articles on the Gnadenhutten massacre can be collected so that Word2Vec can give us more nuanced and richer interpretation.