thumbnail-935x561

Tracking How Words Changed Meaning over Time

We designed and implemented an online service that allows investigating how the meaning of any word has developed historically. It uses Google Books n-gram data – a dataset of several terabytes of text data compiled from around 5% of all books ever published.

Human language constantly evolves due to the changing world and the need for easier forms of expression and communication. Every word therefore has its own history, for example, the word “nice” used to mean silly/foolish/simple, the word “silly” meant things worthy or blessed, while “meat” denoted food in general. As language is our most important communication tool, understanding the changes in the meaning and usage of words is important for many professionals who work with historical texts, such as linguists, historians, librarians, and social scientists.

Most prior approaches towards establishing a word’s history relied on manual analysis of old texts. Yet, generating detailed summaries of the evolution of any word has become currently possible due to advances in dataset creation and in text processing and understanding. Given this, easy-to-use and intuitive interfaces would let users investigate histories of their chosen words [1]. We have built an interactive framework for semantic change analysis precisely for that purpose. It permits users to analyze the evolution of arbitrary words in detail. Our system offers a rich online interface allowing complex analytics over large-scale historical textual data, to let users investigate changes in meaning, context, and word relationships across time. This multi-perspective online system [2] is derived from our previous work [3] and is accessible at: http://tinyurl.com/WordEvolutionStudy

Using word representations based on their immediate contexts (i.e., their co-occurring words), our tool allows evolutionary investigation of words at several levels: word analysis, contrastive word pair analysis, multi-word analysis and temporal context analysis.

 

Screenshot of the tool showing filtering and analysis options and graphs
Figure 1 Word analysis level visualization for an example target word "protocol" from 1700 to 2000 using Google Books n-gram data.

In the first type of analysis, word analysis, the user enters a target word and the system evaluates the degree of its context change across time. Figure 1 shows the snapshot of the word analysis results using the target word “protocol”. Both the frequency as well as the inter-decade self-similarity of the word are visualized as time series plots. Users can click on any decade to see the list of the 50 top context words co-occurring with the target word and their counts in that decade (shown in Figure 2). They can then contrast such words with those in other decades by clicking on any other decade to open up new lists with differing words being automatically colored. Figure 2 shows the joint side-by-side comparison of the context of “protocol” over three different decades.

 

Three-column table showing a comparison of word frequencies in the 2000s, 1950s and 1840s
Figure 2 Comparison of the top co-occurring words with the target word "protocol" in three different decades (2000s, 1950s, 1840s) using Google Books n-gram data. Colored words indicate new terms.

Contrastive word pair analysis and multi-word analysis are other mechanisms for quantifying the change in word meaning which work through comparison of the evolution of two or more words. Finally, we develop an approach for temporal context analysis by summarizing common words that appear over time together with the target word (i.e., the top co-occurring words). We use word clouds and the associated frequency plots of words to represent the top co-occurring words of a target word and their prevalence over time (see Figure 3 for an example of the word “mouse” computed over the period of 1900 to 2000).

 

Word cloud of the co-occurring words of "mouse" with small graphs showing their development over time
Figure 3 Example of temporal context analysis: A temporal word cloud of co-occurring words with the target word "mouse" computed over the period of 1900 to 2000 using Google Books n-gram.

Altogether, these four analysis modes provide users with a visual explanation of semantic change for any target word. Their synergy should permit storytelling of the word evolution based on visual analytics. We use two common datasets as underlying data: COHA [4] and Google Books n-gram dataset that span 1810 to 2010, and 1600 to 2010, respectively.

We adopt a simple and common approach used in Natural Language Processing for representing words, distributional semantics, according to which a word’s meaning is captured by its co-occurring words. For a given target word w in a decade d, we collect all n-grams that contain w from books published in d. We then sum the counts of all the context words. The word representation in d is then given by a vector whose size is the number of unique words found in the dataset. The weights in this vector are calculated as the normalized counts of context words co-occurring with the word w in d.

Note that while neural network-based word embeddings have recently been used for diachronic sense detection [5], we use simpler and intuitive word representation leaving the addition of other solutions for later.

Overall, we have introduced in this blog post a simple yet quite flexible interactive system for studying evolution of arbitrary words that could be operated online and without the requirement for programming. Readers are encouraged to use the system at http://tinyurl.com/WordEvolutionStudy with their own choice of words.

 

References:

  1. Jatowt, N. Tahmasebi and L. Borin: Computational Approaches to Lexical Semantic Change: Visualization Systems and Novel Applications, In: Tahmasebi et al. (Eds.): Computational Approaches to Semantic Change, Languauge Variation Series, Language Science Press, pp. 311-340 (2021) https://langsci-press.org/catalog/view/303/3036/2383-1
  2. Jatowt, R. Campos, S. Bhowmick, N. Tahmasebi, A. Doucet: Every Word has its History: Interactive Exploration and Visualization of Word Sense Evolution. In Proceedings of the 27th ACM International Conference on Information and Knowledge Management (CIKM), ACM, pp. 1899-1902, 2018 https://dl.acm.org/doi/10.1145/3269206.3269218
  3. Jatowt, K. Duh: A Framework for Analyzing Semantic Change of Words across Time. In Proceedings of IEEE/ACM Joint Conference on Digital Libraries, JCDL 2014, pp. 229-238. IEEE, 2014 https://ieeexplore.ieee.org/document/6970173?arnumber=6970173
  4. Davies: The Corpus of Contemporary American English as the First Reliable Monitor Corpus of English. Literary and linguistic computing 25.4, pp. 447-464 (2010) https://academic.oup.com/dsh/article-abstract/25/4/447/997323?redirectedFrom=fulltext
  5. Hamilton, J. Leskovec, D. Jurafsky: Diachronic Word Embeddings Reveal Statistical Laws of Semantic Change. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, pp. 1489-1501 https://aclanthology.org/P16-1141/

 


Portrait of Adam Jatowt
Credit: Grazyna Jatowt

Written by Adam Jatowt in April 2022

Professor at DiSC & Department of Computer Science

University of Innsbruck


 

About the author

My research is related to knowledge extraction and information retrieval from unstructured text collections such as news article collections, financial/legal documents or social media posts. In particular, I am interested in novel approaches towards retrieving useful knowledge from long-term news archives and other types of temporal document collections, as well as in applications of text mining techniques for digital history.

 

Research area

Natural Language Processing and Information Retrieval Methods

 

Nach oben scrollen