Workshop: Build your own Semantle-clone

Introduction to Distributional Semantics and Word Embeddings for Linguists

An innocent letter-guessing-game called Wordle went viral in the beginning of 2022. One
particular clone of this game, called Semantle1, is particularly interesting from a linguistic
perspective. The user is tasked with guessing a secret word based on its semantic similarity to
other words. Semantle is premised on The Distributional Hypothesis (Firth, 1957), which states
that the meaning of a lexical item can be approximated by knowing the linguistic contexts in
which it is used. Distributional Semantics concerns itself with building accurate distributional
lexical representations from language corpora. A very intuitive and successful way of representing
words distributionally is using vector spaces (Clark, 2015; Turney & Pantel, 2010). Today, due
to their attractive mathematical properties, vector-based representations of lexical items (word
embeddings) are an indispensable tool for virtually all NLP tasks, for example question answering
(Karpukhin et al., 2020) and co-reference resolution (Lee et al., 2017). In this workshop, aimed at
linguists, participants will be familiarized with the basic concepts of count-based distributional
models of lexical meaning (largely following the structure of Turney and Pantel, 2010) and
will become acquainted with the more recent implementations of word-embeddings such as
Word2Vec (Mikolov et al., 2013) and Fasttext (Bojanowski et al., 2017), as well as the basics
of contextualized embeddings that are retrieved from transformer models like GPT-3 (Peters
et al., 2018; Radford et al., 2019). In this workshop, we will specifically look at and replicate
an intuitive and fun application of distributional semantics, namely Semantle. By trying to
understand and implement what Semantle does and what makes it fun, we will try to gain a
solid grasp of the representations that power most NLP technologies in 2022.