6/10/2021

Generative Pre-trained Transformer 3

 

Generative Pre-trained Transformer 3 (GPT-3) is an autoregressive language model that generates text using algorithms that are pre-trained. It was created by OpenAI (a research business co-founded by Elon Musk) and has been described as the most important and useful advance in AI for years.

Last summer writer, speaker, and musician, K Allado-McDowell initiated a conversation with GPT-3 which became the collection of poetry and prose Pharmako-AI. Taking this collection as her departure point, Warburg PhD student Beatrice Bottomley reflects on what GPT-3 means for how we think about writing and meaning.

 

GPT-3 is just over nine months old now. Since the release of its beta version by the California- based company Open AI in June 2020, the language model has been an object of fascination for both technophiles and, to a certain extent, laypersons. GPT-3 is an autoregressive language model trained on a large text corpus from the internet. It uses deep-learning to produce text in response to prompts. You can direct GPT-3 to perform a task by providing it with examples or through a simple instruction. If you open up the twitter account of Greg Brockman, the chairman of Open AI, you can find examples of GPT-3 being used to make computer programs that write copy, generate code, translate Navajo and compose libretti.

Most articles about GPT-3 will use words like “eerie” or “chilling” to describe the language model’s ability to produce text like a human. Some go further to endow GPT-3 with a more-than-human or god-like quality. During the first summer of the coronavirus pandemic, K Allado-McDowell initiated a conversation with GPT-3, which would become the collection of poetry and prose Pharmako-AI. Allado-McDowell found not only an interlocutor, but also co-writer in the language model.  When writing of GPT-3, Allado-McDowell gives it divine attributes, comparing the language model to a language deity:

“The Greek god Hermes (counterpart to the Roman Mercury) was the god of translators and interpreters. A deity that rules communication is an incorporeal linguistic power. A modern conception of such might read: a force of language from outside of materiality. Automated writing systems like neural net language models relate to geometry, translation, abstract mathematics, interpretation and speech. It’s easy to imagine many applications of these technologies for trade, music, divination etc. So the correspondence is clear. Intuition suggests that we can think the relation between language models and language deities in a way that expands our understanding of both.”

What if we follow Allado-McDowell’s suggestion to consider the relationship between GPT-3 and the language deity Hermes? I must admit that I would hesitate before comparing GPT-3 to a deity. However, if I had to compare the language model to a god, they would be Greek; like Greek gods, GPT-3 is not immune to human-like vagary and bias. Researchers working with Open-AI found that GPT-3 retains the biases of the data that it has been trained on, which can lead it to generate prejudiced content. In that same paper, Brown et al. (2020) also noted that “large pre-trained language models are not grounded in other domains of experience, such as video or real-world physical interaction, and thus lack a large amount of context about the world.” Both the gods and GPT-3 could be considered, to a certain extent, dependent on the human world, but do not interact with it to the same degree as humans.

Lead votive images of Hermes from the reservoir of the aqueduct at 'Ain al-Djoudj near Baalbek (Heliopolis), Lebanon, (100-400 CE), Warburg Iconographic Database.

Lead votive images of Hermes from the reservoir of the aqueduct at ‘Ain al-Djoudj near Baalbek (Heliopolis), Lebanon, (100-400 CE), Warburg Iconographic Database.

Let us return to Hermes. As told by Kerenyi (1951) in The Gods of the Greeks, a baby Hermes, after rustling fifty cows, roasts them on a fire. The smell of the meat torments the little god, but he does not eat; as gods “to whom sacrifices are made, do not really consume the flesh of the victim”. Removed from sensual experience of a world that provides context for much human writing, GPT-3 can produce both surreal imagery and factual inaccuracies. In Pharmako-AI, GPT-3, whilst discussing the construction of a new science, which reflects on “the lessons that living things teach us about themselves”, underlines that “This isn’t a new idea, and I’m not the only one who thinks that way. Just a few weeks ago, a group of scientists at Oxford, including the legendary Nobel Prize winning chemist John Polanyi, published a paper that argued for a ‘Global Apollo Program’ that ‘would commit the world to launch a coordinated research effort to better understand the drivers of climate change…”. Non sequitur aside, a couple of Google searches reveal that the Global Apollo Programme was launched in 2015, not 2020, and, as far as I could find, John Polanyi was not involved.

Such inaccuracies do not only suggest that GPT-3 operates at a different degree of reality, but also relate to the question of how we produce and understand meaning in writing. From Aristotle’s De Interpretatione, the Greeks developed a tripartite theory of meaning, consisting of sounds, thoughts and things (phōnai, noēmata and pragmata). The Medieval Arabic tradition developed its own theory of meaning based on the relationship between vocal form (lafẓ) and mental content (maʿnā). Mental content acts as the intermediary between vocal form and things. In each act of language (whether spoken or written), the relationship between mental content and vocal form is expressed. Avicenna (d.1037) in Pointers and Reminders underlined that this relationship is dynamic. He claimed that vocal form indicated mental content through congruence, implication and concomitance and further suggested that the patterns of vocal form may affect the patterns of mental content. Naṣīr al-Dīn al-Ṭūsī (d.1274) brought together this idea with the Aristotelian tripartite division of existence to distinguish between existence in the mind, in entity, in writing and in speech.

When producing text, GPT-3 does not negotiate between linguistic form and mental content in the same way as humans. GPT-3 is an autoregressive language model, which offers predictions of future text based on its analysis of the corpus. Here the Hermes analogy unwinds. Unlike Hermes, who invented the lyre and “sandals such as no one else could devise” (Kerenyi, 1951), GPT-3 can only offer permutations based on a large, though inevitably limited and normative, corpus created by humans. Brown et al. (2020) note “its [GPT-3’s] decisions are not easily interpretable.” Perhaps this is unsurprising, as GPT-3 negotiates between patterns in linguistic form, rather than between the linguistic, mental and material. Indeed, GPT-3’s reality is centred on the existence of things in writing rather than in the mind or entity, and thus it blends, what might be referred to as, fact and fiction.

Hermes as messenger in an advert for Interflora,(1910-1935), Warburg Iconographic Database.

Hermes as messenger in an advert for Interflora,(1910-1935), Warburg Iconographic Database.

By seeking a co-writer in GPT-3, Allado-McDowell takes for granted that what the language model is doing is writing. However, taking into account an understanding of language and meaning as developed by both the Greek and Islamic traditions, one might ask – does GPT-3 write or produce text? What is the difference? Is what GPT-3 does an act of language?

To a certain extent, these questions are irrelevant. GPT-3 remains just a (complex) tool for creating text that is anchored in human datasets and instruction. It has not yet ushered in the paradigm shift whispered of by reviewers and examples of its use are often more novel than practical (though perhaps this isn’t a bad thing for many workers). However, were GPT-3, or similar language models, to become more present in our lives, I would want to have a clearer grasp of what it meant for writing. As Yuk Hui (2020) points out in his article Writing and Cosmotechnics, “to write is not simply to deliver communicative meaning but also to ponder about the relation between the human and the cosmos.” In acknowledging GPT-3 as an author, would we not only need to make room for different theories of meaning, but also different ways of thinking about how humans relate to the universe?

Beatrice Bottomley is a doctoral student at the Warburg Institute, University of London, supported by a studentship from the London Arts and Humanities Partnership (LAHP). Her research examines the relationship between language and existence in Ibn ʿArabi’s al-Futūḥāt al-Makkiyya, “The Meccan Openings”. Beatrice’s wider research interests include philosophies of language, translation studies and histories of technology. Beatrice also works as a translator from Arabic and French to English.

Beatrice was introduced to the work of K Allado-McDowell after hearing them speak last December in an event that celebrated the launch of two new books, Aby Warburg: Bilderatlas Mnemosyne: The Original and The Atlas of Anomalous AI. Watch the event recording here

6/09/2021

Data Science

 

MATH REFRESHER FOR DATA SCIENTISTS

Statistical Moments in Data Science interviews

Essential math for Data Scientists explained from scratch


Moments are set of statistical parameters used to describe a distribution. The calculations are simple, so are often used as a first quantitative insight into the data. A good understanding of data should always be the step before training any advanced ML model. It allows minimizing the time required to choose the methodology and interpret results.

In physics, moments refer to mass and inform us how the physical quantity is located or arranged. In math, moments refer to something similar — the probability distribution — a function that explains how probable are different possible outcomes of an experiment. To be able to compare different data sets we can describe them using the first four statistical moments:
1. The expected value
2. Variance
3. Skewness
4. Kurtosis

Let’s go through the details together!

The article is organized into two parts:
I. Math Refresher
II. Questions from data science interviews related to the topic

I. Math Refresher

1. The expected value

The first moment — the excepted value, known also as an expectation, mathematical expectation, mean, or average is the sum of all the values the variable can take times the probability of that value occurring. It can be intuitively understood as the arithmetic mean:

This is true when all outcomes have the same probability of occurrence (e.g. throw of a classical dice — all numbers from 1 to 6 have the same chance to be thrown). The more general equation including the probability of each event is:

For rolling a single die, when each value has a probability of occurrence of 1/6, the expected value would be:

Or:

For equally probable events, the expected value is the same as what the arithmetic mean. This is one of the most popular measures of central tendency, often called averages. The other common measures are:

  • median — the middle value
  • mode — the most likely value.

For example, taking the set of seven values: 2, 4, 4, 5, 8, 12, 14, we have:

  • Mean:
  • Median- this is “the middle” value, being exactly in the middle of a data set. For our example, this is 5, as it separates the greater and lesser halves of data: we have 3 values lower than five and 3 values higher than 5. For a data set with an even number of values (e.g. adding 15 to our data set), we take two values in the middle and calculate the mean out of them:
  • Mode- the most frequent value in a set of data. For our example above, the mode is 4, since it appears twice.

2. Variance

The second central moment is variance. Variance explains how a set of values are spread around their expected value. For n equally likely values, the variance is:

Where μ is the average value. So the variance depends strongly on the expected value.

For the exemplar data series above, the variance is:

Where n is 7, since we have 7 elements in our data set, and μ is 7, as calculated above.

When the spread of values is lower and the same mean, the variance is also lower, e.g.:

Standard deviation

Standard deviation is a square root of the variance and is commonly used since its unit is the same as of X:

Variance and standard deviation inform us how strong data is spread around the mean, as shown in the plot below:

The greater the variance/ standard deviation (e.g. blue line), the wider the spread of values around the mean. If a variance is lower, the values are cumulated closer to the mean (red line) and the peak is higher.
The next picture summarizes the interpretation of the first two moments:

3. Skewness

Skewness, which is the third statistical moment measures asymmetry of data about its mean. The formula for calculating skewness is:

We can distinguish three types of distribution with respect to its skewness:

  • symmetrical distribution: as in examples above. Both tails are symmetrical and the skewness is equal to zero.
  • positive skew (right-skewed, right-tailed, skewed to the right): the right tail (with larger values) is longer. This informs us about ‘outliers’ that have values higher than the mean.
  • negative skewed (left-skewed, left-tailed, skewed to the left): the left tail (with small values) is longer. This informs us about ‘outliers’ that have values lower than the mean.

In general, skewness will impact the relationship of mean, median, and mode in the following way:

  • for symmetrical distribution: mean = median = mode
  • for positively skewed distribution: mode < median <mean
  • for negatively skewed distribution: mean < median <mode

But this is not true for all possible distributions. For example, if one tail is long, but the other is heavy, this may not work. The best way to investigate your data is to calculate all three estimators and draw conclusions based on the results, rather than general rules.

4. Kurtosis

The fourth statistical moment is kurtosis. It focuses on the tails of the distribution and explains whether the distribution is flat or rather with a high peak. Kurtosis informs us whether our distribution is richer in extreme values than normal distribution.

There is no strict consensus for the formula used to calculate kurtosis and there are three main formulas used by different programs/packages. A good habit would be to check which one is used by your software before you draw conclusions on your data. The formulas containing the correction term of minus 3 refer to the excess kurtosis. So, the excess kurtosis is equal to kurtosis minus 3.

In general, we can distinguish three types of distributions:

  • Mesokurtic — having the kurtosis of 3 or excess kurtosis of 0. This group involves the normal distribution and some specific binomial distributions.
  • Leptokurtic — the kurtosis is greater than 3, or excess kurtosis is greater than 0. This is the distribution with fatter tails and a more narrow peak.
  • Platykurtic — the kurtosis is smaller than 3 or negative for excess kurtosis. This is a distribution with very thin tails compared to the normal distribution.

For those of you who have a better visual memory, take a look at my sketch:

We went through the first four statistical moment. It is time now for checking ourselves in the interview questions.

II. Questions from Data Science interviews

1. What is the kurtosis of normal distribution?

This is a tricky question! As mentioned in Math Refresher, there is no strict consensus for the formula used to calculate kurtosis and three formulas are commonly met. The most significant difference, especially for large samples where the choice of equation does not matter that much, is to understand whether your formula involves a correction term of -3. If so, the formula calculates excess kurtosis. This means that normal distribution may have a kurtosis of 3 or excess kurtosis of 0. But be careful since the excess kurtosis is also sometimes shortened to simpler kurtosis.

Some languages allow you to choose the type of formula in your calculations (e.g. R) or define which definition of normal kurtosis you want to use (Python). Knowing what you calculate will allow you to compare the results with normal distribution and draw conclusions.

2. When would you consider using median instead of mean?

A sample mean is a well understood and common estimator of an unknown population mean. However, it tends to be easily affected by outliers, especially when the sample size is small. So, if the data set is small, skewed, and there are outliers, it is worth checking the median.

3. You want to invest your money and have two distributions of returns available: with positive and with negative skew. Which one would you choose and why?

There is no good or bad answer here, as long as you can give a rationale for your choice. It depends on your risk appetite.

Personally, with mean and variance held constant, I would invest in positive skew. In general, having a greater chance of getting a high return costs a higher probability of having a large loss. So, in the choice between:
1. 85% chance to win at least $1000 and 1% to lose $99000 or more
2. 1% chance to win at least $99000 and 85% to lose $1000 or more
I go for the first option — a smaller but more probable win over the hope for a big win in the lottery. But the choice depends on you!

4. In your opinion, how informative is the average salary in a given country?

I believe it should be always reported together with the median. This way, we can learn much more about the salary distribution in society. For example, if there is a small group of people with super huge salaries, but the rest earns very little, it will be visible when comparing median and mean. From these two estimators, we can understand whether decent pay can be treated as normal or rather as an outlier. Of course, salaries should be alsocompared with the cost of living in a given country to get a better picture of the quality of life.

Thanks for reading!

We went together through the first four statistical moments: the expected value, variance, skewness, and kurtosis. I hope it was an exciting journey for you.

Remember that the most efficient way to learn (math) skills is by practice. So don’t wait until you feel ‘ready’, just grab a pen and paper or your favourite software and try few examples on your own. I keep my fingers crossed for you.

You may also like:

I will be happy to hear your thoughts and questions in the comments section below, reach me directly via my LinkedIn profile or at akujawska@yahoo.com. See you soon!

Agnieszka Kujawska, PhD