From EM to "EM"beddings

From EM to "EM"beddings

Expectation Maximization is a quite an old tool/concept in the Machine Learning domain. Although it is an old tool but it took me quite some time to grasp the concept and the intuition behind it given that most tutorials and articles out there explain it with heavy mathematical equations. But eventually I found out that, the maths behind the intuition is pretty simple to understand, only the long equations might be a bit put off.

To explain it to someone with not much mathematical background but with some idea about machine learning. In clear and concise english :

Given some observations (or data, as you prefer), you have to build a model, that can be used to explain the observations you already have and also generalize well to unseen observations whenever you encounter them. Basically this implies, that the parameters of your model have to be "in sync" with your observations, or in other words tune the parameters of your model in order to maximize the likelihood of the observations.

For observations which have all the information required to compute the model parameters, we can apply some optimization technique to compute them. But sometimes the observations might not have all the information required to compute the parameters. Given such a scenario where there is missing data from observations, we apply the EM algorithm, which is an iterative procedure.

Note that if we knew the parameters of the model then we can easily compute all the data (including missing data) for the observations, and also if we had all the data we could compute the parameters of the model. Now that this has become a chicken and egg problem, we can use this two facts to generate our algorithm :

Initialization : Assign arbitrary values to the model parameters.

"E" Step : Using the model parameters determined in the last iteration, compute the expected value of the missing data. "Expected" because there could be multiple possible ways of inferring the missing data, with some probability.

"M" Step : Using the expected value of the missing data determined in the E step above, plus the observed data, compute the most likely model parameters by optimization techniques .

Repeat the E and the M step until the model parameters converge (and it does).

As you see this technique is very intuitive and many of us have knowingly or unknowingly applied this technique elsewhere too (although without any theoretical understanding of why this works).

I wanted to learn the Python programming language (The Model) by learning the different components and their syntaxes (The Parameters). So I pick up this Python Cookbook that teaches through different examples (Observations).

So I start reading this book and encounter an example that uses Class and Objects, Map and List data structures. If I already know what are Classes and Objects, Maps and Lists from my previous programming experiences, or if the book also had the explanation for each of the components then it would be a piece of cake to learn that this is how a Map or a List is implemented or a Class is defined in Python.

But generally cookbooks will not introduce basic data structures and algorithms. They are only meant for teaching through examples. So this becomes the missing data in our context. One straightforward way to fill in the missing data is to use a Data Structures and Algos book to learn about Maps and Lists and an OOPS books to learn about Classes and Objects. But the availability of external resources to make the incomplete data a complete one, is not an use-case for EM as it assumes that by no means, apart from only using the Observations we can infer anything about the missing data.

Thus in order to only make use of the Observations i.e. the examples in the book, I could try to look at a several similar examples and try to figure out what a particular block of code is trying to do. That is to say, that if you look at sufficient number of examples that uses Maps and Lists and their outputs, you can deduce what a Map or a List does in general (without going through your Data Structures book) but maybe I will not know how hash functions work. This becomes my E step in the algorithm. After I infer some sort of understanding of what a Map or List does, I try to fit them in the context of the examples (The "M" Step).

To understand the complete Python programming language, I will similarly approach the understanding of other components and simultaneously deduce knowledge about missing data. This is probably a not a good use-case to demonstrate EM process in human learning as often, people would have easy access to external resources (like Google etc.) that would make the observed data a complete one (no missing information). But the take-away is that even in the absence of resources like Google or Stack Overflow (no internet connection), one could guess how a particular data structure works even if the ultimate goal is to learn Python.

This idea of mental frameworks to understand concepts naturally without any external supervision (such as an internet connection) has been used quite effectively in latest Deep Learning research to infer semantics of natural languages. Artificial Neural Networks (modeled after the human brain), can be made to learn hidden relations from multiple observations. Think of what an Arts student will make out of a bunch of Python code snippets. He/she may not have any knowledge of programming and maybe not even the slightest interest to learn it. But given the many examples and the outputs on running the examples, he/she will eventually learn that a List can be used to store values in memory sequentially or a Map can used to retrieve the value for a key without understanding of how hashing works and so on. This idea has been used to generate word embeddings i.e. individual words in a sentence are represented as multi-dimensional vectors. Somebody (a robot in particular) who does not know that the words "happy" and "glad" are semantically similar, would look at the vectors for this two words and can compute the cosine similarity to infer that these two words are similar. The meaning of "happy" and "glad" are hidden from them, but that does not prevent them from inferring that these words are semantically similar.

To view or add a comment, sign in

More articles by Abhijit Mondal

  • Don't miss out on the math !!! Even if you are a programmer

    Here I will show how maths play a significant role in one of the most common mathematical operations used by many…

  • The Lean Startup of Skills Development

    Given the pace at which technological developments are progressing and a never-ending amount of things to learn and…

  • The "Cost" of my Uber Ride

    Quite often, I take the Uber Pool ride to my office in the morning hours of Bangalore's heavy traffic. Although I get a…

  • Matrix Reloaded

    With the recent speculation from Elon Musk that all of us might be living inside a computer simulation or a video game…

  • The Future lies with Decentralization, or does it ?

    Although Bitcoins and other Cryptocurrencies are hailed as the greatest revolution in financial technology, since it is…

  • How cryptocurrency taught me a better concept of "money"

    First of all let me admit that I do not have a formal economics or finance education and until sometimes back, like…

  • Writing "Better" Code

    Although I cannot boast that I am a hardcore programmer or a hacker for that matter, but being a Data Scientist, I code…

  • Approximating Big Data

    I have always held the believe that a good enough algorithm is sufficient to solve a problem in the most efficient…

  • Mathematics is Everywhere !!!

    Thankfully I had persisted my interested in mathematics and so I realized that maths is a "Beautiful Monster"…

    2 Comments
  • Too many competitors averaging the talent pool

    I live in Bangalore, the city of start-ups. Almost every "main" and every "cross" in Indiranagar and Koramangala has a…

    3 Comments

Insights from the community

Others also viewed

Explore topics