From EM to "EM"beddings
Expectation Maximization is a quite an old tool/concept in the Machine Learning domain. Although it is an old tool but it took me quite some time to grasp the concept and the intuition behind it given that most tutorials and articles out there explain it with heavy mathematical equations. But eventually I found out that, the maths behind the intuition is pretty simple to understand, only the long equations might be a bit put off.
To explain it to someone with not much mathematical background but with some idea about machine learning. In clear and concise english :
Given some observations (or data, as you prefer), you have to build a model, that can be used to explain the observations you already have and also generalize well to unseen observations whenever you encounter them. Basically this implies, that the parameters of your model have to be "in sync" with your observations, or in other words tune the parameters of your model in order to maximize the likelihood of the observations.
For observations which have all the information required to compute the model parameters, we can apply some optimization technique to compute them. But sometimes the observations might not have all the information required to compute the parameters. Given such a scenario where there is missing data from observations, we apply the EM algorithm, which is an iterative procedure.
Note that if we knew the parameters of the model then we can easily compute all the data (including missing data) for the observations, and also if we had all the data we could compute the parameters of the model. Now that this has become a chicken and egg problem, we can use this two facts to generate our algorithm :
Initialization : Assign arbitrary values to the model parameters.
"E" Step : Using the model parameters determined in the last iteration, compute the expected value of the missing data. "Expected" because there could be multiple possible ways of inferring the missing data, with some probability.
"M" Step : Using the expected value of the missing data determined in the E step above, plus the observed data, compute the most likely model parameters by optimization techniques .
Repeat the E and the M step until the model parameters converge (and it does).
As you see this technique is very intuitive and many of us have knowingly or unknowingly applied this technique elsewhere too (although without any theoretical understanding of why this works).
I wanted to learn the Python programming language (The Model) by learning the different components and their syntaxes (The Parameters). So I pick up this Python Cookbook that teaches through different examples (Observations).
So I start reading this book and encounter an example that uses Class and Objects, Map and List data structures. If I already know what are Classes and Objects, Maps and Lists from my previous programming experiences, or if the book also had the explanation for each of the components then it would be a piece of cake to learn that this is how a Map or a List is implemented or a Class is defined in Python.
But generally cookbooks will not introduce basic data structures and algorithms. They are only meant for teaching through examples. So this becomes the missing data in our context. One straightforward way to fill in the missing data is to use a Data Structures and Algos book to learn about Maps and Lists and an OOPS books to learn about Classes and Objects. But the availability of external resources to make the incomplete data a complete one, is not an use-case for EM as it assumes that by no means, apart from only using the Observations we can infer anything about the missing data.
Thus in order to only make use of the Observations i.e. the examples in the book, I could try to look at a several similar examples and try to figure out what a particular block of code is trying to do. That is to say, that if you look at sufficient number of examples that uses Maps and Lists and their outputs, you can deduce what a Map or a List does in general (without going through your Data Structures book) but maybe I will not know how hash functions work. This becomes my E step in the algorithm. After I infer some sort of understanding of what a Map or List does, I try to fit them in the context of the examples (The "M" Step).
To understand the complete Python programming language, I will similarly approach the understanding of other components and simultaneously deduce knowledge about missing data. This is probably a not a good use-case to demonstrate EM process in human learning as often, people would have easy access to external resources (like Google etc.) that would make the observed data a complete one (no missing information). But the take-away is that even in the absence of resources like Google or Stack Overflow (no internet connection), one could guess how a particular data structure works even if the ultimate goal is to learn Python.
This idea of mental frameworks to understand concepts naturally without any external supervision (such as an internet connection) has been used quite effectively in latest Deep Learning research to infer semantics of natural languages. Artificial Neural Networks (modeled after the human brain), can be made to learn hidden relations from multiple observations. Think of what an Arts student will make out of a bunch of Python code snippets. He/she may not have any knowledge of programming and maybe not even the slightest interest to learn it. But given the many examples and the outputs on running the examples, he/she will eventually learn that a List can be used to store values in memory sequentially or a Map can used to retrieve the value for a key without understanding of how hashing works and so on. This idea has been used to generate word embeddings i.e. individual words in a sentence are represented as multi-dimensional vectors. Somebody (a robot in particular) who does not know that the words "happy" and "glad" are semantically similar, would look at the vectors for this two words and can compute the cosine similarity to infer that these two words are similar. The meaning of "happy" and "glad" are hidden from them, but that does not prevent them from inferring that these words are semantically similar.