I. Chasing ergodic crabs

Yes! Ergodicity is interesting and very useful. At least in statistics. Don’t ask me what physicists do with the thing though. Those guys are crazy. The problem with ergodicity is that it’s pretty complicated. Go ahead, take a look at the Wikipedia page, I’ll wait a minute.

…

…

Ok, good. The math geniuses are gone now. You know the type – they can look at some math they’ve never seen before and understand it in minutes. Right now they’re busy proving theorems about ergodic flows on n-dimensional locally non-Hausdorff manifolds. And twitching at that last sentence. You and me? We’re going to learn about ergodicity with a pretty crabby extended metaphor.

Picture it with me: you’re on vacation in a small but beautiful coastal town. As you’re walking along the beach, you notice a crab scuttling about in the surf. Fascinated by crustaceans, you maneuver for a closer look. As you carefully close the distance between you and your new ten-footed friend, he finally acknowledges your presence by pointing his beady little eyes right at you. Undeterred by the tiny eyes of a mere arthropod, you bend down for a closer look. Right at that moment, the crab disappears. His fresh scuttle prints are right there in the sand, but he’s gone. Nowhere to be found.

For a moment you reflect on the fallibility of your own senses and the inherent meaninglessness of life itself until a sharp clicking sound coming from somewhere behind you snaps you out of your philosophy-induced stupor. When you turn away from the surf, you see another crab playing in the sand. Once you move closer you realize that it’s not just any crab, it’s the crab that disappeared! It must have teleported there somehow. You take another step towards the crab and it promptly disappears again, then reappears to your right and clicks its claws at you derisively. Is the crab taunting you? This is no ordinary crab.

Determined to learn more about such an amazing creature, you extend your vacation in order to study its habits. Every morning when you walk out onto to beach, you find Stuart – you named him Stuart – hanging out on the beach. He zips around the beach all day, never spending more than a few minutes in one spot. After a few weeks of observation, you notice that in the morning you tend to find Stuart on the higher peaked sand dunes more often than in the slack between the dunes or in the surf. Stuart really seems to like larger quantities of sand! In fact, when you walk out in the morning the probability that you find him in any given region is precisely the volume of sand in that region as a percentage of the total volume of sand on the beach. But it’s more than that, of course. This crab teleports! In fact, Stuart doesn’t scuttle to move around at all. Sure, he’ll poke around in the sand and shift around a bit in his spot, but to move any significant distance, Stuart teleports. He’s like Nightcrawler. Except a crab.

Here’s the weird thing though. When Stuart teleports, the location he teleports to is completely determined by the spot where he teleported from. For example, when Stuart is on the peak of a particularly tall dune on the east side of the beach, he always teleports to a particular spot next to a large rock near the surf. While Stuart only teleports to a single spot from any given spot, there are often multiple spots that he teleports *from *to get *to* the same spot. The more sand piled up at a given spot, the more spots on the beach that teleport to that spot. In fact, Stuart’s teleportations are *sand preserving*. That is, if you pick a region – i.e. collection of spots – on the beach and measure the total volume of sand in that region, the collection of spots that Stuart teleports from that lands him in the region you picked has the same volume! So while Stuart starts at random places every morning, he always follows the same path of spots when he teleports, and when he teleports into a region, it has the same amount of sand as all of the sand on all of the spots that lead into the region.

After studying the behavior of this amazing specimen for some time, you notice another feature of Stuart’s teleportation scheme – *invariant regions*. There are some regions that, once Stuart is there, he never leaves. Specifically, if Stuart is sitting on a spot in an invariant region, he always teleports to another spot in that region; and if he’s sitting outside of an invariant region, he never teleports into the region. You discovered Stuart’s invariant regions one day when you managed to finally catch him and set him on a large nearby rock. Inevitably, he teleported away – but this time to the top of another rock. This was weird because you’d never seen him teleport to a rock before. So you observed for a little while longer, and he kept teleporting to the top of rocks and not to the sand. The next morning though, back to the sand. But Stuart’s invariant regions are special. They either contain no sand at all or all of the sand on the beach. For example, the region containing all of the sandy spots but none of the spots on rocks is an invariant region. So is any region containing all of the sandy spots and some rocky spots. Of course, any region containing only rocky spots is also an invariant region. So the probability of finding Stuart in any given invariant region in the morning was either 0 or 1, depending on the region. If it was a region without sand, you never found Stuart there in the morning. If it was a region containing all of the sand, you always found Stuart in the region since he always started off somewhere in the sand.

When you finally end your vacation and show your notes to an arthropod expert in the biology department at the university back home, she’s unimpressed – you’ve merely discovered an *ergodic crab*. An ergodic crab can get as close as he wants to any sandy spot on the beach from any other sandy spot on the beach so long long as it can teleport enough times. Apparently there’s another species of crab, a* nonergodic crab* which has a slightly different teleportation pattern. When you find a nonergodic crab in the morning, sometimes you find him in a given invariant region, then he never leaves that region for the rest of the day. For example, some days you’ll find a nonergodic crab sitting on wet sand and he’ll only ever teleport to wet sad that day. On other days, you’ll find that same crab sitting on dry sand and he’ll only ever teleport to dry sand that day. So in order to get a feel for where a nonergodic crab tends to hang out, you need to observe him on multiple days. But for an ergodic crab, one day is enough to see how well he likes the various spots on the beach.

There’s one more feature of sand preserving crabs that you learn from Ginny, your new biologist friend. She tells you that the crabs will click the same number of times each time they arrive in a particular spot. Sure enough, when Stuart teleported to the top of the tallest sand dune on the beach, he always clicked five times. When he teleported to a spot near a large rock just above the surf, Stuart always clicked two times. Between their teleportation paths and clicking patterns, sand preserving crabs are paragons of consistency.

II. Stuart as a mathematical object

The story above translates pretty well into the actual math of ergodicity. The beach is a probability (or measure) space and the distribution of sand on the beach is the distribution of probability (or measure) on the space. A sand preserving crab is a probability (or measure) preserving transformation. Strictly speaking, the crab’s teleportation pattern is the transformation – the transformation sends you to a new spot in the probability space completely determined by where you’re currently at in the space. The transformation is probability preserving in the same sense that Stuart is sand preserving – and an invariant subset of the probability space is just an invariant region on the beach. Note that a set is invariant *with respect to the transformation.* If you change the transformation, you potentially change which sets are invariant. A transformation is ergodic, then, if all of its invariant sets are either probability 1 or probability 0.

Now suppose you apply the transformation over and over again. This gives you a sequence of points in the probability space – the same idea as a sand preserving crab’s sequence of spots. Suppose we associate a number with each point in the space, like the number of times Stuart clicks when he arrives at that spot. Then the sequence of numbers is a sequence of random variables. They are random because the first spot on the beach or first spot in the probability space is always chosen randomly. It’s the transformation that is deterministic. A sequence of random variables constructed this way turns out to be a a *stationary* sequence. The properties of a stationary sequence don’t change as you move forward and backward in the sequence. Want to know the probability that the 100th variable in the sequence is greater than zero? It’s the same as the probability that the first variable is greater than zero. What’s the probability that a sand preserving crab clicks 3 times in his fifth spot? The same as the probability that he clicks 3 times in his first spot. This is to say that sand preserving crabs have stationary clicks. What’s more, all stationary random variables can be represented using a probability preserving transformation – all crabs with stationary clicks are sand preserving.

Stationary sequences of random variables are nice because learning about what happened in one part of the sequence tells you a lot about what happens in another part of the sequence, even if you don’t know much about the precise way in which they’re related. If a stationary sequence is nonergodic though, you’re still fundamentally limited in what you can learn about the properties of the sequence. Think about a nonergodic crab teleporting around the beach. If you only ever get to observe the crab on one day, you only ever see it on either wet sand or dry sand. If the crab’s click rate depends on whether it’s on dry or wet sand and you only ever see it on wet sand, how can you learn about it’s click rate on dry sand? You can’t – at least not unless you know the relationship between the crab’s behavior on wet sand and its behavior on dry sand.

Let’s view probability from a frequentist perspective for a moment – i.e. let’s allow ourselves to view a real world sequence of variables as random. We only ever get to see one realization of many sequences of random variables. We only get to see the inflation rate in 1972 once, for example. The next time we see an inflation rate, it’s a different year. So even if inflation follows a stationary process, if we believe it’s nonergodic we’re fundamentally limited in what we can learn about that process. When it comes to ergodic processes, though, a single realization of the sequence will tell you everything you want to know about about the process, as long as you watch long enough. If you watch an ergodic crab long enough, he’ll make his way all over the beach. For a nonergodic crab, you need to keep coming back day after day until you’ve had at least one day with him on wet sand and one day with him on dry sand. But in the real world, you’re often limited to only one day.

If the sequence – or crab – is nonstationary, you’re even worse off. If a nonergodic sequence is stationary, at least today’s value still tells you something about tomorrow’s value. If the sequence is also nonstationary, you need to have some additional information about the process in order for today’s value to give you any information about tomorrow’s value. And often we don’t.

From a Bayesian perspective, probability is in the mind, not reality, so none of this makes much sense. Probability is uncertainty about reality, not reality itself. But I don’t think there’s a Bayesian free lunch here. Another way to think about ergodicity is that in an ergodic sequence of random variables, the structure and amount of dependence is limited to a large degree. For example, if today’s inflation rate depends on the entire history of inflation rates, then 1) older inflation rates are much less important than more recent rates to predicting today’s rate and 2) how today’s inflation rate depends on the older inflation rates is limited to a small set of relatively simple possibilities. From a Bayesian perspective, a nonergodic sequence of random variables is a sequence where all we know about the relationship between today’s value and all previous values is that it’s either not simple in the sense of the previous sentence or older values are indeed very important for predicting the next value of the sequence. In short, variables in a nonergodic sequence have complicated relationships. But because of the nature of this complication, we it’s hard to learn about the relationship between the variables in the sequence – at least not without more information. We can model the relationship in one way or another, but without a reason to favor any given model, we can’t take the conclusions we draw using any given model seriously.

Whether you’re a Bayesian or a frequentist, ergodicity and stationarity are often used in pieces of a larger model that is nonergodic or nonstationary. Often this helps develop plausible models because it’s easier to argue that the various components of the model are stationary – it’s essentially a strategy for arguing for one form of nonergodicity or another. How successful this strategy is depends on the particular application, but it can often be a helpful if not perfect exercise.

The way I’ve been using the term “ergodic” so far isn’t always how it’s used though. One property of an ergodic sequence of random variables is that if you take the average of the sequence over time, this average will converge to the true average. We sometimes say that the variable has an ergodic mean when this happens. Often, the whole business of measure preserving transformations is swept under the rug and a random variable is defined as ergodic if it has an ergodic mean. From a frequentist perspective, this makes a ton of sense. Suppose each random variable in the sequence has the same true mean – the “space mean” since it’s the mean of the variable, averaged over the whole probability space. Since we can only ever observe one realization of the sequence, the hope is that the “time mean,” or the mean of a given realization of the sequence, will estimate the space mean. When the sequence has an ergodic mean, this is true. In practice frequentist statistics is all about ensuring that a sequence of estimates converge to the true value. In addition, there are some nonergodic and even nonstationary sequences of random variables which still have ergodic means, which means the frequentist can still estimate that mean. So you can understand why they often care more about ergodic means than ergodic transformations.

The bottom line is that ergodicity makes learning from the data much, much easier no matter your favorite school of probability. In a domain with nonergodic data, it’s that much harder to come up with a model that closely resembles the truth, which means any inferences that you draw are that much more questionable. And there are good reasons to think that lots of data is nonergodic, especially in the social sciences. So good luck economists.

III. Markov crab Monte Carlo

Another area of statistics where ergodicity is incredibly useful is in statistical computation. But in order to understand this, we have to take a quick detour through the practice of Bayesian statistics. We start, of course, with Bayes’ rule. There are a couple competing ways to think about Bayesian statistics, but we’ll think about a particularly common one mostly because it’s easier.

Suppose you have some model for how the data are generated, represented by the probability distribution p(Y_1, Y_2, …, Y_T|X), assuming we have T observations of the variable Y in our dataset. We’ll just use p(Y|X) for short, i.e. Y = (Y_1, Y_2, …, Y_T). Here, X is some parameter which we don’t know. Maybe it’s the mean of the Y’s, for example. Y is the data you’ve already observed. Then p(Y|X) asks you for a value of X, then gives you a probability distribution on this data that depends on the value you input. Since we don’t know X and we’re Bayesian, we represent our uncertainty about X in another probability distribution called the *prior*, p(X). This thing traces out a curve above zero by plugging in different values of X. The probability that X is in a given region is the area under that curve but within that region. The distribution p(Y|X) is similar, except the probability that Y is in a given region depends on the value of X you give to it – it’s a conditional distribution, so each value of X determines a different curve for Y.

What we want to learn about in this example is X – after seeing the data, we want to know what our uncertainty about X looks like. This is another conditional distribution, called the *posterior* distribution, p(X|Y). If you input a particular value for the data you’ve observed, p(X|Y) spits out a new probability distribution on X reflecting what you observed. Bayes’ rule tells us that p(X|Y) = p(Y|X) * p(X) / p(Y). In most situations, we can write down p(Y|X) and p(X), so multiplying them together is no problem. p(Y) is a little different – it represents our uncertainty about the data alone (i.e. without considering the relationship between Y and anything else), in the form of a probability distribution. This thing is maddeningly difficult to calculate in all but the most simple models. So we don’t try to compute it at all, at least not by sitting down with a pen and paper and trying to derive it analytically.

An alternative to actually computing p(Y) analytically is called Monte Carlo simulation. The idea here is that if we can’t analytically derive the posterior distribution, maybe we know enough to draw random samples from it using a computer with a random number generator. Given a large enough random sample from the posterior, there’s no difference from actually being able to analytically derive it – we can approximate the posterior probability that X is in any region and that approximation gets better as our random sample gets larger.

In practice, it’s pretty hard to effectively use Monte Carlo simulation in order to sample from a posterior distribution. The main problem is that it would take forever to get a large enough sample that we can say anything meaningfully accurate about the posterior distribution. But what if instead of directly sampling from the posterior distribution, we only sampled from it approximately? This is exactly what Markov chain Monte Carlo simulation does, though there other ways to construct an approximation.

A Markov chain is a special sequence of random variables where dependence is allowed, but only a specific kind of dependence. Specifically, the next value in the chain is allowed to depend on the current value in arbitrarily complicated ways, but oncee you know the current value none of the older values tell you anything about the next value. In order to predict what’s going to happen tomorrow, you need to know what happened today. But once you do, yesterday is irrelevant.

In a Markov chain Monte Carlo simulation, or MCMC, you construct a Markov chain for X on your computer. You give the chain an initial value, X_0. Then the chain randomly simulates a new value, X_1. Next you feed X_1 back into the chain and it spits out a new simulated value, X_2. And so on. Back on the beach, we can think about a Markov crab, named Andrey of course. Imagine you get to hear Andrey’s claws clicking every time he teleports, but you don’t get to see where he teleports. Maybe you’re too busy checking out your neighbor on the beach to actually look at the crab. But you do hear Andrey every time he clicks his claws. Since Andrey is a Markov crab, the probability that he clicks 3 times the next time he teleports depends *only *on how many times he clicked the last time he teleported and, once you know that fact, none of his previous clicks are relevant.

In order for MCMC to work though, we can’t have any old Markov ~~crab~~ chain. As we go farther into the sequence of X_t ‘s given by the Markov chain, we need the probability distribution of X_t to get closer and closer to the probability distribution of X, i.e. the posterior distribution we wanted to sample from in the first place. In other words, we need p(X_t) to get closer and closer to p(X) as t goes towards infinity. Then with a long enough sample from the Markov chain, we can again approximate any posterior probability, with a better approximation as the sample gets larger.

One of the conditions that will help guarantee that the approximation works is that the Markov chain is ergodic. Think about the beach again. Suppose Andrey is whizzing about the beach with his magical teleportation power. If Andrey is a nonergodic crab then once he starts his chain, he’ll only ever be on the type of sand he started on – dry or wet. If our posterior distribution of clicks depends in some way on the type of sand the crab is clicking from, then Andrey’s distribution of clicks may never approach the posterior. If Andrey’s an ergodic crab though, there’s hope. No matter where Andrey starts, he’ll explore the entire beach if we just wait long enough. So over time the sample distribution of Andrey’s clicks will become a better and better approximation of the posterior distribution we care about.

Back to our computer simulations, we need to make sure that the Markov chain for X never gets stuck in some region of X’s space. If the chain only ever spits out values of X that are above zero, we’re in trouble. If the chain is ergodic though, this can’t happen. Ergodicity alone isn’t enough to guarantee that a Markov chain will approximate our posterior distribution – it guarantees that the probability distribution of a Markov chain converges to *something*, but not necessarily that it converges to our posterior. This isn’t typically a major difficulty. Usually, the big problem in MCMC comes from trying to find a Markov chain which converges to the posterior distribution timely manner. This inches us pretty close to my research area, but it also takes us pretty far away from ergodicity. So I think I’ll just stop here before I coerce anymore crabs into another metaphor.