In probability and statistics, a Bernoulli process (named after Jacob Bernoulli) is a finite or infinite sequence of binary random variables, so it is a discretetime stochastic process that takes only two values, canonically 0 and 1. The component Bernoulli variables X_{i} are identically distributed and independent. Prosaically, a Bernoulli process is a repeated coin flipping, possibly with an unfair coin (but with consistent unfairness). Every variable X_{i} in the sequence is associated with a Bernoulli trial or experiment. They all have the same Bernoulli distribution. Much of what can be said about the Bernoulli process can also be generalized to more than two outcomes (such as the process for a sixsided dice); this generalization is known as the Bernoulli scheme.
The problem of determining the process, given only a limited sample of the Bernoulli trials, may be called the problem of checking whether a coin is fair.
YouTube Encyclopedic

1/5Views:57 9602 617175 13110 289360

✪ 13. Bernoulli Process

✪ L21.2 The Bernoulli Process

✪ Introduction to the Bernoulli Distribution

✪ Bernoulli Process Practice

✪ L41  Bernoulli Process
Transcription
The following content is provided under a Creative Commons license. Your support will help MIT OpenCourseWare continue to offer high quality educational resources for free. To make a donation or view additional materials from hundreds of MIT courses, visit MIT OpenCourseWare at ocw.mit.edu. PROFESSOR: So by now you have seen pretty much every possible trick there is in basic probability theory, about how to calculate distributions, and so on. You have the basic tools to do pretty much anything. So what's coming after this? Well, probability is useful for developing the science of inference, and this is a subject to which we're going to come back at the end of the semester. Another chapter, which is what we will be doing over the next few weeks, is to deal with phenomena that evolve in time. So socalled random processes or stochastic processes. So what is this about? So in the real world, you don't just throw two random variables and go home. Rather the world goes on. So you generate the random variable, then you get more random variables, and things evolve in time. And random processes are supposed to be models that capture the evolution of random phenomena over time. So that's what we will be doing. Now when we have evolution in time, mathematically speaking, you can use discrete time or continuous time. Of course, discrete time is easier. And that's where we're going to start from. And we're going to start from the easiest, simplest random process, which is the socalled Bernoulli process, which is nothing but just a sequence of coin flips. You keep flipping a coin and keep going forever. That's what the Bernoulli process is. So in some sense it's something that you have already seen. But we're going to introduce a few additional ideas here that will be useful and relevant as we go along and we move on to continuous time processes. So we're going to define the Bernoulli process, talk about some basic properties that the process has, and derive a few formulas, and exploit the special structure that it has to do a few quite interesting things. By the way, where does the word Bernoulli come from? Well the Bernoulli's were a family of mathematicians, Swiss mathematicians and scientists around the 1700s. There were so many of them that actually and some of them had the same first name historians even have difficulty of figuring out who exactly did what. But in any case, you can imagine that at the dinner table they were probably flipping coins and doing Bernoulli trials. So maybe that was their passtime. OK. So what is the Bernoulli process? The Bernoulli process is nothing but a sequence of independent Bernoulli trials that you can think of as coin flips. So you can think the result of each trial being heads or tails. It's a little more convenient maybe to talk about successes and failures instead of heads or tails. Or if you wish numerical values, to use a 1 for a success and 0 for a failure. So the model is that each one of these trials has the same probability of success, p. And the other assumption is that these trials are statistically independent of each other. So what could be some examples of Bernoulli trials? You buy a lottery ticket every week and you win or lose. Presumably, these are independent of each other. And if it's the same kind of lottery, the probability of winning should be the same during every week. Maybe you want to model the financial markets. And a crude model could be that on any given day the Dow Jones is going to go up or down with a certain probability. Well that probability must be somewhere around 0.5, or so. This is a crude model of financial markets. You say, probably there is more into them. Life is not that simple. But actually it's a pretty reasonable model. It takes quite a bit of work to come up with more sophisticated models that can do better predictions than just pure heads and tails. Now more interesting, perhaps to the examples we will be dealing with in this class a Bernoulli process is a good model for streams of arrivals of any kind to a facility. So it could be a bank, and you are sitting at the door of the bank. And at every second, you check whether a customer came in during that second or not. Or you can think about arrivals of jobs to a server. Or any other kind of requests to a service system. So requests, or jobs, arrive at random times. You split the time into time slots. And during each time slot something comes or something does not come. And for many applications, it's a reasonable assumption to make that arrivals on any given slot are independent of arrivals in any other time slot. So each time slot can be viewed as a trial, where either something comes or doesn't come. And different trials are independent of each other. Now there's two assumptions that we're making here. One is the independence assumption. The other is that this number, p, probability of success, is constant. Now if you think about the bank example, if you stand outside the bank at 9:30 in the morning, you'll see arrivals happening at a certain rate. If you stand outside the bank at 12:00 noon, probably arrivals are more frequent. Which means that the given time slot has a higher probability of seeing an arrival around noon time. This means that the assumption of a constant p is probably not correct in that setting, if you're talking about the whole day. So the probability of successes or arrivals in the morning is going to be smaller than what it would be at noon. But if you're talking about a time period, let's say 10:00 to 10:15, probably all slots have the same probability of seeing an arrival and it's a good approximation. So we're going to stick with the assumption that p is constant, doesn't change with time. Now that we have our model what do we do with it? Well, we start talking about the statistical properties that it has. And here there's two slightly different perspectives of thinking about what a random process is. The simplest version is to think about the random process as being just a sequence of random variables. We know what random variables are. We know what multiple random variables are. So it's just an experiment that has associated with it a bunch of random variables. So once you have random variables, what do you do instinctively? You talk about the distribution of these random variables. We already specified for the Bernoulli process that each Xi is a Bernoulli random variable, with probability of success equal to p. That specifies the distribution of the random variable X, or Xt, for general time t. Then you can calculate expected values and variances, and so on. So the expected value is, with probability p, you get a 1. And with probability 1  p, you get a 0. So the expected value is equal to p. And then we have seen before a formula for the variance of the Bernoulli random variable, which is p times 1p. So this way we basically now have all the statistical properties of the random variable Xt, and we have those properties for every t. Is this enough of a probabilistic description of a random process? Well, no. You need to know how the different random variables relate to each other. If you're talking about a general random process, you would like to know things. For example, the joint distribution of X2, with X5, and X7. For example, that might be something that you're interested in. And the way you specify it is by giving the joint PMF of these random variables. And you have to do that for every collection, or any subset, of the random variables you are interested in. So to have a complete description of a random processes, you need to specify for me all the possible joint distributions. And once you have all the possible joint distributions, then you can answer, in principle, any questions you might be interested in. How did we get around this issue for the Bernoulli process? I didn't give you the joint distributions explicitly. But I gave them to you implicitly. And this is because I told you that the different random variables are independent of each other. So at least for the Bernoulli process, where we make the independence assumption, we know that this is going to be the product of the PMFs. And since I have told you what the individual PMFs are, this means that you automatically know all the joint PMFs. And we can go to business based on that. All right. So this is one view of what a random process is, just a collection of random variables. There's another view that's a little more abstract, which is the following. The entire process is to be thought of as one long experiment. So we go back to the chapter one view of probabilistic models. So there must be a sample space involved. What is the sample space? If I do my infinite, long experiment of flipping an infinite number of coins, a typical outcome of the experiment would be a sequence of 0's and 1's. So this could be one possible outcome of the experiment, just an infinite sequence of 0's and 1's. My sample space is the set of all possible outcomes of this kind. Here's another possible outcome, and so on. And essentially we're dealing with a sample space, which is the space of all sequences of 0's and 1's. And we're making some sort of probabilistic assumption about what may happen in that experiment. So one particular sequence that we may be interested in is the sequence of obtaining all 1's. So this is the sequence that gives you 1's forever. Once you take the point of view that this is our sample space its the space of all infinite sequences you can start asking questions that have to do with infinite sequences. Such as the question, what's the probability of obtaining the infinite sequence that consists of all 1's? So what is this probability? Let's see how we could calculate it. So the probability of obtaining all 1's is certainly less than or equal to the probability of obtaining 1's, just in the first 10 tosses. OK. This is asking for more things to happen than this. If this event is true, then this is also true. Therefore the probability of this is smaller than the probability of that. This event is contained in that event. This implies this. So we have this inequality. Now what's the probability of obtaining 1's in 10 trials? This is just p to the 10th because the trials are independent. Now of course there's no reason why I chose 10 here. The same argument goes through if I use an arbitrary number, k. And this has to be true for all k. So this probability is less than p to the k, no matter what k I choose. Therefore, this must be less than or equal to the limit of this, as k goes to infinity. This is smaller than that for all k's. Let k go to infinity, take k arbitrarily large, this number is going to become arbitrarily small. It goes to 0. And that proves that the probability of an infinite sequence of 1's is equal to 0. So take limits of both sides. It's going to be less than or equal to the limit I shouldn't take a limit here. The probability is less than or equal to the limit of p to the k, as k goes to infinity, which is 0. So this proves in a formal way that the sequence of all 1's has 0 probability. If you have an infinite number of coin flips, what's the probability that all of the coin flips result in heads? The probability of this happening is equal to zero. So this particular sequence has 0 probability. Of course, I'm assuming here that p is less than 1, strictly less than 1. Now the interesting thing is that if you look at any other infinite sequence, and you try to calculate the probability of that infinite sequence, you would get a product of (1p) times 1, 1p times 1, 1p, times p times p, times 1p and so on. You keep multiplying numbers that are less than 1. Again, I'm making the assumption that p is between 0 and 1. So 1p is less than 1, p is less than 1. You keep multiplying numbers less than 1. If you multiply infinitely many such numbers, the infinite product becomes 0. So any individual sequence in this sample space actually has 0 probability. And that is a little bit counterintuitive perhaps. But the situation is more like the situation where we deal with continuous random variables. So if you could draw a continuous random variable, every possible outcome has 0 probability. And that's fine. But all of the outcomes collectively still have positive probability. So the situation here is very much similar. So the space of infinite sequences of 0's and 1's, that sample space is very much like a continuous space. If you want to push that analogy further, you could think of this as the expansion of a real number. Or the representation of a real number in binary. Take a real number, write it down in binary, you are going to get an infinite sequence of 0's and 1's. So you can think of each possible outcome here essentially as a real number. So the experiment of doing an infinite number of coin flips is sort of similar to the experiment of picking a real number at random. When you pick real numbers at random, any particular real number has 0 probability. So similarly here, any particular infinite sequence has 0 probability. So if we were to push that analogy further, there would be a few interesting things we could do. But we will not push it further. This is just to give you an indication that things can get pretty subtle and interesting once you start talking about random processes that involve forever, over the infinite time horizon. So things get interesting even in this context of the simple Bernoulli process. Just to give you a preview of what's coming further, today we're going to talk just about the Bernoulli process. And you should make sure before the next lecture I guess between the exam and the next lecture to understand everything we do today. Because next time we're going to do everything once more, but in continuous time. And in continuous time, things become more subtle and a little more difficult. But we are going to build on what we understand for the discrete time case. Now both the Bernoulli process and its continuous time analog has a property that we call memorylessness, whatever happened in the past does not affect the future. Later on in this class we're going to talk about more general random processes, socalled Markov chains, in which there are certain dependences across time. That is, what has happened in the past will have some bearing on what may happen in the future. So it's like having coin flips where the outcome of the next coin flip has some dependence on the previous coin flip. And that gives us a richer class of models. And once we get there, essentially we will have covered all possible models. So for random processes that are practically useful and which you can manipulate, Markov chains are a pretty general class of models. And almost any real world phenomenon that evolves in time can be approximately modeled using Markov chains. So even though this is a first class in probability, we will get pretty far in that direction. All right. So now let's start doing a few calculations and answer some questions about the Bernoulli process. So again, the best way to think in terms of models that correspond to the Bernoulli process is in terms of arrivals of jobs to a facility. And there's two types of questions that you can ask. In a given amount of time, how many jobs arrived? Or conversely, for a given number of jobs, how much time did it take for them to arrive? So we're going to deal with these two questions, starting with the first. For a given amount of time that is, for a given number of time periods how many arrivals have we had? How many of those Xi's happen to be 1's? We fix the number of time slots let's say n time slots and you measure the number of successes. Well this is a very familiar random variable. The number of successes in n independent coin flips or in n independent trials is a binomial random variable. So we know its distribution is given by the binomial PMF, and it's just this, for k going from 0 up to n. And we know everything by now about this random variable. We know its expected value is n times p. And we know the variance, which is n times p, times 1p. So there's nothing new here. That's the easy part. So now let's look at the opposite kind of question. Instead of fixing the time and asking how many arrivals, now let us fix the number of arrivals and ask how much time did it take. And let's start with the time until the first arrival. So the process starts. We got our slots. And we see, perhaps, a sequence of 0's and then at some point we get a 1. The number of trials it took until we get a 1, we're going to call it T1. And it's the time of the first arrival. OK. What is the probability distribution of T1? What kind of random variable is it? We've gone through this before. The event that the first arrival happens at time little t is the event that the first t1 trials were failures, and the trial number t happens to be a success. So for the first success to happen at time slot number 5, it means that the first 4 slots had failures and the 5th slot had a success. So the probability of this happening is the probability of having failures in the first t 1 trials, and having a success at trial number 1. And this is the formula for t equal 1,2, and so on. So we know what this distribution is. It's the socalled geometric distribution. Let me jump this through this for a minute. In the past, we did calculate the expected value of the geometric distribution, and it's 1/p. Which means that if p is small, you expect to take a long time until the first success. And then there's a formula also for the variance of T1, which we never formally derived in class, but it was in your textbook and it just happens to be this. All right. So nothing new until this point. Now, let's talk about this property, the memorylessness property. We kind of touched on this property when we discussed when we did the derivation in class of the expected value of T1. Now what is the memoryless property? It's essentially a consequence of independence. If I tell you the results of my coin flips up to a certain time, this, because of independence, doesn't give you any information about the coin flips after that time. So knowing that we had lots of 0's here does not change what I believe about the future coin flips, because the future coin flips are going to be just independent coin flips with a given probability, p, for obtaining tails. So this is a statement that I made about a specific time. That is, you do coin flips until 12 o'clock. And then at 12 o'clock, you start watching. No matter what happens before 12 o'clock, after 12:00, what you're going to see is just a sequence of independent Bernoulli trials with the same probability, p. Whatever happened in the past is irrelevant. Now instead of talking about the fixed time at which you start watching, let's think about a situation where your sister sits in the next room, flips the coins until she observes the first success, and then calls you inside. And you start watching after this time. What are you're going to see? Well, you're going to see a coin flip with probability p of success. You're going to see another trial that has probability p as a success, and these are all independent of each other. So what you're going to see starting at that time is going to be just a sequence of independent Bernoulli trials, as if the process was starting at this time. How long it took for the first success to occur doesn't have any bearing on what is going to happen afterwards. What happens afterwards is still a sequence of independent coin flips. And this story is actually even more general. So your sister watches the coin flips and at some point tells you, oh, something really interesting is happening here. I got this string of a hundred 1's in a row. Come and watch. Now when you go in there and you start watching, do you expect to see something unusual? There were unusual things that happened before you were called in. Does this means that you're going to see unusual things afterwards? No. Afterwards, what you're going to see is, again, just a sequence of independent coin flips. The fact that some strange things happened before doesn't have any bearing as to what is going to happen in the future. So if the roulettes in the casino are properly made, the fact that there were 3 reds in a row doesn't affect the odds of whether in the next roll it's going to be a red or a black. So whatever happens in the past no matter how unusual it is at the time when you're called in, what's going to happen in the future is going to be just independent Bernoulli trials, with the same probability, p. The only case where this story changes is if your sister has a little bit of foresight. So your sister can look ahead into the future and knows that the next 10 coin flips will be heads, and calls you before those 10 flips will happen. If she calls you in, then what are you going to see? You're not going to see independent Bernoulli trials, since she has psychic powers and she knows that the next ones would be 1's. She called you in and you will see a sequence of 1's. So it's no more independent Bernoulli trials. So what's the subtle difference here? The future is independent from the past, provided that the time that you are called and asked to start watching is determined by someone who doesn't have any foresight, who cannot see the future. If you are called in, just on the basis of what has happened so far, then you don't have any information about the future. And one special case is the picture here. You have your coin flips. Once you see a one that happens, once you see a success, you are called in. You are called in on the basis of what happened in the past, but without any foresight. OK. And this subtle distinction is what's going to make our next example interesting and subtle. So here's the question. You buy a lottery ticket every day, so we have a Bernoulli process that's running in time. And you're interested in the length of the first string of losing days. What does that mean? So suppose that a typical sequence of events could be this one. So what are we discussing here? We're looking at the first string of losing days, where losing days means 0's. So the string of losing days is this string here. Let's call the length of that string, L. We're interested in the random variable, which is the length of this interval. What kind of random variable is it? OK. Here's one possible way you might think about the problem. OK. Starting from this time, and looking until this time here, what are we looking at? We're looking at the time, starting from here, until the first success. So the past doesn't matter. Starting from here we have coin flips until the first success. The time until the first success in a Bernoulli process we just discussed that it's a geometric random variable. So your first conjecture would be that this random variable here, which is 1 longer than the one we are interested in, that perhaps is a geometric random variable. And if this were so, then you could say that the random variable, L, is a geometric, minus 1. Can that be the correct answer? A geometric random variable, what values does it take? It takes values 1, 2, 3, and so on. 1 minus a geometric would take values from 0, 1, 2, and so on. Can the random variable L be 0? No. The random variable L is the length of a string of losing days. So the shortest that L could be, would be just 1. If you get just one losing day and then you start winning, L would be equal to 1. So L cannot be 0 by definition, which means that L + 1 cannot be 1, by definition. But if L +1 were geometric, it could be equal to 1. Therefore this random variable, L + 1, is not a geometric. OK. Why is it not geometric? I started watching at this time. From this time until the first success, that should be a geometric random variable. Where's the catch? If I'm asked to start watching at this time, it's because my sister knows that the next one was a failure. This is the time where the string of failures starts. In order to know that they should start watching here, it's the same as if I'm told that the next one is a failure. So to be asked to start watching at this time requires that someone looked in the future. And in that case, it's no longer true that these will be independent Bernoulli trials. In fact, they're not. If you start watching here, you're certain that the next one is a failure. The next one is not an independent Bernoulli trial. That's why the argument that would claim that this L + 1 is geometric would be incorrect. So if this is not the correct answer, which is the correct answer? The correct answer goes as follows. Your sister is watching. Your sister sees the first failure, and then tells you, OK, the failures or losing days have started. Come in and watch. So you start to watching at this time. And you start watching until the first success comes. This will be a geometric random variable. So from here to here, this will be geometric. So things happen. You are asked to start watching. After you start watching, the future is just a sequence of independent Bernoulli trials. And the time until the first failure occurs, this is going to be a geometric random variable with parameter p. And then you notice that the interval of interest is exactly the same as the length of this interval. This starts one time step later, and ends one time step later. So conclusion is that L is actually geometric, with parameter p. OK, it looks like I'm missing one slide. Can I cheat a little from here? OK. So now that we dealt with the time until the first arrival, we can start talking about the time until the second arrival, and so on. How do we define these? After the first arrival happens, we're going to have a sequence of time slots with no arrivals, and then the next arrival is going to happen. So we call this time that elapses or number of time slots after the first arrival until the next one we call it T2. This is the second interarrival time, that is, time between arrivals. Once this arrival has happened, then we wait and see how many more it takes until the third arrival. And we call this time here, T3. We're interested in the time of the kth arrival, which is going to be just the sum of the first k interarrival times. So for example, let's say Y3 is the time that the third arrival comes. Y3 is just the sum of T1, plus T2, plus T3. So we're interested in this random variable, Y3, and it's the sum of interarrival times. To understand what kind of random variable it is, I guess we should understand what kind of random variables these are going to be. So what kind of random variable is T2? Your sister is doing her coin flips until a success is observed for the first time. Based on that information about what has happened so far, you are called into the room. And you start watching until a success is observed again. So after you start watching, what you have is just a sequence of independent Bernoulli trials. So each one of these has probability p of being a success. The time it's going to take until the first success, this number, T2, is going to be again just another geometric random variable. It's as if the process just started. After you are called into the room, you have no foresight, you don't have any information about the future, other than the fact that these are going to be independent Bernoulli trials. So T2 itself is going to be geometric with the same parameter p. And then you can continue the arguments and argue that T3 is also geometric with the same parameter p. Furthermore, whatever happened, how long it took until you were called in, it doesn't change the statistics about what's going to happen in the future. So whatever happens in the future is independent from the past. So T1, T2, and T3 are independent random variables. So conclusion is that the time until the third arrival is the sum of 3 independent geometric random variables, with the same parameter. And this is true more generally. The time until the kth arrival is going to be the sum of k independent random variables. So in general, Yk is going to be T1 plus Tk, where the Ti's are geometric, with the same parameter p, and independent. So now what's more natural than trying to find the distribution of the random variable Yk? How can we find it? So I fixed k for you. Let's say k is 100. I'm interested in how long it takes until 100 customers arrive. How can we find the distribution of Yk? Well one way of doing it is to use this lovely convolution formula. Take a geometric, convolve it with another geometric, you get something. Take that something that you got, convolve it with a geometric once more, do this 99 times, and this gives you the distribution of Yk. So that's definitely doable, and it's extremely tedious. Let's try to find the distribution of Yk using a shortcut. So the probability that Yk is equal to t. So we're trying to find the PMF of Yk. k has been fixed for us. And we want to calculate this probability for the various values of t, because this is going to give us the PMF of Yk. OK. What is this event? What does it take for the kth arrival to be at time t? For that to happen, we need two things. In the first t 1 slots, how many arrivals should we have gotten? k  1. And then in the last slot, we get one more arrival, and that's the kth one. So this is the probability that we have k  1 arrivals in the time interval from 1 up to t. And then, an arrival at time t. That's the only way that it can happen, that the kth arrival happens at time t. We need to have an arrival at time t. And before that time, we need to have exactly k  1 arrivals. Now this is an event that refers t1. In the previous time slots we had exactly k 1 arrivals. And then at the last time slot we get one more arrival. Now the interesting thing is that this event here has to do with what happened from time 1 up to time t 1. This event has to do with what happened at time t. Different time slots are independent of each other. So this event and that event are independent. So this means that we can multiply their probabilities. So take the probability of this. What is that? Well probability of having a certain number of arrivals in a certain number of time slots, these are just the binomial probabilities. So this is, out of t  1 slots, to get exactly k  1 arrivals, p to the k1, (1p) to the t1  (k1), this gives us tk. And then we multiply with this probability, the probability of an arrival, at time t is equal to p. And so this is the formula for the PMF of the number of the time it takes until the kth arrival happens. Does it agree with the formula in your handout? Or its not there? It's not there. OK. Yeah. OK. So that's the formula and it is true for what values of t? [INAUDIBLE]. It takes at least k time slots in order to get k arrivals, so this formula should be true for k larger than or equal to t. For t larger than or equal to k. All right. So this gives us the PMF of the random variable Yk. Of course, we may also be interested in the mean and variance of Yk. But this is a lot easier. Since Yk is the sum of independent random variables, the expected value of Yk is going to be just k times the expected value of your typical t. So the expected value of Yk is going to be just k times 1/p, which is the mean of the geometric. And similarly for the variance, it's going to be k times the variance of a geometric. So we have everything there is to know about the distribution of how long it takes until the first arrival comes. OK. Finally, let's do a few more things about the Bernoulli process. It's interesting to talk about several processes at the time. So in the situation here of splitting a Bernoulli process is where you have arrivals that come to a server. And that's a picture of which slots get arrivals. But actually maybe you have two servers. And whenever an arrival comes to the system, you flip a coin and with some probability, q, you send it to one server. And with probability 1q, you send it to another server. So there is a single arrival stream, but two possible servers. And whenever there's an arrival, you either send it here or you send it there. And each time you decide where you send it by flipping an independent coin that has its own bias q. The coin flips that decide where do you send it are assumed to be independent from the arrival process itself. So there's two coin flips that are happening. At each time slot, there's a coin flip that decides whether you have an arrival in this process here, and that coin flip is with parameter p. And if you have something that arrives, you flip another coin with probabilities q, and 1q, that decides whether you send it up there or you send it down there. So what kind of arrival process does this server see? At any given time slot, there's probability p that there's an arrival here. And there's a further probability q that this arrival gets sent up there. So the probability that this server sees an arrival at any given time is p times q. So this process here is going to be a Bernoulli process, but with a different parameter, p times q. And this one down here, with the same argument, is going to be Bernoulli with parameter p times (1q). So by taking a Bernoulli stream of arrivals and splitting it into two, you get two separate Bernoulli processes. This is going to be a Bernoulli process, that's going to be a Bernoulli process. Well actually, I'm running a little too fast. What does it take to verify that it's a Bernoulli process? At each time slot, it's a 0 or 1. And it's going to be a 1, you're going to see an arrival with probability p times q. What else do we need to verify, to be able to tell to say that it's a Bernoulli process? We need to make sure that whatever happens in this process, in different time slots, are statistically independent from each other. Is that property true? For example, what happens in this time slot whether you got an arrival or not, is it independent from what happened at that time slot? The answer is yes for the following reason. What happens in this time slot has to do with the coin flip associated with the original process at this time, and the coin flip that decides where to send things. What happens at that time slot has to do with the coin flip here, and the additional coin flip that decides where to send it if something came. Now all these coin flips are independent of each other. The coin flips that determine whether we have an arrival here is independent from the coin flips that determined whether we had an arrival there. And you can generalize this argument and conclude that, indeed, every time slot here is independent from any other time slot. And this does make it a Bernoulli process. And the reason is that, in the original process, every time slot is independent from every other time slot. And the additional assumption that the coin flips that we're using to decide where to send things, these are also independent of each other. So we're using here the basic property that functions of independent things remain independent. There's a converse picture of this. Instead of taking one stream and splitting it into two streams, you can do the opposite. You could start from two streams of arrivals. Let's say you have arrivals of men and you have arrivals of women, but you don't care about gender. And the only thing you record is whether, in a given time slot, you had an arrival or not. Notice that here we may have an arrival of a man and the arrival of a woman. We just record it with a 1, by saying there was an arrival. So in the merged process, we're not keeping track of how many arrivals we had total. We just record whether there was an arrival or not an arrival. So an arrival gets recorded here if, and only if, one or both of these streams had an arrival. So that we call a merging of two Bernoull of two processes, of two arrival processes. So let's make the assumption that this arrival process is independent from that arrival process. So what happens at the typical slot here? I'm going to see an arrival, unless none of these had an arrival. So the probability of an arrival in a typical time slot is going to be 1 minus the probability of no arrival. And the event of no arrival corresponds to the first process having no arrival, and the second process having no arrival. So there's no arrival in the merged process if, and only if, there's no arrival in the first process and no arrival in the second process. We're assuming that the two processes are independent and that's why we can multiply probabilities here. And then you can take this formula and it simplifies to p + q, minus p times q. So each time slot of the merged process has a certain probability of seeing an arrival. Is the merged process a Bernoulli process? Yes, it is after you verify the additional property that different slots are independent of each other. Why are they independent? What happens in this slot has to do with that slot, and that slot down here. These two slots so what happens here, has to do with what happens here and there. What happens in this slot has to do with whatever happened here and there. Now, whatever happens here and there is independent from whatever happens here and there. Therefore, what happens here is independent from what happens there. So the independence property is preserved. The different slots of this merged process are independent of each other. So the merged process is itself a Bernoulli process. So please digest these two pictures of merging and splitting, because we're going to revisit them in continuous time where things are little subtler than that. OK. Good luck on the exam and see you in a week.
Contents
Definition
A Bernoulli process is a finite or infinite sequence of independent random variables X_{1}, X_{2}, X_{3}, ..., such that
 For each i, the value of X_{i} is either 0 or 1;
 For all values of i, the probability p that X_{i} = 1 is the same.
In other words, a Bernoulli process is a sequence of independent identically distributed Bernoulli trials.
Independence of the trials implies that the process is memoryless. Given that the probability p is known, past outcomes provide no information about future outcomes. (If p is unknown, however, the past informs about the future indirectly, through inferences about p.)
If the process is infinite, then from any point the future trials constitute a Bernoulli process identical to the whole process, the freshstart property.
Interpretation
The two possible values of each X_{i} are often called "success" and "failure". Thus, when expressed as a number 0 or 1, the outcome may be called the number of successes on the ith "trial".
Two other common interpretations of the values are true or false and yes or no. Under any interpretation of the two values, the individual variables X_{i} may be called Bernoulli trials with parameter p.
In many applications time passes between trials, as the index i increases. In effect, the trials X_{1}, X_{2}, ... X_{i}, ... happen at "points in time" 1, 2, ..., i, .... That passage of time and the associated notions of "past" and "future" are not necessary, however. Most generally, any X_{i} and X_{j} in the process are simply two from a set of random variables indexed by {1, 2, ..., n} or by {1, 2, 3, ...}, the finite and infinite cases.
Several random variables and probability distributions beside the Bernoullis may be derived from the Bernoulli process:
 The number of successes in the first n trials, which has a binomial distribution B(n, p)
 The number of failures needed to get r successes, which has a negative binomial distribution NB(r, p)
 The number of failures needed to get one success, which has a geometric distribution NB(1, p), a special case of the negative binomial distribution
The negative binomial variables may be interpreted as random waiting times.
Formal definition
The Bernoulli process can be formalized in the language of probability spaces as a random sequence of independent realisations of a random variable that can take values of heads or tails. The state space for an individual value is denoted by
Specifically, one considers the countably infinite direct product of copies of . It is common to examine either the onesided set or the twosided set . There is a natural topology on this space, called the product topology. The sets in this topology are finite sequences of coin flips, that is, finitelength strings of H and T, with the rest of (infinitely long) sequence taken as "don't care". These sets of finite sequences are referred to as cylinder sets in the product topology. The set of all such strings form a sigma algebra, specifically, a Borel algebra. This algebra is then commonly written as where the elements of are the finitelength sequences of coin flips (the cylinder sets).
If the chances of flipping heads or tails are given by the probabilities , then one can define a natural measure on the product space, given by (or by for the twosided process). Given a cylinder set, that is, a specific sequence of coin flip results at times , the probability of observing this particular sequence is given by
where k is the number of times that H appears in the sequence, and n−k is the number of times that T appears in the sequence. There are several different kinds of notations for the above; a common one is to write
where each is a binaryvalued random variable with in Iverson bracket notation, meaning either if or if . This probability is commonly called the Bernoulli measure.^{[1]}
Note that the probability of any specific, infinitely long sequence of coin flips is exactly zero; this is because , for any . A probability equal to 1 implies that any given infinite sequence has measure zero. Nevertheless, one can still say that some classes of infinite sequences of coin flips are far more likely than others, this is given by the asymptotic equipartition property.
To conclude the formal definition, a Bernoulli process is then given by the probability triple , as defined above.
Law of large numbers, binomial distribution and central limit theorem
Let us assume the canonical process with represented by and represented by . The law of large numbers states that, on the average of the sequence, i.e., , will approach the expected value almost certainly, that is, the events which do not satisfy this limit have zero probability. The expectation value of flipping heads, assumed to be represented by 1, is given by . In fact, one has
for any given random variable out of the infinite sequence of Bernoulli trials that compose the Bernoulli process.
One is often interested in knowing how often one will observe H in a sequence of n coin flips. This is given by simply counting: Given n successive coin flips, that is, given the set of all possible strings of length n, the number N(k,n) of such strings that contain k occurrences of H is given by the binomial coefficient
If the probability of flipping heads is given by p, then the total probability of seeing a string of length n with k heads is
where . The probability measure thus defined is known as the Binomial distribution.
Of particular interest is the question of the value of for a sufficiently long sequences of coin flips, that is, for the limit . In this case, one may make use of Stirling's approximation to the factorial, and write
Inserting this into the expression for P(k,n), one obtains the Normal distribution; this is the content of the central limit theorem, and this is the simplest example thereof.
The combination of the law of large numbers, together with the central limit theorem, leads to an interesting and perhaps surprising result: the asymptotic equipartition property. Put informally, one notes that, yes, over many coin flips, one will observe H exactly p fraction of the time, and that this corresponds exactly with the peak of the Gaussian. The asymptotic equipartition property essentially states that this peak is infinitely sharp, with infinite falloff on either side. That is, given the set of all possible infinitely long strings of H and T occurring in the Bernoulli process, this set is partitioned into two: those strings that occur with probability 1, and those that occur with probability 0. This partitioning is known as the Kolmogorov 01 law.
The size of this set is interesting, also, and can be explicitly determined: the logarithm of it is exactly the entropy of the Bernoulli process. Once again, consider the set of all strings of length n. The size of this set is . Of these, only a certain subset are likely; the size of this set is for . By using Stirling's approximation, putting it into the expression for P(k,n), solving for the location and width of the peak, and finally taking one finds that
This value is the Bernoulli entropy of a Bernoulli process. Here, H stands for entropy; do not confuse it with the same symbol H standing for heads.
John von Neumann posed a curious question about the Bernoulli process: is it ever possible that a given process is isomorphic to another, in the sense of the isomorphism of dynamical systems? The question long defied analysis, but was finally and completely answered with the Ornstein isomorphism theorem. This breakthrough resulted in the understanding that the Bernoulli process is unique and universal; in a certain sense, it is the single most random process possible; nothing is 'more' random than the Bernoulli process (although one must be careful with this informal statement; certainly, systems that are mixing are, in a certain sense, 'stronger' than the Bernoulli process, which is merely ergodic but not mixing. However, such processes do not consist of independent random variables: indeed, many purely deterministic, nonrandom systems can be mixing).
Dynamical system
The Bernoulli process can also be understood to be a dynamical system, specifically, a measurepreserving dynamical system. This arises because there is a natural translation symmetry on the (twosided) product space given by the shift operator
The measure is translationinvariant; that is, given any cylinder set , one has
and thus the Bernoulli measure is a Haar measure.
The shift operator should be understood to be an operator acting on the sigma algebra , so that one has
In this guise, the shift operator is known as the transfer operator or the RuelleFrobeniusPerron operator. It is interesting to consider the eigenfunctions of this operator, and how they differ when restricted to different subspaces of . When restricted to the standard topology of the real numbers, the eigenfunctions are curiously the Bernoulli polynomials!^{[2]}^{[3]} This coincidence of naming was presumably not known to Bernoulli.
Bernoulli sequence
The term Bernoulli sequence is often used informally to refer to a realization of a Bernoulli process. However, the term has an entirely different formal definition as given below.
Suppose a Bernoulli process formally defined as a single random variable (see preceding section). For every infinite sequence x of coin flips, there is a sequence of integers
called the Bernoulli sequence^{[verification needed]} associated with the Bernoulli process. For example, if x represents a sequence of coin flips, then the associated Bernoulli sequence is the list of natural numbers or timepoints for which the coin toss outcome is heads.
So defined, a Bernoulli sequence is also a random subset of the index set, the natural numbers .
Almost all Bernoulli sequences are ergodic sequences.^{[verification needed]}
Randomness extraction
From any Bernoulli process one may derive a Bernoulli process with p = 1/2 by the von Neumann extractor, the earliest randomness extractor, which actually extracts uniform randomness.
Basic von Neumann extractor
Represent the observed process as a sequence of zeroes and ones, or bits, and group that input stream in nonoverlapping pairs of successive bits, such as (11)(00)(10)... . Then for each pair,
 if the bits are equal, discard;
 if the bits are not equal, output the first bit.
This table summarizes the computation.
input  output 

00  discard 
01  0 
10  1 
11  discard 
For example, an input stream of eight bits 10011011 would by grouped into pairs as (10)(01)(10)(11). Then, according to the table above, these pairs are translated into the output of the procedure: (1)(0)(1)() (=101).
In the output stream 0 and 1 are equally likely, as 10 and 01 are equally likely in the original, both having probability p(1−p) = (1−p)p. This extraction of uniform randomness does not require the input trials to be independent, only uncorrelated. More generally, it works for any exchangeable sequence of bits: all sequences that are finite rearrangements are equally likely.
The von Neumann extractor uses two input bits to produce either zero or one output bits, so the output is shorter than the input by a factor of at least 2. On average the computation discards proportion p^{2} + (1 − p)^{2} of the input pairs, which is near one when p is near zero or one, and is minimized at 1/4 when p = 1/2 for the original process (in which case the output stream is 1/4 the length of the input stream on average).
Von Neumann (classical) main operation pseudocode:
if (Bit1 ≠ Bit2) {
output(Bit1)
}
Iterated von Neumann extractor
This decrease in efficiency, or waste of randomness present in the input stream, can be mitigated by iterating the algorithm over the input data. This way the output can be made to be "arbitrarily close to the entropy bound".^{[4]}
The iterated version of the von Neumann algorithm, also known as Advanced MultiLevel Strategy (AMLS)^{[5]}, was introduced by Yuval Peres in 1992.^{[4]} It works recursively, recycling "wasted randomness" from two sources: the sequence of discard/nondiscard, and the values of discarded pairs (0 for 00, and 1 for 11). Intuitively, it relies on the fact that, given the sequence already generated, both of those sources are still exchangeable sequences of bits, and thus eligible for another round of extraction. While such generation of additional sequences can be iterated infinitely to extract all available entropy, an infinite amount of computational resources is required, therefore the number of iterations is typically fixed to a low value – this value either fixed in advance, or calculated at runtime.
More concretely, on an input sequence, the algorithm consumes the input bits in pairs, generating output together with two new sequences:
input  output  new sequence 1  new sequence 2 

00  none  0  0 
01  0  1  none 
10  1  1  none 
11  none  0  1 
(If the length of the input is odd, the last bit is completely discarded.) Then the algorithm is applied recursively to each of the two new sequences, until the input is empty.
Example: The input stream from above, 10011011, is processed this way:
step number  input  output  new sequence 1  new sequence 2 

0  (10)(01)(10)(11)  (1)(0)(1)()  (1)(1)(1)(0)  ()()()(1) 
1  (11)(10)  ()(1)  (0)(1)  (1)() 
1.1  (01)  (0)  (1)  () 
1.1.1  1  none  none  none 
1.2  1  none  none  none 
2  1  none  none  none 
The output is therefore (101)(1)(0)()()() (=10110), so that from the eight bits of input five bits of output were generated, as opposed to three bits through the basic algorithm above. The constant output of exactly 2 bits per round (compared with a variable 0 to 1 bits in classical VN) also allows for constanttime implementations which are resistant to timing attacks.
Von Neumann–Peres (iterated) main operation pseudocode:
if (Bit1 ≠ Bit2) {
output(1, Sequence1)
output(Bit1)
} else {
output(0, Sequence1)
output(Bit1, Sequence2)
}
Another tweak was presented in 2016, based on the observation that the Sequence2 channel doesn't provide much throughput, and a hardware implementation with a finite number of levels can benefit from discarding it earlier in exchange for processing more levels of Sequence1.^{[6]}
References
 ^ Klenke, Achim (2006). Probability Theory. SpringerVerlag. ISBN 9781848000476.
 ^ Pierre Gaspard, "radic onedimensional maps and the Euler summation formula", Journal of Physics A, 25 (letter) L483L485 (1992).
 ^ Dean J. Driebe, Fully Chaotic Maps and Broken Time Symmetry, (1999) Kluwer Academic Publishers, Dordrecht Netherlands ISBN 0792355644
 ^ ^{a} ^{b} Peres, Yuval (March 1992). "Iterating Von Neumann's Procedure for Extracting Random Bits" (PDF). The Annals of Statistics. 20 (1): 590–597. doi:10.1214/aos/1176348543. Retrieved 30 May 2013.
 ^ "Tossing a Biased Coin" (PDF). eecs.harvard.edu. Retrieved 20180728.
 ^ Rožić, Vladimir; Yang, Bohan; Dehaene, Wim; Verbauwhede, Ingrid (3–5 May 2016). Iterating Von Neumann's postprocessing under hardware constraints (PDF). 2016 IEEE International Symposium on Hardware Oriented Security and Trust (HOST). Maclean, VA, USA. doi:10.1109/HST.2016.7495553\.
Further reading
 Carl W. Helstrom, Probability and Stochastic Processes for Engineers, (1984) Macmillan Publishing Company, New York ISBN 0023535601.