![]() |
Partially Observable Markov Decision Processes |
We will now show an example of value iteration proceeding on a problem for a horizon length of 3. This example will provide some of the useful insights, making the connection between the figures and the concepts that are needed to explain the general problem. For this problem, we assume the POMDP has two states, two actions and three observations.
We start with the first horizon. The value function here will represent the best we can hope to do (in terms of value) if we are limited to taking a single action. This is the simplest case; normally (for horizon length h) we need to trade off the immediate rewards and the future rewards. However, when you have a horizon of 1, there is no future and the value function becomes nothing but the immediate rewards.
Since we have two states and two actions, our POMDP model will include four separate immediate reward values: there is one value for each combination of action and state. These values are defined over the discrete state space of the POMDP, but it becomes easy to get the value of doing a particular action in a particular belief state. To do this we simply use the probabilities in the belief state to weight the value of each state.
As an example: let action a1 have a value of 0 in state s1 and 1 in state s2 and let action a2 have a value of 1.5 in state s1 and 0 in state s2. If our belief state is [ 0.75 0.25 ] then the value of doing action a1 in this belief state is 0.75 x 0 + 0.25 x 1 = 0.25. Similarly, action a2 has value 0.75 x 1.5 + 0.25 x 0 = 1.125. We can display these values over belief space with the figure below. This is, in fact, the horizon 1 value function. (Note that we have not violated the "no formula" promise: what preceded were not formulas, they were just calculations.)
|
With the horizon 1 value function we are now ready to construct the horizon 2 value function. This part of the tutorial is the most crucial for understanding POMDP solutions procedures. Once you understand how we will build the horizon 2 value function, you should have the necessary intuition behind POMDP value functions to understand the various algorithms.
Our goal in building this new value function is to find the best action (or highest value) we can achieve using only two actions (i.e., the horizon is 2) for every belief state. To show how to construct this new value function, we break the problem down into a series of steps.
We start with the problem: given a particular belief state, b what is the value of doing action a1, if after the action we received observation z1? In other words we want to find the best value possible for a single belief state when the immediate action and observation are fixed.
The value of a belief state for horizon 2 is simple the value of the immediate action plus the value of the next action. In general, we would like to find the best possible value which would include considering all possible sequences of two actions. However, since in our restricted problem our immediate action is fixed, the immediate value is fully determined. We can use the immediate rewards for action a1 to find the value of b just like we did in constructing the horizon 1 value function. Recall that the horizon 1 value function is nothing but the immediate reward function.
The only question is what is best achievable value for the belief state that results from our initial belief state b when we perform action a1 and observe z1. This isn't really much of a problem at all, since we know our initial belief state, the action and the resulting observation. This is all that is required to transform b into the unique resulting next belief state, which we will call b'. This new belief state will be the belief state we are in when we have one more action to perform; our horizon length is 2, but we just did one of the 2 possible actions. We know what the best values are for every belief state when there is a single action left to perform; this is exactly what our horizon 1 value function tells us.
The figure below shows this process. On the left is the immediate reward function and on the right is the horizon 1 value function. (Recall that for horizon length 1, the immediate rewards are the same as the value function.) The immediate rewards for action a2 are shown with a dashed line, since they are not of immediate interest when considering the fixed action a1.
|
Recall that what we are concerned with at this point is finding the value of the belief state b with the fixed action and observation. We have everything we need to calculate this value; we know what the immediate reward we will get is and we know the best value for the transformed belief state b'. Simply summing these two values gives us the value of belief state b given that we take action a1 and observe z1. As a side effect we also know what is the best next action to take.
As we compute the horizon 2 value function for a given initial belief state, action and observation, we would transform the belief state to anew point in belief space and use the horizon 1 value function to simply lookup the value of this transformed space. Suppose we ignore worry about factoring in the immediate rewards before transforming the belief state. Imagine we plotted this function: for every belief state, transform it (using a particular action and observation) and then lookup the horizon 1 value of the new belief. This would give you another value function over belief space, which would be the horizon 1 value function, but slightly transformed from the original. The transformation results from having factoring in the belief update calculation.
This imaginery algorithm cannot actually be implmented directly since there are uncountably infinite number of belief states we would need to do this for. However, it turns out that we can directly construct a function over the entire belief space from the horizon 1 value function that has the belief transformation built in. This gives us a function which directly tells us the value of each belief state after the action a1 is taken and observation z1 is seen without having to actually worry about transforming the belief state. The figure below show this transformation.
|
We previously decided to solve the simple problem of finding the value of a belief state, given a fixed action and observation. We have showed this and actually demonstrated how to find the value of all belief states given a fixed action and observation. Next we want to show how to compute the value of a belief state given only the action.
When we were looking at individual points, getting their immediate reward value, transforming them and getting the resulting belief states value, we where computing the conditional value. This is the value if we see observation z1. However, because the observations are probabilistic, we are not guaranteed to see z1. In this example, there are three possible observations and each one can lead to a separate resulting belief state. This figure below, shows the full situation when we fix our first action to be a1.
|
This might still be a bit cloudy, so let us do an example. Suppose that we compute the values of the resulting belief states for belief state b, action a1 and all three observations and find that the values for each resulting belief state are: z1:0.8, z2:0.7, z3:1.2. These are the values we were initially calculating when we were doing things one belief point at a time. Now we also can compute the probability of getting each of the three observations for the given belief state and action and find them to be: z1:0.6, z2:0.25, z3:0.15. Then the horizon 2 value of the belief state b when we fix the action at a1 is 0.6x0.8 + 0.25x0.7 + 0.15x1.2 = 0.835 plus the immediate reward of doing action a1 in b.
In fact, the transformed value function S(a1,z1) we showed before actually factors in the probabilities of the observation. So in reality, the S() function is not quite what we claimed; we claimed that it was the next belief state value of each belief state for the fixed action and given the observation. It reality, the S() function already has the probability of the observation built into it.
Let's look at the situation we currently have with the figure below.
|
So what is the horizon 2 value of a belief state, given a particular action a1? Well, it depends not only on the value of doing action a1 but also upon what action we do next (where the horizon length will be 1). However, what we do next will depend upon what observation we get. For a given belief state and observation, we can look at the S() function partition to decide what the best action next action to do is. This is best seen with a figure.
|
|
We derived this particular future strategy from the belief point b and it is the best future strategy for that belief point. However, just because we can compute the value of this future strategy for each belief point, doesn't mean it is the best strategy for all belief points. So which belief points is this the best future strategy? This is actually easy to see from the partition diagram.
|
|
|
If there was only the action a1 in our model, then the value function shown in the previous figure would be the horizon 2 value function. However, because there is another action, we must compare the value of the other action with the value of action a1 before we have the true horizon 2 value function, since we are interested in finding the best value for each belief state.
We can repeat the whole process we did for action a1 for the other action. This includes constructing the S() functions for the action a2 and all the observations. From these we can find the value function for that action. Below is the value function and partition for action a2.
|
|
|
This whole process took a long time to explain and is not nearly as complicated as it might seem. We will show how to construct the horizon 3 policy from the horizon 2 policy is a slightly accelerated manner. The steps are the same, but we can now eliminate a lot of the discussion and the intermediate steps which we used to aid the explanation.
First transform the horizon 2 value function for action a1 and all the observations.
|
|
|
When we were constructing the horizon 2 value function, these colors corresponded to the next action to take. The reason the colors corresponded to a single action was that with a horizon length of 2, there was only going to be a single action left to take after taking the first action. However, here and in general, each color represents a complete future strategy, not just one action. For instance, the magenta color corresponds to the line in the horizon 2 value function where we would do the action a2 and adopt its future strategy. The future strategy of the magenta line will depend on the observation we get after doing the a2 action.
Next we find the value function by adding the immediate rewards and the S() functions for each of the useful strategies. The partition that this value function will impose is easy to construct by simply looking at the partitions of the S() functions.
In the figure below, we show the S() partitions for action a1 below and the value function for action a1 with a horizon of 3. The partition this value function imposes is also shown.
|
We then construct the value function for the other action, put them together and see which line segments we can get rid of. Here are the S() function partitions, value function and the value functions partition for the action a2.
|
|
|
This concludes our example. The concepts and procedures can be applied over and over to any horizon length. This is the way we do value iteration on the CO-MDP derived from the POMDP.
| © 2003-2005, Anthony R. Cassandra |