- the trade-off between exploration and exploitation. The agent has to exploit what it already knows, but it has to explore to make better action selection in the future.
- it explicitly considers the whole problem of a goal-seeking agent interacting with an uncertain environment. In contrast, another machine learning approach considers subproblems without addressing how they might fit into a larger picture.
- A large trend is that it has greater contact between AI, control theory and statistics.
Markov property: only present matters
One never learns anything from history.Your current state remembers everything you need to remember from the past.
Markov Decision Process
reward: makes reinforcement learning different from supervised/un learning.
The solution is called policy.
- The optimal policy is that maximize your long-term expected reward.
- A concrete plan of what to do vs what’s the best next thing I can do
- A plan tells you what sequence of actions you should take in a particular state
- A policy tells you what action to take in a particular state.
- reinforcement learning way is robust to the underlying stochastic of the world.
- by having a small negative reward everywhere, it encourages you to end the game.
the utility of sequence
value iteration: choose arbitrary value as the starting point for U
BURLAP documentation: http://burlap.cs.brown.edu/doc/
BURLAP .jar file: http://burlap.cs.brown.edu/burlap.jar
Reinforcement learning basics
evaluate a policy
- state transition to immediate rewards
- truncate according to horizon
- summarize sequence
- summarize over sequence
- policy search
more direct learning, less direct supervised
learning rate properties:
TD(1), TD(0), TD(lambda)
eligibility, k-step estimator
convergence, generalized MDP
Advanced Algorithmic Analysis
- Value iteration
- linear programming
- policy iteration
(example of dolphin jumping through a hoop) Not only at grad school, but in life. We don’t get the big reward after we’ve inadvertently done the desired behavior. We get hints along the way.
Partially Observable MDP
You don’t really know the state
Why RL hard?
- delay reward
it’s only told how it’s doing and not necessarily what it should be doing.
- boot strapping, need exploration
- number of states and actions
- mathematics of conflict (of interest)
- single agent -> multiple agents
- ways of thinking about what happens when you’re not the only thing with intention in the world, and how do you incorporate other goals from other people who might or might not have your best interest at heart. How do you make that work.
What RL different from supervised learning is not just input and output, you have to see states and rewards you take actions. All these things require interaction.
Just capture a big static set of data and presenting that to the learner isn’t really the RL process.