Monday, October 10, 2016

Reinforcement Learning

Reinforcement learning

  • the trade-off between exploration and exploitation. The agent has to exploit what it already knows, but it has to explore to make better action selection in the future.
  • it explicitly considers the whole problem of a goal-seeking agent interacting with an uncertain environment. In contrast, another machine learning approach considers subproblems without addressing how they might fit into a larger picture.
  • A large trend is that it has greater contact between AI, control theory and statistics.
Markov property: only present matters
One never learns anything from history.
Your current state remembers everything you need to remember from the past.

Markov Decision Process

reward: makes reinforcement learning different from supervised/un learning.
The solution is called policy.
  • The optimal policy is that maximize your long-term expected reward.
  • A concrete plan of what to do vs what’s the best next thing I can do
  • A plan tells you what sequence of actions you should take in a particular state
  • A policy tells you what action to take in a particular state.
  • reinforcement learning way is robust to the underlying stochastic of the world.
  • by having a small negative reward everywhere, it encourages you to end the game.
infinity horizon
the utility of sequence

Bellman equation

U(s)=R(s)+ \gamma*max\sum{T(s,a,s')U(s')}
value iteration: choose arbitrary value as the starting point for U
U_{t+1}(s)=R(s)+ \gamma*max\sum{T(s,a,s')U_t(s')}


BURLAP documentation:

Reinforcement learning basics

evaluate a policy
  1. state transition to immediate rewards
  2. truncate according to horizon
  3. summarize sequence
  4. summarize over sequence
RL context:
  1. model-based.
  2. value-function-based
  3. policy search
more direct learning, less direct supervised

Temporal difference

learning rate properties:
\sum_T\alpha_T= \inf
\sum_T\alpha^2_T< \inf
TD(1), TD(0), TD(lambda)
eligibility, k-step estimator
convergence, generalized MDP

Advanced Algorithmic Analysis

  • Value iteration
  • linear programming
  • policy iteration

Reward shaping

(example of dolphin jumping through a hoop) Not only at grad school, but in life. We don’t get the big reward after we’ve inadvertently done the desired behavior. We get hints along the way.
potential function

Partially Observable MDP

You don’t really know the state

Why RL hard?

  • delay reward
it’s only told how it’s doing and not necessarily what it should be doing.
  • boot strapping, need exploration
  • number of states and actions

Game Theory

  • mathematics of conflict (of interest)
  • single agent -> multiple agents
  • ways of thinking about what happens when you’re not the only thing with intention in the world, and how do you incorporate other goals from other people who might or might not have your best interest at heart. How do you make that work.


What RL different from supervised learning is not just input and output, you have to see states and rewards you take actions. All these things require interaction.
Just capture a big static set of data and presenting that to the learner isn’t really the RL process.