Yuchao's blogspot: 10/10/16

Monday, October 10, 2016

Reinforcement Learning

Reinforcement learning

the trade-off between exploration and exploitation. The agent has to exploit what it already knows, but it has to explore to make better action selection in the future.
it explicitly considers the whole problem of a goal-seeking agent interacting with an uncertain environment. In contrast, another machine learning approach considers subproblems without addressing how they might fit into a larger picture.
A large trend is that it has greater contact between AI, control theory and statistics.

Markov property: only present matters

One never learns anything from history.

Your current state remembers everything you need to remember from the past.

Markov Decision Process

states,

model,

actions,

reward: makes reinforcement learning different from supervised/un learning.

The solution is called policy.

The optimal policy is that maximize your long-term expected reward.
A concrete plan of what to do vs what’s the best next thing I can do
A plan tells you what sequence of actions you should take in a particular state
A policy tells you what action to take in a particular state.
reinforcement learning way is robust to the underlying stochastic of the world.
by having a small negative reward everywhere, it encourages you to end the game.

infinity horizon

the utility of sequence

Bellman equation

$U(s)=R(s)+ \gamma*max\sum{T(s,a,s')U(s')}$

value iteration: choose arbitrary value as the starting point for U

$U_{t+1}(s)=R(s)+ \gamma*max\sum{T(s,a,s')U_t(s')}$

BURLAP

BURLAP documentation: http://burlap.cs.brown.edu/doc/

BURLAP .jar file: http://burlap.cs.brown.edu/burlap.jar

GraphDefinedDomain

Reinforcement learning basics

evaluate a policy

state transition to immediate rewards
truncate according to horizon
summarize sequence
summarize over sequence

RL context:

model-based.
value-function-based
policy search

more direct learning, less direct supervised

Temporal difference

learning rate properties:

$\sum_T\alpha_T= \inf$

$\sum_T\alpha^2_T< \inf$

TD(1), TD(0), TD(lambda)

eligibility, k-step estimator

convergence, generalized MDP

Advanced Algorithmic Analysis

Value iteration
linear programming
policy iteration

Reward shaping

(example of dolphin jumping through a hoop) Not only at grad school, but in life. We don’t get the big reward after we’ve inadvertently done the desired behavior. We get hints along the way.

potential function

Partially Observable MDP

You don’t really know the state

Why RL hard?

delay reward

it’s only told how it’s doing and not necessarily what it should be doing.

boot strapping, need exploration
number of states and actions

Game Theory

mathematics of conflict (of interest)
single agent -> multiple agents
ways of thinking about what happens when you’re not the only thing with intention in the world, and how do you incorporate other goals from other people who might or might not have your best interest at heart. How do you make that work.

outroduction

What RL different from supervised learning is not just input and output, you have to see states and rewards you take actions. All these things require interaction.

Just capture a big static set of data and presenting that to the learner isn’t really the RL process.