深度学习ppt课件:深度强化学习.pptx
Introduction to Deep Reinforcement Learning,Yen-Chen Wu2015/12/11,Outline,Reinforcement LearningMarkov Decision ProcessHow to Solve MDPsDPMCTDQ-learning(DQN)Paper Review,Reinforcement Learning,Branches of Machine Learning,What makes different?,There is no supervisor,only a reward signalFeedback is delayed,not instantaneousTime really matters(sequential,non i.i.d data)Agents actions affect the subsequent data it receives,Goal:Maximize Cumulative Reward,Actions may have long term consequencesReward may be delayedIt may be better to sacrifice immediate reward to gain more long-term reward,Agent&Enviroment,DefenseAttackJump,Full observability vs Partial observabilityLearning and PlanningExploration and ExploitationPrediction and Control,Markov Decision Process,Markov ProcessesMarkov Reward Processes Markov Decision Processes,Markov Process,Markov Reward Processes,Markov Decision Process,Markov Decision Process(MDP),S:finite set of states(observations)A:finite set of actionsP:transition probabilityR:immediate reward:discount factorGoal:Choose policy Maximize expected return:,How to Solve MDP,Dynamic ProgrammingMonte-CarloTemporal-DifferenceQ-Learning,Model-based,Dynamic ProgrammingEvaluate policyUpdate policy,Model Free,Unknown Transition Probability&RewardMC vs TD,Model Free:Q-learning,Instead of tabularoptimal action-value function(Q-learning)=Bellman equation,Basic idea:iterative update(lack of generalization)In practical:function approximatorLinear?Using DNN!,Deep Q-network(DQN),Video,https:/,Deep Q-Network,compute Q-values for all actions,Input:84x84x4,Convolves 32 filters of 8x8 with stride 4Convolves 64 filters of 4x4 with stride 2Convolves 64 filters of 3x3 with stride 1,Full-connected 512 nodes,Output a node for each action,Update DQN,Loss functionGradient,Two Technique,Experience ReplayExperiencePooled MemoryData efficiency(bootstrap)Avoid correlation between samples(variance between batches)Off policy is suitable for Q-learningRandom sampled mini-batchPrioritized sweeping(active learning)Separate Target Networkmore stable than online learning,DEMO,Paper review,Paper list,Massively Parallel Methods for Deep Reinforcement LearningContinuous control with deep reinforcement learningDeep Reinforcement Learning with Double Q-learningPolicy DistillationDueling Network Architectures for Deep Reinforcement LearningMultiagent Cooperation and Competition with Deep Reinforcement Learning,Massively Parallel Methods for Deep Reinforcement LearningArun NairarXiv:1507.04296,DDPG(Deterministic Policy Gradient),DDAC(Deep Deterministic Actor-Critic),Continuous control with deep reinforcement learningTimothy P.LillicraparXiv:1509.02971https:/goo.gl/J4PIAz,Double Q-learning,Policy Distillation,Soft target,Dueling Network,Multiagent,