Last active
February 26, 2020 13:37
-
-
Save simoninithomas/d6adc6edb0a7f37d6323a5e3d2ab72ec to your computer and use it in GitHub Desktop.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| { | |
| "cells": [ | |
| { | |
| "cell_type": "markdown", | |
| "metadata": {}, | |
| "source": [ | |
| "# DDDQN (Double Dueling Deep Q Learning with Prioritized Experience Replay) Doom🕹️\n", | |
| "In this notebook we'll implement an agent <b>that plays Doom by using a Dueling Double Deep Q learning architecture with Prioritized Experience Replay.</b> <br>\n", | |
| "\n", | |
| "Our agent playing Doom after 3 hours of training of **CPU**, remember that our agent needs about 2 days of **GPU** to have optimal score, we'll train from beginning to end the most important architectures (PPO with transfer):\n", | |
| "\n", | |
| "<img src=\"https://raw.githubusercontent.com/simoninithomas/Deep_reinforcement_learning_Course/master/docs/assets/img/projects/doomdeathmatc.gif\" alt=\"Doom Deathmatch\"/>\n", | |
| "\n", | |
| "But we can see that our agent **understand that he needs to kill enemies before being able to move forward (if he moves forward without killing ennemies he will be killed before getting the vest)**" | |
| ] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "metadata": {}, | |
| "source": [ | |
| "# This is a notebook from Deep Reinforcement Learning Course with Tensorflow\n", | |
| "<img src=\"https://raw.githubusercontent.com/simoninithomas/Deep_reinforcement_learning_Course/master/docs/assets/img/DRLC%20Environments.png\" alt=\"Deep Reinforcement Course\" />\n", | |
| "\n", | |
| "<p> Deep Reinforcement Learning Course is a free series of blog posts and videos 🆕 about Deep Reinforcement Learning, where we'll learn the main algorithms, and how to implement them with Tensorflow.\n", | |
| "\n", | |
| "📜The articles explain the concept from the big picture to the mathematical details behind it.\n", | |
| "\n", | |
| "📹 The videos explain how to create the agent with Tensorflow </b></p>\n", | |
| "\n", | |
| "## <a href=\"https://simoninithomas.github.io/Deep_reinforcement_learning_Course/\">Syllabus</a><br>\n", | |
| "### 📜 Part 1: Introduction to Reinforcement Learning [ARTICLE](https://medium.freecodecamp.org/an-introduction-to-reinforcement-learning-4339519de419) \n", | |
| "\n", | |
| "### Part 2: Q-learning with FrozenLake \n", | |
| "#### 📜 [ARTICLE](https://medium.freecodecamp.org/diving-deeper-into-reinforcement-learning-with-q-learning-c18d0db58efe) // [FROZENLAKE IMPLEMENTATION](https://github.com/simoninithomas/Deep_reinforcement_learning_Course/blob/master/Q%20learning/Q%20Learning%20with%20FrozenLake.ipynb)\n", | |
| "#### 📹 [Implementing a Q-learning agent that plays Taxi-v2 🚕](https://youtu.be/q2ZOEFAaaI0) \n", | |
| "\n", | |
| "### Part 3: Deep Q-learning with Doom\n", | |
| "#### 📜 [ARTICLE](https://medium.freecodecamp.org/an-introduction-to-deep-q-learning-lets-play-doom-54d02d8017d8) // [DOOM IMPLEMENTATION](https://github.com/simoninithomas/Deep_reinforcement_learning_Course/blob/master/DQN%20Doom/Deep%20Q%20learning%20with%20Doom.ipynb)\n", | |
| "#### 📹 [Create a DQN Agent that learns to play Atari Space Invaders 👾 ](https://youtu.be/gCJyVX98KJ4)\n", | |
| "\n", | |
| "### Part 3+: Improvments in Deep Q-Learning\n", | |
| "#### 📜 [ARTICLE (📅 JUNE)] \n", | |
| "#### 📹 [Create an Agent that learns to play Doom Deadly corridor (📅 06/30)] \n", | |
| "\n", | |
| "### Part 4: Policy Gradients with Doom \n", | |
| "#### 📜 [ARTICLE](https://medium.freecodecamp.org/an-introduction-to-policy-gradients-with-cartpole-and-doom-495b5ef2207f) // [CARTPOLE IMPLEMENTATION](https://github.com/simoninithomas/Deep_reinforcement_learning_Course/blob/master/Policy%20Gradients/Cartpole/Cartpole%20REINFORCE%20Monte%20Carlo%20Policy%20Gradients.ipynb) // [DOOM IMPLEMENTATION](https://github.com/simoninithomas/Deep_reinforcement_learning_Course/blob/master/Policy%20Gradients/Doom/Doom%20REINFORCE%20Monte%20Carlo%20Policy%20gradients.ipynb)\n", | |
| "#### 📹 [Create an Agent that learns to play Doom deathmatch](https://youtu.be/wLTQRuizVyE)\n", | |
| "\n", | |
| "### Part 5: Advantage Advantage Actor Critic (A2C) \n", | |
| "#### 📜 [ARTICLE (📅 June)] \n", | |
| "#### 📹 [Create an Agent that learns to play Outrun (📅 07/04)] \n", | |
| "\n", | |
| "### Part 6: Asynchronous Advantage Actor Critic (A3C) \n", | |
| "#### 📜 [ARTICLE (📅 July)] \n", | |
| "#### 📹 [Create an Agent that learns to play Michael Jackson's Moonwalker (📅 07/11)] \n", | |
| "\n", | |
| "### Part 7: Proximal Policy Gradients \n", | |
| "#### 📜 [ARTICLE (📅 July)]\n", | |
| "#### 📹 [Create an Agent that learns to play walk with Mujoco (📅 07/18)]\n", | |
| "\n", | |
| "### Part 8: TBA \n", | |
| "\n", | |
| "## Any questions 👨💻\n", | |
| "<p> If you have any questions, feel free to ask me: </p>\n", | |
| "<p> 📧: <a href=\"mailto:hello@simoninithomas.com\">hello@simoninithomas.com</a> </p>\n", | |
| "<p> Github: https://github.com/simoninithomas/Deep_reinforcement_learning_Course </p>\n", | |
| "<p> 🌐 : https://simoninithomas.github.io/Deep_reinforcement_learning_Course/ </p>\n", | |
| "<p> Twitter: <a href=\"https://twitter.com/ThomasSimonini\">@ThomasSimonini</a> </p>\n", | |
| "<p> Don't forget to <b> follow me on <a href=\"https://twitter.com/ThomasSimonini\">twitter</a>, <a href=\"https://github.com/simoninithomas/Deep_reinforcement_learning_Course\">github</a> and <a href=\"https://medium.com/@thomassimonini\">Medium</a> to be alerted of the new articles that I publish </b></p>\n", | |
| " \n", | |
| "## How to help 🙌\n", | |
| "3 ways:\n", | |
| "- **Clap our articles and like our videos a lot**:Clapping in Medium means that you really like our articles. And the more claps we have, the more our article is shared Liking our videos help them to be much more visible to the deep learning community.\n", | |
| "- **Share and speak about our articles and videos**: By sharing our articles and videos you help us to spread the word. \n", | |
| "- **Improve our notebooks**: if you found a bug or **a better implementation** you can send a pull request.\n", | |
| "<br>\n", | |
| "\n", | |
| "## Important note 🤔\n", | |
| "<b> You can run it on your computer but it's better to run it on GPU based services </b>, personally I use Microsoft Azure and their Deep Learning Virtual Machine (they offer 170$)\n", | |
| "https://azuremarketplace.microsoft.com/en-us/marketplace/apps/microsoft-ads.dsvm-deep-learning\n", | |
| "<br>\n", | |
| "⚠️ I don't have any business relations with them. I just loved their excellent customer service.\n", | |
| "\n", | |
| "If you have some troubles to use Microsoft Azure follow the explainations of this excellent article here (without last the part fast.ai): https://medium.com/@manikantayadunanda/setting-up-deeplearning-machine-and-fast-ai-on-azure-a22eb6bd6429" | |
| ] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "metadata": {}, | |
| "source": [ | |
| "## Step 1: Import the libraries 📚" | |
| ] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "execution_count": null, | |
| "metadata": {}, | |
| "outputs": [], | |
| "source": [ | |
| "import tensorflow as tf # Deep Learning library\n", | |
| "import numpy as np # Handle matrices\n", | |
| "from vizdoom import * # Doom Environment\n", | |
| "\n", | |
| "import random # Handling random number generation\n", | |
| "import time # Handling time calculation\n", | |
| "from skimage import transform# Help us to preprocess the frames\n", | |
| "\n", | |
| "from collections import deque# Ordered collection with ends\n", | |
| "import matplotlib.pyplot as plt # Display graphs\n", | |
| "\n", | |
| "import warnings # This ignore all the warning messages that are normally printed during the training because of skiimage\n", | |
| "warnings.filterwarnings('ignore') " | |
| ] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "metadata": {}, | |
| "source": [ | |
| "## Step 2: Create our environment 🎮\n", | |
| "- Now that we imported the libraries/dependencies, we will create our environment.\n", | |
| "- Doom environment takes:\n", | |
| " - A `configuration file` that **handle all the options** (size of the frame, possible actions...)\n", | |
| " - A `scenario file`: that **generates the correct scenario** (in our case basic **but you're invited to try other scenarios**).\n", | |
| "- Note: We have 7 possible actions: turn left, turn right, move left, move right, shoot (attack)...`[[0,0,0,0,1]...]` so we don't need to do one hot encoding (thanks to <a href=\"https://stackoverflow.com/users/2237916/silgon\">silgon</a> for figuring out). \n", | |
| "\n", | |
| "### Our environment\n", | |
| "<img src=\"https://simoninithomas.github.io/Deep_reinforcement_learning_Course/assets/img/video%20projects/deadlycorridor.png\" style=\"max-width:500px;\" alt=\"Vizdoom deadly corridor\"/>\n", | |
| "\n", | |
| "The purpose of this scenario is to teach the agent to navigate towards his fundamental goal (the vest) and make sure he survives at the same time.\n", | |
| "\n", | |
| "- Map is a corridor with shooting monsters on both sides (6 monsters in total). \n", | |
| "- A green vest is placed at the oposite end of the corridor. \n", | |
| "- **Reward is proportional (negative or positive) to change of the distance between the player and the vest.** \n", | |
| "- If player ignores monsters on the sides and runs straight for the vest he will be killed somewhere along the way. \n", | |
| "- To ensure this behavior doom_skill = 5 (config) is needed.\n", | |
| "\n", | |
| "<br>\n", | |
| "REWARDS:\n", | |
| "\n", | |
| "- +dX for getting closer to the vest. -dX for getting further from the vest.\n", | |
| "- death penalty = 100" | |
| ] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "execution_count": null, | |
| "metadata": {}, | |
| "outputs": [], | |
| "source": [ | |
| "\"\"\"\n", | |
| "Here we create our environment\n", | |
| "\"\"\"\n", | |
| "def create_environment():\n", | |
| " game = DoomGame()\n", | |
| " \n", | |
| " # Load the correct configuration\n", | |
| " game.load_config(\"deadly_corridor.cfg\")\n", | |
| " \n", | |
| " # Load the correct scenario (in our case deadly_corridor scenario)\n", | |
| " game.set_doom_scenario_path(\"deadly_corridor.wad\")\n", | |
| " \n", | |
| " game.init()\n", | |
| "\n", | |
| " # Here we create an hot encoded version of our actions (5 possible actions)\n", | |
| " # possible_actions = [[1, 0, 0, 0, 0], [0, 1, 0, 0, 0]...]\n", | |
| " possible_actions = np.identity(7,dtype=int).tolist()\n", | |
| " \n", | |
| " return game, possible_actions" | |
| ] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "execution_count": null, | |
| "metadata": {}, | |
| "outputs": [], | |
| "source": [ | |
| "game, possible_actions = create_environment()" | |
| ] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "metadata": {}, | |
| "source": [ | |
| "## Step 3: Define the preprocessing functions ⚙️\n", | |
| "### preprocess_frame\n", | |
| "Preprocessing is an important step, <b>because we want to reduce the complexity of our states to reduce the computation time needed for training.</b>\n", | |
| "<br><br>\n", | |
| "Our steps:\n", | |
| "- Grayscale each of our frames (because <b> color does not add important information </b>). But this is already done by the config file.\n", | |
| "- Crop the screen (in our case we remove the roof because it contains no information)\n", | |
| "- We normalize pixel values\n", | |
| "- Finally we resize the preprocessed frame" | |
| ] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "execution_count": null, | |
| "metadata": {}, | |
| "outputs": [], | |
| "source": [ | |
| "\"\"\"\n", | |
| " preprocess_frame:\n", | |
| " Take a frame.\n", | |
| " Resize it.\n", | |
| " __________________\n", | |
| " | |\n", | |
| " | |\n", | |
| " | |\n", | |
| " | |\n", | |
| " |_________________|\n", | |
| " \n", | |
| " to\n", | |
| " _____________\n", | |
| " | |\n", | |
| " | |\n", | |
| " | |\n", | |
| " |____________|\n", | |
| " Normalize it.\n", | |
| " \n", | |
| " return preprocessed_frame\n", | |
| " \n", | |
| " \"\"\"\n", | |
| "def preprocess_frame(frame):\n", | |
| " # Crop the screen (remove part that contains no information)\n", | |
| " # [Up: Down, Left: right]\n", | |
| " cropped_frame = frame[15:-5,20:-20]\n", | |
| " \n", | |
| " # Normalize Pixel Values\n", | |
| " normalized_frame = cropped_frame/255.0\n", | |
| " \n", | |
| " # Resize\n", | |
| " preprocessed_frame = transform.resize(cropped_frame, [100,120])\n", | |
| " \n", | |
| " return preprocessed_frame # 100x120x1 frame" | |
| ] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "metadata": {}, | |
| "source": [ | |
| "### stack_frames\n", | |
| "👏 This part was made possible thanks to help of <a href=\"https://github.com/Miffyli\">Anssi</a><br>\n", | |
| "\n", | |
| "As explained in this really <a href=\"https://danieltakeshi.github.io/2016/11/25/frame-skipping-and-preprocessing-for-deep-q-networks-on-atari-2600-games/\"> good article </a> we stack frames.\n", | |
| "\n", | |
| "Stacking frames is really important because it helps us to **give have a sense of motion to our Neural Network.**\n", | |
| "\n", | |
| "- First we preprocess frame\n", | |
| "- Then we append the frame to the deque that automatically **removes the oldest frame**\n", | |
| "- Finally we **build the stacked state**\n", | |
| "\n", | |
| "This is how work stack:\n", | |
| "- For the first frame, we feed 4 frames\n", | |
| "- At each timestep, **we add the new frame to deque and then we stack them to form a new stacked frame**\n", | |
| "- And so on\n", | |
| "<img src=\"https://raw.githubusercontent.com/simoninithomas/Deep_reinforcement_learning_Course/master/DQN/Space%20Invaders/assets/stack_frames.png\" alt=\"stack\">\n", | |
| "- If we're done, **we create a new stack with 4 new frames (because we are in a new episode)**." | |
| ] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "execution_count": null, | |
| "metadata": {}, | |
| "outputs": [], | |
| "source": [ | |
| "stack_size = 4 # We stack 4 frames\n", | |
| "\n", | |
| "# Initialize deque with zero-images one array for each image\n", | |
| "stacked_frames = deque([np.zeros((100,120), dtype=np.int) for i in range(stack_size)], maxlen=4) \n", | |
| "\n", | |
| "def stack_frames(stacked_frames, state, is_new_episode):\n", | |
| " # Preprocess frame\n", | |
| " frame = preprocess_frame(state)\n", | |
| " \n", | |
| " if is_new_episode:\n", | |
| " # Clear our stacked_frames\n", | |
| " stacked_frames = deque([np.zeros((100,120), dtype=np.int) for i in range(stack_size)], maxlen=4)\n", | |
| " \n", | |
| " # Because we're in a new episode, copy the same frame 4x\n", | |
| " stacked_frames.append(frame)\n", | |
| " stacked_frames.append(frame)\n", | |
| " stacked_frames.append(frame)\n", | |
| " stacked_frames.append(frame)\n", | |
| " \n", | |
| " # Stack the frames\n", | |
| " stacked_state = np.stack(stacked_frames, axis=2)\n", | |
| "\n", | |
| " else:\n", | |
| " # Append frame to deque, automatically removes the oldest frame\n", | |
| " stacked_frames.append(frame)\n", | |
| "\n", | |
| " # Build the stacked state (first dimension specifies different frames)\n", | |
| " stacked_state = np.stack(stacked_frames, axis=2) \n", | |
| " \n", | |
| " return stacked_state, stacked_frames" | |
| ] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "metadata": {}, | |
| "source": [ | |
| "## Step 4: Set up our hyperparameters ⚗️\n", | |
| "In this part we'll set up our different hyperparameters. But when you implement a Neural Network by yourself you will **not implement hyperparamaters at once but progressively**.\n", | |
| "\n", | |
| "- First, you begin by defining the neural networks hyperparameters when you implement the model.\n", | |
| "- Then, you'll add the training hyperparameters when you implement the training algorithm." | |
| ] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "execution_count": null, | |
| "metadata": {}, | |
| "outputs": [], | |
| "source": [ | |
| "### MODEL HYPERPARAMETERS\n", | |
| "state_size = [100,120,4] # Our input is a stack of 4 frames hence 100x120x4 (Width, height, channels) \n", | |
| "action_size = game.get_available_buttons_size() # 7 possible actions\n", | |
| "learning_rate = 0.00025 # Alpha (aka learning rate)\n", | |
| "\n", | |
| "### TRAINING HYPERPARAMETERS\n", | |
| "total_episodes = 5000 # Total episodes for training\n", | |
| "max_steps = 5000 # Max possible steps in an episode\n", | |
| "batch_size = 64 \n", | |
| "\n", | |
| "# FIXED Q TARGETS HYPERPARAMETERS \n", | |
| "max_tau = 10000 #Tau is the C step where we update our target network\n", | |
| "\n", | |
| "# EXPLORATION HYPERPARAMETERS for epsilon greedy strategy\n", | |
| "explore_start = 1.0 # exploration probability at start\n", | |
| "explore_stop = 0.01 # minimum exploration probability \n", | |
| "decay_rate = 0.00005 # exponential decay rate for exploration prob\n", | |
| "\n", | |
| "# Q LEARNING hyperparameters\n", | |
| "gamma = 0.95 # Discounting rate\n", | |
| "\n", | |
| "### MEMORY HYPERPARAMETERS\n", | |
| "## If you have GPU change to 1million\n", | |
| "pretrain_length = 100000 # Number of experiences stored in the Memory when initialized for the first time\n", | |
| "memory_size = 100000 # Number of experiences the Memory can keep\n", | |
| "\n", | |
| "### MODIFY THIS TO FALSE IF YOU JUST WANT TO SEE THE TRAINED AGENT\n", | |
| "training = False\n", | |
| "\n", | |
| "## TURN THIS TO TRUE IF YOU WANT TO RENDER THE ENVIRONMENT\n", | |
| "episode_render = False" | |
| ] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "metadata": {}, | |
| "source": [ | |
| "## Step 5: Create our Dueling Double Deep Q-learning Neural Network model (aka DDDQN) 🧠\n", | |
| "<img src=\"https://cdn-images-1.medium.com/max/1500/1*FkHqwA2eSGixdS-3dvVoMA.png\" alt=\"Dueling Double Deep Q Learning Model\" />\n", | |
| "This is our Dueling Double Deep Q-learning model:\n", | |
| "- We take a stack of 4 frames as input\n", | |
| "- It passes through 3 convnets\n", | |
| "- Then it is flatened\n", | |
| "- Then it is passed through 2 streams\n", | |
| " - One that calculates V(s)\n", | |
| " - The other that calculates A(s,a)\n", | |
| "- Finally an agregating layer\n", | |
| "- It outputs a Q value for each actions" | |
| ] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "execution_count": null, | |
| "metadata": {}, | |
| "outputs": [], | |
| "source": [ | |
| "class DDDQNNet:\n", | |
| " def __init__(self, state_size, action_size, learning_rate, name):\n", | |
| " self.state_size = state_size\n", | |
| " self.action_size = action_size\n", | |
| " self.learning_rate = learning_rate\n", | |
| " self.name = name\n", | |
| " \n", | |
| " \n", | |
| " # We use tf.variable_scope here to know which network we're using (DQN or target_net)\n", | |
| " # it will be useful when we will update our w- parameters (by copy the DQN parameters)\n", | |
| " with tf.variable_scope(self.name):\n", | |
| " \n", | |
| " # We create the placeholders\n", | |
| " # *state_size means that we take each elements of state_size in tuple hence is like if we wrote\n", | |
| " # [None, 100, 120, 4]\n", | |
| " self.inputs_ = tf.placeholder(tf.float32, [None, *state_size], name=\"inputs\")\n", | |
| " \n", | |
| " #\n", | |
| " self.ISWeights_ = tf.placeholder(tf.float32, [None,1], name='IS_weights')\n", | |
| " \n", | |
| " self.actions_ = tf.placeholder(tf.float32, [None, action_size], name=\"actions_\")\n", | |
| " \n", | |
| " # Remember that target_Q is the R(s,a) + ymax Qhat(s', a')\n", | |
| " self.target_Q = tf.placeholder(tf.float32, [None], name=\"target\")\n", | |
| " \n", | |
| " \"\"\"\n", | |
| " First convnet:\n", | |
| " CNN\n", | |
| " ELU\n", | |
| " \"\"\"\n", | |
| " # Input is 100x120x4\n", | |
| " self.conv1 = tf.layers.conv2d(inputs = self.inputs_,\n", | |
| " filters = 32,\n", | |
| " kernel_size = [8,8],\n", | |
| " strides = [4,4],\n", | |
| " padding = \"VALID\",\n", | |
| " kernel_initializer=tf.contrib.layers.xavier_initializer_conv2d(),\n", | |
| " name = \"conv1\")\n", | |
| " \n", | |
| " self.conv1_out = tf.nn.elu(self.conv1, name=\"conv1_out\")\n", | |
| " \n", | |
| " \n", | |
| " \"\"\"\n", | |
| " Second convnet:\n", | |
| " CNN\n", | |
| " ELU\n", | |
| " \"\"\"\n", | |
| " self.conv2 = tf.layers.conv2d(inputs = self.conv1_out,\n", | |
| " filters = 64,\n", | |
| " kernel_size = [4,4],\n", | |
| " strides = [2,2],\n", | |
| " padding = \"VALID\",\n", | |
| " kernel_initializer=tf.contrib.layers.xavier_initializer_conv2d(),\n", | |
| " name = \"conv2\")\n", | |
| "\n", | |
| " self.conv2_out = tf.nn.elu(self.conv2, name=\"conv2_out\")\n", | |
| " \n", | |
| " \n", | |
| " \"\"\"\n", | |
| " Third convnet:\n", | |
| " CNN\n", | |
| " ELU\n", | |
| " \"\"\"\n", | |
| " self.conv3 = tf.layers.conv2d(inputs = self.conv2_out,\n", | |
| " filters = 128,\n", | |
| " kernel_size = [4,4],\n", | |
| " strides = [2,2],\n", | |
| " padding = \"VALID\",\n", | |
| " kernel_initializer=tf.contrib.layers.xavier_initializer_conv2d(),\n", | |
| " name = \"conv3\")\n", | |
| "\n", | |
| " self.conv3_out = tf.nn.elu(self.conv3, name=\"conv3_out\")\n", | |
| " \n", | |
| " \n", | |
| " self.flatten = tf.layers.flatten(self.conv3_out)\n", | |
| " \n", | |
| " \n", | |
| " ## Here we separate into two streams\n", | |
| " # The one that calculate V(s)\n", | |
| " self.value_fc = tf.layers.dense(inputs = self.flatten,\n", | |
| " units = 512,\n", | |
| " activation = tf.nn.elu,\n", | |
| " kernel_initializer=tf.contrib.layers.xavier_initializer(),\n", | |
| " name=\"value_fc\")\n", | |
| " \n", | |
| " self.value = tf.layers.dense(inputs = self.value_fc,\n", | |
| " units = 1,\n", | |
| " activation = None,\n", | |
| " kernel_initializer=tf.contrib.layers.xavier_initializer(),\n", | |
| " name=\"value\")\n", | |
| " \n", | |
| " # The one that calculate A(s,a)\n", | |
| " self.advantage_fc = tf.layers.dense(inputs = self.flatten,\n", | |
| " units = 512,\n", | |
| " activation = tf.nn.elu,\n", | |
| " kernel_initializer=tf.contrib.layers.xavier_initializer(),\n", | |
| " name=\"advantage_fc\")\n", | |
| " \n", | |
| " self.advantage = tf.layers.dense(inputs = self.advantage_fc,\n", | |
| " units = self.action_size,\n", | |
| " activation = None,\n", | |
| " kernel_initializer=tf.contrib.layers.xavier_initializer(),\n", | |
| " name=\"advantages\")\n", | |
| " \n", | |
| " # Agregating layer\n", | |
| " # Q(s,a) = V(s) + (A(s,a) - 1/|A| * sum A(s,a'))\n", | |
| " self.output = self.value + tf.subtract(self.advantage, tf.reduce_mean(self.advantage, axis=1, keepdims=True))\n", | |
| " \n", | |
| " # Q is our predicted Q value.\n", | |
| " self.Q = tf.reduce_sum(tf.multiply(self.output, self.actions_), axis=1)\n", | |
| " \n", | |
| " # The loss is modified because of PER \n", | |
| " self.absolute_errors = tf.abs(self.target_Q - self.Q)# for updating Sumtree\n", | |
| " \n", | |
| " self.loss = tf.reduce_mean(self.ISWeights_ * tf.squared_difference(self.target_Q, self.Q))\n", | |
| " \n", | |
| " self.optimizer = tf.train.RMSPropOptimizer(self.learning_rate).minimize(self.loss)" | |
| ] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "execution_count": null, | |
| "metadata": {}, | |
| "outputs": [], | |
| "source": [ | |
| "# Reset the graph\n", | |
| "tf.reset_default_graph()\n", | |
| "\n", | |
| "# Instantiate the DQNetwork\n", | |
| "DQNetwork = DDDQNNet(state_size, action_size, learning_rate, name=\"DQNetwork\")\n", | |
| "\n", | |
| "# Instantiate the target network\n", | |
| "TargetNetwork = DDDQNNet(state_size, action_size, learning_rate, name=\"TargetNetwork\")" | |
| ] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "metadata": {}, | |
| "source": [ | |
| "## Step 6: Prioritized Experience Replay 🔁\n", | |
| "Now that we create our Neural Network, **we need to implement the Prioritized Experience Replay method.** <br>\n", | |
| "\n", | |
| "As explained in the article, **we can't use a simple array to do that because sampling from it will be not efficient, so we use a binary tree data type (in a binary tree each node has no + than 2 children).** More precisely, a sumtree, which is a binary tree where parents nodes are the sum of the children nodes.\n", | |
| "\n", | |
| "If you don't know what is a binary tree check this awesome video https://www.youtube.com/watch?v=oSWTXtMglKE\n", | |
| "\n", | |
| "\n", | |
| "This SumTree implementation was taken from Morvan Zhou in his chinese course about Reinforcement Learning\n", | |
| "\n", | |
| "To summarize:\n", | |
| "- **Step 1**: We construct a SumTree, which is a Binary Sum tree where leaves contains the priorities and a data array where index points to the index of leaves.\n", | |
| " <img src=\"https://cdn-images-1.medium.com/max/1200/1*Go9DNr7YY-wMGdIQ7HQduQ.png\" alt=\"SumTree\"/>\n", | |
| " <br><br>\n", | |
| " - **def __init__**: Initialize our SumTree data object with all nodes = 0 and data (data array) with all = 0.\n", | |
| " - **def add**: add our priority score in the sumtree leaf and experience (S, A, R, S', Done) in data.\n", | |
| " - **def update**: we update the leaf priority score and propagate through tree.\n", | |
| " - **def get_leaf**: retrieve priority score, index and experience associated with a leaf.\n", | |
| " - **def total_priority**: get the root node value to calculate the total priority score of our replay buffer.\n", | |
| "<br><br>\n", | |
| "- **Step 2**: We create a Memory object that will contain our sumtree and data.\n", | |
| " - **def __init__**: generates our sumtree and data by instantiating the SumTree object.\n", | |
| " - **def store**: we store a new experience in our tree. Each new experience will **have priority = max_priority** (and then this priority will be corrected during the training (when we'll calculating the TD error hence the priority score).\n", | |
| " - **def sample**:\n", | |
| " - First, to sample a minibatch of k size, the range [0, priority_total] is / into k ranges.\n", | |
| " - Then a value is uniformly sampled from each range\n", | |
| " - We search in the sumtree, the experience where priority score correspond to sample values are retrieved from.\n", | |
| " - Then, we calculate IS weights for each minibatch element\n", | |
| " - **def update_batch**: update the priorities on the tree" | |
| ] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "execution_count": null, | |
| "metadata": {}, | |
| "outputs": [], | |
| "source": [ | |
| "class SumTree(object):\n", | |
| " \"\"\"\n", | |
| " This SumTree code is modified version of Morvan Zhou: \n", | |
| " https://github.com/MorvanZhou/Reinforcement-learning-with-tensorflow/blob/master/contents/5.2_Prioritized_Replay_DQN/RL_brain.py\n", | |
| " \"\"\"\n", | |
| " data_pointer = 0\n", | |
| " \n", | |
| " \"\"\"\n", | |
| " Here we initialize the tree with all nodes = 0, and initialize the data with all values = 0\n", | |
| " \"\"\"\n", | |
| " def __init__(self, capacity):\n", | |
| " self.capacity = capacity # Number of leaf nodes (final nodes) that contains experiences\n", | |
| " \n", | |
| " # Generate the tree with all nodes values = 0\n", | |
| " # To understand this calculation (2 * capacity - 1) look at the schema above\n", | |
| " # Remember we are in a binary node (each node has max 2 children) so 2x size of leaf (capacity) - 1 (root node)\n", | |
| " # Parent nodes = capacity - 1\n", | |
| " # Leaf nodes = capacity\n", | |
| " self.tree = np.zeros(2 * capacity - 1)\n", | |
| " \n", | |
| " \"\"\" tree:\n", | |
| " 0\n", | |
| " / \\\n", | |
| " 0 0\n", | |
| " / \\ / \\\n", | |
| " 0 0 0 0 [Size: capacity] it's at this line that there is the priorities score (aka pi)\n", | |
| " \"\"\"\n", | |
| " \n", | |
| " # Contains the experiences (so the size of data is capacity)\n", | |
| " self.data = np.zeros(capacity, dtype=object)\n", | |
| " \n", | |
| " \n", | |
| " \"\"\"\n", | |
| " Here we add our priority score in the sumtree leaf and add the experience in data\n", | |
| " \"\"\"\n", | |
| " def add(self, priority, data):\n", | |
| " # Look at what index we want to put the experience\n", | |
| " tree_index = self.data_pointer + self.capacity - 1\n", | |
| " \n", | |
| " \"\"\" tree:\n", | |
| " 0\n", | |
| " / \\\n", | |
| " 0 0\n", | |
| " / \\ / \\\n", | |
| "tree_index 0 0 0 We fill the leaves from left to right\n", | |
| " \"\"\"\n", | |
| " \n", | |
| " # Update data frame\n", | |
| " self.data[self.data_pointer] = data\n", | |
| " \n", | |
| " # Update the leaf\n", | |
| " self.update (tree_index, priority)\n", | |
| " \n", | |
| " # Add 1 to data_pointer\n", | |
| " self.data_pointer += 1\n", | |
| " \n", | |
| " if self.data_pointer >= self.capacity: # If we're above the capacity, you go back to first index (we overwrite)\n", | |
| " self.data_pointer = 0\n", | |
| " \n", | |
| " \n", | |
| " \"\"\"\n", | |
| " Update the leaf priority score and propagate the change through tree\n", | |
| " \"\"\"\n", | |
| " def update(self, tree_index, priority):\n", | |
| " # Change = new priority score - former priority score\n", | |
| " change = priority - self.tree[tree_index]\n", | |
| " self.tree[tree_index] = priority\n", | |
| " \n", | |
| " # then propagate the change through tree\n", | |
| " while tree_index != 0: # this method is faster than the recursive loop in the reference code\n", | |
| " \n", | |
| " \"\"\"\n", | |
| " Here we want to access the line above\n", | |
| " THE NUMBERS IN THIS TREE ARE THE INDEXES NOT THE PRIORITY VALUES\n", | |
| " \n", | |
| " 0\n", | |
| " / \\\n", | |
| " 1 2\n", | |
| " / \\ / \\\n", | |
| " 3 4 5 [6] \n", | |
| " \n", | |
| " If we are in leaf at index 6, we updated the priority score\n", | |
| " We need then to update index 2 node\n", | |
| " So tree_index = (tree_index - 1) // 2\n", | |
| " tree_index = (6-1)//2\n", | |
| " tree_index = 2 (because // round the result)\n", | |
| " \"\"\"\n", | |
| " tree_index = (tree_index - 1) // 2\n", | |
| " self.tree[tree_index] += change\n", | |
| " \n", | |
| " \n", | |
| " \"\"\"\n", | |
| " Here we get the leaf_index, priority value of that leaf and experience associated with that index\n", | |
| " \"\"\"\n", | |
| " def get_leaf(self, v):\n", | |
| " \"\"\"\n", | |
| " Tree structure and array storage:\n", | |
| " Tree index:\n", | |
| " 0 -> storing priority sum\n", | |
| " / \\\n", | |
| " 1 2\n", | |
| " / \\ / \\\n", | |
| " 3 4 5 6 -> storing priority for experiences\n", | |
| " Array type for storing:\n", | |
| " [0,1,2,3,4,5,6]\n", | |
| " \"\"\"\n", | |
| " parent_index = 0\n", | |
| " \n", | |
| " while True: # the while loop is faster than the method in the reference code\n", | |
| " left_child_index = 2 * parent_index + 1\n", | |
| " right_child_index = left_child_index + 1\n", | |
| " \n", | |
| " # If we reach bottom, end the search\n", | |
| " if left_child_index >= len(self.tree):\n", | |
| " leaf_index = parent_index\n", | |
| " break\n", | |
| " \n", | |
| " else: # downward search, always search for a higher priority node\n", | |
| " \n", | |
| " if v <= self.tree[left_child_index]:\n", | |
| " parent_index = left_child_index\n", | |
| " \n", | |
| " else:\n", | |
| " v -= self.tree[left_child_index]\n", | |
| " parent_index = right_child_index\n", | |
| " \n", | |
| " data_index = leaf_index - self.capacity + 1\n", | |
| "\n", | |
| " return leaf_index, self.tree[leaf_index], self.data[data_index]\n", | |
| " \n", | |
| " @property\n", | |
| " def total_priority(self):\n", | |
| " return self.tree[0] # Returns the root node" | |
| ] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "metadata": {}, | |
| "source": [ | |
| "Here we don't use deque anymore" | |
| ] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "execution_count": null, | |
| "metadata": {}, | |
| "outputs": [], | |
| "source": [ | |
| "class Memory(object): # stored as ( s, a, r, s_ ) in SumTree\n", | |
| " \"\"\"\n", | |
| " This SumTree code is modified version and the original code is from:\n", | |
| " https://github.com/jaara/AI-blog/blob/master/Seaquest-DDQN-PER.py\n", | |
| " \"\"\"\n", | |
| " PER_e = 0.01 # Hyperparameter that we use to avoid some experiences to have 0 probability of being taken\n", | |
| " PER_a = 0.6 # Hyperparameter that we use to make a tradeoff between taking only exp with high priority and sampling randomly\n", | |
| " PER_b = 0.4 # importance-sampling, from initial value increasing to 1\n", | |
| " \n", | |
| " PER_b_increment_per_sampling = 0.001\n", | |
| " \n", | |
| " absolute_error_upper = 1. # clipped abs error\n", | |
| "\n", | |
| " def __init__(self, capacity):\n", | |
| " # Making the tree \n", | |
| " \"\"\"\n", | |
| " Remember that our tree is composed of a sum tree that contains the priority scores at his leaf\n", | |
| " And also a data array\n", | |
| " We don't use deque because it means that at each timestep our experiences change index by one.\n", | |
| " We prefer to use a simple array and to overwrite when the memory is full.\n", | |
| " \"\"\"\n", | |
| " self.tree = SumTree(capacity)\n", | |
| " \n", | |
| " \"\"\"\n", | |
| " Store a new experience in our tree\n", | |
| " Each new experience have a score of max_prority (it will be then improved when we use this exp to train our DDQN)\n", | |
| " \"\"\"\n", | |
| " def store(self, experience):\n", | |
| " # Find the max priority\n", | |
| " max_priority = np.max(self.tree.tree[-self.tree.capacity:])\n", | |
| " \n", | |
| " # If the max priority = 0 we can't put priority = 0 since this exp will never have a chance to be selected\n", | |
| " # So we use a minimum priority\n", | |
| " if max_priority == 0:\n", | |
| " max_priority = self.absolute_error_upper\n", | |
| " \n", | |
| " self.tree.add(max_priority, experience) # set the max p for new p\n", | |
| "\n", | |
| " \n", | |
| " \"\"\"\n", | |
| " - First, to sample a minibatch of k size, the range [0, priority_total] is / into k ranges.\n", | |
| " - Then a value is uniformly sampled from each range\n", | |
| " - We search in the sumtree, the experience where priority score correspond to sample values are retrieved from.\n", | |
| " - Then, we calculate IS weights for each minibatch element\n", | |
| " \"\"\"\n", | |
| " def sample(self, n):\n", | |
| " # Create a sample array that will contains the minibatch\n", | |
| " memory_b = []\n", | |
| " \n", | |
| " b_idx, b_ISWeights = np.empty((n,), dtype=np.int32), np.empty((n, 1), dtype=np.float32)\n", | |
| " \n", | |
| " # Calculate the priority segment\n", | |
| " # Here, as explained in the paper, we divide the Range[0, ptotal] into n ranges\n", | |
| " priority_segment = self.tree.total_priority / n # priority segment\n", | |
| " \n", | |
| " # Here we increasing the PER_b each time we sample a new minibatch\n", | |
| " self.PER_b = np.min([1., self.PER_b + self.PER_b_increment_per_sampling]) # max = 1\n", | |
| " \n", | |
| " # Calculating the max_weight\n", | |
| " p_min = np.min(self.tree.tree[-self.tree.capacity:]) / self.tree.total_priority\n", | |
| " max_weight = (p_min * n) ** (-self.PER_b)\n", | |
| " \n", | |
| " for i in range(n):\n", | |
| " \"\"\"\n", | |
| " A value is uniformly sample from each range\n", | |
| " \"\"\"\n", | |
| " a, b = priority_segment * i, priority_segment * (i + 1)\n", | |
| " value = np.random.uniform(a, b)\n", | |
| " \n", | |
| " \"\"\"\n", | |
| " Experience that correspond to each value is retrieved\n", | |
| " \"\"\"\n", | |
| " index, priority, data = self.tree.get_leaf(value)\n", | |
| " \n", | |
| " #P(j)\n", | |
| " sampling_probabilities = priority / self.tree.total_priority\n", | |
| " \n", | |
| " # IS = (1/N * 1/P(i))**b /max wi == (N*P(i))**-b /max wi\n", | |
| " b_ISWeights[i, 0] = np.power(n * sampling_probabilities, -self.PER_b)/ max_weight\n", | |
| " \n", | |
| " b_idx[i]= index\n", | |
| " \n", | |
| " experience = [data]\n", | |
| " \n", | |
| " memory_b.append(experience)\n", | |
| " \n", | |
| " return b_idx, memory_b, b_ISWeights\n", | |
| " \n", | |
| " \"\"\"\n", | |
| " Update the priorities on the tree\n", | |
| " \"\"\"\n", | |
| " def batch_update(self, tree_idx, abs_errors):\n", | |
| " abs_errors += self.PER_e # convert to abs and avoid 0\n", | |
| " clipped_errors = np.minimum(abs_errors, self.absolute_error_upper)\n", | |
| " ps = np.power(clipped_errors, self.PER_a)\n", | |
| "\n", | |
| " for ti, p in zip(tree_idx, ps):\n", | |
| " self.tree.update(ti, p)" | |
| ] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "metadata": {}, | |
| "source": [ | |
| "Here we'll **deal with the empty memory problem**: we pre-populate our memory by taking random actions and storing the experience." | |
| ] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "execution_count": null, | |
| "metadata": {}, | |
| "outputs": [], | |
| "source": [ | |
| "# Instantiate memory\n", | |
| "memory = Memory(memory_size)\n", | |
| "\n", | |
| "# Render the environment\n", | |
| "game.new_episode()\n", | |
| "\n", | |
| "for i in range(pretrain_length):\n", | |
| " # If it's the first step\n", | |
| " if i == 0:\n", | |
| " # First we need a state\n", | |
| " state = game.get_state().screen_buffer\n", | |
| " state, stacked_frames = stack_frames(stacked_frames, state, True)\n", | |
| " \n", | |
| " # Random action\n", | |
| " action = random.choice(possible_actions)\n", | |
| " \n", | |
| " # Get the rewards\n", | |
| " reward = game.make_action(action)\n", | |
| " \n", | |
| " # Look if the episode is finished\n", | |
| " done = game.is_episode_finished()\n", | |
| "\n", | |
| " # If we're dead\n", | |
| " if done:\n", | |
| " # We finished the episode\n", | |
| " next_state = np.zeros(state.shape)\n", | |
| " \n", | |
| " # Add experience to memory\n", | |
| " #experience = np.hstack((state, [action, reward], next_state, done))\n", | |
| " \n", | |
| " experience = state, action, reward, next_state, done\n", | |
| " memory.store(experience)\n", | |
| " \n", | |
| " # Start a new episode\n", | |
| " game.new_episode()\n", | |
| " \n", | |
| " # First we need a state\n", | |
| " state = game.get_state().screen_buffer\n", | |
| " \n", | |
| " # Stack the frames\n", | |
| " state, stacked_frames = stack_frames(stacked_frames, state, True)\n", | |
| " \n", | |
| " else:\n", | |
| " # Get the next state\n", | |
| " next_state = game.get_state().screen_buffer\n", | |
| " next_state, stacked_frames = stack_frames(stacked_frames, next_state, False)\n", | |
| " \n", | |
| " # Add experience to memory\n", | |
| " experience = state, action, reward, next_state, done\n", | |
| " memory.store(experience)\n", | |
| " \n", | |
| " # Our state is now the next_state\n", | |
| " state = next_state" | |
| ] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "metadata": {}, | |
| "source": [ | |
| "## Step 7: Set up Tensorboard 📊\n", | |
| "For more information about tensorboard, please watch this <a href=\"https://www.youtube.com/embed/eBbEDRsCmv4\">excellent 30min tutorial</a> <br><br>\n", | |
| "To launch tensorboard : `tensorboard --logdir=/tensorboard/dddqn/1`" | |
| ] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "execution_count": null, | |
| "metadata": {}, | |
| "outputs": [], | |
| "source": [ | |
| "# Setup TensorBoard Writer\n", | |
| "writer = tf.summary.FileWriter(\"/tensorboard/dddqn/1\")\n", | |
| "\n", | |
| "## Losses\n", | |
| "tf.summary.scalar(\"Loss\", DQNetwork.loss)\n", | |
| "\n", | |
| "write_op = tf.summary.merge_all()" | |
| ] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "metadata": {}, | |
| "source": [ | |
| "## Step 8: Train our Agent 🏃♂️\n", | |
| "\n", | |
| "Our algorithm:\n", | |
| "<br>\n", | |
| "* Initialize the weights for DQN\n", | |
| "* Initialize target value weights w- <- w\n", | |
| "* Init the environment\n", | |
| "* Initialize the decay rate (that will use to reduce epsilon) \n", | |
| "<br><br>\n", | |
| "* **For** episode to max_episode **do** \n", | |
| " * Make new episode\n", | |
| " * Set step to 0\n", | |
| " * Observe the first state $s_0$\n", | |
| " <br><br>\n", | |
| " * **While** step < max_steps **do**:\n", | |
| " * Increase decay_rate\n", | |
| " * With $\\epsilon$ select a random action $a_t$, otherwise select $a_t = \\mathrm{argmax}_a Q(s_t,a)$\n", | |
| " * Execute action $a_t$ in simulator and observe reward $r_{t+1}$ and new state $s_{t+1}$\n", | |
| " * Store transition $<s_t, a_t, r_{t+1}, s_{t+1}>$ in memory $D$\n", | |
| " \n", | |
| " * Sample random mini-batch from $D$: $<s, a, r, s'>$\n", | |
| " * Set target $\\hat{Q} = r$ if the episode ends at $+1$, otherwise set $\\hat{Q} = r + \\gamma Q(s',argmax_{a'}{Q(s', a', w), w^-)}$\n", | |
| " * Make a gradient descent step with loss $(\\hat{Q} - Q(s, a))^2$\n", | |
| " * Every C steps, reset: $w^- \\leftarrow w$\n", | |
| " * **endfor**\n", | |
| " <br><br>\n", | |
| "* **endfor**\n", | |
| "\n", | |
| " " | |
| ] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "execution_count": null, | |
| "metadata": {}, | |
| "outputs": [], | |
| "source": [ | |
| "\"\"\"\n", | |
| "This function will do the part\n", | |
| "With ϵ select a random action atat, otherwise select at=argmaxaQ(st,a)\n", | |
| "\"\"\"\n", | |
| "def predict_action(explore_start, explore_stop, decay_rate, decay_step, state, actions):\n", | |
| " ## EPSILON GREEDY STRATEGY\n", | |
| " # Choose action a from state s using epsilon greedy.\n", | |
| " ## First we randomize a number\n", | |
| " exp_exp_tradeoff = np.random.rand()\n", | |
| "\n", | |
| " # Here we'll use an improved version of our epsilon greedy strategy used in Q-learning notebook\n", | |
| " explore_probability = explore_stop + (explore_start - explore_stop) * np.exp(-decay_rate * decay_step)\n", | |
| " \n", | |
| " if (explore_probability > exp_exp_tradeoff):\n", | |
| " # Make a random action (exploration)\n", | |
| " action = random.choice(possible_actions)\n", | |
| " \n", | |
| " else:\n", | |
| " # Get action from Q-network (exploitation)\n", | |
| " # Estimate the Qs values state\n", | |
| " Qs = sess.run(DQNetwork.output, feed_dict = {DQNetwork.inputs_: state.reshape((1, *state.shape))})\n", | |
| " \n", | |
| " # Take the biggest Q value (= the best action)\n", | |
| " choice = np.argmax(Qs)\n", | |
| " action = possible_actions[int(choice)]\n", | |
| " \n", | |
| " return action, explore_probability" | |
| ] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "execution_count": null, | |
| "metadata": {}, | |
| "outputs": [], | |
| "source": [ | |
| "# This function helps us to copy one set of variables to another\n", | |
| "# In our case we use it when we want to copy the parameters of DQN to Target_network\n", | |
| "# Thanks of the very good implementation of Arthur Juliani https://github.com/awjuliani\n", | |
| "def update_target_graph():\n", | |
| " \n", | |
| " # Get the parameters of our DQNNetwork\n", | |
| " from_vars = tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES, \"DQNetwork\")\n", | |
| " \n", | |
| " # Get the parameters of our Target_network\n", | |
| " to_vars = tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES, \"TargetNetwork\")\n", | |
| "\n", | |
| " op_holder = []\n", | |
| " \n", | |
| " # Update our target_network parameters with DQNNetwork parameters\n", | |
| " for from_var,to_var in zip(from_vars,to_vars):\n", | |
| " op_holder.append(to_var.assign(from_var))\n", | |
| " return op_holder" | |
| ] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "execution_count": null, | |
| "metadata": {}, | |
| "outputs": [], | |
| "source": [ | |
| "# Saver will help us to save our model\n", | |
| "saver = tf.train.Saver()\n", | |
| "\n", | |
| "if training == True:\n", | |
| " with tf.Session() as sess:\n", | |
| " # Initialize the variables\n", | |
| " sess.run(tf.global_variables_initializer())\n", | |
| " \n", | |
| " # Initialize the decay rate (that will use to reduce epsilon) \n", | |
| " decay_step = 0\n", | |
| " \n", | |
| " # Set tau = 0\n", | |
| " tau = 0\n", | |
| "\n", | |
| " # Init the game\n", | |
| " game.init()\n", | |
| " \n", | |
| " # Update the parameters of our TargetNetwork with DQN_weights\n", | |
| " update_target = update_target_graph()\n", | |
| " sess.run(update_target)\n", | |
| " \n", | |
| " for episode in range(total_episodes):\n", | |
| " # Set step to 0\n", | |
| " step = 0\n", | |
| " \n", | |
| " # Initialize the rewards of the episode\n", | |
| " episode_rewards = []\n", | |
| " \n", | |
| " # Make a new episode and observe the first state\n", | |
| " game.new_episode()\n", | |
| " \n", | |
| " state = game.get_state().screen_buffer\n", | |
| " \n", | |
| " # Remember that stack frame function also call our preprocess function.\n", | |
| " state, stacked_frames = stack_frames(stacked_frames, state, True)\n", | |
| " \n", | |
| " while step < max_steps:\n", | |
| " step += 1\n", | |
| " \n", | |
| " # Increase the C step\n", | |
| " tau += 1\n", | |
| " \n", | |
| " # Increase decay_step\n", | |
| " decay_step +=1\n", | |
| " \n", | |
| " # With ϵ select a random action atat, otherwise select a = argmaxQ(st,a)\n", | |
| " action, explore_probability = predict_action(explore_start, explore_stop, decay_rate, decay_step, state, possible_actions)\n", | |
| "\n", | |
| " # Do the action\n", | |
| " reward = game.make_action(action)\n", | |
| "\n", | |
| " # Look if the episode is finished\n", | |
| " done = game.is_episode_finished()\n", | |
| " \n", | |
| " # Add the reward to total reward\n", | |
| " episode_rewards.append(reward)\n", | |
| "\n", | |
| " # If the game is finished\n", | |
| " if done:\n", | |
| " # the episode ends so no next state\n", | |
| " next_state = np.zeros((120,140), dtype=np.int)\n", | |
| " next_state, stacked_frames = stack_frames(stacked_frames, next_state, False)\n", | |
| "\n", | |
| " # Set step = max_steps to end the episode\n", | |
| " step = max_steps\n", | |
| "\n", | |
| " # Get the total reward of the episode\n", | |
| " total_reward = np.sum(episode_rewards)\n", | |
| "\n", | |
| " print('Episode: {}'.format(episode),\n", | |
| " 'Total reward: {}'.format(total_reward),\n", | |
| " 'Training loss: {:.4f}'.format(loss),\n", | |
| " 'Explore P: {:.4f}'.format(explore_probability))\n", | |
| "\n", | |
| " # Add experience to memory\n", | |
| " experience = state, action, reward, next_state, done\n", | |
| " memory.store(experience)\n", | |
| "\n", | |
| " else:\n", | |
| " # Get the next state\n", | |
| " next_state = game.get_state().screen_buffer\n", | |
| " \n", | |
| " # Stack the frame of the next_state\n", | |
| " next_state, stacked_frames = stack_frames(stacked_frames, next_state, False)\n", | |
| " \n", | |
| "\n", | |
| " # Add experience to memory\n", | |
| " experience = state, action, reward, next_state, done\n", | |
| " memory.store(experience)\n", | |
| " \n", | |
| " # st+1 is now our current state\n", | |
| " state = next_state\n", | |
| "\n", | |
| "\n", | |
| " ### LEARNING PART \n", | |
| " # Obtain random mini-batch from memory\n", | |
| " tree_idx, batch, ISWeights_mb = memory.sample(batch_size)\n", | |
| " \n", | |
| " states_mb = np.array([each[0][0] for each in batch], ndmin=3)\n", | |
| " actions_mb = np.array([each[0][1] for each in batch])\n", | |
| " rewards_mb = np.array([each[0][2] for each in batch]) \n", | |
| " next_states_mb = np.array([each[0][3] for each in batch], ndmin=3)\n", | |
| " dones_mb = np.array([each[0][4] for each in batch])\n", | |
| "\n", | |
| " target_Qs_batch = []\n", | |
| "\n", | |
| " \n", | |
| " ### DOUBLE DQN Logic\n", | |
| " # Use DQNNetwork to select the action to take at next_state (a') (action with the highest Q-value)\n", | |
| " # Use TargetNetwork to calculate the Q_val of Q(s',a')\n", | |
| " \n", | |
| " # Get Q values for next_state \n", | |
| " q_next_state = sess.run(DQNetwork.output, feed_dict = {DQNetwork.inputs_: next_states_mb})\n", | |
| " \n", | |
| " # Calculate Qtarget for all actions that state\n", | |
| " q_target_next_state = sess.run(TargetNetwork.output, feed_dict = {TargetNetwork.inputs_: next_states_mb})\n", | |
| " \n", | |
| " \n", | |
| " # Set Q_target = r if the episode ends at s+1, otherwise set Q_target = r + gamma * Qtarget(s',a') \n", | |
| " for i in range(0, len(batch)):\n", | |
| " terminal = dones_mb[i]\n", | |
| " \n", | |
| " # We got a'\n", | |
| " action = np.argmax(q_next_state[i])\n", | |
| "\n", | |
| " # If we are in a terminal state, only equals reward\n", | |
| " if terminal:\n", | |
| " target_Qs_batch.append(rewards_mb[i])\n", | |
| " \n", | |
| " else:\n", | |
| " # Take the Qtarget for action a'\n", | |
| " target = rewards_mb[i] + gamma * q_target_next_state[i][action]\n", | |
| " target_Qs_batch.append(target)\n", | |
| " \n", | |
| "\n", | |
| " targets_mb = np.array([each for each in target_Qs_batch])\n", | |
| "\n", | |
| " \n", | |
| " _, loss, absolute_errors = sess.run([DQNetwork.optimizer, DQNetwork.loss, DQNetwork.absolute_errors],\n", | |
| " feed_dict={DQNetwork.inputs_: states_mb,\n", | |
| " DQNetwork.target_Q: targets_mb,\n", | |
| " DQNetwork.actions_: actions_mb,\n", | |
| " DQNetwork.ISWeights_: ISWeights_mb})\n", | |
| " \n", | |
| " \n", | |
| " \n", | |
| " # Update priority\n", | |
| " memory.batch_update(tree_idx, absolute_errors)\n", | |
| " \n", | |
| " \n", | |
| " # Write TF Summaries\n", | |
| " summary = sess.run(write_op, feed_dict={DQNetwork.inputs_: states_mb,\n", | |
| " DQNetwork.target_Q: targets_mb,\n", | |
| " DQNetwork.actions_: actions_mb,\n", | |
| " DQNetwork.ISWeights_: ISWeights_mb})\n", | |
| " writer.add_summary(summary, episode)\n", | |
| " writer.flush()\n", | |
| " \n", | |
| " if tau > max_tau:\n", | |
| " # Update the parameters of our TargetNetwork with DQN_weights\n", | |
| " update_target = update_target_graph()\n", | |
| " sess.run(update_target)\n", | |
| " tau = 0\n", | |
| " print(\"Model updated\")\n", | |
| "\n", | |
| " # Save model every 5 episodes\n", | |
| " if episode % 5 == 0:\n", | |
| " save_path = saver.save(sess, \"./models/model.ckpt\")\n", | |
| " print(\"Model Saved\")" | |
| ] | |
| }, | |
| { | |
| "cell_type": "raw", | |
| "metadata": {}, | |
| "source": [ | |
| "Episode: 0 Total reward: -89.26519775390625 Training loss: 0.6859 Explore P: 0.9919\n", | |
| "Model Saved\n", | |
| "Episode: 1 Total reward: -115.13249206542969 Training loss: 1.0044 Explore P: 0.9872\n", | |
| "Episode: 2 Total reward: -88.6678466796875 Training loss: 30.3196 Explore P: 0.9760\n", | |
| "Episode: 3 Total reward: -79.75584411621094 Training loss: 0.4285 Explore P: 0.9687\n", | |
| "Episode: 4 Total reward: -112.888916015625 Training loss: 17.6260 Explore P: 0.9616\n", | |
| "Episode: 5 Total reward: -72.01809692382812 Training loss: 0.3325 Explore P: 0.9566\n", | |
| "Model Saved\n", | |
| "Episode: 6 Total reward: -91.27947998046875 Training loss: 11.2775 Explore P: 0.9489\n", | |
| "Episode: 7 Total reward: -96.70150756835938 Training loss: 0.8339 Explore P: 0.9412\n", | |
| "Episode: 8 Total reward: -89.98007202148438 Training loss: 10.4413 Explore P: 0.9368\n", | |
| "Episode: 9 Total reward: -78.67872619628906 Training loss: 13.1729 Explore P: 0.9292\n", | |
| "Episode: 10 Total reward: -114.67742919921875 Training loss: 30.3574 Explore P: 0.9192\n", | |
| "Model Saved\n", | |
| "Episode: 11 Total reward: -96.40922546386719 Training loss: 1.3338 Explore P: 0.9128\n", | |
| "Episode: 12 Total reward: -55.94306945800781 Training loss: 1.0281 Explore P: 0.9030\n", | |
| "Episode: 13 Total reward: -90.61083984375 Training loss: 17.3349 Explore P: 0.8908\n", | |
| "Episode: 14 Total reward: -110.54817199707031 Training loss: 11.5495 Explore P: 0.8845\n", | |
| "Episode: 15 Total reward: -86.7437744140625 Training loss: 8.5952 Explore P: 0.8773\n", | |
| "Model Saved\n", | |
| "Episode: 16 Total reward: -108.89964294433594 Training loss: 1.6884 Explore P: 0.8730\n", | |
| "Episode: 17 Total reward: -59.06085205078125 Training loss: 1.4325 Explore P: 0.8659\n", | |
| "Episode: 18 Total reward: -102.50300598144531 Training loss: 1.0175 Explore P: 0.8593\n", | |
| "Episode: 19 Total reward: -100.752685546875 Training loss: 7.5007 Explore P: 0.8556\n", | |
| "Episode: 20 Total reward: -111.81524658203125 Training loss: 5.0652 Explore P: 0.8487\n", | |
| "Model Saved\n", | |
| "Episode: 21 Total reward: -104.21478271484375 Training loss: 1.5522 Explore P: 0.8395\n", | |
| "Episode: 22 Total reward: -112.78564453125 Training loss: 5.5151 Explore P: 0.8359\n", | |
| "Episode: 23 Total reward: -82.05340576171875 Training loss: 1.2886 Explore P: 0.8292\n", | |
| "Episode: 24 Total reward: -111.15492248535156 Training loss: 0.5443 Explore P: 0.8225\n", | |
| "Episode: 25 Total reward: -73.76707458496094 Training loss: 1.5218 Explore P: 0.8156\n", | |
| "Model Saved\n", | |
| "Episode: 26 Total reward: -93.35487365722656 Training loss: 0.9257 Explore P: 0.8088\n", | |
| "Episode: 27 Total reward: -104.47758483886719 Training loss: 2.1761 Explore P: 0.8023\n", | |
| "Episode: 28 Total reward: -81.60890197753906 Training loss: 0.5991 Explore P: 0.7958\n", | |
| "Episode: 29 Total reward: -100.63589477539062 Training loss: 1.7395 Explore P: 0.7894\n", | |
| "Episode: 30 Total reward: -88.62884521484375 Training loss: 8.3556 Explore P: 0.7830\n", | |
| "Model Saved\n", | |
| "Episode: 31 Total reward: -105.23612976074219 Training loss: 0.8298 Explore P: 0.7767\n", | |
| "Episode: 32 Total reward: -111.5128173828125 Training loss: 0.6878 Explore P: 0.7711\n", | |
| "Episode: 33 Total reward: -107.63644409179688 Training loss: 0.7860 Explore P: 0.7651\n", | |
| "Episode: 34 Total reward: -99.78999328613281 Training loss: 1.9388 Explore P: 0.7567\n", | |
| "Episode: 35 Total reward: -107.68731689453125 Training loss: 3.5948 Explore P: 0.7481\n", | |
| "Model Saved\n", | |
| "Episode: 36 Total reward: -112.137451171875 Training loss: 0.4563 Explore P: 0.7421\n", | |
| "Episode: 37 Total reward: -50.57890319824219 Training loss: 0.5308 Explore P: 0.7361\n", | |
| "Episode: 38 Total reward: -73.00382995605469 Training loss: 1.9759 Explore P: 0.7302\n", | |
| "Episode: 39 Total reward: -80.82208251953125 Training loss: 0.2969 Explore P: 0.7243\n", | |
| "Episode: 40 Total reward: -97.41578674316406 Training loss: 16.1484 Explore P: 0.7185\n", | |
| "Model Saved\n", | |
| "Episode: 41 Total reward: -77.568115234375 Training loss: 0.2420 Explore P: 0.7128\n", | |
| "Episode: 42 Total reward: -103.93637084960938 Training loss: 0.1838 Explore P: 0.7026\n", | |
| "Episode: 43 Total reward: -81.61286926269531 Training loss: 0.3259 Explore P: 0.6948\n", | |
| "Episode: 44 Total reward: -91.02716064453125 Training loss: 0.3337 Explore P: 0.6859\n", | |
| "Episode: 45 Total reward: -98.70729064941406 Training loss: 2.1673 Explore P: 0.6804\n", | |
| "Model Saved\n", | |
| "Episode: 46 Total reward: -115.98574829101562 Training loss: 14.9863 Explore P: 0.6726\n", | |
| "Episode: 47 Total reward: -100.81024169921875 Training loss: 2.0342 Explore P: 0.6654\n", | |
| "Episode: 48 Total reward: -60.25152587890625 Training loss: 0.2753 Explore P: 0.6569\n", | |
| "Episode: 49 Total reward: -67.41098022460938 Training loss: 0.3018 Explore P: 0.6486\n", | |
| "Episode: 50 Total reward: -105.46267700195312 Training loss: 1.0995 Explore P: 0.6413\n", | |
| "Model Saved\n", | |
| "Episode: 51 Total reward: -73.07460021972656 Training loss: 0.1813 Explore P: 0.6362\n", | |
| "Episode: 52 Total reward: -96.30844116210938 Training loss: 0.2939 Explore P: 0.6310\n", | |
| "Episode: 53 Total reward: -94.21073913574219 Training loss: 0.4776 Explore P: 0.6284\n", | |
| "Episode: 54 Total reward: -65.328125 Training loss: 0.2104 Explore P: 0.6233\n", | |
| "Episode: 55 Total reward: -66.21479797363281 Training loss: 3.2012 Explore P: 0.6183\n", | |
| "Model Saved\n", | |
| "Episode: 56 Total reward: -94.83515930175781 Training loss: 0.5179 Explore P: 0.6136\n", | |
| "Episode: 57 Total reward: -92.63566589355469 Training loss: 7.6108 Explore P: 0.6068\n", | |
| "Episode: 58 Total reward: -114.22836303710938 Training loss: 0.1981 Explore P: 0.5979\n", | |
| "Episode: 59 Total reward: -109.301025390625 Training loss: 0.1633 Explore P: 0.5931\n", | |
| "Episode: 60 Total reward: -69.18382263183594 Training loss: 0.3027 Explore P: 0.5883\n", | |
| "Model Saved\n", | |
| "Episode: 61 Total reward: -96.5882568359375 Training loss: 0.2388 Explore P: 0.5856\n", | |
| "Episode: 62 Total reward: -115.95585632324219 Training loss: 0.2598 Explore P: 0.5815\n", | |
| "Episode: 63 Total reward: -91.42893981933594 Training loss: 3.1792 Explore P: 0.5768\n", | |
| "Episode: 64 Total reward: -78.47196960449219 Training loss: 0.1737 Explore P: 0.5722\n", | |
| "Episode: 65 Total reward: -33.51860046386719 Training loss: 16.5782 Explore P: 0.5676\n", | |
| "Model Saved\n", | |
| "Episode: 66 Total reward: -52.46026611328125 Training loss: 0.7277 Explore P: 0.5630\n", | |
| "Episode: 67 Total reward: -104.60054016113281 Training loss: 0.1622 Explore P: 0.5585\n", | |
| "Episode: 68 Total reward: -77.99497985839844 Training loss: 2.5138 Explore P: 0.5540\n", | |
| "Episode: 69 Total reward: -54.47041320800781 Training loss: 0.1590 Explore P: 0.5496\n", | |
| "Episode: 70 Total reward: -63.22991943359375 Training loss: 0.1965 Explore P: 0.5452\n", | |
| "Model Saved\n", | |
| "Episode: 71 Total reward: -87.78546142578125 Training loss: 0.3122 Explore P: 0.5375\n", | |
| "Episode: 72 Total reward: -96.14764404296875 Training loss: 0.1515 Explore P: 0.5351\n", | |
| "Episode: 73 Total reward: -69.32623291015625 Training loss: 2.8430 Explore P: 0.5308\n", | |
| "Episode: 74 Total reward: -13.840484619140625 Training loss: 0.2721 Explore P: 0.5266\n", | |
| "Episode: 75 Total reward: -89.6734619140625 Training loss: 0.1506 Explore P: 0.5213\n", | |
| "Model Saved\n", | |
| "Episode: 76 Total reward: -64.53419494628906 Training loss: 1.8367 Explore P: 0.5171\n", | |
| "Episode: 77 Total reward: -106.41300964355469 Training loss: 0.3183 Explore P: 0.5072\n", | |
| "Episode: 78 Total reward: -50.4837646484375 Training loss: 0.2255 Explore P: 0.5033\n", | |
| "Episode: 79 Total reward: -34.91241455078125 Training loss: 0.1923 Explore P: 0.4976\n", | |
| "Episode: 80 Total reward: -115.21119689941406 Training loss: 0.1336 Explore P: 0.4950\n", | |
| "Model Saved\n", | |
| "Episode: 81 Total reward: -73.21771240234375 Training loss: 0.1376 Explore P: 0.4911\n", | |
| "Episode: 82 Total reward: -62.74360656738281 Training loss: 0.6687 Explore P: 0.4871\n", | |
| "Episode: 83 Total reward: -15.30194091796875 Training loss: 0.1503 Explore P: 0.4778\n", | |
| "Episode: 84 Total reward: -74.79470825195312 Training loss: 0.1727 Explore P: 0.4740\n", | |
| "Episode: 85 Total reward: -54.167205810546875 Training loss: 0.1432 Explore P: 0.4702\n", | |
| "Model Saved\n", | |
| "Episode: 86 Total reward: -62.83433532714844 Training loss: 0.1632 Explore P: 0.4665\n", | |
| "Episode: 87 Total reward: -82.97991943359375 Training loss: 0.1923 Explore P: 0.4644\n", | |
| "Episode: 88 Total reward: -72.07733154296875 Training loss: 0.2274 Explore P: 0.4607\n", | |
| "Episode: 89 Total reward: -55.19401550292969 Training loss: 0.1261 Explore P: 0.4570\n", | |
| "Episode: 90 Total reward: -76.98689270019531 Training loss: 0.7601 Explore P: 0.4505\n", | |
| "Model Saved\n", | |
| "Episode: 91 Total reward: -65.32528686523438 Training loss: 0.3138 Explore P: 0.4469\n", | |
| "Episode: 92 Total reward: -50.588714599609375 Training loss: 0.2203 Explore P: 0.4435\n", | |
| "Episode: 93 Total reward: -70.39730834960938 Training loss: 1.2486 Explore P: 0.4415\n", | |
| "\n", | |
| "Episode: 94 Total reward: 70.74258422851562 Training loss: 0.4045 Explore P: 0.4366\n", | |
| "Episode: 95 Total reward: -11.190460205078125 Training loss: 0.2244 Explore P: 0.4331\n", | |
| "Model Saved\n", | |
| "Episode: 96 Total reward: -22.803070068359375 Training loss: 0.4332 Explore P: 0.4297\n", | |
| "Episode: 97 Total reward: -43.600616455078125 Training loss: 2.4079 Explore P: 0.4265\n", | |
| "Episode: 98 Total reward: -74.661376953125 Training loss: 0.3113 Explore P: 0.4246\n", | |
| "Episode: 99 Total reward: -32.23060607910156 Training loss: 0.1899 Explore P: 0.4212\n", | |
| "Episode: 100 Total reward: -66.32485961914062 Training loss: 0.1400 Explore P: 0.4167\n", | |
| "Model Saved\n", | |
| "Episode: 101 Total reward: -15.644882202148438 Training loss: 0.0826 Explore P: 0.4134\n", | |
| "Episode: 102 Total reward: 44.1182861328125 Training loss: 0.1348 Explore P: 0.4101\n", | |
| "Episode: 103 Total reward: -61.74578857421875 Training loss: 0.6734 Explore P: 0.4058\n", | |
| "Episode: 104 Total reward: -87.16415405273438 Training loss: 0.2358 Explore P: 0.4026\n", | |
| "Episode: 105 Total reward: -90.69143676757812 Training loss: 0.4390 Explore P: 0.3939\n", | |
| "Model Saved\n", | |
| "Episode: 106 Total reward: -56.23359680175781 Training loss: 0.1456 Explore P: 0.3908\n", | |
| "Episode: 107 Total reward: -41.05461120605469 Training loss: 0.9647 Explore P: 0.3877\n", | |
| "Episode: 108 Total reward: -1.7525482177734375 Training loss: 0.4109 Explore P: 0.3846\n", | |
| "Episode: 109 Total reward: -37.95100402832031 Training loss: 0.2784 Explore P: 0.3815\n", | |
| "Episode: 110 Total reward: -71.89024353027344 Training loss: 0.1012 Explore P: 0.3786\n", | |
| "Model Saved\n", | |
| "Episode: 111 Total reward: -72.90853881835938 Training loss: 1.4025 Explore P: 0.3756\n", | |
| "Model updated\n", | |
| "Episode: 112 Total reward: -56.199127197265625 Training loss: 7.5684 Explore P: 0.3727\n", | |
| "Episode: 113 Total reward: -77.53300476074219 Training loss: 3.6123 Explore P: 0.3698\n", | |
| "Episode: 114 Total reward: -50.253692626953125 Training loss: 6.0007 Explore P: 0.3668\n", | |
| "Episode: 115 Total reward: 18.208023071289062 Training loss: 6.2701 Explore P: 0.3639\n", | |
| "Model Saved\n", | |
| "Episode: 116 Total reward: -74.686767578125 Training loss: 7.9382 Explore P: 0.3610\n", | |
| "Episode: 117 Total reward: -76.70317077636719 Training loss: 3.9754 Explore P: 0.3593\n", | |
| "Episode: 118 Total reward: 18.843551635742188 Training loss: 1.0298 Explore P: 0.3554\n", | |
| "Episode: 119 Total reward: 1.3499298095703125 Training loss: 1.5573 Explore P: 0.3525\n", | |
| "Episode: 120 Total reward: -0.566131591796875 Training loss: 0.4084 Explore P: 0.3497\n", | |
| "Model Saved\n", | |
| "Episode: 121 Total reward: 20.053070068359375 Training loss: 0.6762 Explore P: 0.3470\n", | |
| "Episode: 122 Total reward: -79.74948120117188 Training loss: 0.5085 Explore P: 0.3443\n", | |
| "Episode: 123 Total reward: -68.07794189453125 Training loss: 0.6844 Explore P: 0.3416\n", | |
| "Episode: 124 Total reward: 20.166915893554688 Training loss: 0.2775 Explore P: 0.3389\n", | |
| "Episode: 125 Total reward: -87.4755859375 Training loss: 0.3127 Explore P: 0.3364\n", | |
| "Model Saved\n", | |
| "Episode: 126 Total reward: -17.0537109375 Training loss: 0.3796 Explore P: 0.3337\n", | |
| "Episode: 127 Total reward: 5.201812744140625 Training loss: 0.6150 Explore P: 0.3311\n", | |
| "Episode: 128 Total reward: -32.572784423828125 Training loss: 0.2595 Explore P: 0.3285\n", | |
| "Episode: 129 Total reward: -43.18853759765625 Training loss: 0.4992 Explore P: 0.3259\n", | |
| "Episode: 130 Total reward: -84.01849365234375 Training loss: 0.3338 Explore P: 0.3226\n", | |
| "Model Saved\n", | |
| "Episode: 131 Total reward: -99.23286437988281 Training loss: 1.2294 Explore P: 0.3200\n", | |
| "Episode: 132 Total reward: -27.938064575195312 Training loss: 0.9042 Explore P: 0.3175\n", | |
| "Episode: 133 Total reward: 2.96868896484375 Training loss: 0.3110 Explore P: 0.3151\n", | |
| "Episode: 134 Total reward: -49.97503662109375 Training loss: 0.4291 Explore P: 0.3119\n", | |
| "Episode: 135 Total reward: 8.848037719726562 Training loss: 0.9113 Explore P: 0.3095\n", | |
| "Model Saved\n", | |
| "Episode: 136 Total reward: -78.30146789550781 Training loss: 1.1113 Explore P: 0.3064\n", | |
| "Episode: 137 Total reward: -35.61848449707031 Training loss: 0.2758 Explore P: 0.3039\n", | |
| "Episode: 138 Total reward: -80.23164367675781 Training loss: 1.1325 Explore P: 0.3015\n", | |
| "Episode: 139 Total reward: -41.44696044921875 Training loss: 0.2293 Explore P: 0.2993\n", | |
| "Episode: 140 Total reward: -63.55998229980469 Training loss: 0.5988 Explore P: 0.2969\n", | |
| "Model Saved\n", | |
| "Episode: 141 Total reward: -74.58718872070312 Training loss: 0.3622 Explore P: 0.2956\n", | |
| "Episode: 142 Total reward: -44.1854248046875 Training loss: 0.8818 Explore P: 0.2933\n", | |
| "Episode: 143 Total reward: -43.17417907714844 Training loss: 0.6441 Explore P: 0.2918\n", | |
| "Episode: 144 Total reward: -35.05082702636719 Training loss: 0.1932 Explore P: 0.2885\n", | |
| "Episode: 145 Total reward: 2.6080322265625 Training loss: 0.2974 Explore P: 0.2857\n", | |
| "Model Saved\n", | |
| "Episode: 146 Total reward: -75.66334533691406 Training loss: 0.2797 Explore P: 0.2828\n", | |
| "Episode: 147 Total reward: -79.89767456054688 Training loss: 14.5457 Explore P: 0.2805\n", | |
| "Episode: 148 Total reward: -65.21456909179688 Training loss: 0.7638 Explore P: 0.2783\n", | |
| "Episode: 149 Total reward: 13.195510864257812 Training loss: 0.3936 Explore P: 0.2761\n", | |
| "Episode: 150 Total reward: 60.77146911621094 Training loss: 1.1485 Explore P: 0.2739\n", | |
| "Model Saved\n", | |
| "Episode: 151 Total reward: -67.01502990722656 Training loss: 1.1541 Explore P: 0.2710\n", | |
| "Episode: 152 Total reward: 7.119903564453125 Training loss: 0.4257 Explore P: 0.2689\n", | |
| "Episode: 153 Total reward: 13.754486083984375 Training loss: 0.4931 Explore P: 0.2639\n", | |
| "Episode: 154 Total reward: -67.7314453125 Training loss: 0.5301 Explore P: 0.2618\n", | |
| "Episode: 155 Total reward: -61.25654602050781 Training loss: 0.3877 Explore P: 0.2599\n", | |
| "Model Saved\n", | |
| "Episode: 156 Total reward: -1.2131805419921875 Training loss: 0.3397 Explore P: 0.2579\n", | |
| "Episode: 157 Total reward: -26.2254638671875 Training loss: 0.1870 Explore P: 0.2558\n", | |
| "Episode: 158 Total reward: 71.63455200195312 Training loss: 0.3283 Explore P: 0.2538\n", | |
| "Episode: 159 Total reward: -41.72747802734375 Training loss: 0.6035 Explore P: 0.2520\n", | |
| "Episode: 160 Total reward: -75.83839416503906 Training loss: 0.5253 Explore P: 0.2488\n", | |
| "Model Saved\n", | |
| "Episode: 161 Total reward: 3.0420074462890625 Training loss: 0.8875 Explore P: 0.2468\n", | |
| "Episode: 162 Total reward: -21.011383056640625 Training loss: 0.2739 Explore P: 0.2449\n", | |
| "Episode: 163 Total reward: -19.587127685546875 Training loss: 1.2479 Explore P: 0.2431\n", | |
| "Episode: 164 Total reward: -53.40458679199219 Training loss: 1.2350 Explore P: 0.2413\n", | |
| "Episode: 165 Total reward: -59.686767578125 Training loss: 0.4527 Explore P: 0.2395\n", | |
| "Model Saved\n", | |
| "Episode: 166 Total reward: -53.43865966796875 Training loss: 12.8202 Explore P: 0.2384\n", | |
| "Episode: 167 Total reward: 4.73968505859375 Training loss: 0.2532 Explore P: 0.2366\n", | |
| "Episode: 168 Total reward: -42.3804931640625 Training loss: 0.6826 Explore P: 0.2347\n", | |
| "Episode: 169 Total reward: -1.4572296142578125 Training loss: 0.5197 Explore P: 0.2329\n", | |
| "Episode: 170 Total reward: -39.27558898925781 Training loss: 11.5407 Explore P: 0.2311\n", | |
| "Model Saved\n", | |
| "Episode: 171 Total reward: 8.362579345703125 Training loss: 0.2713 Explore P: 0.2295\n", | |
| "Episode: 172 Total reward: 14.519943237304688 Training loss: 7.7963 Explore P: 0.2277\n", | |
| "Episode: 173 Total reward: -58.884429931640625 Training loss: 0.3072 Explore P: 0.2259\n", | |
| "Episode: 174 Total reward: -93.07179260253906 Training loss: 0.6735 Explore P: 0.2235\n", | |
| "Episode: 175 Total reward: -60.440277099609375 Training loss: 0.6426 Explore P: 0.2218\n", | |
| "Model Saved\n", | |
| "Episode: 176 Total reward: 24.163375854492188 Training loss: 0.5932 Explore P: 0.2201\n", | |
| "Episode: 177 Total reward: -74.15121459960938 Training loss: 0.1940 Explore P: 0.2191\n", | |
| "Episode: 178 Total reward: -47.54103088378906 Training loss: 0.9826 Explore P: 0.2174\n", | |
| "Episode: 179 Total reward: -88.96371459960938 Training loss: 0.6407 Explore P: 0.2157\n", | |
| "Episode: 180 Total reward: 86.02571105957031 Training loss: 0.3157 Explore P: 0.2134\n", | |
| "Model Saved\n", | |
| "Episode: 181 Total reward: -8.269500732421875 Training loss: 1.0492 Explore P: 0.2118\n", | |
| "Episode: 182 Total reward: 37.916839599609375 Training loss: 0.3531 Explore P: 0.2102\n", | |
| "Episode: 183 Total reward: 28.824462890625 Training loss: 0.3685 Explore P: 0.2086\n", | |
| "Episode: 184 Total reward: -103.504150390625 Training loss: 0.8678 Explore P: 0.2077\n", | |
| "Episode: 185 Total reward: -33.638336181640625 Training loss: 0.5436 Explore P: 0.2062\n", | |
| "Model Saved\n", | |
| "Episode: 186 Total reward: -46.80809020996094 Training loss: 0.8421 Explore P: 0.2046\n", | |
| "\n", | |
| "Episode: 187 Total reward: 4.5064849853515625 Training loss: 0.2865 Explore P: 0.2030\n", | |
| "Episode: 188 Total reward: -10.029891967773438 Training loss: 0.4644 Explore P: 0.2014\n", | |
| "Episode: 189 Total reward: -35.31138610839844 Training loss: 0.3323 Explore P: 0.1999\n", | |
| "Episode: 190 Total reward: 22.30352783203125 Training loss: 0.6971 Explore P: 0.1984\n", | |
| "Model Saved\n", | |
| "Episode: 191 Total reward: -54.252655029296875 Training loss: 0.7283 Explore P: 0.1968\n", | |
| "Episode: 192 Total reward: -94.67848205566406 Training loss: 1.4658 Explore P: 0.1959\n", | |
| "Episode: 193 Total reward: -38.33479309082031 Training loss: 0.2945 Explore P: 0.1944\n", | |
| "Episode: 194 Total reward: -96.05851745605469 Training loss: 0.2530 Explore P: 0.1929\n", | |
| "Episode: 195 Total reward: -16.951339721679688 Training loss: 0.8220 Explore P: 0.1914\n", | |
| "Model Saved\n", | |
| "Episode: 196 Total reward: -104.72447204589844 Training loss: 0.4501 Explore P: 0.1900\n", | |
| "Episode: 197 Total reward: -3.453094482421875 Training loss: 0.4974 Explore P: 0.1886\n", | |
| "Episode: 198 Total reward: -26.187362670898438 Training loss: 0.2195 Explore P: 0.1872\n", | |
| "Episode: 199 Total reward: -98.55648803710938 Training loss: 0.2501 Explore P: 0.1864\n", | |
| "Episode: 200 Total reward: -16.166595458984375 Training loss: 0.3163 Explore P: 0.1850\n", | |
| "Model Saved" | |
| ] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "metadata": {}, | |
| "source": [ | |
| "## Step 9: Watch our Agent play 👀\n", | |
| "Now that we trained our agent, we can test it" | |
| ] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "execution_count": null, | |
| "metadata": {}, | |
| "outputs": [], | |
| "source": [ | |
| "with tf.Session() as sess:\n", | |
| " \n", | |
| " game = DoomGame()\n", | |
| " \n", | |
| " # Load the correct configuration (TESTING)\n", | |
| " game.load_config(\"deadly_corridor_testing.cfg\")\n", | |
| " \n", | |
| " # Load the correct scenario (in our case deadly_corridor scenario)\n", | |
| " game.set_doom_scenario_path(\"deadly_corridor.wad\")\n", | |
| " \n", | |
| " game.init() \n", | |
| " \n", | |
| " # Load the model\n", | |
| " saver.restore(sess, \"./models/model.ckpt\")\n", | |
| " game.init()\n", | |
| " \n", | |
| " for i in range(10):\n", | |
| " \n", | |
| " game.new_episode()\n", | |
| " state = game.get_state().screen_buffer\n", | |
| " state, stacked_frames = stack_frames(stacked_frames, state, True)\n", | |
| " \n", | |
| " while not game.is_episode_finished():\n", | |
| " ## EPSILON GREEDY STRATEGY\n", | |
| " # Choose action a from state s using epsilon greedy.\n", | |
| " ## First we randomize a number\n", | |
| " exp_exp_tradeoff = np.random.rand()\n", | |
| " \n", | |
| "\n", | |
| " explore_probability = 0.01\n", | |
| " \n", | |
| " if (explore_probability > exp_exp_tradeoff):\n", | |
| " # Make a random action (exploration)\n", | |
| " action = random.choice(possible_actions)\n", | |
| " \n", | |
| " else:\n", | |
| " # Get action from Q-network (exploitation)\n", | |
| " # Estimate the Qs values state\n", | |
| " Qs = sess.run(DQNetwork.output, feed_dict = {DQNetwork.inputs_: state.reshape((1, *state.shape))})\n", | |
| " \n", | |
| " # Take the biggest Q value (= the best action)\n", | |
| " choice = np.argmax(Qs)\n", | |
| " action = possible_actions[int(choice)]\n", | |
| " \n", | |
| " game.make_action(action)\n", | |
| " done = game.is_episode_finished()\n", | |
| " \n", | |
| " if done:\n", | |
| " break \n", | |
| " \n", | |
| " else:\n", | |
| " next_state = game.get_state().screen_buffer\n", | |
| " next_state, stacked_frames = stack_frames(stacked_frames, next_state, False)\n", | |
| " state = next_state\n", | |
| " \n", | |
| " score = game.get_total_reward()\n", | |
| " print(\"Score: \", score)\n", | |
| " \n", | |
| " game.close()" | |
| ] | |
| } | |
| ], | |
| "metadata": { | |
| "kernelspec": { | |
| "display_name": "Python [default]", | |
| "language": "python", | |
| "name": "python3" | |
| }, | |
| "language_info": { | |
| "codemirror_mode": { | |
| "name": "ipython", | |
| "version": 3 | |
| }, | |
| "file_extension": ".py", | |
| "mimetype": "text/x-python", | |
| "name": "python", | |
| "nbconvert_exporter": "python", | |
| "pygments_lexer": "ipython3", | |
| "version": "3.6.4" | |
| } | |
| }, | |
| "nbformat": 4, | |
| "nbformat_minor": 2 | |
| } |
I think there is a typo, you normalize the fame but then u did not use the normalize one
@Parkertao Did you manage to solve the issue of ISWeights_mb (and eventually of loss)?
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Hi, Thomas Simonini, thank you for your code, but when I tried to modify it and run, I got a issue that the value of loss is always 0. I apologize for my poor English ,and I am really confused.
I found the ISWeights_mb is [[0],[0],...[0]],and I guess that's the cause of this issue.