Darren Broderick (DBro)

Posted on Feb 23, 2023 • Edited on Mar 13

DeepRacer Virtual Racing

#deepracer #reinforcement #machinelearning #awsdeepracer

Latest Track File(For Log Analysis)
https://github.com/aws-deepracer-community/deepracer-simapp/tree/master/bundle/deepracer_simulation_environment/share/deepracer_simulation_environment/routes

Latest Robomaker Container (For Training)
https://hub.docker.com/r/aws deep racercommunity/deepracer-robomaker/tags?page=1&ordering=last_updated

General Training Starting Steps

These are commands I run if starting from a reboot
source bin/activate.sh
sudo liquidctl set fan1 speed 30
(This is my own fan setting)
dr-increment-training -f
dr-update OR dr-update-env (I tend to favour -env)
dr-start-training OR dr-start-training -w
dr-start-viewer OR dr-update-viewer
http://127.0.0.1:8100 OR http://localhost:8100
dr-logs-robomaker (dr-logs-robomaker -n2) for worker 2 etc
dr-logs-sagemaker
nvidia-smi (check temperatures)
htop to check threads and memory usage
(Try to maximise my worker count, but keep to <75%)
dr-start-evaluation -c & dr-stop-evaluation

Virtual DRFC Upload

aws configure
dr-upload-model -b -f
Uploads best checkpoint to s3

Physical DRFC Upload

dr-upload-car-zip -f
Sagemaker must be running for this to work
Only uses last checkpoint, not best

Container Update Links

Check your version with command "docker images"

Sagemaker
https://hub.docker.com/r/awsdeepracercommunity/deepracer-sagemaker/tags?page=1&ordering=last_updated

For new Sagemaker images follow this guide:
https://github.com/aws-deepracer-community/deepracer-for-cloud/blob/master/docs/multi_gpu.md

Robomaker
https://hub.docker.com/r/aws deep racercommunity/deepracer-robomaker/tags?page=1&ordering=last_updated

RL Coach
https://hub.docker.com/r/awsdeepracercommunity/deepracer-rlcoach/tags
Linux terminal startup script is called ".bashrc"

Open GL Robomaker

https://aws-deepracer-community.github.io/deepracer-for-cloud/opengl.html
example -> docker pull awsdeepracercommunity/deepracer-robomaker:4.0.12-gpu-gl
system.env: (Below bullet points)

DR_HOST_X=True; uses the local X server rather than starting one within the docker container.

DR_ROBOMAKER_IMAGE; choose the tag for an OpenGL enabled image - e.g. cpu-gl-avx for an image where Tensorflow will use CPU orgpu-glor an image where also Tensorflow will use the GPU.
Do echo $DISPLAY and see what that is, should be :0 but might be :1
Make system.env dr_display value same as echo value
dr-reload

source utils/setup-xorg.sh
source utils/start-xorg.sh
you should see the xorg stuff in nvidia-smi once you run the start-xorg.sh script
sudo pkill x11vnc
sudo pkill Xorg

New Sagemaker - M40 Tagging

run -> docker tag 2b4e84b8c10a awsdeepracercommunity/deepracer-sagemaker:gpu-m40

Log Analysis

run -> dr-start-loganalysis
Only change needed is for model_logs_root
e.g. 'minio/bucket/model-name/0'
Tracks
https://github.com/aws-deepracer-community/deepracer-simapp/tree/master/bundle/deepracer_simulation_environment/share/deepracer_simulation_environment/routes
Might have to upload the new track to tracks folder
Repo for all racer data
https://github.com/aws-deepracer-community/deepracer-race-data/tree/main/raw_data/leaderboards

Run Second DRFC Instance

Create 2 different run.env or use 2 folders
The DR_RUN_ID keeps things separate
Only 1 minio should be running
Use a unique model name
Run source bin/activate.sh run-1.env to activate a separate environment

Steps for fresh DRFC

./bin/prepare.sh && sudo reboot
docker start
ARCH=gpu
Run LARS script -> source bin/lars_one.sh
docker swarm init (If issues run step 7 and grab IP, run step 8, check bottom for example)
ifconfig -a
docker swarm init
docker swarm init - advertise-addr 000.000.0.000
sudo ./bin/init.sh -a gpu -c local
docker images
docker tag xxxxxxx awsdeepracercommunity/deepracer-sagemaker:gpu-m40
source bin/activate.sh
vim run.env
vim system.env
dr-update
aws configure - profile minio
aws configure
(use real AWS IAM details below to allow upload of models)
dr-reload
docker ps -a
Setup multiple GPU
cd custom-files
vim on the 3 files
dr-upload-custom-files

Different editor option to vim
gedit

Troubleshooting DRFC

General Tip
It's always worth checking if you are missing anything new that might have been added to the default files that DRFC would then be expecting.
In particular, the system.env or template-run.env files and compare them with your own.

Troubleshooting Docker Start

Docker failed to start
docker ps -a
docker service ls
sudo service docker status
sudo service - status-all
sudo systemctl status docker.service

Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?
sudo systemctl stop docker
sudo systemctl start docker
sudo systemctl enable docker
sudo systemctl restart docker
sudo service docker restart
snap list
sudo su THEN apt-get install docker.io
Re-run Installing Docker (From Lars)
cat /etc/docker/daemon.json
apt-cache policy docker-ce
sudo tail /var/log/syslog
sudo cat /var/log/syslog | grep dockerd | tail

"For me it was a missing file"
udo gedit /etc/docker/daemon.json
Make /etc/docker/daemon.json look like below:

{
"runtimes": {
"nvidia": {
"path": "nvidia-container-runtime",
"runtimeArgs": []
}
},
"default-runtime": "nvidia"
}
Make /etc/docker/daemon.json look like below:
sudo systemctl stop docker then sudo systemctl start docker
test with -> docker images

Troubleshooting Docker Swarm

Could not connect to the endpoint URL: "http://localhost:9000/bucket
Error response from daemon: This node is not a swarm manager. Use "docker swarm init" or "docker swarm join" to connect this node to swarm and try again.
You might have to disable ipv6 to stop docker pulling from multiple addresses
Here's how to disable IPv6 on Linux if you're running a Red Hat-based system:
Open the terminal window.
Change to the root user.
Type these commands:
sysctl -w net.ipv6.conf.all.disable_ipv6=1
sysctl -w net.ipv6.conf.default.disable_ipv6=1
sysctl -w net.ipv6.conf.tun0.disable_ipv6=1
To re-enable IPv6, type these commands:
sysctl -w net.ipv6.conf.all.disable_ipv6=0
sysctl -w net.ipv6.conf.default.disable_ipv6=0
sysctl -w net.ipv6.conf.tun0.disable_ipv6=0
sysctl -p

run -> ./bin/init.sh
run -> docker pull minio/minio:RELEASE.2022–10–24T18–35–07Z
DR_MINIO_IMAGE in system.env, make sure it's set to:
RELEASE.2022–10–24T18–35–07Z

Other Fixes That Might Work for minio

run -> docker swarm init
Error response from daemon: This node is already part of a swarm. Use "docker swarm leave" to leave this swarm and join another one.
run -> docker swarm leave
run -> docker swarm init
Error response from daemon: could not choose an IP address to advertise since this system has multiple addresses on interface
run -> docker network ls
sagemaker-local should appear in the network
IF NOT
There's a new fix script for this called "lars_swarm_fix.sh" in the bin folder.
run-> docker swarm leave - force
run -> source bin/lars_swarm_fix.sh

Script might need address, error message will say, This node is not a swarm manager. Use "docker swarm init"
run -> docker swarm init (and grab the first addr, example below)
docker swarm init - advertise-addr 2a00:23c8::d6c3:4a71:9adb:87ad

Swarm initialized: current node (wv3eqpslrstc6hm7n65744z) is now a manager.
ifconfig -a
You don't need to join the token
dr-start-training

Swarm is a docker concept, you can theoretically connect multiple machines together and run DRFC over multiple machines, sagemaker on one PC, robomakers spread out, but once you have cloned DRFC you can now do bin/init.sh -a gpu -c local

Issue - Minio kept making new containers every 10 seconds

Fix: https://github.com/aws-deepracer-community/deepracer-for-cloud/pull/102/commits/a2db4df0a624ace87b89afcc7ff27f35fe9751fe
docker service rm s3_minio
source bin/activate.sh

Issue - Minio containers kept exiting within 7 seconds
docker ps -a
docker service rm s3_minio
docker-compose -f $DR_DIR/docker/docker-compose-local.yml -p s3 up
docker ps
ls -l data
ls -l
Issue was I ran the init script as root
Fix -> chown -R dbro:dbro .
docker-compose -f $DR_DIR/docker/docker-compose-local.yml -p s3 up
docker ps -a showed there were now 2 minio's running
docker-compose -f $DR_DIR/docker/docker-compose-local.yml -p s3 down
docker stack rm s3
dr-reload
docker ps
dr-upload-custom-files

- - - - - General Notes - - - - -

The m40 runs sagemaker docker
System ram runs robomaker
You can offload some of the robomaker to gpu by using the opengl image, but generally yes
Basically model is living inside the GPU memory
training checkpoints are in - cd data/minio/bucket

Wouldn't go any higher than what "htop" shows below because you're at 80% on all threads
- - - - - Additional Scripts - - -

Create script "lars_one.sh"

if [[ "${ARCH}" == "gpu" ]];
then
    distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
    curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add -
    curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | sudo tee /etc/apt/sources.list.d/nvidia-docker.list
    sudo apt-get update && sudo apt-get install -y --no-install-recommends nvidia-docker2 nvidia-container-toolkit nvidia-container-runtime
    cat /etc/docker/daemon.json | jq 'del(."default-runtime") + {"default-runtime": "nvidia"}' | sudo tee /etc/docker/daemon.json
fi

Miscellaneous

Sensors

"FRONT_FACING_CAMERA"
"SECTOR_LIDAR"
"LIDAR"
"STEREO_CAMERAS"

Check temperature commands

nvidia-smi
nvidia-smi -l 60
watch -n900 nvidia-smi (Every 15 minutes auto calls)
sensors

Set fan speed commands

sudo liquidctl set fan1 speed 30
sudo liquidctl set fan1 speed 0

Check specs / stats commands

nvidia-smi -L
GeForce GTX 1650 -> nvidia-smi -a -i 0
M40 Specs -> nvidia-smi -a -i 1
lspci -k | grep -EA3 'VGA|3D|Display'
top (checks processors to help see worker limits)
free -m
htop
docker stats
docker run - rm - gpus all nvidia/cuda:11.6.0-base-ubuntu20.04 nvidia-smi

Useful Links

Full Guide - https://aws-deepracer-community.github.io/deepracer-for-cloud
Sudo - https://phpraxis.wordpress.com/2016/09/27/enable-sudo-without-password-in-ubuntudebian
Training on multiple GPU - https://github.com/aws-deepracer-community/deepracer-for-cloud/blob/master/docs/multi_gpu.md
nvidia monitor - https://stackoverflow.com/questions/8223811/a-top-like-utility-for-monitoring-cuda-activity-on-a-gpu
Tesla M40 24GB specs - https://www.microway.com/hpc-tech-tips/nvidia-tesla-m40-24gb-gpu-accelerator-maxwell-gm200-close
Complex shutdown - https://www.maketecheasier.com/schedule-ubuntu-shutdown
Sudo shutdown - https://sdet.ro/blog/shutdown-ubuntu-with-timer
Video trimmer - https://launchpad.net/~kdenlive/+archive/ubuntu/kdenlive-stable
Flatpak - https://flatpak.org/setup/Ubuntu

Installation commands

sudo snap install jupyter
sudo apt install git
sudo apt install nvidia-cuda-toolkit
sudo apt install curl
sudo apt install jq
sudo pip install liquidctl (to install fan controller globally)
sudo apt install net-tools
sudo apt install vim
sudo apt-get install htop
sudo apt install hddtemp
sudo apt install lm-sensors
pip install - user pipenv
sudo apt install pipenv
pipenv install jupyterlab

Installing Docker

sudo su (run from root)
curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo apt-key add -
sudo add-apt-repository "deb [arch=amd64] https://download.docker.com/linux/ubuntu $(lsb_release -cs) stable"
sudo apt-get update && sudo apt-get install -y - no-install-recommends docker-ce docker-ce-cli containerd.io
sudo apt-get install -y - no-install-recommends nvidia-docker2 nvidia-container-toolkit nvidia-container-runtime
sudo apt-get upgrade

Steps for Cuda upgrade
First removed existing:
sudo dpkg -P $(dpkg -l | grep nvidia-driver | awk '{print $2}')
sudo apt autoremove

then added new:
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/cuda-ubuntu2004.pin
sudo mv cuda-ubuntu2004.pin /etc/apt/preferences.d/cuda-repository-pin-600
sudo apt-key adv - fetch-keys https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/7fa2af80.pub
sudo add-apt-repository "deb https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/ /"
sudo apt update
sudo apt -y install cuda

then rebooted and do nvidia-smi

NVIDIA-SMI 510.47.03

Driver Version: 510.47.03

CUDA Version: 11.6

Thank you!

DEV Community