NVIDIA DIGITS 3 on EC2

By | February 10, 2016

NVIDIA DIGITS 3 on EC2.

So you have heard a lot about Deep Learning and Convolutional Neural Network, and you want to quickly try it out. But before you dive into the theory you want to get your hands dirty. And you don’t want to write a line of code. You also want to monitor progress of your training process from your smart phone. All I can say is that I respect your laziness! Let’s get started.

Instead of a step by step tutorial on how to install DIGITS on Amazon EC2, if you would rather have an Amazon Machine Image (AMI ) that has DIGITS preinstalled, you can read my follow up article titled “Deep Learning Example using NVIDIA DIGITS 3 on EC2”.

In this post we will learn how to set up a Deep Learning framework ( NVIDIA DIGITS + Caffe / Torch ) on an Amazon EC2 instance. This setup will enable you to schedule training tasks, monitor progress, and visualize results using a web interface.

What is NVIDIA DIGITS ?

DIGITS stands for Deep Learning GPU Training System. It is a web / browser based graphical user interface that allows you to prepare data, set training parameters, choose from some popular neural net architectures (or use your own) and train a deep neural net. It is a perfect tool to get started if you know very little about Deep Learning. Under the hood DIGITS uses Caffe — the popular open source deep learning framework. Support for Torch — a deep learning framework backed by Facebook — is in beta, but you can try it out.

GPUs on EC2

One big obstacle in immediately starting with Deep Learning is access to a good GPU. You may not have an NVIDIA card on your laptop and even if you do it may not be very powerful. Sometimes training a deep neural net takes hours and it makes no sense to use your primary computer for the task.

Without a GPU deep learning is painfully slow. In fact, one of contributions of the 2012 paper that firmly established Deep Learning as the undisputed king of image classification algorithms was its clever use of two GPUs.

Fortunately we live in amazing times. We have access to near infinite compute power at our finger tips. All you need to do is to register for Amazon Web Services ( AWS ).

https://aws.amazon.com/

This will give you access to Amazon’s Elastic Compute Cloud (EC2) and its virtually unlimited compute resources ( for a price of course ). The web interface allows you to start a virtual server called an “instance”. We are interested in the two GPU enabled instance types that have the following specifications.

Model GPUs vCPU Mem (GiB) SSD Storage (GB)
g2.2xlarge 1 8 15 1 x 60
g2.8xlarge 4 32 60 2 x 120

 

In this tutorial we are going to use g2.2xlarge because it is less expensive ( $0.6 / hour ) and is sufficient for this tutorial. g2.8xlarge comes with 4 GPUs and you can use them all in parallel if you are using DIGITS with Caffe.

Install NVIDIA DIGITS using Amazon Web Services

I am going to assume that you have created an account on Amazon AWS and are logged in. Follow the steps below to set up an EC2 GPU instance. If you are already familiar with the process skip to the next section.

Set up EC2 GPU Instance

  1. Go to EC2 Management Console : On AWS Management Console click on EC2. This will bring you to EC2 Management Console.

    Amazon Management Console
  2. Launch instance : On EC2 Management Console go to Instances and click on Launch Instance button

    EC2 Launch Instance
  3. Choose Operating System : From the list of Operating Systems choose Ubuntu 14.04. Then click Next.

    ec2 choose OS
  4. Choose instance type : From the list of instance types choose g2.2xlarge. Then click on the Configure Instance Details button at the bottom of the page.

    ec2-choose-instance-type
  5. Configure instance details : Make sure the number of instances is one. Pick a Subnet. It does not matter which one you pick. Later if you decided to attach a Volume ( storage space ) to your instance you will need to know the Subnet. Click the Next button.

    EC2 Instance Details
  6. Add storage : I recommend you add 50GB at least. Click Next.
    Note : This storage is NOT permanent. You will lose all data when you terminate your EC2 instance. If you are doing serious work, you should add an EC2 Volume to your instance.

    ec2-add-storage
  7. Tag Instance : Pick a name — any name is fine. Then click Next.

    EC2 Tag Instance
  8. Configure security group : Pick the “Create a new security group” option and give your security group a descriptive name. We want two ways to access the server. First, we want to be able to log on to the machine via ssh. Second, we want to open port 80 to run DIGITS web server. Notice these two services are available from my IP address only. You may choose other custom IP. I do not recommend you make it accessible from any IP address.

    EC2 Security Group
  9. Review & launch

    EC2 Launch Instance
  10. Download Key : You need a key pair to ssh into this machine. Create a new key if you don’t have one. Choose a descriptive name. The downloaded file will have a .pem extension.

    EC2 Key
  11. Verify instance : To verify your instance is running, go to the EC2 Management Console, and then click on “Instances”. Copy the public ip address into your clipboard.

    EC2 Verify Instance

Install NVIDIA DIGITS on EC2 GPU Instance

We are now ready to install NVIDIA DIGITS on the GPU instance we created in the last step.

  1. SSH into EC2 Instance : Open a terminal ( on OSX or Linux ) or use an ssh client on Windows to log onto the machine. Type the following command with the full path to the .pem file you had downloaded and the public IP address of your machine.
    # Change permission of your ssh key file. 
    chmod 600 your-pemfile.pem
    # SSH into machine. 
    ssh -Y -i your-pemfile.pem ubuntu@your-public-ip.com 
    

    If you do not change the permission of your ssh key file you may receive the following warning.

     
    WARNING: UNPROTECTED PRIVATE KEY FILE! 
    Permissions 0644 for 'yourpem.pem' are too open.
    It is recommended that your private key files are NOT accessible by others.
    This private key will be ignored.
    bad permissions: ignore key: sentiment.pem
    Permission denied (publickey).
    
  2. Update and upgrade package manager apt-get : Assuming you were able to log in and are on the server now.
    sudo apt-get update && sudo apt-get -y upgrade
    
  3. Install linux-image-extra : The base linux kernel package that comes with Ubuntu 14.04 instance on Amazon has some drivers missing. This is done to slim down the size of the linux image. So we need to install the drivers left out of the base package.
    sudo apt-get install -y linux-image-extra-`uname -r`
    
  4. Install NVIDIA drivers
    sudo add-apt-repository ppa:graphics-drivers/ppa
    sudo apt-get update
    sudo apt-get install nvidia-352 nvidia-settings
    
  5. Get CUDA and NVIDIA’s machine learning repos
    CUDA_REPO_PKG=cuda-repo-ubuntu1404_7.5-18_amd64.deb && 
    wget http://developer.download.nvidia.com/compute/cuda/repos/ubuntu1404/x86_64/$CUDA_REPO_PKG && 
    sudo dpkg -i $CUDA_REPO_PKG
    
    ML_REPO_PKG=nvidia-machine-learning-repo_4.0-2_amd64.deb &&
    wget http://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1404/x86_64/$ML_REPO_PKG &&
    sudo dpkg -i $ML_REPO_PKG
    

    The machine learning repo above gives access to digits, caffe-nv, torch, libcudnn4.

  6. Install DIGITS
    sudo apt-get update
    sudo apt-get install digits
    

    If everything went well, go to your public IP on the browser, and you will see this screen.

    EC2 Digits Installed

Woohoo! we are all set up! BTW if you relax your security requirements, you can actually view this page and therefore monitor progress of your training process from your smart phone!

Getting Started with NDVIDIA DIGITS

My follow up article titled “Deep Learning Example using NVIDIA DIGITS 3 on EC2” provides a detailed video tutorial on how to use DIGITS for Image Classification.

The github page for DIGITS provides an example for creating a dataset and training at model. Click here to get started.

NDVIDIA DIGITS Configuration FAQ

  1. How can you configure DIGITS to run a different port ?
    You can configure DIGITS to run a different port using the following command.

    sudo dpkg-reconfigure digits
    
  2. Where does DIGITS store the datasets and trained models ?
    DIGITS stores all data inside /usr/share/digits/digits/jobs.

    ls /usr/share/digits/digits/jobs
    

    There are two kinds of jobs directories– 1) Dataset job — contains information about a dataset created using DIGITS 2) Training job — contains information about a model trained using DIGITS. You can tell a jobs directory contains a dataset if it contains labels.txt, mean.binaryproto, train_db, train.txt, val_db, val.txt etc. E.g.

    # Here 20160208-182427-0f82 is a Dataset job
    $ ls -1 /usr/share/digits/digits/jobs/20160208-182427-0f82
    create_train_db.log
    create_val_db.log
    labels.txt
    mean.binaryproto
    mean.jpg
    status.pickle
    train_db
    train.txt
    val_db
    val.txt
    

    On the other hand if it contains a trained model, you will see files named deploy.prototxt, solver.prototxt, train_val.prototxt, snapshot_iter_*.caffemodel etc. E.g.

    # Here 20160209-011941-7953 is a Training job
    $ ls -1 /usr/share/digits/digits/jobs/20160209-011941-7953
    caffe_output.log
    deploy.prototxt
    snapshot_iter_104.caffemodel
    .
    .
    snapshot_iter_960.caffemodel
    snapshot_iter_960.solverstate
    solver.prototxt
    status.pickle
    train_val.prototxt
    
  3. How to start / stop / restart DIGITS server ?
    cd /usr/share/digits
    # set new config
    sudo python -m digits.config.edit -v
    # restart server
    sudo stop nvidia-digits-server
    sudo start nvidia-digits-server
    
  4. How to change the default jobs directory in NVIDIA DIGITS ?

    As mentioned above, by default DIGITS stores all data inside /usr/share/digits/digits/jobs/ . You probably want a different location for your data. For example, you may want all the DIGITS jobs to be stored on an attached volume. You can do so using the following commands.

    cd /usr/share/digits
    # set new config
    sudo python -m digits.config.edit -v
    # restart server
    sudo stop nvidia-digits-server
    sudo start nvidia-digits-server
    

    NOTE: The new jobs directory you choose should be writable by www-data.

    sudo chown -R www-data path_to_new_jobs_dir  
    
  5. How to change configurations in NVIDIA DIGITS ?
    The following commands will allow you to change all configurations in DIGITS. The configurations include the jobs directory, the GPUs to use, the log file location, the log level, server name, location of caffe installation and the location of Torch installation.

    cd /usr/share/digits
    # set new config
    sudo python -m digits.config.edit -v
    # restart server
    sudo stop nvidia-digits-server
    sudo start nvidia-digits-server
    

Subscribe

If you liked this article, please subscribe to our newsletter and receive a free
Computer Vision Resource guide. In our newsletter we share OpenCV tutorials and examples written in C++/Python, and Computer Vision and Machine Learning algorithms and news.

Subscribe Now

Category: Deep Learning how-to Install Tutorial Tags: , , , ,

About Satya Mallick

I am an entrepreneur with a love for Computer Vision and Machine Learning with a dozen years of experience (and a Ph.D.) in the field. In 2007, right after finishing my Ph.D., I co-founded TAAZ Inc. with my advisor Dr. David Kriegman and Kevin Barnes. The scalability, and robustness of our computer vision and machine learning algorithms have been put to rigorous test by more than 100M users who have tried our products.

  • Lucas Porto

    Im new using caffe, and how much to use this kind service ?

    • If you use GPU instance of Amazon, it will cost you $0.6 / hour. But if you can install DIGITS on your own linux box too.

      • Lucas Porto

        Great…. Congrats, I really like your website…

        I’m using caffe in my PC, but I think I need more hardware or I am not configuring well the solver and the layers to find a good solution in my problem…

        • Thanks Lucas. There are so many small things ( other than the hardware ) that can result in not so good results.

          • Lucas Porto

            Yes you are right, I got some interesting results, but I need to learn more 🙂

            Are you thinking to prepare some text like “how to use Caffe” or something like that?

          • I had not thought about it. Are Caffe’s online examples not useful ? If so, could you tell me what they are lacking ?

            Thanks
            Satya

  • Samir Zen Master Al-Stouhi

    Satya,
    I haven’t tried this yet but that is awesome. Did you happen to make an AMI that I can copy?

    • I do 🙂 . I will share it shortly.

    • Hi Samir,

      The AMI id is ami-d949afb9 . It is available in US West ( Oregon ) region. The name is bigvision-digits. Please let me know if it works for you. Make sure you allocate about 40 GB of space. I have also put a dataset called 17flowers in the data directory. This should help you get started.

      • Samir Zen Master Al-Stouhi

        Satya,
        Thank you very very much for doing this. I found the AMI and I was able to access it.
        It has no GUI so I assume that I have to run DIGITS from a local machine with the ip of the ec2 server to see the screenshot below?
        I need to get digits on my machine before I proceed.

        • I am assuming you created an instance using that AMI. You have to find the public IP address of that instance. In this post search for “Verify instance” and you will see how to find the IP address. On your browser simple go to that ip address. If the above web page does not show up, you will have to restart the digits server. For this you have to log onto your instance and do

          sudo stop nvidia-digits-server
          sudo start nvidia-digits-server

          • Samir Zen Master Al-Stouhi

            Satya,
            I don’t want to burden you with this and I appreciate your help. I have done that but I see nothing on my browsers.

          • Samir Zen Master Al-Stouhi

            Satya,
            Sorry to bother with this. I think I did all of this but when I put the IP address in my browser, it doesn’t work.

          • Samir Zen Master Al-Stouhi

            Satya,

            I was able to run the server.

            1) I had an error in my custom IP setting: Under your step 9: “Review & launch”, you have a 5000 port although in your step 8 you had the correct port of 80. I now have port 80 and now it works.

            2) I was able to train and test the flowers database although the labels are not correct. It seems that the folder names is not correct (but the training worked fine) and the validation worked great.

            I want to thank you very very much for your help.

          • Thank you so much for the feedback. You are right, the new digits 3 runs on port 80. I will also check into the flower dataset. I had put it together quickly without checking so that you have something to try :).

          • Samir Zen Master Al-Stouhi

            Everything worked perfect so thank you again.

          • BTW I have new post based on the discussion we had here. I have fixed the flower classes and created a video that explains how to use the AMI for people who are not familiar with Amazon EC2.

            http://www.learnopencv.com/deep-learning-example-using-nvidia-digits-3-on-ec2/

  • Djebril

    Thanks for your feed Satya. DIGITS is indeed a great tool to get started with. TensorFlow has also a Visualization board called TensorBoard, but the framework is a bit slower compared to the others deep learning frameworks (This should change on the next release).

    Therefore, I think DIGITS is the best choice for training image classification models so far.

    Another great thing would be to use afterwards the DNN module (that supports GoogleNet !) to perform a forward pass with the trained model. But unfortunately, DNN forward pass [1] does not generate the same results as DIGITS standalone image classification test (“classify one image” which is more more accurate). So maybe you could write something about that ?

    Cheers,

    [1] http://docs.opencv.org/trunk/d5/de7/tutorial_dnn_googlenet.html#gsc.tab=0

    • Thanks for the feedback, Djebril. I also have a suspicion the digits will soon have support for TensorFlow. Here is a recent quote from NVIDIA’s CEO Jen-Hsun Huang.

      “TensorFlow will democratize deep learning” Jen-Hsun says. “That’s a huge contribution to humanity.”

  • Chris Wang

    hi Satya, i have a server running DIGITS. Do you know how to password protected the page? Or force some sort of login to access DIGITS. Thanks!

  • pogopaule

    Hi Satya! Thanks for your great tutorial. It helps a lot!

    Unfortunately if I run an image classification, I get the following error:
    Creating layer data
    Check failed: error == cudaSuccess (38 vs. 0) no CUDA-capable device is detected

    I set up the machine twice an did not see any errors while installing the packages. So do you have any clue what has gone wrong?

  • Jay Stevens

    Satya, really appreciate the tutorial but it seems as if something may have changed. The curren tdefault AMI GPU instance following all of your steps yields an install of DIGITS but no support for GPU (which was kind of the point, LOL). I’m trying to manually install the CUDA libraries and drivers, etc. using a different tutorial. Any help at all would be appreciated. I’m not the only one (see other comments in this thread).

    Thanks.

    • Hi Jay,

      I will look into in on Tuesday and let you know.

      Satya

      • Jay Stevens

        As a followup this may be related to a bug in the deployment package that NVIDIA just discovered after I and a few others were having problems.

        • Thanks man. I have been scratching my head over this but have not found a solution. Do you have a link to the buy you mentioned ?

      • Jay Stevens
        • Thanks. I am watching the conversation now and will try again after they confirm a good fix.

  • Did you see the new p2 instance type? 🙂