gianni.rosagallina.com - Build Caffe2 and Detectron with GPU support on Windows (Part 1 of 2)

In the last couple of weeks, I had the need to test and use some custom models made with Caffe2 framework and Detectron. They are actively developed on Linux, but I needed to have them run on Windows 10 with CUDA GPU support. It is possible to build Caffe2 for Windows, and a guide is provided, but if you need to use Detectron (not supported on Windows, officially), it is a bit more complicated and some changes in the source code and in the build scripts are required.

After many long frustrating days of trial and error, failed builds and scraping ideas and suggestions about fixing issues from GitHub discussions and blog posts (thanks to Mianzhi Wang for this guide), I came up with an updated, clean and reproducible way to build Caffe2 and Detectron on Windows, supporting CUDA 9 or CUDA 10.

This post, divided in 2 parts, is a step-by-step guide on how I did it, hoping it can help other people with the same need. Plan at least 1 day of work, to prepare your build environment. There are for sure better ways to handle some of the fixes and changes, but I hadn't enough time to dig deeper into them. If you find out any improvements, please let me know.

Here, you can find Part 2, with all the steps required to build and run Detectron on Windows 10. You need to follow all the steps of this post, prior to continue with the next.

DISCLAIMER: this guide has been written and tested in the last week of September 2018. I've tried and tested it on 3 different Windows dev machines (2 with CUDA 9.2, 1 with CUDA 10), successfully. But I can't 100% ensure it works on yours, nor I can provide you direct support in case something does not work. Please check carefully the version of packages, dependencies, git commits, etc. It is quite possible that newer releases (of any dependency, package, core or 3rd party source code, tools) may brake the build.

Step 0: prerequisites

To successfully compile Caffe2 and Detectron on Windows 10 with CUDA GPU support, the following pre-requisites are mandatory:

Windows 10: according to the official document, Windows 10 or greater is required to run Caffe2. I used Windows 10 Pro/Enterprise April 2018 Update (with all patches, fixes, updates up to September 30th, 2018).
Visual Studio 2017 (v15.8.4 or v15.8.5), with Visual C++, appropriate Windows SDKs and Python 3.6 dev tools installed: compiling Caffe2 requires a compatible C++ compiler. CUDA 8.0 only supports up to Visual Studio 2015 with Update 3. If you use Visual Studio 2017, you will be able to build with CUDA 9.x or CUDA 10.0 only. I used VS2017 Pro/Enterprise, but Community version should work as well.
CUDA 9.x/10.x with cuDNN 7.x installed: you can download and install them from nVidia's Developer website. I've tested the build with CUDA 9.2.148 x64 + Patch1 + cuDNN 7.2.1.38 and CUDA 10.0.130 x64 + cuDNN 7.3.0.29. Be extremely careful to not mix versions, and follow the official guides to install them. Check CUDA environment variables and verify they point to the right version you want to use to build. Also, update your nVidia drivers to the latest available and compatible version (check CUDA release notes). I used nVidia GTX 1080 drivers v411.63.
Python 2.7.x and VC++ Compiler Package: currently only Python 2.7 is supported (according to some comments/issues in GitHub Python 3 will be supported soon). I installed Python 2.7.15 and Visual C++ Compiler Package for Python 2.7. I put Python 2.7 in c:\Python27, without adding it to the PATH environment variable. For some commands, to launch Python from outside a virtual environment, if not available from the command line, use the full path (i.e. c:\python27\python.exe).
Intel Math Kernel Library: to optimize CPU operations, when GPU is not supported. I used binaries v2018.3 x64 and the its corresponding Python package.
OpenCV 3.x: I downloaded pre-built Windows x64 binaries for OpenCV 3.4.3. I installed them in c:\opencv.
CMake 3.12.x: required to configure and generate Visual Studio solutions. I used CMake 3.12.2 x64.
Visual Studio Code, with Python extension to edit text files, scripts and Python code. I used VSCode v1.27.2.
git: use git to clone source code repositories and dependencies. I used git for Windows 2.19.0 and TortoiseGit 2.7.0.

The dev machines used to test this guide are all equipped with Intel Core i7 7th gen CPU, 256GB SSD, 16/32GB RAM, nVidia GTX 1070/1080.

Step 1: Python Virtual Environment for Caffe2

Open a x64 Native Tools Developer Command Prompt
Clone Caffe2 from the official GitHub repository

At the moment, Caffe2 and PyTorch are being merged, and the official repo is now PyTorch.
```
> mkdir c:\projects\pytorch
> cd c:\projects\pytorch
> git clone --recursive https://github.com/pytorch/pytorch.git
```
If you prefer, you can also clone the repo using TortoiseGit. Remember to check the recursive option.

I built successfully using the following commits from master branch:
- 21ed7e51b606f9912b658ab63e016d7546f3b75c (2018-09-26 10:44:03)
- 91b6458e2d0dba935da2cc7c2cdc6d7907bc3f48 (2018-09-18 10:11:55)
Update pip to the latest version (I used v18.0)
```
> python -m pip install -U pip
```
Install virtualenv
```
> python -m pip install virtualenv
```
Create a new Python 2.7 virtual environment
```
> python -m virtualenv caffe2env
```
Activate the virtual environment
```
> caffe2env\Scripts\activate
```
To exit from the virtual environment, you can use:
```
(caffe2env)> deactivate
```
Install the required packages (from the official guide).

If you want to use Intel-MKL optimized numpy, get it from here and install it, before any other package. I installed 1.15.2 version.
```
(caffe2env)> pip install PATH_TO\numpy‑1.15.2+mkl‑cp27‑cp27m‑win_amd64.whl
```
Otherwise you can use the standard package:
```
(caffe2env)> pip install numpy
```
Then install the other dependencies:
```
(caffe2env)> pip install future
(caffe2env)> pip install hypothesis
(caffe2env)> pip install protobuf
(caffe2env)> pip install six
```
A required package was not specified:
```
(caffe2env)> pip install typing
```
I used Ninja to speed up the build operations, enabling parallel build of CUDA components (not yet supported when using MSBuild alone), as explained here. Without using Ninja, the build takes almost 4 hours. With Ninja, it takes less than 1h.
```
(caffe2env)> pip install ninja
```
Our environment is now ready to build.

Step 2: configure `build_windows.bat`

In order to build Caffe2 in Windows, with all the required options, you have to edit the provided build script. It can be found in c:\projects\pytorch\scripts\build_windows.bat

First, you have to set some environment variables (optionally, they can be set directly from the command line. I preferred to put them in the build script, to avoid remembering to type the commands each time a new command prompt is opened).

Before line 29, add:

set CMAKE_GENERATOR=Ninja
set OpenCV_DIR=C:\opencv\build
set USE_CUDA=ON
set GEN_TO_SOURCE=1

The previous variables configure Ninja as generator (instead of Visual Studio 2017 x64), specify where OpenCV libs can be found and enable the support for CUDA. The last variable is needed to avoid an error during the build, when ATEN build script generates the corresponding source files.

Then change some settings for cmake.

At line 71, set the right version of CUDA architecture to support. In my case, I wanted to run CUDA 9/CUDA 10 on GTX 1070/1080 so, according to this post, I had to use architecture 6.1 (leaving 5.0 didn't work). Please change this setting accordingly to your GPU/CUDA version.
```
-DTORCH_CUDA_ARCH_LIST=6.1
```
At line 80, enable OpenCV:
```
-DUSE_OPENCV=ON
```
At line 82, enable Python bindings:
```
-DBUILD_PYTHON=ON
```
At line 86, by using Ninja, remove the unsupported option:
```
-- /maxcpucount:%NUMBER_OF_PROCESSORS%
```

Step 3: tweak some stuff here and there

In order to successfully build, also some changes to the source code are needed (it is possible that future releases will include those changes, or different changes may be required), and some files have to be copied around, to make them available during the build operations to compiler and linker.

Create build\caffe2 folder in c:\projects\pytorch
Create libs folder in c:\projects\pytorch\caffe2env
Copy c:\python27\libs\python27.lib to the 2 previously created folders:
- c:\projects\pytorch\build\caffe2
- c:\projects\pytorch\caffe2env\libs
Remove all AT_CPP14_CONSTEXPR from c:\projects\pytorch\aten\src\ATen\core\ArrayRef.h, to fix a VC++ 17 compiler issue.
Replace a couple of "or" with "||" in c:\projects\pytorch\caffe2\image\image_input_op.h, to fix a compiler issue.

In Windows build, additional and custom operators can't be loaded dynamically (a bug? See GitHub issues here, and here). A possible workaround is to move Detectron in built-in Caffe2 operators.

Copy or move modules\detectron in caffe2\operators\detectron
Remove "detectron" subfolder from modules\CMakeLists.txt (comment out the whole conditional check related to MSVC and BUILD_SHARED_LIBS).

Change caffe2\operators\detectron\CMakeLists.txt similarly to caffe2\operators\RNN\CMakeLists.txt:

file(GLOB Detectron_CPU_SRCS ${CMAKE_CURRENT_SOURCE_DIR}/*.cc)
file(GLOB Detectron_GPU_SRCS ${CMAKE_CURRENT_SOURCE_DIR}/*.cu)

set(Caffe2_CPU_SRCS ${Caffe2_CPU_SRCS} ${Detectron_CPU_SRCS} PARENT_SCOPE)
set(Caffe2_GPU_SRCS ${Caffe2_GPU_SRCS} ${Detectron_GPU_SRCS} PARENT_SCOPE)

In caffe2\CMakeLists.txt add "detectron" subfolder (line 91):
```
add_subdirectory(operators/detectron)  
```
Fix an issue in Eigen CUDA source code, c:\projects\pytorch\third_party\eigen\Eigen\src\Core\arch\CUDA\Half.h. See what to change here.

Step 4: build

Now, everything should be ready to build Caffe2 successfully.

Open a x64 Native Tools Developer Command Prompt

Activate the virtual environment

> cd c:\projects\pytorch
> caffe2env\Scripts\activate

Run the build script
```
> scripts\build_windows.bat
```

The build will start and it will take some time (about 40/45 minutes on my machines, using Ninja; from 4 to 5 hours without). There will be a lot of warnings... you can ignore them, but there should be no errors. When the build process is finished, you will have a Caffe2 with CUDA GPU support for Windows 10 ready in c:\projects\pytorch\build\caffe2 folder.

Prior to be able to use it, I had to manually copy some missing DLLs for Intel MKL and OpenCV.

From C:\Program Files (x86)\IntelSWTools\compilers_and_libraries_2018.3.210\windows\redist\intel64_win\compiler copy
- libiomp5md.dll
- libiomp5md.pdb
- libiompstubs5md.dll
to c:\projects\pytorch\build\caffe2\python
From C:\opencv\build\x64\vc15\bin copy
- opencv_world343.dll
to c:\projects\pytorch\build\caffe2\python

Step 5: test

To verify that your build with CUDA support is working, you can test with the following command:

> python -c "from caffe2.python import workspace; print(workspace.NumCudaDevices())"

It should print a number > 0. If not, or any other error occurs, your Caffe2 build has some issue (i.e. wrong build configuration, missing dependencies, missing DLLs, no supported CUDA device available, etc.). Check on the Internet, on StackOverflow, GitHub repo issues for ideas and clues on how to solve them.

That's all for now. A quite long guide, but there should be everything to obtain a Caffe2 build with all the required components needed to build and run Detectron. In the next post, I'll show you how to do it on Windows. Stay tuned!