Journey to GAN

Initial Goals

The original intention was to get a Generative Adversarial Network (GAN) generating images on an Arch Linux system. Here are the hurdles encountered along the way to installing Keras, CUDA (Compute Unified Device Architecture), cuDNN (CUDA Deep Neural Network library), and TensorFlow on Arch Linux.

Installing CUDA

At the time of writing, TensorFlow did not officially support Arch Linux as an operating system. However, all the necessary packages could still be found through the official and community repositories.

Following the GPGPU article from the Arch Linux wiki, CUDA could be installed simply by running the following commands:

sudo pacman -S nvidia cuda cudnn

This installed CUDA 11.1.1 and cuDNN 8.0.4. After a successful installation and reboot, the sample files could be tested with the installed nvcc compiler. These samples were located in /opt/cuda/samples. To verify the installation, run:

sudo cp -r /opt/cuda/samples ~/samples
cd ~/samples/1_Utilities/deviceQuery
make
./deviceQuery

However, this resulted in an error:

code=999(cudaErrorUnknown) cudaGetDeviceCount(&device_count)

Debugging Installation Issues

In the midst of debugging why the device was not being recognized by CUDA, I mistakenly uninstalled the nvidia package, which led to several hours of attempting to start an X server. The errors encountered included:

Reached target Graphical Interface
Failed to initialize the NVIDIA kernel module

Running the following command resolved both the X server crash and the CUDA device not found errors:

sudo pacman -Syu

Installing TensorFlow

With CUDA working, the next step was to install TensorFlow and Keras:

pip install tensorflow-gpu keras

However, TensorFlow was unable to locate the graphics card. This was fixed by installing the TensorFlow-CUDA package directly from the Arch repositories:

pip uninstall tensorflow-gpu
sudo pacman -S tensorflow-cuda

This resolved most library issues, except for libcusolver.so.10:

ImportError: libcusolver.so.10.0: cannot open shared object file: No such file or directory

Setting the LD_LIBRARY_PATH did not resolve this issue:

export LD_LIBRARY_PATH=/opt/cuda/extras/CUPTI/lib64

Using Docker

Docker offered a viable solution as the official TensorFlow containers ensured matching versions of the CUDA and TensorFlow drivers. The following commands set up a fully functional development environment:

docker run --gpus all --rm nvidia/cuda nvidia-smi
docker run --gpus all -it --rm tensorflow/tensorflow:latest-gpu python -c "import tensorflow as tf; print(tf.reduce_sum(tf.random.normal([1000, 1000])))"
docker run --gpus all -it tensorflow/tensorflow:latest-gpu bash

Building the GAN

After installing the correct pip packages, the environment was ready to create a GAN model. The inspiration for the model came from an article that provided a detailed approach.

Conclusion

Despite the hurdles, an effective way to run a GAN using TensorFlow, Keras, and CUDA on Arch Linux was found. By using dockerized containers, driver incompatibilities were almost completely avoided, allowing for a smooth development process.