Website Photo

Thoughts and Software

Gregory Diamos

Welcome to my personal website.

I am a practicing Electrical Engineer focused on the development of new computer technologies. I have cross-disciplinary expertise in microprocessor architecture (superscalar and SIMD), compilers (dynamic), system software, task scheduling runtimes, database systems, circuit design, discrete and mathematical optimization, machine learning, and algorithms and data structures. I am currently employed by Baidu as a Senior Researcher in the Silicon Valley AI Lab.

This website covers my research interests, open source projects, and contact information. It is not affiliated with any company or organization.

My Research...


Tinkering With GPU Accelerated Deep Neural Networks

Check out the source code that I'm describing.

After being inspired by a talk by Yann Lecun on multiclass image classification, I thought that it would interesting to see how difficult it is to apply techniques for training Deep Neural Networks to some real-world image classification problems. The ability to accurately perform vision and audio classification tasks seems like it has the potential for enormous impact on multiple industries, yet there are not many consumer-facing products that employ this technology. So I thought I would try to reproduce a number of published results to see how robust this technology really is. Sparse Autoencoder Layer 1 Features

As a side benefit, it gives me an excuse to gain experience exploring the performance characteristics of Neural Networks running on GPUs, and to learn more about global optimization (a field that has always interested me).

I was able to quickly reproduce published results for face and digit recognition with fully connected networks, but realized that existing open source frameworks are either not very flexible or not very performant (Matlab/Python). If I really wanted to experiment the concepts of sparisty, deep architectures, and local connectivity, I would need to write something from scratch in C++/CUDA.

Image and video classification systems are typically implemented in two stages: feature extraction and statistical classification. Feature extraction identifies patterns in the image that are useful for performing the given classificatin task. Examples of low level image features are typically edges or corners, and these have historically been computed using domain specific algorithms like SIFT. These features are then fed into a classification model, which predicts the classification of the image using a statistical model of the features. Common examples of classifcation models might be linear or logistic regression, support vector machines, decision trees, random forrests, and neural networks.

Sparse Autoencoder Layer 3 Features

One of the reasons why Deep Neural Networks are so interesting is that they can perform feature selection automatically without the use of domain specific algorithms. Some papers claim that you can more or less have a network "watch" a stream of images and it will automatically configure itself to identify useful features. I was quite skeptical of this claim until I tried it out and was able to reproduce it reliably. The first image shows visualizations of the first level features learned by a single layer neural network (with a sparse cost function) trained on unlabeled 8x8 tiles from a few thousand random black and white images. Each tile in the image represents the pattern that an individual neuron maximally responds to. Notice that they are edge detectors.

The next image shows the features learned by a three layer network trained on color images of cats and dogs in natural (noisy) environments. If you zoom in you can see that each tile is composed of a combination of edge detector tiles, and represent more complex patterns. Training these networks is extremely computationally intensive, but luckily they can be expressed entirely as block-sparse matrices and operated on with well-known linear algebra operations. So they can take advantage of modern parallel processors and are almost a perfect fit for GPUs. It is possible to train moderately sized networks in a few minutes on a laptop.

Overall these results seem quite promising, and I'm excited to try out these networks on more challenging classification problems. In his talk, Yann Lecun showed an example of a deep convolutional network that was able to accurately identify objects in a live video stream. This is the next experiment that I will try to reproduce.

A Different Way to Write GPU Applications

Check out the source code that I'm describing.

When most people think of GPU Computing, they think of heterogeneous applications that run on a combination of a CPU and a GPU. This allows programmers to use single threaded algorithms on the CPU and parallel algorithms on the GPU, but they must accept extra complexity associated with compiling for different types of processors with different memory spaces. In general I think that complexity is justified for performance critical applications, but most applications are not performance critical. Much of the time it is acceptable to write an application with good asymptotic algorithmic complexity with performance that scales well on future (faster) hardware. Then, if performance is an issue, specific pieces of an application can be tuned. For these applications the explicit control over CPU and GPU portions of code becomes a handicap because dealing with it is a requirement, not an option. For many applications, I think that it would be more desirable to just think about parallelism, synchronization, and data sharing, which are hard enough problems without heterogeneity.

So what would be ideal? For a C++ enthusiast like me, the best case scenario would be to allow me to write applications in vanilla C++11 with two simple extensions.

  • Parallelism (an API call to launch a SPMD kernel over a function pointer).
  • Synchronization (an API call for a barrier over SPMD threads in a group).

With these extensions, a C++ program would begin with a single thread entering main, memory would be managed with malloc/new and drawn from a single shared address space, and the use of parallelism (or not) would be up to the programmer. Heterogeneity would be abstracted away and the programmer would be given the perspective that he is running on a single processor with a large number of cores. If the two previously mentioned extensions were implemented in a library, there would be no need for a domain specific compiler - an off-the-shelf C++ compiler would work. More importantly, it would be possible to compile existing libraries as is, and call them from the parallel code.

So why isn't it possible to do this today?

Well it is possible to do this today and run applications on a CPU. However, that isn't interesting because it doesn't allow me to accomplish my primary goal of implementing forward scalable parallel algorithms that run well on systems with hundreds or thousands of cores.

So instead, why isn't it possible to do this today on a GPU?

Well, it turns out that it is possible to do most of this today if you are willing to be creative in your use of existing tools, but you need to address the following real or perceived issues.

  • Compilers that target GPUs don't implement all of C++11.
  • GPUs don't allow basic functionality that programs expect that is normally through the standard library (File IO, system calls, etc).
  • Performance of single threaded code running on the GPU will be so bad that it drags down the rest of the program.
  • GPUs need a CPU for some functionality.

So let me address these one at a time.

GPU Compilers Can't Compile C++11

Well most CPU compilers can't either (can any?). So the more important question is "can they compile enough of C++11 to be useful?". The most mature GPU compilers that handle C-like languages are NVCC, command-line wrappers around OpenCL JIT compilers, and a bunch of other options that are far less stable (I'm hoping that CLANG->LLVM->PTX will get some better support, but it can't even compile my "Hello World" code yet). So I'm still sticking with NVCC.

NVCC ends up working out very well. It has many useful C++ features that haven't existed on GPUs before they were implemented in NVCC.

  • Basic C/C++ statements and control flow.
  • Function calls.
  • Separable compilation (plus a linker).
  • Correct handling of classes/polymorphism/new/delete.
  • Template metaprogramming.
  • Kernel launches from the GPU (added in Kepler).

It does have some notable missing features that need to be worked around.

  • All functions need to be tagged with __device__ (which is usable but highly cumbersome if you don't want to address heterogeneity).
  • No exceptions.
  • No global constructors/destructors in device code.
  • No variadic functions/templates other than printf.
  • No rvalue references.
  • Several other minor features (constexpr, lambdas, static_assert, etc).

All in all, other than the need to tag everything with __device__, NVCC has better C++ support than many CPU compilers (out of compilers I use regularly, only GCC, CLANG, and MSVS are clearly better).

GPUs Can't Implement The Standard Library

Even with decent compiler support in NVCC, most C++ applications lean heavily on the standard library to avoid re-inventing the wheel and to access system level functionality (most significantly File IO) in a platform independent way.

There isn't a great workaround for this one. I had to reimplement a lot of functionality in the Standard Library to get around this. Hopefully, it only has to be done once and will eventually ship with a GPU compiler.

File IO was particularly bad and involved doing remote accesses through the CPU. However, I was able to get this working, and it should be possible to push all of the complexity into a low level library that the standard library (FILE and iostream) calls into.

Performance of Single Threaded Code Will Be Too Slow

This is a valid concern, but it is something that I hope future systems will resolve automatically over time (e.g. by running single threads on fast cores and parallel kernels on GPU cores). Today, I can live with the latency needed for a single thread to enter main and launch a parallel kernel.

GPUs Need a CPU For Some Functionality

Surprisingly, this wasn't an issue at all. It was fairly easy to write a CPU program that loads a GPU binary, looks up main and kicks off a single thread. I did this using Ocelot, but it would be possible to do the same thing using CUDA driver calls. It was so straightforward in fact that I was able to add a build rule to SCONs for GPU applications that builds the device code, embeds it in a vanilla loader program, builds the loader using a native compiler to run on the CPU. When the loader starts, it extracts the GPU binary and kicks off the main function, giving the illusion that it never interacts with the CPU at all.

With an implementation of the Standard Library, there was no need to callback to the CPU to do anything.

Summing It Up

Although there are a few minor issues, it is largely possible to run native C++ applications completely on a GPU without any CPU interaction other than to kick off the first kernel.

Single-Source Longest Paths on Control Flow Graphs

I recently ran into an interesting problem that I feel like sharing. Given an arbitrary directed graph (like a control flow graph), find the set of longest acyclic paths from a single node to all other nodes.

This is not quite the same as the Hamiltonian path problem, since a subsection of a Hamiltonian path may not be the longest path to another node along the path. After not being able to find any obvious related work, here's what I came up with:


Use two data structures, an N-ary tree and a hash table that tracks the mapping from any node to its position in the tree. Start with the single node as the root of the tree, and visit it. For each successor of the node being visited, see if there is an entry for it in the hash table. If there is no entry, add it to the tree as a child of the current node, add an entry to the hash table, and visit it immediately (as in a DFS). If an entry already exists check to see if the length of the path to the current node + 1 is greater than the length of the existing path to its successor (the length should be stored with the node in the tree). If it is greater, then we have found a potentially longer path. As long as it is not part of a cycle.

To determine if it is, check to see if the successor is already a parent of the current node in the tree (by walking up the tree). If it isn't part of a cycle, take the entire subtree rooted at the successor and move it as a child under the current node, updating the distances of the subtree nodes and their hash table entries (which can be elided if the hash table entries are pointers).


At the end, the tree contains the longest path from the root to all other nodes. For anyone who is interested, there's an implementation of this algorithm that is used in the ThreadFrontierAnalysis in gpuocelot - thread-frontier-analysis

Geek points for anyone who emails me an equivalent weakly-scalable parallel algorithm.

Simulated SAXPY clocks in at 1.3 GIPS:

As a side project, my wife and I have been toying with the idea of what it would take to build a complete software infrastructure on a GPU, including an architecture simulator, compiler, and runtime. Vanaheimr is a project that attempts just this, by implementing a simulator, compiler, and runtime in 100% CUDA, with NO host code at all other than to kick off the first kernel.

Rather than to build everything from the ground up, our approach was to start with a few small case studies, the first of which involved simulating a SAXPY program. The SAXPY kernel is written in assembly, and executed by a GPU simulator running CUDA on a C2050 GPU. The first results look very promising, with a functional simulator achieving 1.3 giga-instructions per second, about 1000x faster than existing CPU or FPGA based simulators.

This is only a simple example and not a full simulator, but it shows that the general approach is promising. The next steps involve building pieces of a compiler that can compile existing applications into Vanaheimr's new assembly language and flushing out the simulator to support the entire ISA.

Worst case complexity of computing DAG criticality:

I usually shy away from theory problems, but I also often find myself trying to design systems that are robust to worst-case behavior. This leads me to issues like the following:

Assume that we have a directed acyclic graph reresenting dependent tasks, and we want to schedule these tasks on infinite processors satisfying dependencies in the DAG. Also assume that we know the execution time of each task a-priori. A first step is to examine the complete set of solutions (the complete set of possible schedules). I think that a step towards bounding the space can be made by answering the following question:

I have a DAG with N vertices and an arbitrary number of non-duplicate edges. What is the maximum number of unique paths in such a graph, where a unique path is defined as a path that is not partially contained within any other path?

Ocelot 1.3.967 and 2.0.969 Released

Ocelot 2.0.969 supports Fermi devices and introduces a series of new features listed below:

  • Full PTX 2.2 support, including function pointers and recursion.
  • Lazy just-in-time compilation for the x86 backend.
  • Ocelot sever and remote device client.

Ocelot 1.3.967 is the last release that supports PTX 1.x devices (G80 and GT200 series).

Ocelot at GTC 2010

We recently gave an overview of Ocelot at GTC 2010. NVIDIA was kind enough to make a webcast of the talk available.

This presentation covers the high level vision of the project, what other researchers can use Ocelot for, and the currently active sub-projects. This concludes the current round of travel and presentations so I can devote the majority of my time to the project again.

Design and Implementation of Ocelot at PACT 2010

Our paper, Ocelot: A Dynamic Compiler for Bulk-Synchronous Applications in Heterogeneous Systems was accepted at The Nineteenth International Conference on Parallel Architectures and Compilation Techniques.

Also, I am trying to be more open with the development cycle for all of the work that I do. In that spirit I am going to start including the full reviews for all of my papers starting with this one. I may also go back and try to digg up old reviews depending on whether or not I can still find them. [Reviews]

Ocelot Overview at NVIDIA [pdf]

I finally brought this sever back up yesterday after a cross country journey from Atlanta to San Jose.

Also, I am giving a presentation on Ocelot to NVIDIA and NVIDIA Research this friday.

Abstract: "Ocelot is a dynamic compilation framework designed to map the explicitly data parallel execution model used by NVIDIA CUDA applications onto diverse multi-threaded platforms. Ocelot includes a dynamic binary translator from PTX to many-core processors that leverages the LLVM code generator to target x86 and other ISAs. The dynamic compiler is able to execute existing CUDA binaries without recompilation from source and supports switching between execution on an NVIDIA GPU and a many-core CPU at runtime. It has been validated against over 130 applications taken from the CUDA SDK, the UIUC Parboil benchmarks, the Virginia Rodinia benchmarks, the GPU-VSIPL signal and image processing library, the Thrust library, and several domain specific applications. This talk provides a high level overview of Ocelot, focusing on its use as a tool for performance analysis, workload characterization, CUDA debugging, architecture simulation, dynamic optimization, and transparent execution of CUDA applications on CPUs and GPUs."

Index of MSB Bug

Stefanus Du Toit offered a potential solution as to why the previous code failed to compile with -O3. He writes "The most likely reason is the cast from &ff to an (unsigned long*). This breaks the strict aliasing rules in C/C++, which state that in general two pointers of different types never alias (with exceptions for char* and void*). GCC's -fno-strict-aliasing can be used to test this (it should work with that option enabled).There's no fully portable standards-compliant way to do this in C/C++, but all the major compilers accept the following, which is based on implementation-defined behavior related to unions:

D bitwise_cast(S s) {
union { D d; S s; } u;
u.s = s;
return u.d;

Then you would do:

return (bitwise_cast(ff) >> 23) - 127;

While I'm staring at the code, you could avoid the branch with something like:

float ff = (v | 1) >> (v < 0x01000000 ? 0 : 16);

and a similar adjustment for the +16. On SIMD ISAs this should just result in some masking code. Haven't counted what that would do to the instruction count though.FWIW I plan to proposed a bitwise_cast like this in C++ post-C++11."

Modelling GPU-CPU Workloads and Systems at GPGPU-3

Our paper, Modeling GPU-CPU Workloads and Systems was accepted at the Third Workshop on General-Purpose Computation on Graphics Procesing Units.

Ocelot 1.0.265-Alpha Released

Ocelot Support for native execution of CUDA applications on GPUs and CPUs without recompilation.

Faster Index of MSB

Steve Worley thinks that this is faster than the code I posted before. It relies on the exponent computation of an IEEE754 floating point number. Compiling this code produces 8 x86 instructions on the most likely path. My example used 22 instructions. However, Steve's code relies on two branches and two floating point conversion instructions, whereas my example included only shifts, comparisons, and conditional selects. The performance of these depends on the branch predictor accuracy and latency of FP conversion instructions, although I would imagine that Steve's code is faster. Also, I couldn't get Steve's code to produce correct results with -O3 optimization on, I need to take a look at the machine code to figure out why this is. I may also get around to benchmarking these if I have a bit of free time.

inline ulong highestBitIndex(ulong v)
if (v<0x01000000) {
float ff=(float)(v|1);
return ((*((unsigned long *)&ff))>>23)-127;
float ff=(float)(v>>16);
return ((*((unsigned long *)&ff))>>23)-127+16;

Fast Index of MSB

I keep telling people that this is a fast way to find the position of the msb of an int.

unsigned int msb32(unsigned int x)

int position = 0;

int offset = ( x & 0xffff0000 ) ? 16 : 0; position = offset; x >>= offset;

offset = ( x & 0x0000ff00 ) ? 8 : 0; position += offset; x >>= offset;

offset = ( x & 0x000000f0 ) ? 4 : 0; position += offset; x >>= offset;

offset = ( x & 0x0000000c ) ? 2 : 0; position += offset; x >>= offset;

offset = ( x & 0x00000002 ) ? 1 : 0; position += offset; x >>= offset;

return position;


New Website Template

I recently switched to a new website design that is hopefully clearner and more straightforward than the previous one. It may take a week or so before everything settles so please excuse any broken links.