Introduction
Since I’m on holiday break from University, I’ve had time to begin investigating parallel processing. I’m going to try and share a little of what I’ve learned about the technology and how programming languages can leverage multi-core CPUs and GPUs. I’ll finish up and explain a bit about task-decomposition , which is an important aspect of writing algorithms intended for parallel processing.
My investigation began one night as I was cooking dinner and watching a talk by Rich Hickey, the inventor of Clojure, titled “Are we there yet?” . At one point, while Mr Hickey was discussing parallel processes, the question of “how does a GPU process pixels?”popped into my head. Surely that must be parallel, right? Since my “Intro to comp-sci”course last semester didn’t get into graphics processing, I only had the mental model of a (very basic) single-core CPU in my head. So I asked myself, “If a GPU might process things in parallel, could that be used by a high-level language to process data?”. Well it turns out that it can!
(Note: If you’re unsure why parallel processing and concurrency are important then check out this article from Herb Sutter. )
------------------------------
Overview:
- General Purpose GPUs
- Low-Level Language: OpenCL
- High-Level Aspirations
- Back to reality: Low level Heterogenous computing and C++
- Granularity
- Tiling
- Fusion
- Fork/Join Framework
- Best case for Task-Decomposition: Map
- There Be Dragons!
------------------------------
General Purpose GPUs
After investigating, I discovered that GPUs can have thousands of cores , each of which can be used as a whole to process immense amounts of data in parallel. This blew me away since I’m so accustomed to the single-threaded nature of programming languages like C and JavaScript. So I obsessively sought more information, looking particularly for any functional language integration with GPUs because the paradigm seems perfectly suited for parallel processing (mapping a function over a collection).
This lead me to the world of GPGPUs, or General Purpose Graphics Processing Units. Basically, manufacturers like Nvidia have always specialized GPUs to process pixel information with what are called shaders . This means they weren’t originally designed for handling general computations the way a CPU is. Eventually people like Ian Buck came along (the inventor of CUDA ) and hacked a GPUs pixel processing capabilities and translated it into a scheme for doing general computing. After Buck proved that GPUs could do some extremely useful general-computing tasks, vendors began to modify their hardware to support General Processing capabilities.
Low-Level Language: OpenCL
CUDA is probably the most popular platform for programming a GPGPU, but Nvidia isn’t the only vendor out there with hardware that supports massive parallelism. Not only that, but multi-core CPUs themselves can be treated as parallel processors, so there is a strong need for a universal way to create parallel programs across the unique assortment of available processors.
Khronos is a standardization firm who have worked on the open standard for parallel processing, named OpenCL , for Open Compute Language. They were previously responsible for the OpenGL standard (Open Graphics Language) that unified graphics processors into supporting a single standard for computing pixels and shaders. OpenCL is referred to as a platform for Heterogeneous computing, meaning that it allows expert programmers with a means of applying computation across the entire range of computational devices (Multi-Core CPUs, GPGPUs, DSPs, etc) that are connected within a computer. The standard is in its second draft, but currently OpenCL 1.1 seems to be the sweet-spot for mass support (you can see who-supports-what here ) . OpenCL is VERY low level, as it is basically a souped up version of C. I personally see OpenCL as a way for compiler writers to extend the reach of their high-level language’s ability to support Heterogeneous computing; I don’t imagine OpenCL would be used by a typical programmer looking to optimize their code. I hope that in the near future language run-times will be able to target a GPGPU for workloads that are obviously parallelizable, without the programmer needing to handle all of the dirty work of implementing it. But how might that look in a high-level language?
High-level Aspirations
To see an example of parallelism in a high-level functional language we can look at Clojure’s pmap , which splits the mapping of a function across a JVM thread-pool. The function needs the programmer to manually handle task-decomposition (which I’ll explain later) , but it’s still quite high-level. Here’s a link to a chapter from Daniel Higgenbotham’s awesome Clojure book where he describes pmap in more depth than I will here.
Currently, pmap just targets the CPU, but it would be neat to have a version that could automatically target a GPGPU if one was available via the OpenCL run-time! Currently there is the ClojureCL project, but its goal isn’t high-level integration with a map-like functionality per-se. There are other language integrations with OpenCL, like PyOpenCL , but again they seem to just be a high-level means for launching a OpenCL compute kernel written in OpenCL-C (if you don’t know what that means, then just imagine a function that you have to pass a string to, and the string is C code; Yikes!).
Back to reality: Low level Heterogenous computing and C++
Before I get into it I want to clarify the difference between concurrent and parallel with an analogy from Pablo Halpern from this round-table talk from CppCon 2014, titled "Paying for Lunch: C++ in the ManyCore Age ". Mr Halpern describes concurrency to be like a basketball game where the teams are comprised of individuals who work together and have to inevitably all interact with a shared resource (the ball). Parallelism is more like track-and-field, where each athlete has their own lane and they don’t actually interact directly.
Parallel and C++
In the C++ community, there is a lot of disparate yet related efforts towards unifying the Heterogeneous computing environment into a complete standard. This round-table discussion session mentioned above illustrates the challenges they face and how the standards-side of the language community is working on cleaning it up. The participants include Pablo Halpern (quoted above) from Intel who specializes in vectorization and SIMD parallelization; Michael Wong, who is the CEO of OpenMP and works on Transactional memory and C++ compilers; Jared Hoberock, who works for Nvidia works on the Thrust parallel library; and Artur Laksburg, who is working on a proposal for adding Parallelized algorithms to the C++ Standard Library.
The talk is quite interesting if you’ve had a chance to familiarize yourself with the topic, and some of the key takeaways for me is that there really isn’t a clear way to integrate parallelization abstractions into a language like C++ without the programmer being concretely aware of the underlying hardware (Pablo Halpern explains this at the 49:13 mark in the video above). But is this the case for all languages?
Generalizability of Parallel Processing
At the end of the discussion Michael Wong reiterates a point he was making about developer education regarding task-decomposition. He believes that this is “[developers’] next big challenge”, while the tools that the library authors are creating will take care of the implementation details. (One example of task-decomposition in regards of SIMD vectorization is explained well in Pablo Halpern’s 2016 talk at CppCon.) This relates back to my Clojure example of pmap where an aspect of task-decomposition, namely granularity , needs to be handled by the programmer. For an explanation of what granularity means, let’s take some time and talk about it and other aspects of task-decomposition.
Task Decomposition
For a (brief) discussion on task-decomposition, we can watch Daniel Higginbotham’s talk “ Parallel Programming, Fork Join, and Reducers ”. The crux of the talk is to use the specialized reducers library in Clojure, which relies on the Fork/Join framework introduced in the JDK version 7. Because Clojure tires to abstract gritty implementation details away, were better able at focusing on what Michael Wong said earlier was our next big task; namely, task-decomposition . Higginbotham describes three aspects of decomposition: Granularity, Tiling, and Fusion, which I’ll discuss below. Also, I’ll mention a few caveats that Higginbotham doesn’t mention regarding some of dangers of using reducers in parallel and parallel processing in general.
Granularity
This is the size of the sub-task. Granularity relates back to the pmap example above, where Higginbotham explained that using pmap directly is inefficient because roughly speaking it spreads the work out among the available cores and then applies the function to each element. There is an overhead-cost for each function application (thread coordination, setup, etc) so pmap is actually slower than map for direct mapping, BUT if you break up the collection and pass pmap a function that has an internal map over the sub-collections, then pmap can parallelize the sub-mappings across threads. PHEW , that’s a mouth full. Again, I encourage you to check out Mr Higginbotham’s explanation , because he has nice pictures and code examples.
Tiling
This method splits up the work into equal tiles that themselves are processed serially. The description reminds me of vectorization, but I’m not sure if it’s the same under the hood. Think of an image being divided into equal tiles.
Fusion
The idea behind fusion is to eliminate the creation of intermediate collections. For instance, if you wanted to apply three operations over a collection, do it in the body of a single function that maps over the collection once.
Fork/Join Framework
Java’s Fork/Join framework is a type of Executor which Mr Higginbotham explains will handle some of the aspects of task decomposition automatically, and one can make use of it via Clojure’s special reducers library.
Best case for Task-Decomposition: Map
So as we’ve seen, one of the simplest tasks that can be decomposed is the mapping of a function from one collection to another. This can be handled by the standardized Clojure function pmap combined with a bit of thinking about task-decomposition. Mapping in parallel is the best use-case for parallel processing because each application of the function is and entirely independent process, so the programmer doesn’t have to worry much about the parallel threads overlapping or racing.
What the programmer should worry about is parallelizing a reduction operation, because although Mr Higginbotham didn’t mention it, there are potential risks! Let’s take a look.
There be dragons!
In Fork / Join, lambda ¶llel() : parallel computing made (too ?) easy , a talk by José Paumard, Dr. Paumard warns us that:
- When reducing in parallel, one must test that the function being applied exhibits the mathematical property of associativity .
- Some algorithms (ie: quicksort) are slower when parallelized.
- Some algorithms are incorrect when parallelized, and it’s not obvious upfront which ones are!
Thus the programmer must be careful to test their parallelized code for correctness before it can be used. I hope to follow up this post with another post that can demonstrate writing a test for the first dragon mentioned; associativity. Until then, let’s wrap this post up!
Conclusion
The inevitable march towards parallel processing is upon us, and there is a diverse effort towards integrating the option to parallelize into programming languages (I haven’t even mentioned concurrency in languages like Go, Erlang, or Haskell, but that’s another post entirely). I’ve personally just begun to look into this aspect of language design, and so I’ve started looking at the lowest levels with the C/C++ community and how they are working (almost) directly with the hardware itself and are trying to find suitable abstractions. Then jumping all the way up into a really high-level language like Clojure, we can see that parallelism lends itself naturally to what is in my opinion one of the most useful feature of a functional language: mapping.
Comments
Post a Comment