Parallel Processing

This week I’ve added another piece to the puzzle that is my ray tracer. As you may have guessed from the lack of pretty pictures, by initial enthusiasm for the ray tracing part itself has waned a little, I’m trying to implement other parts that are interesting to me. In my ongoing effort to reduce the amount of time to draw, I wanted to make my ray tracer capable of running across a networked cluster of computers. I am thinking of eventually making the ray tracer output some kind of video or animation file, but that is a long time away yet. When that is done, I might attempt some optimisation of the ray tracing algorithm itself.

All the concurrency is really making it hard to think about, but at the same time is fascinating. I think I’ve settled on a final design, but I have had that feeling before. I’ve broken the program into a number of parts. The program runs in 2 or more threads. The main thread allocates jobs, and pastes the completed jobs to the screen. A job is something to the tune of “draw frame X rectangle 0,0,300,300”. Frame X will be used later if I want animation. There is a network thread, which I’ll talk more about below. Finally, there are an optional set of draw threads. The number of draw threads should be in or around the number of cores available on the computer. Each thread may allocate one or more workers. A worker is what actually does the “job”. So, each draw thread has a single worker. The network thread has a worker per peer. Each worker has a “CommandQueue”, a thread-safe container of the jobs (and other important inter-thread messages: like communicating the size of the full image to each worker). For draw threads, the flow of control is simple. Pop a command, execute it, continue. If there are no commands, the thread blocks, allowing others to run.

The network thread is causing me problems. I am reliably informed that the apparently “simpler” alternative, a thread per client, is not scalable. A single thread servicing all the sockets is supposed to be better. Currently, the network thread runs a loop like so:
get commands & send to clients
receive completed jobs from clients
poll UDP port for automatic peer discovery
poll TCP listen port for new connections

My problem is that I would really like this thread to block when it has nothing to do, so as not to soak up the cycles the draw threads really need. I could make both the UDP and TCP poll actions into threads, that would block. However, I still have to handle the commands and jobs, which both depend on the same socket. What I would love would be able to is handle the sending in one thread and the receiving in another. However, for the moment a little sleep() does the world of good at the expense of additional latency. I understand that there may be platform specific ways of doing this (I’ve read a little about IOCP which sounds similar), but I want to keep my application cross platform (I have 2 macs in the house which I can use as part of the test cluster once I figure out how to build stuff for them). I might move back to a blocking thread-per-client model because the number of clients will be small. I think with my current architecture switching between the two modes should be relatively easy.

I also fixed a minor bug where the client would dump parts of the executable (namely the debug symbol section) onto the TCP stream (walked off the edge of a vector in to no man’s land looking for a space character). At least there was no problem at the receiving end, otherwise I could have made a lovely remote exploit security hole for myself. It is amazing how easy such errors are to write, even though I am fairly militant about letting C++ manage my memory where possible.

The main TODO on the parallel processing side right now is implementing the job allocator. Currently the server generates a somewhat random sub rectangle for the client to process. A proper version would take note of the next few frames that need rendering, so if the cluster had 5 workers, 3 might be working on the current frame and 2 might be pre-rendering the next frame.Then I suppose I’ll have to go back to the raytracing and try get rid of all those speedups I have now that I can run my raytracer on lots of machines 🙂


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: