Note: This is a work in progress, and I'm interested in soliciting feedback. These are the guidelines I use with my own team, but this is my first attempt at formalizing them.
Often when there is a recommendation to omit goto from a language developers pointing out the edge cases where goto can make sense will argue for the inclusion of the construct, and surprisingly even relatively new languages often contain a goto statement.
There are extremely few cases where a goto is optimal, so developers should take a significant pause and be prepared to justify their decision in a code review, before using goto in production code. I believe the same can be said about other programming constructs: macros in C, template meta-programming in C++, or the use of threads.
I've spent enough time debugging or trying to mentally 'prove' the correctness of threaded code, that when I read threaded code my palms get sweaty and I start having flash backs of dark, lonely, nights working by the warm glow of the debugger. Threads are powerful, but threads are also extremely difficult to prove correct and often result in extremely subtle bugs. Also I believe it is many developers' intuition that performing multiple tasks simultaneously or creating multiple threads will increase performance. Unfortunately in most cases creating threads with out warrant will decrease performance as they simple increase the OS's context switching overhead. Because of this I have developed some guidelines to help guide the decision to use multiple threads in an application.
- Utilization of Multiple Core/Processors
- UI Responsiveness
- Concurrent Blocking I/O
Ultilization of Multiple Cores/ProcessorsIf a developer wants to scale a CPU bound application across multiple processors or cores, there is no alternative to threads or multiple processes. It is not common for the OS to provide the programmer the ability to specify which CPUs that their tasks will be scheduled on, and the CPUs will be shared with OS threads and other processors. With that said, if your application is CPU bound and parallizable, it might be appropriate to construct a system which starts as many worker threads as there are CPUs, although it has been my experience that contention either from heap, OS, or hardware will still make it unlikely that adding a second CPU will double the performance of an application.
I cut my teeth in the industry programming under Windows 3.1. For developers who have spent their careers working in modern multitasking environments, it maybe hard to fathom that early versions of Windows and MacOS had no concept of threads or pre-emptive multitasking. Not only did applications run in a single execution context, the entire operating system did.
To give the impression of concurrency, the OS had a main loop which would pass control to applications by sending them messages. While the application handled the message it had exclusive control of the system, but to keep the system responsive, well behaved applications would quickly handle the message (for instance a request to redraw part of a window) and relinquish control back to the OS. I learned the implications of cooperative multitasking when developing a serial I/O library to communicate with a hand held device. While the library used perfectly structured code, or so I thought, with new fangled OO techniques in C++, when the update operation ran, not only did the application become unresponsive, the entire OS came to a halt.
The solution to this problem was to break a long running task up into a series of small tasks that the OS could schedule over multiple events. To do so the application developer needed to create a state machine to model the current step of the long running task. Before the application relinquished control back to the system, the application state would be updated so when the next event occured the application could pick up where it left off. If the application wanted to do something like update a progress bar, when handling a message to draw the status bar, the application would reference the state machine to determine the progress of the long running task.
For many smaller programs this technique works reasonably well, and even on modern threaded OSes it is still used. The difference on a modern OS is that if a application spends a long time handling an event, the rest of the OS can continue to operate even if the application appears unresponsive. Most modern applications now create separate threads to handle long running tasks while the processing of UI events is handled by the main application thread. The reason is that is trying to break large tasks up into state machines can become untenable and most developers tend to think of algorithms sequentially rather than as series of state transitions.
Unfortunately using multiple threads to update an applications UI and to execute long running tasks has the drawback that of requiring synchronization of the application threads, and while I believe UI responsiveness is a valid justification of using threads, it also extremely difficult to get right. On the Windows OS it often means sending messages to synchronize the threads through the application's message loop.
It is worth pointing out that the structure of a thread mimics the state machine by implicitly maintaining state in the thread's call stack and instruction pointer. Also instead of allowing the application to control task scheduling by the length of time to handle one event, the OS handles task scheduling by implicitly switching control to another thread or task based on a clock interval.
Concurrent Blocking I/O
In modern networked environments, implementing concurrent blocking I/O is the most common use of threads and typical in network servers, or clients which must maintain multiple server server connections such as a load test application. Internet facing servers must handle requests concurrently from multiple clients at unknown data rates. Excluding asynchronous or non-blocking I/O, when the server attempts to read or write data from the client, the operating system will automatically switch the context to another thread until the I/O has completed. If the server is handling connections from many clients in a thread per client or connection scenario, many of the threads will be in a wait state consuming few resources. When read or write operation is complete the thread will be activated to continue processing.
It is interesting to note that the GNU pth library is a user mode co-operative threading library which take advantage of the fact that many applications use threads to implement concurrent blocking I/O and hence when the read and write functions are called the library takes the opportunity to switch the context to the next running thread.
Disk I/O is also a common source blocking I/O, since disk access is a local resource it has significantly different properties than handling slow network clients. Starting a large number of threads to concurrently access the disk almost certainly will result in degraded performance as the throughput of the disk system is reached and thrashing increases.
While it is beyond the scope of this discussion, most high performance servers no longer use a thread per connection architecture. Dan Kegel laid the ground work for modern *nix server design with his outline of the C10k problem. Many modern servers including Apache use a hybrid event driven/threaded architecture.