TCP_CORK: More than you ever wanted to know
I previously mentioned the leakiness of Unix's file metaphor. The leak often becomes a gushing torrent when trying to bump up performance. TCP_CORK is yet another example.
Before I get into the details of TCP_CORK and the problem it addresses, I want to point out that this is a Linux only option, although variants exist on other *nix flavors -- for instance TCP_NOPUSH on FreeBSD and Mac OS X (although from what I read the OS X implementation is buggy). This is one of the unfortunate aspects of modern Unix programming. While most of the APIs are identical between Unix like OSes, if the functionality isn't specified by POSIX, none of the major *nix's can seem to agree on an implementation.
What are "physical" socket writes?
The root of the abstraction leak derives from the semantics of the write() function when applied to TCP/IP. Historically (and any Unix experts in the crowd feel free to correct me here if this is not accurate) the write() function resulted in a physical, non-buffered, write to the device. With TCP/IP the device is a network packet, but the implementors were forced to define a physical write given Unix's file semantics, so a TCP/IP write() was defined as follows:
Any data that has been sent to the kernel with write() is placed into one or more packets and immediately sent onto the wire.
The resulting behavior is what application programmers expected. When they called write() the data would be sent and available to host on the other side of the wire. But it didn't take long to realize that this resulted in some interesting performance problems, which were addressed by Nagle's algorithm.
Nagle's algorithm
In the early 1980's John Nagle found that the networks at Ford Aerospace were becoming congested with packets containing only a single character's worth of data. Basically every time a user struck a key in a telnet-like console app an entire packet was put onto the network. As Nagle pointed out , this resulted in about 4000% overhead (the total amount of data sent vs. the actual application data). Nagle's solution was simple: wait for the peer to acknowledge the previously sent packet before sending any partial packets. This gives the OS time to coalesce multiple calls to write() from the application into larger packets before forwarding the data to the peer.
Nagle's algorithm is transparent to application developers, and it effectively sticks a fat finger in the abstraction leak. Calls to write() guarantee that data is delivered to the peer. Nagle also has the side benefit of providing additional rudimentary flow control.
Nagle not optimal for streams
While Nagle's algorithm is an excellent compromise for many applications, and it is the default behavior for most TCP/IP implementations including Linux's, it isn't without drawbacks. The Nagle algorithm is most effective if TCP/IP traffic is generated sporadically by user input, not by applications using stream oriented protocols. It works great for Telnet, but it is less than optimal for HTTP. For example, if an application needs to send 1 1/2 packets of data to complete a message, the second packet is delayed until an ACK is received from the previous packet, thereby needlessly increasing latency when the application doesn't expect to send more data.
It also requires the peer to process more packets when network latency is low. This can affect the responsiveness of the peer, by causing it to needlessly consume resources.
Unfortunately, as is often the case, the file abstraction must be violated to improve performance. The application must instruct the OS not to send any packets unless they are full, or the application signals the OS to send all pending data. This is the effect of TCP_CORK.
The application must tell the OS where the boundaries of the application layer messages are. For instance multiple HTTP messages can be passed on one connection using HTTP pipelines. When a message is complete the application should signal the OS to send any outstanding data. If the application fails to signal the peer of a completed message, the peer will hang waiting for the remainder of the message.
In my HTTP implementation, I use the flush metaphor which is common with streams, but not usually associated with calls to write() which are supposed to be physical. I set the TCP_CORK option when the socket is created, and then "flush" the socket at message boundaries.
Prefer the gather function writev()
If you need to write multiple buffers that are currently in memory you should prefer the gather function writev() before considering TCP_CORK with multiple calls to write(). This function allows multiple non-contiguous buffers to be written with one system call. The kernel can then coalesce the buffers efficiently into packet structures before writing them to the network. It also reduces the number of system calls required to send the data, and hence improves performance.
This should be combined with TCP_NODELAY option or TCP_CORK options. TCP_NODELAY disables the Nagle algorithm and ensures that the data will be written immediately. Using TCP_CORK with writev() will allow the kernel to buffer and align packets between multiple calls to write() or writev(), but you must remember to remove the cork option to write the data as described in the next section.
TCP_NODELAY is set on a socket as follows:
int state = 1; setsockopt(fd, IPPROTO_TCP, TCP_NODELAY, &state, sizeof(state));
The drawback of writev() is that it is difficult to use with non-blocking I/O, when the function may return before all the data is written. A post call operation must be preformed to determine how much data was written, and to realign the buffers for subsequent calls. This is an area with auxiliary library functionality would help. Also the behavior of writev() with non-blocking I/O isn't well documented.
A quick look at the TCP_CORK API
If you need the kernel to align and buffer packet data over the lifespan of buffers (hence the inability of using writev()), then TCP_CORK should be considered. TCP_CORK is set on a socket file descriptor using the setsockopt() function. When the TCP_CORK option is set, only full packets are sent, until the TCP_CORK option is removed. This is important. To ensure all waiting data is sent, the TCP_CORK option MUST be removed. Herein lies the beauty of the Nagle algorithm. It doesn't require any intervention from the application programmer. But once you set TCP_CORK, you have to be prepared to remove it when there is no more data to send. I can't stress this enough, as it is possible that TCP_CORK could cause subtle bugs if the cork isn't pulled at the appropriate times.
To set TCP_CORK use the following:
int state = 1; setsockopt(fd, IPPROTO_TCP, TCP_CORK, &state, sizeof(state));The cork can be removed and partial packets data send with:
int state = 0; setsockopt(fd, IPPROTO_TCP, TCP_CORK, &state, sizeof(state));As I mentioned, I use the flush paradigm, which involves awkwardly removing and reapplying of the TCP_CORK option. This can be done as follows:
int state = 0; setsockopt(fd, IPPROTO_TCP, TCP_CORK, &state, sizeof(state)); state ~= state; setsockopt(fd, IPPROTO_TCP, TCP_CORK, &state, sizeof(state));
Other solutions
User mode buffered streams, is another solution to problem. User mode buffering is implemented follows: instead of calling write() directly, the application stores data in a write buffer. When the write buffer is full, all data is then sent with a call to write().
Even with buffered streams the application must be able to instruct the OS to forward all pending data when the stream has been flushed for optimal performance. The application does not know where packet boundaries reside, hence buffer flushes might not align on packet boundaries. TCP_CORK can pack data more effectively, because it has direct access to the TCP/IP layer.
Also application buffering requires gratuitous memory copies, which many high performance servers attempt to minimize. Memory bus contention and latency often limit a server's throughput.
If you do use an application buffering and streaming mechanism (as does Apache), I highly recommend applying the TCP_NODELAY socket option which disables Nagle's algorithm. All calls to write() will then result in immediate transfer of data.