Zero-Copy in Linux with sendfile() and splice()

4 minute read

After my recent excursion to Kernelspace, I’m back in Userland working on a server process that copies data back and forth between a file and a socket. The traditional way to do this is to copy data from the source file descriptor to a buffer, then from the buffer to the destination file descriptor - like this:

// do_read and do_write are simple wrappers on the read() and 
// write() functions that keep reading/writing the file descriptor
// until all the data is processed.
do_read(source_fd, buffer, len);
do_write(destination_fd, buffer, len);

While this is very simple and straightforward, it is somewhat inefficient - we are copying data from the kernel buffer for source_fd into a buffer located in user space, then immediately copying it from that buffer to the kernel buffers for destination_fd. We aren’t examining or altering the data in any way - buffer is just a bit bucket we use to get data from a socket to a file or vice versa. While working on this code, a colleague clued me in to a better way of doing this - zero-copy.

As its name implies, zero-copy allows us to operate on data without copying it, or, at least, by minimizing the amount of copying going on. Zero Copy I: User-Mode Perspective describes the technique, with some nice diagrams and a description of the sendfile() system call.

Rewriting my example above with sendfile() gives us the following:

ssize_t do_sendfile(int out_fd, int in_fd, off_t offset, size_t count) {
    ssize_t bytes_sent;
    size_t total_bytes_sent = 0;
    while (total_bytes_sent < count) {
        if ((bytes_sent = sendfile(out_fd, in_fd, &offset,
                count - total_bytes_sent)) <= 0) {
            if (errno == EINTR || errno == EAGAIN) {
                // Interrupted system call/try again
                // Just skip to the top of the loop and try again
                continue;
            }
            perror("sendfile");
            return -1;
        }
        total_bytes_sent += bytes_sent;
    }
    return total_bytes_sent;
}

//...

// Send 'len' bytes starting at 'offset' from 'file_fd' to 'socket_fd'
do_sendfile(socket_fd, file_fd, offset, len);

Now, as the man page states, there’s a limitation here: “Presently (Linux 2.6.9 [and, in fact, as of this writing in June 2010]): in_fd, must correspond to a file which supports mmap()-like operations (i.e., it cannot be a socket); and out_fd must refer to a socket.”. So, we can only use sendfile() for reading data from our file and sending it to the socket.

It turns out that sendfile() significantly outperforms read()/write() - I was seeing about 8% higher throughput on a fairly informal read test. Great stuff, but our write operations are still bouncing unnecessarily through userland. After some googling around, I came across splice(), which turns out to be the primitive underlying sendfile(). An lkml thread back in 2006 carries a detailed explanation of splice() from Linus himself, but the basic gist is that splice() allows you to move data between kernel buffers (via a pipe) with no copy to userland. It’s a more primitive (and therefore flexible) system call than sendfile(), and requires a bit of wrapping to be useful - here’s my first attempt to write data from a socket to a file:

// Our pipe - a pair of file descriptors in an array - see [pipe()](http://linux.die.net/man/2/pipe)
static int pipefd[2];

//...

ssize_t do_recvfile(int out_fd, int in_fd, off_t offset, size_t count) {
    ssize_t bytes, bytes_sent, bytes_in_pipe;
    size_t total_bytes_sent = 0;

    // Splice the data from in_fd into the pipe
    while (total_bytes_sent < count) {
        if ((bytes_sent = splice(in_fd, NULL, pipefd[1], NULL,
                count - total_bytes_sent, 
                SPLICE_F_MORE | SPLICE_F_MOVE)) <= 0) {
            if (errno == EINTR || errno == EAGAIN) {
                // Interrupted system call/try again
                // Just skip to the top of the loop and try again
                continue;
            }
            perror("splice");
            return -1;
        }

        // Splice the data from the pipe into out_fd
        bytes_in_pipe = bytes_sent;
        while (bytes_in_pipe > 0) {
            if ((bytes = splice(pipefd[0], NULL, out_fd, &offset, bytes_in_pipe,
                    SPLICE_F_MORE | SPLICE_F_MOVE)) <= 0) {
                if (errno == EINTR || errno == EAGAIN) {
                    // Interrupted system call/try again
                    // Just skip to the top of the loop and try again
                    continue;
                }
                perror("splice");
                return -1;
            }
            bytes_in_pipe -= bytes;
        }
        total_bytes_sent += bytes_sent;
    }
    return total_bytes_sent;
}

//...

// Setup the pipe at initialization time
if ( pipe(pipefd) < 0 ) {
    perror("pipe");
    exit(1);
}

//...

// Send 'len' bytes from 'socket_fd' to 'offset' in 'file_fd'
do_recvfile(file_fd, socket_fd, offset, len);

This almost worked on my system, and it may work fine on yours, but there is a bug in kernel 2.6.31 that makes the first splice() call hang when you ask for all of the data on the socket. The Samba guys worked around this by simply limiting the data read from the socket to 16k. Modifying our first splice call similarly fixes the issue:

    if ((bytes_sent = splice(in_fd, NULL, pipefd[1], NULL,
            MIN(count - total_bytes_sent, 16384), 
            SPLICE_F_MORE | SPLICE_F_MOVE)) <= 0) {

I haven’t benchmarked the ‘write’ speed yet, but, on reads, splice() performed just a little slower than sendfile(), which I attribute to the additional user/kernel context switching, but, again, significantly faster than read()/write().

As is often the case, I’m merely standing on the shoulders of giants here, collating hints and fragments, but I hope you find this post useful!

Twitter Facebook LinkedIn

Comments

A Cup of tee() and a splice() of Cake « Superpatterns

July 8, 2010 at 05:13 PM

[…] Superpatterns Pat Patterson on Identity Management, Federation and Single Malt Scotch « Zero-Copy in Linux with sendfile() and splice() […]

Semaphores on Linux – sem_init() vs sem_open() « Superpatterns

July 14, 2010 at 01:57 PM

[…] readers will know that I’m working on a Linux server daemon that, amongst other things, moves data back and forth between sockets and files without it appearing in user space, and even ‘tees’ that data to a second destination, again without a copy to a user […]

Moving On From Huawei « Superpatterns

October 15, 2010 at 08:08 AM

[…] low level stuff – Linux kernel drivers and server daemons, and I’ve learned lots about zero-copy technology and semaphores. All fascinating stuff, working with great people and visiting some cool […]

Mukesh

June 29, 2011 at 09:35 PM

Thanks for sharing your ideas. The description is really nice and clear.

I tried the code snippet shared by you in VSFTP server. But unfortunately instead of getting performance gain I am getting degradation in performance.

Note: In my kernel stack LRO feature is enabled for aggregation.

-Mukesh

Comments

Before changes

Leave a Comment

You May Also Enjoy