Superpatterns Pat Patterson on the Cloud, Identity and Single Malt Scotch

8Jul/100

A Cup of tee() and a splice() of Cake

Credit to dajobe for the photo - click the image for the original

Apologies for the terrible pun in the title - I just couldn't resist :-)

I was hard at work on my current project the other day, a user-mode Linux server daemon, when I realized that I would need to both copy incoming data to disk and forward it to another daemon via a socket. This caused me a moment's consternation, since I was using splice() to move incoming data from a socket to a file without needing an intermediate copy in a user-mode buffer, but then I remembered mention of tee(), a companion to splice().

Where splice() moves data directly from a socket (or file) to a pipe (or vice versa), tee() copies data from one pipe to another leaving the data intact in the source pipe. You can then use splice() again to move the data from tee()'s destination pipe to another file descriptor.

It was the work of a few minutes to code up a quick sample app to test this. Since it's short, and there seems to be a dearth of tee()/splice() examples, here it is in its entirety:

#define _GNU_SOURCE // needed for splice
#include <fcntl.h>
#include <limits.h>
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>

static void test_splice(int in, int out, int out2, int number_of_bytes) {
    int rcvd = 0, sent = 0, teed = 0, remaining = number_of_bytes;
    int pipe1[2], pipe2[2];

    if (pipe(pipe1) < 0) {
        perror("pipe");
        exit(EXIT_FAILURE);
    }

    if (pipe(pipe2) < 0) {
        perror("pipe");
        exit(EXIT_FAILURE);
    }

    while (remaining > 0) {
        if ((rcvd = splice(in, NULL, pipe1[1], NULL, remaining,
                SPLICE_F_MORE | SPLICE_F_MOVE)) < 0) {
            perror("splice");
            exit(EXIT_FAILURE);
        }

        if (rcvd == 0) {
            printf("Reached end of input file\n");
            break;
        }

        printf("Wrote %d bytes to pipe1\n", rcvd);

        if ((teed = tee(pipe1[0], pipe2[1], rcvd, 0)) < 0) {
            perror("tee");
            exit(EXIT_FAILURE);
        }

        printf("Copied %d bytes from pipe1 to pipe2\n", teed);

        if ((sent = splice(pipe1[0], NULL, out, NULL, rcvd, SPLICE_F_MORE
                | SPLICE_F_MOVE)) < 0) {
            perror("splice");
            exit(EXIT_FAILURE);
        }

        printf("Read %d bytes from pipe1\n", sent);

        if ((sent = splice(pipe2[0], NULL, out2, NULL, teed, SPLICE_F_MORE
                | SPLICE_F_MOVE)) < 0) {
            perror("splice");
            exit(EXIT_FAILURE);
        }

        printf("Read %d bytes from pipe2\n", sent);

        remaining -= rcvd;
    }
}

int main(int argc, char *argv[]) {
    int infile, outfile1, outfile2, number_of_bytes = INT_MAX;

    if (argc < 4) {
        fprintf(stderr,
                "Usage: %s infile outfile1 outfile2 [number_of_bytes]\n",
                argv[0]);
        exit(EXIT_FAILURE);
    }

    if ((infile = open(argv[1], O_RDONLY)) < 0) {
        fprintf(stderr, "Can't open %s for reading\n", argv[1]);
        exit(EXIT_FAILURE);
    }

    if ((outfile1 = open(argv[2], O_WRONLY | O_CREAT | O_TRUNC, 0644)) < 0) {
        fprintf(stderr, "Can't create/open %s for writing\n", argv[2]);
        exit(EXIT_FAILURE);
    }

    if ((outfile2 = open(argv[3], O_WRONLY | O_CREAT | O_TRUNC, 0644)) < 0) {
        fprintf(stderr, "Can't create/open %s for writing\n", argv[3]);
        exit(EXIT_FAILURE);
    }

    if (argc > 4) {
        number_of_bytes = atoi(argv[4]);
    }

    test_splice(infile, outfile1, outfile2, number_of_bytes);

    return EXIT_SUCCESS;
}

The example doesn't need much explanation; I added the 'number_of_bytes' parameter so that you can copy a limited amount of data from an infinite source such as /dev/zero or /dev/urandom. Note that a 'real' implementation needs a bit more code, since it's not safe to assume that all the bytes get moved to the destination pipes in one hit, but that would obscure the example :-)

1Jun/1042

Zero-Copy in Linux with sendfile() and splice()

A Splice

After my recent excursion to Kernelspace, I'm back in Userland working on a server process that copies data back and forth between a file and a socket. The traditional way to do this is to copy data from the source file descriptor to a buffer, then from the buffer to the destination file descriptor - like this:

// do_read and do_write are simple wrappers on the read() and 
// write() functions that keep reading/writing the file descriptor
// until all the data is processed.
do_read(source_fd, buffer, len);
do_write(destination_fd, buffer, len);

While this is very simple and straightforward, it is somewhat inefficient - we are copying data from the kernel buffer for source_fd into a buffer located in user space, then immediately copying it from that buffer to the kernel buffers for destination_fd. We aren't examining or altering the data in any way - buffer is just a bit bucket we use to get data from a socket to a file or vice versa. While working on this code, a colleague clued me in to a better way of doing this - zero-copy.

As its name implies, zero-copy allows us to operate on data without copying it, or, at least, by minimizing the amount of copying going on. Zero Copy I: User-Mode Perspective describes the technique, with some nice diagrams and a description of the sendfile() system call.

Rewriting my example above with sendfile() gives us the following:

ssize_t do_sendfile(int out_fd, int in_fd, off_t offset, size_t count) {
    ssize_t bytes_sent;
    size_t total_bytes_sent = 0;
    while (total_bytes_sent < count) {
        if ((bytes_sent = sendfile(out_fd, in_fd, &offset,
                count - total_bytes_sent)) <= 0) {
            if (errno == EINTR || errno == EAGAIN) {
                // Interrupted system call/try again
                // Just skip to the top of the loop and try again
                continue;
            }
            perror("sendfile");
            return -1;
        }
        total_bytes_sent += bytes_sent;
    }
    return total_bytes_sent;
}

//...

// Send 'len' bytes starting at 'offset' from 'file_fd' to 'socket_fd'
do_sendfile(socket_fd, file_fd, offset, len);

Now, as the man page states, there's a limitation here: "Presently (Linux 2.6.9 [and, in fact, as of this writing in June 2010]): in_fd, must correspond to a file which supports mmap()-like operations (i.e., it cannot be a socket); and out_fd must refer to a socket.". So, we can only use sendfile() for reading data from our file and sending it to the socket.

It turns out that sendfile() significantly outperforms read()/write() - I was seeing about 8% higher throughput on a fairly informal read test. Great stuff, but our write operations are still bouncing unnecessarily through userland. After some googling around, I came across splice(), which turns out to be the primitive underlying sendfile(). An lkml thread back in 2006 carries a detailed explanation of splice() from Linus himself, but the basic gist is that splice() allows you to move data between kernel buffers (via a pipe) with no copy to userland. It's a more primitive (and therefore flexible) system call than sendfile(), and requires a bit of wrapping to be useful - here's my first attempt to write data from a socket to a file:


// Our pipe - a pair of file descriptors in an array - see pipe()
static int pipefd[2];

//...

ssize_t do_recvfile(int out_fd, int in_fd, off_t offset, size_t count) {
    ssize_t bytes, bytes_sent, bytes_in_pipe;
    size_t total_bytes_sent = 0;

    // Splice the data from in_fd into the pipe
    while (total_bytes_sent < count) {
        if ((bytes_sent = splice(in_fd, NULL, pipefd[1], NULL,
                count - total_bytes_sent, 
                SPLICE_F_MORE | SPLICE_F_MOVE)) <= 0) {
            if (errno == EINTR || errno == EAGAIN) {
                // Interrupted system call/try again
                // Just skip to the top of the loop and try again
                continue;
            }
            perror("splice");
            return -1;
        }

        // Splice the data from the pipe into out_fd
        bytes_in_pipe = bytes_sent;
        while (bytes_in_pipe > 0) {
            if ((bytes = splice(pipefd[0], NULL, out_fd, &offset, bytes_in_pipe,
                    SPLICE_F_MORE | SPLICE_F_MOVE)) <= 0) {
                if (errno == EINTR || errno == EAGAIN) {
                    // Interrupted system call/try again
                    // Just skip to the top of the loop and try again
                    continue;
                }
                perror("splice");
                return -1;
            }
            bytes_in_pipe -= bytes;
        }
        total_bytes_sent += bytes_sent;
    }
    return total_bytes_sent;
}

//...

// Setup the pipe at initialization time
if ( pipe(pipefd) < 0 ) {
    perror("pipe");
    exit(1);
}

//...

// Send 'len' bytes from 'socket_fd' to 'offset' in 'file_fd'
do_recvfile(file_fd, socket_fd, offset, len);

This almost worked on my system, and it may work fine on yours, but there is a bug in kernel 2.6.31 that makes the first splice() call hang when you ask for all of the data on the socket. The Samba guys worked around this by simply limiting the data read from the socket to 16k. Modifying our first splice call similarly fixes the issue:

    if ((bytes_sent = splice(in_fd, NULL, pipefd[1], NULL,
            MIN(count - total_bytes_sent, 16384), 
            SPLICE_F_MORE | SPLICE_F_MOVE)) <= 0) {

I haven't benchmarked the 'write' speed yet, but, on reads, splice() performed just a little slower than sendfile(), which I attribute to the additional user/kernel context switching, but, again, significantly faster than read()/write().

As is often the case, I'm merely standing on the shoulders of giants here, collating hints and fragments, but I hope you find this post useful!