Superpatterns Pat Patterson on the Cloud, Identity and Single Malt Scotch

1Jun/1042

Zero-Copy in Linux with sendfile() and splice()

A Splice

After my recent excursion to Kernelspace, I'm back in Userland working on a server process that copies data back and forth between a file and a socket. The traditional way to do this is to copy data from the source file descriptor to a buffer, then from the buffer to the destination file descriptor - like this:

// do_read and do_write are simple wrappers on the read() and 
// write() functions that keep reading/writing the file descriptor
// until all the data is processed.
do_read(source_fd, buffer, len);
do_write(destination_fd, buffer, len);

While this is very simple and straightforward, it is somewhat inefficient - we are copying data from the kernel buffer for source_fd into a buffer located in user space, then immediately copying it from that buffer to the kernel buffers for destination_fd. We aren't examining or altering the data in any way - buffer is just a bit bucket we use to get data from a socket to a file or vice versa. While working on this code, a colleague clued me in to a better way of doing this - zero-copy.

As its name implies, zero-copy allows us to operate on data without copying it, or, at least, by minimizing the amount of copying going on. Zero Copy I: User-Mode Perspective describes the technique, with some nice diagrams and a description of the sendfile() system call.

Rewriting my example above with sendfile() gives us the following:

ssize_t do_sendfile(int out_fd, int in_fd, off_t offset, size_t count) {
    ssize_t bytes_sent;
    size_t total_bytes_sent = 0;
    while (total_bytes_sent < count) {
        if ((bytes_sent = sendfile(out_fd, in_fd, &offset,
                count - total_bytes_sent)) <= 0) {
            if (errno == EINTR || errno == EAGAIN) {
                // Interrupted system call/try again
                // Just skip to the top of the loop and try again
                continue;
            }
            perror("sendfile");
            return -1;
        }
        total_bytes_sent += bytes_sent;
    }
    return total_bytes_sent;
}

//...

// Send 'len' bytes starting at 'offset' from 'file_fd' to 'socket_fd'
do_sendfile(socket_fd, file_fd, offset, len);

Now, as the man page states, there's a limitation here: "Presently (Linux 2.6.9 [and, in fact, as of this writing in June 2010]): in_fd, must correspond to a file which supports mmap()-like operations (i.e., it cannot be a socket); and out_fd must refer to a socket.". So, we can only use sendfile() for reading data from our file and sending it to the socket.

It turns out that sendfile() significantly outperforms read()/write() - I was seeing about 8% higher throughput on a fairly informal read test. Great stuff, but our write operations are still bouncing unnecessarily through userland. After some googling around, I came across splice(), which turns out to be the primitive underlying sendfile(). An lkml thread back in 2006 carries a detailed explanation of splice() from Linus himself, but the basic gist is that splice() allows you to move data between kernel buffers (via a pipe) with no copy to userland. It's a more primitive (and therefore flexible) system call than sendfile(), and requires a bit of wrapping to be useful - here's my first attempt to write data from a socket to a file:


// Our pipe - a pair of file descriptors in an array - see pipe()
static int pipefd[2];

//...

ssize_t do_recvfile(int out_fd, int in_fd, off_t offset, size_t count) {
    ssize_t bytes, bytes_sent, bytes_in_pipe;
    size_t total_bytes_sent = 0;

    // Splice the data from in_fd into the pipe
    while (total_bytes_sent < count) {
        if ((bytes_sent = splice(in_fd, NULL, pipefd[1], NULL,
                count - total_bytes_sent, 
                SPLICE_F_MORE | SPLICE_F_MOVE)) <= 0) {
            if (errno == EINTR || errno == EAGAIN) {
                // Interrupted system call/try again
                // Just skip to the top of the loop and try again
                continue;
            }
            perror("splice");
            return -1;
        }

        // Splice the data from the pipe into out_fd
        bytes_in_pipe = bytes_sent;
        while (bytes_in_pipe > 0) {
            if ((bytes = splice(pipefd[0], NULL, out_fd, &offset, bytes_in_pipe,
                    SPLICE_F_MORE | SPLICE_F_MOVE)) <= 0) {
                if (errno == EINTR || errno == EAGAIN) {
                    // Interrupted system call/try again
                    // Just skip to the top of the loop and try again
                    continue;
                }
                perror("splice");
                return -1;
            }
            bytes_in_pipe -= bytes;
        }
        total_bytes_sent += bytes_sent;
    }
    return total_bytes_sent;
}

//...

// Setup the pipe at initialization time
if ( pipe(pipefd) < 0 ) {
    perror("pipe");
    exit(1);
}

//...

// Send 'len' bytes from 'socket_fd' to 'offset' in 'file_fd'
do_recvfile(file_fd, socket_fd, offset, len);

This almost worked on my system, and it may work fine on yours, but there is a bug in kernel 2.6.31 that makes the first splice() call hang when you ask for all of the data on the socket. The Samba guys worked around this by simply limiting the data read from the socket to 16k. Modifying our first splice call similarly fixes the issue:

    if ((bytes_sent = splice(in_fd, NULL, pipefd[1], NULL,
            MIN(count - total_bytes_sent, 16384), 
            SPLICE_F_MORE | SPLICE_F_MOVE)) <= 0) {

I haven't benchmarked the 'write' speed yet, but, on reads, splice() performed just a little slower than sendfile(), which I attribute to the additional user/kernel context switching, but, again, significantly faster than read()/write().

As is often the case, I'm merely standing on the shoulders of giants here, collating hints and fragments, but I hope you find this post useful!

Comments (42) Trackbacks (4)
  1. Thanks for sharing your ideas. The description is really nice and clear.

    I tried the code snippet shared by you in VSFTP server. But unfortunately instead of getting performance gain I am getting degradation in performance.

    Note: In my kernel stack LRO feature is enabled for aggregation.

    -Mukesh

  2. Hi Mukesh – that’s interesting. Not sure why you’d see a perf degradation like that – what sort of numbers are you seeing? I’m not sure it’s LRO – that’s been around for a while. Perhaps VSFTP is tuned for the traditional read/write pattern?

  3. Hello Pat,

    Thanks for your quick response.

    Without zero copy changes I am getting ~25000 K bytes/sec in put performance. After the changes I am getting ~19000 K bytes/sec :-(

    One thing I would like to highlight: I had to use below changes for reading the data into the pipe from socket,to make it operational:

    MIN(count – total_bytes_sent, 16384)

    Where as conventional read and write use the buffer size of around 64K bytes.

    -Mukesh

  4. My kernel version is 2.6.28
    File size used for transfer ~700 MB

  5. Hi Mukesh – the small buffer size (16k) is the culprit. As I mentioned in the blog entry, older versions of the kernel have a bug in the splice implementation that causes a hang when the pipe gets full (see http://www.kerneltrap.com/mailarchive/git-commits-head/2009/10/2/10408). It looks like the fix went into kernel 2.6.32 (see http://www.kernel.org/pub/linux/kernel/v2.6/testing/v2.6.32/ChangeLog-2.6.32-rc3).

  6. Pat,

    Thanks for the info. I looked in the details of the threads you shared and then ported the patch from one of the threads to enable splice call accepting buffer size greater than 16348. I was able to make it operational. But to my surprise the performance went down further to ~17000 K bytes/sec :-(

    Do I need to include SPLICE_F_NONBLOCK flag in read from pipe too?

    Here are my read and write calls:

    // Splice the data from in_fd into the pipe
    while (total_bytes_sent < count) {
    if ((bytes_sent = splice(in_fd, NULL, pipefd1, NULL,
    count – total_bytes_sent,
    SPLICE_F_MOVE | SPLICE_F_NONBLOCK)) 0) {
    if ((bytes = splice(pipefd0, NULL, out_fd, &offset, bytes_in_pipe,
    SPLICE_F_MOVE)) <= 0) {
    if (errno == EINTR || errno == EAGAIN) {
    // Interrupted system call/try again
    // Just skip to the top of the loop and try again
    continue;
    }
    perror("splice Write");
    return -1;
    }
    bytes_in_pipe -= bytes;
    }

    Any clues on the issue?

    -Mukesh Kohli

  7. The code snippet didn’t came fine in my previous post. Here is the updated one:

    int
    vsf_sysutil_recvfile(const int out_fd, const int in_fd,
    unsigned int offset, unsigned int count, int pipefd0, int pipefd1)
    {

    int bytes, bytes_sent, bytes_in_pipe;
    unsigned int total_bytes_sent = 0;

    while (total_bytes_sent < count) {
    if ((bytes_sent = splice(in_fd, NULL, pipefd1, NULL,
    count – total_bytes_sent,
    SPLICE_F_MOVE | SPLICE_F_NONBLOCK)) 0) {
    if ((bytes = splice(pipefd0, NULL, out_fd, &offset, bytes_in_pipe,
    SPLICE_F_MOVE)) <= 0) {
    if (errno == EINTR || errno == EAGAIN) {
    // Interrupted system call/try again
    // Just skip to the top of the loop and try again
    continue;
    }
    perror("splice Write");
    return -1;
    }
    bytes_in_pipe -= bytes;
    }
    total_bytes_sent += bytes_sent;
    }

    return total_bytes_sent;
    }

    -Mukesh Kohli

  8. I am not able to cut & paste the code properly here, kindly just focus on the two splice calls being made. Remaining code I have taken from your post only.

    -Mukesh Kohli

  9. Wow – I don’t understand that at all – why would fixing that bug result in a performance degradation? How many “splice Write” messages do you see in the kernel log? You should be able to figure out what size of chunk is being transferred – it should be more than 16k! Of course, the other thing you could do is to comment out that perror call to see if it is adding significant overhead.

    I don’t think the read should have SPLICE_F_NONBLOCK, but you could try it and see. Unfortunately I don’t have a system set up to test any of this, so you’re pretty much on your own.

    BTW – a good tip for posting source is to use https://gist.github.com/ – paste your source there and just post the URL here.

  10. perror call should not impact as ideally it won’t be called.

    Here are my changes :

    git://gist.github.com/1066971.git

    -Mukesh Kohli

  11. You’re right – that perror() won’t get called in the normal flow – the code indentation had me fooled.

    I honestly can’t see why this isn’t working faster for you, and, as I mentioned, I no longer have a test rig to try things out. I’m afraid you’re on your own here…

  12. I had a similar experience. Performance of spilce() is much worse than buffered read/write if the splice size is small. After utilizing it with the pipe size (64K), it is better than read/write. However, it is still far away from my expected zero-copy. LRO seems to be supported by some 10G NICs because of patent issues. I only need OS-based zero-copy on basic Gb/100Mb hardware. (I am using it with a embedded Linux.)

  13. Pat,

    After profiling (for CPU cycles) before and after the addition of above patch here is what I found (Note I am having patch to support greater than 16K buffer, hence using 128K):

    Before changes
    ——————–
    % app name symbol name
    22.759 vmlinux __copy_user_inatomic
    21.495 vmlinux __copy_user
    3.2354 ext2.ko .text
    2.8263 vmlinux do_ade
    2.752 fusivlib_lkm.ko mips_flush_inv_entire_dcache
    2.6776 vmlinux handle_adel_int
    1.8222 vmlinux __bzero
    ….

    After the patch:

    % app name symbol name
    37.4879 vmlinux __copy_user
    7.0531 vmlinux __bzero
    2.7375 ext2.ko .text
    2.5765 fusivlib_lkm.ko mips_flush_inv_entire_dcache
    2.2544 vmlinux do_ade
    2.2544 vmlinux handle_adel_int
    1.8357 vmlinux get_page_from_freelist
    1.8035 nf_conntrack.ko .text

    If you notice __copy_user_inatomic usage has pretty much gone but __copy_user() usage has increased to almost same proportion. Also __Bzero() usage has increased considerably.

    Based on above experience its hard to believe the benefit of using splice over read/write and actually achieving zero-copy.

    -Mukesh Kohli

  14. Hi Mukesh – I think it might just be kernel-version dependent. I definitely saw a measurable perf increase (about 8%) of splice with 2.6.31, and Samson (comment above yours) also sees a benefit with 64k buffer size, albeit smaller than expected. Thanks for posting your experience here – it’s all useful info.

  15. Great information :)

  16. salut, s’il vous plait quelqu’un a une idée sur un programme d’envoi de fichier entre l’espace utilisateur et l’espace noyau (plus précisament fichier de capture des paquets)?? ……..
    c’est urgent ..pour l’amour de dieu :)

  17. Ahlem – aucune idée, désolé!

  18. I just can’t figure out how to read on server-side in a TCP scenario.

    What I’m trying to do is just like this…

    On client-side I have a file descriptor (my source), then I do a write on my socket descriptor: it works fine, at least I hope.

    On server-side, I’m not able to get any byte from my client. Surely, the main issue is related to the file size on server-side.

    Any suggestions ? This is driving me insane.

  19. Oops – ignore the comments on nbd – I just deleted them. I didn’t read your question properly!

    Have you checked the return code from the client-side write? Does it return the number of bytes you sent it?

    What does your server-side code look like? Can you post it in a gist?

  20. I’ll study ndb source but I guess that what is good for me is just like the source I’ve found here.

  21. On client-side I have this: git://gist.github.com/3164659.git

    On server-side: git://gist.github.com/3164653.git

    If I try to write on “standard output” from client-side, it works… I mean I can put the content of my “input file” on standard output.

  22. Hi Ron – you may be encountering the same issue that I did – if you read carefully to the end of the blog post, you’ll see that I saw the first splice() call hang when I asked for all of the data on the socket. The solution is to modify the first splice call to limit the amount of data you read. I forked and modified your server-side gist: https://gist.github.com/3165080

  23. This way, I get an output file on my server but…

    -r—wx— 1 ron ron 77309411346 lug 23 21:51 mail_rc

    I really dunno what to say ! 77,3 GB ?

  24. Wow – what did the server side print out for total bytes sent? Is it a sparse file? (See http://en.wikipedia.org/wiki/Sparse_file) Is it really using 77.3GB of disk?

    I can think of two possible explanations – offset is somehow being corrupted, so it’s seeking to some crazy offset in the file while it’s writing, and so creating a sparse file, or the loop is somehow broken and it’s running round the loop too many times.

    What does the content of the file look like, compared to the input file?

  25. Total byte sent on the server is 18, the same as the input file on client-side.

    Anyway, that output file is just like this “application/octet-stream”

    Sorry Pat, I’m driving insane you too.

  26. Can I post all the source code in a gist ?

  27. As far as I know, you can post as much as you like in a gist. If I were you, though, I’d add some more printf calls to see what is happening – how much it’s reading in each call to splice, what the offset is, that sort of thing. I don’t actually have a Linux box handy to run anything on right now :-/

  28. One thought, Ron – add O_TRUNC to the flags for open – i.e.

    destination_fd = open(filename, O_WRONLY | O_CREAT | O_TRUNC);

    It may be that you’ve tested a few times, and the file was corrupted when splice() wasn’t working properly, and you’re just overwriting the output file and leaving the junk in place.

  29. It’s the same with O_TRUNC !

    In my opinion the main issue is on the server, I guess. When I’m going to create the output file, the file size (in byte) is unknown. The client should send “count” informations before the server starts write on the output file.

    I’m too tired at the moment and I need to sleep.

    BTW, here you’ll find the client: https://gist.github.com/3166618

    You can try to run it on your linux-box and see if you get such monster file.

    Sorry.

  30. Ron – I can’t think what’s happening. You could send the file size, but you shouldn’t need to – the server should just read data from the socket until the client closes it. Good luck debugging tomorrow!

  31. The client works fine sending to the server all the stuff from an input file. Added some code to print the content of my input file directly by the server.

    Having, for example, a 16 byte file on client-side I got all the stuff on server-side and I can print such content.

    The problem is with splice and so on, I guess.

  32. Ron – yes – splice() is still a bit ‘bleeding edge’, I think. There might be bugs or idiosyncrasies in the kernel version you’re using.

  33. $ cat /proc/version

    Linux version 3.2.0-27-generic-pae (buildd@akateko) (gcc version 4.6.3 (Ubuntu/Linaro 4.6.3-1ubuntu5) ) #43-Ubuntu SMP Fri Jul 6 15:06:05 UTC 2012

  34. Accordingly to my experiments, the big issue is related to splice syscall on server-side.

  35. Kernel 3.2? All bets are off – I was using 2.6 – I haven’t done anything with 3.x. Sorry!

  36. I really dunno what to do at the moment

  37. Fall back to a traditional approach of reading the socket into a buffer and writing that buffer to the file?

  38. What a temptation !!!

  39. Using sendfile syscall on client-side and read syscall on server-side… everything works fine.

  40. Thanks for writing this. Do you know how this compares to mmapping a file, and supplying the resulting buffer to read calls? It should in theory achieve something similar. The socket data goes straight to the shared buffer and we avoid the read/write pair calls.

  41. sendfile vs mmap – I believe the underlying mechanism is the same, but I’ve been away from this for some time, so I’m not absolutely sure.


Leave a comment