Zero-Copy in Linux with sendfile() and splice()

4 minute read

A Splice After my recent excursion to Kernelspace, I’m back in Userland working on a server process that copies data back and forth between a file and a socket. The traditional way to do this is to copy data from the source file descriptor to a buffer, then from the buffer to the destination file descriptor - like this:

// do_read and do_write are simple wrappers on the read() and 
// write() functions that keep reading/writing the file descriptor
// until all the data is processed.
do_read(source_fd, buffer, len);
do_write(destination_fd, buffer, len);

While this is very simple and straightforward, it is somewhat inefficient - we are copying data from the kernel buffer for source_fd into a buffer located in user space, then immediately copying it from that buffer to the kernel buffers for destination_fd. We aren’t examining or altering the data in any way - buffer is just a bit bucket we use to get data from a socket to a file or vice versa. While working on this code, a colleague clued me in to a better way of doing this - zero-copy.

As its name implies, zero-copy allows us to operate on data without copying it, or, at least, by minimizing the amount of copying going on. Zero Copy I: User-Mode Perspective describes the technique, with some nice diagrams and a description of the sendfile() system call.

Rewriting my example above with sendfile() gives us the following:

ssize_t do_sendfile(int out_fd, int in_fd, off_t offset, size_t count) {
    ssize_t bytes_sent;
    size_t total_bytes_sent = 0;
    while (total_bytes_sent < count) {
        if ((bytes_sent = sendfile(out_fd, in_fd, &offset,
                count - total_bytes_sent)) <= 0) {
            if (errno == EINTR || errno == EAGAIN) {
                // Interrupted system call/try again
                // Just skip to the top of the loop and try again
                continue;
            }
            perror("sendfile");
            return -1;
        }
        total_bytes_sent += bytes_sent;
    }
    return total_bytes_sent;
}

//...

// Send 'len' bytes starting at 'offset' from 'file_fd' to 'socket_fd'
do_sendfile(socket_fd, file_fd, offset, len);

Now, as the man page states, there’s a limitation here: “Presently (Linux 2.6.9 [and, in fact, as of this writing in June 2010]): in_fd, must correspond to a file which supports mmap()-like operations (i.e., it cannot be a socket); and out_fd must refer to a socket.”. So, we can only use sendfile() for reading data from our file and sending it to the socket.

It turns out that sendfile() significantly outperforms read()/write() - I was seeing about 8% higher throughput on a fairly informal read test. Great stuff, but our write operations are still bouncing unnecessarily through userland. After some googling around, I came across splice(), which turns out to be the primitive underlying sendfile(). An lkml thread back in 2006 carries a detailed explanation of splice() from Linus himself, but the basic gist is that splice() allows you to move data between kernel buffers (via a pipe) with no copy to userland. It’s a more primitive (and therefore flexible) system call than sendfile(), and requires a bit of wrapping to be useful - here’s my first attempt to write data from a socket to a file:


// Our pipe - a pair of file descriptors in an array - see [pipe()](http://linux.die.net/man/2/pipe)
static int pipefd[2];

//...

ssize_t do_recvfile(int out_fd, int in_fd, off_t offset, size_t count) {
    ssize_t bytes, bytes_sent, bytes_in_pipe;
    size_t total_bytes_sent = 0;

    // Splice the data from in_fd into the pipe
    while (total_bytes_sent < count) {
        if ((bytes_sent = splice(in_fd, NULL, pipefd[1], NULL,
                count - total_bytes_sent, 
                SPLICE_F_MORE | SPLICE_F_MOVE)) <= 0) {
            if (errno == EINTR || errno == EAGAIN) {
                // Interrupted system call/try again
                // Just skip to the top of the loop and try again
                continue;
            }
            perror("splice");
            return -1;
        }

        // Splice the data from the pipe into out_fd
        bytes_in_pipe = bytes_sent;
        while (bytes_in_pipe > 0) {
            if ((bytes = splice(pipefd[0], NULL, out_fd, &offset, bytes_in_pipe,
                    SPLICE_F_MORE | SPLICE_F_MOVE)) <= 0) {
                if (errno == EINTR || errno == EAGAIN) {
                    // Interrupted system call/try again
                    // Just skip to the top of the loop and try again
                    continue;
                }
                perror("splice");
                return -1;
            }
            bytes_in_pipe -= bytes;
        }
        total_bytes_sent += bytes_sent;
    }
    return total_bytes_sent;
}

//...

// Setup the pipe at initialization time
if ( pipe(pipefd) < 0 ) {
    perror("pipe");
    exit(1);
}

//...

// Send 'len' bytes from 'socket_fd' to 'offset' in 'file_fd'
do_recvfile(file_fd, socket_fd, offset, len);

This almost worked on my system, and it may work fine on yours, but there is a bug in kernel 2.6.31 that makes the first splice() call hang when you ask for all of the data on the socket. The Samba guys worked around this by simply limiting the data read from the socket to 16k. Modifying our first splice call similarly fixes the issue:

    if ((bytes_sent = splice(in_fd, NULL, pipefd[1], NULL,
            MIN(count - total_bytes_sent, 16384), 
            SPLICE_F_MORE | SPLICE_F_MOVE)) <= 0) {

I haven’t benchmarked the ‘write’ speed yet, but, on reads, splice() performed just a little slower than sendfile(), which I attribute to the additional user/kernel context switching, but, again, significantly faster than read()/write().

As is often the case, I’m merely standing on the shoulders of giants here, collating hints and fragments, but I hope you find this post useful!

Updated:

Comments

Mukesh

Thanks for sharing your ideas. The description is really nice and clear.

I tried the code snippet shared by you in VSFTP server. But unfortunately instead of getting performance gain I am getting degradation in performance.

Note: In my kernel stack LRO feature is enabled for aggregation.

-Mukesh

Pat Patterson

Hi Mukesh - that’s interesting. Not sure why you’d see a perf degradation like that - what sort of numbers are you seeing? I’m not sure it’s LRO - that’s been around for a while. Perhaps VSFTP is tuned for the traditional read/write pattern?

Mukesh

Hello Pat,

Thanks for your quick response.

Without zero copy changes I am getting ~25000 K bytes/sec in put performance. After the changes I am getting ~19000 K bytes/sec :-(

One thing I would like to highlight: I had to use below changes for reading the data into the pipe from socket,to make it operational:

MIN(count - total_bytes_sent, 16384)

Where as conventional read and write use the buffer size of around 64K bytes.

-Mukesh

Pat Patterson

Hi Mukesh - the small buffer size (16k) is the culprit. As I mentioned in the blog entry, older versions of the kernel have a bug in the splice implementation that causes a hang when the pipe gets full (see http://www.kerneltrap.com/mailarchive/git-commits-head/2009/10/2/10408). It looks like the fix went into kernel 2.6.32 (see http://www.kernel.org/pub/linux/kernel/v2.6/testing/v2.6.32/ChangeLog-2.6.32-rc3).

Mukesh

Pat,

Thanks for the info. I looked in the details of the threads you shared and then ported the patch from one of the threads to enable splice call accepting buffer size greater than 16348. I was able to make it operational. But to my surprise the performance went down further to ~17000 K bytes/sec :-(

Do I need to include SPLICE_F_NONBLOCK flag in read from pipe too?

Here are my read and write calls:

// Splice the data from in_fd into the pipe while (total_bytes_sent < count) { if ((bytes_sent = splice(in_fd, NULL, pipefd1, NULL, count - total_bytes_sent, SPLICE_F_MOVE | SPLICE_F_NONBLOCK)) 0) { if ((bytes = splice(pipefd0, NULL, out_fd, &offset, bytes_in_pipe, SPLICE_F_MOVE)) <= 0) { if (errno == EINTR || errno == EAGAIN) { // Interrupted system call/try again // Just skip to the top of the loop and try again continue; } perror("splice Write"); return -1; } bytes_in_pipe -= bytes; }

Any clues on the issue?

-Mukesh Kohli

Mukesh

The code snippet didn’t came fine in my previous post. Here is the updated one:

int vsf_sysutil_recvfile(const int out_fd, const int in_fd, unsigned int offset, unsigned int count, int pipefd0, int pipefd1) {

int bytes, bytes_sent, bytes_in_pipe;
unsigned int total_bytes_sent = 0;

while (total_bytes_sent &lt; count) {
    if ((bytes_sent = splice(in_fd, NULL, pipefd1, NULL,
            count - total_bytes_sent,
            SPLICE_F_MOVE | SPLICE_F_NONBLOCK))  0) {
        if ((bytes = splice(pipefd0, NULL, out_fd, &amp;offset, bytes_in_pipe,
                SPLICE_F_MOVE)) &lt;= 0) {
            if (errno == EINTR || errno == EAGAIN) {
                // Interrupted system call/try again
                // Just skip to the top of the loop and try again
                continue;
            }
            perror(&quot;splice Write&quot;);
            return -1;
        }
        bytes_in_pipe -= bytes;
    }
    total_bytes_sent += bytes_sent;
}

return total_bytes_sent; }

-Mukesh Kohli

Mukesh

I am not able to cut & paste the code properly here, kindly just focus on the two splice calls being made. Remaining code I have taken from your post only.

-Mukesh Kohli

Pat Patterson

Wow - I don’t understand that at all - why would fixing that bug result in a performance degradation? How many “splice Write” messages do you see in the kernel log? You should be able to figure out what size of chunk is being transferred - it should be more than 16k! Of course, the other thing you could do is to comment out that perror call to see if it is adding significant overhead.

I don’t think the read should have SPLICE_F_NONBLOCK, but you could try it and see. Unfortunately I don’t have a system set up to test any of this, so you’re pretty much on your own.

BTW - a good tip for posting source is to use https://gist.github.com/ - paste your source there and just post the URL here.

Mukesh

perror call should not impact as ideally it won’t be called.

Here are my changes :

git://gist.github.com/1066971.git

-Mukesh Kohli

Pat Patterson

You’re right - that perror() won’t get called in the normal flow - the code indentation had me fooled.

I honestly can’t see why this isn’t working faster for you, and, as I mentioned, I no longer have a test rig to try things out. I’m afraid you’re on your own here…

Samson Chen

I had a similar experience. Performance of spilce() is much worse than buffered read/write if the splice size is small. After utilizing it with the pipe size (64K), it is better than read/write. However, it is still far away from my expected zero-copy. LRO seems to be supported by some 10G NICs because of patent issues. I only need OS-based zero-copy on basic Gb/100Mb hardware. (I am using it with a embedded Linux.)

Mukesh

Pat,

After profiling (for CPU cycles) before and after the addition of above patch here is what I found (Note I am having patch to support greater than 16K buffer, hence using 128K):

Before changes

% app name symbol name 22.759 vmlinux __copy_user_inatomic 21.495 vmlinux __copy_user 3.2354 ext2.ko .text 2.8263 vmlinux do_ade 2.752 fusivlib_lkm.ko mips_flush_inv_entire_dcache 2.6776 vmlinux handle_adel_int 1.8222 vmlinux __bzero ….

After the patch:

% app name symbol name 37.4879 vmlinux __copy_user 7.0531 vmlinux __bzero 2.7375 ext2.ko .text 2.5765 fusivlib_lkm.ko mips_flush_inv_entire_dcache 2.2544 vmlinux do_ade 2.2544 vmlinux handle_adel_int 1.8357 vmlinux get_page_from_freelist 1.8035 nf_conntrack.ko .text …

If you notice __copy_user_inatomic usage has pretty much gone but __copy_user() usage has increased to almost same proportion. Also __Bzero() usage has increased considerably.

Based on above experience its hard to believe the benefit of using splice over read/write and actually achieving zero-copy.

-Mukesh Kohli

Pat Patterson

Hi Mukesh - I think it might just be kernel-version dependent. I definitely saw a measurable perf increase (about 8%) of splice with 2.6.31, and Samson (comment above yours) also sees a benefit with 64k buffer size, albeit smaller than expected. Thanks for posting your experience here - it’s all useful info.

ahlem

salut, s’il vous plait quelqu’un a une idée sur un programme d’envoi de fichier entre l’espace utilisateur et l’espace noyau (plus précisament fichier de capture des paquets)?? …….. c’est urgent ..pour l’amour de dieu :)

Ron

I just can’t figure out how to read on server-side in a TCP scenario.

What I’m trying to do is just like this…

On client-side I have a file descriptor (my source), then I do a write on my socket descriptor: it works fine, at least I hope.

On server-side, I’m not able to get any byte from my client. Surely, the main issue is related to the file size on server-side.

Any suggestions ? This is driving me insane.

Pat Patterson

Oops - ignore the comments on nbd - I just deleted them. I didn’t read your question properly!

Have you checked the return code from the client-side write? Does it return the number of bytes you sent it?

What does your server-side code look like? Can you post it in a gist?

Ron

On client-side I have this: git://gist.github.com/3164659.git

On server-side: git://gist.github.com/3164653.git

If I try to write on “standard output” from client-side, it works… I mean I can put the content of my “input file” on standard output.

Pat Patterson

Hi Ron - you may be encountering the same issue that I did - if you read carefully to the end of the blog post, you’ll see that I saw the first splice() call hang when I asked for all of the data on the socket. The solution is to modify the first splice call to limit the amount of data you read. I forked and modified your server-side gist: https://gist.github.com/3165080

Ron

This way, I get an output file on my server but…

-r—wx— 1 ron ron 77309411346 lug 23 21:51 mail_rc

I really dunno what to say ! 77,3 GB ?

Pat Patterson

Wow - what did the server side print out for total bytes sent? Is it a sparse file? (See http://en.wikipedia.org/wiki/Sparse_file) Is it really using 77.3GB of disk?

I can think of two possible explanations - offset is somehow being corrupted, so it’s seeking to some crazy offset in the file while it’s writing, and so creating a sparse file, or the loop is somehow broken and it’s running round the loop too many times.

What does the content of the file look like, compared to the input file?

Ron

Total byte sent on the server is 18, the same as the input file on client-side.

Anyway, that output file is just like this “application/octet-stream”

Sorry Pat, I’m driving insane you too.

Pat Patterson

As far as I know, you can post as much as you like in a gist. If I were you, though, I’d add some more printf calls to see what is happening - how much it’s reading in each call to splice, what the offset is, that sort of thing. I don’t actually have a Linux box handy to run anything on right now :-/

Pat Patterson

One thought, Ron - add O_TRUNC to the flags for open - i.e.

destination_fd = open(filename, O_WRONLY O_CREAT O_TRUNC);

It may be that you’ve tested a few times, and the file was corrupted when splice() wasn’t working properly, and you’re just overwriting the output file and leaving the junk in place.

Ron

It’s the same with O_TRUNC !

In my opinion the main issue is on the server, I guess. When I’m going to create the output file, the file size (in byte) is unknown. The client should send “count” informations before the server starts write on the output file.

I’m too tired at the moment and I need to sleep.

BTW, here you’ll find the client: https://gist.github.com/3166618

You can try to run it on your linux-box and see if you get such monster file.

Sorry.

Pat Patterson

Ron - I can’t think what’s happening. You could send the file size, but you shouldn’t need to - the server should just read data from the socket until the client closes it. Good luck debugging tomorrow!

Ron

The client works fine sending to the server all the stuff from an input file. Added some code to print the content of my input file directly by the server.

Having, for example, a 16 byte file on client-side I got all the stuff on server-side and I can print such content.

The problem is with splice and so on, I guess.

Ron

$ cat /proc/version

Linux version 3.2.0-27-generic-pae (buildd@akateko) (gcc version 4.6.3 (Ubuntu/Linaro 4.6.3-1ubuntu5) ) #43-Ubuntu SMP Fri Jul 6 15:06:05 UTC 2012

Rajiv

Thanks for writing this. Do you know how this compares to mmapping a file, and supplying the resulting buffer to read calls? It should in theory achieve something similar. The socket data goes straight to the shared buffer and we avoid the read/write pair calls.

Leave a Comment

Your email address will not be published. Required fields are marked *

Loading...