Superpatterns Pat Patterson on the Cloud, Identity and Single Malt Scotch

14Jul/1016

Semaphores on Linux – sem_init() vs sem_open()

Credit to Denelson83 for the image - click for the original

Regular readers will know that I'm working on a Linux server daemon that, amongst other things, moves data back and forth between sockets and files without it appearing in user space, and even 'tees' that data to a second destination, again without a copy to a user space buffer. Now I have multiple instances of my server running, and they need to synchronize access to shared data structures.

The standard mechanism for this is the semaphore. I won't get into a deep discussion of semaphores here, the Wikipedia article linked in the preceding sentence gives a good description. Basically, if you want to ensure that no more than one thread (ok, 'n' threads in the general case) has access to some resource concurrently, you use a semaphore.

Looking for an example of semaphores on Linux, I found the aptly named Semaphores in Linux, by Vikram Shukla, on the O'Reilly Linux DevCenter. This is a very useful article, explaining the general semaphore concept and comparing the System V and POSIX semaphore implementations.

Guided by the article, in particular, the 'Related Process' example, which closely matched my use case, I wrote a quick test program using the POSIX sem_init() call to initialize a semaphore and sem_wait()/sem_post() to decrement/increment the semaphore respectively. Only one problem. It didn't work - my processes had concurrent access to the shared resource!

Going back to Vikram's example, and reading the sem_init() man page very carefully, the issue seems to be that the semaphore is created on the stack of the parent process. When the child is forked, it gets a copy of the semaphore, not a reference to the parent's semaphore. Adding a few sleep()'s and printf()'s to the example highlights the problem:

#include <semaphore.h>
#include <stdio.h>
#include <errno.h>
#include <stdlib.h>
#include <unistd.h>
#include <sys/types.h>

#include <sys/stat.h>
#include <fcntl.h>
#include <sys/mman.h>

int main(int argc, char **argv)
{
  int fd, i,count=0,nloop=10,zero=0,*ptr;
  sem_t mutex;

  //open a file and map it into memory

  fd = open("log.txt",O_RDWR|O_CREAT,S_IRWXU);
  write(fd,&zero,sizeof(int));
  ptr = mmap(NULL,sizeof(int),PROT_READ |PROT_WRITE,MAP_SHARED,fd,0);
  close(fd);

  /* create, initialize semaphore */
  if( sem_init(&mutex,1,1) < 0)
    {
      perror("semaphore initilization");
      exit(0);
    }
  if (fork() == 0) { /* child process*/
    for (i = 0; i < nloop; i++) {
      sem_wait(&mutex);
      printf("child entered crititical section: %d\n", (*ptr)++);
      sleep(2);
      printf("child leaving critical section\n");
      sem_post(&mutex);
      sleep(1);
    }
    exit(0);
  }
  /* back to parent process */
  for (i = 0; i < nloop; i++) {
    sem_wait(&mutex);
    printf("parent entered critical section: %d\n", (*ptr)++);
    sleep(2);
    printf("parent leaving critical section\n");
    sem_post(&mutex);
    sleep(1);
  }
  exit(0);
}

Running this shows that both the parent and the child are in the critical section at the same time:

child entered critical section: 0
parent entered critical section: 1
parent leaving critical section
child leaving critical section
parent entered critical section: 2
child entered critical section: 3
...

The explanation is in the sem_init() man page:

If pshared is non-zero, then the semaphore is shared between processes, and should be located in a region of shared memory (see shm_open(3), mmap(2), and shmget(2)). (Since a child created by fork(2) inherits its parent's memory mappings, it can also access the semaphore.) Any process that can access the shared memory region can operate on the semaphore using sem_post(3), sem_wait(3), etc.

The key here is that the semaphore must be in a region of shared memory, even if you're accessing it from related processes such as a parent and its child.

There are two ways of fixing the problem. The first is to use shm_open(), ftruncate() and mmap() to create a shared memory region and obtain a pointer to it:

  int shm;
  sem_t * mutex;

  ...

  if ((shm = shm_open("myshm", O_RDWR | O_CREAT, S_IRWXU))   0) {
    perror("shm_open");
    exit(1);
  }

  if ( ftruncate(shm, sizeof(sem_t)) < 0 ) {
    perror("ftruncate");
    exit(1);
  }

  if ((mutex = mmap(NULL, sizeof(sem_t), PROT_READ | PROT_WRITE,
      MAP_SHARED, shm, 0)) == MAP_FAILED) {
    perror("mmap");
    exit(1);
  }

  if (sem_init(mutex, 1, 1) < 0) {
    perror("semaphore initialization");
    exit(1);
  }

  ...

The other, simpler, solution is to just use sem_open(), which Vikram describes in the next section of the article:

  if ((mutex = sem_open("mysemaphore", O_CREAT, 0644, 1)) == SEM_FAILED) {
    perror("semaphore initilization");
    exit(1);
  }

Either of these approaches gives the desired result:

child entered crit section: 0
child leaving crit section
parent entered crit section: 1
parent leaving crit section
child entered crit section: 2
child leaving crit section
parent entered crit section: 3
...

Postscript: this is a minor flaw in an otherwise excellent and very useful article. I address it here, rather than in a comment on the article, due to the amount of space required for a full explanation.

8Jul/100

A Cup of tee() and a splice() of Cake

Credit to dajobe for the photo - click the image for the original

Apologies for the terrible pun in the title - I just couldn't resist :-)

I was hard at work on my current project the other day, a user-mode Linux server daemon, when I realized that I would need to both copy incoming data to disk and forward it to another daemon via a socket. This caused me a moment's consternation, since I was using splice() to move incoming data from a socket to a file without needing an intermediate copy in a user-mode buffer, but then I remembered mention of tee(), a companion to splice().

Where splice() moves data directly from a socket (or file) to a pipe (or vice versa), tee() copies data from one pipe to another leaving the data intact in the source pipe. You can then use splice() again to move the data from tee()'s destination pipe to another file descriptor.

It was the work of a few minutes to code up a quick sample app to test this. Since it's short, and there seems to be a dearth of tee()/splice() examples, here it is in its entirety:

#define _GNU_SOURCE // needed for splice
#include <fcntl.h>
#include <limits.h>
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>

static void test_splice(int in, int out, int out2, int number_of_bytes) {
    int rcvd = 0, sent = 0, teed = 0, remaining = number_of_bytes;
    int pipe1[2], pipe2[2];

    if (pipe(pipe1) < 0) {
        perror("pipe");
        exit(EXIT_FAILURE);
    }

    if (pipe(pipe2) < 0) {
        perror("pipe");
        exit(EXIT_FAILURE);
    }

    while (remaining > 0) {
        if ((rcvd = splice(in, NULL, pipe1[1], NULL, remaining,
                SPLICE_F_MORE | SPLICE_F_MOVE)) < 0) {
            perror("splice");
            exit(EXIT_FAILURE);
        }

        if (rcvd == 0) {
            printf("Reached end of input file\n");
            break;
        }

        printf("Wrote %d bytes to pipe1\n", rcvd);

        if ((teed = tee(pipe1[0], pipe2[1], rcvd, 0)) < 0) {
            perror("tee");
            exit(EXIT_FAILURE);
        }

        printf("Copied %d bytes from pipe1 to pipe2\n", teed);

        if ((sent = splice(pipe1[0], NULL, out, NULL, rcvd, SPLICE_F_MORE
                | SPLICE_F_MOVE)) < 0) {
            perror("splice");
            exit(EXIT_FAILURE);
        }

        printf("Read %d bytes from pipe1\n", sent);

        if ((sent = splice(pipe2[0], NULL, out2, NULL, teed, SPLICE_F_MORE
                | SPLICE_F_MOVE)) < 0) {
            perror("splice");
            exit(EXIT_FAILURE);
        }

        printf("Read %d bytes from pipe2\n", sent);

        remaining -= rcvd;
    }
}

int main(int argc, char *argv[]) {
    int infile, outfile1, outfile2, number_of_bytes = INT_MAX;

    if (argc < 4) {
        fprintf(stderr,
                "Usage: %s infile outfile1 outfile2 [number_of_bytes]\n",
                argv[0]);
        exit(EXIT_FAILURE);
    }

    if ((infile = open(argv[1], O_RDONLY)) < 0) {
        fprintf(stderr, "Can't open %s for reading\n", argv[1]);
        exit(EXIT_FAILURE);
    }

    if ((outfile1 = open(argv[2], O_WRONLY | O_CREAT | O_TRUNC, 0644)) < 0) {
        fprintf(stderr, "Can't create/open %s for writing\n", argv[2]);
        exit(EXIT_FAILURE);
    }

    if ((outfile2 = open(argv[3], O_WRONLY | O_CREAT | O_TRUNC, 0644)) < 0) {
        fprintf(stderr, "Can't create/open %s for writing\n", argv[3]);
        exit(EXIT_FAILURE);
    }

    if (argc > 4) {
        number_of_bytes = atoi(argv[4]);
    }

    test_splice(infile, outfile1, outfile2, number_of_bytes);

    return EXIT_SUCCESS;
}

The example doesn't need much explanation; I added the 'number_of_bytes' parameter so that you can copy a limited amount of data from an infinite source such as /dev/zero or /dev/urandom. Note that a 'real' implementation needs a bit more code, since it's not safe to assume that all the bytes get moved to the destination pipes in one hit, but that would obscure the example :-)

1Jun/1042

Zero-Copy in Linux with sendfile() and splice()

A Splice

After my recent excursion to Kernelspace, I'm back in Userland working on a server process that copies data back and forth between a file and a socket. The traditional way to do this is to copy data from the source file descriptor to a buffer, then from the buffer to the destination file descriptor - like this:

// do_read and do_write are simple wrappers on the read() and 
// write() functions that keep reading/writing the file descriptor
// until all the data is processed.
do_read(source_fd, buffer, len);
do_write(destination_fd, buffer, len);

While this is very simple and straightforward, it is somewhat inefficient - we are copying data from the kernel buffer for source_fd into a buffer located in user space, then immediately copying it from that buffer to the kernel buffers for destination_fd. We aren't examining or altering the data in any way - buffer is just a bit bucket we use to get data from a socket to a file or vice versa. While working on this code, a colleague clued me in to a better way of doing this - zero-copy.

As its name implies, zero-copy allows us to operate on data without copying it, or, at least, by minimizing the amount of copying going on. Zero Copy I: User-Mode Perspective describes the technique, with some nice diagrams and a description of the sendfile() system call.

Rewriting my example above with sendfile() gives us the following:

ssize_t do_sendfile(int out_fd, int in_fd, off_t offset, size_t count) {
    ssize_t bytes_sent;
    size_t total_bytes_sent = 0;
    while (total_bytes_sent < count) {
        if ((bytes_sent = sendfile(out_fd, in_fd, &offset,
                count - total_bytes_sent)) <= 0) {
            if (errno == EINTR || errno == EAGAIN) {
                // Interrupted system call/try again
                // Just skip to the top of the loop and try again
                continue;
            }
            perror("sendfile");
            return -1;
        }
        total_bytes_sent += bytes_sent;
    }
    return total_bytes_sent;
}

//...

// Send 'len' bytes starting at 'offset' from 'file_fd' to 'socket_fd'
do_sendfile(socket_fd, file_fd, offset, len);

Now, as the man page states, there's a limitation here: "Presently (Linux 2.6.9 [and, in fact, as of this writing in June 2010]): in_fd, must correspond to a file which supports mmap()-like operations (i.e., it cannot be a socket); and out_fd must refer to a socket.". So, we can only use sendfile() for reading data from our file and sending it to the socket.

It turns out that sendfile() significantly outperforms read()/write() - I was seeing about 8% higher throughput on a fairly informal read test. Great stuff, but our write operations are still bouncing unnecessarily through userland. After some googling around, I came across splice(), which turns out to be the primitive underlying sendfile(). An lkml thread back in 2006 carries a detailed explanation of splice() from Linus himself, but the basic gist is that splice() allows you to move data between kernel buffers (via a pipe) with no copy to userland. It's a more primitive (and therefore flexible) system call than sendfile(), and requires a bit of wrapping to be useful - here's my first attempt to write data from a socket to a file:


// Our pipe - a pair of file descriptors in an array - see pipe()
static int pipefd[2];

//...

ssize_t do_recvfile(int out_fd, int in_fd, off_t offset, size_t count) {
    ssize_t bytes, bytes_sent, bytes_in_pipe;
    size_t total_bytes_sent = 0;

    // Splice the data from in_fd into the pipe
    while (total_bytes_sent < count) {
        if ((bytes_sent = splice(in_fd, NULL, pipefd[1], NULL,
                count - total_bytes_sent, 
                SPLICE_F_MORE | SPLICE_F_MOVE)) <= 0) {
            if (errno == EINTR || errno == EAGAIN) {
                // Interrupted system call/try again
                // Just skip to the top of the loop and try again
                continue;
            }
            perror("splice");
            return -1;
        }

        // Splice the data from the pipe into out_fd
        bytes_in_pipe = bytes_sent;
        while (bytes_in_pipe > 0) {
            if ((bytes = splice(pipefd[0], NULL, out_fd, &offset, bytes_in_pipe,
                    SPLICE_F_MORE | SPLICE_F_MOVE)) <= 0) {
                if (errno == EINTR || errno == EAGAIN) {
                    // Interrupted system call/try again
                    // Just skip to the top of the loop and try again
                    continue;
                }
                perror("splice");
                return -1;
            }
            bytes_in_pipe -= bytes;
        }
        total_bytes_sent += bytes_sent;
    }
    return total_bytes_sent;
}

//...

// Setup the pipe at initialization time
if ( pipe(pipefd) < 0 ) {
    perror("pipe");
    exit(1);
}

//...

// Send 'len' bytes from 'socket_fd' to 'offset' in 'file_fd'
do_recvfile(file_fd, socket_fd, offset, len);

This almost worked on my system, and it may work fine on yours, but there is a bug in kernel 2.6.31 that makes the first splice() call hang when you ask for all of the data on the socket. The Samba guys worked around this by simply limiting the data read from the socket to 16k. Modifying our first splice call similarly fixes the issue:

    if ((bytes_sent = splice(in_fd, NULL, pipefd[1], NULL,
            MIN(count - total_bytes_sent, 16384), 
            SPLICE_F_MORE | SPLICE_F_MOVE)) <= 0) {

I haven't benchmarked the 'write' speed yet, but, on reads, splice() performed just a little slower than sendfile(), which I attribute to the additional user/kernel context switching, but, again, significantly faster than read()/write().

As is often the case, I'm merely standing on the shoulders of giants here, collating hints and fragments, but I hope you find this post useful!

4May/1067

A Simple Block Driver for Linux Kernel 2.6.31

Programming Amazon Web Services

Linux Device Drivers, 3rd Edition

My current work involves writing my first Linux block device driver. Going to the web to find a sample, I discovered Jonathan Corbet's Simple Block Driver article with its associated block driver example code. It's a nice succinct implementation of a ramdisk - pretty much the simplest working block device. There's only one problem, though, the article was written in 2003, when kernel 2.6.0 was the new kid on the block. Trying to build it on openSUSE 11.2 with kernel 2.6.31 just produced a slew of compile errors. A bit of research revealed that there were major changes to the kernel block device interface in 2.6.31, so I would have to port the example to get it working.

About a day and a half of poring through the kernel source and the excellent LDD3 (hardcopy) later, I had a running simple block driver for kernel 2.6.31. I've also tested it successfully on SUSE 11 SP1 Beta, which uses kernel 2.6.32. Here's the code, followed by instructions for getting it working.

sbd.c

/*
 * A sample, extra-simple block driver. Updated for kernel 2.6.31.
 *
 * (C) 2003 Eklektix, Inc.
 * (C) 2010 Pat Patterson <pat at superpat dot com>
 * Redistributable under the terms of the GNU GPL.
 */

#include <linux/module.h>
#include <linux/moduleparam.h>
#include <linux/init.h>

#include <linux/kernel.h> /* printk() */
#include <linux/fs.h>     /* everything... */
#include <linux/errno.h>  /* error codes */
#include <linux/types.h>  /* size_t */
#include <linux/vmalloc.h>
#include <linux/genhd.h>
#include <linux/blkdev.h>
#include <linux/hdreg.h>

MODULE_LICENSE("Dual BSD/GPL");
static char *Version = "1.4";

static int major_num = 0;
module_param(major_num, int, 0);
static int logical_block_size = 512;
module_param(logical_block_size, int, 0);
static int nsectors = 1024; /* How big the drive is */
module_param(nsectors, int, 0);

/*
 * We can tweak our hardware sector size, but the kernel talks to us
 * in terms of small sectors, always.
 */
#define KERNEL_SECTOR_SIZE 512

/*
 * Our request queue.
 */
static struct request_queue *Queue;

/*
 * The internal representation of our device.
 */
static struct sbd_device {
	unsigned long size;
	spinlock_t lock;
	u8 *data;
	struct gendisk *gd;
} Device;

/*
 * Handle an I/O request.
 */
static void sbd_transfer(struct sbd_device *dev, sector_t sector,
		unsigned long nsect, char *buffer, int write) {
	unsigned long offset = sector * logical_block_size;
	unsigned long nbytes = nsect * logical_block_size;

	if ((offset + nbytes) > dev->size) {
		printk (KERN_NOTICE "sbd: Beyond-end write (%ld %ld)\n", offset, nbytes);
		return;
	}
	if (write)
		memcpy(dev->data + offset, buffer, nbytes);
	else
		memcpy(buffer, dev->data + offset, nbytes);
}

static void sbd_request(struct request_queue *q) {
	struct request *req;

	req = blk_fetch_request(q);
	while (req != NULL) {
		// blk_fs_request() was removed in 2.6.36 - many thanks to
		// Christian Paro for the heads up and fix...
		//if (!blk_fs_request(req)) {
		if (req == NULL || (req->cmd_type != REQ_TYPE_FS)) {
			printk (KERN_NOTICE "Skip non-CMD request\n");
			__blk_end_request_all(req, -EIO);
			continue;
		}
		sbd_transfer(&Device, blk_rq_pos(req), blk_rq_cur_sectors(req),
				req->buffer, rq_data_dir(req));
		if ( ! __blk_end_request_cur(req, 0) ) {
			req = blk_fetch_request(q);
		}
	}
}

/*
 * The HDIO_GETGEO ioctl is handled in blkdev_ioctl(), which
 * calls this. We need to implement getgeo, since we can't
 * use tools such as fdisk to partition the drive otherwise.
 */
int sbd_getgeo(struct block_device * block_device, struct hd_geometry * geo) {
	long size;

	/* We have no real geometry, of course, so make something up. */
	size = Device.size * (logical_block_size / KERNEL_SECTOR_SIZE);
	geo->cylinders = (size & ~0x3f) >> 6;
	geo->heads = 4;
	geo->sectors = 16;
	geo->start = 0;
	return 0;
}

/*
 * The device operations structure.
 */
static struct block_device_operations sbd_ops = {
		.owner  = THIS_MODULE,
		.getgeo = sbd_getgeo
};

static int __init sbd_init(void) {
	/*
	 * Set up our internal device.
	 */
	Device.size = nsectors * logical_block_size;
	spin_lock_init(&Device.lock);
	Device.data = vmalloc(Device.size);
	if (Device.data == NULL)
		return -ENOMEM;
	/*
	 * Get a request queue.
	 */
	Queue = blk_init_queue(sbd_request, &Device.lock);
	if (Queue == NULL)
		goto out;
	blk_queue_logical_block_size(Queue, logical_block_size);
	/*
	 * Get registered.
	 */
	major_num = register_blkdev(major_num, "sbd");
	if (major_num < 0) {
		printk(KERN_WARNING "sbd: unable to get major number\n");
		goto out;
	}
	/*
	 * And the gendisk structure.
	 */
	Device.gd = alloc_disk(16);
	if (!Device.gd)
		goto out_unregister;
	Device.gd->major = major_num;
	Device.gd->first_minor = 0;
	Device.gd->fops = &sbd_ops;
	Device.gd->private_data = &Device;
	strcpy(Device.gd->disk_name, "sbd0");
	set_capacity(Device.gd, nsectors);
	Device.gd->queue = Queue;
	add_disk(Device.gd);

	return 0;

out_unregister:
	unregister_blkdev(major_num, "sbd");
out:
	vfree(Device.data);
	return -ENOMEM;
}

static void __exit sbd_exit(void)
{
	del_gendisk(Device.gd);
	put_disk(Device.gd);
	unregister_blkdev(major_num, "sbd");
	blk_cleanup_queue(Queue);
	vfree(Device.data);
}

module_init(sbd_init);
module_exit(sbd_exit);

Makefile

obj-m := sbd.o
KDIR := /lib/modules/$(shell uname -r)/build
PWD := $(shell pwd)
default:
	$(MAKE) -C $(KDIR) SUBDIRS=$(PWD) modules

There are two main areas of change compared with Jonathan's original:

  • sbd_request() uses the blk_fetch_request(), blk_rq_pos(), blk_rq_cur_sectors() and __blk_end_request_cur() functions rather than elv_next_request(), req->sector, req->current_nr_sectors and end_request() respectively. The structure of the loop also changes so we handle each sector from the request individually. One outstanding task for me is to investigate whether req->buffer holds all of the data for the entire request, so I can handle it all in one shot, rather than sector-by-sector. My first attempt resulted in the (virtual) machine hanging when I installed the driver, so I clearly need to do some more work in this area!
  • The driver implements the getgeo operation (in sbd_getgeo), rather than ioctl, since blkdev_ioctl now handles HDIO_GETGEO by calling the driver's getgeo function. This is a nice simplification since it moves a copy_to_user call out of each driver and into the kernel.

Before building, ensure you have the kernel source, headers, gcc, make etc - if you've read this far, you likely have all this and/or know how to get it, so I won't spell it all out here. You'll also need to go to the kernel source directory and do the following to prepare your build environment, if you have not already done so:

cd /usr/src/`uname -r`
make oldconfig && make prepare

Now, back in the directory with the sbd source, you can build it:

make -C /lib/modules/`uname -r`/build M=`pwd` modules

You'll see a warning about 'Version' being defined, but not used, but don't worry about that :-). Now we can load the module, partition the ramdisk, make a filesystem, mount it, and create a file:

opensuse:/home/pat/sbd # insmod sbd.ko
opensuse:/home/pat/sbd # fdisk /dev/sbd0
Device contains neither a valid DOS partition table, nor Sun, SGI or OSF disklabel
Building a new DOS disklabel with disk identifier 0x5f93978c.
Changes will remain in memory only, until you decide to write them.
After that, of course, the previous content won't be recoverable.

Warning: invalid flag 0x0000 of partition table 4 will be corrected by w(rite)

Command (m for help): n
Command action
   e   extended
   p   primary partition (1-4)
p
Partition number (1-4): 1
First cylinder (1-16, default 1):
Using default value 1
Last cylinder, +cylinders or +size{K,M,G} (1-16, default 16):
Using default value 16

Command (m for help): w
The partition table has been altered!

Calling ioctl() to re-read partition table.
Syncing disks.
opensuse:/home/pat/sbd # mkfs /dev/sbd0p1
mke2fs 1.41.9 (22-Aug-2009)
Filesystem label=
OS type: Linux
Block size=1024 (log=0)
Fragment size=1024 (log=0)
64 inodes, 504 blocks
25 blocks (4.96%) reserved for the super user
First data block=1
Maximum filesystem blocks=524288
1 block group
8192 blocks per group, 8192 fragments per group
64 inodes per group

Writing inode tables: done
Writing superblocks and filesystem accounting information: done

This filesystem will be automatically checked every 24 mounts or
180 days, whichever comes first.  Use tune2fs -c or -i to override.
opensuse:/home/pat/sbd # mount /dev/sbd0p1 /mnt
opensuse:/home/pat/sbd # echo Hi > /mnt/file1
opensuse:/home/pat/sbd # cat /mnt/file1
Hi
opensuse:/home/pat/sbd # ls -l /mnt
total 13
-rw-r--r-- 1 root root     3 2010-04-29 07:04 file1
drwx------ 2 root root 12288 2010-04-29 07:04 lost+found
opensuse:/home/pat/sbd # umount /mnt
opensuse:/home/pat/sbd # rmmod sbd

Hopefully this all works for you, and is as useful for you as it has been for me. Many thanks to Jonathan for the original version and the excellent LDD3. One final piece of housekeeping - although the comment at the top of sbd.c mentions only GPL, the MODULE_LICENSE macro specifies "Dual BSD/GPL". I am interpreting the original code as being under the dual GPL/BSD license and this version is similarly dual licensed.

UPDATE (Feb 5 2011) See the comment by Michele regarding changes to logical_block_size!