Superpatterns Pat Patterson on the Cloud, Identity and Single Malt Scotch

1Jun/1042

Zero-Copy in Linux with sendfile() and splice()

A Splice

After my recent excursion to Kernelspace, I'm back in Userland working on a server process that copies data back and forth between a file and a socket. The traditional way to do this is to copy data from the source file descriptor to a buffer, then from the buffer to the destination file descriptor - like this:

// do_read and do_write are simple wrappers on the read() and 
// write() functions that keep reading/writing the file descriptor
// until all the data is processed.
do_read(source_fd, buffer, len);
do_write(destination_fd, buffer, len);

While this is very simple and straightforward, it is somewhat inefficient - we are copying data from the kernel buffer for source_fd into a buffer located in user space, then immediately copying it from that buffer to the kernel buffers for destination_fd. We aren't examining or altering the data in any way - buffer is just a bit bucket we use to get data from a socket to a file or vice versa. While working on this code, a colleague clued me in to a better way of doing this - zero-copy.

As its name implies, zero-copy allows us to operate on data without copying it, or, at least, by minimizing the amount of copying going on. Zero Copy I: User-Mode Perspective describes the technique, with some nice diagrams and a description of the sendfile() system call.

Rewriting my example above with sendfile() gives us the following:

ssize_t do_sendfile(int out_fd, int in_fd, off_t offset, size_t count) {
    ssize_t bytes_sent;
    size_t total_bytes_sent = 0;
    while (total_bytes_sent < count) {
        if ((bytes_sent = sendfile(out_fd, in_fd, &offset,
                count - total_bytes_sent)) <= 0) {
            if (errno == EINTR || errno == EAGAIN) {
                // Interrupted system call/try again
                // Just skip to the top of the loop and try again
                continue;
            }
            perror("sendfile");
            return -1;
        }
        total_bytes_sent += bytes_sent;
    }
    return total_bytes_sent;
}

//...

// Send 'len' bytes starting at 'offset' from 'file_fd' to 'socket_fd'
do_sendfile(socket_fd, file_fd, offset, len);

Now, as the man page states, there's a limitation here: "Presently (Linux 2.6.9 [and, in fact, as of this writing in June 2010]): in_fd, must correspond to a file which supports mmap()-like operations (i.e., it cannot be a socket); and out_fd must refer to a socket.". So, we can only use sendfile() for reading data from our file and sending it to the socket.

It turns out that sendfile() significantly outperforms read()/write() - I was seeing about 8% higher throughput on a fairly informal read test. Great stuff, but our write operations are still bouncing unnecessarily through userland. After some googling around, I came across splice(), which turns out to be the primitive underlying sendfile(). An lkml thread back in 2006 carries a detailed explanation of splice() from Linus himself, but the basic gist is that splice() allows you to move data between kernel buffers (via a pipe) with no copy to userland. It's a more primitive (and therefore flexible) system call than sendfile(), and requires a bit of wrapping to be useful - here's my first attempt to write data from a socket to a file:


// Our pipe - a pair of file descriptors in an array - see pipe()
static int pipefd[2];

//...

ssize_t do_recvfile(int out_fd, int in_fd, off_t offset, size_t count) {
    ssize_t bytes, bytes_sent, bytes_in_pipe;
    size_t total_bytes_sent = 0;

    // Splice the data from in_fd into the pipe
    while (total_bytes_sent < count) {
        if ((bytes_sent = splice(in_fd, NULL, pipefd[1], NULL,
                count - total_bytes_sent, 
                SPLICE_F_MORE | SPLICE_F_MOVE)) <= 0) {
            if (errno == EINTR || errno == EAGAIN) {
                // Interrupted system call/try again
                // Just skip to the top of the loop and try again
                continue;
            }
            perror("splice");
            return -1;
        }

        // Splice the data from the pipe into out_fd
        bytes_in_pipe = bytes_sent;
        while (bytes_in_pipe > 0) {
            if ((bytes = splice(pipefd[0], NULL, out_fd, &offset, bytes_in_pipe,
                    SPLICE_F_MORE | SPLICE_F_MOVE)) <= 0) {
                if (errno == EINTR || errno == EAGAIN) {
                    // Interrupted system call/try again
                    // Just skip to the top of the loop and try again
                    continue;
                }
                perror("splice");
                return -1;
            }
            bytes_in_pipe -= bytes;
        }
        total_bytes_sent += bytes_sent;
    }
    return total_bytes_sent;
}

//...

// Setup the pipe at initialization time
if ( pipe(pipefd) < 0 ) {
    perror("pipe");
    exit(1);
}

//...

// Send 'len' bytes from 'socket_fd' to 'offset' in 'file_fd'
do_recvfile(file_fd, socket_fd, offset, len);

This almost worked on my system, and it may work fine on yours, but there is a bug in kernel 2.6.31 that makes the first splice() call hang when you ask for all of the data on the socket. The Samba guys worked around this by simply limiting the data read from the socket to 16k. Modifying our first splice call similarly fixes the issue:

    if ((bytes_sent = splice(in_fd, NULL, pipefd[1], NULL,
            MIN(count - total_bytes_sent, 16384), 
            SPLICE_F_MORE | SPLICE_F_MOVE)) <= 0) {

I haven't benchmarked the 'write' speed yet, but, on reads, splice() performed just a little slower than sendfile(), which I attribute to the additional user/kernel context switching, but, again, significantly faster than read()/write().

As is often the case, I'm merely standing on the shoulders of giants here, collating hints and fragments, but I hope you find this post useful!

7May/100

Bookmarks for May 6th 2010

These are my links for May 6th 2010:

4May/1086

A Simple Block Driver for Linux Kernel 2.6.31

Programming Amazon Web Services

Linux Device Drivers, 3rd Edition

My current work involves writing my first Linux block device driver. Going to the web to find a sample, I discovered Jonathan Corbet's Simple Block Driver article with its associated block driver example code. It's a nice succinct implementation of a ramdisk - pretty much the simplest working block device. There's only one problem, though, the article was written in 2003, when kernel 2.6.0 was the new kid on the block. Trying to build it on openSUSE 11.2 with kernel 2.6.31 just produced a slew of compile errors. A bit of research revealed that there were major changes to the kernel block device interface in 2.6.31, so I would have to port the example to get it working.

About a day and a half of poring through the kernel source and the excellent LDD3 (hardcopy) later, I had a running simple block driver for kernel 2.6.31. I've also tested it successfully on SUSE 11 SP1 Beta, which uses kernel 2.6.32. Here's the code, followed by instructions for getting it working.

sbd.c

/*
 * A sample, extra-simple block driver. Updated for kernel 2.6.31.
 *
 * (C) 2003 Eklektix, Inc.
 * (C) 2010 Pat Patterson <pat at superpat dot com>
 * Redistributable under the terms of the GNU GPL.
 */

#include <linux/module.h>
#include <linux/moduleparam.h>
#include <linux/init.h>

#include <linux/kernel.h> /* printk() */
#include <linux/fs.h>     /* everything... */
#include <linux/errno.h>  /* error codes */
#include <linux/types.h>  /* size_t */
#include <linux/vmalloc.h>
#include <linux/genhd.h>
#include <linux/blkdev.h>
#include <linux/hdreg.h>

MODULE_LICENSE("Dual BSD/GPL");
static char *Version = "1.4";

static int major_num = 0;
module_param(major_num, int, 0);
static int logical_block_size = 512;
module_param(logical_block_size, int, 0);
static int nsectors = 1024; /* How big the drive is */
module_param(nsectors, int, 0);

/*
 * We can tweak our hardware sector size, but the kernel talks to us
 * in terms of small sectors, always.
 */
#define KERNEL_SECTOR_SIZE 512

/*
 * Our request queue.
 */
static struct request_queue *Queue;

/*
 * The internal representation of our device.
 */
static struct sbd_device {
	unsigned long size;
	spinlock_t lock;
	u8 *data;
	struct gendisk *gd;
} Device;

/*
 * Handle an I/O request.
 */
static void sbd_transfer(struct sbd_device *dev, sector_t sector,
		unsigned long nsect, char *buffer, int write) {
	unsigned long offset = sector * logical_block_size;
	unsigned long nbytes = nsect * logical_block_size;

	if ((offset + nbytes) > dev->size) {
		printk (KERN_NOTICE "sbd: Beyond-end write (%ld %ld)\n", offset, nbytes);
		return;
	}
	if (write)
		memcpy(dev->data + offset, buffer, nbytes);
	else
		memcpy(buffer, dev->data + offset, nbytes);
}

static void sbd_request(struct request_queue *q) {
	struct request *req;

	req = blk_fetch_request(q);
	while (req != NULL) {
		// blk_fs_request() was removed in 2.6.36 - many thanks to
		// Christian Paro for the heads up and fix...
		//if (!blk_fs_request(req)) {
		if (req == NULL || (req->cmd_type != REQ_TYPE_FS)) {
			printk (KERN_NOTICE "Skip non-CMD request\n");
			__blk_end_request_all(req, -EIO);
			continue;
		}
		sbd_transfer(&Device, blk_rq_pos(req), blk_rq_cur_sectors(req),
				req->buffer, rq_data_dir(req));
		if ( ! __blk_end_request_cur(req, 0) ) {
			req = blk_fetch_request(q);
		}
	}
}

/*
 * The HDIO_GETGEO ioctl is handled in blkdev_ioctl(), which
 * calls this. We need to implement getgeo, since we can't
 * use tools such as fdisk to partition the drive otherwise.
 */
int sbd_getgeo(struct block_device * block_device, struct hd_geometry * geo) {
	long size;

	/* We have no real geometry, of course, so make something up. */
	size = Device.size * (logical_block_size / KERNEL_SECTOR_SIZE);
	geo->cylinders = (size & ~0x3f) >> 6;
	geo->heads = 4;
	geo->sectors = 16;
	geo->start = 0;
	return 0;
}

/*
 * The device operations structure.
 */
static struct block_device_operations sbd_ops = {
		.owner  = THIS_MODULE,
		.getgeo = sbd_getgeo
};

static int __init sbd_init(void) {
	/*
	 * Set up our internal device.
	 */
	Device.size = nsectors * logical_block_size;
	spin_lock_init(&Device.lock);
	Device.data = vmalloc(Device.size);
	if (Device.data == NULL)
		return -ENOMEM;
	/*
	 * Get a request queue.
	 */
	Queue = blk_init_queue(sbd_request, &Device.lock);
	if (Queue == NULL)
		goto out;
	blk_queue_logical_block_size(Queue, logical_block_size);
	/*
	 * Get registered.
	 */
	major_num = register_blkdev(major_num, "sbd");
	if (major_num < 0) {
		printk(KERN_WARNING "sbd: unable to get major number\n");
		goto out;
	}
	/*
	 * And the gendisk structure.
	 */
	Device.gd = alloc_disk(16);
	if (!Device.gd)
		goto out_unregister;
	Device.gd->major = major_num;
	Device.gd->first_minor = 0;
	Device.gd->fops = &sbd_ops;
	Device.gd->private_data = &Device;
	strcpy(Device.gd->disk_name, "sbd0");
	set_capacity(Device.gd, nsectors);
	Device.gd->queue = Queue;
	add_disk(Device.gd);

	return 0;

out_unregister:
	unregister_blkdev(major_num, "sbd");
out:
	vfree(Device.data);
	return -ENOMEM;
}

static void __exit sbd_exit(void)
{
	del_gendisk(Device.gd);
	put_disk(Device.gd);
	unregister_blkdev(major_num, "sbd");
	blk_cleanup_queue(Queue);
	vfree(Device.data);
}

module_init(sbd_init);
module_exit(sbd_exit);

Makefile

obj-m := sbd.o
KDIR := /lib/modules/$(shell uname -r)/build
PWD := $(shell pwd)
default:
	$(MAKE) -C $(KDIR) SUBDIRS=$(PWD) modules

There are two main areas of change compared with Jonathan's original:

  • sbd_request() uses the blk_fetch_request(), blk_rq_pos(), blk_rq_cur_sectors() and __blk_end_request_cur() functions rather than elv_next_request(), req->sector, req->current_nr_sectors and end_request() respectively. The structure of the loop also changes so we handle each sector from the request individually. One outstanding task for me is to investigate whether req->buffer holds all of the data for the entire request, so I can handle it all in one shot, rather than sector-by-sector. My first attempt resulted in the (virtual) machine hanging when I installed the driver, so I clearly need to do some more work in this area!
  • The driver implements the getgeo operation (in sbd_getgeo), rather than ioctl, since blkdev_ioctl now handles HDIO_GETGEO by calling the driver's getgeo function. This is a nice simplification since it moves a copy_to_user call out of each driver and into the kernel.

Before building, ensure you have the kernel source, headers, gcc, make etc - if you've read this far, you likely have all this and/or know how to get it, so I won't spell it all out here. You'll also need to go to the kernel source directory and do the following to prepare your build environment, if you have not already done so:

cd /usr/src/`uname -r`
make oldconfig && make prepare

Now, back in the directory with the sbd source, you can build it:

make -C /lib/modules/`uname -r`/build M=`pwd` modules

You'll see a warning about 'Version' being defined, but not used, but don't worry about that :-). Now we can load the module, partition the ramdisk, make a filesystem, mount it, and create a file:

opensuse:/home/pat/sbd # insmod sbd.ko
opensuse:/home/pat/sbd # fdisk /dev/sbd0
Device contains neither a valid DOS partition table, nor Sun, SGI or OSF disklabel
Building a new DOS disklabel with disk identifier 0x5f93978c.
Changes will remain in memory only, until you decide to write them.
After that, of course, the previous content won't be recoverable.

Warning: invalid flag 0x0000 of partition table 4 will be corrected by w(rite)

Command (m for help): n
Command action
   e   extended
   p   primary partition (1-4)
p
Partition number (1-4): 1
First cylinder (1-16, default 1):
Using default value 1
Last cylinder, +cylinders or +size{K,M,G} (1-16, default 16):
Using default value 16

Command (m for help): w
The partition table has been altered!

Calling ioctl() to re-read partition table.
Syncing disks.
opensuse:/home/pat/sbd # mkfs /dev/sbd0p1
mke2fs 1.41.9 (22-Aug-2009)
Filesystem label=
OS type: Linux
Block size=1024 (log=0)
Fragment size=1024 (log=0)
64 inodes, 504 blocks
25 blocks (4.96%) reserved for the super user
First data block=1
Maximum filesystem blocks=524288
1 block group
8192 blocks per group, 8192 fragments per group
64 inodes per group

Writing inode tables: done
Writing superblocks and filesystem accounting information: done

This filesystem will be automatically checked every 24 mounts or
180 days, whichever comes first.  Use tune2fs -c or -i to override.
opensuse:/home/pat/sbd # mount /dev/sbd0p1 /mnt
opensuse:/home/pat/sbd # echo Hi > /mnt/file1
opensuse:/home/pat/sbd # cat /mnt/file1
Hi
opensuse:/home/pat/sbd # ls -l /mnt
total 13
-rw-r--r-- 1 root root     3 2010-04-29 07:04 file1
drwx------ 2 root root 12288 2010-04-29 07:04 lost+found
opensuse:/home/pat/sbd # umount /mnt
opensuse:/home/pat/sbd # rmmod sbd

Hopefully this all works for you, and is as useful for you as it has been for me. Many thanks to Jonathan for the original version and the excellent LDD3. One final piece of housekeeping - although the comment at the top of sbd.c mentions only GPL, the MODULE_LICENSE macro specifies "Dual BSD/GPL". I am interpreting the original code as being under the dual GPL/BSD license and this version is similarly dual licensed.

UPDATE (Feb 5 2011) See the comment by Michele regarding changes to logical_block_size!

UPDATE (Apr 23 2015) See the comment by Sarge regarding changes for kernel 3.15-rc2 and later

21Apr/100

Bookmarks for April 20th 2010

These are my links for April 20th 2010:

12Apr/100

OpenSSO Brukergruppemøte

OpenSSO logoI had a note from the OpenSSO Meetup group the other day announcing an 'OpenSSO Brukergruppemøte' (OpenSSO user group meeting, according to Google Translate) in Oslo, Norway, on Thursday April 22, 2010. Norway has long been a hub of OpenSSO activity; it's great to see this continuing into OpenSSO's post-Sun existence. Go along and say "Hei!" to Jonathan and the rest of the ForgeRock guys from me!

Tagged as: , No Comments
14Mar/104

A Weekend in Xi’an

Who are you looking at?

I've been in Xi'an, northern China, for the past few days, visiting Huawei's site here. Since my trip ran across the weekend, I found myself with a couple of days to explore the area.

Following Geoff's lead, on Saturday morning, I headed out to bīngmǎ yǒng, better known in English as the Terracotta Warriors. I had the hotel, Days Xi'an, arrange a ride for me - the most expensive component of the trip at ¥380 (approx $60), but both car and driver were at my disposal for nearly six hours. After a two hour drive through the Xi'an traffic then a few miles of countryside, I arrived at the site to be greeted by an English-speaking guide named Jay, whose excellent service was an absolute bargain for ¥100 ($15). Admission was a very reasonable ¥90 ($13 or so).

Jay walked me round the initial display of a giant marionette warrior (pictured above), made for the 2008 Beijing Olympics, a pair of bronze chariots and other artifacts, then showed me to the 360 degree cinema for a 20 minute film introducing some of the historical background to the commissioning of the Terracotta Army by Qin Shi Huang, the first emperor of a unified China. and its accidental discovery in 1974 by a local farmer digging a well. Amazingly, although the location of the imperial tomb was well known, there had been no historical record of the army itself, so the find came as a complete surprise.

Pit number 1

After the film, it was time for the main event - 'Pit Number 1' - and what an incredible sight it was - rank upon rank of larger than life warriors, vintage 210 BC. Pit 1 alone contains an estimated 8000 infantrymen, each an individual with different faces, hair and physique. I spent some time walking around the perimeter, just taking it all in. At this point, what was most impressive was the sheer scale of the army - it was only when I saw a couple of the warriors up close in the adjoining display area that I realized the craftsmanship that went into each one.

'Lucky Warrior' shoe detail

I took a series of pictures of the 'Lucky Warrior' - a kneeling archer - the sole statue found intact, all the others having suffered from the collapse of the wooden roof of the tomb. You can see all of the photos in my Flickr set from the day, but here is possibly the most interesting picture - the sole of the Lucky Warrior's shoe - complete with three different tread patterns, for the heel, mid-section and front of the sole. When you see the craftsmanship that went into a single warrior, then realize that there are over 8,000 of them, it's easy to believe that it took 700,000 workers some 40 years to complete!

The tour was rounded off by a visit to the official museum store, where I had an order from Jim for an 'Old General'. I succumbed to temptation and came away with Jim's general, an infantryman for myself, and a jade bracelet for my wife, Karen. Ah well; it's only money, I suppose.

Saturday evening, I went out with Tom, one of the Xi'an engineers, and we discovered the Little Sheep Mongolian hot pot restaurant, where we had an excellent meal of thinly sliced lamb, cooked at the table in a spicy broth, washed down by a couple of bottles of Tsingtao.

Lantern festival decorations at the South Gate

Sunday started wet, so I left my 'real' camera at the hotel and set off with only my iPhone to take pictures. A mistake as it turned out, as the day dried up soon after lunch - oh well - the iPhone did pretty well, in the event. First order of the morning was to find a source of China Mobile topup cards for my prepay phone, then I relaxed for a couple of hours at the Starbucks next to the hotel with a Chai tea and free wifi - bliss! After lunch I met up with Asen, another Huawei engineer based in Xi'an, and we headed out for a walk around central Xi'an.

Xi'an has the most complete city wall in China, with eight and a half miles of fortifications forming a rectangle around the city center. Right now, the wall is decorated for Yuánxiāojié, or the Lantern Festival, and we walked about a mile and a half along the southern section, photographing the decorations. Coming down off the walls, we happened on a market stall selling chops (name stamps) and I had a 'monkey' (my birth year) chop carved with my 'Chinese name' - 潘德生. Heading north, we came to the Bell Tower, pretty much the center point of the city. ¥40 ($6) bought a ticket that also included admission to the nearby Drum Tower.

The interior of the Bell Tower houses an exhibition of ancient Chinese pottery showing an amazing level of artistry, while the exterior gives an excellent view of the city including the four gates in the city walls. The Drum Tower contains exhibitions of antique furniture and, not surprisingly, drums. Again, you can walk around the outside of the tower, this time gaining a view of the Muslim Hui quarter of Xi'an.

Street market stall

Leaving the Drum Tower, Asen and I entered the heart of the Muslim quarter, a bustling, colorful street market that seemed mainly focused on grilled beef and chicken kebabs, or chuànr. After a wander around, we chose a restaurant to sample some chuànr and pào mó, a tasty soup of cubed flatbread and beef, washed down with a little more Tsingtao.

I must admit, I didn't expect Xi'an to have so much to offer. I knew of the Terracotta Warriors, of course, but I was still surprised at the modest grandeur of central Xi'an. If I'm lucky enough to return, I plan to spend a couple of hours circumnavigating the city walls, this time with my 'proper' camera 🙂

12Mar/100

Bookmarks for March 11th 2010

These are my links for March 11th 2010:

27Feb/104

OpenSolaris 2009.06 as a domU guest on Xen 3.4/openSUSE 11.2

OpenSolaris LogoI recently trawled the web figuring out how to install a paravirtualized OpenSolaris 2009.06 on Xen. No one place had all the story, so I'm blogging this. I found a lot of the information spread across many other blog entries; some I figured out on my own. Thanks to all the giants on whose shoulders I am now standing:

The procedure:

  1. Download OpenSolaris 2009.06 ISO
  2. Mount the ISO somewhere
    pat-m6400:~ # mount -o loop,ro /vm/opensolaris/osol-0906-x86.iso /mnt
  3. Copy the kernel and rootfs somewhere convenient
    pat-m6400:~ # cp /mnt/platform/i86xpv/kernel/amd64/unix /vm/opensolaris
    pat-m6400:~ # cp /mnt/boot/amd64/x86.microroot /vm/opensolaris
  4. Create a disk image for your root filesystem
    pat-m6400:~ # dd if=/dev/zero of=/vm/opensolaris/root.img bs=1G count=1010+0 records out
    10+0 records in
    10+0 records out
    10737418240 bytes (11 GB) copied, 127.888 s, 84.0 MB/s
  5. Create a Xen config file (let's call it /vm/opensolaris/opensolaris-install.cfg) with the following content:
    name = "opensolaris"
    vcpus = 1
    memory = 1024
    kernel = "/vm/opensolaris/opensolaris/unix"
    ramdisk = "/vm/opensolaris/x86.microroot"
    extra = "/platform/i86xpv/kernel/amd64/unix -B console=ttya"
    disk = ['file:/vm/opensolaris/osol-0906-x86.iso,6:cdrom,r', 'file:/vm/opensolaris/root.img,0,w']
    vif = ['bridge=br0']
    on_shutdown = "destroy"
    on_reboot = "destroy"
    on_crash = "destroy"
  6. Now start your VM:
    pat-m6400:~ # xm create -c /vm/opensolaris/opensolaris-install.cfg
  7. You should see something like:
    Using config file "./opensolaris-install.cfg".
    Started domain opensolaris (id=21)
    
    v3.4.1_19718_04-2.1 chgset '19718'
    SunOS Release 5.11 Version snv_111b 64-bit
    Copyright 1983-2009 Sun Microsystems, Inc.  All rights reserved.
    Use is subject to license terms.
    Hostname: opensolaris
    Remounting root read/write
    Probing for device nodes ...
    Preparing live image for use
    Done mounting Live image
    USB keyboard
    1. Albanian                      23. Lithuanian
    2. Belarusian                    24. Latvian
    3. Belgian                       25. Macedonian
    4. Brazilian                     26. Malta_UK
    5. Bulgarian                     27. Malta_US
    6. Canadian-Bilingual            28. Norwegian
    7. Croatian                      29. Polish
    8. Czech                         30. Portuguese
    9. Danish                        31. Russian
    10. Dutch                         32. Serbia-And-Montenegro
    11. Finnish                       33. Slovenian
    12. French                        34. Slovakian
    13. French-Canadian               35. Spanish
    14. Hungarian                     36. Swedish
    15. German                        37. Swiss-French
    16. Greek                         38. Swiss-German
    17. Icelandic                     39. Traditional-Chinese
    18. Italian                       40. TurkishQ
    19. Japanese-type6                41. TurkishF
    20. Japanese                      42. UK-English
    21. Korean                        43. US-English
    22. Latin-American
    To select the keyboard layout, enter a number [default 43]:
  8. Press enter to select the default...
    1. Arabic
    2. Chinese - Simplified
    3. Chinese - Traditional
    4. Czech
    5. Dutch
    6. English
    7. French
    8. German
    9. Greek
    10. Hebrew
    11. Hungarian
    12. Indonesian
    13. Italian
    14. Japanese
    15. Korean
    16. Polish
    17. Portuguese - Brazil
    18. Russian
    19. Slovak
    20. Spanish
    21. Swedish
    To select desktop language, enter a number [default is 6]:
  9. Press enter again...
    User selected: English
    Configuring devices.
    Mounting cdroms
    Reading ZFS config: done.
    
    opensolaris console login:
  10. Now login with jack/jack
    opensolaris console login: jack
    Password:
    Sun Microsystems Inc.   SunOS 5.11      snv_111b        November 2008
    jack@opensolaris:~$
    
  11. And su with the password opensolaris
    jack@opensolaris:~$ su
    Password:
    Feb  5 20:29:29 opensolaris su: 'su root' succeeded for jack on /dev/console
  12. Now do ifconfig -a to discover your IP address. You might have to try a few times since it seems to take a minute or two to get an IP:
    jack@opensolaris:~# ifconfig -a
    lo0: flags=2001000849 mtu 8232 index 1
    inet 127.0.0.1 netmask ff000000
    xnf0: flags=1004843 mtu 1500 index 2
    inet 192.168.69.124 netmask ffffff00 broadcast 192.168.69.255
    ether 0:16:3e:79:d:ba
    lo0: flags=2002000849 mtu 8252 index 1
    inet6 ::1/128
    xnf0: flags=2000841 mtu 1500 index 2
    inet6 fe80::216:3eff:fe79:dba/10
    ether 0:16:3e:79:d:ba
  13. Now go to a dom0 shell and find the domain id:
    pat-m6400:~ # domid=`xm domid opensolaris`
    pat-m6400:~ # echo $domid
    21
    
  14. Use xenstore-read to find the vnc port and password:
    pat-m6400:~ # xenstore-read /local/domain/$domid/guest/vnc/port
    5900
    pat-m6400:~ # xenstore-read /local/domain/$domid/guest/vnc/passwd
    5PaJpX6n
    

    Supposedly you can also discover the IP address this way, but I've never seen
    it:

    pat-m6400:~ # xenstore-read /local/domain/$domid/ipaddr/0
    xenstore-read: couldn't read path /local/domain/21/ipaddr/0
  15. Now you can VNC to the OpenSolaris installer - use the port and password you just discovered. Note the double colon (::) to use port number rather than
    display number

    pat-m6400:~ # vncviewer 192.168.69.124::5900
    Connected to RFB server, using protocol version 3.8
    Performing standard VNC authentication
    Password:
    Authentication successful
    [...]
  16. You should see the OpenSolaris installer - hurrah! Go through the install process, click 'restart' and the domain should shutdown.
  17. Once it is down (you can check with xm list), create another config file - opensolaris.cfg
    name = "opensolaris"
    vcpus = 1
    memory = 1024
    bootloader = "/usr/bin/pygrub"
    disk = ['file:/vm/opensolaris/root.img,0,w']
    vif = ['bridge=br0']
    on_shutdown = "destroy"
    on_reboot = "destroy"
    on_crash = "destroy"
  18. Now you can create the VM again using the new config
    xm create -c /vm/opensolaris/opensolaris.cfg
  19. If all is well, you should now be the proud owner of an OpenSolaris domU 🙂
  20. Now, log in as the user you specified in the install, su - to root and find the IP address.
    pat@opensolaris:~$ su -
    Password:
    root@opensolaris:~# ifconfig xnf0
    xnf0: flags=1004843 mtu 1500 index 2
    inet 192.168.69.128 netmask ffffff00 broadcast 192.168.69.255
    ether 0:16:3e:5d:6:60

That's the basic install done. You have a couple of options at this point depending on whether you want to be able to VNC in for the full OpenSolaris desktop experience, and whether you want a static IP address.

For the OpenSolaris desktop:

  1. Set X11-server to listen to the tcp port
    root@opensolaris:~# svccfg -s x11-server
    svc:/application/x11/x11-server> setprop options/tcp_listen = boolean: true
    svc:/application/x11/x11-server> quit
  2. I disabled idletimeout on the VNC server, so that I don't lose the desktop over my lunch break!
    root@opensolaris:~# svccfg -s xvnc-inetd
    svc:/application/x11/xvnc-inetd> setprop inetd_start/exec = astring: "/usr/X11/bin/Xvnc -inetd -query localhost -once securitytypes=none -IdleTimeout 0"
    svc:/application/x11/xvnc-inetd> quit
  3. Enable XDMCP for GDM
    root@opensolaris:~# printf '[xdmcp]\nEnable=true\n' >>/etc/X11/gdm/custom.conf
    root@opensolaris:~# svcadm restart gdm
  4. Make sure GDM runs on startup
    root@opensolaris:~# svcadm enable -s gdm
  5. Turn on xvnc-inetd services
    root@opensolaris:~# svcadm enable xvnc-inetd
  6. Now just connect from dom0:
    pat-m6400:~ # vncviewer 192.168.69.128
    

    And you should be in GNOME desktop wonderland 🙂

  7. If you want to continue to use DHCP, on subsequent boots, just run nmap on dom0 to find your IP address:
    pat-m6400:~ # nmap -sP 192.168.69.0/24
    Starting Nmap 5.00 ( http://nmap.org ) at 2010-02-05 23:15 PST
    Host 192.168.69.1 is up (0.00056s latency).
    [...]
    Host 192.168.69.128 is up (0.0017s latency).
    Nmap done: 256 IP addresses (9 hosts up) scanned in 2.48 seconds

As an alternative to getting a VNC session, you can do

ssh -X 182.168.69.128

(or whatever) and then (at the OpenSolaris prompt) you can do

pat@opensolaris:~$ some-gui-program &

to have the program run on the dom0 desktop. Cool 🙂

To configure OpenSolaris to use a static IP address:

root@opensolaris:~# svcadm disable network/physical:nwam
root@opensolaris:~# svcadm enable  network/physical:default
root@opensolaris:~# ifconfig xnf0 down
root@opensolaris:~# ifconfig xnf0 192.168.69.25 netmask 255.255.255.0
root@opensolaris:~# ifconfig xnf0 up
root@opensolaris:~# route add default 192.168.69.1
root@opensolaris:~# echo 192.168.69.25 netmask 255.255.255.0 > /etc/hostname.xnf0
root@opensolaris:~# echo 192.168.69.1 > /etc/defaultrouter

So there you have it - OpenSolaris 2009.06 happily running as a Xen domU. If you have any comments/corrections, please post them and I'll update this entry as appropriate.

Create a Xen config file (let's call it
/vm/opensolaris/opensolaris-install.cfg) with the following content:

22Feb/100

The ForgeRock OpenSSO Roadshow comes to North America!

ForgeRockMy friends at ForgeRock are bringing their series of OpenSSO user group meetings to the USA and Canada in late March/early April 2010. If you're interested in where they're taking open source identity, you should definitely take this opportunity to participate in one of the meetings - choose from New York (3/29), Toronto (3/30), Chicago (3/31) or San Francisco (4/1). I'll likely take the drive up 280 to the SF event on April 1st - see you there!

18Nov/093

OpenSSO User Group Meetings in Northern Europe – Nov/Dec 2009

SupportRockAlthough I'm no longer as active in the OpenSSO community as I once was, some things still catch my eye - for example, news of a series of user group meetings across Northern Europe in late November and early December. OpenSSO experts Allan Foster, Jonathan Scudder, Steve Ferris and Victor Ake (not a blogger amongst them!?!?) will be presenting on OpenSSO-related topics ranging from monitoring to the Fedlet, via entitlements and OAuth, in Helsinki, Stockholm, Copenhagen, Oslo, London and Brussels. Seems like SupportRock might be a name to watch in the world of OpenSSO...