vmdksync helps you escape from VMware

Posted: Sun, 13 May 2012 | permalink | 5 Comments

When I wrote lvmsync late last year, I didn’t realise I was being typecast. Before too long, I realised that the logic that I’d implemented for lvmsync would also help me with a separate migration project I’d been dreading – getting the day job off VMware.

Back in the early days of virtualisation, management made the decision to run VMware, for all the usual reasons (“commercially supported!”, “industry standard!”, and so on). Unsurprisingly (to me, anyway) it didn’t take too long for management to realise that it wasn’t the best choice for us. When you’ve got umpty-billion dollars to spend on hardware, software, and support, VMware might be the right option (although Amazon doesn’t seem to think so). Anchor’s company culture, on the other hand, is build around “smart staff, simple systems” over “dumb staff, smart vendors”, because no vendor is ever going to care about our customers as much as we do. So VMware was never going to work for us.

Unfortunately, as happens all too often, once VMware was in place, there was very little motivation to get rid of it and move those customers onto the chosen replacement (that we were deploying all new customers on). I happen to think this is a terrible attitude in general – one that makes life so much harder in the long term. I believe strongly in retrofitting old systems to keep them up-to-date with the current state of the art, and keeping technical debt under control. But, I wasn’t running the show back when we stopped putting new customers on VMware, so the few VMware servers we had stayed around far longer than they should have.

Recently, though, bad things started to happen. The VMware servers were starting to fall apart. The Windows machine we had to keep around to use the VMware management console started crapping out, and when the choice was between doing unspeakable things to Windows, and just ditching VMware… well, it wasn’t much of a choice. The only remaining question was how to do the migration off VMware with the least amount of downtime to our customers.

I was really quite surprised that nobody out in Internet land appeared to have come up with a simple, robust tool to do this. Sure, some vendors had all-singing, all-dancing toolkits that cost ridiculous amounts of money, required you to install their agent on the machine involved, and promised the earth, but it all smelt of snakeoil and bullshit.

In true hacker style, then, I decided to write something myself. The model I came up with mirrored lvmsync’s quite closely – because that one worked, and it turned out to be surprisingly easy to implement once I managed to reverse-engineer the file format (VMware has a PDF spec of a bunch of it’s file formats, but whoever wrote it was enough of an evil genius to make it utterly incomprehensible to anyone who doesn’t already know the file format, whilst making perfect sense to anyone who already does).

The result: vmdksync. It is nothing but 80-odd lines of ruby whose sole purpose is to take a delta.vmdk file and write the changes that are stored in that file to a file or block device that is a copy of the flat.vmdk file that you can copy while the VM is still running (after you’ve made a snapshot, of course). It helped me provide a painless migration path away from VMware, and I’d be really pleased if it helped some other people do the same. Share and enjoy!


The Other Way...

Posted: Sun, 25 December 2011 | permalink | 6 Comments

Chris Siebenmann sez:

The profusion of network cables strung through doorways here demonstrates that two drops per sysadmin isn’t anywhere near enough.

What I actually suspect it demonstrates is that Chris’ company hasn’t learnt about the magic that is VLANs. All of the reasons he cites in the longer, explanatory blog post could be solved with VLANs. The only time you can’t get away with one gigabit drop per office and an 8 port VLAN-capable switch is when you need high capacity, and given how many companies struggle by with wifi, I’m going to guess that sustained gigabit-per-machine is not a common requirement.

So, for Christmas, buy your colleages a bunch of gigabit VLAN capable switches, and you can avoid both the nightmare of not having enough network ports, and the more hideous tragedy of having to crawl around the roofspace and recable an entire office.


Rethtool: How I Learned to Stop Worrying and Love the ioctl

Posted: Sat, 17 December 2011 | permalink | 2 Comments

Damn those unshaven yaks

I’m trying to write a Nagios plugin for work that will comprehensively monitor network interfaces and make sure they’re up, passing traffic, all those sorts of things. Of course, I’m doing it all in Ruby, because that’s how I roll.

So, I need to Know Things about the interface. Everyone does that with ethtool. Right? Sure, if your eyeballs are parsing it. But have you ever tried to machine parse it? To put it as eloquently as possible:

# ethtool eth0
Settings for eth0:
 Supported ports: [ TP MII ]
 Supported link modes:   10baseT/Half 10baseT/Full 
                         100baseT/Half 100baseT/Full 
                         1000baseT/Half 1000baseT/Full 
 Supports auto-negotiation: Yes
 Advertised link modes:  10baseT/Half 10baseT/Full 
                         100baseT/Half 100baseT/Full 
                         1000baseT/Half 1000baseT/Full 
 Advertised pause frame use: No
 Advertised auto-negotiation: Yes
 Link partner advertised link modes:  10baseT/Half 10baseT/Full 
                                      100baseT/Half 100baseT/Full 
                                      1000baseT/Half 1000baseT/Full 
 Link partner advertised pause frame use: No
 Link partner advertised auto-negotiation: Yes
 Speed: 1000Mb/s
 Duplex: Full
 Port: MII
 PHYAD: 0
 Transceiver: internal
 Auto-negotiation: on
 Supports Wake-on: pumbg
 Wake-on: g
 Current message level: 0x00000033 (51)
 Link detected: yes

Parse that, bitch!

Or… perhaps not.

At any rate, I decided that it would be most advantageous if I went straight to the source and twiddle the ioctl until it did my bidding.

And thus, about 5 hours later, was Rethtool born.

Once I worked out a less-than-entirely-crackful way of dealing with C structs in Ruby (after a bit of digging around, I went with the appallingly-undocumented-but-sufficiently-featureful CStruct), and after I finally worked out I was passing the wrong damned struct to ioctl(SIOCETHTOOL) (speaking of appallingly-undocumented: fuck you, ioctl, and all your twisty-passages children), it was smooth sailing.

So, if you’re one of the eight or so people on earth who will ever need to get at the grubby internals of your network interfaces using Ruby (and can’t do it via some sysfs magic), Rethtool is for you.


Misleading error messages from blktrace

Posted: Sat, 12 November 2011 | permalink | No comments

If you ever get an error message from the blktrace tool that looks like this:

BLKTRACESETUP(2) /dev/dm-0 failed: 2/No such file or directory
Thread 3 failed open /sys/kernel/debug/block/(null)/trace3: 2/No such file or directory
Thread 2 failed open /sys/kernel/debug/block/(null)/trace2: 2/No such file or directory
Thread 0 failed open /sys/kernel/debug/block/(null)/trace0: 2/No such file or directory
Thread 1 failed open /sys/kernel/debug/block/(null)/trace1: 2/No such file or directory
FAILED to start thread on CPU 0: 1/Operation not permitted
FAILED to start thread on CPU 1: 1/Operation not permitted
FAILED to start thread on CPU 2: 1/Operation not permitted
FAILED to start thread on CPU 3: 1/Operation not permitted

Don’t be alarmed – your disk hasn’t suddenly disappeared out from underneath you. In fact, it means quite the opposite of what “No such file or directory” might imply. In fact, it means that there is already a blktrace of that particular block device in progress, and you’ll need to kill that one off before you can start another one.

Thank $DEITY for the kernel source code – it was the only hope I had of diagnosing this particular nit before I went completely bananas and smashed my keyboard into small pieces.


rsync for LVM-managed block devices

Posted: Fri, 28 October 2011 | permalink | 12 Comments

If you’ve ever had to migrate a service to a new machine, you’ve probably found rsync to be a godsend. It’s ability to pre-sync most data while the service is still running, then perform the much quicker “sync the new changes” action after the service has been taken down is fantastic.

For a long time, I’ve wanted a similar tool for block devices. I’ve managed ridiculous numbers of VMs in my time, almost all stored in LVM logical volumes, and migrating them between machines is a downtime hassle. You need to shutdown the VM, do a massive dd | netcat, and then bring the machine back up. For a large disk, even over a fast local network, this can be quite an extended period of downtime.

The naive implementation of a tool that was capable of doing a block-device rsync would be to checksum the contents of the device, possibly in blocks, and transfer only the blocks that have changed. Unfortunately, as network speeds approach disk I/O speeds, this becomes a pointless operation. Scanning 200GB of data and checksumming it still takes a fair amount of time – in fact, it’s often nearly as quick to just send all the data as it is to checksum it and then send the differences.1

No, a different approach is needed for block devices. We need something that keeps track of the blocks on disk that have changed since our initial sync, so that we can just transfer those changed blocks.

As it turns out, keeping track of changed blocks is exactly what LVM snapshots do. They actually keep a copy of what was in the blocks before it changed, but we’re not interested in that so much. No, what we want is the list of changed blocks, which is stored in a hash table on disk.

All that was missing was a tool that read this hash table to get the list of blocks that had changed, then sent them over a network to another program that was listening for the changes and could write them into the right places on the destination.

That tool now exists, and is called lvmsync. It is a slightly crufty chunk of ruby that, when given a local LV and a remote machine and block device, reads the snapshot metadata and transfers the changed blocks over an SSH connection it sets up.

Be warned: at present, it’s a pretty raw piece of code. It does nothing but the “send updated blocks over the network”, so you have to deal with the snapshot creation, initial sync, and so on. As time goes on, I’m hoping to polish it and turn it into something Very Awesome. “Patches Accepted”, as the saying goes.

  1. rsync avoids a full-disk checksum because it cheats and uses file metadata (the last-modified time, or mtime of a file) to choose which files can be ignored. No such metadata is available for block devices (in the general case).


UPSes in Datacentres

Posted: Tue, 23 August 2011 | permalink | 3 Comments

(This was going to be a comment on this blog post, but it’s a Turdpress site that wants JS and cookies to comment. Bugger that for a game of skittles.

Rimuhosting’s recent extended outage due to power problems was apparently caused by a transfer switch failure at their colo provider. This has led people to wonder if putting UPSes in individual racks is a wise move. The theory is that in the event of a small outage, the UPS can keep things humming, and in an extended outage you can gracefully shut things down rather than having a hard thump.

I happen to think this theory is bunkum. Your UPS is a newly instituted single point of failure. I’d be willing to bet that the cost of purchasing, installing, and maintaining the UPSes, as well as the cost of the outages that inevitably result from their occasional failure, would be far greater than the cost of the occasional power outage you get in a well-managed facility.

Good facilities don’t have small outages. They don’t have squirrels in the roof cavities, and they don’t have people dropping spanners across busbars. The only outages they have are the big ones, when some piece of overengineered equipment turns out to be not so overengineered – the multi-hour (or multi-day) ones where your UPS isn’t going to stop you from going down. Your SLA credit and customer goodwill is already toast, so all you’re saving is the incremental cost of a little bit more downtime while you get fscks run.

If you want the best possible power reliability, get yourself into a really well engineered facility, and run dual-power on everything. Definitely run the numbers before you go down the UPS road; I’ll bet you find they’re not worth it.


Oh HP, you Bucket of Fail

Posted: Tue, 23 August 2011 | permalink | 9 Comments

I recently got given a new printer, a HP LaserJet “Professional”1 P1102w. It’s fairly loudly touted on HP’s website that this printer has “Full” support under Linux.

And yet, it won’t work with my Linux-based print server. Why? Because it uses a proprietary driver plugin, and that plugin is only available for x86 and amd64, and my print server is ARM-based. Well done, HP. You’ve managed to revive the old “all the world’s a VAX” philosophy, on an OS that is more than capable of running on practically anything. You got that for free. Why do you insist on screwing with it?

As an added bonus, when I try to “Ask a Question” on the HPLIP website, to politely (ha!) inquire as to the possibility of an ARM binary, I get sent to Launchpad, which does nothing more than tell me that there is an “Invalid OpenID transaction”. That’s the entire content of the page. Useful.

Lies, damned lies, and a double helping of proprietary software fail. My day is complete.

  1. I use scarequotes around “Professional” because, as far as I can tell, this is just an entry-level personal laser printer. There is nothing particularly professional about it.


Unintended Consequences: Why Evidence Matters

Posted: Sun, 21 August 2011 | permalink | 1 Comment

If you were trying to get rid of hiring discrimination (on grounds irrelevant to the ability to do the job), you’d think a good way to do it would be to reduce the ability of the hiring manager to discriminate, by restricting their access to irrelevant (but possibly prejudicial) information. It’s certainly what I might come up with as an early idea in a brainstorming session.

I’m not alone: France had this same idea, and gave it a go, by passing a law requiring companies to anonymise resumes before they got to any decision makers.

So far, so average. But rather than just coming up with an idea and inflicting it on everyone by a blanket law, they did what should be done with all new ideas: they trialled it (with 50 large corporations, according to the report) before making it universal, to make sure that the theory matched reality. Then, after giving it a good shake, they examined the evidence, and found that the idea had some unintended consequences:

Applicants with foreign names, or who lived in under privileged areas were found to be less likely to be called in for an interview without the listing of their name and address. Researchers reasoned that this was because employers and recruiters made allowances for subpar presentation or limited French speaking if their performance could be explained by deprivation or foreign birth.

The icing on the cake is that now the evidence is in, they’re now planning on making it “optional” (I’m not sure how that’s different from killing it entirely, but I guess it’s worth the same in the end).

So we’ve got the quinella of decision-making awesome:

Far too often, we get far too attached to our ideas, and don’t let them go when reality doesn’t fit our preconceptions. Kudos to the people involved in this idea for not letting their egos get in the way of good government. Let it be an object lesson for us all.


Stream of Consciousness

Posted: Fri, 19 August 2011 | permalink | No comments

This forum post on requiring formal letters of resignation made me smile:

HR does silly stuff like this all the time. Somebody’s following some policy that was created because somebody verbally resigned nine years ago and then wanted to come back and some executive said where’s their letter and HR said we don’t have one and the exec said that’s not good and we oughta not be doing stuff to help people leave unless they’re really leaving and HR said okay we’ll have a policy and the exec said that’s good.

And the exec’s not there anymore.

I’ll leave everyone to make their own conclusions as to why I was reading that particular thread.


Using a Local Root Zone with djbdns

Posted: Sun, 7 August 2011 | permalink | No comments

In my continuing war on the effects of craptastic mobile Internet connectivity, I came across a suggestion to host a local copy of the root zone alongside your local DNS resolver. It’s an interesting idea, so I’ve decided to give it a go, despite the potential problems (I’m confident I can manage the risks).

I was surprised to find that nobody had a guide on setting this up using djbdns1 so… I’ve written one.

If you’re thinking of doing this yourself, heed some words of caution: It is imperative that you keep your local cache up to date. If you set this up, and don’t maintain it, you will have a slow, gradual degradation of Internet service as the live root zone diverges from your local, out-of-date cache.

If you set this up locally, just for yourself, that’s one thing; all you’re doing is breaking your own machine. If you want to do this for the ISP you run, though, you’re doing your customers a grave disservice if you don’t automate the cache update, and setup some means of monitoring that your cache is kept up to date (a SOA check against the live roots, or at least a check to make sure that your data.cdb file is no more than a couple of days old).

The Design

For simplicity, I decided to run a dedicated tinydns instance that only serves the root zone. This makes it easy to periodically refresh the root zone that I serve with a script, which I run daily, without needing to integrate with the database of any other tinydns instances I’ve got running (I have a couple on my laptop for testing). I’ve set this up on an arbitrary loopback address (127.53.53.53), so it’s inaccessable from anywhere other than localhost, and so my local dnscache instance just forwards root zone requests to it.

Setup the infrastructure

You’ve now got a minimal tinydns suitable for serving a local cache of the root zone to anyone on your local machine who asks. But where’s the data?

Script the root zone processing

The following script should do the job nicely. Drop it somewhere useful and chmod a+x it. If you put your tinydns somewhere else, change the TINYDNS_DATA variable at the top.

Run it once by hand to “seed” your root cache, then add it to cron for a nightly update.

#!/bin/sh

set -e

TINYDNS_DATA="/etc/service/tinydns-root/root/data.cdb"

###########################################################################

WORKDIR="$(mktemp -d)"
trap "rm -rf ${WORKDIR}" EXIT

cd "$WORKDIR"

wget -q http://www.internic.net/domain/root.zone.gz
wget -q http://www.internic.net/domain/root.zone.gz.sig

if ! gpgv root.zone.gz.sig root.zone.gz >/dev/null 2>&1; then
        echo "Root zone signature validation failed -- this is probably
really bad" >&2
        exit 1
fi

gzip -d root.zone.gz

egrep -v '[[:space:]]IN[[:space:]]+(RRSIG|DNSKEY|DS|NSEC)[[:space:]]' root.zone \
     | /usr/local/bin/bind-to-tinydns . data btttmp

tinydns-data

cp data.cdb "${TINYDNS_DATA}"

Test

The simplest test, to make sure you’ve got everything running, is just to request something from the root zone:

dig @127.53.53.53 com IN NS

If you get something useful (compare against dig com IN NS for a sanity check) then everything’s probably working well.

Point dnscache to your local root server

echo 127.53.53.53 >/etc/service/dnscache/root/servers/@
svc -k /etc/service/dnscache

And you’re away.

  1. For all it’s oddities, it’s a very tidy piece of software, and takes up so little resources on a modern system that it’s presence is practically invisible – it uses less memory than init.