System Failure and Recovery Practiceby Jeff Dike
User-Mode Linux (UML) is a Linux virtual machine running on Linux that allows you to boot Linux on a "software" machine. These virtual machines can be easily created and destroyed, and allow you do do virtually anything that can be done with a physical system. Because of this, UML has turned out to have a wide variety of uses. In this article, I will talk about an application that has not received anywhere near the attention I think it deserves.
UML virtual machines are nearly identical to physical machines in their behavior, except that they are far more convenient to configure and boot. This makes them ideal for system administrator training and practice. In particular, they are very well-suited for creating admin disasters in order to practice recovering from them. I will be describing the creation of and recovery from three disasters, plus the creation (but not recovery) of a fourth.
To get started, you will need to download UML and install it. Go to
http://user-mode-linux.sourceforge.net/dl-sf.html and grab and
install either the UML RPM or deb, whichever is appropriate for your
system. These will install UML itself, plus a number of utilities.
You will also need a filesystem image to boot UML on. These are
available from the same page. I will be using the Debian root
filesystem in the examples below. If you are too short of bandwidth
to download that one, get the
tomsrtbt filesystem instead.
Where have all the files gone?
To help you get used to using UML, I'll start off with a special introductory disaster which I'll make no attempt to recover from. Even if you are an experienced UML user, you'll probably want to follow along because we're going to do something that you've always wanted to do anyway.
We're going to do a
rm -rf / just to see what happens.
So, start UML as follows:
% linux ubd0=cow,root_fs
This tells UML to boot from the
root_fs file with the file
a copy-on-write (COW) layer above it. The file name
cow is arbitrary and generated automatically, so you can change the name as long as you are consistent about it. You'll see the utility of this a bit later. After you uncompress it, your root filesystem is likely named
You can either rename it to
root_fs to follow the instructions below
verbatim or replace
root_fs everywhere with the actual name.
As it boots, take note of a line in the console output that looks like this:
mconsole initialized on /tmp/uml/d4oIw6/mconsole
Now, when it comes up and gives you a login prompt, log in as root (password "root"), and do the following:
usermode:~# cd / usermode:/# rm -rf /
Let it crank for awhile until things break horribly. With the Debian filesystem from the UML site, I ultimately get this:
rm: cannot remove directory '//dev/pty': Directory not empty rm: WARNING: Circular directory structure. This almost certainly means that you have a corrupted file system. NOTIFY YOUR SYSTEM MANAGER. The following two directories have the same inode number: //dev //dev/pts
If you're the morbid type, you might poke around to see what, if
anything, you can still do. You'll need the
bash built-ins because
your favorite utilities are likely to be gone.
When you've had enough of this trashed system, you'll need to shut it
down cleanly. Since
halt won't work, the best way is to use the
uml_mconsole utility to halt it. On the host, run
giving it the directory name that you took careful note of when it was
booting, and tell it to halt UML:
% uml_mconsole d4oIw6 (d4oIw6) halt OK
Now, you get to see why we used the COW file. The damage to the
filesystem is contained entirely within the COW file. The underlying
root_fs file is completely untouched. To see this, you can throw
out the COW file:
% rm cow
and boot UML just as you did before.
% linux ubd0=cow,root_fs
You'll see that it boots fine, and that the filesystem is intact. We'll be using this technique to create disasters without irreversibly damaging the real filesystem.
The case of the missing password file
Now, we'll create a relatively simple disaster and recover from it.
% rm cow % linux ubd0=cow,root_fs
Now, remove the password file and try to halt the machine
usermode:~# rm /etc/passwd usermode:~# halt You don't exist. Go away.
halt doesn't work any more, so we'll shut it down from the
uml_mconsole zJwanV (zJwanV) sysrq u OK (zJwanV) halt OK
sysrq u flushes the filesystems to disk and remounts them
read-only. This will save us an
fsck on the next boot. Boot it again,
this time specifying only the
cow file on the command line:
% linux ubd0=cow
Now, we see how well Linux works without a password file:
Debian GNU/Linux 2.2 usermode ttys/0 usermode login: root Password: Login incorrect
It boots fine, but it's (surprise!) impossible to log in. So, let's
shut this down from the
mconsole again and fix it:
uml_mconsole b9cpus (b9cpus) sysrq u OK (b9cpus) halt OK
We'll boot up only to single-user, and recreate enough of the password file so that root can log in:
% linux ubd0=cow single
Distributions differ on their interpretation of
single. If you
don't get a shell with
single, then try
emergency instead. On my
Debian filesystem, both give me a shell.
/etc/passwd: No such file or directory Give root password for maintenance (or type Control-D for normal startup):
Anything here, including hitting Return, seems to work.
sh-2.03# cat > /etc/passwd sh: /etc/passwd: Read-only file system
Here's the first problem. We need to remount the root filesystem read-write before doing anything else:
sh-2.03# mount / -o remount
OK, back to our regularly scheduled disaster. I use
cat here, but if
you prefer vi, go ahead and use that.
sh-2.03# cat > /etc/passwd root::0:0:root:/root:/bin/bash ^D
So far, so good. Let's do a sanity check to make sure the utilities think the password file is good:
sh-2.03# whoami root
That's fine, so let's continue the boot by exiting the single-user shell:
And now let's see if root can log in:
Debian GNU/Linux 2.2 usermode ttys/0 usermode login: root Last login: Tue Nov 13 18:28:32 2001 on ttys/0 Linux usermode 2.4.13-1um #2 Fri Oct 26 15:42:47 EDT 2001 i686 unknown usermode:~#
Yes, root can log in again. If this had happened on a physical
machine, your next job would be to chase down the most recent backup
tape and restore
/etc/passwd from it.
Pages: 1, 2