public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* re: Announce: dumpfs v0.01 - common RAS output API
@ 2004-07-22 15:42 Dan Kegel
  0 siblings, 0 replies; 11+ messages in thread
From: Dan Kegel @ 2004-07-22 15:42 UTC (permalink / raw)
  To: Linux Kernel Mailing List, kaos

Keith Owens <kaos () sgi ! com> wrote:
> Announcing dumpfs - a common API for all the RAS code that wants to
> save data during a kernel failure and to extract that RAS data on the
> next boot.  The documentation file is appended to this mail.
 > ...

I looked, but couldn't see any definition for RAS in your doc.
Could you add one?
The fs/Kconfig hunk might be a nice place to define it, since
naive users might see that text when configuring kernels.

http://www.kernelnewbies.org/glossary/#R does define it,
but it's so far down on
http://www.google.com/search?q=define%3Aras
that most people configuring a kernel might not be familiar with that sense.
- Dan

-- 
My technical stuff: http://kegel.com
My politics: see http://www.misleader.org for examples of why I'm for regime change

^ permalink raw reply	[flat|nested] 11+ messages in thread
* Announce: dumpfs v0.01 - common RAS output API
@ 2004-07-22 16:19 Keith Owens
  2004-07-26  6:57 ` Andrew Morton
  0 siblings, 1 reply; 11+ messages in thread
From: Keith Owens @ 2004-07-22 16:19 UTC (permalink / raw)
  To: linux-kernel

Announcing dumpfs - a common API for all the RAS code that wants to
save data during a kernel failure and to extract that RAS data on the
next boot.  The documentation file is appended to this mail.

ftp://oss.sgi.com/projects/kdb/download/dumpfs - current version is
v0.01, patch against 2.6.8-rc2.

This is a work in progress, the code is not complete and is subject to
change without notice.

dumpfs-v0.01 handles mounting the dumpfs partitions, including reliable
sharing with swap partitions and clearing the dumpfs partitions.  I am
working on the code that reads and writes dumpfs data from kernel
space, it is incomplete and has not been tested yet.  After
dumpfs_kernel is working, dumpfs_user is trivial.  The code is proof of
concept, some sections of the API (including polled I/O and data
compression) are not supported yet, and some of the code is ugly.

Why announce incomplete and untested code?  Mainly because RAS and
kernel dumping are being discussed at OLS this week.  Since I cannot be
at OLS, this is the next best thing.  Also the dumpfs API has
stabilized for the first cut, so it is time to get more discussion on
the API and to determine if it is worth continuing with the dumpfs
approach.  If dumpfs is discussed at OLS then I would appreciate any
feedback.

Questions for the other people who care about RAS (which rules out most
of the kernel developers) -

* Is using a common dump API the right thing to do?

  Obviously I think that this makes sense.  At the moment every bit of
  RAS code has its own dedicated I/O mechanism, not to mention its own
  user space tools to interface with the kernel, and to initialize,
  extract and clear its own data.

  dumpfs consolidates a lot of common code that is scattered over
  several RAS tools.  dumpfs removes the need for special RAS tools to
  extract dump data on reboot, instead standard user space commands
  will do the job.

* Is overloading mount the best approach?

  Making mount dumpfs share the partition with swap is ugly.  OTOH most
  of the existing code that dumpfs is intended to replace makes no
  attempt to verify its partition usage.  At least dumpfs tries to
  verify its partition data, ugly though the code is.

* Does the dumpfs API need to be extended or even replaced, either in
  kernel or in user space?

  One obvious extension is to make compression selective, so that some
  sections of the file can be compressed and others be in clear text.
  The lcrash header springs to mind.  Omitted for now since this
  version does not support compression yet.

* How do we get a clean API to do polling mode I/O to disk?

  One thing that is absolutely required for reliable RAS output is a
  polling mode method.  netdump is available for the network, we need
  the equivalent for disk I/O.  What is the best way to integrate
  polling mode I/O into the block device subsystem?

If the people who care about RAS think that a common RAS output API is
worthwhile then I will continue working on dumpfs.  Otherwise it will
be just another idea that did not get taken up, and each RAS tool will
continue to be developed and maintained in isolation.


==== 2.6.8-rc2/Documentation/filesystems/dumpfs.txt ====

dumpfs provides a common API for RAS components that need to dump kernel data
during a problem.  The dumped data is expected to be copied and cleared on the
next successful boot.

dumpfs consists of two layers, with completely different semantics.  These are
dumpfs (kernel only) and dumpfs_user (user space view of any saved dump data).

dumpfs uses one mount for each dump partition.  Each dumpfs partition can be
mounted with option share or noshare, the default is noshare.  The only
allowable user space operations on a dumpfs partition are mount and umount, user
space cannot directly access the dumpfs data.  Each dumpfs partition is mounted
with "mount -t dumpfs /dev/partition /mnt/dumpfs".  /mnt/dumpfs must be a
directory; it never contains anything useful but the mount semantics require a
directory here.

A shared dumpfs partition will normally coexist with a swap partition; the
dumpfs superblock is stored at an offset which leaves the swap signature alone.
A shared dump partition has no superblock on disk until the first dump file is
created.  Mounting a dumpfs partition with "-o clear" will completely zero the
dumpfs superblock, including the magic field.  This ensures that old dumpfs data
in a shared partition will not be used, its contents are unreliable because of
the data sharing.

When mounting a shared dumpfs partition, no check is made to see if the disk
contains a dumpfs superblock.  Mounting a dumpfs partition with -o share will
only share with a swap partition, it will not share with any other mounted
partition.

A non-shared dumpfs partition must have a superblock before being mounted.
mkfs.dumpfs and fsck.dumpfs (only used for non-shared partitions) are trivial.
Mounting dumpfs with "-o noshare,clear" will clear the metadata in the dumpfs
superblock, but preserve the magic field.

mkfs.dumpfs

#!/bin/sh
dd if=/dev/zero of="$1" bs=64k count=1
echo 'dum0' | dd of="$1" bs=64k seek=1 conv=sync

fsck.dumpfs

#!/bin/sh
true

Each dumpfs partition can be mounted with option poll or nopoll, the default is
poll.  Poll uses low level polled mode I/O direct to the partition, completely
bypassing the normal interrupt driven code.  This is done in an attempt to get
the data out to disk even when the kernel is so badly broken that interrupts are
not working.  Poll requires that the device driver for the dumpfs partition
supports polling mode I/O.  Nopoll uses the standard kernel I/O mechanisms, so
it is not guaranteed to work when the kernel is crashing.  Nopoll should only be
used when your device driver does not support polling mode I/O yet; you must
accept that dumpfs may hang waiting for the I/O to be serviced.

Another option when mounting a dumpfs partition is to specify the size of its
data buffer, in kibibytes.  This buffer is permanently allocated as long as the
dumpfs partition is mounted, it is only used when writing RAS data via dumpfs.
The buffer size will be rounded up to a multiple of the kernel page size.  The
default is buffer=128.


The user space view of the RAS data held in the dumpfs partitions is created by
"mount -t dumpfs_user none /mnt/dumpfs".  It logically merges and validates all
the dumpfs partitions that have been mounted and provides a user space view of
the files that have been written to dumpfs.  The only user space operations
supported on dumpfs_user are llseek, read, readdir, open (read only), close and
unlink.  Just enough to copy the files out of dumpfs_user and remove them.  User
space cannot write to dumpfs_user.

The kernel can write to files held in dumpfs partitions, to save RAS data over a
reboot.  Note that when kernel RAS components write to dumpfs they do _not_ use
the normal VFS layer, it may not be working during a failure.  Instead a RAS
component makes direct calls to the following dumpfs_kernel functions.

dumpfs_kernel_open("prefix", flags)

  Create and open for writing a file in dumpfs.  It returns a file descriptor
  within dumpfs.

  The dumpfs filename is constructed from "prefix-" followed by the value of
  xtime in the format CCYY-MM-DD-hh:mm:ss.n, where n starts at 0 and is
  incremented for each dumpfs file in the current boot.

  There is no requirement that a dumpfs_user mount point exist before the kernel
  can dump its data.  The first call to dumpfs_kernel_open will automatically
  create a kernel view that merges all the mounted dumpfs partitions.  The first
  call to dumpfs_kernel_open also writes the dumpfs superblocks to any shared
  partitions.

  Flags select compression, if any.

  dumpfs_kernel_open() is the simple interface.  It automatically stripes the
  data across all dumpfs partitions that are not currently being used.

  Most RAS code will open one dump file at a time, mainly because most users
  will only have one dumpfs partition.  The dumpfs code has a module_parm called
  dumpfs_max_open, with a default value of 1.

dumpfs_kernel_bdev_list()
dumpfs_kernel_open_choose("prefix", flags, bdev_list)

  Some platforms may need to have multiple output streams open in parallel.  For
  example a system with large amounts of memory and multiple disks may wish to
  assign different sections of memory to each cpu and to write to separate
  partitions.

  dumpfs_kernel_bdev_list() returns the list of usable dumpfs partitions.  If
  all partitions are in use then the list is empty.

  dumpfs_kernel_open_choose() opens a file using only the selected bdev entries.

  Systems that use concurrent parallel dumps should set module_parm
  dumpfs_max_open to a suitable value.

  Note: The following problems are inherently architecture and platform specific
  and are outside the scope of dumpfs.  That is not to say that we should not
  have an API for handling these problems on large systems, but it would be a
  separate API from dumpfs.

    Deciding which cpus to use for parallel dumping.
    Deciding which block devices each cpu should use.
    Getting the chosen cpus into the RAS code.
    Assigning the range of work to each cpu and each partition.
    Watching the dumping cpus for problems, recovering from those problems
      and reassigning the work to another cpu.
    Reconstructing the parallel dumps into a format for analysis.  dumpfs_user
      makes each dump file available to user space, but some code may be
      required to merge the separate files together.

dumpfs_kernel_close(fd)

  Sync the file's data to disk, close the file and update the dumpfs metadata.

dumpfs_kernel_write(fd, buffer, length)

  Write the buffer at the current dumpfs file location.  The data may or may not
  be written to disk immediately.  It returns the current location, including
  the data that was just written.

  For performance, the dumpfs data is striped over all the assigned partitions,
  in round robin.  The stripe unit is the minimum of the buffer= value across
  all the assigned partitions.

dumpfs_kernel_read(fd, buffer, length)

  Read the buffer from the current dumpfs file location.  It returns the current
  location, including the data that was just read.

dumpfs_kernel_llseek(fd, position)

  Set the current dumpfs file location.  It returns the previous location.  Only
  absolute seeking is supported.

dumpfs_kernel_sync(fd)

  Sync the file's data to disk and update the dumpfs metadata.

dumpfs_kernel_dirty_shared()

  Returns true if any shared partitions have been dirtied, in which case the
  kernel must be rebooted after all the RAS components have completed their
  work.

dumpfs_kernel_all_polled()

  Returns true if all dumpfs partitions can support polling mode I/O.  Otherwise
  the RAS code that calls dumpfs should enable interrupts, if at all possible.


Sample /etc/fstab entries for dumpfs partitions.

  /dev/sda2  /mnt/dumpfs  dumpfs  defaults  0 0
  /dev/sdb2  /mnt/dumpfs  dumpfs  share     0 0
  /dev/sdc7  /mnt/dumpfs  dumpfs  nopoll    0 0

Sample code in /etc/rc.sysinit to save dump data from the previous boot.  If you
are sharing dumpfs with swap, these commands must be executed before mounting
swap.  Note that dumpfs does not require any special user space tools to poke
inside partitions to see if there is any useful data to save, everything is a
file.

  # mount all the dumpfs partitions
  mount -a -t dumpfs
  # merge all dumpfs into dumpfs_user on /mnt/dump
  mount -t dumpfs_user none /mnt/dump
  # copy the data out
  (cd /mnt/dump; for f in `find -type f`; do echo saving $f; mv $f /var/log/dump; done)
  # drop dumpfs_user
  umount /mnt/dump
  # clear all the dumpfs metadata
  umount -a -t dumpfs
  mount -a -t dumpfs -o clear
  umount -a -t dumpfs

rc.sysinit will later mount the swap partitions, then mount all the other
partition types.  That will remount the dumpfs partitions, ready for the next
kernel crash.


^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2004-07-28 19:46 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2004-07-22 15:42 Announce: dumpfs v0.01 - common RAS output API Dan Kegel
  -- strict thread matches above, loose matches on Subject: below --
2004-07-22 16:19 Keith Owens
2004-07-26  6:57 ` Andrew Morton
2004-07-28  1:53   ` Eric W. Biederman
2004-07-28 10:54     ` Suparna Bhattacharya
2004-07-28 16:03     ` Jesse Barnes
2004-07-28 18:00       ` Eric W. Biederman
2004-07-28 18:06         ` Jesse Barnes
2004-07-28 19:42           ` Martin J. Bligh
2004-07-28 19:44           ` Andrew Morton
2004-07-28 19:23       ` Martin J. Bligh

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox