From: Andrew Morton <akpm@linux-foundation.org>
To: Apollon Oikonomopoulos <ao-lkml@noc.grnet.gr>
Cc: linux-kernel@vger.kernel.org
Subject: Re: Block device cache issue
Date: Tue, 7 Apr 2009 00:31:53 -0700 [thread overview]
Message-ID: <20090407003153.41fb9c78.akpm@linux-foundation.org> (raw)
In-Reply-To: <20090402145205.GG30077@apollon.noc.grnet.gr>
On Thu, 2 Apr 2009 17:52:05 +0300 Apollon Oikonomopoulos <ao-lkml@noc.grnet.gr> wrote:
> Greetings to the list,
>
> At my company, we have come across something that we think is a design
> limitation in the way the Linux kernel handles block device caches. I
> will first describe the incident we encountered, before speculating on
> the actual cause.
>
> As part of our infrastructure, we are running some Linux servers used as
> Xen Dom0s, using SAN LUNs as the VMs' disk images, so these LUNs contain
> normal MBR partition tables. At some point we came across a VM, that -
> due to a misconfiguration of GRUB - failed on a reboot. We used
> multipath-tools' kpartx to create a device-mapper device pointing to the
> first partition of the LUN, mounted the filesystem, changed
> boot/grub/menu.lst, unmounted it and proceeded to boot the VM once more.
> To our surprise, Xen's pygrub showed the boot menu exactly as it was
> before the changes we made. We double-checked that the changes we made
> were indeed there and tried to find out what was actually going on.
>
> As it turned out, the LUN device's read buffers had not been updated;
> losetup'ing the LUN device with the proper offset to the first partition
> and mounting it gave us exactly the image of the filesystem as it was
> _before_ our changes. We started digging into the kernel's buffer
> internals and came along the conclusion [1] that every block device has
> its own pagecache, attached to a hash of (major,minor), that is
> independent from the caches of its containing or contained devices.
>
> Now, in practice one rarely - if ever - accesses the same data from
> these two different paths (disk + partition), except in scenarios like
> this. However currently there seems to be an implicit assumption that
> these two paths should not be used in the same "uptime" cycle at all, at
> least not without dropping the caches. For the record, I managed to
> reproduce the whole issue by reading a single block through sda, dd'ing
> random data to it through sda1 and re-reading it through sda: its
> contents were intact (even hours later) and were up-to-date only when
> using O_DIRECT and finally when I dropped all caches (using
> /proc/sys/vm/drop_caches).
>
> And now we come to the question part: Can someone please verify that the
> above statements are correct, or am I missing something?
The above statements are correct ;)
Similarly, the pagecache for /etc/password is separate from the
pagecache for the device upon which /etc is mounted.
> If they are,
> should it perhaps be the case that the partition's buffers somehow be
> linked with those of the containing device, or even be part of them? I
> don't even know if this is possible without significant overhead in the
> page cache (of which my understanding is very shallow), but keep in mind
> that this behaviour almost led to filesystem corruption (luckily we only
> changed a single file and hit a single inode).
It would incur overhead. We could perhaps fix it by having a single
cache for /dev/sda and then just making /dev/sda1 access that cache
with an offset. But it rarely if ever comes up - I guess the few
applications which do this sort of thing are taking suitable steps to
avoid it - fsync, ioctl(BKLFLSBUF), posix_fadvise(FADV_DONTNEED),
O_DIRECT, etc.
next prev parent reply other threads:[~2009-04-07 7:33 UTC|newest]
Thread overview: 3+ messages / expand[flat|nested] mbox.gz Atom feed top
2009-04-02 14:52 Block device cache issue Apollon Oikonomopoulos
2009-04-07 7:31 ` Andrew Morton [this message]
2009-04-07 10:38 ` Avi Kivity
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20090407003153.41fb9c78.akpm@linux-foundation.org \
--to=akpm@linux-foundation.org \
--cc=ao-lkml@noc.grnet.gr \
--cc=linux-kernel@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.