public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
From: Andrew Morton <akpm@linux-foundation.org>
To: Apollon Oikonomopoulos <ao-lkml@noc.grnet.gr>
Cc: linux-kernel@vger.kernel.org
Subject: Re: Block device cache issue
Date: Tue, 7 Apr 2009 00:31:53 -0700	[thread overview]
Message-ID: <20090407003153.41fb9c78.akpm@linux-foundation.org> (raw)
In-Reply-To: <20090402145205.GG30077@apollon.noc.grnet.gr>

On Thu, 2 Apr 2009 17:52:05 +0300 Apollon Oikonomopoulos <ao-lkml@noc.grnet.gr> wrote:

> Greetings to the list,
> 
> At my company, we have come across something that we think is a design 
> limitation in the way the Linux kernel handles block device caches.  I 
> will first describe the incident we encountered, before speculating on 
> the actual cause.
> 
> As part of our infrastructure, we are running some Linux servers used as 
> Xen Dom0s, using SAN LUNs as the VMs' disk images, so these LUNs contain 
> normal MBR partition tables. At some point  we came across a VM, that - 
> due to a misconfiguration of GRUB - failed on a reboot. We used 
> multipath-tools' kpartx to create a device-mapper device pointing to the 
> first partition of the LUN, mounted the filesystem, changed 
> boot/grub/menu.lst, unmounted it and proceeded to boot the VM once more.  
> To our surprise, Xen's pygrub showed the boot menu exactly as it was 
> before the changes we made. We double-checked that the changes we made 
> were indeed there and tried to find out what was actually going on.
> 
> As it turned out, the LUN device's read buffers had not been updated;  
> losetup'ing the LUN device with the proper offset to the first partition 
> and mounting it gave us exactly the image of the filesystem as it was 
> _before_ our changes. We started digging into the kernel's buffer 
> internals and came along the conclusion [1] that every block device  has 
> its own pagecache, attached to a hash of (major,minor), that is 
> independent from the caches of its containing or contained devices.  
> 
> Now, in practice one rarely - if ever - accesses the same data from 
> these two different paths (disk + partition), except in scenarios like 
> this. However currently there seems to be an implicit assumption that 
> these two paths should not be used in the same "uptime" cycle at all, at 
> least not without dropping the caches.  For the record, I managed to 
> reproduce the whole issue by reading a single block through sda, dd'ing 
> random data to it through sda1 and re-reading it through sda: its 
> contents were intact (even hours later) and were up-to-date only when 
> using O_DIRECT and finally when I dropped all caches (using 
> /proc/sys/vm/drop_caches).
> 
> And now we come to the question part: Can someone please verify that the 
> above statements are correct, or am I missing something?

The above statements are correct ;)

Similarly, the pagecache for /etc/password is separate from the
pagecache for the device upon which /etc is mounted.

> If they are, 
> should it perhaps be the case that the partition's buffers somehow be 
> linked with those of the containing device, or even be part of them? I 
> don't even know if this is possible without significant overhead in the 
> page cache (of which my understanding is very shallow), but keep in mind 
> that this behaviour almost led to filesystem corruption (luckily we only 
> changed a single file and hit a single inode).

It would incur overhead.  We could perhaps fix it by having a single
cache for /dev/sda and then just making /dev/sda1 access that cache
with an offset.  But it rarely if ever comes up - I guess the few
applications which do this sort of thing are taking suitable steps to
avoid it - fsync, ioctl(BKLFLSBUF), posix_fadvise(FADV_DONTNEED),
O_DIRECT, etc.




  reply	other threads:[~2009-04-07  7:33 UTC|newest]

Thread overview: 3+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2009-04-02 14:52 Block device cache issue Apollon Oikonomopoulos
2009-04-07  7:31 ` Andrew Morton [this message]
2009-04-07 10:38   ` Avi Kivity

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20090407003153.41fb9c78.akpm@linux-foundation.org \
    --to=akpm@linux-foundation.org \
    --cc=ao-lkml@noc.grnet.gr \
    --cc=linux-kernel@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox