* Block device cache issue
@ 2009-04-02 14:52 Apollon Oikonomopoulos
2009-04-07 7:31 ` Andrew Morton
0 siblings, 1 reply; 3+ messages in thread
From: Apollon Oikonomopoulos @ 2009-04-02 14:52 UTC (permalink / raw)
To: linux-kernel
Greetings to the list,
At my company, we have come across something that we think is a design
limitation in the way the Linux kernel handles block device caches. I
will first describe the incident we encountered, before speculating on
the actual cause.
As part of our infrastructure, we are running some Linux servers used as
Xen Dom0s, using SAN LUNs as the VMs' disk images, so these LUNs contain
normal MBR partition tables. At some point we came across a VM, that -
due to a misconfiguration of GRUB - failed on a reboot. We used
multipath-tools' kpartx to create a device-mapper device pointing to the
first partition of the LUN, mounted the filesystem, changed
boot/grub/menu.lst, unmounted it and proceeded to boot the VM once more.
To our surprise, Xen's pygrub showed the boot menu exactly as it was
before the changes we made. We double-checked that the changes we made
were indeed there and tried to find out what was actually going on.
As it turned out, the LUN device's read buffers had not been updated;
losetup'ing the LUN device with the proper offset to the first partition
and mounting it gave us exactly the image of the filesystem as it was
_before_ our changes. We started digging into the kernel's buffer
internals and came along the conclusion [1] that every block device has
its own pagecache, attached to a hash of (major,minor), that is
independent from the caches of its containing or contained devices.
Now, in practice one rarely - if ever - accesses the same data from
these two different paths (disk + partition), except in scenarios like
this. However currently there seems to be an implicit assumption that
these two paths should not be used in the same "uptime" cycle at all, at
least not without dropping the caches. For the record, I managed to
reproduce the whole issue by reading a single block through sda, dd'ing
random data to it through sda1 and re-reading it through sda: its
contents were intact (even hours later) and were up-to-date only when
using O_DIRECT and finally when I dropped all caches (using
/proc/sys/vm/drop_caches).
And now we come to the question part: Can someone please verify that the
above statements are correct, or am I missing something? If they are,
should it perhaps be the case that the partition's buffers somehow be
linked with those of the containing device, or even be part of them? I
don't even know if this is possible without significant overhead in the
page cache (of which my understanding is very shallow), but keep in mind
that this behaviour almost led to filesystem corruption (luckily we only
changed a single file and hit a single inode).
Thank you for your time. Cheers,
Apollon
PS: I am not subscribed to the list, so I would appreciate if you could
Cc any answers to my address.
[1] If I interpret the contents of fs/buffer.c and
include/linux/buffer_head.h correctly. Unfortunately, I'm not a kernel
hacker, so I apologise if I'm mistaken at this point.
--
-----------------------------------------------------------
Apollon Oikonomopoulos - GRNET Network Operations Centre
Greek Research & Technology Network - http://www.grnet.gr
-----------------------------------------------------------
^ permalink raw reply [flat|nested] 3+ messages in thread
* Re: Block device cache issue
2009-04-02 14:52 Block device cache issue Apollon Oikonomopoulos
@ 2009-04-07 7:31 ` Andrew Morton
2009-04-07 10:38 ` Avi Kivity
0 siblings, 1 reply; 3+ messages in thread
From: Andrew Morton @ 2009-04-07 7:31 UTC (permalink / raw)
To: Apollon Oikonomopoulos; +Cc: linux-kernel
On Thu, 2 Apr 2009 17:52:05 +0300 Apollon Oikonomopoulos <ao-lkml@noc.grnet.gr> wrote:
> Greetings to the list,
>
> At my company, we have come across something that we think is a design
> limitation in the way the Linux kernel handles block device caches. I
> will first describe the incident we encountered, before speculating on
> the actual cause.
>
> As part of our infrastructure, we are running some Linux servers used as
> Xen Dom0s, using SAN LUNs as the VMs' disk images, so these LUNs contain
> normal MBR partition tables. At some point we came across a VM, that -
> due to a misconfiguration of GRUB - failed on a reboot. We used
> multipath-tools' kpartx to create a device-mapper device pointing to the
> first partition of the LUN, mounted the filesystem, changed
> boot/grub/menu.lst, unmounted it and proceeded to boot the VM once more.
> To our surprise, Xen's pygrub showed the boot menu exactly as it was
> before the changes we made. We double-checked that the changes we made
> were indeed there and tried to find out what was actually going on.
>
> As it turned out, the LUN device's read buffers had not been updated;
> losetup'ing the LUN device with the proper offset to the first partition
> and mounting it gave us exactly the image of the filesystem as it was
> _before_ our changes. We started digging into the kernel's buffer
> internals and came along the conclusion [1] that every block device has
> its own pagecache, attached to a hash of (major,minor), that is
> independent from the caches of its containing or contained devices.
>
> Now, in practice one rarely - if ever - accesses the same data from
> these two different paths (disk + partition), except in scenarios like
> this. However currently there seems to be an implicit assumption that
> these two paths should not be used in the same "uptime" cycle at all, at
> least not without dropping the caches. For the record, I managed to
> reproduce the whole issue by reading a single block through sda, dd'ing
> random data to it through sda1 and re-reading it through sda: its
> contents were intact (even hours later) and were up-to-date only when
> using O_DIRECT and finally when I dropped all caches (using
> /proc/sys/vm/drop_caches).
>
> And now we come to the question part: Can someone please verify that the
> above statements are correct, or am I missing something?
The above statements are correct ;)
Similarly, the pagecache for /etc/password is separate from the
pagecache for the device upon which /etc is mounted.
> If they are,
> should it perhaps be the case that the partition's buffers somehow be
> linked with those of the containing device, or even be part of them? I
> don't even know if this is possible without significant overhead in the
> page cache (of which my understanding is very shallow), but keep in mind
> that this behaviour almost led to filesystem corruption (luckily we only
> changed a single file and hit a single inode).
It would incur overhead. We could perhaps fix it by having a single
cache for /dev/sda and then just making /dev/sda1 access that cache
with an offset. But it rarely if ever comes up - I guess the few
applications which do this sort of thing are taking suitable steps to
avoid it - fsync, ioctl(BKLFLSBUF), posix_fadvise(FADV_DONTNEED),
O_DIRECT, etc.
^ permalink raw reply [flat|nested] 3+ messages in thread
* Re: Block device cache issue
2009-04-07 7:31 ` Andrew Morton
@ 2009-04-07 10:38 ` Avi Kivity
0 siblings, 0 replies; 3+ messages in thread
From: Avi Kivity @ 2009-04-07 10:38 UTC (permalink / raw)
To: Andrew Morton; +Cc: Apollon Oikonomopoulos, linux-kernel
Andrew Morton wrote:
>
>> should it perhaps be the case that the partition's buffers somehow be
>> linked with those of the containing device, or even be part of them? I
>> don't even know if this is possible without significant overhead in the
>> page cache (of which my understanding is very shallow), but keep in mind
>> that this behaviour almost led to filesystem corruption (luckily we only
>> changed a single file and hit a single inode).
>>
>
> It would incur overhead. We could perhaps fix it by having a single
> cache for /dev/sda and then just making /dev/sda1 access that cache
> with an offset.
The offset can be non PAGE_SIZE aligned (and usually isn't, 63 sectors
difference with normal partitioning).
--
error compiling committee.c: too many arguments to function
^ permalink raw reply [flat|nested] 3+ messages in thread
end of thread, other threads:[~2009-04-07 10:39 UTC | newest]
Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2009-04-02 14:52 Block device cache issue Apollon Oikonomopoulos
2009-04-07 7:31 ` Andrew Morton
2009-04-07 10:38 ` Avi Kivity
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox