public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* Block device cache issue
@ 2009-04-02 14:52 Apollon Oikonomopoulos
  2009-04-07  7:31 ` Andrew Morton
  0 siblings, 1 reply; 3+ messages in thread
From: Apollon Oikonomopoulos @ 2009-04-02 14:52 UTC (permalink / raw)
  To: linux-kernel

Greetings to the list,

At my company, we have come across something that we think is a design 
limitation in the way the Linux kernel handles block device caches.  I 
will first describe the incident we encountered, before speculating on 
the actual cause.

As part of our infrastructure, we are running some Linux servers used as 
Xen Dom0s, using SAN LUNs as the VMs' disk images, so these LUNs contain 
normal MBR partition tables. At some point  we came across a VM, that - 
due to a misconfiguration of GRUB - failed on a reboot. We used 
multipath-tools' kpartx to create a device-mapper device pointing to the 
first partition of the LUN, mounted the filesystem, changed 
boot/grub/menu.lst, unmounted it and proceeded to boot the VM once more.  
To our surprise, Xen's pygrub showed the boot menu exactly as it was 
before the changes we made. We double-checked that the changes we made 
were indeed there and tried to find out what was actually going on.

As it turned out, the LUN device's read buffers had not been updated;  
losetup'ing the LUN device with the proper offset to the first partition 
and mounting it gave us exactly the image of the filesystem as it was 
_before_ our changes. We started digging into the kernel's buffer 
internals and came along the conclusion [1] that every block device  has 
its own pagecache, attached to a hash of (major,minor), that is 
independent from the caches of its containing or contained devices.  

Now, in practice one rarely - if ever - accesses the same data from 
these two different paths (disk + partition), except in scenarios like 
this. However currently there seems to be an implicit assumption that 
these two paths should not be used in the same "uptime" cycle at all, at 
least not without dropping the caches.  For the record, I managed to 
reproduce the whole issue by reading a single block through sda, dd'ing 
random data to it through sda1 and re-reading it through sda: its 
contents were intact (even hours later) and were up-to-date only when 
using O_DIRECT and finally when I dropped all caches (using 
/proc/sys/vm/drop_caches).

And now we come to the question part: Can someone please verify that the 
above statements are correct, or am I missing something? If they are, 
should it perhaps be the case that the partition's buffers somehow be 
linked with those of the containing device, or even be part of them? I 
don't even know if this is possible without significant overhead in the 
page cache (of which my understanding is very shallow), but keep in mind 
that this behaviour almost led to filesystem corruption (luckily we only 
changed a single file and hit a single inode).

Thank you for your time. Cheers,
Apollon

PS: I am not subscribed to the list, so I would appreciate if you could 
    Cc any answers to my address.


[1] If I interpret the contents of fs/buffer.c and 
include/linux/buffer_head.h correctly. Unfortunately, I'm not a kernel 
hacker, so I apologise if I'm mistaken at this point.

-- 
-----------------------------------------------------------
 Apollon Oikonomopoulos - GRNET Network Operations Centre
 Greek Research & Technology Network - http://www.grnet.gr
----------------------------------------------------------- 

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: Block device cache issue
  2009-04-02 14:52 Block device cache issue Apollon Oikonomopoulos
@ 2009-04-07  7:31 ` Andrew Morton
  2009-04-07 10:38   ` Avi Kivity
  0 siblings, 1 reply; 3+ messages in thread
From: Andrew Morton @ 2009-04-07  7:31 UTC (permalink / raw)
  To: Apollon Oikonomopoulos; +Cc: linux-kernel

On Thu, 2 Apr 2009 17:52:05 +0300 Apollon Oikonomopoulos <ao-lkml@noc.grnet.gr> wrote:

> Greetings to the list,
> 
> At my company, we have come across something that we think is a design 
> limitation in the way the Linux kernel handles block device caches.  I 
> will first describe the incident we encountered, before speculating on 
> the actual cause.
> 
> As part of our infrastructure, we are running some Linux servers used as 
> Xen Dom0s, using SAN LUNs as the VMs' disk images, so these LUNs contain 
> normal MBR partition tables. At some point  we came across a VM, that - 
> due to a misconfiguration of GRUB - failed on a reboot. We used 
> multipath-tools' kpartx to create a device-mapper device pointing to the 
> first partition of the LUN, mounted the filesystem, changed 
> boot/grub/menu.lst, unmounted it and proceeded to boot the VM once more.  
> To our surprise, Xen's pygrub showed the boot menu exactly as it was 
> before the changes we made. We double-checked that the changes we made 
> were indeed there and tried to find out what was actually going on.
> 
> As it turned out, the LUN device's read buffers had not been updated;  
> losetup'ing the LUN device with the proper offset to the first partition 
> and mounting it gave us exactly the image of the filesystem as it was 
> _before_ our changes. We started digging into the kernel's buffer 
> internals and came along the conclusion [1] that every block device  has 
> its own pagecache, attached to a hash of (major,minor), that is 
> independent from the caches of its containing or contained devices.  
> 
> Now, in practice one rarely - if ever - accesses the same data from 
> these two different paths (disk + partition), except in scenarios like 
> this. However currently there seems to be an implicit assumption that 
> these two paths should not be used in the same "uptime" cycle at all, at 
> least not without dropping the caches.  For the record, I managed to 
> reproduce the whole issue by reading a single block through sda, dd'ing 
> random data to it through sda1 and re-reading it through sda: its 
> contents were intact (even hours later) and were up-to-date only when 
> using O_DIRECT and finally when I dropped all caches (using 
> /proc/sys/vm/drop_caches).
> 
> And now we come to the question part: Can someone please verify that the 
> above statements are correct, or am I missing something?

The above statements are correct ;)

Similarly, the pagecache for /etc/password is separate from the
pagecache for the device upon which /etc is mounted.

> If they are, 
> should it perhaps be the case that the partition's buffers somehow be 
> linked with those of the containing device, or even be part of them? I 
> don't even know if this is possible without significant overhead in the 
> page cache (of which my understanding is very shallow), but keep in mind 
> that this behaviour almost led to filesystem corruption (luckily we only 
> changed a single file and hit a single inode).

It would incur overhead.  We could perhaps fix it by having a single
cache for /dev/sda and then just making /dev/sda1 access that cache
with an offset.  But it rarely if ever comes up - I guess the few
applications which do this sort of thing are taking suitable steps to
avoid it - fsync, ioctl(BKLFLSBUF), posix_fadvise(FADV_DONTNEED),
O_DIRECT, etc.




^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: Block device cache issue
  2009-04-07  7:31 ` Andrew Morton
@ 2009-04-07 10:38   ` Avi Kivity
  0 siblings, 0 replies; 3+ messages in thread
From: Avi Kivity @ 2009-04-07 10:38 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Apollon Oikonomopoulos, linux-kernel

Andrew Morton wrote:
>
>> should it perhaps be the case that the partition's buffers somehow be 
>> linked with those of the containing device, or even be part of them? I 
>> don't even know if this is possible without significant overhead in the 
>> page cache (of which my understanding is very shallow), but keep in mind 
>> that this behaviour almost led to filesystem corruption (luckily we only 
>> changed a single file and hit a single inode).
>>     
>
> It would incur overhead.  We could perhaps fix it by having a single
> cache for /dev/sda and then just making /dev/sda1 access that cache
> with an offset.

The offset can be non PAGE_SIZE aligned (and usually isn't, 63 sectors 
difference with normal partitioning).

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2009-04-07 10:39 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2009-04-02 14:52 Block device cache issue Apollon Oikonomopoulos
2009-04-07  7:31 ` Andrew Morton
2009-04-07 10:38   ` Avi Kivity

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox