* ftruncate-mmap: pages are lost after writing to mmaped file.
@ 2009-03-18 19:44 Ying Han
2009-03-18 22:11 ` Andrew Morton
0 siblings, 1 reply; 60+ messages in thread
From: Ying Han @ 2009-03-18 19:44 UTC (permalink / raw)
To: linux-kernel, linux-mm, Andrew Morton, guichaz, Alex Khesin,
Mike Waychison, Rohit Seth
We triggered the failure during some internal experiment with
ftruncate/mmap/write/read sequence. And we found that some pages are
"lost" after writing to the mmaped file. which in the following test
cases (count >= 0).
First we deployed the test cases into group of machines and see about
>20% failure rate on average. Then, I did couple of experiment to try
to reproduce it on a single machine. what i found is that:
1. add a fsync after write the file, i can not reproduce this issue.
2. add memory pressure(mmap/mlock) while run the test in infinite
loop, the failure is reproduced quickly. ( background flushing ? )
The "bad pages" count differs each time from one digit to 4,5 digit
for 128M ftruncated file. and what i also found that the bad page
number are contiguous for each segment which total bad pages container
several segments. ext "1-4, 9-20, 48-50" ( batch flushing ? )
(The failure is reproduced based on 2.6.29-rc8, also happened on
2.6.18 kernel. . Here is the simple test case to reproduce it with
memory pressure. )
#include <sys/mman.h>
#include <sys/types.h>
#include <fcntl.h>
#include <unistd.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
long kMemSize = 128 << 20;
int kPageSize = 4096;
int main(int argc, char **argv) {
int status;
int count = 0;
int i;
char *fname = "/root/test.mmap";
char *mem;
unlink(fname);
int fd = open(fname, O_CREAT | O_EXCL | O_RDWR, 0600);
status = ftruncate(fd, kMemSize);
mem = mmap(0, kMemSize, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
// Fill the memory with 1s.
memset(mem, 1, kMemSize);
for (i = 0; i < kMemSize; i++) {
int byte_good = mem[i] != 0;
if (!byte_good && ((i % kPageSize) == 0)) {
//printf("%d ", i / kPageSize);
count++;
}
}
munmap(mem, kMemSize);
close(fd);
unlink(fname);
if (count > 0) {
printf("Running %d bad page\n", count);
return 1;
}
return 0;
}
--Ying
^ permalink raw reply [flat|nested] 60+ messages in thread* Re: ftruncate-mmap: pages are lost after writing to mmaped file. 2009-03-18 19:44 ftruncate-mmap: pages are lost after writing to mmaped file Ying Han @ 2009-03-18 22:11 ` Andrew Morton 2009-03-18 22:40 ` Linus Torvalds 0 siblings, 1 reply; 60+ messages in thread From: Andrew Morton @ 2009-03-18 22:11 UTC (permalink / raw) To: Ying Han Cc: linux-kernel, linux-mm, guichaz, Alex Khesin, Mike Waychison, Rohit Seth, Nick Piggin, Peter Zijlstra, Linus Torvalds On Wed, 18 Mar 2009 12:44:08 -0700 Ying Han <yinghan@google.com> wrote: > We triggered the failure during some internal experiment with > ftruncate/mmap/write/read sequence. And we found that some pages are > "lost" after writing to the mmaped file. which in the following test > cases (count >= 0). > > First we deployed the test cases into group of machines and see about > >20% failure rate on average. Then, I did couple of experiment to try > to reproduce it on a single machine. what i found is that: > 1. add a fsync after write the file, i can not reproduce this issue. > 2. add memory pressure(mmap/mlock) while run the test in infinite > loop, the failure is reproduced quickly. ( background flushing ? ) > > The "bad pages" count differs each time from one digit to 4,5 digit > for 128M ftruncated file. and what i also found that the bad page > number are contiguous for each segment which total bad pages container > several segments. ext "1-4, 9-20, 48-50" ( batch flushing ? ) > > (The failure is reproduced based on 2.6.29-rc8, also happened on > 2.6.18 kernel. . Here is the simple test case to reproduce it with > memory pressure. ) Thanks. This will be a regression - the testing I did back in the days when I actually wrote stuff would have picked this up. Perhaps it is a 2.6.17 thing. Which, IIRC, is when we made the changes to redirty pages on each write fault. Or maybe it was something else. Nick, Peter: I'm in .au at preset, not able to build and run kernels - is this something you'd have time to look into please? Given the amount of time for which this bug has existed, I guess it isn't a 2.6.29 blocker, but once we've found out the cause we should have a little post-mortem to work out how a bug of this nature has gone undetected for so long. > #include <sys/mman.h> > #include <sys/types.h> > #include <fcntl.h> > #include <unistd.h> > #include <stdio.h> > #include <stdlib.h> > #include <string.h> > > long kMemSize = 128 << 20; > int kPageSize = 4096; > > int main(int argc, char **argv) { > int status; > int count = 0; > int i; > char *fname = "/root/test.mmap"; > char *mem; > > unlink(fname); > int fd = open(fname, O_CREAT | O_EXCL | O_RDWR, 0600); > status = ftruncate(fd, kMemSize); > > mem = mmap(0, kMemSize, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0); > // Fill the memory with 1s. > memset(mem, 1, kMemSize); > > for (i = 0; i < kMemSize; i++) { > int byte_good = mem[i] != 0; > > if (!byte_good && ((i % kPageSize) == 0)) { > //printf("%d ", i / kPageSize); > count++; > } > } > > munmap(mem, kMemSize); > close(fd); > unlink(fname); > > if (count > 0) { > printf("Running %d bad page\n", count); > return 1; > } > return 0; > } > > --Ying ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: ftruncate-mmap: pages are lost after writing to mmaped file. 2009-03-18 22:11 ` Andrew Morton @ 2009-03-18 22:40 ` Linus Torvalds 2009-03-18 23:18 ` Ying Han ` (2 more replies) 0 siblings, 3 replies; 60+ messages in thread From: Linus Torvalds @ 2009-03-18 22:40 UTC (permalink / raw) To: Andrew Morton Cc: Ying Han, linux-kernel, linux-mm, guichaz, Alex Khesin, Mike Waychison, Rohit Seth, Nick Piggin, Peter Zijlstra On Wed, 18 Mar 2009, Andrew Morton wrote: > On Wed, 18 Mar 2009 12:44:08 -0700 Ying Han <yinghan@google.com> wrote: > > > > The "bad pages" count differs each time from one digit to 4,5 digit > > for 128M ftruncated file. and what i also found that the bad page > > number are contiguous for each segment which total bad pages container > > several segments. ext "1-4, 9-20, 48-50" ( batch flushing ? ) Yeah, probably the batched write-out. Can you say what filesystem, and what mount-flags you use? Iirc, last time we had MAP_SHARED lost writes it was at least partly triggered by the filesystem doing its own flushing independently of the VM (ie ext3 with "data=journal", I think), so that kind of thing does tend to matter. See for example commit ecdfc9787fe527491baefc22dce8b2dbd5b2908d. > > (The failure is reproduced based on 2.6.29-rc8, also happened on > > 2.6.18 kernel. . Here is the simple test case to reproduce it with > > memory pressure. ) > > Thanks. This will be a regression - the testing I did back in the days > when I actually wrote stuff would have picked this up. > > Perhaps it is a 2.6.17 thing. Which, IIRC, is when we made the changes to > redirty pages on each write fault. Or maybe it was something else. Hmm. I _think_ that changes went in _after_ 2.6.18, if you're talking about Peter's exact dirty page tracking. If I recall correctly, that became then 2.6.19, and then had the horrible mm dirty bit loss that triggered in librtorrent downloads, which got fixed sometime after 2.6.20 (and back-ported). So if 2.6.18 shows the same problem, then it's a _really_ old bug, and not related to the exact dirty tracking. The exact dirty accounting patch I'm talking about is d08b3851da41 ("mm: tracking shared dirty pages"), but maybe you had something else in mind? > Given the amount of time for which this bug has existed, I guess it isn't a > 2.6.29 blocker, but once we've found out the cause we should have a little > post-mortem to work out how a bug of this nature has gone undetected for so > long. I'm somewhat surprised, because this test-program looks like a very simple version of the exact one that I used to track down the 2.6.20 mmap corruption problems. And that one got pretty heavily tested back then, when people were looking at it (December 2006) and then when trying out my fix for it. Ying Han - since you're all set up for testing this and have reproduced it on multiple kernels, can you try it on a few more kernel versions? It would be interesting to both go further back in time (say 2.6.15-ish), _and_ check something like 2.6.21 which had the exact dirty accounting fix. Maybe it's not really an old bug - maybe we re-introduced a bug that was fixed for a while. Linus ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: ftruncate-mmap: pages are lost after writing to mmaped file. 2009-03-18 22:40 ` Linus Torvalds @ 2009-03-18 23:18 ` Ying Han 2009-03-18 23:36 ` Linus Torvalds 2009-03-20 0:34 ` Ying Han 2009-03-25 23:15 ` Ying Han 2 siblings, 1 reply; 60+ messages in thread From: Ying Han @ 2009-03-18 23:18 UTC (permalink / raw) To: Linus Torvalds Cc: Andrew Morton, linux-kernel, linux-mm, guichaz, Alex Khesin, Mike Waychison, Rohit Seth, Nick Piggin, Peter Zijlstra On Wed, Mar 18, 2009 at 3:40 PM, Linus Torvalds <torvalds@linux-foundation.org> wrote: > > > On Wed, 18 Mar 2009, Andrew Morton wrote: > >> On Wed, 18 Mar 2009 12:44:08 -0700 Ying Han <yinghan@google.com> wrote: >> > >> > The "bad pages" count differs each time from one digit to 4,5 digit >> > for 128M ftruncated file. and what i also found that the bad page >> > number are contiguous for each segment which total bad pages container >> > several segments. ext "1-4, 9-20, 48-50" ( batch flushing ? ) > > Yeah, probably the batched write-out. > > Can you say what filesystem, and what mount-flags you use? Iirc, last time > we had MAP_SHARED lost writes it was at least partly triggered by the > filesystem doing its own flushing independently of the VM (ie ext3 with > "data=journal", I think), so that kind of thing does tend to matter. /etc/fstab "/dev/hda1 / ext2 defaults 1 0" > > See for example commit ecdfc9787fe527491baefc22dce8b2dbd5b2908d. > >> > (The failure is reproduced based on 2.6.29-rc8, also happened on >> > 2.6.18 kernel. . Here is the simple test case to reproduce it with >> > memory pressure. ) >> >> Thanks. This will be a regression - the testing I did back in the days >> when I actually wrote stuff would have picked this up. >> >> Perhaps it is a 2.6.17 thing. Which, IIRC, is when we made the changes to >> redirty pages on each write fault. Or maybe it was something else. > > Hmm. I _think_ that changes went in _after_ 2.6.18, if you're talking > about Peter's exact dirty page tracking. If I recall correctly, that > became then 2.6.19, and then had the horrible mm dirty bit loss that > triggered in librtorrent downloads, which got fixed sometime after 2.6.20 > (and back-ported). > > So if 2.6.18 shows the same problem, then it's a _really_ old bug, and not > related to the exact dirty tracking. > > The exact dirty accounting patch I'm talking about is d08b3851da41 ("mm: > tracking shared dirty pages"), but maybe you had something else in mind? > >> Given the amount of time for which this bug has existed, I guess it isn't a >> 2.6.29 blocker, but once we've found out the cause we should have a little >> post-mortem to work out how a bug of this nature has gone undetected for so >> long. > > I'm somewhat surprised, because this test-program looks like a very simple > version of the exact one that I used to track down the 2.6.20 mmap > corruption problems. And that one got pretty heavily tested back then, > when people were looking at it (December 2006) and then when trying out my > fix for it. > > Ying Han - since you're all set up for testing this and have reproduced it > on multiple kernels, can you try it on a few more kernel versions? It > would be interesting to both go further back in time (say 2.6.15-ish), > _and_ check something like 2.6.21 which had the exact dirty accounting > fix. Maybe it's not really an old bug - maybe we re-introduced a bug that > was fixed for a while. I will give a try. > > Linus > ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: ftruncate-mmap: pages are lost after writing to mmaped file. 2009-03-18 23:18 ` Ying Han @ 2009-03-18 23:36 ` Linus Torvalds 2009-03-18 23:54 ` Ying Han 0 siblings, 1 reply; 60+ messages in thread From: Linus Torvalds @ 2009-03-18 23:36 UTC (permalink / raw) To: Ying Han Cc: Andrew Morton, linux-kernel, linux-mm, guichaz, Alex Khesin, Mike Waychison, Rohit Seth, Nick Piggin, Peter Zijlstra On Wed, 18 Mar 2009, Ying Han wrote: > > > > Can you say what filesystem, and what mount-flags you use? Iirc, last time > > we had MAP_SHARED lost writes it was at least partly triggered by the > > filesystem doing its own flushing independently of the VM (ie ext3 with > > "data=journal", I think), so that kind of thing does tend to matter. > > /etc/fstab > "/dev/hda1 / ext2 defaults 1 0" Sadly, /etc/fstab is not necessarily accurate for the root filesystem. At least Fedora will ignore the flags in it. What does /proc/mounts say? That should be a more reliable indication of what the kernel actually does. That said, I assume the ext2 part is accurate. Maybe that's why people haven't seen it - I guess most testing was done on ext3. It certainly was for me. > > Ying Han - since you're all set up for testing this and have reproduced it > > on multiple kernels, can you try it on a few more kernel versions? It > > would be interesting to both go further back in time (say 2.6.15-ish), > > _and_ check something like 2.6.21 which had the exact dirty accounting > > fix. Maybe it's not really an old bug - maybe we re-introduced a bug that > > was fixed for a while. > > I will give a try. Thanks, Linus ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: ftruncate-mmap: pages are lost after writing to mmaped file. 2009-03-18 23:36 ` Linus Torvalds @ 2009-03-18 23:54 ` Ying Han 2009-03-19 15:48 ` Nick Piggin 0 siblings, 1 reply; 60+ messages in thread From: Ying Han @ 2009-03-18 23:54 UTC (permalink / raw) To: Linus Torvalds Cc: Andrew Morton, linux-kernel, linux-mm, guichaz, Alex Khesin, Mike Waychison, Rohit Seth, Nick Piggin, Peter Zijlstra On Wed, Mar 18, 2009 at 4:36 PM, Linus Torvalds <torvalds@linux-foundation.org> wrote: > > > On Wed, 18 Mar 2009, Ying Han wrote: >> > >> > Can you say what filesystem, and what mount-flags you use? Iirc, last time >> > we had MAP_SHARED lost writes it was at least partly triggered by the >> > filesystem doing its own flushing independently of the VM (ie ext3 with >> > "data=journal", I think), so that kind of thing does tend to matter. >> >> /etc/fstab >> "/dev/hda1 / ext2 defaults 1 0" > > Sadly, /etc/fstab is not necessarily accurate for the root filesystem. At > least Fedora will ignore the flags in it. > > What does /proc/mounts say? That should be a more reliable indication of > what the kernel actually does. "/dev/root / ext2 rw,errors=continue 0 0" > > That said, I assume the ext2 part is accurate. Maybe that's why people > haven't seen it - I guess most testing was done on ext3. It certainly was > for me. > >> > Ying Han - since you're all set up for testing this and have reproduced it >> > on multiple kernels, can you try it on a few more kernel versions? It >> > would be interesting to both go further back in time (say 2.6.15-ish), >> > _and_ check something like 2.6.21 which had the exact dirty accounting >> > fix. Maybe it's not really an old bug - maybe we re-introduced a bug that >> > was fixed for a while. >> >> I will give a try. > > Thanks, > > Linus > ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: ftruncate-mmap: pages are lost after writing to mmaped file. 2009-03-18 23:54 ` Ying Han @ 2009-03-19 15:48 ` Nick Piggin 2009-03-19 16:16 ` Peter Zijlstra ` (2 more replies) 0 siblings, 3 replies; 60+ messages in thread From: Nick Piggin @ 2009-03-19 15:48 UTC (permalink / raw) To: Ying Han, Jan Kara Cc: Linus Torvalds, Andrew Morton, linux-kernel, linux-mm, guichaz, Alex Khesin, Mike Waychison, Rohit Seth, Peter Zijlstra On Thursday 19 March 2009 10:54:33 Ying Han wrote: > On Wed, Mar 18, 2009 at 4:36 PM, Linus Torvalds > > <torvalds@linux-foundation.org> wrote: > > On Wed, 18 Mar 2009, Ying Han wrote: > >> > Can you say what filesystem, and what mount-flags you use? Iirc, last > >> > time we had MAP_SHARED lost writes it was at least partly triggered by > >> > the filesystem doing its own flushing independently of the VM (ie ext3 > >> > with "data=journal", I think), so that kind of thing does tend to > >> > matter. > >> > >> /etc/fstab > >> "/dev/hda1 / ext2 defaults 1 0" > > > > Sadly, /etc/fstab is not necessarily accurate for the root filesystem. At > > least Fedora will ignore the flags in it. > > > > What does /proc/mounts say? That should be a more reliable indication of > > what the kernel actually does. > > "/dev/root / ext2 rw,errors=continue 0 0" No luck with finding the problem yet. But I think we do have a race in __set_page_dirty_buffers(): The page may not have buffers between the mapping->private_lock critical section and the __set_page_dirty call there. So between them, another thread might do a create_empty_buffers which can see !PageDirty and thus it will create clean buffers. The page will get dirtied by the original thread, but if the buffers are clean it can be cleaned without writing out buffers. Holding mapping->private_lock over the __set_page_dirty should fix it, although I guess you'd want to release it before calling __mark_inode_dirty so as not to put inode_lock under there. I have a patch for this if it sounds reasonable. ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: ftruncate-mmap: pages are lost after writing to mmaped file. 2009-03-19 15:48 ` Nick Piggin @ 2009-03-19 16:16 ` Peter Zijlstra 2009-03-19 16:36 ` Nick Piggin 2009-03-19 16:20 ` Linus Torvalds 2009-03-19 16:46 ` Jan Kara 2 siblings, 1 reply; 60+ messages in thread From: Peter Zijlstra @ 2009-03-19 16:16 UTC (permalink / raw) To: Nick Piggin Cc: Ying Han, Jan Kara, Linus Torvalds, Andrew Morton, linux-kernel, linux-mm, guichaz, Alex Khesin, Mike Waychison, Rohit Seth On Fri, 2009-03-20 at 02:48 +1100, Nick Piggin wrote: > On Thursday 19 March 2009 10:54:33 Ying Han wrote: > > On Wed, Mar 18, 2009 at 4:36 PM, Linus Torvalds > > > > <torvalds@linux-foundation.org> wrote: > > > On Wed, 18 Mar 2009, Ying Han wrote: > > >> > Can you say what filesystem, and what mount-flags you use? Iirc, last > > >> > time we had MAP_SHARED lost writes it was at least partly triggered by > > >> > the filesystem doing its own flushing independently of the VM (ie ext3 > > >> > with "data=journal", I think), so that kind of thing does tend to > > >> > matter. > > >> > > >> /etc/fstab > > >> "/dev/hda1 / ext2 defaults 1 0" > > > > > > Sadly, /etc/fstab is not necessarily accurate for the root filesystem. At > > > least Fedora will ignore the flags in it. > > > > > > What does /proc/mounts say? That should be a more reliable indication of > > > what the kernel actually does. > > > > "/dev/root / ext2 rw,errors=continue 0 0" > > No luck with finding the problem yet. > > But I think we do have a race in __set_page_dirty_buffers(): > > The page may not have buffers between the mapping->private_lock > critical section and the __set_page_dirty call there. So between > them, another thread might do a create_empty_buffers which can > see !PageDirty and thus it will create clean buffers. The page > will get dirtied by the original thread, but if the buffers are > clean it can be cleaned without writing out buffers. > > Holding mapping->private_lock over the __set_page_dirty should > fix it, although I guess you'd want to release it before calling > __mark_inode_dirty so as not to put inode_lock under there. I > have a patch for this if it sounds reasonable. When I first did those dirty tracking patches someone (I think Andrew) commented no the fact that I did set_page_dirty() under one of these inner locks.. /me frobs around in archives for a bit.. - fs/buffers.c try_to_free_buffers(): remove clear_page_dirty() from under ->private_lock. This seems to be save, since ->private_lock is used to serialize access to the buffers, not the page itself. Hmm, that's a slightly different issue... But yeah, your scenario makes heaps of sense. Can't we do the TestSetPageDirty() before private_lock ? It's currently done before tree_lock as well. ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: ftruncate-mmap: pages are lost after writing to mmaped file. 2009-03-19 16:16 ` Peter Zijlstra @ 2009-03-19 16:36 ` Nick Piggin 0 siblings, 0 replies; 60+ messages in thread From: Nick Piggin @ 2009-03-19 16:36 UTC (permalink / raw) To: Peter Zijlstra Cc: Ying Han, Jan Kara, Linus Torvalds, Andrew Morton, linux-kernel, linux-mm, guichaz, Alex Khesin, Mike Waychison, Rohit Seth On Friday 20 March 2009 03:16:01 Peter Zijlstra wrote: > On Fri, 2009-03-20 at 02:48 +1100, Nick Piggin wrote: > > On Thursday 19 March 2009 10:54:33 Ying Han wrote: > > > On Wed, Mar 18, 2009 at 4:36 PM, Linus Torvalds > > > > > > <torvalds@linux-foundation.org> wrote: > > > > On Wed, 18 Mar 2009, Ying Han wrote: > > > >> > Can you say what filesystem, and what mount-flags you use? Iirc, > > > >> > last time we had MAP_SHARED lost writes it was at least partly > > > >> > triggered by the filesystem doing its own flushing independently > > > >> > of the VM (ie ext3 with "data=journal", I think), so that kind of > > > >> > thing does tend to matter. > > > >> > > > >> /etc/fstab > > > >> "/dev/hda1 / ext2 defaults 1 0" > > > > > > > > Sadly, /etc/fstab is not necessarily accurate for the root > > > > filesystem. At least Fedora will ignore the flags in it. > > > > > > > > What does /proc/mounts say? That should be a more reliable indication > > > > of what the kernel actually does. > > > > > > "/dev/root / ext2 rw,errors=continue 0 0" > > > > No luck with finding the problem yet. > > > > But I think we do have a race in __set_page_dirty_buffers(): > > > > The page may not have buffers between the mapping->private_lock > > critical section and the __set_page_dirty call there. So between > > them, another thread might do a create_empty_buffers which can > > see !PageDirty and thus it will create clean buffers. The page > > will get dirtied by the original thread, but if the buffers are > > clean it can be cleaned without writing out buffers. > > > > Holding mapping->private_lock over the __set_page_dirty should > > fix it, although I guess you'd want to release it before calling > > __mark_inode_dirty so as not to put inode_lock under there. I > > have a patch for this if it sounds reasonable. > > When I first did those dirty tracking patches someone (I think Andrew) > commented no the fact that I did set_page_dirty() under one of these > inner locks.. > > /me frobs around in archives for a bit.. > > - fs/buffers.c try_to_free_buffers(): remove clear_page_dirty() from under > ->private_lock. This seems to be save, since ->private_lock is used to > serialize access to the buffers, not the page itself. > > Hmm, that's a slightly different issue... > > But yeah, your scenario makes heaps of sense. > > Can't we do the TestSetPageDirty() before private_lock ? It's currently > done before tree_lock as well. I think there might be issues with having a clean page but dirty buffers if you do it that way... At any rate, if we can solve the race without swapping the order, I think that would be safer. ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: ftruncate-mmap: pages are lost after writing to mmaped file. 2009-03-19 15:48 ` Nick Piggin 2009-03-19 16:16 ` Peter Zijlstra @ 2009-03-19 16:20 ` Linus Torvalds 2009-03-19 16:34 ` Nick Piggin 2009-03-19 16:46 ` Jan Kara 2 siblings, 1 reply; 60+ messages in thread From: Linus Torvalds @ 2009-03-19 16:20 UTC (permalink / raw) To: Nick Piggin Cc: Ying Han, Jan Kara, Andrew Morton, linux-kernel, linux-mm, guichaz, Alex Khesin, Mike Waychison, Rohit Seth, Peter Zijlstra On Fri, 20 Mar 2009, Nick Piggin wrote: > > But I think we do have a race in __set_page_dirty_buffers(): > > The page may not have buffers between the mapping->private_lock > critical section and the __set_page_dirty call there. So between > them, another thread might do a create_empty_buffers which can > see !PageDirty and thus it will create clean buffers. Hmm. Creating clean buffers is locked by the page lock, nothing else. And not all page dirtiers hold the page lock (in fact, most try to avoid it - the rule is that you either have to hold the page lock _or_ hold a reference to the 'mapping', and the latter is what the mmap code does, I think). So yeah, the page lock isn't sufficient. > Holding mapping->private_lock over the __set_page_dirty should > fix it, although I guess you'd want to release it before calling > __mark_inode_dirty so as not to put inode_lock under there. I > have a patch for this if it sounds reasonable. That would seem to make sense. Maybe moving the "TestSetPageDirty()" from inside __set_page_dirty() to the caller? Something like the appended? This is TOTALLY untested. Of course. Linus --- fs/buffer.c | 23 +++++++++++------------ 1 files changed, 11 insertions(+), 12 deletions(-) diff --git a/fs/buffer.c b/fs/buffer.c index 9f69741..891e1c7 100644 --- a/fs/buffer.c +++ b/fs/buffer.c @@ -760,15 +760,9 @@ EXPORT_SYMBOL(mark_buffer_dirty_inode); * If warn is true, then emit a warning if the page is not uptodate and has * not been truncated. */ -static int __set_page_dirty(struct page *page, +static void __set_page_dirty(struct page *page, struct address_space *mapping, int warn) { - if (unlikely(!mapping)) - return !TestSetPageDirty(page); - - if (TestSetPageDirty(page)) - return 0; - spin_lock_irq(&mapping->tree_lock); if (page->mapping) { /* Race with truncate? */ WARN_ON_ONCE(warn && !PageUptodate(page)); @@ -785,8 +779,6 @@ static int __set_page_dirty(struct page *page, } spin_unlock_irq(&mapping->tree_lock); __mark_inode_dirty(mapping->host, I_DIRTY_PAGES); - - return 1; } /* @@ -816,6 +808,7 @@ static int __set_page_dirty(struct page *page, */ int __set_page_dirty_buffers(struct page *page) { + int newly_dirty; struct address_space *mapping = page_mapping(page); if (unlikely(!mapping)) @@ -831,9 +824,12 @@ int __set_page_dirty_buffers(struct page *page) bh = bh->b_this_page; } while (bh != head); } + newly_dirty = !TestSetPageDirty(page); spin_unlock(&mapping->private_lock); - return __set_page_dirty(page, mapping, 1); + if (newly_dirty) + __set_page_dirty(page, mapping, 1); + return newly_dirty; } EXPORT_SYMBOL(__set_page_dirty_buffers); @@ -1262,8 +1258,11 @@ void mark_buffer_dirty(struct buffer_head *bh) return; } - if (!test_set_buffer_dirty(bh)) - __set_page_dirty(bh->b_page, page_mapping(bh->b_page), 0); + if (!test_set_buffer_dirty(bh)) { + struct page *page = bh->b_page; + if (!TestSetPageDirty(page)) + __set_page_dirty(page, page_mapping(page), 0); + } } /* ^ permalink raw reply related [flat|nested] 60+ messages in thread
* Re: ftruncate-mmap: pages are lost after writing to mmaped file. 2009-03-19 16:20 ` Linus Torvalds @ 2009-03-19 16:34 ` Nick Piggin 2009-03-19 16:51 ` Linus Torvalds 0 siblings, 1 reply; 60+ messages in thread From: Nick Piggin @ 2009-03-19 16:34 UTC (permalink / raw) To: Linus Torvalds Cc: Ying Han, Jan Kara, Andrew Morton, linux-kernel, linux-mm, guichaz, Alex Khesin, Mike Waychison, Rohit Seth, Peter Zijlstra On Friday 20 March 2009 03:20:57 Linus Torvalds wrote: > On Fri, 20 Mar 2009, Nick Piggin wrote: > > But I think we do have a race in __set_page_dirty_buffers(): > > > > The page may not have buffers between the mapping->private_lock > > critical section and the __set_page_dirty call there. So between > > them, another thread might do a create_empty_buffers which can > > see !PageDirty and thus it will create clean buffers. > > Hmm. > > Creating clean buffers is locked by the page lock, nothing else. And not > all page dirtiers hold the page lock (in fact, most try to avoid it - the > rule is that you either have to hold the page lock _or_ hold a reference > to the 'mapping', and the latter is what the mmap code does, I think). > > So yeah, the page lock isn't sufficient. No. FWIW, I thought there might be a race due to page fault code not holding page lock over set_page_dirty (well there *are* some kinds of races, but they're another story). So I tried out my patches that move that lock over set_page_dirty for __do_fault and do_wp_page (so the lock is held over pte_mkwrite and set_page_dirty), but that still didn't solve the problem either. > > Holding mapping->private_lock over the __set_page_dirty should > > fix it, although I guess you'd want to release it before calling > > __mark_inode_dirty so as not to put inode_lock under there. I > > have a patch for this if it sounds reasonable. > > That would seem to make sense. Maybe moving the "TestSetPageDirty()" from > inside __set_page_dirty() to the caller? Something like the appended? > > This is TOTALLY untested. Of course. Yeah, probably no need to hold private_lock while tagging the radix tree (which is what my version did). So maybe this one is a little better. I did test mine, it worked, but it didn't solve the problem. Still, it does appear to solve a real race, which we should close. > > Linus > > --- > fs/buffer.c | 23 +++++++++++------------ > 1 files changed, 11 insertions(+), 12 deletions(-) > > diff --git a/fs/buffer.c b/fs/buffer.c > index 9f69741..891e1c7 100644 > --- a/fs/buffer.c > +++ b/fs/buffer.c > @@ -760,15 +760,9 @@ EXPORT_SYMBOL(mark_buffer_dirty_inode); > * If warn is true, then emit a warning if the page is not uptodate and > has * not been truncated. > */ > -static int __set_page_dirty(struct page *page, > +static void __set_page_dirty(struct page *page, > struct address_space *mapping, int warn) > { > - if (unlikely(!mapping)) > - return !TestSetPageDirty(page); > - > - if (TestSetPageDirty(page)) > - return 0; > - > spin_lock_irq(&mapping->tree_lock); > if (page->mapping) { /* Race with truncate? */ > WARN_ON_ONCE(warn && !PageUptodate(page)); > @@ -785,8 +779,6 @@ static int __set_page_dirty(struct page *page, > } > spin_unlock_irq(&mapping->tree_lock); > __mark_inode_dirty(mapping->host, I_DIRTY_PAGES); > - > - return 1; > } > > /* > @@ -816,6 +808,7 @@ static int __set_page_dirty(struct page *page, > */ > int __set_page_dirty_buffers(struct page *page) > { > + int newly_dirty; > struct address_space *mapping = page_mapping(page); > > if (unlikely(!mapping)) > @@ -831,9 +824,12 @@ int __set_page_dirty_buffers(struct page *page) > bh = bh->b_this_page; > } while (bh != head); > } > + newly_dirty = !TestSetPageDirty(page); > spin_unlock(&mapping->private_lock); > > - return __set_page_dirty(page, mapping, 1); > + if (newly_dirty) > + __set_page_dirty(page, mapping, 1); > + return newly_dirty; > } > EXPORT_SYMBOL(__set_page_dirty_buffers); > > @@ -1262,8 +1258,11 @@ void mark_buffer_dirty(struct buffer_head *bh) > return; > } > > - if (!test_set_buffer_dirty(bh)) > - __set_page_dirty(bh->b_page, page_mapping(bh->b_page), 0); > + if (!test_set_buffer_dirty(bh)) { > + struct page *page = bh->b_page; > + if (!TestSetPageDirty(page)) > + __set_page_dirty(page, page_mapping(page), 0); > + } > } > > /* ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: ftruncate-mmap: pages are lost after writing to mmaped file. 2009-03-19 16:34 ` Nick Piggin @ 2009-03-19 16:51 ` Linus Torvalds 2009-03-19 17:03 ` Jan Kara 2009-03-19 20:21 ` Linus Torvalds 0 siblings, 2 replies; 60+ messages in thread From: Linus Torvalds @ 2009-03-19 16:51 UTC (permalink / raw) To: Nick Piggin Cc: Ying Han, Jan Kara, Andrew Morton, linux-kernel, linux-mm, guichaz, Alex Khesin, Mike Waychison, Rohit Seth, Peter Zijlstra On Fri, 20 Mar 2009, Nick Piggin wrote: > > Yeah, probably no need to hold private_lock while tagging the radix > tree (which is what my version did). So maybe this one is a little > better. I did test mine, it worked, but it didn't solve the problem. Ahh, so you re-created it? On ext2 only, or is it visible on ext3 as well? I've not even tested - I assumed that I would have to boot into less memory and downgrade my filesystem to ext2, which made me hope somebody else would pick it up first ;) > Still, it does appear to solve a real race, which we should close. A very small and unlikely race, but yeah - the patch isn't large, removes more lines than it adds, and the code seems to make _more_ sense with it in place than without. Linus ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: ftruncate-mmap: pages are lost after writing to mmaped file. 2009-03-19 16:51 ` Linus Torvalds @ 2009-03-19 17:03 ` Jan Kara 2009-03-19 17:06 ` Jan Kara 2009-03-19 20:05 ` Linus Torvalds 2009-03-19 20:21 ` Linus Torvalds 1 sibling, 2 replies; 60+ messages in thread From: Jan Kara @ 2009-03-19 17:03 UTC (permalink / raw) To: Linus Torvalds Cc: Nick Piggin, Ying Han, Jan Kara, Andrew Morton, linux-kernel, linux-mm, guichaz, Alex Khesin, Mike Waychison, Rohit Seth, Peter Zijlstra On Thu 19-03-09 09:51:59, Linus Torvalds wrote: > > > On Fri, 20 Mar 2009, Nick Piggin wrote: > > > > Yeah, probably no need to hold private_lock while tagging the radix > > tree (which is what my version did). So maybe this one is a little > > better. I did test mine, it worked, but it didn't solve the problem. > > Ahh, so you re-created it? On ext2 only, or is it visible on ext3 as well? > I've not even tested - I assumed that I would have to boot into less > memory and downgrade my filesystem to ext2, which made me hope somebody > else would pick it up first ;) In thread http://lkml.org/lkml/2009/3/4/179 I've reported similar problem - write lost. I'm able to reproduce under UML linux at will. ext3 takes with 1KB blocksize about 20 minutes to hit the corruption, ext2 with 1 KB blocksize about an hour, ext2 with 4KB blocksize several hours... I've reported that also ordinary write() got lost once but that might have been an error in me reading the fsx logs since I never saw it again... Honza -- Jan Kara <jack@suse.cz> SUSE Labs, CR ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: ftruncate-mmap: pages are lost after writing to mmaped file. 2009-03-19 17:03 ` Jan Kara @ 2009-03-19 17:06 ` Jan Kara 2009-03-19 20:05 ` Linus Torvalds 1 sibling, 0 replies; 60+ messages in thread From: Jan Kara @ 2009-03-19 17:06 UTC (permalink / raw) To: Linus Torvalds Cc: Nick Piggin, Ying Han, Jan Kara, Andrew Morton, linux-kernel, linux-mm, guichaz, Alex Khesin, Mike Waychison, Rohit Seth, Peter Zijlstra On Thu 19-03-09 18:03:40, Jan Kara wrote: > On Thu 19-03-09 09:51:59, Linus Torvalds wrote: > > > > > > On Fri, 20 Mar 2009, Nick Piggin wrote: > > > > > > Yeah, probably no need to hold private_lock while tagging the radix > > > tree (which is what my version did). So maybe this one is a little > > > better. I did test mine, it worked, but it didn't solve the problem. > > > > Ahh, so you re-created it? On ext2 only, or is it visible on ext3 as well? > > I've not even tested - I assumed that I would have to boot into less > > memory and downgrade my filesystem to ext2, which made me hope somebody > > else would pick it up first ;) > In thread http://lkml.org/lkml/2009/3/4/179 I've reported similar problem > - write lost. I'm able to reproduce under UML linux at will. ext3 takes > with 1KB blocksize about 20 minutes to hit the corruption, ext2 with 1 KB > blocksize about an hour, ext2 with 4KB blocksize several hours... And BTW HP is reporting to us a similarly looking problem on reiserfs on ia64 with SLE11 (2.6.27 based). But that takes several days to reproduce with their proprietary test suite. Honza ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: ftruncate-mmap: pages are lost after writing to mmaped file. 2009-03-19 17:03 ` Jan Kara 2009-03-19 17:06 ` Jan Kara @ 2009-03-19 20:05 ` Linus Torvalds 1 sibling, 0 replies; 60+ messages in thread From: Linus Torvalds @ 2009-03-19 20:05 UTC (permalink / raw) To: Jan Kara Cc: Nick Piggin, Ying Han, Andrew Morton, linux-kernel, linux-mm, guichaz, Alex Khesin, Mike Waychison, Rohit Seth, Peter Zijlstra On Thu, 19 Mar 2009, Jan Kara wrote: > > > > Ahh, so you re-created it? On ext2 only, or is it visible on ext3 as well? > > I've not even tested - I assumed that I would have to boot into less > > memory and downgrade my filesystem to ext2, which made me hope somebody > > else would pick it up first ;) > > In thread http://lkml.org/lkml/2009/3/4/179 I've reported similar problem > - write lost. I'm able to reproduce under UML linux at will. ext3 takes > with 1KB blocksize about 20 minutes to hit the corruption, ext2 with 1 KB > blocksize about an hour, ext2 with 4KB blocksize several hours... Hmm. I can't seem to recreate it with Ying Han's testprog, at least. That's with the fs/buffer.c patch applied, but that shouldn't matter since Nick reports that his (roughly equivalent) patch didn't help. I'll continue to run it for a while, but it's been going for about an hour now, with vmstat reporting bo/bi at roughly 5-10MB/s pretty continuosly. Of course, that's with my SSD's and an insanely fast Nehalem box, so my timings are likely rather different from most other peoples. Linus ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: ftruncate-mmap: pages are lost after writing to mmaped file. 2009-03-19 16:51 ` Linus Torvalds 2009-03-19 17:03 ` Jan Kara @ 2009-03-19 20:21 ` Linus Torvalds 2009-03-19 21:17 ` Ying Han 2009-03-19 22:16 ` Jan Kara 1 sibling, 2 replies; 60+ messages in thread From: Linus Torvalds @ 2009-03-19 20:21 UTC (permalink / raw) To: Nick Piggin Cc: Ying Han, Jan Kara, Andrew Morton, linux-kernel, linux-mm, guichaz, Alex Khesin, Mike Waychison, Rohit Seth, Peter Zijlstra On Thu, 19 Mar 2009, Linus Torvalds wrote: > > Ahh, so you re-created it? On ext2 only, or is it visible on ext3 as well? > I've not even tested - I assumed that I would have to boot into less > memory and downgrade my filesystem to ext2, which made me hope somebody > else would pick it up first ;) Oh, btw, can people who see this (Ying Han, Nick and apparently Jan) detail their configurations, please? In particular - SMP? (CONFIG_SMP and how many cores do you have if so?) - PREEMPT (NONE/VOLUNTARY or full preempt?) - RCU (CLASSIC/TREE/PREEMPT?) since those affect the kinds of races we can see a lot. Linus ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: ftruncate-mmap: pages are lost after writing to mmaped file. 2009-03-19 20:21 ` Linus Torvalds @ 2009-03-19 21:17 ` Ying Han 2009-03-19 22:16 ` Jan Kara 1 sibling, 0 replies; 60+ messages in thread From: Ying Han @ 2009-03-19 21:17 UTC (permalink / raw) To: Linus Torvalds Cc: Nick Piggin, Jan Kara, Andrew Morton, linux-kernel, linux-mm, guichaz, Alex Khesin, Mike Waychison, Rohit Seth, Peter Zijlstra On Thu, Mar 19, 2009 at 1:21 PM, Linus Torvalds <torvalds@linux-foundation.org> wrote: > > > On Thu, 19 Mar 2009, Linus Torvalds wrote: >> >> Ahh, so you re-created it? On ext2 only, or is it visible on ext3 as well? >> I've not even tested - I assumed that I would have to boot into less >> memory and downgrade my filesystem to ext2, which made me hope somebody >> else would pick it up first ;) > > Oh, btw, can people who see this (Ying Han, Nick and apparently Jan) > detail their configurations, please? In particular > > - SMP? (CONFIG_SMP and how many cores do you have if so?) "CONFIG_SMP=y" the testing machine i was using has 16 cores. > > - PREEMPT (NONE/VOLUNTARY or full preempt?) CONFIG_PREEMPT_NONE=y # CONFIG_PREEMPT_VOLUNTARY is not set # CONFIG_PREEMPT is not set CONFIG_PREEMPT_BKL=y > > - RCU (CLASSIC/TREE/PREEMPT?) Not in my .config file. > > since those affect the kinds of races we can see a lot. > > Linus > ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: ftruncate-mmap: pages are lost after writing to mmaped file. 2009-03-19 20:21 ` Linus Torvalds 2009-03-19 21:17 ` Ying Han @ 2009-03-19 22:16 ` Jan Kara 1 sibling, 0 replies; 60+ messages in thread From: Jan Kara @ 2009-03-19 22:16 UTC (permalink / raw) To: Linus Torvalds Cc: Nick Piggin, Ying Han, Andrew Morton, linux-kernel, linux-mm, guichaz, Alex Khesin, Mike Waychison, Rohit Seth, Peter Zijlstra On Thu 19-03-09 13:21:28, Linus Torvalds wrote: > > > On Thu, 19 Mar 2009, Linus Torvalds wrote: > > > > Ahh, so you re-created it? On ext2 only, or is it visible on ext3 as well? > > I've not even tested - I assumed that I would have to boot into less > > memory and downgrade my filesystem to ext2, which made me hope somebody > > else would pick it up first ;) > > Oh, btw, can people who see this (Ying Han, Nick and apparently Jan) > detail their configurations, please? In particular I'm able to see it only under UML. I've tried to reproduce under normal machine but without luck. My system is single CPU Athlon64. > - SMP? (CONFIG_SMP and how many cores do you have if so?) > > - PREEMPT (NONE/VOLUNTARY or full preempt?) Above two don't exist in UML, but I assume UML behaves basically like full preempt SMP... > - RCU (CLASSIC/TREE/PREEMPT?) RCU is CLASSIC. Honza -- Jan Kara <jack@suse.cz> SUSE Labs, CR ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: ftruncate-mmap: pages are lost after writing to mmaped file. 2009-03-19 15:48 ` Nick Piggin 2009-03-19 16:16 ` Peter Zijlstra 2009-03-19 16:20 ` Linus Torvalds @ 2009-03-19 16:46 ` Jan Kara 2009-03-24 7:44 ` Nick Piggin 2 siblings, 1 reply; 60+ messages in thread From: Jan Kara @ 2009-03-19 16:46 UTC (permalink / raw) To: Nick Piggin Cc: Ying Han, Linus Torvalds, Andrew Morton, linux-kernel, linux-mm, guichaz, Alex Khesin, Mike Waychison, Rohit Seth, Peter Zijlstra Hi, On Fri 20-03-09 02:48:21, Nick Piggin wrote: > On Thursday 19 March 2009 10:54:33 Ying Han wrote: > > On Wed, Mar 18, 2009 at 4:36 PM, Linus Torvalds > > > > <torvalds@linux-foundation.org> wrote: > > > On Wed, 18 Mar 2009, Ying Han wrote: > > >> > Can you say what filesystem, and what mount-flags you use? Iirc, last > > >> > time we had MAP_SHARED lost writes it was at least partly triggered by > > >> > the filesystem doing its own flushing independently of the VM (ie ext3 > > >> > with "data=journal", I think), so that kind of thing does tend to > > >> > matter. > > >> > > >> /etc/fstab > > >> "/dev/hda1 / ext2 defaults 1 0" > > > > > > Sadly, /etc/fstab is not necessarily accurate for the root filesystem. At > > > least Fedora will ignore the flags in it. > > > > > > What does /proc/mounts say? That should be a more reliable indication of > > > what the kernel actually does. > > > > "/dev/root / ext2 rw,errors=continue 0 0" > > No luck with finding the problem yet. I've been staring at the code whole yesterday and didn't find the problem either. > But I think we do have a race in __set_page_dirty_buffers(): > > The page may not have buffers between the mapping->private_lock > critical section and the __set_page_dirty call there. So between > them, another thread might do a create_empty_buffers which can > see !PageDirty and thus it will create clean buffers. The page > will get dirtied by the original thread, but if the buffers are > clean it can be cleaned without writing out buffers. > > Holding mapping->private_lock over the __set_page_dirty should > fix it, although I guess you'd want to release it before calling > __mark_inode_dirty so as not to put inode_lock under there. I > have a patch for this if it sounds reasonable. Yes, that seems to be a bug - the function actually looked suspitious to me yesterday but I somehow convinced myself that it's fine. Probably because fsx-linux is single-threaded. Anyway, I've tried the following hack: diff --git a/fs/buffer.c b/fs/buffer.c index 985f617..f764c8a 100644 --- a/fs/buffer.c +++ b/fs/buffer.c @@ -763,10 +763,15 @@ EXPORT_SYMBOL(mark_buffer_dirty_inode); static int __set_page_dirty(struct page *page, struct address_space *mapping, int warn) { + int ret; + if (unlikely(!mapping)) return !TestSetPageDirty(page); - if (TestSetPageDirty(page)) + ret = TestSetPageDirty(page); + if (warn) + spin_unlock(&mapping->private_lock); + if (ret) return 0; spin_lock_irq(&mapping->tree_lock); @@ -831,8 +836,6 @@ int __set_page_dirty_buffers(struct page *page) bh = bh->b_this_page; } while (bh != head); } - spin_unlock(&mapping->private_lock); - return __set_page_dirty(page, mapping, 1); } But it didn't help my data corruption under UML :(. Honza -- Jan Kara <jack@suse.cz> SUSE Labs, CR ^ permalink raw reply related [flat|nested] 60+ messages in thread
* Re: ftruncate-mmap: pages are lost after writing to mmaped file. 2009-03-19 16:46 ` Jan Kara @ 2009-03-24 7:44 ` Nick Piggin 2009-03-24 10:27 ` Nick Piggin ` (3 more replies) 0 siblings, 4 replies; 60+ messages in thread From: Nick Piggin @ 2009-03-24 7:44 UTC (permalink / raw) To: Jan Kara, Martin J. Bligh, linux-ext4 Cc: Ying Han, Linus Torvalds, Andrew Morton, linux-kernel, linux-mm, guichaz, Alex Khesin, Mike Waychison, Rohit Seth, Peter Zijlstra On Friday 20 March 2009 03:46:39 Jan Kara wrote: > On Fri 20-03-09 02:48:21, Nick Piggin wrote: > > Holding mapping->private_lock over the __set_page_dirty should > > fix it, although I guess you'd want to release it before calling > > __mark_inode_dirty so as not to put inode_lock under there. I > > have a patch for this if it sounds reasonable. > > Yes, that seems to be a bug - the function actually looked suspitious to > me yesterday but I somehow convinced myself that it's fine. Probably > because fsx-linux is single-threaded. After a whole lot of chasing my own tail in the VM and buffer layers, I think it is a problem in ext2 (and I haven't been able to reproduce with ext3 yet, which might lend weight to that, although as we have seen, it is very timing dependent). That would be slightly unfortunate because we still have Jan's ext3 problem, and also another reported problem of corruption on ext3 (on brd driver). Anyway, when I have reproduced the problem with the test case, the "lost" writes are all reported to be holes. Unfortunately, that doesn't point straight to the filesystem, because ext2 allocates blocks in this case at writeout time, so if dirty bits are getting lost, then it would be normal to see holes. I then put in a whole lot of extra infrastructure to track metadata about each struct page (when it was last written out, when it last had the number of writable ptes reach 0, when the dirty bits were last cleared etc). And none of the normal asertions were triggering: eg. when any page is removed from pagecache (except truncates), it has always had all its buffers written out *after* all ptes were made readonly or unmapped. Lots of other tests and crap like that. So I tried what I should have done to start with and did an e2fsck after seeing corruption. Yes, it comes up with errors. Now that is unusual because that should be largely insulated from the vm: if a dirty bit gets lost, then the filesystem image should be quite happy and error-free with a hole or unwritten data there. I don't know ext? locking very well, except that it looks pretty overly complex and crufty. Usually, blocks are instantiated by write(2), under i_mutex, serialising the allocator somewhat. mmap-write blocks are instantiated at writeout time, unserialised. I moved truncate_mutex to cover the entire get_blocks function, and can no longer trigger the problem. Might be a timing issue though -- Ying, can you try this and see if you can still reproduce? I close my eyes and pick something out of a hat. a686cd89. Search for XXX. Nice. Whether or not this cased the problem, can someone please tell me why it got merged in that state? I'm leaving ext3 running for now. It looks like a nasty task to bisect ext2 down to that commit :( and I would be more interested in trying to reproduce Jan's ext3 problem, however, because I'm not too interested in diving into ext2 locking to work out exactly what is racing and how to fix it properly. I suspect it would be most productive to wire up some ioctls right into the block allocator/lookup and code up a userspace tester for it that could probably stress it a lot harder than kernel writeout can. ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: ftruncate-mmap: pages are lost after writing to mmaped file. 2009-03-24 7:44 ` Nick Piggin @ 2009-03-24 10:27 ` Nick Piggin 2009-03-24 10:32 ` Andrew Morton ` (2 subsequent siblings) 3 siblings, 0 replies; 60+ messages in thread From: Nick Piggin @ 2009-03-24 10:27 UTC (permalink / raw) To: Jan Kara Cc: Martin J. Bligh, linux-ext4, Ying Han, Linus Torvalds, Andrew Morton, linux-kernel, linux-mm, guichaz, Alex Khesin, Mike Waychison, Rohit Seth, Peter Zijlstra On Tuesday 24 March 2009 18:44:21 Nick Piggin wrote: > I close my eyes and pick something out of a hat. a686cd89. Search for XXX. > Nice. Whether or not this cased the problem, can someone please tell me > why it got merged in that state? Actually I must be wrong about this if the problem was reproduced in 2.6.18. Question still stands, though. ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: ftruncate-mmap: pages are lost after writing to mmaped file. 2009-03-24 7:44 ` Nick Piggin 2009-03-24 10:27 ` Nick Piggin @ 2009-03-24 10:32 ` Andrew Morton 2009-03-24 15:35 ` Nick Piggin 2009-03-26 0:03 ` Ying Han 2009-03-24 12:39 ` Jan Kara 2009-03-27 20:35 ` Ying Han 3 siblings, 2 replies; 60+ messages in thread From: Andrew Morton @ 2009-03-24 10:32 UTC (permalink / raw) To: Nick Piggin Cc: Jan Kara, Martin J. Bligh, linux-ext4, Ying Han, Linus Torvalds, linux-kernel, linux-mm, guichaz, Alex Khesin, Mike Waychison, Rohit Seth, Peter Zijlstra On Tue, 24 Mar 2009 18:44:21 +1100 Nick Piggin <nickpiggin@yahoo.com.au> wrote: > On Friday 20 March 2009 03:46:39 Jan Kara wrote: > > On Fri 20-03-09 02:48:21, Nick Piggin wrote: > > > > Holding mapping->private_lock over the __set_page_dirty should > > > fix it, although I guess you'd want to release it before calling > > > __mark_inode_dirty so as not to put inode_lock under there. I > > > have a patch for this if it sounds reasonable. > > > > Yes, that seems to be a bug - the function actually looked suspitious to > > me yesterday but I somehow convinced myself that it's fine. Probably > > because fsx-linux is single-threaded. > > > After a whole lot of chasing my own tail in the VM and buffer layers, > I think it is a problem in ext2 (and I haven't been able to reproduce > with ext3 yet, which might lend weight to that, although as we have > seen, it is very timing dependent). > > That would be slightly unfortunate because we still have Jan's ext3 > problem, and also another reported problem of corruption on ext3 (on > brd driver). > > Anyway, when I have reproduced the problem with the test case, the > "lost" writes are all reported to be holes. Unfortunately, that doesn't > point straight to the filesystem, because ext2 allocates blocks in this > case at writeout time, so if dirty bits are getting lost, then it would > be normal to see holes. > > I then put in a whole lot of extra infrastructure to track metadata about > each struct page (when it was last written out, when it last had the number > of writable ptes reach 0, when the dirty bits were last cleared etc). And > none of the normal asertions were triggering: eg. when any page is removed > from pagecache (except truncates), it has always had all its buffers > written out *after* all ptes were made readonly or unmapped. Lots of other > tests and crap like that. > > So I tried what I should have done to start with and did an e2fsck after > seeing corruption. Yes, it comes up with errors. Do you recall what the errors were? > Now that is unusual > because that should be largely insulated from the vm: if a dirty bit gets > lost, then the filesystem image should be quite happy and error-free with > a hole or unwritten data there. > > I don't know ext? locking very well, except that it looks pretty overly > complex and crufty. > > Usually, blocks are instantiated by write(2), under i_mutex, serialising > the allocator somewhat. mmap-write blocks are instantiated at writeout > time, unserialised. I moved truncate_mutex to cover the entire get_blocks > function, and can no longer trigger the problem. Might be a timing issue > though -- Ying, can you try this and see if you can still reproduce? > > I close my eyes and pick something out of a hat. a686cd89. Search for XXX. > Nice. Whether or not this cased the problem, can someone please tell me > why it got merged in that state? > > I'm leaving ext3 running for now. It looks like a nasty task to bisect > ext2 down to that commit :( and I would be more interested in trying to > reproduce Jan's ext3 problem, however, because I'm not too interested in > diving into ext2 locking to work out exactly what is racing and how to > fix it properly. I suspect it would be most productive to wire up some > ioctls right into the block allocator/lookup and code up a userspace > tester for it that could probably stress it a lot harder than kernel > writeout can. ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: ftruncate-mmap: pages are lost after writing to mmaped file. 2009-03-24 10:32 ` Andrew Morton @ 2009-03-24 15:35 ` Nick Piggin 2009-03-26 18:29 ` Jan Kara 2009-03-26 0:03 ` Ying Han 1 sibling, 1 reply; 60+ messages in thread From: Nick Piggin @ 2009-03-24 15:35 UTC (permalink / raw) To: Andrew Morton Cc: Jan Kara, Martin J. Bligh, linux-ext4, Ying Han, Linus Torvalds, linux-kernel, linux-mm, guichaz, Alex Khesin, Mike Waychison, Rohit Seth, Peter Zijlstra On Tuesday 24 March 2009 21:32:04 Andrew Morton wrote: > On Tue, 24 Mar 2009 18:44:21 +1100 Nick Piggin <nickpiggin@yahoo.com.au> wrote: > > On Friday 20 March 2009 03:46:39 Jan Kara wrote: > > > On Fri 20-03-09 02:48:21, Nick Piggin wrote: > > > > Holding mapping->private_lock over the __set_page_dirty should > > > > fix it, although I guess you'd want to release it before calling > > > > __mark_inode_dirty so as not to put inode_lock under there. I > > > > have a patch for this if it sounds reasonable. > > > > > > Yes, that seems to be a bug - the function actually looked suspitious > > > to me yesterday but I somehow convinced myself that it's fine. Probably > > > because fsx-linux is single-threaded. > > > > After a whole lot of chasing my own tail in the VM and buffer layers, > > I think it is a problem in ext2 (and I haven't been able to reproduce > > with ext3 yet, which might lend weight to that, although as we have > > seen, it is very timing dependent). > > > > That would be slightly unfortunate because we still have Jan's ext3 > > problem, and also another reported problem of corruption on ext3 (on > > brd driver). > > > > Anyway, when I have reproduced the problem with the test case, the > > "lost" writes are all reported to be holes. Unfortunately, that doesn't > > point straight to the filesystem, because ext2 allocates blocks in this > > case at writeout time, so if dirty bits are getting lost, then it would > > be normal to see holes. > > > > I then put in a whole lot of extra infrastructure to track metadata about > > each struct page (when it was last written out, when it last had the > > number of writable ptes reach 0, when the dirty bits were last cleared > > etc). And none of the normal asertions were triggering: eg. when any page > > is removed from pagecache (except truncates), it has always had all its > > buffers written out *after* all ptes were made readonly or unmapped. Lots > > of other tests and crap like that. > > > > So I tried what I should have done to start with and did an e2fsck after > > seeing corruption. Yes, it comes up with errors. > > Do you recall what the errors were? OK, after running several tests in parallel and having 3 of them blow up, I unmounted the fs (so error-case files are still intact). # e2fsck -fn /dev/ram0 e2fsck 1.41.3 (12-Oct-2008) Pass 1: Checking inodes, blocks, and sizes Inode 16, i_blocks is 131594, should be 131566. Fix? no Inode 18, i_blocks is 131588, should be 131576. Fix? no Inode 21, i_blocks is 131594, should be 131552. Fix? no Pass 2: Checking directory structure Pass 3: Checking directory connectivity Pass 4: Checking reference counts Pass 5: Checking group summary information Block bitmap differences: -(628209--628220) -628231 -628233 -(638751--638755) -638765 -(646271--646295) -(646301--646304) -647609 -(651501--651505) -651509 -(651719--651726) -(651732--651733) -(665666--665670) Fix? no /dev/ram0: ********** WARNING: Filesystem still has errors ********** /dev/ram0: 21/229376 files (4.8% non-contiguous), 407105/3670016 blocks ino 16, 18, 21 of course are the files with errors. inode 18 is the simplest case with just one hole, so let's look at that: #hexdump file9 0000000 ffff ffff ffff ffff ffff ffff ffff ffff * 3c8c000 0000 0000 0000 0000 0000 0000 0000 0000 * 3c8d400 ffff ffff ffff ffff ffff ffff ffff ffff * 4000000 Let's take a look at our hole then: #./bmap file9 // bmap is modified to print hex offsets [... lots of stuff ...] 3c82000-3c82c00: 26fd0400-26fd1000 (1000) 3c83000-3c83c00: 26fd3400-26fd4000 (1000) 3c84000-3c84c00: 26fc9c00-26fca800 (1000) 3c85000-3c85c00: 26fcc400-26fcd000 (1000) 3c86000-3c86c00: 26fcf400-26fd0000 (1000) 3c87000-3c87c00: 26fd2400-26fd3000 (1000) 3c88000-3c88c00: 26fd5400-26fd6000 (1000) 3c89000-3c8bc00: 26fd7400-26fda000 (3000) 3c8c000-3c8c000: 0-0 (400) 3c8c400-3c8c400: 0-0 (400) 3c8c800-3c8c800: 0-0 (400) 3c8cc00-3c8cc00: 0-0 (400) 3c8d000-3c8d000: 0-0 (400) 3c8d400-3c8dc00: 26fcb800-26fcc000 (c00) 3c8e000-3c8ec00: 26fce400-26fcf000 (1000) 3c8f000-3c8fc00: 26fd1400-26fd2000 (1000) 3c90000-3c99c00: 27924400-2792e000 (a000) 3c9a000-3c9ac00: 2792f000-2792fc00 (1000) 3c9b000-3c9bc00: 27931000-27931c00 (1000) 3c9c000-3c9cc00: 27933000-27933c00 (1000) 3c9d000-3c9dc00: 27935000-27935c00 (1000) 3c9e000-3c9ec00: 27938000-27938c00 (1000) 3c9f000-3c9fc00: 2793a000-2793ac00 (1000) [... lots more stuff ...] 3.5G filesystem image bzip2s down to 500K if anybody wants it I can send it privately. ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: ftruncate-mmap: pages are lost after writing to mmaped file. 2009-03-24 15:35 ` Nick Piggin @ 2009-03-26 18:29 ` Jan Kara 0 siblings, 0 replies; 60+ messages in thread From: Jan Kara @ 2009-03-26 18:29 UTC (permalink / raw) To: Nick Piggin Cc: Andrew Morton, Martin J. Bligh, linux-ext4, Ying Han, Linus Torvalds, linux-kernel, linux-mm, guichaz, Alex Khesin, Mike Waychison, Rohit Seth, Peter Zijlstra On Wed 25-03-09 02:35:01, Nick Piggin wrote: > On Tuesday 24 March 2009 21:32:04 Andrew Morton wrote: > > On Tue, 24 Mar 2009 18:44:21 +1100 Nick Piggin <nickpiggin@yahoo.com.au> > wrote: > > > On Friday 20 March 2009 03:46:39 Jan Kara wrote: > > > > On Fri 20-03-09 02:48:21, Nick Piggin wrote: > > > > > Holding mapping->private_lock over the __set_page_dirty should > > > > > fix it, although I guess you'd want to release it before calling > > > > > __mark_inode_dirty so as not to put inode_lock under there. I > > > > > have a patch for this if it sounds reasonable. > > > > > > > > Yes, that seems to be a bug - the function actually looked suspitious > > > > to me yesterday but I somehow convinced myself that it's fine. Probably > > > > because fsx-linux is single-threaded. > > > > > > After a whole lot of chasing my own tail in the VM and buffer layers, > > > I think it is a problem in ext2 (and I haven't been able to reproduce > > > with ext3 yet, which might lend weight to that, although as we have > > > seen, it is very timing dependent). > > > > > > That would be slightly unfortunate because we still have Jan's ext3 > > > problem, and also another reported problem of corruption on ext3 (on > > > brd driver). > > > > > > Anyway, when I have reproduced the problem with the test case, the > > > "lost" writes are all reported to be holes. Unfortunately, that doesn't > > > point straight to the filesystem, because ext2 allocates blocks in this > > > case at writeout time, so if dirty bits are getting lost, then it would > > > be normal to see holes. > > > > > > I then put in a whole lot of extra infrastructure to track metadata about > > > each struct page (when it was last written out, when it last had the > > > number of writable ptes reach 0, when the dirty bits were last cleared > > > etc). And none of the normal asertions were triggering: eg. when any page > > > is removed from pagecache (except truncates), it has always had all its > > > buffers written out *after* all ptes were made readonly or unmapped. Lots > > > of other tests and crap like that. > > > > > > So I tried what I should have done to start with and did an e2fsck after > > > seeing corruption. Yes, it comes up with errors. > > > > Do you recall what the errors were? > > OK, after running several tests in parallel and having 3 of them > blow up, I unmounted the fs (so error-case files are still intact). Nick, what tests do you use? Because on the first reading the ext2 code looks correct so I'll probably have to reproduce the corruption... Honza -- Jan Kara <jack@suse.cz> SUSE Labs, CR ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: ftruncate-mmap: pages are lost after writing to mmaped file. 2009-03-24 10:32 ` Andrew Morton 2009-03-24 15:35 ` Nick Piggin @ 2009-03-26 0:03 ` Ying Han 1 sibling, 0 replies; 60+ messages in thread From: Ying Han @ 2009-03-26 0:03 UTC (permalink / raw) To: Andrew Morton Cc: Nick Piggin, Jan Kara, Martin J. Bligh, linux-ext4, Linus Torvalds, linux-kernel, linux-mm, guichaz, Alex Khesin, Mike Waychison, Rohit Seth, Peter Zijlstra On Tue, Mar 24, 2009 at 3:32 AM, Andrew Morton <akpm@linux-foundation.org> wrote: > On Tue, 24 Mar 2009 18:44:21 +1100 Nick Piggin <nickpiggin@yahoo.com.au> wrote: > >> On Friday 20 March 2009 03:46:39 Jan Kara wrote: >> > On Fri 20-03-09 02:48:21, Nick Piggin wrote: >> >> > > Holding mapping->private_lock over the __set_page_dirty should >> > > fix it, although I guess you'd want to release it before calling >> > > __mark_inode_dirty so as not to put inode_lock under there. I >> > > have a patch for this if it sounds reasonable. >> > >> > Yes, that seems to be a bug - the function actually looked suspitious to >> > me yesterday but I somehow convinced myself that it's fine. Probably >> > because fsx-linux is single-threaded. >> >> >> After a whole lot of chasing my own tail in the VM and buffer layers, >> I think it is a problem in ext2 (and I haven't been able to reproduce >> with ext3 yet, which might lend weight to that, although as we have >> seen, it is very timing dependent). >> >> That would be slightly unfortunate because we still have Jan's ext3 >> problem, and also another reported problem of corruption on ext3 (on >> brd driver). >> >> Anyway, when I have reproduced the problem with the test case, the >> "lost" writes are all reported to be holes. Unfortunately, that doesn't >> point straight to the filesystem, because ext2 allocates blocks in this >> case at writeout time, so if dirty bits are getting lost, then it would >> be normal to see holes. >> >> I then put in a whole lot of extra infrastructure to track metadata about >> each struct page (when it was last written out, when it last had the number >> of writable ptes reach 0, when the dirty bits were last cleared etc). And >> none of the normal asertions were triggering: eg. when any page is removed >> from pagecache (except truncates), it has always had all its buffers >> written out *after* all ptes were made readonly or unmapped. Lots of other >> tests and crap like that. >> >> So I tried what I should have done to start with and did an e2fsck after >> seeing corruption. Yes, it comes up with errors. > > Do you recall what the errors were? I run e2fsck on the partition after the failure happened and here is what i saw, not sure if that is the same message Jan looked at: e2fsck 1.41.3 (12-Oct-2008) Warning! /dev/hda1 is mounted. /dev/hda1 contains a file system with errors, check forced. Pass 1: Checking inodes, blocks, and sizes Pass 2: Checking directory structure Pass 3: Checking directory connectivity Pass 4: Checking reference counts Pass 5: Checking group summary information Block bitmap differences: +74915 -195111 -224680 Fix? no Free blocks count wrong for group #6 (170, counted=169). Fix? no Free blocks count wrong (10120, counted=523). Fix? no Free inodes count wrong (95678, counted=95672). Fix? no /dev/hda1: ********** WARNING: Filesystem still has errors ********** /dev/hda1: 35938/131616 files (1.5% non-contiguous), 252936/263056 blocks --Ying > >> Now that is unusual >> because that should be largely insulated from the vm: if a dirty bit gets >> lost, then the filesystem image should be quite happy and error-free with >> a hole or unwritten data there. >> >> I don't know ext? locking very well, except that it looks pretty overly >> complex and crufty. >> >> Usually, blocks are instantiated by write(2), under i_mutex, serialising >> the allocator somewhat. mmap-write blocks are instantiated at writeout >> time, unserialised. I moved truncate_mutex to cover the entire get_blocks >> function, and can no longer trigger the problem. Might be a timing issue >> though -- Ying, can you try this and see if you can still reproduce? >> >> I close my eyes and pick something out of a hat. a686cd89. Search for XXX. >> Nice. Whether or not this cased the problem, can someone please tell me >> why it got merged in that state? >> >> I'm leaving ext3 running for now. It looks like a nasty task to bisect >> ext2 down to that commit :( and I would be more interested in trying to >> reproduce Jan's ext3 problem, however, because I'm not too interested in >> diving into ext2 locking to work out exactly what is racing and how to >> fix it properly. I suspect it would be most productive to wire up some >> ioctls right into the block allocator/lookup and code up a userspace >> tester for it that could probably stress it a lot harder than kernel >> writeout can. > ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: ftruncate-mmap: pages are lost after writing to mmaped file. 2009-03-24 7:44 ` Nick Piggin 2009-03-24 10:27 ` Nick Piggin 2009-03-24 10:32 ` Andrew Morton @ 2009-03-24 12:39 ` Jan Kara 2009-03-24 12:55 ` Jan Kara 2009-03-27 20:35 ` Ying Han 3 siblings, 1 reply; 60+ messages in thread From: Jan Kara @ 2009-03-24 12:39 UTC (permalink / raw) To: Nick Piggin Cc: Martin J. Bligh, linux-ext4, Ying Han, Linus Torvalds, Andrew Morton, linux-kernel, linux-mm, guichaz, Alex Khesin, Mike Waychison, Rohit Seth, Peter Zijlstra Hi, On Tue 24-03-09 18:44:21, Nick Piggin wrote: > On Friday 20 March 2009 03:46:39 Jan Kara wrote: > > On Fri 20-03-09 02:48:21, Nick Piggin wrote: > > > > Holding mapping->private_lock over the __set_page_dirty should > > > fix it, although I guess you'd want to release it before calling > > > __mark_inode_dirty so as not to put inode_lock under there. I > > > have a patch for this if it sounds reasonable. > > > > Yes, that seems to be a bug - the function actually looked suspitious to > > me yesterday but I somehow convinced myself that it's fine. Probably > > because fsx-linux is single-threaded. > > > After a whole lot of chasing my own tail in the VM and buffer layers, > I think it is a problem in ext2 (and I haven't been able to reproduce > with ext3 yet, which might lend weight to that, although as we have > seen, it is very timing dependent). > > That would be slightly unfortunate because we still have Jan's ext3 > problem, and also another reported problem of corruption on ext3 (on > brd driver). > > Anyway, when I have reproduced the problem with the test case, the > "lost" writes are all reported to be holes. Unfortunately, that doesn't > point straight to the filesystem, because ext2 allocates blocks in this > case at writeout time, so if dirty bits are getting lost, then it would > be normal to see holes. > > I then put in a whole lot of extra infrastructure to track metadata about > each struct page (when it was last written out, when it last had the number > of writable ptes reach 0, when the dirty bits were last cleared etc). And > none of the normal asertions were triggering: eg. when any page is removed > from pagecache (except truncates), it has always had all its buffers > written out *after* all ptes were made readonly or unmapped. Lots of other > tests and crap like that. I see we're going the same way ;) > So I tried what I should have done to start with and did an e2fsck after > seeing corruption. Yes, it comes up with errors. Now that is unusual > because that should be largely insulated from the vm: if a dirty bit gets > lost, then the filesystem image should be quite happy and error-free with > a hole or unwritten data there. This is different for me. I see no corruption on the filesystem with ext3. Anyway, errors from fsck would be useful to see what kind of corruption you saw. > I don't know ext? locking very well, except that it looks pretty overly > complex and crufty. > > Usually, blocks are instantiated by write(2), under i_mutex, serialising > the allocator somewhat. mmap-write blocks are instantiated at writeout > time, unserialised. I moved truncate_mutex to cover the entire get_blocks > function, and can no longer trigger the problem. Might be a timing issue > though -- Ying, can you try this and see if you can still reproduce? > > I close my eyes and pick something out of a hat. a686cd89. Search for XXX. > Nice. Whether or not this cased the problem, can someone please tell me > why it got merged in that state? Maybe, I see it did some changes to ext2_get_blocks() which could be where the problem was introduced... > I'm leaving ext3 running for now. It looks like a nasty task to bisect > ext2 down to that commit :( and I would be more interested in trying to > reproduce Jan's ext3 problem, however, because I'm not too interested in > diving into ext2 locking to work out exactly what is racing and how to > fix it properly. I suspect it would be most productive to wire up some > ioctls right into the block allocator/lookup and code up a userspace > tester for it that could probably stress it a lot harder than kernel > writeout can. Yes, what I observed with ext3 so far is that data is properly copied and page marked dirty when the data is copied in. But then at some point dirty bit is cleared via block_write_full_page() but we don't get to submitting at least one buffer in that page. I'm now debugging which path we take so that this happens... Honza -- Jan Kara <jack@suse.cz> SUSE Labs, CR ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: ftruncate-mmap: pages are lost after writing to mmaped file. 2009-03-24 12:39 ` Jan Kara @ 2009-03-24 12:55 ` Jan Kara 2009-03-24 13:26 ` Jan Kara 0 siblings, 1 reply; 60+ messages in thread From: Jan Kara @ 2009-03-24 12:55 UTC (permalink / raw) To: Nick Piggin Cc: Martin J. Bligh, linux-ext4, Ying Han, Linus Torvalds, Andrew Morton, linux-kernel, linux-mm, guichaz, Alex Khesin, Mike Waychison, Rohit Seth, Peter Zijlstra On Tue 24-03-09 13:39:36, Jan Kara wrote: > Hi, > > On Tue 24-03-09 18:44:21, Nick Piggin wrote: > > On Friday 20 March 2009 03:46:39 Jan Kara wrote: > > > On Fri 20-03-09 02:48:21, Nick Piggin wrote: > > > > > > Holding mapping->private_lock over the __set_page_dirty should > > > > fix it, although I guess you'd want to release it before calling > > > > __mark_inode_dirty so as not to put inode_lock under there. I > > > > have a patch for this if it sounds reasonable. > > > > > > Yes, that seems to be a bug - the function actually looked suspitious to > > > me yesterday but I somehow convinced myself that it's fine. Probably > > > because fsx-linux is single-threaded. > > > > > > After a whole lot of chasing my own tail in the VM and buffer layers, > > I think it is a problem in ext2 (and I haven't been able to reproduce > > with ext3 yet, which might lend weight to that, although as we have > > seen, it is very timing dependent). > > > > That would be slightly unfortunate because we still have Jan's ext3 > > problem, and also another reported problem of corruption on ext3 (on > > brd driver). > > > > Anyway, when I have reproduced the problem with the test case, the > > "lost" writes are all reported to be holes. Unfortunately, that doesn't > > point straight to the filesystem, because ext2 allocates blocks in this > > case at writeout time, so if dirty bits are getting lost, then it would > > be normal to see holes. > > > > I then put in a whole lot of extra infrastructure to track metadata about > > each struct page (when it was last written out, when it last had the number > > of writable ptes reach 0, when the dirty bits were last cleared etc). And > > none of the normal asertions were triggering: eg. when any page is removed > > from pagecache (except truncates), it has always had all its buffers > > written out *after* all ptes were made readonly or unmapped. Lots of other > > tests and crap like that. > I see we're going the same way ;) > > > So I tried what I should have done to start with and did an e2fsck after > > seeing corruption. Yes, it comes up with errors. Now that is unusual > > because that should be largely insulated from the vm: if a dirty bit gets > > lost, then the filesystem image should be quite happy and error-free with > > a hole or unwritten data there. > This is different for me. I see no corruption on the filesystem with > ext3. Anyway, errors from fsck would be useful to see what kind of > corruption you saw. > > > I don't know ext? locking very well, except that it looks pretty overly > > complex and crufty. > > > > Usually, blocks are instantiated by write(2), under i_mutex, serialising > > the allocator somewhat. mmap-write blocks are instantiated at writeout > > time, unserialised. I moved truncate_mutex to cover the entire get_blocks > > function, and can no longer trigger the problem. Might be a timing issue > > though -- Ying, can you try this and see if you can still reproduce? > > > > I close my eyes and pick something out of a hat. a686cd89. Search for XXX. > > Nice. Whether or not this cased the problem, can someone please tell me > > why it got merged in that state? > Maybe, I see it did some changes to ext2_get_blocks() which could be > where the problem was introduced... > > > I'm leaving ext3 running for now. It looks like a nasty task to bisect > > ext2 down to that commit :( and I would be more interested in trying to > > reproduce Jan's ext3 problem, however, because I'm not too interested in > > diving into ext2 locking to work out exactly what is racing and how to > > fix it properly. I suspect it would be most productive to wire up some > > ioctls right into the block allocator/lookup and code up a userspace > > tester for it that could probably stress it a lot harder than kernel > > writeout can. > Yes, what I observed with ext3 so far is that data is properly copied and > page marked dirty when the data is copied in. But then at some point dirty > bit is cleared via block_write_full_page() but we don't get to submitting > at least one buffer in that page. I'm now debugging which path we take so > that this happens... And one more interesting thing I don't yet fully understand - I see pages having PageError() set when they are removed from page cache (and they have been faulted in before). It's probably some interaction with pagecache readahead... Honza ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: ftruncate-mmap: pages are lost after writing to mmaped file. 2009-03-24 12:55 ` Jan Kara @ 2009-03-24 13:26 ` Jan Kara 2009-03-24 14:01 ` Chris Mason 2009-03-24 14:30 ` Nick Piggin 0 siblings, 2 replies; 60+ messages in thread From: Jan Kara @ 2009-03-24 13:26 UTC (permalink / raw) To: Nick Piggin Cc: Martin J. Bligh, linux-ext4, Ying Han, Linus Torvalds, Andrew Morton, linux-kernel, linux-mm, guichaz, Alex Khesin, Mike Waychison, Rohit Seth, Peter Zijlstra On Tue 24-03-09 13:55:10, Jan Kara wrote: > On Tue 24-03-09 13:39:36, Jan Kara wrote: > > Hi, > > > > On Tue 24-03-09 18:44:21, Nick Piggin wrote: > > > On Friday 20 March 2009 03:46:39 Jan Kara wrote: > > > > On Fri 20-03-09 02:48:21, Nick Piggin wrote: > > > > > > > > Holding mapping->private_lock over the __set_page_dirty should > > > > > fix it, although I guess you'd want to release it before calling > > > > > __mark_inode_dirty so as not to put inode_lock under there. I > > > > > have a patch for this if it sounds reasonable. > > > > > > > > Yes, that seems to be a bug - the function actually looked suspitious to > > > > me yesterday but I somehow convinced myself that it's fine. Probably > > > > because fsx-linux is single-threaded. > > > > > > > > > After a whole lot of chasing my own tail in the VM and buffer layers, > > > I think it is a problem in ext2 (and I haven't been able to reproduce > > > with ext3 yet, which might lend weight to that, although as we have > > > seen, it is very timing dependent). > > > > > > That would be slightly unfortunate because we still have Jan's ext3 > > > problem, and also another reported problem of corruption on ext3 (on > > > brd driver). > > > > > > Anyway, when I have reproduced the problem with the test case, the > > > "lost" writes are all reported to be holes. Unfortunately, that doesn't > > > point straight to the filesystem, because ext2 allocates blocks in this > > > case at writeout time, so if dirty bits are getting lost, then it would > > > be normal to see holes. > > > > > > I then put in a whole lot of extra infrastructure to track metadata about > > > each struct page (when it was last written out, when it last had the number > > > of writable ptes reach 0, when the dirty bits were last cleared etc). And > > > none of the normal asertions were triggering: eg. when any page is removed > > > from pagecache (except truncates), it has always had all its buffers > > > written out *after* all ptes were made readonly or unmapped. Lots of other > > > tests and crap like that. > > I see we're going the same way ;) > > > > > So I tried what I should have done to start with and did an e2fsck after > > > seeing corruption. Yes, it comes up with errors. Now that is unusual > > > because that should be largely insulated from the vm: if a dirty bit gets > > > lost, then the filesystem image should be quite happy and error-free with > > > a hole or unwritten data there. > > This is different for me. I see no corruption on the filesystem with > > ext3. Anyway, errors from fsck would be useful to see what kind of > > corruption you saw. > > > > > I don't know ext? locking very well, except that it looks pretty overly > > > complex and crufty. > > > > > > Usually, blocks are instantiated by write(2), under i_mutex, serialising > > > the allocator somewhat. mmap-write blocks are instantiated at writeout > > > time, unserialised. I moved truncate_mutex to cover the entire get_blocks > > > function, and can no longer trigger the problem. Might be a timing issue > > > though -- Ying, can you try this and see if you can still reproduce? > > > > > > I close my eyes and pick something out of a hat. a686cd89. Search for XXX. > > > Nice. Whether or not this cased the problem, can someone please tell me > > > why it got merged in that state? > > Maybe, I see it did some changes to ext2_get_blocks() which could be > > where the problem was introduced... > > > > > I'm leaving ext3 running for now. It looks like a nasty task to bisect > > > ext2 down to that commit :( and I would be more interested in trying to > > > reproduce Jan's ext3 problem, however, because I'm not too interested in > > > diving into ext2 locking to work out exactly what is racing and how to > > > fix it properly. I suspect it would be most productive to wire up some > > > ioctls right into the block allocator/lookup and code up a userspace > > > tester for it that could probably stress it a lot harder than kernel > > > writeout can. > > Yes, what I observed with ext3 so far is that data is properly copied and > > page marked dirty when the data is copied in. But then at some point dirty > > bit is cleared via block_write_full_page() but we don't get to submitting > > at least one buffer in that page. I'm now debugging which path we take so > > that this happens... > And one more interesting thing I don't yet fully understand - I see pages > having PageError() set when they are removed from page cache (and they have > been faulted in before). It's probably some interaction with pagecache > readahead... Argh... So the problem seems to be that get_block() occasionally returns ENOSPC and we then discard the dirty data (hmm, we could give at least a warning for that). I'm not yet sure why getblock behaves like this because the filesystem seems to have enough space but anyway this seems to be some strange fs trouble as well. Honza -- Jan Kara <jack@suse.cz> SUSE Labs, CR ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: ftruncate-mmap: pages are lost after writing to mmaped file. 2009-03-24 13:26 ` Jan Kara @ 2009-03-24 14:01 ` Chris Mason 2009-03-24 14:07 ` Jan Kara 2009-03-24 14:30 ` Nick Piggin 1 sibling, 1 reply; 60+ messages in thread From: Chris Mason @ 2009-03-24 14:01 UTC (permalink / raw) To: Jan Kara Cc: Nick Piggin, Martin J. Bligh, linux-ext4, Ying Han, Linus Torvalds, Andrew Morton, linux-kernel, linux-mm, guichaz, Alex Khesin, Mike Waychison, Rohit Seth, Peter Zijlstra On Tue, 2009-03-24 at 14:26 +0100, Jan Kara wrote: > On Tue 24-03-09 13:55:10, Jan Kara wrote: > > And one more interesting thing I don't yet fully understand - I see pages > > having PageError() set when they are removed from page cache (and they have > > been faulted in before). It's probably some interaction with pagecache > > readahead... > Argh... So the problem seems to be that get_block() occasionally returns > ENOSPC and we then discard the dirty data (hmm, we could give at least a > warning for that). I'm not yet sure why getblock behaves like this because > the filesystem seems to have enough space but anyway this seems to be some > strange fs trouble as well. > Ouch. Perhaps the free space is waiting on a journal commit? -chris ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: ftruncate-mmap: pages are lost after writing to mmaped file. 2009-03-24 14:01 ` Chris Mason @ 2009-03-24 14:07 ` Jan Kara 2009-03-26 8:18 ` Aneesh Kumar K.V 0 siblings, 1 reply; 60+ messages in thread From: Jan Kara @ 2009-03-24 14:07 UTC (permalink / raw) To: Chris Mason Cc: Jan Kara, Nick Piggin, Martin J. Bligh, linux-ext4, Ying Han, Linus Torvalds, Andrew Morton, linux-kernel, linux-mm, guichaz, Alex Khesin, Mike Waychison, Rohit Seth, Peter Zijlstra On Tue 24-03-09 10:01:45, Chris Mason wrote: > On Tue, 2009-03-24 at 14:26 +0100, Jan Kara wrote: > > On Tue 24-03-09 13:55:10, Jan Kara wrote: > > > > And one more interesting thing I don't yet fully understand - I see pages > > > having PageError() set when they are removed from page cache (and they have > > > been faulted in before). It's probably some interaction with pagecache > > > readahead... > > Argh... So the problem seems to be that get_block() occasionally returns > > ENOSPC and we then discard the dirty data (hmm, we could give at least a > > warning for that). I'm not yet sure why getblock behaves like this because > > the filesystem seems to have enough space but anyway this seems to be some > > strange fs trouble as well. > > > > Ouch. Perhaps the free space is waiting on a journal commit? Yes, exactly. I've already found there's lot of space hold by the committing transaction (it can easily hold a few hundred megs or a few gigs with larger journal and my UML images aren't that big...). And writepage() implementation in ext3 does not have a logic to retry. Also block_write_full_page() clears buffers dirty bits so it's not easy to retry even if we did it. I'm now looking into how to fix this... Honza -- Jan Kara <jack@suse.cz> SUSE Labs, CR ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: ftruncate-mmap: pages are lost after writing to mmaped file. 2009-03-24 14:07 ` Jan Kara @ 2009-03-26 8:18 ` Aneesh Kumar K.V 0 siblings, 0 replies; 60+ messages in thread From: Aneesh Kumar K.V @ 2009-03-26 8:18 UTC (permalink / raw) To: Jan Kara Cc: Chris Mason, Nick Piggin, Martin J. Bligh, linux-ext4, Ying Han, Linus Torvalds, Andrew Morton, linux-kernel, linux-mm, guichaz, Alex Khesin, Mike Waychison, Rohit Seth, Peter Zijlstra On Tue, Mar 24, 2009 at 03:07:21PM +0100, Jan Kara wrote: > On Tue 24-03-09 10:01:45, Chris Mason wrote: > > On Tue, 2009-03-24 at 14:26 +0100, Jan Kara wrote: > > > On Tue 24-03-09 13:55:10, Jan Kara wrote: > > > > > > And one more interesting thing I don't yet fully understand - I see pages > > > > having PageError() set when they are removed from page cache (and they have > > > > been faulted in before). It's probably some interaction with pagecache > > > > readahead... > > > Argh... So the problem seems to be that get_block() occasionally returns > > > ENOSPC and we then discard the dirty data (hmm, we could give at least a > > > warning for that). I'm not yet sure why getblock behaves like this because > > > the filesystem seems to have enough space but anyway this seems to be some > > > strange fs trouble as well. > > > > > > > Ouch. Perhaps the free space is waiting on a journal commit? > Yes, exactly. I've already found there's lot of space hold by the > committing transaction (it can easily hold a few hundred megs or a few gigs > with larger journal and my UML images aren't that big...). And writepage() > implementation in ext3 does not have a logic to retry. Also > block_write_full_page() clears buffers dirty bits so it's not easy to retry > even if we did it. I'm now looking into how to fix this... We retry block allocation in ext3_write_begin. And for mmap we should be doing something similar to ext4_page_mkwrite so that we can be sure that during writepage we don't need to do block allocation. -aneesh ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: ftruncate-mmap: pages are lost after writing to mmaped file. 2009-03-24 13:26 ` Jan Kara 2009-03-24 14:01 ` Chris Mason @ 2009-03-24 14:30 ` Nick Piggin 2009-03-24 14:47 ` Jan Kara 1 sibling, 1 reply; 60+ messages in thread From: Nick Piggin @ 2009-03-24 14:30 UTC (permalink / raw) To: Jan Kara Cc: Martin J. Bligh, linux-ext4, Ying Han, Linus Torvalds, Andrew Morton, linux-kernel, linux-mm, guichaz, Alex Khesin, Mike Waychison, Rohit Seth, Peter Zijlstra On Wednesday 25 March 2009 00:26:37 Jan Kara wrote: > On Tue 24-03-09 13:55:10, Jan Kara wrote: > > On Tue 24-03-09 13:39:36, Jan Kara wrote: > > > Hi, > > > > > > On Tue 24-03-09 18:44:21, Nick Piggin wrote: > > > > On Friday 20 March 2009 03:46:39 Jan Kara wrote: > > > > > On Fri 20-03-09 02:48:21, Nick Piggin wrote: > > > > > > Holding mapping->private_lock over the __set_page_dirty should > > > > > > fix it, although I guess you'd want to release it before calling > > > > > > __mark_inode_dirty so as not to put inode_lock under there. I > > > > > > have a patch for this if it sounds reasonable. > > > > > > > > > > Yes, that seems to be a bug - the function actually looked > > > > > suspitious to me yesterday but I somehow convinced myself that it's > > > > > fine. Probably because fsx-linux is single-threaded. > > > > > > > > After a whole lot of chasing my own tail in the VM and buffer layers, > > > > I think it is a problem in ext2 (and I haven't been able to reproduce > > > > with ext3 yet, which might lend weight to that, although as we have > > > > seen, it is very timing dependent). > > > > > > > > That would be slightly unfortunate because we still have Jan's ext3 > > > > problem, and also another reported problem of corruption on ext3 (on > > > > brd driver). > > > > > > > > Anyway, when I have reproduced the problem with the test case, the > > > > "lost" writes are all reported to be holes. Unfortunately, that > > > > doesn't point straight to the filesystem, because ext2 allocates > > > > blocks in this case at writeout time, so if dirty bits are getting > > > > lost, then it would be normal to see holes. > > > > > > > > I then put in a whole lot of extra infrastructure to track metadata > > > > about each struct page (when it was last written out, when it last > > > > had the number of writable ptes reach 0, when the dirty bits were > > > > last cleared etc). And none of the normal asertions were triggering: > > > > eg. when any page is removed from pagecache (except truncates), it > > > > has always had all its buffers written out *after* all ptes were made > > > > readonly or unmapped. Lots of other tests and crap like that. > > > > > > I see we're going the same way ;) > > > > > > > So I tried what I should have done to start with and did an e2fsck > > > > after seeing corruption. Yes, it comes up with errors. Now that is > > > > unusual because that should be largely insulated from the vm: if a > > > > dirty bit gets lost, then the filesystem image should be quite happy > > > > and error-free with a hole or unwritten data there. > > > > > > This is different for me. I see no corruption on the filesystem with > > > ext3. Anyway, errors from fsck would be useful to see what kind of > > > corruption you saw. > > > > > > > I don't know ext? locking very well, except that it looks pretty > > > > overly complex and crufty. > > > > > > > > Usually, blocks are instantiated by write(2), under i_mutex, > > > > serialising the allocator somewhat. mmap-write blocks are > > > > instantiated at writeout time, unserialised. I moved truncate_mutex > > > > to cover the entire get_blocks function, and can no longer trigger > > > > the problem. Might be a timing issue though -- Ying, can you try this > > > > and see if you can still reproduce? > > > > > > > > I close my eyes and pick something out of a hat. a686cd89. Search for > > > > XXX. Nice. Whether or not this cased the problem, can someone please > > > > tell me why it got merged in that state? > > > > > > Maybe, I see it did some changes to ext2_get_blocks() which could be > > > where the problem was introduced... > > > > > > > I'm leaving ext3 running for now. It looks like a nasty task to > > > > bisect ext2 down to that commit :( and I would be more interested in > > > > trying to reproduce Jan's ext3 problem, however, because I'm not too > > > > interested in diving into ext2 locking to work out exactly what is > > > > racing and how to fix it properly. I suspect it would be most > > > > productive to wire up some ioctls right into the block > > > > allocator/lookup and code up a userspace tester for it that could > > > > probably stress it a lot harder than kernel writeout can. > > > > > > Yes, what I observed with ext3 so far is that data is properly copied > > > and page marked dirty when the data is copied in. But then at some > > > point dirty bit is cleared via block_write_full_page() but we don't get > > > to submitting at least one buffer in that page. I'm now debugging which > > > path we take so that this happens... > > > > And one more interesting thing I don't yet fully understand - I see > > pages having PageError() set when they are removed from page cache (and > > they have been faulted in before). It's probably some interaction with > > pagecache readahead... > > Argh... So the problem seems to be that get_block() occasionally returns > ENOSPC and we then discard the dirty data (hmm, we could give at least a > warning for that). I'm not yet sure why getblock behaves like this because > the filesystem seems to have enough space but anyway this seems to be some > strange fs trouble as well. Ah good find. I don't think it is a very good idea for block_write_full_page recovery to do clear_buffer_dirty for !mapped buffers. I think that should rather be a redirty_page_for_writepage in the case that the buffer is dirty. Perhaps not the cleanest way to solve the problem if it is just due to transient shortage of space in ext3, but generic code shouldn't be allowed to throw away dirty data even if it can't be written back due to some software or hardware error. ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: ftruncate-mmap: pages are lost after writing to mmaped file. 2009-03-24 14:30 ` Nick Piggin @ 2009-03-24 14:47 ` Jan Kara 2009-03-24 14:56 ` Peter Zijlstra 2009-03-24 15:03 ` Nick Piggin 0 siblings, 2 replies; 60+ messages in thread From: Jan Kara @ 2009-03-24 14:47 UTC (permalink / raw) To: Nick Piggin Cc: Jan Kara, Martin J. Bligh, linux-ext4, Ying Han, Linus Torvalds, Andrew Morton, linux-kernel, linux-mm, guichaz, Alex Khesin, Mike Waychison, Rohit Seth, Peter Zijlstra On Wed 25-03-09 01:30:00, Nick Piggin wrote: > On Wednesday 25 March 2009 00:26:37 Jan Kara wrote: > > On Tue 24-03-09 13:55:10, Jan Kara wrote: > > > On Tue 24-03-09 13:39:36, Jan Kara wrote: > > > > Hi, > > > > > > > > On Tue 24-03-09 18:44:21, Nick Piggin wrote: > > > > > On Friday 20 March 2009 03:46:39 Jan Kara wrote: > > > > > > On Fri 20-03-09 02:48:21, Nick Piggin wrote: > > > > > > > Holding mapping->private_lock over the __set_page_dirty should > > > > > > > fix it, although I guess you'd want to release it before calling > > > > > > > __mark_inode_dirty so as not to put inode_lock under there. I > > > > > > > have a patch for this if it sounds reasonable. > > > > > > > > > > > > Yes, that seems to be a bug - the function actually looked > > > > > > suspitious to me yesterday but I somehow convinced myself that it's > > > > > > fine. Probably because fsx-linux is single-threaded. > > > > > > > > > > After a whole lot of chasing my own tail in the VM and buffer layers, > > > > > I think it is a problem in ext2 (and I haven't been able to reproduce > > > > > with ext3 yet, which might lend weight to that, although as we have > > > > > seen, it is very timing dependent). > > > > > > > > > > That would be slightly unfortunate because we still have Jan's ext3 > > > > > problem, and also another reported problem of corruption on ext3 (on > > > > > brd driver). > > > > > > > > > > Anyway, when I have reproduced the problem with the test case, the > > > > > "lost" writes are all reported to be holes. Unfortunately, that > > > > > doesn't point straight to the filesystem, because ext2 allocates > > > > > blocks in this case at writeout time, so if dirty bits are getting > > > > > lost, then it would be normal to see holes. > > > > > > > > > > I then put in a whole lot of extra infrastructure to track metadata > > > > > about each struct page (when it was last written out, when it last > > > > > had the number of writable ptes reach 0, when the dirty bits were > > > > > last cleared etc). And none of the normal asertions were triggering: > > > > > eg. when any page is removed from pagecache (except truncates), it > > > > > has always had all its buffers written out *after* all ptes were made > > > > > readonly or unmapped. Lots of other tests and crap like that. > > > > > > > > I see we're going the same way ;) > > > > > > > > > So I tried what I should have done to start with and did an e2fsck > > > > > after seeing corruption. Yes, it comes up with errors. Now that is > > > > > unusual because that should be largely insulated from the vm: if a > > > > > dirty bit gets lost, then the filesystem image should be quite happy > > > > > and error-free with a hole or unwritten data there. > > > > > > > > This is different for me. I see no corruption on the filesystem with > > > > ext3. Anyway, errors from fsck would be useful to see what kind of > > > > corruption you saw. > > > > > > > > > I don't know ext? locking very well, except that it looks pretty > > > > > overly complex and crufty. > > > > > > > > > > Usually, blocks are instantiated by write(2), under i_mutex, > > > > > serialising the allocator somewhat. mmap-write blocks are > > > > > instantiated at writeout time, unserialised. I moved truncate_mutex > > > > > to cover the entire get_blocks function, and can no longer trigger > > > > > the problem. Might be a timing issue though -- Ying, can you try this > > > > > and see if you can still reproduce? > > > > > > > > > > I close my eyes and pick something out of a hat. a686cd89. Search for > > > > > XXX. Nice. Whether or not this cased the problem, can someone please > > > > > tell me why it got merged in that state? > > > > > > > > Maybe, I see it did some changes to ext2_get_blocks() which could be > > > > where the problem was introduced... > > > > > > > > > I'm leaving ext3 running for now. It looks like a nasty task to > > > > > bisect ext2 down to that commit :( and I would be more interested in > > > > > trying to reproduce Jan's ext3 problem, however, because I'm not too > > > > > interested in diving into ext2 locking to work out exactly what is > > > > > racing and how to fix it properly. I suspect it would be most > > > > > productive to wire up some ioctls right into the block > > > > > allocator/lookup and code up a userspace tester for it that could > > > > > probably stress it a lot harder than kernel writeout can. > > > > > > > > Yes, what I observed with ext3 so far is that data is properly copied > > > > and page marked dirty when the data is copied in. But then at some > > > > point dirty bit is cleared via block_write_full_page() but we don't get > > > > to submitting at least one buffer in that page. I'm now debugging which > > > > path we take so that this happens... > > > > > > And one more interesting thing I don't yet fully understand - I see > > > pages having PageError() set when they are removed from page cache (and > > > they have been faulted in before). It's probably some interaction with > > > pagecache readahead... > > > > Argh... So the problem seems to be that get_block() occasionally returns > > ENOSPC and we then discard the dirty data (hmm, we could give at least a > > warning for that). I'm not yet sure why getblock behaves like this because > > the filesystem seems to have enough space but anyway this seems to be some > > strange fs trouble as well. > > Ah good find. > > I don't think it is a very good idea for block_write_full_page recovery > to do clear_buffer_dirty for !mapped buffers. I think that should rather > be a redirty_page_for_writepage in the case that the buffer is dirty. > > Perhaps not the cleanest way to solve the problem if it is just due to > transient shortage of space in ext3, but generic code shouldn't be > allowed to throw away dirty data even if it can't be written back due > to some software or hardware error. Well, that would be one possibility. But then we'd be left with dirty pages we cannot ever release since they are constantly dirty (when the filesystem really becomes out of space). So what I rather want to do is something like below: diff --git a/fs/ext3/inode.c b/fs/ext3/inode.c index d351eab..77c526f 100644 --- a/fs/ext3/inode.c +++ b/fs/ext3/inode.c @@ -1440,6 +1440,40 @@ static int journal_dirty_data_fn(handle_t *handle, struct buffer_head *bh) } /* + * Decides whether it's worthwhile to wait for transaction commit and + * retry allocation. If it is, function waits 1 is returns, otherwise + * 0 is returned. In both cases we redirty page and it's buffers so that + * data is not lost. In case we've retried too many times, we also return + * 0 and don't redirty the page. Data gets discarded but we cannot hang + * writepage forever... + */ +static int ext3_writepage_retry_alloc(struct page *page, int *retries, + struct writeback_control *wbc) +{ + struct super_block *sb = ((struct inode *)page->mapping->host)->i_sb; + int ret = 0; + + /* + * We don't want to slow down background writeback too much. On the + * other hand if most of the dirty data needs allocation, we better + * wait to make some progress + */ + if (wbc->sync_mode == WB_SYNC_NONE && !wbc->for_reclaim && + wbc->pages_skipped < wbc->nr_to_write / 2) + goto redirty; + /* + * Now wait if commit can free some space and we haven't retried + * too much + */ + if (!ext3_should_retry_alloc(sb, retries)) + return 0; + ret = 1; +redirty: + set_page_dirty(page); + return ret; +} + +/* * Note that we always start a transaction even if we're not journalling * data. This is to preserve ordering: any hole instantiation within * __block_write_full_page -> ext3_get_block() should be journalled @@ -1564,10 +1598,12 @@ static int ext3_writeback_writepage(struct page *page, handle_t *handle = NULL; int ret = 0; int err; + int retries; if (ext3_journal_current_handle()) goto out_fail; +restart: handle = ext3_journal_start(inode, ext3_writepage_trans_blocks(inode)); if (IS_ERR(handle)) { ret = PTR_ERR(handle); @@ -1580,8 +1616,13 @@ static int ext3_writeback_writepage(struct page *page, ret = block_write_full_page(page, ext3_get_block, wbc); err = ext3_journal_stop(handle); - if (!ret) + if (!ret) { ret = err; + } else { + if (ret == -ENOSPC && + ext3_writepage_retry_alloc(page, &retries, wbc)) + goto restart; + } return ret; out_fail: And similarly for the other two writepage implementations in ext3... But it currently gives me: WARNING: at fs/buffer.c:781 __set_page_dirty+0x8d/0x145() probably because of that set_page_dirty() in ext3_writepage_retry_alloc(). Or we could implement ext3_mkwrite() to allocate buffers already when we make page writeable. But it costs some performace (we have to write page full of zeros when allocating those buffers, where previously we didn't have to do anything) and it's not trivial to make it work if pagesize > blocksize (we should not allocate buffers outside of i_size so if i_size = 1024, we create just one block in ext3_mkwrite() but then we need to allocate more when we extend the file). Honza -- Jan Kara <jack@suse.cz> SUSE Labs, CR ^ permalink raw reply related [flat|nested] 60+ messages in thread
* Re: ftruncate-mmap: pages are lost after writing to mmaped file. 2009-03-24 14:47 ` Jan Kara @ 2009-03-24 14:56 ` Peter Zijlstra 2009-03-24 15:29 ` Jan Kara 2009-03-24 15:03 ` Nick Piggin 1 sibling, 1 reply; 60+ messages in thread From: Peter Zijlstra @ 2009-03-24 14:56 UTC (permalink / raw) To: Jan Kara Cc: Nick Piggin, Martin J. Bligh, linux-ext4, Ying Han, Linus Torvalds, Andrew Morton, linux-kernel, linux-mm, guichaz, Alex Khesin, Mike Waychison, Rohit Seth On Tue, 2009-03-24 at 15:47 +0100, Jan Kara wrote: > > Or we could implement ext3_mkwrite() to allocate buffers already when we > make page writeable. But it costs some performace (we have to write page > full of zeros when allocating those buffers, where previously we didn't > have to do anything) and it's not trivial to make it work if pagesize > > blocksize (we should not allocate buffers outside of i_size so if i_size > = 1024, we create just one block in ext3_mkwrite() but then we need to > allocate more when we extend the file). I think this is the best option, failing with SIGBUS when we fail to allocate blocks seems consistent with other filesystems as well. ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: ftruncate-mmap: pages are lost after writing to mmaped file. 2009-03-24 14:56 ` Peter Zijlstra @ 2009-03-24 15:29 ` Jan Kara 2009-03-24 20:14 ` OGAWA Hirofumi 2009-03-26 8:47 ` Aneesh Kumar K.V 0 siblings, 2 replies; 60+ messages in thread From: Jan Kara @ 2009-03-24 15:29 UTC (permalink / raw) To: Peter Zijlstra Cc: Nick Piggin, Martin J. Bligh, linux-ext4, Ying Han, Linus Torvalds, Andrew Morton, linux-kernel, linux-mm, guichaz, Alex Khesin, Mike Waychison, Rohit Seth On Tue 24-03-09 15:56:03, Peter Zijlstra wrote: > On Tue, 2009-03-24 at 15:47 +0100, Jan Kara wrote: > > > > Or we could implement ext3_mkwrite() to allocate buffers already when we > > make page writeable. But it costs some performace (we have to write page > > full of zeros when allocating those buffers, where previously we didn't > > have to do anything) and it's not trivial to make it work if pagesize > > > blocksize (we should not allocate buffers outside of i_size so if i_size > > = 1024, we create just one block in ext3_mkwrite() but then we need to > > allocate more when we extend the file). > > I think this is the best option, failing with SIGBUS when we fail to > allocate blocks seems consistent with other filesystems as well. I agree this looks attractive at the first sight. But there are drawbacks as I wrote - the problem with blocksize < pagesize, slight performance decrease due to additional write, page faults doing allocation can take a *long* time and overall fragmentation is going to be higher (previously writepage wrote pages for us in the right order, now we are going to allocate in the first-accessed order). So I'm not sure we really want to go this way. Hmm, maybe we could play a trick ala delayed allocation - i.e., reserve some space in mkwrite() but don't actually allocate it. That would be done in writepage(). This would solve all the problems I describe above. We could use PG_Checked flag to track that the page has a reservation and behave accordingly in writepage() / invalidatepage(). ext3 in data=journal mode already uses the flag but the use seems to be compatible with what I want to do now... So it may actually work. BTW: Note that there's a plenty of filesystems that don't implement mkwrite() (e.g. ext2, UDF, VFAT...) and thus have the same problem with ENOSPC. So I'd not speak too much about consistency ;). Honza -- Jan Kara <jack@suse.cz> SUSE Labs, CR ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: ftruncate-mmap: pages are lost after writing to mmaped file. 2009-03-24 15:29 ` Jan Kara @ 2009-03-24 20:14 ` OGAWA Hirofumi 2009-03-26 8:47 ` Aneesh Kumar K.V 1 sibling, 0 replies; 60+ messages in thread From: OGAWA Hirofumi @ 2009-03-24 20:14 UTC (permalink / raw) To: Jan Kara Cc: Peter Zijlstra, Nick Piggin, Martin J. Bligh, linux-ext4, Ying Han, Linus Torvalds, Andrew Morton, linux-kernel, linux-mm, guichaz, Alex Khesin, Mike Waychison, Rohit Seth Jan Kara <jack@suse.cz> writes: > BTW: Note that there's a plenty of filesystems that don't implement > mkwrite() (e.g. ext2, UDF, VFAT...) and thus have the same problem with > ENOSPC. So I'd not speak too much about consistency ;). FWIW, fatfs doesn't allow sparse file (mmap the non-allocated region), so I guess there is no problem. -- OGAWA Hirofumi <hirofumi@mail.parknet.co.jp> ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: ftruncate-mmap: pages are lost after writing to mmaped file. 2009-03-24 15:29 ` Jan Kara 2009-03-24 20:14 ` OGAWA Hirofumi @ 2009-03-26 8:47 ` Aneesh Kumar K.V 2009-03-26 11:37 ` Jan Kara 2009-03-26 23:02 ` Linus Torvalds 1 sibling, 2 replies; 60+ messages in thread From: Aneesh Kumar K.V @ 2009-03-26 8:47 UTC (permalink / raw) To: Jan Kara Cc: Peter Zijlstra, Nick Piggin, Martin J. Bligh, linux-ext4, Ying Han, Linus Torvalds, Andrew Morton, linux-kernel, linux-mm, guichaz, Alex Khesin, Mike Waychison, Rohit Seth On Tue, Mar 24, 2009 at 04:29:59PM +0100, Jan Kara wrote: > On Tue 24-03-09 15:56:03, Peter Zijlstra wrote: > > On Tue, 2009-03-24 at 15:47 +0100, Jan Kara wrote: > > > > > > Or we could implement ext3_mkwrite() to allocate buffers already when we > > > make page writeable. But it costs some performace (we have to write page > > > full of zeros when allocating those buffers, where previously we didn't > > > have to do anything) and it's not trivial to make it work if pagesize > > > > blocksize (we should not allocate buffers outside of i_size so if i_size > > > = 1024, we create just one block in ext3_mkwrite() but then we need to > > > allocate more when we extend the file). > > > > I think this is the best option, failing with SIGBUS when we fail to > > allocate blocks seems consistent with other filesystems as well. > I agree this looks attractive at the first sight. But there are drawbacks > as I wrote - the problem with blocksize < pagesize, slight performance > decrease due to additional write, It should not cause an additional write. Can you let me why it would result in additional write ? >page faults doing allocation can take a > *long* time That is true >and overall fragmentation is going to be higher (previously > writepage wrote pages for us in the right order, now we are going to > allocate in the first-accessed order). So I'm not sure we really want to > go this way. block allocator should be improved to fix that. For example ext4 mballoc also look at the logical file block number when doing block allocation. So if we does enough reservation it should handle the the first-accessed order and sequential order allocation properly. Another reason why I think we would need ext3_page_mkwrite is, if we really are out of space how do we handle it ? Currently the patch you posted does redirty_page_for_writepage, which would imply we can't reclaim the page and since get_block get ENOSPC we can't allocate blocks. > Hmm, maybe we could play a trick ala delayed allocation - i.e., reserve > some space in mkwrite() but don't actually allocate it. That would be done > in writepage(). This would solve all the problems I describe above. We could > use PG_Checked flag to track that the page has a reservation and behave > accordingly in writepage() / invalidatepage(). ext3 in data=journal mode > already uses the flag but the use seems to be compatible with what I want > to do now... So it may actually work. > BTW: Note that there's a plenty of filesystems that don't implement > mkwrite() (e.g. ext2, UDF, VFAT...) and thus have the same problem with > ENOSPC. So I'd not speak too much about consistency ;). > -aneesh ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: ftruncate-mmap: pages are lost after writing to mmaped file. 2009-03-26 8:47 ` Aneesh Kumar K.V @ 2009-03-26 11:37 ` Jan Kara 2009-03-26 23:02 ` Linus Torvalds 1 sibling, 0 replies; 60+ messages in thread From: Jan Kara @ 2009-03-26 11:37 UTC (permalink / raw) To: Aneesh Kumar K.V Cc: Peter Zijlstra, Nick Piggin, Martin J. Bligh, linux-ext4, Ying Han, Linus Torvalds, Andrew Morton, linux-kernel, linux-mm, guichaz, Alex Khesin, Mike Waychison, Rohit Seth On Thu 26-03-09 14:17:23, Aneesh Kumar K.V wrote: > On Tue, Mar 24, 2009 at 04:29:59PM +0100, Jan Kara wrote: > > On Tue 24-03-09 15:56:03, Peter Zijlstra wrote: > > > On Tue, 2009-03-24 at 15:47 +0100, Jan Kara wrote: > > > > > > > > Or we could implement ext3_mkwrite() to allocate buffers already when we > > > > make page writeable. But it costs some performace (we have to write page > > > > full of zeros when allocating those buffers, where previously we didn't > > > > have to do anything) and it's not trivial to make it work if pagesize > > > > > blocksize (we should not allocate buffers outside of i_size so if i_size > > > > = 1024, we create just one block in ext3_mkwrite() but then we need to > > > > allocate more when we extend the file). > > > > > > I think this is the best option, failing with SIGBUS when we fail to > > > allocate blocks seems consistent with other filesystems as well. > > I agree this looks attractive at the first sight. But there are drawbacks > > as I wrote - the problem with blocksize < pagesize, slight performance > > decrease due to additional write, > > It should not cause an additional write. Can you let me why it would > result in additional write ? Because if you have a new page, at the time mkwrite() or set_page_dirty() is called, it is just full of zeros. So we attach buffers full of zeros to the running transaction to stand to data=ordered mode requirements. Then these get written out on transaction commit (or they can already contain some data user has written via mmap) but we're going to write them again when writepage() is called on the page. Umm, but yes, thinking more about the details, we clear buffer dirty bits at commit time so if by that time user has copied in all the data, subsequent writepage will find out all the buffers are clean and will not send them to disk. So in this case overhead is going to be just journal_start() + journal_stop(). OTOH mm usually decides to write the page only after some time so if user writes to the page often then we really do one more write. But in this case one additional write is going to be probably lost in the number of total writes of the page. So yes, this is not such a big issue as I though originally. > >page faults doing allocation can take a > > *long* time > > That is true > > >and overall fragmentation is going to be higher (previously > > writepage wrote pages for us in the right order, now we are going to > > allocate in the first-accessed order). So I'm not sure we really want to > > go this way. > > > block allocator should be improved to fix that. For example ext4 > mballoc also look at the logical file block number when doing block > allocation. So if we does enough reservation it should handle the > the first-accessed order and sequential order allocation properly. Well, we could definitely improve ext3 allocator. But do we really want to backport mballoc to ext3? IMO It is easier to essentialy perform delayed allocation at the time of mkwrite() and the do the real allocation at the time of writepage(). So I'd rather vote for a mechanism I write about below. > Another reason why I think we would need ext3_page_mkwrite is, if we > really are out of space how do we handle it ? Currently the patch you > posted does redirty_page_for_writepage, which would imply we can't > reclaim the page and since get_block get ENOSPC we can't allocate > blocks. I definitely agree we should somehow solve this problem but the mechanism below seems to be an easier way to me. > > Hmm, maybe we could play a trick ala delayed allocation - i.e., reserve > > some space in mkwrite() but don't actually allocate it. That would be done > > in writepage(). This would solve all the problems I describe above. We could > > use PG_Checked flag to track that the page has a reservation and behave > > accordingly in writepage() / invalidatepage(). ext3 in data=journal mode > > already uses the flag but the use seems to be compatible with what I want > > to do now... So it may actually work. > > BTW: Note that there's a plenty of filesystems that don't implement > > mkwrite() (e.g. ext2, UDF, VFAT...) and thus have the same problem with > > ENOSPC. So I'd not speak too much about consistency ;). Honza -- Jan Kara <jack@suse.cz> SUSE Labs, CR ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: ftruncate-mmap: pages are lost after writing to mmaped file. 2009-03-26 8:47 ` Aneesh Kumar K.V 2009-03-26 11:37 ` Jan Kara @ 2009-03-26 23:02 ` Linus Torvalds 1 sibling, 0 replies; 60+ messages in thread From: Linus Torvalds @ 2009-03-26 23:02 UTC (permalink / raw) To: Aneesh Kumar K.V Cc: Jan Kara, Peter Zijlstra, Nick Piggin, Martin J. Bligh, linux-ext4, Ying Han, Andrew Morton, linux-kernel, linux-mm, guichaz, Alex Khesin, Mike Waychison, Rohit Seth On Thu, 26 Mar 2009, Aneesh Kumar K.V wrote: > > >page faults doing allocation can take a > > *long* time > > That is true Btw, this is actually a feature rather than a bug. We want to slow down the writer, which is why we also do dirty page balancing when marking a page dirty. Basically, if block allocation is a performance problem, then it should be a performance problem that is attributed to the process that _causes_ it, rather than to some random poor unrelated process that then later ends up writing the page out because it wants to use some memory. This is why tracking dirty pages is so important. Yes, it also avoids various nasty overcommit situations, but the whole "make it hurt for the person responsible, rather than a random innocent bystander" is the more important part of it. Linus ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: ftruncate-mmap: pages are lost after writing to mmaped file. 2009-03-24 14:47 ` Jan Kara 2009-03-24 14:56 ` Peter Zijlstra @ 2009-03-24 15:03 ` Nick Piggin 2009-03-24 15:48 ` Jan Kara 1 sibling, 1 reply; 60+ messages in thread From: Nick Piggin @ 2009-03-24 15:03 UTC (permalink / raw) To: Jan Kara Cc: Martin J. Bligh, linux-ext4, Ying Han, Linus Torvalds, Andrew Morton, linux-kernel, linux-mm, guichaz, Alex Khesin, Mike Waychison, Rohit Seth, Peter Zijlstra On Wednesday 25 March 2009 01:47:09 Jan Kara wrote: > On Wed 25-03-09 01:30:00, Nick Piggin wrote: > > I don't think it is a very good idea for block_write_full_page recovery > > to do clear_buffer_dirty for !mapped buffers. I think that should rather > > be a redirty_page_for_writepage in the case that the buffer is dirty. > > > > Perhaps not the cleanest way to solve the problem if it is just due to > > transient shortage of space in ext3, but generic code shouldn't be > > allowed to throw away dirty data even if it can't be written back due > > to some software or hardware error. > > Well, that would be one possibility. But then we'd be left with dirty > pages we cannot ever release since they are constantly dirty (when the > filesystem really becomes out of space). So what I If the filesystem becomes out of space and we have over-committed these dirty mmapped blocks, then we most definitely want to keep them around. An error of the system losing a few pages (or if it happens an insanely large number of times, then slowly dying due to memory leak) is better than an app suddenly seeing the contents of the page change to nulls under it when the kernel decides to do some page reclaim. > rather want to do is something like below: > > diff --git a/fs/ext3/inode.c b/fs/ext3/inode.c > index d351eab..77c526f 100644 > --- a/fs/ext3/inode.c > +++ b/fs/ext3/inode.c > @@ -1440,6 +1440,40 @@ static int journal_dirty_data_fn(handle_t *handle, > struct buffer_head *bh) } > > /* > + * Decides whether it's worthwhile to wait for transaction commit and > + * retry allocation. If it is, function waits 1 is returns, otherwise > + * 0 is returned. In both cases we redirty page and it's buffers so that > + * data is not lost. In case we've retried too many times, we also return > + * 0 and don't redirty the page. Data gets discarded but we cannot hang > + * writepage forever... > + */ > +static int ext3_writepage_retry_alloc(struct page *page, int *retries, > + struct writeback_control *wbc) > +{ > + struct super_block *sb = ((struct inode *)page->mapping->host)->i_sb; > + int ret = 0; > + > + /* > + * We don't want to slow down background writeback too much. On the > + * other hand if most of the dirty data needs allocation, we better > + * wait to make some progress > + */ > + if (wbc->sync_mode == WB_SYNC_NONE && !wbc->for_reclaim && > + wbc->pages_skipped < wbc->nr_to_write / 2) > + goto redirty; > + /* > + * Now wait if commit can free some space and we haven't retried > + * too much > + */ > + if (!ext3_should_retry_alloc(sb, retries)) > + return 0; > + ret = 1; > +redirty: > + set_page_dirty(page); > + return ret; > +} > + > +/* > * Note that we always start a transaction even if we're not journalling > * data. This is to preserve ordering: any hole instantiation within > * __block_write_full_page -> ext3_get_block() should be journalled > @@ -1564,10 +1598,12 @@ static int ext3_writeback_writepage(struct page > *page, handle_t *handle = NULL; > int ret = 0; > int err; > + int retries; > > if (ext3_journal_current_handle()) > goto out_fail; > > +restart: > handle = ext3_journal_start(inode, ext3_writepage_trans_blocks(inode)); > if (IS_ERR(handle)) { > ret = PTR_ERR(handle); > @@ -1580,8 +1616,13 @@ static int ext3_writeback_writepage(struct page > *page, ret = block_write_full_page(page, ext3_get_block, wbc); > > err = ext3_journal_stop(handle); > - if (!ret) > + if (!ret) { > ret = err; > + } else { > + if (ret == -ENOSPC && > + ext3_writepage_retry_alloc(page, &retries, wbc)) > + goto restart; > + } > return ret; > > out_fail: > > And similarly for the other two writepage implementations in ext3... > But it currently gives me: > WARNING: at fs/buffer.c:781 __set_page_dirty+0x8d/0x145() > probably because of that set_page_dirty() in ext3_writepage_retry_alloc(). And this is a valid warning because we don't know that all buffers are uptodate or which ones to set as dirty, I think. Unless it is impossible to have dirty && !uptodate pages come thought this path. But you shouldn't need to redirty the page at all if we change block_write_full_page in the way I suggested. Because then it won't have cleaned the buffer, and it will have done redirty_page_for_writepage. > Or we could implement ext3_mkwrite() to allocate buffers already when we > make page writeable. But it costs some performace (we have to write page > full of zeros when allocating those buffers, where previously we didn't > have to do anything) and it's not trivial to make it work if pagesize > > blocksize (we should not allocate buffers outside of i_size so if i_size > = 1024, we create just one block in ext3_mkwrite() but then we need to > allocate more when we extend the file). Well the core page_mkwrite function doesn't care about that case properly either (it just doesn't allocate buffers on extend). I agree it should be fixed, but it is a little hard (I need to fix it in fsblock and I think what is required is a setattr helper for that). ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: ftruncate-mmap: pages are lost after writing to mmaped file. 2009-03-24 15:03 ` Nick Piggin @ 2009-03-24 15:48 ` Jan Kara 2009-03-24 17:35 ` Jan Kara 0 siblings, 1 reply; 60+ messages in thread From: Jan Kara @ 2009-03-24 15:48 UTC (permalink / raw) To: Nick Piggin Cc: Jan Kara, Martin J. Bligh, linux-ext4, Ying Han, Linus Torvalds, Andrew Morton, linux-kernel, linux-mm, guichaz, Alex Khesin, Mike Waychison, Rohit Seth, Peter Zijlstra On Wed 25-03-09 02:03:54, Nick Piggin wrote: > On Wednesday 25 March 2009 01:47:09 Jan Kara wrote: > > On Wed 25-03-09 01:30:00, Nick Piggin wrote: > > > > I don't think it is a very good idea for block_write_full_page recovery > > > to do clear_buffer_dirty for !mapped buffers. I think that should rather > > > be a redirty_page_for_writepage in the case that the buffer is dirty. > > > > > > Perhaps not the cleanest way to solve the problem if it is just due to > > > transient shortage of space in ext3, but generic code shouldn't be > > > allowed to throw away dirty data even if it can't be written back due > > > to some software or hardware error. > > > > Well, that would be one possibility. But then we'd be left with dirty > > pages we cannot ever release since they are constantly dirty (when the > > filesystem really becomes out of space). So what I > > If the filesystem becomes out of space and we have over-committed these > dirty mmapped blocks, then we most definitely want to keep them around. > An error of the system losing a few pages (or if it happens an insanely > large number of times, then slowly dying due to memory leak) is better > than an app suddenly seeing the contents of the page change to nulls > under it when the kernel decides to do some page reclaim. Hmm, probably you're right. Definitely it would be much easier to track the problem down than it is now... Thinking a bit more... But couldn't a malicious user bring the machine easily to OOM this way? That would be unfortunate. > > rather want to do is something like below: > > > > diff --git a/fs/ext3/inode.c b/fs/ext3/inode.c > > index d351eab..77c526f 100644 > > --- a/fs/ext3/inode.c > > +++ b/fs/ext3/inode.c > > @@ -1440,6 +1440,40 @@ static int journal_dirty_data_fn(handle_t *handle, > > struct buffer_head *bh) } > > > > /* > > + * Decides whether it's worthwhile to wait for transaction commit and > > + * retry allocation. If it is, function waits 1 is returns, otherwise > > + * 0 is returned. In both cases we redirty page and it's buffers so that > > + * data is not lost. In case we've retried too many times, we also return > > + * 0 and don't redirty the page. Data gets discarded but we cannot hang > > + * writepage forever... > > + */ > > +static int ext3_writepage_retry_alloc(struct page *page, int *retries, > > + struct writeback_control *wbc) > > +{ > > + struct super_block *sb = ((struct inode *)page->mapping->host)->i_sb; > > + int ret = 0; > > + > > + /* > > + * We don't want to slow down background writeback too much. On the > > + * other hand if most of the dirty data needs allocation, we better > > + * wait to make some progress > > + */ > > + if (wbc->sync_mode == WB_SYNC_NONE && !wbc->for_reclaim && > > + wbc->pages_skipped < wbc->nr_to_write / 2) > > + goto redirty; > > + /* > > + * Now wait if commit can free some space and we haven't retried > > + * too much > > + */ > > + if (!ext3_should_retry_alloc(sb, retries)) > > + return 0; > > + ret = 1; > > +redirty: > > + set_page_dirty(page); > > + return ret; > > +} > > + > > +/* > > * Note that we always start a transaction even if we're not journalling > > * data. This is to preserve ordering: any hole instantiation within > > * __block_write_full_page -> ext3_get_block() should be journalled > > @@ -1564,10 +1598,12 @@ static int ext3_writeback_writepage(struct page > > *page, handle_t *handle = NULL; > > int ret = 0; > > int err; > > + int retries; > > > > if (ext3_journal_current_handle()) > > goto out_fail; > > > > +restart: > > handle = ext3_journal_start(inode, ext3_writepage_trans_blocks(inode)); > > if (IS_ERR(handle)) { > > ret = PTR_ERR(handle); > > @@ -1580,8 +1616,13 @@ static int ext3_writeback_writepage(struct page > > *page, ret = block_write_full_page(page, ext3_get_block, wbc); > > > > err = ext3_journal_stop(handle); > > - if (!ret) > > + if (!ret) { > > ret = err; > > + } else { > > + if (ret == -ENOSPC && > > + ext3_writepage_retry_alloc(page, &retries, wbc)) > > + goto restart; > > + } > > return ret; > > > > out_fail: > > > > And similarly for the other two writepage implementations in ext3... > > But it currently gives me: > > WARNING: at fs/buffer.c:781 __set_page_dirty+0x8d/0x145() > > probably because of that set_page_dirty() in ext3_writepage_retry_alloc(). > > And this is a valid warning because we don't know that all buffers are > uptodate or which ones to set as dirty, I think. Unless it is impossible > to have dirty && !uptodate pages come thought this path. Ah, OK, thanks for explanation. > But you shouldn't need to redirty the page at all if we change > block_write_full_page in the way I suggested. Because then it won't > have cleaned the buffer, and it will have done redirty_page_for_writepage. > > > Or we could implement ext3_mkwrite() to allocate buffers already when we > > make page writeable. But it costs some performace (we have to write page > > full of zeros when allocating those buffers, where previously we didn't > > have to do anything) and it's not trivial to make it work if pagesize > > > blocksize (we should not allocate buffers outside of i_size so if i_size > > = 1024, we create just one block in ext3_mkwrite() but then we need to > > allocate more when we extend the file). > > Well the core page_mkwrite function doesn't care about that case > properly either (it just doesn't allocate buffers on extend). I agree it > should be fixed, but it is a little hard (I need to fix it in fsblock > and I think what is required is a setattr helper for that). Won't it be enough, if extending truncate called page_mkwrite() on the page which used to be the last one in the file? That would be enough for my use although it seems a bit hacky I agree... Honza -- Jan Kara <jack@suse.cz> SUSE Labs, CR ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: ftruncate-mmap: pages are lost after writing to mmaped file. 2009-03-24 15:48 ` Jan Kara @ 2009-03-24 17:35 ` Jan Kara 2009-04-01 22:36 ` Ying Han 0 siblings, 1 reply; 60+ messages in thread From: Jan Kara @ 2009-03-24 17:35 UTC (permalink / raw) To: Nick Piggin Cc: Jan Kara, Martin J. Bligh, linux-ext4, Ying Han, Linus Torvalds, Andrew Morton, linux-kernel, linux-mm, guichaz, Alex Khesin, Mike Waychison, Rohit Seth, Peter Zijlstra On Tue 24-03-09 16:48:14, Jan Kara wrote: > On Wed 25-03-09 02:03:54, Nick Piggin wrote: > > On Wednesday 25 March 2009 01:47:09 Jan Kara wrote: > > > On Wed 25-03-09 01:30:00, Nick Piggin wrote: > > > > > > I don't think it is a very good idea for block_write_full_page recovery > > > > to do clear_buffer_dirty for !mapped buffers. I think that should rather > > > > be a redirty_page_for_writepage in the case that the buffer is dirty. > > > > > > > > Perhaps not the cleanest way to solve the problem if it is just due to > > > > transient shortage of space in ext3, but generic code shouldn't be > > > > allowed to throw away dirty data even if it can't be written back due > > > > to some software or hardware error. > > > > > > Well, that would be one possibility. But then we'd be left with dirty > > > pages we cannot ever release since they are constantly dirty (when the > > > filesystem really becomes out of space). So what I > > > > If the filesystem becomes out of space and we have over-committed these > > dirty mmapped blocks, then we most definitely want to keep them around. > > An error of the system losing a few pages (or if it happens an insanely > > large number of times, then slowly dying due to memory leak) is better > > than an app suddenly seeing the contents of the page change to nulls > > under it when the kernel decides to do some page reclaim. > Hmm, probably you're right. Definitely it would be much easier to track > the problem down than it is now... Thinking a bit more... But couldn't a > malicious user bring the machine easily to OOM this way? That would be > unfortunate. OK, below is the patch which makes things work for me (i.e. no data lost). What do you think? Honza -- Jan Kara <jack@suse.cz> SUSE Labs, CR >From f423c2964dd5afbcc40c47731724d48675dd2822 Mon Sep 17 00:00:00 2001 From: Jan Kara <jack@suse.cz> Date: Tue, 24 Mar 2009 16:38:22 +0100 Subject: [PATCH] fs: Don't clear dirty bits in block_write_full_page() If getblock() fails in block_write_full_page(), we don't want to clear dirty bits on buffers. Actually, we even want to redirty the page. This way we just won't silently discard users data (written e.g. through mmap) in case of ENOSPC, EDQUOT, EIO or other write error. The downside of this approach is that if the error is persistent we have this page pinned in memory forever and if there are lots of such pages, we can bring the machine OOM. Signed-off-by: Jan Kara <jack@suse.cz> --- fs/buffer.c | 10 +++------- 1 files changed, 3 insertions(+), 7 deletions(-) diff --git a/fs/buffer.c b/fs/buffer.c index 891e1c7..ae779a0 100644 --- a/fs/buffer.c +++ b/fs/buffer.c @@ -1833,9 +1833,11 @@ recover: /* * ENOSPC, or some other error. We may already have added some * blocks to the file, so we need to write these out to avoid - * exposing stale data. + * exposing stale data. We redirty the page so that we don't + * loose data we are unable to write. * The page is currently locked and not marked for writeback */ + redirty_page_for_writepage(wbc, page); bh = head; /* Recovery: lock and submit the mapped buffers */ do { @@ -1843,12 +1845,6 @@ recover: !buffer_delay(bh)) { lock_buffer(bh); mark_buffer_async_write(bh); - } else { - /* - * The buffer may have been set dirty during - * attachment to a dirty page. - */ - clear_buffer_dirty(bh); } } while ((bh = bh->b_this_page) != head); SetPageError(page); -- 1.6.0.2 ^ permalink raw reply related [flat|nested] 60+ messages in thread
* Re: ftruncate-mmap: pages are lost after writing to mmaped file. 2009-03-24 17:35 ` Jan Kara @ 2009-04-01 22:36 ` Ying Han 2009-04-02 10:11 ` Jan Kara 2009-04-02 11:24 ` Nick Piggin 0 siblings, 2 replies; 60+ messages in thread From: Ying Han @ 2009-04-01 22:36 UTC (permalink / raw) To: Jan Kara Cc: Nick Piggin, Martin J. Bligh, linux-ext4, Linus Torvalds, Andrew Morton, linux-kernel, linux-mm, guichaz, Alex Khesin, Mike Waychison, Rohit Seth, Peter Zijlstra Hi Jan: I feel that the problem you saw is kind of differnt than mine. As you mentioned that you saw the PageError() message, which i don't see it on my system. I tried you patch(based on 2.6.21) on my system and it runs ok for 2 days, Still, since i don't see the same error message as you saw, i am not convineced this is the root cause at least for our problem. I am still looking into it. So, are you seeing the PageError() every time the problem happened? --Ying On Tue, Mar 24, 2009 at 10:35 AM, Jan Kara <jack@suse.cz> wrote: > On Tue 24-03-09 16:48:14, Jan Kara wrote: >> On Wed 25-03-09 02:03:54, Nick Piggin wrote: >> > On Wednesday 25 March 2009 01:47:09 Jan Kara wrote: >> > > On Wed 25-03-09 01:30:00, Nick Piggin wrote: >> > >> > > > I don't think it is a very good idea for block_write_full_page recovery >> > > > to do clear_buffer_dirty for !mapped buffers. I think that should rather >> > > > be a redirty_page_for_writepage in the case that the buffer is dirty. >> > > > >> > > > Perhaps not the cleanest way to solve the problem if it is just due to >> > > > transient shortage of space in ext3, but generic code shouldn't be >> > > > allowed to throw away dirty data even if it can't be written back due >> > > > to some software or hardware error. >> > > >> > > Well, that would be one possibility. But then we'd be left with dirty >> > > pages we cannot ever release since they are constantly dirty (when the >> > > filesystem really becomes out of space). So what I >> > >> > If the filesystem becomes out of space and we have over-committed these >> > dirty mmapped blocks, then we most definitely want to keep them around. >> > An error of the system losing a few pages (or if it happens an insanely >> > large number of times, then slowly dying due to memory leak) is better >> > than an app suddenly seeing the contents of the page change to nulls >> > under it when the kernel decides to do some page reclaim. >> Hmm, probably you're right. Definitely it would be much easier to track >> the problem down than it is now... Thinking a bit more... But couldn't a >> malicious user bring the machine easily to OOM this way? That would be >> unfortunate. > OK, below is the patch which makes things work for me (i.e. no data > lost). What do you think? > > Honza > -- > Jan Kara <jack@suse.cz> > SUSE Labs, CR > > From f423c2964dd5afbcc40c47731724d48675dd2822 Mon Sep 17 00:00:00 2001 > From: Jan Kara <jack@suse.cz> > Date: Tue, 24 Mar 2009 16:38:22 +0100 > Subject: [PATCH] fs: Don't clear dirty bits in block_write_full_page() > > If getblock() fails in block_write_full_page(), we don't want to clear > dirty bits on buffers. Actually, we even want to redirty the page. This > way we just won't silently discard users data (written e.g. through mmap) > in case of ENOSPC, EDQUOT, EIO or other write error. The downside of this > approach is that if the error is persistent we have this page pinned in > memory forever and if there are lots of such pages, we can bring the > machine OOM. > > Signed-off-by: Jan Kara <jack@suse.cz> > --- > fs/buffer.c | 10 +++------- > 1 files changed, 3 insertions(+), 7 deletions(-) > > diff --git a/fs/buffer.c b/fs/buffer.c > index 891e1c7..ae779a0 100644 > --- a/fs/buffer.c > +++ b/fs/buffer.c > @@ -1833,9 +1833,11 @@ recover: > /* > * ENOSPC, or some other error. We may already have added some > * blocks to the file, so we need to write these out to avoid > - * exposing stale data. > + * exposing stale data. We redirty the page so that we don't > + * loose data we are unable to write. > * The page is currently locked and not marked for writeback > */ > + redirty_page_for_writepage(wbc, page); > bh = head; > /* Recovery: lock and submit the mapped buffers */ > do { > @@ -1843,12 +1845,6 @@ recover: > !buffer_delay(bh)) { > lock_buffer(bh); > mark_buffer_async_write(bh); > - } else { > - /* > - * The buffer may have been set dirty during > - * attachment to a dirty page. > - */ > - clear_buffer_dirty(bh); > } > } while ((bh = bh->b_this_page) != head); > SetPageError(page); > -- > 1.6.0.2 > > -- > To unsubscribe, send a message with 'unsubscribe linux-mm' in > the body to majordomo@kvack.org. For more info on Linux MM, > see: http://www.linux-mm.org/ . > Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> > ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: ftruncate-mmap: pages are lost after writing to mmaped file. 2009-04-01 22:36 ` Ying Han @ 2009-04-02 10:11 ` Jan Kara 2009-04-02 11:24 ` Nick Piggin 1 sibling, 0 replies; 60+ messages in thread From: Jan Kara @ 2009-04-02 10:11 UTC (permalink / raw) To: Ying Han Cc: Nick Piggin, Martin J. Bligh, linux-ext4, Linus Torvalds, Andrew Morton, linux-kernel, linux-mm, guichaz, Alex Khesin, Mike Waychison, Rohit Seth, Peter Zijlstra Hi Ying, On Wed 01-04-09 15:36:13, Ying Han wrote: > I feel that the problem you saw is kind of differnt than mine. As > you mentioned that you saw the PageError() message, which i don't see > it on my system. I tried you patch(based on 2.6.21) on my system and > it runs ok for 2 days, Still, since i don't see the same error message > as you saw, i am not convineced this is the root cause at least for > our problem. I am still looking into it. > So, are you seeing the PageError() every time the problem happened? Yes, but I agree that your problem is probably different. BTW: How do you reproduce the problem? Honza > On Tue, Mar 24, 2009 at 10:35 AM, Jan Kara <jack@suse.cz> wrote: > > On Tue 24-03-09 16:48:14, Jan Kara wrote: > >> On Wed 25-03-09 02:03:54, Nick Piggin wrote: > >> > On Wednesday 25 March 2009 01:47:09 Jan Kara wrote: > >> > > On Wed 25-03-09 01:30:00, Nick Piggin wrote: > >> > > >> > > > I don't think it is a very good idea for block_write_full_page recovery > >> > > > to do clear_buffer_dirty for !mapped buffers. I think that should rather > >> > > > be a redirty_page_for_writepage in the case that the buffer is dirty. > >> > > > > >> > > > Perhaps not the cleanest way to solve the problem if it is just due to > >> > > > transient shortage of space in ext3, but generic code shouldn't be > >> > > > allowed to throw away dirty data even if it can't be written back due > >> > > > to some software or hardware error. > >> > > > >> > > Well, that would be one possibility. But then we'd be left with dirty > >> > > pages we cannot ever release since they are constantly dirty (when the > >> > > filesystem really becomes out of space). So what I > >> > > >> > If the filesystem becomes out of space and we have over-committed these > >> > dirty mmapped blocks, then we most definitely want to keep them around. > >> > An error of the system losing a few pages (or if it happens an insanely > >> > large number of times, then slowly dying due to memory leak) is better > >> > than an app suddenly seeing the contents of the page change to nulls > >> > under it when the kernel decides to do some page reclaim. > >> Hmm, probably you're right. Definitely it would be much easier to track > >> the problem down than it is now... Thinking a bit more... But couldn't a > >> malicious user bring the machine easily to OOM this way? That would be > >> unfortunate. > > OK, below is the patch which makes things work for me (i.e. no data > > lost). What do you think? > > > > Honza > > -- > > Jan Kara <jack@suse.cz> > > SUSE Labs, CR > > > > From f423c2964dd5afbcc40c47731724d48675dd2822 Mon Sep 17 00:00:00 2001 > > From: Jan Kara <jack@suse.cz> > > Date: Tue, 24 Mar 2009 16:38:22 +0100 > > Subject: [PATCH] fs: Don't clear dirty bits in block_write_full_page() > > > > If getblock() fails in block_write_full_page(), we don't want to clear > > dirty bits on buffers. Actually, we even want to redirty the page. This > > way we just won't silently discard users data (written e.g. through mmap) > > in case of ENOSPC, EDQUOT, EIO or other write error. The downside of this > > approach is that if the error is persistent we have this page pinned in > > memory forever and if there are lots of such pages, we can bring the > > machine OOM. > > > > Signed-off-by: Jan Kara <jack@suse.cz> > > --- > > fs/buffer.c | 10 +++------- > > 1 files changed, 3 insertions(+), 7 deletions(-) > > > > diff --git a/fs/buffer.c b/fs/buffer.c > > index 891e1c7..ae779a0 100644 > > --- a/fs/buffer.c > > +++ b/fs/buffer.c > > @@ -1833,9 +1833,11 @@ recover: > > /* > > * ENOSPC, or some other error. We may already have added some > > * blocks to the file, so we need to write these out to avoid > > - * exposing stale data. > > + * exposing stale data. We redirty the page so that we don't > > + * loose data we are unable to write. > > * The page is currently locked and not marked for writeback > > */ > > + redirty_page_for_writepage(wbc, page); > > bh = head; > > /* Recovery: lock and submit the mapped buffers */ > > do { > > @@ -1843,12 +1845,6 @@ recover: > > !buffer_delay(bh)) { > > lock_buffer(bh); > > mark_buffer_async_write(bh); > > - } else { > > - /* > > - * The buffer may have been set dirty during > > - * attachment to a dirty page. > > - */ > > - clear_buffer_dirty(bh); > > } > > } while ((bh = bh->b_this_page) != head); > > SetPageError(page); > > -- > > 1.6.0.2 > > > > -- > > To unsubscribe, send a message with 'unsubscribe linux-mm' in > > the body to majordomo@kvack.org. For more info on Linux MM, > > see: http://www.linux-mm.org/ . > > Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> > > -- Jan Kara <jack@suse.cz> SUSE Labs, CR ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: ftruncate-mmap: pages are lost after writing to mmaped file. 2009-04-01 22:36 ` Ying Han 2009-04-02 10:11 ` Jan Kara @ 2009-04-02 11:24 ` Nick Piggin 2009-04-02 11:34 ` Jan Kara 2009-04-03 0:13 ` Ying Han 1 sibling, 2 replies; 60+ messages in thread From: Nick Piggin @ 2009-04-02 11:24 UTC (permalink / raw) To: Ying Han Cc: Jan Kara, Martin J. Bligh, linux-ext4, Linus Torvalds, Andrew Morton, linux-kernel, linux-mm, guichaz, Alex Khesin, Mike Waychison, Rohit Seth, Peter Zijlstra On Thursday 02 April 2009 09:36:13 Ying Han wrote: > Hi Jan: > I feel that the problem you saw is kind of differnt than mine. As > you mentioned that you saw the PageError() message, which i don't see > it on my system. I tried you patch(based on 2.6.21) on my system and > it runs ok for 2 days, Still, since i don't see the same error message > as you saw, i am not convineced this is the root cause at least for > our problem. I am still looking into it. > So, are you seeing the PageError() every time the problem happened? So I asked if you could test with my workaround of taking truncate_mutex at the start of ext2_get_blocks, and report back. I never heard of any response after that. To reiterate: I was able to reproduce a problem with ext2 (I was testing on brd to get IO rates high enough to reproduce it quite frequently). I think I narrowed the problem down to block allocation or inode block tree corruption because I was unable to reproduce it with that hack in place. ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: ftruncate-mmap: pages are lost after writing to mmaped file. 2009-04-02 11:24 ` Nick Piggin @ 2009-04-02 11:34 ` Jan Kara 2009-04-02 15:51 ` Nick Piggin 2009-04-03 0:13 ` Ying Han 1 sibling, 1 reply; 60+ messages in thread From: Jan Kara @ 2009-04-02 11:34 UTC (permalink / raw) To: Nick Piggin Cc: Ying Han, Martin J. Bligh, linux-ext4, Linus Torvalds, Andrew Morton, linux-kernel, linux-mm, guichaz, Alex Khesin, Mike Waychison, Rohit Seth, Peter Zijlstra On Thu 02-04-09 22:24:29, Nick Piggin wrote: > On Thursday 02 April 2009 09:36:13 Ying Han wrote: > > Hi Jan: > > I feel that the problem you saw is kind of differnt than mine. As > > you mentioned that you saw the PageError() message, which i don't see > > it on my system. I tried you patch(based on 2.6.21) on my system and > > it runs ok for 2 days, Still, since i don't see the same error message > > as you saw, i am not convineced this is the root cause at least for > > our problem. I am still looking into it. > > So, are you seeing the PageError() every time the problem happened? > > So I asked if you could test with my workaround of taking truncate_mutex > at the start of ext2_get_blocks, and report back. I never heard of any > response after that. > > To reiterate: I was able to reproduce a problem with ext2 (I was testing > on brd to get IO rates high enough to reproduce it quite frequently). > I think I narrowed the problem down to block allocation or inode block > tree corruption because I was unable to reproduce it with that hack in > place. Nick, what load did you use for reproduction? I'll try to reproduce it here so that I can debug ext2... Honza -- Jan Kara <jack@suse.cz> SUSE Labs, CR ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: ftruncate-mmap: pages are lost after writing to mmaped file. 2009-04-02 11:34 ` Jan Kara @ 2009-04-02 15:51 ` Nick Piggin 2009-04-02 17:44 ` Ying Han 0 siblings, 1 reply; 60+ messages in thread From: Nick Piggin @ 2009-04-02 15:51 UTC (permalink / raw) To: Jan Kara Cc: Ying Han, Martin J. Bligh, linux-ext4, Linus Torvalds, Andrew Morton, linux-kernel, linux-mm, guichaz, Alex Khesin, Mike Waychison, Rohit Seth, Peter Zijlstra On Thursday 02 April 2009 22:34:01 Jan Kara wrote: > On Thu 02-04-09 22:24:29, Nick Piggin wrote: > > On Thursday 02 April 2009 09:36:13 Ying Han wrote: > > > Hi Jan: > > > I feel that the problem you saw is kind of differnt than mine. As > > > you mentioned that you saw the PageError() message, which i don't see > > > it on my system. I tried you patch(based on 2.6.21) on my system and > > > it runs ok for 2 days, Still, since i don't see the same error message > > > as you saw, i am not convineced this is the root cause at least for > > > our problem. I am still looking into it. > > > So, are you seeing the PageError() every time the problem happened? > > > > So I asked if you could test with my workaround of taking truncate_mutex > > at the start of ext2_get_blocks, and report back. I never heard of any > > response after that. > > > > To reiterate: I was able to reproduce a problem with ext2 (I was testing > > on brd to get IO rates high enough to reproduce it quite frequently). > > I think I narrowed the problem down to block allocation or inode block > > tree corruption because I was unable to reproduce it with that hack in > > place. > Nick, what load did you use for reproduction? I'll try to reproduce it > here so that I can debug ext2... OK, I set up the filesystem like this: modprobe rd rd_size=$[3*1024*1024] #almost fill memory so we reclaim buffers dd if=/dev/zero of=/dev/ram0 bs=4k #prefill brd so we don't get alloc deadlock mkfs.ext2 -b1024 /dev/ram0 #1K buffers Test is basically unmodified except I use 64MB files, and start 8 of them at once to (8 core system, so improve chances of hitting the bug). Although I do see it with only 1 running it takes longer to trigger. I also run a loop doing 'sync ; echo 3 > /proc/sys/vm/drop_caches' but I don't know if that really helps speed up reproducing it. It is quite random to hit, but I was able to hit it IIRC in under a minute with that setup. ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: ftruncate-mmap: pages are lost after writing to mmaped file. 2009-04-02 15:51 ` Nick Piggin @ 2009-04-02 17:44 ` Ying Han 2009-04-02 22:52 ` Ying Han 0 siblings, 1 reply; 60+ messages in thread From: Ying Han @ 2009-04-02 17:44 UTC (permalink / raw) To: Nick Piggin Cc: Jan Kara, Martin J. Bligh, linux-ext4, Linus Torvalds, Andrew Morton, linux-kernel, linux-mm, guichaz, Alex Khesin, Mike Waychison, Rohit Seth, Peter Zijlstra On Thu, Apr 2, 2009 at 8:51 AM, Nick Piggin <nickpiggin@yahoo.com.au> wrote: > On Thursday 02 April 2009 22:34:01 Jan Kara wrote: >> On Thu 02-04-09 22:24:29, Nick Piggin wrote: >> > On Thursday 02 April 2009 09:36:13 Ying Han wrote: >> > > Hi Jan: >> > > I feel that the problem you saw is kind of differnt than mine. As >> > > you mentioned that you saw the PageError() message, which i don't see >> > > it on my system. I tried you patch(based on 2.6.21) on my system and >> > > it runs ok for 2 days, Still, since i don't see the same error message >> > > as you saw, i am not convineced this is the root cause at least for >> > > our problem. I am still looking into it. >> > > So, are you seeing the PageError() every time the problem happened? >> > >> > So I asked if you could test with my workaround of taking truncate_mutex >> > at the start of ext2_get_blocks, and report back. I never heard of any >> > response after that. >> > >> > To reiterate: I was able to reproduce a problem with ext2 (I was testing >> > on brd to get IO rates high enough to reproduce it quite frequently). >> > I think I narrowed the problem down to block allocation or inode block >> > tree corruption because I was unable to reproduce it with that hack in >> > place. >> Nick, what load did you use for reproduction? I'll try to reproduce it >> here so that I can debug ext2... > > OK, I set up the filesystem like this: > > modprobe rd rd_size=$[3*1024*1024] #almost fill memory so we reclaim buffers > dd if=/dev/zero of=/dev/ram0 bs=4k #prefill brd so we don't get alloc deadlock > mkfs.ext2 -b1024 /dev/ram0 #1K buffers > > Test is basically unmodified except I use 64MB files, and start 8 of them > at once to (8 core system, so improve chances of hitting the bug). Although I > do see it with only 1 running it takes longer to trigger. > > I also run a loop doing 'sync ; echo 3 > /proc/sys/vm/drop_caches' but I don't > know if that really helps speed up reproducing it. It is quite random to hit, > but I was able to hit it IIRC in under a minute with that setup. > Here is how i reproduce it: Filesystem is ext2 with blocksize 4096 Fill up the ram with 95% anon memory and mlockall ( put enough memory pressure which will trigger page reclaim and background writeout) Run one thread of the test program and i will see "bad pages" within few minutes. --Ying ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: ftruncate-mmap: pages are lost after writing to mmaped file. 2009-04-02 17:44 ` Ying Han @ 2009-04-02 22:52 ` Ying Han 2009-04-02 23:39 ` Jan Kara 0 siblings, 1 reply; 60+ messages in thread From: Ying Han @ 2009-04-02 22:52 UTC (permalink / raw) To: Nick Piggin Cc: Jan Kara, Martin J. Bligh, linux-ext4, Linus Torvalds, Andrew Morton, linux-kernel, linux-mm, guichaz, Alex Khesin, Mike Waychison, Rohit Seth, Peter Zijlstra On Thu, Apr 2, 2009 at 10:44 AM, Ying Han <yinghan@google.com> wrote: > On Thu, Apr 2, 2009 at 8:51 AM, Nick Piggin <nickpiggin@yahoo.com.au> wrote: >> On Thursday 02 April 2009 22:34:01 Jan Kara wrote: >>> On Thu 02-04-09 22:24:29, Nick Piggin wrote: >>> > On Thursday 02 April 2009 09:36:13 Ying Han wrote: >>> > > Hi Jan: >>> > > I feel that the problem you saw is kind of differnt than mine. As >>> > > you mentioned that you saw the PageError() message, which i don't see >>> > > it on my system. I tried you patch(based on 2.6.21) on my system and >>> > > it runs ok for 2 days, Still, since i don't see the same error message >>> > > as you saw, i am not convineced this is the root cause at least for >>> > > our problem. I am still looking into it. >>> > > So, are you seeing the PageError() every time the problem happened? >>> > >>> > So I asked if you could test with my workaround of taking truncate_mutex >>> > at the start of ext2_get_blocks, and report back. I never heard of any >>> > response after that. >>> > >>> > To reiterate: I was able to reproduce a problem with ext2 (I was testing >>> > on brd to get IO rates high enough to reproduce it quite frequently). >>> > I think I narrowed the problem down to block allocation or inode block >>> > tree corruption because I was unable to reproduce it with that hack in >>> > place. >>> Nick, what load did you use for reproduction? I'll try to reproduce it >>> here so that I can debug ext2... >> >> OK, I set up the filesystem like this: >> >> modprobe rd rd_size=$[3*1024*1024] #almost fill memory so we reclaim buffers >> dd if=/dev/zero of=/dev/ram0 bs=4k #prefill brd so we don't get alloc deadlock >> mkfs.ext2 -b1024 /dev/ram0 #1K buffers >> >> Test is basically unmodified except I use 64MB files, and start 8 of them >> at once to (8 core system, so improve chances of hitting the bug). Although I >> do see it with only 1 running it takes longer to trigger. >> >> I also run a loop doing 'sync ; echo 3 > /proc/sys/vm/drop_caches' but I don't >> know if that really helps speed up reproducing it. It is quite random to hit, >> but I was able to hit it IIRC in under a minute with that setup. >> > > Here is how i reproduce it: > Filesystem is ext2 with blocksize 4096 > Fill up the ram with 95% anon memory and mlockall ( put enough memory > pressure which will trigger page reclaim and background writeout) > Run one thread of the test program > > and i will see "bad pages" within few minutes. And here is the "top" and stdout while it is getting "bad pages" top PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 3487 root 20 0 52616 50m 284 R 95 0.3 3:58.85 usemem 3810 root 20 0 129m 99m 99m D 41 0.6 0:01.87 ftruncate_mmap 261 root 15 -5 0 0 0 D 4 0.0 0:31.08 kswapd0 262 root 15 -5 0 0 0 D 3 0.0 0:10.26 kswapd1 stdout: while true; do ./ftruncate_mmap; done Running 852 bad page Running 315 bad page Running 999 bad page Running 482 bad page Running 24 bad page --Ying > > --Ying > ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: ftruncate-mmap: pages are lost after writing to mmaped file. 2009-04-02 22:52 ` Ying Han @ 2009-04-02 23:39 ` Jan Kara 2009-04-03 0:25 ` Ying Han 2009-04-03 1:29 ` Ying Han 0 siblings, 2 replies; 60+ messages in thread From: Jan Kara @ 2009-04-02 23:39 UTC (permalink / raw) To: Ying Han Cc: Nick Piggin, Martin J. Bligh, linux-ext4, Linus Torvalds, Andrew Morton, linux-kernel, linux-mm, guichaz, Alex Khesin, Mike Waychison, Rohit Seth, Peter Zijlstra [-- Attachment #1: Type: text/plain, Size: 3540 bytes --] On Thu 02-04-09 15:52:19, Ying Han wrote: > On Thu, Apr 2, 2009 at 10:44 AM, Ying Han <yinghan@google.com> wrote: > > On Thu, Apr 2, 2009 at 8:51 AM, Nick Piggin <nickpiggin@yahoo.com.au> wrote: > >> On Thursday 02 April 2009 22:34:01 Jan Kara wrote: > >>> On Thu 02-04-09 22:24:29, Nick Piggin wrote: > >>> > On Thursday 02 April 2009 09:36:13 Ying Han wrote: > >>> > > Hi Jan: > >>> > > I feel that the problem you saw is kind of differnt than mine. As > >>> > > you mentioned that you saw the PageError() message, which i don't see > >>> > > it on my system. I tried you patch(based on 2.6.21) on my system and > >>> > > it runs ok for 2 days, Still, since i don't see the same error message > >>> > > as you saw, i am not convineced this is the root cause at least for > >>> > > our problem. I am still looking into it. > >>> > > So, are you seeing the PageError() every time the problem happened? > >>> > > >>> > So I asked if you could test with my workaround of taking truncate_mutex > >>> > at the start of ext2_get_blocks, and report back. I never heard of any > >>> > response after that. > >>> > > >>> > To reiterate: I was able to reproduce a problem with ext2 (I was testing > >>> > on brd to get IO rates high enough to reproduce it quite frequently). > >>> > I think I narrowed the problem down to block allocation or inode block > >>> > tree corruption because I was unable to reproduce it with that hack in > >>> > place. > >>> Nick, what load did you use for reproduction? I'll try to reproduce it > >>> here so that I can debug ext2... > >> > >> OK, I set up the filesystem like this: > >> > >> modprobe rd rd_size=$[3*1024*1024] #almost fill memory so we reclaim buffers > >> dd if=/dev/zero of=/dev/ram0 bs=4k #prefill brd so we don't get alloc deadlock > >> mkfs.ext2 -b1024 /dev/ram0 #1K buffers > >> > >> Test is basically unmodified except I use 64MB files, and start 8 of them > >> at once to (8 core system, so improve chances of hitting the bug). Although I > >> do see it with only 1 running it takes longer to trigger. > >> > >> I also run a loop doing 'sync ; echo 3 > /proc/sys/vm/drop_caches' but I don't > >> know if that really helps speed up reproducing it. It is quite random to hit, > >> but I was able to hit it IIRC in under a minute with that setup. > >> > > > > Here is how i reproduce it: > > Filesystem is ext2 with blocksize 4096 > > Fill up the ram with 95% anon memory and mlockall ( put enough memory > > pressure which will trigger page reclaim and background writeout) > > Run one thread of the test program > > > > and i will see "bad pages" within few minutes. > > And here is the "top" and stdout while it is getting "bad pages" > top > > PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND > 3487 root 20 0 52616 50m 284 R 95 0.3 3:58.85 usemem > 3810 root 20 0 129m 99m 99m D 41 0.6 0:01.87 ftruncate_mmap > 261 root 15 -5 0 0 0 D 4 0.0 0:31.08 kswapd0 > 262 root 15 -5 0 0 0 D 3 0.0 0:10.26 kswapd1 > > stdout: > > while true; do > ./ftruncate_mmap; > done > Running 852 bad page > Running 315 bad page > Running 999 bad page > Running 482 bad page > Running 24 bad page Thanks, for the help. I've debugged the problem to a bug in ext2_get_block(). I've already sent out a patch which should fix the issue (at least it fixes the problem for me). The fix is also attached if you want to try it. Honza -- Jan Kara <jack@suse.cz> SUSE Labs, CR [-- Attachment #2: 0001-ext2-Fix-data-corruption-for-racing-writes.patch --] [-- Type: text/x-patch, Size: 4340 bytes --] >From f6e454b1f9fe90d023b30daf0ae7e081580c3bb2 Mon Sep 17 00:00:00 2001 From: Jan Kara <jack@suse.cz> Date: Fri, 3 Apr 2009 00:24:28 +0200 Subject: [PATCH] ext2: Fix data corruption for racing writes If two writers allocating blocks to file race with each other (e.g. because writepages races with ordinary write or two writepages race with each other), ext2_getblock() can be called on the same inode in parallel. Before we are going to allocate new blocks, we have to recheck the block chain we have obtained so far without holding truncate_mutex. Otherwise we could overwrite the indirect block pointer set by the other writer leading to data loss. The below test program by Ying is able to reproduce the data loss with ext2 on in BRD in a few minutes if the machine is under memory pressure: long kMemSize = 50 << 20; int kPageSize = 4096; int main(int argc, char **argv) { int status; int count = 0; int i; char *fname = "/mnt/test.mmap"; char *mem; unlink(fname); int fd = open(fname, O_CREAT | O_EXCL | O_RDWR, 0600); status = ftruncate(fd, kMemSize); mem = mmap(0, kMemSize, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0); // Fill the memory with 1s. memset(mem, 1, kMemSize); sleep(2); for (i = 0; i < kMemSize; i++) { int byte_good = mem[i] != 0; if (!byte_good && ((i % kPageSize) == 0)) { //printf("%d ", i / kPageSize); count++; } } munmap(mem, kMemSize); close(fd); unlink(fname); if (count > 0) { printf("Running %d bad page\n", count); return 1; } return 0; } CC: Ying Han <yinghan@google.com> CC: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by: Jan Kara <jack@suse.cz> --- fs/ext2/inode.c | 44 +++++++++++++++++++++++++++++++++----------- 1 files changed, 33 insertions(+), 11 deletions(-) diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c index b43b955..acf6788 100644 --- a/fs/ext2/inode.c +++ b/fs/ext2/inode.c @@ -590,9 +590,8 @@ static int ext2_get_blocks(struct inode *inode, if (depth == 0) return (err); -reread: - partial = ext2_get_branch(inode, depth, offsets, chain, &err); + partial = ext2_get_branch(inode, depth, offsets, chain, &err); /* Simplest case - block found, no allocation needed */ if (!partial) { first_block = le32_to_cpu(chain[depth - 1].key); @@ -602,15 +601,16 @@ reread: while (count < maxblocks && count <= blocks_to_boundary) { ext2_fsblk_t blk; - if (!verify_chain(chain, partial)) { + if (!verify_chain(chain, chain + depth - 1)) { /* * Indirect block might be removed by * truncate while we were reading it. * Handling of that case: forget what we've * got now, go to reread. */ + err = -EAGAIN; count = 0; - goto changed; + break; } blk = le32_to_cpu(*(chain[depth-1].p + count)); if (blk == first_block + count) @@ -618,7 +618,8 @@ reread: else break; } - goto got_it; + if (err != -EAGAIN) + goto got_it; } /* Next simple case - plain lookup or failed read of indirect block */ @@ -626,6 +627,33 @@ reread: goto cleanup; mutex_lock(&ei->truncate_mutex); + /* + * If the indirect block is missing while we are reading + * the chain(ext3_get_branch() returns -EAGAIN err), or + * if the chain has been changed after we grab the semaphore, + * (either because another process truncated this branch, or + * another get_block allocated this branch) re-grab the chain to see if + * the request block has been allocated or not. + * + * Since we already block the truncate/other get_block + * at this point, we will have the current copy of the chain when we + * splice the branch into the tree. + */ + if (err == -EAGAIN || !verify_chain(chain, partial)) { + while (partial > chain) { + brelse(partial->bh); + partial--; + } + partial = ext2_get_branch(inode, depth, offsets, chain, &err); + if (!partial) { + count++; + mutex_unlock(&ei->truncate_mutex); + if (err) + goto cleanup; + clear_buffer_new(bh_result); + goto got_it; + } + } /* * Okay, we need to do block allocation. Lazily initialize the block @@ -683,12 +711,6 @@ cleanup: partial--; } return err; -changed: - while (partial > chain) { - brelse(partial->bh); - partial--; - } - goto reread; } int ext2_get_block(struct inode *inode, sector_t iblock, struct buffer_head *bh_result, int create) -- 1.6.0.2 ^ permalink raw reply related [flat|nested] 60+ messages in thread
* Re: ftruncate-mmap: pages are lost after writing to mmaped file. 2009-04-02 23:39 ` Jan Kara @ 2009-04-03 0:25 ` Ying Han 2009-04-03 1:29 ` Ying Han 1 sibling, 0 replies; 60+ messages in thread From: Ying Han @ 2009-04-03 0:25 UTC (permalink / raw) To: Jan Kara Cc: Nick Piggin, Martin J. Bligh, linux-ext4, Linus Torvalds, Andrew Morton, linux-kernel, linux-mm, guichaz, Alex Khesin, Mike Waychison, Rohit Seth, Peter Zijlstra On Thu, Apr 2, 2009 at 4:39 PM, Jan Kara <jack@suse.cz> wrote: > On Thu 02-04-09 15:52:19, Ying Han wrote: >> On Thu, Apr 2, 2009 at 10:44 AM, Ying Han <yinghan@google.com> wrote: >> > On Thu, Apr 2, 2009 at 8:51 AM, Nick Piggin <nickpiggin@yahoo.com.au> wrote: >> >> On Thursday 02 April 2009 22:34:01 Jan Kara wrote: >> >>> On Thu 02-04-09 22:24:29, Nick Piggin wrote: >> >>> > On Thursday 02 April 2009 09:36:13 Ying Han wrote: >> >>> > > Hi Jan: >> >>> > > I feel that the problem you saw is kind of differnt than mine. As >> >>> > > you mentioned that you saw the PageError() message, which i don't see >> >>> > > it on my system. I tried you patch(based on 2.6.21) on my system and >> >>> > > it runs ok for 2 days, Still, since i don't see the same error message >> >>> > > as you saw, i am not convineced this is the root cause at least for >> >>> > > our problem. I am still looking into it. >> >>> > > So, are you seeing the PageError() every time the problem happened? >> >>> > >> >>> > So I asked if you could test with my workaround of taking truncate_mutex >> >>> > at the start of ext2_get_blocks, and report back. I never heard of any >> >>> > response after that. >> >>> > >> >>> > To reiterate: I was able to reproduce a problem with ext2 (I was testing >> >>> > on brd to get IO rates high enough to reproduce it quite frequently). >> >>> > I think I narrowed the problem down to block allocation or inode block >> >>> > tree corruption because I was unable to reproduce it with that hack in >> >>> > place. >> >>> Nick, what load did you use for reproduction? I'll try to reproduce it >> >>> here so that I can debug ext2... >> >> >> >> OK, I set up the filesystem like this: >> >> >> >> modprobe rd rd_size=$[3*1024*1024] #almost fill memory so we reclaim buffers >> >> dd if=/dev/zero of=/dev/ram0 bs=4k #prefill brd so we don't get alloc deadlock >> >> mkfs.ext2 -b1024 /dev/ram0 #1K buffers >> >> >> >> Test is basically unmodified except I use 64MB files, and start 8 of them >> >> at once to (8 core system, so improve chances of hitting the bug). Although I >> >> do see it with only 1 running it takes longer to trigger. >> >> >> >> I also run a loop doing 'sync ; echo 3 > /proc/sys/vm/drop_caches' but I don't >> >> know if that really helps speed up reproducing it. It is quite random to hit, >> >> but I was able to hit it IIRC in under a minute with that setup. >> >> >> > >> > Here is how i reproduce it: >> > Filesystem is ext2 with blocksize 4096 >> > Fill up the ram with 95% anon memory and mlockall ( put enough memory >> > pressure which will trigger page reclaim and background writeout) >> > Run one thread of the test program >> > >> > and i will see "bad pages" within few minutes. >> >> And here is the "top" and stdout while it is getting "bad pages" >> top >> >> PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND >> 3487 root 20 0 52616 50m 284 R 95 0.3 3:58.85 usemem >> 3810 root 20 0 129m 99m 99m D 41 0.6 0:01.87 ftruncate_mmap >> 261 root 15 -5 0 0 0 D 4 0.0 0:31.08 kswapd0 >> 262 root 15 -5 0 0 0 D 3 0.0 0:10.26 kswapd1 >> >> stdout: >> >> while true; do >> ./ftruncate_mmap; >> done >> Running 852 bad page >> Running 315 bad page >> Running 999 bad page >> Running 482 bad page >> Running 24 bad page > Thanks, for the help. I've debugged the problem to a bug in > ext2_get_block(). I've already sent out a patch which should fix the issue > (at least it fixes the problem for me). > The fix is also attached if you want to try it. I just did a quick run after applied the patch, unforturnately, my problem is still there... --Ying > > Honza > -- > Jan Kara <jack@suse.cz> > SUSE Labs, CR > ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: ftruncate-mmap: pages are lost after writing to mmaped file. 2009-04-02 23:39 ` Jan Kara 2009-04-03 0:25 ` Ying Han @ 2009-04-03 1:29 ` Ying Han 2009-04-03 9:41 ` Jan Kara 1 sibling, 1 reply; 60+ messages in thread From: Ying Han @ 2009-04-03 1:29 UTC (permalink / raw) To: Jan Kara Cc: Nick Piggin, Martin J. Bligh, linux-ext4, Linus Torvalds, Andrew Morton, linux-kernel, linux-mm, guichaz, Alex Khesin, Mike Waychison, Rohit Seth, Peter Zijlstra On Thu, Apr 2, 2009 at 4:39 PM, Jan Kara <jack@suse.cz> wrote: > On Thu 02-04-09 15:52:19, Ying Han wrote: >> On Thu, Apr 2, 2009 at 10:44 AM, Ying Han <yinghan@google.com> wrote: >> > On Thu, Apr 2, 2009 at 8:51 AM, Nick Piggin <nickpiggin@yahoo.com.au> wrote: >> >> On Thursday 02 April 2009 22:34:01 Jan Kara wrote: >> >>> On Thu 02-04-09 22:24:29, Nick Piggin wrote: >> >>> > On Thursday 02 April 2009 09:36:13 Ying Han wrote: >> >>> > > Hi Jan: >> >>> > > I feel that the problem you saw is kind of differnt than mine. As >> >>> > > you mentioned that you saw the PageError() message, which i don't see >> >>> > > it on my system. I tried you patch(based on 2.6.21) on my system and >> >>> > > it runs ok for 2 days, Still, since i don't see the same error message >> >>> > > as you saw, i am not convineced this is the root cause at least for >> >>> > > our problem. I am still looking into it. >> >>> > > So, are you seeing the PageError() every time the problem happened? >> >>> > >> >>> > So I asked if you could test with my workaround of taking truncate_mutex >> >>> > at the start of ext2_get_blocks, and report back. I never heard of any >> >>> > response after that. >> >>> > >> >>> > To reiterate: I was able to reproduce a problem with ext2 (I was testing >> >>> > on brd to get IO rates high enough to reproduce it quite frequently). >> >>> > I think I narrowed the problem down to block allocation or inode block >> >>> > tree corruption because I was unable to reproduce it with that hack in >> >>> > place. >> >>> Nick, what load did you use for reproduction? I'll try to reproduce it >> >>> here so that I can debug ext2... >> >> >> >> OK, I set up the filesystem like this: >> >> >> >> modprobe rd rd_size=$[3*1024*1024] #almost fill memory so we reclaim buffers >> >> dd if=/dev/zero of=/dev/ram0 bs=4k #prefill brd so we don't get alloc deadlock >> >> mkfs.ext2 -b1024 /dev/ram0 #1K buffers >> >> >> >> Test is basically unmodified except I use 64MB files, and start 8 of them >> >> at once to (8 core system, so improve chances of hitting the bug). Although I >> >> do see it with only 1 running it takes longer to trigger. >> >> >> >> I also run a loop doing 'sync ; echo 3 > /proc/sys/vm/drop_caches' but I don't >> >> know if that really helps speed up reproducing it. It is quite random to hit, >> >> but I was able to hit it IIRC in under a minute with that setup. >> >> >> > >> > Here is how i reproduce it: >> > Filesystem is ext2 with blocksize 4096 >> > Fill up the ram with 95% anon memory and mlockall ( put enough memory >> > pressure which will trigger page reclaim and background writeout) >> > Run one thread of the test program >> > >> > and i will see "bad pages" within few minutes. >> >> And here is the "top" and stdout while it is getting "bad pages" >> top >> >> PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND >> 3487 root 20 0 52616 50m 284 R 95 0.3 3:58.85 usemem >> 3810 root 20 0 129m 99m 99m D 41 0.6 0:01.87 ftruncate_mmap >> 261 root 15 -5 0 0 0 D 4 0.0 0:31.08 kswapd0 >> 262 root 15 -5 0 0 0 D 3 0.0 0:10.26 kswapd1 >> >> stdout: >> >> while true; do >> ./ftruncate_mmap; >> done >> Running 852 bad page >> Running 315 bad page >> Running 999 bad page >> Running 482 bad page >> Running 24 bad page > Thanks, for the help. I've debugged the problem to a bug in > ext2_get_block(). I've already sent out a patch which should fix the issue > (at least it fixes the problem for me). > The fix is also attached if you want to try it. hmm, now i do see that get_block() returns ENOSPC by printk the err. So did you applied the patch which redirty_page_for_writepage as well as this one together? I will start the test with kernel applied both patches and leave it for running. > > Honza > -- > Jan Kara <jack@suse.cz> > SUSE Labs, CR > ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: ftruncate-mmap: pages are lost after writing to mmaped file. 2009-04-03 1:29 ` Ying Han @ 2009-04-03 9:41 ` Jan Kara 2009-04-03 21:34 ` Ying Han 0 siblings, 1 reply; 60+ messages in thread From: Jan Kara @ 2009-04-03 9:41 UTC (permalink / raw) To: Ying Han Cc: Nick Piggin, Martin J. Bligh, linux-ext4, Linus Torvalds, Andrew Morton, linux-kernel, linux-mm, guichaz, Alex Khesin, Mike Waychison, Rohit Seth, Peter Zijlstra On Thu 02-04-09 18:29:21, Ying Han wrote: > On Thu, Apr 2, 2009 at 4:39 PM, Jan Kara <jack@suse.cz> wrote: > > On Thu 02-04-09 15:52:19, Ying Han wrote: > >> On Thu, Apr 2, 2009 at 10:44 AM, Ying Han <yinghan@google.com> wrote: > >> > On Thu, Apr 2, 2009 at 8:51 AM, Nick Piggin <nickpiggin@yahoo.com.au> wrote: > >> >> On Thursday 02 April 2009 22:34:01 Jan Kara wrote: > >> >>> On Thu 02-04-09 22:24:29, Nick Piggin wrote: > >> >>> > On Thursday 02 April 2009 09:36:13 Ying Han wrote: > >> >>> > > Hi Jan: > >> >>> > > I feel that the problem you saw is kind of differnt than mine. As > >> >>> > > you mentioned that you saw the PageError() message, which i don't see > >> >>> > > it on my system. I tried you patch(based on 2.6.21) on my system and > >> >>> > > it runs ok for 2 days, Still, since i don't see the same error message > >> >>> > > as you saw, i am not convineced this is the root cause at least for > >> >>> > > our problem. I am still looking into it. > >> >>> > > So, are you seeing the PageError() every time the problem happened? > >> >>> > > >> >>> > So I asked if you could test with my workaround of taking truncate_mutex > >> >>> > at the start of ext2_get_blocks, and report back. I never heard of any > >> >>> > response after that. > >> >>> > > >> >>> > To reiterate: I was able to reproduce a problem with ext2 (I was testing > >> >>> > on brd to get IO rates high enough to reproduce it quite frequently). > >> >>> > I think I narrowed the problem down to block allocation or inode block > >> >>> > tree corruption because I was unable to reproduce it with that hack in > >> >>> > place. > >> >>> Nick, what load did you use for reproduction? I'll try to reproduce it > >> >>> here so that I can debug ext2... > >> >> > >> >> OK, I set up the filesystem like this: > >> >> > >> >> modprobe rd rd_size=$[3*1024*1024] #almost fill memory so we reclaim buffers > >> >> dd if=/dev/zero of=/dev/ram0 bs=4k #prefill brd so we don't get alloc deadlock > >> >> mkfs.ext2 -b1024 /dev/ram0 #1K buffers > >> >> > >> >> Test is basically unmodified except I use 64MB files, and start 8 of them > >> >> at once to (8 core system, so improve chances of hitting the bug). Although I > >> >> do see it with only 1 running it takes longer to trigger. > >> >> > >> >> I also run a loop doing 'sync ; echo 3 > /proc/sys/vm/drop_caches' but I don't > >> >> know if that really helps speed up reproducing it. It is quite random to hit, > >> >> but I was able to hit it IIRC in under a minute with that setup. > >> >> > >> > > >> > Here is how i reproduce it: > >> > Filesystem is ext2 with blocksize 4096 > >> > Fill up the ram with 95% anon memory and mlockall ( put enough memory > >> > pressure which will trigger page reclaim and background writeout) > >> > Run one thread of the test program > >> > > >> > and i will see "bad pages" within few minutes. > >> > >> And here is the "top" and stdout while it is getting "bad pages" > >> top > >> > >> PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND > >> 3487 root 20 0 52616 50m 284 R 95 0.3 3:58.85 usemem > >> 3810 root 20 0 129m 99m 99m D 41 0.6 0:01.87 ftruncate_mmap > >> 261 root 15 -5 0 0 0 D 4 0.0 0:31.08 kswapd0 > >> 262 root 15 -5 0 0 0 D 3 0.0 0:10.26 kswapd1 > >> > >> stdout: > >> > >> while true; do > >> ./ftruncate_mmap; > >> done > >> Running 852 bad page > >> Running 315 bad page > >> Running 999 bad page > >> Running 482 bad page > >> Running 24 bad page > > Thanks, for the help. I've debugged the problem to a bug in > > ext2_get_block(). I've already sent out a patch which should fix the issue > > (at least it fixes the problem for me). > > The fix is also attached if you want to try it. > > hmm, now i do see that get_block() returns ENOSPC by printk the err. > So did you applied the patch which redirty_page_for_writepage as well > as this one together? No, my patch contained only a fix in ext2_get_block(). When you see ENOSPC, that's a completely separate issue. You may apply that patch but with ext2 it would be enough to make the file fit the ram disk. I.e. first try with dd how big file fits there and then run your tester with at most as big file so that you don't hit ENOSPC... > I will start the test with kernel applied both patches and leave it for running. OK. Honza -- Jan Kara <jack@suse.cz> SUSE Labs, CR ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: ftruncate-mmap: pages are lost after writing to mmaped file. 2009-04-03 9:41 ` Jan Kara @ 2009-04-03 21:34 ` Ying Han 0 siblings, 0 replies; 60+ messages in thread From: Ying Han @ 2009-04-03 21:34 UTC (permalink / raw) To: Jan Kara Cc: Nick Piggin, Martin J. Bligh, linux-ext4, Linus Torvalds, Andrew Morton, linux-kernel, linux-mm, guichaz, Alex Khesin, Mike Waychison, Rohit Seth, Peter Zijlstra On Fri, Apr 3, 2009 at 2:41 AM, Jan Kara <jack@suse.cz> wrote: > On Thu 02-04-09 18:29:21, Ying Han wrote: >> On Thu, Apr 2, 2009 at 4:39 PM, Jan Kara <jack@suse.cz> wrote: >> > On Thu 02-04-09 15:52:19, Ying Han wrote: >> >> On Thu, Apr 2, 2009 at 10:44 AM, Ying Han <yinghan@google.com> wrote: >> >> > On Thu, Apr 2, 2009 at 8:51 AM, Nick Piggin <nickpiggin@yahoo.com.au> wrote: >> >> >> On Thursday 02 April 2009 22:34:01 Jan Kara wrote: >> >> >>> On Thu 02-04-09 22:24:29, Nick Piggin wrote: >> >> >>> > On Thursday 02 April 2009 09:36:13 Ying Han wrote: >> >> >>> > > Hi Jan: >> >> >>> > > I feel that the problem you saw is kind of differnt than mine. As >> >> >>> > > you mentioned that you saw the PageError() message, which i don't see >> >> >>> > > it on my system. I tried you patch(based on 2.6.21) on my system and >> >> >>> > > it runs ok for 2 days, Still, since i don't see the same error message >> >> >>> > > as you saw, i am not convineced this is the root cause at least for >> >> >>> > > our problem. I am still looking into it. >> >> >>> > > So, are you seeing the PageError() every time the problem happened? >> >> >>> > >> >> >>> > So I asked if you could test with my workaround of taking truncate_mutex >> >> >>> > at the start of ext2_get_blocks, and report back. I never heard of any >> >> >>> > response after that. >> >> >>> > >> >> >>> > To reiterate: I was able to reproduce a problem with ext2 (I was testing >> >> >>> > on brd to get IO rates high enough to reproduce it quite frequently). >> >> >>> > I think I narrowed the problem down to block allocation or inode block >> >> >>> > tree corruption because I was unable to reproduce it with that hack in >> >> >>> > place. >> >> >>> Nick, what load did you use for reproduction? I'll try to reproduce it >> >> >>> here so that I can debug ext2... >> >> >> >> >> >> OK, I set up the filesystem like this: >> >> >> >> >> >> modprobe rd rd_size=$[3*1024*1024] #almost fill memory so we reclaim buffers >> >> >> dd if=/dev/zero of=/dev/ram0 bs=4k #prefill brd so we don't get alloc deadlock >> >> >> mkfs.ext2 -b1024 /dev/ram0 #1K buffers >> >> >> >> >> >> Test is basically unmodified except I use 64MB files, and start 8 of them >> >> >> at once to (8 core system, so improve chances of hitting the bug). Although I >> >> >> do see it with only 1 running it takes longer to trigger. >> >> >> >> >> >> I also run a loop doing 'sync ; echo 3 > /proc/sys/vm/drop_caches' but I don't >> >> >> know if that really helps speed up reproducing it. It is quite random to hit, >> >> >> but I was able to hit it IIRC in under a minute with that setup. >> >> >> >> >> > >> >> > Here is how i reproduce it: >> >> > Filesystem is ext2 with blocksize 4096 >> >> > Fill up the ram with 95% anon memory and mlockall ( put enough memory >> >> > pressure which will trigger page reclaim and background writeout) >> >> > Run one thread of the test program >> >> > >> >> > and i will see "bad pages" within few minutes. >> >> >> >> And here is the "top" and stdout while it is getting "bad pages" >> >> top >> >> >> >> PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND >> >> 3487 root 20 0 52616 50m 284 R 95 0.3 3:58.85 usemem >> >> 3810 root 20 0 129m 99m 99m D 41 0.6 0:01.87 ftruncate_mmap >> >> 261 root 15 -5 0 0 0 D 4 0.0 0:31.08 kswapd0 >> >> 262 root 15 -5 0 0 0 D 3 0.0 0:10.26 kswapd1 >> >> >> >> stdout: >> >> >> >> while true; do >> >> ./ftruncate_mmap; >> >> done >> >> Running 852 bad page >> >> Running 315 bad page >> >> Running 999 bad page >> >> Running 482 bad page >> >> Running 24 bad page >> > Thanks, for the help. I've debugged the problem to a bug in >> > ext2_get_block(). I've already sent out a patch which should fix the issue >> > (at least it fixes the problem for me). >> > The fix is also attached if you want to try it. >> >> hmm, now i do see that get_block() returns ENOSPC by printk the err. >> So did you applied the patch which redirty_page_for_writepage as well >> as this one together? > No, my patch contained only a fix in ext2_get_block(). When you see > ENOSPC, that's a completely separate issue. You may apply that patch but > with ext2 it would be enough to make the file fit the ram disk. I.e. first > try with dd how big file fits there and then run your tester with at most > as big file so that you don't hit ENOSPC... > >> I will start the test with kernel applied both patches and leave it for running. > OK. I applied the patch(based on 2.6.26) in the attachment and the test itself runs fine so far without reporting "bad pages", however, i seems get deadlock in the varlog, here is the message and i turned on lockdep. kernel: 1 lock held by kswapd1/264: kernel: #0: (&ei->truncate_mutex){--..}, at: [<ffffffff8031d529>] ext2_get_block+0x109/0x960 kernel: INFO: task ftruncate_mmap:2950 blocked for more than 120 seconds. kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. kernel: ftruncate_mma D ffff81047e733a80 0 2950 2858 kernel: ffff8101798516f8 0000000000000092 0000000000000000 0000000000000046 kernel: ffff81047e0a1260 ffff81047f070000 ffff81047e0a15c0 0000000100130c66 kernel: 00000000ffffffff ffffffff8025740d 0000000000000000 0000000000000000 kernel: Call Trace: kernel: [<ffffffff8025740d>] mark_held_locks+0x3d/0x80 kernel: [<ffffffff804d78bd>] mutex_lock_nested+0x14d/0x280 kernel: [<ffffffff804d7855>] mutex_lock_nested+0xe5/0x280 kernel: [<ffffffff8031d529>] ext2_get_block+0x109/0x960 kernel: [<ffffffff802ca2e3>] create_empty_buffers+0x43/0xb0 kernel: [<ffffffff802ca2e3>] create_empty_buffers+0x43/0xb0 kernel: [<ffffffff802ca217>] alloc_page_buffers+0x97/0x120 kernel: [<ffffffff802cbfb6>] __block_write_full_page+0x206/0x320 kernel: [<ffffffff802cbe70>] __block_write_full_page+0xc0/0x320 kernel: [<ffffffff8031d420>] ext2_get_block+0x0/0x960 kernel: [<ffffffff8027c74e>] shrink_page_list+0x4fe/0x650 kernel: [<ffffffff80257ee8>] __lock_acquire+0x3b8/0x1080 kernel: [<ffffffff8027be18>] isolate_lru_pages+0x88/0x230 kernel: [<ffffffff8027c9ea>] shrink_inactive_list+0x14a/0x3f0 kernel: [<ffffffff8027cd43>] shrink_zone+0xb3/0x130 kernel: [<ffffffff80249e90>] autoremove_wake_function+0x0/0x30 kernel: [<ffffffff8027d1a8>] try_to_free_pages+0x268/0x3e0 kernel: [<ffffffff8027bfc0>] isolate_pages_global+0x0/0x40 kernel: [<ffffffff802774f7>] __alloc_pages_internal+0x1d7/0x4a0 kernel: [<ffffffff80279b94>] __do_page_cache_readahead+0x124/0x270 kernel: [<ffffffff8027314f>] filemap_fault+0x18f/0x400 kernel: [<ffffffff80280925>] __do_fault+0x65/0x450 kernel: [<ffffffff80257ee8>] __lock_acquire+0x3b8/0x1080 kernel: [<ffffffff803475dd>] __down_read_trylock+0x1d/0x60 kernel: [<ffffffff8028389a>] handle_mm_fault+0x18a/0x7a0 kernel: [<ffffffff804dba1c>] do_page_fault+0x29c/0x930 kernel: [<ffffffff804d8b46>] trace_hardirqs_on_thunk+0x35/0x3a kernel: [<ffffffff804d94dd>] error_exit+0x0/0xa9 kernel: kernel: 2 locks held by ftruncate_mmap/2950: kernel: #0: (&mm->mmap_sem){----}, at: [<ffffffff804db9af>] do_page_fault+0x22f/0x930 kernel: #1: (&ei->truncate_mutex){--..}, at: [<ffffffff8031d529>] ext2_get_block+0x109/0x960 --Ying > > Honza > -- > Jan Kara <jack@suse.cz> > SUSE Labs, CR > ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: ftruncate-mmap: pages are lost after writing to mmaped file. 2009-04-02 11:24 ` Nick Piggin 2009-04-02 11:34 ` Jan Kara @ 2009-04-03 0:13 ` Ying Han 1 sibling, 0 replies; 60+ messages in thread From: Ying Han @ 2009-04-03 0:13 UTC (permalink / raw) To: Nick Piggin Cc: Jan Kara, Martin J. Bligh, linux-ext4, Linus Torvalds, Andrew Morton, linux-kernel, linux-mm, guichaz, Alex Khesin, Mike Waychison, Rohit Seth, Peter Zijlstra On Thu, Apr 2, 2009 at 4:24 AM, Nick Piggin <nickpiggin@yahoo.com.au> wrote: > On Thursday 02 April 2009 09:36:13 Ying Han wrote: >> Hi Jan: >> I feel that the problem you saw is kind of differnt than mine. As >> you mentioned that you saw the PageError() message, which i don't see >> it on my system. I tried you patch(based on 2.6.21) on my system and >> it runs ok for 2 days, Still, since i don't see the same error message >> as you saw, i am not convineced this is the root cause at least for >> our problem. I am still looking into it. >> So, are you seeing the PageError() every time the problem happened? > > So I asked if you could test with my workaround of taking truncate_mutex > at the start of ext2_get_blocks, and report back. I never heard of any > response after that. I applied the change and still get the same issue, unless i didn't do the right thing, here is the patch i applied, which put the truncate_mutex at the beginning of ext2_get_blocks. diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c index 384fc0d..94cf773 100644 --- a/fs/ext2/inode.c +++ b/fs/ext2/inode.c @@ -586,10 +586,13 @@ static int ext2_get_blocks(struct inode *inode, int count = 0; ext2_fsblk_t first_block = 0; + mutex_lock(&ei->truncate_mutex); depth = ext2_block_to_path(inode,iblock,offsets,&blocks_to_boundary); - if (depth == 0) + if (depth == 0) { + mutex_unlock(&ei->truncate_mutex); return (err); + } reread: partial = ext2_get_branch(inode, depth, offsets, chain, &err); @@ -625,7 +628,7 @@ reread: if (!create || err == -EIO) goto cleanup; - mutex_lock(&ei->truncate_mutex); /* * Okay, we need to do block allocation. Lazily initialize the block @@ -651,7 +654,7 @@ reread: offsets + (partial - chain), partial); if (err) { - mutex_unlock(&ei->truncate_mutex); goto cleanup; } @@ -662,13 +665,13 @@ reread: err = ext2_clear_xip_target (inode, le32_to_cpu(chain[depth-1].key)); if (err) { - mutex_unlock(&ei->truncate_mutex); goto cleanup; } } ext2_splice_branch(inode, iblock, partial, indirect_blks, count); - mutex_unlock(&ei->truncate_mutex); set_buffer_new(bh_result); got_it: map_bh(bh_result, inode->i_sb, le32_to_cpu(chain[depth-1].key)); @@ -678,6 +681,7 @@ got_it: /* Clean up and exit */ partial = chain + depth - 1; /* the whole chain */ cleanup: + mutex_unlock(&ei->truncate_mutex); while (partial > chain) { brelse(partial->bh); partial--; --Ying > > To reiterate: I was able to reproduce a problem with ext2 (I was testing > on brd to get IO rates high enough to reproduce it quite frequently). > I think I narrowed the problem down to block allocation or inode block > tree corruption because I was unable to reproduce it with that hack in > place. > > ^ permalink raw reply related [flat|nested] 60+ messages in thread
* Re: ftruncate-mmap: pages are lost after writing to mmaped file. 2009-03-24 7:44 ` Nick Piggin ` (2 preceding siblings ...) 2009-03-24 12:39 ` Jan Kara @ 2009-03-27 20:35 ` Ying Han 3 siblings, 0 replies; 60+ messages in thread From: Ying Han @ 2009-03-27 20:35 UTC (permalink / raw) To: Nick Piggin Cc: Jan Kara, Martin J. Bligh, linux-ext4, Linus Torvalds, Andrew Morton, linux-kernel, linux-mm, guichaz, Alex Khesin, Mike Waychison, Rohit Seth, Peter Zijlstra On Tue, Mar 24, 2009 at 12:44 AM, Nick Piggin <nickpiggin@yahoo.com.au> wrote: > On Friday 20 March 2009 03:46:39 Jan Kara wrote: >> On Fri 20-03-09 02:48:21, Nick Piggin wrote: > >> > Holding mapping->private_lock over the __set_page_dirty should >> > fix it, although I guess you'd want to release it before calling >> > __mark_inode_dirty so as not to put inode_lock under there. I >> > have a patch for this if it sounds reasonable. >> >> Yes, that seems to be a bug - the function actually looked suspitious to >> me yesterday but I somehow convinced myself that it's fine. Probably >> because fsx-linux is single-threaded. > > > After a whole lot of chasing my own tail in the VM and buffer layers, > I think it is a problem in ext2 (and I haven't been able to reproduce > with ext3 yet, which might lend weight to that, although as we have > seen, it is very timing dependent). > > That would be slightly unfortunate because we still have Jan's ext3 > problem, and also another reported problem of corruption on ext3 (on > brd driver). I believe i see the same issue on ext2 as well as ext4. > > Anyway, when I have reproduced the problem with the test case, the > "lost" writes are all reported to be holes. Unfortunately, that doesn't > point straight to the filesystem, because ext2 allocates blocks in this > case at writeout time, so if dirty bits are getting lost, then it would > be normal to see holes. > > I then put in a whole lot of extra infrastructure to track metadata about > each struct page (when it was last written out, when it last had the number > of writable ptes reach 0, when the dirty bits were last cleared etc). And > none of the normal asertions were triggering: eg. when any page is removed > from pagecache (except truncates), it has always had all its buffers > written out *after* all ptes were made readonly or unmapped. Lots of other > tests and crap like that. Do you think there might be a race in the page reclaim path? I did a hack which commeted out wakeup_pdflush in try_to_free_pages ( based on 2.6.21, just randomly picked on has the problem) It runs for couple of hours and the problem not happened yet. I am not sure if that is the problem or not, and i will leave it running. The reason i tried the hack since i reproduce the "bad pages" easily everytime i put more memory pressure on the system. diff --git a/mm/vmscan.c b/mm/vmscan.c index db023e2..b4b7e1f 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -1067,11 +1067,13 @@ unsigned long try_to_free_pages(struct zone **zones, g * that's undesirable in laptop mode, where we *want* lumpy * writeout. So in laptop mode, write out the whole world. */ +/* if (total_scanned > sc.swap_cluster_max + sc.swap_cluster_max / 2) { wakeup_pdflush(laptop_mode ? 0 : total_scanned); sc.may_writepage = 1; } +*/ /* Take a nap, wait for some writeback to complete */ if (sc.nr_scanned && priority < DEF_PRIORITY - 2) > > So I tried what I should have done to start with and did an e2fsck after > seeing corruption. Yes, it comes up with errors. Now that is unusual > because that should be largely insulated from the vm: if a dirty bit gets > lost, then the filesystem image should be quite happy and error-free with > a hole or unwritten data there. > > I don't know ext? locking very well, except that it looks pretty overly > complex and crufty. > > Usually, blocks are instantiated by write(2), under i_mutex, serialising > the allocator somewhat. mmap-write blocks are instantiated at writeout > time, unserialised. I moved truncate_mutex to cover the entire get_blocks > function, and can no longer trigger the problem. Might be a timing issue > though -- Ying, can you try this and see if you can still reproduce? > > I close my eyes and pick something out of a hat. a686cd89. Search for XXX. > Nice. Whether or not this cased the problem, can someone please tell me > why it got merged in that state? > > I'm leaving ext3 running for now. It looks like a nasty task to bisect > ext2 down to that commit :( and I would be more interested in trying to > reproduce Jan's ext3 problem, however, because I'm not too interested in > diving into ext2 locking to work out exactly what is racing and how to > fix it properly. I suspect it would be most productive to wire up some > ioctls right into the block allocator/lookup and code up a userspace > tester for it that could probably stress it a lot harder than kernel > writeout can. > > ^ permalink raw reply related [flat|nested] 60+ messages in thread
* Re: ftruncate-mmap: pages are lost after writing to mmaped file. 2009-03-18 22:40 ` Linus Torvalds 2009-03-18 23:18 ` Ying Han @ 2009-03-20 0:34 ` Ying Han 2009-03-20 0:49 ` Linus Torvalds 2009-03-25 23:15 ` Ying Han 2 siblings, 1 reply; 60+ messages in thread From: Ying Han @ 2009-03-20 0:34 UTC (permalink / raw) To: Linus Torvalds Cc: Andrew Morton, linux-kernel, linux-mm, guichaz, Alex Khesin, Mike Waychison, Rohit Seth, Nick Piggin, Peter Zijlstra On Wed, Mar 18, 2009 at 3:40 PM, Linus Torvalds <torvalds@linux-foundation.org> wrote: > > > On Wed, 18 Mar 2009, Andrew Morton wrote: > >> On Wed, 18 Mar 2009 12:44:08 -0700 Ying Han <yinghan@google.com> wrote: >> > >> > The "bad pages" count differs each time from one digit to 4,5 digit >> > for 128M ftruncated file. and what i also found that the bad page >> > number are contiguous for each segment which total bad pages container >> > several segments. ext "1-4, 9-20, 48-50" ( batch flushing ? ) > > Yeah, probably the batched write-out. > > Can you say what filesystem, and what mount-flags you use? Iirc, last time > we had MAP_SHARED lost writes it was at least partly triggered by the > filesystem doing its own flushing independently of the VM (ie ext3 with > "data=journal", I think), so that kind of thing does tend to matter. > > See for example commit ecdfc9787fe527491baefc22dce8b2dbd5b2908d. > >> > (The failure is reproduced based on 2.6.29-rc8, also happened on >> > 2.6.18 kernel. . Here is the simple test case to reproduce it with >> > memory pressure. ) >> >> Thanks. This will be a regression - the testing I did back in the days >> when I actually wrote stuff would have picked this up. >> >> Perhaps it is a 2.6.17 thing. Which, IIRC, is when we made the changes to >> redirty pages on each write fault. Or maybe it was something else. > > Hmm. I _think_ that changes went in _after_ 2.6.18, if you're talking > about Peter's exact dirty page tracking. If I recall correctly, that > became then 2.6.19, and then had the horrible mm dirty bit loss that > triggered in librtorrent downloads, which got fixed sometime after 2.6.20 > (and back-ported). > > So if 2.6.18 shows the same problem, then it's a _really_ old bug, and not > related to the exact dirty tracking. > > The exact dirty accounting patch I'm talking about is d08b3851da41 ("mm: > tracking shared dirty pages"), but maybe you had something else in mind? > >> Given the amount of time for which this bug has existed, I guess it isn't a >> 2.6.29 blocker, but once we've found out the cause we should have a little >> post-mortem to work out how a bug of this nature has gone undetected for so >> long. > > I'm somewhat surprised, because this test-program looks like a very simple > version of the exact one that I used to track down the 2.6.20 mmap > corruption problems. And that one got pretty heavily tested back then, > when people were looking at it (December 2006) and then when trying out my > fix for it. > > Ying Han - since you're all set up for testing this and have reproduced it > on multiple kernels, can you try it on a few more kernel versions? It > would be interesting to both go further back in time (say 2.6.15-ish), > _and_ check something like 2.6.21 which had the exact dirty accounting > fix. Maybe it's not really an old bug - maybe we re-introduced a bug that > was fixed for a while. I tried 2.6.24 for couple of hours and the problem not happening yet. While the same test on 2.6.25, the problem happen right away. > > Linus > ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: ftruncate-mmap: pages are lost after writing to mmaped file. 2009-03-20 0:34 ` Ying Han @ 2009-03-20 0:49 ` Linus Torvalds 2009-03-20 7:00 ` Ying Han 0 siblings, 1 reply; 60+ messages in thread From: Linus Torvalds @ 2009-03-20 0:49 UTC (permalink / raw) To: Ying Han Cc: Andrew Morton, linux-kernel, linux-mm, guichaz, Alex Khesin, Mike Waychison, Rohit Seth, Nick Piggin, Peter Zijlstra On Thu, 19 Mar 2009, Ying Han wrote: > > > > Ying Han - since you're all set up for testing this and have reproduced it > > on multiple kernels, can you try it on a few more kernel versions? It > > would be interesting to both go further back in time (say 2.6.15-ish), > > _and_ check something like 2.6.21 which had the exact dirty accounting > > fix. Maybe it's not really an old bug - maybe we re-introduced a bug that > > was fixed for a while. > > I tried 2.6.24 for couple of hours and the problem not happening yet. While > the same test on 2.6.25, the problem happen right away. Ok, so 2.6.25 is known bad. Can you test 2.6.24 a lot more, because we should not decide that it's bug-free without a _lot_ of testing. But if it's a bug that has gone away and then re-appeared, it at least explains how 2.6.21 (which got a fair amount of mmap testing) didn't have lots of reports of mmap corruption. That said, I can think of nothing obvious in between 2.6.24 and .25 that would have re-introduced it. But if some heavy testing really does confirm that 2.6.24 doesn't have the problem, that is a good first step to trying to narrow down where things started going wrong. That said, it could _easily_ be some timing-related pattern. One of the things in between 2.6.24 and .25 is - 8bc3be2751b4f74ab90a446da1912fd8204d53f7: "writeback: speed up writeback of big dirty files" which is that exact kind of "change the timing patterns, but don't change anything fundamental" thing. Which is why I'd like you to continue testing 2.6.24 just to be _really_ sure that it really doesn't happen there. Linus ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: ftruncate-mmap: pages are lost after writing to mmaped file. 2009-03-20 0:49 ` Linus Torvalds @ 2009-03-20 7:00 ` Ying Han 0 siblings, 0 replies; 60+ messages in thread From: Ying Han @ 2009-03-20 7:00 UTC (permalink / raw) To: Linus Torvalds Cc: Andrew Morton, linux-kernel, linux-mm, guichaz, Alex Khesin, Mike Waychison, Rohit Seth, Nick Piggin, Peter Zijlstra On Thu, Mar 19, 2009 at 5:49 PM, Linus Torvalds <torvalds@linux-foundation.org> wrote: > > > On Thu, 19 Mar 2009, Ying Han wrote: >> > >> > Ying Han - since you're all set up for testing this and have reproduced it >> > on multiple kernels, can you try it on a few more kernel versions? It >> > would be interesting to both go further back in time (say 2.6.15-ish), >> > _and_ check something like 2.6.21 which had the exact dirty accounting >> > fix. Maybe it's not really an old bug - maybe we re-introduced a bug that >> > was fixed for a while. >> >> I tried 2.6.24 for couple of hours and the problem not happening yet. While >> the same test on 2.6.25, the problem happen right away. > > Ok, so 2.6.25 is known bad. Can you test 2.6.24 a lot more, because we > should not decide that it's bug-free without a _lot_ of testing. > > But if it's a bug that has gone away and then re-appeared, it at least > explains how 2.6.21 (which got a fair amount of mmap testing) didn't have > lots of reports of mmap corruption. > > That said, I can think of nothing obvious in between 2.6.24 and .25 that > would have re-introduced it. But if some heavy testing really does confirm > that 2.6.24 doesn't have the problem, that is a good first step to trying > to narrow down where things started going wrong. > > That said, it could _easily_ be some timing-related pattern. One of the > things in between 2.6.24 and .25 is > > - 8bc3be2751b4f74ab90a446da1912fd8204d53f7: "writeback: speed up > writeback of big dirty files" > > which is that exact kind of "change the timing patterns, but don't change > anything fundamental" thing. > > Which is why I'd like you to continue testing 2.6.24 just to be _really_ > sure that it really doesn't happen there. Unfortunately, 2.6.24 is not immune. After running several hours, i triggered the problem. > > Linus > ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: ftruncate-mmap: pages are lost after writing to mmaped file. 2009-03-18 22:40 ` Linus Torvalds 2009-03-18 23:18 ` Ying Han 2009-03-20 0:34 ` Ying Han @ 2009-03-25 23:15 ` Ying Han 2 siblings, 0 replies; 60+ messages in thread From: Ying Han @ 2009-03-25 23:15 UTC (permalink / raw) To: Linus Torvalds Cc: Andrew Morton, linux-kernel, linux-mm, guichaz, Alex Khesin, Mike Waychison, Rohit Seth, Nick Piggin, Peter Zijlstra On Wed, Mar 18, 2009 at 3:40 PM, Linus Torvalds <torvalds@linux-foundation.org> wrote: > > > On Wed, 18 Mar 2009, Andrew Morton wrote: > >> On Wed, 18 Mar 2009 12:44:08 -0700 Ying Han <yinghan@google.com> wrote: >> > >> > The "bad pages" count differs each time from one digit to 4,5 digit >> > for 128M ftruncated file. and what i also found that the bad page >> > number are contiguous for each segment which total bad pages container >> > several segments. ext "1-4, 9-20, 48-50" ( batch flushing ? ) > > Yeah, probably the batched write-out. > > Can you say what filesystem, and what mount-flags you use? Iirc, last time > we had MAP_SHARED lost writes it was at least partly triggered by the > filesystem doing its own flushing independently of the VM (ie ext3 with > "data=journal", I think), so that kind of thing does tend to matter. > > See for example commit ecdfc9787fe527491baefc22dce8b2dbd5b2908d. > >> > (The failure is reproduced based on 2.6.29-rc8, also happened on >> > 2.6.18 kernel. . Here is the simple test case to reproduce it with >> > memory pressure. ) >> >> Thanks. This will be a regression - the testing I did back in the days >> when I actually wrote stuff would have picked this up. >> >> Perhaps it is a 2.6.17 thing. Which, IIRC, is when we made the changes to >> redirty pages on each write fault. Or maybe it was something else. > > Hmm. I _think_ that changes went in _after_ 2.6.18, if you're talking > about Peter's exact dirty page tracking. If I recall correctly, that > became then 2.6.19, and then had the horrible mm dirty bit loss that > triggered in librtorrent downloads, which got fixed sometime after 2.6.20 > (and back-ported). > > So if 2.6.18 shows the same problem, then it's a _really_ old bug, and not > related to the exact dirty tracking. > > The exact dirty accounting patch I'm talking about is d08b3851da41 ("mm: > tracking shared dirty pages"), but maybe you had something else in mind? > >> Given the amount of time for which this bug has existed, I guess it isn't a >> 2.6.29 blocker, but once we've found out the cause we should have a little >> post-mortem to work out how a bug of this nature has gone undetected for so >> long. > > I'm somewhat surprised, because this test-program looks like a very simple > version of the exact one that I used to track down the 2.6.20 mmap > corruption problems. And that one got pretty heavily tested back then, > when people were looking at it (December 2006) and then when trying out my > fix for it. > > Ying Han - since you're all set up for testing this and have reproduced it > on multiple kernels, can you try it on a few more kernel versions? It > would be interesting to both go further back in time (say 2.6.15-ish), > _and_ check something like 2.6.21 which had the exact dirty accounting > fix. Maybe it's not really an old bug - maybe we re-introduced a bug that > was fixed for a while. Just answer your question, i got chance try 2.6.15 and 2.6.21 and they both report the "bad pages" failure. I am using the same system as well as the config > Linus > ^ permalink raw reply [flat|nested] 60+ messages in thread
end of thread, other threads:[~2009-04-03 21:35 UTC | newest] Thread overview: 60+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2009-03-18 19:44 ftruncate-mmap: pages are lost after writing to mmaped file Ying Han 2009-03-18 22:11 ` Andrew Morton 2009-03-18 22:40 ` Linus Torvalds 2009-03-18 23:18 ` Ying Han 2009-03-18 23:36 ` Linus Torvalds 2009-03-18 23:54 ` Ying Han 2009-03-19 15:48 ` Nick Piggin 2009-03-19 16:16 ` Peter Zijlstra 2009-03-19 16:36 ` Nick Piggin 2009-03-19 16:20 ` Linus Torvalds 2009-03-19 16:34 ` Nick Piggin 2009-03-19 16:51 ` Linus Torvalds 2009-03-19 17:03 ` Jan Kara 2009-03-19 17:06 ` Jan Kara 2009-03-19 20:05 ` Linus Torvalds 2009-03-19 20:21 ` Linus Torvalds 2009-03-19 21:17 ` Ying Han 2009-03-19 22:16 ` Jan Kara 2009-03-19 16:46 ` Jan Kara 2009-03-24 7:44 ` Nick Piggin 2009-03-24 10:27 ` Nick Piggin 2009-03-24 10:32 ` Andrew Morton 2009-03-24 15:35 ` Nick Piggin 2009-03-26 18:29 ` Jan Kara 2009-03-26 0:03 ` Ying Han 2009-03-24 12:39 ` Jan Kara 2009-03-24 12:55 ` Jan Kara 2009-03-24 13:26 ` Jan Kara 2009-03-24 14:01 ` Chris Mason 2009-03-24 14:07 ` Jan Kara 2009-03-26 8:18 ` Aneesh Kumar K.V 2009-03-24 14:30 ` Nick Piggin 2009-03-24 14:47 ` Jan Kara 2009-03-24 14:56 ` Peter Zijlstra 2009-03-24 15:29 ` Jan Kara 2009-03-24 20:14 ` OGAWA Hirofumi 2009-03-26 8:47 ` Aneesh Kumar K.V 2009-03-26 11:37 ` Jan Kara 2009-03-26 23:02 ` Linus Torvalds 2009-03-24 15:03 ` Nick Piggin 2009-03-24 15:48 ` Jan Kara 2009-03-24 17:35 ` Jan Kara 2009-04-01 22:36 ` Ying Han 2009-04-02 10:11 ` Jan Kara 2009-04-02 11:24 ` Nick Piggin 2009-04-02 11:34 ` Jan Kara 2009-04-02 15:51 ` Nick Piggin 2009-04-02 17:44 ` Ying Han 2009-04-02 22:52 ` Ying Han 2009-04-02 23:39 ` Jan Kara 2009-04-03 0:25 ` Ying Han 2009-04-03 1:29 ` Ying Han 2009-04-03 9:41 ` Jan Kara 2009-04-03 21:34 ` Ying Han 2009-04-03 0:13 ` Ying Han 2009-03-27 20:35 ` Ying Han 2009-03-20 0:34 ` Ying Han 2009-03-20 0:49 ` Linus Torvalds 2009-03-20 7:00 ` Ying Han 2009-03-25 23:15 ` Ying Han
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox