ftruncate-mmap: pages are lost after writing to mmaped file.

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* ftruncate-mmap: pages are lost after writing to mmaped file.
@ 2009-03-18 19:44 Ying Han
  2009-03-18 22:11 ` Andrew Morton
  0 siblings, 1 reply; 60+ messages in thread
From: Ying Han @ 2009-03-18 19:44 UTC (permalink / raw)
  To: linux-kernel, linux-mm, Andrew Morton, guichaz, Alex Khesin,
	Mike Waychison, Rohit Seth

We triggered the failure during some internal experiment with
ftruncate/mmap/write/read sequence. And we found that some pages are
"lost" after writing to the mmaped file. which in the following test
cases (count >= 0).

First we deployed the test cases into group of machines and see about
>20% failure rate on average. Then, I did couple of experiment to try
to reproduce it on a single machine. what i found is that:
1. add a fsync after write the file, i can not reproduce this issue.
2. add memory pressure(mmap/mlock) while run the test in infinite
loop, the failure is reproduced quickly. ( background flushing ? )

The "bad pages" count differs each time from one digit to 4,5 digit
for 128M ftruncated file. and what i also found that the bad page
number are contiguous for each segment which total bad pages container
several segments. ext "1-4, 9-20, 48-50" (  batch flushing ? )

(The failure is reproduced based on 2.6.29-rc8, also happened on
2.6.18 kernel. . Here is the simple test case to reproduce it with
memory pressure. )

#include <sys/mman.h>
#include <sys/types.h>
#include <fcntl.h>
#include <unistd.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>

long kMemSize  = 128 << 20;
int kPageSize = 4096;

int main(int argc, char **argv) {
	int status;
	int count = 0;
	int i;
	char *fname = "/root/test.mmap";
	char *mem;

	unlink(fname);
	int fd = open(fname, O_CREAT | O_EXCL | O_RDWR, 0600);
	status = ftruncate(fd, kMemSize);

	mem = mmap(0, kMemSize, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
	// Fill the memory with 1s.
	memset(mem, 1, kMemSize);

	for (i = 0; i < kMemSize; i++) {
		int byte_good = mem[i] != 0;

		if (!byte_good && ((i % kPageSize) == 0)) {
			//printf("%d ", i / kPageSize);
			count++;
		}
	}

	munmap(mem, kMemSize);
	close(fd);
	unlink(fname);

	if (count > 0) {
		printf("Running %d bad page\n", count);
		return 1;
	}
	return 0;
}

--Ying

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: ftruncate-mmap: pages are lost after writing to mmaped file.
  2009-03-18 19:44 ftruncate-mmap: pages are lost after writing to mmaped file Ying Han
@ 2009-03-18 22:11 ` Andrew Morton
  2009-03-18 22:40   ` Linus Torvalds
  0 siblings, 1 reply; 60+ messages in thread
From: Andrew Morton @ 2009-03-18 22:11 UTC (permalink / raw)
  To: Ying Han
  Cc: linux-kernel, linux-mm, guichaz, Alex Khesin, Mike Waychison,
	Rohit Seth, Nick Piggin, Peter Zijlstra, Linus Torvalds

On Wed, 18 Mar 2009 12:44:08 -0700 Ying Han <yinghan@google.com> wrote:

> We triggered the failure during some internal experiment with
> ftruncate/mmap/write/read sequence. And we found that some pages are
> "lost" after writing to the mmaped file. which in the following test
> cases (count >= 0).
> 
> First we deployed the test cases into group of machines and see about
> >20% failure rate on average. Then, I did couple of experiment to try
> to reproduce it on a single machine. what i found is that:
> 1. add a fsync after write the file, i can not reproduce this issue.
> 2. add memory pressure(mmap/mlock) while run the test in infinite
> loop, the failure is reproduced quickly. ( background flushing ? )
> 
> The "bad pages" count differs each time from one digit to 4,5 digit
> for 128M ftruncated file. and what i also found that the bad page
> number are contiguous for each segment which total bad pages container
> several segments. ext "1-4, 9-20, 48-50" (  batch flushing ? )
> 
> (The failure is reproduced based on 2.6.29-rc8, also happened on
> 2.6.18 kernel. . Here is the simple test case to reproduce it with
> memory pressure. )

Thanks.  This will be a regression - the testing I did back in the days
when I actually wrote stuff would have picked this up.

Perhaps it is a 2.6.17 thing.  Which, IIRC, is when we made the changes to
redirty pages on each write fault.  Or maybe it was something else.

Nick, Peter: I'm in .au at preset, not able to build and run kernels - is
this something you'd have time to look into please?

Given the amount of time for which this bug has existed, I guess it isn't a
2.6.29 blocker, but once we've found out the cause we should have a little
post-mortem to work out how a bug of this nature has gone undetected for so
long.


> #include <sys/mman.h>
> #include <sys/types.h>
> #include <fcntl.h>
> #include <unistd.h>
> #include <stdio.h>
> #include <stdlib.h>
> #include <string.h>
> 
> long kMemSize  = 128 << 20;
> int kPageSize = 4096;
> 
> int main(int argc, char **argv) {
> 	int status;
> 	int count = 0;
> 	int i;
> 	char *fname = "/root/test.mmap";
> 	char *mem;
> 
> 	unlink(fname);
> 	int fd = open(fname, O_CREAT | O_EXCL | O_RDWR, 0600);
> 	status = ftruncate(fd, kMemSize);
> 
> 	mem = mmap(0, kMemSize, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
> 	// Fill the memory with 1s.
> 	memset(mem, 1, kMemSize);
> 
> 	for (i = 0; i < kMemSize; i++) {
> 		int byte_good = mem[i] != 0;
> 
> 		if (!byte_good && ((i % kPageSize) == 0)) {
> 			//printf("%d ", i / kPageSize);
> 			count++;
> 		}
> 	}
> 
> 	munmap(mem, kMemSize);
> 	close(fd);
> 	unlink(fname);
> 
> 	if (count > 0) {
> 		printf("Running %d bad page\n", count);
> 		return 1;
> 	}
> 	return 0;
> }
> 
> --Ying

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: ftruncate-mmap: pages are lost after writing to mmaped file.
  2009-03-18 22:11 ` Andrew Morton
@ 2009-03-18 22:40   ` Linus Torvalds
  2009-03-18 23:18     ` Ying Han
                       ` (2 more replies)
  0 siblings, 3 replies; 60+ messages in thread
From: Linus Torvalds @ 2009-03-18 22:40 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Ying Han, linux-kernel, linux-mm, guichaz, Alex Khesin,
	Mike Waychison, Rohit Seth, Nick Piggin, Peter Zijlstra

On Wed, 18 Mar 2009, Andrew Morton wrote:

> On Wed, 18 Mar 2009 12:44:08 -0700 Ying Han <yinghan@google.com> wrote:
> > 
> > The "bad pages" count differs each time from one digit to 4,5 digit
> > for 128M ftruncated file. and what i also found that the bad page
> > number are contiguous for each segment which total bad pages container
> > several segments. ext "1-4, 9-20, 48-50" (  batch flushing ? )

Yeah, probably the batched write-out.

Can you say what filesystem, and what mount-flags you use? Iirc, last time 
we had MAP_SHARED lost writes it was at least partly triggered by the 
filesystem doing its own flushing independently of the VM (ie ext3 with 
"data=journal", I think), so that kind of thing does tend to matter.

See for example commit ecdfc9787fe527491baefc22dce8b2dbd5b2908d.

> > (The failure is reproduced based on 2.6.29-rc8, also happened on
> > 2.6.18 kernel. . Here is the simple test case to reproduce it with
> > memory pressure. )
> 
> Thanks.  This will be a regression - the testing I did back in the days
> when I actually wrote stuff would have picked this up.
> 
> Perhaps it is a 2.6.17 thing.  Which, IIRC, is when we made the changes to
> redirty pages on each write fault.  Or maybe it was something else.

Hmm. I _think_ that changes went in _after_ 2.6.18, if you're talking 
about Peter's exact dirty page tracking. If I recall correctly, that 
became then 2.6.19, and then had the horrible mm dirty bit loss that 
triggered in librtorrent downloads, which got fixed sometime after 2.6.20 
(and back-ported).

So if 2.6.18 shows the same problem, then it's a _really_ old bug, and not 
related to the exact dirty tracking.

The exact dirty accounting patch I'm talking about is d08b3851da41 ("mm: 
tracking shared dirty pages"), but maybe you had something else in mind?

> Given the amount of time for which this bug has existed, I guess it isn't a
> 2.6.29 blocker, but once we've found out the cause we should have a little
> post-mortem to work out how a bug of this nature has gone undetected for so
> long.

I'm somewhat surprised, because this test-program looks like a very simple 
version of the exact one that I used to track down the 2.6.20 mmap 
corruption problems. And that one got pretty heavily tested back then, 
when people were looking at it (December 2006) and then when trying out my 
fix for it. 

Ying Han - since you're all set up for testing this and have reproduced it 
on multiple kernels, can you try it on a few more kernel versions? It 
would be interesting to both go further back in time (say 2.6.15-ish), 
_and_ check something like 2.6.21 which had the exact dirty accounting 
fix. Maybe it's not really an old bug - maybe we re-introduced a bug that 
was fixed for a while.

				Linus

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: ftruncate-mmap: pages are lost after writing to mmaped file.
  2009-03-18 22:40   ` Linus Torvalds
@ 2009-03-18 23:18     ` Ying Han
  2009-03-18 23:36       ` Linus Torvalds
  2009-03-20  0:34     ` Ying Han
  2009-03-25 23:15     ` Ying Han
  2 siblings, 1 reply; 60+ messages in thread
From: Ying Han @ 2009-03-18 23:18 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andrew Morton, linux-kernel, linux-mm, guichaz, Alex Khesin,
	Mike Waychison, Rohit Seth, Nick Piggin, Peter Zijlstra

On Wed, Mar 18, 2009 at 3:40 PM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
>
> On Wed, 18 Mar 2009, Andrew Morton wrote:
>
>> On Wed, 18 Mar 2009 12:44:08 -0700 Ying Han <yinghan@google.com> wrote:
>> >
>> > The "bad pages" count differs each time from one digit to 4,5 digit
>> > for 128M ftruncated file. and what i also found that the bad page
>> > number are contiguous for each segment which total bad pages container
>> > several segments. ext "1-4, 9-20, 48-50" (  batch flushing ? )
>
> Yeah, probably the batched write-out.
>
> Can you say what filesystem, and what mount-flags you use? Iirc, last time
> we had MAP_SHARED lost writes it was at least partly triggered by the
> filesystem doing its own flushing independently of the VM (ie ext3 with
> "data=journal", I think), so that kind of thing does tend to matter.

/etc/fstab
"/dev/hda1 / ext2 defaults 1 0"

>
> See for example commit ecdfc9787fe527491baefc22dce8b2dbd5b2908d.
>
>> > (The failure is reproduced based on 2.6.29-rc8, also happened on
>> > 2.6.18 kernel. . Here is the simple test case to reproduce it with
>> > memory pressure. )
>>
>> Thanks.  This will be a regression - the testing I did back in the days
>> when I actually wrote stuff would have picked this up.
>>
>> Perhaps it is a 2.6.17 thing.  Which, IIRC, is when we made the changes to
>> redirty pages on each write fault.  Or maybe it was something else.
>
> Hmm. I _think_ that changes went in _after_ 2.6.18, if you're talking
> about Peter's exact dirty page tracking. If I recall correctly, that
> became then 2.6.19, and then had the horrible mm dirty bit loss that
> triggered in librtorrent downloads, which got fixed sometime after 2.6.20
> (and back-ported).
>
> So if 2.6.18 shows the same problem, then it's a _really_ old bug, and not
> related to the exact dirty tracking.
>
> The exact dirty accounting patch I'm talking about is d08b3851da41 ("mm:
> tracking shared dirty pages"), but maybe you had something else in mind?
>
>> Given the amount of time for which this bug has existed, I guess it isn't a
>> 2.6.29 blocker, but once we've found out the cause we should have a little
>> post-mortem to work out how a bug of this nature has gone undetected for so
>> long.
>
> I'm somewhat surprised, because this test-program looks like a very simple
> version of the exact one that I used to track down the 2.6.20 mmap
> corruption problems. And that one got pretty heavily tested back then,
> when people were looking at it (December 2006) and then when trying out my
> fix for it.
>
> Ying Han - since you're all set up for testing this and have reproduced it
> on multiple kernels, can you try it on a few more kernel versions? It
> would be interesting to both go further back in time (say 2.6.15-ish),
> _and_ check something like 2.6.21 which had the exact dirty accounting
> fix. Maybe it's not really an old bug - maybe we re-introduced a bug that
> was fixed for a while.

I will give a try.
>
>                                Linus
>

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: ftruncate-mmap: pages are lost after writing to mmaped file.
  2009-03-18 23:18     ` Ying Han
@ 2009-03-18 23:36       ` Linus Torvalds
  2009-03-18 23:54         ` Ying Han
  0 siblings, 1 reply; 60+ messages in thread
From: Linus Torvalds @ 2009-03-18 23:36 UTC (permalink / raw)
  To: Ying Han
  Cc: Andrew Morton, linux-kernel, linux-mm, guichaz, Alex Khesin,
	Mike Waychison, Rohit Seth, Nick Piggin, Peter Zijlstra



On Wed, 18 Mar 2009, Ying Han wrote:
> >
> > Can you say what filesystem, and what mount-flags you use? Iirc, last time
> > we had MAP_SHARED lost writes it was at least partly triggered by the
> > filesystem doing its own flushing independently of the VM (ie ext3 with
> > "data=journal", I think), so that kind of thing does tend to matter.
> 
> /etc/fstab
> "/dev/hda1 / ext2 defaults 1 0"

Sadly, /etc/fstab is not necessarily accurate for the root filesystem. At 
least Fedora will ignore the flags in it.

What does /proc/mounts say? That should be a more reliable indication of 
what the kernel actually does. 

That said, I assume the ext2 part is accurate. Maybe that's why people 
haven't seen it - I guess most testing was done on ext3. It certainly was 
for me.

> > Ying Han - since you're all set up for testing this and have reproduced it
> > on multiple kernels, can you try it on a few more kernel versions? It
> > would be interesting to both go further back in time (say 2.6.15-ish),
> > _and_ check something like 2.6.21 which had the exact dirty accounting
> > fix. Maybe it's not really an old bug - maybe we re-introduced a bug that
> > was fixed for a while.
> 
> I will give a try.

Thanks,

		Linus

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: ftruncate-mmap: pages are lost after writing to mmaped file.
  2009-03-18 23:36       ` Linus Torvalds
@ 2009-03-18 23:54         ` Ying Han
  2009-03-19 15:48           ` Nick Piggin
  0 siblings, 1 reply; 60+ messages in thread
From: Ying Han @ 2009-03-18 23:54 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andrew Morton, linux-kernel, linux-mm, guichaz, Alex Khesin,
	Mike Waychison, Rohit Seth, Nick Piggin, Peter Zijlstra

On Wed, Mar 18, 2009 at 4:36 PM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
>
> On Wed, 18 Mar 2009, Ying Han wrote:
>> >
>> > Can you say what filesystem, and what mount-flags you use? Iirc, last time
>> > we had MAP_SHARED lost writes it was at least partly triggered by the
>> > filesystem doing its own flushing independently of the VM (ie ext3 with
>> > "data=journal", I think), so that kind of thing does tend to matter.
>>
>> /etc/fstab
>> "/dev/hda1 / ext2 defaults 1 0"
>
> Sadly, /etc/fstab is not necessarily accurate for the root filesystem. At
> least Fedora will ignore the flags in it.
>
> What does /proc/mounts say? That should be a more reliable indication of
> what the kernel actually does.

"/dev/root / ext2 rw,errors=continue 0 0"

>
> That said, I assume the ext2 part is accurate. Maybe that's why people
> haven't seen it - I guess most testing was done on ext3. It certainly was
> for me.
>
>> > Ying Han - since you're all set up for testing this and have reproduced it
>> > on multiple kernels, can you try it on a few more kernel versions? It
>> > would be interesting to both go further back in time (say 2.6.15-ish),
>> > _and_ check something like 2.6.21 which had the exact dirty accounting
>> > fix. Maybe it's not really an old bug - maybe we re-introduced a bug that
>> > was fixed for a while.
>>
>> I will give a try.
>
> Thanks,
>
>                Linus
>

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: ftruncate-mmap: pages are lost after writing to mmaped file.
  2009-03-18 23:54         ` Ying Han
@ 2009-03-19 15:48           ` Nick Piggin
  2009-03-19 16:16             ` Peter Zijlstra
                               ` (2 more replies)
  0 siblings, 3 replies; 60+ messages in thread
From: Nick Piggin @ 2009-03-19 15:48 UTC (permalink / raw)
  To: Ying Han, Jan Kara
  Cc: Linus Torvalds, Andrew Morton, linux-kernel, linux-mm, guichaz,
	Alex Khesin, Mike Waychison, Rohit Seth, Peter Zijlstra

On Thursday 19 March 2009 10:54:33 Ying Han wrote:
> On Wed, Mar 18, 2009 at 4:36 PM, Linus Torvalds
>
> <torvalds@linux-foundation.org> wrote:
> > On Wed, 18 Mar 2009, Ying Han wrote:
> >> > Can you say what filesystem, and what mount-flags you use? Iirc, last
> >> > time we had MAP_SHARED lost writes it was at least partly triggered by
> >> > the filesystem doing its own flushing independently of the VM (ie ext3
> >> > with "data=journal", I think), so that kind of thing does tend to
> >> > matter.
> >>
> >> /etc/fstab
> >> "/dev/hda1 / ext2 defaults 1 0"
> >
> > Sadly, /etc/fstab is not necessarily accurate for the root filesystem. At
> > least Fedora will ignore the flags in it.
> >
> > What does /proc/mounts say? That should be a more reliable indication of
> > what the kernel actually does.
>
> "/dev/root / ext2 rw,errors=continue 0 0"

No luck with finding the problem yet.

But I think we do have a race in __set_page_dirty_buffers():

The page may not have buffers between the mapping->private_lock
critical section and the __set_page_dirty call there. So between
them, another thread might do a create_empty_buffers which can
see !PageDirty and thus it will create clean buffers. The page
will get dirtied by the original thread, but if the buffers are
clean it can be cleaned without writing out buffers.

Holding mapping->private_lock over the __set_page_dirty should
fix it, although I guess you'd want to release it before calling
__mark_inode_dirty so as not to put inode_lock under there. I
have a patch for this if it sounds reasonable.

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: ftruncate-mmap: pages are lost after writing to mmaped file.
  2009-03-19 15:48           ` Nick Piggin
@ 2009-03-19 16:16             ` Peter Zijlstra
  2009-03-19 16:36               ` Nick Piggin
  2009-03-19 16:20             ` Linus Torvalds
  2009-03-19 16:46             ` Jan Kara
  2 siblings, 1 reply; 60+ messages in thread
From: Peter Zijlstra @ 2009-03-19 16:16 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Ying Han, Jan Kara, Linus Torvalds, Andrew Morton, linux-kernel,
	linux-mm, guichaz, Alex Khesin, Mike Waychison, Rohit Seth

On Fri, 2009-03-20 at 02:48 +1100, Nick Piggin wrote:
> On Thursday 19 March 2009 10:54:33 Ying Han wrote:
> > On Wed, Mar 18, 2009 at 4:36 PM, Linus Torvalds
> >
> > <torvalds@linux-foundation.org> wrote:
> > > On Wed, 18 Mar 2009, Ying Han wrote:
> > >> > Can you say what filesystem, and what mount-flags you use? Iirc, last
> > >> > time we had MAP_SHARED lost writes it was at least partly triggered by
> > >> > the filesystem doing its own flushing independently of the VM (ie ext3
> > >> > with "data=journal", I think), so that kind of thing does tend to
> > >> > matter.
> > >>
> > >> /etc/fstab
> > >> "/dev/hda1 / ext2 defaults 1 0"
> > >
> > > Sadly, /etc/fstab is not necessarily accurate for the root filesystem. At
> > > least Fedora will ignore the flags in it.
> > >
> > > What does /proc/mounts say? That should be a more reliable indication of
> > > what the kernel actually does.
> >
> > "/dev/root / ext2 rw,errors=continue 0 0"
> 
> No luck with finding the problem yet.
> 
> But I think we do have a race in __set_page_dirty_buffers():
> 
> The page may not have buffers between the mapping->private_lock
> critical section and the __set_page_dirty call there. So between
> them, another thread might do a create_empty_buffers which can
> see !PageDirty and thus it will create clean buffers. The page
> will get dirtied by the original thread, but if the buffers are
> clean it can be cleaned without writing out buffers.
> 
> Holding mapping->private_lock over the __set_page_dirty should
> fix it, although I guess you'd want to release it before calling
> __mark_inode_dirty so as not to put inode_lock under there. I
> have a patch for this if it sounds reasonable.

When I first did those dirty tracking patches someone (I think Andrew)
commented no the fact that I did set_page_dirty() under one of these
inner locks..

/me frobs around in archives for a bit..

 - fs/buffers.c try_to_free_buffers(): remove clear_page_dirty() from under
   ->private_lock. This seems to be save, since ->private_lock is used to
   serialize access to the buffers, not the page itself.

Hmm, that's a slightly different issue...

But yeah, your scenario makes heaps of sense.

Can't we do the TestSetPageDirty() before private_lock ? It's currently
done before tree_lock as well.



^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: ftruncate-mmap: pages are lost after writing to mmaped file.
  2009-03-19 16:16             ` Peter Zijlstra
@ 2009-03-19 16:36               ` Nick Piggin
  0 siblings, 0 replies; 60+ messages in thread
From: Nick Piggin @ 2009-03-19 16:36 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ying Han, Jan Kara, Linus Torvalds, Andrew Morton, linux-kernel,
	linux-mm, guichaz, Alex Khesin, Mike Waychison, Rohit Seth

On Friday 20 March 2009 03:16:01 Peter Zijlstra wrote:
> On Fri, 2009-03-20 at 02:48 +1100, Nick Piggin wrote:
> > On Thursday 19 March 2009 10:54:33 Ying Han wrote:
> > > On Wed, Mar 18, 2009 at 4:36 PM, Linus Torvalds
> > >
> > > <torvalds@linux-foundation.org> wrote:
> > > > On Wed, 18 Mar 2009, Ying Han wrote:
> > > >> > Can you say what filesystem, and what mount-flags you use? Iirc,
> > > >> > last time we had MAP_SHARED lost writes it was at least partly
> > > >> > triggered by the filesystem doing its own flushing independently
> > > >> > of the VM (ie ext3 with "data=journal", I think), so that kind of
> > > >> > thing does tend to matter.
> > > >>
> > > >> /etc/fstab
> > > >> "/dev/hda1 / ext2 defaults 1 0"
> > > >
> > > > Sadly, /etc/fstab is not necessarily accurate for the root
> > > > filesystem. At least Fedora will ignore the flags in it.
> > > >
> > > > What does /proc/mounts say? That should be a more reliable indication
> > > > of what the kernel actually does.
> > >
> > > "/dev/root / ext2 rw,errors=continue 0 0"
> >
> > No luck with finding the problem yet.
> >
> > But I think we do have a race in __set_page_dirty_buffers():
> >
> > The page may not have buffers between the mapping->private_lock
> > critical section and the __set_page_dirty call there. So between
> > them, another thread might do a create_empty_buffers which can
> > see !PageDirty and thus it will create clean buffers. The page
> > will get dirtied by the original thread, but if the buffers are
> > clean it can be cleaned without writing out buffers.
> >
> > Holding mapping->private_lock over the __set_page_dirty should
> > fix it, although I guess you'd want to release it before calling
> > __mark_inode_dirty so as not to put inode_lock under there. I
> > have a patch for this if it sounds reasonable.
>
> When I first did those dirty tracking patches someone (I think Andrew)
> commented no the fact that I did set_page_dirty() under one of these
> inner locks..
>
> /me frobs around in archives for a bit..
>
>  - fs/buffers.c try_to_free_buffers(): remove clear_page_dirty() from under
>    ->private_lock. This seems to be save, since ->private_lock is used to
>    serialize access to the buffers, not the page itself.
>
> Hmm, that's a slightly different issue...
>
> But yeah, your scenario makes heaps of sense.
>
> Can't we do the TestSetPageDirty() before private_lock ? It's currently
> done before tree_lock as well.

I think there might be issues with having a clean page but dirty buffers
if you do it that way... At any rate, if we can solve the race without
swapping the order, I think that would be safer.


^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: ftruncate-mmap: pages are lost after writing to mmaped file.
  2009-03-19 15:48           ` Nick Piggin
  2009-03-19 16:16             ` Peter Zijlstra
@ 2009-03-19 16:20             ` Linus Torvalds
  2009-03-19 16:34               ` Nick Piggin
  2009-03-19 16:46             ` Jan Kara
  2 siblings, 1 reply; 60+ messages in thread
From: Linus Torvalds @ 2009-03-19 16:20 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Ying Han, Jan Kara, Andrew Morton, linux-kernel, linux-mm,
	guichaz, Alex Khesin, Mike Waychison, Rohit Seth, Peter Zijlstra



On Fri, 20 Mar 2009, Nick Piggin wrote:
> 
> But I think we do have a race in __set_page_dirty_buffers():
> 
> The page may not have buffers between the mapping->private_lock
> critical section and the __set_page_dirty call there. So between
> them, another thread might do a create_empty_buffers which can
> see !PageDirty and thus it will create clean buffers.

Hmm.

Creating clean buffers is locked by the page lock, nothing else.  And not 
all page dirtiers hold the page lock (in fact, most try to avoid it - the 
rule is that you either have to hold the page lock _or_ hold a reference 
to the 'mapping', and the latter is what the mmap code does, I think).

So yeah, the page lock isn't sufficient.

> Holding mapping->private_lock over the __set_page_dirty should
> fix it, although I guess you'd want to release it before calling
> __mark_inode_dirty so as not to put inode_lock under there. I
> have a patch for this if it sounds reasonable.

That would seem to make sense. Maybe moving the "TestSetPageDirty()" from 
inside __set_page_dirty() to the caller? Something like the appended?

This is TOTALLY untested. Of course.

			Linus

---
 fs/buffer.c |   23 +++++++++++------------
 1 files changed, 11 insertions(+), 12 deletions(-)

diff --git a/fs/buffer.c b/fs/buffer.c
index 9f69741..891e1c7 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -760,15 +760,9 @@ EXPORT_SYMBOL(mark_buffer_dirty_inode);
  * If warn is true, then emit a warning if the page is not uptodate and has
  * not been truncated.
  */
-static int __set_page_dirty(struct page *page,
+static void __set_page_dirty(struct page *page,
 		struct address_space *mapping, int warn)
 {
-	if (unlikely(!mapping))
-		return !TestSetPageDirty(page);
-
-	if (TestSetPageDirty(page))
-		return 0;
-
 	spin_lock_irq(&mapping->tree_lock);
 	if (page->mapping) {	/* Race with truncate? */
 		WARN_ON_ONCE(warn && !PageUptodate(page));
@@ -785,8 +779,6 @@ static int __set_page_dirty(struct page *page,
 	}
 	spin_unlock_irq(&mapping->tree_lock);
 	__mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
-
-	return 1;
 }
 
 /*
@@ -816,6 +808,7 @@ static int __set_page_dirty(struct page *page,
  */
 int __set_page_dirty_buffers(struct page *page)
 {
+	int newly_dirty;
 	struct address_space *mapping = page_mapping(page);
 
 	if (unlikely(!mapping))
@@ -831,9 +824,12 @@ int __set_page_dirty_buffers(struct page *page)
 			bh = bh->b_this_page;
 		} while (bh != head);
 	}
+	newly_dirty = !TestSetPageDirty(page);
 	spin_unlock(&mapping->private_lock);
 
-	return __set_page_dirty(page, mapping, 1);
+	if (newly_dirty)
+		__set_page_dirty(page, mapping, 1);
+	return newly_dirty;
 }
 EXPORT_SYMBOL(__set_page_dirty_buffers);
 
@@ -1262,8 +1258,11 @@ void mark_buffer_dirty(struct buffer_head *bh)
 			return;
 	}
 
-	if (!test_set_buffer_dirty(bh))
-		__set_page_dirty(bh->b_page, page_mapping(bh->b_page), 0);
+	if (!test_set_buffer_dirty(bh)) {
+		struct page *page = bh->b_page;
+		if (!TestSetPageDirty(page))
+			__set_page_dirty(page, page_mapping(page), 0);
+	}
 }
 
 /*

^ permalink raw reply related	[flat|nested] 60+ messages in thread

* Re: ftruncate-mmap: pages are lost after writing to mmaped file.
  2009-03-19 16:20             ` Linus Torvalds
@ 2009-03-19 16:34               ` Nick Piggin
  2009-03-19 16:51                 ` Linus Torvalds
  0 siblings, 1 reply; 60+ messages in thread
From: Nick Piggin @ 2009-03-19 16:34 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Ying Han, Jan Kara, Andrew Morton, linux-kernel, linux-mm,
	guichaz, Alex Khesin, Mike Waychison, Rohit Seth, Peter Zijlstra

On Friday 20 March 2009 03:20:57 Linus Torvalds wrote:
> On Fri, 20 Mar 2009, Nick Piggin wrote:
> > But I think we do have a race in __set_page_dirty_buffers():
> >
> > The page may not have buffers between the mapping->private_lock
> > critical section and the __set_page_dirty call there. So between
> > them, another thread might do a create_empty_buffers which can
> > see !PageDirty and thus it will create clean buffers.
>
> Hmm.
>
> Creating clean buffers is locked by the page lock, nothing else.  And not
> all page dirtiers hold the page lock (in fact, most try to avoid it - the
> rule is that you either have to hold the page lock _or_ hold a reference
> to the 'mapping', and the latter is what the mmap code does, I think).
>
> So yeah, the page lock isn't sufficient.

No. FWIW, I thought there might be a race due to page fault code
not holding page lock over set_page_dirty (well there *are* some kinds
of races, but they're another story). So I tried out my patches that
move that lock over set_page_dirty for __do_fault and do_wp_page
(so the lock is held over pte_mkwrite and set_page_dirty), but that
still didn't solve the problem either.


> > Holding mapping->private_lock over the __set_page_dirty should
> > fix it, although I guess you'd want to release it before calling
> > __mark_inode_dirty so as not to put inode_lock under there. I
> > have a patch for this if it sounds reasonable.
>
> That would seem to make sense. Maybe moving the "TestSetPageDirty()" from
> inside __set_page_dirty() to the caller? Something like the appended?
>
> This is TOTALLY untested. Of course.

Yeah, probably no need to hold private_lock while tagging the radix
tree (which is what my version did). So maybe this one is a little
better. I did test mine, it worked, but it didn't solve the problem.

Still, it does appear to solve a real race, which we should close.

>
> 			Linus
>
> ---
>  fs/buffer.c |   23 +++++++++++------------
>  1 files changed, 11 insertions(+), 12 deletions(-)
>
> diff --git a/fs/buffer.c b/fs/buffer.c
> index 9f69741..891e1c7 100644
> --- a/fs/buffer.c
> +++ b/fs/buffer.c
> @@ -760,15 +760,9 @@ EXPORT_SYMBOL(mark_buffer_dirty_inode);
>   * If warn is true, then emit a warning if the page is not uptodate and
> has * not been truncated.
>   */
> -static int __set_page_dirty(struct page *page,
> +static void __set_page_dirty(struct page *page,
>  		struct address_space *mapping, int warn)
>  {
> -	if (unlikely(!mapping))
> -		return !TestSetPageDirty(page);
> -
> -	if (TestSetPageDirty(page))
> -		return 0;
> -
>  	spin_lock_irq(&mapping->tree_lock);
>  	if (page->mapping) {	/* Race with truncate? */
>  		WARN_ON_ONCE(warn && !PageUptodate(page));
> @@ -785,8 +779,6 @@ static int __set_page_dirty(struct page *page,
>  	}
>  	spin_unlock_irq(&mapping->tree_lock);
>  	__mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
> -
> -	return 1;
>  }
>
>  /*
> @@ -816,6 +808,7 @@ static int __set_page_dirty(struct page *page,
>   */
>  int __set_page_dirty_buffers(struct page *page)
>  {
> +	int newly_dirty;
>  	struct address_space *mapping = page_mapping(page);
>
>  	if (unlikely(!mapping))
> @@ -831,9 +824,12 @@ int __set_page_dirty_buffers(struct page *page)
>  			bh = bh->b_this_page;
>  		} while (bh != head);
>  	}
> +	newly_dirty = !TestSetPageDirty(page);
>  	spin_unlock(&mapping->private_lock);
>
> -	return __set_page_dirty(page, mapping, 1);
> +	if (newly_dirty)
> +		__set_page_dirty(page, mapping, 1);
> +	return newly_dirty;
>  }
>  EXPORT_SYMBOL(__set_page_dirty_buffers);
>
> @@ -1262,8 +1258,11 @@ void mark_buffer_dirty(struct buffer_head *bh)
>  			return;
>  	}
>
> -	if (!test_set_buffer_dirty(bh))
> -		__set_page_dirty(bh->b_page, page_mapping(bh->b_page), 0);
> +	if (!test_set_buffer_dirty(bh)) {
> +		struct page *page = bh->b_page;
> +		if (!TestSetPageDirty(page))
> +			__set_page_dirty(page, page_mapping(page), 0);
> +	}
>  }
>
>  /*



^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: ftruncate-mmap: pages are lost after writing to mmaped file.
  2009-03-19 16:34               ` Nick Piggin
@ 2009-03-19 16:51                 ` Linus Torvalds
  2009-03-19 17:03                   ` Jan Kara
  2009-03-19 20:21                   ` Linus Torvalds
  0 siblings, 2 replies; 60+ messages in thread
From: Linus Torvalds @ 2009-03-19 16:51 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Ying Han, Jan Kara, Andrew Morton, linux-kernel, linux-mm,
	guichaz, Alex Khesin, Mike Waychison, Rohit Seth, Peter Zijlstra

On Fri, 20 Mar 2009, Nick Piggin wrote:
> 
> Yeah, probably no need to hold private_lock while tagging the radix
> tree (which is what my version did). So maybe this one is a little
> better. I did test mine, it worked, but it didn't solve the problem.

Ahh, so you re-created it? On ext2 only, or is it visible on ext3 as well? 
I've not even tested - I assumed that I would have to boot into less 
memory and downgrade my filesystem to ext2, which made me hope somebody 
else would pick it up first ;)

> Still, it does appear to solve a real race, which we should close.

A very small and unlikely race, but yeah - the patch isn't large, removes 
more lines than it adds, and the code seems to make _more_ sense with it 
in place than without. 

			Linus

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: ftruncate-mmap: pages are lost after writing to mmaped file.
  2009-03-19 16:51                 ` Linus Torvalds
@ 2009-03-19 17:03                   ` Jan Kara
  2009-03-19 17:06                     ` Jan Kara
  2009-03-19 20:05                     ` Linus Torvalds
  2009-03-19 20:21                   ` Linus Torvalds
  1 sibling, 2 replies; 60+ messages in thread
From: Jan Kara @ 2009-03-19 17:03 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Nick Piggin, Ying Han, Jan Kara, Andrew Morton, linux-kernel,
	linux-mm, guichaz, Alex Khesin, Mike Waychison, Rohit Seth,
	Peter Zijlstra

On Thu 19-03-09 09:51:59, Linus Torvalds wrote:
> 
> 
> On Fri, 20 Mar 2009, Nick Piggin wrote:
> > 
> > Yeah, probably no need to hold private_lock while tagging the radix
> > tree (which is what my version did). So maybe this one is a little
> > better. I did test mine, it worked, but it didn't solve the problem.
> 
> Ahh, so you re-created it? On ext2 only, or is it visible on ext3 as well? 
> I've not even tested - I assumed that I would have to boot into less 
> memory and downgrade my filesystem to ext2, which made me hope somebody 
> else would pick it up first ;)
  In thread http://lkml.org/lkml/2009/3/4/179 I've reported similar problem
- write lost. I'm able to reproduce under UML linux at will. ext3 takes
with 1KB blocksize about 20 minutes to hit the corruption, ext2 with 1 KB
blocksize about an hour, ext2 with 4KB blocksize several hours...
  I've reported that also ordinary write() got lost once but that might
have been an error in me reading the fsx logs since I never saw it again...

									Honza

-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: ftruncate-mmap: pages are lost after writing to mmaped file.
  2009-03-19 17:03                   ` Jan Kara
@ 2009-03-19 17:06                     ` Jan Kara
  2009-03-19 20:05                     ` Linus Torvalds
  1 sibling, 0 replies; 60+ messages in thread
From: Jan Kara @ 2009-03-19 17:06 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Nick Piggin, Ying Han, Jan Kara, Andrew Morton, linux-kernel,
	linux-mm, guichaz, Alex Khesin, Mike Waychison, Rohit Seth,
	Peter Zijlstra

On Thu 19-03-09 18:03:40, Jan Kara wrote:
> On Thu 19-03-09 09:51:59, Linus Torvalds wrote:
> > 
> > 
> > On Fri, 20 Mar 2009, Nick Piggin wrote:
> > > 
> > > Yeah, probably no need to hold private_lock while tagging the radix
> > > tree (which is what my version did). So maybe this one is a little
> > > better. I did test mine, it worked, but it didn't solve the problem.
> > 
> > Ahh, so you re-created it? On ext2 only, or is it visible on ext3 as well? 
> > I've not even tested - I assumed that I would have to boot into less 
> > memory and downgrade my filesystem to ext2, which made me hope somebody 
> > else would pick it up first ;)
>   In thread http://lkml.org/lkml/2009/3/4/179 I've reported similar problem
> - write lost. I'm able to reproduce under UML linux at will. ext3 takes
> with 1KB blocksize about 20 minutes to hit the corruption, ext2 with 1 KB
> blocksize about an hour, ext2 with 4KB blocksize several hours...
  And BTW HP is reporting to us a similarly looking problem on reiserfs on
ia64 with SLE11 (2.6.27 based). But that takes several days to reproduce
with their proprietary test suite.

									Honza

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: ftruncate-mmap: pages are lost after writing to mmaped file.
  2009-03-19 17:03                   ` Jan Kara
  2009-03-19 17:06                     ` Jan Kara
@ 2009-03-19 20:05                     ` Linus Torvalds
  1 sibling, 0 replies; 60+ messages in thread
From: Linus Torvalds @ 2009-03-19 20:05 UTC (permalink / raw)
  To: Jan Kara
  Cc: Nick Piggin, Ying Han, Andrew Morton, linux-kernel, linux-mm,
	guichaz, Alex Khesin, Mike Waychison, Rohit Seth, Peter Zijlstra



On Thu, 19 Mar 2009, Jan Kara wrote:
> > 
> > Ahh, so you re-created it? On ext2 only, or is it visible on ext3 as well? 
> > I've not even tested - I assumed that I would have to boot into less 
> > memory and downgrade my filesystem to ext2, which made me hope somebody 
> > else would pick it up first ;)
>
>   In thread http://lkml.org/lkml/2009/3/4/179 I've reported similar problem
> - write lost. I'm able to reproduce under UML linux at will. ext3 takes
> with 1KB blocksize about 20 minutes to hit the corruption, ext2 with 1 KB
> blocksize about an hour, ext2 with 4KB blocksize several hours...

Hmm. I can't seem to recreate it with Ying Han's testprog, at least. 
That's with the fs/buffer.c patch applied, but that shouldn't matter since 
Nick reports that his (roughly equivalent) patch didn't help.

I'll continue to run it for a while, but it's been going for about an hour 
now, with vmstat reporting bo/bi at roughly 5-10MB/s pretty continuosly.

Of course, that's with my SSD's and an insanely fast Nehalem box, so my 
timings are likely rather different from most other peoples.

			Linus

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: ftruncate-mmap: pages are lost after writing to mmaped file.
  2009-03-19 16:51                 ` Linus Torvalds
  2009-03-19 17:03                   ` Jan Kara
@ 2009-03-19 20:21                   ` Linus Torvalds
  2009-03-19 21:17                     ` Ying Han
  2009-03-19 22:16                     ` Jan Kara
  1 sibling, 2 replies; 60+ messages in thread
From: Linus Torvalds @ 2009-03-19 20:21 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Ying Han, Jan Kara, Andrew Morton, linux-kernel, linux-mm,
	guichaz, Alex Khesin, Mike Waychison, Rohit Seth, Peter Zijlstra



On Thu, 19 Mar 2009, Linus Torvalds wrote:
> 
> Ahh, so you re-created it? On ext2 only, or is it visible on ext3 as well? 
> I've not even tested - I assumed that I would have to boot into less 
> memory and downgrade my filesystem to ext2, which made me hope somebody 
> else would pick it up first ;)

Oh, btw, can people who see this (Ying Han, Nick and apparently Jan) 
detail their configurations, please? In particular

 - SMP? (CONFIG_SMP and how many cores do you have if so?)

 - PREEMPT (NONE/VOLUNTARY or full preempt?)

 - RCU (CLASSIC/TREE/PREEMPT?)

since those affect the kinds of races we can see a lot.

			Linus

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: ftruncate-mmap: pages are lost after writing to mmaped file.
  2009-03-19 20:21                   ` Linus Torvalds
@ 2009-03-19 21:17                     ` Ying Han
  2009-03-19 22:16                     ` Jan Kara
  1 sibling, 0 replies; 60+ messages in thread
From: Ying Han @ 2009-03-19 21:17 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Nick Piggin, Jan Kara, Andrew Morton, linux-kernel, linux-mm,
	guichaz, Alex Khesin, Mike Waychison, Rohit Seth, Peter Zijlstra

On Thu, Mar 19, 2009 at 1:21 PM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
>
> On Thu, 19 Mar 2009, Linus Torvalds wrote:
>>
>> Ahh, so you re-created it? On ext2 only, or is it visible on ext3 as well?
>> I've not even tested - I assumed that I would have to boot into less
>> memory and downgrade my filesystem to ext2, which made me hope somebody
>> else would pick it up first ;)
>
> Oh, btw, can people who see this (Ying Han, Nick and apparently Jan)
> detail their configurations, please? In particular
>
>  - SMP? (CONFIG_SMP and how many cores do you have if so?)
"CONFIG_SMP=y"
the testing machine i was using has 16 cores.

>
>  - PREEMPT (NONE/VOLUNTARY or full preempt?)
CONFIG_PREEMPT_NONE=y
# CONFIG_PREEMPT_VOLUNTARY is not set
# CONFIG_PREEMPT is not set
CONFIG_PREEMPT_BKL=y

>
>  - RCU (CLASSIC/TREE/PREEMPT?)
Not in my .config file.
>
> since those affect the kinds of races we can see a lot.
>
>                        Linus
>

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: ftruncate-mmap: pages are lost after writing to mmaped file.
  2009-03-19 20:21                   ` Linus Torvalds
  2009-03-19 21:17                     ` Ying Han
@ 2009-03-19 22:16                     ` Jan Kara
  1 sibling, 0 replies; 60+ messages in thread
From: Jan Kara @ 2009-03-19 22:16 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Nick Piggin, Ying Han, Andrew Morton, linux-kernel, linux-mm,
	guichaz, Alex Khesin, Mike Waychison, Rohit Seth, Peter Zijlstra

On Thu 19-03-09 13:21:28, Linus Torvalds wrote:
> 
> 
> On Thu, 19 Mar 2009, Linus Torvalds wrote:
> > 
> > Ahh, so you re-created it? On ext2 only, or is it visible on ext3 as well? 
> > I've not even tested - I assumed that I would have to boot into less 
> > memory and downgrade my filesystem to ext2, which made me hope somebody 
> > else would pick it up first ;)
> 
> Oh, btw, can people who see this (Ying Han, Nick and apparently Jan) 
> detail their configurations, please? In particular
  I'm able to see it only under UML. I've tried to reproduce under normal
machine but without luck. My system is single CPU Athlon64.

>  - SMP? (CONFIG_SMP and how many cores do you have if so?)
> 
>  - PREEMPT (NONE/VOLUNTARY or full preempt?)
  Above two don't exist in UML, but I assume UML behaves basically like full
preempt SMP...

>  - RCU (CLASSIC/TREE/PREEMPT?)
  RCU is CLASSIC.

										Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: ftruncate-mmap: pages are lost after writing to mmaped file.
  2009-03-19 15:48           ` Nick Piggin
  2009-03-19 16:16             ` Peter Zijlstra
  2009-03-19 16:20             ` Linus Torvalds
@ 2009-03-19 16:46             ` Jan Kara
  2009-03-24  7:44               ` Nick Piggin
  2 siblings, 1 reply; 60+ messages in thread
From: Jan Kara @ 2009-03-19 16:46 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Ying Han, Linus Torvalds, Andrew Morton, linux-kernel, linux-mm,
	guichaz, Alex Khesin, Mike Waychison, Rohit Seth, Peter Zijlstra

  Hi,

On Fri 20-03-09 02:48:21, Nick Piggin wrote:
> On Thursday 19 March 2009 10:54:33 Ying Han wrote:
> > On Wed, Mar 18, 2009 at 4:36 PM, Linus Torvalds
> >
> > <torvalds@linux-foundation.org> wrote:
> > > On Wed, 18 Mar 2009, Ying Han wrote:
> > >> > Can you say what filesystem, and what mount-flags you use? Iirc, last
> > >> > time we had MAP_SHARED lost writes it was at least partly triggered by
> > >> > the filesystem doing its own flushing independently of the VM (ie ext3
> > >> > with "data=journal", I think), so that kind of thing does tend to
> > >> > matter.
> > >>
> > >> /etc/fstab
> > >> "/dev/hda1 / ext2 defaults 1 0"
> > >
> > > Sadly, /etc/fstab is not necessarily accurate for the root filesystem. At
> > > least Fedora will ignore the flags in it.
> > >
> > > What does /proc/mounts say? That should be a more reliable indication of
> > > what the kernel actually does.
> >
> > "/dev/root / ext2 rw,errors=continue 0 0"
> 
> No luck with finding the problem yet.
  I've been staring at the code whole yesterday and didn't find the problem
either.

> But I think we do have a race in __set_page_dirty_buffers():
> 
> The page may not have buffers between the mapping->private_lock
> critical section and the __set_page_dirty call there. So between
> them, another thread might do a create_empty_buffers which can
> see !PageDirty and thus it will create clean buffers. The page
> will get dirtied by the original thread, but if the buffers are
> clean it can be cleaned without writing out buffers.
> 
> Holding mapping->private_lock over the __set_page_dirty should
> fix it, although I guess you'd want to release it before calling
> __mark_inode_dirty so as not to put inode_lock under there. I
> have a patch for this if it sounds reasonable.
  Yes, that seems to be a bug - the function actually looked suspitious to
me yesterday but I somehow convinced myself that it's fine. Probably
because fsx-linux is single-threaded.
  Anyway, I've tried the following hack:

diff --git a/fs/buffer.c b/fs/buffer.c
index 985f617..f764c8a 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -763,10 +763,15 @@ EXPORT_SYMBOL(mark_buffer_dirty_inode);
 static int __set_page_dirty(struct page *page,
                struct address_space *mapping, int warn)
 {
+       int ret;
+
        if (unlikely(!mapping))
                return !TestSetPageDirty(page);
 
-       if (TestSetPageDirty(page))
+       ret = TestSetPageDirty(page);
+       if (warn)
+               spin_unlock(&mapping->private_lock);
+       if (ret)
                return 0;
 
        spin_lock_irq(&mapping->tree_lock);
@@ -831,8 +836,6 @@ int __set_page_dirty_buffers(struct page *page)
                        bh = bh->b_this_page;
                } while (bh != head);
        }
-       spin_unlock(&mapping->private_lock);
-
        return __set_page_dirty(page, mapping, 1);
 }

   But it didn't help my data corruption under UML :(.

									Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply related	[flat|nested] 60+ messages in thread

* Re: ftruncate-mmap: pages are lost after writing to mmaped file.
  2009-03-19 16:46             ` Jan Kara
@ 2009-03-24  7:44               ` Nick Piggin
  2009-03-24 10:27                 ` Nick Piggin
                                   ` (3 more replies)
  0 siblings, 4 replies; 60+ messages in thread
From: Nick Piggin @ 2009-03-24  7:44 UTC (permalink / raw)
  To: Jan Kara, Martin J. Bligh, linux-ext4
  Cc: Ying Han, Linus Torvalds, Andrew Morton, linux-kernel, linux-mm,
	guichaz, Alex Khesin, Mike Waychison, Rohit Seth, Peter Zijlstra

On Friday 20 March 2009 03:46:39 Jan Kara wrote:
> On Fri 20-03-09 02:48:21, Nick Piggin wrote:

> > Holding mapping->private_lock over the __set_page_dirty should
> > fix it, although I guess you'd want to release it before calling
> > __mark_inode_dirty so as not to put inode_lock under there. I
> > have a patch for this if it sounds reasonable.
>
>   Yes, that seems to be a bug - the function actually looked suspitious to
> me yesterday but I somehow convinced myself that it's fine. Probably
> because fsx-linux is single-threaded.

After a whole lot of chasing my own tail in the VM and buffer layers,
I think it is a problem in ext2 (and I haven't been able to reproduce
with ext3 yet, which might lend weight to that, although as we have
seen, it is very timing dependent).

That would be slightly unfortunate because we still have Jan's ext3
problem, and also another reported problem of corruption on ext3 (on
brd driver).

Anyway, when I have reproduced the problem with the test case, the
"lost" writes are all reported to be holes. Unfortunately, that doesn't
point straight to the filesystem, because ext2 allocates blocks in this
case at writeout time, so if dirty bits are getting lost, then it would
be normal to see holes.

I then put in a whole lot of extra infrastructure to track metadata about
each struct page (when it was last written out, when it last had the number
of writable ptes reach 0, when the dirty bits were last cleared etc). And
none of the normal asertions were triggering: eg. when any page is removed
from pagecache (except truncates), it has always had all its buffers
written out *after* all ptes were made readonly or unmapped. Lots of other
tests and crap like that.

So I tried what I should have done to start with and did an e2fsck after
seeing corruption. Yes, it comes up with errors. Now that is unusual
because that should be largely insulated from the vm: if a dirty bit gets
lost, then the filesystem image should be quite happy and error-free with
a hole or unwritten data there.

I don't know ext? locking very well, except that it looks pretty overly
complex and crufty.

Usually, blocks are instantiated by write(2), under i_mutex, serialising
the allocator somewhat. mmap-write blocks are instantiated at writeout
time, unserialised. I moved truncate_mutex to cover the entire get_blocks
function, and can no longer trigger the problem. Might be a timing issue
though -- Ying, can you try this and see if you can still reproduce?

I close my eyes and pick something out of a hat. a686cd89. Search for XXX.
Nice. Whether or not this cased the problem, can someone please tell me
why it got merged in that state?

I'm leaving ext3 running for now. It looks like a nasty task to bisect
ext2 down to that commit :( and I would be more interested in trying to
reproduce Jan's ext3 problem, however, because I'm not too interested in
diving into ext2 locking to work out exactly what is racing and how to
fix it properly. I suspect it would be most productive to wire up some
ioctls right into the block allocator/lookup and code up a userspace
tester for it that could probably stress it a lot harder than kernel
writeout can.

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: ftruncate-mmap: pages are lost after writing to mmaped file.
  2009-03-24  7:44               ` Nick Piggin
@ 2009-03-24 10:27                 ` Nick Piggin
  2009-03-24 10:32                 ` Andrew Morton
                                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 60+ messages in thread
From: Nick Piggin @ 2009-03-24 10:27 UTC (permalink / raw)
  To: Jan Kara
  Cc: Martin J. Bligh, linux-ext4, Ying Han, Linus Torvalds,
	Andrew Morton, linux-kernel, linux-mm, guichaz, Alex Khesin,
	Mike Waychison, Rohit Seth, Peter Zijlstra

On Tuesday 24 March 2009 18:44:21 Nick Piggin wrote:

> I close my eyes and pick something out of a hat. a686cd89. Search for XXX.
> Nice. Whether or not this cased the problem, can someone please tell me
> why it got merged in that state?

Actually I must be wrong about this if the problem was reproduced in
2.6.18. Question still stands, though.


^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: ftruncate-mmap: pages are lost after writing to mmaped file.
  2009-03-24  7:44               ` Nick Piggin
  2009-03-24 10:27                 ` Nick Piggin
@ 2009-03-24 10:32                 ` Andrew Morton
  2009-03-24 15:35                   ` Nick Piggin
  2009-03-26  0:03                   ` Ying Han
  2009-03-24 12:39                 ` Jan Kara
  2009-03-27 20:35                 ` Ying Han
  3 siblings, 2 replies; 60+ messages in thread
From: Andrew Morton @ 2009-03-24 10:32 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Jan Kara, Martin J. Bligh, linux-ext4, Ying Han, Linus Torvalds,
	linux-kernel, linux-mm, guichaz, Alex Khesin, Mike Waychison,
	Rohit Seth, Peter Zijlstra

On Tue, 24 Mar 2009 18:44:21 +1100 Nick Piggin <nickpiggin@yahoo.com.au> wrote:

> On Friday 20 March 2009 03:46:39 Jan Kara wrote:
> > On Fri 20-03-09 02:48:21, Nick Piggin wrote:
> 
> > > Holding mapping->private_lock over the __set_page_dirty should
> > > fix it, although I guess you'd want to release it before calling
> > > __mark_inode_dirty so as not to put inode_lock under there. I
> > > have a patch for this if it sounds reasonable.
> >
> >   Yes, that seems to be a bug - the function actually looked suspitious to
> > me yesterday but I somehow convinced myself that it's fine. Probably
> > because fsx-linux is single-threaded.
> 
> 
> After a whole lot of chasing my own tail in the VM and buffer layers,
> I think it is a problem in ext2 (and I haven't been able to reproduce
> with ext3 yet, which might lend weight to that, although as we have
> seen, it is very timing dependent).
> 
> That would be slightly unfortunate because we still have Jan's ext3
> problem, and also another reported problem of corruption on ext3 (on
> brd driver).
> 
> Anyway, when I have reproduced the problem with the test case, the
> "lost" writes are all reported to be holes. Unfortunately, that doesn't
> point straight to the filesystem, because ext2 allocates blocks in this
> case at writeout time, so if dirty bits are getting lost, then it would
> be normal to see holes.
> 
> I then put in a whole lot of extra infrastructure to track metadata about
> each struct page (when it was last written out, when it last had the number
> of writable ptes reach 0, when the dirty bits were last cleared etc). And
> none of the normal asertions were triggering: eg. when any page is removed
> from pagecache (except truncates), it has always had all its buffers
> written out *after* all ptes were made readonly or unmapped. Lots of other
> tests and crap like that.
> 
> So I tried what I should have done to start with and did an e2fsck after
> seeing corruption. Yes, it comes up with errors.

Do you recall what the errors were?

> Now that is unusual
> because that should be largely insulated from the vm: if a dirty bit gets
> lost, then the filesystem image should be quite happy and error-free with
> a hole or unwritten data there.
> 
> I don't know ext? locking very well, except that it looks pretty overly
> complex and crufty.
> 
> Usually, blocks are instantiated by write(2), under i_mutex, serialising
> the allocator somewhat. mmap-write blocks are instantiated at writeout
> time, unserialised. I moved truncate_mutex to cover the entire get_blocks
> function, and can no longer trigger the problem. Might be a timing issue
> though -- Ying, can you try this and see if you can still reproduce?
> 
> I close my eyes and pick something out of a hat. a686cd89. Search for XXX.
> Nice. Whether or not this cased the problem, can someone please tell me
> why it got merged in that state?
> 
> I'm leaving ext3 running for now. It looks like a nasty task to bisect
> ext2 down to that commit :( and I would be more interested in trying to
> reproduce Jan's ext3 problem, however, because I'm not too interested in
> diving into ext2 locking to work out exactly what is racing and how to
> fix it properly. I suspect it would be most productive to wire up some
> ioctls right into the block allocator/lookup and code up a userspace
> tester for it that could probably stress it a lot harder than kernel
> writeout can.

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: ftruncate-mmap: pages are lost after writing to mmaped file.
  2009-03-24 10:32                 ` Andrew Morton
@ 2009-03-24 15:35                   ` Nick Piggin
  2009-03-26 18:29                     ` Jan Kara
  2009-03-26  0:03                   ` Ying Han
  1 sibling, 1 reply; 60+ messages in thread
From: Nick Piggin @ 2009-03-24 15:35 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jan Kara, Martin J. Bligh, linux-ext4, Ying Han, Linus Torvalds,
	linux-kernel, linux-mm, guichaz, Alex Khesin, Mike Waychison,
	Rohit Seth, Peter Zijlstra

On Tuesday 24 March 2009 21:32:04 Andrew Morton wrote:
> On Tue, 24 Mar 2009 18:44:21 +1100 Nick Piggin <nickpiggin@yahoo.com.au> 
wrote:
> > On Friday 20 March 2009 03:46:39 Jan Kara wrote:
> > > On Fri 20-03-09 02:48:21, Nick Piggin wrote:
> > > > Holding mapping->private_lock over the __set_page_dirty should
> > > > fix it, although I guess you'd want to release it before calling
> > > > __mark_inode_dirty so as not to put inode_lock under there. I
> > > > have a patch for this if it sounds reasonable.
> > >
> > >   Yes, that seems to be a bug - the function actually looked suspitious
> > > to me yesterday but I somehow convinced myself that it's fine. Probably
> > > because fsx-linux is single-threaded.
> >
> > After a whole lot of chasing my own tail in the VM and buffer layers,
> > I think it is a problem in ext2 (and I haven't been able to reproduce
> > with ext3 yet, which might lend weight to that, although as we have
> > seen, it is very timing dependent).
> >
> > That would be slightly unfortunate because we still have Jan's ext3
> > problem, and also another reported problem of corruption on ext3 (on
> > brd driver).
> >
> > Anyway, when I have reproduced the problem with the test case, the
> > "lost" writes are all reported to be holes. Unfortunately, that doesn't
> > point straight to the filesystem, because ext2 allocates blocks in this
> > case at writeout time, so if dirty bits are getting lost, then it would
> > be normal to see holes.
> >
> > I then put in a whole lot of extra infrastructure to track metadata about
> > each struct page (when it was last written out, when it last had the
> > number of writable ptes reach 0, when the dirty bits were last cleared
> > etc). And none of the normal asertions were triggering: eg. when any page
> > is removed from pagecache (except truncates), it has always had all its
> > buffers written out *after* all ptes were made readonly or unmapped. Lots
> > of other tests and crap like that.
> >
> > So I tried what I should have done to start with and did an e2fsck after
> > seeing corruption. Yes, it comes up with errors.
>
> Do you recall what the errors were?

OK, after running several tests in parallel and having 3 of them
blow up, I unmounted the fs (so error-case files are still intact).

# e2fsck -fn /dev/ram0
e2fsck 1.41.3 (12-Oct-2008)
Pass 1: Checking inodes, blocks, and sizes
Inode 16, i_blocks is 131594, should be 131566.  Fix? no

Inode 18, i_blocks is 131588, should be 131576.  Fix? no

Inode 21, i_blocks is 131594, should be 131552.  Fix? no

Pass 2: Checking directory structure
Pass 3: Checking directory connectivity
Pass 4: Checking reference counts
Pass 5: Checking group summary information
Block bitmap differences:  -(628209--628220) -628231 -628233 -(638751--638755) 
-638765 -(646271--646295) -(646301--646304) -647609 -(651501--651505) -651509 
-(651719--651726) -(651732--651733) -(665666--665670)
Fix? no


/dev/ram0: ********** WARNING: Filesystem still has errors **********

/dev/ram0: 21/229376 files (4.8% non-contiguous), 407105/3670016 blocks

ino 16, 18, 21 of course are the files with errors.


inode 18 is the simplest case with just one hole, so let's look at that:

#hexdump file9
0000000 ffff ffff ffff ffff ffff ffff ffff ffff
*
3c8c000 0000 0000 0000 0000 0000 0000 0000 0000
*
3c8d400 ffff ffff ffff ffff ffff ffff ffff ffff
*
4000000


Let's take a look at our hole then:

#./bmap file9  // bmap is modified to print hex offsets
[... lots of stuff ...]
3c82000-3c82c00: 26fd0400-26fd1000 (1000)
3c83000-3c83c00: 26fd3400-26fd4000 (1000)
3c84000-3c84c00: 26fc9c00-26fca800 (1000)
3c85000-3c85c00: 26fcc400-26fcd000 (1000)
3c86000-3c86c00: 26fcf400-26fd0000 (1000)
3c87000-3c87c00: 26fd2400-26fd3000 (1000)
3c88000-3c88c00: 26fd5400-26fd6000 (1000)
3c89000-3c8bc00: 26fd7400-26fda000 (3000)
3c8c000-3c8c000: 0-0 (400)
3c8c400-3c8c400: 0-0 (400)
3c8c800-3c8c800: 0-0 (400)
3c8cc00-3c8cc00: 0-0 (400)
3c8d000-3c8d000: 0-0 (400)
3c8d400-3c8dc00: 26fcb800-26fcc000 (c00)
3c8e000-3c8ec00: 26fce400-26fcf000 (1000)
3c8f000-3c8fc00: 26fd1400-26fd2000 (1000)
3c90000-3c99c00: 27924400-2792e000 (a000)
3c9a000-3c9ac00: 2792f000-2792fc00 (1000)
3c9b000-3c9bc00: 27931000-27931c00 (1000)
3c9c000-3c9cc00: 27933000-27933c00 (1000)
3c9d000-3c9dc00: 27935000-27935c00 (1000)
3c9e000-3c9ec00: 27938000-27938c00 (1000)
3c9f000-3c9fc00: 2793a000-2793ac00 (1000)
[... lots more stuff ...]

3.5G filesystem image bzip2s down to 500K if anybody wants it I
can send it privately.


^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: ftruncate-mmap: pages are lost after writing to mmaped file.
  2009-03-24 15:35                   ` Nick Piggin
@ 2009-03-26 18:29                     ` Jan Kara
  0 siblings, 0 replies; 60+ messages in thread
From: Jan Kara @ 2009-03-26 18:29 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Andrew Morton, Martin J. Bligh, linux-ext4, Ying Han,
	Linus Torvalds, linux-kernel, linux-mm, guichaz, Alex Khesin,
	Mike Waychison, Rohit Seth, Peter Zijlstra

On Wed 25-03-09 02:35:01, Nick Piggin wrote:
> On Tuesday 24 March 2009 21:32:04 Andrew Morton wrote:
> > On Tue, 24 Mar 2009 18:44:21 +1100 Nick Piggin <nickpiggin@yahoo.com.au> 
> wrote:
> > > On Friday 20 March 2009 03:46:39 Jan Kara wrote:
> > > > On Fri 20-03-09 02:48:21, Nick Piggin wrote:
> > > > > Holding mapping->private_lock over the __set_page_dirty should
> > > > > fix it, although I guess you'd want to release it before calling
> > > > > __mark_inode_dirty so as not to put inode_lock under there. I
> > > > > have a patch for this if it sounds reasonable.
> > > >
> > > >   Yes, that seems to be a bug - the function actually looked suspitious
> > > > to me yesterday but I somehow convinced myself that it's fine. Probably
> > > > because fsx-linux is single-threaded.
> > >
> > > After a whole lot of chasing my own tail in the VM and buffer layers,
> > > I think it is a problem in ext2 (and I haven't been able to reproduce
> > > with ext3 yet, which might lend weight to that, although as we have
> > > seen, it is very timing dependent).
> > >
> > > That would be slightly unfortunate because we still have Jan's ext3
> > > problem, and also another reported problem of corruption on ext3 (on
> > > brd driver).
> > >
> > > Anyway, when I have reproduced the problem with the test case, the
> > > "lost" writes are all reported to be holes. Unfortunately, that doesn't
> > > point straight to the filesystem, because ext2 allocates blocks in this
> > > case at writeout time, so if dirty bits are getting lost, then it would
> > > be normal to see holes.
> > >
> > > I then put in a whole lot of extra infrastructure to track metadata about
> > > each struct page (when it was last written out, when it last had the
> > > number of writable ptes reach 0, when the dirty bits were last cleared
> > > etc). And none of the normal asertions were triggering: eg. when any page
> > > is removed from pagecache (except truncates), it has always had all its
> > > buffers written out *after* all ptes were made readonly or unmapped. Lots
> > > of other tests and crap like that.
> > >
> > > So I tried what I should have done to start with and did an e2fsck after
> > > seeing corruption. Yes, it comes up with errors.
> >
> > Do you recall what the errors were?
> 
> OK, after running several tests in parallel and having 3 of them
> blow up, I unmounted the fs (so error-case files are still intact).
  Nick, what tests do you use? Because on the first reading the ext2 code
looks correct so I'll probably have to reproduce the corruption...

									Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: ftruncate-mmap: pages are lost after writing to mmaped file.
  2009-03-24 10:32                 ` Andrew Morton
  2009-03-24 15:35                   ` Nick Piggin
@ 2009-03-26  0:03                   ` Ying Han
  1 sibling, 0 replies; 60+ messages in thread
From: Ying Han @ 2009-03-26  0:03 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Nick Piggin, Jan Kara, Martin J. Bligh, linux-ext4,
	Linus Torvalds, linux-kernel, linux-mm, guichaz, Alex Khesin,
	Mike Waychison, Rohit Seth, Peter Zijlstra

On Tue, Mar 24, 2009 at 3:32 AM, Andrew Morton
<akpm@linux-foundation.org> wrote:
> On Tue, 24 Mar 2009 18:44:21 +1100 Nick Piggin <nickpiggin@yahoo.com.au> wrote:
>
>> On Friday 20 March 2009 03:46:39 Jan Kara wrote:
>> > On Fri 20-03-09 02:48:21, Nick Piggin wrote:
>>
>> > > Holding mapping->private_lock over the __set_page_dirty should
>> > > fix it, although I guess you'd want to release it before calling
>> > > __mark_inode_dirty so as not to put inode_lock under there. I
>> > > have a patch for this if it sounds reasonable.
>> >
>> >   Yes, that seems to be a bug - the function actually looked suspitious to
>> > me yesterday but I somehow convinced myself that it's fine. Probably
>> > because fsx-linux is single-threaded.
>>
>>
>> After a whole lot of chasing my own tail in the VM and buffer layers,
>> I think it is a problem in ext2 (and I haven't been able to reproduce
>> with ext3 yet, which might lend weight to that, although as we have
>> seen, it is very timing dependent).
>>
>> That would be slightly unfortunate because we still have Jan's ext3
>> problem, and also another reported problem of corruption on ext3 (on
>> brd driver).
>>
>> Anyway, when I have reproduced the problem with the test case, the
>> "lost" writes are all reported to be holes. Unfortunately, that doesn't
>> point straight to the filesystem, because ext2 allocates blocks in this
>> case at writeout time, so if dirty bits are getting lost, then it would
>> be normal to see holes.
>>
>> I then put in a whole lot of extra infrastructure to track metadata about
>> each struct page (when it was last written out, when it last had the number
>> of writable ptes reach 0, when the dirty bits were last cleared etc). And
>> none of the normal asertions were triggering: eg. when any page is removed
>> from pagecache (except truncates), it has always had all its buffers
>> written out *after* all ptes were made readonly or unmapped. Lots of other
>> tests and crap like that.
>>
>> So I tried what I should have done to start with and did an e2fsck after
>> seeing corruption. Yes, it comes up with errors.
>
> Do you recall what the errors were?

I run e2fsck on the partition after the failure happened and here is
what i saw, not sure if that is the same message Jan looked at:

e2fsck 1.41.3 (12-Oct-2008)
Warning!  /dev/hda1 is mounted.
/dev/hda1 contains a file system with errors, check forced.
Pass 1: Checking inodes, blocks, and sizes
Pass 2: Checking directory structure
Pass 3: Checking directory connectivity
Pass 4: Checking reference counts
Pass 5: Checking group summary information
Block bitmap differences:  +74915 -195111 -224680
Fix? no

Free blocks count wrong for group #6 (170, counted=169).
Fix? no

Free blocks count wrong (10120, counted=523).
Fix? no

Free inodes count wrong (95678, counted=95672).
Fix? no


/dev/hda1: ********** WARNING: Filesystem still has errors **********

/dev/hda1: 35938/131616 files (1.5% non-contiguous), 252936/263056 blocks

--Ying


>
>> Now that is unusual
>> because that should be largely insulated from the vm: if a dirty bit gets
>> lost, then the filesystem image should be quite happy and error-free with
>> a hole or unwritten data there.
>>
>> I don't know ext? locking very well, except that it looks pretty overly
>> complex and crufty.
>>
>> Usually, blocks are instantiated by write(2), under i_mutex, serialising
>> the allocator somewhat. mmap-write blocks are instantiated at writeout
>> time, unserialised. I moved truncate_mutex to cover the entire get_blocks
>> function, and can no longer trigger the problem. Might be a timing issue
>> though -- Ying, can you try this and see if you can still reproduce?
>>
>> I close my eyes and pick something out of a hat. a686cd89. Search for XXX.
>> Nice. Whether or not this cased the problem, can someone please tell me
>> why it got merged in that state?
>>
>> I'm leaving ext3 running for now. It looks like a nasty task to bisect
>> ext2 down to that commit :( and I would be more interested in trying to
>> reproduce Jan's ext3 problem, however, because I'm not too interested in
>> diving into ext2 locking to work out exactly what is racing and how to
>> fix it properly. I suspect it would be most productive to wire up some
>> ioctls right into the block allocator/lookup and code up a userspace
>> tester for it that could probably stress it a lot harder than kernel
>> writeout can.
>

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: ftruncate-mmap: pages are lost after writing to mmaped file.
  2009-03-24  7:44               ` Nick Piggin
  2009-03-24 10:27                 ` Nick Piggin
  2009-03-24 10:32                 ` Andrew Morton
@ 2009-03-24 12:39                 ` Jan Kara
  2009-03-24 12:55                   ` Jan Kara
  2009-03-27 20:35                 ` Ying Han
  3 siblings, 1 reply; 60+ messages in thread
From: Jan Kara @ 2009-03-24 12:39 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Martin J. Bligh, linux-ext4, Ying Han, Linus Torvalds,
	Andrew Morton, linux-kernel, linux-mm, guichaz, Alex Khesin,
	Mike Waychison, Rohit Seth, Peter Zijlstra

  Hi,

On Tue 24-03-09 18:44:21, Nick Piggin wrote:
> On Friday 20 March 2009 03:46:39 Jan Kara wrote:
> > On Fri 20-03-09 02:48:21, Nick Piggin wrote:
> 
> > > Holding mapping->private_lock over the __set_page_dirty should
> > > fix it, although I guess you'd want to release it before calling
> > > __mark_inode_dirty so as not to put inode_lock under there. I
> > > have a patch for this if it sounds reasonable.
> >
> >   Yes, that seems to be a bug - the function actually looked suspitious to
> > me yesterday but I somehow convinced myself that it's fine. Probably
> > because fsx-linux is single-threaded.
> 
> 
> After a whole lot of chasing my own tail in the VM and buffer layers,
> I think it is a problem in ext2 (and I haven't been able to reproduce
> with ext3 yet, which might lend weight to that, although as we have
> seen, it is very timing dependent).
> 
> That would be slightly unfortunate because we still have Jan's ext3
> problem, and also another reported problem of corruption on ext3 (on
> brd driver).
> 
> Anyway, when I have reproduced the problem with the test case, the
> "lost" writes are all reported to be holes. Unfortunately, that doesn't
> point straight to the filesystem, because ext2 allocates blocks in this
> case at writeout time, so if dirty bits are getting lost, then it would
> be normal to see holes.
> 
> I then put in a whole lot of extra infrastructure to track metadata about
> each struct page (when it was last written out, when it last had the number
> of writable ptes reach 0, when the dirty bits were last cleared etc). And
> none of the normal asertions were triggering: eg. when any page is removed
> from pagecache (except truncates), it has always had all its buffers
> written out *after* all ptes were made readonly or unmapped. Lots of other
> tests and crap like that.
  I see we're going the same way ;)

> So I tried what I should have done to start with and did an e2fsck after
> seeing corruption. Yes, it comes up with errors. Now that is unusual
> because that should be largely insulated from the vm: if a dirty bit gets
> lost, then the filesystem image should be quite happy and error-free with
> a hole or unwritten data there.
  This is different for me. I see no corruption on the filesystem with
ext3. Anyway, errors from fsck would be useful to see what kind of
corruption you saw.

> I don't know ext? locking very well, except that it looks pretty overly
> complex and crufty.
> 
> Usually, blocks are instantiated by write(2), under i_mutex, serialising
> the allocator somewhat. mmap-write blocks are instantiated at writeout
> time, unserialised. I moved truncate_mutex to cover the entire get_blocks
> function, and can no longer trigger the problem. Might be a timing issue
> though -- Ying, can you try this and see if you can still reproduce?
> 
> I close my eyes and pick something out of a hat. a686cd89. Search for XXX.
> Nice. Whether or not this cased the problem, can someone please tell me
> why it got merged in that state?
  Maybe, I see it did some changes to ext2_get_blocks() which could be
where the problem was introduced...

> I'm leaving ext3 running for now. It looks like a nasty task to bisect
> ext2 down to that commit :( and I would be more interested in trying to
> reproduce Jan's ext3 problem, however, because I'm not too interested in
> diving into ext2 locking to work out exactly what is racing and how to
> fix it properly. I suspect it would be most productive to wire up some
> ioctls right into the block allocator/lookup and code up a userspace
> tester for it that could probably stress it a lot harder than kernel
> writeout can.
  Yes, what I observed with ext3 so far is that data is properly copied and
page marked dirty when the data is copied in. But then at some point dirty
bit is cleared via block_write_full_page() but we don't get to submitting
at least one buffer in that page. I'm now debugging which path we take so
that this happens...

									Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: ftruncate-mmap: pages are lost after writing to mmaped file.
  2009-03-24 12:39                 ` Jan Kara
@ 2009-03-24 12:55                   ` Jan Kara
  2009-03-24 13:26                     ` Jan Kara
  0 siblings, 1 reply; 60+ messages in thread
From: Jan Kara @ 2009-03-24 12:55 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Martin J. Bligh, linux-ext4, Ying Han, Linus Torvalds,
	Andrew Morton, linux-kernel, linux-mm, guichaz, Alex Khesin,
	Mike Waychison, Rohit Seth, Peter Zijlstra

On Tue 24-03-09 13:39:36, Jan Kara wrote:
>   Hi,
> 
> On Tue 24-03-09 18:44:21, Nick Piggin wrote:
> > On Friday 20 March 2009 03:46:39 Jan Kara wrote:
> > > On Fri 20-03-09 02:48:21, Nick Piggin wrote:
> > 
> > > > Holding mapping->private_lock over the __set_page_dirty should
> > > > fix it, although I guess you'd want to release it before calling
> > > > __mark_inode_dirty so as not to put inode_lock under there. I
> > > > have a patch for this if it sounds reasonable.
> > >
> > >   Yes, that seems to be a bug - the function actually looked suspitious to
> > > me yesterday but I somehow convinced myself that it's fine. Probably
> > > because fsx-linux is single-threaded.
> > 
> > 
> > After a whole lot of chasing my own tail in the VM and buffer layers,
> > I think it is a problem in ext2 (and I haven't been able to reproduce
> > with ext3 yet, which might lend weight to that, although as we have
> > seen, it is very timing dependent).
> > 
> > That would be slightly unfortunate because we still have Jan's ext3
> > problem, and also another reported problem of corruption on ext3 (on
> > brd driver).
> > 
> > Anyway, when I have reproduced the problem with the test case, the
> > "lost" writes are all reported to be holes. Unfortunately, that doesn't
> > point straight to the filesystem, because ext2 allocates blocks in this
> > case at writeout time, so if dirty bits are getting lost, then it would
> > be normal to see holes.
> > 
> > I then put in a whole lot of extra infrastructure to track metadata about
> > each struct page (when it was last written out, when it last had the number
> > of writable ptes reach 0, when the dirty bits were last cleared etc). And
> > none of the normal asertions were triggering: eg. when any page is removed
> > from pagecache (except truncates), it has always had all its buffers
> > written out *after* all ptes were made readonly or unmapped. Lots of other
> > tests and crap like that.
>   I see we're going the same way ;)
> 
> > So I tried what I should have done to start with and did an e2fsck after
> > seeing corruption. Yes, it comes up with errors. Now that is unusual
> > because that should be largely insulated from the vm: if a dirty bit gets
> > lost, then the filesystem image should be quite happy and error-free with
> > a hole or unwritten data there.
>   This is different for me. I see no corruption on the filesystem with
> ext3. Anyway, errors from fsck would be useful to see what kind of
> corruption you saw.
> 
> > I don't know ext? locking very well, except that it looks pretty overly
> > complex and crufty.
> > 
> > Usually, blocks are instantiated by write(2), under i_mutex, serialising
> > the allocator somewhat. mmap-write blocks are instantiated at writeout
> > time, unserialised. I moved truncate_mutex to cover the entire get_blocks
> > function, and can no longer trigger the problem. Might be a timing issue
> > though -- Ying, can you try this and see if you can still reproduce?
> > 
> > I close my eyes and pick something out of a hat. a686cd89. Search for XXX.
> > Nice. Whether or not this cased the problem, can someone please tell me
> > why it got merged in that state?
>   Maybe, I see it did some changes to ext2_get_blocks() which could be
> where the problem was introduced...
> 
> > I'm leaving ext3 running for now. It looks like a nasty task to bisect
> > ext2 down to that commit :( and I would be more interested in trying to
> > reproduce Jan's ext3 problem, however, because I'm not too interested in
> > diving into ext2 locking to work out exactly what is racing and how to
> > fix it properly. I suspect it would be most productive to wire up some
> > ioctls right into the block allocator/lookup and code up a userspace
> > tester for it that could probably stress it a lot harder than kernel
> > writeout can.
>   Yes, what I observed with ext3 so far is that data is properly copied and
> page marked dirty when the data is copied in. But then at some point dirty
> bit is cleared via block_write_full_page() but we don't get to submitting
> at least one buffer in that page. I'm now debugging which path we take so
> that this happens...
  And one more interesting thing I don't yet fully understand - I see pages
having PageError() set when they are removed from page cache (and they have
been faulted in before). It's probably some interaction with pagecache
readahead...

 									Honza

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: ftruncate-mmap: pages are lost after writing to mmaped file.
  2009-03-24 12:55                   ` Jan Kara
@ 2009-03-24 13:26                     ` Jan Kara
  2009-03-24 14:01                       ` Chris Mason
  2009-03-24 14:30                       ` Nick Piggin
  0 siblings, 2 replies; 60+ messages in thread
From: Jan Kara @ 2009-03-24 13:26 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Martin J. Bligh, linux-ext4, Ying Han, Linus Torvalds,
	Andrew Morton, linux-kernel, linux-mm, guichaz, Alex Khesin,
	Mike Waychison, Rohit Seth, Peter Zijlstra

On Tue 24-03-09 13:55:10, Jan Kara wrote:
> On Tue 24-03-09 13:39:36, Jan Kara wrote:
> >   Hi,
> > 
> > On Tue 24-03-09 18:44:21, Nick Piggin wrote:
> > > On Friday 20 March 2009 03:46:39 Jan Kara wrote:
> > > > On Fri 20-03-09 02:48:21, Nick Piggin wrote:
> > > 
> > > > > Holding mapping->private_lock over the __set_page_dirty should
> > > > > fix it, although I guess you'd want to release it before calling
> > > > > __mark_inode_dirty so as not to put inode_lock under there. I
> > > > > have a patch for this if it sounds reasonable.
> > > >
> > > >   Yes, that seems to be a bug - the function actually looked suspitious to
> > > > me yesterday but I somehow convinced myself that it's fine. Probably
> > > > because fsx-linux is single-threaded.
> > > 
> > > 
> > > After a whole lot of chasing my own tail in the VM and buffer layers,
> > > I think it is a problem in ext2 (and I haven't been able to reproduce
> > > with ext3 yet, which might lend weight to that, although as we have
> > > seen, it is very timing dependent).
> > > 
> > > That would be slightly unfortunate because we still have Jan's ext3
> > > problem, and also another reported problem of corruption on ext3 (on
> > > brd driver).
> > > 
> > > Anyway, when I have reproduced the problem with the test case, the
> > > "lost" writes are all reported to be holes. Unfortunately, that doesn't
> > > point straight to the filesystem, because ext2 allocates blocks in this
> > > case at writeout time, so if dirty bits are getting lost, then it would
> > > be normal to see holes.
> > > 
> > > I then put in a whole lot of extra infrastructure to track metadata about
> > > each struct page (when it was last written out, when it last had the number
> > > of writable ptes reach 0, when the dirty bits were last cleared etc). And
> > > none of the normal asertions were triggering: eg. when any page is removed
> > > from pagecache (except truncates), it has always had all its buffers
> > > written out *after* all ptes were made readonly or unmapped. Lots of other
> > > tests and crap like that.
> >   I see we're going the same way ;)
> > 
> > > So I tried what I should have done to start with and did an e2fsck after
> > > seeing corruption. Yes, it comes up with errors. Now that is unusual
> > > because that should be largely insulated from the vm: if a dirty bit gets
> > > lost, then the filesystem image should be quite happy and error-free with
> > > a hole or unwritten data there.
> >   This is different for me. I see no corruption on the filesystem with
> > ext3. Anyway, errors from fsck would be useful to see what kind of
> > corruption you saw.
> > 
> > > I don't know ext? locking very well, except that it looks pretty overly
> > > complex and crufty.
> > > 
> > > Usually, blocks are instantiated by write(2), under i_mutex, serialising
> > > the allocator somewhat. mmap-write blocks are instantiated at writeout
> > > time, unserialised. I moved truncate_mutex to cover the entire get_blocks
> > > function, and can no longer trigger the problem. Might be a timing issue
> > > though -- Ying, can you try this and see if you can still reproduce?
> > > 
> > > I close my eyes and pick something out of a hat. a686cd89. Search for XXX.
> > > Nice. Whether or not this cased the problem, can someone please tell me
> > > why it got merged in that state?
> >   Maybe, I see it did some changes to ext2_get_blocks() which could be
> > where the problem was introduced...
> > 
> > > I'm leaving ext3 running for now. It looks like a nasty task to bisect
> > > ext2 down to that commit :( and I would be more interested in trying to
> > > reproduce Jan's ext3 problem, however, because I'm not too interested in
> > > diving into ext2 locking to work out exactly what is racing and how to
> > > fix it properly. I suspect it would be most productive to wire up some
> > > ioctls right into the block allocator/lookup and code up a userspace
> > > tester for it that could probably stress it a lot harder than kernel
> > > writeout can.
> >   Yes, what I observed with ext3 so far is that data is properly copied and
> > page marked dirty when the data is copied in. But then at some point dirty
> > bit is cleared via block_write_full_page() but we don't get to submitting
> > at least one buffer in that page. I'm now debugging which path we take so
> > that this happens...
>   And one more interesting thing I don't yet fully understand - I see pages
> having PageError() set when they are removed from page cache (and they have
> been faulted in before). It's probably some interaction with pagecache
> readahead...
  Argh... So the problem seems to be that get_block() occasionally returns
ENOSPC and we then discard the dirty data (hmm, we could give at least a
warning for that). I'm not yet sure why getblock behaves like this because
the filesystem seems to have enough space but anyway this seems to be some
strange fs trouble as well.

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: ftruncate-mmap: pages are lost after writing to mmaped file.
  2009-03-24 13:26                     ` Jan Kara
@ 2009-03-24 14:01                       ` Chris Mason
  2009-03-24 14:07                         ` Jan Kara
  2009-03-24 14:30                       ` Nick Piggin
  1 sibling, 1 reply; 60+ messages in thread
From: Chris Mason @ 2009-03-24 14:01 UTC (permalink / raw)
  To: Jan Kara
  Cc: Nick Piggin, Martin J. Bligh, linux-ext4, Ying Han,
	Linus Torvalds, Andrew Morton, linux-kernel, linux-mm, guichaz,
	Alex Khesin, Mike Waychison, Rohit Seth, Peter Zijlstra

On Tue, 2009-03-24 at 14:26 +0100, Jan Kara wrote:
> On Tue 24-03-09 13:55:10, Jan Kara wrote:

> >   And one more interesting thing I don't yet fully understand - I see pages
> > having PageError() set when they are removed from page cache (and they have
> > been faulted in before). It's probably some interaction with pagecache
> > readahead...
>   Argh... So the problem seems to be that get_block() occasionally returns
> ENOSPC and we then discard the dirty data (hmm, we could give at least a
> warning for that). I'm not yet sure why getblock behaves like this because
> the filesystem seems to have enough space but anyway this seems to be some
> strange fs trouble as well.
> 

Ouch.  Perhaps the free space is waiting on a journal commit?

-chris



^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: ftruncate-mmap: pages are lost after writing to mmaped file.
  2009-03-24 14:01                       ` Chris Mason
@ 2009-03-24 14:07                         ` Jan Kara
  2009-03-26  8:18                           ` Aneesh Kumar K.V
  0 siblings, 1 reply; 60+ messages in thread
From: Jan Kara @ 2009-03-24 14:07 UTC (permalink / raw)
  To: Chris Mason
  Cc: Jan Kara, Nick Piggin, Martin J. Bligh, linux-ext4, Ying Han,
	Linus Torvalds, Andrew Morton, linux-kernel, linux-mm, guichaz,
	Alex Khesin, Mike Waychison, Rohit Seth, Peter Zijlstra

On Tue 24-03-09 10:01:45, Chris Mason wrote:
> On Tue, 2009-03-24 at 14:26 +0100, Jan Kara wrote:
> > On Tue 24-03-09 13:55:10, Jan Kara wrote:
> 
> > >   And one more interesting thing I don't yet fully understand - I see pages
> > > having PageError() set when they are removed from page cache (and they have
> > > been faulted in before). It's probably some interaction with pagecache
> > > readahead...
> >   Argh... So the problem seems to be that get_block() occasionally returns
> > ENOSPC and we then discard the dirty data (hmm, we could give at least a
> > warning for that). I'm not yet sure why getblock behaves like this because
> > the filesystem seems to have enough space but anyway this seems to be some
> > strange fs trouble as well.
> > 
> 
> Ouch.  Perhaps the free space is waiting on a journal commit?
  Yes, exactly. I've already found there's lot of space hold by the
committing transaction (it can easily hold a few hundred megs or a few gigs
with larger journal and my UML images aren't that big...). And writepage()
implementation in ext3 does not have a logic to retry. Also
block_write_full_page() clears buffers dirty bits so it's not easy to retry
even if we did it. I'm now looking into how to fix this...

									Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: ftruncate-mmap: pages are lost after writing to mmaped file.
  2009-03-24 14:07                         ` Jan Kara
@ 2009-03-26  8:18                           ` Aneesh Kumar K.V
  0 siblings, 0 replies; 60+ messages in thread
From: Aneesh Kumar K.V @ 2009-03-26  8:18 UTC (permalink / raw)
  To: Jan Kara
  Cc: Chris Mason, Nick Piggin, Martin J. Bligh, linux-ext4, Ying Han,
	Linus Torvalds, Andrew Morton, linux-kernel, linux-mm, guichaz,
	Alex Khesin, Mike Waychison, Rohit Seth, Peter Zijlstra

On Tue, Mar 24, 2009 at 03:07:21PM +0100, Jan Kara wrote:
> On Tue 24-03-09 10:01:45, Chris Mason wrote:
> > On Tue, 2009-03-24 at 14:26 +0100, Jan Kara wrote:
> > > On Tue 24-03-09 13:55:10, Jan Kara wrote:
> > 
> > > >   And one more interesting thing I don't yet fully understand - I see pages
> > > > having PageError() set when they are removed from page cache (and they have
> > > > been faulted in before). It's probably some interaction with pagecache
> > > > readahead...
> > >   Argh... So the problem seems to be that get_block() occasionally returns
> > > ENOSPC and we then discard the dirty data (hmm, we could give at least a
> > > warning for that). I'm not yet sure why getblock behaves like this because
> > > the filesystem seems to have enough space but anyway this seems to be some
> > > strange fs trouble as well.
> > > 
> > 
> > Ouch.  Perhaps the free space is waiting on a journal commit?
>   Yes, exactly. I've already found there's lot of space hold by the
> committing transaction (it can easily hold a few hundred megs or a few gigs
> with larger journal and my UML images aren't that big...). And writepage()
> implementation in ext3 does not have a logic to retry. Also
> block_write_full_page() clears buffers dirty bits so it's not easy to retry
> even if we did it. I'm now looking into how to fix this...

We retry block allocation in ext3_write_begin. And for mmap we should be
doing something similar to ext4_page_mkwrite so that we can be sure
that during writepage we don't need to do block allocation.

-aneesh

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: ftruncate-mmap: pages are lost after writing to mmaped file.
  2009-03-24 13:26                     ` Jan Kara
  2009-03-24 14:01                       ` Chris Mason
@ 2009-03-24 14:30                       ` Nick Piggin
  2009-03-24 14:47                         ` Jan Kara
  1 sibling, 1 reply; 60+ messages in thread
From: Nick Piggin @ 2009-03-24 14:30 UTC (permalink / raw)
  To: Jan Kara
  Cc: Martin J. Bligh, linux-ext4, Ying Han, Linus Torvalds,
	Andrew Morton, linux-kernel, linux-mm, guichaz, Alex Khesin,
	Mike Waychison, Rohit Seth, Peter Zijlstra

On Wednesday 25 March 2009 00:26:37 Jan Kara wrote:
> On Tue 24-03-09 13:55:10, Jan Kara wrote:
> > On Tue 24-03-09 13:39:36, Jan Kara wrote:
> > >   Hi,
> > >
> > > On Tue 24-03-09 18:44:21, Nick Piggin wrote:
> > > > On Friday 20 March 2009 03:46:39 Jan Kara wrote:
> > > > > On Fri 20-03-09 02:48:21, Nick Piggin wrote:
> > > > > > Holding mapping->private_lock over the __set_page_dirty should
> > > > > > fix it, although I guess you'd want to release it before calling
> > > > > > __mark_inode_dirty so as not to put inode_lock under there. I
> > > > > > have a patch for this if it sounds reasonable.
> > > > >
> > > > >   Yes, that seems to be a bug - the function actually looked
> > > > > suspitious to me yesterday but I somehow convinced myself that it's
> > > > > fine. Probably because fsx-linux is single-threaded.
> > > >
> > > > After a whole lot of chasing my own tail in the VM and buffer layers,
> > > > I think it is a problem in ext2 (and I haven't been able to reproduce
> > > > with ext3 yet, which might lend weight to that, although as we have
> > > > seen, it is very timing dependent).
> > > >
> > > > That would be slightly unfortunate because we still have Jan's ext3
> > > > problem, and also another reported problem of corruption on ext3 (on
> > > > brd driver).
> > > >
> > > > Anyway, when I have reproduced the problem with the test case, the
> > > > "lost" writes are all reported to be holes. Unfortunately, that
> > > > doesn't point straight to the filesystem, because ext2 allocates
> > > > blocks in this case at writeout time, so if dirty bits are getting
> > > > lost, then it would be normal to see holes.
> > > >
> > > > I then put in a whole lot of extra infrastructure to track metadata
> > > > about each struct page (when it was last written out, when it last
> > > > had the number of writable ptes reach 0, when the dirty bits were
> > > > last cleared etc). And none of the normal asertions were triggering:
> > > > eg. when any page is removed from pagecache (except truncates), it
> > > > has always had all its buffers written out *after* all ptes were made
> > > > readonly or unmapped. Lots of other tests and crap like that.
> > >
> > >   I see we're going the same way ;)
> > >
> > > > So I tried what I should have done to start with and did an e2fsck
> > > > after seeing corruption. Yes, it comes up with errors. Now that is
> > > > unusual because that should be largely insulated from the vm: if a
> > > > dirty bit gets lost, then the filesystem image should be quite happy
> > > > and error-free with a hole or unwritten data there.
> > >
> > >   This is different for me. I see no corruption on the filesystem with
> > > ext3. Anyway, errors from fsck would be useful to see what kind of
> > > corruption you saw.
> > >
> > > > I don't know ext? locking very well, except that it looks pretty
> > > > overly complex and crufty.
> > > >
> > > > Usually, blocks are instantiated by write(2), under i_mutex,
> > > > serialising the allocator somewhat. mmap-write blocks are
> > > > instantiated at writeout time, unserialised. I moved truncate_mutex
> > > > to cover the entire get_blocks function, and can no longer trigger
> > > > the problem. Might be a timing issue though -- Ying, can you try this
> > > > and see if you can still reproduce?
> > > >
> > > > I close my eyes and pick something out of a hat. a686cd89. Search for
> > > > XXX. Nice. Whether or not this cased the problem, can someone please
> > > > tell me why it got merged in that state?
> > >
> > >   Maybe, I see it did some changes to ext2_get_blocks() which could be
> > > where the problem was introduced...
> > >
> > > > I'm leaving ext3 running for now. It looks like a nasty task to
> > > > bisect ext2 down to that commit :( and I would be more interested in
> > > > trying to reproduce Jan's ext3 problem, however, because I'm not too
> > > > interested in diving into ext2 locking to work out exactly what is
> > > > racing and how to fix it properly. I suspect it would be most
> > > > productive to wire up some ioctls right into the block
> > > > allocator/lookup and code up a userspace tester for it that could
> > > > probably stress it a lot harder than kernel writeout can.
> > >
> > >   Yes, what I observed with ext3 so far is that data is properly copied
> > > and page marked dirty when the data is copied in. But then at some
> > > point dirty bit is cleared via block_write_full_page() but we don't get
> > > to submitting at least one buffer in that page. I'm now debugging which
> > > path we take so that this happens...
> >
> >   And one more interesting thing I don't yet fully understand - I see
> > pages having PageError() set when they are removed from page cache (and
> > they have been faulted in before). It's probably some interaction with
> > pagecache readahead...
>
>   Argh... So the problem seems to be that get_block() occasionally returns
> ENOSPC and we then discard the dirty data (hmm, we could give at least a
> warning for that). I'm not yet sure why getblock behaves like this because
> the filesystem seems to have enough space but anyway this seems to be some
> strange fs trouble as well.

Ah good find.

I don't think it is a very good idea for block_write_full_page recovery
to do clear_buffer_dirty for !mapped buffers. I think that should rather
be a redirty_page_for_writepage in the case that the buffer is dirty.

Perhaps not the cleanest way to solve the problem if it is just due to
transient shortage of space in ext3, but generic code shouldn't be
allowed to throw away dirty data even if it can't be written back due
to some software or hardware error.


^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: ftruncate-mmap: pages are lost after writing to mmaped file.
  2009-03-24 14:30                       ` Nick Piggin
@ 2009-03-24 14:47                         ` Jan Kara
  2009-03-24 14:56                           ` Peter Zijlstra
  2009-03-24 15:03                           ` Nick Piggin
  0 siblings, 2 replies; 60+ messages in thread
From: Jan Kara @ 2009-03-24 14:47 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Jan Kara, Martin J. Bligh, linux-ext4, Ying Han, Linus Torvalds,
	Andrew Morton, linux-kernel, linux-mm, guichaz, Alex Khesin,
	Mike Waychison, Rohit Seth, Peter Zijlstra

On Wed 25-03-09 01:30:00, Nick Piggin wrote:
> On Wednesday 25 March 2009 00:26:37 Jan Kara wrote:
> > On Tue 24-03-09 13:55:10, Jan Kara wrote:
> > > On Tue 24-03-09 13:39:36, Jan Kara wrote:
> > > >   Hi,
> > > >
> > > > On Tue 24-03-09 18:44:21, Nick Piggin wrote:
> > > > > On Friday 20 March 2009 03:46:39 Jan Kara wrote:
> > > > > > On Fri 20-03-09 02:48:21, Nick Piggin wrote:
> > > > > > > Holding mapping->private_lock over the __set_page_dirty should
> > > > > > > fix it, although I guess you'd want to release it before calling
> > > > > > > __mark_inode_dirty so as not to put inode_lock under there. I
> > > > > > > have a patch for this if it sounds reasonable.
> > > > > >
> > > > > >   Yes, that seems to be a bug - the function actually looked
> > > > > > suspitious to me yesterday but I somehow convinced myself that it's
> > > > > > fine. Probably because fsx-linux is single-threaded.
> > > > >
> > > > > After a whole lot of chasing my own tail in the VM and buffer layers,
> > > > > I think it is a problem in ext2 (and I haven't been able to reproduce
> > > > > with ext3 yet, which might lend weight to that, although as we have
> > > > > seen, it is very timing dependent).
> > > > >
> > > > > That would be slightly unfortunate because we still have Jan's ext3
> > > > > problem, and also another reported problem of corruption on ext3 (on
> > > > > brd driver).
> > > > >
> > > > > Anyway, when I have reproduced the problem with the test case, the
> > > > > "lost" writes are all reported to be holes. Unfortunately, that
> > > > > doesn't point straight to the filesystem, because ext2 allocates
> > > > > blocks in this case at writeout time, so if dirty bits are getting
> > > > > lost, then it would be normal to see holes.
> > > > >
> > > > > I then put in a whole lot of extra infrastructure to track metadata
> > > > > about each struct page (when it was last written out, when it last
> > > > > had the number of writable ptes reach 0, when the dirty bits were
> > > > > last cleared etc). And none of the normal asertions were triggering:
> > > > > eg. when any page is removed from pagecache (except truncates), it
> > > > > has always had all its buffers written out *after* all ptes were made
> > > > > readonly or unmapped. Lots of other tests and crap like that.
> > > >
> > > >   I see we're going the same way ;)
> > > >
> > > > > So I tried what I should have done to start with and did an e2fsck
> > > > > after seeing corruption. Yes, it comes up with errors. Now that is
> > > > > unusual because that should be largely insulated from the vm: if a
> > > > > dirty bit gets lost, then the filesystem image should be quite happy
> > > > > and error-free with a hole or unwritten data there.
> > > >
> > > >   This is different for me. I see no corruption on the filesystem with
> > > > ext3. Anyway, errors from fsck would be useful to see what kind of
> > > > corruption you saw.
> > > >
> > > > > I don't know ext? locking very well, except that it looks pretty
> > > > > overly complex and crufty.
> > > > >
> > > > > Usually, blocks are instantiated by write(2), under i_mutex,
> > > > > serialising the allocator somewhat. mmap-write blocks are
> > > > > instantiated at writeout time, unserialised. I moved truncate_mutex
> > > > > to cover the entire get_blocks function, and can no longer trigger
> > > > > the problem. Might be a timing issue though -- Ying, can you try this
> > > > > and see if you can still reproduce?
> > > > >
> > > > > I close my eyes and pick something out of a hat. a686cd89. Search for
> > > > > XXX. Nice. Whether or not this cased the problem, can someone please
> > > > > tell me why it got merged in that state?
> > > >
> > > >   Maybe, I see it did some changes to ext2_get_blocks() which could be
> > > > where the problem was introduced...
> > > >
> > > > > I'm leaving ext3 running for now. It looks like a nasty task to
> > > > > bisect ext2 down to that commit :( and I would be more interested in
> > > > > trying to reproduce Jan's ext3 problem, however, because I'm not too
> > > > > interested in diving into ext2 locking to work out exactly what is
> > > > > racing and how to fix it properly. I suspect it would be most
> > > > > productive to wire up some ioctls right into the block
> > > > > allocator/lookup and code up a userspace tester for it that could
> > > > > probably stress it a lot harder than kernel writeout can.
> > > >
> > > >   Yes, what I observed with ext3 so far is that data is properly copied
> > > > and page marked dirty when the data is copied in. But then at some
> > > > point dirty bit is cleared via block_write_full_page() but we don't get
> > > > to submitting at least one buffer in that page. I'm now debugging which
> > > > path we take so that this happens...
> > >
> > >   And one more interesting thing I don't yet fully understand - I see
> > > pages having PageError() set when they are removed from page cache (and
> > > they have been faulted in before). It's probably some interaction with
> > > pagecache readahead...
> >
> >   Argh... So the problem seems to be that get_block() occasionally returns
> > ENOSPC and we then discard the dirty data (hmm, we could give at least a
> > warning for that). I'm not yet sure why getblock behaves like this because
> > the filesystem seems to have enough space but anyway this seems to be some
> > strange fs trouble as well.
> 
> Ah good find.
> 
> I don't think it is a very good idea for block_write_full_page recovery
> to do clear_buffer_dirty for !mapped buffers. I think that should rather
> be a redirty_page_for_writepage in the case that the buffer is dirty.
> 
> Perhaps not the cleanest way to solve the problem if it is just due to
> transient shortage of space in ext3, but generic code shouldn't be
> allowed to throw away dirty data even if it can't be written back due
> to some software or hardware error.
  Well, that would be one possibility. But then we'd be left with dirty
pages we cannot ever release since they are constantly dirty (when the
filesystem really becomes out of space). So what I
rather want to do is something like below:

diff --git a/fs/ext3/inode.c b/fs/ext3/inode.c
index d351eab..77c526f 100644
--- a/fs/ext3/inode.c
+++ b/fs/ext3/inode.c
@@ -1440,6 +1440,40 @@ static int journal_dirty_data_fn(handle_t *handle, struct buffer_head *bh)
 }
 
 /*
+ * Decides whether it's worthwhile to wait for transaction commit and
+ * retry allocation. If it is, function waits 1 is returns, otherwise
+ * 0 is returned. In both cases we redirty page and it's buffers so that
+ * data is not lost. In case we've retried too many times, we also return
+ * 0 and don't redirty the page. Data gets discarded but we cannot hang
+ * writepage forever...
+ */
+static int ext3_writepage_retry_alloc(struct page *page, int *retries,
+				      struct writeback_control *wbc)
+{
+	struct super_block *sb = ((struct inode *)page->mapping->host)->i_sb;
+	int ret = 0;
+
+	/*
+	 * We don't want to slow down background writeback too much. On the
+	 * other hand if most of the dirty data needs allocation, we better
+	 * wait to make some progress
+	 */
+	if (wbc->sync_mode == WB_SYNC_NONE && !wbc->for_reclaim &&
+	    wbc->pages_skipped < wbc->nr_to_write / 2)
+		goto redirty;
+	/*
+	 * Now wait if commit can free some space and we haven't retried
+	 * too much
+	 */
+	if (!ext3_should_retry_alloc(sb, retries))
+		return 0;
+	ret = 1;
+redirty:
+	set_page_dirty(page);
+	return ret;
+}
+
+/*
  * Note that we always start a transaction even if we're not journalling
  * data.  This is to preserve ordering: any hole instantiation within
  * __block_write_full_page -> ext3_get_block() should be journalled
@@ -1564,10 +1598,12 @@ static int ext3_writeback_writepage(struct page *page,
 	handle_t *handle = NULL;
 	int ret = 0;
 	int err;
+	int retries;
 
 	if (ext3_journal_current_handle())
 		goto out_fail;
 
+restart:
 	handle = ext3_journal_start(inode, ext3_writepage_trans_blocks(inode));
 	if (IS_ERR(handle)) {
 		ret = PTR_ERR(handle);
@@ -1580,8 +1616,13 @@ static int ext3_writeback_writepage(struct page *page,
 		ret = block_write_full_page(page, ext3_get_block, wbc);
 
 	err = ext3_journal_stop(handle);
-	if (!ret)
+	if (!ret) {
 		ret = err;
+	} else {
+		if (ret == -ENOSPC &&
+		    ext3_writepage_retry_alloc(page, &retries, wbc))
+			goto restart;
+	}
 	return ret;
 
 out_fail:

  And similarly for the other two writepage implementations in ext3...
But it currently gives me:
WARNING: at fs/buffer.c:781 __set_page_dirty+0x8d/0x145()
probably because of that set_page_dirty() in ext3_writepage_retry_alloc().

Or we could implement ext3_mkwrite() to allocate buffers already when we
make page writeable. But it costs some performace (we have to write page
full of zeros when allocating those buffers, where previously we didn't
have to do anything) and it's not trivial to make it work if pagesize >
blocksize (we should not allocate buffers outside of i_size so if i_size
= 1024, we create just one block in ext3_mkwrite() but then we need to
allocate more when we extend the file).

									Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply related	[flat|nested] 60+ messages in thread

* Re: ftruncate-mmap: pages are lost after writing to mmaped file.
  2009-03-24 14:47                         ` Jan Kara
@ 2009-03-24 14:56                           ` Peter Zijlstra
  2009-03-24 15:29                             ` Jan Kara
  2009-03-24 15:03                           ` Nick Piggin
  1 sibling, 1 reply; 60+ messages in thread
From: Peter Zijlstra @ 2009-03-24 14:56 UTC (permalink / raw)
  To: Jan Kara
  Cc: Nick Piggin, Martin J. Bligh, linux-ext4, Ying Han,
	Linus Torvalds, Andrew Morton, linux-kernel, linux-mm, guichaz,
	Alex Khesin, Mike Waychison, Rohit Seth

On Tue, 2009-03-24 at 15:47 +0100, Jan Kara wrote:
> 
> Or we could implement ext3_mkwrite() to allocate buffers already when we
> make page writeable. But it costs some performace (we have to write page
> full of zeros when allocating those buffers, where previously we didn't
> have to do anything) and it's not trivial to make it work if pagesize >
> blocksize (we should not allocate buffers outside of i_size so if i_size
> = 1024, we create just one block in ext3_mkwrite() but then we need to
> allocate more when we extend the file).

I think this is the best option, failing with SIGBUS when we fail to
allocate blocks seems consistent with other filesystems as well.


^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: ftruncate-mmap: pages are lost after writing to mmaped file.
  2009-03-24 14:56                           ` Peter Zijlstra
@ 2009-03-24 15:29                             ` Jan Kara
  2009-03-24 20:14                               ` OGAWA Hirofumi
  2009-03-26  8:47                               ` Aneesh Kumar K.V
  0 siblings, 2 replies; 60+ messages in thread
From: Jan Kara @ 2009-03-24 15:29 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Nick Piggin, Martin J. Bligh, linux-ext4, Ying Han,
	Linus Torvalds, Andrew Morton, linux-kernel, linux-mm, guichaz,
	Alex Khesin, Mike Waychison, Rohit Seth

On Tue 24-03-09 15:56:03, Peter Zijlstra wrote:
> On Tue, 2009-03-24 at 15:47 +0100, Jan Kara wrote:
> > 
> > Or we could implement ext3_mkwrite() to allocate buffers already when we
> > make page writeable. But it costs some performace (we have to write page
> > full of zeros when allocating those buffers, where previously we didn't
> > have to do anything) and it's not trivial to make it work if pagesize >
> > blocksize (we should not allocate buffers outside of i_size so if i_size
> > = 1024, we create just one block in ext3_mkwrite() but then we need to
> > allocate more when we extend the file).
> 
> I think this is the best option, failing with SIGBUS when we fail to
> allocate blocks seems consistent with other filesystems as well.
  I agree this looks attractive at the first sight. But there are drawbacks
as I wrote - the problem with blocksize < pagesize, slight performance
decrease due to additional write, page faults doing allocation can take a
*long* time and overall fragmentation is going to be higher (previously
writepage wrote pages for us in the right order, now we are going to
allocate in the first-accessed order). So I'm not sure we really want to
go this way.
  Hmm, maybe we could play a trick ala delayed allocation - i.e., reserve
some space in mkwrite() but don't actually allocate it. That would be done
in writepage(). This would solve all the problems I describe above. We could
use PG_Checked flag to track that the page has a reservation and behave
accordingly in writepage() / invalidatepage(). ext3 in data=journal mode
already uses the flag but the use seems to be compatible with what I want
to do now... So it may actually work.
  BTW: Note that there's a plenty of filesystems that don't implement
mkwrite() (e.g. ext2, UDF, VFAT...) and thus have the same problem with
ENOSPC. So I'd not speak too much about consistency ;).

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: ftruncate-mmap: pages are lost after writing to mmaped file.
  2009-03-24 15:29                             ` Jan Kara
@ 2009-03-24 20:14                               ` OGAWA Hirofumi
  2009-03-26  8:47                               ` Aneesh Kumar K.V
  1 sibling, 0 replies; 60+ messages in thread
From: OGAWA Hirofumi @ 2009-03-24 20:14 UTC (permalink / raw)
  To: Jan Kara
  Cc: Peter Zijlstra, Nick Piggin, Martin J. Bligh, linux-ext4,
	Ying Han, Linus Torvalds, Andrew Morton, linux-kernel, linux-mm,
	guichaz, Alex Khesin, Mike Waychison, Rohit Seth

Jan Kara <jack@suse.cz> writes:

>   BTW: Note that there's a plenty of filesystems that don't implement
> mkwrite() (e.g. ext2, UDF, VFAT...) and thus have the same problem with
> ENOSPC. So I'd not speak too much about consistency ;).

FWIW, fatfs doesn't allow sparse file (mmap the non-allocated region),
so I guess there is no problem.
-- 
OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: ftruncate-mmap: pages are lost after writing to mmaped file.
  2009-03-24 15:29                             ` Jan Kara
  2009-03-24 20:14                               ` OGAWA Hirofumi
@ 2009-03-26  8:47                               ` Aneesh Kumar K.V
  2009-03-26 11:37                                 ` Jan Kara
  2009-03-26 23:02                                 ` Linus Torvalds
  1 sibling, 2 replies; 60+ messages in thread
From: Aneesh Kumar K.V @ 2009-03-26  8:47 UTC (permalink / raw)
  To: Jan Kara
  Cc: Peter Zijlstra, Nick Piggin, Martin J. Bligh, linux-ext4,
	Ying Han, Linus Torvalds, Andrew Morton, linux-kernel, linux-mm,
	guichaz, Alex Khesin, Mike Waychison, Rohit Seth

On Tue, Mar 24, 2009 at 04:29:59PM +0100, Jan Kara wrote:
> On Tue 24-03-09 15:56:03, Peter Zijlstra wrote:
> > On Tue, 2009-03-24 at 15:47 +0100, Jan Kara wrote:
> > > 
> > > Or we could implement ext3_mkwrite() to allocate buffers already when we
> > > make page writeable. But it costs some performace (we have to write page
> > > full of zeros when allocating those buffers, where previously we didn't
> > > have to do anything) and it's not trivial to make it work if pagesize >
> > > blocksize (we should not allocate buffers outside of i_size so if i_size
> > > = 1024, we create just one block in ext3_mkwrite() but then we need to
> > > allocate more when we extend the file).
> > 
> > I think this is the best option, failing with SIGBUS when we fail to
> > allocate blocks seems consistent with other filesystems as well.
>   I agree this looks attractive at the first sight. But there are drawbacks
> as I wrote - the problem with blocksize < pagesize, slight performance
> decrease due to additional write,

It should not cause an additional write. Can you let me why it would
result in additional write ?


>page faults doing allocation can take a
> *long* time 

That is true

>and overall fragmentation is going to be higher (previously
> writepage wrote pages for us in the right order, now we are going to
> allocate in the first-accessed order). So I'm not sure we really want to
> go this way.


block allocator should be improved to fix that. For example ext4
mballoc also look at the logical file block number when doing block
allocation. So if we does enough reservation it should handle the 
the first-accessed order and sequential order allocation properly.

Another reason why I think we would need ext3_page_mkwrite is, if we
really are out of space how do we handle it ? Currently the patch you
posted does redirty_page_for_writepage, which would imply we can't
reclaim the page and since get_block get ENOSPC we can't allocate
blocks.

>   Hmm, maybe we could play a trick ala delayed allocation - i.e., reserve
> some space in mkwrite() but don't actually allocate it. That would be done
> in writepage(). This would solve all the problems I describe above. We could
> use PG_Checked flag to track that the page has a reservation and behave
> accordingly in writepage() / invalidatepage(). ext3 in data=journal mode
> already uses the flag but the use seems to be compatible with what I want
> to do now... So it may actually work.
>   BTW: Note that there's a plenty of filesystems that don't implement
> mkwrite() (e.g. ext2, UDF, VFAT...) and thus have the same problem with
> ENOSPC. So I'd not speak too much about consistency ;).
> 

-aneesh

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: ftruncate-mmap: pages are lost after writing to mmaped file.
  2009-03-26  8:47                               ` Aneesh Kumar K.V
@ 2009-03-26 11:37                                 ` Jan Kara
  2009-03-26 23:02                                 ` Linus Torvalds
  1 sibling, 0 replies; 60+ messages in thread
From: Jan Kara @ 2009-03-26 11:37 UTC (permalink / raw)
  To: Aneesh Kumar K.V
  Cc: Peter Zijlstra, Nick Piggin, Martin J. Bligh, linux-ext4,
	Ying Han, Linus Torvalds, Andrew Morton, linux-kernel, linux-mm,
	guichaz, Alex Khesin, Mike Waychison, Rohit Seth

On Thu 26-03-09 14:17:23, Aneesh Kumar K.V wrote:
> On Tue, Mar 24, 2009 at 04:29:59PM +0100, Jan Kara wrote:
> > On Tue 24-03-09 15:56:03, Peter Zijlstra wrote:
> > > On Tue, 2009-03-24 at 15:47 +0100, Jan Kara wrote:
> > > > 
> > > > Or we could implement ext3_mkwrite() to allocate buffers already when we
> > > > make page writeable. But it costs some performace (we have to write page
> > > > full of zeros when allocating those buffers, where previously we didn't
> > > > have to do anything) and it's not trivial to make it work if pagesize >
> > > > blocksize (we should not allocate buffers outside of i_size so if i_size
> > > > = 1024, we create just one block in ext3_mkwrite() but then we need to
> > > > allocate more when we extend the file).
> > > 
> > > I think this is the best option, failing with SIGBUS when we fail to
> > > allocate blocks seems consistent with other filesystems as well.
> >   I agree this looks attractive at the first sight. But there are drawbacks
> > as I wrote - the problem with blocksize < pagesize, slight performance
> > decrease due to additional write,
> 
> It should not cause an additional write. Can you let me why it would
> result in additional write ?
  Because if you have a new page, at the time mkwrite() or set_page_dirty()
is called, it is just full of zeros. So we attach buffers full of zeros to
the running transaction to stand to data=ordered mode requirements. Then
these get written out on transaction commit (or they can already contain some
data user has written via mmap) but we're going to write them again when
writepage() is called on the page.
  Umm, but yes, thinking more about the details, we clear buffer dirty bits
at commit time so if by that time user has copied in all the data,
subsequent writepage will find out all the buffers are clean and will not
send them to disk. So in this case overhead is going to be just
journal_start() + journal_stop(). OTOH mm usually decides to write the page
only after some time so if user writes to the page often then we really do
one more write. But in this case one additional write is going to be
probably lost in the number of total writes of the page. So yes, this is
not such a big issue as I though originally.

> >page faults doing allocation can take a
> > *long* time 
> 
> That is true
> 
> >and overall fragmentation is going to be higher (previously
> > writepage wrote pages for us in the right order, now we are going to
> > allocate in the first-accessed order). So I'm not sure we really want to
> > go this way.
> 
> 
> block allocator should be improved to fix that. For example ext4
> mballoc also look at the logical file block number when doing block
> allocation. So if we does enough reservation it should handle the 
> the first-accessed order and sequential order allocation properly.
  Well, we could definitely improve ext3 allocator. But do we really want
to backport mballoc to ext3? IMO It is easier to essentialy perform delayed
allocation at the time of mkwrite() and the do the real allocation at the
time of writepage(). So I'd rather vote for a mechanism I write about
below.

> Another reason why I think we would need ext3_page_mkwrite is, if we
> really are out of space how do we handle it ? Currently the patch you
> posted does redirty_page_for_writepage, which would imply we can't
> reclaim the page and since get_block get ENOSPC we can't allocate
> blocks.
  I definitely agree we should somehow solve this problem but the mechanism
below seems to be an easier way to me.

> >   Hmm, maybe we could play a trick ala delayed allocation - i.e., reserve
> > some space in mkwrite() but don't actually allocate it. That would be done
> > in writepage(). This would solve all the problems I describe above. We could
> > use PG_Checked flag to track that the page has a reservation and behave
> > accordingly in writepage() / invalidatepage(). ext3 in data=journal mode
> > already uses the flag but the use seems to be compatible with what I want
> > to do now... So it may actually work.
> >   BTW: Note that there's a plenty of filesystems that don't implement
> > mkwrite() (e.g. ext2, UDF, VFAT...) and thus have the same problem with
> > ENOSPC. So I'd not speak too much about consistency ;).

									Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: ftruncate-mmap: pages are lost after writing to mmaped file.
  2009-03-26  8:47                               ` Aneesh Kumar K.V
  2009-03-26 11:37                                 ` Jan Kara
@ 2009-03-26 23:02                                 ` Linus Torvalds
  1 sibling, 0 replies; 60+ messages in thread
From: Linus Torvalds @ 2009-03-26 23:02 UTC (permalink / raw)
  To: Aneesh Kumar K.V
  Cc: Jan Kara, Peter Zijlstra, Nick Piggin, Martin J. Bligh,
	linux-ext4, Ying Han, Andrew Morton, linux-kernel, linux-mm,
	guichaz, Alex Khesin, Mike Waychison, Rohit Seth

On Thu, 26 Mar 2009, Aneesh Kumar K.V wrote:
> 
> >page faults doing allocation can take a
> > *long* time 
> 
> That is true

Btw, this is actually a feature rather than a bug.

We want to slow down the writer, which is why we also do dirty page 
balancing when marking a page dirty. 

Basically, if block allocation is a performance problem, then it should be 
a performance problem that is attributed to the process that _causes_ it, 
rather than to some random poor unrelated process that then later ends up 
writing the page out because it wants to use some memory.

This is why tracking dirty pages is so important. Yes, it also avoids 
various nasty overcommit situations, but the whole "make it hurt for the 
person responsible, rather than a random innocent bystander" is the more 
important part of it. 

			Linus

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: ftruncate-mmap: pages are lost after writing to mmaped file.
  2009-03-24 14:47                         ` Jan Kara
  2009-03-24 14:56                           ` Peter Zijlstra
@ 2009-03-24 15:03                           ` Nick Piggin
  2009-03-24 15:48                             ` Jan Kara
  1 sibling, 1 reply; 60+ messages in thread
From: Nick Piggin @ 2009-03-24 15:03 UTC (permalink / raw)
  To: Jan Kara
  Cc: Martin J. Bligh, linux-ext4, Ying Han, Linus Torvalds,
	Andrew Morton, linux-kernel, linux-mm, guichaz, Alex Khesin,
	Mike Waychison, Rohit Seth, Peter Zijlstra

On Wednesday 25 March 2009 01:47:09 Jan Kara wrote:
> On Wed 25-03-09 01:30:00, Nick Piggin wrote:

> > I don't think it is a very good idea for block_write_full_page recovery
> > to do clear_buffer_dirty for !mapped buffers. I think that should rather
> > be a redirty_page_for_writepage in the case that the buffer is dirty.
> >
> > Perhaps not the cleanest way to solve the problem if it is just due to
> > transient shortage of space in ext3, but generic code shouldn't be
> > allowed to throw away dirty data even if it can't be written back due
> > to some software or hardware error.
>
>   Well, that would be one possibility. But then we'd be left with dirty
> pages we cannot ever release since they are constantly dirty (when the
> filesystem really becomes out of space). So what I

If the filesystem becomes out of space and we have over-committed these
dirty mmapped blocks, then we most definitely want to keep them around.
An error of the system losing a few pages (or if it happens an insanely
large number of times, then slowly dying due to memory leak) is better
than an app suddenly seeing the contents of the page change to nulls
under it when the kernel decides to do some page reclaim.


> rather want to do is something like below:
>
> diff --git a/fs/ext3/inode.c b/fs/ext3/inode.c
> index d351eab..77c526f 100644
> --- a/fs/ext3/inode.c
> +++ b/fs/ext3/inode.c
> @@ -1440,6 +1440,40 @@ static int journal_dirty_data_fn(handle_t *handle,
> struct buffer_head *bh) }
>
>  /*
> + * Decides whether it's worthwhile to wait for transaction commit and
> + * retry allocation. If it is, function waits 1 is returns, otherwise
> + * 0 is returned. In both cases we redirty page and it's buffers so that
> + * data is not lost. In case we've retried too many times, we also return
> + * 0 and don't redirty the page. Data gets discarded but we cannot hang
> + * writepage forever...
> + */
> +static int ext3_writepage_retry_alloc(struct page *page, int *retries,
> +				      struct writeback_control *wbc)
> +{
> +	struct super_block *sb = ((struct inode *)page->mapping->host)->i_sb;
> +	int ret = 0;
> +
> +	/*
> +	 * We don't want to slow down background writeback too much. On the
> +	 * other hand if most of the dirty data needs allocation, we better
> +	 * wait to make some progress
> +	 */
> +	if (wbc->sync_mode == WB_SYNC_NONE && !wbc->for_reclaim &&
> +	    wbc->pages_skipped < wbc->nr_to_write / 2)
> +		goto redirty;
> +	/*
> +	 * Now wait if commit can free some space and we haven't retried
> +	 * too much
> +	 */
> +	if (!ext3_should_retry_alloc(sb, retries))
> +		return 0;
> +	ret = 1;
> +redirty:
> +	set_page_dirty(page);
> +	return ret;
> +}
> +
> +/*
>   * Note that we always start a transaction even if we're not journalling
>   * data.  This is to preserve ordering: any hole instantiation within
>   * __block_write_full_page -> ext3_get_block() should be journalled
> @@ -1564,10 +1598,12 @@ static int ext3_writeback_writepage(struct page
> *page, handle_t *handle = NULL;
>  	int ret = 0;
>  	int err;
> +	int retries;
>
>  	if (ext3_journal_current_handle())
>  		goto out_fail;
>
> +restart:
>  	handle = ext3_journal_start(inode, ext3_writepage_trans_blocks(inode));
>  	if (IS_ERR(handle)) {
>  		ret = PTR_ERR(handle);
> @@ -1580,8 +1616,13 @@ static int ext3_writeback_writepage(struct page
> *page, ret = block_write_full_page(page, ext3_get_block, wbc);
>
>  	err = ext3_journal_stop(handle);
> -	if (!ret)
> +	if (!ret) {
>  		ret = err;
> +	} else {
> +		if (ret == -ENOSPC &&
> +		    ext3_writepage_retry_alloc(page, &retries, wbc))
> +			goto restart;
> +	}
>  	return ret;
>
>  out_fail:
>
>   And similarly for the other two writepage implementations in ext3...
> But it currently gives me:
> WARNING: at fs/buffer.c:781 __set_page_dirty+0x8d/0x145()
> probably because of that set_page_dirty() in ext3_writepage_retry_alloc().

And this is a valid warning because we don't know that all buffers are
uptodate or which ones to set as dirty, I think. Unless it is impossible
to have dirty && !uptodate pages come thought this path.

But you shouldn't need to redirty the page at all if we change
block_write_full_page in the way I suggested. Because then it won't
have cleaned the buffer, and it will have done redirty_page_for_writepage.


> Or we could implement ext3_mkwrite() to allocate buffers already when we
> make page writeable. But it costs some performace (we have to write page
> full of zeros when allocating those buffers, where previously we didn't
> have to do anything) and it's not trivial to make it work if pagesize >
> blocksize (we should not allocate buffers outside of i_size so if i_size
> = 1024, we create just one block in ext3_mkwrite() but then we need to
> allocate more when we extend the file).

Well the core page_mkwrite function doesn't care about that case
properly either (it just doesn't allocate buffers on extend). I agree it
should be fixed, but it is a little hard (I need to fix it in fsblock
and I think what is required is a setattr helper for that).


^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: ftruncate-mmap: pages are lost after writing to mmaped file.
  2009-03-24 15:03                           ` Nick Piggin
@ 2009-03-24 15:48                             ` Jan Kara
  2009-03-24 17:35                               ` Jan Kara
  0 siblings, 1 reply; 60+ messages in thread
From: Jan Kara @ 2009-03-24 15:48 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Jan Kara, Martin J. Bligh, linux-ext4, Ying Han, Linus Torvalds,
	Andrew Morton, linux-kernel, linux-mm, guichaz, Alex Khesin,
	Mike Waychison, Rohit Seth, Peter Zijlstra

On Wed 25-03-09 02:03:54, Nick Piggin wrote:
> On Wednesday 25 March 2009 01:47:09 Jan Kara wrote:
> > On Wed 25-03-09 01:30:00, Nick Piggin wrote:
> 
> > > I don't think it is a very good idea for block_write_full_page recovery
> > > to do clear_buffer_dirty for !mapped buffers. I think that should rather
> > > be a redirty_page_for_writepage in the case that the buffer is dirty.
> > >
> > > Perhaps not the cleanest way to solve the problem if it is just due to
> > > transient shortage of space in ext3, but generic code shouldn't be
> > > allowed to throw away dirty data even if it can't be written back due
> > > to some software or hardware error.
> >
> >   Well, that would be one possibility. But then we'd be left with dirty
> > pages we cannot ever release since they are constantly dirty (when the
> > filesystem really becomes out of space). So what I
> 
> If the filesystem becomes out of space and we have over-committed these
> dirty mmapped blocks, then we most definitely want to keep them around.
> An error of the system losing a few pages (or if it happens an insanely
> large number of times, then slowly dying due to memory leak) is better
> than an app suddenly seeing the contents of the page change to nulls
> under it when the kernel decides to do some page reclaim.
  Hmm, probably you're right. Definitely it would be much easier to track
the problem down than it is now... Thinking a bit more... But couldn't a
malicious user bring the machine easily to OOM this way? That would be
unfortunate.

> > rather want to do is something like below:
> >
> > diff --git a/fs/ext3/inode.c b/fs/ext3/inode.c
> > index d351eab..77c526f 100644
> > --- a/fs/ext3/inode.c
> > +++ b/fs/ext3/inode.c
> > @@ -1440,6 +1440,40 @@ static int journal_dirty_data_fn(handle_t *handle,
> > struct buffer_head *bh) }
> >
> >  /*
> > + * Decides whether it's worthwhile to wait for transaction commit and
> > + * retry allocation. If it is, function waits 1 is returns, otherwise
> > + * 0 is returned. In both cases we redirty page and it's buffers so that
> > + * data is not lost. In case we've retried too many times, we also return
> > + * 0 and don't redirty the page. Data gets discarded but we cannot hang
> > + * writepage forever...
> > + */
> > +static int ext3_writepage_retry_alloc(struct page *page, int *retries,
> > +				      struct writeback_control *wbc)
> > +{
> > +	struct super_block *sb = ((struct inode *)page->mapping->host)->i_sb;
> > +	int ret = 0;
> > +
> > +	/*
> > +	 * We don't want to slow down background writeback too much. On the
> > +	 * other hand if most of the dirty data needs allocation, we better
> > +	 * wait to make some progress
> > +	 */
> > +	if (wbc->sync_mode == WB_SYNC_NONE && !wbc->for_reclaim &&
> > +	    wbc->pages_skipped < wbc->nr_to_write / 2)
> > +		goto redirty;
> > +	/*
> > +	 * Now wait if commit can free some space and we haven't retried
> > +	 * too much
> > +	 */
> > +	if (!ext3_should_retry_alloc(sb, retries))
> > +		return 0;
> > +	ret = 1;
> > +redirty:
> > +	set_page_dirty(page);
> > +	return ret;
> > +}
> > +
> > +/*
> >   * Note that we always start a transaction even if we're not journalling
> >   * data.  This is to preserve ordering: any hole instantiation within
> >   * __block_write_full_page -> ext3_get_block() should be journalled
> > @@ -1564,10 +1598,12 @@ static int ext3_writeback_writepage(struct page
> > *page, handle_t *handle = NULL;
> >  	int ret = 0;
> >  	int err;
> > +	int retries;
> >
> >  	if (ext3_journal_current_handle())
> >  		goto out_fail;
> >
> > +restart:
> >  	handle = ext3_journal_start(inode, ext3_writepage_trans_blocks(inode));
> >  	if (IS_ERR(handle)) {
> >  		ret = PTR_ERR(handle);
> > @@ -1580,8 +1616,13 @@ static int ext3_writeback_writepage(struct page
> > *page, ret = block_write_full_page(page, ext3_get_block, wbc);
> >
> >  	err = ext3_journal_stop(handle);
> > -	if (!ret)
> > +	if (!ret) {
> >  		ret = err;
> > +	} else {
> > +		if (ret == -ENOSPC &&
> > +		    ext3_writepage_retry_alloc(page, &retries, wbc))
> > +			goto restart;
> > +	}
> >  	return ret;
> >
> >  out_fail:
> >
> >   And similarly for the other two writepage implementations in ext3...
> > But it currently gives me:
> > WARNING: at fs/buffer.c:781 __set_page_dirty+0x8d/0x145()
> > probably because of that set_page_dirty() in ext3_writepage_retry_alloc().
> 
> And this is a valid warning because we don't know that all buffers are
> uptodate or which ones to set as dirty, I think. Unless it is impossible
> to have dirty && !uptodate pages come thought this path.
  Ah, OK, thanks for explanation.

> But you shouldn't need to redirty the page at all if we change
> block_write_full_page in the way I suggested. Because then it won't
> have cleaned the buffer, and it will have done redirty_page_for_writepage.
>
> > Or we could implement ext3_mkwrite() to allocate buffers already when we
> > make page writeable. But it costs some performace (we have to write page
> > full of zeros when allocating those buffers, where previously we didn't
> > have to do anything) and it's not trivial to make it work if pagesize >
> > blocksize (we should not allocate buffers outside of i_size so if i_size
> > = 1024, we create just one block in ext3_mkwrite() but then we need to
> > allocate more when we extend the file).
> 
> Well the core page_mkwrite function doesn't care about that case
> properly either (it just doesn't allocate buffers on extend). I agree it
> should be fixed, but it is a little hard (I need to fix it in fsblock
> and I think what is required is a setattr helper for that).
  Won't it be enough, if extending truncate called page_mkwrite() on the
page which used to be the last one in the file? That would be enough for
my use although it seems a bit hacky I agree...
  

									Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: ftruncate-mmap: pages are lost after writing to mmaped file.
  2009-03-24 15:48                             ` Jan Kara
@ 2009-03-24 17:35                               ` Jan Kara
  2009-04-01 22:36                                 ` Ying Han
  0 siblings, 1 reply; 60+ messages in thread
From: Jan Kara @ 2009-03-24 17:35 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Jan Kara, Martin J. Bligh, linux-ext4, Ying Han, Linus Torvalds,
	Andrew Morton, linux-kernel, linux-mm, guichaz, Alex Khesin,
	Mike Waychison, Rohit Seth, Peter Zijlstra

On Tue 24-03-09 16:48:14, Jan Kara wrote:
> On Wed 25-03-09 02:03:54, Nick Piggin wrote:
> > On Wednesday 25 March 2009 01:47:09 Jan Kara wrote:
> > > On Wed 25-03-09 01:30:00, Nick Piggin wrote:
> > 
> > > > I don't think it is a very good idea for block_write_full_page recovery
> > > > to do clear_buffer_dirty for !mapped buffers. I think that should rather
> > > > be a redirty_page_for_writepage in the case that the buffer is dirty.
> > > >
> > > > Perhaps not the cleanest way to solve the problem if it is just due to
> > > > transient shortage of space in ext3, but generic code shouldn't be
> > > > allowed to throw away dirty data even if it can't be written back due
> > > > to some software or hardware error.
> > >
> > >   Well, that would be one possibility. But then we'd be left with dirty
> > > pages we cannot ever release since they are constantly dirty (when the
> > > filesystem really becomes out of space). So what I
> > 
> > If the filesystem becomes out of space and we have over-committed these
> > dirty mmapped blocks, then we most definitely want to keep them around.
> > An error of the system losing a few pages (or if it happens an insanely
> > large number of times, then slowly dying due to memory leak) is better
> > than an app suddenly seeing the contents of the page change to nulls
> > under it when the kernel decides to do some page reclaim.
>   Hmm, probably you're right. Definitely it would be much easier to track
> the problem down than it is now... Thinking a bit more... But couldn't a
> malicious user bring the machine easily to OOM this way? That would be
> unfortunate.
  OK, below is the patch which makes things work for me (i.e. no data
lost). What do you think?

									Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

>From f423c2964dd5afbcc40c47731724d48675dd2822 Mon Sep 17 00:00:00 2001
From: Jan Kara <jack@suse.cz>
Date: Tue, 24 Mar 2009 16:38:22 +0100
Subject: [PATCH] fs: Don't clear dirty bits in block_write_full_page()

If getblock() fails in block_write_full_page(), we don't want to clear
dirty bits on buffers. Actually, we even want to redirty the page. This
way we just won't silently discard users data (written e.g. through mmap)
in case of ENOSPC, EDQUOT, EIO or other write error. The downside of this
approach is that if the error is persistent we have this page pinned in
memory forever and if there are lots of such pages, we can bring the
machine OOM.

Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/buffer.c |   10 +++-------
 1 files changed, 3 insertions(+), 7 deletions(-)

diff --git a/fs/buffer.c b/fs/buffer.c
index 891e1c7..ae779a0 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -1833,9 +1833,11 @@ recover:
 	/*
 	 * ENOSPC, or some other error.  We may already have added some
 	 * blocks to the file, so we need to write these out to avoid
-	 * exposing stale data.
+	 * exposing stale data. We redirty the page so that we don't
+	 * loose data we are unable to write.
 	 * The page is currently locked and not marked for writeback
 	 */
+	redirty_page_for_writepage(wbc, page);
 	bh = head;
 	/* Recovery: lock and submit the mapped buffers */
 	do {
@@ -1843,12 +1845,6 @@ recover:
 		    !buffer_delay(bh)) {
 			lock_buffer(bh);
 			mark_buffer_async_write(bh);
-		} else {
-			/*
-			 * The buffer may have been set dirty during
-			 * attachment to a dirty page.
-			 */
-			clear_buffer_dirty(bh);
 		}
 	} while ((bh = bh->b_this_page) != head);
 	SetPageError(page);
-- 
1.6.0.2


^ permalink raw reply related	[flat|nested] 60+ messages in thread

* Re: ftruncate-mmap: pages are lost after writing to mmaped file.
  2009-03-24 17:35                               ` Jan Kara
@ 2009-04-01 22:36                                 ` Ying Han
  2009-04-02 10:11                                   ` Jan Kara
  2009-04-02 11:24                                   ` Nick Piggin
  0 siblings, 2 replies; 60+ messages in thread
From: Ying Han @ 2009-04-01 22:36 UTC (permalink / raw)
  To: Jan Kara
  Cc: Nick Piggin, Martin J. Bligh, linux-ext4, Linus Torvalds,
	Andrew Morton, linux-kernel, linux-mm, guichaz, Alex Khesin,
	Mike Waychison, Rohit Seth, Peter Zijlstra

Hi Jan:
    I feel that the problem you saw is kind of differnt than mine. As
you mentioned that you saw the PageError() message, which i don't see
it on my system. I tried you patch(based on 2.6.21) on my system and
it runs ok for 2 days, Still, since i don't see the same error message
as you saw, i am not convineced this is the root cause at least for
our problem. I am still looking into it.
    So, are you seeing the PageError() every time the problem happened?

--Ying


On Tue, Mar 24, 2009 at 10:35 AM, Jan Kara <jack@suse.cz> wrote:
> On Tue 24-03-09 16:48:14, Jan Kara wrote:
>> On Wed 25-03-09 02:03:54, Nick Piggin wrote:
>> > On Wednesday 25 March 2009 01:47:09 Jan Kara wrote:
>> > > On Wed 25-03-09 01:30:00, Nick Piggin wrote:
>> >
>> > > > I don't think it is a very good idea for block_write_full_page recovery
>> > > > to do clear_buffer_dirty for !mapped buffers. I think that should rather
>> > > > be a redirty_page_for_writepage in the case that the buffer is dirty.
>> > > >
>> > > > Perhaps not the cleanest way to solve the problem if it is just due to
>> > > > transient shortage of space in ext3, but generic code shouldn't be
>> > > > allowed to throw away dirty data even if it can't be written back due
>> > > > to some software or hardware error.
>> > >
>> > >   Well, that would be one possibility. But then we'd be left with dirty
>> > > pages we cannot ever release since they are constantly dirty (when the
>> > > filesystem really becomes out of space). So what I
>> >
>> > If the filesystem becomes out of space and we have over-committed these
>> > dirty mmapped blocks, then we most definitely want to keep them around.
>> > An error of the system losing a few pages (or if it happens an insanely
>> > large number of times, then slowly dying due to memory leak) is better
>> > than an app suddenly seeing the contents of the page change to nulls
>> > under it when the kernel decides to do some page reclaim.
>>   Hmm, probably you're right. Definitely it would be much easier to track
>> the problem down than it is now... Thinking a bit more... But couldn't a
>> malicious user bring the machine easily to OOM this way? That would be
>> unfortunate.
>  OK, below is the patch which makes things work for me (i.e. no data
> lost). What do you think?
>
>                                                                        Honza
> --
> Jan Kara <jack@suse.cz>
> SUSE Labs, CR
>
> From f423c2964dd5afbcc40c47731724d48675dd2822 Mon Sep 17 00:00:00 2001
> From: Jan Kara <jack@suse.cz>
> Date: Tue, 24 Mar 2009 16:38:22 +0100
> Subject: [PATCH] fs: Don't clear dirty bits in block_write_full_page()
>
> If getblock() fails in block_write_full_page(), we don't want to clear
> dirty bits on buffers. Actually, we even want to redirty the page. This
> way we just won't silently discard users data (written e.g. through mmap)
> in case of ENOSPC, EDQUOT, EIO or other write error. The downside of this
> approach is that if the error is persistent we have this page pinned in
> memory forever and if there are lots of such pages, we can bring the
> machine OOM.
>
> Signed-off-by: Jan Kara <jack@suse.cz>
> ---
>  fs/buffer.c |   10 +++-------
>  1 files changed, 3 insertions(+), 7 deletions(-)
>
> diff --git a/fs/buffer.c b/fs/buffer.c
> index 891e1c7..ae779a0 100644
> --- a/fs/buffer.c
> +++ b/fs/buffer.c
> @@ -1833,9 +1833,11 @@ recover:
>        /*
>         * ENOSPC, or some other error.  We may already have added some
>         * blocks to the file, so we need to write these out to avoid
> -        * exposing stale data.
> +        * exposing stale data. We redirty the page so that we don't
> +        * loose data we are unable to write.
>         * The page is currently locked and not marked for writeback
>         */
> +       redirty_page_for_writepage(wbc, page);
>        bh = head;
>        /* Recovery: lock and submit the mapped buffers */
>        do {
> @@ -1843,12 +1845,6 @@ recover:
>                    !buffer_delay(bh)) {
>                        lock_buffer(bh);
>                        mark_buffer_async_write(bh);
> -               } else {
> -                       /*
> -                        * The buffer may have been set dirty during
> -                        * attachment to a dirty page.
> -                        */
> -                       clear_buffer_dirty(bh);
>                }
>        } while ((bh = bh->b_this_page) != head);
>        SetPageError(page);
> --
> 1.6.0.2
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
>

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: ftruncate-mmap: pages are lost after writing to mmaped file.
  2009-04-01 22:36                                 ` Ying Han
@ 2009-04-02 10:11                                   ` Jan Kara
  2009-04-02 11:24                                   ` Nick Piggin
  1 sibling, 0 replies; 60+ messages in thread
From: Jan Kara @ 2009-04-02 10:11 UTC (permalink / raw)
  To: Ying Han
  Cc: Nick Piggin, Martin J. Bligh, linux-ext4, Linus Torvalds,
	Andrew Morton, linux-kernel, linux-mm, guichaz, Alex Khesin,
	Mike Waychison, Rohit Seth, Peter Zijlstra

  Hi Ying,

On Wed 01-04-09 15:36:13, Ying Han wrote:
>     I feel that the problem you saw is kind of differnt than mine. As
> you mentioned that you saw the PageError() message, which i don't see
> it on my system. I tried you patch(based on 2.6.21) on my system and
> it runs ok for 2 days, Still, since i don't see the same error message
> as you saw, i am not convineced this is the root cause at least for
> our problem. I am still looking into it.
>     So, are you seeing the PageError() every time the problem happened?
  Yes, but I agree that your problem is probably different. BTW: How do you
reproduce the problem?

								Honza

> On Tue, Mar 24, 2009 at 10:35 AM, Jan Kara <jack@suse.cz> wrote:
> > On Tue 24-03-09 16:48:14, Jan Kara wrote:
> >> On Wed 25-03-09 02:03:54, Nick Piggin wrote:
> >> > On Wednesday 25 March 2009 01:47:09 Jan Kara wrote:
> >> > > On Wed 25-03-09 01:30:00, Nick Piggin wrote:
> >> >
> >> > > > I don't think it is a very good idea for block_write_full_page recovery
> >> > > > to do clear_buffer_dirty for !mapped buffers. I think that should rather
> >> > > > be a redirty_page_for_writepage in the case that the buffer is dirty.
> >> > > >
> >> > > > Perhaps not the cleanest way to solve the problem if it is just due to
> >> > > > transient shortage of space in ext3, but generic code shouldn't be
> >> > > > allowed to throw away dirty data even if it can't be written back due
> >> > > > to some software or hardware error.
> >> > >
> >> > >   Well, that would be one possibility. But then we'd be left with dirty
> >> > > pages we cannot ever release since they are constantly dirty (when the
> >> > > filesystem really becomes out of space). So what I
> >> >
> >> > If the filesystem becomes out of space and we have over-committed these
> >> > dirty mmapped blocks, then we most definitely want to keep them around.
> >> > An error of the system losing a few pages (or if it happens an insanely
> >> > large number of times, then slowly dying due to memory leak) is better
> >> > than an app suddenly seeing the contents of the page change to nulls
> >> > under it when the kernel decides to do some page reclaim.
> >>   Hmm, probably you're right. Definitely it would be much easier to track
> >> the problem down than it is now... Thinking a bit more... But couldn't a
> >> malicious user bring the machine easily to OOM this way? That would be
> >> unfortunate.
> >  OK, below is the patch which makes things work for me (i.e. no data
> > lost). What do you think?
> >
> >                                                                        Honza
> > --
> > Jan Kara <jack@suse.cz>
> > SUSE Labs, CR
> >
> > From f423c2964dd5afbcc40c47731724d48675dd2822 Mon Sep 17 00:00:00 2001
> > From: Jan Kara <jack@suse.cz>
> > Date: Tue, 24 Mar 2009 16:38:22 +0100
> > Subject: [PATCH] fs: Don't clear dirty bits in block_write_full_page()
> >
> > If getblock() fails in block_write_full_page(), we don't want to clear
> > dirty bits on buffers. Actually, we even want to redirty the page. This
> > way we just won't silently discard users data (written e.g. through mmap)
> > in case of ENOSPC, EDQUOT, EIO or other write error. The downside of this
> > approach is that if the error is persistent we have this page pinned in
> > memory forever and if there are lots of such pages, we can bring the
> > machine OOM.
> >
> > Signed-off-by: Jan Kara <jack@suse.cz>
> > ---
> >  fs/buffer.c |   10 +++-------
> >  1 files changed, 3 insertions(+), 7 deletions(-)
> >
> > diff --git a/fs/buffer.c b/fs/buffer.c
> > index 891e1c7..ae779a0 100644
> > --- a/fs/buffer.c
> > +++ b/fs/buffer.c
> > @@ -1833,9 +1833,11 @@ recover:
> >        /*
> >         * ENOSPC, or some other error.  We may already have added some
> >         * blocks to the file, so we need to write these out to avoid
> > -        * exposing stale data.
> > +        * exposing stale data. We redirty the page so that we don't
> > +        * loose data we are unable to write.
> >         * The page is currently locked and not marked for writeback
> >         */
> > +       redirty_page_for_writepage(wbc, page);
> >        bh = head;
> >        /* Recovery: lock and submit the mapped buffers */
> >        do {
> > @@ -1843,12 +1845,6 @@ recover:
> >                    !buffer_delay(bh)) {
> >                        lock_buffer(bh);
> >                        mark_buffer_async_write(bh);
> > -               } else {
> > -                       /*
> > -                        * The buffer may have been set dirty during
> > -                        * attachment to a dirty page.
> > -                        */
> > -                       clear_buffer_dirty(bh);
> >                }
> >        } while ((bh = bh->b_this_page) != head);
> >        SetPageError(page);
> > --
> > 1.6.0.2
> >
> > --
> > To unsubscribe, send a message with 'unsubscribe linux-mm' in
> > the body to majordomo@kvack.org.  For more info on Linux MM,
> > see: http://www.linux-mm.org/ .
> > Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
> >
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: ftruncate-mmap: pages are lost after writing to mmaped file.
  2009-04-01 22:36                                 ` Ying Han
  2009-04-02 10:11                                   ` Jan Kara
@ 2009-04-02 11:24                                   ` Nick Piggin
  2009-04-02 11:34                                     ` Jan Kara
  2009-04-03  0:13                                     ` Ying Han
  1 sibling, 2 replies; 60+ messages in thread
From: Nick Piggin @ 2009-04-02 11:24 UTC (permalink / raw)
  To: Ying Han
  Cc: Jan Kara, Martin J. Bligh, linux-ext4, Linus Torvalds,
	Andrew Morton, linux-kernel, linux-mm, guichaz, Alex Khesin,
	Mike Waychison, Rohit Seth, Peter Zijlstra

On Thursday 02 April 2009 09:36:13 Ying Han wrote:
> Hi Jan:
>     I feel that the problem you saw is kind of differnt than mine. As
> you mentioned that you saw the PageError() message, which i don't see
> it on my system. I tried you patch(based on 2.6.21) on my system and
> it runs ok for 2 days, Still, since i don't see the same error message
> as you saw, i am not convineced this is the root cause at least for
> our problem. I am still looking into it.
>     So, are you seeing the PageError() every time the problem happened?

So I asked if you could test with my workaround of taking truncate_mutex
at the start of ext2_get_blocks, and report back. I never heard of any
response after that.

To reiterate: I was able to reproduce a problem with ext2 (I was testing
on brd to get IO rates high enough to reproduce it quite frequently).
I think I narrowed the problem down to block allocation or inode block
tree corruption because I was unable to reproduce it with that hack in
place.


^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: ftruncate-mmap: pages are lost after writing to mmaped file.
  2009-04-02 11:24                                   ` Nick Piggin
@ 2009-04-02 11:34                                     ` Jan Kara
  2009-04-02 15:51                                       ` Nick Piggin
  2009-04-03  0:13                                     ` Ying Han
  1 sibling, 1 reply; 60+ messages in thread
From: Jan Kara @ 2009-04-02 11:34 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Ying Han, Martin J. Bligh, linux-ext4, Linus Torvalds,
	Andrew Morton, linux-kernel, linux-mm, guichaz, Alex Khesin,
	Mike Waychison, Rohit Seth, Peter Zijlstra

On Thu 02-04-09 22:24:29, Nick Piggin wrote:
> On Thursday 02 April 2009 09:36:13 Ying Han wrote:
> > Hi Jan:
> >     I feel that the problem you saw is kind of differnt than mine. As
> > you mentioned that you saw the PageError() message, which i don't see
> > it on my system. I tried you patch(based on 2.6.21) on my system and
> > it runs ok for 2 days, Still, since i don't see the same error message
> > as you saw, i am not convineced this is the root cause at least for
> > our problem. I am still looking into it.
> >     So, are you seeing the PageError() every time the problem happened?
> 
> So I asked if you could test with my workaround of taking truncate_mutex
> at the start of ext2_get_blocks, and report back. I never heard of any
> response after that.
> 
> To reiterate: I was able to reproduce a problem with ext2 (I was testing
> on brd to get IO rates high enough to reproduce it quite frequently).
> I think I narrowed the problem down to block allocation or inode block
> tree corruption because I was unable to reproduce it with that hack in
> place.
  Nick, what load did you use for reproduction? I'll try to reproduce it
here so that I can debug ext2...

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: ftruncate-mmap: pages are lost after writing to mmaped file.
  2009-04-02 11:34                                     ` Jan Kara
@ 2009-04-02 15:51                                       ` Nick Piggin
  2009-04-02 17:44                                         ` Ying Han
  0 siblings, 1 reply; 60+ messages in thread
From: Nick Piggin @ 2009-04-02 15:51 UTC (permalink / raw)
  To: Jan Kara
  Cc: Ying Han, Martin J. Bligh, linux-ext4, Linus Torvalds,
	Andrew Morton, linux-kernel, linux-mm, guichaz, Alex Khesin,
	Mike Waychison, Rohit Seth, Peter Zijlstra

On Thursday 02 April 2009 22:34:01 Jan Kara wrote:
> On Thu 02-04-09 22:24:29, Nick Piggin wrote:
> > On Thursday 02 April 2009 09:36:13 Ying Han wrote:
> > > Hi Jan:
> > >     I feel that the problem you saw is kind of differnt than mine. As
> > > you mentioned that you saw the PageError() message, which i don't see
> > > it on my system. I tried you patch(based on 2.6.21) on my system and
> > > it runs ok for 2 days, Still, since i don't see the same error message
> > > as you saw, i am not convineced this is the root cause at least for
> > > our problem. I am still looking into it.
> > >     So, are you seeing the PageError() every time the problem happened?
> > 
> > So I asked if you could test with my workaround of taking truncate_mutex
> > at the start of ext2_get_blocks, and report back. I never heard of any
> > response after that.
> > 
> > To reiterate: I was able to reproduce a problem with ext2 (I was testing
> > on brd to get IO rates high enough to reproduce it quite frequently).
> > I think I narrowed the problem down to block allocation or inode block
> > tree corruption because I was unable to reproduce it with that hack in
> > place.
>   Nick, what load did you use for reproduction? I'll try to reproduce it
> here so that I can debug ext2...

OK, I set up the filesystem like this:

modprobe rd rd_size=$[3*1024*1024]   #almost fill memory so we reclaim buffers
dd if=/dev/zero of=/dev/ram0 bs=4k   #prefill brd so we don't get alloc deadlock
mkfs.ext2 -b1024 /dev/ram0           #1K buffers

Test is basically unmodified except I use 64MB files, and start 8 of them
at once to (8 core system, so improve chances of hitting the bug). Although I
do see it with only 1 running it takes longer to trigger.

I also run a loop doing 'sync ; echo 3 > /proc/sys/vm/drop_caches' but I don't
know if that really helps speed up reproducing it. It is quite random to hit,
but I was able to hit it IIRC in under a minute with that setup.

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: ftruncate-mmap: pages are lost after writing to mmaped file.
  2009-04-02 15:51                                       ` Nick Piggin
@ 2009-04-02 17:44                                         ` Ying Han
  2009-04-02 22:52                                           ` Ying Han
  0 siblings, 1 reply; 60+ messages in thread
From: Ying Han @ 2009-04-02 17:44 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Jan Kara, Martin J. Bligh, linux-ext4, Linus Torvalds,
	Andrew Morton, linux-kernel, linux-mm, guichaz, Alex Khesin,
	Mike Waychison, Rohit Seth, Peter Zijlstra

On Thu, Apr 2, 2009 at 8:51 AM, Nick Piggin <nickpiggin@yahoo.com.au> wrote:
> On Thursday 02 April 2009 22:34:01 Jan Kara wrote:
>> On Thu 02-04-09 22:24:29, Nick Piggin wrote:
>> > On Thursday 02 April 2009 09:36:13 Ying Han wrote:
>> > > Hi Jan:
>> > >     I feel that the problem you saw is kind of differnt than mine. As
>> > > you mentioned that you saw the PageError() message, which i don't see
>> > > it on my system. I tried you patch(based on 2.6.21) on my system and
>> > > it runs ok for 2 days, Still, since i don't see the same error message
>> > > as you saw, i am not convineced this is the root cause at least for
>> > > our problem. I am still looking into it.
>> > >     So, are you seeing the PageError() every time the problem happened?
>> >
>> > So I asked if you could test with my workaround of taking truncate_mutex
>> > at the start of ext2_get_blocks, and report back. I never heard of any
>> > response after that.
>> >
>> > To reiterate: I was able to reproduce a problem with ext2 (I was testing
>> > on brd to get IO rates high enough to reproduce it quite frequently).
>> > I think I narrowed the problem down to block allocation or inode block
>> > tree corruption because I was unable to reproduce it with that hack in
>> > place.
>>   Nick, what load did you use for reproduction? I'll try to reproduce it
>> here so that I can debug ext2...
>
> OK, I set up the filesystem like this:
>
> modprobe rd rd_size=$[3*1024*1024]   #almost fill memory so we reclaim buffers
> dd if=/dev/zero of=/dev/ram0 bs=4k   #prefill brd so we don't get alloc deadlock
> mkfs.ext2 -b1024 /dev/ram0           #1K buffers
>
> Test is basically unmodified except I use 64MB files, and start 8 of them
> at once to (8 core system, so improve chances of hitting the bug). Although I
> do see it with only 1 running it takes longer to trigger.
>
> I also run a loop doing 'sync ; echo 3 > /proc/sys/vm/drop_caches' but I don't
> know if that really helps speed up reproducing it. It is quite random to hit,
> but I was able to hit it IIRC in under a minute with that setup.
>

Here is how i reproduce it:
Filesystem is ext2 with blocksize 4096
Fill up the ram with 95% anon memory and mlockall ( put enough memory
pressure which will trigger page reclaim and background writeout)
Run one thread of the test program

and i will see "bad pages" within few minutes.

--Ying

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: ftruncate-mmap: pages are lost after writing to mmaped file.
  2009-04-02 17:44                                         ` Ying Han
@ 2009-04-02 22:52                                           ` Ying Han
  2009-04-02 23:39                                             ` Jan Kara
  0 siblings, 1 reply; 60+ messages in thread
From: Ying Han @ 2009-04-02 22:52 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Jan Kara, Martin J. Bligh, linux-ext4, Linus Torvalds,
	Andrew Morton, linux-kernel, linux-mm, guichaz, Alex Khesin,
	Mike Waychison, Rohit Seth, Peter Zijlstra

On Thu, Apr 2, 2009 at 10:44 AM, Ying Han <yinghan@google.com> wrote:
> On Thu, Apr 2, 2009 at 8:51 AM, Nick Piggin <nickpiggin@yahoo.com.au> wrote:
>> On Thursday 02 April 2009 22:34:01 Jan Kara wrote:
>>> On Thu 02-04-09 22:24:29, Nick Piggin wrote:
>>> > On Thursday 02 April 2009 09:36:13 Ying Han wrote:
>>> > > Hi Jan:
>>> > >     I feel that the problem you saw is kind of differnt than mine. As
>>> > > you mentioned that you saw the PageError() message, which i don't see
>>> > > it on my system. I tried you patch(based on 2.6.21) on my system and
>>> > > it runs ok for 2 days, Still, since i don't see the same error message
>>> > > as you saw, i am not convineced this is the root cause at least for
>>> > > our problem. I am still looking into it.
>>> > >     So, are you seeing the PageError() every time the problem happened?
>>> >
>>> > So I asked if you could test with my workaround of taking truncate_mutex
>>> > at the start of ext2_get_blocks, and report back. I never heard of any
>>> > response after that.
>>> >
>>> > To reiterate: I was able to reproduce a problem with ext2 (I was testing
>>> > on brd to get IO rates high enough to reproduce it quite frequently).
>>> > I think I narrowed the problem down to block allocation or inode block
>>> > tree corruption because I was unable to reproduce it with that hack in
>>> > place.
>>>   Nick, what load did you use for reproduction? I'll try to reproduce it
>>> here so that I can debug ext2...
>>
>> OK, I set up the filesystem like this:
>>
>> modprobe rd rd_size=$[3*1024*1024]   #almost fill memory so we reclaim buffers
>> dd if=/dev/zero of=/dev/ram0 bs=4k   #prefill brd so we don't get alloc deadlock
>> mkfs.ext2 -b1024 /dev/ram0           #1K buffers
>>
>> Test is basically unmodified except I use 64MB files, and start 8 of them
>> at once to (8 core system, so improve chances of hitting the bug). Although I
>> do see it with only 1 running it takes longer to trigger.
>>
>> I also run a loop doing 'sync ; echo 3 > /proc/sys/vm/drop_caches' but I don't
>> know if that really helps speed up reproducing it. It is quite random to hit,
>> but I was able to hit it IIRC in under a minute with that setup.
>>
>
> Here is how i reproduce it:
> Filesystem is ext2 with blocksize 4096
> Fill up the ram with 95% anon memory and mlockall ( put enough memory
> pressure which will trigger page reclaim and background writeout)
> Run one thread of the test program
>
> and i will see "bad pages" within few minutes.

And here is the "top" and stdout while it is getting "bad pages"
top

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
 3487 root      20   0 52616  50m  284 R   95  0.3   3:58.85 usemem
 3810 root      20   0  129m  99m  99m D   41  0.6   0:01.87 ftruncate_mmap
  261 root      15  -5     0    0    0 D    4  0.0   0:31.08 kswapd0
  262 root      15  -5     0    0    0 D    3  0.0   0:10.26 kswapd1

stdout:

while true; do
    ./ftruncate_mmap;
done
Running 852 bad page
Running 315 bad page
Running 999 bad page
Running 482 bad page
Running 24 bad page

--Ying

>
> --Ying
>

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: ftruncate-mmap: pages are lost after writing to mmaped file.
  2009-04-02 22:52                                           ` Ying Han
@ 2009-04-02 23:39                                             ` Jan Kara
  2009-04-03  0:25                                               ` Ying Han
  2009-04-03  1:29                                               ` Ying Han
  0 siblings, 2 replies; 60+ messages in thread
From: Jan Kara @ 2009-04-02 23:39 UTC (permalink / raw)
  To: Ying Han
  Cc: Nick Piggin, Martin J. Bligh, linux-ext4, Linus Torvalds,
	Andrew Morton, linux-kernel, linux-mm, guichaz, Alex Khesin,
	Mike Waychison, Rohit Seth, Peter Zijlstra

[-- Attachment #1: Type: text/plain, Size: 3540 bytes --]

On Thu 02-04-09 15:52:19, Ying Han wrote:
> On Thu, Apr 2, 2009 at 10:44 AM, Ying Han <yinghan@google.com> wrote:
> > On Thu, Apr 2, 2009 at 8:51 AM, Nick Piggin <nickpiggin@yahoo.com.au> wrote:
> >> On Thursday 02 April 2009 22:34:01 Jan Kara wrote:
> >>> On Thu 02-04-09 22:24:29, Nick Piggin wrote:
> >>> > On Thursday 02 April 2009 09:36:13 Ying Han wrote:
> >>> > > Hi Jan:
> >>> > >     I feel that the problem you saw is kind of differnt than mine. As
> >>> > > you mentioned that you saw the PageError() message, which i don't see
> >>> > > it on my system. I tried you patch(based on 2.6.21) on my system and
> >>> > > it runs ok for 2 days, Still, since i don't see the same error message
> >>> > > as you saw, i am not convineced this is the root cause at least for
> >>> > > our problem. I am still looking into it.
> >>> > >     So, are you seeing the PageError() every time the problem happened?
> >>> >
> >>> > So I asked if you could test with my workaround of taking truncate_mutex
> >>> > at the start of ext2_get_blocks, and report back. I never heard of any
> >>> > response after that.
> >>> >
> >>> > To reiterate: I was able to reproduce a problem with ext2 (I was testing
> >>> > on brd to get IO rates high enough to reproduce it quite frequently).
> >>> > I think I narrowed the problem down to block allocation or inode block
> >>> > tree corruption because I was unable to reproduce it with that hack in
> >>> > place.
> >>>   Nick, what load did you use for reproduction? I'll try to reproduce it
> >>> here so that I can debug ext2...
> >>
> >> OK, I set up the filesystem like this:
> >>
> >> modprobe rd rd_size=$[3*1024*1024]   #almost fill memory so we reclaim buffers
> >> dd if=/dev/zero of=/dev/ram0 bs=4k   #prefill brd so we don't get alloc deadlock
> >> mkfs.ext2 -b1024 /dev/ram0           #1K buffers
> >>
> >> Test is basically unmodified except I use 64MB files, and start 8 of them
> >> at once to (8 core system, so improve chances of hitting the bug). Although I
> >> do see it with only 1 running it takes longer to trigger.
> >>
> >> I also run a loop doing 'sync ; echo 3 > /proc/sys/vm/drop_caches' but I don't
> >> know if that really helps speed up reproducing it. It is quite random to hit,
> >> but I was able to hit it IIRC in under a minute with that setup.
> >>
> >
> > Here is how i reproduce it:
> > Filesystem is ext2 with blocksize 4096
> > Fill up the ram with 95% anon memory and mlockall ( put enough memory
> > pressure which will trigger page reclaim and background writeout)
> > Run one thread of the test program
> >
> > and i will see "bad pages" within few minutes.
> 
> And here is the "top" and stdout while it is getting "bad pages"
> top
> 
>   PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
>  3487 root      20   0 52616  50m  284 R   95  0.3   3:58.85 usemem
>  3810 root      20   0  129m  99m  99m D   41  0.6   0:01.87 ftruncate_mmap
>   261 root      15  -5     0    0    0 D    4  0.0   0:31.08 kswapd0
>   262 root      15  -5     0    0    0 D    3  0.0   0:10.26 kswapd1
> 
> stdout:
> 
> while true; do
>     ./ftruncate_mmap;
> done
> Running 852 bad page
> Running 315 bad page
> Running 999 bad page
> Running 482 bad page
> Running 24 bad page
  Thanks, for the help. I've debugged the problem to a bug in
ext2_get_block(). I've already sent out a patch which should fix the issue
(at least it fixes the problem for me).
  The fix is also attached if you want to try it.

									Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

[-- Attachment #2: 0001-ext2-Fix-data-corruption-for-racing-writes.patch --]
[-- Type: text/x-patch, Size: 4340 bytes --]

>From f6e454b1f9fe90d023b30daf0ae7e081580c3bb2 Mon Sep 17 00:00:00 2001
From: Jan Kara <jack@suse.cz>
Date: Fri, 3 Apr 2009 00:24:28 +0200
Subject: [PATCH] ext2: Fix data corruption for racing writes

If two writers allocating blocks to file race with each other (e.g. because
writepages races with ordinary write or two writepages race with each other),
ext2_getblock() can be called on the same inode in parallel.  Before we are
going to allocate new blocks, we have to recheck the block chain we have
obtained so far without holding truncate_mutex. Otherwise we could overwrite
the indirect block pointer set by the other writer leading to data loss.

The below test program by Ying is able to reproduce the data loss with ext2
on in BRD in a few minutes if the machine is under memory pressure:

long kMemSize  = 50 << 20;
int kPageSize = 4096;

int main(int argc, char **argv) {
	int status;
	int count = 0;
	int i;
	char *fname = "/mnt/test.mmap";
	char *mem;
	unlink(fname);
	int fd = open(fname, O_CREAT | O_EXCL | O_RDWR, 0600);
	status = ftruncate(fd, kMemSize);
	mem = mmap(0, kMemSize, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
	// Fill the memory with 1s.
	memset(mem, 1, kMemSize);
	sleep(2);
	for (i = 0; i < kMemSize; i++) {
		int byte_good = mem[i] != 0;
		if (!byte_good && ((i % kPageSize) == 0)) {
			//printf("%d ", i / kPageSize);
			count++;
		}
	}
	munmap(mem, kMemSize);
	close(fd);
	unlink(fname);

	if (count > 0) {
		printf("Running %d bad page\n", count);
		return 1;
	}
	return 0;
}

CC: Ying Han <yinghan@google.com>
CC: Nick Piggin <nickpiggin@yahoo.com.au>
Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/ext2/inode.c |   44 +++++++++++++++++++++++++++++++++-----------
 1 files changed, 33 insertions(+), 11 deletions(-)

diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c
index b43b955..acf6788 100644
--- a/fs/ext2/inode.c
+++ b/fs/ext2/inode.c
@@ -590,9 +590,8 @@ static int ext2_get_blocks(struct inode *inode,
 
 	if (depth == 0)
 		return (err);
-reread:
-	partial = ext2_get_branch(inode, depth, offsets, chain, &err);
 
+	partial = ext2_get_branch(inode, depth, offsets, chain, &err);
 	/* Simplest case - block found, no allocation needed */
 	if (!partial) {
 		first_block = le32_to_cpu(chain[depth - 1].key);
@@ -602,15 +601,16 @@ reread:
 		while (count < maxblocks && count <= blocks_to_boundary) {
 			ext2_fsblk_t blk;
 
-			if (!verify_chain(chain, partial)) {
+			if (!verify_chain(chain, chain + depth - 1)) {
 				/*
 				 * Indirect block might be removed by
 				 * truncate while we were reading it.
 				 * Handling of that case: forget what we've
 				 * got now, go to reread.
 				 */
+				err = -EAGAIN;
 				count = 0;
-				goto changed;
+				break;
 			}
 			blk = le32_to_cpu(*(chain[depth-1].p + count));
 			if (blk == first_block + count)
@@ -618,7 +618,8 @@ reread:
 			else
 				break;
 		}
-		goto got_it;
+		if (err != -EAGAIN)
+			goto got_it;
 	}
 
 	/* Next simple case - plain lookup or failed read of indirect block */
@@ -626,6 +627,33 @@ reread:
 		goto cleanup;
 
 	mutex_lock(&ei->truncate_mutex);
+	/*
+	 * If the indirect block is missing while we are reading
+	 * the chain(ext3_get_branch() returns -EAGAIN err), or
+	 * if the chain has been changed after we grab the semaphore,
+	 * (either because another process truncated this branch, or
+	 * another get_block allocated this branch) re-grab the chain to see if
+	 * the request block has been allocated or not.
+	 *
+	 * Since we already block the truncate/other get_block
+	 * at this point, we will have the current copy of the chain when we
+	 * splice the branch into the tree.
+	 */
+	if (err == -EAGAIN || !verify_chain(chain, partial)) {
+		while (partial > chain) {
+			brelse(partial->bh);
+			partial--;
+		}
+		partial = ext2_get_branch(inode, depth, offsets, chain, &err);
+		if (!partial) {
+			count++;
+			mutex_unlock(&ei->truncate_mutex);
+			if (err)
+				goto cleanup;
+			clear_buffer_new(bh_result);
+			goto got_it;
+		}
+	}
 
 	/*
 	 * Okay, we need to do block allocation.  Lazily initialize the block
@@ -683,12 +711,6 @@ cleanup:
 		partial--;
 	}
 	return err;
-changed:
-	while (partial > chain) {
-		brelse(partial->bh);
-		partial--;
-	}
-	goto reread;
 }
 
 int ext2_get_block(struct inode *inode, sector_t iblock, struct buffer_head *bh_result, int create)
-- 
1.6.0.2


^ permalink raw reply related	[flat|nested] 60+ messages in thread

* Re: ftruncate-mmap: pages are lost after writing to mmaped file.
  2009-04-02 23:39                                             ` Jan Kara
@ 2009-04-03  0:25                                               ` Ying Han
  2009-04-03  1:29                                               ` Ying Han
  1 sibling, 0 replies; 60+ messages in thread
From: Ying Han @ 2009-04-03  0:25 UTC (permalink / raw)
  To: Jan Kara
  Cc: Nick Piggin, Martin J. Bligh, linux-ext4, Linus Torvalds,
	Andrew Morton, linux-kernel, linux-mm, guichaz, Alex Khesin,
	Mike Waychison, Rohit Seth, Peter Zijlstra

On Thu, Apr 2, 2009 at 4:39 PM, Jan Kara <jack@suse.cz> wrote:
> On Thu 02-04-09 15:52:19, Ying Han wrote:
>> On Thu, Apr 2, 2009 at 10:44 AM, Ying Han <yinghan@google.com> wrote:
>> > On Thu, Apr 2, 2009 at 8:51 AM, Nick Piggin <nickpiggin@yahoo.com.au> wrote:
>> >> On Thursday 02 April 2009 22:34:01 Jan Kara wrote:
>> >>> On Thu 02-04-09 22:24:29, Nick Piggin wrote:
>> >>> > On Thursday 02 April 2009 09:36:13 Ying Han wrote:
>> >>> > > Hi Jan:
>> >>> > >     I feel that the problem you saw is kind of differnt than mine. As
>> >>> > > you mentioned that you saw the PageError() message, which i don't see
>> >>> > > it on my system. I tried you patch(based on 2.6.21) on my system and
>> >>> > > it runs ok for 2 days, Still, since i don't see the same error message
>> >>> > > as you saw, i am not convineced this is the root cause at least for
>> >>> > > our problem. I am still looking into it.
>> >>> > >     So, are you seeing the PageError() every time the problem happened?
>> >>> >
>> >>> > So I asked if you could test with my workaround of taking truncate_mutex
>> >>> > at the start of ext2_get_blocks, and report back. I never heard of any
>> >>> > response after that.
>> >>> >
>> >>> > To reiterate: I was able to reproduce a problem with ext2 (I was testing
>> >>> > on brd to get IO rates high enough to reproduce it quite frequently).
>> >>> > I think I narrowed the problem down to block allocation or inode block
>> >>> > tree corruption because I was unable to reproduce it with that hack in
>> >>> > place.
>> >>>   Nick, what load did you use for reproduction? I'll try to reproduce it
>> >>> here so that I can debug ext2...
>> >>
>> >> OK, I set up the filesystem like this:
>> >>
>> >> modprobe rd rd_size=$[3*1024*1024]   #almost fill memory so we reclaim buffers
>> >> dd if=/dev/zero of=/dev/ram0 bs=4k   #prefill brd so we don't get alloc deadlock
>> >> mkfs.ext2 -b1024 /dev/ram0           #1K buffers
>> >>
>> >> Test is basically unmodified except I use 64MB files, and start 8 of them
>> >> at once to (8 core system, so improve chances of hitting the bug). Although I
>> >> do see it with only 1 running it takes longer to trigger.
>> >>
>> >> I also run a loop doing 'sync ; echo 3 > /proc/sys/vm/drop_caches' but I don't
>> >> know if that really helps speed up reproducing it. It is quite random to hit,
>> >> but I was able to hit it IIRC in under a minute with that setup.
>> >>
>> >
>> > Here is how i reproduce it:
>> > Filesystem is ext2 with blocksize 4096
>> > Fill up the ram with 95% anon memory and mlockall ( put enough memory
>> > pressure which will trigger page reclaim and background writeout)
>> > Run one thread of the test program
>> >
>> > and i will see "bad pages" within few minutes.
>>
>> And here is the "top" and stdout while it is getting "bad pages"
>> top
>>
>>   PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
>>  3487 root      20   0 52616  50m  284 R   95  0.3   3:58.85 usemem
>>  3810 root      20   0  129m  99m  99m D   41  0.6   0:01.87 ftruncate_mmap
>>   261 root      15  -5     0    0    0 D    4  0.0   0:31.08 kswapd0
>>   262 root      15  -5     0    0    0 D    3  0.0   0:10.26 kswapd1
>>
>> stdout:
>>
>> while true; do
>>     ./ftruncate_mmap;
>> done
>> Running 852 bad page
>> Running 315 bad page
>> Running 999 bad page
>> Running 482 bad page
>> Running 24 bad page
>  Thanks, for the help. I've debugged the problem to a bug in
> ext2_get_block(). I've already sent out a patch which should fix the issue
> (at least it fixes the problem for me).
>  The fix is also attached if you want to try it.

I just did a quick run after applied the patch, unforturnately, my
problem is still there...

--Ying
>
>                                                                        Honza
> --
> Jan Kara <jack@suse.cz>
> SUSE Labs, CR
>

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: ftruncate-mmap: pages are lost after writing to mmaped file.
  2009-04-02 23:39                                             ` Jan Kara
  2009-04-03  0:25                                               ` Ying Han
@ 2009-04-03  1:29                                               ` Ying Han
  2009-04-03  9:41                                                 ` Jan Kara
  1 sibling, 1 reply; 60+ messages in thread
From: Ying Han @ 2009-04-03  1:29 UTC (permalink / raw)
  To: Jan Kara
  Cc: Nick Piggin, Martin J. Bligh, linux-ext4, Linus Torvalds,
	Andrew Morton, linux-kernel, linux-mm, guichaz, Alex Khesin,
	Mike Waychison, Rohit Seth, Peter Zijlstra

On Thu, Apr 2, 2009 at 4:39 PM, Jan Kara <jack@suse.cz> wrote:
> On Thu 02-04-09 15:52:19, Ying Han wrote:
>> On Thu, Apr 2, 2009 at 10:44 AM, Ying Han <yinghan@google.com> wrote:
>> > On Thu, Apr 2, 2009 at 8:51 AM, Nick Piggin <nickpiggin@yahoo.com.au> wrote:
>> >> On Thursday 02 April 2009 22:34:01 Jan Kara wrote:
>> >>> On Thu 02-04-09 22:24:29, Nick Piggin wrote:
>> >>> > On Thursday 02 April 2009 09:36:13 Ying Han wrote:
>> >>> > > Hi Jan:
>> >>> > >     I feel that the problem you saw is kind of differnt than mine. As
>> >>> > > you mentioned that you saw the PageError() message, which i don't see
>> >>> > > it on my system. I tried you patch(based on 2.6.21) on my system and
>> >>> > > it runs ok for 2 days, Still, since i don't see the same error message
>> >>> > > as you saw, i am not convineced this is the root cause at least for
>> >>> > > our problem. I am still looking into it.
>> >>> > >     So, are you seeing the PageError() every time the problem happened?
>> >>> >
>> >>> > So I asked if you could test with my workaround of taking truncate_mutex
>> >>> > at the start of ext2_get_blocks, and report back. I never heard of any
>> >>> > response after that.
>> >>> >
>> >>> > To reiterate: I was able to reproduce a problem with ext2 (I was testing
>> >>> > on brd to get IO rates high enough to reproduce it quite frequently).
>> >>> > I think I narrowed the problem down to block allocation or inode block
>> >>> > tree corruption because I was unable to reproduce it with that hack in
>> >>> > place.
>> >>>   Nick, what load did you use for reproduction? I'll try to reproduce it
>> >>> here so that I can debug ext2...
>> >>
>> >> OK, I set up the filesystem like this:
>> >>
>> >> modprobe rd rd_size=$[3*1024*1024]   #almost fill memory so we reclaim buffers
>> >> dd if=/dev/zero of=/dev/ram0 bs=4k   #prefill brd so we don't get alloc deadlock
>> >> mkfs.ext2 -b1024 /dev/ram0           #1K buffers
>> >>
>> >> Test is basically unmodified except I use 64MB files, and start 8 of them
>> >> at once to (8 core system, so improve chances of hitting the bug). Although I
>> >> do see it with only 1 running it takes longer to trigger.
>> >>
>> >> I also run a loop doing 'sync ; echo 3 > /proc/sys/vm/drop_caches' but I don't
>> >> know if that really helps speed up reproducing it. It is quite random to hit,
>> >> but I was able to hit it IIRC in under a minute with that setup.
>> >>
>> >
>> > Here is how i reproduce it:
>> > Filesystem is ext2 with blocksize 4096
>> > Fill up the ram with 95% anon memory and mlockall ( put enough memory
>> > pressure which will trigger page reclaim and background writeout)
>> > Run one thread of the test program
>> >
>> > and i will see "bad pages" within few minutes.
>>
>> And here is the "top" and stdout while it is getting "bad pages"
>> top
>>
>>   PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
>>  3487 root      20   0 52616  50m  284 R   95  0.3   3:58.85 usemem
>>  3810 root      20   0  129m  99m  99m D   41  0.6   0:01.87 ftruncate_mmap
>>   261 root      15  -5     0    0    0 D    4  0.0   0:31.08 kswapd0
>>   262 root      15  -5     0    0    0 D    3  0.0   0:10.26 kswapd1
>>
>> stdout:
>>
>> while true; do
>>     ./ftruncate_mmap;
>> done
>> Running 852 bad page
>> Running 315 bad page
>> Running 999 bad page
>> Running 482 bad page
>> Running 24 bad page
>  Thanks, for the help. I've debugged the problem to a bug in
> ext2_get_block(). I've already sent out a patch which should fix the issue
> (at least it fixes the problem for me).
>  The fix is also attached if you want to try it.

hmm, now i do see that get_block() returns ENOSPC by printk the err.
So did you applied the patch which redirty_page_for_writepage as well
as this one together?

I will start the test with kernel applied both patches and leave it for running.

>
>                                                                        Honza
> --
> Jan Kara <jack@suse.cz>
> SUSE Labs, CR
>

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: ftruncate-mmap: pages are lost after writing to mmaped file.
  2009-04-03  1:29                                               ` Ying Han
@ 2009-04-03  9:41                                                 ` Jan Kara
  2009-04-03 21:34                                                   ` Ying Han
  0 siblings, 1 reply; 60+ messages in thread
From: Jan Kara @ 2009-04-03  9:41 UTC (permalink / raw)
  To: Ying Han
  Cc: Nick Piggin, Martin J. Bligh, linux-ext4, Linus Torvalds,
	Andrew Morton, linux-kernel, linux-mm, guichaz, Alex Khesin,
	Mike Waychison, Rohit Seth, Peter Zijlstra

On Thu 02-04-09 18:29:21, Ying Han wrote:
> On Thu, Apr 2, 2009 at 4:39 PM, Jan Kara <jack@suse.cz> wrote:
> > On Thu 02-04-09 15:52:19, Ying Han wrote:
> >> On Thu, Apr 2, 2009 at 10:44 AM, Ying Han <yinghan@google.com> wrote:
> >> > On Thu, Apr 2, 2009 at 8:51 AM, Nick Piggin <nickpiggin@yahoo.com.au> wrote:
> >> >> On Thursday 02 April 2009 22:34:01 Jan Kara wrote:
> >> >>> On Thu 02-04-09 22:24:29, Nick Piggin wrote:
> >> >>> > On Thursday 02 April 2009 09:36:13 Ying Han wrote:
> >> >>> > > Hi Jan:
> >> >>> > >     I feel that the problem you saw is kind of differnt than mine. As
> >> >>> > > you mentioned that you saw the PageError() message, which i don't see
> >> >>> > > it on my system. I tried you patch(based on 2.6.21) on my system and
> >> >>> > > it runs ok for 2 days, Still, since i don't see the same error message
> >> >>> > > as you saw, i am not convineced this is the root cause at least for
> >> >>> > > our problem. I am still looking into it.
> >> >>> > >     So, are you seeing the PageError() every time the problem happened?
> >> >>> >
> >> >>> > So I asked if you could test with my workaround of taking truncate_mutex
> >> >>> > at the start of ext2_get_blocks, and report back. I never heard of any
> >> >>> > response after that.
> >> >>> >
> >> >>> > To reiterate: I was able to reproduce a problem with ext2 (I was testing
> >> >>> > on brd to get IO rates high enough to reproduce it quite frequently).
> >> >>> > I think I narrowed the problem down to block allocation or inode block
> >> >>> > tree corruption because I was unable to reproduce it with that hack in
> >> >>> > place.
> >> >>>   Nick, what load did you use for reproduction? I'll try to reproduce it
> >> >>> here so that I can debug ext2...
> >> >>
> >> >> OK, I set up the filesystem like this:
> >> >>
> >> >> modprobe rd rd_size=$[3*1024*1024]   #almost fill memory so we reclaim buffers
> >> >> dd if=/dev/zero of=/dev/ram0 bs=4k   #prefill brd so we don't get alloc deadlock
> >> >> mkfs.ext2 -b1024 /dev/ram0           #1K buffers
> >> >>
> >> >> Test is basically unmodified except I use 64MB files, and start 8 of them
> >> >> at once to (8 core system, so improve chances of hitting the bug). Although I
> >> >> do see it with only 1 running it takes longer to trigger.
> >> >>
> >> >> I also run a loop doing 'sync ; echo 3 > /proc/sys/vm/drop_caches' but I don't
> >> >> know if that really helps speed up reproducing it. It is quite random to hit,
> >> >> but I was able to hit it IIRC in under a minute with that setup.
> >> >>
> >> >
> >> > Here is how i reproduce it:
> >> > Filesystem is ext2 with blocksize 4096
> >> > Fill up the ram with 95% anon memory and mlockall ( put enough memory
> >> > pressure which will trigger page reclaim and background writeout)
> >> > Run one thread of the test program
> >> >
> >> > and i will see "bad pages" within few minutes.
> >>
> >> And here is the "top" and stdout while it is getting "bad pages"
> >> top
> >>
> >>   PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
> >>  3487 root      20   0 52616  50m  284 R   95  0.3   3:58.85 usemem
> >>  3810 root      20   0  129m  99m  99m D   41  0.6   0:01.87 ftruncate_mmap
> >>   261 root      15  -5     0    0    0 D    4  0.0   0:31.08 kswapd0
> >>   262 root      15  -5     0    0    0 D    3  0.0   0:10.26 kswapd1
> >>
> >> stdout:
> >>
> >> while true; do
> >>     ./ftruncate_mmap;
> >> done
> >> Running 852 bad page
> >> Running 315 bad page
> >> Running 999 bad page
> >> Running 482 bad page
> >> Running 24 bad page
> >  Thanks, for the help. I've debugged the problem to a bug in
> > ext2_get_block(). I've already sent out a patch which should fix the issue
> > (at least it fixes the problem for me).
> >  The fix is also attached if you want to try it.
> 
> hmm, now i do see that get_block() returns ENOSPC by printk the err.
> So did you applied the patch which redirty_page_for_writepage as well
> as this one together?
  No, my patch contained only a fix in ext2_get_block(). When you see
ENOSPC, that's a completely separate issue. You may apply that patch but
with ext2 it would be enough to make the file fit the ram disk. I.e. first
try with dd how big file fits there and then run your tester with at most
as big file so that you don't hit ENOSPC...

> I will start the test with kernel applied both patches and leave it for running.
  OK.

										Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: ftruncate-mmap: pages are lost after writing to mmaped file.
  2009-04-03  9:41                                                 ` Jan Kara
@ 2009-04-03 21:34                                                   ` Ying Han
  0 siblings, 0 replies; 60+ messages in thread
From: Ying Han @ 2009-04-03 21:34 UTC (permalink / raw)
  To: Jan Kara
  Cc: Nick Piggin, Martin J. Bligh, linux-ext4, Linus Torvalds,
	Andrew Morton, linux-kernel, linux-mm, guichaz, Alex Khesin,
	Mike Waychison, Rohit Seth, Peter Zijlstra

On Fri, Apr 3, 2009 at 2:41 AM, Jan Kara <jack@suse.cz> wrote:
> On Thu 02-04-09 18:29:21, Ying Han wrote:
>> On Thu, Apr 2, 2009 at 4:39 PM, Jan Kara <jack@suse.cz> wrote:
>> > On Thu 02-04-09 15:52:19, Ying Han wrote:
>> >> On Thu, Apr 2, 2009 at 10:44 AM, Ying Han <yinghan@google.com> wrote:
>> >> > On Thu, Apr 2, 2009 at 8:51 AM, Nick Piggin <nickpiggin@yahoo.com.au> wrote:
>> >> >> On Thursday 02 April 2009 22:34:01 Jan Kara wrote:
>> >> >>> On Thu 02-04-09 22:24:29, Nick Piggin wrote:
>> >> >>> > On Thursday 02 April 2009 09:36:13 Ying Han wrote:
>> >> >>> > > Hi Jan:
>> >> >>> > >     I feel that the problem you saw is kind of differnt than mine. As
>> >> >>> > > you mentioned that you saw the PageError() message, which i don't see
>> >> >>> > > it on my system. I tried you patch(based on 2.6.21) on my system and
>> >> >>> > > it runs ok for 2 days, Still, since i don't see the same error message
>> >> >>> > > as you saw, i am not convineced this is the root cause at least for
>> >> >>> > > our problem. I am still looking into it.
>> >> >>> > >     So, are you seeing the PageError() every time the problem happened?
>> >> >>> >
>> >> >>> > So I asked if you could test with my workaround of taking truncate_mutex
>> >> >>> > at the start of ext2_get_blocks, and report back. I never heard of any
>> >> >>> > response after that.
>> >> >>> >
>> >> >>> > To reiterate: I was able to reproduce a problem with ext2 (I was testing
>> >> >>> > on brd to get IO rates high enough to reproduce it quite frequently).
>> >> >>> > I think I narrowed the problem down to block allocation or inode block
>> >> >>> > tree corruption because I was unable to reproduce it with that hack in
>> >> >>> > place.
>> >> >>>   Nick, what load did you use for reproduction? I'll try to reproduce it
>> >> >>> here so that I can debug ext2...
>> >> >>
>> >> >> OK, I set up the filesystem like this:
>> >> >>
>> >> >> modprobe rd rd_size=$[3*1024*1024]   #almost fill memory so we reclaim buffers
>> >> >> dd if=/dev/zero of=/dev/ram0 bs=4k   #prefill brd so we don't get alloc deadlock
>> >> >> mkfs.ext2 -b1024 /dev/ram0           #1K buffers
>> >> >>
>> >> >> Test is basically unmodified except I use 64MB files, and start 8 of them
>> >> >> at once to (8 core system, so improve chances of hitting the bug). Although I
>> >> >> do see it with only 1 running it takes longer to trigger.
>> >> >>
>> >> >> I also run a loop doing 'sync ; echo 3 > /proc/sys/vm/drop_caches' but I don't
>> >> >> know if that really helps speed up reproducing it. It is quite random to hit,
>> >> >> but I was able to hit it IIRC in under a minute with that setup.
>> >> >>
>> >> >
>> >> > Here is how i reproduce it:
>> >> > Filesystem is ext2 with blocksize 4096
>> >> > Fill up the ram with 95% anon memory and mlockall ( put enough memory
>> >> > pressure which will trigger page reclaim and background writeout)
>> >> > Run one thread of the test program
>> >> >
>> >> > and i will see "bad pages" within few minutes.
>> >>
>> >> And here is the "top" and stdout while it is getting "bad pages"
>> >> top
>> >>
>> >>   PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
>> >>  3487 root      20   0 52616  50m  284 R   95  0.3   3:58.85 usemem
>> >>  3810 root      20   0  129m  99m  99m D   41  0.6   0:01.87 ftruncate_mmap
>> >>   261 root      15  -5     0    0    0 D    4  0.0   0:31.08 kswapd0
>> >>   262 root      15  -5     0    0    0 D    3  0.0   0:10.26 kswapd1
>> >>
>> >> stdout:
>> >>
>> >> while true; do
>> >>     ./ftruncate_mmap;
>> >> done
>> >> Running 852 bad page
>> >> Running 315 bad page
>> >> Running 999 bad page
>> >> Running 482 bad page
>> >> Running 24 bad page
>> >  Thanks, for the help. I've debugged the problem to a bug in
>> > ext2_get_block(). I've already sent out a patch which should fix the issue
>> > (at least it fixes the problem for me).
>> >  The fix is also attached if you want to try it.
>>
>> hmm, now i do see that get_block() returns ENOSPC by printk the err.
>> So did you applied the patch which redirty_page_for_writepage as well
>> as this one together?
>  No, my patch contained only a fix in ext2_get_block(). When you see
> ENOSPC, that's a completely separate issue. You may apply that patch but
> with ext2 it would be enough to make the file fit the ram disk. I.e. first
> try with dd how big file fits there and then run your tester with at most
> as big file so that you don't hit ENOSPC...
>
>> I will start the test with kernel applied both patches and leave it for running.
>  OK.

I applied the patch(based on 2.6.26) in the attachment and the test
itself runs fine so far without reporting "bad pages", however, i
seems get deadlock in the varlog, here is the message and i turned on
lockdep.

kernel: 1 lock held by kswapd1/264:
kernel: #0:  (&ei->truncate_mutex){--..}, at: [<ffffffff8031d529>]
ext2_get_block+0x109/0x960
kernel: INFO: task ftruncate_mmap:2950 blocked for more than 120 seconds.
kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables
this message.
kernel: ftruncate_mma D ffff81047e733a80     0  2950   2858
kernel: ffff8101798516f8 0000000000000092 0000000000000000 0000000000000046
kernel: ffff81047e0a1260 ffff81047f070000 ffff81047e0a15c0 0000000100130c66
kernel: 00000000ffffffff ffffffff8025740d 0000000000000000 0000000000000000
kernel: Call Trace:
kernel: [<ffffffff8025740d>] mark_held_locks+0x3d/0x80
kernel: [<ffffffff804d78bd>] mutex_lock_nested+0x14d/0x280
kernel: [<ffffffff804d7855>] mutex_lock_nested+0xe5/0x280
kernel: [<ffffffff8031d529>] ext2_get_block+0x109/0x960
kernel: [<ffffffff802ca2e3>] create_empty_buffers+0x43/0xb0
kernel: [<ffffffff802ca2e3>] create_empty_buffers+0x43/0xb0
kernel: [<ffffffff802ca217>] alloc_page_buffers+0x97/0x120
kernel: [<ffffffff802cbfb6>] __block_write_full_page+0x206/0x320
kernel: [<ffffffff802cbe70>] __block_write_full_page+0xc0/0x320
kernel: [<ffffffff8031d420>] ext2_get_block+0x0/0x960
kernel: [<ffffffff8027c74e>] shrink_page_list+0x4fe/0x650
kernel: [<ffffffff80257ee8>] __lock_acquire+0x3b8/0x1080
kernel: [<ffffffff8027be18>] isolate_lru_pages+0x88/0x230
kernel: [<ffffffff8027c9ea>] shrink_inactive_list+0x14a/0x3f0
kernel: [<ffffffff8027cd43>] shrink_zone+0xb3/0x130
kernel: [<ffffffff80249e90>] autoremove_wake_function+0x0/0x30
kernel: [<ffffffff8027d1a8>] try_to_free_pages+0x268/0x3e0
kernel: [<ffffffff8027bfc0>] isolate_pages_global+0x0/0x40
kernel: [<ffffffff802774f7>] __alloc_pages_internal+0x1d7/0x4a0
kernel: [<ffffffff80279b94>] __do_page_cache_readahead+0x124/0x270
kernel: [<ffffffff8027314f>] filemap_fault+0x18f/0x400
kernel: [<ffffffff80280925>] __do_fault+0x65/0x450
kernel: [<ffffffff80257ee8>] __lock_acquire+0x3b8/0x1080
kernel: [<ffffffff803475dd>] __down_read_trylock+0x1d/0x60
kernel: [<ffffffff8028389a>] handle_mm_fault+0x18a/0x7a0
kernel: [<ffffffff804dba1c>] do_page_fault+0x29c/0x930
kernel: [<ffffffff804d8b46>] trace_hardirqs_on_thunk+0x35/0x3a
kernel: [<ffffffff804d94dd>] error_exit+0x0/0xa9
kernel:
kernel: 2 locks held by ftruncate_mmap/2950:
kernel: #0:  (&mm->mmap_sem){----}, at: [<ffffffff804db9af>]
do_page_fault+0x22f/0x930
kernel: #1:  (&ei->truncate_mutex){--..}, at: [<ffffffff8031d529>]
ext2_get_block+0x109/0x960

--Ying

>
>                                                                                Honza
> --
> Jan Kara <jack@suse.cz>
> SUSE Labs, CR
>

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: ftruncate-mmap: pages are lost after writing to mmaped file.
  2009-04-02 11:24                                   ` Nick Piggin
  2009-04-02 11:34                                     ` Jan Kara
@ 2009-04-03  0:13                                     ` Ying Han
  1 sibling, 0 replies; 60+ messages in thread
From: Ying Han @ 2009-04-03  0:13 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Jan Kara, Martin J. Bligh, linux-ext4, Linus Torvalds,
	Andrew Morton, linux-kernel, linux-mm, guichaz, Alex Khesin,
	Mike Waychison, Rohit Seth, Peter Zijlstra

On Thu, Apr 2, 2009 at 4:24 AM, Nick Piggin <nickpiggin@yahoo.com.au> wrote:
> On Thursday 02 April 2009 09:36:13 Ying Han wrote:
>> Hi Jan:
>>     I feel that the problem you saw is kind of differnt than mine. As
>> you mentioned that you saw the PageError() message, which i don't see
>> it on my system. I tried you patch(based on 2.6.21) on my system and
>> it runs ok for 2 days, Still, since i don't see the same error message
>> as you saw, i am not convineced this is the root cause at least for
>> our problem. I am still looking into it.
>>     So, are you seeing the PageError() every time the problem happened?
>
> So I asked if you could test with my workaround of taking truncate_mutex
> at the start of ext2_get_blocks, and report back. I never heard of any
> response after that.

I applied the change and still get the same issue, unless i didn't do
the right thing, here
is the patch i applied, which put the truncate_mutex at the beginning
of ext2_get_blocks.

diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c
index 384fc0d..94cf773 100644
--- a/fs/ext2/inode.c
+++ b/fs/ext2/inode.c
@@ -586,10 +586,13 @@ static int ext2_get_blocks(struct inode *inode,
 	int count = 0;
 	ext2_fsblk_t first_block = 0;

+	mutex_lock(&ei->truncate_mutex);
 	depth = ext2_block_to_path(inode,iblock,offsets,&blocks_to_boundary);

-	if (depth == 0)
+	if (depth == 0) {
+		mutex_unlock(&ei->truncate_mutex);
 		return (err);
+	}
 reread:
 	partial = ext2_get_branch(inode, depth, offsets, chain, &err);

@@ -625,7 +628,7 @@ reread:
 	if (!create || err == -EIO)
 		goto cleanup;

-	mutex_lock(&ei->truncate_mutex);

 	/*
 	 * Okay, we need to do block allocation.  Lazily initialize the block
@@ -651,7 +654,7 @@ reread:
 				offsets + (partial - chain), partial);

 	if (err) {
-		mutex_unlock(&ei->truncate_mutex);
 		goto cleanup;
 	}

@@ -662,13 +665,13 @@ reread:
 		err = ext2_clear_xip_target (inode,
 			le32_to_cpu(chain[depth-1].key));
 		if (err) {
-			mutex_unlock(&ei->truncate_mutex);
 			goto cleanup;
 		}
 	}

 	ext2_splice_branch(inode, iblock, partial, indirect_blks, count);
-	mutex_unlock(&ei->truncate_mutex);
 	set_buffer_new(bh_result);
 got_it:
 	map_bh(bh_result, inode->i_sb, le32_to_cpu(chain[depth-1].key));
@@ -678,6 +681,7 @@ got_it:
 	/* Clean up and exit */
 	partial = chain + depth - 1;	/* the whole chain */
 cleanup:
+	mutex_unlock(&ei->truncate_mutex);
 	while (partial > chain) {
 		brelse(partial->bh);
 		partial--;

--Ying

>
> To reiterate: I was able to reproduce a problem with ext2 (I was testing
> on brd to get IO rates high enough to reproduce it quite frequently).
> I think I narrowed the problem down to block allocation or inode block
> tree corruption because I was unable to reproduce it with that hack in
> place.
>
>

^ permalink raw reply related	[flat|nested] 60+ messages in thread

* Re: ftruncate-mmap: pages are lost after writing to mmaped file.
  2009-03-24  7:44               ` Nick Piggin
                                   ` (2 preceding siblings ...)
  2009-03-24 12:39                 ` Jan Kara
@ 2009-03-27 20:35                 ` Ying Han
  3 siblings, 0 replies; 60+ messages in thread
From: Ying Han @ 2009-03-27 20:35 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Jan Kara, Martin J. Bligh, linux-ext4, Linus Torvalds,
	Andrew Morton, linux-kernel, linux-mm, guichaz, Alex Khesin,
	Mike Waychison, Rohit Seth, Peter Zijlstra

On Tue, Mar 24, 2009 at 12:44 AM, Nick Piggin <nickpiggin@yahoo.com.au> wrote:
> On Friday 20 March 2009 03:46:39 Jan Kara wrote:
>> On Fri 20-03-09 02:48:21, Nick Piggin wrote:
>
>> > Holding mapping->private_lock over the __set_page_dirty should
>> > fix it, although I guess you'd want to release it before calling
>> > __mark_inode_dirty so as not to put inode_lock under there. I
>> > have a patch for this if it sounds reasonable.
>>
>>   Yes, that seems to be a bug - the function actually looked suspitious to
>> me yesterday but I somehow convinced myself that it's fine. Probably
>> because fsx-linux is single-threaded.
>
>
> After a whole lot of chasing my own tail in the VM and buffer layers,
> I think it is a problem in ext2 (and I haven't been able to reproduce
> with ext3 yet, which might lend weight to that, although as we have
> seen, it is very timing dependent).
>
> That would be slightly unfortunate because we still have Jan's ext3
> problem, and also another reported problem of corruption on ext3 (on
> brd driver).
I believe i see the same issue on ext2 as well as ext4.
>
> Anyway, when I have reproduced the problem with the test case, the
> "lost" writes are all reported to be holes. Unfortunately, that doesn't
> point straight to the filesystem, because ext2 allocates blocks in this
> case at writeout time, so if dirty bits are getting lost, then it would
> be normal to see holes.
>
> I then put in a whole lot of extra infrastructure to track metadata about
> each struct page (when it was last written out, when it last had the number
> of writable ptes reach 0, when the dirty bits were last cleared etc). And
> none of the normal asertions were triggering: eg. when any page is removed
> from pagecache (except truncates), it has always had all its buffers
> written out *after* all ptes were made readonly or unmapped. Lots of other
> tests and crap like that.

Do you think there might be a race in the page reclaim path? I did a
hack which commeted out
wakeup_pdflush in try_to_free_pages ( based on 2.6.21, just randomly
picked on has the problem)
It runs for couple of hours and the problem not happened yet. I am not
sure if that is the problem or not,
and i will leave it running.
The reason i tried the hack since i reproduce the "bad pages" easily
everytime i put more memory pressure
on the system.


diff --git a/mm/vmscan.c b/mm/vmscan.c
index db023e2..b4b7e1f 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1067,11 +1067,13 @@ unsigned long try_to_free_pages(struct zone **zones, g
 		 * that's undesirable in laptop mode, where we *want* lumpy
 		 * writeout.  So in laptop mode, write out the whole world.
 		 */
+/*
 		if (total_scanned > sc.swap_cluster_max +
 					sc.swap_cluster_max / 2) {
 			wakeup_pdflush(laptop_mode ? 0 : total_scanned);
 			sc.may_writepage = 1;
 		}
+*/

 		/* Take a nap, wait for some writeback to complete */
 		if (sc.nr_scanned && priority < DEF_PRIORITY - 2)

>
> So I tried what I should have done to start with and did an e2fsck after
> seeing corruption. Yes, it comes up with errors. Now that is unusual
> because that should be largely insulated from the vm: if a dirty bit gets
> lost, then the filesystem image should be quite happy and error-free with
> a hole or unwritten data there.
>
> I don't know ext? locking very well, except that it looks pretty overly
> complex and crufty.
>
> Usually, blocks are instantiated by write(2), under i_mutex, serialising
> the allocator somewhat. mmap-write blocks are instantiated at writeout
> time, unserialised. I moved truncate_mutex to cover the entire get_blocks
> function, and can no longer trigger the problem. Might be a timing issue
> though -- Ying, can you try this and see if you can still reproduce?
>
> I close my eyes and pick something out of a hat. a686cd89. Search for XXX.
> Nice. Whether or not this cased the problem, can someone please tell me
> why it got merged in that state?
>
> I'm leaving ext3 running for now. It looks like a nasty task to bisect
> ext2 down to that commit :( and I would be more interested in trying to
> reproduce Jan's ext3 problem, however, because I'm not too interested in
> diving into ext2 locking to work out exactly what is racing and how to
> fix it properly. I suspect it would be most productive to wire up some
> ioctls right into the block allocator/lookup and code up a userspace
> tester for it that could probably stress it a lot harder than kernel
> writeout can.
>
>

^ permalink raw reply related	[flat|nested] 60+ messages in thread

* Re: ftruncate-mmap: pages are lost after writing to mmaped file.
  2009-03-18 22:40   ` Linus Torvalds
  2009-03-18 23:18     ` Ying Han
@ 2009-03-20  0:34     ` Ying Han
  2009-03-20  0:49       ` Linus Torvalds
  2009-03-25 23:15     ` Ying Han
  2 siblings, 1 reply; 60+ messages in thread
From: Ying Han @ 2009-03-20  0:34 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andrew Morton, linux-kernel, linux-mm, guichaz, Alex Khesin,
	Mike Waychison, Rohit Seth, Nick Piggin, Peter Zijlstra

On Wed, Mar 18, 2009 at 3:40 PM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
>
> On Wed, 18 Mar 2009, Andrew Morton wrote:
>
>> On Wed, 18 Mar 2009 12:44:08 -0700 Ying Han <yinghan@google.com> wrote:
>> >
>> > The "bad pages" count differs each time from one digit to 4,5 digit
>> > for 128M ftruncated file. and what i also found that the bad page
>> > number are contiguous for each segment which total bad pages container
>> > several segments. ext "1-4, 9-20, 48-50" (  batch flushing ? )
>
> Yeah, probably the batched write-out.
>
> Can you say what filesystem, and what mount-flags you use? Iirc, last time
> we had MAP_SHARED lost writes it was at least partly triggered by the
> filesystem doing its own flushing independently of the VM (ie ext3 with
> "data=journal", I think), so that kind of thing does tend to matter.
>
> See for example commit ecdfc9787fe527491baefc22dce8b2dbd5b2908d.
>
>> > (The failure is reproduced based on 2.6.29-rc8, also happened on
>> > 2.6.18 kernel. . Here is the simple test case to reproduce it with
>> > memory pressure. )
>>
>> Thanks.  This will be a regression - the testing I did back in the days
>> when I actually wrote stuff would have picked this up.
>>
>> Perhaps it is a 2.6.17 thing.  Which, IIRC, is when we made the changes to
>> redirty pages on each write fault.  Or maybe it was something else.
>
> Hmm. I _think_ that changes went in _after_ 2.6.18, if you're talking
> about Peter's exact dirty page tracking. If I recall correctly, that
> became then 2.6.19, and then had the horrible mm dirty bit loss that
> triggered in librtorrent downloads, which got fixed sometime after 2.6.20
> (and back-ported).
>
> So if 2.6.18 shows the same problem, then it's a _really_ old bug, and not
> related to the exact dirty tracking.
>
> The exact dirty accounting patch I'm talking about is d08b3851da41 ("mm:
> tracking shared dirty pages"), but maybe you had something else in mind?
>
>> Given the amount of time for which this bug has existed, I guess it isn't a
>> 2.6.29 blocker, but once we've found out the cause we should have a little
>> post-mortem to work out how a bug of this nature has gone undetected for so
>> long.
>
> I'm somewhat surprised, because this test-program looks like a very simple
> version of the exact one that I used to track down the 2.6.20 mmap
> corruption problems. And that one got pretty heavily tested back then,
> when people were looking at it (December 2006) and then when trying out my
> fix for it.
>
> Ying Han - since you're all set up for testing this and have reproduced it
> on multiple kernels, can you try it on a few more kernel versions? It
> would be interesting to both go further back in time (say 2.6.15-ish),
> _and_ check something like 2.6.21 which had the exact dirty accounting
> fix. Maybe it's not really an old bug - maybe we re-introduced a bug that
> was fixed for a while.

I tried 2.6.24 for couple of hours and the problem not happening yet. While
the same test on 2.6.25, the problem happen right away.

>
>                                Linus
>

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: ftruncate-mmap: pages are lost after writing to mmaped file.
  2009-03-20  0:34     ` Ying Han
@ 2009-03-20  0:49       ` Linus Torvalds
  2009-03-20  7:00         ` Ying Han
  0 siblings, 1 reply; 60+ messages in thread
From: Linus Torvalds @ 2009-03-20  0:49 UTC (permalink / raw)
  To: Ying Han
  Cc: Andrew Morton, linux-kernel, linux-mm, guichaz, Alex Khesin,
	Mike Waychison, Rohit Seth, Nick Piggin, Peter Zijlstra

On Thu, 19 Mar 2009, Ying Han wrote:
> >
> > Ying Han - since you're all set up for testing this and have reproduced it
> > on multiple kernels, can you try it on a few more kernel versions? It
> > would be interesting to both go further back in time (say 2.6.15-ish),
> > _and_ check something like 2.6.21 which had the exact dirty accounting
> > fix. Maybe it's not really an old bug - maybe we re-introduced a bug that
> > was fixed for a while.
> 
> I tried 2.6.24 for couple of hours and the problem not happening yet. While
> the same test on 2.6.25, the problem happen right away.

Ok, so 2.6.25 is known bad. Can you test 2.6.24 a lot more, because we 
should not decide that it's bug-free without a _lot_ of testing.

But if it's a bug that has gone away and then re-appeared, it at least 
explains how 2.6.21 (which got a fair amount of mmap testing) didn't have 
lots of reports of mmap corruption.

That said, I can think of nothing obvious in between 2.6.24 and .25 that 
would have re-introduced it. But if some heavy testing really does confirm 
that 2.6.24 doesn't have the problem, that is a good first step to trying 
to narrow down where things started going wrong.

That said, it could _easily_ be some timing-related pattern. One of the 
things in between 2.6.24 and .25 is

 - 8bc3be2751b4f74ab90a446da1912fd8204d53f7: "writeback: speed up 
   writeback of big dirty files"

which is that exact kind of "change the timing patterns, but don't change 
anything fundamental" thing.

Which is why I'd like you to continue testing 2.6.24 just to be _really_ 
sure that it really doesn't happen there.

		Linus

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: ftruncate-mmap: pages are lost after writing to mmaped file.
  2009-03-20  0:49       ` Linus Torvalds
@ 2009-03-20  7:00         ` Ying Han
  0 siblings, 0 replies; 60+ messages in thread
From: Ying Han @ 2009-03-20  7:00 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andrew Morton, linux-kernel, linux-mm, guichaz, Alex Khesin,
	Mike Waychison, Rohit Seth, Nick Piggin, Peter Zijlstra

On Thu, Mar 19, 2009 at 5:49 PM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
>
> On Thu, 19 Mar 2009, Ying Han wrote:
>> >
>> > Ying Han - since you're all set up for testing this and have reproduced it
>> > on multiple kernels, can you try it on a few more kernel versions? It
>> > would be interesting to both go further back in time (say 2.6.15-ish),
>> > _and_ check something like 2.6.21 which had the exact dirty accounting
>> > fix. Maybe it's not really an old bug - maybe we re-introduced a bug that
>> > was fixed for a while.
>>
>> I tried 2.6.24 for couple of hours and the problem not happening yet. While
>> the same test on 2.6.25, the problem happen right away.
>
> Ok, so 2.6.25 is known bad. Can you test 2.6.24 a lot more, because we
> should not decide that it's bug-free without a _lot_ of testing.
>
> But if it's a bug that has gone away and then re-appeared, it at least
> explains how 2.6.21 (which got a fair amount of mmap testing) didn't have
> lots of reports of mmap corruption.
>
> That said, I can think of nothing obvious in between 2.6.24 and .25 that
> would have re-introduced it. But if some heavy testing really does confirm
> that 2.6.24 doesn't have the problem, that is a good first step to trying
> to narrow down where things started going wrong.
>
> That said, it could _easily_ be some timing-related pattern. One of the
> things in between 2.6.24 and .25 is
>
>  - 8bc3be2751b4f74ab90a446da1912fd8204d53f7: "writeback: speed up
>   writeback of big dirty files"
>
> which is that exact kind of "change the timing patterns, but don't change
> anything fundamental" thing.
>
> Which is why I'd like you to continue testing 2.6.24 just to be _really_
> sure that it really doesn't happen there.

Unfortunately, 2.6.24 is not immune. After running several hours, i triggered
the problem.
>
>                Linus
>

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: ftruncate-mmap: pages are lost after writing to mmaped file.
  2009-03-18 22:40   ` Linus Torvalds
  2009-03-18 23:18     ` Ying Han
  2009-03-20  0:34     ` Ying Han
@ 2009-03-25 23:15     ` Ying Han
  2 siblings, 0 replies; 60+ messages in thread
From: Ying Han @ 2009-03-25 23:15 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andrew Morton, linux-kernel, linux-mm, guichaz, Alex Khesin,
	Mike Waychison, Rohit Seth, Nick Piggin, Peter Zijlstra

On Wed, Mar 18, 2009 at 3:40 PM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
>
> On Wed, 18 Mar 2009, Andrew Morton wrote:
>
>> On Wed, 18 Mar 2009 12:44:08 -0700 Ying Han <yinghan@google.com> wrote:
>> >
>> > The "bad pages" count differs each time from one digit to 4,5 digit
>> > for 128M ftruncated file. and what i also found that the bad page
>> > number are contiguous for each segment which total bad pages container
>> > several segments. ext "1-4, 9-20, 48-50" (  batch flushing ? )
>
> Yeah, probably the batched write-out.
>
> Can you say what filesystem, and what mount-flags you use? Iirc, last time
> we had MAP_SHARED lost writes it was at least partly triggered by the
> filesystem doing its own flushing independently of the VM (ie ext3 with
> "data=journal", I think), so that kind of thing does tend to matter.
>
> See for example commit ecdfc9787fe527491baefc22dce8b2dbd5b2908d.
>
>> > (The failure is reproduced based on 2.6.29-rc8, also happened on
>> > 2.6.18 kernel. . Here is the simple test case to reproduce it with
>> > memory pressure. )
>>
>> Thanks.  This will be a regression - the testing I did back in the days
>> when I actually wrote stuff would have picked this up.
>>
>> Perhaps it is a 2.6.17 thing.  Which, IIRC, is when we made the changes to
>> redirty pages on each write fault.  Or maybe it was something else.
>
> Hmm. I _think_ that changes went in _after_ 2.6.18, if you're talking
> about Peter's exact dirty page tracking. If I recall correctly, that
> became then 2.6.19, and then had the horrible mm dirty bit loss that
> triggered in librtorrent downloads, which got fixed sometime after 2.6.20
> (and back-ported).
>
> So if 2.6.18 shows the same problem, then it's a _really_ old bug, and not
> related to the exact dirty tracking.
>
> The exact dirty accounting patch I'm talking about is d08b3851da41 ("mm:
> tracking shared dirty pages"), but maybe you had something else in mind?
>
>> Given the amount of time for which this bug has existed, I guess it isn't a
>> 2.6.29 blocker, but once we've found out the cause we should have a little
>> post-mortem to work out how a bug of this nature has gone undetected for so
>> long.
>
> I'm somewhat surprised, because this test-program looks like a very simple
> version of the exact one that I used to track down the 2.6.20 mmap
> corruption problems. And that one got pretty heavily tested back then,
> when people were looking at it (December 2006) and then when trying out my
> fix for it.
>
> Ying Han - since you're all set up for testing this and have reproduced it
> on multiple kernels, can you try it on a few more kernel versions? It
> would be interesting to both go further back in time (say 2.6.15-ish),
> _and_ check something like 2.6.21 which had the exact dirty accounting
> fix. Maybe it's not really an old bug - maybe we re-introduced a bug that
> was fixed for a while.

Just answer your question, i got chance try 2.6.15 and 2.6.21 and they
both report the "bad pages"
failure. I am using the same system as well as the config
>                                Linus
>

^ permalink raw reply	[flat|nested] 60+ messages in thread

end of thread, other threads:[~2009-04-03 21:35 UTC | newest]

Thread overview: 60+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2009-03-18 19:44 ftruncate-mmap: pages are lost after writing to mmaped file Ying Han
2009-03-18 22:11 ` Andrew Morton
2009-03-18 22:40   ` Linus Torvalds
2009-03-18 23:18     ` Ying Han
2009-03-18 23:36       ` Linus Torvalds
2009-03-18 23:54         ` Ying Han
2009-03-19 15:48           ` Nick Piggin
2009-03-19 16:16             ` Peter Zijlstra
2009-03-19 16:36               ` Nick Piggin
2009-03-19 16:20             ` Linus Torvalds
2009-03-19 16:34               ` Nick Piggin
2009-03-19 16:51                 ` Linus Torvalds
2009-03-19 17:03                   ` Jan Kara
2009-03-19 17:06                     ` Jan Kara
2009-03-19 20:05                     ` Linus Torvalds
2009-03-19 20:21                   ` Linus Torvalds
2009-03-19 21:17                     ` Ying Han
2009-03-19 22:16                     ` Jan Kara
2009-03-19 16:46             ` Jan Kara
2009-03-24  7:44               ` Nick Piggin
2009-03-24 10:27                 ` Nick Piggin
2009-03-24 10:32                 ` Andrew Morton
2009-03-24 15:35                   ` Nick Piggin
2009-03-26 18:29                     ` Jan Kara
2009-03-26  0:03                   ` Ying Han
2009-03-24 12:39                 ` Jan Kara
2009-03-24 12:55                   ` Jan Kara
2009-03-24 13:26                     ` Jan Kara
2009-03-24 14:01                       ` Chris Mason
2009-03-24 14:07                         ` Jan Kara
2009-03-26  8:18                           ` Aneesh Kumar K.V
2009-03-24 14:30                       ` Nick Piggin
2009-03-24 14:47                         ` Jan Kara
2009-03-24 14:56                           ` Peter Zijlstra
2009-03-24 15:29                             ` Jan Kara
2009-03-24 20:14                               ` OGAWA Hirofumi
2009-03-26  8:47                               ` Aneesh Kumar K.V
2009-03-26 11:37                                 ` Jan Kara
2009-03-26 23:02                                 ` Linus Torvalds
2009-03-24 15:03                           ` Nick Piggin
2009-03-24 15:48                             ` Jan Kara
2009-03-24 17:35                               ` Jan Kara
2009-04-01 22:36                                 ` Ying Han
2009-04-02 10:11                                   ` Jan Kara
2009-04-02 11:24                                   ` Nick Piggin
2009-04-02 11:34                                     ` Jan Kara
2009-04-02 15:51                                       ` Nick Piggin
2009-04-02 17:44                                         ` Ying Han
2009-04-02 22:52                                           ` Ying Han
2009-04-02 23:39                                             ` Jan Kara
2009-04-03  0:25                                               ` Ying Han
2009-04-03  1:29                                               ` Ying Han
2009-04-03  9:41                                                 ` Jan Kara
2009-04-03 21:34                                                   ` Ying Han
2009-04-03  0:13                                     ` Ying Han
2009-03-27 20:35                 ` Ying Han
2009-03-20  0:34     ` Ying Han
2009-03-20  0:49       ` Linus Torvalds
2009-03-20  7:00         ` Ying Han
2009-03-25 23:15     ` Ying Han

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox