[CFT] read+shared mmap write+read data corruption

All of lore.kernel.org
 help / color / mirror / Atom feed

* [CFT] read+shared mmap write+read data corruption
@ 2007-05-27 10:49 Russell King
  2007-05-27 14:16 ` James Bottomley
                   ` (2 more replies)
  0 siblings, 3 replies; 25+ messages in thread
From: Russell King @ 2007-05-27 10:49 UTC (permalink / raw)
  To: linux-arch; +Cc: Lennert Buytenhek, Andrew Morton

Lennert Buytenhek recently reported on the linux-arm-kernel list that
fsx-linux fails on his Xscale ARM platforms.  Originally, a problem was
noticed with DB4 databases being corrupted.

After a little bit of digging, I found what appears to be a hole in
the cache alias handling for the page cache; I'm not certain what the
solution is at present [*].  However, I don't think this is an ARM
specific issue.

So, if you have a cache which needs a flush_dcache_page() implementation,
can you please grab:

  http://www.wantstofly.org/~buytenh/cache-issue.c

Build this, and then run:

$ ./cache-issue | od -t x1 -Ax

If the output is a load of zero bytes instead of "0123456789abcdef" then
the read after a shared mmap() write returned stale data.

Note: this program is much more likely to show the issue than fsx-linux -
on my platforms, fsx-linux doesn't show any problems after several minutes
of running, whereas Lennert's program shows a problem immediately.

Also note that our msync() syscall is a no-op; my original test was with
a version of the above with msync(MS_INVALIDATE) in.

* - the hole is:

void do_generic_mapping_read(struct address_space *mapping,
                             struct file_ra_state *_ra,
                             struct file *filp,
                             loff_t *ppos,
                             read_descriptor_t *desc,
                             read_actor_t actor)
{
...
                /* If users can be writing to this page using arbitrary
                 * virtual addresses, take care about potential aliasing
                 * before reading the page on the kernel side.
                 */
                if (mapping_writably_mapped(mapping))
                        flush_dcache_page(page);

is false for both read calls, since at the time this test is done, there
are no shared writable mappings.  However, that's not to say that there
haven't _been_ shared writable mappings.

-- 
Russell King
 Linux kernel    2.6 ARM Linux   - http://www.arm.linux.org.uk/
 maintainer of:

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [CFT] read+shared mmap write+read data corruption
  2007-05-27 10:49 [CFT] read+shared mmap write+read data corruption Russell King
@ 2007-05-27 14:16 ` James Bottomley
  2007-05-27 22:25   ` Lennert Buytenhek
  2007-05-27 23:05   ` David Miller
  2007-05-27 22:24 ` Lennert Buytenhek
  2007-05-28 12:38 ` Lennert Buytenhek
  2 siblings, 2 replies; 25+ messages in thread
From: James Bottomley @ 2007-05-27 14:16 UTC (permalink / raw)
  To: Russell King; +Cc: linux-arch, Lennert Buytenhek, Andrew Morton

On Sun, 2007-05-27 at 11:49 +0100, Russell King wrote:
> Lennert Buytenhek recently reported on the linux-arm-kernel list that
> fsx-linux fails on his Xscale ARM platforms.  Originally, a problem was
> noticed with DB4 databases being corrupted.
> 
> After a little bit of digging, I found what appears to be a hole in
> the cache alias handling for the page cache; I'm not certain what the
> solution is at present [*].  However, I don't think this is an ARM
> specific issue.

Yes, I confirm seeing the same results on parisc (VIPT large caches).

> So, if you have a cache which needs a flush_dcache_page() implementation,
> can you please grab:
> 
>   http://www.wantstofly.org/~buytenh/cache-issue.c
> 
> Build this, and then run:
> 
> $ ./cache-issue | od -t x1 -Ax
> 
> If the output is a load of zero bytes instead of "0123456789abcdef" then
> the read after a shared mmap() write returned stale data.
> 
> Note: this program is much more likely to show the issue than fsx-linux -
> on my platforms, fsx-linux doesn't show any problems after several minutes
> of running, whereas Lennert's program shows a problem immediately.
> 
> Also note that our msync() syscall is a no-op; my original test was with
> a version of the above with msync(MS_INVALIDATE) in.

Regardless of how we implement it, the way I read POSIX, it does require
at least msync(x, 4096, MS_SYNC|MS_INVALIDATE) before accessing the data
in another way.

> * - the hole is:
> 
> void do_generic_mapping_read(struct address_space *mapping,
>                              struct file_ra_state *_ra,
>                              struct file *filp,
>                              loff_t *ppos,
>                              read_descriptor_t *desc,
>                              read_actor_t actor)
> {
> ...
>                 /* If users can be writing to this page using arbitrary
>                  * virtual addresses, take care about potential aliasing
>                  * before reading the page on the kernel side.
>                  */
>                 if (mapping_writably_mapped(mapping))
>                         flush_dcache_page(page);
> 
> is false for both read calls, since at the time this test is done, there
> are no shared writable mappings.  However, that's not to say that there
> haven't _been_ shared writable mappings.

The operation has to be done when we know the state of the object
(unless we want to persist the dirty state in a lazy fashion?  I think I
don't really want to go there) so that's either in msync or munmap.  I
think I favour the former.

James



^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [CFT] read+shared mmap write+read data corruption
  2007-05-27 14:16 ` James Bottomley
@ 2007-05-27 22:25   ` Lennert Buytenhek
  2007-05-27 23:06     ` David Miller
  2007-05-27 23:05   ` David Miller
  1 sibling, 1 reply; 25+ messages in thread
From: Lennert Buytenhek @ 2007-05-27 22:25 UTC (permalink / raw)
  To: James Bottomley; +Cc: Russell King, linux-arch, Andrew Morton

On Sun, May 27, 2007 at 09:16:24AM -0500, James Bottomley wrote:

> Regardless of how we implement it, the way I read POSIX, it does
> require at least msync(x, 4096, MS_SYNC|MS_INVALIDATE) before
> accessing the data in another way.

Right.  At least the version of fsx-linux in akpm's ext3-tools
tarball doesn't do that.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [CFT] read+shared mmap write+read data corruption
  2007-05-27 22:25   ` Lennert Buytenhek
@ 2007-05-27 23:06     ` David Miller
  0 siblings, 0 replies; 25+ messages in thread
From: David Miller @ 2007-05-27 23:06 UTC (permalink / raw)
  To: buytenh; +Cc: James.Bottomley, rmk, linux-arch, akpm

From: Lennert Buytenhek <buytenh@wantstofly.org>
Date: Mon, 28 May 2007 00:25:55 +0200

> On Sun, May 27, 2007 at 09:16:24AM -0500, James Bottomley wrote:
> 
> > Regardless of how we implement it, the way I read POSIX, it does
> > require at least msync(x, 4096, MS_SYNC|MS_INVALIDATE) before
> > accessing the data in another way.
> 
> Right.  At least the version of fsx-linux in akpm's ext3-tools
> tarball doesn't do that.

I don't think we can seriously consider this requirement.

We can make the data properly coherent on D-cache aliasing cpus, so we
should.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [CFT] read+shared mmap write+read data corruption
  2007-05-27 14:16 ` James Bottomley
  2007-05-27 22:25   ` Lennert Buytenhek
@ 2007-05-27 23:05   ` David Miller
  2007-05-28  0:31     ` James Bottomley
  1 sibling, 1 reply; 25+ messages in thread
From: David Miller @ 2007-05-27 23:05 UTC (permalink / raw)
  To: James.Bottomley; +Cc: rmk, linux-arch, buytenh, akpm

From: James Bottomley <James.Bottomley@SteelEye.com>
Date: Sun, 27 May 2007 09:16:24 -0500

> On Sun, 2007-05-27 at 11:49 +0100, Russell King wrote:
> > So, if you have a cache which needs a flush_dcache_page() implementation,
> > can you please grab:
> > 
> >   http://www.wantstofly.org/~buytenh/cache-issue.c
> > 
> > Build this, and then run:
> > 
> > $ ./cache-issue | od -t x1 -Ax
> > 
> > If the output is a load of zero bytes instead of "0123456789abcdef" then
> > the read after a shared mmap() write returned stale data.
> > 
> > Note: this program is much more likely to show the issue than fsx-linux -
> > on my platforms, fsx-linux doesn't show any problems after several minutes
> > of running, whereas Lennert's program shows a problem immediately.
> > 
> > Also note that our msync() syscall is a no-op; my original test was with
> > a version of the above with msync(MS_INVALIDATE) in.
> 
> Regardless of how we implement it, the way I read POSIX, it does require
> at least msync(x, 4096, MS_SYNC|MS_INVALIDATE) before accessing the data
> in another way.

I used to have code in flush_cache_range() on sparc64 that would
have handled this case.

I think whatever in the world POSIX says, quality of implementation
says we should not be returning zeros in this test program, yet we
are. :-)

If we accept the msync() requirement (I don't think we should) then
this is the kind of obscure thing that causes application programmers
to scratch their heads for days if not weeks, and it will only anger
and frustrate them further when we tell them they have to stick an
msync() in there to see non-corrupt data.

Perhaps one way to catch this is to make the filemap read-page code
unconditionally call a new interface "check_dcache_page()" or similar,
instead of the guarded flush_dcache_page() call there now.

Then on unmap's of SHARED+WRITABLE mappings of files, we mark
or flush dirty pages as is appropriate for however the platform
handles D-cache aliasing.

On sparc64 for example, I'd set PG_arch_1 and record the cpu in
page->flags if PG_arch_1 wasn't already set, else I'd flush.

Then I'd have check_dcache_page() do something like:

	if (PG_arch_1 || shared_writable_mappings_exist)
		flush();

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [CFT] read+shared mmap write+read data corruption
  2007-05-27 23:05   ` David Miller
@ 2007-05-28  0:31     ` James Bottomley
  2007-05-28 12:44       ` Lennert Buytenhek
  0 siblings, 1 reply; 25+ messages in thread
From: James Bottomley @ 2007-05-28  0:31 UTC (permalink / raw)
  To: David Miller; +Cc: rmk, linux-arch, buytenh, akpm

On Sun, 2007-05-27 at 16:05 -0700, David Miller wrote:
> From: James Bottomley <James.Bottomley@SteelEye.com>
> Date: Sun, 27 May 2007 09:16:24 -0500
> 
> > On Sun, 2007-05-27 at 11:49 +0100, Russell King wrote:
> > > So, if you have a cache which needs a flush_dcache_page() implementation,
> > > can you please grab:
> > > 
> > >   http://www.wantstofly.org/~buytenh/cache-issue.c
> > > 
> > > Build this, and then run:
> > > 
> > > $ ./cache-issue | od -t x1 -Ax
> > > 
> > > If the output is a load of zero bytes instead of "0123456789abcdef" then
> > > the read after a shared mmap() write returned stale data.
> > > 
> > > Note: this program is much more likely to show the issue than fsx-linux -
> > > on my platforms, fsx-linux doesn't show any problems after several minutes
> > > of running, whereas Lennert's program shows a problem immediately.
> > > 
> > > Also note that our msync() syscall is a no-op; my original test was with
> > > a version of the above with msync(MS_INVALIDATE) in.
> > 
> > Regardless of how we implement it, the way I read POSIX, it does require
> > at least msync(x, 4096, MS_SYNC|MS_INVALIDATE) before accessing the data
> > in another way.
> 
> I used to have code in flush_cache_range() on sparc64 that would
> have handled this case.
> 
> I think whatever in the world POSIX says, quality of implementation
> says we should not be returning zeros in this test program, yet we
> are. :-)

I wasn't advocating ... merely point out that POSIX gives us a get out
of jail free card.

> If we accept the msync() requirement (I don't think we should) then
> this is the kind of obscure thing that causes application programmers
> to scratch their heads for days if not weeks, and it will only anger
> and frustrate them further when we tell them they have to stick an
> msync() in there to see non-corrupt data.
> 
> Perhaps one way to catch this is to make the filemap read-page code
> unconditionally call a new interface "check_dcache_page()" or similar,
> instead of the guarded flush_dcache_page() call there now.
> 
> Then on unmap's of SHARED+WRITABLE mappings of files, we mark
> or flush dirty pages as is appropriate for however the platform
> handles D-cache aliasing.
> 
> On sparc64 for example, I'd set PG_arch_1 and record the cpu in
> page->flags if PG_arch_1 wasn't already set, else I'd flush.
> 
> Then I'd have check_dcache_page() do something like:
> 
> 	if (PG_arch_1 || shared_writable_mappings_exist)
> 		flush();

Actually, I'm not convinced we have a problem ... if I put an lseek back
to zero in the original program, it works on parisc as well.  On munmap
we do a flush_cache_range for the unmapped vmas ... we have to,
otherwise the data might be lost as the user mapping is zapped and we
wouldn't be able to guarantee coherency writing it to the backing store.

James





^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [CFT] read+shared mmap write+read data corruption
  2007-05-28  0:31     ` James Bottomley
@ 2007-05-28 12:44       ` Lennert Buytenhek
  2007-05-29  5:35         ` David Miller
  0 siblings, 1 reply; 25+ messages in thread
From: Lennert Buytenhek @ 2007-05-28 12:44 UTC (permalink / raw)
  To: James Bottomley; +Cc: David Miller, rmk, linux-arch, akpm

On Sun, May 27, 2007 at 07:31:27PM -0500, James Bottomley wrote:

> On munmap we do a flush_cache_range for the unmapped vmas ... we
> have to, otherwise the data might be lost as the user mapping is
> zapped and we wouldn't be able to guarantee coherency writing it
> to the backing store.

As far as I understand it, the munmap() does flush out the copy of
the data at the user virtual address, but the subsequent read() call
reads from an address in the kernel direct mapped window, for which
there is still data in the cache due to the earlier read() syscall,
and the mapping_writably_mapped() test fails so we don't end up
calling flush_dcache_page().

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [CFT] read+shared mmap write+read data corruption
  2007-05-28 12:44       ` Lennert Buytenhek
@ 2007-05-29  5:35         ` David Miller
  2007-05-29  9:12           ` Russell King
  0 siblings, 1 reply; 25+ messages in thread
From: David Miller @ 2007-05-29  5:35 UTC (permalink / raw)
  To: buytenh; +Cc: James.Bottomley, rmk, linux-arch, akpm

From: Lennert Buytenhek <buytenh@wantstofly.org>
Date: Mon, 28 May 2007 14:44:11 +0200

> On Sun, May 27, 2007 at 07:31:27PM -0500, James Bottomley wrote:
> 
> > On munmap we do a flush_cache_range for the unmapped vmas ... we
> > have to, otherwise the data might be lost as the user mapping is
> > zapped and we wouldn't be able to guarantee coherency writing it
> > to the backing store.
> 
> As far as I understand it, the munmap() does flush out the copy of
> the data at the user virtual address, but the subsequent read() call
> reads from an address in the kernel direct mapped window, for which
> there is still data in the cache due to the earlier read() syscall,
> and the mapping_writably_mapped() test fails so we don't end up
> calling flush_dcache_page().

That is what is happening for sure.

That mapping_writably_mapped() check depends upon munmap()
flush out the lines from the cache on the user side at
least enough to make them coherent on the kernel side.

As I said my flush_cache_range() on sparc64 used to do this,
but I removed it for whatever reason, perhaps I did not
consider this case back then.

I'm not advocating a full flush on flush_cache_range(), but rather to
set a page state bit, which will force a flush on the
"check_dcache_page()" call which we could replace this conditionalized
flush_dcache_page() call with.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [CFT] read+shared mmap write+read data corruption
  2007-05-29  5:35         ` David Miller
@ 2007-05-29  9:12           ` Russell King
  2007-05-29 10:26             ` David Miller
  0 siblings, 1 reply; 25+ messages in thread
From: Russell King @ 2007-05-29  9:12 UTC (permalink / raw)
  To: David Miller; +Cc: buytenh, James.Bottomley, linux-arch, akpm

On Mon, May 28, 2007 at 10:35:50PM -0700, David Miller wrote:
> From: Lennert Buytenhek <buytenh@wantstofly.org>
> Date: Mon, 28 May 2007 14:44:11 +0200
> > As far as I understand it, the munmap() does flush out the copy of
> > the data at the user virtual address, but the subsequent read() call
> > reads from an address in the kernel direct mapped window, for which
> > there is still data in the cache due to the earlier read() syscall,
> > and the mapping_writably_mapped() test fails so we don't end up
> > calling flush_dcache_page().
> 
> That is what is happening for sure.
> 
> That mapping_writably_mapped() check depends upon munmap()
> flush out the lines from the cache on the user side at
> least enough to make them coherent on the kernel side.
> 
> As I said my flush_cache_range() on sparc64 used to do this,
> but I removed it for whatever reason, perhaps I did not
> consider this case back then.
> 
> I'm not advocating a full flush on flush_cache_range(), but rather to
> set a page state bit, which will force a flush on the
> "check_dcache_page()" call which we could replace this conditionalized
> flush_dcache_page() call with.

Could we have PG_arch_2 as bit 13 for this purpose, guaranteed to be
cleared on page cache page allocation?  IOW, same rules as far as the
non-arch code is concerned as PG_arch_1.

Such a bit could be set in tlb_remove_tlb_entry() rather than having to
walk the page tables twice (once in flush_cache_range and again in
zap_*_range) though I wonder if that's too late in zap_pte_range().
(Walking the page tables on flush_cache_range would be too disgusting;
I don't fancy coding page table walks in assembly for some subset of
ARM CPUs.)

-- 
Russell King
 Linux kernel    2.6 ARM Linux   - http://www.arm.linux.org.uk/
 maintainer of:

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [CFT] read+shared mmap write+read data corruption
  2007-05-29  9:12           ` Russell King
@ 2007-05-29 10:26             ` David Miller
  0 siblings, 0 replies; 25+ messages in thread
From: David Miller @ 2007-05-29 10:26 UTC (permalink / raw)
  To: rmk; +Cc: buytenh, James.Bottomley, linux-arch, akpm

From: Russell King <rmk@arm.linux.org.uk>
Date: Tue, 29 May 2007 10:12:52 +0100

> Could we have PG_arch_2 as bit 13 for this purpose, guaranteed to be
> cleared on page cache page allocation?  IOW, same rules as far as the
> non-arch code is concerned as PG_arch_1.

I don't think there will be much objection to adding one more arch bit
:-)

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [CFT] read+shared mmap write+read data corruption
  2007-05-27 10:49 [CFT] read+shared mmap write+read data corruption Russell King
  2007-05-27 14:16 ` James Bottomley
@ 2007-05-27 22:24 ` Lennert Buytenhek
  2007-05-28  0:00   ` James Bottomley
  2007-05-28 12:38 ` Lennert Buytenhek
  2 siblings, 1 reply; 25+ messages in thread
From: Lennert Buytenhek @ 2007-05-27 22:24 UTC (permalink / raw)
  To: linux-arch; +Cc: Andrew Morton, davej

On Sun, May 27, 2007 at 11:49:30AM +0100, Russell King wrote:

> So, if you have a cache which needs a flush_dcache_page()
> implementation, can you please grab:
> 
>   http://www.wantstofly.org/~buytenh/cache-issue.c

Davide Libenzi pointed out that there is an error in this program,
in that it doesn't lseek() back to the beginning of the file, so it
ends up reading bytes 16-31 instead.  *blush*

The program below accurately (AFAICT) reflects what fsx-linux does
just before it dies in this testcase.  It write(2)s to a file, reads
from the file via an mmap() region, read(2)s from the file, writes
to the file via an mmap() region, and finally, read(2)s from the file
again.  The last read(2) call returns the data written by the first
write(2) call, and doesn't see the data written by the mmap() write,
and so the last line of output is firstfirstfirst instead of the
expected secondsecondsec.

fsx-linux does call msync(), but it calls it with a third argument
of zero.  Changing the third msync() argument from 0 to MS_SYNC seems
to make the issue disappear, which makes me suspect that fsx-linux's
calling msync(pointer, size, 0); isn't guaranteed to be doing what it
expects it to be doing.

If it _is_ in fact a bug in fsx, it would seem that most of the 2615
fsx versions out there are affected.

#include <stdio.h>
#include <stdlib.h>
#include <fcntl.h>
#include <string.h>
#include <sys/mman.h>
#include <unistd.h>

char *file = "footest";
char buf[4096];

int main()
{
	int fd;
	char *x;

	/* Clear test file.  */
	fd = open(file, O_CREAT | O_TRUNC | O_RDWR, 0644);
	write(fd, buf, sizeof(buf));
	close(fd);
	sleep(0);

	/* WRITE */
	fd = open(file, O_RDWR, 0644);
	write(fd, "firstfirstfirst\n", 16);
	close(fd);

	/* MAPREAD */
	fd = open(file, O_RDWR, 0644);
	x = mmap(NULL, 4096, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
	memcpy(buf, x, 16);
	munmap(x, 4096);
	close(fd);

	/* READ */
	fd = open(file, O_RDWR, 0644);
	read(fd, buf + 16, 16);
	close(fd);

	/* MAPWRITE */
	fd = open(file, O_RDWR, 0644);
	x = mmap(NULL, 4096, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
	memcpy(x, "secondsecondsec\n", 16);
	msync(x, 4096, 0);
	munmap(x, 4096);
	close(fd);

	/* READ */
	fd = open(file, O_RDWR, 0644);
	read(fd, buf + 32, 16);
	close(fd);

	/* dump output */
	write(1, buf, 3 * 16);

	return 0;
}

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [CFT] read+shared mmap write+read data corruption
  2007-05-27 22:24 ` Lennert Buytenhek
@ 2007-05-28  0:00   ` James Bottomley
  2007-05-28 10:05     ` Russell King
  2007-05-28 12:33     ` Lennert Buytenhek
  0 siblings, 2 replies; 25+ messages in thread
From: James Bottomley @ 2007-05-28  0:00 UTC (permalink / raw)
  To: Lennert Buytenhek; +Cc: linux-arch, Andrew Morton, davej

On Mon, 2007-05-28 at 00:24 +0200, Lennert Buytenhek wrote:
> On Sun, May 27, 2007 at 11:49:30AM +0100, Russell King wrote:
> 
> > So, if you have a cache which needs a flush_dcache_page()
> > implementation, can you please grab:
> > 
> >   http://www.wantstofly.org/~buytenh/cache-issue.c
> 
> Davide Libenzi pointed out that there is an error in this program,
> in that it doesn't lseek() back to the beginning of the file, so it
> ends up reading bytes 16-31 instead.  *blush*

Ah ... that explains why making the flush_dcache_page() unconditional
didn't fix the problem ...

> The program below accurately (AFAICT) reflects what fsx-linux does
> just before it dies in this testcase.  It write(2)s to a file, reads
> from the file via an mmap() region, read(2)s from the file, writes
> to the file via an mmap() region, and finally, read(2)s from the file
> again.  The last read(2) call returns the data written by the first
> write(2) call, and doesn't see the data written by the mmap() write,
> and so the last line of output is firstfirstfirst instead of the
> expected secondsecondsec.
> 
> fsx-linux does call msync(), but it calls it with a third argument
> of zero.  Changing the third msync() argument from 0 to MS_SYNC seems
> to make the issue disappear, which makes me suspect that fsx-linux's
> calling msync(pointer, size, 0); isn't guaranteed to be doing what it
> expects it to be doing.
> 
> If it _is_ in fact a bug in fsx, it would seem that most of the 2615
> fsx versions out there are affected.

Ok, output on parisc is:

jejb@ioz:~$ ./a.out
firstfirstfirst
firstfirstfirst
secondsecondsec

Which is correct.  It remains correct even if I drop the msync().

James



^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [CFT] read+shared mmap write+read data corruption
  2007-05-28  0:00   ` James Bottomley
@ 2007-05-28 10:05     ` Russell King
  2007-05-28 14:17       ` James Bottomley
  2007-05-29 15:42       ` Lennert Buytenhek
  2007-05-28 12:33     ` Lennert Buytenhek
  1 sibling, 2 replies; 25+ messages in thread
From: Russell King @ 2007-05-28 10:05 UTC (permalink / raw)
  To: James Bottomley; +Cc: Lennert Buytenhek, linux-arch, Andrew Morton, davej

On Sun, May 27, 2007 at 07:00:31PM -0500, James Bottomley wrote:
> Ok, output on parisc is:
> 
> jejb@ioz:~$ ./a.out
> firstfirstfirst
> firstfirstfirst
> secondsecondsec
> 
> Which is correct.  It remains correct even if I drop the msync().

With Lennert's new program, I get mostly:

firstfirstfirst
firstfirstfirst
firstfirstfirst

but occasionally:

firstfirstfirst
firstfirstfirst
secondsecondsec

However, if I open code the memcpy() in the MAPREAD to copy one word
at a time, then I reliably get the "secondsecondsec" line.  But if I
convert the memcpy() in MAPWRITE in the same way, I'm back to mostly
getting the failure with the occasional success.  Utterly confused.

Unless someone's got a theory, I'm stumped.

-- 
Russell King
 Linux kernel    2.6 ARM Linux   - http://www.arm.linux.org.uk/
 maintainer of:

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [CFT] read+shared mmap write+read data corruption
  2007-05-28 10:05     ` Russell King
@ 2007-05-28 14:17       ` James Bottomley
  2007-05-28 14:39         ` Lennert Buytenhek
  2007-05-28 15:04         ` Russell King
  2007-05-29 15:42       ` Lennert Buytenhek
  1 sibling, 2 replies; 25+ messages in thread
From: James Bottomley @ 2007-05-28 14:17 UTC (permalink / raw)
  To: Russell King; +Cc: Lennert Buytenhek, linux-arch, Andrew Morton, davej

On Mon, 2007-05-28 at 11:05 +0100, Russell King wrote:
> On Sun, May 27, 2007 at 07:00:31PM -0500, James Bottomley wrote:
> > Ok, output on parisc is:
> > 
> > jejb@ioz:~$ ./a.out
> > firstfirstfirst
> > firstfirstfirst
> > secondsecondsec
> > 
> > Which is correct.  It remains correct even if I drop the msync().
> 
> With Lennert's new program, I get mostly:
> 
> firstfirstfirst
> firstfirstfirst
> firstfirstfirst
> 
> but occasionally:
> 
> firstfirstfirst
> firstfirstfirst
> secondsecondsec
> 
> However, if I open code the memcpy() in the MAPREAD to copy one word
> at a time, then I reliably get the "secondsecondsec" line.  But if I
> convert the memcpy() in MAPWRITE in the same way, I'm back to mostly
> getting the failure with the occasional success.  Utterly confused.
> 
> Unless someone's got a theory, I'm stumped.

I think you're not flushing correctly in munmap() ... but I'm not sure
the linux API actually requires this.

We *have* to do it this way on parisc.  In VIPT architectures, once
you've lost the mapping, there's no real way to flush the page (well,
except by setting up another congruent mapping pointing to the same
physical page) so we have to flush before it's torn down otherwise we'd
be at the mercy of cache eviction for coherency.  We do it in
tlb_start_vma().

James



^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [CFT] read+shared mmap write+read data corruption
  2007-05-28 14:17       ` James Bottomley
@ 2007-05-28 14:39         ` Lennert Buytenhek
  2007-05-29  3:06           ` James Bottomley
  2007-05-29  5:58           ` David Miller
  2007-05-28 15:04         ` Russell King
  1 sibling, 2 replies; 25+ messages in thread
From: Lennert Buytenhek @ 2007-05-28 14:39 UTC (permalink / raw)
  To: James Bottomley; +Cc: Russell King, linux-arch, Andrew Morton, davej

On Mon, May 28, 2007 at 09:17:55AM -0500, James Bottomley wrote:

> > > Ok, output on parisc is:
> > > 
> > > jejb@ioz:~$ ./a.out
> > > firstfirstfirst
> > > firstfirstfirst
> > > secondsecondsec
> > > 
> > > Which is correct.  It remains correct even if I drop the msync().
> > 
> > With Lennert's new program, I get mostly:
> > 
> > firstfirstfirst
> > firstfirstfirst
> > firstfirstfirst
> > 
> > but occasionally:
> > 
> > firstfirstfirst
> > firstfirstfirst
> > secondsecondsec
> > 
> > However, if I open code the memcpy() in the MAPREAD to copy one word
> > at a time, then I reliably get the "secondsecondsec" line.  But if I
> > convert the memcpy() in MAPWRITE in the same way, I'm back to mostly
> > getting the failure with the occasional success.  Utterly confused.
> > 
> > Unless someone's got a theory, I'm stumped.
> 
> I think you're not flushing correctly in munmap() ... but I'm not sure
> the linux API actually requires this.

Having to (conditionally) invalidate the kernel direct mapping for
every userland page we unmap would kind of suck..

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [CFT] read+shared mmap write+read data corruption
  2007-05-28 14:39         ` Lennert Buytenhek
@ 2007-05-29  3:06           ` James Bottomley
  2007-05-29  3:15             ` Lennert Buytenhek
  2007-05-29  5:58           ` David Miller
  1 sibling, 1 reply; 25+ messages in thread
From: James Bottomley @ 2007-05-29  3:06 UTC (permalink / raw)
  To: Lennert Buytenhek; +Cc: Russell King, linux-arch, Andrew Morton, davej

On Mon, 2007-05-28 at 16:39 +0200, Lennert Buytenhek wrote:
> On Mon, May 28, 2007 at 09:17:55AM -0500, James Bottomley wrote:
> 
> > > > Ok, output on parisc is:
> > > > 
> > > > jejb@ioz:~$ ./a.out
> > > > firstfirstfirst
> > > > firstfirstfirst
> > > > secondsecondsec
> > > > 
> > > > Which is correct.  It remains correct even if I drop the msync().
> > > 
> > > With Lennert's new program, I get mostly:
> > > 
> > > firstfirstfirst
> > > firstfirstfirst
> > > firstfirstfirst
> > > 
> > > but occasionally:
> > > 
> > > firstfirstfirst
> > > firstfirstfirst
> > > secondsecondsec
> > > 
> > > However, if I open code the memcpy() in the MAPREAD to copy one word
> > > at a time, then I reliably get the "secondsecondsec" line.  But if I
> > > convert the memcpy() in MAPWRITE in the same way, I'm back to mostly
> > > getting the failure with the occasional success.  Utterly confused.
> > > 
> > > Unless someone's got a theory, I'm stumped.
> > 
> > I think you're not flushing correctly in munmap() ... but I'm not sure
> > the linux API actually requires this.
> 
> Having to (conditionally) invalidate the kernel direct mapping for
> every userland page we unmap would kind of suck..

Lets just verify it is a stale kernel mapping first.  Try this patch: it
will cohere the kernel aliases but not the user ones, so if the problem
goes away its definitely a stale kernel alias rather than a dirty user
one.  Actually, I can't find a patch ... what I want is a
flush_kernel_dcache_page() before

		if (mapping_writably_mapped(mapping))
			flush_dcache_page(page);

But arm doesn't seem to implement the API ... you must have some
equivalent, though, if you can find it ... I think it's

 __cpuc_coherent_kern_range(lowmem_page_address(page), lowmem_page_address(page)+PAGE_SIZE);

But I'm not an arm expert.

James



^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [CFT] read+shared mmap write+read data corruption
  2007-05-29  3:06           ` James Bottomley
@ 2007-05-29  3:15             ` Lennert Buytenhek
  2007-05-29 14:32               ` James Bottomley
  0 siblings, 1 reply; 25+ messages in thread
From: Lennert Buytenhek @ 2007-05-29  3:15 UTC (permalink / raw)
  To: James Bottomley; +Cc: Russell King, linux-arch, Andrew Morton, davej

On Mon, May 28, 2007 at 10:06:58PM -0500, James Bottomley wrote:

> > > > > Ok, output on parisc is:
> > > > > 
> > > > > jejb@ioz:~$ ./a.out
> > > > > firstfirstfirst
> > > > > firstfirstfirst
> > > > > secondsecondsec
> > > > > 
> > > > > Which is correct.  It remains correct even if I drop the msync().
> > > > 
> > > > With Lennert's new program, I get mostly:
> > > > 
> > > > firstfirstfirst
> > > > firstfirstfirst
> > > > firstfirstfirst
> > > > 
> > > > but occasionally:
> > > > 
> > > > firstfirstfirst
> > > > firstfirstfirst
> > > > secondsecondsec
> > > > 
> > > > However, if I open code the memcpy() in the MAPREAD to copy one word
> > > > at a time, then I reliably get the "secondsecondsec" line.  But if I
> > > > convert the memcpy() in MAPWRITE in the same way, I'm back to mostly
> > > > getting the failure with the occasional success.  Utterly confused.
> > > > 
> > > > Unless someone's got a theory, I'm stumped.
> > > 
> > > I think you're not flushing correctly in munmap() ... but I'm not sure
> > > the linux API actually requires this.
> > 
> > Having to (conditionally) invalidate the kernel direct mapping for
> > every userland page we unmap would kind of suck..
> 
> Lets just verify it is a stale kernel mapping first.  Try this patch:
> it will cohere the kernel aliases but not the user ones,

Hmm.  I don't understand what you're saying.

By the time we call the final read(2) (which is what returns the
stale data), there _are_ no user mappings of the page in question.

Flushing the kernel direct mapping by unconditionally calling
flush_dcache_page() in do_generic_mapping_read() makes the issue go
away and makes fsx-linux happy.

Flushing the kernel direct mapping by forcibly context switching
between munmap() and read() (VIVT cache, context switch does full
cache flush+invalidate) makes the issue go away, too.  

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [CFT] read+shared mmap write+read data corruption
  2007-05-29  3:15             ` Lennert Buytenhek
@ 2007-05-29 14:32               ` James Bottomley
  2007-05-29 17:13                 ` Russell King
  0 siblings, 1 reply; 25+ messages in thread
From: James Bottomley @ 2007-05-29 14:32 UTC (permalink / raw)
  To: Lennert Buytenhek; +Cc: Russell King, linux-arch, Andrew Morton, davej

On Tue, 2007-05-29 at 05:15 +0200, Lennert Buytenhek wrote:
> On Mon, May 28, 2007 at 10:06:58PM -0500, James Bottomley wrote:
> > > Having to (conditionally) invalidate the kernel direct mapping for
> > > every userland page we unmap would kind of suck..
> > 
> > Lets just verify it is a stale kernel mapping first.  Try this patch:
> > it will cohere the kernel aliases but not the user ones,
> 
> Hmm.  I don't understand what you're saying.

I agree its likely a stale kernel cache ... but the symptoms could be
dirty user cache as well ... I was just trying to verify beyond doubt
that it's the former.

> By the time we call the final read(2) (which is what returns the
> stale data), there _are_ no user mappings of the page in question.

> Flushing the kernel direct mapping by unconditionally calling
> flush_dcache_page() in do_generic_mapping_read() makes the issue go
> away and makes fsx-linux happy.
> 
> Flushing the kernel direct mapping by forcibly context switching
> between munmap() and read() (VIVT cache, context switch does full
> cache flush+invalidate) makes the issue go away, too.  

I'm not that familiar with the mechanics of a VIVT cache.  If you unmap,
and thus remove the page table entries, does that mean a dirty VIVT line
at those entries gets discarded if the processor can't find a mapping?
Or does the TLB pin entries for dirty cache lines until they're flushed?

James



^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [CFT] read+shared mmap write+read data corruption
  2007-05-29 14:32               ` James Bottomley
@ 2007-05-29 17:13                 ` Russell King
  0 siblings, 0 replies; 25+ messages in thread
From: Russell King @ 2007-05-29 17:13 UTC (permalink / raw)
  To: James Bottomley; +Cc: Lennert Buytenhek, linux-arch, Andrew Morton, davej

On Tue, May 29, 2007 at 09:32:16AM -0500, James Bottomley wrote:
> I agree its likely a stale kernel cache ... but the symptoms could be
> dirty user cache as well ... I was just trying to verify beyond doubt
> that it's the former.

If that were true, with a VIVT cache you'd have userspace randomly
SEGVing since stale data would leak through when things get remapped
(eg, objects which are only loaded for a short time) etc.

That isn't the case.

> > Flushing the kernel direct mapping by unconditionally calling
> > flush_dcache_page() in do_generic_mapping_read() makes the issue go
> > away and makes fsx-linux happy.
> > 
> > Flushing the kernel direct mapping by forcibly context switching
> > between munmap() and read() (VIVT cache, context switch does full
> > cache flush+invalidate) makes the issue go away, too.  
> 
> I'm not that familiar with the mechanics of a VIVT cache.  If you unmap,
> and thus remove the page table entries, does that mean a dirty VIVT line
> at those entries gets discarded if the processor can't find a mapping?
> Or does the TLB pin entries for dirty cache lines until they're flushed?

We _always_ write out data from the cache for the range we're going to
unmap with VIVT.

-- 
Russell King
 Linux kernel    2.6 ARM Linux   - http://www.arm.linux.org.uk/
 maintainer of:

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [CFT] read+shared mmap write+read data corruption
  2007-05-28 14:39         ` Lennert Buytenhek
  2007-05-29  3:06           ` James Bottomley
@ 2007-05-29  5:58           ` David Miller
  1 sibling, 0 replies; 25+ messages in thread
From: David Miller @ 2007-05-29  5:58 UTC (permalink / raw)
  To: buytenh; +Cc: James.Bottomley, rmk, linux-arch, akpm, davej

From: Lennert Buytenhek <buytenh@wantstofly.org>
Date: Mon, 28 May 2007 16:39:36 +0200

> On Mon, May 28, 2007 at 09:17:55AM -0500, James Bottomley wrote:
> 
> > > > Ok, output on parisc is:
> > > > 
> > > > jejb@ioz:~$ ./a.out
> > > > firstfirstfirst
> > > > firstfirstfirst
> > > > secondsecondsec
> > > > 
> > > > Which is correct.  It remains correct even if I drop the msync().
> > > 
> > > With Lennert's new program, I get mostly:
> > > 
> > > firstfirstfirst
> > > firstfirstfirst
> > > firstfirstfirst
> > > 
> > > but occasionally:
> > > 
> > > firstfirstfirst
> > > firstfirstfirst
> > > secondsecondsec
> > > 
> > > However, if I open code the memcpy() in the MAPREAD to copy one word
> > > at a time, then I reliably get the "secondsecondsec" line.  But if I
> > > convert the memcpy() in MAPWRITE in the same way, I'm back to mostly
> > > getting the failure with the occasional success.  Utterly confused.
> > > 
> > > Unless someone's got a theory, I'm stumped.
> > 
> > I think you're not flushing correctly in munmap() ... but I'm not sure
> > the linux API actually requires this.
> 
> Having to (conditionally) invalidate the kernel direct mapping for
> every userland page we unmap would kind of suck..

You don't have to do it for all pages, you just have to do whatever
flush_dcache_page() would do, for dirty+shared+writable mmap()'s of
files.

unmap's of shared+writable regions of files are so rare, only test
programs and the odd database do it when they shut down.


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [CFT] read+shared mmap write+read data corruption
  2007-05-28 14:17       ` James Bottomley
  2007-05-28 14:39         ` Lennert Buytenhek
@ 2007-05-28 15:04         ` Russell King
  1 sibling, 0 replies; 25+ messages in thread
From: Russell King @ 2007-05-28 15:04 UTC (permalink / raw)
  To: James Bottomley; +Cc: Lennert Buytenhek, linux-arch, Andrew Morton, davej

On Mon, May 28, 2007 at 09:17:55AM -0500, James Bottomley wrote:
> On Mon, 2007-05-28 at 11:05 +0100, Russell King wrote:
> > On Sun, May 27, 2007 at 07:00:31PM -0500, James Bottomley wrote:
> > > Ok, output on parisc is:
> > > 
> > > jejb@ioz:~$ ./a.out
> > > firstfirstfirst
> > > firstfirstfirst
> > > secondsecondsec
> > > 
> > > Which is correct.  It remains correct even if I drop the msync().
> > 
> > With Lennert's new program, I get mostly:
> > 
> > firstfirstfirst
> > firstfirstfirst
> > firstfirstfirst
> > 
> > but occasionally:
> > 
> > firstfirstfirst
> > firstfirstfirst
> > secondsecondsec
> > 
> > However, if I open code the memcpy() in the MAPREAD to copy one word
> > at a time, then I reliably get the "secondsecondsec" line.  But if I
> > convert the memcpy() in MAPWRITE in the same way, I'm back to mostly
> > getting the failure with the occasional success.  Utterly confused.
> > 
> > Unless someone's got a theory, I'm stumped.
> 
> I think you're not flushing correctly in munmap() ... but I'm not sure
> the linux API actually requires this.

We do an unconditional cache flush over the user virtual addresses in
the mapping before we tear down the page table entries.

sys_munmap -> do_munmap -> unmap_region -> unmap_vmas -> unmap_page_range
	-> tlb_start_vma

Although the call to flush_cache_range() in tlb_start_vma() is conditional,
it is conditional on !tlb->fullmm, and the call to tlb_gather_mmu() in
unmap_region() sets this to zero.

Adding additional expensive cache flushing into tlb_start_vma() is going
to be, well, disgusting.  Performance?  What's that?  Oh, something we
used to have in 1995.

Having an additional bit which is set on page cache pages to indicate
that they need flushing would be far more preferable.  PG_arch_1 won't
work because we already use this for delaying flush_dcache_page().  We
need PG_arch_2.

-- 
Russell King
 Linux kernel    2.6 ARM Linux   - http://www.arm.linux.org.uk/
 maintainer of:

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [CFT] read+shared mmap write+read data corruption
  2007-05-28 10:05     ` Russell King
  2007-05-28 14:17       ` James Bottomley
@ 2007-05-29 15:42       ` Lennert Buytenhek
  1 sibling, 0 replies; 25+ messages in thread
From: Lennert Buytenhek @ 2007-05-29 15:42 UTC (permalink / raw)
  To: James Bottomley, linux-arch, Andrew Morton, davej

On Mon, May 28, 2007 at 11:05:13AM +0100, Russell King wrote:

> With Lennert's new program, I get mostly:
> 
> firstfirstfirst
> firstfirstfirst
> firstfirstfirst
> 
> but occasionally:
> 
> firstfirstfirst
> firstfirstfirst
> secondsecondsec
> 
> However, if I open code the memcpy() in the MAPREAD to copy one word
> at a time, then I reliably get the "secondsecondsec" line.  But if I
> convert the memcpy() in MAPWRITE in the same way, I'm back to mostly
> getting the failure with the occasional success.  Utterly confused.

Maybe your caches are no-allocate-on-write and glibc's memcpy()
loads from the destination addresses to force line allocation?

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [CFT] read+shared mmap write+read data corruption
  2007-05-28  0:00   ` James Bottomley
  2007-05-28 10:05     ` Russell King
@ 2007-05-28 12:33     ` Lennert Buytenhek
  2007-05-28 14:22       ` James Bottomley
  1 sibling, 1 reply; 25+ messages in thread
From: Lennert Buytenhek @ 2007-05-28 12:33 UTC (permalink / raw)
  To: James Bottomley; +Cc: linux-arch, Andrew Morton, davej

On Sun, May 27, 2007 at 07:00:31PM -0500, James Bottomley wrote:

> Ok, output on parisc is:
> 
> jejb@ioz:~$ ./a.out
> firstfirstfirst
> firstfirstfirst
> secondsecondsec
> 
> Which is correct.  It remains correct even if I drop the msync().

What else was running when you ran this program, and how many runs
did you try?  I get a mixture of firstfirstfirst and secondsecondsec
when the machine I run it on isn't idle.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [CFT] read+shared mmap write+read data corruption
  2007-05-28 12:33     ` Lennert Buytenhek
@ 2007-05-28 14:22       ` James Bottomley
  0 siblings, 0 replies; 25+ messages in thread
From: James Bottomley @ 2007-05-28 14:22 UTC (permalink / raw)
  To: Lennert Buytenhek; +Cc: linux-arch, Andrew Morton, davej

On Mon, 2007-05-28 at 14:33 +0200, Lennert Buytenhek wrote:
> On Sun, May 27, 2007 at 07:00:31PM -0500, James Bottomley wrote:
> 
> > Ok, output on parisc is:
> > 
> > jejb@ioz:~$ ./a.out
> > firstfirstfirst
> > firstfirstfirst
> > secondsecondsec
> > 
> > Which is correct.  It remains correct even if I drop the msync().
> 
> What else was running when you ran this program, and how many runs
> did you try?  I get a mixture of firstfirstfirst and secondsecondsec
> when the machine I run it on isn't idle.

Both loaded and unloaded ... although you have to be careful how you
load on pa to see coherency problems:  the fork/exec cycle does a full
cache flush.

James


James



^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [CFT] read+shared mmap write+read data corruption
  2007-05-27 10:49 [CFT] read+shared mmap write+read data corruption Russell King
  2007-05-27 14:16 ` James Bottomley
  2007-05-27 22:24 ` Lennert Buytenhek
@ 2007-05-28 12:38 ` Lennert Buytenhek
  2 siblings, 0 replies; 25+ messages in thread
From: Lennert Buytenhek @ 2007-05-28 12:38 UTC (permalink / raw)
  To: linux-arch; +Cc: Andrew Morton

On Sun, May 27, 2007 at 11:49:30AM +0100, Russell King wrote:

> * - the hole is:
> 
> void do_generic_mapping_read(struct address_space *mapping,
>                              struct file_ra_state *_ra,
>                              struct file *filp,
>                              loff_t *ppos,
>                              read_descriptor_t *desc,
>                              read_actor_t actor)
> {
> ...
>                 /* If users can be writing to this page using arbitrary
>                  * virtual addresses, take care about potential aliasing
>                  * before reading the page on the kernel side.
>                  */
>                 if (mapping_writably_mapped(mapping))
>                         flush_dcache_page(page);
> 
> is false for both read calls, since at the time this test is done, there
> are no shared writable mappings.  However, that's not to say that there
> haven't _been_ shared writable mappings.

If I make the flush_dcache_page() call above unconditional, then
the program I posted to this list previously[*] starts reliably
printing secondsecondsec as the third line.

David's check_dcache_page() suggestion makes a lot of sense to me.

Whether this solves the db4 corruption issues I've been seeing is
another matter...


[*] http://www.wantstofly.org/~buytenh/cache-issue2.c

^ permalink raw reply	[flat|nested] 25+ messages in thread

end of thread, other threads:[~2007-05-29 17:14 UTC | newest]

Thread overview: 25+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2007-05-27 10:49 [CFT] read+shared mmap write+read data corruption Russell King
2007-05-27 14:16 ` James Bottomley
2007-05-27 22:25   ` Lennert Buytenhek
2007-05-27 23:06     ` David Miller
2007-05-27 23:05   ` David Miller
2007-05-28  0:31     ` James Bottomley
2007-05-28 12:44       ` Lennert Buytenhek
2007-05-29  5:35         ` David Miller
2007-05-29  9:12           ` Russell King
2007-05-29 10:26             ` David Miller
2007-05-27 22:24 ` Lennert Buytenhek
2007-05-28  0:00   ` James Bottomley
2007-05-28 10:05     ` Russell King
2007-05-28 14:17       ` James Bottomley
2007-05-28 14:39         ` Lennert Buytenhek
2007-05-29  3:06           ` James Bottomley
2007-05-29  3:15             ` Lennert Buytenhek
2007-05-29 14:32               ` James Bottomley
2007-05-29 17:13                 ` Russell King
2007-05-29  5:58           ` David Miller
2007-05-28 15:04         ` Russell King
2007-05-29 15:42       ` Lennert Buytenhek
2007-05-28 12:33     ` Lennert Buytenhek
2007-05-28 14:22       ` James Bottomley
2007-05-28 12:38 ` Lennert Buytenhek

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.