* [CFT] read+shared mmap write+read data corruption
@ 2007-05-27 10:49 Russell King
2007-05-27 14:16 ` James Bottomley
` (2 more replies)
0 siblings, 3 replies; 25+ messages in thread
From: Russell King @ 2007-05-27 10:49 UTC (permalink / raw)
To: linux-arch; +Cc: Lennert Buytenhek, Andrew Morton
Lennert Buytenhek recently reported on the linux-arm-kernel list that
fsx-linux fails on his Xscale ARM platforms. Originally, a problem was
noticed with DB4 databases being corrupted.
After a little bit of digging, I found what appears to be a hole in
the cache alias handling for the page cache; I'm not certain what the
solution is at present [*]. However, I don't think this is an ARM
specific issue.
So, if you have a cache which needs a flush_dcache_page() implementation,
can you please grab:
http://www.wantstofly.org/~buytenh/cache-issue.c
Build this, and then run:
$ ./cache-issue | od -t x1 -Ax
If the output is a load of zero bytes instead of "0123456789abcdef" then
the read after a shared mmap() write returned stale data.
Note: this program is much more likely to show the issue than fsx-linux -
on my platforms, fsx-linux doesn't show any problems after several minutes
of running, whereas Lennert's program shows a problem immediately.
Also note that our msync() syscall is a no-op; my original test was with
a version of the above with msync(MS_INVALIDATE) in.
* - the hole is:
void do_generic_mapping_read(struct address_space *mapping,
struct file_ra_state *_ra,
struct file *filp,
loff_t *ppos,
read_descriptor_t *desc,
read_actor_t actor)
{
...
/* If users can be writing to this page using arbitrary
* virtual addresses, take care about potential aliasing
* before reading the page on the kernel side.
*/
if (mapping_writably_mapped(mapping))
flush_dcache_page(page);
is false for both read calls, since at the time this test is done, there
are no shared writable mappings. However, that's not to say that there
haven't _been_ shared writable mappings.
--
Russell King
Linux kernel 2.6 ARM Linux - http://www.arm.linux.org.uk/
maintainer of:
^ permalink raw reply [flat|nested] 25+ messages in thread* Re: [CFT] read+shared mmap write+read data corruption 2007-05-27 10:49 [CFT] read+shared mmap write+read data corruption Russell King @ 2007-05-27 14:16 ` James Bottomley 2007-05-27 22:25 ` Lennert Buytenhek 2007-05-27 23:05 ` David Miller 2007-05-27 22:24 ` Lennert Buytenhek 2007-05-28 12:38 ` Lennert Buytenhek 2 siblings, 2 replies; 25+ messages in thread From: James Bottomley @ 2007-05-27 14:16 UTC (permalink / raw) To: Russell King; +Cc: linux-arch, Lennert Buytenhek, Andrew Morton On Sun, 2007-05-27 at 11:49 +0100, Russell King wrote: > Lennert Buytenhek recently reported on the linux-arm-kernel list that > fsx-linux fails on his Xscale ARM platforms. Originally, a problem was > noticed with DB4 databases being corrupted. > > After a little bit of digging, I found what appears to be a hole in > the cache alias handling for the page cache; I'm not certain what the > solution is at present [*]. However, I don't think this is an ARM > specific issue. Yes, I confirm seeing the same results on parisc (VIPT large caches). > So, if you have a cache which needs a flush_dcache_page() implementation, > can you please grab: > > http://www.wantstofly.org/~buytenh/cache-issue.c > > Build this, and then run: > > $ ./cache-issue | od -t x1 -Ax > > If the output is a load of zero bytes instead of "0123456789abcdef" then > the read after a shared mmap() write returned stale data. > > Note: this program is much more likely to show the issue than fsx-linux - > on my platforms, fsx-linux doesn't show any problems after several minutes > of running, whereas Lennert's program shows a problem immediately. > > Also note that our msync() syscall is a no-op; my original test was with > a version of the above with msync(MS_INVALIDATE) in. Regardless of how we implement it, the way I read POSIX, it does require at least msync(x, 4096, MS_SYNC|MS_INVALIDATE) before accessing the data in another way. > * - the hole is: > > void do_generic_mapping_read(struct address_space *mapping, > struct file_ra_state *_ra, > struct file *filp, > loff_t *ppos, > read_descriptor_t *desc, > read_actor_t actor) > { > ... > /* If users can be writing to this page using arbitrary > * virtual addresses, take care about potential aliasing > * before reading the page on the kernel side. > */ > if (mapping_writably_mapped(mapping)) > flush_dcache_page(page); > > is false for both read calls, since at the time this test is done, there > are no shared writable mappings. However, that's not to say that there > haven't _been_ shared writable mappings. The operation has to be done when we know the state of the object (unless we want to persist the dirty state in a lazy fashion? I think I don't really want to go there) so that's either in msync or munmap. I think I favour the former. James ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [CFT] read+shared mmap write+read data corruption 2007-05-27 14:16 ` James Bottomley @ 2007-05-27 22:25 ` Lennert Buytenhek 2007-05-27 23:06 ` David Miller 2007-05-27 23:05 ` David Miller 1 sibling, 1 reply; 25+ messages in thread From: Lennert Buytenhek @ 2007-05-27 22:25 UTC (permalink / raw) To: James Bottomley; +Cc: Russell King, linux-arch, Andrew Morton On Sun, May 27, 2007 at 09:16:24AM -0500, James Bottomley wrote: > Regardless of how we implement it, the way I read POSIX, it does > require at least msync(x, 4096, MS_SYNC|MS_INVALIDATE) before > accessing the data in another way. Right. At least the version of fsx-linux in akpm's ext3-tools tarball doesn't do that. ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [CFT] read+shared mmap write+read data corruption 2007-05-27 22:25 ` Lennert Buytenhek @ 2007-05-27 23:06 ` David Miller 0 siblings, 0 replies; 25+ messages in thread From: David Miller @ 2007-05-27 23:06 UTC (permalink / raw) To: buytenh; +Cc: James.Bottomley, rmk, linux-arch, akpm From: Lennert Buytenhek <buytenh@wantstofly.org> Date: Mon, 28 May 2007 00:25:55 +0200 > On Sun, May 27, 2007 at 09:16:24AM -0500, James Bottomley wrote: > > > Regardless of how we implement it, the way I read POSIX, it does > > require at least msync(x, 4096, MS_SYNC|MS_INVALIDATE) before > > accessing the data in another way. > > Right. At least the version of fsx-linux in akpm's ext3-tools > tarball doesn't do that. I don't think we can seriously consider this requirement. We can make the data properly coherent on D-cache aliasing cpus, so we should. ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [CFT] read+shared mmap write+read data corruption 2007-05-27 14:16 ` James Bottomley 2007-05-27 22:25 ` Lennert Buytenhek @ 2007-05-27 23:05 ` David Miller 2007-05-28 0:31 ` James Bottomley 1 sibling, 1 reply; 25+ messages in thread From: David Miller @ 2007-05-27 23:05 UTC (permalink / raw) To: James.Bottomley; +Cc: rmk, linux-arch, buytenh, akpm From: James Bottomley <James.Bottomley@SteelEye.com> Date: Sun, 27 May 2007 09:16:24 -0500 > On Sun, 2007-05-27 at 11:49 +0100, Russell King wrote: > > So, if you have a cache which needs a flush_dcache_page() implementation, > > can you please grab: > > > > http://www.wantstofly.org/~buytenh/cache-issue.c > > > > Build this, and then run: > > > > $ ./cache-issue | od -t x1 -Ax > > > > If the output is a load of zero bytes instead of "0123456789abcdef" then > > the read after a shared mmap() write returned stale data. > > > > Note: this program is much more likely to show the issue than fsx-linux - > > on my platforms, fsx-linux doesn't show any problems after several minutes > > of running, whereas Lennert's program shows a problem immediately. > > > > Also note that our msync() syscall is a no-op; my original test was with > > a version of the above with msync(MS_INVALIDATE) in. > > Regardless of how we implement it, the way I read POSIX, it does require > at least msync(x, 4096, MS_SYNC|MS_INVALIDATE) before accessing the data > in another way. I used to have code in flush_cache_range() on sparc64 that would have handled this case. I think whatever in the world POSIX says, quality of implementation says we should not be returning zeros in this test program, yet we are. :-) If we accept the msync() requirement (I don't think we should) then this is the kind of obscure thing that causes application programmers to scratch their heads for days if not weeks, and it will only anger and frustrate them further when we tell them they have to stick an msync() in there to see non-corrupt data. Perhaps one way to catch this is to make the filemap read-page code unconditionally call a new interface "check_dcache_page()" or similar, instead of the guarded flush_dcache_page() call there now. Then on unmap's of SHARED+WRITABLE mappings of files, we mark or flush dirty pages as is appropriate for however the platform handles D-cache aliasing. On sparc64 for example, I'd set PG_arch_1 and record the cpu in page->flags if PG_arch_1 wasn't already set, else I'd flush. Then I'd have check_dcache_page() do something like: if (PG_arch_1 || shared_writable_mappings_exist) flush(); ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [CFT] read+shared mmap write+read data corruption 2007-05-27 23:05 ` David Miller @ 2007-05-28 0:31 ` James Bottomley 2007-05-28 12:44 ` Lennert Buytenhek 0 siblings, 1 reply; 25+ messages in thread From: James Bottomley @ 2007-05-28 0:31 UTC (permalink / raw) To: David Miller; +Cc: rmk, linux-arch, buytenh, akpm On Sun, 2007-05-27 at 16:05 -0700, David Miller wrote: > From: James Bottomley <James.Bottomley@SteelEye.com> > Date: Sun, 27 May 2007 09:16:24 -0500 > > > On Sun, 2007-05-27 at 11:49 +0100, Russell King wrote: > > > So, if you have a cache which needs a flush_dcache_page() implementation, > > > can you please grab: > > > > > > http://www.wantstofly.org/~buytenh/cache-issue.c > > > > > > Build this, and then run: > > > > > > $ ./cache-issue | od -t x1 -Ax > > > > > > If the output is a load of zero bytes instead of "0123456789abcdef" then > > > the read after a shared mmap() write returned stale data. > > > > > > Note: this program is much more likely to show the issue than fsx-linux - > > > on my platforms, fsx-linux doesn't show any problems after several minutes > > > of running, whereas Lennert's program shows a problem immediately. > > > > > > Also note that our msync() syscall is a no-op; my original test was with > > > a version of the above with msync(MS_INVALIDATE) in. > > > > Regardless of how we implement it, the way I read POSIX, it does require > > at least msync(x, 4096, MS_SYNC|MS_INVALIDATE) before accessing the data > > in another way. > > I used to have code in flush_cache_range() on sparc64 that would > have handled this case. > > I think whatever in the world POSIX says, quality of implementation > says we should not be returning zeros in this test program, yet we > are. :-) I wasn't advocating ... merely point out that POSIX gives us a get out of jail free card. > If we accept the msync() requirement (I don't think we should) then > this is the kind of obscure thing that causes application programmers > to scratch their heads for days if not weeks, and it will only anger > and frustrate them further when we tell them they have to stick an > msync() in there to see non-corrupt data. > > Perhaps one way to catch this is to make the filemap read-page code > unconditionally call a new interface "check_dcache_page()" or similar, > instead of the guarded flush_dcache_page() call there now. > > Then on unmap's of SHARED+WRITABLE mappings of files, we mark > or flush dirty pages as is appropriate for however the platform > handles D-cache aliasing. > > On sparc64 for example, I'd set PG_arch_1 and record the cpu in > page->flags if PG_arch_1 wasn't already set, else I'd flush. > > Then I'd have check_dcache_page() do something like: > > if (PG_arch_1 || shared_writable_mappings_exist) > flush(); Actually, I'm not convinced we have a problem ... if I put an lseek back to zero in the original program, it works on parisc as well. On munmap we do a flush_cache_range for the unmapped vmas ... we have to, otherwise the data might be lost as the user mapping is zapped and we wouldn't be able to guarantee coherency writing it to the backing store. James ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [CFT] read+shared mmap write+read data corruption 2007-05-28 0:31 ` James Bottomley @ 2007-05-28 12:44 ` Lennert Buytenhek 2007-05-29 5:35 ` David Miller 0 siblings, 1 reply; 25+ messages in thread From: Lennert Buytenhek @ 2007-05-28 12:44 UTC (permalink / raw) To: James Bottomley; +Cc: David Miller, rmk, linux-arch, akpm On Sun, May 27, 2007 at 07:31:27PM -0500, James Bottomley wrote: > On munmap we do a flush_cache_range for the unmapped vmas ... we > have to, otherwise the data might be lost as the user mapping is > zapped and we wouldn't be able to guarantee coherency writing it > to the backing store. As far as I understand it, the munmap() does flush out the copy of the data at the user virtual address, but the subsequent read() call reads from an address in the kernel direct mapped window, for which there is still data in the cache due to the earlier read() syscall, and the mapping_writably_mapped() test fails so we don't end up calling flush_dcache_page(). ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [CFT] read+shared mmap write+read data corruption 2007-05-28 12:44 ` Lennert Buytenhek @ 2007-05-29 5:35 ` David Miller 2007-05-29 9:12 ` Russell King 0 siblings, 1 reply; 25+ messages in thread From: David Miller @ 2007-05-29 5:35 UTC (permalink / raw) To: buytenh; +Cc: James.Bottomley, rmk, linux-arch, akpm From: Lennert Buytenhek <buytenh@wantstofly.org> Date: Mon, 28 May 2007 14:44:11 +0200 > On Sun, May 27, 2007 at 07:31:27PM -0500, James Bottomley wrote: > > > On munmap we do a flush_cache_range for the unmapped vmas ... we > > have to, otherwise the data might be lost as the user mapping is > > zapped and we wouldn't be able to guarantee coherency writing it > > to the backing store. > > As far as I understand it, the munmap() does flush out the copy of > the data at the user virtual address, but the subsequent read() call > reads from an address in the kernel direct mapped window, for which > there is still data in the cache due to the earlier read() syscall, > and the mapping_writably_mapped() test fails so we don't end up > calling flush_dcache_page(). That is what is happening for sure. That mapping_writably_mapped() check depends upon munmap() flush out the lines from the cache on the user side at least enough to make them coherent on the kernel side. As I said my flush_cache_range() on sparc64 used to do this, but I removed it for whatever reason, perhaps I did not consider this case back then. I'm not advocating a full flush on flush_cache_range(), but rather to set a page state bit, which will force a flush on the "check_dcache_page()" call which we could replace this conditionalized flush_dcache_page() call with. ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [CFT] read+shared mmap write+read data corruption 2007-05-29 5:35 ` David Miller @ 2007-05-29 9:12 ` Russell King 2007-05-29 10:26 ` David Miller 0 siblings, 1 reply; 25+ messages in thread From: Russell King @ 2007-05-29 9:12 UTC (permalink / raw) To: David Miller; +Cc: buytenh, James.Bottomley, linux-arch, akpm On Mon, May 28, 2007 at 10:35:50PM -0700, David Miller wrote: > From: Lennert Buytenhek <buytenh@wantstofly.org> > Date: Mon, 28 May 2007 14:44:11 +0200 > > As far as I understand it, the munmap() does flush out the copy of > > the data at the user virtual address, but the subsequent read() call > > reads from an address in the kernel direct mapped window, for which > > there is still data in the cache due to the earlier read() syscall, > > and the mapping_writably_mapped() test fails so we don't end up > > calling flush_dcache_page(). > > That is what is happening for sure. > > That mapping_writably_mapped() check depends upon munmap() > flush out the lines from the cache on the user side at > least enough to make them coherent on the kernel side. > > As I said my flush_cache_range() on sparc64 used to do this, > but I removed it for whatever reason, perhaps I did not > consider this case back then. > > I'm not advocating a full flush on flush_cache_range(), but rather to > set a page state bit, which will force a flush on the > "check_dcache_page()" call which we could replace this conditionalized > flush_dcache_page() call with. Could we have PG_arch_2 as bit 13 for this purpose, guaranteed to be cleared on page cache page allocation? IOW, same rules as far as the non-arch code is concerned as PG_arch_1. Such a bit could be set in tlb_remove_tlb_entry() rather than having to walk the page tables twice (once in flush_cache_range and again in zap_*_range) though I wonder if that's too late in zap_pte_range(). (Walking the page tables on flush_cache_range would be too disgusting; I don't fancy coding page table walks in assembly for some subset of ARM CPUs.) -- Russell King Linux kernel 2.6 ARM Linux - http://www.arm.linux.org.uk/ maintainer of: ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [CFT] read+shared mmap write+read data corruption 2007-05-29 9:12 ` Russell King @ 2007-05-29 10:26 ` David Miller 0 siblings, 0 replies; 25+ messages in thread From: David Miller @ 2007-05-29 10:26 UTC (permalink / raw) To: rmk; +Cc: buytenh, James.Bottomley, linux-arch, akpm From: Russell King <rmk@arm.linux.org.uk> Date: Tue, 29 May 2007 10:12:52 +0100 > Could we have PG_arch_2 as bit 13 for this purpose, guaranteed to be > cleared on page cache page allocation? IOW, same rules as far as the > non-arch code is concerned as PG_arch_1. I don't think there will be much objection to adding one more arch bit :-) ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [CFT] read+shared mmap write+read data corruption 2007-05-27 10:49 [CFT] read+shared mmap write+read data corruption Russell King 2007-05-27 14:16 ` James Bottomley @ 2007-05-27 22:24 ` Lennert Buytenhek 2007-05-28 0:00 ` James Bottomley 2007-05-28 12:38 ` Lennert Buytenhek 2 siblings, 1 reply; 25+ messages in thread From: Lennert Buytenhek @ 2007-05-27 22:24 UTC (permalink / raw) To: linux-arch; +Cc: Andrew Morton, davej On Sun, May 27, 2007 at 11:49:30AM +0100, Russell King wrote: > So, if you have a cache which needs a flush_dcache_page() > implementation, can you please grab: > > http://www.wantstofly.org/~buytenh/cache-issue.c Davide Libenzi pointed out that there is an error in this program, in that it doesn't lseek() back to the beginning of the file, so it ends up reading bytes 16-31 instead. *blush* The program below accurately (AFAICT) reflects what fsx-linux does just before it dies in this testcase. It write(2)s to a file, reads from the file via an mmap() region, read(2)s from the file, writes to the file via an mmap() region, and finally, read(2)s from the file again. The last read(2) call returns the data written by the first write(2) call, and doesn't see the data written by the mmap() write, and so the last line of output is firstfirstfirst instead of the expected secondsecondsec. fsx-linux does call msync(), but it calls it with a third argument of zero. Changing the third msync() argument from 0 to MS_SYNC seems to make the issue disappear, which makes me suspect that fsx-linux's calling msync(pointer, size, 0); isn't guaranteed to be doing what it expects it to be doing. If it _is_ in fact a bug in fsx, it would seem that most of the 2615 fsx versions out there are affected. #include <stdio.h> #include <stdlib.h> #include <fcntl.h> #include <string.h> #include <sys/mman.h> #include <unistd.h> char *file = "footest"; char buf[4096]; int main() { int fd; char *x; /* Clear test file. */ fd = open(file, O_CREAT | O_TRUNC | O_RDWR, 0644); write(fd, buf, sizeof(buf)); close(fd); sleep(0); /* WRITE */ fd = open(file, O_RDWR, 0644); write(fd, "firstfirstfirst\n", 16); close(fd); /* MAPREAD */ fd = open(file, O_RDWR, 0644); x = mmap(NULL, 4096, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0); memcpy(buf, x, 16); munmap(x, 4096); close(fd); /* READ */ fd = open(file, O_RDWR, 0644); read(fd, buf + 16, 16); close(fd); /* MAPWRITE */ fd = open(file, O_RDWR, 0644); x = mmap(NULL, 4096, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0); memcpy(x, "secondsecondsec\n", 16); msync(x, 4096, 0); munmap(x, 4096); close(fd); /* READ */ fd = open(file, O_RDWR, 0644); read(fd, buf + 32, 16); close(fd); /* dump output */ write(1, buf, 3 * 16); return 0; } ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [CFT] read+shared mmap write+read data corruption 2007-05-27 22:24 ` Lennert Buytenhek @ 2007-05-28 0:00 ` James Bottomley 2007-05-28 10:05 ` Russell King 2007-05-28 12:33 ` Lennert Buytenhek 0 siblings, 2 replies; 25+ messages in thread From: James Bottomley @ 2007-05-28 0:00 UTC (permalink / raw) To: Lennert Buytenhek; +Cc: linux-arch, Andrew Morton, davej On Mon, 2007-05-28 at 00:24 +0200, Lennert Buytenhek wrote: > On Sun, May 27, 2007 at 11:49:30AM +0100, Russell King wrote: > > > So, if you have a cache which needs a flush_dcache_page() > > implementation, can you please grab: > > > > http://www.wantstofly.org/~buytenh/cache-issue.c > > Davide Libenzi pointed out that there is an error in this program, > in that it doesn't lseek() back to the beginning of the file, so it > ends up reading bytes 16-31 instead. *blush* Ah ... that explains why making the flush_dcache_page() unconditional didn't fix the problem ... > The program below accurately (AFAICT) reflects what fsx-linux does > just before it dies in this testcase. It write(2)s to a file, reads > from the file via an mmap() region, read(2)s from the file, writes > to the file via an mmap() region, and finally, read(2)s from the file > again. The last read(2) call returns the data written by the first > write(2) call, and doesn't see the data written by the mmap() write, > and so the last line of output is firstfirstfirst instead of the > expected secondsecondsec. > > fsx-linux does call msync(), but it calls it with a third argument > of zero. Changing the third msync() argument from 0 to MS_SYNC seems > to make the issue disappear, which makes me suspect that fsx-linux's > calling msync(pointer, size, 0); isn't guaranteed to be doing what it > expects it to be doing. > > If it _is_ in fact a bug in fsx, it would seem that most of the 2615 > fsx versions out there are affected. Ok, output on parisc is: jejb@ioz:~$ ./a.out firstfirstfirst firstfirstfirst secondsecondsec Which is correct. It remains correct even if I drop the msync(). James ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [CFT] read+shared mmap write+read data corruption 2007-05-28 0:00 ` James Bottomley @ 2007-05-28 10:05 ` Russell King 2007-05-28 14:17 ` James Bottomley 2007-05-29 15:42 ` Lennert Buytenhek 2007-05-28 12:33 ` Lennert Buytenhek 1 sibling, 2 replies; 25+ messages in thread From: Russell King @ 2007-05-28 10:05 UTC (permalink / raw) To: James Bottomley; +Cc: Lennert Buytenhek, linux-arch, Andrew Morton, davej On Sun, May 27, 2007 at 07:00:31PM -0500, James Bottomley wrote: > Ok, output on parisc is: > > jejb@ioz:~$ ./a.out > firstfirstfirst > firstfirstfirst > secondsecondsec > > Which is correct. It remains correct even if I drop the msync(). With Lennert's new program, I get mostly: firstfirstfirst firstfirstfirst firstfirstfirst but occasionally: firstfirstfirst firstfirstfirst secondsecondsec However, if I open code the memcpy() in the MAPREAD to copy one word at a time, then I reliably get the "secondsecondsec" line. But if I convert the memcpy() in MAPWRITE in the same way, I'm back to mostly getting the failure with the occasional success. Utterly confused. Unless someone's got a theory, I'm stumped. -- Russell King Linux kernel 2.6 ARM Linux - http://www.arm.linux.org.uk/ maintainer of: ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [CFT] read+shared mmap write+read data corruption 2007-05-28 10:05 ` Russell King @ 2007-05-28 14:17 ` James Bottomley 2007-05-28 14:39 ` Lennert Buytenhek 2007-05-28 15:04 ` Russell King 2007-05-29 15:42 ` Lennert Buytenhek 1 sibling, 2 replies; 25+ messages in thread From: James Bottomley @ 2007-05-28 14:17 UTC (permalink / raw) To: Russell King; +Cc: Lennert Buytenhek, linux-arch, Andrew Morton, davej On Mon, 2007-05-28 at 11:05 +0100, Russell King wrote: > On Sun, May 27, 2007 at 07:00:31PM -0500, James Bottomley wrote: > > Ok, output on parisc is: > > > > jejb@ioz:~$ ./a.out > > firstfirstfirst > > firstfirstfirst > > secondsecondsec > > > > Which is correct. It remains correct even if I drop the msync(). > > With Lennert's new program, I get mostly: > > firstfirstfirst > firstfirstfirst > firstfirstfirst > > but occasionally: > > firstfirstfirst > firstfirstfirst > secondsecondsec > > However, if I open code the memcpy() in the MAPREAD to copy one word > at a time, then I reliably get the "secondsecondsec" line. But if I > convert the memcpy() in MAPWRITE in the same way, I'm back to mostly > getting the failure with the occasional success. Utterly confused. > > Unless someone's got a theory, I'm stumped. I think you're not flushing correctly in munmap() ... but I'm not sure the linux API actually requires this. We *have* to do it this way on parisc. In VIPT architectures, once you've lost the mapping, there's no real way to flush the page (well, except by setting up another congruent mapping pointing to the same physical page) so we have to flush before it's torn down otherwise we'd be at the mercy of cache eviction for coherency. We do it in tlb_start_vma(). James ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [CFT] read+shared mmap write+read data corruption 2007-05-28 14:17 ` James Bottomley @ 2007-05-28 14:39 ` Lennert Buytenhek 2007-05-29 3:06 ` James Bottomley 2007-05-29 5:58 ` David Miller 2007-05-28 15:04 ` Russell King 1 sibling, 2 replies; 25+ messages in thread From: Lennert Buytenhek @ 2007-05-28 14:39 UTC (permalink / raw) To: James Bottomley; +Cc: Russell King, linux-arch, Andrew Morton, davej On Mon, May 28, 2007 at 09:17:55AM -0500, James Bottomley wrote: > > > Ok, output on parisc is: > > > > > > jejb@ioz:~$ ./a.out > > > firstfirstfirst > > > firstfirstfirst > > > secondsecondsec > > > > > > Which is correct. It remains correct even if I drop the msync(). > > > > With Lennert's new program, I get mostly: > > > > firstfirstfirst > > firstfirstfirst > > firstfirstfirst > > > > but occasionally: > > > > firstfirstfirst > > firstfirstfirst > > secondsecondsec > > > > However, if I open code the memcpy() in the MAPREAD to copy one word > > at a time, then I reliably get the "secondsecondsec" line. But if I > > convert the memcpy() in MAPWRITE in the same way, I'm back to mostly > > getting the failure with the occasional success. Utterly confused. > > > > Unless someone's got a theory, I'm stumped. > > I think you're not flushing correctly in munmap() ... but I'm not sure > the linux API actually requires this. Having to (conditionally) invalidate the kernel direct mapping for every userland page we unmap would kind of suck.. ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [CFT] read+shared mmap write+read data corruption 2007-05-28 14:39 ` Lennert Buytenhek @ 2007-05-29 3:06 ` James Bottomley 2007-05-29 3:15 ` Lennert Buytenhek 2007-05-29 5:58 ` David Miller 1 sibling, 1 reply; 25+ messages in thread From: James Bottomley @ 2007-05-29 3:06 UTC (permalink / raw) To: Lennert Buytenhek; +Cc: Russell King, linux-arch, Andrew Morton, davej On Mon, 2007-05-28 at 16:39 +0200, Lennert Buytenhek wrote: > On Mon, May 28, 2007 at 09:17:55AM -0500, James Bottomley wrote: > > > > > Ok, output on parisc is: > > > > > > > > jejb@ioz:~$ ./a.out > > > > firstfirstfirst > > > > firstfirstfirst > > > > secondsecondsec > > > > > > > > Which is correct. It remains correct even if I drop the msync(). > > > > > > With Lennert's new program, I get mostly: > > > > > > firstfirstfirst > > > firstfirstfirst > > > firstfirstfirst > > > > > > but occasionally: > > > > > > firstfirstfirst > > > firstfirstfirst > > > secondsecondsec > > > > > > However, if I open code the memcpy() in the MAPREAD to copy one word > > > at a time, then I reliably get the "secondsecondsec" line. But if I > > > convert the memcpy() in MAPWRITE in the same way, I'm back to mostly > > > getting the failure with the occasional success. Utterly confused. > > > > > > Unless someone's got a theory, I'm stumped. > > > > I think you're not flushing correctly in munmap() ... but I'm not sure > > the linux API actually requires this. > > Having to (conditionally) invalidate the kernel direct mapping for > every userland page we unmap would kind of suck.. Lets just verify it is a stale kernel mapping first. Try this patch: it will cohere the kernel aliases but not the user ones, so if the problem goes away its definitely a stale kernel alias rather than a dirty user one. Actually, I can't find a patch ... what I want is a flush_kernel_dcache_page() before if (mapping_writably_mapped(mapping)) flush_dcache_page(page); But arm doesn't seem to implement the API ... you must have some equivalent, though, if you can find it ... I think it's __cpuc_coherent_kern_range(lowmem_page_address(page), lowmem_page_address(page)+PAGE_SIZE); But I'm not an arm expert. James ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [CFT] read+shared mmap write+read data corruption 2007-05-29 3:06 ` James Bottomley @ 2007-05-29 3:15 ` Lennert Buytenhek 2007-05-29 14:32 ` James Bottomley 0 siblings, 1 reply; 25+ messages in thread From: Lennert Buytenhek @ 2007-05-29 3:15 UTC (permalink / raw) To: James Bottomley; +Cc: Russell King, linux-arch, Andrew Morton, davej On Mon, May 28, 2007 at 10:06:58PM -0500, James Bottomley wrote: > > > > > Ok, output on parisc is: > > > > > > > > > > jejb@ioz:~$ ./a.out > > > > > firstfirstfirst > > > > > firstfirstfirst > > > > > secondsecondsec > > > > > > > > > > Which is correct. It remains correct even if I drop the msync(). > > > > > > > > With Lennert's new program, I get mostly: > > > > > > > > firstfirstfirst > > > > firstfirstfirst > > > > firstfirstfirst > > > > > > > > but occasionally: > > > > > > > > firstfirstfirst > > > > firstfirstfirst > > > > secondsecondsec > > > > > > > > However, if I open code the memcpy() in the MAPREAD to copy one word > > > > at a time, then I reliably get the "secondsecondsec" line. But if I > > > > convert the memcpy() in MAPWRITE in the same way, I'm back to mostly > > > > getting the failure with the occasional success. Utterly confused. > > > > > > > > Unless someone's got a theory, I'm stumped. > > > > > > I think you're not flushing correctly in munmap() ... but I'm not sure > > > the linux API actually requires this. > > > > Having to (conditionally) invalidate the kernel direct mapping for > > every userland page we unmap would kind of suck.. > > Lets just verify it is a stale kernel mapping first. Try this patch: > it will cohere the kernel aliases but not the user ones, Hmm. I don't understand what you're saying. By the time we call the final read(2) (which is what returns the stale data), there _are_ no user mappings of the page in question. Flushing the kernel direct mapping by unconditionally calling flush_dcache_page() in do_generic_mapping_read() makes the issue go away and makes fsx-linux happy. Flushing the kernel direct mapping by forcibly context switching between munmap() and read() (VIVT cache, context switch does full cache flush+invalidate) makes the issue go away, too. ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [CFT] read+shared mmap write+read data corruption 2007-05-29 3:15 ` Lennert Buytenhek @ 2007-05-29 14:32 ` James Bottomley 2007-05-29 17:13 ` Russell King 0 siblings, 1 reply; 25+ messages in thread From: James Bottomley @ 2007-05-29 14:32 UTC (permalink / raw) To: Lennert Buytenhek; +Cc: Russell King, linux-arch, Andrew Morton, davej On Tue, 2007-05-29 at 05:15 +0200, Lennert Buytenhek wrote: > On Mon, May 28, 2007 at 10:06:58PM -0500, James Bottomley wrote: > > > Having to (conditionally) invalidate the kernel direct mapping for > > > every userland page we unmap would kind of suck.. > > > > Lets just verify it is a stale kernel mapping first. Try this patch: > > it will cohere the kernel aliases but not the user ones, > > Hmm. I don't understand what you're saying. I agree its likely a stale kernel cache ... but the symptoms could be dirty user cache as well ... I was just trying to verify beyond doubt that it's the former. > By the time we call the final read(2) (which is what returns the > stale data), there _are_ no user mappings of the page in question. > Flushing the kernel direct mapping by unconditionally calling > flush_dcache_page() in do_generic_mapping_read() makes the issue go > away and makes fsx-linux happy. > > Flushing the kernel direct mapping by forcibly context switching > between munmap() and read() (VIVT cache, context switch does full > cache flush+invalidate) makes the issue go away, too. I'm not that familiar with the mechanics of a VIVT cache. If you unmap, and thus remove the page table entries, does that mean a dirty VIVT line at those entries gets discarded if the processor can't find a mapping? Or does the TLB pin entries for dirty cache lines until they're flushed? James ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [CFT] read+shared mmap write+read data corruption 2007-05-29 14:32 ` James Bottomley @ 2007-05-29 17:13 ` Russell King 0 siblings, 0 replies; 25+ messages in thread From: Russell King @ 2007-05-29 17:13 UTC (permalink / raw) To: James Bottomley; +Cc: Lennert Buytenhek, linux-arch, Andrew Morton, davej On Tue, May 29, 2007 at 09:32:16AM -0500, James Bottomley wrote: > I agree its likely a stale kernel cache ... but the symptoms could be > dirty user cache as well ... I was just trying to verify beyond doubt > that it's the former. If that were true, with a VIVT cache you'd have userspace randomly SEGVing since stale data would leak through when things get remapped (eg, objects which are only loaded for a short time) etc. That isn't the case. > > Flushing the kernel direct mapping by unconditionally calling > > flush_dcache_page() in do_generic_mapping_read() makes the issue go > > away and makes fsx-linux happy. > > > > Flushing the kernel direct mapping by forcibly context switching > > between munmap() and read() (VIVT cache, context switch does full > > cache flush+invalidate) makes the issue go away, too. > > I'm not that familiar with the mechanics of a VIVT cache. If you unmap, > and thus remove the page table entries, does that mean a dirty VIVT line > at those entries gets discarded if the processor can't find a mapping? > Or does the TLB pin entries for dirty cache lines until they're flushed? We _always_ write out data from the cache for the range we're going to unmap with VIVT. -- Russell King Linux kernel 2.6 ARM Linux - http://www.arm.linux.org.uk/ maintainer of: ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [CFT] read+shared mmap write+read data corruption 2007-05-28 14:39 ` Lennert Buytenhek 2007-05-29 3:06 ` James Bottomley @ 2007-05-29 5:58 ` David Miller 1 sibling, 0 replies; 25+ messages in thread From: David Miller @ 2007-05-29 5:58 UTC (permalink / raw) To: buytenh; +Cc: James.Bottomley, rmk, linux-arch, akpm, davej From: Lennert Buytenhek <buytenh@wantstofly.org> Date: Mon, 28 May 2007 16:39:36 +0200 > On Mon, May 28, 2007 at 09:17:55AM -0500, James Bottomley wrote: > > > > > Ok, output on parisc is: > > > > > > > > jejb@ioz:~$ ./a.out > > > > firstfirstfirst > > > > firstfirstfirst > > > > secondsecondsec > > > > > > > > Which is correct. It remains correct even if I drop the msync(). > > > > > > With Lennert's new program, I get mostly: > > > > > > firstfirstfirst > > > firstfirstfirst > > > firstfirstfirst > > > > > > but occasionally: > > > > > > firstfirstfirst > > > firstfirstfirst > > > secondsecondsec > > > > > > However, if I open code the memcpy() in the MAPREAD to copy one word > > > at a time, then I reliably get the "secondsecondsec" line. But if I > > > convert the memcpy() in MAPWRITE in the same way, I'm back to mostly > > > getting the failure with the occasional success. Utterly confused. > > > > > > Unless someone's got a theory, I'm stumped. > > > > I think you're not flushing correctly in munmap() ... but I'm not sure > > the linux API actually requires this. > > Having to (conditionally) invalidate the kernel direct mapping for > every userland page we unmap would kind of suck.. You don't have to do it for all pages, you just have to do whatever flush_dcache_page() would do, for dirty+shared+writable mmap()'s of files. unmap's of shared+writable regions of files are so rare, only test programs and the odd database do it when they shut down. ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [CFT] read+shared mmap write+read data corruption 2007-05-28 14:17 ` James Bottomley 2007-05-28 14:39 ` Lennert Buytenhek @ 2007-05-28 15:04 ` Russell King 1 sibling, 0 replies; 25+ messages in thread From: Russell King @ 2007-05-28 15:04 UTC (permalink / raw) To: James Bottomley; +Cc: Lennert Buytenhek, linux-arch, Andrew Morton, davej On Mon, May 28, 2007 at 09:17:55AM -0500, James Bottomley wrote: > On Mon, 2007-05-28 at 11:05 +0100, Russell King wrote: > > On Sun, May 27, 2007 at 07:00:31PM -0500, James Bottomley wrote: > > > Ok, output on parisc is: > > > > > > jejb@ioz:~$ ./a.out > > > firstfirstfirst > > > firstfirstfirst > > > secondsecondsec > > > > > > Which is correct. It remains correct even if I drop the msync(). > > > > With Lennert's new program, I get mostly: > > > > firstfirstfirst > > firstfirstfirst > > firstfirstfirst > > > > but occasionally: > > > > firstfirstfirst > > firstfirstfirst > > secondsecondsec > > > > However, if I open code the memcpy() in the MAPREAD to copy one word > > at a time, then I reliably get the "secondsecondsec" line. But if I > > convert the memcpy() in MAPWRITE in the same way, I'm back to mostly > > getting the failure with the occasional success. Utterly confused. > > > > Unless someone's got a theory, I'm stumped. > > I think you're not flushing correctly in munmap() ... but I'm not sure > the linux API actually requires this. We do an unconditional cache flush over the user virtual addresses in the mapping before we tear down the page table entries. sys_munmap -> do_munmap -> unmap_region -> unmap_vmas -> unmap_page_range -> tlb_start_vma Although the call to flush_cache_range() in tlb_start_vma() is conditional, it is conditional on !tlb->fullmm, and the call to tlb_gather_mmu() in unmap_region() sets this to zero. Adding additional expensive cache flushing into tlb_start_vma() is going to be, well, disgusting. Performance? What's that? Oh, something we used to have in 1995. Having an additional bit which is set on page cache pages to indicate that they need flushing would be far more preferable. PG_arch_1 won't work because we already use this for delaying flush_dcache_page(). We need PG_arch_2. -- Russell King Linux kernel 2.6 ARM Linux - http://www.arm.linux.org.uk/ maintainer of: ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [CFT] read+shared mmap write+read data corruption 2007-05-28 10:05 ` Russell King 2007-05-28 14:17 ` James Bottomley @ 2007-05-29 15:42 ` Lennert Buytenhek 1 sibling, 0 replies; 25+ messages in thread From: Lennert Buytenhek @ 2007-05-29 15:42 UTC (permalink / raw) To: James Bottomley, linux-arch, Andrew Morton, davej On Mon, May 28, 2007 at 11:05:13AM +0100, Russell King wrote: > With Lennert's new program, I get mostly: > > firstfirstfirst > firstfirstfirst > firstfirstfirst > > but occasionally: > > firstfirstfirst > firstfirstfirst > secondsecondsec > > However, if I open code the memcpy() in the MAPREAD to copy one word > at a time, then I reliably get the "secondsecondsec" line. But if I > convert the memcpy() in MAPWRITE in the same way, I'm back to mostly > getting the failure with the occasional success. Utterly confused. Maybe your caches are no-allocate-on-write and glibc's memcpy() loads from the destination addresses to force line allocation? ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [CFT] read+shared mmap write+read data corruption 2007-05-28 0:00 ` James Bottomley 2007-05-28 10:05 ` Russell King @ 2007-05-28 12:33 ` Lennert Buytenhek 2007-05-28 14:22 ` James Bottomley 1 sibling, 1 reply; 25+ messages in thread From: Lennert Buytenhek @ 2007-05-28 12:33 UTC (permalink / raw) To: James Bottomley; +Cc: linux-arch, Andrew Morton, davej On Sun, May 27, 2007 at 07:00:31PM -0500, James Bottomley wrote: > Ok, output on parisc is: > > jejb@ioz:~$ ./a.out > firstfirstfirst > firstfirstfirst > secondsecondsec > > Which is correct. It remains correct even if I drop the msync(). What else was running when you ran this program, and how many runs did you try? I get a mixture of firstfirstfirst and secondsecondsec when the machine I run it on isn't idle. ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [CFT] read+shared mmap write+read data corruption 2007-05-28 12:33 ` Lennert Buytenhek @ 2007-05-28 14:22 ` James Bottomley 0 siblings, 0 replies; 25+ messages in thread From: James Bottomley @ 2007-05-28 14:22 UTC (permalink / raw) To: Lennert Buytenhek; +Cc: linux-arch, Andrew Morton, davej On Mon, 2007-05-28 at 14:33 +0200, Lennert Buytenhek wrote: > On Sun, May 27, 2007 at 07:00:31PM -0500, James Bottomley wrote: > > > Ok, output on parisc is: > > > > jejb@ioz:~$ ./a.out > > firstfirstfirst > > firstfirstfirst > > secondsecondsec > > > > Which is correct. It remains correct even if I drop the msync(). > > What else was running when you ran this program, and how many runs > did you try? I get a mixture of firstfirstfirst and secondsecondsec > when the machine I run it on isn't idle. Both loaded and unloaded ... although you have to be careful how you load on pa to see coherency problems: the fork/exec cycle does a full cache flush. James James ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [CFT] read+shared mmap write+read data corruption 2007-05-27 10:49 [CFT] read+shared mmap write+read data corruption Russell King 2007-05-27 14:16 ` James Bottomley 2007-05-27 22:24 ` Lennert Buytenhek @ 2007-05-28 12:38 ` Lennert Buytenhek 2 siblings, 0 replies; 25+ messages in thread From: Lennert Buytenhek @ 2007-05-28 12:38 UTC (permalink / raw) To: linux-arch; +Cc: Andrew Morton On Sun, May 27, 2007 at 11:49:30AM +0100, Russell King wrote: > * - the hole is: > > void do_generic_mapping_read(struct address_space *mapping, > struct file_ra_state *_ra, > struct file *filp, > loff_t *ppos, > read_descriptor_t *desc, > read_actor_t actor) > { > ... > /* If users can be writing to this page using arbitrary > * virtual addresses, take care about potential aliasing > * before reading the page on the kernel side. > */ > if (mapping_writably_mapped(mapping)) > flush_dcache_page(page); > > is false for both read calls, since at the time this test is done, there > are no shared writable mappings. However, that's not to say that there > haven't _been_ shared writable mappings. If I make the flush_dcache_page() call above unconditional, then the program I posted to this list previously[*] starts reliably printing secondsecondsec as the third line. David's check_dcache_page() suggestion makes a lot of sense to me. Whether this solves the db4 corruption issues I've been seeing is another matter... [*] http://www.wantstofly.org/~buytenh/cache-issue2.c ^ permalink raw reply [flat|nested] 25+ messages in thread
end of thread, other threads:[~2007-05-29 17:14 UTC | newest] Thread overview: 25+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2007-05-27 10:49 [CFT] read+shared mmap write+read data corruption Russell King 2007-05-27 14:16 ` James Bottomley 2007-05-27 22:25 ` Lennert Buytenhek 2007-05-27 23:06 ` David Miller 2007-05-27 23:05 ` David Miller 2007-05-28 0:31 ` James Bottomley 2007-05-28 12:44 ` Lennert Buytenhek 2007-05-29 5:35 ` David Miller 2007-05-29 9:12 ` Russell King 2007-05-29 10:26 ` David Miller 2007-05-27 22:24 ` Lennert Buytenhek 2007-05-28 0:00 ` James Bottomley 2007-05-28 10:05 ` Russell King 2007-05-28 14:17 ` James Bottomley 2007-05-28 14:39 ` Lennert Buytenhek 2007-05-29 3:06 ` James Bottomley 2007-05-29 3:15 ` Lennert Buytenhek 2007-05-29 14:32 ` James Bottomley 2007-05-29 17:13 ` Russell King 2007-05-29 5:58 ` David Miller 2007-05-28 15:04 ` Russell King 2007-05-29 15:42 ` Lennert Buytenhek 2007-05-28 12:33 ` Lennert Buytenhek 2007-05-28 14:22 ` James Bottomley 2007-05-28 12:38 ` Lennert Buytenhek
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.