* Re: tmpfs returns incorrect data on concurrent pread() and truncate() [not found] <18e9fa0f-ec31-9107-459c-ae1694503f87@gmail.com> @ 2016-11-01 23:51 ` Hugh Dickins 2016-11-02 1:01 ` Dave Chinner 2016-11-03 23:45 ` Jakob Unterwurzacher 0 siblings, 2 replies; 6+ messages in thread From: Hugh Dickins @ 2016-11-01 23:51 UTC (permalink / raw) To: Jakob Unterwurzacher; +Cc: linux-kernel, linux-fsdevel On Wed, 26 Oct 2016, Jakob Unterwurzacher wrote: > tmpfs seems to be incorrectly returning 0-bytes when reading from > a file that is concurrently being truncated. That is an interesting observation, and you got me worried; but in fact, it is not a tmpfs problem: if we call it a problem at all, it's a VFS problem or a userspace problem. You chose a ratio of 3 preads to 1 ftruncate in your program below: let's call that the Unterwurzacher Ratio, 3 for tmpfs; YMMV, but for me 4 worked well to show the same issue on ramfs, and 15 on ext4. The Linux VFS does not serialize reads against writes or truncation very strictly: it's unusual to need that serialization, and most users prefer maximum speed to the additional locking, or intermediate buffering, that would be required to avoid the issue you've seen. > > This is causing crashes in gocryptfs, a cryptographic FUSE overlay, > when it reads a nonce from disk that should absolutely positively > never be all-zero. I don't think that there will be much appetite for changing the kernel's VFS to prevent that. I hope that gocryptfs can provide the serialization that it needs for itself, or otherwise handle those zeroes without crashing. (Though if something is free to truncate at an awkward moment, then I imagine it can also mess things up by writing wrong data.) > > I have written a reproducer in C that triggers this issue on > both boxes I tested, Linux 4.7.2 and 3.16.7, both amd64. > > It can be downloaded from here: > > https://gist.github.com/rfjakob/d01281c737db38075767f90bf03fc475 > > or, alternatively, I have attached it to this email at the bottom. > > The reproducer: > 1) Creates a 10MB file filled with 'x' at /dev/shm/x > 2) Spawns a thread that truncates the file 3 bytes at a time > 3) Spawns another thread that pread()s the file 1 byte at a time > starting from the top > 4) Prints "wrong data" whenever the pread() gets something that > is not 'x' or an empty result. Nice work, thank you, this helped me a lot. Hugh > > Example run: > > $ gcc -Wall -lpthread truncate_read.c && ./a.out > wrong data: 0 > wrong data: 0 > wrong data: 0 > wrong data: 0 > wrong data: 0 > wrong data: 0 > wrong data: 0 > wrong data: 0 > wrong data: 0 > wrong data: 0 > [...] > > > Best regards, > Jakob > > > --------------------------------------------------------------------- > truncate_read.c > ------------------------------8<--------------------------------------- > > > // Compile and run: > // gcc -Wall -lpthread truncate_read.c && ./a.out > > #include <sys/types.h> > #include <sys/stat.h> > #include <fcntl.h> > #include <stdlib.h> > #include <string.h> > #include <unistd.h> > #include <stdio.h> > #include <pthread.h> > > int fd; > int l = 10*1024*1024; > pthread_t tid[2]; > > void* read_thread(void *arg){ > int o,n; > char b; > for(o=l; o>0; o--) { > b = 'a'; > n = pread(fd, &b, 1, o); > if(n==0) { > continue; > } > if(b != 'x') { > printf("wrong data: %x\n", b); > } > } > return NULL; > } > > void* truncate_thread(void *arg){ > // Parent = Truncater > int o,n; > // "3" seems to be the sweet spot to trigger the most errors. > for(o=l; o>0; o-=3) { > n = ftruncate(fd, o); > if(n!=0) { > perror("ftruncate err"); > } > } > return NULL; > } > > int main(int argc, char *argv[]) { > fd = open("/dev/shm/x", O_RDWR | O_TRUNC | O_CREAT, 0666); > if(fd < 0) { > printf("open failed\n"); > exit(1); > } > char* x = malloc(l); > memset(x, 'x', l); > write(fd, x, l); > > pthread_create(&(tid[0]), NULL, &read_thread, NULL); > pthread_create(&(tid[1]), NULL, &truncate_thread, NULL); > > void *res; > pthread_join(tid[0], &res); > pthread_join(tid[1], &res); > > return 0; > } ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: tmpfs returns incorrect data on concurrent pread() and truncate() 2016-11-01 23:51 ` tmpfs returns incorrect data on concurrent pread() and truncate() Hugh Dickins @ 2016-11-02 1:01 ` Dave Chinner 2016-11-02 1:38 ` Hugh Dickins 2016-11-03 23:45 ` Jakob Unterwurzacher 1 sibling, 1 reply; 6+ messages in thread From: Dave Chinner @ 2016-11-02 1:01 UTC (permalink / raw) To: Hugh Dickins; +Cc: Jakob Unterwurzacher, linux-kernel, linux-fsdevel On Tue, Nov 01, 2016 at 04:51:30PM -0700, Hugh Dickins wrote: > On Wed, 26 Oct 2016, Jakob Unterwurzacher wrote: > > > tmpfs seems to be incorrectly returning 0-bytes when reading from > > a file that is concurrently being truncated. > > That is an interesting observation, and you got me worried; > but in fact, it is not a tmpfs problem: if we call it a > problem at all, it's a VFS problem or a userspace problem. > > You chose a ratio of 3 preads to 1 ftruncate in your program below: > let's call that the Unterwurzacher Ratio, 3 for tmpfs; YMMV, but for > me 4 worked well to show the same issue on ramfs, and 15 on ext4. > > The Linux VFS does not serialize reads against writes or truncation > very strictly: Which is a fine, because... > it's unusual to need that serialization, and most .... many filesystems need more robust serialisation as hole punching (and other fallocate-based extent manipulations) have much stricter serialisation requirements than truncate and these .... > users prefer maximum speed to the additional locking, or intermediate > buffering, that would be required to avoid the issue you've seen. .... require additional locking to be done at the filesystem level to avoid race conditions. Throw in the fact that we already have to do this serialisation in the filesystem for direct IO as there are no page locks to serialise direct IO against truncate. And we need to lock out page faults from refaulting while we are doing things like punching holes (to avoid data coherency and corruption bugs), so we need more filesystem level locks to serialise mmap against fallocate(). And DAX has similar issues - there are no struct pages to serialise read or mmap access against truncate, so again we need filesystem level serialisation for this. Put simple: page locks are insufficient as a generic mechanism for serialising filesystem operations. The locking required for this is generally deeply filesystem implementation specific, so it's fine that the VFS doesn't attempt to provide anything stricter than it currently does.... Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: tmpfs returns incorrect data on concurrent pread() and truncate() 2016-11-02 1:01 ` Dave Chinner @ 2016-11-02 1:38 ` Hugh Dickins 2016-11-02 4:01 ` Dave Chinner 2016-11-02 13:21 ` Theodore Ts'o 0 siblings, 2 replies; 6+ messages in thread From: Hugh Dickins @ 2016-11-02 1:38 UTC (permalink / raw) To: Dave Chinner Cc: Hugh Dickins, Jakob Unterwurzacher, linux-kernel, linux-fsdevel On Wed, 2 Nov 2016, Dave Chinner wrote: > On Tue, Nov 01, 2016 at 04:51:30PM -0700, Hugh Dickins wrote: > > On Wed, 26 Oct 2016, Jakob Unterwurzacher wrote: > > > > > tmpfs seems to be incorrectly returning 0-bytes when reading from > > > a file that is concurrently being truncated. > > > > That is an interesting observation, and you got me worried; > > but in fact, it is not a tmpfs problem: if we call it a > > problem at all, it's a VFS problem or a userspace problem. > > > > You chose a ratio of 3 preads to 1 ftruncate in your program below: > > let's call that the Unterwurzacher Ratio, 3 for tmpfs; YMMV, but for > > me 4 worked well to show the same issue on ramfs, and 15 on ext4. > > > > The Linux VFS does not serialize reads against writes or truncation > > very strictly: > > Which is a fine, because... > > > it's unusual to need that serialization, and most > > .... many filesystems need more robust serialisation as hole punching > (and other fallocate-based extent manipulations) have much stricter > serialisation requirements than truncate and these .... > > > users prefer maximum speed to the additional locking, or intermediate > > buffering, that would be required to avoid the issue you've seen. > > .... require additional locking to be done at the filesystem level > to avoid race conditions. > > Throw in the fact that we already have to do this serialisation in > the filesystem for direct IO as there are no page locks to serialise > direct IO against truncate. And we need to lock out page faults > from refaulting while we are doing things like punching holes (to > avoid data coherency and corruption bugs), so we need more > filesystem level locks to serialise mmap against fallocate(). > > And DAX has similar issues - there are no struct pages to serialise > read or mmap access against truncate, so again we need filesystem > level serialisation for this. > > Put simple: page locks are insufficient as a generic mechanism for > serialising filesystem operations. The locking required for this is > generally deeply filesystem implementation specific, so it's fine > that the VFS doesn't attempt to provide anything stricter than it > currently does.... I think you are saying that: xfs already provides the extra locking that avoids this issue; most other filesystems do not; but more can be expected to add that extra locking in the coming months? Hugh ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: tmpfs returns incorrect data on concurrent pread() and truncate() 2016-11-02 1:38 ` Hugh Dickins @ 2016-11-02 4:01 ` Dave Chinner 2016-11-02 13:21 ` Theodore Ts'o 1 sibling, 0 replies; 6+ messages in thread From: Dave Chinner @ 2016-11-02 4:01 UTC (permalink / raw) To: Hugh Dickins; +Cc: Jakob Unterwurzacher, linux-kernel, linux-fsdevel On Tue, Nov 01, 2016 at 06:38:26PM -0700, Hugh Dickins wrote: > On Wed, 2 Nov 2016, Dave Chinner wrote: > > On Tue, Nov 01, 2016 at 04:51:30PM -0700, Hugh Dickins wrote: > > > On Wed, 26 Oct 2016, Jakob Unterwurzacher wrote: > > > > > > > tmpfs seems to be incorrectly returning 0-bytes when reading from > > > > a file that is concurrently being truncated. > > > > > > That is an interesting observation, and you got me worried; > > > but in fact, it is not a tmpfs problem: if we call it a > > > problem at all, it's a VFS problem or a userspace problem. > > > > > > You chose a ratio of 3 preads to 1 ftruncate in your program below: > > > let's call that the Unterwurzacher Ratio, 3 for tmpfs; YMMV, but for > > > me 4 worked well to show the same issue on ramfs, and 15 on ext4. > > > > > > The Linux VFS does not serialize reads against writes or truncation > > > very strictly: > > > > Which is a fine, because... > > > > > it's unusual to need that serialization, and most > > > > .... many filesystems need more robust serialisation as hole punching > > (and other fallocate-based extent manipulations) have much stricter > > serialisation requirements than truncate and these .... > > > > > users prefer maximum speed to the additional locking, or intermediate > > > buffering, that would be required to avoid the issue you've seen. > > > > .... require additional locking to be done at the filesystem level > > to avoid race conditions. > > > > Throw in the fact that we already have to do this serialisation in > > the filesystem for direct IO as there are no page locks to serialise > > direct IO against truncate. And we need to lock out page faults > > from refaulting while we are doing things like punching holes (to > > avoid data coherency and corruption bugs), so we need more > > filesystem level locks to serialise mmap against fallocate(). > > > > And DAX has similar issues - there are no struct pages to serialise > > read or mmap access against truncate, so again we need filesystem > > level serialisation for this. > > > > Put simple: page locks are insufficient as a generic mechanism for > > serialising filesystem operations. The locking required for this is > > generally deeply filesystem implementation specific, so it's fine > > that the VFS doesn't attempt to provide anything stricter than it > > currently does.... > > I think you are saying that: xfs already provides the extra locking > that avoids this issue; most other filesystems do not; but more can > be expected to add that extra locking in the coming months? Effectively, yes. ext4 already has the extra mmap lock for DAX, and it's likely to get saner internal IO locking as it is converted to use iomaps (which is also needed for DAX). As I've pointed out in a different thread recently, filesystems now have three IO paths - page cache, direct and DAX - and only the page cache IO uses struct pages. The other IO paths don't, and they still have to be serialised against truncate and other operations. IOWs, if a filesystem wants to support the new incoming storage technologies, it has to do this IO serialisation internally regardless of whatever other optimisations and go-fast paths the page cache has... Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: tmpfs returns incorrect data on concurrent pread() and truncate() 2016-11-02 1:38 ` Hugh Dickins 2016-11-02 4:01 ` Dave Chinner @ 2016-11-02 13:21 ` Theodore Ts'o 1 sibling, 0 replies; 6+ messages in thread From: Theodore Ts'o @ 2016-11-02 13:21 UTC (permalink / raw) To: Hugh Dickins Cc: Dave Chinner, Jakob Unterwurzacher, linux-kernel, linux-fsdevel On Tue, Nov 01, 2016 at 06:38:26PM -0700, Hugh Dickins wrote: > > Put simple: page locks are insufficient as a generic mechanism for > > serialising filesystem operations. The locking required for this is > > generally deeply filesystem implementation specific, so it's fine > > that the VFS doesn't attempt to provide anything stricter than it > > currently does.... > > I think you are saying that: xfs already provides the extra locking > that avoids this issue; most other filesystems do not; but more can > be expected to add that extra locking in the coming months? The test program was using buffered reads, and ext4 is using generic_file_read_iter(), so presumably adding file system specific locking would require that file systems which use generic_file_read_iter() would need to front-end it with the additional required file system specific locks. This is what XFS is currently doing, but a quick git grep seems to show that besides XFS, nfs, cifs, ceph and ocfs2 (i.e., XFS and remote file systems), everyone else is using generic_file_read_iter() is using it directly as read_iter method. So if we think we really care about getting this case right, and if we think the locking overhead in the uncontended case is sufficiently low that it's worth fixing (it'll probably mean the bonnie benchmark will take a hit, but that's a pretty crappy benchmark anyway :-), perhaps we should be thinking about some way of adding a generic locking mechanism so that most file systems will have a fighting chance of getting this right. Otherwise, perhaps the major file systems will add the appropriate locking to the buffered I/O case (much more attention has been paid to the direct I/O case, which is where more workloads have historically cared) but I suspect as a percentage of the file systems in the Linux kernel, they won't be getting this case right. Cheers, - Ted ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: tmpfs returns incorrect data on concurrent pread() and truncate() 2016-11-01 23:51 ` tmpfs returns incorrect data on concurrent pread() and truncate() Hugh Dickins 2016-11-02 1:01 ` Dave Chinner @ 2016-11-03 23:45 ` Jakob Unterwurzacher 1 sibling, 0 replies; 6+ messages in thread From: Jakob Unterwurzacher @ 2016-11-03 23:45 UTC (permalink / raw) To: Hugh Dickins; +Cc: linux-kernel, linux-fsdevel First of all, thanks for your replies! On 02.11.2016 00:51, Hugh Dickins wrote: > I don't think that there will be much appetite for changing the > kernel's VFS to prevent that. I hope that gocryptfs can provide > the serialization that it needs for itself, or otherwise handle > those zeroes without crashing. Yes, I dropped the nonce sanity check, the MAC will catch the corruption anyway. Just FYI, xfstests generic/075 discovered this. The test itself does not check the data it gets back, which is why it usually passes. Best regards, Jakob ^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2016-11-03 23:45 UTC | newest]
Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <18e9fa0f-ec31-9107-459c-ae1694503f87@gmail.com>
2016-11-01 23:51 ` tmpfs returns incorrect data on concurrent pread() and truncate() Hugh Dickins
2016-11-02 1:01 ` Dave Chinner
2016-11-02 1:38 ` Hugh Dickins
2016-11-02 4:01 ` Dave Chinner
2016-11-02 13:21 ` Theodore Ts'o
2016-11-03 23:45 ` Jakob Unterwurzacher
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).