* [PATCH 000 of 2] Fix some bugs with 'read' racing with 'truncate' @ 2007-06-07 1:46 NeilBrown 2007-06-07 1:46 ` [PATCH 001 of 2] Fix read/truncate race NeilBrown 2007-06-07 1:47 ` [PATCH 002 of 2] Make sure readv stops reading when it hits end-of-file NeilBrown 0 siblings, 2 replies; 9+ messages in thread From: NeilBrown @ 2007-06-07 1:46 UTC (permalink / raw) To: Andrew Morton; +Cc: linux-kernel, Jens Axboe, Neil Brown, Nick Piggin The following two patches fix a couple of bugs which trigger when read races with truncate. As there is no locking between read and truncate, we need to be careful about sequencing. In some cases were aren't careful enough. The first patch ensures that we check i_size *after* gaining a reference to an uptodate page, thus ensuring that we don't unknowingly returns NUL characters that are beyond the end of the file. The second ensures that we don't deliver partial reads to more than one sub-buffer in a readv call. These bugs have been around for a while and are quite subtle, so I don't think there is any rush for them to go into 2.6.22. Rather they should sit in -mm at least until .23 opens up. Thanks, NeilBrown [PATCH 001 of 2] Fix read/truncate race. [PATCH 002 of 2] Make sure readv stops reading when it hits end-of-file. ^ permalink raw reply [flat|nested] 9+ messages in thread
* [PATCH 001 of 2] Fix read/truncate race. 2007-06-07 1:46 [PATCH 000 of 2] Fix some bugs with 'read' racing with 'truncate' NeilBrown @ 2007-06-07 1:46 ` NeilBrown 2007-06-07 7:19 ` Jens Axboe 2007-06-08 1:18 ` Andrew Morton 2007-06-07 1:47 ` [PATCH 002 of 2] Make sure readv stops reading when it hits end-of-file NeilBrown 1 sibling, 2 replies; 9+ messages in thread From: NeilBrown @ 2007-06-07 1:46 UTC (permalink / raw) To: Andrew Morton; +Cc: linux-kernel, Jens Axboe, Neil Brown, Nick Piggin do_generic_mapping_read currently samples the i_size at the start and doesn't do so again unless it needs to call ->readpage to load a page. After ->readpage it has to re-sample i_size as a truncate may have caused that page to be filled with zeros, and the read() call should not see these. However there are other activities that might cause ->readpage to be called on a page between the time that do_generic_mapping_read samples i_size and when it finds that it has an uptodate page. These include at least read-ahead and possibly another thread performing a read. So do_generic_mapping_read must sample i_size *after* it has an uptodate page. Thus the current sampling at the start and after a read can be replaced with a sampling before the copy-out. The same change applied to __generic_file_splice_read. Note that this fixes any race with truncate_complete_page, but does not fix a possible race with truncate_partial_page. If a partial truncate happens after do_generic_mapping_read samples i_size and before the copy_out, the nuls that truncate_partial_page place in the page could be copied out incorrectly. I think the best fix for that is to *not* zero out parts of the page in truncate_partial_page, but rather to zero out the tail of a page when increasing i_size. Signed-off-by: Neil Brown <neilb@suse.de> Cc: Jens Axboe <jens.axboe@oracle.com> Acked-by: Nick Piggin <npiggin@suse.de> (for the do_generic_mapping_read part) ### Diffstat output ./fs/splice.c | 43 +++++++++++++++++----------------- ./mm/filemap.c | 72 ++++++++++++++++++++++----------------------------------- 2 files changed, 50 insertions(+), 65 deletions(-) diff .prev/fs/splice.c ./fs/splice.c --- .prev/fs/splice.c 2007-06-07 11:34:16.000000000 +1000 +++ ./fs/splice.c 2007-06-07 11:34:23.000000000 +1000 @@ -412,31 +412,32 @@ __generic_file_splice_read(struct file * break; } - /* - * i_size must be checked after ->readpage(). - */ - isize = i_size_read(mapping->host); - end_index = (isize - 1) >> PAGE_CACHE_SHIFT; - if (unlikely(!isize || index > end_index)) - break; + } + fill_it: + /* + * i_size must be checked after PageUptodate. + */ + isize = i_size_read(mapping->host); + end_index = (isize - 1) >> PAGE_CACHE_SHIFT; + if (unlikely(!isize || index > end_index)) + break; + /* + * if this is the last page, see if we need to shrink + * the length and stop + */ + if (end_index == index) { + loff = PAGE_CACHE_SIZE - (isize & ~PAGE_CACHE_MASK); + if (total_len + loff > isize) + break; /* - * if this is the last page, see if we need to shrink - * the length and stop + * force quit after adding this page */ - if (end_index == index) { - loff = PAGE_CACHE_SIZE - (isize & ~PAGE_CACHE_MASK); - if (total_len + loff > isize) - break; - /* - * force quit after adding this page - */ - len = this_len; - this_len = min(this_len, loff); - loff = 0; - } + len = this_len; + this_len = min(this_len, loff); + loff = 0; } -fill_it: + partial[page_nr].offset = loff; partial[page_nr].len = this_len; len -= this_len; diff .prev/mm/filemap.c ./mm/filemap.c --- .prev/mm/filemap.c 2007-06-07 11:34:16.000000000 +1000 +++ ./mm/filemap.c 2007-06-07 11:34:24.000000000 +1000 @@ -871,13 +871,11 @@ void do_generic_mapping_read(struct addr { struct inode *inode = mapping->host; unsigned long index; - unsigned long end_index; unsigned long offset; unsigned long last_index; unsigned long next_index; unsigned long prev_index; unsigned int prev_offset; - loff_t isize; int error; struct file_ra_state ra = *_ra; @@ -888,27 +886,12 @@ void do_generic_mapping_read(struct addr last_index = (*ppos + desc->count + PAGE_CACHE_SIZE-1) >> PAGE_CACHE_SHIFT; offset = *ppos & ~PAGE_CACHE_MASK; - isize = i_size_read(inode); - if (!isize) - goto out; - - end_index = (isize - 1) >> PAGE_CACHE_SHIFT; for (;;) { struct page *page; + unsigned long end_index; + loff_t isize; unsigned long nr, ret; - /* nr is the maximum number of bytes to copy from this page */ - nr = PAGE_CACHE_SIZE; - if (index >= end_index) { - if (index > end_index) - goto out; - nr = ((isize - 1) & ~PAGE_CACHE_MASK) + 1; - if (nr <= offset) { - goto out; - } - } - nr = nr - offset; - cond_resched(); find_page: page = find_get_page(mapping, index); @@ -928,6 +911,32 @@ find_page: if (!PageUptodate(page)) goto page_not_up_to_date; page_ok: + /* + * i_size must be checked after we know the page is Uptodate. + * + * Checking i_size after the check allows us to calculate + * the correct value for "nr", which means the zero-filled + * part of the page is not copied back to userspace (unless + * another truncate extends the file - this is desired though). + */ + + isize = i_size_read(inode); + end_index = (isize - 1) >> PAGE_CACHE_SHIFT; + if (unlikely(!isize || index > end_index)) { + page_cache_release(page); + goto out; + } + + /* nr is the maximum number of bytes to copy from this page */ + nr = PAGE_CACHE_SIZE; + if (index == end_index) { + nr = ((isize - 1) & ~PAGE_CACHE_MASK) + 1; + if (nr <= offset) { + page_cache_release(page); + goto out; + } + } + nr = nr - offset; /* If users can be writing to this page using arbitrary * virtual addresses, take care about potential aliasing @@ -1014,31 +1023,6 @@ readpage: unlock_page(page); } - /* - * i_size must be checked after we have done ->readpage. - * - * Checking i_size after the readpage allows us to calculate - * the correct value for "nr", which means the zero-filled - * part of the page is not copied back to userspace (unless - * another truncate extends the file - this is desired though). - */ - isize = i_size_read(inode); - end_index = (isize - 1) >> PAGE_CACHE_SHIFT; - if (unlikely(!isize || index > end_index)) { - page_cache_release(page); - goto out; - } - - /* nr is the maximum number of bytes to copy from this page */ - nr = PAGE_CACHE_SIZE; - if (index == end_index) { - nr = ((isize - 1) & ~PAGE_CACHE_MASK) + 1; - if (nr <= offset) { - page_cache_release(page); - goto out; - } - } - nr = nr - offset; goto page_ok; readpage_error: ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH 001 of 2] Fix read/truncate race. 2007-06-07 1:46 ` [PATCH 001 of 2] Fix read/truncate race NeilBrown @ 2007-06-07 7:19 ` Jens Axboe 2007-06-08 1:18 ` Andrew Morton 1 sibling, 0 replies; 9+ messages in thread From: Jens Axboe @ 2007-06-07 7:19 UTC (permalink / raw) To: NeilBrown; +Cc: Andrew Morton, linux-kernel, Nick Piggin On Thu, Jun 07 2007, NeilBrown wrote: > > do_generic_mapping_read currently samples the i_size at the start > and doesn't do so again unless it needs to call ->readpage to load > a page. After ->readpage it has to re-sample i_size as a truncate > may have caused that page to be filled with zeros, and the read() > call should not see these. > > However there are other activities that might cause ->readpage to be > called on a page between the time that do_generic_mapping_read > samples i_size and when it finds that it has an uptodate page. These > include at least read-ahead and possibly another thread performing a > read. > > So do_generic_mapping_read must sample i_size *after* it has an > uptodate page. Thus the current sampling at the start and after a read > can be replaced with a sampling before the copy-out. > > The same change applied to __generic_file_splice_read. > > Note that this fixes any race with truncate_complete_page, but does > not fix a possible race with truncate_partial_page. If a partial > truncate happens after do_generic_mapping_read samples i_size and > before the copy_out, the nuls that truncate_partial_page place in the > page could be copied out incorrectly. > > I think the best fix for that is to *not* zero out parts of the page > in truncate_partial_page, but rather to zero out the tail of a page > when increasing i_size. Thanks Neil, if you don't mind I'll steal the splice bits and add them to the splice branch, as this particular patch conflicts with other isize fixes that Hugh and I have been working on. -- Jens Axboe ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH 001 of 2] Fix read/truncate race. 2007-06-07 1:46 ` [PATCH 001 of 2] Fix read/truncate race NeilBrown 2007-06-07 7:19 ` Jens Axboe @ 2007-06-08 1:18 ` Andrew Morton 2007-06-08 2:48 ` Neil Brown 1 sibling, 1 reply; 9+ messages in thread From: Andrew Morton @ 2007-06-08 1:18 UTC (permalink / raw) To: NeilBrown; +Cc: linux-kernel, Jens Axboe, Nick Piggin On Thu, 7 Jun 2007 11:46:53 +1000 NeilBrown <neilb@suse.de> wrote: > do_generic_mapping_read currently samples the i_size at the start > and doesn't do so again unless it needs to call ->readpage to load > a page. After ->readpage it has to re-sample i_size as a truncate > may have caused that page to be filled with zeros, and the read() > call should not see these. > > However there are other activities that might cause ->readpage to be > called on a page between the time that do_generic_mapping_read > samples i_size and when it finds that it has an uptodate page. These > include at least read-ahead and possibly another thread performing a > read. > > So do_generic_mapping_read must sample i_size *after* it has an > uptodate page. Thus the current sampling at the start and after a read > can be replaced with a sampling before the copy-out. > > The same change applied to __generic_file_splice_read. > > Note that this fixes any race with truncate_complete_page, but does > not fix a possible race with truncate_partial_page. If a partial > truncate happens after do_generic_mapping_read samples i_size and > before the copy_out, the nuls that truncate_partial_page place in the > page could be copied out incorrectly. > > I think the best fix for that is to *not* zero out parts of the page > in truncate_partial_page, but rather to zero out the tail of a page > when increasing i_size. Is this really worth fixing? The code still has races there - for example, another process can come after we've got the page, truncate it off the file then write new data. So this thread ends up copying out-of-date page contents out to userspace. The application is being silly, and the silly application gets wrong data: either out-of-date or zeroed. afacit the patch shrinks the timing windows significantly, but at a cost: moving a bunch of 64-bit arithmetic and seqlock operations into the inner loop. We could plug all this in a more predictable fashion by locking the page, but then we have another instance of the read-a-page-into-a-mmapped-copy-of-itself deadlock (I think - maybe it can't happen because of page uptodateness checks). ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH 001 of 2] Fix read/truncate race. 2007-06-08 1:18 ` Andrew Morton @ 2007-06-08 2:48 ` Neil Brown 2007-06-08 6:10 ` Andrew Morton 0 siblings, 1 reply; 9+ messages in thread From: Neil Brown @ 2007-06-08 2:48 UTC (permalink / raw) To: Andrew Morton; +Cc: linux-kernel, Jens Axboe, Nick Piggin On Thursday June 7, akpm@linux-foundation.org wrote: > > Is this really worth fixing? Hard to say. One could argue that a read should never return data that was never in the file. One could also argue that you should always use locking... But people often look for lock-less solutions. > > The code still has races there - for example, another process can come > after we've got the page, truncate it off the file then write new data. So > this thread ends up copying out-of-date page contents out to userspace. I don't think "out of date" is a problem. As long as the data returned by the "read" matches what was in the file at some moment between when the syscall started and when the syscall completed, it is correct. > > The application is being silly, and the silly application gets wrong data: > either out-of-date or zeroed. "was in the file when syscall commenced" and "was never in the file" are two very different things. "silly" ?? Maybe it is "silly" to return data that was never in the file. > > afacit the patch shrinks the timing windows significantly, but at a cost: > moving a bunch of 64-bit arithmetic and seqlock operations into the inner > loop. For common cases where truncation is to a page boundary, the window for seeing data that was never in the file is shrunk to zero. Partial truncates can still cause a problem as the changelog comment suggests, and require a different solution. Most of the 64-bit arithmetic is already in the main loop, though not quite all. In most cases the seqlock will not loop - how expensive is that? The following patch will remove the extra seqlock except when we actually need it and remove the extra arithmetic - but I haven't tested it or reviewed it properly. I can do that if you think it is the right direction. > > We could plug all this in a more predictable fashion by locking the page, > but then we have another instance of the > read-a-page-into-a-mmapped-copy-of-itself deadlock (I think - maybe it > can't happen because of page uptodateness checks). But do we really want exclusion between multiple readers of the same page? Yes, I expect locking the page would remove all the races, but I suspect it is more expensive that the current alternative. NeilBrown Signed-off-by: Neil Brown <neilb@suse.de> ### Diffstat output ./mm/filemap.c | 34 ++++++++++++++++++++++++++++------ 1 file changed, 28 insertions(+), 6 deletions(-) diff .prev/mm/filemap.c ./mm/filemap.c --- .prev/mm/filemap.c 2007-06-07 16:23:37.000000000 +1000 +++ ./mm/filemap.c 2007-06-08 12:43:36.000000000 +1000 @@ -876,6 +876,7 @@ void do_generic_mapping_read(struct addr unsigned long next_index; unsigned long prev_index; unsigned int prev_offset; + loff_t isize; int error; struct file_ra_state ra = *_ra; @@ -886,13 +887,22 @@ void do_generic_mapping_read(struct addr last_index = (*ppos + desc->count + PAGE_CACHE_SIZE-1) >> PAGE_CACHE_SHIFT; offset = *ppos & ~PAGE_CACHE_MASK; + isize = i_size_read(inode); + if (!isize) + goto out; + + end_index = (isize - 1) >> PAGE_CACHE_SHIFT; for (;;) { struct page *page; unsigned long end_index; - loff_t isize; unsigned long nr, ret; cond_resched(); + + if (unlikely(index > end_index)) { + page_cache_release(page); + goto out; + } find_page: page = find_get_page(mapping, index); if (!page) { @@ -918,13 +928,25 @@ page_ok: * the correct value for "nr", which means the zero-filled * part of the page is not copied back to userspace (unless * another truncate extends the file - this is desired though). + * + * NOTE: This access of inode->i_size is not protected + * and if there is a concurrent update on a 32bit machine, + * it could return the wrong value. This could only be a problem + * if i_size has actually changed to a smaller value before the + * page became uptodate, and at this point it still has a smaller + * value, but due to a race while reading, it appears unchanged. + * The chances of this happening are so small and the consequence + * sufficiently minor, that the cost of the seqlock seems + * not to be justified. */ - isize = i_size_read(inode); - end_index = (isize - 1) >> PAGE_CACHE_SHIFT; - if (unlikely(!isize || index > end_index)) { - page_cache_release(page); - goto out; + if (unlikely(isize != inode->i_size)) { + isize = i_size_read(inode); + end_index = (isize - 1) >> PAGE_CACHE_SHIFT; + if (unlikely(!isize || index > end_index)) { + page_cache_release(page); + goto out; + } } /* nr is the maximum number of bytes to copy from this page */ ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH 001 of 2] Fix read/truncate race. 2007-06-08 2:48 ` Neil Brown @ 2007-06-08 6:10 ` Andrew Morton 2007-06-12 0:16 ` Neil Brown 0 siblings, 1 reply; 9+ messages in thread From: Andrew Morton @ 2007-06-08 6:10 UTC (permalink / raw) To: Neil Brown; +Cc: linux-kernel, Jens Axboe, Nick Piggin On Fri, 8 Jun 2007 12:48:48 +1000 Neil Brown <neilb@suse.de> wrote: > The following patch will remove the extra seqlock except when we > actually need it and remove the extra arithmetic - but I haven't > tested it or reviewed it properly. I can do that if you think it is > the right direction. Yes, the optimisation is valid and looks useful. > ./mm/filemap.c | 34 ++++++++++++++++++++++++++++------ It didn't apply - your tree seems different from mine. > + * > + * NOTE: This access of inode->i_size is not protected > + * and if there is a concurrent update on a 32bit machine, > + * it could return the wrong value. This could only be a problem > + * if i_size has actually changed to a smaller value before the > + * page became uptodate, and at this point it still has a smaller > + * value, but due to a race while reading, it appears unchanged. > + * The chances of this happening are so small and the consequence > + * sufficiently minor, that the cost of the seqlock seems > + * not to be justified. please consider incorporating scripts/checkpatch.pl into your patch preparation toolchain. ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH 001 of 2] Fix read/truncate race. 2007-06-08 6:10 ` Andrew Morton @ 2007-06-12 0:16 ` Neil Brown 2007-06-12 0:35 ` Andrew Morton 0 siblings, 1 reply; 9+ messages in thread From: Neil Brown @ 2007-06-12 0:16 UTC (permalink / raw) To: Andrew Morton; +Cc: linux-kernel, Jens Axboe, Nick Piggin On Thursday June 7, akpm@linux-foundation.org wrote: > On Fri, 8 Jun 2007 12:48:48 +1000 Neil Brown <neilb@suse.de> wrote: > > > The following patch will remove the extra seqlock except when we > > actually need it and remove the extra arithmetic - but I haven't > > tested it or reviewed it properly. I can do that if you think it is > > the right direction. > > Yes, the optimisation is valid and looks useful. > > > ./mm/filemap.c | 34 ++++++++++++++++++++++++++++------ > > It didn't apply - your tree seems different from mine. Odd. I had no other changes to that file in my tree. I'll wait until the next -mm(?). It's just as well really, the patch was buggy - didn't even compile and (as I said) totally untested. I'll get you a tested patch after I can rebase. > > > + * > > + * NOTE: This access of inode->i_size is not protected > > + * and if there is a concurrent update on a 32bit machine, > > + * it could return the wrong value. This could only be a problem > > + * if i_size has actually changed to a smaller value before the > > + * page became uptodate, and at this point it still has a smaller > > + * value, but due to a race while reading, it appears unchanged. > > + * The chances of this happening are so small and the consequence > > + * sufficiently minor, that the cost of the seqlock seems > > + * not to be justified. > > please consider incorporating scripts/checkpatch.pl into your patch > preparation toolchain. Done... Any reason that it isn't executable (chmod +x)? NeilBrown ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH 001 of 2] Fix read/truncate race. 2007-06-12 0:16 ` Neil Brown @ 2007-06-12 0:35 ` Andrew Morton 0 siblings, 0 replies; 9+ messages in thread From: Andrew Morton @ 2007-06-12 0:35 UTC (permalink / raw) To: Neil Brown; +Cc: linux-kernel, Jens Axboe, Nick Piggin On Tue, 12 Jun 2007 10:16:22 +1000 Neil Brown <neilb@suse.de> wrote: > > please consider incorporating scripts/checkpatch.pl into your patch > > preparation toolchain. > > Done... Any reason that it isn't executable (chmod +x)? It is executable now (Linus did a chmod). However I think it was wrong to do this. Because lots of people will lose that X bit (say, people who download and use patch-2.6.22.gz). But _some_ people will have their X bit set, so they will go write scripts which assume X permissions, only to find that those scripts break on other people's systems. So to force the lowest-common-denominator, we should have left X unset. Oh well. ^ permalink raw reply [flat|nested] 9+ messages in thread
* [PATCH 002 of 2] Make sure readv stops reading when it hits end-of-file. 2007-06-07 1:46 [PATCH 000 of 2] Fix some bugs with 'read' racing with 'truncate' NeilBrown 2007-06-07 1:46 ` [PATCH 001 of 2] Fix read/truncate race NeilBrown @ 2007-06-07 1:47 ` NeilBrown 1 sibling, 0 replies; 9+ messages in thread From: NeilBrown @ 2007-06-07 1:47 UTC (permalink / raw) To: Andrew Morton; +Cc: linux-kernel, Neil Brown The do_loop_readv_writev implementation of readv breaks out of the loop as soon as a single read request didn't fill it's buffer: if (nr != len) break; The generic_file_aio_read version doesn't. So if it hits EOF before the end of the list of buffers, it will try again on the next buffer. If the file was extended in the mean time, this will produce a bad result. Signed-off-by: Neil Brown <neilb@suse.de> ### Diffstat output ./mm/filemap.c | 2 ++ 1 file changed, 2 insertions(+) diff .prev/mm/filemap.c ./mm/filemap.c --- .prev/mm/filemap.c 2007-06-07 11:34:24.000000000 +1000 +++ ./mm/filemap.c 2007-06-07 11:39:35.000000000 +1000 @@ -1206,6 +1206,8 @@ generic_file_aio_read(struct kiocb *iocb retval = retval ?: desc.error; break; } + if (desc.count > 0) + break; } } out: ^ permalink raw reply [flat|nested] 9+ messages in thread
end of thread, other threads:[~2007-06-12 0:35 UTC | newest] Thread overview: 9+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2007-06-07 1:46 [PATCH 000 of 2] Fix some bugs with 'read' racing with 'truncate' NeilBrown 2007-06-07 1:46 ` [PATCH 001 of 2] Fix read/truncate race NeilBrown 2007-06-07 7:19 ` Jens Axboe 2007-06-08 1:18 ` Andrew Morton 2007-06-08 2:48 ` Neil Brown 2007-06-08 6:10 ` Andrew Morton 2007-06-12 0:16 ` Neil Brown 2007-06-12 0:35 ` Andrew Morton 2007-06-07 1:47 ` [PATCH 002 of 2] Make sure readv stops reading when it hits end-of-file NeilBrown
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox