* [patch 0/9] writeback data integrity and other fixes (take 3)
@ 2008-10-28 14:47 npiggin
2008-10-28 14:47 ` [patch 1/9] mm: write_cache_pages cyclic fix npiggin
` (10 more replies)
0 siblings, 11 replies; 53+ messages in thread
From: npiggin @ 2008-10-28 14:47 UTC (permalink / raw)
To: akpm; +Cc: linux-fsdevel, Dave Chinner, Chris Mason
OK, I'm happier with this patchset now. Note that I've taken your patch
and mangled it a bit at the end of the series.
This one survives and seems to run OK here, but I'm mainly doing dumb
stress testing with a handful of filesystems, and data-io error injection
testing. There are a lot of combinations of ways this function can operate
and interact obviously, so it would be helpful to get more review.
Chris, would you possibly have time to run your btrfs tests that are
sensitive to problems in this code? I could provide you a single patch
rollup against mainline if it helps.
--
^ permalink raw reply [flat|nested] 53+ messages in thread* [patch 1/9] mm: write_cache_pages cyclic fix 2008-10-28 14:47 [patch 0/9] writeback data integrity and other fixes (take 3) npiggin @ 2008-10-28 14:47 ` npiggin 2008-10-29 0:24 ` [patch 1.1/9] mm: write_cache_pages cyclic fix fix Nick Piggin 2008-10-28 14:47 ` [patch 2/9] mm: write_cache_pages early loop termination npiggin ` (9 subsequent siblings) 10 siblings, 1 reply; 53+ messages in thread From: npiggin @ 2008-10-28 14:47 UTC (permalink / raw) To: akpm; +Cc: linux-fsdevel, Dave Chinner, Chris Mason [-- Attachment #1: mm-wcp-cyclic-fix.patch --] [-- Type: text/plain, Size: 2744 bytes --] In write_cache_pages, scanned == 1 is supposed to mean that cyclic writeback has circled through zero, thus we should not circle again. However it gets set to 1 after the first successful pagevec lookup. This leads to cases where not enough data gets written. Counterexample: file with first 10 pages dirty, writeback_index == 5, nr_to_write == 10. Then the 5 last pages will be found, and scanned will be set to 1, after writing those out, we will not cycle back to get the first 5. Rework this logic, now we'll always cycle unless we started off from index 0. When cycling, only write out as far as 1 page before the start page from the first cycle (so we don't write parts of the file twice). Signed-off-by: Nick Piggin <npiggin@suse.de> --- Index: linux-2.6/mm/page-writeback.c =================================================================== --- linux-2.6.orig/mm/page-writeback.c +++ linux-2.6/mm/page-writeback.c @@ -868,9 +868,10 @@ int write_cache_pages(struct address_spa int done = 0; struct pagevec pvec; int nr_pages; + pgoff_t writeback_index; pgoff_t index; pgoff_t end; /* Inclusive */ - int scanned = 0; + int cycled; int range_whole = 0; long nr_to_write = wbc->nr_to_write; @@ -881,14 +882,19 @@ int write_cache_pages(struct address_spa pagevec_init(&pvec, 0); if (wbc->range_cyclic) { - index = mapping->writeback_index; /* Start from prev offset */ + writeback_index = mapping->writeback_index; /* prev offset */ + index = writeback_index; + if (index == 0) + cycled = 1; + else + cycled = 0; end = -1; } else { index = wbc->range_start >> PAGE_CACHE_SHIFT; end = wbc->range_end >> PAGE_CACHE_SHIFT; if (wbc->range_start == 0 && wbc->range_end == LLONG_MAX) range_whole = 1; - scanned = 1; + cycled = 1; /* ignore range_cyclic tests */ } retry: while (!done && (index <= end) && @@ -897,7 +903,6 @@ retry: min(end - index, (pgoff_t)PAGEVEC_SIZE-1) + 1))) { unsigned i; - scanned = 1; for (i = 0; i < nr_pages; i++) { struct page *page = pvec.pages[i]; @@ -915,7 +920,11 @@ retry: continue; } - if (!wbc->range_cyclic && page->index > end) { + if (page->index > end) { + /* + * can't be range_cyclic (1st pass) because + * end == -1 in that case. + */ done = 1; unlock_page(page); continue; @@ -946,13 +955,15 @@ retry: pagevec_release(&pvec); cond_resched(); } - if (!scanned && !done) { + if (!cycled) { /* + * range_cyclic: * We hit the last page and there is more work to be done: wrap * back to the start of the file */ - scanned = 1; + cycled = 1; index = 0; + end = writeback_index - 1; goto retry; } if (!wbc->no_nrwrite_index_update) { -- ^ permalink raw reply [flat|nested] 53+ messages in thread
* [patch 1.1/9] mm: write_cache_pages cyclic fix fix 2008-10-28 14:47 ` [patch 1/9] mm: write_cache_pages cyclic fix npiggin @ 2008-10-29 0:24 ` Nick Piggin 0 siblings, 0 replies; 53+ messages in thread From: Nick Piggin @ 2008-10-29 0:24 UTC (permalink / raw) To: akpm; +Cc: linux-fsdevel, Dave Chinner, Chris Mason mm: write_cache_pages cyclic fix fix shut up gcc Signed-off-by: Nick Piggin <npiggin@suse.de> --- Index: linux-2.6/mm/page-writeback.c =================================================================== --- linux-2.6.orig/mm/page-writeback.c +++ linux-2.6/mm/page-writeback.c @@ -868,7 +868,7 @@ int write_cache_pages(struct address_spa int done = 0; struct pagevec pvec; int nr_pages; - pgoff_t writeback_index; + pgoff_t uninitialized_var(writeback_index); pgoff_t index; pgoff_t end; /* Inclusive */ int cycled; ^ permalink raw reply [flat|nested] 53+ messages in thread
* [patch 2/9] mm: write_cache_pages early loop termination 2008-10-28 14:47 [patch 0/9] writeback data integrity and other fixes (take 3) npiggin 2008-10-28 14:47 ` [patch 1/9] mm: write_cache_pages cyclic fix npiggin @ 2008-10-28 14:47 ` npiggin 2008-10-28 14:47 ` [patch 3/9] mm: write_cache_pages writepage error fix npiggin ` (8 subsequent siblings) 10 siblings, 0 replies; 53+ messages in thread From: npiggin @ 2008-10-28 14:47 UTC (permalink / raw) To: akpm; +Cc: linux-fsdevel, Dave Chinner, Chris Mason [-- Attachment #1: mm-wcp-breakearly-fix.patch --] [-- Type: text/plain, Size: 1999 bytes --] We'd like to break out of the loop early in many situations, however the existing code has been setting mapping->writeback_index past the final page in the pagevec lookup for cyclic writeback. This is a problem if we don't process all pages up to the final page. Currently the code mostly keeps writeback_index reasonable and hacked around this by not breaking out of the loop or writing pages outside the range in these cases. Keep track of a real "done index" that enables us to terminate the loop in a much more flexible manner. Needed by the subsequent patch to preserve writepage errors, and then further patches to break out of the loop early for other reasons. However there are no functional changes with this patch alone. Signed-off-by: Nick Piggin <npiggin@suse.de> --- Index: linux-2.6/mm/page-writeback.c =================================================================== --- linux-2.6.orig/mm/page-writeback.c +++ linux-2.6/mm/page-writeback.c @@ -871,6 +871,7 @@ int write_cache_pages(struct address_spa pgoff_t writeback_index; pgoff_t index; pgoff_t end; /* Inclusive */ + pgoff_t done_index; int cycled; int range_whole = 0; long nr_to_write = wbc->nr_to_write; @@ -897,6 +898,7 @@ int write_cache_pages(struct address_spa cycled = 1; /* ignore range_cyclic tests */ } retry: + done_index = index; while (!done && (index <= end) && (nr_pages = pagevec_lookup_tag(&pvec, mapping, &index, PAGECACHE_TAG_DIRTY, @@ -906,6 +908,8 @@ retry: for (i = 0; i < nr_pages; i++) { struct page *page = pvec.pages[i]; + done_index = page->index + 1; + /* * At this point we hold neither mapping->tree_lock nor * lock on the page itself: the page may be truncated or @@ -968,7 +972,7 @@ retry: } if (!wbc->no_nrwrite_index_update) { if (wbc->range_cyclic || (range_whole && nr_to_write > 0)) - mapping->writeback_index = index; + mapping->writeback_index = done_index; wbc->nr_to_write = nr_to_write; } -- ^ permalink raw reply [flat|nested] 53+ messages in thread
* [patch 3/9] mm: write_cache_pages writepage error fix 2008-10-28 14:47 [patch 0/9] writeback data integrity and other fixes (take 3) npiggin 2008-10-28 14:47 ` [patch 1/9] mm: write_cache_pages cyclic fix npiggin 2008-10-28 14:47 ` [patch 2/9] mm: write_cache_pages early loop termination npiggin @ 2008-10-28 14:47 ` npiggin 2008-10-28 14:47 ` [patch 4/9] mm: write_cache_pages integrity fix npiggin ` (7 subsequent siblings) 10 siblings, 0 replies; 53+ messages in thread From: npiggin @ 2008-10-28 14:47 UTC (permalink / raw) To: akpm; +Cc: linux-fsdevel, Dave Chinner, Chris Mason [-- Attachment #1: mm-wcp-writepage-error-fix.patch --] [-- Type: text/plain, Size: 1571 bytes --] In write_cache_pages, if ret signals a real error, but we still have some pages left in the pagevec, done would be set to 1, but the remaining pages would continue to be processed and ret will be overwritten in the process. It could easily be overwritten with success, and thus success will be returned even if there is an error. Thus the caller is told all writes succeeded, wheras in reality some did not. Fix this by bailing immediately if there is an error, and retaining the first error code. This is a data interity bug. Signed-off-by: Nick Piggin <npiggin@suse.de> --- Index: linux-2.6/mm/page-writeback.c =================================================================== --- linux-2.6.orig/mm/page-writeback.c +++ linux-2.6/mm/page-writeback.c @@ -944,12 +944,26 @@ retry: } ret = (*writepage)(page, wbc, data); + if (unlikely(ret)) { + if (ret == AOP_WRITEPAGE_ACTIVATE) { + unlock_page(page); + ret = 0; + } else { + /* + * done_index is set past this page, + * so media errors will not choke + * background writeout for the entire + * file. This has consequences for + * range_cyclic semantics (ie. it may + * not be suitable for data integrity + * writeout). + */ + done = 1; + break; + } + } - if (unlikely(ret == AOP_WRITEPAGE_ACTIVATE)) { - unlock_page(page); - ret = 0; - } - if (ret || (--nr_to_write <= 0)) + if (--nr_to_write <= 0) done = 1; if (wbc->nonblocking && bdi_write_congested(bdi)) { wbc->encountered_congestion = 1; -- ^ permalink raw reply [flat|nested] 53+ messages in thread
* [patch 4/9] mm: write_cache_pages integrity fix 2008-10-28 14:47 [patch 0/9] writeback data integrity and other fixes (take 3) npiggin ` (2 preceding siblings ...) 2008-10-28 14:47 ` [patch 3/9] mm: write_cache_pages writepage error fix npiggin @ 2008-10-28 14:47 ` npiggin 2008-10-28 14:47 ` [patch 5/9] mm: write_cache_pages cleanups npiggin ` (6 subsequent siblings) 10 siblings, 0 replies; 53+ messages in thread From: npiggin @ 2008-10-28 14:47 UTC (permalink / raw) To: akpm; +Cc: linux-fsdevel, Dave Chinner, Chris Mason [-- Attachment #1: mm-wcp-integrity-fix.patch --] [-- Type: text/plain, Size: 2785 bytes --] In write_cache_pages, nr_to_write is heeded even for data-integrity syncs, so the function will return success after writing out nr_to_write pages, even if that was not sufficient to guarantee data integrity. The callers tend to set it to values that could break data interity semantics easily in practice. For example, nr_to_write can be set to mapping->nr_pages * 2, however if a file has a single, dirty page, then fsync is called, subsequent pages might be concurrently added and dirtied, then write_cache_pages might writeout two of these newly dirty pages, while not writing out the old page that should have been written out. Fix this by ignoring nr_to_write if it is a data integrity sync. This is a data interity bug. Signed-off-by: Nick Piggin <npiggin@suse.de> --- The reason this has been done in the past is to avoid stalling sync operations behind page dirtiers. "If a file has one dirty page at offset 1000000000000000 then someone does an fsync() and someone else gets in first and starts madly writing pages at offset 0, we want to write that page at 1000000000000000. Somehow." What we to today is return success after an arbitrary amount of pages are written, whether or not we have provided the data-integrity semantics that the caller has asked for. Even this doesn't actually fix all stall cases completely: in the above situation, if the file has a huge number of pages in pagecache (but not dirty), then mapping->nrpages is going to be huge, even if pages are being dirtied. This change does indeed make the possibility of long stalls lager, and that's not a good thing, but lying about data integrity is even worse. We have to either perform the sync, or return -ELINUXISLAME so at least the caller knows what has happened. There are subsequent competing approaches in the works to solve the stall problems properly, without compromising data integrity. Index: linux-2.6/mm/page-writeback.c =================================================================== --- linux-2.6.orig/mm/page-writeback.c +++ linux-2.6/mm/page-writeback.c @@ -963,8 +963,10 @@ retry: } } - if (--nr_to_write <= 0) - done = 1; + if (wbc->sync_mode == WB_SYNC_NONE) { + if (--wbc->nr_to_write <= 0) + done = 1; + } if (wbc->nonblocking && bdi_write_congested(bdi)) { wbc->encountered_congestion = 1; done = 1; Index: linux-2.6/mm/filemap.c =================================================================== --- linux-2.6.orig/mm/filemap.c +++ linux-2.6/mm/filemap.c @@ -210,7 +210,7 @@ int __filemap_fdatawrite_range(struct ad int ret; struct writeback_control wbc = { .sync_mode = sync_mode, - .nr_to_write = mapping->nrpages * 2, + .nr_to_write = LONG_MAX, .range_start = start, .range_end = end, }; -- ^ permalink raw reply [flat|nested] 53+ messages in thread
* [patch 5/9] mm: write_cache_pages cleanups 2008-10-28 14:47 [patch 0/9] writeback data integrity and other fixes (take 3) npiggin ` (3 preceding siblings ...) 2008-10-28 14:47 ` [patch 4/9] mm: write_cache_pages integrity fix npiggin @ 2008-10-28 14:47 ` npiggin 2008-10-28 14:47 ` [patch 6/9] mm: write_cache_pages optimise page cleaning npiggin ` (5 subsequent siblings) 10 siblings, 0 replies; 53+ messages in thread From: npiggin @ 2008-10-28 14:47 UTC (permalink / raw) To: akpm; +Cc: linux-fsdevel, Dave Chinner, Chris Mason [-- Attachment #1: mm-wcp-cleanup.patch --] [-- Type: text/plain, Size: 2190 bytes --] Get rid of some complex expressions from flow control statements, add a comment, remove some duplicate code. Signed-off-by: Nick Piggin <npiggin@suse.de> --- Index: linux-2.6/mm/page-writeback.c =================================================================== --- linux-2.6.orig/mm/page-writeback.c +++ linux-2.6/mm/page-writeback.c @@ -899,11 +899,14 @@ int write_cache_pages(struct address_spa } retry: done_index = index; - while (!done && (index <= end) && - (nr_pages = pagevec_lookup_tag(&pvec, mapping, &index, - PAGECACHE_TAG_DIRTY, - min(end - index, (pgoff_t)PAGEVEC_SIZE-1) + 1))) { - unsigned i; + while (!done && (index <= end)) { + int i; + + nr_pages = pagevec_lookup_tag(&pvec, mapping, &index, + PAGECACHE_TAG_DIRTY, + min(end - index, (pgoff_t)PAGEVEC_SIZE-1) + 1); + if (nr_pages == 0) + break; for (i = 0; i < nr_pages; i++) { struct page *page = pvec.pages[i]; @@ -919,7 +922,16 @@ retry: */ lock_page(page); + /* + * Page truncated or invalidated. We can freely skip it + * then, even for data integrity operations: the page + * has disappeared concurrently, so there could be no + * real expectation of this data interity operation + * even if there is now a new, dirty page at the same + * pagecache address. + */ if (unlikely(page->mapping != mapping)) { +continue_unlock: unlock_page(page); continue; } @@ -930,18 +942,15 @@ retry: * end == -1 in that case. */ done = 1; - unlock_page(page); - continue; + goto continue_unlock; } if (wbc->sync_mode != WB_SYNC_NONE) wait_on_page_writeback(page); if (PageWriteback(page) || - !clear_page_dirty_for_io(page)) { - unlock_page(page); - continue; - } + !clear_page_dirty_for_io(page)) + goto continue_unlock; ret = (*writepage)(page, wbc, data); if (unlikely(ret)) { @@ -964,7 +973,8 @@ retry: } if (wbc->sync_mode == WB_SYNC_NONE) { - if (--wbc->nr_to_write <= 0) + wbc->nr_to_write--; + if (wbc->nr_to_write <= 0) done = 1; } if (wbc->nonblocking && bdi_write_congested(bdi)) { -- ^ permalink raw reply [flat|nested] 53+ messages in thread
* [patch 6/9] mm: write_cache_pages optimise page cleaning 2008-10-28 14:47 [patch 0/9] writeback data integrity and other fixes (take 3) npiggin ` (4 preceding siblings ...) 2008-10-28 14:47 ` [patch 5/9] mm: write_cache_pages cleanups npiggin @ 2008-10-28 14:47 ` npiggin 2008-10-28 14:47 ` [patch 7/9] mm: write_cache_pages terminate quickly npiggin ` (4 subsequent siblings) 10 siblings, 0 replies; 53+ messages in thread From: npiggin @ 2008-10-28 14:47 UTC (permalink / raw) To: akpm; +Cc: linux-fsdevel, Dave Chinner, Chris Mason [-- Attachment #1: mm-wcp-writeback-clean-opt.patch --] [-- Type: text/plain, Size: 1314 bytes --] In write_cache_pages, if we get stuck behind another process that is cleaning pages, we will be forced to wait for them to finish, then perform our own writeout (if it was redirtied during the long wait), then wait for that. If a page under writeout is still clean, we can skip waiting for it (if we're part of a data integrity sync, we'll be waiting for all writeout pages afterwards, so we'll still be waiting for the other guy's write that's cleaned the page). Signed-off-by: Nick Piggin <npiggin@suse.de> --- Index: linux-2.6/mm/page-writeback.c =================================================================== --- linux-2.6.orig/mm/page-writeback.c +++ linux-2.6/mm/page-writeback.c @@ -945,11 +945,20 @@ continue_unlock: goto continue_unlock; } - if (wbc->sync_mode != WB_SYNC_NONE) - wait_on_page_writeback(page); + if (!PageDirty(page)) { + /* someone wrote it for us */ + goto continue_unlock; + } + + if (PageWriteback(page)) { + if (wbc->sync_mode != WB_SYNC_NONE) + wait_on_page_writeback(page); + else + goto continue_unlock; + } - if (PageWriteback(page) || - !clear_page_dirty_for_io(page)) + BUG_ON(PageWriteback(page)); + if (!clear_page_dirty_for_io(page)) goto continue_unlock; ret = (*writepage)(page, wbc, data); -- ^ permalink raw reply [flat|nested] 53+ messages in thread
* [patch 7/9] mm: write_cache_pages terminate quickly 2008-10-28 14:47 [patch 0/9] writeback data integrity and other fixes (take 3) npiggin ` (5 preceding siblings ...) 2008-10-28 14:47 ` [patch 6/9] mm: write_cache_pages optimise page cleaning npiggin @ 2008-10-28 14:47 ` npiggin 2008-10-30 23:07 ` Andrew Morton 2008-10-28 14:47 ` [patch 8/9] mm: write_cache_pages more " npiggin ` (3 subsequent siblings) 10 siblings, 1 reply; 53+ messages in thread From: npiggin @ 2008-10-28 14:47 UTC (permalink / raw) To: akpm; +Cc: linux-fsdevel, Dave Chinner, Chris Mason [-- Attachment #1: mm-wcp-terminate-quickly.patch --] [-- Type: text/plain, Size: 1730 bytes --] Terminate the write_cache_pages loop upon encountering the first page past end, without locking the page. Pages cannot have their index change when we have a reference on them (truncate, eg truncate_inode_pages_range performs the same check without the page lock). Signed-off-by: Nick Piggin <npiggin@suse.de> --- Index: linux-2.6/mm/page-writeback.c =================================================================== --- linux-2.6.orig/mm/page-writeback.c +++ linux-2.6/mm/page-writeback.c @@ -911,15 +911,24 @@ retry: for (i = 0; i < nr_pages; i++) { struct page *page = pvec.pages[i]; - done_index = page->index + 1; - /* - * At this point we hold neither mapping->tree_lock nor - * lock on the page itself: the page may be truncated or - * invalidated (changing page->mapping to NULL), or even - * swizzled back from swapper_space to tmpfs file - * mapping + * At this point, the page may be truncated or + * invalidated (changing page->mapping to NULL), or + * even swizzled back from swapper_space to tmpfs file + * mapping. However, page->index will not change + * because we have a reference on the page. */ + if (page->index > end) { + /* + * can't be range_cyclic (1st pass) because + * end == -1 in that case. + */ + done = 1; + break; + } + + done_index = page->index + 1; + lock_page(page); /* @@ -936,15 +945,6 @@ continue_unlock: continue; } - if (page->index > end) { - /* - * can't be range_cyclic (1st pass) because - * end == -1 in that case. - */ - done = 1; - goto continue_unlock; - } - if (!PageDirty(page)) { /* someone wrote it for us */ goto continue_unlock; -- ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: [patch 7/9] mm: write_cache_pages terminate quickly 2008-10-28 14:47 ` [patch 7/9] mm: write_cache_pages terminate quickly npiggin @ 2008-10-30 23:07 ` Andrew Morton 2008-10-31 7:29 ` Nick Piggin 0 siblings, 1 reply; 53+ messages in thread From: Andrew Morton @ 2008-10-30 23:07 UTC (permalink / raw) To: npiggin; +Cc: linux-fsdevel, david, chris.mason On Wed, 29 Oct 2008 01:47:22 +1100 npiggin@suse.de wrote: > Terminate the write_cache_pages loop upon encountering the first page past > end, without locking the page. Pages cannot have their index change when we > have a reference on them (truncate, eg truncate_inode_pages_range performs > the same check without the page lock). > Traditionally lock_page() is used to stabilise ->index and ->mapping. Here you introduce a new and very subtle sort-of-locking rule without actually really introducing it at all. OK, there's a little comment buried way down in this function. But there's a contradictory comment over truncate_inode_pages_range() ("When looking at..."). How do we make this new locking rule maintainable? How do we avoid breaking it in the future? How do we prevent accidental breakage from slipping past developers' and reviewers' attention? Given the additional maintenance burdens, is this change worth doing at all? > --- > Index: linux-2.6/mm/page-writeback.c > =================================================================== > --- linux-2.6.orig/mm/page-writeback.c > +++ linux-2.6/mm/page-writeback.c > @@ -911,15 +911,24 @@ retry: > for (i = 0; i < nr_pages; i++) { > struct page *page = pvec.pages[i]; > > - done_index = page->index + 1; > - > /* > - * At this point we hold neither mapping->tree_lock nor > - * lock on the page itself: the page may be truncated or > - * invalidated (changing page->mapping to NULL), or even > - * swizzled back from swapper_space to tmpfs file > - * mapping > + * At this point, the page may be truncated or > + * invalidated (changing page->mapping to NULL), or > + * even swizzled back from swapper_space to tmpfs file > + * mapping. However, page->index will not change > + * because we have a reference on the page. > */ > + if (page->index > end) { > + /* > + * can't be range_cyclic (1st pass) because > + * end == -1 in that case. > + */ > + done = 1; > + break; > + } > + > + done_index = page->index + 1; > + > lock_page(page); > > /* > @@ -936,15 +945,6 @@ continue_unlock: > continue; > } > > - if (page->index > end) { > - /* > - * can't be range_cyclic (1st pass) because > - * end == -1 in that case. > - */ > - done = 1; > - goto continue_unlock; > - } > - > if (!PageDirty(page)) { > /* someone wrote it for us */ > goto continue_unlock; > > -- ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: [patch 7/9] mm: write_cache_pages terminate quickly 2008-10-30 23:07 ` Andrew Morton @ 2008-10-31 7:29 ` Nick Piggin 0 siblings, 0 replies; 53+ messages in thread From: Nick Piggin @ 2008-10-31 7:29 UTC (permalink / raw) To: Andrew Morton; +Cc: linux-fsdevel, david, chris.mason On Thu, Oct 30, 2008 at 04:07:46PM -0700, Andrew Morton wrote: > On Wed, 29 Oct 2008 01:47:22 +1100 > npiggin@suse.de wrote: > > > Terminate the write_cache_pages loop upon encountering the first page past > > end, without locking the page. Pages cannot have their index change when we > > have a reference on them (truncate, eg truncate_inode_pages_range performs > > the same check without the page lock). > > > > Traditionally lock_page() is used to stabilise ->index and ->mapping. Well, mapping. index of course is irrelevant without mapping, *except* for a "where did we get to" kind of thing. But it has been used in that way for a long time. > Here you introduce a new and very subtle sort-of-locking rule without > actually really introducing it at all. OK, there's a little comment > buried way down in this function. But there's a contradictory comment > over truncate_inode_pages_range() ("When looking at..."). That comment is actually wrong. Index won't change. If index could change randomly, then we could skip pages here if index skips forwards. pagevec pagecache tag lookup functions would be broken in general actually. > How do we make this new locking rule maintainable? How do we avoid > breaking it in the future? How do we prevent accidental breakage from > slipping past developers' and reviewers' attention? It's actually fairly fundamental. Even more fundamental than the above functions I quote. If we have any place that does: lock_page(page) if (!page->mapping) /* truncate got to it */ but does not check the index of the page (which most don't), then it could have moved from where we first got it from (which would not always be a bug, but often could be). read(2) syscall actually also doesn't lock the page by default. Having the page move somewhere else would be a disaster for it. I guess it's not explicitly documented AFAIKS, but I thought it is a hard rule. Is there anywhere useful we can write it that people will actually read? OTOH, there isn't a lot of places that could be doing this. Some wild filesystem might think they own the pagecache I guess. I know that when it came up in splice, I told Jens we can't move a page with references on it even if it is locked... > Given the additional maintenance burdens, is this change worth doing > at all? > > > > --- > > Index: linux-2.6/mm/page-writeback.c > > =================================================================== > > --- linux-2.6.orig/mm/page-writeback.c > > +++ linux-2.6/mm/page-writeback.c > > @@ -911,15 +911,24 @@ retry: > > for (i = 0; i < nr_pages; i++) { > > struct page *page = pvec.pages[i]; > > > > - done_index = page->index + 1; > > - > > /* > > - * At this point we hold neither mapping->tree_lock nor > > - * lock on the page itself: the page may be truncated or > > - * invalidated (changing page->mapping to NULL), or even > > - * swizzled back from swapper_space to tmpfs file > > - * mapping > > + * At this point, the page may be truncated or > > + * invalidated (changing page->mapping to NULL), or > > + * even swizzled back from swapper_space to tmpfs file > > + * mapping. However, page->index will not change > > + * because we have a reference on the page. > > */ > > + if (page->index > end) { > > + /* > > + * can't be range_cyclic (1st pass) because > > + * end == -1 in that case. > > + */ > > + done = 1; > > + break; > > + } > > + > > + done_index = page->index + 1; > > + > > lock_page(page); > > > > /* > > @@ -936,15 +945,6 @@ continue_unlock: > > continue; > > } > > > > - if (page->index > end) { > > - /* > > - * can't be range_cyclic (1st pass) because > > - * end == -1 in that case. > > - */ > > - done = 1; > > - goto continue_unlock; > > - } > > - > > if (!PageDirty(page)) { > > /* someone wrote it for us */ > > goto continue_unlock; > > > > -- ^ permalink raw reply [flat|nested] 53+ messages in thread
* [patch 8/9] mm: write_cache_pages more terminate quickly 2008-10-28 14:47 [patch 0/9] writeback data integrity and other fixes (take 3) npiggin ` (6 preceding siblings ...) 2008-10-28 14:47 ` [patch 7/9] mm: write_cache_pages terminate quickly npiggin @ 2008-10-28 14:47 ` npiggin 2008-10-28 14:47 ` [patch 9/9] mm: do_sync_mapping_range integrity fix npiggin ` (2 subsequent siblings) 10 siblings, 0 replies; 53+ messages in thread From: npiggin @ 2008-10-28 14:47 UTC (permalink / raw) To: akpm; +Cc: linux-fsdevel, Dave Chinner, Chris Mason [-- Attachment #1: mm-wcp-terminate-quickly2.patch --] [-- Type: text/plain, Size: 1116 bytes --] From: Andrew Morton <akpm@linux-foundation.org> Now that we have the early-termination logic in place, it makes sense to bail out early in all other cases where done is set to 1. Signed-off-by: Nick Piggin <npiggin@suse.de> --- I had a hard time with this one replacing the done flag with gotos. It can be done one way or the other, but I didn't find anything completely pleasing. Importantly, the new logic is in place, and subsequent cleanup or refactoring patches would be welcome (I just wasn't able to come up with something clever enough myself) Index: linux-2.6/mm/page-writeback.c =================================================================== --- linux-2.6.orig/mm/page-writeback.c +++ linux-2.6/mm/page-writeback.c @@ -983,12 +983,15 @@ continue_unlock: if (wbc->sync_mode == WB_SYNC_NONE) { wbc->nr_to_write--; - if (wbc->nr_to_write <= 0) + if (wbc->nr_to_write <= 0) { done = 1; + break; + } } if (wbc->nonblocking && bdi_write_congested(bdi)) { wbc->encountered_congestion = 1; done = 1; + break; } } pagevec_release(&pvec); -- ^ permalink raw reply [flat|nested] 53+ messages in thread
* [patch 9/9] mm: do_sync_mapping_range integrity fix 2008-10-28 14:47 [patch 0/9] writeback data integrity and other fixes (take 3) npiggin ` (7 preceding siblings ...) 2008-10-28 14:47 ` [patch 8/9] mm: write_cache_pages more " npiggin @ 2008-10-28 14:47 ` npiggin 2008-10-30 23:13 ` Andrew Morton 2008-10-28 15:39 ` [patch 0/9] writeback data integrity and other fixes (take 3) Nick Piggin 2008-10-28 23:14 ` Dave Chinner 10 siblings, 1 reply; 53+ messages in thread From: npiggin @ 2008-10-28 14:47 UTC (permalink / raw) To: akpm; +Cc: linux-fsdevel, Dave Chinner, Chris Mason [-- Attachment #1: mm-do-sync-mapping-range-fix.patch --] [-- Type: text/plain, Size: 645 bytes --] Chris Mason notices do_sync_mapping_range didn't actually ask for data integrity writeout. Unfortunately, it is advertised as being usable for data integrity operations. This is a data interity bug. Signed-off-by: Nick Piggin <npiggin@suse.de> --- Index: linux-2.6/fs/sync.c =================================================================== --- linux-2.6.orig/fs/sync.c +++ linux-2.6/fs/sync.c @@ -269,7 +269,7 @@ int do_sync_mapping_range(struct address if (flags & SYNC_FILE_RANGE_WRITE) { ret = __filemap_fdatawrite_range(mapping, offset, endbyte, - WB_SYNC_NONE); + WB_SYNC_ALL); if (ret < 0) goto out; } -- ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: [patch 9/9] mm: do_sync_mapping_range integrity fix 2008-10-28 14:47 ` [patch 9/9] mm: do_sync_mapping_range integrity fix npiggin @ 2008-10-30 23:13 ` Andrew Morton 2008-10-31 9:16 ` Nick Piggin 0 siblings, 1 reply; 53+ messages in thread From: Andrew Morton @ 2008-10-30 23:13 UTC (permalink / raw) To: npiggin; +Cc: linux-fsdevel, david, chris.mason On Wed, 29 Oct 2008 01:47:24 +1100 npiggin@suse.de wrote: > Chris Mason notices do_sync_mapping_range didn't actually ask for data > integrity writeout. Unfortunately, it is advertised as being usable for > data integrity operations. > > This is a data interity bug. > > Signed-off-by: Nick Piggin <npiggin@suse.de> > --- > Index: linux-2.6/fs/sync.c > =================================================================== > --- linux-2.6.orig/fs/sync.c > +++ linux-2.6/fs/sync.c > @@ -269,7 +269,7 @@ int do_sync_mapping_range(struct address > > if (flags & SYNC_FILE_RANGE_WRITE) { > ret = __filemap_fdatawrite_range(mapping, offset, endbyte, > - WB_SYNC_NONE); > + WB_SYNC_ALL); > if (ret < 0) > goto out; > } > Really? Some thought did go into the code which you're "fixing". If the caller is using sync_file_range() for integrity then the caller has done a SYNC_FILE_RANGE_WAIT_BEFORE. So all we need to guarantee here is that __filemap_fdatawrite_range(WB_SYNC_NONE) will start writeout on all dirty pages in the range. Probably that gets broken lower down as part of various hacks^woptimisations have gone in, but which ones, and where? Perhaps _this_ (if it's there) is what should be fixed. And if we _do_ make the above change, we don't need to run the wait_on_page_writeback_range() if userspace asked for SYNC_FILE_RANGE_WAIT_BEFORE|SYNC_FILE_RANGE_WRITE, yes? IOW, I don't think enough thought (or at least description of that thought) has gone into this one. ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: [patch 9/9] mm: do_sync_mapping_range integrity fix 2008-10-30 23:13 ` Andrew Morton @ 2008-10-31 9:16 ` Nick Piggin 2008-10-31 10:04 ` Andrew Morton 2008-10-31 14:10 ` Chris Mason 0 siblings, 2 replies; 53+ messages in thread From: Nick Piggin @ 2008-10-31 9:16 UTC (permalink / raw) To: Andrew Morton; +Cc: linux-fsdevel, david, chris.mason On Thu, Oct 30, 2008 at 04:13:44PM -0700, Andrew Morton wrote: > On Wed, 29 Oct 2008 01:47:24 +1100 > npiggin@suse.de wrote: > > > Chris Mason notices do_sync_mapping_range didn't actually ask for data > > integrity writeout. Unfortunately, it is advertised as being usable for > > data integrity operations. > > > > This is a data interity bug. > > > > Signed-off-by: Nick Piggin <npiggin@suse.de> > > --- > > Index: linux-2.6/fs/sync.c > > =================================================================== > > --- linux-2.6.orig/fs/sync.c > > +++ linux-2.6/fs/sync.c > > @@ -269,7 +269,7 @@ int do_sync_mapping_range(struct address > > > > if (flags & SYNC_FILE_RANGE_WRITE) { > > ret = __filemap_fdatawrite_range(mapping, offset, endbyte, > > - WB_SYNC_NONE); > > + WB_SYNC_ALL); > > if (ret < 0) > > goto out; > > } > > > > Really? > > Some thought did go into the code which you're "fixing". Yes, I even remember something of a flamewar involving me and you :) > If the caller > is using sync_file_range() for integrity then the caller has done a > SYNC_FILE_RANGE_WAIT_BEFORE. No disputes about whether the API works "by design". But I think the implementation has a bug. I'll explain: > So all we need to guarantee here is that > __filemap_fdatawrite_range(WB_SYNC_NONE) will start writeout on all > dirty pages in the range. Probably that gets broken lower down as part > of various hacks^woptimisations have gone in, but which ones, and > where? Perhaps _this_ (if it's there) is what should be fixed. WB_SYNC_NONE has never (until this was introduced) been used for data integrity AFAIKS. There is code littered throughout fs/ which assumes WB_SYNC_NONE ~= "efficiency / background writeback". At least definitely the buffer.c "trylock" will cause dirty pages to be skipped. There is also a fair amount of filesystem code I haven't looked at. The fs-writeback.c code might not affect this path, but it also definitely makes the same assumption about WB_SYNC_NONE, so it would be ugly to mandate WB_SYNC_NONE is for data integrity from mapping downard, but not from inode upward... I didn't check, but I suspect this has been broken since it got merged. > And if we _do_ make the above change, we don't need to run the > wait_on_page_writeback_range() if userspace asked for > SYNC_FILE_RANGE_WAIT_BEFORE|SYNC_FILE_RANGE_WRITE, yes? Now you're asking the hard questions... I think we still have to wait, because SYNC_FILE_RANGE_WRITE itself doesn't necessarily wait for writeback. After the optimisation to skip waiting for clean and writeback pages in write_cache_pages, actually I think this change (to use WB_SYNC_ALL) should not hurt very much... > > > > IOW, I don't think enough thought (or at least description of that > thought) has gone into this one. ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: [patch 9/9] mm: do_sync_mapping_range integrity fix 2008-10-31 9:16 ` Nick Piggin @ 2008-10-31 10:04 ` Andrew Morton 2008-10-31 10:53 ` Nick Piggin 2008-10-31 20:03 ` Jamie Lokier 2008-10-31 14:10 ` Chris Mason 1 sibling, 2 replies; 53+ messages in thread From: Andrew Morton @ 2008-10-31 10:04 UTC (permalink / raw) To: Nick Piggin; +Cc: linux-fsdevel, david, chris.mason On Fri, 31 Oct 2008 10:16:16 +0100 Nick Piggin <npiggin@suse.de> wrote: > > > So all we need to guarantee here is that > > __filemap_fdatawrite_range(WB_SYNC_NONE) will start writeout on all > > dirty pages in the range. Probably that gets broken lower down as part > > of various hacks^woptimisations have gone in, but which ones, and > > where? Perhaps _this_ (if it's there) is what should be fixed. > > WB_SYNC_NONE has never (until this was introduced) been used for data > integrity AFAIKS. There is code littered throughout fs/ which assumes > WB_SYNC_NONE ~= "efficiency / background writeback". At least > definitely the buffer.c "trylock" will cause dirty pages to be > skipped. The change seriously wrecks sync_file_range(SYNC_FILE_RANGE_WRITE), the whole point of which is to as-nonblockingly-as-possible shove some data into the queues. This is useful. Perhaps we could use WB_SYNC_ALL if SYNC_FILE_RANGE_WAIT_BEFORE was specified, or WB_SYNC_NONE if it was not (need to think about SYNC_FILE_RANGE_WAIT_AFTER, too). That's a bit grubby, because userspace could do sync_file_range(SYNC_FILE_RANGE_WAIT_BEFORE); sync_file_range(SYNC_FILE_RANGE_WRITE); expecting it to have the same behaviour as sync_file_range(SYNC_FILE_RANGE_WAIT_BEFORE|SYNC_FILE_RANGE_WRITE); How the hell did that stupid sync_mode thing get into writeback_control? :( We should get rid of WB_SYNC_FOO and migrate to better-defined writeback_control fields. ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: [patch 9/9] mm: do_sync_mapping_range integrity fix 2008-10-31 10:04 ` Andrew Morton @ 2008-10-31 10:53 ` Nick Piggin 2008-10-31 20:03 ` Jamie Lokier 1 sibling, 0 replies; 53+ messages in thread From: Nick Piggin @ 2008-10-31 10:53 UTC (permalink / raw) To: Andrew Morton; +Cc: linux-fsdevel, david, chris.mason On Fri, Oct 31, 2008 at 03:04:32AM -0700, Andrew Morton wrote: > On Fri, 31 Oct 2008 10:16:16 +0100 Nick Piggin <npiggin@suse.de> wrote: > > > > > > So all we need to guarantee here is that > > > __filemap_fdatawrite_range(WB_SYNC_NONE) will start writeout on all > > > dirty pages in the range. Probably that gets broken lower down as part > > > of various hacks^woptimisations have gone in, but which ones, and > > > where? Perhaps _this_ (if it's there) is what should be fixed. > > > > WB_SYNC_NONE has never (until this was introduced) been used for data > > integrity AFAIKS. There is code littered throughout fs/ which assumes > > WB_SYNC_NONE ~= "efficiency / background writeback". At least > > definitely the buffer.c "trylock" will cause dirty pages to be > > skipped. > > > The change seriously wrecks sync_file_range(SYNC_FILE_RANGE_WRITE), the > whole point of which is to as-nonblockingly-as-possible shove some data > into the queues. This is useful. Right, but what's possible to do nonblockingly according to the API isn't very much. Because you can't distinguish data integrity writes from hints (as you note below). IMO we should implement SYNC_FILE_RANGE_ASYNC to do the real "hint" thing. (and aside, but further along this topic, a SYNC_FILE_RANGE_SYNC to complement as well, and allow trivial migration for apps from fsync to an fsync_range equivalent... also a flag or two for metadata might be nice). > Perhaps we could use WB_SYNC_ALL if SYNC_FILE_RANGE_WAIT_BEFORE was > specified, or WB_SYNC_NONE if it was not (need to think about > SYNC_FILE_RANGE_WAIT_AFTER, too). > > That's a bit grubby, because userspace could do > > sync_file_range(SYNC_FILE_RANGE_WAIT_BEFORE); > sync_file_range(SYNC_FILE_RANGE_WRITE); > > expecting it to have the same behaviour as > > sync_file_range(SYNC_FILE_RANGE_WAIT_BEFORE|SYNC_FILE_RANGE_WRITE); > > > > How the hell did that stupid sync_mode thing get into > writeback_control? :( We should get rid of WB_SYNC_FOO and migrate to > better-defined writeback_control fields. No arguments. It's crusty. But this is the best I think I can do for stable and 2.6.28 kernels. As for how it got there... :) ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: [patch 9/9] mm: do_sync_mapping_range integrity fix 2008-10-31 10:04 ` Andrew Morton 2008-10-31 10:53 ` Nick Piggin @ 2008-10-31 20:03 ` Jamie Lokier 1 sibling, 0 replies; 53+ messages in thread From: Jamie Lokier @ 2008-10-31 20:03 UTC (permalink / raw) To: Andrew Morton; +Cc: Nick Piggin, linux-fsdevel, david, chris.mason Andrew Morton wrote: > That's a bit grubby, because userspace could do > > sync_file_range(SYNC_FILE_RANGE_WAIT_BEFORE); > sync_file_range(SYNC_FILE_RANGE_WRITE); > > expecting it to have the same behaviour as > > sync_file_range(SYNC_FILE_RANGE_WAIT_BEFORE|SYNC_FILE_RANGE_WRITE); Yes it could. The documentation for sync_file_range is unclear enough already. If the distinction is useful, a different SYNC_FILE_RANGE_ flag may be appropriate to indicate write-for-performance vs write-for-integrity, rather than such a peculiar subtlety. -- Jamie ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: [patch 9/9] mm: do_sync_mapping_range integrity fix 2008-10-31 9:16 ` Nick Piggin 2008-10-31 10:04 ` Andrew Morton @ 2008-10-31 14:10 ` Chris Mason 2008-10-31 14:30 ` steve 2008-11-01 8:04 ` Nick Piggin 1 sibling, 2 replies; 53+ messages in thread From: Chris Mason @ 2008-10-31 14:10 UTC (permalink / raw) To: Nick Piggin; +Cc: Andrew Morton, linux-fsdevel, david On Fri, 2008-10-31 at 10:16 +0100, Nick Piggin wrote: > On Thu, Oct 30, 2008 at 04:13:44PM -0700, Andrew Morton wrote: > > On Wed, 29 Oct 2008 01:47:24 +1100 > > npiggin@suse.de wrote: > > > > > Chris Mason notices do_sync_mapping_range didn't actually ask for > data > > > integrity writeout. Unfortunately, it is advertised as being > usable for > > > data integrity operations. > > > > > > This is a data interity bug. > > > [ use WB_SYNC_ALL instead of WB_SYNC_NONE ] > > If the caller > > is using sync_file_range() for integrity then the caller has done a > > SYNC_FILE_RANGE_WAIT_BEFORE. > > No disputes about whether the API works "by design". But I think the > implementation has a bug. I'll explain: I'll definitely agree the current usage is clumsy, and there is a bug in fs/buffer.c. A grep through the rest of the filesystems doesn't turn up many assumptions that WB_SYNC_NONE means it's ok to skip dirty pages. Greps for WB_SYNC_ALL and WB_SYNC_NONE in the fs code reveal: fs/buffer.c:__block_write_full_page() if (wbc->sync_mode != WB_SYNC_NONE || !wbc->nonblocking) { lock_buffer(bh); } else if (!trylock_buffer(bh)) { redirty_page_for_writepage(wbc, page); continue; } Easily fixed s/||/&&/, which is what XFS does. reiser3 has the same bug in fs/reiserfs/inode.c ntfs and gfs2 each have a check that assumes WB_SYNC_NONE means optional writeback, both seem fixable with one liners. Everywhere we wait on page writeback while we're trying to build nice big bios hurts performance. I'd rather see us switch to something closer to the do_sync_mapping_range expectation of WB_SYNC_NONE than sprinkle WB_SYNC_ALLs everywhere. -chris ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: [patch 9/9] mm: do_sync_mapping_range integrity fix 2008-10-31 14:10 ` Chris Mason @ 2008-10-31 14:30 ` steve 2008-10-31 15:02 ` Chris Mason 2008-11-01 8:04 ` Nick Piggin 1 sibling, 1 reply; 53+ messages in thread From: steve @ 2008-10-31 14:30 UTC (permalink / raw) To: Chris Mason; +Cc: Nick Piggin, Andrew Morton, linux-fsdevel, david Hi, On Fri, Oct 31, 2008 at 10:10:19AM -0400, Chris Mason wrote: > UID: 518429 > > On Fri, 2008-10-31 at 10:16 +0100, Nick Piggin wrote: > > On Thu, Oct 30, 2008 at 04:13:44PM -0700, Andrew Morton wrote: > > > On Wed, 29 Oct 2008 01:47:24 +1100 > > > npiggin@suse.de wrote: > > > > > > > Chris Mason notices do_sync_mapping_range didn't actually ask for > > data > > > > integrity writeout. Unfortunately, it is advertised as being > > usable for > > > > data integrity operations. > > > > > > > > This is a data interity bug. > > > > > > [ use WB_SYNC_ALL instead of WB_SYNC_NONE ] > > > > If the caller > > > is using sync_file_range() for integrity then the caller has done a > > > SYNC_FILE_RANGE_WAIT_BEFORE. > > > > No disputes about whether the API works "by design". But I think the > > implementation has a bug. I'll explain: > > I'll definitely agree the current usage is clumsy, and there is a bug in > fs/buffer.c. A grep through the rest of the filesystems doesn't turn up > many assumptions that WB_SYNC_NONE means it's ok to skip dirty pages. > > Greps for WB_SYNC_ALL and WB_SYNC_NONE in the fs code reveal: > > fs/buffer.c:__block_write_full_page() > if (wbc->sync_mode != WB_SYNC_NONE || !wbc->nonblocking) { > lock_buffer(bh); > } else if (!trylock_buffer(bh)) { > redirty_page_for_writepage(wbc, page); > continue; > } > > Easily fixed s/||/&&/, which is what XFS does. reiser3 has the same bug > in fs/reiserfs/inode.c > > ntfs and gfs2 each have a check that assumes WB_SYNC_NONE means optional > writeback, both seem fixable with one liners. > I doubt that is the case for GFS2... Its a bit subtle, but in order to ensure that we have the correct lock ordering the code in writepage for jdata (the only place this occurs, and I assume is what you are referring to) relies in the fact that because we have a writepages set for jdata, writepage will only ever be called with WB_SYNC_NONE and can thus be skipped. The lock ordering in question is between the transaction (glock) and the page lock. The former has to be grabbed first and this is also the reason that this only affects jdata as writeback and ordered don't need transactions. This is also the same reason that we have our own version of write_cache_pages, its more or less identical to the VFS version except its got our transaction lock stuck in the middle of it. Ideally (and I speak entirely from a selfish, personal, GFS2 point of view here!) we would either have a version of writepage which didn't hold the page lock on entry or we'd just move over to writepages entirely. This issue came up at the filesystem workshop earlier in the year and I think it was generally agreed that it was a good idea. I did have a look into doing it at one stage, but the code scared me off a bit :-) Steve. ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: [patch 9/9] mm: do_sync_mapping_range integrity fix 2008-10-31 14:30 ` steve @ 2008-10-31 15:02 ` Chris Mason 0 siblings, 0 replies; 53+ messages in thread From: Chris Mason @ 2008-10-31 15:02 UTC (permalink / raw) To: steve; +Cc: Nick Piggin, Andrew Morton, linux-fsdevel, david On Fri, 2008-10-31 at 14:30 +0000, steve@chygwyn.com wrote: > > ntfs and gfs2 each have a check that assumes WB_SYNC_NONE means optional > > writeback, both seem fixable with one liners. > > > I doubt that is the case for GFS2... Its a bit subtle, but in order to > ensure that we have the correct lock ordering the code in writepage > for jdata (the only place this occurs, and I assume is what you are > referring to) relies in the fact that because we have a writepages > set for jdata, writepage will only ever be called with WB_SYNC_NONE > and can thus be skipped. Well, if you have a writepages call, writepage is only used by kswapd and is entirely optional (assuming you don't call it yourself ;) -chris ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: [patch 9/9] mm: do_sync_mapping_range integrity fix 2008-10-31 14:10 ` Chris Mason 2008-10-31 14:30 ` steve @ 2008-11-01 8:04 ` Nick Piggin 1 sibling, 0 replies; 53+ messages in thread From: Nick Piggin @ 2008-11-01 8:04 UTC (permalink / raw) To: Chris Mason; +Cc: Andrew Morton, linux-fsdevel, david On Fri, Oct 31, 2008 at 10:10:19AM -0400, Chris Mason wrote: > On Fri, 2008-10-31 at 10:16 +0100, Nick Piggin wrote: > > On Thu, Oct 30, 2008 at 04:13:44PM -0700, Andrew Morton wrote: > > > On Wed, 29 Oct 2008 01:47:24 +1100 > > > npiggin@suse.de wrote: > > > > > > > Chris Mason notices do_sync_mapping_range didn't actually ask for > > data > > > > integrity writeout. Unfortunately, it is advertised as being > > usable for > > > > data integrity operations. > > > > > > > > This is a data interity bug. > > > > > > [ use WB_SYNC_ALL instead of WB_SYNC_NONE ] > > > > If the caller > > > is using sync_file_range() for integrity then the caller has done a > > > SYNC_FILE_RANGE_WAIT_BEFORE. > > > > No disputes about whether the API works "by design". But I think the > > implementation has a bug. I'll explain: > > I'll definitely agree the current usage is clumsy, and there is a bug in > fs/buffer.c. A grep through the rest of the filesystems doesn't turn up > many assumptions that WB_SYNC_NONE means it's ok to skip dirty pages. > > Greps for WB_SYNC_ALL and WB_SYNC_NONE in the fs code reveal: > > fs/buffer.c:__block_write_full_page() > if (wbc->sync_mode != WB_SYNC_NONE || !wbc->nonblocking) { > lock_buffer(bh); > } else if (!trylock_buffer(bh)) { > redirty_page_for_writepage(wbc, page); > continue; > } > > Easily fixed s/||/&&/, which is what XFS does. reiser3 has the same bug > in fs/reiserfs/inode.c > > ntfs and gfs2 each have a check that assumes WB_SYNC_NONE means optional > writeback, both seem fixable with one liners. Well, we went over this before :) Semantics... Whether or not it can be considered a bug, WB_SYNC_NONE is very much considered not to be a data integrity sync in code and comments (eg. buffer.c fs-writeback.c and file systems) for a long time. So that's not something I can really change (especially not for stable releases). It's _probably_ a good idea not to fragment bios, and it's probably not going to introduce bugs if we convert everything over. But presently, all the bugs/misunderstandings/whatever mean that WB_SYNC_NONE is not usable for data integrity. Obviously this has been the case for long before sync_mapping_range was added. > Everywhere we wait on page writeback while we're trying to build nice > big bios hurts performance. I'd rather see us switch to something > closer to the do_sync_mapping_range expectation of WB_SYNC_NONE than > sprinkle WB_SYNC_ALLs everywhere. They already are I guess. At least, this is the only place I have seen that needed changing. After that series, I think we get a lot closer to being correct in a lot of the writeback paths, which is a good starting point to make improvements. I'll be looking at the "livelock" avoidance, but you or other fs developers probably in a better position to look at things like waiting for locked or writeback pages... ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: [patch 0/9] writeback data integrity and other fixes (take 3) 2008-10-28 14:47 [patch 0/9] writeback data integrity and other fixes (take 3) npiggin ` (8 preceding siblings ...) 2008-10-28 14:47 ` [patch 9/9] mm: do_sync_mapping_range integrity fix npiggin @ 2008-10-28 15:39 ` Nick Piggin 2008-10-28 22:27 ` Dave Chinner 2008-10-28 23:14 ` Dave Chinner 10 siblings, 1 reply; 53+ messages in thread From: Nick Piggin @ 2008-10-28 15:39 UTC (permalink / raw) To: akpm, xfs; +Cc: linux-fsdevel, Dave Chinner, Chris Mason On Wed, Oct 29, 2008 at 01:47:15AM +1100, npiggin@suse.de wrote: > OK, I'm happier with this patchset now. Note that I've taken your patch > and mangled it a bit at the end of the series. > > This one survives and seems to run OK here, but I'm mainly doing dumb > stress testing with a handful of filesystems, and data-io error injection > testing. There are a lot of combinations of ways this function can operate > and interact obviously, so it would be helpful to get more review. > > Chris, would you possibly have time to run your btrfs tests that are > sensitive to problems in this code? I could provide you a single patch > rollup against mainline if it helps. BTW. XFS seems to be doing something interesting with my simple sync test case with IO error injection. I map a file MAP_SHARED into a number of processes, which then each run a loop that dirties the memory then calls msync(MS_SYNC) on the range. ext2 mostly reports -EIO back to userspace when a failure is injected AFAIKS. ext3 (ordered) doesn't until a lot of errors have been injected, but eventually reports -EIO and shuts down the filesystem. reiserfs seems to report failure more consistently. I haven't seen any -EIO failures from XFS... maybe I'm just not doing the right thing, or there is a caveat I'm not aware of. All fault injections I noticed had a trace like this: FAULT_INJECTION: forcing a failure Call Trace: 9f9cd758: [<6019f1de>] random32+0xe/0x20 9f9cd768: [<601a31b9>] should_fail+0xd9/0x130 9f9cd798: [<6018d0c4>] generic_make_request+0x304/0x4e0 9f9cd7a8: [<60062301>] mempool_alloc+0x51/0x130 9f9cd858: [<6018e6bf>] submit_bio+0x4f/0xe0 9f9cd8a8: [<60165505>] xfs_submit_ioend_bio+0x25/0x40 9f9cd8c8: [<6016603c>] xfs_submit_ioend+0xbc/0xf0 9f9cd908: [<60166bf9>] xfs_page_state_convert+0x3d9/0x6a0 9f9cd928: [<6005d515>] delayacct_end+0x95/0xb0 9f9cda08: [<60166ffd>] xfs_vm_writepage+0x6d/0x110 9f9cda18: [<6006618b>] set_page_dirty+0x4b/0xd0 9f9cda58: [<60066115>] __writepage+0x15/0x40 9f9cda78: [<60066775>] write_cache_pages+0x255/0x470 9f9cda90: [<60066100>] __writepage+0x0/0x40 9f9cdb98: [<600669b0>] generic_writepages+0x20/0x30 9f9cdba8: [<60165ba3>] xfs_vm_writepages+0x53/0x70 9f9cdbd8: [<600669eb>] do_writepages+0x2b/0x40 9f9cdbf8: [<6006004c>] __filemap_fdatawrite_range+0x5c/0x70 9f9cdc58: [<6006026a>] filemap_fdatawrite+0x1a/0x20 9f9cdc68: [<600a7a05>] do_fsync+0x45/0xe0 9f9cdc98: [<6007794b>] sys_msync+0x14b/0x1d0 9f9cdcf8: [<60019a70>] handle_syscall+0x50/0x80 9f9cdd18: [<6002a10f>] userspace+0x44f/0x510 9f9cdfc8: [<60016792>] fork_handler+0x62/0x70 And the kernel would sometimes say this: Buffer I/O error on device ram0, logical block 279 lost page write due to I/O error on ram0 Buffer I/O error on device ram0, logical block 379 lost page write due to I/O error on ram0 Buffer I/O error on device ram0, logical block 389 lost page write due to I/O error on ram0 I think I also saw a slab bug when running dbench with fault injection on. Running latest Linus kernel. bash-3.1# dbench -t10 -c ../client.txt 8 dbench version 3.04 - Copyright Andrew Tridgell 1999-2004 Running for 10 seconds with load '../client.txt' and minimum warmup 2 secs 8 clients started FAULT_INJECTION: forcing a failure Call Trace: 9e7bb548: [<601623ae>] random32+0xe/0x20 9e7bb558: [<60166389>] should_fail+0xd9/0x130 9e7bb588: [<60150294>] generic_make_request+0x304/0x4e0 9e7bb598: [<60062301>] mempool_alloc+0x51/0x130 9e7bb648: [<6015188f>] submit_bio+0x4f/0xe0 9e7bb698: [<6012b440>] _xfs_buf_ioapply+0x180/0x2a0 9e7bb6a0: [<6002f600>] default_wake_function+0x0/0x10 9e7bb6f8: [<6012bae1>] xfs_buf_iorequest+0x31/0x90 9e7bb718: [<60112f75>] xlog_bdstrat_cb+0x45/0x50 9e7bb738: [<60114135>] xlog_sync+0x195/0x440 9e7bb778: [<60114491>] xlog_state_release_iclog+0xb1/0xc0 9e7bb7a8: [<60114ca9>] xlog_write+0x539/0x550 9e7bb858: [<60114e60>] xfs_log_write+0x40/0x60 9e7bb888: [<6011fbaa>] _xfs_trans_commit+0x19a/0x360 9e7bb8b8: [<600838e2>] poison_obj+0x42/0x60 9e7bb8d0: [<60082cb3>] dbg_redzone1+0x13/0x30 9e7bb8e8: [<60083999>] cache_alloc_debugcheck_after+0x99/0x1c0 9e7bb918: [<6008517b>] kmem_cache_alloc+0x8b/0x100 9e7bb958: [<60128084>] kmem_zone_alloc+0x74/0xe0 9e7bb998: [<60082ad9>] kmem_cache_size+0x9/0x10 9e7bb9a8: [<60128124>] kmem_zone_zalloc+0x34/0x50 9e7bb9e8: [<60121e8b>] xfs_dir_ialloc+0x13b/0x2e0 9e7bba58: [<601f534b>] __down_write+0xb/0x10 9e7bbaa8: [<60125b9e>] xfs_mkdir+0x37e/0x4b0 9e7bbb38: [<601f5589>] _spin_unlock+0x9/0x10 9e7bbb78: [<601301a4>] xfs_vn_mknod+0xf4/0x1a0 9e7bbbd8: [<6013025e>] xfs_vn_mkdir+0xe/0x10 9e7bbbe8: [<60091010>] vfs_mkdir+0x90/0xc0 9e7bbc18: [<600934d6>] sys_mkdirat+0x106/0x120 9e7bbc88: [<6008629b>] filp_close+0x4b/0x80 9e7bbce8: [<60093503>] sys_mkdir+0x13/0x20 9e7bbcf8: [<60019a70>] handle_syscall+0x50/0x80 9e7bbd18: [<6002a10f>] userspace+0x44f/0x510 9e7bbfc8: [<60016792>] fork_handler+0x62/0x70 I/O error in filesystem ("ram0") meta-data dev ram0 block 0x8002c ("xlog_i odone") error 5 buf count 32768 xfs_force_shutdown(ram0,0x2) called from line 1056 of file /home/npiggin/usr/src /linux-2.6/fs/xfs/xfs_log.c. Return address = 0x000000006011370d Filesystem "ram0": Log I/O Error Detected. Shutting down filesystem: ram0 Please umount the filesystem, and rectify the problem(s) xfs_force_shutdown(ram0,0x2) called from line 818 of file /home/npiggin/usr/src/ linux-2.6/fs/xfs/xfs_log.c. Return address = 0x0000000060114e7d slab error in verify_redzone_free(): cache `xfs_log_ticket': double free detecte d Call Trace: 9e7bb998: [<6008372f>] __slab_error+0x1f/0x30 9e7bb9a8: [<60083cae>] cache_free_debugcheck+0x1ee/0x240 9e7bb9b0: [<60112ef0>] xlog_ticket_put+0x10/0x20 9e7bb9e8: [<60083f70>] kmem_cache_free+0x50/0xc0 9e7bba18: [<60112ef0>] xlog_ticket_put+0x10/0x20 9e7bba28: [<60114dc9>] xfs_log_done+0x59/0xb0 9e7bba68: [<6011f5de>] xfs_trans_cancel+0x7e/0x140 9e7bbaa8: [<60125a1e>] xfs_mkdir+0x1fe/0x4b0 9e7bbb38: [<601f5589>] _spin_unlock+0x9/0x10 9e7bbb78: [<601301a4>] xfs_vn_mknod+0xf4/0x1a0 9e7bbbd8: [<6013025e>] xfs_vn_mkdir+0xe/0x10 9e7bbbe8: [<60091010>] vfs_mkdir+0x90/0xc0 9e7bbc18: [<600934d6>] sys_mkdirat+0x106/0x120 9e7bbc88: [<6008629b>] filp_close+0x4b/0x80 9e7bbce8: [<60093503>] sys_mkdir+0x13/0x20 9e7bbcf8: [<60019a70>] handle_syscall+0x50/0x80 9e7bbd18: [<6002a10f>] userspace+0x44f/0x510 9e7bbfc8: [<60016792>] fork_handler+0x62/0x70 000000009e0d4ec0: redzone 1:0x9f911029d74e35b, redzone 2:0x9f911029d74e35b. (3) open ./clients/client1 failed for handle 16385 (No such file or directory) (4) ERROR: handle 16385 was not found Child failed with status 1 (kernel died soon afterwards) ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: [patch 0/9] writeback data integrity and other fixes (take 3) 2008-10-28 15:39 ` [patch 0/9] writeback data integrity and other fixes (take 3) Nick Piggin @ 2008-10-28 22:27 ` Dave Chinner 2008-10-29 0:04 ` Nick Piggin ` (2 more replies) 0 siblings, 3 replies; 53+ messages in thread From: Dave Chinner @ 2008-10-28 22:27 UTC (permalink / raw) To: Nick Piggin; +Cc: akpm, xfs, linux-fsdevel, Chris Mason On Tue, Oct 28, 2008 at 04:39:53PM +0100, Nick Piggin wrote: > On Wed, Oct 29, 2008 at 01:47:15AM +1100, npiggin@suse.de wrote: > > OK, I'm happier with this patchset now. Note that I've taken your patch > > and mangled it a bit at the end of the series. > > > > This one survives and seems to run OK here, but I'm mainly doing dumb > > stress testing with a handful of filesystems, and data-io error injection > > testing. There are a lot of combinations of ways this function can operate > > and interact obviously, so it would be helpful to get more review. > > > > Chris, would you possibly have time to run your btrfs tests that are > > sensitive to problems in this code? I could provide you a single patch > > rollup against mainline if it helps. > > BTW. XFS seems to be doing something interesting with my simple sync > test case with IO error injection. I map a file MAP_SHARED into a number of > processes, which then each run a loop that dirties the memory then calls > msync(MS_SYNC) on the range. > > ext2 mostly reports -EIO back to userspace when a failure is injected AFAIKS. > ext3 (ordered) doesn't until a lot of errors have been injected, but eventually > reports -EIO and shuts down the filesystem. reiserfs seems to report failure > more consistently. > > I haven't seen any -EIO failures from XFS... maybe I'm just not doing the > right thing, or there is a caveat I'm not aware of. > > All fault injections I noticed had a trace like this: > FAULT_INJECTION: forcing a failure > Call Trace: > 9f9cd758: [<6019f1de>] random32+0xe/0x20 > 9f9cd768: [<601a31b9>] should_fail+0xd9/0x130 > 9f9cd798: [<6018d0c4>] generic_make_request+0x304/0x4e0 > 9f9cd7a8: [<60062301>] mempool_alloc+0x51/0x130 > 9f9cd858: [<6018e6bf>] submit_bio+0x4f/0xe0 > 9f9cd8a8: [<60165505>] xfs_submit_ioend_bio+0x25/0x40 > 9f9cd8c8: [<6016603c>] xfs_submit_ioend+0xbc/0xf0 > 9f9cd908: [<60166bf9>] xfs_page_state_convert+0x3d9/0x6a0 > 9f9cd928: [<6005d515>] delayacct_end+0x95/0xb0 > 9f9cda08: [<60166ffd>] xfs_vm_writepage+0x6d/0x110 > 9f9cda18: [<6006618b>] set_page_dirty+0x4b/0xd0 > 9f9cda58: [<60066115>] __writepage+0x15/0x40 > 9f9cda78: [<60066775>] write_cache_pages+0x255/0x470 > 9f9cda90: [<60066100>] __writepage+0x0/0x40 > 9f9cdb98: [<600669b0>] generic_writepages+0x20/0x30 > 9f9cdba8: [<60165ba3>] xfs_vm_writepages+0x53/0x70 > 9f9cdbd8: [<600669eb>] do_writepages+0x2b/0x40 > 9f9cdbf8: [<6006004c>] __filemap_fdatawrite_range+0x5c/0x70 > 9f9cdc58: [<6006026a>] filemap_fdatawrite+0x1a/0x20 > 9f9cdc68: [<600a7a05>] do_fsync+0x45/0xe0 > 9f9cdc98: [<6007794b>] sys_msync+0x14b/0x1d0 > 9f9cdcf8: [<60019a70>] handle_syscall+0x50/0x80 > 9f9cdd18: [<6002a10f>] userspace+0x44f/0x510 > 9f9cdfc8: [<60016792>] fork_handler+0x62/0x70 XFS reports bio errors through the I/O completion path, not the submission path. > And the kernel would sometimes say this: > Buffer I/O error on device ram0, logical block 279 > lost page write due to I/O error on ram0 > Buffer I/O error on device ram0, logical block 379 > lost page write due to I/O error on ram0 > Buffer I/O error on device ram0, logical block 389 > lost page write due to I/O error on ram0 Yes - that's coming from end_buffer_async_write() when an error is reported in bio completion. This does: 465 set_bit(AS_EIO, &page->mapping->flags); 466 set_buffer_write_io_error(bh); 467 clear_buffer_uptodate(bh); 468 SetPageError(page); Hmmmm - do_fsync() calls filemap_fdatawait() which ends up in wait_on_page_writeback_range() which is appears to be checking the mapping flags for errors. I wonder why that error is not being propagated then? AFAICT both XFS and the fsync code are doing the right thing but somewhere the error has gone missing... > I think I also saw a slab bug when running dbench with fault injection on. > Running latest Linus kernel. > > bash-3.1# dbench -t10 -c ../client.txt 8 > dbench version 3.04 - Copyright Andrew Tridgell 1999-2004 > > Running for 10 seconds with load '../client.txt' and minimum warmup 2 secs > 8 clients started > FAULT_INJECTION: forcing a failure > Call Trace: > 9e7bb548: [<601623ae>] random32+0xe/0x20 > 9e7bb558: [<60166389>] should_fail+0xd9/0x130 > 9e7bb588: [<60150294>] generic_make_request+0x304/0x4e0 > 9e7bb598: [<60062301>] mempool_alloc+0x51/0x130 > 9e7bb648: [<6015188f>] submit_bio+0x4f/0xe0 > 9e7bb698: [<6012b440>] _xfs_buf_ioapply+0x180/0x2a0 > 9e7bb6a0: [<6002f600>] default_wake_function+0x0/0x10 > 9e7bb6f8: [<6012bae1>] xfs_buf_iorequest+0x31/0x90 > 9e7bb718: [<60112f75>] xlog_bdstrat_cb+0x45/0x50 > 9e7bb738: [<60114135>] xlog_sync+0x195/0x440 > 9e7bb778: [<60114491>] xlog_state_release_iclog+0xb1/0xc0 > 9e7bb7a8: [<60114ca9>] xlog_write+0x539/0x550 > 9e7bb858: [<60114e60>] xfs_log_write+0x40/0x60 > 9e7bb888: [<6011fbaa>] _xfs_trans_commit+0x19a/0x360 > 9e7bb8b8: [<600838e2>] poison_obj+0x42/0x60 > 9e7bb8d0: [<60082cb3>] dbg_redzone1+0x13/0x30 > 9e7bb8e8: [<60083999>] cache_alloc_debugcheck_after+0x99/0x1c0 > 9e7bb918: [<6008517b>] kmem_cache_alloc+0x8b/0x100 > 9e7bb958: [<60128084>] kmem_zone_alloc+0x74/0xe0 > 9e7bb998: [<60082ad9>] kmem_cache_size+0x9/0x10 > 9e7bb9a8: [<60128124>] kmem_zone_zalloc+0x34/0x50 > 9e7bb9e8: [<60121e8b>] xfs_dir_ialloc+0x13b/0x2e0 > 9e7bba58: [<601f534b>] __down_write+0xb/0x10 > 9e7bbaa8: [<60125b9e>] xfs_mkdir+0x37e/0x4b0 > 9e7bbb38: [<601f5589>] _spin_unlock+0x9/0x10 > 9e7bbb78: [<601301a4>] xfs_vn_mknod+0xf4/0x1a0 > 9e7bbbd8: [<6013025e>] xfs_vn_mkdir+0xe/0x10 > 9e7bbbe8: [<60091010>] vfs_mkdir+0x90/0xc0 > 9e7bbc18: [<600934d6>] sys_mkdirat+0x106/0x120 > 9e7bbc88: [<6008629b>] filp_close+0x4b/0x80 > 9e7bbce8: [<60093503>] sys_mkdir+0x13/0x20 > 9e7bbcf8: [<60019a70>] handle_syscall+0x50/0x80 > 9e7bbd18: [<6002a10f>] userspace+0x44f/0x510 > 9e7bbfc8: [<60016792>] fork_handler+0x62/0x70 > > I/O error in filesystem ("ram0") meta-data dev ram0 block 0x8002c ("xlog_i > odone") error 5 buf count 32768 > xfs_force_shutdown(ram0,0x2) called from line 1056 of file /home/npiggin/usr/src > /linux-2.6/fs/xfs/xfs_log.c. Return address = 0x000000006011370d > Filesystem "ram0": Log I/O Error Detected. Shutting down filesystem: ram0 > Please umount the filesystem, and rectify the problem(s) > xfs_force_shutdown(ram0,0x2) called from line 818 of file /home/npiggin/usr/src/ > linux-2.6/fs/xfs/xfs_log.c. Return address = 0x0000000060114e7d > slab error in verify_redzone_free(): cache `xfs_log_ticket': double free detecte > d > Call Trace: > 9e7bb998: [<6008372f>] __slab_error+0x1f/0x30 > 9e7bb9a8: [<60083cae>] cache_free_debugcheck+0x1ee/0x240 > 9e7bb9b0: [<60112ef0>] xlog_ticket_put+0x10/0x20 > 9e7bb9e8: [<60083f70>] kmem_cache_free+0x50/0xc0 > 9e7bba18: [<60112ef0>] xlog_ticket_put+0x10/0x20 > 9e7bba28: [<60114dc9>] xfs_log_done+0x59/0xb0 > 9e7bba68: [<6011f5de>] xfs_trans_cancel+0x7e/0x140 > 9e7bbaa8: [<60125a1e>] xfs_mkdir+0x1fe/0x4b0 > 9e7bbb38: [<601f5589>] _spin_unlock+0x9/0x10 > 9e7bbb78: [<601301a4>] xfs_vn_mknod+0xf4/0x1a0 > 9e7bbbd8: [<6013025e>] xfs_vn_mkdir+0xe/0x10 > 9e7bbbe8: [<60091010>] vfs_mkdir+0x90/0xc0 > 9e7bbc18: [<600934d6>] sys_mkdirat+0x106/0x120 > 9e7bbc88: [<6008629b>] filp_close+0x4b/0x80 > 9e7bbce8: [<60093503>] sys_mkdir+0x13/0x20 > 9e7bbcf8: [<60019a70>] handle_syscall+0x50/0x80 > 9e7bbd18: [<6002a10f>] userspace+0x44f/0x510 > 9e7bbfc8: [<60016792>] fork_handler+0x62/0x70 Now that is interesting. We've got a rolling transaction in progress, and the commit of the first part of the transaction has got the I/O error. That frees the transaction structure used during that commit, as well as the ticket. However, before we committed the initial transaction, we duplicated the transaction structure to allow the transaction to continue to track all the dirty objects in the first commit. That included duplicating the pointer to the ticket. Then the EIO is returned to mkdir code with the duplicated transaction, which is then cancelled, and that frees the transaction and the ticket it holds. However, we'd already freed the ticket. Ok, we're only seeing this problem now because I recently modified the ticket allocation to use a slab instead of a roll-your-own free list structure that wouldn't have been poisoned. Nice to know that this change did more than just remove code. ;) This might take a little while to fix - a lot of code needs auditing - but thanks for reporting the problem. Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: [patch 0/9] writeback data integrity and other fixes (take 3) 2008-10-28 22:27 ` Dave Chinner @ 2008-10-29 0:04 ` Nick Piggin 2008-10-29 0:16 ` Nick Piggin 2008-10-29 8:51 ` Dave Chinner 2 siblings, 0 replies; 53+ messages in thread From: Nick Piggin @ 2008-10-29 0:04 UTC (permalink / raw) To: akpm, xfs, linux-fsdevel, Chris Mason On Wed, Oct 29, 2008 at 09:27:46AM +1100, Dave Chinner wrote: > On Tue, Oct 28, 2008 at 04:39:53PM +0100, Nick Piggin wrote: > > On Wed, Oct 29, 2008 at 01:47:15AM +1100, npiggin@suse.de wrote: > > I haven't seen any -EIO failures from XFS... maybe I'm just not doing the > > right thing, or there is a caveat I'm not aware of. > > > > All fault injections I noticed had a trace like this: [...] > > XFS reports bio errors through the I/O completion path, not the > submission path. Right, that's just to give you an indication of where it's failing... > > And the kernel would sometimes say this: > > Buffer I/O error on device ram0, logical block 279 > > lost page write due to I/O error on ram0 > > Buffer I/O error on device ram0, logical block 379 > > lost page write due to I/O error on ram0 > > Buffer I/O error on device ram0, logical block 389 > > lost page write due to I/O error on ram0 > > Yes - that's coming from end_buffer_async_write() when an error is > reported in bio completion. This does: > > 465 set_bit(AS_EIO, &page->mapping->flags); > 466 set_buffer_write_io_error(bh); > 467 clear_buffer_uptodate(bh); > 468 SetPageError(page); > > Hmmmm - do_fsync() calls filemap_fdatawait() which ends up in > wait_on_page_writeback_range() which is appears to be checking the > mapping flags for errors. I wonder why that error is not being > propagated then? AFAICT both XFS and the fsync code are doing the > right thing but somewhere the error has gone missing... Yeah, I couldn't immediately see why nothing comes out. I'll do a bit more digging. > > I think I also saw a slab bug when running dbench with fault injection on. > > Running latest Linus kernel. [...] > > Now that is interesting. > > We've got a rolling transaction in progress, and the commit of the > first part of the transaction has got the I/O error. That frees the > transaction structure used during that commit, as well as the > ticket. > > However, before we committed the initial transaction, we duplicated > the transaction structure to allow the transaction to continue to > track all the dirty objects in the first commit. That included > duplicating the pointer to the ticket. > > Then the EIO is returned to mkdir code with the duplicated > transaction, which is then cancelled, and that frees the transaction > and the ticket it holds. However, we'd already freed the ticket. > > Ok, we're only seeing this problem now because I recently modified > the ticket allocation to use a slab instead of a roll-your-own free > list structure that wouldn't have been poisoned. Nice to know that > this change did more than just remove code. ;) > > This might take a little while to fix - a lot of code needs > auditing - but thanks for reporting the problem. No problem, hope it helps. Thanks, Nick ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: [patch 0/9] writeback data integrity and other fixes (take 3) 2008-10-28 22:27 ` Dave Chinner 2008-10-29 0:04 ` Nick Piggin @ 2008-10-29 0:16 ` Nick Piggin 2008-10-29 3:16 ` Dave Chinner 2008-10-29 8:51 ` Dave Chinner 2 siblings, 1 reply; 53+ messages in thread From: Nick Piggin @ 2008-10-29 0:16 UTC (permalink / raw) To: akpm, xfs, linux-fsdevel, Chris Mason On Wed, Oct 29, 2008 at 09:27:46AM +1100, Dave Chinner wrote: > On Tue, Oct 28, 2008 at 04:39:53PM +0100, Nick Piggin wrote: > > > > I haven't seen any -EIO failures from XFS... maybe I'm just not doing the > > right thing, or there is a caveat I'm not aware of. > > > > All fault injections I noticed had a trace like this: > > FAULT_INJECTION: forcing a failure > > Call Trace: > > 9f9cd758: [<6019f1de>] random32+0xe/0x20 > > 9f9cd768: [<601a31b9>] should_fail+0xd9/0x130 > > 9f9cd798: [<6018d0c4>] generic_make_request+0x304/0x4e0 > > 9f9cd7a8: [<60062301>] mempool_alloc+0x51/0x130 > > 9f9cd858: [<6018e6bf>] submit_bio+0x4f/0xe0 > > 9f9cd8a8: [<60165505>] xfs_submit_ioend_bio+0x25/0x40 > > 9f9cd8c8: [<6016603c>] xfs_submit_ioend+0xbc/0xf0 > > 9f9cd908: [<60166bf9>] xfs_page_state_convert+0x3d9/0x6a0 > > 9f9cd928: [<6005d515>] delayacct_end+0x95/0xb0 > > 9f9cda08: [<60166ffd>] xfs_vm_writepage+0x6d/0x110 > > 9f9cda18: [<6006618b>] set_page_dirty+0x4b/0xd0 > > 9f9cda58: [<60066115>] __writepage+0x15/0x40 > > 9f9cda78: [<60066775>] write_cache_pages+0x255/0x470 > > 9f9cda90: [<60066100>] __writepage+0x0/0x40 > > 9f9cdb98: [<600669b0>] generic_writepages+0x20/0x30 > > 9f9cdba8: [<60165ba3>] xfs_vm_writepages+0x53/0x70 > > 9f9cdbd8: [<600669eb>] do_writepages+0x2b/0x40 > > 9f9cdbf8: [<6006004c>] __filemap_fdatawrite_range+0x5c/0x70 > > 9f9cdc58: [<6006026a>] filemap_fdatawrite+0x1a/0x20 > > 9f9cdc68: [<600a7a05>] do_fsync+0x45/0xe0 > > 9f9cdc98: [<6007794b>] sys_msync+0x14b/0x1d0 > > 9f9cdcf8: [<60019a70>] handle_syscall+0x50/0x80 > > 9f9cdd18: [<6002a10f>] userspace+0x44f/0x510 > > 9f9cdfc8: [<60016792>] fork_handler+0x62/0x70 > > XFS reports bio errors through the I/O completion path, not the > submission path. > > > And the kernel would sometimes say this: > > Buffer I/O error on device ram0, logical block 279 > > lost page write due to I/O error on ram0 > > Buffer I/O error on device ram0, logical block 379 > > lost page write due to I/O error on ram0 > > Buffer I/O error on device ram0, logical block 389 > > lost page write due to I/O error on ram0 > > Yes - that's coming from end_buffer_async_write() when an error is > reported in bio completion. This does: > > 465 set_bit(AS_EIO, &page->mapping->flags); > 466 set_buffer_write_io_error(bh); > 467 clear_buffer_uptodate(bh); > 468 SetPageError(page); > > Hmmmm - do_fsync() calls filemap_fdatawait() which ends up in > wait_on_page_writeback_range() which is appears to be checking the > mapping flags for errors. I wonder why that error is not being > propagated then? AFAICT both XFS and the fsync code are doing the > right thing but somewhere the error has gone missing... This one-liner has it reporting EIO errors like a champion. I don't know if you'll actually need to put this into the linux API layer or not, but anyway the root cause of the problem AFAIKS is this. -- XFS: fix fsync errors not being propogated back to userspace. --- Index: linux-2.6/fs/xfs/xfs_vnodeops.c =================================================================== --- linux-2.6.orig/fs/xfs/xfs_vnodeops.c +++ linux-2.6/fs/xfs/xfs_vnodeops.c @@ -715,7 +715,7 @@ xfs_fsync( /* capture size updates in I/O completion before writing the inode. */ error = filemap_fdatawait(VFS_I(ip)->i_mapping); if (error) - return XFS_ERROR(error); + return XFS_ERROR(-error); /* * We always need to make sure that the required inode state is safe on ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: [patch 0/9] writeback data integrity and other fixes (take 3) 2008-10-29 0:16 ` Nick Piggin @ 2008-10-29 3:16 ` Dave Chinner 2008-10-29 3:26 ` Dave Chinner ` (2 more replies) 0 siblings, 3 replies; 53+ messages in thread From: Dave Chinner @ 2008-10-29 3:16 UTC (permalink / raw) To: Nick Piggin; +Cc: akpm, xfs, linux-fsdevel, Chris Mason On Wed, Oct 29, 2008 at 01:16:53AM +0100, Nick Piggin wrote: > On Wed, Oct 29, 2008 at 09:27:46AM +1100, Dave Chinner wrote: > > On Tue, Oct 28, 2008 at 04:39:53PM +0100, Nick Piggin wrote: > > Yes - that's coming from end_buffer_async_write() when an error is > > reported in bio completion. This does: > > > > 465 set_bit(AS_EIO, &page->mapping->flags); > > 466 set_buffer_write_io_error(bh); > > 467 clear_buffer_uptodate(bh); > > 468 SetPageError(page); > > > > Hmmmm - do_fsync() calls filemap_fdatawait() which ends up in > > wait_on_page_writeback_range() which is appears to be checking the > > mapping flags for errors. I wonder why that error is not being > > propagated then? AFAICT both XFS and the fsync code are doing the > > right thing but somewhere the error has gone missing... > > This one-liner has it reporting EIO errors like a champion. I > don't know if you'll actually need to put this into the > linux API layer or not, but anyway the root cause of the problem > AFAIKS is this. > -- > > XFS: fix fsync errors not being propogated back to userspace. > --- > Index: linux-2.6/fs/xfs/xfs_vnodeops.c > =================================================================== > --- linux-2.6.orig/fs/xfs/xfs_vnodeops.c > +++ linux-2.6/fs/xfs/xfs_vnodeops.c > @@ -715,7 +715,7 @@ xfs_fsync( > /* capture size updates in I/O completion before writing the inode. */ > error = filemap_fdatawait(VFS_I(ip)->i_mapping); > if (error) > - return XFS_ERROR(error); > + return XFS_ERROR(-error); <groan> Yeah, that'd do it. Good catch. I can't believe I recently fixed a bug that touched these lines of code without noticing the inversion. Sometimes I wonder if we should just conver the entire of XFS to return negative errors - mistakes in handling negative error numbers in the core XFS code happen all the time. FWIW, the core issue here is that we've got to do the filemap_fdatawait() call in the ->fsync method because ->fsync gets called before we've waited for the data I/O to complete. XFS updates inode state on I/O completion, so we *must* wait for data I/O to complete before logging the inode changes. I think btrfs has the same problem.... Thanks again, Nick. Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: [patch 0/9] writeback data integrity and other fixes (take 3) 2008-10-29 3:16 ` Dave Chinner @ 2008-10-29 3:26 ` Dave Chinner 2008-10-29 4:11 ` Nick Piggin 2008-10-29 9:13 ` Christoph Hellwig 2008-10-29 4:00 ` Nick Piggin 2008-10-29 9:12 ` Christoph Hellwig 2 siblings, 2 replies; 53+ messages in thread From: Dave Chinner @ 2008-10-29 3:26 UTC (permalink / raw) To: Nick Piggin, akpm, xfs, linux-fsdevel, Chris Mason On Wed, Oct 29, 2008 at 02:16:45PM +1100, Dave Chinner wrote: > On Wed, Oct 29, 2008 at 01:16:53AM +0100, Nick Piggin wrote: > > XFS: fix fsync errors not being propogated back to userspace. > > --- > > Index: linux-2.6/fs/xfs/xfs_vnodeops.c > > =================================================================== > > --- linux-2.6.orig/fs/xfs/xfs_vnodeops.c > > +++ linux-2.6/fs/xfs/xfs_vnodeops.c > > @@ -715,7 +715,7 @@ xfs_fsync( > > /* capture size updates in I/O completion before writing the inode. */ > > error = filemap_fdatawait(VFS_I(ip)->i_mapping); > > if (error) > > - return XFS_ERROR(error); > > + return XFS_ERROR(-error); > > <groan> > > Yeah, that'd do it. Good catch. I can't believe I recently fixed a > bug that touched these lines of code without noticing the inversion. > Sometimes I wonder if we should just conver the entire of XFS to > return negative errors - mistakes in handling negative error numbers > in the core XFS code happen all the time. Ok, I was right - these problems happen all the time. The above call should really call xfs_flush_pages() to do the flush and wait. I note that xfs_flush_pages() returns negative errors, and all the callers expect positive errors. I bet the same occurs for xfs_flushinval_pages() and xfs_tosspages() which are the wrappers that core XFS code is supposed to be using for flushing and invalidating file ranges.... I'll write up a patch that covers all of these. Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: [patch 0/9] writeback data integrity and other fixes (take 3) 2008-10-29 3:26 ` Dave Chinner @ 2008-10-29 4:11 ` Nick Piggin 2008-10-29 4:57 ` Dave Chinner 2008-10-29 9:13 ` Christoph Hellwig 1 sibling, 1 reply; 53+ messages in thread From: Nick Piggin @ 2008-10-29 4:11 UTC (permalink / raw) To: akpm, xfs, linux-fsdevel, Chris Mason On Wed, Oct 29, 2008 at 02:26:01PM +1100, Dave Chinner wrote: > On Wed, Oct 29, 2008 at 02:16:45PM +1100, Dave Chinner wrote: > > On Wed, Oct 29, 2008 at 01:16:53AM +0100, Nick Piggin wrote: > > > XFS: fix fsync errors not being propogated back to userspace. > > > --- > > > Index: linux-2.6/fs/xfs/xfs_vnodeops.c > > > =================================================================== > > > --- linux-2.6.orig/fs/xfs/xfs_vnodeops.c > > > +++ linux-2.6/fs/xfs/xfs_vnodeops.c > > > @@ -715,7 +715,7 @@ xfs_fsync( > > > /* capture size updates in I/O completion before writing the inode. */ > > > error = filemap_fdatawait(VFS_I(ip)->i_mapping); > > > if (error) > > > - return XFS_ERROR(error); > > > + return XFS_ERROR(-error); > > > > <groan> > > > > Yeah, that'd do it. Good catch. I can't believe I recently fixed a > > bug that touched these lines of code without noticing the inversion. > > Sometimes I wonder if we should just conver the entire of XFS to > > return negative errors - mistakes in handling negative error numbers > > in the core XFS code happen all the time. > > Ok, I was right - these problems happen all the time. The above call > should really call xfs_flush_pages() to do the flush and wait. I > note that xfs_flush_pages() returns negative errors, and all the > callers expect positive errors. I bet the same occurs for > xfs_flushinval_pages() and xfs_tosspages() which are the wrappers > that core XFS code is supposed to be using for flushing and > invalidating file ranges.... Just be careful -- in your xfs_flush_pages, I think after the first filemap_fdatawrite, the mapping may no longer be tagged with PAGECACHE_TAG_DIRTY, so you may not pick up those writeback ones you need to wait on. Might need a different variant, or we could just bite the bullet and push through the ->fsync conversion so you get full control of the writeout. BTW. the Linux pagecache APIs should support range operations quite nicely for these. Any reason not to use them (it looks like the wrappers can take ranges)? ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: [patch 0/9] writeback data integrity and other fixes (take 3) 2008-10-29 4:11 ` Nick Piggin @ 2008-10-29 4:57 ` Dave Chinner 2008-10-29 5:06 ` Nick Piggin 0 siblings, 1 reply; 53+ messages in thread From: Dave Chinner @ 2008-10-29 4:57 UTC (permalink / raw) To: Nick Piggin; +Cc: akpm, xfs, linux-fsdevel, Chris Mason On Wed, Oct 29, 2008 at 05:11:12AM +0100, Nick Piggin wrote: > On Wed, Oct 29, 2008 at 02:26:01PM +1100, Dave Chinner wrote: > > On Wed, Oct 29, 2008 at 02:16:45PM +1100, Dave Chinner wrote: > > > On Wed, Oct 29, 2008 at 01:16:53AM +0100, Nick Piggin wrote: > > > > XFS: fix fsync errors not being propogated back to userspace. > > > > --- > > > > Index: linux-2.6/fs/xfs/xfs_vnodeops.c > > > > =================================================================== > > > > --- linux-2.6.orig/fs/xfs/xfs_vnodeops.c > > > > +++ linux-2.6/fs/xfs/xfs_vnodeops.c > > > > @@ -715,7 +715,7 @@ xfs_fsync( > > > > /* capture size updates in I/O completion before writing the inode. */ > > > > error = filemap_fdatawait(VFS_I(ip)->i_mapping); > > > > if (error) > > > > - return XFS_ERROR(error); > > > > + return XFS_ERROR(-error); > > > > > > <groan> > > > > > > Yeah, that'd do it. Good catch. I can't believe I recently fixed a > > > bug that touched these lines of code without noticing the inversion. > > > Sometimes I wonder if we should just conver the entire of XFS to > > > return negative errors - mistakes in handling negative error numbers > > > in the core XFS code happen all the time. > > > > Ok, I was right - these problems happen all the time. The above call > > should really call xfs_flush_pages() to do the flush and wait. I > > note that xfs_flush_pages() returns negative errors, and all the > > callers expect positive errors. I bet the same occurs for > > xfs_flushinval_pages() and xfs_tosspages() which are the wrappers > > that core XFS code is supposed to be using for flushing and > > invalidating file ranges.... > > Just be careful -- in your xfs_flush_pages, I think after the first > filemap_fdatawrite, the mapping may no longer be tagged with > PAGECACHE_TAG_DIRTY, so you may not pick up those writeback ones > you need to wait on. Yes, I realised this as soon as I looked at the code. I added xfs_wait_on_pages() to do this wait. ;) > Might need a different variant, or we could just bite the bullet and > push through the ->fsync conversion so you get full control of the > writeout. Not important right now, though.... > BTW. the Linux pagecache APIs should support range operations quite > nicely for these. Any reason not to use them (it looks like the > wrappers can take ranges)? Because I haven't got around to modifying these wrappers now that the range primitives are in place - XFS inherited the range operations from Irix and have sat unimplemented since being ported to Linux. The patch below should fix this entire class of error value screwup in XFS. I've just started running it through XFSQA, so it's not really tested yet. FWIW, your entire patch series made it through XFSQA without any new regressions, so it looks good from that POV. Cheers, Dave. -- Dave Chinner david@fromorbit.com XFS: fix error inversion problems with data flushing XFS gets the sign of the error wrong in several places when gathering the error from generic linux functions. These functions return negative error values, while the core XFS code returns positive error values. Hence when XFS inverts the error to be returned to the VFS, it can incorrectly invert a negative error and this error will be ignored by the syscall return. Fix all the problems related to calling filemap_* functions. Problem initially identified by Nick Piggin in xfs_fsync(). Signed-off-by: Dave Chinner <david@fromorbit.com> --- fs/xfs/linux-2.6/xfs_fs_subr.c | 23 ++++++++++++++++++++--- fs/xfs/linux-2.6/xfs_lrw.c | 2 +- fs/xfs/linux-2.6/xfs_super.c | 13 +++++++++---- fs/xfs/xfs_vnodeops.c | 2 +- fs/xfs/xfs_vnodeops.h | 1 + 5 files changed, 32 insertions(+), 9 deletions(-) diff --git a/fs/xfs/linux-2.6/xfs_fs_subr.c b/fs/xfs/linux-2.6/xfs_fs_subr.c index 36caa6d..5aeb777 100644 --- a/fs/xfs/linux-2.6/xfs_fs_subr.c +++ b/fs/xfs/linux-2.6/xfs_fs_subr.c @@ -24,6 +24,10 @@ int fs_noerr(void) { return 0; } int fs_nosys(void) { return ENOSYS; } void fs_noval(void) { return; } +/* + * note: all filemap functions return negative error codes. These + * need to be inverted before returning to the xfs core functions. + */ void xfs_tosspages( xfs_inode_t *ip, @@ -53,7 +57,7 @@ xfs_flushinval_pages( if (!ret) truncate_inode_pages(mapping, first); } - return ret; + return -ret; } int @@ -72,10 +76,23 @@ xfs_flush_pages( xfs_iflags_clear(ip, XFS_ITRUNCATED); ret = filemap_fdatawrite(mapping); if (flags & XFS_B_ASYNC) - return ret; + return -ret; ret2 = filemap_fdatawait(mapping); if (!ret) ret = ret2; } - return ret; + return -ret; +} + +int +xfs_wait_on_pages( + xfs_inode_t *ip, + xfs_off_t first, + xfs_off_t last) +{ + struct address_space *mapping = VFS_I(ip)->i_mapping; + + if (mapping_tagged(mapping, PAGECACHE_TAG_WRITEBACK)) + return -filemap_fdatawait(mapping); + return 0; } diff --git a/fs/xfs/linux-2.6/xfs_lrw.c b/fs/xfs/linux-2.6/xfs_lrw.c index 1957e53..4959c87 100644 --- a/fs/xfs/linux-2.6/xfs_lrw.c +++ b/fs/xfs/linux-2.6/xfs_lrw.c @@ -243,7 +243,7 @@ xfs_read( if (unlikely(ioflags & IO_ISDIRECT)) { if (inode->i_mapping->nrpages) - ret = xfs_flushinval_pages(ip, (*offset & PAGE_CACHE_MASK), + ret = -xfs_flushinval_pages(ip, (*offset & PAGE_CACHE_MASK), -1, FI_REMAPF_LOCKED); mutex_unlock(&inode->i_mutex); if (ret) { diff --git a/fs/xfs/linux-2.6/xfs_super.c b/fs/xfs/linux-2.6/xfs_super.c index 9dc977d..c5cfc1e 100644 --- a/fs/xfs/linux-2.6/xfs_super.c +++ b/fs/xfs/linux-2.6/xfs_super.c @@ -1012,21 +1012,26 @@ xfs_fs_write_inode( struct inode *inode, int sync) { + struct xfs_inode *ip = XFS_I(inode); int error = 0; int flags = 0; - xfs_itrace_entry(XFS_I(inode)); + xfs_itrace_entry(ip); if (sync) { - filemap_fdatawait(inode->i_mapping); + error = xfs_wait_on_pages(ip, 0, -1); + if (error) + goto out_error; flags |= FLUSH_SYNC; } - error = xfs_inode_flush(XFS_I(inode), flags); + error = xfs_inode_flush(ip, flags); + +out_error: /* * if we failed to write out the inode then mark * it dirty again so we'll try again later. */ if (error) - xfs_mark_inode_dirty_sync(XFS_I(inode)); + xfs_mark_inode_dirty_sync(ip); return -error; } diff --git a/fs/xfs/xfs_vnodeops.c b/fs/xfs/xfs_vnodeops.c index 2e2fbd9..5809c42 100644 --- a/fs/xfs/xfs_vnodeops.c +++ b/fs/xfs/xfs_vnodeops.c @@ -713,7 +713,7 @@ xfs_fsync( return XFS_ERROR(EIO); /* capture size updates in I/O completion before writing the inode. */ - error = filemap_fdatawait(VFS_I(ip)->i_mapping); + error = xfs_wait_on_pages(ip, 0, -1); if (error) return XFS_ERROR(error); diff --git a/fs/xfs/xfs_vnodeops.h b/fs/xfs/xfs_vnodeops.h index e932a96..f9cd376 100644 --- a/fs/xfs/xfs_vnodeops.h +++ b/fs/xfs/xfs_vnodeops.h @@ -78,5 +78,6 @@ int xfs_flushinval_pages(struct xfs_inode *ip, xfs_off_t first, xfs_off_t last, int fiopt); int xfs_flush_pages(struct xfs_inode *ip, xfs_off_t first, xfs_off_t last, uint64_t flags, int fiopt); +int xfs_wait_on_pages(struct xfs_inode *ip, xfs_off_t first, xfs_off_t last); #endif /* _XFS_VNODEOPS_H */ ^ permalink raw reply related [flat|nested] 53+ messages in thread
* Re: [patch 0/9] writeback data integrity and other fixes (take 3) 2008-10-29 4:57 ` Dave Chinner @ 2008-10-29 5:06 ` Nick Piggin 0 siblings, 0 replies; 53+ messages in thread From: Nick Piggin @ 2008-10-29 5:06 UTC (permalink / raw) To: akpm, xfs, linux-fsdevel, Chris Mason On Wed, Oct 29, 2008 at 03:57:07PM +1100, Dave Chinner wrote: > On Wed, Oct 29, 2008 at 05:11:12AM +0100, Nick Piggin wrote: > > Just be careful -- in your xfs_flush_pages, I think after the first > > filemap_fdatawrite, the mapping may no longer be tagged with > > PAGECACHE_TAG_DIRTY, so you may not pick up those writeback ones > > you need to wait on. > > Yes, I realised this as soon as I looked at the code. > I added xfs_wait_on_pages() to do this wait. ;) Oh good. > > Might need a different variant, or we could just bite the bullet and > > push through the ->fsync conversion so you get full control of the > > writeout. > > Not important right now, though.... OK. > > BTW. the Linux pagecache APIs should support range operations quite > > nicely for these. Any reason not to use them (it looks like the > > wrappers can take ranges)? > > Because I haven't got around to modifying these wrappers now that > the range primitives are in place - XFS inherited the range > operations from Irix and have sat unimplemented since being ported > to Linux. OK, fair enough. Just something easy to stick on a todo list somewhere ;) Probably doesn't make much difference now, but we might start seeing apps using range syncs. > The patch below should fix this entire class of error value screwup > in XFS. I've just started running it through XFSQA, so it's not > really tested yet. I'll sanity check it by running it through my basic fault injection tests here. > FWIW, your entire patch series made it through XFSQA without any new > regressions, so it looks good from that POV. Thanks, very good to know. > > Cheers, > > Dave. > -- > Dave Chinner > david@fromorbit.com > > XFS: fix error inversion problems with data flushing > > XFS gets the sign of the error wrong in several places when > gathering the error from generic linux functions. These functions > return negative error values, while the core XFS code returns > positive error values. Hence when XFS inverts the error to be > returned to the VFS, it can incorrectly invert a negative > error and this error will be ignored by the syscall return. > > Fix all the problems related to calling filemap_* functions. > > Problem initially identified by Nick Piggin in xfs_fsync(). > > Signed-off-by: Dave Chinner <david@fromorbit.com> > --- > fs/xfs/linux-2.6/xfs_fs_subr.c | 23 ++++++++++++++++++++--- > fs/xfs/linux-2.6/xfs_lrw.c | 2 +- > fs/xfs/linux-2.6/xfs_super.c | 13 +++++++++---- > fs/xfs/xfs_vnodeops.c | 2 +- > fs/xfs/xfs_vnodeops.h | 1 + > 5 files changed, 32 insertions(+), 9 deletions(-) > > diff --git a/fs/xfs/linux-2.6/xfs_fs_subr.c b/fs/xfs/linux-2.6/xfs_fs_subr.c > index 36caa6d..5aeb777 100644 > --- a/fs/xfs/linux-2.6/xfs_fs_subr.c > +++ b/fs/xfs/linux-2.6/xfs_fs_subr.c > @@ -24,6 +24,10 @@ int fs_noerr(void) { return 0; } > int fs_nosys(void) { return ENOSYS; } > void fs_noval(void) { return; } > > +/* > + * note: all filemap functions return negative error codes. These > + * need to be inverted before returning to the xfs core functions. > + */ > void > xfs_tosspages( > xfs_inode_t *ip, > @@ -53,7 +57,7 @@ xfs_flushinval_pages( > if (!ret) > truncate_inode_pages(mapping, first); > } > - return ret; > + return -ret; > } > > int > @@ -72,10 +76,23 @@ xfs_flush_pages( > xfs_iflags_clear(ip, XFS_ITRUNCATED); > ret = filemap_fdatawrite(mapping); > if (flags & XFS_B_ASYNC) > - return ret; > + return -ret; > ret2 = filemap_fdatawait(mapping); > if (!ret) > ret = ret2; > } > - return ret; > + return -ret; > +} > + > +int > +xfs_wait_on_pages( > + xfs_inode_t *ip, > + xfs_off_t first, > + xfs_off_t last) > +{ > + struct address_space *mapping = VFS_I(ip)->i_mapping; > + > + if (mapping_tagged(mapping, PAGECACHE_TAG_WRITEBACK)) > + return -filemap_fdatawait(mapping); > + return 0; > } > diff --git a/fs/xfs/linux-2.6/xfs_lrw.c b/fs/xfs/linux-2.6/xfs_lrw.c > index 1957e53..4959c87 100644 > --- a/fs/xfs/linux-2.6/xfs_lrw.c > +++ b/fs/xfs/linux-2.6/xfs_lrw.c > @@ -243,7 +243,7 @@ xfs_read( > > if (unlikely(ioflags & IO_ISDIRECT)) { > if (inode->i_mapping->nrpages) > - ret = xfs_flushinval_pages(ip, (*offset & PAGE_CACHE_MASK), > + ret = -xfs_flushinval_pages(ip, (*offset & PAGE_CACHE_MASK), > -1, FI_REMAPF_LOCKED); > mutex_unlock(&inode->i_mutex); > if (ret) { > diff --git a/fs/xfs/linux-2.6/xfs_super.c b/fs/xfs/linux-2.6/xfs_super.c > index 9dc977d..c5cfc1e 100644 > --- a/fs/xfs/linux-2.6/xfs_super.c > +++ b/fs/xfs/linux-2.6/xfs_super.c > @@ -1012,21 +1012,26 @@ xfs_fs_write_inode( > struct inode *inode, > int sync) > { > + struct xfs_inode *ip = XFS_I(inode); > int error = 0; > int flags = 0; > > - xfs_itrace_entry(XFS_I(inode)); > + xfs_itrace_entry(ip); > if (sync) { > - filemap_fdatawait(inode->i_mapping); > + error = xfs_wait_on_pages(ip, 0, -1); > + if (error) > + goto out_error; > flags |= FLUSH_SYNC; > } > - error = xfs_inode_flush(XFS_I(inode), flags); > + error = xfs_inode_flush(ip, flags); > + > +out_error: > /* > * if we failed to write out the inode then mark > * it dirty again so we'll try again later. > */ > if (error) > - xfs_mark_inode_dirty_sync(XFS_I(inode)); > + xfs_mark_inode_dirty_sync(ip); > > return -error; > } > diff --git a/fs/xfs/xfs_vnodeops.c b/fs/xfs/xfs_vnodeops.c > index 2e2fbd9..5809c42 100644 > --- a/fs/xfs/xfs_vnodeops.c > +++ b/fs/xfs/xfs_vnodeops.c > @@ -713,7 +713,7 @@ xfs_fsync( > return XFS_ERROR(EIO); > > /* capture size updates in I/O completion before writing the inode. */ > - error = filemap_fdatawait(VFS_I(ip)->i_mapping); > + error = xfs_wait_on_pages(ip, 0, -1); > if (error) > return XFS_ERROR(error); > > diff --git a/fs/xfs/xfs_vnodeops.h b/fs/xfs/xfs_vnodeops.h > index e932a96..f9cd376 100644 > --- a/fs/xfs/xfs_vnodeops.h > +++ b/fs/xfs/xfs_vnodeops.h > @@ -78,5 +78,6 @@ int xfs_flushinval_pages(struct xfs_inode *ip, xfs_off_t first, > xfs_off_t last, int fiopt); > int xfs_flush_pages(struct xfs_inode *ip, xfs_off_t first, > xfs_off_t last, uint64_t flags, int fiopt); > +int xfs_wait_on_pages(struct xfs_inode *ip, xfs_off_t first, xfs_off_t last); > > #endif /* _XFS_VNODEOPS_H */ ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: [patch 0/9] writeback data integrity and other fixes (take 3) 2008-10-29 3:26 ` Dave Chinner 2008-10-29 4:11 ` Nick Piggin @ 2008-10-29 9:13 ` Christoph Hellwig 2008-10-29 21:42 ` Dave Chinner 1 sibling, 1 reply; 53+ messages in thread From: Christoph Hellwig @ 2008-10-29 9:13 UTC (permalink / raw) To: Nick Piggin, akpm, xfs, linux-fsdevel, Chris Mason On Wed, Oct 29, 2008 at 02:26:01PM +1100, Dave Chinner wrote: > Ok, I was right - these problems happen all the time. The above call > should really call xfs_flush_pages() to do the flush and wait. I > note that xfs_flush_pages() returns negative errors, and all the > callers expect positive errors. I bet the same occurs for > xfs_flushinval_pages() and xfs_tosspages() which are the wrappers > that core XFS code is supposed to be using for flushing and > invalidating file ranges.... > > I'll write up a patch that covers all of these. Can you also merge xfs_fsync into xfs_file_fsync while you're at it? The split newer made any sense as xfs_fsync is as Linux-specific as it gets and shouldn't be in the pseudo OS-independent layer. ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: [patch 0/9] writeback data integrity and other fixes (take 3) 2008-10-29 9:13 ` Christoph Hellwig @ 2008-10-29 21:42 ` Dave Chinner 2008-10-29 21:45 ` Christoph Hellwig 0 siblings, 1 reply; 53+ messages in thread From: Dave Chinner @ 2008-10-29 21:42 UTC (permalink / raw) To: Christoph Hellwig; +Cc: Nick Piggin, akpm, xfs, linux-fsdevel, Chris Mason On Wed, Oct 29, 2008 at 05:13:26AM -0400, Christoph Hellwig wrote: > On Wed, Oct 29, 2008 at 02:26:01PM +1100, Dave Chinner wrote: > > Ok, I was right - these problems happen all the time. The above call > > should really call xfs_flush_pages() to do the flush and wait. I > > note that xfs_flush_pages() returns negative errors, and all the > > callers expect positive errors. I bet the same occurs for > > xfs_flushinval_pages() and xfs_tosspages() which are the wrappers > > that core XFS code is supposed to be using for flushing and > > invalidating file ranges.... > > > > I'll write up a patch that covers all of these. > > Can you also merge xfs_fsync into xfs_file_fsync while you're at it? > The split newer made any sense as xfs_fsync is as Linux-specific as it > gets and shouldn't be in the pseudo OS-independent layer. I'll do that as a separate patch - it's not really part of a "fix error value inversion" bug fix.... Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: [patch 0/9] writeback data integrity and other fixes (take 3) 2008-10-29 21:42 ` Dave Chinner @ 2008-10-29 21:45 ` Christoph Hellwig 2008-10-29 21:53 ` Dave Chinner 0 siblings, 1 reply; 53+ messages in thread From: Christoph Hellwig @ 2008-10-29 21:45 UTC (permalink / raw) To: Christoph Hellwig, Nick Piggin, akpm, xfs, linux-fsdevel, Chris Mason On Thu, Oct 30, 2008 at 08:42:02AM +1100, Dave Chinner wrote: > I'll do that as a separate patch - it's not really part of a "fix > error value inversion" bug fix.... Well it is kindof. Because we'd call Linux library functions directly from the method we skip to places that need sign conversion and could possibly go wrong. But if you want a separate patch that's fine with me too. ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: [patch 0/9] writeback data integrity and other fixes (take 3) 2008-10-29 21:45 ` Christoph Hellwig @ 2008-10-29 21:53 ` Dave Chinner 0 siblings, 0 replies; 53+ messages in thread From: Dave Chinner @ 2008-10-29 21:53 UTC (permalink / raw) To: Christoph Hellwig; +Cc: Nick Piggin, akpm, xfs, linux-fsdevel, Chris Mason On Wed, Oct 29, 2008 at 05:45:03PM -0400, Christoph Hellwig wrote: > On Thu, Oct 30, 2008 at 08:42:02AM +1100, Dave Chinner wrote: > > I'll do that as a separate patch - it's not really part of a "fix > > error value inversion" bug fix.... > > Well it is kindof. Because we'd call Linux library functions directly > from the method we skip to places that need sign conversion and could > possibly go wrong. But if you want a separate patch that's fine with > me too. Well, the patch I posted to fix the sign problems covered the xfs_flush_pages() and xfs_flushinval_pages() functions that we returning negative errors to core code, and none of the core code inverted that before returning it to functions that inverted it. That's a general problem that affected truncate, direct I/O, getbmap, etc, as well as fsync... Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: [patch 0/9] writeback data integrity and other fixes (take 3) 2008-10-29 3:16 ` Dave Chinner 2008-10-29 3:26 ` Dave Chinner @ 2008-10-29 4:00 ` Nick Piggin 2008-10-29 5:27 ` Dave Chinner 2008-10-29 9:12 ` Christoph Hellwig 2 siblings, 1 reply; 53+ messages in thread From: Nick Piggin @ 2008-10-29 4:00 UTC (permalink / raw) To: akpm, xfs, linux-fsdevel, Chris Mason; +Cc: linux-ext4 On Wed, Oct 29, 2008 at 02:16:45PM +1100, Dave Chinner wrote: > On Wed, Oct 29, 2008 at 01:16:53AM +0100, Nick Piggin wrote: > > On Wed, Oct 29, 2008 at 09:27:46AM +1100, Dave Chinner wrote: > > > On Tue, Oct 28, 2008 at 04:39:53PM +0100, Nick Piggin wrote: > > > Yes - that's coming from end_buffer_async_write() when an error is > > > reported in bio completion. This does: > > > > > > 465 set_bit(AS_EIO, &page->mapping->flags); > > > 466 set_buffer_write_io_error(bh); > > > 467 clear_buffer_uptodate(bh); > > > 468 SetPageError(page); > > > > > > Hmmmm - do_fsync() calls filemap_fdatawait() which ends up in > > > wait_on_page_writeback_range() which is appears to be checking the > > > mapping flags for errors. I wonder why that error is not being > > > propagated then? AFAICT both XFS and the fsync code are doing the > > > right thing but somewhere the error has gone missing... > > > > This one-liner has it reporting EIO errors like a champion. I > > don't know if you'll actually need to put this into the > > linux API layer or not, but anyway the root cause of the problem > > AFAIKS is this. > > -- > > > > XFS: fix fsync errors not being propogated back to userspace. > > --- > > Index: linux-2.6/fs/xfs/xfs_vnodeops.c > > =================================================================== > > --- linux-2.6.orig/fs/xfs/xfs_vnodeops.c > > +++ linux-2.6/fs/xfs/xfs_vnodeops.c > > @@ -715,7 +715,7 @@ xfs_fsync( > > /* capture size updates in I/O completion before writing the inode. */ > > error = filemap_fdatawait(VFS_I(ip)->i_mapping); > > if (error) > > - return XFS_ERROR(error); > > + return XFS_ERROR(-error); > > <groan> > > Yeah, that'd do it. Good catch. I can't believe I recently fixed a > bug that touched these lines of code without noticing the inversion. > Sometimes I wonder if we should just conver the entire of XFS to > return negative errors - mistakes in handling negative error numbers > in the core XFS code happen all the time. Heh, yeah it seems a bit tricky. BTW. feel free to add Signed-off-by: Nick Piggin <npiggin@suse.de> if you would like to merge it, or alternatively if you fix it another way then I can rerun the test in 2 minutes if it is not 100% obviously correct. BTW. XFS seems to really be a lot more consistent than ext3 (ordered) in reporting EIO with this change in place. I don't know why exactly, but in ext3 I can see hundreds or thousands of failures injected in this manner: FAULT_INJECTION: forcing a failure Call Trace: 9ebe17f8: [<6019f34e>] random32+0xe/0x20 9ebe1808: [<601a3329>] should_fail+0xd9/0x130 9ebe1838: [<6018d234>] generic_make_request+0x304/0x4e0 9ebe1848: [<600622e1>] mempool_alloc+0x51/0x130 9ebe18f8: [<6018e82f>] submit_bio+0x4f/0xe0 9ebe1948: [<600a8969>] submit_bh+0xd9/0x130 9ebe1978: [<600aab08>] __block_write_full_page+0x198/0x310 9ebe1988: [<600a8487>] alloc_buffer_head+0x37/0x50 9ebe19a0: [<600f6ec0>] ext3_get_block+0x0/0x110 9ebe19d8: [<600f6ec0>] ext3_get_block+0x0/0x110 9ebe19e8: [<600aad48>] block_write_full_page+0xc8/0x100 9ebe1a38: [<600f8951>] ext3_ordered_writepage+0xd1/0x180 9ebe1a48: [<6006616b>] set_page_dirty+0x4b/0xd0 9ebe1a88: [<600660f5>] __writepage+0x15/0x40 9ebe1aa8: [<60066755>] write_cache_pages+0x255/0x470 9ebe1ac0: [<600660e0>] __writepage+0x0/0x40 9ebe1bc8: [<60066990>] generic_writepages+0x20/0x30 9ebe1bd8: [<600669db>] do_writepages+0x3b/0x40 9ebe1bf8: [<6006006c>] __filemap_fdatawrite_range+0x5c/0x70 9ebe1c58: [<6006028a>] filemap_fdatawrite+0x1a/0x20 9ebe1c68: [<600a7a15>] do_fsync+0x45/0xe0 9ebe1c98: [<6007792b>] sys_msync+0x14b/0x1d0 9ebe1cf8: [<60019a70>] handle_syscall+0x50/0x80 9ebe1d18: [<6002a10f>] userspace+0x44f/0x510 9ebe1fc8: [<60016792>] fork_handler+0x62/0x70 Buffer I/O error on device ram0, logical block 8228 lost page write due to I/O error on ram0 But eventually the filesystem aborts when we get a journal writing error, and sometimes I see a single EIO to msync, sometimes not (cc ext3 list). Wheras XFS reports a lot more EIOs as it goes... If ext3 guys have any suggestions, I could test them out. > FWIW, the core issue here is that we've got to do the > filemap_fdatawait() call in the ->fsync method because ->fsync > gets called before we've waited for the data I/O to complete. > XFS updates inode state on I/O completion, so we *must* wait > for data I/O to complete before logging the inode changes. I > think btrfs has the same problem.... Interesting. Does that mean you can do without the final filemap_fdatawait? Can you do the first fdatawait without i_mutex held? There was some talk IIRC about moving all this logic into the filesystem to provide more flexibility here. If there is still enough interest, we should get that moving... > Thanks again, Nick. No problem. Thanks, Nick ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: [patch 0/9] writeback data integrity and other fixes (take 3) 2008-10-29 4:00 ` Nick Piggin @ 2008-10-29 5:27 ` Dave Chinner 0 siblings, 0 replies; 53+ messages in thread From: Dave Chinner @ 2008-10-29 5:27 UTC (permalink / raw) To: Nick Piggin; +Cc: akpm, xfs, linux-fsdevel, Chris Mason, linux-ext4 On Wed, Oct 29, 2008 at 05:00:14AM +0100, Nick Piggin wrote: > On Wed, Oct 29, 2008 at 02:16:45PM +1100, Dave Chinner wrote: > > FWIW, the core issue here is that we've got to do the > > filemap_fdatawait() call in the ->fsync method because ->fsync > > gets called before we've waited for the data I/O to complete. > > XFS updates inode state on I/O completion, so we *must* wait > > for data I/O to complete before logging the inode changes. I > > think btrfs has the same problem.... > > Interesting. Does that mean you can do without the final filemap_fdatawait? We could, yes. > Can you do the first fdatawait without i_mutex held? I don't see why not - I/O completion is only touching the XFS inode. XFS has it's own inode locks for I/O and inode exclusion, and the mapping tree lock protects the tree walk that the fdatawait is doing... > There was some talk IIRC about moving all this logic into the filesystem > to provide more flexibility here. If there is still enough interest, > we should get that moving... It would simplify some of the code in XFS, that's for sure ;) Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: [patch 0/9] writeback data integrity and other fixes (take 3) 2008-10-29 3:16 ` Dave Chinner 2008-10-29 3:26 ` Dave Chinner 2008-10-29 4:00 ` Nick Piggin @ 2008-10-29 9:12 ` Christoph Hellwig 2008-10-29 9:21 ` Nick Piggin 2 siblings, 1 reply; 53+ messages in thread From: Christoph Hellwig @ 2008-10-29 9:12 UTC (permalink / raw) To: Nick Piggin, akpm, xfs, linux-fsdevel, Chris Mason On Wed, Oct 29, 2008 at 02:16:45PM +1100, Dave Chinner wrote: > Yeah, that'd do it. Good catch. I can't believe I recently fixed a > bug that touched these lines of code without noticing the inversion. > Sometimes I wonder if we should just conver the entire of XFS to > return negative errors - mistakes in handling negative error numbers > in the core XFS code happen all the time. > > FWIW, the core issue here is that we've got to do the > filemap_fdatawait() call in the ->fsync method because ->fsync > gets called before we've waited for the data I/O to complete. > XFS updates inode state on I/O completion, so we *must* wait > for data I/O to complete before logging the inode changes. I > think btrfs has the same problem.... Yes. I have patches to fix this by changing what ->fsync does and how it's called. I really need to get them out on the list. ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: [patch 0/9] writeback data integrity and other fixes (take 3) 2008-10-29 9:12 ` Christoph Hellwig @ 2008-10-29 9:21 ` Nick Piggin 2008-10-29 9:44 ` Christoph Hellwig 0 siblings, 1 reply; 53+ messages in thread From: Nick Piggin @ 2008-10-29 9:21 UTC (permalink / raw) To: Christoph Hellwig; +Cc: akpm, xfs, linux-fsdevel, Chris Mason On Wed, Oct 29, 2008 at 05:12:03AM -0400, Christoph Hellwig wrote: > On Wed, Oct 29, 2008 at 02:16:45PM +1100, Dave Chinner wrote: > > Yeah, that'd do it. Good catch. I can't believe I recently fixed a > > bug that touched these lines of code without noticing the inversion. > > Sometimes I wonder if we should just conver the entire of XFS to > > return negative errors - mistakes in handling negative error numbers > > in the core XFS code happen all the time. > > > > FWIW, the core issue here is that we've got to do the > > filemap_fdatawait() call in the ->fsync method because ->fsync > > gets called before we've waited for the data I/O to complete. > > XFS updates inode state on I/O completion, so we *must* wait > > for data I/O to complete before logging the inode changes. I > > think btrfs has the same problem.... > > Yes. I have patches to fix this by changing what ->fsync does and > how it's called. I really need to get them out on the list. Please do. ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: [patch 0/9] writeback data integrity and other fixes (take 3) 2008-10-29 9:21 ` Nick Piggin @ 2008-10-29 9:44 ` Christoph Hellwig 2008-10-29 10:30 ` Nick Piggin 0 siblings, 1 reply; 53+ messages in thread From: Christoph Hellwig @ 2008-10-29 9:44 UTC (permalink / raw) To: Nick Piggin; +Cc: Christoph Hellwig, akpm, xfs, linux-fsdevel, Chris Mason On Wed, Oct 29, 2008 at 10:21:43AM +0100, Nick Piggin wrote: > Please do. Well, there's one stumling block I haven't made progress on yet: I've changed the prototype of ->fsync to lose the dentry as we should always have a valid file struct. Except that nfsd doesn't on directories. So I either need to fake up a file there, or bail out and add a ->dir_sync export operation that needs just a dentry. ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: [patch 0/9] writeback data integrity and other fixes (take 3) 2008-10-29 9:44 ` Christoph Hellwig @ 2008-10-29 10:30 ` Nick Piggin 2008-10-29 12:22 ` Jamie Lokier 0 siblings, 1 reply; 53+ messages in thread From: Nick Piggin @ 2008-10-29 10:30 UTC (permalink / raw) To: Christoph Hellwig, linux-nfs; +Cc: akpm, xfs, linux-fsdevel, Chris Mason On Wed, Oct 29, 2008 at 05:44:17AM -0400, Christoph Hellwig wrote: > On Wed, Oct 29, 2008 at 10:21:43AM +0100, Nick Piggin wrote: > > Please do. > > Well, there's one stumling block I haven't made progress on yet: > > I've changed the prototype of ->fsync to lose the dentry as we should > always have a valid file struct. Except that nfsd doesn't on > directories. So I either need to fake up a file there, or bail out > and add a ->dir_sync export operation that needs just a dentry. OK. I don't know much about hthat code, but I would think nfsd should look as close to the syscall layer as possible. I guess there must be something prohibitive (some protocol semantics?). Is there anything that particularly makes it a file operation as opposed to an inode operation? ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: [patch 0/9] writeback data integrity and other fixes (take 3) 2008-10-29 10:30 ` Nick Piggin @ 2008-10-29 12:22 ` Jamie Lokier [not found] ` <20081029122234.GE846-yetKDKU6eevNLxjTenLetw@public.gmane.org> 2008-10-29 21:43 ` Dave Chinner 0 siblings, 2 replies; 53+ messages in thread From: Jamie Lokier @ 2008-10-29 12:22 UTC (permalink / raw) To: Nick Piggin Cc: Christoph Hellwig, linux-nfs, akpm, xfs, linux-fsdevel, Chris Mason Nick Piggin wrote: > On Wed, Oct 29, 2008 at 05:44:17AM -0400, Christoph Hellwig wrote: > > On Wed, Oct 29, 2008 at 10:21:43AM +0100, Nick Piggin wrote: > > > Please do. > > > > Well, there's one stumling block I haven't made progress on yet: > > > > I've changed the prototype of ->fsync to lose the dentry as we should > > always have a valid file struct. Except that nfsd doesn't on > > directories. So I either need to fake up a file there, or bail out > > and add a ->dir_sync export operation that needs just a dentry. > > OK. I don't know much about hthat code, but I would think nfsd > should look as close to the syscall layer as possible. I guess > there must be something prohibitive (some protocol semantics?). > > Is there anything that particularly makes it a file operation > as opposed to an inode operation? In principle, is fsync() required to flush all dirty data written through any file descriptor ever, or just dirty data written through the file descriptor used for fsync()? -- Jamie ^ permalink raw reply [flat|nested] 53+ messages in thread
[parent not found: <20081029122234.GE846-yetKDKU6eevNLxjTenLetw@public.gmane.org>]
* Re: [patch 0/9] writeback data integrity and other fixes (take 3) [not found] ` <20081029122234.GE846-yetKDKU6eevNLxjTenLetw@public.gmane.org> @ 2008-10-29 13:32 ` Ric Wheeler 2008-10-29 14:56 ` Chris Mason 0 siblings, 1 reply; 53+ messages in thread From: Ric Wheeler @ 2008-10-29 13:32 UTC (permalink / raw) To: Jamie Lokier Cc: Nick Piggin, Christoph Hellwig, linux-nfs-u79uwXL29TY76Z2rM5mHXA, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, xfs-VZNHf3L845pBDgjK7y7TUQ, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Chris Mason Jamie Lokier wrote: > Nick Piggin wrote: > >> On Wed, Oct 29, 2008 at 05:44:17AM -0400, Christoph Hellwig wrote: >> >>> On Wed, Oct 29, 2008 at 10:21:43AM +0100, Nick Piggin wrote: >>> >>>> Please do. >>>> >>> Well, there's one stumling block I haven't made progress on yet: >>> >>> I've changed the prototype of ->fsync to lose the dentry as we should >>> always have a valid file struct. Except that nfsd doesn't on >>> directories. So I either need to fake up a file there, or bail out >>> and add a ->dir_sync export operation that needs just a dentry. >>> >> OK. I don't know much about hthat code, but I would think nfsd >> should look as close to the syscall layer as possible. I guess >> there must be something prohibitive (some protocol semantics?). >> >> Is there anything that particularly makes it a file operation >> as opposed to an inode operation? >> > > In principle, is fsync() required to flush all dirty data written > through any file descriptor ever, or just dirty data written through > the file descriptor used for fsync()? > > -- Jamie > -- > http://www.opengroup.org/onlinepubs/009695399/functions/fsync.html Is a pointer to what seems to be the official posix spec for this - it is definitely per file descriptor, not per file system, etc... What can happen by side effect (depending on the implementation) is that you can actually force out all data for any file. I found that you can approach non-fsync speeds for an fsync per file workload by simply writing all of the files out, then going back and fsync'ing them one at a time (last file first makes a bit of a difference). With that technique, you do get the hard promise of full data integrity and high speed. This is useful when you want to do bulk writes (tar, etc) ric -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: [patch 0/9] writeback data integrity and other fixes (take 3) 2008-10-29 13:32 ` Ric Wheeler @ 2008-10-29 14:56 ` Chris Mason [not found] ` <1225292196.6448.263.camel-cGoWVVl3WGUrkklhUoBCrlaTQe2KTcn/@public.gmane.org> 0 siblings, 1 reply; 53+ messages in thread From: Chris Mason @ 2008-10-29 14:56 UTC (permalink / raw) To: Ric Wheeler Cc: Jamie Lokier, Nick Piggin, Christoph Hellwig, linux-nfs, akpm, xfs, linux-fsdevel On Wed, 2008-10-29 at 09:32 -0400, Ric Wheeler wrote: > Jamie Lokier wrote: > >> Is there anything that particularly makes it a file operation > >> as opposed to an inode operation? > >> > > > > In principle, is fsync() required to flush all dirty data written > > through any file descriptor ever, or just dirty data written through > > the file descriptor used for fsync()? > > > > -- Jamie > > -- > > > http://www.opengroup.org/onlinepubs/009695399/functions/fsync.html > > Is a pointer to what seems to be the official posix spec for this - it > is definitely per file descriptor, not per file system, etc... > Maybe I'm reading Jamie's question wrong, but I think he's saying: /* open exactly the same file twice */ fd = open("file"); fd2 = open("file"); write(fd, "stuff") write(fd2, "more stuff") fsync(fd); Does the fsync promise "more stuff" will be on disk? I think the answer should be yes. -chris ^ permalink raw reply [flat|nested] 53+ messages in thread
[parent not found: <1225292196.6448.263.camel-cGoWVVl3WGUrkklhUoBCrlaTQe2KTcn/@public.gmane.org>]
* Re: [patch 0/9] writeback data integrity and other fixes (take 3) [not found] ` <1225292196.6448.263.camel-cGoWVVl3WGUrkklhUoBCrlaTQe2KTcn/@public.gmane.org> @ 2008-10-30 2:16 ` Nick Piggin [not found] ` <20081030021601.GF18041-B4tOwbsTzaBolqkO4TVVkw@public.gmane.org> 0 siblings, 1 reply; 53+ messages in thread From: Nick Piggin @ 2008-10-30 2:16 UTC (permalink / raw) To: Chris Mason Cc: Ric Wheeler, Jamie Lokier, Christoph Hellwig, linux-nfs-u79uwXL29TY76Z2rM5mHXA, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, xfs-VZNHf3L845pBDgjK7y7TUQ, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA On Wed, Oct 29, 2008 at 10:56:36AM -0400, Chris Mason wrote: > On Wed, 2008-10-29 at 09:32 -0400, Ric Wheeler wrote: > > Jamie Lokier wrote: > > > >> Is there anything that particularly makes it a file operation > > >> as opposed to an inode operation? > > >> > > > > > > In principle, is fsync() required to flush all dirty data written > > > through any file descriptor ever, or just dirty data written through > > > the file descriptor used for fsync()? > > > > > > -- Jamie > > > -- > > > > > http://www.opengroup.org/onlinepubs/009695399/functions/fsync.html > > > > Is a pointer to what seems to be the official posix spec for this - it > > is definitely per file descriptor, not per file system, etc... > > > > Maybe I'm reading Jamie's question wrong, but I think he's saying: > > /* open exactly the same file twice */ > fd = open("file"); > fd2 = open("file"); > > write(fd, "stuff") > write(fd2, "more stuff") > fsync(fd); > > Does the fsync promise "more stuff" will be on disk? I think the answer > should be yes. I think so. And this is in the context of making ->fsync an inode operation and avoid the NFS NULL-file problem... I don't think there is any fd specific metadata that fsync has to deal with? Any other reasons it has to be a file operation? -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 53+ messages in thread
[parent not found: <20081030021601.GF18041-B4tOwbsTzaBolqkO4TVVkw@public.gmane.org>]
* Re: [patch 0/9] writeback data integrity and other fixes (take 3) [not found] ` <20081030021601.GF18041-B4tOwbsTzaBolqkO4TVVkw@public.gmane.org> @ 2008-10-30 12:51 ` jim owens 2008-10-30 13:41 ` Jim Rees 0 siblings, 1 reply; 53+ messages in thread From: jim owens @ 2008-10-30 12:51 UTC (permalink / raw) To: Nick Piggin Cc: Chris Mason, Ric Wheeler, Jamie Lokier, Christoph Hellwig, linux-nfs-u79uwXL29TY76Z2rM5mHXA, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, xfs-VZNHf3L845pBDgjK7y7TUQ, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA Nick Piggin wrote: > On Wed, Oct 29, 2008 at 10:56:36AM -0400, Chris Mason wrote: >> On Wed, 2008-10-29 at 09:32 -0400, Ric Wheeler wrote: >>> Jamie Lokier wrote: >>>>> Is there anything that particularly makes it a file operation >>>>> as opposed to an inode operation? >>>>> >>>> In principle, is fsync() required to flush all dirty data written >>>> through any file descriptor ever, or just dirty data written through >>>> the file descriptor used for fsync()? >>>> >>>> -- Jamie >>>> -- >>>> >>> http://www.opengroup.org/onlinepubs/009695399/functions/fsync.html >>> >>> Is a pointer to what seems to be the official posix spec for this - it >>> is definitely per file descriptor, not per file system, etc... >>> >> Maybe I'm reading Jamie's question wrong, but I think he's saying: >> >> /* open exactly the same file twice */ >> fd = open("file"); >> fd2 = open("file"); >> >> write(fd, "stuff") >> write(fd2, "more stuff") >> fsync(fd); >> >> Does the fsync promise "more stuff" will be on disk? I think the answer >> should be yes. > > I think so. And this is in the context of making ->fsync an inode > operation and avoid the NFS NULL-file problem... I don't think there > is any fd specific metadata that fsync has to deal with? Any other > reasons it has to be a file operation? NO, or at least *not the posix definition*. It is normal in unix-like operating systems to always flush everything dirty on the inode no matter what stream it arrived on. Flushing everything is permitted but not the requirement so applications must not expect this is *promised* or they will not be portable. It is only guaranteed that "stuff" in this example will be on disk. AFAIK the fsync semantic comes from the days of dinosaurs, mainframes, and minicomputers... when a lot of operating systems had user-space libraries that buffered the I/O. On fsync(fd), the "fd2" data would still be in user-space. jim -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: [patch 0/9] writeback data integrity and other fixes (take 3) 2008-10-30 12:51 ` jim owens @ 2008-10-30 13:41 ` Jim Rees 0 siblings, 0 replies; 53+ messages in thread From: Jim Rees @ 2008-10-30 13:41 UTC (permalink / raw) To: jim owens Cc: Nick Piggin, Chris Mason, Ric Wheeler, Jamie Lokier, Christoph Hellwig, linux-nfs, akpm, xfs, linux-fsdevel jim owens wrote: AFAIK the fsync semantic comes from the days of dinosaurs, mainframes, and minicomputers... when a lot of operating systems had user-space libraries that buffered the I/O. On fsync(fd), the "fd2" data would still be in user-space. User space buffering happens in stdio, which is above the system call level. It's been that way since fsync() was first introduced, and is still that way today. ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: [patch 0/9] writeback data integrity and other fixes (take 3) 2008-10-29 12:22 ` Jamie Lokier [not found] ` <20081029122234.GE846-yetKDKU6eevNLxjTenLetw@public.gmane.org> @ 2008-10-29 21:43 ` Dave Chinner 1 sibling, 0 replies; 53+ messages in thread From: Dave Chinner @ 2008-10-29 21:43 UTC (permalink / raw) To: Jamie Lokier Cc: Nick Piggin, Christoph Hellwig, linux-nfs, akpm, xfs, linux-fsdevel, Chris Mason On Wed, Oct 29, 2008 at 12:22:35PM +0000, Jamie Lokier wrote: > Nick Piggin wrote: > > On Wed, Oct 29, 2008 at 05:44:17AM -0400, Christoph Hellwig wrote: > > > On Wed, Oct 29, 2008 at 10:21:43AM +0100, Nick Piggin wrote: > > > > Please do. > > > > > > Well, there's one stumling block I haven't made progress on yet: > > > > > > I've changed the prototype of ->fsync to lose the dentry as we should > > > always have a valid file struct. Except that nfsd doesn't on > > > directories. So I either need to fake up a file there, or bail out > > > and add a ->dir_sync export operation that needs just a dentry. > > > > OK. I don't know much about hthat code, but I would think nfsd > > should look as close to the syscall layer as possible. I guess > > there must be something prohibitive (some protocol semantics?). > > > > Is there anything that particularly makes it a file operation > > as opposed to an inode operation? > > In principle, is fsync() required to flush all dirty data written > through any file descriptor ever, or just dirty data written through > the file descriptor used for fsync()? fsync() is required to flush the data that is dirty at the time of the call, as well as any associated metadata needed to reference that data. It doesn't matter who wrote the data in the first place.... Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: [patch 0/9] writeback data integrity and other fixes (take 3) 2008-10-28 22:27 ` Dave Chinner 2008-10-29 0:04 ` Nick Piggin 2008-10-29 0:16 ` Nick Piggin @ 2008-10-29 8:51 ` Dave Chinner 2 siblings, 0 replies; 53+ messages in thread From: Dave Chinner @ 2008-10-29 8:51 UTC (permalink / raw) To: Nick Piggin, akpm, xfs, linux-fsdevel, Chris Mason On Wed, Oct 29, 2008 at 09:27:46AM +1100, Dave Chinner wrote: > On Tue, Oct 28, 2008 at 04:39:53PM +0100, Nick Piggin wrote: > > On Wed, Oct 29, 2008 at 01:47:15AM +1100, npiggin@suse.de wrote: > > I think I also saw a slab bug when running dbench with fault injection on. > > Running latest Linus kernel. ..... > > xfs_force_shutdown(ram0,0x2) called from line 1056 of file /home/npiggin/usr/src > > /linux-2.6/fs/xfs/xfs_log.c. Return address = 0x000000006011370d > > Filesystem "ram0": Log I/O Error Detected. Shutting down filesystem: ram0 > > Please umount the filesystem, and rectify the problem(s) > > xfs_force_shutdown(ram0,0x2) called from line 818 of file /home/npiggin/usr/src/ > > linux-2.6/fs/xfs/xfs_log.c. Return address = 0x0000000060114e7d > > slab error in verify_redzone_free(): cache `xfs_log_ticket': double free detecte > > d .... > Now that is interesting. > > We've got a rolling transaction in progress, and the commit of the > first part of the transaction has got the I/O error. That frees the > transaction structure used during that commit, as well as the > ticket. > > However, before we committed the initial transaction, we duplicated > the transaction structure to allow the transaction to continue to > track all the dirty objects in the first commit. That included > duplicating the pointer to the ticket. > > Then the EIO is returned to mkdir code with the duplicated > transaction, which is then cancelled, and that frees the transaction > and the ticket it holds. However, we'd already freed the ticket. > > Ok, we're only seeing this problem now because I recently modified > the ticket allocation to use a slab instead of a roll-your-own free > list structure that wouldn't have been poisoned. Nice to know that > this change did more than just remove code. ;) > > This might take a little while to fix - a lot of code needs > auditing - but thanks for reporting the problem. The patch below should fix this problem. It's about 10% of the way through XFSQA, which means there's nothing obviously bad about it, but it will need many more eyes on it before we decide if this is the best way to fix the problem.... Cheers, Dave. -- Dave Chinner david@fromorbit.com XFS: Fix double free of log tickets When an I/O error occurs during an intermediate commit on a rolling transaction, xfs_trans_commit() will free the transaction structure and the related ticket. However, the duplicate transaction that gets used as the transaction continues still contains a pointer to the ticket. Hence when the duplicate transaction is cancelled and freed, we free the ticket a second time. Add reference counting to the ticket so that we hold an extra reference to the ticket over the transaction commit. We drop the extra reference once we have checked that the transaction commit did not return an error, thus avoiding a double free on commit error. Credit to Nick Piggin for tripping over the problem. Signed-off-by: Dave Chinner <david@fromorbit.com> --- fs/xfs/xfs_bmap.c | 10 ++++++++-- fs/xfs/xfs_inode.c | 10 ++++++++-- fs/xfs/xfs_log.c | 39 +++++++++++++++++++++++++-------------- fs/xfs/xfs_log.h | 4 ++++ fs/xfs/xfs_log_priv.h | 1 + fs/xfs/xfs_trans.c | 9 ++++++++- fs/xfs/xfs_utils.c | 6 ++++++ fs/xfs/xfs_vnodeops.c | 6 ++++++ 8 files changed, 66 insertions(+), 19 deletions(-) diff --git a/fs/xfs/xfs_bmap.c b/fs/xfs/xfs_bmap.c index db28905..c391221 100644 --- a/fs/xfs/xfs_bmap.c +++ b/fs/xfs/xfs_bmap.c @@ -4292,9 +4292,15 @@ xfs_bmap_finish( * We have a new transaction, so we should return committed=1, * even though we're returning an error. */ - if (error) { + if (error) return error; - } + + /* + * transaction commit worked ok so we can drop the extra ticket + * reference that we gained in xfs_trans_dup() + */ + xfs_log_ticket_put(ntp->t_ticket); + if ((error = xfs_trans_reserve(ntp, 0, logres, 0, XFS_TRANS_PERM_LOG_RES, logcount))) return error; diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c index cd52282..b977100 100644 --- a/fs/xfs/xfs_inode.c +++ b/fs/xfs/xfs_inode.c @@ -1782,8 +1782,14 @@ xfs_itruncate_finish( xfs_trans_ijoin(ntp, ip, XFS_ILOCK_EXCL | XFS_IOLOCK_EXCL); xfs_trans_ihold(ntp, ip); - if (!error) - error = xfs_trans_reserve(ntp, 0, + if (error) + return error; + /* + * transaction commit worked ok so we can drop the extra ticket + * reference that we gained in xfs_trans_dup() + */ + xfs_log_ticket_put(ntp->t_ticket); + error = xfs_trans_reserve(ntp, 0, XFS_ITRUNCATE_LOG_RES(mp), 0, XFS_TRANS_PERM_LOG_RES, XFS_ITRUNCATE_LOG_COUNT); diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c index cc1e789..9aecefd 100644 --- a/fs/xfs/xfs_log.c +++ b/fs/xfs/xfs_log.c @@ -100,12 +100,11 @@ STATIC void xlog_ungrant_log_space(xlog_t *log, /* local ticket functions */ -STATIC xlog_ticket_t *xlog_ticket_get(xlog_t *log, +STATIC xlog_ticket_t *xlog_ticket_alloc(xlog_t *log, int unit_bytes, int count, char clientid, uint flags); -STATIC void xlog_ticket_put(xlog_t *log, xlog_ticket_t *ticket); #if defined(DEBUG) STATIC void xlog_verify_dest_ptr(xlog_t *log, __psint_t ptr); @@ -360,7 +359,7 @@ xfs_log_done(xfs_mount_t *mp, */ xlog_trace_loggrant(log, ticket, "xfs_log_done: (non-permanent)"); xlog_ungrant_log_space(log, ticket); - xlog_ticket_put(log, ticket); + xfs_log_ticket_put(ticket); } else { xlog_trace_loggrant(log, ticket, "xfs_log_done: (permanent)"); xlog_regrant_reserve_log_space(log, ticket); @@ -514,7 +513,7 @@ xfs_log_reserve(xfs_mount_t *mp, retval = xlog_regrant_write_log_space(log, internal_ticket); } else { /* may sleep if need to allocate more tickets */ - internal_ticket = xlog_ticket_get(log, unit_bytes, cnt, + internal_ticket = xlog_ticket_alloc(log, unit_bytes, cnt, client, flags); if (!internal_ticket) return XFS_ERROR(ENOMEM); @@ -749,7 +748,7 @@ xfs_log_unmount_write(xfs_mount_t *mp) if (tic) { xlog_trace_loggrant(log, tic, "unmount rec"); xlog_ungrant_log_space(log, tic); - xlog_ticket_put(log, tic); + xfs_log_ticket_put(tic); } } else { /* @@ -3223,22 +3222,33 @@ xlog_state_want_sync(xlog_t *log, xlog_in_core_t *iclog) */ /* - * Free a used ticket. + * Free a used ticket when it's refcount falls to zero. */ -STATIC void -xlog_ticket_put(xlog_t *log, - xlog_ticket_t *ticket) +void +xfs_log_ticket_put( + xlog_ticket_t *ticket) { - sv_destroy(&ticket->t_wait); - kmem_zone_free(xfs_log_ticket_zone, ticket); -} /* xlog_ticket_put */ + ASSERT(atomic_read(&ticket->t_ref) > 0); + if (atomic_dec_and_test(&ticket->t_ref)) { + sv_destroy(&ticket->t_wait); + kmem_zone_free(xfs_log_ticket_zone, ticket); + } +} +xlog_ticket_t * +xfs_log_ticket_get( + xlog_ticket_t *ticket) +{ + ASSERT(atomic_read(&ticket->t_ref) > 0); + atomic_inc(&ticket->t_ref); + return ticket; +} /* * Allocate and initialise a new log ticket. */ STATIC xlog_ticket_t * -xlog_ticket_get(xlog_t *log, +xlog_ticket_alloc(xlog_t *log, int unit_bytes, int cnt, char client, @@ -3309,6 +3319,7 @@ xlog_ticket_get(xlog_t *log, unit_bytes += 2*BBSIZE; } + atomic_set(&tic->t_ref, 1); tic->t_unit_res = unit_bytes; tic->t_curr_res = unit_bytes; tic->t_cnt = cnt; @@ -3324,7 +3335,7 @@ xlog_ticket_get(xlog_t *log, xlog_tic_reset_res(tic); return tic; -} /* xlog_ticket_get */ +} /****************************************************************************** diff --git a/fs/xfs/xfs_log.h b/fs/xfs/xfs_log.h index d47b91f..8a3e84e 100644 --- a/fs/xfs/xfs_log.h +++ b/fs/xfs/xfs_log.h @@ -134,6 +134,7 @@ typedef struct xfs_log_callback { #ifdef __KERNEL__ /* Log manager interfaces */ struct xfs_mount; +struct xlog_ticket; xfs_lsn_t xfs_log_done(struct xfs_mount *mp, xfs_log_ticket_t ticket, void **iclog, @@ -177,6 +178,9 @@ int xfs_log_need_covered(struct xfs_mount *mp); void xlog_iodone(struct xfs_buf *); +struct xlog_ticket * xfs_log_ticket_get(struct xlog_ticket *ticket); +void xfs_log_ticket_put(struct xlog_ticket *ticket); + #endif diff --git a/fs/xfs/xfs_log_priv.h b/fs/xfs/xfs_log_priv.h index de7ef6c..b39a198 100644 --- a/fs/xfs/xfs_log_priv.h +++ b/fs/xfs/xfs_log_priv.h @@ -245,6 +245,7 @@ typedef struct xlog_ticket { struct xlog_ticket *t_next; /* :4|8 */ struct xlog_ticket *t_prev; /* :4|8 */ xlog_tid_t t_tid; /* transaction identifier : 4 */ + atomic_t t_ref; /* ticket reference count : 4 */ int t_curr_res; /* current reservation in bytes : 4 */ int t_unit_res; /* unit reservation in bytes : 4 */ char t_ocnt; /* original count : 1 */ diff --git a/fs/xfs/xfs_trans.c b/fs/xfs/xfs_trans.c index ad137ef..8570b82 100644 --- a/fs/xfs/xfs_trans.c +++ b/fs/xfs/xfs_trans.c @@ -290,7 +290,7 @@ xfs_trans_dup( ASSERT(tp->t_ticket != NULL); ntp->t_flags = XFS_TRANS_PERM_LOG_RES | (tp->t_flags & XFS_TRANS_RESERVE); - ntp->t_ticket = tp->t_ticket; + ntp->t_ticket = xfs_log_ticket_get(tp->t_ticket); ntp->t_blk_res = tp->t_blk_res - tp->t_blk_res_used; tp->t_blk_res = tp->t_blk_res_used; ntp->t_rtx_res = tp->t_rtx_res - tp->t_rtx_res_used; @@ -1260,6 +1260,13 @@ xfs_trans_roll( trans = *tpp; /* + * transaction commit worked ok so we can drop the extra ticket + * reference that we gained in xfs_trans_dup() + */ + xfs_log_ticket_put(trans->t_ticket); + + + /* * Reserve space in the log for th next transaction. * This also pushes items in the "AIL", the list of logged items, * out to disk if they are taking up space at the tail of the log diff --git a/fs/xfs/xfs_utils.c b/fs/xfs/xfs_utils.c index 35d4d41..7711449 100644 --- a/fs/xfs/xfs_utils.c +++ b/fs/xfs/xfs_utils.c @@ -172,6 +172,12 @@ xfs_dir_ialloc( *ipp = NULL; return code; } + + /* + * transaction commit worked ok so we can drop the extra ticket + * reference that we gained in xfs_trans_dup() + */ + xfs_log_ticket_put(tp->t_ticket); code = xfs_trans_reserve(tp, 0, log_res, 0, XFS_TRANS_PERM_LOG_RES, log_count); /* diff --git a/fs/xfs/xfs_vnodeops.c b/fs/xfs/xfs_vnodeops.c index 5809c42..f7eea70 100644 --- a/fs/xfs/xfs_vnodeops.c +++ b/fs/xfs/xfs_vnodeops.c @@ -1029,6 +1029,12 @@ xfs_inactive_symlink_rmt( goto error0; } /* + * transaction commit worked ok so we can drop the extra ticket + * reference that we gained in xfs_trans_dup() + */ + xfs_log_ticket_put(tp->t_ticket); + + /* * Remove the memory for extent descriptions (just bookkeeping). */ if (ip->i_df.if_bytes) ^ permalink raw reply related [flat|nested] 53+ messages in thread
* Re: [patch 0/9] writeback data integrity and other fixes (take 3) 2008-10-28 14:47 [patch 0/9] writeback data integrity and other fixes (take 3) npiggin ` (9 preceding siblings ...) 2008-10-28 15:39 ` [patch 0/9] writeback data integrity and other fixes (take 3) Nick Piggin @ 2008-10-28 23:14 ` Dave Chinner 2008-10-28 23:57 ` Nick Piggin 10 siblings, 1 reply; 53+ messages in thread From: Dave Chinner @ 2008-10-28 23:14 UTC (permalink / raw) To: npiggin; +Cc: akpm, linux-fsdevel, Chris Mason On Wed, Oct 29, 2008 at 01:47:15AM +1100, npiggin@suse.de wrote: > OK, I'm happier with this patchset now. Note that I've taken your patch > and mangled it a bit at the end of the series. > > This one survives and seems to run OK here, but I'm mainly doing dumb > stress testing with a handful of filesystems, and data-io error injection > testing. There are a lot of combinations of ways this function can operate > and interact obviously, so it would be helpful to get more review. > > Chris, would you possibly have time to run your btrfs tests that are > sensitive to problems in this code? I could provide you a single patch > rollup against mainline if it helps. Nick, after applying the patchset: CC mm/page-writeback.o mm/page-writeback.c: In function write_cache_pages: mm/page-writeback.c:871: warning: wrteback_index may be used uninitialized in this function Looks harmless, but it probably should be cleaned up. Compiler is 'gcc version 4.3.1 (Debian 4.3.1-9)' Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: [patch 0/9] writeback data integrity and other fixes (take 3) 2008-10-28 23:14 ` Dave Chinner @ 2008-10-28 23:57 ` Nick Piggin 2008-10-29 0:05 ` Andrew Morton 0 siblings, 1 reply; 53+ messages in thread From: Nick Piggin @ 2008-10-28 23:57 UTC (permalink / raw) To: akpm, linux-fsdevel, Chris Mason On Wed, Oct 29, 2008 at 10:14:35AM +1100, Dave Chinner wrote: > On Wed, Oct 29, 2008 at 01:47:15AM +1100, npiggin@suse.de wrote: > > OK, I'm happier with this patchset now. Note that I've taken your patch > > and mangled it a bit at the end of the series. > > > > This one survives and seems to run OK here, but I'm mainly doing dumb > > stress testing with a handful of filesystems, and data-io error injection > > testing. There are a lot of combinations of ways this function can operate > > and interact obviously, so it would be helpful to get more review. > > > > Chris, would you possibly have time to run your btrfs tests that are > > sensitive to problems in this code? I could provide you a single patch > > rollup against mainline if it helps. > > Nick, after applying the patchset: > > CC mm/page-writeback.o > mm/page-writeback.c: In function write_cache_pages: > mm/page-writeback.c:871: warning: wrteback_index may be used uninitialized in this function > > Looks harmless, but it probably should be cleaned up. > Compiler is 'gcc version 4.3.1 (Debian 4.3.1-9)' Yeah, it's annoying but I couldn't find a way to shut it up nicely (yes AFAIKS it is harmless). I could just put the old "= 0; /* shut up gcc*/" trick there, but that sucks too ;( ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: [patch 0/9] writeback data integrity and other fixes (take 3) 2008-10-28 23:57 ` Nick Piggin @ 2008-10-29 0:05 ` Andrew Morton 2008-10-29 0:10 ` Nick Piggin 0 siblings, 1 reply; 53+ messages in thread From: Andrew Morton @ 2008-10-29 0:05 UTC (permalink / raw) To: Nick Piggin; +Cc: linux-fsdevel, chris.mason On Wed, 29 Oct 2008 00:57:55 +0100 Nick Piggin <npiggin@suse.de> wrote: > On Wed, Oct 29, 2008 at 10:14:35AM +1100, Dave Chinner wrote: > > On Wed, Oct 29, 2008 at 01:47:15AM +1100, npiggin@suse.de wrote: > > > OK, I'm happier with this patchset now. Note that I've taken your patch > > > and mangled it a bit at the end of the series. > > > > > > This one survives and seems to run OK here, but I'm mainly doing dumb > > > stress testing with a handful of filesystems, and data-io error injection > > > testing. There are a lot of combinations of ways this function can operate > > > and interact obviously, so it would be helpful to get more review. > > > > > > Chris, would you possibly have time to run your btrfs tests that are > > > sensitive to problems in this code? I could provide you a single patch > > > rollup against mainline if it helps. > > > > Nick, after applying the patchset: > > > > CC mm/page-writeback.o > > mm/page-writeback.c: In function write_cache_pages: > > mm/page-writeback.c:871: warning: wrteback_index may be used uninitialized in this function > > > > Looks harmless, but it probably should be cleaned up. > > Compiler is 'gcc version 4.3.1 (Debian 4.3.1-9)' > > Yeah, it's annoying but I couldn't find a way to shut it up nicely > (yes AFAIKS it is harmless). > > I could just put the old "= 0; /* shut up gcc*/" trick there, but that > sucks too ;( > uninitialized_var()? ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: [patch 0/9] writeback data integrity and other fixes (take 3) 2008-10-29 0:05 ` Andrew Morton @ 2008-10-29 0:10 ` Nick Piggin 0 siblings, 0 replies; 53+ messages in thread From: Nick Piggin @ 2008-10-29 0:10 UTC (permalink / raw) To: Andrew Morton; +Cc: linux-fsdevel, chris.mason On Tue, Oct 28, 2008 at 05:05:31PM -0700, Andrew Morton wrote: > On Wed, 29 Oct 2008 00:57:55 +0100 > Nick Piggin <npiggin@suse.de> wrote: > > > > Yeah, it's annoying but I couldn't find a way to shut it up nicely > > (yes AFAIKS it is harmless). > > > > I could just put the old "= 0; /* shut up gcc*/" trick there, but that > > sucks too ;( > > > > uninitialized_var()? OK, that's a little bit better. Still somewhat subject to bitrot and hiding of real bugs, but I guess you can disable it easly. Will do. ^ permalink raw reply [flat|nested] 53+ messages in thread
end of thread, other threads:[~2008-11-01 8:04 UTC | newest]
Thread overview: 53+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-10-28 14:47 [patch 0/9] writeback data integrity and other fixes (take 3) npiggin
2008-10-28 14:47 ` [patch 1/9] mm: write_cache_pages cyclic fix npiggin
2008-10-29 0:24 ` [patch 1.1/9] mm: write_cache_pages cyclic fix fix Nick Piggin
2008-10-28 14:47 ` [patch 2/9] mm: write_cache_pages early loop termination npiggin
2008-10-28 14:47 ` [patch 3/9] mm: write_cache_pages writepage error fix npiggin
2008-10-28 14:47 ` [patch 4/9] mm: write_cache_pages integrity fix npiggin
2008-10-28 14:47 ` [patch 5/9] mm: write_cache_pages cleanups npiggin
2008-10-28 14:47 ` [patch 6/9] mm: write_cache_pages optimise page cleaning npiggin
2008-10-28 14:47 ` [patch 7/9] mm: write_cache_pages terminate quickly npiggin
2008-10-30 23:07 ` Andrew Morton
2008-10-31 7:29 ` Nick Piggin
2008-10-28 14:47 ` [patch 8/9] mm: write_cache_pages more " npiggin
2008-10-28 14:47 ` [patch 9/9] mm: do_sync_mapping_range integrity fix npiggin
2008-10-30 23:13 ` Andrew Morton
2008-10-31 9:16 ` Nick Piggin
2008-10-31 10:04 ` Andrew Morton
2008-10-31 10:53 ` Nick Piggin
2008-10-31 20:03 ` Jamie Lokier
2008-10-31 14:10 ` Chris Mason
2008-10-31 14:30 ` steve
2008-10-31 15:02 ` Chris Mason
2008-11-01 8:04 ` Nick Piggin
2008-10-28 15:39 ` [patch 0/9] writeback data integrity and other fixes (take 3) Nick Piggin
2008-10-28 22:27 ` Dave Chinner
2008-10-29 0:04 ` Nick Piggin
2008-10-29 0:16 ` Nick Piggin
2008-10-29 3:16 ` Dave Chinner
2008-10-29 3:26 ` Dave Chinner
2008-10-29 4:11 ` Nick Piggin
2008-10-29 4:57 ` Dave Chinner
2008-10-29 5:06 ` Nick Piggin
2008-10-29 9:13 ` Christoph Hellwig
2008-10-29 21:42 ` Dave Chinner
2008-10-29 21:45 ` Christoph Hellwig
2008-10-29 21:53 ` Dave Chinner
2008-10-29 4:00 ` Nick Piggin
2008-10-29 5:27 ` Dave Chinner
2008-10-29 9:12 ` Christoph Hellwig
2008-10-29 9:21 ` Nick Piggin
2008-10-29 9:44 ` Christoph Hellwig
2008-10-29 10:30 ` Nick Piggin
2008-10-29 12:22 ` Jamie Lokier
[not found] ` <20081029122234.GE846-yetKDKU6eevNLxjTenLetw@public.gmane.org>
2008-10-29 13:32 ` Ric Wheeler
2008-10-29 14:56 ` Chris Mason
[not found] ` <1225292196.6448.263.camel-cGoWVVl3WGUrkklhUoBCrlaTQe2KTcn/@public.gmane.org>
2008-10-30 2:16 ` Nick Piggin
[not found] ` <20081030021601.GF18041-B4tOwbsTzaBolqkO4TVVkw@public.gmane.org>
2008-10-30 12:51 ` jim owens
2008-10-30 13:41 ` Jim Rees
2008-10-29 21:43 ` Dave Chinner
2008-10-29 8:51 ` Dave Chinner
2008-10-28 23:14 ` Dave Chinner
2008-10-28 23:57 ` Nick Piggin
2008-10-29 0:05 ` Andrew Morton
2008-10-29 0:10 ` Nick Piggin
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).