[patch 0/9] writeback data integrity and other fixes (take 3)

linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [patch 0/9] writeback data integrity and other fixes (take 3)
@ 2008-10-28 14:47 npiggin
  2008-10-28 14:47 ` [patch 1/9] mm: write_cache_pages cyclic fix npiggin
                   ` (10 more replies)
  0 siblings, 11 replies; 53+ messages in thread
From: npiggin @ 2008-10-28 14:47 UTC (permalink / raw)
  To: akpm; +Cc: linux-fsdevel, Dave Chinner, Chris Mason

OK, I'm happier with this patchset now. Note that I've taken your patch
and mangled it a bit at the end of the series.

This one survives and seems to run OK here, but I'm mainly doing dumb
stress testing with a handful of filesystems, and data-io error injection
testing. There are a lot of combinations of ways this function can operate
and interact obviously, so it would be helpful to get more review.

Chris, would you possibly have time to run your btrfs tests that are
sensitive to problems in this code? I could provide you a single patch
rollup against mainline if it helps.

-- 

^ permalink raw reply	[flat|nested] 53+ messages in thread

* [patch 1/9] mm: write_cache_pages cyclic fix
  2008-10-28 14:47 [patch 0/9] writeback data integrity and other fixes (take 3) npiggin
@ 2008-10-28 14:47 ` npiggin
  2008-10-29  0:24   ` [patch 1.1/9] mm: write_cache_pages cyclic fix fix Nick Piggin
  2008-10-28 14:47 ` [patch 2/9] mm: write_cache_pages early loop termination npiggin
                   ` (9 subsequent siblings)
  10 siblings, 1 reply; 53+ messages in thread
From: npiggin @ 2008-10-28 14:47 UTC (permalink / raw)
  To: akpm; +Cc: linux-fsdevel, Dave Chinner, Chris Mason

[-- Attachment #1: mm-wcp-cyclic-fix.patch --]
[-- Type: text/plain, Size: 2744 bytes --]

In write_cache_pages, scanned == 1 is supposed to mean that cyclic writeback
has circled through zero, thus we should not circle again. However it gets set
to 1 after the first successful pagevec lookup. This leads to cases where not
enough data gets written.

Counterexample: file with first 10 pages dirty, writeback_index == 5,
nr_to_write == 10. Then the 5 last pages will be found, and scanned will be set
to 1, after writing those out, we will not cycle back to get the first 5.

Rework this logic, now we'll always cycle unless we started off from index 0.
When cycling, only write out as far as 1 page before the start page from the
first cycle (so we don't write parts of the file twice).

Signed-off-by: Nick Piggin <npiggin@suse.de>
---
Index: linux-2.6/mm/page-writeback.c
===================================================================
--- linux-2.6.orig/mm/page-writeback.c
+++ linux-2.6/mm/page-writeback.c
@@ -868,9 +868,10 @@ int write_cache_pages(struct address_spa
 	int done = 0;
 	struct pagevec pvec;
 	int nr_pages;
+	pgoff_t writeback_index;
 	pgoff_t index;
 	pgoff_t end;		/* Inclusive */
-	int scanned = 0;
+	int cycled;
 	int range_whole = 0;
 	long nr_to_write = wbc->nr_to_write;
 
@@ -881,14 +882,19 @@ int write_cache_pages(struct address_spa
 
 	pagevec_init(&pvec, 0);
 	if (wbc->range_cyclic) {
-		index = mapping->writeback_index; /* Start from prev offset */
+		writeback_index = mapping->writeback_index; /* prev offset */
+		index = writeback_index;
+		if (index == 0)
+			cycled = 1;
+		else
+			cycled = 0;
 		end = -1;
 	} else {
 		index = wbc->range_start >> PAGE_CACHE_SHIFT;
 		end = wbc->range_end >> PAGE_CACHE_SHIFT;
 		if (wbc->range_start == 0 && wbc->range_end == LLONG_MAX)
 			range_whole = 1;
-		scanned = 1;
+		cycled = 1; /* ignore range_cyclic tests */
 	}
 retry:
 	while (!done && (index <= end) &&
@@ -897,7 +903,6 @@ retry:
 					      min(end - index, (pgoff_t)PAGEVEC_SIZE-1) + 1))) {
 		unsigned i;
 
-		scanned = 1;
 		for (i = 0; i < nr_pages; i++) {
 			struct page *page = pvec.pages[i];
 
@@ -915,7 +920,11 @@ retry:
 				continue;
 			}
 
-			if (!wbc->range_cyclic && page->index > end) {
+			if (page->index > end) {
+				/*
+				 * can't be range_cyclic (1st pass) because
+				 * end == -1 in that case.
+				 */
 				done = 1;
 				unlock_page(page);
 				continue;
@@ -946,13 +955,15 @@ retry:
 		pagevec_release(&pvec);
 		cond_resched();
 	}
-	if (!scanned && !done) {
+	if (!cycled) {
 		/*
+		 * range_cyclic:
 		 * We hit the last page and there is more work to be done: wrap
 		 * back to the start of the file
 		 */
-		scanned = 1;
+		cycled = 1;
 		index = 0;
+		end = writeback_index - 1;
 		goto retry;
 	}
 	if (!wbc->no_nrwrite_index_update) {

-- 


^ permalink raw reply	[flat|nested] 53+ messages in thread

* [patch 1.1/9] mm: write_cache_pages cyclic fix fix
  2008-10-28 14:47 ` [patch 1/9] mm: write_cache_pages cyclic fix npiggin
@ 2008-10-29  0:24   ` Nick Piggin
  0 siblings, 0 replies; 53+ messages in thread
From: Nick Piggin @ 2008-10-29  0:24 UTC (permalink / raw)
  To: akpm; +Cc: linux-fsdevel, Dave Chinner, Chris Mason

mm: write_cache_pages cyclic fix fix

shut up gcc

Signed-off-by: Nick Piggin <npiggin@suse.de>
---
Index: linux-2.6/mm/page-writeback.c
===================================================================
--- linux-2.6.orig/mm/page-writeback.c
+++ linux-2.6/mm/page-writeback.c
@@ -868,7 +868,7 @@ int write_cache_pages(struct address_spa
 	int done = 0;
 	struct pagevec pvec;
 	int nr_pages;
-	pgoff_t writeback_index;
+	pgoff_t uninitialized_var(writeback_index);
 	pgoff_t index;
 	pgoff_t end;		/* Inclusive */
 	int cycled;

^ permalink raw reply	[flat|nested] 53+ messages in thread

* [patch 2/9] mm: write_cache_pages early loop termination
  2008-10-28 14:47 [patch 0/9] writeback data integrity and other fixes (take 3) npiggin
  2008-10-28 14:47 ` [patch 1/9] mm: write_cache_pages cyclic fix npiggin
@ 2008-10-28 14:47 ` npiggin
  2008-10-28 14:47 ` [patch 3/9] mm: write_cache_pages writepage error fix npiggin
                   ` (8 subsequent siblings)
  10 siblings, 0 replies; 53+ messages in thread
From: npiggin @ 2008-10-28 14:47 UTC (permalink / raw)
  To: akpm; +Cc: linux-fsdevel, Dave Chinner, Chris Mason

[-- Attachment #1: mm-wcp-breakearly-fix.patch --]
[-- Type: text/plain, Size: 1999 bytes --]

We'd like to break out of the loop early in many situations, however the
existing code has been setting mapping->writeback_index past the final page
in the pagevec lookup for cyclic writeback. This is a problem if we don't
process all pages up to the final page.

Currently the code mostly keeps writeback_index reasonable and hacked around
this by not breaking out of the loop or writing pages outside the range in
these cases. Keep track of a real "done index" that enables us to terminate
the loop in a much more flexible manner.

Needed by the subsequent patch to preserve writepage errors, and then further
patches to break out of the loop early for other reasons. However there
are no functional changes with this patch alone.

Signed-off-by: Nick Piggin <npiggin@suse.de>
---
Index: linux-2.6/mm/page-writeback.c
===================================================================
--- linux-2.6.orig/mm/page-writeback.c
+++ linux-2.6/mm/page-writeback.c
@@ -871,6 +871,7 @@ int write_cache_pages(struct address_spa
 	pgoff_t writeback_index;
 	pgoff_t index;
 	pgoff_t end;		/* Inclusive */
+	pgoff_t done_index;
 	int cycled;
 	int range_whole = 0;
 	long nr_to_write = wbc->nr_to_write;
@@ -897,6 +898,7 @@ int write_cache_pages(struct address_spa
 		cycled = 1; /* ignore range_cyclic tests */
 	}
 retry:
+	done_index = index;
 	while (!done && (index <= end) &&
 	       (nr_pages = pagevec_lookup_tag(&pvec, mapping, &index,
 					      PAGECACHE_TAG_DIRTY,
@@ -906,6 +908,8 @@ retry:
 		for (i = 0; i < nr_pages; i++) {
 			struct page *page = pvec.pages[i];
 
+			done_index = page->index + 1;
+
 			/*
 			 * At this point we hold neither mapping->tree_lock nor
 			 * lock on the page itself: the page may be truncated or
@@ -968,7 +972,7 @@ retry:
 	}
 	if (!wbc->no_nrwrite_index_update) {
 		if (wbc->range_cyclic || (range_whole && nr_to_write > 0))
-			mapping->writeback_index = index;
+			mapping->writeback_index = done_index;
 		wbc->nr_to_write = nr_to_write;
 	}
 

-- 


^ permalink raw reply	[flat|nested] 53+ messages in thread

* [patch 3/9] mm: write_cache_pages writepage error fix
  2008-10-28 14:47 [patch 0/9] writeback data integrity and other fixes (take 3) npiggin
  2008-10-28 14:47 ` [patch 1/9] mm: write_cache_pages cyclic fix npiggin
  2008-10-28 14:47 ` [patch 2/9] mm: write_cache_pages early loop termination npiggin
@ 2008-10-28 14:47 ` npiggin
  2008-10-28 14:47 ` [patch 4/9] mm: write_cache_pages integrity fix npiggin
                   ` (7 subsequent siblings)
  10 siblings, 0 replies; 53+ messages in thread
From: npiggin @ 2008-10-28 14:47 UTC (permalink / raw)
  To: akpm; +Cc: linux-fsdevel, Dave Chinner, Chris Mason

[-- Attachment #1: mm-wcp-writepage-error-fix.patch --]
[-- Type: text/plain, Size: 1571 bytes --]

In write_cache_pages, if ret signals a real error, but we still have some pages
left in the pagevec, done would be set to 1, but the remaining pages would
continue to be processed and ret will be overwritten in the process. It could
easily be overwritten with success, and thus success will be returned even if
there is an error. Thus the caller is told all writes succeeded, wheras in
reality some did not.

Fix this by bailing immediately if there is an error, and retaining the first
error code.

This is a data interity bug.

Signed-off-by: Nick Piggin <npiggin@suse.de>
---
Index: linux-2.6/mm/page-writeback.c
===================================================================
--- linux-2.6.orig/mm/page-writeback.c
+++ linux-2.6/mm/page-writeback.c
@@ -944,12 +944,26 @@ retry:
 			}
 
 			ret = (*writepage)(page, wbc, data);
+			if (unlikely(ret)) {
+				if (ret == AOP_WRITEPAGE_ACTIVATE) {
+					unlock_page(page);
+					ret = 0;
+				} else {
+					/*
+					 * done_index is set past this page,
+					 * so media errors will not choke
+					 * background writeout for the entire
+					 * file. This has consequences for
+					 * range_cyclic semantics (ie. it may
+					 * not be suitable for data integrity
+					 * writeout).
+					 */
+					done = 1;
+					break;
+				}
+ 			}
 
-			if (unlikely(ret == AOP_WRITEPAGE_ACTIVATE)) {
-				unlock_page(page);
-				ret = 0;
-			}
-			if (ret || (--nr_to_write <= 0))
+			if (--nr_to_write <= 0)
 				done = 1;
 			if (wbc->nonblocking && bdi_write_congested(bdi)) {
 				wbc->encountered_congestion = 1;

-- 


^ permalink raw reply	[flat|nested] 53+ messages in thread

* [patch 4/9] mm: write_cache_pages integrity fix
  2008-10-28 14:47 [patch 0/9] writeback data integrity and other fixes (take 3) npiggin
                   ` (2 preceding siblings ...)
  2008-10-28 14:47 ` [patch 3/9] mm: write_cache_pages writepage error fix npiggin
@ 2008-10-28 14:47 ` npiggin
  2008-10-28 14:47 ` [patch 5/9] mm: write_cache_pages cleanups npiggin
                   ` (6 subsequent siblings)
  10 siblings, 0 replies; 53+ messages in thread
From: npiggin @ 2008-10-28 14:47 UTC (permalink / raw)
  To: akpm; +Cc: linux-fsdevel, Dave Chinner, Chris Mason

[-- Attachment #1: mm-wcp-integrity-fix.patch --]
[-- Type: text/plain, Size: 2785 bytes --]

In write_cache_pages, nr_to_write is heeded even for data-integrity syncs, so
the function will return success after writing out nr_to_write pages, even if
that was not sufficient to guarantee data integrity.

The callers tend to set it to values that could break data interity semantics
easily in practice. For example, nr_to_write can be set to mapping->nr_pages *
2, however if a file has a single, dirty page, then fsync is called, subsequent
pages might be concurrently added and dirtied, then write_cache_pages might
writeout two of these newly dirty pages, while not writing out the old page
that should have been written out.

Fix this by ignoring nr_to_write if it is a data integrity sync.

This is a data interity bug.

Signed-off-by: Nick Piggin <npiggin@suse.de>

---
The reason this has been done in the past is to avoid stalling sync operations
behind page dirtiers.

 "If a file has one dirty page at offset 1000000000000000 then someone
  does an fsync() and someone else gets in first and starts madly writing
  pages at offset 0, we want to write that page at 1000000000000000. 
  Somehow."

What we to today is return success after an arbitrary amount of pages are
written, whether or not we have provided the data-integrity semantics that
the caller has asked for. Even this doesn't actually fix all stall cases
completely: in the above situation, if the file has a huge number of pages
in pagecache (but not dirty), then mapping->nrpages is going to be huge,
even if pages are being dirtied.

This change does indeed make the possibility of long stalls lager, and that's
not a good thing, but lying about data integrity is even worse. We have to
either perform the sync, or return -ELINUXISLAME so at least the caller knows
what has happened.

There are subsequent competing approaches in the works to solve the stall
problems properly, without compromising data integrity.

Index: linux-2.6/mm/page-writeback.c
===================================================================
--- linux-2.6.orig/mm/page-writeback.c
+++ linux-2.6/mm/page-writeback.c
@@ -963,8 +963,10 @@ retry:
 				}
  			}

-			if (--nr_to_write <= 0)
-				done = 1;
+			if (wbc->sync_mode == WB_SYNC_NONE) {
+				if (--wbc->nr_to_write <= 0)
+					done = 1;
+			}
 			if (wbc->nonblocking && bdi_write_congested(bdi)) {
 				wbc->encountered_congestion = 1;
 				done = 1;
Index: linux-2.6/mm/filemap.c
===================================================================
--- linux-2.6.orig/mm/filemap.c
+++ linux-2.6/mm/filemap.c
@@ -210,7 +210,7 @@ int __filemap_fdatawrite_range(struct ad
 	int ret;
 	struct writeback_control wbc = {
 		.sync_mode = sync_mode,
-		.nr_to_write = mapping->nrpages * 2,
+		.nr_to_write = LONG_MAX,
 		.range_start = start,
 		.range_end = end,
 	};

-- 

^ permalink raw reply	[flat|nested] 53+ messages in thread

* [patch 5/9] mm: write_cache_pages cleanups
  2008-10-28 14:47 [patch 0/9] writeback data integrity and other fixes (take 3) npiggin
                   ` (3 preceding siblings ...)
  2008-10-28 14:47 ` [patch 4/9] mm: write_cache_pages integrity fix npiggin
@ 2008-10-28 14:47 ` npiggin
  2008-10-28 14:47 ` [patch 6/9] mm: write_cache_pages optimise page cleaning npiggin
                   ` (5 subsequent siblings)
  10 siblings, 0 replies; 53+ messages in thread
From: npiggin @ 2008-10-28 14:47 UTC (permalink / raw)
  To: akpm; +Cc: linux-fsdevel, Dave Chinner, Chris Mason

[-- Attachment #1: mm-wcp-cleanup.patch --]
[-- Type: text/plain, Size: 2190 bytes --]

Get rid of some complex expressions from flow control statements, add a
comment, remove some duplicate code.

Signed-off-by: Nick Piggin <npiggin@suse.de>
---
Index: linux-2.6/mm/page-writeback.c
===================================================================
--- linux-2.6.orig/mm/page-writeback.c
+++ linux-2.6/mm/page-writeback.c
@@ -899,11 +899,14 @@ int write_cache_pages(struct address_spa
 	}
 retry:
 	done_index = index;
-	while (!done && (index <= end) &&
-	       (nr_pages = pagevec_lookup_tag(&pvec, mapping, &index,
-					      PAGECACHE_TAG_DIRTY,
-					      min(end - index, (pgoff_t)PAGEVEC_SIZE-1) + 1))) {
-		unsigned i;
+	while (!done && (index <= end)) {
+		int i;
+
+		nr_pages = pagevec_lookup_tag(&pvec, mapping, &index,
+			      PAGECACHE_TAG_DIRTY,
+			      min(end - index, (pgoff_t)PAGEVEC_SIZE-1) + 1);
+		if (nr_pages == 0)
+			break;
 
 		for (i = 0; i < nr_pages; i++) {
 			struct page *page = pvec.pages[i];
@@ -919,7 +922,16 @@ retry:
 			 */
 			lock_page(page);
 
+			/*
+			 * Page truncated or invalidated. We can freely skip it
+			 * then, even for data integrity operations: the page
+			 * has disappeared concurrently, so there could be no
+			 * real expectation of this data interity operation
+			 * even if there is now a new, dirty page at the same
+			 * pagecache address.
+			 */
 			if (unlikely(page->mapping != mapping)) {
+continue_unlock:
 				unlock_page(page);
 				continue;
 			}
@@ -930,18 +942,15 @@ retry:
 				 * end == -1 in that case.
 				 */
 				done = 1;
-				unlock_page(page);
-				continue;
+				goto continue_unlock;
 			}
 
 			if (wbc->sync_mode != WB_SYNC_NONE)
 				wait_on_page_writeback(page);
 
 			if (PageWriteback(page) ||
-			    !clear_page_dirty_for_io(page)) {
-				unlock_page(page);
-				continue;
-			}
+			    !clear_page_dirty_for_io(page))
+				goto continue_unlock;
 
 			ret = (*writepage)(page, wbc, data);
 			if (unlikely(ret)) {
@@ -964,7 +973,8 @@ retry:
  			}
 
 			if (wbc->sync_mode == WB_SYNC_NONE) {
-				if (--wbc->nr_to_write <= 0)
+				wbc->nr_to_write--;
+				if (wbc->nr_to_write <= 0)
 					done = 1;
 			}
 			if (wbc->nonblocking && bdi_write_congested(bdi)) {

-- 


^ permalink raw reply	[flat|nested] 53+ messages in thread

* [patch 6/9] mm: write_cache_pages optimise page cleaning
  2008-10-28 14:47 [patch 0/9] writeback data integrity and other fixes (take 3) npiggin
                   ` (4 preceding siblings ...)
  2008-10-28 14:47 ` [patch 5/9] mm: write_cache_pages cleanups npiggin
@ 2008-10-28 14:47 ` npiggin
  2008-10-28 14:47 ` [patch 7/9] mm: write_cache_pages terminate quickly npiggin
                   ` (4 subsequent siblings)
  10 siblings, 0 replies; 53+ messages in thread
From: npiggin @ 2008-10-28 14:47 UTC (permalink / raw)
  To: akpm; +Cc: linux-fsdevel, Dave Chinner, Chris Mason

[-- Attachment #1: mm-wcp-writeback-clean-opt.patch --]
[-- Type: text/plain, Size: 1314 bytes --]

In write_cache_pages, if we get stuck behind another process that is cleaning
pages, we will be forced to wait for them to finish, then perform our own
writeout (if it was redirtied during the long wait), then wait for that.

If a page under writeout is still clean, we can skip waiting for it (if we're
part of a data integrity sync, we'll be waiting for all writeout pages
afterwards, so we'll still be waiting for the other guy's write that's cleaned
the page).

Signed-off-by: Nick Piggin <npiggin@suse.de>
---
Index: linux-2.6/mm/page-writeback.c
===================================================================
--- linux-2.6.orig/mm/page-writeback.c
+++ linux-2.6/mm/page-writeback.c
@@ -945,11 +945,20 @@ continue_unlock:
 				goto continue_unlock;
 			}
 
-			if (wbc->sync_mode != WB_SYNC_NONE)
-				wait_on_page_writeback(page);
+			if (!PageDirty(page)) {
+				/* someone wrote it for us */
+				goto continue_unlock;
+			}
+
+			if (PageWriteback(page)) {
+				if (wbc->sync_mode != WB_SYNC_NONE)
+					wait_on_page_writeback(page);
+				else
+					goto continue_unlock;
+			}
 
-			if (PageWriteback(page) ||
-			    !clear_page_dirty_for_io(page))
+			BUG_ON(PageWriteback(page));
+			if (!clear_page_dirty_for_io(page))
 				goto continue_unlock;
 
 			ret = (*writepage)(page, wbc, data);

-- 


^ permalink raw reply	[flat|nested] 53+ messages in thread

* [patch 7/9] mm: write_cache_pages terminate quickly
  2008-10-28 14:47 [patch 0/9] writeback data integrity and other fixes (take 3) npiggin
                   ` (5 preceding siblings ...)
  2008-10-28 14:47 ` [patch 6/9] mm: write_cache_pages optimise page cleaning npiggin
@ 2008-10-28 14:47 ` npiggin
  2008-10-30 23:07   ` Andrew Morton
  2008-10-28 14:47 ` [patch 8/9] mm: write_cache_pages more " npiggin
                   ` (3 subsequent siblings)
  10 siblings, 1 reply; 53+ messages in thread
From: npiggin @ 2008-10-28 14:47 UTC (permalink / raw)
  To: akpm; +Cc: linux-fsdevel, Dave Chinner, Chris Mason

[-- Attachment #1: mm-wcp-terminate-quickly.patch --]
[-- Type: text/plain, Size: 1730 bytes --]

Terminate the write_cache_pages loop upon encountering the first page past
end, without locking the page. Pages cannot have their index change when we
have a reference on them (truncate, eg truncate_inode_pages_range performs
the same check without the page lock).

Signed-off-by: Nick Piggin <npiggin@suse.de>
---
Index: linux-2.6/mm/page-writeback.c
===================================================================
--- linux-2.6.orig/mm/page-writeback.c
+++ linux-2.6/mm/page-writeback.c
@@ -911,15 +911,24 @@ retry:
 		for (i = 0; i < nr_pages; i++) {
 			struct page *page = pvec.pages[i];
 
-			done_index = page->index + 1;
-
 			/*
-			 * At this point we hold neither mapping->tree_lock nor
-			 * lock on the page itself: the page may be truncated or
-			 * invalidated (changing page->mapping to NULL), or even
-			 * swizzled back from swapper_space to tmpfs file
-			 * mapping
+			 * At this point, the page may be truncated or
+			 * invalidated (changing page->mapping to NULL), or
+			 * even swizzled back from swapper_space to tmpfs file
+			 * mapping. However, page->index will not change
+			 * because we have a reference on the page.
 			 */
+			if (page->index > end) {
+				/*
+				 * can't be range_cyclic (1st pass) because
+				 * end == -1 in that case.
+				 */
+				done = 1;
+				break;
+			}
+
+			done_index = page->index + 1;
+
 			lock_page(page);
 
 			/*
@@ -936,15 +945,6 @@ continue_unlock:
 				continue;
 			}
 
-			if (page->index > end) {
-				/*
-				 * can't be range_cyclic (1st pass) because
-				 * end == -1 in that case.
-				 */
-				done = 1;
-				goto continue_unlock;
-			}
-
 			if (!PageDirty(page)) {
 				/* someone wrote it for us */
 				goto continue_unlock;

-- 


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [patch 7/9] mm: write_cache_pages terminate quickly
  2008-10-28 14:47 ` [patch 7/9] mm: write_cache_pages terminate quickly npiggin
@ 2008-10-30 23:07   ` Andrew Morton
  2008-10-31  7:29     ` Nick Piggin
  0 siblings, 1 reply; 53+ messages in thread
From: Andrew Morton @ 2008-10-30 23:07 UTC (permalink / raw)
  To: npiggin; +Cc: linux-fsdevel, david, chris.mason

On Wed, 29 Oct 2008 01:47:22 +1100
npiggin@suse.de wrote:

> Terminate the write_cache_pages loop upon encountering the first page past
> end, without locking the page. Pages cannot have their index change when we
> have a reference on them (truncate, eg truncate_inode_pages_range performs
> the same check without the page lock).
> 

Traditionally lock_page() is used to stabilise ->index and ->mapping. 
Here you introduce a new and very subtle sort-of-locking rule without
actually really introducing it at all.  OK, there's a little comment
buried way down in this function.  But there's a contradictory comment
over truncate_inode_pages_range() ("When looking at...").

How do we make this new locking rule maintainable?  How do we avoid
breaking it in the future?  How do we prevent accidental breakage from
slipping past developers' and reviewers' attention?

Given the additional maintenance burdens, is this change worth doing
at all?


> ---
> Index: linux-2.6/mm/page-writeback.c
> ===================================================================
> --- linux-2.6.orig/mm/page-writeback.c
> +++ linux-2.6/mm/page-writeback.c
> @@ -911,15 +911,24 @@ retry:
>  		for (i = 0; i < nr_pages; i++) {
>  			struct page *page = pvec.pages[i];
>  
> -			done_index = page->index + 1;
> -
>  			/*
> -			 * At this point we hold neither mapping->tree_lock nor
> -			 * lock on the page itself: the page may be truncated or
> -			 * invalidated (changing page->mapping to NULL), or even
> -			 * swizzled back from swapper_space to tmpfs file
> -			 * mapping
> +			 * At this point, the page may be truncated or
> +			 * invalidated (changing page->mapping to NULL), or
> +			 * even swizzled back from swapper_space to tmpfs file
> +			 * mapping. However, page->index will not change
> +			 * because we have a reference on the page.
>  			 */
> +			if (page->index > end) {
> +				/*
> +				 * can't be range_cyclic (1st pass) because
> +				 * end == -1 in that case.
> +				 */
> +				done = 1;
> +				break;
> +			}
> +
> +			done_index = page->index + 1;
> +
>  			lock_page(page);
>  
>  			/*
> @@ -936,15 +945,6 @@ continue_unlock:
>  				continue;
>  			}
>  
> -			if (page->index > end) {
> -				/*
> -				 * can't be range_cyclic (1st pass) because
> -				 * end == -1 in that case.
> -				 */
> -				done = 1;
> -				goto continue_unlock;
> -			}
> -
>  			if (!PageDirty(page)) {
>  				/* someone wrote it for us */
>  				goto continue_unlock;
> 
> -- 

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [patch 7/9] mm: write_cache_pages terminate quickly
  2008-10-30 23:07   ` Andrew Morton
@ 2008-10-31  7:29     ` Nick Piggin
  0 siblings, 0 replies; 53+ messages in thread
From: Nick Piggin @ 2008-10-31  7:29 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-fsdevel, david, chris.mason

On Thu, Oct 30, 2008 at 04:07:46PM -0700, Andrew Morton wrote:
> On Wed, 29 Oct 2008 01:47:22 +1100
> npiggin@suse.de wrote:
> 
> > Terminate the write_cache_pages loop upon encountering the first page past
> > end, without locking the page. Pages cannot have their index change when we
> > have a reference on them (truncate, eg truncate_inode_pages_range performs
> > the same check without the page lock).
> > 
> 
> Traditionally lock_page() is used to stabilise ->index and ->mapping. 

Well, mapping. index of course is irrelevant without mapping, *except*
for a "where did we get to" kind of thing. But it has been used in that
way for a long time.


> Here you introduce a new and very subtle sort-of-locking rule without
> actually really introducing it at all.  OK, there's a little comment
> buried way down in this function.  But there's a contradictory comment
> over truncate_inode_pages_range() ("When looking at...").

That comment is actually wrong. Index won't change. If index could change
randomly, then we could skip pages here if index skips forwards. pagevec
pagecache tag lookup functions would be broken in general actually.

 
> How do we make this new locking rule maintainable?  How do we avoid
> breaking it in the future?  How do we prevent accidental breakage from
> slipping past developers' and reviewers' attention?

It's actually fairly fundamental. Even more fundamental than the above
functions I quote.

If we have any place that does:
lock_page(page)
if (!page->mapping) /* truncate got to it */

but does not check the index of the page (which most don't), then it could
have moved from where we first got it from (which would not always be a
bug, but often could be).

read(2) syscall actually also doesn't lock the page by default. Having the
page move somewhere else would be a disaster for it.

I guess it's not explicitly documented AFAIKS, but I thought it is a
hard rule. Is there anywhere useful we can write it that people will
actually read?

OTOH, there isn't a lot of places that could be doing this. Some wild
filesystem might think they own the pagecache I guess. I know that when
it came up in splice, I told Jens we can't move a page with references
on it even if it is locked...

 
> Given the additional maintenance burdens, is this change worth doing
> at all?
> 
> 
> > ---
> > Index: linux-2.6/mm/page-writeback.c
> > ===================================================================
> > --- linux-2.6.orig/mm/page-writeback.c
> > +++ linux-2.6/mm/page-writeback.c
> > @@ -911,15 +911,24 @@ retry:
> >  		for (i = 0; i < nr_pages; i++) {
> >  			struct page *page = pvec.pages[i];
> >  
> > -			done_index = page->index + 1;
> > -
> >  			/*
> > -			 * At this point we hold neither mapping->tree_lock nor
> > -			 * lock on the page itself: the page may be truncated or
> > -			 * invalidated (changing page->mapping to NULL), or even
> > -			 * swizzled back from swapper_space to tmpfs file
> > -			 * mapping
> > +			 * At this point, the page may be truncated or
> > +			 * invalidated (changing page->mapping to NULL), or
> > +			 * even swizzled back from swapper_space to tmpfs file
> > +			 * mapping. However, page->index will not change
> > +			 * because we have a reference on the page.
> >  			 */
> > +			if (page->index > end) {
> > +				/*
> > +				 * can't be range_cyclic (1st pass) because
> > +				 * end == -1 in that case.
> > +				 */
> > +				done = 1;
> > +				break;
> > +			}
> > +
> > +			done_index = page->index + 1;
> > +
> >  			lock_page(page);
> >  
> >  			/*
> > @@ -936,15 +945,6 @@ continue_unlock:
> >  				continue;
> >  			}
> >  
> > -			if (page->index > end) {
> > -				/*
> > -				 * can't be range_cyclic (1st pass) because
> > -				 * end == -1 in that case.
> > -				 */
> > -				done = 1;
> > -				goto continue_unlock;
> > -			}
> > -
> >  			if (!PageDirty(page)) {
> >  				/* someone wrote it for us */
> >  				goto continue_unlock;
> > 
> > -- 

^ permalink raw reply	[flat|nested] 53+ messages in thread

* [patch 8/9] mm: write_cache_pages more terminate quickly
  2008-10-28 14:47 [patch 0/9] writeback data integrity and other fixes (take 3) npiggin
                   ` (6 preceding siblings ...)
  2008-10-28 14:47 ` [patch 7/9] mm: write_cache_pages terminate quickly npiggin
@ 2008-10-28 14:47 ` npiggin
  2008-10-28 14:47 ` [patch 9/9] mm: do_sync_mapping_range integrity fix npiggin
                   ` (2 subsequent siblings)
  10 siblings, 0 replies; 53+ messages in thread
From: npiggin @ 2008-10-28 14:47 UTC (permalink / raw)
  To: akpm; +Cc: linux-fsdevel, Dave Chinner, Chris Mason

[-- Attachment #1: mm-wcp-terminate-quickly2.patch --]
[-- Type: text/plain, Size: 1116 bytes --]

From: Andrew Morton <akpm@linux-foundation.org>

Now that we have the early-termination logic in place, it makes sense to bail
out early in all other cases where done is set to 1.

Signed-off-by: Nick Piggin <npiggin@suse.de>
---

I had a hard time with this one replacing the done flag with gotos. It can
be done one way or the other, but I didn't find anything completely pleasing.
Importantly, the new logic is in place, and subsequent cleanup or refactoring
patches would be welcome (I just wasn't able to come up with something 
clever enough myself)

Index: linux-2.6/mm/page-writeback.c
===================================================================
--- linux-2.6.orig/mm/page-writeback.c
+++ linux-2.6/mm/page-writeback.c
@@ -983,12 +983,15 @@ continue_unlock:
 
 			if (wbc->sync_mode == WB_SYNC_NONE) {
 				wbc->nr_to_write--;
-				if (wbc->nr_to_write <= 0)
+				if (wbc->nr_to_write <= 0) {
 					done = 1;
+					break;
+				}
 			}
 			if (wbc->nonblocking && bdi_write_congested(bdi)) {
 				wbc->encountered_congestion = 1;
 				done = 1;
+				break;
 			}
 		}
 		pagevec_release(&pvec);

-- 


^ permalink raw reply	[flat|nested] 53+ messages in thread

* [patch 9/9] mm: do_sync_mapping_range integrity fix
  2008-10-28 14:47 [patch 0/9] writeback data integrity and other fixes (take 3) npiggin
                   ` (7 preceding siblings ...)
  2008-10-28 14:47 ` [patch 8/9] mm: write_cache_pages more " npiggin
@ 2008-10-28 14:47 ` npiggin
  2008-10-30 23:13   ` Andrew Morton
  2008-10-28 15:39 ` [patch 0/9] writeback data integrity and other fixes (take 3) Nick Piggin
  2008-10-28 23:14 ` Dave Chinner
  10 siblings, 1 reply; 53+ messages in thread
From: npiggin @ 2008-10-28 14:47 UTC (permalink / raw)
  To: akpm; +Cc: linux-fsdevel, Dave Chinner, Chris Mason

[-- Attachment #1: mm-do-sync-mapping-range-fix.patch --]
[-- Type: text/plain, Size: 645 bytes --]

Chris Mason notices do_sync_mapping_range didn't actually ask for data
integrity writeout. Unfortunately, it is advertised as being usable for
data integrity operations.

This is a data interity bug.

Signed-off-by: Nick Piggin <npiggin@suse.de>
---
Index: linux-2.6/fs/sync.c
===================================================================
--- linux-2.6.orig/fs/sync.c
+++ linux-2.6/fs/sync.c
@@ -269,7 +269,7 @@ int do_sync_mapping_range(struct address
 
 	if (flags & SYNC_FILE_RANGE_WRITE) {
 		ret = __filemap_fdatawrite_range(mapping, offset, endbyte,
-						WB_SYNC_NONE);
+						WB_SYNC_ALL);
 		if (ret < 0)
 			goto out;
 	}

-- 


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [patch 9/9] mm: do_sync_mapping_range integrity fix
  2008-10-28 14:47 ` [patch 9/9] mm: do_sync_mapping_range integrity fix npiggin
@ 2008-10-30 23:13   ` Andrew Morton
  2008-10-31  9:16     ` Nick Piggin
  0 siblings, 1 reply; 53+ messages in thread
From: Andrew Morton @ 2008-10-30 23:13 UTC (permalink / raw)
  To: npiggin; +Cc: linux-fsdevel, david, chris.mason

On Wed, 29 Oct 2008 01:47:24 +1100
npiggin@suse.de wrote:

> Chris Mason notices do_sync_mapping_range didn't actually ask for data
> integrity writeout. Unfortunately, it is advertised as being usable for
> data integrity operations.
> 
> This is a data interity bug.
> 
> Signed-off-by: Nick Piggin <npiggin@suse.de>
> ---
> Index: linux-2.6/fs/sync.c
> ===================================================================
> --- linux-2.6.orig/fs/sync.c
> +++ linux-2.6/fs/sync.c
> @@ -269,7 +269,7 @@ int do_sync_mapping_range(struct address
>  
>  	if (flags & SYNC_FILE_RANGE_WRITE) {
>  		ret = __filemap_fdatawrite_range(mapping, offset, endbyte,
> -						WB_SYNC_NONE);
> +						WB_SYNC_ALL);
>  		if (ret < 0)
>  			goto out;
>  	}
> 

Really?

Some thought did go into the code which you're "fixing".  If the caller
is using sync_file_range() for integrity then the caller has done a
SYNC_FILE_RANGE_WAIT_BEFORE.

So all we need to guarantee here is that
__filemap_fdatawrite_range(WB_SYNC_NONE) will start writeout on all
dirty pages in the range.  Probably that gets broken lower down as part
of various hacks^woptimisations have gone in, but which ones, and
where?  Perhaps _this_ (if it's there) is what should be fixed.

And if we _do_ make the above change, we don't need to run the
wait_on_page_writeback_range() if userspace asked for
SYNC_FILE_RANGE_WAIT_BEFORE|SYNC_FILE_RANGE_WRITE, yes?

IOW, I don't think enough thought (or at least description of that
thought) has gone into this one.

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [patch 9/9] mm: do_sync_mapping_range integrity fix
  2008-10-30 23:13   ` Andrew Morton
@ 2008-10-31  9:16     ` Nick Piggin
  2008-10-31 10:04       ` Andrew Morton
  2008-10-31 14:10       ` Chris Mason
  0 siblings, 2 replies; 53+ messages in thread
From: Nick Piggin @ 2008-10-31  9:16 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-fsdevel, david, chris.mason

On Thu, Oct 30, 2008 at 04:13:44PM -0700, Andrew Morton wrote:
> On Wed, 29 Oct 2008 01:47:24 +1100
> npiggin@suse.de wrote:
> 
> > Chris Mason notices do_sync_mapping_range didn't actually ask for data
> > integrity writeout. Unfortunately, it is advertised as being usable for
> > data integrity operations.
> > 
> > This is a data interity bug.
> > 
> > Signed-off-by: Nick Piggin <npiggin@suse.de>
> > ---
> > Index: linux-2.6/fs/sync.c
> > ===================================================================
> > --- linux-2.6.orig/fs/sync.c
> > +++ linux-2.6/fs/sync.c
> > @@ -269,7 +269,7 @@ int do_sync_mapping_range(struct address
> >  
> >  	if (flags & SYNC_FILE_RANGE_WRITE) {
> >  		ret = __filemap_fdatawrite_range(mapping, offset, endbyte,
> > -						WB_SYNC_NONE);
> > +						WB_SYNC_ALL);
> >  		if (ret < 0)
> >  			goto out;
> >  	}
> > 
> 
> Really?
> 
> Some thought did go into the code which you're "fixing".

Yes, I even remember something of a flamewar involving me and you :)

>  If the caller
> is using sync_file_range() for integrity then the caller has done a
> SYNC_FILE_RANGE_WAIT_BEFORE.

No disputes about whether the API works "by design". But I think the
implementation has a bug. I'll explain:

> So all we need to guarantee here is that
> __filemap_fdatawrite_range(WB_SYNC_NONE) will start writeout on all
> dirty pages in the range.  Probably that gets broken lower down as part
> of various hacks^woptimisations have gone in, but which ones, and
> where?  Perhaps _this_ (if it's there) is what should be fixed.

WB_SYNC_NONE has never (until this was introduced) been used for data
integrity AFAIKS. There is code littered throughout fs/ which assumes
WB_SYNC_NONE ~= "efficiency / background writeback". At least
definitely the buffer.c "trylock" will cause dirty pages to be
skipped. There is also a fair amount of filesystem code I haven't looked
at. The fs-writeback.c code might not affect this path, but it also
definitely makes the same assumption about WB_SYNC_NONE, so it would be
ugly to mandate WB_SYNC_NONE is for data integrity from mapping downard,
but not from inode upward...

I didn't check, but I suspect this has been broken since it got merged.

> And if we _do_ make the above change, we don't need to run the
> wait_on_page_writeback_range() if userspace asked for
> SYNC_FILE_RANGE_WAIT_BEFORE|SYNC_FILE_RANGE_WRITE, yes?

Now you're asking the hard questions... I think we still have to wait,
because SYNC_FILE_RANGE_WRITE itself doesn't necessarily wait for writeback.
After the optimisation to skip waiting for clean and writeback pages
in write_cache_pages, actually I think this change (to use WB_SYNC_ALL)
should not hurt very much...

> 
> 
> 
> IOW, I don't think enough thought (or at least description of that
> thought) has gone into this one.

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [patch 9/9] mm: do_sync_mapping_range integrity fix
  2008-10-31  9:16     ` Nick Piggin
@ 2008-10-31 10:04       ` Andrew Morton
  2008-10-31 10:53         ` Nick Piggin
  2008-10-31 20:03         ` Jamie Lokier
  2008-10-31 14:10       ` Chris Mason
  1 sibling, 2 replies; 53+ messages in thread
From: Andrew Morton @ 2008-10-31 10:04 UTC (permalink / raw)
  To: Nick Piggin; +Cc: linux-fsdevel, david, chris.mason

On Fri, 31 Oct 2008 10:16:16 +0100 Nick Piggin <npiggin@suse.de> wrote:

>  
> > So all we need to guarantee here is that
> > __filemap_fdatawrite_range(WB_SYNC_NONE) will start writeout on all
> > dirty pages in the range.  Probably that gets broken lower down as part
> > of various hacks^woptimisations have gone in, but which ones, and
> > where?  Perhaps _this_ (if it's there) is what should be fixed.
> 
> WB_SYNC_NONE has never (until this was introduced) been used for data
> integrity AFAIKS. There is code littered throughout fs/ which assumes
> WB_SYNC_NONE ~= "efficiency / background writeback". At least
> definitely the buffer.c "trylock" will cause dirty pages to be
> skipped.


The change seriously wrecks sync_file_range(SYNC_FILE_RANGE_WRITE), the
whole point of which is to as-nonblockingly-as-possible shove some data
into the queues.  This is useful.

Perhaps we could use WB_SYNC_ALL if SYNC_FILE_RANGE_WAIT_BEFORE was
specified, or WB_SYNC_NONE if it was not (need to think about
SYNC_FILE_RANGE_WAIT_AFTER, too).

That's a bit grubby, because userspace could do

	sync_file_range(SYNC_FILE_RANGE_WAIT_BEFORE);
	sync_file_range(SYNC_FILE_RANGE_WRITE);

expecting it to have the same behaviour as

	sync_file_range(SYNC_FILE_RANGE_WAIT_BEFORE|SYNC_FILE_RANGE_WRITE);



How the hell did that stupid sync_mode thing get into
writeback_control?  :( We should get rid of WB_SYNC_FOO and migrate to
better-defined writeback_control fields.



^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [patch 9/9] mm: do_sync_mapping_range integrity fix
  2008-10-31 10:04       ` Andrew Morton
@ 2008-10-31 10:53         ` Nick Piggin
  2008-10-31 20:03         ` Jamie Lokier
  1 sibling, 0 replies; 53+ messages in thread
From: Nick Piggin @ 2008-10-31 10:53 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-fsdevel, david, chris.mason

On Fri, Oct 31, 2008 at 03:04:32AM -0700, Andrew Morton wrote:
> On Fri, 31 Oct 2008 10:16:16 +0100 Nick Piggin <npiggin@suse.de> wrote:
> 
> >  
> > > So all we need to guarantee here is that
> > > __filemap_fdatawrite_range(WB_SYNC_NONE) will start writeout on all
> > > dirty pages in the range.  Probably that gets broken lower down as part
> > > of various hacks^woptimisations have gone in, but which ones, and
> > > where?  Perhaps _this_ (if it's there) is what should be fixed.
> > 
> > WB_SYNC_NONE has never (until this was introduced) been used for data
> > integrity AFAIKS. There is code littered throughout fs/ which assumes
> > WB_SYNC_NONE ~= "efficiency / background writeback". At least
> > definitely the buffer.c "trylock" will cause dirty pages to be
> > skipped.
> 
> 
> The change seriously wrecks sync_file_range(SYNC_FILE_RANGE_WRITE), the
> whole point of which is to as-nonblockingly-as-possible shove some data
> into the queues.  This is useful.

Right, but what's possible to do nonblockingly according to the API
isn't very much. Because you can't distinguish data integrity writes
from hints (as you note below).

IMO we should implement SYNC_FILE_RANGE_ASYNC to do the real "hint"
thing. (and aside, but further along this topic, a SYNC_FILE_RANGE_SYNC
to complement as well, and allow trivial migration for apps from fsync
to an fsync_range equivalent... also a flag or two for metadata might
be nice). 

 
> Perhaps we could use WB_SYNC_ALL if SYNC_FILE_RANGE_WAIT_BEFORE was
> specified, or WB_SYNC_NONE if it was not (need to think about
> SYNC_FILE_RANGE_WAIT_AFTER, too).
> 
> That's a bit grubby, because userspace could do
> 
> 	sync_file_range(SYNC_FILE_RANGE_WAIT_BEFORE);
> 	sync_file_range(SYNC_FILE_RANGE_WRITE);
> 
> expecting it to have the same behaviour as
> 
> 	sync_file_range(SYNC_FILE_RANGE_WAIT_BEFORE|SYNC_FILE_RANGE_WRITE);
> 
> 
> 
> How the hell did that stupid sync_mode thing get into
> writeback_control?  :( We should get rid of WB_SYNC_FOO and migrate to
> better-defined writeback_control fields.

No arguments. It's crusty. But this is the best I think I can do for
stable and 2.6.28 kernels.

As for how it got there... :)


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [patch 9/9] mm: do_sync_mapping_range integrity fix
  2008-10-31 10:04       ` Andrew Morton
  2008-10-31 10:53         ` Nick Piggin
@ 2008-10-31 20:03         ` Jamie Lokier
  1 sibling, 0 replies; 53+ messages in thread
From: Jamie Lokier @ 2008-10-31 20:03 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Nick Piggin, linux-fsdevel, david, chris.mason

Andrew Morton wrote:
> That's a bit grubby, because userspace could do
> 
> 	sync_file_range(SYNC_FILE_RANGE_WAIT_BEFORE);
> 	sync_file_range(SYNC_FILE_RANGE_WRITE);
> 
> expecting it to have the same behaviour as
> 
> 	sync_file_range(SYNC_FILE_RANGE_WAIT_BEFORE|SYNC_FILE_RANGE_WRITE);

Yes it could.  The documentation for sync_file_range is unclear enough
already.  If the distinction is useful, a different SYNC_FILE_RANGE_
flag may be appropriate to indicate write-for-performance vs
write-for-integrity, rather than such a peculiar subtlety.

-- Jamie

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [patch 9/9] mm: do_sync_mapping_range integrity fix
  2008-10-31  9:16     ` Nick Piggin
  2008-10-31 10:04       ` Andrew Morton
@ 2008-10-31 14:10       ` Chris Mason
  2008-10-31 14:30         ` steve
  2008-11-01  8:04         ` Nick Piggin
  1 sibling, 2 replies; 53+ messages in thread
From: Chris Mason @ 2008-10-31 14:10 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Andrew Morton, linux-fsdevel, david

On Fri, 2008-10-31 at 10:16 +0100, Nick Piggin wrote:
> On Thu, Oct 30, 2008 at 04:13:44PM -0700, Andrew Morton wrote:
> > On Wed, 29 Oct 2008 01:47:24 +1100
> > npiggin@suse.de wrote:
> > 
> > > Chris Mason notices do_sync_mapping_range didn't actually ask for
> data
> > > integrity writeout. Unfortunately, it is advertised as being
> usable for
> > > data integrity operations.
> > > 
> > > This is a data interity bug.
> > > 

[ use WB_SYNC_ALL instead of WB_SYNC_NONE ]

> >  If the caller
> > is using sync_file_range() for integrity then the caller has done a
> > SYNC_FILE_RANGE_WAIT_BEFORE.
> 
> No disputes about whether the API works "by design". But I think the
> implementation has a bug. I'll explain:

I'll definitely agree the current usage is clumsy, and there is a bug in
fs/buffer.c.  A grep through the rest of the filesystems doesn't turn up
many assumptions that WB_SYNC_NONE means it's ok to skip dirty pages.

Greps for WB_SYNC_ALL and WB_SYNC_NONE in the fs code reveal:

fs/buffer.c:__block_write_full_page()
               if (wbc->sync_mode != WB_SYNC_NONE || !wbc->nonblocking) {
                        lock_buffer(bh);
                } else if (!trylock_buffer(bh)) {
                        redirty_page_for_writepage(wbc, page);
                        continue;
                }

Easily fixed s/||/&&/, which is what XFS does.  reiser3 has the same bug
in fs/reiserfs/inode.c

ntfs and gfs2 each have a check that assumes WB_SYNC_NONE means optional
writeback, both seem fixable with one liners.

Everywhere we wait on page writeback while we're trying to build nice
big bios hurts performance.  I'd rather see us switch to something
closer to the do_sync_mapping_range expectation of WB_SYNC_NONE than
sprinkle WB_SYNC_ALLs everywhere.

-chris

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [patch 9/9] mm: do_sync_mapping_range integrity fix
  2008-10-31 14:10       ` Chris Mason
@ 2008-10-31 14:30         ` steve
  2008-10-31 15:02           ` Chris Mason
  2008-11-01  8:04         ` Nick Piggin
  1 sibling, 1 reply; 53+ messages in thread
From: steve @ 2008-10-31 14:30 UTC (permalink / raw)
  To: Chris Mason; +Cc: Nick Piggin, Andrew Morton, linux-fsdevel, david

Hi,

On Fri, Oct 31, 2008 at 10:10:19AM -0400, Chris Mason wrote:
> UID: 518429                                                  
> 
> On Fri, 2008-10-31 at 10:16 +0100, Nick Piggin wrote:
> > On Thu, Oct 30, 2008 at 04:13:44PM -0700, Andrew Morton wrote:
> > > On Wed, 29 Oct 2008 01:47:24 +1100
> > > npiggin@suse.de wrote:
> > > 
> > > > Chris Mason notices do_sync_mapping_range didn't actually ask for
> > data
> > > > integrity writeout. Unfortunately, it is advertised as being
> > usable for
> > > > data integrity operations.
> > > > 
> > > > This is a data interity bug.
> > > > 
> 
> [ use WB_SYNC_ALL instead of WB_SYNC_NONE ]
> 
> > >  If the caller
> > > is using sync_file_range() for integrity then the caller has done a
> > > SYNC_FILE_RANGE_WAIT_BEFORE.
> > 
> > No disputes about whether the API works "by design". But I think the
> > implementation has a bug. I'll explain:
> 
> I'll definitely agree the current usage is clumsy, and there is a bug in
> fs/buffer.c.  A grep through the rest of the filesystems doesn't turn up
> many assumptions that WB_SYNC_NONE means it's ok to skip dirty pages.
> 
> Greps for WB_SYNC_ALL and WB_SYNC_NONE in the fs code reveal:
> 
> fs/buffer.c:__block_write_full_page()
>                if (wbc->sync_mode != WB_SYNC_NONE || !wbc->nonblocking) {
>                         lock_buffer(bh);
>                 } else if (!trylock_buffer(bh)) {
>                         redirty_page_for_writepage(wbc, page);
>                         continue;
>                 }
> 
> Easily fixed s/||/&&/, which is what XFS does.  reiser3 has the same bug
> in fs/reiserfs/inode.c
> 
> ntfs and gfs2 each have a check that assumes WB_SYNC_NONE means optional
> writeback, both seem fixable with one liners.
>
I doubt that is the case for GFS2... Its a bit subtle, but in order to
ensure that we have the correct lock ordering the code in writepage
for jdata (the only place this occurs, and I assume is what you are
referring to) relies in the fact that because we have a writepages
set for jdata, writepage will only ever be called with WB_SYNC_NONE
and can thus be skipped.

The lock ordering in question is between the transaction (glock)
and the page lock. The former has to be grabbed first and this
is also the reason that this only affects jdata as writeback and
ordered don't need transactions.

This is also the same reason that we have our own version of
write_cache_pages, its more or less identical to the VFS version
except its got our transaction lock stuck in the middle of it.

Ideally (and I speak entirely from a selfish, personal, GFS2 point of
view here!) we would either have a version of writepage which didn't
hold the page lock on entry or we'd just move over to writepages
entirely. This issue came up at the filesystem workshop earlier in the year
and I think it was generally agreed that it was a good idea. I did
have a look into doing it at one stage, but the code scared me off a bit :-)

Steve.


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [patch 9/9] mm: do_sync_mapping_range integrity fix
  2008-10-31 14:30         ` steve
@ 2008-10-31 15:02           ` Chris Mason
  0 siblings, 0 replies; 53+ messages in thread
From: Chris Mason @ 2008-10-31 15:02 UTC (permalink / raw)
  To: steve; +Cc: Nick Piggin, Andrew Morton, linux-fsdevel, david

On Fri, 2008-10-31 at 14:30 +0000, steve@chygwyn.com wrote:

> > ntfs and gfs2 each have a check that assumes WB_SYNC_NONE means optional
> > writeback, both seem fixable with one liners.
> >
> I doubt that is the case for GFS2... Its a bit subtle, but in order to
> ensure that we have the correct lock ordering the code in writepage
> for jdata (the only place this occurs, and I assume is what you are
> referring to) relies in the fact that because we have a writepages
> set for jdata, writepage will only ever be called with WB_SYNC_NONE
> and can thus be skipped.

Well, if you have a writepages call, writepage is only used by kswapd
and is entirely optional (assuming you don't call it yourself ;)

-chris



^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [patch 9/9] mm: do_sync_mapping_range integrity fix
  2008-10-31 14:10       ` Chris Mason
  2008-10-31 14:30         ` steve
@ 2008-11-01  8:04         ` Nick Piggin
  1 sibling, 0 replies; 53+ messages in thread
From: Nick Piggin @ 2008-11-01  8:04 UTC (permalink / raw)
  To: Chris Mason; +Cc: Andrew Morton, linux-fsdevel, david

On Fri, Oct 31, 2008 at 10:10:19AM -0400, Chris Mason wrote:
> On Fri, 2008-10-31 at 10:16 +0100, Nick Piggin wrote:
> > On Thu, Oct 30, 2008 at 04:13:44PM -0700, Andrew Morton wrote:
> > > On Wed, 29 Oct 2008 01:47:24 +1100
> > > npiggin@suse.de wrote:
> > > 
> > > > Chris Mason notices do_sync_mapping_range didn't actually ask for
> > data
> > > > integrity writeout. Unfortunately, it is advertised as being
> > usable for
> > > > data integrity operations.
> > > > 
> > > > This is a data interity bug.
> > > > 
> 
> [ use WB_SYNC_ALL instead of WB_SYNC_NONE ]
> 
> > >  If the caller
> > > is using sync_file_range() for integrity then the caller has done a
> > > SYNC_FILE_RANGE_WAIT_BEFORE.
> > 
> > No disputes about whether the API works "by design". But I think the
> > implementation has a bug. I'll explain:
> 
> I'll definitely agree the current usage is clumsy, and there is a bug in
> fs/buffer.c.  A grep through the rest of the filesystems doesn't turn up
> many assumptions that WB_SYNC_NONE means it's ok to skip dirty pages.
> 
> Greps for WB_SYNC_ALL and WB_SYNC_NONE in the fs code reveal:
> 
> fs/buffer.c:__block_write_full_page()
>                if (wbc->sync_mode != WB_SYNC_NONE || !wbc->nonblocking) {
>                         lock_buffer(bh);
>                 } else if (!trylock_buffer(bh)) {
>                         redirty_page_for_writepage(wbc, page);
>                         continue;
>                 }
> 
> Easily fixed s/||/&&/, which is what XFS does.  reiser3 has the same bug
> in fs/reiserfs/inode.c
> 
> ntfs and gfs2 each have a check that assumes WB_SYNC_NONE means optional
> writeback, both seem fixable with one liners.

Well, we went over this before :) Semantics... Whether or not it can be
considered a bug, WB_SYNC_NONE is very much considered not to be a data
integrity sync in code and comments (eg. buffer.c fs-writeback.c and
file systems) for a long time.

So that's not something I can really change (especially not for stable
releases).

It's _probably_ a good idea not to fragment bios, and it's probably not
going to introduce bugs if we convert everything over. But presently,
all the bugs/misunderstandings/whatever mean that WB_SYNC_NONE is not
usable for data integrity. Obviously this has been the case for long
before sync_mapping_range was added.

> Everywhere we wait on page writeback while we're trying to build nice
> big bios hurts performance.  I'd rather see us switch to something
> closer to the do_sync_mapping_range expectation of WB_SYNC_NONE than
> sprinkle WB_SYNC_ALLs everywhere.

They already are I guess. At least, this is the only place I have seen
that needed changing. After that series, I think we get a lot closer to
being correct in a lot of the writeback paths, which is a good starting
point to make improvements. I'll be looking at the "livelock" avoidance,
but you or other fs developers probably in a better position to look at
things like waiting for locked or writeback pages...

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [patch 0/9] writeback data integrity and other fixes (take 3)
  2008-10-28 14:47 [patch 0/9] writeback data integrity and other fixes (take 3) npiggin
                   ` (8 preceding siblings ...)
  2008-10-28 14:47 ` [patch 9/9] mm: do_sync_mapping_range integrity fix npiggin
@ 2008-10-28 15:39 ` Nick Piggin
  2008-10-28 22:27   ` Dave Chinner
  2008-10-28 23:14 ` Dave Chinner
  10 siblings, 1 reply; 53+ messages in thread
From: Nick Piggin @ 2008-10-28 15:39 UTC (permalink / raw)
  To: akpm, xfs; +Cc: linux-fsdevel, Dave Chinner, Chris Mason

On Wed, Oct 29, 2008 at 01:47:15AM +1100, npiggin@suse.de wrote:
> OK, I'm happier with this patchset now. Note that I've taken your patch
> and mangled it a bit at the end of the series.
> 
> This one survives and seems to run OK here, but I'm mainly doing dumb
> stress testing with a handful of filesystems, and data-io error injection
> testing. There are a lot of combinations of ways this function can operate
> and interact obviously, so it would be helpful to get more review.
> 
> Chris, would you possibly have time to run your btrfs tests that are
> sensitive to problems in this code? I could provide you a single patch
> rollup against mainline if it helps.

BTW. XFS seems to be doing something interesting with my simple sync
test case with IO error injection. I map a file MAP_SHARED into a number of
processes, which then each run a loop that dirties the memory then calls
msync(MS_SYNC) on the range.

ext2 mostly reports -EIO back to userspace when a failure is injected AFAIKS.
ext3 (ordered) doesn't until a lot of errors have been injected, but eventually
reports -EIO and shuts down the filesystem. reiserfs seems to report failure
more consistently.

I haven't seen any -EIO failures from XFS... maybe I'm just not doing the
right thing, or there is a caveat I'm not aware of.

All fault injections I noticed had a trace like this:
FAULT_INJECTION: forcing a failure
Call Trace:
9f9cd758:  [<6019f1de>] random32+0xe/0x20
9f9cd768:  [<601a31b9>] should_fail+0xd9/0x130
9f9cd798:  [<6018d0c4>] generic_make_request+0x304/0x4e0
9f9cd7a8:  [<60062301>] mempool_alloc+0x51/0x130
9f9cd858:  [<6018e6bf>] submit_bio+0x4f/0xe0
9f9cd8a8:  [<60165505>] xfs_submit_ioend_bio+0x25/0x40
9f9cd8c8:  [<6016603c>] xfs_submit_ioend+0xbc/0xf0
9f9cd908:  [<60166bf9>] xfs_page_state_convert+0x3d9/0x6a0
9f9cd928:  [<6005d515>] delayacct_end+0x95/0xb0
9f9cda08:  [<60166ffd>] xfs_vm_writepage+0x6d/0x110
9f9cda18:  [<6006618b>] set_page_dirty+0x4b/0xd0
9f9cda58:  [<60066115>] __writepage+0x15/0x40
9f9cda78:  [<60066775>] write_cache_pages+0x255/0x470
9f9cda90:  [<60066100>] __writepage+0x0/0x40
9f9cdb98:  [<600669b0>] generic_writepages+0x20/0x30
9f9cdba8:  [<60165ba3>] xfs_vm_writepages+0x53/0x70
9f9cdbd8:  [<600669eb>] do_writepages+0x2b/0x40
9f9cdbf8:  [<6006004c>] __filemap_fdatawrite_range+0x5c/0x70
9f9cdc58:  [<6006026a>] filemap_fdatawrite+0x1a/0x20
9f9cdc68:  [<600a7a05>] do_fsync+0x45/0xe0
9f9cdc98:  [<6007794b>] sys_msync+0x14b/0x1d0
9f9cdcf8:  [<60019a70>] handle_syscall+0x50/0x80
9f9cdd18:  [<6002a10f>] userspace+0x44f/0x510
9f9cdfc8:  [<60016792>] fork_handler+0x62/0x70

And the kernel would sometimes say this:
Buffer I/O error on device ram0, logical block 279
lost page write due to I/O error on ram0
Buffer I/O error on device ram0, logical block 379
lost page write due to I/O error on ram0
Buffer I/O error on device ram0, logical block 389
lost page write due to I/O error on ram0


I think I also saw a slab bug when running dbench with fault injection on.
Running latest Linus kernel.

bash-3.1# dbench -t10 -c ../client.txt 8
dbench version 3.04 - Copyright Andrew Tridgell 1999-2004

Running for 10 seconds with load '../client.txt' and minimum warmup 2 secs
8 clients started
FAULT_INJECTION: forcing a failure
Call Trace:
9e7bb548:  [<601623ae>] random32+0xe/0x20
9e7bb558:  [<60166389>] should_fail+0xd9/0x130
9e7bb588:  [<60150294>] generic_make_request+0x304/0x4e0
9e7bb598:  [<60062301>] mempool_alloc+0x51/0x130
9e7bb648:  [<6015188f>] submit_bio+0x4f/0xe0
9e7bb698:  [<6012b440>] _xfs_buf_ioapply+0x180/0x2a0
9e7bb6a0:  [<6002f600>] default_wake_function+0x0/0x10
9e7bb6f8:  [<6012bae1>] xfs_buf_iorequest+0x31/0x90
9e7bb718:  [<60112f75>] xlog_bdstrat_cb+0x45/0x50
9e7bb738:  [<60114135>] xlog_sync+0x195/0x440
9e7bb778:  [<60114491>] xlog_state_release_iclog+0xb1/0xc0
9e7bb7a8:  [<60114ca9>] xlog_write+0x539/0x550
9e7bb858:  [<60114e60>] xfs_log_write+0x40/0x60
9e7bb888:  [<6011fbaa>] _xfs_trans_commit+0x19a/0x360
9e7bb8b8:  [<600838e2>] poison_obj+0x42/0x60
9e7bb8d0:  [<60082cb3>] dbg_redzone1+0x13/0x30
9e7bb8e8:  [<60083999>] cache_alloc_debugcheck_after+0x99/0x1c0
9e7bb918:  [<6008517b>] kmem_cache_alloc+0x8b/0x100
9e7bb958:  [<60128084>] kmem_zone_alloc+0x74/0xe0
9e7bb998:  [<60082ad9>] kmem_cache_size+0x9/0x10
9e7bb9a8:  [<60128124>] kmem_zone_zalloc+0x34/0x50
9e7bb9e8:  [<60121e8b>] xfs_dir_ialloc+0x13b/0x2e0
9e7bba58:  [<601f534b>] __down_write+0xb/0x10
9e7bbaa8:  [<60125b9e>] xfs_mkdir+0x37e/0x4b0
9e7bbb38:  [<601f5589>] _spin_unlock+0x9/0x10
9e7bbb78:  [<601301a4>] xfs_vn_mknod+0xf4/0x1a0
9e7bbbd8:  [<6013025e>] xfs_vn_mkdir+0xe/0x10
9e7bbbe8:  [<60091010>] vfs_mkdir+0x90/0xc0
9e7bbc18:  [<600934d6>] sys_mkdirat+0x106/0x120
9e7bbc88:  [<6008629b>] filp_close+0x4b/0x80
9e7bbce8:  [<60093503>] sys_mkdir+0x13/0x20
9e7bbcf8:  [<60019a70>] handle_syscall+0x50/0x80
9e7bbd18:  [<6002a10f>] userspace+0x44f/0x510
9e7bbfc8:  [<60016792>] fork_handler+0x62/0x70

I/O error in filesystem ("ram0") meta-data dev ram0 block 0x8002c       ("xlog_i
odone") error 5 buf count 32768
xfs_force_shutdown(ram0,0x2) called from line 1056 of file /home/npiggin/usr/src
/linux-2.6/fs/xfs/xfs_log.c.  Return address = 0x000000006011370d
Filesystem "ram0": Log I/O Error Detected.  Shutting down filesystem: ram0
Please umount the filesystem, and rectify the problem(s)
xfs_force_shutdown(ram0,0x2) called from line 818 of file /home/npiggin/usr/src/
linux-2.6/fs/xfs/xfs_log.c.  Return address = 0x0000000060114e7d
slab error in verify_redzone_free(): cache `xfs_log_ticket': double free detecte
d
Call Trace:
9e7bb998:  [<6008372f>] __slab_error+0x1f/0x30
9e7bb9a8:  [<60083cae>] cache_free_debugcheck+0x1ee/0x240
9e7bb9b0:  [<60112ef0>] xlog_ticket_put+0x10/0x20
9e7bb9e8:  [<60083f70>] kmem_cache_free+0x50/0xc0
9e7bba18:  [<60112ef0>] xlog_ticket_put+0x10/0x20
9e7bba28:  [<60114dc9>] xfs_log_done+0x59/0xb0
9e7bba68:  [<6011f5de>] xfs_trans_cancel+0x7e/0x140
9e7bbaa8:  [<60125a1e>] xfs_mkdir+0x1fe/0x4b0
9e7bbb38:  [<601f5589>] _spin_unlock+0x9/0x10
9e7bbb78:  [<601301a4>] xfs_vn_mknod+0xf4/0x1a0
9e7bbbd8:  [<6013025e>] xfs_vn_mkdir+0xe/0x10
9e7bbbe8:  [<60091010>] vfs_mkdir+0x90/0xc0
9e7bbc18:  [<600934d6>] sys_mkdirat+0x106/0x120
9e7bbc88:  [<6008629b>] filp_close+0x4b/0x80
9e7bbce8:  [<60093503>] sys_mkdir+0x13/0x20
9e7bbcf8:  [<60019a70>] handle_syscall+0x50/0x80
9e7bbd18:  [<6002a10f>] userspace+0x44f/0x510
9e7bbfc8:  [<60016792>] fork_handler+0x62/0x70

000000009e0d4ec0: redzone 1:0x9f911029d74e35b, redzone 2:0x9f911029d74e35b.
(3) open ./clients/client1 failed for handle 16385 (No such file or directory)
(4) ERROR: handle 16385 was not found
Child failed with status 1

(kernel died soon afterwards)


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [patch 0/9] writeback data integrity and other fixes (take 3)
  2008-10-28 15:39 ` [patch 0/9] writeback data integrity and other fixes (take 3) Nick Piggin
@ 2008-10-28 22:27   ` Dave Chinner
  2008-10-29  0:04     ` Nick Piggin
                       ` (2 more replies)
  0 siblings, 3 replies; 53+ messages in thread
From: Dave Chinner @ 2008-10-28 22:27 UTC (permalink / raw)
  To: Nick Piggin; +Cc: akpm, xfs, linux-fsdevel, Chris Mason

On Tue, Oct 28, 2008 at 04:39:53PM +0100, Nick Piggin wrote:
> On Wed, Oct 29, 2008 at 01:47:15AM +1100, npiggin@suse.de wrote:
> > OK, I'm happier with this patchset now. Note that I've taken your patch
> > and mangled it a bit at the end of the series.
> > 
> > This one survives and seems to run OK here, but I'm mainly doing dumb
> > stress testing with a handful of filesystems, and data-io error injection
> > testing. There are a lot of combinations of ways this function can operate
> > and interact obviously, so it would be helpful to get more review.
> > 
> > Chris, would you possibly have time to run your btrfs tests that are
> > sensitive to problems in this code? I could provide you a single patch
> > rollup against mainline if it helps.
> 
> BTW. XFS seems to be doing something interesting with my simple sync
> test case with IO error injection. I map a file MAP_SHARED into a number of
> processes, which then each run a loop that dirties the memory then calls
> msync(MS_SYNC) on the range.
> 
> ext2 mostly reports -EIO back to userspace when a failure is injected AFAIKS.
> ext3 (ordered) doesn't until a lot of errors have been injected, but eventually
> reports -EIO and shuts down the filesystem. reiserfs seems to report failure
> more consistently.
> 
> I haven't seen any -EIO failures from XFS... maybe I'm just not doing the
> right thing, or there is a caveat I'm not aware of.
> 
> All fault injections I noticed had a trace like this:
> FAULT_INJECTION: forcing a failure
> Call Trace:
> 9f9cd758:  [<6019f1de>] random32+0xe/0x20
> 9f9cd768:  [<601a31b9>] should_fail+0xd9/0x130
> 9f9cd798:  [<6018d0c4>] generic_make_request+0x304/0x4e0
> 9f9cd7a8:  [<60062301>] mempool_alloc+0x51/0x130
> 9f9cd858:  [<6018e6bf>] submit_bio+0x4f/0xe0
> 9f9cd8a8:  [<60165505>] xfs_submit_ioend_bio+0x25/0x40
> 9f9cd8c8:  [<6016603c>] xfs_submit_ioend+0xbc/0xf0
> 9f9cd908:  [<60166bf9>] xfs_page_state_convert+0x3d9/0x6a0
> 9f9cd928:  [<6005d515>] delayacct_end+0x95/0xb0
> 9f9cda08:  [<60166ffd>] xfs_vm_writepage+0x6d/0x110
> 9f9cda18:  [<6006618b>] set_page_dirty+0x4b/0xd0
> 9f9cda58:  [<60066115>] __writepage+0x15/0x40
> 9f9cda78:  [<60066775>] write_cache_pages+0x255/0x470
> 9f9cda90:  [<60066100>] __writepage+0x0/0x40
> 9f9cdb98:  [<600669b0>] generic_writepages+0x20/0x30
> 9f9cdba8:  [<60165ba3>] xfs_vm_writepages+0x53/0x70
> 9f9cdbd8:  [<600669eb>] do_writepages+0x2b/0x40
> 9f9cdbf8:  [<6006004c>] __filemap_fdatawrite_range+0x5c/0x70
> 9f9cdc58:  [<6006026a>] filemap_fdatawrite+0x1a/0x20
> 9f9cdc68:  [<600a7a05>] do_fsync+0x45/0xe0
> 9f9cdc98:  [<6007794b>] sys_msync+0x14b/0x1d0
> 9f9cdcf8:  [<60019a70>] handle_syscall+0x50/0x80
> 9f9cdd18:  [<6002a10f>] userspace+0x44f/0x510
> 9f9cdfc8:  [<60016792>] fork_handler+0x62/0x70

XFS reports bio errors through the I/O completion path, not the
submission path.

> And the kernel would sometimes say this:
> Buffer I/O error on device ram0, logical block 279
> lost page write due to I/O error on ram0
> Buffer I/O error on device ram0, logical block 379
> lost page write due to I/O error on ram0
> Buffer I/O error on device ram0, logical block 389
> lost page write due to I/O error on ram0

Yes - that's coming from end_buffer_async_write() when an error is
reported in bio completion. This does:

 465                 set_bit(AS_EIO, &page->mapping->flags);
 466                 set_buffer_write_io_error(bh);
 467                 clear_buffer_uptodate(bh);
 468                 SetPageError(page);

Hmmmm - do_fsync() calls filemap_fdatawait() which ends up in
wait_on_page_writeback_range() which is appears to be checking the
mapping flags for errors. I wonder why that error is not being
propagated then? AFAICT both XFS and the fsync code are doing the
right thing but somewhere the error has gone missing...

> I think I also saw a slab bug when running dbench with fault injection on.
> Running latest Linus kernel.
> 
> bash-3.1# dbench -t10 -c ../client.txt 8
> dbench version 3.04 - Copyright Andrew Tridgell 1999-2004
> 
> Running for 10 seconds with load '../client.txt' and minimum warmup 2 secs
> 8 clients started
> FAULT_INJECTION: forcing a failure
> Call Trace:
> 9e7bb548:  [<601623ae>] random32+0xe/0x20
> 9e7bb558:  [<60166389>] should_fail+0xd9/0x130
> 9e7bb588:  [<60150294>] generic_make_request+0x304/0x4e0
> 9e7bb598:  [<60062301>] mempool_alloc+0x51/0x130
> 9e7bb648:  [<6015188f>] submit_bio+0x4f/0xe0
> 9e7bb698:  [<6012b440>] _xfs_buf_ioapply+0x180/0x2a0
> 9e7bb6a0:  [<6002f600>] default_wake_function+0x0/0x10
> 9e7bb6f8:  [<6012bae1>] xfs_buf_iorequest+0x31/0x90
> 9e7bb718:  [<60112f75>] xlog_bdstrat_cb+0x45/0x50
> 9e7bb738:  [<60114135>] xlog_sync+0x195/0x440
> 9e7bb778:  [<60114491>] xlog_state_release_iclog+0xb1/0xc0
> 9e7bb7a8:  [<60114ca9>] xlog_write+0x539/0x550
> 9e7bb858:  [<60114e60>] xfs_log_write+0x40/0x60
> 9e7bb888:  [<6011fbaa>] _xfs_trans_commit+0x19a/0x360
> 9e7bb8b8:  [<600838e2>] poison_obj+0x42/0x60
> 9e7bb8d0:  [<60082cb3>] dbg_redzone1+0x13/0x30
> 9e7bb8e8:  [<60083999>] cache_alloc_debugcheck_after+0x99/0x1c0
> 9e7bb918:  [<6008517b>] kmem_cache_alloc+0x8b/0x100
> 9e7bb958:  [<60128084>] kmem_zone_alloc+0x74/0xe0
> 9e7bb998:  [<60082ad9>] kmem_cache_size+0x9/0x10
> 9e7bb9a8:  [<60128124>] kmem_zone_zalloc+0x34/0x50
> 9e7bb9e8:  [<60121e8b>] xfs_dir_ialloc+0x13b/0x2e0
> 9e7bba58:  [<601f534b>] __down_write+0xb/0x10
> 9e7bbaa8:  [<60125b9e>] xfs_mkdir+0x37e/0x4b0
> 9e7bbb38:  [<601f5589>] _spin_unlock+0x9/0x10
> 9e7bbb78:  [<601301a4>] xfs_vn_mknod+0xf4/0x1a0
> 9e7bbbd8:  [<6013025e>] xfs_vn_mkdir+0xe/0x10
> 9e7bbbe8:  [<60091010>] vfs_mkdir+0x90/0xc0
> 9e7bbc18:  [<600934d6>] sys_mkdirat+0x106/0x120
> 9e7bbc88:  [<6008629b>] filp_close+0x4b/0x80
> 9e7bbce8:  [<60093503>] sys_mkdir+0x13/0x20
> 9e7bbcf8:  [<60019a70>] handle_syscall+0x50/0x80
> 9e7bbd18:  [<6002a10f>] userspace+0x44f/0x510
> 9e7bbfc8:  [<60016792>] fork_handler+0x62/0x70
> 
> I/O error in filesystem ("ram0") meta-data dev ram0 block 0x8002c       ("xlog_i
> odone") error 5 buf count 32768
> xfs_force_shutdown(ram0,0x2) called from line 1056 of file /home/npiggin/usr/src
> /linux-2.6/fs/xfs/xfs_log.c.  Return address = 0x000000006011370d
> Filesystem "ram0": Log I/O Error Detected.  Shutting down filesystem: ram0
> Please umount the filesystem, and rectify the problem(s)
> xfs_force_shutdown(ram0,0x2) called from line 818 of file /home/npiggin/usr/src/
> linux-2.6/fs/xfs/xfs_log.c.  Return address = 0x0000000060114e7d
> slab error in verify_redzone_free(): cache `xfs_log_ticket': double free detecte
> d
> Call Trace:
> 9e7bb998:  [<6008372f>] __slab_error+0x1f/0x30
> 9e7bb9a8:  [<60083cae>] cache_free_debugcheck+0x1ee/0x240
> 9e7bb9b0:  [<60112ef0>] xlog_ticket_put+0x10/0x20
> 9e7bb9e8:  [<60083f70>] kmem_cache_free+0x50/0xc0
> 9e7bba18:  [<60112ef0>] xlog_ticket_put+0x10/0x20
> 9e7bba28:  [<60114dc9>] xfs_log_done+0x59/0xb0
> 9e7bba68:  [<6011f5de>] xfs_trans_cancel+0x7e/0x140
> 9e7bbaa8:  [<60125a1e>] xfs_mkdir+0x1fe/0x4b0
> 9e7bbb38:  [<601f5589>] _spin_unlock+0x9/0x10
> 9e7bbb78:  [<601301a4>] xfs_vn_mknod+0xf4/0x1a0
> 9e7bbbd8:  [<6013025e>] xfs_vn_mkdir+0xe/0x10
> 9e7bbbe8:  [<60091010>] vfs_mkdir+0x90/0xc0
> 9e7bbc18:  [<600934d6>] sys_mkdirat+0x106/0x120
> 9e7bbc88:  [<6008629b>] filp_close+0x4b/0x80
> 9e7bbce8:  [<60093503>] sys_mkdir+0x13/0x20
> 9e7bbcf8:  [<60019a70>] handle_syscall+0x50/0x80
> 9e7bbd18:  [<6002a10f>] userspace+0x44f/0x510
> 9e7bbfc8:  [<60016792>] fork_handler+0x62/0x70

Now that is interesting.

We've got a rolling transaction in progress, and the commit of the
first part of the transaction has got the I/O error.  That frees the
transaction structure used during that commit, as well as the
ticket.

However, before we committed the initial transaction, we duplicated
the transaction structure to allow the transaction to continue to
track all the dirty objects in the first commit. That included
duplicating the pointer to the ticket.

Then the EIO is returned to mkdir code with the duplicated
transaction, which is then cancelled, and that frees the transaction
and the ticket it holds. However, we'd already freed the ticket.

Ok, we're only seeing this problem now because I recently modified
the ticket allocation to use a slab instead of a roll-your-own free
list structure that wouldn't have been poisoned. Nice to know that
this change did more than just remove code. ;)

This might take a little while to fix - a lot of code needs
auditing - but thanks for reporting the problem.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [patch 0/9] writeback data integrity and other fixes (take 3)
  2008-10-28 22:27   ` Dave Chinner
@ 2008-10-29  0:04     ` Nick Piggin
  2008-10-29  0:16     ` Nick Piggin
  2008-10-29  8:51     ` Dave Chinner
  2 siblings, 0 replies; 53+ messages in thread
From: Nick Piggin @ 2008-10-29  0:04 UTC (permalink / raw)
  To: akpm, xfs, linux-fsdevel, Chris Mason

On Wed, Oct 29, 2008 at 09:27:46AM +1100, Dave Chinner wrote:
> On Tue, Oct 28, 2008 at 04:39:53PM +0100, Nick Piggin wrote:
> > On Wed, Oct 29, 2008 at 01:47:15AM +1100, npiggin@suse.de wrote:
> > I haven't seen any -EIO failures from XFS... maybe I'm just not doing the
> > right thing, or there is a caveat I'm not aware of.
> > 
> > All fault injections I noticed had a trace like this:
[...]
> 
> XFS reports bio errors through the I/O completion path, not the
> submission path.

Right, that's just to give you an indication of where it's failing...

 
> > And the kernel would sometimes say this:
> > Buffer I/O error on device ram0, logical block 279
> > lost page write due to I/O error on ram0
> > Buffer I/O error on device ram0, logical block 379
> > lost page write due to I/O error on ram0
> > Buffer I/O error on device ram0, logical block 389
> > lost page write due to I/O error on ram0
> 
> Yes - that's coming from end_buffer_async_write() when an error is
> reported in bio completion. This does:
> 
>  465                 set_bit(AS_EIO, &page->mapping->flags);
>  466                 set_buffer_write_io_error(bh);
>  467                 clear_buffer_uptodate(bh);
>  468                 SetPageError(page);
> 
> Hmmmm - do_fsync() calls filemap_fdatawait() which ends up in
> wait_on_page_writeback_range() which is appears to be checking the
> mapping flags for errors. I wonder why that error is not being
> propagated then? AFAICT both XFS and the fsync code are doing the
> right thing but somewhere the error has gone missing...

Yeah, I couldn't immediately see why nothing comes out. I'll do a bit
more digging.

 
> > I think I also saw a slab bug when running dbench with fault injection on.
> > Running latest Linus kernel.
[...]
>
> Now that is interesting.
> 
> We've got a rolling transaction in progress, and the commit of the
> first part of the transaction has got the I/O error.  That frees the
> transaction structure used during that commit, as well as the
> ticket.
> 
> However, before we committed the initial transaction, we duplicated
> the transaction structure to allow the transaction to continue to
> track all the dirty objects in the first commit. That included
> duplicating the pointer to the ticket.
> 
> Then the EIO is returned to mkdir code with the duplicated
> transaction, which is then cancelled, and that frees the transaction
> and the ticket it holds. However, we'd already freed the ticket.
> 
> Ok, we're only seeing this problem now because I recently modified
> the ticket allocation to use a slab instead of a roll-your-own free
> list structure that wouldn't have been poisoned. Nice to know that
> this change did more than just remove code. ;)
> 
> This might take a little while to fix - a lot of code needs
> auditing - but thanks for reporting the problem.

No problem, hope it helps.

Thanks,
Nick

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [patch 0/9] writeback data integrity and other fixes (take 3)
  2008-10-28 22:27   ` Dave Chinner
  2008-10-29  0:04     ` Nick Piggin
@ 2008-10-29  0:16     ` Nick Piggin
  2008-10-29  3:16       ` Dave Chinner
  2008-10-29  8:51     ` Dave Chinner
  2 siblings, 1 reply; 53+ messages in thread
From: Nick Piggin @ 2008-10-29  0:16 UTC (permalink / raw)
  To: akpm, xfs, linux-fsdevel, Chris Mason

On Wed, Oct 29, 2008 at 09:27:46AM +1100, Dave Chinner wrote:
> On Tue, Oct 28, 2008 at 04:39:53PM +0100, Nick Piggin wrote:
> > 
> > I haven't seen any -EIO failures from XFS... maybe I'm just not doing the
> > right thing, or there is a caveat I'm not aware of.
> > 
> > All fault injections I noticed had a trace like this:
> > FAULT_INJECTION: forcing a failure
> > Call Trace:
> > 9f9cd758:  [<6019f1de>] random32+0xe/0x20
> > 9f9cd768:  [<601a31b9>] should_fail+0xd9/0x130
> > 9f9cd798:  [<6018d0c4>] generic_make_request+0x304/0x4e0
> > 9f9cd7a8:  [<60062301>] mempool_alloc+0x51/0x130
> > 9f9cd858:  [<6018e6bf>] submit_bio+0x4f/0xe0
> > 9f9cd8a8:  [<60165505>] xfs_submit_ioend_bio+0x25/0x40
> > 9f9cd8c8:  [<6016603c>] xfs_submit_ioend+0xbc/0xf0
> > 9f9cd908:  [<60166bf9>] xfs_page_state_convert+0x3d9/0x6a0
> > 9f9cd928:  [<6005d515>] delayacct_end+0x95/0xb0
> > 9f9cda08:  [<60166ffd>] xfs_vm_writepage+0x6d/0x110
> > 9f9cda18:  [<6006618b>] set_page_dirty+0x4b/0xd0
> > 9f9cda58:  [<60066115>] __writepage+0x15/0x40
> > 9f9cda78:  [<60066775>] write_cache_pages+0x255/0x470
> > 9f9cda90:  [<60066100>] __writepage+0x0/0x40
> > 9f9cdb98:  [<600669b0>] generic_writepages+0x20/0x30
> > 9f9cdba8:  [<60165ba3>] xfs_vm_writepages+0x53/0x70
> > 9f9cdbd8:  [<600669eb>] do_writepages+0x2b/0x40
> > 9f9cdbf8:  [<6006004c>] __filemap_fdatawrite_range+0x5c/0x70
> > 9f9cdc58:  [<6006026a>] filemap_fdatawrite+0x1a/0x20
> > 9f9cdc68:  [<600a7a05>] do_fsync+0x45/0xe0
> > 9f9cdc98:  [<6007794b>] sys_msync+0x14b/0x1d0
> > 9f9cdcf8:  [<60019a70>] handle_syscall+0x50/0x80
> > 9f9cdd18:  [<6002a10f>] userspace+0x44f/0x510
> > 9f9cdfc8:  [<60016792>] fork_handler+0x62/0x70
> 
> XFS reports bio errors through the I/O completion path, not the
> submission path.
> 
> > And the kernel would sometimes say this:
> > Buffer I/O error on device ram0, logical block 279
> > lost page write due to I/O error on ram0
> > Buffer I/O error on device ram0, logical block 379
> > lost page write due to I/O error on ram0
> > Buffer I/O error on device ram0, logical block 389
> > lost page write due to I/O error on ram0
> 
> Yes - that's coming from end_buffer_async_write() when an error is
> reported in bio completion. This does:
> 
>  465                 set_bit(AS_EIO, &page->mapping->flags);
>  466                 set_buffer_write_io_error(bh);
>  467                 clear_buffer_uptodate(bh);
>  468                 SetPageError(page);
> 
> Hmmmm - do_fsync() calls filemap_fdatawait() which ends up in
> wait_on_page_writeback_range() which is appears to be checking the
> mapping flags for errors. I wonder why that error is not being
> propagated then? AFAICT both XFS and the fsync code are doing the
> right thing but somewhere the error has gone missing...

This one-liner has it reporting EIO errors like a champion. I
don't know if you'll actually need to put this into the
linux API layer or not, but anyway the root cause of the problem
AFAIKS is this.
--

XFS: fix fsync errors not being propogated back to userspace.
---
Index: linux-2.6/fs/xfs/xfs_vnodeops.c
===================================================================
--- linux-2.6.orig/fs/xfs/xfs_vnodeops.c
+++ linux-2.6/fs/xfs/xfs_vnodeops.c
@@ -715,7 +715,7 @@ xfs_fsync(
 	/* capture size updates in I/O completion before writing the inode. */
 	error = filemap_fdatawait(VFS_I(ip)->i_mapping);
 	if (error)
-		return XFS_ERROR(error);
+		return XFS_ERROR(-error);
 
 	/*
 	 * We always need to make sure that the required inode state is safe on

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [patch 0/9] writeback data integrity and other fixes (take 3)
  2008-10-29  0:16     ` Nick Piggin
@ 2008-10-29  3:16       ` Dave Chinner
  2008-10-29  3:26         ` Dave Chinner
                           ` (2 more replies)
  0 siblings, 3 replies; 53+ messages in thread
From: Dave Chinner @ 2008-10-29  3:16 UTC (permalink / raw)
  To: Nick Piggin; +Cc: akpm, xfs, linux-fsdevel, Chris Mason

On Wed, Oct 29, 2008 at 01:16:53AM +0100, Nick Piggin wrote:
> On Wed, Oct 29, 2008 at 09:27:46AM +1100, Dave Chinner wrote:
> > On Tue, Oct 28, 2008 at 04:39:53PM +0100, Nick Piggin wrote:
> > Yes - that's coming from end_buffer_async_write() when an error is
> > reported in bio completion. This does:
> > 
> >  465                 set_bit(AS_EIO, &page->mapping->flags);
> >  466                 set_buffer_write_io_error(bh);
> >  467                 clear_buffer_uptodate(bh);
> >  468                 SetPageError(page);
> > 
> > Hmmmm - do_fsync() calls filemap_fdatawait() which ends up in
> > wait_on_page_writeback_range() which is appears to be checking the
> > mapping flags for errors. I wonder why that error is not being
> > propagated then? AFAICT both XFS and the fsync code are doing the
> > right thing but somewhere the error has gone missing...
> 
> This one-liner has it reporting EIO errors like a champion. I
> don't know if you'll actually need to put this into the
> linux API layer or not, but anyway the root cause of the problem
> AFAIKS is this.
> --
> 
> XFS: fix fsync errors not being propogated back to userspace.
> ---
> Index: linux-2.6/fs/xfs/xfs_vnodeops.c
> ===================================================================
> --- linux-2.6.orig/fs/xfs/xfs_vnodeops.c
> +++ linux-2.6/fs/xfs/xfs_vnodeops.c
> @@ -715,7 +715,7 @@ xfs_fsync(
>  	/* capture size updates in I/O completion before writing the inode. */
>  	error = filemap_fdatawait(VFS_I(ip)->i_mapping);
>  	if (error)
> -		return XFS_ERROR(error);
> +		return XFS_ERROR(-error);

<groan>

Yeah, that'd do it. Good catch. I can't believe I recently fixed a
bug that touched these lines of code without noticing the inversion.
Sometimes I wonder if we should just conver the entire of XFS to
return negative errors - mistakes in handling negative error numbers
in the core XFS code happen all the time.

FWIW, the core issue here is that we've got to do the
filemap_fdatawait() call in the ->fsync method because ->fsync
gets called before we've waited for the data I/O to complete.
XFS updates inode state on I/O completion, so we *must* wait
for data I/O to complete before logging the inode changes. I
think btrfs has the same problem....

Thanks again, Nick.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [patch 0/9] writeback data integrity and other fixes (take 3)
  2008-10-29  3:16       ` Dave Chinner
@ 2008-10-29  3:26         ` Dave Chinner
  2008-10-29  4:11           ` Nick Piggin
  2008-10-29  9:13           ` Christoph Hellwig
  2008-10-29  4:00         ` Nick Piggin
  2008-10-29  9:12         ` Christoph Hellwig
  2 siblings, 2 replies; 53+ messages in thread
From: Dave Chinner @ 2008-10-29  3:26 UTC (permalink / raw)
  To: Nick Piggin, akpm, xfs, linux-fsdevel, Chris Mason

On Wed, Oct 29, 2008 at 02:16:45PM +1100, Dave Chinner wrote:
> On Wed, Oct 29, 2008 at 01:16:53AM +0100, Nick Piggin wrote:
> > XFS: fix fsync errors not being propogated back to userspace.
> > ---
> > Index: linux-2.6/fs/xfs/xfs_vnodeops.c
> > ===================================================================
> > --- linux-2.6.orig/fs/xfs/xfs_vnodeops.c
> > +++ linux-2.6/fs/xfs/xfs_vnodeops.c
> > @@ -715,7 +715,7 @@ xfs_fsync(
> >  	/* capture size updates in I/O completion before writing the inode. */
> >  	error = filemap_fdatawait(VFS_I(ip)->i_mapping);
> >  	if (error)
> > -		return XFS_ERROR(error);
> > +		return XFS_ERROR(-error);
> 
> <groan>
> 
> Yeah, that'd do it. Good catch. I can't believe I recently fixed a
> bug that touched these lines of code without noticing the inversion.
> Sometimes I wonder if we should just conver the entire of XFS to
> return negative errors - mistakes in handling negative error numbers
> in the core XFS code happen all the time.

Ok, I was right - these problems happen all the time. The above call
should really call xfs_flush_pages() to do the flush and wait. I
note that xfs_flush_pages() returns negative errors, and all the
callers expect positive errors. I bet the same occurs for
xfs_flushinval_pages() and xfs_tosspages() which are the wrappers
that core XFS code is supposed to be using for flushing and
invalidating file ranges....

I'll write up a patch that covers all of these.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [patch 0/9] writeback data integrity and other fixes (take 3)
  2008-10-29  3:26         ` Dave Chinner
@ 2008-10-29  4:11           ` Nick Piggin
  2008-10-29  4:57             ` Dave Chinner
  2008-10-29  9:13           ` Christoph Hellwig
  1 sibling, 1 reply; 53+ messages in thread
From: Nick Piggin @ 2008-10-29  4:11 UTC (permalink / raw)
  To: akpm, xfs, linux-fsdevel, Chris Mason

On Wed, Oct 29, 2008 at 02:26:01PM +1100, Dave Chinner wrote:
> On Wed, Oct 29, 2008 at 02:16:45PM +1100, Dave Chinner wrote:
> > On Wed, Oct 29, 2008 at 01:16:53AM +0100, Nick Piggin wrote:
> > > XFS: fix fsync errors not being propogated back to userspace.
> > > ---
> > > Index: linux-2.6/fs/xfs/xfs_vnodeops.c
> > > ===================================================================
> > > --- linux-2.6.orig/fs/xfs/xfs_vnodeops.c
> > > +++ linux-2.6/fs/xfs/xfs_vnodeops.c
> > > @@ -715,7 +715,7 @@ xfs_fsync(
> > >  	/* capture size updates in I/O completion before writing the inode. */
> > >  	error = filemap_fdatawait(VFS_I(ip)->i_mapping);
> > >  	if (error)
> > > -		return XFS_ERROR(error);
> > > +		return XFS_ERROR(-error);
> > 
> > <groan>
> > 
> > Yeah, that'd do it. Good catch. I can't believe I recently fixed a
> > bug that touched these lines of code without noticing the inversion.
> > Sometimes I wonder if we should just conver the entire of XFS to
> > return negative errors - mistakes in handling negative error numbers
> > in the core XFS code happen all the time.
> 
> Ok, I was right - these problems happen all the time. The above call
> should really call xfs_flush_pages() to do the flush and wait. I
> note that xfs_flush_pages() returns negative errors, and all the
> callers expect positive errors. I bet the same occurs for
> xfs_flushinval_pages() and xfs_tosspages() which are the wrappers
> that core XFS code is supposed to be using for flushing and
> invalidating file ranges....

Just be careful -- in your xfs_flush_pages, I think after the first
filemap_fdatawrite, the mapping may no longer be tagged with
PAGECACHE_TAG_DIRTY, so you may not pick up those writeback ones
you need to wait on.

Might need a different variant, or we could just bite the bullet and
push through the ->fsync conversion so you get full control of the
writeout.

BTW. the Linux pagecache APIs should support range operations quite
nicely for these. Any reason not to use them (it looks like the
wrappers can take ranges)?

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [patch 0/9] writeback data integrity and other fixes (take 3)
  2008-10-29  4:11           ` Nick Piggin
@ 2008-10-29  4:57             ` Dave Chinner
  2008-10-29  5:06               ` Nick Piggin
  0 siblings, 1 reply; 53+ messages in thread
From: Dave Chinner @ 2008-10-29  4:57 UTC (permalink / raw)
  To: Nick Piggin; +Cc: akpm, xfs, linux-fsdevel, Chris Mason

On Wed, Oct 29, 2008 at 05:11:12AM +0100, Nick Piggin wrote:
> On Wed, Oct 29, 2008 at 02:26:01PM +1100, Dave Chinner wrote:
> > On Wed, Oct 29, 2008 at 02:16:45PM +1100, Dave Chinner wrote:
> > > On Wed, Oct 29, 2008 at 01:16:53AM +0100, Nick Piggin wrote:
> > > > XFS: fix fsync errors not being propogated back to userspace.
> > > > ---
> > > > Index: linux-2.6/fs/xfs/xfs_vnodeops.c
> > > > ===================================================================
> > > > --- linux-2.6.orig/fs/xfs/xfs_vnodeops.c
> > > > +++ linux-2.6/fs/xfs/xfs_vnodeops.c
> > > > @@ -715,7 +715,7 @@ xfs_fsync(
> > > >  	/* capture size updates in I/O completion before writing the inode. */
> > > >  	error = filemap_fdatawait(VFS_I(ip)->i_mapping);
> > > >  	if (error)
> > > > -		return XFS_ERROR(error);
> > > > +		return XFS_ERROR(-error);
> > > 
> > > <groan>
> > > 
> > > Yeah, that'd do it. Good catch. I can't believe I recently fixed a
> > > bug that touched these lines of code without noticing the inversion.
> > > Sometimes I wonder if we should just conver the entire of XFS to
> > > return negative errors - mistakes in handling negative error numbers
> > > in the core XFS code happen all the time.
> > 
> > Ok, I was right - these problems happen all the time. The above call
> > should really call xfs_flush_pages() to do the flush and wait. I
> > note that xfs_flush_pages() returns negative errors, and all the
> > callers expect positive errors. I bet the same occurs for
> > xfs_flushinval_pages() and xfs_tosspages() which are the wrappers
> > that core XFS code is supposed to be using for flushing and
> > invalidating file ranges....
> 
> Just be careful -- in your xfs_flush_pages, I think after the first
> filemap_fdatawrite, the mapping may no longer be tagged with
> PAGECACHE_TAG_DIRTY, so you may not pick up those writeback ones
> you need to wait on.

Yes, I realised this as soon as I looked at the code.
I added xfs_wait_on_pages() to do this wait. ;)

> Might need a different variant, or we could just bite the bullet and
> push through the ->fsync conversion so you get full control of the
> writeout.

Not important right now, though....

> BTW. the Linux pagecache APIs should support range operations quite
> nicely for these. Any reason not to use them (it looks like the
> wrappers can take ranges)?

Because I haven't got around to modifying these wrappers now that
the range primitives are in place - XFS inherited the range
operations from Irix and have sat unimplemented since being ported
to Linux.

The patch below should fix this entire class of error value screwup
in XFS. I've just started running it through XFSQA, so it's not
really tested yet.

FWIW, your entire patch series made it through XFSQA without any new
regressions, so it looks good from that POV.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

XFS: fix error inversion problems with data flushing

XFS gets the sign of the error wrong in several places when
gathering the error from generic linux functions. These functions
return negative error values, while the core XFS code returns
positive error values. Hence when XFS inverts the error to be
returned to the VFS, it can incorrectly invert a negative
error and this error will be ignored by the syscall return.

Fix all the problems related to calling filemap_* functions.

Problem initially identified by Nick Piggin in xfs_fsync().

Signed-off-by: Dave Chinner <david@fromorbit.com>
---
 fs/xfs/linux-2.6/xfs_fs_subr.c |   23 ++++++++++++++++++++---
 fs/xfs/linux-2.6/xfs_lrw.c     |    2 +-
 fs/xfs/linux-2.6/xfs_super.c   |   13 +++++++++----
 fs/xfs/xfs_vnodeops.c          |    2 +-
 fs/xfs/xfs_vnodeops.h          |    1 +
 5 files changed, 32 insertions(+), 9 deletions(-)

diff --git a/fs/xfs/linux-2.6/xfs_fs_subr.c b/fs/xfs/linux-2.6/xfs_fs_subr.c
index 36caa6d..5aeb777 100644
--- a/fs/xfs/linux-2.6/xfs_fs_subr.c
+++ b/fs/xfs/linux-2.6/xfs_fs_subr.c
@@ -24,6 +24,10 @@ int  fs_noerr(void) { return 0; }
 int  fs_nosys(void) { return ENOSYS; }
 void fs_noval(void) { return; }
 
+/*
+ * note: all filemap functions return negative error codes. These
+ * need to be inverted before returning to the xfs core functions.
+ */
 void
 xfs_tosspages(
 	xfs_inode_t	*ip,
@@ -53,7 +57,7 @@ xfs_flushinval_pages(
 		if (!ret)
 			truncate_inode_pages(mapping, first);
 	}
-	return ret;
+	return -ret;
 }
 
 int
@@ -72,10 +76,23 @@ xfs_flush_pages(
 		xfs_iflags_clear(ip, XFS_ITRUNCATED);
 		ret = filemap_fdatawrite(mapping);
 		if (flags & XFS_B_ASYNC)
-			return ret;
+			return -ret;
 		ret2 = filemap_fdatawait(mapping);
 		if (!ret)
 			ret = ret2;
 	}
-	return ret;
+	return -ret;
+}
+
+int
+xfs_wait_on_pages(
+	xfs_inode_t	*ip,
+	xfs_off_t	first,
+	xfs_off_t	last)
+{
+	struct address_space *mapping = VFS_I(ip)->i_mapping;
+
+	if (mapping_tagged(mapping, PAGECACHE_TAG_WRITEBACK))
+		return -filemap_fdatawait(mapping);
+	return 0;
 }
diff --git a/fs/xfs/linux-2.6/xfs_lrw.c b/fs/xfs/linux-2.6/xfs_lrw.c
index 1957e53..4959c87 100644
--- a/fs/xfs/linux-2.6/xfs_lrw.c
+++ b/fs/xfs/linux-2.6/xfs_lrw.c
@@ -243,7 +243,7 @@ xfs_read(
 
 	if (unlikely(ioflags & IO_ISDIRECT)) {
 		if (inode->i_mapping->nrpages)
-			ret = xfs_flushinval_pages(ip, (*offset & PAGE_CACHE_MASK),
+			ret = -xfs_flushinval_pages(ip, (*offset & PAGE_CACHE_MASK),
 						    -1, FI_REMAPF_LOCKED);
 		mutex_unlock(&inode->i_mutex);
 		if (ret) {
diff --git a/fs/xfs/linux-2.6/xfs_super.c b/fs/xfs/linux-2.6/xfs_super.c
index 9dc977d..c5cfc1e 100644
--- a/fs/xfs/linux-2.6/xfs_super.c
+++ b/fs/xfs/linux-2.6/xfs_super.c
@@ -1012,21 +1012,26 @@ xfs_fs_write_inode(
 	struct inode		*inode,
 	int			sync)
 {
+	struct xfs_inode	*ip = XFS_I(inode);
 	int			error = 0;
 	int			flags = 0;
 
-	xfs_itrace_entry(XFS_I(inode));
+	xfs_itrace_entry(ip);
 	if (sync) {
-		filemap_fdatawait(inode->i_mapping);
+		error = xfs_wait_on_pages(ip, 0, -1);
+		if (error)
+			goto out_error;
 		flags |= FLUSH_SYNC;
 	}
-	error = xfs_inode_flush(XFS_I(inode), flags);
+	error = xfs_inode_flush(ip, flags);
+
+out_error:
 	/*
 	 * if we failed to write out the inode then mark
 	 * it dirty again so we'll try again later.
 	 */
 	if (error)
-		xfs_mark_inode_dirty_sync(XFS_I(inode));
+		xfs_mark_inode_dirty_sync(ip);
 
 	return -error;
 }
diff --git a/fs/xfs/xfs_vnodeops.c b/fs/xfs/xfs_vnodeops.c
index 2e2fbd9..5809c42 100644
--- a/fs/xfs/xfs_vnodeops.c
+++ b/fs/xfs/xfs_vnodeops.c
@@ -713,7 +713,7 @@ xfs_fsync(
 		return XFS_ERROR(EIO);
 
 	/* capture size updates in I/O completion before writing the inode. */
-	error = filemap_fdatawait(VFS_I(ip)->i_mapping);
+	error = xfs_wait_on_pages(ip, 0, -1);
 	if (error)
 		return XFS_ERROR(error);
 
diff --git a/fs/xfs/xfs_vnodeops.h b/fs/xfs/xfs_vnodeops.h
index e932a96..f9cd376 100644
--- a/fs/xfs/xfs_vnodeops.h
+++ b/fs/xfs/xfs_vnodeops.h
@@ -78,5 +78,6 @@ int xfs_flushinval_pages(struct xfs_inode *ip, xfs_off_t first,
 		xfs_off_t last, int fiopt);
 int xfs_flush_pages(struct xfs_inode *ip, xfs_off_t first,
 		xfs_off_t last, uint64_t flags, int fiopt);
+int xfs_wait_on_pages(struct xfs_inode *ip, xfs_off_t first, xfs_off_t last);
 
 #endif /* _XFS_VNODEOPS_H */

^ permalink raw reply related	[flat|nested] 53+ messages in thread

* Re: [patch 0/9] writeback data integrity and other fixes (take 3)
  2008-10-29  4:57             ` Dave Chinner
@ 2008-10-29  5:06               ` Nick Piggin
  0 siblings, 0 replies; 53+ messages in thread
From: Nick Piggin @ 2008-10-29  5:06 UTC (permalink / raw)
  To: akpm, xfs, linux-fsdevel, Chris Mason

On Wed, Oct 29, 2008 at 03:57:07PM +1100, Dave Chinner wrote:
> On Wed, Oct 29, 2008 at 05:11:12AM +0100, Nick Piggin wrote:
> > Just be careful -- in your xfs_flush_pages, I think after the first
> > filemap_fdatawrite, the mapping may no longer be tagged with
> > PAGECACHE_TAG_DIRTY, so you may not pick up those writeback ones
> > you need to wait on.
> 
> Yes, I realised this as soon as I looked at the code.
> I added xfs_wait_on_pages() to do this wait. ;)
 
Oh good.


> > Might need a different variant, or we could just bite the bullet and
> > push through the ->fsync conversion so you get full control of the
> > writeout.
> 
> Not important right now, though....

OK.

 
> > BTW. the Linux pagecache APIs should support range operations quite
> > nicely for these. Any reason not to use them (it looks like the
> > wrappers can take ranges)?
> 
> Because I haven't got around to modifying these wrappers now that
> the range primitives are in place - XFS inherited the range
> operations from Irix and have sat unimplemented since being ported
> to Linux.

OK, fair enough. Just something easy to stick on a todo list somewhere ;)
Probably doesn't make much difference now, but we might start seeing
apps using range syncs.
 

> The patch below should fix this entire class of error value screwup
> in XFS. I've just started running it through XFSQA, so it's not
> really tested yet.

I'll sanity check it by running it through my basic fault injection
tests here.


> FWIW, your entire patch series made it through XFSQA without any new
> regressions, so it looks good from that POV.

Thanks, very good to know.

> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> 
> XFS: fix error inversion problems with data flushing
> 
> XFS gets the sign of the error wrong in several places when
> gathering the error from generic linux functions. These functions
> return negative error values, while the core XFS code returns
> positive error values. Hence when XFS inverts the error to be
> returned to the VFS, it can incorrectly invert a negative
> error and this error will be ignored by the syscall return.
> 
> Fix all the problems related to calling filemap_* functions.
> 
> Problem initially identified by Nick Piggin in xfs_fsync().
> 
> Signed-off-by: Dave Chinner <david@fromorbit.com>
> ---
>  fs/xfs/linux-2.6/xfs_fs_subr.c |   23 ++++++++++++++++++++---
>  fs/xfs/linux-2.6/xfs_lrw.c     |    2 +-
>  fs/xfs/linux-2.6/xfs_super.c   |   13 +++++++++----
>  fs/xfs/xfs_vnodeops.c          |    2 +-
>  fs/xfs/xfs_vnodeops.h          |    1 +
>  5 files changed, 32 insertions(+), 9 deletions(-)
> 
> diff --git a/fs/xfs/linux-2.6/xfs_fs_subr.c b/fs/xfs/linux-2.6/xfs_fs_subr.c
> index 36caa6d..5aeb777 100644
> --- a/fs/xfs/linux-2.6/xfs_fs_subr.c
> +++ b/fs/xfs/linux-2.6/xfs_fs_subr.c
> @@ -24,6 +24,10 @@ int  fs_noerr(void) { return 0; }
>  int  fs_nosys(void) { return ENOSYS; }
>  void fs_noval(void) { return; }
>  
> +/*
> + * note: all filemap functions return negative error codes. These
> + * need to be inverted before returning to the xfs core functions.
> + */
>  void
>  xfs_tosspages(
>  	xfs_inode_t	*ip,
> @@ -53,7 +57,7 @@ xfs_flushinval_pages(
>  		if (!ret)
>  			truncate_inode_pages(mapping, first);
>  	}
> -	return ret;
> +	return -ret;
>  }
>  
>  int
> @@ -72,10 +76,23 @@ xfs_flush_pages(
>  		xfs_iflags_clear(ip, XFS_ITRUNCATED);
>  		ret = filemap_fdatawrite(mapping);
>  		if (flags & XFS_B_ASYNC)
> -			return ret;
> +			return -ret;
>  		ret2 = filemap_fdatawait(mapping);
>  		if (!ret)
>  			ret = ret2;
>  	}
> -	return ret;
> +	return -ret;
> +}
> +
> +int
> +xfs_wait_on_pages(
> +	xfs_inode_t	*ip,
> +	xfs_off_t	first,
> +	xfs_off_t	last)
> +{
> +	struct address_space *mapping = VFS_I(ip)->i_mapping;
> +
> +	if (mapping_tagged(mapping, PAGECACHE_TAG_WRITEBACK))
> +		return -filemap_fdatawait(mapping);
> +	return 0;
>  }
> diff --git a/fs/xfs/linux-2.6/xfs_lrw.c b/fs/xfs/linux-2.6/xfs_lrw.c
> index 1957e53..4959c87 100644
> --- a/fs/xfs/linux-2.6/xfs_lrw.c
> +++ b/fs/xfs/linux-2.6/xfs_lrw.c
> @@ -243,7 +243,7 @@ xfs_read(
>  
>  	if (unlikely(ioflags & IO_ISDIRECT)) {
>  		if (inode->i_mapping->nrpages)
> -			ret = xfs_flushinval_pages(ip, (*offset & PAGE_CACHE_MASK),
> +			ret = -xfs_flushinval_pages(ip, (*offset & PAGE_CACHE_MASK),
>  						    -1, FI_REMAPF_LOCKED);
>  		mutex_unlock(&inode->i_mutex);
>  		if (ret) {
> diff --git a/fs/xfs/linux-2.6/xfs_super.c b/fs/xfs/linux-2.6/xfs_super.c
> index 9dc977d..c5cfc1e 100644
> --- a/fs/xfs/linux-2.6/xfs_super.c
> +++ b/fs/xfs/linux-2.6/xfs_super.c
> @@ -1012,21 +1012,26 @@ xfs_fs_write_inode(
>  	struct inode		*inode,
>  	int			sync)
>  {
> +	struct xfs_inode	*ip = XFS_I(inode);
>  	int			error = 0;
>  	int			flags = 0;
>  
> -	xfs_itrace_entry(XFS_I(inode));
> +	xfs_itrace_entry(ip);
>  	if (sync) {
> -		filemap_fdatawait(inode->i_mapping);
> +		error = xfs_wait_on_pages(ip, 0, -1);
> +		if (error)
> +			goto out_error;
>  		flags |= FLUSH_SYNC;
>  	}
> -	error = xfs_inode_flush(XFS_I(inode), flags);
> +	error = xfs_inode_flush(ip, flags);
> +
> +out_error:
>  	/*
>  	 * if we failed to write out the inode then mark
>  	 * it dirty again so we'll try again later.
>  	 */
>  	if (error)
> -		xfs_mark_inode_dirty_sync(XFS_I(inode));
> +		xfs_mark_inode_dirty_sync(ip);
>  
>  	return -error;
>  }
> diff --git a/fs/xfs/xfs_vnodeops.c b/fs/xfs/xfs_vnodeops.c
> index 2e2fbd9..5809c42 100644
> --- a/fs/xfs/xfs_vnodeops.c
> +++ b/fs/xfs/xfs_vnodeops.c
> @@ -713,7 +713,7 @@ xfs_fsync(
>  		return XFS_ERROR(EIO);
>  
>  	/* capture size updates in I/O completion before writing the inode. */
> -	error = filemap_fdatawait(VFS_I(ip)->i_mapping);
> +	error = xfs_wait_on_pages(ip, 0, -1);
>  	if (error)
>  		return XFS_ERROR(error);
>  
> diff --git a/fs/xfs/xfs_vnodeops.h b/fs/xfs/xfs_vnodeops.h
> index e932a96..f9cd376 100644
> --- a/fs/xfs/xfs_vnodeops.h
> +++ b/fs/xfs/xfs_vnodeops.h
> @@ -78,5 +78,6 @@ int xfs_flushinval_pages(struct xfs_inode *ip, xfs_off_t first,
>  		xfs_off_t last, int fiopt);
>  int xfs_flush_pages(struct xfs_inode *ip, xfs_off_t first,
>  		xfs_off_t last, uint64_t flags, int fiopt);
> +int xfs_wait_on_pages(struct xfs_inode *ip, xfs_off_t first, xfs_off_t last);
>  
>  #endif /* _XFS_VNODEOPS_H */

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [patch 0/9] writeback data integrity and other fixes (take 3)
  2008-10-29  3:26         ` Dave Chinner
  2008-10-29  4:11           ` Nick Piggin
@ 2008-10-29  9:13           ` Christoph Hellwig
  2008-10-29 21:42             ` Dave Chinner
  1 sibling, 1 reply; 53+ messages in thread
From: Christoph Hellwig @ 2008-10-29  9:13 UTC (permalink / raw)
  To: Nick Piggin, akpm, xfs, linux-fsdevel, Chris Mason

On Wed, Oct 29, 2008 at 02:26:01PM +1100, Dave Chinner wrote:
> Ok, I was right - these problems happen all the time. The above call
> should really call xfs_flush_pages() to do the flush and wait. I
> note that xfs_flush_pages() returns negative errors, and all the
> callers expect positive errors. I bet the same occurs for
> xfs_flushinval_pages() and xfs_tosspages() which are the wrappers
> that core XFS code is supposed to be using for flushing and
> invalidating file ranges....
> 
> I'll write up a patch that covers all of these.

Can you also merge xfs_fsync into xfs_file_fsync while you're at it?
The split newer made any sense as xfs_fsync is as Linux-specific as it
gets and shouldn't be in the pseudo OS-independent layer.


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [patch 0/9] writeback data integrity and other fixes (take 3)
  2008-10-29  9:13           ` Christoph Hellwig
@ 2008-10-29 21:42             ` Dave Chinner
  2008-10-29 21:45               ` Christoph Hellwig
  0 siblings, 1 reply; 53+ messages in thread
From: Dave Chinner @ 2008-10-29 21:42 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Nick Piggin, akpm, xfs, linux-fsdevel, Chris Mason

On Wed, Oct 29, 2008 at 05:13:26AM -0400, Christoph Hellwig wrote:
> On Wed, Oct 29, 2008 at 02:26:01PM +1100, Dave Chinner wrote:
> > Ok, I was right - these problems happen all the time. The above call
> > should really call xfs_flush_pages() to do the flush and wait. I
> > note that xfs_flush_pages() returns negative errors, and all the
> > callers expect positive errors. I bet the same occurs for
> > xfs_flushinval_pages() and xfs_tosspages() which are the wrappers
> > that core XFS code is supposed to be using for flushing and
> > invalidating file ranges....
> > 
> > I'll write up a patch that covers all of these.
> 
> Can you also merge xfs_fsync into xfs_file_fsync while you're at it?
> The split newer made any sense as xfs_fsync is as Linux-specific as it
> gets and shouldn't be in the pseudo OS-independent layer.

I'll do that as a separate patch - it's not really part of a "fix
error value inversion" bug fix....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [patch 0/9] writeback data integrity and other fixes (take 3)
  2008-10-29 21:42             ` Dave Chinner
@ 2008-10-29 21:45               ` Christoph Hellwig
  2008-10-29 21:53                 ` Dave Chinner
  0 siblings, 1 reply; 53+ messages in thread
From: Christoph Hellwig @ 2008-10-29 21:45 UTC (permalink / raw)
  To: Christoph Hellwig, Nick Piggin, akpm, xfs, linux-fsdevel,
	Chris Mason

On Thu, Oct 30, 2008 at 08:42:02AM +1100, Dave Chinner wrote:
> I'll do that as a separate patch - it's not really part of a "fix
> error value inversion" bug fix....

Well it is kindof.  Because we'd call Linux library functions directly
from the method we skip to places that need sign conversion and could
possibly go wrong.  But if you want a separate patch that's fine with
me too.


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [patch 0/9] writeback data integrity and other fixes (take 3)
  2008-10-29 21:45               ` Christoph Hellwig
@ 2008-10-29 21:53                 ` Dave Chinner
  0 siblings, 0 replies; 53+ messages in thread
From: Dave Chinner @ 2008-10-29 21:53 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Nick Piggin, akpm, xfs, linux-fsdevel, Chris Mason

On Wed, Oct 29, 2008 at 05:45:03PM -0400, Christoph Hellwig wrote:
> On Thu, Oct 30, 2008 at 08:42:02AM +1100, Dave Chinner wrote:
> > I'll do that as a separate patch - it's not really part of a "fix
> > error value inversion" bug fix....
> 
> Well it is kindof.  Because we'd call Linux library functions directly
> from the method we skip to places that need sign conversion and could
> possibly go wrong.  But if you want a separate patch that's fine with
> me too.

Well, the patch I posted to fix the sign problems covered the
xfs_flush_pages() and xfs_flushinval_pages() functions that we
returning negative errors to core code, and none of the core code
inverted that before returning it to functions that inverted it.
That's a general problem that affected truncate, direct I/O,
getbmap, etc, as well as fsync...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [patch 0/9] writeback data integrity and other fixes (take 3)
  2008-10-29  3:16       ` Dave Chinner
  2008-10-29  3:26         ` Dave Chinner
@ 2008-10-29  4:00         ` Nick Piggin
  2008-10-29  5:27           ` Dave Chinner
  2008-10-29  9:12         ` Christoph Hellwig
  2 siblings, 1 reply; 53+ messages in thread
From: Nick Piggin @ 2008-10-29  4:00 UTC (permalink / raw)
  To: akpm, xfs, linux-fsdevel, Chris Mason; +Cc: linux-ext4

On Wed, Oct 29, 2008 at 02:16:45PM +1100, Dave Chinner wrote:
> On Wed, Oct 29, 2008 at 01:16:53AM +0100, Nick Piggin wrote:
> > On Wed, Oct 29, 2008 at 09:27:46AM +1100, Dave Chinner wrote:
> > > On Tue, Oct 28, 2008 at 04:39:53PM +0100, Nick Piggin wrote:
> > > Yes - that's coming from end_buffer_async_write() when an error is
> > > reported in bio completion. This does:
> > > 
> > >  465                 set_bit(AS_EIO, &page->mapping->flags);
> > >  466                 set_buffer_write_io_error(bh);
> > >  467                 clear_buffer_uptodate(bh);
> > >  468                 SetPageError(page);
> > > 
> > > Hmmmm - do_fsync() calls filemap_fdatawait() which ends up in
> > > wait_on_page_writeback_range() which is appears to be checking the
> > > mapping flags for errors. I wonder why that error is not being
> > > propagated then? AFAICT both XFS and the fsync code are doing the
> > > right thing but somewhere the error has gone missing...
> > 
> > This one-liner has it reporting EIO errors like a champion. I
> > don't know if you'll actually need to put this into the
> > linux API layer or not, but anyway the root cause of the problem
> > AFAIKS is this.
> > --
> > 
> > XFS: fix fsync errors not being propogated back to userspace.
> > ---
> > Index: linux-2.6/fs/xfs/xfs_vnodeops.c
> > ===================================================================
> > --- linux-2.6.orig/fs/xfs/xfs_vnodeops.c
> > +++ linux-2.6/fs/xfs/xfs_vnodeops.c
> > @@ -715,7 +715,7 @@ xfs_fsync(
> >  	/* capture size updates in I/O completion before writing the inode. */
> >  	error = filemap_fdatawait(VFS_I(ip)->i_mapping);
> >  	if (error)
> > -		return XFS_ERROR(error);
> > +		return XFS_ERROR(-error);
> 
> <groan>
> 
> Yeah, that'd do it. Good catch. I can't believe I recently fixed a
> bug that touched these lines of code without noticing the inversion.
> Sometimes I wonder if we should just conver the entire of XFS to
> return negative errors - mistakes in handling negative error numbers
> in the core XFS code happen all the time.

Heh, yeah it seems a bit tricky. BTW. feel free to add
Signed-off-by: Nick Piggin <npiggin@suse.de>
if you would like to merge it, or alternatively if you fix it another
way then I can rerun the test in 2 minutes if it is not 100% obviously
correct.

BTW. XFS seems to really be a lot more consistent than ext3 (ordered)
in reporting EIO with this change in place. I don't know why exactly,
but in ext3 I can see hundreds or thousands of failures injected in this
manner:

FAULT_INJECTION: forcing a failure
Call Trace:
9ebe17f8:  [<6019f34e>] random32+0xe/0x20
9ebe1808:  [<601a3329>] should_fail+0xd9/0x130
9ebe1838:  [<6018d234>] generic_make_request+0x304/0x4e0
9ebe1848:  [<600622e1>] mempool_alloc+0x51/0x130
9ebe18f8:  [<6018e82f>] submit_bio+0x4f/0xe0
9ebe1948:  [<600a8969>] submit_bh+0xd9/0x130
9ebe1978:  [<600aab08>] __block_write_full_page+0x198/0x310
9ebe1988:  [<600a8487>] alloc_buffer_head+0x37/0x50
9ebe19a0:  [<600f6ec0>] ext3_get_block+0x0/0x110
9ebe19d8:  [<600f6ec0>] ext3_get_block+0x0/0x110
9ebe19e8:  [<600aad48>] block_write_full_page+0xc8/0x100
9ebe1a38:  [<600f8951>] ext3_ordered_writepage+0xd1/0x180
9ebe1a48:  [<6006616b>] set_page_dirty+0x4b/0xd0
9ebe1a88:  [<600660f5>] __writepage+0x15/0x40
9ebe1aa8:  [<60066755>] write_cache_pages+0x255/0x470
9ebe1ac0:  [<600660e0>] __writepage+0x0/0x40
9ebe1bc8:  [<60066990>] generic_writepages+0x20/0x30
9ebe1bd8:  [<600669db>] do_writepages+0x3b/0x40
9ebe1bf8:  [<6006006c>] __filemap_fdatawrite_range+0x5c/0x70
9ebe1c58:  [<6006028a>] filemap_fdatawrite+0x1a/0x20
9ebe1c68:  [<600a7a15>] do_fsync+0x45/0xe0
9ebe1c98:  [<6007792b>] sys_msync+0x14b/0x1d0
9ebe1cf8:  [<60019a70>] handle_syscall+0x50/0x80
9ebe1d18:  [<6002a10f>] userspace+0x44f/0x510
9ebe1fc8:  [<60016792>] fork_handler+0x62/0x70

Buffer I/O error on device ram0, logical block 8228
lost page write due to I/O error on ram0

But eventually the filesystem aborts when we get a journal writing error,
and sometimes I see a single EIO to msync, sometimes not (cc ext3 list).

Wheras XFS reports a lot more EIOs as it goes...

If ext3 guys have any suggestions, I could test them out.

 
> FWIW, the core issue here is that we've got to do the
> filemap_fdatawait() call in the ->fsync method because ->fsync
> gets called before we've waited for the data I/O to complete.
> XFS updates inode state on I/O completion, so we *must* wait
> for data I/O to complete before logging the inode changes. I
> think btrfs has the same problem....

Interesting. Does that mean you can do without the final filemap_fdatawait?
Can you do the first fdatawait without i_mutex held?

There was some talk IIRC about moving all this logic into the filesystem
to provide more flexibility here. If there is still enough interest,
we should get that moving...


> Thanks again, Nick.

No problem.

Thanks,
Nick

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [patch 0/9] writeback data integrity and other fixes (take 3)
  2008-10-29  4:00         ` Nick Piggin
@ 2008-10-29  5:27           ` Dave Chinner
  0 siblings, 0 replies; 53+ messages in thread
From: Dave Chinner @ 2008-10-29  5:27 UTC (permalink / raw)
  To: Nick Piggin; +Cc: akpm, xfs, linux-fsdevel, Chris Mason, linux-ext4

On Wed, Oct 29, 2008 at 05:00:14AM +0100, Nick Piggin wrote:
> On Wed, Oct 29, 2008 at 02:16:45PM +1100, Dave Chinner wrote:
> > FWIW, the core issue here is that we've got to do the
> > filemap_fdatawait() call in the ->fsync method because ->fsync
> > gets called before we've waited for the data I/O to complete.
> > XFS updates inode state on I/O completion, so we *must* wait
> > for data I/O to complete before logging the inode changes. I
> > think btrfs has the same problem....
> 
> Interesting. Does that mean you can do without the final filemap_fdatawait?

We could, yes.

> Can you do the first fdatawait without i_mutex held?

I don't see why not - I/O completion is only touching the XFS inode.
XFS has it's own inode locks for I/O and inode exclusion, and the
mapping tree lock protects the tree walk that the fdatawait is
doing...

> There was some talk IIRC about moving all this logic into the filesystem
> to provide more flexibility here. If there is still enough interest,
> we should get that moving...

It would simplify some of the code in XFS, that's for sure ;)

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [patch 0/9] writeback data integrity and other fixes (take 3)
  2008-10-29  3:16       ` Dave Chinner
  2008-10-29  3:26         ` Dave Chinner
  2008-10-29  4:00         ` Nick Piggin
@ 2008-10-29  9:12         ` Christoph Hellwig
  2008-10-29  9:21           ` Nick Piggin
  2 siblings, 1 reply; 53+ messages in thread
From: Christoph Hellwig @ 2008-10-29  9:12 UTC (permalink / raw)
  To: Nick Piggin, akpm, xfs, linux-fsdevel, Chris Mason

On Wed, Oct 29, 2008 at 02:16:45PM +1100, Dave Chinner wrote:
> Yeah, that'd do it. Good catch. I can't believe I recently fixed a
> bug that touched these lines of code without noticing the inversion.
> Sometimes I wonder if we should just conver the entire of XFS to
> return negative errors - mistakes in handling negative error numbers
> in the core XFS code happen all the time.
> 
> FWIW, the core issue here is that we've got to do the
> filemap_fdatawait() call in the ->fsync method because ->fsync
> gets called before we've waited for the data I/O to complete.
> XFS updates inode state on I/O completion, so we *must* wait
> for data I/O to complete before logging the inode changes. I
> think btrfs has the same problem....

Yes.  I have patches to fix this by changing what ->fsync does and
how it's called.  I really need to get them out on the list.


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [patch 0/9] writeback data integrity and other fixes (take 3)
  2008-10-29  9:12         ` Christoph Hellwig
@ 2008-10-29  9:21           ` Nick Piggin
  2008-10-29  9:44             ` Christoph Hellwig
  0 siblings, 1 reply; 53+ messages in thread
From: Nick Piggin @ 2008-10-29  9:21 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: akpm, xfs, linux-fsdevel, Chris Mason

On Wed, Oct 29, 2008 at 05:12:03AM -0400, Christoph Hellwig wrote:
> On Wed, Oct 29, 2008 at 02:16:45PM +1100, Dave Chinner wrote:
> > Yeah, that'd do it. Good catch. I can't believe I recently fixed a
> > bug that touched these lines of code without noticing the inversion.
> > Sometimes I wonder if we should just conver the entire of XFS to
> > return negative errors - mistakes in handling negative error numbers
> > in the core XFS code happen all the time.
> > 
> > FWIW, the core issue here is that we've got to do the
> > filemap_fdatawait() call in the ->fsync method because ->fsync
> > gets called before we've waited for the data I/O to complete.
> > XFS updates inode state on I/O completion, so we *must* wait
> > for data I/O to complete before logging the inode changes. I
> > think btrfs has the same problem....
> 
> Yes.  I have patches to fix this by changing what ->fsync does and
> how it's called.  I really need to get them out on the list.

Please do.

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [patch 0/9] writeback data integrity and other fixes (take 3)
  2008-10-29  9:21           ` Nick Piggin
@ 2008-10-29  9:44             ` Christoph Hellwig
  2008-10-29 10:30               ` Nick Piggin
  0 siblings, 1 reply; 53+ messages in thread
From: Christoph Hellwig @ 2008-10-29  9:44 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Christoph Hellwig, akpm, xfs, linux-fsdevel, Chris Mason

On Wed, Oct 29, 2008 at 10:21:43AM +0100, Nick Piggin wrote:
> Please do.

Well, there's one stumling block I haven't made progress on yet:

I've changed the prototype of ->fsync to lose the dentry as we should
always have a valid file struct.  Except that nfsd doesn't on
directories.  So I either need to fake up a file there, or bail out
and add a ->dir_sync export operation that needs just a dentry.

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [patch 0/9] writeback data integrity and other fixes (take 3)
  2008-10-29  9:44             ` Christoph Hellwig
@ 2008-10-29 10:30               ` Nick Piggin
  2008-10-29 12:22                 ` Jamie Lokier
  0 siblings, 1 reply; 53+ messages in thread
From: Nick Piggin @ 2008-10-29 10:30 UTC (permalink / raw)
  To: Christoph Hellwig, linux-nfs; +Cc: akpm, xfs, linux-fsdevel, Chris Mason

On Wed, Oct 29, 2008 at 05:44:17AM -0400, Christoph Hellwig wrote:
> On Wed, Oct 29, 2008 at 10:21:43AM +0100, Nick Piggin wrote:
> > Please do.
> 
> Well, there's one stumling block I haven't made progress on yet:
> 
> I've changed the prototype of ->fsync to lose the dentry as we should
> always have a valid file struct.  Except that nfsd doesn't on
> directories.  So I either need to fake up a file there, or bail out
> and add a ->dir_sync export operation that needs just a dentry.

OK. I don't know much about hthat code, but I would think nfsd
should look as close to the syscall layer as possible. I guess
there must be something prohibitive (some protocol semantics?).

Is there anything that particularly makes it a file operation
as opposed to an inode operation?

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [patch 0/9] writeback data integrity and other fixes (take 3)
  2008-10-29 10:30               ` Nick Piggin
@ 2008-10-29 12:22                 ` Jamie Lokier
       [not found]                   ` <20081029122234.GE846-yetKDKU6eevNLxjTenLetw@public.gmane.org>
  2008-10-29 21:43                   ` Dave Chinner
  0 siblings, 2 replies; 53+ messages in thread
From: Jamie Lokier @ 2008-10-29 12:22 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Christoph Hellwig, linux-nfs, akpm, xfs, linux-fsdevel,
	Chris Mason

Nick Piggin wrote:
> On Wed, Oct 29, 2008 at 05:44:17AM -0400, Christoph Hellwig wrote:
> > On Wed, Oct 29, 2008 at 10:21:43AM +0100, Nick Piggin wrote:
> > > Please do.
> > 
> > Well, there's one stumling block I haven't made progress on yet:
> > 
> > I've changed the prototype of ->fsync to lose the dentry as we should
> > always have a valid file struct.  Except that nfsd doesn't on
> > directories.  So I either need to fake up a file there, or bail out
> > and add a ->dir_sync export operation that needs just a dentry.
> 
> OK. I don't know much about hthat code, but I would think nfsd
> should look as close to the syscall layer as possible. I guess
> there must be something prohibitive (some protocol semantics?).
> 
> Is there anything that particularly makes it a file operation
> as opposed to an inode operation?

In principle, is fsync() required to flush all dirty data written
through any file descriptor ever, or just dirty data written through
the file descriptor used for fsync()?

-- Jamie

^ permalink raw reply	[flat|nested] 53+ messages in thread

[parent not found: <20081029122234.GE846-yetKDKU6eevNLxjTenLetw@public.gmane.org>]

* Re: [patch 0/9] writeback data integrity and other fixes (take 3)
       [not found]                   ` <20081029122234.GE846-yetKDKU6eevNLxjTenLetw@public.gmane.org>
@ 2008-10-29 13:32                     ` Ric Wheeler
  2008-10-29 14:56                       ` Chris Mason
  0 siblings, 1 reply; 53+ messages in thread
From: Ric Wheeler @ 2008-10-29 13:32 UTC (permalink / raw)
  To: Jamie Lokier
  Cc: Nick Piggin, Christoph Hellwig, linux-nfs-u79uwXL29TY76Z2rM5mHXA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, xfs-VZNHf3L845pBDgjK7y7TUQ,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Chris Mason

Jamie Lokier wrote:
> Nick Piggin wrote:
>   
>> On Wed, Oct 29, 2008 at 05:44:17AM -0400, Christoph Hellwig wrote:
>>     
>>> On Wed, Oct 29, 2008 at 10:21:43AM +0100, Nick Piggin wrote:
>>>       
>>>> Please do.
>>>>         
>>> Well, there's one stumling block I haven't made progress on yet:
>>>
>>> I've changed the prototype of ->fsync to lose the dentry as we should
>>> always have a valid file struct.  Except that nfsd doesn't on
>>> directories.  So I either need to fake up a file there, or bail out
>>> and add a ->dir_sync export operation that needs just a dentry.
>>>       
>> OK. I don't know much about hthat code, but I would think nfsd
>> should look as close to the syscall layer as possible. I guess
>> there must be something prohibitive (some protocol semantics?).
>>
>> Is there anything that particularly makes it a file operation
>> as opposed to an inode operation?
>>     
>
> In principle, is fsync() required to flush all dirty data written
> through any file descriptor ever, or just dirty data written through
> the file descriptor used for fsync()?
>
> -- Jamie
> --
>   
http://www.opengroup.org/onlinepubs/009695399/functions/fsync.html

Is a pointer to what seems to be the official posix spec for this - it 
is definitely per file descriptor, not per file system, etc...

What can happen by side effect (depending on the implementation) is that 
you can actually force out all data for any file. I found that you can 
approach non-fsync speeds for an fsync per file workload by simply 
writing all of the files out, then going back and fsync'ing them one at 
a time (last file first makes a bit of a difference). With that 
technique, you do get the hard promise of full data integrity and high 
speed. This is useful when you want to do bulk writes (tar, etc)

ric


--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [patch 0/9] writeback data integrity and other fixes (take 3)
  2008-10-29 13:32                     ` Ric Wheeler
@ 2008-10-29 14:56                       ` Chris Mason
       [not found]                         ` <1225292196.6448.263.camel-cGoWVVl3WGUrkklhUoBCrlaTQe2KTcn/@public.gmane.org>
  0 siblings, 1 reply; 53+ messages in thread
From: Chris Mason @ 2008-10-29 14:56 UTC (permalink / raw)
  To: Ric Wheeler
  Cc: Jamie Lokier, Nick Piggin, Christoph Hellwig, linux-nfs, akpm,
	xfs, linux-fsdevel

On Wed, 2008-10-29 at 09:32 -0400, Ric Wheeler wrote:
> Jamie Lokier wrote:

> >> Is there anything that particularly makes it a file operation
> >> as opposed to an inode operation?
> >>     
> >
> > In principle, is fsync() required to flush all dirty data written
> > through any file descriptor ever, or just dirty data written through
> > the file descriptor used for fsync()?
> >
> > -- Jamie
> > --
> >   
> http://www.opengroup.org/onlinepubs/009695399/functions/fsync.html
> 
> Is a pointer to what seems to be the official posix spec for this - it 
> is definitely per file descriptor, not per file system, etc...
> 

Maybe I'm reading Jamie's question wrong, but I think he's saying:

/* open exactly the same file twice */
fd = open("file");
fd2 = open("file");

write(fd, "stuff")
write(fd2, "more stuff")
fsync(fd);

Does the fsync promise "more stuff" will be on disk?  I think the answer
should be yes.

-chris



^ permalink raw reply	[flat|nested] 53+ messages in thread

[parent not found: <1225292196.6448.263.camel-cGoWVVl3WGUrkklhUoBCrlaTQe2KTcn/@public.gmane.org>]

* Re: [patch 0/9] writeback data integrity and other fixes (take 3)
       [not found]                         ` <1225292196.6448.263.camel-cGoWVVl3WGUrkklhUoBCrlaTQe2KTcn/@public.gmane.org>
@ 2008-10-30  2:16                           ` Nick Piggin
       [not found]                             ` <20081030021601.GF18041-B4tOwbsTzaBolqkO4TVVkw@public.gmane.org>
  0 siblings, 1 reply; 53+ messages in thread
From: Nick Piggin @ 2008-10-30  2:16 UTC (permalink / raw)
  To: Chris Mason
  Cc: Ric Wheeler, Jamie Lokier, Christoph Hellwig,
	linux-nfs-u79uwXL29TY76Z2rM5mHXA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, xfs-VZNHf3L845pBDgjK7y7TUQ,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA

On Wed, Oct 29, 2008 at 10:56:36AM -0400, Chris Mason wrote:
> On Wed, 2008-10-29 at 09:32 -0400, Ric Wheeler wrote:
> > Jamie Lokier wrote:
> 
> > >> Is there anything that particularly makes it a file operation
> > >> as opposed to an inode operation?
> > >>     
> > >
> > > In principle, is fsync() required to flush all dirty data written
> > > through any file descriptor ever, or just dirty data written through
> > > the file descriptor used for fsync()?
> > >
> > > -- Jamie
> > > --
> > >   
> > http://www.opengroup.org/onlinepubs/009695399/functions/fsync.html
> > 
> > Is a pointer to what seems to be the official posix spec for this - it 
> > is definitely per file descriptor, not per file system, etc...
> > 
> 
> Maybe I'm reading Jamie's question wrong, but I think he's saying:
> 
> /* open exactly the same file twice */
> fd = open("file");
> fd2 = open("file");
> 
> write(fd, "stuff")
> write(fd2, "more stuff")
> fsync(fd);
> 
> Does the fsync promise "more stuff" will be on disk?  I think the answer
> should be yes.

I think so. And this is in the context of making ->fsync an inode
operation and avoid the NFS NULL-file problem... I don't think there
is any fd specific metadata that fsync has to deal with? Any other
reasons it has to be a file operation?
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 53+ messages in thread

[parent not found: <20081030021601.GF18041-B4tOwbsTzaBolqkO4TVVkw@public.gmane.org>]

* Re: [patch 0/9] writeback data integrity and other fixes (take 3)
       [not found]                             ` <20081030021601.GF18041-B4tOwbsTzaBolqkO4TVVkw@public.gmane.org>
@ 2008-10-30 12:51                               ` jim owens
  2008-10-30 13:41                                 ` Jim Rees
  0 siblings, 1 reply; 53+ messages in thread
From: jim owens @ 2008-10-30 12:51 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Chris Mason, Ric Wheeler, Jamie Lokier, Christoph Hellwig,
	linux-nfs-u79uwXL29TY76Z2rM5mHXA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, xfs-VZNHf3L845pBDgjK7y7TUQ,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA

Nick Piggin wrote:
> On Wed, Oct 29, 2008 at 10:56:36AM -0400, Chris Mason wrote:
>> On Wed, 2008-10-29 at 09:32 -0400, Ric Wheeler wrote:
>>> Jamie Lokier wrote:
>>>>> Is there anything that particularly makes it a file operation
>>>>> as opposed to an inode operation?
>>>>>     
>>>> In principle, is fsync() required to flush all dirty data written
>>>> through any file descriptor ever, or just dirty data written through
>>>> the file descriptor used for fsync()?
>>>>
>>>> -- Jamie
>>>> --
>>>>   
>>> http://www.opengroup.org/onlinepubs/009695399/functions/fsync.html
>>>
>>> Is a pointer to what seems to be the official posix spec for this - it 
>>> is definitely per file descriptor, not per file system, etc...
>>>
>> Maybe I'm reading Jamie's question wrong, but I think he's saying:
>>
>> /* open exactly the same file twice */
>> fd = open("file");
>> fd2 = open("file");
>>
>> write(fd, "stuff")
>> write(fd2, "more stuff")
>> fsync(fd);
>>
>> Does the fsync promise "more stuff" will be on disk?  I think the answer
>> should be yes.
> 
> I think so. And this is in the context of making ->fsync an inode
> operation and avoid the NFS NULL-file problem... I don't think there
> is any fd specific metadata that fsync has to deal with? Any other
> reasons it has to be a file operation?

NO, or at least *not the posix definition*.  It is normal
in unix-like operating systems to always flush everything
dirty on the inode no matter what stream it arrived on.

Flushing everything is permitted but not the requirement so
applications must not expect this is *promised* or they
will not be portable.  It is only guaranteed that "stuff"
in this example will be on disk.

AFAIK the fsync semantic comes from the days of dinosaurs,
mainframes, and minicomputers... when a lot of operating
systems had user-space libraries that buffered the I/O.
On fsync(fd), the "fd2" data would still be in user-space.

jim
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [patch 0/9] writeback data integrity and other fixes (take 3)
  2008-10-30 12:51                               ` jim owens
@ 2008-10-30 13:41                                 ` Jim Rees
  0 siblings, 0 replies; 53+ messages in thread
From: Jim Rees @ 2008-10-30 13:41 UTC (permalink / raw)
  To: jim owens
  Cc: Nick Piggin, Chris Mason, Ric Wheeler, Jamie Lokier,
	Christoph Hellwig, linux-nfs, akpm, xfs, linux-fsdevel

jim owens wrote:

  AFAIK the fsync semantic comes from the days of dinosaurs,
  mainframes, and minicomputers... when a lot of operating
  systems had user-space libraries that buffered the I/O.
  On fsync(fd), the "fd2" data would still be in user-space.

User space buffering happens in stdio, which is above the system call
level.  It's been that way since fsync() was first introduced, and is still
that way today.

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [patch 0/9] writeback data integrity and other fixes (take 3)
  2008-10-29 12:22                 ` Jamie Lokier
       [not found]                   ` <20081029122234.GE846-yetKDKU6eevNLxjTenLetw@public.gmane.org>
@ 2008-10-29 21:43                   ` Dave Chinner
  1 sibling, 0 replies; 53+ messages in thread
From: Dave Chinner @ 2008-10-29 21:43 UTC (permalink / raw)
  To: Jamie Lokier
  Cc: Nick Piggin, Christoph Hellwig, linux-nfs, akpm, xfs,
	linux-fsdevel, Chris Mason

On Wed, Oct 29, 2008 at 12:22:35PM +0000, Jamie Lokier wrote:
> Nick Piggin wrote:
> > On Wed, Oct 29, 2008 at 05:44:17AM -0400, Christoph Hellwig wrote:
> > > On Wed, Oct 29, 2008 at 10:21:43AM +0100, Nick Piggin wrote:
> > > > Please do.
> > > 
> > > Well, there's one stumling block I haven't made progress on yet:
> > > 
> > > I've changed the prototype of ->fsync to lose the dentry as we should
> > > always have a valid file struct.  Except that nfsd doesn't on
> > > directories.  So I either need to fake up a file there, or bail out
> > > and add a ->dir_sync export operation that needs just a dentry.
> > 
> > OK. I don't know much about hthat code, but I would think nfsd
> > should look as close to the syscall layer as possible. I guess
> > there must be something prohibitive (some protocol semantics?).
> > 
> > Is there anything that particularly makes it a file operation
> > as opposed to an inode operation?
> 
> In principle, is fsync() required to flush all dirty data written
> through any file descriptor ever, or just dirty data written through
> the file descriptor used for fsync()?

fsync() is required to flush the data that is dirty at the time of the
call, as well as any associated metadata needed to reference that data. It
doesn't matter who wrote the data in the first place....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [patch 0/9] writeback data integrity and other fixes (take 3)
  2008-10-28 22:27   ` Dave Chinner
  2008-10-29  0:04     ` Nick Piggin
  2008-10-29  0:16     ` Nick Piggin
@ 2008-10-29  8:51     ` Dave Chinner
  2 siblings, 0 replies; 53+ messages in thread
From: Dave Chinner @ 2008-10-29  8:51 UTC (permalink / raw)
  To: Nick Piggin, akpm, xfs, linux-fsdevel, Chris Mason

On Wed, Oct 29, 2008 at 09:27:46AM +1100, Dave Chinner wrote:
> On Tue, Oct 28, 2008 at 04:39:53PM +0100, Nick Piggin wrote:
> > On Wed, Oct 29, 2008 at 01:47:15AM +1100, npiggin@suse.de wrote:
> > I think I also saw a slab bug when running dbench with fault injection on.
> > Running latest Linus kernel.
.....
> > xfs_force_shutdown(ram0,0x2) called from line 1056 of file /home/npiggin/usr/src
> > /linux-2.6/fs/xfs/xfs_log.c.  Return address = 0x000000006011370d
> > Filesystem "ram0": Log I/O Error Detected.  Shutting down filesystem: ram0
> > Please umount the filesystem, and rectify the problem(s)
> > xfs_force_shutdown(ram0,0x2) called from line 818 of file /home/npiggin/usr/src/
> > linux-2.6/fs/xfs/xfs_log.c.  Return address = 0x0000000060114e7d
> > slab error in verify_redzone_free(): cache `xfs_log_ticket': double free detecte
> > d
....
> Now that is interesting.
> 
> We've got a rolling transaction in progress, and the commit of the
> first part of the transaction has got the I/O error.  That frees the
> transaction structure used during that commit, as well as the
> ticket.
> 
> However, before we committed the initial transaction, we duplicated
> the transaction structure to allow the transaction to continue to
> track all the dirty objects in the first commit. That included
> duplicating the pointer to the ticket.
> 
> Then the EIO is returned to mkdir code with the duplicated
> transaction, which is then cancelled, and that frees the transaction
> and the ticket it holds. However, we'd already freed the ticket.
> 
> Ok, we're only seeing this problem now because I recently modified
> the ticket allocation to use a slab instead of a roll-your-own free
> list structure that wouldn't have been poisoned. Nice to know that
> this change did more than just remove code. ;)
> 
> This might take a little while to fix - a lot of code needs
> auditing - but thanks for reporting the problem.

The patch below should fix this problem. It's about 10% of the way
through XFSQA, which means there's nothing obviously bad about it,
but it will need many more eyes on it before we decide if this is
the best way to fix the problem....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

XFS: Fix double free of log tickets

When an I/O error occurs during an intermediate commit on a rolling
transaction, xfs_trans_commit() will free the transaction structure
and the related ticket. However, the duplicate transaction that
gets used as the transaction continues still contains a pointer
to the ticket. Hence when the duplicate transaction is cancelled
and freed, we free the ticket a second time.

Add reference counting to the ticket so that we hold an extra
reference to the ticket over the transaction commit. We drop the
extra reference once we have checked that the transaction commit
did not return an error, thus avoiding a double free on commit
error.

Credit to Nick Piggin for tripping over the problem.

Signed-off-by: Dave Chinner <david@fromorbit.com>
---
 fs/xfs/xfs_bmap.c     |   10 ++++++++--
 fs/xfs/xfs_inode.c    |   10 ++++++++--
 fs/xfs/xfs_log.c      |   39 +++++++++++++++++++++++++--------------
 fs/xfs/xfs_log.h      |    4 ++++
 fs/xfs/xfs_log_priv.h |    1 +
 fs/xfs/xfs_trans.c    |    9 ++++++++-
 fs/xfs/xfs_utils.c    |    6 ++++++
 fs/xfs/xfs_vnodeops.c |    6 ++++++
 8 files changed, 66 insertions(+), 19 deletions(-)

diff --git a/fs/xfs/xfs_bmap.c b/fs/xfs/xfs_bmap.c
index db28905..c391221 100644
--- a/fs/xfs/xfs_bmap.c
+++ b/fs/xfs/xfs_bmap.c
@@ -4292,9 +4292,15 @@ xfs_bmap_finish(
 	 * We have a new transaction, so we should return committed=1,
 	 * even though we're returning an error.
 	 */
-	if (error) {
+	if (error)
 		return error;
-	}
+
+	/*
+	 * transaction commit worked ok so we can drop the extra ticket
+	 * reference that we gained in xfs_trans_dup()
+	 */
+	xfs_log_ticket_put(ntp->t_ticket);
+
 	if ((error = xfs_trans_reserve(ntp, 0, logres, 0, XFS_TRANS_PERM_LOG_RES,
 			logcount)))
 		return error;
diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
index cd52282..b977100 100644
--- a/fs/xfs/xfs_inode.c
+++ b/fs/xfs/xfs_inode.c
@@ -1782,8 +1782,14 @@ xfs_itruncate_finish(
 		xfs_trans_ijoin(ntp, ip, XFS_ILOCK_EXCL | XFS_IOLOCK_EXCL);
 		xfs_trans_ihold(ntp, ip);
 
-		if (!error)
-			error = xfs_trans_reserve(ntp, 0,
+		if (error)
+			return error;
+		/*
+		 * transaction commit worked ok so we can drop the extra ticket
+		 * reference that we gained in xfs_trans_dup()
+		 */
+		xfs_log_ticket_put(ntp->t_ticket);
+		error = xfs_trans_reserve(ntp, 0,
 					XFS_ITRUNCATE_LOG_RES(mp), 0,
 					XFS_TRANS_PERM_LOG_RES,
 					XFS_ITRUNCATE_LOG_COUNT);
diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c
index cc1e789..9aecefd 100644
--- a/fs/xfs/xfs_log.c
+++ b/fs/xfs/xfs_log.c
@@ -100,12 +100,11 @@ STATIC void xlog_ungrant_log_space(xlog_t	 *log,
 
 
 /* local ticket functions */
-STATIC xlog_ticket_t	*xlog_ticket_get(xlog_t *log,
+STATIC xlog_ticket_t	*xlog_ticket_alloc(xlog_t *log,
 					 int	unit_bytes,
 					 int	count,
 					 char	clientid,
 					 uint	flags);
-STATIC void		xlog_ticket_put(xlog_t *log, xlog_ticket_t *ticket);
 
 #if defined(DEBUG)
 STATIC void	xlog_verify_dest_ptr(xlog_t *log, __psint_t ptr);
@@ -360,7 +359,7 @@ xfs_log_done(xfs_mount_t	*mp,
 		 */
 		xlog_trace_loggrant(log, ticket, "xfs_log_done: (non-permanent)");
 		xlog_ungrant_log_space(log, ticket);
-		xlog_ticket_put(log, ticket);
+		xfs_log_ticket_put(ticket);
 	} else {
 		xlog_trace_loggrant(log, ticket, "xfs_log_done: (permanent)");
 		xlog_regrant_reserve_log_space(log, ticket);
@@ -514,7 +513,7 @@ xfs_log_reserve(xfs_mount_t	 *mp,
 		retval = xlog_regrant_write_log_space(log, internal_ticket);
 	} else {
 		/* may sleep if need to allocate more tickets */
-		internal_ticket = xlog_ticket_get(log, unit_bytes, cnt,
+		internal_ticket = xlog_ticket_alloc(log, unit_bytes, cnt,
 						  client, flags);
 		if (!internal_ticket)
 			return XFS_ERROR(ENOMEM);
@@ -749,7 +748,7 @@ xfs_log_unmount_write(xfs_mount_t *mp)
 		if (tic) {
 			xlog_trace_loggrant(log, tic, "unmount rec");
 			xlog_ungrant_log_space(log, tic);
-			xlog_ticket_put(log, tic);
+			xfs_log_ticket_put(tic);
 		}
 	} else {
 		/*
@@ -3223,22 +3222,33 @@ xlog_state_want_sync(xlog_t *log, xlog_in_core_t *iclog)
  */
 
 /*
- * Free a used ticket.
+ * Free a used ticket when it's refcount falls to zero.
  */
-STATIC void
-xlog_ticket_put(xlog_t		*log,
-		xlog_ticket_t	*ticket)
+void
+xfs_log_ticket_put(
+	xlog_ticket_t	*ticket)
 {
-	sv_destroy(&ticket->t_wait);
-	kmem_zone_free(xfs_log_ticket_zone, ticket);
-}	/* xlog_ticket_put */
+	ASSERT(atomic_read(&ticket->t_ref) > 0);
+	if (atomic_dec_and_test(&ticket->t_ref)) {
+		sv_destroy(&ticket->t_wait);
+		kmem_zone_free(xfs_log_ticket_zone, ticket);
+	}
+}
 
+xlog_ticket_t *
+xfs_log_ticket_get(
+	xlog_ticket_t	*ticket)
+{
+	ASSERT(atomic_read(&ticket->t_ref) > 0);
+	atomic_inc(&ticket->t_ref);
+	return ticket;
+}
 
 /*
  * Allocate and initialise a new log ticket.
  */
 STATIC xlog_ticket_t *
-xlog_ticket_get(xlog_t		*log,
+xlog_ticket_alloc(xlog_t		*log,
 		int		unit_bytes,
 		int		cnt,
 		char		client,
@@ -3309,6 +3319,7 @@ xlog_ticket_get(xlog_t		*log,
 		unit_bytes += 2*BBSIZE;
         }
 
+	atomic_set(&tic->t_ref, 1);
 	tic->t_unit_res		= unit_bytes;
 	tic->t_curr_res		= unit_bytes;
 	tic->t_cnt		= cnt;
@@ -3324,7 +3335,7 @@ xlog_ticket_get(xlog_t		*log,
 	xlog_tic_reset_res(tic);
 
 	return tic;
-}	/* xlog_ticket_get */
+}
 
 
 /******************************************************************************
diff --git a/fs/xfs/xfs_log.h b/fs/xfs/xfs_log.h
index d47b91f..8a3e84e 100644
--- a/fs/xfs/xfs_log.h
+++ b/fs/xfs/xfs_log.h
@@ -134,6 +134,7 @@ typedef struct xfs_log_callback {
 #ifdef __KERNEL__
 /* Log manager interfaces */
 struct xfs_mount;
+struct xlog_ticket;
 xfs_lsn_t xfs_log_done(struct xfs_mount *mp,
 		       xfs_log_ticket_t ticket,
 		       void		**iclog,
@@ -177,6 +178,9 @@ int	  xfs_log_need_covered(struct xfs_mount *mp);
 
 void	  xlog_iodone(struct xfs_buf *);
 
+struct xlog_ticket * xfs_log_ticket_get(struct xlog_ticket *ticket);
+void	  xfs_log_ticket_put(struct xlog_ticket *ticket);
+
 #endif
 
 
diff --git a/fs/xfs/xfs_log_priv.h b/fs/xfs/xfs_log_priv.h
index de7ef6c..b39a198 100644
--- a/fs/xfs/xfs_log_priv.h
+++ b/fs/xfs/xfs_log_priv.h
@@ -245,6 +245,7 @@ typedef struct xlog_ticket {
 	struct xlog_ticket *t_next;	 /*			         :4|8 */
 	struct xlog_ticket *t_prev;	 /*				 :4|8 */
 	xlog_tid_t	   t_tid;	 /* transaction identifier	 : 4  */
+	atomic_t	   t_ref;	 /* ticket reference count       : 4  */
 	int		   t_curr_res;	 /* current reservation in bytes : 4  */
 	int		   t_unit_res;	 /* unit reservation in bytes    : 4  */
 	char		   t_ocnt;	 /* original count		 : 1  */
diff --git a/fs/xfs/xfs_trans.c b/fs/xfs/xfs_trans.c
index ad137ef..8570b82 100644
--- a/fs/xfs/xfs_trans.c
+++ b/fs/xfs/xfs_trans.c
@@ -290,7 +290,7 @@ xfs_trans_dup(
 	ASSERT(tp->t_ticket != NULL);
 
 	ntp->t_flags = XFS_TRANS_PERM_LOG_RES | (tp->t_flags & XFS_TRANS_RESERVE);
-	ntp->t_ticket = tp->t_ticket;
+	ntp->t_ticket = xfs_log_ticket_get(tp->t_ticket);
 	ntp->t_blk_res = tp->t_blk_res - tp->t_blk_res_used;
 	tp->t_blk_res = tp->t_blk_res_used;
 	ntp->t_rtx_res = tp->t_rtx_res - tp->t_rtx_res_used;
@@ -1260,6 +1260,13 @@ xfs_trans_roll(
 	trans = *tpp;
 
 	/*
+	 * transaction commit worked ok so we can drop the extra ticket
+	 * reference that we gained in xfs_trans_dup()
+	 */
+	xfs_log_ticket_put(trans->t_ticket);
+
+
+	/*
 	 * Reserve space in the log for th next transaction.
 	 * This also pushes items in the "AIL", the list of logged items,
 	 * out to disk if they are taking up space at the tail of the log
diff --git a/fs/xfs/xfs_utils.c b/fs/xfs/xfs_utils.c
index 35d4d41..7711449 100644
--- a/fs/xfs/xfs_utils.c
+++ b/fs/xfs/xfs_utils.c
@@ -172,6 +172,12 @@ xfs_dir_ialloc(
 			*ipp = NULL;
 			return code;
 		}
+
+		/*
+		 * transaction commit worked ok so we can drop the extra ticket
+		 * reference that we gained in xfs_trans_dup()
+		 */
+		xfs_log_ticket_put(tp->t_ticket);
 		code = xfs_trans_reserve(tp, 0, log_res, 0,
 					 XFS_TRANS_PERM_LOG_RES, log_count);
 		/*
diff --git a/fs/xfs/xfs_vnodeops.c b/fs/xfs/xfs_vnodeops.c
index 5809c42..f7eea70 100644
--- a/fs/xfs/xfs_vnodeops.c
+++ b/fs/xfs/xfs_vnodeops.c
@@ -1029,6 +1029,12 @@ xfs_inactive_symlink_rmt(
 		goto error0;
 	}
 	/*
+	 * transaction commit worked ok so we can drop the extra ticket
+	 * reference that we gained in xfs_trans_dup()
+	 */
+	xfs_log_ticket_put(tp->t_ticket);
+
+	/*
 	 * Remove the memory for extent descriptions (just bookkeeping).
 	 */
 	if (ip->i_df.if_bytes)

^ permalink raw reply related	[flat|nested] 53+ messages in thread

* Re: [patch 0/9] writeback data integrity and other fixes (take 3)
  2008-10-28 14:47 [patch 0/9] writeback data integrity and other fixes (take 3) npiggin
                   ` (9 preceding siblings ...)
  2008-10-28 15:39 ` [patch 0/9] writeback data integrity and other fixes (take 3) Nick Piggin
@ 2008-10-28 23:14 ` Dave Chinner
  2008-10-28 23:57   ` Nick Piggin
  10 siblings, 1 reply; 53+ messages in thread
From: Dave Chinner @ 2008-10-28 23:14 UTC (permalink / raw)
  To: npiggin; +Cc: akpm, linux-fsdevel, Chris Mason

On Wed, Oct 29, 2008 at 01:47:15AM +1100, npiggin@suse.de wrote:
> OK, I'm happier with this patchset now. Note that I've taken your patch
> and mangled it a bit at the end of the series.
> 
> This one survives and seems to run OK here, but I'm mainly doing dumb
> stress testing with a handful of filesystems, and data-io error injection
> testing. There are a lot of combinations of ways this function can operate
> and interact obviously, so it would be helpful to get more review.
> 
> Chris, would you possibly have time to run your btrfs tests that are
> sensitive to problems in this code? I could provide you a single patch
> rollup against mainline if it helps.

Nick, after applying the patchset:

  CC      mm/page-writeback.o
mm/page-writeback.c: In function write_cache_pages:
mm/page-writeback.c:871: warning: wrteback_index may be used uninitialized in this function

Looks harmless, but it probably should be cleaned up.
Compiler is 'gcc version 4.3.1 (Debian 4.3.1-9)'

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [patch 0/9] writeback data integrity and other fixes (take 3)
  2008-10-28 23:14 ` Dave Chinner
@ 2008-10-28 23:57   ` Nick Piggin
  2008-10-29  0:05     ` Andrew Morton
  0 siblings, 1 reply; 53+ messages in thread
From: Nick Piggin @ 2008-10-28 23:57 UTC (permalink / raw)
  To: akpm, linux-fsdevel, Chris Mason

On Wed, Oct 29, 2008 at 10:14:35AM +1100, Dave Chinner wrote:
> On Wed, Oct 29, 2008 at 01:47:15AM +1100, npiggin@suse.de wrote:
> > OK, I'm happier with this patchset now. Note that I've taken your patch
> > and mangled it a bit at the end of the series.
> > 
> > This one survives and seems to run OK here, but I'm mainly doing dumb
> > stress testing with a handful of filesystems, and data-io error injection
> > testing. There are a lot of combinations of ways this function can operate
> > and interact obviously, so it would be helpful to get more review.
> > 
> > Chris, would you possibly have time to run your btrfs tests that are
> > sensitive to problems in this code? I could provide you a single patch
> > rollup against mainline if it helps.
> 
> Nick, after applying the patchset:
> 
>   CC      mm/page-writeback.o
> mm/page-writeback.c: In function write_cache_pages:
> mm/page-writeback.c:871: warning: wrteback_index may be used uninitialized in this function
> 
> Looks harmless, but it probably should be cleaned up.
> Compiler is 'gcc version 4.3.1 (Debian 4.3.1-9)'

Yeah, it's annoying but I couldn't find a way to shut it up nicely
(yes AFAIKS it is harmless).

I could just put the old "= 0; /* shut up gcc*/" trick there, but that
sucks too ;(



^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [patch 0/9] writeback data integrity and other fixes (take 3)
  2008-10-28 23:57   ` Nick Piggin
@ 2008-10-29  0:05     ` Andrew Morton
  2008-10-29  0:10       ` Nick Piggin
  0 siblings, 1 reply; 53+ messages in thread
From: Andrew Morton @ 2008-10-29  0:05 UTC (permalink / raw)
  To: Nick Piggin; +Cc: linux-fsdevel, chris.mason

On Wed, 29 Oct 2008 00:57:55 +0100
Nick Piggin <npiggin@suse.de> wrote:

> On Wed, Oct 29, 2008 at 10:14:35AM +1100, Dave Chinner wrote:
> > On Wed, Oct 29, 2008 at 01:47:15AM +1100, npiggin@suse.de wrote:
> > > OK, I'm happier with this patchset now. Note that I've taken your patch
> > > and mangled it a bit at the end of the series.
> > > 
> > > This one survives and seems to run OK here, but I'm mainly doing dumb
> > > stress testing with a handful of filesystems, and data-io error injection
> > > testing. There are a lot of combinations of ways this function can operate
> > > and interact obviously, so it would be helpful to get more review.
> > > 
> > > Chris, would you possibly have time to run your btrfs tests that are
> > > sensitive to problems in this code? I could provide you a single patch
> > > rollup against mainline if it helps.
> > 
> > Nick, after applying the patchset:
> > 
> >   CC      mm/page-writeback.o
> > mm/page-writeback.c: In function write_cache_pages:
> > mm/page-writeback.c:871: warning: wrteback_index may be used uninitialized in this function
> > 
> > Looks harmless, but it probably should be cleaned up.
> > Compiler is 'gcc version 4.3.1 (Debian 4.3.1-9)'
> 
> Yeah, it's annoying but I couldn't find a way to shut it up nicely
> (yes AFAIKS it is harmless).
> 
> I could just put the old "= 0; /* shut up gcc*/" trick there, but that
> sucks too ;(
> 

uninitialized_var()?

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [patch 0/9] writeback data integrity and other fixes (take 3)
  2008-10-29  0:05     ` Andrew Morton
@ 2008-10-29  0:10       ` Nick Piggin
  0 siblings, 0 replies; 53+ messages in thread
From: Nick Piggin @ 2008-10-29  0:10 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-fsdevel, chris.mason

On Tue, Oct 28, 2008 at 05:05:31PM -0700, Andrew Morton wrote:
> On Wed, 29 Oct 2008 00:57:55 +0100
> Nick Piggin <npiggin@suse.de> wrote:
> > 
> > Yeah, it's annoying but I couldn't find a way to shut it up nicely
> > (yes AFAIKS it is harmless).
> > 
> > I could just put the old "= 0; /* shut up gcc*/" trick there, but that
> > sucks too ;(
> > 
> 
> uninitialized_var()?

OK, that's a little bit better. Still somewhat subject to bitrot and
hiding of real bugs, but I guess you can disable it easly. Will do.



^ permalink raw reply	[flat|nested] 53+ messages in thread

end of thread, other threads:[~2008-11-01  8:04 UTC | newest]

Thread overview: 53+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-10-28 14:47 [patch 0/9] writeback data integrity and other fixes (take 3) npiggin
2008-10-28 14:47 ` [patch 1/9] mm: write_cache_pages cyclic fix npiggin
2008-10-29  0:24   ` [patch 1.1/9] mm: write_cache_pages cyclic fix fix Nick Piggin
2008-10-28 14:47 ` [patch 2/9] mm: write_cache_pages early loop termination npiggin
2008-10-28 14:47 ` [patch 3/9] mm: write_cache_pages writepage error fix npiggin
2008-10-28 14:47 ` [patch 4/9] mm: write_cache_pages integrity fix npiggin
2008-10-28 14:47 ` [patch 5/9] mm: write_cache_pages cleanups npiggin
2008-10-28 14:47 ` [patch 6/9] mm: write_cache_pages optimise page cleaning npiggin
2008-10-28 14:47 ` [patch 7/9] mm: write_cache_pages terminate quickly npiggin
2008-10-30 23:07   ` Andrew Morton
2008-10-31  7:29     ` Nick Piggin
2008-10-28 14:47 ` [patch 8/9] mm: write_cache_pages more " npiggin
2008-10-28 14:47 ` [patch 9/9] mm: do_sync_mapping_range integrity fix npiggin
2008-10-30 23:13   ` Andrew Morton
2008-10-31  9:16     ` Nick Piggin
2008-10-31 10:04       ` Andrew Morton
2008-10-31 10:53         ` Nick Piggin
2008-10-31 20:03         ` Jamie Lokier
2008-10-31 14:10       ` Chris Mason
2008-10-31 14:30         ` steve
2008-10-31 15:02           ` Chris Mason
2008-11-01  8:04         ` Nick Piggin
2008-10-28 15:39 ` [patch 0/9] writeback data integrity and other fixes (take 3) Nick Piggin
2008-10-28 22:27   ` Dave Chinner
2008-10-29  0:04     ` Nick Piggin
2008-10-29  0:16     ` Nick Piggin
2008-10-29  3:16       ` Dave Chinner
2008-10-29  3:26         ` Dave Chinner
2008-10-29  4:11           ` Nick Piggin
2008-10-29  4:57             ` Dave Chinner
2008-10-29  5:06               ` Nick Piggin
2008-10-29  9:13           ` Christoph Hellwig
2008-10-29 21:42             ` Dave Chinner
2008-10-29 21:45               ` Christoph Hellwig
2008-10-29 21:53                 ` Dave Chinner
2008-10-29  4:00         ` Nick Piggin
2008-10-29  5:27           ` Dave Chinner
2008-10-29  9:12         ` Christoph Hellwig
2008-10-29  9:21           ` Nick Piggin
2008-10-29  9:44             ` Christoph Hellwig
2008-10-29 10:30               ` Nick Piggin
2008-10-29 12:22                 ` Jamie Lokier
     [not found]                   ` <20081029122234.GE846-yetKDKU6eevNLxjTenLetw@public.gmane.org>
2008-10-29 13:32                     ` Ric Wheeler
2008-10-29 14:56                       ` Chris Mason
     [not found]                         ` <1225292196.6448.263.camel-cGoWVVl3WGUrkklhUoBCrlaTQe2KTcn/@public.gmane.org>
2008-10-30  2:16                           ` Nick Piggin
     [not found]                             ` <20081030021601.GF18041-B4tOwbsTzaBolqkO4TVVkw@public.gmane.org>
2008-10-30 12:51                               ` jim owens
2008-10-30 13:41                                 ` Jim Rees
2008-10-29 21:43                   ` Dave Chinner
2008-10-29  8:51     ` Dave Chinner
2008-10-28 23:14 ` Dave Chinner
2008-10-28 23:57   ` Nick Piggin
2008-10-29  0:05     ` Andrew Morton
2008-10-29  0:10       ` Nick Piggin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).