[patch 0/7] writeback data integrity and other fixes

linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [patch 0/7] writeback data integrity and other fixes
@ 2008-10-21  8:09 npiggin
  2008-10-21  8:09 ` [patch 1/7] mm: write_cache_pages writepage error fix npiggin
                   ` (6 more replies)
  0 siblings, 7 replies; 14+ messages in thread
From: npiggin @ 2008-10-21  8:09 UTC (permalink / raw)
  To: akpm; +Cc: linux-fsdevel

Hi,

Please apply these patches (and hopefully send at least the integrity
ones to 2.6.28).

Thanks,
Nick
-- 


^ permalink raw reply	[flat|nested] 14+ messages in thread

* [patch 1/7] mm: write_cache_pages writepage error fix
  2008-10-21  8:09 [patch 0/7] writeback data integrity and other fixes npiggin
@ 2008-10-21  8:09 ` npiggin
  2008-10-21 21:55   ` Andrew Morton
  2008-10-21  8:09 ` [patch 2/7] mm: write_cache_pages integrity fix npiggin
                   ` (5 subsequent siblings)
  6 siblings, 1 reply; 14+ messages in thread
From: npiggin @ 2008-10-21  8:09 UTC (permalink / raw)
  To: akpm; +Cc: linux-fsdevel

[-- Attachment #1: mm-wcp-writepage-error-fix.patch --]
[-- Type: text/plain, Size: 1276 bytes --]

In write_cache_pages, if ret signals a real error, but we still have some pages
left in the pagevec, done would be set to 1, but the remaining pages would
continue to be processed and ret will be overwritten in the process. It could
easily be overwritten with success, and thus success will be returned even if
there is an error. Thus the caller is told all writes succeeded, wheras in
reality some did not.

Fix this by bailing immediately if there is an error, and retaining the first
error code.

This is a data interity bug.

Signed-off-by: Nick Piggin <npiggin@suse.de>
---
Index: linux-2.6/mm/page-writeback.c
===================================================================
--- linux-2.6.orig/mm/page-writeback.c
+++ linux-2.6/mm/page-writeback.c
@@ -931,12 +931,16 @@ retry:
 			}
 
 			ret = (*writepage)(page, wbc, data);
-
-			if (unlikely(ret == AOP_WRITEPAGE_ACTIVATE)) {
-				unlock_page(page);
-				ret = 0;
-			}
-			if (ret || (--nr_to_write <= 0))
+			if (unlikely(ret)) {
+				if (ret == AOP_WRITEPAGE_ACTIVATE) {
+					unlock_page(page);
+					ret = 0;
+				} else {
+					done = 1;
+					break;
+				}
+ 			}
+			if (--nr_to_write <= 0)
 				done = 1;
 			if (wbc->nonblocking && bdi_write_congested(bdi)) {
 				wbc->encountered_congestion = 1;

-- 


^ permalink raw reply	[flat|nested] 14+ messages in thread

* [patch 2/7] mm: write_cache_pages integrity fix
  2008-10-21  8:09 [patch 0/7] writeback data integrity and other fixes npiggin
  2008-10-21  8:09 ` [patch 1/7] mm: write_cache_pages writepage error fix npiggin
@ 2008-10-21  8:09 ` npiggin
  2008-10-21  8:09 ` [patch 3/7] mm: do_sync_mapping_range " npiggin
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 14+ messages in thread
From: npiggin @ 2008-10-21  8:09 UTC (permalink / raw)
  To: akpm; +Cc: linux-fsdevel

[-- Attachment #1: mm-wcp-integrity-fix.patch --]
[-- Type: text/plain, Size: 2796 bytes --]

In write_cache_pages, nr_to_write is heeded even for data-integrity syncs, so
the function will return success after writing out nr_to_write pages, even if
that was not sufficient to guarantee data integrity.

The callers tend to set it to values that could break data interity semantics
easily in practice. For example, nr_to_write can be set to mapping->nr_pages *
2, however if a file has a single, dirty page, then fsync is called, subsequent
pages might be concurrently added and dirtied, then write_cache_pages might
writeout two of these newly dirty pages, while not writing out the old page
that should have been written out.

Fix this by ignoring nr_to_write if it is a data integrity sync.

This is a data interity bug.

Signed-off-by: Nick Piggin <npiggin@suse.de>

---
The reason this has been done in the past is to avoid stalling sync operations
behind page dirtiers.

 "If a file has one dirty page at offset 1000000000000000 then someone
  does an fsync() and someone else gets in first and starts madly writing
  pages at offset 0, we want to write that page at 1000000000000000. 
  Somehow."

What we to today is return success after an arbitrary amount of pages are
written, whether or not we have provided the data-integrity semantics that
the caller has asked for. Even this doesn't actually fix all stall cases
completely: in the above situation, if the file has a huge number of pages
in pagecache (but not dirty), then mapping->nrpages is going to be huge,
even if pages are being dirtied.

This change does indeed make the possibility of long stalls lager, and that's
not a good thing, but lying about data integrity is even worse. We have to
either perform the sync, or return -ELINUXISLAME so at least the caller knows
what has happened.

There are subsequent competing approaches in the works to solve the stall
problems properly, without compromising data integrity.

Index: linux-2.6/mm/page-writeback.c
===================================================================
--- linux-2.6.orig/mm/page-writeback.c
+++ linux-2.6/mm/page-writeback.c
@@ -940,8 +940,10 @@ retry:
 					break;
 				}
  			}
-			if (--nr_to_write <= 0)
-				done = 1;
+			if (wbc->sync_mode == WB_SYNC_NONE) {
+				if (--wbc->nr_to_write <= 0)
+					done = 1;
+			}
 			if (wbc->nonblocking && bdi_write_congested(bdi)) {
 				wbc->encountered_congestion = 1;
 				done = 1;
Index: linux-2.6/mm/filemap.c
===================================================================
--- linux-2.6.orig/mm/filemap.c
+++ linux-2.6/mm/filemap.c
@@ -210,7 +210,7 @@ int __filemap_fdatawrite_range(struct ad
 	int ret;
 	struct writeback_control wbc = {
 		.sync_mode = sync_mode,
-		.nr_to_write = mapping->nrpages * 2,
+		.nr_to_write = LONG_MAX,
 		.range_start = start,
 		.range_end = end,
 	};

-- 

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [patch 3/7] mm: do_sync_mapping_range integrity fix
  2008-10-21  8:09 [patch 0/7] writeback data integrity and other fixes npiggin
  2008-10-21  8:09 ` [patch 1/7] mm: write_cache_pages writepage error fix npiggin
  2008-10-21  8:09 ` [patch 2/7] mm: write_cache_pages integrity fix npiggin
@ 2008-10-21  8:09 ` npiggin
  2008-10-21  8:09 ` [patch 4/7] mm: write_cache_pages cyclic fix npiggin
                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 14+ messages in thread
From: npiggin @ 2008-10-21  8:09 UTC (permalink / raw)
  To: akpm; +Cc: linux-fsdevel

[-- Attachment #1: mm-do-sync-mapping-range-fix.patch --]
[-- Type: text/plain, Size: 645 bytes --]

Chris Mason notices do_sync_mapping_range didn't actually ask for data
integrity writeout. Unfortunately, it is advertised as being usable for
data integrity operations.

This is a data interity bug.

Signed-off-by: Nick Piggin <npiggin@suse.de>
---
Index: linux-2.6/fs/sync.c
===================================================================
--- linux-2.6.orig/fs/sync.c
+++ linux-2.6/fs/sync.c
@@ -269,7 +269,7 @@ int do_sync_mapping_range(struct address
 
 	if (flags & SYNC_FILE_RANGE_WRITE) {
 		ret = __filemap_fdatawrite_range(mapping, offset, endbyte,
-						WB_SYNC_NONE);
+						WB_SYNC_ALL);
 		if (ret < 0)
 			goto out;
 	}

-- 


^ permalink raw reply	[flat|nested] 14+ messages in thread

* [patch 4/7] mm: write_cache_pages cyclic fix
  2008-10-21  8:09 [patch 0/7] writeback data integrity and other fixes npiggin
                   ` (2 preceding siblings ...)
  2008-10-21  8:09 ` [patch 3/7] mm: do_sync_mapping_range " npiggin
@ 2008-10-21  8:09 ` npiggin
  2008-10-21  8:09 ` [patch 5/7] mm: write_cache_pages cleanups npiggin
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 14+ messages in thread
From: npiggin @ 2008-10-21  8:09 UTC (permalink / raw)
  To: akpm; +Cc: linux-fsdevel

[-- Attachment #1: mm-wcp-cyclic-fix.patch --]
[-- Type: text/plain, Size: 2695 bytes --]

In write_cache_pages, scanned == 1 is supposed to mean that cyclic writeback
has circled through zero, thus we should not circle again. However it gets set
to 1 after the first successful pagevec lookup. This leads to cases where not
enough data gets written.

Counterexample: file with first 10 pages dirty, writeback_index == 5,
nr_to_write == 10. Then the 5 last pages will be found, and scanned will be set
to 1, after writing those out, we will not cycle back to get the first 5.

Rework this logic, now we'll always cycle unless we started off from index 0.
When cycling, only write out as far as 1 page before the start page from the
first cycle (so we don't write parts of the file twice).

Signed-off-by: Nick Piggin <npiggin@suse.de>
---
Index: linux-2.6/mm/page-writeback.c
===================================================================
--- linux-2.6.orig/mm/page-writeback.c
+++ linux-2.6/mm/page-writeback.c
@@ -868,9 +868,10 @@ int write_cache_pages(struct address_spa
 	int done = 0;
 	struct pagevec pvec;
 	int nr_pages;
+	pgoff_t writeback_index;
 	pgoff_t index;
 	pgoff_t end;		/* Inclusive */
-	int scanned = 0;
+	int cycled;
 	int range_whole = 0;
 	long nr_to_write = wbc->nr_to_write;
 
@@ -881,14 +882,19 @@ int write_cache_pages(struct address_spa
 
 	pagevec_init(&pvec, 0);
 	if (wbc->range_cyclic) {
-		index = mapping->writeback_index; /* Start from prev offset */
+		writeback_index = mapping->writeback_index; /* prev offset */
+		index = writeback_index;
+		if (index == 0)
+			cycled = 1;
+		else
+			cycled = 0;
 		end = -1;
 	} else {
 		index = wbc->range_start >> PAGE_CACHE_SHIFT;
 		end = wbc->range_end >> PAGE_CACHE_SHIFT;
 		if (wbc->range_start == 0 && wbc->range_end == LLONG_MAX)
 			range_whole = 1;
-		scanned = 1;
+		cycled = 1; /* ignore range_cyclic tests */
 	}
 retry:
 	while (!done && (index <= end) &&
@@ -897,7 +903,6 @@ retry:
 					      min(end - index, (pgoff_t)PAGEVEC_SIZE-1) + 1))) {
 		unsigned i;
 
-		scanned = 1;
 		for (i = 0; i < nr_pages; i++) {
 			struct page *page = pvec.pages[i];
 
@@ -915,7 +920,8 @@ retry:
 				continue;
 			}
 
-			if (!wbc->range_cyclic && page->index > end) {
+			if (page->index > end) {
+				/* Can't be range_cyclic: end == -1 there */
 				done = 1;
 				unlock_page(page);
 				continue;
@@ -952,13 +958,15 @@ retry:
 		pagevec_release(&pvec);
 		cond_resched();
 	}
-	if (!scanned && !done) {
+	if (!cycled) {
 		/*
+		 * range_cyclic:
 		 * We hit the last page and there is more work to be done: wrap
 		 * back to the start of the file
 		 */
-		scanned = 1;
+		cycled = 1;
 		index = 0;
+		end = writeback_index - 1;
 		goto retry;
 	}
 	if (!wbc->no_nrwrite_index_update) {

-- 


^ permalink raw reply	[flat|nested] 14+ messages in thread

* [patch 5/7] mm: write_cache_pages cleanups
  2008-10-21  8:09 [patch 0/7] writeback data integrity and other fixes npiggin
                   ` (3 preceding siblings ...)
  2008-10-21  8:09 ` [patch 4/7] mm: write_cache_pages cyclic fix npiggin
@ 2008-10-21  8:09 ` npiggin
  2008-10-21  8:09 ` [patch 6/7] mm: write_cache_pages optimise page cleaning npiggin
  2008-10-21  8:09 ` [patch 7/7] mm: write_cache_pages terminate quickly npiggin
  6 siblings, 0 replies; 14+ messages in thread
From: npiggin @ 2008-10-21  8:09 UTC (permalink / raw)
  To: akpm; +Cc: linux-fsdevel

[-- Attachment #1: mm-wcp-cleanup.patch --]
[-- Type: text/plain, Size: 2258 bytes --]

Get rid of some complex expressions from flow control statements, add a
comment, remove some duplicate code.

Signed-off-by: Nick Piggin <npiggin@suse.de>
---
Index: linux-2.6/mm/page-writeback.c
===================================================================
--- linux-2.6.orig/mm/page-writeback.c
+++ linux-2.6/mm/page-writeback.c
@@ -897,11 +897,14 @@ int write_cache_pages(struct address_spa
 		cycled = 1; /* ignore range_cyclic tests */
 	}
 retry:
-	while (!done && (index <= end) &&
-	       (nr_pages = pagevec_lookup_tag(&pvec, mapping, &index,
-					      PAGECACHE_TAG_DIRTY,
-					      min(end - index, (pgoff_t)PAGEVEC_SIZE-1) + 1))) {
-		unsigned i;
+	while (!done && (index <= end)) {
+		int i;
+
+		nr_pages = pagevec_lookup_tag(&pvec, mapping, &index,
+			      PAGECACHE_TAG_DIRTY,
+			      min(end - index, (pgoff_t)PAGEVEC_SIZE-1) + 1);
+		if (nr_pages == 0)
+			break;
 
 		for (i = 0; i < nr_pages; i++) {
 			struct page *page = pvec.pages[i];
@@ -915,7 +918,16 @@ retry:
 			 */
 			lock_page(page);
 
+			/*
+			 * Page truncated or invalidated. We can freely skip it
+			 * then, even for data integrity operations: the page
+			 * has disappeared concurrently, so there could be no
+			 * real expectation of this data interity operation
+			 * even if there is now a new, dirty page at the same
+			 * pagecache address.
+			 */
 			if (unlikely(page->mapping != mapping)) {
+continue_unlock:
 				unlock_page(page);
 				continue;
 			}
@@ -923,18 +935,15 @@ retry:
 			if (page->index > end) {
 				/* Can't be range_cyclic: end == -1 there */
 				done = 1;
-				unlock_page(page);
-				continue;
+				goto continue_unlock;
 			}
 
 			if (wbc->sync_mode != WB_SYNC_NONE)
 				wait_on_page_writeback(page);
 
 			if (PageWriteback(page) ||
-			    !clear_page_dirty_for_io(page)) {
-				unlock_page(page);
-				continue;
-			}
+			    !clear_page_dirty_for_io(page))
+				goto continue_unlock;
 
 			ret = (*writepage)(page, wbc, data);
 			if (unlikely(ret)) {
@@ -947,7 +956,8 @@ retry:
 				}
  			}
 			if (wbc->sync_mode == WB_SYNC_NONE) {
-				if (--wbc->nr_to_write <= 0)
+				wbc->nr_to_write--;
+				if (wbc->nr_to_write <= 0)
 					done = 1;
 			}
 			if (wbc->nonblocking && bdi_write_congested(bdi)) {

-- 


^ permalink raw reply	[flat|nested] 14+ messages in thread

* [patch 6/7] mm: write_cache_pages optimise page cleaning
  2008-10-21  8:09 [patch 0/7] writeback data integrity and other fixes npiggin
                   ` (4 preceding siblings ...)
  2008-10-21  8:09 ` [patch 5/7] mm: write_cache_pages cleanups npiggin
@ 2008-10-21  8:09 ` npiggin
  2008-10-21  8:09 ` [patch 7/7] mm: write_cache_pages terminate quickly npiggin
  6 siblings, 0 replies; 14+ messages in thread
From: npiggin @ 2008-10-21  8:09 UTC (permalink / raw)
  To: akpm; +Cc: linux-fsdevel

[-- Attachment #1: mm-wcp-writeback-clean-opt.patch --]
[-- Type: text/plain, Size: 1314 bytes --]

In write_cache_pages, if we get stuck behind another process that is cleaning
pages, we will be forced to wait for them to finish, then perform our own
writeout (if it was redirtied during the long wait), then wait for that.

If a page under writeout is still clean, we can skip waiting for it (if we're
part of a data integrity sync, we'll be waiting for all writeout pages
afterwards, so we'll still be waiting for the other guy's write that's cleaned
the page).

Signed-off-by: Nick Piggin <npiggin@suse.de>
---
Index: linux-2.6/mm/page-writeback.c
===================================================================
--- linux-2.6.orig/mm/page-writeback.c
+++ linux-2.6/mm/page-writeback.c
@@ -938,11 +938,20 @@ continue_unlock:
 				goto continue_unlock;
 			}
 
-			if (wbc->sync_mode != WB_SYNC_NONE)
-				wait_on_page_writeback(page);
+			if (!PageDirty(page)) {
+				/* someone wrote it for us */
+				goto continue_unlock;
+			}
+
+			if (PageWriteback(page)) {
+				if (wbc->sync_mode != WB_SYNC_NONE)
+					wait_on_page_writeback(page);
+				else
+					goto continue_unlock;
+			}
 
-			if (PageWriteback(page) ||
-			    !clear_page_dirty_for_io(page))
+			BUG_ON(PageWriteback(page));
+			if (!clear_page_dirty_for_io(page))
 				goto continue_unlock;
 
 			ret = (*writepage)(page, wbc, data);

-- 


^ permalink raw reply	[flat|nested] 14+ messages in thread

* [patch 7/7] mm: write_cache_pages terminate quickly
  2008-10-21  8:09 [patch 0/7] writeback data integrity and other fixes npiggin
                   ` (5 preceding siblings ...)
  2008-10-21  8:09 ` [patch 6/7] mm: write_cache_pages optimise page cleaning npiggin
@ 2008-10-21  8:09 ` npiggin
  6 siblings, 0 replies; 14+ messages in thread
From: npiggin @ 2008-10-21  8:09 UTC (permalink / raw)
  To: akpm; +Cc: linux-fsdevel

[-- Attachment #1: mm-wcp-terminate-quickly.patch --]
[-- Type: text/plain, Size: 1526 bytes --]

Terminate the write_cache_pages loop upon encountering the first page past
end, without locking the page. Pages cannot have their index change when we
have a reference on them (truncate, eg truncate_inode_pages_range performs
the same check without the page lock).

Signed-off-by: Nick Piggin <npiggin@suse.de>
---
Index: linux-2.6/mm/page-writeback.c
===================================================================
--- linux-2.6.orig/mm/page-writeback.c
+++ linux-2.6/mm/page-writeback.c
@@ -910,12 +910,18 @@ retry:
 			struct page *page = pvec.pages[i];
 
 			/*
-			 * At this point we hold neither mapping->tree_lock nor
-			 * lock on the page itself: the page may be truncated or
-			 * invalidated (changing page->mapping to NULL), or even
-			 * swizzled back from swapper_space to tmpfs file
-			 * mapping
+			 * At this point, the page may be truncated or
+			 * invalidated (changing page->mapping to NULL), or
+			 * even swizzled back from swapper_space to tmpfs file
+			 * mapping. However, page->index will not change
+			 * because we have a reference on the page.
 			 */
+			if (page->index > end) {
+				/* Can't be range_cyclic: end == -1 there */
+				done = 1;
+				break;
+			}
+
 			lock_page(page);
 
 			/*
@@ -932,12 +938,6 @@ continue_unlock:
 				continue;
 			}
 
-			if (page->index > end) {
-				/* Can't be range_cyclic: end == -1 there */
-				done = 1;
-				goto continue_unlock;
-			}
-
 			if (!PageDirty(page)) {
 				/* someone wrote it for us */
 				goto continue_unlock;

-- 


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [patch 1/7] mm: write_cache_pages writepage error fix
  2008-10-21  8:09 ` [patch 1/7] mm: write_cache_pages writepage error fix npiggin
@ 2008-10-21 21:55   ` Andrew Morton
  2008-10-21 22:01     ` Matthew Wilcox
  2008-10-22  9:36     ` Nick Piggin
  0 siblings, 2 replies; 14+ messages in thread
From: Andrew Morton @ 2008-10-21 21:55 UTC (permalink / raw)
  To: npiggin; +Cc: linux-fsdevel

On Tue, 21 Oct 2008 19:09:48 +1100
npiggin@suse.de wrote:

> In write_cache_pages, if ret signals a real error, but we still have some pages
> left in the pagevec, done would be set to 1, but the remaining pages would
> continue to be processed and ret will be overwritten in the process. It could
> easily be overwritten with success, and thus success will be returned even if
> there is an error. Thus the caller is told all writes succeeded, wheras in
> reality some did not.
> 
> Fix this by bailing immediately if there is an error, and retaining the first
> error code.
> 
> This is a data interity bug.
> 
> Signed-off-by: Nick Piggin <npiggin@suse.de>
> ---
> Index: linux-2.6/mm/page-writeback.c
> ===================================================================
> --- linux-2.6.orig/mm/page-writeback.c
> +++ linux-2.6/mm/page-writeback.c
> @@ -931,12 +931,16 @@ retry:
>  			}
>  
>  			ret = (*writepage)(page, wbc, data);
> -
> -			if (unlikely(ret == AOP_WRITEPAGE_ACTIVATE)) {
> -				unlock_page(page);
> -				ret = 0;
> -			}
> -			if (ret || (--nr_to_write <= 0))
> +			if (unlikely(ret)) {
> +				if (ret == AOP_WRITEPAGE_ACTIVATE) {
> +					unlock_page(page);
> +					ret = 0;
> +				} else {
> +					done = 1;
> +					break;
> +				}
> + 			}
> +			if (--nr_to_write <= 0)
>  				done = 1;
>  			if (wbc->nonblocking && bdi_write_congested(bdi)) {
>  				wbc->encountered_congestion = 1;
> 
> -- 

I don't think I like the implementation much.

In all cases in that function where we set done=1, we want to bale out
right now at this page, rather than processing the remaining pages in
the pagevec.

So it would be better to implement a bit of code which releases the
pagevec pages and then breaks out of the loop.  Then this bug
automatically gets fixed.

Like this:

--- a/mm/page-writeback.c~mm-write_cache_pages-writepage-error-fix
+++ a/mm/page-writeback.c
@@ -865,7 +865,6 @@ int write_cache_pages(struct address_spa
 {
 	struct backing_dev_info *bdi = mapping->backing_dev_info;
 	int ret = 0;
-	int done = 0;
 	struct pagevec pvec;
 	int nr_pages;
 	pgoff_t index;
@@ -891,10 +890,10 @@ int write_cache_pages(struct address_spa
 		scanned = 1;
 	}
 retry:
-	while (!done && (index <= end) &&
+	while ((index <= end) &&
 	       (nr_pages = pagevec_lookup_tag(&pvec, mapping, &index,
-					      PAGECACHE_TAG_DIRTY,
-					      min(end - index, (pgoff_t)PAGEVEC_SIZE-1) + 1))) {
+				PAGECACHE_TAG_DIRTY,
+				min(end - index, (pgoff_t)PAGEVEC_SIZE-1) + 1))) {
 		unsigned i;
 
 		scanned = 1;
@@ -903,9 +902,9 @@ retry:
 
 			/*
 			 * At this point we hold neither mapping->tree_lock nor
-			 * lock on the page itself: the page may be truncated or
-			 * invalidated (changing page->mapping to NULL), or even
-			 * swizzled back from swapper_space to tmpfs file
+			 * lock on the page itself: the page may be truncated
+			 * or invalidated (changing page->mapping to NULL), or
+			 * even swizzled back from swapper_space to tmpfs file
 			 * mapping
 			 */
 			lock_page(page);
@@ -916,8 +915,8 @@ retry:
 			}
 
 			if (!wbc->range_cyclic && page->index > end) {
-				done = 1;
 				unlock_page(page);
+				goto bale;
 				continue;
 			}
 
@@ -937,16 +936,16 @@ retry:
 				ret = 0;
 			}
 			if (ret || (--nr_to_write <= 0))
-				done = 1;
+				goto bale;
 			if (wbc->nonblocking && bdi_write_congested(bdi)) {
 				wbc->encountered_congestion = 1;
-				done = 1;
+				goto bale;
 			}
 		}
 		pagevec_release(&pvec);
 		cond_resched();
 	}
-	if (!scanned && !done) {
+	if (!scanned) {
 		/*
 		 * We hit the last page and there is more work to be done: wrap
 		 * back to the start of the file
@@ -961,7 +960,11 @@ retry:
 		wbc->nr_to_write = nr_to_write;
 	}
 
+out:
 	return ret;
+bale:
+	pagevec_release(&pvec);
+	goto out;
 }
 EXPORT_SYMBOL(write_cache_pages);
 
_


Which yields this:

int write_cache_pages(struct address_space *mapping,
		      struct writeback_control *wbc, writepage_t writepage,
		      void *data)
{
	struct backing_dev_info *bdi = mapping->backing_dev_info;
	int ret = 0;
	struct pagevec pvec;
	int nr_pages;
	pgoff_t index;
	pgoff_t end;		/* Inclusive */
	int scanned = 0;
	int range_whole = 0;
	long nr_to_write = wbc->nr_to_write;

	if (wbc->nonblocking && bdi_write_congested(bdi)) {
		wbc->encountered_congestion = 1;
		return 0;
	}

	pagevec_init(&pvec, 0);
	if (wbc->range_cyclic) {
		index = mapping->writeback_index; /* Start from prev offset */
		end = -1;
	} else {
		index = wbc->range_start >> PAGE_CACHE_SHIFT;
		end = wbc->range_end >> PAGE_CACHE_SHIFT;
		if (wbc->range_start == 0 && wbc->range_end == LLONG_MAX)
			range_whole = 1;
		scanned = 1;
	}
retry:
	while ((index <= end) &&
	       (nr_pages = pagevec_lookup_tag(&pvec, mapping, &index,
				PAGECACHE_TAG_DIRTY,
				min(end - index, (pgoff_t)PAGEVEC_SIZE-1) + 1))) {
		unsigned i;

		scanned = 1;
		for (i = 0; i < nr_pages; i++) {
			struct page *page = pvec.pages[i];

			/*
			 * At this point we hold neither mapping->tree_lock nor
			 * lock on the page itself: the page may be truncated
			 * or invalidated (changing page->mapping to NULL), or
			 * even swizzled back from swapper_space to tmpfs file
			 * mapping
			 */
			lock_page(page);

			if (unlikely(page->mapping != mapping)) {
				unlock_page(page);
				continue;
			}

			if (!wbc->range_cyclic && page->index > end) {
				unlock_page(page);
				goto bale;
				continue;
			}

			if (wbc->sync_mode != WB_SYNC_NONE)
				wait_on_page_writeback(page);

			if (PageWriteback(page) ||
			    !clear_page_dirty_for_io(page)) {
				unlock_page(page);
				continue;
			}

			ret = (*writepage)(page, wbc, data);

			if (unlikely(ret == AOP_WRITEPAGE_ACTIVATE)) {
				unlock_page(page);
				ret = 0;
			}
			if (ret || (--nr_to_write <= 0))
				goto bale;
			if (wbc->nonblocking && bdi_write_congested(bdi)) {
				wbc->encountered_congestion = 1;
				goto bale;
			}
		}
		pagevec_release(&pvec);
		cond_resched();
	}
	if (!scanned) {
		/*
		 * We hit the last page and there is more work to be done: wrap
		 * back to the start of the file
		 */
		scanned = 1;
		index = 0;
		goto retry;
	}
	if (!wbc->no_nrwrite_index_update) {
		if (wbc->range_cyclic || (range_whole && nr_to_write > 0))
			mapping->writeback_index = index;
		wbc->nr_to_write = nr_to_write;
	}

out:
	return ret;
bale:
	pagevec_release(&pvec);
	goto out;
}


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [patch 1/7] mm: write_cache_pages writepage error fix
  2008-10-21 21:55   ` Andrew Morton
@ 2008-10-21 22:01     ` Matthew Wilcox
  2008-10-21 22:05       ` Andrew Morton
  2008-10-22  9:36     ` Nick Piggin
  1 sibling, 1 reply; 14+ messages in thread
From: Matthew Wilcox @ 2008-10-21 22:01 UTC (permalink / raw)
  To: Andrew Morton; +Cc: npiggin, linux-fsdevel

On Tue, Oct 21, 2008 at 02:55:18PM -0700, Andrew Morton wrote:
> @@ -916,8 +915,8 @@ retry:
                      ^^^^^^
Hey, look everyone, our tools are completely shit.

>  			}
>  
>  			if (!wbc->range_cyclic && page->index > end) {
> -				done = 1;
>  				unlock_page(page);
> +				goto bale;

I think you mean "bail".  Unless you mean making hay, or the 1913
Webster definition:
     1. Misery; calamity; misfortune; sorrow.
which would be kind of appropriate, but I think you mean bail out.

-- 
Matthew Wilcox				Intel Open Source Technology Centre
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours.  We can't possibly take such
a retrograde step."

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [patch 1/7] mm: write_cache_pages writepage error fix
  2008-10-21 22:01     ` Matthew Wilcox
@ 2008-10-21 22:05       ` Andrew Morton
  0 siblings, 0 replies; 14+ messages in thread
From: Andrew Morton @ 2008-10-21 22:05 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: npiggin, linux-fsdevel

On Tue, 21 Oct 2008 16:01:03 -0600
Matthew Wilcox <matthew@wil.cx> wrote:

> On Tue, Oct 21, 2008 at 02:55:18PM -0700, Andrew Morton wrote:
> > @@ -916,8 +915,8 @@ retry:
>                       ^^^^^^
> Hey, look everyone, our tools are completely shit.

yeah, that drive3s me nuts.

> >  			}
> >  
> >  			if (!wbc->range_cyclic && page->index > end) {
> > -				done = 1;
> >  				unlock_page(page);
> > +				goto bale;
> 
> I think you mean "bail".  Unless you mean making hay, or the 1913
> Webster definition:
>      1. Misery; calamity; misfortune; sorrow.
> which would be kind of appropriate, but I think you mean bail out.

Then "bail" means "making hey".

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [patch 1/7] mm: write_cache_pages writepage error fix
  2008-10-21 21:55   ` Andrew Morton
  2008-10-21 22:01     ` Matthew Wilcox
@ 2008-10-22  9:36     ` Nick Piggin
  2008-10-22 16:37       ` Andrew Morton
  1 sibling, 1 reply; 14+ messages in thread
From: Nick Piggin @ 2008-10-22  9:36 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-fsdevel

On Tue, Oct 21, 2008 at 02:55:18PM -0700, Andrew Morton wrote:
> On Tue, 21 Oct 2008 19:09:48 +1100
> npiggin@suse.de wrote:
> 
> > In write_cache_pages, if ret signals a real error, but we still have some pages
> > left in the pagevec, done would be set to 1, but the remaining pages would
> > continue to be processed and ret will be overwritten in the process. It could
> > easily be overwritten with success, and thus success will be returned even if
> > there is an error. Thus the caller is told all writes succeeded, wheras in
> > reality some did not.
> > 
> > Fix this by bailing immediately if there is an error, and retaining the first
> > error code.
> > 
> > This is a data interity bug.
> > 
> > Signed-off-by: Nick Piggin <npiggin@suse.de>
> > ---
> > Index: linux-2.6/mm/page-writeback.c
> > ===================================================================
> > --- linux-2.6.orig/mm/page-writeback.c
> > +++ linux-2.6/mm/page-writeback.c
> > @@ -931,12 +931,16 @@ retry:
> >  			}
> >  
> >  			ret = (*writepage)(page, wbc, data);
> > -
> > -			if (unlikely(ret == AOP_WRITEPAGE_ACTIVATE)) {
> > -				unlock_page(page);
> > -				ret = 0;
> > -			}
> > -			if (ret || (--nr_to_write <= 0))
> > +			if (unlikely(ret)) {
> > +				if (ret == AOP_WRITEPAGE_ACTIVATE) {
> > +					unlock_page(page);
> > +					ret = 0;
> > +				} else {
> > +					done = 1;
> > +					break;
> > +				}
> > + 			}
> > +			if (--nr_to_write <= 0)
> >  				done = 1;
> >  			if (wbc->nonblocking && bdi_write_congested(bdi)) {
> >  				wbc->encountered_congestion = 1;
> > 
> > -- 
> 
> I don't think I like the implementation much.
> 
> In all cases in that function where we set done=1, we want to bale out
> right now at this page, rather than processing the remaining pages in
> the pagevec.
> 
> So it would be better to implement a bit of code which releases the
> pagevec pages and then breaks out of the loop.  Then this bug
> automatically gets fixed.

I had almost exactly that at some point, then I rethought it because it
was a bigger behaviour change than the bugfix-only.

I thought the !done thing must be just for batching and to process the
pages already looked up in the pagevec.

I'll need to backport at least some of these to stable -- any chance of
making the change you suggest toward the end of the patchset? Let me
know if you want a patch or will do one yourself.


> 
> Like this:
> 
> --- a/mm/page-writeback.c~mm-write_cache_pages-writepage-error-fix
> +++ a/mm/page-writeback.c
> @@ -865,7 +865,6 @@ int write_cache_pages(struct address_spa
>  {
>  	struct backing_dev_info *bdi = mapping->backing_dev_info;
>  	int ret = 0;
> -	int done = 0;
>  	struct pagevec pvec;
>  	int nr_pages;
>  	pgoff_t index;
> @@ -891,10 +890,10 @@ int write_cache_pages(struct address_spa
>  		scanned = 1;
>  	}
>  retry:
> -	while (!done && (index <= end) &&
> +	while ((index <= end) &&
>  	       (nr_pages = pagevec_lookup_tag(&pvec, mapping, &index,
> -					      PAGECACHE_TAG_DIRTY,
> -					      min(end - index, (pgoff_t)PAGEVEC_SIZE-1) + 1))) {
> +				PAGECACHE_TAG_DIRTY,
> +				min(end - index, (pgoff_t)PAGEVEC_SIZE-1) + 1))) {
>  		unsigned i;
>  
>  		scanned = 1;
> @@ -903,9 +902,9 @@ retry:
>  
>  			/*
>  			 * At this point we hold neither mapping->tree_lock nor
> -			 * lock on the page itself: the page may be truncated or
> -			 * invalidated (changing page->mapping to NULL), or even
> -			 * swizzled back from swapper_space to tmpfs file
> +			 * lock on the page itself: the page may be truncated
> +			 * or invalidated (changing page->mapping to NULL), or
> +			 * even swizzled back from swapper_space to tmpfs file
>  			 * mapping
>  			 */
>  			lock_page(page);
> @@ -916,8 +915,8 @@ retry:
>  			}
>  
>  			if (!wbc->range_cyclic && page->index > end) {
> -				done = 1;
>  				unlock_page(page);
> +				goto bale;
>  				continue;
>  			}
>  
> @@ -937,16 +936,16 @@ retry:
>  				ret = 0;
>  			}
>  			if (ret || (--nr_to_write <= 0))
> -				done = 1;
> +				goto bale;
>  			if (wbc->nonblocking && bdi_write_congested(bdi)) {
>  				wbc->encountered_congestion = 1;
> -				done = 1;
> +				goto bale;
>  			}
>  		}
>  		pagevec_release(&pvec);
>  		cond_resched();
>  	}
> -	if (!scanned && !done) {
> +	if (!scanned) {
>  		/*
>  		 * We hit the last page and there is more work to be done: wrap
>  		 * back to the start of the file
> @@ -961,7 +960,11 @@ retry:
>  		wbc->nr_to_write = nr_to_write;
>  	}
>  
> +out:
>  	return ret;
> +bale:
> +	pagevec_release(&pvec);
> +	goto out;
>  }
>  EXPORT_SYMBOL(write_cache_pages);
>  
> _
> 
> 
> Which yields this:
> 
> int write_cache_pages(struct address_space *mapping,
> 		      struct writeback_control *wbc, writepage_t writepage,
> 		      void *data)
> {
> 	struct backing_dev_info *bdi = mapping->backing_dev_info;
> 	int ret = 0;
> 	struct pagevec pvec;
> 	int nr_pages;
> 	pgoff_t index;
> 	pgoff_t end;		/* Inclusive */
> 	int scanned = 0;
> 	int range_whole = 0;
> 	long nr_to_write = wbc->nr_to_write;
> 
> 	if (wbc->nonblocking && bdi_write_congested(bdi)) {
> 		wbc->encountered_congestion = 1;
> 		return 0;
> 	}
> 
> 	pagevec_init(&pvec, 0);
> 	if (wbc->range_cyclic) {
> 		index = mapping->writeback_index; /* Start from prev offset */
> 		end = -1;
> 	} else {
> 		index = wbc->range_start >> PAGE_CACHE_SHIFT;
> 		end = wbc->range_end >> PAGE_CACHE_SHIFT;
> 		if (wbc->range_start == 0 && wbc->range_end == LLONG_MAX)
> 			range_whole = 1;
> 		scanned = 1;
> 	}
> retry:
> 	while ((index <= end) &&
> 	       (nr_pages = pagevec_lookup_tag(&pvec, mapping, &index,
> 				PAGECACHE_TAG_DIRTY,
> 				min(end - index, (pgoff_t)PAGEVEC_SIZE-1) + 1))) {
> 		unsigned i;
> 
> 		scanned = 1;
> 		for (i = 0; i < nr_pages; i++) {
> 			struct page *page = pvec.pages[i];
> 
> 			/*
> 			 * At this point we hold neither mapping->tree_lock nor
> 			 * lock on the page itself: the page may be truncated
> 			 * or invalidated (changing page->mapping to NULL), or
> 			 * even swizzled back from swapper_space to tmpfs file
> 			 * mapping
> 			 */
> 			lock_page(page);
> 
> 			if (unlikely(page->mapping != mapping)) {
> 				unlock_page(page);
> 				continue;
> 			}
> 
> 			if (!wbc->range_cyclic && page->index > end) {
> 				unlock_page(page);
> 				goto bale;
> 				continue;
> 			}
> 
> 			if (wbc->sync_mode != WB_SYNC_NONE)
> 				wait_on_page_writeback(page);
> 
> 			if (PageWriteback(page) ||
> 			    !clear_page_dirty_for_io(page)) {
> 				unlock_page(page);
> 				continue;
> 			}
> 
> 			ret = (*writepage)(page, wbc, data);
> 
> 			if (unlikely(ret == AOP_WRITEPAGE_ACTIVATE)) {
> 				unlock_page(page);
> 				ret = 0;
> 			}
> 			if (ret || (--nr_to_write <= 0))
> 				goto bale;
> 			if (wbc->nonblocking && bdi_write_congested(bdi)) {
> 				wbc->encountered_congestion = 1;
> 				goto bale;
> 			}
> 		}
> 		pagevec_release(&pvec);
> 		cond_resched();
> 	}
> 	if (!scanned) {
> 		/*
> 		 * We hit the last page and there is more work to be done: wrap
> 		 * back to the start of the file
> 		 */
> 		scanned = 1;
> 		index = 0;
> 		goto retry;
> 	}
> 	if (!wbc->no_nrwrite_index_update) {
> 		if (wbc->range_cyclic || (range_whole && nr_to_write > 0))
> 			mapping->writeback_index = index;
> 		wbc->nr_to_write = nr_to_write;
> 	}
> 
> out:
> 	return ret;
> bale:
> 	pagevec_release(&pvec);
> 	goto out;
> }

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [patch 1/7] mm: write_cache_pages writepage error fix
  2008-10-22  9:36     ` Nick Piggin
@ 2008-10-22 16:37       ` Andrew Morton
  2008-10-28 13:18         ` Nick Piggin
  0 siblings, 1 reply; 14+ messages in thread
From: Andrew Morton @ 2008-10-22 16:37 UTC (permalink / raw)
  To: Nick Piggin; +Cc: linux-fsdevel

On Wed, 22 Oct 2008 11:36:32 +0200 Nick Piggin <npiggin@suse.de> wrote:

> > >  				wbc->encountered_congestion = 1;
> > > 
> > > -- 
> > 
> > I don't think I like the implementation much.
> > 
> > In all cases in that function where we set done=1, we want to bale out
> > right now at this page, rather than processing the remaining pages in
> > the pagevec.
> > 
> > So it would be better to implement a bit of code which releases the
> > pagevec pages and then breaks out of the loop.  Then this bug
> > automatically gets fixed.
> 
> I had almost exactly that at some point, then I rethought it because it
> was a bigger behaviour change than the bugfix-only.
> 
> I thought the !done thing must be just for batching and to process the
> pages already looked up in the pagevec.
> 
> I'll need to backport at least some of these to stable -- any chance of
> making the change you suggest toward the end of the patchset?

That works.

> Let me
> know if you want a patch or will do one yourself.

I couldn't hope to match your level of testing ;)

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [patch 1/7] mm: write_cache_pages writepage error fix
  2008-10-22 16:37       ` Andrew Morton
@ 2008-10-28 13:18         ` Nick Piggin
  0 siblings, 0 replies; 14+ messages in thread
From: Nick Piggin @ 2008-10-28 13:18 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-fsdevel

On Wed, Oct 22, 2008 at 09:37:15AM -0700, Andrew Morton wrote:
> On Wed, 22 Oct 2008 11:36:32 +0200 Nick Piggin <npiggin@suse.de> wrote:
> 
> > > >  				wbc->encountered_congestion = 1;
> > > > 
> > > > -- 
> > > 
> > > I don't think I like the implementation much.
> > > 
> > > In all cases in that function where we set done=1, we want to bale out
> > > right now at this page, rather than processing the remaining pages in
> > > the pagevec.
> > > 
> > > So it would be better to implement a bit of code which releases the
> > > pagevec pages and then breaks out of the loop.  Then this bug
> > > automatically gets fixed.
> > 
> > I had almost exactly that at some point, then I rethought it because it
> > was a bigger behaviour change than the bugfix-only.
> > 
> > I thought the !done thing must be just for batching and to process the
> > pages already looked up in the pagevec.
> > 
> > I'll need to backport at least some of these to stable -- any chance of
> > making the change you suggest toward the end of the patchset?
> 
> That works.
> 
> > Let me
> > know if you want a patch or will do one yourself.
> 
> I couldn't hope to match your level of testing ;)

OK, looking at this closer, I think the aversion to breaking out of the
loop early is because it becomes more difficult to keep track of you
end index, for cyclic writeback. It's no longer the index that comes out
of the radix tree lookup.

And whether we go for the current page index or the previous one depends
on the reason we have to bail out.

So I don't know if it will fall out quite so nicely as you had it. But
I missed this too: I didn't handle it properly either in the early bailout
from writepage error case. So I'll have to do that.

The problem that would cause in normal writeout operations is that you'd
bail out when nr_to_write reaches 0, but you would most likely set the
writeback_index to skip a few dirty pages that were in the pagevec but
not yet written out. For error-cases it's not so bad as it will be an
abnormal condition, but AFAIKS nowhere is documented that range_cyclic
cannot be used for data integrity, so it is better to be consistent and
make it all work as expected.

Anyway, I just noticed this as I was about to merge your changes and
resend, so I'll need a little longer to fix and test...

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2008-10-28 13:18 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-10-21  8:09 [patch 0/7] writeback data integrity and other fixes npiggin
2008-10-21  8:09 ` [patch 1/7] mm: write_cache_pages writepage error fix npiggin
2008-10-21 21:55   ` Andrew Morton
2008-10-21 22:01     ` Matthew Wilcox
2008-10-21 22:05       ` Andrew Morton
2008-10-22  9:36     ` Nick Piggin
2008-10-22 16:37       ` Andrew Morton
2008-10-28 13:18         ` Nick Piggin
2008-10-21  8:09 ` [patch 2/7] mm: write_cache_pages integrity fix npiggin
2008-10-21  8:09 ` [patch 3/7] mm: do_sync_mapping_range " npiggin
2008-10-21  8:09 ` [patch 4/7] mm: write_cache_pages cyclic fix npiggin
2008-10-21  8:09 ` [patch 5/7] mm: write_cache_pages cleanups npiggin
2008-10-21  8:09 ` [patch 6/7] mm: write_cache_pages optimise page cleaning npiggin
2008-10-21  8:09 ` [patch 7/7] mm: write_cache_pages terminate quickly npiggin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).