* [patch 1/8] mm: write_cache_pages cyclic fix
2008-10-09 15:50 [patch 0/8] write_cache_pages fixes npiggin
@ 2008-10-09 15:50 ` npiggin
2008-10-09 15:50 ` [patch 2/8] mm: write_cache_pages AOP_WRITEPAGE_ACTIVATE fix npiggin
` (6 subsequent siblings)
7 siblings, 0 replies; 36+ messages in thread
From: npiggin @ 2008-10-09 15:50 UTC (permalink / raw)
To: Andrew Morton; +Cc: Mikulas Patocka, linux-mm, linux-fsdevel
[-- Attachment #1: mm-wcp-cyclic-fix.patch --]
[-- Type: text/plain, Size: 2737 bytes --]
In write_cache_pages, scanned == 1 is supposed to mean that cyclic writeback
has circled through zero, thus we should not circle again. However it gets set
to 1 after the first successful pagevec lookup. This leads to cases where not
enough data gets written.
Counterexample: file with first 10 pages dirty, writeback_index == 5,
nr_to_write == 10. Then the 5 last pages will be found, and scanned will be set
to 1, after writing those out, we will not cycle back to get the first 5.
Rework this logic, now we'll always cycle unless we started off from index 0.
When cycling, only write out as far as 1 page before the start page from the
first cycle (so we don't write parts of the file twice).
Signed-off-by: Nick Piggin <npiggin@suse.de>
---
Index: linux-2.6/mm/page-writeback.c
===================================================================
--- linux-2.6.orig/mm/page-writeback.c
+++ linux-2.6/mm/page-writeback.c
@@ -872,9 +872,10 @@ int write_cache_pages(struct address_spa
int done = 0;
struct pagevec pvec;
int nr_pages;
+ pgoff_t writeback_index;
pgoff_t index;
pgoff_t end; /* Inclusive */
- int scanned = 0;
+ int cycled;
int range_whole = 0;
if (wbc->nonblocking && bdi_write_congested(bdi)) {
@@ -884,14 +885,19 @@ int write_cache_pages(struct address_spa
pagevec_init(&pvec, 0);
if (wbc->range_cyclic) {
- index = mapping->writeback_index; /* Start from prev offset */
+ writeback_index = mapping->writeback_index; /* prev offset */
+ index = writeback_index;
+ if (index == 0)
+ cycled = 1;
+ else
+ cycled = 0;
end = -1;
} else {
index = wbc->range_start >> PAGE_CACHE_SHIFT;
end = wbc->range_end >> PAGE_CACHE_SHIFT;
if (wbc->range_start == 0 && wbc->range_end == LLONG_MAX)
range_whole = 1;
- scanned = 1;
+ cycled = 1; /* ignore range_cyclic tests */
}
retry:
while (!done && (index <= end) &&
@@ -900,7 +906,6 @@ retry:
min(end - index, (pgoff_t)PAGEVEC_SIZE-1) + 1))) {
unsigned i;
- scanned = 1;
for (i = 0; i < nr_pages; i++) {
struct page *page = pvec.pages[i];
@@ -918,7 +923,8 @@ retry:
continue;
}
- if (!wbc->range_cyclic && page->index > end) {
+ if (page->index > end) {
+ /* Can't be range_cyclic: end == -1 there */
done = 1;
unlock_page(page);
continue;
@@ -949,13 +955,15 @@ retry:
pagevec_release(&pvec);
cond_resched();
}
- if (!scanned && !done) {
+ if (!cycled) {
/*
+ * range_cyclic:
* We hit the last page and there is more work to be done: wrap
* back to the start of the file
*/
- scanned = 1;
+ cycled = 1;
index = 0;
+ end = writeback_index - 1;
goto retry;
}
if (wbc->range_cyclic || (range_whole && wbc->nr_to_write > 0))
--
^ permalink raw reply [flat|nested] 36+ messages in thread* [patch 2/8] mm: write_cache_pages AOP_WRITEPAGE_ACTIVATE fix
2008-10-09 15:50 [patch 0/8] write_cache_pages fixes npiggin
2008-10-09 15:50 ` [patch 1/8] mm: write_cache_pages cyclic fix npiggin
@ 2008-10-09 15:50 ` npiggin
2008-10-10 16:00 ` Miklos Szeredi
2008-10-09 15:50 ` [patch 3/8] mm: write_cache_pages writepage error fix npiggin
` (5 subsequent siblings)
7 siblings, 1 reply; 36+ messages in thread
From: npiggin @ 2008-10-09 15:50 UTC (permalink / raw)
To: Andrew Morton; +Cc: Mikulas Patocka, linux-mm, linux-fsdevel
[-- Attachment #1: mm-wcp-writepage-activate-fix.patch --]
[-- Type: text/plain, Size: 1012 bytes --]
In write_cache_pages, if AOP_WRITEPAGE_ACTIVATE is returned, the filesystem is
calling on us to drop the page lock and retry, however the existing code would
just skip that page regardless of whether or not it was a data interity
operation. Change this to always retry such a result.
This is a data interity bug.
Signed-off-by: Nick Piggin <npiggin@suse.de>
---
Index: linux-2.6/mm/page-writeback.c
===================================================================
--- linux-2.6.orig/mm/page-writeback.c
+++ linux-2.6/mm/page-writeback.c
@@ -916,6 +916,7 @@ retry:
* swizzled back from swapper_space to tmpfs file
* mapping
*/
+again:
lock_page(page);
if (unlikely(page->mapping != mapping)) {
@@ -940,10 +941,11 @@ retry:
}
ret = (*writepage)(page, wbc, data);
-
if (unlikely(ret == AOP_WRITEPAGE_ACTIVATE)) {
+ /* Must retry the write */
unlock_page(page);
ret = 0;
+ goto again;
}
if (ret || (--(wbc->nr_to_write) <= 0))
done = 1;
--
^ permalink raw reply [flat|nested] 36+ messages in thread* Re: [patch 2/8] mm: write_cache_pages AOP_WRITEPAGE_ACTIVATE fix
2008-10-09 15:50 ` [patch 2/8] mm: write_cache_pages AOP_WRITEPAGE_ACTIVATE fix npiggin
@ 2008-10-10 16:00 ` Miklos Szeredi
2008-10-10 18:29 ` Hugh Dickins
0 siblings, 1 reply; 36+ messages in thread
From: Miklos Szeredi @ 2008-10-10 16:00 UTC (permalink / raw)
To: npiggin; +Cc: akpm, mpatocka, linux-mm, linux-fsdevel
On Fri, 10 Oct 2008, npiggin@suse.de wrote:
> In write_cache_pages, if AOP_WRITEPAGE_ACTIVATE is returned, the
> filesystem is calling on us to drop the page lock and retry,
Are you sure? It's not what fs.h says. I think this return value is
related to reclaim (and only used by shmfs), and retrying is not the
right thing in that case.
Miklos
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [patch 2/8] mm: write_cache_pages AOP_WRITEPAGE_ACTIVATE fix
2008-10-10 16:00 ` Miklos Szeredi
@ 2008-10-10 18:29 ` Hugh Dickins
2008-10-11 4:05 ` Nick Piggin
0 siblings, 1 reply; 36+ messages in thread
From: Hugh Dickins @ 2008-10-10 18:29 UTC (permalink / raw)
To: Miklos Szeredi; +Cc: npiggin, akpm, mpatocka, linux-mm, linux-fsdevel
On Fri, 10 Oct 2008, Miklos Szeredi wrote:
> On Fri, 10 Oct 2008, npiggin@suse.de wrote:
> > In write_cache_pages, if AOP_WRITEPAGE_ACTIVATE is returned, the
> > filesystem is calling on us to drop the page lock and retry,
>
> Are you sure? It's not what fs.h says. I think this return value is
> related to reclaim (and only used by shmfs), and retrying is not the
> right thing in that case.
Only used by shmfs nowadays, yes; it means go away for now,
don't keep on spamming me with this, but try it again later on.
Though I didn't invent it, it's very much my fault that it
still exists: I've had a patch to remove it (setting PageActive
instead, ending that horrid "but in this case, return with the
page still locked") for about a year, but still hadn't got around
to verifying that it really does what's intended, before the more
interesting split-lru changes reached -mm, and I thought it polite
to hold off for now (though in fact there's almost no conflict).
I'll get there...
Hugh
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [patch 2/8] mm: write_cache_pages AOP_WRITEPAGE_ACTIVATE fix
2008-10-10 18:29 ` Hugh Dickins
@ 2008-10-11 4:05 ` Nick Piggin
0 siblings, 0 replies; 36+ messages in thread
From: Nick Piggin @ 2008-10-11 4:05 UTC (permalink / raw)
To: Hugh Dickins
Cc: Miklos Szeredi, npiggin, akpm, mpatocka, linux-mm, linux-fsdevel
On Saturday 11 October 2008 05:29, Hugh Dickins wrote:
> On Fri, 10 Oct 2008, Miklos Szeredi wrote:
> > On Fri, 10 Oct 2008, npiggin@suse.de wrote:
> > > In write_cache_pages, if AOP_WRITEPAGE_ACTIVATE is returned, the
> > > filesystem is calling on us to drop the page lock and retry,
> >
> > Are you sure? It's not what fs.h says. I think this return value is
> > related to reclaim (and only used by shmfs), and retrying is not the
> > right thing in that case.
Oh, you're absolutely right about that. Sorry, I confused it with
another AOP flag :( Thanks...
> Only used by shmfs nowadays, yes; it means go away for now,
> don't keep on spamming me with this, but try it again later on.
>
> Though I didn't invent it, it's very much my fault that it
> still exists: I've had a patch to remove it (setting PageActive
> instead, ending that horrid "but in this case, return with the
> page still locked") for about a year, but still hadn't got around
> to verifying that it really does what's intended, before the more
> interesting split-lru changes reached -mm, and I thought it polite
> to hold off for now (though in fact there's almost no conflict).
> I'll get there...
No big deal.
^ permalink raw reply [flat|nested] 36+ messages in thread
* [patch 3/8] mm: write_cache_pages writepage error fix
2008-10-09 15:50 [patch 0/8] write_cache_pages fixes npiggin
2008-10-09 15:50 ` [patch 1/8] mm: write_cache_pages cyclic fix npiggin
2008-10-09 15:50 ` [patch 2/8] mm: write_cache_pages AOP_WRITEPAGE_ACTIVATE fix npiggin
@ 2008-10-09 15:50 ` npiggin
2008-10-09 15:50 ` [patch 4/8] mm: write_cache_pages type overflow fix npiggin
` (4 subsequent siblings)
7 siblings, 0 replies; 36+ messages in thread
From: npiggin @ 2008-10-09 15:50 UTC (permalink / raw)
To: Andrew Morton; +Cc: Mikulas Patocka, linux-mm, linux-fsdevel
[-- Attachment #1: mm-wcp-writepage-error-fix.patch --]
[-- Type: text/plain, Size: 1365 bytes --]
In write_cache_pages, if ret signals a real error, but we still have some pages
left in the pagevec, done would be set to 1, but the remaining pages would
continue to be processed and ret will be overwritten in the process. It could
easily be overwritten with success, and thus success will be returned even if
there is an error. Thus the caller is told all writes succeeded, wheras in
reality some did not.
Fix this by bailing immediately if there is an error, and retaining the first
error code.
This is a data interity bug.
Signed-off-by: Nick Piggin <npiggin@suse.de>
---
Index: linux-2.6/mm/page-writeback.c
===================================================================
--- linux-2.6.orig/mm/page-writeback.c
+++ linux-2.6/mm/page-writeback.c
@@ -941,13 +941,17 @@ again:
}
ret = (*writepage)(page, wbc, data);
- if (unlikely(ret == AOP_WRITEPAGE_ACTIVATE)) {
- /* Must retry the write */
- unlock_page(page);
- ret = 0;
- goto again;
+ if (unlikely(ret)) {
+ if (ret == AOP_WRITEPAGE_ACTIVATE) {
+ /* Must retry the write */
+ unlock_page(page);
+ ret = 0;
+ goto again;
+ }
+ done = 1;
+ break;
}
- if (ret || (--(wbc->nr_to_write) <= 0))
+ if (--(wbc->nr_to_write) <= 0)
done = 1;
if (wbc->nonblocking && bdi_write_congested(bdi)) {
wbc->encountered_congestion = 1;
--
^ permalink raw reply [flat|nested] 36+ messages in thread* [patch 4/8] mm: write_cache_pages type overflow fix
2008-10-09 15:50 [patch 0/8] write_cache_pages fixes npiggin
` (2 preceding siblings ...)
2008-10-09 15:50 ` [patch 3/8] mm: write_cache_pages writepage error fix npiggin
@ 2008-10-09 15:50 ` npiggin
2008-10-09 8:23 ` Christoph Hellwig
2008-10-09 15:50 ` [patch 5/8] mm: write_cache_pages integrity fix npiggin
` (3 subsequent siblings)
7 siblings, 1 reply; 36+ messages in thread
From: npiggin @ 2008-10-09 15:50 UTC (permalink / raw)
To: Andrew Morton; +Cc: Mikulas Patocka, linux-mm, linux-fsdevel
[-- Attachment #1: mm-wcp-type-overflow-fix.patch --]
[-- Type: text/plain, Size: 776 bytes --]
In the range_cont case, range_start is set to index << PAGE_CACHE_SHIFT, but
index is a pgoff_t and range_start is loff_t, so we can get truncation of the
value on 32-bit platforms. Fix this by adding the standard loff_t cast.
This is a data interity bug (depending on how range_cont is used).
Signed-off-by: Nick Piggin <npiggin@suse.de>
---
Index: linux-2.6/mm/page-writeback.c
===================================================================
--- linux-2.6.orig/mm/page-writeback.c
+++ linux-2.6/mm/page-writeback.c
@@ -976,7 +976,7 @@ again:
mapping->writeback_index = index;
if (wbc->range_cont)
- wbc->range_start = index << PAGE_CACHE_SHIFT;
+ wbc->range_start = (loff_t)index << PAGE_CACHE_SHIFT;
return ret;
}
EXPORT_SYMBOL(write_cache_pages);
--
^ permalink raw reply [flat|nested] 36+ messages in thread* Re: [patch 4/8] mm: write_cache_pages type overflow fix
2008-10-09 15:50 ` [patch 4/8] mm: write_cache_pages type overflow fix npiggin
@ 2008-10-09 8:23 ` Christoph Hellwig
2008-10-09 8:33 ` Nick Piggin
2008-10-10 13:10 ` Theodore Tso
0 siblings, 2 replies; 36+ messages in thread
From: Christoph Hellwig @ 2008-10-09 8:23 UTC (permalink / raw)
To: npiggin; +Cc: Andrew Morton, Mikulas Patocka, linux-mm, linux-fsdevel
On Fri, Oct 10, 2008 at 02:50:43AM +1100, npiggin@suse.de wrote:
> In the range_cont case, range_start is set to index << PAGE_CACHE_SHIFT, but
> index is a pgoff_t and range_start is loff_t, so we can get truncation of the
> value on 32-bit platforms. Fix this by adding the standard loff_t cast.
>
> This is a data interity bug (depending on how range_cont is used).
Aneesh has a patch to kill the range_cont flag, which is queued up for
2.6.28.
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [patch 4/8] mm: write_cache_pages type overflow fix
2008-10-09 8:23 ` Christoph Hellwig
@ 2008-10-09 8:33 ` Nick Piggin
2008-10-10 13:10 ` Theodore Tso
1 sibling, 0 replies; 36+ messages in thread
From: Nick Piggin @ 2008-10-09 8:33 UTC (permalink / raw)
To: Christoph Hellwig; +Cc: Andrew Morton, Mikulas Patocka, linux-mm, linux-fsdevel
On Thu, Oct 09, 2008 at 04:23:36AM -0400, Christoph Hellwig wrote:
> On Fri, Oct 10, 2008 at 02:50:43AM +1100, npiggin@suse.de wrote:
> > In the range_cont case, range_start is set to index << PAGE_CACHE_SHIFT, but
> > index is a pgoff_t and range_start is loff_t, so we can get truncation of the
> > value on 32-bit platforms. Fix this by adding the standard loff_t cast.
> >
> > This is a data interity bug (depending on how range_cont is used).
>
> Aneesh has a patch to kill the range_cont flag, which is queued up for
> 2.6.28.
OK, great. I guess actually this patch out of all of them could go into
2.6.27 and previous stable kernels because it is obviously correct and
could not really cause a regression.
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [patch 4/8] mm: write_cache_pages type overflow fix
2008-10-09 8:23 ` Christoph Hellwig
2008-10-09 8:33 ` Nick Piggin
@ 2008-10-10 13:10 ` Theodore Tso
2008-10-10 13:13 ` Christoph Hellwig
1 sibling, 1 reply; 36+ messages in thread
From: Theodore Tso @ 2008-10-10 13:10 UTC (permalink / raw)
To: Christoph Hellwig
Cc: npiggin, Andrew Morton, Mikulas Patocka, linux-mm, linux-fsdevel
On Thu, Oct 09, 2008 at 04:23:36AM -0400, Christoph Hellwig wrote:
> On Fri, Oct 10, 2008 at 02:50:43AM +1100, npiggin@suse.de wrote:
> > In the range_cont case, range_start is set to index << PAGE_CACHE_SHIFT, but
> > index is a pgoff_t and range_start is loff_t, so we can get truncation of the
> > value on 32-bit platforms. Fix this by adding the standard loff_t cast.
> >
> > This is a data interity bug (depending on how range_cont is used).
>
> Aneesh has a patch to kill the range_cont flag, which is queued up for
> 2.6.28.
Which tree is this queued up in? It's not in ext4 or the mm tree...
- Ted
^ permalink raw reply [flat|nested] 36+ messages in thread* Re: [patch 4/8] mm: write_cache_pages type overflow fix
2008-10-10 13:10 ` Theodore Tso
@ 2008-10-10 13:13 ` Christoph Hellwig
2008-10-10 13:37 ` Theodore Tso
0 siblings, 1 reply; 36+ messages in thread
From: Christoph Hellwig @ 2008-10-10 13:13 UTC (permalink / raw)
To: Theodore Tso
Cc: Christoph Hellwig, npiggin, Andrew Morton, Mikulas Patocka,
linux-mm, linux-fsdevel
On Fri, Oct 10, 2008 at 09:10:30AM -0400, Theodore Tso wrote:
> > Aneesh has a patch to kill the range_cont flag, which is queued up for
> > 2.6.28.
>
> Which tree is this queued up in? It's not in ext4 or the mm tree...
Oh, it' not queued up yet? It's part of the patch that switches ext4
to it's own copy of write_cache_pages to fix the buffer write issues.
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [patch 4/8] mm: write_cache_pages type overflow fix
2008-10-10 13:13 ` Christoph Hellwig
@ 2008-10-10 13:37 ` Theodore Tso
2008-10-10 13:48 ` Steven Whitehouse
2008-10-10 13:56 ` Chris Mason
0 siblings, 2 replies; 36+ messages in thread
From: Theodore Tso @ 2008-10-10 13:37 UTC (permalink / raw)
To: Christoph Hellwig
Cc: npiggin, Andrew Morton, Mikulas Patocka, linux-mm, linux-fsdevel
On Fri, Oct 10, 2008 at 09:13:25AM -0400, Christoph Hellwig wrote:
> On Fri, Oct 10, 2008 at 09:10:30AM -0400, Theodore Tso wrote:
> > > Aneesh has a patch to kill the range_cont flag, which is queued up for
> > > 2.6.28.
> >
> > Which tree is this queued up in? It's not in ext4 or the mm tree...
>
> Oh, it' not queued up yet? It's part of the patch that switches ext4
> to it's own copy of write_cache_pages to fix the buffer write issues.
>
I held off queing it up since the version Aneesh did created ext4's
own copy of write_cache_pages, and given that Nick has a bunch of
fixes and improvements for write_cache_pages, it confirmed my fears
that queueing a patch which copied ~100 lines of code into ext4 was
probably not the best way to go.
That being said, I would dearly love to see the 10x improvement in
streaming write speed for ext4 make the 2.6.28 merge window, since
people will start benchmarking ext4 much more seriously in the near
future. So, I was tempted to queue Aneesh's original version of the
fix --- but if we're going to get a version of the change which gets
the required changes into to the generic write_cache_pages(), I'm not
sure we have the time to coordinate ext4's and other filesystems'
needs into Nick's RFC patchset.
It wouldn't be hard to create a version of Aneesh's patch which makes
the change to the original write_cache_pages(), instead of creating
our own copy of ext4_write_cache_pages(), but it hasn't been done yet,
so it hasn't been queued. We can probably do *that* pretty quickly,
and send it out for review, but it would almost certainly conflict
with Nick's patchset --- and I had assumed that Nick might be
interested in pushing this during this merge window given his concern
about data integrity correctness issues.
So I think the main issue here is coordinating planned changes to core
code....
- Ted
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [patch 4/8] mm: write_cache_pages type overflow fix
2008-10-10 13:37 ` Theodore Tso
@ 2008-10-10 13:48 ` Steven Whitehouse
2008-10-10 14:05 ` Theodore Tso
2008-10-10 13:56 ` Chris Mason
1 sibling, 1 reply; 36+ messages in thread
From: Steven Whitehouse @ 2008-10-10 13:48 UTC (permalink / raw)
To: Theodore Tso
Cc: Christoph Hellwig, npiggin, Andrew Morton, Mikulas Patocka,
linux-mm, linux-fsdevel
Hi,
On Fri, 2008-10-10 at 09:37 -0400, Theodore Tso wrote:
> On Fri, Oct 10, 2008 at 09:13:25AM -0400, Christoph Hellwig wrote:
> > On Fri, Oct 10, 2008 at 09:10:30AM -0400, Theodore Tso wrote:
> > > > Aneesh has a patch to kill the range_cont flag, which is queued up for
> > > > 2.6.28.
> > >
> > > Which tree is this queued up in? It's not in ext4 or the mm tree...
> >
> > Oh, it' not queued up yet? It's part of the patch that switches ext4
> > to it's own copy of write_cache_pages to fix the buffer write issues.
> >
>
> I held off queing it up since the version Aneesh did created ext4's
> own copy of write_cache_pages, and given that Nick has a bunch of
> fixes and improvements for write_cache_pages, it confirmed my fears
> that queueing a patch which copied ~100 lines of code into ext4 was
> probably not the best way to go.
>
I've not looked at ext4's copy of write_cache_pages, but there is also a
copy in GFS2. Its used only for journaled data, and it is pretty much a
direct copy of write_cache_pages except that its split into two so that
a transaction can be opened in the "middle".
Perhaps it would be possible to make some changes so that we can both
share the "core" version of write_cache_pages. My plan was to wait until
Nick's patches have made it to Linus, and then to look into what might
be done,
Steve.
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [patch 4/8] mm: write_cache_pages type overflow fix
2008-10-10 13:48 ` Steven Whitehouse
@ 2008-10-10 14:05 ` Theodore Tso
2008-10-10 14:08 ` Christoph Hellwig
0 siblings, 1 reply; 36+ messages in thread
From: Theodore Tso @ 2008-10-10 14:05 UTC (permalink / raw)
To: Steven Whitehouse
Cc: Christoph Hellwig, npiggin, Andrew Morton, Mikulas Patocka,
linux-mm, linux-fsdevel, linux-ext4
On Fri, Oct 10, 2008 at 02:48:02PM +0100, Steven Whitehouse wrote:
> I've not looked at ext4's copy of write_cache_pages, but there is also a
> copy in GFS2. Its used only for journaled data, and it is pretty much a
> direct copy of write_cache_pages except that its split into two so that
> a transaction can be opened in the "middle".
>
> Perhaps it would be possible to make some changes so that we can both
> share the "core" version of write_cache_pages. My plan was to wait until
> Nick's patches have made it to Linus, and then to look into what might
> be done,
To be clear; ext4 doesn't have its own copy of write_cache_pages (at
least not yet); there is a patch that creates our own copy, mainly to
disable updates to writeback_index, range_start, and nr_to_write in
the wbc structure.
Christoph has suggested a patch which modifies write_cache_pages() so
that a filesystem could set a flag which disables those updates,
instead of just making a copy of write_cache_pages. (Maybe eventually
we would get rid of those updates unconditionally; it's not clear to
me though that this makes sense for all filesystems.)
So we have three choices as far as getting the 10x for (large)
streaming write performance into 2.6.28:
1) Aneesh's first patch, which called write_cache_pages()
and then undid the effects of the updates to the relevant wbc fields.
(http://lkml.org/lkml/2008/10/6/61)
2) Aneesh's second version of the patch, which copied
write_cache_pages() into ext4_write_cache_pages() and then removed the
updates; resulting in a large patch, but one that might be easier to
understand, although harder to maintain in the long term.
(http://lkml.org/lkml/2008/10/7/52)
3) A version which (optionally via a flag in the wbc structure)
instructs write_cache_pages() to not pursue those updates. This has
not been written yet.
For why we need to do this, see Aneesh's explanation here:
http://lkml.org/lkml/2008/10/7/78
If we don't think the Nick's patches are going to be stable enough for
merging in time for 2.6.28, it's possible we could pursue (1) or (2),
and if there's -mm concurrence, even (3). (1) might be the best if
the goal is to wait for Nick's patches to hit mainline first, and then
we can try to sort and merge our per-filesystems unique hacks (or
copies of write_cache_pages or whatever) back to the upstream version.
All of this being said, I'll confess that I have *not* had time to
look deeply at Nick's full patchset yet. Which has been another
reason why I haven't queued up any of Aneesh's patches in this area
yet; all of this hit just right before the merge window opened up, and
I've been insanely busy as of late.
- Ted
^ permalink raw reply [flat|nested] 36+ messages in thread* Re: [patch 4/8] mm: write_cache_pages type overflow fix
2008-10-10 14:05 ` Theodore Tso
@ 2008-10-10 14:08 ` Christoph Hellwig
2008-10-10 15:54 ` Aneesh Kumar K.V
0 siblings, 1 reply; 36+ messages in thread
From: Christoph Hellwig @ 2008-10-10 14:08 UTC (permalink / raw)
To: Theodore Tso
Cc: Steven Whitehouse, Christoph Hellwig, npiggin, Andrew Morton,
Mikulas Patocka, linux-mm, linux-fsdevel, linux-ext4
On Fri, Oct 10, 2008 at 10:05:35AM -0400, Theodore Tso wrote:
> 3) A version which (optionally via a flag in the wbc structure)
> instructs write_cache_pages() to not pursue those updates. This has
> not been written yet.
This one sounds best to me (although we'd have to actualy see it..)
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [patch 4/8] mm: write_cache_pages type overflow fix
2008-10-10 14:08 ` Christoph Hellwig
@ 2008-10-10 15:54 ` Aneesh Kumar K.V
2008-10-10 15:59 ` Chris Mason
` (2 more replies)
0 siblings, 3 replies; 36+ messages in thread
From: Aneesh Kumar K.V @ 2008-10-10 15:54 UTC (permalink / raw)
To: Christoph Hellwig
Cc: Theodore Tso, Steven Whitehouse, npiggin, Andrew Morton,
Mikulas Patocka, linux-mm, linux-fsdevel, linux-ext4
On Fri, Oct 10, 2008 at 10:08:29AM -0400, Christoph Hellwig wrote:
> On Fri, Oct 10, 2008 at 10:05:35AM -0400, Theodore Tso wrote:
> > 3) A version which (optionally via a flag in the wbc structure)
> > instructs write_cache_pages() to not pursue those updates. This has
> > not been written yet.
>
> This one sounds best to me (although we'd have to actualy see it..)
something like the below ?
diff --git a/include/linux/writeback.h b/include/linux/writeback.h
index bd91987..7599af2 100644
--- a/include/linux/writeback.h
+++ b/include/linux/writeback.h
@@ -63,6 +63,8 @@ struct writeback_control {
unsigned for_writepages:1; /* This is a writepages() call */
unsigned range_cyclic:1; /* range_start is cyclic */
unsigned more_io:1; /* more io to be dispatched */
+ /* flags which control the write_cache_pages behaviour */
+ int writeback_flags;
};
/*
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 718efa6..c198ead 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -876,11 +876,18 @@ int write_cache_pages(struct address_space *mapping,
pgoff_t end; /* Inclusive */
int scanned = 0;
int range_whole = 0;
+ int flags = wbc->writeback_flags;
+ long *nr_to_write, count;
if (wbc->nonblocking && bdi_write_congested(bdi)) {
wbc->encountered_congestion = 1;
return 0;
}
+ if (flags & WB_NO_NRWRITE_UPDATE) {
+ count = wbc->nr_to_write;
+ nr_to_write = &count;
+ } else
+ nr_to_write = &wbc->nr_to_write;
pagevec_init(&pvec, 0);
if (wbc->range_cyclic) {
@@ -939,7 +946,7 @@ int write_cache_pages(struct address_space *mapping,
unlock_page(page);
ret = 0;
}
- if (ret || (--(wbc->nr_to_write) <= 0))
+ if (ret || (--(*nr_to_write) <= 0))
done = 1;
if (wbc->nonblocking && bdi_write_congested(bdi)) {
wbc->encountered_congestion = 1;
@@ -958,8 +965,11 @@ int write_cache_pages(struct address_space *mapping,
index = 0;
goto retry;
}
- if (wbc->range_cyclic || (range_whole && wbc->nr_to_write > 0))
+ if ((wbc->range_cyclic ||
+ (range_whole && wbc->nr_to_write > 0)) &&
+ (flags & ~WB_NO_INDEX_UPDATE)) {
mapping->writeback_index = index;
+ }
return ret;
}
^ permalink raw reply related [flat|nested] 36+ messages in thread* Re: [patch 4/8] mm: write_cache_pages type overflow fix
2008-10-10 15:54 ` Aneesh Kumar K.V
@ 2008-10-10 15:59 ` Chris Mason
2008-10-10 16:10 ` Theodore Tso
2008-10-10 16:34 ` Christoph Hellwig
2 siblings, 0 replies; 36+ messages in thread
From: Chris Mason @ 2008-10-10 15:59 UTC (permalink / raw)
To: Aneesh Kumar K.V
Cc: Christoph Hellwig, Theodore Tso, Steven Whitehouse, npiggin,
Andrew Morton, Mikulas Patocka, linux-mm, linux-fsdevel,
linux-ext4
On Fri, 2008-10-10 at 21:24 +0530, Aneesh Kumar K.V wrote:
> On Fri, Oct 10, 2008 at 10:08:29AM -0400, Christoph Hellwig wrote:
> > On Fri, Oct 10, 2008 at 10:05:35AM -0400, Theodore Tso wrote:
> > > 3) A version which (optionally via a flag in the wbc structure)
> > > instructs write_cache_pages() to not pursue those updates. This has
> > > not been written yet.
> >
> > This one sounds best to me (although we'd have to actualy see it..)
>
> something like the below ?
>
> diff --git a/include/linux/writeback.h b/include/linux/writeback.h
> index bd91987..7599af2 100644
> --- a/include/linux/writeback.h
> +++ b/include/linux/writeback.h
> @@ -63,6 +63,8 @@ struct writeback_control {
> unsigned for_writepages:1; /* This is a writepages() call */
> unsigned range_cyclic:1; /* range_start is cyclic */
> unsigned more_io:1; /* more io to be dispatched */
> + /* flags which control the write_cache_pages behaviour */
> + int writeback_flags;
> };
Doesn't seem in line with the bitflag method currently used in struct
writeback_control.
-chris
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 36+ messages in thread* Re: [patch 4/8] mm: write_cache_pages type overflow fix
2008-10-10 15:54 ` Aneesh Kumar K.V
2008-10-10 15:59 ` Chris Mason
@ 2008-10-10 16:10 ` Theodore Tso
2008-10-10 16:34 ` Christoph Hellwig
2 siblings, 0 replies; 36+ messages in thread
From: Theodore Tso @ 2008-10-10 16:10 UTC (permalink / raw)
To: Aneesh Kumar K.V
Cc: Christoph Hellwig, Steven Whitehouse, npiggin, Andrew Morton,
Mikulas Patocka, linux-mm, linux-fsdevel, linux-ext4
On Fri, Oct 10, 2008 at 09:24:47PM +0530, Aneesh Kumar K.V wrote:
> On Fri, Oct 10, 2008 at 10:08:29AM -0400, Christoph Hellwig wrote:
> > On Fri, Oct 10, 2008 at 10:05:35AM -0400, Theodore Tso wrote:
> > > 3) A version which (optionally via a flag in the wbc structure)
> > > instructs write_cache_pages() to not pursue those updates. This has
> > > not been written yet.
> >
> > This one sounds best to me (although we'd have to actualy see it..)
>
> something like the below ?
>
> diff --git a/include/linux/writeback.h b/include/linux/writeback.h
> index bd91987..7599af2 100644
> --- a/include/linux/writeback.h
> +++ b/include/linux/writeback.h
> @@ -63,6 +63,8 @@ struct writeback_control {
> unsigned for_writepages:1; /* This is a writepages() call */
> unsigned range_cyclic:1; /* range_start is cyclic */
> unsigned more_io:1; /* more io to be dispatched */
> + /* flags which control the write_cache_pages behaviour */
> + int writeback_flags;
> };
I don't see a definition for WB_NO_NRWRITE_UPDATE and
WB_NO_INDEX_UPDATE in your patch?
Given the structure seems to be using bitfields for all of the other
fields, why not do this instead?
unsigned no_nrwrite_update:1;
unsigned no_index_update:1;
Personally, I'm old school, and prefer using an int flag field and
using #define's for flags, but the rest of the structure is using
bitfields for flags, and it's probably better to be consistent...
- Ted
^ permalink raw reply [flat|nested] 36+ messages in thread* Re: [patch 4/8] mm: write_cache_pages type overflow fix
2008-10-10 15:54 ` Aneesh Kumar K.V
2008-10-10 15:59 ` Chris Mason
2008-10-10 16:10 ` Theodore Tso
@ 2008-10-10 16:34 ` Christoph Hellwig
2 siblings, 0 replies; 36+ messages in thread
From: Christoph Hellwig @ 2008-10-10 16:34 UTC (permalink / raw)
To: Aneesh Kumar K.V
Cc: Christoph Hellwig, Theodore Tso, Steven Whitehouse, npiggin,
Andrew Morton, Mikulas Patocka, linux-mm, linux-fsdevel,
linux-ext4
On Fri, Oct 10, 2008 at 09:24:47PM +0530, Aneesh Kumar K.V wrote:
> diff --git a/include/linux/writeback.h b/include/linux/writeback.h
> index bd91987..7599af2 100644
> --- a/include/linux/writeback.h
> +++ b/include/linux/writeback.h
> @@ -63,6 +63,8 @@ struct writeback_control {
> unsigned for_writepages:1; /* This is a writepages() call */
> unsigned range_cyclic:1; /* range_start is cyclic */
> unsigned more_io:1; /* more io to be dispatched */
> + /* flags which control the write_cache_pages behaviour */
> + int writeback_flags;
As Ted already said please follow the bitfields style already used.
> --- a/mm/page-writeback.c
> +++ b/mm/page-writeback.c
> @@ -876,11 +876,18 @@ int write_cache_pages(struct address_space *mapping,
> pgoff_t end; /* Inclusive */
> int scanned = 0;
> int range_whole = 0;
> + int flags = wbc->writeback_flags;
> + long *nr_to_write, count;
>
> if (wbc->nonblocking && bdi_write_congested(bdi)) {
> wbc->encountered_congestion = 1;
> return 0;
> }
> + if (flags & WB_NO_NRWRITE_UPDATE) {
> + count = wbc->nr_to_write;
> + nr_to_write = &count;
> + } else
> + nr_to_write = &wbc->nr_to_write;
I think we'd be better off always using a local variable and updating
wbc->nr_to_write again before the exit for the !WB_NO_NRWRITE_UPDATE
case.
> - if (wbc->range_cyclic || (range_whole && wbc->nr_to_write > 0))
> + if ((wbc->range_cyclic ||
> + (range_whole && wbc->nr_to_write > 0)) &&
> + (flags & ~WB_NO_INDEX_UPDATE)) {
> mapping->writeback_index = index;
The conditional looks rather odd, what about:
if (!wbc->no_index_update &&
(wbc->range_cyclic || (range_whole && wbc->nr_to_write > 0))
Also I wonder what this is for. Do you want what Chris did in his
original patch in ext4 code, or is there another reason to not update
the writeback_index sometimes?
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [patch 4/8] mm: write_cache_pages type overflow fix
2008-10-10 13:37 ` Theodore Tso
2008-10-10 13:48 ` Steven Whitehouse
@ 2008-10-10 13:56 ` Chris Mason
1 sibling, 0 replies; 36+ messages in thread
From: Chris Mason @ 2008-10-10 13:56 UTC (permalink / raw)
To: Theodore Tso
Cc: Christoph Hellwig, npiggin, Andrew Morton, Mikulas Patocka,
linux-mm, linux-fsdevel
On Fri, 2008-10-10 at 09:37 -0400, Theodore Tso wrote:
> On Fri, Oct 10, 2008 at 09:13:25AM -0400, Christoph Hellwig wrote:
> > On Fri, Oct 10, 2008 at 09:10:30AM -0400, Theodore Tso wrote:
> > > > Aneesh has a patch to kill the range_cont flag, which is queued up for
> > > > 2.6.28.
> > >
> > > Which tree is this queued up in? It's not in ext4 or the mm tree...
> >
> > Oh, it' not queued up yet? It's part of the patch that switches ext4
> > to it's own copy of write_cache_pages to fix the buffer write issues.
> >
>
> I held off queing it up since the version Aneesh did created ext4's
> own copy of write_cache_pages, and given that Nick has a bunch of
> fixes and improvements for write_cache_pages, it confirmed my fears
> that queueing a patch which copied ~100 lines of code into ext4 was
> probably not the best way to go.
>
What I was hoping for when I suggested copying it in was a larger move
of logic from the ext4 writepages code into the ext4 write_cache_pages.
The idea was to see what ext4 really needed write_cache_pages to do, but
Aneesh knows much better than I do there.
But, given that the change was only a few lines, I think it makes sense
to fold it back into write_cache_pages.
-chris
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 36+ messages in thread
* [patch 5/8] mm: write_cache_pages integrity fix
2008-10-09 15:50 [patch 0/8] write_cache_pages fixes npiggin
` (3 preceding siblings ...)
2008-10-09 15:50 ` [patch 4/8] mm: write_cache_pages type overflow fix npiggin
@ 2008-10-09 15:50 ` npiggin
2008-10-09 12:52 ` Chris Mason
2008-10-09 15:50 ` [patch 6/8] mm: write_cache_pages cleanups npiggin
` (2 subsequent siblings)
7 siblings, 1 reply; 36+ messages in thread
From: npiggin @ 2008-10-09 15:50 UTC (permalink / raw)
To: Andrew Morton; +Cc: Mikulas Patocka, linux-mm, linux-fsdevel
[-- Attachment #1: mm-wcp-integrity-fix.patch --]
[-- Type: text/plain, Size: 2811 bytes --]
In write_cache_pages, nr_to_write is heeded even for data-integrity syncs, so
the function will return success after writing out nr_to_write pages, even if
that was not sufficient to guarantee data integrity.
The callers tend to set it to values that could break data interity semantics
easily in practice. For example, nr_to_write can be set to mapping->nr_pages *
2, however if a file has a single, dirty page, then fsync is called, subsequent
pages might be concurrently added and dirtied, then write_cache_pages might
writeout two of these newly dirty pages, while not writing out the old page
that should have been written out.
Fix this by ignoring nr_to_write if it is a data integrity sync.
This is a data interity bug.
Signed-off-by: Nick Piggin <npiggin@suse.de>
---
The reason this has been done in the past is to avoid stalling sync operations
behind page dirtiers.
"If a file has one dirty page at offset 1000000000000000 then someone
does an fsync() and someone else gets in first and starts madly writing
pages at offset 0, we want to write that page at 1000000000000000.
Somehow."
What we to today is return success after an arbitrary amount of pages are
written, whether or not we have provided the data-integrity semantics that
the caller has asked for. Even this doesn't actually fix all stall cases
completely: in the above situation, if the file has a huge number of pages
in pagecache (but not dirty), then mapping->nrpages is going to be huge,
even if pages are being dirtied.
This change does indeed make the possibility of long stalls lager, and that's
not a good thing, but lying about data integrity is even worse. We have to
either perform the sync, or return -ELINUXISLAME so at least the caller knows
what has happened.
There are subsequent competing approaches in the works to solve the stall
problems properly, without compromising data integrity.
Index: linux-2.6/mm/page-writeback.c
===================================================================
--- linux-2.6.orig/mm/page-writeback.c
+++ linux-2.6/mm/page-writeback.c
@@ -951,8 +951,10 @@ again:
done = 1;
break;
}
- if (--(wbc->nr_to_write) <= 0)
- done = 1;
+ if (wbc->sync_mode == WB_SYNC_NONE) {
+ if (--(wbc->nr_to_write) <= 0)
+ done = 1;
+ }
if (wbc->nonblocking && bdi_write_congested(bdi)) {
wbc->encountered_congestion = 1;
done = 1;
Index: linux-2.6/mm/filemap.c
===================================================================
--- linux-2.6.orig/mm/filemap.c
+++ linux-2.6/mm/filemap.c
@@ -209,7 +209,7 @@ int __filemap_fdatawrite_range(struct ad
int ret;
struct writeback_control wbc = {
.sync_mode = sync_mode,
- .nr_to_write = mapping->nrpages * 2,
+ .nr_to_write = LONG_MAX,
.range_start = start,
.range_end = end,
};
--
^ permalink raw reply [flat|nested] 36+ messages in thread* Re: [patch 5/8] mm: write_cache_pages integrity fix
2008-10-09 15:50 ` [patch 5/8] mm: write_cache_pages integrity fix npiggin
@ 2008-10-09 12:52 ` Chris Mason
2008-10-09 13:27 ` Nick Piggin
0 siblings, 1 reply; 36+ messages in thread
From: Chris Mason @ 2008-10-09 12:52 UTC (permalink / raw)
To: npiggin; +Cc: Andrew Morton, Mikulas Patocka, linux-mm, linux-fsdevel
On Fri, 2008-10-10 at 02:50 +1100, npiggin@suse.de wrote:
> plain text document attachment (mm-wcp-integrity-fix.patch)
> In write_cache_pages, nr_to_write is heeded even for data-integrity syncs, so
> the function will return success after writing out nr_to_write pages, even if
> that was not sufficient to guarantee data integrity.
>
> The callers tend to set it to values that could break data interity semantics
> easily in practice. For example, nr_to_write can be set to mapping->nr_pages *
> 2, however if a file has a single, dirty page, then fsync is called, subsequent
> pages might be concurrently added and dirtied, then write_cache_pages might
> writeout two of these newly dirty pages, while not writing out the old page
> that should have been written out.
>
> Fix this by ignoring nr_to_write if it is a data integrity sync.
>
Thanks for working on these.
We should have a wbc->integrity flag because WB_SYNC_NONE is somewhat
over used, and it is often used in data integrity syncs.
See fs/sync.c:do_sync_mapping_range()
There are many valid uses where we don't want to wait on pages that are
already writeback but we do want to write everything else.
-chris
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [patch 5/8] mm: write_cache_pages integrity fix
2008-10-09 12:52 ` Chris Mason
@ 2008-10-09 13:27 ` Nick Piggin
2008-10-09 13:35 ` Chris Mason
0 siblings, 1 reply; 36+ messages in thread
From: Nick Piggin @ 2008-10-09 13:27 UTC (permalink / raw)
To: Chris Mason; +Cc: Andrew Morton, Mikulas Patocka, linux-mm, linux-fsdevel
On Thu, Oct 09, 2008 at 08:52:45AM -0400, Chris Mason wrote:
> On Fri, 2008-10-10 at 02:50 +1100, npiggin@suse.de wrote:
> > plain text document attachment (mm-wcp-integrity-fix.patch)
> > In write_cache_pages, nr_to_write is heeded even for data-integrity syncs, so
> > the function will return success after writing out nr_to_write pages, even if
> > that was not sufficient to guarantee data integrity.
> >
> > The callers tend to set it to values that could break data interity semantics
> > easily in practice. For example, nr_to_write can be set to mapping->nr_pages *
> > 2, however if a file has a single, dirty page, then fsync is called, subsequent
> > pages might be concurrently added and dirtied, then write_cache_pages might
> > writeout two of these newly dirty pages, while not writing out the old page
> > that should have been written out.
> >
> > Fix this by ignoring nr_to_write if it is a data integrity sync.
> >
>
> Thanks for working on these.
No problem. Actually I feel I would be negligent for knowingly shipping
a kernel with these bugs :( So I don't have much choice...
> We should have a wbc->integrity flag because WB_SYNC_NONE is somewhat
> over used, and it is often used in data integrity syncs.
>
> See fs/sync.c:do_sync_mapping_range()
Oh great, more data integrity bugs.
I've always disliked the sync_file_range API ;) it seems over complex and
introduces the concept of writeout to userspace that seems questionable to
me. I should add SYNC_FILE_RANGE_ASYNC and SYNC_FILE_RANGE_SYNC for people
who already know POSIX and just want to convert existing fsync or msync to
a file and range based API, and also get a proper async operation that
isn't bound to kick off writeback for every page but could hand off page
cleaning to another thread...
Anyway, quick fix says we have to change that WB_SYNC_NONE into WB_SYNC_ALL.
WB_SYNC_NONE is all over the kernel. So do_sync_mapping_range is
broken as-is.
> There are many valid uses where we don't want to wait on pages that are
> already writeback but we do want to write everything else.
Hmm. For this strange "synchronously start writeback but don't have to
wait on it" operation, maybe. But that's not a great async operation
because for big ranges or under any sort of load, it's going to block
on congested block devices anyway. Are there any other uses than this one?
However I don't have any problem with improving it if it is useful. But
for now...
---
Chris Mason notices do_sync_mapping_range didn't actually ask for data
integrity writeout. Oops.
Signed-off-by: Nick Piggin <npiggin@suse.de>
---
Index: linux-2.6/fs/sync.c
===================================================================
--- linux-2.6.orig/fs/sync.c
+++ linux-2.6/fs/sync.c
@@ -269,7 +269,7 @@ int do_sync_mapping_range(struct address
if (flags & SYNC_FILE_RANGE_WRITE) {
ret = __filemap_fdatawrite_range(mapping, offset, endbyte,
- WB_SYNC_NONE);
+ WB_SYNC_ALL);
if (ret < 0)
goto out;
}
^ permalink raw reply [flat|nested] 36+ messages in thread* Re: [patch 5/8] mm: write_cache_pages integrity fix
2008-10-09 13:27 ` Nick Piggin
@ 2008-10-09 13:35 ` Chris Mason
2008-10-09 13:55 ` Nick Piggin
0 siblings, 1 reply; 36+ messages in thread
From: Chris Mason @ 2008-10-09 13:35 UTC (permalink / raw)
To: Nick Piggin; +Cc: Andrew Morton, Mikulas Patocka, linux-mm, linux-fsdevel
On Thu, 2008-10-09 at 15:27 +0200, Nick Piggin wrote:
> On Thu, Oct 09, 2008 at 08:52:45AM -0400, Chris Mason wrote:
> > On Fri, 2008-10-10 at 02:50 +1100, npiggin@suse.de wrote:
> > > plain text document attachment (mm-wcp-integrity-fix.patch)
> > > In write_cache_pages, nr_to_write is heeded even for data-integrity syncs, so
> > > the function will return success after writing out nr_to_write pages, even if
> > > that was not sufficient to guarantee data integrity.
> > >
> > > The callers tend to set it to values that could break data interity semantics
> > > easily in practice. For example, nr_to_write can be set to mapping->nr_pages *
> > > 2, however if a file has a single, dirty page, then fsync is called, subsequent
> > > pages might be concurrently added and dirtied, then write_cache_pages might
> > > writeout two of these newly dirty pages, while not writing out the old page
> > > that should have been written out.
> > >
> > > Fix this by ignoring nr_to_write if it is a data integrity sync.
> > >
> >
> > Thanks for working on these.
>
> No problem. Actually I feel I would be negligent for knowingly shipping
> a kernel with these bugs :( So I don't have much choice...
>
>
> > We should have a wbc->integrity flag because WB_SYNC_NONE is somewhat
> > over used, and it is often used in data integrity syncs.
> >
> > See fs/sync.c:do_sync_mapping_range()
>
> Oh great, more data integrity bugs.
>
> I've always disliked the sync_file_range API ;) it seems over complex and
> introduces the concept of writeout to userspace that seems questionable to
> me. I should add SYNC_FILE_RANGE_ASYNC and SYNC_FILE_RANGE_SYNC for people
> who already know POSIX and just want to convert existing fsync or msync to
> a file and range based API, and also get a proper async operation that
> isn't bound to kick off writeback for every page but could hand off page
> cleaning to another thread...
>
> Anyway, quick fix says we have to change that WB_SYNC_NONE into WB_SYNC_ALL.
> WB_SYNC_NONE is all over the kernel. So do_sync_mapping_range is
> broken as-is.
>
I don't think do_sync_mapping_range is broken as is. It simply splits
the operations into different parts. The caller can request that we
wait for pending IO first.
WB_SYNC_NONE none just means don't wait for IO in flight, and there are
valid uses for it that will slow down if you switch them all to
WB_SYNC_ALL.
The problem is that we have a few flags and callers that mean almost but
not quite the same thing. Some people confuse WB_SYNC_NONE with
wbc->nonblocking.
I'd leave WB_SYNC_NONE alone and set wbc->nr_to_write to the max int, or
just make a new flag that says write every dirty page in this range.
-chris
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [patch 5/8] mm: write_cache_pages integrity fix
2008-10-09 13:35 ` Chris Mason
@ 2008-10-09 13:55 ` Nick Piggin
2008-10-09 14:12 ` Chris Mason
0 siblings, 1 reply; 36+ messages in thread
From: Nick Piggin @ 2008-10-09 13:55 UTC (permalink / raw)
To: Chris Mason; +Cc: Andrew Morton, Mikulas Patocka, linux-mm, linux-fsdevel
On Thu, Oct 09, 2008 at 09:35:58AM -0400, Chris Mason wrote:
> On Thu, 2008-10-09 at 15:27 +0200, Nick Piggin wrote:
>
> I don't think do_sync_mapping_range is broken as is. It simply splits
> the operations into different parts. The caller can request that we
> wait for pending IO first.
It is. Not because of it's whacky API, but because it uses WB_SYNC_NONE.
> WB_SYNC_NONE none just means don't wait for IO in flight, and there are
> valid uses for it that will slow down if you switch them all to
> WB_SYNC_ALL.
To write_cache_pages it means that, but further down the chain (eg.
block_write_full_page) it also means not to wait on other stuff.
It has broadly meant "don't worry about data integirty" for a long time
AFAIKS.
> The problem is that we have a few flags and callers that mean almost but
> not quite the same thing. Some people confuse WB_SYNC_NONE with
> wbc->nonblocking.
>
> I'd leave WB_SYNC_NONE alone and set wbc->nr_to_write to the max int, or
> just make a new flag that says write every dirty page in this range.
Well it wouldn't hurt to clarify things. I'd rather introduce a new
WB_SYNC_WRITEBACK or something for that guy. But anyway, as I said,
it's just a minimal fix.
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [patch 5/8] mm: write_cache_pages integrity fix
2008-10-09 13:55 ` Nick Piggin
@ 2008-10-09 14:12 ` Chris Mason
2008-10-09 14:21 ` Nick Piggin
0 siblings, 1 reply; 36+ messages in thread
From: Chris Mason @ 2008-10-09 14:12 UTC (permalink / raw)
To: Nick Piggin; +Cc: Andrew Morton, Mikulas Patocka, linux-mm, linux-fsdevel
On Thu, 2008-10-09 at 15:55 +0200, Nick Piggin wrote:
> On Thu, Oct 09, 2008 at 09:35:58AM -0400, Chris Mason wrote:
> > On Thu, 2008-10-09 at 15:27 +0200, Nick Piggin wrote:
> >
> > I don't think do_sync_mapping_range is broken as is. It simply splits
> > the operations into different parts. The caller can request that we
> > wait for pending IO first.
>
> It is. Not because of it's whacky API, but because it uses WB_SYNC_NONE.
>
>
> > WB_SYNC_NONE none just means don't wait for IO in flight, and there are
> > valid uses for it that will slow down if you switch them all to
> > WB_SYNC_ALL.
>
> To write_cache_pages it means that, but further down the chain (eg.
> block_write_full_page) it also means not to wait on other stuff.
>
> It has broadly meant "don't worry about data integirty" for a long time
> AFAIKS.
Sadly it has broadly meant different things to different people ;)
You're right, block_write_full_page is broken.
-chris
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [patch 5/8] mm: write_cache_pages integrity fix
2008-10-09 14:12 ` Chris Mason
@ 2008-10-09 14:21 ` Nick Piggin
2008-10-09 14:39 ` Chris Mason
0 siblings, 1 reply; 36+ messages in thread
From: Nick Piggin @ 2008-10-09 14:21 UTC (permalink / raw)
To: Chris Mason; +Cc: Andrew Morton, Mikulas Patocka, linux-mm, linux-fsdevel
On Thu, Oct 09, 2008 at 10:12:55AM -0400, Chris Mason wrote:
> On Thu, 2008-10-09 at 15:55 +0200, Nick Piggin wrote:
> > On Thu, Oct 09, 2008 at 09:35:58AM -0400, Chris Mason wrote:
> > > On Thu, 2008-10-09 at 15:27 +0200, Nick Piggin wrote:
> > >
> > > I don't think do_sync_mapping_range is broken as is. It simply splits
> > > the operations into different parts. The caller can request that we
> > > wait for pending IO first.
> >
> > It is. Not because of it's whacky API, but because it uses WB_SYNC_NONE.
> >
> >
> > > WB_SYNC_NONE none just means don't wait for IO in flight, and there are
> > > valid uses for it that will slow down if you switch them all to
> > > WB_SYNC_ALL.
> >
> > To write_cache_pages it means that, but further down the chain (eg.
> > block_write_full_page) it also means not to wait on other stuff.
> >
> > It has broadly meant "don't worry about data integirty" for a long time
> > AFAIKS.
>
> Sadly it has broadly meant different things to different people ;)
> You're right, block_write_full_page is broken.
Well, I really just think it is do_sync_mapping_range that is broken.
Because __sync_single_inode treats WB_SYNC_NONE as a general "nowait",
so does __writeback_single_inode. Weakest semantics define the API :)
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [patch 5/8] mm: write_cache_pages integrity fix
2008-10-09 14:21 ` Nick Piggin
@ 2008-10-09 14:39 ` Chris Mason
2008-10-09 14:50 ` Nick Piggin
0 siblings, 1 reply; 36+ messages in thread
From: Chris Mason @ 2008-10-09 14:39 UTC (permalink / raw)
To: Nick Piggin; +Cc: Andrew Morton, Mikulas Patocka, linux-mm, linux-fsdevel
On Thu, 2008-10-09 at 16:21 +0200, Nick Piggin wrote:
> On Thu, Oct 09, 2008 at 10:12:55AM -0400, Chris Mason wrote:
> > On Thu, 2008-10-09 at 15:55 +0200, Nick Piggin wrote:
> > > On Thu, Oct 09, 2008 at 09:35:58AM -0400, Chris Mason wrote:
> > > > On Thu, 2008-10-09 at 15:27 +0200, Nick Piggin wrote:
> > > >
> > > > I don't think do_sync_mapping_range is broken as is. It simply splits
> > > > the operations into different parts. The caller can request that we
> > > > wait for pending IO first.
> > >
> > > It is. Not because of it's whacky API, but because it uses WB_SYNC_NONE.
> > >
> > >
> > > > WB_SYNC_NONE none just means don't wait for IO in flight, and there are
> > > > valid uses for it that will slow down if you switch them all to
> > > > WB_SYNC_ALL.
> > >
> > > To write_cache_pages it means that, but further down the chain (eg.
> > > block_write_full_page) it also means not to wait on other stuff.
> > >
> > > It has broadly meant "don't worry about data integirty" for a long time
> > > AFAIKS.
> >
> > Sadly it has broadly meant different things to different people ;)
> > You're right, block_write_full_page is broken.
>
> Well, I really just think it is do_sync_mapping_range that is broken.
> Because __sync_single_inode treats WB_SYNC_NONE as a general "nowait",
> so does __writeback_single_inode. Weakest semantics define the API :)
Unfortunately these things are using the flag differently
__sync_single_inode and __writeback_single_inode do different things
with the flag than people that end up directly calling the writepages
methods.
At the write_cache_pages level, WB_SYNC_NONE should only change the
waiting for IO in flight.
-chris
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [patch 5/8] mm: write_cache_pages integrity fix
2008-10-09 14:39 ` Chris Mason
@ 2008-10-09 14:50 ` Nick Piggin
2008-10-09 15:16 ` Chris Mason
0 siblings, 1 reply; 36+ messages in thread
From: Nick Piggin @ 2008-10-09 14:50 UTC (permalink / raw)
To: Chris Mason; +Cc: Andrew Morton, Mikulas Patocka, linux-mm, linux-fsdevel
On Thu, Oct 09, 2008 at 10:39:23AM -0400, Chris Mason wrote:
> On Thu, 2008-10-09 at 16:21 +0200, Nick Piggin wrote:
> > On Thu, Oct 09, 2008 at 10:12:55AM -0400, Chris Mason wrote:
> > > On Thu, 2008-10-09 at 15:55 +0200, Nick Piggin wrote:
> > > > On Thu, Oct 09, 2008 at 09:35:58AM -0400, Chris Mason wrote:
> > > > > On Thu, 2008-10-09 at 15:27 +0200, Nick Piggin wrote:
> > > > >
> > > > > I don't think do_sync_mapping_range is broken as is. It simply splits
> > > > > the operations into different parts. The caller can request that we
> > > > > wait for pending IO first.
> > > >
> > > > It is. Not because of it's whacky API, but because it uses WB_SYNC_NONE.
> > > >
> > > >
> > > > > WB_SYNC_NONE none just means don't wait for IO in flight, and there are
> > > > > valid uses for it that will slow down if you switch them all to
> > > > > WB_SYNC_ALL.
> > > >
> > > > To write_cache_pages it means that, but further down the chain (eg.
> > > > block_write_full_page) it also means not to wait on other stuff.
> > > >
> > > > It has broadly meant "don't worry about data integirty" for a long time
> > > > AFAIKS.
> > >
> > > Sadly it has broadly meant different things to different people ;)
> > > You're right, block_write_full_page is broken.
> >
> > Well, I really just think it is do_sync_mapping_range that is broken.
> > Because __sync_single_inode treats WB_SYNC_NONE as a general "nowait",
> > so does __writeback_single_inode. Weakest semantics define the API :)
>
> Unfortunately these things are using the flag differently
> __sync_single_inode and __writeback_single_inode do different things
> with the flag than people that end up directly calling the writepages
> methods.
Sure, but it's the "nowait" semantics that they want, right? And *they*
eventually call into writepages. So they want similar semantics from
writepages, presumably.
The comment in WB_SYNC_NONE definition kind of suggests it meant don't
wait for anything when it was written...
> At the write_cache_pages level, WB_SYNC_NONE should only change the
> waiting for IO in flight.
Aside from do_sync_mapping_range, what are other reasons to enforce
the same thing all up and down the writeout stack? If there are good
reasons, let's add WB_SYNC_WRITEBACK?
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [patch 5/8] mm: write_cache_pages integrity fix
2008-10-09 14:50 ` Nick Piggin
@ 2008-10-09 15:16 ` Chris Mason
2008-10-10 2:40 ` Nick Piggin
0 siblings, 1 reply; 36+ messages in thread
From: Chris Mason @ 2008-10-09 15:16 UTC (permalink / raw)
To: Nick Piggin; +Cc: Andrew Morton, Mikulas Patocka, linux-mm, linux-fsdevel
On Thu, 2008-10-09 at 16:50 +0200, Nick Piggin wrote:
> On Thu, Oct 09, 2008 at 10:39:23AM -0400, Chris Mason wrote:
> > On Thu, 2008-10-09 at 16:21 +0200, Nick Piggin wrote:
> > > On Thu, Oct 09, 2008 at 10:12:55AM -0400, Chris Mason wrote:
> > > > On Thu, 2008-10-09 at 15:55 +0200, Nick Piggin wrote:
> > > > > On Thu, Oct 09, 2008 at 09:35:58AM -0400, Chris Mason wrote:
> > > > > > On Thu, 2008-10-09 at 15:27 +0200, Nick Piggin wrote:
> > > > > >
> > > > > > I don't think do_sync_mapping_range is broken as is. It simply splits
> > > > > > the operations into different parts. The caller can request that we
> > > > > > wait for pending IO first.
> > > > >
> > > > > It is. Not because of it's whacky API, but because it uses WB_SYNC_NONE.
> > > > >
> > > > >
> > > > > > WB_SYNC_NONE none just means don't wait for IO in flight, and there are
> > > > > > valid uses for it that will slow down if you switch them all to
> > > > > > WB_SYNC_ALL.
> > > > >
> > > > > To write_cache_pages it means that, but further down the chain (eg.
> > > > > block_write_full_page) it also means not to wait on other stuff.
> > > > >
> > > > > It has broadly meant "don't worry about data integirty" for a long time
> > > > > AFAIKS.
> > > >
> > > > Sadly it has broadly meant different things to different people ;)
> > > > You're right, block_write_full_page is broken.
> > >
> > > Well, I really just think it is do_sync_mapping_range that is broken.
> > > Because __sync_single_inode treats WB_SYNC_NONE as a general "nowait",
> > > so does __writeback_single_inode. Weakest semantics define the API :)
> >
> > Unfortunately these things are using the flag differently
> > __sync_single_inode and __writeback_single_inode do different things
> > with the flag than people that end up directly calling the writepages
> > methods.
>
> Sure, but it's the "nowait" semantics that they want, right? And *they*
> eventually call into writepages. So they want similar semantics from
> writepages, presumably.
>
> The comment in WB_SYNC_NONE definition kind of suggests it meant don't
> wait for anything when it was written...
That seems to have turned into wbc->nonblocking. WB_SYNC_NONE doesn't
stop other blocking inside the FS (delalloc and other fun).
>
>
> > At the write_cache_pages level, WB_SYNC_NONE should only change the
> > waiting for IO in flight.
>
> Aside from do_sync_mapping_range, what are other reasons to enforce
> the same thing all up and down the writeout stack? If there are good
> reasons, let's add WB_SYNC_WRITEBACK?
Your change to skip writeback pages that aren't dirty makes WB_SYNC_ALL
almost the same as WB_SYNC_WRITEBACK. With that in place we're pretty
deep into grey areas where people may not want to go around rewriting
pages that were dirtied after their sync began.
At least that's what I think the idea behind do_sync_mapping_range using
WB_SYNC_NONE was.
-chris
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [patch 5/8] mm: write_cache_pages integrity fix
2008-10-09 15:16 ` Chris Mason
@ 2008-10-10 2:40 ` Nick Piggin
0 siblings, 0 replies; 36+ messages in thread
From: Nick Piggin @ 2008-10-10 2:40 UTC (permalink / raw)
To: Chris Mason; +Cc: Andrew Morton, Mikulas Patocka, linux-mm, linux-fsdevel
On Thu, Oct 09, 2008 at 11:16:34AM -0400, Chris Mason wrote:
> On Thu, 2008-10-09 at 16:50 +0200, Nick Piggin wrote:
> >
> > The comment in WB_SYNC_NONE definition kind of suggests it meant don't
> > wait for anything when it was written...
>
> That seems to have turned into wbc->nonblocking. WB_SYNC_NONE doesn't
> stop other blocking inside the FS (delalloc and other fun).
Right, and neither does nonblocking in all cases ;)
Anyway, I completely agree that it is unclear at best and could really
use a spring clean of the fields and their semantics.
> > > At the write_cache_pages level, WB_SYNC_NONE should only change the
> > > waiting for IO in flight.
> >
> > Aside from do_sync_mapping_range, what are other reasons to enforce
> > the same thing all up and down the writeout stack? If there are good
> > reasons, let's add WB_SYNC_WRITEBACK?
>
> Your change to skip writeback pages that aren't dirty makes WB_SYNC_ALL
> almost the same as WB_SYNC_WRITEBACK. With that in place we're pretty
> deep into grey areas where people may not want to go around rewriting
> pages that were dirtied after their sync began.
Yeah, they definitely do, though that's a slightly different problem. At
the moment, any dirty pages found in a data integrity operation *must* be
written out. Because we have no idea when they were dirtied. This is
how sync can get stuck behind write(2) for a long time (and this is why our
sync has traditionally bailed out after ->nrpages*2).
I have further patches to add a new tag to the radix-tree to mark all the
pages to sync up-front to solve this nicely. Mikulas has a different
approach to instead throttle the dirtiers. Whichever approach is favoured
should be the next step after this round of patches.
> At least that's what I think the idea behind do_sync_mapping_range using
> WB_SYNC_NONE was.
do_sync_mapping_range indeed can ignore dirty,writeback pages, because its
data integrity operation would wait for writeback, then write dirty, then
wait for writeback again. This is quite a corner-case, for its unusual
semantics though. You may just as well not wait for writeback to start
with, but wait for them in the writeout pass (and only if they are dirty):
that will likely be as fast or faster anyway.
^ permalink raw reply [flat|nested] 36+ messages in thread
* [patch 6/8] mm: write_cache_pages cleanups
2008-10-09 15:50 [patch 0/8] write_cache_pages fixes npiggin
` (4 preceding siblings ...)
2008-10-09 15:50 ` [patch 5/8] mm: write_cache_pages integrity fix npiggin
@ 2008-10-09 15:50 ` npiggin
2008-10-09 14:37 ` Artem Bityutskiy
2008-10-09 15:50 ` [patch 7/8] mm: write_cache_pages optimise page cleaning npiggin
2008-10-09 15:50 ` [patch 8/8] mm: write_cache_pages terminate quickly npiggin
7 siblings, 1 reply; 36+ messages in thread
From: npiggin @ 2008-10-09 15:50 UTC (permalink / raw)
To: Andrew Morton; +Cc: Mikulas Patocka, linux-mm, linux-fsdevel
[-- Attachment #1: mm-wcp-cleanup.patch --]
[-- Type: text/plain, Size: 2264 bytes --]
Get rid of some complex expressions from flow control statements, add a
comment, remove some duplicate code.
Signed-off-by: Nick Piggin <npiggin@suse.de>
---
Index: linux-2.6/mm/page-writeback.c
===================================================================
--- linux-2.6.orig/mm/page-writeback.c
+++ linux-2.6/mm/page-writeback.c
@@ -900,11 +900,14 @@ int write_cache_pages(struct address_spa
cycled = 1; /* ignore range_cyclic tests */
}
retry:
- while (!done && (index <= end) &&
- (nr_pages = pagevec_lookup_tag(&pvec, mapping, &index,
- PAGECACHE_TAG_DIRTY,
- min(end - index, (pgoff_t)PAGEVEC_SIZE-1) + 1))) {
- unsigned i;
+ while (!done && (index <= end)) {
+ int i;
+
+ nr_pages = pagevec_lookup_tag(&pvec, mapping, &index,
+ PAGECACHE_TAG_DIRTY,
+ min(end - index, (pgoff_t)PAGEVEC_SIZE-1) + 1);
+ if (nr_pages == 0)
+ break;
for (i = 0; i < nr_pages; i++) {
struct page *page = pvec.pages[i];
@@ -919,7 +922,16 @@ retry:
again:
lock_page(page);
+ /*
+ * Page truncated or invalidated. We can freely skip it
+ * then, even for data integrity operations: the page
+ * has disappeared concurrently, so there could be no
+ * real expectation of this data interity operation
+ * even if there is now a new, dirty page at the same
+ * pagecache address.
+ */
if (unlikely(page->mapping != mapping)) {
+continue_unlock:
unlock_page(page);
continue;
}
@@ -927,18 +939,15 @@ again:
if (page->index > end) {
/* Can't be range_cyclic: end == -1 there */
done = 1;
- unlock_page(page);
- continue;
+ goto continue_unlock;
}
if (wbc->sync_mode != WB_SYNC_NONE)
wait_on_page_writeback(page);
if (PageWriteback(page) ||
- !clear_page_dirty_for_io(page)) {
- unlock_page(page);
- continue;
- }
+ !clear_page_dirty_for_io(page))
+ goto continue_unlock;
ret = (*writepage)(page, wbc, data);
if (unlikely(ret)) {
@@ -952,7 +961,8 @@ again:
break;
}
if (wbc->sync_mode == WB_SYNC_NONE) {
- if (--(wbc->nr_to_write) <= 0)
+ wbc->nr_to_write--;
+ if (wbc->nr_to_write <= 0)
done = 1;
}
if (wbc->nonblocking && bdi_write_congested(bdi)) {
--
^ permalink raw reply [flat|nested] 36+ messages in thread* [patch 7/8] mm: write_cache_pages optimise page cleaning
2008-10-09 15:50 [patch 0/8] write_cache_pages fixes npiggin
` (5 preceding siblings ...)
2008-10-09 15:50 ` [patch 6/8] mm: write_cache_pages cleanups npiggin
@ 2008-10-09 15:50 ` npiggin
2008-10-09 15:50 ` [patch 8/8] mm: write_cache_pages terminate quickly npiggin
7 siblings, 0 replies; 36+ messages in thread
From: npiggin @ 2008-10-09 15:50 UTC (permalink / raw)
To: Andrew Morton; +Cc: Mikulas Patocka, linux-mm, linux-fsdevel
[-- Attachment #1: mm-wcp-writeback-clean-opt.patch --]
[-- Type: text/plain, Size: 1314 bytes --]
In write_cache_pages, if we get stuck behind another process that is cleaning
pages, we will be forced to wait for them to finish, then perform our own
writeout (if it was redirtied during the long wait), then wait for that.
If a page under writeout is still clean, we can skip waiting for it (if we're
part of a data integrity sync, we'll be waiting for all writeout pages
afterwards, so we'll still be waiting for the other guy's write that's cleaned
the page).
Signed-off-by: Nick Piggin <npiggin@suse.de>
---
Index: linux-2.6/mm/page-writeback.c
===================================================================
--- linux-2.6.orig/mm/page-writeback.c
+++ linux-2.6/mm/page-writeback.c
@@ -942,11 +942,20 @@ continue_unlock:
goto continue_unlock;
}
- if (wbc->sync_mode != WB_SYNC_NONE)
- wait_on_page_writeback(page);
+ if (!PageDirty(page)) {
+ /* someone wrote it for us */
+ goto continue_unlock;
+ }
+
+ if (PageWriteback(page)) {
+ if (wbc->sync_mode != WB_SYNC_NONE)
+ wait_on_page_writeback(page);
+ else
+ goto continue_unlock;
+ }
- if (PageWriteback(page) ||
- !clear_page_dirty_for_io(page))
+ BUG_ON(PageWriteback(page));
+ if (!clear_page_dirty_for_io(page))
goto continue_unlock;
ret = (*writepage)(page, wbc, data);
--
^ permalink raw reply [flat|nested] 36+ messages in thread* [patch 8/8] mm: write_cache_pages terminate quickly
2008-10-09 15:50 [patch 0/8] write_cache_pages fixes npiggin
` (6 preceding siblings ...)
2008-10-09 15:50 ` [patch 7/8] mm: write_cache_pages optimise page cleaning npiggin
@ 2008-10-09 15:50 ` npiggin
7 siblings, 0 replies; 36+ messages in thread
From: npiggin @ 2008-10-09 15:50 UTC (permalink / raw)
To: Andrew Morton; +Cc: Mikulas Patocka, linux-mm, linux-fsdevel
[-- Attachment #1: mm-wcp-terminate-quickly.patch --]
[-- Type: text/plain, Size: 1527 bytes --]
Terminate the write_cache_pages loop upon encountering the first page past
end, without locking the page. Pages cannot have their index change when we
have a reference on them (truncate, eg truncate_inode_pages_range performs
the same check without the page lock).
Signed-off-by: Nick Piggin <npiggin@suse.de>
---
Index: linux-2.6/mm/page-writeback.c
===================================================================
--- linux-2.6.orig/mm/page-writeback.c
+++ linux-2.6/mm/page-writeback.c
@@ -913,12 +913,18 @@ retry:
struct page *page = pvec.pages[i];
/*
- * At this point we hold neither mapping->tree_lock nor
- * lock on the page itself: the page may be truncated or
- * invalidated (changing page->mapping to NULL), or even
- * swizzled back from swapper_space to tmpfs file
- * mapping
+ * At this point, the page may be truncated or
+ * invalidated (changing page->mapping to NULL), or
+ * even swizzled back from swapper_space to tmpfs file
+ * mapping. However, page->index will not change
+ * because we have a reference on the page.
*/
+ if (page->index > end) {
+ /* Can't be range_cyclic: end == -1 there */
+ done = 1;
+ break;
+ }
+
again:
lock_page(page);
@@ -936,12 +942,6 @@ continue_unlock:
continue;
}
- if (page->index > end) {
- /* Can't be range_cyclic: end == -1 there */
- done = 1;
- goto continue_unlock;
- }
-
if (!PageDirty(page)) {
/* someone wrote it for us */
goto continue_unlock;
--
^ permalink raw reply [flat|nested] 36+ messages in thread