* [PATCH 1/3] ext4: tidy up a void argument in inode.c
2010-10-22 21:29 [PATCH 0/3] ext4: minor writeback changes Eric Sandeen
@ 2010-10-22 21:30 ` Eric Sandeen
2010-10-22 21:37 ` [PATCH 2/3] ext4: implement writeback livelock avoidance using page tagging Eric Sandeen
2010-10-22 21:45 ` [PATCH 3/3] ext4: update writeback_index based on last page scanned Eric Sandeen
2 siblings, 0 replies; 9+ messages in thread
From: Eric Sandeen @ 2010-10-22 21:30 UTC (permalink / raw)
To: ext4 development
This doesn't fix anything at all, it just removes a vestige
of prior use from __mpage_da_writepage()
__mpage_da_writepage() had a *void argument leftover from
its previous life as a callback; make it reflect the actual type.
Fixing this up makes it slightly more obvious to read, and
enables proper typechecking.
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
---
Ted, I had a similar change for mpage_put_bnr_to_bhs
which takes an mpd but only wants an inode, but you
absorbed that function in your patch series, so I didn't
submit the change here.
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 50f3bba..854917e 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -2424,9 +2423,9 @@ static int ext4_bh_delay_or_unwritten(handle_t *handle, struct buffer_head *bh)
* The function finds extents of pages and scan them for all blocks.
*/
static int __mpage_da_writepage(struct page *page,
- struct writeback_control *wbc, void *data)
+ struct writeback_control *wbc,
+ struct mpage_da_data *mpd)
{
- struct mpage_da_data *mpd = data;
struct inode *inode = mpd->inode;
struct buffer_head *bh, *head;
sector_t logical;
^ permalink raw reply [flat|nested] 9+ messages in thread
* [PATCH 2/3] ext4: implement writeback livelock avoidance using page tagging
2010-10-22 21:29 [PATCH 0/3] ext4: minor writeback changes Eric Sandeen
2010-10-22 21:30 ` [PATCH 1/3] ext4: tidy up a void argument in inode.c Eric Sandeen
@ 2010-10-22 21:37 ` Eric Sandeen
2010-10-22 21:45 ` [PATCH 3/3] ext4: update writeback_index based on last page scanned Eric Sandeen
2 siblings, 0 replies; 9+ messages in thread
From: Eric Sandeen @ 2010-10-22 21:37 UTC (permalink / raw)
To: ext4 development; +Cc: Jan Kara
This is analogous to Jan Kara's commit,
f446daaea9d4a420d16c606f755f3689dcb2d0ce
mm: implement writeback livelock avoidance using page tagging
but since we forked write_cache_pages, we need to reimplement
it there (and in ext4_da_writepages, since range_cyclic handling
was moved to there)
If you start a large buffered IO to a file, and then set
fsync after it, you'll find that fsync does not complete
until the other IO stops.
If you continue re-dirtying the file (say, putting dd
with conv=notrunc in a loop), when fsync finally completes
(after all IO is done), it reports via tracing that
it has written many more pages than the file contains;
in other words it has synced and re-synced pages in
the file multiple times.
This then leads to problems with our writeback_index
update, since it advances it by pages written, and
essentially sets writeback_index off the end of the
file...
With the following patch, we only sync as much as was
dirty at the time of the sync.
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
--
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 854917e..163ffca 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -2834,16 +2834,21 @@ static int write_cache_pages_da(struct address_space *mapping,
pgoff_t index;
pgoff_t end; /* Inclusive */
long nr_to_write = wbc->nr_to_write;
+ int tag;
pagevec_init(&pvec, 0);
index = wbc->range_start >> PAGE_CACHE_SHIFT;
end = wbc->range_end >> PAGE_CACHE_SHIFT;
+ if (wbc->sync_mode == WB_SYNC_ALL)
+ tag = PAGECACHE_TAG_TOWRITE;
+ else
+ tag = PAGECACHE_TAG_DIRTY;
+
while (!done && (index <= end)) {
int i;
- nr_pages = pagevec_lookup_tag(&pvec, mapping, &index,
- PAGECACHE_TAG_DIRTY,
+ nr_pages = pagevec_lookup_tag(&pvec, mapping, &index, tag,
min(end - index, (pgoff_t)PAGEVEC_SIZE-1) + 1);
if (nr_pages == 0)
break;
@@ -2948,6 +2953,7 @@ static int ext4_da_writepages(struct address_space *mapping,
long desired_nr_to_write, nr_to_writebump = 0;
loff_t range_start = wbc->range_start;
struct ext4_sb_info *sbi = EXT4_SB(mapping->host->i_sb);
+ pgoff_t end;
trace_ext4_da_writepages(inode, wbc);
@@ -2983,8 +2989,11 @@ static int ext4_da_writepages(struct address_space *mapping,
wbc->range_start = index << PAGE_CACHE_SHIFT;
wbc->range_end = LLONG_MAX;
wbc->range_cyclic = 0;
- } else
+ end = -1;
+ } else {
index = wbc->range_start >> PAGE_CACHE_SHIFT;
+ end = wbc->range_end >> PAGE_CACHE_SHIFT;
+ }
/*
* This works around two forms of stupidity. The first is in
@@ -3025,6 +3034,9 @@ static int ext4_da_writepages(struct address_space *mapping,
pages_skipped = wbc->pages_skipped;
retry:
+ if (wbc->sync_mode == WB_SYNC_ALL)
+ tag_pages_for_writeback(mapping, index, end);
+
while (!ret && wbc->nr_to_write > 0) {
/*
diff --git a/include/linux/writeback.h b/include/linux/writeback.h
index 72a5d64..3d132bf 100644
--- a/include/linux/writeback.h
+++ b/include/linux/writeback.h
@@ -143,6 +143,8 @@ typedef int (*writepage_t)(struct page *page, struct writeback_control *wbc,
int generic_writepages(struct address_space *mapping,
struct writeback_control *wbc);
+void tag_pages_for_writeback(struct address_space *mapping,
+ pgoff_t start, pgoff_t end);
int write_cache_pages(struct address_space *mapping,
struct writeback_control *wbc, writepage_t writepage,
void *data);
^ permalink raw reply related [flat|nested] 9+ messages in thread
* [PATCH 3/3] ext4: update writeback_index based on last page scanned
2010-10-22 21:29 [PATCH 0/3] ext4: minor writeback changes Eric Sandeen
2010-10-22 21:30 ` [PATCH 1/3] ext4: tidy up a void argument in inode.c Eric Sandeen
2010-10-22 21:37 ` [PATCH 2/3] ext4: implement writeback livelock avoidance using page tagging Eric Sandeen
@ 2010-10-22 21:45 ` Eric Sandeen
2010-10-25 21:35 ` Ted Ts'o
2 siblings, 1 reply; 9+ messages in thread
From: Eric Sandeen @ 2010-10-22 21:45 UTC (permalink / raw)
To: ext4 development
As pointed out in a prior patch, updating the mapping's
writeback_index based on pages written isn't quite right;
what the writeback index is really supposed to reflect is
the next page which should be scanned for writeback during
periodic flush.
As in write_cache_pages(), write_cache_pages_da() does
this scanning for us as we assemble the mpd for later
writeout. If we keep track of the next page after the
current scan, we can easily update writeback_index without
worrying about pages written vs. pages skipped, etc.
Without this, an fsync will reset writeback_index to
0 (its starting index) + however many pages it wrote, which
can mess up the progress of periodic flush.
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
---
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 163ffca..767e8b5 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -2825,12 +2825,13 @@ static int ext4_da_writepages_trans_blocks(struct inode *inode)
*/
static int write_cache_pages_da(struct address_space *mapping,
struct writeback_control *wbc,
- struct mpage_da_data *mpd)
+ struct mpage_da_data *mpd,
+ pgoff_t *done_index)
{
int ret = 0;
int done = 0;
struct pagevec pvec;
- int nr_pages;
+ unsigned nr_pages;
pgoff_t index;
pgoff_t end; /* Inclusive */
long nr_to_write = wbc->nr_to_write;
@@ -2845,6 +2846,7 @@ static int write_cache_pages_da(struct address_space *mapping,
else
tag = PAGECACHE_TAG_DIRTY;
+ *done_index = index;
while (!done && (index <= end)) {
int i;
@@ -2868,6 +2870,8 @@ static int write_cache_pages_da(struct address_space *mapping,
break;
}
+ *done_index = page->index + 1;
+
lock_page(page);
/*
@@ -2953,6 +2957,7 @@ static int ext4_da_writepages(struct address_space *mapping,
long desired_nr_to_write, nr_to_writebump = 0;
loff_t range_start = wbc->range_start;
struct ext4_sb_info *sbi = EXT4_SB(mapping->host->i_sb);
+ pgoff_t done_index = 0;
pgoff_t end;
trace_ext4_da_writepages(inode, wbc);
@@ -3075,7 +3080,7 @@ retry:
mpd.io_done = 0;
mpd.pages_written = 0;
mpd.retval = 0;
- ret = write_cache_pages_da(mapping, wbc, &mpd);
+ ret = write_cache_pages_da(mapping, wbc, &mpd, &done_index);
/*
* If we have a contiguous extent of pages and we
* haven't done the I/O yet, map the blocks and submit
@@ -3131,14 +3136,13 @@ retry:
__func__, wbc->nr_to_write, ret);
/* Update index */
- index += pages_written;
wbc->range_cyclic = range_cyclic;
if (wbc->range_cyclic || (range_whole && wbc->nr_to_write > 0))
/*
* set the writeback_index so that range_cyclic
* mode will write it back later
*/
- mapping->writeback_index = index;
+ mapping->writeback_index = done_index;
out_writepages:
wbc->nr_to_write -= nr_to_writebump;
^ permalink raw reply related [flat|nested] 9+ messages in thread
* Re: [PATCH 3/3] ext4: update writeback_index based on last page scanned
2010-10-22 21:45 ` [PATCH 3/3] ext4: update writeback_index based on last page scanned Eric Sandeen
@ 2010-10-25 21:35 ` Ted Ts'o
2010-10-25 21:39 ` Eric Sandeen
0 siblings, 1 reply; 9+ messages in thread
From: Ted Ts'o @ 2010-10-25 21:35 UTC (permalink / raw)
To: Eric Sandeen; +Cc: ext4 development
On Fri, Oct 22, 2010 at 04:45:17PM -0500, Eric Sandeen wrote:
> As pointed out in a prior patch, updating the mapping's
> writeback_index based on pages written isn't quite right;
> what the writeback index is really supposed to reflect is
> the next page which should be scanned for writeback during
> periodic flush.
>
> As in write_cache_pages(), write_cache_pages_da() does
> this scanning for us as we assemble the mpd for later
> writeout. If we keep track of the next page after the
> current scan, we can easily update writeback_index without
> worrying about pages written vs. pages skipped, etc.
>
> Without this, an fsync will reset writeback_index to
> 0 (its starting index) + however many pages it wrote, which
> can mess up the progress of periodic flush.
>
> Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Have you done any benchmarks with and without this patch series?
Say, compilebench on a used and mildly fragmented file system?
- Ted
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH 3/3] ext4: update writeback_index based on last page scanned
2010-10-25 21:35 ` Ted Ts'o
@ 2010-10-25 21:39 ` Eric Sandeen
2010-10-26 14:14 ` Ted Ts'o
0 siblings, 1 reply; 9+ messages in thread
From: Eric Sandeen @ 2010-10-25 21:39 UTC (permalink / raw)
To: Ted Ts'o; +Cc: ext4 development
Ted Ts'o wrote:
> On Fri, Oct 22, 2010 at 04:45:17PM -0500, Eric Sandeen wrote:
>> As pointed out in a prior patch, updating the mapping's
>> writeback_index based on pages written isn't quite right;
>> what the writeback index is really supposed to reflect is
>> the next page which should be scanned for writeback during
>> periodic flush.
>>
>> As in write_cache_pages(), write_cache_pages_da() does
>> this scanning for us as we assemble the mpd for later
>> writeout. If we keep track of the next page after the
>> current scan, we can easily update writeback_index without
>> worrying about pages written vs. pages skipped, etc.
>>
>> Without this, an fsync will reset writeback_index to
>> 0 (its starting index) + however many pages it wrote, which
>> can mess up the progress of periodic flush.
>>
>> Signed-off-by: Eric Sandeen <sandeen@redhat.com>
>
> Have you done any benchmarks with and without this patch series?
>
> Say, compilebench on a used and mildly fragmented file system?
>
> - Ted
Not compilebench specifically, but I did do some benchmarking
with out of cache buffered IO; to be honest I didn't see
striking performance differences, but I did see the writeback
behave better in terms of not wandering all over, even if it
might recover well.
I can try compilebench; do you have specific concerns?
-Eric
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH 3/3] ext4: update writeback_index based on last page scanned
2010-10-25 21:39 ` Eric Sandeen
@ 2010-10-26 14:14 ` Ted Ts'o
2010-10-26 14:57 ` Eric Sandeen
0 siblings, 1 reply; 9+ messages in thread
From: Ted Ts'o @ 2010-10-26 14:14 UTC (permalink / raw)
To: Eric Sandeen; +Cc: ext4 development
On Mon, Oct 25, 2010 at 04:39:10PM -0500, Eric Sandeen wrote:
>
> Not compilebench specifically, but I did do some benchmarking
> with out of cache buffered IO; to be honest I didn't see
> striking performance differences, but I did see the writeback
> behave better in terms of not wandering all over, even if it
> might recover well.
>
> I can try compilebench; do you have specific concerns?
My specific concern is that what happens if __mpage_da_writepage()
accumulates 200 pages, but then we were only able to accumulate 50
pages, and we only write 50 pages.
In the long run what I really want to do is to not call
clear_page_dirty_for_io() in the renamed write_cache_pages_da(), but
rather be as greedy as possible about finding dirty/delayed allocate
pages, and then try to allocate pages for all of them.
We would then scan the pages for PAGECACHE_TAG_TOWRITE in
mpage_submit_data_io(), and then write out whatever number of pages we
need. At that point we will be a good citizen and writing back what
the writeback system asks of us --- but we'll be allocating as much
pages as possible so that the block allocations are sane. (At that
point we may find out that the core writeback is screwing us because
it's not asking us to write back enough; note that XFS decides on its
own how many pages to writeback in the first call to xfs_writepage(),
and even if writeback is silly enough to think that XFS should write
4MB, then switch to another inode, write 4MB, then write to another
inode, etc., XFS ignores what writeback asks of it. But we'll cross
that road when we get to it....)
So the bottom line is that I believe that what we were doing before is
wrong; and what we're doing is still wrong, even after your patches.
I just want to make sure that our performance doesn't go crashing
through the floor in order to avoid the livelock problem. (Which I
agree is a real problem, but we've lived it for quite a while, and I
haven't seen any evidence of it showing up in production.)
- Ted
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH 3/3] ext4: update writeback_index based on last page scanned
2010-10-26 14:14 ` Ted Ts'o
@ 2010-10-26 14:57 ` Eric Sandeen
2010-10-26 18:59 ` Ted Ts'o
0 siblings, 1 reply; 9+ messages in thread
From: Eric Sandeen @ 2010-10-26 14:57 UTC (permalink / raw)
To: Ted Ts'o; +Cc: ext4 development
Ted Ts'o wrote:
> On Mon, Oct 25, 2010 at 04:39:10PM -0500, Eric Sandeen wrote:
>> Not compilebench specifically, but I did do some benchmarking
>> with out of cache buffered IO; to be honest I didn't see
>> striking performance differences, but I did see the writeback
>> behave better in terms of not wandering all over, even if it
>> might recover well.
>>
>> I can try compilebench; do you have specific concerns?
>
> My specific concern is that what happens if __mpage_da_writepage()
> accumulates 200 pages, but then we were only able to accumulate 50
> pages, and we only write 50 pages.
Be patient with me, but how do we accumulate 200 pages but then only
accumulate 50 pages?
> In the long run what I really want to do is to not call
> clear_page_dirty_for_io() in the renamed write_cache_pages_da(), but
> rather be as greedy as possible about finding dirty/delayed allocate
> pages, and then try to allocate pages for all of them.
>
> We would then scan the pages for PAGECACHE_TAG_TOWRITE in
> mpage_submit_data_io(), and then write out whatever number of pages we
> need. At that point we will be a good citizen and writing back what
> the writeback system asks of us --- but we'll be allocating as much
> pages as possible so that the block allocations are sane. (At that
> point we may find out that the core writeback is screwing us because
> it's not asking us to write back enough; note that XFS decides on its
> own how many pages to writeback in the first call to xfs_writepage(),
> and even if writeback is silly enough to think that XFS should write
> 4MB, then switch to another inode, write 4MB, then write to another
> inode, etc., XFS ignores what writeback asks of it. But we'll cross
> that road when we get to it....)
Since it works for xfs, we should probably try that direction, but my
early feeble attempts got bogged down in a lot of tangled code.
> So the bottom line is that I believe that what we were doing before is
> wrong; and what we're doing is still wrong, even after your patches.
> I just want to make sure that our performance doesn't go crashing
> through the floor in order to avoid the livelock problem. (Which I
> agree is a real problem, but we've lived it for quite a while, and I
> haven't seen any evidence of it showing up in production.)
Well, I have a partner-filed bug for that one so I'm motivated ;)
Why would fsync writing only TOWRITE tagged pages cause performance
to crash through the floor?
Note patch 3 doesn't really require patch 2 or vice versa, they
are addressing 2 pretty orthogonal things.
-Eric
> - Ted
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH 3/3] ext4: update writeback_index based on last page scanned
2010-10-26 14:57 ` Eric Sandeen
@ 2010-10-26 18:59 ` Ted Ts'o
0 siblings, 0 replies; 9+ messages in thread
From: Ted Ts'o @ 2010-10-26 18:59 UTC (permalink / raw)
To: Eric Sandeen; +Cc: ext4 development
On Tue, Oct 26, 2010 at 09:57:14AM -0500, Eric Sandeen wrote:
> Ted Ts'o wrote:
> > On Mon, Oct 25, 2010 at 04:39:10PM -0500, Eric Sandeen wrote:
> >> Not compilebench specifically, but I did do some benchmarking
> >> with out of cache buffered IO; to be honest I didn't see
> >> striking performance differences, but I did see the writeback
> >> behave better in terms of not wandering all over, even if it
> >> might recover well.
> >>
> >> I can try compilebench; do you have specific concerns?
> >
> > My specific concern is that what happens if __mpage_da_writepage()
> > accumulates 200 pages, but then we were only able to accumulate 50
> > pages, and we only write 50 pages.
>
> Be patient with me, but how do we accumulate 200 pages but then only
> accumulate 50 pages?
Sorry, I typo'ed the world . What I meant was, we accumulate 200
pages of contiguously dirty, delay allocated pages in logical block
numberspace, but then we are able to only _allocate_ 50 pages worth of
blocks which are contiguous in physical block numberspace, so we only
end up writing 50 pages worth of blocks.
But with your patch we end up skipping 200 pages, even though at the
end we only wrote 50 pages.
- Ted
^ permalink raw reply [flat|nested] 9+ messages in thread