From: Fengguang Wu <wfg@mail.ustc.edu.cn>
To: David Chinner <dgc@sgi.com>
Cc: Andrew Morton <akpm@osdl.org>,
linux-kernel@vger.kernel.org, Ken Chen <kenchen@google.com>,
Andrew Morton <akpm@linux-foundation.org>,
Michael Rubin <mrubin@google.com>
Subject: Re: [PATCH 5/5] writeback: introduce writeback_control.more_io to indicate more io
Date: Fri, 5 Oct 2007 19:55:08 +0800 [thread overview]
Message-ID: <391585312.05654@ustc.edu.cn> (raw)
Message-ID: <20071005115508.GA9998@mail.ustc.edu.cn> (raw)
In-Reply-To: <20071005074103.GM23367404@sgi.com>
On Fri, Oct 05, 2007 at 05:41:03PM +1000, David Chinner wrote:
> On Fri, Oct 05, 2007 at 11:36:52AM +0800, Fengguang Wu wrote:
> > On Thu, Oct 04, 2007 at 03:03:44PM +1000, David Chinner wrote:
> > > On Thu, Oct 04, 2007 at 10:21:33AM +0800, Fengguang Wu wrote:
> > if (wbc.nr_to_write > 0 || wbc.pages_skipped > 0) { /* all-written or blockade... */
> > if (wbc.encountered_congestion || wbc.more_io) /* blockade! */
> > congestion_wait(WRITE, HZ/10);
> > else /* all-written! */
> > break;
> > }
>
> >From this, if we have more_io on one superblock and we skip pages on a
> different superblock, the combination of the two will causes us to stop
> writeback for a while. Is this the right thing to do?
No, the two cases will occur at the same time to a super_block.
See below.
> > We can also read the whole background_writeout() logic as
> >
> > while (!done) {
> > /* sync _all_ sync-able data */
> > congestion_wait(100ms);
> > }
>
> To me it reads as:
>
> while (!done) {
> /* sync all data or until one inode skips */
> congestion_wait(up to 100ms);
> }
>
> and it ignores that we might have more superblocks with dirty data
> on them that we haven't flushed because we skipped pages on
> an inode on a different block device.
AFAIK, generic_sync_sb_inodes() will simply skip the inode in trouble
and _continue_ to sync other inodes:
if (wbc->pages_skipped != pages_skipped) {
/*
* writeback is not making progress due to locked
* buffers. Skip this inode for now.
*/
redirty_tail(inode);
}
Note that there's no "break" here.
> > Sure, the queues should be filled as fast as possible.
> > How fast can we fill the queue? Let's measure it:
> >
> > //generated by the patch below
> >
> > [ 871.430700] mm/page-writeback.c 668 wb_kupdate: pdflush(202) 54289 global 29911 0 0 wc _M tw -12 sk 0
> > [ 871.444718] mm/page-writeback.c 668 wb_kupdate: pdflush(202) 53253 global 28857 0 0 wc _M tw -12 sk 0
> > [ 871.458764] mm/page-writeback.c 668 wb_kupdate: pdflush(202) 52217 global 27834 0 0 wc _M tw -12 sk 0
> > [ 871.472797] mm/page-writeback.c 668 wb_kupdate: pdflush(202) 51181 global 26780 0 0 wc _M tw -12 sk 0
> > [ 871.486825] mm/page-writeback.c 668 wb_kupdate: pdflush(202) 50145 global 25757 0 0 wc _M tw -12 sk 0
> > [ 871.500857] mm/page-writeback.c 668 wb_kupdate: pdflush(202) 49109 global 24734 0 0 wc _M tw -12 sk 0
> > [ 871.514864] mm/page-writeback.c 668 wb_kupdate: pdflush(202) 48073 global 23680 0 0 wc _M tw -12 sk 0
> > [ 871.528889] mm/page-writeback.c 668 wb_kupdate: pdflush(202) 47037 global 22657 0 0 wc _M tw -12 sk 0
> > [ 871.542894] mm/page-writeback.c 668 wb_kupdate: pdflush(202) 46001 global 21603 0 0 wc _M tw -12 sk 0
> > [ 871.556927] mm/page-writeback.c 668 wb_kupdate: pdflush(202) 44965 global 20580 0 0 wc _M tw -12 sk 0
> > [ 871.570961] mm/page-writeback.c 668 wb_kupdate: pdflush(202) 43929 global 19557 0 0 wc _M tw -12 sk 0
> > [ 871.584992] mm/page-writeback.c 668 wb_kupdate: pdflush(202) 42893 global 18503 0 0 wc _M tw -12 sk 0
> > [ 871.599005] mm/page-writeback.c 668 wb_kupdate: pdflush(202) 41857 global 17480 0 0 wc _M tw -12 sk 0
> > [ 871.613027] mm/page-writeback.c 668 wb_kupdate: pdflush(202) 40821 global 16426 0 0 wc _M tw -12 sk 0
> > [ 871.628626] mm/page-writeback.c 668 wb_kupdate: pdflush(202) 39785 global 15403 961 0 wc _M tw -12 sk 0
> > [ 871.644439] mm/page-writeback.c 668 wb_kupdate: pdflush(202) 38749 global 14380 1550 0 wc _M tw -12 sk 0
> > [ 871.660267] mm/page-writeback.c 668 wb_kupdate: pdflush(202) 37713 global 13326 2573 0 wc _M tw -12 sk 0
> > [ 871.676236] mm/page-writeback.c 668 wb_kupdate: pdflush(202) 36677 global 12303 3224 0 wc _M tw -12 sk 0
> > [ 871.692021] mm/page-writeback.c 668 wb_kupdate: pdflush(202) 35641 global 11280 4154 0 wc _M tw -12 sk 0
> > [ 871.707824] mm/page-writeback.c 668 wb_kupdate: pdflush(202) 34605 global 10226 4929 0 wc _M tw -12 sk 0
> > [ 871.723638] mm/page-writeback.c 668 wb_kupdate: pdflush(202) 33569 global 9203 5735 0 wc _M tw -12 sk 0
> > [ 871.739708] mm/page-writeback.c 668 wb_kupdate: pdflush(202) 32533 global 8149 6603 0 wc _M tw -12 sk 0
> > [ 871.756407] mm/page-writeback.c 668 wb_kupdate: pdflush(202) 31497 global 7126 7409 0 wc _M tw -12 sk 0
> > [ 871.772165] mm/page-writeback.c 668 wb_kupdate: pdflush(202) 30461 global 6103 8246 0 wc _M tw -12 sk 0
> > [ 871.788035] mm/page-writeback.c 668 wb_kupdate: pdflush(202) 29425 global 5049 9052 0 wc _M tw -12 sk 0
> > [ 871.803896] mm/page-writeback.c 668 wb_kupdate: pdflush(202) 28389 global 4026 9982 0 wc _M tw -12 sk 0
> > [ 871.820427] mm/page-writeback.c 668 wb_kupdate: pdflush(202) 27353 global 2972 10757 0 wc _M tw -12 sk 0
> > [ 871.836728] mm/page-writeback.c 668 wb_kupdate: pdflush(202) 26317 global 1949 11656 0 wc _M tw -12 sk 0
> > [ 871.853286] mm/page-writeback.c 668 wb_kupdate: pdflush(202) 25281 global 895 12431 0 wc _M tw -12 sk 0
> > [ 871.868762] mm/page-writeback.c 668 wb_kupdate: pdflush(202) 24245 global 58 13051 0 wc __ tw 168 sk 0
> >
> > It's an Intel Core 2 2.93GHz CPU and a SATA disk.
> > The trace shows that
> > - there's no congestion_wait() called in wb_kupdate()
> > - it takes wb_kupdate() ~15ms to sync every 4MB
>
> But it takes a modern SATA disk ~40-50ms to write 4MB (80-100MB/s).
> IOWs, what you've timed above is a burst workload, not a steady
> state behaviour. And it actually shows that the elevator queues
> are growing in constrast to your goal of preventing them from
> growing.
My goal really? ;-)
> In more detail, the first half of the trace indicates no pages under
> writeback, that tends to imply that all I/O is complete by the
> time wb_kupdate is woken - it's been sucked into the drive
> cache as fast as possible.
Right.
> About half way through we start to see windup of the the number of
> pages under writeback of about 800-900 pages per printk. That's
> 1024 pages minus 1 or 2 512k I/Os. This implies that the disk cache
> is now full and the disk has reached saturation. I/O is now
> being queued in the elevator. The last trace has 13051 pages under
> writeback, which at 128 pages per I/O is ~100 queued 512k I/Os.
>
> The default queue depth with cfq is 128 requests, and IIRC it
> congests at 7/8s full, or 112 requests. IOWs, you file that you
> wrote was about 10MB short of what is needed to see congestion on
> your test rig.
Exactly.
wfg ~% cat /sys/block/sda/queue/nr_requests
128
wfg ~% cat /sys/block/sda/queue/max_sectors_kb
512
More exactly, I was writing a huge file. It produces
balance_dirty_pages, background_writeout, and at last wb_kupdate. The
trace messages are collected after the copy completes, when
wb_kupdate() starts to sync the remaining (< background_thresh) data.
> So the trace shows we slept on neither congestion or more_io
> and it points towards congestion being the thing will typically
> block us on large file I/O. Before drawing any conclusions on
> whether wbc.more_io is needed or not, do you have any way of
> producing skipped pages when more_io is set?
No(and not that easy). (pages_skipped && more_io) events are rare indeed.
> > However, wb_kupdate() is syncing the data a bit slow(4*1000/15=266MB/s),
> > could it be because of a lot of cond_resched()?
>
> You are using ext3? That would be my guess based simply on the write
> rate - ext3 has long been stuck at about that speed for buffered
> writes even on much faster block devices. If I'm right, try using
> XFS and see how much differently it behaves. I bet you hit
> congestion much sooner than you expect. ;)
Yes, I was running ext3. It seems that XFS is about the same speed:
[ 1427.278454] mm/page-writeback.c 668 wb_kupdate: pdflush(5606) 37974 global 16727 0 0 wc _M tw -4 sk 0
[ 1427.293653] mm/page-writeback.c 668 wb_kupdate: pdflush(5606) 36946 global 15704 0 0 wc _M tw -3 sk 0
[ 1427.308891] mm/page-writeback.c 668 wb_kupdate: pdflush(5606) 35919 global 14650 0 0 wc _M tw -13 sk 0
[ 1427.322462] mm/page-writeback.c 668 wb_kupdate: pdflush(5606) 34882 global 13937 0 0 wc _M tw 300 sk 0
[ 1427.338194] mm/page-writeback.c 668 wb_kupdate: pdflush(5606) 34158 global 12914 0 0 wc _M tw -9 sk 0
[ 1427.353473] mm/page-writeback.c 668 wb_kupdate: pdflush(5606) 33125 global 11860 0 0 wc _M tw -12 sk 0
[ 1427.362984] mm/page-writeback.c 668 wb_kupdate: pdflush(5606) 32089 global 11860 0 0 wc _M tw 1018 sk 0
That's 14ms per 4MB. Maybe it's a VFS issue.
prev parent reply other threads:[~2007-10-05 11:55 UTC|newest]
Thread overview: 19+ messages / expand[flat|nested] mbox.gz Atom feed top
[not found] <20071002084143.110486039@mail.ustc.edu.cn>
2007-10-02 8:41 ` [PATCH 0/5] sluggish writeback fixes Fengguang Wu
2007-10-03 11:04 ` Martin Knoblauch
[not found] ` <20071002090254.489150786@mail.ustc.edu.cn>
2007-10-02 8:41 ` [PATCH 1/5] revert check_dirty_inode_list.patch Fengguang Wu
[not found] ` <20071002090254.596842343@mail.ustc.edu.cn>
2007-10-02 8:41 ` [PATCH 2/5] writeback: fix time ordering of the per superblock inode lists 8 Fengguang Wu
[not found] ` <20071002090254.728493507@mail.ustc.edu.cn>
2007-10-02 8:41 ` [PATCH 3/5] writeback: fix ntfs with sb_has_dirty_inodes() Fengguang Wu
[not found] ` <20071002090254.873023041@mail.ustc.edu.cn>
2007-10-02 8:41 ` [PATCH 4/5] writeback: remove pages_skipped accounting in __block_write_full_page() Fengguang Wu
2007-10-04 21:26 ` Andrew Morton
2007-10-02 21:55 ` David Chinner
[not found] ` <20071003014333.GB6501@mail.ustc.edu.cn>
2007-10-03 1:43 ` Fengguang Wu
2007-10-03 2:22 ` David Chinner
[not found] ` <20071002090254.987182999@mail.ustc.edu.cn>
2007-10-02 8:41 ` [PATCH 5/5] writeback: introduce writeback_control.more_io to indicate more io Fengguang Wu
2007-10-02 21:47 ` David Chinner
[not found] ` <20071003013439.GA6501@mail.ustc.edu.cn>
2007-10-03 1:34 ` Fengguang Wu
2007-10-03 2:41 ` David Chinner
[not found] ` <20071004022133.GA6244@mail.ustc.edu.cn>
2007-10-04 2:21 ` Fengguang Wu
2007-10-04 5:03 ` David Chinner
[not found] ` <20071005033652.GA6448@mail.ustc.edu.cn>
2007-10-05 3:36 ` Fengguang Wu
2007-10-05 7:41 ` David Chinner
[not found] ` <20071005115508.GA9998@mail.ustc.edu.cn>
2007-10-05 11:55 ` Fengguang Wu [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=391585312.05654@ustc.edu.cn \
--to=wfg@mail.ustc.edu.cn \
--cc=akpm@linux-foundation.org \
--cc=akpm@osdl.org \
--cc=dgc@sgi.com \
--cc=kenchen@google.com \
--cc=linux-kernel@vger.kernel.org \
--cc=mrubin@google.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox