From: Fengguang Wu <wfg@mail.ustc.edu.cn>
To: David Chinner <dgc@sgi.com>
Cc: Andrew Morton <akpm@osdl.org>,
linux-kernel@vger.kernel.org, Ken Chen <kenchen@google.com>,
Andrew Morton <akpm@linux-foundation.org>,
Michael Rubin <mrubin@google.com>
Subject: Re: [PATCH 5/5] writeback: introduce writeback_control.more_io to indicate more io
Date: Fri, 5 Oct 2007 19:55:08 +0800 [thread overview]
Message-ID: <391585312.05654@ustc.edu.cn> (raw)
Message-ID: <20071005115508.GA9998@mail.ustc.edu.cn> (raw)
In-Reply-To: <20071005074103.GM23367404@sgi.com>
On Fri, Oct 05, 2007 at 05:41:03PM +1000, David Chinner wrote:
> On Fri, Oct 05, 2007 at 11:36:52AM +0800, Fengguang Wu wrote:
> > On Thu, Oct 04, 2007 at 03:03:44PM +1000, David Chinner wrote:
> > > On Thu, Oct 04, 2007 at 10:21:33AM +0800, Fengguang Wu wrote:
> > if (wbc.nr_to_write > 0 || wbc.pages_skipped > 0) { /* all-written or blockade... */
> > if (wbc.encountered_congestion || wbc.more_io) /* blockade! */
> > congestion_wait(WRITE, HZ/10);
> > else /* all-written! */
> > break;
> > }
>
> >From this, if we have more_io on one superblock and we skip pages on a
> different superblock, the combination of the two will causes us to stop
> writeback for a while. Is this the right thing to do?
No, the two cases will occur at the same time to a super_block.
See below.
> > We can also read the whole background_writeout() logic as
> >
> > while (!done) {
> > /* sync _all_ sync-able data */
> > congestion_wait(100ms);
> > }
>
> To me it reads as:
>
> while (!done) {
> /* sync all data or until one inode skips */
> congestion_wait(up to 100ms);
> }
>
> and it ignores that we might have more superblocks with dirty data
> on them that we haven't flushed because we skipped pages on
> an inode on a different block device.
AFAIK, generic_sync_sb_inodes() will simply skip the inode in trouble
and _continue_ to sync other inodes:
if (wbc->pages_skipped != pages_skipped) {
/*
* writeback is not making progress due to locked
* buffers. Skip this inode for now.
*/
redirty_tail(inode);
}
Note that there's no "break" here.
> > Sure, the queues should be filled as fast as possible.
> > How fast can we fill the queue? Let's measure it:
> >
> > //generated by the patch below
> >
> > [ 871.430700] mm/page-writeback.c 668 wb_kupdate: pdflush(202) 54289 global 29911 0 0 wc _M tw -12 sk 0
> > [ 871.444718] mm/page-writeback.c 668 wb_kupdate: pdflush(202) 53253 global 28857 0 0 wc _M tw -12 sk 0
> > [ 871.458764] mm/page-writeback.c 668 wb_kupdate: pdflush(202) 52217 global 27834 0 0 wc _M tw -12 sk 0
> > [ 871.472797] mm/page-writeback.c 668 wb_kupdate: pdflush(202) 51181 global 26780 0 0 wc _M tw -12 sk 0
> > [ 871.486825] mm/page-writeback.c 668 wb_kupdate: pdflush(202) 50145 global 25757 0 0 wc _M tw -12 sk 0
> > [ 871.500857] mm/page-writeback.c 668 wb_kupdate: pdflush(202) 49109 global 24734 0 0 wc _M tw -12 sk 0
> > [ 871.514864] mm/page-writeback.c 668 wb_kupdate: pdflush(202) 48073 global 23680 0 0 wc _M tw -12 sk 0
> > [ 871.528889] mm/page-writeback.c 668 wb_kupdate: pdflush(202) 47037 global 22657 0 0 wc _M tw -12 sk 0
> > [ 871.542894] mm/page-writeback.c 668 wb_kupdate: pdflush(202) 46001 global 21603 0 0 wc _M tw -12 sk 0
> > [ 871.556927] mm/page-writeback.c 668 wb_kupdate: pdflush(202) 44965 global 20580 0 0 wc _M tw -12 sk 0
> > [ 871.570961] mm/page-writeback.c 668 wb_kupdate: pdflush(202) 43929 global 19557 0 0 wc _M tw -12 sk 0
> > [ 871.584992] mm/page-writeback.c 668 wb_kupdate: pdflush(202) 42893 global 18503 0 0 wc _M tw -12 sk 0
> > [ 871.599005] mm/page-writeback.c 668 wb_kupdate: pdflush(202) 41857 global 17480 0 0 wc _M tw -12 sk 0
> > [ 871.613027] mm/page-writeback.c 668 wb_kupdate: pdflush(202) 40821 global 16426 0 0 wc _M tw -12 sk 0
> > [ 871.628626] mm/page-writeback.c 668 wb_kupdate: pdflush(202) 39785 global 15403 961 0 wc _M tw -12 sk 0
> > [ 871.644439] mm/page-writeback.c 668 wb_kupdate: pdflush(202) 38749 global 14380 1550 0 wc _M tw -12 sk 0
> > [ 871.660267] mm/page-writeback.c 668 wb_kupdate: pdflush(202) 37713 global 13326 2573 0 wc _M tw -12 sk 0
> > [ 871.676236] mm/page-writeback.c 668 wb_kupdate: pdflush(202) 36677 global 12303 3224 0 wc _M tw -12 sk 0
> > [ 871.692021] mm/page-writeback.c 668 wb_kupdate: pdflush(202) 35641 global 11280 4154 0 wc _M tw -12 sk 0
> > [ 871.707824] mm/page-writeback.c 668 wb_kupdate: pdflush(202) 34605 global 10226 4929 0 wc _M tw -12 sk 0
> > [ 871.723638] mm/page-writeback.c 668 wb_kupdate: pdflush(202) 33569 global 9203 5735 0 wc _M tw -12 sk 0
> > [ 871.739708] mm/page-writeback.c 668 wb_kupdate: pdflush(202) 32533 global 8149 6603 0 wc _M tw -12 sk 0
> > [ 871.756407] mm/page-writeback.c 668 wb_kupdate: pdflush(202) 31497 global 7126 7409 0 wc _M tw -12 sk 0
> > [ 871.772165] mm/page-writeback.c 668 wb_kupdate: pdflush(202) 30461 global 6103 8246 0 wc _M tw -12 sk 0
> > [ 871.788035] mm/page-writeback.c 668 wb_kupdate: pdflush(202) 29425 global 5049 9052 0 wc _M tw -12 sk 0
> > [ 871.803896] mm/page-writeback.c 668 wb_kupdate: pdflush(202) 28389 global 4026 9982 0 wc _M tw -12 sk 0
> > [ 871.820427] mm/page-writeback.c 668 wb_kupdate: pdflush(202) 27353 global 2972 10757 0 wc _M tw -12 sk 0
> > [ 871.836728] mm/page-writeback.c 668 wb_kupdate: pdflush(202) 26317 global 1949 11656 0 wc _M tw -12 sk 0
> > [ 871.853286] mm/page-writeback.c 668 wb_kupdate: pdflush(202) 25281 global 895 12431 0 wc _M tw -12 sk 0
> > [ 871.868762] mm/page-writeback.c 668 wb_kupdate: pdflush(202) 24245 global 58 13051 0 wc __ tw 168 sk 0
> >
> > It's an Intel Core 2 2.93GHz CPU and a SATA disk.
> > The trace shows that
> > - there's no congestion_wait() called in wb_kupdate()
> > - it takes wb_kupdate() ~15ms to sync every 4MB
>
> But it takes a modern SATA disk ~40-50ms to write 4MB (80-100MB/s).
> IOWs, what you've timed above is a burst workload, not a steady
> state behaviour. And it actually shows that the elevator queues
> are growing in constrast to your goal of preventing them from
> growing.
My goal really? ;-)
> In more detail, the first half of the trace indicates no pages under
> writeback, that tends to imply that all I/O is complete by the
> time wb_kupdate is woken - it's been sucked into the drive
> cache as fast as possible.
Right.
> About half way through we start to see windup of the the number of
> pages under writeback of about 800-900 pages per printk. That's
> 1024 pages minus 1 or 2 512k I/Os. This implies that the disk cache
> is now full and the disk has reached saturation. I/O is now
> being queued in the elevator. The last trace has 13051 pages under
> writeback, which at 128 pages per I/O is ~100 queued 512k I/Os.
>
> The default queue depth with cfq is 128 requests, and IIRC it
> congests at 7/8s full, or 112 requests. IOWs, you file that you
> wrote was about 10MB short of what is needed to see congestion on
> your test rig.
Exactly.
wfg ~% cat /sys/block/sda/queue/nr_requests
128
wfg ~% cat /sys/block/sda/queue/max_sectors_kb
512
More exactly, I was writing a huge file. It produces
balance_dirty_pages, background_writeout, and at last wb_kupdate. The
trace messages are collected after the copy completes, when
wb_kupdate() starts to sync the remaining (< background_thresh) data.
> So the trace shows we slept on neither congestion or more_io
> and it points towards congestion being the thing will typically
> block us on large file I/O. Before drawing any conclusions on
> whether wbc.more_io is needed or not, do you have any way of
> producing skipped pages when more_io is set?
No(and not that easy). (pages_skipped && more_io) events are rare indeed.
> > However, wb_kupdate() is syncing the data a bit slow(4*1000/15=266MB/s),
> > could it be because of a lot of cond_resched()?
>
> You are using ext3? That would be my guess based simply on the write
> rate - ext3 has long been stuck at about that speed for buffered
> writes even on much faster block devices. If I'm right, try using
> XFS and see how much differently it behaves. I bet you hit
> congestion much sooner than you expect. ;)
Yes, I was running ext3. It seems that XFS is about the same speed:
[ 1427.278454] mm/page-writeback.c 668 wb_kupdate: pdflush(5606) 37974 global 16727 0 0 wc _M tw -4 sk 0
[ 1427.293653] mm/page-writeback.c 668 wb_kupdate: pdflush(5606) 36946 global 15704 0 0 wc _M tw -3 sk 0
[ 1427.308891] mm/page-writeback.c 668 wb_kupdate: pdflush(5606) 35919 global 14650 0 0 wc _M tw -13 sk 0
[ 1427.322462] mm/page-writeback.c 668 wb_kupdate: pdflush(5606) 34882 global 13937 0 0 wc _M tw 300 sk 0
[ 1427.338194] mm/page-writeback.c 668 wb_kupdate: pdflush(5606) 34158 global 12914 0 0 wc _M tw -9 sk 0
[ 1427.353473] mm/page-writeback.c 668 wb_kupdate: pdflush(5606) 33125 global 11860 0 0 wc _M tw -12 sk 0
[ 1427.362984] mm/page-writeback.c 668 wb_kupdate: pdflush(5606) 32089 global 11860 0 0 wc _M tw 1018 sk 0
That's 14ms per 4MB. Maybe it's a VFS issue.
next prev parent reply other threads:[~2007-10-05 11:55 UTC|newest]
Thread overview: 19+ messages / expand[flat|nested] mbox.gz Atom feed top
2007-10-02 8:41 [PATCH 0/5] sluggish writeback fixes Fengguang Wu
2007-10-02 8:41 ` Fengguang Wu
2007-10-03 11:04 ` Martin Knoblauch
2007-10-02 8:41 ` [PATCH 1/5] revert check_dirty_inode_list.patch Fengguang Wu
2007-10-02 8:41 ` Fengguang Wu
2007-10-02 8:41 ` [PATCH 2/5] writeback: fix time ordering of the per superblock inode lists 8 Fengguang Wu
2007-10-02 8:41 ` Fengguang Wu
2007-10-02 8:41 ` [PATCH 3/5] writeback: fix ntfs with sb_has_dirty_inodes() Fengguang Wu
2007-10-02 8:41 ` Fengguang Wu
2007-10-02 8:41 ` [PATCH 4/5] writeback: remove pages_skipped accounting in __block_write_full_page() Fengguang Wu
2007-10-02 8:41 ` Fengguang Wu
2007-10-04 21:26 ` Andrew Morton
2007-10-02 21:55 ` David Chinner
2007-10-03 1:43 ` Fengguang Wu
2007-10-03 1:43 ` Fengguang Wu
2007-10-03 2:22 ` David Chinner
2007-10-02 8:41 ` [PATCH 5/5] writeback: introduce writeback_control.more_io to indicate more io Fengguang Wu
2007-10-02 8:41 ` Fengguang Wu
2007-10-02 21:47 ` David Chinner
2007-10-03 1:34 ` Fengguang Wu
2007-10-03 1:34 ` Fengguang Wu
2007-10-03 2:41 ` David Chinner
2007-10-04 2:21 ` Fengguang Wu
2007-10-04 2:21 ` Fengguang Wu
2007-10-04 5:03 ` David Chinner
2007-10-05 3:36 ` Fengguang Wu
2007-10-05 3:36 ` Fengguang Wu
2007-10-05 7:41 ` David Chinner
2007-10-05 11:55 ` Fengguang Wu [this message]
2007-10-05 11:55 ` Fengguang Wu
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=391585312.05654@ustc.edu.cn \
--to=wfg@mail.ustc.edu.cn \
--cc=akpm@linux-foundation.org \
--cc=akpm@osdl.org \
--cc=dgc@sgi.com \
--cc=kenchen@google.com \
--cc=linux-kernel@vger.kernel.org \
--cc=mrubin@google.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.