From: Fengguang Wu <fengguang.wu@intel.com>
To: Wanpeng Li <liwp.linux@gmail.com>
Cc: Jeff Moyer <jmoyer@redhat.com>,
Alexander Viro <viro@zeniv.linux.org.uk>,
linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org,
Gavin Shan <shangw@linux.vnet.ibm.com>
Subject: Re: [PATCH V2] writeback: fix hung_task alarm when sync block
Date: Thu, 14 Jun 2012 21:26:26 +0800 [thread overview]
Message-ID: <20120614132626.GA14883@localhost> (raw)
In-Reply-To: <20120614013533.GB15051@kernel>
On Thu, Jun 14, 2012 at 09:35:34AM +0800, Wanpeng Li wrote:
> On Wed, Jun 13, 2012 at 10:48:40PM +0800, Fengguang Wu wrote:
> >Hi Jeff,
> >
> >On Wed, Jun 13, 2012 at 10:27:50AM -0400, Jeff Moyer wrote:
> >> Wanpeng Li <liwp.linux@gmail.com> writes:
> >>
> >> > diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
> >> > index f2d0109..df879ee 100644
> >> > --- a/fs/fs-writeback.c
> >> > +++ b/fs/fs-writeback.c
> >> > @@ -1311,7 +1311,11 @@ void writeback_inodes_sb_nr(struct super_block *sb,
> >> >
> >> > WARN_ON(!rwsem_is_locked(&sb->s_umount));
> >> > bdi_queue_work(sb->s_bdi, &work);
> >> > - wait_for_completion(&done);
> >> > + if (sysctl_hung_task_timeout_secs)
> >> > + while (!wait_for_completion_timeout(&done, HZ/2))
> >> > + ;
> >> > + else
> >> > + wait_for_completion(&done);
> >> > }
> >> > EXPORT_SYMBOL(writeback_inodes_sb_nr);
> >>
> >> Is it really expected that writeback_inodes_sb_nr will routinely queue
> >> up more than 2 seconds worth of I/O (Yes, I understand that it isn't the
> >> only entity issuing I/O)?
> >
> >Yes, in the case of syncing the whole superblock.
> >Basically sync() does its job in two steps:
> >
> >for all sb:
> > writeback_inodes_sb_nr() # WB_SYNC_NONE
> > sync_inodes_sb() # WB_SYNC_ALL
> >
> >> For devices that are really slow, it may make
> >> more sense to tune the system so that you don't have too much writeback
> >> I/O submitted at once. Dropping nr_requests for the given queue should
> >> fix this situation, I would think.
> >
> >The worried case is about sync() waiting
> >
> > (nr_dirty + nr_writeback) / write_bandwidth
> >
> >time, where it is nr_dirty that could grow rather large.
> >
> >For example, if dirty threshold is 1GB and write_bandwidth is 10MB/s,
> >the sync() will have to wait for 100 seconds. If there are heavy
> >dirtiers running during the sync, it will typically take several
> >hundreds of seconds (which looks not that good, but still much better
> >than being livelocked in some old kernels)..
> >
> >> This really feels like we're papering over the problem.
> >
> >That's true. The majority users probably don't want to cache 100s
> >worth of data in memory. It may be worthwhile to add a new per-bdi
> >limit whose unit is number-of-seconds (of dirty data).
> Hi Fengguang,
>
> Maybe we have already have a threshold "dirty_expire_interval" to ensure
> pages will not dirty more than 30 seconds. Why should add a similar
> variable ? I think per-bdi flusher will try its best to flush dirty pages
> when waken up, just because the backing storages is too slow. :-)
dirty_expire_interval is used to start background writeback and not
used for throttling the dirtier task. So the heavy dirtier can still
outrun the flusher thread and push dirty pages all the way to the
dirty limits.
So to actually keep dirty pages under control 30s, we need some time
based _limit_. However I'm not sure whether we are going to do this.
Because the bandwidth estimation may not be accurate, especially that
it's initialized to an arbitrary 100MB/s for all newly detected disks,
whether it be slow USB keys or fast enterprise SSD. So now you claim
to throttle dirty pages under 30s worth of data, however the user can
still be able to dirty 300s worth of data on a newly plugged USB key.
That makes the interface look silly and not working to user expectations.
Another worry is that it will break some existing workloads because
the new limit triggers throttling for them earlier than before.
Thanks,
Fengguang
next prev parent reply other threads:[~2012-06-14 13:26 UTC|newest]
Thread overview: 15+ messages / expand[flat|nested] mbox.gz Atom feed top
2012-06-13 4:42 [PATCH V2] writeback: fix hung_task alarm when sync block Wanpeng Li
2012-06-13 14:27 ` Jeff Moyer
2012-06-13 14:48 ` Fengguang Wu
2012-06-13 14:55 ` Fengguang Wu
2012-06-13 15:34 ` Jeff Moyer
2012-06-14 13:36 ` Fengguang Wu
2012-06-19 20:14 ` Jeff Moyer
2012-06-19 21:02 ` Dave Chinner
2012-06-19 21:09 ` Jeff Moyer
2012-06-19 21:56 ` Dave Chinner
2012-06-14 1:35 ` Wanpeng Li
2012-06-14 13:26 ` Fengguang Wu [this message]
2012-06-15 22:43 ` Dave Chinner
2012-06-14 10:52 ` Wanpeng Li
2012-06-15 22:38 ` Dave Chinner
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20120614132626.GA14883@localhost \
--to=fengguang.wu@intel.com \
--cc=jmoyer@redhat.com \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=liwp.linux@gmail.com \
--cc=shangw@linux.vnet.ibm.com \
--cc=viro@zeniv.linux.org.uk \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).