From: Wu Fengguang <fengguang.wu@intel.com>
To: Jan Kara <jack@suse.cz>
Cc: Rik van Riel <riel@redhat.com>, Greg Thelen <gthelen@google.com>,
"bsingharora@gmail.com" <bsingharora@gmail.com>,
Hugh Dickins <hughd@google.com>, Michal Hocko <mhocko@suse.cz>,
linux-mm@kvack.org, Mel Gorman <mgorman@suse.de>,
Ying Han <yinghan@google.com>,
"hannes@cmpxchg.org" <hannes@cmpxchg.org>,
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>,
Minchan Kim <minchan.kim@gmail.com>
Subject: Re: reclaim the LRU lists full of dirty/writeback pages
Date: Thu, 16 Feb 2012 12:00:19 +0800 [thread overview]
Message-ID: <20120216040019.GB17597@localhost> (raw)
In-Reply-To: <20120214132950.GE1934@quack.suse.cz>
On Tue, Feb 14, 2012 at 02:29:50PM +0100, Jan Kara wrote:
> > > I wonder what happens if you run:
> > > mkdir /cgroup/x
> > > echo 100M > /cgroup/x/memory.limit_in_bytes
> > > echo $$ > /cgroup/x/tasks
> > >
> > > for (( i = 0; i < 2; i++ )); do
> > > mkdir /fs/d$i
> > > for (( j = 0; j < 5000; j++ )); do
> > > dd if=/dev/zero of=/fs/d$i/f$j bs=1k count=50
> > > done &
> > > done
> >
> > That's a very good case, thanks!
> >
> > > Because for small files the writearound logic won't help much...
> >
> > Right, it also means the native background work cannot be more I/O
> > efficient than the pageout works, except for the overheads of more
> > work items..
> Yes, that's true.
>
> > > Also the number of work items queued might become interesting.
> >
> > It turns out that the 1024 mempool reservations are not exhausted at
> > all (the below patch as a trace_printk on alloc failure and it didn't
> > trigger at all).
> >
> > Here is the representative iostat lines on XFS (full "iostat -kx 1 20" log attached):
> >
> > avg-cpu: %user %nice %system %iowait %steal %idle
> > 0.80 0.00 6.03 0.03 0.00 93.14
> >
> > Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await svctm %util
> > sda 0.00 205.00 0.00 163.00 0.00 16900.00 207.36 4.09 21.63 1.88 30.70
> >
> > The attached dirtied/written progress graph looks interesting.
> > Although the iostat disk utilization is low, the "dirtied" progress
> > line is pretty straight and there is no single congestion_wait event
> > in the trace log. Which makes me wonder if there are some unknown
> > blocking issues in the way.
> Interesting. I'd also expect we should block in reclaim path. How fast
> can dd threads progress when there is no cgroup involved?
I tried running the dd tasks in global context with
echo $((100<<20)) > /proc/sys/vm/dirty_bytes
and got mostly the same results on XFS:
avg-cpu: %user %nice %system %iowait %steal %idle
0.85 0.00 8.88 0.00 0.00 90.26
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await svctm %util
sda 0.00 0.00 0.00 50.00 0.00 23036.00 921.44 9.59 738.02 7.38 36.90
avg-cpu: %user %nice %system %iowait %steal %idle
0.95 0.00 8.95 0.00 0.00 90.11
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await svctm %util
sda 0.00 854.00 0.00 99.00 0.00 19552.00 394.99 34.14 87.98 3.82 37.80
Interestingly, ext4 shows comparable throughput, however is reporting
near 100% disk utilization:
avg-cpu: %user %nice %system %iowait %steal %idle
0.76 0.00 9.02 0.00 0.00 90.23
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await svctm %util
sda 0.00 0.00 0.00 317.00 0.00 20956.00 132.21 28.57 82.71 3.16 100.10
avg-cpu: %user %nice %system %iowait %steal %idle
0.82 0.00 8.95 0.00 0.00 90.23
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await svctm %util
sda 0.00 0.00 0.00 402.00 0.00 24388.00 121.33 21.09 58.55 2.42 97.40
avg-cpu: %user %nice %system %iowait %steal %idle
0.82 0.00 8.99 0.00 0.00 90.19
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await svctm %util
sda 0.00 0.00 0.00 409.00 0.00 21996.00 107.56 15.25 36.74 2.30 94.10
And btrfs shows
avg-cpu: %user %nice %system %iowait %steal %idle
0.76 0.00 23.59 0.00 0.00 75.65
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await svctm %util
sda 0.00 801.00 0.00 141.00 0.00 48984.00 694.81 41.08 291.36 6.11 86.20
avg-cpu: %user %nice %system %iowait %steal %idle
0.72 0.00 12.65 0.00 0.00 86.62
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await svctm %util
sda 0.00 792.00 0.00 69.00 0.00 15288.00 443.13 22.74 69.35 4.09 28.20
avg-cpu: %user %nice %system %iowait %steal %idle
0.83 0.00 23.11 0.00 0.00 76.06
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await svctm %util
sda 0.00 0.00 0.00 73.00 0.00 33280.00 911.78 22.09 548.58 8.10 59.10
> > > Another common case to test - run 'slapadd' command in each cgroup to
> > > create big LDAP database. That does pretty much random IO on a big mmaped
> > > DB file.
> >
> > I've not used this. Will it need some configuration and data feed?
> > fio looks more handy to me for emulating mmap random IO.
> Yes, fio can generate random mmap IO. It's just that this is a real life
> workload. So it is not completely random, it happens on several files and
> is also interleaved with other memory allocations from DB. I can send you
> the config files and data feed if you are interested.
I'm very interested, thank you!
> > > > +/*
> > > > + * schedule writeback on a range of inode pages.
> > > > + */
> > > > +static struct wb_writeback_work *
> > > > +bdi_flush_inode_range(struct backing_dev_info *bdi,
> > > > + struct inode *inode,
> > > > + pgoff_t offset,
> > > > + pgoff_t len,
> > > > + bool wait)
> > > > +{
> > > > + struct wb_writeback_work *work;
> > > > +
> > > > + if (!igrab(inode))
> > > > + return ERR_PTR(-ENOENT);
> > > One technical note here: If the inode is deleted while it is queued, this
> > > reference will keep it living until flusher thread gets to it. Then when
> > > flusher thread puts its reference, the inode will get deleted in flusher
> > > thread context. I don't see an immediate problem in that but it might be
> > > surprising sometimes. Another problem I see is that if you try to
> > > unmount the filesystem while the work item is queued, you'll get EBUSY for
> > > no apparent reason (for userspace).
> >
> > Yeah, we need to make umount work.
> The positive thing is that if the inode is reaped while the work item is
> queue, we know all that needed to be done is done. So we don't really need
> to pin the inode.
But I do need to make sure the *inode pointer does not point to some
invalid memory at work exec time. Is this possible without raising
->i_count?
> > And I find the pageout works seem to have some problems with ext4.
> > For example, this can be easily triggered with 10 dd tasks running
> > inside the 100MB limited memcg:
> So journal thread is getting stuck while committing transaction. Most
> likely waiting for some dd thread to stop a transaction so that commit can
> proceed. The processes waiting in start_this_handle() are just secondary
> effect resulting from the first problem. It might be interesting to get
> stack traces of all bloked processes when the journal thread is stuck.
For completeness of discussion, citing your conclusion on my private
data feed:
: We enter memcg reclaim from grab_cache_page_write_begin() and are
: waiting in congestion_wait(). Because grab_cache_page_write_begin() is
: called with transaction started, this blocks transaction from
: committing and subsequently blocks all other activity on the
: filesystem. The fact is this isn't new with your patches, just your
: changes or the fact that we are running in a memory constrained cgroup
: make this more visible.
Thanks,
Fengguang
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
next prev parent reply other threads:[~2012-02-16 4:10 UTC|newest]
Thread overview: 33+ messages / expand[flat|nested] mbox.gz Atom feed top
2012-02-08 7:55 memcg writeback (was Re: [Lsf-pc] [LSF/MM TOPIC] memcg topics.) Greg Thelen
2012-02-08 9:31 ` Wu Fengguang
2012-02-08 20:54 ` Ying Han
2012-02-09 13:50 ` Wu Fengguang
2012-02-13 18:40 ` Ying Han
2012-02-10 5:51 ` Greg Thelen
2012-02-10 5:52 ` Greg Thelen
2012-02-10 9:20 ` Wu Fengguang
2012-02-10 11:47 ` Wu Fengguang
2012-02-11 12:44 ` reclaim the LRU lists full of dirty/writeback pages Wu Fengguang
2012-02-11 14:55 ` Rik van Riel
2012-02-12 3:10 ` Wu Fengguang
2012-02-12 6:45 ` Wu Fengguang
2012-02-13 15:43 ` Jan Kara
2012-02-14 10:03 ` Wu Fengguang
2012-02-14 13:29 ` Jan Kara
2012-02-16 4:00 ` Wu Fengguang [this message]
2012-02-16 12:44 ` Jan Kara
2012-02-16 13:32 ` Wu Fengguang
2012-02-16 14:06 ` Wu Fengguang
2012-02-17 16:41 ` Wu Fengguang
2012-02-20 14:00 ` Jan Kara
2012-02-14 10:19 ` Mel Gorman
2012-02-14 13:18 ` Wu Fengguang
2012-02-14 13:35 ` Wu Fengguang
2012-02-14 15:51 ` Mel Gorman
2012-02-16 9:50 ` Wu Fengguang
2012-02-16 17:31 ` Mel Gorman
2012-02-27 14:24 ` Fengguang Wu
2012-02-16 0:00 ` KAMEZAWA Hiroyuki
2012-02-16 3:04 ` Wu Fengguang
2012-02-16 3:52 ` KAMEZAWA Hiroyuki
2012-02-16 4:05 ` Wu Fengguang
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20120216040019.GB17597@localhost \
--to=fengguang.wu@intel.com \
--cc=bsingharora@gmail.com \
--cc=gthelen@google.com \
--cc=hannes@cmpxchg.org \
--cc=hughd@google.com \
--cc=jack@suse.cz \
--cc=kamezawa.hiroyu@jp.fujitsu.com \
--cc=linux-mm@kvack.org \
--cc=mgorman@suse.de \
--cc=mhocko@suse.cz \
--cc=minchan.kim@gmail.com \
--cc=riel@redhat.com \
--cc=yinghan@google.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).