Re: Kswapd in 3.2.0-rc5 is a CPU hog

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Dave Chinner <david@fromorbit.com>
To: nowhere <nowhere@hakkenden.ath.cx>
Cc: Michal Hocko <mhocko@suse.cz>,
	linux-kernel@vger.kernel.org, linux-mm@kvack.org
Subject: Re: Kswapd in 3.2.0-rc5 is a CPU hog
Date: Sat, 24 Dec 2011 07:45:03 +1100	[thread overview]
Message-ID: <20111223204503.GC12731@dastard> (raw)
In-Reply-To: <1324638242.562.15.camel@rybalov.eng.ttk.net>

On Fri, Dec 23, 2011 at 03:04:02PM +0400, nowhere wrote:
> D? D?N?., 23/12/2011 D2 21:20 +1100, Dave Chinner D?D,N?DuN?:
> > On Fri, Dec 23, 2011 at 01:01:20PM +0400, nowhere wrote:
> > > D? D?N?., 22/12/2011 D2 09:55 +1100, Dave Chinner D?D,N?DuN?:
> > > > On Wed, Dec 21, 2011 at 10:52:49AM +0100, Michal Hocko wrote:
> > > > > [Let's CC linux-mm]
> > > > > 
> > > > > On Wed 21-12-11 07:10:36, Nikolay S. wrote:
> > > > > > Hello,
> > > > > > 
> > > > > > I'm using 3.2-rc5 on a machine, which atm does almost nothing except
> > > > > > file system operations and network i/o (i.e. file server). And there is
> > > > > > a problem with kswapd.
> > > > > 
> > > > > What kind of filesystem do you use?
> > > > > 
> > > > > > 
> > > > > > I'm playing with dd:
> > > > > > dd if=/some/big/file of=/dev/null bs=8M
> > > > > > 
> > > > > > I.e. I'm filling page cache.
> > > > > > 
> > > > > > So when the machine is just rebooted, kswapd during this operation is
> > > > > > almost idle, just 5-8 percent according to top.
> > > > > > 
> > > > > > After ~5 days of uptime (5 days,  2:10), the same operation demands ~70%
> > > > > > for kswapd:
> > > > > > 
> > > > > >   PID USER      S %CPU %MEM    TIME+  SWAP COMMAND
> > > > > >   420 root      R   70  0.0  22:09.60    0 kswapd0
> > > > > > 17717 nowhere   D   27  0.2   0:01.81  10m dd
> > > > > > 
> > > > > > In fact, kswapd cpu usage on this operation steadily increases over
> > > > > > time.
> > > > > > 
> > > > > > Also read performance degrades over time. After reboot:
> > > > > > dd if=/some/big/file of=/dev/null bs=8M
> > > > > > 1019+1 records in
> > > > > > 1019+1 records out
> > > > > > 8553494018 bytes (8.6 GB) copied, 16.211 s, 528 MB/s
> > > > > > 
> > > > > > After ~5 days uptime:
> > > > > > dd if=/some/big/file of=/dev/null bs=8M
> > > > > > 1019+1 records in
> > > > > > 1019+1 records out
> > > > > > 8553494018 bytes (8.6 GB) copied, 29.0507 s, 294 MB/s
> > > > > > 
> > > > > > Whereas raw disk sequential read performance stays the same:
> > > > > > dd if=/some/big/file of=/dev/null bs=8M iflag=direct
> > > > > > 1019+1 records in
> > > > > > 1019+1 records out
> > > > > > 8553494018 bytes (8.6 GB) copied, 14.7286 s, 581 MB/s
> > > > > > 
> > > > > > Also after dropping caches, situation somehow improves, but not to the
> > > > > > state of freshly restarted system:
> > > > > >   PID USER      S %CPU %MEM    TIME+  SWAP COMMAND
> > > > > >   420 root      S   39  0.0  23:31.17    0 kswapd0
> > > > > > 19829 nowhere   D   24  0.2   0:02.72 7764 dd
> > > > > > 
> > > > > > perf shows:
> > > > > > 
> > > > > >     31.24%  kswapd0  [kernel.kallsyms]  [k] _raw_spin_lock
> > > > > >     26.19%  kswapd0  [kernel.kallsyms]  [k] shrink_slab
> > > > > >     16.28%  kswapd0  [kernel.kallsyms]  [k] prune_super
> > > > > >      6.55%  kswapd0  [kernel.kallsyms]  [k] grab_super_passive
> > > > > >      5.35%  kswapd0  [kernel.kallsyms]  [k] down_read_trylock
> > > > > >      4.03%  kswapd0  [kernel.kallsyms]  [k] up_read
> > > > > >      2.31%  kswapd0  [kernel.kallsyms]  [k] put_super
> > > > > >      1.81%  kswapd0  [kernel.kallsyms]  [k] drop_super
> > > > > >      0.99%  kswapd0  [kernel.kallsyms]  [k] __put_super
> > > > > >      0.25%  kswapd0  [kernel.kallsyms]  [k] __isolate_lru_page
> > > > > >      0.23%  kswapd0  [kernel.kallsyms]  [k] free_pcppages_bulk
> > > > > >      0.19%  kswapd0  [r8169]            [k] rtl8169_interrupt
> > > > > >      0.15%  kswapd0  [kernel.kallsyms]  [k] twa_interrupt
> > > > > 
> > > > > Quite a lot of time spent shrinking slab (dcache I guess) and a lot of
> > > > > spin lock contention.
> > > > 
> > > > That's just scanning superblocks, not apparently doing anything
> > > > useful like shrinking dentries or inodes attached to each sb. i.e.
> > > > the shrinkers are being called an awful lot and basically have
> > > > nothing to do. I'd be suspecting a problem higher up in the stack to
> > > > do with how shrink_slab is operating or being called.
> > > > 
> > > > I'd suggest gathering event traces for mm_shrink_slab_start/
> > > > mm_shrink_slab_end to try to see how the shrinkers are being
> > > > driven...
> > > > 
> > > > Cheers,
> > > > 
> > > > Dave.
> > > 
> > > I have recompiled kernel with tracers, and today the problem is visible
> > > again. So here is the trace for mm_shrink_slab_start (it is HUGE):
> > > 
> > >          kswapd0   421 [000] 103976.627873: mm_shrink_slab_start: prune_super+0x0 0xffff88011b00d300: objects to shrink 12 gfp_flags GFP_KERNELGFP_NOTRACK pgs_scanned 32 lru_pgs 942483 cache items 1500 delt
> > >          kswapd0   421 [000] 103976.627882: mm_shrink_slab_start: prune_super+0x0 0xffff88011a20ab00: objects to shrink 267 gfp_flags GFP_KERNELGFP_NOTRACK pgs_scanned 32 lru_pgs 942483 cache items 5300 del
> > 
> > And possibly useless in this form. I need to see the
> > mm_shrink_slab_start/mm_shrink_slab_end events interleaved so I can
> > see exactly how much work each shrinker call is doing, and the start
> > events are truncated so not all the info I need is present.
> > 
> > Perhaps you should install trace-cmd.
> > 
> > $ trace-cmd record -e mm_shrink_slab*
> > (wait 30s, then ^C)
> > $ trace-cmd report > shrink.trace
> > 
> > And then compress and attach the trace file or put up on the web
> > somewhere for me ot download if it's too large for email...
> > 
> > As it is, there's ~940k pages in the LRU, and shrink_slab is being
> > called after 32, 95, 8, 8, 32 and 32 pages on the LRU have been
> > scanned. That seems like the shrinkers are being called rather too
> > often.
> > 
> > The end traces indicate the shrinker caches aren't able to free
> > anything. So it looks like the vmscan code has got itself in a
> > situation where it is not scanning many pages between shrinker
> > callouts, and the shrinkers scan but can't make any progress. Looks
> > like a vmscan balancing problem right now, not anything to do with
> > the shrinker code. A better trace will confirm that.
> > 
> > FWIW, if you use trace-cmd, it might be worthwhile collecting all the
> > vmscan trace events too, as that might help the VM folk understand
> > the problem without needing to ask you for more info.
> 
> ./trace-cmd record -e vmscan/*
> 
> Here is the report of trace-cmd while dd'ing
> https://80.237.6.56/report-dd.xz

Ok, it's not a shrink_slab() problem - it's just being called ~100uS
by kswapd. The pattern is:

	- reclaim 94 (batches of 32,32,30) pages from iinactive list
	  of zone 1, node 0, prio 12
	- call shrink_slab
		- scan all caches
		- all shrinkers return 0 saying nothing to shrink
	- 40us gap
	- reclaim 10-30 pages from inactive list of zone 2, node 0, prio 12
	- call shrink_slab
		- scan all caches
		- all shrinkers return 0 saying nothing to shrink
	- 40us gap
	- isolate 9 pages from LRU zone ?, node ?, none isolated, none freed
	- isolate 22 pages from LRU zone ?, node ?, none isolated, none freed
	- call shrink_slab
		- scan all caches
		- all shrinkers return 0 saying nothing to shrink
	40us gap

And it just repeats over and over again. After a while, nid=0,zone=1
drops out of the traces, so reclaim only comes in batches of 10-30
pages from zone 2 between each shrink_slab() call.

The trace starts at 111209.881s, with 944776 pages on the LRUs. It
finishes at 111216.1 with kswapd going to sleep on node 0 with
930067 pages on the LRU. So 7 seconds to free 15,000 pages (call it
2,000 pages/s) which is awfully slow....

vmscan gurus - time for you to step in now...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

WARNING: multiple messages have this Message-ID (diff)

From: Dave Chinner <david@fromorbit.com>
To: nowhere <nowhere@hakkenden.ath.cx>
Cc: Michal Hocko <mhocko@suse.cz>,
	linux-kernel@vger.kernel.org, linux-mm@kvack.org
Subject: Re: Kswapd in 3.2.0-rc5 is a CPU hog
Date: Sat, 24 Dec 2011 07:45:03 +1100	[thread overview]
Message-ID: <20111223204503.GC12731@dastard> (raw)
In-Reply-To: <1324638242.562.15.camel@rybalov.eng.ttk.net>

On Fri, Dec 23, 2011 at 03:04:02PM +0400, nowhere wrote:
> В Пт., 23/12/2011 в 21:20 +1100, Dave Chinner пишет:
> > On Fri, Dec 23, 2011 at 01:01:20PM +0400, nowhere wrote:
> > > В Чт., 22/12/2011 в 09:55 +1100, Dave Chinner пишет:
> > > > On Wed, Dec 21, 2011 at 10:52:49AM +0100, Michal Hocko wrote:
> > > > > [Let's CC linux-mm]
> > > > > 
> > > > > On Wed 21-12-11 07:10:36, Nikolay S. wrote:
> > > > > > Hello,
> > > > > > 
> > > > > > I'm using 3.2-rc5 on a machine, which atm does almost nothing except
> > > > > > file system operations and network i/o (i.e. file server). And there is
> > > > > > a problem with kswapd.
> > > > > 
> > > > > What kind of filesystem do you use?
> > > > > 
> > > > > > 
> > > > > > I'm playing with dd:
> > > > > > dd if=/some/big/file of=/dev/null bs=8M
> > > > > > 
> > > > > > I.e. I'm filling page cache.
> > > > > > 
> > > > > > So when the machine is just rebooted, kswapd during this operation is
> > > > > > almost idle, just 5-8 percent according to top.
> > > > > > 
> > > > > > After ~5 days of uptime (5 days,  2:10), the same operation demands ~70%
> > > > > > for kswapd:
> > > > > > 
> > > > > >   PID USER      S %CPU %MEM    TIME+  SWAP COMMAND
> > > > > >   420 root      R   70  0.0  22:09.60    0 kswapd0
> > > > > > 17717 nowhere   D   27  0.2   0:01.81  10m dd
> > > > > > 
> > > > > > In fact, kswapd cpu usage on this operation steadily increases over
> > > > > > time.
> > > > > > 
> > > > > > Also read performance degrades over time. After reboot:
> > > > > > dd if=/some/big/file of=/dev/null bs=8M
> > > > > > 1019+1 records in
> > > > > > 1019+1 records out
> > > > > > 8553494018 bytes (8.6 GB) copied, 16.211 s, 528 MB/s
> > > > > > 
> > > > > > After ~5 days uptime:
> > > > > > dd if=/some/big/file of=/dev/null bs=8M
> > > > > > 1019+1 records in
> > > > > > 1019+1 records out
> > > > > > 8553494018 bytes (8.6 GB) copied, 29.0507 s, 294 MB/s
> > > > > > 
> > > > > > Whereas raw disk sequential read performance stays the same:
> > > > > > dd if=/some/big/file of=/dev/null bs=8M iflag=direct
> > > > > > 1019+1 records in
> > > > > > 1019+1 records out
> > > > > > 8553494018 bytes (8.6 GB) copied, 14.7286 s, 581 MB/s
> > > > > > 
> > > > > > Also after dropping caches, situation somehow improves, but not to the
> > > > > > state of freshly restarted system:
> > > > > >   PID USER      S %CPU %MEM    TIME+  SWAP COMMAND
> > > > > >   420 root      S   39  0.0  23:31.17    0 kswapd0
> > > > > > 19829 nowhere   D   24  0.2   0:02.72 7764 dd
> > > > > > 
> > > > > > perf shows:
> > > > > > 
> > > > > >     31.24%  kswapd0  [kernel.kallsyms]  [k] _raw_spin_lock
> > > > > >     26.19%  kswapd0  [kernel.kallsyms]  [k] shrink_slab
> > > > > >     16.28%  kswapd0  [kernel.kallsyms]  [k] prune_super
> > > > > >      6.55%  kswapd0  [kernel.kallsyms]  [k] grab_super_passive
> > > > > >      5.35%  kswapd0  [kernel.kallsyms]  [k] down_read_trylock
> > > > > >      4.03%  kswapd0  [kernel.kallsyms]  [k] up_read
> > > > > >      2.31%  kswapd0  [kernel.kallsyms]  [k] put_super
> > > > > >      1.81%  kswapd0  [kernel.kallsyms]  [k] drop_super
> > > > > >      0.99%  kswapd0  [kernel.kallsyms]  [k] __put_super
> > > > > >      0.25%  kswapd0  [kernel.kallsyms]  [k] __isolate_lru_page
> > > > > >      0.23%  kswapd0  [kernel.kallsyms]  [k] free_pcppages_bulk
> > > > > >      0.19%  kswapd0  [r8169]            [k] rtl8169_interrupt
> > > > > >      0.15%  kswapd0  [kernel.kallsyms]  [k] twa_interrupt
> > > > > 
> > > > > Quite a lot of time spent shrinking slab (dcache I guess) and a lot of
> > > > > spin lock contention.
> > > > 
> > > > That's just scanning superblocks, not apparently doing anything
> > > > useful like shrinking dentries or inodes attached to each sb. i.e.
> > > > the shrinkers are being called an awful lot and basically have
> > > > nothing to do. I'd be suspecting a problem higher up in the stack to
> > > > do with how shrink_slab is operating or being called.
> > > > 
> > > > I'd suggest gathering event traces for mm_shrink_slab_start/
> > > > mm_shrink_slab_end to try to see how the shrinkers are being
> > > > driven...
> > > > 
> > > > Cheers,
> > > > 
> > > > Dave.
> > > 
> > > I have recompiled kernel with tracers, and today the problem is visible
> > > again. So here is the trace for mm_shrink_slab_start (it is HUGE):
> > > 
> > >          kswapd0   421 [000] 103976.627873: mm_shrink_slab_start: prune_super+0x0 0xffff88011b00d300: objects to shrink 12 gfp_flags GFP_KERNELGFP_NOTRACK pgs_scanned 32 lru_pgs 942483 cache items 1500 delt
> > >          kswapd0   421 [000] 103976.627882: mm_shrink_slab_start: prune_super+0x0 0xffff88011a20ab00: objects to shrink 267 gfp_flags GFP_KERNELGFP_NOTRACK pgs_scanned 32 lru_pgs 942483 cache items 5300 del
> > 
> > And possibly useless in this form. I need to see the
> > mm_shrink_slab_start/mm_shrink_slab_end events interleaved so I can
> > see exactly how much work each shrinker call is doing, and the start
> > events are truncated so not all the info I need is present.
> > 
> > Perhaps you should install trace-cmd.
> > 
> > $ trace-cmd record -e mm_shrink_slab*
> > (wait 30s, then ^C)
> > $ trace-cmd report > shrink.trace
> > 
> > And then compress and attach the trace file or put up on the web
> > somewhere for me ot download if it's too large for email...
> > 
> > As it is, there's ~940k pages in the LRU, and shrink_slab is being
> > called after 32, 95, 8, 8, 32 and 32 pages on the LRU have been
> > scanned. That seems like the shrinkers are being called rather too
> > often.
> > 
> > The end traces indicate the shrinker caches aren't able to free
> > anything. So it looks like the vmscan code has got itself in a
> > situation where it is not scanning many pages between shrinker
> > callouts, and the shrinkers scan but can't make any progress. Looks
> > like a vmscan balancing problem right now, not anything to do with
> > the shrinker code. A better trace will confirm that.
> > 
> > FWIW, if you use trace-cmd, it might be worthwhile collecting all the
> > vmscan trace events too, as that might help the VM folk understand
> > the problem without needing to ask you for more info.
> 
> ./trace-cmd record -e vmscan/*
> 
> Here is the report of trace-cmd while dd'ing
> https://80.237.6.56/report-dd.xz

Ok, it's not a shrink_slab() problem - it's just being called ~100uS
by kswapd. The pattern is:

	- reclaim 94 (batches of 32,32,30) pages from iinactive list
	  of zone 1, node 0, prio 12
	- call shrink_slab
		- scan all caches
		- all shrinkers return 0 saying nothing to shrink
	- 40us gap
	- reclaim 10-30 pages from inactive list of zone 2, node 0, prio 12
	- call shrink_slab
		- scan all caches
		- all shrinkers return 0 saying nothing to shrink
	- 40us gap
	- isolate 9 pages from LRU zone ?, node ?, none isolated, none freed
	- isolate 22 pages from LRU zone ?, node ?, none isolated, none freed
	- call shrink_slab
		- scan all caches
		- all shrinkers return 0 saying nothing to shrink
	40us gap

And it just repeats over and over again. After a while, nid=0,zone=1
drops out of the traces, so reclaim only comes in batches of 10-30
pages from zone 2 between each shrink_slab() call.

The trace starts at 111209.881s, with 944776 pages on the LRUs. It
finishes at 111216.1 with kswapd going to sleep on node 0 with
930067 pages on the LRU. So 7 seconds to free 15,000 pages (call it
2,000 pages/s) which is awfully slow....

vmscan gurus - time for you to step in now...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

next prev parent reply	other threads:[~2011-12-23 20:45 UTC|newest]

Thread overview: 67+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2011-12-21  3:10 Kswapd in 3.2.0-rc5 is a CPU hog Nikolay S.
2011-12-21  9:52 ` Michal Hocko
2011-12-21  9:52   ` Michal Hocko
2011-12-21 10:15   ` nowhere
2011-12-21 10:15     ` nowhere
2011-12-21 10:24     ` Michal Hocko
2011-12-21 10:24       ` Michal Hocko
2011-12-21 10:24       ` Michal Hocko
2011-12-21 10:52       ` nowhere
2011-12-21 10:52         ` nowhere
2011-12-21 10:52         ` nowhere
2011-12-21 14:06       ` Alex Elder
2011-12-21 14:06         ` Alex Elder
2011-12-21 14:06         ` Alex Elder
2011-12-21 14:19         ` nowhere
2011-12-21 14:19           ` nowhere
2011-12-21 14:19           ` nowhere
2011-12-21 22:55   ` Dave Chinner
2011-12-21 22:55     ` Dave Chinner
2011-12-23  9:01     ` nowhere
2011-12-23  9:01       ` nowhere
2011-12-23 10:20       ` Dave Chinner
2011-12-23 10:20         ` Dave Chinner
2011-12-23 11:04         ` nowhere
2011-12-23 11:04           ` nowhere
2011-12-23 20:45           ` Dave Chinner [this message]
2011-12-23 20:45             ` Dave Chinner
2011-12-25  9:09             ` Hillf Danton
2011-12-25  9:09               ` Hillf Danton
2011-12-25 10:21               ` Nikolay S.
2011-12-25 10:21                 ` Nikolay S.
2011-12-26 12:35                 ` Hillf Danton
2011-12-26 12:35                   ` Hillf Danton
2011-12-27  0:20                   ` KAMEZAWA Hiroyuki
2011-12-27  0:20                     ` KAMEZAWA Hiroyuki
2011-12-27 13:33                     ` Hillf Danton
2011-12-27 13:33                       ` Hillf Danton
2011-12-28  0:06                       ` KAMEZAWA Hiroyuki
2011-12-28  0:06                         ` KAMEZAWA Hiroyuki
2011-12-27  2:15             ` KAMEZAWA Hiroyuki
2011-12-27  2:15               ` KAMEZAWA Hiroyuki
2011-12-27  2:50               ` Nikolay S.
2011-12-27  2:50                 ` Nikolay S.
2011-12-27  4:44                 ` KAMEZAWA Hiroyuki
2011-12-27  4:44                   ` KAMEZAWA Hiroyuki
2011-12-27  6:06                   ` nowhere
2011-12-27  6:06                     ` nowhere
2011-12-28 21:33                   ` Dave Chinner
2011-12-28 21:33                     ` Dave Chinner
2011-12-28 22:57                     ` KOSAKI Motohiro
2011-12-28 22:57                       ` KOSAKI Motohiro
2012-01-02  7:00                       ` Dave Chinner
2012-01-02  7:00                         ` Dave Chinner
2011-12-27  3:57               ` Minchan Kim
2011-12-27  3:57                 ` Minchan Kim
2011-12-27  4:56                 ` KAMEZAWA Hiroyuki
2011-12-27  4:56                   ` KAMEZAWA Hiroyuki
2012-01-10 22:33                   ` Andrew Morton
2012-01-10 22:33                     ` Andrew Morton
2012-01-11  3:25                     ` Nikolay S.
2012-01-11  3:25                       ` Nikolay S.
2012-01-11  4:42                       ` Andrew Morton
2012-01-11  4:42                         ` Andrew Morton
2012-01-11  0:33                   ` Dave Chinner
2012-01-11  0:33                     ` Dave Chinner
2012-01-11  1:17                 ` Rik van Riel
2012-01-11  1:17                   ` Rik van Riel

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20111223204503.GC12731@dastard \
    --to=david@fromorbit.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mhocko@suse.cz \
    --cc=nowhere@hakkenden.ath.cx \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.