linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Larry Woodman <lwoodman@redhat.com>
To: Ingo Molnar <mingo@elte.hu>
Cc: "KOSAKI Motohiro" <kosaki.motohiro@jp.fujitsu.com>,
	"Frédéric Weisbecker" <fweisbec@gmail.com>,
	"Li Zefan" <lizf@cn.fujitsu.com>,
	"Pekka Enberg" <penberg@cs.helsinki.fi>,
	eduard.munteanu@linux360.ro, linux-kernel@vger.kernel.org,
	linux-mm@kvack.org, riel@redhat.com, rostedt@goodmis.org
Subject: Re: [Patch] mm tracepoints update - use case.
Date: Wed, 22 Apr 2009 15:22:31 -0400	[thread overview]
Message-ID: <1240428151.11613.46.camel@dhcp-100-19-198.bos.redhat.com> (raw)
In-Reply-To: <1240402037.4682.3.camel@dhcp47-138.lab.bos.redhat.com>

[-- Attachment #1: Type: text/plain, Size: 695 bytes --]

On Wed, 2009-04-22 at 08:07 -0400, Larry Woodman wrote:
> On Wed, 2009-04-22 at 11:57 +0200, Ingo Molnar wrote:
> > * KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> wrote:

> > > In past thread, Andrew pointed out bare page tracer isn't useful. 
> > 
> > (do you have a link to that mail?)
> > 
> > > Can you make good consumer?
> 
> I will work up some good examples of what these are useful for.  I use
> the mm tracepoint data in the debugfs trace buffer to locate customer
> performance problems associated with memory allocation, deallocation,
> paging and swapping frequently, especially on large systems.
> 
> Larry

Attached is an example of what the mm tracepoints can be used for:



[-- Attachment #2: usecase --]
[-- Type: text/plain, Size: 7275 bytes --]


At Red Hat I use these mm tracepoints in an older kernel version(2.6.18).
The following steps illustrate how the mm tracepoints were used to debug 
and ultimately fix a problem. 

1.) We had customer complaints about large NUMA systems burning up 100% of
a CPU in system mode when running memory applications that require at least
half but not all of the of the memory.

---------- top output -------------------------------------------------------
Tasks: 212 total,   2 running, 210 sleeping,   0 stopped,   0 zombie
Cpu0  :  0.0%us,  0.3%sy,  0.0%ni, 99.7%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu1  :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu2  :  0.0%us,  0.3%sy,  0.0%ni, 99.7%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu3  :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu4  :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu5  :  0.0%us,100.0%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu6  :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu7  :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Mem:  16334996k total,  8979320k used,  7355676k free,     3280k buffers
Swap:  2031608k total,   129572k used,  1902036k free,   353220k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
10723 root      20   0 16.0g 8.0g  376 R  100 51.4   0:17.78 mem
10724 root      20   0 12880 1224  872 R    1  0.0   0:00.06 top
 7822 root      20   0 10868  348  272 S    0  0.0   0:06.00 irqbalance
-----------------------------------------------------------------------------

2.) Using the mm tracepoints I could immediately see that __zone_reclaim() is 
being called directly out of the memory allocator indicating that 
zone_reclaim_mode is non-zero(1).  In addition I could see that the priority
was decremented to zero and that 12342 pages had been reclaimed rather than
just enough to satisfy the page allocation request.

-----------------------------------------------------------------------------
# tracer: nop
#
#           TASK-PID    CPU#    TIMESTAMP  FUNCTION
#              | |       |          |         |
<mem>-10723 [005]  6976.285610: mm_directreclaim_reclaimzone: reclaimed=12342, priority=0
-----------------------------------------------------------------------------

3.) zone_reclaim_mode is set to 1 in build_zonelists() on NUMA systems with 
sufficient distance between the nodes:

                /*
                 * If another node is sufficiently far away then it is better
                 * to reclaim pages in a zone before going off node.
                 */
                if (distance > RECLAIM_DISTANCE)
                        zone_reclaim_mode = 1;


4.) To verify zone_reclaim_mode was involved I disabled it by:
"echo 0 > /proc/sys/vm/zone_reclaim_mode" and sure enough the problem went
away.

5.) Next, after a reboot using the mm tracepoints I could see several calls 
were made to shrink_zone() and it had reclaimed many more pages than it 
should have:

-----------------------------------------------------------------------------
# tracer: nop
#
#           TASK-PID    CPU#    TIMESTAMP  FUNCTION
#              | |       |          |         |
           <mem>-10723 [005]   282.776271: mm_pagereclaim_shrinkzone: reclaimed=12342
           <mem>-10723 [005]   282.781209: mm_pagereclaim_shrinkzone: reclaimed=3540
           <mem>-10723 [005]   282.801194: mm_pagereclaim_shrinkzone: reclaimed=7528
-----------------------------------------------------------------------------

6.) In between the shrink_zone() runs, shrink_active_list() and 
shrink_inactive_list() had run several times, each time fulfilling the memory
request from the pagecache.

-----------------------------------------------------------------------------
# tracer: nop
#
#           TASK-PID    CPU#    TIMESTAMP  FUNCTION
#              | |       |          |         |
           <mem>-10723 [005]   282.755691: mm_pagereclaim_shrinkinactive: scanned=32, pagecache, priority=4
           <mem>-10723 [005]   282.755766: mm_pagereclaim_shrinkinactive: scanned=32, pagecache, priority=4
           <mem>-10723 [005]   282.755795: mm_pagereclaim_shrinkinactive: scanned=32, pagecache, priority=4
 ...
           <mem>-10723 [005]   282.755845: mm_pagereclaim_shrinkactive: scanned=32, pagecache, priority=4
           <mem>-10723 [005]   282.755882: mm_pagereclaim_shrinkactive: scanned=32, pagecache, priority=4
           <mem>-10723 [005]   282.755938: mm_pagereclaim_shrinkactive: scanned=32, pagecache, priority=4
-----------------------------------------------------------------------------

7.) This indicates that the direct memory reclaim code path called directly 
from the memory allocator when zone_reclaim_mode is non-zero could reclaim 
far more than SWAP_CLUSTER_MAX pages and consume significant CPU time doing 
it:

-----------------------------------------------------------------------------
get_page_from_freelist(..)

                if (!(alloc_flags & ALLOC_NO_WATERMARKS)) {
                        unsigned long mark;
                        if (alloc_flags & ALLOC_WMARK_MIN)
                                mark = (*z)->pages_min;
                        else if (alloc_flags & ALLOC_WMARK_LOW)
                                mark = (*z)->pages_low;
                        else
                                mark = (*z)->pages_high;
                        if (!zone_watermark_ok(*z, order, mark,
                                    classzone_idx, alloc_flags))
                                if (!zone_reclaim_mode ||
                                    !zone_reclaim(*z, gfp_mask, order))
                                        continue;
                }

-----------------------------------------------------------------------------

8.) On further investigation I found that the 2.6.18 shrink_zone() was missing
an upstream patch that bails out as soon as SWAP_CLUSTER_MAX pages have been 
reclaimed.

-----------------------------------------------------------------------------
shrink_zone(...)

+               /*
+                * On large memory systems, scan >> priority can become
+                * really large. This is fine for the starting priority;
+                * we want to put equal scanning pressure on each zone.
+                * However, if the VM has a harder time of freeing pages,
+                * with multiple processes reclaiming pages, the total
+                * freeing target can get unreasonably large.
+                */
+               if (nr_reclaimed > swap_cluster_max &&
+                       priority < DEF_PRIORITY && !current_is_kswapd())
+                       break; 
-----------------------------------------------------------------------------

9.) Including this patch in shrink_zone() fixed the problem by terminating
one enough memory is reclaimed to satisfy the __alloc_pages() request on the
local node.


This example is realitively simple and does not illustrate the use of
every one of the proposed mm tracepoints,   It show how they can be used to
quickly drill down into performance and other problems without several
itterations of rebuilding the kernel adding debug code.  


  reply	other threads:[~2009-04-22 19:22 UTC|newest]

Thread overview: 20+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2009-04-21 22:45 [Patch] mm tracepoints update Larry Woodman
2009-04-22  1:00 ` KOSAKI Motohiro
2009-04-22  9:57   ` Ingo Molnar
2009-04-22 12:07     ` Larry Woodman
2009-04-22 19:22       ` Larry Woodman [this message]
2009-04-23  0:48         ` [Patch] mm tracepoints update - use case KOSAKI Motohiro
2009-04-23  4:50           ` Andrew Morton
2009-04-23  8:42             ` Ingo Molnar
2009-04-23 11:47               ` Larry Woodman
2009-04-24 20:48                 ` Larry Woodman
2009-06-15 18:26           ` Rik van Riel
2009-06-17 14:07             ` Larry Woodman
2009-06-18  7:57             ` KOSAKI Motohiro
2009-06-18 19:22               ` Larry Woodman
2009-06-18 19:40                 ` Rik van Riel
2009-06-22  3:37                   ` KOSAKI Motohiro
2009-06-22 15:04                     ` Larry Woodman
2009-06-23  5:52                       ` KOSAKI Motohiro
2009-06-22  3:37                 ` KOSAKI Motohiro
2009-06-22 15:28                   ` Larry Woodman

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1240428151.11613.46.camel@dhcp-100-19-198.bos.redhat.com \
    --to=lwoodman@redhat.com \
    --cc=eduard.munteanu@linux360.ro \
    --cc=fweisbec@gmail.com \
    --cc=kosaki.motohiro@jp.fujitsu.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=lizf@cn.fujitsu.com \
    --cc=mingo@elte.hu \
    --cc=penberg@cs.helsinki.fi \
    --cc=riel@redhat.com \
    --cc=rostedt@goodmis.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).