[RFC PATCH 0/4] Add some trace events for the page allocator v2

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [RFC PATCH 0/4] Add some trace events for the page allocator v2
@ 2009-07-29 21:05 Mel Gorman
  2009-07-29 21:05 ` [PATCH 1/4] tracing, page-allocator: Add trace events for page allocation and page freeing Mel Gorman
                   ` (3 more replies)
  0 siblings, 4 replies; 31+ messages in thread
From: Mel Gorman @ 2009-07-29 21:05 UTC (permalink / raw)
  To: Larry Woodman, riel, Ingo Molnar, Peter Zijlstra
  Cc: LKML, linux-mm, Mel Gorman

In this version, I switched the CC list to match who Larry Woodman mailed
for his "mm tracepoints" patch which I wasn't previously aware of. In this
version, I brought the naming scheme more in line with Larry's as his naming
scheme was very sensible.

This patchset only considers the page-allocator-related events instead of the
much more comprehensive approach Larry took. I included a post-processing
script as Andrew's main complaint as I saw it with Larry's work was a lack
of tools that could give a higher-level view of what was going on. If this
works out, the other mm tracepoints can be deal with in piecemeal chunks.

Changelog since V1
  o Fix minor formatting error for the __rmqueue event
  o Add event for __pagevec_free
  o Bring naming more in line with Larry Woodman's tracing patch
  o Add an example post-processing script for the trace events

The following four patches add some trace events for the page allocator
under the heading of kmem (pagealloc heading instead?).

	Patch 1 adds events for plain old allocate and freeing of pages
	Patch 2 gives information useful for analysing fragmentation avoidance
	Patch 3 tracks pages going to and from the buddy lists as an indirect
		indication of zone lock hotness
	Patch 4 adds a post-processing script that aggegates the events to
		give a higher-level view

The first one could be used as an indicator as to whether the workload was
heavily dependant on the page allocator or not. You can make a guess based
on vmstat but you can't get a per-process breakdown. Depending on the call
path, the call_site for page allocation may be __get_free_pages() instead
of a useful callsite. Instead of passing down a return address similar to
slab debugging, the user should enable the stacktrace and seg-addr options
to get a proper stack trace.

The second patch would mainly be useful for users of hugepages and
particularly dynamic hugepage pool resizing as it could be used to tune
min_free_kbytes to a level that fragmentation was rarely a problem. My
main concern is that maybe I'm trying to jam too much into the TP_printk
that could be extrapolated after the fact if you were familiar with the
implementation. I couldn't determine if it was best to hold the hand of
the administrator even if it cost more to figure it out.

The third patch is trickier to draw conclusions from but high activity on
those events could explain why there were a large number of cache misses
on a page-allocator-intensive workload. The coalescing and splitting of
buddies involves a lot of writing of page metadata and cache line bounces
not to mention the acquisition of an interrupt-safe lock necessary to enter
this path.

The fourth patch parses the trace buffer to draw a higher-level picture of
what is going on broken down on a per-process basis.

All comments indicating whether this is generally useful and how it might
be improved are welcome.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH 1/4] tracing, page-allocator: Add trace events for page allocation and page freeing
  2009-07-29 21:05 [RFC PATCH 0/4] Add some trace events for the page allocator v2 Mel Gorman
@ 2009-07-29 21:05 ` Mel Gorman
  2009-07-30  0:55   ` Rik van Riel
  2009-07-29 21:05 ` [PATCH 2/4] tracing, mm: Add trace events for anti-fragmentation falling back to other migratetypes Mel Gorman
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 31+ messages in thread
From: Mel Gorman @ 2009-07-29 21:05 UTC (permalink / raw)
  To: Larry Woodman, riel, Ingo Molnar, Peter Zijlstra
  Cc: LKML, linux-mm, Mel Gorman

This patch adds trace events for the allocation and freeing of pages,
including the freeing of pagevecs.  Using the events, it will be known what
struct page and pfns are being allocated and freed and what the call site
was in many cases.

The page alloc tracepoints be used as an indicator as to whether the workload
was heavily dependant on the page allocator or not. You can make a guess based
on vmstat but you can't get a per-process breakdown. Depending on the call
path, the call_site for page allocation may be __get_free_pages() instead
of a useful callsite. Instead of passing down a return address similar to
slab debugging, the user should enable the stacktrace and seg-addr options
to get a proper stack trace.

The pagevec free tracepoint has a different usecase. It can be used to get
a idea of how many pages are being dumped off the LRU and whether it is
kswapd doing the work or a process doing direct reclaim.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
 include/trace/events/kmem.h |   86 +++++++++++++++++++++++++++++++++++++++++++
 mm/page_alloc.c             |    6 ++-
 2 files changed, 91 insertions(+), 1 deletions(-)

diff --git a/include/trace/events/kmem.h b/include/trace/events/kmem.h
index 1493c54..57bf13c 100644
--- a/include/trace/events/kmem.h
+++ b/include/trace/events/kmem.h
@@ -225,6 +225,92 @@ TRACE_EVENT(kmem_cache_free,
 
 	TP_printk("call_site=%lx ptr=%p", __entry->call_site, __entry->ptr)
 );
+
+TRACE_EVENT(mm_page_free_direct,
+
+	TP_PROTO(unsigned long call_site, const void *page, unsigned int order),
+
+	TP_ARGS(call_site, page, order),
+
+	TP_STRUCT__entry(
+		__field(	unsigned long,	call_site	)
+		__field(	const void *,	page		)
+		__field(	unsigned int,	order		)
+	),
+
+	TP_fast_assign(
+		__entry->call_site	= call_site;
+		__entry->page		= page;
+		__entry->order		= order;
+	),
+
+	TP_printk("call_site=%lx page=%p pfn=%lu order=%d",
+			__entry->call_site,
+			__entry->page,
+			page_to_pfn((struct page *)__entry->page),
+			__entry->order)
+);
+
+TRACE_EVENT(mm_pagevec_free,
+
+	TP_PROTO(unsigned long call_site, const void *page, int order, int cold),
+
+	TP_ARGS(call_site, page, order, cold),
+
+	TP_STRUCT__entry(
+		__field(	unsigned long,	call_site	)
+		__field(	const void *,	page		)
+		__field(	int,		order		)
+		__field(	int,		cold		)
+	),
+
+	TP_fast_assign(
+		__entry->call_site	= call_site;
+		__entry->page		= page;
+		__entry->order		= order;
+		__entry->cold		= cold;
+	),
+
+	TP_printk("call_site=%lx page=%p pfn=%lu order=%d cold=%d",
+			__entry->call_site,
+			__entry->page,
+			page_to_pfn((struct page *)__entry->page),
+			__entry->order,
+			__entry->cold)
+);
+
+TRACE_EVENT(mm_page_alloc,
+
+	TP_PROTO(unsigned long call_site, const void *page, unsigned int order,
+			gfp_t gfp_flags, int migratetype),
+
+	TP_ARGS(call_site, page, order, gfp_flags, migratetype),
+
+	TP_STRUCT__entry(
+		__field(	unsigned long,	call_site	)
+		__field(	const void *,	page		)
+		__field(	unsigned int,	order		)
+		__field(	gfp_t,		gfp_flags	)
+		__field(	int,		migratetype	)
+	),
+
+	TP_fast_assign(
+		__entry->call_site	= call_site;
+		__entry->page		= page;
+		__entry->order		= order;
+		__entry->gfp_flags	= gfp_flags;
+		__entry->migratetype	= migratetype;
+	),
+
+	TP_printk("call_site=%lx page=%p pfn=%lu order=%d migratetype=%d gfp_flags=%s",
+		__entry->call_site,
+		__entry->page,
+		page_to_pfn((struct page *)__entry->page),
+		__entry->order,
+		__entry->migratetype,
+		show_gfp_flags(__entry->gfp_flags))
+);
+
 #endif /* _TRACE_KMEM_H */
 
 /* This part must be outside protection */
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index caa9268..6cd8730 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1894,6 +1894,7 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
 				zonelist, high_zoneidx, nodemask,
 				preferred_zone, migratetype);
 
+	trace_mm_page_alloc(_RET_IP_, page, order, gfp_mask, migratetype);
 	return page;
 }
 EXPORT_SYMBOL(__alloc_pages_nodemask);
@@ -1934,12 +1935,15 @@ void __pagevec_free(struct pagevec *pvec)
 {
 	int i = pagevec_count(pvec);
 
-	while (--i >= 0)
+	while (--i >= 0) {
+		trace_mm_pagevec_free(_RET_IP_, pvec->pages[i], 0, pvec->cold);
 		free_hot_cold_page(pvec->pages[i], pvec->cold);
+	}
 }
 
 void __free_pages(struct page *page, unsigned int order)
 {
+	trace_mm_page_free_direct(_RET_IP_, page, order);
 	if (put_page_testzero(page)) {
 		if (order == 0)
 			free_hot_page(page);
-- 
1.6.3.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH 2/4] tracing, mm: Add trace events for anti-fragmentation falling back to other migratetypes
  2009-07-29 21:05 [RFC PATCH 0/4] Add some trace events for the page allocator v2 Mel Gorman
  2009-07-29 21:05 ` [PATCH 1/4] tracing, page-allocator: Add trace events for page allocation and page freeing Mel Gorman
@ 2009-07-29 21:05 ` Mel Gorman
  2009-07-30  1:39   ` Rik van Riel
  2009-07-29 21:05 ` [PATCH 3/4] tracing, page-allocator: Add trace event for page traffic related to the buddy lists Mel Gorman
  2009-07-29 21:05 ` [PATCH 4/4] tracing, page-allocator: Add a postprocessing script for page-allocator-related ftrace events Mel Gorman
  3 siblings, 1 reply; 31+ messages in thread
From: Mel Gorman @ 2009-07-29 21:05 UTC (permalink / raw)
  To: Larry Woodman, riel, Ingo Molnar, Peter Zijlstra
  Cc: LKML, linux-mm, Mel Gorman

Fragmentation avoidance depends on being able to use free pages from
lists of the appropriate migrate type. In the event this is not
possible, __rmqueue_fallback() selects a different list and in some
circumstances change the migratetype of the pageblock. Simplistically,
the more times this event occurs, the more likely that fragmentation
will be a problem later for hugepage allocation at least but there are
other considerations such as the order of page being split to satisfy
the allocation.

This patch adds a trace event for __rmqueue_fallback() that reports what
page is being used for the fallback, the orders of relevant pages, the
desired migratetype and the migratetype of the lists being used, whether
the pageblock changed type and whether this event is important with
respect to fragmentation avoidance or not. This information can be used
to help analyse fragmentation avoidance and help decide whether
min_free_kbytes should be increased or not.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
 include/trace/events/kmem.h |   44 +++++++++++++++++++++++++++++++++++++++++++
 mm/page_alloc.c             |    6 +++++
 2 files changed, 50 insertions(+), 0 deletions(-)

diff --git a/include/trace/events/kmem.h b/include/trace/events/kmem.h
index 57bf13c..0b4002e 100644
--- a/include/trace/events/kmem.h
+++ b/include/trace/events/kmem.h
@@ -311,6 +311,50 @@ TRACE_EVENT(mm_page_alloc,
 		show_gfp_flags(__entry->gfp_flags))
 );
 
+TRACE_EVENT(mm_page_alloc_extfrag,
+
+	TP_PROTO(const void *page,
+			int alloc_order, int fallback_order,
+			int alloc_migratetype, int fallback_migratetype,
+			int fragmenting, int change_ownership),
+
+	TP_ARGS(page,
+		alloc_order, fallback_order,
+		alloc_migratetype, fallback_migratetype,
+		fragmenting, change_ownership),
+
+	TP_STRUCT__entry(
+		__field(	const void *,	page			)
+		__field(	int,		alloc_order		)
+		__field(	int,		fallback_order		)
+		__field(	int,		alloc_migratetype	)
+		__field(	int,		fallback_migratetype	)
+		__field(	int,		fragmenting		)
+		__field(	int,		change_ownership	)
+	),
+
+	TP_fast_assign(
+		__entry->page			= page;
+		__entry->alloc_order		= alloc_order;
+		__entry->fallback_order		= fallback_order;
+		__entry->alloc_migratetype	= alloc_migratetype;
+		__entry->fallback_migratetype	= fallback_migratetype;
+		__entry->fragmenting		= fragmenting;
+		__entry->change_ownership	= change_ownership;
+	),
+
+	TP_printk("page=%p pfn=%lu alloc_order=%d fallback_order=%d pageblock_order=%d alloc_migratetype=%d fallback_migratetype=%d fragmenting=%d change_ownership=%d",
+		__entry->page,
+		page_to_pfn((struct page *)__entry->page),
+		__entry->alloc_order,
+		__entry->fallback_order,
+		pageblock_order,
+		__entry->alloc_migratetype,
+		__entry->fallback_migratetype,
+		__entry->fragmenting,
+		__entry->change_ownership)
+);
+
 #endif /* _TRACE_KMEM_H */
 
 /* This part must be outside protection */
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 6cd8730..8113403 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -839,6 +839,12 @@ __rmqueue_fallback(struct zone *zone, int order, int start_migratetype)
 							start_migratetype);
 
 			expand(zone, page, order, current_order, area, migratetype);
+
+			trace_mm_page_alloc_extfrag(page, order, current_order,
+				start_migratetype, migratetype,
+				current_order < pageblock_order,
+				migratetype == start_migratetype);
+
 			return page;
 		}
 	}
-- 
1.6.3.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH 3/4] tracing, page-allocator: Add trace event for page traffic related to the buddy lists
  2009-07-29 21:05 [RFC PATCH 0/4] Add some trace events for the page allocator v2 Mel Gorman
  2009-07-29 21:05 ` [PATCH 1/4] tracing, page-allocator: Add trace events for page allocation and page freeing Mel Gorman
  2009-07-29 21:05 ` [PATCH 2/4] tracing, mm: Add trace events for anti-fragmentation falling back to other migratetypes Mel Gorman
@ 2009-07-29 21:05 ` Mel Gorman
  2009-07-30 13:43   ` Rik van Riel
  2009-07-29 21:05 ` [PATCH 4/4] tracing, page-allocator: Add a postprocessing script for page-allocator-related ftrace events Mel Gorman
  3 siblings, 1 reply; 31+ messages in thread
From: Mel Gorman @ 2009-07-29 21:05 UTC (permalink / raw)
  To: Larry Woodman, riel, Ingo Molnar, Peter Zijlstra
  Cc: LKML, linux-mm, Mel Gorman

The page allocation trace event reports that a page was successfully allocated
but it does not specify where it came from. When analysing performance,
it can be important to distinguish between pages coming from the per-cpu
allocator and pages coming from the buddy lists as the latter requires the
zone lock to the taken and more data structures to be examined.

This patch adds a trace event for __rmqueue reporting when a page is being
allocated from the buddy lists. It distinguishes between being called
to refill the per-cpu lists or whether it is a high-order allocation.
Similarly, this patch adds an event to catch when the PCP lists are being
drained a little and pages are going back to the buddy lists.

This is trickier to draw conclusions from but high activity on those
events could explain why there were a large number of cache misses on a
page-allocator-intensive workload. The coalescing and splitting of buddies
involves a lot of writing of page metadata and cache line bounces not to
mention the acquisition of an interrupt-safe lock necessary to enter this
path.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
 include/trace/events/kmem.h |   54 +++++++++++++++++++++++++++++++++++++++++++
 mm/page_alloc.c             |    2 +
 2 files changed, 56 insertions(+), 0 deletions(-)

diff --git a/include/trace/events/kmem.h b/include/trace/events/kmem.h
index 0b4002e..3be3df3 100644
--- a/include/trace/events/kmem.h
+++ b/include/trace/events/kmem.h
@@ -311,6 +311,60 @@ TRACE_EVENT(mm_page_alloc,
 		show_gfp_flags(__entry->gfp_flags))
 );
 
+TRACE_EVENT(mm_page_alloc_zone_locked,
+
+	TP_PROTO(const void *page, unsigned int order,
+				int migratetype, int percpu_refill),
+
+	TP_ARGS(page, order, migratetype, percpu_refill),
+
+	TP_STRUCT__entry(
+		__field(	const void *,	page		)
+		__field(	unsigned int,	order		)
+		__field(	int,		migratetype	)
+		__field(	int,		percpu_refill	)
+	),
+
+	TP_fast_assign(
+		__entry->page		= page;
+		__entry->order		= order;
+		__entry->migratetype	= migratetype;
+		__entry->percpu_refill	= percpu_refill;
+	),
+
+	TP_printk("page=%p pfn=%lu order=%u migratetype=%d percpu_refill=%d",
+		__entry->page,
+		page_to_pfn((struct page *)__entry->page),
+		__entry->order,
+		__entry->migratetype,
+		__entry->percpu_refill)
+);
+
+TRACE_EVENT(mm_page_pcpu_drain,
+
+	TP_PROTO(const void *page, int order, int migratetype),
+
+	TP_ARGS(page, order, migratetype),
+
+	TP_STRUCT__entry(
+		__field(	const void *,	page		)
+		__field(	int,		order		)
+		__field(	int,		migratetype	)
+	),
+
+	TP_fast_assign(
+		__entry->page		= page;
+		__entry->order		= order;
+		__entry->migratetype	= migratetype;
+	),
+
+	TP_printk("page=%p pfn=%lu order=%d migratetype=%d",
+		__entry->page,
+		page_to_pfn((struct page *)__entry->page),
+		__entry->order,
+		__entry->migratetype)
+);
+
 TRACE_EVENT(mm_page_alloc_extfrag,
 
 	TP_PROTO(const void *page,
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 8113403..1bcef16 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -535,6 +535,7 @@ static void free_pages_bulk(struct zone *zone, int count,
 		page = list_entry(list->prev, struct page, lru);
 		/* have to delete it as __free_one_page list manipulates */
 		list_del(&page->lru);
+		trace_mm_page_pcpu_drain(page, order, page_private(page));
 		__free_one_page(page, zone, order, page_private(page));
 	}
 	spin_unlock(&zone->lock);
@@ -878,6 +879,7 @@ retry_reserve:
 		}
 	}
 
+	trace_mm_page_alloc_zone_locked(page, order, migratetype, order == 0);
 	return page;
 }
 
-- 
1.6.3.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH 4/4] tracing, page-allocator: Add a postprocessing script for page-allocator-related ftrace events
  2009-07-29 21:05 [RFC PATCH 0/4] Add some trace events for the page allocator v2 Mel Gorman
                   ` (2 preceding siblings ...)
  2009-07-29 21:05 ` [PATCH 3/4] tracing, page-allocator: Add trace event for page traffic related to the buddy lists Mel Gorman
@ 2009-07-29 21:05 ` Mel Gorman
  2009-07-30 13:45   ` Rik van Riel
  3 siblings, 1 reply; 31+ messages in thread
From: Mel Gorman @ 2009-07-29 21:05 UTC (permalink / raw)
  To: Larry Woodman, riel, Ingo Molnar, Peter Zijlstra
  Cc: LKML, linux-mm, Mel Gorman

This patch adds a simple post-processing script for the page-allocator-related
trace events. It can be used to give an indication of who the most
allocator-intensive processes are and how often the zone lock was taken
during the tracing period. Example output looks like

find-2840
 o pages allocd            = 1877
 o pages allocd under lock = 1817
 o pages freed directly    = 9
 o pcpu refills            = 1078
 o migrate fallbacks       = 48
   - fragmentation causing = 48
     - severe              = 46
     - moderate            = 2
   - changed migratetype   = 7

The high number of fragmentation events were because 32 dd processes were
running at the same time under qemu, with limited memory with standard
min_free_kbytes so it's not a surprising outcome.

The postprocessor parses the text output of tracing. While there is a binary
format, the expectation is that the binary output can be readily translated
into text and post-processed offline. Obviously if the text format
changes, the parser will break but the regular expression parser is
fairly rudimentary so should be readily adjustable.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
 .../postprocess/trace-pagealloc-postprocess.pl     |  131 ++++++++++++++++++++
 1 files changed, 131 insertions(+), 0 deletions(-)
 create mode 100755 Documentation/trace/postprocess/trace-pagealloc-postprocess.pl

diff --git a/Documentation/trace/postprocess/trace-pagealloc-postprocess.pl b/Documentation/trace/postprocess/trace-pagealloc-postprocess.pl
new file mode 100755
index 0000000..d4332c3
--- /dev/null
+++ b/Documentation/trace/postprocess/trace-pagealloc-postprocess.pl
@@ -0,0 +1,131 @@
+#!/usr/bin/perl
+# This is a POC (proof of concept or piece of crap, take your pick) for reading the
+# text representation of trace output related to page allocation. It makes an attempt
+# to extract some high-level information on what is going on. The accuracy of the parser
+# may vary considerably
+#
+# Copyright (c) Mel Gorman 2009
+use Switch;
+use strict;
+
+my $traceevent;
+my %perprocess;
+
+while ($traceevent = <>) {
+	my $process_pid;
+	my $cpus;
+	my $timestamp;
+	my $tracepoint;
+	my $details;
+
+	#                      (process_pid)     (cpus      )   ( time  )   (tpoint    ) (details)
+	if ($traceevent =~ /\s*([a-zA-Z0-9-]*)\s*(\[[0-9]*\])\s*([0-9.]*):\s*([a-zA-Z_]*):\s*(.*)/) {
+		$process_pid = $1;
+		$cpus = $2;
+		$timestamp = $3;
+		$tracepoint = $4;
+		$details = $5;
+
+	} else {
+		next;
+	}
+
+	switch ($tracepoint) {
+	case "mm_page_alloc" {
+		$perprocess{$process_pid}->{"mm_page_alloc"}++;
+	}
+	case "mm_page_free_direct" {
+		$perprocess{$process_pid}->{"mm_page_free_direct"}++;
+	}
+	case "mm_pagevec_free" {
+		$perprocess{$process_pid}->{"mm_pagevec_free"}++;
+	}
+	case "mm_page_pcpu_drain" {
+		$perprocess{$process_pid}->{"mm_page_pcpu_drain"}++;
+		$perprocess{$process_pid}->{"mm_page_pcpu_drain-pagesdrained"}++;
+	}
+	case "mm_page_alloc_zone_locked" {
+		$perprocess{$process_pid}->{"mm_page_alloc_zone_locked"}++;
+		$perprocess{$process_pid}->{"mm_page_alloc_zone_locked-pagesrefilled"}++;
+	}
+	case "mm_page_alloc_extfrag" {
+		$perprocess{$process_pid}->{"mm_page_alloc_extfrag"}++;
+		my ($page, $pfn);
+		my ($alloc_order, $fallback_order, $pageblock_order);
+		my ($alloc_migratetype, $fallback_migratetype);
+		my ($fragmenting, $change_ownership);
+
+		$details =~ /page=([0-9a-f]*) pfn=([0-9]*) alloc_order=([0-9]*) fallback_order=([0-9]*) pageblock_order=([0-9]*) alloc_migratetype=([0-9]*) fallback_migratetype=([0-9]*) fragmenting=([0-9]) change_ownership=([0-9])/;
+		$page = $1;
+		$pfn = $2;
+		$alloc_order = $3;
+		$fallback_order = $4;
+		$pageblock_order = $5;
+		$alloc_migratetype = $6;
+		$fallback_migratetype = $7;
+		$fragmenting = $8;
+		$change_ownership = $9;
+
+		if ($fragmenting) {
+			$perprocess{$process_pid}->{"mm_page_alloc_extfrag-fragmenting"}++;
+			if ($fallback_order <= 3) {
+				$perprocess{$process_pid}->{"mm_page_alloc_extfrag-fragmenting-severe"}++;
+			} else {
+				$perprocess{$process_pid}->{"mm_page_alloc_extfrag-fragmenting-moderate"}++;
+			}
+		}
+		if ($change_ownership) {
+			$perprocess{$process_pid}->{"mm_page_alloc_extfrag-changetype"}++;
+		}
+	}
+	else {
+		$perprocess{$process_pid}->{"unknown"}++;
+	}
+	}
+
+	# Catch a full pcpu drain event
+	if ($perprocess{$process_pid}->{"mm_page_pcpu_drain-pagesdrained"} &&
+			$tracepoint ne "mm_page_pcpu_drain") {
+
+		$perprocess{$process_pid}->{"mm_page_pcpu_drain-drains"}++;
+		$perprocess{$process_pid}->{"mm_page_pcpu_drain-pagesdrained"} = 0;
+	}
+
+	# Catch a full pcpu refill event
+	if ($perprocess{$process_pid}->{"mm_page_alloc_zone_locked-pagesrefilled"} &&
+			$tracepoint ne "mm_page_alloc_zone_locked") {
+		$perprocess{$process_pid}->{"mm_page_alloc_zone_locked-refills"}++;
+		$perprocess{$process_pid}->{"mm_page_alloc_zone_locked-pagesrefilled"} = 0;
+	}
+}
+
+# Dump per-process stats
+my $process_pid;
+foreach $process_pid (keys %perprocess) {
+	# Dump final aggregates
+	if ($perprocess{$process_pid}->{"mm_page_pcpu_drain-pagesdrained"}) {
+		$perprocess{$process_pid}->{"mm_page_pcpu_drain-drains"}++;
+		$perprocess{$process_pid}->{"mm_page_pcpu_drain-pagesdrained"} = 0;
+	}
+	if ($perprocess{$process_pid}->{"mm_page_alloc_zone_locked-pagesrefilled"}) {
+		$perprocess{$process_pid}->{"mm_page_alloc_zone_locked-refills"}++;
+		$perprocess{$process_pid}->{"mm_page_alloc_zone_locked-pagesrefilled"} = 0;
+	}
+
+	my %process = $perprocess{$process_pid};
+	printf("$process_pid\n");
+	printf(" o pages allocd            = %d\n", $perprocess{$process_pid}->{"mm_page_alloc"});
+	printf(" o pages allocd under lock = %d\n", $perprocess{$process_pid}->{"mm_page_alloc_zone_locked"});
+	printf(" o pages freed directly    = %d\n", $perprocess{$process_pid}->{"mm_page_free_direct"});
+	printf(" o pages freed via pagevec = %d\n", $perprocess{$process_pid}->{"mm_pagevec_free"});
+	printf(" o pcpu pages drained      = %d\n", $perprocess{$process_pid}->{"mm_page_pcpu_drain"});
+	printf(" o pcpu drains             = %d\n", $perprocess{$process_pid}->{"mm_page_pcpu_drain-drains"});
+	printf(" o pcpu refills            = %d\n", $perprocess{$process_pid}->{"mm_page_alloc_zone_locked-refills"});
+	printf(" o migrate fallbacks       = %d\n", $perprocess{$process_pid}->{"mm_page_alloc_extfrag"});
+	printf("   - fragmentation causing = %d\n", $perprocess{$process_pid}->{"mm_page_alloc_extfrag-fragmenting"});
+	printf("     - severe              = %d\n", $perprocess{$process_pid}->{"mm_page_alloc_extfrag-fragmenting-severe"});
+	printf("     - moderate            = %d\n", $perprocess{$process_pid}->{"mm_page_alloc_extfrag-fragmenting-moderate"});
+	printf("   - changed migratetype   = %d\n", $perprocess{$process_pid}->{"mm_page_alloc_extfrag-changetype"});
+	printf(" o unknown events          = %d\n", $perprocess{$process_pid}->{"unknown"});
+	printf("\n");
+}
-- 
1.6.3.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* Re: [PATCH 1/4] tracing, page-allocator: Add trace events for page allocation and page freeing
  2009-07-29 21:05 ` [PATCH 1/4] tracing, page-allocator: Add trace events for page allocation and page freeing Mel Gorman
@ 2009-07-30  0:55   ` Rik van Riel
  0 siblings, 0 replies; 31+ messages in thread
From: Rik van Riel @ 2009-07-30  0:55 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Larry Woodman, Ingo Molnar, Peter Zijlstra, LKML, linux-mm

Mel Gorman wrote:
> This patch adds trace events for the allocation and freeing of pages,
> including the freeing of pagevecs.  Using the events, it will be known what
> struct page and pfns are being allocated and freed and what the call site
> was in many cases.

Acked-by: Rik van Riel <riel@redhat.com>

-- 
All rights reversed.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 2/4] tracing, mm: Add trace events for anti-fragmentation falling back to other migratetypes
  2009-07-29 21:05 ` [PATCH 2/4] tracing, mm: Add trace events for anti-fragmentation falling back to other migratetypes Mel Gorman
@ 2009-07-30  1:39   ` Rik van Riel
  0 siblings, 0 replies; 31+ messages in thread
From: Rik van Riel @ 2009-07-30  1:39 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Larry Woodman, Ingo Molnar, Peter Zijlstra, LKML, linux-mm

Mel Gorman wrote:
> Fragmentation avoidance depends on being able to use free pages from
> lists of the appropriate migrate type. In the event this is not
> possible, __rmqueue_fallback() selects a different list and in some
> circumstances change the migratetype of the pageblock. Simplistically,
> the more times this event occurs, the more likely that fragmentation
> will be a problem later for hugepage allocation at least but there are
> other considerations such as the order of page being split to satisfy
> the allocation.

Acked-by: Rik van Riel <riel@redhat.com>

-- 
All rights reversed.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 3/4] tracing, page-allocator: Add trace event for page traffic related to the buddy lists
  2009-07-29 21:05 ` [PATCH 3/4] tracing, page-allocator: Add trace event for page traffic related to the buddy lists Mel Gorman
@ 2009-07-30 13:43   ` Rik van Riel
  0 siblings, 0 replies; 31+ messages in thread
From: Rik van Riel @ 2009-07-30 13:43 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Larry Woodman, Ingo Molnar, Peter Zijlstra, LKML, linux-mm

Mel Gorman wrote:
> The page allocation trace event reports that a page was successfully allocated
> but it does not specify where it came from. When analysing performance,
> it can be important to distinguish between pages coming from the per-cpu
> allocator and pages coming from the buddy lists as the latter requires the
> zone lock to the taken and more data structures to be examined.

Acked-by: Rik van Riel <riel@redhat.com>

-- 
All rights reversed.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 4/4] tracing, page-allocator: Add a postprocessing script for page-allocator-related ftrace events
  2009-07-29 21:05 ` [PATCH 4/4] tracing, page-allocator: Add a postprocessing script for page-allocator-related ftrace events Mel Gorman
@ 2009-07-30 13:45   ` Rik van Riel
  0 siblings, 0 replies; 31+ messages in thread
From: Rik van Riel @ 2009-07-30 13:45 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Larry Woodman, Ingo Molnar, Peter Zijlstra, LKML, linux-mm

Mel Gorman wrote:
> This patch adds a simple post-processing script for the page-allocator-related
> trace events. It can be used to give an indication of who the most
> allocator-intensive processes are and how often the zone lock was taken
> during the tracing period. Example output looks like
> 
> find-2840
>  o pages allocd            = 1877
>  o pages allocd under lock = 1817
>  o pages freed directly    = 9
>  o pcpu refills            = 1078
>  o migrate fallbacks       = 48
>    - fragmentation causing = 48
>      - severe              = 46
>      - moderate            = 2
>    - changed migratetype   = 7

I like it.

Acked-by: Rik van Riel <riel@redhat.com>

-- 
All rights reversed.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH 4/4] tracing, page-allocator: Add a postprocessing script for page-allocator-related ftrace events
  2009-08-04 18:12 [PATCH 0/4] Add some trace events for the page allocator v3 Mel Gorman
@ 2009-08-04 18:12 ` Mel Gorman
  2009-08-04 18:22   ` Andrew Morton
  0 siblings, 1 reply; 31+ messages in thread
From: Mel Gorman @ 2009-08-04 18:12 UTC (permalink / raw)
  To: Larry Woodman, Andrew Morton
  Cc: riel, Ingo Molnar, Peter Zijlstra, LKML, linux-mm, Mel Gorman

This patch adds a simple post-processing script for the page-allocator-related
trace events. It can be used to give an indication of who the most
allocator-intensive processes are and how often the zone lock was taken
during the tracing period. Example output looks like

find-2840
 o pages allocd            = 1877
 o pages allocd under lock = 1817
 o pages freed directly    = 9
 o pcpu refills            = 1078
 o migrate fallbacks       = 48
   - fragmentation causing = 48
     - severe              = 46
     - moderate            = 2
   - changed migratetype   = 7

The high number of fragmentation events were because 32 dd processes were
running at the same time under qemu, with limited memory with standard
min_free_kbytes so it's not a surprising outcome.

The postprocessor parses the text output of tracing. While there is a binary
format, the expectation is that the binary output can be readily translated
into text and post-processed offline. Obviously if the text format
changes, the parser will break but the regular expression parser is
fairly rudimentary so should be readily adjustable.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Acked-by: Rik van Riel <riel@redhat.com>
---
 .../postprocess/trace-pagealloc-postprocess.pl     |  131 ++++++++++++++++++++
 1 files changed, 131 insertions(+), 0 deletions(-)
 create mode 100755 Documentation/trace/postprocess/trace-pagealloc-postprocess.pl

diff --git a/Documentation/trace/postprocess/trace-pagealloc-postprocess.pl b/Documentation/trace/postprocess/trace-pagealloc-postprocess.pl
new file mode 100755
index 0000000..d4332c3
--- /dev/null
+++ b/Documentation/trace/postprocess/trace-pagealloc-postprocess.pl
@@ -0,0 +1,131 @@
+#!/usr/bin/perl
+# This is a POC (proof of concept or piece of crap, take your pick) for reading the
+# text representation of trace output related to page allocation. It makes an attempt
+# to extract some high-level information on what is going on. The accuracy of the parser
+# may vary considerably
+#
+# Copyright (c) Mel Gorman 2009
+use Switch;
+use strict;
+
+my $traceevent;
+my %perprocess;
+
+while ($traceevent = <>) {
+	my $process_pid;
+	my $cpus;
+	my $timestamp;
+	my $tracepoint;
+	my $details;
+
+	#                      (process_pid)     (cpus      )   ( time  )   (tpoint    ) (details)
+	if ($traceevent =~ /\s*([a-zA-Z0-9-]*)\s*(\[[0-9]*\])\s*([0-9.]*):\s*([a-zA-Z_]*):\s*(.*)/) {
+		$process_pid = $1;
+		$cpus = $2;
+		$timestamp = $3;
+		$tracepoint = $4;
+		$details = $5;
+
+	} else {
+		next;
+	}
+
+	switch ($tracepoint) {
+	case "mm_page_alloc" {
+		$perprocess{$process_pid}->{"mm_page_alloc"}++;
+	}
+	case "mm_page_free_direct" {
+		$perprocess{$process_pid}->{"mm_page_free_direct"}++;
+	}
+	case "mm_pagevec_free" {
+		$perprocess{$process_pid}->{"mm_pagevec_free"}++;
+	}
+	case "mm_page_pcpu_drain" {
+		$perprocess{$process_pid}->{"mm_page_pcpu_drain"}++;
+		$perprocess{$process_pid}->{"mm_page_pcpu_drain-pagesdrained"}++;
+	}
+	case "mm_page_alloc_zone_locked" {
+		$perprocess{$process_pid}->{"mm_page_alloc_zone_locked"}++;
+		$perprocess{$process_pid}->{"mm_page_alloc_zone_locked-pagesrefilled"}++;
+	}
+	case "mm_page_alloc_extfrag" {
+		$perprocess{$process_pid}->{"mm_page_alloc_extfrag"}++;
+		my ($page, $pfn);
+		my ($alloc_order, $fallback_order, $pageblock_order);
+		my ($alloc_migratetype, $fallback_migratetype);
+		my ($fragmenting, $change_ownership);
+
+		$details =~ /page=([0-9a-f]*) pfn=([0-9]*) alloc_order=([0-9]*) fallback_order=([0-9]*) pageblock_order=([0-9]*) alloc_migratetype=([0-9]*) fallback_migratetype=([0-9]*) fragmenting=([0-9]) change_ownership=([0-9])/;
+		$page = $1;
+		$pfn = $2;
+		$alloc_order = $3;
+		$fallback_order = $4;
+		$pageblock_order = $5;
+		$alloc_migratetype = $6;
+		$fallback_migratetype = $7;
+		$fragmenting = $8;
+		$change_ownership = $9;
+
+		if ($fragmenting) {
+			$perprocess{$process_pid}->{"mm_page_alloc_extfrag-fragmenting"}++;
+			if ($fallback_order <= 3) {
+				$perprocess{$process_pid}->{"mm_page_alloc_extfrag-fragmenting-severe"}++;
+			} else {
+				$perprocess{$process_pid}->{"mm_page_alloc_extfrag-fragmenting-moderate"}++;
+			}
+		}
+		if ($change_ownership) {
+			$perprocess{$process_pid}->{"mm_page_alloc_extfrag-changetype"}++;
+		}
+	}
+	else {
+		$perprocess{$process_pid}->{"unknown"}++;
+	}
+	}
+
+	# Catch a full pcpu drain event
+	if ($perprocess{$process_pid}->{"mm_page_pcpu_drain-pagesdrained"} &&
+			$tracepoint ne "mm_page_pcpu_drain") {
+
+		$perprocess{$process_pid}->{"mm_page_pcpu_drain-drains"}++;
+		$perprocess{$process_pid}->{"mm_page_pcpu_drain-pagesdrained"} = 0;
+	}
+
+	# Catch a full pcpu refill event
+	if ($perprocess{$process_pid}->{"mm_page_alloc_zone_locked-pagesrefilled"} &&
+			$tracepoint ne "mm_page_alloc_zone_locked") {
+		$perprocess{$process_pid}->{"mm_page_alloc_zone_locked-refills"}++;
+		$perprocess{$process_pid}->{"mm_page_alloc_zone_locked-pagesrefilled"} = 0;
+	}
+}
+
+# Dump per-process stats
+my $process_pid;
+foreach $process_pid (keys %perprocess) {
+	# Dump final aggregates
+	if ($perprocess{$process_pid}->{"mm_page_pcpu_drain-pagesdrained"}) {
+		$perprocess{$process_pid}->{"mm_page_pcpu_drain-drains"}++;
+		$perprocess{$process_pid}->{"mm_page_pcpu_drain-pagesdrained"} = 0;
+	}
+	if ($perprocess{$process_pid}->{"mm_page_alloc_zone_locked-pagesrefilled"}) {
+		$perprocess{$process_pid}->{"mm_page_alloc_zone_locked-refills"}++;
+		$perprocess{$process_pid}->{"mm_page_alloc_zone_locked-pagesrefilled"} = 0;
+	}
+
+	my %process = $perprocess{$process_pid};
+	printf("$process_pid\n");
+	printf(" o pages allocd            = %d\n", $perprocess{$process_pid}->{"mm_page_alloc"});
+	printf(" o pages allocd under lock = %d\n", $perprocess{$process_pid}->{"mm_page_alloc_zone_locked"});
+	printf(" o pages freed directly    = %d\n", $perprocess{$process_pid}->{"mm_page_free_direct"});
+	printf(" o pages freed via pagevec = %d\n", $perprocess{$process_pid}->{"mm_pagevec_free"});
+	printf(" o pcpu pages drained      = %d\n", $perprocess{$process_pid}->{"mm_page_pcpu_drain"});
+	printf(" o pcpu drains             = %d\n", $perprocess{$process_pid}->{"mm_page_pcpu_drain-drains"});
+	printf(" o pcpu refills            = %d\n", $perprocess{$process_pid}->{"mm_page_alloc_zone_locked-refills"});
+	printf(" o migrate fallbacks       = %d\n", $perprocess{$process_pid}->{"mm_page_alloc_extfrag"});
+	printf("   - fragmentation causing = %d\n", $perprocess{$process_pid}->{"mm_page_alloc_extfrag-fragmenting"});
+	printf("     - severe              = %d\n", $perprocess{$process_pid}->{"mm_page_alloc_extfrag-fragmenting-severe"});
+	printf("     - moderate            = %d\n", $perprocess{$process_pid}->{"mm_page_alloc_extfrag-fragmenting-moderate"});
+	printf("   - changed migratetype   = %d\n", $perprocess{$process_pid}->{"mm_page_alloc_extfrag-changetype"});
+	printf(" o unknown events          = %d\n", $perprocess{$process_pid}->{"unknown"});
+	printf("\n");
+}
-- 
1.6.3.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* Re: [PATCH 4/4] tracing, page-allocator: Add a postprocessing script for page-allocator-related ftrace events
  2009-08-04 18:12 ` [PATCH 4/4] tracing, page-allocator: Add a postprocessing script for page-allocator-related ftrace events Mel Gorman
@ 2009-08-04 18:22   ` Andrew Morton
  2009-08-04 18:27     ` Rik van Riel
                       ` (2 more replies)
  0 siblings, 3 replies; 31+ messages in thread
From: Andrew Morton @ 2009-08-04 18:22 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Larry Woodman, riel, Ingo Molnar, Peter Zijlstra, LKML, linux-mm

On Tue,  4 Aug 2009 19:12:26 +0100 Mel Gorman <mel@csn.ul.ie> wrote:

> This patch adds a simple post-processing script for the page-allocator-related
> trace events. It can be used to give an indication of who the most
> allocator-intensive processes are and how often the zone lock was taken
> during the tracing period. Example output looks like
> 
> find-2840
>  o pages allocd            = 1877
>  o pages allocd under lock = 1817
>  o pages freed directly    = 9
>  o pcpu refills            = 1078
>  o migrate fallbacks       = 48
>    - fragmentation causing = 48
>      - severe              = 46
>      - moderate            = 2
>    - changed migratetype   = 7

The usual way of accumulating and presenting such measurements is via
/proc/vmstat.  How do we justify adding a completely new and different
way of doing something which we already do?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 4/4] tracing, page-allocator: Add a postprocessing script for page-allocator-related ftrace events
  2009-08-04 18:22   ` Andrew Morton
@ 2009-08-04 18:27     ` Rik van Riel
  2009-08-04 19:13       ` Andrew Morton
  2009-08-04 19:57     ` Ingo Molnar
  2009-08-05  3:07     ` KOSAKI Motohiro
  2 siblings, 1 reply; 31+ messages in thread
From: Rik van Riel @ 2009-08-04 18:27 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Mel Gorman, Larry Woodman, Ingo Molnar, Peter Zijlstra, LKML,
	linux-mm

Andrew Morton wrote:
> On Tue,  4 Aug 2009 19:12:26 +0100 Mel Gorman <mel@csn.ul.ie> wrote:
> 
>> This patch adds a simple post-processing script for the page-allocator-related
>> trace events. It can be used to give an indication of who the most
>> allocator-intensive processes are and how often the zone lock was taken
>> during the tracing period. Example output looks like
>>
>> find-2840
>>  o pages allocd            = 1877
>>  o pages allocd under lock = 1817
>>  o pages freed directly    = 9
>>  o pcpu refills            = 1078
>>  o migrate fallbacks       = 48
>>    - fragmentation causing = 48
>>      - severe              = 46
>>      - moderate            = 2
>>    - changed migratetype   = 7
> 
> The usual way of accumulating and presenting such measurements is via
> /proc/vmstat.  How do we justify adding a completely new and different
> way of doing something which we already do?

Mel's tracing is more akin to BSD process accounting,
where these statistics are kept on a per-process basis.

Nothing in /proc allows us to see statistics on a per
process basis on process exit.

-- 
All rights reversed.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 4/4] tracing, page-allocator: Add a postprocessing script for page-allocator-related ftrace events
  2009-08-04 18:27     ` Rik van Riel
@ 2009-08-04 19:13       ` Andrew Morton
  2009-08-04 20:48         ` Mel Gorman
  0 siblings, 1 reply; 31+ messages in thread
From: Andrew Morton @ 2009-08-04 19:13 UTC (permalink / raw)
  To: Rik van Riel; +Cc: mel, lwoodman, mingo, peterz, linux-kernel, linux-mm

On Tue, 04 Aug 2009 14:27:16 -0400
Rik van Riel <riel@redhat.com> wrote:

> Andrew Morton wrote:
> > On Tue,  4 Aug 2009 19:12:26 +0100 Mel Gorman <mel@csn.ul.ie> wrote:
> > 
> >> This patch adds a simple post-processing script for the page-allocator-related
> >> trace events. It can be used to give an indication of who the most
> >> allocator-intensive processes are and how often the zone lock was taken
> >> during the tracing period. Example output looks like
> >>
> >> find-2840
> >>  o pages allocd            = 1877
> >>  o pages allocd under lock = 1817
> >>  o pages freed directly    = 9
> >>  o pcpu refills            = 1078
> >>  o migrate fallbacks       = 48
> >>    - fragmentation causing = 48
> >>      - severe              = 46
> >>      - moderate            = 2
> >>    - changed migratetype   = 7
> > 
> > The usual way of accumulating and presenting such measurements is via
> > /proc/vmstat.  How do we justify adding a completely new and different
> > way of doing something which we already do?
> 
> Mel's tracing is more akin to BSD process accounting,
> where these statistics are kept on a per-process basis.

Is that useful?  Any time I've wanted to find out things like this, I
just don't run other stuff on the machine at the same time.

Maybe there are some scenarios where it's useful to filter out other
processes, but are those scenarios sufficiently important to warrant
creation of separate machinery like this?

> Nothing in /proc allows us to see statistics on a per
> process basis on process exit.

Can this script be used to monitor the process while it's still running?

Also, we have a counter for "moderate fragmentation causing migrate
fallbacks".  There must be hundreds of MM statistics which can be
accumulated once we get down to this level of detail.  Why choose these
nine?

Is there a plan to add the rest later on?

Or are these nine more a proof-of-concept demonstration-code thing?  If
so, is it expected that developers will do an ad-hoc copy-n-paste to
solve a particular short-term problem and will then toss the tracepoint
away?  I guess that could be useful, although you can do the same with
vmstat.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 4/4] tracing, page-allocator: Add a postprocessing script for page-allocator-related ftrace events
  2009-08-04 18:22   ` Andrew Morton
  2009-08-04 18:27     ` Rik van Riel
@ 2009-08-04 19:57     ` Ingo Molnar
  2009-08-04 20:18       ` Andrew Morton
                         ` (2 more replies)
  2009-08-05  3:07     ` KOSAKI Motohiro
  2 siblings, 3 replies; 31+ messages in thread
From: Ingo Molnar @ 2009-08-04 19:57 UTC (permalink / raw)
  To: Andrew Morton, Pekka Enberg, Peter Zijlstra,
	Frédéric Weisbecker, Steven Rostedt
  Cc: Mel Gorman, Larry Woodman, riel, Peter Zijlstra, LKML, linux-mm

3
* Andrew Morton <akpm@linux-foundation.org> wrote:

> > This patch adds a simple post-processing script for the 
> > page-allocator-related trace events. It can be used to give an 
> > indication of who the most allocator-intensive processes are and 
> > how often the zone lock was taken during the tracing period. 
> > Example output looks like
> > 
> > find-2840
> >  o pages allocd            = 1877
> >  o pages allocd under lock = 1817
> >  o pages freed directly    = 9
> >  o pcpu refills            = 1078
> >  o migrate fallbacks       = 48
> >    - fragmentation causing = 48
> >      - severe              = 46
> >      - moderate            = 2
> >    - changed migratetype   = 7
> 
> The usual way of accumulating and presenting such measurements is 
> via /proc/vmstat.  How do we justify adding a completely new and 
> different way of doing something which we already do?

/proc/vmstat has a couple of technical and usage disadvantages:

 - it is pretty coarse - all-of-system, nothing else 

 - expensive to read (have to read the full file with all fields)

 - has to be polled, has no notion for events

 - it does not offer sampling of workloads

 - it does not allow the separation of workloads: you cannot measure
   just a single workload, you cannot measure just a single process, 
   nor a single CPU.

Incidentally there's an upstream kernel instrumentation and 
statistics framework that solves all the above disadvantages of 
/proc/vmstat:

 - it is finegrained: per task or per workload or per cpu or full system

 - cheap to read - the counts can be accessed individually

 - is event based, can be poll()ed

 - offers sampling of workloads, of any subset of these values

 - it allows easy separation of workloads

All that is needed are the patches form Mel and Rik and it's 
plug-and-play.

Let me demonstrate these features in action (i've applied the 
patches for testing to -tip):

First, discovery/enumeration of available counters can be done via 
'perf list':

titan:~> perf list
  [...]
  kmem:kmalloc                             [Tracepoint event]
  kmem:kmem_cache_alloc                    [Tracepoint event]
  kmem:kmalloc_node                        [Tracepoint event]
  kmem:kmem_cache_alloc_node               [Tracepoint event]
  kmem:kfree                               [Tracepoint event]
  kmem:kmem_cache_free                     [Tracepoint event]
  kmem:mm_page_free_direct                 [Tracepoint event]
  kmem:mm_pagevec_free                     [Tracepoint event]
  kmem:mm_page_alloc                       [Tracepoint event]
  kmem:mm_page_alloc_zone_locked           [Tracepoint event]
  kmem:mm_page_pcpu_drain                  [Tracepoint event]
  kmem:mm_page_alloc_extfrag               [Tracepoint event]

Then any (or all) of the above event sources can be activated and 
measured. For example the page alloc/free properties of a 'hackbench 
run' are:

 titan:~> perf stat -e kmem:mm_page_pcpu_drain -e kmem:mm_page_alloc 
 -e kmem:mm_pagevec_free -e kmem:mm_page_free_direct ./hackbench 10
 Time: 0.575

 Performance counter stats for './hackbench 10':

          13857  kmem:mm_page_pcpu_drain 
          27576  kmem:mm_page_alloc      
           6025  kmem:mm_pagevec_free    
          20934  kmem:mm_page_free_direct

    0.613972165  seconds time elapsed

You can observe the statistical properties as well, by using the 
'repeat the workload N times' feature of perf stat:

 titan:~> perf stat --repeat 5 -e kmem:mm_page_pcpu_drain -e 
   kmem:mm_page_alloc -e kmem:mm_pagevec_free -e 
   kmem:mm_page_free_direct ./hackbench 10
 Time: 0.627
 Time: 0.644
 Time: 0.564
 Time: 0.559
 Time: 0.626

 Performance counter stats for './hackbench 10' (5 runs):

          12920  kmem:mm_page_pcpu_drain    ( +-   3.359% )
          25035  kmem:mm_page_alloc         ( +-   3.783% )
           6104  kmem:mm_pagevec_free       ( +-   0.934% )
          18376  kmem:mm_page_free_direct   ( +-   4.941% )

    0.643954516  seconds time elapsed   ( +-   2.363% )

Furthermore, these tracepoints can be used to sample the workload as 
well. For example the page allocations done by a 'git gc' can be 
captured the following way:

 titan:~/git> perf record -f -e kmem:mm_page_alloc -c 1 ./git gc
 Counting objects: 1148, done.
 Delta compression using up to 2 threads.
 Compressing objects: 100% (450/450), done.
 Writing objects: 100% (1148/1148), done.
 Total 1148 (delta 690), reused 1148 (delta 690)
 [ perf record: Captured and wrote 0.267 MB perf.data (~11679 samples) ]

To check which functions generated page allocations:

 titan:~/git> perf report
 # Samples: 10646
 #
 # Overhead          Command               Shared Object
 # ........  ...............  ..........................
 #
    23.57%       git-repack  /lib64/libc-2.5.so        
    21.81%              git  /lib64/libc-2.5.so        
    14.59%              git  ./git                     
    11.79%       git-repack  ./git                     
     7.12%              git  /lib64/ld-2.5.so          
     3.16%       git-repack  /lib64/libpthread-2.5.so  
     2.09%       git-repack  /bin/bash                 
     1.97%               rm  /lib64/libc-2.5.so        
     1.39%               mv  /lib64/ld-2.5.so          
     1.37%               mv  /lib64/libc-2.5.so        
     1.12%       git-repack  /lib64/ld-2.5.so          
     0.95%               rm  /lib64/ld-2.5.so          
     0.90%  git-update-serv  /lib64/libc-2.5.so        
     0.73%  git-update-serv  /lib64/ld-2.5.so          
     0.68%             perf  /lib64/libpthread-2.5.so  
     0.64%       git-repack  /usr/lib64/libz.so.1.2.3  

Or to see it on a more finegrained level:

titan:~/git> perf report --sort comm,dso,symbol
# Samples: 10646
#
# Overhead          Command               Shared Object  Symbol
# ........  ...............  ..........................  ......
#
     9.35%       git-repack  ./git                       [.] insert_obj_hash
     9.12%              git  ./git                       [.] insert_obj_hash
     7.31%              git  /lib64/libc-2.5.so          [.] memcpy
     6.34%       git-repack  /lib64/libc-2.5.so          [.] _int_malloc
     6.24%       git-repack  /lib64/libc-2.5.so          [.] memcpy
     5.82%       git-repack  /lib64/libc-2.5.so          [.] __GI___fork
     5.47%              git  /lib64/libc-2.5.so          [.] _int_malloc
     2.99%              git  /lib64/libc-2.5.so          [.] memset

Furthermore, call-graph sampling can be done too, of page 
allocations - to see precisely what kind of page allocations there 
are:

 titan:~/git> perf record -f -g -e kmem:mm_page_alloc -c 1 ./git gc
 Counting objects: 1148, done.
 Delta compression using up to 2 threads.
 Compressing objects: 100% (450/450), done.
 Writing objects: 100% (1148/1148), done.
 Total 1148 (delta 690), reused 1148 (delta 690)
 [ perf record: Captured and wrote 0.963 MB perf.data (~42069 samples) ]

 titan:~/git> perf report -g
 # Samples: 10686
 #
 # Overhead          Command               Shared Object
 # ........  ...............  ..........................
 #
    23.25%       git-repack  /lib64/libc-2.5.so        
                |          
                |--50.00%-- _int_free
                |          
                |--37.50%-- __GI___fork
                |          make_child
                |          
                |--12.50%-- ptmalloc_unlock_all2
                |          make_child
                |          
                 --6.25%-- __GI_strcpy
    21.61%              git  /lib64/libc-2.5.so        
                |          
                |--30.00%-- __GI_read
                |          |          
                |           --83.33%-- git_config_from_file
                |                     git_config
                |                     |          
   [...]

Or you can observe the whole system's page allocations for 10 
seconds:

titan:~/git> perf stat -a -e kmem:mm_page_pcpu_drain -e 
kmem:mm_page_alloc -e kmem:mm_pagevec_free -e 
kmem:mm_page_free_direct sleep 10

 Performance counter stats for 'sleep 10':

         171585  kmem:mm_page_pcpu_drain 
         322114  kmem:mm_page_alloc      
          73623  kmem:mm_pagevec_free    
         254115  kmem:mm_page_free_direct

   10.000591410  seconds time elapsed

Or observe how fluctuating the page allocations are, via statistical 
analysis done over ten 1-second intervals:

 titan:~/git> perf stat --repeat 10 -a -e kmem:mm_page_pcpu_drain -e 
   kmem:mm_page_alloc -e kmem:mm_pagevec_free -e 
   kmem:mm_page_free_direct sleep 1

 Performance counter stats for 'sleep 1' (10 runs):

          17254  kmem:mm_page_pcpu_drain    ( +-   3.709% )
          34394  kmem:mm_page_alloc         ( +-   4.617% )
           7509  kmem:mm_pagevec_free       ( +-   4.820% )
          25653  kmem:mm_page_free_direct   ( +-   3.672% )

    1.058135029  seconds time elapsed   ( +-   3.089% )

Or you can annotate the recorded 'git gc' run on a per symbol basis 
and check which instructions/source-code generated page allocations:

 titan:~/git> perf annotate __GI___fork
 ------------------------------------------------
  Percent |      Source code & Disassembly of libc-2.5.so
 ------------------------------------------------
          :
          :
          :      Disassembly of section .plt:
          :      Disassembly of section .text:
          :
          :      00000031a2e95560 <__fork>:
 [...]
     0.00 :        31a2e95602:   b8 38 00 00 00          mov    $0x38,%eax
     0.00 :        31a2e95607:   0f 05                   syscall 
    83.42 :        31a2e95609:   48 3d 00 f0 ff ff       cmp    $0xfffffffffffff000,%rax
     0.00 :        31a2e9560f:   0f 87 4d 01 00 00       ja     31a2e95762 <__fork+0x202>
     0.00 :        31a2e95615:   85 c0                   test   %eax,%eax

( this shows that 83.42% of __GI___fork's page allocations come from
  the 0x38 system call it performs. )

etc. etc. - a lot more is possible. I could list a dozen of 
other different usecases straight away - neither of which is 
possible via /proc/vmstat.

/proc/vmstat is not in the same league really, in terms of 
expressive power of system analysis and performance 
analysis.

All that the above results needed were those new tracepoints 
in include/tracing/events/kmem.h.

	Ingo

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 4/4] tracing, page-allocator: Add a postprocessing script for page-allocator-related ftrace events
  2009-08-04 19:57     ` Ingo Molnar
@ 2009-08-04 20:18       ` Andrew Morton
  2009-08-04 20:35         ` Ingo Molnar
  2009-08-05 15:07         ` Valdis.Kletnieks
  2009-08-05 14:53       ` Valdis.Kletnieks
  2009-08-06 15:50       ` Mel Gorman
  2 siblings, 2 replies; 31+ messages in thread
From: Andrew Morton @ 2009-08-04 20:18 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: penberg, a.p.zijlstra, fweisbec, rostedt, mel, lwoodman, riel,
	peterz, linux-kernel, linux-mm

On Tue, 4 Aug 2009 21:57:17 +0200
Ingo Molnar <mingo@elte.hu> wrote:

> Let me demonstrate these features in action (i've applied the 
> patches for testing to -tip):

So?  The fact that certain things can be done doesn't mean that there's
a demand for them, nor that anyone will _use_ this stuff.

As usual, we're adding tracepoints because we feel we must add
tracepoints, not because anyone has a need for the data which they
gather.

There is some benefit in providing MM developers with some code which
they can copy-n-paste for their day-to-day activity.  But as I said,
they can do that with vmstat too.

If we can get rid of vmstat all together (and meminfo) and replace all
that with common infrastructure then that would be a good cleanup.  But
if we end up leaving vmstat and meminfo in place and then adding
_another_ statistic gathering mechanism in parallel then we haven't
cleaned anything up at all - it just gets worse.

I don't really oppose the patches - they're small.  But they seem
rather useless too.

It would be nice to at least partially remove the vmstat/meminfo
infrastructure but I don't think we can do that?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 4/4] tracing, page-allocator: Add a postprocessing script for page-allocator-related ftrace events
  2009-08-04 20:18       ` Andrew Morton
@ 2009-08-04 20:35         ` Ingo Molnar
  2009-08-04 20:53           ` Andrew Morton
  2009-08-05 13:04           ` Peter Zijlstra
  2009-08-05 15:07         ` Valdis.Kletnieks
  1 sibling, 2 replies; 31+ messages in thread
From: Ingo Molnar @ 2009-08-04 20:35 UTC (permalink / raw)
  To: Andrew Morton
  Cc: penberg, a.p.zijlstra, fweisbec, rostedt, mel, lwoodman, riel,
	peterz, linux-kernel, linux-mm

* Andrew Morton <akpm@linux-foundation.org> wrote:

> On Tue, 4 Aug 2009 21:57:17 +0200
> Ingo Molnar <mingo@elte.hu> wrote:
> 
> > Let me demonstrate these features in action (i've applied the 
> > patches for testing to -tip):
> 
> So?  The fact that certain things can be done doesn't mean that 
> there's a demand for them, nor that anyone will _use_ this stuff.

c'mon Andrew ...

Did you never want to see whether firefox is leaking [any sort of] 
memory, and if yes, on what callsites? Try something like on an 
already running firefox context:

  perf stat -e kmem:mm_page_alloc \
            -e kmem:mm_pagevec_free \
            -e kmem:mm_page_free_direct \
     -p $(pidof firefox-bin) sleep 10

... and "perf record" for the specific callsites.

this perf stuff is immensely flexible and a very unixish 
abstraction. The perf.data contains timestamped trace entries of 
page allocations and freeing done.

[...]
> It would be nice to at least partially remove the vmstat/meminfo 
> infrastructure but I don't think we can do that?

at least meminfo is an ABI for sure - vmstat too really.

But we can stop adding new fields into obsolete, inflexible and 
clearly deficient interfaces, and we can standardize new 
instrumentation to use modern instrumentation facilities - i.e. 
tracepoints and perfcounters.

I'm not saying to put a tracepoint on every second line of the 
kernel, but obviously Mel and Rik wanted this kind of info because 
they found it useful in practice.

	Ingo

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 4/4] tracing, page-allocator: Add a postprocessing script for page-allocator-related ftrace events
  2009-08-04 19:13       ` Andrew Morton
@ 2009-08-04 20:48         ` Mel Gorman
  2009-08-05  7:41           ` Ingo Molnar
  2009-08-05 14:53           ` Larry Woodman
  0 siblings, 2 replies; 31+ messages in thread
From: Mel Gorman @ 2009-08-04 20:48 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Rik van Riel, lwoodman, mingo, peterz, linux-kernel, linux-mm

On Tue, Aug 04, 2009 at 12:13:32PM -0700, Andrew Morton wrote:
> On Tue, 04 Aug 2009 14:27:16 -0400
> Rik van Riel <riel@redhat.com> wrote:
> 
> > Andrew Morton wrote:
> > > On Tue,  4 Aug 2009 19:12:26 +0100 Mel Gorman <mel@csn.ul.ie> wrote:
> > > 
> > >> This patch adds a simple post-processing script for the page-allocator-related
> > >> trace events. It can be used to give an indication of who the most
> > >> allocator-intensive processes are and how often the zone lock was taken
> > >> during the tracing period. Example output looks like
> > >>
> > >> find-2840
> > >>  o pages allocd            = 1877
> > >>  o pages allocd under lock = 1817
> > >>  o pages freed directly    = 9
> > >>  o pcpu refills            = 1078
> > >>  o migrate fallbacks       = 48
> > >>    - fragmentation causing = 48
> > >>      - severe              = 46
> > >>      - moderate            = 2
> > >>    - changed migratetype   = 7
> > > 
> > > The usual way of accumulating and presenting such measurements is via
> > > /proc/vmstat.  How do we justify adding a completely new and different
> > > way of doing something which we already do?
> > 
> > Mel's tracing is more akin to BSD process accounting,
> > where these statistics are kept on a per-process basis.
> 
> Is that useful?  Any time I've wanted to find out things like this, I
> just don't run other stuff on the machine at the same time.
> 

For some workloads, there will be multiple helper processes making it harder
to just not run other stuff on the machine at the same time. When looking at
just global statistics, it might be very easy to jump to the wrong conclusion
based on oprofile output or other aggregated figures.

> Maybe there are some scenarios where it's useful to filter out other
> processes, but are those scenarios sufficiently important to warrant
> creation of separate machinery like this?
> 

> > Nothing in /proc allows us to see statistics on a per
> > process basis on process exit.
> 
> Can this script be used to monitor the process while it's still running?
> 

Not in it's current form. It was intended as an illustration of how the events
can be used to generate a high-level picture and more suited to off-line
rather than on-line analysis. For on-line analysis, the parser would need to
be a lot more efficient than regular expressions and string matching in perl.

But, lets say you had asked me to give a live report on page allocator
activity on a per-process basis, I could have slapped together a
systemtap script in 5 minutes that looked something like .....
*scribbles*

==== BEGIN TAP SCRIPT ====
global page_allocs

probe kernel.trace("mm_page_alloc") {
  page_allocs[execname()]++
}

function print_count() {
  printf ("%-25s %-s\n", "#Pages Allocated", "Process Name")
  foreach (proc in page_allocs-)
    printf("%-25d %s\n", page_allocs[proc], proc)
  printf ("\n")
  delete page_allocs
}

probe timer.s(5) {
        print_count()
}
==== END SYSTEMTAP SCRIPT ====

This would tell me every 5 seconds what the most active processes
were that were allocating pages. Obviously I could have used the
mm_page_alloc_zone_locked point if the question was related to the zone lock
and lock_stat was not available. If I had oprofile output telling me a lot
of time was spent in the page allocator, I could then use a script like this
to better pin down which process might be responsible.

Incidentally, I ran this on my laptop which is running a patched kernel. Sample
output looks like

#Pages Allocated          Process Name
3683                      Xorg
40                        awesome
34                        konqueror
4                         thinkfan
2                         hald-addon-stor
2                         kjournald
1                         akregator

#Pages Allocated          Process Name
7715                      Xorg
2545                      modprobe
2489                      kio_http
1593                      akregator
405                       kdeinit
246                       khelper
158                       gconfd-2
52                        kded
27                        awesome
20                        gnome-terminal
7                         pageattr-test
5                         swapper
3                         hald-addon-stor
3                         lnotes
3                         thinkfan
2                         kjournald
1                         notes2
1                         pdflush
1                         konqueror

Straight off looking at that, I wonder what Xorg was doing and where modprobe
came out of :/. I don't think modprobe was from systemtap itself because it
was running too long at the point I cut & pasted the output.

> Also, we have a counter for "moderate fragmentation causing migrate
> fallbacks". 

Which counter is that? There are breakdowns all right of how many pageblocks
there are of each migratetype but it's a bit trickier to catch when
fragmentation is really occuring and to what extent. Just measuring the
frequency it occurs at may be enough to help tune min_free_kbytes for example.

> There must be hundreds of MM statistics which can be
> accumulated once we get down to this level of detail.  Why choose these
> nine?
> 

Because the page allocator is where I'm currently looking and these were
the points I wanted to draw reasonable conclusions on what sort of behaviour
the page allocator was seeing.

> Is there a plan to add the rest later on?

Depending on how this goes, I will attempt to do a similar set of trace
points for tracking kswapd and direct reclaim with the view to identifying
when stalls occur due to reclaim, when lumpy reclaim is kicking in, how long
it's taken and how often is succeeds/fails.

> 
> Or are these nine more a proof-of-concept demonstration-code thing?  If
> so, is it expected that developers will do an ad-hoc copy-n-paste to
> solve a particular short-term problem and will then toss the tracepoint
> away?  I guess that could be useful, although you can do the same with
> vmstat.
> 

Adding and deleting tracepoints, rebuilding and rebooting the kernel is
obviously usable by developers but not a whole pile of use if
recompiling the kernel is not an option or you're trying to debug a
difficult-to-reproduce-but-is-happening-now type of problem.

Of the CC list, I believe Larry Woodman has the most experience with
these sort of problems in the field so I'm hoping he'll make some sort
of comment.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 4/4] tracing, page-allocator: Add a postprocessing script for page-allocator-related ftrace events
  2009-08-04 20:35         ` Ingo Molnar
@ 2009-08-04 20:53           ` Andrew Morton
  2009-08-05  7:53             ` Ingo Molnar
  2009-08-05 13:04           ` Peter Zijlstra
  1 sibling, 1 reply; 31+ messages in thread
From: Andrew Morton @ 2009-08-04 20:53 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: penberg, a.p.zijlstra, fweisbec, rostedt, mel, lwoodman, riel,
	peterz, linux-kernel, linux-mm

On Tue, 4 Aug 2009 22:35:26 +0200
Ingo Molnar <mingo@elte.hu> wrote:

> Did you never want to see whether firefox is leaking [any sort of] 
> memory, and if yes, on what callsites? Try something like on an 
> already running firefox context:
> 
>   perf stat -e kmem:mm_page_alloc \
>             -e kmem:mm_pagevec_free \
>             -e kmem:mm_page_free_direct \
>      -p $(pidof firefox-bin) sleep 10
> 
> ... and "perf record" for the specific callsites.

OK, that would be useful.  What does the output look like?

In what way is it superior to existing ways of finding leaks?

> this perf stuff is immensely flexible and a very unixish 
> abstraction. The perf.data contains timestamped trace entries of 
> page allocations and freeing done.
> 
> [...]
> > It would be nice to at least partially remove the vmstat/meminfo 
> > infrastructure but I don't think we can do that?
> 
> at least meminfo is an ABI for sure - vmstat too really.
> 
> But we can stop adding new fields into obsolete, inflexible and 
> clearly deficient interfaces, and we can standardize new 
> instrumentation to use modern instrumentation facilities - i.e. 
> tracepoints and perfcounters.

That's bad.  Is there really no way in which we can consolidate _any_
of that infrastructure?  We just pile in new stuff alongside the old?

The worst part is needing two unrelated sets of userspace tools to
access basically-identical things.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 4/4] tracing, page-allocator: Add a postprocessing script for page-allocator-related ftrace events
  2009-08-04 18:22   ` Andrew Morton
  2009-08-04 18:27     ` Rik van Riel
  2009-08-04 19:57     ` Ingo Molnar
@ 2009-08-05  3:07     ` KOSAKI Motohiro
  2 siblings, 0 replies; 31+ messages in thread
From: KOSAKI Motohiro @ 2009-08-05  3:07 UTC (permalink / raw)
  To: Andrew Morton
  Cc: kosaki.motohiro, Mel Gorman, Larry Woodman, riel, Ingo Molnar,
	Peter Zijlstra, LKML, linux-mm

Hi

> On Tue,  4 Aug 2009 19:12:26 +0100 Mel Gorman <mel@csn.ul.ie> wrote:
> 
> > This patch adds a simple post-processing script for the page-allocator-related
> > trace events. It can be used to give an indication of who the most
> > allocator-intensive processes are and how often the zone lock was taken
> > during the tracing period. Example output looks like
> > 
> > find-2840
> >  o pages allocd            = 1877
> >  o pages allocd under lock = 1817
> >  o pages freed directly    = 9
> >  o pcpu refills            = 1078
> >  o migrate fallbacks       = 48
> >    - fragmentation causing = 48
> >      - severe              = 46
> >      - moderate            = 2
> >    - changed migratetype   = 7
> 
> The usual way of accumulating and presenting such measurements is via
> /proc/vmstat.  How do we justify adding a completely new and different
> way of doing something which we already do?

I think this approach have following merit.

 - It can collect per-process information.
   (Of cource, ftrace event filter can filter more various condtion)
 - It can integrate perf-counter easily.
 -




--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 4/4] tracing, page-allocator: Add a postprocessing script for page-allocator-related ftrace events
  2009-08-04 20:48         ` Mel Gorman
@ 2009-08-05  7:41           ` Ingo Molnar
  2009-08-05  9:07             ` Mel Gorman
  2009-08-05 14:53           ` Larry Woodman
  1 sibling, 1 reply; 31+ messages in thread
From: Ingo Molnar @ 2009-08-05  7:41 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Rik van Riel, lwoodman, peterz, linux-kernel,
	linux-mm, Peter Zijlstra, Steven Rostedt,
	Frédéric Weisbecker


* Mel Gorman <mel@csn.ul.ie> wrote:

[...]
> > Is there a plan to add the rest later on?
> 
> Depending on how this goes, I will attempt to do a similar set of 
> trace points for tracking kswapd and direct reclaim with the view 
> to identifying when stalls occur due to reclaim, when lumpy 
> reclaim is kicking in, how long it's taken and how often is 
> succeeds/fails.
> 
> > Or are these nine more a proof-of-concept demonstration-code 
> > thing?  If so, is it expected that developers will do an ad-hoc 
> > copy-n-paste to solve a particular short-term problem and will 
> > then toss the tracepoint away?  I guess that could be useful, 
> > although you can do the same with vmstat.
> 
> Adding and deleting tracepoints, rebuilding and rebooting the 
> kernel is obviously usable by developers but not a whole pile of 
> use if recompiling the kernel is not an option or you're trying to 
> debug a difficult-to-reproduce-but-is-happening-now type of 
> problem.
> 
> Of the CC list, I believe Larry Woodman has the most experience 
> with these sort of problems in the field so I'm hoping he'll make 
> some sort of comment.

Yes. FYI, Larry's last set of patches (which Andrew essentially 
NAK-ed) can be found attached below.

My general impression is that these things are very clearly useful, 
but that it would also be nice to see a more structured plan about 
what we want to instrument in the MM and what not so that a general 
decision can be made instead of a creeping stream of ad-hoc 
tracepoints with no end in sight.

I.e. have a full cycle set of tracepoints based on a high level 
description - one (incomplete) sub-set i outlined here for example:

  http://lkml.org/lkml/2009/3/24/435

Adding a document about the page allocator and perhaps comment on 
precisely what we want to trace would definitely be useful in 
addressing Andrew's scepticism i think.

I.e. we'd have your patch in the end, but also with some feel-good 
thoughts made about it on a higher level, so that we can be 
reasonably sure that we have a meaningful set of tracepoints.

	Ingo

----- Forwarded message from Larry Woodman <lwoodman@redhat.com> -----

Date: Tue, 21 Apr 2009 18:45:15 -0400
From: Larry Woodman <lwoodman@redhat.com>
To: linux-kernel@vger.kernel.org, linux-mm@kvack.org, riel@redhat.com,
	mingo@elte.hu, rostedt@goodmis.org
Subject: [Patch] mm tracepoints update


I've cleaned up the mm tracepoints to track page allocation and
freeing, various types of pagefaults and unmaps, and critical page
reclamation routines.  This is useful for debugging memory allocation
issues and system performance problems under heavy memory loads.


----------------------------------------------------------------------


# tracer: mm
#
#           TASK-PID    CPU#    TIMESTAMP  FUNCTION
#              | |       |          |         |
         pdflush-624   [004]   184.293169: wb_kupdate:
mm_pdflush_kupdate count=3e48
         pdflush-624   [004]   184.293439: get_page_from_freelist:
mm_page_allocation pfn=447c27 zone_free=1940910
        events/6-33    [006]   184.962879: free_hot_cold_page:
mm_page_free pfn=44bba9
      irqbalance-8313  [001]   188.042951: unmap_vmas:
mm_anon_userfree mm=ffff88044a7300c0 address=7f9a2eb70000 pfn=24c29a
             cat-9122  [005]   191.141173: filemap_fault:
mm_filemap_fault primary fault: mm=ffff88024c9d8f40 address=3cea2dd000
pfn=44d68e
             cat-9122  [001]   191.143036: handle_mm_fault:
mm_anon_fault mm=ffff88024c8beb40 address=7fffbde99f94 pfn=24ce22
-------------------------------------------------------------------------

Signed-off-by: Larry Woodman <lwoodman@redhat.com>
Acked-by: Rik van Riel <riel@redhat.com>


The patch applies to ingo's latest tip tree:



>From 7189889a6978d9fe46a803c94ae7a1d700bdf2ef Mon Sep 17 00:00:00 2001
From: lwoodman <lwoodman@dhcp-100-19-50.bos.redhat.com>
Date: Tue, 21 Apr 2009 14:34:35 -0400
Subject: [PATCH] Merge mm tracepoints into upstream tip tree.

---
 include/trace/events/mm.h |  510 +++++++++++++++++++++++++++++++++++++++++++++
 mm/filemap.c              |    4 +
 mm/memory.c               |   24 ++-
 mm/page-writeback.c       |    4 +
 mm/page_alloc.c           |    8 +-
 mm/rmap.c                 |    4 +
 mm/vmscan.c               |   17 ++-
 7 files changed, 564 insertions(+), 7 deletions(-)
 create mode 100644 include/trace/events/mm.h

diff --git a/include/trace/events/mm.h b/include/trace/events/mm.h
new file mode 100644
index 0000000..ca959f6
--- /dev/null
+++ b/include/trace/events/mm.h
@@ -0,0 +1,510 @@
+#if !defined(_TRACE_MM_H) || defined(TRACE_HEADER_MULTI_READ)
+#define _TRACE_MM_H
+
+#include <linux/mm.h>
+#include <linux/tracepoint.h>
+
+#undef TRACE_SYSTEM
+#define TRACE_SYSTEM mm
+
+TRACE_EVENT(mm_anon_fault,
+
+	TP_PROTO(struct mm_struct *mm,
+			unsigned long address, unsigned long pfn),
+
+	TP_ARGS(mm, address, pfn),
+
+	TP_STRUCT__entry(
+		__field(struct mm_struct *, mm)
+		__field(unsigned long, address)
+		__field(unsigned long, pfn)
+	),
+
+	TP_fast_assign(
+		__entry->mm = mm;
+		__entry->address = address;
+		__entry->pfn = pfn;
+	),
+
+	TP_printk("mm=%lx address=%lx pfn=%lx",
+		(unsigned long)__entry->mm, __entry->address, __entry->pfn)
+);
+
+TRACE_EVENT(mm_anon_pgin,
+
+	TP_PROTO(struct mm_struct *mm,
+			unsigned long address, unsigned long pfn),
+
+	TP_ARGS(mm, address, pfn),
+
+	TP_STRUCT__entry(
+		__field(struct mm_struct *, mm)
+		__field(unsigned long, address)
+		__field(unsigned long, pfn)
+	),
+
+	TP_fast_assign(
+		__entry->mm = mm;
+		__entry->address = address;
+		__entry->pfn = pfn;
+	),
+
+	TP_printk("mm=%lx address=%lx pfn=%lx",
+		(unsigned long)__entry->mm, __entry->address, __entry->pfn)
+	);
+
+TRACE_EVENT(mm_anon_cow,
+
+	TP_PROTO(struct mm_struct *mm,
+			unsigned long address, unsigned long pfn),
+
+	TP_ARGS(mm, address, pfn),
+
+	TP_STRUCT__entry(
+		__field(struct mm_struct *, mm)
+		__field(unsigned long, address)
+		__field(unsigned long, pfn)
+	),
+
+	TP_fast_assign(
+		__entry->mm = mm;
+		__entry->address = address;
+		__entry->pfn = pfn;
+	),
+
+	TP_printk("mm=%lx address=%lx pfn=%lx",
+		(unsigned long)__entry->mm, __entry->address, __entry->pfn)
+	);
+
+TRACE_EVENT(mm_anon_userfree,
+
+	TP_PROTO(struct mm_struct *mm,
+			unsigned long address, unsigned long pfn),
+
+	TP_ARGS(mm, address, pfn),
+
+	TP_STRUCT__entry(
+		__field(struct mm_struct *, mm)
+		__field(unsigned long, address)
+		__field(unsigned long, pfn)
+	),
+
+	TP_fast_assign(
+		__entry->mm = mm;
+		__entry->address = address;
+		__entry->pfn = pfn;
+	),
+
+	TP_printk("mm=%lx address=%lx pfn=%lx",
+		(unsigned long)__entry->mm, __entry->address, __entry->pfn)
+	);
+
+TRACE_EVENT(mm_anon_unmap,
+
+	TP_PROTO(unsigned long pfn, int success),
+
+	TP_ARGS(pfn, success),
+
+	TP_STRUCT__entry(
+		__field(unsigned long, pfn)
+		__field(int, success)
+	),
+
+	TP_fast_assign(
+		__entry->pfn = pfn;
+		__entry->success = success;
+	),
+
+	TP_printk("%s: pfn=%lx",
+		__entry->success ? "succeeded" : "failed", __entry->pfn)
+	);
+
+TRACE_EVENT(mm_filemap_fault,
+
+	TP_PROTO(struct mm_struct *mm, unsigned long address,
+			unsigned long pfn, int flag),
+	TP_ARGS(mm, address, pfn, flag),
+
+	TP_STRUCT__entry(
+		__field(struct mm_struct *, mm)
+		__field(unsigned long, address)
+		__field(unsigned long, pfn)
+		__field(int, flag)
+	),
+
+	TP_fast_assign(
+		__entry->mm = mm;
+		__entry->address = address;
+		__entry->pfn = pfn;
+		__entry->flag = flag;
+	),
+
+	TP_printk("%s: mm=%lx address=%lx pfn=%lx",
+		__entry->flag ? "pagein" : "primary fault",
+		(unsigned long)__entry->mm, __entry->address, __entry->pfn)
+	);
+
+TRACE_EVENT(mm_filemap_cow,
+
+	TP_PROTO(struct mm_struct *mm,
+			unsigned long address, unsigned long pfn),
+
+	TP_ARGS(mm, address, pfn),
+
+	TP_STRUCT__entry(
+		__field(struct mm_struct *, mm)
+		__field(unsigned long, address)
+		__field(unsigned long, pfn)
+	),
+
+	TP_fast_assign(
+		__entry->mm = mm;
+		__entry->address = address;
+		__entry->pfn = pfn;
+	),
+
+	TP_printk("mm=%lx address=%lx pfn=%lx",
+		(unsigned long)__entry->mm, __entry->address, __entry->pfn)
+	);
+
+TRACE_EVENT(mm_filemap_unmap,
+
+	TP_PROTO(unsigned long pfn, int success),
+
+	TP_ARGS(pfn, success),
+
+	TP_STRUCT__entry(
+		__field(unsigned long, pfn)
+		__field(int, success)
+	),
+
+	TP_fast_assign(
+		__entry->pfn = pfn;
+		__entry->success = success;
+	),
+
+	TP_printk("%s: pfn=%lx",
+		__entry->success ? "succeeded" : "failed", __entry->pfn)
+	);
+
+TRACE_EVENT(mm_filemap_userunmap,
+
+	TP_PROTO(struct mm_struct *mm,
+			unsigned long address, unsigned long pfn),
+
+	TP_ARGS(mm, address, pfn),
+
+	TP_STRUCT__entry(
+		__field(struct mm_struct *, mm)
+		__field(unsigned long, address)
+		__field(unsigned long, pfn)
+	),
+
+	TP_fast_assign(
+		__entry->mm = mm;
+		__entry->address = address;
+		__entry->pfn = pfn;
+	),
+
+	TP_printk("mm=%lx address=%lx pfn=%lx",
+		(unsigned long)__entry->mm, __entry->address, __entry->pfn)
+	);
+
+TRACE_EVENT(mm_pagereclaim_pgout,
+
+	TP_PROTO(unsigned long pfn, int anon),
+
+	TP_ARGS(pfn, anon),
+
+	TP_STRUCT__entry(
+		__field(unsigned long, pfn)
+		__field(int, anon)
+	),
+
+	TP_fast_assign(
+		__entry->pfn = pfn;
+		__entry->anon = anon;
+	),
+
+	TP_printk("%s: pfn=%lx",
+		__entry->anon ? "anonymous" : "pagecache", __entry->pfn)
+	);
+
+TRACE_EVENT(mm_pagereclaim_free,
+
+	TP_PROTO(unsigned long pfn, int anon),
+
+	TP_ARGS(pfn, anon),
+
+	TP_STRUCT__entry(
+		__field(unsigned long, pfn)
+		__field(int, anon)
+	),
+
+	TP_fast_assign(
+		__entry->pfn = pfn;
+		__entry->anon = anon;
+	),
+
+	TP_printk("%s: pfn=%lx",
+		__entry->anon ? "anonymous" : "pagecache", __entry->pfn)
+	);
+
+TRACE_EVENT(mm_pdflush_bgwriteout,
+
+	TP_PROTO(unsigned long count),
+
+	TP_ARGS(count),
+
+	TP_STRUCT__entry(
+		__field(unsigned long, count)
+	),
+
+	TP_fast_assign(
+		__entry->count = count;
+	),
+
+	TP_printk("count=%lx", __entry->count)
+	);
+
+TRACE_EVENT(mm_pdflush_kupdate,
+
+	TP_PROTO(unsigned long count),
+
+	TP_ARGS(count),
+
+	TP_STRUCT__entry(
+		__field(unsigned long, count)
+	),
+
+	TP_fast_assign(
+		__entry->count = count;
+	),
+
+	TP_printk("count=%lx", __entry->count)
+	);
+
+TRACE_EVENT(mm_page_allocation,
+
+	TP_PROTO(unsigned long pfn, unsigned long free),
+
+	TP_ARGS(pfn, free),
+
+	TP_STRUCT__entry(
+		__field(unsigned long, pfn)
+		__field(unsigned long, free)
+	),
+
+	TP_fast_assign(
+		__entry->pfn = pfn;
+		__entry->free = free;
+	),
+
+	TP_printk("pfn=%lx zone_free=%ld", __entry->pfn, __entry->free)
+	);
+
+TRACE_EVENT(mm_kswapd_runs,
+
+	TP_PROTO(unsigned long reclaimed),
+
+	TP_ARGS(reclaimed),
+
+	TP_STRUCT__entry(
+		__field(unsigned long, reclaimed)
+	),
+
+	TP_fast_assign(
+		__entry->reclaimed = reclaimed;
+	),
+
+	TP_printk("reclaimed=%lx", __entry->reclaimed)
+	);
+
+TRACE_EVENT(mm_directreclaim_reclaimall,
+
+	TP_PROTO(unsigned long priority),
+
+	TP_ARGS(priority),
+
+	TP_STRUCT__entry(
+		__field(unsigned long, priority)
+	),
+
+	TP_fast_assign(
+		__entry->priority = priority;
+	),
+
+	TP_printk("priority=%lx", __entry->priority)
+	);
+
+TRACE_EVENT(mm_directreclaim_reclaimzone,
+
+	TP_PROTO(unsigned long reclaimed, unsigned long priority),
+
+	TP_ARGS(reclaimed, priority),
+
+	TP_STRUCT__entry(
+		__field(unsigned long, reclaimed)
+		__field(unsigned long, priority)
+	),
+
+	TP_fast_assign(
+		__entry->reclaimed = reclaimed;
+		__entry->priority = priority;
+	),
+
+	TP_printk("reclaimed=%lx, priority=%lx",
+			__entry->reclaimed, __entry->priority)
+	);
+TRACE_EVENT(mm_pagereclaim_shrinkzone,
+
+	TP_PROTO(unsigned long reclaimed),
+
+	TP_ARGS(reclaimed),
+
+	TP_STRUCT__entry(
+		__field(unsigned long, reclaimed)
+	),
+
+	TP_fast_assign(
+		__entry->reclaimed = reclaimed;
+	),
+
+	TP_printk("reclaimed=%lx", __entry->reclaimed)
+	);
+
+TRACE_EVENT(mm_pagereclaim_shrinkactive,
+
+	TP_PROTO(unsigned long scanned, int file, int priority),
+
+	TP_ARGS(scanned, file, priority),
+
+	TP_STRUCT__entry(
+		__field(unsigned long, scanned)
+		__field(int, file)
+		__field(int, priority)
+	),
+
+	TP_fast_assign(
+		__entry->scanned = scanned;
+		__entry->file = file;
+		__entry->priority = priority;
+	),
+
+	TP_printk("scanned=%lx, %s, priority=%d",
+		__entry->scanned, __entry->file ? "anonymous" : "pagecache",
+		__entry->priority)
+	);
+
+TRACE_EVENT(mm_pagereclaim_shrinkactive_a2a,
+
+	TP_PROTO(unsigned long pfn),
+
+	TP_ARGS(pfn),
+
+	TP_STRUCT__entry(
+		__field(unsigned long, pfn)
+	),
+
+	TP_fast_assign(
+		__entry->pfn = pfn;
+	),
+
+	TP_printk("pfn=%lx", __entry->pfn)
+	);
+
+TRACE_EVENT(mm_pagereclaim_shrinkactive_a2i,
+
+	TP_PROTO(unsigned long pfn),
+
+	TP_ARGS(pfn),
+
+	TP_STRUCT__entry(
+		__field(unsigned long, pfn)
+	),
+
+	TP_fast_assign(
+		__entry->pfn = pfn;
+	),
+
+	TP_printk("pfn=%lx", __entry->pfn)
+	);
+
+TRACE_EVENT(mm_pagereclaim_shrinkinactive,
+
+	TP_PROTO(unsigned long scanned, int file, int priority),
+
+	TP_ARGS(scanned, file, priority),
+
+	TP_STRUCT__entry(
+		__field(unsigned long, scanned)
+		__field(int, file)
+		__field(int, priority)
+	),
+
+	TP_fast_assign(
+		__entry->scanned = scanned;
+		__entry->file = file;
+		__entry->priority = priority;
+	),
+
+	TP_printk("scanned=%lx, %s, priority=%d",
+		__entry->scanned, __entry->file ? "anonymous" : "pagecache",
+		__entry->priority)
+	);
+
+TRACE_EVENT(mm_pagereclaim_shrinkinactive_i2a,
+
+	TP_PROTO(unsigned long pfn),
+
+	TP_ARGS(pfn),
+
+	TP_STRUCT__entry(
+		__field(unsigned long, pfn)
+	),
+
+	TP_fast_assign(
+		__entry->pfn = pfn;
+	),
+
+	TP_printk("pfn=%lx", __entry->pfn)
+	);
+
+TRACE_EVENT(mm_pagereclaim_shrinkinactive_i2i,
+
+	TP_PROTO(unsigned long pfn),
+
+	TP_ARGS(pfn),
+
+	TP_STRUCT__entry(
+		__field(unsigned long, pfn)
+	),
+
+	TP_fast_assign(
+		__entry->pfn = pfn;
+	),
+
+	TP_printk("pfn=%lx", __entry->pfn)
+	);
+
+TRACE_EVENT(mm_page_free,
+
+	TP_PROTO(unsigned long pfn),
+
+	TP_ARGS(pfn),
+
+	TP_STRUCT__entry(
+		__field(unsigned long, pfn)
+	),
+
+	TP_fast_assign(
+		__entry->pfn = pfn;
+	),
+
+	TP_printk("pfn=%lx", __entry->pfn)
+	);
+
+#endif /* _TRACE_MM_H */
+
+/* This part must be outside protection */
+#include <trace/define_trace.h>
diff --git a/mm/filemap.c b/mm/filemap.c
index 379ff0b..4ff804c 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -34,6 +34,8 @@
 #include <linux/hardirq.h> /* for BUG_ON(!in_atomic()) only */
 #include <linux/memcontrol.h>
 #include <linux/mm_inline.h> /* for page_is_file_cache() */
+#include <linux/ftrace.h>
+#include <trace/events/mm.h>
 #include "internal.h"
 
 /*
@@ -1568,6 +1570,8 @@ retry_find:
 	 */
 	ra->prev_pos = (loff_t)page->index << PAGE_CACHE_SHIFT;
 	vmf->page = page;
+	trace_mm_filemap_fault(vma->vm_mm, (unsigned long)vmf->virtual_address,
+			page_to_pfn(page), vmf->flags&FAULT_FLAG_NONLINEAR);
 	return ret | VM_FAULT_LOCKED;
 
 no_cached_page:
diff --git a/mm/memory.c b/mm/memory.c
index cf6873e..abd28d8 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -55,6 +55,7 @@
 #include <linux/kallsyms.h>
 #include <linux/swapops.h>
 #include <linux/elf.h>
+#include <linux/ftrace.h>
 
 #include <asm/pgalloc.h>
 #include <asm/uaccess.h>
@@ -64,6 +65,8 @@
 
 #include "internal.h"
 
+#include <trace/events/mm.h>
+
 #ifndef CONFIG_NEED_MULTIPLE_NODES
 /* use the per-pgdat data instead for discontigmem - mbligh */
 unsigned long max_mapnr;
@@ -812,15 +815,19 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
 						addr) != page->index)
 				set_pte_at(mm, addr, pte,
 					   pgoff_to_pte(page->index));
-			if (PageAnon(page))
+			if (PageAnon(page)) {
 				anon_rss--;
-			else {
+				trace_mm_anon_userfree(mm, addr,
+							page_to_pfn(page));
+			} else {
 				if (pte_dirty(ptent))
 					set_page_dirty(page);
 				if (pte_young(ptent) &&
 				    likely(!VM_SequentialReadHint(vma)))
 					mark_page_accessed(page);
 				file_rss--;
+				trace_mm_filemap_userunmap(mm, addr,
+							page_to_pfn(page));
 			}
 			page_remove_rmap(page);
 			if (unlikely(page_mapcount(page) < 0))
@@ -1896,7 +1903,7 @@ static int do_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		unsigned long address, pte_t *page_table, pmd_t *pmd,
 		spinlock_t *ptl, pte_t orig_pte)
 {
-	struct page *old_page, *new_page;
+	struct page *old_page, *new_page = NULL;
 	pte_t entry;
 	int reuse = 0, ret = 0;
 	int page_mkwrite = 0;
@@ -2039,9 +2046,14 @@ gotten:
 			if (!PageAnon(old_page)) {
 				dec_mm_counter(mm, file_rss);
 				inc_mm_counter(mm, anon_rss);
+				trace_mm_filemap_cow(mm, address,
+					page_to_pfn(new_page));
 			}
-		} else
+		} else {
 			inc_mm_counter(mm, anon_rss);
+			trace_mm_anon_cow(mm, address,
+					page_to_pfn(new_page));
+		}
 		flush_cache_page(vma, address, pte_pfn(orig_pte));
 		entry = mk_pte(new_page, vma->vm_page_prot);
 		entry = maybe_mkwrite(pte_mkdirty(entry), vma);
@@ -2416,7 +2428,7 @@ static int do_swap_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		int write_access, pte_t orig_pte)
 {
 	spinlock_t *ptl;
-	struct page *page;
+	struct page *page = NULL;
 	swp_entry_t entry;
 	pte_t pte;
 	struct mem_cgroup *ptr = NULL;
@@ -2517,6 +2529,7 @@ static int do_swap_page(struct mm_struct *mm, struct vm_area_struct *vma,
 unlock:
 	pte_unmap_unlock(page_table, ptl);
 out:
+	trace_mm_anon_pgin(mm, address, page_to_pfn(page));
 	return ret;
 out_nomap:
 	mem_cgroup_cancel_charge_swapin(ptr);
@@ -2549,6 +2562,7 @@ static int do_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		goto oom;
 	__SetPageUptodate(page);
 
+	trace_mm_anon_fault(mm, address, page_to_pfn(page));
 	if (mem_cgroup_newpage_charge(page, mm, GFP_KERNEL))
 		goto oom_free_page;
 
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 30351f0..122cad4 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -34,6 +34,8 @@
 #include <linux/syscalls.h>
 #include <linux/buffer_head.h>
 #include <linux/pagevec.h>
+#include <linux/ftrace.h>
+#include <trace/events/mm.h>
 
 /*
  * The maximum number of pages to writeout in a single bdflush/kupdate
@@ -716,6 +718,7 @@ static void background_writeout(unsigned long _min_pages)
 				break;
 		}
 	}
+	trace_mm_pdflush_bgwriteout(_min_pages);
 }
 
 /*
@@ -776,6 +779,7 @@ static void wb_kupdate(unsigned long arg)
 	nr_to_write = global_page_state(NR_FILE_DIRTY) +
 			global_page_state(NR_UNSTABLE_NFS) +
 			(inodes_stat.nr_inodes - inodes_stat.nr_unused);
+	trace_mm_pdflush_kupdate(nr_to_write);
 	while (nr_to_write > 0) {
 		wbc.more_io = 0;
 		wbc.encountered_congestion = 0;
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index a3df888..5c175fa 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -47,6 +47,8 @@
 #include <linux/page-isolation.h>
 #include <linux/page_cgroup.h>
 #include <linux/debugobjects.h>
+#include <linux/ftrace.h>
+#include <trace/events/mm.h>
 
 #include <asm/tlbflush.h>
 #include <asm/div64.h>
@@ -1007,6 +1009,7 @@ static void free_hot_cold_page(struct page *page, int cold)
 	if (free_pages_check(page))
 		return;
 
+	trace_mm_page_free(page_to_pfn(page));
 	if (!PageHighMem(page)) {
 		debug_check_no_locks_freed(page_address(page), PAGE_SIZE);
 		debug_check_no_obj_freed(page_address(page), PAGE_SIZE);
@@ -1450,8 +1453,11 @@ zonelist_scan:
 		}
 
 		page = buffered_rmqueue(preferred_zone, zone, order, gfp_mask);
-		if (page)
+		if (page) {
+			trace_mm_page_allocation(page_to_pfn(page),
+					zone_page_state(zone, NR_FREE_PAGES));
 			break;
+		}
 this_zone_full:
 		if (NUMA_BUILD)
 			zlc_mark_zone_full(zonelist, z);
diff --git a/mm/rmap.c b/mm/rmap.c
index 1652166..ae8882b 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -50,6 +50,8 @@
 #include <linux/memcontrol.h>
 #include <linux/mmu_notifier.h>
 #include <linux/migrate.h>
+#include <linux/ftrace.h>
+#include <trace/events/mm.h>
 
 #include <asm/tlbflush.h>
 
@@ -1034,6 +1036,7 @@ static int try_to_unmap_anon(struct page *page, int unlock, int migration)
 	else if (ret == SWAP_MLOCK)
 		ret = SWAP_AGAIN;	/* saw VM_LOCKED vma */
 
+	trace_mm_anon_unmap(page_to_pfn(page), ret == SWAP_SUCCESS);
 	return ret;
 }
 
@@ -1170,6 +1173,7 @@ out:
 		ret = SWAP_MLOCK;	/* actually mlocked the page */
 	else if (ret == SWAP_MLOCK)
 		ret = SWAP_AGAIN;	/* saw VM_LOCKED vma */
+	trace_mm_filemap_unmap(page_to_pfn(page), ret == SWAP_SUCCESS);
 	return ret;
 }
 
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 99155b7..cc73c89 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -40,6 +40,9 @@
 #include <linux/memcontrol.h>
 #include <linux/delayacct.h>
 #include <linux/sysctl.h>
+#include <linux/ftrace.h>
+#define CREATE_TRACE_POINTS
+#include <trace/events/mm.h>
 
 #include <asm/tlbflush.h>
 #include <asm/div64.h>
@@ -414,6 +417,7 @@ static pageout_t pageout(struct page *page, struct address_space *mapping,
 			ClearPageReclaim(page);
 		}
 		inc_zone_page_state(page, NR_VMSCAN_WRITE);
+		trace_mm_pagereclaim_pgout(page_to_pfn(page), PageAnon(page));
 		return PAGE_SUCCESS;
 	}
 
@@ -765,6 +769,7 @@ free_it:
 			__pagevec_free(&freed_pvec);
 			pagevec_reinit(&freed_pvec);
 		}
+		trace_mm_pagereclaim_free(page_to_pfn(page), PageAnon(page));
 		continue;
 
 cull_mlocked:
@@ -781,10 +786,12 @@ activate_locked:
 		VM_BUG_ON(PageActive(page));
 		SetPageActive(page);
 		pgactivate++;
+		trace_mm_pagereclaim_shrinkinactive_i2a(page_to_pfn(page));
 keep_locked:
 		unlock_page(page);
 keep:
 		list_add(&page->lru, &ret_pages);
+		trace_mm_pagereclaim_shrinkinactive_i2i(page_to_pfn(page));
 		VM_BUG_ON(PageLRU(page) || PageUnevictable(page));
 	}
 	list_splice(&ret_pages, page_list);
@@ -1177,6 +1184,7 @@ static unsigned long shrink_inactive_list(unsigned long max_scan,
 done:
 	local_irq_enable();
 	pagevec_release(&pvec);
+	trace_mm_pagereclaim_shrinkinactive(nr_reclaimed, file, priority);
 	return nr_reclaimed;
 }
 
@@ -1254,6 +1262,7 @@ static void shrink_active_list(unsigned long nr_pages, struct zone *zone,
 
 		if (unlikely(!page_evictable(page, NULL))) {
 			putback_lru_page(page);
+			trace_mm_pagereclaim_shrinkactive_a2a(page_to_pfn(page));
 			continue;
 		}
 
@@ -1263,6 +1272,7 @@ static void shrink_active_list(unsigned long nr_pages, struct zone *zone,
 			pgmoved++;
 
 		list_add(&page->lru, &l_inactive);
+		trace_mm_pagereclaim_shrinkactive_a2i(page_to_pfn(page));
 	}
 
 	/*
@@ -1311,6 +1321,7 @@ static void shrink_active_list(unsigned long nr_pages, struct zone *zone,
 	if (buffer_heads_over_limit)
 		pagevec_strip(&pvec);
 	pagevec_release(&pvec);
+	trace_mm_pagereclaim_shrinkactive(pgscanned, file, priority);
 }
 
 static int inactive_anon_is_low_global(struct zone *zone)
@@ -1511,6 +1522,7 @@ static void shrink_zone(int priority, struct zone *zone,
 	}
 
 	sc->nr_reclaimed = nr_reclaimed;
+	trace_mm_pagereclaim_shrinkzone(nr_reclaimed);
 
 	/*
 	 * Even if we did not try to evict anon pages at all, we want to
@@ -1571,6 +1583,7 @@ static void shrink_zones(int priority, struct zonelist *zonelist,
 							priority);
 		}
 
+		trace_mm_directreclaim_reclaimall(priority);
 		shrink_zone(priority, zone, sc);
 	}
 }
@@ -1942,6 +1955,7 @@ out:
 		goto loop_again;
 	}
 
+	trace_mm_kswapd_runs(sc.nr_reclaimed);
 	return sc.nr_reclaimed;
 }
 
@@ -2294,7 +2308,7 @@ static int __zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order)
 	const unsigned long nr_pages = 1 << order;
 	struct task_struct *p = current;
 	struct reclaim_state reclaim_state;
-	int priority;
+	int priority = ZONE_RECLAIM_PRIORITY;
 	struct scan_control sc = {
 		.may_writepage = !!(zone_reclaim_mode & RECLAIM_WRITE),
 		.may_unmap = !!(zone_reclaim_mode & RECLAIM_SWAP),
@@ -2360,6 +2374,7 @@ static int __zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order)
 
 	p->reclaim_state = NULL;
 	current->flags &= ~(PF_MEMALLOC | PF_SWAPWRITE);
+	trace_mm_directreclaim_reclaimzone(sc.nr_reclaimed, priority);
 	return sc.nr_reclaimed >= nr_pages;
 }
 
-- 
1.5.5.1


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* Re: [PATCH 4/4] tracing, page-allocator: Add a postprocessing script for page-allocator-related ftrace events
  2009-08-04 20:53           ` Andrew Morton
@ 2009-08-05  7:53             ` Ingo Molnar
  0 siblings, 0 replies; 31+ messages in thread
From: Ingo Molnar @ 2009-08-05  7:53 UTC (permalink / raw)
  To: Andrew Morton
  Cc: penberg, a.p.zijlstra, fweisbec, rostedt, mel, lwoodman, riel,
	peterz, linux-kernel, linux-mm

* Andrew Morton <akpm@linux-foundation.org> wrote:

> On Tue, 4 Aug 2009 22:35:26 +0200
> Ingo Molnar <mingo@elte.hu> wrote:
> 
> > Did you never want to see whether firefox is leaking [any sort of] 
> > memory, and if yes, on what callsites? Try something like on an 
> > already running firefox context:
> > 
> >   perf stat -e kmem:mm_page_alloc \
> >             -e kmem:mm_pagevec_free \
> >             -e kmem:mm_page_free_direct \
> >      -p $(pidof firefox-bin) sleep 10
> > 
> > ... and "perf record" for the specific callsites.
> 
> OK, that would be useful.  What does the output look like?

I suspect Mel's output is an even better example.

> In what way is it superior to existing ways of finding leaks?

It's barely useful in this form - i just demoed the capability. perf 
stat is not a 'leak finding' special-purpose tool, but a generic 
tool that i used for this purpose as well, on an ad-hoc basis.

Tools that can be used in unexpected but still useful ways tend to 
be the best ones.

The kind of information these tracepoints expose, combined with the 
sampling and analysis features of perfcounters is the most 
high-quality information one can get about the page allocator IMO.

This is my general point: instead of wasting time and effort 
extending derived information, why not expose the core information? 
When the tracepoints are off there is essentially no overhead. 
(which is an added benefit - all the /proc/vmstat bits are collected 
unconditionally and then have to be summed up from all cpus when 
read out.)

> > this perf stuff is immensely flexible and a very unixish 
> > abstraction. The perf.data contains timestamped trace entries of 
> > page allocations and freeing done.
> > 
> > [...]
> > > It would be nice to at least partially remove the vmstat/meminfo 
> > > infrastructure but I don't think we can do that?
> > 
> > at least meminfo is an ABI for sure - vmstat too really.
> > 
> > But we can stop adding new fields into obsolete, inflexible and 
> > clearly deficient interfaces, and we can standardize new 
> > instrumentation to use modern instrumentation facilities - i.e. 
> > tracepoints and perfcounters.
> 
> That's bad.  Is there really no way in which we can consolidate 
> _any_ of that infrastructure?  We just pile in new stuff alongside 
> the old?
> 
> The worst part is needing two unrelated sets of userspace tools to 
> access basically-identical things.

We certainly should expose the full set of information to the new 
facility, so that it's self-sufficient and does not have to go 
digging in /proc for odd bits here and there (in various ad-hoc 
formats).

Above i'm arguing that since the old bits are an ABI, they should be 
kept but not extended.

btw., this is why i was resisting ad-hoc hacks like kpageflags. 
Those special-purpose instrumentation ABIs are hard to get rid of, 
and they come nowhere close to the utility of the real thing.

	Ingo

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 4/4] tracing, page-allocator: Add a postprocessing script for page-allocator-related ftrace events
  2009-08-05  7:41           ` Ingo Molnar
@ 2009-08-05  9:07             ` Mel Gorman
  2009-08-05  9:16               ` Ingo Molnar
  2009-08-05 10:27               ` Johannes Weiner
  0 siblings, 2 replies; 31+ messages in thread
From: Mel Gorman @ 2009-08-05  9:07 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Andrew Morton, Rik van Riel, lwoodman, peterz, linux-kernel,
	linux-mm, Peter Zijlstra, Steven Rostedt, Fr?d?ric Weisbecker

On Wed, Aug 05, 2009 at 09:41:03AM +0200, Ingo Molnar wrote:
> 
> * Mel Gorman <mel@csn.ul.ie> wrote:
> 
> [...]
> > > Is there a plan to add the rest later on?
> > 
> > Depending on how this goes, I will attempt to do a similar set of 
> > trace points for tracking kswapd and direct reclaim with the view 
> > to identifying when stalls occur due to reclaim, when lumpy 
> > reclaim is kicking in, how long it's taken and how often is 
> > succeeds/fails.
> > 
> > > Or are these nine more a proof-of-concept demonstration-code 
> > > thing?  If so, is it expected that developers will do an ad-hoc 
> > > copy-n-paste to solve a particular short-term problem and will 
> > > then toss the tracepoint away?  I guess that could be useful, 
> > > although you can do the same with vmstat.
> > 
> > Adding and deleting tracepoints, rebuilding and rebooting the 
> > kernel is obviously usable by developers but not a whole pile of 
> > use if recompiling the kernel is not an option or you're trying to 
> > debug a difficult-to-reproduce-but-is-happening-now type of 
> > problem.
> > 
> > Of the CC list, I believe Larry Woodman has the most experience 
> > with these sort of problems in the field so I'm hoping he'll make 
> > some sort of comment.
> 
> Yes. FYI, Larry's last set of patches (which Andrew essentially 
> NAK-ed) can be found attached below.
> 

I was made aware of that patch after V1 of this patchset and brought the
naming scheme more in line with Larry's. It's still up in the air what the
proper naming scheme should be. I went with mm_page* as the prefix which
I'm reasonably happy with but I've been hit on the nose with a rolled up
newspaper over naming before.

I also decided to just deal with the page allocator and not the MM as a whole
figuring that reviewing all MM tracepoints at the same time would be too much
to chew on and decide "are these the right tracepoints?". My expectation is
that there would need to be at least one set per headings;

page allocator
  subsys: kmem
  prefix: mm_page*
  example use: estimate zone lock contention

o slab allocator (already done)
  subsys: kmem
  prefix: kmem_* (although this wasn't consistent, e.g. kmalloc vs kmem_kmalloc)
  example use: measure allocation times for slab, slub, slqb

o high-level reclaim, kswapd wakeups, direct reclaim, lumpy triggers
  subsys: vmscan
  prefix: mm_vmscan*
  example use: estimate memory pressure

o low-level reclaim, list rotations, pages scanned, types of pages moving etc.
  subsys: vmscan
  prefix: mm_vmscan*
  (debugging VM tunables such as swappiness or why kswapd so active)

The following might also be useful for kernel developers but maybe less
useful in general so would be harder to justify.

o fault activity, anon, file, swap ins/outs 
o page cache activity
o readahead
o VM/FS, writeback, pdflush
o hugepage reservations, pool activity, faulting
o hotplug

> My general impression is that these things are very clearly useful, 
> but that it would also be nice to see a more structured plan about 
> what we want to instrument in the MM and what not so that a general 
> decision can be made instead of a creeping stream of ad-hoc 
> tracepoints with no end in sight.
> 
> I.e. have a full cycle set of tracepoints based on a high level 
> description - one (incomplete) sub-set i outlined here for example:
> 
>   http://lkml.org/lkml/2009/3/24/435
> 
> Adding a document about the page allocator and perhaps comment on 
> precisely what we want to trace would definitely be useful in 
> addressing Andrew's scepticism i think.
> 
> I.e. we'd have your patch in the end, but also with some feel-good 
> thoughts made about it on a higher level, so that we can be 
> reasonably sure that we have a meaningful set of tracepoints.
> 

Ok, I think I could put together such a description for the page allocator
tracepoints using the leader and your mail as starting points. I reckon the
best place for the end result would be Documentation/vm/tracepoints.txt

<Larry's patch snipped>

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 4/4] tracing, page-allocator: Add a postprocessing script for page-allocator-related ftrace events
  2009-08-05  9:07             ` Mel Gorman
@ 2009-08-05  9:16               ` Ingo Molnar
  2009-08-05 10:27               ` Johannes Weiner
  1 sibling, 0 replies; 31+ messages in thread
From: Ingo Molnar @ 2009-08-05  9:16 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Rik van Riel, lwoodman, peterz, linux-kernel,
	linux-mm, Peter Zijlstra, Steven Rostedt, Fr?d?ric Weisbecker


* Mel Gorman <mel@csn.ul.ie> wrote:

> > I.e. we'd have your patch in the end, but also with some 
> > feel-good thoughts made about it on a higher level, so that we 
> > can be reasonably sure that we have a meaningful set of 
> > tracepoints.
> 
> Ok, I think I could put together such a description for the page 
> allocator tracepoints using the leader and your mail as starting 
> points. I reckon the best place for the end result would be 
> Documentation/vm/tracepoints.txt

The canonical place for that info is Documentation/trace/ - we 
already have a collection of similar bits there:

 events.txt  kmemtrace.txt  power.txt               tracepoints.txt
 ftrace.txt  mmiotrace.txt  ring-buffer-design.txt

	Ingo

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 4/4] tracing, page-allocator: Add a postprocessing script for page-allocator-related ftrace events
  2009-08-05  9:07             ` Mel Gorman
  2009-08-05  9:16               ` Ingo Molnar
@ 2009-08-05 10:27               ` Johannes Weiner
  2009-08-06 15:48                 ` Mel Gorman
  1 sibling, 1 reply; 31+ messages in thread
From: Johannes Weiner @ 2009-08-05 10:27 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Ingo Molnar, Andrew Morton, Rik van Riel, lwoodman, peterz,
	linux-kernel, linux-mm, Peter Zijlstra, Steven Rostedt,
	Fr?d?ric Weisbecker

On Wed, Aug 05, 2009 at 10:07:43AM +0100, Mel Gorman wrote:

> I also decided to just deal with the page allocator and not the MM as a whole
> figuring that reviewing all MM tracepoints at the same time would be too much
> to chew on and decide "are these the right tracepoints?". My expectation is
> that there would need to be at least one set per headings;
> 
> page allocator
>   subsys: kmem
>   prefix: mm_page*
>   example use: estimate zone lock contention
> 
> o slab allocator (already done)
>   subsys: kmem
>   prefix: kmem_* (although this wasn't consistent, e.g. kmalloc vs kmem_kmalloc)
>   example use: measure allocation times for slab, slub, slqb
> 
> o high-level reclaim, kswapd wakeups, direct reclaim, lumpy triggers
>   subsys: vmscan
>   prefix: mm_vmscan*
>   example use: estimate memory pressure
> 
> o low-level reclaim, list rotations, pages scanned, types of pages moving etc.
>   subsys: vmscan
>   prefix: mm_vmscan*
>   (debugging VM tunables such as swappiness or why kswapd so active)
> 
> The following might also be useful for kernel developers but maybe less
> useful in general so would be harder to justify.
> 
> o fault activity, anon, file, swap ins/outs 
> o page cache activity
> o readahead
> o VM/FS, writeback, pdflush
> o hugepage reservations, pool activity, faulting
> o hotplug

Maybe if more people would tell how they currently use tracepoints in
the MM we can find some common ground on what can be useful to more
than one person and why?

FWIW, I recently started using tracepoints at the following places for
looking at swap code behaviour:

	o swap slot alloc/free	[type, offset]
	o swap slot read/write	[type, offset]
	o swapcache add/delete	[type, offset]
	o swap fault/evict	[page->mapping, page->index, type, offset]

This gives detail beyond vmstat's possibilities at the cost of 8 lines
of trace_swap_foo() distributed over 5 files.

I have not aggregated the output so far, just looked at the raw data
and enjoyed reading how the swap slot allocator behaves in reality
(you can probably integrate the traces into snapshots of the whole
swap space layout), what load behaviour triggers insane swap IO
patterns, in what context is readahead reading the wrong pages etc.,
stuff you wouldn't see when starting out with statistical
aggregations.

Now, these data are pretty specialized and probably only few people
will make use of them, but OTOH, the cost they impose on the traced
code is so miniscule that it would be a much greater pain to 1) know
about and find third party patches and 2) apply, possibly forward-port
third party patches.

	Hannes

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 4/4] tracing, page-allocator: Add a postprocessing script for page-allocator-related ftrace events
  2009-08-04 20:35         ` Ingo Molnar
  2009-08-04 20:53           ` Andrew Morton
@ 2009-08-05 13:04           ` Peter Zijlstra
  1 sibling, 0 replies; 31+ messages in thread
From: Peter Zijlstra @ 2009-08-05 13:04 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Andrew Morton, penberg, fweisbec, rostedt, mel, lwoodman, riel,
	linux-kernel, linux-mm

On Tue, 2009-08-04 at 22:35 +0200, Ingo Molnar wrote:

> Did you never want to see whether firefox is leaking [any sort of] 
> memory, and if yes, on what callsites? Try something like on an 
> already running firefox context:
> 
>   perf stat -e kmem:mm_page_alloc \
>             -e kmem:mm_pagevec_free \
>             -e kmem:mm_page_free_direct \
>      -p $(pidof firefox-bin) sleep 10
> 
> .... and "perf record" for the specific callsites.

If these tracepoints were to use something like (not yet in mainline)

  TP_perf_assign(
 	__perf_data(obj);
  ),

Where obj was the thing being allocated/freed, you could, when using
PERF_SAMPLE_ADDR even match up alloc/frees, combined with
PERF_SAMPLE_CALLCHAIN you could then figure out where the unmatched
entries came from.

Might be useful, dunno.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 4/4] tracing, page-allocator: Add a postprocessing script for page-allocator-related ftrace events
  2009-08-04 20:48         ` Mel Gorman
  2009-08-05  7:41           ` Ingo Molnar
@ 2009-08-05 14:53           ` Larry Woodman
  2009-08-06 15:54             ` Mel Gorman
  1 sibling, 1 reply; 31+ messages in thread
From: Larry Woodman @ 2009-08-05 14:53 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Rik van Riel, mingo, peterz, linux-kernel,
	linux-mm

On Tue, 2009-08-04 at 21:48 +0100, Mel Gorman wrote:

> > 
> 
> Adding and deleting tracepoints, rebuilding and rebooting the kernel is
> obviously usable by developers but not a whole pile of use if
> recompiling the kernel is not an option or you're trying to debug a
> difficult-to-reproduce-but-is-happening-now type of problem.
> 
> Of the CC list, I believe Larry Woodman has the most experience with
> these sort of problems in the field so I'm hoping he'll make some sort
> of comment.
> 

I am all for adding tracepoints that eliminate the need to locate a
problem, add debug code, rebuild, reboot and retest until the real
problem is found.  

Personally I have not seen as many problems in the page allocator as I
have in the page reclaim code thats why the majority of my tracepoints
were in vmscan.c  However I do ACK this patch set because it provides
the opportunity to zoom into the page allocator dynamically without
needing to iterate through the cumbersome debug process.

Larry
 



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 4/4] tracing, page-allocator: Add a postprocessing script for page-allocator-related ftrace events
  2009-08-04 19:57     ` Ingo Molnar
  2009-08-04 20:18       ` Andrew Morton
@ 2009-08-05 14:53       ` Valdis.Kletnieks
  2009-08-06 15:50       ` Mel Gorman
  2 siblings, 0 replies; 31+ messages in thread
From: Valdis.Kletnieks @ 2009-08-05 14:53 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Andrew Morton, Pekka Enberg, Peter Zijlstra,
	Frédéric Weisbecker, Steven Rostedt, Mel Gorman,
	Larry Woodman, riel, Peter Zijlstra, LKML, linux-mm

[-- Attachment #1: Type: text/plain, Size: 354 bytes --]

On Tue, 04 Aug 2009 21:57:17 +0200, Ingo Molnar said:

> Let me demonstrate these features in action (i've applied the 
> patches for testing to -tip):
> 
> First, discovery/enumeration of available counters can be done via 
> 'perf list':

Woo hoo! A perf cheat sheet! perf's usability just went up 110%, at least
for me.

Thanks for the clear demo. ;)

[-- Attachment #2: Type: application/pgp-signature, Size: 226 bytes --]

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 4/4] tracing, page-allocator: Add a postprocessing script for page-allocator-related ftrace events
  2009-08-04 20:18       ` Andrew Morton
  2009-08-04 20:35         ` Ingo Molnar
@ 2009-08-05 15:07         ` Valdis.Kletnieks
  1 sibling, 0 replies; 31+ messages in thread
From: Valdis.Kletnieks @ 2009-08-05 15:07 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Ingo Molnar, penberg, a.p.zijlstra, fweisbec, rostedt, mel,
	lwoodman, riel, peterz, linux-kernel, linux-mm

[-- Attachment #1: Type: text/plain, Size: 869 bytes --]

On Tue, 04 Aug 2009 13:18:18 PDT, Andrew Morton said:

> As usual, we're adding tracepoints because we feel we must add
> tracepoints, not because anyone has a need for the data which they
> gather.

One of the strong points of the Solaris 'dtrace' is that the kernel comes
pre-instrumented with zillions of tracepoints, including a lot that don't
seem to have very much application - just so they're already in place in
case you hit some weird issue and need the tracepoint for an ad-crock dtrace
script to debug something.  So when I'm trying to diagnose why my backup
server suddenly got sluggish 3 terabytes into a 5 terabyte backup, and it
looks like some weird fiberchannel issue, I can collect data without having
to reboot to install a tracepoint (which would lose the backup, and possibly
reset the issue or otherwise make it go into hiding).

Just sayin'. :)

[-- Attachment #2: Type: application/pgp-signature, Size: 226 bytes --]

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 4/4] tracing, page-allocator: Add a postprocessing script for page-allocator-related ftrace events
  2009-08-05 10:27               ` Johannes Weiner
@ 2009-08-06 15:48                 ` Mel Gorman
  0 siblings, 0 replies; 31+ messages in thread
From: Mel Gorman @ 2009-08-06 15:48 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Ingo Molnar, Andrew Morton, Rik van Riel, lwoodman, peterz,
	linux-kernel, linux-mm, Peter Zijlstra, Steven Rostedt,
	Fr?d?ric Weisbecker

On Wed, Aug 05, 2009 at 12:27:50PM +0200, Johannes Weiner wrote:
> On Wed, Aug 05, 2009 at 10:07:43AM +0100, Mel Gorman wrote:
> 
> > I also decided to just deal with the page allocator and not the MM as a whole
> > figuring that reviewing all MM tracepoints at the same time would be too much
> > to chew on and decide "are these the right tracepoints?". My expectation is
> > that there would need to be at least one set per headings;
> > 
> > page allocator
> >   subsys: kmem
> >   prefix: mm_page*
> >   example use: estimate zone lock contention
> > 
> > o slab allocator (already done)
> >   subsys: kmem
> >   prefix: kmem_* (although this wasn't consistent, e.g. kmalloc vs kmem_kmalloc)
> >   example use: measure allocation times for slab, slub, slqb
> > 
> > o high-level reclaim, kswapd wakeups, direct reclaim, lumpy triggers
> >   subsys: vmscan
> >   prefix: mm_vmscan*
> >   example use: estimate memory pressure
> > 
> > o low-level reclaim, list rotations, pages scanned, types of pages moving etc.
> >   subsys: vmscan
> >   prefix: mm_vmscan*
> >   (debugging VM tunables such as swappiness or why kswapd so active)
> > 
> > The following might also be useful for kernel developers but maybe less
> > useful in general so would be harder to justify.
> > 
> > o fault activity, anon, file, swap ins/outs 
> > o page cache activity
> > o readahead
> > o VM/FS, writeback, pdflush
> > o hugepage reservations, pool activity, faulting
> > o hotplug
> 
> Maybe if more people would tell how they currently use tracepoints in
> the MM we can find some common ground on what can be useful to more
> than one person and why?
> 

Not a bad plan at all. I've added patch describing the kmem trace points
and some notes on how they might be used.

> FWIW, I recently started using tracepoints at the following places for
> looking at swap code behaviour:
> 
> 	o swap slot alloc/free	[type, offset]
> 	o swap slot read/write	[type, offset]
> 	o swapcache add/delete	[type, offset]
> 	o swap fault/evict	[page->mapping, page->index, type, offset]
> 
> This gives detail beyond vmstat's possibilities at the cost of 8 lines
> of trace_swap_foo() distributed over 5 files.
> 
> I have not aggregated the output so far, just looked at the raw data
> and enjoyed reading how the swap slot allocator behaves in reality
> (you can probably integrate the traces into snapshots of the whole
> swap space layout), 

Can seekwatcher also show the access pattern for swap? Whether it can or not,
you could use points like that to show what correlation, if any, there is
between location on swap and process ownership.

> what load behaviour triggers insane swap IO
> patterns, in what context is readahead reading the wrong pages etc.,
> stuff you wouldn't see when starting out with statistical
> aggregations.
> 
> Now, these data are pretty specialized and probably only few people
> will make use of them, but OTOH, the cost they impose on the traced
> code is so miniscule that it would be a much greater pain to 1) know
> about and find third party patches and 2) apply, possibly forward-port
> third party patches.

Somewhat agreed although without seeing the tracepoints and thinking
about how they might be used, I can't say much further.

I think the next round of patches might give a reasonable template on how
tracepoints can be proposed, reviewed and justified.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 4/4] tracing, page-allocator: Add a postprocessing script for page-allocator-related ftrace events
  2009-08-04 19:57     ` Ingo Molnar
  2009-08-04 20:18       ` Andrew Morton
  2009-08-05 14:53       ` Valdis.Kletnieks
@ 2009-08-06 15:50       ` Mel Gorman
  2 siblings, 0 replies; 31+ messages in thread
From: Mel Gorman @ 2009-08-06 15:50 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Andrew Morton, Pekka Enberg, Peter Zijlstra, Fr?d?ric Weisbecker,
	Steven Rostedt, Larry Woodman, riel, Peter Zijlstra, LKML,
	linux-mm

On Tue, Aug 04, 2009 at 09:57:17PM +0200, Ingo Molnar wrote:

> <SNIP>
> 
> Let me demonstrate these features in action (i've applied the 
> patches for testing to -tip):
> 

Nice demo!

I blatently stole the guts of your mail and made it part of a more general
document on using tracepoints for analysis. It will included the patch with
the next set.

> <SNIP>

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 4/4] tracing, page-allocator: Add a postprocessing script for page-allocator-related ftrace events
  2009-08-05 14:53           ` Larry Woodman
@ 2009-08-06 15:54             ` Mel Gorman
  0 siblings, 0 replies; 31+ messages in thread
From: Mel Gorman @ 2009-08-06 15:54 UTC (permalink / raw)
  To: Larry Woodman
  Cc: Andrew Morton, Rik van Riel, mingo, peterz, linux-kernel,
	linux-mm

On Wed, Aug 05, 2009 at 10:53:50AM -0400, Larry Woodman wrote:
> On Tue, 2009-08-04 at 21:48 +0100, Mel Gorman wrote:
> 
> > > 
> > 
> > Adding and deleting tracepoints, rebuilding and rebooting the kernel is
> > obviously usable by developers but not a whole pile of use if
> > recompiling the kernel is not an option or you're trying to debug a
> > difficult-to-reproduce-but-is-happening-now type of problem.
> > 
> > Of the CC list, I believe Larry Woodman has the most experience with
> > these sort of problems in the field so I'm hoping he'll make some sort
> > of comment.
> > 
> 
> I am all for adding tracepoints that eliminate the need to locate a
> problem, add debug code, rebuild, reboot and retest until the real
> problem is found.  
> 
> Personally I have not seen as many problems in the page allocator as I
> have in the page reclaim code thats why the majority of my tracepoints
> were in vmscan.c 

I'd be surprised if you had, problems in page reclaim would be a lot
more obvious for a start. The page allocator happened to be where I wanted
tracepoints at the moment and I think the next patchset will act as a template
for how to introduce tracepoints which can be repeated for the reclaim points.

> However I do ACK this patch set because it provides
> the opportunity to zoom into the page allocator dynamically without
> needing to iterate through the cumbersome debug process.
> 

Thanks.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 31+ messages in thread

end of thread, other threads:[~2009-08-06 15:54 UTC | newest]

Thread overview: 31+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2009-07-29 21:05 [RFC PATCH 0/4] Add some trace events for the page allocator v2 Mel Gorman
2009-07-29 21:05 ` [PATCH 1/4] tracing, page-allocator: Add trace events for page allocation and page freeing Mel Gorman
2009-07-30  0:55   ` Rik van Riel
2009-07-29 21:05 ` [PATCH 2/4] tracing, mm: Add trace events for anti-fragmentation falling back to other migratetypes Mel Gorman
2009-07-30  1:39   ` Rik van Riel
2009-07-29 21:05 ` [PATCH 3/4] tracing, page-allocator: Add trace event for page traffic related to the buddy lists Mel Gorman
2009-07-30 13:43   ` Rik van Riel
2009-07-29 21:05 ` [PATCH 4/4] tracing, page-allocator: Add a postprocessing script for page-allocator-related ftrace events Mel Gorman
2009-07-30 13:45   ` Rik van Riel
  -- strict thread matches above, loose matches on Subject: below --
2009-08-04 18:12 [PATCH 0/4] Add some trace events for the page allocator v3 Mel Gorman
2009-08-04 18:12 ` [PATCH 4/4] tracing, page-allocator: Add a postprocessing script for page-allocator-related ftrace events Mel Gorman
2009-08-04 18:22   ` Andrew Morton
2009-08-04 18:27     ` Rik van Riel
2009-08-04 19:13       ` Andrew Morton
2009-08-04 20:48         ` Mel Gorman
2009-08-05  7:41           ` Ingo Molnar
2009-08-05  9:07             ` Mel Gorman
2009-08-05  9:16               ` Ingo Molnar
2009-08-05 10:27               ` Johannes Weiner
2009-08-06 15:48                 ` Mel Gorman
2009-08-05 14:53           ` Larry Woodman
2009-08-06 15:54             ` Mel Gorman
2009-08-04 19:57     ` Ingo Molnar
2009-08-04 20:18       ` Andrew Morton
2009-08-04 20:35         ` Ingo Molnar
2009-08-04 20:53           ` Andrew Morton
2009-08-05  7:53             ` Ingo Molnar
2009-08-05 13:04           ` Peter Zijlstra
2009-08-05 15:07         ` Valdis.Kletnieks
2009-08-05 14:53       ` Valdis.Kletnieks
2009-08-06 15:50       ` Mel Gorman
2009-08-05  3:07     ` KOSAKI Motohiro

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).