[RFC PATCH 0/4] Add some trace events for the page allocator v2

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [RFC PATCH 0/4] Add some trace events for the page allocator v2
@ 2009-07-29 21:05 Mel Gorman
  2009-07-29 21:05 ` [PATCH 1/4] tracing, page-allocator: Add trace events for page allocation and page freeing Mel Gorman
                   ` (3 more replies)
  0 siblings, 4 replies; 15+ messages in thread
From: Mel Gorman @ 2009-07-29 21:05 UTC (permalink / raw)
  To: Larry Woodman, riel, Ingo Molnar, Peter Zijlstra
  Cc: LKML, linux-mm, Mel Gorman

In this version, I switched the CC list to match who Larry Woodman mailed
for his "mm tracepoints" patch which I wasn't previously aware of. In this
version, I brought the naming scheme more in line with Larry's as his naming
scheme was very sensible.

This patchset only considers the page-allocator-related events instead of the
much more comprehensive approach Larry took. I included a post-processing
script as Andrew's main complaint as I saw it with Larry's work was a lack
of tools that could give a higher-level view of what was going on. If this
works out, the other mm tracepoints can be deal with in piecemeal chunks.

Changelog since V1
  o Fix minor formatting error for the __rmqueue event
  o Add event for __pagevec_free
  o Bring naming more in line with Larry Woodman's tracing patch
  o Add an example post-processing script for the trace events

The following four patches add some trace events for the page allocator
under the heading of kmem (pagealloc heading instead?).

	Patch 1 adds events for plain old allocate and freeing of pages
	Patch 2 gives information useful for analysing fragmentation avoidance
	Patch 3 tracks pages going to and from the buddy lists as an indirect
		indication of zone lock hotness
	Patch 4 adds a post-processing script that aggegates the events to
		give a higher-level view

The first one could be used as an indicator as to whether the workload was
heavily dependant on the page allocator or not. You can make a guess based
on vmstat but you can't get a per-process breakdown. Depending on the call
path, the call_site for page allocation may be __get_free_pages() instead
of a useful callsite. Instead of passing down a return address similar to
slab debugging, the user should enable the stacktrace and seg-addr options
to get a proper stack trace.

The second patch would mainly be useful for users of hugepages and
particularly dynamic hugepage pool resizing as it could be used to tune
min_free_kbytes to a level that fragmentation was rarely a problem. My
main concern is that maybe I'm trying to jam too much into the TP_printk
that could be extrapolated after the fact if you were familiar with the
implementation. I couldn't determine if it was best to hold the hand of
the administrator even if it cost more to figure it out.

The third patch is trickier to draw conclusions from but high activity on
those events could explain why there were a large number of cache misses
on a page-allocator-intensive workload. The coalescing and splitting of
buddies involves a lot of writing of page metadata and cache line bounces
not to mention the acquisition of an interrupt-safe lock necessary to enter
this path.

The fourth patch parses the trace buffer to draw a higher-level picture of
what is going on broken down on a per-process basis.

All comments indicating whether this is generally useful and how it might
be improved are welcome.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH 1/4] tracing, page-allocator: Add trace events for page allocation and page freeing
  2009-07-29 21:05 [RFC PATCH 0/4] Add some trace events for the page allocator v2 Mel Gorman
@ 2009-07-29 21:05 ` Mel Gorman
  2009-07-30  0:55   ` Rik van Riel
  2009-07-29 21:05 ` [PATCH 2/4] tracing, mm: Add trace events for anti-fragmentation falling back to other migratetypes Mel Gorman
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 15+ messages in thread
From: Mel Gorman @ 2009-07-29 21:05 UTC (permalink / raw)
  To: Larry Woodman, riel, Ingo Molnar, Peter Zijlstra
  Cc: LKML, linux-mm, Mel Gorman

This patch adds trace events for the allocation and freeing of pages,
including the freeing of pagevecs.  Using the events, it will be known what
struct page and pfns are being allocated and freed and what the call site
was in many cases.

The page alloc tracepoints be used as an indicator as to whether the workload
was heavily dependant on the page allocator or not. You can make a guess based
on vmstat but you can't get a per-process breakdown. Depending on the call
path, the call_site for page allocation may be __get_free_pages() instead
of a useful callsite. Instead of passing down a return address similar to
slab debugging, the user should enable the stacktrace and seg-addr options
to get a proper stack trace.

The pagevec free tracepoint has a different usecase. It can be used to get
a idea of how many pages are being dumped off the LRU and whether it is
kswapd doing the work or a process doing direct reclaim.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
 include/trace/events/kmem.h |   86 +++++++++++++++++++++++++++++++++++++++++++
 mm/page_alloc.c             |    6 ++-
 2 files changed, 91 insertions(+), 1 deletions(-)

diff --git a/include/trace/events/kmem.h b/include/trace/events/kmem.h
index 1493c54..57bf13c 100644
--- a/include/trace/events/kmem.h
+++ b/include/trace/events/kmem.h
@@ -225,6 +225,92 @@ TRACE_EVENT(kmem_cache_free,
 
 	TP_printk("call_site=%lx ptr=%p", __entry->call_site, __entry->ptr)
 );
+
+TRACE_EVENT(mm_page_free_direct,
+
+	TP_PROTO(unsigned long call_site, const void *page, unsigned int order),
+
+	TP_ARGS(call_site, page, order),
+
+	TP_STRUCT__entry(
+		__field(	unsigned long,	call_site	)
+		__field(	const void *,	page		)
+		__field(	unsigned int,	order		)
+	),
+
+	TP_fast_assign(
+		__entry->call_site	= call_site;
+		__entry->page		= page;
+		__entry->order		= order;
+	),
+
+	TP_printk("call_site=%lx page=%p pfn=%lu order=%d",
+			__entry->call_site,
+			__entry->page,
+			page_to_pfn((struct page *)__entry->page),
+			__entry->order)
+);
+
+TRACE_EVENT(mm_pagevec_free,
+
+	TP_PROTO(unsigned long call_site, const void *page, int order, int cold),
+
+	TP_ARGS(call_site, page, order, cold),
+
+	TP_STRUCT__entry(
+		__field(	unsigned long,	call_site	)
+		__field(	const void *,	page		)
+		__field(	int,		order		)
+		__field(	int,		cold		)
+	),
+
+	TP_fast_assign(
+		__entry->call_site	= call_site;
+		__entry->page		= page;
+		__entry->order		= order;
+		__entry->cold		= cold;
+	),
+
+	TP_printk("call_site=%lx page=%p pfn=%lu order=%d cold=%d",
+			__entry->call_site,
+			__entry->page,
+			page_to_pfn((struct page *)__entry->page),
+			__entry->order,
+			__entry->cold)
+);
+
+TRACE_EVENT(mm_page_alloc,
+
+	TP_PROTO(unsigned long call_site, const void *page, unsigned int order,
+			gfp_t gfp_flags, int migratetype),
+
+	TP_ARGS(call_site, page, order, gfp_flags, migratetype),
+
+	TP_STRUCT__entry(
+		__field(	unsigned long,	call_site	)
+		__field(	const void *,	page		)
+		__field(	unsigned int,	order		)
+		__field(	gfp_t,		gfp_flags	)
+		__field(	int,		migratetype	)
+	),
+
+	TP_fast_assign(
+		__entry->call_site	= call_site;
+		__entry->page		= page;
+		__entry->order		= order;
+		__entry->gfp_flags	= gfp_flags;
+		__entry->migratetype	= migratetype;
+	),
+
+	TP_printk("call_site=%lx page=%p pfn=%lu order=%d migratetype=%d gfp_flags=%s",
+		__entry->call_site,
+		__entry->page,
+		page_to_pfn((struct page *)__entry->page),
+		__entry->order,
+		__entry->migratetype,
+		show_gfp_flags(__entry->gfp_flags))
+);
+
 #endif /* _TRACE_KMEM_H */
 
 /* This part must be outside protection */
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index caa9268..6cd8730 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1894,6 +1894,7 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
 				zonelist, high_zoneidx, nodemask,
 				preferred_zone, migratetype);
 
+	trace_mm_page_alloc(_RET_IP_, page, order, gfp_mask, migratetype);
 	return page;
 }
 EXPORT_SYMBOL(__alloc_pages_nodemask);
@@ -1934,12 +1935,15 @@ void __pagevec_free(struct pagevec *pvec)
 {
 	int i = pagevec_count(pvec);
 
-	while (--i >= 0)
+	while (--i >= 0) {
+		trace_mm_pagevec_free(_RET_IP_, pvec->pages[i], 0, pvec->cold);
 		free_hot_cold_page(pvec->pages[i], pvec->cold);
+	}
 }
 
 void __free_pages(struct page *page, unsigned int order)
 {
+	trace_mm_page_free_direct(_RET_IP_, page, order);
 	if (put_page_testzero(page)) {
 		if (order == 0)
 			free_hot_page(page);
-- 
1.6.3.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH 2/4] tracing, mm: Add trace events for anti-fragmentation falling back to other migratetypes
  2009-07-29 21:05 [RFC PATCH 0/4] Add some trace events for the page allocator v2 Mel Gorman
  2009-07-29 21:05 ` [PATCH 1/4] tracing, page-allocator: Add trace events for page allocation and page freeing Mel Gorman
@ 2009-07-29 21:05 ` Mel Gorman
  2009-07-30  1:39   ` Rik van Riel
  2009-07-29 21:05 ` [PATCH 3/4] tracing, page-allocator: Add trace event for page traffic related to the buddy lists Mel Gorman
  2009-07-29 21:05 ` [PATCH 4/4] tracing, page-allocator: Add a postprocessing script for page-allocator-related ftrace events Mel Gorman
  3 siblings, 1 reply; 15+ messages in thread
From: Mel Gorman @ 2009-07-29 21:05 UTC (permalink / raw)
  To: Larry Woodman, riel, Ingo Molnar, Peter Zijlstra
  Cc: LKML, linux-mm, Mel Gorman

Fragmentation avoidance depends on being able to use free pages from
lists of the appropriate migrate type. In the event this is not
possible, __rmqueue_fallback() selects a different list and in some
circumstances change the migratetype of the pageblock. Simplistically,
the more times this event occurs, the more likely that fragmentation
will be a problem later for hugepage allocation at least but there are
other considerations such as the order of page being split to satisfy
the allocation.

This patch adds a trace event for __rmqueue_fallback() that reports what
page is being used for the fallback, the orders of relevant pages, the
desired migratetype and the migratetype of the lists being used, whether
the pageblock changed type and whether this event is important with
respect to fragmentation avoidance or not. This information can be used
to help analyse fragmentation avoidance and help decide whether
min_free_kbytes should be increased or not.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
 include/trace/events/kmem.h |   44 +++++++++++++++++++++++++++++++++++++++++++
 mm/page_alloc.c             |    6 +++++
 2 files changed, 50 insertions(+), 0 deletions(-)

diff --git a/include/trace/events/kmem.h b/include/trace/events/kmem.h
index 57bf13c..0b4002e 100644
--- a/include/trace/events/kmem.h
+++ b/include/trace/events/kmem.h
@@ -311,6 +311,50 @@ TRACE_EVENT(mm_page_alloc,
 		show_gfp_flags(__entry->gfp_flags))
 );
 
+TRACE_EVENT(mm_page_alloc_extfrag,
+
+	TP_PROTO(const void *page,
+			int alloc_order, int fallback_order,
+			int alloc_migratetype, int fallback_migratetype,
+			int fragmenting, int change_ownership),
+
+	TP_ARGS(page,
+		alloc_order, fallback_order,
+		alloc_migratetype, fallback_migratetype,
+		fragmenting, change_ownership),
+
+	TP_STRUCT__entry(
+		__field(	const void *,	page			)
+		__field(	int,		alloc_order		)
+		__field(	int,		fallback_order		)
+		__field(	int,		alloc_migratetype	)
+		__field(	int,		fallback_migratetype	)
+		__field(	int,		fragmenting		)
+		__field(	int,		change_ownership	)
+	),
+
+	TP_fast_assign(
+		__entry->page			= page;
+		__entry->alloc_order		= alloc_order;
+		__entry->fallback_order		= fallback_order;
+		__entry->alloc_migratetype	= alloc_migratetype;
+		__entry->fallback_migratetype	= fallback_migratetype;
+		__entry->fragmenting		= fragmenting;
+		__entry->change_ownership	= change_ownership;
+	),
+
+	TP_printk("page=%p pfn=%lu alloc_order=%d fallback_order=%d pageblock_order=%d alloc_migratetype=%d fallback_migratetype=%d fragmenting=%d change_ownership=%d",
+		__entry->page,
+		page_to_pfn((struct page *)__entry->page),
+		__entry->alloc_order,
+		__entry->fallback_order,
+		pageblock_order,
+		__entry->alloc_migratetype,
+		__entry->fallback_migratetype,
+		__entry->fragmenting,
+		__entry->change_ownership)
+);
+
 #endif /* _TRACE_KMEM_H */
 
 /* This part must be outside protection */
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 6cd8730..8113403 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -839,6 +839,12 @@ __rmqueue_fallback(struct zone *zone, int order, int start_migratetype)
 							start_migratetype);
 
 			expand(zone, page, order, current_order, area, migratetype);
+
+			trace_mm_page_alloc_extfrag(page, order, current_order,
+				start_migratetype, migratetype,
+				current_order < pageblock_order,
+				migratetype == start_migratetype);
+
 			return page;
 		}
 	}
-- 
1.6.3.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH 3/4] tracing, page-allocator: Add trace event for page traffic related to the buddy lists
  2009-07-29 21:05 [RFC PATCH 0/4] Add some trace events for the page allocator v2 Mel Gorman
  2009-07-29 21:05 ` [PATCH 1/4] tracing, page-allocator: Add trace events for page allocation and page freeing Mel Gorman
  2009-07-29 21:05 ` [PATCH 2/4] tracing, mm: Add trace events for anti-fragmentation falling back to other migratetypes Mel Gorman
@ 2009-07-29 21:05 ` Mel Gorman
  2009-07-30 13:43   ` Rik van Riel
  2009-07-29 21:05 ` [PATCH 4/4] tracing, page-allocator: Add a postprocessing script for page-allocator-related ftrace events Mel Gorman
  3 siblings, 1 reply; 15+ messages in thread
From: Mel Gorman @ 2009-07-29 21:05 UTC (permalink / raw)
  To: Larry Woodman, riel, Ingo Molnar, Peter Zijlstra
  Cc: LKML, linux-mm, Mel Gorman

The page allocation trace event reports that a page was successfully allocated
but it does not specify where it came from. When analysing performance,
it can be important to distinguish between pages coming from the per-cpu
allocator and pages coming from the buddy lists as the latter requires the
zone lock to the taken and more data structures to be examined.

This patch adds a trace event for __rmqueue reporting when a page is being
allocated from the buddy lists. It distinguishes between being called
to refill the per-cpu lists or whether it is a high-order allocation.
Similarly, this patch adds an event to catch when the PCP lists are being
drained a little and pages are going back to the buddy lists.

This is trickier to draw conclusions from but high activity on those
events could explain why there were a large number of cache misses on a
page-allocator-intensive workload. The coalescing and splitting of buddies
involves a lot of writing of page metadata and cache line bounces not to
mention the acquisition of an interrupt-safe lock necessary to enter this
path.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
 include/trace/events/kmem.h |   54 +++++++++++++++++++++++++++++++++++++++++++
 mm/page_alloc.c             |    2 +
 2 files changed, 56 insertions(+), 0 deletions(-)

diff --git a/include/trace/events/kmem.h b/include/trace/events/kmem.h
index 0b4002e..3be3df3 100644
--- a/include/trace/events/kmem.h
+++ b/include/trace/events/kmem.h
@@ -311,6 +311,60 @@ TRACE_EVENT(mm_page_alloc,
 		show_gfp_flags(__entry->gfp_flags))
 );
 
+TRACE_EVENT(mm_page_alloc_zone_locked,
+
+	TP_PROTO(const void *page, unsigned int order,
+				int migratetype, int percpu_refill),
+
+	TP_ARGS(page, order, migratetype, percpu_refill),
+
+	TP_STRUCT__entry(
+		__field(	const void *,	page		)
+		__field(	unsigned int,	order		)
+		__field(	int,		migratetype	)
+		__field(	int,		percpu_refill	)
+	),
+
+	TP_fast_assign(
+		__entry->page		= page;
+		__entry->order		= order;
+		__entry->migratetype	= migratetype;
+		__entry->percpu_refill	= percpu_refill;
+	),
+
+	TP_printk("page=%p pfn=%lu order=%u migratetype=%d percpu_refill=%d",
+		__entry->page,
+		page_to_pfn((struct page *)__entry->page),
+		__entry->order,
+		__entry->migratetype,
+		__entry->percpu_refill)
+);
+
+TRACE_EVENT(mm_page_pcpu_drain,
+
+	TP_PROTO(const void *page, int order, int migratetype),
+
+	TP_ARGS(page, order, migratetype),
+
+	TP_STRUCT__entry(
+		__field(	const void *,	page		)
+		__field(	int,		order		)
+		__field(	int,		migratetype	)
+	),
+
+	TP_fast_assign(
+		__entry->page		= page;
+		__entry->order		= order;
+		__entry->migratetype	= migratetype;
+	),
+
+	TP_printk("page=%p pfn=%lu order=%d migratetype=%d",
+		__entry->page,
+		page_to_pfn((struct page *)__entry->page),
+		__entry->order,
+		__entry->migratetype)
+);
+
 TRACE_EVENT(mm_page_alloc_extfrag,
 
 	TP_PROTO(const void *page,
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 8113403..1bcef16 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -535,6 +535,7 @@ static void free_pages_bulk(struct zone *zone, int count,
 		page = list_entry(list->prev, struct page, lru);
 		/* have to delete it as __free_one_page list manipulates */
 		list_del(&page->lru);
+		trace_mm_page_pcpu_drain(page, order, page_private(page));
 		__free_one_page(page, zone, order, page_private(page));
 	}
 	spin_unlock(&zone->lock);
@@ -878,6 +879,7 @@ retry_reserve:
 		}
 	}
 
+	trace_mm_page_alloc_zone_locked(page, order, migratetype, order == 0);
 	return page;
 }
 
-- 
1.6.3.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH 4/4] tracing, page-allocator: Add a postprocessing script for page-allocator-related ftrace events
  2009-07-29 21:05 [RFC PATCH 0/4] Add some trace events for the page allocator v2 Mel Gorman
                   ` (2 preceding siblings ...)
  2009-07-29 21:05 ` [PATCH 3/4] tracing, page-allocator: Add trace event for page traffic related to the buddy lists Mel Gorman
@ 2009-07-29 21:05 ` Mel Gorman
  2009-07-30 13:45   ` Rik van Riel
  3 siblings, 1 reply; 15+ messages in thread
From: Mel Gorman @ 2009-07-29 21:05 UTC (permalink / raw)
  To: Larry Woodman, riel, Ingo Molnar, Peter Zijlstra
  Cc: LKML, linux-mm, Mel Gorman

This patch adds a simple post-processing script for the page-allocator-related
trace events. It can be used to give an indication of who the most
allocator-intensive processes are and how often the zone lock was taken
during the tracing period. Example output looks like

find-2840
 o pages allocd            = 1877
 o pages allocd under lock = 1817
 o pages freed directly    = 9
 o pcpu refills            = 1078
 o migrate fallbacks       = 48
   - fragmentation causing = 48
     - severe              = 46
     - moderate            = 2
   - changed migratetype   = 7

The high number of fragmentation events were because 32 dd processes were
running at the same time under qemu, with limited memory with standard
min_free_kbytes so it's not a surprising outcome.

The postprocessor parses the text output of tracing. While there is a binary
format, the expectation is that the binary output can be readily translated
into text and post-processed offline. Obviously if the text format
changes, the parser will break but the regular expression parser is
fairly rudimentary so should be readily adjustable.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
 .../postprocess/trace-pagealloc-postprocess.pl     |  131 ++++++++++++++++++++
 1 files changed, 131 insertions(+), 0 deletions(-)
 create mode 100755 Documentation/trace/postprocess/trace-pagealloc-postprocess.pl

diff --git a/Documentation/trace/postprocess/trace-pagealloc-postprocess.pl b/Documentation/trace/postprocess/trace-pagealloc-postprocess.pl
new file mode 100755
index 0000000..d4332c3
--- /dev/null
+++ b/Documentation/trace/postprocess/trace-pagealloc-postprocess.pl
@@ -0,0 +1,131 @@
+#!/usr/bin/perl
+# This is a POC (proof of concept or piece of crap, take your pick) for reading the
+# text representation of trace output related to page allocation. It makes an attempt
+# to extract some high-level information on what is going on. The accuracy of the parser
+# may vary considerably
+#
+# Copyright (c) Mel Gorman 2009
+use Switch;
+use strict;
+
+my $traceevent;
+my %perprocess;
+
+while ($traceevent = <>) {
+	my $process_pid;
+	my $cpus;
+	my $timestamp;
+	my $tracepoint;
+	my $details;
+
+	#                      (process_pid)     (cpus      )   ( time  )   (tpoint    ) (details)
+	if ($traceevent =~ /\s*([a-zA-Z0-9-]*)\s*(\[[0-9]*\])\s*([0-9.]*):\s*([a-zA-Z_]*):\s*(.*)/) {
+		$process_pid = $1;
+		$cpus = $2;
+		$timestamp = $3;
+		$tracepoint = $4;
+		$details = $5;
+
+	} else {
+		next;
+	}
+
+	switch ($tracepoint) {
+	case "mm_page_alloc" {
+		$perprocess{$process_pid}->{"mm_page_alloc"}++;
+	}
+	case "mm_page_free_direct" {
+		$perprocess{$process_pid}->{"mm_page_free_direct"}++;
+	}
+	case "mm_pagevec_free" {
+		$perprocess{$process_pid}->{"mm_pagevec_free"}++;
+	}
+	case "mm_page_pcpu_drain" {
+		$perprocess{$process_pid}->{"mm_page_pcpu_drain"}++;
+		$perprocess{$process_pid}->{"mm_page_pcpu_drain-pagesdrained"}++;
+	}
+	case "mm_page_alloc_zone_locked" {
+		$perprocess{$process_pid}->{"mm_page_alloc_zone_locked"}++;
+		$perprocess{$process_pid}->{"mm_page_alloc_zone_locked-pagesrefilled"}++;
+	}
+	case "mm_page_alloc_extfrag" {
+		$perprocess{$process_pid}->{"mm_page_alloc_extfrag"}++;
+		my ($page, $pfn);
+		my ($alloc_order, $fallback_order, $pageblock_order);
+		my ($alloc_migratetype, $fallback_migratetype);
+		my ($fragmenting, $change_ownership);
+
+		$details =~ /page=([0-9a-f]*) pfn=([0-9]*) alloc_order=([0-9]*) fallback_order=([0-9]*) pageblock_order=([0-9]*) alloc_migratetype=([0-9]*) fallback_migratetype=([0-9]*) fragmenting=([0-9]) change_ownership=([0-9])/;
+		$page = $1;
+		$pfn = $2;
+		$alloc_order = $3;
+		$fallback_order = $4;
+		$pageblock_order = $5;
+		$alloc_migratetype = $6;
+		$fallback_migratetype = $7;
+		$fragmenting = $8;
+		$change_ownership = $9;
+
+		if ($fragmenting) {
+			$perprocess{$process_pid}->{"mm_page_alloc_extfrag-fragmenting"}++;
+			if ($fallback_order <= 3) {
+				$perprocess{$process_pid}->{"mm_page_alloc_extfrag-fragmenting-severe"}++;
+			} else {
+				$perprocess{$process_pid}->{"mm_page_alloc_extfrag-fragmenting-moderate"}++;
+			}
+		}
+		if ($change_ownership) {
+			$perprocess{$process_pid}->{"mm_page_alloc_extfrag-changetype"}++;
+		}
+	}
+	else {
+		$perprocess{$process_pid}->{"unknown"}++;
+	}
+	}
+
+	# Catch a full pcpu drain event
+	if ($perprocess{$process_pid}->{"mm_page_pcpu_drain-pagesdrained"} &&
+			$tracepoint ne "mm_page_pcpu_drain") {
+
+		$perprocess{$process_pid}->{"mm_page_pcpu_drain-drains"}++;
+		$perprocess{$process_pid}->{"mm_page_pcpu_drain-pagesdrained"} = 0;
+	}
+
+	# Catch a full pcpu refill event
+	if ($perprocess{$process_pid}->{"mm_page_alloc_zone_locked-pagesrefilled"} &&
+			$tracepoint ne "mm_page_alloc_zone_locked") {
+		$perprocess{$process_pid}->{"mm_page_alloc_zone_locked-refills"}++;
+		$perprocess{$process_pid}->{"mm_page_alloc_zone_locked-pagesrefilled"} = 0;
+	}
+}
+
+# Dump per-process stats
+my $process_pid;
+foreach $process_pid (keys %perprocess) {
+	# Dump final aggregates
+	if ($perprocess{$process_pid}->{"mm_page_pcpu_drain-pagesdrained"}) {
+		$perprocess{$process_pid}->{"mm_page_pcpu_drain-drains"}++;
+		$perprocess{$process_pid}->{"mm_page_pcpu_drain-pagesdrained"} = 0;
+	}
+	if ($perprocess{$process_pid}->{"mm_page_alloc_zone_locked-pagesrefilled"}) {
+		$perprocess{$process_pid}->{"mm_page_alloc_zone_locked-refills"}++;
+		$perprocess{$process_pid}->{"mm_page_alloc_zone_locked-pagesrefilled"} = 0;
+	}
+
+	my %process = $perprocess{$process_pid};
+	printf("$process_pid\n");
+	printf(" o pages allocd            = %d\n", $perprocess{$process_pid}->{"mm_page_alloc"});
+	printf(" o pages allocd under lock = %d\n", $perprocess{$process_pid}->{"mm_page_alloc_zone_locked"});
+	printf(" o pages freed directly    = %d\n", $perprocess{$process_pid}->{"mm_page_free_direct"});
+	printf(" o pages freed via pagevec = %d\n", $perprocess{$process_pid}->{"mm_pagevec_free"});
+	printf(" o pcpu pages drained      = %d\n", $perprocess{$process_pid}->{"mm_page_pcpu_drain"});
+	printf(" o pcpu drains             = %d\n", $perprocess{$process_pid}->{"mm_page_pcpu_drain-drains"});
+	printf(" o pcpu refills            = %d\n", $perprocess{$process_pid}->{"mm_page_alloc_zone_locked-refills"});
+	printf(" o migrate fallbacks       = %d\n", $perprocess{$process_pid}->{"mm_page_alloc_extfrag"});
+	printf("   - fragmentation causing = %d\n", $perprocess{$process_pid}->{"mm_page_alloc_extfrag-fragmenting"});
+	printf("     - severe              = %d\n", $perprocess{$process_pid}->{"mm_page_alloc_extfrag-fragmenting-severe"});
+	printf("     - moderate            = %d\n", $perprocess{$process_pid}->{"mm_page_alloc_extfrag-fragmenting-moderate"});
+	printf("   - changed migratetype   = %d\n", $perprocess{$process_pid}->{"mm_page_alloc_extfrag-changetype"});
+	printf(" o unknown events          = %d\n", $perprocess{$process_pid}->{"unknown"});
+	printf("\n");
+}
-- 
1.6.3.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH 1/4] tracing, page-allocator: Add trace events for page allocation and page freeing
  2009-07-29 21:05 ` [PATCH 1/4] tracing, page-allocator: Add trace events for page allocation and page freeing Mel Gorman
@ 2009-07-30  0:55   ` Rik van Riel
  0 siblings, 0 replies; 15+ messages in thread
From: Rik van Riel @ 2009-07-30  0:55 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Larry Woodman, Ingo Molnar, Peter Zijlstra, LKML, linux-mm

Mel Gorman wrote:
> This patch adds trace events for the allocation and freeing of pages,
> including the freeing of pagevecs.  Using the events, it will be known what
> struct page and pfns are being allocated and freed and what the call site
> was in many cases.

Acked-by: Rik van Riel <riel@redhat.com>

-- 
All rights reversed.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 2/4] tracing, mm: Add trace events for anti-fragmentation falling back to other migratetypes
  2009-07-29 21:05 ` [PATCH 2/4] tracing, mm: Add trace events for anti-fragmentation falling back to other migratetypes Mel Gorman
@ 2009-07-30  1:39   ` Rik van Riel
  0 siblings, 0 replies; 15+ messages in thread
From: Rik van Riel @ 2009-07-30  1:39 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Larry Woodman, Ingo Molnar, Peter Zijlstra, LKML, linux-mm

Mel Gorman wrote:
> Fragmentation avoidance depends on being able to use free pages from
> lists of the appropriate migrate type. In the event this is not
> possible, __rmqueue_fallback() selects a different list and in some
> circumstances change the migratetype of the pageblock. Simplistically,
> the more times this event occurs, the more likely that fragmentation
> will be a problem later for hugepage allocation at least but there are
> other considerations such as the order of page being split to satisfy
> the allocation.

Acked-by: Rik van Riel <riel@redhat.com>

-- 
All rights reversed.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 3/4] tracing, page-allocator: Add trace event for page traffic related to the buddy lists
  2009-07-29 21:05 ` [PATCH 3/4] tracing, page-allocator: Add trace event for page traffic related to the buddy lists Mel Gorman
@ 2009-07-30 13:43   ` Rik van Riel
  0 siblings, 0 replies; 15+ messages in thread
From: Rik van Riel @ 2009-07-30 13:43 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Larry Woodman, Ingo Molnar, Peter Zijlstra, LKML, linux-mm

Mel Gorman wrote:
> The page allocation trace event reports that a page was successfully allocated
> but it does not specify where it came from. When analysing performance,
> it can be important to distinguish between pages coming from the per-cpu
> allocator and pages coming from the buddy lists as the latter requires the
> zone lock to the taken and more data structures to be examined.

Acked-by: Rik van Riel <riel@redhat.com>

-- 
All rights reversed.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 4/4] tracing, page-allocator: Add a postprocessing script for page-allocator-related ftrace events
  2009-07-29 21:05 ` [PATCH 4/4] tracing, page-allocator: Add a postprocessing script for page-allocator-related ftrace events Mel Gorman
@ 2009-07-30 13:45   ` Rik van Riel
  0 siblings, 0 replies; 15+ messages in thread
From: Rik van Riel @ 2009-07-30 13:45 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Larry Woodman, Ingo Molnar, Peter Zijlstra, LKML, linux-mm

Mel Gorman wrote:
> This patch adds a simple post-processing script for the page-allocator-related
> trace events. It can be used to give an indication of who the most
> allocator-intensive processes are and how often the zone lock was taken
> during the tracing period. Example output looks like
> 
> find-2840
>  o pages allocd            = 1877
>  o pages allocd under lock = 1817
>  o pages freed directly    = 9
>  o pcpu refills            = 1078
>  o migrate fallbacks       = 48
>    - fragmentation causing = 48
>      - severe              = 46
>      - moderate            = 2
>    - changed migratetype   = 7

I like it.

Acked-by: Rik van Riel <riel@redhat.com>

-- 
All rights reversed.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH 1/4] tracing, page-allocator: Add trace events for page allocation and page freeing
  2009-08-04 18:12 [PATCH 0/4] Add some trace events for the page allocator v3 Mel Gorman
@ 2009-08-04 18:12 ` Mel Gorman
  2009-08-05  9:13   ` KOSAKI Motohiro
  0 siblings, 1 reply; 15+ messages in thread
From: Mel Gorman @ 2009-08-04 18:12 UTC (permalink / raw)
  To: Larry Woodman, Andrew Morton
  Cc: riel, Ingo Molnar, Peter Zijlstra, LKML, linux-mm, Mel Gorman

This patch adds trace events for the allocation and freeing of pages,
including the freeing of pagevecs.  Using the events, it will be known what
struct page and pfns are being allocated and freed and what the call site
was in many cases.

The page alloc tracepoints be used as an indicator as to whether the workload
was heavily dependant on the page allocator or not. You can make a guess based
on vmstat but you can't get a per-process breakdown. Depending on the call
path, the call_site for page allocation may be __get_free_pages() instead
of a useful callsite. Instead of passing down a return address similar to
slab debugging, the user should enable the stacktrace and seg-addr options
to get a proper stack trace.

The pagevec free tracepoint has a different usecase. It can be used to get
a idea of how many pages are being dumped off the LRU and whether it is
kswapd doing the work or a process doing direct reclaim.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Acked-by: Rik van Riel <riel@redhat.com>
---
 include/trace/events/kmem.h |   86 +++++++++++++++++++++++++++++++++++++++++++
 mm/page_alloc.c             |    6 ++-
 2 files changed, 91 insertions(+), 1 deletions(-)

diff --git a/include/trace/events/kmem.h b/include/trace/events/kmem.h
index 1493c54..57bf13c 100644
--- a/include/trace/events/kmem.h
+++ b/include/trace/events/kmem.h
@@ -225,6 +225,92 @@ TRACE_EVENT(kmem_cache_free,
 
 	TP_printk("call_site=%lx ptr=%p", __entry->call_site, __entry->ptr)
 );
+
+TRACE_EVENT(mm_page_free_direct,
+
+	TP_PROTO(unsigned long call_site, const void *page, unsigned int order),
+
+	TP_ARGS(call_site, page, order),
+
+	TP_STRUCT__entry(
+		__field(	unsigned long,	call_site	)
+		__field(	const void *,	page		)
+		__field(	unsigned int,	order		)
+	),
+
+	TP_fast_assign(
+		__entry->call_site	= call_site;
+		__entry->page		= page;
+		__entry->order		= order;
+	),
+
+	TP_printk("call_site=%lx page=%p pfn=%lu order=%d",
+			__entry->call_site,
+			__entry->page,
+			page_to_pfn((struct page *)__entry->page),
+			__entry->order)
+);
+
+TRACE_EVENT(mm_pagevec_free,
+
+	TP_PROTO(unsigned long call_site, const void *page, int order, int cold),
+
+	TP_ARGS(call_site, page, order, cold),
+
+	TP_STRUCT__entry(
+		__field(	unsigned long,	call_site	)
+		__field(	const void *,	page		)
+		__field(	int,		order		)
+		__field(	int,		cold		)
+	),
+
+	TP_fast_assign(
+		__entry->call_site	= call_site;
+		__entry->page		= page;
+		__entry->order		= order;
+		__entry->cold		= cold;
+	),
+
+	TP_printk("call_site=%lx page=%p pfn=%lu order=%d cold=%d",
+			__entry->call_site,
+			__entry->page,
+			page_to_pfn((struct page *)__entry->page),
+			__entry->order,
+			__entry->cold)
+);
+
+TRACE_EVENT(mm_page_alloc,
+
+	TP_PROTO(unsigned long call_site, const void *page, unsigned int order,
+			gfp_t gfp_flags, int migratetype),
+
+	TP_ARGS(call_site, page, order, gfp_flags, migratetype),
+
+	TP_STRUCT__entry(
+		__field(	unsigned long,	call_site	)
+		__field(	const void *,	page		)
+		__field(	unsigned int,	order		)
+		__field(	gfp_t,		gfp_flags	)
+		__field(	int,		migratetype	)
+	),
+
+	TP_fast_assign(
+		__entry->call_site	= call_site;
+		__entry->page		= page;
+		__entry->order		= order;
+		__entry->gfp_flags	= gfp_flags;
+		__entry->migratetype	= migratetype;
+	),
+
+	TP_printk("call_site=%lx page=%p pfn=%lu order=%d migratetype=%d gfp_flags=%s",
+		__entry->call_site,
+		__entry->page,
+		page_to_pfn((struct page *)__entry->page),
+		__entry->order,
+		__entry->migratetype,
+		show_gfp_flags(__entry->gfp_flags))
+);
+
 #endif /* _TRACE_KMEM_H */
 
 /* This part must be outside protection */
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index d052abb..843bdec 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1905,6 +1905,7 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
 				zonelist, high_zoneidx, nodemask,
 				preferred_zone, migratetype);
 
+	trace_mm_page_alloc(_RET_IP_, page, order, gfp_mask, migratetype);
 	return page;
 }
 EXPORT_SYMBOL(__alloc_pages_nodemask);
@@ -1945,13 +1946,16 @@ void __pagevec_free(struct pagevec *pvec)
 {
 	int i = pagevec_count(pvec);
 
-	while (--i >= 0)
+	while (--i >= 0) {
+		trace_mm_pagevec_free(_RET_IP_, pvec->pages[i], 0, pvec->cold);
 		free_hot_cold_page(pvec->pages[i], pvec->cold);
+	}
 }
 
 void __free_pages(struct page *page, unsigned int order)
 {
 	if (put_page_testzero(page)) {
+		trace_mm_page_free_direct(_RET_IP_, page, order);
 		if (order == 0)
 			free_hot_page(page);
 		else
-- 
1.6.3.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH 1/4] tracing, page-allocator: Add trace events for page allocation and page freeing
  2009-08-04 18:12 ` [PATCH 1/4] tracing, page-allocator: Add trace events for page allocation and page freeing Mel Gorman
@ 2009-08-05  9:13   ` KOSAKI Motohiro
  2009-08-05  9:40     ` Mel Gorman
  0 siblings, 1 reply; 15+ messages in thread
From: KOSAKI Motohiro @ 2009-08-05  9:13 UTC (permalink / raw)
  To: Mel Gorman
  Cc: kosaki.motohiro, Larry Woodman, Andrew Morton, riel, Ingo Molnar,
	Peter Zijlstra, LKML, linux-mm

Hi

sorry for the delayed review.

> This patch adds trace events for the allocation and freeing of pages,
> including the freeing of pagevecs.  Using the events, it will be known what
> struct page and pfns are being allocated and freed and what the call site
> was in many cases.
> 
> The page alloc tracepoints be used as an indicator as to whether the workload
> was heavily dependant on the page allocator or not. You can make a guess based
> on vmstat but you can't get a per-process breakdown. Depending on the call
> path, the call_site for page allocation may be __get_free_pages() instead
> of a useful callsite. Instead of passing down a return address similar to
> slab debugging, the user should enable the stacktrace and seg-addr options
> to get a proper stack trace.
> 
> The pagevec free tracepoint has a different usecase. It can be used to get
> a idea of how many pages are being dumped off the LRU and whether it is
> kswapd doing the work or a process doing direct reclaim.
> 
> Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> Acked-by: Rik van Riel <riel@redhat.com>
> ---
>  include/trace/events/kmem.h |   86 +++++++++++++++++++++++++++++++++++++++++++
>  mm/page_alloc.c             |    6 ++-
>  2 files changed, 91 insertions(+), 1 deletions(-)
> 
> diff --git a/include/trace/events/kmem.h b/include/trace/events/kmem.h
> index 1493c54..57bf13c 100644
> --- a/include/trace/events/kmem.h
> +++ b/include/trace/events/kmem.h
> @@ -225,6 +225,92 @@ TRACE_EVENT(kmem_cache_free,
>  
>  	TP_printk("call_site=%lx ptr=%p", __entry->call_site, __entry->ptr)
>  );
> +
> +TRACE_EVENT(mm_page_free_direct,
> +
> +	TP_PROTO(unsigned long call_site, const void *page, unsigned int order),
> +
> +	TP_ARGS(call_site, page, order),
> +
> +	TP_STRUCT__entry(
> +		__field(	unsigned long,	call_site	)
> +		__field(	const void *,	page		)

Why void? Is there any benefit?

> +		__field(	unsigned int,	order		)
> +	),
> +
> +	TP_fast_assign(
> +		__entry->call_site	= call_site;
> +		__entry->page		= page;
> +		__entry->order		= order;
> +	),
> +
> +	TP_printk("call_site=%lx page=%p pfn=%lu order=%d",
> +			__entry->call_site,
> +			__entry->page,
> +			page_to_pfn((struct page *)__entry->page),
> +			__entry->order)
> +);
> +
> +TRACE_EVENT(mm_pagevec_free,
> +
> +	TP_PROTO(unsigned long call_site, const void *page, int order, int cold),
> +
> +	TP_ARGS(call_site, page, order, cold),
> +
> +	TP_STRUCT__entry(
> +		__field(	unsigned long,	call_site	)
> +		__field(	const void *,	page		)
> +		__field(	int,		order		)
> +		__field(	int,		cold		)
> +	),
> +
> +	TP_fast_assign(
> +		__entry->call_site	= call_site;
> +		__entry->page		= page;
> +		__entry->order		= order;
> +		__entry->cold		= cold;
> +	),
> +
> +	TP_printk("call_site=%lx page=%p pfn=%lu order=%d cold=%d",
> +			__entry->call_site,
> +			__entry->page,
> +			page_to_pfn((struct page *)__entry->page),
> +			__entry->order,
> +			__entry->cold)
> +);
> +
> +TRACE_EVENT(mm_page_alloc,
> +
> +	TP_PROTO(unsigned long call_site, const void *page, unsigned int order,
> +			gfp_t gfp_flags, int migratetype),
> +
> +	TP_ARGS(call_site, page, order, gfp_flags, migratetype),
> +
> +	TP_STRUCT__entry(
> +		__field(	unsigned long,	call_site	)
> +		__field(	const void *,	page		)
> +		__field(	unsigned int,	order		)
> +		__field(	gfp_t,		gfp_flags	)
> +		__field(	int,		migratetype	)
> +	),
> +
> +	TP_fast_assign(
> +		__entry->call_site	= call_site;
> +		__entry->page		= page;
> +		__entry->order		= order;
> +		__entry->gfp_flags	= gfp_flags;
> +		__entry->migratetype	= migratetype;
> +	),
> +
> +	TP_printk("call_site=%lx page=%p pfn=%lu order=%d migratetype=%d gfp_flags=%s",
> +		__entry->call_site,
> +		__entry->page,
> +		page_to_pfn((struct page *)__entry->page),
> +		__entry->order,
> +		__entry->migratetype,
> +		show_gfp_flags(__entry->gfp_flags))
> +);
> +
>  #endif /* _TRACE_KMEM_H */
>  
>  /* This part must be outside protection */
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index d052abb..843bdec 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1905,6 +1905,7 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
>  				zonelist, high_zoneidx, nodemask,
>  				preferred_zone, migratetype);
>  
> +	trace_mm_page_alloc(_RET_IP_, page, order, gfp_mask, migratetype);
>  	return page;
>  }

In almost case, __alloc_pages_nodemask() is called from alloc_pages_current().
Can you add call_site argument? (likes slab_alloc)


>  EXPORT_SYMBOL(__alloc_pages_nodemask);
> @@ -1945,13 +1946,16 @@ void __pagevec_free(struct pagevec *pvec)
>  {
>  	int i = pagevec_count(pvec);
>  
> -	while (--i >= 0)
> +	while (--i >= 0) {
> +		trace_mm_pagevec_free(_RET_IP_, pvec->pages[i], 0, pvec->cold);
>  		free_hot_cold_page(pvec->pages[i], pvec->cold);
> +	}
>  }

This _RET_IP_ assume pagevec_free() is inlined function. Then,
pagevec_free() sould also change always_inline?

Yeah, I agree this is theoretical issue. but it improve readability and
studying author's intention. 

>  void __free_pages(struct page *page, unsigned int order)
>  {
>  	if (put_page_testzero(page)) {
> +		trace_mm_page_free_direct(_RET_IP_, page, order);
>  		if (order == 0)
>  			free_hot_page(page);
>  		else

This patch covered free_pages() and __pagevec_free() case.
but it doesn't cover free_hot_page() direct call.

(Fortunately, there is no free_cold_page() caller)


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 1/4] tracing, page-allocator: Add trace events for page allocation and page freeing
  2009-08-05  9:13   ` KOSAKI Motohiro
@ 2009-08-05  9:40     ` Mel Gorman
  2009-08-07  1:17       ` KOSAKI Motohiro
  0 siblings, 1 reply; 15+ messages in thread
From: Mel Gorman @ 2009-08-05  9:40 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Larry Woodman, Andrew Morton, riel, Ingo Molnar, Peter Zijlstra,
	LKML, linux-mm

On Wed, Aug 05, 2009 at 06:13:09PM +0900, KOSAKI Motohiro wrote:
> Hi
> 
> sorry for the delayed review.
> 
> > This patch adds trace events for the allocation and freeing of pages,
> > including the freeing of pagevecs.  Using the events, it will be known what
> > struct page and pfns are being allocated and freed and what the call site
> > was in many cases.
> > 
> > The page alloc tracepoints be used as an indicator as to whether the workload
> > was heavily dependant on the page allocator or not. You can make a guess based
> > on vmstat but you can't get a per-process breakdown. Depending on the call
> > path, the call_site for page allocation may be __get_free_pages() instead
> > of a useful callsite. Instead of passing down a return address similar to
> > slab debugging, the user should enable the stacktrace and seg-addr options
> > to get a proper stack trace.
> > 
> > The pagevec free tracepoint has a different usecase. It can be used to get
> > a idea of how many pages are being dumped off the LRU and whether it is
> > kswapd doing the work or a process doing direct reclaim.
> > 
> > Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> > Acked-by: Rik van Riel <riel@redhat.com>
> > ---
> >  include/trace/events/kmem.h |   86 +++++++++++++++++++++++++++++++++++++++++++
> >  mm/page_alloc.c             |    6 ++-
> >  2 files changed, 91 insertions(+), 1 deletions(-)
> > 
> > diff --git a/include/trace/events/kmem.h b/include/trace/events/kmem.h
> > index 1493c54..57bf13c 100644
> > --- a/include/trace/events/kmem.h
> > +++ b/include/trace/events/kmem.h
> > @@ -225,6 +225,92 @@ TRACE_EVENT(kmem_cache_free,
> >  
> >  	TP_printk("call_site=%lx ptr=%p", __entry->call_site, __entry->ptr)
> >  );
> > +
> > +TRACE_EVENT(mm_page_free_direct,
> > +
> > +	TP_PROTO(unsigned long call_site, const void *page, unsigned int order),
> > +
> > +	TP_ARGS(call_site, page, order),
> > +
> > +	TP_STRUCT__entry(
> > +		__field(	unsigned long,	call_site	)
> > +		__field(	const void *,	page		)
> 
> Why void? Is there any benefit?
> 

No real benefit, I'll switch to struct page *. I thought at one point it was
failing to compile as struct page * was not in scope but that must have been
my imagination.

> > +		__field(	unsigned int,	order		)
> > +	),
> > +
> > +	TP_fast_assign(
> > +		__entry->call_site	= call_site;
> > +		__entry->page		= page;
> > +		__entry->order		= order;
> > +	),
> > +
> > +	TP_printk("call_site=%lx page=%p pfn=%lu order=%d",
> > +			__entry->call_site,
> > +			__entry->page,
> > +			page_to_pfn((struct page *)__entry->page),
> > +			__entry->order)
> > +);
> > +
> > +TRACE_EVENT(mm_pagevec_free,
> > +
> > +	TP_PROTO(unsigned long call_site, const void *page, int order, int cold),
> > +
> > +	TP_ARGS(call_site, page, order, cold),
> > +
> > +	TP_STRUCT__entry(
> > +		__field(	unsigned long,	call_site	)
> > +		__field(	const void *,	page		)
> > +		__field(	int,		order		)
> > +		__field(	int,		cold		)
> > +	),
> > +
> > +	TP_fast_assign(
> > +		__entry->call_site	= call_site;
> > +		__entry->page		= page;
> > +		__entry->order		= order;
> > +		__entry->cold		= cold;
> > +	),
> > +
> > +	TP_printk("call_site=%lx page=%p pfn=%lu order=%d cold=%d",
> > +			__entry->call_site,
> > +			__entry->page,
> > +			page_to_pfn((struct page *)__entry->page),
> > +			__entry->order,
> > +			__entry->cold)
> > +);
> > +
> > +TRACE_EVENT(mm_page_alloc,
> > +
> > +	TP_PROTO(unsigned long call_site, const void *page, unsigned int order,
> > +			gfp_t gfp_flags, int migratetype),
> > +
> > +	TP_ARGS(call_site, page, order, gfp_flags, migratetype),
> > +
> > +	TP_STRUCT__entry(
> > +		__field(	unsigned long,	call_site	)
> > +		__field(	const void *,	page		)
> > +		__field(	unsigned int,	order		)
> > +		__field(	gfp_t,		gfp_flags	)
> > +		__field(	int,		migratetype	)
> > +	),
> > +
> > +	TP_fast_assign(
> > +		__entry->call_site	= call_site;
> > +		__entry->page		= page;
> > +		__entry->order		= order;
> > +		__entry->gfp_flags	= gfp_flags;
> > +		__entry->migratetype	= migratetype;
> > +	),
> > +
> > +	TP_printk("call_site=%lx page=%p pfn=%lu order=%d migratetype=%d gfp_flags=%s",
> > +		__entry->call_site,
> > +		__entry->page,
> > +		page_to_pfn((struct page *)__entry->page),
> > +		__entry->order,
> > +		__entry->migratetype,
> > +		show_gfp_flags(__entry->gfp_flags))
> > +);
> > +
> >  #endif /* _TRACE_KMEM_H */
> >  
> >  /* This part must be outside protection */
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index d052abb..843bdec 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -1905,6 +1905,7 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
> >  				zonelist, high_zoneidx, nodemask,
> >  				preferred_zone, migratetype);
> >  
> > +	trace_mm_page_alloc(_RET_IP_, page, order, gfp_mask, migratetype);
> >  	return page;
> >  }
> 
> In almost case, __alloc_pages_nodemask() is called from alloc_pages_current().
> Can you add call_site argument? (likes slab_alloc)
> 

In the NUMA case, this will be true but addressing it involves passing down
an additional argument in the non-tracing case which I wanted to avoid.
As the stacktrace option is available to ftrace, I think I'll drop call_site
altogether as anyone who really needs that information has options.

> >  EXPORT_SYMBOL(__alloc_pages_nodemask);
> > @@ -1945,13 +1946,16 @@ void __pagevec_free(struct pagevec *pvec)
> >  {
> >  	int i = pagevec_count(pvec);
> >  
> > -	while (--i >= 0)
> > +	while (--i >= 0) {
> > +		trace_mm_pagevec_free(_RET_IP_, pvec->pages[i], 0, pvec->cold);
> >  		free_hot_cold_page(pvec->pages[i], pvec->cold);
> > +	}
> >  }
> 
> This _RET_IP_ assume pagevec_free() is inlined function. Then,
> pagevec_free() sould also change always_inline?
> 

There is an assumption being made about the inlining all right.

> Yeah, I agree this is theoretical issue. but it improve readability and
> studying author's intention. 
> 

If call_site persists, I'll do this but the next version of the patchset
is likely to drop call_site.

> >  void __free_pages(struct page *page, unsigned int order)
> >  {
> >  	if (put_page_testzero(page)) {
> > +		trace_mm_page_free_direct(_RET_IP_, page, order);
> >  		if (order == 0)
> >  			free_hot_page(page);
> >  		else
> 
> This patch covered free_pages() and __pagevec_free() case.
> but it doesn't cover free_hot_page() direct call.
> 
> (Fortunately, there is no free_cold_page() caller)
> 

Good spot. free_cold_page() is dead code but I'll duplicate the
trace_mm_page_free_direct event for now and look at cleaning out
free_cold_page(). Thanks

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 1/4] tracing, page-allocator: Add trace events for page allocation and page freeing
  2009-08-05  9:40     ` Mel Gorman
@ 2009-08-07  1:17       ` KOSAKI Motohiro
  2009-08-07 17:31         ` Mel Gorman
  0 siblings, 1 reply; 15+ messages in thread
From: KOSAKI Motohiro @ 2009-08-07  1:17 UTC (permalink / raw)
  To: Mel Gorman
  Cc: kosaki.motohiro, Larry Woodman, Andrew Morton, riel, Ingo Molnar,
	Peter Zijlstra, LKML, linux-mm


> > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > > index d052abb..843bdec 100644
> > > --- a/mm/page_alloc.c
> > > +++ b/mm/page_alloc.c
> > > @@ -1905,6 +1905,7 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
> > >  				zonelist, high_zoneidx, nodemask,
> > >  				preferred_zone, migratetype);
> > >  
> > > +	trace_mm_page_alloc(_RET_IP_, page, order, gfp_mask, migratetype);
> > >  	return page;
> > >  }
> > 
> > In almost case, __alloc_pages_nodemask() is called from alloc_pages_current().
> > Can you add call_site argument? (likes slab_alloc)
> > 
> 
> In the NUMA case, this will be true but addressing it involves passing down
> an additional argument in the non-tracing case which I wanted to avoid.
> As the stacktrace option is available to ftrace, I think I'll drop call_site
> altogether as anyone who really needs that information has options.

Insted, can we move this tracepoint to alloc_pages_current(), alloc_pages_node() et al ?
On page tracking case, call_site information is one of most frequently used one.
if we need multiple trace combination, it become hard to use and reduce usefulness a bit.



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 1/4] tracing, page-allocator: Add trace events for page allocation and page freeing
  2009-08-07  1:17       ` KOSAKI Motohiro
@ 2009-08-07 17:31         ` Mel Gorman
  2009-08-08  5:44           ` KOSAKI Motohiro
  0 siblings, 1 reply; 15+ messages in thread
From: Mel Gorman @ 2009-08-07 17:31 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Larry Woodman, Andrew Morton, riel, Ingo Molnar, Peter Zijlstra,
	LKML, linux-mm

On Fri, Aug 07, 2009 at 10:17:57AM +0900, KOSAKI Motohiro wrote:
> 
> > > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > > > index d052abb..843bdec 100644
> > > > --- a/mm/page_alloc.c
> > > > +++ b/mm/page_alloc.c
> > > > @@ -1905,6 +1905,7 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
> > > >  				zonelist, high_zoneidx, nodemask,
> > > >  				preferred_zone, migratetype);
> > > >  
> > > > +	trace_mm_page_alloc(_RET_IP_, page, order, gfp_mask, migratetype);
> > > >  	return page;
> > > >  }
> > > 
> > > In almost case, __alloc_pages_nodemask() is called from alloc_pages_current().
> > > Can you add call_site argument? (likes slab_alloc)
> > > 
> > 
> > In the NUMA case, this will be true but addressing it involves passing down
> > an additional argument in the non-tracing case which I wanted to avoid.
> > As the stacktrace option is available to ftrace, I think I'll drop call_site
> > altogether as anyone who really needs that information has options.
> 
> Insted, can we move this tracepoint to alloc_pages_current(), alloc_pages_node() et al ?
> On page tracking case, call_site information is one of most frequently used one.
> if we need multiple trace combination, it become hard to use and reduce usefulness a bit.
> 

Ok, lets think about that. The potential points that would need
annotation are

	o alloc_pages_current
	o alloc_page_vma
	o alloc_pages_node
	o alloc_pages_exact_node

The inlined functions that call those and should preserve the call_site
are

	o alloc_pages

The slightly lower functions they call are as follows. These cannot
trigger a tracepoint event because it would look like a duplicate.

	o __alloc_pages_nodemask
		- called by __alloc_pages
	o __alloc_pages
		- called by alloc_page_interleave() but event logged
		- called by alloc_pages_node but event logged
		- called by alloc_pages_exact_node but event logged

The more problematic ones are

	o __get_free_pages
	o get_zeroed_page
	o alloc_pages_exact

The are all real functions that call down to functions that would log
events already based on your suggestion - alloc_pages_current() in
particularly.

Looking at it, it would appear the page allocator API would need a fair
amount of reschuffling to preserve call_site and not duplicate events or
else to pass call_site down through the API even in the non-tracing case.
Minimally, that makes it a standalone patch but it would also need a good
explanation as to why capturing the stack trace on the event is not enough
to track the page for things like catching memory leaks.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 1/4] tracing, page-allocator: Add trace events for page allocation and page freeing
  2009-08-07 17:31         ` Mel Gorman
@ 2009-08-08  5:44           ` KOSAKI Motohiro
  0 siblings, 0 replies; 15+ messages in thread
From: KOSAKI Motohiro @ 2009-08-08  5:44 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Larry Woodman, Andrew Morton, riel, Ingo Molnar, Peter Zijlstra,
	LKML, linux-mm

>> > In the NUMA case, this will be true but addressing it involves passing down
>> > an additional argument in the non-tracing case which I wanted to avoid.
>> > As the stacktrace option is available to ftrace, I think I'll drop call_site
>> > altogether as anyone who really needs that information has options.
>>
>> Insted, can we move this tracepoint to alloc_pages_current(), alloc_pages_node() et al ?
>> On page tracking case, call_site information is one of most frequently used one.
>> if we need multiple trace combination, it become hard to use and reduce usefulness a bit.
>>
>
> Ok, lets think about that. The potential points that would need
> annotation are
>
>        o alloc_pages_current
>        o alloc_page_vma
>        o alloc_pages_node
>        o alloc_pages_exact_node
>
> The inlined functions that call those and should preserve the call_site
> are
>
>        o alloc_pages
>
> The slightly lower functions they call are as follows. These cannot
> trigger a tracepoint event because it would look like a duplicate.
>
>        o __alloc_pages_nodemask
>                - called by __alloc_pages
>        o __alloc_pages
>                - called by alloc_page_interleave() but event logged
>                - called by alloc_pages_node but event logged
>                - called by alloc_pages_exact_node but event logged
>
> The more problematic ones are
>
>        o __get_free_pages
>        o get_zeroed_page
>        o alloc_pages_exact
>
> The are all real functions that call down to functions that would log
> events already based on your suggestion - alloc_pages_current() in
> particularly.
>
> Looking at it, it would appear the page allocator API would need a fair
> amount of reschuffling to preserve call_site and not duplicate events or
> else to pass call_site down through the API even in the non-tracing case.
> Minimally, that makes it a standalone patch but it would also need a good
> explanation as to why capturing the stack trace on the event is not enough
> to track the page for things like catching memory leaks.

I agree this is need to some cleanup.
I think I can do that and I can agree your.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2009-08-08  5:44 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2009-07-29 21:05 [RFC PATCH 0/4] Add some trace events for the page allocator v2 Mel Gorman
2009-07-29 21:05 ` [PATCH 1/4] tracing, page-allocator: Add trace events for page allocation and page freeing Mel Gorman
2009-07-30  0:55   ` Rik van Riel
2009-07-29 21:05 ` [PATCH 2/4] tracing, mm: Add trace events for anti-fragmentation falling back to other migratetypes Mel Gorman
2009-07-30  1:39   ` Rik van Riel
2009-07-29 21:05 ` [PATCH 3/4] tracing, page-allocator: Add trace event for page traffic related to the buddy lists Mel Gorman
2009-07-30 13:43   ` Rik van Riel
2009-07-29 21:05 ` [PATCH 4/4] tracing, page-allocator: Add a postprocessing script for page-allocator-related ftrace events Mel Gorman
2009-07-30 13:45   ` Rik van Riel
  -- strict thread matches above, loose matches on Subject: below --
2009-08-04 18:12 [PATCH 0/4] Add some trace events for the page allocator v3 Mel Gorman
2009-08-04 18:12 ` [PATCH 1/4] tracing, page-allocator: Add trace events for page allocation and page freeing Mel Gorman
2009-08-05  9:13   ` KOSAKI Motohiro
2009-08-05  9:40     ` Mel Gorman
2009-08-07  1:17       ` KOSAKI Motohiro
2009-08-07 17:31         ` Mel Gorman
2009-08-08  5:44           ` KOSAKI Motohiro

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).