[RFC PATCH -tip 0/2 v3] pagecache tracepoints proposal

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [RFC PATCH -tip 0/2 v3] pagecache tracepoints proposal
@ 2010-02-05  2:17 Keiichi KII
  2010-02-05  2:24 ` [RFC PATCH -tip 1/2 v3] tracepoints: add tracepoints for pagecache Keiichi KII
                   ` (3 more replies)
  0 siblings, 4 replies; 17+ messages in thread
From: Keiichi KII @ 2010-02-05  2:17 UTC (permalink / raw)
  To: linux-kernel, mingo
  Cc: lwoodman, linux-mm, Tom Zanussi, riel, rostedt, akpm, fweisbec,
	Munehiro Ikeda, Atsushi Tsuji, Keiichi KII

Hello,

This is v3 of a patchset to add some tracepoints for pagecache.

I would propose several tracepoints for tracing pagecache behavior and
a script for these.
By using both the tracepoints and the script, we can analysis pagecache behavior
like usage or hit ratio with high resolution like per process or per file. 
Example output of the script looks like:

[process list]
o yum-3215
                          cache find  cache hit  cache hit
        device      inode      count      count      ratio
  --------------------------------------------------------
         253:0         16      34434      34130     99.12%
         253:0        198       9692       9463     97.64%
         253:0        639        647        628     97.06%
         253:0        778         32         29     90.62%
         253:0       7305      50225      49005     97.57%
         253:0     144217         12         10     83.33%
         253:0     262775         16         13     81.25%
*snip*

-------------------------------------------------------------------------------

[file list]
        device              cached
     (maj:min)      inode    pages
  --------------------------------
         253:0         16     5752
         253:0        198     2233
         253:0        639       51
         253:0        778       86
         253:0       7305    12307
         253:0     144217       11
         253:0     262775       39
*snip*

[process list]
o yum-3215
        device              cached    added  removed      indirect
     (maj:min)      inode    pages    pages    pages removed pages
  ----------------------------------------------------------------
         253:0         16    34130     5752        0             0
         253:0        198     9463     2233        0             0
         253:0        639      628       51        0             0
         253:0        778       29       78        0             0
         253:0       7305    49005    12307        0             0
         253:0     144217       10       11        0             0
         253:0     262775       13       39        0             0
*snip*
  ----------------------------------------------------------------
  total:                    102346    26165        1             0

We can now know system-wide pagecache usage by /proc/meminfo.
But we have no method to get higher resolution information like per file or
per process usage than system-wide one.
A process may share some pagecache or add a pagecache to the memory or
remove a pagecache from the memory.
If a pagecache miss hit ratio rises, maybe it leads to extra I/O and
affects system performance.

So, by using the tracepoints we can get the following information.
 1. how many pagecaches each process has per each file
 2. how many pages are cached per each file
 3. how many pagecaches each process shares
 4. how often each process adds/removes pagecache
 5. how long a pagecache stays in the memory
 6. pagecache hit rate per file

Especially, the monitoring pagecache usage per each file and pagecache hit 
ratio would help us tune some applications like database.
And it will also help us tune the kernel parameters like "vm.dirty_*".

Changelog since v2
  o add new script to monitor pagecache hit ratio per process.
  o use DECLARE_EVENT_CLASS

Changelog since v1
  o Add a script based on "perf trace stream scripting support".

Any comments are welcome.
--
Keiichi Kii <k-keiichi@bx.jp.nec.com>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [RFC PATCH -tip 1/2 v3] tracepoints: add tracepoints for pagecache
  2010-02-05  2:17 [RFC PATCH -tip 0/2 v3] pagecache tracepoints proposal Keiichi KII
@ 2010-02-05  2:24 ` Keiichi KII
  2010-02-05  2:25 ` [RFC PATCH -tip 2/2 v3] add scripts for pagecache analysis per process Keiichi KII
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 17+ messages in thread
From: Keiichi KII @ 2010-02-05  2:24 UTC (permalink / raw)
  To: Keiichi KII
  Cc: linux-kernel, mingo, lwoodman, linux-mm, Tom Zanussi, riel,
	rostedt, akpm, fweisbec, Munehiro Ikeda, Atsushi Tsuji

This patch adds several tracepoints to track pagecach behavior.
These trecepoints would help us monitor pagecache usage with high resolution.

Signed-off-by: Keiichi Kii <k-keiichi@bx.jp.nec.com>
Cc: Atsushi Tsuji <a-tsuji@bk.jp.nec.com> 
---
 include/trace/events/filemap.h |   83 +++++++++++++++++++++++++++++++++++++++++
 mm/filemap.c                   |    5 ++
 mm/truncate.c                  |    2 
 mm/vmscan.c                    |    3 +
 4 files changed, 93 insertions(+)

Index: linux-2.6-tip/include/trace/events/filemap.h
===================================================================
--- /dev/null
+++ linux-2.6-tip/include/trace/events/filemap.h
@@ -0,0 +1,75 @@
+#undef TRACE_SYSTEM
+#define TRACE_SYSTEM filemap
+
+#if !defined(_TRACE_FILEMAP_H) || defined(TRACE_HEADER_MULTI_READ)
+#define _TRACE_FILEMAP_H
+
+#include <linux/fs.h>
+#include <linux/tracepoint.h>
+
+TRACE_EVENT(find_get_page,
+
+	TP_PROTO(struct address_space *mapping, pgoff_t offset,
+		struct page *page),
+
+	TP_ARGS(mapping, offset, page),
+
+	TP_STRUCT__entry(
+		__field(dev_t, s_dev)
+		__field(ino_t, i_ino)
+		__field(pgoff_t, offset)
+		__field(struct page *, page)
+		),
+
+	TP_fast_assign(
+		__entry->s_dev = mapping->host ? mapping->host->i_sb->s_dev : 0;
+		__entry->i_ino = mapping->host ? mapping->host->i_ino : 0;
+		__entry->offset = offset;
+		__entry->page = page;
+		),
+
+	TP_printk("s_dev=%u:%u i_ino=%lu offset=%lu %s", MAJOR(__entry->s_dev),
+		MINOR(__entry->s_dev), __entry->i_ino, __entry->offset,
+		__entry->page == NULL ? "page_not_found" : "page_found")
+);
+
+DECLARE_EVENT_CLASS(page_cache_template,
+
+	TP_PROTO(struct address_space *mapping, pgoff_t offset),
+
+	TP_ARGS(mapping, offset),
+
+	TP_STRUCT__entry(
+		__field(dev_t, s_dev)
+		__field(ino_t, i_ino)
+		__field(pgoff_t, offset)
+		),
+
+	TP_fast_assign(
+		__entry->s_dev = mapping->host->i_sb->s_dev;
+		__entry->i_ino = mapping->host->i_ino;
+		__entry->offset = offset;
+		),
+
+	TP_printk("s_dev=%u:%u i_ino=%lu offset=%lu", MAJOR(__entry->s_dev),
+		MINOR(__entry->s_dev), __entry->i_ino, __entry->offset)
+);
+
+DEFINE_EVENT(page_cache_template, add_to_page_cache,
+
+	TP_PROTO(struct address_space *mapping, pgoff_t offset),
+
+	TP_ARGS(mapping, offset)
+);
+
+DEFINE_EVENT(page_cache_template, remove_from_page_cache,
+
+	TP_PROTO(struct address_space *mapping, pgoff_t offset),
+
+	TP_ARGS(mapping, offset)
+);
+
+#endif /* _TRACE_FILEMAP_H */
+
+/* This part must be outside protection */
+#include <trace/define_trace.h>
Index: linux-2.6-tip/mm/filemap.c
===================================================================
--- linux-2.6-tip.orig/mm/filemap.c
+++ linux-2.6-tip/mm/filemap.c
@@ -34,6 +34,8 @@
 #include <linux/hardirq.h> /* for BUG_ON(!in_atomic()) only */
 #include <linux/memcontrol.h>
 #include <linux/mm_inline.h> /* for page_is_file_cache() */
+#define CREATE_TRACE_POINTS
+#include <trace/events/filemap.h>
 #include "internal.h"
 
 /*
@@ -149,6 +151,7 @@ void remove_from_page_cache(struct page 
 	spin_lock_irq(&mapping->tree_lock);
 	__remove_from_page_cache(page);
 	spin_unlock_irq(&mapping->tree_lock);
+	trace_remove_from_page_cache(mapping, page->index);
 	mem_cgroup_uncharge_cache_page(page);
 }
 
@@ -419,6 +422,7 @@ int add_to_page_cache_locked(struct page
 			if (PageSwapBacked(page))
 				__inc_zone_page_state(page, NR_SHMEM);
 			spin_unlock_irq(&mapping->tree_lock);
+			trace_add_to_page_cache(mapping, offset);
 		} else {
 			page->mapping = NULL;
 			spin_unlock_irq(&mapping->tree_lock);
@@ -642,6 +646,7 @@ repeat:
 	}
 	rcu_read_unlock();
 
+	trace_find_get_page(mapping, offset, page);
 	return page;
 }
 EXPORT_SYMBOL(find_get_page);
Index: linux-2.6-tip/mm/truncate.c
===================================================================
--- linux-2.6-tip.orig/mm/truncate.c
+++ linux-2.6-tip/mm/truncate.c
@@ -20,6 +20,7 @@
 				   do_invalidatepage */
 #include "internal.h"
 
+#include <trace/events/filemap.h>
 
 /**
  * do_invalidatepage - invalidate part or all of a page
@@ -388,6 +389,7 @@ invalidate_complete_page2(struct address
 	BUG_ON(page_has_private(page));
 	__remove_from_page_cache(page);
 	spin_unlock_irq(&mapping->tree_lock);
+	trace_remove_from_page_cache(mapping, page->index);
 	mem_cgroup_uncharge_cache_page(page);
 	page_cache_release(page);	/* pagecache ref */
 	return 1;
Index: linux-2.6-tip/mm/vmscan.c
===================================================================
--- linux-2.6-tip.orig/mm/vmscan.c
+++ linux-2.6-tip/mm/vmscan.c
@@ -48,6 +48,8 @@
 
 #include "internal.h"
 
+#include <trace/events/filemap.h>
+
 struct scan_control {
 	/* Incremented by the number of inactive pages that were scanned */
 	unsigned long nr_scanned;
@@ -477,6 +479,7 @@ static int __remove_mapping(struct addre
 	} else {
 		__remove_from_page_cache(page);
 		spin_unlock_irq(&mapping->tree_lock);
+		trace_remove_from_page_cache(mapping, page->index);
 		mem_cgroup_uncharge_cache_page(page);
 	}
 



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [RFC PATCH -tip 2/2 v3] add scripts for pagecache analysis per process
  2010-02-05  2:17 [RFC PATCH -tip 0/2 v3] pagecache tracepoints proposal Keiichi KII
  2010-02-05  2:24 ` [RFC PATCH -tip 1/2 v3] tracepoints: add tracepoints for pagecache Keiichi KII
@ 2010-02-05  2:25 ` Keiichi KII
  2010-02-05  7:28 ` [RFC PATCH -tip 0/2 v3] pagecache tracepoints proposal Ingo Molnar
  2010-02-08 13:04 ` Balbir Singh
  3 siblings, 0 replies; 17+ messages in thread
From: Keiichi KII @ 2010-02-05  2:25 UTC (permalink / raw)
  To: linux-kernel, mingo
  Cc: Keiichi KII, lwoodman, linux-mm, Tom Zanussi, riel, rostedt, akpm,
	fweisbec, Munehiro Ikeda, Atsushi Tsuji

The scripts are implemented based on the trace stream scripting support.
And the scripts implement the following and depend on the page cache 
tracepoints.

 - pagecache hit ratio per process
 - how many pagecaches each process has per each file
 - how many pages are cached per each file
 - how many pagecaches each process shares

To monitor pagecache hit ratio per process, run "pagecache-hit-ratio-record"
or "pref trace record pagecache-hit-ratio" to record perf data for 
"pagecache-hit-ration.pl" and run "pagecache-hit-ratio-report" or
"perf trace report pagecache-usage" to display.

The below outputs show execution sample.

[process list]
o yum-3215
                          cache find  cache hit  cache hit
        device      inode      count      count      ratio
  --------------------------------------------------------
         253:0         16      34434      34130     99.12%
         253:0        198       9692       9463     97.64%
         253:0        639        647        628     97.06%
         253:0        778         32         29     90.62%
         253:0       7305      50225      49005     97.57%
         253:0     144217         12         10     83.33%
         253:0     262775         16         13     81.25%
*snip*

To monitor pagecache usage per a process, run "pagecache-usage-record" or
"perf trace record pagecache-usage" to record perf data for 
"pagecache-usage.pl" and run "pagecache-usage-report" or "perf trace report
pagecache-usage" to display.

The below outputs show execution sample.

[file list]
        device              cached
     (maj:min)      inode    pages
  --------------------------------
         253:0         16     5752
         253:0        198     2233
         253:0        639       51
         253:0        778       86
         253:0       7305    12307
         253:0     144217       11
         253:0     262775       39
*snip*

[process list]
o yum-3215
        device              cached    added  removed      indirect
     (maj:min)      inode    pages    pages    pages removed pages
  ----------------------------------------------------------------
         253:0         16    34130     5752        0             0
         253:0        198     9463     2233        0             0
         253:0        639      628       51        0             0
         253:0        778       29       78        0             0
         253:0       7305    49005    12307        0             0
         253:0     144217       10       11        0             0
         253:0     262775       13       39        0             0
*snip*
  ----------------------------------------------------------------
  total:                    102346    26165        1             0

>From the output, we can know some information like:

- if "added pages" > "cached pages" on process list then
    It means repeating add/remove pagecache many times.
  => Bad case for pagecache usage

- if "added pages" <= "cached pages" on process list then
    It means no unnecessary I/O operations.
  => Good case for pagecache usage.

- if "caches" on file list > 
         sum "cached pages" per each file on process list then
    It means there are unneccessary pagecaches in the memory. 
  => Bad case for pagecache usage

Signed-off-by: Keiichi Kii <k-keiichi@bx.jp.nec.com>
Cc: Atsushi Tsuji <a-tsuji@bk.jp.nec.com>
---
 tools/perf/scripts/perl/bin/pagecache-hit-ratio-record |    7 
 tools/perf/scripts/perl/bin/pagecache-hit-ratio-report |    6 
 tools/perf/scripts/perl/bin/pagecache-usage-record     |    7 
 tools/perf/scripts/perl/bin/pagecache-usage-report     |    6 
 tools/perf/scripts/perl/pagecache-hit-ratio.pl         |   75 +++++++++
 tools/perf/scripts/perl/pagecache-usage.pl             |  136 +++++++++++++++++
 6 files changed, 237 insertions(+)

Index: linux-2.6-tip/tools/perf/scripts/perl/bin/pagecache-usage-record
===================================================================
--- /dev/null
+++ linux-2.6-tip/tools/perf/scripts/perl/bin/pagecache-usage-record
@@ -0,0 +1,7 @@
+#!/bin/bash
+perf record -c 1 -f -a -M -R -e filemap:add_to_page_cache -e filemap:find_get_page -e filemap:remove_from_page_cache
+
+
+
+
+
Index: linux-2.6-tip/tools/perf/scripts/perl/bin/pagecache-usage-report
===================================================================
--- /dev/null
+++ linux-2.6-tip/tools/perf/scripts/perl/bin/pagecache-usage-report
@@ -0,0 +1,6 @@
+#!/bin/bash
+# description: pagecache usage per process
+perf trace -s ~/libexec/perf-core/scripts/perl/pagecache-usage.pl
+
+
+
Index: linux-2.6-tip/tools/perf/scripts/perl/pagecache-usage.pl
===================================================================
--- /dev/null
+++ linux-2.6-tip/tools/perf/scripts/perl/pagecache-usage.pl
@@ -0,0 +1,136 @@
+#!/usr/bin/perl -w
+# (C) 2010, Keiichi Kii <k-keiichi@bx.jp.nec.com>
+# Licensed under the terms of the GNU GPL License version 2
+
+# Display pagecache usage per a process
+
+use lib "$ENV{'PERF_EXEC_PATH'}/scripts/perl/Perf-Trace-Util/lib";
+use lib "./Perf-Trace-Util/lib";
+use Perf::Trace::Core;
+use Perf::Trace::Context;
+use Perf::Trace::Util;
+use List::Util qw/sum/;
+
+my %files;
+my %processes;
+my %records;
+
+sub trace_end
+{
+	print_pagecache_usage_per_file();
+	print "\n";
+	print_pagecache_usage_per_process();
+}
+
+sub filemap::remove_from_page_cache
+{
+	my ($event_name, $context, $common_cpu, $common_secs, $common_nsecs,
+	    $common_pid, $common_comm,
+	    $s_dev, $i_ino, $offset) = @_;
+	my ($f, $r) = get_record($common_comm."-".$common_pid, $s_dev, $i_ino);
+
+	delete $$f{$offset};
+	if (defined $$r{added}{$offset}) {
+	    $$r{removed}++;
+	} else {
+	    $$r{indirect_removed}++;
+	}
+}
+
+sub filemap::add_to_page_cache
+{
+	my ($event_name, $context, $common_cpu, $common_secs, $common_nsecs,
+	    $common_pid, $common_comm,
+	    $s_dev, $i_ino, $offset) = @_;
+	my ($f, $r) = get_record($common_comm."-".$common_pid, $s_dev, $i_ino);
+
+	$$f{$offset}++;
+	$$r{added}{$offset}++;
+}
+
+sub filemap::find_get_page
+{
+	my ($event_name, $context, $common_cpu, $common_secs, $common_nsecs,
+	    $common_pid, $common_comm,
+	    $s_dev, $i_ino, $offset, $page) = @_;
+	my ($f, $r) = get_record($common_comm."-".$common_pid, $s_dev, $i_ino);
+
+	if ($page != 0) {
+	    $$f{$offset}++;
+	    $$r{cached}++;
+	}
+}
+
+sub get_record
+{
+	my ($p, $dev, $inode) = @_;
+
+	unless (defined($files{$dev}{$inode})) {
+	    $files{$dev}{$inode} = {};
+	}
+	$f = $files{$dev}{$inode};
+	unless (defined($records{$p}{$f})) {
+	    $records{$p}{$f} =
+	      {inode => $inode, dev => $dev, added => {},
+	       cached => 0, removed => 0, indirect_removed => 0};
+	}
+	return $f, $records{$p}{$f};
+}
+
+sub minor
+{
+	my $dev = shift;
+	return $dev & ((1 << 20) - 1);
+}
+
+sub major
+{
+	my $dev = shift;
+	return $dev >> 20;
+}
+
+sub print_pagecache_usage_per_file
+{
+	print "[file list]\n";
+	printf("  %12s %10s %8s\n", "device", "", "cached");
+	printf("  %12s %10s %8s\n", "(maj:min)", "inode", "pages");
+	printf("  %s\n", '-' x 32);
+	while(my($dev, $file) = each(%files)) {
+	    foreach my $inode (sort { $a <=> $b } keys %$file) {
+		my $count = values %{$$file{$inode}};
+		next if $count == 0;
+		printf("  %12s %10d %8d\n",
+		       major($dev).":".minor($dev), $inode, $count);
+	    }
+	}
+}
+
+sub print_pagecache_usage_per_process
+{
+	print "[process list]\n";
+	while(my ($pid, $v) = each(%records)) {
+	    my ($sum_cached, $sum_added, $sum_removed, $sum_indirect_removed);
+	    print "o $pid\n";
+	    printf("  %12s %10s %8s %8s %8s %13s\n", "device", "",
+		   "cached", "added", "removed", "indirect");
+	    printf("  %12s %10s %8s %8s %8s %13s\n", "(maj:min)", "inode",
+		   "pages", "pages", "pages", "removed pages");
+	    printf("  %s\n", '-' x 64);
+	    foreach my $r (sort { $$a{inode} <=> $$b{inode} } values %$v) {
+		my $added_num = scalar(keys %{$$r{added}}) == 0 ?
+		    0 : List::Util::sum(values %{$$r{added}});
+		$sum_cached += $$r{cached};
+		$sum_added += $added_num;
+		$sum_removed += $$r{removed};
+		$sum_indirect_removed += $$r{indirect_removed};
+		printf("  %12s %10d %8d %8d %8d %13d\n",
+		       major($$r{dev}).":".minor($$r{dev}), $$r{inode},
+		       $$r{cached}, $added_num, $$r{removed},
+		       $$r{indirect_removed});
+	    }
+	    printf("  %s\n", '-' x 64);
+	    printf("  total: %5s %10s %8d %8d %8d %13d\n", "", "", $sum_cached,
+		   $sum_added, $sum_removed, $sum_indirect_removed);
+	    print "\n";
+	}
+}
Index: linux-2.6-tip/tools/perf/scripts/perl/bin/pagecache-hit-ratio-record
===================================================================
--- /dev/null
+++ linux-2.6-tip/tools/perf/scripts/perl/bin/pagecache-hit-ratio-record
@@ -0,0 +1,7 @@
+#!/bin/bash
+perf record -c 1 -f -a -M -R -e filemap:find_get_page
+
+
+
+
+
Index: linux-2.6-tip/tools/perf/scripts/perl/bin/pagecache-hit-ratio-report
===================================================================
--- /dev/null
+++ linux-2.6-tip/tools/perf/scripts/perl/bin/pagecache-hit-ratio-report
@@ -0,0 +1,6 @@
+#!/bin/bash
+# description: monitor pagecache hit ratio per process
+perf trace -s ~/libexec/perf-core/scripts/perl/pagecache-hit-ratio.pl
+
+
+
Index: linux-2.6-tip/tools/perf/scripts/perl/pagecache-hit-ratio.pl
===================================================================
--- /dev/null
+++ linux-2.6-tip/tools/perf/scripts/perl/pagecache-hit-ratio.pl
@@ -0,0 +1,75 @@
+#!/usr/bin/perl -w
+# (C) 2010, Keiichi Kii <k-keiichi@bx.jp.nec.com>
+# Licensed under the terms of the GNU GPL License version 2
+
+# Display pagecache hit ratio per process
+
+use lib "$ENV{'PERF_EXEC_PATH'}/scripts/perl/Perf-Trace-Util/lib";
+use lib "./Perf-Trace-Util/lib";
+use Perf::Trace::Core;
+use Perf::Trace::Context;
+use Perf::Trace::Util;
+
+my %records;
+
+sub trace_end
+{
+	print_pagecache_hit_ratio();
+}
+
+sub filemap::find_get_page
+{
+	my ($event_name, $context, $common_cpu, $common_secs, $common_nsecs,
+	    $common_pid, $common_comm,
+	    $s_dev, $i_ino, $offset, $page) = @_;
+	my $r = get_record($common_comm."-".$common_pid, $s_dev, $i_ino);
+
+	if ($page != 0) {
+	    $$r{hit}++;
+	} else {
+	    $$r{miss}++;
+	}
+}
+
+sub get_record
+{
+	my ($p, $dev, $inode) = @_;
+
+	unless (defined($records{$p}{$dev.":".$inode})) {
+	    $records{$p}{$dev.":".$inode} = {inode => $inode, dev => $dev,
+					    hit => 0, miss => 0};
+	}
+	return $records{$p}{$dev.":".$inode};
+}
+
+sub minor
+{
+	my $dev = shift;
+	return $dev & ((1 << 20) - 1);
+}
+
+sub major
+{
+	my $dev = shift;
+	return $dev >> 20;
+}
+
+sub print_pagecache_hit_ratio
+{
+	print "[process list]\n";
+	while(my ($pid, $v) = each(%records)) {
+	    print "o $pid\n";
+	    printf("  %12s %10s %10s %10s %10s\n", "", "",
+		   "cache find", "cache hit", "cache hit");
+	    printf("  %12s %10s %10s %10s %10s\n", "device", "inode",
+		   "count", "count", "ratio");
+	    printf("  %s\n", '-' x 56);
+	    foreach my $r (sort { $$a{inode} <=> $$b{inode} } values %$v) {
+		printf("  %12s %10d %10d %10d %9.2f%%\n",
+		       major($$r{dev}).":".minor($$r{dev}), $$r{inode},
+		       $$r{miss} + $$r{hit}, $$r{hit},
+		       $$r{hit} / ($$r{miss} + $$r{hit}) * 100);
+	    }
+	    print "\n";
+	}
+}




--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH -tip 0/2 v3] pagecache tracepoints proposal
  2010-02-05  2:17 [RFC PATCH -tip 0/2 v3] pagecache tracepoints proposal Keiichi KII
  2010-02-05  2:24 ` [RFC PATCH -tip 1/2 v3] tracepoints: add tracepoints for pagecache Keiichi KII
  2010-02-05  2:25 ` [RFC PATCH -tip 2/2 v3] add scripts for pagecache analysis per process Keiichi KII
@ 2010-02-05  7:28 ` Ingo Molnar
  2010-02-05 21:19   ` Keiichi KII
  2010-02-08 15:54   ` Wu Fengguang
  2010-02-08 13:04 ` Balbir Singh
  3 siblings, 2 replies; 17+ messages in thread
From: Ingo Molnar @ 2010-02-05  7:28 UTC (permalink / raw)
  To: Keiichi KII, Wu Fengguang, Andrew Morton, Fr??d??ric Weisbecker,
	Steven Rostedt, Peter Zijlstra, Jason Baron, Hitoshi Mitake
  Cc: linux-kernel, lwoodman, linux-mm, Tom Zanussi, riel,
	Munehiro Ikeda, Atsushi Tsuji


* Keiichi KII <k-keiichi@bx.jp.nec.com> wrote:

> Hello,
> 
> This is v3 of a patchset to add some tracepoints for pagecache.
> 
> I would propose several tracepoints for tracing pagecache behavior and
> a script for these.
> By using both the tracepoints and the script, we can analysis pagecache behavior
> like usage or hit ratio with high resolution like per process or per file. 
> Example output of the script looks like:
> 
> [process list]
> o yum-3215
>                           cache find  cache hit  cache hit
>         device      inode      count      count      ratio
>   --------------------------------------------------------
>          253:0         16      34434      34130     99.12%
>          253:0        198       9692       9463     97.64%
>          253:0        639        647        628     97.06%
>          253:0        778         32         29     90.62%
>          253:0       7305      50225      49005     97.57%
>          253:0     144217         12         10     83.33%
>          253:0     262775         16         13     81.25%
> *snip*
> 
> -------------------------------------------------------------------------------
> 
> [file list]
>         device              cached
>      (maj:min)      inode    pages
>   --------------------------------
>          253:0         16     5752
>          253:0        198     2233
>          253:0        639       51
>          253:0        778       86
>          253:0       7305    12307
>          253:0     144217       11
>          253:0     262775       39
> *snip*
> 
> [process list]
> o yum-3215
>         device              cached    added  removed      indirect
>      (maj:min)      inode    pages    pages    pages removed pages
>   ----------------------------------------------------------------
>          253:0         16    34130     5752        0             0
>          253:0        198     9463     2233        0             0
>          253:0        639      628       51        0             0
>          253:0        778       29       78        0             0
>          253:0       7305    49005    12307        0             0
>          253:0     144217       10       11        0             0
>          253:0     262775       13       39        0             0
> *snip*
>   ----------------------------------------------------------------
>   total:                    102346    26165        1             0
> 
> We can now know system-wide pagecache usage by /proc/meminfo.
> But we have no method to get higher resolution information like per file or
> per process usage than system-wide one.
> A process may share some pagecache or add a pagecache to the memory or
> remove a pagecache from the memory.
> If a pagecache miss hit ratio rises, maybe it leads to extra I/O and
> affects system performance.
> 
> So, by using the tracepoints we can get the following information.
>  1. how many pagecaches each process has per each file
>  2. how many pages are cached per each file
>  3. how many pagecaches each process shares
>  4. how often each process adds/removes pagecache
>  5. how long a pagecache stays in the memory
>  6. pagecache hit rate per file
> 
> Especially, the monitoring pagecache usage per each file and pagecache hit 
> ratio would help us tune some applications like database.
> And it will also help us tune the kernel parameters like "vm.dirty_*".
> 
> Changelog since v2
>   o add new script to monitor pagecache hit ratio per process.
>   o use DECLARE_EVENT_CLASS
> 
> Changelog since v1
>   o Add a script based on "perf trace stream scripting support".
> 
> Any comments are welcome.

Looks really nice IMO! It also demonstrates nicely the extensibility via 
Tom's perf trace scripting engine. (which will soon get a Python script 
engine as well, so Perl and C wont be the only possibility to extend perf 
with.)

I've Cc:-ed a few parties who might be interested in this. Wu Fengguang has 
done MM instrumentation in this area before - there might be some common 
ground instead of scattered functionality in /proc, debugfs, perf and 
elsewhere?

Note that there's also these older experimental commits in tip:tracing/mm 
that introduce the notion of 'object collections' and adds the ability to 
trace them:

3383e37: tracing, page-allocator: Add a postprocessing script for page-allocator-related ftrace events
c33b359: tracing, page-allocator: Add trace event for page traffic related to the buddy lists
0d524fb: tracing, mm: Add trace events for anti-fragmentation falling back to other migratetypes
b9a2817: tracing, page-allocator: Add trace events for page allocation and page freeing
08b6cb8: perf_counter tools: Provide default bfd_demangle() function in case it's not around
eb46710: tracing/mm: rename 'trigger' file to 'dump_range'
1487a7a: tracing/mm: fix mapcount trace record field
dcac8cd: tracing/mm: add page frame snapshot trace

this concept, if refreshed a bit and extended to the page cache, would allow 
the recording/snapshotting of the MM state of all currently present pages in 
the page-cache - a possibly nice addition to the dynamic technique you apply 
in your patches.

there's similar "object collections" work underway for 'perf lock' btw., by 
Hitoshi Mitake and Frederic.

So there's lots of common ground and lots of interest.

Btw., instead of "perf trace record pagecache-usage", you might want to think 
about introducing a higher level tool as well: 'perf mm' or 'perf pagecache' 
- just like we have 'perf kmem' for SLAB instrumentation, 'perf sched' for 
scheduler instrumentation and 'perf lock' for locking instrumentation. [with 
'perf timer' having been posted too.]

'perf mm' could then still map to Perl scripts, it's just a convenience. It 
could then harbor other MM related instrumentation bits as well. Just an idea 
- this is a possibility, if you are trying to achieve higher organization.

Thanks,

	Ingo

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH -tip 0/2 v3] pagecache tracepoints proposal
  2010-02-05  7:28 ` [RFC PATCH -tip 0/2 v3] pagecache tracepoints proposal Ingo Molnar
@ 2010-02-05 21:19   ` Keiichi KII
  2010-02-08 15:54   ` Wu Fengguang
  1 sibling, 0 replies; 17+ messages in thread
From: Keiichi KII @ 2010-02-05 21:19 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Wu Fengguang, Andrew Morton, Fr??d??ric Weisbecker,
	Steven Rostedt, Peter Zijlstra, Jason Baron, Hitoshi Mitake,
	linux-kernel, lwoodman, linux-mm, Tom Zanussi, riel,
	Munehiro Ikeda, Atsushi Tsuji

Hello,

(02/05/10 02:28), Ingo Molnar wrote:
> Looks really nice IMO! It also demonstrates nicely the extensibility via 
> Tom's perf trace scripting engine. (which will soon get a Python script 
> engine as well, so Perl and C wont be the only possibility to extend perf 
> with.)
> 
> I've Cc:-ed a few parties who might be interested in this. Wu Fengguang has 
> done MM instrumentation in this area before - there might be some common 
> ground instead of scattered functionality in /proc, debugfs, perf and 
> elsewhere?
> 
> Note that there's also these older experimental commits in tip:tracing/mm 
> that introduce the notion of 'object collections' and adds the ability to 
> trace them:
> 
> 3383e37: tracing, page-allocator: Add a postprocessing script for page-allocator-related ftrace events
> c33b359: tracing, page-allocator: Add trace event for page traffic related to the buddy lists
> 0d524fb: tracing, mm: Add trace events for anti-fragmentation falling back to other migratetypes
> b9a2817: tracing, page-allocator: Add trace events for page allocation and page freeing
> 08b6cb8: perf_counter tools: Provide default bfd_demangle() function in case it's not around
> eb46710: tracing/mm: rename 'trigger' file to 'dump_range'
> 1487a7a: tracing/mm: fix mapcount trace record field
> dcac8cd: tracing/mm: add page frame snapshot trace
> 
> this concept, if refreshed a bit and extended to the page cache, would allow 
> the recording/snapshotting of the MM state of all currently present pages in 
> the page-cache - a possibly nice addition to the dynamic technique you apply 
> in your patches.
> there's similar "object collections" work underway for 'perf lock' btw., by 
> Hitoshi Mitake and Frederic.
>
> So there's lots of common ground and lots of interest.
> 
> Btw., instead of "perf trace record pagecache-usage", you might want to think 
> about introducing a higher level tool as well: 'perf mm' or 'perf pagecache' 
> - just like we have 'perf kmem' for SLAB instrumentation, 'perf sched' for 
> scheduler instrumentation and 'perf lock' for locking instrumentation. [with 
> 'perf timer' having been posted too.]
> 
> 'perf mm' could then still map to Perl scripts, it's just a convenience. It 
> could then harbor other MM related instrumentation bits as well. Just an idea 
> - this is a possibility, if you are trying to achieve higher organization.

Thank you for your information about "perf lock" and "tip:tracing/mm" things.
I think it's very useful to merge 'object collections' about tracing/mm into 
"perf mm". So, I will introduce a higer level tool like "perf mm" for the 
mm related things as next step.
These will help me implement "perf mm".

And tom's perf trace scripting engine is very flexible.
I will try to implement "perf mm" based on his scripting engine and 
harbor other MM related instrumentation like the above if I can.

Thanks,
Keiichi

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH -tip 0/2 v3] pagecache tracepoints proposal
  2010-02-05  2:17 [RFC PATCH -tip 0/2 v3] pagecache tracepoints proposal Keiichi KII
                   ` (2 preceding siblings ...)
  2010-02-05  7:28 ` [RFC PATCH -tip 0/2 v3] pagecache tracepoints proposal Ingo Molnar
@ 2010-02-08 13:04 ` Balbir Singh
  3 siblings, 0 replies; 17+ messages in thread
From: Balbir Singh @ 2010-02-08 13:04 UTC (permalink / raw)
  To: Keiichi KII
  Cc: linux-kernel, mingo, lwoodman, linux-mm, Tom Zanussi, riel,
	rostedt, akpm, fweisbec, Munehiro Ikeda, Atsushi Tsuji

* Keiichi KII <k-keiichi@bx.jp.nec.com> [2010-02-04 21:17:35]:

> Hello,
> 
> This is v3 of a patchset to add some tracepoints for pagecache.
> 
> I would propose several tracepoints for tracing pagecache behavior and
> a script for these.
> By using both the tracepoints and the script, we can analysis pagecache behavior
> like usage or hit ratio with high resolution like per process or per file. 
> Example output of the script looks like:
> 
> [process list]
> o yum-3215
>                           cache find  cache hit  cache hit
>         device      inode      count      count      ratio
>   --------------------------------------------------------
>          253:0         16      34434      34130     99.12%
>          253:0        198       9692       9463     97.64%
>          253:0        639        647        628     97.06%
>          253:0        778         32         29     90.62%
>          253:0       7305      50225      49005     97.57%
>          253:0     144217         12         10     83.33%
>          253:0     262775         16         13     81.25%
> *snip*

Very nice, we should be able to sum these to get a system wide view

> 
> -------------------------------------------------------------------------------
> 
> [file list]
>         device              cached
>      (maj:min)      inode    pages
>   --------------------------------
>          253:0         16     5752
>          253:0        198     2233
>          253:0        639       51
>          253:0        778       86
>          253:0       7305    12307
>          253:0     144217       11
>          253:0     262775       39
> *snip*
> 
> [process list]
> o yum-3215
>         device              cached    added  removed      indirect
>      (maj:min)      inode    pages    pages    pages removed pages
>   ----------------------------------------------------------------
>          253:0         16    34130     5752        0             0
>          253:0        198     9463     2233        0             0
>          253:0        639      628       51        0             0
>          253:0        778       29       78        0             0
>          253:0       7305    49005    12307        0             0
>          253:0     144217       10       11        0             0
>          253:0     262775       13       39        0             0
> *snip*
>   ----------------------------------------------------------------
>   total:                    102346    26165        1             0
                                                    ^^^
                                                Is this 1 stray?
> 
> We can now know system-wide pagecache usage by /proc/meminfo.
> But we have no method to get higher resolution information like per file or
> per process usage than system-wide one.

It would be really nice to see if we can detect the mapped from the
unmapped page cache

> A process may share some pagecache or add a pagecache to the memory or
> remove a pagecache from the memory.
> If a pagecache miss hit ratio rises, maybe it leads to extra I/O and
> affects system performance.
> 
> So, by using the tracepoints we can get the following information.
>  1. how many pagecaches each process has per each file
>  2. how many pages are cached per each file
>  3. how many pagecaches each process shares
>  4. how often each process adds/removes pagecache
>  5. how long a pagecache stays in the memory
>  6. pagecache hit rate per file
> 
> Especially, the monitoring pagecache usage per each file and pagecache hit 
> ratio would help us tune some applications like database.
> And it will also help us tune the kernel parameters like "vm.dirty_*".
> 
> Changelog since v2
>   o add new script to monitor pagecache hit ratio per process.
>   o use DECLARE_EVENT_CLASS
> 
> Changelog since v1
>   o Add a script based on "perf trace stream scripting support".
> 
> Any comments are welcome.

-- 
	Three Cheers,
	Balbir

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH -tip 0/2 v3] pagecache tracepoints proposal
  2010-02-05  7:28 ` [RFC PATCH -tip 0/2 v3] pagecache tracepoints proposal Ingo Molnar
  2010-02-05 21:19   ` Keiichi KII
@ 2010-02-08 15:54   ` Wu Fengguang
  2010-02-09 16:21     ` Wu Fengguang
  2010-02-18  5:34     ` KAMEZAWA Hiroyuki
  1 sibling, 2 replies; 17+ messages in thread
From: Wu Fengguang @ 2010-02-08 15:54 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Chris Frost, Steven Rostedt, Peter Zijlstra, Frederic Weisbecker,
	Keiichi KII, Andrew Morton, Jason Baron, Hitoshi Mitake,
	linux-kernel@vger.kernel.org, lwoodman@redhat.com,
	linux-mm@kvack.org, Tom Zanussi, riel@redhat.com, Munehiro Ikeda,
	Atsushi Tsuji

Hi Ingo,

> Note that there's also these older experimental commits in tip:tracing/mm 
> that introduce the notion of 'object collections' and adds the ability to 
> trace them:
> 
> 3383e37: tracing, page-allocator: Add a postprocessing script for page-allocator-related ftrace events
> c33b359: tracing, page-allocator: Add trace event for page traffic related to the buddy lists
> 0d524fb: tracing, mm: Add trace events for anti-fragmentation falling back to other migratetypes
> b9a2817: tracing, page-allocator: Add trace events for page allocation and page freeing
> 08b6cb8: perf_counter tools: Provide default bfd_demangle() function in case it's not around
> eb46710: tracing/mm: rename 'trigger' file to 'dump_range'
> 1487a7a: tracing/mm: fix mapcount trace record field
> dcac8cd: tracing/mm: add page frame snapshot trace
> 
> this concept, if refreshed a bit and extended to the page cache, would allow 
> the recording/snapshotting of the MM state of all currently present pages in 
> the page-cache - a possibly nice addition to the dynamic technique you apply 
> in your patches.
> 
> there's similar "object collections" work underway for 'perf lock' btw., by 
> Hitoshi Mitake and Frederic.
> 
> So there's lots of common ground and lots of interest.

Here is a scratch patch to exercise the "object collections" idea :)

Interestingly, the pagecache walk is pretty fast, while copying out the trace
data takes more time:

        # time (echo / > walk-fs)
        (; echo / > walk-fs; )  0.01s user 0.11s system 82% cpu 0.145 total

        # time wc /debug/tracing/trace
        4570 45893 551282 /debug/tracing/trace
        wc /debug/tracing/trace  0.75s user 0.55s system 88% cpu 1.470 total

        # time (cat /debug/tracing/trace > /dev/shm/t)
        (; cat /debug/tracing/trace > /dev/shm/t; )  0.04s user 0.49s system 95% cpu 0.548 total

        # time (dd if=/debug/tracing/trace of=/dev/shm/t bs=1M)
        0+138 records in
        0+138 records out
        551282 bytes (551 kB) copied, 0.380454 s, 1.4 MB/s
        (; dd if=/debug/tracing/trace of=/dev/shm/t bs=1M; )  0.09s user 0.48s system 96% cpu 0.600 total

The patch is based on tip/tracing/mm. 

Thanks,
Fengguang
---
tracing: pagecache object collections

This dumps
- all cached files of a mounted fs  (the inode-cache)
- all cached pages of a cached file (the page-cache)

Usage and Sample output:

# echo / > /debug/tracing/objects/mm/pages/walk-fs
# head /debug/tracing/trace

# tracer: nop
#
#           TASK-PID    CPU#    TIMESTAMP  FUNCTION
#              | |       |          |         |
             zsh-3078  [000]   526.272587: dump_inode: ino=102223 size=169291 cached=172032 age=9 dirty=6 dev=0:15 file=<TODO>
             zsh-3078  [000]   526.274260: dump_pagecache_range: index=0 len=41 flags=10000000000002c count=1 mapcount=0
             zsh-3078  [000]   526.274340: dump_pagecache_range: index=41 len=1 flags=10000000000006c count=1 mapcount=0
             zsh-3078  [000]   526.274401: dump_inode: ino=8966 size=442 cached=4096 age=49 dirty=0 dev=0:15 file=<TODO>
             zsh-3078  [000]   526.274425: dump_pagecache_range: index=0 len=1 flags=10000000000002c count=1 mapcount=0
             zsh-3078  [000]   526.274440: dump_inode: ino=8964 size=4096 cached=0 age=49 dirty=0 dev=0:15 file=<TODO>

Here "age" is either age from inode create time, or from last dirty time.

TODO:

correctness
- show file path name
  XXX: can trace_seq_path() be called directly inside TRACE_EVENT()?
- reliably prevent ring buffer overflow,
  by replacing cond_resched() with some wait function
  (eg. wait until 2+ pages are free in ring buffer)
- use stable_page_flags() in recent kernel

output style
- use plain tracing output format (no fancy TASK-PID/.../FUNCTION fields)
- clear ring buffer before dumping the objects?
- output format: key=value pairs ==> header + tabbed values?
- add filtering options if necessary

CC: Ingo Molnar <mingo@elte.hu>
CC: Chris Frost <frost@cs.ucla.edu>
CC: Steven Rostedt <rostedt@goodmis.org>
CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
CC: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 fs/inode.c                |    2 
 include/trace/events/mm.h |   67 ++++++++++++++
 kernel/trace/trace_mm.c   |  165 ++++++++++++++++++++++++++++++++++++
 3 files changed, 233 insertions(+), 1 deletion(-)

--- linux-mm.orig/include/trace/events/mm.h	2010-02-08 23:19:09.000000000 +0800
+++ linux-mm/include/trace/events/mm.h	2010-02-08 23:19:16.000000000 +0800
@@ -2,6 +2,7 @@
 #define _TRACE_MM_H
 
 #include <linux/tracepoint.h>
+#include <linux/pagemap.h>
 #include <linux/mm.h>
 
 #undef TRACE_SYSTEM
@@ -42,6 +43,72 @@ TRACE_EVENT(dump_pages,
 		  __entry->mapcount, __entry->index)
 );
 
+TRACE_EVENT(dump_pagecache_range,
+
+	TP_PROTO(struct page *page, unsigned long len),
+
+	TP_ARGS(page, len),
+
+	TP_STRUCT__entry(
+		__field(	unsigned long,	index		)
+		__field(	unsigned long,	len		)
+		__field(	unsigned long,	flags		)
+		__field(	unsigned int,	count		)
+		__field(	unsigned int,	mapcount	)
+	),
+
+	TP_fast_assign(
+		__entry->index		= page->index;
+		__entry->len		= len;
+		__entry->flags		= page->flags;
+		__entry->count		= atomic_read(&page->_count);
+		__entry->mapcount	= page_mapcount(page);
+	),
+
+	TP_printk("index=%lu len=%lu flags=%lx count=%u mapcount=%u",
+		  __entry->index,
+		  __entry->len,
+		  __entry->flags,
+		  __entry->count,
+		  __entry->mapcount)
+);
+
+TRACE_EVENT(dump_inode,
+
+	TP_PROTO(struct inode *inode),
+
+	TP_ARGS(inode),
+
+	TP_STRUCT__entry(
+		__field(	unsigned long,	ino		)
+		__field(	loff_t,		size		)
+		__field(	unsigned long,	nrpages		)
+		__field(	unsigned long,	age		)
+		__field(	unsigned long,	state		)
+		__field(	dev_t,		dev		)
+	),
+
+	TP_fast_assign(
+		__entry->ino		= inode->i_ino;
+		__entry->size		= i_size_read(inode);
+		__entry->nrpages	= inode->i_mapping->nrpages;
+		__entry->age		= jiffies - inode->dirtied_when;
+		__entry->state		= inode->i_state;
+		__entry->dev		= inode->i_sb->s_dev;
+	),
+
+	TP_printk("ino=%lu size=%llu cached=%lu age=%lu dirty=%lu "
+		  "dev=%u:%u file=<TODO>",
+		  __entry->ino,
+		  __entry->size,
+		  __entry->nrpages << PAGE_CACHE_SHIFT,
+		  __entry->age / HZ,
+		  __entry->state & I_DIRTY,
+		  MAJOR(__entry->dev),
+		  MINOR(__entry->dev))
+);
+
+
 #endif /*  _TRACE_MM_H */
 
 /* This part must be outside protection */
--- linux-mm.orig/kernel/trace/trace_mm.c	2010-02-08 23:19:09.000000000 +0800
+++ linux-mm/kernel/trace/trace_mm.c	2010-02-08 23:19:16.000000000 +0800
@@ -9,6 +9,9 @@
 #include <linux/bootmem.h>
 #include <linux/debugfs.h>
 #include <linux/uaccess.h>
+#include <linux/pagevec.h>
+#include <linux/writeback.h>
+#include <linux/file.h>
 
 #include "trace_output.h"
 
@@ -95,6 +98,162 @@ static const struct file_operations trac
 	.write		= trace_mm_dump_range_write,
 };
 
+static unsigned long page_flags(struct page* page)
+{
+	return page->flags & ((1 << NR_PAGEFLAGS) - 1);
+}
+
+static int pages_similiar(struct page* page0, struct page* page)
+{
+	if (page_count(page0) != page_count(page))
+		return 0;
+
+	if (page_mapcount(page0) != page_mapcount(page))
+		return 0;
+
+	if (page_flags(page0) != page_flags(page))
+		return 0;
+
+	return 1;
+}
+
+#define BATCH_LINES	100
+static void dump_pagecache(struct address_space *mapping)
+{
+	int i;
+	int lines = 0;
+	pgoff_t len = 0;
+	struct pagevec pvec;
+	struct page *page;
+	struct page *page0 = NULL;
+	unsigned long start = 0;
+
+	for (;;) {
+		pagevec_init(&pvec, 0);
+		pvec.nr = radix_tree_gang_lookup(&mapping->page_tree,
+				(void **)pvec.pages, start + len, PAGEVEC_SIZE);
+
+		if (pvec.nr == 0) {
+			if (len)
+				trace_dump_pagecache_range(page0, len);
+			break;
+		}
+
+		if (!page0)
+			page0 = pvec.pages[0];
+
+		for (i = 0; i < pvec.nr; i++) {
+			page = pvec.pages[i];
+
+			if (page->index == start + len &&
+					pages_similiar(page0, page))
+				len++;
+			else {
+				trace_dump_pagecache_range(page0, len);
+				page0 = page;
+				start = page->index;
+				len = 1;
+				if (++lines > BATCH_LINES) {
+					lines = 0;
+					cond_resched();
+				}
+			}
+		}
+	}
+}
+
+static void dump_fs_pagecache(struct super_block *sb)
+{
+	struct inode *inode;
+	struct inode *prev_inode = NULL;
+
+	down_read(&sb->s_umount);
+	if (!sb->s_root)
+		goto out;
+	spin_lock(&inode_lock);
+	list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
+		if (inode->i_state & (I_FREEING|I_CLEAR|I_WILL_FREE|I_NEW))
+			continue;
+		__iget(inode);
+		spin_unlock(&inode_lock);
+		trace_dump_inode(inode);
+		if (inode->i_mapping->nrpages)
+			dump_pagecache(inode->i_mapping);
+		iput(prev_inode);
+		prev_inode = inode;
+		cond_resched();
+		spin_lock(&inode_lock);
+	}
+	spin_unlock(&inode_lock);
+	iput(prev_inode);
+out:
+	up_read(&sb->s_umount);
+}
+
+static ssize_t
+trace_pagecache_write(struct file *filp, const char __user *ubuf, size_t count,
+		      loff_t *ppos)
+{
+	struct file *file = NULL;
+	char *name;
+	int err = 0;
+
+	if (count > PATH_MAX + 1)
+		return -ENAMETOOLONG;
+
+	name = kmalloc(count+1, GFP_KERNEL);
+	if (!name)
+		return -ENOMEM;
+
+	if (copy_from_user(name, ubuf, count)) {
+		err = -EFAULT;
+		goto out;
+	}
+
+	/* strip the newline added by `echo` */
+	if (count)
+		name[count-1] = '\0';
+
+	file = filp_open(name, O_RDONLY|O_LARGEFILE, 0);
+	if (IS_ERR(file)) {
+		err = PTR_ERR(file);
+		file = NULL;
+		goto out;
+	}
+
+	if (tracing_update_buffers() < 0) {
+		err = -ENOMEM;
+		goto out;
+	}
+	if (trace_set_clr_event("mm", "dump_pagecache_range", 1)) {
+		err = -EINVAL;
+		goto out;
+	}
+	if (trace_set_clr_event("mm", "dump_inode", 1)) {
+		err = -EINVAL;
+		goto out;
+	}
+
+	if (filp->f_path.dentry->d_inode->i_private) {
+		dump_fs_pagecache(file->f_path.dentry->d_sb);
+	} else {
+		dump_pagecache(file->f_mapping);
+	}
+
+out:
+	if (file)
+		fput(file);
+	kfree(name);
+
+	return err ? err : count;
+}
+
+static const struct file_operations trace_pagecache_fops = {
+	.open		= tracing_open_generic,
+	.read		= trace_mm_dump_range_read,
+	.write		= trace_pagecache_write,
+};
+
 /* move this into trace_objects.c when that file is created */
 static struct dentry *trace_objects_dir(void)
 {
@@ -167,6 +326,12 @@ static __init int trace_objects_mm_init(
 	trace_create_file("dump_range", 0600, d_pages, NULL,
 			  &trace_mm_fops);
 
+	trace_create_file("walk-file", 0600, d_pages, NULL,
+			  &trace_pagecache_fops);
+
+	trace_create_file("walk-fs", 0600, d_pages, (void *)1,
+			  &trace_pagecache_fops);
+
 	return 0;
 }
 fs_initcall(trace_objects_mm_init);
--- linux-mm.orig/fs/inode.c	2010-02-08 23:19:12.000000000 +0800
+++ linux-mm/fs/inode.c	2010-02-08 23:19:22.000000000 +0800
@@ -149,7 +149,7 @@ struct inode *inode_init_always(struct s
 	inode->i_bdev = NULL;
 	inode->i_cdev = NULL;
 	inode->i_rdev = 0;
-	inode->dirtied_when = 0;
+	inode->dirtied_when = jiffies;
 
 	if (security_inode_alloc(inode))
 		goto out_free_inode;

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH -tip 0/2 v3] pagecache tracepoints proposal
  2010-02-08 15:54   ` Wu Fengguang
@ 2010-02-09 16:21     ` Wu Fengguang
  2010-02-13 13:29       ` Balbir Singh
  2010-02-16  3:22       ` KOSAKI Motohiro
  2010-02-18  5:34     ` KAMEZAWA Hiroyuki
  1 sibling, 2 replies; 17+ messages in thread
From: Wu Fengguang @ 2010-02-09 16:21 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Chris Frost, Steven Rostedt, Peter Zijlstra, Frederic Weisbecker,
	Keiichi KII, Andrew Morton, Jason Baron, Hitoshi Mitake,
	linux-kernel@vger.kernel.org, lwoodman@redhat.com,
	linux-mm@kvack.org, Tom Zanussi, riel@redhat.com, Munehiro Ikeda,
	Atsushi Tsuji

> Here is a scratch patch to exercise the "object collections" idea :)
> 
> Interestingly, the pagecache walk is pretty fast, while copying out the trace
> data takes more time:
> 
>         # time (echo / > walk-fs)
>         (; echo / > walk-fs; )  0.01s user 0.11s system 82% cpu 0.145 total
> 
>         # time wc /debug/tracing/trace
>         4570 45893 551282 /debug/tracing/trace
>         wc /debug/tracing/trace  0.75s user 0.55s system 88% cpu 1.470 total

Ah got it: it takes much time to "print" the raw trace data.

> TODO:
> 
> correctness
> - show file path name
>   XXX: can trace_seq_path() be called directly inside TRACE_EVENT()?

OK, finished with the file name with d_path(). I choose not to mangle
the possible '\n' in file names, and simply show "?" for such files,
for the sake of speed.

Thanks,
Fengguang
---
tracing: pagecache object collections

This dumps
- all cached files of a mounted fs  (the inode-cache)
- all cached pages of a cached file (the page-cache)

Usage and Sample output:

# echo /dev > /debug/tracing/objects/mm/pages/walk-fs
# tail /debug/tracing/trace
             zsh-2528  [000] 10429.172470: dump_inode: ino=889 size=0 cached=0 age=442 dirty=0 dev=0:18 file=/dev/console
             zsh-2528  [000] 10429.172472: dump_inode: ino=888 size=0 cached=0 age=442 dirty=7 dev=0:18 file=/dev/null
             zsh-2528  [000] 10429.172474: dump_inode: ino=887 size=40 cached=0 age=442 dirty=0 dev=0:18 file=/dev/shm
             zsh-2528  [000] 10429.172477: dump_inode: ino=886 size=40 cached=0 age=442 dirty=0 dev=0:18 file=/dev/pts
             zsh-2528  [000] 10429.172479: dump_inode: ino=885 size=11 cached=0 age=442 dirty=0 dev=0:18 file=/dev/core
             zsh-2528  [000] 10429.172481: dump_inode: ino=884 size=15 cached=0 age=442 dirty=0 dev=0:18 file=/dev/stderr
             zsh-2528  [000] 10429.172483: dump_inode: ino=883 size=15 cached=0 age=442 dirty=0 dev=0:18 file=/dev/stdout
             zsh-2528  [000] 10429.172486: dump_inode: ino=882 size=15 cached=0 age=442 dirty=0 dev=0:18 file=/dev/stdin
             zsh-2528  [000] 10429.172488: dump_inode: ino=881 size=13 cached=0 age=442 dirty=0 dev=0:18 file=/dev/fd
             zsh-2528  [000] 10429.172491: dump_inode: ino=872 size=13360 cached=0 age=442 dirty=0 dev=0:18 file=/dev

Here "age" is either age from inode create time, or from last dirty time.

TODO:

correctness
- reliably prevent ring buffer overflow,
  by replacing cond_resched() with some wait function
  (eg. wait until 2+ pages are free in ring buffer)
- use stable_page_flags() in recent kernel

output style
- use plain tracing output format (no fancy TASK-PID/.../FUNCTION fields)
- clear ring buffer before dumping the objects?
- output format: key=value pairs ==> header + tabbed values?
- add filtering options if necessary

CC: Ingo Molnar <mingo@elte.hu>
CC: Chris Frost <frost@cs.ucla.edu>
CC: Steven Rostedt <rostedt@goodmis.org>
CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
CC: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 fs/inode.c                |    2 
 include/trace/events/mm.h |   70 ++++++++++++
 kernel/trace/trace_mm.c   |  204 ++++++++++++++++++++++++++++++++++++
 3 files changed, 275 insertions(+), 1 deletion(-)

--- linux-mm.orig/include/trace/events/mm.h	2010-02-08 23:19:09.000000000 +0800
+++ linux-mm/include/trace/events/mm.h	2010-02-09 23:39:03.000000000 +0800
@@ -2,6 +2,7 @@
 #define _TRACE_MM_H
 
 #include <linux/tracepoint.h>
+#include <linux/pagemap.h>
 #include <linux/mm.h>
 
 #undef TRACE_SYSTEM
@@ -42,6 +43,75 @@ TRACE_EVENT(dump_pages,
 		  __entry->mapcount, __entry->index)
 );
 
+TRACE_EVENT(dump_pagecache_range,
+
+	TP_PROTO(struct page *page, unsigned long len),
+
+	TP_ARGS(page, len),
+
+	TP_STRUCT__entry(
+		__field(	unsigned long,	index		)
+		__field(	unsigned long,	len		)
+		__field(	unsigned long,	flags		)
+		__field(	unsigned int,	count		)
+		__field(	unsigned int,	mapcount	)
+	),
+
+	TP_fast_assign(
+		__entry->index		= page->index;
+		__entry->len		= len;
+		__entry->flags		= page->flags;
+		__entry->count		= atomic_read(&page->_count);
+		__entry->mapcount	= page_mapcount(page);
+	),
+
+	TP_printk("index=%lu len=%lu flags=%lx count=%u mapcount=%u",
+		  __entry->index,
+		  __entry->len,
+		  __entry->flags,
+		  __entry->count,
+		  __entry->mapcount)
+);
+
+TRACE_EVENT(dump_inode,
+
+	TP_PROTO(struct inode *inode, char *name, int len),
+
+	TP_ARGS(inode, name, len),
+
+	TP_STRUCT__entry(
+		__field(	unsigned long,	ino		)
+		__field(	loff_t,		size		)
+		__field(	unsigned long,	nrpages		)
+		__field(	unsigned long,	age		)
+		__field(	unsigned long,	state		)
+		__field(	dev_t,		dev		)
+		__dynamic_array(char,		file,	len	)
+	),
+
+	TP_fast_assign(
+		__entry->ino		= inode->i_ino;
+		__entry->size		= i_size_read(inode);
+		__entry->nrpages	= inode->i_mapping->nrpages;
+		__entry->age		= jiffies - inode->dirtied_when;
+		__entry->state		= inode->i_state;
+		__entry->dev		= inode->i_sb->s_dev;
+		memcpy(__get_str(file), name, len);
+	),
+
+	TP_printk("ino=%lu size=%llu cached=%lu age=%lu dirty=%lu "
+		  "dev=%u:%u file=%s",
+		  __entry->ino,
+		  __entry->size,
+		  __entry->nrpages << PAGE_CACHE_SHIFT,
+		  __entry->age / HZ,
+		  __entry->state & I_DIRTY,
+		  MAJOR(__entry->dev),
+		  MINOR(__entry->dev),
+		  strchr(__get_str(file), '\n') ? "?" : __get_str(file))
+);
+
+
 #endif /*  _TRACE_MM_H */
 
 /* This part must be outside protection */
--- linux-mm.orig/kernel/trace/trace_mm.c	2010-02-08 23:19:09.000000000 +0800
+++ linux-mm/kernel/trace/trace_mm.c	2010-02-10 00:04:47.000000000 +0800
@@ -9,6 +9,9 @@
 #include <linux/bootmem.h>
 #include <linux/debugfs.h>
 #include <linux/uaccess.h>
+#include <linux/pagevec.h>
+#include <linux/writeback.h>
+#include <linux/file.h>
 
 #include "trace_output.h"
 
@@ -95,6 +98,201 @@ static const struct file_operations trac
 	.write		= trace_mm_dump_range_write,
 };
 
+static unsigned long page_flags(struct page* page)
+{
+	return page->flags & ((1 << NR_PAGEFLAGS) - 1);
+}
+
+static int pages_similiar(struct page* page0, struct page* page)
+{
+	if (page_count(page0) != page_count(page))
+		return 0;
+
+	if (page_mapcount(page0) != page_mapcount(page))
+		return 0;
+
+	if (page_flags(page0) != page_flags(page))
+		return 0;
+
+	return 1;
+}
+
+#define BATCH_LINES	100
+static void dump_pagecache(struct address_space *mapping)
+{
+	int i;
+	int lines = 0;
+	pgoff_t len = 0;
+	struct pagevec pvec;
+	struct page *page;
+	struct page *page0 = NULL;
+	unsigned long start = 0;
+
+	for (;;) {
+		pagevec_init(&pvec, 0);
+		pvec.nr = radix_tree_gang_lookup(&mapping->page_tree,
+				(void **)pvec.pages, start + len, PAGEVEC_SIZE);
+
+		if (pvec.nr == 0) {
+			if (len)
+				trace_dump_pagecache_range(page0, len);
+			break;
+		}
+
+		if (!page0)
+			page0 = pvec.pages[0];
+
+		for (i = 0; i < pvec.nr; i++) {
+			page = pvec.pages[i];
+
+			if (page->index == start + len &&
+					pages_similiar(page0, page))
+				len++;
+			else {
+				trace_dump_pagecache_range(page0, len);
+				page0 = page;
+				start = page->index;
+				len = 1;
+				if (++lines > BATCH_LINES) {
+					lines = 0;
+					cond_resched();
+				}
+			}
+		}
+	}
+}
+
+static void dump_inode(struct inode *inode,
+		       char *name_buf,
+		       struct vfsmount *mnt)
+{
+	struct path path = {
+		.mnt = mnt,
+		.dentry = d_find_alias(inode)
+	};
+	char *name;
+	int len;
+
+	if (!path.dentry) {
+		trace_dump_inode(inode, "?", 2);
+		return;
+	}
+
+	name = d_path(&path, name_buf, PAGE_SIZE);
+	if (IS_ERR(name)) {
+		name = "?";
+		len = 2;
+	} else
+		len = PAGE_SIZE + name_buf - name;
+
+	trace_dump_inode(inode, name, len);
+
+	if (path.dentry)
+		dput(path.dentry);
+}
+
+static void dump_fs_pagecache(struct super_block *sb, struct vfsmount *mnt)
+{
+	struct inode *inode;
+	struct inode *prev_inode = NULL;
+	char *name_buf;
+
+	name_buf = (char *)__get_free_page(GFP_TEMPORARY);
+	if (!name_buf)
+		return;
+
+	down_read(&sb->s_umount);
+	if (!sb->s_root)
+		goto out;
+
+	spin_lock(&inode_lock);
+	list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
+		if (inode->i_state & (I_FREEING|I_CLEAR|I_WILL_FREE|I_NEW))
+			continue;
+		__iget(inode);
+		spin_unlock(&inode_lock);
+		dump_inode(inode, name_buf, mnt);
+		if (inode->i_mapping->nrpages)
+			dump_pagecache(inode->i_mapping);
+		iput(prev_inode);
+		prev_inode = inode;
+		cond_resched();
+		spin_lock(&inode_lock);
+	}
+	spin_unlock(&inode_lock);
+	iput(prev_inode);
+out:
+	up_read(&sb->s_umount);
+	free_page((unsigned long)name_buf);
+}
+
+static ssize_t
+trace_pagecache_write(struct file *filp, const char __user *ubuf, size_t count,
+		      loff_t *ppos)
+{
+	struct file *file = NULL;
+	char *name;
+	int err = 0;
+
+	if (count <= 1)
+		return -EINVAL;
+	if (count > PATH_MAX + 1)
+		return -ENAMETOOLONG;
+
+	name = kmalloc(count+1, GFP_KERNEL);
+	if (!name)
+		return -ENOMEM;
+
+	if (copy_from_user(name, ubuf, count)) {
+		err = -EFAULT;
+		goto out;
+	}
+
+	/* strip the newline added by `echo` */
+	if (name[count-1] != '\n')
+		return -EINVAL;
+	name[count-1] = '\0';
+
+	file = filp_open(name, O_RDONLY|O_LARGEFILE, 0);
+	if (IS_ERR(file)) {
+		err = PTR_ERR(file);
+		file = NULL;
+		goto out;
+	}
+
+	if (tracing_update_buffers() < 0) {
+		err = -ENOMEM;
+		goto out;
+	}
+	if (trace_set_clr_event("mm", "dump_pagecache_range", 1)) {
+		err = -EINVAL;
+		goto out;
+	}
+	if (trace_set_clr_event("mm", "dump_inode", 1)) {
+		err = -EINVAL;
+		goto out;
+	}
+
+	if (filp->f_path.dentry->d_inode->i_private) {
+		dump_fs_pagecache(file->f_path.dentry->d_sb, file->f_path.mnt);
+	} else {
+		dump_pagecache(file->f_mapping);
+	}
+
+out:
+	if (file)
+		fput(file);
+	kfree(name);
+
+	return err ? err : count;
+}
+
+static const struct file_operations trace_pagecache_fops = {
+	.open		= tracing_open_generic,
+	.read		= trace_mm_dump_range_read,
+	.write		= trace_pagecache_write,
+};
+
 /* move this into trace_objects.c when that file is created */
 static struct dentry *trace_objects_dir(void)
 {
@@ -167,6 +365,12 @@ static __init int trace_objects_mm_init(
 	trace_create_file("dump_range", 0600, d_pages, NULL,
 			  &trace_mm_fops);
 
+	trace_create_file("walk-file", 0600, d_pages, NULL,
+			  &trace_pagecache_fops);
+
+	trace_create_file("walk-fs", 0600, d_pages, (void *)1,
+			  &trace_pagecache_fops);
+
 	return 0;
 }
 fs_initcall(trace_objects_mm_init);
--- linux-mm.orig/fs/inode.c	2010-02-08 23:19:12.000000000 +0800
+++ linux-mm/fs/inode.c	2010-02-08 23:19:22.000000000 +0800
@@ -149,7 +149,7 @@ struct inode *inode_init_always(struct s
 	inode->i_bdev = NULL;
 	inode->i_cdev = NULL;
 	inode->i_rdev = 0;
-	inode->dirtied_when = 0;
+	inode->dirtied_when = jiffies;
 
 	if (security_inode_alloc(inode))
 		goto out_free_inode;

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH -tip 0/2 v3] pagecache tracepoints proposal
  2010-02-09 16:21     ` Wu Fengguang
@ 2010-02-13 13:29       ` Balbir Singh
  2010-02-14 10:52         ` Balbir Singh
  2010-02-21  2:28         ` Wu Fengguang
  2010-02-16  3:22       ` KOSAKI Motohiro
  1 sibling, 2 replies; 17+ messages in thread
From: Balbir Singh @ 2010-02-13 13:29 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Ingo Molnar, Chris Frost, Steven Rostedt, Peter Zijlstra,
	Frederic Weisbecker, Keiichi KII, Andrew Morton, Jason Baron,
	Hitoshi Mitake, linux-kernel@vger.kernel.org, lwoodman@redhat.com,
	linux-mm@kvack.org, Tom Zanussi, riel@redhat.com, Munehiro Ikeda,
	Atsushi Tsuji

* Wu Fengguang <fengguang.wu@intel.com> [2010-02-10 00:21:01]:

> > Here is a scratch patch to exercise the "object collections" idea :)
> > 
> > Interestingly, the pagecache walk is pretty fast, while copying out the trace
> > data takes more time:
> > 
> >         # time (echo / > walk-fs)
> >         (; echo / > walk-fs; )  0.01s user 0.11s system 82% cpu 0.145 total
> > 
> >         # time wc /debug/tracing/trace
> >         4570 45893 551282 /debug/tracing/trace
> >         wc /debug/tracing/trace  0.75s user 0.55s system 88% cpu 1.470 total
> 
> Ah got it: it takes much time to "print" the raw trace data.
> 
> > TODO:
> > 
> > correctness
> > - show file path name
> >   XXX: can trace_seq_path() be called directly inside TRACE_EVENT()?
> 
> OK, finished with the file name with d_path(). I choose not to mangle
> the possible '\n' in file names, and simply show "?" for such files,
> for the sake of speed.
> 
> Thanks,
> Fengguang
> ---
> tracing: pagecache object collections
> 
> This dumps
> - all cached files of a mounted fs  (the inode-cache)
> - all cached pages of a cached file (the page-cache)
> 
> Usage and Sample output:
> 
> # echo /dev > /debug/tracing/objects/mm/pages/walk-fs
> # tail /debug/tracing/trace
>              zsh-2528  [000] 10429.172470: dump_inode: ino=889 size=0 cached=0 age=442 dirty=0 dev=0:18 file=/dev/console
>              zsh-2528  [000] 10429.172472: dump_inode: ino=888 size=0 cached=0 age=442 dirty=7 dev=0:18 file=/dev/null
>              zsh-2528  [000] 10429.172474: dump_inode: ino=887 size=40 cached=0 age=442 dirty=0 dev=0:18 file=/dev/shm
>              zsh-2528  [000] 10429.172477: dump_inode: ino=886 size=40 cached=0 age=442 dirty=0 dev=0:18 file=/dev/pts
>              zsh-2528  [000] 10429.172479: dump_inode: ino=885 size=11 cached=0 age=442 dirty=0 dev=0:18 file=/dev/core
>              zsh-2528  [000] 10429.172481: dump_inode: ino=884 size=15 cached=0 age=442 dirty=0 dev=0:18 file=/dev/stderr
>              zsh-2528  [000] 10429.172483: dump_inode: ino=883 size=15 cached=0 age=442 dirty=0 dev=0:18 file=/dev/stdout
>              zsh-2528  [000] 10429.172486: dump_inode: ino=882 size=15 cached=0 age=442 dirty=0 dev=0:18 file=/dev/stdin
>              zsh-2528  [000] 10429.172488: dump_inode: ino=881 size=13 cached=0 age=442 dirty=0 dev=0:18 file=/dev/fd
>              zsh-2528  [000] 10429.172491: dump_inode: ino=872 size=13360 cached=0 age=442 dirty=0 dev=0:18 file=/dev
> 
> Here "age" is either age from inode create time, or from last dirty time.
>

It would be nice to see mapped/unmapped information as well.
 
> TODO:
> 
> correctness
> - reliably prevent ring buffer overflow,
>   by replacing cond_resched() with some wait function
>   (eg. wait until 2+ pages are free in ring buffer)
> - use stable_page_flags() in recent kernel
> 
> output style
> - use plain tracing output format (no fancy TASK-PID/.../FUNCTION fields)
> - clear ring buffer before dumping the objects?
> - output format: key=value pairs ==> header + tabbed values?
> - add filtering options if necessary
> 
> CC: Ingo Molnar <mingo@elte.hu>
> CC: Chris Frost <frost@cs.ucla.edu>
> CC: Steven Rostedt <rostedt@goodmis.org>
> CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
> CC: Frederic Weisbecker <fweisbec@gmail.com>
> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> ---
>  fs/inode.c                |    2 
>  include/trace/events/mm.h |   70 ++++++++++++
>  kernel/trace/trace_mm.c   |  204 ++++++++++++++++++++++++++++++++++++
>  3 files changed, 275 insertions(+), 1 deletion(-)
> 
> --- linux-mm.orig/include/trace/events/mm.h	2010-02-08 23:19:09.000000000 +0800
> +++ linux-mm/include/trace/events/mm.h	2010-02-09 23:39:03.000000000 +0800
> @@ -2,6 +2,7 @@
>  #define _TRACE_MM_H
> 
>  #include <linux/tracepoint.h>
> +#include <linux/pagemap.h>
>  #include <linux/mm.h>
> 
>  #undef TRACE_SYSTEM
> @@ -42,6 +43,75 @@ TRACE_EVENT(dump_pages,
>  		  __entry->mapcount, __entry->index)
>  );
> 
> +TRACE_EVENT(dump_pagecache_range,
> +
> +	TP_PROTO(struct page *page, unsigned long len),
> +
> +	TP_ARGS(page, len),
> +
> +	TP_STRUCT__entry(
> +		__field(	unsigned long,	index		)
> +		__field(	unsigned long,	len		)
> +		__field(	unsigned long,	flags		)
> +		__field(	unsigned int,	count		)
> +		__field(	unsigned int,	mapcount	)
> +	),
> +
> +	TP_fast_assign(
> +		__entry->index		= page->index;
> +		__entry->len		= len;
> +		__entry->flags		= page->flags;
> +		__entry->count		= atomic_read(&page->_count);
> +		__entry->mapcount	= page_mapcount(page);
> +	),
> +
> +	TP_printk("index=%lu len=%lu flags=%lx count=%u mapcount=%u",
> +		  __entry->index,
> +		  __entry->len,
> +		  __entry->flags,
> +		  __entry->count,
> +		  __entry->mapcount)
> +);
> +
> +TRACE_EVENT(dump_inode,
> +
> +	TP_PROTO(struct inode *inode, char *name, int len),
> +
> +	TP_ARGS(inode, name, len),
> +
> +	TP_STRUCT__entry(
> +		__field(	unsigned long,	ino		)
> +		__field(	loff_t,		size		)
> +		__field(	unsigned long,	nrpages		)
> +		__field(	unsigned long,	age		)
> +		__field(	unsigned long,	state		)
> +		__field(	dev_t,		dev		)
> +		__dynamic_array(char,		file,	len	)
> +	),
> +
> +	TP_fast_assign(
> +		__entry->ino		= inode->i_ino;
> +		__entry->size		= i_size_read(inode);
> +		__entry->nrpages	= inode->i_mapping->nrpages;
> +		__entry->age		= jiffies - inode->dirtied_when;
> +		__entry->state		= inode->i_state;
> +		__entry->dev		= inode->i_sb->s_dev;
> +		memcpy(__get_str(file), name, len);
> +	),
> +
> +	TP_printk("ino=%lu size=%llu cached=%lu age=%lu dirty=%lu "
> +		  "dev=%u:%u file=%s",
> +		  __entry->ino,
> +		  __entry->size,
> +		  __entry->nrpages << PAGE_CACHE_SHIFT,
> +		  __entry->age / HZ,
> +		  __entry->state & I_DIRTY,
> +		  MAJOR(__entry->dev),
> +		  MINOR(__entry->dev),
> +		  strchr(__get_str(file), '\n') ? "?" : __get_str(file))
> +);
> +
> +
>  #endif /*  _TRACE_MM_H */
> 
>  /* This part must be outside protection */
> --- linux-mm.orig/kernel/trace/trace_mm.c	2010-02-08 23:19:09.000000000 +0800
> +++ linux-mm/kernel/trace/trace_mm.c	2010-02-10 00:04:47.000000000 +0800
> @@ -9,6 +9,9 @@
>  #include <linux/bootmem.h>
>  #include <linux/debugfs.h>
>  #include <linux/uaccess.h>
> +#include <linux/pagevec.h>
> +#include <linux/writeback.h>
> +#include <linux/file.h>
> 
>  #include "trace_output.h"
> 
> @@ -95,6 +98,201 @@ static const struct file_operations trac
>  	.write		= trace_mm_dump_range_write,
>  };
> 
> +static unsigned long page_flags(struct page* page)
> +{
> +	return page->flags & ((1 << NR_PAGEFLAGS) - 1);
> +}
> +
> +static int pages_similiar(struct page* page0, struct page* page)
> +{
> +	if (page_count(page0) != page_count(page))
> +		return 0;
> +
> +	if (page_mapcount(page0) != page_mapcount(page))
> +		return 0;
> +
> +	if (page_flags(page0) != page_flags(page))
> +		return 0;
> +
> +	return 1;
> +}
> +

OK, so pages_similar() is used to identify a range of pages in the
cache?

> +#define BATCH_LINES	100
> +static void dump_pagecache(struct address_space *mapping)
> +{
> +	int i;
> +	int lines = 0;
> +	pgoff_t len = 0;
> +	struct pagevec pvec;
> +	struct page *page;
> +	struct page *page0 = NULL;
> +	unsigned long start = 0;
> +
> +	for (;;) {
> +		pagevec_init(&pvec, 0);
> +		pvec.nr = radix_tree_gang_lookup(&mapping->page_tree,
> +				(void **)pvec.pages, start + len, PAGEVEC_SIZE);

Is radix_tree_gang_lookup synchronized somewhere? Don't we need to
call it under RCU or a lock (mapping) ?

> +
> +		if (pvec.nr == 0) {
> +			if (len)
> +				trace_dump_pagecache_range(page0, len);
> +			break;
> +		}
> +
> +		if (!page0)
> +			page0 = pvec.pages[0];
> +
> +		for (i = 0; i < pvec.nr; i++) {
> +			page = pvec.pages[i];
> +
> +			if (page->index == start + len &&
> +					pages_similiar(page0, page))
> +				len++;
> +			else {
> +				trace_dump_pagecache_range(page0, len);
> +				page0 = page;
> +				start = page->index;
> +				len = 1;
> +				if (++lines > BATCH_LINES) {
> +					lines = 0;
> +					cond_resched();
> +				}
> +			}
> +		}
> +	}
> +}
> +
> +static void dump_inode(struct inode *inode,
> +		       char *name_buf,
> +		       struct vfsmount *mnt)
> +{
> +	struct path path = {
> +		.mnt = mnt,
> +		.dentry = d_find_alias(inode)
> +	};
> +	char *name;
> +	int len;
> +
> +	if (!path.dentry) {
> +		trace_dump_inode(inode, "?", 2);
> +		return;
> +	}
> +
> +	name = d_path(&path, name_buf, PAGE_SIZE);
> +	if (IS_ERR(name)) {
> +		name = "?";
> +		len = 2;
> +	} else
> +		len = PAGE_SIZE + name_buf - name;
> +
> +	trace_dump_inode(inode, name, len);
> +
> +	if (path.dentry)
> +		dput(path.dentry);
> +}
> +
> +static void dump_fs_pagecache(struct super_block *sb, struct vfsmount *mnt)
> +{
> +	struct inode *inode;
> +	struct inode *prev_inode = NULL;
> +	char *name_buf;
> +
> +	name_buf = (char *)__get_free_page(GFP_TEMPORARY);
> +	if (!name_buf)
> +		return;
> +
> +	down_read(&sb->s_umount);
> +	if (!sb->s_root)
> +		goto out;
> +
> +	spin_lock(&inode_lock);
> +	list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
> +		if (inode->i_state & (I_FREEING|I_CLEAR|I_WILL_FREE|I_NEW))
> +			continue;
> +		__iget(inode);
> +		spin_unlock(&inode_lock);
> +		dump_inode(inode, name_buf, mnt);
> +		if (inode->i_mapping->nrpages)
> +			dump_pagecache(inode->i_mapping);
> +		iput(prev_inode);
> +		prev_inode = inode;
> +		cond_resched();
> +		spin_lock(&inode_lock);
> +	}
> +	spin_unlock(&inode_lock);
> +	iput(prev_inode);
> +out:
> +	up_read(&sb->s_umount);
> +	free_page((unsigned long)name_buf);
> +}
> +
> +static ssize_t
> +trace_pagecache_write(struct file *filp, const char __user *ubuf, size_t count,
> +		      loff_t *ppos)
> +{
> +	struct file *file = NULL;
> +	char *name;
> +	int err = 0;
> +

Can't we use the trace_parser here?

> +	if (count <= 1)
> +		return -EINVAL;
> +	if (count > PATH_MAX + 1)
> +		return -ENAMETOOLONG;
> +
> +	name = kmalloc(count+1, GFP_KERNEL);
> +	if (!name)
> +		return -ENOMEM;
> +
> +	if (copy_from_user(name, ubuf, count)) {
> +		err = -EFAULT;
> +		goto out;
> +	}
> +
> +	/* strip the newline added by `echo` */
> +	if (name[count-1] != '\n')
> +		return -EINVAL;

Doesn't sound correct, what happens if we use echo -n?

> +	name[count-1] = '\0';
> +
> +	file = filp_open(name, O_RDONLY|O_LARGEFILE, 0);
> +	if (IS_ERR(file)) {
> +		err = PTR_ERR(file);
> +		file = NULL;
> +		goto out;
> +	}
> +
> +	if (tracing_update_buffers() < 0) {
> +		err = -ENOMEM;
> +		goto out;
> +	}
> +	if (trace_set_clr_event("mm", "dump_pagecache_range", 1)) {
> +		err = -EINVAL;
> +		goto out;
> +	}
> +	if (trace_set_clr_event("mm", "dump_inode", 1)) {
> +		err = -EINVAL;
> +		goto out;
> +	}
> +
> +	if (filp->f_path.dentry->d_inode->i_private) {
> +		dump_fs_pagecache(file->f_path.dentry->d_sb, file->f_path.mnt);
> +	} else {
> +		dump_pagecache(file->f_mapping);
> +	}
> +
> +out:
> +	if (file)
> +		fput(file);
> +	kfree(name);
> +
> +	return err ? err : count;
> +}
> +
> +static const struct file_operations trace_pagecache_fops = {
> +	.open		= tracing_open_generic,
> +	.read		= trace_mm_dump_range_read,
> +	.write		= trace_pagecache_write,
> +};
> +
>  /* move this into trace_objects.c when that file is created */
>  static struct dentry *trace_objects_dir(void)
>  {
> @@ -167,6 +365,12 @@ static __init int trace_objects_mm_init(
>  	trace_create_file("dump_range", 0600, d_pages, NULL,
>  			  &trace_mm_fops);
> 
> +	trace_create_file("walk-file", 0600, d_pages, NULL,
> +			  &trace_pagecache_fops);
> +
> +	trace_create_file("walk-fs", 0600, d_pages, (void *)1,
> +			  &trace_pagecache_fops);
> +
>  	return 0;
>  }
>  fs_initcall(trace_objects_mm_init);
> --- linux-mm.orig/fs/inode.c	2010-02-08 23:19:12.000000000 +0800
> +++ linux-mm/fs/inode.c	2010-02-08 23:19:22.000000000 +0800
> @@ -149,7 +149,7 @@ struct inode *inode_init_always(struct s
>  	inode->i_bdev = NULL;
>  	inode->i_cdev = NULL;
>  	inode->i_rdev = 0;
> -	inode->dirtied_when = 0;
> +	inode->dirtied_when = jiffies;
>

Hmmm... Is the inode really dirtied when initialized? I know the
change is for tracing, but the code when read is confusing.
 

-- 
	Three Cheers,
	Balbir

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH -tip 0/2 v3] pagecache tracepoints proposal
  2010-02-13 13:29       ` Balbir Singh
@ 2010-02-14 10:52         ` Balbir Singh
  2010-02-21  2:28         ` Wu Fengguang
  1 sibling, 0 replies; 17+ messages in thread
From: Balbir Singh @ 2010-02-14 10:52 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Ingo Molnar, Chris Frost, Steven Rostedt, Peter Zijlstra,
	Frederic Weisbecker, Keiichi KII, Andrew Morton, Jason Baron,
	Hitoshi Mitake, linux-kernel@vger.kernel.org, lwoodman@redhat.com,
	linux-mm@kvack.org, Tom Zanussi, riel@redhat.com, Munehiro Ikeda,
	Atsushi Tsuji

* Balbir Singh <balbir@linux.vnet.ibm.com> [2010-02-13 18:59:52]:

> * Wu Fengguang <fengguang.wu@intel.com> [2010-02-10 00:21:01]:
> 
> > > Here is a scratch patch to exercise the "object collections" idea :)
> > > 
> > > Interestingly, the pagecache walk is pretty fast, while copying out the trace
> > > data takes more time:
> > > 
> > >         # time (echo / > walk-fs)
> > >         (; echo / > walk-fs; )  0.01s user 0.11s system 82% cpu 0.145 total
> > > 
> > >         # time wc /debug/tracing/trace
> > >         4570 45893 551282 /debug/tracing/trace
> > >         wc /debug/tracing/trace  0.75s user 0.55s system 88% cpu 1.470 total
> > 
> > Ah got it: it takes much time to "print" the raw trace data.
> > 
> > > TODO:
> > > 
> > > correctness
> > > - show file path name
> > >   XXX: can trace_seq_path() be called directly inside TRACE_EVENT()?
> > 
> > OK, finished with the file name with d_path(). I choose not to mangle
> > the possible '\n' in file names, and simply show "?" for such files,
> > for the sake of speed.
> > 
> > Thanks,
> > Fengguang
> > ---
> > tracing: pagecache object collections
> > 
> > This dumps
> > - all cached files of a mounted fs  (the inode-cache)
> > - all cached pages of a cached file (the page-cache)
> > 
> > Usage and Sample output:
> > 
> > # echo /dev > /debug/tracing/objects/mm/pages/walk-fs
> > # tail /debug/tracing/trace
> >              zsh-2528  [000] 10429.172470: dump_inode: ino=889 size=0 cached=0 age=442 dirty=0 dev=0:18 file=/dev/console
> >              zsh-2528  [000] 10429.172472: dump_inode: ino=888 size=0 cached=0 age=442 dirty=7 dev=0:18 file=/dev/null
> >              zsh-2528  [000] 10429.172474: dump_inode: ino=887 size=40 cached=0 age=442 dirty=0 dev=0:18 file=/dev/shm
> >              zsh-2528  [000] 10429.172477: dump_inode: ino=886 size=40 cached=0 age=442 dirty=0 dev=0:18 file=/dev/pts
> >              zsh-2528  [000] 10429.172479: dump_inode: ino=885 size=11 cached=0 age=442 dirty=0 dev=0:18 file=/dev/core
> >              zsh-2528  [000] 10429.172481: dump_inode: ino=884 size=15 cached=0 age=442 dirty=0 dev=0:18 file=/dev/stderr
> >              zsh-2528  [000] 10429.172483: dump_inode: ino=883 size=15 cached=0 age=442 dirty=0 dev=0:18 file=/dev/stdout
> >              zsh-2528  [000] 10429.172486: dump_inode: ino=882 size=15 cached=0 age=442 dirty=0 dev=0:18 file=/dev/stdin
> >              zsh-2528  [000] 10429.172488: dump_inode: ino=881 size=13 cached=0 age=442 dirty=0 dev=0:18 file=/dev/fd
> >              zsh-2528  [000] 10429.172491: dump_inode: ino=872 size=13360 cached=0 age=442 dirty=0 dev=0:18 file=/dev
> > 
> > Here "age" is either age from inode create time, or from last dirty time.
> >
> 
> It would be nice to see mapped/unmapped information as well.
>

OK, I see you got mapcount, thanks!
 
-- 
	Three Cheers,
	Balbir

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH -tip 0/2 v3] pagecache tracepoints proposal
  2010-02-09 16:21     ` Wu Fengguang
  2010-02-13 13:29       ` Balbir Singh
@ 2010-02-16  3:22       ` KOSAKI Motohiro
  2010-02-17 22:38         ` Keiichi KII
  1 sibling, 1 reply; 17+ messages in thread
From: KOSAKI Motohiro @ 2010-02-16  3:22 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: kosaki.motohiro, Ingo Molnar, Chris Frost, Steven Rostedt,
	Peter Zijlstra, Frederic Weisbecker, Keiichi KII, Andrew Morton,
	Jason Baron, Hitoshi Mitake, linux-kernel@vger.kernel.org,
	lwoodman@redhat.com, linux-mm@kvack.org, Tom Zanussi,
	riel@redhat.com, Munehiro Ikeda, Atsushi Tsuji

> > Here is a scratch patch to exercise the "object collections" idea :)
> > 
> > Interestingly, the pagecache walk is pretty fast, while copying out the trace
> > data takes more time:
> > 
> >         # time (echo / > walk-fs)
> >         (; echo / > walk-fs; )  0.01s user 0.11s system 82% cpu 0.145 total
> > 
> >         # time wc /debug/tracing/trace
> >         4570 45893 551282 /debug/tracing/trace
> >         wc /debug/tracing/trace  0.75s user 0.55s system 88% cpu 1.470 total
> 
> Ah got it: it takes much time to "print" the raw trace data.
> 
> > TODO:
> > 
> > correctness
> > - show file path name
> >   XXX: can trace_seq_path() be called directly inside TRACE_EVENT()?
> 
> OK, finished with the file name with d_path(). I choose not to mangle
> the possible '\n' in file names, and simply show "?" for such files,
> for the sake of speed.


This patch is nicer than KII-san's one. I plan to test it on
my local test environment awhile.

thanks.


> 
> Thanks,
> Fengguang
> ---
> tracing: pagecache object collections
> 
> This dumps
> - all cached files of a mounted fs  (the inode-cache)
> - all cached pages of a cached file (the page-cache)
> 
> Usage and Sample output:
> 
> # echo /dev > /debug/tracing/objects/mm/pages/walk-fs
> # tail /debug/tracing/trace
>              zsh-2528  [000] 10429.172470: dump_inode: ino=889 size=0 cached=0 age=442 dirty=0 dev=0:18 file=/dev/console
>              zsh-2528  [000] 10429.172472: dump_inode: ino=888 size=0 cached=0 age=442 dirty=7 dev=0:18 file=/dev/null
>              zsh-2528  [000] 10429.172474: dump_inode: ino=887 size=40 cached=0 age=442 dirty=0 dev=0:18 file=/dev/shm
>              zsh-2528  [000] 10429.172477: dump_inode: ino=886 size=40 cached=0 age=442 dirty=0 dev=0:18 file=/dev/pts
>              zsh-2528  [000] 10429.172479: dump_inode: ino=885 size=11 cached=0 age=442 dirty=0 dev=0:18 file=/dev/core
>              zsh-2528  [000] 10429.172481: dump_inode: ino=884 size=15 cached=0 age=442 dirty=0 dev=0:18 file=/dev/stderr
>              zsh-2528  [000] 10429.172483: dump_inode: ino=883 size=15 cached=0 age=442 dirty=0 dev=0:18 file=/dev/stdout
>              zsh-2528  [000] 10429.172486: dump_inode: ino=882 size=15 cached=0 age=442 dirty=0 dev=0:18 file=/dev/stdin
>              zsh-2528  [000] 10429.172488: dump_inode: ino=881 size=13 cached=0 age=442 dirty=0 dev=0:18 file=/dev/fd
>              zsh-2528  [000] 10429.172491: dump_inode: ino=872 size=13360 cached=0 age=442 dirty=0 dev=0:18 file=/dev
> 
> Here "age" is either age from inode create time, or from last dirty time.
> 
> TODO:
> 
> correctness
> - reliably prevent ring buffer overflow,
>   by replacing cond_resched() with some wait function
>   (eg. wait until 2+ pages are free in ring buffer)
> - use stable_page_flags() in recent kernel
> 
> output style
> - use plain tracing output format (no fancy TASK-PID/.../FUNCTION fields)
> - clear ring buffer before dumping the objects?
> - output format: key=value pairs ==> header + tabbed values?
> - add filtering options if necessary
> 
> CC: Ingo Molnar <mingo@elte.hu>
> CC: Chris Frost <frost@cs.ucla.edu>
> CC: Steven Rostedt <rostedt@goodmis.org>
> CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
> CC: Frederic Weisbecker <fweisbec@gmail.com>
> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> ---
>  fs/inode.c                |    2 
>  include/trace/events/mm.h |   70 ++++++++++++
>  kernel/trace/trace_mm.c   |  204 ++++++++++++++++++++++++++++++++++++
>  3 files changed, 275 insertions(+), 1 deletion(-)
> 
> --- linux-mm.orig/include/trace/events/mm.h	2010-02-08 23:19:09.000000000 +0800
> +++ linux-mm/include/trace/events/mm.h	2010-02-09 23:39:03.000000000 +0800
> @@ -2,6 +2,7 @@
>  #define _TRACE_MM_H
>  
>  #include <linux/tracepoint.h>
> +#include <linux/pagemap.h>
>  #include <linux/mm.h>
>  
>  #undef TRACE_SYSTEM
> @@ -42,6 +43,75 @@ TRACE_EVENT(dump_pages,
>  		  __entry->mapcount, __entry->index)
>  );
>  
> +TRACE_EVENT(dump_pagecache_range,
> +
> +	TP_PROTO(struct page *page, unsigned long len),
> +
> +	TP_ARGS(page, len),
> +
> +	TP_STRUCT__entry(
> +		__field(	unsigned long,	index		)
> +		__field(	unsigned long,	len		)
> +		__field(	unsigned long,	flags		)
> +		__field(	unsigned int,	count		)
> +		__field(	unsigned int,	mapcount	)
> +	),
> +
> +	TP_fast_assign(
> +		__entry->index		= page->index;
> +		__entry->len		= len;
> +		__entry->flags		= page->flags;
> +		__entry->count		= atomic_read(&page->_count);
> +		__entry->mapcount	= page_mapcount(page);
> +	),
> +
> +	TP_printk("index=%lu len=%lu flags=%lx count=%u mapcount=%u",
> +		  __entry->index,
> +		  __entry->len,
> +		  __entry->flags,
> +		  __entry->count,
> +		  __entry->mapcount)
> +);
> +
> +TRACE_EVENT(dump_inode,
> +
> +	TP_PROTO(struct inode *inode, char *name, int len),
> +
> +	TP_ARGS(inode, name, len),
> +
> +	TP_STRUCT__entry(
> +		__field(	unsigned long,	ino		)
> +		__field(	loff_t,		size		)
> +		__field(	unsigned long,	nrpages		)
> +		__field(	unsigned long,	age		)
> +		__field(	unsigned long,	state		)
> +		__field(	dev_t,		dev		)
> +		__dynamic_array(char,		file,	len	)
> +	),
> +
> +	TP_fast_assign(
> +		__entry->ino		= inode->i_ino;
> +		__entry->size		= i_size_read(inode);
> +		__entry->nrpages	= inode->i_mapping->nrpages;
> +		__entry->age		= jiffies - inode->dirtied_when;
> +		__entry->state		= inode->i_state;
> +		__entry->dev		= inode->i_sb->s_dev;
> +		memcpy(__get_str(file), name, len);
> +	),
> +
> +	TP_printk("ino=%lu size=%llu cached=%lu age=%lu dirty=%lu "
> +		  "dev=%u:%u file=%s",
> +		  __entry->ino,
> +		  __entry->size,
> +		  __entry->nrpages << PAGE_CACHE_SHIFT,
> +		  __entry->age / HZ,
> +		  __entry->state & I_DIRTY,
> +		  MAJOR(__entry->dev),
> +		  MINOR(__entry->dev),
> +		  strchr(__get_str(file), '\n') ? "?" : __get_str(file))
> +);
> +
> +
>  #endif /*  _TRACE_MM_H */
>  
>  /* This part must be outside protection */
> --- linux-mm.orig/kernel/trace/trace_mm.c	2010-02-08 23:19:09.000000000 +0800
> +++ linux-mm/kernel/trace/trace_mm.c	2010-02-10 00:04:47.000000000 +0800
> @@ -9,6 +9,9 @@
>  #include <linux/bootmem.h>
>  #include <linux/debugfs.h>
>  #include <linux/uaccess.h>
> +#include <linux/pagevec.h>
> +#include <linux/writeback.h>
> +#include <linux/file.h>
>  
>  #include "trace_output.h"
>  
> @@ -95,6 +98,201 @@ static const struct file_operations trac
>  	.write		= trace_mm_dump_range_write,
>  };
>  
> +static unsigned long page_flags(struct page* page)
> +{
> +	return page->flags & ((1 << NR_PAGEFLAGS) - 1);
> +}
> +
> +static int pages_similiar(struct page* page0, struct page* page)
> +{
> +	if (page_count(page0) != page_count(page))
> +		return 0;
> +
> +	if (page_mapcount(page0) != page_mapcount(page))
> +		return 0;
> +
> +	if (page_flags(page0) != page_flags(page))
> +		return 0;
> +
> +	return 1;
> +}
> +
> +#define BATCH_LINES	100
> +static void dump_pagecache(struct address_space *mapping)
> +{
> +	int i;
> +	int lines = 0;
> +	pgoff_t len = 0;
> +	struct pagevec pvec;
> +	struct page *page;
> +	struct page *page0 = NULL;
> +	unsigned long start = 0;
> +
> +	for (;;) {
> +		pagevec_init(&pvec, 0);
> +		pvec.nr = radix_tree_gang_lookup(&mapping->page_tree,
> +				(void **)pvec.pages, start + len, PAGEVEC_SIZE);
> +
> +		if (pvec.nr == 0) {
> +			if (len)
> +				trace_dump_pagecache_range(page0, len);
> +			break;
> +		}
> +
> +		if (!page0)
> +			page0 = pvec.pages[0];
> +
> +		for (i = 0; i < pvec.nr; i++) {
> +			page = pvec.pages[i];
> +
> +			if (page->index == start + len &&
> +					pages_similiar(page0, page))
> +				len++;
> +			else {
> +				trace_dump_pagecache_range(page0, len);
> +				page0 = page;
> +				start = page->index;
> +				len = 1;
> +				if (++lines > BATCH_LINES) {
> +					lines = 0;
> +					cond_resched();
> +				}
> +			}
> +		}
> +	}
> +}
> +
> +static void dump_inode(struct inode *inode,
> +		       char *name_buf,
> +		       struct vfsmount *mnt)
> +{
> +	struct path path = {
> +		.mnt = mnt,
> +		.dentry = d_find_alias(inode)
> +	};
> +	char *name;
> +	int len;
> +
> +	if (!path.dentry) {
> +		trace_dump_inode(inode, "?", 2);
> +		return;
> +	}
> +
> +	name = d_path(&path, name_buf, PAGE_SIZE);
> +	if (IS_ERR(name)) {
> +		name = "?";
> +		len = 2;
> +	} else
> +		len = PAGE_SIZE + name_buf - name;
> +
> +	trace_dump_inode(inode, name, len);
> +
> +	if (path.dentry)
> +		dput(path.dentry);
> +}
> +
> +static void dump_fs_pagecache(struct super_block *sb, struct vfsmount *mnt)
> +{
> +	struct inode *inode;
> +	struct inode *prev_inode = NULL;
> +	char *name_buf;
> +
> +	name_buf = (char *)__get_free_page(GFP_TEMPORARY);
> +	if (!name_buf)
> +		return;
> +
> +	down_read(&sb->s_umount);
> +	if (!sb->s_root)
> +		goto out;
> +
> +	spin_lock(&inode_lock);
> +	list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
> +		if (inode->i_state & (I_FREEING|I_CLEAR|I_WILL_FREE|I_NEW))
> +			continue;
> +		__iget(inode);
> +		spin_unlock(&inode_lock);
> +		dump_inode(inode, name_buf, mnt);
> +		if (inode->i_mapping->nrpages)
> +			dump_pagecache(inode->i_mapping);
> +		iput(prev_inode);
> +		prev_inode = inode;
> +		cond_resched();
> +		spin_lock(&inode_lock);
> +	}
> +	spin_unlock(&inode_lock);
> +	iput(prev_inode);
> +out:
> +	up_read(&sb->s_umount);
> +	free_page((unsigned long)name_buf);
> +}
> +
> +static ssize_t
> +trace_pagecache_write(struct file *filp, const char __user *ubuf, size_t count,
> +		      loff_t *ppos)
> +{
> +	struct file *file = NULL;
> +	char *name;
> +	int err = 0;
> +
> +	if (count <= 1)
> +		return -EINVAL;
> +	if (count > PATH_MAX + 1)
> +		return -ENAMETOOLONG;
> +
> +	name = kmalloc(count+1, GFP_KERNEL);
> +	if (!name)
> +		return -ENOMEM;
> +
> +	if (copy_from_user(name, ubuf, count)) {
> +		err = -EFAULT;
> +		goto out;
> +	}
> +
> +	/* strip the newline added by `echo` */
> +	if (name[count-1] != '\n')
> +		return -EINVAL;
> +	name[count-1] = '\0';
> +
> +	file = filp_open(name, O_RDONLY|O_LARGEFILE, 0);
> +	if (IS_ERR(file)) {
> +		err = PTR_ERR(file);
> +		file = NULL;
> +		goto out;
> +	}
> +
> +	if (tracing_update_buffers() < 0) {
> +		err = -ENOMEM;
> +		goto out;
> +	}
> +	if (trace_set_clr_event("mm", "dump_pagecache_range", 1)) {
> +		err = -EINVAL;
> +		goto out;
> +	}
> +	if (trace_set_clr_event("mm", "dump_inode", 1)) {
> +		err = -EINVAL;
> +		goto out;
> +	}
> +
> +	if (filp->f_path.dentry->d_inode->i_private) {
> +		dump_fs_pagecache(file->f_path.dentry->d_sb, file->f_path.mnt);
> +	} else {
> +		dump_pagecache(file->f_mapping);
> +	}
> +
> +out:
> +	if (file)
> +		fput(file);
> +	kfree(name);
> +
> +	return err ? err : count;
> +}
> +
> +static const struct file_operations trace_pagecache_fops = {
> +	.open		= tracing_open_generic,
> +	.read		= trace_mm_dump_range_read,
> +	.write		= trace_pagecache_write,
> +};
> +
>  /* move this into trace_objects.c when that file is created */
>  static struct dentry *trace_objects_dir(void)
>  {
> @@ -167,6 +365,12 @@ static __init int trace_objects_mm_init(
>  	trace_create_file("dump_range", 0600, d_pages, NULL,
>  			  &trace_mm_fops);
>  
> +	trace_create_file("walk-file", 0600, d_pages, NULL,
> +			  &trace_pagecache_fops);
> +
> +	trace_create_file("walk-fs", 0600, d_pages, (void *)1,
> +			  &trace_pagecache_fops);
> +
>  	return 0;
>  }
>  fs_initcall(trace_objects_mm_init);
> --- linux-mm.orig/fs/inode.c	2010-02-08 23:19:12.000000000 +0800
> +++ linux-mm/fs/inode.c	2010-02-08 23:19:22.000000000 +0800
> @@ -149,7 +149,7 @@ struct inode *inode_init_always(struct s
>  	inode->i_bdev = NULL;
>  	inode->i_cdev = NULL;
>  	inode->i_rdev = 0;
> -	inode->dirtied_when = 0;
> +	inode->dirtied_when = jiffies;
>  
>  	if (security_inode_alloc(inode))
>  		goto out_free_inode;
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH -tip 0/2 v3] pagecache tracepoints proposal
  2010-02-16  3:22       ` KOSAKI Motohiro
@ 2010-02-17 22:38         ` Keiichi KII
  0 siblings, 0 replies; 17+ messages in thread
From: Keiichi KII @ 2010-02-17 22:38 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Wu Fengguang, Ingo Molnar, Chris Frost, Steven Rostedt,
	Peter Zijlstra, Frederic Weisbecker, Andrew Morton, Jason Baron,
	Hitoshi Mitake, linux-kernel@vger.kernel.org, lwoodman@redhat.com,
	linux-mm@kvack.org, Tom Zanussi, riel@redhat.com, Munehiro Ikeda,
	Atsushi Tsuji

Hello,

(02/15/10 22:22), KOSAKI Motohiro wrote:
>>> Here is a scratch patch to exercise the "object collections" idea :)
>>>
>>> Interestingly, the pagecache walk is pretty fast, while copying out the trace
>>> data takes more time:
>>>
>>>         # time (echo / > walk-fs)
>>>         (; echo / > walk-fs; )  0.01s user 0.11s system 82% cpu 0.145 total
>>>
>>>         # time wc /debug/tracing/trace
>>>         4570 45893 551282 /debug/tracing/trace
>>>         wc /debug/tracing/trace  0.75s user 0.55s system 88% cpu 1.470 total
>>
>> Ah got it: it takes much time to "print" the raw trace data.
>>
>>> TODO:
>>>
>>> correctness
>>> - show file path name
>>>   XXX: can trace_seq_path() be called directly inside TRACE_EVENT()?
>>
>> OK, finished with the file name with d_path(). I choose not to mangle
>> the possible '\n' in file names, and simply show "?" for such files,
>> for the sake of speed.
> 
> 
> This patch is nicer than KII-san's one. I plan to test it on
> my local test environment awhile.

I don't think my patch is completely replaced by Wu's patch.
Both patches focus on pagecache and will work together for achieving
perf enhancement for mm like "perf mm".

His patch can efficiently dump a pagecache usage snapshot for a file system
or a file as he said.
And we will be able to just monitor pagecache increase and decrease
by taking some snapshots for pagecache using his patch.
My patch can monitor some pagecache behavior like pagecache hit ratio and
using frequency(e.g. the following outputs).

For example, the outputs shows yum's pagecache behavior analysis using my patch.
Please focus on inode 16 and 778 on the device(253:0).
The system has 5752 pagecaches for the inode 16 and 86 pagecaches for
the inode 778.
We will be able to know same information using his patch as well.
But we can get further detailed information about pagecache in the system
using my patch.
There is a big difference of using frequency between inode 16 and inode 778.
The inode 16 is used by the yum more same pagecaches than inode 778's.

And maybe it is useful to improve/tune pagecache management like pdflush.

[process list]
o yum-3215
                          cache find  cache hit  cache hit
        device      inode      count      count      ratio
  --------------------------------------------------------
         253:0         16      34434      34130     99.12%
         253:0        198       9692       9463     97.64%
         253:0        639        647        628     97.06%
         253:0        778         32         29     90.62%
         253:0       7305      50225      49005     97.57%
         253:0     144217         12         10     83.33%
         253:0     262775         16         13     81.25%
*snip*

[file list]
        device              cached
     (maj:min)      inode    pages
  --------------------------------
         253:0         16     5752
         253:0        198     2233
         253:0        639       51
         253:0        778       86
         253:0       7305    12307
         253:0     144217       11
         253:0     262775       39
*snip*

[process list]
o yum-3215
        device              cached    added  removed      indirect
     (maj:min)      inode    pages    pages    pages removed pages
  ----------------------------------------------------------------
         253:0         16    34130     5752        0             0
         253:0        198     9463     2233        0             0
         253:0        639      628       51        0             0
         253:0        778       29       78        0             0
         253:0       7305    49005    12307        0             0
         253:0     144217       10       11        0             0
         253:0     262775       13       39        0             0
*snip*
  ----------------------------------------------------------------
  total:                    102346    26165        1             0

Any comments are welcome.

Thanks,
Keiichi

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH -tip 0/2 v3] pagecache tracepoints proposal
  2010-02-08 15:54   ` Wu Fengguang
  2010-02-09 16:21     ` Wu Fengguang
@ 2010-02-18  5:34     ` KAMEZAWA Hiroyuki
  2010-02-18  9:58       ` Balbir Singh
  2010-02-21  3:09       ` Wu Fengguang
  1 sibling, 2 replies; 17+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-02-18  5:34 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Ingo Molnar, Chris Frost, Steven Rostedt, Peter Zijlstra,
	Frederic Weisbecker, Keiichi KII, Andrew Morton, Jason Baron,
	Hitoshi Mitake, linux-kernel@vger.kernel.org, lwoodman@redhat.com,
	linux-mm@kvack.org, Tom Zanussi, riel@redhat.com, Munehiro Ikeda,
	Atsushi Tsuji

On Mon, 8 Feb 2010 23:54:50 +0800
Wu Fengguang <fengguang.wu@intel.com> wrote:

> Hi Ingo,
> 
> > Note that there's also these older experimental commits in tip:tracing/mm 
> > that introduce the notion of 'object collections' and adds the ability to 
> > trace them:
> > 
> > 3383e37: tracing, page-allocator: Add a postprocessing script for page-allocator-related ftrace events
> > c33b359: tracing, page-allocator: Add trace event for page traffic related to the buddy lists
> > 0d524fb: tracing, mm: Add trace events for anti-fragmentation falling back to other migratetypes
> > b9a2817: tracing, page-allocator: Add trace events for page allocation and page freeing
> > 08b6cb8: perf_counter tools: Provide default bfd_demangle() function in case it's not around
> > eb46710: tracing/mm: rename 'trigger' file to 'dump_range'
> > 1487a7a: tracing/mm: fix mapcount trace record field
> > dcac8cd: tracing/mm: add page frame snapshot trace
> > 
> > this concept, if refreshed a bit and extended to the page cache, would allow 
> > the recording/snapshotting of the MM state of all currently present pages in 
> > the page-cache - a possibly nice addition to the dynamic technique you apply 
> > in your patches.
> > 
> > there's similar "object collections" work underway for 'perf lock' btw., by 
> > Hitoshi Mitake and Frederic.
> > 
> > So there's lots of common ground and lots of interest.
> 
> Here is a scratch patch to exercise the "object collections" idea :)
> 
> Interestingly, the pagecache walk is pretty fast, while copying out the trace
> data takes more time:
> 
>         # time (echo / > walk-fs)
>         (; echo / > walk-fs; )  0.01s user 0.11s system 82% cpu 0.145 total
> 
>         # time wc /debug/tracing/trace
>         4570 45893 551282 /debug/tracing/trace
>         wc /debug/tracing/trace  0.75s user 0.55s system 88% cpu 1.470 total
> 
>         # time (cat /debug/tracing/trace > /dev/shm/t)
>         (; cat /debug/tracing/trace > /dev/shm/t; )  0.04s user 0.49s system 95% cpu 0.548 total
> 
>         # time (dd if=/debug/tracing/trace of=/dev/shm/t bs=1M)
>         0+138 records in
>         0+138 records out
>         551282 bytes (551 kB) copied, 0.380454 s, 1.4 MB/s
>         (; dd if=/debug/tracing/trace of=/dev/shm/t bs=1M; )  0.09s user 0.48s system 96% cpu 0.600 total
> 
> The patch is based on tip/tracing/mm. 
> 
> Thanks,
> Fengguang
> ---
> tracing: pagecache object collections
> 
> This dumps
> - all cached files of a mounted fs  (the inode-cache)
> - all cached pages of a cached file (the page-cache)
> 
> Usage and Sample output:
> 
> # echo / > /debug/tracing/objects/mm/pages/walk-fs
> # head /debug/tracing/trace
> 
> # tracer: nop
> #
> #           TASK-PID    CPU#    TIMESTAMP  FUNCTION
> #              | |       |          |         |
>              zsh-3078  [000]   526.272587: dump_inode: ino=102223 size=169291 cached=172032 age=9 dirty=6 dev=0:15 file=<TODO>
>              zsh-3078  [000]   526.274260: dump_pagecache_range: index=0 len=41 flags=10000000000002c count=1 mapcount=0
>              zsh-3078  [000]   526.274340: dump_pagecache_range: index=41 len=1 flags=10000000000006c count=1 mapcount=0
>              zsh-3078  [000]   526.274401: dump_inode: ino=8966 size=442 cached=4096 age=49 dirty=0 dev=0:15 file=<TODO>
>              zsh-3078  [000]   526.274425: dump_pagecache_range: index=0 len=1 flags=10000000000002c count=1 mapcount=0
>              zsh-3078  [000]   526.274440: dump_inode: ino=8964 size=4096 cached=0 age=49 dirty=0 dev=0:15 file=<TODO>
> 
> Here "age" is either age from inode create time, or from last dirty time.
> 
> TODO:
> 
> correctness
> - show file path name
>   XXX: can trace_seq_path() be called directly inside TRACE_EVENT()?
> - reliably prevent ring buffer overflow,
>   by replacing cond_resched() with some wait function
>   (eg. wait until 2+ pages are free in ring buffer)
> - use stable_page_flags() in recent kernel
> 
> output style
> - use plain tracing output format (no fancy TASK-PID/.../FUNCTION fields)
> - clear ring buffer before dumping the objects?
> - output format: key=value pairs ==> header + tabbed values?
> - add filtering options if necessary
> 

Can we dump page's cgroup ? If so, I'm happy.
Maybe
==
  struct page_cgroup *pc = lookup_page_cgroup(page);
  struct mem_cgroup *mem = pc->mem_cgroup;
  shodt mem_cgroup_id = mem->css.css_id;

  And statistics can be counted per css_id.

And then, some output like

dump_pagecache_range: index=0 len=1 flags=10000000000002c count=1 mapcount=0 file=XXX memcg=group_A:x,group_B:y

Is it okay to add a new field after your work finish ?

If so, I'll think about some infrastructure to get above based on your patch.

THanks,
-Kame





> CC: Ingo Molnar <mingo@elte.hu>
> CC: Chris Frost <frost@cs.ucla.edu>
> CC: Steven Rostedt <rostedt@goodmis.org>
> CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
> CC: Frederic Weisbecker <fweisbec@gmail.com>
> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> ---
>  fs/inode.c                |    2 
>  include/trace/events/mm.h |   67 ++++++++++++++
>  kernel/trace/trace_mm.c   |  165 ++++++++++++++++++++++++++++++++++++
>  3 files changed, 233 insertions(+), 1 deletion(-)
> 
> --- linux-mm.orig/include/trace/events/mm.h	2010-02-08 23:19:09.000000000 +0800
> +++ linux-mm/include/trace/events/mm.h	2010-02-08 23:19:16.000000000 +0800
> @@ -2,6 +2,7 @@
>  #define _TRACE_MM_H
>  
>  #include <linux/tracepoint.h>
> +#include <linux/pagemap.h>
>  #include <linux/mm.h>
>  
>  #undef TRACE_SYSTEM
> @@ -42,6 +43,72 @@ TRACE_EVENT(dump_pages,
>  		  __entry->mapcount, __entry->index)
>  );
>  
> +TRACE_EVENT(dump_pagecache_range,
> +
> +	TP_PROTO(struct page *page, unsigned long len),
> +
> +	TP_ARGS(page, len),
> +
> +	TP_STRUCT__entry(
> +		__field(	unsigned long,	index		)
> +		__field(	unsigned long,	len		)
> +		__field(	unsigned long,	flags		)
> +		__field(	unsigned int,	count		)
> +		__field(	unsigned int,	mapcount	)
> +	),
> +
> +	TP_fast_assign(
> +		__entry->index		= page->index;
> +		__entry->len		= len;
> +		__entry->flags		= page->flags;
> +		__entry->count		= atomic_read(&page->_count);
> +		__entry->mapcount	= page_mapcount(page);
> +	),
> +
> +	TP_printk("index=%lu len=%lu flags=%lx count=%u mapcount=%u",
> +		  __entry->index,
> +		  __entry->len,
> +		  __entry->flags,
> +		  __entry->count,
> +		  __entry->mapcount)
> +);
> +
> +TRACE_EVENT(dump_inode,
> +
> +	TP_PROTO(struct inode *inode),
> +
> +	TP_ARGS(inode),
> +
> +	TP_STRUCT__entry(
> +		__field(	unsigned long,	ino		)
> +		__field(	loff_t,		size		)
> +		__field(	unsigned long,	nrpages		)
> +		__field(	unsigned long,	age		)
> +		__field(	unsigned long,	state		)
> +		__field(	dev_t,		dev		)
> +	),
> +
> +	TP_fast_assign(
> +		__entry->ino		= inode->i_ino;
> +		__entry->size		= i_size_read(inode);
> +		__entry->nrpages	= inode->i_mapping->nrpages;
> +		__entry->age		= jiffies - inode->dirtied_when;
> +		__entry->state		= inode->i_state;
> +		__entry->dev		= inode->i_sb->s_dev;
> +	),
> +
> +	TP_printk("ino=%lu size=%llu cached=%lu age=%lu dirty=%lu "
> +		  "dev=%u:%u file=<TODO>",
> +		  __entry->ino,
> +		  __entry->size,
> +		  __entry->nrpages << PAGE_CACHE_SHIFT,
> +		  __entry->age / HZ,
> +		  __entry->state & I_DIRTY,
> +		  MAJOR(__entry->dev),
> +		  MINOR(__entry->dev))
> +);
> +
> +
>  #endif /*  _TRACE_MM_H */
>  
>  /* This part must be outside protection */
> --- linux-mm.orig/kernel/trace/trace_mm.c	2010-02-08 23:19:09.000000000 +0800
> +++ linux-mm/kernel/trace/trace_mm.c	2010-02-08 23:19:16.000000000 +0800
> @@ -9,6 +9,9 @@
>  #include <linux/bootmem.h>
>  #include <linux/debugfs.h>
>  #include <linux/uaccess.h>
> +#include <linux/pagevec.h>
> +#include <linux/writeback.h>
> +#include <linux/file.h>
>  
>  #include "trace_output.h"
>  
> @@ -95,6 +98,162 @@ static const struct file_operations trac
>  	.write		= trace_mm_dump_range_write,
>  };
>  
> +static unsigned long page_flags(struct page* page)
> +{
> +	return page->flags & ((1 << NR_PAGEFLAGS) - 1);
> +}
> +
> +static int pages_similiar(struct page* page0, struct page* page)
> +{
> +	if (page_count(page0) != page_count(page))
> +		return 0;
> +
> +	if (page_mapcount(page0) != page_mapcount(page))
> +		return 0;
> +
> +	if (page_flags(page0) != page_flags(page))
> +		return 0;
> +
> +	return 1;
> +}
> +
> +#define BATCH_LINES	100
> +static void dump_pagecache(struct address_space *mapping)
> +{
> +	int i;
> +	int lines = 0;
> +	pgoff_t len = 0;
> +	struct pagevec pvec;
> +	struct page *page;
> +	struct page *page0 = NULL;
> +	unsigned long start = 0;
> +
> +	for (;;) {
> +		pagevec_init(&pvec, 0);
> +		pvec.nr = radix_tree_gang_lookup(&mapping->page_tree,
> +				(void **)pvec.pages, start + len, PAGEVEC_SIZE);
> +
> +		if (pvec.nr == 0) {
> +			if (len)
> +				trace_dump_pagecache_range(page0, len);
> +			break;
> +		}
> +
> +		if (!page0)
> +			page0 = pvec.pages[0];
> +
> +		for (i = 0; i < pvec.nr; i++) {
> +			page = pvec.pages[i];
> +
> +			if (page->index == start + len &&
> +					pages_similiar(page0, page))
> +				len++;
> +			else {
> +				trace_dump_pagecache_range(page0, len);
> +				page0 = page;
> +				start = page->index;
> +				len = 1;
> +				if (++lines > BATCH_LINES) {
> +					lines = 0;
> +					cond_resched();
> +				}
> +			}
> +		}
> +	}
> +}
> +
> +static void dump_fs_pagecache(struct super_block *sb)
> +{
> +	struct inode *inode;
> +	struct inode *prev_inode = NULL;
> +
> +	down_read(&sb->s_umount);
> +	if (!sb->s_root)
> +		goto out;
> +	spin_lock(&inode_lock);
> +	list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
> +		if (inode->i_state & (I_FREEING|I_CLEAR|I_WILL_FREE|I_NEW))
> +			continue;
> +		__iget(inode);
> +		spin_unlock(&inode_lock);
> +		trace_dump_inode(inode);
> +		if (inode->i_mapping->nrpages)
> +			dump_pagecache(inode->i_mapping);
> +		iput(prev_inode);
> +		prev_inode = inode;
> +		cond_resched();
> +		spin_lock(&inode_lock);
> +	}
> +	spin_unlock(&inode_lock);
> +	iput(prev_inode);
> +out:
> +	up_read(&sb->s_umount);
> +}
> +
> +static ssize_t
> +trace_pagecache_write(struct file *filp, const char __user *ubuf, size_t count,
> +		      loff_t *ppos)
> +{
> +	struct file *file = NULL;
> +	char *name;
> +	int err = 0;
> +
> +	if (count > PATH_MAX + 1)
> +		return -ENAMETOOLONG;
> +
> +	name = kmalloc(count+1, GFP_KERNEL);
> +	if (!name)
> +		return -ENOMEM;
> +
> +	if (copy_from_user(name, ubuf, count)) {
> +		err = -EFAULT;
> +		goto out;
> +	}
> +
> +	/* strip the newline added by `echo` */
> +	if (count)
> +		name[count-1] = '\0';
> +
> +	file = filp_open(name, O_RDONLY|O_LARGEFILE, 0);
> +	if (IS_ERR(file)) {
> +		err = PTR_ERR(file);
> +		file = NULL;
> +		goto out;
> +	}
> +
> +	if (tracing_update_buffers() < 0) {
> +		err = -ENOMEM;
> +		goto out;
> +	}
> +	if (trace_set_clr_event("mm", "dump_pagecache_range", 1)) {
> +		err = -EINVAL;
> +		goto out;
> +	}
> +	if (trace_set_clr_event("mm", "dump_inode", 1)) {
> +		err = -EINVAL;
> +		goto out;
> +	}
> +
> +	if (filp->f_path.dentry->d_inode->i_private) {
> +		dump_fs_pagecache(file->f_path.dentry->d_sb);
> +	} else {
> +		dump_pagecache(file->f_mapping);
> +	}
> +
> +out:
> +	if (file)
> +		fput(file);
> +	kfree(name);
> +
> +	return err ? err : count;
> +}
> +
> +static const struct file_operations trace_pagecache_fops = {
> +	.open		= tracing_open_generic,
> +	.read		= trace_mm_dump_range_read,
> +	.write		= trace_pagecache_write,
> +};
> +
>  /* move this into trace_objects.c when that file is created */
>  static struct dentry *trace_objects_dir(void)
>  {
> @@ -167,6 +326,12 @@ static __init int trace_objects_mm_init(
>  	trace_create_file("dump_range", 0600, d_pages, NULL,
>  			  &trace_mm_fops);
>  
> +	trace_create_file("walk-file", 0600, d_pages, NULL,
> +			  &trace_pagecache_fops);
> +
> +	trace_create_file("walk-fs", 0600, d_pages, (void *)1,
> +			  &trace_pagecache_fops);
> +
>  	return 0;
>  }
>  fs_initcall(trace_objects_mm_init);
> --- linux-mm.orig/fs/inode.c	2010-02-08 23:19:12.000000000 +0800
> +++ linux-mm/fs/inode.c	2010-02-08 23:19:22.000000000 +0800
> @@ -149,7 +149,7 @@ struct inode *inode_init_always(struct s
>  	inode->i_bdev = NULL;
>  	inode->i_cdev = NULL;
>  	inode->i_rdev = 0;
> -	inode->dirtied_when = 0;
> +	inode->dirtied_when = jiffies;
>  
>  	if (security_inode_alloc(inode))
>  		goto out_free_inode;
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH -tip 0/2 v3] pagecache tracepoints proposal
  2010-02-18  5:34     ` KAMEZAWA Hiroyuki
@ 2010-02-18  9:58       ` Balbir Singh
  2010-02-23 14:04         ` Wu Fengguang
  2010-02-21  3:09       ` Wu Fengguang
  1 sibling, 1 reply; 17+ messages in thread
From: Balbir Singh @ 2010-02-18  9:58 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Wu Fengguang, Ingo Molnar, Chris Frost, Steven Rostedt,
	Peter Zijlstra, Frederic Weisbecker, Keiichi KII, Andrew Morton,
	Jason Baron, Hitoshi Mitake, linux-kernel@vger.kernel.org,
	lwoodman@redhat.com, linux-mm@kvack.org, Tom Zanussi,
	riel@redhat.com, Munehiro Ikeda, Atsushi Tsuji

* KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2010-02-18 14:34:29]:

> On Mon, 8 Feb 2010 23:54:50 +0800
> Wu Fengguang <fengguang.wu@intel.com> wrote:
> 
> > Hi Ingo,
> > 
> > > Note that there's also these older experimental commits in tip:tracing/mm 
> > > that introduce the notion of 'object collections' and adds the ability to 
> > > trace them:
> > > 
> > > 3383e37: tracing, page-allocator: Add a postprocessing script for page-allocator-related ftrace events
> > > c33b359: tracing, page-allocator: Add trace event for page traffic related to the buddy lists
> > > 0d524fb: tracing, mm: Add trace events for anti-fragmentation falling back to other migratetypes
> > > b9a2817: tracing, page-allocator: Add trace events for page allocation and page freeing
> > > 08b6cb8: perf_counter tools: Provide default bfd_demangle() function in case it's not around
> > > eb46710: tracing/mm: rename 'trigger' file to 'dump_range'
> > > 1487a7a: tracing/mm: fix mapcount trace record field
> > > dcac8cd: tracing/mm: add page frame snapshot trace
> > > 
> > > this concept, if refreshed a bit and extended to the page cache, would allow 
> > > the recording/snapshotting of the MM state of all currently present pages in 
> > > the page-cache - a possibly nice addition to the dynamic technique you apply 
> > > in your patches.
> > > 
> > > there's similar "object collections" work underway for 'perf lock' btw., by 
> > > Hitoshi Mitake and Frederic.
> > > 
> > > So there's lots of common ground and lots of interest.
> > 
> > Here is a scratch patch to exercise the "object collections" idea :)
> > 
> > Interestingly, the pagecache walk is pretty fast, while copying out the trace
> > data takes more time:
> > 
> >         # time (echo / > walk-fs)
> >         (; echo / > walk-fs; )  0.01s user 0.11s system 82% cpu 0.145 total
> > 
> >         # time wc /debug/tracing/trace
> >         4570 45893 551282 /debug/tracing/trace
> >         wc /debug/tracing/trace  0.75s user 0.55s system 88% cpu 1.470 total
> > 
> >         # time (cat /debug/tracing/trace > /dev/shm/t)
> >         (; cat /debug/tracing/trace > /dev/shm/t; )  0.04s user 0.49s system 95% cpu 0.548 total
> > 
> >         # time (dd if=/debug/tracing/trace of=/dev/shm/t bs=1M)
> >         0+138 records in
> >         0+138 records out
> >         551282 bytes (551 kB) copied, 0.380454 s, 1.4 MB/s
> >         (; dd if=/debug/tracing/trace of=/dev/shm/t bs=1M; )  0.09s user 0.48s system 96% cpu 0.600 total
> > 
> > The patch is based on tip/tracing/mm. 
> > 
> > Thanks,
> > Fengguang
> > ---
> > tracing: pagecache object collections
> > 
> > This dumps
> > - all cached files of a mounted fs  (the inode-cache)
> > - all cached pages of a cached file (the page-cache)
> > 
> > Usage and Sample output:
> > 
> > # echo / > /debug/tracing/objects/mm/pages/walk-fs
> > # head /debug/tracing/trace
> > 
> > # tracer: nop
> > #
> > #           TASK-PID    CPU#    TIMESTAMP  FUNCTION
> > #              | |       |          |         |
> >              zsh-3078  [000]   526.272587: dump_inode: ino=102223 size=169291 cached=172032 age=9 dirty=6 dev=0:15 file=<TODO>
> >              zsh-3078  [000]   526.274260: dump_pagecache_range: index=0 len=41 flags=10000000000002c count=1 mapcount=0
> >              zsh-3078  [000]   526.274340: dump_pagecache_range: index=41 len=1 flags=10000000000006c count=1 mapcount=0
> >              zsh-3078  [000]   526.274401: dump_inode: ino=8966 size=442 cached=4096 age=49 dirty=0 dev=0:15 file=<TODO>
> >              zsh-3078  [000]   526.274425: dump_pagecache_range: index=0 len=1 flags=10000000000002c count=1 mapcount=0
> >              zsh-3078  [000]   526.274440: dump_inode: ino=8964 size=4096 cached=0 age=49 dirty=0 dev=0:15 file=<TODO>
> > 
> > Here "age" is either age from inode create time, or from last dirty time.
> > 
> > TODO:
> > 
> > correctness
> > - show file path name
> >   XXX: can trace_seq_path() be called directly inside TRACE_EVENT()?
> > - reliably prevent ring buffer overflow,
> >   by replacing cond_resched() with some wait function
> >   (eg. wait until 2+ pages are free in ring buffer)
> > - use stable_page_flags() in recent kernel
> > 
> > output style
> > - use plain tracing output format (no fancy TASK-PID/.../FUNCTION fields)
> > - clear ring buffer before dumping the objects?
> > - output format: key=value pairs ==> header + tabbed values?
> > - add filtering options if necessary
> > 
> 
> Can we dump page's cgroup ? If so, I'm happy.
> Maybe
> ==
>   struct page_cgroup *pc = lookup_page_cgroup(page);
>   struct mem_cgroup *mem = pc->mem_cgroup;
>   shodt mem_cgroup_id = mem->css.css_id;
> 
>   And statistics can be counted per css_id.
>

Good idea, all of this needs to happen with a check to see if memcg is
enabled/disabled at boot as well. pc can be NULL if
CONFIG_CGROUP_MEM_RES_CTLR is not enabled.
 
> And then, some output like
> 
> dump_pagecache_range: index=0 len=1 flags=10000000000002c count=1 mapcount=0 file=XXX memcg=group_A:x,group_B:y
> 
> Is it okay to add a new field after your work finish ?
> 
-- 
	Three Cheers,
	Balbir

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH -tip 0/2 v3] pagecache tracepoints proposal
  2010-02-13 13:29       ` Balbir Singh
  2010-02-14 10:52         ` Balbir Singh
@ 2010-02-21  2:28         ` Wu Fengguang
  1 sibling, 0 replies; 17+ messages in thread
From: Wu Fengguang @ 2010-02-21  2:28 UTC (permalink / raw)
  To: Balbir Singh
  Cc: Ingo Molnar, Chris Frost, Steven Rostedt, Peter Zijlstra,
	Frederic Weisbecker, Keiichi KII, Andrew Morton, Jason Baron,
	Hitoshi Mitake, linux-kernel@vger.kernel.org, lwoodman@redhat.com,
	linux-mm@kvack.org, Tom Zanussi, riel@redhat.com, Munehiro Ikeda,
	Atsushi Tsuji

Hi Balbir,

> > tracing: pagecache object collections
> >
> > This dumps
> > - all cached files of a mounted fs  (the inode-cache)
> > - all cached pages of a cached file (the page-cache)
> >
> > Usage and Sample output:
> >
> > # echo /dev > /debug/tracing/objects/mm/pages/walk-fs
> > # tail /debug/tracing/trace
> >              zsh-2528  [000] 10429.172470: dump_inode: ino=889 size=0 cached=0 age=442 dirty=0 dev=0:18 file=/dev/console
> >              zsh-2528  [000] 10429.172472: dump_inode: ino=888 size=0 cached=0 age=442 dirty=7 dev=0:18 file=/dev/null
> >              zsh-2528  [000] 10429.172474: dump_inode: ino=887 size=40 cached=0 age=442 dirty=0 dev=0:18 file=/dev/shm
> >              zsh-2528  [000] 10429.172477: dump_inode: ino=886 size=40 cached=0 age=442 dirty=0 dev=0:18 file=/dev/pts
> >              zsh-2528  [000] 10429.172479: dump_inode: ino=885 size=11 cached=0 age=442 dirty=0 dev=0:18 file=/dev/core
> >              zsh-2528  [000] 10429.172481: dump_inode: ino=884 size=15 cached=0 age=442 dirty=0 dev=0:18 file=/dev/stderr
> >              zsh-2528  [000] 10429.172483: dump_inode: ino=883 size=15 cached=0 age=442 dirty=0 dev=0:18 file=/dev/stdout
> >              zsh-2528  [000] 10429.172486: dump_inode: ino=882 size=15 cached=0 age=442 dirty=0 dev=0:18 file=/dev/stdin
> >              zsh-2528  [000] 10429.172488: dump_inode: ino=881 size=13 cached=0 age=442 dirty=0 dev=0:18 file=/dev/fd
> >              zsh-2528  [000] 10429.172491: dump_inode: ino=872 size=13360 cached=0 age=442 dirty=0 dev=0:18 file=/dev
> >
> > Here "age" is either age from inode create time, or from last dirty time.
> >
> 
> It would be nice to see mapped/unmapped information as well.

As you noticed, we have mapcount for individual pages :)

> > +static int pages_similiar(struct page* page0, struct page* page)
> > +{
> > +     if (page_count(page0) != page_count(page))
> > +             return 0;
> > +
> > +     if (page_mapcount(page0) != page_mapcount(page))
> > +             return 0;
> > +
> > +     if (page_flags(page0) != page_flags(page))
> > +             return 0;
> > +
> > +     return 1;
> > +}
> > +
> 
> OK, so pages_similar() is used to identify a range of pages in the
> cache?

Right. Many files are accessed sequentially or clustered, so
pages_similar() can save lots of output lines :)

> > +#define BATCH_LINES  100
> > +static void dump_pagecache(struct address_space *mapping)
> > +{
> > +     int i;
> > +     int lines = 0;
> > +     pgoff_t len = 0;
> > +     struct pagevec pvec;
> > +     struct page *page;
> > +     struct page *page0 = NULL;
> > +     unsigned long start = 0;
> > +
> > +     for (;;) {
> > +             pagevec_init(&pvec, 0);
> > +             pvec.nr = radix_tree_gang_lookup(&mapping->page_tree,
> > +                             (void **)pvec.pages, start + len, PAGEVEC_SIZE);
> 
> Is radix_tree_gang_lookup synchronized somewhere? Don't we need to
> call it under RCU or a lock (mapping) ?

No. This function is inherently non-atomic, and it seems that most in-kernel
users do not bother to take rcu_read_lock(). So lets leave it as is?

> > +static ssize_t
> > +trace_pagecache_write(struct file *filp, const char __user *ubuf, size_t count,
> > +                   loff_t *ppos)
> > +{
> > +     struct file *file = NULL;
> > +     char *name;
> > +     int err = 0;
> > +
> 
> Can't we use the trace_parser here?

Seems not necessary? It's merely one file name, which could contain spaces.

> > +     if (count <= 1)
> > +             return -EINVAL;
> > +     if (count > PATH_MAX + 1)
> > +             return -ENAMETOOLONG;
> > +
> > +     name = kmalloc(count+1, GFP_KERNEL);
> > +     if (!name)
> > +             return -ENOMEM;
> > +
> > +     if (copy_from_user(name, ubuf, count)) {
> > +             err = -EFAULT;
> > +             goto out;
> > +     }
> > +
> > +     /* strip the newline added by `echo` */
> > +     if (name[count-1] != '\n')
> > +             return -EINVAL;
> 
> Doesn't sound correct, what happens if we use echo -n?

It's a bit sad. If we accept both "echo" and "echo -n" with some
smart logic to test for trailing '\n', then it will go wrong for a
'\n'-terminated file name.

Or shall we support only "echo -n"?  I can do with either one.

> > --- linux-mm.orig/fs/inode.c  2010-02-08 23:19:12.000000000 +0800
> > +++ linux-mm/fs/inode.c       2010-02-08 23:19:22.000000000 +0800
> > @@ -149,7 +149,7 @@ struct inode *inode_init_always(struct s
> >       inode->i_bdev = NULL;
> >       inode->i_cdev = NULL;
> >       inode->i_rdev = 0;
> > -     inode->dirtied_when = 0;
> > +     inode->dirtied_when = jiffies;
> >
> 
> Hmmm... Is the inode really dirtied when initialized? I know the
> change is for tracing, but the code when read is confusing.

Huh. Not really dirtied (for that you need to check I_DIRTY), but
dirtied_when is only used in writeback code when I_DIRTY is set.

So I overload dirtied_when in the clean case to indicate the inode
load time. This is a useful trick for fastboot to collect cache
footprint shortly after boot, when most inodes are clean.

It does ask for a comment:

        /*
         * This records inode load time. It will be invalidated once inode is
         * dirtied, or jiffies wraps around. Despite the pitfalls it still
         * provides useful information for some use cases like fastboot.
         */
        inode->dirtied_when = jiffies;


Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH -tip 0/2 v3] pagecache tracepoints proposal
  2010-02-18  5:34     ` KAMEZAWA Hiroyuki
  2010-02-18  9:58       ` Balbir Singh
@ 2010-02-21  3:09       ` Wu Fengguang
  1 sibling, 0 replies; 17+ messages in thread
From: Wu Fengguang @ 2010-02-21  3:09 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Ingo Molnar, Chris Frost, Steven Rostedt, Peter Zijlstra,
	Frederic Weisbecker, Keiichi KII, Andrew Morton, Jason Baron,
	Hitoshi Mitake, linux-kernel@vger.kernel.org, lwoodman@redhat.com,
	linux-mm@kvack.org, Tom Zanussi, riel@redhat.com, Munehiro Ikeda,
	Atsushi Tsuji

Kame,

On Thu, Feb 18, 2010 at 01:34:29PM +0800, KAMEZAWA Hiroyuki wrote:

> Can we dump page's cgroup ? If so, I'm happy.

Good idea. page_cgroup is extended mem_map anyway.

> Maybe
> ==
>   struct page_cgroup *pc = lookup_page_cgroup(page);
>   struct mem_cgroup *mem = pc->mem_cgroup;
>   shodt mem_cgroup_id = mem->css.css_id;
> 
>   And statistics can be counted per css_id.
> 
> And then, some output like
> 
> dump_pagecache_range: index=0 len=1 flags=10000000000002c count=1 mapcount=0 file=XXX memcg=group_A:x,group_B:y

Is it possible for a page to be owned by two cgroups?
For hierarchical cgroups, it would be easier to report only the bottom level cgroup.

> Is it okay to add a new field after your work finish ?

Sure.

> If so, I'll think about some infrastructure to get above based on your patch.

Then you may want to include this patch (with modification),
if recording the css id as raw tracing data.

Thanks,
Fengguang
---
memcg: show memory.id in cgroupfs

The hwpoison test suite need to selectively inject hwpoison to some
targeted task pages, and must not kill important system processes
such as init.

The memory cgroup serves this purpose well. We can put the target
processes under the control of a memory cgroup, tell the hwpoison
injection code the id of that memory cgroup so that it will only
poison pages associated with it.

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 mm/memcontrol.c |   13 +++++++++++++
 1 file changed, 13 insertions(+)

--- linux-mm.orig/mm/memcontrol.c	2009-09-07 16:01:02.000000000 +0800
+++ linux-mm/mm/memcontrol.c	2009-09-11 18:20:55.000000000 +0800
@@ -2510,6 +2510,13 @@ mem_cgroup_get_recursive_idx_stat(struct
 	*val = d.val;
 }
 
+#ifdef CONFIG_HWPOISON_INJECT
+static u64 mem_cgroup_id_read(struct cgroup *cont, struct cftype *cft)
+{
+	return css_id(cgroup_subsys_state(cont, mem_cgroup_subsys_id));
+}
+#endif
+
 static u64 mem_cgroup_read(struct cgroup *cont, struct cftype *cft)
 {
 	struct mem_cgroup *mem = mem_cgroup_from_cont(cont);
@@ -2841,6 +2848,12 @@ static int mem_cgroup_swappiness_write(s
 
 
 static struct cftype mem_cgroup_files[] = {
+#ifdef CONFIG_HWPOISON_INJECT /* for now, only user is hwpoison testing */
+	{
+		.name = "id",
+		.read_u64 = mem_cgroup_id_read,
+	},
+#endif
 	{
 		.name = "usage_in_bytes",
 		.private = MEMFILE_PRIVATE(_MEM, RES_USAGE),

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH -tip 0/2 v3] pagecache tracepoints proposal
  2010-02-18  9:58       ` Balbir Singh
@ 2010-02-23 14:04         ` Wu Fengguang
  0 siblings, 0 replies; 17+ messages in thread
From: Wu Fengguang @ 2010-02-23 14:04 UTC (permalink / raw)
  To: Balbir Singh
  Cc: KAMEZAWA Hiroyuki, Ingo Molnar, Chris Frost, Steven Rostedt,
	Peter Zijlstra, Frederic Weisbecker, Keiichi KII, Andrew Morton,
	Jason Baron, Hitoshi Mitake, linux-kernel@vger.kernel.org,
	lwoodman@redhat.com, linux-mm@kvack.org, Tom Zanussi,
	riel@redhat.com, Munehiro Ikeda, Atsushi Tsuji

On Thu, Feb 18, 2010 at 05:58:50PM +0800, Balbir Singh wrote:
> * KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2010-02-18 14:34:29]:
> > Can we dump page's cgroup ? If so, I'm happy.
> > Maybe
> > ==
> >   struct page_cgroup *pc = lookup_page_cgroup(page);
> >   struct mem_cgroup *mem = pc->mem_cgroup;
> >   shodt mem_cgroup_id = mem->css.css_id;
> > 
> >   And statistics can be counted per css_id.
> >
> 
> Good idea, all of this needs to happen with a check to see if memcg is
> enabled/disabled at boot as well. pc can be NULL if
> CONFIG_CGROUP_MEM_RES_CTLR is not enabled.

Not sure if this is the one in your mind, but I defined a function in
memcontrol.c for the trace code. Compile tested.

It'll be used like this:

        TP_fast_assign(
                        __entry->memcg          = page_memcg_id(page);
                      )

        TP_printk("index=%lu len=%lu flags=%lx count=%u mapcount=%u memcg=%d",

Thanks,
Fengguang

---
memcg: introduce page_memcg_id()

This will be used to dump the memcg id associated with a pagecache page.

CC: Balbir Singh <balbir@linux.vnet.ibm.com>
CC: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 include/linux/memcontrol.h |    6 ++++++
 mm/memcontrol.c            |   16 ++++++++++++++++
 2 files changed, 22 insertions(+)

--- linux-mm.orig/include/linux/memcontrol.h	2010-02-23 21:49:39.000000000 +0800
+++ linux-mm/include/linux/memcontrol.h	2010-02-23 21:50:14.000000000 +0800
@@ -69,6 +69,7 @@ extern void mem_cgroup_out_of_memory(str
 int task_in_mem_cgroup(struct task_struct *task, const struct mem_cgroup *mem);
 
 extern struct mem_cgroup *try_get_mem_cgroup_from_page(struct page *page);
+extern unsigned short page_memcg_id(struct page *page);
 extern struct mem_cgroup *mem_cgroup_from_task(struct task_struct *p);
 
 static inline
@@ -142,6 +143,11 @@ static inline int mem_cgroup_try_charge_
 	return 0;
 }
 
+static inline unsigned short page_memcg_id(struct page *page)
+{
+	return 0;
+}
+
 static inline void mem_cgroup_commit_charge_swapin(struct page *page,
 					  struct mem_cgroup *ptr)
 {
--- linux-mm.orig/mm/memcontrol.c	2010-02-23 21:48:23.000000000 +0800
+++ linux-mm/mm/memcontrol.c	2010-02-23 21:49:33.000000000 +0800
@@ -324,6 +324,22 @@ static struct mem_cgroup *try_get_mem_cg
 	return mem;
 }
 
+unsigned short page_memcg_id(struct page *page)
+{
+	struct mem_cgroup *mem;
+	struct cgroup_subsys_state *css;
+	unsigned short id = 0;
+
+	mem = try_get_mem_cgroup_from_page(page);
+	if (mem) {
+		css = mem_cgroup_css(mem);
+		id = css_id(css);
+		css_put(css);
+	}
+
+	return id;
+}
+
 /*
  * Call callback function against all cgroup under hierarchy tree.
  */

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2010-02-23 14:06 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-02-05  2:17 [RFC PATCH -tip 0/2 v3] pagecache tracepoints proposal Keiichi KII
2010-02-05  2:24 ` [RFC PATCH -tip 1/2 v3] tracepoints: add tracepoints for pagecache Keiichi KII
2010-02-05  2:25 ` [RFC PATCH -tip 2/2 v3] add scripts for pagecache analysis per process Keiichi KII
2010-02-05  7:28 ` [RFC PATCH -tip 0/2 v3] pagecache tracepoints proposal Ingo Molnar
2010-02-05 21:19   ` Keiichi KII
2010-02-08 15:54   ` Wu Fengguang
2010-02-09 16:21     ` Wu Fengguang
2010-02-13 13:29       ` Balbir Singh
2010-02-14 10:52         ` Balbir Singh
2010-02-21  2:28         ` Wu Fengguang
2010-02-16  3:22       ` KOSAKI Motohiro
2010-02-17 22:38         ` Keiichi KII
2010-02-18  5:34     ` KAMEZAWA Hiroyuki
2010-02-18  9:58       ` Balbir Singh
2010-02-23 14:04         ` Wu Fengguang
2010-02-21  3:09       ` Wu Fengguang
2010-02-08 13:04 ` Balbir Singh

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).