Re: [RFC PATCH -tip 0/2 v3] pagecache tracepoints proposal

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Wu Fengguang <fengguang.wu@intel.com>
To: Balbir Singh <balbir@linux.vnet.ibm.com>
Cc: Ingo Molnar <mingo@elte.hu>, Chris Frost <frost@cs.ucla.edu>,
	Steven Rostedt <rostedt@goodmis.org>,
	Peter Zijlstra <a.p.zijlstra@chello.nl>,
	Frederic Weisbecker <fweisbec@gmail.com>,
	Keiichi KII <k-keiichi@bx.jp.nec.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	Jason Baron <jbaron@redhat.com>,
	Hitoshi Mitake <mitake@dcl.info.waseda.ac.jp>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	"lwoodman@redhat.com" <lwoodman@redhat.com>,
	"linux-mm@kvack.org" <linux-mm@kvack.org>,
	Tom Zanussi <tzanussi@gmail.com>,
	"riel@redhat.com" <riel@redhat.com>,
	Munehiro Ikeda <m-ikeda@ds.jp.nec.com>,
	Atsushi Tsuji <a-tsuji@bk.jp.nec.com>
Subject: Re: [RFC PATCH -tip 0/2 v3] pagecache tracepoints proposal
Date: Sun, 21 Feb 2010 10:28:31 +0800	[thread overview]
Message-ID: <20100221022831.GB6448@localhost> (raw)
In-Reply-To: <20100213132952.GG11364@balbir.in.ibm.com>

Hi Balbir,

> > tracing: pagecache object collections
> >
> > This dumps
> > - all cached files of a mounted fs  (the inode-cache)
> > - all cached pages of a cached file (the page-cache)
> >
> > Usage and Sample output:
> >
> > # echo /dev > /debug/tracing/objects/mm/pages/walk-fs
> > # tail /debug/tracing/trace
> >              zsh-2528  [000] 10429.172470: dump_inode: ino=889 size=0 cached=0 age=442 dirty=0 dev=0:18 file=/dev/console
> >              zsh-2528  [000] 10429.172472: dump_inode: ino=888 size=0 cached=0 age=442 dirty=7 dev=0:18 file=/dev/null
> >              zsh-2528  [000] 10429.172474: dump_inode: ino=887 size=40 cached=0 age=442 dirty=0 dev=0:18 file=/dev/shm
> >              zsh-2528  [000] 10429.172477: dump_inode: ino=886 size=40 cached=0 age=442 dirty=0 dev=0:18 file=/dev/pts
> >              zsh-2528  [000] 10429.172479: dump_inode: ino=885 size=11 cached=0 age=442 dirty=0 dev=0:18 file=/dev/core
> >              zsh-2528  [000] 10429.172481: dump_inode: ino=884 size=15 cached=0 age=442 dirty=0 dev=0:18 file=/dev/stderr
> >              zsh-2528  [000] 10429.172483: dump_inode: ino=883 size=15 cached=0 age=442 dirty=0 dev=0:18 file=/dev/stdout
> >              zsh-2528  [000] 10429.172486: dump_inode: ino=882 size=15 cached=0 age=442 dirty=0 dev=0:18 file=/dev/stdin
> >              zsh-2528  [000] 10429.172488: dump_inode: ino=881 size=13 cached=0 age=442 dirty=0 dev=0:18 file=/dev/fd
> >              zsh-2528  [000] 10429.172491: dump_inode: ino=872 size=13360 cached=0 age=442 dirty=0 dev=0:18 file=/dev
> >
> > Here "age" is either age from inode create time, or from last dirty time.
> >
> 
> It would be nice to see mapped/unmapped information as well.

As you noticed, we have mapcount for individual pages :)

> > +static int pages_similiar(struct page* page0, struct page* page)
> > +{
> > +     if (page_count(page0) != page_count(page))
> > +             return 0;
> > +
> > +     if (page_mapcount(page0) != page_mapcount(page))
> > +             return 0;
> > +
> > +     if (page_flags(page0) != page_flags(page))
> > +             return 0;
> > +
> > +     return 1;
> > +}
> > +
> 
> OK, so pages_similar() is used to identify a range of pages in the
> cache?

Right. Many files are accessed sequentially or clustered, so
pages_similar() can save lots of output lines :)

> > +#define BATCH_LINES  100
> > +static void dump_pagecache(struct address_space *mapping)
> > +{
> > +     int i;
> > +     int lines = 0;
> > +     pgoff_t len = 0;
> > +     struct pagevec pvec;
> > +     struct page *page;
> > +     struct page *page0 = NULL;
> > +     unsigned long start = 0;
> > +
> > +     for (;;) {
> > +             pagevec_init(&pvec, 0);
> > +             pvec.nr = radix_tree_gang_lookup(&mapping->page_tree,
> > +                             (void **)pvec.pages, start + len, PAGEVEC_SIZE);
> 
> Is radix_tree_gang_lookup synchronized somewhere? Don't we need to
> call it under RCU or a lock (mapping) ?

No. This function is inherently non-atomic, and it seems that most in-kernel
users do not bother to take rcu_read_lock(). So lets leave it as is?

> > +static ssize_t
> > +trace_pagecache_write(struct file *filp, const char __user *ubuf, size_t count,
> > +                   loff_t *ppos)
> > +{
> > +     struct file *file = NULL;
> > +     char *name;
> > +     int err = 0;
> > +
> 
> Can't we use the trace_parser here?

Seems not necessary? It's merely one file name, which could contain spaces.

> > +     if (count <= 1)
> > +             return -EINVAL;
> > +     if (count > PATH_MAX + 1)
> > +             return -ENAMETOOLONG;
> > +
> > +     name = kmalloc(count+1, GFP_KERNEL);
> > +     if (!name)
> > +             return -ENOMEM;
> > +
> > +     if (copy_from_user(name, ubuf, count)) {
> > +             err = -EFAULT;
> > +             goto out;
> > +     }
> > +
> > +     /* strip the newline added by `echo` */
> > +     if (name[count-1] != '\n')
> > +             return -EINVAL;
> 
> Doesn't sound correct, what happens if we use echo -n?

It's a bit sad. If we accept both "echo" and "echo -n" with some
smart logic to test for trailing '\n', then it will go wrong for a
'\n'-terminated file name.

Or shall we support only "echo -n"?  I can do with either one.

> > --- linux-mm.orig/fs/inode.c  2010-02-08 23:19:12.000000000 +0800
> > +++ linux-mm/fs/inode.c       2010-02-08 23:19:22.000000000 +0800
> > @@ -149,7 +149,7 @@ struct inode *inode_init_always(struct s
> >       inode->i_bdev = NULL;
> >       inode->i_cdev = NULL;
> >       inode->i_rdev = 0;
> > -     inode->dirtied_when = 0;
> > +     inode->dirtied_when = jiffies;
> >
> 
> Hmmm... Is the inode really dirtied when initialized? I know the
> change is for tracing, but the code when read is confusing.

Huh. Not really dirtied (for that you need to check I_DIRTY), but
dirtied_when is only used in writeback code when I_DIRTY is set.

So I overload dirtied_when in the clean case to indicate the inode
load time. This is a useful trick for fastboot to collect cache
footprint shortly after boot, when most inodes are clean.

It does ask for a comment:

        /*
         * This records inode load time. It will be invalidated once inode is
         * dirtied, or jiffies wraps around. Despite the pitfalls it still
         * provides useful information for some use cases like fastboot.
         */
        inode->dirtied_when = jiffies;


Thanks,
Fengguang

WARNING: multiple messages have this Message-ID (diff)

From: Wu Fengguang <fengguang.wu@intel.com>
To: Balbir Singh <balbir@linux.vnet.ibm.com>
Cc: Ingo Molnar <mingo@elte.hu>, Chris Frost <frost@cs.ucla.edu>,
	Steven Rostedt <rostedt@goodmis.org>,
	Peter Zijlstra <a.p.zijlstra@chello.nl>,
	Frederic Weisbecker <fweisbec@gmail.com>,
	Keiichi KII <k-keiichi@bx.jp.nec.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	Jason Baron <jbaron@redhat.com>,
	Hitoshi Mitake <mitake@dcl.info.waseda.ac.jp>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	"lwoodman@redhat.com" <lwoodman@redhat.com>,
	"linux-mm@kvack.org" <linux-mm@kvack.org>,
	Tom Zanussi <tzanussi@gmail.com>,
	"riel@redhat.com" <riel@redhat.com>,
	Munehiro Ikeda <m-ikeda@ds.jp.nec.com>,
	Atsushi Tsuji <a-tsuji@bk.jp.nec.com>
Subject: Re: [RFC PATCH -tip 0/2 v3] pagecache tracepoints proposal
Date: Sun, 21 Feb 2010 10:28:31 +0800	[thread overview]
Message-ID: <20100221022831.GB6448@localhost> (raw)
In-Reply-To: <20100213132952.GG11364@balbir.in.ibm.com>

Hi Balbir,

> > tracing: pagecache object collections
> >
> > This dumps
> > - all cached files of a mounted fs  (the inode-cache)
> > - all cached pages of a cached file (the page-cache)
> >
> > Usage and Sample output:
> >
> > # echo /dev > /debug/tracing/objects/mm/pages/walk-fs
> > # tail /debug/tracing/trace
> >              zsh-2528  [000] 10429.172470: dump_inode: ino=889 size=0 cached=0 age=442 dirty=0 dev=0:18 file=/dev/console
> >              zsh-2528  [000] 10429.172472: dump_inode: ino=888 size=0 cached=0 age=442 dirty=7 dev=0:18 file=/dev/null
> >              zsh-2528  [000] 10429.172474: dump_inode: ino=887 size=40 cached=0 age=442 dirty=0 dev=0:18 file=/dev/shm
> >              zsh-2528  [000] 10429.172477: dump_inode: ino=886 size=40 cached=0 age=442 dirty=0 dev=0:18 file=/dev/pts
> >              zsh-2528  [000] 10429.172479: dump_inode: ino=885 size=11 cached=0 age=442 dirty=0 dev=0:18 file=/dev/core
> >              zsh-2528  [000] 10429.172481: dump_inode: ino=884 size=15 cached=0 age=442 dirty=0 dev=0:18 file=/dev/stderr
> >              zsh-2528  [000] 10429.172483: dump_inode: ino=883 size=15 cached=0 age=442 dirty=0 dev=0:18 file=/dev/stdout
> >              zsh-2528  [000] 10429.172486: dump_inode: ino=882 size=15 cached=0 age=442 dirty=0 dev=0:18 file=/dev/stdin
> >              zsh-2528  [000] 10429.172488: dump_inode: ino=881 size=13 cached=0 age=442 dirty=0 dev=0:18 file=/dev/fd
> >              zsh-2528  [000] 10429.172491: dump_inode: ino=872 size=13360 cached=0 age=442 dirty=0 dev=0:18 file=/dev
> >
> > Here "age" is either age from inode create time, or from last dirty time.
> >
> 
> It would be nice to see mapped/unmapped information as well.

As you noticed, we have mapcount for individual pages :)

> > +static int pages_similiar(struct page* page0, struct page* page)
> > +{
> > +     if (page_count(page0) != page_count(page))
> > +             return 0;
> > +
> > +     if (page_mapcount(page0) != page_mapcount(page))
> > +             return 0;
> > +
> > +     if (page_flags(page0) != page_flags(page))
> > +             return 0;
> > +
> > +     return 1;
> > +}
> > +
> 
> OK, so pages_similar() is used to identify a range of pages in the
> cache?

Right. Many files are accessed sequentially or clustered, so
pages_similar() can save lots of output lines :)

> > +#define BATCH_LINES  100
> > +static void dump_pagecache(struct address_space *mapping)
> > +{
> > +     int i;
> > +     int lines = 0;
> > +     pgoff_t len = 0;
> > +     struct pagevec pvec;
> > +     struct page *page;
> > +     struct page *page0 = NULL;
> > +     unsigned long start = 0;
> > +
> > +     for (;;) {
> > +             pagevec_init(&pvec, 0);
> > +             pvec.nr = radix_tree_gang_lookup(&mapping->page_tree,
> > +                             (void **)pvec.pages, start + len, PAGEVEC_SIZE);
> 
> Is radix_tree_gang_lookup synchronized somewhere? Don't we need to
> call it under RCU or a lock (mapping) ?

No. This function is inherently non-atomic, and it seems that most in-kernel
users do not bother to take rcu_read_lock(). So lets leave it as is?

> > +static ssize_t
> > +trace_pagecache_write(struct file *filp, const char __user *ubuf, size_t count,
> > +                   loff_t *ppos)
> > +{
> > +     struct file *file = NULL;
> > +     char *name;
> > +     int err = 0;
> > +
> 
> Can't we use the trace_parser here?

Seems not necessary? It's merely one file name, which could contain spaces.

> > +     if (count <= 1)
> > +             return -EINVAL;
> > +     if (count > PATH_MAX + 1)
> > +             return -ENAMETOOLONG;
> > +
> > +     name = kmalloc(count+1, GFP_KERNEL);
> > +     if (!name)
> > +             return -ENOMEM;
> > +
> > +     if (copy_from_user(name, ubuf, count)) {
> > +             err = -EFAULT;
> > +             goto out;
> > +     }
> > +
> > +     /* strip the newline added by `echo` */
> > +     if (name[count-1] != '\n')
> > +             return -EINVAL;
> 
> Doesn't sound correct, what happens if we use echo -n?

It's a bit sad. If we accept both "echo" and "echo -n" with some
smart logic to test for trailing '\n', then it will go wrong for a
'\n'-terminated file name.

Or shall we support only "echo -n"?  I can do with either one.

> > --- linux-mm.orig/fs/inode.c  2010-02-08 23:19:12.000000000 +0800
> > +++ linux-mm/fs/inode.c       2010-02-08 23:19:22.000000000 +0800
> > @@ -149,7 +149,7 @@ struct inode *inode_init_always(struct s
> >       inode->i_bdev = NULL;
> >       inode->i_cdev = NULL;
> >       inode->i_rdev = 0;
> > -     inode->dirtied_when = 0;
> > +     inode->dirtied_when = jiffies;
> >
> 
> Hmmm... Is the inode really dirtied when initialized? I know the
> change is for tracing, but the code when read is confusing.

Huh. Not really dirtied (for that you need to check I_DIRTY), but
dirtied_when is only used in writeback code when I_DIRTY is set.

So I overload dirtied_when in the clean case to indicate the inode
load time. This is a useful trick for fastboot to collect cache
footprint shortly after boot, when most inodes are clean.

It does ask for a comment:

        /*
         * This records inode load time. It will be invalidated once inode is
         * dirtied, or jiffies wraps around. Despite the pitfalls it still
         * provides useful information for some use cases like fastboot.
         */
        inode->dirtied_when = jiffies;


Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

next prev parent reply	other threads:[~2010-02-21  2:28 UTC|newest]

Thread overview: 34+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2010-02-05  2:17 [RFC PATCH -tip 0/2 v3] pagecache tracepoints proposal Keiichi KII
2010-02-05  2:17 ` Keiichi KII
2010-02-05  2:24 ` [RFC PATCH -tip 1/2 v3] tracepoints: add tracepoints for pagecache Keiichi KII
2010-02-05  2:24   ` Keiichi KII
2010-02-05  2:25 ` [RFC PATCH -tip 2/2 v3] add scripts for pagecache analysis per process Keiichi KII
2010-02-05  2:25   ` Keiichi KII
2010-02-05  7:28 ` [RFC PATCH -tip 0/2 v3] pagecache tracepoints proposal Ingo Molnar
2010-02-05  7:28   ` Ingo Molnar
2010-02-05 21:19   ` Keiichi KII
2010-02-05 21:19     ` Keiichi KII
2010-02-08 15:54   ` Wu Fengguang
2010-02-08 15:54     ` Wu Fengguang
2010-02-09 16:21     ` Wu Fengguang
2010-02-09 16:21       ` Wu Fengguang
2010-02-13 13:29       ` Balbir Singh
2010-02-13 13:29         ` Balbir Singh
2010-02-14 10:52         ` Balbir Singh
2010-02-14 10:52           ` Balbir Singh
2010-02-21  2:28         ` Wu Fengguang [this message]
2010-02-21  2:28           ` Wu Fengguang
2010-02-16  3:22       ` KOSAKI Motohiro
2010-02-16  3:22         ` KOSAKI Motohiro
2010-02-17 22:38         ` Keiichi KII
2010-02-17 22:38           ` Keiichi KII
2010-02-18  5:34     ` KAMEZAWA Hiroyuki
2010-02-18  5:34       ` KAMEZAWA Hiroyuki
2010-02-18  9:58       ` Balbir Singh
2010-02-18  9:58         ` Balbir Singh
2010-02-23 14:04         ` Wu Fengguang
2010-02-23 14:04           ` Wu Fengguang
2010-02-21  3:09       ` Wu Fengguang
2010-02-21  3:09         ` Wu Fengguang
2010-02-08 13:04 ` Balbir Singh
2010-02-08 13:04   ` Balbir Singh

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20100221022831.GB6448@localhost \
    --to=fengguang.wu@intel.com \
    --cc=a-tsuji@bk.jp.nec.com \
    --cc=a.p.zijlstra@chello.nl \
    --cc=akpm@linux-foundation.org \
    --cc=balbir@linux.vnet.ibm.com \
    --cc=frost@cs.ucla.edu \
    --cc=fweisbec@gmail.com \
    --cc=jbaron@redhat.com \
    --cc=k-keiichi@bx.jp.nec.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=lwoodman@redhat.com \
    --cc=m-ikeda@ds.jp.nec.com \
    --cc=mingo@elte.hu \
    --cc=mitake@dcl.info.waseda.ac.jp \
    --cc=riel@redhat.com \
    --cc=rostedt@goodmis.org \
    --cc=tzanussi@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.