From: Frederic Weisbecker <fweisbec@gmail.com>
To: Wu Fengguang <fengguang.wu@intel.com>
Cc: Ingo Molnar <mingo@elte.hu>, Li Zefan <lizf@cn.fujitsu.com>,
Tom Zanussi <tzanussi@gmail.com>,
KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>,
Pekka Enberg <penberg@cs.helsinki.fi>,
Andi Kleen <andi@firstfloor.org>,
Steven Rostedt <rostedt@goodmis.org>,
Larry Woodman <lwoodman@redhat.com>,
Peter Zijlstra <a.p.zijlstra@chello.nl>,
Eduard - Gabriel Munteanu <eduard.munteanu@linux360.ro>,
Andrew Morton <akpm@linux-foundation.org>,
LKML <linux-kernel@vger.kernel.org>,
Matt Mackall <mpm@selenic.com>,
Alexey Dobriyan <adobriyan@gmail.com>,
"linux-mm@kvack.org" <linux-mm@kvack.org>
Subject: Re: [rfc] object collection tracing (was: [PATCH 5/5] proc: export more page flags in /proc/kpageflags)
Date: Tue, 12 May 2009 15:01:12 +0200 [thread overview]
Message-ID: <20090512130110.GA6255@nowhere> (raw)
In-Reply-To: <20090428133108.GA23560@localhost>
On Tue, Apr 28, 2009 at 09:31:08PM +0800, Wu Fengguang wrote:
> On Tue, Apr 28, 2009 at 08:17:51PM +0800, Ingo Molnar wrote:
> > tent-Transfer-Encoding: quoted-printable
> > Status: RO
> > Content-Length: 5480
> > Lines: 161
> >
> >
> > * Wu Fengguang <fengguang.wu@intel.com> wrote:
> >
> > > > The above 'get object state' interface (which allows passive
> > > > sampling) - integrated into the tracing framework - would serve
> > > > that goal, agreed?
> > >
> > > Agreed. That could in theory a good complement to dynamic
> > > tracings.
> > >
> > > Then what will be the canonical form for all the 'get object
> > > state' interfaces - "object.attr=value", or whatever? [...]
> >
> > Lemme outline what i'm thinking of.
> >
> > I'd call the feature "object collection tracing", which would live
> > in /debug/tracing, accessed via such files:
> >
> > /debug/tracing/objects/mm/pages/
> > /debug/tracing/objects/mm/pages/format
> > /debug/tracing/objects/mm/pages/filter
> > /debug/tracing/objects/mm/pages/trace_pipe
> > /debug/tracing/objects/mm/pages/stats
> > /debug/tracing/objects/mm/pages/events/
> >
> > here's the (proposed) semantics of those files:
> >
> > 1) /debug/tracing/objects/mm/pages/
> >
> > There's a subsystem / object basic directory structure to make it
> > easy and intuitive to find our way around there.
> >
> > 2) /debug/tracing/objects/mm/pages/format
> >
> > the format file:
> >
> > /debug/tracing/objects/mm/pages/format
> >
> > Would reuse the existing dynamic-tracepoint structured-logging
> > descriptor format and code (this is upstream already):
> >
> > [root@phoenix sched_signal_send]# pwd
> > /debug/tracing/events/sched/sched_signal_send
> >
> > [root@phoenix sched_signal_send]# cat format
> > name: sched_signal_send
> > ID: 24
> > format:
> > field:unsigned short common_type; offset:0; size:2;
> > field:unsigned char common_flags; offset:2; size:1;
> > field:unsigned char common_preempt_count; offset:3; size:1;
> > field:int common_pid; offset:4; size:4;
> > field:int common_tgid; offset:8; size:4;
> >
> > field:int sig; offset:12; size:4;
> > field:char comm[TASK_COMM_LEN]; offset:16; size:16;
> > field:pid_t pid; offset:32; size:4;
> >
> > print fmt: "sig: %d task %s:%d", REC->sig, REC->comm, REC->pid
> >
> > These format descriptors enumerate fields, types and sizes, in a
> > structured way that user-space tools can parse easily. (The binary
> > records that come from the trace_pipe file follow this format
> > description.)
> >
> > 3) /debug/tracing/objects/mm/pages/filter
> >
> > This is the tracing filter that can be set based on the 'format'
> > descriptor. So with the above (signal-send tracepoint) you can
> > define such filter expressions:
> >
> > echo "(sig == 10 && comm == bash) || sig == 13" > filter
> >
> > To restrict the 'scope' of the object collection along pretty much
> > any key or combination of keys. (Or you can leave it as it is and
> > dump all objects and do keying in user-space.)
> >
> > [ Using in-kernel filtering is obviously faster that streaming it
> > out to user-space - but there might be details and types of
> > visualization you want to do in user-space - so we dont want to
> > restrict things here. ]
> >
> > For the mm object collection tracepoint i could imagine such filter
> > expressions:
> >
> > echo "type == shared && file == /sbin/init" > filter
> >
> > To dump all shared pages that are mapped to /sbin/init.
> >
> > 4) /debug/tracing/objects/mm/pages/trace_pipe
> >
> > The 'trace_pipe' file can be used to dump all objects in the
> > collection, which match the filter ('all objects' by default). The
> > record format is described in 'format'.
> >
> > trace_pipe would be a reuse of the existing trace_pipe code: it is a
> > modern, poll()-able, read()-able, splice()-able pipe abstraction.
> >
> > 5) /debug/tracing/objects/mm/pages/stats
> >
> > The 'stats' file would be a reuse of the existing histogram code of
> > the tracing code. We already make use of it for the branch tracers
> > and for the workqueue tracer - it could be extended to be applicable
> > to object collections as well.
> >
> > The advantage there would be that there's no dumping at all - all
> > the integration is done straight in the kernel. ( The 'filter'
> > condition is listened to - increasing flexibility. The filter file
> > could perhaps also act as a default histogram key. )
> >
> > 6) /debug/tracing/objects/mm/pages/events/
> >
> > The 'events' directory offers links back to existing dynamic
> > tracepoints that are under /debug/tracing/events/. This would serve
> > as an additional coherent force that keeps dynamic tracepoints
> > collected by subsystem and by object type as well. (Tools could make
> > use of this information as well - without being aware of actual
> > object semantics.)
> >
> >
> > There would be a number of other object collections we could
> > enumerate:
> >
> > tasks:
> >
> > /debug/tracing/objects/sched/tasks/
> >
> > active inodes known to the kernel:
> >
> > /debug/tracing/objects/fs/inodes/
> >
> > interrupts:
> >
> > /debug/tracing/objects/hw/irqs/
> >
> > etc.
> >
> > These would use the same 'object collection' framework. Once done we
> > can use it for many other thing too.
> >
> > Note how organically integrated it all is with the tracing
> > framework. You could start from an 'object view' to get an overview
> > and then go towards a more dynamic view of specific object
> > attributes (or specific objects), as you drill down on a specific
> > problem you want to analyze.
> >
> > How does this all sound to you?
>
> Great! I saw much opportunity to adapt the not yet submitted
> /proc/filecache interface to the proposed framework.
>
> Its basic form is:
>
> # ino size cached cached% refcnt state age accessed process dev file
> [snip]
> 320 1 4 100 1 D- 50443 1085 udevd 00:11(tmpfs) /.udev/uevent_seqnum
> 460725 123 124 100 35 -- 50444 6795 touch 08:02(sda2) /lib/libpthread-2.9.so
> 460727 31 32 100 14 -- 50444 2007 touch 08:02(sda2) /lib/librt-2.9.so
> 458865 97 80 82 1 -- 50444 49 mount 08:02(sda2) /lib/libdevmapper.so.1.02.1
> 460090 15 16 100 1 -- 50444 48 mount 08:02(sda2) /lib/libuuid.so.1.2
> 458866 46 48 100 1 -- 50444 47 mount 08:02(sda2) /lib/libblkid.so.1.0
> 460732 43 44 100 69 -- 50444 3581 rcS 08:02(sda2) /lib/libnss_nis-2.9.so
> 460739 87 88 100 73 -- 50444 3597 rcS 08:02(sda2) /lib/libnsl-2.9.so
> 460726 31 32 100 69 -- 50444 3581 rcS 08:02(sda2) /lib/libnss_compat-2.9.so
> 458804 250 252 100 11 -- 50445 8175 rcS 08:02(sda2) /lib/libncurses.so.5.6
> 229540 780 752 96 3 -- 50445 7594 init 08:02(sda2) /bin/bash
> 460735 15 16 100 89 -- 50445 17581 init 08:02(sda2) /lib/libdl-2.9.so
> 460721 1344 1340 99 117 -- 50445 48732 init 08:02(sda2) /lib/libc-2.9.so
> 458801 107 104 97 24 -- 50445 3586 init 08:02(sda2) /lib/libselinux.so.1
> 671870 37 24 65 1 -- 50446 1 swapper 08:02(sda2) /sbin/init
> 175 1 24412 100 1 -- 50446 0 swapper 00:01(rootfs) /dev/root
>
> The patch basically does a traversal through one or more of the inode
> lists to produce the output:
> inode_in_use
> inode_unused
> sb->s_dirty
> sb->s_io
> sb->s_more_io
> sb->s_inodes
>
> The filtering feature is a necessity for this interface - or it will
> take considerable time to do a full listing. It supports the following
> filters:
> { LS_OPT_DIRTY, "dirty" },
> { LS_OPT_CLEAN, "clean" },
> { LS_OPT_INUSE, "inuse" },
> { LS_OPT_EMPTY, "empty" },
> { LS_OPT_ALL, "all" },
> { LS_OPT_DEV, "dev=%s" },
>
> There are two possible challenges for the conversion:
>
> - One trick it does is to select different lists to traverse on
> different filter options. Will this be possible in the object
> tracing framework?
Yeah, I guess.
> - The file name lookup(last field) is the performance killer. Is it
> possible to skip the file name lookup when the filter failed on the
> leading fields?
objects collection lays on trace events where filters basically ignore
a whole entry in case of non-matching. Not sure if we can easily only
ignore one field.
But I guess we can do something about the performances...
Could you send us the (sob'ed) patch you made which implements this.
I could try to adapt it to object collection.
Thanks,
Frederic.
> Will the object tracing interface allow such flexibilities?
> (Sorry I'm not yet familiar with the tracing framework.)
>
> > Can you see any conceptual holes in the scheme, any use-case that
> > /proc/kpageflags supports but the object collection approach does
> > not?
>
> kpageflags is simply a big (perhaps sparse) binary array.
> I'd still prefer to retain its current form - the kernel patches and
> user space tools are all ready made, and I see no benefits in
> converting to the tracing framework.
>
> > Would you be interested in seeing something like this, if we tried
> > to implement it in the tracing tree? The majority of the code
> > already exists, we just need interest from the MM side and we have
> > to hook it all up. (it is by no means trivial to do - but looks like
> > a very exciting feature.)
>
> Definitely! /proc/filecache has another 'page view':
>
> # head /proc/filecache
> # file /bin/bash
> # flags R:referenced A:active M:mmap U:uptodate D:dirty W:writeback X:readahead P:private O:owner b:buffer d:dirty w:writeback
> # idx len state refcnt
> 0 1 RAMU________ 4
> 3 8 RAMU________ 4
> 12 1 RAMU________ 4
> 14 5 RAMU________ 4
> 20 7 RAMU________ 4
> 27 2 RAMU________ 5
> 29 1 RAMU________ 4
>
> Which is also a good candidate. However I still need to investigate
> whether it offers considerable margins over the mincore() syscall.
>
> Thanks and Regards,
> Fengguang
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
next prev parent reply other threads:[~2009-05-12 13:01 UTC|newest]
Thread overview: 68+ messages / expand[flat|nested] mbox.gz Atom feed top
2009-04-28 1:09 [PATCH 0/5] proc: export more page flags in /proc/kpageflags (take 4) Wu Fengguang
2009-04-28 1:09 ` [PATCH 1/5] pagemap: document clarifications Wu Fengguang
2009-04-28 7:11 ` Tommi Rantala
2009-04-28 1:09 ` [PATCH 2/5] pagemap: documentation 9 more exported page flags Wu Fengguang
2009-04-28 1:09 ` [PATCH 3/5] mm: introduce PageHuge() for testing huge/gigantic pages Wu Fengguang
2009-04-28 1:09 ` [PATCH 4/5] proc: kpagecount/kpageflags code cleanup Wu Fengguang
2009-04-28 1:09 ` [PATCH 5/5] proc: export more page flags in /proc/kpageflags Wu Fengguang
2009-04-28 6:55 ` Ingo Molnar
2009-04-28 7:40 ` Andi Kleen
2009-04-28 9:04 ` Pekka Enberg
2009-04-28 9:10 ` Andi Kleen
2009-04-28 9:15 ` Pekka Enberg
2009-04-28 9:15 ` Ingo Molnar
2009-04-28 9:19 ` Pekka Enberg
2009-04-28 9:25 ` Pekka Enberg
2009-04-28 9:36 ` Wu Fengguang
2009-04-28 9:36 ` Ingo Molnar
2009-04-28 9:57 ` Pekka Enberg
2009-04-28 10:10 ` KOSAKI Motohiro
2009-04-28 10:21 ` Pekka Enberg
2009-04-28 10:56 ` Ingo Molnar
2009-04-28 11:09 ` KOSAKI Motohiro
2009-04-28 12:42 ` Ingo Molnar
2009-04-28 11:03 ` Ingo Molnar
2009-04-28 17:42 ` Matt Mackall
2009-04-28 9:29 ` Ingo Molnar
2009-04-28 9:34 ` KOSAKI Motohiro
2009-04-28 9:38 ` Ingo Molnar
2009-04-28 9:55 ` Wu Fengguang
2009-04-28 10:11 ` KOSAKI Motohiro
2009-04-28 11:05 ` Ingo Molnar
2009-04-28 11:36 ` Wu Fengguang
2009-04-28 12:17 ` [rfc] object collection tracing (was: [PATCH 5/5] proc: export more page flags in /proc/kpageflags) Ingo Molnar
2009-04-28 13:31 ` Wu Fengguang
2009-05-12 13:01 ` Frederic Weisbecker [this message]
2009-05-17 13:36 ` Wu Fengguang
2009-05-17 13:55 ` Frederic Weisbecker
2009-05-17 14:12 ` Wu Fengguang
2009-05-18 11:44 ` KOSAKI Motohiro
2009-05-18 11:47 ` Wu Fengguang
2009-04-28 10:18 ` [PATCH 5/5] proc: export more page flags in /proc/kpageflags Andi Kleen
2009-04-28 8:33 ` Wu Fengguang
2009-04-28 9:24 ` Ingo Molnar
2009-04-28 18:11 ` Tony Luck
2009-04-28 18:34 ` Matt Mackall
2009-04-28 20:47 ` Tony Luck
2009-04-28 20:54 ` Andi Kleen
2009-04-28 20:59 ` Matt Mackall
2009-04-28 21:17 ` Andrew Morton
2009-04-28 21:49 ` Matt Mackall
2009-04-29 0:02 ` Robin Holt
2009-04-28 17:49 ` Matt Mackall
2009-04-29 8:05 ` Wu Fengguang
2009-04-29 19:13 ` Matt Mackall
2009-04-30 1:00 ` Wu Fengguang
2009-04-28 21:32 ` Andrew Morton
2009-04-28 22:46 ` Matt Mackall
2009-04-28 23:02 ` Andrew Morton
2009-04-28 23:31 ` Matt Mackall
2009-04-28 23:42 ` Andrew Morton
2009-04-28 23:55 ` Matt Mackall
2009-04-29 3:33 ` Wu Fengguang
2009-04-29 2:38 ` Wu Fengguang
2009-04-29 2:55 ` Andrew Morton
2009-04-29 3:48 ` Wu Fengguang
2009-04-29 5:09 ` Wu Fengguang
2009-04-29 4:41 ` Nathan Lynch
2009-04-29 4:50 ` Andrew Morton
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20090512130110.GA6255@nowhere \
--to=fweisbec@gmail.com \
--cc=a.p.zijlstra@chello.nl \
--cc=adobriyan@gmail.com \
--cc=akpm@linux-foundation.org \
--cc=andi@firstfloor.org \
--cc=eduard.munteanu@linux360.ro \
--cc=fengguang.wu@intel.com \
--cc=kosaki.motohiro@jp.fujitsu.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=lizf@cn.fujitsu.com \
--cc=lwoodman@redhat.com \
--cc=mingo@elte.hu \
--cc=mpm@selenic.com \
--cc=penberg@cs.helsinki.fi \
--cc=rostedt@goodmis.org \
--cc=tzanussi@gmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).