From: Ingo Molnar <mingo@elte.hu>
To: Ying Han <yinghan@google.com>
Cc: "KAMEZAWA Hiroyuki" <kamezawa.hiroyu@jp.fujitsu.com>,
"KOSAKI Motohiro" <kosaki.motohiro@jp.fujitsu.com>,
"Minchan Kim" <minchan.kim@gmail.com>,
"Daisuke Nishimura" <nishimura@mxp.nes.nec.co.jp>,
"Balbir Singh" <balbir@linux.vnet.ibm.com>,
"Tejun Heo" <tj@kernel.org>, "Pavel Emelyanov" <xemul@openvz.org>,
"Andrew Morton" <akpm@linux-foundation.org>,
"Li Zefan" <lizf@cn.fujitsu.com>, "Mel Gorman" <mel@csn.ul.ie>,
"Christoph Lameter" <cl@linux.com>,
"Johannes Weiner" <hannes@cmpxchg.org>,
"Rik van Riel" <riel@redhat.com>,
"Hugh Dickins" <hughd@google.com>,
"Michal Hocko" <mhocko@suse.cz>,
"Dave Hansen" <dave@linux.vnet.ibm.com>,
"Zhu Yanhai" <zhu.yanhai@gmail.com>,
"linux-mm@kvack.org" <linux-mm@kvack.org>,
"Peter Zijlstra" <a.p.zijlstra@chello.nl>,
"Frédéric Weisbecker" <fweisbec@gmail.com>,
"Arnaldo Carvalho de Melo" <acme@redhat.com>,
"Tom Zanussi" <tzanussi@gmail.com>
Subject: Re: [PATCH] memcg: add pgfault latency histograms
Date: Sat, 28 May 2011 12:17:45 +0200 [thread overview]
Message-ID: <20110528101745.GA15692@elte.hu> (raw)
In-Reply-To: <BANLkTimSXrqPudRZ=af9N7k+Z=p5V+nxHQ@mail.gmail.com>
* Ying Han <yinghan@google.com> wrote:
> After study a bit on perf, it is not feasible in this casecase. The
> cpu & memory overhead of perf is overwhelming.... Each page fault
> will generate a record in the buffer and how many data we can
> record in the buffer, and how many data will be processed later..
> Most of the data that is recorded by the general perf framework is
> not needed here.
>
>
> On the other hand, the memory consumption is very little in this
> patch. We only need to keep a counter of each bucket and the
> recording can go on as long as the machine is up. As also measured,
> there is no overhead of the data collection :)
>
> So, the perf is not an option for this purpose.
It's not a fundamental limitation in perf though.
The way i always thought perf could be extended to support heavy-duty
profiling such as your patch does would be along the following lines:
Right now perf supports three output methods:
'full detail': per sample records, recorded in the ring-buffer
'filtered full detail': per sample records filtered, recorded in the ring-buffer
'full summary': the count of all samples (simple counter), no recording
What i think would make sense is to introduce a fourth variant, which
is a natural intermediate of the above output methods:
'partial summary': partially summarized samples, record in an
array in the ring-buffer - an extended
multi-dimensional 'count'.
A histogram like yours would be one (small) sub-case of this new
model.
Now, to keep things maximally flexible we really do not want to hard
code histogram summary functions: i.e. we do not want to hardcode
ourselves to 'latency histograms' or 'frequency histograms'.
To achieve that flexibility we could define the histogram function as
a simple extension to filters: filters that evaluate to an integer
value.
For example, if we defined the following tracepoint in
arch/x86/mm/fault.c:
TRACE_EVENT(mm_pagefault,
TP_PROTO(u64 time_start, u64 time_end, unsigned long address, int error_code, unsigned long ip),
TP_ARGS(time_start, time_end, address, error_code, ip),
TP_STRUCT__entry(
__field(u64, time_start)
__field(u64, time_end)
__field(unsigned long, address)
__field(unsigned long, error_code)
__field(unsigned long, ip)
),
TP_fast_assign(
__entry->time_start = time_start;
__entry->time_end = time_end;
__entry->address = address;
__entry->error_code = error_code;
__entry->ip = ip;
),
TP_printk("time_start=%uL time_end=%uL address=%lx, error code=%lx, ip=%lx",
__entry->time_start, __entry->time_end,
__entry->address, __entry->error_code, __entry->ip)
Then the following filter expressions could be used to calculate the
histogram index and value:
index: "(time_end - time_start)/1000"
iterator: "curr + 1"
The /1000 index filter expression means that there is one separate
bucket per microsecond of delay.
The "curr + 1" iterator filter expression would represent that for
every bucket an event means we add +1 to the current bucket value.
Today our filter expressions evaluate to a small subset of integer
numbers: 0 or 1 :-)
Extending them to integer calculations is possible and would be
desirable for other purposes as well, not just histograms. Adding
integer operators in addition to the logical and bitwise operators
the filter engine supports today would be useful as well. (See
kernel/trace/trace_events_filter.c for the current filter engine.)
This way we would have the equivalent functionality and performance
of your histogram patch - and it would also open up many, *many*
other nice possibilities as well:
- this could be used with any event, anywhere - could even be used
with hardware events. We could sample with an NMI every 100 usecs
and profile with relatively small profiling overhead.
- arbitrarily large histograms could be created: need a 10 GB
histogram on a really large system? No problem, create such
a big ring-buffer.
- many different types of summaries are possible as well:
- we could create a histogram over *which* code pagefaults, via
using the "ip" (faulting instruction) address and a
sufficiently large ring-buffer.
- histogram over the address space (which vmas are the hottest ones),
by changing the first filter to "address/1000000" to have per
megabyte buckets.
- weighted histograms: for example if the histogram iteration
function is "curr + (time_end-time_start)/1000" and the
histogram index is "address/1000000", then we get an
address-indexed histogram weighted by length of latency: the
higher latencies a given area of memory causes, the hotter the
bucket.
- the existing event filter code can be used to filter the incoming
events to begin with: for example an "error_code = 1" filter would
limit the histogram to write faults (page dirtying).
So instead of adding just one hardcoded histogram type, it would be
really nice to work on a more generic solution!
Thanks,
Ingo
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
next prev parent reply other threads:[~2011-05-28 10:18 UTC|newest]
Thread overview: 14+ messages / expand[flat|nested] mbox.gz Atom feed top
2011-05-26 21:07 [PATCH] memcg: add pgfault latency histograms Ying Han
2011-05-27 0:05 ` KAMEZAWA Hiroyuki
2011-05-27 0:23 ` Ying Han
2011-05-27 0:31 ` KAMEZAWA Hiroyuki
2011-05-27 1:40 ` Ying Han
2011-05-27 2:11 ` KAMEZAWA Hiroyuki
2011-05-27 4:45 ` Ying Han
2011-05-27 5:41 ` Ying Han
2011-05-27 8:33 ` KAMEZAWA Hiroyuki
2011-05-27 18:46 ` Ying Han
2011-05-28 10:17 ` Ingo Molnar [this message]
2011-05-31 16:51 ` Ying Han
2011-05-27 8:04 ` Balbir Singh
2011-05-27 16:27 ` Ying Han
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20110528101745.GA15692@elte.hu \
--to=mingo@elte.hu \
--cc=a.p.zijlstra@chello.nl \
--cc=acme@redhat.com \
--cc=akpm@linux-foundation.org \
--cc=balbir@linux.vnet.ibm.com \
--cc=cl@linux.com \
--cc=dave@linux.vnet.ibm.com \
--cc=fweisbec@gmail.com \
--cc=hannes@cmpxchg.org \
--cc=hughd@google.com \
--cc=kamezawa.hiroyu@jp.fujitsu.com \
--cc=kosaki.motohiro@jp.fujitsu.com \
--cc=linux-mm@kvack.org \
--cc=lizf@cn.fujitsu.com \
--cc=mel@csn.ul.ie \
--cc=mhocko@suse.cz \
--cc=minchan.kim@gmail.com \
--cc=nishimura@mxp.nes.nec.co.jp \
--cc=riel@redhat.com \
--cc=tj@kernel.org \
--cc=tzanussi@gmail.com \
--cc=xemul@openvz.org \
--cc=yinghan@google.com \
--cc=zhu.yanhai@gmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).