From: Ingo Molnar <mingo@elte.hu>
To: Huang Ying <ying.huang@intel.com>,
Borislav Petkov <borislav.petkov@amd.com>,
Fr??d??ric Weisbecker <fweisbec@gmail.com>,
Li Zefan <lizf@cn.fujitsu.com>,
Steven Rostedt <rostedt@goodmis.org>
Cc: "H. Peter Anvin" <hpa@zytor.com>, Andi Kleen <ak@linux.intel.com>,
Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>,
"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>
Subject: Re: [BUGFIX -v7] x86, MCE: Fix bugs and issues of MCE log ring buffer
Date: Fri, 18 Sep 2009 13:09:53 +0200 [thread overview]
Message-ID: <20090918110953.GA9930@elte.hu> (raw)
In-Reply-To: <1253269241.15717.525.camel@yhuang-dev.sh.intel.com>
* Huang Ying <ying.huang@intel.com> wrote:
> Current MCE log ring buffer has following bugs and issues:
>
> - On larger systems the 32 size buffer easily overflow, losing events.
>
> - We had some reports of events getting corrupted which were also
> blamed on the ring buffer.
>
> - There's a known livelock, now hit by more people, under high error
> rate.
>
> We fix these bugs and issues via making MCE log ring buffer as
> lock-less per-CPU ring buffer.
I like the direction of this (the current MCE ring-buffer code is a bad
local hack that should never have been merged upstream in that form) -
but i'd like to see a MUCH more ambitious (and much more useful!)
approach insted of using an explicit ring-buffer.
Please define MCE generic tracepoints using TRACE_EVENT() and use
perfcounters to access them.
This approach solves all the problems you listed and it also adds a
large number of new features to MCE events:
- Multiple user-space agents can access MCE events. You can have an
mcelog daemon running but also a system-wide tracer capturing
important events in flight-recorder mode.
- Sampling support: the kernel and the user-space call-chain of MCE
events can be stored and analyzed as well. This way actual patterns
of bad behavior can be matched to precisely what kind of activity
happened in the kernel (and/or in the app) around that moment in
time.
- Coupling with other hardware and software events: the PMU can track a
number of other anomalies - monitoring software might chose to
monitor those plus the MCE events as well - in one coherent stream of
events.
- Discovery of MCE sources - tracepoints are enumerated and tools can
act upon the existence (or non-existence) of various channels of MCE
information.
- Filtering support: you just subscribe to and act upon the events you
are interested in. Then even on a per event source basis there's
in-kernel filter expressions available that can restrict the amount
of data that hits the event channel.
- Arbitrary deep per cpu buffering of events - you can buffer 32
entries or you can buffer as much as you want, as long as you have
the RAM.
- An NMI-safe ring-buffer implementation - mappable to user-space.
- Built-in support for timestamping of events, PID markers, CPU
markers, etc.
- A rich ABI accessible over system call interface. Per cpu, per task
and per workload monitoring of MCE events can be done this way. The
ABI itself has a nice, meaningful structure.
- Extensible ABI: new fields can be added without breaking tooling.
New tracepoints can be added as the hardware side evolves. There's
various parsers that can be used.
- Lots of scheduling/buffering/batching modes of operandi for MCE
events. poll() support. mmap() support. read() support. You name it.
- Rich tooling support: even without any MCE specific extensions added
the 'perf' tool today offers various views of MCE data: perf report,
perf stat, perf trace can all be used to view logged MCE events and
perhaps correlate them to certain user-space usage patterns. But it
can be used directly as well, for user-space agents and policy action
in mcelog, etc.
- Significant code reduction and cleanup in the MCE code: the whole
mcelog facility can be dropped in essence.
- (these are the top of the list - there more advantages as well.)
Such a design would basically propel the MCE code into the twenty first
century. Once we have these facilities we can phase out /dev/mcelog for
good. It would turn Linux MCE events from a quirky hack that doesnt even
work after years of hacking into a modern, extensible event logging
facility that uses event sources and flexible transports to user-space.
It would actually be code that is not a problem child like today but one
that we can take pride in and which is fun to work on :-)
Now, an approach like this shouldnt just be a blind export of mce_log()
into a single artificial generic event [which is a pretty poor API to
begin with] - it should be the definition of meaningful
tracepoints/events that describe the hardware's structure.
I'd rather have a good enumeration of various sources of MCEs as
separate tracepoints than some badly jumbled mess of all MCE sources in
one inflexible ABI as /dev/mcelog does it today.
Note, if you need any perfcounter infrastructure extensions/help for
this then we'll be glad to provide that. I'm sure there's a few things
to enhance and a few things to fix - there always are with any
non-trivial new user :-) But heck would i take _those_ forward looking
problems over any of the current MCE design mess, any day of the week.
Thanks,
Ingo
next prev parent reply other threads:[~2009-09-18 11:10 UTC|newest]
Thread overview: 23+ messages / expand[flat|nested] mbox.gz Atom feed top
2009-09-18 10:20 [BUGFIX -v7] x86, MCE: Fix bugs and issues of MCE log ring buffer Huang Ying
2009-09-18 11:09 ` Ingo Molnar [this message]
2009-09-21 5:37 ` Huang Ying
2009-09-22 13:39 ` [PATCH] x86: mce: New MCE logging design Ingo Molnar
2009-10-05 6:23 ` [BUGFIX -v7] x86, MCE: Fix bugs and issues of MCE log ring buffer Hidetoshi Seto
2009-10-05 6:33 ` [PATCH 01/10] x86, mce: remove tsc handling from mce_read Hidetoshi Seto
2009-10-05 6:34 ` [PATCH 02/10] x86, mce: mce_read can check args without mutex Hidetoshi Seto
2009-10-05 6:35 ` [PATCH 03/10] x86, mce: change writer timeout in mce_read Hidetoshi Seto
2009-10-05 6:36 ` [PATCH 04/10] x86, mce: use do-while in mce_log Hidetoshi Seto
2009-10-05 6:37 ` [PATCH 05/10] x86, mce: make mce_log buffer to per-CPU, prep Hidetoshi Seto
2009-10-05 6:38 ` [PATCH 06/10] x86, mce: make mce_log buffer to per-CPU Hidetoshi Seto
2009-10-05 7:06 ` Andi Kleen
2009-10-05 7:50 ` Hidetoshi Seto
2009-10-09 1:45 ` Huang Ying
2009-10-09 5:34 ` Hidetoshi Seto
2009-10-05 6:40 ` [PATCH 07/10] x86, mce: remove for-loop in mce_log Hidetoshi Seto
2009-10-05 6:41 ` [PATCH 08/10] x86, mce: change barriers " Hidetoshi Seto
2009-10-05 6:42 ` [PATCH 09/10] x86, mce: make mce_log buffer to ring buffer Hidetoshi Seto
2009-10-05 6:44 ` [PATCH 10/10] x86, mce: move mce_log_init() into mce_cap_init() Hidetoshi Seto
2009-10-05 7:07 ` [BUGFIX -v7] x86, MCE: Fix bugs and issues of MCE log ring buffer Hidetoshi Seto
2009-10-05 8:51 ` Frédéric Weisbecker
2009-10-05 15:16 ` Andi Kleen
2009-10-06 5:46 ` Hidetoshi Seto
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20090918110953.GA9930@elte.hu \
--to=mingo@elte.hu \
--cc=ak@linux.intel.com \
--cc=borislav.petkov@amd.com \
--cc=fweisbec@gmail.com \
--cc=hpa@zytor.com \
--cc=linux-kernel@vger.kernel.org \
--cc=lizf@cn.fujitsu.com \
--cc=rostedt@goodmis.org \
--cc=seto.hidetoshi@jp.fujitsu.com \
--cc=ying.huang@intel.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).