linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Ingo Molnar <mingo@elte.hu>
To: Huang Ying <ying.huang@intel.com>,
	Borislav Petkov <borislav.petkov@amd.com>,
	Fr??d??ric Weisbecker <fweisbec@gmail.com>,
	Li Zefan <lizf@cn.fujitsu.com>,
	Steven Rostedt <rostedt@goodmis.org>
Cc: "H. Peter Anvin" <hpa@zytor.com>, Andi Kleen <ak@linux.intel.com>,
	Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>
Subject: Re: [BUGFIX -v7] x86, MCE: Fix bugs and issues of MCE log ring buffer
Date: Fri, 18 Sep 2009 13:09:53 +0200	[thread overview]
Message-ID: <20090918110953.GA9930@elte.hu> (raw)
In-Reply-To: <1253269241.15717.525.camel@yhuang-dev.sh.intel.com>


* Huang Ying <ying.huang@intel.com> wrote:

> Current MCE log ring buffer has following bugs and issues:
> 
> - On larger systems the 32 size buffer easily overflow, losing events.
> 
> - We had some reports of events getting corrupted which were also
>   blamed on the ring buffer.
> 
> - There's a known livelock, now hit by more people, under high error
>   rate.
> 
> We fix these bugs and issues via making MCE log ring buffer as 
> lock-less per-CPU ring buffer.

I like the direction of this (the current MCE ring-buffer code is a bad 
local hack that should never have been merged upstream in that form) - 
but i'd like to see a MUCH more ambitious (and much more useful!) 
approach insted of using an explicit ring-buffer.

Please define MCE generic tracepoints using TRACE_EVENT() and use 
perfcounters to access them.

This approach solves all the problems you listed and it also adds a 
large number of new features to MCE events:

 - Multiple user-space agents can access MCE events. You can have an
   mcelog daemon running but also a system-wide tracer capturing
   important events in flight-recorder mode.

 - Sampling support: the kernel and the user-space call-chain of MCE
   events can be stored and analyzed as well. This way actual patterns 
   of bad behavior can be matched to precisely what kind of activity 
   happened in the kernel (and/or in the app) around that moment in 
   time.

 - Coupling with other hardware and software events: the PMU can track a 
   number of other anomalies - monitoring software might chose to 
   monitor those plus the MCE events as well - in one coherent stream of 
   events.

 - Discovery of MCE sources - tracepoints are enumerated and tools can 
   act upon the existence (or non-existence) of various channels of MCE 
   information.

 - Filtering support: you just subscribe to and act upon the events you 
   are interested in. Then even on a per event source basis there's 
   in-kernel filter expressions available that can restrict the amount
   of data that hits the event channel.

 - Arbitrary deep per cpu buffering of events - you can buffer 32 
   entries or you can buffer as much as you want, as long as you have 
   the RAM.

 - An NMI-safe ring-buffer implementation - mappable to user-space.

 - Built-in support for timestamping of events, PID markers, CPU 
   markers, etc.

 - A rich ABI accessible over system call interface. Per cpu, per task 
   and per workload monitoring of MCE events can be done this way. The 
   ABI itself has a nice, meaningful structure.

 - Extensible ABI: new fields can be added without breaking tooling.
   New tracepoints can be added as the hardware side evolves. There's 
   various parsers that can be used.

 - Lots of scheduling/buffering/batching modes of operandi for MCE
   events. poll() support. mmap() support. read() support. You name it.

 - Rich tooling support: even without any MCE specific extensions added
   the 'perf' tool today offers various views of MCE data: perf report,
   perf stat, perf trace can all be used to view logged MCE events and
   perhaps correlate them to certain user-space usage patterns. But it
   can be used directly as well, for user-space agents and policy action
   in mcelog, etc.

 - Significant code reduction and cleanup in the MCE code: the whole 
   mcelog facility can be dropped in essence.

 - (these are the top of the list - there more advantages as well.)

Such a design would basically propel the MCE code into the twenty first 
century. Once we have these facilities we can phase out /dev/mcelog for 
good. It would turn Linux MCE events from a quirky hack that doesnt even 
work after years of hacking into a modern, extensible event logging 
facility that uses event sources and flexible transports to user-space.

It would actually be code that is not a problem child like today but one 
that we can take pride in and which is fun to work on :-)

Now, an approach like this shouldnt just be a blind export of mce_log() 
into a single artificial generic event [which is a pretty poor API to 
begin with] - it should be the definition of meaningful 
tracepoints/events that describe the hardware's structure.

I'd rather have a good enumeration of various sources of MCEs as 
separate tracepoints than some badly jumbled mess of all MCE sources in 
one inflexible ABI as /dev/mcelog does it today.

Note, if you need any perfcounter infrastructure extensions/help for 
this then we'll be glad to provide that. I'm sure there's a few things 
to enhance and a few things to fix - there always are with any 
non-trivial new user :-) But heck would i take _those_ forward looking 
problems over any of the current MCE design mess, any day of the week.

Thanks,

	Ingo

  reply	other threads:[~2009-09-18 11:10 UTC|newest]

Thread overview: 23+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2009-09-18 10:20 [BUGFIX -v7] x86, MCE: Fix bugs and issues of MCE log ring buffer Huang Ying
2009-09-18 11:09 ` Ingo Molnar [this message]
2009-09-21  5:37   ` Huang Ying
2009-09-22 13:39     ` [PATCH] x86: mce: New MCE logging design Ingo Molnar
2009-10-05  6:23 ` [BUGFIX -v7] x86, MCE: Fix bugs and issues of MCE log ring buffer Hidetoshi Seto
2009-10-05  6:33   ` [PATCH 01/10] x86, mce: remove tsc handling from mce_read Hidetoshi Seto
2009-10-05  6:34   ` [PATCH 02/10] x86, mce: mce_read can check args without mutex Hidetoshi Seto
2009-10-05  6:35   ` [PATCH 03/10] x86, mce: change writer timeout in mce_read Hidetoshi Seto
2009-10-05  6:36   ` [PATCH 04/10] x86, mce: use do-while in mce_log Hidetoshi Seto
2009-10-05  6:37   ` [PATCH 05/10] x86, mce: make mce_log buffer to per-CPU, prep Hidetoshi Seto
2009-10-05  6:38   ` [PATCH 06/10] x86, mce: make mce_log buffer to per-CPU Hidetoshi Seto
2009-10-05  7:06     ` Andi Kleen
2009-10-05  7:50       ` Hidetoshi Seto
2009-10-09  1:45         ` Huang Ying
2009-10-09  5:34           ` Hidetoshi Seto
2009-10-05  6:40   ` [PATCH 07/10] x86, mce: remove for-loop in mce_log Hidetoshi Seto
2009-10-05  6:41   ` [PATCH 08/10] x86, mce: change barriers " Hidetoshi Seto
2009-10-05  6:42   ` [PATCH 09/10] x86, mce: make mce_log buffer to ring buffer Hidetoshi Seto
2009-10-05  6:44   ` [PATCH 10/10] x86, mce: move mce_log_init() into mce_cap_init() Hidetoshi Seto
2009-10-05  7:07   ` [BUGFIX -v7] x86, MCE: Fix bugs and issues of MCE log ring buffer Hidetoshi Seto
2009-10-05  8:51   ` Frédéric Weisbecker
2009-10-05 15:16     ` Andi Kleen
2009-10-06  5:46     ` Hidetoshi Seto

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20090918110953.GA9930@elte.hu \
    --to=mingo@elte.hu \
    --cc=ak@linux.intel.com \
    --cc=borislav.petkov@amd.com \
    --cc=fweisbec@gmail.com \
    --cc=hpa@zytor.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=lizf@cn.fujitsu.com \
    --cc=rostedt@goodmis.org \
    --cc=seto.hidetoshi@jp.fujitsu.com \
    --cc=ying.huang@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).