All of lore.kernel.org
 help / color / mirror / Atom feed
From: Borislav Petkov <bp@alien8.de>
To: "Luck, Tony" <tony.luck@intel.com>
Cc: "Naveen N. Rao" <naveen.n.rao@linux.vnet.ibm.com>,
	"ananth@in.ibm.com" <ananth@in.ibm.com>,
	"masbock@linux.vnet.ibm.com" <masbock@linux.vnet.ibm.com>,
	"lcm@linux.vnet.ibm.com" <lcm@linux.vnet.ibm.com>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	"linux-acpi@vger.kernel.org" <linux-acpi@vger.kernel.org>,
	"Huang, Ying" <ying.huang@intel.com>,
	Robert Richter <rric@kernel.org>
Subject: Re: [PATCH v2 2/2] mce: acpi/apei: Add a boot option to disable ff mode for corrected errors
Date: Wed, 19 Jun 2013 23:07:06 +0200	[thread overview]
Message-ID: <20130619210706.GP28300@pd.tnic> (raw)
In-Reply-To: <3908561D78D1C84285E8C5FCA982C28F2DA8838B@ORSMSX106.amr.corp.intel.com>

On Wed, Jun 19, 2013 at 08:33:49PM +0000, Luck, Tony wrote:
> >> There is (or should be)
> >
> > Ha!
> 
> Oh ye of little faith - I'm sure the BIOS will get this right this time :-)
> 
> 
> > Ok, seriously: so the situation should still be fine, FF reported errors
> > get the CPER format while the rest, the "old" MCE format.
> >
> > cper.c is doing printk so I'm guessing it would need to get its own
> > tracepoint and carry that to userspace.
> 
> Yes - a tracepoint is the right answer here for all the new stuff.
> 
> > Concerning the RAS daemon, Robert and I are making good progress so once
> > we have the persistent events in perf, we can read that tracepoint in
> > userspace and do whatever we want with the error info.
> 
> Mauro has a rasdaemon in progress
>        git://git.fedorahosted.org/rasdaemon.git
> just picks up perf/events and logs to a sqlite database.

Actually it uses ftrace's facilities but it is a tracepoint in the end.

And I asked him nicely not to call it rasdaemon because I already have a
RAS daemon but hey, whatever. The more confusion, the better.

> Because Linux can do runtime things that the BIOS can't - like offline
> a 4K page. Idea here is that BIOS does whatever the OEM thinks is the
> right level of threshholding - not bothering the OS with petty details
> of random corrected erorrs that mean nothing. But if there is some
> repeated error (like a stuck bit) then the BIOS can provide a CPER
> to the OS telling it that it would be a good idea to stop using that
> page.

Ok, where is that semantics? What in a CPER record does say "this error
should tell you that you need to offline the containing page and I'm
telling you this exactly only once"? Error Severity 0, i.e. Recoverable?

> And this is where the semantics of a CPER change between the original
> WSM-EX implementation ... where Linux expects to see all the errors
> and do its own thresholding only taking a page offline if it sees a
> lot of CPER refer to the same page; and now - where the BIOS does the
> counting and tells Linux just once to take the page offline.

Ok, we're talking about the S in RAS now. Do we have error recovery
strategies specified anywhere? Are they per-platform or generic? Is this
CPER strategy above, for example, only valid for some platforms or for
all APEI-using hardware?

Questions over questions...

-- 
Regards/Gruss,
    Boris.

Sent from a fat crate under my desk. Formatting is fine.
--

  reply	other threads:[~2013-06-19 21:07 UTC|newest]

Thread overview: 50+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2013-06-19 17:57 [PATCH v2 1/2] mce: acpi/apei: Honour Firmware First for MCA banks listed in APEI HEST CMC Naveen N. Rao
2013-06-19 17:57 ` [PATCH v2 2/2] mce: acpi/apei: Add a boot option to disable ff mode for corrected errors Naveen N. Rao
2013-06-19 18:04   ` Borislav Petkov
2013-06-19 18:17     ` Naveen N. Rao
2013-06-19 18:19     ` Luck, Tony
2013-06-19 18:19       ` Luck, Tony
2013-06-19 18:36       ` Borislav Petkov
2013-06-19 19:05         ` Luck, Tony
2013-06-19 19:05           ` Luck, Tony
2013-06-19 20:14           ` Borislav Petkov
2013-06-19 20:33             ` Luck, Tony
2013-06-19 20:33               ` Luck, Tony
2013-06-19 21:07               ` Borislav Petkov [this message]
2013-06-19 21:28                 ` Luck, Tony
2013-06-19 21:28                   ` Luck, Tony
2013-06-19 21:41                   ` Borislav Petkov
2013-06-19 22:08                     ` Luck, Tony
2013-06-19 22:08                       ` Luck, Tony
2013-06-20  5:35                       ` Borislav Petkov
2013-06-20 21:21                   ` Naveen N. Rao
2013-06-20 22:11                     ` Luck, Tony
2013-06-20 22:11                       ` Luck, Tony
2013-06-21  7:27                       ` Borislav Petkov
2013-06-21 16:43                         ` Naveen N. Rao
2013-06-28 12:04                         ` Naveen N. Rao
2013-06-28 17:31                           ` Tony Luck
2013-07-01 15:07                             ` Naveen N. Rao
2013-07-01 15:38                               ` Borislav Petkov
2013-07-01 15:41                                 ` Naveen N. Rao
2013-06-20  7:48   ` Borislav Petkov
2013-06-20 19:02     ` Naveen N. Rao
2013-06-20  7:39 ` [PATCH v2 1/2] mce: acpi/apei: Honour Firmware First for MCA banks listed in APEI HEST CMC Borislav Petkov
2013-06-20  7:39   ` Borislav Petkov
2013-06-20 19:08   ` Naveen N. Rao
2013-06-20 19:29     ` Borislav Petkov
2013-06-20 20:14       ` Naveen N. Rao
2013-06-20 20:57         ` Borislav Petkov
2013-06-20 21:22           ` Naveen N. Rao
2013-06-21  7:34             ` Borislav Petkov
2013-06-21  7:46               ` Naveen N. Rao
2013-06-21  8:36                 ` Borislav Petkov
2013-06-21  9:32                   ` Naveen N. Rao
2013-06-21 14:08                     ` Borislav Petkov
2013-06-21 16:47                   ` Tony Luck
2013-06-21 17:40                     ` Borislav Petkov
2013-06-25 17:46                       ` Naveen N. Rao
2013-06-25 17:53                         ` Borislav Petkov
2013-06-25 17:55                         ` Luck, Tony
2013-06-25 17:55                           ` Luck, Tony
2013-06-25 18:28                           ` Naveen N. Rao

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20130619210706.GP28300@pd.tnic \
    --to=bp@alien8.de \
    --cc=ananth@in.ibm.com \
    --cc=lcm@linux.vnet.ibm.com \
    --cc=linux-acpi@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=masbock@linux.vnet.ibm.com \
    --cc=naveen.n.rao@linux.vnet.ibm.com \
    --cc=rric@kernel.org \
    --cc=tony.luck@intel.com \
    --cc=ying.huang@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.