Re: [Patch V1 1/3] x86, mce: MCE log size not enough for high core parts

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

From: "Raj, Ashok" <ashok.raj@intel.com>
To: Borislav Petkov <bp@alien8.de>
Cc: x86-ml <x86@kernel.org>, "Luck, Tony" <tony.luck@intel.com>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	"linux-edac@vger.kernel.org" <linux-edac@vger.kernel.org>,
	ashok.raj@intel.com
Subject: Re: [Patch V1 1/3] x86, mce: MCE log size not enough for high core parts
Date: Fri, 25 Sep 2015 09:29:39 -0700	[thread overview]
Message-ID: <20150925162939.GB15216@linux.intel.com> (raw)
In-Reply-To: <20150925082901.GA3568@pd.tnic>

On Fri, Sep 25, 2015 at 10:29:01AM +0200, Borislav Petkov wrote:
> > > 
> > 
> > The last patch of that series had 2 changes.
> > 
> > 1. Allow offline cpu's to participate in the rendezvous. Since in the odd
> > chance the offline cpus have any errors collected we can still report them.
> > (we changed mce_start/mce_end to use cpu_present_mask instead of just 
> > online map).
> 
> This is not necessarily wrong - it is just unusual.

Correct!.
> 
> > Without this change today if i were to inject an broadcast MCE
> > it ends up hanging, since the offline cpu is also incrementing mce_callin.
> > It will always end up more than cpu_online_mask by the number of cpu's
> > logically offlined
> 
> Yeah, I'd like to have a bit somewhere which says "don't report MCEs on this
> core." But we talked about this already.
> 
> > Consider for e.g. if 2 thread of the core are offline. And the MLC picks up
> 
> What is MLC?

Mid Level Cache. This is shared between the 2 threads in that core. As opposed
to the Last Level Cache (LLC) which is shared between all the threads in the
socket.

> 
> > an error. Other cpus in the socket can't access them. Only way is to let those 
> > CPUs read and report their own banks as they are core scoped. In upcoming CPUs
> > we have some banks that can be thread scoped as well.
> > 
> > Its understood OS doesn't execute any code on those CPUs. But SMI can still
> > run on them, and could collect errors that can be logged.
> 
> Well, that is not our problem, is it?
> 
> I mean, SMM wants to stay undetected. When all of a sudden offlined
> cores start reporting MCEs, that's going to raise some brows.

You are right.. i was simply trying to state how an offline CPU from the
OS perspective could still be collecting errors. Only trying to highlight
what happens from a platform level.

> 
> Regardless, there are other reasons why offlined cores might report MCEs
> - the fact that logical cores share functional units and data flow goes
> through them might trip the reporting on those cores. Yadda yadda...

Yep!
> 
> > 2. If the cpu is offline, we copied them to mce_log buffer, and them copy 
> > those out from the rendezvous master during mce_reign().
> > 
> > If we were to replace this mce_log_add() with gen_pool_add(), then i would 
> > have to call mce_gen_pool_add() from the offline CPU. This will end up calling
> > RCU functions. 
> > 
> > We don't want to leave any errors reported by the offline CPU for purpose
> > of logging. It is rare, but still interested in capturing those errors if they
> > were to happen.
> > 
> > Does this help?
> 
> So first of all, we need to hold this down somewhere, maybe in
> Documentation/ to explain why we're running on offlined cores. This is
> certainly unusual code and people will ask WTF is going on there.

Good idea to document these weird cases. I can do it in either the 
cpu-hotplug.txt, or in the x86/x86_64/machinecheck file as appropriate.
Will add that in my next update.

> 
> Then, I really really don't like a static buffer which we will have
> to increase with each new bigger machine configuration. This is just
> clumsy.
> 
> It'd be probably much better to make that MCE buffer per CPU. We can
> say, we're allowed to log 2-3, hell, 5 errors in it and when we're done
> with the rendezvous, an online core goes and flushes out the error
> records to gen_pool.

That makes good sense.. i will make these per-cpu and also look at 
clearing them during mce_panic to make sure we flush them in fatal cases.

per-cpu certainly scales better than a static number.. 

> 
> This scales much better than any artificial MCE_LOG_LEN size.
> 
> Oh, and we either overwrite old errors when we fill up the percpu buffer
> or we return. that's something we can discuss later. Or we come up with
> a bit smarter strategy of selecting which ones to overwrite.
> 
> Just artificially increasing a static buffer is not good design IMO.

Agreed.. thanks, i will get another rev rolling.

Cheers,
Ashok

     prev parent reply	other threads:[~2015-09-25 16:29 UTC|newest]

Thread overview: 13+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-09-24  5:48 [Patch V1 1/3] x86, mce: MCE log size not enough for high core parts Ashok Raj
2015-09-24  5:48 ` [Patch V1 2/3] x86, mce: Refactor parts of mce_log() to reuse when logging from offline CPUs Ashok Raj
2015-09-24  5:48 ` [Patch V1 3/3] x86, mce: Account for offline CPUs during MCE rendezvous Ashok Raj
2015-09-24 15:47 ` [Patch V1 1/3] x86, mce: MCE log size not enough for high core parts Borislav Petkov
2015-09-24 18:44   ` Luck, Tony
2015-09-24 18:52     ` Borislav Petkov
2015-09-24 19:00       ` Luck, Tony
2015-09-24 19:22         ` Borislav Petkov
2015-09-24 20:22           ` Raj, Ashok
2015-09-24 21:07             ` Borislav Petkov
2015-09-24 21:25               ` Raj, Ashok
2015-09-25  8:29                 ` Borislav Petkov
2015-09-25 16:29                   ` Raj, Ashok [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20150925162939.GB15216@linux.intel.com \
    --to=ashok.raj@intel.com \
    --cc=bp@alien8.de \
    --cc=linux-edac@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=tony.luck@intel.com \
    --cc=x86@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox