public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
From: Borislav Petkov <bp@alien8.de>
To: Havard Skinnemoen <hskinnemoen@google.com>
Cc: Tony Luck <tony.luck@gmail.com>,
	Linux Kernel <linux-kernel@vger.kernel.org>,
	Ewout van Bekkum <ewout@google.com>,
	linux-edac <linux-edac@vger.kernel.org>
Subject: Re: [PATCH 1/6] x86-mce: Modify CMCI poll interval to adjust for small check_interval values.
Date: Mon, 14 Jul 2014 16:57:01 +0200	[thread overview]
Message-ID: <20140714145701.GC25115@pd.tnic> (raw)
In-Reply-To: <CAFQmdRbpVfNjvyOPHo7RZ+W=2FTA6fEw2hWBebKNBX8LMPYfeQ@mail.gmail.com>

On Fri, Jul 11, 2014 at 01:39:19PM -0700, Havard Skinnemoen wrote:
> Sorry, I was being unclear. I was actually arguing the opposite:
> Getting 15 CMCIs per second is fine and shouldn't cause any switch to
> polling mode, especially if the polling will happen at 100 times per
> second. But your proposal would switch to polling if we ever see 2
> CMCIs within a period, which seems way too trigger-happy, even if the
> period is short.
>
> I do agree there are already a lot of arbitrary numbers in the code.

Yes, triggerhappy is no good either.

The thing is, even if we would come up with a correct number now, who's
to say that that same number would be correct on future uarches? I like
the idea of approximating storm entry point on each system and I, like
you, worry about complexity. This needs to be done really conservatively
and without rushing...

Thankfully, this thread has some nice starting ideas. :)

> > Instead, the criteria should probably be something like: what is the
> > number of CMCIs per second which we can process while leaving system
> > operation relatively unaffected? Anything above that number constitutes
> > a CMCI storm.
> 
> That sounds good to me. But now you're talking about CMCIs per second,
> which seems to imply some form of counting right?

<thinking out loud>

Well, I was thinking of measuring the average duration of the CMCI
interrupt handler (which basically is machine_check_poll) and then maybe
allowing x% of that per second. Any higher count above x% switches to
storm.

So we'll probably end up counting again but CMCI_STORM_THRESHOLD will be
determined dynamically by doing:

	CMCI_STORM_THRESHOLD = (1000ms / average duration of CMCI in ms) * x%

Then, making that x user-configurable would probably be fine too. It'll
basically allow users to say what percentage of time they'd want the
system to spend handling CMCIs before polling.

And, it'll have a sane, conservative default for the majority of people
who don't want to deal with this at all.

The usual conserns about exporting stuff to userspace apply, see below.

</thinking out loud>

> > Now, how we'll come up with an answer to that question is a whole
> > another story...
> 
> Right. If we can come up with an answer, that's great, but if we
> don't, I think we're better off exporting a nice knob and letting the
> user tune his system according to his needs.

Yeah, just remember that exporting all kinds of knobs means we're forced
to support it forever. So I'm very cautious with exposing anything to
userspace as it becomes an API and we're stuck with it.

> Just to throw another number out, how about doing CMCI storm polling
> at a fixed interval of 100 ms? Since check_interval is an integer
> representing a number of seconds, it can never get lower than 10x this
> number, so we won't need to restrict it any further.

Yep, this is basically the approach where we do find a static number
default for all machines out there. It could be a temporary solution ...

> If we see more than X CMCIs in a second, we switch to polling. If less
> than Y out of 10 polls see an error, we switch back to CMCI.
> 
> Now, we still leave 3 magic numbers to be figured out...but I think
> their range is somewhat more limited.

Makes sense.

So X will always be < 10, (== 10 means we automatically switch to
polling).

The Y could contain a historic aspect by setting it to some value and
decrementing it by one if we haven't seen an error and incrementing it
if we saw an error during the last poll. It will saturate at Y errors
and when it reaches 0, it will switch back to CMCI.

Hrrm, sounds interesting :)

-- 
Regards/Gruss,
    Boris.

Sent from a fat crate under my desk. Formatting is fine.
--

  reply	other threads:[~2014-07-14 14:57 UTC|newest]

Thread overview: 61+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2014-07-09 17:09 [PATCH 0/6] x86 mce fixes Havard Skinnemoen
2014-07-09 17:09 ` [PATCH 1/6] x86-mce: Modify CMCI poll interval to adjust for small check_interval values Havard Skinnemoen
2014-07-09 19:17   ` Borislav Petkov
2014-07-09 21:24     ` Havard Skinnemoen
2014-07-10  9:01       ` Chen, Gong
2014-07-10 17:16         ` Havard Skinnemoen
2014-07-11  2:12           ` Chen, Gong
2014-07-10 11:42       ` Borislav Petkov
2014-07-10 17:51         ` Havard Skinnemoen
2014-07-10 18:55           ` Tony Luck
2014-07-10 22:45             ` Havard Skinnemoen
2014-07-11 15:35               ` Borislav Petkov
2014-07-11 18:56                 ` Havard Skinnemoen
2014-07-11 20:10                   ` Borislav Petkov
2014-07-11 20:39                     ` Havard Skinnemoen
2014-07-14 14:57                       ` Borislav Petkov [this message]
2014-07-11 20:22                   ` Borislav Petkov
2014-07-12  0:10                     ` Havard Skinnemoen
2014-07-14 15:14                       ` Borislav Petkov
2014-07-11 20:36                   ` Borislav Petkov
2014-07-11 21:05                     ` Havard Skinnemoen
2014-07-09 17:09 ` [PATCH 2/6] x86-mce: Modify CMCI storm exit to reenable instead of rediscover banks Havard Skinnemoen
2014-07-09 20:20   ` Luck, Tony
2014-07-09 21:34     ` Havard Skinnemoen
2014-07-10 15:51       ` Borislav Petkov
2014-07-10 18:32         ` Havard Skinnemoen
2014-07-09 17:09 ` [PATCH 3/6] x86-mce: Clear CMCI enable on all claimed CMCI banks before reboot Havard Skinnemoen
2014-07-09 20:36   ` Luck, Tony
2014-07-09 21:40     ` Havard Skinnemoen
2014-07-10 16:24       ` Borislav Petkov
2014-07-10 16:33         ` Tony Luck
2014-07-10 17:56         ` Havard Skinnemoen
2014-07-10 18:27           ` Tony Luck
2014-07-10 18:30           ` Borislav Petkov
2014-07-09 17:09 ` [PATCH 4/6] x86-mce: Add spinlocks to prevent duplicated MCP and CMCI reports Havard Skinnemoen
2014-07-09 20:35   ` Andi Kleen
2014-07-09 21:51     ` Havard Skinnemoen
2014-07-09 23:32       ` Luck, Tony
2014-07-10  8:16         ` Borislav Petkov
2014-07-09 20:47   ` Luck, Tony
2014-07-09 21:56     ` Havard Skinnemoen
2014-07-10 16:41   ` Borislav Petkov
2014-07-10 18:03     ` Havard Skinnemoen
2014-07-10 18:44       ` Borislav Petkov
2014-07-10 18:57         ` Tony Luck
2014-07-10 19:12           ` Borislav Petkov
2014-07-11  9:24             ` Borislav Petkov
2014-07-11 19:06               ` Tony Luck
2014-07-11 19:52                 ` Borislav Petkov
2014-07-11 21:15                   ` Havard Skinnemoen
2014-07-17 10:50                     ` Borislav Petkov
2014-07-18 21:23                       ` Tony Luck
2014-07-18 21:31                         ` Borislav Petkov
2014-07-09 17:09 ` [PATCH 5/6] x86-mce: check if no_way_out applies before deciding not to clear MCE banks Havard Skinnemoen
2014-07-09 21:00   ` Luck, Tony
2014-07-09 23:00     ` Havard Skinnemoen
2014-07-09 23:27       ` Luck, Tony
2014-07-10 16:49         ` Borislav Petkov
2014-07-09 17:09 ` [PATCH 6/6] x86-mce: ensure the MCP timer is not already set in the mce_timer_fn Havard Skinnemoen
2014-07-09 21:04   ` Luck, Tony
2014-07-09 23:01     ` Havard Skinnemoen

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20140714145701.GC25115@pd.tnic \
    --to=bp@alien8.de \
    --cc=ewout@google.com \
    --cc=hskinnemoen@google.com \
    --cc=linux-edac@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=tony.luck@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox