All of lore.kernel.org
 help / color / mirror / Atom feed
From: Chen Gong <gong.chen@linux.intel.com>
To: "Luck, Tony" <tony.luck@intel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>,
	"bp@amd64.org" <bp@amd64.org>, "x86@kernel.org" <x86@kernel.org>,
	LKML <linux-kernel@vger.kernel.org>,
	Peter Zijlstra <peterz@infradead.org>
Subject: Re: [PATCH] x86: auto poll/interrupt mode switch for CMC to stop CMC storm
Date: Thu, 24 May 2012 10:23:38 +0800	[thread overview]
Message-ID: <4FBD9BAA.7070902@linux.intel.com> (raw)
In-Reply-To: <3908561D78D1C84285E8C5FCA982C28F192F30C0@ORSMSX104.amr.corp.intel.com>

于 2012/5/24 4:53, Luck, Tony 写道:
>> If that's the case, then I really can't understand the 5 CMCIs per
>> second threshold for defining the storm and switching to poll mode.
>> I'd rather expect 5 of them in a row.
> We don't have a lot of science to back up the "5" number (and
> can change it to conform to any better numbers if someone has
> some real data).
>
> My general approximation for DRAM corrected error rates is
> "one per gigabyte per month, plus or minus two orders of
>  magnitude". So if I saw 1600 errors per month on a 16GB
> workstation, I'd think that was a high rate - but still
> plausible from natural causes (especially if the machine
> was some place 5000 feet above sea level with a lot less
> atmosphere to block neutrons). That only amounts to a couple
> of errors per hour. So five in a second is certainly a storm!
>
> Looking at this from another perspective ... how many
> CMCIs can we take per second before we start having a
> noticeable impact on system performance. RT answer may
> be quite a small number, generic throughput computing
> answer might be several hundred per second.
>
> The situation we are trying to avoid is a stuck bit on
> some very frequently accessed piece of memory generating
> a solid stream of CMCI that make the system unusable. In
> this case the question is for how long do we let the storm
> rage before we turn of CMCI to get some real work done.
>
> Once we are in polling mode, we do lose data on the location
> of some corrected errors. But I don't think that this is
> too serious. If there are few errors, we want to know about
> them all. If there are so many that we have difficulty
> counting them all - then sampling from a subset will
> give us reasonable data most of the time (the exception
> being the case where we have one error source that is
> 100,000 times as noisy as some other sources that we'd
> still like to keep tabs on ... we'll need a *lot* of samples
> to see the quieter error sources amongst the noise).
>
> So I think there are justifications for numbers in the
> 2..1000 range. We could punt it to the user by making
> it configurable/tunable ... but I think we already have
> too many tunables that end-users don't have enough information
> to really set in meaningful ways to meet their actual
> needs - so I'd prefer to see some "good enough" number
> that meets the needs, rather than yet another /sys/...
> file that people can tweak.
>
> -Tony
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

Thanks very much for your elaboration, Tony. You give so much detail
than I want
to tell :-).

Hi, Thomas, yes, you can say 5 is just a arbitraty value and I can't
give you too
many proofs though I ever found some guys to help to test on the real
platform.
I only can say it works based on our internal test bench, but I really
hope someone
can use this patch on their actual machines and give me the feedback. I
can decide
what value is proper or if we need a tunable switch. By now, as Tony
said, there are
too many switches for end users so I don't want to add more.

BTW, I will update the description in the next version.

Hi, Boris, when I write these codes I don't care if it is specific for
Intel or AMD. I just
noticed it should be general for x86 platform and all related codes are
general too,
which in mce.c, so I think it should be fine to place the codes in mce.c.


  reply	other threads:[~2012-05-24  2:23 UTC|newest]

Thread overview: 18+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2012-05-23  2:32 [PATCH] x86: auto poll/interrupt mode switch for CMC to stop CMC storm Chen Gong
2012-05-23 10:09 ` Thomas Gleixner
2012-05-23 17:01   ` Luck, Tony
2012-05-23 18:58     ` Thomas Gleixner
2012-05-23 20:53       ` Luck, Tony
2012-05-24  2:23         ` Chen Gong [this message]
2012-05-24  6:00           ` Borislav Petkov
2012-05-24  9:54             ` Chen Gong
2012-05-24 10:02               ` Thomas Gleixner
2012-05-24 10:01             ` Thomas Gleixner
2012-05-24 10:48               ` Borislav Petkov
2012-05-24 17:34               ` Borislav Petkov
2012-05-24 10:12         ` Thomas Gleixner
2012-05-24 16:27           ` Luck, Tony
2012-05-24 18:18             ` Thomas Gleixner
2012-05-23 10:11 ` Borislav Petkov
  -- strict thread matches above, loose matches on Subject: below --
2012-05-23  2:30 Chen Gong
2012-05-23  2:30 ` Chen Gong

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=4FBD9BAA.7070902@linux.intel.com \
    --to=gong.chen@linux.intel.com \
    --cc=bp@amd64.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=peterz@infradead.org \
    --cc=tglx@linutronix.de \
    --cc=tony.luck@intel.com \
    --cc=x86@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.