From: Borislav Petkov <bp@amd64.org>
To: "Luck, Tony" <tony.luck@intel.com>
Cc: Borislav Petkov <bp@amd64.org>,
Steven Rostedt <rostedt@goodmis.org>,
Mauro Carvalho Chehab <mchehab@redhat.com>,
Linux Edac Mailing List <linux-edac@vger.kernel.org>,
Linux Kernel Mailing List <linux-kernel@vger.kernel.org>,
Aristeu Rozanski <arozansk@redhat.com>,
Doug Thompson <norsk5@yahoo.com>,
Frederic Weisbecker <fweisbec@gmail.com>,
Ingo Molnar <mingo@redhat.com>
Subject: Re: [PATCH] RAS: Add a tracepoint for reporting memory controller events
Date: Fri, 1 Jun 2012 11:10:26 +0200 [thread overview]
Message-ID: <20120601091026.GC20959@aftab.osrc.amd.com> (raw)
In-Reply-To: <3908561D78D1C84285E8C5FCA982C28F192F6DE2@ORSMSX104.amr.corp.intel.com>
On Thu, May 31, 2012 at 08:52:21PM +0000, Luck, Tony wrote:
> > It could be very quiet (i.e., machine runs with no errors) and it could
> > have bursts where it reports a large number of errors back-to-back
> > depending on access patterns, DIMM health, temperature, sea level and at
> > least a bunch more factors.
>
> Yes - the normal case is a few errors from stray neutrons ... perhaps
> a few per month, maybe on a very big system a few per hour. When something
> breaks, especially if it affects a wide range of memory addresses, then
> you will see a storm of errors.
IOW, when the sh*t hits the fan :-)
> > So I can imagine buffers filling up suddenly and fast, and userspace
> > having hard time consuming them in a timely manner.
>
> But I'm wondering what agent is going to be reporting all these
> errors. Intel has CMCI - so you can get a storm of interrupts
> which would each generate a trace record ... but we are working
> on a patch to turn off CMCI if a storm is detected.
Yeah, about that. What are you guys doing about losing CECCs when
throttling is on, I'm assuming there's no way around it?
> AMD doesn't have CMCI, so errors just report from polling
It does, look at <arch/x86/kernel/cpu/mcheck/mce_amd.c> That's the error
thresholding. We were talking about having an APIC interrupt fire at
_every_ CECC but I don't know/haven't tested how the software would
behave in such cases where the hw spits out an overwhelming amount of
errors.
> - and we have a
> maximum poll rate which is quite low by trace standards (even
> when multiplied by NR_CPUS).
>
> Will EDAC drivers loop over some chipset registers blasting
> out huge numbers of trace records ... that seems just as bad
> for system throughput as a CMCI storm. And just as useless.
Why useless?
I don't know but we need to be as slim as possible on the reporting side
for future use cases like that.
Also, we probably want to proactively do something about such storms
like offline pages or disable some hardware components so that they
subside.
Switching to polling mode IMHO only cures the symptom but not the
underlying cause.
> General principle: If there are very few errors happening then it is
> important to log every single one of them.
Absolutely.
> If there are so many that we can't keep up, then we must sample at
> some level, and we might as well do that at generation point.
Yes, and then take action to recover and stop the storm.
--
Regards/Gruss,
Boris.
Advanced Micro Devices GmbH
Einsteinring 24, 85609 Dornach
GM: Alberto Bozzo
Reg: Dornach, Landkreis Muenchen
HRB Nr. 43632 WEEE Registernr: 129 19551
next prev parent reply other threads:[~2012-06-01 9:10 UTC|newest]
Thread overview: 124+ messages / expand[flat|nested] mbox.gz Atom feed top
2012-05-18 16:31 [PATCH EDAC v26 00/66] EDAC patches for v3.5 Mauro Carvalho Chehab
2012-05-18 16:31 ` [PATCH EDAC v26 01/66] edac: Create a dimm struct and move the labels into it Mauro Carvalho Chehab
2012-05-18 16:31 ` [PATCH EDAC v26 02/66] edac: move dimm properties to struct dimm_info Mauro Carvalho Chehab
2012-05-18 16:31 ` [PATCH EDAC v26 03/66] edac: Don't initialize csrow's first_page & friends when not needed Mauro Carvalho Chehab
2012-05-18 16:31 ` [PATCH EDAC v26 04/66] edac: move nr_pages to dimm struct Mauro Carvalho Chehab
2012-05-18 16:31 ` Mauro Carvalho Chehab
2012-05-18 16:31 ` [PATCH EDAC v26 05/66] edac: rewrite edac_align_ptr() Mauro Carvalho Chehab
2012-05-18 16:31 ` [PATCH EDAC v26 06/66] edac.h: Add generic layers for describing a memory location Mauro Carvalho Chehab
2012-05-18 16:31 ` [PATCH EDAC v26 07/66] edac: Change internal representation to work with layers Mauro Carvalho Chehab
2012-05-18 16:31 ` [PATCH EDAC v26 08/66] amd64_edac: convert driver to use the new edac ABI Mauro Carvalho Chehab
2012-05-18 16:31 ` [PATCH EDAC v26 09/66] amd76x_edac: " Mauro Carvalho Chehab
2012-05-18 16:31 ` [PATCH EDAC v26 10/66] cell_edac: " Mauro Carvalho Chehab
2012-05-18 16:31 ` [PATCH EDAC v26 11/66] cpc925_edac: " Mauro Carvalho Chehab
2012-05-18 16:31 ` [PATCH EDAC v26 12/66] e752x_edac: " Mauro Carvalho Chehab
2012-05-18 16:32 ` [PATCH EDAC v26 13/66] e7xxx_edac: " Mauro Carvalho Chehab
2012-05-18 16:32 ` [PATCH EDAC v26 14/66] i3000_edac: " Mauro Carvalho Chehab
2012-05-18 16:32 ` [PATCH EDAC v26 15/66] i3200_edac: " Mauro Carvalho Chehab
2012-05-18 16:32 ` [PATCH EDAC v26 16/66] i5000_edac: " Mauro Carvalho Chehab
2012-05-18 16:32 ` [PATCH EDAC v26 17/66] i5100_edac: " Mauro Carvalho Chehab
2012-05-18 16:32 ` [PATCH EDAC v26 18/66] i5400_edac: " Mauro Carvalho Chehab
2012-05-18 16:32 ` [PATCH EDAC v26 19/66] i7300_edac: " Mauro Carvalho Chehab
2012-05-18 16:32 ` [PATCH EDAC v26 20/66] i7core_edac: " Mauro Carvalho Chehab
2012-05-18 16:32 ` [PATCH EDAC v26 21/66] i82443bxgx_edac: " Mauro Carvalho Chehab
2012-05-18 16:32 ` [PATCH EDAC v26 22/66] i82860_edac: " Mauro Carvalho Chehab
2012-05-18 16:32 ` [PATCH EDAC v26 23/66] i82875p_edac: " Mauro Carvalho Chehab
2012-05-18 16:32 ` [PATCH EDAC v26 24/66] i82975x_edac: " Mauro Carvalho Chehab
2012-05-18 16:32 ` [PATCH EDAC v26 25/66] mpc85xx_edac: " Mauro Carvalho Chehab
2012-05-18 16:32 ` [PATCH EDAC v26 26/66] mv64x60_edac: " Mauro Carvalho Chehab
2012-05-18 16:32 ` [PATCH EDAC v26 27/66] pasemi_edac: " Mauro Carvalho Chehab
2012-05-18 16:32 ` Mauro Carvalho Chehab
2012-05-18 16:32 ` [PATCH EDAC v26 28/66] ppc4xx_edac: " Mauro Carvalho Chehab
2012-05-18 16:32 ` [PATCH EDAC v26 29/66] r82600_edac: " Mauro Carvalho Chehab
2012-05-18 16:32 ` [PATCH EDAC v26 30/66] sb_edac: " Mauro Carvalho Chehab
2012-05-18 16:32 ` [PATCH EDAC v26 31/66] tile_edac: " Mauro Carvalho Chehab
2012-05-18 16:32 ` [PATCH EDAC v26 32/66] x38_edac: " Mauro Carvalho Chehab
2012-05-18 16:32 ` [PATCH EDAC v26 33/66] edac: Remove the legacy EDAC ABI Mauro Carvalho Chehab
2012-05-18 17:51 ` Mauro Carvalho Chehab
2012-05-18 16:32 ` [PATCH EDAC v26 34/66] edac: Initialize the dimm label with the known information Mauro Carvalho Chehab
2012-05-18 16:32 ` [PATCH EDAC v26 35/66] edac: Cleanup the logs for i7core and sb edac drivers Mauro Carvalho Chehab
2012-05-18 16:32 ` [PATCH EDAC v26 36/66] i5400_edac: improve debug messages to better represent the filled memory Mauro Carvalho Chehab
2012-05-18 16:32 ` [PATCH EDAC v26 37/66] RAS: Add a tracepoint for reporting memory controller events Mauro Carvalho Chehab
2012-05-24 10:14 ` [PATCH] " Mauro Carvalho Chehab
2012-05-24 10:56 ` Borislav Petkov
2012-05-24 16:13 ` Mauro Carvalho Chehab
2012-05-24 16:17 ` Mauro Carvalho Chehab
2012-05-24 16:45 ` Borislav Petkov
2012-05-24 18:00 ` Mauro Carvalho Chehab
2012-05-29 11:58 ` Borislav Petkov
2012-05-29 14:02 ` Mauro Carvalho Chehab
2012-05-29 14:52 ` Borislav Petkov
2012-05-29 15:23 ` Mauro Carvalho Chehab
2012-05-30 23:24 ` Luck, Tony
2012-05-31 10:00 ` Borislav Petkov
2012-05-31 10:33 ` Mauro Carvalho Chehab
2012-05-31 12:17 ` Borislav Petkov
2012-05-31 13:56 ` Mauro Carvalho Chehab
2012-05-31 14:22 ` Borislav Petkov
2012-05-31 14:44 ` Mauro Carvalho Chehab
2012-05-31 14:54 ` Borislav Petkov
2012-05-31 15:01 ` Mauro Carvalho Chehab
2012-05-31 15:14 ` Borislav Petkov
2012-05-31 16:14 ` Mauro Carvalho Chehab
2012-05-31 17:13 ` Borislav Petkov
2012-05-31 18:04 ` Mauro Carvalho Chehab
2012-05-31 18:33 ` Aristeu Rozanski
2012-05-31 19:37 ` Borislav Petkov
2012-05-31 19:32 ` Steven Rostedt
2012-05-31 19:42 ` Borislav Petkov
2012-05-31 20:11 ` Steven Rostedt
2012-05-31 20:18 ` Borislav Petkov
2012-05-31 20:52 ` Luck, Tony
2012-06-01 9:10 ` Borislav Petkov [this message]
2012-06-01 9:40 ` Chen Gong
2012-06-01 12:15 ` Mauro Carvalho Chehab
2012-06-01 15:42 ` Luck, Tony
2012-06-01 16:00 ` Borislav Petkov
2012-06-01 18:21 ` Luck, Tony
2012-06-01 23:00 ` Borislav Petkov
2012-06-01 23:19 ` Luck, Tony
2012-06-01 23:28 ` Borislav Petkov
2012-05-31 16:51 ` Luck, Tony
2012-05-31 17:20 ` Borislav Petkov
2012-05-31 18:14 ` Luck, Tony
2012-05-31 19:26 ` Borislav Petkov
2012-05-31 18:24 ` Mauro Carvalho Chehab
2012-05-18 16:32 ` [PATCH EDAC v26 38/66] i5000_edac: Fix the logic that retrieves memory information Mauro Carvalho Chehab
2012-05-18 16:32 ` [PATCH EDAC v26 39/66] e752x_edac: provide more info about how DIMMS/ranks are mapped Mauro Carvalho Chehab
2012-05-18 16:32 ` [PATCH EDAC v26 40/66] edac: Rename the parent dev to pdev Mauro Carvalho Chehab
2012-05-18 16:32 ` Mauro Carvalho Chehab
2012-05-18 16:32 ` [PATCH EDAC v26 41/66] edac: use Documentation-nano format for some data structs Mauro Carvalho Chehab
2012-05-18 16:32 ` [PATCH EDAC v26 42/66] edac: rewrite the sysfs code to use struct device Mauro Carvalho Chehab
2012-05-18 16:32 ` [PATCH EDAC v26 43/66] mpc85xx_edac: convert sysfs logic " Mauro Carvalho Chehab
2012-05-18 16:32 ` [PATCH EDAC v26 44/66] amd64_edac: " Mauro Carvalho Chehab
2012-05-18 16:32 ` [PATCH EDAC v26 45/66] i7core_edac: convert it " Mauro Carvalho Chehab
2012-05-18 16:32 ` [PATCH EDAC v26 46/66] edac: Get rid of the old kobj's from the edac mc code Mauro Carvalho Chehab
2012-05-18 16:32 ` [PATCH EDAC v26 47/66] edac: add a new per-dimm API and make the old per-virtual-rank API obsolete Mauro Carvalho Chehab
2012-05-18 16:32 ` [PATCH EDAC v26 48/66] edac: add a sysfs node to report the maximum location for the system Mauro Carvalho Chehab
2012-05-18 16:32 ` [PATCH EDAC v26 49/66] edac: Add debufs nodes to allow doing fake error inject Mauro Carvalho Chehab
2012-05-18 16:32 ` [PATCH EDAC v26 50/66] edac: Move grain/dtype/edac_type calculus to be out of channel loop Mauro Carvalho Chehab
2012-05-18 16:32 ` [PATCH EDAC v26 51/66] i82975x_edac: Test nr_pages earlier to save a few CPU cycles Mauro Carvalho Chehab
2012-05-18 16:32 ` [PATCH EDAC v26 52/66] i5100_edac: Fix a warning when compiled with 32 bits Mauro Carvalho Chehab
2012-05-18 16:32 ` [PATCH EDAC v26 53/66] i7300_edac: Get rid of some wrongly-solved rebase conflict Mauro Carvalho Chehab
2012-05-18 16:32 ` [PATCH EDAC v26 54/66] edac: Only expose csrows/channels on legacy API if they're populated Mauro Carvalho Chehab
2012-05-18 16:32 ` [PATCH EDAC v26 55/66] edac: change the mem allocation scheme to make Documentation/kobject.txt happy Mauro Carvalho Chehab
2012-05-18 16:32 ` Mauro Carvalho Chehab
2012-05-18 16:32 ` [PATCH EDAC v26 56/66] i7core_edac: " Mauro Carvalho Chehab
2012-05-18 16:32 ` [PATCH EDAC v26 57/66] edac: move documentation ABI to ABI/testing/sysfs-devices-edac Mauro Carvalho Chehab
2012-05-18 16:32 ` [PATCH EDAC v26 58/66] Edac: Add ABI Documentation for the new device nodes Mauro Carvalho Chehab
2012-05-18 16:32 ` [PATCH EDAC v26 59/66] i5000: Fix the fatal error handling Mauro Carvalho Chehab
2012-05-18 16:32 ` [PATCH EDAC v26 60/66] i7core: fix ranks information at the per-channel struct Mauro Carvalho Chehab
2012-05-18 16:32 ` [PATCH EDAC v26 61/66] edac: Don't add __func__ or __FILE__ for debugf[0-9] msgs Mauro Carvalho Chehab
2012-05-18 16:32 ` [PATCH EDAC v26 62/66] edac: Use more normal debugging macro style Mauro Carvalho Chehab
2012-05-18 16:32 ` [PATCH EDAC v26 63/66] edac: Convert debugfX to edac_dbg(X, Mauro Carvalho Chehab
2012-05-18 16:32 ` [PATCH EDAC v26 64/66] edac_mc: Cleanup per-dimm_info debug messages Mauro Carvalho Chehab
2012-05-18 16:32 ` [PATCH EDAC v26 65/66] edac: Increase version to 3.0.0 Mauro Carvalho Chehab
2012-05-18 16:32 ` [PATCH EDAC v26 66/66] edac_mc: check for allocation failure in edac_mc_alloc() Mauro Carvalho Chehab
2012-05-18 16:46 ` [PATCH EDAC v26 00/66] EDAC patches for v3.5 Borislav Petkov
2012-05-18 17:43 ` Mauro Carvalho Chehab
2012-05-18 17:53 ` Borislav Petkov
2012-05-28 15:46 ` Mauro Carvalho Chehab
2012-05-28 20:36 ` Borislav Petkov
2012-05-28 23:13 ` Mauro Carvalho Chehab
2012-05-29 2:40 ` Chen Gong
2012-05-29 11:45 ` Mauro Carvalho Chehab
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20120601091026.GC20959@aftab.osrc.amd.com \
--to=bp@amd64.org \
--cc=arozansk@redhat.com \
--cc=fweisbec@gmail.com \
--cc=linux-edac@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=mchehab@redhat.com \
--cc=mingo@redhat.com \
--cc=norsk5@yahoo.com \
--cc=rostedt@goodmis.org \
--cc=tony.luck@intel.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.