From: Mauro Carvalho Chehab <mchehab@redhat.com>
To: Borislav Petkov <bp@amd64.org>
Cc: Linux Edac Mailing List <linux-edac@vger.kernel.org>,
Linux Kernel Mailing List <linux-kernel@vger.kernel.org>,
Aristeu Rozanski <arozansk@redhat.com>,
Doug Thompson <norsk5@yahoo.com>,
Steven Rostedt <rostedt@goodmis.org>,
Frederic Weisbecker <fweisbec@gmail.com>,
Ingo Molnar <mingo@redhat.com>
Subject: Re: [PATCH] RAS: Add a tracepoint for reporting memory controller events
Date: Thu, 24 May 2012 13:13:17 -0300 [thread overview]
Message-ID: <4FBE5E1D.7070804@redhat.com> (raw)
In-Reply-To: <20120524105604.GC27063@aftab.osrc.amd.com>
Em 24-05-2012 07:56, Borislav Petkov escreveu:
> On Thu, May 24, 2012 at 07:14:20AM -0300, Mauro Carvalho Chehab wrote:
>> + * Default error mechanisms for Memory Controller errors (CE and UE)
>> + */
>> +TRACE_EVENT(mc_event,
>> +
>> + TP_PROTO(const unsigned int err_type,
>> + const unsigned int mc_index,
>> + const char *error_msg,
>> + const char *label,
>> + int layer0,
>> + int layer1,
>> + int layer2,
>> + unsigned long address,
>> + unsigned long grain,
>> + unsigned long syndrome,
>> + const char *driver_detail),
>
> What about the comments I had about layer{0,1,2} and grain?
>
> http://marc.info/?l=linux-edac&m=133769207124938&w=2
Sorry, I missed that email.
Em 22-05-2012 10:05, Borislav Petkov escreveu:
> On Tue, May 22, 2012 at 07:18:21AM -0300, Mauro Carvalho Chehab wrote:
>> Em 22-05-2012 06:28, Borislav Petkov escreveu:
>>> On Tue, May 22, 2012 at 12:04:48AM -0300, Mauro Carvalho Chehab wrote:
>>>> +TRACE_EVENT(mc_event,
>>>> +
>>>> + TP_PROTO(const unsigned int err_type,
>>>> + const unsigned int mc_index,
>>>> + const char *error_msg,
>>>> + const char *label,
>>>> + int layer0,
>>>> + int layer1,
>>>> + int layer2,
>>>
>>> Those are EDAC-internal layer representation, why are they exported to
>>> userspace? Userspace needs only the location and label AFAICT.
>>
>> Those are not the EDAC internal layer representation. They're the physical
>> location of the DIMM or rank.
>
> Ok, you've replaced the location char * with the layers.
Yes.
>
>>> If you export them to userspace, they need much more meaningful names -
>>> layer{0,1,2} mean nothing outside of the kernel.
>>
>> Ok. Do you have a better naming suggestion?
>>
>> What about layer0_pos, layer1_pos, layer2_pos?
>
> Actually, I'd like them to be called channel/rank/row or something. Having them
> numbered I don't know which layer is the top layer (channel/branch/slot)
> and the lowest (rank/csrow/...)
>
> Maybe top_layer, middle_layer, lowest_layer? Or something like that...
Works for me.
>
>>>
>>>> + unsigned long pfn,
>>>> + unsigned long offset,
>>>> + unsigned long grain,
>>>
>>> Why aren't those a single 'unsigned long address' since they all are
>>> computed from it?
>>
>> We can merge pfn and offset into "unsigned long address".
>
> Just have a single "unsigned long address" field and userspace can pick
> out the stuff it needs from it.
Ok.
>> With regards to the grain, it is an address mask, written with a "short" way.
>> So, grain 32, for example, means:
>> ffff:ffff:ffff:fffe0
>>
>> As the current EDAC API exports it as grain, IMO, it is better to keep it as-is,
>> but it won't be hard to do:
>> unsigned long mask = ((unsigned long) -1) && (1 - grain)
>>
>> What do you think?
>
> Why are we even exporting grain actually with each tracepoint
> invocation? This is the granularity of reported error in bytes, and it,
> as such, is statically assigned to a value in each driver. Userspace can
> certainly figure out that value in a different way.
The API doesn't export the grain, except via the tracepoint/printk.
> But the more important question is: does the grain help us when handling
> the error info in userspace?
>
> It tells us that at this physical address with "grain" granularity we
> had an error. So?
While a certain number of corrected errors that happened on different, sparsed,
addresses may not mean a damaged memory, the same number of corrected errors
happening at the same physical address/grain means that the DRAM chip that
contains such address is damaged, so the corresponding DIMM needs to be
replaced.
So, the address/grain can be used by userspace algorithms to increase the
probability that a DIMM is damaged.
Regards,
Mauro.
I'm folding the following patch with this one:
diff --git a/include/ras/ras_event.h b/include/ras/ras_event.h
index 5b06c43..18dde46 100644
--- a/include/ras/ras_event.h
+++ b/include/ras/ras_event.h
@@ -49,9 +49,9 @@ TRACE_EVENT(mc_event,
__field( unsigned int, mc_index )
__string( msg, error_msg )
__string( label, label )
- __field( int, layer0 )
- __field( int, layer1 )
- __field( int, layer2 )
+ __field( int, top_layer )
+ __field( int, middle_layer )
+ __field( int, lower_layer )
__field( int, address )
__field( int, grain )
__field( int, syndrome )
@@ -63,9 +63,9 @@ TRACE_EVENT(mc_event,
__entry->mc_index = mc_index;
__assign_str(msg, error_msg);
__assign_str(label, label);
- __entry->layer0 = layer0;
- __entry->layer1 = layer1;
- __entry->layer2 = layer2;
+ __entry->top_layer = layer0;
+ __entry->middle_layer = layer1;
+ __entry->lower_layer = layer2;
__entry->address = address;
__entry->grain = grain;
__entry->syndrome = syndrome;
@@ -80,9 +80,9 @@ TRACE_EVENT(mc_event,
__get_str(msg),
__get_str(label),
__entry->mc_index,
- __entry->layer0,
- __entry->layer1,
- __entry->layer2,
+ __entry->top_layer,
+ __entry->middle_layer,
+ __entry->lower_layer,
__entry->address,
__entry->grain,
__entry->syndrome,
next prev parent reply other threads:[~2012-05-24 16:13 UTC|newest]
Thread overview: 118+ messages / expand[flat|nested] mbox.gz Atom feed top
2012-05-18 16:31 [PATCH EDAC v26 00/66] EDAC patches for v3.5 Mauro Carvalho Chehab
2012-05-18 16:31 ` [PATCH EDAC v26 01/66] edac: Create a dimm struct and move the labels into it Mauro Carvalho Chehab
2012-05-18 16:31 ` [PATCH EDAC v26 03/66] edac: Don't initialize csrow's first_page & friends when not needed Mauro Carvalho Chehab
2012-05-18 16:31 ` [PATCH EDAC v26 04/66] edac: move nr_pages to dimm struct Mauro Carvalho Chehab
2012-05-18 16:31 ` [PATCH EDAC v26 05/66] edac: rewrite edac_align_ptr() Mauro Carvalho Chehab
2012-05-18 16:31 ` [PATCH EDAC v26 06/66] edac.h: Add generic layers for describing a memory location Mauro Carvalho Chehab
2012-05-18 16:31 ` [PATCH EDAC v26 08/66] amd64_edac: convert driver to use the new edac ABI Mauro Carvalho Chehab
2012-05-18 16:31 ` [PATCH EDAC v26 09/66] amd76x_edac: " Mauro Carvalho Chehab
2012-05-18 16:31 ` [PATCH EDAC v26 10/66] cell_edac: " Mauro Carvalho Chehab
2012-05-18 16:31 ` [PATCH EDAC v26 11/66] cpc925_edac: " Mauro Carvalho Chehab
2012-05-18 16:31 ` [PATCH EDAC v26 12/66] e752x_edac: " Mauro Carvalho Chehab
2012-05-18 16:32 ` [PATCH EDAC v26 13/66] e7xxx_edac: " Mauro Carvalho Chehab
2012-05-18 16:32 ` [PATCH EDAC v26 14/66] i3000_edac: " Mauro Carvalho Chehab
2012-05-18 16:32 ` [PATCH EDAC v26 15/66] i3200_edac: " Mauro Carvalho Chehab
2012-05-18 16:32 ` [PATCH EDAC v26 16/66] i5000_edac: " Mauro Carvalho Chehab
2012-05-18 16:32 ` [PATCH EDAC v26 17/66] i5100_edac: " Mauro Carvalho Chehab
2012-05-18 16:32 ` [PATCH EDAC v26 18/66] i5400_edac: " Mauro Carvalho Chehab
2012-05-18 16:32 ` [PATCH EDAC v26 19/66] i7300_edac: " Mauro Carvalho Chehab
2012-05-18 16:32 ` [PATCH EDAC v26 20/66] i7core_edac: " Mauro Carvalho Chehab
2012-05-18 16:32 ` [PATCH EDAC v26 21/66] i82443bxgx_edac: " Mauro Carvalho Chehab
2012-05-18 16:32 ` [PATCH EDAC v26 22/66] i82860_edac: " Mauro Carvalho Chehab
2012-05-18 16:32 ` [PATCH EDAC v26 23/66] i82875p_edac: " Mauro Carvalho Chehab
2012-05-18 16:32 ` [PATCH EDAC v26 24/66] i82975x_edac: " Mauro Carvalho Chehab
2012-05-18 16:32 ` [PATCH EDAC v26 25/66] mpc85xx_edac: " Mauro Carvalho Chehab
2012-05-18 16:32 ` [PATCH EDAC v26 26/66] mv64x60_edac: " Mauro Carvalho Chehab
2012-05-18 16:32 ` [PATCH EDAC v26 27/66] pasemi_edac: " Mauro Carvalho Chehab
2012-05-18 16:32 ` [PATCH EDAC v26 28/66] ppc4xx_edac: " Mauro Carvalho Chehab
2012-05-18 16:32 ` [PATCH EDAC v26 29/66] r82600_edac: " Mauro Carvalho Chehab
2012-05-18 16:32 ` [PATCH EDAC v26 30/66] sb_edac: " Mauro Carvalho Chehab
2012-05-18 16:32 ` [PATCH EDAC v26 31/66] tile_edac: " Mauro Carvalho Chehab
2012-05-18 16:32 ` [PATCH EDAC v26 32/66] x38_edac: " Mauro Carvalho Chehab
2012-05-18 16:32 ` [PATCH EDAC v26 33/66] edac: Remove the legacy EDAC ABI Mauro Carvalho Chehab
2012-05-18 17:51 ` Mauro Carvalho Chehab
2012-05-18 16:32 ` [PATCH EDAC v26 34/66] edac: Initialize the dimm label with the known information Mauro Carvalho Chehab
2012-05-18 16:32 ` [PATCH EDAC v26 35/66] edac: Cleanup the logs for i7core and sb edac drivers Mauro Carvalho Chehab
2012-05-18 16:32 ` [PATCH EDAC v26 36/66] i5400_edac: improve debug messages to better represent the filled memory Mauro Carvalho Chehab
2012-05-18 16:32 ` [PATCH EDAC v26 37/66] RAS: Add a tracepoint for reporting memory controller events Mauro Carvalho Chehab
2012-05-24 10:14 ` [PATCH] " Mauro Carvalho Chehab
2012-05-24 10:56 ` Borislav Petkov
2012-05-24 16:13 ` Mauro Carvalho Chehab [this message]
2012-05-24 16:17 ` Mauro Carvalho Chehab
2012-05-24 16:45 ` Borislav Petkov
2012-05-24 18:00 ` Mauro Carvalho Chehab
2012-05-29 11:58 ` Borislav Petkov
2012-05-29 14:02 ` Mauro Carvalho Chehab
2012-05-29 14:52 ` Borislav Petkov
2012-05-29 15:23 ` Mauro Carvalho Chehab
2012-05-30 23:24 ` Luck, Tony
2012-05-31 10:00 ` Borislav Petkov
2012-05-31 10:33 ` Mauro Carvalho Chehab
2012-05-31 12:17 ` Borislav Petkov
2012-05-31 13:56 ` Mauro Carvalho Chehab
2012-05-31 14:22 ` Borislav Petkov
2012-05-31 14:44 ` Mauro Carvalho Chehab
2012-05-31 14:54 ` Borislav Petkov
2012-05-31 15:01 ` Mauro Carvalho Chehab
2012-05-31 15:14 ` Borislav Petkov
2012-05-31 16:14 ` Mauro Carvalho Chehab
2012-05-31 17:13 ` Borislav Petkov
2012-05-31 18:04 ` Mauro Carvalho Chehab
2012-05-31 18:33 ` Aristeu Rozanski
2012-05-31 19:37 ` Borislav Petkov
2012-05-31 19:32 ` Steven Rostedt
2012-05-31 19:42 ` Borislav Petkov
2012-05-31 20:11 ` Steven Rostedt
2012-05-31 20:18 ` Borislav Petkov
2012-05-31 20:52 ` Luck, Tony
2012-06-01 9:10 ` Borislav Petkov
2012-06-01 9:40 ` Chen Gong
2012-06-01 12:15 ` Mauro Carvalho Chehab
2012-06-01 15:42 ` Luck, Tony
2012-06-01 16:00 ` Borislav Petkov
2012-06-01 18:21 ` Luck, Tony
2012-06-01 23:00 ` Borislav Petkov
2012-06-01 23:19 ` Luck, Tony
2012-06-01 23:28 ` Borislav Petkov
2012-05-31 16:51 ` Luck, Tony
2012-05-31 17:20 ` Borislav Petkov
2012-05-31 18:14 ` Luck, Tony
2012-05-31 19:26 ` Borislav Petkov
2012-05-31 18:24 ` Mauro Carvalho Chehab
2012-05-18 16:32 ` [PATCH EDAC v26 38/66] i5000_edac: Fix the logic that retrieves memory information Mauro Carvalho Chehab
2012-05-18 16:32 ` [PATCH EDAC v26 39/66] e752x_edac: provide more info about how DIMMS/ranks are mapped Mauro Carvalho Chehab
2012-05-18 16:32 ` [PATCH EDAC v26 40/66] edac: Rename the parent dev to pdev Mauro Carvalho Chehab
2012-05-18 16:32 ` [PATCH EDAC v26 41/66] edac: use Documentation-nano format for some data structs Mauro Carvalho Chehab
2012-05-18 16:32 ` [PATCH EDAC v26 42/66] edac: rewrite the sysfs code to use struct device Mauro Carvalho Chehab
2012-05-18 16:32 ` [PATCH EDAC v26 43/66] mpc85xx_edac: convert sysfs logic " Mauro Carvalho Chehab
2012-05-18 16:32 ` [PATCH EDAC v26 44/66] amd64_edac: " Mauro Carvalho Chehab
2012-05-18 16:32 ` [PATCH EDAC v26 45/66] i7core_edac: convert it " Mauro Carvalho Chehab
2012-05-18 16:32 ` [PATCH EDAC v26 46/66] edac: Get rid of the old kobj's from the edac mc code Mauro Carvalho Chehab
2012-05-18 16:32 ` [PATCH EDAC v26 47/66] edac: add a new per-dimm API and make the old per-virtual-rank API obsolete Mauro Carvalho Chehab
2012-05-18 16:32 ` [PATCH EDAC v26 48/66] edac: add a sysfs node to report the maximum location for the system Mauro Carvalho Chehab
2012-05-18 16:32 ` [PATCH EDAC v26 49/66] edac: Add debufs nodes to allow doing fake error inject Mauro Carvalho Chehab
2012-05-18 16:32 ` [PATCH EDAC v26 50/66] edac: Move grain/dtype/edac_type calculus to be out of channel loop Mauro Carvalho Chehab
2012-05-18 16:32 ` [PATCH EDAC v26 51/66] i82975x_edac: Test nr_pages earlier to save a few CPU cycles Mauro Carvalho Chehab
2012-05-18 16:32 ` [PATCH EDAC v26 52/66] i5100_edac: Fix a warning when compiled with 32 bits Mauro Carvalho Chehab
2012-05-18 16:32 ` [PATCH EDAC v26 53/66] i7300_edac: Get rid of some wrongly-solved rebase conflict Mauro Carvalho Chehab
2012-05-18 16:32 ` [PATCH EDAC v26 54/66] edac: Only expose csrows/channels on legacy API if they're populated Mauro Carvalho Chehab
2012-05-18 16:32 ` [PATCH EDAC v26 55/66] edac: change the mem allocation scheme to make Documentation/kobject.txt happy Mauro Carvalho Chehab
2012-05-18 16:32 ` [PATCH EDAC v26 56/66] i7core_edac: " Mauro Carvalho Chehab
2012-05-18 16:32 ` [PATCH EDAC v26 57/66] edac: move documentation ABI to ABI/testing/sysfs-devices-edac Mauro Carvalho Chehab
2012-05-18 16:32 ` [PATCH EDAC v26 58/66] Edac: Add ABI Documentation for the new device nodes Mauro Carvalho Chehab
2012-05-18 16:32 ` [PATCH EDAC v26 59/66] i5000: Fix the fatal error handling Mauro Carvalho Chehab
2012-05-18 16:32 ` [PATCH EDAC v26 60/66] i7core: fix ranks information at the per-channel struct Mauro Carvalho Chehab
2012-05-18 16:32 ` [PATCH EDAC v26 61/66] edac: Don't add __func__ or __FILE__ for debugf[0-9] msgs Mauro Carvalho Chehab
2012-05-18 16:32 ` [PATCH EDAC v26 62/66] edac: Use more normal debugging macro style Mauro Carvalho Chehab
2012-05-18 16:32 ` [PATCH EDAC v26 63/66] edac: Convert debugfX to edac_dbg(X, Mauro Carvalho Chehab
2012-05-18 16:32 ` [PATCH EDAC v26 64/66] edac_mc: Cleanup per-dimm_info debug messages Mauro Carvalho Chehab
2012-05-18 16:32 ` [PATCH EDAC v26 65/66] edac: Increase version to 3.0.0 Mauro Carvalho Chehab
2012-05-18 16:32 ` [PATCH EDAC v26 66/66] edac_mc: check for allocation failure in edac_mc_alloc() Mauro Carvalho Chehab
2012-05-18 16:46 ` [PATCH EDAC v26 00/66] EDAC patches for v3.5 Borislav Petkov
2012-05-18 17:43 ` Mauro Carvalho Chehab
2012-05-18 17:53 ` Borislav Petkov
2012-05-28 15:46 ` Mauro Carvalho Chehab
2012-05-28 20:36 ` Borislav Petkov
2012-05-28 23:13 ` Mauro Carvalho Chehab
2012-05-29 2:40 ` Chen Gong
2012-05-29 11:45 ` Mauro Carvalho Chehab
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=4FBE5E1D.7070804@redhat.com \
--to=mchehab@redhat.com \
--cc=arozansk@redhat.com \
--cc=bp@amd64.org \
--cc=fweisbec@gmail.com \
--cc=linux-edac@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=mingo@redhat.com \
--cc=norsk5@yahoo.com \
--cc=rostedt@goodmis.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).