From: Mauro Carvalho Chehab <mchehab@redhat.com>
To: Borislav Petkov <bp@amd64.org>
Cc: Linux Edac Mailing List <linux-edac@vger.kernel.org>,
Linux Kernel Mailing List <linux-kernel@vger.kernel.org>,
Aristeu Rozanski <arozansk@redhat.com>,
Doug Thompson <norsk5@yahoo.com>,
Steven Rostedt <rostedt@goodmis.org>,
Frederic Weisbecker <fweisbec@gmail.com>,
Ingo Molnar <mingo@redhat.com>
Subject: Re: [PATCH] RAS: Add a tracepoint for reporting memory controller events
Date: Thu, 24 May 2012 13:13:17 -0300 [thread overview]
Message-ID: <4FBE5E1D.7070804@redhat.com> (raw)
In-Reply-To: <20120524105604.GC27063@aftab.osrc.amd.com>
Em 24-05-2012 07:56, Borislav Petkov escreveu:
> On Thu, May 24, 2012 at 07:14:20AM -0300, Mauro Carvalho Chehab wrote:
>> + * Default error mechanisms for Memory Controller errors (CE and UE)
>> + */
>> +TRACE_EVENT(mc_event,
>> +
>> + TP_PROTO(const unsigned int err_type,
>> + const unsigned int mc_index,
>> + const char *error_msg,
>> + const char *label,
>> + int layer0,
>> + int layer1,
>> + int layer2,
>> + unsigned long address,
>> + unsigned long grain,
>> + unsigned long syndrome,
>> + const char *driver_detail),
>
> What about the comments I had about layer{0,1,2} and grain?
>
> http://marc.info/?l=linux-edac&m=133769207124938&w=2
Sorry, I missed that email.
Em 22-05-2012 10:05, Borislav Petkov escreveu:
> On Tue, May 22, 2012 at 07:18:21AM -0300, Mauro Carvalho Chehab wrote:
>> Em 22-05-2012 06:28, Borislav Petkov escreveu:
>>> On Tue, May 22, 2012 at 12:04:48AM -0300, Mauro Carvalho Chehab wrote:
>>>> +TRACE_EVENT(mc_event,
>>>> +
>>>> + TP_PROTO(const unsigned int err_type,
>>>> + const unsigned int mc_index,
>>>> + const char *error_msg,
>>>> + const char *label,
>>>> + int layer0,
>>>> + int layer1,
>>>> + int layer2,
>>>
>>> Those are EDAC-internal layer representation, why are they exported to
>>> userspace? Userspace needs only the location and label AFAICT.
>>
>> Those are not the EDAC internal layer representation. They're the physical
>> location of the DIMM or rank.
>
> Ok, you've replaced the location char * with the layers.
Yes.
>
>>> If you export them to userspace, they need much more meaningful names -
>>> layer{0,1,2} mean nothing outside of the kernel.
>>
>> Ok. Do you have a better naming suggestion?
>>
>> What about layer0_pos, layer1_pos, layer2_pos?
>
> Actually, I'd like them to be called channel/rank/row or something. Having them
> numbered I don't know which layer is the top layer (channel/branch/slot)
> and the lowest (rank/csrow/...)
>
> Maybe top_layer, middle_layer, lowest_layer? Or something like that...
Works for me.
>
>>>
>>>> + unsigned long pfn,
>>>> + unsigned long offset,
>>>> + unsigned long grain,
>>>
>>> Why aren't those a single 'unsigned long address' since they all are
>>> computed from it?
>>
>> We can merge pfn and offset into "unsigned long address".
>
> Just have a single "unsigned long address" field and userspace can pick
> out the stuff it needs from it.
Ok.
>> With regards to the grain, it is an address mask, written with a "short" way.
>> So, grain 32, for example, means:
>> ffff:ffff:ffff:fffe0
>>
>> As the current EDAC API exports it as grain, IMO, it is better to keep it as-is,
>> but it won't be hard to do:
>> unsigned long mask = ((unsigned long) -1) && (1 - grain)
>>
>> What do you think?
>
> Why are we even exporting grain actually with each tracepoint
> invocation? This is the granularity of reported error in bytes, and it,
> as such, is statically assigned to a value in each driver. Userspace can
> certainly figure out that value in a different way.
The API doesn't export the grain, except via the tracepoint/printk.
> But the more important question is: does the grain help us when handling
> the error info in userspace?
>
> It tells us that at this physical address with "grain" granularity we
> had an error. So?
While a certain number of corrected errors that happened on different, sparsed,
addresses may not mean a damaged memory, the same number of corrected errors
happening at the same physical address/grain means that the DRAM chip that
contains such address is damaged, so the corresponding DIMM needs to be
replaced.
So, the address/grain can be used by userspace algorithms to increase the
probability that a DIMM is damaged.
Regards,
Mauro.
I'm folding the following patch with this one:
diff --git a/include/ras/ras_event.h b/include/ras/ras_event.h
index 5b06c43..18dde46 100644
--- a/include/ras/ras_event.h
+++ b/include/ras/ras_event.h
@@ -49,9 +49,9 @@ TRACE_EVENT(mc_event,
__field( unsigned int, mc_index )
__string( msg, error_msg )
__string( label, label )
- __field( int, layer0 )
- __field( int, layer1 )
- __field( int, layer2 )
+ __field( int, top_layer )
+ __field( int, middle_layer )
+ __field( int, lower_layer )
__field( int, address )
__field( int, grain )
__field( int, syndrome )
@@ -63,9 +63,9 @@ TRACE_EVENT(mc_event,
__entry->mc_index = mc_index;
__assign_str(msg, error_msg);
__assign_str(label, label);
- __entry->layer0 = layer0;
- __entry->layer1 = layer1;
- __entry->layer2 = layer2;
+ __entry->top_layer = layer0;
+ __entry->middle_layer = layer1;
+ __entry->lower_layer = layer2;
__entry->address = address;
__entry->grain = grain;
__entry->syndrome = syndrome;
@@ -80,9 +80,9 @@ TRACE_EVENT(mc_event,
__get_str(msg),
__get_str(label),
__entry->mc_index,
- __entry->layer0,
- __entry->layer1,
- __entry->layer2,
+ __entry->top_layer,
+ __entry->middle_layer,
+ __entry->lower_layer,
__entry->address,
__entry->grain,
__entry->syndrome,
next prev parent reply other threads:[~2012-05-24 16:13 UTC|newest]
Thread overview: 124+ messages / expand[flat|nested] mbox.gz Atom feed top
2012-05-18 16:31 [PATCH EDAC v26 00/66] EDAC patches for v3.5 Mauro Carvalho Chehab
2012-05-18 16:31 ` [PATCH EDAC v26 01/66] edac: Create a dimm struct and move the labels into it Mauro Carvalho Chehab
2012-05-18 16:31 ` [PATCH EDAC v26 02/66] edac: move dimm properties to struct dimm_info Mauro Carvalho Chehab
2012-05-18 16:31 ` [PATCH EDAC v26 03/66] edac: Don't initialize csrow's first_page & friends when not needed Mauro Carvalho Chehab
2012-05-18 16:31 ` [PATCH EDAC v26 04/66] edac: move nr_pages to dimm struct Mauro Carvalho Chehab
2012-05-18 16:31 ` Mauro Carvalho Chehab
2012-05-18 16:31 ` [PATCH EDAC v26 05/66] edac: rewrite edac_align_ptr() Mauro Carvalho Chehab
2012-05-18 16:31 ` [PATCH EDAC v26 06/66] edac.h: Add generic layers for describing a memory location Mauro Carvalho Chehab
2012-05-18 16:31 ` [PATCH EDAC v26 07/66] edac: Change internal representation to work with layers Mauro Carvalho Chehab
2012-05-18 16:31 ` [PATCH EDAC v26 08/66] amd64_edac: convert driver to use the new edac ABI Mauro Carvalho Chehab
2012-05-18 16:31 ` [PATCH EDAC v26 09/66] amd76x_edac: " Mauro Carvalho Chehab
2012-05-18 16:31 ` [PATCH EDAC v26 10/66] cell_edac: " Mauro Carvalho Chehab
2012-05-18 16:31 ` [PATCH EDAC v26 11/66] cpc925_edac: " Mauro Carvalho Chehab
2012-05-18 16:31 ` [PATCH EDAC v26 12/66] e752x_edac: " Mauro Carvalho Chehab
2012-05-18 16:32 ` [PATCH EDAC v26 13/66] e7xxx_edac: " Mauro Carvalho Chehab
2012-05-18 16:32 ` [PATCH EDAC v26 14/66] i3000_edac: " Mauro Carvalho Chehab
2012-05-18 16:32 ` [PATCH EDAC v26 15/66] i3200_edac: " Mauro Carvalho Chehab
2012-05-18 16:32 ` [PATCH EDAC v26 16/66] i5000_edac: " Mauro Carvalho Chehab
2012-05-18 16:32 ` [PATCH EDAC v26 17/66] i5100_edac: " Mauro Carvalho Chehab
2012-05-18 16:32 ` [PATCH EDAC v26 18/66] i5400_edac: " Mauro Carvalho Chehab
2012-05-18 16:32 ` [PATCH EDAC v26 19/66] i7300_edac: " Mauro Carvalho Chehab
2012-05-18 16:32 ` [PATCH EDAC v26 20/66] i7core_edac: " Mauro Carvalho Chehab
2012-05-18 16:32 ` [PATCH EDAC v26 21/66] i82443bxgx_edac: " Mauro Carvalho Chehab
2012-05-18 16:32 ` [PATCH EDAC v26 22/66] i82860_edac: " Mauro Carvalho Chehab
2012-05-18 16:32 ` [PATCH EDAC v26 23/66] i82875p_edac: " Mauro Carvalho Chehab
2012-05-18 16:32 ` [PATCH EDAC v26 24/66] i82975x_edac: " Mauro Carvalho Chehab
2012-05-18 16:32 ` [PATCH EDAC v26 25/66] mpc85xx_edac: " Mauro Carvalho Chehab
2012-05-18 16:32 ` [PATCH EDAC v26 26/66] mv64x60_edac: " Mauro Carvalho Chehab
2012-05-18 16:32 ` [PATCH EDAC v26 27/66] pasemi_edac: " Mauro Carvalho Chehab
2012-05-18 16:32 ` Mauro Carvalho Chehab
2012-05-18 16:32 ` [PATCH EDAC v26 28/66] ppc4xx_edac: " Mauro Carvalho Chehab
2012-05-18 16:32 ` [PATCH EDAC v26 29/66] r82600_edac: " Mauro Carvalho Chehab
2012-05-18 16:32 ` [PATCH EDAC v26 30/66] sb_edac: " Mauro Carvalho Chehab
2012-05-18 16:32 ` [PATCH EDAC v26 31/66] tile_edac: " Mauro Carvalho Chehab
2012-05-18 16:32 ` [PATCH EDAC v26 32/66] x38_edac: " Mauro Carvalho Chehab
2012-05-18 16:32 ` [PATCH EDAC v26 33/66] edac: Remove the legacy EDAC ABI Mauro Carvalho Chehab
2012-05-18 17:51 ` Mauro Carvalho Chehab
2012-05-18 16:32 ` [PATCH EDAC v26 34/66] edac: Initialize the dimm label with the known information Mauro Carvalho Chehab
2012-05-18 16:32 ` [PATCH EDAC v26 35/66] edac: Cleanup the logs for i7core and sb edac drivers Mauro Carvalho Chehab
2012-05-18 16:32 ` [PATCH EDAC v26 36/66] i5400_edac: improve debug messages to better represent the filled memory Mauro Carvalho Chehab
2012-05-18 16:32 ` [PATCH EDAC v26 37/66] RAS: Add a tracepoint for reporting memory controller events Mauro Carvalho Chehab
2012-05-24 10:14 ` [PATCH] " Mauro Carvalho Chehab
2012-05-24 10:56 ` Borislav Petkov
2012-05-24 16:13 ` Mauro Carvalho Chehab [this message]
2012-05-24 16:17 ` Mauro Carvalho Chehab
2012-05-24 16:45 ` Borislav Petkov
2012-05-24 18:00 ` Mauro Carvalho Chehab
2012-05-29 11:58 ` Borislav Petkov
2012-05-29 14:02 ` Mauro Carvalho Chehab
2012-05-29 14:52 ` Borislav Petkov
2012-05-29 15:23 ` Mauro Carvalho Chehab
2012-05-30 23:24 ` Luck, Tony
2012-05-31 10:00 ` Borislav Petkov
2012-05-31 10:33 ` Mauro Carvalho Chehab
2012-05-31 12:17 ` Borislav Petkov
2012-05-31 13:56 ` Mauro Carvalho Chehab
2012-05-31 14:22 ` Borislav Petkov
2012-05-31 14:44 ` Mauro Carvalho Chehab
2012-05-31 14:54 ` Borislav Petkov
2012-05-31 15:01 ` Mauro Carvalho Chehab
2012-05-31 15:14 ` Borislav Petkov
2012-05-31 16:14 ` Mauro Carvalho Chehab
2012-05-31 17:13 ` Borislav Petkov
2012-05-31 18:04 ` Mauro Carvalho Chehab
2012-05-31 18:33 ` Aristeu Rozanski
2012-05-31 19:37 ` Borislav Petkov
2012-05-31 19:32 ` Steven Rostedt
2012-05-31 19:42 ` Borislav Petkov
2012-05-31 20:11 ` Steven Rostedt
2012-05-31 20:18 ` Borislav Petkov
2012-05-31 20:52 ` Luck, Tony
2012-06-01 9:10 ` Borislav Petkov
2012-06-01 9:40 ` Chen Gong
2012-06-01 12:15 ` Mauro Carvalho Chehab
2012-06-01 15:42 ` Luck, Tony
2012-06-01 16:00 ` Borislav Petkov
2012-06-01 18:21 ` Luck, Tony
2012-06-01 23:00 ` Borislav Petkov
2012-06-01 23:19 ` Luck, Tony
2012-06-01 23:28 ` Borislav Petkov
2012-05-31 16:51 ` Luck, Tony
2012-05-31 17:20 ` Borislav Petkov
2012-05-31 18:14 ` Luck, Tony
2012-05-31 19:26 ` Borislav Petkov
2012-05-31 18:24 ` Mauro Carvalho Chehab
2012-05-18 16:32 ` [PATCH EDAC v26 38/66] i5000_edac: Fix the logic that retrieves memory information Mauro Carvalho Chehab
2012-05-18 16:32 ` [PATCH EDAC v26 39/66] e752x_edac: provide more info about how DIMMS/ranks are mapped Mauro Carvalho Chehab
2012-05-18 16:32 ` [PATCH EDAC v26 40/66] edac: Rename the parent dev to pdev Mauro Carvalho Chehab
2012-05-18 16:32 ` Mauro Carvalho Chehab
2012-05-18 16:32 ` [PATCH EDAC v26 41/66] edac: use Documentation-nano format for some data structs Mauro Carvalho Chehab
2012-05-18 16:32 ` [PATCH EDAC v26 42/66] edac: rewrite the sysfs code to use struct device Mauro Carvalho Chehab
2012-05-18 16:32 ` [PATCH EDAC v26 43/66] mpc85xx_edac: convert sysfs logic " Mauro Carvalho Chehab
2012-05-18 16:32 ` [PATCH EDAC v26 44/66] amd64_edac: " Mauro Carvalho Chehab
2012-05-18 16:32 ` [PATCH EDAC v26 45/66] i7core_edac: convert it " Mauro Carvalho Chehab
2012-05-18 16:32 ` [PATCH EDAC v26 46/66] edac: Get rid of the old kobj's from the edac mc code Mauro Carvalho Chehab
2012-05-18 16:32 ` [PATCH EDAC v26 47/66] edac: add a new per-dimm API and make the old per-virtual-rank API obsolete Mauro Carvalho Chehab
2012-05-18 16:32 ` [PATCH EDAC v26 48/66] edac: add a sysfs node to report the maximum location for the system Mauro Carvalho Chehab
2012-05-18 16:32 ` [PATCH EDAC v26 49/66] edac: Add debufs nodes to allow doing fake error inject Mauro Carvalho Chehab
2012-05-18 16:32 ` [PATCH EDAC v26 50/66] edac: Move grain/dtype/edac_type calculus to be out of channel loop Mauro Carvalho Chehab
2012-05-18 16:32 ` [PATCH EDAC v26 51/66] i82975x_edac: Test nr_pages earlier to save a few CPU cycles Mauro Carvalho Chehab
2012-05-18 16:32 ` [PATCH EDAC v26 52/66] i5100_edac: Fix a warning when compiled with 32 bits Mauro Carvalho Chehab
2012-05-18 16:32 ` [PATCH EDAC v26 53/66] i7300_edac: Get rid of some wrongly-solved rebase conflict Mauro Carvalho Chehab
2012-05-18 16:32 ` [PATCH EDAC v26 54/66] edac: Only expose csrows/channels on legacy API if they're populated Mauro Carvalho Chehab
2012-05-18 16:32 ` [PATCH EDAC v26 55/66] edac: change the mem allocation scheme to make Documentation/kobject.txt happy Mauro Carvalho Chehab
2012-05-18 16:32 ` Mauro Carvalho Chehab
2012-05-18 16:32 ` [PATCH EDAC v26 56/66] i7core_edac: " Mauro Carvalho Chehab
2012-05-18 16:32 ` [PATCH EDAC v26 57/66] edac: move documentation ABI to ABI/testing/sysfs-devices-edac Mauro Carvalho Chehab
2012-05-18 16:32 ` [PATCH EDAC v26 58/66] Edac: Add ABI Documentation for the new device nodes Mauro Carvalho Chehab
2012-05-18 16:32 ` [PATCH EDAC v26 59/66] i5000: Fix the fatal error handling Mauro Carvalho Chehab
2012-05-18 16:32 ` [PATCH EDAC v26 60/66] i7core: fix ranks information at the per-channel struct Mauro Carvalho Chehab
2012-05-18 16:32 ` [PATCH EDAC v26 61/66] edac: Don't add __func__ or __FILE__ for debugf[0-9] msgs Mauro Carvalho Chehab
2012-05-18 16:32 ` [PATCH EDAC v26 62/66] edac: Use more normal debugging macro style Mauro Carvalho Chehab
2012-05-18 16:32 ` [PATCH EDAC v26 63/66] edac: Convert debugfX to edac_dbg(X, Mauro Carvalho Chehab
2012-05-18 16:32 ` [PATCH EDAC v26 64/66] edac_mc: Cleanup per-dimm_info debug messages Mauro Carvalho Chehab
2012-05-18 16:32 ` [PATCH EDAC v26 65/66] edac: Increase version to 3.0.0 Mauro Carvalho Chehab
2012-05-18 16:32 ` [PATCH EDAC v26 66/66] edac_mc: check for allocation failure in edac_mc_alloc() Mauro Carvalho Chehab
2012-05-18 16:46 ` [PATCH EDAC v26 00/66] EDAC patches for v3.5 Borislav Petkov
2012-05-18 17:43 ` Mauro Carvalho Chehab
2012-05-18 17:53 ` Borislav Petkov
2012-05-28 15:46 ` Mauro Carvalho Chehab
2012-05-28 20:36 ` Borislav Petkov
2012-05-28 23:13 ` Mauro Carvalho Chehab
2012-05-29 2:40 ` Chen Gong
2012-05-29 11:45 ` Mauro Carvalho Chehab
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=4FBE5E1D.7070804@redhat.com \
--to=mchehab@redhat.com \
--cc=arozansk@redhat.com \
--cc=bp@amd64.org \
--cc=fweisbec@gmail.com \
--cc=linux-edac@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=mingo@redhat.com \
--cc=norsk5@yahoo.com \
--cc=rostedt@goodmis.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.