linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Mauro Carvalho Chehab <mchehab@redhat.com>
To: Borislav Petkov <bp@amd64.org>
Cc: Linux Edac Mailing List <linux-edac@vger.kernel.org>,
	Linux Kernel Mailing List <linux-kernel@vger.kernel.org>,
	Aristeu Rozanski <arozansk@redhat.com>,
	Doug Thompson <norsk5@yahoo.com>,
	Steven Rostedt <rostedt@goodmis.org>,
	Frederic Weisbecker <fweisbec@gmail.com>,
	Ingo Molnar <mingo@redhat.com>
Subject: Re: [PATCH] RAS: Add a tracepoint for reporting memory controller events
Date: Thu, 24 May 2012 13:13:17 -0300	[thread overview]
Message-ID: <4FBE5E1D.7070804@redhat.com> (raw)
In-Reply-To: <20120524105604.GC27063@aftab.osrc.amd.com>

Em 24-05-2012 07:56, Borislav Petkov escreveu:
> On Thu, May 24, 2012 at 07:14:20AM -0300, Mauro Carvalho Chehab wrote:
>> + * Default error mechanisms for Memory Controller errors (CE and UE)
>> + */
>> +TRACE_EVENT(mc_event,
>> +
>> +	TP_PROTO(const unsigned int err_type,
>> +		 const unsigned int mc_index,
>> +		 const char *error_msg,
>> +		 const char *label,
>> +		 int layer0,
>> +		 int layer1,
>> +		 int layer2,
>> +		 unsigned long address,
>> +		 unsigned long grain,
>> +		 unsigned long syndrome,
>> +		 const char *driver_detail),
> 
> What about the comments I had about layer{0,1,2} and grain?
> 
> http://marc.info/?l=linux-edac&m=133769207124938&w=2

Sorry, I missed that email.


Em 22-05-2012 10:05, Borislav Petkov escreveu:
> On Tue, May 22, 2012 at 07:18:21AM -0300, Mauro Carvalho Chehab wrote:
>> Em 22-05-2012 06:28, Borislav Petkov escreveu:
>>> On Tue, May 22, 2012 at 12:04:48AM -0300, Mauro Carvalho Chehab wrote:
>>>> +TRACE_EVENT(mc_event,
>>>> +
>>>> +	TP_PROTO(const unsigned int err_type,
>>>> +		 const unsigned int mc_index,
>>>> +		 const char *error_msg,
>>>> +		 const char *label,
>>>> +		 int layer0,
>>>> +		 int layer1,
>>>> +		 int layer2,
>>>
>>> Those are EDAC-internal layer representation, why are they exported to
>>> userspace? Userspace needs only the location and label AFAICT.
>>
>> Those are not the EDAC internal layer representation. They're the physical
>> location of the DIMM or rank.
> 
> Ok, you've replaced the location char * with the layers.

Yes. 
> 
>>> If you export them to userspace, they need much more meaningful names -
>>> layer{0,1,2} mean nothing outside of the kernel.
>>
>> Ok. Do you have a better naming suggestion?
>>
>> What about layer0_pos, layer1_pos, layer2_pos?
> 
> Actually, I'd like them to be called channel/rank/row or something. Having them
> numbered I don't know which layer is the top layer (channel/branch/slot)
> and the lowest (rank/csrow/...)
> 
> Maybe top_layer, middle_layer, lowest_layer? Or something like that...

Works for me.

> 
>>>
>>>> +		 unsigned long pfn,
>>>> +		 unsigned long offset,
>>>> +		 unsigned long grain,
>>>
>>> Why aren't those a single 'unsigned long address' since they all are
>>> computed from it?
>>
>> We can merge pfn and offset into "unsigned long address".
> 
> Just have a single "unsigned long address" field and userspace can pick
> out the stuff it needs from it.

Ok.

>> With regards to the grain, it is an address mask, written with a "short" way.
>> So, grain 32, for example, means:
>> 	ffff:ffff:ffff:fffe0
>>
>> As the current EDAC API exports it as grain, IMO, it is better to keep it as-is,
>> but it won't be hard to do:
>> 	unsigned long mask = ((unsigned long) -1) && (1 - grain)
>>
>> What do you think?
> 
> Why are we even exporting grain actually with each tracepoint
> invocation? This is the granularity of reported error in bytes, and it,
> as such, is statically assigned to a value in each driver. Userspace can
> certainly figure out that value in a different way.

The API doesn't export the grain, except via the tracepoint/printk.

> But the more important question is: does the grain help us when handling
> the error info in userspace?
> 
> It tells us that at this physical address with "grain" granularity we
> had an error. So?

While a certain number of corrected errors that happened on different, sparsed,
addresses may not mean a damaged memory, the same number of corrected errors
happening at the same physical address/grain means that the DRAM chip that
contains such address is damaged, so the corresponding DIMM needs to be 
replaced.

So, the address/grain can be used by userspace algorithms to increase the
probability that a DIMM is damaged.

Regards,
Mauro.

I'm folding the following patch with this one:

diff --git a/include/ras/ras_event.h b/include/ras/ras_event.h
index 5b06c43..18dde46 100644
--- a/include/ras/ras_event.h
+++ b/include/ras/ras_event.h
@@ -49,9 +49,9 @@ TRACE_EVENT(mc_event,
 		__field(	unsigned int,	mc_index		)
 		__string(	msg,		error_msg		)
 		__string(	label,		label			)
-		__field(	int,		layer0			)
-		__field(	int,		layer1			)
-		__field(	int,		layer2			)
+		__field(	int,		top_layer		)
+		__field(	int,		middle_layer		)
+		__field(	int,		lower_layer		)
 		__field(	int,		address			)
 		__field(	int,		grain			)
 		__field(	int,		syndrome		)
@@ -63,9 +63,9 @@ TRACE_EVENT(mc_event,
 		__entry->mc_index		= mc_index;
 		__assign_str(msg, error_msg);
 		__assign_str(label, label);
-		__entry->layer0			= layer0;
-		__entry->layer1			= layer1;
-		__entry->layer2			= layer2;
+		__entry->top_layer		= layer0;
+		__entry->middle_layer		= layer1;
+		__entry->lower_layer		= layer2;
 		__entry->address		= address;
 		__entry->grain			= grain;
 		__entry->syndrome		= syndrome;
@@ -80,9 +80,9 @@ TRACE_EVENT(mc_event,
 		  __get_str(msg),
 		  __get_str(label),
 		  __entry->mc_index,
-		  __entry->layer0,
-		  __entry->layer1,
-		  __entry->layer2,
+		  __entry->top_layer,
+		  __entry->middle_layer,
+		  __entry->lower_layer,
 		  __entry->address,
 		  __entry->grain,
 		  __entry->syndrome,


  reply	other threads:[~2012-05-24 16:13 UTC|newest]

Thread overview: 118+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2012-05-18 16:31 [PATCH EDAC v26 00/66] EDAC patches for v3.5 Mauro Carvalho Chehab
2012-05-18 16:31 ` [PATCH EDAC v26 01/66] edac: Create a dimm struct and move the labels into it Mauro Carvalho Chehab
2012-05-18 16:31 ` [PATCH EDAC v26 03/66] edac: Don't initialize csrow's first_page & friends when not needed Mauro Carvalho Chehab
2012-05-18 16:31 ` [PATCH EDAC v26 04/66] edac: move nr_pages to dimm struct Mauro Carvalho Chehab
2012-05-18 16:31 ` [PATCH EDAC v26 05/66] edac: rewrite edac_align_ptr() Mauro Carvalho Chehab
2012-05-18 16:31 ` [PATCH EDAC v26 06/66] edac.h: Add generic layers for describing a memory location Mauro Carvalho Chehab
2012-05-18 16:31 ` [PATCH EDAC v26 08/66] amd64_edac: convert driver to use the new edac ABI Mauro Carvalho Chehab
2012-05-18 16:31 ` [PATCH EDAC v26 09/66] amd76x_edac: " Mauro Carvalho Chehab
2012-05-18 16:31 ` [PATCH EDAC v26 10/66] cell_edac: " Mauro Carvalho Chehab
2012-05-18 16:31 ` [PATCH EDAC v26 11/66] cpc925_edac: " Mauro Carvalho Chehab
2012-05-18 16:31 ` [PATCH EDAC v26 12/66] e752x_edac: " Mauro Carvalho Chehab
2012-05-18 16:32 ` [PATCH EDAC v26 13/66] e7xxx_edac: " Mauro Carvalho Chehab
2012-05-18 16:32 ` [PATCH EDAC v26 14/66] i3000_edac: " Mauro Carvalho Chehab
2012-05-18 16:32 ` [PATCH EDAC v26 15/66] i3200_edac: " Mauro Carvalho Chehab
2012-05-18 16:32 ` [PATCH EDAC v26 16/66] i5000_edac: " Mauro Carvalho Chehab
2012-05-18 16:32 ` [PATCH EDAC v26 17/66] i5100_edac: " Mauro Carvalho Chehab
2012-05-18 16:32 ` [PATCH EDAC v26 18/66] i5400_edac: " Mauro Carvalho Chehab
2012-05-18 16:32 ` [PATCH EDAC v26 19/66] i7300_edac: " Mauro Carvalho Chehab
2012-05-18 16:32 ` [PATCH EDAC v26 20/66] i7core_edac: " Mauro Carvalho Chehab
2012-05-18 16:32 ` [PATCH EDAC v26 21/66] i82443bxgx_edac: " Mauro Carvalho Chehab
2012-05-18 16:32 ` [PATCH EDAC v26 22/66] i82860_edac: " Mauro Carvalho Chehab
2012-05-18 16:32 ` [PATCH EDAC v26 23/66] i82875p_edac: " Mauro Carvalho Chehab
2012-05-18 16:32 ` [PATCH EDAC v26 24/66] i82975x_edac: " Mauro Carvalho Chehab
2012-05-18 16:32 ` [PATCH EDAC v26 25/66] mpc85xx_edac: " Mauro Carvalho Chehab
2012-05-18 16:32 ` [PATCH EDAC v26 26/66] mv64x60_edac: " Mauro Carvalho Chehab
2012-05-18 16:32 ` [PATCH EDAC v26 27/66] pasemi_edac: " Mauro Carvalho Chehab
2012-05-18 16:32 ` [PATCH EDAC v26 28/66] ppc4xx_edac: " Mauro Carvalho Chehab
2012-05-18 16:32 ` [PATCH EDAC v26 29/66] r82600_edac: " Mauro Carvalho Chehab
2012-05-18 16:32 ` [PATCH EDAC v26 30/66] sb_edac: " Mauro Carvalho Chehab
2012-05-18 16:32 ` [PATCH EDAC v26 31/66] tile_edac: " Mauro Carvalho Chehab
2012-05-18 16:32 ` [PATCH EDAC v26 32/66] x38_edac: " Mauro Carvalho Chehab
2012-05-18 16:32 ` [PATCH EDAC v26 33/66] edac: Remove the legacy EDAC ABI Mauro Carvalho Chehab
2012-05-18 17:51   ` Mauro Carvalho Chehab
2012-05-18 16:32 ` [PATCH EDAC v26 34/66] edac: Initialize the dimm label with the known information Mauro Carvalho Chehab
2012-05-18 16:32 ` [PATCH EDAC v26 35/66] edac: Cleanup the logs for i7core and sb edac drivers Mauro Carvalho Chehab
2012-05-18 16:32 ` [PATCH EDAC v26 36/66] i5400_edac: improve debug messages to better represent the filled memory Mauro Carvalho Chehab
2012-05-18 16:32 ` [PATCH EDAC v26 37/66] RAS: Add a tracepoint for reporting memory controller events Mauro Carvalho Chehab
2012-05-24 10:14   ` [PATCH] " Mauro Carvalho Chehab
2012-05-24 10:56     ` Borislav Petkov
2012-05-24 16:13       ` Mauro Carvalho Chehab [this message]
2012-05-24 16:17         ` Mauro Carvalho Chehab
2012-05-24 16:45         ` Borislav Petkov
2012-05-24 18:00           ` Mauro Carvalho Chehab
2012-05-29 11:58             ` Borislav Petkov
2012-05-29 14:02               ` Mauro Carvalho Chehab
2012-05-29 14:52                 ` Borislav Petkov
2012-05-29 15:23                   ` Mauro Carvalho Chehab
2012-05-30 23:24                     ` Luck, Tony
2012-05-31 10:00                       ` Borislav Petkov
2012-05-31 10:33                         ` Mauro Carvalho Chehab
2012-05-31 12:17                           ` Borislav Petkov
2012-05-31 13:56                             ` Mauro Carvalho Chehab
2012-05-31 14:22                               ` Borislav Petkov
2012-05-31 14:44                                 ` Mauro Carvalho Chehab
2012-05-31 14:54                                   ` Borislav Petkov
2012-05-31 15:01                                     ` Mauro Carvalho Chehab
2012-05-31 15:14                                       ` Borislav Petkov
2012-05-31 16:14                                         ` Mauro Carvalho Chehab
2012-05-31 17:13                                           ` Borislav Petkov
2012-05-31 18:04                                             ` Mauro Carvalho Chehab
2012-05-31 18:33                                               ` Aristeu Rozanski
2012-05-31 19:37                                               ` Borislav Petkov
2012-05-31 19:32                                             ` Steven Rostedt
2012-05-31 19:42                                               ` Borislav Petkov
2012-05-31 20:11                                                 ` Steven Rostedt
2012-05-31 20:18                                                   ` Borislav Petkov
2012-05-31 20:52                                                     ` Luck, Tony
2012-06-01  9:10                                                       ` Borislav Petkov
2012-06-01  9:40                                                         ` Chen Gong
2012-06-01 12:15                                                         ` Mauro Carvalho Chehab
2012-06-01 15:42                                                         ` Luck, Tony
2012-06-01 16:00                                                           ` Borislav Petkov
2012-06-01 18:21                                                             ` Luck, Tony
2012-06-01 23:00                                                               ` Borislav Petkov
2012-06-01 23:19                                                                 ` Luck, Tony
2012-06-01 23:28                                                                   ` Borislav Petkov
2012-05-31 16:51                         ` Luck, Tony
2012-05-31 17:20                           ` Borislav Petkov
2012-05-31 18:14                             ` Luck, Tony
2012-05-31 19:26                               ` Borislav Petkov
2012-05-31 18:24                             ` Mauro Carvalho Chehab
2012-05-18 16:32 ` [PATCH EDAC v26 38/66] i5000_edac: Fix the logic that retrieves memory information Mauro Carvalho Chehab
2012-05-18 16:32 ` [PATCH EDAC v26 39/66] e752x_edac: provide more info about how DIMMS/ranks are mapped Mauro Carvalho Chehab
2012-05-18 16:32 ` [PATCH EDAC v26 40/66] edac: Rename the parent dev to pdev Mauro Carvalho Chehab
2012-05-18 16:32 ` [PATCH EDAC v26 41/66] edac: use Documentation-nano format for some data structs Mauro Carvalho Chehab
2012-05-18 16:32 ` [PATCH EDAC v26 42/66] edac: rewrite the sysfs code to use struct device Mauro Carvalho Chehab
2012-05-18 16:32 ` [PATCH EDAC v26 43/66] mpc85xx_edac: convert sysfs logic " Mauro Carvalho Chehab
2012-05-18 16:32 ` [PATCH EDAC v26 44/66] amd64_edac: " Mauro Carvalho Chehab
2012-05-18 16:32 ` [PATCH EDAC v26 45/66] i7core_edac: convert it " Mauro Carvalho Chehab
2012-05-18 16:32 ` [PATCH EDAC v26 46/66] edac: Get rid of the old kobj's from the edac mc code Mauro Carvalho Chehab
2012-05-18 16:32 ` [PATCH EDAC v26 47/66] edac: add a new per-dimm API and make the old per-virtual-rank API obsolete Mauro Carvalho Chehab
2012-05-18 16:32 ` [PATCH EDAC v26 48/66] edac: add a sysfs node to report the maximum location for the system Mauro Carvalho Chehab
2012-05-18 16:32 ` [PATCH EDAC v26 49/66] edac: Add debufs nodes to allow doing fake error inject Mauro Carvalho Chehab
2012-05-18 16:32 ` [PATCH EDAC v26 50/66] edac: Move grain/dtype/edac_type calculus to be out of channel loop Mauro Carvalho Chehab
2012-05-18 16:32 ` [PATCH EDAC v26 51/66] i82975x_edac: Test nr_pages earlier to save a few CPU cycles Mauro Carvalho Chehab
2012-05-18 16:32 ` [PATCH EDAC v26 52/66] i5100_edac: Fix a warning when compiled with 32 bits Mauro Carvalho Chehab
2012-05-18 16:32 ` [PATCH EDAC v26 53/66] i7300_edac: Get rid of some wrongly-solved rebase conflict Mauro Carvalho Chehab
2012-05-18 16:32 ` [PATCH EDAC v26 54/66] edac: Only expose csrows/channels on legacy API if they're populated Mauro Carvalho Chehab
2012-05-18 16:32 ` [PATCH EDAC v26 55/66] edac: change the mem allocation scheme to make Documentation/kobject.txt happy Mauro Carvalho Chehab
2012-05-18 16:32 ` [PATCH EDAC v26 56/66] i7core_edac: " Mauro Carvalho Chehab
2012-05-18 16:32 ` [PATCH EDAC v26 57/66] edac: move documentation ABI to ABI/testing/sysfs-devices-edac Mauro Carvalho Chehab
2012-05-18 16:32 ` [PATCH EDAC v26 58/66] Edac: Add ABI Documentation for the new device nodes Mauro Carvalho Chehab
2012-05-18 16:32 ` [PATCH EDAC v26 59/66] i5000: Fix the fatal error handling Mauro Carvalho Chehab
2012-05-18 16:32 ` [PATCH EDAC v26 60/66] i7core: fix ranks information at the per-channel struct Mauro Carvalho Chehab
2012-05-18 16:32 ` [PATCH EDAC v26 61/66] edac: Don't add __func__ or __FILE__ for debugf[0-9] msgs Mauro Carvalho Chehab
2012-05-18 16:32 ` [PATCH EDAC v26 62/66] edac: Use more normal debugging macro style Mauro Carvalho Chehab
2012-05-18 16:32 ` [PATCH EDAC v26 63/66] edac: Convert debugfX to edac_dbg(X, Mauro Carvalho Chehab
2012-05-18 16:32 ` [PATCH EDAC v26 64/66] edac_mc: Cleanup per-dimm_info debug messages Mauro Carvalho Chehab
2012-05-18 16:32 ` [PATCH EDAC v26 65/66] edac: Increase version to 3.0.0 Mauro Carvalho Chehab
2012-05-18 16:32 ` [PATCH EDAC v26 66/66] edac_mc: check for allocation failure in edac_mc_alloc() Mauro Carvalho Chehab
2012-05-18 16:46 ` [PATCH EDAC v26 00/66] EDAC patches for v3.5 Borislav Petkov
2012-05-18 17:43   ` Mauro Carvalho Chehab
2012-05-18 17:53     ` Borislav Petkov
2012-05-28 15:46       ` Mauro Carvalho Chehab
2012-05-28 20:36         ` Borislav Petkov
2012-05-28 23:13           ` Mauro Carvalho Chehab
2012-05-29  2:40             ` Chen Gong
2012-05-29 11:45               ` Mauro Carvalho Chehab

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=4FBE5E1D.7070804@redhat.com \
    --to=mchehab@redhat.com \
    --cc=arozansk@redhat.com \
    --cc=bp@amd64.org \
    --cc=fweisbec@gmail.com \
    --cc=linux-edac@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mingo@redhat.com \
    --cc=norsk5@yahoo.com \
    --cc=rostedt@goodmis.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).