All of lore.kernel.org
 help / color / mirror / Atom feed
From: Xiaofei Tan <tanxiaofei@huawei.com>
To: John Garry <john.garry@huawei.com>,
	Borislav Petkov <bp@alien8.de>,
	"Robert Richter" <rrichter@marvell.com>
Cc: Mauro Carvalho Chehab <mchehab@kernel.org>,
	Tony Luck <tony.luck@intel.com>,
	James Morse <james.morse@arm.com>,
	Aristeu Rozanski <aris@redhat.com>,
	"linux-edac@vger.kernel.org" <linux-edac@vger.kernel.org>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	Toshi Kani <toshi.kani@hpe.com>,
	Shiju Jose <shiju.jose@huawei.com>
Subject: Re: [PATCH 11/11] EDAC/ghes: Create one memory controller per physical memory array
Date: Tue, 24 Mar 2020 19:32:16 +0800	[thread overview]
Message-ID: <5E79EFC0.3040108@huawei.com> (raw)
In-Reply-To: <924f4c0e-1f9d-e7de-17cd-466eb3a74d90@huawei.com>


On 2020/3/18 0:34, John Garry wrote:
> On 16/03/2020 09:51, Borislav Petkov wrote:
>> On Fri, Mar 06, 2020 at 04:13:18PM +0100, Robert Richter wrote:
>>> The ghes driver only creates one memory controller for the whole
>>> system. This does not reflect memory topology especially in multi-node
>>> systems. E.g. a Marvell ThunderX2 system shows:
>>>
>>>   /sys/devices/system/edac/mc/mc0/dimm0
>>>   /sys/devices/system/edac/mc/mc0/dimm1
>>>   /sys/devices/system/edac/mc/mc0/dimm2
>>>   /sys/devices/system/edac/mc/mc0/dimm3
>>>   /sys/devices/system/edac/mc/mc0/dimm4
>>>   /sys/devices/system/edac/mc/mc0/dimm5
>>>   /sys/devices/system/edac/mc/mc0/dimm6
>>>   /sys/devices/system/edac/mc/mc0/dimm7
>>>   /sys/devices/system/edac/mc/mc0/dimm8
>>>   /sys/devices/system/edac/mc/mc0/dimm9
>>>   /sys/devices/system/edac/mc/mc0/dimm10
>>>   /sys/devices/system/edac/mc/mc0/dimm11
>>>   /sys/devices/system/edac/mc/mc0/dimm12
>>>   /sys/devices/system/edac/mc/mc0/dimm13
>>>   /sys/devices/system/edac/mc/mc0/dimm14
>>>   /sys/devices/system/edac/mc/mc0/dimm15
>>>
>>> The DIMMs 9-15 are located on the 2nd node of the system. On
>>> comparable x86 systems there is one memory controller per node. The
>>> ghes driver should also group DIMMs depending on the topology and
>>> create one MC per node.
>>>
>>> There are several options to detect the topology. ARM64 systems
>>> retrieve the (NUMA) node information from the ACPI SRAT table (see
>>> acpi_table_parse_srat()). The node id is later stored in the physical
>>> address page. The pfn_to_nid() macro could be used for a DIMM after
>>> determining its physical address. The drawback of this approach is
>>> that there are too many subsystems involved it depends on. It could
>>> easily break and makes the implementation complex. E.g. pfn_to_nid()
>>> can only be reliable used on mapped address ranges which is not always
>>> granted, there are various firmware instances involved which could be
>>> broken, or results may vary depending on NUMA settings.
>>>
>>> Another approach that was suggested by James' is to use the DIMM's
>>> physical memory array handle to group DIMMs [1]. The advantage is to
>>> only use the information on memory devices from the SMBIOS table that
>>> contains a reference to the physical memory array it belongs too. This
>>> information is mandatory same as the use of DIMM handle references by
>>> GHES to provide the DIMM location of an error. There is only a single
>>> table to parse which eases implementation. This patch uses this
>>> approach for DIMM grouping.
>>>
>>> Modify the DMI decoder to also detect the physical memory array a DIMM
>>> is linked to and create one memory controller per array to group
>>> DIMMs. With the change DIMMs are grouped, e.g. a ThunderX2 system
>>> shows one MC per node now:
>>>
>>>   # grep . /sys/devices/system/edac/mc/mc*/dimm*/dimm_label
>>>   /sys/devices/system/edac/mc/mc0/dimm0/dimm_label:N0 DIMM_A0
>>>   /sys/devices/system/edac/mc/mc0/dimm1/dimm_label:N0 DIMM_B0
>>>   /sys/devices/system/edac/mc/mc0/dimm2/dimm_label:N0 DIMM_C0
>>>   /sys/devices/system/edac/mc/mc0/dimm3/dimm_label:N0 DIMM_D0
>>>   /sys/devices/system/edac/mc/mc0/dimm4/dimm_label:N0 DIMM_E0
>>>   /sys/devices/system/edac/mc/mc0/dimm5/dimm_label:N0 DIMM_F0
>>>   /sys/devices/system/edac/mc/mc0/dimm6/dimm_label:N0 DIMM_G0
>>>   /sys/devices/system/edac/mc/mc0/dimm7/dimm_label:N0 DIMM_H0
>>>   /sys/devices/system/edac/mc/mc1/dimm0/dimm_label:N1 DIMM_I0
>>>   /sys/devices/system/edac/mc/mc1/dimm1/dimm_label:N1 DIMM_J0
>>>   /sys/devices/system/edac/mc/mc1/dimm2/dimm_label:N1 DIMM_K0
>>>   /sys/devices/system/edac/mc/mc1/dimm3/dimm_label:N1 DIMM_L0
>>>   /sys/devices/system/edac/mc/mc1/dimm4/dimm_label:N1 DIMM_M0
>>>   /sys/devices/system/edac/mc/mc1/dimm5/dimm_label:N1 DIMM_N0
>>>   /sys/devices/system/edac/mc/mc1/dimm6/dimm_label:N1 DIMM_O0
>>>   /sys/devices/system/edac/mc/mc1/dimm7/dimm_label:N1 DIMM_P0
>>>
>>> [1] https://lkml.kernel.org/r/f878201f-f8fd-0f2a-5072-ba60c64eefaf@arm.com
>>>
>>> Suggested-by: James Morse <james.morse@arm.com>
>>> Signed-off-by: Robert Richter <rrichter@marvell.com>
>>> ---
>>>   drivers/edac/ghes_edac.c | 137 ++++++++++++++++++++++++++++++---------
>>>   1 file changed, 107 insertions(+), 30 deletions(-)
>>
>> This is all fine and good but that change affects the one x86 platform
>> we support so the whole patchset should be tested there too. Adding
>> Toshi.
>>
>> As a matter of fact, the final version of this set should be tested on
>> all platforms which are using this thing. Adding John Garry too who
>> reported issues with this driver recently on his platform.
> 
> Adding other RAS-centric guys for H.

Hi John & Borislav & Robert
I have tested this patch set on our platform. Only one memory controller found when there is one DIMM on
each socket or node. Just like this:
estuary:/$ grep . /sys/devices/system/edac/mc/mc*/dimm*/dimm_label
/sys/devices/system/edac/mc/mc0/dimm0/dimm_label:SOCKET 0 CHANNEL 0 DIMM 0 DIMM0
/sys/devices/system/edac/mc/mc0/dimm20/dimm_label:SOCKET 1 CHANNEL 2 DIMM 0 DIMM1

It is not the problem of the patch set. Because our BIOS only defined one "Physical Memory Array Handle" in DMI table.
Just like this:
estuary:/$ dmidecode -t memory | grep "Array Handle"
        Array Handle: 0x0000
        Array Handle: 0x0000
        Array Handle: 0x0000
        Array Handle: 0x0000
        Array Handle: 0x0000
        Array Handle: 0x0000
        Array Handle: 0x0000
        Array Handle: 0x0000
        Array Handle: 0x0000
        Array Handle: 0x0000
        Array Handle: 0x0000
        Array Handle: 0x0000
        Array Handle: 0x0000
        Array Handle: 0x0000
        Array Handle: 0x0000
        Array Handle: 0x0000
        Array Handle: 0x0000
        Array Handle: 0x0000
        Array Handle: 0x0000
        Array Handle: 0x0000
        Array Handle: 0x0000
        Array Handle: 0x0000
        Array Handle: 0x0000
        Array Handle: 0x0000
        Array Handle: 0x0000
        Array Handle: 0x0000
        Array Handle: 0x0000
        Array Handle: 0x0000
        Array Handle: 0x0000
        Array Handle: 0x0000
        Array Handle: 0x0000
        Array Handle: 0x0000

BTW, i also test other function of edac driver our platform used. They're all good. :)
> 
> Cheers,
> John
> 
>>
>> Thx.
>>
> 
> 
> .
> 

-- 
 thanks
tanxiaofei


  parent reply	other threads:[~2020-03-24 11:32 UTC|newest]

Thread overview: 27+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-03-06 15:13 [PATCH 00/11] EDAC/ghes: Cleanup, rework and improvement of memory reporting Robert Richter
2020-03-06 15:13 ` [PATCH 01/11] EDAC/mc: Use int type for parameters of edac_mc_alloc() Robert Richter
2020-04-06 11:38   ` Matthias Brugger
2020-03-06 15:13 ` [PATCH 02/11] EDAC/ghes: Setup DIMM label from DMI and use it in error reports Robert Richter
2020-03-06 15:13 ` [PATCH 03/11] EDAC/ghes: Remove local variable rdr_mask in ghes_edac_dmidecode() Robert Richter
2020-04-06 11:51   ` Matthias Brugger
2020-03-06 15:13 ` [PATCH 04/11] EDAC/ghes: Remove unused members of struct ghes_edac_pvt, rename it to ghes_mci Robert Richter
2020-03-06 15:13 ` [PATCH 05/11] EDAC/ghes: Cleanup struct ghes_edac_dimm_fill, rename it to ghes_dimm_fill Robert Richter
2020-03-16  9:29   ` Borislav Petkov
2020-03-06 15:13 ` [PATCH 06/11] EDAC/ghes: Carve out MC device handling into separate functions Robert Richter
2020-03-16  9:31   ` Borislav Petkov
2020-03-16 12:12     ` Robert Richter
2020-03-06 15:13 ` [PATCH 07/11] EDAC/ghes: Have a separate code path for creating the fake MC Robert Richter
2020-03-06 15:13 ` [PATCH 08/11] EDAC/ghes: Carve out code into ghes_edac_register_{one,fake}() Robert Richter
2020-03-06 15:13 ` [PATCH 09/11] EDAC/ghes: Implement DIMM mapping table for SMBIOS handles Robert Richter
2020-03-16  9:40   ` Borislav Petkov
2020-03-06 15:13 ` [PATCH 10/11] EDAC/ghes: Create an own device for each mci Robert Richter
2020-03-16  9:45   ` Borislav Petkov
2020-03-06 15:13 ` [PATCH 11/11] EDAC/ghes: Create one memory controller per physical memory array Robert Richter
2020-03-16  9:51   ` Borislav Petkov
2020-03-17 16:34     ` John Garry
2020-03-17 22:14       ` Kani, Toshi
2020-03-17 22:50         ` Borislav Petkov
2020-03-17 22:53           ` Kani, Toshi
2020-03-18  0:10             ` Robert Richter
2020-03-24 11:32       ` Xiaofei Tan [this message]
2020-03-10 20:18 ` [PATCH 00/11] EDAC/ghes: Cleanup, rework and improvement of memory reporting Aristeu Rozanski

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=5E79EFC0.3040108@huawei.com \
    --to=tanxiaofei@huawei.com \
    --cc=aris@redhat.com \
    --cc=bp@alien8.de \
    --cc=james.morse@arm.com \
    --cc=john.garry@huawei.com \
    --cc=linux-edac@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mchehab@kernel.org \
    --cc=rrichter@marvell.com \
    --cc=shiju.jose@huawei.com \
    --cc=tony.luck@intel.com \
    --cc=toshi.kani@hpe.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.