From: "M K, Muralidhara" <muralimk@amd.com>
To: Borislav Petkov <bp@alien8.de>
Cc: linux-edac@vger.kernel.org, linux-kernel@vger.kernel.org,
mchehab@kernel.org, Muralidhara M K <muralidhara.mk@amd.com>,
Yazen Ghannam <yazen.ghannam@amd.com>
Subject: Re: [PATCH v2 1/4] EDAC/mce_amd: Remove SMCA Extended Error code descriptions
Date: Thu, 26 Oct 2023 15:12:22 +0530 [thread overview]
Message-ID: <b3b21eaa-226f-e78f-14e3-09e2e02e38d6@amd.com> (raw)
In-Reply-To: <20231025190818.GDZTlnomnaT8zxnbxX@fat_crate.local>
Hi Boris,
On 10/26/2023 12:38 AM, Borislav Petkov wrote:
> Caution: This message originated from an External Source. Use proper caution when opening attachments, clicking links, or responding.
>
>
> On Wed, Oct 25, 2023 at 05:14:52AM +0000, Muralidhara M K wrote:
>> The SMCA error decoding already exists in rasdaemon and future bank decoding
>> is supported from below patches merged in rasdaemon.
>> https://github.com/mchehab/rasdaemon/commit/1f74a59ee33b7448b00d7ba13d5ecd4918b9853c rasdaemon: Add new MA_LLC, USR_DP, and USR_CP bank types
>> https://github.com/mchehab/rasdaemon/commit/2d15882a0cbfce0b905039bebc811ac8311cd739 rasdaemon: Handle reassigned bit definitions for UMC bank
>>
>
> I'm still missing here the exact steps a user needs to do in order to
> decode such an error.
>
> Please inject an error, catch the error message and show me how one is
> supposed to decode it with rasdaemon in case the daemon is not running
> while the error happens or the error is fatal and the machine doesn't
> even get to run userspace.
>
> If that is not possible with rasdaemon yet, then this patch should not
> remove the error descriptions but limit them only to the families for
> which they're valid.
>
> Bottom line is, I don't want to have the situation mcelog is in where
> decoding errors with it is a total disaster.
>
> IOW, I'd like error decoding on AMD to always work and be trivially easy
> to do.
>
I have injected error, dmesg log below
[ 3991.560180] mce: [Hardware Error]: Machine check events logged
[ 3991.560195] [Hardware Error]: Corrected error, no action required.
[ 3991.567119] [Hardware Error]: CPU:2 (19:90:0)
MC25_STATUS[Over|CE|MiscV|AddrV|-|-|SyndV|CECC|-|-|-]: 0xdc2040000000011b
[ 3991.579205] [Hardware Error]: Error Addr: 0x0000000000000040
[ 3991.585546] [Hardware Error]: PPIN: 0xabcdef0000000000
[ 3991.591302] [Hardware Error]: IPID: 0x0000009600792f00, Syndrome:
0x000000000a000000
[ 3991.599977] [Hardware Error]: Unified Memory Controller Ext. Error
Code: 0
[ 3991.599985] [Hardware Error]: cache level: L3/GEN, tx: GEN, mem-tx: RD
From above logs, "Ext. Error Code: 0" here we are printing only the
error code and from this patch error strings have been removed.
User can refer the PPR to check what the error code refers to.
or rasdaemon tool can print the respective error string for particular
error code.
Executed rasdaemon:
rasdaemon: Listening to events for cpus 0 to 191
<...>-1420 [002] .... 0.000399 mce_record 2023-10-26
04:28:37 -0500 Unified Memory Controller (bank=25), status=
dc2040000000011b, Corrected error, no action required.,
mci=Error_overflow CECC, mca=DRAM On Die ECC error. Ext Err Code: 0
Memory Error 'mem-tx: generic read, tx: generic, level: L3/generic',
memory_die_id=0, cpu_type= AMD Scalable MCA, cpu= 2, socketid= 0, misc=
d01a000201000000, addr= 40, synd= a000000, ipid= 9600792f00,
mcgstatus=0, mcgcap= 140, apicid= 4
From logs, We can see "DRAM On Die ECC error" which is for Ext Err Code: 0
So, in rasdaemon Error strings are maintained.
> Thx.
>
> --
> Regards/Gruss,
> Boris.
>
> https://people.kernel.org/tglx/notes-about-netiquette
>
next prev parent reply other threads:[~2023-10-26 9:42 UTC|newest]
Thread overview: 16+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-10-25 5:14 [PATCH v2 0/4] Few cleanups and AMD Family 19h Models 90h-9fh EDAC Support Muralidhara M K
2023-10-25 5:14 ` [PATCH v2 1/4] EDAC/mce_amd: Remove SMCA Extended Error code descriptions Muralidhara M K
2023-10-25 19:08 ` Borislav Petkov
2023-10-26 9:42 ` M K, Muralidhara [this message]
2023-10-26 11:14 ` Borislav Petkov
2023-10-26 12:02 ` M K, Muralidhara
2023-10-26 12:37 ` Borislav Petkov
2023-10-26 13:05 ` Yazen Ghannam
2023-10-26 13:40 ` Borislav Petkov
2023-10-27 5:05 ` M K, Muralidhara
2023-10-25 5:14 ` [PATCH v2 2/4] x86/MCE/AMD: Add new MA_LLC, USR_DP, and USR_CP bank types Muralidhara M K
2023-10-26 8:19 ` Borislav Petkov
2023-10-26 9:46 ` M K, Muralidhara
2023-10-25 5:14 ` [PATCH v2 3/4] EDAC/mc: Add new HBM3 memory type Muralidhara M K
2023-10-25 5:14 ` [PATCH v2 4/4] EDAC/amd64: Add Family 19h Models 90h ~ 9fh enumeration support Muralidhara M K
[not found] ` <20231027144552.GGZTvNIE7g1S3jBM72@fat_crate.local>
2023-10-30 4:23 ` M K, Muralidhara
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=b3b21eaa-226f-e78f-14e3-09e2e02e38d6@amd.com \
--to=muralimk@amd.com \
--cc=bp@alien8.de \
--cc=linux-edac@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=mchehab@kernel.org \
--cc=muralidhara.mk@amd.com \
--cc=yazen.ghannam@amd.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox