public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
From: Shuai Xue <xueshuai@linux.alibaba.com>
To: Will Deacon <will@kernel.org>, "Luck, Tony" <tony.luck@intel.com>
Cc: catalin.marinas@arm.com, James.Bottomley@HansenPartnership.com,
	deller@gmx.de, dave.hansen@linux.intel.com, luto@kernel.org,
	peterz@infradead.org, tglx@linutronix.de, mingo@redhat.com,
	bp@alien8.de, x86@kernel.org, hpa@zytor.com,
	linux-arm-kernel@lists.infradead.org,
	linux-kernel@vger.kernel.org, linux-parisc@vger.kernel.org
Subject: Re: [PATCH] HWPOISON: add a pr_err message when forcibly send a sigbus
Date: Thu, 31 Aug 2023 11:29:48 +0800	[thread overview]
Message-ID: <d1c8c0fa-815f-6804-e4e5-89a5259e4bb1@linux.alibaba.com> (raw)
In-Reply-To: <20230830221814.GB30121@willie-the-truck>



On 2023/8/31 06:18, Will Deacon wrote:
> On Mon, Aug 28, 2023 at 09:41:55AM +0800, Shuai Xue wrote:
>> On 2023/8/22 09:15, Shuai Xue wrote:
>>> On 2023/8/21 18:50, Will Deacon wrote:
>>>>> diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c
>>>>> index 3fe516b32577..38e2186882bd 100644
>>>>> --- a/arch/arm64/mm/fault.c
>>>>> +++ b/arch/arm64/mm/fault.c
>>>>> @@ -679,6 +679,8 @@ static int __kprobes do_page_fault(unsigned long far, unsigned long esr,
>>>>>  	} else if (fault & (VM_FAULT_HWPOISON_LARGE | VM_FAULT_HWPOISON)) {
>>>>>  		unsigned int lsb;
>>>>>  
>>>>> +		pr_err("MCE: Killing %s:%d due to hardware memory corruption fault at %lx\n",
>>>>> +		       current->comm, current->pid, far);
>>>>>  		lsb = PAGE_SHIFT;
>>>>>  		if (fault & VM_FAULT_HWPOISON_LARGE)
>>>>>  			lsb = hstate_index_to_shift(VM_FAULT_GET_HINDEX(fault));
>>>>
>>>> Hmm, I'm not convinced by this. We have 'show_unhandled_signals' already,
>>>> and there's plenty of code in memory-failure.c for handling poisoned pages
>>>> reported by e.g. GHES. I don't think dumping extra messages in dmesg from
>>>> the arch code really adds anything.
>>>
>>> I see the show_unhandled_signals() will dump the stack but it rely on
>>> /proc/sys/debug/exception-trace be set.
>>>
>>> The memory failure is the top issue in our production cloud and also other hyperscalers.
>>> We have received complaints from our operations engineers and end users that processes
>>> are being inexplicably killed :(. Could you please consider add a message?
> 
> I don't have any objection to logging this stuff somehow, I'm just not
> convinced that the console is the best place for that information in 2023.
> Is there really nothing better?
> 

Hi, Will,

I agree that console might not the better place, but it still plays an important role.
IMO the most direct idea for end user to check what happened is to check by viewing
the dmesg. In addition, we deployed some log store service collects all cluster dmesg
from /var/log/kern.

Do you have any better choice?

+ @Tony for ERST
I found that after /dev/mcelog driver deprecated, both x86 and ARM64 platform does not
support to collect MCE record of previous boot in persistent storage via APEI ERST.
I propose to add a mechanism to do it for rasdaemon. Do you have any suggestion?

Thank you.
Best Regards,
Shuai


  reply	other threads:[~2023-08-31  3:30 UTC|newest]

Thread overview: 13+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-08-19 10:22 [PATCH] HWPOISON: add a pr_err message when forcibly send a sigbus Shuai Xue
2023-08-19 11:49 ` Helge Deller
2023-08-21 10:50 ` Will Deacon
2023-08-21 11:59   ` Helge Deller
2023-08-22  1:15   ` Shuai Xue
2023-08-28  1:41     ` Shuai Xue
2023-08-30 22:18       ` Will Deacon
2023-08-31  3:29         ` Shuai Xue [this message]
2023-08-31  9:06           ` Helge Deller
2023-09-04 10:40             ` Shuai Xue
2023-10-12  8:16               ` Shuai Xue
2023-08-31 16:56           ` Luck, Tony
2023-09-04 10:39             ` Shuai Xue

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=d1c8c0fa-815f-6804-e4e5-89a5259e4bb1@linux.alibaba.com \
    --to=xueshuai@linux.alibaba.com \
    --cc=James.Bottomley@HansenPartnership.com \
    --cc=bp@alien8.de \
    --cc=catalin.marinas@arm.com \
    --cc=dave.hansen@linux.intel.com \
    --cc=deller@gmx.de \
    --cc=hpa@zytor.com \
    --cc=linux-arm-kernel@lists.infradead.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-parisc@vger.kernel.org \
    --cc=luto@kernel.org \
    --cc=mingo@redhat.com \
    --cc=peterz@infradead.org \
    --cc=tglx@linutronix.de \
    --cc=tony.luck@intel.com \
    --cc=will@kernel.org \
    --cc=x86@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox