Re: [PATCH] kexec/crash: no crash update when kexec in progress

linuxppc-dev.lists.ozlabs.org archive mirror
 help / color / mirror / Atom feed

* Re: [PATCH] kexec/crash: no crash update when kexec in progress
       [not found] <20240731152738.194893-1-sourabhjain@linux.ibm.com>
@ 2024-08-01  2:34 ` Michael Ellerman
  2024-08-01  6:51   ` Sourabh Jain
       [not found] ` <Zqs8veRya7v/pXEt@MiWiFi-R3L-srv>
  1 sibling, 1 reply; 14+ messages in thread
From: Michael Ellerman @ 2024-08-01  2:34 UTC (permalink / raw)
  To: Sourabh Jain, bhe
  Cc: x86, kexec, linux-kernel, Sourabh Jain, Sachin P Bappalige,
	linuxppc-dev, Hari Bathini

Sourabh Jain <sourabhjain@linux.ibm.com> writes:
> The following errors are observed when kexec is done with SMT=off on
> powerpc.
>
> [  358.458385] Removing IBM Power 842 compression device
> [  374.795734] kexec_core: Starting new kernel
> [  374.795748] kexec: Waking offline cpu 1.
> [  374.875695] crash hp: kexec_trylock() failed, elfcorehdr may be inaccurate
> [  374.935833] kexec: Waking offline cpu 2.
> [  375.015664] crash hp: kexec_trylock() failed, elfcorehdr may be inaccurate
> snip..
> [  375.515823] kexec: Waking offline cpu 6.
> [  375.635667] crash hp: kexec_trylock() failed, elfcorehdr may be inaccurate
> [  375.695836] kexec: Waking offline cpu 7.

Are they actually errors though? Do they block the actual kexec from
happening? Or are they just warnings in dmesg?

Because the fix looks like it could be racy.

cheers

> During kexec, the offline CPUs are brought online, which triggers the
> crash hotplug handler `crash_handle_hotplug_event()` to update the kdump
> image. Given that the system is on the kexec path and the kexec lock is
> taken, the `crash_handle_hotplug_event()` function fails to take the
> same lock to update the kdump image, resulting in the above error
> messages.
>
> To fix this, let's return from `crash_handle_hotplug_event()` if kexec
> is in progress.
>
> The same applies to the `crash_check_hotplug_support()` function.
> Return 0 if kexec is in progress.
>
> Cc: Hari Bathini <hbathini@linux.ibm.com>
> Cc: Michael Ellerman <mpe@ellerman.id.au>
> Cc: kexec@lists.infradead.org
> Cc: linuxppc-dev@ozlabs.org
> Cc: linux-kernel@vger.kernel.org
> Cc: x86@kernel.org
> Reported-by: Sachin P Bappalige <sachinpb@linux.vnet.ibm.com>
> Signed-off-by: Sourabh Jain <sourabhjain@linux.ibm.com>
> ---
>  kernel/crash_core.c | 6 ++++++
>  1 file changed, 6 insertions(+)
>
> diff --git a/kernel/crash_core.c b/kernel/crash_core.c
> index 63cf89393c6e..d37a16d5c3a1 100644
> --- a/kernel/crash_core.c
> +++ b/kernel/crash_core.c
> @@ -502,6 +502,9 @@ int crash_check_hotplug_support(void)
>  {
>  	int rc = 0;
>  
> +	if (kexec_in_progress)
> +		return 0;
> +
>  	crash_hotplug_lock();
>  	/* Obtain lock while reading crash information */
>  	if (!kexec_trylock()) {
> @@ -537,6 +540,9 @@ static void crash_handle_hotplug_event(unsigned int hp_action, unsigned int cpu,
>  {
>  	struct kimage *image;
>  
> +	if (kexec_in_progress)
> +		return;
> +
>  	crash_hotplug_lock();
>  	/* Obtain lock while changing crash information */
>  	if (!kexec_trylock()) {
> -- 
> 2.45.2
>
>
> _______________________________________________
> kexec mailing list
> kexec@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] kexec/crash: no crash update when kexec in progress
  2024-08-01  2:34 ` [PATCH] kexec/crash: no crash update when kexec in progress Michael Ellerman
@ 2024-08-01  6:51   ` Sourabh Jain
  2024-08-19  4:15     ` Sourabh Jain
  0 siblings, 1 reply; 14+ messages in thread
From: Sourabh Jain @ 2024-08-01  6:51 UTC (permalink / raw)
  To: Michael Ellerman, bhe
  Cc: x86, kexec, linux-kernel, Sachin P Bappalige, linuxppc-dev,
	Hari Bathini

Hello Michael,

On 01/08/24 08:04, Michael Ellerman wrote:
> Sourabh Jain <sourabhjain@linux.ibm.com> writes:
>> The following errors are observed when kexec is done with SMT=off on
>> powerpc.
>>
>> [  358.458385] Removing IBM Power 842 compression device
>> [  374.795734] kexec_core: Starting new kernel
>> [  374.795748] kexec: Waking offline cpu 1.
>> [  374.875695] crash hp: kexec_trylock() failed, elfcorehdr may be inaccurate
>> [  374.935833] kexec: Waking offline cpu 2.
>> [  375.015664] crash hp: kexec_trylock() failed, elfcorehdr may be inaccurate
>> snip..
>> [  375.515823] kexec: Waking offline cpu 6.
>> [  375.635667] crash hp: kexec_trylock() failed, elfcorehdr may be inaccurate
>> [  375.695836] kexec: Waking offline cpu 7.
> Are they actually errors though? Do they block the actual kexec from
> happening? Or are they just warnings in dmesg?

The kexec kernel boots fine.

This warning appears regardless of whether the kdump kernel is loaded.

However, when the kdump kernel is loaded, we will not be able to update 
the kdump image (FDT).
I think this should be fine given that kexec is in progress.

Please let me know your opinion.

> Because the fix looks like it could be racy.

It seems like it is racy, but given that kexec takes the lock first and then
brings the CPU up, which triggers the kdump image, which always fails to
update the kdump image because it could not take the same lock.

Note: the kexec lock is not released unless kexec boot fails.

Thanks,
Sourabh Jain

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] kexec/crash: no crash update when kexec in progress
       [not found] ` <Zqs8veRya7v/pXEt@MiWiFi-R3L-srv>
@ 2024-08-01  8:06   ` Sourabh Jain
  0 siblings, 0 replies; 14+ messages in thread
From: Sourabh Jain @ 2024-08-01  8:06 UTC (permalink / raw)
  To: Baoquan He
  Cc: x86, kexec, linux-kernel, Sachin P Bappalige, linuxppc-dev,
	Hari Bathini

Hello Baoquan,

On 01/08/24 13:13, Baoquan He wrote:
> On 07/31/24 at 08:57pm, Sourabh Jain wrote:
>> The following errors are observed when kexec is done with SMT=off on
>> powerpc.
>>
>> [  358.458385] Removing IBM Power 842 compression device
>> [  374.795734] kexec_core: Starting new kernel
>> [  374.795748] kexec: Waking offline cpu 1.
>> [  374.875695] crash hp: kexec_trylock() failed, elfcorehdr may be inaccurate
>> [  374.935833] kexec: Waking offline cpu 2.
>> [  375.015664] crash hp: kexec_trylock() failed, elfcorehdr may be inaccurate
>> snip..
>> [  375.515823] kexec: Waking offline cpu 6.
>> [  375.635667] crash hp: kexec_trylock() failed, elfcorehdr may be inaccurate
>> [  375.695836] kexec: Waking offline cpu 7.
>>
>> During kexec, the offline CPUs are brought online, which triggers the
> Is this a generic action or specific on ppc about the offline CPUs being
> brought line during kexec?

I think it is powerpc specific.
Patch that introduced this on powerpc: e8e5c2155b003 ("powerpc/kexec: 
Fix orphaned offline CPUs across kexec")

- Sourabh Jain


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] kexec/crash: no crash update when kexec in progress
  2024-08-01  6:51   ` Sourabh Jain
@ 2024-08-19  4:15     ` Sourabh Jain
  2024-08-19  6:15       ` Baoquan He
  0 siblings, 1 reply; 14+ messages in thread
From: Sourabh Jain @ 2024-08-19  4:15 UTC (permalink / raw)
  To: Michael Ellerman, bhe
  Cc: Hari Bathini, kexec, linuxppc-dev, linux-kernel, x86,
	Sachin P Bappalige

Hello Michael and Boaquan

On 01/08/24 12:21, Sourabh Jain wrote:
> Hello Michael,
>
> On 01/08/24 08:04, Michael Ellerman wrote:
>> Sourabh Jain <sourabhjain@linux.ibm.com> writes:
>>> The following errors are observed when kexec is done with SMT=off on
>>> powerpc.
>>>
>>> [  358.458385] Removing IBM Power 842 compression device
>>> [  374.795734] kexec_core: Starting new kernel
>>> [  374.795748] kexec: Waking offline cpu 1.
>>> [  374.875695] crash hp: kexec_trylock() failed, elfcorehdr may be 
>>> inaccurate
>>> [  374.935833] kexec: Waking offline cpu 2.
>>> [  375.015664] crash hp: kexec_trylock() failed, elfcorehdr may be 
>>> inaccurate
>>> snip..
>>> [  375.515823] kexec: Waking offline cpu 6.
>>> [  375.635667] crash hp: kexec_trylock() failed, elfcorehdr may be 
>>> inaccurate
>>> [  375.695836] kexec: Waking offline cpu 7.
>> Are they actually errors though? Do they block the actual kexec from
>> happening? Or are they just warnings in dmesg?
>
> The kexec kernel boots fine.
>
> This warning appears regardless of whether the kdump kernel is loaded.
>
> However, when the kdump kernel is loaded, we will not be able to 
> update the kdump image (FDT).
> I think this should be fine given that kexec is in progress.
>
> Please let me know your opinion.
>
>> Because the fix looks like it could be racy.
>
> It seems like it is racy, but given that kexec takes the lock first 
> and then
> brings the CPU up, which triggers the kdump image, which always fails to
> update the kdump image because it could not take the same lock.
>
> Note: the kexec lock is not released unless kexec boot fails.

Any comments or suggestions on this fix?

Thanks,
Sourabh Jain



^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] kexec/crash: no crash update when kexec in progress
  2024-08-19  4:15     ` Sourabh Jain
@ 2024-08-19  6:15       ` Baoquan He
  2024-08-20  6:40         ` Sourabh Jain
  0 siblings, 1 reply; 14+ messages in thread
From: Baoquan He @ 2024-08-19  6:15 UTC (permalink / raw)
  To: Sourabh Jain
  Cc: Michael Ellerman, Hari Bathini, kexec, linuxppc-dev, linux-kernel,
	x86, Sachin P Bappalige

On 08/19/24 at 09:45am, Sourabh Jain wrote:
> Hello Michael and Boaquan
> 
> On 01/08/24 12:21, Sourabh Jain wrote:
> > Hello Michael,
> > 
> > On 01/08/24 08:04, Michael Ellerman wrote:
> > > Sourabh Jain <sourabhjain@linux.ibm.com> writes:
> > > > The following errors are observed when kexec is done with SMT=off on
> > > > powerpc.
> > > > 
> > > > [  358.458385] Removing IBM Power 842 compression device
> > > > [  374.795734] kexec_core: Starting new kernel
> > > > [  374.795748] kexec: Waking offline cpu 1.
> > > > [  374.875695] crash hp: kexec_trylock() failed, elfcorehdr may
> > > > be inaccurate
> > > > [  374.935833] kexec: Waking offline cpu 2.
> > > > [  375.015664] crash hp: kexec_trylock() failed, elfcorehdr may
> > > > be inaccurate
> > > > snip..
> > > > [  375.515823] kexec: Waking offline cpu 6.
> > > > [  375.635667] crash hp: kexec_trylock() failed, elfcorehdr may
> > > > be inaccurate
> > > > [  375.695836] kexec: Waking offline cpu 7.
> > > Are they actually errors though? Do they block the actual kexec from
> > > happening? Or are they just warnings in dmesg?
> > 
> > The kexec kernel boots fine.
> > 
> > This warning appears regardless of whether the kdump kernel is loaded.
> > 
> > However, when the kdump kernel is loaded, we will not be able to update
> > the kdump image (FDT).
> > I think this should be fine given that kexec is in progress.
> > 
> > Please let me know your opinion.
> > 
> > > Because the fix looks like it could be racy.
> > 
> > It seems like it is racy, but given that kexec takes the lock first and
> > then
> > brings the CPU up, which triggers the kdump image, which always fails to
> > update the kdump image because it could not take the same lock.
> > 
> > Note: the kexec lock is not released unless kexec boot fails.
> 
> Any comments or suggestions on this fix?

Is this a little better?

diff --git a/kernel/crash_core.c b/kernel/crash_core.c
index 63cf89393c6e..0355ffb712f4 100644
--- a/kernel/crash_core.c
+++ b/kernel/crash_core.c
@@ -504,7 +504,7 @@ int crash_check_hotplug_support(void)
 
 	crash_hotplug_lock();
 	/* Obtain lock while reading crash information */
-	if (!kexec_trylock()) {
+	if (!kexec_trylock() && kexec_in_progress) {
 		pr_info("kexec_trylock() failed, elfcorehdr may be inaccurate\n");
 		crash_hotplug_unlock();
 		return 0;
@@ -539,7 +539,7 @@ static void crash_handle_hotplug_event(unsigned int hp_action, unsigned int cpu,
 
 	crash_hotplug_lock();
 	/* Obtain lock while changing crash information */
-	if (!kexec_trylock()) {
+	if (!kexec_trylock() && kexec_in_progress) {
 		pr_info("kexec_trylock() failed, elfcorehdr may be inaccurate\n");
 		crash_hotplug_unlock();
 		return;



^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH] kexec/crash: no crash update when kexec in progress
  2024-08-19  6:15       ` Baoquan He
@ 2024-08-20  6:40         ` Sourabh Jain
  2024-08-30 11:17           ` Baoquan He
  0 siblings, 1 reply; 14+ messages in thread
From: Sourabh Jain @ 2024-08-20  6:40 UTC (permalink / raw)
  To: Baoquan He
  Cc: Michael Ellerman, Hari Bathini, kexec, linuxppc-dev, linux-kernel,
	x86, Sachin P Bappalige

Hello Baoquan,

On 19/08/24 11:45, Baoquan He wrote:
> On 08/19/24 at 09:45am, Sourabh Jain wrote:
>> Hello Michael and Boaquan
>>
>> On 01/08/24 12:21, Sourabh Jain wrote:
>>> Hello Michael,
>>>
>>> On 01/08/24 08:04, Michael Ellerman wrote:
>>>> Sourabh Jain <sourabhjain@linux.ibm.com> writes:
>>>>> The following errors are observed when kexec is done with SMT=off on
>>>>> powerpc.
>>>>>
>>>>> [  358.458385] Removing IBM Power 842 compression device
>>>>> [  374.795734] kexec_core: Starting new kernel
>>>>> [  374.795748] kexec: Waking offline cpu 1.
>>>>> [  374.875695] crash hp: kexec_trylock() failed, elfcorehdr may
>>>>> be inaccurate
>>>>> [  374.935833] kexec: Waking offline cpu 2.
>>>>> [  375.015664] crash hp: kexec_trylock() failed, elfcorehdr may
>>>>> be inaccurate
>>>>> snip..
>>>>> [  375.515823] kexec: Waking offline cpu 6.
>>>>> [  375.635667] crash hp: kexec_trylock() failed, elfcorehdr may
>>>>> be inaccurate
>>>>> [  375.695836] kexec: Waking offline cpu 7.
>>>> Are they actually errors though? Do they block the actual kexec from
>>>> happening? Or are they just warnings in dmesg?
>>> The kexec kernel boots fine.
>>>
>>> This warning appears regardless of whether the kdump kernel is loaded.
>>>
>>> However, when the kdump kernel is loaded, we will not be able to update
>>> the kdump image (FDT).
>>> I think this should be fine given that kexec is in progress.
>>>
>>> Please let me know your opinion.
>>>
>>>> Because the fix looks like it could be racy.
>>> It seems like it is racy, but given that kexec takes the lock first and
>>> then
>>> brings the CPU up, which triggers the kdump image, which always fails to
>>> update the kdump image because it could not take the same lock.
>>>
>>> Note: the kexec lock is not released unless kexec boot fails.
>> Any comments or suggestions on this fix?
> Is this a little better?
>
> diff --git a/kernel/crash_core.c b/kernel/crash_core.c
> index 63cf89393c6e..0355ffb712f4 100644
> --- a/kernel/crash_core.c
> +++ b/kernel/crash_core.c
> @@ -504,7 +504,7 @@ int crash_check_hotplug_support(void)
>   
>   	crash_hotplug_lock();
>   	/* Obtain lock while reading crash information */
> -	if (!kexec_trylock()) {
> +	if (!kexec_trylock() && kexec_in_progress) {
>   		pr_info("kexec_trylock() failed, elfcorehdr may be inaccurate\n");
>   		crash_hotplug_unlock();
>   		return 0;
> @@ -539,7 +539,7 @@ static void crash_handle_hotplug_event(unsigned int hp_action, unsigned int cpu,
>   
>   	crash_hotplug_lock();
>   	/* Obtain lock while changing crash information */
> -	if (!kexec_trylock()) {
> +	if (!kexec_trylock() && kexec_in_progress) {
>   		pr_info("kexec_trylock() failed, elfcorehdr may be inaccurate\n");
>   		crash_hotplug_unlock();
>   		return;

Ideally, when `kexec_in_progress` is True, there should be no way to 
acquire the kexec lock.
Therefore, calling `kexec_trylock()` before checking `kexec_in_progress` 
is not helpful.
The kernel will print the error message "kexec_trylock() failed, 
elfcorehdr may be inaccurate."
So, with the above changes, the original problem remains unsolved.

However, after closely inspecting the 
`kernel/kexec_core.c:kernel_kexec()` function, I discovered
an exceptional case where my patch needs an update. The issue arises 
when the system returns
from the `machine_kexec()` function, which indicates that kexec has failed.

In this scenario, the kexec lock is released, but `kexec_in_progress` 
remains True.

I am unsure why `kexec_in_progress` is NOT set to False when kexec 
fails. Was this by design,
or was it an oversight because returning from the `machine_kexec()` 
function is highly unlikely?

Here is my proposal to address the original problem along with the 
exceptional case I described
above.

Let's implement two patches:

1. A patch that sets `kexec_in_progress` to False if the system returns 
from `machine_kexec()` before
    unlocking the kexec lock in the `kernel_kexec()` function.

    ```
    diff --git a/kernel/kexec_core.c b/kernel/kexec_core.c
    index c0caa14880c3..b41277183455 100644
    --- a/kernel/kexec_core.c
    +++ b/kernel/kexec_core.c
    @@ -1069,6 +1069,7 @@ int kernel_kexec(void)
    #endif

     Unlock:
    +      kexec_in_progress = false;
            kexec_unlock();
            return error;
     ```

2. A patch to return early from the `crash_handle_hotplug_event()` 
function if `kexec_in_progress` is
    set to True. This is essentially my original patch.

Please share your comments on the new approach.

Thank you for review.

- Sourabh Jain


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] kexec/crash: no crash update when kexec in progress
  2024-08-20  6:40         ` Sourabh Jain
@ 2024-08-30 11:17           ` Baoquan He
  2024-09-04  9:25             ` Sourabh Jain
  0 siblings, 1 reply; 14+ messages in thread
From: Baoquan He @ 2024-08-30 11:17 UTC (permalink / raw)
  To: Sourabh Jain
  Cc: Michael Ellerman, Hari Bathini, kexec, linuxppc-dev, linux-kernel,
	x86, Sachin P Bappalige

On 08/20/24 at 12:10pm, Sourabh Jain wrote:
> Hello Baoquan,
> 
> On 19/08/24 11:45, Baoquan He wrote:
> > On 08/19/24 at 09:45am, Sourabh Jain wrote:
> > > Hello Michael and Boaquan
> > > 
> > > On 01/08/24 12:21, Sourabh Jain wrote:
> > > > Hello Michael,
> > > > 
> > > > On 01/08/24 08:04, Michael Ellerman wrote:
> > > > > Sourabh Jain <sourabhjain@linux.ibm.com> writes:
> > > > > > The following errors are observed when kexec is done with SMT=off on
> > > > > > powerpc.
> > > > > > 
> > > > > > [  358.458385] Removing IBM Power 842 compression device
> > > > > > [  374.795734] kexec_core: Starting new kernel
> > > > > > [  374.795748] kexec: Waking offline cpu 1.
> > > > > > [  374.875695] crash hp: kexec_trylock() failed, elfcorehdr may
> > > > > > be inaccurate
> > > > > > [  374.935833] kexec: Waking offline cpu 2.
> > > > > > [  375.015664] crash hp: kexec_trylock() failed, elfcorehdr may
> > > > > > be inaccurate
> > > > > > snip..
> > > > > > [  375.515823] kexec: Waking offline cpu 6.
> > > > > > [  375.635667] crash hp: kexec_trylock() failed, elfcorehdr may
> > > > > > be inaccurate
> > > > > > [  375.695836] kexec: Waking offline cpu 7.
> > > > > Are they actually errors though? Do they block the actual kexec from
> > > > > happening? Or are they just warnings in dmesg?
> > > > The kexec kernel boots fine.
> > > > 
> > > > This warning appears regardless of whether the kdump kernel is loaded.
> > > > 
> > > > However, when the kdump kernel is loaded, we will not be able to update
> > > > the kdump image (FDT).
> > > > I think this should be fine given that kexec is in progress.
> > > > 
> > > > Please let me know your opinion.
> > > > 
> > > > > Because the fix looks like it could be racy.
> > > > It seems like it is racy, but given that kexec takes the lock first and
> > > > then
> > > > brings the CPU up, which triggers the kdump image, which always fails to
> > > > update the kdump image because it could not take the same lock.
> > > > 
> > > > Note: the kexec lock is not released unless kexec boot fails.
> > > Any comments or suggestions on this fix?
> > Is this a little better?
> > 
> > diff --git a/kernel/crash_core.c b/kernel/crash_core.c
> > index 63cf89393c6e..0355ffb712f4 100644
> > --- a/kernel/crash_core.c
> > +++ b/kernel/crash_core.c
> > @@ -504,7 +504,7 @@ int crash_check_hotplug_support(void)
> >   	crash_hotplug_lock();
> >   	/* Obtain lock while reading crash information */
> > -	if (!kexec_trylock()) {
> > +	if (!kexec_trylock() && kexec_in_progress) {
> >   		pr_info("kexec_trylock() failed, elfcorehdr may be inaccurate\n");
> >   		crash_hotplug_unlock();
> >   		return 0;
> > @@ -539,7 +539,7 @@ static void crash_handle_hotplug_event(unsigned int hp_action, unsigned int cpu,
> >   	crash_hotplug_lock();
> >   	/* Obtain lock while changing crash information */
> > -	if (!kexec_trylock()) {
> > +	if (!kexec_trylock() && kexec_in_progress) {
> >   		pr_info("kexec_trylock() failed, elfcorehdr may be inaccurate\n");
> >   		crash_hotplug_unlock();
> >   		return;
> 
> Ideally, when `kexec_in_progress` is True, there should be no way to acquire
> the kexec lock.
> Therefore, calling `kexec_trylock()` before checking `kexec_in_progress` is
> not helpful.
> The kernel will print the error message "kexec_trylock() failed, elfcorehdr
> may be inaccurate."
> So, with the above changes, the original problem remains unsolved.
> 
> However, after closely inspecting the `kernel/kexec_core.c:kernel_kexec()`
> function, I discovered
> an exceptional case where my patch needs an update. The issue arises when
> the system returns
> from the `machine_kexec()` function, which indicates that kexec has failed.
> 
> In this scenario, the kexec lock is released, but `kexec_in_progress`
> remains True.
> 
> I am unsure why `kexec_in_progress` is NOT set to False when kexec fails.
> Was this by design,
> or was it an oversight because returning from the `machine_kexec()` function
> is highly unlikely?
> 
> Here is my proposal to address the original problem along with the
> exceptional case I described
> above.
> 
> Let's implement two patches:
> 
> 1. A patch that sets `kexec_in_progress` to False if the system returns from
> `machine_kexec()` before

I don't think we have chance to return from machine_kexec() after
triggering kexec/kdump jumping. The KEXEC_JUMP could return, but I'v
never heard people using it.

>    unlocking the kexec lock in the `kernel_kexec()` function.
> 
>    ```
>    diff --git a/kernel/kexec_core.c b/kernel/kexec_core.c
>    index c0caa14880c3..b41277183455 100644
>    --- a/kernel/kexec_core.c
>    +++ b/kernel/kexec_core.c
>    @@ -1069,6 +1069,7 @@ int kernel_kexec(void)
>    #endif
> 
>     Unlock:
>    +      kexec_in_progress = false;
>            kexec_unlock();
>            return error;
>     ```
> 
> 2. A patch to return early from the `crash_handle_hotplug_event()` function
> if `kexec_in_progress` is
>    set to True. This is essentially my original patch.

There's a race gap between the kexec_in_progress checking and the
setting it to true which Michael has mentioned. That's why I think
maybe checking kexec_in_progress after failing to retriving
__kexec_lock is a little better, not very sure.

> 
> Please share your comments on the new approach.
> 
> Thank you for review.
> 
> - Sourabh Jain
> 



^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] kexec/crash: no crash update when kexec in progress
  2024-08-30 11:17           ` Baoquan He
@ 2024-09-04  9:25             ` Sourabh Jain
  2024-09-05  3:23               ` Baoquan He
  0 siblings, 1 reply; 14+ messages in thread
From: Sourabh Jain @ 2024-09-04  9:25 UTC (permalink / raw)
  To: Baoquan He
  Cc: Michael Ellerman, Hari Bathini, kexec, linuxppc-dev, linux-kernel,
	x86, Sachin P Bappalige

Hello Baoquan,

On 30/08/24 16:47, Baoquan He wrote:
> On 08/20/24 at 12:10pm, Sourabh Jain wrote:
>> Hello Baoquan,
>>
>> On 19/08/24 11:45, Baoquan He wrote:
>>> On 08/19/24 at 09:45am, Sourabh Jain wrote:
>>>> Hello Michael and Boaquan
>>>>
>>>> On 01/08/24 12:21, Sourabh Jain wrote:
>>>>> Hello Michael,
>>>>>
>>>>> On 01/08/24 08:04, Michael Ellerman wrote:
>>>>>> Sourabh Jain <sourabhjain@linux.ibm.com> writes:
>>>>>>> The following errors are observed when kexec is done with SMT=off on
>>>>>>> powerpc.
>>>>>>>
>>>>>>> [  358.458385] Removing IBM Power 842 compression device
>>>>>>> [  374.795734] kexec_core: Starting new kernel
>>>>>>> [  374.795748] kexec: Waking offline cpu 1.
>>>>>>> [  374.875695] crash hp: kexec_trylock() failed, elfcorehdr may
>>>>>>> be inaccurate
>>>>>>> [  374.935833] kexec: Waking offline cpu 2.
>>>>>>> [  375.015664] crash hp: kexec_trylock() failed, elfcorehdr may
>>>>>>> be inaccurate
>>>>>>> snip..
>>>>>>> [  375.515823] kexec: Waking offline cpu 6.
>>>>>>> [  375.635667] crash hp: kexec_trylock() failed, elfcorehdr may
>>>>>>> be inaccurate
>>>>>>> [  375.695836] kexec: Waking offline cpu 7.
>>>>>> Are they actually errors though? Do they block the actual kexec from
>>>>>> happening? Or are they just warnings in dmesg?
>>>>> The kexec kernel boots fine.
>>>>>
>>>>> This warning appears regardless of whether the kdump kernel is loaded.
>>>>>
>>>>> However, when the kdump kernel is loaded, we will not be able to update
>>>>> the kdump image (FDT).
>>>>> I think this should be fine given that kexec is in progress.
>>>>>
>>>>> Please let me know your opinion.
>>>>>
>>>>>> Because the fix looks like it could be racy.
>>>>> It seems like it is racy, but given that kexec takes the lock first and
>>>>> then
>>>>> brings the CPU up, which triggers the kdump image, which always fails to
>>>>> update the kdump image because it could not take the same lock.
>>>>>
>>>>> Note: the kexec lock is not released unless kexec boot fails.
>>>> Any comments or suggestions on this fix?
>>> Is this a little better?
>>>
>>> diff --git a/kernel/crash_core.c b/kernel/crash_core.c
>>> index 63cf89393c6e..0355ffb712f4 100644
>>> --- a/kernel/crash_core.c
>>> +++ b/kernel/crash_core.c
>>> @@ -504,7 +504,7 @@ int crash_check_hotplug_support(void)
>>>    	crash_hotplug_lock();
>>>    	/* Obtain lock while reading crash information */
>>> -	if (!kexec_trylock()) {
>>> +	if (!kexec_trylock() && kexec_in_progress) {
>>>    		pr_info("kexec_trylock() failed, elfcorehdr may be inaccurate\n");
>>>    		crash_hotplug_unlock();
>>>    		return 0;
>>> @@ -539,7 +539,7 @@ static void crash_handle_hotplug_event(unsigned int hp_action, unsigned int cpu,
>>>    	crash_hotplug_lock();
>>>    	/* Obtain lock while changing crash information */
>>> -	if (!kexec_trylock()) {
>>> +	if (!kexec_trylock() && kexec_in_progress) {
>>>    		pr_info("kexec_trylock() failed, elfcorehdr may be inaccurate\n");
>>>    		crash_hotplug_unlock();
>>>    		return;
>> Ideally, when `kexec_in_progress` is True, there should be no way to acquire
>> the kexec lock.
>> Therefore, calling `kexec_trylock()` before checking `kexec_in_progress` is
>> not helpful.
>> The kernel will print the error message "kexec_trylock() failed, elfcorehdr
>> may be inaccurate."
>> So, with the above changes, the original problem remains unsolved.
>>
>> However, after closely inspecting the `kernel/kexec_core.c:kernel_kexec()`
>> function, I discovered
>> an exceptional case where my patch needs an update. The issue arises when
>> the system returns
>> from the `machine_kexec()` function, which indicates that kexec has failed.
>>
>> In this scenario, the kexec lock is released, but `kexec_in_progress`
>> remains True.
>>
>> I am unsure why `kexec_in_progress` is NOT set to False when kexec fails.
>> Was this by design,
>> or was it an oversight because returning from the `machine_kexec()` function
>> is highly unlikely?
>>
>> Here is my proposal to address the original problem along with the
>> exceptional case I described
>> above.
>>
>> Let's implement two patches:
>>
>> 1. A patch that sets `kexec_in_progress` to False if the system returns from
>> `machine_kexec()` before
> I don't think we have chance to return from machine_kexec() after
> triggering kexec/kdump jumping. The KEXEC_JUMP could return, but I'v
> never heard people using it.

Agree, on most arch there is no return from machine_kexec function.
So lets drop the above idea of resetting kexec_in_progress.


>
>>     unlocking the kexec lock in the `kernel_kexec()` function.
>>
>>     ```
>>     diff --git a/kernel/kexec_core.c b/kernel/kexec_core.c
>>     index c0caa14880c3..b41277183455 100644
>>     --- a/kernel/kexec_core.c
>>     +++ b/kernel/kexec_core.c
>>     @@ -1069,6 +1069,7 @@ int kernel_kexec(void)
>>     #endif
>>
>>      Unlock:
>>     +      kexec_in_progress = false;
>>             kexec_unlock();
>>             return error;
>>      ```
>>
>> 2. A patch to return early from the `crash_handle_hotplug_event()` function
>> if `kexec_in_progress` is
>>     set to True. This is essentially my original patch.
> There's a race gap between the kexec_in_progress checking and the
> setting it to true which Michael has mentioned.

The window where kernel is holding kexec_lock to do kexec boot
but kexec_in_progress is yet not set to True.

If kernel needs to handle crash hotplug event, the function
crash_handle_hotplug_event()  will not get the kexec_lock and
error out by printing error message about not able to update
kdump image.

I think it should be fine. Given that lock is already taken for
kexec kernel boot.

Am I missing something major?

> That's why I think
> maybe checking kexec_in_progress after failing to retriving
> __kexec_lock is a little better, not very sure.

Try for kexec lock before kexec_in_progress check will not solve
the original problem this patch trying to solve.

You proposed the below changes earlier:

-	if (!kexec_trylock()) {
+	if (!kexec_trylock() && kexec_in_progress) {
  		pr_info("kexec_trylock() failed, elfcorehdr may be inaccurate\n");
  		crash_hotplug_unlock();


Once the kexec_in_progress is set to True there is no way one can get
kexec_lock. So kexec_trylock() before kexec_in_progress is not helpful
for the problem I am trying to solve.


Thanks,
Sourabh Jain





^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] kexec/crash: no crash update when kexec in progress
  2024-09-04  9:25             ` Sourabh Jain
@ 2024-09-05  3:23               ` Baoquan He
  2024-09-05  8:37                 ` Sourabh Jain
  0 siblings, 1 reply; 14+ messages in thread
From: Baoquan He @ 2024-09-05  3:23 UTC (permalink / raw)
  To: Sourabh Jain
  Cc: Michael Ellerman, Hari Bathini, kexec, linuxppc-dev, linux-kernel,
	x86, Sachin P Bappalige

On 09/04/24 at 02:55pm, Sourabh Jain wrote:
> Hello Baoquan,
> 
> On 30/08/24 16:47, Baoquan He wrote:
> > On 08/20/24 at 12:10pm, Sourabh Jain wrote:
> > > Hello Baoquan,
> > > 
......snip... 
> > > 2. A patch to return early from the `crash_handle_hotplug_event()` function
> > > if `kexec_in_progress` is
> > >     set to True. This is essentially my original patch.
> > There's a race gap between the kexec_in_progress checking and the
> > setting it to true which Michael has mentioned.
> 
> The window where kernel is holding kexec_lock to do kexec boot
> but kexec_in_progress is yet not set to True.
> 
> If kernel needs to handle crash hotplug event, the function
> crash_handle_hotplug_event()  will not get the kexec_lock and
> error out by printing error message about not able to update
> kdump image.

But you wanted to avoid the erroring out if it's being in
kernel_kexec().  Now you are seeing at least one the noising 
message, aren't you?

> 
> I think it should be fine. Given that lock is already taken for
> kexec kernel boot.
> 
> Am I missing something major?
> 
> > That's why I think
> > maybe checking kexec_in_progress after failing to retriving
> > __kexec_lock is a little better, not very sure.
> 
> Try for kexec lock before kexec_in_progress check will not solve
> the original problem this patch trying to solve.
> 
> You proposed the below changes earlier:
> 
> -	if (!kexec_trylock()) {
> +	if (!kexec_trylock() && kexec_in_progress) {
>  		pr_info("kexec_trylock() failed, elfcorehdr may be inaccurate\n");
>  		crash_hotplug_unlock();

Ah, I meant as below, but wrote it mistakenly.

diff --git a/kernel/crash_core.c b/kernel/crash_core.c
index 63cf89393c6e..e7c7aa761f46 100644
--- a/kernel/crash_core.c
+++ b/kernel/crash_core.c
@@ -504,7 +504,7 @@ int crash_check_hotplug_support(void)
 
 	crash_hotplug_lock();
 	/* Obtain lock while reading crash information */
-	if (!kexec_trylock()) {
+	if (!kexec_trylock() && !kexec_in_progress) {
 		pr_info("kexec_trylock() failed, elfcorehdr may be inaccurate\n");
 		crash_hotplug_unlock();
 		return 0;


> 
> 
> Once the kexec_in_progress is set to True there is no way one can get
> kexec_lock. So kexec_trylock() before kexec_in_progress is not helpful
> for the problem I am trying to solve.

With your patch, you could still get the error message if the race gap
exist. With above change, you won't get it. Please correct me if I am
wrong.



^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH] kexec/crash: no crash update when kexec in progress
  2024-09-05  3:23               ` Baoquan He
@ 2024-09-05  8:37                 ` Sourabh Jain
  2024-09-08 10:30                   ` Baoquan He
  0 siblings, 1 reply; 14+ messages in thread
From: Sourabh Jain @ 2024-09-05  8:37 UTC (permalink / raw)
  To: Baoquan He
  Cc: Michael Ellerman, Hari Bathini, kexec, linuxppc-dev, linux-kernel,
	x86, Sachin P Bappalige

Hello Baoquan,

On 05/09/24 08:53, Baoquan He wrote:
> On 09/04/24 at 02:55pm, Sourabh Jain wrote:
>> Hello Baoquan,
>>
>> On 30/08/24 16:47, Baoquan He wrote:
>>> On 08/20/24 at 12:10pm, Sourabh Jain wrote:
>>>> Hello Baoquan,
>>>>
> ......snip...
>>>> 2. A patch to return early from the `crash_handle_hotplug_event()` function
>>>> if `kexec_in_progress` is
>>>>      set to True. This is essentially my original patch.
>>> There's a race gap between the kexec_in_progress checking and the
>>> setting it to true which Michael has mentioned.
>> The window where kernel is holding kexec_lock to do kexec boot
>> but kexec_in_progress is yet not set to True.
>>
>> If kernel needs to handle crash hotplug event, the function
>> crash_handle_hotplug_event()  will not get the kexec_lock and
>> error out by printing error message about not able to update
>> kdump image.
> But you wanted to avoid the erroring out if it's being in
> kernel_kexec().  Now you are seeing at least one the noising
> message, aren't you?

Yes, but it is very rare to encounter.

My comments on your updated code are inline below.

>
>> I think it should be fine. Given that lock is already taken for
>> kexec kernel boot.
>>
>> Am I missing something major?
>>
>>> That's why I think
>>> maybe checking kexec_in_progress after failing to retriving
>>> __kexec_lock is a little better, not very sure.
>> Try for kexec lock before kexec_in_progress check will not solve
>> the original problem this patch trying to solve.
>>
>> You proposed the below changes earlier:
>>
>> -	if (!kexec_trylock()) {
>> +	if (!kexec_trylock() && kexec_in_progress) {
>>   		pr_info("kexec_trylock() failed, elfcorehdr may be inaccurate\n");
>>   		crash_hotplug_unlock();
> Ah, I meant as below, but wrote it mistakenly.
>
> diff --git a/kernel/crash_core.c b/kernel/crash_core.c
> index 63cf89393c6e..e7c7aa761f46 100644
> --- a/kernel/crash_core.c
> +++ b/kernel/crash_core.c
> @@ -504,7 +504,7 @@ int crash_check_hotplug_support(void)
>   
>   	crash_hotplug_lock();
>   	/* Obtain lock while reading crash information */
> -	if (!kexec_trylock()) {
> +	if (!kexec_trylock() && !kexec_in_progress) {
>   		pr_info("kexec_trylock() failed, elfcorehdr may be inaccurate\n");
>   		crash_hotplug_unlock();
>   		return 0;
>
>
>>
>> Once the kexec_in_progress is set to True there is no way one can get
>> kexec_lock. So kexec_trylock() before kexec_in_progress is not helpful
>> for the problem I am trying to solve.
> With your patch, you could still get the error message if the race gap
> exist. With above change, you won't get it. Please correct me if I am
> wrong.

The above code will print an error message during the race gap. Here's why:

Let’s say the kexec lock is acquired in the kernel_kexec() function,
but kexec_in_progress is not yet set to True. In this scenario, the code 
will print
an error message.

There is another issue I see with the above code:

Consider that the system is on the kexec kernel boot path, and 
kexec_in_progress
is set to True. If crash_hotplug_unlock() is called, the kernel will not 
only update
the kdump image without acquiring the kexec lock, but it will also 
release the
kexec lock in the out label. I believe this is incorrect.

Please share your thoughts.

Thanks,
Sourabh Jain




^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] kexec/crash: no crash update when kexec in progress
  2024-09-05  8:37                 ` Sourabh Jain
@ 2024-09-08 10:30                   ` Baoquan He
  2024-09-09  5:05                     ` Sourabh Jain
  0 siblings, 1 reply; 14+ messages in thread
From: Baoquan He @ 2024-09-08 10:30 UTC (permalink / raw)
  To: Sourabh Jain
  Cc: Michael Ellerman, Hari Bathini, kexec, linuxppc-dev, linux-kernel,
	x86, Sachin P Bappalige

On 09/05/24 at 02:07pm, Sourabh Jain wrote:
> Hello Baoquan,
> 
> On 05/09/24 08:53, Baoquan He wrote:
> > On 09/04/24 at 02:55pm, Sourabh Jain wrote:
> > > Hello Baoquan,
> > > 
> > > On 30/08/24 16:47, Baoquan He wrote:
> > > > On 08/20/24 at 12:10pm, Sourabh Jain wrote:
> > > > > Hello Baoquan,
> > > > > 
> > ......snip...
> > > > > 2. A patch to return early from the `crash_handle_hotplug_event()` function
> > > > > if `kexec_in_progress` is
> > > > >      set to True. This is essentially my original patch.
> > > > There's a race gap between the kexec_in_progress checking and the
> > > > setting it to true which Michael has mentioned.
> > > The window where kernel is holding kexec_lock to do kexec boot
> > > but kexec_in_progress is yet not set to True.
> > > 
> > > If kernel needs to handle crash hotplug event, the function
> > > crash_handle_hotplug_event()  will not get the kexec_lock and
> > > error out by printing error message about not able to update
> > > kdump image.
> > But you wanted to avoid the erroring out if it's being in
> > kernel_kexec().  Now you are seeing at least one the noising
> > message, aren't you?
> 
> Yes, but it is very rare to encounter.
> 
> My comments on your updated code are inline below.
> 
> > 
> > > I think it should be fine. Given that lock is already taken for
> > > kexec kernel boot.
> > > 
> > > Am I missing something major?
> > > 
> > > > That's why I think
> > > > maybe checking kexec_in_progress after failing to retriving
> > > > __kexec_lock is a little better, not very sure.
> > > Try for kexec lock before kexec_in_progress check will not solve
> > > the original problem this patch trying to solve.
> > > 
> > > You proposed the below changes earlier:
> > > 
> > > -	if (!kexec_trylock()) {
> > > +	if (!kexec_trylock() && kexec_in_progress) {
> > >   		pr_info("kexec_trylock() failed, elfcorehdr may be inaccurate\n");
> > >   		crash_hotplug_unlock();
> > Ah, I meant as below, but wrote it mistakenly.
> > 
> > diff --git a/kernel/crash_core.c b/kernel/crash_core.c
> > index 63cf89393c6e..e7c7aa761f46 100644
> > --- a/kernel/crash_core.c
> > +++ b/kernel/crash_core.c
> > @@ -504,7 +504,7 @@ int crash_check_hotplug_support(void)
> >   	crash_hotplug_lock();
> >   	/* Obtain lock while reading crash information */
> > -	if (!kexec_trylock()) {
> > +	if (!kexec_trylock() && !kexec_in_progress) {
> >   		pr_info("kexec_trylock() failed, elfcorehdr may be inaccurate\n");
> >   		crash_hotplug_unlock();
> >   		return 0;
> > 
> > 
> > > 
> > > Once the kexec_in_progress is set to True there is no way one can get
> > > kexec_lock. So kexec_trylock() before kexec_in_progress is not helpful
> > > for the problem I am trying to solve.
> > With your patch, you could still get the error message if the race gap
> > exist. With above change, you won't get it. Please correct me if I am
> > wrong.
> 
> The above code will print an error message during the race gap. Here's why:
> 
> Let’s say the kexec lock is acquired in the kernel_kexec() function,
> but kexec_in_progress is not yet set to True. In this scenario, the code
> will print
> an error message.
> 
> There is another issue I see with the above code:
> 
> Consider that the system is on the kexec kernel boot path, and
> kexec_in_progress
> is set to True. If crash_hotplug_unlock() is called, the kernel will not
> only update
> the kdump image without acquiring the kexec lock, but it will also release
> the
> kexec lock in the out label. I believe this is incorrect.
> 
> Please share your thoughts.

How about this?

diff --git a/kernel/crash_core.c b/kernel/crash_core.c
index 63cf89393c6e..8ba7b1da0ded 100644
--- a/kernel/crash_core.c
+++ b/kernel/crash_core.c
@@ -505,7 +505,8 @@ int crash_check_hotplug_support(void)
 	crash_hotplug_lock();
 	/* Obtain lock while reading crash information */
 	if (!kexec_trylock()) {
-		pr_info("kexec_trylock() failed, elfcorehdr may be inaccurate\n");
+		if (!kexec_in_progress)
+			pr_info("kexec_trylock() failed, elfcorehdr may be inaccurate\n");
 		crash_hotplug_unlock();
 		return 0;
 	}
@@ -540,7 +541,8 @@ static void crash_handle_hotplug_event(unsigned int hp_action, unsigned int cpu,
 	crash_hotplug_lock();
 	/* Obtain lock while changing crash information */
 	if (!kexec_trylock()) {
-		pr_info("kexec_trylock() failed, elfcorehdr may be inaccurate\n");
+		if (!kexec_in_progress)
+			pr_info("kexec_trylock() failed, elfcorehdr may be inaccurate\n");
 		crash_hotplug_unlock();
 		return;
 	}



^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH] kexec/crash: no crash update when kexec in progress
  2024-09-08 10:30                   ` Baoquan He
@ 2024-09-09  5:05                     ` Sourabh Jain
  2024-09-09  5:23                       ` Baoquan He
  0 siblings, 1 reply; 14+ messages in thread
From: Sourabh Jain @ 2024-09-09  5:05 UTC (permalink / raw)
  To: Baoquan He
  Cc: Michael Ellerman, Hari Bathini, kexec, linuxppc-dev, linux-kernel,
	x86, Sachin P Bappalige



On 08/09/24 16:00, Baoquan He wrote:
> On 09/05/24 at 02:07pm, Sourabh Jain wrote:
>> Hello Baoquan,
>>
>> On 05/09/24 08:53, Baoquan He wrote:
>>> On 09/04/24 at 02:55pm, Sourabh Jain wrote:
>>>> Hello Baoquan,
>>>>
>>>> On 30/08/24 16:47, Baoquan He wrote:
>>>>> On 08/20/24 at 12:10pm, Sourabh Jain wrote:
>>>>>> Hello Baoquan,
>>>>>>
>>> ......snip...
>>>>>> 2. A patch to return early from the `crash_handle_hotplug_event()` function
>>>>>> if `kexec_in_progress` is
>>>>>>       set to True. This is essentially my original patch.
>>>>> There's a race gap between the kexec_in_progress checking and the
>>>>> setting it to true which Michael has mentioned.
>>>> The window where kernel is holding kexec_lock to do kexec boot
>>>> but kexec_in_progress is yet not set to True.
>>>>
>>>> If kernel needs to handle crash hotplug event, the function
>>>> crash_handle_hotplug_event()  will not get the kexec_lock and
>>>> error out by printing error message about not able to update
>>>> kdump image.
>>> But you wanted to avoid the erroring out if it's being in
>>> kernel_kexec().  Now you are seeing at least one the noising
>>> message, aren't you?
>> Yes, but it is very rare to encounter.
>>
>> My comments on your updated code are inline below.
>>
>>>> I think it should be fine. Given that lock is already taken for
>>>> kexec kernel boot.
>>>>
>>>> Am I missing something major?
>>>>
>>>>> That's why I think
>>>>> maybe checking kexec_in_progress after failing to retriving
>>>>> __kexec_lock is a little better, not very sure.
>>>> Try for kexec lock before kexec_in_progress check will not solve
>>>> the original problem this patch trying to solve.
>>>>
>>>> You proposed the below changes earlier:
>>>>
>>>> -	if (!kexec_trylock()) {
>>>> +	if (!kexec_trylock() && kexec_in_progress) {
>>>>    		pr_info("kexec_trylock() failed, elfcorehdr may be inaccurate\n");
>>>>    		crash_hotplug_unlock();
>>> Ah, I meant as below, but wrote it mistakenly.
>>>
>>> diff --git a/kernel/crash_core.c b/kernel/crash_core.c
>>> index 63cf89393c6e..e7c7aa761f46 100644
>>> --- a/kernel/crash_core.c
>>> +++ b/kernel/crash_core.c
>>> @@ -504,7 +504,7 @@ int crash_check_hotplug_support(void)
>>>    	crash_hotplug_lock();
>>>    	/* Obtain lock while reading crash information */
>>> -	if (!kexec_trylock()) {
>>> +	if (!kexec_trylock() && !kexec_in_progress) {
>>>    		pr_info("kexec_trylock() failed, elfcorehdr may be inaccurate\n");
>>>    		crash_hotplug_unlock();
>>>    		return 0;
>>>
>>>
>>>> Once the kexec_in_progress is set to True there is no way one can get
>>>> kexec_lock. So kexec_trylock() before kexec_in_progress is not helpful
>>>> for the problem I am trying to solve.
>>> With your patch, you could still get the error message if the race gap
>>> exist. With above change, you won't get it. Please correct me if I am
>>> wrong.
>> The above code will print an error message during the race gap. Here's why:
>>
>> Let’s say the kexec lock is acquired in the kernel_kexec() function,
>> but kexec_in_progress is not yet set to True. In this scenario, the code
>> will print
>> an error message.
>>
>> There is another issue I see with the above code:
>>
>> Consider that the system is on the kexec kernel boot path, and
>> kexec_in_progress
>> is set to True. If crash_hotplug_unlock() is called, the kernel will not
>> only update
>> the kdump image without acquiring the kexec lock, but it will also release
>> the
>> kexec lock in the out label. I believe this is incorrect.
>>
>> Please share your thoughts.
> How about this?
>
> diff --git a/kernel/crash_core.c b/kernel/crash_core.c
> index 63cf89393c6e..8ba7b1da0ded 100644
> --- a/kernel/crash_core.c
> +++ b/kernel/crash_core.c
> @@ -505,7 +505,8 @@ int crash_check_hotplug_support(void)
>   	crash_hotplug_lock();
>   	/* Obtain lock while reading crash information */
>   	if (!kexec_trylock()) {
> -		pr_info("kexec_trylock() failed, elfcorehdr may be inaccurate\n");
> +		if (!kexec_in_progress)
> +			pr_info("kexec_trylock() failed, elfcorehdr may be inaccurate\n");
>   		crash_hotplug_unlock();
>   		return 0;
>   	}
> @@ -540,7 +541,8 @@ static void crash_handle_hotplug_event(unsigned int hp_action, unsigned int cpu,
>   	crash_hotplug_lock();
>   	/* Obtain lock while changing crash information */
>   	if (!kexec_trylock()) {
> -		pr_info("kexec_trylock() failed, elfcorehdr may be inaccurate\n");
> +		if (!kexec_in_progress)
> +			pr_info("kexec_trylock() failed, elfcorehdr may be inaccurate\n");
>   		crash_hotplug_unlock();
>   		return;
>   	}

Yes putting pr_info under kexec in progress check would work.

I will rebase the patch on top on next-20240906 to avoid conflict with
https://lore.kernel.org/all/20240812041651.703156-1-sourabhjain@linux.ibm.com/T/#u
and send v2.

Thanks,
Sourabh Jain



^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] kexec/crash: no crash update when kexec in progress
  2024-09-09  5:05                     ` Sourabh Jain
@ 2024-09-09  5:23                       ` Baoquan He
  2024-09-09  5:31                         ` Sourabh Jain
  0 siblings, 1 reply; 14+ messages in thread
From: Baoquan He @ 2024-09-09  5:23 UTC (permalink / raw)
  To: Sourabh Jain
  Cc: Michael Ellerman, Hari Bathini, kexec, linuxppc-dev, linux-kernel,
	x86, Sachin P Bappalige

On 09/09/24 at 10:35am, Sourabh Jain wrote:
> 
> 
> On 08/09/24 16:00, Baoquan He wrote:
> > On 09/05/24 at 02:07pm, Sourabh Jain wrote:
> > > Hello Baoquan,
> > > 
> > > On 05/09/24 08:53, Baoquan He wrote:
> > > > On 09/04/24 at 02:55pm, Sourabh Jain wrote:
> > > > > Hello Baoquan,
> > > > > 
> > > > > On 30/08/24 16:47, Baoquan He wrote:
> > > > > > On 08/20/24 at 12:10pm, Sourabh Jain wrote:
> > > > > > > Hello Baoquan,
> > > > > > > 
> > > > ......snip...
> > > > > > > 2. A patch to return early from the `crash_handle_hotplug_event()` function
> > > > > > > if `kexec_in_progress` is
> > > > > > >       set to True. This is essentially my original patch.
> > > > > > There's a race gap between the kexec_in_progress checking and the
> > > > > > setting it to true which Michael has mentioned.
> > > > > The window where kernel is holding kexec_lock to do kexec boot
> > > > > but kexec_in_progress is yet not set to True.
> > > > > 
> > > > > If kernel needs to handle crash hotplug event, the function
> > > > > crash_handle_hotplug_event()  will not get the kexec_lock and
> > > > > error out by printing error message about not able to update
> > > > > kdump image.
> > > > But you wanted to avoid the erroring out if it's being in
> > > > kernel_kexec().  Now you are seeing at least one the noising
> > > > message, aren't you?
> > > Yes, but it is very rare to encounter.
> > > 
> > > My comments on your updated code are inline below.
> > > 
> > > > > I think it should be fine. Given that lock is already taken for
> > > > > kexec kernel boot.
> > > > > 
> > > > > Am I missing something major?
> > > > > 
> > > > > > That's why I think
> > > > > > maybe checking kexec_in_progress after failing to retriving
> > > > > > __kexec_lock is a little better, not very sure.
> > > > > Try for kexec lock before kexec_in_progress check will not solve
> > > > > the original problem this patch trying to solve.
> > > > > 
> > > > > You proposed the below changes earlier:
> > > > > 
> > > > > -	if (!kexec_trylock()) {
> > > > > +	if (!kexec_trylock() && kexec_in_progress) {
> > > > >    		pr_info("kexec_trylock() failed, elfcorehdr may be inaccurate\n");
> > > > >    		crash_hotplug_unlock();
> > > > Ah, I meant as below, but wrote it mistakenly.
> > > > 
> > > > diff --git a/kernel/crash_core.c b/kernel/crash_core.c
> > > > index 63cf89393c6e..e7c7aa761f46 100644
> > > > --- a/kernel/crash_core.c
> > > > +++ b/kernel/crash_core.c
> > > > @@ -504,7 +504,7 @@ int crash_check_hotplug_support(void)
> > > >    	crash_hotplug_lock();
> > > >    	/* Obtain lock while reading crash information */
> > > > -	if (!kexec_trylock()) {
> > > > +	if (!kexec_trylock() && !kexec_in_progress) {
> > > >    		pr_info("kexec_trylock() failed, elfcorehdr may be inaccurate\n");
> > > >    		crash_hotplug_unlock();
> > > >    		return 0;
> > > > 
> > > > 
> > > > > Once the kexec_in_progress is set to True there is no way one can get
> > > > > kexec_lock. So kexec_trylock() before kexec_in_progress is not helpful
> > > > > for the problem I am trying to solve.
> > > > With your patch, you could still get the error message if the race gap
> > > > exist. With above change, you won't get it. Please correct me if I am
> > > > wrong.
> > > The above code will print an error message during the race gap. Here's why:
> > > 
> > > Let’s say the kexec lock is acquired in the kernel_kexec() function,
> > > but kexec_in_progress is not yet set to True. In this scenario, the code
> > > will print
> > > an error message.
> > > 
> > > There is another issue I see with the above code:
> > > 
> > > Consider that the system is on the kexec kernel boot path, and
> > > kexec_in_progress
> > > is set to True. If crash_hotplug_unlock() is called, the kernel will not
> > > only update
> > > the kdump image without acquiring the kexec lock, but it will also release
> > > the
> > > kexec lock in the out label. I believe this is incorrect.
> > > 
> > > Please share your thoughts.
> > How about this?
> > 
> > diff --git a/kernel/crash_core.c b/kernel/crash_core.c
> > index 63cf89393c6e..8ba7b1da0ded 100644
> > --- a/kernel/crash_core.c
> > +++ b/kernel/crash_core.c
> > @@ -505,7 +505,8 @@ int crash_check_hotplug_support(void)
> >   	crash_hotplug_lock();
> >   	/* Obtain lock while reading crash information */
> >   	if (!kexec_trylock()) {
> > -		pr_info("kexec_trylock() failed, elfcorehdr may be inaccurate\n");
> > +		if (!kexec_in_progress)
> > +			pr_info("kexec_trylock() failed, elfcorehdr may be inaccurate\n");
> >   		crash_hotplug_unlock();
> >   		return 0;
> >   	}
> > @@ -540,7 +541,8 @@ static void crash_handle_hotplug_event(unsigned int hp_action, unsigned int cpu,
> >   	crash_hotplug_lock();
> >   	/* Obtain lock while changing crash information */
> >   	if (!kexec_trylock()) {
> > -		pr_info("kexec_trylock() failed, elfcorehdr may be inaccurate\n");
> > +		if (!kexec_in_progress)
> > +			pr_info("kexec_trylock() failed, elfcorehdr may be inaccurate\n");
> >   		crash_hotplug_unlock();
> >   		return;
> >   	}
> 
> Yes putting pr_info under kexec in progress check would work.
> 
> I will rebase the patch on top on next-20240906 to avoid conflict with
> https://lore.kernel.org/all/20240812041651.703156-1-sourabhjain@linux.ibm.com/T/#u
> and send v2.

Great. When you repost, can you please also add why ppc will hot add cpu
into patch log when crash triggered? Otherwise other people may be
confused when reading code here or trace back the code change.



^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] kexec/crash: no crash update when kexec in progress
  2024-09-09  5:23                       ` Baoquan He
@ 2024-09-09  5:31                         ` Sourabh Jain
  0 siblings, 0 replies; 14+ messages in thread
From: Sourabh Jain @ 2024-09-09  5:31 UTC (permalink / raw)
  To: Baoquan He
  Cc: Michael Ellerman, Hari Bathini, kexec, linuxppc-dev, linux-kernel,
	x86, Sachin P Bappalige

Hello Baoquan,

On 09/09/24 10:53, Baoquan He wrote:
> On 09/09/24 at 10:35am, Sourabh Jain wrote:
>>
>> On 08/09/24 16:00, Baoquan He wrote:
>>> On 09/05/24 at 02:07pm, Sourabh Jain wrote:
>>>> Hello Baoquan,
>>>>
>>>> On 05/09/24 08:53, Baoquan He wrote:
>>>>> On 09/04/24 at 02:55pm, Sourabh Jain wrote:
>>>>>> Hello Baoquan,
>>>>>>
>>>>>> On 30/08/24 16:47, Baoquan He wrote:
>>>>>>> On 08/20/24 at 12:10pm, Sourabh Jain wrote:
>>>>>>>> Hello Baoquan,
>>>>>>>>
>>>>> ......snip...
>>>>>>>> 2. A patch to return early from the `crash_handle_hotplug_event()` function
>>>>>>>> if `kexec_in_progress` is
>>>>>>>>        set to True. This is essentially my original patch.
>>>>>>> There's a race gap between the kexec_in_progress checking and the
>>>>>>> setting it to true which Michael has mentioned.
>>>>>> The window where kernel is holding kexec_lock to do kexec boot
>>>>>> but kexec_in_progress is yet not set to True.
>>>>>>
>>>>>> If kernel needs to handle crash hotplug event, the function
>>>>>> crash_handle_hotplug_event()  will not get the kexec_lock and
>>>>>> error out by printing error message about not able to update
>>>>>> kdump image.
>>>>> But you wanted to avoid the erroring out if it's being in
>>>>> kernel_kexec().  Now you are seeing at least one the noising
>>>>> message, aren't you?
>>>> Yes, but it is very rare to encounter.
>>>>
>>>> My comments on your updated code are inline below.
>>>>
>>>>>> I think it should be fine. Given that lock is already taken for
>>>>>> kexec kernel boot.
>>>>>>
>>>>>> Am I missing something major?
>>>>>>
>>>>>>> That's why I think
>>>>>>> maybe checking kexec_in_progress after failing to retriving
>>>>>>> __kexec_lock is a little better, not very sure.
>>>>>> Try for kexec lock before kexec_in_progress check will not solve
>>>>>> the original problem this patch trying to solve.
>>>>>>
>>>>>> You proposed the below changes earlier:
>>>>>>
>>>>>> -	if (!kexec_trylock()) {
>>>>>> +	if (!kexec_trylock() && kexec_in_progress) {
>>>>>>     		pr_info("kexec_trylock() failed, elfcorehdr may be inaccurate\n");
>>>>>>     		crash_hotplug_unlock();
>>>>> Ah, I meant as below, but wrote it mistakenly.
>>>>>
>>>>> diff --git a/kernel/crash_core.c b/kernel/crash_core.c
>>>>> index 63cf89393c6e..e7c7aa761f46 100644
>>>>> --- a/kernel/crash_core.c
>>>>> +++ b/kernel/crash_core.c
>>>>> @@ -504,7 +504,7 @@ int crash_check_hotplug_support(void)
>>>>>     	crash_hotplug_lock();
>>>>>     	/* Obtain lock while reading crash information */
>>>>> -	if (!kexec_trylock()) {
>>>>> +	if (!kexec_trylock() && !kexec_in_progress) {
>>>>>     		pr_info("kexec_trylock() failed, elfcorehdr may be inaccurate\n");
>>>>>     		crash_hotplug_unlock();
>>>>>     		return 0;
>>>>>
>>>>>
>>>>>> Once the kexec_in_progress is set to True there is no way one can get
>>>>>> kexec_lock. So kexec_trylock() before kexec_in_progress is not helpful
>>>>>> for the problem I am trying to solve.
>>>>> With your patch, you could still get the error message if the race gap
>>>>> exist. With above change, you won't get it. Please correct me if I am
>>>>> wrong.
>>>> The above code will print an error message during the race gap. Here's why:
>>>>
>>>> Let’s say the kexec lock is acquired in the kernel_kexec() function,
>>>> but kexec_in_progress is not yet set to True. In this scenario, the code
>>>> will print
>>>> an error message.
>>>>
>>>> There is another issue I see with the above code:
>>>>
>>>> Consider that the system is on the kexec kernel boot path, and
>>>> kexec_in_progress
>>>> is set to True. If crash_hotplug_unlock() is called, the kernel will not
>>>> only update
>>>> the kdump image without acquiring the kexec lock, but it will also release
>>>> the
>>>> kexec lock in the out label. I believe this is incorrect.
>>>>
>>>> Please share your thoughts.
>>> How about this?
>>>
>>> diff --git a/kernel/crash_core.c b/kernel/crash_core.c
>>> index 63cf89393c6e..8ba7b1da0ded 100644
>>> --- a/kernel/crash_core.c
>>> +++ b/kernel/crash_core.c
>>> @@ -505,7 +505,8 @@ int crash_check_hotplug_support(void)
>>>    	crash_hotplug_lock();
>>>    	/* Obtain lock while reading crash information */
>>>    	if (!kexec_trylock()) {
>>> -		pr_info("kexec_trylock() failed, elfcorehdr may be inaccurate\n");
>>> +		if (!kexec_in_progress)
>>> +			pr_info("kexec_trylock() failed, elfcorehdr may be inaccurate\n");
>>>    		crash_hotplug_unlock();
>>>    		return 0;
>>>    	}
>>> @@ -540,7 +541,8 @@ static void crash_handle_hotplug_event(unsigned int hp_action, unsigned int cpu,
>>>    	crash_hotplug_lock();
>>>    	/* Obtain lock while changing crash information */
>>>    	if (!kexec_trylock()) {
>>> -		pr_info("kexec_trylock() failed, elfcorehdr may be inaccurate\n");
>>> +		if (!kexec_in_progress)
>>> +			pr_info("kexec_trylock() failed, elfcorehdr may be inaccurate\n");
>>>    		crash_hotplug_unlock();
>>>    		return;
>>>    	}
>> Yes putting pr_info under kexec in progress check would work.
>>
>> I will rebase the patch on top on next-20240906 to avoid conflict with
>> https://lore.kernel.org/all/20240812041651.703156-1-sourabhjain@linux.ibm.com/T/#u
>> and send v2.
> Great. When you repost, can you please also add why ppc will hot add cpu
> into patch log when crash triggered? Otherwise other people may be
> confused when reading code here or trace back the code change.

Sure, I will add it.

- Sourabh Jain


^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2024-09-09  5:32 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <20240731152738.194893-1-sourabhjain@linux.ibm.com>
2024-08-01  2:34 ` [PATCH] kexec/crash: no crash update when kexec in progress Michael Ellerman
2024-08-01  6:51   ` Sourabh Jain
2024-08-19  4:15     ` Sourabh Jain
2024-08-19  6:15       ` Baoquan He
2024-08-20  6:40         ` Sourabh Jain
2024-08-30 11:17           ` Baoquan He
2024-09-04  9:25             ` Sourabh Jain
2024-09-05  3:23               ` Baoquan He
2024-09-05  8:37                 ` Sourabh Jain
2024-09-08 10:30                   ` Baoquan He
2024-09-09  5:05                     ` Sourabh Jain
2024-09-09  5:23                       ` Baoquan He
2024-09-09  5:31                         ` Sourabh Jain
     [not found] ` <Zqs8veRya7v/pXEt@MiWiFi-R3L-srv>
2024-08-01  8:06   ` Sourabh Jain

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).