Re: [REGRESSION] 6.8-rc process is unable to exit and consumes a lot of cpu

linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Re: [REGRESSION] 6.8-rc process is unable to exit and consumes a lot of cpu
       [not found]     ` <e733c14e-0bdd-41b2-82aa-90c0449aff25@redhat.com>
@ 2024-02-21 16:32       ` Linux regression tracking (Thorsten Leemhuis)
  2024-02-24  7:00         ` Thorsten Leemhuis
  0 siblings, 1 reply; 6+ messages in thread
From: Linux regression tracking (Thorsten Leemhuis) @ 2024-02-21 16:32 UTC (permalink / raw)
  To: Christian Brauner (Microsoft), Alexander Viro
  Cc: Matt Heon, Ed Santiago, Linux-fsdevel, Paul Holzinger,
	Linux regressions mailing list, LKML

[adding Al, Christian and a few lists to the list of recipients to
ensure all affected parties are aware of this new report about a bug for
which a fix is committed, but not yet mainlined]

Thread starts here:
https://lore.kernel.org/all/6a150ddd-3267-4f89-81bd-6807700c57c1@redhat.com/

On 21.02.24 16:56, Paul Holzinger wrote:
> Hi Thorsten,
> 
> On 21/02/2024 15:42, Linux regression tracking (Thorsten Leemhuis) wrote:
>> On 21.02.24 15:31, Paul Holzinger wrote:
>>> On 21/02/2024 15:20, Paul Holzinger wrote:
>>>> we are seeing problems with the 6.8-rc kernels[1] in our CI systems,
>>>> we see random process timeouts across our test suite. It appears that
>>>> sometimes a process is unable to exit, nothing happens even if we send
>>>> SIGKILL and instead the process consumes a lof of cpu.
>>> [...]
>> Thx for the report.
>>
>> Warning, this is not my area of expertise, so this might send you in the
>> totally wrong direction.
>>
>> I briefly checked lore for similar reports and noticed this one when I
>> searched for shrink_dcache_parent:
>>
>> https://lore.kernel.org/all/ZcKOGpTXnlmfplGR@gmail.com/
>
>> Do you think that might be related? A fix for this is pending in vfs.git.
>>
> yes that does seem very relevant. Running the sysrq command I get the
> same backtrace as the reporter there so I think it is fair to assume
> this is the same bug. Looking forward to get the fix into mainline.

FWIW, "the fix" afaics is 7e4a205fe56b90 ("Revert "get rid of
DCACHE_GENOCIDE"") sitting 'fixes' of
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs.git for more than
a week now.

I assume Al or Christian will send this to Linus soon. Christian in fact
already mentioned that he plans to send another vfs fix to Linux, but
that one iirc was sitting in another repo (but I might be mistaken there!).

Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
--
Everything you wanna know about Linux kernel regression tracking:
https://linux-regtracking.leemhuis.info/about/#tldr
If I did something stupid, please tell me, as explained on that page.

P.S.: let me update regzbot while at it:

#regzbot introduced 57851607326a2beef21e67f83f4f53a90df8445a.
#regzbot fix: Revert "get rid of DCACHE_GENOCIDE"

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [REGRESSION] 6.8-rc process is unable to exit and consumes a lot of cpu
  2024-02-21 16:32       ` [REGRESSION] 6.8-rc process is unable to exit and consumes a lot of cpu Linux regression tracking (Thorsten Leemhuis)
@ 2024-02-24  7:00         ` Thorsten Leemhuis
  2024-02-24 23:43           ` Linus Torvalds
  2024-02-25  1:17           ` Al Viro
  0 siblings, 2 replies; 6+ messages in thread
From: Thorsten Leemhuis @ 2024-02-24  7:00 UTC (permalink / raw)
  To: Christian Brauner (Microsoft), Alexander Viro, Linus Torvalds
  Cc: Matt Heon, Ed Santiago, Linux-fsdevel, Paul Holzinger,
	Linux regressions mailing list, LKML

On 21.02.24 17:32, Linux regression tracking (Thorsten Leemhuis) wrote:
> [adding Al, Christian and a few lists to the list of recipients to
> ensure all affected parties are aware of this new report about a bug for
> which a fix is committed, but not yet mainlined]
> 
> Thread starts here:
> https://lore.kernel.org/all/6a150ddd-3267-4f89-81bd-6807700c57c1@redhat.com/

[adding Linus now as well]

TWIMC, the quoted mail apparently did not get delivered to Al (I got a
"48 hours on the queue" warning from my hoster's MTA ~10 hours ago).

Ohh, and there is some suspicion that the problem Calvin[1] and Paul
(this thread, see quote below for the gist) encountered also causes
problems for bwrap (used by Flapak)[2].
[1] https://lore.kernel.org/all/ZcKOGpTXnlmfplGR@gmail.com/
[2] https://github.com/containers/bubblewrap/issues/620

Christian, Linus, all that makes me wonder if it might be wise to pick
up the revert[1] Al queued directly in case Al does not submit a PR
today or tomorrow for -rc6.

[1]
https://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs.git/commit/?h=fixes&id=7e4a205fe56b9092f0143dad6aa5fee081139b09

Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
--
Everything you wanna know about Linux kernel regression tracking:
https://linux-regtracking.leemhuis.info/about/#tldr
If I did something stupid, please tell me, as explained on that page.

#regzbot poke

> On 21.02.24 16:56, Paul Holzinger wrote:
>> Hi Thorsten,
>>
>> On 21/02/2024 15:42, Linux regression tracking (Thorsten Leemhuis) wrote:
>>> On 21.02.24 15:31, Paul Holzinger wrote:
>>>> On 21/02/2024 15:20, Paul Holzinger wrote:
>>>>> we are seeing problems with the 6.8-rc kernels[1] in our CI systems,
>>>>> we see random process timeouts across our test suite. It appears that
>>>>> sometimes a process is unable to exit, nothing happens even if we send
>>>>> SIGKILL and instead the process consumes a lof of cpu.
>>>> [...]
>>> Thx for the report.
>>>
>>> Warning, this is not my area of expertise, so this might send you in the
>>> totally wrong direction.
>>>
>>> I briefly checked lore for similar reports and noticed this one when I
>>> searched for shrink_dcache_parent:
>>>
>>> https://lore.kernel.org/all/ZcKOGpTXnlmfplGR@gmail.com/
>>
>>> Do you think that might be related? A fix for this is pending in vfs.git.
>>>
>> yes that does seem very relevant. Running the sysrq command I get the
>> same backtrace as the reporter there so I think it is fair to assume
>> this is the same bug. Looking forward to get the fix into mainline.
> 
> FWIW, "the fix" afaics is 7e4a205fe56b90 ("Revert "get rid of
> DCACHE_GENOCIDE"") sitting 'fixes' of
> git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs.git for more than
> a week now.
> 
> I assume Al or Christian will send this to Linus soon. Christian in fact
> already mentioned that he plans to send another vfs fix to Linux, but
> that one iirc was sitting in another repo (but I might be mistaken there!).
> 
> Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
> --
> Everything you wanna know about Linux kernel regression tracking:
> https://linux-regtracking.leemhuis.info/about/#tldr
> If I did something stupid, please tell me, as explained on that page.
> 
> P.S.: let me update regzbot while at it:
> 
> #regzbot introduced 57851607326a2beef21e67f83f4f53a90df8445a.
> #regzbot fix: Revert "get rid of DCACHE_GENOCIDE"

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [REGRESSION] 6.8-rc process is unable to exit and consumes a lot of cpu
  2024-02-24  7:00         ` Thorsten Leemhuis
@ 2024-02-24 23:43           ` Linus Torvalds
  2024-02-25  1:22             ` Al Viro
  2024-02-25  1:17           ` Al Viro
  1 sibling, 1 reply; 6+ messages in thread
From: Linus Torvalds @ 2024-02-24 23:43 UTC (permalink / raw)
  To: Linux regressions mailing list, Al Viro
  Cc: Christian Brauner (Microsoft), Matt Heon, Ed Santiago,
	Linux-fsdevel, Paul Holzinger, LKML

On Fri, 23 Feb 2024 at 23:00, Thorsten Leemhuis
<regressions@leemhuis.info> wrote:
>
> TWIMC, the quoted mail apparently did not get delivered to Al (I got a
> "48 hours on the queue" warning from my hoster's MTA ~10 hours ago).

Al's email has been broken for the last almost two weeks - the machine
went belly-up in a major way.

I bounced the email to his kernel.org email that seems to work, but I
also think Al ends up being busy trying to get through everything else
he missed, in addition to trying to get the machine working again...

             Linus

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [REGRESSION] 6.8-rc process is unable to exit and consumes a lot of cpu
  2024-02-24  7:00         ` Thorsten Leemhuis
  2024-02-24 23:43           ` Linus Torvalds
@ 2024-02-25  1:17           ` Al Viro
  1 sibling, 0 replies; 6+ messages in thread
From: Al Viro @ 2024-02-25  1:17 UTC (permalink / raw)
  To: Linux regressions mailing list
  Cc: Christian Brauner (Microsoft), Alexander Viro, Linus Torvalds,
	Matt Heon, Ed Santiago, Linux-fsdevel, Paul Holzinger, LKML

On Sat, Feb 24, 2024 at 08:00:27AM +0100, Thorsten Leemhuis wrote:
> On 21.02.24 17:32, Linux regression tracking (Thorsten Leemhuis) wrote:
> > [adding Al, Christian and a few lists to the list of recipients to
> > ensure all affected parties are aware of this new report about a bug for
> > which a fix is committed, but not yet mainlined]
> > 
> > Thread starts here:
> > https://lore.kernel.org/all/6a150ddd-3267-4f89-81bd-6807700c57c1@redhat.com/
> 
> [adding Linus now as well]
> 
> TWIMC, the quoted mail apparently did not get delivered to Al (I got a
> "48 hours on the queue" warning from my hoster's MTA ~10 hours ago).
> 
> Ohh, and there is some suspicion that the problem Calvin[1] and Paul
> (this thread, see quote below for the gist) encountered also causes
> problems for bwrap (used by Flapak)[2].
> [1] https://lore.kernel.org/all/ZcKOGpTXnlmfplGR@gmail.com/
> [2] https://github.com/containers/bubblewrap/issues/620
> 
> Christian, Linus, all that makes me wonder if it might be wise to pick
> up the revert[1] Al queued directly in case Al does not submit a PR
> today or tomorrow for -rc6.

See #fixes in my tree.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [REGRESSION] 6.8-rc process is unable to exit and consumes a lot of cpu
  2024-02-24 23:43           ` Linus Torvalds
@ 2024-02-25  1:22             ` Al Viro
  2024-02-25  5:57               ` Linux regression tracking (Thorsten Leemhuis)
  0 siblings, 1 reply; 6+ messages in thread
From: Al Viro @ 2024-02-25  1:22 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Linux regressions mailing list, Christian Brauner (Microsoft),
	Matt Heon, Ed Santiago, Linux-fsdevel, Paul Holzinger, LKML

On Sat, Feb 24, 2024 at 03:43:43PM -0800, Linus Torvalds wrote:
> On Fri, 23 Feb 2024 at 23:00, Thorsten Leemhuis
> <regressions@leemhuis.info> wrote:
> >
> > TWIMC, the quoted mail apparently did not get delivered to Al (I got a
> > "48 hours on the queue" warning from my hoster's MTA ~10 hours ago).
> 
> Al's email has been broken for the last almost two weeks - the machine
> went belly-up in a major way.
> 
> I bounced the email to his kernel.org email that seems to work, but I
> also think Al ends up being busy trying to get through everything else
> he missed, in addition to trying to get the machine working again...

FWIW, I'm pretty sure that it's fixed by #fixes^ (7e4a205fe56b) in
my tree; I'll send a pull request, both for #fixes and #fixes.pathwalk.rcu

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [REGRESSION] 6.8-rc process is unable to exit and consumes a lot of cpu
  2024-02-25  1:22             ` Al Viro
@ 2024-02-25  5:57               ` Linux regression tracking (Thorsten Leemhuis)
  0 siblings, 0 replies; 6+ messages in thread
From: Linux regression tracking (Thorsten Leemhuis) @ 2024-02-25  5:57 UTC (permalink / raw)
  To: Al Viro, Linus Torvalds
  Cc: Linux regressions mailing list, Christian Brauner (Microsoft),
	Matt Heon, Ed Santiago, Linux-fsdevel, Paul Holzinger, LKML

On 25.02.24 02:22, Al Viro wrote:
> On Sat, Feb 24, 2024 at 03:43:43PM -0800, Linus Torvalds wrote:
>> On Fri, 23 Feb 2024 at 23:00, Thorsten Leemhuis
>> <regressions@leemhuis.info> wrote:
>>>
>>> TWIMC, the quoted mail apparently did not get delivered to Al (I got a
>>> "48 hours on the queue" warning from my hoster's MTA ~10 hours ago).
>>
>> Al's email has been broken for the last almost two weeks - the machine
>> went belly-up in a major way.
>>
>> I bounced the email to his kernel.org email that seems to work,

Thx!

>> but I
>> also think Al ends up being busy trying to get through everything else
>> he missed, in addition to trying to get the machine working again...
> 
> FWIW, I'm pretty sure that it's fixed by #fixes^ (7e4a205fe56b) in
> my tree; I'll send a pull request, both for #fixes and #fixes.pathwalk.rcu

Great, thank you, too!

Ciao, Thorsten



^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2024-02-25  5:57 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <6a150ddd-3267-4f89-81bd-6807700c57c1@redhat.com>
     [not found] ` <652928aa-0fb8-425e-87b0-d65176dd2cfa@redhat.com>
     [not found]   ` <9b92706b-14c2-4761-95fb-7dbbaede57f4@leemhuis.info>
     [not found]     ` <e733c14e-0bdd-41b2-82aa-90c0449aff25@redhat.com>
2024-02-21 16:32       ` [REGRESSION] 6.8-rc process is unable to exit and consumes a lot of cpu Linux regression tracking (Thorsten Leemhuis)
2024-02-24  7:00         ` Thorsten Leemhuis
2024-02-24 23:43           ` Linus Torvalds
2024-02-25  1:22             ` Al Viro
2024-02-25  5:57               ` Linux regression tracking (Thorsten Leemhuis)
2024-02-25  1:17           ` Al Viro

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).