* Repeatable hard lockup running strace testsuite on 4.19.98+ onwards
@ 2020-06-26 11:35 Steve McIntyre
2020-06-26 13:41 ` Greg KH
0 siblings, 1 reply; 11+ messages in thread
From: Steve McIntyre @ 2020-06-26 11:35 UTC (permalink / raw)
To: stable; +Cc: 963493
Hi folks,
I'm the maintainer in Debian for strace. Trying to reproduce
https://bugs.debian.org/963462 on my machine (Thinkpad T470), I've
found a repeatable hard lockup running the strace testsuite. Each time
it seems to have failed in a slightly different place in the testsuite
(suggesting it's not one particular syscall test that's triggering the
failure). I initially found this using Debian's current Buster kernel
(4.19.118+2+deb10u1), then backtracking I found that 4.19.98+1+deb10u1
worked fine.
I've bisected to find the failure point along the linux-4.19.y stable
branch and what I've got to is the following commit:
e58f543fc7c0926f31a49619c1a3648e49e8d233 is the first bad commit
commit e58f543fc7c0926f31a49619c1a3648e49e8d233
Author: Jann Horn <jannh@google.com>
Date: Thu Sep 13 18:12:09 2018 +0200
apparmor: don't try to replace stale label in ptrace access check
[ Upstream commit 1f8266ff58840d698a1e96d2274189de1bdf7969 ]
As a comment above begin_current_label_crit_section() explains,
begin_current_label_crit_section() must run in sleepable context because
when label_is_stale() is true, aa_replace_current_label() runs, which uses
prepare_creds(), which can sleep.
Until now, the ptrace access check (which runs with a task lock held)
violated this rule.
Also add a might_sleep() assertion to begin_current_label_crit_section(),
because asserts are less likely to be ignored than comments.
Fixes: b2d09ae449ced ("apparmor: move ptrace checks to using labels")
Signed-off-by: Jann Horn <jannh@google.com>
Signed-off-by: John Johansen <john.johansen@canonical.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
:040000 040000 ca92f885a38c1747b812116f19de6967084a647e 865a227665e460e159502f21e8a16e6fa590bf50 M security
Considering I'm running strace build tests to provoke this bug,
finding the failure in a commit talking about ptrace changes does look
very suspicious...!
Annoyingly, I can't reproduce this on my disparate other machines
here, suggesting it's maybe(?) timing related.
Hope this helps - happy to give more information, test things, etc.
--
Steve McIntyre, Cambridge, UK. steve@einval.com
"Managing a volunteer open source project is a lot like herding
kittens, except the kittens randomly appear and disappear because they
have day jobs." -- Matt Mackall
^ permalink raw reply [flat|nested] 11+ messages in thread* Re: Repeatable hard lockup running strace testsuite on 4.19.98+ onwards 2020-06-26 11:35 Repeatable hard lockup running strace testsuite on 4.19.98+ onwards Steve McIntyre @ 2020-06-26 13:41 ` Greg KH 2020-06-26 13:45 ` Steve McIntyre 2020-06-26 14:25 ` Jann Horn 0 siblings, 2 replies; 11+ messages in thread From: Greg KH @ 2020-06-26 13:41 UTC (permalink / raw) To: Steve McIntyre; +Cc: Jann Horn, John Johansen, Sasha Levin, stable, 963493 On Fri, Jun 26, 2020 at 12:35:58PM +0100, Steve McIntyre wrote: > Hi folks, > > I'm the maintainer in Debian for strace. Trying to reproduce > https://bugs.debian.org/963462 on my machine (Thinkpad T470), I've > found a repeatable hard lockup running the strace testsuite. Each time > it seems to have failed in a slightly different place in the testsuite > (suggesting it's not one particular syscall test that's triggering the > failure). I initially found this using Debian's current Buster kernel > (4.19.118+2+deb10u1), then backtracking I found that 4.19.98+1+deb10u1 > worked fine. > > I've bisected to find the failure point along the linux-4.19.y stable > branch and what I've got to is the following commit: > > e58f543fc7c0926f31a49619c1a3648e49e8d233 is the first bad commit > commit e58f543fc7c0926f31a49619c1a3648e49e8d233 > Author: Jann Horn <jannh@google.com> > Date: Thu Sep 13 18:12:09 2018 +0200 > > apparmor: don't try to replace stale label in ptrace access check > > [ Upstream commit 1f8266ff58840d698a1e96d2274189de1bdf7969 ] > > As a comment above begin_current_label_crit_section() explains, > begin_current_label_crit_section() must run in sleepable context because > when label_is_stale() is true, aa_replace_current_label() runs, which uses > prepare_creds(), which can sleep. > Until now, the ptrace access check (which runs with a task lock held) > violated this rule. > > Also add a might_sleep() assertion to begin_current_label_crit_section(), > because asserts are less likely to be ignored than comments. > > Fixes: b2d09ae449ced ("apparmor: move ptrace checks to using labels") > Signed-off-by: Jann Horn <jannh@google.com> > Signed-off-by: John Johansen <john.johansen@canonical.com> > Signed-off-by: Sasha Levin <sashal@kernel.org> > > :040000 040000 ca92f885a38c1747b812116f19de6967084a647e 865a227665e460e159502f21e8a16e6fa590bf50 M security > > Considering I'm running strace build tests to provoke this bug, > finding the failure in a commit talking about ptrace changes does look > very suspicious...! > > Annoyingly, I can't reproduce this on my disparate other machines > here, suggesting it's maybe(?) timing related. > > Hope this helps - happy to give more information, test things, etc. So if you just revert this one patch, all works well? I've added the authors of the patch to the cc: list... Also, does this problem happen on Linus's tree? thanks, greg k-h ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Repeatable hard lockup running strace testsuite on 4.19.98+ onwards 2020-06-26 13:41 ` Greg KH @ 2020-06-26 13:45 ` Steve McIntyre 2020-06-26 15:20 ` Steve McIntyre 2020-06-26 14:25 ` Jann Horn 1 sibling, 1 reply; 11+ messages in thread From: Steve McIntyre @ 2020-06-26 13:45 UTC (permalink / raw) To: Greg KH; +Cc: Jann Horn, John Johansen, Sasha Levin, stable, 963493 Hey Greg, On Fri, Jun 26, 2020 at 03:41:32PM +0200, Greg KH wrote: >On Fri, Jun 26, 2020 at 12:35:58PM +0100, Steve McIntyre wrote: >> >> e58f543fc7c0926f31a49619c1a3648e49e8d233 is the first bad commit >> commit e58f543fc7c0926f31a49619c1a3648e49e8d233 >> Author: Jann Horn <jannh@google.com> >> Date: Thu Sep 13 18:12:09 2018 +0200 ... >> Annoyingly, I can't reproduce this on my disparate other machines >> here, suggesting it's maybe(?) timing related. >> >> Hope this helps - happy to give more information, test things, etc. > >So if you just revert this one patch, all works well? Correct. >I've added the authors of the patch to the cc: list... > >Also, does this problem happen on Linus's tree? Just building to check that now... -- Steve McIntyre, Cambridge, UK. steve@einval.com < Aardvark> I dislike C++ to start with. C++11 just seems to be handing rope-creating factories for users to hang multiple instances of themselves. ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Repeatable hard lockup running strace testsuite on 4.19.98+ onwards 2020-06-26 13:45 ` Steve McIntyre @ 2020-06-26 15:20 ` Steve McIntyre 0 siblings, 0 replies; 11+ messages in thread From: Steve McIntyre @ 2020-06-26 15:20 UTC (permalink / raw) To: Greg KH; +Cc: Jann Horn, John Johansen, Sasha Levin, stable, 963493 On Fri, Jun 26, 2020 at 02:45:18PM +0100, Steve McIntyre wrote: >Hey Greg, > >On Fri, Jun 26, 2020 at 03:41:32PM +0200, Greg KH wrote: >>On Fri, Jun 26, 2020 at 12:35:58PM +0100, Steve McIntyre wrote: >>> >>> e58f543fc7c0926f31a49619c1a3648e49e8d233 is the first bad commit >>> commit e58f543fc7c0926f31a49619c1a3648e49e8d233 >>> Author: Jann Horn <jannh@google.com> >>> Date: Thu Sep 13 18:12:09 2018 +0200 > >... > >>> Annoyingly, I can't reproduce this on my disparate other machines >>> here, suggesting it's maybe(?) timing related. >>> >>> Hope this helps - happy to give more information, test things, etc. >> >>So if you just revert this one patch, all works well? > >Correct. > >>I've added the authors of the patch to the cc: list... >> >>Also, does this problem happen on Linus's tree? > >Just building to check that now... Current head here (3e08a95294a4fb3702bb3d35ed08028433c37fe6) works fine. -- Steve McIntyre, Cambridge, UK. steve@einval.com The two hard things in computing: * naming things * cache invalidation * off-by-one errors -- Stig Sandbeck Mathisen ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Repeatable hard lockup running strace testsuite on 4.19.98+ onwards 2020-06-26 13:41 ` Greg KH 2020-06-26 13:45 ` Steve McIntyre @ 2020-06-26 14:25 ` Jann Horn 2020-06-26 16:50 ` Steve McIntyre 1 sibling, 1 reply; 11+ messages in thread From: Jann Horn @ 2020-06-26 14:25 UTC (permalink / raw) To: Greg KH; +Cc: Steve McIntyre, John Johansen, Sasha Levin, stable, 963493 On Fri, Jun 26, 2020 at 3:41 PM Greg KH <gregkh@linuxfoundation.org> wrote: > On Fri, Jun 26, 2020 at 12:35:58PM +0100, Steve McIntyre wrote: > > I'm the maintainer in Debian for strace. Trying to reproduce > > https://bugs.debian.org/963462 on my machine (Thinkpad T470), I've > > found a repeatable hard lockup running the strace testsuite. Each time > > it seems to have failed in a slightly different place in the testsuite > > (suggesting it's not one particular syscall test that's triggering the > > failure). I initially found this using Debian's current Buster kernel > > (4.19.118+2+deb10u1), then backtracking I found that 4.19.98+1+deb10u1 > > worked fine. > > > > I've bisected to find the failure point along the linux-4.19.y stable > > branch and what I've got to is the following commit: > > > > e58f543fc7c0926f31a49619c1a3648e49e8d233 is the first bad commit > > commit e58f543fc7c0926f31a49619c1a3648e49e8d233 > > Author: Jann Horn <jannh@google.com> > > Date: Thu Sep 13 18:12:09 2018 +0200 > > > > apparmor: don't try to replace stale label in ptrace access check > > > > [ Upstream commit 1f8266ff58840d698a1e96d2274189de1bdf7969 ] > > > > As a comment above begin_current_label_crit_section() explains, > > begin_current_label_crit_section() must run in sleepable context because > > when label_is_stale() is true, aa_replace_current_label() runs, which uses > > prepare_creds(), which can sleep. > > Until now, the ptrace access check (which runs with a task lock held) > > violated this rule. > > > > Also add a might_sleep() assertion to begin_current_label_crit_section(), > > because asserts are less likely to be ignored than comments. > > > > Fixes: b2d09ae449ced ("apparmor: move ptrace checks to using labels") > > Signed-off-by: Jann Horn <jannh@google.com> > > Signed-off-by: John Johansen <john.johansen@canonical.com> > > Signed-off-by: Sasha Levin <sashal@kernel.org> > > > > :040000 040000 ca92f885a38c1747b812116f19de6967084a647e 865a227665e460e159502f21e8a16e6fa590bf50 M security > > > > Considering I'm running strace build tests to provoke this bug, > > finding the failure in a commit talking about ptrace changes does look > > very suspicious...! > > > > Annoyingly, I can't reproduce this on my disparate other machines > > here, suggesting it's maybe(?) timing related. Does "hard lockup" mean that the HARDLOCKUP_DETECTOR infrastructure prints a warning to dmesg? If so, can you share that warning? If you don't have any way to see console output, and you don't have a working serial console setup or such, you may want to try re-running those tests while the kernel is booted with netconsole enabled to log to a different machine over UDP (see https://www.kernel.org/doc/Documentation/networking/netconsole.txt). You may want to try setting the sysctl kernel.sysrq=1 , then when the system has locked up, press ALT+PRINT+L (to generate stack traces for all active CPUs from NMI context), and maybe also ALT+PRINT+T and ALT+PRINT+W (to collect more information about active tasks). (If you share stack traces from these things with us, it would be helpful if you could run them through scripts/decode_stacktrace.pl from the kernel tree first, to add line number information.) Trying to isolate the problem: __end_current_label_crit_section and end_current_label_crit_section are aliases of each other (via #define), so that line change can't have done anything. That leaves two possibilities AFAICS: - the might_sleep() call by itself is causing issues for one of the remaining users of begin_current_label_crit_section() (because it causes preemption to happen more often where it didn't happen on PREEMPT_VOLUNTARY before, or because it's trying to print a warning message with the runqueue lock held, or something like that) - the lack of "if (aa_replace_current_label(label) == 0) aa_put_label(label);" in __begin_current_label_crit_section() is somehow causing issues You could try to see whether just adding the might_sleep(), or just replacing the begin_current_label_crit_section() call with __begin_current_label_crit_section(), triggers the bug. If you could recompile the kernel with CONFIG_DEBUG_ATOMIC_SLEEP - if that isn't already set in your kernel config -, that might help track down the problem, unless it magically makes the problem stop triggering (which I guess would be conceivable if this indeed is a race). ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Repeatable hard lockup running strace testsuite on 4.19.98+ onwards 2020-06-26 14:25 ` Jann Horn @ 2020-06-26 16:50 ` Steve McIntyre 2020-06-26 17:29 ` John Johansen 2020-06-26 17:52 ` Steve McIntyre 0 siblings, 2 replies; 11+ messages in thread From: Steve McIntyre @ 2020-06-26 16:50 UTC (permalink / raw) To: Jann Horn; +Cc: Greg KH, John Johansen, Sasha Levin, stable, 963493 Hi Jann, On Fri, Jun 26, 2020 at 04:25:59PM +0200, Jann Horn wrote: >On Fri, Jun 26, 2020 at 3:41 PM Greg KH <gregkh@linuxfoundation.org> wrote: >> On Fri, Jun 26, 2020 at 12:35:58PM +0100, Steve McIntyre wrote: ... >> > Considering I'm running strace build tests to provoke this bug, >> > finding the failure in a commit talking about ptrace changes does look >> > very suspicious...! >> > >> > Annoyingly, I can't reproduce this on my disparate other machines >> > here, suggesting it's maybe(?) timing related. > >Does "hard lockup" mean that the HARDLOCKUP_DETECTOR infrastructure >prints a warning to dmesg? If so, can you share that warning? I mean the machine locks hard - X stops updating, the mouse/keyboard stop responding. No pings, etc. When I reboot, there's nothing in the logs. >If you don't have any way to see console output, and you don't have a >working serial console setup or such, you may want to try re-running >those tests while the kernel is booted with netconsole enabled to log >to a different machine over UDP (see >https://www.kernel.org/doc/Documentation/networking/netconsole.txt). ACK, will try that now for you. >You may want to try setting the sysctl kernel.sysrq=1 , then when the >system has locked up, press ALT+PRINT+L (to generate stack traces for >all active CPUs from NMI context), and maybe also ALT+PRINT+T and >ALT+PRINT+W (to collect more information about active tasks). Nod. >(If you share stack traces from these things with us, it would be >helpful if you could run them through scripts/decode_stacktrace.pl >from the kernel tree first, to add line number information.) ACK. >Trying to isolate the problem: > >__end_current_label_crit_section and end_current_label_crit_section >are aliases of each other (via #define), so that line change can't >have done anything. > >That leaves two possibilities AFAICS: > - the might_sleep() call by itself is causing issues for one of the >remaining users of begin_current_label_crit_section() (because it >causes preemption to happen more often where it didn't happen on >PREEMPT_VOLUNTARY before, or because it's trying to print a warning >message with the runqueue lock held, or something like that) > - the lack of "if (aa_replace_current_label(label) == 0) >aa_put_label(label);" in __begin_current_label_crit_section() is >somehow causing issues > >You could try to see whether just adding the might_sleep(), or just >replacing the begin_current_label_crit_section() call with >__begin_current_label_crit_section(), triggers the bug. > > >If you could recompile the kernel with CONFIG_DEBUG_ATOMIC_SLEEP - if >that isn't already set in your kernel config -, that might help track >down the problem, unless it magically makes the problem stop >triggering (which I guess would be conceivable if this indeed is a >race). OK, will try that second... -- Steve McIntyre, Cambridge, UK. steve@einval.com "I can't ever sleep on planes ... call it irrational if you like, but I'm afraid I'll miss my stop" -- Vivek Das Mohapatra ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Repeatable hard lockup running strace testsuite on 4.19.98+ onwards 2020-06-26 16:50 ` Steve McIntyre @ 2020-06-26 17:29 ` John Johansen 2020-06-26 19:01 ` Steve McIntyre 2020-06-26 17:52 ` Steve McIntyre 1 sibling, 1 reply; 11+ messages in thread From: John Johansen @ 2020-06-26 17:29 UTC (permalink / raw) To: Steve McIntyre, Jann Horn; +Cc: Greg KH, Sasha Levin, stable, 963493 On 6/26/20 9:50 AM, Steve McIntyre wrote: > Hi Jann, > > On Fri, Jun 26, 2020 at 04:25:59PM +0200, Jann Horn wrote: >> On Fri, Jun 26, 2020 at 3:41 PM Greg KH <gregkh@linuxfoundation.org> wrote: >>> On Fri, Jun 26, 2020 at 12:35:58PM +0100, Steve McIntyre wrote: > > ... > >>>> Considering I'm running strace build tests to provoke this bug, >>>> finding the failure in a commit talking about ptrace changes does look >>>> very suspicious...! >>>> >>>> Annoyingly, I can't reproduce this on my disparate other machines >>>> here, suggesting it's maybe(?) timing related. >> >> Does "hard lockup" mean that the HARDLOCKUP_DETECTOR infrastructure >> prints a warning to dmesg? If so, can you share that warning? > > I mean the machine locks hard - X stops updating, the mouse/keyboard > stop responding. No pings, etc. When I reboot, there's nothing in the > logs. > >> If you don't have any way to see console output, and you don't have a >> working serial console setup or such, you may want to try re-running >> those tests while the kernel is booted with netconsole enabled to log >> to a different machine over UDP (see >> https://www.kernel.org/doc/Documentation/networking/netconsole.txt). > > ACK, will try that now for you. > >> You may want to try setting the sysctl kernel.sysrq=1 , then when the >> system has locked up, press ALT+PRINT+L (to generate stack traces for >> all active CPUs from NMI context), and maybe also ALT+PRINT+T and >> ALT+PRINT+W (to collect more information about active tasks). > > Nod. > >> (If you share stack traces from these things with us, it would be >> helpful if you could run them through scripts/decode_stacktrace.pl >>from the kernel tree first, to add line number information.) > > ACK. > >> Trying to isolate the problem: >> >> __end_current_label_crit_section and end_current_label_crit_section >> are aliases of each other (via #define), so that line change can't >> have done anything. >> >> That leaves two possibilities AFAICS: >> - the might_sleep() call by itself is causing issues for one of the >> remaining users of begin_current_label_crit_section() (because it >> causes preemption to happen more often where it didn't happen on >> PREEMPT_VOLUNTARY before, or because it's trying to print a warning >> message with the runqueue lock held, or something like that) >> - the lack of "if (aa_replace_current_label(label) == 0) >> aa_put_label(label);" in __begin_current_label_crit_section() is >> somehow causing issues >> >> You could try to see whether just adding the might_sleep(), or just >> replacing the begin_current_label_crit_section() call with >> __begin_current_label_crit_section(), triggers the bug. >> >> >> If you could recompile the kernel with CONFIG_DEBUG_ATOMIC_SLEEP - if >> that isn't already set in your kernel config -, that might help track >> down the problem, unless it magically makes the problem stop >> triggering (which I guess would be conceivable if this indeed is a >> race). > > OK, will try that second... > I have not been able to reproduce but So looking at linux-4.19.y it looks like 1f8266ff5884 apparmor: don't try to replace stale label in ptrace access check was picked without ca3fde5214e1 apparmor: don't try to replace stale label in ptraceme check Both of them are marked as Fixes: b2d09ae449ced ("apparmor: move ptrace checks to using labels") so I would expect them to be picked together. ptraceme is potentially updating the task's cred while the access check is running. Try building after picking ca3fde5214e1 apparmor: don't try to replace stale label in ptraceme check ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Repeatable hard lockup running strace testsuite on 4.19.98+ onwards 2020-06-26 17:29 ` John Johansen @ 2020-06-26 19:01 ` Steve McIntyre 2020-06-26 19:57 ` Sasha Levin 0 siblings, 1 reply; 11+ messages in thread From: Steve McIntyre @ 2020-06-26 19:01 UTC (permalink / raw) To: John Johansen; +Cc: Jann Horn, Greg KH, Sasha Levin, stable, 963493 On Fri, Jun 26, 2020 at 10:29:28AM -0700, John Johansen wrote: >On 6/26/20 9:50 AM, Steve McIntyre wrote: >> >> OK, will try that second... >> > >I have not been able to reproduce but > >So looking at linux-4.19.y it looks like >1f8266ff5884 apparmor: don't try to replace stale label in ptrace access check > >was picked without >ca3fde5214e1 apparmor: don't try to replace stale label in ptraceme check > >Both of them are marked as >Fixes: b2d09ae449ced ("apparmor: move ptrace checks to using labels") > >so I would expect them to be picked together. > >ptraceme is potentially updating the task's cred while the access check is >running. > >Try building after picking >ca3fde5214e1 apparmor: don't try to replace stale label in ptraceme check Bingo! With that one change the test suite runs to completion, no lockup. \o/ Thanks guys, I think we've found the cause here. -- Steve McIntyre, Cambridge, UK. steve@einval.com "The whole problem with the world is that fools and fanatics are always so certain of themselves, and wiser people so full of doubts." -- Bertrand Russell ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Repeatable hard lockup running strace testsuite on 4.19.98+ onwards 2020-06-26 19:01 ` Steve McIntyre @ 2020-06-26 19:57 ` Sasha Levin 0 siblings, 0 replies; 11+ messages in thread From: Sasha Levin @ 2020-06-26 19:57 UTC (permalink / raw) To: Steve McIntyre; +Cc: John Johansen, Jann Horn, Greg KH, stable, 963493 On Fri, Jun 26, 2020 at 08:01:18PM +0100, Steve McIntyre wrote: >On Fri, Jun 26, 2020 at 10:29:28AM -0700, John Johansen wrote: >>On 6/26/20 9:50 AM, Steve McIntyre wrote: >>> >>> OK, will try that second... >>> >> >>I have not been able to reproduce but >> >>So looking at linux-4.19.y it looks like >>1f8266ff5884 apparmor: don't try to replace stale label in ptrace access check >> >>was picked without >>ca3fde5214e1 apparmor: don't try to replace stale label in ptraceme check >> >>Both of them are marked as >>Fixes: b2d09ae449ced ("apparmor: move ptrace checks to using labels") >> >>so I would expect them to be picked together. >> >>ptraceme is potentially updating the task's cred while the access check is >>running. >> >>Try building after picking >>ca3fde5214e1 apparmor: don't try to replace stale label in ptraceme check > >Bingo! With that one change the test suite runs to completion, no lockup. I've queued ca3fde5214e1 for 4.19 and 4.14, thank you for your work! -- Thanks, Sasha ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Repeatable hard lockup running strace testsuite on 4.19.98+ onwards 2020-06-26 16:50 ` Steve McIntyre 2020-06-26 17:29 ` John Johansen @ 2020-06-26 17:52 ` Steve McIntyre 2020-06-26 18:27 ` Jann Horn 1 sibling, 1 reply; 11+ messages in thread From: Steve McIntyre @ 2020-06-26 17:52 UTC (permalink / raw) To: Jann Horn; +Cc: Greg KH, John Johansen, Sasha Levin, stable, 963493 [-- Attachment #1: Type: text/plain, Size: 1999 bytes --] Hi again, On Fri, Jun 26, 2020 at 05:50:00PM +0100, Steve McIntyre wrote: >On Fri, Jun 26, 2020 at 04:25:59PM +0200, Jann Horn wrote: >>On Fri, Jun 26, 2020 at 3:41 PM Greg KH <gregkh@linuxfoundation.org> wrote: >>> On Fri, Jun 26, 2020 at 12:35:58PM +0100, Steve McIntyre wrote: > >... > >>> > Considering I'm running strace build tests to provoke this bug, >>> > finding the failure in a commit talking about ptrace changes does look >>> > very suspicious...! >>> > >>> > Annoyingly, I can't reproduce this on my disparate other machines >>> > here, suggesting it's maybe(?) timing related. >> >>Does "hard lockup" mean that the HARDLOCKUP_DETECTOR infrastructure >>prints a warning to dmesg? If so, can you share that warning? > >I mean the machine locks hard - X stops updating, the mouse/keyboard >stop responding. No pings, etc. When I reboot, there's nothing in the >logs. > >>If you don't have any way to see console output, and you don't have a >>working serial console setup or such, you may want to try re-running >>those tests while the kernel is booted with netconsole enabled to log >>to a different machine over UDP (see >>https://www.kernel.org/doc/Documentation/networking/netconsole.txt). > >ACK, will try that now for you. > >>You may want to try setting the sysctl kernel.sysrq=1 , then when the >>system has locked up, press ALT+PRINT+L (to generate stack traces for >>all active CPUs from NMI context), and maybe also ALT+PRINT+T and >>ALT+PRINT+W (to collect more information about active tasks). > >Nod. > >>(If you share stack traces from these things with us, it would be >>helpful if you could run them through scripts/decode_stacktrace.pl >>from the kernel tree first, to add line number information.) > >ACK. Output passed through scripts/decode_stacktrace.sh attached. Just about to try John's suggestion next. -- Steve McIntyre, Cambridge, UK. steve@einval.com Dance like no one's watching. Encrypt like everyone is. - @torproject [-- Attachment #2: tack.log.decoded.gz --] [-- Type: application/gzip, Size: 32484 bytes --] ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Repeatable hard lockup running strace testsuite on 4.19.98+ onwards 2020-06-26 17:52 ` Steve McIntyre @ 2020-06-26 18:27 ` Jann Horn 0 siblings, 0 replies; 11+ messages in thread From: Jann Horn @ 2020-06-26 18:27 UTC (permalink / raw) To: Steve McIntyre; +Cc: Greg KH, John Johansen, Sasha Levin, stable, 963493 On Fri, Jun 26, 2020 at 7:52 PM Steve McIntyre <steve@einval.com> wrote: > On Fri, Jun 26, 2020 at 05:50:00PM +0100, Steve McIntyre wrote: > >On Fri, Jun 26, 2020 at 04:25:59PM +0200, Jann Horn wrote: > >>On Fri, Jun 26, 2020 at 3:41 PM Greg KH <gregkh@linuxfoundation.org> wrote: > >>> On Fri, Jun 26, 2020 at 12:35:58PM +0100, Steve McIntyre wrote: > > > >... > > > >>> > Considering I'm running strace build tests to provoke this bug, > >>> > finding the failure in a commit talking about ptrace changes does look > >>> > very suspicious...! > >>> > > >>> > Annoyingly, I can't reproduce this on my disparate other machines > >>> > here, suggesting it's maybe(?) timing related. > >> > >>Does "hard lockup" mean that the HARDLOCKUP_DETECTOR infrastructure > >>prints a warning to dmesg? If so, can you share that warning? > > > >I mean the machine locks hard - X stops updating, the mouse/keyboard > >stop responding. No pings, etc. When I reboot, there's nothing in the > >logs. > > > >>If you don't have any way to see console output, and you don't have a > >>working serial console setup or such, you may want to try re-running > >>those tests while the kernel is booted with netconsole enabled to log > >>to a different machine over UDP (see > >>https://www.kernel.org/doc/Documentation/networking/netconsole.txt). > > > >ACK, will try that now for you. > > > >>You may want to try setting the sysctl kernel.sysrq=1 , then when the > >>system has locked up, press ALT+PRINT+L (to generate stack traces for > >>all active CPUs from NMI context), and maybe also ALT+PRINT+T and > >>ALT+PRINT+W (to collect more information about active tasks). > > > >Nod. > > > >>(If you share stack traces from these things with us, it would be > >>helpful if you could run them through scripts/decode_stacktrace.pl > >>from the kernel tree first, to add line number information.) > > > >ACK. > > Output passed through scripts/decode_stacktrace.sh attached. > > Just about to try John's suggestion next. Okay, so this is some sort of deadlock... Looking at the NMI backtraces, all the CPUs are blocked on spinlocks: CPU 3 is blocked on current->sighand->siglock, in tty_open_proc_set_tty() CPU 1 is blocked on... I'm not sure which lock, somewhere in do_wait() CPU 2 is blocked on something, somewhere in ptrace_stop() CPU 0 is stuck on a lock in do_exit() So I think it's probably something like a classic deadlock, or a sleeping-in-atomic issue, or a lock-balancing issue (or memory corruption, that can cause all kinds of weird errors)? If it really is a classic deadlock, CONFIG_PROVE_LOCKING=y should be able to pinpoint the issue. If it is a sleeping-in-atomic issue, CONFIG_DEBUG_ATOMIC_SLEEP=y should help. If it is memory corruption, CONFIG_KASAN=y should discover it... but that might majorly mess up the timing, so if this really is a race, that might not work. Maybe flip all of those on, and if it doesn't reproduce anymore, turn off CONFIG_KASAN and try again? ^ permalink raw reply [flat|nested] 11+ messages in thread
end of thread, other threads:[~2020-06-26 19:58 UTC | newest] Thread overview: 11+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2020-06-26 11:35 Repeatable hard lockup running strace testsuite on 4.19.98+ onwards Steve McIntyre 2020-06-26 13:41 ` Greg KH 2020-06-26 13:45 ` Steve McIntyre 2020-06-26 15:20 ` Steve McIntyre 2020-06-26 14:25 ` Jann Horn 2020-06-26 16:50 ` Steve McIntyre 2020-06-26 17:29 ` John Johansen 2020-06-26 19:01 ` Steve McIntyre 2020-06-26 19:57 ` Sasha Levin 2020-06-26 17:52 ` Steve McIntyre 2020-06-26 18:27 ` Jann Horn
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.