Heisenbug: I/O freeze can be resolved by cat $task/cmdline of unrelated process

linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Heisenbug: I/O freeze can be resolved by cat $task/cmdline of unrelated process
@ 2023-11-05  9:40 Donald Buczek
  2023-11-05 12:02 ` Dr. David Alan Gilbert
                   ` (3 more replies)
  0 siblings, 4 replies; 7+ messages in thread
From: Donald Buczek @ 2023-11-05  9:40 UTC (permalink / raw)
  To: Linux Kernel Mailing List, linux-fsdevel

Hello, experts,

we have a strange new problem on a backup server (high metadata I/O 24/7, xfs -> mdraid). The system worked for years and with v5.15.86 for 8 month. Then we've updated to 6.1.52 and after a few hours it froze: No more I/O activity to one of its filesystems, processes trying to access it blocked until we reboot.

Of course, at first we blamed the kernel as this happened after an upgrade. But after several experiments with different kernel versions, we've returned to the v5.15.86 kernel we used before, but still experienced the problem. Then we suspected, that a microcode update (for AMD EPYC 7261), which happened as a side effect of the first reboot, might be the culprit and removed it. That didn't fix it either. For all I can say, all software is back to the state which worked before.

Now the strange part: What we usually do, when we have a situation like this, is that we run a script which takes several procfs and sysfs information which happened to be useful in the past. It was soon discovered, that just running this script unblocks the system. I/O continues as if nothing ever happened. Then we singled-stepped the operations of the script to find out, what action exactly gets the system to resume. It is this part:

     for task in /proc/*/task/*; do
         echo  "# # $task: $(cat $task/comm) : $(cat $task/cmdline | xargs -0 echo)"
         cmd cat $task/stack
     done

which can further be reduced to

     for task in /proc/*/task/*; do echo $task $(cat $task/cmdline | xargs -0 echo); done

This is absolutely reproducible. Above line unblocks the system reliably.

Another remarkable thing: We've modified above code to do the processes slowly one by one and checking after each step if I/O resumed. And each time we've tested that, it was one of the 64 nfsd processes (but not the very first one tried). While the systems exports filesystems, we have absolutely no reason to assume, that any client actually tries to access this nfs server. Additionally, when the full script is run, the stack traces show all nfsd tasks in their normal idle state ( [<0>] svc_recv+0x7bd/0x8d0 [sunrpc] ).

Does anybody have an idea, how a `cat /proc/PID/cmdline` on a specific assumed-to-be-idle nfsd thread could have such an "healing" effect?

I'm well aware, that, for example, a hardware problem might result in just anything and that the question might not be answerable at all. If so: please excuse the noise.

Thanks
Donald
-- 
Donald Buczek
buczek@molgen.mpg.de
Tel: +49 30 8413 1433

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Heisenbug: I/O freeze can be resolved by cat $task/cmdline of unrelated process
  2023-11-05  9:40 Heisenbug: I/O freeze can be resolved by cat $task/cmdline of unrelated process Donald Buczek
@ 2023-11-05 12:02 ` Dr. David Alan Gilbert
  2023-11-05 12:09   ` Donald Buczek
  2023-11-05 12:05 ` Bagas Sanjaya
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 7+ messages in thread
From: Dr. David Alan Gilbert @ 2023-11-05 12:02 UTC (permalink / raw)
  To: Donald Buczek; +Cc: Linux Kernel Mailing List, linux-fsdevel

* Donald Buczek (buczek@molgen.mpg.de) wrote:
> Hello, experts,
> 
> we have a strange new problem on a backup server (high metadata I/O 24/7, xfs -> mdraid). The system worked for years and with v5.15.86 for 8 month. Then we've updated to 6.1.52 and after a few hours it froze: No more I/O activity to one of its filesystems, processes trying to access it blocked until we reboot.
> 
> Of course, at first we blamed the kernel as this happened after an upgrade. But after several experiments with different kernel versions, we've returned to the v5.15.86 kernel we used before, but still experienced the problem. Then we suspected, that a microcode update (for AMD EPYC 7261), which happened as a side effect of the first reboot, might be the culprit and removed it. That didn't fix it either. For all I can say, all software is back to the state which worked before.

I'm not sure; but did you check /proc/cpuinfo after that revert and
check the microcode version dropped back (or physically pwoer cycle);
I'm not sure if a reboot reverts the microcode version.

> Now the strange part: What we usually do, when we have a situation like this, is that we run a script which takes several procfs and sysfs information which happened to be useful in the past. It was soon discovered, that just running this script unblocks the system. I/O continues as if nothing ever happened. Then we singled-stepped the operations of the script to find out, what action exactly gets the system to resume. It is this part:
> 
>     for task in /proc/*/task/*; do
>         echo  "# # $task: $(cat $task/comm) : $(cat $task/cmdline | xargs -0 echo)"
>         cmd cat $task/stack
>     done
> 
> which can further be reduced to
> 
>     for task in /proc/*/task/*; do echo $task $(cat $task/cmdline | xargs -0 echo); done
> 
> This is absolutely reproducible. Above line unblocks the system reliably.
> 
> Another remarkable thing: We've modified above code to do the processes slowly one by one and checking after each step if I/O resumed. And each time we've tested that, it was one of the 64 nfsd processes (but not the very first one tried). While the systems exports filesystems, we have absolutely no reason to assume, that any client actually tries to access this nfs server. Additionally, when the full script is run, the stack traces show all nfsd tasks in their normal idle state ( [<0>] svc_recv+0x7bd/0x8d0 [sunrpc] ).
> 
> Does anybody have an idea, how a `cat /proc/PID/cmdline` on a specific assumed-to-be-idle nfsd thread could have such an "healing" effect?

Not me; but had you tried something simpler like a sysrq-d or sysrq-w
for locks and blocked tasks.

> I'm well aware, that, for example, a hardware problem might result in just anything and that the question might not be answerable at all. If so: please excuse the noise.

Seems a weird hardware problem to have that specific
a way to unblock it.

Dave

> Thanks
> Donald
> -- 
> Donald Buczek
> buczek@molgen.mpg.de
> Tel: +49 30 8413 1433
-- 
 -----Open up your eyes, open up your mind, open up your code -------   
/ Dr. David Alan Gilbert    |       Running GNU/Linux       | Happy  \ 
\        dave @ treblig.org |                               | In Hex /
 \ _________________________|_____ http://www.treblig.org   |_______/

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Heisenbug: I/O freeze can be resolved by cat $task/cmdline of unrelated process
  2023-11-05  9:40 Heisenbug: I/O freeze can be resolved by cat $task/cmdline of unrelated process Donald Buczek
  2023-11-05 12:02 ` Dr. David Alan Gilbert
@ 2023-11-05 12:05 ` Bagas Sanjaya
  2023-11-05 21:51 ` NeilBrown
  2023-11-06 13:58 ` Chuck Lever III
  3 siblings, 0 replies; 7+ messages in thread
From: Bagas Sanjaya @ 2023-11-05 12:05 UTC (permalink / raw)
  To: Donald Buczek, Linux Kernel Mailing List,
	Linux Filesystems Development, Linux NFS, Linux RAID, Linux XFS
  Cc: Chuck Lever, Jeff Layton, Neil Brown, Olga Kornievskaia, Dai Ngo,
	Tom Talpey, Song Liu, Chandan Babu R, Darrick J. Wong

[-- Attachment #1: Type: text/plain, Size: 2728 bytes --]

On Sun, Nov 05, 2023 at 10:40:02AM +0100, Donald Buczek wrote:
> Hello, experts,
> 
> we have a strange new problem on a backup server (high metadata I/O 24/7, xfs -> mdraid). The system worked for years and with v5.15.86 for 8 month. Then we've updated to 6.1.52 and after a few hours it froze: No more I/O activity to one of its filesystems, processes trying to access it blocked until we reboot.
> 
> Of course, at first we blamed the kernel as this happened after an upgrade. But after several experiments with different kernel versions, we've returned to the v5.15.86 kernel we used before, but still experienced the problem. Then we suspected, that a microcode update (for AMD EPYC 7261), which happened as a side effect of the first reboot, might be the culprit and removed it. That didn't fix it either. For all I can say, all software is back to the state which worked before.
> 

By what?

> Now the strange part: What we usually do, when we have a situation like this, is that we run a script which takes several procfs and sysfs information which happened to be useful in the past. It was soon discovered, that just running this script unblocks the system. I/O continues as if nothing ever happened. Then we singled-stepped the operations of the script to find out, what action exactly gets the system to resume. It is this part:
> 
>     for task in /proc/*/task/*; do
>         echo  "# # $task: $(cat $task/comm) : $(cat $task/cmdline | xargs -0 echo)"
>         cmd cat $task/stack
>     done
> 
> which can further be reduced to
> 
>     for task in /proc/*/task/*; do echo $task $(cat $task/cmdline | xargs -0 echo); done
> 
> This is absolutely reproducible. Above line unblocks the system reliably.
> 
> Another remarkable thing: We've modified above code to do the processes slowly one by one and checking after each step if I/O resumed. And each time we've tested that, it was one of the 64 nfsd processes (but not the very first one tried). While the systems exports filesystems, we have absolutely no reason to assume, that any client actually tries to access this nfs server. Additionally, when the full script is run, the stack traces show all nfsd tasks in their normal idle state ( [<0>] svc_recv+0x7bd/0x8d0 [sunrpc] ).
> 

What's so special with that one nfsd process?

> Does anybody have an idea, how a `cat /proc/PID/cmdline` on a specific assumed-to-be-idle nfsd thread could have such an "healing" effect?
> 
> I'm well aware, that, for example, a hardware problem might result in just anything and that the question might not be answerable at all. If so: please excuse the noise.
> 

Confused...

-- 
An old man doll... just what I always wanted! - Clara

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Heisenbug: I/O freeze can be resolved by cat $task/cmdline of unrelated process
  2023-11-05 12:02 ` Dr. David Alan Gilbert
@ 2023-11-05 12:09   ` Donald Buczek
  0 siblings, 0 replies; 7+ messages in thread
From: Donald Buczek @ 2023-11-05 12:09 UTC (permalink / raw)
  To: Dr. David Alan Gilbert; +Cc: Linux Kernel Mailing List, linux-fsdevel

On 11/5/23 13:02, Dr. David Alan Gilbert wrote:
> * Donald Buczek (buczek@molgen.mpg.de) wrote:
>> Hello, experts,
>>
>> we have a strange new problem on a backup server (high metadata I/O 24/7, xfs -> mdraid). The system worked for years and with v5.15.86 for 8 month. Then we've updated to 6.1.52 and after a few hours it froze: No more I/O activity to one of its filesystems, processes trying to access it blocked until we reboot.
>>
>> Of course, at first we blamed the kernel as this happened after an upgrade. But after several experiments with different kernel versions, we've returned to the v5.15.86 kernel we used before, but still experienced the problem. Then we suspected, that a microcode update (for AMD EPYC 7261), which happened as a side effect of the first reboot, might be the culprit and removed it. That didn't fix it either. For all I can say, all software is back to the state which worked before.
> 
> I'm not sure; but did you check /proc/cpuinfo after that revert and
> check the microcode version dropped back (or physically pwoer cycle);
> I'm not sure if a reboot reverts the microcode version.

Yes, when not updated via init.d amd-ucode.img, /proc/cpuinfo and dmesg show the old microcode after reboot (through bios, not just kexec). We've power-cycled nonetheless once

>> Now the strange part: What we usually do, when we have a situation like this, is that we run a script which takes several procfs and sysfs information which happened to be useful in the past. It was soon discovered, that just running this script unblocks the system. I/O continues as if nothing ever happened. Then we singled-stepped the operations of the script to find out, what action exactly gets the system to resume. It is this part:
>>
>>      for task in /proc/*/task/*; do
>>          echo  "# # $task: $(cat $task/comm) : $(cat $task/cmdline | xargs -0 echo)"
>>          cmd cat $task/stack
>>      done
>>
>> which can further be reduced to
>>
>>      for task in /proc/*/task/*; do echo $task $(cat $task/cmdline | xargs -0 echo); done
>>
>> This is absolutely reproducible. Above line unblocks the system reliably.
>>
>> Another remarkable thing: We've modified above code to do the processes slowly one by one and checking after each step if I/O resumed. And each time we've tested that, it was one of the 64 nfsd processes (but not the very first one tried). While the systems exports filesystems, we have absolutely no reason to assume, that any client actually tries to access this nfs server. Additionally, when the full script is run, the stack traces show all nfsd tasks in their normal idle state ( [<0>] svc_recv+0x7bd/0x8d0 [sunrpc] ).
>>
>> Does anybody have an idea, how a `cat /proc/PID/cmdline` on a specific assumed-to-be-idle nfsd thread could have such an "healing" effect?
> 
> Not me; but had you tried something simpler like a sysrq-d or sysrq-w
> for locks and blocked tasks.

No, will do.

Thanks!

   Donald

>> I'm well aware, that, for example, a hardware problem might result in just anything and that the question might not be answerable at all. If so: please excuse the noise.
> 
> Seems a weird hardware problem to have that specific
> a way to unblock it.
> 
> Dave
> 
>> Thanks
>> Donald
>> -- 
>> Donald Buczek
>> buczek@molgen.mpg.de
>> Tel: +49 30 8413 1433

-- 
Donald Buczek
buczek@molgen.mpg.de
Tel: +49 30 8413 1433

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Heisenbug: I/O freeze can be resolved by cat $task/cmdline of unrelated process
  2023-11-05  9:40 Heisenbug: I/O freeze can be resolved by cat $task/cmdline of unrelated process Donald Buczek
  2023-11-05 12:02 ` Dr. David Alan Gilbert
  2023-11-05 12:05 ` Bagas Sanjaya
@ 2023-11-05 21:51 ` NeilBrown
  2023-11-06 13:58 ` Chuck Lever III
  3 siblings, 0 replies; 7+ messages in thread
From: NeilBrown @ 2023-11-05 21:51 UTC (permalink / raw)
  To: Donald Buczek; +Cc: Linux Kernel Mailing List, linux-fsdevel

On Sun, 05 Nov 2023, Donald Buczek wrote:
....
> 
>      for task in /proc/*/task/*; do
>          echo  "# # $task: $(cat $task/comm) : $(cat $task/cmdline | xargs -0 echo)"
>          cmd cat $task/stack
>      done
> 
> which can further be reduced to
> 
>      for task in /proc/*/task/*; do echo $task $(cat $task/cmdline | xargs -0 echo); done
> 
> This is absolutely reproducible. Above line unblocks the system reliably.
> 
> Another remarkable thing: We've modified above code to do the
> processes slowly one by one and checking after each step if I/O
> resumed.  And each time we've tested that, it was one of the 64 nfsd
> processes (but not the very first one tried).  While the systems
> exports filesystems, we have absolutely no reason to assume, that any
> client actually tries to access this nfs server.  Additionally, when
> the full script is run, the stack traces show all nfsd tasks in their
> normal idle state ( [<0>] svc_recv+0x7bd/0x8d0 [sunrpc] ).
> 
> Does anybody have an idea, how a `cat /proc/PID/cmdline` on a specific
> assumed-to-be-idle nfsd thread could have such an "healing" effect?

/proc/PID/cmndline for an nfsd thread is empty.  So it probably isn't
accessing 'cmdline' specifically that unblocks, but any (or almost any)
proc file for the process might help.

You say that *after* accessing cmdline, the "stack" file shows a normal
stack trace.  It might be interesting to see if that same stack is
present *before* accessing cmdline.  But my guess is that nfsd is mostly
a distraction.

It would help to see the fully "echo t > /proc/sysrq-trigger" list of all
process stacks.  That should reveal where the blockage is.

NeilBrown


> 
> I'm well aware, that, for example, a hardware problem might result in
> just anything and that the question might not be answerable at all.
> If so: please excuse the noise.
> 
> Thanks
> Donald
> -- 
> Donald Buczek
> buczek@molgen.mpg.de
> Tel: +49 30 8413 1433
> 
> 


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Heisenbug: I/O freeze can be resolved by cat $task/cmdline of unrelated process
  2023-11-05  9:40 Heisenbug: I/O freeze can be resolved by cat $task/cmdline of unrelated process Donald Buczek
                   ` (2 preceding siblings ...)
  2023-11-05 21:51 ` NeilBrown
@ 2023-11-06 13:58 ` Chuck Lever III
  2023-11-28  8:53   ` Donald Buczek
  3 siblings, 1 reply; 7+ messages in thread
From: Chuck Lever III @ 2023-11-06 13:58 UTC (permalink / raw)
  To: Donald Buczek; +Cc: Linux Kernel Mailing List, linux-fsdevel@vger.kernel.org



> On Nov 5, 2023, at 4:40 AM, Donald Buczek <buczek@molgen.mpg.de> wrote:
> 
> Hello, experts,
> 
> we have a strange new problem on a backup server (high metadata I/O 24/7, xfs -> mdraid). The system worked for years and with v5.15.86 for 8 month. Then we've updated to 6.1.52 and after a few hours it froze: No more I/O activity to one of its filesystems, processes trying to access it blocked until we reboot.
> 
> Of course, at first we blamed the kernel as this happened after an upgrade. But after several experiments with different kernel versions, we've returned to the v5.15.86 kernel we used before, but still experienced the problem. Then we suspected, that a microcode update (for AMD EPYC 7261), which happened as a side effect of the first reboot, might be the culprit and removed it. That didn't fix it either. For all I can say, all software is back to the state which worked before.
> 
> Now the strange part: What we usually do, when we have a situation like this, is that we run a script which takes several procfs and sysfs information which happened to be useful in the past. It was soon discovered, that just running this script unblocks the system. I/O continues as if nothing ever happened. Then we singled-stepped the operations of the script to find out, what action exactly gets the system to resume. It is this part:
> 
>    for task in /proc/*/task/*; do
>        echo  "# # $task: $(cat $task/comm) : $(cat $task/cmdline | xargs -0 echo)"
>        cmd cat $task/stack
>    done
> 
> which can further be reduced to
> 
>    for task in /proc/*/task/*; do echo $task $(cat $task/cmdline | xargs -0 echo); done
> 
> This is absolutely reproducible. Above line unblocks the system reliably.
> 
> Another remarkable thing: We've modified above code to do the processes slowly one by one and checking after each step if I/O resumed. And each time we've tested that, it was one of the 64 nfsd processes (but not the very first one tried). While the systems exports filesystems, we have absolutely no reason to assume, that any client actually tries to access this nfs server. Additionally, when the full script is run, the stack traces show all nfsd tasks in their normal idle state ( [<0>] svc_recv+0x7bd/0x8d0 [sunrpc] ).
> 
> Does anybody have an idea, how a `cat /proc/PID/cmdline` on a specific assumed-to-be-idle nfsd thread could have such an "healing" effect?
> 
> I'm well aware, that, for example, a hardware problem might result in just anything and that the question might not be answerable at all. If so: please excuse the noise.

I'm with Neil on this: I believe the nfsd thread happens to be in the
wrong place at the wrong time. When idle, an nfsd thread is nothing
more than a plain kthread waiting in the kernel's scheduler.

If you have an opportunity, try testing without starting up the NFSD
service. You might find that the symptoms move to another thread or
subsystem.


--
Chuck Lever



^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Heisenbug: I/O freeze can be resolved by cat $task/cmdline of unrelated process
  2023-11-06 13:58 ` Chuck Lever III
@ 2023-11-28  8:53   ` Donald Buczek
  0 siblings, 0 replies; 7+ messages in thread
From: Donald Buczek @ 2023-11-28  8:53 UTC (permalink / raw)
  To: Chuck Lever III, Linux Kernel Mailing List,
	linux-fsdevel@vger.kernel.org, bagasdotme, neilb,
	Dr. David Alan Gilbert

Just a quick followup to the problem I've reported (system freezing and could be unblocked by reading /proc/PID/cmdline of a nfsd process):

While we've rebootet the system multiple-times (through bios, not just kexec) the problem persisted. But after we've power-cycled the system once, the problem was gone. I guess, this points to a problem below ring 0 or hardware and Linux is not to blame.

Thanks everybody who answered!

Best

  Donald

On 11/6/23 14:58, Chuck Lever III wrote:
>> On Nov 5, 2023, at 4:40 AM, Donald Buczek <buczek@molgen.mpg.de> wrote:
>>
>> Hello, experts,
>>
>> we have a strange new problem on a backup server (high metadata I/O 24/7, xfs -> mdraid). The system worked for years and with v5.15.86 for 8 month. Then we've updated to 6.1.52 and after a few hours it froze: No more I/O activity to one of its filesystems, processes trying to access it blocked until we reboot.
>>
>> Of course, at first we blamed the kernel as this happened after an upgrade. But after several experiments with different kernel versions, we've returned to the v5.15.86 kernel we used before, but still experienced the problem. Then we suspected, that a microcode update (for AMD EPYC 7261), which happened as a side effect of the first reboot, might be the culprit and removed it. That didn't fix it either. For all I can say, all software is back to the state which worked before.
>>
>> Now the strange part: What we usually do, when we have a situation like this, is that we run a script which takes several procfs and sysfs information which happened to be useful in the past. It was soon discovered, that just running this script unblocks the system. I/O continues as if nothing ever happened. Then we singled-stepped the operations of the script to find out, what action exactly gets the system to resume. It is this part:
>>
>>     for task in /proc/*/task/*; do
>>         echo  "# # $task: $(cat $task/comm) : $(cat $task/cmdline | xargs -0 echo)"
>>         cmd cat $task/stack
>>     done
>>
>> which can further be reduced to
>>
>>     for task in /proc/*/task/*; do echo $task $(cat $task/cmdline | xargs -0 echo); done
>>
>> This is absolutely reproducible. Above line unblocks the system reliably.
>>
>> Another remarkable thing: We've modified above code to do the processes slowly one by one and checking after each step if I/O resumed. And each time we've tested that, it was one of the 64 nfsd processes (but not the very first one tried). While the systems exports filesystems, we have absolutely no reason to assume, that any client actually tries to access this nfs server. Additionally, when the full script is run, the stack traces show all nfsd tasks in their normal idle state ( [<0>] svc_recv+0x7bd/0x8d0 [sunrpc] ).
>>
>> Does anybody have an idea, how a `cat /proc/PID/cmdline` on a specific assumed-to-be-idle nfsd thread could have such an "healing" effect?
>>
>> I'm well aware, that, for example, a hardware problem might result in just anything and that the question might not be answerable at all. If so: please excuse the noise.
> 
> I'm with Neil on this: I believe the nfsd thread happens to be in the
> wrong place at the wrong time. When idle, an nfsd thread is nothing
> more than a plain kthread waiting in the kernel's scheduler.
> 
> If you have an opportunity, try testing without starting up the NFSD
> service. You might find that the symptoms move to another thread or
> subsystem.
> 
> 
> --
> Chuck Lever
> 
> 

-- 
Donald Buczek
buczek@molgen.mpg.de
Tel: +49 30 8413 1433

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2023-11-28  8:53 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-11-05  9:40 Heisenbug: I/O freeze can be resolved by cat $task/cmdline of unrelated process Donald Buczek
2023-11-05 12:02 ` Dr. David Alan Gilbert
2023-11-05 12:09   ` Donald Buczek
2023-11-05 12:05 ` Bagas Sanjaya
2023-11-05 21:51 ` NeilBrown
2023-11-06 13:58 ` Chuck Lever III
2023-11-28  8:53   ` Donald Buczek

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).