All of lore.kernel.org
 help / color / mirror / Atom feed
From: bugzilla-daemon@bugzilla.kernel.org
To: linux-xfs@vger.kernel.org
Subject: [Bug 214767] New: xfs seems to hang due to race condition? maybe related to (gratuitous) thaw.
Date: Wed, 20 Oct 2021 07:36:07 +0000	[thread overview]
Message-ID: <bug-214767-201763@https.bugzilla.kernel.org/> (raw)

https://bugzilla.kernel.org/show_bug.cgi?id=214767

            Bug ID: 214767
           Summary: xfs seems to hang due to race condition? maybe related
                    to (gratuitous) thaw.
           Product: File System
           Version: 2.5
    Kernel Version: 5.10.70
          Hardware: x86-64
                OS: Linux
              Tree: Mainline
            Status: NEW
          Severity: high
          Priority: P1
         Component: XFS
          Assignee: filesystem_xfs@kernel-bugs.kernel.org
          Reporter: ct@flyingcircus.io
        Regression: No

We have been getting somewhat frequently (around 1 per day for about 500 VMs)
blocked tasks with tracebacks always similar to this:

First we get:

[656898.010322] INFO: task kworker/u2:1:458736 blocked for more than 122
seconds.
[656898.011988]       Not tainted 5.10.70 #1-NixOS
[656898.012981] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables
this message.
[656898.014679] task:kworker/u2:1    state:D stack:    0 pid:458736 ppid:     2
flags:0x00004000
[656898.016530] Workqueue: writeback wb_workfn (flush-253:0)
[656898.017698] Call Trace:
[656898.018310]  __schedule+0x271/0x860
[656898.019208]  schedule+0x46/0xb0
[656898.019944]  rwsem_down_read_slowpath+0x169/0x490
[656898.021008]  ? mempool_alloc+0x62/0x170
[656898.022032]  xfs_map_blocks+0xb9/0x400 [xfs]
[656898.022998]  iomap_do_writepage+0x163/0x850
[656898.023587]  ? __mod_memcg_lruvec_state+0x21/0xe0
[656898.024044]  write_cache_pages+0x186/0x3d0
[656898.024463]  ? iomap_migrate_page+0xc0/0xc0
[656898.024882]  ? submit_bio_noacct+0x3a9/0x420
[656898.025303]  iomap_writepages+0x1c/0x40
[656898.025712]  xfs_vm_writepages+0x64/0x90 [xfs]
[656898.026146]  do_writepages+0x34/0xc0
[656898.026513]  __writeback_single_inode+0x39/0x2a0
[656898.026969]  writeback_sb_inodes+0x200/0x470
[656898.027388]  __writeback_inodes_wb+0x4c/0xe0
[656898.027805]  wb_writeback+0x1d8/0x290
[656898.028161]  wb_workfn+0x29b/0x4d0
[656898.028526]  ? __switch_to_asm+0x42/0x70
[656898.028915]  ? __switch_to+0x7b/0x3e0
[656898.029285]  process_one_work+0x1df/0x370
[656898.029696]  worker_thread+0x50/0x400
[656898.030054]  ? process_one_work+0x370/0x370
[656898.030475]  kthread+0x11b/0x140
[656898.030798]  ? __kthread_bind_mask+0x60/0x60
[656898.031214]  ret_from_fork+0x22/0x30
[656898.031592] INFO: task nix-daemon:459204 blocked for more than 122 seconds.
[656898.032259]       Not tainted 5.10.70 #1-NixOS
[656898.032695] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables
this message.
[656898.033451] task:nix-daemon      state:D stack:    0 pid:459204 ppid:  1067
flags:0x00000000
[656898.034261] Call Trace:
[656898.034538]  __schedule+0x271/0x860
[656898.034881]  schedule+0x46/0xb0
[656898.035226]  xfs_log_commit_cil+0x6a4/0x800 [xfs]
[656898.035709]  ? wake_up_q+0xa0/0xa0
[656898.036073]  __xfs_trans_commit+0x9d/0x310 [xfs]
[656898.036561]  xfs_setattr_nonsize+0x342/0x520 [xfs]
[656898.037027]  notify_change+0x348/0x4c0
[656898.037407]  ? chmod_common+0xa1/0x150
[656898.037779]  chmod_common+0xa1/0x150
[656898.038139]  do_fchmodat+0x5a/0xb0
[656898.038491]  __x64_sys_chmod+0x17/0x20
[656898.038860]  do_syscall_64+0x33/0x40
[656898.039221]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[656898.039722] RIP: 0033:0x7f3d9b3a3b77
[656898.040072] RSP: 002b:00007ffc388e2b18 EFLAGS: 00000206 ORIG_RAX:
000000000000005a
[656898.040797] RAX: ffffffffffffffda RBX: 00007ffc388e2e10 RCX:
00007f3d9b3a3b77
[656898.041542] RDX: 0000000000008049 RSI: 0000000000008124 RDI:
00000000016a2400
[656898.042222] RBP: 00007ffc388e32a0 R08: 00007ffc388e4628 R09:
0000000000000000
[656898.042913] R10: 0000000000000000 R11: 0000000000000206 R12:
00007ffc388e4620
[656898.043604] R13: 0000000000000000 R14: 00007ffc388e2df0 R15:
00007ffc388e2e10


and then a while later we see this:

[657020.886633] INFO: task qemu-ga:750 blocked for more than 122 seconds.
[657020.888346]       Not tainted 5.10.70 #1-NixOS
[657020.889482] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables
this message.
[657020.891072] task:qemu-ga         state:D stack:    0 pid:  750 ppid:     1
flags:0x00000080
[657020.891906] Call Trace:
[657020.892188]  __schedule+0x271/0x860
[657020.892547]  ? kvm_sched_clock_read+0xd/0x20
[657020.892983]  schedule+0x46/0xb0
[657020.893319]  rwsem_down_write_slowpath+0x218/0x480
[657020.893802]  thaw_super+0x12/0x20
[657020.894130]  __x64_sys_ioctl+0x62/0xb0
[657020.894491]  do_syscall_64+0x33/0x40
[657020.894860]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[657020.895396] RIP: 0033:0x7f69a55f2b17
[657020.895762] RSP: 002b:00007ffe5687b548 EFLAGS: 00000246 ORIG_RAX:
0000000000000010
[657020.896470] RAX: ffffffffffffffda RBX: 0000000000000000 RCX:
00007f69a55f2b17
[657020.897144] RDX: 0000000000000000 RSI: 00000000c0045878 RDI:
0000000000000006
[657020.897815] RBP: 0000000000000000 R08: 00007f69a5477c08 R09:
0000000000000000
[657020.898479] R10: 0000000000000000 R11: 0000000000000246 R12:
000055ef74a15190
[657020.899155] R13: 00007ffe5687b5c0 R14: 00007ffe5687b560 R15:
0000000000000006
[657020.899850] INFO: task kworker/u2:1:458736 blocked for more than 245
seconds.
[657020.900513]       Not tainted 5.10.70 #1-NixOS
[657020.900942] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables
this message.
[657020.901672] task:kworker/u2:1    state:D stack:    0 pid:458736 ppid:     2
flags:0x00004000
[657020.902469] Workqueue: writeback wb_workfn (flush-253:0)
[657020.902976] Call Trace:
[657020.903220]  __schedule+0x271/0x860
[657020.903556]  schedule+0x46/0xb0
[657020.903870]  rwsem_down_read_slowpath+0x169/0x490
[657020.904321]  ? mempool_alloc+0x62/0x170
[657020.904996]  xfs_map_blocks+0xb9/0x400 [xfs]
[657020.905412]  iomap_do_writepage+0x163/0x850
[657020.905823]  ? __mod_memcg_lruvec_state+0x21/0xe0
[657020.906269]  write_cache_pages+0x186/0x3d0
[657020.906664]  ? iomap_migrate_page+0xc0/0xc0
[657020.907079]  ? submit_bio_noacct+0x3a9/0x420
[657020.907487]  iomap_writepages+0x1c/0x40
[657020.907942]  xfs_vm_writepages+0x64/0x90 [xfs]
[657020.908376]  do_writepages+0x34/0xc0
[657020.908760]  __writeback_single_inode+0x39/0x2a0
[657020.909222]  writeback_sb_inodes+0x200/0x470
[657020.909635]  __writeback_inodes_wb+0x4c/0xe0
[657020.910058]  wb_writeback+0x1d8/0x290
[657020.910410]  wb_workfn+0x29b/0x4d0
[657020.910756]  ? __switch_to_asm+0x42/0x70
[657020.911135]  ? __switch_to+0x7b/0x3e0
[657020.911489]  process_one_work+0x1df/0x370
[657020.911912]  worker_thread+0x50/0x400
[657020.912264]  ? process_one_work+0x370/0x370
[657020.912667]  kthread+0x11b/0x140
[657020.913008]  ? __kthread_bind_mask+0x60/0x60
[657020.913413]  ret_from_fork+0x22/0x30
[657020.913791] INFO: task nix-daemon:459204 blocked for more than 245 seconds.
[657020.914440]       Not tainted 5.10.70 #1-NixOS
[657020.914879] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables
this message.
[657020.915612] task:nix-daemon      state:D stack:    0 pid:459204 ppid:  1067
flags:0x00000000
[657020.916399] Call Trace:
[657020.916654]  __schedule+0x271/0x860
[657020.917006]  schedule+0x46/0xb0
[657020.917343]  xfs_log_commit_cil+0x6a4/0x800 [xfs]
[657020.917806]  ? wake_up_q+0xa0/0xa0
[657020.918162]  __xfs_trans_commit+0x9d/0x310 [xfs]
[657020.918633]  xfs_setattr_nonsize+0x342/0x520 [xfs]
[657020.919101]  notify_change+0x348/0x4c0
[657020.919461]  ? chmod_common+0xa1/0x150
[657020.919831]  chmod_common+0xa1/0x150
[657020.920180]  do_fchmodat+0x5a/0xb0
[657020.920509]  __x64_sys_chmod+0x17/0x20
[657020.920878]  do_syscall_64+0x33/0x40
[657020.921222]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[657020.921703] RIP: 0033:0x7f3d9b3a3b77
[657020.922055] RSP: 002b:00007ffc388e2b18 EFLAGS: 00000206 ORIG_RAX:
000000000000005a
[657020.922772] RAX: ffffffffffffffda RBX: 00007ffc388e2e10 RCX:
00007f3d9b3a3b77
[657020.923435] RDX: 0000000000008049 RSI: 0000000000008124 RDI:
00000000016a2400
[657020.924115] RBP: 00007ffc388e32a0 R08: 00007ffc388e4628 R09:
0000000000000000
[657020.924787] R10: 0000000000000000 R11: 0000000000000206 R12:
00007ffc388e4620
[657020.925450] R13: 0000000000000000 R14: 00007ffc388e2df0 R15:
00007ffc388e2e10


IO is subsequently stalled on ALL disks (we have vda/vdb/vdc with two of them
running XFS and one running swap).

I am pretty sure that this is not primarily an issue in the hypervisor (Qemu
4.1) or storage (Ceph Jewel) because I can warm reset the virtual machine using
"system-reset" and the machine will boot properly.

This could be related to https://bugzilla.kernel.org/show_bug.cgi?id=207053 but
it doesn't look quite right.

-- 
You may reply to this email to add a comment.

You are receiving this mail because:
You are watching the assignee of the bug.

             reply	other threads:[~2021-10-20  7:36 UTC|newest]

Thread overview: 20+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-10-20  7:36 bugzilla-daemon [this message]
2021-10-20  7:39 ` [Bug 214767] xfs seems to hang due to race condition? maybe related to (gratuitous) thaw bugzilla-daemon
2021-10-20 16:16 ` bugzilla-daemon
2021-10-20 16:31 ` bugzilla-daemon
2021-10-20 18:29 ` bugzilla-daemon
2021-10-20 18:37 ` bugzilla-daemon
2021-10-20 19:22 ` bugzilla-daemon
2021-10-20 19:26 ` bugzilla-daemon
2021-10-20 22:16 ` [Bug 214767] New: " Dave Chinner
2021-10-20 22:38 ` [Bug 214767] " bugzilla-daemon
2021-10-21  4:21 ` bugzilla-daemon
2021-10-21 13:17 ` bugzilla-daemon
2021-10-21 13:19 ` bugzilla-daemon
2021-10-21 13:20 ` bugzilla-daemon
2021-10-29  2:21 ` bugzilla-daemon
2021-10-29  2:22 ` bugzilla-daemon
2021-11-09  5:19 ` bugzilla-daemon
2021-11-10 15:16 ` bugzilla-daemon
2021-11-20 22:31   ` Dave Chinner
2021-11-20 22:31 ` bugzilla-daemon

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=bug-214767-201763@https.bugzilla.kernel.org/ \
    --to=bugzilla-daemon@bugzilla.kernel.org \
    --cc=linux-xfs@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.