[Bug 219166] New: ext4 hang when setting echo noop

linux-ext4.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [Bug 219166] New: ext4 hang when setting echo noop > /sys/block/sda/queue/scheduler
@ 2024-08-16  7:43 bugzilla-daemon
  2024-08-16 16:40 ` [Bug 219166] " bugzilla-daemon
                   ` (20 more replies)
  0 siblings, 21 replies; 22+ messages in thread
From: bugzilla-daemon @ 2024-08-16  7:43 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=219166

            Bug ID: 219166
           Summary: ext4 hang when setting echo noop >
                    /sys/block/sda/queue/scheduler
           Product: File System
           Version: 2.5
          Hardware: All
                OS: Linux
            Status: NEW
          Severity: normal
          Priority: P3
         Component: ext4
          Assignee: fs_ext4@kernel-bugs.osdl.org
          Reporter: rjones@redhat.com
        Regression: No

kernel-6.11.0-0.rc3.30.fc41.x86_64

This hang seems new in 6.11.0.

Downstream bug:
https://bugzilla.redhat.com/show_bug.cgi?id=2303267

Trying to change the queue/scheduler for a device containing an ext4 filesystem
very rarely causes other processes which are using the filesystem to hang.

This is running inside qemu.  Using 'crash' I was able to get a stack trace
from a couple of hanging processes (from different VM instances):

crash> set 230
    PID: 230
COMMAND: "modprobe"
   TASK: ffff98ce03ca3040  [THREAD_INFO: ffff98ce03ca3040]
    CPU: 0
  STATE: TASK_UNINTERRUPTIBLE 
crash> bt
PID: 230      TASK: ffff98ce03ca3040  CPU: 0    COMMAND: "modprobe"
 #0 [ffffaf9940307840] __schedule at ffffffff9618f6d0
 #1 [ffffaf99403078f8] schedule at ffffffff96190a27
 #2 [ffffaf9940307908] __bio_queue_enter at ffffffff957e121c
 #3 [ffffaf9940307968] blk_mq_submit_bio at ffffffff957f358c
 #4 [ffffaf99403079f0] __submit_bio at ffffffff957e1e3c
 #5 [ffffaf9940307a58] submit_bio_noacct_nocheck at ffffffff957e2326
 #6 [ffffaf9940307ac0] ext4_mpage_readpages at ffffffff955ceafc
 #7 [ffffaf9940307be0] read_pages at ffffffff95381d1a
 #8 [ffffaf9940307c40] page_cache_ra_unbounded at ffffffff95381ff5
 #9 [ffffaf9940307ca8] filemap_fault at ffffffff953761b5
#10 [ffffaf9940307d48] __do_fault at ffffffff953d1895
#11 [ffffaf9940307d70] do_fault at ffffffff953d2425
#12 [ffffaf9940307da0] __handle_mm_fault at ffffffff953d8c6b
#13 [ffffaf9940307e88] handle_mm_fault at ffffffff953d95c2
#14 [ffffaf9940307ec8] do_user_addr_fault at ffffffff950b34ea
#15 [ffffaf9940307f28] exc_page_fault at ffffffff96186e4e
#16 [ffffaf9940307f50] asm_exc_page_fault at ffffffff962012a6
    RIP: 0000556b7a7468d8  RSP: 00007ffde2ffb560  RFLAGS: 00000206
    RAX: 00000000000bec82  RBX: 00007f5331a0dc82  RCX: 0000556b7a75b92a
    RDX: 00007ffde2ffd8d0  RSI: 00000000200bec82  RDI: 0000556ba8edf960
    RBP: 00007ffde2ffb7c0   R8: 0000000000000000   R9: 0000000000000000
    R10: 0000000000000000  R11: 0000000000000202  R12: 00000000200bec82
    R13: 0000556ba8edf960  R14: 00007ffde2ffd8d0  R15: 0000556b7a760708
    ORIG_RAX: ffffffffffffffff  CS: 0033  SS: 002b

crash> set 234
    PID: 234
COMMAND: "modprobe"
   TASK: ffff9f5ec3a22f40  [THREAD_INFO: ffff9f5ec3a22f40]
    CPU: 0
  STATE: TASK_UNINTERRUPTIBLE 
crash> bt
PID: 234      TASK: ffff9f5ec3a22f40  CPU: 0    COMMAND: "modprobe"
 #0 [ffffb21e002e7840] __schedule at ffffffffa718f6d0
 #1 [ffffb21e002e78f8] schedule at ffffffffa7190a27
 #2 [ffffb21e002e7908] __bio_queue_enter at ffffffffa67e121c
 #3 [ffffb21e002e7968] blk_mq_submit_bio at ffffffffa67f358c
 #4 [ffffb21e002e79f0] __submit_bio at ffffffffa67e1e3c
 #5 [ffffb21e002e7a58] submit_bio_noacct_nocheck at ffffffffa67e2326
 #6 [ffffb21e002e7ac0] ext4_mpage_readpages at ffffffffa65ceafc
 #7 [ffffb21e002e7be0] read_pages at ffffffffa6381d17
 #8 [ffffb21e002e7c40] page_cache_ra_unbounded at ffffffffa6381ff5
 #9 [ffffb21e002e7ca8] filemap_fault at ffffffffa63761b5
#10 [ffffb21e002e7d48] __do_fault at ffffffffa63d1892
#11 [ffffb21e002e7d70] do_fault at ffffffffa63d2425
#12 [ffffb21e002e7da0] __handle_mm_fault at ffffffffa63d8c6b
#13 [ffffb21e002e7e88] handle_mm_fault at ffffffffa63d95c2
#14 [ffffb21e002e7ec8] do_user_addr_fault at ffffffffa60b34ea
#15 [ffffb21e002e7f28] exc_page_fault at ffffffffa7186e4e
#16 [ffffb21e002e7f50] asm_exc_page_fault at ffffffffa72012a6
    RIP: 000055d16159f8d8  RSP: 00007ffdd4c1f340  RFLAGS: 00010206
    RAX: 00000000000bec82  RBX: 00007ff2fd00dc82  RCX: 000055d1615b492a
    RDX: 00007ffdd4c216b0  RSI: 00000000200bec82  RDI: 000055d185725960
    RBP: 00007ffdd4c1f5a0   R8: 0000000000000000   R9: 0000000000000000
    R10: 0000000000000000  R11: 0000000000000202  R12: 00000000200bec82
    R13: 000055d185725960  R14: 00007ffdd4c216b0  R15: 000055d1615b9708
    ORIG_RAX: ffffffffffffffff  CS: 0033  SS: 002b

The hanging process was modprobe both times.  I don't know if that is
significant.

Note that "noop" is not actually a valid scheduler (for about 10 years)!  We
still observe this hang even when setting it to this impossible value.

# echo noop > /sys/block/sda/queue/scheduler
/init: line 108: echo: write error: Invalid argument

-- 
You may reply to this email to add a comment.

You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [Bug 219166] ext4 hang when setting echo noop > /sys/block/sda/queue/scheduler
  2024-08-16  7:43 [Bug 219166] New: ext4 hang when setting echo noop > /sys/block/sda/queue/scheduler bugzilla-daemon
@ 2024-08-16 16:40 ` bugzilla-daemon
  2024-08-16 17:06 ` bugzilla-daemon
                   ` (19 subsequent siblings)
  20 siblings, 0 replies; 22+ messages in thread
From: bugzilla-daemon @ 2024-08-16 16:40 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=219166

Theodore Tso (tytso@mit.edu) changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |tytso@mit.edu

--- Comment #1 from Theodore Tso (tytso@mit.edu) ---
This is a bug in the block layer, not in ext4.  It's also a fairly
long-standing issue; I remember seeing something like in our Data Center
kernels at $WORK at least five plus years ago.   At the time, we dealt with
this via the simple expedient of "Doctor, Doctor, it hurts when I do that. 
Well, then don't do that then!"

At the time it happened relatively rarely; if you're able to reliably reproduce
it, perhaps it would be useful to send a script with said reliable reproducer
to the linux-block list, and perhaps try to do a bisect to see when it became
trivially easy to trigger.

-- 
You may reply to this email to add a comment.

You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [Bug 219166] ext4 hang when setting echo noop > /sys/block/sda/queue/scheduler
  2024-08-16  7:43 [Bug 219166] New: ext4 hang when setting echo noop > /sys/block/sda/queue/scheduler bugzilla-daemon
  2024-08-16 16:40 ` [Bug 219166] " bugzilla-daemon
@ 2024-08-16 17:06 ` bugzilla-daemon
  2024-08-16 20:36 ` bugzilla-daemon
                   ` (18 subsequent siblings)
  20 siblings, 0 replies; 22+ messages in thread
From: bugzilla-daemon @ 2024-08-16 17:06 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=219166

--- Comment #2 from Richard W.M. Jones (rjones@redhat.com) ---
We did indeed work around it by not doing it:
https://github.com/libguestfs/libguestfs/commit/b2d682a4730ead8b4ae07e5aaf6fa230c5eec305
so my interest in this bug is now limited.  If you want to close it then that's
fine.

It's unfortunately triggered very rarely.  The only test I have is:

$ while guestfish -a /dev/null run -vx >& /tmp/log ; do echo -n . ; done

but that hits the bug less than 1 in 200 iterations (although more often when
run in a qemu VM using software emulation, which is where we first hit this.)

I tried to write something that could trigger it more frequently by having a
loop in a guest writing to queue/scheduler, while another loop in the guest
would write to the disk, but I couldn't seem to hit the bug at all.

-- 
You may reply to this email to add a comment.

You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [Bug 219166] ext4 hang when setting echo noop > /sys/block/sda/queue/scheduler
  2024-08-16  7:43 [Bug 219166] New: ext4 hang when setting echo noop > /sys/block/sda/queue/scheduler bugzilla-daemon
  2024-08-16 16:40 ` [Bug 219166] " bugzilla-daemon
  2024-08-16 17:06 ` bugzilla-daemon
@ 2024-08-16 20:36 ` bugzilla-daemon
  2024-08-16 20:51 ` bugzilla-daemon
                   ` (17 subsequent siblings)
  20 siblings, 0 replies; 22+ messages in thread
From: bugzilla-daemon @ 2024-08-16 20:36 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=219166

--- Comment #3 from Theodore Tso (tytso@mit.edu) ---
So FWIW, what we saw in our data center kernel was switching between one valid
scheduler to a different valid schedule while I/O was in flight.   And this was
with a kernel that didn't have any modules (or not any modules that would be
loaded under normal circumstance, so modprobe wouldn't have been in the
picture).   It triggered rarely as well, and I don't remember whether it was an
oops or a hang --- if I remember correctly, it was a oops.   So it might not be
the same thing, but our workaround was to quiescece the device before changing
the scheduler.  Since this was happening in the boot sequence, it was something
we could do relatively easily, and like you we then lost interest.  :-)

The question is whether or not I want to close it; the question is whether we
think it's worth trying to ask the block layer developers to try to take a look
at it.   Right now it's mostly only ext4 developers who are paying attention to
this bug componet, so someone would need to take it up to the block developers,
and the thing that would be most useful is a reliable reproducer.

I don't know what guestfish is doing, but if we were trying to create a
reproducer from what I was seeing a few years ago, it would be something like
running fsstress or fio to exercise the block layer, and then try switching the
I/O scheduler and see if we can make it go *boom* regularly.   Maybe with
something like that we could get a reproducer that doesn't require launching a
VM multiple times and only seeing the failure less than 0.5% of the time....

-- 
You may reply to this email to add a comment.

You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [Bug 219166] ext4 hang when setting echo noop > /sys/block/sda/queue/scheduler
  2024-08-16  7:43 [Bug 219166] New: ext4 hang when setting echo noop > /sys/block/sda/queue/scheduler bugzilla-daemon
                   ` (2 preceding siblings ...)
  2024-08-16 20:36 ` bugzilla-daemon
@ 2024-08-16 20:51 ` bugzilla-daemon
  2024-08-17 10:22 ` bugzilla-daemon
                   ` (16 subsequent siblings)
  20 siblings, 0 replies; 22+ messages in thread
From: bugzilla-daemon @ 2024-08-16 20:51 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=219166

--- Comment #4 from Richard W.M. Jones (rjones@redhat.com) ---
I've got a job running now:

# while true; do echo noop > /sys/block/sda/queue/scheduler 2>/dev/null ; done

and fio doing some random writes on an XFS filesystem on sda1.

Baremetal, kernel 6.11.0-0.rc3.20240814git6b0f8db921ab.32.fc42.x86_64.

I'll leave that going overnight to see what happens.

-- 
You may reply to this email to add a comment.

You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [Bug 219166] ext4 hang when setting echo noop > /sys/block/sda/queue/scheduler
  2024-08-16  7:43 [Bug 219166] New: ext4 hang when setting echo noop > /sys/block/sda/queue/scheduler bugzilla-daemon
                   ` (3 preceding siblings ...)
  2024-08-16 20:51 ` bugzilla-daemon
@ 2024-08-17 10:22 ` bugzilla-daemon
  2024-08-17 12:36 ` bugzilla-daemon
                   ` (15 subsequent siblings)
  20 siblings, 0 replies; 22+ messages in thread
From: bugzilla-daemon @ 2024-08-17 10:22 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=219166

--- Comment #5 from Richard W.M. Jones (rjones@redhat.com) ---
I'm afraid that didn't reproduce the issue.  I have an idea to run the same
test, but in a software emulated virtual machine.  The idea is that the much
slower performance will exaggerate any race conditions in the kernel.  However
this will take time.

-- 
You may reply to this email to add a comment.

You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [Bug 219166] ext4 hang when setting echo noop > /sys/block/sda/queue/scheduler
  2024-08-16  7:43 [Bug 219166] New: ext4 hang when setting echo noop > /sys/block/sda/queue/scheduler bugzilla-daemon
                   ` (4 preceding siblings ...)
  2024-08-17 10:22 ` bugzilla-daemon
@ 2024-08-17 12:36 ` bugzilla-daemon
  2024-08-17 12:38 ` bugzilla-daemon
                   ` (14 subsequent siblings)
  20 siblings, 0 replies; 22+ messages in thread
From: bugzilla-daemon @ 2024-08-17 12:36 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=219166

--- Comment #6 from Richard W.M. Jones (rjones@redhat.com) ---
Yes I can reproduce this inside a software emulated VM with another 6.11.0
Fedora kernel.  I will bisect this later, but for now reproduction instructions
are given below.

(1) Install a Fedora 40 virtual machine.  I used the command below but other
ways are available:

virt-builder fedora-40 --size=10G --root-password=password:123456

(2) Run the VM in qemu with software emulation (TCG):

qemu-system-x86_64 -machine accel=tcg -cpu qemu64 -m 4096 -drive
file=/var/tmp/fedora-40.qcow2,format=qcow2,if=virtio

(3) Inside the VM, log in as root/123456, install fio, and update the kernel:

dnf install fedora-repos-rawhide
dnf install fio
dnf update kernel
reboot

(should upgrade to 6.11.0 and boot into that kernel).

(4) Inside the VM, in one terminal run:

while true; do echo noop > /sys/block/sda/queue/scheduler 2>/dev/null ; done

(5) Inside the VM, in another terminal run fio with the following config or
similar:

[global]
name=fio-rand-write
filename=/root/fio-rand-write
rw=randwrite
bs=4K
numjobs=4
time_based
runtime=1h
group_reporting

[file1]
size=1G
ioengine=libaio
iodepth=64

(6) After a while the fio process ETA will start counting up (since one or more
threads have got stuck and are not making progress).  Also logging in is
problematic and many common commands like 'dmesg' or 'ps' hang.  I could only
recover by rebooting.

-- 
You may reply to this email to add a comment.

You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [Bug 219166] ext4 hang when setting echo noop > /sys/block/sda/queue/scheduler
  2024-08-16  7:43 [Bug 219166] New: ext4 hang when setting echo noop > /sys/block/sda/queue/scheduler bugzilla-daemon
                   ` (5 preceding siblings ...)
  2024-08-17 12:36 ` bugzilla-daemon
@ 2024-08-17 12:38 ` bugzilla-daemon
  2024-08-17 12:41 ` bugzilla-daemon
                   ` (13 subsequent siblings)
  20 siblings, 0 replies; 22+ messages in thread
From: bugzilla-daemon @ 2024-08-17 12:38 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=219166

--- Comment #7 from Richard W.M. Jones (rjones@redhat.com) ---
> virt-builder fedora-40 --size=10G --root-password=password:123456

Should be:

virt-builder fedora-40 --size=10G --root-password=password:123456
--format=qcow2

-- 
You may reply to this email to add a comment.

You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [Bug 219166] ext4 hang when setting echo noop > /sys/block/sda/queue/scheduler
  2024-08-16  7:43 [Bug 219166] New: ext4 hang when setting echo noop > /sys/block/sda/queue/scheduler bugzilla-daemon
                   ` (6 preceding siblings ...)
  2024-08-17 12:38 ` bugzilla-daemon
@ 2024-08-17 12:41 ` bugzilla-daemon
  2024-08-17 13:58 ` bugzilla-daemon
                   ` (12 subsequent siblings)
  20 siblings, 0 replies; 22+ messages in thread
From: bugzilla-daemon @ 2024-08-17 12:41 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=219166

--- Comment #8 from Richard W.M. Jones (rjones@redhat.com) ---
> while true; do echo noop > /sys/block/sda/queue/scheduler 2>/dev/null ; done

Should be:

while true; do echo noop > /sys/block/vda/queue/scheduler 2>/dev/null ; done

as the guest is using virtio.  (Copied and pasted the instructions from my host
test, but I did use the correct command in the VM test.)

-- 
You may reply to this email to add a comment.

You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [Bug 219166] ext4 hang when setting echo noop > /sys/block/sda/queue/scheduler
  2024-08-16  7:43 [Bug 219166] New: ext4 hang when setting echo noop > /sys/block/sda/queue/scheduler bugzilla-daemon
                   ` (7 preceding siblings ...)
  2024-08-17 12:41 ` bugzilla-daemon
@ 2024-08-17 13:58 ` bugzilla-daemon
  2024-08-20 15:33 ` bugzilla-daemon
                   ` (11 subsequent siblings)
  20 siblings, 0 replies; 22+ messages in thread
From: bugzilla-daemon @ 2024-08-17 13:58 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=219166

--- Comment #9 from Richard W.M. Jones (rjones@redhat.com) ---
Also this does *not* reproduce with a 6.8.5 kernel on the same VM, so I really
do think this is a regression.

This looks like a very tedious bisect but I'll have a go if I have time later.

-- 
You may reply to this email to add a comment.

You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [Bug 219166] ext4 hang when setting echo noop > /sys/block/sda/queue/scheduler
  2024-08-16  7:43 [Bug 219166] New: ext4 hang when setting echo noop > /sys/block/sda/queue/scheduler bugzilla-daemon
                   ` (8 preceding siblings ...)
  2024-08-17 13:58 ` bugzilla-daemon
@ 2024-08-20 15:33 ` bugzilla-daemon
  2024-08-20 15:33 ` [Bug 219166] occasional block layer hang when setting 'echo noop > /sys/block/sda/queue/scheduler' bugzilla-daemon
                   ` (10 subsequent siblings)
  20 siblings, 0 replies; 22+ messages in thread
From: bugzilla-daemon @ 2024-08-20 15:33 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=219166

--- Comment #10 from Richard W.M. Jones (rjones@redhat.com) ---
Just to close this one out ...

I couldn't reproduce this issue when I compiled the kernel myself, even though
I was using the exact same .config as Fedora uses, the same tag, building it on
Fedora 40, and Fedora itself does not have any significant downstream patches. 
There were a few differences, for example I'm probably using a slightly
different version of gcc/binutils than the Fedora kernel builders.

So being unable to reproduce it in a self-compiled kernel, I cannot bisect it.

We have worked around the problem, so that's basically as far as I want to take
this bug.  Feel free to close it if you want.

-- 
You may reply to this email to add a comment.

You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [Bug 219166] occasional block layer hang when setting 'echo noop > /sys/block/sda/queue/scheduler'
  2024-08-16  7:43 [Bug 219166] New: ext4 hang when setting echo noop > /sys/block/sda/queue/scheduler bugzilla-daemon
                   ` (9 preceding siblings ...)
  2024-08-20 15:33 ` bugzilla-daemon
@ 2024-08-20 15:33 ` bugzilla-daemon
  2024-09-05  1:04 ` bugzilla-daemon
                   ` (9 subsequent siblings)
  20 siblings, 0 replies; 22+ messages in thread
From: bugzilla-daemon @ 2024-08-20 15:33 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=219166

Richard W.M. Jones (rjones@redhat.com) changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
            Summary|ext4 hang when setting echo |occasional block layer hang
                   |noop >                      |when setting 'echo noop >
                   |/sys/block/sda/queue/schedu |/sys/block/sda/queue/schedu
                   |ler                         |ler'

-- 
You may reply to this email to add a comment.

You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [Bug 219166] occasional block layer hang when setting 'echo noop > /sys/block/sda/queue/scheduler'
  2024-08-16  7:43 [Bug 219166] New: ext4 hang when setting echo noop > /sys/block/sda/queue/scheduler bugzilla-daemon
                   ` (10 preceding siblings ...)
  2024-08-20 15:33 ` [Bug 219166] occasional block layer hang when setting 'echo noop > /sys/block/sda/queue/scheduler' bugzilla-daemon
@ 2024-09-05  1:04 ` bugzilla-daemon
  2024-09-05  7:25 ` bugzilla-daemon
                   ` (8 subsequent siblings)
  20 siblings, 0 replies; 22+ messages in thread
From: bugzilla-daemon @ 2024-09-05  1:04 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=219166

Lei Ming (tom.leiming@gmail.com) changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |tom.leiming@gmail.com

--- Comment #11 from Lei Ming (tom.leiming@gmail.com) ---

Hi Richard,

Can you collect the blk-mq debugfs log for this virtio-blk devices?

  (cd /sys/kernel/debug/block/vda && find . -type f -exec grep -aH . {} \;)

There is known io hang risk in case of MQ:

https://lore.kernel.org/linux-block/20240903081653.65613-1-songmuchun@bytedance.com/T/#m0212e28431d08e00a3de132ec40b2b907bb07177

If you or anyone can reproduce the issue, please test the above patches.

Thanks,

-- 
You may reply to this email to add a comment.

You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [Bug 219166] occasional block layer hang when setting 'echo noop > /sys/block/sda/queue/scheduler'
  2024-08-16  7:43 [Bug 219166] New: ext4 hang when setting echo noop > /sys/block/sda/queue/scheduler bugzilla-daemon
                   ` (11 preceding siblings ...)
  2024-09-05  1:04 ` bugzilla-daemon
@ 2024-09-05  7:25 ` bugzilla-daemon
  2024-09-05  9:32 ` bugzilla-daemon
                   ` (7 subsequent siblings)
  20 siblings, 0 replies; 22+ messages in thread
From: bugzilla-daemon @ 2024-09-05  7:25 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=219166

--- Comment #12 from Richard W.M. Jones (rjones@redhat.com) ---
Is it possible to get this from 'crash' utility (ie from kernel structs)?  The
VM hangs fairly hard when this problem is hit.

I can try the kernel patches though, will see if I can do that today.

-- 
You may reply to this email to add a comment.

You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [Bug 219166] occasional block layer hang when setting 'echo noop > /sys/block/sda/queue/scheduler'
  2024-08-16  7:43 [Bug 219166] New: ext4 hang when setting echo noop > /sys/block/sda/queue/scheduler bugzilla-daemon
                   ` (12 preceding siblings ...)
  2024-09-05  7:25 ` bugzilla-daemon
@ 2024-09-05  9:32 ` bugzilla-daemon
  2024-09-06 19:46 ` bugzilla-daemon
                   ` (6 subsequent siblings)
  20 siblings, 0 replies; 22+ messages in thread
From: bugzilla-daemon @ 2024-09-05  9:32 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=219166

--- Comment #13 from Richard W.M. Jones (rjones@redhat.com) ---
To document for myself and others what I did to reproduce the bug and get the
kernel stack trace ...

(1) libguestfs from git with this patch reverted:
https://github.com/libguestfs/libguestfs/commit/b2d682a4730ead8b4ae07e5aaf6fa230c5eec305

(2) Run guestfish in a loop until it hangs:

$ while LIBGUESTFS_BACKEND_SETTINGS=force_tcg ./run guestfish -a /dev/null run
-vx >& /tmp/log ; do echo -n . ; done


(3) Looking /tmp/log we can see it hung just after trying to set noop
scheduler:

$ tail -5 /tmp/log
+ echo 300
+ for f in /sys/block/sd*/device/timeout
+ echo 300
+ for f in /sys/block/{h,s,ub,v}d*/queue/scheduler
+ echo noop

(4) Check the log for the kernel version, install the corresponding kernel
debuginfo.

(5) Get virsh to produce a core dump of the VM:

$ virsh list 
 Id     Name                       State
--------------------------------------------
 1950   guestfs-lsdbxy71u4jg1w6x   running

$ virsh dump 1950 /var/tmp/core --memory-only

Domain '1950' dumped to /var/tmp/core

(6) Open in 'crash':

$ crash
/usr/lib/debug/lib/modules/6.11.0-0.rc5.20240830git20371ba12063.47.fc42.x86_64/vmlinux
/var/tmp/core

(7) List processes and find the one which hung:

crash> ps 
...
      230      73   0  ffffa01f83c58000  UN   0.3    11608     3340  modprobe

(8) Get stack trace from the hung process:

crash> set 230
    PID: 230
COMMAND: "modprobe"
   TASK: ffffa01f83c58000  [THREAD_INFO: ffffa01f83c58000]
    CPU: 0
  STATE: TASK_UNINTERRUPTIBLE 
crash> bt
PID: 230      TASK: ffffa01f83c58000  CPU: 0    COMMAND: "modprobe"
 #0 [ffffc1db0030f840] __schedule at ffffffff921906d0
 #1 [ffffc1db0030f8f8] schedule at ffffffff92191a27
 #2 [ffffc1db0030f908] __bio_queue_enter at ffffffff917e17dc
 #3 [ffffc1db0030f968] blk_mq_submit_bio at ffffffff917f3b4c
 #4 [ffffc1db0030f9f0] __submit_bio at ffffffff917e23fc
 #5 [ffffc1db0030fa58] submit_bio_noacct_nocheck at ffffffff917e28e6
 #6 [ffffc1db0030fac0] ext4_mpage_readpages at ffffffff915cef7c
 #7 [ffffc1db0030fbe0] read_pages at ffffffff91381cda
 #8 [ffffc1db0030fc40] page_cache_ra_unbounded at ffffffff91381fb5
 #9 [ffffc1db0030fca8] filemap_fault at ffffffff91376175
#10 [ffffc1db0030fd48] __do_fault at ffffffff913d1755
#11 [ffffc1db0030fd70] do_fault at ffffffff913d22e5
#12 [ffffc1db0030fda0] __handle_mm_fault at ffffffff913d8b2b
#13 [ffffc1db0030fe88] handle_mm_fault at ffffffff913d9472
#14 [ffffc1db0030fec8] do_user_addr_fault at ffffffff910b34ea
#15 [ffffc1db0030ff28] exc_page_fault at ffffffff92187e4e
#16 [ffffc1db0030ff50] asm_exc_page_fault at ffffffff922012a6
    RIP: 000055bb085508d8  RSP: 00007ffc3e731900  RFLAGS: 00010206
    RAX: 00000000000becd6  RBX: 00007f39925d1cd6  RCX: 000055bb0856592a
    RDX: 00007ffc3e733c70  RSI: 00000000200becd6  RDI: 000055bb1a712970
    RBP: 00007ffc3e731b60   R8: 0000000000000000   R9: 0000000000000000
    R10: 0000000000000000  R11: 0000000000000202  R12: 00000000200becd6
    R13: 000055bb1a712970  R14: 00007ffc3e733c70  R15: 000055bb0856a708
    ORIG_RAX: ffffffffffffffff  CS: 0033  SS: 002b

-- 
You may reply to this email to add a comment.

You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [Bug 219166] occasional block layer hang when setting 'echo noop > /sys/block/sda/queue/scheduler'
  2024-08-16  7:43 [Bug 219166] New: ext4 hang when setting echo noop > /sys/block/sda/queue/scheduler bugzilla-daemon
                   ` (13 preceding siblings ...)
  2024-09-05  9:32 ` bugzilla-daemon
@ 2024-09-06 19:46 ` bugzilla-daemon
  2024-09-06 20:28 ` bugzilla-daemon
                   ` (5 subsequent siblings)
  20 siblings, 0 replies; 22+ messages in thread
From: bugzilla-daemon @ 2024-09-06 19:46 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=219166

--- Comment #14 from Richard W.M. Jones (rjones@redhat.com) ---
I think I have bisected this to:

commit af2814149883e2c1851866ea2afcd8eadc040f79
Author: Christoph Hellwig <hch@lst.de>
Date:   Mon Jun 17 08:04:38 2024 +0200

    block: freeze the queue in queue_attr_store

    queue_attr_store updates attributes used to control generating I/O, and
    can cause malformed bios if changed with I/O in flight.  Freeze the queue
    in common code instead of adding it to almost every attribute.

    Signed-off-by: Christoph Hellwig <hch@lst.de>
    Reviewed-by: Bart Van Assche <bvanassche@acm.org>
    Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
    Reviewed-by: Hannes Reinecke <hare@suse.de>
    Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
    Link: https://lore.kernel.org/r/20240617060532.127975-12-hch@lst.de
    Signed-off-by: Jens Axboe <axboe@kernel.dk>

-- 
You may reply to this email to add a comment.

You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [Bug 219166] occasional block layer hang when setting 'echo noop > /sys/block/sda/queue/scheduler'
  2024-08-16  7:43 [Bug 219166] New: ext4 hang when setting echo noop > /sys/block/sda/queue/scheduler bugzilla-daemon
                   ` (14 preceding siblings ...)
  2024-09-06 19:46 ` bugzilla-daemon
@ 2024-09-06 20:28 ` bugzilla-daemon
  2024-09-06 20:32 ` bugzilla-daemon
                   ` (4 subsequent siblings)
  20 siblings, 0 replies; 22+ messages in thread
From: bugzilla-daemon @ 2024-09-06 20:28 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=219166

--- Comment #15 from Richard W.M. Jones (rjones@redhat.com) ---
Created attachment 306825
  --> https://bugzilla.kernel.org/attachment.cgi?id=306825&action=edit
'foreach bt' in crash utility

-- 
You may reply to this email to add a comment.

You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [Bug 219166] occasional block layer hang when setting 'echo noop > /sys/block/sda/queue/scheduler'
  2024-08-16  7:43 [Bug 219166] New: ext4 hang when setting echo noop > /sys/block/sda/queue/scheduler bugzilla-daemon
                   ` (15 preceding siblings ...)
  2024-09-06 20:28 ` bugzilla-daemon
@ 2024-09-06 20:32 ` bugzilla-daemon
  2024-09-07  7:49 ` bugzilla-daemon
                   ` (3 subsequent siblings)
  20 siblings, 0 replies; 22+ messages in thread
From: bugzilla-daemon @ 2024-09-06 20:32 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=219166

--- Comment #16 from Richard W.M. Jones (rjones@redhat.com) ---
Reverting commit af2814149883e2c1851866ea2afcd8eadc040f79 on top of current git
head (788220eee30d6) either fixes or hides the problem enough to fix my
reproducer.

The revert is not clean, but only 2 hunks fail and they were trivial to fix up.

-- 
You may reply to this email to add a comment.

You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [Bug 219166] occasional block layer hang when setting 'echo noop > /sys/block/sda/queue/scheduler'
  2024-08-16  7:43 [Bug 219166] New: ext4 hang when setting echo noop > /sys/block/sda/queue/scheduler bugzilla-daemon
                   ` (16 preceding siblings ...)
  2024-09-06 20:32 ` bugzilla-daemon
@ 2024-09-07  7:49 ` bugzilla-daemon
  2024-09-07 11:09 ` bugzilla-daemon
                   ` (2 subsequent siblings)
  20 siblings, 0 replies; 22+ messages in thread
From: bugzilla-daemon @ 2024-09-07  7:49 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=219166

--- Comment #17 from Richard W.M. Jones (rjones@redhat.com) ---
http://oirase.annexia.org/tmp/kbug-219166/

vmlinux and memory core dump captured when the bug happened.  You
can open this using the 'crash' utility.  The corresponding kernel is:

https://koji.fedoraproject.org/koji/buildinfo?buildID=2538598

-- 
You may reply to this email to add a comment.

You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [Bug 219166] occasional block layer hang when setting 'echo noop > /sys/block/sda/queue/scheduler'
  2024-08-16  7:43 [Bug 219166] New: ext4 hang when setting echo noop > /sys/block/sda/queue/scheduler bugzilla-daemon
                   ` (17 preceding siblings ...)
  2024-09-07  7:49 ` bugzilla-daemon
@ 2024-09-07 11:09 ` bugzilla-daemon
  2024-09-07 11:10 ` bugzilla-daemon
  2024-09-18  8:36 ` bugzilla-daemon
  20 siblings, 0 replies; 22+ messages in thread
From: bugzilla-daemon @ 2024-09-07 11:09 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=219166

Richard W.M. Jones (rjones@redhat.com) changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
          Component|ext4                        |Block Layer
            Product|File System                 |IO/Storage

--- Comment #18 from Richard W.M. Jones (rjones@redhat.com) ---
Upstream discussion:
https://lore.kernel.org/linux-block/20240907073522.GW1450@redhat.com/T/

-- 
You may reply to this email to add a comment.

You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [Bug 219166] occasional block layer hang when setting 'echo noop > /sys/block/sda/queue/scheduler'
  2024-08-16  7:43 [Bug 219166] New: ext4 hang when setting echo noop > /sys/block/sda/queue/scheduler bugzilla-daemon
                   ` (18 preceding siblings ...)
  2024-09-07 11:09 ` bugzilla-daemon
@ 2024-09-07 11:10 ` bugzilla-daemon
  2024-09-18  8:36 ` bugzilla-daemon
  20 siblings, 0 replies; 22+ messages in thread
From: bugzilla-daemon @ 2024-09-07 11:10 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=219166

Richard W.M. Jones (rjones@redhat.com) changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
     Kernel Version|                            |6.11.0

-- 
You may reply to this email to add a comment.

You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [Bug 219166] occasional block layer hang when setting 'echo noop > /sys/block/sda/queue/scheduler'
  2024-08-16  7:43 [Bug 219166] New: ext4 hang when setting echo noop > /sys/block/sda/queue/scheduler bugzilla-daemon
                   ` (19 preceding siblings ...)
  2024-09-07 11:10 ` bugzilla-daemon
@ 2024-09-18  8:36 ` bugzilla-daemon
  20 siblings, 0 replies; 22+ messages in thread
From: bugzilla-daemon @ 2024-09-18  8:36 UTC (permalink / raw)
  To: linux-ext4

https://bugzilla.kernel.org/show_bug.cgi?id=219166

Richard W.M. Jones (rjones@redhat.com) changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEW                         |RESOLVED
                URL|                            |https://git.kernel.org/pub/
                   |                            |scm/linux/kernel/git/torval
                   |                            |ds/linux.git/commit/?id=734
                   |                            |e1a8603128ac31526c477a39456
                   |                            |be5f4092b6
         Resolution|---                         |CODE_FIX

--- Comment #19 from Richard W.M. Jones (rjones@redhat.com) ---
Closing as fixed in:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=734e1a8603128ac31526c477a39456be5f4092b6

-- 
You may reply to this email to add a comment.

You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2024-09-18  8:36 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-08-16  7:43 [Bug 219166] New: ext4 hang when setting echo noop > /sys/block/sda/queue/scheduler bugzilla-daemon
2024-08-16 16:40 ` [Bug 219166] " bugzilla-daemon
2024-08-16 17:06 ` bugzilla-daemon
2024-08-16 20:36 ` bugzilla-daemon
2024-08-16 20:51 ` bugzilla-daemon
2024-08-17 10:22 ` bugzilla-daemon
2024-08-17 12:36 ` bugzilla-daemon
2024-08-17 12:38 ` bugzilla-daemon
2024-08-17 12:41 ` bugzilla-daemon
2024-08-17 13:58 ` bugzilla-daemon
2024-08-20 15:33 ` bugzilla-daemon
2024-08-20 15:33 ` [Bug 219166] occasional block layer hang when setting 'echo noop > /sys/block/sda/queue/scheduler' bugzilla-daemon
2024-09-05  1:04 ` bugzilla-daemon
2024-09-05  7:25 ` bugzilla-daemon
2024-09-05  9:32 ` bugzilla-daemon
2024-09-06 19:46 ` bugzilla-daemon
2024-09-06 20:28 ` bugzilla-daemon
2024-09-06 20:32 ` bugzilla-daemon
2024-09-07  7:49 ` bugzilla-daemon
2024-09-07 11:09 ` bugzilla-daemon
2024-09-07 11:10 ` bugzilla-daemon
2024-09-18  8:36 ` bugzilla-daemon

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).