From: Paul Gortmaker <paul.gortmaker@windriver.com>
To: John Keeping <john@metanate.com>
Cc: linux-kernel@vger.kernel.org, linux-rt-users@vger.kernel.org,
Sebastian Andrzej Siewior <bigeasy@linutronix.de>,
Peter Zijlstra <peterz@infradead.org>,
Thomas Gleixner <tglx@linutronix.de>,
Steven Rostedt <rostedt@goodmis.org>,
Ingo Molnar <mingo@redhat.com>,
Juri Lelli <juri.lelli@redhat.com>,
Vincent Guittot <vincent.guittot@linaro.org>,
Dietmar Eggemann <dietmar.eggemann@arm.com>,
Ben Segall <bsegall@google.com>, Mel Gorman <mgorman@suse.de>,
Daniel Bristot de Oliveira <bristot@redhat.com>,
Valentin Schneider <vschneid@redhat.com>
Subject: Re: [PATCH v3] sched/core: Always flush pending blk_plug
Date: Sat, 9 Jul 2022 12:19:56 -0400 [thread overview]
Message-ID: <20220709161956.GA20046@windriver.com> (raw)
In-Reply-To: <20220708162702.1758865-1-john@metanate.com>
[[PATCH v3] sched/core: Always flush pending blk_plug] On 08/07/2022 (Fri 17:27) John Keeping wrote:
> With CONFIG_PREEMPT_RT, it is possible to hit a deadlock between two
> normal priority tasks (SCHED_OTHER, nice level zero):
>
> INFO: task kworker/u8:0:8 blocked for more than 491 seconds.
> Not tainted 5.15.49-rt46 #1
> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> task:kworker/u8:0 state:D stack: 0 pid: 8 ppid: 2 flags:0x00000000
> Workqueue: writeback wb_workfn (flush-7:0)
This is great. I was on-and-off looking at an ext4 deadlock triggered
by LTP's "ftest01" - and was just potting my notes together this morning
to clarify for myself what I thought was going on before I forgot.
Which is good, because I then could recognize the very similar patterns
in what you saw to what I'd observed/deduced. Otherise I'd probably not
have had your backtraces grab my attention at all.
Long story short, I applied this to v5.19-rc3-rt5 and it resolves the
LTP/ext4 deadlock issue. So...
Tested-by: Paul Gortmaker <paul.gortmaker@windriver.com>
As I'd already put together my ext4 debug info, I'll include it below
in case it helps future people searching for LTP and ftest01.
Thanks,
Paul.
-----------------
There is a reproducible deadlock in ext4 on preempt-rt triggered by LTP
"ftest01"
Reproducer details are at the bottom. What seems to happen is as follows:
The kworker doing writeback wb_workfn (pid 213 below) stalls on the
one and only down_write in ext4_map_blocks() and never moves again.
/*
* New blocks allocate and/or writing to unwritten extent
* will possibly result in updating i_data, so we take
* the write lock of i_data_sem, and call get_block()
* with create == 1 flag.
*/
down_write(&EXT4_I(inode)->i_data_sem);
So who is blocking the kworker? There are two of the ftest01 processes
(pid 818 and pid 821) doing ext4 truncate which look like this:
ext4_truncate+0x46a/0x660
--> ext4_ext_truncate+0xd6/0xf0
--> ext4_ext_remove_space+0x59f/0x1830
--> jbd2_journal_get_write_access+0x51/0x80
--> do_get_write_access+0x2ad/0x550
--> complete_all+0x60/0x60
--> bit_wait_io+0x11/0x60
They too are stalled in D state waiting there forever - presumably
waiting on the kwowrker to make the progress they are waiting for.
Here is where I think lockdep shows us the problem:
5 locks held by ftest01/818:
[...]
#4: ffff957c2a679958 (&ei->i_data_sem){++++}-{3:3}, at: ext4_truncate+0x443/0x660
5 locks held by ftest01/821:
[...]
#4: ffff957c2a67bb78 (&ei->i_data_sem){++++}-{3:3}, at: ext4_truncate+0x443/0x660
Yes - that is the one the kworker was trying the down_write() on.
Also note that the bit_wait is a big telltale sign here - there is only
one in the jbd2's do_get_write_access() function:
/*
* There is one case we have to be very careful about. If the
* committing transaction is currently writing this buffer out to disk
* and has NOT made a copy-out, then we cannot modify the buffer
* contents at all right now. The essence of copy-out is that it is
* the extra copy, not the primary copy, which gets journaled. If the
* primary copy is already going to disk then we cannot do copy-out
* here.
*/
if (buffer_shadow(bh)) {
JBUFFER_TRACE(jh, "on shadow: sleep");
spin_unlock(&jh->b_state_lock);
wait_on_bit_io(&bh->b_state, BH_Shadow, TASK_UNINTERRUPTIBLE);
goto repeat;
}
Presumaby this is a corner case that preempt-rt makes easier to trigger?
The full trace with interspersed lockdep info follows.
Various detailed reproducer information follows that.
-----------------------
sysrq-w backtracce information interspersed with sysrq-d lockdep info
on 5.19-rc2-rt3 kernel:
task:kworker/u129:34 state:D stack: 0 pid: 213 ppid: 2 flags:0x00004000
Workqueue: writeback wb_workfn (flush-7:0)
Call Trace:
<TASK>
__schedule+0x329/0xae0
? __this_cpu_preempt_check+0x13/0x20
? __rt_mutex_slowlock.isra.21.constprop.25+0xc0e/0xc40
schedule+0x48/0xc0
__rt_mutex_slowlock.isra.21.constprop.25+0x1b8/0xc40
rwbase_write_lock+0x98/0x6e0
down_write+0x46/0x50
? ext4_map_blocks+0x156/0x5c0
ext4_map_blocks+0x156/0x5c0
ext4_writepages+0xa1a/0x1290
? __this_cpu_preempt_check+0x13/0x20
? lockdep_hardirqs_on+0xcd/0x150
do_writepages+0xd2/0x1a0
__writeback_single_inode+0x64/0x850
writeback_sb_inodes+0x22d/0x5b0
__writeback_inodes_wb+0x67/0xe0
wb_writeback+0x269/0x5f0
? get_nr_inodes+0x49/0x70
wb_workfn+0x43d/0x780
? __this_cpu_preempt_check+0x13/0x20
process_one_work+0x286/0x710
worker_thread+0x3c/0x3d0
? process_one_work+0x710/0x710
kthread+0x13b/0x150
? kthread_complete_and_exit+0x20/0x20
ret_from_fork+0x1f/0x30
</TASK>
6 locks held by kworker/u129:34/213:
#0: ffff957c25e9ef28 ((wq_completion)writeback){+.+.}-{0:0}, at: process_one_work+0x201/0x710
#1: ffffb7e9c3797e68 ((work_completion)(&(&wb->dwork)->work)){+.+.}-{0:0}, at: process_one_work+0x201/0x710
#2: ffff957b9c9b70d0 (&type->s_umount_key#34){++++}-{3:3}, at: trylock_super+0x1b/0x50
#3: ffff957b8a9b2d18 (&sbi->s_writepages_rwsem){.+.+}-{0:0}, at: do_writepages+0xd2/0x1a0
#4: ffff957b9c9b5be8 (jbd2_handle){++++}-{0:0}, at: start_this_handle+0x13e/0x5f0
#5: ffff957c2a679958 (&ei->i_data_sem){++++}-{3:3}, at: ext4_map_blocks+0x156/0x5c0
task:jbd2/loop0-8 state:D stack: 0 pid: 679 ppid: 2 flags:0x00004000
Call Trace:
<TASK>
__schedule+0x329/0xae0
schedule+0x48/0xc0
jbd2_journal_wait_updates+0x7c/0xf0
? schedstat_stop+0x10/0x10
jbd2_journal_commit_transaction+0x284/0x20e0
? __this_cpu_preempt_check+0x13/0x20
? find_held_lock+0x35/0xa0
? _raw_spin_unlock_irqrestore+0x86/0x8f
? _raw_spin_unlock_irqrestore+0x86/0x8f
? try_to_del_timer_sync+0x53/0x80
? trace_hardirqs_on+0x3b/0x120
? try_to_del_timer_sync+0x53/0x80
kjournald2+0xd5/0x2b0
? schedstat_stop+0x10/0x10
? commit_timeout+0x20/0x20
kthread+0x13b/0x150
? kthread_complete_and_exit+0x20/0x20
ret_from_fork+0x1f/0x30
</TASK>
task:ftest01 state:D stack: 0 pid: 817 ppid: 816 flags:0x00004002
Call Trace:
<TASK>
__schedule+0x329/0xae0
? __x64_sys_tee+0xd0/0xd0
? __x64_sys_tee+0xd0/0xd0
schedule+0x48/0xc0
wb_wait_for_completion+0x62/0x90
? schedstat_stop+0x10/0x10
sync_inodes_sb+0xdd/0x460
? __this_cpu_preempt_check+0x13/0x20
? __x64_sys_tee+0xd0/0xd0
sync_inodes_one_sb+0x15/0x20
iterate_supers+0x94/0x100
ksys_sync+0x42/0xa0
__do_sys_sync+0xe/0x20
do_syscall_64+0x3f/0x90
entry_SYSCALL_64_after_hwframe+0x46/0xb0
RIP: 0033:0x7f3f9dae2cdb
RSP: 002b:00007ffe06430318 EFLAGS: 00000246 ORIG_RAX: 00000000000000a2
RAX: ffffffffffffffda RBX: 0000000000000800 RCX: 00007f3f9dae2cdb
RDX: 0000000000000040 RSI: 0000561f57de52f0 RDI: 0000561f57de5340
RBP: 00000000000ff000 R08: 0000000000000000 R09: 00007f3f9dbce0a0
R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
R13: 0000000000000207 R14: 0000000000000049 R15: 0000000000000003
</TASK>
1 lock held by ftest01/817:
#0: ffff957b9c9b70d0 (&type->s_umount_key#34){++++}-{3:3}, at: iterate_supers+0x7d/0x100
task:ftest01 state:D stack: 0 pid: 818 ppid: 816 flags:0x00004002
Call Trace:
<TASK>
__schedule+0x329/0xae0
? bit_wait+0x60/0x60
schedule+0x48/0xc0
io_schedule+0x16/0x40
bit_wait_io+0x11/0x60
__wait_on_bit_lock+0x5e/0xb0
out_of_line_wait_on_bit_lock+0x93/0xb0
? complete_all+0x60/0x60
__lock_buffer+0x3f/0x50
do_get_write_access+0x2ad/0x550
jbd2_journal_get_write_access+0x51/0x80
__ext4_journal_get_write_access+0xf9/0x1a0
ext4_ext_get_access.isra.32+0x34/0x50
ext4_ext_remove_space+0x59f/0x1830
? ext4_ext_truncate+0xa2/0xf0
? ext4_ext_truncate+0x5e/0xf0
ext4_ext_truncate+0xd6/0xf0
ext4_truncate+0x46a/0x660
ext4_setattr+0x6db/0xa40
notify_change+0x3e5/0x540
do_truncate+0x7b/0xc0
? do_truncate+0x7b/0xc0
do_sys_ftruncate+0x164/0x2f0
__x64_sys_ftruncate+0x1b/0x20
do_syscall_64+0x3f/0x90
entry_SYSCALL_64_after_hwframe+0x46/0xb0
RIP: 0033:0x7f3f9dae3f7b
RSP: 002b:00007ffe06430318 EFLAGS: 00000202 ORIG_RAX: 000000000000004d
RAX: ffffffffffffffda RBX: 00000000000001a5 RCX: 00007f3f9dae3f7b
RDX: 00000000000001a5 RSI: 00000000000d2800 RDI: 0000000000000003
RBP: 00000000000001a5 R08: 00007f3f9dbce090 R09: 00007f3f9dbce0a0
R10: 0000000000000008 R11: 0000000000000202 R12: 0000000000000000
R13: 0000000000000207 R14: 0000000000000043 R15: 0000000000000003
</TASK>
5 locks held by ftest01/818:
#0: ffff957b9c9b74e8 (sb_writers#3){.+.+}-{0:0}, at: __x64_sys_ftruncate+0x1b/0x20
#1: ffff957c2a679b00 (&sb->s_type->i_mutex_key#10){+.+.}-{3:3}, at: do_truncate+0x6a/0xc0
#2: ffff957c2a679cc0 (mapping.invalidate_lock){++++}-{3:3}, at: ext4_setattr+0x39a/0xa40
#3: ffff957b9c9b5be8 (jbd2_handle){++++}-{0:0}, at: start_this_handle+0x13e/0x5f0
#4: ffff957c2a679958 (&ei->i_data_sem){++++}-{3:3}, at: ext4_truncate+0x443/0x660
task:ftest01 state:D stack: 0 pid: 819 ppid: 816 flags:0x00004002
Call Trace:
<TASK>
__schedule+0x329/0xae0
? __x64_sys_tee+0xd0/0xd0
? __x64_sys_tee+0xd0/0xd0
schedule+0x48/0xc0
wb_wait_for_completion+0x62/0x90
? schedstat_stop+0x10/0x10
sync_inodes_sb+0xdd/0x460
? __this_cpu_preempt_check+0x13/0x20
? __x64_sys_tee+0xd0/0xd0
sync_inodes_one_sb+0x15/0x20
iterate_supers+0x94/0x100
ksys_sync+0x42/0xa0
__do_sys_sync+0xe/0x20
do_syscall_64+0x3f/0x90
entry_SYSCALL_64_after_hwframe+0x46/0xb0
RIP: 0033:0x7f3f9dae2cdb
RSP: 002b:00007ffe06430318 EFLAGS: 00000246 ORIG_RAX: 00000000000000a2
RAX: ffffffffffffffda RBX: 0000000000000800 RCX: 00007f3f9dae2cdb
RDX: 0000000000000040 RSI: 0000561f57de52f0 RDI: 0000561f57de5340
RBP: 00000000000c2800 R08: 00000000fffff800 R09: 0000000000004000
R10: 0000000000000010 R11: 0000000000000246 R12: 0000000000000000
R13: 0000000000000207 R14: 0000000000000075 R15: 0000000000000003
</TASK>
1 lock held by ftest01/819:
#0: ffff957b9c9b70d0 (&type->s_umount_key#34){++++}-{3:3}, at: iterate_supers+0x7d/0x100
task:ftest01 state:D stack: 0 pid: 820 ppid: 816 flags:0x00004002
Call Trace:
<TASK>
__schedule+0x329/0xae0
? __x64_sys_tee+0xd0/0xd0
? __x64_sys_tee+0xd0/0xd0
schedule+0x48/0xc0
wb_wait_for_completion+0x62/0x90
? schedstat_stop+0x10/0x10
sync_inodes_sb+0xdd/0x460
? __this_cpu_preempt_check+0x13/0x20
? __x64_sys_tee+0xd0/0xd0
sync_inodes_one_sb+0x15/0x20
iterate_supers+0x94/0x100
ksys_sync+0x42/0xa0
__do_sys_sync+0xe/0x20
do_syscall_64+0x3f/0x90
entry_SYSCALL_64_after_hwframe+0x46/0xb0
RIP: 0033:0x7f3f9dae2cdb
RSP: 002b:00007ffe06430318 EFLAGS: 00000246 ORIG_RAX: 00000000000000a2
RAX: ffffffffffffffda RBX: 0000000000000800 RCX: 00007f3f9dae2cdb
RDX: 0000000000000040 RSI: 0000561f57de52f0 RDI: 0000561f57de5340
RBP: 000000000006a800 R08: 00000000fffff800 R09: 0000000000004000
R10: 0000000000000010 R11: 0000000000000246 R12: 0000000000000000
R13: 0000000000000207 R14: 0000000000000069 R15: 0000000000000003
</TASK>
1 lock held by ftest01/820:
#0: ffff957b9c9b70d0 (&type->s_umount_key#34){++++}-{3:3}, at: iterate_supers+0x7d/0x100
task:ftest01 state:D stack: 0 pid: 821 ppid: 816 flags:0x00000002
Call Trace:
<TASK>
__schedule+0x329/0xae0
? bit_wait+0x60/0x60
schedule+0x48/0xc0
io_schedule+0x16/0x40
bit_wait_io+0x11/0x60
__wait_on_bit_lock+0x5e/0xb0
out_of_line_wait_on_bit_lock+0x93/0xb0
? complete_all+0x60/0x60
__lock_buffer+0x3f/0x50
do_get_write_access+0x2ad/0x550
jbd2_journal_get_write_access+0x51/0x80
__ext4_journal_get_write_access+0xf9/0x1a0
ext4_ext_get_access.isra.32+0x34/0x50
ext4_ext_remove_space+0x59f/0x1830
? ext4_ext_truncate+0xa2/0xf0
? ext4_ext_truncate+0x5e/0xf0
ext4_ext_truncate+0xd6/0xf0
ext4_truncate+0x46a/0x660
ext4_setattr+0x6db/0xa40
notify_change+0x3e5/0x540
do_truncate+0x7b/0xc0
? do_truncate+0x7b/0xc0
do_sys_ftruncate+0x164/0x2f0
__x64_sys_ftruncate+0x1b/0x20
do_syscall_64+0x3f/0x90
entry_SYSCALL_64_after_hwframe+0x46/0xb0
RIP: 0033:0x7f3f9dae3f7b
RSP: 002b:00007ffe06430318 EFLAGS: 00000202 ORIG_RAX: 000000000000004d
RAX: ffffffffffffffda RBX: 000000000000016a RCX: 00007f3f9dae3f7b
RDX: 000000000000016a RSI: 00000000000b5000 RDI: 0000000000000003
RBP: 000000000000016a R08: 00007f3f9dbce080 R09: 00007f3f9dbce0a0
R10: 0000000000000002 R11: 0000000000000202 R12: 0000000000000000
R13: 0000000000000207 R14: 0000000000000040 R15: 0000000000000003
</TASK>
5 locks held by ftest01/821:
#0: ffff957b9c9b74e8 (sb_writers#3){.+.+}-{0:0}, at: __x64_sys_ftruncate+0x1b/0x20
#1: ffff957c2a67bd20 (&sb->s_type->i_mutex_key#10){+.+.}-{3:3}, at: do_truncate+0x6a/0xc0
#2: ffff957c2a67bee0 (mapping.invalidate_lock){++++}-{3:3}, at: ext4_setattr+0x39a/0xa40
#3: ffff957b9c9b5be8 (jbd2_handle){++++}-{0:0}, at: start_this_handle+0x13e/0x5f0
#4: ffff957c2a67bb78 (&ei->i_data_sem){++++}-{3:3}, at: ext4_truncate+0x443/0x660
--------------------------------------
Reproducer:
Get latest preempt-rt kernel:
git://git.kernel.org/pub/scm/linux/kernel/git/rt/linux-rt-devel.git
checkout v5.19-rc3-rt5 (or linux-5.19.y-rt branch)
make defconfig
manually enable CONFIG_EXPERT and then CONFIG_PREEMPT_RT
make oldconfig and take defaults for other options hidden behind EXPERT
can optionally enable LOCKDEP options so sysrq-d works.
Get LTP ; build + install
https://github.com/linux-test-project/ltp.git
Boot and then run reproducer:
dd if=/dev/zero of=/root/tmp_mount_file bs=1M count=512
mkfs.ext4 -F /root/tmp_mount_file
mount /root/tmp_mount_file /mnt/
/opt/ltp/runltp -f fs -s ftest01 -d /mnt/
umount /mnt
loop device isn't strictly necessary but contains fs fallout when
inevitable reboot is required to get rid of D state processes.
Test completes on v5.10-rt, v5.12-rt - should take about 2-3m total.
Test hangs forever on v5.15-rt and latest v5.19-rc-rt. Test threads
in D state - any other i/o directed at /mnt also hangs in D state.
next prev parent reply other threads:[~2022-07-09 16:20 UTC|newest]
Thread overview: 7+ messages / expand[flat|nested] mbox.gz Atom feed top
2022-07-08 16:27 [PATCH v3] sched/core: Always flush pending blk_plug John Keeping
2022-07-09 16:19 ` Paul Gortmaker [this message]
2022-07-10 1:01 ` Hillf Danton
2022-07-12 18:29 ` Paul Gortmaker
2022-07-12 22:38 ` Hillf Danton
2022-07-14 11:35 ` [tip: sched/core] " tip-bot2 for John Keeping
2022-07-25 8:56 ` [PATCH v3] " Sebastian Andrzej Siewior
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20220709161956.GA20046@windriver.com \
--to=paul.gortmaker@windriver.com \
--cc=bigeasy@linutronix.de \
--cc=bristot@redhat.com \
--cc=bsegall@google.com \
--cc=dietmar.eggemann@arm.com \
--cc=john@metanate.com \
--cc=juri.lelli@redhat.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-rt-users@vger.kernel.org \
--cc=mgorman@suse.de \
--cc=mingo@redhat.com \
--cc=peterz@infradead.org \
--cc=rostedt@goodmis.org \
--cc=tglx@linutronix.de \
--cc=vincent.guittot@linaro.org \
--cc=vschneid@redhat.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.