* [PATCH v3] sched/core: Always flush pending blk_plug
@ 2022-07-08 16:27 John Keeping
2022-07-09 16:19 ` Paul Gortmaker
` (2 more replies)
0 siblings, 3 replies; 7+ messages in thread
From: John Keeping @ 2022-07-08 16:27 UTC (permalink / raw)
To: linux-kernel
Cc: linux-rt-users, John Keeping, Sebastian Andrzej Siewior,
Peter Zijlstra, Thomas Gleixner, Steven Rostedt, Ingo Molnar,
Juri Lelli, Vincent Guittot, Dietmar Eggemann, Ben Segall,
Mel Gorman, Daniel Bristot de Oliveira, Valentin Schneider
With CONFIG_PREEMPT_RT, it is possible to hit a deadlock between two
normal priority tasks (SCHED_OTHER, nice level zero):
INFO: task kworker/u8:0:8 blocked for more than 491 seconds.
Not tainted 5.15.49-rt46 #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
task:kworker/u8:0 state:D stack: 0 pid: 8 ppid: 2 flags:0x00000000
Workqueue: writeback wb_workfn (flush-7:0)
[<c08a3a10>] (__schedule) from [<c08a3d84>] (schedule+0xdc/0x134)
[<c08a3d84>] (schedule) from [<c08a65a0>] (rt_mutex_slowlock_block.constprop.0+0xb8/0x174)
[<c08a65a0>] (rt_mutex_slowlock_block.constprop.0) from [<c08a6708>]
+(rt_mutex_slowlock.constprop.0+0xac/0x174)
[<c08a6708>] (rt_mutex_slowlock.constprop.0) from [<c0374d60>] (fat_write_inode+0x34/0x54)
[<c0374d60>] (fat_write_inode) from [<c0297304>] (__writeback_single_inode+0x354/0x3ec)
[<c0297304>] (__writeback_single_inode) from [<c0297998>] (writeback_sb_inodes+0x250/0x45c)
[<c0297998>] (writeback_sb_inodes) from [<c0297c20>] (__writeback_inodes_wb+0x7c/0xb8)
[<c0297c20>] (__writeback_inodes_wb) from [<c0297f24>] (wb_writeback+0x2c8/0x2e4)
[<c0297f24>] (wb_writeback) from [<c0298c40>] (wb_workfn+0x1a4/0x3e4)
[<c0298c40>] (wb_workfn) from [<c0138ab8>] (process_one_work+0x1fc/0x32c)
[<c0138ab8>] (process_one_work) from [<c0139120>] (worker_thread+0x22c/0x2d8)
[<c0139120>] (worker_thread) from [<c013e6e0>] (kthread+0x16c/0x178)
[<c013e6e0>] (kthread) from [<c01000fc>] (ret_from_fork+0x14/0x38)
Exception stack(0xc10e3fb0 to 0xc10e3ff8)
3fa0: 00000000 00000000 00000000 00000000
3fc0: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
3fe0: 00000000 00000000 00000000 00000000 00000013 00000000
INFO: task tar:2083 blocked for more than 491 seconds.
Not tainted 5.15.49-rt46 #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
task:tar state:D stack: 0 pid: 2083 ppid: 2082 flags:0x00000000
[<c08a3a10>] (__schedule) from [<c08a3d84>] (schedule+0xdc/0x134)
[<c08a3d84>] (schedule) from [<c08a41b0>] (io_schedule+0x14/0x24)
[<c08a41b0>] (io_schedule) from [<c08a455c>] (bit_wait_io+0xc/0x30)
[<c08a455c>] (bit_wait_io) from [<c08a441c>] (__wait_on_bit_lock+0x54/0xa8)
[<c08a441c>] (__wait_on_bit_lock) from [<c08a44f4>] (out_of_line_wait_on_bit_lock+0x84/0xb0)
[<c08a44f4>] (out_of_line_wait_on_bit_lock) from [<c0371fb0>] (fat_mirror_bhs+0xa0/0x144)
[<c0371fb0>] (fat_mirror_bhs) from [<c0372a68>] (fat_alloc_clusters+0x138/0x2a4)
[<c0372a68>] (fat_alloc_clusters) from [<c0370b14>] (fat_alloc_new_dir+0x34/0x250)
[<c0370b14>] (fat_alloc_new_dir) from [<c03787c0>] (vfat_mkdir+0x58/0x148)
[<c03787c0>] (vfat_mkdir) from [<c0277b60>] (vfs_mkdir+0x68/0x98)
[<c0277b60>] (vfs_mkdir) from [<c027b484>] (do_mkdirat+0xb0/0xec)
[<c027b484>] (do_mkdirat) from [<c0100060>] (ret_fast_syscall+0x0/0x1c)
Exception stack(0xc2e1bfa8 to 0xc2e1bff0)
bfa0: 01ee42f0 01ee4208 01ee42f0 000041ed 00000000 00004000
bfc0: 01ee42f0 01ee4208 00000000 00000027 01ee4302 00000004 000dcb00 01ee4190
bfe0: 000dc368 bed11924 0006d4b0 b6ebddfc
Here the kworker is waiting on msdos_sb_info::s_lock which is held by
tar which is in turn waiting for a buffer which is locked waiting to be
flushed, but this operation is plugged in the kworker.
The lock is a normal struct mutex, so tsk_is_pi_blocked() will always
return false on !RT and thus the behaviour changes for RT.
It seems that the intent here is to skip blk_flush_plug() in the case
where a non-preemptible lock (such as a spinlock) has been converted to
a rtmutex on RT, which is the case covered by the SM_RTLOCK_WAIT
schedule flag. But sched_submit_work() is only called from schedule()
which is never called in this scenario, so the check can simply be
deleted.
Looking at the history of the -rt patchset, in fact this change was
present from v5.9.1-rt20 until being dropped in v5.13-rt1 as it was part
of a larger patch [1] most of which was replaced by commit b4bfa3fcfe3b
("sched/core: Rework the __schedule() preempt argument").
As described in [1]:
The schedule process must distinguish between blocking on a regular
sleeping lock (rwsem and mutex) and a RT-only sleeping lock (spinlock
and rwlock):
- rwsem and mutex must flush block requests (blk_schedule_flush_plug())
even if blocked on a lock. This can not deadlock because this also
happens for non-RT.
There should be a warning if the scheduling point is within a RCU read
section.
- spinlock and rwlock must not flush block requests. This will deadlock
if the callback attempts to acquire a lock which is already acquired.
Similarly to being preempted, there should be no warning if the
scheduling point is within a RCU read section.
and with the tsk_is_pi_blocked() in the scheduler path, we hit the first
issue.
[1] https://git.kernel.org/pub/scm/linux/kernel/git/rt/linux-rt-devel.git/tree/patches/0022-locking-rtmutex-Use-custom-scheduling-function-for-s.patch?h=linux-5.10.y-rt-patches
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: John Keeping <john@metanate.com>
---
v3:
- Add SCHED_WARN_ON(current->__state & TASK_RTLOCK_WAIT) as suggested by
Peter
v2:
- Add Steven's R-b and update the commit message with his suggested
quote from [1]
include/linux/sched/rt.h | 8 --------
kernel/sched/core.c | 8 ++++++--
2 files changed, 6 insertions(+), 10 deletions(-)
diff --git a/include/linux/sched/rt.h b/include/linux/sched/rt.h
index e5af028c08b49..994c25640e156 100644
--- a/include/linux/sched/rt.h
+++ b/include/linux/sched/rt.h
@@ -39,20 +39,12 @@ static inline struct task_struct *rt_mutex_get_top_task(struct task_struct *p)
}
extern void rt_mutex_setprio(struct task_struct *p, struct task_struct *pi_task);
extern void rt_mutex_adjust_pi(struct task_struct *p);
-static inline bool tsk_is_pi_blocked(struct task_struct *tsk)
-{
- return tsk->pi_blocked_on != NULL;
-}
#else
static inline struct task_struct *rt_mutex_get_top_task(struct task_struct *task)
{
return NULL;
}
# define rt_mutex_adjust_pi(p) do { } while (0)
-static inline bool tsk_is_pi_blocked(struct task_struct *tsk)
-{
- return false;
-}
#endif
extern void normalize_rt_tasks(void);
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 1d4660a1915b3..71d6385ece83f 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6578,8 +6578,12 @@ static inline void sched_submit_work(struct task_struct *tsk)
io_wq_worker_sleeping(tsk);
}
- if (tsk_is_pi_blocked(tsk))
- return;
+ /*
+ * spinlock and rwlock must not flush block requests. This will
+ * deadlock if the callback attempts to acquire a lock which is
+ * already acquired.
+ */
+ SCHED_WARN_ON(current->__state & TASK_RTLOCK_WAIT);
/*
* If we are going to sleep and we have plugged IO queued,
--
2.37.0
^ permalink raw reply related [flat|nested] 7+ messages in thread* Re: [PATCH v3] sched/core: Always flush pending blk_plug 2022-07-08 16:27 [PATCH v3] sched/core: Always flush pending blk_plug John Keeping @ 2022-07-09 16:19 ` Paul Gortmaker 2022-07-10 1:01 ` Hillf Danton 2022-07-14 11:35 ` [tip: sched/core] " tip-bot2 for John Keeping 2022-07-25 8:56 ` [PATCH v3] " Sebastian Andrzej Siewior 2 siblings, 1 reply; 7+ messages in thread From: Paul Gortmaker @ 2022-07-09 16:19 UTC (permalink / raw) To: John Keeping Cc: linux-kernel, linux-rt-users, Sebastian Andrzej Siewior, Peter Zijlstra, Thomas Gleixner, Steven Rostedt, Ingo Molnar, Juri Lelli, Vincent Guittot, Dietmar Eggemann, Ben Segall, Mel Gorman, Daniel Bristot de Oliveira, Valentin Schneider [[PATCH v3] sched/core: Always flush pending blk_plug] On 08/07/2022 (Fri 17:27) John Keeping wrote: > With CONFIG_PREEMPT_RT, it is possible to hit a deadlock between two > normal priority tasks (SCHED_OTHER, nice level zero): > > INFO: task kworker/u8:0:8 blocked for more than 491 seconds. > Not tainted 5.15.49-rt46 #1 > "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > task:kworker/u8:0 state:D stack: 0 pid: 8 ppid: 2 flags:0x00000000 > Workqueue: writeback wb_workfn (flush-7:0) This is great. I was on-and-off looking at an ext4 deadlock triggered by LTP's "ftest01" - and was just potting my notes together this morning to clarify for myself what I thought was going on before I forgot. Which is good, because I then could recognize the very similar patterns in what you saw to what I'd observed/deduced. Otherise I'd probably not have had your backtraces grab my attention at all. Long story short, I applied this to v5.19-rc3-rt5 and it resolves the LTP/ext4 deadlock issue. So... Tested-by: Paul Gortmaker <paul.gortmaker@windriver.com> As I'd already put together my ext4 debug info, I'll include it below in case it helps future people searching for LTP and ftest01. Thanks, Paul. ----------------- There is a reproducible deadlock in ext4 on preempt-rt triggered by LTP "ftest01" Reproducer details are at the bottom. What seems to happen is as follows: The kworker doing writeback wb_workfn (pid 213 below) stalls on the one and only down_write in ext4_map_blocks() and never moves again. /* * New blocks allocate and/or writing to unwritten extent * will possibly result in updating i_data, so we take * the write lock of i_data_sem, and call get_block() * with create == 1 flag. */ down_write(&EXT4_I(inode)->i_data_sem); So who is blocking the kworker? There are two of the ftest01 processes (pid 818 and pid 821) doing ext4 truncate which look like this: ext4_truncate+0x46a/0x660 --> ext4_ext_truncate+0xd6/0xf0 --> ext4_ext_remove_space+0x59f/0x1830 --> jbd2_journal_get_write_access+0x51/0x80 --> do_get_write_access+0x2ad/0x550 --> complete_all+0x60/0x60 --> bit_wait_io+0x11/0x60 They too are stalled in D state waiting there forever - presumably waiting on the kwowrker to make the progress they are waiting for. Here is where I think lockdep shows us the problem: 5 locks held by ftest01/818: [...] #4: ffff957c2a679958 (&ei->i_data_sem){++++}-{3:3}, at: ext4_truncate+0x443/0x660 5 locks held by ftest01/821: [...] #4: ffff957c2a67bb78 (&ei->i_data_sem){++++}-{3:3}, at: ext4_truncate+0x443/0x660 Yes - that is the one the kworker was trying the down_write() on. Also note that the bit_wait is a big telltale sign here - there is only one in the jbd2's do_get_write_access() function: /* * There is one case we have to be very careful about. If the * committing transaction is currently writing this buffer out to disk * and has NOT made a copy-out, then we cannot modify the buffer * contents at all right now. The essence of copy-out is that it is * the extra copy, not the primary copy, which gets journaled. If the * primary copy is already going to disk then we cannot do copy-out * here. */ if (buffer_shadow(bh)) { JBUFFER_TRACE(jh, "on shadow: sleep"); spin_unlock(&jh->b_state_lock); wait_on_bit_io(&bh->b_state, BH_Shadow, TASK_UNINTERRUPTIBLE); goto repeat; } Presumaby this is a corner case that preempt-rt makes easier to trigger? The full trace with interspersed lockdep info follows. Various detailed reproducer information follows that. ----------------------- sysrq-w backtracce information interspersed with sysrq-d lockdep info on 5.19-rc2-rt3 kernel: task:kworker/u129:34 state:D stack: 0 pid: 213 ppid: 2 flags:0x00004000 Workqueue: writeback wb_workfn (flush-7:0) Call Trace: <TASK> __schedule+0x329/0xae0 ? __this_cpu_preempt_check+0x13/0x20 ? __rt_mutex_slowlock.isra.21.constprop.25+0xc0e/0xc40 schedule+0x48/0xc0 __rt_mutex_slowlock.isra.21.constprop.25+0x1b8/0xc40 rwbase_write_lock+0x98/0x6e0 down_write+0x46/0x50 ? ext4_map_blocks+0x156/0x5c0 ext4_map_blocks+0x156/0x5c0 ext4_writepages+0xa1a/0x1290 ? __this_cpu_preempt_check+0x13/0x20 ? lockdep_hardirqs_on+0xcd/0x150 do_writepages+0xd2/0x1a0 __writeback_single_inode+0x64/0x850 writeback_sb_inodes+0x22d/0x5b0 __writeback_inodes_wb+0x67/0xe0 wb_writeback+0x269/0x5f0 ? get_nr_inodes+0x49/0x70 wb_workfn+0x43d/0x780 ? __this_cpu_preempt_check+0x13/0x20 process_one_work+0x286/0x710 worker_thread+0x3c/0x3d0 ? process_one_work+0x710/0x710 kthread+0x13b/0x150 ? kthread_complete_and_exit+0x20/0x20 ret_from_fork+0x1f/0x30 </TASK> 6 locks held by kworker/u129:34/213: #0: ffff957c25e9ef28 ((wq_completion)writeback){+.+.}-{0:0}, at: process_one_work+0x201/0x710 #1: ffffb7e9c3797e68 ((work_completion)(&(&wb->dwork)->work)){+.+.}-{0:0}, at: process_one_work+0x201/0x710 #2: ffff957b9c9b70d0 (&type->s_umount_key#34){++++}-{3:3}, at: trylock_super+0x1b/0x50 #3: ffff957b8a9b2d18 (&sbi->s_writepages_rwsem){.+.+}-{0:0}, at: do_writepages+0xd2/0x1a0 #4: ffff957b9c9b5be8 (jbd2_handle){++++}-{0:0}, at: start_this_handle+0x13e/0x5f0 #5: ffff957c2a679958 (&ei->i_data_sem){++++}-{3:3}, at: ext4_map_blocks+0x156/0x5c0 task:jbd2/loop0-8 state:D stack: 0 pid: 679 ppid: 2 flags:0x00004000 Call Trace: <TASK> __schedule+0x329/0xae0 schedule+0x48/0xc0 jbd2_journal_wait_updates+0x7c/0xf0 ? schedstat_stop+0x10/0x10 jbd2_journal_commit_transaction+0x284/0x20e0 ? __this_cpu_preempt_check+0x13/0x20 ? find_held_lock+0x35/0xa0 ? _raw_spin_unlock_irqrestore+0x86/0x8f ? _raw_spin_unlock_irqrestore+0x86/0x8f ? try_to_del_timer_sync+0x53/0x80 ? trace_hardirqs_on+0x3b/0x120 ? try_to_del_timer_sync+0x53/0x80 kjournald2+0xd5/0x2b0 ? schedstat_stop+0x10/0x10 ? commit_timeout+0x20/0x20 kthread+0x13b/0x150 ? kthread_complete_and_exit+0x20/0x20 ret_from_fork+0x1f/0x30 </TASK> task:ftest01 state:D stack: 0 pid: 817 ppid: 816 flags:0x00004002 Call Trace: <TASK> __schedule+0x329/0xae0 ? __x64_sys_tee+0xd0/0xd0 ? __x64_sys_tee+0xd0/0xd0 schedule+0x48/0xc0 wb_wait_for_completion+0x62/0x90 ? schedstat_stop+0x10/0x10 sync_inodes_sb+0xdd/0x460 ? __this_cpu_preempt_check+0x13/0x20 ? __x64_sys_tee+0xd0/0xd0 sync_inodes_one_sb+0x15/0x20 iterate_supers+0x94/0x100 ksys_sync+0x42/0xa0 __do_sys_sync+0xe/0x20 do_syscall_64+0x3f/0x90 entry_SYSCALL_64_after_hwframe+0x46/0xb0 RIP: 0033:0x7f3f9dae2cdb RSP: 002b:00007ffe06430318 EFLAGS: 00000246 ORIG_RAX: 00000000000000a2 RAX: ffffffffffffffda RBX: 0000000000000800 RCX: 00007f3f9dae2cdb RDX: 0000000000000040 RSI: 0000561f57de52f0 RDI: 0000561f57de5340 RBP: 00000000000ff000 R08: 0000000000000000 R09: 00007f3f9dbce0a0 R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000 R13: 0000000000000207 R14: 0000000000000049 R15: 0000000000000003 </TASK> 1 lock held by ftest01/817: #0: ffff957b9c9b70d0 (&type->s_umount_key#34){++++}-{3:3}, at: iterate_supers+0x7d/0x100 task:ftest01 state:D stack: 0 pid: 818 ppid: 816 flags:0x00004002 Call Trace: <TASK> __schedule+0x329/0xae0 ? bit_wait+0x60/0x60 schedule+0x48/0xc0 io_schedule+0x16/0x40 bit_wait_io+0x11/0x60 __wait_on_bit_lock+0x5e/0xb0 out_of_line_wait_on_bit_lock+0x93/0xb0 ? complete_all+0x60/0x60 __lock_buffer+0x3f/0x50 do_get_write_access+0x2ad/0x550 jbd2_journal_get_write_access+0x51/0x80 __ext4_journal_get_write_access+0xf9/0x1a0 ext4_ext_get_access.isra.32+0x34/0x50 ext4_ext_remove_space+0x59f/0x1830 ? ext4_ext_truncate+0xa2/0xf0 ? ext4_ext_truncate+0x5e/0xf0 ext4_ext_truncate+0xd6/0xf0 ext4_truncate+0x46a/0x660 ext4_setattr+0x6db/0xa40 notify_change+0x3e5/0x540 do_truncate+0x7b/0xc0 ? do_truncate+0x7b/0xc0 do_sys_ftruncate+0x164/0x2f0 __x64_sys_ftruncate+0x1b/0x20 do_syscall_64+0x3f/0x90 entry_SYSCALL_64_after_hwframe+0x46/0xb0 RIP: 0033:0x7f3f9dae3f7b RSP: 002b:00007ffe06430318 EFLAGS: 00000202 ORIG_RAX: 000000000000004d RAX: ffffffffffffffda RBX: 00000000000001a5 RCX: 00007f3f9dae3f7b RDX: 00000000000001a5 RSI: 00000000000d2800 RDI: 0000000000000003 RBP: 00000000000001a5 R08: 00007f3f9dbce090 R09: 00007f3f9dbce0a0 R10: 0000000000000008 R11: 0000000000000202 R12: 0000000000000000 R13: 0000000000000207 R14: 0000000000000043 R15: 0000000000000003 </TASK> 5 locks held by ftest01/818: #0: ffff957b9c9b74e8 (sb_writers#3){.+.+}-{0:0}, at: __x64_sys_ftruncate+0x1b/0x20 #1: ffff957c2a679b00 (&sb->s_type->i_mutex_key#10){+.+.}-{3:3}, at: do_truncate+0x6a/0xc0 #2: ffff957c2a679cc0 (mapping.invalidate_lock){++++}-{3:3}, at: ext4_setattr+0x39a/0xa40 #3: ffff957b9c9b5be8 (jbd2_handle){++++}-{0:0}, at: start_this_handle+0x13e/0x5f0 #4: ffff957c2a679958 (&ei->i_data_sem){++++}-{3:3}, at: ext4_truncate+0x443/0x660 task:ftest01 state:D stack: 0 pid: 819 ppid: 816 flags:0x00004002 Call Trace: <TASK> __schedule+0x329/0xae0 ? __x64_sys_tee+0xd0/0xd0 ? __x64_sys_tee+0xd0/0xd0 schedule+0x48/0xc0 wb_wait_for_completion+0x62/0x90 ? schedstat_stop+0x10/0x10 sync_inodes_sb+0xdd/0x460 ? __this_cpu_preempt_check+0x13/0x20 ? __x64_sys_tee+0xd0/0xd0 sync_inodes_one_sb+0x15/0x20 iterate_supers+0x94/0x100 ksys_sync+0x42/0xa0 __do_sys_sync+0xe/0x20 do_syscall_64+0x3f/0x90 entry_SYSCALL_64_after_hwframe+0x46/0xb0 RIP: 0033:0x7f3f9dae2cdb RSP: 002b:00007ffe06430318 EFLAGS: 00000246 ORIG_RAX: 00000000000000a2 RAX: ffffffffffffffda RBX: 0000000000000800 RCX: 00007f3f9dae2cdb RDX: 0000000000000040 RSI: 0000561f57de52f0 RDI: 0000561f57de5340 RBP: 00000000000c2800 R08: 00000000fffff800 R09: 0000000000004000 R10: 0000000000000010 R11: 0000000000000246 R12: 0000000000000000 R13: 0000000000000207 R14: 0000000000000075 R15: 0000000000000003 </TASK> 1 lock held by ftest01/819: #0: ffff957b9c9b70d0 (&type->s_umount_key#34){++++}-{3:3}, at: iterate_supers+0x7d/0x100 task:ftest01 state:D stack: 0 pid: 820 ppid: 816 flags:0x00004002 Call Trace: <TASK> __schedule+0x329/0xae0 ? __x64_sys_tee+0xd0/0xd0 ? __x64_sys_tee+0xd0/0xd0 schedule+0x48/0xc0 wb_wait_for_completion+0x62/0x90 ? schedstat_stop+0x10/0x10 sync_inodes_sb+0xdd/0x460 ? __this_cpu_preempt_check+0x13/0x20 ? __x64_sys_tee+0xd0/0xd0 sync_inodes_one_sb+0x15/0x20 iterate_supers+0x94/0x100 ksys_sync+0x42/0xa0 __do_sys_sync+0xe/0x20 do_syscall_64+0x3f/0x90 entry_SYSCALL_64_after_hwframe+0x46/0xb0 RIP: 0033:0x7f3f9dae2cdb RSP: 002b:00007ffe06430318 EFLAGS: 00000246 ORIG_RAX: 00000000000000a2 RAX: ffffffffffffffda RBX: 0000000000000800 RCX: 00007f3f9dae2cdb RDX: 0000000000000040 RSI: 0000561f57de52f0 RDI: 0000561f57de5340 RBP: 000000000006a800 R08: 00000000fffff800 R09: 0000000000004000 R10: 0000000000000010 R11: 0000000000000246 R12: 0000000000000000 R13: 0000000000000207 R14: 0000000000000069 R15: 0000000000000003 </TASK> 1 lock held by ftest01/820: #0: ffff957b9c9b70d0 (&type->s_umount_key#34){++++}-{3:3}, at: iterate_supers+0x7d/0x100 task:ftest01 state:D stack: 0 pid: 821 ppid: 816 flags:0x00000002 Call Trace: <TASK> __schedule+0x329/0xae0 ? bit_wait+0x60/0x60 schedule+0x48/0xc0 io_schedule+0x16/0x40 bit_wait_io+0x11/0x60 __wait_on_bit_lock+0x5e/0xb0 out_of_line_wait_on_bit_lock+0x93/0xb0 ? complete_all+0x60/0x60 __lock_buffer+0x3f/0x50 do_get_write_access+0x2ad/0x550 jbd2_journal_get_write_access+0x51/0x80 __ext4_journal_get_write_access+0xf9/0x1a0 ext4_ext_get_access.isra.32+0x34/0x50 ext4_ext_remove_space+0x59f/0x1830 ? ext4_ext_truncate+0xa2/0xf0 ? ext4_ext_truncate+0x5e/0xf0 ext4_ext_truncate+0xd6/0xf0 ext4_truncate+0x46a/0x660 ext4_setattr+0x6db/0xa40 notify_change+0x3e5/0x540 do_truncate+0x7b/0xc0 ? do_truncate+0x7b/0xc0 do_sys_ftruncate+0x164/0x2f0 __x64_sys_ftruncate+0x1b/0x20 do_syscall_64+0x3f/0x90 entry_SYSCALL_64_after_hwframe+0x46/0xb0 RIP: 0033:0x7f3f9dae3f7b RSP: 002b:00007ffe06430318 EFLAGS: 00000202 ORIG_RAX: 000000000000004d RAX: ffffffffffffffda RBX: 000000000000016a RCX: 00007f3f9dae3f7b RDX: 000000000000016a RSI: 00000000000b5000 RDI: 0000000000000003 RBP: 000000000000016a R08: 00007f3f9dbce080 R09: 00007f3f9dbce0a0 R10: 0000000000000002 R11: 0000000000000202 R12: 0000000000000000 R13: 0000000000000207 R14: 0000000000000040 R15: 0000000000000003 </TASK> 5 locks held by ftest01/821: #0: ffff957b9c9b74e8 (sb_writers#3){.+.+}-{0:0}, at: __x64_sys_ftruncate+0x1b/0x20 #1: ffff957c2a67bd20 (&sb->s_type->i_mutex_key#10){+.+.}-{3:3}, at: do_truncate+0x6a/0xc0 #2: ffff957c2a67bee0 (mapping.invalidate_lock){++++}-{3:3}, at: ext4_setattr+0x39a/0xa40 #3: ffff957b9c9b5be8 (jbd2_handle){++++}-{0:0}, at: start_this_handle+0x13e/0x5f0 #4: ffff957c2a67bb78 (&ei->i_data_sem){++++}-{3:3}, at: ext4_truncate+0x443/0x660 -------------------------------------- Reproducer: Get latest preempt-rt kernel: git://git.kernel.org/pub/scm/linux/kernel/git/rt/linux-rt-devel.git checkout v5.19-rc3-rt5 (or linux-5.19.y-rt branch) make defconfig manually enable CONFIG_EXPERT and then CONFIG_PREEMPT_RT make oldconfig and take defaults for other options hidden behind EXPERT can optionally enable LOCKDEP options so sysrq-d works. Get LTP ; build + install https://github.com/linux-test-project/ltp.git Boot and then run reproducer: dd if=/dev/zero of=/root/tmp_mount_file bs=1M count=512 mkfs.ext4 -F /root/tmp_mount_file mount /root/tmp_mount_file /mnt/ /opt/ltp/runltp -f fs -s ftest01 -d /mnt/ umount /mnt loop device isn't strictly necessary but contains fs fallout when inevitable reboot is required to get rid of D state processes. Test completes on v5.10-rt, v5.12-rt - should take about 2-3m total. Test hangs forever on v5.15-rt and latest v5.19-rc-rt. Test threads in D state - any other i/o directed at /mnt also hangs in D state. ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH v3] sched/core: Always flush pending blk_plug 2022-07-09 16:19 ` Paul Gortmaker @ 2022-07-10 1:01 ` Hillf Danton 2022-07-12 18:29 ` Paul Gortmaker 0 siblings, 1 reply; 7+ messages in thread From: Hillf Danton @ 2022-07-10 1:01 UTC (permalink / raw) To: Paul Gortmaker Cc: John Keeping, linux-kernel, linux-mm, Peter Zijlstra, Thomas Gleixner, Mel Gorman, Theodore Ts'o On Sat, 9 Jul 2022 12:19:56 -0400 Paul Gortmaker wrote: > On 08/07/2022 (Fri 17:27) John Keeping wrote: > > With CONFIG_PREEMPT_RT, it is possible to hit a deadlock between two > > normal priority tasks (SCHED_OTHER, nice level zero): > > > > INFO: task kworker/u8:0:8 blocked for more than 491 seconds. > > Not tainted 5.15.49-rt46 #1 > > "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > > task:kworker/u8:0 state:D stack: 0 pid: 8 ppid: 2 flags:0x00000000 > > Workqueue: writeback wb_workfn (flush-7:0) > > This is great. I was on-and-off looking at an ext4 deadlock triggered > by LTP's "ftest01" - and was just potting my notes together this morning > to clarify for myself what I thought was going on before I forgot. > > Which is good, because I then could recognize the very similar patterns > in what you saw to what I'd observed/deduced. Otherise I'd probably not > have had your backtraces grab my attention at all. > > Long story short, I applied this to v5.19-rc3-rt5 and it resolves the > LTP/ext4 deadlock issue. So... > > Tested-by: Paul Gortmaker <paul.gortmaker@windriver.com> > > As I'd already put together my ext4 debug info, I'll include it below > in case it helps future people searching for LTP and ftest01. > > Thanks, > Paul. > > ----------------- > > > There is a reproducible deadlock in ext4 on preempt-rt triggered by LTP > "ftest01" > > Reproducer details are at the bottom. What seems to happen is as follows: > > The kworker doing writeback wb_workfn (pid 213 below) stalls on the > one and only down_write in ext4_map_blocks() and never moves again. > > /* > * New blocks allocate and/or writing to unwritten extent > * will possibly result in updating i_data, so we take > * the write lock of i_data_sem, and call get_block() > * with create == 1 flag. > */ > down_write(&EXT4_I(inode)->i_data_sem); > > So who is blocking the kworker? There are two of the ftest01 processes > (pid 818 and pid 821) doing ext4 truncate which look like this: > > ext4_truncate+0x46a/0x660 > --> ext4_ext_truncate+0xd6/0xf0 > --> ext4_ext_remove_space+0x59f/0x1830 > --> jbd2_journal_get_write_access+0x51/0x80 > --> do_get_write_access+0x2ad/0x550 > --> complete_all+0x60/0x60 > --> bit_wait_io+0x11/0x60 > > They too are stalled in D state waiting there forever - presumably > waiting on the kwowrker to make the progress they are waiting for. > > Here is where I think lockdep shows us the problem: > > 5 locks held by ftest01/818: > [...] > #4: ffff957c2a679958 (&ei->i_data_sem){++++}-{3:3}, at: ext4_truncate+0x443/0x660 > > 5 locks held by ftest01/821: > [...] > #4: ffff957c2a67bb78 (&ei->i_data_sem){++++}-{3:3}, at: ext4_truncate+0x443/0x660 > > Yes - that is the one the kworker was trying the down_write() on. > > Also note that the bit_wait is a big telltale sign here - there is only > one in the jbd2's do_get_write_access() function: > > /* > * There is one case we have to be very careful about. If the > * committing transaction is currently writing this buffer out to disk > * and has NOT made a copy-out, then we cannot modify the buffer > * contents at all right now. The essence of copy-out is that it is > * the extra copy, not the primary copy, which gets journaled. If the > * primary copy is already going to disk then we cannot do copy-out > * here. > */ > if (buffer_shadow(bh)) { > JBUFFER_TRACE(jh, "on shadow: sleep"); > spin_unlock(&jh->b_state_lock); > wait_on_bit_io(&bh->b_state, BH_Shadow, TASK_UNINTERRUPTIBLE); > goto repeat; > } > > Presumaby this is a corner case that preempt-rt makes easier to trigger? > > The full trace with interspersed lockdep info follows. > > Various detailed reproducer information follows that. > > ----------------------- > > sysrq-w backtracce information interspersed with sysrq-d lockdep info > on 5.19-rc2-rt3 kernel: > > task:kworker/u129:34 state:D stack: 0 pid: 213 ppid: 2 flags:0x00004000 > Workqueue: writeback wb_workfn (flush-7:0) > Call Trace: > <TASK> > __schedule+0x329/0xae0 > ? __this_cpu_preempt_check+0x13/0x20 > ? __rt_mutex_slowlock.isra.21.constprop.25+0xc0e/0xc40 > schedule+0x48/0xc0 > __rt_mutex_slowlock.isra.21.constprop.25+0x1b8/0xc40 > rwbase_write_lock+0x98/0x6e0 > down_write+0x46/0x50 > ? ext4_map_blocks+0x156/0x5c0 > ext4_map_blocks+0x156/0x5c0 > ext4_writepages+0xa1a/0x1290 > ? __this_cpu_preempt_check+0x13/0x20 > ? lockdep_hardirqs_on+0xcd/0x150 > do_writepages+0xd2/0x1a0 > __writeback_single_inode+0x64/0x850 > writeback_sb_inodes+0x22d/0x5b0 > __writeback_inodes_wb+0x67/0xe0 > wb_writeback+0x269/0x5f0 > ? get_nr_inodes+0x49/0x70 > wb_workfn+0x43d/0x780 > ? __this_cpu_preempt_check+0x13/0x20 > process_one_work+0x286/0x710 > worker_thread+0x3c/0x3d0 > ? process_one_work+0x710/0x710 > kthread+0x13b/0x150 > ? kthread_complete_and_exit+0x20/0x20 > ret_from_fork+0x1f/0x30 > </TASK> > > 6 locks held by kworker/u129:34/213: > #0: ffff957c25e9ef28 ((wq_completion)writeback){+.+.}-{0:0}, at: process_one_work+0x201/0x710 > #1: ffffb7e9c3797e68 ((work_completion)(&(&wb->dwork)->work)){+.+.}-{0:0}, at: process_one_work+0x201/0x710 > #2: ffff957b9c9b70d0 (&type->s_umount_key#34){++++}-{3:3}, at: trylock_super+0x1b/0x50 > #3: ffff957b8a9b2d18 (&sbi->s_writepages_rwsem){.+.+}-{0:0}, at: do_writepages+0xd2/0x1a0 > #4: ffff957b9c9b5be8 (jbd2_handle){++++}-{0:0}, at: start_this_handle+0x13e/0x5f0 > #5: ffff957c2a679958 (&ei->i_data_sem){++++}-{3:3}, at: ext4_map_blocks+0x156/0x5c0 > > > task:jbd2/loop0-8 state:D stack: 0 pid: 679 ppid: 2 flags:0x00004000 > Call Trace: > <TASK> > __schedule+0x329/0xae0 > schedule+0x48/0xc0 > jbd2_journal_wait_updates+0x7c/0xf0 > ? schedstat_stop+0x10/0x10 > jbd2_journal_commit_transaction+0x284/0x20e0 > ? __this_cpu_preempt_check+0x13/0x20 > ? find_held_lock+0x35/0xa0 > ? _raw_spin_unlock_irqrestore+0x86/0x8f > ? _raw_spin_unlock_irqrestore+0x86/0x8f > ? try_to_del_timer_sync+0x53/0x80 > ? trace_hardirqs_on+0x3b/0x120 > ? try_to_del_timer_sync+0x53/0x80 > kjournald2+0xd5/0x2b0 > ? schedstat_stop+0x10/0x10 > ? commit_timeout+0x20/0x20 > kthread+0x13b/0x150 > ? kthread_complete_and_exit+0x20/0x20 > ret_from_fork+0x1f/0x30 > </TASK> > > > task:ftest01 state:D stack: 0 pid: 817 ppid: 816 flags:0x00004002 > Call Trace: > <TASK> > __schedule+0x329/0xae0 > ? __x64_sys_tee+0xd0/0xd0 > ? __x64_sys_tee+0xd0/0xd0 > schedule+0x48/0xc0 > wb_wait_for_completion+0x62/0x90 > ? schedstat_stop+0x10/0x10 > sync_inodes_sb+0xdd/0x460 > ? __this_cpu_preempt_check+0x13/0x20 > ? __x64_sys_tee+0xd0/0xd0 > sync_inodes_one_sb+0x15/0x20 > iterate_supers+0x94/0x100 > ksys_sync+0x42/0xa0 > __do_sys_sync+0xe/0x20 > do_syscall_64+0x3f/0x90 > entry_SYSCALL_64_after_hwframe+0x46/0xb0 > RIP: 0033:0x7f3f9dae2cdb > RSP: 002b:00007ffe06430318 EFLAGS: 00000246 ORIG_RAX: 00000000000000a2 > RAX: ffffffffffffffda RBX: 0000000000000800 RCX: 00007f3f9dae2cdb > RDX: 0000000000000040 RSI: 0000561f57de52f0 RDI: 0000561f57de5340 > RBP: 00000000000ff000 R08: 0000000000000000 R09: 00007f3f9dbce0a0 > R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000 > R13: 0000000000000207 R14: 0000000000000049 R15: 0000000000000003 > </TASK> > > 1 lock held by ftest01/817: > #0: ffff957b9c9b70d0 (&type->s_umount_key#34){++++}-{3:3}, at: iterate_supers+0x7d/0x100 > > > task:ftest01 state:D stack: 0 pid: 818 ppid: 816 flags:0x00004002 > Call Trace: > <TASK> > __schedule+0x329/0xae0 > ? bit_wait+0x60/0x60 > schedule+0x48/0xc0 > io_schedule+0x16/0x40 > bit_wait_io+0x11/0x60 > __wait_on_bit_lock+0x5e/0xb0 > out_of_line_wait_on_bit_lock+0x93/0xb0 > ? complete_all+0x60/0x60 > __lock_buffer+0x3f/0x50 pid: 818 was locking buffer > do_get_write_access+0x2ad/0x550 > jbd2_journal_get_write_access+0x51/0x80 > __ext4_journal_get_write_access+0xf9/0x1a0 > ext4_ext_get_access.isra.32+0x34/0x50 > ext4_ext_remove_space+0x59f/0x1830 > ? ext4_ext_truncate+0xa2/0xf0 > ? ext4_ext_truncate+0x5e/0xf0 > ext4_ext_truncate+0xd6/0xf0 > ext4_truncate+0x46a/0x660 > ext4_setattr+0x6db/0xa40 > notify_change+0x3e5/0x540 > do_truncate+0x7b/0xc0 > ? do_truncate+0x7b/0xc0 > do_sys_ftruncate+0x164/0x2f0 > __x64_sys_ftruncate+0x1b/0x20 > do_syscall_64+0x3f/0x90 > entry_SYSCALL_64_after_hwframe+0x46/0xb0 > RIP: 0033:0x7f3f9dae3f7b > RSP: 002b:00007ffe06430318 EFLAGS: 00000202 ORIG_RAX: 000000000000004d > RAX: ffffffffffffffda RBX: 00000000000001a5 RCX: 00007f3f9dae3f7b > RDX: 00000000000001a5 RSI: 00000000000d2800 RDI: 0000000000000003 > RBP: 00000000000001a5 R08: 00007f3f9dbce090 R09: 00007f3f9dbce0a0 > R10: 0000000000000008 R11: 0000000000000202 R12: 0000000000000000 > R13: 0000000000000207 R14: 0000000000000043 R15: 0000000000000003 > </TASK> > > 5 locks held by ftest01/818: > #0: ffff957b9c9b74e8 (sb_writers#3){.+.+}-{0:0}, at: __x64_sys_ftruncate+0x1b/0x20 > #1: ffff957c2a679b00 (&sb->s_type->i_mutex_key#10){+.+.}-{3:3}, at: do_truncate+0x6a/0xc0 > #2: ffff957c2a679cc0 (mapping.invalidate_lock){++++}-{3:3}, at: ext4_setattr+0x39a/0xa40 > #3: ffff957b9c9b5be8 (jbd2_handle){++++}-{0:0}, at: start_this_handle+0x13e/0x5f0 > #4: ffff957c2a679958 (&ei->i_data_sem){++++}-{3:3}, at: ext4_truncate+0x443/0x660 > > > task:ftest01 state:D stack: 0 pid: 819 ppid: 816 flags:0x00004002 > Call Trace: > <TASK> > __schedule+0x329/0xae0 > ? __x64_sys_tee+0xd0/0xd0 > ? __x64_sys_tee+0xd0/0xd0 > schedule+0x48/0xc0 > wb_wait_for_completion+0x62/0x90 > ? schedstat_stop+0x10/0x10 > sync_inodes_sb+0xdd/0x460 > ? __this_cpu_preempt_check+0x13/0x20 > ? __x64_sys_tee+0xd0/0xd0 > sync_inodes_one_sb+0x15/0x20 > iterate_supers+0x94/0x100 > ksys_sync+0x42/0xa0 > __do_sys_sync+0xe/0x20 > do_syscall_64+0x3f/0x90 > entry_SYSCALL_64_after_hwframe+0x46/0xb0 > RIP: 0033:0x7f3f9dae2cdb > RSP: 002b:00007ffe06430318 EFLAGS: 00000246 ORIG_RAX: 00000000000000a2 > RAX: ffffffffffffffda RBX: 0000000000000800 RCX: 00007f3f9dae2cdb > RDX: 0000000000000040 RSI: 0000561f57de52f0 RDI: 0000561f57de5340 > RBP: 00000000000c2800 R08: 00000000fffff800 R09: 0000000000004000 > R10: 0000000000000010 R11: 0000000000000246 R12: 0000000000000000 > R13: 0000000000000207 R14: 0000000000000075 R15: 0000000000000003 > </TASK> > > 1 lock held by ftest01/819: > #0: ffff957b9c9b70d0 (&type->s_umount_key#34){++++}-{3:3}, at: iterate_supers+0x7d/0x100 > > > task:ftest01 state:D stack: 0 pid: 820 ppid: 816 flags:0x00004002 > Call Trace: > <TASK> > __schedule+0x329/0xae0 > ? __x64_sys_tee+0xd0/0xd0 > ? __x64_sys_tee+0xd0/0xd0 > schedule+0x48/0xc0 > wb_wait_for_completion+0x62/0x90 > ? schedstat_stop+0x10/0x10 > sync_inodes_sb+0xdd/0x460 > ? __this_cpu_preempt_check+0x13/0x20 > ? __x64_sys_tee+0xd0/0xd0 > sync_inodes_one_sb+0x15/0x20 > iterate_supers+0x94/0x100 > ksys_sync+0x42/0xa0 > __do_sys_sync+0xe/0x20 > do_syscall_64+0x3f/0x90 > entry_SYSCALL_64_after_hwframe+0x46/0xb0 > RIP: 0033:0x7f3f9dae2cdb > RSP: 002b:00007ffe06430318 EFLAGS: 00000246 ORIG_RAX: 00000000000000a2 > RAX: ffffffffffffffda RBX: 0000000000000800 RCX: 00007f3f9dae2cdb > RDX: 0000000000000040 RSI: 0000561f57de52f0 RDI: 0000561f57de5340 > RBP: 000000000006a800 R08: 00000000fffff800 R09: 0000000000004000 > R10: 0000000000000010 R11: 0000000000000246 R12: 0000000000000000 > R13: 0000000000000207 R14: 0000000000000069 R15: 0000000000000003 > </TASK> > > 1 lock held by ftest01/820: > #0: ffff957b9c9b70d0 (&type->s_umount_key#34){++++}-{3:3}, at: iterate_supers+0x7d/0x100 > > > task:ftest01 state:D stack: 0 pid: 821 ppid: 816 flags:0x00000002 > Call Trace: > <TASK> > __schedule+0x329/0xae0 > ? bit_wait+0x60/0x60 > schedule+0x48/0xc0 > io_schedule+0x16/0x40 > bit_wait_io+0x11/0x60 > __wait_on_bit_lock+0x5e/0xb0 > out_of_line_wait_on_bit_lock+0x93/0xb0 > ? complete_all+0x60/0x60 > __lock_buffer+0x3f/0x50 So was pid: 821. > do_get_write_access+0x2ad/0x550 > jbd2_journal_get_write_access+0x51/0x80 > __ext4_journal_get_write_access+0xf9/0x1a0 > ext4_ext_get_access.isra.32+0x34/0x50 > ext4_ext_remove_space+0x59f/0x1830 > ? ext4_ext_truncate+0xa2/0xf0 > ? ext4_ext_truncate+0x5e/0xf0 > ext4_ext_truncate+0xd6/0xf0 > ext4_truncate+0x46a/0x660 > ext4_setattr+0x6db/0xa40 > notify_change+0x3e5/0x540 > do_truncate+0x7b/0xc0 > ? do_truncate+0x7b/0xc0 > do_sys_ftruncate+0x164/0x2f0 > __x64_sys_ftruncate+0x1b/0x20 > do_syscall_64+0x3f/0x90 > entry_SYSCALL_64_after_hwframe+0x46/0xb0 > RIP: 0033:0x7f3f9dae3f7b > RSP: 002b:00007ffe06430318 EFLAGS: 00000202 ORIG_RAX: 000000000000004d > RAX: ffffffffffffffda RBX: 000000000000016a RCX: 00007f3f9dae3f7b > RDX: 000000000000016a RSI: 00000000000b5000 RDI: 0000000000000003 > RBP: 000000000000016a R08: 00007f3f9dbce080 R09: 00007f3f9dbce0a0 > R10: 0000000000000002 R11: 0000000000000202 R12: 0000000000000000 > R13: 0000000000000207 R14: 0000000000000040 R15: 0000000000000003 > </TASK> > > 5 locks held by ftest01/821: > #0: ffff957b9c9b74e8 (sb_writers#3){.+.+}-{0:0}, at: __x64_sys_ftruncate+0x1b/0x20 > #1: ffff957c2a67bd20 (&sb->s_type->i_mutex_key#10){+.+.}-{3:3}, at: do_truncate+0x6a/0xc0 > #2: ffff957c2a67bee0 (mapping.invalidate_lock){++++}-{3:3}, at: ext4_setattr+0x39a/0xa40 > #3: ffff957b9c9b5be8 (jbd2_handle){++++}-{0:0}, at: start_this_handle+0x13e/0x5f0 > #4: ffff957c2a67bb78 (&ei->i_data_sem){++++}-{3:3}, at: ext4_truncate+0x443/0x660 > Looks like a ABBA deadlock that should be fed to lockdep, given what was blocking flusher (pid 213). > -------------------------------------- > > Reproducer: > > Get latest preempt-rt kernel: > git://git.kernel.org/pub/scm/linux/kernel/git/rt/linux-rt-devel.git > checkout v5.19-rc3-rt5 (or linux-5.19.y-rt branch) > make defconfig > manually enable CONFIG_EXPERT and then CONFIG_PREEMPT_RT > make oldconfig and take defaults for other options hidden behind EXPERT > can optionally enable LOCKDEP options so sysrq-d works. > > Get LTP ; build + install > https://github.com/linux-test-project/ltp.git > > Boot and then run reproducer: > dd if=/dev/zero of=/root/tmp_mount_file bs=1M count=512 > mkfs.ext4 -F /root/tmp_mount_file > mount /root/tmp_mount_file /mnt/ > /opt/ltp/runltp -f fs -s ftest01 -d /mnt/ > umount /mnt > > loop device isn't strictly necessary but contains fs fallout when > inevitable reboot is required to get rid of D state processes. > > Test completes on v5.10-rt, v5.12-rt - should take about 2-3m total. > Test hangs forever on v5.15-rt and latest v5.19-rc-rt. Test threads > in D state - any other i/o directed at /mnt also hangs in D state. > Make lock_buffer have mutex_lock() semantics in attempt to catch the deadlock above. Only for thoughts now. Hillf --- mainline/fs/buffer.c +++ b/fs/buffer.c @@ -66,12 +66,17 @@ EXPORT_SYMBOL(touch_buffer); void __lock_buffer(struct buffer_head *bh) { + /* lock_buffer has mutex_lock() semantics here */ + mutex_acquire(&bh->b_dep_map, 0, 0, _RET_IP_); + wait_on_bit_lock_io(&bh->b_state, BH_Lock, TASK_UNINTERRUPTIBLE); } EXPORT_SYMBOL(__lock_buffer); void unlock_buffer(struct buffer_head *bh) { + mutex_release(&bh->b_dep_map, _RET_IP_); + clear_bit_unlock(BH_Lock, &bh->b_state); smp_mb__after_atomic(); wake_up_bit(&bh->b_state, BH_Lock); @@ -3289,12 +3294,22 @@ static void recalc_bh_state(void) buffer_heads_over_limit = (tot > max_buffer_heads); } +static void buffer_head_init_lockdep(struct buffer_head *buf) +{ +#ifdef CONFIG_DEBUG_LOCK_ALLOC + static struct lock_class_key bh_key; + + lockdep_init_map_wait(&buf->b_dep_map, "buffer_head_lock", &bh_key, 0, LD_WAIT_SLEEP); +#endif +} + struct buffer_head *alloc_buffer_head(gfp_t gfp_flags) { struct buffer_head *ret = kmem_cache_zalloc(bh_cachep, gfp_flags); if (ret) { INIT_LIST_HEAD(&ret->b_assoc_buffers); spin_lock_init(&ret->b_uptodate_lock); + buffer_head_init_lockdep(ret); preempt_disable(); __this_cpu_inc(bh_accounting.nr); recalc_bh_state(); --- a/include/linux/buffer_head.h +++ b/include/linux/buffer_head.h @@ -76,6 +76,9 @@ struct buffer_head { spinlock_t b_uptodate_lock; /* Used by the first bh in a page, to * serialise IO completion of other * buffers in the page */ +#ifdef CONFIG_DEBUG_LOCK_ALLOC + struct lockdep_map b_dep_map; +#endif }; /* ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH v3] sched/core: Always flush pending blk_plug 2022-07-10 1:01 ` Hillf Danton @ 2022-07-12 18:29 ` Paul Gortmaker 2022-07-12 22:38 ` Hillf Danton 0 siblings, 1 reply; 7+ messages in thread From: Paul Gortmaker @ 2022-07-12 18:29 UTC (permalink / raw) To: Hillf Danton Cc: John Keeping, linux-kernel, linux-mm, Peter Zijlstra, Thomas Gleixner, Mel Gorman, Theodore Ts'o [Re: [PATCH v3] sched/core: Always flush pending blk_plug] On 10/07/2022 (Sun 09:01) Hillf Danton wrote: > Looks like a ABBA deadlock that should be fed to lockdep, given what was > blocking flusher (pid 213). [,,,] > Make lock_buffer have mutex_lock() semantics in attempt to catch the > deadlock above. > > Only for thoughts now. Thanks for the test patch - I'm keeping it as a reference for future use. So, preempt-rt issues an early lockdep splat at partition probe bootup, both with and without the original v3 patch from this thread. Of course then I figured I'd try the debug patch on vanilla v5.19-rc3 and we get pretty much the same lockdep complaint. sd 1:0:0:0: Attached scsi generic sg1 type 0 sd 1:0:0:0: [sdb] Preferred minimum I/O size 512 bytes scsi 2:0:0:0: CD-ROM HL-DT-ST DVD-ROM DH30N A101 PQ: 0 ANSI: 5 ===================================== WARNING: bad unlock balance detected! 5.19.0-rc3-dirty #2 Not tainted ------------------------------------- swapper/2/0 is trying to release lock (buffer_head_lock) at: [<ffffffffbc494a6b>] end_buffer_async_read+0x5b/0x180 but there are no more locks to release! other info that might help us debug this: 1 lock held by swapper/2/0: #0: ffff8bee27744080 (&ret->b_uptodate_lock){....}-{2:2}, at: end_buffer_async_read+0x47/0x180 stack backtrace: CPU: 2 PID: 0 Comm: swapper/2 Not tainted 5.19.0-rc3-dirty #2 Hardware name: Dell Inc. Precision WorkStation T5500 /0D883F, BIOS A16 05/28/2013 Call Trace: <IRQ> dump_stack_lvl+0x40/0x5c lock_release+0x245/0x3f0 unlock_buffer+0x15/0x30 end_buffer_async_read+0x5b/0x180 end_bio_bh_io_sync+0x1e/0x40 blk_update_request+0x9a/0x470 scsi_end_request+0x27/0x190 scsi_io_completion+0x3c/0x580 blk_complete_reqs+0x39/0x50 __do_softirq+0x11d/0x344 irq_exit_rcu+0xa9/0xc0 common_interrupt+0xa5/0xc0 </IRQ> <TASK> asm_common_interrupt+0x27/0x40 RIP: 0010:cpuidle_enter_state+0x12d/0x3f0 Code: 49 43 0f 84 b7 02 00 00 31 ff e8 fe 1c 74 ff 45 84 ff 74 12 9c 58 f6 c4 02 0f 85 8e 02 00 00 31 ff e8 97 95 7a ff fb 45 85 f6 <0f> 88 12 01 00 00 49 63 d6 4c 2b 24 24 48 8d 04 52 48 8d 04 82 49 RSP: 0000:ffffa2890013fe90 EFLAGS: 00000206 RAX: 0000000000000000 RBX: 0000000000000003 RCX: 0000000000000002 RDX: 0000000080000002 RSI: ffffffffbd78d911 RDI: 00000000ffffffff RBP: ffff8bee2739a400 R08: 0000000000000000 R09: 00000000001faf40 R10: ffffa2890013fdc8 R11: 0000000000000000 R12: 00000000a7b149d6 R13: ffffffffbdbee3a0 R14: 0000000000000003 R15: 0000000000000001 cpuidle_enter+0x24/0x40 do_idle+0x1e3/0x230 cpu_startup_entry+0x14/0x20 start_secondary+0xe8/0xf0 secondary_startup_64_no_verify+0xe0/0xeb </TASK> sda: sda1 sda2 sda3 sd 0:0:0:0: [sda] Attached SCSI disk sdb: sdb1 sdb2 sd 1:0:0:0: [sdb] Attached SCSI disk Not quite sure what to make of that. Paul. -- ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH v3] sched/core: Always flush pending blk_plug 2022-07-12 18:29 ` Paul Gortmaker @ 2022-07-12 22:38 ` Hillf Danton 0 siblings, 0 replies; 7+ messages in thread From: Hillf Danton @ 2022-07-12 22:38 UTC (permalink / raw) To: Paul Gortmaker Cc: John Keeping, linux-kernel, linux-mm, Peter Zijlstra, Thomas Gleixner, Mel Gorman, Theodore Ts'o On Tue, 12 Jul 2022 14:29:14 -0400 Paul Gortmaker wrote: > On 10/07/2022 (Sun 09:01) Hillf Danton wrote: > > > Looks like a ABBA deadlock that should be fed to lockdep, given what was > > blocking flusher (pid 213). > > [,,,] > > > Make lock_buffer have mutex_lock() semantics in attempt to catch the > > deadlock above. > > > > Only for thoughts now. > > Thanks for the test patch - I'm keeping it as a reference for future > use. > > So, preempt-rt issues an early lockdep splat at partition probe bootup, > both with and without the original v3 patch from this thread. > > Of course then I figured I'd try the debug patch on vanilla v5.19-rc3 > and we get pretty much the same lockdep complaint. > > > sd 1:0:0:0: Attached scsi generic sg1 type 0 > sd 1:0:0:0: [sdb] Preferred minimum I/O size 512 bytes > scsi 2:0:0:0: CD-ROM HL-DT-ST DVD-ROM DH30N A101 PQ: 0 ANSI: 5 > > ===================================== > WARNING: bad unlock balance detected! > 5.19.0-rc3-dirty #2 Not tainted > ------------------------------------- > swapper/2/0 is trying to release lock (buffer_head_lock) at: > [<ffffffffbc494a6b>] end_buffer_async_read+0x5b/0x180 > but there are no more locks to release! The lock releaser different from the lock acquirer was caught. > > other info that might help us debug this: > 1 lock held by swapper/2/0: > #0: ffff8bee27744080 (&ret->b_uptodate_lock){....}-{2:2}, at: end_buffer_async_read+0x47/0x180 > > stack backtrace: > CPU: 2 PID: 0 Comm: swapper/2 Not tainted 5.19.0-rc3-dirty #2 > Hardware name: Dell Inc. Precision WorkStation T5500 /0D883F, BIOS A16 05/28/2013 > Call Trace: > <IRQ> > dump_stack_lvl+0x40/0x5c > lock_release+0x245/0x3f0 > unlock_buffer+0x15/0x30 > end_buffer_async_read+0x5b/0x180 > end_bio_bh_io_sync+0x1e/0x40 > blk_update_request+0x9a/0x470 > scsi_end_request+0x27/0x190 > scsi_io_completion+0x3c/0x580 > blk_complete_reqs+0x39/0x50 > __do_softirq+0x11d/0x344 > irq_exit_rcu+0xa9/0xc0 > common_interrupt+0xa5/0xc0 > </IRQ> > <TASK> > asm_common_interrupt+0x27/0x40 > RIP: 0010:cpuidle_enter_state+0x12d/0x3f0 > Code: 49 43 0f 84 b7 02 00 00 31 ff e8 fe 1c 74 ff 45 84 ff 74 12 9c 58 f6 c4 02 0f 85 8e 02 00 00 31 ff e8 97 95 7a ff fb 45 85 f6 <0f> 88 12 01 00 00 49 63 d6 4c 2b 24 24 48 8d 04 52 48 8d 04 82 49 > RSP: 0000:ffffa2890013fe90 EFLAGS: 00000206 > RAX: 0000000000000000 RBX: 0000000000000003 RCX: 0000000000000002 > RDX: 0000000080000002 RSI: ffffffffbd78d911 RDI: 00000000ffffffff > RBP: ffff8bee2739a400 R08: 0000000000000000 R09: 00000000001faf40 > R10: ffffa2890013fdc8 R11: 0000000000000000 R12: 00000000a7b149d6 > R13: ffffffffbdbee3a0 R14: 0000000000000003 R15: 0000000000000001 > cpuidle_enter+0x24/0x40 > do_idle+0x1e3/0x230 > cpu_startup_entry+0x14/0x20 > start_secondary+0xe8/0xf0 > secondary_startup_64_no_verify+0xe0/0xeb > </TASK> > sda: sda1 sda2 sda3 > sd 0:0:0:0: [sda] Attached SCSI disk > sdb: sdb1 sdb2 > sd 1:0:0:0: [sdb] Attached SCSI disk > > Not quite sure what to make of that. You provide the answer to the question why it has been quite a long while that lock_buffer has no annotation to help lockdep. And perhaps lock_page as well despite the relevant long running deadlock does not sit deep under the water. Thanks Hillf ^ permalink raw reply [flat|nested] 7+ messages in thread
* [tip: sched/core] sched/core: Always flush pending blk_plug 2022-07-08 16:27 [PATCH v3] sched/core: Always flush pending blk_plug John Keeping 2022-07-09 16:19 ` Paul Gortmaker @ 2022-07-14 11:35 ` tip-bot2 for John Keeping 2022-07-25 8:56 ` [PATCH v3] " Sebastian Andrzej Siewior 2 siblings, 0 replies; 7+ messages in thread From: tip-bot2 for John Keeping @ 2022-07-14 11:35 UTC (permalink / raw) To: linux-tip-commits Cc: John Keeping, Peter Zijlstra (Intel), Steven Rostedt (Google), x86, linux-kernel The following commit has been merged into the sched/core branch of tip: Commit-ID: 401e4963bf45c800e3e9ea0d3a0289d738005fd4 Gitweb: https://git.kernel.org/tip/401e4963bf45c800e3e9ea0d3a0289d738005fd4 Author: John Keeping <john@metanate.com> AuthorDate: Fri, 08 Jul 2022 17:27:02 +01:00 Committer: Peter Zijlstra <peterz@infradead.org> CommitterDate: Wed, 13 Jul 2022 11:29:17 +02:00 sched/core: Always flush pending blk_plug With CONFIG_PREEMPT_RT, it is possible to hit a deadlock between two normal priority tasks (SCHED_OTHER, nice level zero): INFO: task kworker/u8:0:8 blocked for more than 491 seconds. Not tainted 5.15.49-rt46 #1 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. task:kworker/u8:0 state:D stack: 0 pid: 8 ppid: 2 flags:0x00000000 Workqueue: writeback wb_workfn (flush-7:0) [<c08a3a10>] (__schedule) from [<c08a3d84>] (schedule+0xdc/0x134) [<c08a3d84>] (schedule) from [<c08a65a0>] (rt_mutex_slowlock_block.constprop.0+0xb8/0x174) [<c08a65a0>] (rt_mutex_slowlock_block.constprop.0) from [<c08a6708>] +(rt_mutex_slowlock.constprop.0+0xac/0x174) [<c08a6708>] (rt_mutex_slowlock.constprop.0) from [<c0374d60>] (fat_write_inode+0x34/0x54) [<c0374d60>] (fat_write_inode) from [<c0297304>] (__writeback_single_inode+0x354/0x3ec) [<c0297304>] (__writeback_single_inode) from [<c0297998>] (writeback_sb_inodes+0x250/0x45c) [<c0297998>] (writeback_sb_inodes) from [<c0297c20>] (__writeback_inodes_wb+0x7c/0xb8) [<c0297c20>] (__writeback_inodes_wb) from [<c0297f24>] (wb_writeback+0x2c8/0x2e4) [<c0297f24>] (wb_writeback) from [<c0298c40>] (wb_workfn+0x1a4/0x3e4) [<c0298c40>] (wb_workfn) from [<c0138ab8>] (process_one_work+0x1fc/0x32c) [<c0138ab8>] (process_one_work) from [<c0139120>] (worker_thread+0x22c/0x2d8) [<c0139120>] (worker_thread) from [<c013e6e0>] (kthread+0x16c/0x178) [<c013e6e0>] (kthread) from [<c01000fc>] (ret_from_fork+0x14/0x38) Exception stack(0xc10e3fb0 to 0xc10e3ff8) 3fa0: 00000000 00000000 00000000 00000000 3fc0: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 3fe0: 00000000 00000000 00000000 00000000 00000013 00000000 INFO: task tar:2083 blocked for more than 491 seconds. Not tainted 5.15.49-rt46 #1 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. task:tar state:D stack: 0 pid: 2083 ppid: 2082 flags:0x00000000 [<c08a3a10>] (__schedule) from [<c08a3d84>] (schedule+0xdc/0x134) [<c08a3d84>] (schedule) from [<c08a41b0>] (io_schedule+0x14/0x24) [<c08a41b0>] (io_schedule) from [<c08a455c>] (bit_wait_io+0xc/0x30) [<c08a455c>] (bit_wait_io) from [<c08a441c>] (__wait_on_bit_lock+0x54/0xa8) [<c08a441c>] (__wait_on_bit_lock) from [<c08a44f4>] (out_of_line_wait_on_bit_lock+0x84/0xb0) [<c08a44f4>] (out_of_line_wait_on_bit_lock) from [<c0371fb0>] (fat_mirror_bhs+0xa0/0x144) [<c0371fb0>] (fat_mirror_bhs) from [<c0372a68>] (fat_alloc_clusters+0x138/0x2a4) [<c0372a68>] (fat_alloc_clusters) from [<c0370b14>] (fat_alloc_new_dir+0x34/0x250) [<c0370b14>] (fat_alloc_new_dir) from [<c03787c0>] (vfat_mkdir+0x58/0x148) [<c03787c0>] (vfat_mkdir) from [<c0277b60>] (vfs_mkdir+0x68/0x98) [<c0277b60>] (vfs_mkdir) from [<c027b484>] (do_mkdirat+0xb0/0xec) [<c027b484>] (do_mkdirat) from [<c0100060>] (ret_fast_syscall+0x0/0x1c) Exception stack(0xc2e1bfa8 to 0xc2e1bff0) bfa0: 01ee42f0 01ee4208 01ee42f0 000041ed 00000000 00004000 bfc0: 01ee42f0 01ee4208 00000000 00000027 01ee4302 00000004 000dcb00 01ee4190 bfe0: 000dc368 bed11924 0006d4b0 b6ebddfc Here the kworker is waiting on msdos_sb_info::s_lock which is held by tar which is in turn waiting for a buffer which is locked waiting to be flushed, but this operation is plugged in the kworker. The lock is a normal struct mutex, so tsk_is_pi_blocked() will always return false on !RT and thus the behaviour changes for RT. It seems that the intent here is to skip blk_flush_plug() in the case where a non-preemptible lock (such as a spinlock) has been converted to a rtmutex on RT, which is the case covered by the SM_RTLOCK_WAIT schedule flag. But sched_submit_work() is only called from schedule() which is never called in this scenario, so the check can simply be deleted. Looking at the history of the -rt patchset, in fact this change was present from v5.9.1-rt20 until being dropped in v5.13-rt1 as it was part of a larger patch [1] most of which was replaced by commit b4bfa3fcfe3b ("sched/core: Rework the __schedule() preempt argument"). As described in [1]: The schedule process must distinguish between blocking on a regular sleeping lock (rwsem and mutex) and a RT-only sleeping lock (spinlock and rwlock): - rwsem and mutex must flush block requests (blk_schedule_flush_plug()) even if blocked on a lock. This can not deadlock because this also happens for non-RT. There should be a warning if the scheduling point is within a RCU read section. - spinlock and rwlock must not flush block requests. This will deadlock if the callback attempts to acquire a lock which is already acquired. Similarly to being preempted, there should be no warning if the scheduling point is within a RCU read section. and with the tsk_is_pi_blocked() in the scheduler path, we hit the first issue. [1] https://git.kernel.org/pub/scm/linux/kernel/git/rt/linux-rt-devel.git/tree/patches/0022-locking-rtmutex-Use-custom-scheduling-function-for-s.patch?h=linux-5.10.y-rt-patches Signed-off-by: John Keeping <john@metanate.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org> Link: https://lkml.kernel.org/r/20220708162702.1758865-1-john@metanate.com --- include/linux/sched/rt.h | 8 -------- kernel/sched/core.c | 8 ++++++-- 2 files changed, 6 insertions(+), 10 deletions(-) diff --git a/include/linux/sched/rt.h b/include/linux/sched/rt.h index e5af028..994c256 100644 --- a/include/linux/sched/rt.h +++ b/include/linux/sched/rt.h @@ -39,20 +39,12 @@ static inline struct task_struct *rt_mutex_get_top_task(struct task_struct *p) } extern void rt_mutex_setprio(struct task_struct *p, struct task_struct *pi_task); extern void rt_mutex_adjust_pi(struct task_struct *p); -static inline bool tsk_is_pi_blocked(struct task_struct *tsk) -{ - return tsk->pi_blocked_on != NULL; -} #else static inline struct task_struct *rt_mutex_get_top_task(struct task_struct *task) { return NULL; } # define rt_mutex_adjust_pi(p) do { } while (0) -static inline bool tsk_is_pi_blocked(struct task_struct *tsk) -{ - return false; -} #endif extern void normalize_rt_tasks(void); diff --git a/kernel/sched/core.c b/kernel/sched/core.c index c703d17..a463dbc 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -6470,8 +6470,12 @@ static inline void sched_submit_work(struct task_struct *tsk) io_wq_worker_sleeping(tsk); } - if (tsk_is_pi_blocked(tsk)) - return; + /* + * spinlock and rwlock must not flush block requests. This will + * deadlock if the callback attempts to acquire a lock which is + * already acquired. + */ + SCHED_WARN_ON(current->__state & TASK_RTLOCK_WAIT); /* * If we are going to sleep and we have plugged IO queued, ^ permalink raw reply related [flat|nested] 7+ messages in thread
* Re: [PATCH v3] sched/core: Always flush pending blk_plug 2022-07-08 16:27 [PATCH v3] sched/core: Always flush pending blk_plug John Keeping 2022-07-09 16:19 ` Paul Gortmaker 2022-07-14 11:35 ` [tip: sched/core] " tip-bot2 for John Keeping @ 2022-07-25 8:56 ` Sebastian Andrzej Siewior 2 siblings, 0 replies; 7+ messages in thread From: Sebastian Andrzej Siewior @ 2022-07-25 8:56 UTC (permalink / raw) To: John Keeping Cc: linux-kernel, linux-rt-users, Peter Zijlstra, Thomas Gleixner, Steven Rostedt, Ingo Molnar, Juri Lelli, Vincent Guittot, Dietmar Eggemann, Ben Segall, Mel Gorman, Daniel Bristot de Oliveira, Valentin Schneider On 2022-07-08 17:27:02 [+0100], John Keeping wrote: > Looking at the history of the -rt patchset, in fact this change was > present from v5.9.1-rt20 until being dropped in v5.13-rt1 as it was part > of a larger patch [1] most of which was replaced by commit b4bfa3fcfe3b > ("sched/core: Rework the __schedule() preempt argument"). Thanks for spotting that. The plan was to get rid of this with the introduction of schedule_rtlock() but it somehow remained resilient. Sebastian ^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2022-07-25 8:56 UTC | newest] Thread overview: 7+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2022-07-08 16:27 [PATCH v3] sched/core: Always flush pending blk_plug John Keeping 2022-07-09 16:19 ` Paul Gortmaker 2022-07-10 1:01 ` Hillf Danton 2022-07-12 18:29 ` Paul Gortmaker 2022-07-12 22:38 ` Hillf Danton 2022-07-14 11:35 ` [tip: sched/core] " tip-bot2 for John Keeping 2022-07-25 8:56 ` [PATCH v3] " Sebastian Andrzej Siewior
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.