All of lore.kernel.org
 help / color / mirror / Atom feed
From: Paul Gortmaker <paul.gortmaker@windriver.com>
To: John Keeping <john@metanate.com>
Cc: linux-kernel@vger.kernel.org, linux-rt-users@vger.kernel.org,
	Sebastian Andrzej Siewior <bigeasy@linutronix.de>,
	Peter Zijlstra <peterz@infradead.org>,
	Thomas Gleixner <tglx@linutronix.de>,
	Steven Rostedt <rostedt@goodmis.org>,
	Ingo Molnar <mingo@redhat.com>,
	Juri Lelli <juri.lelli@redhat.com>,
	Vincent Guittot <vincent.guittot@linaro.org>,
	Dietmar Eggemann <dietmar.eggemann@arm.com>,
	Ben Segall <bsegall@google.com>, Mel Gorman <mgorman@suse.de>,
	Daniel Bristot de Oliveira <bristot@redhat.com>,
	Valentin Schneider <vschneid@redhat.com>
Subject: Re: [PATCH v3] sched/core: Always flush pending blk_plug
Date: Sat, 9 Jul 2022 12:19:56 -0400	[thread overview]
Message-ID: <20220709161956.GA20046@windriver.com> (raw)
In-Reply-To: <20220708162702.1758865-1-john@metanate.com>

[[PATCH v3] sched/core: Always flush pending blk_plug] On 08/07/2022 (Fri 17:27) John Keeping wrote:

> With CONFIG_PREEMPT_RT, it is possible to hit a deadlock between two
> normal priority tasks (SCHED_OTHER, nice level zero):
> 
> 	INFO: task kworker/u8:0:8 blocked for more than 491 seconds.
> 	      Not tainted 5.15.49-rt46 #1
> 	"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> 	task:kworker/u8:0    state:D stack:    0 pid:    8 ppid:     2 flags:0x00000000
> 	Workqueue: writeback wb_workfn (flush-7:0)

This is great.  I was on-and-off looking at an ext4 deadlock triggered
by LTP's "ftest01" - and was just potting my notes together this morning
to clarify for myself what I thought was going on before I forgot.

Which is good, because I then could recognize the very similar patterns
in what you saw to what I'd observed/deduced.  Otherise I'd probably not
have had your backtraces grab my attention at all.

Long story short, I applied this to v5.19-rc3-rt5 and it resolves the
LTP/ext4 deadlock issue.  So...

Tested-by: Paul Gortmaker <paul.gortmaker@windriver.com>

As I'd already put together my ext4 debug info, I'll include it below
in case it helps future people searching for LTP and ftest01.

Thanks,
Paul.

	-----------------


There is a reproducible deadlock in ext4 on preempt-rt triggered by LTP
"ftest01" 

Reproducer details are at the bottom.  What seems to happen is as follows:

The kworker doing writeback wb_workfn (pid 213 below) stalls on the 
one and only down_write in ext4_map_blocks() and never moves again.

      /*
       * New blocks allocate and/or writing to unwritten extent
       * will possibly result in updating i_data, so we take
       * the write lock of i_data_sem, and call get_block()
       * with create == 1 flag.
       */
      down_write(&EXT4_I(inode)->i_data_sem);

So who is blocking the kworker?  There are two of the ftest01 processes
(pid 818 and pid 821) doing ext4 truncate which look like this:

  ext4_truncate+0x46a/0x660
   --> ext4_ext_truncate+0xd6/0xf0
       --> ext4_ext_remove_space+0x59f/0x1830
           --> jbd2_journal_get_write_access+0x51/0x80
               --> do_get_write_access+0x2ad/0x550
                   --> complete_all+0x60/0x60
                       --> bit_wait_io+0x11/0x60

They too are stalled in D state waiting there forever - presumably
waiting on the kwowrker to make the progress they are waiting for.

Here is where I think lockdep shows us the problem:

  5 locks held by ftest01/818:
   [...]
   #4: ffff957c2a679958 (&ei->i_data_sem){++++}-{3:3}, at: ext4_truncate+0x443/0x660

  5 locks held by ftest01/821:
   [...]
   #4: ffff957c2a67bb78 (&ei->i_data_sem){++++}-{3:3}, at: ext4_truncate+0x443/0x660

Yes - that is the one the kworker was trying the down_write() on.

Also note that the bit_wait is a big telltale sign here - there is only
one in the jbd2's do_get_write_access() function:

      /*
       * There is one case we have to be very careful about.  If the
       * committing transaction is currently writing this buffer out to disk
       * and has NOT made a copy-out, then we cannot modify the buffer
       * contents at all right now.  The essence of copy-out is that it is
       * the extra copy, not the primary copy, which gets journaled.  If the
       * primary copy is already going to disk then we cannot do copy-out
       * here.
       */
      if (buffer_shadow(bh)) {
              JBUFFER_TRACE(jh, "on shadow: sleep");
              spin_unlock(&jh->b_state_lock);
              wait_on_bit_io(&bh->b_state, BH_Shadow, TASK_UNINTERRUPTIBLE);
              goto repeat;
      }

Presumaby this is a corner case that preempt-rt makes easier to trigger?

The full trace with interspersed lockdep info follows.

Various detailed reproducer information follows that.

	-----------------------

sysrq-w backtracce information interspersed with sysrq-d lockdep info
on 5.19-rc2-rt3 kernel:

 task:kworker/u129:34 state:D stack:    0 pid:  213 ppid:     2 flags:0x00004000
 Workqueue: writeback wb_workfn (flush-7:0)
 Call Trace:
  <TASK>
  __schedule+0x329/0xae0
  ? __this_cpu_preempt_check+0x13/0x20
  ? __rt_mutex_slowlock.isra.21.constprop.25+0xc0e/0xc40
  schedule+0x48/0xc0
  __rt_mutex_slowlock.isra.21.constprop.25+0x1b8/0xc40
  rwbase_write_lock+0x98/0x6e0
  down_write+0x46/0x50
  ? ext4_map_blocks+0x156/0x5c0
  ext4_map_blocks+0x156/0x5c0
  ext4_writepages+0xa1a/0x1290
  ? __this_cpu_preempt_check+0x13/0x20
  ? lockdep_hardirqs_on+0xcd/0x150
  do_writepages+0xd2/0x1a0
  __writeback_single_inode+0x64/0x850
  writeback_sb_inodes+0x22d/0x5b0
  __writeback_inodes_wb+0x67/0xe0
  wb_writeback+0x269/0x5f0
  ? get_nr_inodes+0x49/0x70
  wb_workfn+0x43d/0x780
  ? __this_cpu_preempt_check+0x13/0x20
  process_one_work+0x286/0x710
  worker_thread+0x3c/0x3d0
  ? process_one_work+0x710/0x710
  kthread+0x13b/0x150
  ? kthread_complete_and_exit+0x20/0x20
  ret_from_fork+0x1f/0x30
  </TASK>

  6 locks held by kworker/u129:34/213:
   #0: ffff957c25e9ef28 ((wq_completion)writeback){+.+.}-{0:0}, at: process_one_work+0x201/0x710
   #1: ffffb7e9c3797e68 ((work_completion)(&(&wb->dwork)->work)){+.+.}-{0:0}, at: process_one_work+0x201/0x710
   #2: ffff957b9c9b70d0 (&type->s_umount_key#34){++++}-{3:3}, at: trylock_super+0x1b/0x50
   #3: ffff957b8a9b2d18 (&sbi->s_writepages_rwsem){.+.+}-{0:0}, at: do_writepages+0xd2/0x1a0
   #4: ffff957b9c9b5be8 (jbd2_handle){++++}-{0:0}, at: start_this_handle+0x13e/0x5f0
   #5: ffff957c2a679958 (&ei->i_data_sem){++++}-{3:3}, at: ext4_map_blocks+0x156/0x5c0


 task:jbd2/loop0-8    state:D stack:    0 pid:  679 ppid:     2 flags:0x00004000
 Call Trace:
  <TASK>
  __schedule+0x329/0xae0
  schedule+0x48/0xc0
  jbd2_journal_wait_updates+0x7c/0xf0
  ? schedstat_stop+0x10/0x10
  jbd2_journal_commit_transaction+0x284/0x20e0
  ? __this_cpu_preempt_check+0x13/0x20
  ? find_held_lock+0x35/0xa0
  ? _raw_spin_unlock_irqrestore+0x86/0x8f
  ? _raw_spin_unlock_irqrestore+0x86/0x8f
  ? try_to_del_timer_sync+0x53/0x80
  ? trace_hardirqs_on+0x3b/0x120
  ? try_to_del_timer_sync+0x53/0x80
  kjournald2+0xd5/0x2b0
  ? schedstat_stop+0x10/0x10
  ? commit_timeout+0x20/0x20
  kthread+0x13b/0x150
  ? kthread_complete_and_exit+0x20/0x20
  ret_from_fork+0x1f/0x30
  </TASK>


 task:ftest01         state:D stack:    0 pid:  817 ppid:   816 flags:0x00004002
 Call Trace:
  <TASK>
  __schedule+0x329/0xae0
  ? __x64_sys_tee+0xd0/0xd0
  ? __x64_sys_tee+0xd0/0xd0
  schedule+0x48/0xc0
  wb_wait_for_completion+0x62/0x90
  ? schedstat_stop+0x10/0x10
  sync_inodes_sb+0xdd/0x460
  ? __this_cpu_preempt_check+0x13/0x20
  ? __x64_sys_tee+0xd0/0xd0
  sync_inodes_one_sb+0x15/0x20
  iterate_supers+0x94/0x100
  ksys_sync+0x42/0xa0
  __do_sys_sync+0xe/0x20
  do_syscall_64+0x3f/0x90
  entry_SYSCALL_64_after_hwframe+0x46/0xb0
 RIP: 0033:0x7f3f9dae2cdb
 RSP: 002b:00007ffe06430318 EFLAGS: 00000246 ORIG_RAX: 00000000000000a2
 RAX: ffffffffffffffda RBX: 0000000000000800 RCX: 00007f3f9dae2cdb
 RDX: 0000000000000040 RSI: 0000561f57de52f0 RDI: 0000561f57de5340
 RBP: 00000000000ff000 R08: 0000000000000000 R09: 00007f3f9dbce0a0
 R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
 R13: 0000000000000207 R14: 0000000000000049 R15: 0000000000000003
  </TASK>

  1 lock held by ftest01/817:
   #0: ffff957b9c9b70d0 (&type->s_umount_key#34){++++}-{3:3}, at: iterate_supers+0x7d/0x100


 task:ftest01         state:D stack:    0 pid:  818 ppid:   816 flags:0x00004002
 Call Trace:
  <TASK>
  __schedule+0x329/0xae0
  ? bit_wait+0x60/0x60
  schedule+0x48/0xc0
  io_schedule+0x16/0x40
  bit_wait_io+0x11/0x60
  __wait_on_bit_lock+0x5e/0xb0
  out_of_line_wait_on_bit_lock+0x93/0xb0
  ? complete_all+0x60/0x60
  __lock_buffer+0x3f/0x50
  do_get_write_access+0x2ad/0x550
  jbd2_journal_get_write_access+0x51/0x80
  __ext4_journal_get_write_access+0xf9/0x1a0
  ext4_ext_get_access.isra.32+0x34/0x50
  ext4_ext_remove_space+0x59f/0x1830
  ? ext4_ext_truncate+0xa2/0xf0
  ? ext4_ext_truncate+0x5e/0xf0
  ext4_ext_truncate+0xd6/0xf0
  ext4_truncate+0x46a/0x660
  ext4_setattr+0x6db/0xa40
  notify_change+0x3e5/0x540
  do_truncate+0x7b/0xc0
  ? do_truncate+0x7b/0xc0
  do_sys_ftruncate+0x164/0x2f0
  __x64_sys_ftruncate+0x1b/0x20
  do_syscall_64+0x3f/0x90
  entry_SYSCALL_64_after_hwframe+0x46/0xb0
 RIP: 0033:0x7f3f9dae3f7b
 RSP: 002b:00007ffe06430318 EFLAGS: 00000202 ORIG_RAX: 000000000000004d
 RAX: ffffffffffffffda RBX: 00000000000001a5 RCX: 00007f3f9dae3f7b
 RDX: 00000000000001a5 RSI: 00000000000d2800 RDI: 0000000000000003
 RBP: 00000000000001a5 R08: 00007f3f9dbce090 R09: 00007f3f9dbce0a0
 R10: 0000000000000008 R11: 0000000000000202 R12: 0000000000000000
 R13: 0000000000000207 R14: 0000000000000043 R15: 0000000000000003
  </TASK>

  5 locks held by ftest01/818:
   #0: ffff957b9c9b74e8 (sb_writers#3){.+.+}-{0:0}, at: __x64_sys_ftruncate+0x1b/0x20
   #1: ffff957c2a679b00 (&sb->s_type->i_mutex_key#10){+.+.}-{3:3}, at: do_truncate+0x6a/0xc0
   #2: ffff957c2a679cc0 (mapping.invalidate_lock){++++}-{3:3}, at: ext4_setattr+0x39a/0xa40
   #3: ffff957b9c9b5be8 (jbd2_handle){++++}-{0:0}, at: start_this_handle+0x13e/0x5f0
   #4: ffff957c2a679958 (&ei->i_data_sem){++++}-{3:3}, at: ext4_truncate+0x443/0x660


 task:ftest01         state:D stack:    0 pid:  819 ppid:   816 flags:0x00004002
 Call Trace:
  <TASK>
  __schedule+0x329/0xae0
  ? __x64_sys_tee+0xd0/0xd0
  ? __x64_sys_tee+0xd0/0xd0
  schedule+0x48/0xc0
  wb_wait_for_completion+0x62/0x90
  ? schedstat_stop+0x10/0x10
  sync_inodes_sb+0xdd/0x460
  ? __this_cpu_preempt_check+0x13/0x20
  ? __x64_sys_tee+0xd0/0xd0
  sync_inodes_one_sb+0x15/0x20
  iterate_supers+0x94/0x100
  ksys_sync+0x42/0xa0
  __do_sys_sync+0xe/0x20
  do_syscall_64+0x3f/0x90
  entry_SYSCALL_64_after_hwframe+0x46/0xb0
 RIP: 0033:0x7f3f9dae2cdb
 RSP: 002b:00007ffe06430318 EFLAGS: 00000246 ORIG_RAX: 00000000000000a2
 RAX: ffffffffffffffda RBX: 0000000000000800 RCX: 00007f3f9dae2cdb
 RDX: 0000000000000040 RSI: 0000561f57de52f0 RDI: 0000561f57de5340
 RBP: 00000000000c2800 R08: 00000000fffff800 R09: 0000000000004000
 R10: 0000000000000010 R11: 0000000000000246 R12: 0000000000000000
 R13: 0000000000000207 R14: 0000000000000075 R15: 0000000000000003
  </TASK>

  1 lock held by ftest01/819:
   #0: ffff957b9c9b70d0 (&type->s_umount_key#34){++++}-{3:3}, at: iterate_supers+0x7d/0x100


 task:ftest01         state:D stack:    0 pid:  820 ppid:   816 flags:0x00004002
 Call Trace:
  <TASK>
  __schedule+0x329/0xae0
  ? __x64_sys_tee+0xd0/0xd0
  ? __x64_sys_tee+0xd0/0xd0
  schedule+0x48/0xc0
  wb_wait_for_completion+0x62/0x90
  ? schedstat_stop+0x10/0x10
  sync_inodes_sb+0xdd/0x460
  ? __this_cpu_preempt_check+0x13/0x20
  ? __x64_sys_tee+0xd0/0xd0
  sync_inodes_one_sb+0x15/0x20
  iterate_supers+0x94/0x100
  ksys_sync+0x42/0xa0
  __do_sys_sync+0xe/0x20
  do_syscall_64+0x3f/0x90
  entry_SYSCALL_64_after_hwframe+0x46/0xb0
 RIP: 0033:0x7f3f9dae2cdb
 RSP: 002b:00007ffe06430318 EFLAGS: 00000246 ORIG_RAX: 00000000000000a2
 RAX: ffffffffffffffda RBX: 0000000000000800 RCX: 00007f3f9dae2cdb
 RDX: 0000000000000040 RSI: 0000561f57de52f0 RDI: 0000561f57de5340
 RBP: 000000000006a800 R08: 00000000fffff800 R09: 0000000000004000
 R10: 0000000000000010 R11: 0000000000000246 R12: 0000000000000000
 R13: 0000000000000207 R14: 0000000000000069 R15: 0000000000000003
  </TASK>

  1 lock held by ftest01/820:
   #0: ffff957b9c9b70d0 (&type->s_umount_key#34){++++}-{3:3}, at: iterate_supers+0x7d/0x100


 task:ftest01         state:D stack:    0 pid:  821 ppid:   816 flags:0x00000002
 Call Trace:
  <TASK>
  __schedule+0x329/0xae0
  ? bit_wait+0x60/0x60
  schedule+0x48/0xc0
  io_schedule+0x16/0x40
  bit_wait_io+0x11/0x60
  __wait_on_bit_lock+0x5e/0xb0
  out_of_line_wait_on_bit_lock+0x93/0xb0
  ? complete_all+0x60/0x60
  __lock_buffer+0x3f/0x50
  do_get_write_access+0x2ad/0x550
  jbd2_journal_get_write_access+0x51/0x80
  __ext4_journal_get_write_access+0xf9/0x1a0
  ext4_ext_get_access.isra.32+0x34/0x50
  ext4_ext_remove_space+0x59f/0x1830
  ? ext4_ext_truncate+0xa2/0xf0
  ? ext4_ext_truncate+0x5e/0xf0
  ext4_ext_truncate+0xd6/0xf0
  ext4_truncate+0x46a/0x660
  ext4_setattr+0x6db/0xa40
  notify_change+0x3e5/0x540
  do_truncate+0x7b/0xc0
  ? do_truncate+0x7b/0xc0
  do_sys_ftruncate+0x164/0x2f0
  __x64_sys_ftruncate+0x1b/0x20
  do_syscall_64+0x3f/0x90
  entry_SYSCALL_64_after_hwframe+0x46/0xb0
 RIP: 0033:0x7f3f9dae3f7b
 RSP: 002b:00007ffe06430318 EFLAGS: 00000202 ORIG_RAX: 000000000000004d
 RAX: ffffffffffffffda RBX: 000000000000016a RCX: 00007f3f9dae3f7b
 RDX: 000000000000016a RSI: 00000000000b5000 RDI: 0000000000000003
 RBP: 000000000000016a R08: 00007f3f9dbce080 R09: 00007f3f9dbce0a0
 R10: 0000000000000002 R11: 0000000000000202 R12: 0000000000000000
 R13: 0000000000000207 R14: 0000000000000040 R15: 0000000000000003
  </TASK>

  5 locks held by ftest01/821:
   #0: ffff957b9c9b74e8 (sb_writers#3){.+.+}-{0:0}, at: __x64_sys_ftruncate+0x1b/0x20
   #1: ffff957c2a67bd20 (&sb->s_type->i_mutex_key#10){+.+.}-{3:3}, at: do_truncate+0x6a/0xc0
   #2: ffff957c2a67bee0 (mapping.invalidate_lock){++++}-{3:3}, at: ext4_setattr+0x39a/0xa40
   #3: ffff957b9c9b5be8 (jbd2_handle){++++}-{0:0}, at: start_this_handle+0x13e/0x5f0
   #4: ffff957c2a67bb78 (&ei->i_data_sem){++++}-{3:3}, at: ext4_truncate+0x443/0x660

		--------------------------------------

Reproducer:

Get latest preempt-rt kernel:
	git://git.kernel.org/pub/scm/linux/kernel/git/rt/linux-rt-devel.git
	checkout v5.19-rc3-rt5 (or linux-5.19.y-rt branch)
	make defconfig
	manually enable CONFIG_EXPERT and then CONFIG_PREEMPT_RT
	make oldconfig and take defaults for other options hidden behind EXPERT
	can optionally enable LOCKDEP options so sysrq-d works.

Get LTP ; build + install
	https://github.com/linux-test-project/ltp.git

Boot and then run reproducer:
   dd if=/dev/zero of=/root/tmp_mount_file bs=1M count=512
   mkfs.ext4 -F /root/tmp_mount_file
   mount /root/tmp_mount_file /mnt/
   /opt/ltp/runltp -f fs -s ftest01 -d /mnt/
   umount /mnt

   loop device isn't strictly necessary but contains fs fallout when
   inevitable reboot is required to get rid of D state processes.

   Test completes on v5.10-rt, v5.12-rt - should take about 2-3m total.
   Test hangs forever on v5.15-rt and latest v5.19-rc-rt.  Test threads
   in D state - any other i/o directed at /mnt also hangs in D state.

  reply	other threads:[~2022-07-09 16:20 UTC|newest]

Thread overview: 7+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-07-08 16:27 [PATCH v3] sched/core: Always flush pending blk_plug John Keeping
2022-07-09 16:19 ` Paul Gortmaker [this message]
2022-07-10  1:01   ` Hillf Danton
2022-07-12 18:29     ` Paul Gortmaker
2022-07-12 22:38       ` Hillf Danton
2022-07-14 11:35 ` [tip: sched/core] " tip-bot2 for John Keeping
2022-07-25  8:56 ` [PATCH v3] " Sebastian Andrzej Siewior

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20220709161956.GA20046@windriver.com \
    --to=paul.gortmaker@windriver.com \
    --cc=bigeasy@linutronix.de \
    --cc=bristot@redhat.com \
    --cc=bsegall@google.com \
    --cc=dietmar.eggemann@arm.com \
    --cc=john@metanate.com \
    --cc=juri.lelli@redhat.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-rt-users@vger.kernel.org \
    --cc=mgorman@suse.de \
    --cc=mingo@redhat.com \
    --cc=peterz@infradead.org \
    --cc=rostedt@goodmis.org \
    --cc=tglx@linutronix.de \
    --cc=vincent.guittot@linaro.org \
    --cc=vschneid@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.