From: David Sterba <dsterba@suse.cz>
To: fdmanana@kernel.org
Cc: linux-btrfs@vger.kernel.org
Subject: Re: [PATCH] btrfs: fix deadlock between concurrent dio writes when low on free data space
Date: Fri, 29 Apr 2022 20:50:17 +0200 [thread overview]
Message-ID: <20220429185017.GH18596@twin.jikos.cz> (raw)
In-Reply-To: <ca54a0f0251d2f77c51747de0e096f7f29542feb.1651154066.git.fdmanana@suse.com>
On Thu, Apr 28, 2022 at 02:59:46PM +0100, fdmanana@kernel.org wrote:
> From: Filipe Manana <fdmanana@suse.com>
>
> When reserving data space for a direct IO write we can end up deadlocking
> if we have multiple tasks attempting a write to the same file range, there
> are multiple extents covered by that file range, we are low on available
> space for data and the writes don't expand the inode's i_size.
>
> The deadlock can happen like this:
>
> 1) We have a file with an i_size of 1M, at offset 0 it has an extent with
> a size of 128K and at offset 128K it has another extent also with a
> size of 128K;
>
> 2) Task A does a direct IO write against file range [0, 256K[, and because
> the write is within the i_size boundary, it takes the inode's lock (VFS
> level) in shared mode;
>
> 3) Task A locks the file range [0, 256K[ at btrfs_dio_iomap_begin(), and
> then gets the extent map for the extent covering the range [0, 128K[.
> At btrfs_get_blocks_direct_write(), it creates an ordered extent for
> that file range ([0, 128K[);
>
> 4) Before returning from btrfs_dio_iomap_begin(), it unlocks the file
> range [0, 256K[;
>
> 5) Task A executes btrfs_dio_iomap_begin() again, this time for the file
> range [128K, 256K[, and locks the file range [128K, 256K[;
>
> 6) Task B starts a direct IO write against file range [0, 256K[ as well.
> It also locks the inode in shared mode, as it's within the i_size limit,
> and then tries to lock file range [0, 256K[. It is able to lock the
> subrange [0, 128K[ but then blocks waiting for the range [128K, 256K[,
> as it is currently locked by task A;
>
> 7) Task A enters btrfs_get_blocks_direct_write() and tries to reserve data
> space. Because we are low on available free space, it triggers the
> async data reclaim task, and waits for it to reserve data space;
>
> 8) The async reclaim task decides to wait for all existing ordered extents
> to complete (through btrfs_wait_ordered_roots()).
> It finds the ordered extent previously created by task A for the file
> range [0, 128K[ and waits for it to complete;
>
> 9) The ordered extent for the file range [0, 128K[ can not complete
> because it blocks at btrfs_finish_ordered_io() when trying to lock the
> file range [0, 128K[.
>
> This results in a deadlock, because:
>
> - task B is holding the file range [0, 128K[ locked, waiting for the
> range [128K, 256K[ to be unlocked by task A;
>
> - task A is holding the file range [128K, 256K[ locked and it's waiting
> for the async data reclaim task to satisfy its space reservation
> request;
>
> - the async data reclaim task is waiting for ordered extent [0, 128K[
> to complete, but the ordered extent can not complete because the
> file range [0, 128K[ is currently locked by task B, which is waiting
> on task A to unlock file range [128K, 256K[ and task A waiting
> on the async data reclaim task.
>
> This results in a deadlock between 4 task: task A, task B, the async
> data reclaim task and the task doing ordered extent completion (a work
> queue task).
>
> This type of deadlock can sporadically be triggered by the test case
> generic/300 from fstests, and results in a stack trace like the following:
>
> [12084.033689] INFO: task kworker/u16:7:123749 blocked for more than 241 seconds.
> [12084.034877] Not tainted 5.18.0-rc2-btrfs-next-115 #1
> [12084.035562] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> [12084.036548] task:kworker/u16:7 state:D stack: 0 pid:123749 ppid: 2 flags:0x00004000
> [12084.036554] Workqueue: btrfs-flush_delalloc btrfs_work_helper [btrfs]
> [12084.036599] Call Trace:
> [12084.036601] <TASK>
> [12084.036606] __schedule+0x3cb/0xed0
> [12084.036616] schedule+0x4e/0xb0
> [12084.036620] btrfs_start_ordered_extent+0x109/0x1c0 [btrfs]
> [12084.036651] ? prepare_to_wait_exclusive+0xc0/0xc0
> [12084.036659] btrfs_run_ordered_extent_work+0x1a/0x30 [btrfs]
> [12084.036688] btrfs_work_helper+0xf8/0x400 [btrfs]
> [12084.036719] ? lock_is_held_type+0xe8/0x140
> [12084.036727] process_one_work+0x252/0x5a0
> [12084.036736] ? process_one_work+0x5a0/0x5a0
> [12084.036738] worker_thread+0x52/0x3b0
> [12084.036743] ? process_one_work+0x5a0/0x5a0
> [12084.036745] kthread+0xf2/0x120
> [12084.036747] ? kthread_complete_and_exit+0x20/0x20
> [12084.036751] ret_from_fork+0x22/0x30
> [12084.036765] </TASK>
> [12084.036769] INFO: task kworker/u16:11:153787 blocked for more than 241 seconds.
> [12084.037702] Not tainted 5.18.0-rc2-btrfs-next-115 #1
> [12084.038540] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> [12084.039506] task:kworker/u16:11 state:D stack: 0 pid:153787 ppid: 2 flags:0x00004000
> [12084.039511] Workqueue: events_unbound btrfs_async_reclaim_data_space [btrfs]
> [12084.039551] Call Trace:
> [12084.039553] <TASK>
> [12084.039557] __schedule+0x3cb/0xed0
> [12084.039566] schedule+0x4e/0xb0
> [12084.039569] schedule_timeout+0xed/0x130
> [12084.039573] ? mark_held_locks+0x50/0x80
> [12084.039578] ? _raw_spin_unlock_irq+0x24/0x50
> [12084.039580] ? lockdep_hardirqs_on+0x7d/0x100
> [12084.039585] __wait_for_common+0xaf/0x1f0
> [12084.039587] ? usleep_range_state+0xb0/0xb0
> [12084.039596] btrfs_wait_ordered_extents+0x3d6/0x470 [btrfs]
> [12084.039636] btrfs_wait_ordered_roots+0x175/0x240 [btrfs]
> [12084.039670] flush_space+0x25b/0x630 [btrfs]
> [12084.039712] btrfs_async_reclaim_data_space+0x108/0x1b0 [btrfs]
> [12084.039747] process_one_work+0x252/0x5a0
> [12084.039756] ? process_one_work+0x5a0/0x5a0
> [12084.039758] worker_thread+0x52/0x3b0
> [12084.039762] ? process_one_work+0x5a0/0x5a0
> [12084.039765] kthread+0xf2/0x120
> [12084.039766] ? kthread_complete_and_exit+0x20/0x20
> [12084.039770] ret_from_fork+0x22/0x30
> [12084.039783] </TASK>
> [12084.039800] INFO: task kworker/u16:17:217907 blocked for more than 241 seconds.
> [12084.040709] Not tainted 5.18.0-rc2-btrfs-next-115 #1
> [12084.041398] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> [12084.042404] task:kworker/u16:17 state:D stack: 0 pid:217907 ppid: 2 flags:0x00004000
> [12084.042411] Workqueue: btrfs-endio-write btrfs_work_helper [btrfs]
> [12084.042461] Call Trace:
> [12084.042463] <TASK>
> [12084.042471] __schedule+0x3cb/0xed0
> [12084.042485] schedule+0x4e/0xb0
> [12084.042490] wait_extent_bit.constprop.0+0x1eb/0x260 [btrfs]
> [12084.042539] ? prepare_to_wait_exclusive+0xc0/0xc0
> [12084.042551] lock_extent_bits+0x37/0x90 [btrfs]
> [12084.042601] btrfs_finish_ordered_io.isra.0+0x3fd/0x960 [btrfs]
> [12084.042656] ? lock_is_held_type+0xe8/0x140
> [12084.042667] btrfs_work_helper+0xf8/0x400 [btrfs]
> [12084.042716] ? lock_is_held_type+0xe8/0x140
> [12084.042727] process_one_work+0x252/0x5a0
> [12084.042742] worker_thread+0x52/0x3b0
> [12084.042750] ? process_one_work+0x5a0/0x5a0
> [12084.042754] kthread+0xf2/0x120
> [12084.042757] ? kthread_complete_and_exit+0x20/0x20
> [12084.042763] ret_from_fork+0x22/0x30
> [12084.042783] </TASK>
> [12084.042798] INFO: task fio:234517 blocked for more than 241 seconds.
> [12084.043598] Not tainted 5.18.0-rc2-btrfs-next-115 #1
> [12084.044282] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> [12084.045244] task:fio state:D stack: 0 pid:234517 ppid:234515 flags:0x00004000
> [12084.045248] Call Trace:
> [12084.045250] <TASK>
> [12084.045254] __schedule+0x3cb/0xed0
> [12084.045263] schedule+0x4e/0xb0
> [12084.045266] wait_extent_bit.constprop.0+0x1eb/0x260 [btrfs]
> [12084.045298] ? prepare_to_wait_exclusive+0xc0/0xc0
> [12084.045306] lock_extent_bits+0x37/0x90 [btrfs]
> [12084.045336] btrfs_dio_iomap_begin+0x336/0xc60 [btrfs]
> [12084.045370] ? lock_is_held_type+0xe8/0x140
> [12084.045378] iomap_iter+0x184/0x4c0
> [12084.045383] __iomap_dio_rw+0x2c6/0x8a0
> [12084.045406] iomap_dio_rw+0xa/0x30
> [12084.045408] btrfs_do_write_iter+0x370/0x5e0 [btrfs]
> [12084.045440] aio_write+0xfa/0x2c0
> [12084.045448] ? __might_fault+0x2a/0x70
> [12084.045451] ? kvm_sched_clock_read+0x14/0x40
> [12084.045455] ? lock_release+0x153/0x4a0
> [12084.045463] io_submit_one+0x615/0x9f0
> [12084.045467] ? __might_fault+0x2a/0x70
> [12084.045469] ? kvm_sched_clock_read+0x14/0x40
> [12084.045478] __x64_sys_io_submit+0x83/0x160
> [12084.045483] ? syscall_enter_from_user_mode+0x1d/0x50
> [12084.045489] do_syscall_64+0x3b/0x90
> [12084.045517] entry_SYSCALL_64_after_hwframe+0x44/0xae
> [12084.045521] RIP: 0033:0x7fa76511af79
> [12084.045525] RSP: 002b:00007ffd6d6b9058 EFLAGS: 00000246 ORIG_RAX: 00000000000000d1
> [12084.045530] RAX: ffffffffffffffda RBX: 00007fa75ba6e760 RCX: 00007fa76511af79
> [12084.045532] RDX: 0000557b304ff3f0 RSI: 0000000000000001 RDI: 00007fa75ba4c000
> [12084.045535] RBP: 00007fa75ba4c000 R08: 00007fa751b76000 R09: 0000000000000330
> [12084.045537] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000001
> [12084.045540] R13: 0000000000000000 R14: 0000557b304ff3f0 R15: 0000557b30521eb0
> [12084.045561] </TASK>
>
> Fix this issue by always reserving data space before locking a file range
> at btrfs_dio_iomap_begin(). If we can't reserve the space, then we don't
> error out immediately - instead after locking the file range, check if we
> can do a NOCOW write, and if we can we don't error out since we don't need
> to allocate a data extent, however if we can't NOCOW then error out with
> -ENOSPC. This also implies that we may end up reserving space when it's
> not needed because the write will end up being done in NOCOW mode - in that
> case we just release the space after we noticed we did a NOCOW write - this
> is the same type of logic that is done in the path for buffered IO writes.
>
> Fixes: f0bfa76a11e93d ("btrfs: fix ENOSPC failure when attempting direct IO write into NOCOW range")
> Signed-off-by: Filipe Manana <fdmanana@suse.com>
Added to misc-next, thanks.
prev parent reply other threads:[~2022-04-29 18:54 UTC|newest]
Thread overview: 2+ messages / expand[flat|nested] mbox.gz Atom feed top
2022-04-28 13:59 [PATCH] btrfs: fix deadlock between concurrent dio writes when low on free data space fdmanana
2022-04-29 18:50 ` David Sterba [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20220429185017.GH18596@twin.jikos.cz \
--to=dsterba@suse.cz \
--cc=fdmanana@kernel.org \
--cc=linux-btrfs@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox