[PATCH 0/2] btrfs: fix races between clone, fallocate and memory mapped writes

Linux Btrfs filesystem development
 help / color / mirror / Atom feed

* [PATCH 0/2] btrfs: fix races between clone, fallocate and memory mapped writes
@ 2020-12-14  9:56 fdmanana
  2020-12-14  9:56 ` [PATCH 1/2] btrfs: fix race between cloning and memory mapped writes leading to deadlock fdmanana
                   ` (2 more replies)
  0 siblings, 3 replies; 6+ messages in thread
From: fdmanana @ 2020-12-14  9:56 UTC (permalink / raw)
  To: linux-btrfs; +Cc: josef, Filipe Manana

From: Filipe Manana <fdmanana@suse.com>

For a very long time there's been a race between clone/dedupe and memory
mapped writes as well as between fallocate and memory mapped writes. For
both cases the consequence of the race is that it can makes us deadlock
when we are low on available metadata space, since clone/dedupe/fallocate
start a transaction while holding file ranges locked, and allocating the
metadata can result in the async reclaim task to flush the inodes being
used by clone/dedupe/fallocate, if a memory mapped write happened before
we locked the file ranges.

For the dedupe case, Josef's recent fix [1] ("btrfs: fix race between dedupe
and mmap") happens to fix this deadlock problem as well. The first patch
in this patchset fixes the issue for both clone and dedupe, as it's centered
on the shared extent locking function, and it is independent of Josef's fix
(works both with and without that fix).

[1] https://lore.kernel.org/linux-btrfs/afdc2109f83fff1a925d7a66a6a047d4400721d4.1607724668.git.josef@toxicpanda.com/

Filipe Manana (2):
  btrfs: fix race between cloning and memory mapped writes leading to
    deadlock
  btrfs: fix race between fallocate and memory mapped writes leading to
    deadlock

 fs/btrfs/extent_io.c    |  2 +-
 fs/btrfs/file.c         | 38 +++++---------------------------
 fs/btrfs/inode.c        |  4 ++--
 fs/btrfs/ordered-data.c | 48 ++++++++++++++++++++++++++++++++---------
 fs/btrfs/ordered-data.h |  6 +++---
 fs/btrfs/reflink.c      | 28 ++++++++++++++++++------
 6 files changed, 71 insertions(+), 55 deletions(-)

-- 
2.28.0

^ permalink raw reply	[flat|nested] 6+ messages in thread

* [PATCH 1/2] btrfs: fix race between cloning and memory mapped writes leading to deadlock
  2020-12-14  9:56 [PATCH 0/2] btrfs: fix races between clone, fallocate and memory mapped writes fdmanana
@ 2020-12-14  9:56 ` fdmanana
  2020-12-14  9:56 ` [PATCH 2/2] btrfs: fix race between fallocate " fdmanana
  2020-12-17 15:02 ` [PATCH 0/2] btrfs: fix races between clone, fallocate and memory mapped writes David Sterba
  2 siblings, 0 replies; 6+ messages in thread
From: fdmanana @ 2020-12-14  9:56 UTC (permalink / raw)
  To: linux-btrfs; +Cc: josef, Filipe Manana

From: Filipe Manana <fdmanana@suse.com>

When cloning a file range, we lock the inodes, flush any delalloc within
the respective file ranges, wait for any ordered extents and then lock the
file ranges in both inodes. This means that right after we flush delalloc
and before we lock the file ranges, memory mapped writes can come in and
dirty pages in the file ranges of the clone operation.

Most of the time this is harmless and causes no problems. However, if we
are low on available metadata space, we can later end up in a deadlock
when starting a transaction to replace file extent items. This happens if
when allocating metadata space for the transaction, we need to wait for
the async reclaim thread to release space and the reclaim thread needs to
flush delalloc for the inode that got the memory mapped write and has its
range locked by the clone task.

Basically what happens is the following:

1) A clone operation locks inodes A and B, flushes delalloc for both
   inodes in the respective file ranges and waits for any ordered extents
   in those ranges to complete;

2) Before the clone task locks the file ranges, another task does a
   memory mapped write (which does not lock the inode) for one of the
   inodes of the clone operation. So now we have a dirty page in one of
   the ranges used by the clone operation;

3) The clone operation locks the file ranges for inodes A and B;

4) Later, when iterating over the file extents of inode A, the clone
   task attempts to start a transaction. There's not enough available
   free metadata space, so the async reclaim task is started (if not
   running already) and we wait for someone to wake us up on our
   reservation ticket;

5) The async reclaim task is not able to release space by any other
   means and decides to flush delalloc for the inode of the clone
   operation;

6) The workqueue job used to flush the inode blocks when starting
   delalloc for the inode, since the file range is currently locked by
   the clone task;

7) But the clone task is waiting on its reservation ticket and the async
   reclaim task is waiting on the flush job to complete, which can't
   progress since the clone task has the file range locked. So unless
   some other task is able to release space, for example an ordered
   extent for some other inode completes, we have a deadlock between all
   these tasks;

When this happens stack traces like the following showup in dmesg/syslog:

 INFO: task kworker/u16:11:1810830 blocked for more than 120 seconds.
       Tainted: G    B   W         5.10.0-rc4-btrfs-next-73 #1
 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
 task:kworker/u16:11  state:D stack:    0 pid:1810830 ppid:     2 flags:0x00004000
 Workqueue: btrfs-flush_delalloc btrfs_work_helper [btrfs]
 Call Trace:
  __schedule+0x5d1/0xcf0
  schedule+0x45/0xe0
  lock_extent_bits+0x1e6/0x2d0 [btrfs]
  ? finish_wait+0x90/0x90
  btrfs_invalidatepage+0x32c/0x390 [btrfs]
  ? __mod_memcg_state+0x8e/0x160
  __extent_writepage+0x2d4/0x400 [btrfs]
  extent_write_cache_pages+0x2b2/0x500 [btrfs]
  ? lock_release+0x20e/0x4c0
  ? trace_hardirqs_on+0x1b/0xf0
  extent_writepages+0x43/0x90 [btrfs]
  ? lock_acquire+0x1a3/0x490
  do_writepages+0x43/0xe0
  ? __filemap_fdatawrite_range+0xa4/0x100
  __filemap_fdatawrite_range+0xc5/0x100
  btrfs_run_delalloc_work+0x17/0x40 [btrfs]
  btrfs_work_helper+0xf1/0x600 [btrfs]
  process_one_work+0x24e/0x5e0
  worker_thread+0x50/0x3b0
  ? process_one_work+0x5e0/0x5e0
  kthread+0x153/0x170
  ? kthread_mod_delayed_work+0xc0/0xc0
  ret_from_fork+0x22/0x30
 INFO: task kworker/u16:1:2426217 blocked for more than 120 seconds.
       Tainted: G    B   W         5.10.0-rc4-btrfs-next-73 #1
 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
 task:kworker/u16:1   state:D stack:    0 pid:2426217 ppid:     2 flags:0x00004000
 Workqueue: events_unbound btrfs_async_reclaim_metadata_space [btrfs]
 Call Trace:
  __schedule+0x5d1/0xcf0
  ? kvm_clock_read+0x14/0x30
  ? wait_for_completion+0x81/0x110
  schedule+0x45/0xe0
  schedule_timeout+0x30c/0x580
  ? _raw_spin_unlock_irqrestore+0x3c/0x60
  ? lock_acquire+0x1a3/0x490
  ? try_to_wake_up+0x7a/0xa20
  ? lock_release+0x20e/0x4c0
  ? lock_acquired+0x199/0x490
  ? wait_for_completion+0x81/0x110
  wait_for_completion+0xab/0x110
  start_delalloc_inodes+0x2af/0x390 [btrfs]
  btrfs_start_delalloc_roots+0x12d/0x250 [btrfs]
  flush_space+0x24f/0x660 [btrfs]
  btrfs_async_reclaim_metadata_space+0x1bb/0x480 [btrfs]
  process_one_work+0x24e/0x5e0
  worker_thread+0x20f/0x3b0
  ? process_one_work+0x5e0/0x5e0
  kthread+0x153/0x170
  ? kthread_mod_delayed_work+0xc0/0xc0
  ret_from_fork+0x22/0x30
(...)
several other tasks blocked on inode locks held by the clone task below
(...)
 RIP: 0033:0x7f61efe73fff
 Code: Unable to access opcode bytes at RIP 0x7f61efe73fd5.
 RSP: 002b:00007ffc3371bbe8 EFLAGS: 00000202 ORIG_RAX: 000000000000013c
 RAX: ffffffffffffffda RBX: 00007ffc3371bea0 RCX: 00007f61efe73fff
 RDX: 00000000ffffff9c RSI: 0000560fbd604690 RDI: 00000000ffffff9c
 RBP: 00007ffc3371beb0 R08: 0000000000000002 R09: 0000560fbd5d75f0
 R10: 0000560fbd5d81f0 R11: 0000000000000202 R12: 0000000000000002
 R13: 000000000000000b R14: 00007ffc3371bea0 R15: 00007ffc3371beb0
 task: fdm-stress        state:D stack:    0 pid:2508234 ppid:2508153 flags:0x00004000
 Call Trace:
  __schedule+0x5d1/0xcf0
  ? _raw_spin_unlock_irqrestore+0x3c/0x60
  schedule+0x45/0xe0
  __reserve_bytes+0x4a4/0xb10 [btrfs]
  ? finish_wait+0x90/0x90
  btrfs_reserve_metadata_bytes+0x29/0x190 [btrfs]
  btrfs_block_rsv_add+0x1f/0x50 [btrfs]
  start_transaction+0x2d1/0x760 [btrfs]
  btrfs_replace_file_extents+0x120/0x930 [btrfs]
  ? lock_release+0x20e/0x4c0
  btrfs_clone+0x3e4/0x7e0 [btrfs]
  ? btrfs_lookup_first_ordered_extent+0x8e/0x100 [btrfs]
  btrfs_clone_files+0xf6/0x150 [btrfs]
  btrfs_remap_file_range+0x324/0x3d0 [btrfs]
  do_clone_file_range+0xd4/0x1f0
  vfs_clone_file_range+0x4d/0x230
  ? lock_release+0x20e/0x4c0
  ioctl_file_clone+0x8f/0xc0
  do_vfs_ioctl+0x342/0x750
  __x64_sys_ioctl+0x62/0xb0
  do_syscall_64+0x33/0x80
  entry_SYSCALL_64_after_hwframe+0x44/0xa9

So fix this by checking for new delalloc extents in the ranges after
locking them, and if any exist, unlock the ranges, flush delalloc, wait
for ordered extents to complete and then retry.

This modifies the existing helper btrfs_lock_and_flush_ordered_range() to
be able to detect and flush new delalloc extents in the range we want to
lock. The new behaviour is conditional, since it is not needed in any of
its existing callers, and in fact it's not desired at all in the buffered
write path because that would cause us do unnecessary IO when attempting
to write to an already dirty range. The clone now calls this helper to
lock file ranges.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
---
 fs/btrfs/extent_io.c    |  2 +-
 fs/btrfs/file.c         |  2 +-
 fs/btrfs/inode.c        |  4 ++--
 fs/btrfs/ordered-data.c | 48 ++++++++++++++++++++++++++++++++---------
 fs/btrfs/ordered-data.h |  6 +++---
 fs/btrfs/reflink.c      | 28 ++++++++++++++++++------
 6 files changed, 67 insertions(+), 23 deletions(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 6e3b72e63e42..c5337c3c811e 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -3409,7 +3409,7 @@ static inline void contiguous_readpages(struct page *pages[], int nr_pages,
 	struct btrfs_inode *inode = BTRFS_I(pages[0]->mapping->host);
 	int index;
 
-	btrfs_lock_and_flush_ordered_range(inode, start, end, NULL);
+	btrfs_lock_and_flush_ordered_range(inode, start, end, false, NULL);
 
 	for (index = 0; index < nr_pages; index++) {
 		btrfs_do_readpage(pages[index], em_cached, bio, bio_flags,
diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index 0e41459b8de6..dd2d5d73804d 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -1510,7 +1510,7 @@ static int check_can_nocow(struct btrfs_inode *inode, loff_t pos,
 		}
 	} else {
 		btrfs_lock_and_flush_ordered_range(inode, lockstart,
-						   lockend, NULL);
+						   lockend, false, NULL);
 	}
 
 	ret = can_nocow_extent(&inode->vfs_inode, lockstart, &num_bytes,
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 070716650df8..20a580a0f9ed 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -4886,7 +4886,7 @@ int btrfs_cont_expand(struct btrfs_inode *inode, loff_t oldsize, loff_t size)
 		return 0;
 
 	btrfs_lock_and_flush_ordered_range(inode, hole_start, block_end - 1,
-					   &cached_state);
+					   false, &cached_state);
 	cur_offset = hole_start;
 	while (1) {
 		em = btrfs_get_extent(inode, NULL, 0, cur_offset,
@@ -8069,7 +8069,7 @@ int btrfs_readpage(struct file *file, struct page *page)
 	struct bio *bio = NULL;
 	int ret;
 
-	btrfs_lock_and_flush_ordered_range(inode, start, end, NULL);
+	btrfs_lock_and_flush_ordered_range(inode, start, end, false, NULL);
 
 	ret = btrfs_do_readpage(page, NULL, &bio, &bio_flags, 0, NULL);
 	if (bio)
diff --git a/fs/btrfs/ordered-data.c b/fs/btrfs/ordered-data.c
index 79d366a36223..7a762e62703a 100644
--- a/fs/btrfs/ordered-data.c
+++ b/fs/btrfs/ordered-data.c
@@ -858,31 +858,50 @@ btrfs_lookup_first_ordered_extent(struct btrfs_inode *inode, u64 file_offset)
  * btrfs_flush_ordered_range - Lock the passed range and ensures all pending
  * ordered extents in it are run to completion.
  *
+ * Also, optionally, flush any dirty pages within the range. This is to prevent
+ * deadlocks in case our caller starts a transaction after it locked the range,
+ * and if right before it locked a range a memory mapped write came in. Such
+ * deadlock can happen when we are low on metadata space and if reserving the
+ * metadata space for the transaction requires the async reclaim worker to flush
+ * delalloc.
+ *
  * @inode:        Inode whose ordered tree is to be searched
  * @start:        Beginning of range to flush
  * @end:          Last byte of range to lock
+ * @flush_dirty:  Whether we should or not flush any dirty pages in the range.
  * @cached_state: If passed, will return the extent state responsible for the
  * locked range. It's the caller's responsibility to free the cached state.
  *
  * This function always returns with the given range locked, ensuring after it's
- * called no order extent can be pending.
+ * called no ordered extent can be pending and, optionally, there are no dirty
+ * pages within the range as well.
+ *
+ * This function always returns 0 (success) when @flush_dirty is false, but can
+ * return an error (negative errno) when @flush_dirty is true and there was an
+ * error flushing delalloc.
  */
-void btrfs_lock_and_flush_ordered_range(struct btrfs_inode *inode, u64 start,
-					u64 end,
-					struct extent_state **cached_state)
+int btrfs_lock_and_flush_ordered_range(struct btrfs_inode *inode, u64 start,
+				       u64 end, bool flush_dirty,
+				       struct extent_state **cached_state)
 {
+	const u64 len = end - start + 1;
 	struct btrfs_ordered_extent *ordered;
 	struct extent_state *cache = NULL;
 	struct extent_state **cachedp = &cache;
+	int ret = 0;
 
 	if (cached_state)
 		cachedp = cached_state;
 
-	while (1) {
+	while (!ret) {
+		bool has_delalloc = false;
+
 		lock_extent_bits(&inode->io_tree, start, end, cachedp);
-		ordered = btrfs_lookup_ordered_range(inode, start,
-						     end - start + 1);
-		if (!ordered) {
+		ordered = btrfs_lookup_ordered_range(inode, start, len);
+		if (flush_dirty)
+			has_delalloc = test_range_bit(&inode->io_tree, start, end,
+						      EXTENT_DELALLOC, 0, *cachedp);
+		if (!ordered && !has_delalloc) {
 			/*
 			 * If no external cached_state has been passed then
 			 * decrement the extra ref taken for cachedp since we
@@ -893,9 +912,18 @@ void btrfs_lock_and_flush_ordered_range(struct btrfs_inode *inode, u64 start,
 			break;
 		}
 		unlock_extent_cached(&inode->io_tree, start, end, cachedp);
-		btrfs_start_ordered_extent(ordered, 1);
-		btrfs_put_ordered_extent(ordered);
+
+		if (flush_dirty)
+			ret = btrfs_wait_ordered_range(&inode->vfs_inode, start,
+						       len);
+		else if (ordered)
+			btrfs_start_ordered_extent(ordered, 1);
+
+		if (ordered)
+			btrfs_put_ordered_extent(ordered);
 	}
+
+	return ret;
 }
 
 int __init ordered_data_init(void)
diff --git a/fs/btrfs/ordered-data.h b/fs/btrfs/ordered-data.h
index 0bfa82b58e23..fe2b04983388 100644
--- a/fs/btrfs/ordered-data.h
+++ b/fs/btrfs/ordered-data.h
@@ -187,9 +187,9 @@ u64 btrfs_wait_ordered_extents(struct btrfs_root *root, u64 nr,
 			       const u64 range_start, const u64 range_len);
 void btrfs_wait_ordered_roots(struct btrfs_fs_info *fs_info, u64 nr,
 			      const u64 range_start, const u64 range_len);
-void btrfs_lock_and_flush_ordered_range(struct btrfs_inode *inode, u64 start,
-					u64 end,
-					struct extent_state **cached_state);
+int btrfs_lock_and_flush_ordered_range(struct btrfs_inode *inode, u64 start,
+				       u64 end, bool flush_dirty,
+				       struct extent_state **cached_state);
 int __init ordered_data_init(void);
 void __cold ordered_data_exit(void);
 
diff --git a/fs/btrfs/reflink.c b/fs/btrfs/reflink.c
index b03e7891394e..823486511c1c 100644
--- a/fs/btrfs/reflink.c
+++ b/fs/btrfs/reflink.c
@@ -574,17 +574,29 @@ static void btrfs_double_extent_unlock(struct inode *inode1, u64 loff1,
 	unlock_extent(&BTRFS_I(inode2)->io_tree, loff2, loff2 + len - 1);
 }
 
-static void btrfs_double_extent_lock(struct inode *inode1, u64 loff1,
-				     struct inode *inode2, u64 loff2, u64 len)
+static int btrfs_double_extent_lock(struct inode *inode1, u64 loff1,
+				    struct inode *inode2, u64 loff2, u64 len)
 {
+	int ret;
+
 	if (inode1 < inode2) {
 		swap(inode1, inode2);
 		swap(loff1, loff2);
 	} else if (inode1 == inode2 && loff2 < loff1) {
 		swap(loff1, loff2);
 	}
-	lock_extent(&BTRFS_I(inode1)->io_tree, loff1, loff1 + len - 1);
-	lock_extent(&BTRFS_I(inode2)->io_tree, loff2, loff2 + len - 1);
+
+	ret = btrfs_lock_and_flush_ordered_range(BTRFS_I(inode1), loff1,
+						 loff1 + len - 1, true, NULL);
+	if (ret)
+		return ret;
+
+	ret = btrfs_lock_and_flush_ordered_range(BTRFS_I(inode2), loff2,
+						 loff2 + len - 1, true, NULL);
+	if (ret)
+		unlock_extent(&BTRFS_I(inode1)->io_tree, loff1, loff1 + len - 1);
+
+	return ret;
 }
 
 static int btrfs_extent_same_range(struct inode *src, u64 loff, u64 len,
@@ -597,7 +609,9 @@ static int btrfs_extent_same_range(struct inode *src, u64 loff, u64 len,
 	 * Lock destination range to serialize with concurrent readpages() and
 	 * source range to serialize with relocation.
 	 */
-	btrfs_double_extent_lock(src, loff, dst, dst_loff, len);
+	ret = btrfs_double_extent_lock(src, loff, dst, dst_loff, len);
+	if (ret)
+		return ret;
 	ret = btrfs_clone(src, dst, loff, len, ALIGN(len, bs), dst_loff, 1);
 	btrfs_double_extent_unlock(src, loff, dst, dst_loff, len);
 
@@ -691,7 +705,9 @@ static noinline int btrfs_clone_files(struct file *file, struct file *file_src,
 	 * Lock destination range to serialize with concurrent readpages() and
 	 * source range to serialize with relocation.
 	 */
-	btrfs_double_extent_lock(src, off, inode, destoff, len);
+	ret = btrfs_double_extent_lock(src, off, inode, destoff, len);
+	if (ret)
+		return ret;
 	ret = btrfs_clone(src, inode, off, olen, len, destoff, 0);
 	btrfs_double_extent_unlock(src, off, inode, destoff, len);
 
-- 
2.28.0


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* [PATCH 2/2] btrfs: fix race between fallocate and memory mapped writes leading to deadlock
  2020-12-14  9:56 [PATCH 0/2] btrfs: fix races between clone, fallocate and memory mapped writes fdmanana
  2020-12-14  9:56 ` [PATCH 1/2] btrfs: fix race between cloning and memory mapped writes leading to deadlock fdmanana
@ 2020-12-14  9:56 ` fdmanana
  2020-12-17 15:02 ` [PATCH 0/2] btrfs: fix races between clone, fallocate and memory mapped writes David Sterba
  2 siblings, 0 replies; 6+ messages in thread
From: fdmanana @ 2020-12-14  9:56 UTC (permalink / raw)
  To: linux-btrfs; +Cc: josef, Filipe Manana

From: Filipe Manana <fdmanana@suse.com>

When doing a fallocate operation we lock the inode, flush delalloc within
the target range, wait for any ordered extents to complete and then lock
the file range. Before we lock the range and after we flush delalloc,
there is a time window where another task can come in and do a memory
mapped write for a page within the fallocate range.

This means that after fallocate locks the range, there can be a dirty page
in the range. More often than not, this does not cause any problem.
The exception is when we are low on available metadata space, because an
fallocate operation needs to start a transaction while holding the file
range locked, either through btrfs_prealloc_file_range() or through the
call to btrfs_fallocate_update_isize(). If that's the case, we can end up
in a deadlock. The following list of steps explains how that happens:

1) A fallocate operation starts, locks the inode, flushes delalloc in the
   range and waits for ordered extents in the range to complete;

2) Before the fallocate task locks the file range, another task does a
   memory mapped write for a page in the fallocate target range. This is
   possible since memory mapped writes do not (and can not) lock the
   inode;

3) The fallocate task locks the file range. At this point there is one
   dirty page in the range (due to the memory mapped write);

4) When the fallocate task attempts to start a transaction, it blocks when
   attempting to reserve metadata space, since we are low on available
   metadata space. Before blocking (wait on its reservation ticket), it
   starts the async reclaim task (if not running already);

5) The async reclaim task is not able to release space through any other
   means, so it decides to flush delalloc for inodes with dirty pages.
   It finds that the inode used in the fallocate operation has a dirty
   page and therefore queues a job (fs_info->flushs_workers workqueue) to
   flush delalloc for that inode and waits on that job to complete;

6) The flush job blocks when attempting to lock the file range because
   it is currently locked by the fallocate task;

7) The fallocate task keeps waiting for its metadata reservation, waiting
   for a wakeup on its reservation ticket. The async reclaim task is
   waiting on the flush job, which in turn is waiting for locking the file
   range that is currently locked by the fallocate task. So unless some
   other task is able to release enough metadata space, for example an
   ordered extent for some other inode completes, we end up in a deadlock
   between all these tasks.

When this happens stack traces like the following showup in dmesg/syslog:

 INFO: task kworker/u16:11:1810830 blocked for more than 120 seconds.
       Tainted: G    B   W         5.10.0-rc4-btrfs-next-73 #1
 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
 task:kworker/u16:11  state:D stack:    0 pid:1810830 ppid:     2 flags:0x00004000
 Workqueue: btrfs-flush_delalloc btrfs_work_helper [btrfs]
 Call Trace:
  __schedule+0x5d1/0xcf0
  schedule+0x45/0xe0
  lock_extent_bits+0x1e6/0x2d0 [btrfs]
  ? finish_wait+0x90/0x90
  btrfs_invalidatepage+0x32c/0x390 [btrfs]
  ? __mod_memcg_state+0x8e/0x160
  __extent_writepage+0x2d4/0x400 [btrfs]
  extent_write_cache_pages+0x2b2/0x500 [btrfs]
  ? lock_release+0x20e/0x4c0
  ? trace_hardirqs_on+0x1b/0xf0
  extent_writepages+0x43/0x90 [btrfs]
  ? lock_acquire+0x1a3/0x490
  do_writepages+0x43/0xe0
  ? __filemap_fdatawrite_range+0xa4/0x100
  __filemap_fdatawrite_range+0xc5/0x100
  btrfs_run_delalloc_work+0x17/0x40 [btrfs]
  btrfs_work_helper+0xf1/0x600 [btrfs]
  process_one_work+0x24e/0x5e0
  worker_thread+0x50/0x3b0
  ? process_one_work+0x5e0/0x5e0
  kthread+0x153/0x170
  ? kthread_mod_delayed_work+0xc0/0xc0
  ret_from_fork+0x22/0x30
 INFO: task kworker/u16:1:2426217 blocked for more than 120 seconds.
       Tainted: G    B   W         5.10.0-rc4-btrfs-next-73 #1
 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
 task:kworker/u16:1   state:D stack:    0 pid:2426217 ppid:     2 flags:0x00004000
 Workqueue: events_unbound btrfs_async_reclaim_metadata_space [btrfs]
 Call Trace:
  __schedule+0x5d1/0xcf0
  ? kvm_clock_read+0x14/0x30
  ? wait_for_completion+0x81/0x110
  schedule+0x45/0xe0
  schedule_timeout+0x30c/0x580
  ? _raw_spin_unlock_irqrestore+0x3c/0x60
  ? lock_acquire+0x1a3/0x490
  ? try_to_wake_up+0x7a/0xa20
  ? lock_release+0x20e/0x4c0
  ? lock_acquired+0x199/0x490
  ? wait_for_completion+0x81/0x110
  wait_for_completion+0xab/0x110
  start_delalloc_inodes+0x2af/0x390 [btrfs]
  btrfs_start_delalloc_roots+0x12d/0x250 [btrfs]
  flush_space+0x24f/0x660 [btrfs]
  btrfs_async_reclaim_metadata_space+0x1bb/0x480 [btrfs]
  process_one_work+0x24e/0x5e0
  worker_thread+0x20f/0x3b0
  ? process_one_work+0x5e0/0x5e0
  kthread+0x153/0x170
  ? kthread_mod_delayed_work+0xc0/0xc0
  ret_from_fork+0x22/0x30
(...)
several tasks waiting for the inode lock held by the fallocate task below
(...)
 RIP: 0033:0x7f61efe73fff
 Code: Unable to access opcode bytes at RIP 0x7f61efe73fd5.
 RSP: 002b:00007ffc3371bbe8 EFLAGS: 00000202 ORIG_RAX: 000000000000013c
 RAX: ffffffffffffffda RBX: 00007ffc3371bea0 RCX: 00007f61efe73fff
 RDX: 00000000ffffff9c RSI: 0000560fbd5d90a0 RDI: 00000000ffffff9c
 RBP: 00007ffc3371beb0 R08: 0000000000000001 R09: 0000000000000003
 R10: 0000560fbd5d7ad0 R11: 0000000000000202 R12: 0000000000000001
 R13: 000000000000005e R14: 00007ffc3371bea0 R15: 00007ffc3371beb0
 task:fdm-stress        state:D stack:    0 pid:2508243 ppid:2508153 flags:0x00000000
 Call Trace:
  __schedule+0x5d1/0xcf0
  ? _raw_spin_unlock_irqrestore+0x3c/0x60
  schedule+0x45/0xe0
  __reserve_bytes+0x4a4/0xb10 [btrfs]
  ? finish_wait+0x90/0x90
  btrfs_reserve_metadata_bytes+0x29/0x190 [btrfs]
  btrfs_block_rsv_add+0x1f/0x50 [btrfs]
  start_transaction+0x2d1/0x760 [btrfs]
  btrfs_replace_file_extents+0x120/0x930 [btrfs]
  ? btrfs_fallocate+0xdcf/0x1260 [btrfs]
  btrfs_fallocate+0xdfb/0x1260 [btrfs]
  ? filename_lookup+0xf1/0x180
  vfs_fallocate+0x14f/0x440
  ioctl_preallocate+0x92/0xc0
  do_vfs_ioctl+0x66b/0x750
  ? __do_sys_newfstat+0x53/0x60
  __x64_sys_ioctl+0x62/0xb0
  do_syscall_64+0x33/0x80
  entry_SYSCALL_64_after_hwframe+0x44/0xa9

So fix this by making fallocate use btrfs_lock_and_flush_ordered_range(),
passing it a value of true for the flush delalloc parameter. This way we
are guaranteed that we end up without any dirty pages in the range after
we have locked it.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
---
 fs/btrfs/file.c | 36 ++++--------------------------------
 1 file changed, 4 insertions(+), 32 deletions(-)

diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index dd2d5d73804d..6528725d1a29 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -3383,38 +3383,10 @@ static long btrfs_fallocate(struct file *file, int mode,
 	}
 
 	locked_end = alloc_end - 1;
-	while (1) {
-		struct btrfs_ordered_extent *ordered;
-
-		/* the extent lock is ordered inside the running
-		 * transaction
-		 */
-		lock_extent_bits(&BTRFS_I(inode)->io_tree, alloc_start,
-				 locked_end, &cached_state);
-		ordered = btrfs_lookup_first_ordered_extent(BTRFS_I(inode),
-							    locked_end);
-
-		if (ordered &&
-		    ordered->file_offset + ordered->num_bytes > alloc_start &&
-		    ordered->file_offset < alloc_end) {
-			btrfs_put_ordered_extent(ordered);
-			unlock_extent_cached(&BTRFS_I(inode)->io_tree,
-					     alloc_start, locked_end,
-					     &cached_state);
-			/*
-			 * we can't wait on the range with the transaction
-			 * running or with the extent lock held
-			 */
-			ret = btrfs_wait_ordered_range(inode, alloc_start,
-						       alloc_end - alloc_start);
-			if (ret)
-				goto out;
-		} else {
-			if (ordered)
-				btrfs_put_ordered_extent(ordered);
-			break;
-		}
-	}
+	ret = btrfs_lock_and_flush_ordered_range(BTRFS_I(inode), alloc_start,
+						 locked_end, true, &cached_state);
+	if (ret)
+		goto out;
 
 	/* First, check if we exceed the qgroup limit */
 	INIT_LIST_HEAD(&reserve_list);
-- 
2.28.0


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [PATCH 0/2] btrfs: fix races between clone, fallocate and memory mapped writes
  2020-12-14  9:56 [PATCH 0/2] btrfs: fix races between clone, fallocate and memory mapped writes fdmanana
  2020-12-14  9:56 ` [PATCH 1/2] btrfs: fix race between cloning and memory mapped writes leading to deadlock fdmanana
  2020-12-14  9:56 ` [PATCH 2/2] btrfs: fix race between fallocate " fdmanana
@ 2020-12-17 15:02 ` David Sterba
  2020-12-17 15:11   ` Filipe Manana
  2 siblings, 1 reply; 6+ messages in thread
From: David Sterba @ 2020-12-17 15:02 UTC (permalink / raw)
  To: fdmanana; +Cc: linux-btrfs, josef, Filipe Manana

On Mon, Dec 14, 2020 at 09:56:40AM +0000, fdmanana@kernel.org wrote:
> From: Filipe Manana <fdmanana@suse.com>
> 
> For a very long time there's been a race between clone/dedupe and memory
> mapped writes as well as between fallocate and memory mapped writes. For
> both cases the consequence of the race is that it can makes us deadlock
> when we are low on available metadata space, since clone/dedupe/fallocate
> start a transaction while holding file ranges locked, and allocating the
> metadata can result in the async reclaim task to flush the inodes being
> used by clone/dedupe/fallocate, if a memory mapped write happened before
> we locked the file ranges.
> 
> For the dedupe case, Josef's recent fix [1] ("btrfs: fix race between dedupe
> and mmap") happens to fix this deadlock problem as well. The first patch
> in this patchset fixes the issue for both clone and dedupe, as it's centered
> on the shared extent locking function, and it is independent of Josef's fix
> (works both with and without that fix).

Thanks, I was wondering how all the patches are related.
> 
> [1] https://lore.kernel.org/linux-btrfs/afdc2109f83fff1a925d7a66a6a047d4400721d4.1607724668.git.josef@toxicpanda.com/
> 
> Filipe Manana (2):
>   btrfs: fix race between cloning and memory mapped writes leading to
>     deadlock
>   btrfs: fix race between fallocate and memory mapped writes leading to
>     deadlock

Added to misc-next, thanks.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH 0/2] btrfs: fix races between clone, fallocate and memory mapped writes
  2020-12-17 15:02 ` [PATCH 0/2] btrfs: fix races between clone, fallocate and memory mapped writes David Sterba
@ 2020-12-17 15:11   ` Filipe Manana
  2020-12-17 16:10     ` David Sterba
  0 siblings, 1 reply; 6+ messages in thread
From: Filipe Manana @ 2020-12-17 15:11 UTC (permalink / raw)
  To: dsterba, Filipe Manana, linux-btrfs, Josef Bacik, Filipe Manana

On Thu, Dec 17, 2020 at 3:03 PM David Sterba <dsterba@suse.cz> wrote:
>
> On Mon, Dec 14, 2020 at 09:56:40AM +0000, fdmanana@kernel.org wrote:
> > From: Filipe Manana <fdmanana@suse.com>
> >
> > For a very long time there's been a race between clone/dedupe and memory
> > mapped writes as well as between fallocate and memory mapped writes. For
> > both cases the consequence of the race is that it can makes us deadlock
> > when we are low on available metadata space, since clone/dedupe/fallocate
> > start a transaction while holding file ranges locked, and allocating the
> > metadata can result in the async reclaim task to flush the inodes being
> > used by clone/dedupe/fallocate, if a memory mapped write happened before
> > we locked the file ranges.
> >
> > For the dedupe case, Josef's recent fix [1] ("btrfs: fix race between dedupe
> > and mmap") happens to fix this deadlock problem as well. The first patch
> > in this patchset fixes the issue for both clone and dedupe, as it's centered
> > on the shared extent locking function, and it is independent of Josef's fix
> > (works both with and without that fix).
>
> Thanks, I was wondering how all the patches are related.
> >
> > [1] https://lore.kernel.org/linux-btrfs/afdc2109f83fff1a925d7a66a6a047d4400721d4.1607724668.git.josef@toxicpanda.com/
> >
> > Filipe Manana (2):
> >   btrfs: fix race between cloning and memory mapped writes leading to
> >     deadlock
> >   btrfs: fix race between fallocate and memory mapped writes leading to
> >     deadlock
>
> Added to misc-next, thanks.

Something I haven't mentioned afterwards, as I have been waiting for
vger to deliver me the mails for another patchset from Josef (have
been having 2 to 4 days delays) is that that patchset from Josef:

https://lore.kernel.org/linux-btrfs/cover.1607969636.git.josef@toxicpanda.com/

replaces this patchset and the following RFC patch:

https://lore.kernel.org/linux-btrfs/afdc2109f83fff1a925d7a66a6a047d4400721d4.1607724668.git.josef@toxicpanda.com/

We agreed on Slack that a more generic solution was better, even
because the RFC patch above from Josef ended up being racy and didn't
fully fix the problem.
It doesn't help either that the cover letter for the above patchset
from Josef did not mention this, nor was it discussed in the thread
for the RFC patch.

So please drop this one and replace it with Josef's patchset. I had
just given the review on github:
https://github.com/btrfs/linux/issues/163

Thanks.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH 0/2] btrfs: fix races between clone, fallocate and memory mapped writes
  2020-12-17 15:11   ` Filipe Manana
@ 2020-12-17 16:10     ` David Sterba
  0 siblings, 0 replies; 6+ messages in thread
From: David Sterba @ 2020-12-17 16:10 UTC (permalink / raw)
  To: Filipe Manana; +Cc: dsterba, linux-btrfs, Josef Bacik, Filipe Manana

On Thu, Dec 17, 2020 at 03:11:46PM +0000, Filipe Manana wrote:
> On Thu, Dec 17, 2020 at 3:03 PM David Sterba <dsterba@suse.cz> wrote:
> >
> > On Mon, Dec 14, 2020 at 09:56:40AM +0000, fdmanana@kernel.org wrote:
> > > From: Filipe Manana <fdmanana@suse.com>
> > >
> > > For a very long time there's been a race between clone/dedupe and memory
> > > mapped writes as well as between fallocate and memory mapped writes. For
> > > both cases the consequence of the race is that it can makes us deadlock
> > > when we are low on available metadata space, since clone/dedupe/fallocate
> > > start a transaction while holding file ranges locked, and allocating the
> > > metadata can result in the async reclaim task to flush the inodes being
> > > used by clone/dedupe/fallocate, if a memory mapped write happened before
> > > we locked the file ranges.
> > >
> > > For the dedupe case, Josef's recent fix [1] ("btrfs: fix race between dedupe
> > > and mmap") happens to fix this deadlock problem as well. The first patch
> > > in this patchset fixes the issue for both clone and dedupe, as it's centered
> > > on the shared extent locking function, and it is independent of Josef's fix
> > > (works both with and without that fix).
> >
> > Thanks, I was wondering how all the patches are related.
> > >
> > > [1] https://lore.kernel.org/linux-btrfs/afdc2109f83fff1a925d7a66a6a047d4400721d4.1607724668.git.josef@toxicpanda.com/
> > >
> > > Filipe Manana (2):
> > >   btrfs: fix race between cloning and memory mapped writes leading to
> > >     deadlock
> > >   btrfs: fix race between fallocate and memory mapped writes leading to
> > >     deadlock
> >
> > Added to misc-next, thanks.
> 
> Something I haven't mentioned afterwards, as I have been waiting for
> vger to deliver me the mails for another patchset from Josef (have
> been having 2 to 4 days delays) is that that patchset from Josef:
> 
> https://lore.kernel.org/linux-btrfs/cover.1607969636.git.josef@toxicpanda.com/
> 
> replaces this patchset and the following RFC patch:
> 
> https://lore.kernel.org/linux-btrfs/afdc2109f83fff1a925d7a66a6a047d4400721d4.1607724668.git.josef@toxicpanda.com/
> 
> We agreed on Slack that a more generic solution was better, even
> because the RFC patch above from Josef ended up being racy and didn't
> fully fix the problem.
> It doesn't help either that the cover letter for the above patchset
> from Josef did not mention this, nor was it discussed in the thread
> for the RFC patch.
> 
> So please drop this one and replace it with Josef's patchset. I had
> just given the review on github:
> https://github.com/btrfs/linux/issues/163

I see, thanks. Patches removed from misc-next.

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2020-12-17 16:13 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2020-12-14  9:56 [PATCH 0/2] btrfs: fix races between clone, fallocate and memory mapped writes fdmanana
2020-12-14  9:56 ` [PATCH 1/2] btrfs: fix race between cloning and memory mapped writes leading to deadlock fdmanana
2020-12-14  9:56 ` [PATCH 2/2] btrfs: fix race between fallocate " fdmanana
2020-12-17 15:02 ` [PATCH 0/2] btrfs: fix races between clone, fallocate and memory mapped writes David Sterba
2020-12-17 15:11   ` Filipe Manana
2020-12-17 16:10     ` David Sterba

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox