[PATCH] Btrfs: fix crash while a ranged msync() is ongoing

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH] Btrfs: fix crash while a ranged msync() is ongoing
@ 2014-08-29 12:37 Filipe Manana
  2014-08-29 14:51 ` Christoph Hellwig
  2014-08-29 19:54 ` [PATCH v2] Btrfs: fix crash while doing a ranged fsync Filipe Manana
  0 siblings, 2 replies; 7+ messages in thread
From: Filipe Manana @ 2014-08-29 12:37 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Filipe Manana

After the commit 7fc34a62ca4434a79c68e23e70ed26111b7a4cf8 (titled
"mm/msync.c: sync only the requested range in msync()"), our fsync
callback can be called with a range that covers only part of the
file and not the whole file anymore. Under certain circumstances
this leads to crashes that produce the following trace:

[41074.641913] invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC
(...)
[41074.642692] CPU: 0 PID: 24580 Comm: fsx Not tainted 3.16.0-fdm-btrfs-next-45+ #1
(...)
[41074.643886] RIP: 0010:[<ffffffffa01ecc99>]  [<ffffffffa01ecc99>] btrfs_ordered_update_i_size+0x279/0x2b0 [btrfs]
(...)
[41074.644919] Stack:
(...)
[41074.644919] Call Trace:
[41074.644919]  [<ffffffffa01db531>] btrfs_truncate_inode_items+0x3f1/0xa10 [btrfs]
[41074.644919]  [<ffffffffa01eb54f>] ? btrfs_get_logged_extents+0x4f/0x80 [btrfs]
[41074.644919]  [<ffffffffa02137a9>] btrfs_log_inode+0x2f9/0x970 [btrfs]
[41074.644919]  [<ffffffff81090875>] ? sched_clock_local+0x25/0xa0
[41074.644919]  [<ffffffff8164a55e>] ? mutex_unlock+0xe/0x10
[41074.644919]  [<ffffffff810af51d>] ? trace_hardirqs_on+0xd/0x10
[41074.644919]  [<ffffffffa0214b4f>] btrfs_log_inode_parent+0x1ef/0x560 [btrfs]
[41074.644919]  [<ffffffff811d0c55>] ? dget_parent+0x5/0x180
[41074.644919]  [<ffffffffa0215d11>] btrfs_log_dentry_safe+0x51/0x80 [btrfs]
[41074.644919]  [<ffffffffa01e2d1a>] btrfs_sync_file+0x1ba/0x3e0 [btrfs]
[41074.644919]  [<ffffffff811eda6b>] vfs_fsync_range+0x1b/0x30
(...)

The necessary conditions that lead to such crash are:

* an incremental fsync (when the inode doesn't have the
  BTRFS_INODE_NEEDS_FULL_SYNC flag set) happened for our file and it logged
  a file extent item ending at offset X;

* the file got the flag BTRFS_INODE_NEEDS_FULL_SYNC set in its inode, due
  to a file truncate operation that reduces the file to a size smaller
  than X;

* an msync call happens, with a range that doesn't cover the whole file
  and the end of this range, lets call it Y, is smaller than X;

* btrfs_log_inode, sees the flag BTRFS_INODE_NEEDS_FULL_SYNC set and
  calls btrfs_truncate_inode_items() to remove all items from the log
  tree that are associated with our file;

* btrfs_truncate_inode_items() removes all of the inode's items, and the lowest
  file extent item it removed is the one ending at offset X, where X > 0 and
  X > Y - before returning, it calls btrfs_ordered_update_i_size() with an offset
  parameter set to X;

* btrfs_ordered_update_i_size() sees that X is greater then the current ordered
  size (btrfs_inode's disk_i_size) and then it assumes there can't be any ongoing
  ordered operation with a range covering the offset X, calling a BUG_ON() if
  such ordered operation exists. This assumption is made because the disk_i_size
  is only increased after the corresponding file extent item is added to the
  btree (btrfs_finish_ordered_io);

* But because our msync/fsync covers only a limited range, such an ordered extent
  might exist, and our fsync callback (btrfs_sync_file) doesn't wait for such
  ordered extent to finish when calling btrfs_wait_ordered_range();

And then by the time btrfs_ordered_update_i_size() is called, via:

   btrfs_sync_file() ->
       btrfs_log_dentry_safe() ->
           btrfs_log_inode_parent() ->
               btrfs_log_inode() ->
                   btrfs_truncate_inode_items() ->
                       btrfs_ordered_update_i_size()

We hit the BUG_ON(), which could never happen when msync() used the whole file
range when calling fsync (i.e. before 7fc34a62ca4434a79c68e23e70ed26111b7a4cf8).

So just don't call btrfs_ordered_update_i_size() if we're removing inode items
from a log tree, which isn't supposed to change the in memory inode's disk_i_size,
and never did before commit 7fc34a62ca4434a79c68e23e70ed26111b7a4cf8 because we
used to wait for all ordered extents (and therefore the end of any extent found
by btrfs_truncate_inode_items was always smaller than the in memory inode's
disk_i_size).

Issue found while running xfstests/generic/127 (happens very rarely for me).

Signed-off-by: Filipe Manana <fdmanana@suse.com>
---
 fs/btrfs/inode.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 16e8146..c5ef9eb 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -4296,7 +4296,8 @@ out:
 			btrfs_abort_transaction(trans, root, ret);
 	}
 error:
-	if (last_size != (u64)-1)
+	if (last_size != (u64)-1 &&
+	    root->root_key.objectid != BTRFS_TREE_LOG_OBJECTID)
 		btrfs_ordered_update_i_size(inode, last_size, NULL);
 	btrfs_free_path(path);
 	return err;
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [PATCH] Btrfs: fix crash while a ranged msync() is ongoing
  2014-08-29 12:37 [PATCH] Btrfs: fix crash while a ranged msync() is ongoing Filipe Manana
@ 2014-08-29 14:51 ` Christoph Hellwig
  2014-08-29 15:53   ` Chris Mason
  2014-08-29 18:52   ` Filipe David Manana
  2014-08-29 19:54 ` [PATCH v2] Btrfs: fix crash while doing a ranged fsync Filipe Manana
  1 sibling, 2 replies; 7+ messages in thread
From: Christoph Hellwig @ 2014-08-29 14:51 UTC (permalink / raw)
  To: Filipe Manana; +Cc: linux-btrfs

On Fri, Aug 29, 2014 at 01:37:48PM +0100, Filipe Manana wrote:
> After the commit 7fc34a62ca4434a79c68e23e70ed26111b7a4cf8 (titled
> "mm/msync.c: sync only the requested range in msync()"), our fsync
> callback can be called with a range that covers only part of the
> file and not the whole file anymore.

Not that there have been other users of range fsyncs around for a long
time.  The SCSI target code is one, the NFS server another, and last but
not least generic_write_sync() which is used by most filesystems
including btrfs to implement O_SYNC writes.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] Btrfs: fix crash while a ranged msync() is ongoing
  2014-08-29 14:51 ` Christoph Hellwig
@ 2014-08-29 15:53   ` Chris Mason
  2014-08-29 18:52   ` Filipe David Manana
  1 sibling, 0 replies; 7+ messages in thread
From: Chris Mason @ 2014-08-29 15:53 UTC (permalink / raw)
  To: Christoph Hellwig, Filipe Manana; +Cc: linux-btrfs



On 08/29/2014 10:51 AM, Christoph Hellwig wrote:
> On Fri, Aug 29, 2014 at 01:37:48PM +0100, Filipe Manana wrote:
>> After the commit 7fc34a62ca4434a79c68e23e70ed26111b7a4cf8 (titled
>> "mm/msync.c: sync only the requested range in msync()"), our fsync
>> callback can be called with a range that covers only part of the
>> file and not the whole file anymore.
> 
> Not that there have been other users of range fsyncs around for a long
> time.  The SCSI target code is one, the NFS server another, and last but
> not least generic_write_sync() which is used by most filesystems
> including btrfs to implement O_SYNC writes.

Yeah, this patch is going in, thanks Filipe.

-chris


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] Btrfs: fix crash while a ranged msync() is ongoing
  2014-08-29 14:51 ` Christoph Hellwig
  2014-08-29 15:53   ` Chris Mason
@ 2014-08-29 18:52   ` Filipe David Manana
  2014-08-29 18:55     ` Christoph Hellwig
  1 sibling, 1 reply; 7+ messages in thread
From: Filipe David Manana @ 2014-08-29 18:52 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Filipe Manana, linux-btrfs@vger.kernel.org, Chris Mason

On Fri, Aug 29, 2014 at 3:51 PM, Christoph Hellwig <hch@infradead.org> wrote:
> On Fri, Aug 29, 2014 at 01:37:48PM +0100, Filipe Manana wrote:
>> After the commit 7fc34a62ca4434a79c68e23e70ed26111b7a4cf8 (titled
>> "mm/msync.c: sync only the requested range in msync()"), our fsync
>> callback can be called with a range that covers only part of the
>> file and not the whole file anymore.
>
> Not that there have been other users of range fsyncs around for a long
> time.  The SCSI target code is one, the NFS server another, and last but
> not least generic_write_sync() which is used by most filesystems
> including btrfs to implement O_SYNC writes.

Point taken Christoph, this isn't specific to msync nor impossible to
happen before 7fc34a62ca4434a79c68e23e70ed26111b7a4cf8.
I'll update the commit log to reflect that.

Thanks for looking.

> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



-- 
Filipe David Manana,

"Reasonable men adapt themselves to the world.
 Unreasonable men adapt the world to themselves.
 That's why all progress depends on unreasonable men."

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] Btrfs: fix crash while a ranged msync() is ongoing
  2014-08-29 18:52   ` Filipe David Manana
@ 2014-08-29 18:55     ` Christoph Hellwig
  2014-08-29 18:57       ` Filipe David Manana
  0 siblings, 1 reply; 7+ messages in thread
From: Christoph Hellwig @ 2014-08-29 18:55 UTC (permalink / raw)
  To: Filipe David Manana
  Cc: Christoph Hellwig, Filipe Manana, linux-btrfs@vger.kernel.org,
	Chris Mason

On Fri, Aug 29, 2014 at 07:52:33PM +0100, Filipe David Manana wrote:
> Point taken Christoph, this isn't specific to msync nor impossible to
> happen before 7fc34a62ca4434a79c68e23e70ed26111b7a4cf8.
> I'll update the commit log to reflect that.
> 
> Thanks for looking.

Btw, if you have an easy enough reproducer it would be good to wire it
up for xfstests.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] Btrfs: fix crash while a ranged msync() is ongoing
  2014-08-29 18:55     ` Christoph Hellwig
@ 2014-08-29 18:57       ` Filipe David Manana
  0 siblings, 0 replies; 7+ messages in thread
From: Filipe David Manana @ 2014-08-29 18:57 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Filipe Manana, linux-btrfs@vger.kernel.org, Chris Mason

On Fri, Aug 29, 2014 at 7:55 PM, Christoph Hellwig <hch@infradead.org> wrote:
> On Fri, Aug 29, 2014 at 07:52:33PM +0100, Filipe David Manana wrote:
>> Point taken Christoph, this isn't specific to msync nor impossible to
>> happen before 7fc34a62ca4434a79c68e23e70ed26111b7a4cf8.
>> I'll update the commit log to reflect that.
>>
>> Thanks for looking.
>
> Btw, if you have an easy enough reproducer it would be good to wire it
> up for xfstests.

See the bottom of the commit message, where I mention that it happens
(not very often) when running xfstests/generic/127 - more specifically
the calls to fsx with memory mapped IO.

thanks



-- 
Filipe David Manana,

"Reasonable men adapt themselves to the world.
 Unreasonable men adapt the world to themselves.
 That's why all progress depends on unreasonable men."

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [PATCH v2] Btrfs: fix crash while doing a ranged fsync
  2014-08-29 12:37 [PATCH] Btrfs: fix crash while a ranged msync() is ongoing Filipe Manana
  2014-08-29 14:51 ` Christoph Hellwig
@ 2014-08-29 19:54 ` Filipe Manana
  1 sibling, 0 replies; 7+ messages in thread
From: Filipe Manana @ 2014-08-29 19:54 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Filipe Manana

While doing a ranged fsync, that is, one whose range doesn't cover the
whole possible file range (0 to LLONG_MAX), we can crash under certain
circumstances with a trace like the following:

[41074.641913] invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC
(...)
[41074.642692] CPU: 0 PID: 24580 Comm: fsx Not tainted 3.16.0-fdm-btrfs-next-45+ #1
(...)
[41074.643886] RIP: 0010:[<ffffffffa01ecc99>]  [<ffffffffa01ecc99>] btrfs_ordered_update_i_size+0x279/0x2b0 [btrfs]
(...)
[41074.644919] Stack:
(...)
[41074.644919] Call Trace:
[41074.644919]  [<ffffffffa01db531>] btrfs_truncate_inode_items+0x3f1/0xa10 [btrfs]
[41074.644919]  [<ffffffffa01eb54f>] ? btrfs_get_logged_extents+0x4f/0x80 [btrfs]
[41074.644919]  [<ffffffffa02137a9>] btrfs_log_inode+0x2f9/0x970 [btrfs]
[41074.644919]  [<ffffffff81090875>] ? sched_clock_local+0x25/0xa0
[41074.644919]  [<ffffffff8164a55e>] ? mutex_unlock+0xe/0x10
[41074.644919]  [<ffffffff810af51d>] ? trace_hardirqs_on+0xd/0x10
[41074.644919]  [<ffffffffa0214b4f>] btrfs_log_inode_parent+0x1ef/0x560 [btrfs]
[41074.644919]  [<ffffffff811d0c55>] ? dget_parent+0x5/0x180
[41074.644919]  [<ffffffffa0215d11>] btrfs_log_dentry_safe+0x51/0x80 [btrfs]
[41074.644919]  [<ffffffffa01e2d1a>] btrfs_sync_file+0x1ba/0x3e0 [btrfs]
[41074.644919]  [<ffffffff811eda6b>] vfs_fsync_range+0x1b/0x30
(...)

The necessary conditions that lead to such crash are:

* an incremental fsync (when the inode doesn't have the
  BTRFS_INODE_NEEDS_FULL_SYNC flag set) happened for our file and it logged
  a file extent item ending at offset X;

* the file got the flag BTRFS_INODE_NEEDS_FULL_SYNC set in its inode, due
  to a file truncate operation that reduces the file to a size smaller
  than X;

* a ranged fsync call happens (via an msync for example), with a range that
  doesn't cover the whole file and the end of this range, lets call it Y, is
  smaller than X;

* btrfs_log_inode, sees the flag BTRFS_INODE_NEEDS_FULL_SYNC set and
  calls btrfs_truncate_inode_items() to remove all items from the log
  tree that are associated with our file;

* btrfs_truncate_inode_items() removes all of the inode's items, and the lowest
  file extent item it removed is the one ending at offset X, where X > 0 and
  X > Y - before returning, it calls btrfs_ordered_update_i_size() with an offset
  parameter set to X;

* btrfs_ordered_update_i_size() sees that X is greater then the current ordered
  size (btrfs_inode's disk_i_size) and then it assumes there can't be any ongoing
  ordered operation with a range covering the offset X, calling a BUG_ON() if
  such ordered operation exists. This assumption is made because the disk_i_size
  is only increased after the corresponding file extent item is added to the
  btree (btrfs_finish_ordered_io);

* But because our fsync covers only a limited range, such an ordered extent might
  exist, and our fsync callback (btrfs_sync_file) doesn't wait for such ordered
  extent to finish when calling btrfs_wait_ordered_range();

And then by the time btrfs_ordered_update_i_size() is called, via:

   btrfs_sync_file() ->
       btrfs_log_dentry_safe() ->
           btrfs_log_inode_parent() ->
               btrfs_log_inode() ->
                   btrfs_truncate_inode_items() ->
                       btrfs_ordered_update_i_size()

We hit the BUG_ON(), which could never happen if the fsync range covered the whole
possible file range (0 to LLONG_MAX), as we would wait for all ordered extents to
finish before calling btrfs_truncate_inode_items().

So just don't call btrfs_ordered_update_i_size() if we're removing the inode's items
from a log tree, which isn't supposed to change the in memory inode's disk_i_size.

Issue found while running xfstests/generic/127 (happens very rarely for me), more
specifically via the fsx calls that use memory mapped IO (and issue msync calls).

Signed-off-by: Filipe Manana <fdmanana@suse.com>
---

V2: Updated commit message, such that it reflects the fact that ranged fsyncs are
    not used only by msync.

 fs/btrfs/inode.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 16e8146..c5ef9eb 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -4296,7 +4296,8 @@ out:
 			btrfs_abort_transaction(trans, root, ret);
 	}
 error:
-	if (last_size != (u64)-1)
+	if (last_size != (u64)-1 &&
+	    root->root_key.objectid != BTRFS_TREE_LOG_OBJECTID)
 		btrfs_ordered_update_i_size(inode, last_size, NULL);
 	btrfs_free_path(path);
 	return err;
-- 
1.9.1


^ permalink raw reply related	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2014-08-29 18:57 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-08-29 12:37 [PATCH] Btrfs: fix crash while a ranged msync() is ongoing Filipe Manana
2014-08-29 14:51 ` Christoph Hellwig
2014-08-29 15:53   ` Chris Mason
2014-08-29 18:52   ` Filipe David Manana
2014-08-29 18:55     ` Christoph Hellwig
2014-08-29 18:57       ` Filipe David Manana
2014-08-29 19:54 ` [PATCH v2] Btrfs: fix crash while doing a ranged fsync Filipe Manana

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.