linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] Btrfs: fix crash while a ranged msync() is ongoing
@ 2014-08-29 12:37 Filipe Manana
  2014-08-29 14:51 ` Christoph Hellwig
  2014-08-29 19:54 ` [PATCH v2] Btrfs: fix crash while doing a ranged fsync Filipe Manana
  0 siblings, 2 replies; 7+ messages in thread
From: Filipe Manana @ 2014-08-29 12:37 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Filipe Manana

After the commit 7fc34a62ca4434a79c68e23e70ed26111b7a4cf8 (titled
"mm/msync.c: sync only the requested range in msync()"), our fsync
callback can be called with a range that covers only part of the
file and not the whole file anymore. Under certain circumstances
this leads to crashes that produce the following trace:

[41074.641913] invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC
(...)
[41074.642692] CPU: 0 PID: 24580 Comm: fsx Not tainted 3.16.0-fdm-btrfs-next-45+ #1
(...)
[41074.643886] RIP: 0010:[<ffffffffa01ecc99>]  [<ffffffffa01ecc99>] btrfs_ordered_update_i_size+0x279/0x2b0 [btrfs]
(...)
[41074.644919] Stack:
(...)
[41074.644919] Call Trace:
[41074.644919]  [<ffffffffa01db531>] btrfs_truncate_inode_items+0x3f1/0xa10 [btrfs]
[41074.644919]  [<ffffffffa01eb54f>] ? btrfs_get_logged_extents+0x4f/0x80 [btrfs]
[41074.644919]  [<ffffffffa02137a9>] btrfs_log_inode+0x2f9/0x970 [btrfs]
[41074.644919]  [<ffffffff81090875>] ? sched_clock_local+0x25/0xa0
[41074.644919]  [<ffffffff8164a55e>] ? mutex_unlock+0xe/0x10
[41074.644919]  [<ffffffff810af51d>] ? trace_hardirqs_on+0xd/0x10
[41074.644919]  [<ffffffffa0214b4f>] btrfs_log_inode_parent+0x1ef/0x560 [btrfs]
[41074.644919]  [<ffffffff811d0c55>] ? dget_parent+0x5/0x180
[41074.644919]  [<ffffffffa0215d11>] btrfs_log_dentry_safe+0x51/0x80 [btrfs]
[41074.644919]  [<ffffffffa01e2d1a>] btrfs_sync_file+0x1ba/0x3e0 [btrfs]
[41074.644919]  [<ffffffff811eda6b>] vfs_fsync_range+0x1b/0x30
(...)

The necessary conditions that lead to such crash are:

* an incremental fsync (when the inode doesn't have the
  BTRFS_INODE_NEEDS_FULL_SYNC flag set) happened for our file and it logged
  a file extent item ending at offset X;

* the file got the flag BTRFS_INODE_NEEDS_FULL_SYNC set in its inode, due
  to a file truncate operation that reduces the file to a size smaller
  than X;

* an msync call happens, with a range that doesn't cover the whole file
  and the end of this range, lets call it Y, is smaller than X;

* btrfs_log_inode, sees the flag BTRFS_INODE_NEEDS_FULL_SYNC set and
  calls btrfs_truncate_inode_items() to remove all items from the log
  tree that are associated with our file;

* btrfs_truncate_inode_items() removes all of the inode's items, and the lowest
  file extent item it removed is the one ending at offset X, where X > 0 and
  X > Y - before returning, it calls btrfs_ordered_update_i_size() with an offset
  parameter set to X;

* btrfs_ordered_update_i_size() sees that X is greater then the current ordered
  size (btrfs_inode's disk_i_size) and then it assumes there can't be any ongoing
  ordered operation with a range covering the offset X, calling a BUG_ON() if
  such ordered operation exists. This assumption is made because the disk_i_size
  is only increased after the corresponding file extent item is added to the
  btree (btrfs_finish_ordered_io);

* But because our msync/fsync covers only a limited range, such an ordered extent
  might exist, and our fsync callback (btrfs_sync_file) doesn't wait for such
  ordered extent to finish when calling btrfs_wait_ordered_range();

And then by the time btrfs_ordered_update_i_size() is called, via:

   btrfs_sync_file() ->
       btrfs_log_dentry_safe() ->
           btrfs_log_inode_parent() ->
               btrfs_log_inode() ->
                   btrfs_truncate_inode_items() ->
                       btrfs_ordered_update_i_size()

We hit the BUG_ON(), which could never happen when msync() used the whole file
range when calling fsync (i.e. before 7fc34a62ca4434a79c68e23e70ed26111b7a4cf8).

So just don't call btrfs_ordered_update_i_size() if we're removing inode items
from a log tree, which isn't supposed to change the in memory inode's disk_i_size,
and never did before commit 7fc34a62ca4434a79c68e23e70ed26111b7a4cf8 because we
used to wait for all ordered extents (and therefore the end of any extent found
by btrfs_truncate_inode_items was always smaller than the in memory inode's
disk_i_size).

Issue found while running xfstests/generic/127 (happens very rarely for me).

Signed-off-by: Filipe Manana <fdmanana@suse.com>
---
 fs/btrfs/inode.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 16e8146..c5ef9eb 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -4296,7 +4296,8 @@ out:
 			btrfs_abort_transaction(trans, root, ret);
 	}
 error:
-	if (last_size != (u64)-1)
+	if (last_size != (u64)-1 &&
+	    root->root_key.objectid != BTRFS_TREE_LOG_OBJECTID)
 		btrfs_ordered_update_i_size(inode, last_size, NULL);
 	btrfs_free_path(path);
 	return err;
-- 
1.9.1


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [PATCH] Btrfs: fix crash while a ranged msync() is ongoing
  2014-08-29 12:37 [PATCH] Btrfs: fix crash while a ranged msync() is ongoing Filipe Manana
@ 2014-08-29 14:51 ` Christoph Hellwig
  2014-08-29 15:53   ` Chris Mason
  2014-08-29 18:52   ` Filipe David Manana
  2014-08-29 19:54 ` [PATCH v2] Btrfs: fix crash while doing a ranged fsync Filipe Manana
  1 sibling, 2 replies; 7+ messages in thread
From: Christoph Hellwig @ 2014-08-29 14:51 UTC (permalink / raw)
  To: Filipe Manana; +Cc: linux-btrfs

On Fri, Aug 29, 2014 at 01:37:48PM +0100, Filipe Manana wrote:
> After the commit 7fc34a62ca4434a79c68e23e70ed26111b7a4cf8 (titled
> "mm/msync.c: sync only the requested range in msync()"), our fsync
> callback can be called with a range that covers only part of the
> file and not the whole file anymore.

Not that there have been other users of range fsyncs around for a long
time.  The SCSI target code is one, the NFS server another, and last but
not least generic_write_sync() which is used by most filesystems
including btrfs to implement O_SYNC writes.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] Btrfs: fix crash while a ranged msync() is ongoing
  2014-08-29 14:51 ` Christoph Hellwig
@ 2014-08-29 15:53   ` Chris Mason
  2014-08-29 18:52   ` Filipe David Manana
  1 sibling, 0 replies; 7+ messages in thread
From: Chris Mason @ 2014-08-29 15:53 UTC (permalink / raw)
  To: Christoph Hellwig, Filipe Manana; +Cc: linux-btrfs



On 08/29/2014 10:51 AM, Christoph Hellwig wrote:
> On Fri, Aug 29, 2014 at 01:37:48PM +0100, Filipe Manana wrote:
>> After the commit 7fc34a62ca4434a79c68e23e70ed26111b7a4cf8 (titled
>> "mm/msync.c: sync only the requested range in msync()"), our fsync
>> callback can be called with a range that covers only part of the
>> file and not the whole file anymore.
> 
> Not that there have been other users of range fsyncs around for a long
> time.  The SCSI target code is one, the NFS server another, and last but
> not least generic_write_sync() which is used by most filesystems
> including btrfs to implement O_SYNC writes.

Yeah, this patch is going in, thanks Filipe.

-chris


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] Btrfs: fix crash while a ranged msync() is ongoing
  2014-08-29 14:51 ` Christoph Hellwig
  2014-08-29 15:53   ` Chris Mason
@ 2014-08-29 18:52   ` Filipe David Manana
  2014-08-29 18:55     ` Christoph Hellwig
  1 sibling, 1 reply; 7+ messages in thread
From: Filipe David Manana @ 2014-08-29 18:52 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Filipe Manana, linux-btrfs@vger.kernel.org, Chris Mason

On Fri, Aug 29, 2014 at 3:51 PM, Christoph Hellwig <hch@infradead.org> wrote:
> On Fri, Aug 29, 2014 at 01:37:48PM +0100, Filipe Manana wrote:
>> After the commit 7fc34a62ca4434a79c68e23e70ed26111b7a4cf8 (titled
>> "mm/msync.c: sync only the requested range in msync()"), our fsync
>> callback can be called with a range that covers only part of the
>> file and not the whole file anymore.
>
> Not that there have been other users of range fsyncs around for a long
> time.  The SCSI target code is one, the NFS server another, and last but
> not least generic_write_sync() which is used by most filesystems
> including btrfs to implement O_SYNC writes.

Point taken Christoph, this isn't specific to msync nor impossible to
happen before 7fc34a62ca4434a79c68e23e70ed26111b7a4cf8.
I'll update the commit log to reflect that.

Thanks for looking.

> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



-- 
Filipe David Manana,

"Reasonable men adapt themselves to the world.
 Unreasonable men adapt the world to themselves.
 That's why all progress depends on unreasonable men."

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] Btrfs: fix crash while a ranged msync() is ongoing
  2014-08-29 18:52   ` Filipe David Manana
@ 2014-08-29 18:55     ` Christoph Hellwig
  2014-08-29 18:57       ` Filipe David Manana
  0 siblings, 1 reply; 7+ messages in thread
From: Christoph Hellwig @ 2014-08-29 18:55 UTC (permalink / raw)
  To: Filipe David Manana
  Cc: Christoph Hellwig, Filipe Manana, linux-btrfs@vger.kernel.org,
	Chris Mason

On Fri, Aug 29, 2014 at 07:52:33PM +0100, Filipe David Manana wrote:
> Point taken Christoph, this isn't specific to msync nor impossible to
> happen before 7fc34a62ca4434a79c68e23e70ed26111b7a4cf8.
> I'll update the commit log to reflect that.
> 
> Thanks for looking.

Btw, if you have an easy enough reproducer it would be good to wire it
up for xfstests.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] Btrfs: fix crash while a ranged msync() is ongoing
  2014-08-29 18:55     ` Christoph Hellwig
@ 2014-08-29 18:57       ` Filipe David Manana
  0 siblings, 0 replies; 7+ messages in thread
From: Filipe David Manana @ 2014-08-29 18:57 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Filipe Manana, linux-btrfs@vger.kernel.org, Chris Mason

On Fri, Aug 29, 2014 at 7:55 PM, Christoph Hellwig <hch@infradead.org> wrote:
> On Fri, Aug 29, 2014 at 07:52:33PM +0100, Filipe David Manana wrote:
>> Point taken Christoph, this isn't specific to msync nor impossible to
>> happen before 7fc34a62ca4434a79c68e23e70ed26111b7a4cf8.
>> I'll update the commit log to reflect that.
>>
>> Thanks for looking.
>
> Btw, if you have an easy enough reproducer it would be good to wire it
> up for xfstests.

See the bottom of the commit message, where I mention that it happens
(not very often) when running xfstests/generic/127 - more specifically
the calls to fsx with memory mapped IO.

thanks



-- 
Filipe David Manana,

"Reasonable men adapt themselves to the world.
 Unreasonable men adapt the world to themselves.
 That's why all progress depends on unreasonable men."

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [PATCH v2] Btrfs: fix crash while doing a ranged fsync
  2014-08-29 12:37 [PATCH] Btrfs: fix crash while a ranged msync() is ongoing Filipe Manana
  2014-08-29 14:51 ` Christoph Hellwig
@ 2014-08-29 19:54 ` Filipe Manana
  1 sibling, 0 replies; 7+ messages in thread
From: Filipe Manana @ 2014-08-29 19:54 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Filipe Manana

While doing a ranged fsync, that is, one whose range doesn't cover the
whole possible file range (0 to LLONG_MAX), we can crash under certain
circumstances with a trace like the following:

[41074.641913] invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC
(...)
[41074.642692] CPU: 0 PID: 24580 Comm: fsx Not tainted 3.16.0-fdm-btrfs-next-45+ #1
(...)
[41074.643886] RIP: 0010:[<ffffffffa01ecc99>]  [<ffffffffa01ecc99>] btrfs_ordered_update_i_size+0x279/0x2b0 [btrfs]
(...)
[41074.644919] Stack:
(...)
[41074.644919] Call Trace:
[41074.644919]  [<ffffffffa01db531>] btrfs_truncate_inode_items+0x3f1/0xa10 [btrfs]
[41074.644919]  [<ffffffffa01eb54f>] ? btrfs_get_logged_extents+0x4f/0x80 [btrfs]
[41074.644919]  [<ffffffffa02137a9>] btrfs_log_inode+0x2f9/0x970 [btrfs]
[41074.644919]  [<ffffffff81090875>] ? sched_clock_local+0x25/0xa0
[41074.644919]  [<ffffffff8164a55e>] ? mutex_unlock+0xe/0x10
[41074.644919]  [<ffffffff810af51d>] ? trace_hardirqs_on+0xd/0x10
[41074.644919]  [<ffffffffa0214b4f>] btrfs_log_inode_parent+0x1ef/0x560 [btrfs]
[41074.644919]  [<ffffffff811d0c55>] ? dget_parent+0x5/0x180
[41074.644919]  [<ffffffffa0215d11>] btrfs_log_dentry_safe+0x51/0x80 [btrfs]
[41074.644919]  [<ffffffffa01e2d1a>] btrfs_sync_file+0x1ba/0x3e0 [btrfs]
[41074.644919]  [<ffffffff811eda6b>] vfs_fsync_range+0x1b/0x30
(...)

The necessary conditions that lead to such crash are:

* an incremental fsync (when the inode doesn't have the
  BTRFS_INODE_NEEDS_FULL_SYNC flag set) happened for our file and it logged
  a file extent item ending at offset X;

* the file got the flag BTRFS_INODE_NEEDS_FULL_SYNC set in its inode, due
  to a file truncate operation that reduces the file to a size smaller
  than X;

* a ranged fsync call happens (via an msync for example), with a range that
  doesn't cover the whole file and the end of this range, lets call it Y, is
  smaller than X;

* btrfs_log_inode, sees the flag BTRFS_INODE_NEEDS_FULL_SYNC set and
  calls btrfs_truncate_inode_items() to remove all items from the log
  tree that are associated with our file;

* btrfs_truncate_inode_items() removes all of the inode's items, and the lowest
  file extent item it removed is the one ending at offset X, where X > 0 and
  X > Y - before returning, it calls btrfs_ordered_update_i_size() with an offset
  parameter set to X;

* btrfs_ordered_update_i_size() sees that X is greater then the current ordered
  size (btrfs_inode's disk_i_size) and then it assumes there can't be any ongoing
  ordered operation with a range covering the offset X, calling a BUG_ON() if
  such ordered operation exists. This assumption is made because the disk_i_size
  is only increased after the corresponding file extent item is added to the
  btree (btrfs_finish_ordered_io);

* But because our fsync covers only a limited range, such an ordered extent might
  exist, and our fsync callback (btrfs_sync_file) doesn't wait for such ordered
  extent to finish when calling btrfs_wait_ordered_range();

And then by the time btrfs_ordered_update_i_size() is called, via:

   btrfs_sync_file() ->
       btrfs_log_dentry_safe() ->
           btrfs_log_inode_parent() ->
               btrfs_log_inode() ->
                   btrfs_truncate_inode_items() ->
                       btrfs_ordered_update_i_size()

We hit the BUG_ON(), which could never happen if the fsync range covered the whole
possible file range (0 to LLONG_MAX), as we would wait for all ordered extents to
finish before calling btrfs_truncate_inode_items().

So just don't call btrfs_ordered_update_i_size() if we're removing the inode's items
from a log tree, which isn't supposed to change the in memory inode's disk_i_size.

Issue found while running xfstests/generic/127 (happens very rarely for me), more
specifically via the fsx calls that use memory mapped IO (and issue msync calls).

Signed-off-by: Filipe Manana <fdmanana@suse.com>
---

V2: Updated commit message, such that it reflects the fact that ranged fsyncs are
    not used only by msync.

 fs/btrfs/inode.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 16e8146..c5ef9eb 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -4296,7 +4296,8 @@ out:
 			btrfs_abort_transaction(trans, root, ret);
 	}
 error:
-	if (last_size != (u64)-1)
+	if (last_size != (u64)-1 &&
+	    root->root_key.objectid != BTRFS_TREE_LOG_OBJECTID)
 		btrfs_ordered_update_i_size(inode, last_size, NULL);
 	btrfs_free_path(path);
 	return err;
-- 
1.9.1


^ permalink raw reply related	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2014-08-29 18:57 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-08-29 12:37 [PATCH] Btrfs: fix crash while a ranged msync() is ongoing Filipe Manana
2014-08-29 14:51 ` Christoph Hellwig
2014-08-29 15:53   ` Chris Mason
2014-08-29 18:52   ` Filipe David Manana
2014-08-29 18:55     ` Christoph Hellwig
2014-08-29 18:57       ` Filipe David Manana
2014-08-29 19:54 ` [PATCH v2] Btrfs: fix crash while doing a ranged fsync Filipe Manana

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).