* [PATCH] Btrfs: fix crash while a ranged msync() is ongoing
@ 2014-08-29 12:37 Filipe Manana
2014-08-29 14:51 ` Christoph Hellwig
2014-08-29 19:54 ` [PATCH v2] Btrfs: fix crash while doing a ranged fsync Filipe Manana
0 siblings, 2 replies; 7+ messages in thread
From: Filipe Manana @ 2014-08-29 12:37 UTC (permalink / raw)
To: linux-btrfs; +Cc: Filipe Manana
After the commit 7fc34a62ca4434a79c68e23e70ed26111b7a4cf8 (titled
"mm/msync.c: sync only the requested range in msync()"), our fsync
callback can be called with a range that covers only part of the
file and not the whole file anymore. Under certain circumstances
this leads to crashes that produce the following trace:
[41074.641913] invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC
(...)
[41074.642692] CPU: 0 PID: 24580 Comm: fsx Not tainted 3.16.0-fdm-btrfs-next-45+ #1
(...)
[41074.643886] RIP: 0010:[<ffffffffa01ecc99>] [<ffffffffa01ecc99>] btrfs_ordered_update_i_size+0x279/0x2b0 [btrfs]
(...)
[41074.644919] Stack:
(...)
[41074.644919] Call Trace:
[41074.644919] [<ffffffffa01db531>] btrfs_truncate_inode_items+0x3f1/0xa10 [btrfs]
[41074.644919] [<ffffffffa01eb54f>] ? btrfs_get_logged_extents+0x4f/0x80 [btrfs]
[41074.644919] [<ffffffffa02137a9>] btrfs_log_inode+0x2f9/0x970 [btrfs]
[41074.644919] [<ffffffff81090875>] ? sched_clock_local+0x25/0xa0
[41074.644919] [<ffffffff8164a55e>] ? mutex_unlock+0xe/0x10
[41074.644919] [<ffffffff810af51d>] ? trace_hardirqs_on+0xd/0x10
[41074.644919] [<ffffffffa0214b4f>] btrfs_log_inode_parent+0x1ef/0x560 [btrfs]
[41074.644919] [<ffffffff811d0c55>] ? dget_parent+0x5/0x180
[41074.644919] [<ffffffffa0215d11>] btrfs_log_dentry_safe+0x51/0x80 [btrfs]
[41074.644919] [<ffffffffa01e2d1a>] btrfs_sync_file+0x1ba/0x3e0 [btrfs]
[41074.644919] [<ffffffff811eda6b>] vfs_fsync_range+0x1b/0x30
(...)
The necessary conditions that lead to such crash are:
* an incremental fsync (when the inode doesn't have the
BTRFS_INODE_NEEDS_FULL_SYNC flag set) happened for our file and it logged
a file extent item ending at offset X;
* the file got the flag BTRFS_INODE_NEEDS_FULL_SYNC set in its inode, due
to a file truncate operation that reduces the file to a size smaller
than X;
* an msync call happens, with a range that doesn't cover the whole file
and the end of this range, lets call it Y, is smaller than X;
* btrfs_log_inode, sees the flag BTRFS_INODE_NEEDS_FULL_SYNC set and
calls btrfs_truncate_inode_items() to remove all items from the log
tree that are associated with our file;
* btrfs_truncate_inode_items() removes all of the inode's items, and the lowest
file extent item it removed is the one ending at offset X, where X > 0 and
X > Y - before returning, it calls btrfs_ordered_update_i_size() with an offset
parameter set to X;
* btrfs_ordered_update_i_size() sees that X is greater then the current ordered
size (btrfs_inode's disk_i_size) and then it assumes there can't be any ongoing
ordered operation with a range covering the offset X, calling a BUG_ON() if
such ordered operation exists. This assumption is made because the disk_i_size
is only increased after the corresponding file extent item is added to the
btree (btrfs_finish_ordered_io);
* But because our msync/fsync covers only a limited range, such an ordered extent
might exist, and our fsync callback (btrfs_sync_file) doesn't wait for such
ordered extent to finish when calling btrfs_wait_ordered_range();
And then by the time btrfs_ordered_update_i_size() is called, via:
btrfs_sync_file() ->
btrfs_log_dentry_safe() ->
btrfs_log_inode_parent() ->
btrfs_log_inode() ->
btrfs_truncate_inode_items() ->
btrfs_ordered_update_i_size()
We hit the BUG_ON(), which could never happen when msync() used the whole file
range when calling fsync (i.e. before 7fc34a62ca4434a79c68e23e70ed26111b7a4cf8).
So just don't call btrfs_ordered_update_i_size() if we're removing inode items
from a log tree, which isn't supposed to change the in memory inode's disk_i_size,
and never did before commit 7fc34a62ca4434a79c68e23e70ed26111b7a4cf8 because we
used to wait for all ordered extents (and therefore the end of any extent found
by btrfs_truncate_inode_items was always smaller than the in memory inode's
disk_i_size).
Issue found while running xfstests/generic/127 (happens very rarely for me).
Signed-off-by: Filipe Manana <fdmanana@suse.com>
---
fs/btrfs/inode.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 16e8146..c5ef9eb 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -4296,7 +4296,8 @@ out:
btrfs_abort_transaction(trans, root, ret);
}
error:
- if (last_size != (u64)-1)
+ if (last_size != (u64)-1 &&
+ root->root_key.objectid != BTRFS_TREE_LOG_OBJECTID)
btrfs_ordered_update_i_size(inode, last_size, NULL);
btrfs_free_path(path);
return err;
--
1.9.1
^ permalink raw reply related [flat|nested] 7+ messages in thread
* Re: [PATCH] Btrfs: fix crash while a ranged msync() is ongoing
2014-08-29 12:37 [PATCH] Btrfs: fix crash while a ranged msync() is ongoing Filipe Manana
@ 2014-08-29 14:51 ` Christoph Hellwig
2014-08-29 15:53 ` Chris Mason
2014-08-29 18:52 ` Filipe David Manana
2014-08-29 19:54 ` [PATCH v2] Btrfs: fix crash while doing a ranged fsync Filipe Manana
1 sibling, 2 replies; 7+ messages in thread
From: Christoph Hellwig @ 2014-08-29 14:51 UTC (permalink / raw)
To: Filipe Manana; +Cc: linux-btrfs
On Fri, Aug 29, 2014 at 01:37:48PM +0100, Filipe Manana wrote:
> After the commit 7fc34a62ca4434a79c68e23e70ed26111b7a4cf8 (titled
> "mm/msync.c: sync only the requested range in msync()"), our fsync
> callback can be called with a range that covers only part of the
> file and not the whole file anymore.
Not that there have been other users of range fsyncs around for a long
time. The SCSI target code is one, the NFS server another, and last but
not least generic_write_sync() which is used by most filesystems
including btrfs to implement O_SYNC writes.
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH] Btrfs: fix crash while a ranged msync() is ongoing
2014-08-29 14:51 ` Christoph Hellwig
@ 2014-08-29 15:53 ` Chris Mason
2014-08-29 18:52 ` Filipe David Manana
1 sibling, 0 replies; 7+ messages in thread
From: Chris Mason @ 2014-08-29 15:53 UTC (permalink / raw)
To: Christoph Hellwig, Filipe Manana; +Cc: linux-btrfs
On 08/29/2014 10:51 AM, Christoph Hellwig wrote:
> On Fri, Aug 29, 2014 at 01:37:48PM +0100, Filipe Manana wrote:
>> After the commit 7fc34a62ca4434a79c68e23e70ed26111b7a4cf8 (titled
>> "mm/msync.c: sync only the requested range in msync()"), our fsync
>> callback can be called with a range that covers only part of the
>> file and not the whole file anymore.
>
> Not that there have been other users of range fsyncs around for a long
> time. The SCSI target code is one, the NFS server another, and last but
> not least generic_write_sync() which is used by most filesystems
> including btrfs to implement O_SYNC writes.
Yeah, this patch is going in, thanks Filipe.
-chris
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH] Btrfs: fix crash while a ranged msync() is ongoing
2014-08-29 14:51 ` Christoph Hellwig
2014-08-29 15:53 ` Chris Mason
@ 2014-08-29 18:52 ` Filipe David Manana
2014-08-29 18:55 ` Christoph Hellwig
1 sibling, 1 reply; 7+ messages in thread
From: Filipe David Manana @ 2014-08-29 18:52 UTC (permalink / raw)
To: Christoph Hellwig; +Cc: Filipe Manana, linux-btrfs@vger.kernel.org, Chris Mason
On Fri, Aug 29, 2014 at 3:51 PM, Christoph Hellwig <hch@infradead.org> wrote:
> On Fri, Aug 29, 2014 at 01:37:48PM +0100, Filipe Manana wrote:
>> After the commit 7fc34a62ca4434a79c68e23e70ed26111b7a4cf8 (titled
>> "mm/msync.c: sync only the requested range in msync()"), our fsync
>> callback can be called with a range that covers only part of the
>> file and not the whole file anymore.
>
> Not that there have been other users of range fsyncs around for a long
> time. The SCSI target code is one, the NFS server another, and last but
> not least generic_write_sync() which is used by most filesystems
> including btrfs to implement O_SYNC writes.
Point taken Christoph, this isn't specific to msync nor impossible to
happen before 7fc34a62ca4434a79c68e23e70ed26111b7a4cf8.
I'll update the commit log to reflect that.
Thanks for looking.
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
--
Filipe David Manana,
"Reasonable men adapt themselves to the world.
Unreasonable men adapt the world to themselves.
That's why all progress depends on unreasonable men."
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH] Btrfs: fix crash while a ranged msync() is ongoing
2014-08-29 18:52 ` Filipe David Manana
@ 2014-08-29 18:55 ` Christoph Hellwig
2014-08-29 18:57 ` Filipe David Manana
0 siblings, 1 reply; 7+ messages in thread
From: Christoph Hellwig @ 2014-08-29 18:55 UTC (permalink / raw)
To: Filipe David Manana
Cc: Christoph Hellwig, Filipe Manana, linux-btrfs@vger.kernel.org,
Chris Mason
On Fri, Aug 29, 2014 at 07:52:33PM +0100, Filipe David Manana wrote:
> Point taken Christoph, this isn't specific to msync nor impossible to
> happen before 7fc34a62ca4434a79c68e23e70ed26111b7a4cf8.
> I'll update the commit log to reflect that.
>
> Thanks for looking.
Btw, if you have an easy enough reproducer it would be good to wire it
up for xfstests.
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH] Btrfs: fix crash while a ranged msync() is ongoing
2014-08-29 18:55 ` Christoph Hellwig
@ 2014-08-29 18:57 ` Filipe David Manana
0 siblings, 0 replies; 7+ messages in thread
From: Filipe David Manana @ 2014-08-29 18:57 UTC (permalink / raw)
To: Christoph Hellwig; +Cc: Filipe Manana, linux-btrfs@vger.kernel.org, Chris Mason
On Fri, Aug 29, 2014 at 7:55 PM, Christoph Hellwig <hch@infradead.org> wrote:
> On Fri, Aug 29, 2014 at 07:52:33PM +0100, Filipe David Manana wrote:
>> Point taken Christoph, this isn't specific to msync nor impossible to
>> happen before 7fc34a62ca4434a79c68e23e70ed26111b7a4cf8.
>> I'll update the commit log to reflect that.
>>
>> Thanks for looking.
>
> Btw, if you have an easy enough reproducer it would be good to wire it
> up for xfstests.
See the bottom of the commit message, where I mention that it happens
(not very often) when running xfstests/generic/127 - more specifically
the calls to fsx with memory mapped IO.
thanks
--
Filipe David Manana,
"Reasonable men adapt themselves to the world.
Unreasonable men adapt the world to themselves.
That's why all progress depends on unreasonable men."
^ permalink raw reply [flat|nested] 7+ messages in thread
* [PATCH v2] Btrfs: fix crash while doing a ranged fsync
2014-08-29 12:37 [PATCH] Btrfs: fix crash while a ranged msync() is ongoing Filipe Manana
2014-08-29 14:51 ` Christoph Hellwig
@ 2014-08-29 19:54 ` Filipe Manana
1 sibling, 0 replies; 7+ messages in thread
From: Filipe Manana @ 2014-08-29 19:54 UTC (permalink / raw)
To: linux-btrfs; +Cc: Filipe Manana
While doing a ranged fsync, that is, one whose range doesn't cover the
whole possible file range (0 to LLONG_MAX), we can crash under certain
circumstances with a trace like the following:
[41074.641913] invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC
(...)
[41074.642692] CPU: 0 PID: 24580 Comm: fsx Not tainted 3.16.0-fdm-btrfs-next-45+ #1
(...)
[41074.643886] RIP: 0010:[<ffffffffa01ecc99>] [<ffffffffa01ecc99>] btrfs_ordered_update_i_size+0x279/0x2b0 [btrfs]
(...)
[41074.644919] Stack:
(...)
[41074.644919] Call Trace:
[41074.644919] [<ffffffffa01db531>] btrfs_truncate_inode_items+0x3f1/0xa10 [btrfs]
[41074.644919] [<ffffffffa01eb54f>] ? btrfs_get_logged_extents+0x4f/0x80 [btrfs]
[41074.644919] [<ffffffffa02137a9>] btrfs_log_inode+0x2f9/0x970 [btrfs]
[41074.644919] [<ffffffff81090875>] ? sched_clock_local+0x25/0xa0
[41074.644919] [<ffffffff8164a55e>] ? mutex_unlock+0xe/0x10
[41074.644919] [<ffffffff810af51d>] ? trace_hardirqs_on+0xd/0x10
[41074.644919] [<ffffffffa0214b4f>] btrfs_log_inode_parent+0x1ef/0x560 [btrfs]
[41074.644919] [<ffffffff811d0c55>] ? dget_parent+0x5/0x180
[41074.644919] [<ffffffffa0215d11>] btrfs_log_dentry_safe+0x51/0x80 [btrfs]
[41074.644919] [<ffffffffa01e2d1a>] btrfs_sync_file+0x1ba/0x3e0 [btrfs]
[41074.644919] [<ffffffff811eda6b>] vfs_fsync_range+0x1b/0x30
(...)
The necessary conditions that lead to such crash are:
* an incremental fsync (when the inode doesn't have the
BTRFS_INODE_NEEDS_FULL_SYNC flag set) happened for our file and it logged
a file extent item ending at offset X;
* the file got the flag BTRFS_INODE_NEEDS_FULL_SYNC set in its inode, due
to a file truncate operation that reduces the file to a size smaller
than X;
* a ranged fsync call happens (via an msync for example), with a range that
doesn't cover the whole file and the end of this range, lets call it Y, is
smaller than X;
* btrfs_log_inode, sees the flag BTRFS_INODE_NEEDS_FULL_SYNC set and
calls btrfs_truncate_inode_items() to remove all items from the log
tree that are associated with our file;
* btrfs_truncate_inode_items() removes all of the inode's items, and the lowest
file extent item it removed is the one ending at offset X, where X > 0 and
X > Y - before returning, it calls btrfs_ordered_update_i_size() with an offset
parameter set to X;
* btrfs_ordered_update_i_size() sees that X is greater then the current ordered
size (btrfs_inode's disk_i_size) and then it assumes there can't be any ongoing
ordered operation with a range covering the offset X, calling a BUG_ON() if
such ordered operation exists. This assumption is made because the disk_i_size
is only increased after the corresponding file extent item is added to the
btree (btrfs_finish_ordered_io);
* But because our fsync covers only a limited range, such an ordered extent might
exist, and our fsync callback (btrfs_sync_file) doesn't wait for such ordered
extent to finish when calling btrfs_wait_ordered_range();
And then by the time btrfs_ordered_update_i_size() is called, via:
btrfs_sync_file() ->
btrfs_log_dentry_safe() ->
btrfs_log_inode_parent() ->
btrfs_log_inode() ->
btrfs_truncate_inode_items() ->
btrfs_ordered_update_i_size()
We hit the BUG_ON(), which could never happen if the fsync range covered the whole
possible file range (0 to LLONG_MAX), as we would wait for all ordered extents to
finish before calling btrfs_truncate_inode_items().
So just don't call btrfs_ordered_update_i_size() if we're removing the inode's items
from a log tree, which isn't supposed to change the in memory inode's disk_i_size.
Issue found while running xfstests/generic/127 (happens very rarely for me), more
specifically via the fsx calls that use memory mapped IO (and issue msync calls).
Signed-off-by: Filipe Manana <fdmanana@suse.com>
---
V2: Updated commit message, such that it reflects the fact that ranged fsyncs are
not used only by msync.
fs/btrfs/inode.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 16e8146..c5ef9eb 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -4296,7 +4296,8 @@ out:
btrfs_abort_transaction(trans, root, ret);
}
error:
- if (last_size != (u64)-1)
+ if (last_size != (u64)-1 &&
+ root->root_key.objectid != BTRFS_TREE_LOG_OBJECTID)
btrfs_ordered_update_i_size(inode, last_size, NULL);
btrfs_free_path(path);
return err;
--
1.9.1
^ permalink raw reply related [flat|nested] 7+ messages in thread
end of thread, other threads:[~2014-08-29 18:57 UTC | newest]
Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-08-29 12:37 [PATCH] Btrfs: fix crash while a ranged msync() is ongoing Filipe Manana
2014-08-29 14:51 ` Christoph Hellwig
2014-08-29 15:53 ` Chris Mason
2014-08-29 18:52 ` Filipe David Manana
2014-08-29 18:55 ` Christoph Hellwig
2014-08-29 18:57 ` Filipe David Manana
2014-08-29 19:54 ` [PATCH v2] Btrfs: fix crash while doing a ranged fsync Filipe Manana
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.