* [PATCH] Btrfs: fix crash while a ranged msync() is ongoing
@ 2014-08-29 12:37 Filipe Manana
2014-08-29 14:51 ` Christoph Hellwig
2014-08-29 19:54 ` [PATCH v2] Btrfs: fix crash while doing a ranged fsync Filipe Manana
0 siblings, 2 replies; 7+ messages in thread
From: Filipe Manana @ 2014-08-29 12:37 UTC (permalink / raw)
To: linux-btrfs; +Cc: Filipe Manana
After the commit 7fc34a62ca4434a79c68e23e70ed26111b7a4cf8 (titled
"mm/msync.c: sync only the requested range in msync()"), our fsync
callback can be called with a range that covers only part of the
file and not the whole file anymore. Under certain circumstances
this leads to crashes that produce the following trace:
[41074.641913] invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC
(...)
[41074.642692] CPU: 0 PID: 24580 Comm: fsx Not tainted 3.16.0-fdm-btrfs-next-45+ #1
(...)
[41074.643886] RIP: 0010:[<ffffffffa01ecc99>] [<ffffffffa01ecc99>] btrfs_ordered_update_i_size+0x279/0x2b0 [btrfs]
(...)
[41074.644919] Stack:
(...)
[41074.644919] Call Trace:
[41074.644919] [<ffffffffa01db531>] btrfs_truncate_inode_items+0x3f1/0xa10 [btrfs]
[41074.644919] [<ffffffffa01eb54f>] ? btrfs_get_logged_extents+0x4f/0x80 [btrfs]
[41074.644919] [<ffffffffa02137a9>] btrfs_log_inode+0x2f9/0x970 [btrfs]
[41074.644919] [<ffffffff81090875>] ? sched_clock_local+0x25/0xa0
[41074.644919] [<ffffffff8164a55e>] ? mutex_unlock+0xe/0x10
[41074.644919] [<ffffffff810af51d>] ? trace_hardirqs_on+0xd/0x10
[41074.644919] [<ffffffffa0214b4f>] btrfs_log_inode_parent+0x1ef/0x560 [btrfs]
[41074.644919] [<ffffffff811d0c55>] ? dget_parent+0x5/0x180
[41074.644919] [<ffffffffa0215d11>] btrfs_log_dentry_safe+0x51/0x80 [btrfs]
[41074.644919] [<ffffffffa01e2d1a>] btrfs_sync_file+0x1ba/0x3e0 [btrfs]
[41074.644919] [<ffffffff811eda6b>] vfs_fsync_range+0x1b/0x30
(...)
The necessary conditions that lead to such crash are:
* an incremental fsync (when the inode doesn't have the
BTRFS_INODE_NEEDS_FULL_SYNC flag set) happened for our file and it logged
a file extent item ending at offset X;
* the file got the flag BTRFS_INODE_NEEDS_FULL_SYNC set in its inode, due
to a file truncate operation that reduces the file to a size smaller
than X;
* an msync call happens, with a range that doesn't cover the whole file
and the end of this range, lets call it Y, is smaller than X;
* btrfs_log_inode, sees the flag BTRFS_INODE_NEEDS_FULL_SYNC set and
calls btrfs_truncate_inode_items() to remove all items from the log
tree that are associated with our file;
* btrfs_truncate_inode_items() removes all of the inode's items, and the lowest
file extent item it removed is the one ending at offset X, where X > 0 and
X > Y - before returning, it calls btrfs_ordered_update_i_size() with an offset
parameter set to X;
* btrfs_ordered_update_i_size() sees that X is greater then the current ordered
size (btrfs_inode's disk_i_size) and then it assumes there can't be any ongoing
ordered operation with a range covering the offset X, calling a BUG_ON() if
such ordered operation exists. This assumption is made because the disk_i_size
is only increased after the corresponding file extent item is added to the
btree (btrfs_finish_ordered_io);
* But because our msync/fsync covers only a limited range, such an ordered extent
might exist, and our fsync callback (btrfs_sync_file) doesn't wait for such
ordered extent to finish when calling btrfs_wait_ordered_range();
And then by the time btrfs_ordered_update_i_size() is called, via:
btrfs_sync_file() ->
btrfs_log_dentry_safe() ->
btrfs_log_inode_parent() ->
btrfs_log_inode() ->
btrfs_truncate_inode_items() ->
btrfs_ordered_update_i_size()
We hit the BUG_ON(), which could never happen when msync() used the whole file
range when calling fsync (i.e. before 7fc34a62ca4434a79c68e23e70ed26111b7a4cf8).
So just don't call btrfs_ordered_update_i_size() if we're removing inode items
from a log tree, which isn't supposed to change the in memory inode's disk_i_size,
and never did before commit 7fc34a62ca4434a79c68e23e70ed26111b7a4cf8 because we
used to wait for all ordered extents (and therefore the end of any extent found
by btrfs_truncate_inode_items was always smaller than the in memory inode's
disk_i_size).
Issue found while running xfstests/generic/127 (happens very rarely for me).
Signed-off-by: Filipe Manana <fdmanana@suse.com>
---
fs/btrfs/inode.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 16e8146..c5ef9eb 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -4296,7 +4296,8 @@ out:
btrfs_abort_transaction(trans, root, ret);
}
error:
- if (last_size != (u64)-1)
+ if (last_size != (u64)-1 &&
+ root->root_key.objectid != BTRFS_TREE_LOG_OBJECTID)
btrfs_ordered_update_i_size(inode, last_size, NULL);
btrfs_free_path(path);
return err;
--
1.9.1
^ permalink raw reply related [flat|nested] 7+ messages in thread* Re: [PATCH] Btrfs: fix crash while a ranged msync() is ongoing
2014-08-29 12:37 [PATCH] Btrfs: fix crash while a ranged msync() is ongoing Filipe Manana
@ 2014-08-29 14:51 ` Christoph Hellwig
2014-08-29 15:53 ` Chris Mason
2014-08-29 18:52 ` Filipe David Manana
2014-08-29 19:54 ` [PATCH v2] Btrfs: fix crash while doing a ranged fsync Filipe Manana
1 sibling, 2 replies; 7+ messages in thread
From: Christoph Hellwig @ 2014-08-29 14:51 UTC (permalink / raw)
To: Filipe Manana; +Cc: linux-btrfs
On Fri, Aug 29, 2014 at 01:37:48PM +0100, Filipe Manana wrote:
> After the commit 7fc34a62ca4434a79c68e23e70ed26111b7a4cf8 (titled
> "mm/msync.c: sync only the requested range in msync()"), our fsync
> callback can be called with a range that covers only part of the
> file and not the whole file anymore.
Not that there have been other users of range fsyncs around for a long
time. The SCSI target code is one, the NFS server another, and last but
not least generic_write_sync() which is used by most filesystems
including btrfs to implement O_SYNC writes.
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH] Btrfs: fix crash while a ranged msync() is ongoing
2014-08-29 14:51 ` Christoph Hellwig
@ 2014-08-29 15:53 ` Chris Mason
2014-08-29 18:52 ` Filipe David Manana
1 sibling, 0 replies; 7+ messages in thread
From: Chris Mason @ 2014-08-29 15:53 UTC (permalink / raw)
To: Christoph Hellwig, Filipe Manana; +Cc: linux-btrfs
On 08/29/2014 10:51 AM, Christoph Hellwig wrote:
> On Fri, Aug 29, 2014 at 01:37:48PM +0100, Filipe Manana wrote:
>> After the commit 7fc34a62ca4434a79c68e23e70ed26111b7a4cf8 (titled
>> "mm/msync.c: sync only the requested range in msync()"), our fsync
>> callback can be called with a range that covers only part of the
>> file and not the whole file anymore.
>
> Not that there have been other users of range fsyncs around for a long
> time. The SCSI target code is one, the NFS server another, and last but
> not least generic_write_sync() which is used by most filesystems
> including btrfs to implement O_SYNC writes.
Yeah, this patch is going in, thanks Filipe.
-chris
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH] Btrfs: fix crash while a ranged msync() is ongoing
2014-08-29 14:51 ` Christoph Hellwig
2014-08-29 15:53 ` Chris Mason
@ 2014-08-29 18:52 ` Filipe David Manana
2014-08-29 18:55 ` Christoph Hellwig
1 sibling, 1 reply; 7+ messages in thread
From: Filipe David Manana @ 2014-08-29 18:52 UTC (permalink / raw)
To: Christoph Hellwig; +Cc: Filipe Manana, linux-btrfs@vger.kernel.org, Chris Mason
On Fri, Aug 29, 2014 at 3:51 PM, Christoph Hellwig <hch@infradead.org> wrote:
> On Fri, Aug 29, 2014 at 01:37:48PM +0100, Filipe Manana wrote:
>> After the commit 7fc34a62ca4434a79c68e23e70ed26111b7a4cf8 (titled
>> "mm/msync.c: sync only the requested range in msync()"), our fsync
>> callback can be called with a range that covers only part of the
>> file and not the whole file anymore.
>
> Not that there have been other users of range fsyncs around for a long
> time. The SCSI target code is one, the NFS server another, and last but
> not least generic_write_sync() which is used by most filesystems
> including btrfs to implement O_SYNC writes.
Point taken Christoph, this isn't specific to msync nor impossible to
happen before 7fc34a62ca4434a79c68e23e70ed26111b7a4cf8.
I'll update the commit log to reflect that.
Thanks for looking.
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
--
Filipe David Manana,
"Reasonable men adapt themselves to the world.
Unreasonable men adapt the world to themselves.
That's why all progress depends on unreasonable men."
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH] Btrfs: fix crash while a ranged msync() is ongoing
2014-08-29 18:52 ` Filipe David Manana
@ 2014-08-29 18:55 ` Christoph Hellwig
2014-08-29 18:57 ` Filipe David Manana
0 siblings, 1 reply; 7+ messages in thread
From: Christoph Hellwig @ 2014-08-29 18:55 UTC (permalink / raw)
To: Filipe David Manana
Cc: Christoph Hellwig, Filipe Manana, linux-btrfs@vger.kernel.org,
Chris Mason
On Fri, Aug 29, 2014 at 07:52:33PM +0100, Filipe David Manana wrote:
> Point taken Christoph, this isn't specific to msync nor impossible to
> happen before 7fc34a62ca4434a79c68e23e70ed26111b7a4cf8.
> I'll update the commit log to reflect that.
>
> Thanks for looking.
Btw, if you have an easy enough reproducer it would be good to wire it
up for xfstests.
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH] Btrfs: fix crash while a ranged msync() is ongoing
2014-08-29 18:55 ` Christoph Hellwig
@ 2014-08-29 18:57 ` Filipe David Manana
0 siblings, 0 replies; 7+ messages in thread
From: Filipe David Manana @ 2014-08-29 18:57 UTC (permalink / raw)
To: Christoph Hellwig; +Cc: Filipe Manana, linux-btrfs@vger.kernel.org, Chris Mason
On Fri, Aug 29, 2014 at 7:55 PM, Christoph Hellwig <hch@infradead.org> wrote:
> On Fri, Aug 29, 2014 at 07:52:33PM +0100, Filipe David Manana wrote:
>> Point taken Christoph, this isn't specific to msync nor impossible to
>> happen before 7fc34a62ca4434a79c68e23e70ed26111b7a4cf8.
>> I'll update the commit log to reflect that.
>>
>> Thanks for looking.
>
> Btw, if you have an easy enough reproducer it would be good to wire it
> up for xfstests.
See the bottom of the commit message, where I mention that it happens
(not very often) when running xfstests/generic/127 - more specifically
the calls to fsx with memory mapped IO.
thanks
--
Filipe David Manana,
"Reasonable men adapt themselves to the world.
Unreasonable men adapt the world to themselves.
That's why all progress depends on unreasonable men."
^ permalink raw reply [flat|nested] 7+ messages in thread
* [PATCH v2] Btrfs: fix crash while doing a ranged fsync
2014-08-29 12:37 [PATCH] Btrfs: fix crash while a ranged msync() is ongoing Filipe Manana
2014-08-29 14:51 ` Christoph Hellwig
@ 2014-08-29 19:54 ` Filipe Manana
1 sibling, 0 replies; 7+ messages in thread
From: Filipe Manana @ 2014-08-29 19:54 UTC (permalink / raw)
To: linux-btrfs; +Cc: Filipe Manana
While doing a ranged fsync, that is, one whose range doesn't cover the
whole possible file range (0 to LLONG_MAX), we can crash under certain
circumstances with a trace like the following:
[41074.641913] invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC
(...)
[41074.642692] CPU: 0 PID: 24580 Comm: fsx Not tainted 3.16.0-fdm-btrfs-next-45+ #1
(...)
[41074.643886] RIP: 0010:[<ffffffffa01ecc99>] [<ffffffffa01ecc99>] btrfs_ordered_update_i_size+0x279/0x2b0 [btrfs]
(...)
[41074.644919] Stack:
(...)
[41074.644919] Call Trace:
[41074.644919] [<ffffffffa01db531>] btrfs_truncate_inode_items+0x3f1/0xa10 [btrfs]
[41074.644919] [<ffffffffa01eb54f>] ? btrfs_get_logged_extents+0x4f/0x80 [btrfs]
[41074.644919] [<ffffffffa02137a9>] btrfs_log_inode+0x2f9/0x970 [btrfs]
[41074.644919] [<ffffffff81090875>] ? sched_clock_local+0x25/0xa0
[41074.644919] [<ffffffff8164a55e>] ? mutex_unlock+0xe/0x10
[41074.644919] [<ffffffff810af51d>] ? trace_hardirqs_on+0xd/0x10
[41074.644919] [<ffffffffa0214b4f>] btrfs_log_inode_parent+0x1ef/0x560 [btrfs]
[41074.644919] [<ffffffff811d0c55>] ? dget_parent+0x5/0x180
[41074.644919] [<ffffffffa0215d11>] btrfs_log_dentry_safe+0x51/0x80 [btrfs]
[41074.644919] [<ffffffffa01e2d1a>] btrfs_sync_file+0x1ba/0x3e0 [btrfs]
[41074.644919] [<ffffffff811eda6b>] vfs_fsync_range+0x1b/0x30
(...)
The necessary conditions that lead to such crash are:
* an incremental fsync (when the inode doesn't have the
BTRFS_INODE_NEEDS_FULL_SYNC flag set) happened for our file and it logged
a file extent item ending at offset X;
* the file got the flag BTRFS_INODE_NEEDS_FULL_SYNC set in its inode, due
to a file truncate operation that reduces the file to a size smaller
than X;
* a ranged fsync call happens (via an msync for example), with a range that
doesn't cover the whole file and the end of this range, lets call it Y, is
smaller than X;
* btrfs_log_inode, sees the flag BTRFS_INODE_NEEDS_FULL_SYNC set and
calls btrfs_truncate_inode_items() to remove all items from the log
tree that are associated with our file;
* btrfs_truncate_inode_items() removes all of the inode's items, and the lowest
file extent item it removed is the one ending at offset X, where X > 0 and
X > Y - before returning, it calls btrfs_ordered_update_i_size() with an offset
parameter set to X;
* btrfs_ordered_update_i_size() sees that X is greater then the current ordered
size (btrfs_inode's disk_i_size) and then it assumes there can't be any ongoing
ordered operation with a range covering the offset X, calling a BUG_ON() if
such ordered operation exists. This assumption is made because the disk_i_size
is only increased after the corresponding file extent item is added to the
btree (btrfs_finish_ordered_io);
* But because our fsync covers only a limited range, such an ordered extent might
exist, and our fsync callback (btrfs_sync_file) doesn't wait for such ordered
extent to finish when calling btrfs_wait_ordered_range();
And then by the time btrfs_ordered_update_i_size() is called, via:
btrfs_sync_file() ->
btrfs_log_dentry_safe() ->
btrfs_log_inode_parent() ->
btrfs_log_inode() ->
btrfs_truncate_inode_items() ->
btrfs_ordered_update_i_size()
We hit the BUG_ON(), which could never happen if the fsync range covered the whole
possible file range (0 to LLONG_MAX), as we would wait for all ordered extents to
finish before calling btrfs_truncate_inode_items().
So just don't call btrfs_ordered_update_i_size() if we're removing the inode's items
from a log tree, which isn't supposed to change the in memory inode's disk_i_size.
Issue found while running xfstests/generic/127 (happens very rarely for me), more
specifically via the fsx calls that use memory mapped IO (and issue msync calls).
Signed-off-by: Filipe Manana <fdmanana@suse.com>
---
V2: Updated commit message, such that it reflects the fact that ranged fsyncs are
not used only by msync.
fs/btrfs/inode.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 16e8146..c5ef9eb 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -4296,7 +4296,8 @@ out:
btrfs_abort_transaction(trans, root, ret);
}
error:
- if (last_size != (u64)-1)
+ if (last_size != (u64)-1 &&
+ root->root_key.objectid != BTRFS_TREE_LOG_OBJECTID)
btrfs_ordered_update_i_size(inode, last_size, NULL);
btrfs_free_path(path);
return err;
--
1.9.1
^ permalink raw reply related [flat|nested] 7+ messages in thread
end of thread, other threads:[~2014-08-29 18:57 UTC | newest]
Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-08-29 12:37 [PATCH] Btrfs: fix crash while a ranged msync() is ongoing Filipe Manana
2014-08-29 14:51 ` Christoph Hellwig
2014-08-29 15:53 ` Chris Mason
2014-08-29 18:52 ` Filipe David Manana
2014-08-29 18:55 ` Christoph Hellwig
2014-08-29 18:57 ` Filipe David Manana
2014-08-29 19:54 ` [PATCH v2] Btrfs: fix crash while doing a ranged fsync Filipe Manana
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).