From: Filipe Manana <fdmanana@kernel.org>
To: Josef Bacik <jbacik@fb.com>
Cc: "linux-btrfs@vger.kernel.org" <linux-btrfs@vger.kernel.org>
Subject: Re: [PATCH 1/2] Btrfs: fix race between fsync and direct IO writes for prealloc extents
Date: Wed, 11 May 2016 16:56:56 +0100 [thread overview]
Message-ID: <CAL3q7H6t-64EKDP6MeM3MmNOjBN24OEGPOE17y0pyMYkt9pknA@mail.gmail.com> (raw)
In-Reply-To: <c1a1a668-07c9-3402-993f-de7b341511e7@fb.com>
On Wed, May 11, 2016 at 4:41 PM, Josef Bacik <jbacik@fb.com> wrote:
> On 05/09/2016 07:01 AM, fdmanana@kernel.org wrote:
>>
>> From: Filipe Manana <fdmanana@suse.com>
>>
>> When we do a direct IO write against a preallocated extent (fallocate)
>> that does not go beyond the i_size of the inode, we do the write operation
>> without holding the inode's i_mutex (an optimization that landed in
>> commit 38851cc19adb ("Btrfs: implement unlocked dio write")). This allows
>> for a very tiny time window where a race can happen with a concurrent
>> fsync using the fast code path, as the direct IO write path creates first
>> a new extent map (no longer flagged as a prealloc extent) and then it
>> creates the ordered extent, while the fast fsync path first collects
>> ordered extents and then it collects extent maps. This allows for the
>> possibility of the fast fsync path to collect the new extent map without
>> collecting the new ordered extent, and therefore logging an extent item
>> based on the extent map without waiting for the ordered extent to be
>> created and complete. This can result in a situation where after a log
>> replay we end up with an extent not marked anymore as prealloc but it was
>> only partially written (or not written at all), exposing random, stale or
>> garbage data corresponding to the unwritten pages and without any
>> checksums in the csum tree covering the extent's range.
>>
>> This is an extension of what was done in commit de0ee0edb21f ("Btrfs: fix
>> race between fsync and lockless direct IO writes").
>>
>> So fix this by creating first the ordered extent and then the extent
>> map, so that this way if the fast fsync patch collects the new extent
>> map it also collects the corresponding ordered extent.
>>
>
> So this seems to just be trading one problem for another bad behavior. Now
> instead we'll end up with an ordered extent but possibly not the extent map,
> which is fine, but seems ugly, and is a subtle and undocumented behavior
> that is going to bite us in the ass in the future.
>
> I know everybody loves lockless writes, but we need to protect for this
> particular case in an explicit way so I don't go change something in the
> future and forget about this dependency. What if we add a rwsem around
> adding the extent map and the ordered extent. In fact we pull this dance
> out into a common helper function so everybody who wants to do this just
> calls the helper function that is easy to audit and understand, then we just
> put a
>
> down_read(&BTRFS_I(inode)->lets_not_fuck_fsync));
> /* Do our extent map + ordered extent dance here */
> up_read(&BTRFS_I(inode)->lets_not_fuck_fsync));
>
> and then where we gather all of this stuff in fsync we do a
> down_write/up_write. This won't affect the buffered write case currently as
> we're still protected by the i_mutex, and then makes the direct io case much
> cleaner and safer. Then later on when we want to remove the i_mutex from
> the buffered write case we have one less gotcha to worry about. Thanks,
Agreed. I though about something similar initially but I didn't want
to increase the size of struct btrfs_inode.
Would you mind that if instead of changing this patch I do one on top
that introduces that helper function, and makes this and the other
place in the dio path that does the same logic use it too?
thanks
>
> Josef
next prev parent reply other threads:[~2016-05-11 15:56 UTC|newest]
Thread overview: 9+ messages / expand[flat|nested] mbox.gz Atom feed top
2016-05-09 14:01 [PATCH 1/2] Btrfs: fix race between fsync and direct IO writes for prealloc extents fdmanana
2016-05-11 15:41 ` Josef Bacik
2016-05-11 15:56 ` Filipe Manana [this message]
2016-05-11 16:04 ` Josef Bacik
2016-05-11 16:05 ` Chris Mason
2016-05-11 16:10 ` Josef Bacik
2016-05-11 16:12 ` Chris Mason
2016-05-11 16:14 ` Josef Bacik
2016-05-12 17:36 ` Josef Bacik
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=CAL3q7H6t-64EKDP6MeM3MmNOjBN24OEGPOE17y0pyMYkt9pknA@mail.gmail.com \
--to=fdmanana@kernel.org \
--cc=jbacik@fb.com \
--cc=linux-btrfs@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).