Re: [PATCH 1/2] Btrfs: fix race between fsync and direct IO writes for prealloc extents

linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Filipe Manana <fdmanana@kernel.org>
To: Josef Bacik <jbacik@fb.com>
Cc: "linux-btrfs@vger.kernel.org" <linux-btrfs@vger.kernel.org>
Subject: Re: [PATCH 1/2] Btrfs: fix race between fsync and direct IO writes for prealloc extents
Date: Wed, 11 May 2016 16:56:56 +0100	[thread overview]
Message-ID: <CAL3q7H6t-64EKDP6MeM3MmNOjBN24OEGPOE17y0pyMYkt9pknA@mail.gmail.com> (raw)
In-Reply-To: <c1a1a668-07c9-3402-993f-de7b341511e7@fb.com>

On Wed, May 11, 2016 at 4:41 PM, Josef Bacik <jbacik@fb.com> wrote:
> On 05/09/2016 07:01 AM, fdmanana@kernel.org wrote:
>>
>> From: Filipe Manana <fdmanana@suse.com>
>>
>> When we do a direct IO write against a preallocated extent (fallocate)
>> that does not go beyond the i_size of the inode, we do the write operation
>> without holding the inode's i_mutex (an optimization that landed in
>> commit 38851cc19adb ("Btrfs: implement unlocked dio write")). This allows
>> for a very tiny time window where a race can happen with a concurrent
>> fsync using the fast code path, as the direct IO write path creates first
>> a new extent map (no longer flagged as a prealloc extent) and then it
>> creates the ordered extent, while the fast fsync path first collects
>> ordered extents and then it collects extent maps. This allows for the
>> possibility of the fast fsync path to collect the new extent map without
>> collecting the new ordered extent, and therefore logging an extent item
>> based on the extent map without waiting for the ordered extent to be
>> created and complete. This can result in a situation where after a log
>> replay we end up with an extent not marked anymore as prealloc but it was
>> only partially written (or not written at all), exposing random, stale or
>> garbage data corresponding to the unwritten pages and without any
>> checksums in the csum tree covering the extent's range.
>>
>> This is an extension of what was done in commit de0ee0edb21f ("Btrfs: fix
>> race between fsync and lockless direct IO writes").
>>
>> So fix this by creating first the ordered extent and then the extent
>> map, so that this way if the fast fsync patch collects the new extent
>> map it also collects the corresponding ordered extent.
>>
>
> So this seems to just be trading one problem for another bad behavior. Now
> instead we'll end up with an ordered extent but possibly not the extent map,
> which is fine, but seems ugly, and is a subtle and undocumented behavior
> that is going to bite us in the ass in the future.
>
> I know everybody loves lockless writes, but we need to protect for this
> particular case in an explicit way so I don't go change something in the
> future and forget about this dependency.  What if we add a rwsem around
> adding the extent map and the ordered extent.  In fact we pull this dance
> out into a common helper function so everybody who wants to do this just
> calls the helper function that is easy to audit and understand, then we just
> put a
>
> down_read(&BTRFS_I(inode)->lets_not_fuck_fsync));
> /* Do our extent map + ordered extent dance here */
> up_read(&BTRFS_I(inode)->lets_not_fuck_fsync));
>
> and then where we gather all of this stuff in fsync we do a
> down_write/up_write.  This won't affect the buffered write case currently as
> we're still protected by the i_mutex, and then makes the direct io case much
> cleaner and safer.  Then later on when we want to remove the i_mutex from
> the buffered write case we have one less gotcha to worry about.  Thanks,

Agreed. I though about something similar initially but I didn't want
to increase the size of struct btrfs_inode.
Would you mind that if instead of changing this patch I do one on top
that introduces that helper function, and makes this and the other
place in the dio path that does the same logic use it too?

thanks

>
> Josef

next prev parent reply	other threads:[~2016-05-11 15:56 UTC|newest]

Thread overview: 9+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-05-09 14:01 [PATCH 1/2] Btrfs: fix race between fsync and direct IO writes for prealloc extents fdmanana
2016-05-11 15:41 ` Josef Bacik
2016-05-11 15:56   ` Filipe Manana [this message]
2016-05-11 16:04     ` Josef Bacik
2016-05-11 16:05     ` Chris Mason
2016-05-11 16:10       ` Josef Bacik
2016-05-11 16:12         ` Chris Mason
2016-05-11 16:14           ` Josef Bacik
2016-05-12 17:36 ` Josef Bacik

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CAL3q7H6t-64EKDP6MeM3MmNOjBN24OEGPOE17y0pyMYkt9pknA@mail.gmail.com \
    --to=fdmanana@kernel.org \
    --cc=jbacik@fb.com \
    --cc=linux-btrfs@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).