linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Filipe David Manana <fdmanana@gmail.com>
To: Stefan Behrens <sbehrens@giantdisaster.de>
Cc: "linux-btrfs@vger.kernel.org" <linux-btrfs@vger.kernel.org>
Subject: Re: [PATCH] Btrfs: fix race between removing a dev and writing sbs
Date: Fri, 9 Aug 2013 14:50:35 +0100	[thread overview]
Message-ID: <CAL3q7H5nVVDdCKNA_u16cDaqTierpCpftCyrfQTLd7v8AbOmBg@mail.gmail.com> (raw)
In-Reply-To: <5204E978.9010403@giantdisaster.de>

On Fri, Aug 9, 2013 at 2:07 PM, Stefan Behrens
<sbehrens@giantdisaster.de> wrote:
> On Thu,  8 Aug 2013 21:00:52 +0100, Filipe David Borba Manana wrote:
>> Since all code paths that update the number of devices in the
>> super copy (fs_info->super_copy) first lock the device list
>> (fs_info->fs_devices->device_list_mutex), and write_all_supers()
>> also needs to lock the devices list mutex, make write_all_supers()
>> read the number of devices from the super copy after it locks
>> the device list mutex (and before unlocking it of course).
>>
>> The only code path that doesn't lock the device list mutex
>> before updating the number of devices in the super copy is
>> disk-io.c:next_root_backup(), called by open_ctree() during
>> mount time where concurrency issues can't happen.
>>
>> Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
>> ---
>>  fs/btrfs/disk-io.c |    2 +-
>>  fs/btrfs/volumes.c |   11 ++++-------
>>  2 files changed, 5 insertions(+), 8 deletions(-)
>>
>> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
>> index 254cdc8..c4b24c7 100644
>> --- a/fs/btrfs/disk-io.c
>> +++ b/fs/btrfs/disk-io.c
>> @@ -3313,7 +3313,6 @@ static int write_all_supers(struct btrfs_root *root, int max_mirrors)
>>       int total_errors = 0;
>>       u64 flags;
>>
>> -     max_errors = btrfs_super_num_devices(root->fs_info->super_copy) - 1;
>>       do_barriers = !btrfs_test_opt(root, NOBARRIER);
>>       backup_super_roots(root->fs_info);
>>
>> @@ -3322,6 +3321,7 @@ static int write_all_supers(struct btrfs_root *root, int max_mirrors)
>>
>>       mutex_lock(&root->fs_info->fs_devices->device_list_mutex);
>>       head = &root->fs_info->fs_devices->devices;
>> +     max_errors = btrfs_super_num_devices(root->fs_info->super_copy) - 1;
>>
>>       if (do_barriers) {
>>               ret = barrier_all_devices(root->fs_info);
>> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
>> index 090f57c..eddf386 100644
>> --- a/fs/btrfs/volumes.c
>> +++ b/fs/btrfs/volumes.c
>> @@ -1568,11 +1568,6 @@ int btrfs_rm_device(struct btrfs_root *root, char *device_path)
>>       if (ret)
>>               goto error_undo;
>>
>> -     /*
>> -      * TODO: the superblock still includes this device in its num_devices
>> -      * counter although write_all_supers() is not locked out. This
>> -      * could give a filesystem state which requires a degraded mount.
>> -      */
>>       ret = btrfs_rm_dev_item(root->fs_info->chunk_root, device);
>
> The problem that I had seen when I added that comment is something
> different than what you are addressing.
>
> The call to btrfs_rm_dev_item() is the place where the device is removed
> in the filesystem device tree. The transaction is commited.

So, it would only be super correct if the call to btrfs_rm_dev_item()
(and the following code) is run inside the critical section delimited
by the device list mutex (and have the super_copy num devices updated
inside that section too, like I did).

Other than a potentially much longer critical section, or mutex
deadlock (because btrfs_scrub_cancel locks scrub_lock), any reason to
not do it?

>
> root->fs_info->super_copy is not updated and still includes the device
> that is not part of the device tree anymore.
>
> 19 lines later, the device_list_mutex is acquired. Until then, nobody
> prevents write_all_supers() to write the superblock to disk. This means,
> until then, you can create a state on disk with an updated device tree
> and a num_devices value which is too high by one.
>
> If you now crash or the power drops, the on-disk state is not
> consistent. However, this is not a severe problem. btrfs_rm_device()
> relocates all chunks that are located on the removed device. On next
> mount, at first the device items are read which do not include the
> deleted device anymore, afterwards the chunks are checked, whether they
> reference a device that is not present. And this is not the case.
> Therefore this situation is not a severe problem and my comment was
> wrong that says "could require a degraded mount".
>
> But the field num_devices in the superblock will stay wrong for the
> lifetime of the filesystem, causing malfunction of the ioctl
> BTRFS_IOC_DEVICES_READY, and potentially causing trouble in the future
> when somebody adds code that relies on fs_devices->total_devices being
> correct.
>
> It's simply not correct like it is now. And your patch doesn't fix the
> issue that the TODO comment describes.

Thanks for the explanation, very helpful.

Indeed, it doesn't fix the issue you described. I thought more about
fixing the following issue:

1) Write super gets a number of N devices from super_copy, so it will
not panic if it fails to write dbs for N - 1 devices;

2) Then tries to acquire device_list_mutex, but blocks because
btrfs_rm_device() got it first

3) btrfs_rm_device() removes the device from the list, and does all
those things it does and then unlocks the dev list mutex;

4) write_all_supers() acquires the mutex, iterates over all devices in
the list and gets N - 1 errors (failed to write db to all devices)

5) Because N - 1 is less than N, it thinks all is ok, when it's not
because there's actually only N - 1 devices now. Therefore the
BUG_ON() won't get executed.

This is more likely to happen for a small number of devices only (2 ->
1 for e.g.). I will revert re-add your comment, as this fixes
something different.

>
>
>>       if (ret)
>>               goto error_undo;
>> @@ -1588,7 +1583,9 @@ int btrfs_rm_device(struct btrfs_root *root, char *device_path)
>>       /*
>>        * the device list mutex makes sure that we don't change
>>        * the device list while someone else is writing out all
>> -      * the device supers.
>> +      * the device supers. Whoever is writing all supers, should
>> +      * lock the device list mutex before getting the number of
>> +      * devices in the super block (super_copy).
>>        */
>>
>>       cur_devices = device->fs_devices;
>> @@ -1612,10 +1609,10 @@ int btrfs_rm_device(struct btrfs_root *root, char *device_path)
>>               device->fs_devices->open_devices--;
>>
>>       call_rcu(&device->rcu, free_device);
>> -     mutex_unlock(&root->fs_info->fs_devices->device_list_mutex);
>>
>>       num_devices = btrfs_super_num_devices(root->fs_info->super_copy) - 1;
>>       btrfs_set_super_num_devices(root->fs_info->super_copy, num_devices);
>> +     mutex_unlock(&root->fs_info->fs_devices->device_list_mutex);
>>
>>       if (cur_devices->open_devices == 0) {
>>               struct btrfs_fs_devices *fs_devices;
>>
>



-- 
Filipe David Manana,

"Reasonable men adapt themselves to the world.
 Unreasonable men adapt the world to themselves.
 That's why all progress depends on unreasonable men."

  reply	other threads:[~2013-08-09 13:50 UTC|newest]

Thread overview: 5+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2013-08-08 20:00 [PATCH] Btrfs: fix race between removing a dev and writing sbs Filipe David Borba Manana
2013-08-09 13:07 ` Stefan Behrens
2013-08-09 13:50   ` Filipe David Manana [this message]
2013-08-09 14:58     ` Stefan Behrens
2013-08-09 14:41 ` [PATCH v2] " Filipe David Borba Manana

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CAL3q7H5nVVDdCKNA_u16cDaqTierpCpftCyrfQTLd7v8AbOmBg@mail.gmail.com \
    --to=fdmanana@gmail.com \
    --cc=linux-btrfs@vger.kernel.org \
    --cc=sbehrens@giantdisaster.de \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).