From: Stefan Behrens <sbehrens@giantdisaster.de>
To: fdmanana@gmail.com
Cc: "linux-btrfs@vger.kernel.org" <linux-btrfs@vger.kernel.org>
Subject: Re: [PATCH] Btrfs: fix race between removing a dev and writing sbs
Date: Fri, 09 Aug 2013 16:58:52 +0200 [thread overview]
Message-ID: <520503AC.4040400@giantdisaster.de> (raw)
In-Reply-To: <CAL3q7H5nVVDdCKNA_u16cDaqTierpCpftCyrfQTLd7v8AbOmBg@mail.gmail.com>
On Fri, 9 Aug 2013 14:50:35 +0100, Filipe David Manana wrote:
> On Fri, Aug 9, 2013 at 2:07 PM, Stefan Behrens
> <sbehrens@giantdisaster.de> wrote:
>> On Thu, 8 Aug 2013 21:00:52 +0100, Filipe David Borba Manana wrote:
>>> Since all code paths that update the number of devices in the
>>> super copy (fs_info->super_copy) first lock the device list
>>> (fs_info->fs_devices->device_list_mutex), and write_all_supers()
>>> also needs to lock the devices list mutex, make write_all_supers()
>>> read the number of devices from the super copy after it locks
>>> the device list mutex (and before unlocking it of course).
>>>
>>> The only code path that doesn't lock the device list mutex
>>> before updating the number of devices in the super copy is
>>> disk-io.c:next_root_backup(), called by open_ctree() during
>>> mount time where concurrency issues can't happen.
>>>
>>> Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
>>> ---
>>> fs/btrfs/disk-io.c | 2 +-
>>> fs/btrfs/volumes.c | 11 ++++-------
>>> 2 files changed, 5 insertions(+), 8 deletions(-)
>>>
>>> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
>>> index 254cdc8..c4b24c7 100644
>>> --- a/fs/btrfs/disk-io.c
>>> +++ b/fs/btrfs/disk-io.c
>>> @@ -3313,7 +3313,6 @@ static int write_all_supers(struct btrfs_root *root, int max_mirrors)
>>> int total_errors = 0;
>>> u64 flags;
>>>
>>> - max_errors = btrfs_super_num_devices(root->fs_info->super_copy) - 1;
>>> do_barriers = !btrfs_test_opt(root, NOBARRIER);
>>> backup_super_roots(root->fs_info);
>>>
>>> @@ -3322,6 +3321,7 @@ static int write_all_supers(struct btrfs_root *root, int max_mirrors)
>>>
>>> mutex_lock(&root->fs_info->fs_devices->device_list_mutex);
>>> head = &root->fs_info->fs_devices->devices;
>>> + max_errors = btrfs_super_num_devices(root->fs_info->super_copy) - 1;
>>>
>>> if (do_barriers) {
>>> ret = barrier_all_devices(root->fs_info);
>>> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
>>> index 090f57c..eddf386 100644
>>> --- a/fs/btrfs/volumes.c
>>> +++ b/fs/btrfs/volumes.c
>>> @@ -1568,11 +1568,6 @@ int btrfs_rm_device(struct btrfs_root *root, char *device_path)
>>> if (ret)
>>> goto error_undo;
>>>
>>> - /*
>>> - * TODO: the superblock still includes this device in its num_devices
>>> - * counter although write_all_supers() is not locked out. This
>>> - * could give a filesystem state which requires a degraded mount.
>>> - */
>>> ret = btrfs_rm_dev_item(root->fs_info->chunk_root, device);
>>
>> The problem that I had seen when I added that comment is something
>> different than what you are addressing.
>>
>> The call to btrfs_rm_dev_item() is the place where the device is removed
>> in the filesystem device tree. The transaction is commited.
>
> So, it would only be super correct if the call to btrfs_rm_dev_item()
> (and the following code) is run inside the critical section delimited
> by the device list mutex (and have the super_copy num devices updated
> inside that section too, like I did).
>
> Other than a potentially much longer critical section, or mutex
> deadlock (because btrfs_scrub_cancel locks scrub_lock), any reason to
> not do it?
Yes, committing a transaction while holding such a mutex usually causes
a deadlock.
>
>>
>> root->fs_info->super_copy is not updated and still includes the device
>> that is not part of the device tree anymore.
>>
>> 19 lines later, the device_list_mutex is acquired. Until then, nobody
>> prevents write_all_supers() to write the superblock to disk. This means,
>> until then, you can create a state on disk with an updated device tree
>> and a num_devices value which is too high by one.
>>
>> If you now crash or the power drops, the on-disk state is not
>> consistent. However, this is not a severe problem. btrfs_rm_device()
>> relocates all chunks that are located on the removed device. On next
>> mount, at first the device items are read which do not include the
>> deleted device anymore, afterwards the chunks are checked, whether they
>> reference a device that is not present. And this is not the case.
>> Therefore this situation is not a severe problem and my comment was
>> wrong that says "could require a degraded mount".
>>
>> But the field num_devices in the superblock will stay wrong for the
>> lifetime of the filesystem, causing malfunction of the ioctl
>> BTRFS_IOC_DEVICES_READY, and potentially causing trouble in the future
>> when somebody adds code that relies on fs_devices->total_devices being
>> correct.
>>
>> It's simply not correct like it is now. And your patch doesn't fix the
>> issue that the TODO comment describes.
>
> Thanks for the explanation, very helpful.
>
> Indeed, it doesn't fix the issue you described. I thought more about
> fixing the following issue:
>
> 1) Write super gets a number of N devices from super_copy, so it will
> not panic if it fails to write dbs for N - 1 devices;
>
> 2) Then tries to acquire device_list_mutex, but blocks because
> btrfs_rm_device() got it first
>
> 3) btrfs_rm_device() removes the device from the list, and does all
> those things it does and then unlocks the dev list mutex;
>
> 4) write_all_supers() acquires the mutex, iterates over all devices in
> the list and gets N - 1 errors (failed to write db to all devices)
>
> 5) Because N - 1 is less than N, it thinks all is ok, when it's not
> because there's actually only N - 1 devices now. Therefore the
> BUG_ON() won't get executed.
Right, this is also a problem.
>
> This is more likely to happen for a small number of devices only (2 ->
> 1 for e.g.). I will revert re-add your comment, as this fixes
> something different.
>
>>
>>
>>> if (ret)
>>> goto error_undo;
>>> @@ -1588,7 +1583,9 @@ int btrfs_rm_device(struct btrfs_root *root, char *device_path)
>>> /*
>>> * the device list mutex makes sure that we don't change
>>> * the device list while someone else is writing out all
>>> - * the device supers.
>>> + * the device supers. Whoever is writing all supers, should
>>> + * lock the device list mutex before getting the number of
>>> + * devices in the super block (super_copy).
>>> */
>>>
>>> cur_devices = device->fs_devices;
>>> @@ -1612,10 +1609,10 @@ int btrfs_rm_device(struct btrfs_root *root, char *device_path)
>>> device->fs_devices->open_devices--;
>>>
>>> call_rcu(&device->rcu, free_device);
>>> - mutex_unlock(&root->fs_info->fs_devices->device_list_mutex);
>>>
>>> num_devices = btrfs_super_num_devices(root->fs_info->super_copy) - 1;
>>> btrfs_set_super_num_devices(root->fs_info->super_copy, num_devices);
>>> + mutex_unlock(&root->fs_info->fs_devices->device_list_mutex);
>>>
>>> if (cur_devices->open_devices == 0) {
>>> struct btrfs_fs_devices *fs_devices;
next prev parent reply other threads:[~2013-08-09 14:58 UTC|newest]
Thread overview: 5+ messages / expand[flat|nested] mbox.gz Atom feed top
2013-08-08 20:00 [PATCH] Btrfs: fix race between removing a dev and writing sbs Filipe David Borba Manana
2013-08-09 13:07 ` Stefan Behrens
2013-08-09 13:50 ` Filipe David Manana
2013-08-09 14:58 ` Stefan Behrens [this message]
2013-08-09 14:41 ` [PATCH v2] " Filipe David Borba Manana
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=520503AC.4040400@giantdisaster.de \
--to=sbehrens@giantdisaster.de \
--cc=fdmanana@gmail.com \
--cc=linux-btrfs@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).