From: Josef Bacik <josef@toxicpanda.com>
To: Anand Jain <anand.jain@oracle.com>
Cc: linux-btrfs@vger.kernel.org, kernel-team@fb.com
Subject: Re: [PATCH] btrfs: free device if we fail to open it
Date: Thu, 2 Dec 2021 11:02:17 -0500 [thread overview]
Message-ID: <YajuCbMN4H0Ap76V@localhost.localdomain> (raw)
In-Reply-To: <fdd4a2c9-00dd-430c-0537-d43734337845@oracle.com>
On Thu, Dec 02, 2021 at 03:09:38PM +0800, Anand Jain wrote:
> On 02/12/2021 05:18, Josef Bacik wrote:
> > We've been having transient failures of btrfs/197 because sometimes we
> > don't show a missing device.
>
> > This turned out to be because I use LVM for my devices, and sometimes we
> > race with udev and get the "/dev/dm-#" device registered as a device in
> > the fs_devices list instead of "/dev/mapper/vg0-lv#" device.
> > Thus when
> > the test adds a device to the existing mount it doesn't find the old
> > device in the original fs_devices list and remove it properly.
>
I think most of your confusion is because you don't know what btrfs/197 does, so
I'll explain that and then answer your questions below.
DEV=/dev/vg0/lv0
RAID_DEVS=/dev/vg0/lv1 /dev/vg0/lv2 /dev/vg0/vl3 /dev/vg0/lv4
# First we create a single fs and mount it
mkfs.btrfs -f $DEV
mount $DEV /mnt/test
# Now we create the RAID fs
mkfs.btrfs -f -d raid10 -m raid10 $RAID_DEVS
# Now we add one of the raid devs to the single mount above
btrfs device add /dev/vg0/lv2 /mnt/test
# /dev/vg0/lv2 is no longer part of the fs it was made on, it's part of the fs
# that's mounted at /mnt/test
# Mount degraded with the raid setup
mount -o degraded /dev/vg0/lv1 /mnt/scratch
# Now we shouldn't have found /dev/vg0/lv2, because it was wiped and is no
# longer part of the fs_devices for this thing, except it is because it wasn't
# removed, so when we do the following it doesn't show as missing
btrfs filesystem show /mnt/scratch
> The above para is confusing. It can go. IMHO. The device path shouldn't
> matter as we depend on the bdev to compare in the device add thread.
>
> 2637 bdev = blkdev_get_by_path(device_path, FMODE_WRITE |
> FMODE_EXCL,
> 2638 fs_info->bdev_holder);
> ::
> 2657 list_for_each_entry_rcu(device, &fs_devices->devices, dev_list)
> {
> 2658 if (device->bdev == bdev) {
> 2659 ret = -EEXIST;
> 2660 rcu_read_unlock();
> 2661 goto error;
> 2662 }
> 2663 }
>
This is on the init thread, this is just checking the fs_devices of /mnt/test,
not the fs_devices of the RAID setup that we created, so this doesn't error out
(nor should it) because we're adding it to our mounted fs.
>
> > This is fine in general, because when we open the devices we check the
> > UUID, and properly skip using the device that we added to the other file
> > system. However we do not remove it from our fs_devices,
>
> Hmm, context/thread is missing. Like, is it during device add or during
> mkfs/dev-scan?
>
> AFAIK btrfs_free_stale_devices() checks and handles the removing of stale
> devices in the fs_devices in both the contexts/threads device-add, mkfs
> (device-scan).
>
> For example:
>
> $ alias cnt_free_stale_devices="bpftrace -e 'kprobe:btrfs_free_stale_devices
> { @ = count(); }'"
>
> New FSID on 2 devices, we call free_stale_devices():
>
> $ cnt_free_stale_devices -c 'mkfs.btrfs -fq -draid1 -mraid1 /dev/vg/scratch0
> /dev/vg/scratch1'
> Attaching 1 probe...
>
> @: 2
>
> We do it only when there is a new fsid/device added to the fs_devices.
>
> For example:
>
> Clean up the fs_devices:
> $ cnt_free_stale_devices -c 'btrfs dev scan --forget'
> Attaching 1 probe...
>
> @: 1
>
> Now mounting devices are new to the fs_devices list, so call
> free_stale_devices().
>
> $ cnt_free_stale_devices -c 'mount -o device=/dev/vg/scratch0
> /dev/vg/scratch1 /btrfs'
> Attaching 1 probe...
>
> @: 2
>
> $ umount /btrfs
>
> Below we didn't call free_stale_devices() because these two devices are
> already in the appropriate fs_devices.
>
> $ cnt_free_stale_devices -c 'mount -o device=/dev/vg/scratch0
> /dev/vg/scratch1 /btrfs'
> Attaching 1 probe...
>
> @: 0
>
> $
>
> To me, it looks to be working correctly.
>
Yes it does work correctly, most of the time. If you run it in a loop 500 times
it'll fail, because _sometimes_ udev goes in and does teh btrfs device scan and
changes the name of the device in the fs_devices for the RAID group. So the
btrfs_free_stale_devices() thing doesn't find the exising device, because it's
just looking at the device->name, which is different from the device we're
adding.
We have the fs_devices for the RAID fs, and instead of /dev/vg0/lv2, we have
/dev/dm-4 or whatever. So we do the addition of /dev/vg0/lv2, go to find it in
any other fs_devices, and don't find it because strcmp("/dev/vg0/lv2",
"/dev/dm0-4") != 0, and thus leave the device on the fs_devices for the RAID
file system.
> > so when we get
> > the device info we still see this old device and think that everything
> > is ok.
>
>
> > We have a check for -ENODATA coming back from reading the super off of
> > the device to catch the wipefs case, but we don't catch literally any
> > other error where the device is no longer valid and thus do not remove
> > the device.
> >
>
> > Fix this to not special case an empty device when reading the super
> > block, and simply remove any device we are unable to open.
> >
> > With this fix we properly print out missing devices in the test case,
> > and after 500 iterations I'm no longer able to reproduce the problem,
> > whereas I could usually reproduce within 200 iterations.
> >
>
> commit 7f551d969037 ("btrfs: free alien device after device add")
> fixed the case we weren't freeing the stale device in the device-add
> context.
>
> However, here I don't understand the thread/context we are missing to free
> the stale device here.
>
> > Signed-off-by: Josef Bacik <josef@toxicpanda.com>
> > ---
> > fs/btrfs/disk-io.c | 2 +-
> > fs/btrfs/volumes.c | 2 +-
> > 2 files changed, 2 insertions(+), 2 deletions(-)
> >
> > diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
> > index 5c598e124c25..fa34b8807f8d 100644
> > --- a/fs/btrfs/disk-io.c
> > +++ b/fs/btrfs/disk-io.c
> > @@ -3924,7 +3924,7 @@ struct btrfs_super_block *btrfs_read_dev_one_super(struct block_device *bdev,
> > super = page_address(page);
> > if (btrfs_super_magic(super) != BTRFS_MAGIC) {
> > btrfs_release_disk_super(super);
> > - return ERR_PTR(-ENODATA);
> > + return ERR_PTR(-EINVAL);
> > }
>
> I think you are removing ENODATA because it is going away in the parent
> function. And, we don't reach this condition in the test case btrfs/197.
> Instead, we fail here at line 645 below and, we return -EINVAL;
I'm changing it to be consistent, instead of special casing this one specific
failure, just treat all failures like we need to remove the device, and thus we
can just make this be EINVAL as well.
>
> 645 if (memcmp(device->uuid, disk_super->dev_item.uuid,
> BTRFS_UUID_SIZE))
> 646 goto error_free_page;
> 647
>
> 687 error_free_page:
> 688 btrfs_release_disk_super(disk_super);
> 689 blkdev_put(bdev, flags);
> 690
> 691 return -EINVAL;
>
> function stack carries the return code to the parent open_fs_devices().
>
> open_fs_devices()
> btrfs_open_one_device()
> btrfs_get_bdev_and_sb()
> btrfs_read_dev_super()
> btrfs_read_dev_one_super()
>
>
>
> > if (btrfs_super_bytenr(super) != bytenr_orig) {
> > diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
> > index f38c230111be..890153dd2a2b 100644
> > --- a/fs/btrfs/volumes.c
> > +++ b/fs/btrfs/volumes.c
> > @@ -1223,7 +1223,7 @@ static int open_fs_devices(struct btrfs_fs_devices *fs_devices,
> > if (ret == 0 &&
> > (!latest_dev || device->generation > latest_dev->generation)) {
> > latest_dev = device;
> > - } else if (ret == -ENODATA) {
> > + } else if (ret) {
> > fs_devices->num_devices--;
> > list_del(&device->dev_list);
> > btrfs_free_device(device);
> >
>
> There are many other cases, for example including -EACCES (from
> blkdev_get_by_path()) (I haven't checked other error codes). For which,
> calling btrfs_free_device() is inappropriate.
Yeah I was a little fuzzy on this. I think *any* failure should mean that we
remove the device from the fs_devices tho right? So that we show we're missing
a device, since we can't actually access it? I'm actually asking, because I
think we can go either way, but to me I think any failure sure result in the
removal of the device so we can re-scan the correct one. Thanks,
Josef
next prev parent reply other threads:[~2021-12-02 16:02 UTC|newest]
Thread overview: 8+ messages / expand[flat|nested] mbox.gz Atom feed top
2021-12-01 21:18 [PATCH] btrfs: free device if we fail to open it Josef Bacik
2021-12-02 7:09 ` Anand Jain
2021-12-02 16:02 ` Josef Bacik [this message]
2021-12-03 8:31 ` Anand Jain
2021-12-03 15:08 ` Josef Bacik
2021-12-06 14:32 ` Anand Jain
2021-12-06 19:16 ` Josef Bacik
2021-12-06 22:23 ` Anand Jain
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=YajuCbMN4H0Ap76V@localhost.localdomain \
--to=josef@toxicpanda.com \
--cc=anand.jain@oracle.com \
--cc=kernel-team@fb.com \
--cc=linux-btrfs@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox