public inbox for linux-btrfs@vger.kernel.org
 help / color / mirror / Atom feed
From: Anand Jain <anand.jain@oracle.com>
To: Qu Wenruo <wqu@suse.com>, linux-btrfs@vger.kernel.org
Cc: Anand Jain <anand.jain@oracle.com>
Subject: Re: [PATCH v2] btrfs: make dev-replace properly follow its read mode
Date: Mon, 20 Feb 2023 15:55:37 +0800	[thread overview]
Message-ID: <b0521a4e-15f9-e6d8-0239-97ad83c3124a@oracle.com> (raw)
In-Reply-To: <8b465081-65f7-4b97-a1bc-3c6b93d3b9c5@suse.com>

On 2/20/23 10:56, Qu Wenruo wrote:
> 
> 
> On 2023/2/19 18:33, Qu Wenruo wrote:
>> [BUG]
>> Although dev replace ioctl has a way to specify the mode on whether we
>> should read from the source device, it's not properly followed.
>>
>>   # mkfs.btrfs -f -d raid1 -m raid1 $dev1 $dev2
>>   # mount $dev1 $mnt
>>   # xfs_io -f -c "pwrite 0 32M" $mnt/file
>>   # sync
>>   # btrfs replace start -r -f 1 $dev3 $mnt
>>
>> And one extra trace is added to scrub_submit(), showing the detail about
>> the bio:
>>
>>             btrfs-1115669 [005] .....  5437.027093: 
>> scrub_submit.part.0: devid=1 logical=22036480 phy=22036480 len=16384
>>             btrfs-1115669 [005] .....  5437.027372: 
>> scrub_submit.part.0: devid=1 logical=30457856 phy=30457856 len=32768
>>             btrfs-1115669 [005] .....  5437.027440: 
>> scrub_submit.part.0: devid=1 logical=30507008 phy=30507008 len=49152
>>             btrfs-1115669 [005] .....  5437.027487: 
>> scrub_submit.part.0: devid=1 logical=30605312 phy=30605312 len=32768
>>             btrfs-1115669 [005] .....  5437.027556: 
>> scrub_submit.part.0: devid=1 logical=30703616 phy=30703616 len=65536
>>             btrfs-1115669 [005] .....  5437.028186: 
>> scrub_submit.part.0: devid=1 logical=298844160 phy=298844160 len=131072
>>             ...
>>             btrfs-1115669 [005] .....  5437.076243: 
>> scrub_submit.part.0: devid=1 logical=322961408 phy=322961408 len=131072
>>             btrfs-1115669 [005] .....  5437.076248: 
>> scrub_submit.part.0: devid=1 logical=323092480 phy=323092480 len=131072
>>
>> One can see that all the read are submitted to devid 1, even we have
>> specified "-r" option to avoid read from the source device.
>>
>> [CAUSE]
>> The dev-replace read mode is only set but not followed by scrub code
>> at all.
>>
>> In fact, only common read path is properly following the read mode,
>> but scrub itself has its own read path, thus not following the mode.
>>
>> [FIX]
>> Here we enhance scrub_find_good_copy() to also follow the read mode.
>>
>> The idea is pretty simple, in the first loop, we avoid the following
>> devices:
>>
>> - Missing devices
>>    This is the existing condition
>>
>> - The source device if the replace wants to avoid it.
>>
>> And if above loop found no candidate (e.g. replace a single device),
>> then we discard the 2nd condition, and try again.
>>
>> Since we're here, also enhance the function scrub_find_good_copy() by:
>>
>> - Remove the forward declaration
>>
>> - Makes it return int
>>    To indicates errors, e.g. no good mirror found.
>>
>> - Add extra error messages
>>
>> Now with the same trace, "btrfs replace start -r" works as expected:
>>
>>             btrfs-1121013 [000] .....  5991.905971: 
>> scrub_submit.part.0: devid=2 logical=22036480 phy=1064960 len=16384
>>             btrfs-1121013 [000] .....  5991.906276: 
>> scrub_submit.part.0: devid=2 logical=30457856 phy=9486336 len=32768
>>             btrfs-1121013 [000] .....  5991.906365: 
>> scrub_submit.part.0: devid=2 logical=30507008 phy=9535488 len=49152
>>             btrfs-1121013 [000] .....  5991.906423: 
>> scrub_submit.part.0: devid=2 logical=30605312 phy=9633792 len=32768
>>             btrfs-1121013 [000] .....  5991.906504: 
>> scrub_submit.part.0: devid=2 logical=30703616 phy=9732096 len=65536
>>             btrfs-1121013 [000] .....  5991.907314: 
>> scrub_submit.part.0: devid=2 logical=298844160 phy=277872640 len=131072
>>             btrfs-1121013 [000] .....  5991.907575: 
>> scrub_submit.part.0: devid=2 logical=298975232 phy=278003712 len=131072
>>             btrfs-1121013 [000] .....  5991.907822: 
>> scrub_submit.part.0: devid=2 logical=299106304 phy=278134784 len=131072
>>             ...
>>             btrfs-1121013 [000] .....  5991.947417: 
>> scrub_submit.part.0: devid=2 logical=318504960 phy=297533440 len=131072
>>             btrfs-1121013 [000] .....  5991.947664: 
>> scrub_submit.part.0: devid=2 logical=318636032 phy=297664512 len=131072
>>             btrfs-1121013 [000] .....  5991.947920: 
>> scrub_submit.part.0: devid=2 logical=318767104 phy=297795584 len=131072
>>
>> Reviewed-by: Anand Jain <anand.jain@oracle.com>
>> Signed-off-by: Qu Wenruo <wqu@suse.com>
>> ---
>> Changelog:
>> v2:
>> - Rename "replace read policy" to "replace read mode" in comments
>>    This is avoid the confusion with the existing read policy.
>>    No behavior change.
>> ---
>>   fs/btrfs/scrub.c | 131 +++++++++++++++++++++++++++++++++++------------
>>   1 file changed, 97 insertions(+), 34 deletions(-)
>>
>> diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
>> index ee3fe6c291fe..4c399a720bf1 100644
>> --- a/fs/btrfs/scrub.c
>> +++ b/fs/btrfs/scrub.c
>> @@ -423,11 +423,6 @@ static int scrub_sectors(struct scrub_ctx *sctx, 
>> u64 logical, u32 len,
>>   static void scrub_bio_end_io(struct bio *bio);
>>   static void scrub_bio_end_io_worker(struct work_struct *work);
>>   static void scrub_block_complete(struct scrub_block *sblock);
>> -static void scrub_find_good_copy(struct btrfs_fs_info *fs_info,
>> -                 u64 extent_logical, u32 extent_len,
>> -                 u64 *extent_physical,
>> -                 struct btrfs_device **extent_dev,
>> -                 int *extent_mirror_num);
>>   static int scrub_add_sector_to_wr_bio(struct scrub_ctx *sctx,
>>                         struct scrub_sector *sector);
>>   static void scrub_wr_submit(struct scrub_ctx *sctx);
>> @@ -2709,6 +2704,93 @@ static int scrub_find_csum(struct scrub_ctx 
>> *sctx, u64 logical, u8 *csum)
>>       return 1;
>>   }
>> +static bool should_use_device(struct btrfs_fs_info *fs_info,
>> +                  struct btrfs_device *dev,
>> +                  bool follow_replace_read_mode)
>> +{
>> +    struct btrfs_device *replace_srcdev = fs_info->dev_replace.srcdev;
>> +    struct btrfs_device *replace_tgtdev = fs_info->dev_replace.tgtdev;
>> +
>> +    if (!dev->bdev)
>> +        return false;
>> +
>> +    /*
>> +     * We're doing scrub/replace, if it's pure scrub, no tgtdev 
>> should be
>> +     * here.
>> +     * If it's replace, we're going to write data to tgtdev, thus the 
>> current
>> +     * data of the tgtdev is all garbage, thus we can not use it at all.
>> +     */
>> +    if (dev == replace_tgtdev)
>> +        return false;
>> +
>> +    /* No need to follow replace read policy, any existing device is 
>> fine. */
>> +    if (!follow_replace_read_mode)
>> +        return true;
>> +
>> +    /* Need to follow the policy. */
>> +    if (fs_info->dev_replace.cont_reading_from_srcdev_mode ==
>> +        BTRFS_DEV_REPLACE_ITEM_CONT_READING_FROM_SRCDEV_MODE_AVOID)
>> +        return dev != replace_srcdev;
>> +    return true;
>> +}
>> +static int scrub_find_good_copy(struct btrfs_fs_info *fs_info,
>> +                u64 extent_logical, u32 extent_len,
>> +                u64 *extent_physical,
>> +                struct btrfs_device **extent_dev,
>> +                int *extent_mirror_num)
>> +{
>> +    u64 mapped_length;
>> +    struct btrfs_io_context *bioc = NULL;
>> +    int ret;
>> +    int i;
>> +
>> +    mapped_length = extent_len;
>> +    ret = btrfs_map_block(fs_info, BTRFS_MAP_GET_READ_MIRRORS,
>> +                  extent_logical, &mapped_length, &bioc, 0);
>> +    if (ret || !bioc || mapped_length < extent_len) {
>> +        btrfs_put_bioc(bioc);
>> +        btrfs_err_rl(fs_info, "btrfs_map_block() failed for logical 
>> %llu: %d",
>> +                extent_logical, ret);
>> +        return -EIO;
>> +    }
>> +
>> +    /*
>> +     * First loop to exclude all missing devices and the source
>> +     * device if needed.
>> +     * And we don't want to use target device as mirror either,
>> +     * as we're doing the replace, the target device range
>> +     * contains nothing.
>> +     */
>> +    for (i = 0; i < bioc->num_stripes - bioc->replace_nr_stripes; i++) {
>> +        struct btrfs_io_stripe *stripe = &bioc->stripes[i];
>> +
>> +        if (!should_use_device(fs_info, stripe->dev, true))
>> +            continue;
>> +        goto found;
>> +    }
>> +    /*
>> +     * We didn't find any alternative mirrors, we have to break our
>> +     * replace read mode, or we can not read at all.
>> +     */
>> +    for (i = 0; i < bioc->num_stripes - bioc->replace_nr_stripes; i++) {
>> +        struct btrfs_io_stripe *stripe = &bioc->stripes[i];
>> +
>> +        if (!should_use_device(fs_info, stripe->dev, false))
>> +            continue;
>> +        goto found;
>> +    }
>> +
>> +    btrfs_err_rl(fs_info, "failed to find any live mirror for logical 
>> %llu",
>> +            extent_logical);
>> +    return -EIO;
>> +
>> +found:
>> +    *extent_physical = bioc->stripes[i].physical;
>> +    *extent_mirror_num = i + 1;
>> +    *extent_dev = bioc->stripes[i].dev;
>> +    btrfs_put_bioc(bioc);
>> +    return 0;
>> +}
>>   /* scrub extent tries to collect up to 64 kB for each bio */
>>   static int scrub_extent(struct scrub_ctx *sctx, struct map_lookup *map,
>>               u64 logical, u32 len,
>> @@ -2746,7 +2828,8 @@ static int scrub_extent(struct scrub_ctx *sctx, 
>> struct map_lookup *map,
>>       }
>>       /*
>> -     * For dev-replace case, we can have @dev being a missing device.
>> +     * For dev-replace case, we can have @dev being a missing device, or
>> +     * we want to avoid read from the source device if possible.
>>        * Regular scrub will avoid its execution on missing device at all,
>>        * as that would trigger tons of read error.
>>        *
>> @@ -2754,9 +2837,14 @@ static int scrub_extent(struct scrub_ctx *sctx, 
>> struct map_lookup *map,
>>        * increase unnecessarily.
>>        * So here we change the read source to a good mirror.
>>        */
>> -    if (sctx->is_dev_replace && !dev->bdev)
>> -        scrub_find_good_copy(sctx->fs_info, logical, len, &src_physical,
>> -                     &src_dev, &src_mirror);
>> +    if (sctx->is_dev_replace &&
>> +        (!dev->bdev || 
>> sctx->fs_info->dev_replace.cont_reading_from_srcdev_mode ==
>> +         BTRFS_DEV_REPLACE_ITEM_CONT_READING_FROM_SRCDEV_MODE_AVOID)) {
> 
> The check condition is not safe for RAID56.
> 
> For RAID56, the scrub_find_good_copy() won't return a good candidate at 
> all.
> 
> Thus unfortunately for RAID56, we won't follow the avoid mode for now.
> The proper way for RAID56 avoid mode is to go rebuild path instead, 
> which is pretty different from the current code base.
> 
> I'll update the patch to exclude the RAID56 mode for now.
> 


Based on the comments found in the only parent function of
scrub_extent()-  scrub_simple_mirror(), this function
stack is not intended for RAID56. I don't understand what you mean here.

Thanks, Anand

> Thanks,
> Qu
> 
>> +        ret = scrub_find_good_copy(sctx->fs_info, logical, len,
>> +                       &src_physical, &src_dev, &src_mirror);
>> +        if (ret < 0)
>> +            return ret;
>> +    }
>>       while (len) {
>>           u32 l = min(len, blocksize);
>>           int have_csum = 0;
>> @@ -4544,28 +4632,3 @@ int btrfs_scrub_progress(struct btrfs_fs_info 
>> *fs_info, u64 devid,
>>       return dev ? (sctx ? 0 : -ENOTCONN) : -ENODEV;
>>   }
>> -
>> -static void scrub_find_good_copy(struct btrfs_fs_info *fs_info,
>> -                 u64 extent_logical, u32 extent_len,
>> -                 u64 *extent_physical,
>> -                 struct btrfs_device **extent_dev,
>> -                 int *extent_mirror_num)
>> -{
>> -    u64 mapped_length;
>> -    struct btrfs_io_context *bioc = NULL;
>> -    int ret;
>> -
>> -    mapped_length = extent_len;
>> -    ret = btrfs_map_block(fs_info, BTRFS_MAP_READ, extent_logical,
>> -                  &mapped_length, &bioc, 0);
>> -    if (ret || !bioc || mapped_length < extent_len ||
>> -        !bioc->stripes[0].dev->bdev) {
>> -        btrfs_put_bioc(bioc);
>> -        return;
>> -    }
>> -
>> -    *extent_physical = bioc->stripes[0].physical;
>> -    *extent_mirror_num = bioc->mirror_num;
>> -    *extent_dev = bioc->stripes[0].dev;
>> -    btrfs_put_bioc(bioc);
>> -}


  reply	other threads:[~2023-02-20  7:56 UTC|newest]

Thread overview: 4+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-02-19 10:33 [PATCH v2] btrfs: make dev-replace properly follow its read mode Qu Wenruo
2023-02-20  2:56 ` Qu Wenruo
2023-02-20  7:55   ` Anand Jain [this message]
2023-02-20 11:51     ` Qu Wenruo

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=b0521a4e-15f9-e6d8-0239-97ad83c3124a@oracle.com \
    --to=anand.jain@oracle.com \
    --cc=linux-btrfs@vger.kernel.org \
    --cc=wqu@suse.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox