Linux Btrfs filesystem development
 help / color / mirror / Atom feed
From: David Sterba <dsterba@suse.cz>
To: Qu Wenruo <wqu@suse.com>
Cc: linux-btrfs@vger.kernel.org, Christoph Hellwig <hch@lst.de>
Subject: Re: [PATCH v2] btrfs: fix dev-replace after the scrub rework
Date: Thu, 1 Jun 2023 15:11:45 +0200	[thread overview]
Message-ID: <20230601131145.GI32581@twin.jikos.cz> (raw)
In-Reply-To: <61e46bae045ec4e5173874dc81cb178e456644ab.1685616199.git.wqu@suse.com>

On Thu, Jun 01, 2023 at 06:51:34PM +0800, Qu Wenruo wrote:
> [BUG]
> After commit e02ee89baa66 ("btrfs: scrub: switch scrub_simple_mirror()
> to scrub_stripe infrastructure"), scrub no longer works for zoned device
> at all.
> 
> Even an empty zoned btrfs can not be replaced:
> 
>  # mkfs.btrfs -f /dev/nvme0n1
>  # mount /dev/nvme0n1 /mnt/btrfs
>  # btrfs replace start -Bf 1 /dev/nvme0n2 /mnt/btrfs
>  Resetting device zones /dev/nvme1n1 (160 zones) ...
>  ERROR: ioctl(DEV_REPLACE_START) failed on "/mnt/btrfs/": Input/output error
> 
> And we can hit kernel crash related to that:
> 
>  BTRFS info (device nvme1n1): host-managed zoned block device /dev/nvme3n1, 160 zones of 134217728 bytes
>  BTRFS info (device nvme1n1): dev_replace from /dev/nvme2n1 (devid 2) to /dev/nvme3n1 started
>  nvme3n1: Zone Management Append(0x7d) @ LBA 65536, 4 blocks, Zone Is Full (sct 0x1 / sc 0xb9) DNR
>  I/O error, dev nvme3n1, sector 786432 op 0xd:(ZONE_APPEND) flags 0x4000 phys_seg 3 prio class 2
>  BTRFS error (device nvme1n1): bdev /dev/nvme3n1 errs: wr 1, rd 0, flush 0, corrupt 0, gen 0
>  BUG: kernel NULL pointer dereference, address: 00000000000000a8
>  Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014
>  RIP: 0010:_raw_spin_lock_irqsave+0x1e/0x40
>  Call Trace:
>   <IRQ>
>   btrfs_lookup_ordered_extent+0x31/0x190
>   btrfs_record_physical_zoned+0x18/0x40
>   btrfs_simple_end_io+0xaf/0xc0
>   blk_update_request+0x153/0x4c0
>   blk_mq_end_request+0x15/0xd0
>   nvme_poll_cq+0x1d3/0x360
>   nvme_irq+0x39/0x80
>   __handle_irq_event_percpu+0x3b/0x190
>   handle_irq_event+0x2f/0x70
>   handle_edge_irq+0x7c/0x210
>   __common_interrupt+0x34/0xa0
>   common_interrupt+0x7d/0xa0
>   </IRQ>
>   <TASK>
>   asm_common_interrupt+0x22/0x40
> 
> [CAUSE]
> Dev-replace reuses scrub code to iterate all extents and write the
> existing content back to the new device.
> 
> And for zoned devices, we call fill_writer_pointer_gap() to make sure
> all the writes into the zoned device is sequential, even if there may be
> some gaps between the writes.
> 
> However we have several different bugs all related to zoned dev-replace:
> 
> - We are using ZONE_APPEND operation for metadata style write back
>   For zoned devices, btrfs has two ways to write data:
> 
>   * ZONE_APPEND for data
>     This allows higher queue depth, but will not be able to know where
>     the write would land.
>     Thus needs to grab the real on-disk physical location in it's endio.
> 
>   * WRITE for metadata
>     This requires single queue depth (new writes can only be submitted
>     after previous one finished), and all writes must be sequential.
> 
>   For scrub, we go single queue depth, but still goes with ZONE_APPEND,
>   which requires btrfs_bio::inode being populated.
>   This is the cause of that crash.
> 
> - No correct tracing of write_pointer
>   After a write finished, we should forward sctx->write_pointer, or
>   fill_writer_pointer_gap() would not work properly and cause more
>   than necessary zero out, and fill the whole zone prematurely.
> 
> - Incorrect physical bytenr passed to fill_writer_pointer_gap()
>   In scrub_write_sectors(), one call site passes logical address, which
>   is completely wrong.
> 
>   The other call site passes physical address of current sector, but
>   we should pass the physical address of the btrfs_bio we're submitting.
> 
>   This is the cause of the -EIO errors.
> 
> [FIX]
> - Do not use ZONE_APPEND for btrfs_submit_repair_write().
> 
> - Manually forward sctx->write_pointer after success writeback
> 
> - Use the physical address of the to-be-submitted btrfs_bio for
>   fill_writer_pointer_gap()
> 
> Now zoned device replace would work as expected.
> 
> Reported-by: Christoph Hellwig <hch@lst.de>
> Fixes: e02ee89baa66 ("btrfs: scrub: switch scrub_simple_mirror() to scrub_stripe infrastructure")
> Signed-off-by: Qu Wenruo <wqu@suse.com>
> ---
> Changelog:
> v2:
> - Concentrace the write bio submission into a dedicated helper
>   This would cleanup the code as the submission part is getting more
>   complex than before.

Added to misc-next, thanks. It took 4 -rcs to notice broken zoned
scrub/dev-replace.

  parent reply	other threads:[~2023-06-01 13:18 UTC|newest]

Thread overview: 4+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-06-01 10:51 [PATCH v2] btrfs: fix dev-replace after the scrub rework Qu Wenruo
2023-06-01 10:54 ` Christoph Hellwig
2023-06-01 13:11 ` David Sterba [this message]
2023-06-01 15:01   ` Johannes Thumshirn

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20230601131145.GI32581@twin.jikos.cz \
    --to=dsterba@suse.cz \
    --cc=hch@lst.de \
    --cc=linux-btrfs@vger.kernel.org \
    --cc=wqu@suse.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox