new scrub code vs zoned file systems

Linux Btrfs filesystem development
 help / color / mirror / Atom feed

* new scrub code vs zoned file systems
@ 2023-05-31 12:52 Christoph Hellwig
  2023-05-31 13:10 ` Johannes Thumshirn
  0 siblings, 1 reply; 23+ messages in thread
From: Christoph Hellwig @ 2023-05-31 12:52 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: Naohiro Aota, Johannes Thumshirn, linux-btrfs

Hi Qu,

btrfs/163 got really unhappy on zoned devices with the new scrub code,
as it tries to rewrite data in place using btrfs_submit_repair_write,
which is a regression compared to say Linux 6.3.

This log is when running on a SCRATCH_POOL using qemu emulated ZNS
drives:

[   53.359190] BTRFS info (device nvme1n1): host-managed zoned block device /dev/nvme3n1, 160 zones of 134217728 bytes
[   53.360646] BTRFS info (device nvme1n1): dev_replace from /dev/nvme2n1 (devid 2) to /dev/nvme3n1 started
[   53.691003] nvme3n1: Zone Management Append(0x7d) @ LBA 65536, 4 blocks, Zone Is Full (sct 0x1 / sc 0xb9) DNR 
[   53.694996] I/O error, dev nvme3n1, sector 786432 op 0xd:(ZONE_APPEND) flags 0x4000 phys_seg 3 prio class 2
[   53.694996] BTRFS error (device nvme1n1): bdev /dev/nvme3n1 errs: wr 1, rd 0, flush 0, corrupt 0, gen 0
[   53.694996] BUG: kernel NULL pointer dereference, address: 00000000000000a8
[   53.694996] #PF: supervisor write access in kernel mode
[   53.694996] #PF: error_code(0x0002) - not-present page
[   53.694996] PGD 0 P4D 0 
[   53.694996] Oops: 0002 [#1] PREEMPT SMP NOPTI
[   53.694996] CPU: 1 PID: 0 Comm: swapper/1 Not tainted 6.4.0-rc2+ #1190
[   53.694996] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014
[   53.694996] RIP: 0010:_raw_spin_lock_irqsave+0x1e/0x40
[   53.694996] Code: c2 fe c3 66 0f 1f 84 00 00 00 00 00 53 9c 58 0f 1f 40 00 48 89 c3 fa 0f 1f 44 00 00 65 ff 05 21 e7 c4 7d 31 c0 ba 01 00 00 00 <f0> 0f b1 17 75 05 48 89 d8 5b c3 89 c6 e8 10 03 00 00 48 89 d8 5b
[   53.694996] RSP: 0018:ffffc900000c4e00 EFLAGS: 00010046
[   53.694996] RAX: 0000000000000000 RBX: 0000000000000046 RCX: 0000000000000027
[   53.694996] RDX: 0000000000000001 RSI: 0000000000000000 RDI: 00000000000000a8
[   53.694996] RBP: 0000000000000000 R08: 0000000000000001 R09: 0000000000000003
[   53.694996] R10: ffffffff83076000 R11: ffffffff83076000 R12: ffff88810adecc00
[   53.694996] R13: 00000000000000a8 R14: 0000000000004000 R15: 0000000000004000
[   53.694996] FS:  0000000000000000(0000) GS:ffff888277d00000(0000) knlGS:0000000000000000
[   53.694996] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   53.694996] CR2: 00000000000000a8 CR3: 0000000103544000 CR4: 0000000000750ee0
[   53.694996] PKRU: 55555554
[   53.694996] Call Trace:
[   53.694996]  <IRQ>
[   53.694996]  btrfs_lookup_ordered_extent+0x31/0x190
[   53.694996]  btrfs_record_physical_zoned+0x18/0x40
[   53.694996]  btrfs_simple_end_io+0xaf/0xc0
[   53.694996]  blk_update_request+0x153/0x4c0
[   53.694996]  blk_mq_end_request+0x15/0xd0
[   53.694996]  nvme_poll_cq+0x1d3/0x360
[   53.694996]  nvme_irq+0x39/0x80
[   53.694996]  __handle_irq_event_percpu+0x3b/0x190
[   53.694996]  handle_irq_event+0x2f/0x70
[   53.694996]  handle_edge_irq+0x7c/0x210
[   53.694996]  __common_interrupt+0x34/0xa0
[   53.694996]  common_interrupt+0x7d/0xa0
[   53.694996]  </IRQ>
[   53.694996]  <TASK>
[   53.694996]  asm_common_interrupt+0x22/0x40


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: new scrub code vs zoned file systems
  2023-05-31 12:52 new scrub code vs zoned file systems Christoph Hellwig
@ 2023-05-31 13:10 ` Johannes Thumshirn
  2023-05-31 13:20   ` Christoph Hellwig
  0 siblings, 1 reply; 23+ messages in thread
From: Johannes Thumshirn @ 2023-05-31 13:10 UTC (permalink / raw)
  To: Christoph Hellwig, Qu Wenruo; +Cc: Naohiro Aota, linux-btrfs@vger.kernel.org

On 31.05.23 14:52, Christoph Hellwig wrote:
> Hi Qu,
> 
> btrfs/163 got really unhappy on zoned devices with the new scrub code,
> as it tries to rewrite data in place using btrfs_submit_repair_write,
> which is a regression compared to say Linux 6.3.
> 
> This log is when running on a SCRATCH_POOL using qemu emulated ZNS
> drives:
> 

I've just hit this as well and wanted to dig into it as I thought it's 
something I screwed up on my private branch.

So it looks like we're calling btrfs_lookup_ordered_extent() with a NULL
inode.

This actually makes sense as the current scrub code does not have an inode
in the bbio so:

btrfs_simple_end_io(bio)
`-> btrfs_record_physical_zoned(btrfs_bio(bio))
    `-> btrfs_lookup_ordered_extent(bbio->inode, ...)
        `-> tree = &inode->ordered_tree;
            spin_lock_irqsave(&tree->lock, flags); <--- BOOM

We don't really need the inode in the zoned code, but the ordered_extent.

I've just quickly skimmed over "add an ordered_extent pointer to struct 
btrfs_bio v2" but didn't find anything that adds it for scrub writes as well.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: new scrub code vs zoned file systems
  2023-05-31 13:10 ` Johannes Thumshirn
@ 2023-05-31 13:20   ` Christoph Hellwig
  2023-05-31 13:25     ` Johannes Thumshirn
  0 siblings, 1 reply; 23+ messages in thread
From: Christoph Hellwig @ 2023-05-31 13:20 UTC (permalink / raw)
  To: Johannes Thumshirn
  Cc: Christoph Hellwig, Qu Wenruo, Naohiro Aota,
	linux-btrfs@vger.kernel.org

On Wed, May 31, 2023 at 01:10:55PM +0000, Johannes Thumshirn wrote:
> So it looks like we're calling btrfs_lookup_ordered_extent() with a NULL
> inode.
> 
> This actually makes sense as the current scrub code does not have an inode
> in the bbio so:
> 
> btrfs_simple_end_io(bio)
> `-> btrfs_record_physical_zoned(btrfs_bio(bio))
>     `-> btrfs_lookup_ordered_extent(bbio->inode, ...)
>         `-> tree = &inode->ordered_tree;
>             spin_lock_irqsave(&tree->lock, flags); <--- BOOM
> 
> We don't really need the inode in the zoned code, but the ordered_extent.
> 
> I've just quickly skimmed over "add an ordered_extent pointer to struct 
> btrfs_bio v2" but didn't find anything that adds it for scrub writes as well.

That is correct, but as far as I can tell it is just the symptom.

The underlying issue is that the scrub code has no zone awareness at
all, and just tries to rewrite sectors in place.  The old code OTOH
tried to always migrate the entire BG (aka zone).

> 
> 
---end quoted text---

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: new scrub code vs zoned file systems
  2023-05-31 13:20   ` Christoph Hellwig
@ 2023-05-31 13:25     ` Johannes Thumshirn
  2023-05-31 13:30       ` Christoph Hellwig
  0 siblings, 1 reply; 23+ messages in thread
From: Johannes Thumshirn @ 2023-05-31 13:25 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Qu Wenruo, Naohiro Aota, linux-btrfs@vger.kernel.org

On 31.05.23 15:20, Christoph Hellwig wrote:
> On Wed, May 31, 2023 at 01:10:55PM +0000, Johannes Thumshirn wrote:
>> So it looks like we're calling btrfs_lookup_ordered_extent() with a NULL
>> inode.
>>
>> This actually makes sense as the current scrub code does not have an inode
>> in the bbio so:
>>
>> btrfs_simple_end_io(bio)
>> `-> btrfs_record_physical_zoned(btrfs_bio(bio))
>>     `-> btrfs_lookup_ordered_extent(bbio->inode, ...)
>>         `-> tree = &inode->ordered_tree;
>>             spin_lock_irqsave(&tree->lock, flags); <--- BOOM
>>
>> We don't really need the inode in the zoned code, but the ordered_extent.
>>
>> I've just quickly skimmed over "add an ordered_extent pointer to struct 
>> btrfs_bio v2" but didn't find anything that adds it for scrub writes as well.
> 
> That is correct, but as far as I can tell it is just the symptom.
> 
> The underlying issue is that the scrub code has no zone awareness at
> all, and just tries to rewrite sectors in place.  The old code OTOH
> tried to always migrate the entire BG (aka zone).
> 

Hmm at least flush_scrub_stripes() should not go into the simple write 
path at all:

	[...]
        /*                                                                      
         * Submit the repaired sectors.  For zoned case, we cannot do repair    
         * in-place, but queue the bg to be relocated.                          
         */                                                                     
        if (btrfs_is_zoned(fs_info)) {                                          
                for (int i = 0; i < nr_stripes; i++) {                          
                        stripe = &sctx->stripes[i];                             
                                                                                
                        if (!bitmap_empty(&stripe->error_bitmap, stripe->nr_sectors)) {
                                btrfs_repair_one_zone(fs_info,                  
                                                      sctx->stripes[0].bg->start);
                                break;                                          
                        }
                }
	} else {
                for (int i = 0; i < nr_stripes; i++) {
                        unsigned long repaired;                                 
                                                                                
                        stripe = &sctx->stripes[i];                             
                                                                                
                        bitmap_andnot(&repaired, &stripe->init_error_bitmap,    
                                      &stripe->error_bitmap, stripe->nr_sectors);
                        scrub_write_sectors(sctx, stripe, repaired, false);     
                }                                                               
        }                                                                   


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: new scrub code vs zoned file systems
  2023-05-31 13:25     ` Johannes Thumshirn
@ 2023-05-31 13:30       ` Christoph Hellwig
  2023-05-31 14:04         ` Johannes Thumshirn
  0 siblings, 1 reply; 23+ messages in thread
From: Christoph Hellwig @ 2023-05-31 13:30 UTC (permalink / raw)
  To: Johannes Thumshirn
  Cc: Christoph Hellwig, Qu Wenruo, Naohiro Aota,
	linux-btrfs@vger.kernel.org

On Wed, May 31, 2023 at 01:25:14PM +0000, Johannes Thumshirn wrote:
> Hmm at least flush_scrub_stripes() should not go into the simple write 
> path at all:

Except for the dev-replace case, which seems to trigger this
write.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: new scrub code vs zoned file systems
  2023-05-31 13:30       ` Christoph Hellwig
@ 2023-05-31 14:04         ` Johannes Thumshirn
  2023-05-31 14:17           ` Christoph Hellwig
  2023-05-31 22:25           ` Qu Wenruo
  0 siblings, 2 replies; 23+ messages in thread
From: Johannes Thumshirn @ 2023-05-31 14:04 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Qu Wenruo, Naohiro Aota, linux-btrfs@vger.kernel.org

On 31.05.23 15:31, Christoph Hellwig wrote:
> On Wed, May 31, 2023 at 01:25:14PM +0000, Johannes Thumshirn wrote:
>> Hmm at least flush_scrub_stripes() should not go into the simple write 
>> path at all:
> 
> Except for the dev-replace case, which seems to trigger this
> write.
> 

Heh and this has never actually worked IMHO.

I did a crude hack to bandaid scrub:
diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
index d7d8faf1978a..b20115bd0675 100644
--- a/fs/btrfs/scrub.c
+++ b/fs/btrfs/scrub.c
@@ -1709,9 +1709,20 @@ static int flush_scrub_stripes(struct scrub_ctx *sctx)
 
                        ASSERT(stripe->dev == fs_info->dev_replace.srcdev);
 
-                       bitmap_andnot(&good, &stripe->extent_sector_bitmap,
-                                     &stripe->error_bitmap, stripe->nr_sectors);
-                       scrub_write_sectors(sctx, stripe, good, true);
+                       if (btrfs_is_zoned(fs_info)) {
+                               if (!bitmap_empty(&stripe->extent_sector_bitmap,
+                                                 stripe->nr_sectors)) {
+                                       btrfs_repair_one_zone(fs_info,
+                                                             sctx->stripes[0].bg->start);
+                                       break;
+                               }
+                       } else {
+                               bitmap_andnot(&good,
+                                             &stripe->extent_sector_bitmap,
+                                             &stripe->error_bitmap,
+                                             stripe->nr_sectors);
+                               scrub_write_sectors(sctx, stripe, good, true);
+                       }
                }
        }
 


But then it doesn't work as well because:

static int relocating_repair_kthread(void *data)                                                                                                                                     
{                                                                                                                                                                                    
	[...]                                                                                 
        sb_start_write(fs_info->sb);                                                                                                                                                 
        if (!btrfs_exclop_start(fs_info, BTRFS_EXCLOP_BALANCE)) {                                                                                                                    
                btrfs_info(fs_info,                                                                                                                                                  
                           "zoned: skip relocating block group %llu to repair: EBUSY",                                                                                               
                           target);                                                                                                                                                  
                sb_end_write(fs_info->sb);                                                                                                                                           
                return -EBUSY;                                                                                                                                                       
      
That will always fail, because in the case of dev-replace we already have
BTRFS_EXCLOP_DEV_REPLACE set.

I've just spotted btrfs_exclop_start_try_lock(), that could solve our problem 
here.

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: new scrub code vs zoned file systems
  2023-05-31 14:04         ` Johannes Thumshirn
@ 2023-05-31 14:17           ` Christoph Hellwig
  2023-06-01  2:09             ` Qu Wenruo
  2023-05-31 22:25           ` Qu Wenruo
  1 sibling, 1 reply; 23+ messages in thread
From: Christoph Hellwig @ 2023-05-31 14:17 UTC (permalink / raw)
  To: Johannes Thumshirn
  Cc: Christoph Hellwig, Qu Wenruo, Naohiro Aota,
	linux-btrfs@vger.kernel.org

On Wed, May 31, 2023 at 02:04:05PM +0000, Johannes Thumshirn wrote:
> 
> Heh and this has never actually worked IMHO.
> 
> I did a crude hack to bandaid scrub:

I think the better approach is to:

 a) branch out at a very high level to the zoned code in
    flush_scrub_stripes, or in fact even higher given that we
    don't really care about tracking stripes.  The write
    side of scrub has to work at a zone, not stripe level for
    zoned devices
 b) don't create a new relocation thread per zone, but run it from
    the scrub context.


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: new scrub code vs zoned file systems
  2023-05-31 14:04         ` Johannes Thumshirn
  2023-05-31 14:17           ` Christoph Hellwig
@ 2023-05-31 22:25           ` Qu Wenruo
  2023-05-31 22:48             ` Qu Wenruo
  2023-06-01  4:53             ` Christoph Hellwig
  1 sibling, 2 replies; 23+ messages in thread
From: Qu Wenruo @ 2023-05-31 22:25 UTC (permalink / raw)
  To: Johannes Thumshirn, Christoph Hellwig
  Cc: Naohiro Aota, linux-btrfs@vger.kernel.org



On 2023/5/31 22:04, Johannes Thumshirn wrote:
> On 31.05.23 15:31, Christoph Hellwig wrote:
>> On Wed, May 31, 2023 at 01:25:14PM +0000, Johannes Thumshirn wrote:
>>> Hmm at least flush_scrub_stripes() should not go into the simple write
>>> path at all:
>>
>> Except for the dev-replace case, which seems to trigger this
>> write.
>>
>
> Heh and this has never actually worked IMHO.
>
> I did a crude hack to bandaid scrub:
> diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
> index d7d8faf1978a..b20115bd0675 100644
> --- a/fs/btrfs/scrub.c
> +++ b/fs/btrfs/scrub.c
> @@ -1709,9 +1709,20 @@ static int flush_scrub_stripes(struct scrub_ctx *sctx)
>
>                          ASSERT(stripe->dev == fs_info->dev_replace.srcdev);
>
> -                       bitmap_andnot(&good, &stripe->extent_sector_bitmap,
> -                                     &stripe->error_bitmap, stripe->nr_sectors);
> -                       scrub_write_sectors(sctx, stripe, good, true);
> +                       if (btrfs_is_zoned(fs_info)) {
> +                               if (!bitmap_empty(&stripe->extent_sector_bitmap,
> +                                                 stripe->nr_sectors)) {
> +                                       btrfs_repair_one_zone(fs_info,
> +                                                             sctx->stripes[0].bg->start);
> +                                       break;

This doesn't look good, is this a hack to use repair to do the dev-replace?

> +                               }
> +                       } else {
> +                               bitmap_andnot(&good,
> +                                             &stripe->extent_sector_bitmap,
> +                                             &stripe->error_bitmap,
> +                                             stripe->nr_sectors);
> +                               scrub_write_sectors(sctx, stripe, good, true);
> +                       }
>                  }
>          }
>
>
>
> But then it doesn't work as well because:
>
> static int relocating_repair_kthread(void *data)
> {
> 	[...]
>          sb_start_write(fs_info->sb);
>          if (!btrfs_exclop_start(fs_info, BTRFS_EXCLOP_BALANCE)) {
>                  btrfs_info(fs_info,
>                             "zoned: skip relocating block group %llu to repair: EBUSY",
>                             target);
>                  sb_end_write(fs_info->sb);
>                  return -EBUSY;
>
> That will always fail, because in the case of dev-replace we already have
> BTRFS_EXCLOP_DEV_REPLACE set.
>
> I've just spotted btrfs_exclop_start_try_lock(), that could solve our problem
> here.

To me, the problem can be solved in a much simpler way, if it's
dev-replace for zoned device, let's write the whole stripe to the target
device, and wait for it.

For the btrfs_record_physical_zoned(), we can skip the OE things if
bbio::inode is NULL.

Would the following change solves the problem?

Thanks,
Qu

diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
index d7d8faf1978a..3fa480cd905e 100644
--- a/fs/btrfs/scrub.c
+++ b/fs/btrfs/scrub.c
@@ -1709,8 +1709,15 @@ static int flush_scrub_stripes(struct scrub_ctx
*sctx)

                         ASSERT(stripe->dev == fs_info->dev_replace.srcdev);

-                       bitmap_andnot(&good, &stripe->extent_sector_bitmap,
-                                     &stripe->error_bitmap,
stripe->nr_sectors);
+                       if (btrfs_is_zoned(fs_info))
+                               /*
+                                * For zoned case, we need to write the
whole
+                                * stripe back, no gaps allowed.
+                                */
+                               bitmap_set(&good, 0, stripe->nr_sectors);
+                       else
+                               bitmap_andnot(&good,
&stripe->extent_sector_bitmap,
+                                             &stripe->error_bitmap,
stripe->nr_sectors);
                         scrub_write_sectors(sctx, stripe, good, true);
                 }
         }
diff --git a/fs/btrfs/zoned.c b/fs/btrfs/zoned.c
index 98d6b8cc3874..cced6aeff8d7 100644
--- a/fs/btrfs/zoned.c
+++ b/fs/btrfs/zoned.c
@@ -1659,6 +1659,13 @@ void btrfs_record_physical_zoned(struct btrfs_bio
*bbio)
         const u64 physical = bbio->bio.bi_iter.bi_sector << SECTOR_SHIFT;
         struct btrfs_ordered_extent *ordered;

+       /*
+        * For scrub case we have no inode, and doesn't need to bother
ordered
+        * extents.
+        */
+       if (!bbio->inode)
+               return;
+
         ordered = btrfs_lookup_ordered_extent(bbio->inode,
bbio->file_offset);
         if (WARN_ON(!ordered))
                 return;

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: new scrub code vs zoned file systems
  2023-05-31 22:25           ` Qu Wenruo
@ 2023-05-31 22:48             ` Qu Wenruo
  2023-06-01  4:53             ` Christoph Hellwig
  1 sibling, 0 replies; 23+ messages in thread
From: Qu Wenruo @ 2023-05-31 22:48 UTC (permalink / raw)
  To: Johannes Thumshirn, Christoph Hellwig
  Cc: Naohiro Aota, linux-btrfs@vger.kernel.org



On 2023/6/1 06:25, Qu Wenruo wrote:
> 
> 
> On 2023/5/31 22:04, Johannes Thumshirn wrote:
>> On 31.05.23 15:31, Christoph Hellwig wrote:
>>> On Wed, May 31, 2023 at 01:25:14PM +0000, Johannes Thumshirn wrote:
>>>> Hmm at least flush_scrub_stripes() should not go into the simple write
>>>> path at all:
>>>
>>> Except for the dev-replace case, which seems to trigger this
>>> write.
>>>
>>
>> Heh and this has never actually worked IMHO.
>>
>> I did a crude hack to bandaid scrub:
>> diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
>> index d7d8faf1978a..b20115bd0675 100644
>> --- a/fs/btrfs/scrub.c
>> +++ b/fs/btrfs/scrub.c
>> @@ -1709,9 +1709,20 @@ static int flush_scrub_stripes(struct scrub_ctx 
>> *sctx)
>>
>>                          ASSERT(stripe->dev == 
>> fs_info->dev_replace.srcdev);
>>
>> -                       bitmap_andnot(&good, 
>> &stripe->extent_sector_bitmap,
>> -                                     &stripe->error_bitmap, 
>> stripe->nr_sectors);
>> -                       scrub_write_sectors(sctx, stripe, good, true);
>> +                       if (btrfs_is_zoned(fs_info)) {
>> +                               if 
>> (!bitmap_empty(&stripe->extent_sector_bitmap,
>> +                                                 stripe->nr_sectors)) {
>> +                                       btrfs_repair_one_zone(fs_info,
>> +                                                             
>> sctx->stripes[0].bg->start);
>> +                                       break;
> 
> This doesn't look good, is this a hack to use repair to do the dev-replace?
> 
>> +                               }
>> +                       } else {
>> +                               bitmap_andnot(&good,
>> +                                             
>> &stripe->extent_sector_bitmap,
>> +                                             &stripe->error_bitmap,
>> +                                             stripe->nr_sectors);
>> +                               scrub_write_sectors(sctx, stripe, 
>> good, true);
>> +                       }
>>                  }
>>          }
>>
>>
>>
>> But then it doesn't work as well because:
>>
>> static int relocating_repair_kthread(void *data)
>> {
>>     [...]
>>          sb_start_write(fs_info->sb);
>>          if (!btrfs_exclop_start(fs_info, BTRFS_EXCLOP_BALANCE)) {
>>                  btrfs_info(fs_info,
>>                             "zoned: skip relocating block group %llu 
>> to repair: EBUSY",
>>                             target);
>>                  sb_end_write(fs_info->sb);
>>                  return -EBUSY;
>>
>> That will always fail, because in the case of dev-replace we already have
>> BTRFS_EXCLOP_DEV_REPLACE set.
>>
>> I've just spotted btrfs_exclop_start_try_lock(), that could solve our 
>> problem
>> here.
> 
> To me, the problem can be solved in a much simpler way, if it's
> dev-replace for zoned device, let's write the whole stripe to the target
> device, and wait for it.
> 
> For the btrfs_record_physical_zoned(), we can skip the OE things if
> bbio::inode is NULL.
> 
> Would the following change solves the problem?
> 
> Thanks,
> Qu
> 
> diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
> index d7d8faf1978a..3fa480cd905e 100644
> --- a/fs/btrfs/scrub.c
> +++ b/fs/btrfs/scrub.c
> @@ -1709,8 +1709,15 @@ static int flush_scrub_stripes(struct scrub_ctx
> *sctx)
> 
>                          ASSERT(stripe->dev == 
> fs_info->dev_replace.srcdev);
> 
> -                       bitmap_andnot(&good, &stripe->extent_sector_bitmap,
> -                                     &stripe->error_bitmap,
> stripe->nr_sectors);
> +                       if (btrfs_is_zoned(fs_info))
> +                               /*
> +                                * For zoned case, we need to write the
> whole
> +                                * stripe back, no gaps allowed.
> +                                */
> +                               bitmap_set(&good, 0, stripe->nr_sectors);

In fact this is not even needed.

The scrub_write_sectors() already have all the fill_writer_pointer_gap() 
calls for the block group.

Meaning we can pass the existing @good bitmap, even with gaps, and 
scrub_write_sectors() would handle it properly.
Only the NULL inode pointer check is needed.

Thus the initial problem of that crash is really not about the write gaps.

[   53.691003] nvme3n1: Zone Management Append(0x7d) @ LBA 65536, 4 
blocks, Zone Is Full (sct 0x1 / sc 0xb9) DNR
[   53.694996] I/O error, dev nvme3n1, sector 786432 op 
0xd:(ZONE_APPEND) flags 0x4000 phys_seg 3 prio class 2

Any clue on why the target zone is full during replace?

Thanks,
Qu
> +                       else
> +                               bitmap_andnot(&good,
> &stripe->extent_sector_bitmap,
> +                                             &stripe->error_bitmap,
> stripe->nr_sectors);
>                          scrub_write_sectors(sctx, stripe, good, true);
>                  }
>          }
> diff --git a/fs/btrfs/zoned.c b/fs/btrfs/zoned.c
> index 98d6b8cc3874..cced6aeff8d7 100644
> --- a/fs/btrfs/zoned.c
> +++ b/fs/btrfs/zoned.c
> @@ -1659,6 +1659,13 @@ void btrfs_record_physical_zoned(struct btrfs_bio
> *bbio)
>          const u64 physical = bbio->bio.bi_iter.bi_sector << SECTOR_SHIFT;
>          struct btrfs_ordered_extent *ordered;
> 
> +       /*
> +        * For scrub case we have no inode, and doesn't need to bother
> ordered
> +        * extents.
> +        */
> +       if (!bbio->inode)
> +               return;
> +
>          ordered = btrfs_lookup_ordered_extent(bbio->inode,
> bbio->file_offset);
>          if (WARN_ON(!ordered))
>                  return;

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: new scrub code vs zoned file systems
  2023-05-31 14:17           ` Christoph Hellwig
@ 2023-06-01  2:09             ` Qu Wenruo
  2023-06-01  4:40               ` Christoph Hellwig
  0 siblings, 1 reply; 23+ messages in thread
From: Qu Wenruo @ 2023-06-01  2:09 UTC (permalink / raw)
  To: Christoph Hellwig, Johannes Thumshirn
  Cc: Naohiro Aota, linux-btrfs@vger.kernel.org

On 2023/5/31 22:17, Christoph Hellwig wrote:
> On Wed, May 31, 2023 at 02:04:05PM +0000, Johannes Thumshirn wrote:
>>
>> Heh and this has never actually worked IMHO.
>>
>> I did a crude hack to bandaid scrub:
>
> I think the better approach is to:
>
>   a) branch out at a very high level to the zoned code in
>      flush_scrub_stripes, or in fact even higher given that we
>      don't really care about tracking stripes.  The write
>      side of scrub has to work at a zone, not stripe level for
>      zoned devices

So far the various wrapper around the write operations work as expected,
and hide the detailed well enough that most of us didn't even notice.

E.g. all the zoned code is already handled in scrub_write_sectors().

The crash itself is caused by the fact that end io part is relying on
the inode pointer, that itself is a simple fix.

But I'm more concerned about why we have a full zone before that crash.

>   b) don't create a new relocation thread per zone, but run it from
>      the scrub context.
>

That's a little too complex, the problem is that relocation is a
completely different beast, too different from the scrub code.

But I agree the repair part for zoned needs some rework, it's not
working from the day 1 of zoned support, but shouldn't need that a huge
change.

E.g. we just record that we need to relocate the bg, then after the
scrub of that bg is fully finished, queue a relocation for it.

Thanks,
Qu

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: new scrub code vs zoned file systems
  2023-06-01  2:09             ` Qu Wenruo
@ 2023-06-01  4:40               ` Christoph Hellwig
  2023-06-01  5:00                 ` Qu Wenruo
  0 siblings, 1 reply; 23+ messages in thread
From: Christoph Hellwig @ 2023-06-01  4:40 UTC (permalink / raw)
  To: Qu Wenruo
  Cc: Christoph Hellwig, Johannes Thumshirn, Naohiro Aota,
	linux-btrfs@vger.kernel.org

On Thu, Jun 01, 2023 at 10:09:24AM +0800, Qu Wenruo wrote:
> So far the various wrapper around the write operations work as expected,
> and hide the detailed well enough that most of us didn't even notice.
>
> E.g. all the zoned code is already handled in scrub_write_sectors().
>
> The crash itself is caused by the fact that end io part is relying on
> the inode pointer, that itself is a simple fix.

But the reason why it is relying on the inode pointer is that it needs
to record the actual written LBA after I/O completion.  So it's not
just a case of just add a NULL check, it needs a way to adjust the
logical to physical mapping from the dummy added before the I/O.

> But I'm more concerned about why we have a full zone before that crash.

I think this is happening because we can't account for the zone filling
without the proper context.

>>   b) don't create a new relocation thread per zone, but run it from
>>      the scrub context.
>>
>
> That's a little too complex, the problem is that relocation is a
> completely different beast, too different from the scrub code.
>
> But I agree the repair part for zoned needs some rework, it's not
> working from the day 1 of zoned support, but shouldn't need that a huge
> change.
>
> E.g. we just record that we need to relocate the bg, then after the
> scrub of that bg is fully finished, queue a relocation for it.

Yes.  That's what the read repair already does, and also the scrub
code, although in a somewhat sub-optimal way.

>
> Thanks,
> Qu
---end quoted text---

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: new scrub code vs zoned file systems
  2023-05-31 22:25           ` Qu Wenruo
  2023-05-31 22:48             ` Qu Wenruo
@ 2023-06-01  4:53             ` Christoph Hellwig
  2023-06-01  5:04               ` Qu Wenruo
  1 sibling, 1 reply; 23+ messages in thread
From: Christoph Hellwig @ 2023-06-01  4:53 UTC (permalink / raw)
  To: Qu Wenruo
  Cc: Johannes Thumshirn, Christoph Hellwig, Naohiro Aota,
	linux-btrfs@vger.kernel.org

On Thu, Jun 01, 2023 at 06:25:59AM +0800, Qu Wenruo wrote:
> To me, the problem can be solved in a much simpler way, if it's
> dev-replace for zoned device, let's write the whole stripe to the target
> device, and wait for it.
>
> For the btrfs_record_physical_zoned(), we can skip the OE things if
> bbio::inode is NULL.
>
> Would the following change solves the problem?

It can't, as we need to record the actually written location.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: new scrub code vs zoned file systems
  2023-06-01  4:40               ` Christoph Hellwig
@ 2023-06-01  5:00                 ` Qu Wenruo
  2023-06-01  5:17                   ` Naohiro Aota
  0 siblings, 1 reply; 23+ messages in thread
From: Qu Wenruo @ 2023-06-01  5:00 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Johannes Thumshirn, Naohiro Aota, linux-btrfs@vger.kernel.org



On 2023/6/1 12:40, Christoph Hellwig wrote:
> On Thu, Jun 01, 2023 at 10:09:24AM +0800, Qu Wenruo wrote:
>> So far the various wrapper around the write operations work as expected,
>> and hide the detailed well enough that most of us didn't even notice.
>>
>> E.g. all the zoned code is already handled in scrub_write_sectors().
>>
>> The crash itself is caused by the fact that end io part is relying on
>> the inode pointer, that itself is a simple fix.
>
> But the reason why it is relying on the inode pointer is that it needs
> to record the actual written LBA after I/O completion.  So it's not
> just a case of just add a NULL check, it needs a way to adjust the
> logical to physical mapping from the dummy added before the I/O.

That's all handled by scrub.

For scrub we're doing the writes just like metadata, with QD=1, aka,
always write and wait (and know where the write would land), and for the
gaps we would call fill_writer_pointer_gap() to fill them.

Thus we don't need to do any adjustment (unless you're talking about
RST, but I believe that's a different beast).

>
>> But I'm more concerned about why we have a full zone before that crash.
>
> I think this is happening because we can't account for the zone filling
> without the proper context.

I believe it's a different problem, maybe some de-sync between scrub
write_pointer and the real zoned pointer inside the device.

My current guess is, the target zone inside the target device is not
properly reset before dev-replace.

Thanks,
Qu

>
>>>    b) don't create a new relocation thread per zone, but run it from
>>>       the scrub context.
>>>
>>
>> That's a little too complex, the problem is that relocation is a
>> completely different beast, too different from the scrub code.
>>
>> But I agree the repair part for zoned needs some rework, it's not
>> working from the day 1 of zoned support, but shouldn't need that a huge
>> change.
>>
>> E.g. we just record that we need to relocate the bg, then after the
>> scrub of that bg is fully finished, queue a relocation for it.
>
> Yes.  That's what the read repair already does, and also the scrub
> code, although in a somewhat sub-optimal way.
>
>>
>> Thanks,
>> Qu
> ---end quoted text---

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: new scrub code vs zoned file systems
  2023-06-01  4:53             ` Christoph Hellwig
@ 2023-06-01  5:04               ` Qu Wenruo
  0 siblings, 0 replies; 23+ messages in thread
From: Qu Wenruo @ 2023-06-01  5:04 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Johannes Thumshirn, Naohiro Aota, linux-btrfs@vger.kernel.org

On 2023/6/1 12:53, Christoph Hellwig wrote:
> On Thu, Jun 01, 2023 at 06:25:59AM +0800, Qu Wenruo wrote:
>> To me, the problem can be solved in a much simpler way, if it's
>> dev-replace for zoned device, let's write the whole stripe to the target
>> device, and wait for it.
>>
>> For the btrfs_record_physical_zoned(), we can skip the OE things if
>> bbio::inode is NULL.
>>
>> Would the following change solves the problem?
>
> It can't, as we need to record the actually written location.

We don't need, it's scrub, the target block group is marked RO, and we
won't duplicate writes for zoned devices (the FLAG_TO_COPY thing), thus
only scrub can write into the zone.

Furthermore, in the context of scrub, we have always do write and wait
(QD=1, just like metadata writes), and fill the gaps using
fill_writer_pointer_gap() when needed.

In fact, if there is something so basically wrong, we should have all
the replace tests before that b/169 failed.

Thanks,
Qu

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: new scrub code vs zoned file systems
  2023-06-01  5:00                 ` Qu Wenruo
@ 2023-06-01  5:17                   ` Naohiro Aota
  2023-06-01  5:21                     ` Naohiro Aota
                                       ` (2 more replies)
  0 siblings, 3 replies; 23+ messages in thread
From: Naohiro Aota @ 2023-06-01  5:17 UTC (permalink / raw)
  To: Qu Wenruo
  Cc: Christoph Hellwig, Johannes Thumshirn,
	linux-btrfs@vger.kernel.org

On Thu, Jun 01, 2023 at 01:00:40PM +0800, Qu Wenruo wrote:
> 
> 
> On 2023/6/1 12:40, Christoph Hellwig wrote:
> > On Thu, Jun 01, 2023 at 10:09:24AM +0800, Qu Wenruo wrote:
> > > So far the various wrapper around the write operations work as expected,
> > > and hide the detailed well enough that most of us didn't even notice.
> > > 
> > > E.g. all the zoned code is already handled in scrub_write_sectors().
> > > 
> > > The crash itself is caused by the fact that end io part is relying on
> > > the inode pointer, that itself is a simple fix.
> > 
> > But the reason why it is relying on the inode pointer is that it needs
> > to record the actual written LBA after I/O completion.  So it's not
> > just a case of just add a NULL check, it needs a way to adjust the
> > logical to physical mapping from the dummy added before the I/O.
> 
> That's all handled by scrub.
> 
> For scrub we're doing the writes just like metadata, with QD=1, aka,
> always write and wait (and know where the write would land), and for the
> gaps we would call fill_writer_pointer_gap() to fill them.
> 
> Thus we don't need to do any adjustment (unless you're talking about
> RST, but I believe that's a different beast).

True. For the dev_replace, we need to place the moved data at the same
address on the destination device as the source device. Thus, we need to
use WRITE command to ensure that.

So, calling into the record_physical function looks strange to me. It
misses some condition to use ZONE_APPEND?

> > 
> > > But I'm more concerned about why we have a full zone before that crash.
> > 
> > I think this is happening because we can't account for the zone filling
> > without the proper context.
> 
> I believe it's a different problem, maybe some de-sync between scrub
> write_pointer and the real zoned pointer inside the device.
> 
> My current guess is, the target zone inside the target device is not
> properly reset before dev-replace.

This must be a different issue. Are we choosing that zone for zone finish
to free the active zone resource?

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: new scrub code vs zoned file systems
  2023-06-01  5:17                   ` Naohiro Aota
@ 2023-06-01  5:21                     ` Naohiro Aota
  2023-06-01  7:21                       ` Qu Wenruo
  2023-06-01  5:22                     ` Christoph Hellwig
  2023-06-01  5:45                     ` Qu Wenruo
  2 siblings, 1 reply; 23+ messages in thread
From: Naohiro Aota @ 2023-06-01  5:21 UTC (permalink / raw)
  To: Christoph Hellwig, Qu Wenruo
  Cc: Johannes Thumshirn, linux-btrfs@vger.kernel.org

On Thu, Jun 01, 2023 at 02:17:22PM +0900, Naohiro Aota wrote:
> On Thu, Jun 01, 2023 at 01:00:40PM +0800, Qu Wenruo wrote:
> > > > But I'm more concerned about why we have a full zone before that crash.
> > > 
> > > I think this is happening because we can't account for the zone filling
> > > without the proper context.
> > 
> > I believe it's a different problem, maybe some de-sync between scrub
> > write_pointer and the real zoned pointer inside the device.
> > 
> > My current guess is, the target zone inside the target device is not
> > properly reset before dev-replace.
> 
> This must be a different issue. Are we choosing that zone for zone finish
> to free the active zone resource?

BTW, you may want to use this patch to track zone finishing.

diff --git a/fs/btrfs/zoned.c b/fs/btrfs/zoned.c
index dbac203ea54a..5b4ab12368c9 100644
--- a/fs/btrfs/zoned.c
+++ b/fs/btrfs/zoned.c
@@ -2015,6 +2015,7 @@ static int do_zone_finish(struct btrfs_block_group *block_group, bool fully_writ
 		btrfs_dev_clear_active_zone(device, physical);
 	}
 
+	trace_btrfs_zone_finish_block_group(block_group);
 	if (!fully_written)
 		btrfs_dec_block_group_ro(block_group);
 
diff --git a/include/trace/events/btrfs.h b/include/trace/events/btrfs.h
index 8ea9cea9bfeb..594e4aca0a02 100644
--- a/include/trace/events/btrfs.h
+++ b/include/trace/events/btrfs.h
@@ -2057,6 +2057,12 @@ DEFINE_EVENT(btrfs__block_group, btrfs_skip_unused_block_group,
 	TP_ARGS(bg_cache)
 );
 
+DEFINE_EVENT(btrfs__block_group, btrfs_zone_finish_block_group,
+	TP_PROTO(const struct btrfs_block_group *bg_cache),
+
+	TP_ARGS(bg_cache)
+);
+
 TRACE_EVENT(btrfs_set_extent_bit,
 	TP_PROTO(const struct extent_io_tree *tree,
 		 u64 start, u64 len, unsigned set_bits),

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: new scrub code vs zoned file systems
  2023-06-01  5:17                   ` Naohiro Aota
  2023-06-01  5:21                     ` Naohiro Aota
@ 2023-06-01  5:22                     ` Christoph Hellwig
  2023-06-01  5:34                       ` Christoph Hellwig
  2023-06-01  5:45                     ` Qu Wenruo
  2 siblings, 1 reply; 23+ messages in thread
From: Christoph Hellwig @ 2023-06-01  5:22 UTC (permalink / raw)
  To: Naohiro Aota
  Cc: Qu Wenruo, Christoph Hellwig, Johannes Thumshirn,
	linux-btrfs@vger.kernel.org

On Thu, Jun 01, 2023 at 05:17:23AM +0000, Naohiro Aota wrote:
> > Thus we don't need to do any adjustment (unless you're talking about
> > RST, but I believe that's a different beast).
> 
> True. For the dev_replace, we need to place the moved data at the same
> address on the destination device as the source device. Thus, we need to
> use WRITE command to ensure that.
> 
> So, calling into the record_physical function looks strange to me. It
> misses some condition to use ZONE_APPEND?

ch@brick:~/work/linux$ git-grep -C1 btrfs_record_physical_zoned fs/btrfs/bio.c
fs/btrfs/bio.c-         if (bio_op(bio) == REQ_OP_ZONE_APPEND && !bio->bi_status)
fs/btrfs/bio.c:                 btrfs_record_physical_zoned(bbio);
fs/btrfs/bio.c-

nope.  We are doing a zone append write here.

That being said in latest misc-next btrfs_record_physical_zoned stops
lookin at bbio->inode, so the crash part is gone.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: new scrub code vs zoned file systems
  2023-06-01  5:22                     ` Christoph Hellwig
@ 2023-06-01  5:34                       ` Christoph Hellwig
  0 siblings, 0 replies; 23+ messages in thread
From: Christoph Hellwig @ 2023-06-01  5:34 UTC (permalink / raw)
  To: Naohiro Aota
  Cc: Qu Wenruo, Christoph Hellwig, Johannes Thumshirn,
	linux-btrfs@vger.kernel.org

On Thu, Jun 01, 2023 at 07:22:23AM +0200, Christoph Hellwig wrote:
> ch@brick:~/work/linux$ git-grep -C1 btrfs_record_physical_zoned fs/btrfs/bio.c
> fs/btrfs/bio.c-         if (bio_op(bio) == REQ_OP_ZONE_APPEND && !bio->bi_status)
> fs/btrfs/bio.c:                 btrfs_record_physical_zoned(bbio);
> fs/btrfs/bio.c-
> 
> nope.  We are doing a zone append write here.
> 
> That being said in latest misc-next btrfs_record_physical_zoned stops
> lookin at bbio->inode, so the crash part is gone.

Although btrfs/167 then crashes a little later in
btrfs_record_physical_zoned.  From what I can tell because we get
a bio without the ->sums array.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: new scrub code vs zoned file systems
  2023-06-01  5:17                   ` Naohiro Aota
  2023-06-01  5:21                     ` Naohiro Aota
  2023-06-01  5:22                     ` Christoph Hellwig
@ 2023-06-01  5:45                     ` Qu Wenruo
  2023-06-01  5:47                       ` Christoph Hellwig
  2 siblings, 1 reply; 23+ messages in thread
From: Qu Wenruo @ 2023-06-01  5:45 UTC (permalink / raw)
  To: Naohiro Aota
  Cc: Christoph Hellwig, Johannes Thumshirn,
	linux-btrfs@vger.kernel.org



On 2023/6/1 13:17, Naohiro Aota wrote:
> On Thu, Jun 01, 2023 at 01:00:40PM +0800, Qu Wenruo wrote:
>>
>>
>> On 2023/6/1 12:40, Christoph Hellwig wrote:
>>> On Thu, Jun 01, 2023 at 10:09:24AM +0800, Qu Wenruo wrote:
>>>> So far the various wrapper around the write operations work as expected,
>>>> and hide the detailed well enough that most of us didn't even notice.
>>>>
>>>> E.g. all the zoned code is already handled in scrub_write_sectors().
>>>>
>>>> The crash itself is caused by the fact that end io part is relying on
>>>> the inode pointer, that itself is a simple fix.
>>>
>>> But the reason why it is relying on the inode pointer is that it needs
>>> to record the actual written LBA after I/O completion.  So it's not
>>> just a case of just add a NULL check, it needs a way to adjust the
>>> logical to physical mapping from the dummy added before the I/O.
>>
>> That's all handled by scrub.
>>
>> For scrub we're doing the writes just like metadata, with QD=1, aka,
>> always write and wait (and know where the write would land), and for the
>> gaps we would call fill_writer_pointer_gap() to fill them.
>>
>> Thus we don't need to do any adjustment (unless you're talking about
>> RST, but I believe that's a different beast).
>
> True. For the dev_replace, we need to place the moved data at the same
> address on the destination device as the source device. Thus, we need to
> use WRITE command to ensure that.

Oh, that looks like the cause.

In btrfs_submit_repair_write() we set the bi_opf to ZONE_APPEND instead,
which later would trigger btrfs_record_physical_zoned().

So this means, we should not change the WRITE into ZONE_APPEND for
btrfs_submit_repair_write() for dev-replace case at all.

I stupidly thought zoned device can not accept WRITE command at all but
only ZONE_APPEND.

Let me try it locally first.

Thanks,
Qu
>
> So, calling into the record_physical function looks strange to me. It
> misses some condition to use ZONE_APPEND?
>
>>>
>>>> But I'm more concerned about why we have a full zone before that crash.
>>>
>>> I think this is happening because we can't account for the zone filling
>>> without the proper context.
>>
>> I believe it's a different problem, maybe some de-sync between scrub
>> write_pointer and the real zoned pointer inside the device.
>>
>> My current guess is, the target zone inside the target device is not
>> properly reset before dev-replace.
>
> This must be a different issue. Are we choosing that zone for zone finish
> to free the active zone resource?

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: new scrub code vs zoned file systems
  2023-06-01  5:45                     ` Qu Wenruo
@ 2023-06-01  5:47                       ` Christoph Hellwig
  0 siblings, 0 replies; 23+ messages in thread
From: Christoph Hellwig @ 2023-06-01  5:47 UTC (permalink / raw)
  To: Qu Wenruo
  Cc: Naohiro Aota, Christoph Hellwig, Johannes Thumshirn,
	linux-btrfs@vger.kernel.org

On Thu, Jun 01, 2023 at 01:45:49PM +0800, Qu Wenruo wrote:
> Oh, that looks like the cause.
>
> In btrfs_submit_repair_write() we set the bi_opf to ZONE_APPEND instead,
> which later would trigger btrfs_record_physical_zoned().
>
> So this means, we should not change the WRITE into ZONE_APPEND for
> btrfs_submit_repair_write() for dev-replace case at all.
>
> I stupidly thought zoned device can not accept WRITE command at all but
> only ZONE_APPEND.
>
> Let me try it locally first.

I'm already testing with the patch.  It stop us from seeing the
call into btrfs_record_physical_zoned, but device replace on zoned
devices still doesn't work.


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: new scrub code vs zoned file systems
  2023-06-01  5:21                     ` Naohiro Aota
@ 2023-06-01  7:21                       ` Qu Wenruo
  2023-06-01  7:27                         ` Christoph Hellwig
  0 siblings, 1 reply; 23+ messages in thread
From: Qu Wenruo @ 2023-06-01  7:21 UTC (permalink / raw)
  To: Naohiro Aota, Christoph Hellwig
  Cc: Johannes Thumshirn, linux-btrfs@vger.kernel.org



On 2023/6/1 13:21, Naohiro Aota wrote:
> On Thu, Jun 01, 2023 at 02:17:22PM +0900, Naohiro Aota wrote:
>> On Thu, Jun 01, 2023 at 01:00:40PM +0800, Qu Wenruo wrote:
>>>>> But I'm more concerned about why we have a full zone before that crash.
>>>>
>>>> I think this is happening because we can't account for the zone filling
>>>> without the proper context.
>>>
>>> I believe it's a different problem, maybe some de-sync between scrub
>>> write_pointer and the real zoned pointer inside the device.
>>>
>>> My current guess is, the target zone inside the target device is not
>>> properly reset before dev-replace.
>>
>> This must be a different issue. Are we choosing that zone for zone finish
>> to free the active zone resource?
>
> BTW, you may want to use this patch to track zone finishing.

Is there any function that I can go to grab the current writer pointer?

The new trace events only output the flags and used bytes, or the used
bytes is the write pointer already?

Thanks,
Qu

>
> diff --git a/fs/btrfs/zoned.c b/fs/btrfs/zoned.c
> index dbac203ea54a..5b4ab12368c9 100644
> --- a/fs/btrfs/zoned.c
> +++ b/fs/btrfs/zoned.c
> @@ -2015,6 +2015,7 @@ static int do_zone_finish(struct btrfs_block_group *block_group, bool fully_writ
>   		btrfs_dev_clear_active_zone(device, physical);
>   	}
>
> +	trace_btrfs_zone_finish_block_group(block_group);
>   	if (!fully_written)
>   		btrfs_dec_block_group_ro(block_group);
>
> diff --git a/include/trace/events/btrfs.h b/include/trace/events/btrfs.h
> index 8ea9cea9bfeb..594e4aca0a02 100644
> --- a/include/trace/events/btrfs.h
> +++ b/include/trace/events/btrfs.h
> @@ -2057,6 +2057,12 @@ DEFINE_EVENT(btrfs__block_group, btrfs_skip_unused_block_group,
>   	TP_ARGS(bg_cache)
>   );
>
> +DEFINE_EVENT(btrfs__block_group, btrfs_zone_finish_block_group,
> +	TP_PROTO(const struct btrfs_block_group *bg_cache),
> +
> +	TP_ARGS(bg_cache)
> +);
> +
>   TRACE_EVENT(btrfs_set_extent_bit,
>   	TP_PROTO(const struct extent_io_tree *tree,
>   		 u64 start, u64 len, unsigned set_bits),

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: new scrub code vs zoned file systems
  2023-06-01  7:21                       ` Qu Wenruo
@ 2023-06-01  7:27                         ` Christoph Hellwig
  2023-06-01  8:46                           ` Qu Wenruo
  0 siblings, 1 reply; 23+ messages in thread
From: Christoph Hellwig @ 2023-06-01  7:27 UTC (permalink / raw)
  To: Qu Wenruo
  Cc: Naohiro Aota, Christoph Hellwig, Johannes Thumshirn,
	linux-btrfs@vger.kernel.org

On Thu, Jun 01, 2023 at 03:21:22PM +0800, Qu Wenruo wrote:
> Is there any function that I can go to grab the current writer pointer?

You can get the information using blkdev_report_zones, or the
"blkzoned report" command from user space.


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: new scrub code vs zoned file systems
  2023-06-01  7:27                         ` Christoph Hellwig
@ 2023-06-01  8:46                           ` Qu Wenruo
  0 siblings, 0 replies; 23+ messages in thread
From: Qu Wenruo @ 2023-06-01  8:46 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Naohiro Aota, Johannes Thumshirn, linux-btrfs@vger.kernel.org



On 2023/6/1 15:27, Christoph Hellwig wrote:
> On Thu, Jun 01, 2023 at 03:21:22PM +0800, Qu Wenruo wrote:
>> Is there any function that I can go to grab the current writer pointer?
>
> You can get the information using blkdev_report_zones, or the
> "blkzoned report" command from user space.
>
Several new bugs exposed.

- We need to forward the sctx->write_pointer after the write bio
   finished

- Wrong physical bytenr passed to fill_writer_pointer_gap()
   The 2nd one called in scrub_write_sectors() is passing logical
   address, which is definitely a big NONO.

- Missing gaps for certain bitmap layout
   If we have the following layout for write,

   0     4     8     12    16
   |     |/////|     |/////|

   We should:
   * Fill gap for [0, 4)
   * Submit write for [4, 8)
   * Fill gap for [8, 12)
   * Submit write for [12, 16)

   But we have a wrong check, we only file the gap request before
   submitting the write bio.

   This results:
   * Fill gap for [0, 12)
   * Submit write for [4, 8)
     Obvious the write would fail.
   * Fill gap for [8, 12)
     No op, because the write pointer is already at 12.
   * Submit write for [12, 16)
     Weirdly, this would not fail, as the write pointer is at 12.

In short, the fill_writer_pointer_gap() arguments are a total mess...

I will submit a proper fix soon.

Thanks,
Qu

^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2023-06-01  8:47 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-05-31 12:52 new scrub code vs zoned file systems Christoph Hellwig
2023-05-31 13:10 ` Johannes Thumshirn
2023-05-31 13:20   ` Christoph Hellwig
2023-05-31 13:25     ` Johannes Thumshirn
2023-05-31 13:30       ` Christoph Hellwig
2023-05-31 14:04         ` Johannes Thumshirn
2023-05-31 14:17           ` Christoph Hellwig
2023-06-01  2:09             ` Qu Wenruo
2023-06-01  4:40               ` Christoph Hellwig
2023-06-01  5:00                 ` Qu Wenruo
2023-06-01  5:17                   ` Naohiro Aota
2023-06-01  5:21                     ` Naohiro Aota
2023-06-01  7:21                       ` Qu Wenruo
2023-06-01  7:27                         ` Christoph Hellwig
2023-06-01  8:46                           ` Qu Wenruo
2023-06-01  5:22                     ` Christoph Hellwig
2023-06-01  5:34                       ` Christoph Hellwig
2023-06-01  5:45                     ` Qu Wenruo
2023-06-01  5:47                       ` Christoph Hellwig
2023-05-31 22:25           ` Qu Wenruo
2023-05-31 22:48             ` Qu Wenruo
2023-06-01  4:53             ` Christoph Hellwig
2023-06-01  5:04               ` Qu Wenruo

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox