All of lore.kernel.org
 help / color / mirror / Atom feed
From: Ming Lei <ming.lei@redhat.com>
To: Mike Snitzer <snitzer@kernel.org>
Cc: Patrick Plenefisch <simonpatp@gmail.com>,
	Goffredo Baroncelli <kreijack@inwind.it>,
	linux-kernel@vger.kernel.org, Alasdair Kergon <agk@redhat.com>,
	Mikulas Patocka <mpatocka@redhat.com>, Chris Mason <clm@fb.com>,
	Josef Bacik <josef@toxicpanda.com>,
	David Sterba <dsterba@suse.com>,
	regressions@lists.linux.dev, dm-devel@lists.linux.dev,
	linux-btrfs@vger.kernel.org,
	Heinz Mauelshagen <heinzm@redhat.com>
Subject: Re: LVM-on-LVM: error while submitting device barriers
Date: Sun, 10 Mar 2024 23:47:32 +0800	[thread overview]
Message-ID: <Ze3WFGE0je3FivlL@fedora> (raw)
In-Reply-To: <Ze3RWqLvG18cQ4dz@redhat.com>

On Sun, Mar 10, 2024 at 11:27:22AM -0400, Mike Snitzer wrote:
> On Sun, Mar 10 2024 at  7:34P -0400,
> Ming Lei <ming.lei@redhat.com> wrote:
> 
> > On Sat, Mar 09, 2024 at 03:39:02PM -0500, Patrick Plenefisch wrote:
> > > On Wed, Mar 6, 2024 at 11:00 AM Ming Lei <ming.lei@redhat.com> wrote:
> > > >
> > > > On Tue, Mar 05, 2024 at 12:45:13PM -0500, Mike Snitzer wrote:
> > > > > On Thu, Feb 29 2024 at  5:05P -0500,
> > > > > Goffredo Baroncelli <kreijack@inwind.it> wrote:
> > > > >
> > > > > > On 29/02/2024 21.22, Patrick Plenefisch wrote:
> > > > > > > On Thu, Feb 29, 2024 at 2:56 PM Goffredo Baroncelli <kreijack@inwind.it> wrote:
> > > > > > > >
> > > > > > > > > Your understanding is correct. The only thing that comes to my mind to
> > > > > > > > > cause the problem is asymmetry of the SATA devices. I have one 8TB
> > > > > > > > > device, plus a 1.5TB, 3TB, and 3TB drives. Doing math on the actual
> > > > > > > > > extents, lowerVG/single spans (3TB+3TB), and
> > > > > > > > > lowerVG/lvmPool/lvm/brokenDisk spans (3TB+1.5TB). Both obviously have
> > > > > > > > > the other leg of raid1 on the 8TB drive, but my thought was that the
> > > > > > > > > jump across the 1.5+3TB drive gap was at least "interesting"
> > > > > > > >
> > > > > > > >
> > > > > > > > what about lowerVG/works ?
> > > > > > > >
> > > > > > >
> > > > > > > That one is only on two disks, it doesn't span any gaps
> > > > > >
> > > > > > Sorry, but re-reading the original email I found something that I missed before:
> > > > > >
> > > > > > > BTRFS error (device dm-75): bdev /dev/mapper/lvm-brokenDisk errs: wr
> > > > > > > 0, rd 0, flush 1, corrupt 0, gen 0
> > > > > > > BTRFS warning (device dm-75): chunk 13631488 missing 1 devices, max
> > > > > >                                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> > > > > > > tolerance is 0 for writable mount
> > > > > > > BTRFS: error (device dm-75) in write_all_supers:4379: errno=-5 IO
> > > > > > > failure (errors while submitting device barriers.)
> > > > > >
> > > > > > Looking at the code, it seems that if a FLUSH commands fails, btrfs
> > > > > > considers that the disk is missing. The it cannot mount RW the device.
> > > > > >
> > > > > > I would investigate with the LVM developers, if it properly passes
> > > > > > the flush/barrier command through all the layers, when we have an
> > > > > > lvm over lvm (raid1). The fact that the lvm is a raid1, is important because
> > > > > > a flush command to be honored has to be honored by all the
> > > > > > devices involved.
> > > >
> > > > Hello Patrick & Goffredo,
> > > >
> > > > I can trigger this kind of btrfs complaint by simulating one FLUSH failure.
> > > >
> > > > If you can reproduce this issue easily, please collect log by the
> > > > following bpftrace script, which may show where the flush failure is,
> > > > and maybe it can help to narrow down the issue in the whole stack.
> > > >
> > > >
> > > > #!/usr/bin/bpftrace
> > > >
> > > > #ifndef BPFTRACE_HAVE_BTF
> > > > #include <linux/blkdev.h>
> > > > #endif
> > > >
> > > > kprobe:submit_bio_noacct,
> > > > kprobe:submit_bio
> > > > / (((struct bio *)arg0)->bi_opf & (1 << __REQ_PREFLUSH)) != 0 /
> > > > {
> > > >         $bio = (struct bio *)arg0;
> > > >         @submit_stack[arg0] = kstack;
> > > >         @tracked[arg0] = 1;
> > > > }
> > > >
> > > > kprobe:bio_endio
> > > > /@tracked[arg0] != 0/
> > > > {
> > > >         $bio = (struct bio *)arg0;
> > > >
> > > >         if (($bio->bi_flags & (1 << BIO_CHAIN)) && $bio->__bi_remaining.counter > 1) {
> > > >                 return;
> > > >         }
> > > >
> > > >         if ($bio->bi_status != 0) {
> > > >                 printf("dev %s bio failed %d, submitter %s completion %s\n",
> > > >                         $bio->bi_bdev->bd_disk->disk_name,
> > > >                         $bio->bi_status, @submit_stack[arg0], kstack);
> > > >         }
> > > >         delete(@submit_stack[arg0]);
> > > >         delete(@tracked[arg0]);
> > > > }
> > > >
> > > > END {
> > > >         clear(@submit_stack);
> > > >         clear(@tracked);
> > > > }
> > > >
> > > 
> > > Attaching 4 probes...
> > > dev dm-77 bio failed 10, submitter
> > >        submit_bio_noacct+5
> > >        __send_duplicate_bios+358
> > >        __send_empty_flush+179
> > >        dm_submit_bio+857
> > >        __submit_bio+132
> > >        submit_bio_noacct_nocheck+345
> > >        write_all_supers+1718
> > >        btrfs_commit_transaction+2342
> > >        transaction_kthread+345
> > >        kthread+229
> > >        ret_from_fork+49
> > >        ret_from_fork_asm+27
> > > completion
> > >        bio_endio+5
> > >        dm_submit_bio+955
> > >        __submit_bio+132
> > >        submit_bio_noacct_nocheck+345
> > >        write_all_supers+1718
> > >        btrfs_commit_transaction+2342
> > >        transaction_kthread+345
> > >        kthread+229
> > >        ret_from_fork+49
> > >        ret_from_fork_asm+27
> > > 
> > > dev dm-86 bio failed 10, submitter
> > >        submit_bio_noacct+5
> > >        write_all_supers+1718
> > >        btrfs_commit_transaction+2342
> > >        transaction_kthread+345
> > >        kthread+229
> > >        ret_from_fork+49
> > >        ret_from_fork_asm+27
> > > completion
> > >        bio_endio+5
> > >        clone_endio+295
> > >        clone_endio+295
> > >        process_one_work+369
> > >        worker_thread+635
> > >        kthread+229
> > >        ret_from_fork+49
> > >        ret_from_fork_asm+27
> > > 
> > > 
> > > For context, dm-86 is /dev/lvm/brokenDisk and dm-77 is /dev/lowerVG/lvmPool
> > 
> > io_status is 10(BLK_STS_IOERR), which is produced in submission code path on
> > /dev/dm-77(/dev/lowerVG/lvmPool) first, so looks it is one device mapper issue.
> > 
> > The error should be from the following code only:
> > 
> > static void __map_bio(struct bio *clone)
> > 
> > 	...
> > 	if (r == DM_MAPIO_KILL)
> > 		dm_io_dec_pending(io, BLK_STS_IOERR);
> > 	else
> > 		dm_io_dec_pending(io, BLK_STS_DM_REQUEUE);
> >     break;
> 
> I agree that the above bpf stack traces for dm-77 indicate that
> dm_submit_bio failed, which would end up in the above branch if the
> target's ->map() returned DM_MAPIO_KILL or DM_MAPIO_REQUEUE.
> 
> But such an early failure speaks to the flush bio never being
> submitted to the underlying storage. No?
> 
> dm-raid.c:raid_map does return DM_MAPIO_REQUEUE with:
> 
>         /*
>          * If we're reshaping to add disk(s)), ti->len and
>          * mddev->array_sectors will differ during the process
>          * (ti->len > mddev->array_sectors), so we have to requeue
>          * bios with addresses > mddev->array_sectors here or
>          * there will occur accesses past EOD of the component
>          * data images thus erroring the raid set.
>          */
>         if (unlikely(bio_end_sector(bio) > mddev->array_sectors))
>                 return DM_MAPIO_REQUEUE;
> 
> But a flush doesn't have an end_sector (it'd be 0 afaik).. so it seems
> weird relative to a flush.

Yeah, I also found the above is weird, since DM_MAPIO_REQUEUE is
supposed to work together only with noflush_suspend, see
2e93ccc1933d ("[PATCH] dm: suspend: add noflush pushback"), looks
you already mentioned.

If that is the reason, maybe the following change can make a
difference:

diff --git a/drivers/md/dm-raid.c b/drivers/md/dm-raid.c
index 5e41fbae3f6b..07af18baa8dd 100644
--- a/drivers/md/dm-raid.c
+++ b/drivers/md/dm-raid.c
@@ -3331,7 +3331,7 @@ static int raid_map(struct dm_target *ti, struct bio *bio)
 	 * there will occur accesses past EOD of the component
 	 * data images thus erroring the raid set.
 	 */
-	if (unlikely(bio_end_sector(bio) > mddev->array_sectors))
+	if (unlikely(bio_has_data(bio) && bio_end_sector(bio) > mddev->array_sectors))
 		return DM_MAPIO_REQUEUE;
 
 	md_handle_request(mddev, bio);


> 
> > Patrick, you mentioned lvmPool is raid1, can you explain how lvmPool is
> > built? It is dm-raid1 target or over plain raid1 device which is
> > build over /dev/lowerVG?
> 
> In my earlier reply I asked Patrick for both:
> lsblk
> dmsetup table
> 
> Picking over the described IO stacks provided earlier (or Goffredo's
> interpretation of it, via ascii art) isn't really a great way to see
> the IO stacks that are in use/question.
> 
> > Mike, the logic in the following code doesn't change from v5.18-rc2 to
> > v5.19, but I still can't understand why STS_IOERR is set in
> > dm_io_complete() in case of BLK_STS_DM_REQUEUE && !__noflush_suspending(),
> > since DMF_NOFLUSH_SUSPENDING is only set in __dm_suspend() which
> > is supposed to not happen in Patrick's case.
> > 
> > dm_io_complete()
> > 	...
> > 	if (io->status == BLK_STS_DM_REQUEUE) {
> > 	        unsigned long flags;
> > 	        /*
> > 	         * Target requested pushing back the I/O.
> > 	         */
> > 	        spin_lock_irqsave(&md->deferred_lock, flags);
> > 	        if (__noflush_suspending(md) &&
> > 	            !WARN_ON_ONCE(dm_is_zone_write(md, bio))) {
> > 	                /* NOTE early return due to BLK_STS_DM_REQUEUE below */
> > 	                bio_list_add_head(&md->deferred, bio);
> > 	        } else {
> > 	                /*
> > 	                 * noflush suspend was interrupted or this is
> > 	                 * a write to a zoned target.
> > 	                 */
> > 	                io->status = BLK_STS_IOERR;
> > 	        }
> > 	        spin_unlock_irqrestore(&md->deferred_lock, flags);
> > 	}
> 
> Given the reason from dm-raid.c:raid_map returning DM_MAPIO_REQUEUE
> I think the DM device could be suspending without flush.
> 
> But regardless, given you logged BLK_STS_IOERR lets assume it isn't,
> the assumption that "noflush suspend was interrupted" seems like a
> stale comment -- especially given that target's like dm-raid are now
> using DM_MAPIO_REQUEUE without concern for the historic tight-coupling
> of noflush suspend (which was always the case for the biggest historic
> reason for this code: dm-multipath, see commit 2e93ccc1933d0 from
> 2006 -- predates my time with developing DM).
> 
> So all said, this code seems flawed for dm-raid (and possibly other
> targets that return DM_MAPIO_REQUEUE).  I'll look closer this week.

Agree, the change is added since 9dbd1aa3a81c ("dm raid: add reshaping
support to the target"), so loop Heinz in.


Thanks,
Ming


  reply	other threads:[~2024-03-10 15:47 UTC|newest]

Thread overview: 15+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <CAOCpoWc_HQy4UJzTi9pqtJdO740Wx5Yd702O-mwXBE6RVBX1Eg@mail.gmail.com>
     [not found] ` <CAOCpoWf3TSQkUUo-qsj0LVEOm-kY0hXdmttLE82Ytc0hjpTSPw@mail.gmail.com>
2024-02-28 17:25   ` [REGRESSION] LVM-on-LVM: error while submitting device barriers Patrick Plenefisch
2024-02-28 19:19     ` Goffredo Baroncelli
2024-02-28 19:37       ` Patrick Plenefisch
2024-02-29 19:56         ` Goffredo Baroncelli
2024-02-29 20:22           ` Patrick Plenefisch
2024-02-29 22:05             ` Goffredo Baroncelli
2024-03-05 17:45               ` Mike Snitzer
2024-03-06 15:59                 ` Ming Lei
2024-03-09 20:39                   ` Patrick Plenefisch
2024-03-10 11:34                     ` Ming Lei
2024-03-10 15:27                       ` Mike Snitzer
2024-03-10 15:47                         ` Ming Lei [this message]
2024-03-10 18:11                         ` Patrick Plenefisch
2024-03-11 13:13                           ` Ming Lei
2024-03-12 22:54                             ` Patrick Plenefisch

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=Ze3WFGE0je3FivlL@fedora \
    --to=ming.lei@redhat.com \
    --cc=agk@redhat.com \
    --cc=clm@fb.com \
    --cc=dm-devel@lists.linux.dev \
    --cc=dsterba@suse.com \
    --cc=heinzm@redhat.com \
    --cc=josef@toxicpanda.com \
    --cc=kreijack@inwind.it \
    --cc=linux-btrfs@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mpatocka@redhat.com \
    --cc=regressions@lists.linux.dev \
    --cc=simonpatp@gmail.com \
    --cc=snitzer@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.