public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
From: Mike Snitzer <snitzer@kernel.org>
To: Ming Lei <ming.lei@redhat.com>
Cc: Patrick Plenefisch <simonpatp@gmail.com>,
	Goffredo Baroncelli <kreijack@inwind.it>,
	linux-kernel@vger.kernel.org, Alasdair Kergon <agk@redhat.com>,
	Mikulas Patocka <mpatocka@redhat.com>, Chris Mason <clm@fb.com>,
	Josef Bacik <josef@toxicpanda.com>,
	David Sterba <dsterba@suse.com>,
	regressions@lists.linux.dev, dm-devel@lists.linux.dev,
	linux-btrfs@vger.kernel.org
Subject: Re: LVM-on-LVM: error while submitting device barriers
Date: Sun, 10 Mar 2024 11:27:22 -0400	[thread overview]
Message-ID: <Ze3RWqLvG18cQ4dz@redhat.com> (raw)
In-Reply-To: <Ze2azGlb1WxVFv7Z@fedora>

On Sun, Mar 10 2024 at  7:34P -0400,
Ming Lei <ming.lei@redhat.com> wrote:

> On Sat, Mar 09, 2024 at 03:39:02PM -0500, Patrick Plenefisch wrote:
> > On Wed, Mar 6, 2024 at 11:00 AM Ming Lei <ming.lei@redhat.com> wrote:
> > >
> > > On Tue, Mar 05, 2024 at 12:45:13PM -0500, Mike Snitzer wrote:
> > > > On Thu, Feb 29 2024 at  5:05P -0500,
> > > > Goffredo Baroncelli <kreijack@inwind.it> wrote:
> > > >
> > > > > On 29/02/2024 21.22, Patrick Plenefisch wrote:
> > > > > > On Thu, Feb 29, 2024 at 2:56 PM Goffredo Baroncelli <kreijack@inwind.it> wrote:
> > > > > > >
> > > > > > > > Your understanding is correct. The only thing that comes to my mind to
> > > > > > > > cause the problem is asymmetry of the SATA devices. I have one 8TB
> > > > > > > > device, plus a 1.5TB, 3TB, and 3TB drives. Doing math on the actual
> > > > > > > > extents, lowerVG/single spans (3TB+3TB), and
> > > > > > > > lowerVG/lvmPool/lvm/brokenDisk spans (3TB+1.5TB). Both obviously have
> > > > > > > > the other leg of raid1 on the 8TB drive, but my thought was that the
> > > > > > > > jump across the 1.5+3TB drive gap was at least "interesting"
> > > > > > >
> > > > > > >
> > > > > > > what about lowerVG/works ?
> > > > > > >
> > > > > >
> > > > > > That one is only on two disks, it doesn't span any gaps
> > > > >
> > > > > Sorry, but re-reading the original email I found something that I missed before:
> > > > >
> > > > > > BTRFS error (device dm-75): bdev /dev/mapper/lvm-brokenDisk errs: wr
> > > > > > 0, rd 0, flush 1, corrupt 0, gen 0
> > > > > > BTRFS warning (device dm-75): chunk 13631488 missing 1 devices, max
> > > > >                                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> > > > > > tolerance is 0 for writable mount
> > > > > > BTRFS: error (device dm-75) in write_all_supers:4379: errno=-5 IO
> > > > > > failure (errors while submitting device barriers.)
> > > > >
> > > > > Looking at the code, it seems that if a FLUSH commands fails, btrfs
> > > > > considers that the disk is missing. The it cannot mount RW the device.
> > > > >
> > > > > I would investigate with the LVM developers, if it properly passes
> > > > > the flush/barrier command through all the layers, when we have an
> > > > > lvm over lvm (raid1). The fact that the lvm is a raid1, is important because
> > > > > a flush command to be honored has to be honored by all the
> > > > > devices involved.
> > >
> > > Hello Patrick & Goffredo,
> > >
> > > I can trigger this kind of btrfs complaint by simulating one FLUSH failure.
> > >
> > > If you can reproduce this issue easily, please collect log by the
> > > following bpftrace script, which may show where the flush failure is,
> > > and maybe it can help to narrow down the issue in the whole stack.
> > >
> > >
> > > #!/usr/bin/bpftrace
> > >
> > > #ifndef BPFTRACE_HAVE_BTF
> > > #include <linux/blkdev.h>
> > > #endif
> > >
> > > kprobe:submit_bio_noacct,
> > > kprobe:submit_bio
> > > / (((struct bio *)arg0)->bi_opf & (1 << __REQ_PREFLUSH)) != 0 /
> > > {
> > >         $bio = (struct bio *)arg0;
> > >         @submit_stack[arg0] = kstack;
> > >         @tracked[arg0] = 1;
> > > }
> > >
> > > kprobe:bio_endio
> > > /@tracked[arg0] != 0/
> > > {
> > >         $bio = (struct bio *)arg0;
> > >
> > >         if (($bio->bi_flags & (1 << BIO_CHAIN)) && $bio->__bi_remaining.counter > 1) {
> > >                 return;
> > >         }
> > >
> > >         if ($bio->bi_status != 0) {
> > >                 printf("dev %s bio failed %d, submitter %s completion %s\n",
> > >                         $bio->bi_bdev->bd_disk->disk_name,
> > >                         $bio->bi_status, @submit_stack[arg0], kstack);
> > >         }
> > >         delete(@submit_stack[arg0]);
> > >         delete(@tracked[arg0]);
> > > }
> > >
> > > END {
> > >         clear(@submit_stack);
> > >         clear(@tracked);
> > > }
> > >
> > 
> > Attaching 4 probes...
> > dev dm-77 bio failed 10, submitter
> >        submit_bio_noacct+5
> >        __send_duplicate_bios+358
> >        __send_empty_flush+179
> >        dm_submit_bio+857
> >        __submit_bio+132
> >        submit_bio_noacct_nocheck+345
> >        write_all_supers+1718
> >        btrfs_commit_transaction+2342
> >        transaction_kthread+345
> >        kthread+229
> >        ret_from_fork+49
> >        ret_from_fork_asm+27
> > completion
> >        bio_endio+5
> >        dm_submit_bio+955
> >        __submit_bio+132
> >        submit_bio_noacct_nocheck+345
> >        write_all_supers+1718
> >        btrfs_commit_transaction+2342
> >        transaction_kthread+345
> >        kthread+229
> >        ret_from_fork+49
> >        ret_from_fork_asm+27
> > 
> > dev dm-86 bio failed 10, submitter
> >        submit_bio_noacct+5
> >        write_all_supers+1718
> >        btrfs_commit_transaction+2342
> >        transaction_kthread+345
> >        kthread+229
> >        ret_from_fork+49
> >        ret_from_fork_asm+27
> > completion
> >        bio_endio+5
> >        clone_endio+295
> >        clone_endio+295
> >        process_one_work+369
> >        worker_thread+635
> >        kthread+229
> >        ret_from_fork+49
> >        ret_from_fork_asm+27
> > 
> > 
> > For context, dm-86 is /dev/lvm/brokenDisk and dm-77 is /dev/lowerVG/lvmPool
> 
> io_status is 10(BLK_STS_IOERR), which is produced in submission code path on
> /dev/dm-77(/dev/lowerVG/lvmPool) first, so looks it is one device mapper issue.
> 
> The error should be from the following code only:
> 
> static void __map_bio(struct bio *clone)
> 
> 	...
> 	if (r == DM_MAPIO_KILL)
> 		dm_io_dec_pending(io, BLK_STS_IOERR);
> 	else
> 		dm_io_dec_pending(io, BLK_STS_DM_REQUEUE);
>     break;

I agree that the above bpf stack traces for dm-77 indicate that
dm_submit_bio failed, which would end up in the above branch if the
target's ->map() returned DM_MAPIO_KILL or DM_MAPIO_REQUEUE.

But such an early failure speaks to the flush bio never being
submitted to the underlying storage. No?

dm-raid.c:raid_map does return DM_MAPIO_REQUEUE with:

        /*
         * If we're reshaping to add disk(s)), ti->len and
         * mddev->array_sectors will differ during the process
         * (ti->len > mddev->array_sectors), so we have to requeue
         * bios with addresses > mddev->array_sectors here or
         * there will occur accesses past EOD of the component
         * data images thus erroring the raid set.
         */
        if (unlikely(bio_end_sector(bio) > mddev->array_sectors))
                return DM_MAPIO_REQUEUE;

But a flush doesn't have an end_sector (it'd be 0 afaik).. so it seems
weird relative to a flush.

> Patrick, you mentioned lvmPool is raid1, can you explain how lvmPool is
> built? It is dm-raid1 target or over plain raid1 device which is
> build over /dev/lowerVG?

In my earlier reply I asked Patrick for both:
lsblk
dmsetup table

Picking over the described IO stacks provided earlier (or Goffredo's
interpretation of it, via ascii art) isn't really a great way to see
the IO stacks that are in use/question.

> Mike, the logic in the following code doesn't change from v5.18-rc2 to
> v5.19, but I still can't understand why STS_IOERR is set in
> dm_io_complete() in case of BLK_STS_DM_REQUEUE && !__noflush_suspending(),
> since DMF_NOFLUSH_SUSPENDING is only set in __dm_suspend() which
> is supposed to not happen in Patrick's case.
> 
> dm_io_complete()
> 	...
> 	if (io->status == BLK_STS_DM_REQUEUE) {
> 	        unsigned long flags;
> 	        /*
> 	         * Target requested pushing back the I/O.
> 	         */
> 	        spin_lock_irqsave(&md->deferred_lock, flags);
> 	        if (__noflush_suspending(md) &&
> 	            !WARN_ON_ONCE(dm_is_zone_write(md, bio))) {
> 	                /* NOTE early return due to BLK_STS_DM_REQUEUE below */
> 	                bio_list_add_head(&md->deferred, bio);
> 	        } else {
> 	                /*
> 	                 * noflush suspend was interrupted or this is
> 	                 * a write to a zoned target.
> 	                 */
> 	                io->status = BLK_STS_IOERR;
> 	        }
> 	        spin_unlock_irqrestore(&md->deferred_lock, flags);
> 	}

Given the reason from dm-raid.c:raid_map returning DM_MAPIO_REQUEUE
I think the DM device could be suspending without flush.

But regardless, given you logged BLK_STS_IOERR lets assume it isn't,
the assumption that "noflush suspend was interrupted" seems like a
stale comment -- especially given that target's like dm-raid are now
using DM_MAPIO_REQUEUE without concern for the historic tight-coupling
of noflush suspend (which was always the case for the biggest historic
reason for this code: dm-multipath, see commit 2e93ccc1933d0 from
2006 -- predates my time with developing DM).

So all said, this code seems flawed for dm-raid (and possibly other
targets that return DM_MAPIO_REQUEUE).  I'll look closer this week.

Mike

  reply	other threads:[~2024-03-10 15:27 UTC|newest]

Thread overview: 15+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <CAOCpoWc_HQy4UJzTi9pqtJdO740Wx5Yd702O-mwXBE6RVBX1Eg@mail.gmail.com>
     [not found] ` <CAOCpoWf3TSQkUUo-qsj0LVEOm-kY0hXdmttLE82Ytc0hjpTSPw@mail.gmail.com>
2024-02-28 17:25   ` [REGRESSION] LVM-on-LVM: error while submitting device barriers Patrick Plenefisch
2024-02-28 19:19     ` Goffredo Baroncelli
2024-02-28 19:37       ` Patrick Plenefisch
2024-02-29 19:56         ` Goffredo Baroncelli
2024-02-29 20:22           ` Patrick Plenefisch
2024-02-29 22:05             ` Goffredo Baroncelli
2024-03-05 17:45               ` Mike Snitzer
2024-03-06 15:59                 ` Ming Lei
2024-03-09 20:39                   ` Patrick Plenefisch
2024-03-10 11:34                     ` Ming Lei
2024-03-10 15:27                       ` Mike Snitzer [this message]
2024-03-10 15:47                         ` Ming Lei
2024-03-10 18:11                         ` Patrick Plenefisch
2024-03-11 13:13                           ` Ming Lei
2024-03-12 22:54                             ` Patrick Plenefisch

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=Ze3RWqLvG18cQ4dz@redhat.com \
    --to=snitzer@kernel.org \
    --cc=agk@redhat.com \
    --cc=clm@fb.com \
    --cc=dm-devel@lists.linux.dev \
    --cc=dsterba@suse.com \
    --cc=josef@toxicpanda.com \
    --cc=kreijack@inwind.it \
    --cc=linux-btrfs@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=ming.lei@redhat.com \
    --cc=mpatocka@redhat.com \
    --cc=regressions@lists.linux.dev \
    --cc=simonpatp@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox