All of lore.kernel.org
 help / color / mirror / Atom feed
From: Ming Lei <ming.lei@redhat.com>
To: Martin Wilck <mwilck@suse.com>
Cc: Mike Snitzer <snitzer@redhat.com>,
	Mikulas Patocka <mpatocka@redhat.com>,
	Alasdair G Kergon <agk@redhat.com>,
	dm-devel@lists.linux.dev, Hannes Reinecke <hare@suse.de>,
	Vasilis Liaskovitis <vliaskovitis@suse.com>
Subject: Re: [RFC Patch] dm: make sure to wait for all dispatched requests in __dm_suspend()
Date: Wed, 20 Mar 2024 11:03:48 +0800	[thread overview]
Message-ID: <ZfpSFA9NlLKZyBDg@fedora> (raw)
In-Reply-To: <821487f4494d34dd1c9686f23306b20f60bacd8a.camel@suse.com>

On Tue, Mar 19, 2024 at 04:41:26PM +0100, Martin Wilck wrote:
> Hello Ming,
> 
> On Tue, 2024-03-19 at 21:04 +0800, Ming Lei wrote:
> > Hello Martin,
> > 
> > On Sat, Mar 16, 2024 at 12:10:35AM +0100, Martin Wilck wrote:
> > > In a recent kernel dump analysis, we found that the kernel crashed
> > > because
> > > dm_rq_target_io tio->ti was pointing to invalid memory in
> > > dm_end_request(),
> > > in a situation where multipathd was doing map reloads because of a
> > > storage
> > > failover. The map of the respective mapped_device had been replaced
> > > by a
> > > different struct dm_table.
> > > 
> > > We obverved this with a 5.3.18 distro kernel, but the code in
> > > question
> > > hasn't change much since then. Basically, we were only missing
> > > b4459b11e840 ("dm rq: don't queue request to blk-mq during DM
> > > suspend"),
> > > which doesn't guarantee that the race I'm thinking of (see below)
> > > can't
> > > happen.
> > > 
> > > When a map is resumed after a table reload, the live table is
> > > swapped, and
> > > the tio->ti member of any live request becomes stale. 
> > > __dm_resume() avoids
> > > this by quiescing the queue and calling dm_wait_for_completion(),
> > > which
> > > waits until blk_mq_queue_inflight() doesn't report any in-flight
> > > requests.
> > > 
> > > However, blk_mq_queue_inflight() counts only "started" requests.
> > > So, if a
> > > request is dispatched before the queue was quiesced, but
> > > dm_wait_for_completion() doesn't observe MQ_RQ_IN_FLIGHT for this
> > > request
> > > because of memory ordering effects, __dm_suspend() may finish
> > > successfully,
> > 
> > Can you explain a bit about the exact memory order which causes
> > MQ_RQ_IN_FLIGHT
> > not observed?
> > 
> > blk-mq quiesce includes synchronize_rcu() which drains all in-flight
> > dispatch, so after blk_mq_quiesce_queue() returns,
> > if blk_mq_queue_inflight() returns 0, it does mean there isn't any
> > active inflight requests.
> > 
> > If there is bug in this pattern, I guess more drivers may have such
> > 'risk'.
> 
> What we know for sure is that there was a bad dm_target reference in
> (struct dm_rq_target_io *tio)->ti:
> 
> crash> struct -x dm_rq_target_io c00000245ca90128
> struct dm_rq_target_io {
>   md = 0xc0000031c66a4000,
>   ti = 0xc0080000020d0080 <fscache_object_list_lock+665632>,
> 
> crash> struct -x dm_target  0xc0080000020d0080
> struct dm_target struct: invalid kernel virtual address:
> c0080000020d0080  type: "gdb_readmem_callback"
> 
> The question is how this could have come to pass. It can only happen
> if tio->ti had been set before the map was reloaded. 
> My theory is that the IO had been dispatched before the queue had been
> quiesced, like this:
> 
> Task A                                 Task B
> (dispatching IO)                       (executing a DM_SUSPEND ioctl to
>                                        resume after DM_TABLE_LOAD)
>                                        do_resume()
>                                          dm_suspend()
>                                            __dm_suspend()
> dm_mq_queue_rq()                         
>    struct dm_target *ti =                
>        md->immutable_target;                  
>                                               dm_stop_queue()
>                                                  blk_mq_quiesce_queue()
>        /* 
>         * At this point, the queue is quiesced, but task A
>         * has alreadyentered dm_mq_queue_rq()
>         */

That shouldn't happen, blk_mq_quiesce_queue() drains all pending
dm_mq_queue_rq() and prevents new dm_mq_queue_rq() from being
called.

>                                               dm_wait_for_completion()
>                                                  blk_mq_queue_inflight()
>        /* 
>         * blk_mq_queue_inflight() doesn't see Task A's 
>         * request because it isn't started yet
>         */
>                                               set_bit(dmf_suspended_flag)
>    dm_start_request(md, rq);             dm_swap_table()
>                                             __bind()
>                                               md->immutable_target = ...
>                                          dm_target_destroy()
>         /* the previous md->immutable_target is freed */
>    init_tio(tio, rq, md);
>         /* the stale ti pointer is assigned to tio->ti */
>    tio->ti = ti;
> 
> dm_mq_queue_rq() contains no synchronization code if 
> md->immutable_target is set, so I think that this can happen, even
> though it looks unlikely. With b4459b11e840 (which was not applied in
> the customer kernel), there would be a
> set_bit(DMF_BLOCK_IO_FOR_SUSPEND) statement before dm_stop_queue(),
> but IMO that the above would still be possible.
> 
> If this can't happen, I have no more ideas how the observed situation
> came to pass. The customer who sent us the core claims that
> he has seen this multiple times already (but we have only this single
> core dump).

Your kernel is v5.3, which is too old, so as Mike suggested, please
try b4459b11e840 ("dm rq: don't queue request to blk-mq during DM
suspend") given blk-mq has too many changes here.

IMO, this issue should happen in upstream kernel.

Thanks,
Ming


  parent reply	other threads:[~2024-03-20  3:04 UTC|newest]

Thread overview: 7+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-03-15 23:10 [RFC Patch] dm: make sure to wait for all dispatched requests in __dm_suspend() Martin Wilck
2024-03-19 13:04 ` Ming Lei
2024-03-19 15:41   ` Martin Wilck
2024-03-19 16:53     ` Mike Snitzer
2024-03-19 21:01       ` Martin Wilck
2024-03-20  3:03     ` Ming Lei [this message]
2024-03-20  9:51       ` Martin Wilck

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ZfpSFA9NlLKZyBDg@fedora \
    --to=ming.lei@redhat.com \
    --cc=agk@redhat.com \
    --cc=dm-devel@lists.linux.dev \
    --cc=hare@suse.de \
    --cc=mpatocka@redhat.com \
    --cc=mwilck@suse.com \
    --cc=snitzer@redhat.com \
    --cc=vliaskovitis@suse.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.