From: Ming Lei <ming.lei@redhat.com>
To: Sagi Grimberg <sagi@grimberg.me>
Cc: Max Gurtovoy <maxg@mellanox.com>, Jens Axboe <axboe@kernel.dk>,
"linux-block@vger.kernel.org" <linux-block@vger.kernel.org>,
"linux-nvme@lists.infradead.org" <linux-nvme@lists.infradead.org>
Subject: Re: NVMe induced NULL deref in bt_iter()
Date: Mon, 3 Jul 2017 17:40:01 +0800 [thread overview]
Message-ID: <20170703093951.GA28651@ming.t460p> (raw)
In-Reply-To: <7138df5a-b1ce-7f46-281f-ae15172c61e5@grimberg.me>
On Sun, Jul 02, 2017 at 02:56:56PM +0300, Sagi Grimberg wrote:
>
>
> On 02/07/17 13:45, Max Gurtovoy wrote:
> >
> >
> > On 6/30/2017 8:26 PM, Jens Axboe wrote:
> > > Hi Max,
> >
> > Hi Jens,
> >
> > >
> > > I remembered you reporting this. I think this is a regression introduced
> > > with the scheduling, since ->rqs[] isn't static anymore. ->static_rqs[]
> > > is, but that's not indexable by the tag we find. So I think we need to
> > > guard those with a NULL check. The actual requests themselves are
> > > static, so we know the memory itself isn't going away. But if we race
> > > with completion, we could find a NULL there, validly.
> > >
> > > Since you could reproduce it, can you try the below?
> >
> > I still can repro the null deref with this patch applied.
> >
> > >
> > > diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
> > > index d0be72ccb091..b856b2827157 100644
> > > --- a/block/blk-mq-tag.c
> > > +++ b/block/blk-mq-tag.c
> > > @@ -214,7 +214,7 @@ static bool bt_iter(struct sbitmap *bitmap,
> > > unsigned int bitnr, void *data)
> > > bitnr += tags->nr_reserved_tags;
> > > rq = tags->rqs[bitnr];
> > >
> > > - if (rq->q == hctx->queue)
> > > + if (rq && rq->q == hctx->queue)
> > > iter_data->fn(hctx, rq, iter_data->data, reserved);
> > > return true;
> > > }
> > > @@ -249,8 +249,8 @@ static bool bt_tags_iter(struct sbitmap *bitmap,
> > > unsigned int bitnr, void *data)
> > > if (!reserved)
> > > bitnr += tags->nr_reserved_tags;
> > > rq = tags->rqs[bitnr];
> > > -
> > > - iter_data->fn(rq, iter_data->data, reserved);
> > > + if (rq)
> > > + iter_data->fn(rq, iter_data->data, reserved);
> > > return true;
> > > }
> >
> > see the attached file for dmesg output.
> >
> > output of gdb:
> >
> > (gdb) list *(blk_mq_flush_busy_ctxs+0x48)
> > 0xffffffff8127b108 is in blk_mq_flush_busy_ctxs
> > (./include/linux/sbitmap.h:234).
> > 229
> > 230 for (i = 0; i < sb->map_nr; i++) {
> > 231 struct sbitmap_word *word = &sb->map[i];
> > 232 unsigned int off, nr;
> > 233
> > 234 if (!word->word)
> > 235 continue;
> > 236
> > 237 nr = 0;
> > 238 off = i << sb->shift;
> >
> >
> > when I change the "if (!word->word)" to "if (word && !word->word)"
> > I can get null deref at "nr = find_next_bit(&word->word, word->depth,
> > nr);". Seems like somehow word becomes NULL.
> >
> > Adding the linux-nvme guys too.
> > Sagi has mentioned that this can be null only if we remove the tagset
> > while I/O is trying to get a tag and when killing the target we get into
> > error recovery and periodic reconnects, which does _NOT_ include freeing
> > the tagset, so this is probably the admin tagset.
> >
> > Sagi,
> > you've mention a patch for centrelizing the treatment of the admin
> > tagset to the nvme core. I think I missed this patch, so can you please
> > send a pointer to it and I'll check if it helps ?
>
> Hmm,
>
> In the above flow we should not be freeing the tag_set, not on admin as
> well. The target keep removing namespaces and finally removes the
> subsystem which generates a error recovery flow. What we at least try
> to do is:
>
> 1. mark rdma queues as not live
> 2. stop all the sw queues (admin and io)
> 3. fail inflight I/Os
> 4. restart all sw queues (to fast fail until we recover)
>
> We shouldn't be freeing the tagsets (although we might update them
> when we recover and cpu map changed - which I don't think is happening).
>
> However, I do see a difference between bt_tags_for_each
> and blk_mq_flush_busy_ctxs (checks tags->rqs not being NULL).
>
> Unrelated to this I think we should quiesce/unquiesce the admin_q
> instead of stop/start because it respects the submission path rcu [1].
>
> It might hide the issue, but given that we never free the tagset its
> seems like it's not in nvme-rdma (max, can you see if this makes the
> issue go away?)
>
> [1]:
> --
> diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c
> index e3996db22738..094873a4ee38 100644
> --- a/drivers/nvme/host/rdma.c
> +++ b/drivers/nvme/host/rdma.c
> @@ -785,7 +785,7 @@ static void nvme_rdma_error_recovery_work(struct
> work_struct *work)
>
> if (ctrl->ctrl.queue_count > 1)
> nvme_stop_queues(&ctrl->ctrl);
> - blk_mq_stop_hw_queues(ctrl->ctrl.admin_q);
> + blk_mq_quiesce_queue(ctrl->ctrl.admin_q);
>
> /* We must take care of fastfail/requeue all our inflight requests
> */
> if (ctrl->ctrl.queue_count > 1)
> @@ -798,7 +798,8 @@ static void nvme_rdma_error_recovery_work(struct
> work_struct *work)
> * queues are not a live anymore, so restart the queues to fail fast
> * new IO
> */
> - blk_mq_start_stopped_hw_queues(ctrl->ctrl.admin_q, true);
> + blk_mq_unquiesce_queue(ctrl->ctrl.admin_q);
> + blk_mq_kick_requeue_list(ctrl->ctrl.admin_q);
> nvme_start_queues(&ctrl->ctrl);
>
> nvme_rdma_reconnect_or_remove(ctrl);
> @@ -1651,7 +1652,7 @@ static void nvme_rdma_shutdown_ctrl(struct
> nvme_rdma_ctrl *ctrl)
> if (test_bit(NVME_RDMA_Q_LIVE, &ctrl->queues[0].flags))
> nvme_shutdown_ctrl(&ctrl->ctrl);
>
> - blk_mq_stop_hw_queues(ctrl->ctrl.admin_q);
> + blk_mq_quiesce_queue(ctrl->ctrl.admin_q);
> blk_mq_tagset_busy_iter(&ctrl->admin_tag_set,
> nvme_cancel_request, &ctrl->ctrl);
> nvme_rdma_destroy_admin_queue(ctrl);
Yeah, the above change is correct, for any canceling requests in this
way we should use blk_mq_quiesce_queue().
Thanks,
Ming
WARNING: multiple messages have this Message-ID (diff)
From: ming.lei@redhat.com (Ming Lei)
Subject: NVMe induced NULL deref in bt_iter()
Date: Mon, 3 Jul 2017 17:40:01 +0800 [thread overview]
Message-ID: <20170703093951.GA28651@ming.t460p> (raw)
In-Reply-To: <7138df5a-b1ce-7f46-281f-ae15172c61e5@grimberg.me>
On Sun, Jul 02, 2017@02:56:56PM +0300, Sagi Grimberg wrote:
>
>
> On 02/07/17 13:45, Max Gurtovoy wrote:
> >
> >
> > On 6/30/2017 8:26 PM, Jens Axboe wrote:
> > > Hi Max,
> >
> > Hi Jens,
> >
> > >
> > > I remembered you reporting this. I think this is a regression introduced
> > > with the scheduling, since ->rqs[] isn't static anymore. ->static_rqs[]
> > > is, but that's not indexable by the tag we find. So I think we need to
> > > guard those with a NULL check. The actual requests themselves are
> > > static, so we know the memory itself isn't going away. But if we race
> > > with completion, we could find a NULL there, validly.
> > >
> > > Since you could reproduce it, can you try the below?
> >
> > I still can repro the null deref with this patch applied.
> >
> > >
> > > diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
> > > index d0be72ccb091..b856b2827157 100644
> > > --- a/block/blk-mq-tag.c
> > > +++ b/block/blk-mq-tag.c
> > > @@ -214,7 +214,7 @@ static bool bt_iter(struct sbitmap *bitmap,
> > > unsigned int bitnr, void *data)
> > > bitnr += tags->nr_reserved_tags;
> > > rq = tags->rqs[bitnr];
> > >
> > > - if (rq->q == hctx->queue)
> > > + if (rq && rq->q == hctx->queue)
> > > iter_data->fn(hctx, rq, iter_data->data, reserved);
> > > return true;
> > > }
> > > @@ -249,8 +249,8 @@ static bool bt_tags_iter(struct sbitmap *bitmap,
> > > unsigned int bitnr, void *data)
> > > if (!reserved)
> > > bitnr += tags->nr_reserved_tags;
> > > rq = tags->rqs[bitnr];
> > > -
> > > - iter_data->fn(rq, iter_data->data, reserved);
> > > + if (rq)
> > > + iter_data->fn(rq, iter_data->data, reserved);
> > > return true;
> > > }
> >
> > see the attached file for dmesg output.
> >
> > output of gdb:
> >
> > (gdb) list *(blk_mq_flush_busy_ctxs+0x48)
> > 0xffffffff8127b108 is in blk_mq_flush_busy_ctxs
> > (./include/linux/sbitmap.h:234).
> > 229
> > 230 for (i = 0; i < sb->map_nr; i++) {
> > 231 struct sbitmap_word *word = &sb->map[i];
> > 232 unsigned int off, nr;
> > 233
> > 234 if (!word->word)
> > 235 continue;
> > 236
> > 237 nr = 0;
> > 238 off = i << sb->shift;
> >
> >
> > when I change the "if (!word->word)" to "if (word && !word->word)"
> > I can get null deref at "nr = find_next_bit(&word->word, word->depth,
> > nr);". Seems like somehow word becomes NULL.
> >
> > Adding the linux-nvme guys too.
> > Sagi has mentioned that this can be null only if we remove the tagset
> > while I/O is trying to get a tag and when killing the target we get into
> > error recovery and periodic reconnects, which does _NOT_ include freeing
> > the tagset, so this is probably the admin tagset.
> >
> > Sagi,
> > you've mention a patch for centrelizing the treatment of the admin
> > tagset to the nvme core. I think I missed this patch, so can you please
> > send a pointer to it and I'll check if it helps ?
>
> Hmm,
>
> In the above flow we should not be freeing the tag_set, not on admin as
> well. The target keep removing namespaces and finally removes the
> subsystem which generates a error recovery flow. What we at least try
> to do is:
>
> 1. mark rdma queues as not live
> 2. stop all the sw queues (admin and io)
> 3. fail inflight I/Os
> 4. restart all sw queues (to fast fail until we recover)
>
> We shouldn't be freeing the tagsets (although we might update them
> when we recover and cpu map changed - which I don't think is happening).
>
> However, I do see a difference between bt_tags_for_each
> and blk_mq_flush_busy_ctxs (checks tags->rqs not being NULL).
>
> Unrelated to this I think we should quiesce/unquiesce the admin_q
> instead of stop/start because it respects the submission path rcu [1].
>
> It might hide the issue, but given that we never free the tagset its
> seems like it's not in nvme-rdma (max, can you see if this makes the
> issue go away?)
>
> [1]:
> --
> diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c
> index e3996db22738..094873a4ee38 100644
> --- a/drivers/nvme/host/rdma.c
> +++ b/drivers/nvme/host/rdma.c
> @@ -785,7 +785,7 @@ static void nvme_rdma_error_recovery_work(struct
> work_struct *work)
>
> if (ctrl->ctrl.queue_count > 1)
> nvme_stop_queues(&ctrl->ctrl);
> - blk_mq_stop_hw_queues(ctrl->ctrl.admin_q);
> + blk_mq_quiesce_queue(ctrl->ctrl.admin_q);
>
> /* We must take care of fastfail/requeue all our inflight requests
> */
> if (ctrl->ctrl.queue_count > 1)
> @@ -798,7 +798,8 @@ static void nvme_rdma_error_recovery_work(struct
> work_struct *work)
> * queues are not a live anymore, so restart the queues to fail fast
> * new IO
> */
> - blk_mq_start_stopped_hw_queues(ctrl->ctrl.admin_q, true);
> + blk_mq_unquiesce_queue(ctrl->ctrl.admin_q);
> + blk_mq_kick_requeue_list(ctrl->ctrl.admin_q);
> nvme_start_queues(&ctrl->ctrl);
>
> nvme_rdma_reconnect_or_remove(ctrl);
> @@ -1651,7 +1652,7 @@ static void nvme_rdma_shutdown_ctrl(struct
> nvme_rdma_ctrl *ctrl)
> if (test_bit(NVME_RDMA_Q_LIVE, &ctrl->queues[0].flags))
> nvme_shutdown_ctrl(&ctrl->ctrl);
>
> - blk_mq_stop_hw_queues(ctrl->ctrl.admin_q);
> + blk_mq_quiesce_queue(ctrl->ctrl.admin_q);
> blk_mq_tagset_busy_iter(&ctrl->admin_tag_set,
> nvme_cancel_request, &ctrl->ctrl);
> nvme_rdma_destroy_admin_queue(ctrl);
Yeah, the above change is correct, for any canceling requests in this
way we should use blk_mq_quiesce_queue().
Thanks,
Ming
next prev parent reply other threads:[~2017-07-03 9:40 UTC|newest]
Thread overview: 29+ messages / expand[flat|nested] mbox.gz Atom feed top
2017-06-30 17:26 NVMe induced NULL deref in bt_iter() Jens Axboe
2017-07-02 10:45 ` Max Gurtovoy
2017-07-02 10:45 ` Max Gurtovoy
2017-07-02 11:56 ` Sagi Grimberg
2017-07-02 11:56 ` Sagi Grimberg
2017-07-02 14:37 ` Max Gurtovoy
2017-07-02 14:37 ` Max Gurtovoy
2017-07-02 15:08 ` Sagi Grimberg
2017-07-02 15:08 ` Sagi Grimberg
2017-07-03 9:40 ` Ming Lei [this message]
2017-07-03 9:40 ` Ming Lei
2017-07-03 10:07 ` Sagi Grimberg
2017-07-03 10:07 ` Sagi Grimberg
2017-07-03 12:03 ` Ming Lei
2017-07-03 12:03 ` Ming Lei
2017-07-03 12:46 ` Max Gurtovoy
2017-07-03 12:46 ` Max Gurtovoy
2017-07-03 15:54 ` Ming Lei
2017-07-03 15:54 ` Ming Lei
2017-07-04 6:58 ` Sagi Grimberg
2017-07-04 6:58 ` Sagi Grimberg
2017-07-04 7:56 ` Sagi Grimberg
2017-07-04 7:56 ` Sagi Grimberg
2017-07-04 8:08 ` Ming Lei
2017-07-04 8:08 ` Ming Lei
2017-07-04 9:14 ` Sagi Grimberg
2017-07-04 9:14 ` Sagi Grimberg
2017-07-03 16:01 ` Jens Axboe
2017-07-03 16:01 ` Jens Axboe
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20170703093951.GA28651@ming.t460p \
--to=ming.lei@redhat.com \
--cc=axboe@kernel.dk \
--cc=linux-block@vger.kernel.org \
--cc=linux-nvme@lists.infradead.org \
--cc=maxg@mellanox.com \
--cc=sagi@grimberg.me \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.