From: Ming Lei <ming.lei@redhat.com>
To: Bart Van Assche <bvanassche@acm.org>
Cc: Jens Axboe <axboe@kernel.dk>,
linux-block@vger.kernel.org, Christoph Hellwig <hch@lst.de>,
Daniel Wagner <dwagner@suse.de>,
Khazhismel Kumykov <khazhy@google.com>,
Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>,
"Martin K . Petersen" <martin.petersen@oracle.com>,
Hannes Reinecke <hare@suse.de>,
Johannes Thumshirn <johannes.thumshirn@wdc.com>,
John Garry <john.garry@huawei.com>
Subject: Re: [PATCH v7 3/5] blk-mq: Fix races between iterating over requests and freeing requests
Date: Thu, 22 Apr 2021 15:23:58 +0800 [thread overview]
Message-ID: <YIEkjh2NaiLVCCK8@T590> (raw)
In-Reply-To: <67442b28-3b08-1704-fff3-6f0e6a23465e@acm.org>
On Wed, Apr 21, 2021 at 09:01:50PM -0700, Bart Van Assche wrote:
> On 4/21/21 7:25 PM, Ming Lei wrote:
> > On Tue, Apr 20, 2021 at 05:02:33PM -0700, Bart Van Assche wrote:
> >> When multiple request queues share a tag set and when switching the I/O
> >> scheduler for one of the request queues associated with that tag set, the
> >> following race can happen:
> >> * blk_mq_tagset_busy_iter() calls bt_tags_iter() and bt_tags_iter() assigns
> >> a pointer to a scheduler request to the local variable 'rq'.
> >> * blk_mq_sched_free_requests() is called to free hctx->sched_tags.
> >> * blk_mq_tagset_busy_iter() dereferences 'rq' and triggers a use-after-free.
> >>
> >> Fix this race as follows:
> >> * Use rcu_assign_pointer() and rcu_dereference() to access hctx->tags->rqs[].
> >> The performance impact of the assignments added to the hot path is minimal
> >> (about 1% according to one particular test).
> >> * Protect hctx->tags->rqs[] reads with an RCU read-side lock or with a
> >> semaphore. Which mechanism is used depends on whether or not it is allowed
> >> to sleep and also on whether or not the callback function may sleep.
> >> * Wait for all preexisting readers to finish before freeing scheduler
> >> requests.
> >>
> >> Another race is as follows:
> >> * blk_mq_sched_free_requests() is called to free hctx->sched_tags.
> >> * blk_mq_queue_tag_busy_iter() iterates over the same tag set but for another
> >> request queue than the queue for which scheduler tags are being freed.
> >> * bt_iter() examines rq->q after *rq has been freed.
> >>
> >> Fix this race by protecting the rq->q read in bt_iter() with an RCU read
> >> lock and by calling synchronize_rcu() before freeing scheduler tags.
> >>
> >> Multiple users have reported use-after-free complaints similar to the
> >> following (from https://lore.kernel.org/linux-block/1545261885.185366.488.camel@acm.org/ ):
> >>
> >> BUG: KASAN: use-after-free in bt_iter+0x86/0xf0
> >> Read of size 8 at addr ffff88803b335240 by task fio/21412
> >>
> >> CPU: 0 PID: 21412 Comm: fio Tainted: G W 4.20.0-rc6-dbg+ #3
> >> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1 04/01/2014
> >> Call Trace:
> >> dump_stack+0x86/0xca
> >> print_address_description+0x71/0x239
> >> kasan_report.cold.5+0x242/0x301
> >> __asan_load8+0x54/0x90
> >> bt_iter+0x86/0xf0
> >> blk_mq_queue_tag_busy_iter+0x373/0x5e0
> >> blk_mq_in_flight+0x96/0xb0
> >> part_in_flight+0x40/0x140
> >> part_round_stats+0x18e/0x370
> >> blk_account_io_start+0x3d7/0x670
> >> blk_mq_bio_to_request+0x19c/0x3a0
> >> blk_mq_make_request+0x7a9/0xcb0
> >> generic_make_request+0x41d/0x960
> >> submit_bio+0x9b/0x250
> >> do_blockdev_direct_IO+0x435c/0x4c70
> >> __blockdev_direct_IO+0x79/0x88
> >> ext4_direct_IO+0x46c/0xc00
> >> generic_file_direct_write+0x119/0x210
> >> __generic_file_write_iter+0x11c/0x280
> >> ext4_file_write_iter+0x1b8/0x6f0
> >> aio_write+0x204/0x310
> >> io_submit_one+0x9d3/0xe80
> >> __x64_sys_io_submit+0x115/0x340
> >> do_syscall_64+0x71/0x210
> >>
> >> Reviewed-by: Khazhismel Kumykov <khazhy@google.com>
> >> Tested-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
> >> Cc: Christoph Hellwig <hch@lst.de>
> >> Cc: Martin K. Petersen <martin.petersen@oracle.com>
> >> Cc: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
> >> Cc: Ming Lei <ming.lei@redhat.com>
> >> Cc: Hannes Reinecke <hare@suse.de>
> >> Cc: Johannes Thumshirn <johannes.thumshirn@wdc.com>
> >> Cc: John Garry <john.garry@huawei.com>
> >> Cc: Khazhy Kumykov <khazhy@google.com>
> >> Signed-off-by: Bart Van Assche <bvanassche@acm.org>
> >> ---
> >> block/blk-core.c | 34 ++++++++++++++++++++++++++++++-
> >> block/blk-mq-tag.c | 51 ++++++++++++++++++++++++++++++++++++++++------
> >> block/blk-mq-tag.h | 4 +++-
> >> block/blk-mq.c | 21 +++++++++++++++----
> >> block/blk-mq.h | 1 +
> >> block/blk.h | 2 ++
> >> block/elevator.c | 1 +
> >> 7 files changed, 102 insertions(+), 12 deletions(-)
> >>
> >> diff --git a/block/blk-core.c b/block/blk-core.c
> >> index 9bcdae93f6d4..ca7f833e25a8 100644
> >> --- a/block/blk-core.c
> >> +++ b/block/blk-core.c
> >> @@ -279,6 +279,36 @@ void blk_dump_rq_flags(struct request *rq, char *msg)
> >> }
> >> EXPORT_SYMBOL(blk_dump_rq_flags);
> >>
> >> +/**
> >> + * blk_mq_wait_for_tag_iter - wait for preexisting tag iteration functions to finish
> >> + * @set: Tag set to wait on.
> >> + *
> >> + * Waits for preexisting calls of blk_mq_all_tag_iter(),
> >> + * blk_mq_tagset_busy_iter() and also for their atomic variants to finish
> >> + * accessing hctx->tags->rqs[]. New readers may start while this function is
> >> + * in progress or after this function has returned. Use this function to make
> >> + * sure that hctx->tags->rqs[] changes have become globally visible.
> >> + *
> >> + * Waits for preexisting blk_mq_queue_tag_busy_iter(q, fn, priv) calls to
> >> + * finish accessing requests associated with other request queues than 'q'.
> >> + */
> >> +void blk_mq_wait_for_tag_iter(struct blk_mq_tag_set *set)
> >> +{
> >> + struct blk_mq_tags *tags;
> >> + int i;
> >> +
> >> + if (set->tags) {
> >> + for (i = 0; i < set->nr_hw_queues; i++) {
> >> + tags = set->tags[i];
> >> + if (!tags)
> >> + continue;
> >> + down_write(&tags->iter_rwsem);
> >> + up_write(&tags->iter_rwsem);
> >> + }
> >> + }
> >> + synchronize_rcu();
> >> +}
> >> +
> >> /**
> >> * blk_sync_queue - cancel any pending callbacks on a queue
> >> * @q: the queue
> >> @@ -412,8 +442,10 @@ void blk_cleanup_queue(struct request_queue *q)
> >> * it is safe to free requests now.
> >> */
> >> mutex_lock(&q->sysfs_lock);
> >> - if (q->elevator)
> >> + if (q->elevator) {
> >> + blk_mq_wait_for_tag_iter(q->tag_set);
> >> blk_mq_sched_free_requests(q);
> >> + }
> >> mutex_unlock(&q->sysfs_lock);
> >>
> >> percpu_ref_exit(&q->q_usage_counter);
> >> diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
> >> index d8eaa38a1bd1..39d5c9190a6b 100644
> >> --- a/block/blk-mq-tag.c
> >> +++ b/block/blk-mq-tag.c
> >> @@ -209,14 +209,24 @@ static bool bt_iter(struct sbitmap *bitmap, unsigned int bitnr, void *data)
> >>
> >> if (!reserved)
> >> bitnr += tags->nr_reserved_tags;
> >> - rq = tags->rqs[bitnr];
> >> -
> >> + rcu_read_lock();
> >> + /*
> >> + * The request 'rq' points at is protected by an RCU read lock until
> >> + * its queue pointer has been verified and by q_usage_count while the
> >> + * callback function is being invoked. See also the
> >> + * percpu_ref_tryget() and blk_queue_exit() calls in
> >> + * blk_mq_queue_tag_busy_iter().
> >> + */
> >> + rq = rcu_dereference(tags->rqs[bitnr]);
> >> /*
> >> * We can hit rq == NULL here, because the tagging functions
> >> * test and set the bit before assigning ->rqs[].
> >> */
> >> - if (rq && rq->q == hctx->queue && rq->mq_hctx == hctx)
> >> + if (rq && rq->q == hctx->queue && rq->mq_hctx == hctx) {
> >> + rcu_read_unlock();
> >> return iter_data->fn(hctx, rq, iter_data->data, reserved);
> >> + }
> >> + rcu_read_unlock();
> >> return true;
> >> }
> >>
> >> @@ -254,11 +264,17 @@ struct bt_tags_iter_data {
> >> unsigned int flags;
> >> };
> >>
> >> +/* Include reserved tags. */
> >> #define BT_TAG_ITER_RESERVED (1 << 0)
> >> +/* Only include started requests. */
> >> #define BT_TAG_ITER_STARTED (1 << 1)
> >> +/* Iterate over tags->static_rqs[] instead of tags->rqs[]. */
> >> #define BT_TAG_ITER_STATIC_RQS (1 << 2)
> >> +/* The callback function may sleep. */
> >> +#define BT_TAG_ITER_MAY_SLEEP (1 << 3)
> >>
> >> -static bool bt_tags_iter(struct sbitmap *bitmap, unsigned int bitnr, void *data)
> >> +static bool __bt_tags_iter(struct sbitmap *bitmap, unsigned int bitnr,
> >> + void *data)
> >> {
> >> struct bt_tags_iter_data *iter_data = data;
> >> struct blk_mq_tags *tags = iter_data->tags;
> >> @@ -275,7 +291,8 @@ static bool bt_tags_iter(struct sbitmap *bitmap, unsigned int bitnr, void *data)
> >> if (iter_data->flags & BT_TAG_ITER_STATIC_RQS)
> >> rq = tags->static_rqs[bitnr];
> >> else
> >> - rq = tags->rqs[bitnr];
> >> + rq = rcu_dereference_check(tags->rqs[bitnr],
> >> + lockdep_is_held(&tags->iter_rwsem));
> >> if (!rq)
> >> return true;
> >> if ((iter_data->flags & BT_TAG_ITER_STARTED) &&
> >> @@ -284,6 +301,25 @@ static bool bt_tags_iter(struct sbitmap *bitmap, unsigned int bitnr, void *data)
> >> return iter_data->fn(rq, iter_data->data, reserved);
> >> }
> >>
> >> +static bool bt_tags_iter(struct sbitmap *bitmap, unsigned int bitnr, void *data)
> >> +{
> >> + struct bt_tags_iter_data *iter_data = data;
> >> + struct blk_mq_tags *tags = iter_data->tags;
> >> + bool res;
> >> +
> >> + if (iter_data->flags & BT_TAG_ITER_MAY_SLEEP) {
> >> + down_read(&tags->iter_rwsem);
> >> + res = __bt_tags_iter(bitmap, bitnr, data);
> >> + up_read(&tags->iter_rwsem);
> >> + } else {
> >> + rcu_read_lock();
> >> + res = __bt_tags_iter(bitmap, bitnr, data);
> >> + rcu_read_unlock();
> >> + }
> >> +
> >> + return res;
> >> +}
> >> +
> >> /**
> >> * bt_tags_for_each - iterate over the requests in a tag map
> >> * @tags: Tag map to iterate over.
> >> @@ -357,10 +393,12 @@ void blk_mq_tagset_busy_iter(struct blk_mq_tag_set *tagset,
> >> {
> >> int i;
> >>
> >> + might_sleep();
> >> +
> >> for (i = 0; i < tagset->nr_hw_queues; i++) {
> >> if (tagset->tags && tagset->tags[i])
> >> __blk_mq_all_tag_iter(tagset->tags[i], fn, priv,
> >> - BT_TAG_ITER_STARTED);
> >> + BT_TAG_ITER_STARTED | BT_TAG_ITER_MAY_SLEEP);
> >> }
> >> }
> >> EXPORT_SYMBOL(blk_mq_tagset_busy_iter);
> >> @@ -544,6 +582,7 @@ struct blk_mq_tags *blk_mq_init_tags(unsigned int total_tags,
> >>
> >> tags->nr_tags = total_tags;
> >> tags->nr_reserved_tags = reserved_tags;
> >> + init_rwsem(&tags->iter_rwsem);
> >>
> >> if (blk_mq_is_sbitmap_shared(flags))
> >> return tags;
> >> diff --git a/block/blk-mq-tag.h b/block/blk-mq-tag.h
> >> index 0290c308ece9..d1d73d7cc7df 100644
> >> --- a/block/blk-mq-tag.h
> >> +++ b/block/blk-mq-tag.h
> >> @@ -17,9 +17,11 @@ struct blk_mq_tags {
> >> struct sbitmap_queue __bitmap_tags;
> >> struct sbitmap_queue __breserved_tags;
> >>
> >> - struct request **rqs;
> >> + struct request __rcu **rqs;
> >> struct request **static_rqs;
> >> struct list_head page_list;
> >> +
> >> + struct rw_semaphore iter_rwsem;
> >> };
> >>
> >> extern struct blk_mq_tags *blk_mq_init_tags(unsigned int nr_tags,
> >> diff --git a/block/blk-mq.c b/block/blk-mq.c
> >> index 79c01b1f885c..8b59f6b4ec8e 100644
> >> --- a/block/blk-mq.c
> >> +++ b/block/blk-mq.c
> >> @@ -496,8 +496,10 @@ static void __blk_mq_free_request(struct request *rq)
> >> blk_crypto_free_request(rq);
> >> blk_pm_mark_last_busy(rq);
> >> rq->mq_hctx = NULL;
> >> - if (rq->tag != BLK_MQ_NO_TAG)
> >> + if (rq->tag != BLK_MQ_NO_TAG) {
> >> blk_mq_put_tag(hctx->tags, ctx, rq->tag);
> >> + rcu_assign_pointer(hctx->tags->rqs[rq->tag], NULL);
> >
> > After the tag is released, it can be re-allocated from another context
> > immediately. And the above rcu_assign_pointer(NULL) could be run after
> > the re-allocation and new ->rqs[rq->tag] assignment in submission path,
> > is it one issue?
>
> How about swapping the rcu_assign_pointer() and blk_mq_put_tag() calls?
> That should be safe since the tag iteration functions check whether or
> not hctx->tags->rqs[] is NULL.
>
> There are already barriers in sbitmap_queue_clear() so I don't think
> that any new barriers would need to be added.
That looks fine, since the barrier pair is implied between allocating tag
and assigning ->tags[tag] too.
Thanks,
Ming
next prev parent reply other threads:[~2021-04-22 7:24 UTC|newest]
Thread overview: 21+ messages / expand[flat|nested] mbox.gz Atom feed top
2021-04-21 0:02 [PATCH v7 0/5] blk-mq: Fix a race between iterating over requests and freeing requests Bart Van Assche
2021-04-21 0:02 ` [PATCH v7 1/5] blk-mq: Move the elevator_exit() definition Bart Van Assche
2021-04-21 0:02 ` [PATCH v7 2/5] blk-mq: Introduce atomic variants of blk_mq_(all_tag|tagset_busy)_iter Bart Van Assche
2021-04-21 0:02 ` [PATCH v7 3/5] blk-mq: Fix races between iterating over requests and freeing requests Bart Van Assche
2021-04-22 2:25 ` Ming Lei
2021-04-22 4:01 ` Bart Van Assche
2021-04-22 7:23 ` Ming Lei [this message]
2021-04-22 3:15 ` Ming Lei
2021-04-22 3:54 ` Bart Van Assche
2021-04-22 7:13 ` Ming Lei
2021-04-22 15:51 ` Bart Van Assche
2021-04-23 3:52 ` Ming Lei
2021-04-23 17:52 ` Bart Van Assche
2021-04-25 0:09 ` Ming Lei
2021-04-25 21:01 ` Bart Van Assche
2021-04-26 0:55 ` Ming Lei
2021-04-26 16:29 ` Bart Van Assche
2021-04-27 0:11 ` Ming Lei
2021-04-21 0:02 ` [PATCH v7 4/5] blk-mq: Make it safe to use RCU to iterate over blk_mq_tag_set.tag_list Bart Van Assche
2021-04-21 0:02 ` [PATCH v7 5/5] blk-mq: Fix races between blk_mq_update_nr_hw_queues() and iterating over tags Bart Van Assche
2021-04-21 14:40 ` [PATCH v7 0/5] blk-mq: Fix a race between iterating over requests and freeing requests Jens Axboe
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=YIEkjh2NaiLVCCK8@T590 \
--to=ming.lei@redhat.com \
--cc=axboe@kernel.dk \
--cc=bvanassche@acm.org \
--cc=dwagner@suse.de \
--cc=hare@suse.de \
--cc=hch@lst.de \
--cc=johannes.thumshirn@wdc.com \
--cc=john.garry@huawei.com \
--cc=khazhy@google.com \
--cc=linux-block@vger.kernel.org \
--cc=martin.petersen@oracle.com \
--cc=shinichiro.kawasaki@wdc.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox