Linux block layer
 help / color / mirror / Atom feed
From: Yu Kuai <yukuai1@huaweicloud.com>
To: Nilay Shroff <nilay@linux.ibm.com>, linux-block@vger.kernel.org
Cc: yi.zhang@redhat.com, hch@lst.de, ming.lei@redhat.com,
	yukuai1@huaweicloud.com, axboe@kernel.dk,
	shinichiro.kawasaki@wdc.com, gjoyce@ibm.com,
	"yukuai (C)" <yukuai3@huawei.com>
Subject: Re: [PATCHv3] block: restore two stage elevator switch while running nr_hw_queue update
Date: Fri, 25 Jul 2025 09:13:38 +0800	[thread overview]
Message-ID: <c7b3ad81-dff8-078a-df02-65a0df7f62d1@huaweicloud.com> (raw)
In-Reply-To: <20250724102540.1366308-1-nilay@linux.ibm.com>

Hi,

在 2025/07/24 18:01, Nilay Shroff 写道:
> The kmemleak reports memory leaks related to elevator resources that
> were originally allocated in the ->init_hctx() method. The following
> leak traces are observed after running blktests block/040:
> 
> unreferenced object 0xffff8881b82f7400 (size 512):
>    comm "check", pid 68454, jiffies 4310588881
>    hex dump (first 32 bytes):
>      00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
>      00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
>    backtrace (crc 5bac8b34):
>      __kvmalloc_node_noprof+0x55d/0x7a0
>      sbitmap_init_node+0x15a/0x6a0
>      kyber_init_hctx+0x316/0xb90
>      blk_mq_init_sched+0x419/0x580
>      elevator_switch+0x18b/0x630
>      elv_update_nr_hw_queues+0x219/0x2c0
>      __blk_mq_update_nr_hw_queues+0x36a/0x6f0
>      blk_mq_update_nr_hw_queues+0x3a/0x60
>      0xffffffffc09ceb80
>      0xffffffffc09d7e0b
>      configfs_write_iter+0x2b1/0x470
>      vfs_write+0x527/0xe70
>      ksys_write+0xff/0x200
>      do_syscall_64+0x98/0x3c0
>      entry_SYSCALL_64_after_hwframe+0x76/0x7e
> unreferenced object 0xffff8881b82f6000 (size 512):
>    comm "check", pid 68454, jiffies 4310588881
>    hex dump (first 32 bytes):
>      00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
>      00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
>    backtrace (crc 5bac8b34):
>      __kvmalloc_node_noprof+0x55d/0x7a0
>      sbitmap_init_node+0x15a/0x6a0
>      kyber_init_hctx+0x316/0xb90
>      blk_mq_init_sched+0x419/0x580
>      elevator_switch+0x18b/0x630
>      elv_update_nr_hw_queues+0x219/0x2c0
>      __blk_mq_update_nr_hw_queues+0x36a/0x6f0
>      blk_mq_update_nr_hw_queues+0x3a/0x60
>      0xffffffffc09ceb80
>      0xffffffffc09d7e0b
>      configfs_write_iter+0x2b1/0x470
>      vfs_write+0x527/0xe70
>      ksys_write+0xff/0x200
>      do_syscall_64+0x98/0x3c0
>      entry_SYSCALL_64_after_hwframe+0x76/0x7e
> unreferenced object 0xffff8881b82f5800 (size 512):
>    comm "check", pid 68454, jiffies 4310588881
>    hex dump (first 32 bytes):
>      00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
>      00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
>    backtrace (crc 5bac8b34):
>      __kvmalloc_node_noprof+0x55d/0x7a0
>      sbitmap_init_node+0x15a/0x6a0
>      kyber_init_hctx+0x316/0xb90
>      blk_mq_init_sched+0x419/0x580
>      elevator_switch+0x18b/0x630
>      elv_update_nr_hw_queues+0x219/0x2c0
>      __blk_mq_update_nr_hw_queues+0x36a/0x6f0
>      blk_mq_update_nr_hw_queues+0x3a/0x60
>      0xffffffffc09ceb80
>      0xffffffffc09d7e0b
>      configfs_write_iter+0x2b1/0x470
>      vfs_write+0x527/0xe70
> 
>      ksys_write+0xff/0x200
>      do_syscall_64+0x98/0x3c0
>      entry_SYSCALL_64_after_hwframe+0x76/0x7e
> 
> The issue arises while we run nr_hw_queue update,  Specifically, we first
> reallocate hardware contexts (hctx) via __blk_mq_realloc_hw_ctxs(), and
> then later invoke elevator_switch() (assuming q->elevator is not NULL).
> The elevator switch code would first exit old elevator (elevator_exit)
> and then switches to the new elevator. The elevator_exit loops through
> each hctx and invokes the elevator’s per-hctx exit method ->exit_hctx(),
> which releases resources allocated during ->init_hctx().
> 
> This memleak manifests when we reduce the num of h/w queues - for example,
> when the initial update sets the number of queues to X, and a later update
> reduces it to Y, where Y < X. In this case, we'd loose the access to old
> hctxs while we get to elevator exit code because __blk_mq_realloc_hw_ctxs
> would have already released the old hctxs. As we don't now have any
> reference left to the old hctxs, we don't have any way to free the
> scheduler resources (which are allocate in ->init_hctx()) and kmemleak
> complains about it.
> 
> This issue was caused due to the commit 596dce110b7d ("block: simplify
> elevator reattachment for updating nr_hw_queues"). That change unified
> the two-stage elevator teardown and reattachment into a single call that
> occurs after __blk_mq_realloc_hw_ctxs() has already freed the hctxs.
> 
> This patch restores the previous two-stage elevator switch logic during
> nr_hw_queues updates. First, the elevator is switched to 'none', which
> ensures all scheduler resources are properly freed. Then, the hardware
> contexts (hctxs) are reallocated, and the software-to-hardware queue
> mappings are updated. Finally, the original elevator is reattached. This
> sequence prevents loss of references to old hctxs and avoids the scheduler
> resource leaks reported by kmemleak.
> 
> Reported-by : Yi Zhang <yi.zhang@redhat.com>
> Fixes: 596dce110b7d ("block: simplify elevator reattachment for updating nr_hw_queues")
> Closes: https://lore.kernel.org/all/CAHj4cs8oJFvz=daCvjHM5dYCNQH4UXwSySPPU4v-WHce_kZXZA@mail.gmail.com/
> Signed-off-by: Nilay Shroff <nilay@linux.ibm.com>
> ---
> Changes from v2:
>      - Simplified error handling by combining WARN_ON_ONCE() and the
>        return value check of xa_insert() into a single statement.
>        This makes code more concise and easier to read without changing
>        the behavior. (Yu Kuai)
>      - Reduced the scope of ->elevator_lock in elv_update_nr_hw_queues().
>        Since elevator_type does not need protection, the lock now only
>        covers the critical section where elevator_switch() is called.
>        (Yu Kuai)
> 
> Changes from v1:
>      - Updated commit message with kmemleak trace generated using null-blk
>        (Yi Zhang)
>      - The elevator module could be removed while nr_hw_queue update is
>        running, so protect elevator switch using elevator_get() and
>        elevator_put() (Ming Lei)
>      - Invoke elv_update_nr_hw_queues() from blk_mq_elv_switch_back() and
>        that way avoid elevator code restructuring in a patch which fixes
>        a regression. (Ming Lei)
> 
> ---
>   block/blk-mq.c   | 84 ++++++++++++++++++++++++++++++++++++++++++------
>   block/blk.h      |  2 +-
>   block/elevator.c | 10 +++---
>   3 files changed, 81 insertions(+), 15 deletions(-)
> 
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index 4806b867e37d..dec1cd4f1f5b 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -4966,6 +4966,60 @@ int blk_mq_update_nr_requests(struct request_queue *q, unsigned int nr)
>   	return ret;
>   }
>   
> +/*
> + * Switch back to the elevator type stored in the xarray.
> + */
> +static void blk_mq_elv_switch_back(struct request_queue *q,
> +		struct xarray *elv_tbl)
> +{
> +	struct elevator_type *e = xa_load(elv_tbl, q->id);
> +
> +	/* The elv_update_nr_hw_queues unfreezes the queue. */
> +	elv_update_nr_hw_queues(q, e);
> +
> +	/* Drop the reference acquired in blk_mq_elv_switch_none. */
> +	if (e)
> +		elevator_put(e);
> +}
> +
> +/*
> + * Stores elevator type in xarray and set current elevator to none. It uses
> + * q->id as an index to store the elevator type into the xarray.
> + */
> +static int blk_mq_elv_switch_none(struct request_queue *q,
> +		struct xarray *elv_tbl)
> +{
> +	int ret = 0;
> +
> +	lockdep_assert_held_write(&q->tag_set->update_nr_hwq_lock);
> +
> +	/*
> +	 * Accessing q->elevator without holding q->elevator_lock is safe here
> +	 * because we're called from nr_hw_queue update which is protected by
> +	 * set->update_nr_hwq_lock in the writer context. So, scheduler update/
> +	 * switch code (which acquires the same lock in the reader context)
> +	 * can't run concurrently.
> +	 */
> +	if (q->elevator) {
> +Emptly line here. Otherweise LGTM

Reviewed-by: Yu Kuai <yukuai3@huawei.com>

> +		ret = xa_insert(elv_tbl, q->id, q->elevator->type, GFP_KERNEL);
> +		if (WARN_ON_ONCE(ret))
> +			return ret;
> +
> +		/*
> +		 * Before we switch elevator to 'none', take a reference to
> +		 * the elevator module so that while nr_hw_queue update is
> +		 * running, no one can remove elevator module. We'd put the
> +		 * reference to elevator module later when we switch back
> +		 * elevator.
> +		 */
> +		__elevator_get(q->elevator->type);
> +
> +		elevator_set_none(q);
> +	}
> +	return ret;
> +}
> +
>   static void __blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set,
>   							int nr_hw_queues)
>   {
> @@ -4973,6 +5027,7 @@ static void __blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set,
>   	int prev_nr_hw_queues = set->nr_hw_queues;
>   	unsigned int memflags;
>   	int i;
> +	struct xarray elv_tbl;
>   
>   	lockdep_assert_held(&set->tag_list_lock);
>   
> @@ -4984,6 +5039,9 @@ static void __blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set,
>   		return;
>   
>   	memflags = memalloc_noio_save();
> +
> +	xa_init(&elv_tbl);
> +
>   	list_for_each_entry(q, &set->tag_list, tag_set_list) {
>   		blk_mq_debugfs_unregister_hctxs(q);
>   		blk_mq_sysfs_unregister_hctxs(q);
> @@ -4992,11 +5050,17 @@ static void __blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set,
>   	list_for_each_entry(q, &set->tag_list, tag_set_list)
>   		blk_mq_freeze_queue_nomemsave(q);
>   
> -	if (blk_mq_realloc_tag_set_tags(set, nr_hw_queues) < 0) {
> -		list_for_each_entry(q, &set->tag_list, tag_set_list)
> -			blk_mq_unfreeze_queue_nomemrestore(q);
> -		goto reregister;
> -	}
> +	/*
> +	 * Switch IO scheduler to 'none', cleaning up the data associated
> +	 * with the previous scheduler. We will switch back once we are done
> +	 * updating the new sw to hw queue mappings.
> +	 */
> +	list_for_each_entry(q, &set->tag_list, tag_set_list)
> +		if (blk_mq_elv_switch_none(q, &elv_tbl))
> +			goto switch_back;
> +
> +	if (blk_mq_realloc_tag_set_tags(set, nr_hw_queues) < 0)
> +		goto switch_back;
>   
>   fallback:
>   	blk_mq_update_queue_map(set);
> @@ -5016,12 +5080,11 @@ static void __blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set,
>   		}
>   		blk_mq_map_swqueue(q);
>   	}
> -
> -	/* elv_update_nr_hw_queues() unfreeze queue for us */
> +switch_back:
> +	/* The blk_mq_elv_switch_back unfreezes queue for us. */
>   	list_for_each_entry(q, &set->tag_list, tag_set_list)
> -		elv_update_nr_hw_queues(q);
> +		blk_mq_elv_switch_back(q, &elv_tbl);
>   
> -reregister:
>   	list_for_each_entry(q, &set->tag_list, tag_set_list) {
>   		blk_mq_sysfs_register_hctxs(q);
>   		blk_mq_debugfs_register_hctxs(q);
> @@ -5029,6 +5092,9 @@ static void __blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set,
>   		blk_mq_remove_hw_queues_cpuhp(q);
>   		blk_mq_add_hw_queues_cpuhp(q);
>   	}
> +
> +	xa_destroy(&elv_tbl);
> +
>   	memalloc_noio_restore(memflags);
>   
>   	/* Free the excess tags when nr_hw_queues shrink. */
> diff --git a/block/blk.h b/block/blk.h
> index 37ec459fe656..fae7653a941f 100644
> --- a/block/blk.h
> +++ b/block/blk.h
> @@ -321,7 +321,7 @@ bool blk_bio_list_merge(struct request_queue *q, struct list_head *list,
>   
>   bool blk_insert_flush(struct request *rq);
>   
> -void elv_update_nr_hw_queues(struct request_queue *q);
> +void elv_update_nr_hw_queues(struct request_queue *q, struct elevator_type *e);
>   void elevator_set_default(struct request_queue *q);
>   void elevator_set_none(struct request_queue *q);
>   
> diff --git a/block/elevator.c b/block/elevator.c
> index ab22542e6cf0..9d81a06db6ec 100644
> --- a/block/elevator.c
> +++ b/block/elevator.c
> @@ -689,21 +689,21 @@ static int elevator_change(struct request_queue *q, struct elv_change_ctx *ctx)
>    * The I/O scheduler depends on the number of hardware queues, this forces a
>    * reattachment when nr_hw_queues changes.
>    */
> -void elv_update_nr_hw_queues(struct request_queue *q)
> +void elv_update_nr_hw_queues(struct request_queue *q, struct elevator_type *e)
>   {
>   	struct elv_change_ctx ctx = {};
>   	int ret = -ENODEV;
>   
>   	WARN_ON_ONCE(q->mq_freeze_depth == 0);
>   
> -	mutex_lock(&q->elevator_lock);
> -	if (q->elevator && !blk_queue_dying(q) && blk_queue_registered(q)) {
> -		ctx.name = q->elevator->type->elevator_name;
> +	if (e && !blk_queue_dying(q) && blk_queue_registered(q)) {
> +		ctx.name = e->elevator_name;
>   
> +		mutex_lock(&q->elevator_lock);
>   		/* force to reattach elevator after nr_hw_queue is updated */
>   		ret = elevator_switch(q, &ctx);
> +		mutex_unlock(&q->elevator_lock);
>   	}
> -	mutex_unlock(&q->elevator_lock);
>   	blk_mq_unfreeze_queue_nomemrestore(q);
>   	if (!ret)
>   		WARN_ON_ONCE(elevator_change_done(q, &ctx));
> 


  reply	other threads:[~2025-07-25  1:13 UTC|newest]

Thread overview: 3+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-07-24 10:01 [PATCHv3] block: restore two stage elevator switch while running nr_hw_queue update Nilay Shroff
2025-07-25  1:13 ` Yu Kuai [this message]
2025-07-25 12:13 ` Jens Axboe

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=c7b3ad81-dff8-078a-df02-65a0df7f62d1@huaweicloud.com \
    --to=yukuai1@huaweicloud.com \
    --cc=axboe@kernel.dk \
    --cc=gjoyce@ibm.com \
    --cc=hch@lst.de \
    --cc=linux-block@vger.kernel.org \
    --cc=ming.lei@redhat.com \
    --cc=nilay@linux.ibm.com \
    --cc=shinichiro.kawasaki@wdc.com \
    --cc=yi.zhang@redhat.com \
    --cc=yukuai3@huawei.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox