From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 7D2D03FE7 for ; Thu, 17 Jul 2025 14:46:01 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=170.10.133.124 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1752763567; cv=none; b=auiKuGSqIUsj5YL1YHC6dHriOL44CXGQbHCanBMS3Az8YJ3NqmIG9hfkEL/sZL6g7LJXOVgiDT1X475YyFbWbRe/oMwxherJWFqzcqWv77kuVxsGRm8Kx5jZYnu2LhI74EdX2xQTGwMq+HLPHldI7g8ipD4PPvmY5mx6eLKXvQ4= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1752763567; c=relaxed/simple; bh=MnuaxBHOH9Zs8xJBcTdBuXRcNFStWRKDXN3wd/PligE=; h=Date:From:To:Cc:Subject:Message-ID:References:MIME-Version: Content-Type:Content-Disposition:In-Reply-To; b=rNR2bBDU7XvruFzm7aMmM1acgkPEyFOcmLBtSVeaUahGpaElOKpx9b/dSCtI24tmRFMDQN8uuHHsOHF/11pt9HG5ZceyuJM8xIE3ocqmtW3O9FkHX0hoPRTWsbZcKeJsL7SQ7HMFI3QaZLi+r8X52fSICn7hS9/5zV53YgZ+viY= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=redhat.com; spf=pass smtp.mailfrom=redhat.com; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b=KX8KcDkm; arc=none smtp.client-ip=170.10.133.124 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=redhat.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=redhat.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b="KX8KcDkm" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1752763560; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=0IL4wuHEwR/yhOtiGJ7JBvXrn5iczMhD54QzUBZ8f28=; b=KX8KcDkmPGH+NA9jf0TkVvFOCudf3xso5P9Mq5youLq11ZRxH3GJuLRzln8EIKZHspADmp BfZA5jEc20gUGdWBICbJDfjliU3foOi7yZ7tvI5xUxdG2HN+grbDj2Qi9pGuNyDwmsYc2a 4Qx+2t775I/XKP6DZY6FRO42z1uyp2E= Received: from mx-prod-mc-08.mail-002.prod.us-west-2.aws.redhat.com (ec2-35-165-154-97.us-west-2.compute.amazonaws.com [35.165.154.97]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-504-oGD5970lNpCfvS57UyyWhw-1; Thu, 17 Jul 2025 10:45:54 -0400 X-MC-Unique: oGD5970lNpCfvS57UyyWhw-1 X-Mimecast-MFC-AGG-ID: oGD5970lNpCfvS57UyyWhw_1752763552 Received: from mx-prod-int-06.mail-002.prod.us-west-2.aws.redhat.com (mx-prod-int-06.mail-002.prod.us-west-2.aws.redhat.com [10.30.177.93]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by mx-prod-mc-08.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTPS id 95F6818B6513; Thu, 17 Jul 2025 14:45:14 +0000 (UTC) Received: from fedora (unknown [10.72.116.47]) by mx-prod-int-06.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTPS id 9FD6E18003FC; Thu, 17 Jul 2025 14:45:02 +0000 (UTC) Date: Thu, 17 Jul 2025 22:44:57 +0800 From: Ming Lei To: Nilay Shroff Cc: linux-block@vger.kernel.org, yi.zhang@redhat.com, hch@lst.de, yukuai1@huaweicloud.com, axboe@kernel.dk, shinichiro.kawasaki@wdc.com, yukuai3@huawei.com, gjoyce@ibm.com Subject: Re: [PATCH] block: restore two stage elevator switch while running nr_hw_queue update Message-ID: References: <20250717141155.473362-1-nilay@linux.ibm.com> Precedence: bulk X-Mailing-List: linux-block@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: <20250717141155.473362-1-nilay@linux.ibm.com> X-Scanned-By: MIMEDefang 3.4.1 on 10.30.177.93 On Thu, Jul 17, 2025 at 07:41:20PM +0530, Nilay Shroff wrote: > The kmemleak reports memory leaks related to elevator resources that > were originally allocated in the ->init_hctx() method. The following > leak traces are observed after running blktests: > > unreferenced object 0xffff8882e7fb9000 (size 2048): > comm "check", pid 10460, jiffies 4324980514 > hex dump (first 32 bytes): > 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ > 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ > backtrace (crc c47e6a37): > __kvmalloc_node_noprof+0x55d/0x7a0 > sbitmap_init_node+0x15a/0x6a0 > kyber_init_hctx+0x316/0xb90 > blk_mq_init_sched+0x416/0x580 > elevator_switch+0x18b/0x630 > elv_update_nr_hw_queues+0x219/0x2c0 > __blk_mq_update_nr_hw_queues+0x36a/0x6f0 > blk_mq_update_nr_hw_queues+0x3a/0x60 > find_fallback+0x510/0x540 [nbd] > nbd_send_cmd+0x24b/0x1480 [nbd] > configfs_write_iter+0x2ae/0x470 > vfs_write+0x524/0xe70 > ksys_write+0xff/0x200 > do_syscall_64+0x98/0x3c0 > entry_SYSCALL_64_after_hwframe+0x76/0x7e > unreferenced object 0xffff8882e7fbb000 (size 2048): > comm "check", pid 10460, jiffies 4324980514 > hex dump (first 32 bytes): > 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ > 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ > backtrace (crc c47e6a37): > __kvmalloc_node_noprof+0x55d/0x7a0 > sbitmap_init_node+0x15a/0x6a0 > kyber_init_hctx+0x316/0xb90 > blk_mq_init_sched+0x416/0x580 > elevator_switch+0x18b/0x630 > elv_update_nr_hw_queues+0x219/0x2c0 > __blk_mq_update_nr_hw_queues+0x36a/0x6f0 > blk_mq_update_nr_hw_queues+0x3a/0x60 > find_fallback+0x510/0x540 [nbd] > nbd_send_cmd+0x24b/0x1480 [nbd] > configfs_write_iter+0x2ae/0x470 > vfs_write+0x524/0xe70 > ksys_write+0xff/0x200 > do_syscall_64+0x98/0x3c0 > entry_SYSCALL_64_after_hwframe+0x76/0x7e > unreferenced object 0xffff88819e855000 (size 2048): > comm "check", pid 10460, jiffies 4324980514 > hex dump (first 32 bytes): > 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ > 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ > backtrace (crc c47e6a37): > __kvmalloc_node_noprof+0x55d/0x7a0 > sbitmap_init_node+0x15a/0x6a0 > kyber_init_hctx+0x316/0xb90 > blk_mq_init_sched+0x416/0x580 > elevator_switch+0x18b/0x630 > elv_update_nr_hw_queues+0x219/0x2c0 > __blk_mq_update_nr_hw_queues+0x36a/0x6f0 > blk_mq_update_nr_hw_queues+0x3a/0x60 > find_fallback+0x510/0x540 [nbd] > nbd_send_cmd+0x24b/0x1480 [nbd] > configfs_write_iter+0x2ae/0x470 > vfs_write+0x524/0xe70 > ksys_write+0xff/0x200 > do_syscall_64+0x98/0x3c0 > entry_SYSCALL_64_after_hwframe+0x76/0x7e > > The issue arises while we run nr_hw_queue update, Specifically, we first > reallocate hardware contexts (hctx) via __blk_mq_realloc_hw_ctxs(), and > then later invoke elevator_switch() (assuming q->elevator is not NULL). > The elevator switch code would first exit old elevator (elevator_exit) > and then switches to the new elevator. The elevator_exit loops through > each hctx and invokes the elevator’s per-hctx exit method ->exit_hctx(), > which releases resources allocated during ->init_hctx(). > > This memleak manifests when we reduce the num of h/w queues - for example, > when the initial update sets the number of queues to X, and a later update > reduces it to Y, where Y < X. In this case, we'd loose the access to old > hctxs while we get to elevator exit code because __blk_mq_realloc_hw_ctxs > would have already released the old hctxs. As we don't now have any > reference left to the old hctxs, we don't have any way to free the > scheduler resources (which are allocate in ->init_hctx()) and kmemleak > complains about it. > > This issue was caused due to the commit 596dce110b7d ("block: simplify > elevator reattachment for updating nr_hw_queues"). That change unified > the two-stage elevator teardown and reattachment into a single call that > occurs after __blk_mq_realloc_hw_ctxs() has already freed the hctxs. > > This patch restores the previous two-stage elevator switch logic during > nr_hw_queues updates. First, the elevator is switched to 'none', which > ensures all scheduler resources are properly freed. Then, the hardware > contexts (hctxs) are reallocated, and the software-to-hardware queue > mappings are updated. Finally, the original elevator is reattached. This > sequence prevents loss of references to old hctxs and avoids the scheduler > resource leaks reported by kmemleak. > > Reported-by : Yi Zhang > Fixes: 596dce110b7d ("block: simplify elevator reattachment for updating nr_hw_queues") > Closes: https://lore.kernel.org/all/CAHj4cs8oJFvz=daCvjHM5dYCNQH4UXwSySPPU4v-WHce_kZXZA@mail.gmail.com/ > Signed-off-by: Nilay Shroff > --- > block/blk-mq.c | 87 +++++++++++++++++++++++++++++++++++++++++++----- > block/blk.h | 4 ++- > block/elevator.c | 40 ++-------------------- > block/elevator.h | 11 ++++++ > 4 files changed, 94 insertions(+), 48 deletions(-) > > diff --git a/block/blk-mq.c b/block/blk-mq.c > index 4806b867e37d..b26cbf2a86dd 100644 > --- a/block/blk-mq.c > +++ b/block/blk-mq.c > @@ -4966,6 +4966,63 @@ int blk_mq_update_nr_requests(struct request_queue *q, unsigned int nr) > return ret; > } > > +/* > + * Switch back to the elevator name stored in the xarray. > + */ > +static void blk_mq_elv_switch_back(struct request_queue *q, > + struct xarray *elv_tbl) > +{ > + struct elv_change_ctx ctx = {}; > + int ret = -ENODEV; > + char *elevator_name = xa_load(elv_tbl, q->id); > + > + WARN_ON_ONCE(q->mq_freeze_depth == 0); > + > + mutex_lock(&q->elevator_lock); > + if (elevator_name && !blk_queue_dying(q) && blk_queue_registered(q)) { > + ctx.name = elevator_name; > + > + /* force to reattach elevator after nr_hw_queue is updated */ > + ret = elevator_switch(q, &ctx); > + } > + mutex_unlock(&q->elevator_lock); > + blk_mq_unfreeze_queue_nomemrestore(q); > + if (!ret) > + WARN_ON_ONCE(elevator_change_done(q, &ctx)); > +} > + > +/* > + * Stores elevator name in xarray and set current elevator to none. It uses > + * q->id as an index to store the elevator name into the xarray. > + */ > +static int blk_mq_elv_switch_none(struct request_queue *q, > + struct xarray *elv_tbl) > +{ > + int ret = 0; > + > + lockdep_assert_held_write(&q->tag_set->update_nr_hwq_lock); > + > + /* > + * Accessing q->elevator without holding q->elevator_lock is safe here > + * because we're called from nr_hw_queue update which is protected by > + * set->update_nr_hwq_lock in the writer context. So, scheduler update/ > + * switch code (which acquires the same lock in the reader context) > + * can't run concurrently. > + */ > + if (q->elevator) { > + char *elevator_name = (char *)q->elevator->type->elevator_name; > + > + ret = xa_insert(elv_tbl, q->id, elevator_name, GFP_KERNEL); This way isn't memory safe, because elevator module can be reloaded during updating nr_hw_queues. One solution is to build a string<->int mapping table, and store the (int) index of elevator type to xarray. > + if (ret) { > + WARN_ON_ONCE(1); > + goto out; > + } > + elevator_set_none(q); > + } > +out: > + return ret; > +} > + > static void __blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set, > int nr_hw_queues) > { > @@ -4973,6 +5030,7 @@ static void __blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set, > int prev_nr_hw_queues = set->nr_hw_queues; > unsigned int memflags; > int i; > + struct xarray elv_tbl; > > lockdep_assert_held(&set->tag_list_lock); > > @@ -4984,6 +5042,9 @@ static void __blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set, > return; > > memflags = memalloc_noio_save(); > + > + xa_init(&elv_tbl); > + > list_for_each_entry(q, &set->tag_list, tag_set_list) { > blk_mq_debugfs_unregister_hctxs(q); > blk_mq_sysfs_unregister_hctxs(q); > @@ -4992,11 +5053,17 @@ static void __blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set, > list_for_each_entry(q, &set->tag_list, tag_set_list) > blk_mq_freeze_queue_nomemsave(q); > > - if (blk_mq_realloc_tag_set_tags(set, nr_hw_queues) < 0) { > - list_for_each_entry(q, &set->tag_list, tag_set_list) > - blk_mq_unfreeze_queue_nomemrestore(q); > - goto reregister; > - } > + /* > + * Switch IO scheduler to 'none', cleaning up the data associated > + * with the previous scheduler. We will switch back once we are done > + * updating the new sw to hw queue mappings. > + */ > + list_for_each_entry(q, &set->tag_list, tag_set_list) > + if (blk_mq_elv_switch_none(q, &elv_tbl)) > + goto switch_back; > + > + if (blk_mq_realloc_tag_set_tags(set, nr_hw_queues) < 0) > + goto switch_back; > > fallback: > blk_mq_update_queue_map(set); > @@ -5016,12 +5083,11 @@ static void __blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set, > } > blk_mq_map_swqueue(q); > } > - > - /* elv_update_nr_hw_queues() unfreeze queue for us */ > +switch_back: > + /* blk_mq_elv_switch_back() unfreezes queue for us */ > list_for_each_entry(q, &set->tag_list, tag_set_list) > - elv_update_nr_hw_queues(q); > + blk_mq_elv_switch_back(q, &elv_tbl); > > -reregister: > list_for_each_entry(q, &set->tag_list, tag_set_list) { > blk_mq_sysfs_register_hctxs(q); > blk_mq_debugfs_register_hctxs(q); > @@ -5029,6 +5095,9 @@ static void __blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set, > blk_mq_remove_hw_queues_cpuhp(q); > blk_mq_add_hw_queues_cpuhp(q); > } > + > + xa_destroy(&elv_tbl); > + > memalloc_noio_restore(memflags); > > /* Free the excess tags when nr_hw_queues shrink. */ > diff --git a/block/blk.h b/block/blk.h > index 37ec459fe656..35ec19a085fd 100644 > --- a/block/blk.h > +++ b/block/blk.h > @@ -12,6 +12,7 @@ > #include "blk-crypto-internal.h" > > struct elevator_type; > +struct elv_change_ctx; > > #define BLK_DEV_MAX_SECTORS (LLONG_MAX >> 9) > #define BLK_MIN_SEGMENT_SIZE 4096 > @@ -321,9 +322,10 @@ bool blk_bio_list_merge(struct request_queue *q, struct list_head *list, > > bool blk_insert_flush(struct request *rq); > > -void elv_update_nr_hw_queues(struct request_queue *q); > void elevator_set_default(struct request_queue *q); > void elevator_set_none(struct request_queue *q); > +int elevator_change_done(struct request_queue *q, struct elv_change_ctx *ctx); > +int elevator_switch(struct request_queue *q, struct elv_change_ctx *ctx); > > ssize_t part_size_show(struct device *dev, struct device_attribute *attr, > char *buf); > diff --git a/block/elevator.c b/block/elevator.c > index ab22542e6cf0..016fe9c08c8c 100644 > --- a/block/elevator.c > +++ b/block/elevator.c > @@ -45,17 +45,6 @@ > #include "blk-wbt.h" > #include "blk-cgroup.h" > > -/* Holding context data for changing elevator */ > -struct elv_change_ctx { > - const char *name; > - bool no_uevent; > - > - /* for unregistering old elevator */ > - struct elevator_queue *old; > - /* for registering new elevator */ > - struct elevator_queue *new; > -}; > - > static DEFINE_SPINLOCK(elv_list_lock); > static LIST_HEAD(elv_list); > > @@ -570,7 +559,7 @@ EXPORT_SYMBOL_GPL(elv_unregister); > * If switching fails, we are most likely running out of memory and not able > * to restore the old io scheduler, so leaving the io scheduler being none. > */ > -static int elevator_switch(struct request_queue *q, struct elv_change_ctx *ctx) > +int elevator_switch(struct request_queue *q, struct elv_change_ctx *ctx) > { > struct elevator_type *new_e = NULL; > int ret = 0; > @@ -631,8 +620,7 @@ static void elv_exit_and_release(struct request_queue *q) > kobject_put(&e->kobj); > } > > -static int elevator_change_done(struct request_queue *q, > - struct elv_change_ctx *ctx) > +int elevator_change_done(struct request_queue *q, struct elv_change_ctx *ctx) > { > int ret = 0; > > @@ -685,30 +673,6 @@ static int elevator_change(struct request_queue *q, struct elv_change_ctx *ctx) > return ret; > } > > -/* > - * The I/O scheduler depends on the number of hardware queues, this forces a > - * reattachment when nr_hw_queues changes. > - */ > -void elv_update_nr_hw_queues(struct request_queue *q) > -{ > - struct elv_change_ctx ctx = {}; > - int ret = -ENODEV; > - > - WARN_ON_ONCE(q->mq_freeze_depth == 0); > - > - mutex_lock(&q->elevator_lock); > - if (q->elevator && !blk_queue_dying(q) && blk_queue_registered(q)) { > - ctx.name = q->elevator->type->elevator_name; > - > - /* force to reattach elevator after nr_hw_queue is updated */ > - ret = elevator_switch(q, &ctx); > - } > - mutex_unlock(&q->elevator_lock); > - blk_mq_unfreeze_queue_nomemrestore(q); > - if (!ret) > - WARN_ON_ONCE(elevator_change_done(q, &ctx)); > -} > - > /* > * Use the default elevator settings. If the chosen elevator initialization > * fails, fall back to the "none" elevator (no elevator). > diff --git a/block/elevator.h b/block/elevator.h > index a07ce773a38f..440b6e766848 100644 > --- a/block/elevator.h > +++ b/block/elevator.h > @@ -85,6 +85,17 @@ struct elevator_type > struct list_head list; > }; > > +/* Holding context data for changing elevator */ > +struct elv_change_ctx { > + const char *name; > + bool no_uevent; > + > + /* for unregistering old elevator */ > + struct elevator_queue *old; > + /* for registering new elevator */ > + struct elevator_queue *new; > +}; > + You may avoid the big chunk of code move for both `elv_change_ctx` and `elv_update_nr_hw_queues()`, not sure why you did that in a bug fix patch. Thanks, Ming