From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 6EE4636B00 for ; Mon, 20 Nov 2023 20:01:01 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=linux-foundation.org header.i=@linux-foundation.org header.b="tBqvHymb" Received: by smtp.kernel.org (Postfix) with ESMTPSA id 5989AC433C8; Mon, 20 Nov 2023 20:01:01 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=linux-foundation.org; s=korg; t=1700510461; bh=biW/bVGnRij9I2dtfarXJBwQG+bK+6YfrHHXjNnRcqk=; h=Date:From:To:Cc:Subject:In-Reply-To:References:From; b=tBqvHymbJhTUUeKJhYJ1zIxZgWcGaOQ3jNVvGf5SBCrH2TEz16QpYtyZ2qiwDEG8t 9eXkoqI92DPoAUCAnimFT+g/P3R2oH/KfI8gvXcAYbwe6ipA8r1ZYllhlmEyl1x6aF ecfZ4iSYbwr2eS9Tb4Kbt1ltQF+m3fCR9yFVN/Cs= Date: Mon, 20 Nov 2023 12:00:59 -0800 From: Andrew Morton To: Ming Lei Cc: Thomas Gleixner , linux-kernel@vger.kernel.org, Keith Busch , linux-nvme@lists.infradead.org, linux-block@vger.kernel.org, Yi Zhang , Guangwu Zhang , Chengming Zhou , Jens Axboe Subject: Re: [PATCH V4 resend] lib/group_cpus.c: avoid to acquire cpu hotplug lock in group_cpus_evenly Message-Id: <20231120120059.ef0614c2295b2102100cb56e@linux-foundation.org> In-Reply-To: <20231120083559.285174-1-ming.lei@redhat.com> References: <20231120083559.285174-1-ming.lei@redhat.com> X-Mailer: Sylpheed 3.8.0beta1 (GTK+ 2.24.33; x86_64-pc-linux-gnu) Precedence: bulk X-Mailing-List: linux-block@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit On Mon, 20 Nov 2023 16:35:59 +0800 Ming Lei wrote: > group_cpus_evenly() could be part of storage driver's error handler, > such as nvme driver, when may happen during CPU hotplug, in which > storage queue has to drain its pending IOs because all CPUs associated > with the queue are offline and the queue is becoming inactive. And > handling IO needs error handler to provide forward progress. > > Then dead lock is caused: > > 1) inside CPU hotplug handler, CPU hotplug lock is held, and blk-mq's > handler is waiting for inflight IO > > 2) error handler is waiting for CPU hotplug lock > > 3) inflight IO can't be completed in blk-mq's CPU hotplug handler because > error handling can't provide forward progress. > > Solve the deadlock by not holding CPU hotplug lock in group_cpus_evenly(), > in which two stage spreads are taken: 1) the 1st stage is over all present > CPUs; 2) the end stage is over all other CPUs. > > Turns out the two stage spread just needs consistent 'cpu_present_mask', and > remove the CPU hotplug lock by storing it into one local cache. This way > doesn't change correctness, because all CPUs are still covered. I'm not sure what is the intended merge path for this, but I can do lib/. Do you think that a -stable backport is needed? It sounds that way. If so, are we able to identify a suitable Fixes: target? That would predate f7b3ea8cf72f3 ("genirq/affinity: Move group_cpus_evenly() into lib/").