From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from smtp.kernel.org (aws-us-west-2-korg-mail-alma10-1.taild15c8.ts.net [100.103.45.18]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 7ED1D25B09A; Sat, 4 Jul 2026 17:01:20 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=100.103.45.18 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1783184481; cv=none; b=GRy5jZIOJllUp68NiCAEptCP9qoQExiM3nsdZwC3dSw2Fqj3jo9Iec0VLQK6kgNhY7HSQHYVybIKut4WEZhm6uuIYgaWtMM1cQ0YFgzul2OIZKuTshI9eKoaE5kLfWph4irZtrlJsdlfCKjyxddKg/tFr9kv82ZcsDw3MIwE5Ac= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1783184481; c=relaxed/simple; bh=U+uPYYtyKjOwDVtqFJheK7I//0Wze1thj/WXxYHLjLI=; h=Date:From:To:Cc:Subject:Message-ID:References:MIME-Version: Content-Type:Content-Disposition:In-Reply-To; b=iKwEHFvo+gaW1m5gK7cgtGwzWNzMTlhsYGSWVcTsBm84jrZ+bIGSwqkopl60709MeSN0UigA2EN/8AVC3B6tk3naJvuXX7jD/kKfKU6etH0kKxBoaFeSmeUwm2lzAx1F/Hu1sPS4YxvLlojFfldiHBnuBAZVSbg9SGDXNigfQs0= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b=USwQqrc0; arc=none smtp.client-ip=100.103.45.18 Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b="USwQqrc0" Received: by smtp.kernel.org (Postfix) with ESMTPSA id F1E1A1F000E9; Sat, 4 Jul 2026 17:01:19 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=kernel.org; s=k20260515; t=1783184480; bh=rMN02CbW0x4zESSpqcA4jAOm9MavLi9M+SczfZuVy40=; h=Date:From:To:Cc:Subject:References:In-Reply-To; b=USwQqrc07rG7KoJ2LBvQM6iXiNH1zYPyMz4Cw7azJYj41mWlKiQy7bV6E6g4pB40N rJBmcBhn/k+CVhIgQWtltKdmdvVEQpc7uP3ETOt2BmlSUoYLXWQSn0enVx42eK07Zu ZLwuZ/VPSLkQJ1n40li4x/Fx2Kue4kHtk1wq8DAF39fbg6d2Jk/zAEKeczYRCVLdgv 03M8SQ6rP3qHhmCKxppnoE3YvwzoHSMjZGn252BJq4DflA4BfmI7ezK/bXEZQG5kGe U0TsH0ZzTc4aCRYQXLak/rb+TqzkT0EqkPfRidfh6lEy6itRxxCcdMXAAiJt3qBjI+ cmrkYgOMbx/1A== Date: Sat, 4 Jul 2026 11:01:18 -0600 From: Keith Busch To: Ben Carey Cc: Jens Axboe , io-uring@vger.kernel.org, linux-kernel@vger.kernel.org Subject: Re: [BUG] RCU hang with io_uring nvme polling Message-ID: References: <20260626150946.287781-1-benjamin.james.carey3@gmail.com> <85d1f999-7778-4c74-9d72-b8ac8500de31@kernel.dk> <1932a509-4e27-485e-8e09-1da67e0082c8@kernel.dk> <94614dd9-9351-4a64-83dc-4fc87e377e59@kernel.dk> Precedence: bulk X-Mailing-List: io-uring@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: On Fri, Jul 03, 2026 at 01:20:24PM -0400, Ben Carey wrote: > While testing the patch we decided to trace the amount of time a workload > spends in blk_hctx_poll. We found that, for a test case with 8 jobs running for > 10 seconds, it spent ~71% of its runtime in that function alone. We ran this > test with an Intel Optane mounted with NVMe over PCIe target but have observed > similar behavior on a VM, measured by: > > perf record -F 99 -a -g -- \ > fio --bs=1K --direct=1 --iodepth=1 --runtime=10 --rw=randread --time_based \ > --ioengine=io_uring --hipri=1 --fixedbufs=0 --registerfiles=0 \ > --sqthread_poll=0 \ > --numjobs=8 --name=job0 --output-format=json --clocksource=clock_gettime \ > --filename=/dev/nvme0n1 > > Again, this was tested with nvme.poll_queues=1, but similar behavior occurs > with higher poll_queues, and also on a VM. > > This bug seems to pollute our experimental results, and thus stands as > something needing to be fixed for us to continue our research. Do you all think > there's a different solution than the timeout? What exactly do you have in mind? Shouldn't you expect to spend most of your CPU time in the polling loop? As long as you keep the queues busy, there's something to poll, so blk_hctx_poll is exactly where you want to see the software be in a perf report. Seeing a high poll CPU utilization means the software is efficient compared to the hardware. If we spend very little time in the polling loop, then either you have incredibly quick hardware, and let's face it, Optane SSDs are EOL and a generation behind on link speeds so that's not gonna get there anymore, or our software dispatch stack has an inefficiency somewhere. If you have many pollers competing against a very low utilized queue, then I think you have an application level problem mismatched to the feature. If you want to spend less time in the poll loop, then set the hybrid poll sleep time. It should result in less polling time, but it'll push your average latency higher. The only thing the jiffie timeout may show a problem is when you stop dispatching, which should only affect the time to close the ring when it lost the polling race with a peer on the last IO it is looking for, but should not affect individual IO latency.