From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from fanzine2.igalia.com (fanzine2.igalia.com [213.97.179.56]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 46C0831C571 for ; Wed, 15 Oct 2025 11:05:20 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=213.97.179.56 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1760526325; cv=none; b=TPUhO2HvIknppw5YP+ytwTfEDVngRK/vGLSR/ztjb1I3rP/ymgu62zUVE0dexAPw+94i21Z0gVCoXpGeR/++xxZIhIuqrqxKKz+eAnwXVaORtnSxd8gMw3+kpVcnKDrqsAHzyDlVXWlvdd1zqUhMuD1TcE9VVoxhuwDtkDx84u0= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1760526325; c=relaxed/simple; bh=E8X0bAwEL3TRMOp9bJugjpKF8gcUTiSQJlLEbw3nGn0=; h=From:To:Cc:Subject:In-Reply-To:References:Date:Message-ID: MIME-Version:Content-Type; b=W644srM0/n1p0cq74QE1K8db3XWorEmwjUshQCLQBJfouAIJptAeo5G+CvgdApfQ+SZu4VMaloLEPlbJdYVp3cEA1chylOE/9M4v+XCrjc01JDOhVNZ5B52dfSMsQiNheEElw4U3L3W678jMAY0tJBQfFb9GcORe0ECU9DqJY50= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=igalia.com; spf=pass smtp.mailfrom=igalia.com; dkim=pass (2048-bit key) header.d=igalia.com header.i=@igalia.com header.b=jooJp/Q/; arc=none smtp.client-ip=213.97.179.56 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=igalia.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=igalia.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=igalia.com header.i=@igalia.com header.b="jooJp/Q/" DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=igalia.com; s=20170329; h=Content-Transfer-Encoding:Content-Type:MIME-Version:Message-ID: Date:References:In-Reply-To:Subject:Cc:To:From:Sender:Reply-To:Content-ID: Content-Description:Resent-Date:Resent-From:Resent-Sender:Resent-To:Resent-Cc :Resent-Message-ID:List-Id:List-Help:List-Unsubscribe:List-Subscribe: List-Post:List-Owner:List-Archive; bh=ISy995E/NrxWDs1k40usNcKnHZHjV9kWpKZ1uJ5ohNs=; b=jooJp/Q/s3UBBMyxYea44aj1WL nJrvoSGzeuAqtY4DPkV+RtzmlxOcgD2pDBZ31cGg+98IiPXZ6PmP4xBB+8hkF0F01jjXhWPXRMqZE QY9SjJ74DkboviG1unOE0YzjVDoDOolAzpjVSxC6K25dy0vJnYlbRrki/oihxv5NB34q2M7ZjZ6k4 FcVZNAyg1MAIQu9o/DF1XkbbyFDQbEnzZ5QRtud3mD3CBOjWUKdzAJ4wpO73tjc1H3IdTSr2ejcYT uLQB0cDFHP8kyqsJxSZ6/YLaKQZCrVhT6hf1tHPO7qhd0U+1sD0uNzkLMGxLLx5m1AiXQJaXsruWf pp8xaXtA==; Received: from bl17-145-117.dsl.telepac.pt ([188.82.145.117] helo=localhost) by fanzine2.igalia.com with utf8esmtpsa (Cipher TLS1.3:ECDHE_SECP256R1__RSA_PSS_RSAE_SHA256__AES_256_GCM:256) (Exim) id 1v8zJj-009urt-OX; Wed, 15 Oct 2025 13:05:15 +0200 From: Luis Henriques To: Bernd Schubert Cc: Miklos Szeredi , Joanne Koong , "linux-fsdevel@vger.kernel.org" , Gang He Subject: Re: [PATCH v3 6/6] fuse: {io-uring} Queue background requests on a different core In-Reply-To: <5e3b8848-2049-4321-82e7-2dec658d6936@ddn.com> (Bernd Schubert's message of "Wed, 15 Oct 2025 10:27:44 +0000") References: <20251013-reduced-nr-ring-queues_3-v3-0-6d87c8aa31ae@ddn.com> <20251013-reduced-nr-ring-queues_3-v3-6-6d87c8aa31ae@ddn.com> <87jz0w8pdp.fsf@igalia.com> <5e3b8848-2049-4321-82e7-2dec658d6936@ddn.com> Date: Wed, 15 Oct 2025 12:05:10 +0100 Message-ID: <87frbk8ly1.fsf@igalia.com> Precedence: bulk X-Mailing-List: linux-fsdevel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable On Wed, Oct 15 2025, Bernd Schubert wrote: > On 10/15/25 11:50, Luis Henriques wrote: >> On Mon, Oct 13 2025, Bernd Schubert wrote: >>=20 >>> Running background IO on a different core makes quite a difference. >>> >>> fio --directory=3D/tmp/dest --name=3Diops.\$jobnum --rw=3Drandread \ >>> --bs=3D4k --size=3D1G --numjobs=3D1 --iodepth=3D4 --time_based\ >>> --runtime=3D30s --group_reporting --ioengine=3Dio_uring\ >>> --direct=3D1 >>> >>> unpatched >>> READ: bw=3D272MiB/s (285MB/s), 272MiB/s-272MiB/s ... >>> patched >>> READ: bw=3D760MiB/s (797MB/s), 760MiB/s-760MiB/s ... >>> >>> With --iodepth=3D8 >>> >>> unpatched >>> READ: bw=3D466MiB/s (489MB/s), 466MiB/s-466MiB/s ... >>> patched >>> READ: bw=3D966MiB/s (1013MB/s), 966MiB/s-966MiB/s ... >>> 2nd run: >>> READ: bw=3D1014MiB/s (1064MB/s), 1014MiB/s-1014MiB/s ... >>> >>> Without io-uring (--iodepth=3D8) >>> READ: bw=3D729MiB/s (764MB/s), 729MiB/s-729MiB/s ... >>> >>> Without fuse (--iodepth=3D8) >>> READ: bw=3D2199MiB/s (2306MB/s), 2199MiB/s-2199MiB/s ... >>> >>> (Test were done with >>> /example/passthrough_hp -o allow_other --nopassthrough \ >>> [-o io_uring] /tmp/source /tmp/dest >>> ) >>> >>> Additional notes: >>> >>> With FURING_NEXT_QUEUE_RETRIES=3D0 (--iodepth=3D8) >>> READ: bw=3D903MiB/s (946MB/s), 903MiB/s-903MiB/s ... >>> >>> With just a random qid (--iodepth=3D8) >>> READ: bw=3D429MiB/s (450MB/s), 429MiB/s-429MiB/s ... >>> >>> With --iodepth=3D1 >>> unpatched >>> READ: bw=3D195MiB/s (204MB/s), 195MiB/s-195MiB/s ... >>> patched >>> READ: bw=3D232MiB/s (243MB/s), 232MiB/s-232MiB/s ... >>> >>> With --iodepth=3D1 --numjobs=3D2 >>> unpatched >>> READ: bw=3D966MiB/s (1013MB/s), 966MiB/s-966MiB/s ... >>> patched >>> READ: bw=3D1821MiB/s (1909MB/s), 1821MiB/s-1821MiB/s ... >>> >>> With --iodepth=3D1 --numjobs=3D8 >>> unpatched >>> READ: bw=3D1138MiB/s (1193MB/s), 1138MiB/s-1138MiB/s ... >>> patched >>> READ: bw=3D1650MiB/s (1730MB/s), 1650MiB/s-1650MiB/s ... >>> fuse without io-uring >>> READ: bw=3D1314MiB/s (1378MB/s), 1314MiB/s-1314MiB/s ... >>> no-fuse >>> READ: bw=3D2566MiB/s (2690MB/s), 2566MiB/s-2566MiB/s ... >>> >>> In summary, for async requests the core doing application IO is busy >>> sending requests and processing IOs should be done on a different core. >>> Spreading the load on random cores is also not desirable, as the core >>> might be frequency scaled down and/or in C1 sleep states. Not shown her= e, >>> but differnces are much smaller when the system uses performance govenor >>> instead of schedutil (ubuntu default). Obviously at the cost of higher >>> system power consumption for performance govenor - not desirable either. >>> >>> Results without io-uring (which uses fixed libfuse threads per queue) >>> heavily depend on the current number of active threads. Libfuse uses >>> default of max 10 threads, but actual nr max threads is a parameter. >>> Also, no-fuse-io-uring results heavily depend on, if there was already >>> running another workload before, as libfuse starts these threads >>> dynamically - i.e. the more threads are active, the worse the >>> performance. >>> >>> Signed-off-by: Bernd Schubert >>> --- >>> fs/fuse/dev_uring.c | 53 +++++++++++++++++++++++++++++++++++++++++++++= +------- >>> 1 file changed, 46 insertions(+), 7 deletions(-) >>> >>> diff --git a/fs/fuse/dev_uring.c b/fs/fuse/dev_uring.c >>> index aca71ce5632efd1d80e3ac0ad4e81ac1536dbc47..f35dd98abfe6407849fec55= 847c6b3d186383803 100644 >>> --- a/fs/fuse/dev_uring.c >>> +++ b/fs/fuse/dev_uring.c >>> @@ -23,6 +23,7 @@ MODULE_PARM_DESC(enable_uring, >>> #define FURING_Q_LOCAL_THRESHOLD 2 >>> #define FURING_Q_NUMA_THRESHOLD (FURING_Q_LOCAL_THRESHOLD + 1) >>> #define FURING_Q_GLOBAL_THRESHOLD (FURING_Q_LOCAL_THRESHOLD * 2) >>> +#define FURING_NEXT_QUEUE_RETRIES 2 >>=20 >> Some bikeshedding: >>=20 >> Maybe that's just me, but I associate the names above (FURING_*) to 'fur' >> -- the action of making 'fur'. I'm not sure that verb even exists, but = my >> brain makes me dislike those names :-) >>=20 >> But I'm not a native speaker, and I don't have any other objection to >> those names rather than "I don't like fur!" so feel free to ignore me. := -) > > I had initially called it "FUSE_URING", but it seemed to be a bit long. > I can change back to that or maybe you have a better short name in your > mind (as usual the hardest part is to find good variable names)? Yeah, the only alternative would be to either drop 'FUSE' or 'URING' from the name: FUSE_Q_LOCAL_THRESHOLD or URING_Q_LOCAL_THRESHOLD. But as I said: this is plain bikeshedding. If no one else complains about these definitions, just leave them as-is. >>>=20=20 >>> bool fuse_uring_enabled(void) >>> { >>> @@ -1302,12 +1303,15 @@ static struct fuse_ring_queue *fuse_uring_best_= queue(const struct cpumask *mask, >>> /* >>> * Get the best queue for the current CPU >>> */ >>> -static struct fuse_ring_queue *fuse_uring_get_queue(struct fuse_ring *= ring) >>> +static struct fuse_ring_queue *fuse_uring_get_queue(struct fuse_ring *= ring, >>> + bool background) >>> { >>> unsigned int qid; >>> struct fuse_ring_queue *local_queue, *best_numa, *best_global; >>> int local_node; >>> const struct cpumask *numa_mask, *global_mask; >>> + int retries =3D 0; >>> + int weight =3D -1; >>>=20=20 >>> qid =3D task_cpu(current); >>> if (WARN_ONCE(qid >=3D ring->max_nr_queues, >>> @@ -1315,16 +1319,50 @@ static struct fuse_ring_queue *fuse_uring_get_q= ueue(struct fuse_ring *ring) >>> ring->max_nr_queues)) >>> qid =3D 0; >>>=20=20 >>> - local_queue =3D READ_ONCE(ring->queues[qid]); >>> local_node =3D cpu_to_node(qid); >>> if (WARN_ON_ONCE(local_node > ring->nr_numa_nodes)) >>> local_node =3D 0; >>>=20=20 >>> - /* Fast path: if local queue exists and is not overloaded, use it */ >>> - if (local_queue && >>> - READ_ONCE(local_queue->nr_reqs) <=3D FURING_Q_LOCAL_THRESHOLD) >>> + local_queue =3D READ_ONCE(ring->queues[qid]); >>> + >>> +retry: >>> + /* >>> + * For background requests, try next CPU in same NUMA domain. >>> + * I.e. cpu-0 creates async requests, cpu-1 io processes. >>> + * Similar for foreground requests, when the local queue does not >>> + * exist - still better to always wake the same cpu id. >>> + */ >>> + if (background || !local_queue) { >>> + numa_mask =3D ring->numa_registered_q_mask[local_node]; >>> + >>> + if (weight =3D=3D -1) >>> + weight =3D cpumask_weight(numa_mask); >>> + >>> + if (weight =3D=3D 0) >>> + goto global; >>> + >>> + if (weight > 1) { >>> + int idx =3D (qid + 1) % weight; >>> + >>> + qid =3D cpumask_nth(idx, numa_mask); >>> + } else { >>> + qid =3D cpumask_first(numa_mask); >>> + } >>> + >>> + local_queue =3D READ_ONCE(ring->queues[qid]); >>> + if (WARN_ON_ONCE(!local_queue)) >>> + return NULL; >>> + } >>> + >>> + if (READ_ONCE(local_queue->nr_reqs) <=3D FURING_Q_NUMA_THRESHOLD) >>> return local_queue; >>>=20=20 >>> + if (retries < FURING_NEXT_QUEUE_RETRIES && weight > retries + 1) { >>> + retries++; >>> + local_queue =3D NULL; >>> + goto retry; >>> + } >>> + >>=20 >> I wonder if this retry loop is really useful. If I understand this >> correctly, we're doing a busy loop, hoping for a better queue to become >> available. But if the system is really busy doing IO this retry loop wi= ll >> most of the times fail and will fall back to the next option -- only once >> in a while we will get a better one. >>=20 >> Do you have evidence that this could be helpful? Or am I misunderstandi= ng >> the purpose of this retry loop? > > Yeah, I got better results with the retry, because using random queues > really doesn't work well - wakes up cores on random queues and wakes and > doesn't accumulate on the same queue - doesn't make use of the ring > advantage. So random should be avoided, if possible. Also, the more > queues the system has, the worse the results with the plain random > fallback. I can provide some numbers later on today. >From my side, there's no need to provide numbers -- I'm happy with your answer. I just thought that looping could be more expensive than just picking whatever queue was selected, because in the end that would likely be the end result anyway. But I haven't tested it and I understand that this design can definitely have impact in big systems. Cheers, --=20 Lu=C3=ADs