* [PATCH 0/2] virtio-fs: introduce multi-queue support
@ 2024-05-01 15:38 Peter-Jan Gootzen
2024-05-01 15:38 ` [PATCH 1/2] virtio-fs: limit number of request queues Peter-Jan Gootzen
` (3 more replies)
0 siblings, 4 replies; 18+ messages in thread
From: Peter-Jan Gootzen @ 2024-05-01 15:38 UTC (permalink / raw)
To: virtualization, vgoyal, stefanha, miklos
Cc: mst, dgilbert, yorayz, gurtovoy, Peter-Jan Gootzen
In this patch set we are adding multi-queue support in the virtio-fs
driver to enhance performance.
While multi-queue functionality has been in the virtio-fs specification
since its inception, the current implementation in the Linux kernel
creates multiple queues, but only enqueues requests on the first request
queue.
The first commit in this patch set limits the number of request queues
to the number of CPUs in the system, to reduce unnecessary resource
consumption on the device-side.
The second commit adds a multi queue mapping to the virtio-fs device
driver that is created using MSI-X affinity mappings or, in case of no
existing mappings, by evenly grouping the CPUs and dividing the queues
over these groups.
Future work in this direction could include:
- CPU hotplug support
- In case of no MSI-X, using the created multi-queue mapping to set
interrupt affinities
See the commit message of the second patch for performance results.
Peter-Jan Gootzen (2):
virtio-fs: limit number of request queues
virtio-fs: add multi-queue support
fs/fuse/virtio_fs.c | 73 ++++++++++++++++++++++++++++++++++++++++-----
1 file changed, 65 insertions(+), 8 deletions(-)
--
2.34.1
^ permalink raw reply [flat|nested] 18+ messages in thread* [PATCH 1/2] virtio-fs: limit number of request queues 2024-05-01 15:38 [PATCH 0/2] virtio-fs: introduce multi-queue support Peter-Jan Gootzen @ 2024-05-01 15:38 ` Peter-Jan Gootzen 2024-05-06 19:10 ` Stefan Hajnoczi 2024-05-28 8:40 ` Jingbo Xu 2024-05-01 15:38 ` [PATCH 2/2] virtio-fs: add multi-queue support Peter-Jan Gootzen ` (2 subsequent siblings) 3 siblings, 2 replies; 18+ messages in thread From: Peter-Jan Gootzen @ 2024-05-01 15:38 UTC (permalink / raw) To: virtualization, vgoyal, stefanha, miklos Cc: mst, dgilbert, yorayz, gurtovoy, Peter-Jan Gootzen, Yoray Zack, Max Gurtovoy Virtio-fs devices might allocate significant resources to virtio queues such as CPU cores that busy poll on the queue. The device indicates how many request queues it can support and the driver should initialize the number of queues that they want to utilize. In this patch we limit the number of initialized request queues to the number of CPUs, to limit the resource consumption on the device-side and to prepare for the upcoming multi-queue patch. Signed-off-by: Peter-Jan Gootzen <pgootzen@nvidia.com> Signed-off-by: Yoray Zack <yorayz@nvidia.com> Suggested-by: Max Gurtovoy <mgurtovoy@nvidia.com> Reviewed-by: Max Gurtovoy <mgurtovoy@nvidia.com> --- fs/fuse/virtio_fs.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/fs/fuse/virtio_fs.c b/fs/fuse/virtio_fs.c index 322af827a232..2745fa484dcb 100644 --- a/fs/fuse/virtio_fs.c +++ b/fs/fuse/virtio_fs.c @@ -751,6 +751,9 @@ static int virtio_fs_setup_vqs(struct virtio_device *vdev, if (fs->num_request_queues == 0) return -EINVAL; + /* Truncate nr of request queues to nr_cpu_id */ + fs->num_request_queues = min_t(unsigned int, fs->num_request_queues, + nr_cpu_ids); fs->nvqs = VQ_REQUEST + fs->num_request_queues; fs->vqs = kcalloc(fs->nvqs, sizeof(fs->vqs[VQ_HIPRIO]), GFP_KERNEL); if (!fs->vqs) -- 2.34.1 ^ permalink raw reply related [flat|nested] 18+ messages in thread
* Re: [PATCH 1/2] virtio-fs: limit number of request queues 2024-05-01 15:38 ` [PATCH 1/2] virtio-fs: limit number of request queues Peter-Jan Gootzen @ 2024-05-06 19:10 ` Stefan Hajnoczi 2024-05-28 8:40 ` Jingbo Xu 1 sibling, 0 replies; 18+ messages in thread From: Stefan Hajnoczi @ 2024-05-06 19:10 UTC (permalink / raw) To: Peter-Jan Gootzen Cc: virtualization, vgoyal, miklos, mst, dgilbert, yorayz, gurtovoy, Yoray Zack, Max Gurtovoy [-- Attachment #1: Type: text/plain, Size: 889 bytes --] On Wed, May 01, 2024 at 05:38:16PM +0200, Peter-Jan Gootzen wrote: > Virtio-fs devices might allocate significant resources to virtio queues > such as CPU cores that busy poll on the queue. The device indicates how > many request queues it can support and the driver should initialize the > number of queues that they want to utilize. > > In this patch we limit the number of initialized request queues to the > number of CPUs, to limit the resource consumption on the device-side > and to prepare for the upcoming multi-queue patch. > > Signed-off-by: Peter-Jan Gootzen <pgootzen@nvidia.com> > Signed-off-by: Yoray Zack <yorayz@nvidia.com> > Suggested-by: Max Gurtovoy <mgurtovoy@nvidia.com> > Reviewed-by: Max Gurtovoy <mgurtovoy@nvidia.com> > --- > fs/fuse/virtio_fs.c | 3 +++ > 1 file changed, 3 insertions(+) Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com> [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 488 bytes --] ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH 1/2] virtio-fs: limit number of request queues 2024-05-01 15:38 ` [PATCH 1/2] virtio-fs: limit number of request queues Peter-Jan Gootzen 2024-05-06 19:10 ` Stefan Hajnoczi @ 2024-05-28 8:40 ` Jingbo Xu 2024-05-28 9:05 ` Peter-Jan Gootzen 1 sibling, 1 reply; 18+ messages in thread From: Jingbo Xu @ 2024-05-28 8:40 UTC (permalink / raw) To: Peter-Jan Gootzen, virtualization, vgoyal, stefanha, miklos Cc: mst, dgilbert, yorayz, gurtovoy, Yoray Zack, Max Gurtovoy Hi Peter-Jan, Thanks for the amazing work! Glad to see it is upstreamed. We are also planning distribute virtiofs on DPU. BTW I have some questions when reading the patch, appreciate if you could give a hint. On 5/1/24 11:38 PM, Peter-Jan Gootzen wrote: > Virtio-fs devices might allocate significant resources to virtio queues > such as CPU cores that busy poll on the queue. The device indicates how > many request queues it can support and the driver should initialize the > number of queues that they want to utilize. I'm okay with truncating the num_request_queues to nr_cpu_ids if the number of the available queues exceeds nr_cpu_ids, as generally 1:1 mapping between cpus and queues is enough and we can not make use more queues in this case. I just don't understand the above commit message as how this relates to the resource footprint at the device side. When the the number of the queues actually used by the driver (e.g. nr_cpu_ids is 6) is less than the amount (e.g. 8) that is advertised by the driver, will the device side reclaim the resources allocated for the remaining unused queues? The driver has no way notifying how many queues it actually utilizes. > > In this patch we limit the number of initialized request queues to the > number of CPUs, to limit the resource consumption on the device-side > and to prepare for the upcoming multi-queue patch. > > Signed-off-by: Peter-Jan Gootzen <pgootzen@nvidia.com> > Signed-off-by: Yoray Zack <yorayz@nvidia.com> > Suggested-by: Max Gurtovoy <mgurtovoy@nvidia.com> > Reviewed-by: Max Gurtovoy <mgurtovoy@nvidia.com> > --- > fs/fuse/virtio_fs.c | 3 +++ > 1 file changed, 3 insertions(+) > > diff --git a/fs/fuse/virtio_fs.c b/fs/fuse/virtio_fs.c > index 322af827a232..2745fa484dcb 100644 > --- a/fs/fuse/virtio_fs.c > +++ b/fs/fuse/virtio_fs.c > @@ -751,6 +751,9 @@ static int virtio_fs_setup_vqs(struct virtio_device *vdev, > if (fs->num_request_queues == 0) > return -EINVAL; > > + /* Truncate nr of request queues to nr_cpu_id */ > + fs->num_request_queues = min_t(unsigned int, fs->num_request_queues, > + nr_cpu_ids); > fs->nvqs = VQ_REQUEST + fs->num_request_queues; > fs->vqs = kcalloc(fs->nvqs, sizeof(fs->vqs[VQ_HIPRIO]), GFP_KERNEL); > if (!fs->vqs) -- Thanks, Jingbo ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH 1/2] virtio-fs: limit number of request queues 2024-05-28 8:40 ` Jingbo Xu @ 2024-05-28 9:05 ` Peter-Jan Gootzen 2024-05-28 9:13 ` Jingbo Xu 0 siblings, 1 reply; 18+ messages in thread From: Peter-Jan Gootzen @ 2024-05-28 9:05 UTC (permalink / raw) To: vgoyal@redhat.com, stefanha@redhat.com, jefflexu@linux.alibaba.com, miklos@szeredi.hu, virtualization@lists.linux.dev Cc: Max Gurtovoy, dgilbert@redhat.com, mst@redhat.com, Yoray Zack Hi Jingbo, On Tue, 2024-05-28 at 16:40 +0800, Jingbo Xu wrote: > External email: Use caution opening links or attachments > > > Hi Peter-Jan, > > Thanks for the amazing work! Glad to see it is upstreamed. We are > also > planning distribute virtiofs on DPU. Great to hear that virtio-fs is getting more attention with this powerful approach :) > > BTW I have some questions when reading the patch, appreciate if you > could give a hint. > > > On 5/1/24 11:38 PM, Peter-Jan Gootzen wrote: > > Virtio-fs devices might allocate significant resources to virtio > > queues > > such as CPU cores that busy poll on the queue. The device indicates > > how > > many request queues it can support and the driver should initialize > > the > > number of queues that they want to utilize. > > I'm okay with truncating the num_request_queues to nr_cpu_ids if the > number of the available queues exceeds nr_cpu_ids, as generally 1:1 > mapping between cpus and queues is enough and we can not make use more > queues in this case. > > I just don't understand the above commit message as how this relates > to > the resource footprint at the device side. When the the number of the > queues actually used by the driver (e.g. nr_cpu_ids is 6) is less than > the amount (e.g. 8) that is advertised by the driver, will the device > side reclaim the resources allocated for the remaining unused queues? > The driver has no way notifying how many queues it actually utilizes. The device resources allocated to queues is device implementation specific. So the cost of the driver creating more queues than it will ultimately use, is dependent on how the specific device handles these unsused queues. A smart device could at some point decide to spend less polling resources on a queue if it sees that it is unused (reclaiming the resources as you put it). But in my opinion, the driver should not allocate more than it will use. The new scheme tries to map every core to a queue, so we know that we will not use >nr_cpu_ids queues. - Peter-Jan ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH 1/2] virtio-fs: limit number of request queues 2024-05-28 9:05 ` Peter-Jan Gootzen @ 2024-05-28 9:13 ` Jingbo Xu 0 siblings, 0 replies; 18+ messages in thread From: Jingbo Xu @ 2024-05-28 9:13 UTC (permalink / raw) To: Peter-Jan Gootzen, vgoyal@redhat.com, stefanha@redhat.com, miklos@szeredi.hu, virtualization@lists.linux.dev Cc: Max Gurtovoy, dgilbert@redhat.com, mst@redhat.com, Yoray Zack On 5/28/24 5:05 PM, Peter-Jan Gootzen wrote: > Hi Jingbo, > > On Tue, 2024-05-28 at 16:40 +0800, Jingbo Xu wrote: >> External email: Use caution opening links or attachments >> >> >> Hi Peter-Jan, >> >> Thanks for the amazing work! Glad to see it is upstreamed. We are >> also >> planning distribute virtiofs on DPU. > Great to hear that virtio-fs is getting more attention with this > powerful approach :) > >> >> BTW I have some questions when reading the patch, appreciate if you >> could give a hint. >> >> >> On 5/1/24 11:38 PM, Peter-Jan Gootzen wrote: >>> Virtio-fs devices might allocate significant resources to virtio >>> queues >>> such as CPU cores that busy poll on the queue. The device indicates >>> how >>> many request queues it can support and the driver should initialize >>> the >>> number of queues that they want to utilize. >> >> I'm okay with truncating the num_request_queues to nr_cpu_ids if the >> number of the available queues exceeds nr_cpu_ids, as generally 1:1 >> mapping between cpus and queues is enough and we can not make use more >> queues in this case. >> >> I just don't understand the above commit message as how this relates >> to >> the resource footprint at the device side. When the the number of the >> queues actually used by the driver (e.g. nr_cpu_ids is 6) is less than >> the amount (e.g. 8) that is advertised by the driver, will the device >> side reclaim the resources allocated for the remaining unused queues? >> The driver has no way notifying how many queues it actually utilizes. > > The device resources allocated to queues is device implementation > specific. So the cost of the driver creating more queues than it will > ultimately use, is dependent on how the specific device handles these > unsused queues. > A smart device could at some point decide to spend less polling > resources on a queue if it sees that it is unused (reclaiming the > resources as you put it). > > But in my opinion, the driver should not allocate more than it will use. > The new scheme tries to map every core to a queue, so we know that we > will not use >nr_cpu_ids queues. Okay got it. Many thanks for the reply. -- Thanks, Jingbo ^ permalink raw reply [flat|nested] 18+ messages in thread
* [PATCH 2/2] virtio-fs: add multi-queue support 2024-05-01 15:38 [PATCH 0/2] virtio-fs: introduce multi-queue support Peter-Jan Gootzen 2024-05-01 15:38 ` [PATCH 1/2] virtio-fs: limit number of request queues Peter-Jan Gootzen @ 2024-05-01 15:38 ` Peter-Jan Gootzen 2024-05-28 8:51 ` Jingbo Xu 2024-05-06 19:52 ` [PATCH 0/2] virtio-fs: introduce " Stefan Hajnoczi 2024-05-10 11:05 ` Michael S. Tsirkin 3 siblings, 1 reply; 18+ messages in thread From: Peter-Jan Gootzen @ 2024-05-01 15:38 UTC (permalink / raw) To: virtualization, vgoyal, stefanha, miklos Cc: mst, dgilbert, yorayz, gurtovoy, Peter-Jan Gootzen, Yoray Zack, Max Gurtovoy This commit creates a multi-queue mapping at device bring-up. The driver first attempts to use the existing MSI-X interrupt affinities (previously disabled), and if not present, will distribute the request queues evenly over the CPUs. If the latter fails as well, all CPUs are mapped to request queue zero. When a request is handed from FUSE to the virtio-fs device driver, the driver will use the current CPU to index into the multi-queue mapping and determine the optimal request queue to use. We measured the performance of this patch with the fio benchmarking tool, increasing the number of queues results in a significant speedup for both read and write operations, demonstrating the effectiveness of multi-queue support. Host: - Dell PowerEdge R760 - CPU: Intel(R) Xeon(R) Gold 6438M, 128 cores - VM: KVM with 32 cores Virtio-fs device: - BlueField-3 DPU - CPU: ARM Cortex-A78AE, 16 cores - One thread per queue, each busy polling on one request queue - Each queue is 1024 descriptors deep Workload: - fio, sequential read or write, ioengine=libaio, numjobs=32, 4GiB file per job, iodepth=8, bs=256KiB, runtime=30s Performance Results: +===========================+==========+===========+ | Number of queues | Fio read | Fio write | +===========================+==========+===========+ | 1 request queue (GiB/s) | 6.1 | 4.6 | +---------------------------+----------+-----------+ | 8 request queues (GiB/s) | 25.8 | 10.3 | +---------------------------+----------+-----------+ | 16 request queues (GiB/s) | 30.9 | 19.5 | +---------------------------+----------+-----------+ | 32 request queue (GiB/s) | 33.2 | 22.6 | +---------------------------+----------+-----------+ | Speedup | 5.5x | 5x | +---------------=-----------+----------+-----------+ Signed-off-by: Peter-Jan Gootzen <pgootzen@nvidia.com> Signed-off-by: Yoray Zack <yorayz@nvidia.com> Signed-off-by: Max Gurtovoy <mgurtovoy@nvidia.com> --- fs/fuse/virtio_fs.c | 70 +++++++++++++++++++++++++++++++++++++++------ 1 file changed, 62 insertions(+), 8 deletions(-) diff --git a/fs/fuse/virtio_fs.c b/fs/fuse/virtio_fs.c index 2745fa484dcb..ed50a5096e74 100644 --- a/fs/fuse/virtio_fs.c +++ b/fs/fuse/virtio_fs.c @@ -7,6 +7,8 @@ #include <linux/fs.h> #include <linux/dax.h> #include <linux/pci.h> +#include <linux/interrupt.h> +#include <linux/group_cpus.h> #include <linux/pfn_t.h> #include <linux/memremap.h> #include <linux/module.h> @@ -67,6 +69,8 @@ struct virtio_fs { unsigned int num_request_queues; /* number of request queues */ struct dax_device *dax_dev; + unsigned int *mq_map; /* index = cpu id, value = request vq id */ + /* DAX memory window where file contents are mapped */ void *window_kaddr; phys_addr_t window_phys_addr; @@ -185,6 +189,7 @@ static void virtio_fs_ktype_release(struct kobject *kobj) { struct virtio_fs *vfs = container_of(kobj, struct virtio_fs, kobj); + kfree(vfs->mq_map); kfree(vfs->vqs); kfree(vfs); } @@ -706,6 +711,44 @@ static void virtio_fs_requests_done_work(struct work_struct *work) } } +static void virtio_fs_map_queues(struct virtio_device *vdev, struct virtio_fs *fs) +{ + const struct cpumask *mask, *masks; + unsigned int q, cpu; + + /* First attempt to map using existing transport layer affinities + * e.g. PCIe MSI-X + */ + if (!vdev->config->get_vq_affinity) + goto fallback; + + for (q = 0; q < fs->num_request_queues; q++) { + mask = vdev->config->get_vq_affinity(vdev, VQ_REQUEST + q); + if (!mask) + goto fallback; + + for_each_cpu(cpu, mask) + fs->mq_map[cpu] = q; + } + + return; +fallback: + /* Attempt to map evenly in groups over the CPUs */ + masks = group_cpus_evenly(fs->num_request_queues); + /* If even this fails we default to all CPUs use queue zero */ + if (!masks) { + for_each_possible_cpu(cpu) + fs->mq_map[cpu] = 0; + return; + } + + for (q = 0; q < fs->num_request_queues; q++) { + for_each_cpu(cpu, &masks[q]) + fs->mq_map[cpu] = q; + } + kfree(masks); +} + /* Virtqueue interrupt handler */ static void virtio_fs_vq_done(struct virtqueue *vq) { @@ -742,6 +785,11 @@ static int virtio_fs_setup_vqs(struct virtio_device *vdev, { struct virtqueue **vqs; vq_callback_t **callbacks; + /* Specify pre_vectors to ensure that the queues before the + * request queues (e.g. hiprio) don't claim any of the CPUs in + * the multi-queue mapping and interrupt affinities + */ + struct irq_affinity desc = { .pre_vectors = VQ_REQUEST }; const char **names; unsigned int i; int ret = 0; @@ -763,7 +811,9 @@ static int virtio_fs_setup_vqs(struct virtio_device *vdev, callbacks = kmalloc_array(fs->nvqs, sizeof(callbacks[VQ_HIPRIO]), GFP_KERNEL); names = kmalloc_array(fs->nvqs, sizeof(names[VQ_HIPRIO]), GFP_KERNEL); - if (!vqs || !callbacks || !names) { + fs->mq_map = kcalloc_node(nr_cpu_ids, sizeof(*fs->mq_map), GFP_KERNEL, + dev_to_node(&vdev->dev)); + if (!vqs || !callbacks || !names || !fs->mq_map) { ret = -ENOMEM; goto out; } @@ -783,7 +833,7 @@ static int virtio_fs_setup_vqs(struct virtio_device *vdev, names[i] = fs->vqs[i].name; } - ret = virtio_find_vqs(vdev, fs->nvqs, vqs, callbacks, names, NULL); + ret = virtio_find_vqs(vdev, fs->nvqs, vqs, callbacks, names, &desc); if (ret < 0) goto out; @@ -795,8 +845,10 @@ static int virtio_fs_setup_vqs(struct virtio_device *vdev, kfree(names); kfree(callbacks); kfree(vqs); - if (ret) + if (ret) { kfree(fs->vqs); + kfree(fs->mq_map); + } return ret; } @@ -942,7 +994,7 @@ static int virtio_fs_probe(struct virtio_device *vdev) if (ret < 0) goto out; - /* TODO vq affinity */ + virtio_fs_map_queues(vdev, fs); ret = virtio_fs_setup_dax(vdev, fs); if (ret < 0) @@ -1291,7 +1343,7 @@ static int virtio_fs_enqueue_req(struct virtio_fs_vq *fsvq, static void virtio_fs_wake_pending_and_unlock(struct fuse_iqueue *fiq) __releases(fiq->lock) { - unsigned int queue_id = VQ_REQUEST; /* TODO multiqueue */ + unsigned int queue_id; struct virtio_fs *fs; struct fuse_req *req; struct virtio_fs_vq *fsvq; @@ -1305,11 +1357,13 @@ __releases(fiq->lock) spin_unlock(&fiq->lock); fs = fiq->priv; + queue_id = VQ_REQUEST + fs->mq_map[raw_smp_processor_id()]; - pr_debug("%s: opcode %u unique %#llx nodeid %#llx in.len %u out.len %u\n", - __func__, req->in.h.opcode, req->in.h.unique, + pr_debug("%s: opcode %u unique %#llx nodeid %#llx in.len %u out.len %u queue_id %u\n", + __func__, req->in.h.opcode, req->in.h.unique, req->in.h.nodeid, req->in.h.len, - fuse_len_args(req->args->out_numargs, req->args->out_args)); + fuse_len_args(req->args->out_numargs, req->args->out_args), + queue_id); fsvq = &fs->vqs[queue_id]; ret = virtio_fs_enqueue_req(fsvq, req, false); -- 2.34.1 ^ permalink raw reply related [flat|nested] 18+ messages in thread
* Re: [PATCH 2/2] virtio-fs: add multi-queue support 2024-05-01 15:38 ` [PATCH 2/2] virtio-fs: add multi-queue support Peter-Jan Gootzen @ 2024-05-28 8:51 ` Jingbo Xu 2024-05-28 11:59 ` Miklos Szeredi 0 siblings, 1 reply; 18+ messages in thread From: Jingbo Xu @ 2024-05-28 8:51 UTC (permalink / raw) To: Peter-Jan Gootzen, virtualization, vgoyal, stefanha, miklos Cc: mst, dgilbert, yorayz, gurtovoy, Yoray Zack, Max Gurtovoy Hi Peter-Jan, Thanks for the amazing work. I'd just like to know if you have any plan of making fiq and fiq->lock more scalable, e.g. make fiq a per-CPU software queue? I think the global fiq and fiq->lock can be a performance bottleneck after we have supported mq. -- Thanks, Jingbo ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH 2/2] virtio-fs: add multi-queue support 2024-05-28 8:51 ` Jingbo Xu @ 2024-05-28 11:59 ` Miklos Szeredi 2024-05-28 12:39 ` Peter-Jan Gootzen 2024-05-28 13:35 ` Jingbo Xu 0 siblings, 2 replies; 18+ messages in thread From: Miklos Szeredi @ 2024-05-28 11:59 UTC (permalink / raw) To: Jingbo Xu Cc: Peter-Jan Gootzen, virtualization, vgoyal, stefanha, mst, dgilbert, yorayz, gurtovoy, Yoray Zack, Max Gurtovoy On Tue, 28 May 2024 at 10:52, Jingbo Xu <jefflexu@linux.alibaba.com> wrote: > > Hi Peter-Jan, > > Thanks for the amazing work. > > I'd just like to know if you have any plan of making fiq and fiq->lock > more scalable, e.g. make fiq a per-CPU software queue? Doing a per-CPU queue is not necessarily a good idea: async requests could overload one queue while others are idle. One idea is to allow request to go through a per-CPU fast path if the respective listener is idle. Otherwise the request would enter the default slow queue, where idle listeners would pick requests (like they do now). Might make sense to also optimize the way this picking is done if there are several idle listeners according to how close the listener's CPU is to the one that generated the request. Thanks, Miklos ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH 2/2] virtio-fs: add multi-queue support 2024-05-28 11:59 ` Miklos Szeredi @ 2024-05-28 12:39 ` Peter-Jan Gootzen 2024-05-28 13:22 ` Miklos Szeredi 2024-05-28 13:35 ` Jingbo Xu 1 sibling, 1 reply; 18+ messages in thread From: Peter-Jan Gootzen @ 2024-05-28 12:39 UTC (permalink / raw) To: miklos@szeredi.hu, jefflexu@linux.alibaba.com Cc: vgoyal@redhat.com, stefanha@redhat.com, mst@redhat.com, Yoray Zack, dgilbert@redhat.com, virtualization@lists.linux.dev, Max Gurtovoy On Tue, 2024-05-28 at 13:59 +0200, Miklos Szeredi wrote: > External email: Use caution opening links or attachments > > > On Tue, 28 May 2024 at 10:52, Jingbo Xu <jefflexu@linux.alibaba.com> > wrote: > > > > Hi Peter-Jan, > > > > Thanks for the amazing work. > > > > I'd just like to know if you have any plan of making fiq and fiq- > > >lock > > more scalable, e.g. make fiq a per-CPU software queue? We have also identified the single fiq and fiq->lock as a logically large bottleneck in our system's data path. But we don't have hard proof for this just yet. I also still need to investigate what the impact of having multiple locks would be on the file system consistency in the FUSE driver. File systems like their locks... Can anyone here comment on that? > > Doing a per-CPU queue is not necessarily a good idea: async requests > could overload one queue while others are idle. > > One idea is to allow request to go through a per-CPU fast path if the > respective listener is idle. Otherwise the request would enter the > default slow queue, where idle listeners would pick requests (like > they do now). > > Might make sense to also optimize the way this picking is done if > there are several idle listeners according to how close the listener's > CPU is to the one that generated the request. > > Thanks, > Miklos I am not that familiar yet with the FUSE driver. What are these listeners you are referring to? If I follow a FUSE file system operation like fuse_statfs, I see a single-threaded path towards the wake_pending_and_unlock of virtio-fs via fuse_simple_request. Or are these listeners something from the virtiofsd implementation? I think the optimality of another "simple" multi-queue layer but in the FUSE driver would depend on the virtio-fs device implementation (busy polling or thread pooling like virtiofsd) and how the file system operates (asynchronously over the network or locally accessing the disk). Are you aware of other device drivers that perform smart tricks like checking if the per-core queue is full and using another empty one? ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH 2/2] virtio-fs: add multi-queue support 2024-05-28 12:39 ` Peter-Jan Gootzen @ 2024-05-28 13:22 ` Miklos Szeredi 0 siblings, 0 replies; 18+ messages in thread From: Miklos Szeredi @ 2024-05-28 13:22 UTC (permalink / raw) To: Peter-Jan Gootzen Cc: jefflexu@linux.alibaba.com, vgoyal@redhat.com, stefanha@redhat.com, mst@redhat.com, dgilbert@redhat.com, virtualization@lists.linux.dev, Yoray Zack On Tue, 28 May 2024 at 14:39, Peter-Jan Gootzen <pgootzen@nvidia.com> wrote: > I am not that familiar yet with the FUSE driver. What are these > listeners you are referring to? If I follow a FUSE file system operation > like fuse_statfs, I see a single-threaded path towards the > wake_pending_and_unlock of virtio-fs via fuse_simple_request. > Or are these listeners something from the virtiofsd implementation? I was thinking of the regular /dev/fuse interface, not virtiofs. For the /dev/fuse there can be multiple threads that call read(2) on the fuse fd, which I call "listeners" above. The design is outdated for both /dev/fuse and for virtiofs. When a request is created by the filesystem it shouldn't normally have to go through that fc->fiq bottleneck... I'll see if there's an easy fix for that. Thanks, Miklos ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH 2/2] virtio-fs: add multi-queue support 2024-05-28 11:59 ` Miklos Szeredi 2024-05-28 12:39 ` Peter-Jan Gootzen @ 2024-05-28 13:35 ` Jingbo Xu 2024-05-28 14:21 ` Peter-Jan Gootzen 1 sibling, 1 reply; 18+ messages in thread From: Jingbo Xu @ 2024-05-28 13:35 UTC (permalink / raw) To: Miklos Szeredi Cc: Peter-Jan Gootzen, virtualization, vgoyal, stefanha, mst, dgilbert, yorayz, gurtovoy, Yoray Zack, Max Gurtovoy On 5/28/24 7:59 PM, Miklos Szeredi wrote: > On Tue, 28 May 2024 at 10:52, Jingbo Xu <jefflexu@linux.alibaba.com> wrote: >> >> Hi Peter-Jan, >> >> Thanks for the amazing work. >> >> I'd just like to know if you have any plan of making fiq and fiq->lock >> more scalable, e.g. make fiq a per-CPU software queue? > > Doing a per-CPU queue is not necessarily a good idea: async requests > could overload one queue while others are idle. It is in FUSE scenarios (instead of virtiofs). There's no 1:1 mapping between CPUs and daemon threads. All requests submitted from all CPUs are enqueued in one global pending list, and all daemon threads fetch request to be processed from this global pending list. > > One idea is to allow request to go through a per-CPU fast path if the > respective listener is idle. Otherwise the request would enter the > default slow queue, where idle listeners would pick requests (like > they do now). I guess "listener" refers to one thread of fuse daemon. When coming to virtiofs scenarios, there's 1:1 mapping between CPUs and hardware queues. The requests will only be routed to the hardware queue to which the submitter CPU maps, and thus there's no meaning still managing pending requests in one global pending list. In this case, the per-CPU pending list theoretically can reduce the lock contention on the global pending list. I guess we have an internal RFC on per-CPU pending list for virtiofs. Let me see if this really improves the performance from test. -- Thanks, Jingbo ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH 2/2] virtio-fs: add multi-queue support 2024-05-28 13:35 ` Jingbo Xu @ 2024-05-28 14:21 ` Peter-Jan Gootzen 2024-05-29 1:52 ` Jingbo Xu 0 siblings, 1 reply; 18+ messages in thread From: Peter-Jan Gootzen @ 2024-05-28 14:21 UTC (permalink / raw) To: miklos@szeredi.hu, jefflexu@linux.alibaba.com Cc: vgoyal@redhat.com, stefanha@redhat.com, mst@redhat.com, Yoray Zack, dgilbert@redhat.com, virtualization@lists.linux.dev, Max Gurtovoy On Tue, 2024-05-28 at 21:35 +0800, Jingbo Xu wrote: > External email: Use caution opening links or attachments > > > On 5/28/24 7:59 PM, Miklos Szeredi wrote: > > On Tue, 28 May 2024 at 10:52, Jingbo Xu <jefflexu@linux.alibaba.com> > > wrote: > > > > > > Hi Peter-Jan, > > > > > > Thanks for the amazing work. > > > > > > I'd just like to know if you have any plan of making fiq and fiq- > > > >lock > > > more scalable, e.g. make fiq a per-CPU software queue? > > > > Doing a per-CPU queue is not necessarily a good idea: async requests > > could overload one queue while others are idle. > > It is in FUSE scenarios (instead of virtiofs). There's no 1:1 mapping > between CPUs and daemon threads. All requests submitted from all CPUs > are enqueued in one global pending list, and all daemon threads fetch > request to be processed from this global pending list. Virtiofsd also works in this thread pool manner with a global pending list and many daemon threads fetching from said list. > > > > > One idea is to allow request to go through a per-CPU fast path if > > the > > respective listener is idle. Otherwise the request would enter the > > default slow queue, where idle listeners would pick requests (like > > they do now). > > I guess "listener" refers to one thread of fuse daemon. > > > When coming to virtiofs scenarios, there's 1:1 mapping between CPUs > and > hardware queues. The requests will only be routed to the hardware > queue > to which the submitter CPU maps, and thus there's no meaning still > managing pending requests in one global pending list. In this case, > the > per-CPU pending list theoretically can reduce the lock contention on > the > global pending list. > > > I guess we have an internal RFC on per-CPU pending list for virtiofs. > Let me see if this really improves the performance from test. I am a bit confused about the software layers we are talking about now. We have: - FUSE driver (so the kernel FUSE fs driver) - virtio-fs driver - virtiofsd as a software virtio-fs device - Hardware virtio-fs device implementations (e.g. BlueField) - libfuse (connecting /dev/fuse and a FS implementation) - File systems that build upon: libfuse, virtiofsd or some hardware virtio-fs implementation So you have a patch at hand that fixes this fiq and fiq->lock bottleneck in the FUSE driver, correct? In the virtio-fs driver there is no pending list, it just pops from the pending list of the FUSE driver. > > > -- > Thanks, > Jingbo - Peter-Jan ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH 2/2] virtio-fs: add multi-queue support 2024-05-28 14:21 ` Peter-Jan Gootzen @ 2024-05-29 1:52 ` Jingbo Xu 2024-05-29 9:27 ` Peter-Jan Gootzen 0 siblings, 1 reply; 18+ messages in thread From: Jingbo Xu @ 2024-05-29 1:52 UTC (permalink / raw) To: Peter-Jan Gootzen, miklos@szeredi.hu Cc: vgoyal@redhat.com, stefanha@redhat.com, mst@redhat.com, Yoray Zack, dgilbert@redhat.com, virtualization@lists.linux.dev, Max Gurtovoy On 5/28/24 10:21 PM, Peter-Jan Gootzen wrote: > On Tue, 2024-05-28 at 21:35 +0800, Jingbo Xu wrote: >> External email: Use caution opening links or attachments >> >> >> On 5/28/24 7:59 PM, Miklos Szeredi wrote: >>> On Tue, 28 May 2024 at 10:52, Jingbo Xu <jefflexu@linux.alibaba.com> >>> wrote: >>>> >>>> Hi Peter-Jan, >>>> >>>> Thanks for the amazing work. >>>> >>>> I'd just like to know if you have any plan of making fiq and fiq- >>>>> lock >>>> more scalable, e.g. make fiq a per-CPU software queue? >>> >>> Doing a per-CPU queue is not necessarily a good idea: async requests >>> could overload one queue while others are idle. >> >> It is in FUSE scenarios (instead of virtiofs). There's no 1:1 mapping >> between CPUs and daemon threads. All requests submitted from all CPUs >> are enqueued in one global pending list, and all daemon threads fetch >> request to be processed from this global pending list. > Virtiofsd also works in this thread pool manner with a global pending > list and many daemon threads fetching from said list. > >> >>> >>> One idea is to allow request to go through a per-CPU fast path if >>> the >>> respective listener is idle. Otherwise the request would enter the >>> default slow queue, where idle listeners would pick requests (like >>> they do now). >> >> I guess "listener" refers to one thread of fuse daemon. >> >> >> When coming to virtiofs scenarios, there's 1:1 mapping between CPUs >> and >> hardware queues. The requests will only be routed to the hardware >> queue >> to which the submitter CPU maps, and thus there's no meaning still >> managing pending requests in one global pending list. In this case, >> the >> per-CPU pending list theoretically can reduce the lock contention on >> the >> global pending list. >> >> >> I guess we have an internal RFC on per-CPU pending list for virtiofs. >> Let me see if this really improves the performance from test. > I am a bit confused about the software layers we are talking about now. > We have: > - FUSE driver (so the kernel FUSE fs driver) > - virtio-fs driver > - virtiofsd as a software virtio-fs device > - Hardware virtio-fs device implementations (e.g. BlueField) > - libfuse (connecting /dev/fuse and a FS implementation) > - File systems that build upon: libfuse, virtiofsd or some hardware > virtio-fs implementation > > So you have a patch at hand that fixes this fiq and fiq->lock bottleneck > in the FUSE driver, correct? In the virtio-fs driver there is no pending > list, it just pops from the pending list of the FUSE driver. I mean fiq->pending list (struct fuse_iqueue) when I refer to software pending list. __fuse_request_send spin_lock(&fiq->lock) <-- lock contention if multiple CPUs submit IOs queue_request_and_unlock # enqueue request to fiq->pending list fiq->ops->wake_pending_and_unlock(), i.e. virtio_fs_wake_pending_and_unlock() for virtiofs # fetch one request from fiq->pending list spin_unlock(&fiq->lock) Both the regular /dev/fuse interface and virtiofs use "struct fuse_iqueue" as the pending list. -- Thanks, Jingbo ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH 2/2] virtio-fs: add multi-queue support 2024-05-29 1:52 ` Jingbo Xu @ 2024-05-29 9:27 ` Peter-Jan Gootzen 0 siblings, 0 replies; 18+ messages in thread From: Peter-Jan Gootzen @ 2024-05-29 9:27 UTC (permalink / raw) To: miklos@szeredi.hu, jefflexu@linux.alibaba.com Cc: vgoyal@redhat.com, stefanha@redhat.com, mst@redhat.com, Yoray Zack, dgilbert@redhat.com, virtualization@lists.linux.dev, Max Gurtovoy On Wed, 2024-05-29 at 09:52 +0800, Jingbo Xu wrote: > External email: Use caution opening links or attachments > > > On 5/28/24 10:21 PM, Peter-Jan Gootzen wrote: > > On Tue, 2024-05-28 at 21:35 +0800, Jingbo Xu wrote: > > > External email: Use caution opening links or attachments > > > > > > > > > On 5/28/24 7:59 PM, Miklos Szeredi wrote: > > > > On Tue, 28 May 2024 at 10:52, Jingbo Xu > > > > <jefflexu@linux.alibaba.com> > > > > wrote: > > > > > > > > > > Hi Peter-Jan, > > > > > > > > > > Thanks for the amazing work. > > > > > > > > > > I'd just like to know if you have any plan of making fiq and > > > > > fiq- > > > > > > lock > > > > > more scalable, e.g. make fiq a per-CPU software queue? > > > > > > > > Doing a per-CPU queue is not necessarily a good idea: async > > > > requests > > > > could overload one queue while others are idle. > > > > > > It is in FUSE scenarios (instead of virtiofs). There's no 1:1 > > > mapping > > > between CPUs and daemon threads. All requests submitted from all > > > CPUs > > > are enqueued in one global pending list, and all daemon threads > > > fetch > > > request to be processed from this global pending list. > > Virtiofsd also works in this thread pool manner with a global > > pending > > list and many daemon threads fetching from said list. > > > > > > > > > > > > > One idea is to allow request to go through a per-CPU fast path > > > > if > > > > the > > > > respective listener is idle. Otherwise the request would enter > > > > the > > > > default slow queue, where idle listeners would pick requests > > > > (like > > > > they do now). > > > > > > I guess "listener" refers to one thread of fuse daemon. > > > > > > > > > When coming to virtiofs scenarios, there's 1:1 mapping between > > > CPUs > > > and > > > hardware queues. The requests will only be routed to the hardware > > > queue > > > to which the submitter CPU maps, and thus there's no meaning still > > > managing pending requests in one global pending list. In this > > > case, > > > the > > > per-CPU pending list theoretically can reduce the lock contention > > > on > > > the > > > global pending list. > > > > > > > > > I guess we have an internal RFC on per-CPU pending list for > > > virtiofs. > > > Let me see if this really improves the performance from test. > > I am a bit confused about the software layers we are talking about > > now. > > We have: > > - FUSE driver (so the kernel FUSE fs driver) > > - virtio-fs driver > > - virtiofsd as a software virtio-fs device > > - Hardware virtio-fs device implementations (e.g. BlueField) > > - libfuse (connecting /dev/fuse and a FS implementation) > > - File systems that build upon: libfuse, virtiofsd or some hardware > > virtio-fs implementation > > > > So you have a patch at hand that fixes this fiq and fiq->lock > > bottleneck > > in the FUSE driver, correct? In the virtio-fs driver there is no > > pending > > list, it just pops from the pending list of the FUSE driver. > > I mean fiq->pending list (struct fuse_iqueue) when I refer to software > pending list. > > __fuse_request_send > spin_lock(&fiq->lock) <-- lock contention if multiple CPUs submit > IOs > queue_request_and_unlock > # enqueue request to fiq->pending list > fiq->ops->wake_pending_and_unlock(), > i.e. virtio_fs_wake_pending_and_unlock() for virtiofs > # fetch one request from fiq->pending list > spin_unlock(&fiq->lock) > > Both the regular /dev/fuse interface and virtiofs use "struct > fuse_iqueue" as the pending list. Okay I see now. I am very eager to hear your result, take a look at the patch and see how it evaluates on our system :) > > -- > Thanks, > Jingbo - Peter-Jan ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH 0/2] virtio-fs: introduce multi-queue support 2024-05-01 15:38 [PATCH 0/2] virtio-fs: introduce multi-queue support Peter-Jan Gootzen 2024-05-01 15:38 ` [PATCH 1/2] virtio-fs: limit number of request queues Peter-Jan Gootzen 2024-05-01 15:38 ` [PATCH 2/2] virtio-fs: add multi-queue support Peter-Jan Gootzen @ 2024-05-06 19:52 ` Stefan Hajnoczi 2024-05-10 11:05 ` Michael S. Tsirkin 3 siblings, 0 replies; 18+ messages in thread From: Stefan Hajnoczi @ 2024-05-06 19:52 UTC (permalink / raw) To: Peter-Jan Gootzen Cc: virtualization, vgoyal, miklos, mst, dgilbert, yorayz, gurtovoy [-- Attachment #1: Type: text/plain, Size: 1658 bytes --] On Wed, May 01, 2024 at 05:38:15PM +0200, Peter-Jan Gootzen wrote: > In this patch set we are adding multi-queue support in the virtio-fs > driver to enhance performance. > While multi-queue functionality has been in the virtio-fs specification > since its inception, the current implementation in the Linux kernel > creates multiple queues, but only enqueues requests on the first request > queue. > > The first commit in this patch set limits the number of request queues > to the number of CPUs in the system, to reduce unnecessary resource > consumption on the device-side. > The second commit adds a multi queue mapping to the virtio-fs device > driver that is created using MSI-X affinity mappings or, in case of no > existing mappings, by evenly grouping the CPUs and dividing the queues > over these groups. > > Future work in this direction could include: > - CPU hotplug support It looks like this series is safe to merge as-is. Hotplugged CPUs default to the first request queue. It's safe because mq_map[] is sized to nr_cpu_ids (max number of possible CPUs) and mq_map[] is initialized to 0 (first request queue). Hotplug can be added later. > - In case of no MSI-X, using the created multi-queue mapping to set > interrupt affinities > > See the commit message of the second patch for performance results. > > Peter-Jan Gootzen (2): > virtio-fs: limit number of request queues > virtio-fs: add multi-queue support > > fs/fuse/virtio_fs.c | 73 ++++++++++++++++++++++++++++++++++++++++----- > 1 file changed, 65 insertions(+), 8 deletions(-) Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com> [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 488 bytes --] ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH 0/2] virtio-fs: introduce multi-queue support 2024-05-01 15:38 [PATCH 0/2] virtio-fs: introduce multi-queue support Peter-Jan Gootzen ` (2 preceding siblings ...) 2024-05-06 19:52 ` [PATCH 0/2] virtio-fs: introduce " Stefan Hajnoczi @ 2024-05-10 11:05 ` Michael S. Tsirkin 2024-05-10 11:41 ` Miklos Szeredi 3 siblings, 1 reply; 18+ messages in thread From: Michael S. Tsirkin @ 2024-05-10 11:05 UTC (permalink / raw) To: Peter-Jan Gootzen Cc: virtualization, vgoyal, stefanha, miklos, dgilbert, yorayz, gurtovoy On Wed, May 01, 2024 at 05:38:15PM +0200, Peter-Jan Gootzen wrote: > In this patch set we are adding multi-queue support in the virtio-fs > driver to enhance performance. > While multi-queue functionality has been in the virtio-fs specification > since its inception, the current implementation in the Linux kernel > creates multiple queues, but only enqueues requests on the first request > queue. > > The first commit in this patch set limits the number of request queues > to the number of CPUs in the system, to reduce unnecessary resource > consumption on the device-side. > The second commit adds a multi queue mapping to the virtio-fs device > driver that is created using MSI-X affinity mappings or, in case of no > existing mappings, by evenly grouping the CPUs and dividing the queues > over these groups. Acked-by: Michael S. Tsirkin <mst@redhat.com> > Future work in this direction could include: > - CPU hotplug support > - In case of no MSI-X, using the created multi-queue mapping to set > interrupt affinities > > See the commit message of the second patch for performance results. > > Peter-Jan Gootzen (2): > virtio-fs: limit number of request queues > virtio-fs: add multi-queue support > > fs/fuse/virtio_fs.c | 73 ++++++++++++++++++++++++++++++++++++++++----- > 1 file changed, 65 insertions(+), 8 deletions(-) > > -- > 2.34.1 ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH 0/2] virtio-fs: introduce multi-queue support 2024-05-10 11:05 ` Michael S. Tsirkin @ 2024-05-10 11:41 ` Miklos Szeredi 0 siblings, 0 replies; 18+ messages in thread From: Miklos Szeredi @ 2024-05-10 11:41 UTC (permalink / raw) To: Michael S. Tsirkin Cc: Peter-Jan Gootzen, virtualization, vgoyal, stefanha, dgilbert, yorayz, gurtovoy On Fri, 10 May 2024 at 13:05, Michael S. Tsirkin <mst@redhat.com> wrote: > > On Wed, May 01, 2024 at 05:38:15PM +0200, Peter-Jan Gootzen wrote: > > In this patch set we are adding multi-queue support in the virtio-fs > > driver to enhance performance. > > While multi-queue functionality has been in the virtio-fs specification > > since its inception, the current implementation in the Linux kernel > > creates multiple queues, but only enqueues requests on the first request > > queue. > > > > The first commit in this patch set limits the number of request queues > > to the number of CPUs in the system, to reduce unnecessary resource > > consumption on the device-side. > > The second commit adds a multi queue mapping to the virtio-fs device > > driver that is created using MSI-X affinity mappings or, in case of no > > existing mappings, by evenly grouping the CPUs and dividing the queues > > over these groups. > > > Acked-by: Michael S. Tsirkin <mst@redhat.com> Applied, thanks. Miklos ^ permalink raw reply [flat|nested] 18+ messages in thread
end of thread, other threads:[~2024-05-29 9:27 UTC | newest] Thread overview: 18+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2024-05-01 15:38 [PATCH 0/2] virtio-fs: introduce multi-queue support Peter-Jan Gootzen 2024-05-01 15:38 ` [PATCH 1/2] virtio-fs: limit number of request queues Peter-Jan Gootzen 2024-05-06 19:10 ` Stefan Hajnoczi 2024-05-28 8:40 ` Jingbo Xu 2024-05-28 9:05 ` Peter-Jan Gootzen 2024-05-28 9:13 ` Jingbo Xu 2024-05-01 15:38 ` [PATCH 2/2] virtio-fs: add multi-queue support Peter-Jan Gootzen 2024-05-28 8:51 ` Jingbo Xu 2024-05-28 11:59 ` Miklos Szeredi 2024-05-28 12:39 ` Peter-Jan Gootzen 2024-05-28 13:22 ` Miklos Szeredi 2024-05-28 13:35 ` Jingbo Xu 2024-05-28 14:21 ` Peter-Jan Gootzen 2024-05-29 1:52 ` Jingbo Xu 2024-05-29 9:27 ` Peter-Jan Gootzen 2024-05-06 19:52 ` [PATCH 0/2] virtio-fs: introduce " Stefan Hajnoczi 2024-05-10 11:05 ` Michael S. Tsirkin 2024-05-10 11:41 ` Miklos Szeredi
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).