* [LSF/MM/BPF BoF]: extend UBLK to cover real storage hardware
@ 2023-02-06 15:00 Ming Lei
2023-02-06 17:53 ` Hannes Reinecke
` (2 more replies)
0 siblings, 3 replies; 34+ messages in thread
From: Ming Lei @ 2023-02-06 15:00 UTC (permalink / raw)
To: linux-block, lsf-pc
Cc: ming.lei, Liu Xiaodong, Jim Harris, Hans Holmberg,
Matias Bjørling, hch@lst.de, Stefan Hajnoczi, ZiyangZhang
Hello,
So far UBLK is only used for implementing virtual block device from
userspace, such as loop, nbd, qcow2, ...[1].
It could be useful for UBLK to cover real storage hardware too:
- for fast prototype or performance evaluation
- some network storages are attached to host, such as iscsi and nvme-tcp,
the current UBLK interface doesn't support such devices, since it needs
all LUNs/Namespaces to share host resources(such as tag)
- SPDK has supported user space driver for real hardware
So propose to extend UBLK for supporting real hardware device:
1) extend UBLK ABI interface to support disks attached to host, such
as SCSI Luns/NVME Namespaces
2) the followings are related with operating hardware from userspace,
so userspace driver has to be trusted, and root is required, and
can't support unprivileged UBLK device
3) how to operating hardware memory space
- unbind kernel driver and rebind with uio/vfio
- map PCI BAR into userspace[2], then userspace can operate hardware
with mapped user address via MMIO
4) DMA
- DMA requires physical memory address, UBLK driver actually has
block request pages, so can we export request SG list(each segment
physical address, offset, len) into userspace? If the max_segments
limit is not too big(<=64), the needed buffer for holding SG list
can be small enough.
- small amount of physical memory for using as DMA descriptor can be
pre-allocated from userspace, and ask kernel to pin pages, then still
return physical address to userspace for programming DMA
- this way is still zero copy
5) notification from hardware: interrupt or polling
- SPDK applies userspace polling, this way is doable, but
eat CPU, so it is only one choice
- io_uring command has been proved as very efficient, if io_uring
command is applied(similar way with UBLK for forwarding blk io
command from kernel to userspace) to uio/vfio for delivering interrupt,
which should be efficient too, given batching processes are done after
the io_uring command is completed
- or it could be flexible by hybrid interrupt & polling, given
userspace single pthread/queue implementation can retrieve all
kinds of inflight IO info in very cheap way, and maybe it is likely
to apply some ML model to learn & predict when IO will be completed
6) others?
[1] https://github.com/ming1/ubdsrv
[2] https://spdk.io/doc/userspace.html
Thanks,
Ming
^ permalink raw reply [flat|nested] 34+ messages in thread* Re: [LSF/MM/BPF BoF]: extend UBLK to cover real storage hardware 2023-02-06 15:00 [LSF/MM/BPF BoF]: extend UBLK to cover real storage hardware Ming Lei @ 2023-02-06 17:53 ` Hannes Reinecke 2023-03-08 8:50 ` Hans Holmberg 2023-02-06 18:26 ` Bart Van Assche 2023-02-06 20:27 ` Stefan Hajnoczi 2 siblings, 1 reply; 34+ messages in thread From: Hannes Reinecke @ 2023-02-06 17:53 UTC (permalink / raw) To: Ming Lei, linux-block, lsf-pc Cc: Liu Xiaodong, Jim Harris, Hans Holmberg, Matias Bjørling, hch@lst.de, Stefan Hajnoczi, ZiyangZhang On 2/6/23 16:00, Ming Lei wrote: > Hello, > > So far UBLK is only used for implementing virtual block device from > userspace, such as loop, nbd, qcow2, ...[1]. > > It could be useful for UBLK to cover real storage hardware too: > > - for fast prototype or performance evaluation > > - some network storages are attached to host, such as iscsi and nvme-tcp, > the current UBLK interface doesn't support such devices, since it needs > all LUNs/Namespaces to share host resources(such as tag) > > - SPDK has supported user space driver for real hardware > > So propose to extend UBLK for supporting real hardware device: > > 1) extend UBLK ABI interface to support disks attached to host, such > as SCSI Luns/NVME Namespaces > > 2) the followings are related with operating hardware from userspace, > so userspace driver has to be trusted, and root is required, and > can't support unprivileged UBLK device > > 3) how to operating hardware memory space > - unbind kernel driver and rebind with uio/vfio > - map PCI BAR into userspace[2], then userspace can operate hardware > with mapped user address via MMIO > > 4) DMA > - DMA requires physical memory address, UBLK driver actually has > block request pages, so can we export request SG list(each segment > physical address, offset, len) into userspace? If the max_segments > limit is not too big(<=64), the needed buffer for holding SG list > can be small enough. > > - small amount of physical memory for using as DMA descriptor can be > pre-allocated from userspace, and ask kernel to pin pages, then still > return physical address to userspace for programming DMA > > - this way is still zero copy > > 5) notification from hardware: interrupt or polling > - SPDK applies userspace polling, this way is doable, but > eat CPU, so it is only one choice > > - io_uring command has been proved as very efficient, if io_uring > command is applied(similar way with UBLK for forwarding blk io > command from kernel to userspace) to uio/vfio for delivering interrupt, > which should be efficient too, given batching processes are done after > the io_uring command is completed > > - or it could be flexible by hybrid interrupt & polling, given > userspace single pthread/queue implementation can retrieve all > kinds of inflight IO info in very cheap way, and maybe it is likely > to apply some ML model to learn & predict when IO will be completed > > 6) others? > > Good idea. I'd love to have this discussion. Cheers, Hannes -- Dr. Hannes Reinecke Kernel Storage Architect hare@suse.de +49 911 74053 688 SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg HRB 36809 (AG Nürnberg), Geschäftsführer: Ivo Totev, Andrew Myers, Andrew McDonald, Martje Boudien Moerman ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [LSF/MM/BPF BoF]: extend UBLK to cover real storage hardware 2023-02-06 17:53 ` Hannes Reinecke @ 2023-03-08 8:50 ` Hans Holmberg 2023-03-08 12:27 ` Ming Lei 0 siblings, 1 reply; 34+ messages in thread From: Hans Holmberg @ 2023-03-08 8:50 UTC (permalink / raw) To: Hannes Reinecke Cc: Ming Lei, linux-block, lsf-pc, Liu Xiaodong, Jim Harris, Hans Holmberg, Matias Bjørling, hch@lst.de, Stefan Hajnoczi, ZiyangZhang This is a great topic, so I'd like to be part of it as well. It would be great to figure out what latency overhead we could expect of ublk in the future, clarifying what use cases ublk could cater for. This will help a lot in making decisions on what to implement in-kernel vs user space. Cheers, Hans On Mon, Feb 6, 2023 at 6:54 PM Hannes Reinecke <hare@suse.de> wrote: > > On 2/6/23 16:00, Ming Lei wrote: > > Hello, > > > > So far UBLK is only used for implementing virtual block device from > > userspace, such as loop, nbd, qcow2, ...[1]. > > > > It could be useful for UBLK to cover real storage hardware too: > > > > - for fast prototype or performance evaluation > > > > - some network storages are attached to host, such as iscsi and nvme-tcp, > > the current UBLK interface doesn't support such devices, since it needs > > all LUNs/Namespaces to share host resources(such as tag) > > > > - SPDK has supported user space driver for real hardware > > > > So propose to extend UBLK for supporting real hardware device: > > > > 1) extend UBLK ABI interface to support disks attached to host, such > > as SCSI Luns/NVME Namespaces > > > > 2) the followings are related with operating hardware from userspace, > > so userspace driver has to be trusted, and root is required, and > > can't support unprivileged UBLK device > > > > 3) how to operating hardware memory space > > - unbind kernel driver and rebind with uio/vfio > > - map PCI BAR into userspace[2], then userspace can operate hardware > > with mapped user address via MMIO > > > > 4) DMA > > - DMA requires physical memory address, UBLK driver actually has > > block request pages, so can we export request SG list(each segment > > physical address, offset, len) into userspace? If the max_segments > > limit is not too big(<=64), the needed buffer for holding SG list > > can be small enough. > > > > - small amount of physical memory for using as DMA descriptor can be > > pre-allocated from userspace, and ask kernel to pin pages, then still > > return physical address to userspace for programming DMA > > > > - this way is still zero copy > > > > 5) notification from hardware: interrupt or polling > > - SPDK applies userspace polling, this way is doable, but > > eat CPU, so it is only one choice > > > > - io_uring command has been proved as very efficient, if io_uring > > command is applied(similar way with UBLK for forwarding blk io > > command from kernel to userspace) to uio/vfio for delivering interrupt, > > which should be efficient too, given batching processes are done after > > the io_uring command is completed > > > > - or it could be flexible by hybrid interrupt & polling, given > > userspace single pthread/queue implementation can retrieve all > > kinds of inflight IO info in very cheap way, and maybe it is likely > > to apply some ML model to learn & predict when IO will be completed > > > > 6) others? > > > > > Good idea. > I'd love to have this discussion. > > Cheers, > > Hannes > -- > Dr. Hannes Reinecke Kernel Storage Architect > hare@suse.de +49 911 74053 688 > SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg > HRB 36809 (AG Nürnberg), Geschäftsführer: Ivo Totev, Andrew > Myers, Andrew McDonald, Martje Boudien Moerman > ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [LSF/MM/BPF BoF]: extend UBLK to cover real storage hardware 2023-03-08 8:50 ` Hans Holmberg @ 2023-03-08 12:27 ` Ming Lei 0 siblings, 0 replies; 34+ messages in thread From: Ming Lei @ 2023-03-08 12:27 UTC (permalink / raw) To: Hans Holmberg Cc: Hannes Reinecke, linux-block, lsf-pc, Liu Xiaodong, Jim Harris, Hans Holmberg, Matias Bjørling, hch@lst.de, Stefan Hajnoczi, ZiyangZhang, ming.lei On Wed, Mar 08, 2023 at 09:50:53AM +0100, Hans Holmberg wrote: > This is a great topic, so I'd like to be part of it as well. > > It would be great to figure out what latency overhead we could expect > of ublk in the future, clarifying what use cases ublk could cater for. > This will help a lot in making decisions on what to implement > in-kernel vs user space. If the zero copy patchset[1] can be accepted, the main overhead should be in io_uring command communication. I just run one quick test on my laptop between ublk/null(2 queues, depth 64, with zero copy) and null_blk(2 queues, depth 64), and run single job fio (128 qd, batch 16, libaio, 4k randread). IOPS on ublk is less than ~13% than null_blk. So looks the difference isn't bad, cause IOPS level has been reached Million level(1.29M vs. 1.46M). This basically shows the communication overhead. However, ublk userspace can handle io in lockless way, and minimize context switch & maximize io parallel by coroutine, that is ublk's advantage, and hard or impossible to do in kernel. In the ublksrv[2] project, we implemented loop, nbd & qcow2, so far, in my previous IOPS test result: 1) kernel loop(dio) vs. ublk/loop: the two are close 2) kernel nbd vs. ublk/nbd: ublk/nbd is a bit better than kernel nbd 3) qmeu-nbd based qcow2 vs. ublk/qcow2: ublk/qcow2 is much better All the three just works, not run further optimization yet. Also ublk may perform bad if io isn't handled in batch, such as, single queue depth io submission. But ublk is still very young, and there can be lots of optimization in future, such as: 1) applying polling for reducing communication overhead for both io command and io handle, and this way should improve latency for low QD workload 2) apply kind of ML model for predicating IO completion, and improve io polling, meantime reducing cpu utilization. 3) improve io_uring command for reducing communication overhead IMO, ublk is one generic userspace block device approach, especially good at: 1) handle complicated io logic, such as, btree is applied in io mapping, cause userspace has more weapons for this stuff 2) virtual device, such as all network based storage, or logical volume management 3) quick prototype development 4) flexible storage simulation for test purpose [1] https://lore.kernel.org/linux-block/ZAff9usDuyXxIPt9@ovpn-8-16.pek2.redhat.com/T/#t [2] https://github.com/ming1/ubdsrv Thanks, Ming ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [LSF/MM/BPF BoF]: extend UBLK to cover real storage hardware 2023-02-06 15:00 [LSF/MM/BPF BoF]: extend UBLK to cover real storage hardware Ming Lei 2023-02-06 17:53 ` Hannes Reinecke @ 2023-02-06 18:26 ` Bart Van Assche 2023-02-08 1:38 ` Ming Lei 2023-02-06 20:27 ` Stefan Hajnoczi 2 siblings, 1 reply; 34+ messages in thread From: Bart Van Assche @ 2023-02-06 18:26 UTC (permalink / raw) To: Ming Lei, linux-block, lsf-pc Cc: Liu Xiaodong, Jim Harris, Hans Holmberg, Matias Bjørling, hch@lst.de, Stefan Hajnoczi, ZiyangZhang On 2/6/23 07:00, Ming Lei wrote: > 4) DMA > - DMA requires physical memory address, UBLK driver actually has > block request pages, so can we export request SG list(each segment > physical address, offset, len) into userspace? If the max_segments > limit is not too big(<=64), the needed buffer for holding SG list > can be small enough. > > - small amount of physical memory for using as DMA descriptor can be > pre-allocated from userspace, and ask kernel to pin pages, then still > return physical address to userspace for programming DMA > > - this way is still zero copy Would it be possible to use vfio in such a way that zero-copy functionality is achieved? I'm concerned about the code duplication that would result if a new interface similar to vfio is introduced. In case it wouldn't be clear, I'm also interested in this topic. Bart. ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [LSF/MM/BPF BoF]: extend UBLK to cover real storage hardware 2023-02-06 18:26 ` Bart Van Assche @ 2023-02-08 1:38 ` Ming Lei 2023-02-08 18:02 ` Bart Van Assche 0 siblings, 1 reply; 34+ messages in thread From: Ming Lei @ 2023-02-08 1:38 UTC (permalink / raw) To: Bart Van Assche Cc: linux-block, lsf-pc, Liu Xiaodong, Jim Harris, Hans Holmberg, Matias Bjørling, hch@lst.de, Stefan Hajnoczi, ZiyangZhang, ming.lei On Mon, Feb 06, 2023 at 10:26:55AM -0800, Bart Van Assche wrote: > On 2/6/23 07:00, Ming Lei wrote: > > 4) DMA > > - DMA requires physical memory address, UBLK driver actually has > > block request pages, so can we export request SG list(each segment > > physical address, offset, len) into userspace? If the max_segments > > limit is not too big(<=64), the needed buffer for holding SG list > > can be small enough. > > > > - small amount of physical memory for using as DMA descriptor can be > > pre-allocated from userspace, and ask kernel to pin pages, then still > > return physical address to userspace for programming DMA > > > > - this way is still zero copy > > Would it be possible to use vfio in such a way that zero-copy > functionality is achieved? I'm concerned about the code duplication that > would result if a new interface similar to vfio is introduced. Here I meant we can export physical address of request sg from /dev/ublkb* to userspace, which can program the DMA controller using exported physical address. With this way, the userspace driver can submit IO without entering kernel, then with high performance. This should how SPDK/nvme-pci[1] is implemented, but SPDK allocates hugepage for getting its physical address. [1] https://spdk.io/doc/memory.html Thanks, Ming ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [LSF/MM/BPF BoF]: extend UBLK to cover real storage hardware 2023-02-08 1:38 ` Ming Lei @ 2023-02-08 18:02 ` Bart Van Assche 0 siblings, 0 replies; 34+ messages in thread From: Bart Van Assche @ 2023-02-08 18:02 UTC (permalink / raw) To: Ming Lei Cc: linux-block, lsf-pc, Liu Xiaodong, Jim Harris, Hans Holmberg, Matias Bjørling, hch@lst.de, Stefan Hajnoczi, ZiyangZhang On 2/7/23 17:38, Ming Lei wrote: > Here I meant we can export physical address of request sg from > /dev/ublkb* to userspace, which can program the DMA controller > using exported physical address. With this way, the userspace driver > can submit IO without entering kernel, then with high performance. Hmm ... security experts might be very unhappy about allowing user space software to program iova addresses, PASIDs etc. in DMA controllers without having this data verified by the kernel. Additionally, hardware designers every now and then propose new device multiplexing mechanisms, e.g. scalable IOV which is an alternative for SRIOV. Shouldn't we make the kernel deal with these mechanisms instead of user space? Thanks, Bart. ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [LSF/MM/BPF BoF]: extend UBLK to cover real storage hardware 2023-02-06 15:00 [LSF/MM/BPF BoF]: extend UBLK to cover real storage hardware Ming Lei 2023-02-06 17:53 ` Hannes Reinecke 2023-02-06 18:26 ` Bart Van Assche @ 2023-02-06 20:27 ` Stefan Hajnoczi 2023-02-08 2:12 ` Ming Lei 2 siblings, 1 reply; 34+ messages in thread From: Stefan Hajnoczi @ 2023-02-06 20:27 UTC (permalink / raw) To: Ming Lei Cc: linux-block, lsf-pc, Liu Xiaodong, Jim Harris, Hans Holmberg, Matias Bjørling, hch@lst.de, ZiyangZhang [-- Attachment #1: Type: text/plain, Size: 6655 bytes --] On Mon, Feb 06, 2023 at 11:00:27PM +0800, Ming Lei wrote: > Hello, > > So far UBLK is only used for implementing virtual block device from > userspace, such as loop, nbd, qcow2, ...[1]. I won't be at LSF/MM so here are my thoughts: > > It could be useful for UBLK to cover real storage hardware too: > > - for fast prototype or performance evaluation > > - some network storages are attached to host, such as iscsi and nvme-tcp, > the current UBLK interface doesn't support such devices, since it needs > all LUNs/Namespaces to share host resources(such as tag) Can you explain this in more detail? It seems like an iSCSI or NVMe-over-TCP initiator could be implemented as a ublk server today. What am I missing? > > - SPDK has supported user space driver for real hardware I think this could already be implemented today. There will be extra memory copies because SPDK won't have access to the application's memory pages. > > So propose to extend UBLK for supporting real hardware device: > > 1) extend UBLK ABI interface to support disks attached to host, such > as SCSI Luns/NVME Namespaces > > 2) the followings are related with operating hardware from userspace, > so userspace driver has to be trusted, and root is required, and > can't support unprivileged UBLK device Linux VFIO provides a safe userspace API for userspace device drivers. That means memory and interrupts are isolated. Neither userspace nor the hardware device can access memory or interrupts that the userspace process is not allowed to access. I think there are still limitations like all memory pages exposed to the device need to be pinned. So effectively you might still need privileges to get the mlock resource limits. But overall I think what you're saying about root and unprivileged ublk devices is not true. Hardware support should be developed with the goal of supporting unprivileged userspace ublk servers. Those unprivileged userspace ublk servers cannot claim any PCI device they want. The user/admin will need to give them permission to open a network card, SCSI HBA, etc. > > 3) how to operating hardware memory space > - unbind kernel driver and rebind with uio/vfio > - map PCI BAR into userspace[2], then userspace can operate hardware > with mapped user address via MMIO > > 4) DMA > - DMA requires physical memory address, UBLK driver actually has > block request pages, so can we export request SG list(each segment > physical address, offset, len) into userspace? If the max_segments > limit is not too big(<=64), the needed buffer for holding SG list > can be small enough. DMA with an IOMMU requires an I/O Virtual Address, not a CPU physical address. The IOVA space is defined by the IOMMU page tables. Userspace controls the IOMMU page tables via Linux VFIO ioctls. For example, <linux/vfio.h> struct vfio_iommu_type1_dma_map defines the IOMMU mapping that makes a range of userspace virtual addresses available at a given IOVA. Mapping and unmapping operations are not free. Similar to mmap(2), the program will be slow if it does this frequently. I think it's effectively the same problem as ublk zero-copy. We want to give the ublk server access to just the I/O buffers that it currently needs, but doing so would be expensive :(. I think Linux has strategies for avoiding the expense like iommu.strict=0 and swiotlb. The drawback is that in our case userspace and/or the hardware device controller by userspace would still have access to the memory pages after I/O has completed. This reduces memory isolation :(. DPDK/SPDK and QEMU use long-lived Linux VFIO DMA mappings. What I'm trying to get at is that either memory isolation is compromised or performance is reduced. It's hard to have good performance together with memory isolation. I think ublk should follow the VFIO philosophy of being a safe kernel/userspace interface. If userspace is malicious or buggy, the kernel's and other process' memory should not be corrupted. > > - small amount of physical memory for using as DMA descriptor can be > pre-allocated from userspace, and ask kernel to pin pages, then still > return physical address to userspace for programming DMA I think this is possible today. The ublk server owns the I/O buffers. It can mlock them and DMA map them via VFIO. ublk doesn't need to know anything about this. > - this way is still zero copy True zero-copy would be when an application does O_DIRECT I/O and the hardware device DMAs to/from the application's memory pages. ublk doesn't do that today and when combined with VFIO it doesn't get any easier. I don't think it's possible because you cannot allow userspace to control a hardware device and grant DMA access to pages that userspace isn't allowed to access. A malicious userspace will program the device to access those pages :). > > 5) notification from hardware: interrupt or polling > - SPDK applies userspace polling, this way is doable, but > eat CPU, so it is only one choice > > - io_uring command has been proved as very efficient, if io_uring > command is applied(similar way with UBLK for forwarding blk io > command from kernel to userspace) to uio/vfio for delivering interrupt, > which should be efficient too, given batching processes are done after > the io_uring command is completed I wonder how much difference there is between the new io_uring command for receiving VFIO irqs that you are suggesting compared to the existing io_uring approach IORING_OP_READ eventfd. > - or it could be flexible by hybrid interrupt & polling, given > userspace single pthread/queue implementation can retrieve all > kinds of inflight IO info in very cheap way, and maybe it is likely > to apply some ML model to learn & predict when IO will be completed Stefano Garzarella and I have discussed but not yet attempted to add a userspace memory polling command to io_uring. IORING_OP_POLL_MEMORY would be useful together with IORING_SETUP_IOPOLL. That way kernel polling can be combined with userspace polling on a single CPU. I'm not sure it's useful for ublk because you may not have any reason to use IORING_SETUP_IOPOLL. But applications that have an Linux NVMe block device open with IORING_SETUP_IOPOLL could use the new IORING_OP_POLL_MEMORY command to also watch for activity on a VIRTIO or VFIO PCI device or maybe just to get kicked by another userspace thread. > 6) others? > > > > [1] https://github.com/ming1/ubdsrv > [2] https://spdk.io/doc/userspace.html > > > Thanks, > Ming > [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 488 bytes --] ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [LSF/MM/BPF BoF]: extend UBLK to cover real storage hardware 2023-02-06 20:27 ` Stefan Hajnoczi @ 2023-02-08 2:12 ` Ming Lei 2023-02-08 12:17 ` Stefan Hajnoczi 0 siblings, 1 reply; 34+ messages in thread From: Ming Lei @ 2023-02-08 2:12 UTC (permalink / raw) To: Stefan Hajnoczi Cc: linux-block, lsf-pc, Liu Xiaodong, Jim Harris, Hans Holmberg, Matias Bjørling, hch@lst.de, ZiyangZhang, ming.lei On Mon, Feb 06, 2023 at 03:27:09PM -0500, Stefan Hajnoczi wrote: > On Mon, Feb 06, 2023 at 11:00:27PM +0800, Ming Lei wrote: > > Hello, > > > > So far UBLK is only used for implementing virtual block device from > > userspace, such as loop, nbd, qcow2, ...[1]. > > I won't be at LSF/MM so here are my thoughts: Thanks for the thoughts, :-) > > > > > It could be useful for UBLK to cover real storage hardware too: > > > > - for fast prototype or performance evaluation > > > > - some network storages are attached to host, such as iscsi and nvme-tcp, > > the current UBLK interface doesn't support such devices, since it needs > > all LUNs/Namespaces to share host resources(such as tag) > > Can you explain this in more detail? It seems like an iSCSI or > NVMe-over-TCP initiator could be implemented as a ublk server today. > What am I missing? The current ublk can't do that yet, because the interface doesn't support multiple ublk disks sharing single host, which is exactly the case of scsi and nvme. > > > > > - SPDK has supported user space driver for real hardware > > I think this could already be implemented today. There will be extra > memory copies because SPDK won't have access to the application's memory > pages. Here I proposed zero copy, and current SPDK nvme-pci implementation haven't such extra copy per my understanding. > > > > > So propose to extend UBLK for supporting real hardware device: > > > > 1) extend UBLK ABI interface to support disks attached to host, such > > as SCSI Luns/NVME Namespaces > > > > 2) the followings are related with operating hardware from userspace, > > so userspace driver has to be trusted, and root is required, and > > can't support unprivileged UBLK device > > Linux VFIO provides a safe userspace API for userspace device drivers. > That means memory and interrupts are isolated. Neither userspace nor the > hardware device can access memory or interrupts that the userspace > process is not allowed to access. > > I think there are still limitations like all memory pages exposed to the > device need to be pinned. So effectively you might still need privileges > to get the mlock resource limits. > > But overall I think what you're saying about root and unprivileged ublk > devices is not true. Hardware support should be developed with the goal > of supporting unprivileged userspace ublk servers. > > Those unprivileged userspace ublk servers cannot claim any PCI device > they want. The user/admin will need to give them permission to open a > network card, SCSI HBA, etc. It depends on implementation, please see https://spdk.io/doc/userspace.html ``` The SPDK NVMe Driver, for instance, maps the BAR for the NVMe device and then follows along with the NVMe Specification to initialize the device, create queue pairs, and ultimately send I/O. ``` The above way needs userspace to operating hardware by the mapped BAR, which can't be allowed for unprivileged user. > > > > > 3) how to operating hardware memory space > > - unbind kernel driver and rebind with uio/vfio > > - map PCI BAR into userspace[2], then userspace can operate hardware > > with mapped user address via MMIO > > > > 4) DMA > > - DMA requires physical memory address, UBLK driver actually has > > block request pages, so can we export request SG list(each segment > > physical address, offset, len) into userspace? If the max_segments > > limit is not too big(<=64), the needed buffer for holding SG list > > can be small enough. > > DMA with an IOMMU requires an I/O Virtual Address, not a CPU physical > address. The IOVA space is defined by the IOMMU page tables. Userspace > controls the IOMMU page tables via Linux VFIO ioctls. > > For example, <linux/vfio.h> struct vfio_iommu_type1_dma_map defines the > IOMMU mapping that makes a range of userspace virtual addresses > available at a given IOVA. > > Mapping and unmapping operations are not free. Similar to mmap(2), the > program will be slow if it does this frequently. Yeah, but SPDK shouldn't use vfio DMA interface, see: https://spdk.io/doc/memory.html they just programs DMA directly with physical address of pinned hugepages. > > I think it's effectively the same problem as ublk zero-copy. We want to > give the ublk server access to just the I/O buffers that it currently > needs, but doing so would be expensive :(. > > I think Linux has strategies for avoiding the expense like > iommu.strict=0 and swiotlb. The drawback is that in our case userspace > and/or the hardware device controller by userspace would still have > access to the memory pages after I/O has completed. This reduces memory > isolation :(. > > DPDK/SPDK and QEMU use long-lived Linux VFIO DMA mappings. Per the above SPDK links, the nvme-pci doesn't use vfio dma mapping. > > What I'm trying to get at is that either memory isolation is compromised > or performance is reduced. It's hard to have good performance together > with memory isolation. > > I think ublk should follow the VFIO philosophy of being a safe > kernel/userspace interface. If userspace is malicious or buggy, the > kernel's and other process' memory should not be corrupted. It is tradeoff between performance and isolation, that is why I mention that directing programming hardware in userspace can be done by root only. > > > > > - small amount of physical memory for using as DMA descriptor can be > > pre-allocated from userspace, and ask kernel to pin pages, then still > > return physical address to userspace for programming DMA > > I think this is possible today. The ublk server owns the I/O buffers. It > can mlock them and DMA map them via VFIO. ublk doesn't need to know > anything about this. It depends on if such VFIO DMA mapping is required for each IO. If it is required, that won't help one high performance driver. > > > - this way is still zero copy > > True zero-copy would be when an application does O_DIRECT I/O and the > hardware device DMAs to/from the application's memory pages. ublk > doesn't do that today and when combined with VFIO it doesn't get any > easier. I don't think it's possible because you cannot allow userspace > to control a hardware device and grant DMA access to pages that > userspace isn't allowed to access. A malicious userspace will program > the device to access those pages :). But that should be what SPDK nvme/pci is doing per the above links, :-) > > > > > 5) notification from hardware: interrupt or polling > > - SPDK applies userspace polling, this way is doable, but > > eat CPU, so it is only one choice > > > > - io_uring command has been proved as very efficient, if io_uring > > command is applied(similar way with UBLK for forwarding blk io > > command from kernel to userspace) to uio/vfio for delivering interrupt, > > which should be efficient too, given batching processes are done after > > the io_uring command is completed > > I wonder how much difference there is between the new io_uring command > for receiving VFIO irqs that you are suggesting compared to the existing > io_uring approach IORING_OP_READ eventfd. eventfd needs extra read/write on the event fd, so more syscalls are required. > > > - or it could be flexible by hybrid interrupt & polling, given > > userspace single pthread/queue implementation can retrieve all > > kinds of inflight IO info in very cheap way, and maybe it is likely > > to apply some ML model to learn & predict when IO will be completed > > Stefano Garzarella and I have discussed but not yet attempted to add a > userspace memory polling command to io_uring. IORING_OP_POLL_MEMORY > would be useful together with IORING_SETUP_IOPOLL. That way kernel > polling can be combined with userspace polling on a single CPU. Here I meant the direct polling on mmio or DMA descriptor, so no need any syscall: https://spdk.io/doc/userspace.html ``` Polling an NVMe device is fast because only host memory needs to be read (no MMIO) to check a queue pair for a bit flip and technologies such as Intel's DDIO will ensure that the host memory being checked is present in the CPU cache after an update by the device. ``` With the above mentioned direct programming DMA & this kind of polling, handling IO won't require any syscall, but the userspace has to be trusted. > > I'm not sure it's useful for ublk because you may not have any reason to > use IORING_SETUP_IOPOLL. But applications that have an Linux NVMe block I think it is reasonable for ublk to poll target io, which isn't different with other polling cases, which should help network recv, IMO. So ublk is going to support io polling for target io only, but can't be done for io command. Thanks, Ming ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [LSF/MM/BPF BoF]: extend UBLK to cover real storage hardware 2023-02-08 2:12 ` Ming Lei @ 2023-02-08 12:17 ` Stefan Hajnoczi 2023-02-13 3:47 ` Ming Lei 0 siblings, 1 reply; 34+ messages in thread From: Stefan Hajnoczi @ 2023-02-08 12:17 UTC (permalink / raw) To: Ming Lei Cc: linux-block, lsf-pc, Liu Xiaodong, Jim Harris, Hans Holmberg, Matias Bjørling, hch@lst.de, ZiyangZhang [-- Attachment #1: Type: text/plain, Size: 11133 bytes --] On Wed, Feb 08, 2023 at 10:12:19AM +0800, Ming Lei wrote: > On Mon, Feb 06, 2023 at 03:27:09PM -0500, Stefan Hajnoczi wrote: > > On Mon, Feb 06, 2023 at 11:00:27PM +0800, Ming Lei wrote: > > > Hello, > > > > > > So far UBLK is only used for implementing virtual block device from > > > userspace, such as loop, nbd, qcow2, ...[1]. > > > > I won't be at LSF/MM so here are my thoughts: > > Thanks for the thoughts, :-) > > > > > > > > > It could be useful for UBLK to cover real storage hardware too: > > > > > > - for fast prototype or performance evaluation > > > > > > - some network storages are attached to host, such as iscsi and nvme-tcp, > > > the current UBLK interface doesn't support such devices, since it needs > > > all LUNs/Namespaces to share host resources(such as tag) > > > > Can you explain this in more detail? It seems like an iSCSI or > > NVMe-over-TCP initiator could be implemented as a ublk server today. > > What am I missing? > > The current ublk can't do that yet, because the interface doesn't > support multiple ublk disks sharing single host, which is exactly > the case of scsi and nvme. Can you give an example that shows exactly where a problem is hit? I took a quick look at the ublk source code and didn't spot a place where it prevents a single ublk server process from handling multiple devices. Regarding "host resources(such as tag)", can the ublk server deal with that in userspace? The Linux block layer doesn't have the concept of a "host", that would come in at the SCSI/NVMe level that's implemented in userspace. I don't understand yet... > > > > > > > > > - SPDK has supported user space driver for real hardware > > > > I think this could already be implemented today. There will be extra > > memory copies because SPDK won't have access to the application's memory > > pages. > > Here I proposed zero copy, and current SPDK nvme-pci implementation haven't > such extra copy per my understanding. > > > > > > > > > So propose to extend UBLK for supporting real hardware device: > > > > > > 1) extend UBLK ABI interface to support disks attached to host, such > > > as SCSI Luns/NVME Namespaces > > > > > > 2) the followings are related with operating hardware from userspace, > > > so userspace driver has to be trusted, and root is required, and > > > can't support unprivileged UBLK device > > > > Linux VFIO provides a safe userspace API for userspace device drivers. > > That means memory and interrupts are isolated. Neither userspace nor the > > hardware device can access memory or interrupts that the userspace > > process is not allowed to access. > > > > I think there are still limitations like all memory pages exposed to the > > device need to be pinned. So effectively you might still need privileges > > to get the mlock resource limits. > > > > But overall I think what you're saying about root and unprivileged ublk > > devices is not true. Hardware support should be developed with the goal > > of supporting unprivileged userspace ublk servers. > > > > Those unprivileged userspace ublk servers cannot claim any PCI device > > they want. The user/admin will need to give them permission to open a > > network card, SCSI HBA, etc. > > It depends on implementation, please see > > https://spdk.io/doc/userspace.html > > ``` > The SPDK NVMe Driver, for instance, maps the BAR for the NVMe device and > then follows along with the NVMe Specification to initialize the device, > create queue pairs, and ultimately send I/O. > ``` > > The above way needs userspace to operating hardware by the mapped BAR, > which can't be allowed for unprivileged user. From https://spdk.io/doc/system_configuration.html: Running SPDK as non-privileged user One of the benefits of using the VFIO Linux kernel driver is the ability to perform DMA operations with peripheral devices as unprivileged user. The permissions to access particular devices still need to be granted by the system administrator, but only on a one-time basis. Note that this functionality is supported with DPDK starting from version 18.11. This is what I had described in my previous reply. > > > > > > > > > 3) how to operating hardware memory space > > > - unbind kernel driver and rebind with uio/vfio > > > - map PCI BAR into userspace[2], then userspace can operate hardware > > > with mapped user address via MMIO > > > > > > 4) DMA > > > - DMA requires physical memory address, UBLK driver actually has > > > block request pages, so can we export request SG list(each segment > > > physical address, offset, len) into userspace? If the max_segments > > > limit is not too big(<=64), the needed buffer for holding SG list > > > can be small enough. > > > > DMA with an IOMMU requires an I/O Virtual Address, not a CPU physical > > address. The IOVA space is defined by the IOMMU page tables. Userspace > > controls the IOMMU page tables via Linux VFIO ioctls. > > > > For example, <linux/vfio.h> struct vfio_iommu_type1_dma_map defines the > > IOMMU mapping that makes a range of userspace virtual addresses > > available at a given IOVA. > > > > Mapping and unmapping operations are not free. Similar to mmap(2), the > > program will be slow if it does this frequently. > > Yeah, but SPDK shouldn't use vfio DMA interface, see: > > https://spdk.io/doc/memory.html > > they just programs DMA directly with physical address of pinned hugepages. From the page you linked: IOMMU Support ... This is a future-proof, hardware-accelerated solution for performing DMA operations into and out of a user space process and forms the long-term foundation for SPDK and DPDK's memory management strategy. We highly recommend that applications are deployed using vfio and the IOMMU enabled, which is fully supported today. Yes, SPDK supports running without IOMMU, but they recommend running with the IOMMU. > > > > > I think it's effectively the same problem as ublk zero-copy. We want to > > give the ublk server access to just the I/O buffers that it currently > > needs, but doing so would be expensive :(. > > > > I think Linux has strategies for avoiding the expense like > > iommu.strict=0 and swiotlb. The drawback is that in our case userspace > > and/or the hardware device controller by userspace would still have > > access to the memory pages after I/O has completed. This reduces memory > > isolation :(. > > > > DPDK/SPDK and QEMU use long-lived Linux VFIO DMA mappings. > > Per the above SPDK links, the nvme-pci doesn't use vfio dma mapping. When using VFIO (recommended by the docs), SPDK uses long-lived DMA mappings. Here are places in the SPDK/DPDK source code where VFIO DMA mapping is used: https://github.com/spdk/spdk/blob/master/lib/env_dpdk/memory.c#L1371 https://github.com/spdk/dpdk/blob/e89c0845a60831864becc261cff48dd9321e7e79/lib/eal/linux/eal_vfio.c#L2164 > > > > > What I'm trying to get at is that either memory isolation is compromised > > or performance is reduced. It's hard to have good performance together > > with memory isolation. > > > > I think ublk should follow the VFIO philosophy of being a safe > > kernel/userspace interface. If userspace is malicious or buggy, the > > kernel's and other process' memory should not be corrupted. > > It is tradeoff between performance and isolation, that is why I mention > that directing programming hardware in userspace can be done by root > only. Yes, there is a trade-off. Over the years the use of unsafe approaches has been discouraged and replaced (/dev/kmem, uio -> VFIO, etc). As secure boot, integrity architecture, and stuff like that becomes more widely used, it's harder to include features that break memory isolation in software in mainstream distros. There can be an option to sacrifice memory isolation for performance and some users may be willing to accept the trade-off. I think it should be an option feature though. I did want to point out that the statement that "direct programming hardware in userspace can be done by root only" is false (see VFIO). > > > > > > > > - small amount of physical memory for using as DMA descriptor can be > > > pre-allocated from userspace, and ask kernel to pin pages, then still > > > return physical address to userspace for programming DMA > > > > I think this is possible today. The ublk server owns the I/O buffers. It > > can mlock them and DMA map them via VFIO. ublk doesn't need to know > > anything about this. > > It depends on if such VFIO DMA mapping is required for each IO. If it > is required, that won't help one high performance driver. It is not necessary to perform a DMA mapping for each IO. ublk's existing model is sufficient: 1. ublk server allocates I/O buffers and VFIO DMA maps them on startup. 2. At runtime the ublk server provides these I/O buffers to the kernel, no further DMA mapping is required. Unfortunately there's still the kernel<->userspace copy that existing ublk applications have, but there's no new overhead related to VFIO. > > > > > - this way is still zero copy > > > > True zero-copy would be when an application does O_DIRECT I/O and the > > hardware device DMAs to/from the application's memory pages. ublk > > doesn't do that today and when combined with VFIO it doesn't get any > > easier. I don't think it's possible because you cannot allow userspace > > to control a hardware device and grant DMA access to pages that > > userspace isn't allowed to access. A malicious userspace will program > > the device to access those pages :). > > But that should be what SPDK nvme/pci is doing per the above links, :-) Sure, it's possible to break memory isolation. Breaking memory isolation isn't specific to ublk servers that access hardware. The same unsafe zero-copy approach would probably also work for regular ublk servers. This is basically bringing back /dev/kmem :). > > > > > > > > > 5) notification from hardware: interrupt or polling > > > - SPDK applies userspace polling, this way is doable, but > > > eat CPU, so it is only one choice > > > > > > - io_uring command has been proved as very efficient, if io_uring > > > command is applied(similar way with UBLK for forwarding blk io > > > command from kernel to userspace) to uio/vfio for delivering interrupt, > > > which should be efficient too, given batching processes are done after > > > the io_uring command is completed > > > > I wonder how much difference there is between the new io_uring command > > for receiving VFIO irqs that you are suggesting compared to the existing > > io_uring approach IORING_OP_READ eventfd. > > eventfd needs extra read/write on the event fd, so more syscalls are > required. No extra syscall is required because IORING_OP_READ is used to read the eventfd, but maybe you were referring to bypassing the file->f_op->read() code path? Stefan [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 488 bytes --] ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [LSF/MM/BPF BoF]: extend UBLK to cover real storage hardware 2023-02-08 12:17 ` Stefan Hajnoczi @ 2023-02-13 3:47 ` Ming Lei 2023-02-13 19:13 ` Stefan Hajnoczi 0 siblings, 1 reply; 34+ messages in thread From: Ming Lei @ 2023-02-13 3:47 UTC (permalink / raw) To: Stefan Hajnoczi Cc: linux-block, lsf-pc, Liu Xiaodong, Jim Harris, Hans Holmberg, Matias Bjørling, hch@lst.de, ZiyangZhang, ming.lei On Wed, Feb 08, 2023 at 07:17:10AM -0500, Stefan Hajnoczi wrote: > On Wed, Feb 08, 2023 at 10:12:19AM +0800, Ming Lei wrote: > > On Mon, Feb 06, 2023 at 03:27:09PM -0500, Stefan Hajnoczi wrote: > > > On Mon, Feb 06, 2023 at 11:00:27PM +0800, Ming Lei wrote: > > > > Hello, > > > > > > > > So far UBLK is only used for implementing virtual block device from > > > > userspace, such as loop, nbd, qcow2, ...[1]. > > > > > > I won't be at LSF/MM so here are my thoughts: > > > > Thanks for the thoughts, :-) > > > > > > > > > > > > > It could be useful for UBLK to cover real storage hardware too: > > > > > > > > - for fast prototype or performance evaluation > > > > > > > > - some network storages are attached to host, such as iscsi and nvme-tcp, > > > > the current UBLK interface doesn't support such devices, since it needs > > > > all LUNs/Namespaces to share host resources(such as tag) > > > > > > Can you explain this in more detail? It seems like an iSCSI or > > > NVMe-over-TCP initiator could be implemented as a ublk server today. > > > What am I missing? > > > > The current ublk can't do that yet, because the interface doesn't > > support multiple ublk disks sharing single host, which is exactly > > the case of scsi and nvme. > > Can you give an example that shows exactly where a problem is hit? > > I took a quick look at the ublk source code and didn't spot a place > where it prevents a single ublk server process from handling multiple > devices. > > Regarding "host resources(such as tag)", can the ublk server deal with > that in userspace? The Linux block layer doesn't have the concept of a > "host", that would come in at the SCSI/NVMe level that's implemented in > userspace. > > I don't understand yet... blk_mq_tag_set is embedded into driver host structure, and referred by queue via q->tag_set, both scsi and nvme allocates tag in host/queue wide, that said all LUNs/NSs share host/queue tags, current every ublk device is independent, and can't shard tags. > > > > > > > > > > > > > > - SPDK has supported user space driver for real hardware > > > > > > I think this could already be implemented today. There will be extra > > > memory copies because SPDK won't have access to the application's memory > > > pages. > > > > Here I proposed zero copy, and current SPDK nvme-pci implementation haven't > > such extra copy per my understanding. > > > > > > > > > > > > > So propose to extend UBLK for supporting real hardware device: > > > > > > > > 1) extend UBLK ABI interface to support disks attached to host, such > > > > as SCSI Luns/NVME Namespaces > > > > > > > > 2) the followings are related with operating hardware from userspace, > > > > so userspace driver has to be trusted, and root is required, and > > > > can't support unprivileged UBLK device > > > > > > Linux VFIO provides a safe userspace API for userspace device drivers. > > > That means memory and interrupts are isolated. Neither userspace nor the > > > hardware device can access memory or interrupts that the userspace > > > process is not allowed to access. > > > > > > I think there are still limitations like all memory pages exposed to the > > > device need to be pinned. So effectively you might still need privileges > > > to get the mlock resource limits. > > > > > > But overall I think what you're saying about root and unprivileged ublk > > > devices is not true. Hardware support should be developed with the goal > > > of supporting unprivileged userspace ublk servers. > > > > > > Those unprivileged userspace ublk servers cannot claim any PCI device > > > they want. The user/admin will need to give them permission to open a > > > network card, SCSI HBA, etc. > > > > It depends on implementation, please see > > > > https://spdk.io/doc/userspace.html > > > > ``` > > The SPDK NVMe Driver, for instance, maps the BAR for the NVMe device and > > then follows along with the NVMe Specification to initialize the device, > > create queue pairs, and ultimately send I/O. > > ``` > > > > The above way needs userspace to operating hardware by the mapped BAR, > > which can't be allowed for unprivileged user. > > From https://spdk.io/doc/system_configuration.html: > > Running SPDK as non-privileged user > > One of the benefits of using the VFIO Linux kernel driver is the > ability to perform DMA operations with peripheral devices as > unprivileged user. The permissions to access particular devices still > need to be granted by the system administrator, but only on a one-time > basis. Note that this functionality is supported with DPDK starting > from version 18.11. > > This is what I had described in my previous reply. My reference on spdk were mostly from spdk/nvme doc. Just take quick look at spdk code, looks both vfio and direct programming hardware are supported: 1) lib/nvme/nvme_vfio_user.c const struct spdk_nvme_transport_ops vfio_ops { .qpair_submit_request = nvme_pcie_qpair_submit_request, 2) lib/nvme/nvme_pcie.c const struct spdk_nvme_transport_ops pcie_ops = { .qpair_submit_request = nvme_pcie_qpair_submit_request nvme_pcie_qpair_submit_tracker nvme_pcie_qpair_submit_tracker nvme_pcie_qpair_ring_sq_doorbell but vfio dma isn't used in nvme_pcie_qpair_submit_request, and simply write/read mmaped mmio. > > > > > > > > > > > > > > 3) how to operating hardware memory space > > > > - unbind kernel driver and rebind with uio/vfio > > > > - map PCI BAR into userspace[2], then userspace can operate hardware > > > > with mapped user address via MMIO > > > > > > > > 4) DMA > > > > - DMA requires physical memory address, UBLK driver actually has > > > > block request pages, so can we export request SG list(each segment > > > > physical address, offset, len) into userspace? If the max_segments > > > > limit is not too big(<=64), the needed buffer for holding SG list > > > > can be small enough. > > > > > > DMA with an IOMMU requires an I/O Virtual Address, not a CPU physical > > > address. The IOVA space is defined by the IOMMU page tables. Userspace > > > controls the IOMMU page tables via Linux VFIO ioctls. > > > > > > For example, <linux/vfio.h> struct vfio_iommu_type1_dma_map defines the > > > IOMMU mapping that makes a range of userspace virtual addresses > > > available at a given IOVA. > > > > > > Mapping and unmapping operations are not free. Similar to mmap(2), the > > > program will be slow if it does this frequently. > > > > Yeah, but SPDK shouldn't use vfio DMA interface, see: > > > > https://spdk.io/doc/memory.html > > > > they just programs DMA directly with physical address of pinned hugepages. > > From the page you linked: > > IOMMU Support > > ... > > This is a future-proof, hardware-accelerated solution for performing > DMA operations into and out of a user space process and forms the > long-term foundation for SPDK and DPDK's memory management strategy. > We highly recommend that applications are deployed using vfio and the > IOMMU enabled, which is fully supported today. > > Yes, SPDK supports running without IOMMU, but they recommend running > with the IOMMU. > > > > > > > > > I think it's effectively the same problem as ublk zero-copy. We want to > > > give the ublk server access to just the I/O buffers that it currently > > > needs, but doing so would be expensive :(. > > > > > > I think Linux has strategies for avoiding the expense like > > > iommu.strict=0 and swiotlb. The drawback is that in our case userspace > > > and/or the hardware device controller by userspace would still have > > > access to the memory pages after I/O has completed. This reduces memory > > > isolation :(. > > > > > > DPDK/SPDK and QEMU use long-lived Linux VFIO DMA mappings. > > > > Per the above SPDK links, the nvme-pci doesn't use vfio dma mapping. > > When using VFIO (recommended by the docs), SPDK uses long-lived DMA > mappings. Here are places in the SPDK/DPDK source code where VFIO DMA > mapping is used: > https://github.com/spdk/spdk/blob/master/lib/env_dpdk/memory.c#L1371 > https://github.com/spdk/dpdk/blob/e89c0845a60831864becc261cff48dd9321e7e79/lib/eal/linux/eal_vfio.c#L2164 I meant spdk nvme implementation. > > > > > > > > > What I'm trying to get at is that either memory isolation is compromised > > > or performance is reduced. It's hard to have good performance together > > > with memory isolation. > > > > > > I think ublk should follow the VFIO philosophy of being a safe > > > kernel/userspace interface. If userspace is malicious or buggy, the > > > kernel's and other process' memory should not be corrupted. > > > > It is tradeoff between performance and isolation, that is why I mention > > that directing programming hardware in userspace can be done by root > > only. > > Yes, there is a trade-off. Over the years the use of unsafe approaches > has been discouraged and replaced (/dev/kmem, uio -> VFIO, etc). As > secure boot, integrity architecture, and stuff like that becomes more > widely used, it's harder to include features that break memory isolation > in software in mainstream distros. There can be an option to sacrifice > memory isolation for performance and some users may be willing to accept > the trade-off. I think it should be an option feature though. > > I did want to point out that the statement that "direct programming > hardware in userspace can be done by root only" is false (see VFIO). Unfortunately not see vfio is used when spdk/nvme is operating hardware mmio. > > > > > > > > > > > > - small amount of physical memory for using as DMA descriptor can be > > > > pre-allocated from userspace, and ask kernel to pin pages, then still > > > > return physical address to userspace for programming DMA > > > > > > I think this is possible today. The ublk server owns the I/O buffers. It > > > can mlock them and DMA map them via VFIO. ublk doesn't need to know > > > anything about this. > > > > It depends on if such VFIO DMA mapping is required for each IO. If it > > is required, that won't help one high performance driver. > > It is not necessary to perform a DMA mapping for each IO. ublk's > existing model is sufficient: > 1. ublk server allocates I/O buffers and VFIO DMA maps them on startup. > 2. At runtime the ublk server provides these I/O buffers to the kernel, > no further DMA mapping is required. > > Unfortunately there's still the kernel<->userspace copy that existing > ublk applications have, but there's no new overhead related to VFIO. We are working on ublk zero copy for avoiding the copy. > > > > > > > > - this way is still zero copy > > > > > > True zero-copy would be when an application does O_DIRECT I/O and the > > > hardware device DMAs to/from the application's memory pages. ublk > > > doesn't do that today and when combined with VFIO it doesn't get any > > > easier. I don't think it's possible because you cannot allow userspace > > > to control a hardware device and grant DMA access to pages that > > > userspace isn't allowed to access. A malicious userspace will program > > > the device to access those pages :). > > > > But that should be what SPDK nvme/pci is doing per the above links, :-) > > Sure, it's possible to break memory isolation. Breaking memory isolation > isn't specific to ublk servers that access hardware. The same unsafe > zero-copy approach would probably also work for regular ublk servers. > This is basically bringing back /dev/kmem :). > > > > > > > > > > > > > > 5) notification from hardware: interrupt or polling > > > > - SPDK applies userspace polling, this way is doable, but > > > > eat CPU, so it is only one choice > > > > > > > > - io_uring command has been proved as very efficient, if io_uring > > > > command is applied(similar way with UBLK for forwarding blk io > > > > command from kernel to userspace) to uio/vfio for delivering interrupt, > > > > which should be efficient too, given batching processes are done after > > > > the io_uring command is completed > > > > > > I wonder how much difference there is between the new io_uring command > > > for receiving VFIO irqs that you are suggesting compared to the existing > > > io_uring approach IORING_OP_READ eventfd. > > > > eventfd needs extra read/write on the event fd, so more syscalls are > > required. > > No extra syscall is required because IORING_OP_READ is used to read the > eventfd, but maybe you were referring to bypassing the > file->f_op->read() code path? OK, missed that, it is usually done in the following way: io_uring_prep_poll_add(sqe, evfd, POLLIN) sqe->flags |= IOSQE_IO_LINK; ... sqe = io_uring_get_sqe(&ring); io_uring_prep_readv(sqe, evfd, &vec, 1, 0); sqe->flags |= IOSQE_IO_LINK; When I get time, will compare the two and see which one performs better. thanks, Ming ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [LSF/MM/BPF BoF]: extend UBLK to cover real storage hardware 2023-02-13 3:47 ` Ming Lei @ 2023-02-13 19:13 ` Stefan Hajnoczi 2023-02-15 0:51 ` Ming Lei 0 siblings, 1 reply; 34+ messages in thread From: Stefan Hajnoczi @ 2023-02-13 19:13 UTC (permalink / raw) To: Ming Lei Cc: linux-block, lsf-pc, Liu Xiaodong, Jim Harris, Hans Holmberg, Matias Bjørling, hch@lst.de, ZiyangZhang [-- Attachment #1: Type: text/plain, Size: 15994 bytes --] On Mon, Feb 13, 2023 at 11:47:31AM +0800, Ming Lei wrote: > On Wed, Feb 08, 2023 at 07:17:10AM -0500, Stefan Hajnoczi wrote: > > On Wed, Feb 08, 2023 at 10:12:19AM +0800, Ming Lei wrote: > > > On Mon, Feb 06, 2023 at 03:27:09PM -0500, Stefan Hajnoczi wrote: > > > > On Mon, Feb 06, 2023 at 11:00:27PM +0800, Ming Lei wrote: > > > > > Hello, > > > > > > > > > > So far UBLK is only used for implementing virtual block device from > > > > > userspace, such as loop, nbd, qcow2, ...[1]. > > > > > > > > I won't be at LSF/MM so here are my thoughts: > > > > > > Thanks for the thoughts, :-) > > > > > > > > > > > > > > > > > It could be useful for UBLK to cover real storage hardware too: > > > > > > > > > > - for fast prototype or performance evaluation > > > > > > > > > > - some network storages are attached to host, such as iscsi and nvme-tcp, > > > > > the current UBLK interface doesn't support such devices, since it needs > > > > > all LUNs/Namespaces to share host resources(such as tag) > > > > > > > > Can you explain this in more detail? It seems like an iSCSI or > > > > NVMe-over-TCP initiator could be implemented as a ublk server today. > > > > What am I missing? > > > > > > The current ublk can't do that yet, because the interface doesn't > > > support multiple ublk disks sharing single host, which is exactly > > > the case of scsi and nvme. > > > > Can you give an example that shows exactly where a problem is hit? > > > > I took a quick look at the ublk source code and didn't spot a place > > where it prevents a single ublk server process from handling multiple > > devices. > > > > Regarding "host resources(such as tag)", can the ublk server deal with > > that in userspace? The Linux block layer doesn't have the concept of a > > "host", that would come in at the SCSI/NVMe level that's implemented in > > userspace. > > > > I don't understand yet... > > blk_mq_tag_set is embedded into driver host structure, and referred by queue > via q->tag_set, both scsi and nvme allocates tag in host/queue wide, > that said all LUNs/NSs share host/queue tags, current every ublk > device is independent, and can't shard tags. Does this actually prevent ublk servers with multiple ublk devices or is it just sub-optimal? Also, is this specific to real storage hardware? I guess userspace NVMe-over-TCP or iSCSI initiators would be affected regardless of whether they simply use the Sockets API (software) or userspace device drivers (hardware). Sorry for all these questions, I think I'm a little confused because you said "doesn't support such devices" and I thought this discussion was about real storage hardware. Neither of these seem to apply to the tag_set issue. > > > > > > > > > > > > > > > > > > > > - SPDK has supported user space driver for real hardware > > > > > > > > I think this could already be implemented today. There will be extra > > > > memory copies because SPDK won't have access to the application's memory > > > > pages. > > > > > > Here I proposed zero copy, and current SPDK nvme-pci implementation haven't > > > such extra copy per my understanding. > > > > > > > > > > > > > > > > > So propose to extend UBLK for supporting real hardware device: > > > > > > > > > > 1) extend UBLK ABI interface to support disks attached to host, such > > > > > as SCSI Luns/NVME Namespaces > > > > > > > > > > 2) the followings are related with operating hardware from userspace, > > > > > so userspace driver has to be trusted, and root is required, and > > > > > can't support unprivileged UBLK device > > > > > > > > Linux VFIO provides a safe userspace API for userspace device drivers. > > > > That means memory and interrupts are isolated. Neither userspace nor the > > > > hardware device can access memory or interrupts that the userspace > > > > process is not allowed to access. > > > > > > > > I think there are still limitations like all memory pages exposed to the > > > > device need to be pinned. So effectively you might still need privileges > > > > to get the mlock resource limits. > > > > > > > > But overall I think what you're saying about root and unprivileged ublk > > > > devices is not true. Hardware support should be developed with the goal > > > > of supporting unprivileged userspace ublk servers. > > > > > > > > Those unprivileged userspace ublk servers cannot claim any PCI device > > > > they want. The user/admin will need to give them permission to open a > > > > network card, SCSI HBA, etc. > > > > > > It depends on implementation, please see > > > > > > https://spdk.io/doc/userspace.html > > > > > > ``` > > > The SPDK NVMe Driver, for instance, maps the BAR for the NVMe device and > > > then follows along with the NVMe Specification to initialize the device, > > > create queue pairs, and ultimately send I/O. > > > ``` > > > > > > The above way needs userspace to operating hardware by the mapped BAR, > > > which can't be allowed for unprivileged user. > > > > From https://spdk.io/doc/system_configuration.html: > > > > Running SPDK as non-privileged user > > > > One of the benefits of using the VFIO Linux kernel driver is the > > ability to perform DMA operations with peripheral devices as > > unprivileged user. The permissions to access particular devices still > > need to be granted by the system administrator, but only on a one-time > > basis. Note that this functionality is supported with DPDK starting > > from version 18.11. > > > > This is what I had described in my previous reply. > > My reference on spdk were mostly from spdk/nvme doc. > Just take quick look at spdk code, looks both vfio and direct > programming hardware are supported: > > 1) lib/nvme/nvme_vfio_user.c > const struct spdk_nvme_transport_ops vfio_ops { > .qpair_submit_request = nvme_pcie_qpair_submit_request, Ignore this, it's the userspace vfio-user UNIX domain socket protocol support. It's not kernel VFIO and is unrelated to what we're discussing. More info on vfio-user: https://spdk.io/news/2021/05/04/vfio-user/ > > > 2) lib/nvme/nvme_pcie.c > const struct spdk_nvme_transport_ops pcie_ops = { > .qpair_submit_request = nvme_pcie_qpair_submit_request > nvme_pcie_qpair_submit_tracker > nvme_pcie_qpair_submit_tracker > nvme_pcie_qpair_ring_sq_doorbell > > but vfio dma isn't used in nvme_pcie_qpair_submit_request, and simply > write/read mmaped mmio. I have only a small amount of SPDK code experienced, so this might be wrong, but I think the NVMe PCI driver code does not need to directly call VFIO APIs. That is handled by DPDK/SPDK's EAL operating system abstractions and device driver APIs. DMA memory is mapped permanently so the device driver doesn't need to perform individual map/unmap operations in the data path. NVMe PCI request submission builds the NVMe command structures containing device addresses (i.e. IOVAs when IOMMU is enabled). This code probably supports both IOMMU (VFIO) and non-IOMMU operation. > > > > > > > > > > > > > > > > > > > > 3) how to operating hardware memory space > > > > > - unbind kernel driver and rebind with uio/vfio > > > > > - map PCI BAR into userspace[2], then userspace can operate hardware > > > > > with mapped user address via MMIO > > > > > > > > > > 4) DMA > > > > > - DMA requires physical memory address, UBLK driver actually has > > > > > block request pages, so can we export request SG list(each segment > > > > > physical address, offset, len) into userspace? If the max_segments > > > > > limit is not too big(<=64), the needed buffer for holding SG list > > > > > can be small enough. > > > > > > > > DMA with an IOMMU requires an I/O Virtual Address, not a CPU physical > > > > address. The IOVA space is defined by the IOMMU page tables. Userspace > > > > controls the IOMMU page tables via Linux VFIO ioctls. > > > > > > > > For example, <linux/vfio.h> struct vfio_iommu_type1_dma_map defines the > > > > IOMMU mapping that makes a range of userspace virtual addresses > > > > available at a given IOVA. > > > > > > > > Mapping and unmapping operations are not free. Similar to mmap(2), the > > > > program will be slow if it does this frequently. > > > > > > Yeah, but SPDK shouldn't use vfio DMA interface, see: > > > > > > https://spdk.io/doc/memory.html > > > > > > they just programs DMA directly with physical address of pinned hugepages. > > > > From the page you linked: > > > > IOMMU Support > > > > ... > > > > This is a future-proof, hardware-accelerated solution for performing > > DMA operations into and out of a user space process and forms the > > long-term foundation for SPDK and DPDK's memory management strategy. > > We highly recommend that applications are deployed using vfio and the > > IOMMU enabled, which is fully supported today. > > > > Yes, SPDK supports running without IOMMU, but they recommend running > > with the IOMMU. > > > > > > > > > > > > > I think it's effectively the same problem as ublk zero-copy. We want to > > > > give the ublk server access to just the I/O buffers that it currently > > > > needs, but doing so would be expensive :(. > > > > > > > > I think Linux has strategies for avoiding the expense like > > > > iommu.strict=0 and swiotlb. The drawback is that in our case userspace > > > > and/or the hardware device controller by userspace would still have > > > > access to the memory pages after I/O has completed. This reduces memory > > > > isolation :(. > > > > > > > > DPDK/SPDK and QEMU use long-lived Linux VFIO DMA mappings. > > > > > > Per the above SPDK links, the nvme-pci doesn't use vfio dma mapping. > > > > When using VFIO (recommended by the docs), SPDK uses long-lived DMA > > mappings. Here are places in the SPDK/DPDK source code where VFIO DMA > > mapping is used: > > https://github.com/spdk/spdk/blob/master/lib/env_dpdk/memory.c#L1371 > > https://github.com/spdk/dpdk/blob/e89c0845a60831864becc261cff48dd9321e7e79/lib/eal/linux/eal_vfio.c#L2164 > > I meant spdk nvme implementation. I did too. The NVMe PCI driver will use the PCI driver APIs and the EAL (operating system abstraction) will deal with IOMMU APIs (VFIO) transparently. > > > > > > > > > > > > > > What I'm trying to get at is that either memory isolation is compromised > > > > or performance is reduced. It's hard to have good performance together > > > > with memory isolation. > > > > > > > > I think ublk should follow the VFIO philosophy of being a safe > > > > kernel/userspace interface. If userspace is malicious or buggy, the > > > > kernel's and other process' memory should not be corrupted. > > > > > > It is tradeoff between performance and isolation, that is why I mention > > > that directing programming hardware in userspace can be done by root > > > only. > > > > Yes, there is a trade-off. Over the years the use of unsafe approaches > > has been discouraged and replaced (/dev/kmem, uio -> VFIO, etc). As > > secure boot, integrity architecture, and stuff like that becomes more > > widely used, it's harder to include features that break memory isolation > > in software in mainstream distros. There can be an option to sacrifice > > memory isolation for performance and some users may be willing to accept > > the trade-off. I think it should be an option feature though. > > > > I did want to point out that the statement that "direct programming > > hardware in userspace can be done by root only" is false (see VFIO). > > Unfortunately not see vfio is used when spdk/nvme is operating hardware > mmio. I think my responses above answered this, but just to be clear: with VFIO PCI userspace mmaps the BARs and performs direct accesses to them (load/store instructions). No VFIO API wrappers are necessary for MMIO accesses, so the code you posted works fine with VFIO. > > > > > > > > > > > > > > > > > - small amount of physical memory for using as DMA descriptor can be > > > > > pre-allocated from userspace, and ask kernel to pin pages, then still > > > > > return physical address to userspace for programming DMA > > > > > > > > I think this is possible today. The ublk server owns the I/O buffers. It > > > > can mlock them and DMA map them via VFIO. ublk doesn't need to know > > > > anything about this. > > > > > > It depends on if such VFIO DMA mapping is required for each IO. If it > > > is required, that won't help one high performance driver. > > > > It is not necessary to perform a DMA mapping for each IO. ublk's > > existing model is sufficient: > > 1. ublk server allocates I/O buffers and VFIO DMA maps them on startup. > > 2. At runtime the ublk server provides these I/O buffers to the kernel, > > no further DMA mapping is required. > > > > Unfortunately there's still the kernel<->userspace copy that existing > > ublk applications have, but there's no new overhead related to VFIO. > > We are working on ublk zero copy for avoiding the copy. I'm curious if it's possible to come up with a solution that doesn't break memory isolation. Userspace controls the IOMMU with Linux VFIO, so if kernel pages are exposed to the device, then userspace will also be able to access them (e.g. by submitting a request that gets the device to DMA those pages). > > > > > > > > > > > > - this way is still zero copy > > > > > > > > True zero-copy would be when an application does O_DIRECT I/O and the > > > > hardware device DMAs to/from the application's memory pages. ublk > > > > doesn't do that today and when combined with VFIO it doesn't get any > > > > easier. I don't think it's possible because you cannot allow userspace > > > > to control a hardware device and grant DMA access to pages that > > > > userspace isn't allowed to access. A malicious userspace will program > > > > the device to access those pages :). > > > > > > But that should be what SPDK nvme/pci is doing per the above links, :-) > > > > Sure, it's possible to break memory isolation. Breaking memory isolation > > isn't specific to ublk servers that access hardware. The same unsafe > > zero-copy approach would probably also work for regular ublk servers. > > This is basically bringing back /dev/kmem :). > > > > > > > > > > > > > > > > > > > 5) notification from hardware: interrupt or polling > > > > > - SPDK applies userspace polling, this way is doable, but > > > > > eat CPU, so it is only one choice > > > > > > > > > > - io_uring command has been proved as very efficient, if io_uring > > > > > command is applied(similar way with UBLK for forwarding blk io > > > > > command from kernel to userspace) to uio/vfio for delivering interrupt, > > > > > which should be efficient too, given batching processes are done after > > > > > the io_uring command is completed > > > > > > > > I wonder how much difference there is between the new io_uring command > > > > for receiving VFIO irqs that you are suggesting compared to the existing > > > > io_uring approach IORING_OP_READ eventfd. > > > > > > eventfd needs extra read/write on the event fd, so more syscalls are > > > required. > > > > No extra syscall is required because IORING_OP_READ is used to read the > > eventfd, but maybe you were referring to bypassing the > > file->f_op->read() code path? > > OK, missed that, it is usually done in the following way: > > io_uring_prep_poll_add(sqe, evfd, POLLIN) > sqe->flags |= IOSQE_IO_LINK; > ... > sqe = io_uring_get_sqe(&ring); > io_uring_prep_readv(sqe, evfd, &vec, 1, 0); > sqe->flags |= IOSQE_IO_LINK; > > When I get time, will compare the two and see which one performs better. That would be really interesting. Stefan [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 488 bytes --] ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [LSF/MM/BPF BoF]: extend UBLK to cover real storage hardware 2023-02-13 19:13 ` Stefan Hajnoczi @ 2023-02-15 0:51 ` Ming Lei 2023-02-15 15:27 ` Stefan Hajnoczi 2023-02-16 9:44 ` Andreas Hindborg 0 siblings, 2 replies; 34+ messages in thread From: Ming Lei @ 2023-02-15 0:51 UTC (permalink / raw) To: Stefan Hajnoczi Cc: linux-block, lsf-pc, Liu Xiaodong, Jim Harris, Hans Holmberg, Matias Bjørling, hch@lst.de, ZiyangZhang, ming.lei On Mon, Feb 13, 2023 at 02:13:59PM -0500, Stefan Hajnoczi wrote: > On Mon, Feb 13, 2023 at 11:47:31AM +0800, Ming Lei wrote: > > On Wed, Feb 08, 2023 at 07:17:10AM -0500, Stefan Hajnoczi wrote: > > > On Wed, Feb 08, 2023 at 10:12:19AM +0800, Ming Lei wrote: > > > > On Mon, Feb 06, 2023 at 03:27:09PM -0500, Stefan Hajnoczi wrote: > > > > > On Mon, Feb 06, 2023 at 11:00:27PM +0800, Ming Lei wrote: > > > > > > Hello, > > > > > > > > > > > > So far UBLK is only used for implementing virtual block device from > > > > > > userspace, such as loop, nbd, qcow2, ...[1]. > > > > > > > > > > I won't be at LSF/MM so here are my thoughts: > > > > > > > > Thanks for the thoughts, :-) > > > > > > > > > > > > > > > > > > > > > It could be useful for UBLK to cover real storage hardware too: > > > > > > > > > > > > - for fast prototype or performance evaluation > > > > > > > > > > > > - some network storages are attached to host, such as iscsi and nvme-tcp, > > > > > > the current UBLK interface doesn't support such devices, since it needs > > > > > > all LUNs/Namespaces to share host resources(such as tag) > > > > > > > > > > Can you explain this in more detail? It seems like an iSCSI or > > > > > NVMe-over-TCP initiator could be implemented as a ublk server today. > > > > > What am I missing? > > > > > > > > The current ublk can't do that yet, because the interface doesn't > > > > support multiple ublk disks sharing single host, which is exactly > > > > the case of scsi and nvme. > > > > > > Can you give an example that shows exactly where a problem is hit? > > > > > > I took a quick look at the ublk source code and didn't spot a place > > > where it prevents a single ublk server process from handling multiple > > > devices. > > > > > > Regarding "host resources(such as tag)", can the ublk server deal with > > > that in userspace? The Linux block layer doesn't have the concept of a > > > "host", that would come in at the SCSI/NVMe level that's implemented in > > > userspace. > > > > > > I don't understand yet... > > > > blk_mq_tag_set is embedded into driver host structure, and referred by queue > > via q->tag_set, both scsi and nvme allocates tag in host/queue wide, > > that said all LUNs/NSs share host/queue tags, current every ublk > > device is independent, and can't shard tags. > > Does this actually prevent ublk servers with multiple ublk devices or is > it just sub-optimal? It is former, ublk can't support multiple devices which share single host because duplicated tag can be seen in host side, then io is failed. > > Also, is this specific to real storage hardware? I guess userspace > NVMe-over-TCP or iSCSI initiators would be affected regardless of > whether they simply use the Sockets API (software) or userspace device > drivers (hardware). > > Sorry for all these questions, I think I'm a little confused because you > said "doesn't support such devices" and I thought this discussion was > about real storage hardware. Neither of these seem to apply to the > tag_set issue. The reality is that both scsi and nvme(either virt or real hardware) supports multi LUNs/NSs, so tag_set issue has to be solved, or multi-LUNs/NSs has to be supported. > > > > > > > > > > > > > > > > > > > > > > > > > > > - SPDK has supported user space driver for real hardware > > > > > > > > > > I think this could already be implemented today. There will be extra > > > > > memory copies because SPDK won't have access to the application's memory > > > > > pages. > > > > > > > > Here I proposed zero copy, and current SPDK nvme-pci implementation haven't > > > > such extra copy per my understanding. > > > > > > > > > > > > > > > > > > > > > So propose to extend UBLK for supporting real hardware device: > > > > > > > > > > > > 1) extend UBLK ABI interface to support disks attached to host, such > > > > > > as SCSI Luns/NVME Namespaces > > > > > > > > > > > > 2) the followings are related with operating hardware from userspace, > > > > > > so userspace driver has to be trusted, and root is required, and > > > > > > can't support unprivileged UBLK device > > > > > > > > > > Linux VFIO provides a safe userspace API for userspace device drivers. > > > > > That means memory and interrupts are isolated. Neither userspace nor the > > > > > hardware device can access memory or interrupts that the userspace > > > > > process is not allowed to access. > > > > > > > > > > I think there are still limitations like all memory pages exposed to the > > > > > device need to be pinned. So effectively you might still need privileges > > > > > to get the mlock resource limits. > > > > > > > > > > But overall I think what you're saying about root and unprivileged ublk > > > > > devices is not true. Hardware support should be developed with the goal > > > > > of supporting unprivileged userspace ublk servers. > > > > > > > > > > Those unprivileged userspace ublk servers cannot claim any PCI device > > > > > they want. The user/admin will need to give them permission to open a > > > > > network card, SCSI HBA, etc. > > > > > > > > It depends on implementation, please see > > > > > > > > https://spdk.io/doc/userspace.html > > > > > > > > ``` > > > > The SPDK NVMe Driver, for instance, maps the BAR for the NVMe device and > > > > then follows along with the NVMe Specification to initialize the device, > > > > create queue pairs, and ultimately send I/O. > > > > ``` > > > > > > > > The above way needs userspace to operating hardware by the mapped BAR, > > > > which can't be allowed for unprivileged user. > > > > > > From https://spdk.io/doc/system_configuration.html: > > > > > > Running SPDK as non-privileged user > > > > > > One of the benefits of using the VFIO Linux kernel driver is the > > > ability to perform DMA operations with peripheral devices as > > > unprivileged user. The permissions to access particular devices still > > > need to be granted by the system administrator, but only on a one-time > > > basis. Note that this functionality is supported with DPDK starting > > > from version 18.11. > > > > > > This is what I had described in my previous reply. > > > > My reference on spdk were mostly from spdk/nvme doc. > > Just take quick look at spdk code, looks both vfio and direct > > programming hardware are supported: > > > > 1) lib/nvme/nvme_vfio_user.c > > const struct spdk_nvme_transport_ops vfio_ops { > > .qpair_submit_request = nvme_pcie_qpair_submit_request, > > Ignore this, it's the userspace vfio-user UNIX domain socket protocol > support. It's not kernel VFIO and is unrelated to what we're discussing. > More info on vfio-user: https://spdk.io/news/2021/05/04/vfio-user/ Not sure, why does .qpair_submit_request point to nvme_pcie_qpair_submit_request? > > > > > > > 2) lib/nvme/nvme_pcie.c > > const struct spdk_nvme_transport_ops pcie_ops = { > > .qpair_submit_request = nvme_pcie_qpair_submit_request > > nvme_pcie_qpair_submit_tracker > > nvme_pcie_qpair_submit_tracker > > nvme_pcie_qpair_ring_sq_doorbell > > > > but vfio dma isn't used in nvme_pcie_qpair_submit_request, and simply > > write/read mmaped mmio. > > I have only a small amount of SPDK code experienced, so this might be Me too. > wrong, but I think the NVMe PCI driver code does not need to directly > call VFIO APIs. That is handled by DPDK/SPDK's EAL operating system > abstractions and device driver APIs. > > DMA memory is mapped permanently so the device driver doesn't need to > perform individual map/unmap operations in the data path. NVMe PCI > request submission builds the NVMe command structures containing device > addresses (i.e. IOVAs when IOMMU is enabled). If IOMMU isn't used, it is physical address of memory. Then I guess you may understand why I said this way can't be done by un-privileged user, cause driver is writing memory physical address to device register directly. But other driver can follow this approach if the way is accepted. > > This code probably supports both IOMMU (VFIO) and non-IOMMU operation. > > > > > > > > > > > > > > > > > > > > > > > > > > > 3) how to operating hardware memory space > > > > > > - unbind kernel driver and rebind with uio/vfio > > > > > > - map PCI BAR into userspace[2], then userspace can operate hardware > > > > > > with mapped user address via MMIO > > > > > > > > > > > > 4) DMA > > > > > > - DMA requires physical memory address, UBLK driver actually has > > > > > > block request pages, so can we export request SG list(each segment > > > > > > physical address, offset, len) into userspace? If the max_segments > > > > > > limit is not too big(<=64), the needed buffer for holding SG list > > > > > > can be small enough. > > > > > > > > > > DMA with an IOMMU requires an I/O Virtual Address, not a CPU physical > > > > > address. The IOVA space is defined by the IOMMU page tables. Userspace > > > > > controls the IOMMU page tables via Linux VFIO ioctls. > > > > > > > > > > For example, <linux/vfio.h> struct vfio_iommu_type1_dma_map defines the > > > > > IOMMU mapping that makes a range of userspace virtual addresses > > > > > available at a given IOVA. > > > > > > > > > > Mapping and unmapping operations are not free. Similar to mmap(2), the > > > > > program will be slow if it does this frequently. > > > > > > > > Yeah, but SPDK shouldn't use vfio DMA interface, see: > > > > > > > > https://spdk.io/doc/memory.html > > > > > > > > they just programs DMA directly with physical address of pinned hugepages. > > > > > > From the page you linked: > > > > > > IOMMU Support > > > > > > ... > > > > > > This is a future-proof, hardware-accelerated solution for performing > > > DMA operations into and out of a user space process and forms the > > > long-term foundation for SPDK and DPDK's memory management strategy. > > > We highly recommend that applications are deployed using vfio and the > > > IOMMU enabled, which is fully supported today. > > > > > > Yes, SPDK supports running without IOMMU, but they recommend running > > > with the IOMMU. > > > > > > > > > > > > > > > > > I think it's effectively the same problem as ublk zero-copy. We want to > > > > > give the ublk server access to just the I/O buffers that it currently > > > > > needs, but doing so would be expensive :(. > > > > > > > > > > I think Linux has strategies for avoiding the expense like > > > > > iommu.strict=0 and swiotlb. The drawback is that in our case userspace > > > > > and/or the hardware device controller by userspace would still have > > > > > access to the memory pages after I/O has completed. This reduces memory > > > > > isolation :(. > > > > > > > > > > DPDK/SPDK and QEMU use long-lived Linux VFIO DMA mappings. > > > > > > > > Per the above SPDK links, the nvme-pci doesn't use vfio dma mapping. > > > > > > When using VFIO (recommended by the docs), SPDK uses long-lived DMA > > > mappings. Here are places in the SPDK/DPDK source code where VFIO DMA > > > mapping is used: > > > https://github.com/spdk/spdk/blob/master/lib/env_dpdk/memory.c#L1371 > > > https://github.com/spdk/dpdk/blob/e89c0845a60831864becc261cff48dd9321e7e79/lib/eal/linux/eal_vfio.c#L2164 > > > > I meant spdk nvme implementation. > > I did too. The NVMe PCI driver will use the PCI driver APIs and the EAL > (operating system abstraction) will deal with IOMMU APIs (VFIO) > transparently. > > > > > > > > > > > > > > > > > > > > What I'm trying to get at is that either memory isolation is compromised > > > > > or performance is reduced. It's hard to have good performance together > > > > > with memory isolation. > > > > > > > > > > I think ublk should follow the VFIO philosophy of being a safe > > > > > kernel/userspace interface. If userspace is malicious or buggy, the > > > > > kernel's and other process' memory should not be corrupted. > > > > > > > > It is tradeoff between performance and isolation, that is why I mention > > > > that directing programming hardware in userspace can be done by root > > > > only. > > > > > > Yes, there is a trade-off. Over the years the use of unsafe approaches > > > has been discouraged and replaced (/dev/kmem, uio -> VFIO, etc). As > > > secure boot, integrity architecture, and stuff like that becomes more > > > widely used, it's harder to include features that break memory isolation > > > in software in mainstream distros. There can be an option to sacrifice > > > memory isolation for performance and some users may be willing to accept > > > the trade-off. I think it should be an option feature though. > > > > > > I did want to point out that the statement that "direct programming > > > hardware in userspace can be done by root only" is false (see VFIO). > > > > Unfortunately not see vfio is used when spdk/nvme is operating hardware > > mmio. > > I think my responses above answered this, but just to be clear: with > VFIO PCI userspace mmaps the BARs and performs direct accesses to them > (load/store instructions). No VFIO API wrappers are necessary for MMIO > accesses, so the code you posted works fine with VFIO. > > > > > > > > > > > > > > > > > > > > > > > - small amount of physical memory for using as DMA descriptor can be > > > > > > pre-allocated from userspace, and ask kernel to pin pages, then still > > > > > > return physical address to userspace for programming DMA > > > > > > > > > > I think this is possible today. The ublk server owns the I/O buffers. It > > > > > can mlock them and DMA map them via VFIO. ublk doesn't need to know > > > > > anything about this. > > > > > > > > It depends on if such VFIO DMA mapping is required for each IO. If it > > > > is required, that won't help one high performance driver. > > > > > > It is not necessary to perform a DMA mapping for each IO. ublk's > > > existing model is sufficient: > > > 1. ublk server allocates I/O buffers and VFIO DMA maps them on startup. > > > 2. At runtime the ublk server provides these I/O buffers to the kernel, > > > no further DMA mapping is required. > > > > > > Unfortunately there's still the kernel<->userspace copy that existing > > > ublk applications have, but there's no new overhead related to VFIO. > > > > We are working on ublk zero copy for avoiding the copy. > > I'm curious if it's possible to come up with a solution that doesn't > break memory isolation. Userspace controls the IOMMU with Linux VFIO, so > if kernel pages are exposed to the device, then userspace will also be > able to access them (e.g. by submitting a request that gets the device > to DMA those pages). spdk nvme already exposes physical address of memory and uses the physical address to program hardware directly. And I think it can't be done by un-trusted user. But I agree with you that this way should be avoided as far as possible. > > > > > > > > > > > > > > > > > - this way is still zero copy > > > > > > > > > > True zero-copy would be when an application does O_DIRECT I/O and the > > > > > hardware device DMAs to/from the application's memory pages. ublk > > > > > doesn't do that today and when combined with VFIO it doesn't get any > > > > > easier. I don't think it's possible because you cannot allow userspace > > > > > to control a hardware device and grant DMA access to pages that > > > > > userspace isn't allowed to access. A malicious userspace will program > > > > > the device to access those pages :). > > > > > > > > But that should be what SPDK nvme/pci is doing per the above links, :-) > > > > > > Sure, it's possible to break memory isolation. Breaking memory isolation > > > isn't specific to ublk servers that access hardware. The same unsafe > > > zero-copy approach would probably also work for regular ublk servers. > > > This is basically bringing back /dev/kmem :). > > > > > > > > > > > > > > > > > > > > > > > > 5) notification from hardware: interrupt or polling > > > > > > - SPDK applies userspace polling, this way is doable, but > > > > > > eat CPU, so it is only one choice > > > > > > > > > > > > - io_uring command has been proved as very efficient, if io_uring > > > > > > command is applied(similar way with UBLK for forwarding blk io > > > > > > command from kernel to userspace) to uio/vfio for delivering interrupt, > > > > > > which should be efficient too, given batching processes are done after > > > > > > the io_uring command is completed > > > > > > > > > > I wonder how much difference there is between the new io_uring command > > > > > for receiving VFIO irqs that you are suggesting compared to the existing > > > > > io_uring approach IORING_OP_READ eventfd. > > > > > > > > eventfd needs extra read/write on the event fd, so more syscalls are > > > > required. > > > > > > No extra syscall is required because IORING_OP_READ is used to read the > > > eventfd, but maybe you were referring to bypassing the > > > file->f_op->read() code path? > > > > OK, missed that, it is usually done in the following way: > > > > io_uring_prep_poll_add(sqe, evfd, POLLIN) > > sqe->flags |= IOSQE_IO_LINK; > > ... > > sqe = io_uring_get_sqe(&ring); > > io_uring_prep_readv(sqe, evfd, &vec, 1, 0); > > sqe->flags |= IOSQE_IO_LINK; > > > > When I get time, will compare the two and see which one performs better. > > That would be really interesting. Anyway, interrupt notification looks not one big deal. Thanks, Ming ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [LSF/MM/BPF BoF]: extend UBLK to cover real storage hardware 2023-02-15 0:51 ` Ming Lei @ 2023-02-15 15:27 ` Stefan Hajnoczi 2023-02-16 0:46 ` Ming Lei 2023-02-16 9:44 ` Andreas Hindborg 1 sibling, 1 reply; 34+ messages in thread From: Stefan Hajnoczi @ 2023-02-15 15:27 UTC (permalink / raw) To: Ming Lei Cc: linux-block, lsf-pc, Liu Xiaodong, Jim Harris, Hans Holmberg, Matias Bjørling, hch@lst.de, ZiyangZhang [-- Attachment #1: Type: text/plain, Size: 9192 bytes --] On Wed, Feb 15, 2023 at 08:51:27AM +0800, Ming Lei wrote: > On Mon, Feb 13, 2023 at 02:13:59PM -0500, Stefan Hajnoczi wrote: > > On Mon, Feb 13, 2023 at 11:47:31AM +0800, Ming Lei wrote: > > > On Wed, Feb 08, 2023 at 07:17:10AM -0500, Stefan Hajnoczi wrote: > > > > On Wed, Feb 08, 2023 at 10:12:19AM +0800, Ming Lei wrote: > > > > > On Mon, Feb 06, 2023 at 03:27:09PM -0500, Stefan Hajnoczi wrote: > > > > > > On Mon, Feb 06, 2023 at 11:00:27PM +0800, Ming Lei wrote: > > > > > > > Hello, > > > > > > > > > > > > > > So far UBLK is only used for implementing virtual block device from > > > > > > > userspace, such as loop, nbd, qcow2, ...[1]. > > > > > > > > > > > > I won't be at LSF/MM so here are my thoughts: > > > > > > > > > > Thanks for the thoughts, :-) > > > > > > > > > > > > > > > > > > > > > > > > > It could be useful for UBLK to cover real storage hardware too: > > > > > > > > > > > > > > - for fast prototype or performance evaluation > > > > > > > > > > > > > > - some network storages are attached to host, such as iscsi and nvme-tcp, > > > > > > > the current UBLK interface doesn't support such devices, since it needs > > > > > > > all LUNs/Namespaces to share host resources(such as tag) > > > > > > > > > > > > Can you explain this in more detail? It seems like an iSCSI or > > > > > > NVMe-over-TCP initiator could be implemented as a ublk server today. > > > > > > What am I missing? > > > > > > > > > > The current ublk can't do that yet, because the interface doesn't > > > > > support multiple ublk disks sharing single host, which is exactly > > > > > the case of scsi and nvme. > > > > > > > > Can you give an example that shows exactly where a problem is hit? > > > > > > > > I took a quick look at the ublk source code and didn't spot a place > > > > where it prevents a single ublk server process from handling multiple > > > > devices. > > > > > > > > Regarding "host resources(such as tag)", can the ublk server deal with > > > > that in userspace? The Linux block layer doesn't have the concept of a > > > > "host", that would come in at the SCSI/NVMe level that's implemented in > > > > userspace. > > > > > > > > I don't understand yet... > > > > > > blk_mq_tag_set is embedded into driver host structure, and referred by queue > > > via q->tag_set, both scsi and nvme allocates tag in host/queue wide, > > > that said all LUNs/NSs share host/queue tags, current every ublk > > > device is independent, and can't shard tags. > > > > Does this actually prevent ublk servers with multiple ublk devices or is > > it just sub-optimal? > > It is former, ublk can't support multiple devices which share single host > because duplicated tag can be seen in host side, then io is failed. The kernel sees two independent block devices so there is no issue within the kernel. Userspace can do its own hw tag allocation if there are shared storage controller resources (e.g. NVMe CIDs) to avoid duplicating tags. Have I missed something? > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > - SPDK has supported user space driver for real hardware > > > > > > > > > > > > I think this could already be implemented today. There will be extra > > > > > > memory copies because SPDK won't have access to the application's memory > > > > > > pages. > > > > > > > > > > Here I proposed zero copy, and current SPDK nvme-pci implementation haven't > > > > > such extra copy per my understanding. > > > > > > > > > > > > > > > > > > > > > > > > > So propose to extend UBLK for supporting real hardware device: > > > > > > > > > > > > > > 1) extend UBLK ABI interface to support disks attached to host, such > > > > > > > as SCSI Luns/NVME Namespaces > > > > > > > > > > > > > > 2) the followings are related with operating hardware from userspace, > > > > > > > so userspace driver has to be trusted, and root is required, and > > > > > > > can't support unprivileged UBLK device > > > > > > > > > > > > Linux VFIO provides a safe userspace API for userspace device drivers. > > > > > > That means memory and interrupts are isolated. Neither userspace nor the > > > > > > hardware device can access memory or interrupts that the userspace > > > > > > process is not allowed to access. > > > > > > > > > > > > I think there are still limitations like all memory pages exposed to the > > > > > > device need to be pinned. So effectively you might still need privileges > > > > > > to get the mlock resource limits. > > > > > > > > > > > > But overall I think what you're saying about root and unprivileged ublk > > > > > > devices is not true. Hardware support should be developed with the goal > > > > > > of supporting unprivileged userspace ublk servers. > > > > > > > > > > > > Those unprivileged userspace ublk servers cannot claim any PCI device > > > > > > they want. The user/admin will need to give them permission to open a > > > > > > network card, SCSI HBA, etc. > > > > > > > > > > It depends on implementation, please see > > > > > > > > > > https://spdk.io/doc/userspace.html > > > > > > > > > > ``` > > > > > The SPDK NVMe Driver, for instance, maps the BAR for the NVMe device and > > > > > then follows along with the NVMe Specification to initialize the device, > > > > > create queue pairs, and ultimately send I/O. > > > > > ``` > > > > > > > > > > The above way needs userspace to operating hardware by the mapped BAR, > > > > > which can't be allowed for unprivileged user. > > > > > > > > From https://spdk.io/doc/system_configuration.html: > > > > > > > > Running SPDK as non-privileged user > > > > > > > > One of the benefits of using the VFIO Linux kernel driver is the > > > > ability to perform DMA operations with peripheral devices as > > > > unprivileged user. The permissions to access particular devices still > > > > need to be granted by the system administrator, but only on a one-time > > > > basis. Note that this functionality is supported with DPDK starting > > > > from version 18.11. > > > > > > > > This is what I had described in my previous reply. > > > > > > My reference on spdk were mostly from spdk/nvme doc. > > > Just take quick look at spdk code, looks both vfio and direct > > > programming hardware are supported: > > > > > > 1) lib/nvme/nvme_vfio_user.c > > > const struct spdk_nvme_transport_ops vfio_ops { > > > .qpair_submit_request = nvme_pcie_qpair_submit_request, > > > > Ignore this, it's the userspace vfio-user UNIX domain socket protocol > > support. It's not kernel VFIO and is unrelated to what we're discussing. > > More info on vfio-user: https://spdk.io/news/2021/05/04/vfio-user/ > > Not sure, why does .qpair_submit_request point to > nvme_pcie_qpair_submit_request? The lib/nvme/nvme_vfio_user.c code is for when SPDK connects to a vfio-user NVMe PCI device. The vfio-user protocol support is not handled by the regular DPDK/SPDK PCI driver APIs, so the lib/nvme/nvme_pcie.c doesn't work with vfio-user devices. However, a lot of the code can be shared with the regular NVMe PCI driver and that's why .qpair_submit_request points to nvme_pcie_qpair_submit_request instead of a special version for vfio-user. If the vfio-user protocol becomes more widely used for other devices besides NVMe PCI, then I guess the DPDK/SPDK developers will figure out a way to move the vfio-user code into the core PCI driver API so that a single lib/nvme/nvme_pcie.c file works with all PCI APIs (kernel VFIO, vfio-user, etc). The code was probably structured like this because it's hard to make those changes and they wanted to get vfio-user NVMe PCI working quickly. > > > > > > > > > > > > 2) lib/nvme/nvme_pcie.c > > > const struct spdk_nvme_transport_ops pcie_ops = { > > > .qpair_submit_request = nvme_pcie_qpair_submit_request > > > nvme_pcie_qpair_submit_tracker > > > nvme_pcie_qpair_submit_tracker > > > nvme_pcie_qpair_ring_sq_doorbell > > > > > > but vfio dma isn't used in nvme_pcie_qpair_submit_request, and simply > > > write/read mmaped mmio. > > > > I have only a small amount of SPDK code experienced, so this might be > > Me too. > > > wrong, but I think the NVMe PCI driver code does not need to directly > > call VFIO APIs. That is handled by DPDK/SPDK's EAL operating system > > abstractions and device driver APIs. > > > > DMA memory is mapped permanently so the device driver doesn't need to > > perform individual map/unmap operations in the data path. NVMe PCI > > request submission builds the NVMe command structures containing device > > addresses (i.e. IOVAs when IOMMU is enabled). > > If IOMMU isn't used, it is physical address of memory. > > Then I guess you may understand why I said this way can't be done by > un-privileged user, cause driver is writing memory physical address to > device register directly. > > But other driver can follow this approach if the way is accepted. Okay, I understand now that you were thinking of non-IOMMU use cases. Stefan [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 488 bytes --] ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [LSF/MM/BPF BoF]: extend UBLK to cover real storage hardware 2023-02-15 15:27 ` Stefan Hajnoczi @ 2023-02-16 0:46 ` Ming Lei 2023-02-16 15:28 ` Stefan Hajnoczi 0 siblings, 1 reply; 34+ messages in thread From: Ming Lei @ 2023-02-16 0:46 UTC (permalink / raw) To: Stefan Hajnoczi Cc: linux-block, lsf-pc, Liu Xiaodong, Jim Harris, Hans Holmberg, Matias Bjørling, hch@lst.de, ZiyangZhang On Wed, Feb 15, 2023 at 10:27:07AM -0500, Stefan Hajnoczi wrote: > On Wed, Feb 15, 2023 at 08:51:27AM +0800, Ming Lei wrote: > > On Mon, Feb 13, 2023 at 02:13:59PM -0500, Stefan Hajnoczi wrote: > > > On Mon, Feb 13, 2023 at 11:47:31AM +0800, Ming Lei wrote: > > > > On Wed, Feb 08, 2023 at 07:17:10AM -0500, Stefan Hajnoczi wrote: > > > > > On Wed, Feb 08, 2023 at 10:12:19AM +0800, Ming Lei wrote: > > > > > > On Mon, Feb 06, 2023 at 03:27:09PM -0500, Stefan Hajnoczi wrote: > > > > > > > On Mon, Feb 06, 2023 at 11:00:27PM +0800, Ming Lei wrote: > > > > > > > > Hello, > > > > > > > > > > > > > > > > So far UBLK is only used for implementing virtual block device from > > > > > > > > userspace, such as loop, nbd, qcow2, ...[1]. > > > > > > > > > > > > > > I won't be at LSF/MM so here are my thoughts: > > > > > > > > > > > > Thanks for the thoughts, :-) > > > > > > > > > > > > > > > > > > > > > > > > > > > > > It could be useful for UBLK to cover real storage hardware too: > > > > > > > > > > > > > > > > - for fast prototype or performance evaluation > > > > > > > > > > > > > > > > - some network storages are attached to host, such as iscsi and nvme-tcp, > > > > > > > > the current UBLK interface doesn't support such devices, since it needs > > > > > > > > all LUNs/Namespaces to share host resources(such as tag) > > > > > > > > > > > > > > Can you explain this in more detail? It seems like an iSCSI or > > > > > > > NVMe-over-TCP initiator could be implemented as a ublk server today. > > > > > > > What am I missing? > > > > > > > > > > > > The current ublk can't do that yet, because the interface doesn't > > > > > > support multiple ublk disks sharing single host, which is exactly > > > > > > the case of scsi and nvme. > > > > > > > > > > Can you give an example that shows exactly where a problem is hit? > > > > > > > > > > I took a quick look at the ublk source code and didn't spot a place > > > > > where it prevents a single ublk server process from handling multiple > > > > > devices. > > > > > > > > > > Regarding "host resources(such as tag)", can the ublk server deal with > > > > > that in userspace? The Linux block layer doesn't have the concept of a > > > > > "host", that would come in at the SCSI/NVMe level that's implemented in > > > > > userspace. > > > > > > > > > > I don't understand yet... > > > > > > > > blk_mq_tag_set is embedded into driver host structure, and referred by queue > > > > via q->tag_set, both scsi and nvme allocates tag in host/queue wide, > > > > that said all LUNs/NSs share host/queue tags, current every ublk > > > > device is independent, and can't shard tags. > > > > > > Does this actually prevent ublk servers with multiple ublk devices or is > > > it just sub-optimal? > > > > It is former, ublk can't support multiple devices which share single host > > because duplicated tag can be seen in host side, then io is failed. > > The kernel sees two independent block devices so there is no issue > within the kernel. This way either wastes memory, or performance is bad since we can't make a perfect queue depth for each ublk device. > > Userspace can do its own hw tag allocation if there are shared storage > controller resources (e.g. NVMe CIDs) to avoid duplicating tags. > > Have I missed something? Please look at lib/sbitmap.c and block/blk-mq-tag.c and see how many hard issues fixed/reported in the past, and how much optimization done in this area. In theory hw tag allocation can be done in userspace, but just hard to do efficiently: 1) it has been proved as one hard task for sharing data efficiently in SMP, so don't reinvent wheel in userspace, and this work could take much more efforts than extending current ublk interface, and just fruitless 2) two times tag allocation slows down io path much 2) even worse for userspace allocation, cause task can be killed and no cleanup is done, so tag leak can be caused easily Thanks, Ming ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [LSF/MM/BPF BoF]: extend UBLK to cover real storage hardware 2023-02-16 0:46 ` Ming Lei @ 2023-02-16 15:28 ` Stefan Hajnoczi 0 siblings, 0 replies; 34+ messages in thread From: Stefan Hajnoczi @ 2023-02-16 15:28 UTC (permalink / raw) To: Ming Lei Cc: linux-block, lsf-pc, Liu Xiaodong, Jim Harris, Hans Holmberg, Matias Bjørling, hch@lst.de, ZiyangZhang [-- Attachment #1: Type: text/plain, Size: 4310 bytes --] On Thu, Feb 16, 2023 at 08:46:56AM +0800, Ming Lei wrote: > On Wed, Feb 15, 2023 at 10:27:07AM -0500, Stefan Hajnoczi wrote: > > On Wed, Feb 15, 2023 at 08:51:27AM +0800, Ming Lei wrote: > > > On Mon, Feb 13, 2023 at 02:13:59PM -0500, Stefan Hajnoczi wrote: > > > > On Mon, Feb 13, 2023 at 11:47:31AM +0800, Ming Lei wrote: > > > > > On Wed, Feb 08, 2023 at 07:17:10AM -0500, Stefan Hajnoczi wrote: > > > > > > On Wed, Feb 08, 2023 at 10:12:19AM +0800, Ming Lei wrote: > > > > > > > On Mon, Feb 06, 2023 at 03:27:09PM -0500, Stefan Hajnoczi wrote: > > > > > > > > On Mon, Feb 06, 2023 at 11:00:27PM +0800, Ming Lei wrote: > > > > > > > > > Hello, > > > > > > > > > > > > > > > > > > So far UBLK is only used for implementing virtual block device from > > > > > > > > > userspace, such as loop, nbd, qcow2, ...[1]. > > > > > > > > > > > > > > > > I won't be at LSF/MM so here are my thoughts: > > > > > > > > > > > > > > Thanks for the thoughts, :-) > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > It could be useful for UBLK to cover real storage hardware too: > > > > > > > > > > > > > > > > > > - for fast prototype or performance evaluation > > > > > > > > > > > > > > > > > > - some network storages are attached to host, such as iscsi and nvme-tcp, > > > > > > > > > the current UBLK interface doesn't support such devices, since it needs > > > > > > > > > all LUNs/Namespaces to share host resources(such as tag) > > > > > > > > > > > > > > > > Can you explain this in more detail? It seems like an iSCSI or > > > > > > > > NVMe-over-TCP initiator could be implemented as a ublk server today. > > > > > > > > What am I missing? > > > > > > > > > > > > > > The current ublk can't do that yet, because the interface doesn't > > > > > > > support multiple ublk disks sharing single host, which is exactly > > > > > > > the case of scsi and nvme. > > > > > > > > > > > > Can you give an example that shows exactly where a problem is hit? > > > > > > > > > > > > I took a quick look at the ublk source code and didn't spot a place > > > > > > where it prevents a single ublk server process from handling multiple > > > > > > devices. > > > > > > > > > > > > Regarding "host resources(such as tag)", can the ublk server deal with > > > > > > that in userspace? The Linux block layer doesn't have the concept of a > > > > > > "host", that would come in at the SCSI/NVMe level that's implemented in > > > > > > userspace. > > > > > > > > > > > > I don't understand yet... > > > > > > > > > > blk_mq_tag_set is embedded into driver host structure, and referred by queue > > > > > via q->tag_set, both scsi and nvme allocates tag in host/queue wide, > > > > > that said all LUNs/NSs share host/queue tags, current every ublk > > > > > device is independent, and can't shard tags. > > > > > > > > Does this actually prevent ublk servers with multiple ublk devices or is > > > > it just sub-optimal? > > > > > > It is former, ublk can't support multiple devices which share single host > > > because duplicated tag can be seen in host side, then io is failed. > > > > The kernel sees two independent block devices so there is no issue > > within the kernel. > > This way either wastes memory, or performance is bad since we can't > make a perfect queue depth for each ublk device. > > > > > Userspace can do its own hw tag allocation if there are shared storage > > controller resources (e.g. NVMe CIDs) to avoid duplicating tags. > > > > Have I missed something? > > Please look at lib/sbitmap.c and block/blk-mq-tag.c and see how many > hard issues fixed/reported in the past, and how much optimization done > in this area. > > In theory hw tag allocation can be done in userspace, but just hard to > do efficiently: > > 1) it has been proved as one hard task for sharing data efficiently in > SMP, so don't reinvent wheel in userspace, and this work could take > much more efforts than extending current ublk interface, and just > fruitless > > 2) two times tag allocation slows down io path much > > 2) even worse for userspace allocation, cause task can be killed and > no cleanup is done, so tag leak can be caused easily So then it is not "the former" after all? Stefan [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 488 bytes --] ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [LSF/MM/BPF BoF]: extend UBLK to cover real storage hardware 2023-02-15 0:51 ` Ming Lei 2023-02-15 15:27 ` Stefan Hajnoczi @ 2023-02-16 9:44 ` Andreas Hindborg 2023-02-16 10:45 ` Ming Lei 1 sibling, 1 reply; 34+ messages in thread From: Andreas Hindborg @ 2023-02-16 9:44 UTC (permalink / raw) To: Ming Lei Cc: Stefan Hajnoczi, linux-block, lsf-pc, Liu Xiaodong, Jim Harris, Hans Holmberg, Matias Bjørling, hch@lst.de, ZiyangZhang, Andreas Hindborg Hi Ming, Ming Lei <ming.lei@redhat.com> writes: > On Mon, Feb 13, 2023 at 02:13:59PM -0500, Stefan Hajnoczi wrote: >> On Mon, Feb 13, 2023 at 11:47:31AM +0800, Ming Lei wrote: >> > On Wed, Feb 08, 2023 at 07:17:10AM -0500, Stefan Hajnoczi wrote: >> > > On Wed, Feb 08, 2023 at 10:12:19AM +0800, Ming Lei wrote: >> > > > On Mon, Feb 06, 2023 at 03:27:09PM -0500, Stefan Hajnoczi wrote: >> > > > > On Mon, Feb 06, 2023 at 11:00:27PM +0800, Ming Lei wrote: >> > > > > > Hello, >> > > > > > >> > > > > > So far UBLK is only used for implementing virtual block device from >> > > > > > userspace, such as loop, nbd, qcow2, ...[1]. >> > > > > >> > > > > I won't be at LSF/MM so here are my thoughts: >> > > > >> > > > Thanks for the thoughts, :-) >> > > > >> > > > > >> > > > > > >> > > > > > It could be useful for UBLK to cover real storage hardware too: >> > > > > > >> > > > > > - for fast prototype or performance evaluation >> > > > > > >> > > > > > - some network storages are attached to host, such as iscsi and nvme-tcp, >> > > > > > the current UBLK interface doesn't support such devices, since it needs >> > > > > > all LUNs/Namespaces to share host resources(such as tag) >> > > > > >> > > > > Can you explain this in more detail? It seems like an iSCSI or >> > > > > NVMe-over-TCP initiator could be implemented as a ublk server today. >> > > > > What am I missing? >> > > > >> > > > The current ublk can't do that yet, because the interface doesn't >> > > > support multiple ublk disks sharing single host, which is exactly >> > > > the case of scsi and nvme. >> > > >> > > Can you give an example that shows exactly where a problem is hit? >> > > >> > > I took a quick look at the ublk source code and didn't spot a place >> > > where it prevents a single ublk server process from handling multiple >> > > devices. >> > > >> > > Regarding "host resources(such as tag)", can the ublk server deal with >> > > that in userspace? The Linux block layer doesn't have the concept of a >> > > "host", that would come in at the SCSI/NVMe level that's implemented in >> > > userspace. >> > > >> > > I don't understand yet... >> > >> > blk_mq_tag_set is embedded into driver host structure, and referred by queue >> > via q->tag_set, both scsi and nvme allocates tag in host/queue wide, >> > that said all LUNs/NSs share host/queue tags, current every ublk >> > device is independent, and can't shard tags. >> >> Does this actually prevent ublk servers with multiple ublk devices or is >> it just sub-optimal? > > It is former, ublk can't support multiple devices which share single host > because duplicated tag can be seen in host side, then io is failed. > I have trouble following this discussion. Why can we not handle multiple block devices in a single ublk user space process? From this conversation it seems that the limiting factor is allocation of the tag set of the virtual device in the kernel? But as far as I can tell, the tag sets are allocated per virtual block device in `ublk_ctrl_add_dev()`? It seems to me that a single ublk user space process shuld be able to connect to multiple storage devices (for instance nvme-of) and then create a ublk device for each namespace, all from a single ublk process. Could you elaborate on why this is not possible? Best regards, Andreas Hindborg ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [LSF/MM/BPF BoF]: extend UBLK to cover real storage hardware 2023-02-16 9:44 ` Andreas Hindborg @ 2023-02-16 10:45 ` Ming Lei 2023-02-16 11:21 ` Andreas Hindborg 0 siblings, 1 reply; 34+ messages in thread From: Ming Lei @ 2023-02-16 10:45 UTC (permalink / raw) To: Andreas Hindborg Cc: Stefan Hajnoczi, linux-block, lsf-pc, Liu Xiaodong, Jim Harris, Hans Holmberg, Matias Bjørling, hch@lst.de, ZiyangZhang, Andreas Hindborg, ming.lei On Thu, Feb 16, 2023 at 10:44:02AM +0100, Andreas Hindborg wrote: > > Hi Ming, > > Ming Lei <ming.lei@redhat.com> writes: > > > On Mon, Feb 13, 2023 at 02:13:59PM -0500, Stefan Hajnoczi wrote: > >> On Mon, Feb 13, 2023 at 11:47:31AM +0800, Ming Lei wrote: > >> > On Wed, Feb 08, 2023 at 07:17:10AM -0500, Stefan Hajnoczi wrote: > >> > > On Wed, Feb 08, 2023 at 10:12:19AM +0800, Ming Lei wrote: > >> > > > On Mon, Feb 06, 2023 at 03:27:09PM -0500, Stefan Hajnoczi wrote: > >> > > > > On Mon, Feb 06, 2023 at 11:00:27PM +0800, Ming Lei wrote: > >> > > > > > Hello, > >> > > > > > > >> > > > > > So far UBLK is only used for implementing virtual block device from > >> > > > > > userspace, such as loop, nbd, qcow2, ...[1]. > >> > > > > > >> > > > > I won't be at LSF/MM so here are my thoughts: > >> > > > > >> > > > Thanks for the thoughts, :-) > >> > > > > >> > > > > > >> > > > > > > >> > > > > > It could be useful for UBLK to cover real storage hardware too: > >> > > > > > > >> > > > > > - for fast prototype or performance evaluation > >> > > > > > > >> > > > > > - some network storages are attached to host, such as iscsi and nvme-tcp, > >> > > > > > the current UBLK interface doesn't support such devices, since it needs > >> > > > > > all LUNs/Namespaces to share host resources(such as tag) > >> > > > > > >> > > > > Can you explain this in more detail? It seems like an iSCSI or > >> > > > > NVMe-over-TCP initiator could be implemented as a ublk server today. > >> > > > > What am I missing? > >> > > > > >> > > > The current ublk can't do that yet, because the interface doesn't > >> > > > support multiple ublk disks sharing single host, which is exactly > >> > > > the case of scsi and nvme. > >> > > > >> > > Can you give an example that shows exactly where a problem is hit? > >> > > > >> > > I took a quick look at the ublk source code and didn't spot a place > >> > > where it prevents a single ublk server process from handling multiple > >> > > devices. > >> > > > >> > > Regarding "host resources(such as tag)", can the ublk server deal with > >> > > that in userspace? The Linux block layer doesn't have the concept of a > >> > > "host", that would come in at the SCSI/NVMe level that's implemented in > >> > > userspace. > >> > > > >> > > I don't understand yet... > >> > > >> > blk_mq_tag_set is embedded into driver host structure, and referred by queue > >> > via q->tag_set, both scsi and nvme allocates tag in host/queue wide, > >> > that said all LUNs/NSs share host/queue tags, current every ublk > >> > device is independent, and can't shard tags. > >> > >> Does this actually prevent ublk servers with multiple ublk devices or is > >> it just sub-optimal? > > > > It is former, ublk can't support multiple devices which share single host > > because duplicated tag can be seen in host side, then io is failed. > > > > I have trouble following this discussion. Why can we not handle multiple > block devices in a single ublk user space process? > > From this conversation it seems that the limiting factor is allocation > of the tag set of the virtual device in the kernel? But as far as I can > tell, the tag sets are allocated per virtual block device in > `ublk_ctrl_add_dev()`? > > It seems to me that a single ublk user space process shuld be able to > connect to multiple storage devices (for instance nvme-of) and then > create a ublk device for each namespace, all from a single ublk process. > > Could you elaborate on why this is not possible? If the multiple storages devices are independent, the current ublk can handle them just fine. But if these storage devices(such as luns in iscsi, or NSs in nvme-tcp) share single host, and use host-wide tagset, the current interface can't work as expected, because tags is shared among all these devices. The current ublk interface needs to be extended for covering this case. Thanks, Ming ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [LSF/MM/BPF BoF]: extend UBLK to cover real storage hardware 2023-02-16 10:45 ` Ming Lei @ 2023-02-16 11:21 ` Andreas Hindborg 2023-02-17 2:20 ` Ming Lei 0 siblings, 1 reply; 34+ messages in thread From: Andreas Hindborg @ 2023-02-16 11:21 UTC (permalink / raw) To: Ming Lei Cc: Stefan Hajnoczi, linux-block, lsf-pc, Liu Xiaodong, Jim Harris, Hans Holmberg, Matias Bjørling, hch@lst.de, ZiyangZhang Ming Lei <ming.lei@redhat.com> writes: > On Thu, Feb 16, 2023 at 10:44:02AM +0100, Andreas Hindborg wrote: >> >> Hi Ming, >> >> Ming Lei <ming.lei@redhat.com> writes: >> >> > On Mon, Feb 13, 2023 at 02:13:59PM -0500, Stefan Hajnoczi wrote: >> >> On Mon, Feb 13, 2023 at 11:47:31AM +0800, Ming Lei wrote: >> >> > On Wed, Feb 08, 2023 at 07:17:10AM -0500, Stefan Hajnoczi wrote: >> >> > > On Wed, Feb 08, 2023 at 10:12:19AM +0800, Ming Lei wrote: >> >> > > > On Mon, Feb 06, 2023 at 03:27:09PM -0500, Stefan Hajnoczi wrote: >> >> > > > > On Mon, Feb 06, 2023 at 11:00:27PM +0800, Ming Lei wrote: >> >> > > > > > Hello, >> >> > > > > > >> >> > > > > > So far UBLK is only used for implementing virtual block device from >> >> > > > > > userspace, such as loop, nbd, qcow2, ...[1]. >> >> > > > > >> >> > > > > I won't be at LSF/MM so here are my thoughts: >> >> > > > >> >> > > > Thanks for the thoughts, :-) >> >> > > > >> >> > > > > >> >> > > > > > >> >> > > > > > It could be useful for UBLK to cover real storage hardware too: >> >> > > > > > >> >> > > > > > - for fast prototype or performance evaluation >> >> > > > > > >> >> > > > > > - some network storages are attached to host, such as iscsi and nvme-tcp, >> >> > > > > > the current UBLK interface doesn't support such devices, since it needs >> >> > > > > > all LUNs/Namespaces to share host resources(such as tag) >> >> > > > > >> >> > > > > Can you explain this in more detail? It seems like an iSCSI or >> >> > > > > NVMe-over-TCP initiator could be implemented as a ublk server today. >> >> > > > > What am I missing? >> >> > > > >> >> > > > The current ublk can't do that yet, because the interface doesn't >> >> > > > support multiple ublk disks sharing single host, which is exactly >> >> > > > the case of scsi and nvme. >> >> > > >> >> > > Can you give an example that shows exactly where a problem is hit? >> >> > > >> >> > > I took a quick look at the ublk source code and didn't spot a place >> >> > > where it prevents a single ublk server process from handling multiple >> >> > > devices. >> >> > > >> >> > > Regarding "host resources(such as tag)", can the ublk server deal with >> >> > > that in userspace? The Linux block layer doesn't have the concept of a >> >> > > "host", that would come in at the SCSI/NVMe level that's implemented in >> >> > > userspace. >> >> > > >> >> > > I don't understand yet... >> >> > >> >> > blk_mq_tag_set is embedded into driver host structure, and referred by queue >> >> > via q->tag_set, both scsi and nvme allocates tag in host/queue wide, >> >> > that said all LUNs/NSs share host/queue tags, current every ublk >> >> > device is independent, and can't shard tags. >> >> >> >> Does this actually prevent ublk servers with multiple ublk devices or is >> >> it just sub-optimal? >> > >> > It is former, ublk can't support multiple devices which share single host >> > because duplicated tag can be seen in host side, then io is failed. >> > >> >> I have trouble following this discussion. Why can we not handle multiple >> block devices in a single ublk user space process? >> >> From this conversation it seems that the limiting factor is allocation >> of the tag set of the virtual device in the kernel? But as far as I can >> tell, the tag sets are allocated per virtual block device in >> `ublk_ctrl_add_dev()`? >> >> It seems to me that a single ublk user space process shuld be able to >> connect to multiple storage devices (for instance nvme-of) and then >> create a ublk device for each namespace, all from a single ublk process. >> >> Could you elaborate on why this is not possible? > > If the multiple storages devices are independent, the current ublk can > handle them just fine. > > But if these storage devices(such as luns in iscsi, or NSs in nvme-tcp) > share single host, and use host-wide tagset, the current interface can't > work as expected, because tags is shared among all these devices. The > current ublk interface needs to be extended for covering this case. Thanks for clarifying, that is very helpful. Follow up question: What would the implications be if one tried to expose (through ublk) each nvme namespace of an nvme-of controller with an independent tag set? What are the benefits of sharing a tagset across all namespaces of a controller? Best regards, Andreas ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [LSF/MM/BPF BoF]: extend UBLK to cover real storage hardware 2023-02-16 11:21 ` Andreas Hindborg @ 2023-02-17 2:20 ` Ming Lei 2023-02-17 16:39 ` Stefan Hajnoczi 0 siblings, 1 reply; 34+ messages in thread From: Ming Lei @ 2023-02-17 2:20 UTC (permalink / raw) To: Andreas Hindborg Cc: Stefan Hajnoczi, linux-block, lsf-pc, Liu Xiaodong, Jim Harris, Hans Holmberg, Matias Bjørling, hch@lst.de, ZiyangZhang, ming.lei On Thu, Feb 16, 2023 at 12:21:32PM +0100, Andreas Hindborg wrote: > > Ming Lei <ming.lei@redhat.com> writes: > > > On Thu, Feb 16, 2023 at 10:44:02AM +0100, Andreas Hindborg wrote: > >> > >> Hi Ming, > >> > >> Ming Lei <ming.lei@redhat.com> writes: > >> > >> > On Mon, Feb 13, 2023 at 02:13:59PM -0500, Stefan Hajnoczi wrote: > >> >> On Mon, Feb 13, 2023 at 11:47:31AM +0800, Ming Lei wrote: > >> >> > On Wed, Feb 08, 2023 at 07:17:10AM -0500, Stefan Hajnoczi wrote: > >> >> > > On Wed, Feb 08, 2023 at 10:12:19AM +0800, Ming Lei wrote: > >> >> > > > On Mon, Feb 06, 2023 at 03:27:09PM -0500, Stefan Hajnoczi wrote: > >> >> > > > > On Mon, Feb 06, 2023 at 11:00:27PM +0800, Ming Lei wrote: > >> >> > > > > > Hello, > >> >> > > > > > > >> >> > > > > > So far UBLK is only used for implementing virtual block device from > >> >> > > > > > userspace, such as loop, nbd, qcow2, ...[1]. > >> >> > > > > > >> >> > > > > I won't be at LSF/MM so here are my thoughts: > >> >> > > > > >> >> > > > Thanks for the thoughts, :-) > >> >> > > > > >> >> > > > > > >> >> > > > > > > >> >> > > > > > It could be useful for UBLK to cover real storage hardware too: > >> >> > > > > > > >> >> > > > > > - for fast prototype or performance evaluation > >> >> > > > > > > >> >> > > > > > - some network storages are attached to host, such as iscsi and nvme-tcp, > >> >> > > > > > the current UBLK interface doesn't support such devices, since it needs > >> >> > > > > > all LUNs/Namespaces to share host resources(such as tag) > >> >> > > > > > >> >> > > > > Can you explain this in more detail? It seems like an iSCSI or > >> >> > > > > NVMe-over-TCP initiator could be implemented as a ublk server today. > >> >> > > > > What am I missing? > >> >> > > > > >> >> > > > The current ublk can't do that yet, because the interface doesn't > >> >> > > > support multiple ublk disks sharing single host, which is exactly > >> >> > > > the case of scsi and nvme. > >> >> > > > >> >> > > Can you give an example that shows exactly where a problem is hit? > >> >> > > > >> >> > > I took a quick look at the ublk source code and didn't spot a place > >> >> > > where it prevents a single ublk server process from handling multiple > >> >> > > devices. > >> >> > > > >> >> > > Regarding "host resources(such as tag)", can the ublk server deal with > >> >> > > that in userspace? The Linux block layer doesn't have the concept of a > >> >> > > "host", that would come in at the SCSI/NVMe level that's implemented in > >> >> > > userspace. > >> >> > > > >> >> > > I don't understand yet... > >> >> > > >> >> > blk_mq_tag_set is embedded into driver host structure, and referred by queue > >> >> > via q->tag_set, both scsi and nvme allocates tag in host/queue wide, > >> >> > that said all LUNs/NSs share host/queue tags, current every ublk > >> >> > device is independent, and can't shard tags. > >> >> > >> >> Does this actually prevent ublk servers with multiple ublk devices or is > >> >> it just sub-optimal? > >> > > >> > It is former, ublk can't support multiple devices which share single host > >> > because duplicated tag can be seen in host side, then io is failed. > >> > > >> > >> I have trouble following this discussion. Why can we not handle multiple > >> block devices in a single ublk user space process? > >> > >> From this conversation it seems that the limiting factor is allocation > >> of the tag set of the virtual device in the kernel? But as far as I can > >> tell, the tag sets are allocated per virtual block device in > >> `ublk_ctrl_add_dev()`? > >> > >> It seems to me that a single ublk user space process shuld be able to > >> connect to multiple storage devices (for instance nvme-of) and then > >> create a ublk device for each namespace, all from a single ublk process. > >> > >> Could you elaborate on why this is not possible? > > > > If the multiple storages devices are independent, the current ublk can > > handle them just fine. > > > > But if these storage devices(such as luns in iscsi, or NSs in nvme-tcp) > > share single host, and use host-wide tagset, the current interface can't > > work as expected, because tags is shared among all these devices. The > > current ublk interface needs to be extended for covering this case. > > Thanks for clarifying, that is very helpful. > > Follow up question: What would the implications be if one tried to > expose (through ublk) each nvme namespace of an nvme-of controller with > an independent tag set? https://lore.kernel.org/linux-block/877cwhrgul.fsf@metaspace.dk/T/#m57158db9f0108e529d8d62d1d56652c52e9e3e67 > What are the benefits of sharing a tagset across > all namespaces of a controller? The userspace implementation can be simplified a lot since generic shared tag allocation isn't needed, meantime with good performance (shared tags allocation in SMP is one hard problem) The extension shouldn't be very hard, follows some raw ideas: 1) interface change - add new feature flag of UBLK_F_SHARED_HOST, multiple ublk devices(ublkcXnY) are attached to the ublk host(ublkhX) - dev_info.dev_id: in case of UBLK_F_SHARED_HOST, the top 16bit stores host id(X), and the bottom 16bit stores device id(Y) - add two control commands: UBLK_CMD_ADD_HOST, UBLK_CMD_DEL_HOST Still sent to /dev/ublk-control ADD_HOST command will allocate one host device(char) with specified host id or allocated host id, tag_set is allocated as host resource. The host device(ublkhX) will become parent of all ublkcXn* Before sending DEL_HOST, all devices attached to this host have to be stopped & removed first, otherwise DEL_HOST won't succeed. - keep other interfaces not changed in case of UBLK_F_SHARED_HOST, userspace has to set correct dev_info.dev_id.host_id, so ublk driver can associate device with specified host 2) implementation - host device(ublkhX) becomes parent of all ublk char devices of ublkcXn* - except for tagset, other per-host resource abstraction? Looks not necessary, anything is available in userspace - host-wide error handling, maybe all devices attached to this host need to be recovered, so it should be done in userspace - per-host admin queue, looks not necessary, given host related management/control tasks are done in userspace directly - others? Thanks, Ming ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [LSF/MM/BPF BoF]: extend UBLK to cover real storage hardware 2023-02-17 2:20 ` Ming Lei @ 2023-02-17 16:39 ` Stefan Hajnoczi 2023-02-18 11:22 ` Ming Lei 0 siblings, 1 reply; 34+ messages in thread From: Stefan Hajnoczi @ 2023-02-17 16:39 UTC (permalink / raw) To: Ming Lei Cc: Andreas Hindborg, linux-block, lsf-pc, Liu Xiaodong, Jim Harris, Hans Holmberg, Matias Bjørling, hch@lst.de, ZiyangZhang [-- Attachment #1: Type: text/plain, Size: 8424 bytes --] On Fri, Feb 17, 2023 at 10:20:45AM +0800, Ming Lei wrote: > On Thu, Feb 16, 2023 at 12:21:32PM +0100, Andreas Hindborg wrote: > > > > Ming Lei <ming.lei@redhat.com> writes: > > > > > On Thu, Feb 16, 2023 at 10:44:02AM +0100, Andreas Hindborg wrote: > > >> > > >> Hi Ming, > > >> > > >> Ming Lei <ming.lei@redhat.com> writes: > > >> > > >> > On Mon, Feb 13, 2023 at 02:13:59PM -0500, Stefan Hajnoczi wrote: > > >> >> On Mon, Feb 13, 2023 at 11:47:31AM +0800, Ming Lei wrote: > > >> >> > On Wed, Feb 08, 2023 at 07:17:10AM -0500, Stefan Hajnoczi wrote: > > >> >> > > On Wed, Feb 08, 2023 at 10:12:19AM +0800, Ming Lei wrote: > > >> >> > > > On Mon, Feb 06, 2023 at 03:27:09PM -0500, Stefan Hajnoczi wrote: > > >> >> > > > > On Mon, Feb 06, 2023 at 11:00:27PM +0800, Ming Lei wrote: > > >> >> > > > > > Hello, > > >> >> > > > > > > > >> >> > > > > > So far UBLK is only used for implementing virtual block device from > > >> >> > > > > > userspace, such as loop, nbd, qcow2, ...[1]. > > >> >> > > > > > > >> >> > > > > I won't be at LSF/MM so here are my thoughts: > > >> >> > > > > > >> >> > > > Thanks for the thoughts, :-) > > >> >> > > > > > >> >> > > > > > > >> >> > > > > > > > >> >> > > > > > It could be useful for UBLK to cover real storage hardware too: > > >> >> > > > > > > > >> >> > > > > > - for fast prototype or performance evaluation > > >> >> > > > > > > > >> >> > > > > > - some network storages are attached to host, such as iscsi and nvme-tcp, > > >> >> > > > > > the current UBLK interface doesn't support such devices, since it needs > > >> >> > > > > > all LUNs/Namespaces to share host resources(such as tag) > > >> >> > > > > > > >> >> > > > > Can you explain this in more detail? It seems like an iSCSI or > > >> >> > > > > NVMe-over-TCP initiator could be implemented as a ublk server today. > > >> >> > > > > What am I missing? > > >> >> > > > > > >> >> > > > The current ublk can't do that yet, because the interface doesn't > > >> >> > > > support multiple ublk disks sharing single host, which is exactly > > >> >> > > > the case of scsi and nvme. > > >> >> > > > > >> >> > > Can you give an example that shows exactly where a problem is hit? > > >> >> > > > > >> >> > > I took a quick look at the ublk source code and didn't spot a place > > >> >> > > where it prevents a single ublk server process from handling multiple > > >> >> > > devices. > > >> >> > > > > >> >> > > Regarding "host resources(such as tag)", can the ublk server deal with > > >> >> > > that in userspace? The Linux block layer doesn't have the concept of a > > >> >> > > "host", that would come in at the SCSI/NVMe level that's implemented in > > >> >> > > userspace. > > >> >> > > > > >> >> > > I don't understand yet... > > >> >> > > > >> >> > blk_mq_tag_set is embedded into driver host structure, and referred by queue > > >> >> > via q->tag_set, both scsi and nvme allocates tag in host/queue wide, > > >> >> > that said all LUNs/NSs share host/queue tags, current every ublk > > >> >> > device is independent, and can't shard tags. > > >> >> > > >> >> Does this actually prevent ublk servers with multiple ublk devices or is > > >> >> it just sub-optimal? > > >> > > > >> > It is former, ublk can't support multiple devices which share single host > > >> > because duplicated tag can be seen in host side, then io is failed. > > >> > > > >> > > >> I have trouble following this discussion. Why can we not handle multiple > > >> block devices in a single ublk user space process? > > >> > > >> From this conversation it seems that the limiting factor is allocation > > >> of the tag set of the virtual device in the kernel? But as far as I can > > >> tell, the tag sets are allocated per virtual block device in > > >> `ublk_ctrl_add_dev()`? > > >> > > >> It seems to me that a single ublk user space process shuld be able to > > >> connect to multiple storage devices (for instance nvme-of) and then > > >> create a ublk device for each namespace, all from a single ublk process. > > >> > > >> Could you elaborate on why this is not possible? > > > > > > If the multiple storages devices are independent, the current ublk can > > > handle them just fine. > > > > > > But if these storage devices(such as luns in iscsi, or NSs in nvme-tcp) > > > share single host, and use host-wide tagset, the current interface can't > > > work as expected, because tags is shared among all these devices. The > > > current ublk interface needs to be extended for covering this case. > > > > Thanks for clarifying, that is very helpful. > > > > Follow up question: What would the implications be if one tried to > > expose (through ublk) each nvme namespace of an nvme-of controller with > > an independent tag set? > > https://lore.kernel.org/linux-block/877cwhrgul.fsf@metaspace.dk/T/#m57158db9f0108e529d8d62d1d56652c52e9e3e67 > > > What are the benefits of sharing a tagset across > > all namespaces of a controller? > > The userspace implementation can be simplified a lot since generic > shared tag allocation isn't needed, meantime with good performance > (shared tags allocation in SMP is one hard problem) In NVMe, tags are per Submission Queue. AFAIK there's no such thing as shared tags across multiple SQs in NVMe. So userspace doesn't need an SMP tag allocator in the first place: - Each ublk server thread has a separate io_uring context. - Each ublk server thread has its own NVMe Submission Queue. - Therefore it's trivial and cheap to allocate NVMe CIDs in userspace because there are no SMP concerns. The issue isn't tag allocation, it's the fact that the kernel block layer submits requests to userspace that don't fit into the NVMe Submission Queue because multiple devices that appear independent from the kernel perspective are sharing a single NVMe Submission Queue. Userspace needs a basic I/O scheduler to ensure fairness across devices. Round-robin for example. There are no SMP concerns here either. So I don't buy the argument that userspace would have to duplicate the tag allocation code from Linux because that solves a different problem that the ublk server doesn't have. If the kernel is aware of tag sharing, then userspace doesn't have to do (trivial) tag allocation or I/O scheduling. It can simply stuff ublk io commands into NVMe queues without thinking, which wastes fewer CPU cycles and is a little simpler. > The extension shouldn't be very hard, follows some raw ideas: It is definitely nice for the ublk server to tell the kernel about shared resources so the Linux block layer has the best information. I think it's a good idea to add support for that. I just disagree with some of the statements you've made about why and especially the claim that ublk doesn't support multiple device servers today. > > 1) interface change > > - add new feature flag of UBLK_F_SHARED_HOST, multiple ublk > devices(ublkcXnY) are attached to the ublk host(ublkhX) > > - dev_info.dev_id: in case of UBLK_F_SHARED_HOST, the top 16bit stores > host id(X), and the bottom 16bit stores device id(Y) > > - add two control commands: UBLK_CMD_ADD_HOST, UBLK_CMD_DEL_HOST > > Still sent to /dev/ublk-control > > ADD_HOST command will allocate one host device(char) with specified host > id or allocated host id, tag_set is allocated as host resource. The > host device(ublkhX) will become parent of all ublkcXn* > > Before sending DEL_HOST, all devices attached to this host have to > be stopped & removed first, otherwise DEL_HOST won't succeed. > > - keep other interfaces not changed > in case of UBLK_F_SHARED_HOST, userspace has to set correct > dev_info.dev_id.host_id, so ublk driver can associate device with > specified host > > 2) implementation > - host device(ublkhX) becomes parent of all ublk char devices of > ublkcXn* > > - except for tagset, other per-host resource abstraction? Looks not > necessary, anything is available in userspace > > - host-wide error handling, maybe all devices attached to this host > need to be recovered, so it should be done in userspace > > - per-host admin queue, looks not necessary, given host related > management/control tasks are done in userspace directly > > - others? > > > Thanks, > Ming > [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 488 bytes --] ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [LSF/MM/BPF BoF]: extend UBLK to cover real storage hardware 2023-02-17 16:39 ` Stefan Hajnoczi @ 2023-02-18 11:22 ` Ming Lei 2023-02-18 18:38 ` Stefan Hajnoczi 0 siblings, 1 reply; 34+ messages in thread From: Ming Lei @ 2023-02-18 11:22 UTC (permalink / raw) To: Stefan Hajnoczi Cc: Andreas Hindborg, linux-block, lsf-pc, Liu Xiaodong, Jim Harris, Hans Holmberg, Matias Bjørling, hch@lst.de, ZiyangZhang, ming.lei On Fri, Feb 17, 2023 at 11:39:58AM -0500, Stefan Hajnoczi wrote: > On Fri, Feb 17, 2023 at 10:20:45AM +0800, Ming Lei wrote: > > On Thu, Feb 16, 2023 at 12:21:32PM +0100, Andreas Hindborg wrote: > > > > > > Ming Lei <ming.lei@redhat.com> writes: > > > > > > > On Thu, Feb 16, 2023 at 10:44:02AM +0100, Andreas Hindborg wrote: > > > >> > > > >> Hi Ming, > > > >> > > > >> Ming Lei <ming.lei@redhat.com> writes: > > > >> > > > >> > On Mon, Feb 13, 2023 at 02:13:59PM -0500, Stefan Hajnoczi wrote: > > > >> >> On Mon, Feb 13, 2023 at 11:47:31AM +0800, Ming Lei wrote: > > > >> >> > On Wed, Feb 08, 2023 at 07:17:10AM -0500, Stefan Hajnoczi wrote: > > > >> >> > > On Wed, Feb 08, 2023 at 10:12:19AM +0800, Ming Lei wrote: > > > >> >> > > > On Mon, Feb 06, 2023 at 03:27:09PM -0500, Stefan Hajnoczi wrote: > > > >> >> > > > > On Mon, Feb 06, 2023 at 11:00:27PM +0800, Ming Lei wrote: > > > >> >> > > > > > Hello, > > > >> >> > > > > > > > > >> >> > > > > > So far UBLK is only used for implementing virtual block device from > > > >> >> > > > > > userspace, such as loop, nbd, qcow2, ...[1]. > > > >> >> > > > > > > > >> >> > > > > I won't be at LSF/MM so here are my thoughts: > > > >> >> > > > > > > >> >> > > > Thanks for the thoughts, :-) > > > >> >> > > > > > > >> >> > > > > > > > >> >> > > > > > > > > >> >> > > > > > It could be useful for UBLK to cover real storage hardware too: > > > >> >> > > > > > > > > >> >> > > > > > - for fast prototype or performance evaluation > > > >> >> > > > > > > > > >> >> > > > > > - some network storages are attached to host, such as iscsi and nvme-tcp, > > > >> >> > > > > > the current UBLK interface doesn't support such devices, since it needs > > > >> >> > > > > > all LUNs/Namespaces to share host resources(such as tag) > > > >> >> > > > > > > > >> >> > > > > Can you explain this in more detail? It seems like an iSCSI or > > > >> >> > > > > NVMe-over-TCP initiator could be implemented as a ublk server today. > > > >> >> > > > > What am I missing? > > > >> >> > > > > > > >> >> > > > The current ublk can't do that yet, because the interface doesn't > > > >> >> > > > support multiple ublk disks sharing single host, which is exactly > > > >> >> > > > the case of scsi and nvme. > > > >> >> > > > > > >> >> > > Can you give an example that shows exactly where a problem is hit? > > > >> >> > > > > > >> >> > > I took a quick look at the ublk source code and didn't spot a place > > > >> >> > > where it prevents a single ublk server process from handling multiple > > > >> >> > > devices. > > > >> >> > > > > > >> >> > > Regarding "host resources(such as tag)", can the ublk server deal with > > > >> >> > > that in userspace? The Linux block layer doesn't have the concept of a > > > >> >> > > "host", that would come in at the SCSI/NVMe level that's implemented in > > > >> >> > > userspace. > > > >> >> > > > > > >> >> > > I don't understand yet... > > > >> >> > > > > >> >> > blk_mq_tag_set is embedded into driver host structure, and referred by queue > > > >> >> > via q->tag_set, both scsi and nvme allocates tag in host/queue wide, > > > >> >> > that said all LUNs/NSs share host/queue tags, current every ublk > > > >> >> > device is independent, and can't shard tags. > > > >> >> > > > >> >> Does this actually prevent ublk servers with multiple ublk devices or is > > > >> >> it just sub-optimal? > > > >> > > > > >> > It is former, ublk can't support multiple devices which share single host > > > >> > because duplicated tag can be seen in host side, then io is failed. > > > >> > > > > >> > > > >> I have trouble following this discussion. Why can we not handle multiple > > > >> block devices in a single ublk user space process? > > > >> > > > >> From this conversation it seems that the limiting factor is allocation > > > >> of the tag set of the virtual device in the kernel? But as far as I can > > > >> tell, the tag sets are allocated per virtual block device in > > > >> `ublk_ctrl_add_dev()`? > > > >> > > > >> It seems to me that a single ublk user space process shuld be able to > > > >> connect to multiple storage devices (for instance nvme-of) and then > > > >> create a ublk device for each namespace, all from a single ublk process. > > > >> > > > >> Could you elaborate on why this is not possible? > > > > > > > > If the multiple storages devices are independent, the current ublk can > > > > handle them just fine. > > > > > > > > But if these storage devices(such as luns in iscsi, or NSs in nvme-tcp) > > > > share single host, and use host-wide tagset, the current interface can't > > > > work as expected, because tags is shared among all these devices. The > > > > current ublk interface needs to be extended for covering this case. > > > > > > Thanks for clarifying, that is very helpful. > > > > > > Follow up question: What would the implications be if one tried to > > > expose (through ublk) each nvme namespace of an nvme-of controller with > > > an independent tag set? > > > > https://lore.kernel.org/linux-block/877cwhrgul.fsf@metaspace.dk/T/#m57158db9f0108e529d8d62d1d56652c52e9e3e67 > > > > > What are the benefits of sharing a tagset across > > > all namespaces of a controller? > > > > The userspace implementation can be simplified a lot since generic > > shared tag allocation isn't needed, meantime with good performance > > (shared tags allocation in SMP is one hard problem) > > In NVMe, tags are per Submission Queue. AFAIK there's no such thing as > shared tags across multiple SQs in NVMe. So userspace doesn't need an In reality the max supported nr_queues of nvme is often much less than nr_cpu_ids, for example, lots of nvme-pci devices just support at most 32 queues, I remembered that Azure nvme supports less(just 8 queues). That is because queue isn't free in both software and hardware, which implementation is often tradeoff between performance and cost. Not mention, most of scsi devices are SQ in which tag allocations from all CPUs are against single shared tagset. So there is still per-queue tag allocations from different CPUs which aims at same queue. What we discussed are supposed to be generic solution, not something just for ideal 1:1 mapping device, which isn't dominant in reality. > SMP tag allocator in the first place: > - Each ublk server thread has a separate io_uring context. > - Each ublk server thread has its own NVMe Submission Queue. > - Therefore it's trivial and cheap to allocate NVMe CIDs in userspace > because there are no SMP concerns. It isn't even trivial for 1:1 mapping, when any ublk server crashes global tag will be leaked, and other ublk servers can't use the leaked tag any more. Not mention there are lots of SQ device(1:M), or nr_queues is much less than nr_cpu_ids(N:M N < M). It is pretty easier to see 1:M or N:M mapping for both nvme and scsi. > > The issue isn't tag allocation, it's the fact that the kernel block > layer submits requests to userspace that don't fit into the NVMe > Submission Queue because multiple devices that appear independent from > the kernel perspective are sharing a single NVMe Submission Queue. > Userspace needs a basic I/O scheduler to ensure fairness across devices. > Round-robin for example. We already have io scheduler for /dev/ublkbN. Also what I proposed is just to align ublk device with the actual device definition, and so far tags is the only shared resource in generic io code path. > There are no SMP concerns here either. No, see above. > > So I don't buy the argument that userspace would have to duplicate the > tag allocation code from Linux because that solves a different problem > that the ublk server doesn't have. > > If the kernel is aware of tag sharing, then userspace doesn't have to do > (trivial) tag allocation or I/O scheduling. It can simply stuff ublk io Again, it isn't trivial. > commands into NVMe queues without thinking, which wastes fewer CPU > cycles and is a little simpler. tag allocation is pretty generic, which is supposed to be done in kernel, then any userspace isn't supposed to duplicate the not-trivial implementation. Thanks, Ming ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [LSF/MM/BPF BoF]: extend UBLK to cover real storage hardware 2023-02-18 11:22 ` Ming Lei @ 2023-02-18 18:38 ` Stefan Hajnoczi 2023-02-22 23:17 ` Ming Lei 0 siblings, 1 reply; 34+ messages in thread From: Stefan Hajnoczi @ 2023-02-18 18:38 UTC (permalink / raw) To: Ming Lei Cc: Andreas Hindborg, linux-block, lsf-pc, Liu Xiaodong, Jim Harris, Hans Holmberg, Matias Bjørling, hch@lst.de, ZiyangZhang [-- Attachment #1: Type: text/plain, Size: 8230 bytes --] On Sat, Feb 18, 2023 at 07:22:49PM +0800, Ming Lei wrote: > On Fri, Feb 17, 2023 at 11:39:58AM -0500, Stefan Hajnoczi wrote: > > On Fri, Feb 17, 2023 at 10:20:45AM +0800, Ming Lei wrote: > > > On Thu, Feb 16, 2023 at 12:21:32PM +0100, Andreas Hindborg wrote: > > > > > > > > Ming Lei <ming.lei@redhat.com> writes: > > > > > > > > > On Thu, Feb 16, 2023 at 10:44:02AM +0100, Andreas Hindborg wrote: > > > > >> > > > > >> Hi Ming, > > > > >> > > > > >> Ming Lei <ming.lei@redhat.com> writes: > > > > >> > > > > >> > On Mon, Feb 13, 2023 at 02:13:59PM -0500, Stefan Hajnoczi wrote: > > > > >> >> On Mon, Feb 13, 2023 at 11:47:31AM +0800, Ming Lei wrote: > > > > >> >> > On Wed, Feb 08, 2023 at 07:17:10AM -0500, Stefan Hajnoczi wrote: > > > > >> >> > > On Wed, Feb 08, 2023 at 10:12:19AM +0800, Ming Lei wrote: > > > > >> >> > > > On Mon, Feb 06, 2023 at 03:27:09PM -0500, Stefan Hajnoczi wrote: > > > > >> >> > > > > On Mon, Feb 06, 2023 at 11:00:27PM +0800, Ming Lei wrote: > > > > >> >> > > > > > Hello, > > > > >> >> > > > > > > > > > >> >> > > > > > So far UBLK is only used for implementing virtual block device from > > > > >> >> > > > > > userspace, such as loop, nbd, qcow2, ...[1]. > > > > >> >> > > > > > > > > >> >> > > > > I won't be at LSF/MM so here are my thoughts: > > > > >> >> > > > > > > > >> >> > > > Thanks for the thoughts, :-) > > > > >> >> > > > > > > > >> >> > > > > > > > > >> >> > > > > > > > > > >> >> > > > > > It could be useful for UBLK to cover real storage hardware too: > > > > >> >> > > > > > > > > > >> >> > > > > > - for fast prototype or performance evaluation > > > > >> >> > > > > > > > > > >> >> > > > > > - some network storages are attached to host, such as iscsi and nvme-tcp, > > > > >> >> > > > > > the current UBLK interface doesn't support such devices, since it needs > > > > >> >> > > > > > all LUNs/Namespaces to share host resources(such as tag) > > > > >> >> > > > > > > > > >> >> > > > > Can you explain this in more detail? It seems like an iSCSI or > > > > >> >> > > > > NVMe-over-TCP initiator could be implemented as a ublk server today. > > > > >> >> > > > > What am I missing? > > > > >> >> > > > > > > > >> >> > > > The current ublk can't do that yet, because the interface doesn't > > > > >> >> > > > support multiple ublk disks sharing single host, which is exactly > > > > >> >> > > > the case of scsi and nvme. > > > > >> >> > > > > > > >> >> > > Can you give an example that shows exactly where a problem is hit? > > > > >> >> > > > > > > >> >> > > I took a quick look at the ublk source code and didn't spot a place > > > > >> >> > > where it prevents a single ublk server process from handling multiple > > > > >> >> > > devices. > > > > >> >> > > > > > > >> >> > > Regarding "host resources(such as tag)", can the ublk server deal with > > > > >> >> > > that in userspace? The Linux block layer doesn't have the concept of a > > > > >> >> > > "host", that would come in at the SCSI/NVMe level that's implemented in > > > > >> >> > > userspace. > > > > >> >> > > > > > > >> >> > > I don't understand yet... > > > > >> >> > > > > > >> >> > blk_mq_tag_set is embedded into driver host structure, and referred by queue > > > > >> >> > via q->tag_set, both scsi and nvme allocates tag in host/queue wide, > > > > >> >> > that said all LUNs/NSs share host/queue tags, current every ublk > > > > >> >> > device is independent, and can't shard tags. > > > > >> >> > > > > >> >> Does this actually prevent ublk servers with multiple ublk devices or is > > > > >> >> it just sub-optimal? > > > > >> > > > > > >> > It is former, ublk can't support multiple devices which share single host > > > > >> > because duplicated tag can be seen in host side, then io is failed. > > > > >> > > > > > >> > > > > >> I have trouble following this discussion. Why can we not handle multiple > > > > >> block devices in a single ublk user space process? > > > > >> > > > > >> From this conversation it seems that the limiting factor is allocation > > > > >> of the tag set of the virtual device in the kernel? But as far as I can > > > > >> tell, the tag sets are allocated per virtual block device in > > > > >> `ublk_ctrl_add_dev()`? > > > > >> > > > > >> It seems to me that a single ublk user space process shuld be able to > > > > >> connect to multiple storage devices (for instance nvme-of) and then > > > > >> create a ublk device for each namespace, all from a single ublk process. > > > > >> > > > > >> Could you elaborate on why this is not possible? > > > > > > > > > > If the multiple storages devices are independent, the current ublk can > > > > > handle them just fine. > > > > > > > > > > But if these storage devices(such as luns in iscsi, or NSs in nvme-tcp) > > > > > share single host, and use host-wide tagset, the current interface can't > > > > > work as expected, because tags is shared among all these devices. The > > > > > current ublk interface needs to be extended for covering this case. > > > > > > > > Thanks for clarifying, that is very helpful. > > > > > > > > Follow up question: What would the implications be if one tried to > > > > expose (through ublk) each nvme namespace of an nvme-of controller with > > > > an independent tag set? > > > > > > https://lore.kernel.org/linux-block/877cwhrgul.fsf@metaspace.dk/T/#m57158db9f0108e529d8d62d1d56652c52e9e3e67 > > > > > > > What are the benefits of sharing a tagset across > > > > all namespaces of a controller? > > > > > > The userspace implementation can be simplified a lot since generic > > > shared tag allocation isn't needed, meantime with good performance > > > (shared tags allocation in SMP is one hard problem) > > > > In NVMe, tags are per Submission Queue. AFAIK there's no such thing as > > shared tags across multiple SQs in NVMe. So userspace doesn't need an > > In reality the max supported nr_queues of nvme is often much less than > nr_cpu_ids, for example, lots of nvme-pci devices just support at most > 32 queues, I remembered that Azure nvme supports less(just 8 queues). > That is because queue isn't free in both software and hardware, which > implementation is often tradeoff between performance and cost. I didn't say that the ublk server should have nr_cpu_ids threads. I thought the idea was the ublk server creates as many threads as it needs (e.g. max 8 if the Azure NVMe device only has 8 queues). Do you expect ublk servers to have nr_cpu_ids threads in all/most cases? > Not mention, most of scsi devices are SQ in which tag allocations from > all CPUs are against single shared tagset. > > So there is still per-queue tag allocations from different CPUs which aims > at same queue. > > What we discussed are supposed to be generic solution, not something just > for ideal 1:1 mapping device, which isn't dominant in reality. The same trivial tag allocation can be used for SCSI: instead of a private tag namespace (e.g. 0x0-0xffff), give each queue a private subset of the tag namespace (e.g. queue 0 has 0x0-0x7f, queue 1 has 0x80-0xff, etc). The issue is not whether the tag namespace is shared across queues, but the threading model of the ublk server. If the threading model requires queues to be shared, then it becomes more complex and slow. It's not clear to me why you think ublk servers should choose threading models that require queues to be shared? They don't have to. Unlike the kernel, they can choose the number of threads. > > > SMP tag allocator in the first place: > > - Each ublk server thread has a separate io_uring context. > > - Each ublk server thread has its own NVMe Submission Queue. > > - Therefore it's trivial and cheap to allocate NVMe CIDs in userspace > > because there are no SMP concerns. > > It isn't even trivial for 1:1 mapping, when any ublk server crashes > global tag will be leaked, and other ublk servers can't use the > leaked tag any more. I'm not sure what you're describing here, a multi-process ublk server? Are you saying userspace must not do tag allocation itself because it won't be able to recover? Stefan [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 488 bytes --] ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [LSF/MM/BPF BoF]: extend UBLK to cover real storage hardware 2023-02-18 18:38 ` Stefan Hajnoczi @ 2023-02-22 23:17 ` Ming Lei 2023-02-23 20:18 ` Stefan Hajnoczi 0 siblings, 1 reply; 34+ messages in thread From: Ming Lei @ 2023-02-22 23:17 UTC (permalink / raw) To: Stefan Hajnoczi Cc: Andreas Hindborg, linux-block, lsf-pc, Liu Xiaodong, Jim Harris, Hans Holmberg, Matias Bjørling, hch@lst.de, ZiyangZhang, ming.lei On Sat, Feb 18, 2023 at 01:38:08PM -0500, Stefan Hajnoczi wrote: > On Sat, Feb 18, 2023 at 07:22:49PM +0800, Ming Lei wrote: > > On Fri, Feb 17, 2023 at 11:39:58AM -0500, Stefan Hajnoczi wrote: > > > On Fri, Feb 17, 2023 at 10:20:45AM +0800, Ming Lei wrote: > > > > On Thu, Feb 16, 2023 at 12:21:32PM +0100, Andreas Hindborg wrote: > > > > > > > > > > Ming Lei <ming.lei@redhat.com> writes: > > > > > > > > > > > On Thu, Feb 16, 2023 at 10:44:02AM +0100, Andreas Hindborg wrote: > > > > > >> > > > > > >> Hi Ming, > > > > > >> > > > > > >> Ming Lei <ming.lei@redhat.com> writes: > > > > > >> > > > > > >> > On Mon, Feb 13, 2023 at 02:13:59PM -0500, Stefan Hajnoczi wrote: > > > > > >> >> On Mon, Feb 13, 2023 at 11:47:31AM +0800, Ming Lei wrote: > > > > > >> >> > On Wed, Feb 08, 2023 at 07:17:10AM -0500, Stefan Hajnoczi wrote: > > > > > >> >> > > On Wed, Feb 08, 2023 at 10:12:19AM +0800, Ming Lei wrote: > > > > > >> >> > > > On Mon, Feb 06, 2023 at 03:27:09PM -0500, Stefan Hajnoczi wrote: > > > > > >> >> > > > > On Mon, Feb 06, 2023 at 11:00:27PM +0800, Ming Lei wrote: > > > > > >> >> > > > > > Hello, > > > > > >> >> > > > > > > > > > > >> >> > > > > > So far UBLK is only used for implementing virtual block device from > > > > > >> >> > > > > > userspace, such as loop, nbd, qcow2, ...[1]. > > > > > >> >> > > > > > > > > > >> >> > > > > I won't be at LSF/MM so here are my thoughts: > > > > > >> >> > > > > > > > > >> >> > > > Thanks for the thoughts, :-) > > > > > >> >> > > > > > > > > >> >> > > > > > > > > > >> >> > > > > > > > > > > >> >> > > > > > It could be useful for UBLK to cover real storage hardware too: > > > > > >> >> > > > > > > > > > > >> >> > > > > > - for fast prototype or performance evaluation > > > > > >> >> > > > > > > > > > > >> >> > > > > > - some network storages are attached to host, such as iscsi and nvme-tcp, > > > > > >> >> > > > > > the current UBLK interface doesn't support such devices, since it needs > > > > > >> >> > > > > > all LUNs/Namespaces to share host resources(such as tag) > > > > > >> >> > > > > > > > > > >> >> > > > > Can you explain this in more detail? It seems like an iSCSI or > > > > > >> >> > > > > NVMe-over-TCP initiator could be implemented as a ublk server today. > > > > > >> >> > > > > What am I missing? > > > > > >> >> > > > > > > > > >> >> > > > The current ublk can't do that yet, because the interface doesn't > > > > > >> >> > > > support multiple ublk disks sharing single host, which is exactly > > > > > >> >> > > > the case of scsi and nvme. > > > > > >> >> > > > > > > > >> >> > > Can you give an example that shows exactly where a problem is hit? > > > > > >> >> > > > > > > > >> >> > > I took a quick look at the ublk source code and didn't spot a place > > > > > >> >> > > where it prevents a single ublk server process from handling multiple > > > > > >> >> > > devices. > > > > > >> >> > > > > > > > >> >> > > Regarding "host resources(such as tag)", can the ublk server deal with > > > > > >> >> > > that in userspace? The Linux block layer doesn't have the concept of a > > > > > >> >> > > "host", that would come in at the SCSI/NVMe level that's implemented in > > > > > >> >> > > userspace. > > > > > >> >> > > > > > > > >> >> > > I don't understand yet... > > > > > >> >> > > > > > > >> >> > blk_mq_tag_set is embedded into driver host structure, and referred by queue > > > > > >> >> > via q->tag_set, both scsi and nvme allocates tag in host/queue wide, > > > > > >> >> > that said all LUNs/NSs share host/queue tags, current every ublk > > > > > >> >> > device is independent, and can't shard tags. > > > > > >> >> > > > > > >> >> Does this actually prevent ublk servers with multiple ublk devices or is > > > > > >> >> it just sub-optimal? > > > > > >> > > > > > > >> > It is former, ublk can't support multiple devices which share single host > > > > > >> > because duplicated tag can be seen in host side, then io is failed. > > > > > >> > > > > > > >> > > > > > >> I have trouble following this discussion. Why can we not handle multiple > > > > > >> block devices in a single ublk user space process? > > > > > >> > > > > > >> From this conversation it seems that the limiting factor is allocation > > > > > >> of the tag set of the virtual device in the kernel? But as far as I can > > > > > >> tell, the tag sets are allocated per virtual block device in > > > > > >> `ublk_ctrl_add_dev()`? > > > > > >> > > > > > >> It seems to me that a single ublk user space process shuld be able to > > > > > >> connect to multiple storage devices (for instance nvme-of) and then > > > > > >> create a ublk device for each namespace, all from a single ublk process. > > > > > >> > > > > > >> Could you elaborate on why this is not possible? > > > > > > > > > > > > If the multiple storages devices are independent, the current ublk can > > > > > > handle them just fine. > > > > > > > > > > > > But if these storage devices(such as luns in iscsi, or NSs in nvme-tcp) > > > > > > share single host, and use host-wide tagset, the current interface can't > > > > > > work as expected, because tags is shared among all these devices. The > > > > > > current ublk interface needs to be extended for covering this case. > > > > > > > > > > Thanks for clarifying, that is very helpful. > > > > > > > > > > Follow up question: What would the implications be if one tried to > > > > > expose (through ublk) each nvme namespace of an nvme-of controller with > > > > > an independent tag set? > > > > > > > > https://lore.kernel.org/linux-block/877cwhrgul.fsf@metaspace.dk/T/#m57158db9f0108e529d8d62d1d56652c52e9e3e67 > > > > > > > > > What are the benefits of sharing a tagset across > > > > > all namespaces of a controller? > > > > > > > > The userspace implementation can be simplified a lot since generic > > > > shared tag allocation isn't needed, meantime with good performance > > > > (shared tags allocation in SMP is one hard problem) > > > > > > In NVMe, tags are per Submission Queue. AFAIK there's no such thing as > > > shared tags across multiple SQs in NVMe. So userspace doesn't need an > > > > In reality the max supported nr_queues of nvme is often much less than > > nr_cpu_ids, for example, lots of nvme-pci devices just support at most > > 32 queues, I remembered that Azure nvme supports less(just 8 queues). > > That is because queue isn't free in both software and hardware, which > > implementation is often tradeoff between performance and cost. > > I didn't say that the ublk server should have nr_cpu_ids threads. I > thought the idea was the ublk server creates as many threads as it needs > (e.g. max 8 if the Azure NVMe device only has 8 queues). > > Do you expect ublk servers to have nr_cpu_ids threads in all/most cases? No. In ublksrv project, each pthread maps to one unique hardware queue, so total number of pthread is equal to nr_hw_queues. > > > Not mention, most of scsi devices are SQ in which tag allocations from > > all CPUs are against single shared tagset. > > > > So there is still per-queue tag allocations from different CPUs which aims > > at same queue. > > > > What we discussed are supposed to be generic solution, not something just > > for ideal 1:1 mapping device, which isn't dominant in reality. > > The same trivial tag allocation can be used for SCSI: instead of a > private tag namespace (e.g. 0x0-0xffff), give each queue a private > subset of the tag namespace (e.g. queue 0 has 0x0-0x7f, queue 1 has > 0x80-0xff, etc). Sorry, I may not get your point. Each hw queue has its own tag space, for example, one scsi adaptor has 2 queues, queue depth is 128, then each hardware queue's tag space is 0 ~ 127. Also if there are two LUNs attached to this host, the two luns share the two queue's tag space, that means any IO issued to queue 0, no matter if it is from lun0 or lun1, the allocated tag has to unique in the set of 0~127. > > The issue is not whether the tag namespace is shared across queues, but > the threading model of the ublk server. If the threading model requires > queues to be shared, then it becomes more complex and slow. ublksrv's threading model is simple, each thread handles IOs from one unique hw queue, so total thread number is equal to nr_hw_queues. If nr_hw_queues(nr_pthreads) < nr_cpu_id, one queue(ublk pthread) has to handle IO requests from more than one CPUs, then contention on tag allocation from this queue(ublk pthread). > > It's not clear to me why you think ublk servers should choose threading > models that require queues to be shared? They don't have to. Unlike the > kernel, they can choose the number of threads. queue sharing or not simply depends on if nr_hw_queues is less than nr_cpu_id. That is one easy math problem, isn't it? > > > > > > SMP tag allocator in the first place: > > > - Each ublk server thread has a separate io_uring context. > > > - Each ublk server thread has its own NVMe Submission Queue. > > > - Therefore it's trivial and cheap to allocate NVMe CIDs in userspace > > > because there are no SMP concerns. > > > > It isn't even trivial for 1:1 mapping, when any ublk server crashes > > global tag will be leaked, and other ublk servers can't use the > > leaked tag any more. > > I'm not sure what you're describing here, a multi-process ublk server? > Are you saying userspace must not do tag allocation itself because it > won't be able to recover? No matter if the ublk server is multi process or threads. If tag allocation is implemented in userspace, you have to take thread/process panic into account. Because if one process/pthread panics without releasing one tag, the tag won't be visible to other ublk server any more. That is because each queue's tag space is shared for all LUNs/NSs which are supposed to implemented as ublk server. Tag utilization highly affects performance, and recover could take a bit long or even not recovered, during this period, the leaked tags aren't visible for other LUNs/NSs(ublk server), not mention for fixing tag leak in recover, you have to track each tag's user(ublk server info), which adds cost/complexity to fast/parallel io path, trivial to solve? thanks, Ming ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [LSF/MM/BPF BoF]: extend UBLK to cover real storage hardware 2023-02-22 23:17 ` Ming Lei @ 2023-02-23 20:18 ` Stefan Hajnoczi 2023-03-02 3:22 ` Ming Lei 0 siblings, 1 reply; 34+ messages in thread From: Stefan Hajnoczi @ 2023-02-23 20:18 UTC (permalink / raw) To: Ming Lei Cc: Andreas Hindborg, linux-block, lsf-pc, Liu Xiaodong, Jim Harris, Hans Holmberg, Matias Bjørling, hch@lst.de, ZiyangZhang [-- Attachment #1: Type: text/plain, Size: 14209 bytes --] On Thu, Feb 23, 2023 at 07:17:33AM +0800, Ming Lei wrote: > On Sat, Feb 18, 2023 at 01:38:08PM -0500, Stefan Hajnoczi wrote: > > On Sat, Feb 18, 2023 at 07:22:49PM +0800, Ming Lei wrote: > > > On Fri, Feb 17, 2023 at 11:39:58AM -0500, Stefan Hajnoczi wrote: > > > > On Fri, Feb 17, 2023 at 10:20:45AM +0800, Ming Lei wrote: > > > > > On Thu, Feb 16, 2023 at 12:21:32PM +0100, Andreas Hindborg wrote: > > > > > > > > > > > > Ming Lei <ming.lei@redhat.com> writes: > > > > > > > > > > > > > On Thu, Feb 16, 2023 at 10:44:02AM +0100, Andreas Hindborg wrote: > > > > > > >> > > > > > > >> Hi Ming, > > > > > > >> > > > > > > >> Ming Lei <ming.lei@redhat.com> writes: > > > > > > >> > > > > > > >> > On Mon, Feb 13, 2023 at 02:13:59PM -0500, Stefan Hajnoczi wrote: > > > > > > >> >> On Mon, Feb 13, 2023 at 11:47:31AM +0800, Ming Lei wrote: > > > > > > >> >> > On Wed, Feb 08, 2023 at 07:17:10AM -0500, Stefan Hajnoczi wrote: > > > > > > >> >> > > On Wed, Feb 08, 2023 at 10:12:19AM +0800, Ming Lei wrote: > > > > > > >> >> > > > On Mon, Feb 06, 2023 at 03:27:09PM -0500, Stefan Hajnoczi wrote: > > > > > > >> >> > > > > On Mon, Feb 06, 2023 at 11:00:27PM +0800, Ming Lei wrote: > > > > > > >> >> > > > > > Hello, > > > > > > >> >> > > > > > > > > > > > >> >> > > > > > So far UBLK is only used for implementing virtual block device from > > > > > > >> >> > > > > > userspace, such as loop, nbd, qcow2, ...[1]. > > > > > > >> >> > > > > > > > > > > >> >> > > > > I won't be at LSF/MM so here are my thoughts: > > > > > > >> >> > > > > > > > > > >> >> > > > Thanks for the thoughts, :-) > > > > > > >> >> > > > > > > > > > >> >> > > > > > > > > > > >> >> > > > > > > > > > > > >> >> > > > > > It could be useful for UBLK to cover real storage hardware too: > > > > > > >> >> > > > > > > > > > > > >> >> > > > > > - for fast prototype or performance evaluation > > > > > > >> >> > > > > > > > > > > > >> >> > > > > > - some network storages are attached to host, such as iscsi and nvme-tcp, > > > > > > >> >> > > > > > the current UBLK interface doesn't support such devices, since it needs > > > > > > >> >> > > > > > all LUNs/Namespaces to share host resources(such as tag) > > > > > > >> >> > > > > > > > > > > >> >> > > > > Can you explain this in more detail? It seems like an iSCSI or > > > > > > >> >> > > > > NVMe-over-TCP initiator could be implemented as a ublk server today. > > > > > > >> >> > > > > What am I missing? > > > > > > >> >> > > > > > > > > > >> >> > > > The current ublk can't do that yet, because the interface doesn't > > > > > > >> >> > > > support multiple ublk disks sharing single host, which is exactly > > > > > > >> >> > > > the case of scsi and nvme. > > > > > > >> >> > > > > > > > > >> >> > > Can you give an example that shows exactly where a problem is hit? > > > > > > >> >> > > > > > > > > >> >> > > I took a quick look at the ublk source code and didn't spot a place > > > > > > >> >> > > where it prevents a single ublk server process from handling multiple > > > > > > >> >> > > devices. > > > > > > >> >> > > > > > > > > >> >> > > Regarding "host resources(such as tag)", can the ublk server deal with > > > > > > >> >> > > that in userspace? The Linux block layer doesn't have the concept of a > > > > > > >> >> > > "host", that would come in at the SCSI/NVMe level that's implemented in > > > > > > >> >> > > userspace. > > > > > > >> >> > > > > > > > > >> >> > > I don't understand yet... > > > > > > >> >> > > > > > > > >> >> > blk_mq_tag_set is embedded into driver host structure, and referred by queue > > > > > > >> >> > via q->tag_set, both scsi and nvme allocates tag in host/queue wide, > > > > > > >> >> > that said all LUNs/NSs share host/queue tags, current every ublk > > > > > > >> >> > device is independent, and can't shard tags. > > > > > > >> >> > > > > > > >> >> Does this actually prevent ublk servers with multiple ublk devices or is > > > > > > >> >> it just sub-optimal? > > > > > > >> > > > > > > > >> > It is former, ublk can't support multiple devices which share single host > > > > > > >> > because duplicated tag can be seen in host side, then io is failed. > > > > > > >> > > > > > > > >> > > > > > > >> I have trouble following this discussion. Why can we not handle multiple > > > > > > >> block devices in a single ublk user space process? > > > > > > >> > > > > > > >> From this conversation it seems that the limiting factor is allocation > > > > > > >> of the tag set of the virtual device in the kernel? But as far as I can > > > > > > >> tell, the tag sets are allocated per virtual block device in > > > > > > >> `ublk_ctrl_add_dev()`? > > > > > > >> > > > > > > >> It seems to me that a single ublk user space process shuld be able to > > > > > > >> connect to multiple storage devices (for instance nvme-of) and then > > > > > > >> create a ublk device for each namespace, all from a single ublk process. > > > > > > >> > > > > > > >> Could you elaborate on why this is not possible? > > > > > > > > > > > > > > If the multiple storages devices are independent, the current ublk can > > > > > > > handle them just fine. > > > > > > > > > > > > > > But if these storage devices(such as luns in iscsi, or NSs in nvme-tcp) > > > > > > > share single host, and use host-wide tagset, the current interface can't > > > > > > > work as expected, because tags is shared among all these devices. The > > > > > > > current ublk interface needs to be extended for covering this case. > > > > > > > > > > > > Thanks for clarifying, that is very helpful. > > > > > > > > > > > > Follow up question: What would the implications be if one tried to > > > > > > expose (through ublk) each nvme namespace of an nvme-of controller with > > > > > > an independent tag set? > > > > > > > > > > https://lore.kernel.org/linux-block/877cwhrgul.fsf@metaspace.dk/T/#m57158db9f0108e529d8d62d1d56652c52e9e3e67 > > > > > > > > > > > What are the benefits of sharing a tagset across > > > > > > all namespaces of a controller? > > > > > > > > > > The userspace implementation can be simplified a lot since generic > > > > > shared tag allocation isn't needed, meantime with good performance > > > > > (shared tags allocation in SMP is one hard problem) > > > > > > > > In NVMe, tags are per Submission Queue. AFAIK there's no such thing as > > > > shared tags across multiple SQs in NVMe. So userspace doesn't need an > > > > > > In reality the max supported nr_queues of nvme is often much less than > > > nr_cpu_ids, for example, lots of nvme-pci devices just support at most > > > 32 queues, I remembered that Azure nvme supports less(just 8 queues). > > > That is because queue isn't free in both software and hardware, which > > > implementation is often tradeoff between performance and cost. > > > > I didn't say that the ublk server should have nr_cpu_ids threads. I > > thought the idea was the ublk server creates as many threads as it needs > > (e.g. max 8 if the Azure NVMe device only has 8 queues). > > > > Do you expect ublk servers to have nr_cpu_ids threads in all/most cases? > > No. > > In ublksrv project, each pthread maps to one unique hardware queue, so total > number of pthread is equal to nr_hw_queues. Good, I think we agree on that part. Here is a summary of the ublk server model I've been describing: 1. Each pthread has a separate io_uring context. 2. Each pthread has its own hardware submission queue (NVMe SQ, SCSI command queue, etc). 3. Each pthread has a distinct subrange of the tag space if the tag space is shared across hardware submission queues. 4. Each pthread allocates tags from its subrange without coordinating with other threads. This is cheap and simple. 5. When the pthread runs out of tags it either suspends processing new ublk requests or enqueues them internally. When hardware completes requests, the pthread resumes requests that were waiting for tags. This way multiple ublk_devices can be handled by a single ublk server without the Linux block layer knowing the exact tag space sharing relationship between ublk_devices and hardware submission queues (NVMe SQ, SCSI command queue, etc). When ublk adds support for configuring tagsets, then 3, 4, and 5 can be eliminated. However, this is purely an optimization. Not that much userspace code will be eliminated and the performance gain is not huge. I believe this model works for the major storage protocols like NVMe and SCSI. I put forward this model to explain why I don't agree that ublk doesn't support ublk servers with multiple devices (e.g. I/O would be failed due to duplicated tags). I think we agree on 1 and 2. It's 3, 4, and 5 that I think you are either saying won't work or are very complex/hard? > > > > > Not mention, most of scsi devices are SQ in which tag allocations from > > > all CPUs are against single shared tagset. > > > > > > So there is still per-queue tag allocations from different CPUs which aims > > > at same queue. > > > > > > What we discussed are supposed to be generic solution, not something just > > > for ideal 1:1 mapping device, which isn't dominant in reality. > > > > The same trivial tag allocation can be used for SCSI: instead of a > > private tag namespace (e.g. 0x0-0xffff), give each queue a private > > subset of the tag namespace (e.g. queue 0 has 0x0-0x7f, queue 1 has > > 0x80-0xff, etc). > > Sorry, I may not get your point. > > Each hw queue has its own tag space, for example, one scsi adaptor has 2 > queues, queue depth is 128, then each hardware queue's tag space is > 0 ~ 127. > > Also if there are two LUNs attached to this host, the two luns > share the two queue's tag space, that means any IO issued to queue 0, > no matter if it is from lun0 or lun1, the allocated tag has to unique in > the set of 0~127. I'm trying to explain why tag allocation in userspace is simple and cheap thanks to the ublk server's ability to create only as many threads as hardware queues (e.g. NVMe SQs). Even in the case where all hardware (NVME/SCSI/etc) queues and LUNs share the same tag space (the worst case), ublk server threads can perform allocation from distinct subranges of the shared tag space. There are no SMP concerns because there is no overlap in the tag space between threads. > > > > The issue is not whether the tag namespace is shared across queues, but > > the threading model of the ublk server. If the threading model requires > > queues to be shared, then it becomes more complex and slow. > > ublksrv's threading model is simple, each thread handles IOs from one unique > hw queue, so total thread number is equal to nr_hw_queues. Here "hw queue" is a Linux block layer hw queue, not a hardware queue (i.e. NVMe SQ)? > > If nr_hw_queues(nr_pthreads) < nr_cpu_id, one queue(ublk pthread) has to > handle IO requests from more than one CPUs, then contention on tag allocation > from this queue(ublk pthread). Userspace doesn't need to worry about the fact that I/O requests were submitted by many CPUs. Each pthread processes one ublk_queue with a known queue depth. Each pthread has a range of userspace tags available and if there are no more tags available then it waits to complete in-flight I/O before accepting more requests or it can internally queue incoming requests. > > > > It's not clear to me why you think ublk servers should choose threading > > models that require queues to be shared? They don't have to. Unlike the > > kernel, they can choose the number of threads. > > queue sharing or not simply depends on if nr_hw_queues is less than > nr_cpu_id. That is one easy math problem, isn't it? We're talking about different things. I mean sharing a hardware queue (i.e. NVMe SQ) across multiple ublk server threads. You seem to define queue sharing as multiple CPUs submitting I/O via ublk? Thinking about your scenario: why does it matter if multiple CPUs submit I/O to a single ublk_queue? I don't see how it makes a difference whether 1 CPU or multiple CPUs enqueue requests on a single ublk_queue. Userspace will process that ublk_queue in the same way in either case. > > > > > > > > > SMP tag allocator in the first place: > > > > - Each ublk server thread has a separate io_uring context. > > > > - Each ublk server thread has its own NVMe Submission Queue. > > > > - Therefore it's trivial and cheap to allocate NVMe CIDs in userspace > > > > because there are no SMP concerns. > > > > > > It isn't even trivial for 1:1 mapping, when any ublk server crashes > > > global tag will be leaked, and other ublk servers can't use the > > > leaked tag any more. > > > > I'm not sure what you're describing here, a multi-process ublk server? > > Are you saying userspace must not do tag allocation itself because it > > won't be able to recover? > > No matter if the ublk server is multi process or threads. If tag > allocation is implemented in userspace, you have to take thread/process > panic into account. Because if one process/pthread panics without > releasing one tag, the tag won't be visible to other ublk server any > more. > > That is because each queue's tag space is shared for all LUNs/NSs which > are supposed to implemented as ublk server. > > Tag utilization highly affects performance, and recover could take a > bit long or even not recovered, during this period, the leaked tags > aren't visible for other LUNs/NSs(ublk server), not mention for fixing > tag leak in recover, you have to track each tag's user(ublk server info), > which adds cost/complexity to fast/parallel io path, trivial to solve? In the interest of time, let's defer the recovery discussion until after the core discussion is finished. I would need to research how ublk recovery works. I am happy to do that if you think recovery is the reason why userspace cannot allocate tags, but reaching a conclusion on the core discussion might be enough to make discussing recovery unnecessary. Thanks, Stefan [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 488 bytes --] ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [LSF/MM/BPF BoF]: extend UBLK to cover real storage hardware 2023-02-23 20:18 ` Stefan Hajnoczi @ 2023-03-02 3:22 ` Ming Lei 2023-03-02 15:09 ` Stefan Hajnoczi 2023-03-16 14:24 ` Stefan Hajnoczi 0 siblings, 2 replies; 34+ messages in thread From: Ming Lei @ 2023-03-02 3:22 UTC (permalink / raw) To: Stefan Hajnoczi Cc: Andreas Hindborg, linux-block, lsf-pc, Liu Xiaodong, Jim Harris, Hans Holmberg, Matias Bjørling, hch@lst.de, ZiyangZhang, ming.lei On Thu, Feb 23, 2023 at 03:18:19PM -0500, Stefan Hajnoczi wrote: > On Thu, Feb 23, 2023 at 07:17:33AM +0800, Ming Lei wrote: > > On Sat, Feb 18, 2023 at 01:38:08PM -0500, Stefan Hajnoczi wrote: > > > On Sat, Feb 18, 2023 at 07:22:49PM +0800, Ming Lei wrote: > > > > On Fri, Feb 17, 2023 at 11:39:58AM -0500, Stefan Hajnoczi wrote: > > > > > On Fri, Feb 17, 2023 at 10:20:45AM +0800, Ming Lei wrote: > > > > > > On Thu, Feb 16, 2023 at 12:21:32PM +0100, Andreas Hindborg wrote: > > > > > > > > > > > > > > Ming Lei <ming.lei@redhat.com> writes: > > > > > > > > > > > > > > > On Thu, Feb 16, 2023 at 10:44:02AM +0100, Andreas Hindborg wrote: > > > > > > > >> > > > > > > > >> Hi Ming, > > > > > > > >> > > > > > > > >> Ming Lei <ming.lei@redhat.com> writes: > > > > > > > >> > > > > > > > >> > On Mon, Feb 13, 2023 at 02:13:59PM -0500, Stefan Hajnoczi wrote: > > > > > > > >> >> On Mon, Feb 13, 2023 at 11:47:31AM +0800, Ming Lei wrote: > > > > > > > >> >> > On Wed, Feb 08, 2023 at 07:17:10AM -0500, Stefan Hajnoczi wrote: > > > > > > > >> >> > > On Wed, Feb 08, 2023 at 10:12:19AM +0800, Ming Lei wrote: > > > > > > > >> >> > > > On Mon, Feb 06, 2023 at 03:27:09PM -0500, Stefan Hajnoczi wrote: > > > > > > > >> >> > > > > On Mon, Feb 06, 2023 at 11:00:27PM +0800, Ming Lei wrote: > > > > > > > >> >> > > > > > Hello, > > > > > > > >> >> > > > > > > > > > > > > >> >> > > > > > So far UBLK is only used for implementing virtual block device from > > > > > > > >> >> > > > > > userspace, such as loop, nbd, qcow2, ...[1]. > > > > > > > >> >> > > > > > > > > > > > >> >> > > > > I won't be at LSF/MM so here are my thoughts: > > > > > > > >> >> > > > > > > > > > > >> >> > > > Thanks for the thoughts, :-) > > > > > > > >> >> > > > > > > > > > > >> >> > > > > > > > > > > > >> >> > > > > > > > > > > > > >> >> > > > > > It could be useful for UBLK to cover real storage hardware too: > > > > > > > >> >> > > > > > > > > > > > > >> >> > > > > > - for fast prototype or performance evaluation > > > > > > > >> >> > > > > > > > > > > > > >> >> > > > > > - some network storages are attached to host, such as iscsi and nvme-tcp, > > > > > > > >> >> > > > > > the current UBLK interface doesn't support such devices, since it needs > > > > > > > >> >> > > > > > all LUNs/Namespaces to share host resources(such as tag) > > > > > > > >> >> > > > > > > > > > > > >> >> > > > > Can you explain this in more detail? It seems like an iSCSI or > > > > > > > >> >> > > > > NVMe-over-TCP initiator could be implemented as a ublk server today. > > > > > > > >> >> > > > > What am I missing? > > > > > > > >> >> > > > > > > > > > > >> >> > > > The current ublk can't do that yet, because the interface doesn't > > > > > > > >> >> > > > support multiple ublk disks sharing single host, which is exactly > > > > > > > >> >> > > > the case of scsi and nvme. > > > > > > > >> >> > > > > > > > > > >> >> > > Can you give an example that shows exactly where a problem is hit? > > > > > > > >> >> > > > > > > > > > >> >> > > I took a quick look at the ublk source code and didn't spot a place > > > > > > > >> >> > > where it prevents a single ublk server process from handling multiple > > > > > > > >> >> > > devices. > > > > > > > >> >> > > > > > > > > > >> >> > > Regarding "host resources(such as tag)", can the ublk server deal with > > > > > > > >> >> > > that in userspace? The Linux block layer doesn't have the concept of a > > > > > > > >> >> > > "host", that would come in at the SCSI/NVMe level that's implemented in > > > > > > > >> >> > > userspace. > > > > > > > >> >> > > > > > > > > > >> >> > > I don't understand yet... > > > > > > > >> >> > > > > > > > > >> >> > blk_mq_tag_set is embedded into driver host structure, and referred by queue > > > > > > > >> >> > via q->tag_set, both scsi and nvme allocates tag in host/queue wide, > > > > > > > >> >> > that said all LUNs/NSs share host/queue tags, current every ublk > > > > > > > >> >> > device is independent, and can't shard tags. > > > > > > > >> >> > > > > > > > >> >> Does this actually prevent ublk servers with multiple ublk devices or is > > > > > > > >> >> it just sub-optimal? > > > > > > > >> > > > > > > > > >> > It is former, ublk can't support multiple devices which share single host > > > > > > > >> > because duplicated tag can be seen in host side, then io is failed. > > > > > > > >> > > > > > > > > >> > > > > > > > >> I have trouble following this discussion. Why can we not handle multiple > > > > > > > >> block devices in a single ublk user space process? > > > > > > > >> > > > > > > > >> From this conversation it seems that the limiting factor is allocation > > > > > > > >> of the tag set of the virtual device in the kernel? But as far as I can > > > > > > > >> tell, the tag sets are allocated per virtual block device in > > > > > > > >> `ublk_ctrl_add_dev()`? > > > > > > > >> > > > > > > > >> It seems to me that a single ublk user space process shuld be able to > > > > > > > >> connect to multiple storage devices (for instance nvme-of) and then > > > > > > > >> create a ublk device for each namespace, all from a single ublk process. > > > > > > > >> > > > > > > > >> Could you elaborate on why this is not possible? > > > > > > > > > > > > > > > > If the multiple storages devices are independent, the current ublk can > > > > > > > > handle them just fine. > > > > > > > > > > > > > > > > But if these storage devices(such as luns in iscsi, or NSs in nvme-tcp) > > > > > > > > share single host, and use host-wide tagset, the current interface can't > > > > > > > > work as expected, because tags is shared among all these devices. The > > > > > > > > current ublk interface needs to be extended for covering this case. > > > > > > > > > > > > > > Thanks for clarifying, that is very helpful. > > > > > > > > > > > > > > Follow up question: What would the implications be if one tried to > > > > > > > expose (through ublk) each nvme namespace of an nvme-of controller with > > > > > > > an independent tag set? > > > > > > > > > > > > https://lore.kernel.org/linux-block/877cwhrgul.fsf@metaspace.dk/T/#m57158db9f0108e529d8d62d1d56652c52e9e3e67 > > > > > > > > > > > > > What are the benefits of sharing a tagset across > > > > > > > all namespaces of a controller? > > > > > > > > > > > > The userspace implementation can be simplified a lot since generic > > > > > > shared tag allocation isn't needed, meantime with good performance > > > > > > (shared tags allocation in SMP is one hard problem) > > > > > > > > > > In NVMe, tags are per Submission Queue. AFAIK there's no such thing as > > > > > shared tags across multiple SQs in NVMe. So userspace doesn't need an > > > > > > > > In reality the max supported nr_queues of nvme is often much less than > > > > nr_cpu_ids, for example, lots of nvme-pci devices just support at most > > > > 32 queues, I remembered that Azure nvme supports less(just 8 queues). > > > > That is because queue isn't free in both software and hardware, which > > > > implementation is often tradeoff between performance and cost. > > > > > > I didn't say that the ublk server should have nr_cpu_ids threads. I > > > thought the idea was the ublk server creates as many threads as it needs > > > (e.g. max 8 if the Azure NVMe device only has 8 queues). > > > > > > Do you expect ublk servers to have nr_cpu_ids threads in all/most cases? > > > > No. > > > > In ublksrv project, each pthread maps to one unique hardware queue, so total > > number of pthread is equal to nr_hw_queues. > > Good, I think we agree on that part. > > Here is a summary of the ublk server model I've been describing: > 1. Each pthread has a separate io_uring context. > 2. Each pthread has its own hardware submission queue (NVMe SQ, SCSI > command queue, etc). > 3. Each pthread has a distinct subrange of the tag space if the tag > space is shared across hardware submission queues. > 4. Each pthread allocates tags from its subrange without coordinating > with other threads. This is cheap and simple. That is also not doable. The tag space can be pretty small, such as, usb-storage queue depth is just 1, and usb card reader can support multi lun too. That is just one extreme example, but there can be more low queue depth scsi devices(sata : 32, ...), typical nvme/pci queue depth is 1023, but there could be some implementation with less. More importantly subrange could waste lots of tags for idle LUNs/NSs, and active LUNs/NSs will have to suffer from the small subrange tags. And available tags depth represents the max allowed in-flight block IOs, so performance is affected a lot by subrange. If you look at block layer tag allocation change history, we never take such way. Thanks, Ming ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [LSF/MM/BPF BoF]: extend UBLK to cover real storage hardware 2023-03-02 3:22 ` Ming Lei @ 2023-03-02 15:09 ` Stefan Hajnoczi 2023-03-17 3:10 ` Ming Lei 2023-03-16 14:24 ` Stefan Hajnoczi 1 sibling, 1 reply; 34+ messages in thread From: Stefan Hajnoczi @ 2023-03-02 15:09 UTC (permalink / raw) To: Ming Lei Cc: Andreas Hindborg, linux-block, lsf-pc, Liu Xiaodong, Jim Harris, Hans Holmberg, Matias Bjørling, hch@lst.de, ZiyangZhang [-- Attachment #1: Type: text/plain, Size: 9972 bytes --] On Thu, Mar 02, 2023 at 11:22:55AM +0800, Ming Lei wrote: > On Thu, Feb 23, 2023 at 03:18:19PM -0500, Stefan Hajnoczi wrote: > > On Thu, Feb 23, 2023 at 07:17:33AM +0800, Ming Lei wrote: > > > On Sat, Feb 18, 2023 at 01:38:08PM -0500, Stefan Hajnoczi wrote: > > > > On Sat, Feb 18, 2023 at 07:22:49PM +0800, Ming Lei wrote: > > > > > On Fri, Feb 17, 2023 at 11:39:58AM -0500, Stefan Hajnoczi wrote: > > > > > > On Fri, Feb 17, 2023 at 10:20:45AM +0800, Ming Lei wrote: > > > > > > > On Thu, Feb 16, 2023 at 12:21:32PM +0100, Andreas Hindborg wrote: > > > > > > > > > > > > > > > > Ming Lei <ming.lei@redhat.com> writes: > > > > > > > > > > > > > > > > > On Thu, Feb 16, 2023 at 10:44:02AM +0100, Andreas Hindborg wrote: > > > > > > > > >> > > > > > > > > >> Hi Ming, > > > > > > > > >> > > > > > > > > >> Ming Lei <ming.lei@redhat.com> writes: > > > > > > > > >> > > > > > > > > >> > On Mon, Feb 13, 2023 at 02:13:59PM -0500, Stefan Hajnoczi wrote: > > > > > > > > >> >> On Mon, Feb 13, 2023 at 11:47:31AM +0800, Ming Lei wrote: > > > > > > > > >> >> > On Wed, Feb 08, 2023 at 07:17:10AM -0500, Stefan Hajnoczi wrote: > > > > > > > > >> >> > > On Wed, Feb 08, 2023 at 10:12:19AM +0800, Ming Lei wrote: > > > > > > > > >> >> > > > On Mon, Feb 06, 2023 at 03:27:09PM -0500, Stefan Hajnoczi wrote: > > > > > > > > >> >> > > > > On Mon, Feb 06, 2023 at 11:00:27PM +0800, Ming Lei wrote: > > > > > > > > >> >> > > > > > Hello, > > > > > > > > >> >> > > > > > > > > > > > > > >> >> > > > > > So far UBLK is only used for implementing virtual block device from > > > > > > > > >> >> > > > > > userspace, such as loop, nbd, qcow2, ...[1]. > > > > > > > > >> >> > > > > > > > > > > > > >> >> > > > > I won't be at LSF/MM so here are my thoughts: > > > > > > > > >> >> > > > > > > > > > > > >> >> > > > Thanks for the thoughts, :-) > > > > > > > > >> >> > > > > > > > > > > > >> >> > > > > > > > > > > > > >> >> > > > > > > > > > > > > > >> >> > > > > > It could be useful for UBLK to cover real storage hardware too: > > > > > > > > >> >> > > > > > > > > > > > > > >> >> > > > > > - for fast prototype or performance evaluation > > > > > > > > >> >> > > > > > > > > > > > > > >> >> > > > > > - some network storages are attached to host, such as iscsi and nvme-tcp, > > > > > > > > >> >> > > > > > the current UBLK interface doesn't support such devices, since it needs > > > > > > > > >> >> > > > > > all LUNs/Namespaces to share host resources(such as tag) > > > > > > > > >> >> > > > > > > > > > > > > >> >> > > > > Can you explain this in more detail? It seems like an iSCSI or > > > > > > > > >> >> > > > > NVMe-over-TCP initiator could be implemented as a ublk server today. > > > > > > > > >> >> > > > > What am I missing? > > > > > > > > >> >> > > > > > > > > > > > >> >> > > > The current ublk can't do that yet, because the interface doesn't > > > > > > > > >> >> > > > support multiple ublk disks sharing single host, which is exactly > > > > > > > > >> >> > > > the case of scsi and nvme. > > > > > > > > >> >> > > > > > > > > > > >> >> > > Can you give an example that shows exactly where a problem is hit? > > > > > > > > >> >> > > > > > > > > > > >> >> > > I took a quick look at the ublk source code and didn't spot a place > > > > > > > > >> >> > > where it prevents a single ublk server process from handling multiple > > > > > > > > >> >> > > devices. > > > > > > > > >> >> > > > > > > > > > > >> >> > > Regarding "host resources(such as tag)", can the ublk server deal with > > > > > > > > >> >> > > that in userspace? The Linux block layer doesn't have the concept of a > > > > > > > > >> >> > > "host", that would come in at the SCSI/NVMe level that's implemented in > > > > > > > > >> >> > > userspace. > > > > > > > > >> >> > > > > > > > > > > >> >> > > I don't understand yet... > > > > > > > > >> >> > > > > > > > > > >> >> > blk_mq_tag_set is embedded into driver host structure, and referred by queue > > > > > > > > >> >> > via q->tag_set, both scsi and nvme allocates tag in host/queue wide, > > > > > > > > >> >> > that said all LUNs/NSs share host/queue tags, current every ublk > > > > > > > > >> >> > device is independent, and can't shard tags. > > > > > > > > >> >> > > > > > > > > >> >> Does this actually prevent ublk servers with multiple ublk devices or is > > > > > > > > >> >> it just sub-optimal? > > > > > > > > >> > > > > > > > > > >> > It is former, ublk can't support multiple devices which share single host > > > > > > > > >> > because duplicated tag can be seen in host side, then io is failed. > > > > > > > > >> > > > > > > > > > >> > > > > > > > > >> I have trouble following this discussion. Why can we not handle multiple > > > > > > > > >> block devices in a single ublk user space process? > > > > > > > > >> > > > > > > > > >> From this conversation it seems that the limiting factor is allocation > > > > > > > > >> of the tag set of the virtual device in the kernel? But as far as I can > > > > > > > > >> tell, the tag sets are allocated per virtual block device in > > > > > > > > >> `ublk_ctrl_add_dev()`? > > > > > > > > >> > > > > > > > > >> It seems to me that a single ublk user space process shuld be able to > > > > > > > > >> connect to multiple storage devices (for instance nvme-of) and then > > > > > > > > >> create a ublk device for each namespace, all from a single ublk process. > > > > > > > > >> > > > > > > > > >> Could you elaborate on why this is not possible? > > > > > > > > > > > > > > > > > > If the multiple storages devices are independent, the current ublk can > > > > > > > > > handle them just fine. > > > > > > > > > > > > > > > > > > But if these storage devices(such as luns in iscsi, or NSs in nvme-tcp) > > > > > > > > > share single host, and use host-wide tagset, the current interface can't > > > > > > > > > work as expected, because tags is shared among all these devices. The > > > > > > > > > current ublk interface needs to be extended for covering this case. > > > > > > > > > > > > > > > > Thanks for clarifying, that is very helpful. > > > > > > > > > > > > > > > > Follow up question: What would the implications be if one tried to > > > > > > > > expose (through ublk) each nvme namespace of an nvme-of controller with > > > > > > > > an independent tag set? > > > > > > > > > > > > > > https://lore.kernel.org/linux-block/877cwhrgul.fsf@metaspace.dk/T/#m57158db9f0108e529d8d62d1d56652c52e9e3e67 > > > > > > > > > > > > > > > What are the benefits of sharing a tagset across > > > > > > > > all namespaces of a controller? > > > > > > > > > > > > > > The userspace implementation can be simplified a lot since generic > > > > > > > shared tag allocation isn't needed, meantime with good performance > > > > > > > (shared tags allocation in SMP is one hard problem) > > > > > > > > > > > > In NVMe, tags are per Submission Queue. AFAIK there's no such thing as > > > > > > shared tags across multiple SQs in NVMe. So userspace doesn't need an > > > > > > > > > > In reality the max supported nr_queues of nvme is often much less than > > > > > nr_cpu_ids, for example, lots of nvme-pci devices just support at most > > > > > 32 queues, I remembered that Azure nvme supports less(just 8 queues). > > > > > That is because queue isn't free in both software and hardware, which > > > > > implementation is often tradeoff between performance and cost. > > > > > > > > I didn't say that the ublk server should have nr_cpu_ids threads. I > > > > thought the idea was the ublk server creates as many threads as it needs > > > > (e.g. max 8 if the Azure NVMe device only has 8 queues). > > > > > > > > Do you expect ublk servers to have nr_cpu_ids threads in all/most cases? > > > > > > No. > > > > > > In ublksrv project, each pthread maps to one unique hardware queue, so total > > > number of pthread is equal to nr_hw_queues. > > > > Good, I think we agree on that part. > > > > Here is a summary of the ublk server model I've been describing: > > 1. Each pthread has a separate io_uring context. > > 2. Each pthread has its own hardware submission queue (NVMe SQ, SCSI > > command queue, etc). > > 3. Each pthread has a distinct subrange of the tag space if the tag > > space is shared across hardware submission queues. > > 4. Each pthread allocates tags from its subrange without coordinating > > with other threads. This is cheap and simple. > > That is also not doable. > > The tag space can be pretty small, such as, usb-storage queue depth > is just 1, and usb card reader can support multi lun too. If the tag space is very limited, just create one pthread. > That is just one extreme example, but there can be more low queue depth > scsi devices(sata : 32, ...), typical nvme/pci queue depth is 1023, but > there could be some implementation with less. NVMe PCI has per-sq tags so subranges aren't needed. Each pthread has its own independent tag space. That means NVMe devices with low queue depths work fine in the model I described. I don't know the exact SCSI/SATA scenario you mentioned, but if there are only 32 tags globally then just create one pthread. If you mean AHCI PCI devices, my understanding is that AHCI is multi-LUN but each port (LUN) has a single Command List (queue) has an independent tag space. Therefore each port has just one ublk_queue that is handled by one pthread. > More importantly subrange could waste lots of tags for idle LUNs/NSs, and > active LUNs/NSs will have to suffer from the small subrange tags. And available > tags depth represents the max allowed in-flight block IOs, so performance > is affected a lot by subrange. Tag subranges are pthread, not per-LUN/NS, so these concerns do not apply to the model I described. Are there any other reasons why you say this model is not doable? Stefan [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 488 bytes --] ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [LSF/MM/BPF BoF]: extend UBLK to cover real storage hardware 2023-03-02 15:09 ` Stefan Hajnoczi @ 2023-03-17 3:10 ` Ming Lei 2023-03-17 14:41 ` Stefan Hajnoczi 0 siblings, 1 reply; 34+ messages in thread From: Ming Lei @ 2023-03-17 3:10 UTC (permalink / raw) To: Stefan Hajnoczi Cc: Andreas Hindborg, linux-block, lsf-pc, Liu Xiaodong, Jim Harris, Hans Holmberg, Matias Bjørling, hch@lst.de, ZiyangZhang, ming.lei On Thu, Mar 02, 2023 at 10:09:25AM -0500, Stefan Hajnoczi wrote: > On Thu, Mar 02, 2023 at 11:22:55AM +0800, Ming Lei wrote: > > On Thu, Feb 23, 2023 at 03:18:19PM -0500, Stefan Hajnoczi wrote: > > > On Thu, Feb 23, 2023 at 07:17:33AM +0800, Ming Lei wrote: > > > > On Sat, Feb 18, 2023 at 01:38:08PM -0500, Stefan Hajnoczi wrote: > > > > > On Sat, Feb 18, 2023 at 07:22:49PM +0800, Ming Lei wrote: > > > > > > On Fri, Feb 17, 2023 at 11:39:58AM -0500, Stefan Hajnoczi wrote: > > > > > > > On Fri, Feb 17, 2023 at 10:20:45AM +0800, Ming Lei wrote: > > > > > > > > On Thu, Feb 16, 2023 at 12:21:32PM +0100, Andreas Hindborg wrote: > > > > > > > > > > > > > > > > > > Ming Lei <ming.lei@redhat.com> writes: > > > > > > > > > > > > > > > > > > > On Thu, Feb 16, 2023 at 10:44:02AM +0100, Andreas Hindborg wrote: > > > > > > > > > >> > > > > > > > > > >> Hi Ming, > > > > > > > > > >> > > > > > > > > > >> Ming Lei <ming.lei@redhat.com> writes: > > > > > > > > > >> > > > > > > > > > >> > On Mon, Feb 13, 2023 at 02:13:59PM -0500, Stefan Hajnoczi wrote: > > > > > > > > > >> >> On Mon, Feb 13, 2023 at 11:47:31AM +0800, Ming Lei wrote: > > > > > > > > > >> >> > On Wed, Feb 08, 2023 at 07:17:10AM -0500, Stefan Hajnoczi wrote: > > > > > > > > > >> >> > > On Wed, Feb 08, 2023 at 10:12:19AM +0800, Ming Lei wrote: > > > > > > > > > >> >> > > > On Mon, Feb 06, 2023 at 03:27:09PM -0500, Stefan Hajnoczi wrote: > > > > > > > > > >> >> > > > > On Mon, Feb 06, 2023 at 11:00:27PM +0800, Ming Lei wrote: > > > > > > > > > >> >> > > > > > Hello, > > > > > > > > > >> >> > > > > > > > > > > > > > > >> >> > > > > > So far UBLK is only used for implementing virtual block device from > > > > > > > > > >> >> > > > > > userspace, such as loop, nbd, qcow2, ...[1]. > > > > > > > > > >> >> > > > > > > > > > > > > > >> >> > > > > I won't be at LSF/MM so here are my thoughts: > > > > > > > > > >> >> > > > > > > > > > > > > >> >> > > > Thanks for the thoughts, :-) > > > > > > > > > >> >> > > > > > > > > > > > > >> >> > > > > > > > > > > > > > >> >> > > > > > > > > > > > > > > >> >> > > > > > It could be useful for UBLK to cover real storage hardware too: > > > > > > > > > >> >> > > > > > > > > > > > > > > >> >> > > > > > - for fast prototype or performance evaluation > > > > > > > > > >> >> > > > > > > > > > > > > > > >> >> > > > > > - some network storages are attached to host, such as iscsi and nvme-tcp, > > > > > > > > > >> >> > > > > > the current UBLK interface doesn't support such devices, since it needs > > > > > > > > > >> >> > > > > > all LUNs/Namespaces to share host resources(such as tag) > > > > > > > > > >> >> > > > > > > > > > > > > > >> >> > > > > Can you explain this in more detail? It seems like an iSCSI or > > > > > > > > > >> >> > > > > NVMe-over-TCP initiator could be implemented as a ublk server today. > > > > > > > > > >> >> > > > > What am I missing? > > > > > > > > > >> >> > > > > > > > > > > > > >> >> > > > The current ublk can't do that yet, because the interface doesn't > > > > > > > > > >> >> > > > support multiple ublk disks sharing single host, which is exactly > > > > > > > > > >> >> > > > the case of scsi and nvme. > > > > > > > > > >> >> > > > > > > > > > > > >> >> > > Can you give an example that shows exactly where a problem is hit? > > > > > > > > > >> >> > > > > > > > > > > > >> >> > > I took a quick look at the ublk source code and didn't spot a place > > > > > > > > > >> >> > > where it prevents a single ublk server process from handling multiple > > > > > > > > > >> >> > > devices. > > > > > > > > > >> >> > > > > > > > > > > > >> >> > > Regarding "host resources(such as tag)", can the ublk server deal with > > > > > > > > > >> >> > > that in userspace? The Linux block layer doesn't have the concept of a > > > > > > > > > >> >> > > "host", that would come in at the SCSI/NVMe level that's implemented in > > > > > > > > > >> >> > > userspace. > > > > > > > > > >> >> > > > > > > > > > > > >> >> > > I don't understand yet... > > > > > > > > > >> >> > > > > > > > > > > >> >> > blk_mq_tag_set is embedded into driver host structure, and referred by queue > > > > > > > > > >> >> > via q->tag_set, both scsi and nvme allocates tag in host/queue wide, > > > > > > > > > >> >> > that said all LUNs/NSs share host/queue tags, current every ublk > > > > > > > > > >> >> > device is independent, and can't shard tags. > > > > > > > > > >> >> > > > > > > > > > >> >> Does this actually prevent ublk servers with multiple ublk devices or is > > > > > > > > > >> >> it just sub-optimal? > > > > > > > > > >> > > > > > > > > > > >> > It is former, ublk can't support multiple devices which share single host > > > > > > > > > >> > because duplicated tag can be seen in host side, then io is failed. > > > > > > > > > >> > > > > > > > > > > >> > > > > > > > > > >> I have trouble following this discussion. Why can we not handle multiple > > > > > > > > > >> block devices in a single ublk user space process? > > > > > > > > > >> > > > > > > > > > >> From this conversation it seems that the limiting factor is allocation > > > > > > > > > >> of the tag set of the virtual device in the kernel? But as far as I can > > > > > > > > > >> tell, the tag sets are allocated per virtual block device in > > > > > > > > > >> `ublk_ctrl_add_dev()`? > > > > > > > > > >> > > > > > > > > > >> It seems to me that a single ublk user space process shuld be able to > > > > > > > > > >> connect to multiple storage devices (for instance nvme-of) and then > > > > > > > > > >> create a ublk device for each namespace, all from a single ublk process. > > > > > > > > > >> > > > > > > > > > >> Could you elaborate on why this is not possible? > > > > > > > > > > > > > > > > > > > > If the multiple storages devices are independent, the current ublk can > > > > > > > > > > handle them just fine. > > > > > > > > > > > > > > > > > > > > But if these storage devices(such as luns in iscsi, or NSs in nvme-tcp) > > > > > > > > > > share single host, and use host-wide tagset, the current interface can't > > > > > > > > > > work as expected, because tags is shared among all these devices. The > > > > > > > > > > current ublk interface needs to be extended for covering this case. > > > > > > > > > > > > > > > > > > Thanks for clarifying, that is very helpful. > > > > > > > > > > > > > > > > > > Follow up question: What would the implications be if one tried to > > > > > > > > > expose (through ublk) each nvme namespace of an nvme-of controller with > > > > > > > > > an independent tag set? > > > > > > > > > > > > > > > > https://lore.kernel.org/linux-block/877cwhrgul.fsf@metaspace.dk/T/#m57158db9f0108e529d8d62d1d56652c52e9e3e67 > > > > > > > > > > > > > > > > > What are the benefits of sharing a tagset across > > > > > > > > > all namespaces of a controller? > > > > > > > > > > > > > > > > The userspace implementation can be simplified a lot since generic > > > > > > > > shared tag allocation isn't needed, meantime with good performance > > > > > > > > (shared tags allocation in SMP is one hard problem) > > > > > > > > > > > > > > In NVMe, tags are per Submission Queue. AFAIK there's no such thing as > > > > > > > shared tags across multiple SQs in NVMe. So userspace doesn't need an > > > > > > > > > > > > In reality the max supported nr_queues of nvme is often much less than > > > > > > nr_cpu_ids, for example, lots of nvme-pci devices just support at most > > > > > > 32 queues, I remembered that Azure nvme supports less(just 8 queues). > > > > > > That is because queue isn't free in both software and hardware, which > > > > > > implementation is often tradeoff between performance and cost. > > > > > > > > > > I didn't say that the ublk server should have nr_cpu_ids threads. I > > > > > thought the idea was the ublk server creates as many threads as it needs > > > > > (e.g. max 8 if the Azure NVMe device only has 8 queues). > > > > > > > > > > Do you expect ublk servers to have nr_cpu_ids threads in all/most cases? > > > > > > > > No. > > > > > > > > In ublksrv project, each pthread maps to one unique hardware queue, so total > > > > number of pthread is equal to nr_hw_queues. > > > > > > Good, I think we agree on that part. > > > > > > Here is a summary of the ublk server model I've been describing: > > > 1. Each pthread has a separate io_uring context. > > > 2. Each pthread has its own hardware submission queue (NVMe SQ, SCSI > > > command queue, etc). > > > 3. Each pthread has a distinct subrange of the tag space if the tag > > > space is shared across hardware submission queues. > > > 4. Each pthread allocates tags from its subrange without coordinating > > > with other threads. This is cheap and simple. > > > > That is also not doable. > > > > The tag space can be pretty small, such as, usb-storage queue depth > > is just 1, and usb card reader can support multi lun too. > > If the tag space is very limited, just create one pthread. What I meant is that sub-range isn't doable. And pthread is aligned with queue, that is nothing to do with nr_tags. > > > That is just one extreme example, but there can be more low queue depth > > scsi devices(sata : 32, ...), typical nvme/pci queue depth is 1023, but > > there could be some implementation with less. > > NVMe PCI has per-sq tags so subranges aren't needed. Each pthread has > its own independent tag space. That means NVMe devices with low queue > depths work fine in the model I described. NVMe PCI isn't special, and it is covered by current ublk abstract, so one way or another, we should not support both sub-range or non-sub-range for avoiding unnecessary complexity. "Each pthread has its own independent tag space" may mean two things 1) each LUN/NS is implemented in standalone process space: - so every queue of each LUN has its own space, but all the queues with same ID share the whole queue tag space - that matches with current ublksrv - also easier to implement 2) all LUNs/NSs are implemented in single process space - so each pthread handles one queue for all NSs/LUNs Yeah, if you mean 2), the tag allocation is cheap, but the existed ublk char device has to handle multiple LUNs/NSs(disks), which still need (big) ublk interface change. Also this way can't scale for single queue devices. Another thing is that io command buffer has to be shared among all LUNs/ NSs. So interface change has to cover shared io command buffer. With zero copy support, io buffer sharing needn't to be considered, that can be a bit easier. In short, the sharing of (tag, io command buffer, io buffer) needs to be considered for shared host ublk disks. Actually I prefer to 1), which matches with current design, and we can just add host concept into ublk, and implementation could be easier. BTW, ublk has been applied to implement iscsi alternative disk[1] for Longhorn[2], and the performance improvement is pretty nice, so I think it is one reasonable requirement to support "shared host" ublk disks for covering multi-lun or multi-ns. [1] https://github.com/ming1/ubdsrv/issues/49 [2] https://github.com/longhorn/longhorn Thanks, Ming ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [LSF/MM/BPF BoF]: extend UBLK to cover real storage hardware 2023-03-17 3:10 ` Ming Lei @ 2023-03-17 14:41 ` Stefan Hajnoczi 2023-03-18 0:30 ` Ming Lei 0 siblings, 1 reply; 34+ messages in thread From: Stefan Hajnoczi @ 2023-03-17 14:41 UTC (permalink / raw) To: Ming Lei Cc: Andreas Hindborg, linux-block, lsf-pc, Liu Xiaodong, Jim Harris, Hans Holmberg, Matias Bjørling, hch@lst.de, ZiyangZhang [-- Attachment #1: Type: text/plain, Size: 14963 bytes --] On Fri, Mar 17, 2023 at 11:10:20AM +0800, Ming Lei wrote: > On Thu, Mar 02, 2023 at 10:09:25AM -0500, Stefan Hajnoczi wrote: > > On Thu, Mar 02, 2023 at 11:22:55AM +0800, Ming Lei wrote: > > > On Thu, Feb 23, 2023 at 03:18:19PM -0500, Stefan Hajnoczi wrote: > > > > On Thu, Feb 23, 2023 at 07:17:33AM +0800, Ming Lei wrote: > > > > > On Sat, Feb 18, 2023 at 01:38:08PM -0500, Stefan Hajnoczi wrote: > > > > > > On Sat, Feb 18, 2023 at 07:22:49PM +0800, Ming Lei wrote: > > > > > > > On Fri, Feb 17, 2023 at 11:39:58AM -0500, Stefan Hajnoczi wrote: > > > > > > > > On Fri, Feb 17, 2023 at 10:20:45AM +0800, Ming Lei wrote: > > > > > > > > > On Thu, Feb 16, 2023 at 12:21:32PM +0100, Andreas Hindborg wrote: > > > > > > > > > > > > > > > > > > > > Ming Lei <ming.lei@redhat.com> writes: > > > > > > > > > > > > > > > > > > > > > On Thu, Feb 16, 2023 at 10:44:02AM +0100, Andreas Hindborg wrote: > > > > > > > > > > >> > > > > > > > > > > >> Hi Ming, > > > > > > > > > > >> > > > > > > > > > > >> Ming Lei <ming.lei@redhat.com> writes: > > > > > > > > > > >> > > > > > > > > > > >> > On Mon, Feb 13, 2023 at 02:13:59PM -0500, Stefan Hajnoczi wrote: > > > > > > > > > > >> >> On Mon, Feb 13, 2023 at 11:47:31AM +0800, Ming Lei wrote: > > > > > > > > > > >> >> > On Wed, Feb 08, 2023 at 07:17:10AM -0500, Stefan Hajnoczi wrote: > > > > > > > > > > >> >> > > On Wed, Feb 08, 2023 at 10:12:19AM +0800, Ming Lei wrote: > > > > > > > > > > >> >> > > > On Mon, Feb 06, 2023 at 03:27:09PM -0500, Stefan Hajnoczi wrote: > > > > > > > > > > >> >> > > > > On Mon, Feb 06, 2023 at 11:00:27PM +0800, Ming Lei wrote: > > > > > > > > > > >> >> > > > > > Hello, > > > > > > > > > > >> >> > > > > > > > > > > > > > > > >> >> > > > > > So far UBLK is only used for implementing virtual block device from > > > > > > > > > > >> >> > > > > > userspace, such as loop, nbd, qcow2, ...[1]. > > > > > > > > > > >> >> > > > > > > > > > > > > > > >> >> > > > > I won't be at LSF/MM so here are my thoughts: > > > > > > > > > > >> >> > > > > > > > > > > > > > >> >> > > > Thanks for the thoughts, :-) > > > > > > > > > > >> >> > > > > > > > > > > > > > >> >> > > > > > > > > > > > > > > >> >> > > > > > > > > > > > > > > > >> >> > > > > > It could be useful for UBLK to cover real storage hardware too: > > > > > > > > > > >> >> > > > > > > > > > > > > > > > >> >> > > > > > - for fast prototype or performance evaluation > > > > > > > > > > >> >> > > > > > > > > > > > > > > > >> >> > > > > > - some network storages are attached to host, such as iscsi and nvme-tcp, > > > > > > > > > > >> >> > > > > > the current UBLK interface doesn't support such devices, since it needs > > > > > > > > > > >> >> > > > > > all LUNs/Namespaces to share host resources(such as tag) > > > > > > > > > > >> >> > > > > > > > > > > > > > > >> >> > > > > Can you explain this in more detail? It seems like an iSCSI or > > > > > > > > > > >> >> > > > > NVMe-over-TCP initiator could be implemented as a ublk server today. > > > > > > > > > > >> >> > > > > What am I missing? > > > > > > > > > > >> >> > > > > > > > > > > > > > >> >> > > > The current ublk can't do that yet, because the interface doesn't > > > > > > > > > > >> >> > > > support multiple ublk disks sharing single host, which is exactly > > > > > > > > > > >> >> > > > the case of scsi and nvme. > > > > > > > > > > >> >> > > > > > > > > > > > > >> >> > > Can you give an example that shows exactly where a problem is hit? > > > > > > > > > > >> >> > > > > > > > > > > > > >> >> > > I took a quick look at the ublk source code and didn't spot a place > > > > > > > > > > >> >> > > where it prevents a single ublk server process from handling multiple > > > > > > > > > > >> >> > > devices. > > > > > > > > > > >> >> > > > > > > > > > > > > >> >> > > Regarding "host resources(such as tag)", can the ublk server deal with > > > > > > > > > > >> >> > > that in userspace? The Linux block layer doesn't have the concept of a > > > > > > > > > > >> >> > > "host", that would come in at the SCSI/NVMe level that's implemented in > > > > > > > > > > >> >> > > userspace. > > > > > > > > > > >> >> > > > > > > > > > > > > >> >> > > I don't understand yet... > > > > > > > > > > >> >> > > > > > > > > > > > >> >> > blk_mq_tag_set is embedded into driver host structure, and referred by queue > > > > > > > > > > >> >> > via q->tag_set, both scsi and nvme allocates tag in host/queue wide, > > > > > > > > > > >> >> > that said all LUNs/NSs share host/queue tags, current every ublk > > > > > > > > > > >> >> > device is independent, and can't shard tags. > > > > > > > > > > >> >> > > > > > > > > > > >> >> Does this actually prevent ublk servers with multiple ublk devices or is > > > > > > > > > > >> >> it just sub-optimal? > > > > > > > > > > >> > > > > > > > > > > > >> > It is former, ublk can't support multiple devices which share single host > > > > > > > > > > >> > because duplicated tag can be seen in host side, then io is failed. > > > > > > > > > > >> > > > > > > > > > > > >> > > > > > > > > > > >> I have trouble following this discussion. Why can we not handle multiple > > > > > > > > > > >> block devices in a single ublk user space process? > > > > > > > > > > >> > > > > > > > > > > >> From this conversation it seems that the limiting factor is allocation > > > > > > > > > > >> of the tag set of the virtual device in the kernel? But as far as I can > > > > > > > > > > >> tell, the tag sets are allocated per virtual block device in > > > > > > > > > > >> `ublk_ctrl_add_dev()`? > > > > > > > > > > >> > > > > > > > > > > >> It seems to me that a single ublk user space process shuld be able to > > > > > > > > > > >> connect to multiple storage devices (for instance nvme-of) and then > > > > > > > > > > >> create a ublk device for each namespace, all from a single ublk process. > > > > > > > > > > >> > > > > > > > > > > >> Could you elaborate on why this is not possible? > > > > > > > > > > > > > > > > > > > > > > If the multiple storages devices are independent, the current ublk can > > > > > > > > > > > handle them just fine. > > > > > > > > > > > > > > > > > > > > > > But if these storage devices(such as luns in iscsi, or NSs in nvme-tcp) > > > > > > > > > > > share single host, and use host-wide tagset, the current interface can't > > > > > > > > > > > work as expected, because tags is shared among all these devices. The > > > > > > > > > > > current ublk interface needs to be extended for covering this case. > > > > > > > > > > > > > > > > > > > > Thanks for clarifying, that is very helpful. > > > > > > > > > > > > > > > > > > > > Follow up question: What would the implications be if one tried to > > > > > > > > > > expose (through ublk) each nvme namespace of an nvme-of controller with > > > > > > > > > > an independent tag set? > > > > > > > > > > > > > > > > > > https://lore.kernel.org/linux-block/877cwhrgul.fsf@metaspace.dk/T/#m57158db9f0108e529d8d62d1d56652c52e9e3e67 > > > > > > > > > > > > > > > > > > > What are the benefits of sharing a tagset across > > > > > > > > > > all namespaces of a controller? > > > > > > > > > > > > > > > > > > The userspace implementation can be simplified a lot since generic > > > > > > > > > shared tag allocation isn't needed, meantime with good performance > > > > > > > > > (shared tags allocation in SMP is one hard problem) > > > > > > > > > > > > > > > > In NVMe, tags are per Submission Queue. AFAIK there's no such thing as > > > > > > > > shared tags across multiple SQs in NVMe. So userspace doesn't need an > > > > > > > > > > > > > > In reality the max supported nr_queues of nvme is often much less than > > > > > > > nr_cpu_ids, for example, lots of nvme-pci devices just support at most > > > > > > > 32 queues, I remembered that Azure nvme supports less(just 8 queues). > > > > > > > That is because queue isn't free in both software and hardware, which > > > > > > > implementation is often tradeoff between performance and cost. > > > > > > > > > > > > I didn't say that the ublk server should have nr_cpu_ids threads. I > > > > > > thought the idea was the ublk server creates as many threads as it needs > > > > > > (e.g. max 8 if the Azure NVMe device only has 8 queues). > > > > > > > > > > > > Do you expect ublk servers to have nr_cpu_ids threads in all/most cases? > > > > > > > > > > No. > > > > > > > > > > In ublksrv project, each pthread maps to one unique hardware queue, so total > > > > > number of pthread is equal to nr_hw_queues. > > > > > > > > Good, I think we agree on that part. > > > > > > > > Here is a summary of the ublk server model I've been describing: > > > > 1. Each pthread has a separate io_uring context. > > > > 2. Each pthread has its own hardware submission queue (NVMe SQ, SCSI > > > > command queue, etc). > > > > 3. Each pthread has a distinct subrange of the tag space if the tag > > > > space is shared across hardware submission queues. > > > > 4. Each pthread allocates tags from its subrange without coordinating > > > > with other threads. This is cheap and simple. > > > > > > That is also not doable. > > > > > > The tag space can be pretty small, such as, usb-storage queue depth > > > is just 1, and usb card reader can support multi lun too. > > > > If the tag space is very limited, just create one pthread. > > What I meant is that sub-range isn't doable. > > And pthread is aligned with queue, that is nothing to do with nr_tags. > > > > > > That is just one extreme example, but there can be more low queue depth > > > scsi devices(sata : 32, ...), typical nvme/pci queue depth is 1023, but > > > there could be some implementation with less. > > > > NVMe PCI has per-sq tags so subranges aren't needed. Each pthread has > > its own independent tag space. That means NVMe devices with low queue > > depths work fine in the model I described. > > NVMe PCI isn't special, and it is covered by current ublk abstract, so one way > or another, we should not support both sub-range or non-sub-range for > avoiding unnecessary complexity. > > "Each pthread has its own independent tag space" may mean two things > > 1) each LUN/NS is implemented in standalone process space: > - so every queue of each LUN has its own space, but all the queues with > same ID share the whole queue tag space > - that matches with current ublksrv > - also easier to implement > > 2) all LUNs/NSs are implemented in single process space > - so each pthread handles one queue for all NSs/LUNs > > Yeah, if you mean 2), the tag allocation is cheap, but the existed ublk > char device has to handle multiple LUNs/NSs(disks), which still need > (big) ublk interface change. Also this way can't scale for single queue > devices. The model I described is neither 1) or 2). It's similar to 2) but I'm not sure why you say the ublk interface needs to be changed. I'm afraid I haven't explained it well, sorry. I'll try to describe it again with an NVMe PCI adapter being handled by userspace. There is a single ublk server process with an NVMe PCI device opened using VFIO. There are N pthreads and each pthread has 1 io_uring context and 1 NVMe PCI SQ/CQ pair. The size of the SQ and CQ rings is QD. The NVMe PCI device has M Namespaces. The ublk server creates M ublk_devices. Each ublk_device has N ublk_queues with queue_depth QD. The Linux block layer sees M block devices with N nr_hw_queues and QD queue_depth. The actual NVMe PCI device resources are less than what the Linux block layer sees because the each SQ/CQ pair is used for M ublk_devices. In other words, Linux thinks there can be M * N * QD requests in flight but in reality the NVMe PCI adapter only supports N * QD requests. Now I'll describe how userspace can take care of the mismatch between the Linux block layer and the NVMe PCI device without doing much work: Each pthread sets up QD UBLK_IO_COMMIT_AND_FETCH_REQ io_uring_cmds for each of the M Namespaces. When userspace receives a request from ublk, it cannot simply copy the struct ublksrv_io_cmd->tag field into the NVMe SQE Command Identifier (CID) field. There would be collisions between the tags used across the M ublk_queues that the pthread services. Userspace selects a free tag (e.g. from a bitmap with QD elements) and uses that as the NVMe Command Identifier. This is trivial because each pthread has its own bitmap and NVMe Command Identifiers are per-SQ. If there are no free tags then the request is placed in the pthread's per Namespace overflow list. Whenever an NVMe command completes, the overflow lists are scanned. One pending request is submitted to the NVMe PCI adapter in a round-robin fashion until the lists are empty or there are no more free tags. That's it. No ublk API changes are necessary. The userspace code is not slow or complex (just a bitmap and overflow list). The approach also works for SCSI or devices that only support 1 request in flight at a time, with small tweaks. Going back to the beginning of the discussion: I think it's possible to write a ublk server that handles multiple LUNs/NS today. > Another thing is that io command buffer has to be shared among all LUNs/ > NSs. So interface change has to cover shared io command buffer. I think the main advantage of extending the ublk API to share io command buffers between ublk_devices is to reduce userspace memory consumption? It eliminates the need to over-provision I/O buffers for write requests (or use the slower UBLK_IO_NEED_GET_DATA approach). > With zero copy support, io buffer sharing needn't to be considered, that > can be a bit easier. > > In short, the sharing of (tag, io command buffer, io buffer) needs to be > considered for shared host ublk disks. > > Actually I prefer to 1), which matches with current design, and we can > just add host concept into ublk, and implementation could be easier. > > BTW, ublk has been applied to implement iscsi alternative disk[1] for Longhorn[2], > and the performance improvement is pretty nice, so I think it is one reasonable > requirement to support "shared host" ublk disks for covering multi-lun or multi-ns. > > [1] https://github.com/ming1/ubdsrv/issues/49 > [2] https://github.com/longhorn/longhorn Nice performance improvement! I agree with you that the ublk API should have a way to declare the resource contraints for multi-LUN/NS servers (i.e. share the tag_set). I guess the simplest way to do that is by passing a reference to an existing device to UBLK_CMD_ADD_DEV so it can share the tag_set? Nothing else about the ublk API needs to change, at least for tags. Solving I/O buffer over-provisioning sounds similar to io_uring's provided buffer mechanism :). Stefan [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 488 bytes --] ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [LSF/MM/BPF BoF]: extend UBLK to cover real storage hardware 2023-03-17 14:41 ` Stefan Hajnoczi @ 2023-03-18 0:30 ` Ming Lei 2023-03-20 12:34 ` Stefan Hajnoczi 0 siblings, 1 reply; 34+ messages in thread From: Ming Lei @ 2023-03-18 0:30 UTC (permalink / raw) To: Stefan Hajnoczi Cc: Andreas Hindborg, linux-block, lsf-pc, Liu Xiaodong, Jim Harris, Hans Holmberg, Matias Bjørling, hch@lst.de, ZiyangZhang, ming.lei On Fri, Mar 17, 2023 at 10:41:28AM -0400, Stefan Hajnoczi wrote: > On Fri, Mar 17, 2023 at 11:10:20AM +0800, Ming Lei wrote: > > On Thu, Mar 02, 2023 at 10:09:25AM -0500, Stefan Hajnoczi wrote: > > > On Thu, Mar 02, 2023 at 11:22:55AM +0800, Ming Lei wrote: > > > > On Thu, Feb 23, 2023 at 03:18:19PM -0500, Stefan Hajnoczi wrote: > > > > > On Thu, Feb 23, 2023 at 07:17:33AM +0800, Ming Lei wrote: > > > > > > On Sat, Feb 18, 2023 at 01:38:08PM -0500, Stefan Hajnoczi wrote: > > > > > > > On Sat, Feb 18, 2023 at 07:22:49PM +0800, Ming Lei wrote: > > > > > > > > On Fri, Feb 17, 2023 at 11:39:58AM -0500, Stefan Hajnoczi wrote: > > > > > > > > > On Fri, Feb 17, 2023 at 10:20:45AM +0800, Ming Lei wrote: > > > > > > > > > > On Thu, Feb 16, 2023 at 12:21:32PM +0100, Andreas Hindborg wrote: > > > > > > > > > > > > > > > > > > > > > > Ming Lei <ming.lei@redhat.com> writes: > > > > > > > > > > > > > > > > > > > > > > > On Thu, Feb 16, 2023 at 10:44:02AM +0100, Andreas Hindborg wrote: > > > > > > > > > > > >> > > > > > > > > > > > >> Hi Ming, > > > > > > > > > > > >> > > > > > > > > > > > >> Ming Lei <ming.lei@redhat.com> writes: > > > > > > > > > > > >> > > > > > > > > > > > >> > On Mon, Feb 13, 2023 at 02:13:59PM -0500, Stefan Hajnoczi wrote: > > > > > > > > > > > >> >> On Mon, Feb 13, 2023 at 11:47:31AM +0800, Ming Lei wrote: > > > > > > > > > > > >> >> > On Wed, Feb 08, 2023 at 07:17:10AM -0500, Stefan Hajnoczi wrote: > > > > > > > > > > > >> >> > > On Wed, Feb 08, 2023 at 10:12:19AM +0800, Ming Lei wrote: > > > > > > > > > > > >> >> > > > On Mon, Feb 06, 2023 at 03:27:09PM -0500, Stefan Hajnoczi wrote: > > > > > > > > > > > >> >> > > > > On Mon, Feb 06, 2023 at 11:00:27PM +0800, Ming Lei wrote: > > > > > > > > > > > >> >> > > > > > Hello, > > > > > > > > > > > >> >> > > > > > > > > > > > > > > > > >> >> > > > > > So far UBLK is only used for implementing virtual block device from > > > > > > > > > > > >> >> > > > > > userspace, such as loop, nbd, qcow2, ...[1]. > > > > > > > > > > > >> >> > > > > > > > > > > > > > > > >> >> > > > > I won't be at LSF/MM so here are my thoughts: > > > > > > > > > > > >> >> > > > > > > > > > > > > > > >> >> > > > Thanks for the thoughts, :-) > > > > > > > > > > > >> >> > > > > > > > > > > > > > > >> >> > > > > > > > > > > > > > > > >> >> > > > > > > > > > > > > > > > > >> >> > > > > > It could be useful for UBLK to cover real storage hardware too: > > > > > > > > > > > >> >> > > > > > > > > > > > > > > > > >> >> > > > > > - for fast prototype or performance evaluation > > > > > > > > > > > >> >> > > > > > > > > > > > > > > > > >> >> > > > > > - some network storages are attached to host, such as iscsi and nvme-tcp, > > > > > > > > > > > >> >> > > > > > the current UBLK interface doesn't support such devices, since it needs > > > > > > > > > > > >> >> > > > > > all LUNs/Namespaces to share host resources(such as tag) > > > > > > > > > > > >> >> > > > > > > > > > > > > > > > >> >> > > > > Can you explain this in more detail? It seems like an iSCSI or > > > > > > > > > > > >> >> > > > > NVMe-over-TCP initiator could be implemented as a ublk server today. > > > > > > > > > > > >> >> > > > > What am I missing? > > > > > > > > > > > >> >> > > > > > > > > > > > > > > >> >> > > > The current ublk can't do that yet, because the interface doesn't > > > > > > > > > > > >> >> > > > support multiple ublk disks sharing single host, which is exactly > > > > > > > > > > > >> >> > > > the case of scsi and nvme. > > > > > > > > > > > >> >> > > > > > > > > > > > > > >> >> > > Can you give an example that shows exactly where a problem is hit? > > > > > > > > > > > >> >> > > > > > > > > > > > > > >> >> > > I took a quick look at the ublk source code and didn't spot a place > > > > > > > > > > > >> >> > > where it prevents a single ublk server process from handling multiple > > > > > > > > > > > >> >> > > devices. > > > > > > > > > > > >> >> > > > > > > > > > > > > > >> >> > > Regarding "host resources(such as tag)", can the ublk server deal with > > > > > > > > > > > >> >> > > that in userspace? The Linux block layer doesn't have the concept of a > > > > > > > > > > > >> >> > > "host", that would come in at the SCSI/NVMe level that's implemented in > > > > > > > > > > > >> >> > > userspace. > > > > > > > > > > > >> >> > > > > > > > > > > > > > >> >> > > I don't understand yet... > > > > > > > > > > > >> >> > > > > > > > > > > > > >> >> > blk_mq_tag_set is embedded into driver host structure, and referred by queue > > > > > > > > > > > >> >> > via q->tag_set, both scsi and nvme allocates tag in host/queue wide, > > > > > > > > > > > >> >> > that said all LUNs/NSs share host/queue tags, current every ublk > > > > > > > > > > > >> >> > device is independent, and can't shard tags. > > > > > > > > > > > >> >> > > > > > > > > > > > >> >> Does this actually prevent ublk servers with multiple ublk devices or is > > > > > > > > > > > >> >> it just sub-optimal? > > > > > > > > > > > >> > > > > > > > > > > > > >> > It is former, ublk can't support multiple devices which share single host > > > > > > > > > > > >> > because duplicated tag can be seen in host side, then io is failed. > > > > > > > > > > > >> > > > > > > > > > > > > >> > > > > > > > > > > > >> I have trouble following this discussion. Why can we not handle multiple > > > > > > > > > > > >> block devices in a single ublk user space process? > > > > > > > > > > > >> > > > > > > > > > > > >> From this conversation it seems that the limiting factor is allocation > > > > > > > > > > > >> of the tag set of the virtual device in the kernel? But as far as I can > > > > > > > > > > > >> tell, the tag sets are allocated per virtual block device in > > > > > > > > > > > >> `ublk_ctrl_add_dev()`? > > > > > > > > > > > >> > > > > > > > > > > > >> It seems to me that a single ublk user space process shuld be able to > > > > > > > > > > > >> connect to multiple storage devices (for instance nvme-of) and then > > > > > > > > > > > >> create a ublk device for each namespace, all from a single ublk process. > > > > > > > > > > > >> > > > > > > > > > > > >> Could you elaborate on why this is not possible? > > > > > > > > > > > > > > > > > > > > > > > > If the multiple storages devices are independent, the current ublk can > > > > > > > > > > > > handle them just fine. > > > > > > > > > > > > > > > > > > > > > > > > But if these storage devices(such as luns in iscsi, or NSs in nvme-tcp) > > > > > > > > > > > > share single host, and use host-wide tagset, the current interface can't > > > > > > > > > > > > work as expected, because tags is shared among all these devices. The > > > > > > > > > > > > current ublk interface needs to be extended for covering this case. > > > > > > > > > > > > > > > > > > > > > > Thanks for clarifying, that is very helpful. > > > > > > > > > > > > > > > > > > > > > > Follow up question: What would the implications be if one tried to > > > > > > > > > > > expose (through ublk) each nvme namespace of an nvme-of controller with > > > > > > > > > > > an independent tag set? > > > > > > > > > > > > > > > > > > > > https://lore.kernel.org/linux-block/877cwhrgul.fsf@metaspace.dk/T/#m57158db9f0108e529d8d62d1d56652c52e9e3e67 > > > > > > > > > > > > > > > > > > > > > What are the benefits of sharing a tagset across > > > > > > > > > > > all namespaces of a controller? > > > > > > > > > > > > > > > > > > > > The userspace implementation can be simplified a lot since generic > > > > > > > > > > shared tag allocation isn't needed, meantime with good performance > > > > > > > > > > (shared tags allocation in SMP is one hard problem) > > > > > > > > > > > > > > > > > > In NVMe, tags are per Submission Queue. AFAIK there's no such thing as > > > > > > > > > shared tags across multiple SQs in NVMe. So userspace doesn't need an > > > > > > > > > > > > > > > > In reality the max supported nr_queues of nvme is often much less than > > > > > > > > nr_cpu_ids, for example, lots of nvme-pci devices just support at most > > > > > > > > 32 queues, I remembered that Azure nvme supports less(just 8 queues). > > > > > > > > That is because queue isn't free in both software and hardware, which > > > > > > > > implementation is often tradeoff between performance and cost. > > > > > > > > > > > > > > I didn't say that the ublk server should have nr_cpu_ids threads. I > > > > > > > thought the idea was the ublk server creates as many threads as it needs > > > > > > > (e.g. max 8 if the Azure NVMe device only has 8 queues). > > > > > > > > > > > > > > Do you expect ublk servers to have nr_cpu_ids threads in all/most cases? > > > > > > > > > > > > No. > > > > > > > > > > > > In ublksrv project, each pthread maps to one unique hardware queue, so total > > > > > > number of pthread is equal to nr_hw_queues. > > > > > > > > > > Good, I think we agree on that part. > > > > > > > > > > Here is a summary of the ublk server model I've been describing: > > > > > 1. Each pthread has a separate io_uring context. > > > > > 2. Each pthread has its own hardware submission queue (NVMe SQ, SCSI > > > > > command queue, etc). > > > > > 3. Each pthread has a distinct subrange of the tag space if the tag > > > > > space is shared across hardware submission queues. > > > > > 4. Each pthread allocates tags from its subrange without coordinating > > > > > with other threads. This is cheap and simple. > > > > > > > > That is also not doable. > > > > > > > > The tag space can be pretty small, such as, usb-storage queue depth > > > > is just 1, and usb card reader can support multi lun too. > > > > > > If the tag space is very limited, just create one pthread. > > > > What I meant is that sub-range isn't doable. > > > > And pthread is aligned with queue, that is nothing to do with nr_tags. > > > > > > > > > That is just one extreme example, but there can be more low queue depth > > > > scsi devices(sata : 32, ...), typical nvme/pci queue depth is 1023, but > > > > there could be some implementation with less. > > > > > > NVMe PCI has per-sq tags so subranges aren't needed. Each pthread has > > > its own independent tag space. That means NVMe devices with low queue > > > depths work fine in the model I described. > > > > NVMe PCI isn't special, and it is covered by current ublk abstract, so one way > > or another, we should not support both sub-range or non-sub-range for > > avoiding unnecessary complexity. > > > > "Each pthread has its own independent tag space" may mean two things > > > > 1) each LUN/NS is implemented in standalone process space: > > - so every queue of each LUN has its own space, but all the queues with > > same ID share the whole queue tag space > > - that matches with current ublksrv > > - also easier to implement > > > > 2) all LUNs/NSs are implemented in single process space > > - so each pthread handles one queue for all NSs/LUNs > > > > Yeah, if you mean 2), the tag allocation is cheap, but the existed ublk > > char device has to handle multiple LUNs/NSs(disks), which still need > > (big) ublk interface change. Also this way can't scale for single queue > > devices. > > The model I described is neither 1) or 2). It's similar to 2) but I'm > not sure why you say the ublk interface needs to be changed. I'm afraid > I haven't explained it well, sorry. I'll try to describe it again with > an NVMe PCI adapter being handled by userspace. > > There is a single ublk server process with an NVMe PCI device opened > using VFIO. > > There are N pthreads and each pthread has 1 io_uring context and 1 NVMe > PCI SQ/CQ pair. The size of the SQ and CQ rings is QD. > > The NVMe PCI device has M Namespaces. The ublk server creates M > ublk_devices. Each ublk_device has N ublk_queues with queue_depth QD. > > The Linux block layer sees M block devices with N nr_hw_queues and QD > queue_depth. The actual NVMe PCI device resources are less than what the > Linux block layer sees because the each SQ/CQ pair is used for M > ublk_devices. In other words, Linux thinks there can be M * N * QD > requests in flight but in reality the NVMe PCI adapter only supports N * > QD requests. Yeah, but it is really bad. Now QD is the host hard queue depth, which can be very big, and could be more than thousands. ublk driver doesn't understand this kind of sharing(tag, io command buffer, io buffers), M * M * QD requests are submitted to ublk server, and CPUs/memory are wasted a lot. Every device has to allocate command buffers for holding QD io commands, and command buffer is supposed to be per-host, instead of per-disk. Same with io buffer pre-allocation in userspace side. Userspace has to re-tag the requests for avoiding duplicated tag, and requests have to be throttled in ublk server side. If you implement tag allocation in userspace side, it is still one typical shared data issue in SMP, M pthreads contends on single tags from multiple CPUs. > > Now I'll describe how userspace can take care of the mismatch between > the Linux block layer and the NVMe PCI device without doing much work: > > Each pthread sets up QD UBLK_IO_COMMIT_AND_FETCH_REQ io_uring_cmds for > each of the M Namespaces. > > When userspace receives a request from ublk, it cannot simply copy the > struct ublksrv_io_cmd->tag field into the NVMe SQE Command Identifier > (CID) field. There would be collisions between the tags used across the > M ublk_queues that the pthread services. > > Userspace selects a free tag (e.g. from a bitmap with QD elements) and > uses that as the NVMe Command Identifier. This is trivial because each > pthread has its own bitmap and NVMe Command Identifiers are per-SQ. I believe I have explained, in reality, NVME SQ/CQ pair can be less( or much less) than nr_cpu_ids, so the per-queue-tags can be allocated & freed among CPUs of (nr_cpu_ids / nr_hw_queues). Not mention userspace is capable of overriding the pthread cpu affinity, so it isn't trivial & cheap, M pthreads could be run from more than (nr_cpu_ids / nr_hw_queues) CPUs and contend on the single hw queue tags. > > If there are no free tags then the request is placed in the pthread's > per Namespace overflow list. Whenever an NVMe command completes, the > overflow lists are scanned. One pending request is submitted to the NVMe > PCI adapter in a round-robin fashion until the lists are empty or there > are no more free tags. > > That's it. No ublk API changes are necessary. The userspace code is not > slow or complex (just a bitmap and overflow list). Fine, but I am not sure we need to support such mess & pool implementation. > > The approach also works for SCSI or devices that only support 1 request > in flight at a time, with small tweaks. > > Going back to the beginning of the discussion: I think it's possible to > write a ublk server that handles multiple LUNs/NS today. It is possible, but it is poor in both performance and resource utilization, meantime with complicated ublk server implementation. > > > Another thing is that io command buffer has to be shared among all LUNs/ > > NSs. So interface change has to cover shared io command buffer. > > I think the main advantage of extending the ublk API to share io command > buffers between ublk_devices is to reduce userspace memory consumption? > > It eliminates the need to over-provision I/O buffers for write requests > (or use the slower UBLK_IO_NEED_GET_DATA approach). Not only avoiding memory and cpu waste, but also simplifying ublk server. > > > With zero copy support, io buffer sharing needn't to be considered, that > > can be a bit easier. > > > > In short, the sharing of (tag, io command buffer, io buffer) needs to be > > considered for shared host ublk disks. > > > > Actually I prefer to 1), which matches with current design, and we can > > just add host concept into ublk, and implementation could be easier. > > > > BTW, ublk has been applied to implement iscsi alternative disk[1] for Longhorn[2], > > and the performance improvement is pretty nice, so I think it is one reasonable > > requirement to support "shared host" ublk disks for covering multi-lun or multi-ns. > > > > [1] https://github.com/ming1/ubdsrv/issues/49 > > [2] https://github.com/longhorn/longhorn > > Nice performance improvement! > > I agree with you that the ublk API should have a way to declare the > resource contraints for multi-LUN/NS servers (i.e. share the tag_set). I > guess the simplest way to do that is by passing a reference to an > existing device to UBLK_CMD_ADD_DEV so it can share the tag_set? Nothing > else about the ublk API needs to change, at least for tags. Basically (tags, io command buffer, io buffers) need to move into host/hw_queue wide from disk wide, so not so simple, but won't be too complicated. > > Solving I/O buffer over-provisioning sounds similar to io_uring's > provided buffer mechanism :). blk-mq has built-in host/hw_queue wide tag allocation, which can provide unique tag for ublk server from ublk driver side, so everything can be simplified a lot if we move (tag, io command buffer, io buffers) into host/hw_queue wide by telling ublk_driver that we are BLK_MQ_F_TAG_QUEUE_SHARED. Not sure if io_uring's provided buffer is good here, cause we need to discard io buffers after queue become idle. But it won't be one big deal if zero copy can be supported. Thanks, Ming ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [LSF/MM/BPF BoF]: extend UBLK to cover real storage hardware 2023-03-18 0:30 ` Ming Lei @ 2023-03-20 12:34 ` Stefan Hajnoczi 2023-03-20 15:30 ` Ming Lei 0 siblings, 1 reply; 34+ messages in thread From: Stefan Hajnoczi @ 2023-03-20 12:34 UTC (permalink / raw) To: Ming Lei Cc: Andreas Hindborg, linux-block, lsf-pc, Liu Xiaodong, Jim Harris, Hans Holmberg, Matias Bjørling, hch@lst.de, ZiyangZhang [-- Attachment #1: Type: text/plain, Size: 20002 bytes --] On Sat, Mar 18, 2023 at 08:30:29AM +0800, Ming Lei wrote: > On Fri, Mar 17, 2023 at 10:41:28AM -0400, Stefan Hajnoczi wrote: > > On Fri, Mar 17, 2023 at 11:10:20AM +0800, Ming Lei wrote: > > > On Thu, Mar 02, 2023 at 10:09:25AM -0500, Stefan Hajnoczi wrote: > > > > On Thu, Mar 02, 2023 at 11:22:55AM +0800, Ming Lei wrote: > > > > > On Thu, Feb 23, 2023 at 03:18:19PM -0500, Stefan Hajnoczi wrote: > > > > > > On Thu, Feb 23, 2023 at 07:17:33AM +0800, Ming Lei wrote: > > > > > > > On Sat, Feb 18, 2023 at 01:38:08PM -0500, Stefan Hajnoczi wrote: > > > > > > > > On Sat, Feb 18, 2023 at 07:22:49PM +0800, Ming Lei wrote: > > > > > > > > > On Fri, Feb 17, 2023 at 11:39:58AM -0500, Stefan Hajnoczi wrote: > > > > > > > > > > On Fri, Feb 17, 2023 at 10:20:45AM +0800, Ming Lei wrote: > > > > > > > > > > > On Thu, Feb 16, 2023 at 12:21:32PM +0100, Andreas Hindborg wrote: > > > > > > > > > > > > > > > > > > > > > > > > Ming Lei <ming.lei@redhat.com> writes: > > > > > > > > > > > > > > > > > > > > > > > > > On Thu, Feb 16, 2023 at 10:44:02AM +0100, Andreas Hindborg wrote: > > > > > > > > > > > > >> > > > > > > > > > > > > >> Hi Ming, > > > > > > > > > > > > >> > > > > > > > > > > > > >> Ming Lei <ming.lei@redhat.com> writes: > > > > > > > > > > > > >> > > > > > > > > > > > > >> > On Mon, Feb 13, 2023 at 02:13:59PM -0500, Stefan Hajnoczi wrote: > > > > > > > > > > > > >> >> On Mon, Feb 13, 2023 at 11:47:31AM +0800, Ming Lei wrote: > > > > > > > > > > > > >> >> > On Wed, Feb 08, 2023 at 07:17:10AM -0500, Stefan Hajnoczi wrote: > > > > > > > > > > > > >> >> > > On Wed, Feb 08, 2023 at 10:12:19AM +0800, Ming Lei wrote: > > > > > > > > > > > > >> >> > > > On Mon, Feb 06, 2023 at 03:27:09PM -0500, Stefan Hajnoczi wrote: > > > > > > > > > > > > >> >> > > > > On Mon, Feb 06, 2023 at 11:00:27PM +0800, Ming Lei wrote: > > > > > > > > > > > > >> >> > > > > > Hello, > > > > > > > > > > > > >> >> > > > > > > > > > > > > > > > > > >> >> > > > > > So far UBLK is only used for implementing virtual block device from > > > > > > > > > > > > >> >> > > > > > userspace, such as loop, nbd, qcow2, ...[1]. > > > > > > > > > > > > >> >> > > > > > > > > > > > > > > > > >> >> > > > > I won't be at LSF/MM so here are my thoughts: > > > > > > > > > > > > >> >> > > > > > > > > > > > > > > > >> >> > > > Thanks for the thoughts, :-) > > > > > > > > > > > > >> >> > > > > > > > > > > > > > > > >> >> > > > > > > > > > > > > > > > > >> >> > > > > > > > > > > > > > > > > > >> >> > > > > > It could be useful for UBLK to cover real storage hardware too: > > > > > > > > > > > > >> >> > > > > > > > > > > > > > > > > > >> >> > > > > > - for fast prototype or performance evaluation > > > > > > > > > > > > >> >> > > > > > > > > > > > > > > > > > >> >> > > > > > - some network storages are attached to host, such as iscsi and nvme-tcp, > > > > > > > > > > > > >> >> > > > > > the current UBLK interface doesn't support such devices, since it needs > > > > > > > > > > > > >> >> > > > > > all LUNs/Namespaces to share host resources(such as tag) > > > > > > > > > > > > >> >> > > > > > > > > > > > > > > > > >> >> > > > > Can you explain this in more detail? It seems like an iSCSI or > > > > > > > > > > > > >> >> > > > > NVMe-over-TCP initiator could be implemented as a ublk server today. > > > > > > > > > > > > >> >> > > > > What am I missing? > > > > > > > > > > > > >> >> > > > > > > > > > > > > > > > >> >> > > > The current ublk can't do that yet, because the interface doesn't > > > > > > > > > > > > >> >> > > > support multiple ublk disks sharing single host, which is exactly > > > > > > > > > > > > >> >> > > > the case of scsi and nvme. > > > > > > > > > > > > >> >> > > > > > > > > > > > > > > >> >> > > Can you give an example that shows exactly where a problem is hit? > > > > > > > > > > > > >> >> > > > > > > > > > > > > > > >> >> > > I took a quick look at the ublk source code and didn't spot a place > > > > > > > > > > > > >> >> > > where it prevents a single ublk server process from handling multiple > > > > > > > > > > > > >> >> > > devices. > > > > > > > > > > > > >> >> > > > > > > > > > > > > > > >> >> > > Regarding "host resources(such as tag)", can the ublk server deal with > > > > > > > > > > > > >> >> > > that in userspace? The Linux block layer doesn't have the concept of a > > > > > > > > > > > > >> >> > > "host", that would come in at the SCSI/NVMe level that's implemented in > > > > > > > > > > > > >> >> > > userspace. > > > > > > > > > > > > >> >> > > > > > > > > > > > > > > >> >> > > I don't understand yet... > > > > > > > > > > > > >> >> > > > > > > > > > > > > > >> >> > blk_mq_tag_set is embedded into driver host structure, and referred by queue > > > > > > > > > > > > >> >> > via q->tag_set, both scsi and nvme allocates tag in host/queue wide, > > > > > > > > > > > > >> >> > that said all LUNs/NSs share host/queue tags, current every ublk > > > > > > > > > > > > >> >> > device is independent, and can't shard tags. > > > > > > > > > > > > >> >> > > > > > > > > > > > > >> >> Does this actually prevent ublk servers with multiple ublk devices or is > > > > > > > > > > > > >> >> it just sub-optimal? > > > > > > > > > > > > >> > > > > > > > > > > > > > >> > It is former, ublk can't support multiple devices which share single host > > > > > > > > > > > > >> > because duplicated tag can be seen in host side, then io is failed. > > > > > > > > > > > > >> > > > > > > > > > > > > > >> > > > > > > > > > > > > >> I have trouble following this discussion. Why can we not handle multiple > > > > > > > > > > > > >> block devices in a single ublk user space process? > > > > > > > > > > > > >> > > > > > > > > > > > > >> From this conversation it seems that the limiting factor is allocation > > > > > > > > > > > > >> of the tag set of the virtual device in the kernel? But as far as I can > > > > > > > > > > > > >> tell, the tag sets are allocated per virtual block device in > > > > > > > > > > > > >> `ublk_ctrl_add_dev()`? > > > > > > > > > > > > >> > > > > > > > > > > > > >> It seems to me that a single ublk user space process shuld be able to > > > > > > > > > > > > >> connect to multiple storage devices (for instance nvme-of) and then > > > > > > > > > > > > >> create a ublk device for each namespace, all from a single ublk process. > > > > > > > > > > > > >> > > > > > > > > > > > > >> Could you elaborate on why this is not possible? > > > > > > > > > > > > > > > > > > > > > > > > > > If the multiple storages devices are independent, the current ublk can > > > > > > > > > > > > > handle them just fine. > > > > > > > > > > > > > > > > > > > > > > > > > > But if these storage devices(such as luns in iscsi, or NSs in nvme-tcp) > > > > > > > > > > > > > share single host, and use host-wide tagset, the current interface can't > > > > > > > > > > > > > work as expected, because tags is shared among all these devices. The > > > > > > > > > > > > > current ublk interface needs to be extended for covering this case. > > > > > > > > > > > > > > > > > > > > > > > > Thanks for clarifying, that is very helpful. > > > > > > > > > > > > > > > > > > > > > > > > Follow up question: What would the implications be if one tried to > > > > > > > > > > > > expose (through ublk) each nvme namespace of an nvme-of controller with > > > > > > > > > > > > an independent tag set? > > > > > > > > > > > > > > > > > > > > > > https://lore.kernel.org/linux-block/877cwhrgul.fsf@metaspace.dk/T/#m57158db9f0108e529d8d62d1d56652c52e9e3e67 > > > > > > > > > > > > > > > > > > > > > > > What are the benefits of sharing a tagset across > > > > > > > > > > > > all namespaces of a controller? > > > > > > > > > > > > > > > > > > > > > > The userspace implementation can be simplified a lot since generic > > > > > > > > > > > shared tag allocation isn't needed, meantime with good performance > > > > > > > > > > > (shared tags allocation in SMP is one hard problem) > > > > > > > > > > > > > > > > > > > > In NVMe, tags are per Submission Queue. AFAIK there's no such thing as > > > > > > > > > > shared tags across multiple SQs in NVMe. So userspace doesn't need an > > > > > > > > > > > > > > > > > > In reality the max supported nr_queues of nvme is often much less than > > > > > > > > > nr_cpu_ids, for example, lots of nvme-pci devices just support at most > > > > > > > > > 32 queues, I remembered that Azure nvme supports less(just 8 queues). > > > > > > > > > That is because queue isn't free in both software and hardware, which > > > > > > > > > implementation is often tradeoff between performance and cost. > > > > > > > > > > > > > > > > I didn't say that the ublk server should have nr_cpu_ids threads. I > > > > > > > > thought the idea was the ublk server creates as many threads as it needs > > > > > > > > (e.g. max 8 if the Azure NVMe device only has 8 queues). > > > > > > > > > > > > > > > > Do you expect ublk servers to have nr_cpu_ids threads in all/most cases? > > > > > > > > > > > > > > No. > > > > > > > > > > > > > > In ublksrv project, each pthread maps to one unique hardware queue, so total > > > > > > > number of pthread is equal to nr_hw_queues. > > > > > > > > > > > > Good, I think we agree on that part. > > > > > > > > > > > > Here is a summary of the ublk server model I've been describing: > > > > > > 1. Each pthread has a separate io_uring context. > > > > > > 2. Each pthread has its own hardware submission queue (NVMe SQ, SCSI > > > > > > command queue, etc). > > > > > > 3. Each pthread has a distinct subrange of the tag space if the tag > > > > > > space is shared across hardware submission queues. > > > > > > 4. Each pthread allocates tags from its subrange without coordinating > > > > > > with other threads. This is cheap and simple. > > > > > > > > > > That is also not doable. > > > > > > > > > > The tag space can be pretty small, such as, usb-storage queue depth > > > > > is just 1, and usb card reader can support multi lun too. > > > > > > > > If the tag space is very limited, just create one pthread. > > > > > > What I meant is that sub-range isn't doable. > > > > > > And pthread is aligned with queue, that is nothing to do with nr_tags. > > > > > > > > > > > > That is just one extreme example, but there can be more low queue depth > > > > > scsi devices(sata : 32, ...), typical nvme/pci queue depth is 1023, but > > > > > there could be some implementation with less. > > > > > > > > NVMe PCI has per-sq tags so subranges aren't needed. Each pthread has > > > > its own independent tag space. That means NVMe devices with low queue > > > > depths work fine in the model I described. > > > > > > NVMe PCI isn't special, and it is covered by current ublk abstract, so one way > > > or another, we should not support both sub-range or non-sub-range for > > > avoiding unnecessary complexity. > > > > > > "Each pthread has its own independent tag space" may mean two things > > > > > > 1) each LUN/NS is implemented in standalone process space: > > > - so every queue of each LUN has its own space, but all the queues with > > > same ID share the whole queue tag space > > > - that matches with current ublksrv > > > - also easier to implement > > > > > > 2) all LUNs/NSs are implemented in single process space > > > - so each pthread handles one queue for all NSs/LUNs > > > > > > Yeah, if you mean 2), the tag allocation is cheap, but the existed ublk > > > char device has to handle multiple LUNs/NSs(disks), which still need > > > (big) ublk interface change. Also this way can't scale for single queue > > > devices. > > > > The model I described is neither 1) or 2). It's similar to 2) but I'm > > not sure why you say the ublk interface needs to be changed. I'm afraid > > I haven't explained it well, sorry. I'll try to describe it again with > > an NVMe PCI adapter being handled by userspace. > > > > There is a single ublk server process with an NVMe PCI device opened > > using VFIO. > > > > There are N pthreads and each pthread has 1 io_uring context and 1 NVMe > > PCI SQ/CQ pair. The size of the SQ and CQ rings is QD. > > > > The NVMe PCI device has M Namespaces. The ublk server creates M > > ublk_devices. Each ublk_device has N ublk_queues with queue_depth QD. > > > > The Linux block layer sees M block devices with N nr_hw_queues and QD > > queue_depth. The actual NVMe PCI device resources are less than what the > > Linux block layer sees because the each SQ/CQ pair is used for M > > ublk_devices. In other words, Linux thinks there can be M * N * QD > > requests in flight but in reality the NVMe PCI adapter only supports N * > > QD requests. > > Yeah, but it is really bad. > > Now QD is the host hard queue depth, which can be very big, and could be > more than thousands. > > ublk driver doesn't understand this kind of sharing(tag, io command buffer, io > buffers), M * M * QD requests are submitted to ublk server, and CPUs/memory > are wasted a lot. > > Every device has to allocate command buffers for holding QD io commands, and > command buffer is supposed to be per-host, instead of per-disk. Same with io > buffer pre-allocation in userspace side. I agree with you in cases with lots of LUNs (large M), block layer and ublk driver per-request memory is allocated that cannot be used simultaneously. > Userspace has to re-tag the requests for avoiding duplicated tag, and > requests have to be throttled in ublk server side. If you implement tag allocation > in userspace side, it is still one typical shared data issue in SMP, M pthreads > contends on single tags from multiple CPUs. Here I still disagree. There is no SMP contention with NVMe because tags are per SQ. For SCSI the tag namespace is shared but each pthread can trivially work with a sub-range to avoid SMP contention. If the tag namespace is too small for sub-ranges, then there should be fewer pthreads. > > > > Now I'll describe how userspace can take care of the mismatch between > > the Linux block layer and the NVMe PCI device without doing much work: > > > > Each pthread sets up QD UBLK_IO_COMMIT_AND_FETCH_REQ io_uring_cmds for > > each of the M Namespaces. > > > > When userspace receives a request from ublk, it cannot simply copy the > > struct ublksrv_io_cmd->tag field into the NVMe SQE Command Identifier > > (CID) field. There would be collisions between the tags used across the > > M ublk_queues that the pthread services. > > > > Userspace selects a free tag (e.g. from a bitmap with QD elements) and > > uses that as the NVMe Command Identifier. This is trivial because each > > pthread has its own bitmap and NVMe Command Identifiers are per-SQ. > > I believe I have explained, in reality, NVME SQ/CQ pair can be less( > or much less) than nr_cpu_ids, so the per-queue-tags can be allocated & freed > among CPUs of (nr_cpu_ids / nr_hw_queues). > > Not mention userspace is capable of overriding the pthread cpu affinity, > so it isn't trivial & cheap, M pthreads could be run from > more than (nr_cpu_ids / nr_hw_queues) CPUs and contend on the single hw queue tags. I don't understand your nr_cpu_ids concerns. In the model I have described, the number of pthreads is min(nr_cpu_ids, max_sq_cq_pairs) and the SQ/CQ pairs are per pthread. There is no sharing of SQ/CQ pairs across pthreads. On a limited NVMe controller nr_cpu_ids=128 and max_sq_cq_pairs=8, so there are only 8 pthreads. Each pthread has its own io_uring context through which it handles M ublk_queues. Even if a pthread runs from more than 1 CPU, its SQ Command Identifiers (tags) are only used by that pthread and there is no SMP contention. Can you explain where you see SMP contention for NVMe SQ Command Identifiers? > > > > If there are no free tags then the request is placed in the pthread's > > per Namespace overflow list. Whenever an NVMe command completes, the > > overflow lists are scanned. One pending request is submitted to the NVMe > > PCI adapter in a round-robin fashion until the lists are empty or there > > are no more free tags. > > > > That's it. No ublk API changes are necessary. The userspace code is not > > slow or complex (just a bitmap and overflow list). > > Fine, but I am not sure we need to support such mess & pool implementation. > > > > > The approach also works for SCSI or devices that only support 1 request > > in flight at a time, with small tweaks. > > > > Going back to the beginning of the discussion: I think it's possible to > > write a ublk server that handles multiple LUNs/NS today. > > It is possible, but it is poor in both performance and resource > utilization, meantime with complicated ublk server implementation. Okay. I wanted to make sure I wasn't missing a reason why it's fundamentally impossible. Performance, resource utilization, or complexity is debatable and I think I understand your position. I think you're looking for a general solution that works well even with a high number of LUNs, where the model I proposed wastes resources. > > > > > > Another thing is that io command buffer has to be shared among all LUNs/ > > > NSs. So interface change has to cover shared io command buffer. > > > > I think the main advantage of extending the ublk API to share io command > > buffers between ublk_devices is to reduce userspace memory consumption? > > > > It eliminates the need to over-provision I/O buffers for write requests > > (or use the slower UBLK_IO_NEED_GET_DATA approach). > > Not only avoiding memory and cpu waste, but also simplifying ublk > server. > > > > > > With zero copy support, io buffer sharing needn't to be considered, that > > > can be a bit easier. > > > > > > In short, the sharing of (tag, io command buffer, io buffer) needs to be > > > considered for shared host ublk disks. > > > > > > Actually I prefer to 1), which matches with current design, and we can > > > just add host concept into ublk, and implementation could be easier. > > > > > > BTW, ublk has been applied to implement iscsi alternative disk[1] for Longhorn[2], > > > and the performance improvement is pretty nice, so I think it is one reasonable > > > requirement to support "shared host" ublk disks for covering multi-lun or multi-ns. > > > > > > [1] https://github.com/ming1/ubdsrv/issues/49 > > > [2] https://github.com/longhorn/longhorn > > > > Nice performance improvement! > > > > I agree with you that the ublk API should have a way to declare the > > resource contraints for multi-LUN/NS servers (i.e. share the tag_set). I > > guess the simplest way to do that is by passing a reference to an > > existing device to UBLK_CMD_ADD_DEV so it can share the tag_set? Nothing > > else about the ublk API needs to change, at least for tags. > > Basically (tags, io command buffer, io buffers) need to move into > host/hw_queue wide from disk wide, so not so simple, but won't > be too complicated. > > > > > Solving I/O buffer over-provisioning sounds similar to io_uring's > > provided buffer mechanism :). > > blk-mq has built-in host/hw_queue wide tag allocation, which can provide > unique tag for ublk server from ublk driver side, so everything can be > simplified a lot if we move (tag, io command buffer, io buffers) into > host/hw_queue wide by telling ublk_driver that we are > BLK_MQ_F_TAG_QUEUE_SHARED. > > Not sure if io_uring's provided buffer is good here, cause we need to > discard io buffers after queue become idle. But it won't be one big > deal if zero copy can be supported. If the per-request ublk resources are shared like tags as you described, then that's a nice solution that also solves I/O buffer over-provisioning. Stefan [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 488 bytes --] ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [LSF/MM/BPF BoF]: extend UBLK to cover real storage hardware 2023-03-20 12:34 ` Stefan Hajnoczi @ 2023-03-20 15:30 ` Ming Lei 2023-03-21 11:25 ` Stefan Hajnoczi 0 siblings, 1 reply; 34+ messages in thread From: Ming Lei @ 2023-03-20 15:30 UTC (permalink / raw) To: Stefan Hajnoczi Cc: Andreas Hindborg, linux-block, lsf-pc, Liu Xiaodong, Jim Harris, Hans Holmberg, Matias Bjørling, hch@lst.de, ZiyangZhang, ming.lei On Mon, Mar 20, 2023 at 08:34:17AM -0400, Stefan Hajnoczi wrote: > On Sat, Mar 18, 2023 at 08:30:29AM +0800, Ming Lei wrote: > > On Fri, Mar 17, 2023 at 10:41:28AM -0400, Stefan Hajnoczi wrote: > > > On Fri, Mar 17, 2023 at 11:10:20AM +0800, Ming Lei wrote: > > > > On Thu, Mar 02, 2023 at 10:09:25AM -0500, Stefan Hajnoczi wrote: > > > > > On Thu, Mar 02, 2023 at 11:22:55AM +0800, Ming Lei wrote: > > > > > > On Thu, Feb 23, 2023 at 03:18:19PM -0500, Stefan Hajnoczi wrote: > > > > > > > On Thu, Feb 23, 2023 at 07:17:33AM +0800, Ming Lei wrote: > > > > > > > > On Sat, Feb 18, 2023 at 01:38:08PM -0500, Stefan Hajnoczi wrote: > > > > > > > > > On Sat, Feb 18, 2023 at 07:22:49PM +0800, Ming Lei wrote: > > > > > > > > > > On Fri, Feb 17, 2023 at 11:39:58AM -0500, Stefan Hajnoczi wrote: > > > > > > > > > > > On Fri, Feb 17, 2023 at 10:20:45AM +0800, Ming Lei wrote: > > > > > > > > > > > > On Thu, Feb 16, 2023 at 12:21:32PM +0100, Andreas Hindborg wrote: > > > > > > > > > > > > > > > > > > > > > > > > > > Ming Lei <ming.lei@redhat.com> writes: > > > > > > > > > > > > > > > > > > > > > > > > > > > On Thu, Feb 16, 2023 at 10:44:02AM +0100, Andreas Hindborg wrote: > > > > > > > > > > > > > >> > > > > > > > > > > > > > >> Hi Ming, > > > > > > > > > > > > > >> > > > > > > > > > > > > > >> Ming Lei <ming.lei@redhat.com> writes: > > > > > > > > > > > > > >> > > > > > > > > > > > > > >> > On Mon, Feb 13, 2023 at 02:13:59PM -0500, Stefan Hajnoczi wrote: > > > > > > > > > > > > > >> >> On Mon, Feb 13, 2023 at 11:47:31AM +0800, Ming Lei wrote: > > > > > > > > > > > > > >> >> > On Wed, Feb 08, 2023 at 07:17:10AM -0500, Stefan Hajnoczi wrote: > > > > > > > > > > > > > >> >> > > On Wed, Feb 08, 2023 at 10:12:19AM +0800, Ming Lei wrote: > > > > > > > > > > > > > >> >> > > > On Mon, Feb 06, 2023 at 03:27:09PM -0500, Stefan Hajnoczi wrote: > > > > > > > > > > > > > >> >> > > > > On Mon, Feb 06, 2023 at 11:00:27PM +0800, Ming Lei wrote: > > > > > > > > > > > > > >> >> > > > > > Hello, > > > > > > > > > > > > > >> >> > > > > > > > > > > > > > > > > > > >> >> > > > > > So far UBLK is only used for implementing virtual block device from > > > > > > > > > > > > > >> >> > > > > > userspace, such as loop, nbd, qcow2, ...[1]. > > > > > > > > > > > > > >> >> > > > > > > > > > > > > > > > > > >> >> > > > > I won't be at LSF/MM so here are my thoughts: > > > > > > > > > > > > > >> >> > > > > > > > > > > > > > > > > >> >> > > > Thanks for the thoughts, :-) > > > > > > > > > > > > > >> >> > > > > > > > > > > > > > > > > >> >> > > > > > > > > > > > > > > > > > >> >> > > > > > > > > > > > > > > > > > > >> >> > > > > > It could be useful for UBLK to cover real storage hardware too: > > > > > > > > > > > > > >> >> > > > > > > > > > > > > > > > > > > >> >> > > > > > - for fast prototype or performance evaluation > > > > > > > > > > > > > >> >> > > > > > > > > > > > > > > > > > > >> >> > > > > > - some network storages are attached to host, such as iscsi and nvme-tcp, > > > > > > > > > > > > > >> >> > > > > > the current UBLK interface doesn't support such devices, since it needs > > > > > > > > > > > > > >> >> > > > > > all LUNs/Namespaces to share host resources(such as tag) > > > > > > > > > > > > > >> >> > > > > > > > > > > > > > > > > > >> >> > > > > Can you explain this in more detail? It seems like an iSCSI or > > > > > > > > > > > > > >> >> > > > > NVMe-over-TCP initiator could be implemented as a ublk server today. > > > > > > > > > > > > > >> >> > > > > What am I missing? > > > > > > > > > > > > > >> >> > > > > > > > > > > > > > > > > >> >> > > > The current ublk can't do that yet, because the interface doesn't > > > > > > > > > > > > > >> >> > > > support multiple ublk disks sharing single host, which is exactly > > > > > > > > > > > > > >> >> > > > the case of scsi and nvme. > > > > > > > > > > > > > >> >> > > > > > > > > > > > > > > > >> >> > > Can you give an example that shows exactly where a problem is hit? > > > > > > > > > > > > > >> >> > > > > > > > > > > > > > > > >> >> > > I took a quick look at the ublk source code and didn't spot a place > > > > > > > > > > > > > >> >> > > where it prevents a single ublk server process from handling multiple > > > > > > > > > > > > > >> >> > > devices. > > > > > > > > > > > > > >> >> > > > > > > > > > > > > > > > >> >> > > Regarding "host resources(such as tag)", can the ublk server deal with > > > > > > > > > > > > > >> >> > > that in userspace? The Linux block layer doesn't have the concept of a > > > > > > > > > > > > > >> >> > > "host", that would come in at the SCSI/NVMe level that's implemented in > > > > > > > > > > > > > >> >> > > userspace. > > > > > > > > > > > > > >> >> > > > > > > > > > > > > > > > >> >> > > I don't understand yet... > > > > > > > > > > > > > >> >> > > > > > > > > > > > > > > >> >> > blk_mq_tag_set is embedded into driver host structure, and referred by queue > > > > > > > > > > > > > >> >> > via q->tag_set, both scsi and nvme allocates tag in host/queue wide, > > > > > > > > > > > > > >> >> > that said all LUNs/NSs share host/queue tags, current every ublk > > > > > > > > > > > > > >> >> > device is independent, and can't shard tags. > > > > > > > > > > > > > >> >> > > > > > > > > > > > > > >> >> Does this actually prevent ublk servers with multiple ublk devices or is > > > > > > > > > > > > > >> >> it just sub-optimal? > > > > > > > > > > > > > >> > > > > > > > > > > > > > > >> > It is former, ublk can't support multiple devices which share single host > > > > > > > > > > > > > >> > because duplicated tag can be seen in host side, then io is failed. > > > > > > > > > > > > > >> > > > > > > > > > > > > > > >> > > > > > > > > > > > > > >> I have trouble following this discussion. Why can we not handle multiple > > > > > > > > > > > > > >> block devices in a single ublk user space process? > > > > > > > > > > > > > >> > > > > > > > > > > > > > >> From this conversation it seems that the limiting factor is allocation > > > > > > > > > > > > > >> of the tag set of the virtual device in the kernel? But as far as I can > > > > > > > > > > > > > >> tell, the tag sets are allocated per virtual block device in > > > > > > > > > > > > > >> `ublk_ctrl_add_dev()`? > > > > > > > > > > > > > >> > > > > > > > > > > > > > >> It seems to me that a single ublk user space process shuld be able to > > > > > > > > > > > > > >> connect to multiple storage devices (for instance nvme-of) and then > > > > > > > > > > > > > >> create a ublk device for each namespace, all from a single ublk process. > > > > > > > > > > > > > >> > > > > > > > > > > > > > >> Could you elaborate on why this is not possible? > > > > > > > > > > > > > > > > > > > > > > > > > > > > If the multiple storages devices are independent, the current ublk can > > > > > > > > > > > > > > handle them just fine. > > > > > > > > > > > > > > > > > > > > > > > > > > > > But if these storage devices(such as luns in iscsi, or NSs in nvme-tcp) > > > > > > > > > > > > > > share single host, and use host-wide tagset, the current interface can't > > > > > > > > > > > > > > work as expected, because tags is shared among all these devices. The > > > > > > > > > > > > > > current ublk interface needs to be extended for covering this case. > > > > > > > > > > > > > > > > > > > > > > > > > > Thanks for clarifying, that is very helpful. > > > > > > > > > > > > > > > > > > > > > > > > > > Follow up question: What would the implications be if one tried to > > > > > > > > > > > > > expose (through ublk) each nvme namespace of an nvme-of controller with > > > > > > > > > > > > > an independent tag set? > > > > > > > > > > > > > > > > > > > > > > > > https://lore.kernel.org/linux-block/877cwhrgul.fsf@metaspace.dk/T/#m57158db9f0108e529d8d62d1d56652c52e9e3e67 > > > > > > > > > > > > > > > > > > > > > > > > > What are the benefits of sharing a tagset across > > > > > > > > > > > > > all namespaces of a controller? > > > > > > > > > > > > > > > > > > > > > > > > The userspace implementation can be simplified a lot since generic > > > > > > > > > > > > shared tag allocation isn't needed, meantime with good performance > > > > > > > > > > > > (shared tags allocation in SMP is one hard problem) > > > > > > > > > > > > > > > > > > > > > > In NVMe, tags are per Submission Queue. AFAIK there's no such thing as > > > > > > > > > > > shared tags across multiple SQs in NVMe. So userspace doesn't need an > > > > > > > > > > > > > > > > > > > > In reality the max supported nr_queues of nvme is often much less than > > > > > > > > > > nr_cpu_ids, for example, lots of nvme-pci devices just support at most > > > > > > > > > > 32 queues, I remembered that Azure nvme supports less(just 8 queues). > > > > > > > > > > That is because queue isn't free in both software and hardware, which > > > > > > > > > > implementation is often tradeoff between performance and cost. > > > > > > > > > > > > > > > > > > I didn't say that the ublk server should have nr_cpu_ids threads. I > > > > > > > > > thought the idea was the ublk server creates as many threads as it needs > > > > > > > > > (e.g. max 8 if the Azure NVMe device only has 8 queues). > > > > > > > > > > > > > > > > > > Do you expect ublk servers to have nr_cpu_ids threads in all/most cases? > > > > > > > > > > > > > > > > No. > > > > > > > > > > > > > > > > In ublksrv project, each pthread maps to one unique hardware queue, so total > > > > > > > > number of pthread is equal to nr_hw_queues. > > > > > > > > > > > > > > Good, I think we agree on that part. > > > > > > > > > > > > > > Here is a summary of the ublk server model I've been describing: > > > > > > > 1. Each pthread has a separate io_uring context. > > > > > > > 2. Each pthread has its own hardware submission queue (NVMe SQ, SCSI > > > > > > > command queue, etc). > > > > > > > 3. Each pthread has a distinct subrange of the tag space if the tag > > > > > > > space is shared across hardware submission queues. > > > > > > > 4. Each pthread allocates tags from its subrange without coordinating > > > > > > > with other threads. This is cheap and simple. > > > > > > > > > > > > That is also not doable. > > > > > > > > > > > > The tag space can be pretty small, such as, usb-storage queue depth > > > > > > is just 1, and usb card reader can support multi lun too. > > > > > > > > > > If the tag space is very limited, just create one pthread. > > > > > > > > What I meant is that sub-range isn't doable. > > > > > > > > And pthread is aligned with queue, that is nothing to do with nr_tags. > > > > > > > > > > > > > > > That is just one extreme example, but there can be more low queue depth > > > > > > scsi devices(sata : 32, ...), typical nvme/pci queue depth is 1023, but > > > > > > there could be some implementation with less. > > > > > > > > > > NVMe PCI has per-sq tags so subranges aren't needed. Each pthread has > > > > > its own independent tag space. That means NVMe devices with low queue > > > > > depths work fine in the model I described. > > > > > > > > NVMe PCI isn't special, and it is covered by current ublk abstract, so one way > > > > or another, we should not support both sub-range or non-sub-range for > > > > avoiding unnecessary complexity. > > > > > > > > "Each pthread has its own independent tag space" may mean two things > > > > > > > > 1) each LUN/NS is implemented in standalone process space: > > > > - so every queue of each LUN has its own space, but all the queues with > > > > same ID share the whole queue tag space > > > > - that matches with current ublksrv > > > > - also easier to implement > > > > > > > > 2) all LUNs/NSs are implemented in single process space > > > > - so each pthread handles one queue for all NSs/LUNs > > > > > > > > Yeah, if you mean 2), the tag allocation is cheap, but the existed ublk > > > > char device has to handle multiple LUNs/NSs(disks), which still need > > > > (big) ublk interface change. Also this way can't scale for single queue > > > > devices. > > > > > > The model I described is neither 1) or 2). It's similar to 2) but I'm > > > not sure why you say the ublk interface needs to be changed. I'm afraid > > > I haven't explained it well, sorry. I'll try to describe it again with > > > an NVMe PCI adapter being handled by userspace. > > > > > > There is a single ublk server process with an NVMe PCI device opened > > > using VFIO. > > > > > > There are N pthreads and each pthread has 1 io_uring context and 1 NVMe > > > PCI SQ/CQ pair. The size of the SQ and CQ rings is QD. > > > > > > The NVMe PCI device has M Namespaces. The ublk server creates M > > > ublk_devices. Each ublk_device has N ublk_queues with queue_depth QD. > > > > > > The Linux block layer sees M block devices with N nr_hw_queues and QD > > > queue_depth. The actual NVMe PCI device resources are less than what the > > > Linux block layer sees because the each SQ/CQ pair is used for M > > > ublk_devices. In other words, Linux thinks there can be M * N * QD > > > requests in flight but in reality the NVMe PCI adapter only supports N * > > > QD requests. > > > > Yeah, but it is really bad. > > > > Now QD is the host hard queue depth, which can be very big, and could be > > more than thousands. > > > > ublk driver doesn't understand this kind of sharing(tag, io command buffer, io > > buffers), M * M * QD requests are submitted to ublk server, and CPUs/memory > > are wasted a lot. > > > > Every device has to allocate command buffers for holding QD io commands, and > > command buffer is supposed to be per-host, instead of per-disk. Same with io > > buffer pre-allocation in userspace side. > > I agree with you in cases with lots of LUNs (large M), block layer and > ublk driver per-request memory is allocated that cannot be used > simultaneously. > > > Userspace has to re-tag the requests for avoiding duplicated tag, and > > requests have to be throttled in ublk server side. If you implement tag allocation > > in userspace side, it is still one typical shared data issue in SMP, M pthreads > > contends on single tags from multiple CPUs. > > Here I still disagree. There is no SMP contention with NVMe because tags > are per SQ. For SCSI the tag namespace is shared but each pthread can > trivially work with a sub-range to avoid SMP contention. If the tag > namespace is too small for sub-ranges, then there should be fewer > pthreads. > > > > > > > Now I'll describe how userspace can take care of the mismatch between > > > the Linux block layer and the NVMe PCI device without doing much work: > > > > > > Each pthread sets up QD UBLK_IO_COMMIT_AND_FETCH_REQ io_uring_cmds for > > > each of the M Namespaces. > > > > > > When userspace receives a request from ublk, it cannot simply copy the > > > struct ublksrv_io_cmd->tag field into the NVMe SQE Command Identifier > > > (CID) field. There would be collisions between the tags used across the > > > M ublk_queues that the pthread services. > > > > > > Userspace selects a free tag (e.g. from a bitmap with QD elements) and > > > uses that as the NVMe Command Identifier. This is trivial because each > > > pthread has its own bitmap and NVMe Command Identifiers are per-SQ. > > > > I believe I have explained, in reality, NVME SQ/CQ pair can be less( > > or much less) than nr_cpu_ids, so the per-queue-tags can be allocated & freed > > among CPUs of (nr_cpu_ids / nr_hw_queues). > > > > Not mention userspace is capable of overriding the pthread cpu affinity, > > so it isn't trivial & cheap, M pthreads could be run from > > more than (nr_cpu_ids / nr_hw_queues) CPUs and contend on the single hw queue tags. > > I don't understand your nr_cpu_ids concerns. In the model I have > described, the number of pthreads is min(nr_cpu_ids, max_sq_cq_pairs) > and the SQ/CQ pairs are per pthread. There is no sharing of SQ/CQ pairs > across pthreads. > > On a limited NVMe controller nr_cpu_ids=128 and max_sq_cq_pairs=8, so > there are only 8 pthreads. Each pthread has its own io_uring context > through which it handles M ublk_queues. Even if a pthread runs from more > than 1 CPU, its SQ Command Identifiers (tags) are only used by that > pthread and there is no SMP contention. > > Can you explain where you see SMP contention for NVMe SQ Command > Identifiers? ublk server queue pthread is aligned with hw queue in ublk driver, and its affinity is retrieved from ublk blk-mq's hw queue's affinity. So if nr_hw_queues is 8, nr_cpu_ids is 128, there will be 16 cpus mapped to each hw queue. For example, hw queue 0's cpu affinity is cpu 0 ~ 15, and affinity of pthread for handling hw queue 0 is cpu 0 ~ 15 too. Now if we have M ublk devices, pthead 0(hw queue 0) of these M devices share same hw queue tags. M pthreads could be scheduled among cpu0~15, and tag is allocated from M pthreads among cpu0~15, contention? That is why I mentioned, if all devices are implemented in same process, and each pthread is handling host hardware queue for all M devices, the contention can be avoided. However, ublk server still needs lots of change. More importantly, it is one generic design, we need to cover both SQ and MQ. > > > > > > > If there are no free tags then the request is placed in the pthread's > > > per Namespace overflow list. Whenever an NVMe command completes, the > > > overflow lists are scanned. One pending request is submitted to the NVMe > > > PCI adapter in a round-robin fashion until the lists are empty or there > > > are no more free tags. > > > > > > That's it. No ublk API changes are necessary. The userspace code is not > > > slow or complex (just a bitmap and overflow list). > > > > Fine, but I am not sure we need to support such mess & pool implementation. > > > > > > > > The approach also works for SCSI or devices that only support 1 request > > > in flight at a time, with small tweaks. > > > > > > Going back to the beginning of the discussion: I think it's possible to > > > write a ublk server that handles multiple LUNs/NS today. > > > > It is possible, but it is poor in both performance and resource > > utilization, meantime with complicated ublk server implementation. > > Okay. I wanted to make sure I wasn't missing a reason why it's > fundamentally impossible. Performance, resource utilization, or > complexity is debatable and I think I understand your position. I think > you're looking for a general solution that works well even with a high > number of LUNs, where the model I proposed wastes resources. As I mentioned, it is one generic design for handling both SQ and MQ, and we won't take some hybrid approach of sub-range and mq. > > > > > > > > > > Another thing is that io command buffer has to be shared among all LUNs/ > > > > NSs. So interface change has to cover shared io command buffer. > > > > > > I think the main advantage of extending the ublk API to share io command > > > buffers between ublk_devices is to reduce userspace memory consumption? > > > > > > It eliminates the need to over-provision I/O buffers for write requests > > > (or use the slower UBLK_IO_NEED_GET_DATA approach). > > > > Not only avoiding memory and cpu waste, but also simplifying ublk > > server. > > > > > > > > > With zero copy support, io buffer sharing needn't to be considered, that > > > > can be a bit easier. > > > > > > > > In short, the sharing of (tag, io command buffer, io buffer) needs to be > > > > considered for shared host ublk disks. > > > > > > > > Actually I prefer to 1), which matches with current design, and we can > > > > just add host concept into ublk, and implementation could be easier. > > > > > > > > BTW, ublk has been applied to implement iscsi alternative disk[1] for Longhorn[2], > > > > and the performance improvement is pretty nice, so I think it is one reasonable > > > > requirement to support "shared host" ublk disks for covering multi-lun or multi-ns. > > > > > > > > [1] https://github.com/ming1/ubdsrv/issues/49 > > > > [2] https://github.com/longhorn/longhorn > > > > > > Nice performance improvement! > > > > > > I agree with you that the ublk API should have a way to declare the > > > resource contraints for multi-LUN/NS servers (i.e. share the tag_set). I > > > guess the simplest way to do that is by passing a reference to an > > > existing device to UBLK_CMD_ADD_DEV so it can share the tag_set? Nothing > > > else about the ublk API needs to change, at least for tags. > > > > Basically (tags, io command buffer, io buffers) need to move into > > host/hw_queue wide from disk wide, so not so simple, but won't > > be too complicated. > > > > > > > > Solving I/O buffer over-provisioning sounds similar to io_uring's > > > provided buffer mechanism :). > > > > blk-mq has built-in host/hw_queue wide tag allocation, which can provide > > unique tag for ublk server from ublk driver side, so everything can be > > simplified a lot if we move (tag, io command buffer, io buffers) into > > host/hw_queue wide by telling ublk_driver that we are > > BLK_MQ_F_TAG_QUEUE_SHARED. > > > > Not sure if io_uring's provided buffer is good here, cause we need to > > discard io buffers after queue become idle. But it won't be one big > > deal if zero copy can be supported. > > If the per-request ublk resources are shared like tags as you described, > then that's a nice solution that also solves I/O buffer > over-provisioning. BTW, io_uring provided buffer can't work here, since we use per-queue/pthead io_uring in device level, but buffer actually belong to hardware queue of host. Thanks, Ming ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [LSF/MM/BPF BoF]: extend UBLK to cover real storage hardware 2023-03-20 15:30 ` Ming Lei @ 2023-03-21 11:25 ` Stefan Hajnoczi 0 siblings, 0 replies; 34+ messages in thread From: Stefan Hajnoczi @ 2023-03-21 11:25 UTC (permalink / raw) To: Ming Lei Cc: Andreas Hindborg, linux-block, lsf-pc, Liu Xiaodong, Jim Harris, Hans Holmberg, Matias Bjørling, hch@lst.de, ZiyangZhang [-- Attachment #1: Type: text/plain, Size: 18139 bytes --] On Mon, Mar 20, 2023 at 11:30:00PM +0800, Ming Lei wrote: > On Mon, Mar 20, 2023 at 08:34:17AM -0400, Stefan Hajnoczi wrote: > > On Sat, Mar 18, 2023 at 08:30:29AM +0800, Ming Lei wrote: > > > On Fri, Mar 17, 2023 at 10:41:28AM -0400, Stefan Hajnoczi wrote: > > > > On Fri, Mar 17, 2023 at 11:10:20AM +0800, Ming Lei wrote: > > > > > On Thu, Mar 02, 2023 at 10:09:25AM -0500, Stefan Hajnoczi wrote: > > > > > > On Thu, Mar 02, 2023 at 11:22:55AM +0800, Ming Lei wrote: > > > > > > > On Thu, Feb 23, 2023 at 03:18:19PM -0500, Stefan Hajnoczi wrote: > > > > > > > > On Thu, Feb 23, 2023 at 07:17:33AM +0800, Ming Lei wrote: > > > > > > > > > On Sat, Feb 18, 2023 at 01:38:08PM -0500, Stefan Hajnoczi wrote: > > > > > > > > > > On Sat, Feb 18, 2023 at 07:22:49PM +0800, Ming Lei wrote: > > > > > > > > > > > On Fri, Feb 17, 2023 at 11:39:58AM -0500, Stefan Hajnoczi wrote: > > > > > > > > > > > > On Fri, Feb 17, 2023 at 10:20:45AM +0800, Ming Lei wrote: > > > > > > > > > > > > > On Thu, Feb 16, 2023 at 12:21:32PM +0100, Andreas Hindborg wrote: > > > > > > > > > > > > > > > > > > > > > > > > > > > > Ming Lei <ming.lei@redhat.com> writes: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Thu, Feb 16, 2023 at 10:44:02AM +0100, Andreas Hindborg wrote: > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > >> Hi Ming, > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > >> Ming Lei <ming.lei@redhat.com> writes: > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > >> > On Mon, Feb 13, 2023 at 02:13:59PM -0500, Stefan Hajnoczi wrote: > > > > > > > > > > > > > > >> >> On Mon, Feb 13, 2023 at 11:47:31AM +0800, Ming Lei wrote: > > > > > > > > > > > > > > >> >> > On Wed, Feb 08, 2023 at 07:17:10AM -0500, Stefan Hajnoczi wrote: > > > > > > > > > > > > > > >> >> > > On Wed, Feb 08, 2023 at 10:12:19AM +0800, Ming Lei wrote: > > > > > > > > > > > > > > >> >> > > > On Mon, Feb 06, 2023 at 03:27:09PM -0500, Stefan Hajnoczi wrote: > > > > > > > > > > > > > > >> >> > > > > On Mon, Feb 06, 2023 at 11:00:27PM +0800, Ming Lei wrote: > > > > > > > > > > > > > > >> >> > > > > > Hello, > > > > > > > > > > > > > > >> >> > > > > > > > > > > > > > > > > > > > >> >> > > > > > So far UBLK is only used for implementing virtual block device from > > > > > > > > > > > > > > >> >> > > > > > userspace, such as loop, nbd, qcow2, ...[1]. > > > > > > > > > > > > > > >> >> > > > > > > > > > > > > > > > > > > >> >> > > > > I won't be at LSF/MM so here are my thoughts: > > > > > > > > > > > > > > >> >> > > > > > > > > > > > > > > > > > >> >> > > > Thanks for the thoughts, :-) > > > > > > > > > > > > > > >> >> > > > > > > > > > > > > > > > > > >> >> > > > > > > > > > > > > > > > > > > >> >> > > > > > > > > > > > > > > > > > > > >> >> > > > > > It could be useful for UBLK to cover real storage hardware too: > > > > > > > > > > > > > > >> >> > > > > > > > > > > > > > > > > > > > >> >> > > > > > - for fast prototype or performance evaluation > > > > > > > > > > > > > > >> >> > > > > > > > > > > > > > > > > > > > >> >> > > > > > - some network storages are attached to host, such as iscsi and nvme-tcp, > > > > > > > > > > > > > > >> >> > > > > > the current UBLK interface doesn't support such devices, since it needs > > > > > > > > > > > > > > >> >> > > > > > all LUNs/Namespaces to share host resources(such as tag) > > > > > > > > > > > > > > >> >> > > > > > > > > > > > > > > > > > > >> >> > > > > Can you explain this in more detail? It seems like an iSCSI or > > > > > > > > > > > > > > >> >> > > > > NVMe-over-TCP initiator could be implemented as a ublk server today. > > > > > > > > > > > > > > >> >> > > > > What am I missing? > > > > > > > > > > > > > > >> >> > > > > > > > > > > > > > > > > > >> >> > > > The current ublk can't do that yet, because the interface doesn't > > > > > > > > > > > > > > >> >> > > > support multiple ublk disks sharing single host, which is exactly > > > > > > > > > > > > > > >> >> > > > the case of scsi and nvme. > > > > > > > > > > > > > > >> >> > > > > > > > > > > > > > > > > >> >> > > Can you give an example that shows exactly where a problem is hit? > > > > > > > > > > > > > > >> >> > > > > > > > > > > > > > > > > >> >> > > I took a quick look at the ublk source code and didn't spot a place > > > > > > > > > > > > > > >> >> > > where it prevents a single ublk server process from handling multiple > > > > > > > > > > > > > > >> >> > > devices. > > > > > > > > > > > > > > >> >> > > > > > > > > > > > > > > > > >> >> > > Regarding "host resources(such as tag)", can the ublk server deal with > > > > > > > > > > > > > > >> >> > > that in userspace? The Linux block layer doesn't have the concept of a > > > > > > > > > > > > > > >> >> > > "host", that would come in at the SCSI/NVMe level that's implemented in > > > > > > > > > > > > > > >> >> > > userspace. > > > > > > > > > > > > > > >> >> > > > > > > > > > > > > > > > > >> >> > > I don't understand yet... > > > > > > > > > > > > > > >> >> > > > > > > > > > > > > > > > >> >> > blk_mq_tag_set is embedded into driver host structure, and referred by queue > > > > > > > > > > > > > > >> >> > via q->tag_set, both scsi and nvme allocates tag in host/queue wide, > > > > > > > > > > > > > > >> >> > that said all LUNs/NSs share host/queue tags, current every ublk > > > > > > > > > > > > > > >> >> > device is independent, and can't shard tags. > > > > > > > > > > > > > > >> >> > > > > > > > > > > > > > > >> >> Does this actually prevent ublk servers with multiple ublk devices or is > > > > > > > > > > > > > > >> >> it just sub-optimal? > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > >> > It is former, ublk can't support multiple devices which share single host > > > > > > > > > > > > > > >> > because duplicated tag can be seen in host side, then io is failed. > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > >> I have trouble following this discussion. Why can we not handle multiple > > > > > > > > > > > > > > >> block devices in a single ublk user space process? > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > >> From this conversation it seems that the limiting factor is allocation > > > > > > > > > > > > > > >> of the tag set of the virtual device in the kernel? But as far as I can > > > > > > > > > > > > > > >> tell, the tag sets are allocated per virtual block device in > > > > > > > > > > > > > > >> `ublk_ctrl_add_dev()`? > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > >> It seems to me that a single ublk user space process shuld be able to > > > > > > > > > > > > > > >> connect to multiple storage devices (for instance nvme-of) and then > > > > > > > > > > > > > > >> create a ublk device for each namespace, all from a single ublk process. > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > >> Could you elaborate on why this is not possible? > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > If the multiple storages devices are independent, the current ublk can > > > > > > > > > > > > > > > handle them just fine. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > But if these storage devices(such as luns in iscsi, or NSs in nvme-tcp) > > > > > > > > > > > > > > > share single host, and use host-wide tagset, the current interface can't > > > > > > > > > > > > > > > work as expected, because tags is shared among all these devices. The > > > > > > > > > > > > > > > current ublk interface needs to be extended for covering this case. > > > > > > > > > > > > > > > > > > > > > > > > > > > > Thanks for clarifying, that is very helpful. > > > > > > > > > > > > > > > > > > > > > > > > > > > > Follow up question: What would the implications be if one tried to > > > > > > > > > > > > > > expose (through ublk) each nvme namespace of an nvme-of controller with > > > > > > > > > > > > > > an independent tag set? > > > > > > > > > > > > > > > > > > > > > > > > > > https://lore.kernel.org/linux-block/877cwhrgul.fsf@metaspace.dk/T/#m57158db9f0108e529d8d62d1d56652c52e9e3e67 > > > > > > > > > > > > > > > > > > > > > > > > > > > What are the benefits of sharing a tagset across > > > > > > > > > > > > > > all namespaces of a controller? > > > > > > > > > > > > > > > > > > > > > > > > > > The userspace implementation can be simplified a lot since generic > > > > > > > > > > > > > shared tag allocation isn't needed, meantime with good performance > > > > > > > > > > > > > (shared tags allocation in SMP is one hard problem) > > > > > > > > > > > > > > > > > > > > > > > > In NVMe, tags are per Submission Queue. AFAIK there's no such thing as > > > > > > > > > > > > shared tags across multiple SQs in NVMe. So userspace doesn't need an > > > > > > > > > > > > > > > > > > > > > > In reality the max supported nr_queues of nvme is often much less than > > > > > > > > > > > nr_cpu_ids, for example, lots of nvme-pci devices just support at most > > > > > > > > > > > 32 queues, I remembered that Azure nvme supports less(just 8 queues). > > > > > > > > > > > That is because queue isn't free in both software and hardware, which > > > > > > > > > > > implementation is often tradeoff between performance and cost. > > > > > > > > > > > > > > > > > > > > I didn't say that the ublk server should have nr_cpu_ids threads. I > > > > > > > > > > thought the idea was the ublk server creates as many threads as it needs > > > > > > > > > > (e.g. max 8 if the Azure NVMe device only has 8 queues). > > > > > > > > > > > > > > > > > > > > Do you expect ublk servers to have nr_cpu_ids threads in all/most cases? > > > > > > > > > > > > > > > > > > No. > > > > > > > > > > > > > > > > > > In ublksrv project, each pthread maps to one unique hardware queue, so total > > > > > > > > > number of pthread is equal to nr_hw_queues. > > > > > > > > > > > > > > > > Good, I think we agree on that part. > > > > > > > > > > > > > > > > Here is a summary of the ublk server model I've been describing: > > > > > > > > 1. Each pthread has a separate io_uring context. > > > > > > > > 2. Each pthread has its own hardware submission queue (NVMe SQ, SCSI > > > > > > > > command queue, etc). > > > > > > > > 3. Each pthread has a distinct subrange of the tag space if the tag > > > > > > > > space is shared across hardware submission queues. > > > > > > > > 4. Each pthread allocates tags from its subrange without coordinating > > > > > > > > with other threads. This is cheap and simple. > > > > > > > > > > > > > > That is also not doable. > > > > > > > > > > > > > > The tag space can be pretty small, such as, usb-storage queue depth > > > > > > > is just 1, and usb card reader can support multi lun too. > > > > > > > > > > > > If the tag space is very limited, just create one pthread. > > > > > > > > > > What I meant is that sub-range isn't doable. > > > > > > > > > > And pthread is aligned with queue, that is nothing to do with nr_tags. > > > > > > > > > > > > > > > > > > That is just one extreme example, but there can be more low queue depth > > > > > > > scsi devices(sata : 32, ...), typical nvme/pci queue depth is 1023, but > > > > > > > there could be some implementation with less. > > > > > > > > > > > > NVMe PCI has per-sq tags so subranges aren't needed. Each pthread has > > > > > > its own independent tag space. That means NVMe devices with low queue > > > > > > depths work fine in the model I described. > > > > > > > > > > NVMe PCI isn't special, and it is covered by current ublk abstract, so one way > > > > > or another, we should not support both sub-range or non-sub-range for > > > > > avoiding unnecessary complexity. > > > > > > > > > > "Each pthread has its own independent tag space" may mean two things > > > > > > > > > > 1) each LUN/NS is implemented in standalone process space: > > > > > - so every queue of each LUN has its own space, but all the queues with > > > > > same ID share the whole queue tag space > > > > > - that matches with current ublksrv > > > > > - also easier to implement > > > > > > > > > > 2) all LUNs/NSs are implemented in single process space > > > > > - so each pthread handles one queue for all NSs/LUNs > > > > > > > > > > Yeah, if you mean 2), the tag allocation is cheap, but the existed ublk > > > > > char device has to handle multiple LUNs/NSs(disks), which still need > > > > > (big) ublk interface change. Also this way can't scale for single queue > > > > > devices. > > > > > > > > The model I described is neither 1) or 2). It's similar to 2) but I'm > > > > not sure why you say the ublk interface needs to be changed. I'm afraid > > > > I haven't explained it well, sorry. I'll try to describe it again with > > > > an NVMe PCI adapter being handled by userspace. > > > > > > > > There is a single ublk server process with an NVMe PCI device opened > > > > using VFIO. > > > > > > > > There are N pthreads and each pthread has 1 io_uring context and 1 NVMe > > > > PCI SQ/CQ pair. The size of the SQ and CQ rings is QD. > > > > > > > > The NVMe PCI device has M Namespaces. The ublk server creates M > > > > ublk_devices. Each ublk_device has N ublk_queues with queue_depth QD. > > > > > > > > The Linux block layer sees M block devices with N nr_hw_queues and QD > > > > queue_depth. The actual NVMe PCI device resources are less than what the > > > > Linux block layer sees because the each SQ/CQ pair is used for M > > > > ublk_devices. In other words, Linux thinks there can be M * N * QD > > > > requests in flight but in reality the NVMe PCI adapter only supports N * > > > > QD requests. > > > > > > Yeah, but it is really bad. > > > > > > Now QD is the host hard queue depth, which can be very big, and could be > > > more than thousands. > > > > > > ublk driver doesn't understand this kind of sharing(tag, io command buffer, io > > > buffers), M * M * QD requests are submitted to ublk server, and CPUs/memory > > > are wasted a lot. > > > > > > Every device has to allocate command buffers for holding QD io commands, and > > > command buffer is supposed to be per-host, instead of per-disk. Same with io > > > buffer pre-allocation in userspace side. > > > > I agree with you in cases with lots of LUNs (large M), block layer and > > ublk driver per-request memory is allocated that cannot be used > > simultaneously. > > > > > Userspace has to re-tag the requests for avoiding duplicated tag, and > > > requests have to be throttled in ublk server side. If you implement tag allocation > > > in userspace side, it is still one typical shared data issue in SMP, M pthreads > > > contends on single tags from multiple CPUs. > > > > Here I still disagree. There is no SMP contention with NVMe because tags > > are per SQ. For SCSI the tag namespace is shared but each pthread can > > trivially work with a sub-range to avoid SMP contention. If the tag > > namespace is too small for sub-ranges, then there should be fewer > > pthreads. > > > > > > > > > > Now I'll describe how userspace can take care of the mismatch between > > > > the Linux block layer and the NVMe PCI device without doing much work: > > > > > > > > Each pthread sets up QD UBLK_IO_COMMIT_AND_FETCH_REQ io_uring_cmds for > > > > each of the M Namespaces. > > > > > > > > When userspace receives a request from ublk, it cannot simply copy the > > > > struct ublksrv_io_cmd->tag field into the NVMe SQE Command Identifier > > > > (CID) field. There would be collisions between the tags used across the > > > > M ublk_queues that the pthread services. > > > > > > > > Userspace selects a free tag (e.g. from a bitmap with QD elements) and > > > > uses that as the NVMe Command Identifier. This is trivial because each > > > > pthread has its own bitmap and NVMe Command Identifiers are per-SQ. > > > > > > I believe I have explained, in reality, NVME SQ/CQ pair can be less( > > > or much less) than nr_cpu_ids, so the per-queue-tags can be allocated & freed > > > among CPUs of (nr_cpu_ids / nr_hw_queues). > > > > > > Not mention userspace is capable of overriding the pthread cpu affinity, > > > so it isn't trivial & cheap, M pthreads could be run from > > > more than (nr_cpu_ids / nr_hw_queues) CPUs and contend on the single hw queue tags. > > > > I don't understand your nr_cpu_ids concerns. In the model I have > > described, the number of pthreads is min(nr_cpu_ids, max_sq_cq_pairs) > > and the SQ/CQ pairs are per pthread. There is no sharing of SQ/CQ pairs > > across pthreads. > > > > On a limited NVMe controller nr_cpu_ids=128 and max_sq_cq_pairs=8, so > > there are only 8 pthreads. Each pthread has its own io_uring context > > through which it handles M ublk_queues. Even if a pthread runs from more > > than 1 CPU, its SQ Command Identifiers (tags) are only used by that > > pthread and there is no SMP contention. > > > > Can you explain where you see SMP contention for NVMe SQ Command > > Identifiers? > > ublk server queue pthread is aligned with hw queue in ublk driver, and its affinity is > retrieved from ublk blk-mq's hw queue's affinity. > > So if nr_hw_queues is 8, nr_cpu_ids is 128, there will be 16 cpus mapped > to each hw queue. For example, hw queue 0's cpu affinity is cpu 0 ~ 15, > and affinity of pthread for handling hw queue 0 is cpu 0 ~ 15 too. > > Now if we have M ublk devices, pthead 0(hw queue 0) of these M devices > share same hw queue tags. M pthreads could be scheduled among cpu0~15, > and tag is allocated from M pthreads among cpu0~15, contention? > > That is why I mentioned, if all devices are implemented in same process, and > each pthread is handling host hardware queue for all M devices, the contention > can be avoided. However, ublk server still needs lots of change. I see. In the model I described each pthread services all M devices so the contention is avoided. Stefan [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 488 bytes --] ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [LSF/MM/BPF BoF]: extend UBLK to cover real storage hardware 2023-03-02 3:22 ` Ming Lei 2023-03-02 15:09 ` Stefan Hajnoczi @ 2023-03-16 14:24 ` Stefan Hajnoczi 1 sibling, 0 replies; 34+ messages in thread From: Stefan Hajnoczi @ 2023-03-16 14:24 UTC (permalink / raw) To: Ming Lei Cc: Andreas Hindborg, linux-block, lsf-pc, Liu Xiaodong, Jim Harris, Hans Holmberg, Matias Bjørling, hch@lst.de, ZiyangZhang [-- Attachment #1: Type: text/plain, Size: 9387 bytes --] On Thu, Mar 02, 2023 at 11:22:55AM +0800, Ming Lei wrote: > On Thu, Feb 23, 2023 at 03:18:19PM -0500, Stefan Hajnoczi wrote: > > On Thu, Feb 23, 2023 at 07:17:33AM +0800, Ming Lei wrote: > > > On Sat, Feb 18, 2023 at 01:38:08PM -0500, Stefan Hajnoczi wrote: > > > > On Sat, Feb 18, 2023 at 07:22:49PM +0800, Ming Lei wrote: > > > > > On Fri, Feb 17, 2023 at 11:39:58AM -0500, Stefan Hajnoczi wrote: > > > > > > On Fri, Feb 17, 2023 at 10:20:45AM +0800, Ming Lei wrote: > > > > > > > On Thu, Feb 16, 2023 at 12:21:32PM +0100, Andreas Hindborg wrote: > > > > > > > > > > > > > > > > Ming Lei <ming.lei@redhat.com> writes: > > > > > > > > > > > > > > > > > On Thu, Feb 16, 2023 at 10:44:02AM +0100, Andreas Hindborg wrote: > > > > > > > > >> > > > > > > > > >> Hi Ming, > > > > > > > > >> > > > > > > > > >> Ming Lei <ming.lei@redhat.com> writes: > > > > > > > > >> > > > > > > > > >> > On Mon, Feb 13, 2023 at 02:13:59PM -0500, Stefan Hajnoczi wrote: > > > > > > > > >> >> On Mon, Feb 13, 2023 at 11:47:31AM +0800, Ming Lei wrote: > > > > > > > > >> >> > On Wed, Feb 08, 2023 at 07:17:10AM -0500, Stefan Hajnoczi wrote: > > > > > > > > >> >> > > On Wed, Feb 08, 2023 at 10:12:19AM +0800, Ming Lei wrote: > > > > > > > > >> >> > > > On Mon, Feb 06, 2023 at 03:27:09PM -0500, Stefan Hajnoczi wrote: > > > > > > > > >> >> > > > > On Mon, Feb 06, 2023 at 11:00:27PM +0800, Ming Lei wrote: > > > > > > > > >> >> > > > > > Hello, > > > > > > > > >> >> > > > > > > > > > > > > > >> >> > > > > > So far UBLK is only used for implementing virtual block device from > > > > > > > > >> >> > > > > > userspace, such as loop, nbd, qcow2, ...[1]. > > > > > > > > >> >> > > > > > > > > > > > > >> >> > > > > I won't be at LSF/MM so here are my thoughts: > > > > > > > > >> >> > > > > > > > > > > > >> >> > > > Thanks for the thoughts, :-) > > > > > > > > >> >> > > > > > > > > > > > >> >> > > > > > > > > > > > > >> >> > > > > > > > > > > > > > >> >> > > > > > It could be useful for UBLK to cover real storage hardware too: > > > > > > > > >> >> > > > > > > > > > > > > > >> >> > > > > > - for fast prototype or performance evaluation > > > > > > > > >> >> > > > > > > > > > > > > > >> >> > > > > > - some network storages are attached to host, such as iscsi and nvme-tcp, > > > > > > > > >> >> > > > > > the current UBLK interface doesn't support such devices, since it needs > > > > > > > > >> >> > > > > > all LUNs/Namespaces to share host resources(such as tag) > > > > > > > > >> >> > > > > > > > > > > > > >> >> > > > > Can you explain this in more detail? It seems like an iSCSI or > > > > > > > > >> >> > > > > NVMe-over-TCP initiator could be implemented as a ublk server today. > > > > > > > > >> >> > > > > What am I missing? > > > > > > > > >> >> > > > > > > > > > > > >> >> > > > The current ublk can't do that yet, because the interface doesn't > > > > > > > > >> >> > > > support multiple ublk disks sharing single host, which is exactly > > > > > > > > >> >> > > > the case of scsi and nvme. > > > > > > > > >> >> > > > > > > > > > > >> >> > > Can you give an example that shows exactly where a problem is hit? > > > > > > > > >> >> > > > > > > > > > > >> >> > > I took a quick look at the ublk source code and didn't spot a place > > > > > > > > >> >> > > where it prevents a single ublk server process from handling multiple > > > > > > > > >> >> > > devices. > > > > > > > > >> >> > > > > > > > > > > >> >> > > Regarding "host resources(such as tag)", can the ublk server deal with > > > > > > > > >> >> > > that in userspace? The Linux block layer doesn't have the concept of a > > > > > > > > >> >> > > "host", that would come in at the SCSI/NVMe level that's implemented in > > > > > > > > >> >> > > userspace. > > > > > > > > >> >> > > > > > > > > > > >> >> > > I don't understand yet... > > > > > > > > >> >> > > > > > > > > > >> >> > blk_mq_tag_set is embedded into driver host structure, and referred by queue > > > > > > > > >> >> > via q->tag_set, both scsi and nvme allocates tag in host/queue wide, > > > > > > > > >> >> > that said all LUNs/NSs share host/queue tags, current every ublk > > > > > > > > >> >> > device is independent, and can't shard tags. > > > > > > > > >> >> > > > > > > > > >> >> Does this actually prevent ublk servers with multiple ublk devices or is > > > > > > > > >> >> it just sub-optimal? > > > > > > > > >> > > > > > > > > > >> > It is former, ublk can't support multiple devices which share single host > > > > > > > > >> > because duplicated tag can be seen in host side, then io is failed. > > > > > > > > >> > > > > > > > > > >> > > > > > > > > >> I have trouble following this discussion. Why can we not handle multiple > > > > > > > > >> block devices in a single ublk user space process? > > > > > > > > >> > > > > > > > > >> From this conversation it seems that the limiting factor is allocation > > > > > > > > >> of the tag set of the virtual device in the kernel? But as far as I can > > > > > > > > >> tell, the tag sets are allocated per virtual block device in > > > > > > > > >> `ublk_ctrl_add_dev()`? > > > > > > > > >> > > > > > > > > >> It seems to me that a single ublk user space process shuld be able to > > > > > > > > >> connect to multiple storage devices (for instance nvme-of) and then > > > > > > > > >> create a ublk device for each namespace, all from a single ublk process. > > > > > > > > >> > > > > > > > > >> Could you elaborate on why this is not possible? > > > > > > > > > > > > > > > > > > If the multiple storages devices are independent, the current ublk can > > > > > > > > > handle them just fine. > > > > > > > > > > > > > > > > > > But if these storage devices(such as luns in iscsi, or NSs in nvme-tcp) > > > > > > > > > share single host, and use host-wide tagset, the current interface can't > > > > > > > > > work as expected, because tags is shared among all these devices. The > > > > > > > > > current ublk interface needs to be extended for covering this case. > > > > > > > > > > > > > > > > Thanks for clarifying, that is very helpful. > > > > > > > > > > > > > > > > Follow up question: What would the implications be if one tried to > > > > > > > > expose (through ublk) each nvme namespace of an nvme-of controller with > > > > > > > > an independent tag set? > > > > > > > > > > > > > > https://lore.kernel.org/linux-block/877cwhrgul.fsf@metaspace.dk/T/#m57158db9f0108e529d8d62d1d56652c52e9e3e67 > > > > > > > > > > > > > > > What are the benefits of sharing a tagset across > > > > > > > > all namespaces of a controller? > > > > > > > > > > > > > > The userspace implementation can be simplified a lot since generic > > > > > > > shared tag allocation isn't needed, meantime with good performance > > > > > > > (shared tags allocation in SMP is one hard problem) > > > > > > > > > > > > In NVMe, tags are per Submission Queue. AFAIK there's no such thing as > > > > > > shared tags across multiple SQs in NVMe. So userspace doesn't need an > > > > > > > > > > In reality the max supported nr_queues of nvme is often much less than > > > > > nr_cpu_ids, for example, lots of nvme-pci devices just support at most > > > > > 32 queues, I remembered that Azure nvme supports less(just 8 queues). > > > > > That is because queue isn't free in both software and hardware, which > > > > > implementation is often tradeoff between performance and cost. > > > > > > > > I didn't say that the ublk server should have nr_cpu_ids threads. I > > > > thought the idea was the ublk server creates as many threads as it needs > > > > (e.g. max 8 if the Azure NVMe device only has 8 queues). > > > > > > > > Do you expect ublk servers to have nr_cpu_ids threads in all/most cases? > > > > > > No. > > > > > > In ublksrv project, each pthread maps to one unique hardware queue, so total > > > number of pthread is equal to nr_hw_queues. > > > > Good, I think we agree on that part. > > > > Here is a summary of the ublk server model I've been describing: > > 1. Each pthread has a separate io_uring context. > > 2. Each pthread has its own hardware submission queue (NVMe SQ, SCSI > > command queue, etc). > > 3. Each pthread has a distinct subrange of the tag space if the tag > > space is shared across hardware submission queues. > > 4. Each pthread allocates tags from its subrange without coordinating > > with other threads. This is cheap and simple. > > That is also not doable. > > The tag space can be pretty small, such as, usb-storage queue depth > is just 1, and usb card reader can support multi lun too. > > That is just one extreme example, but there can be more low queue depth > scsi devices(sata : 32, ...), typical nvme/pci queue depth is 1023, but > there could be some implementation with less. > > More importantly subrange could waste lots of tags for idle LUNs/NSs, and > active LUNs/NSs will have to suffer from the small subrange tags. And available > tags depth represents the max allowed in-flight block IOs, so performance > is affected a lot by subrange. > > If you look at block layer tag allocation change history, we never take > such way. Hi Ming, Any thoughts on my last reply? If my mental model is incorrect I'd like to learn why. Thanks, Stefan [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 488 bytes --] ^ permalink raw reply [flat|nested] 34+ messages in thread
end of thread, other threads:[~2023-03-21 11:26 UTC | newest] Thread overview: 34+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2023-02-06 15:00 [LSF/MM/BPF BoF]: extend UBLK to cover real storage hardware Ming Lei 2023-02-06 17:53 ` Hannes Reinecke 2023-03-08 8:50 ` Hans Holmberg 2023-03-08 12:27 ` Ming Lei 2023-02-06 18:26 ` Bart Van Assche 2023-02-08 1:38 ` Ming Lei 2023-02-08 18:02 ` Bart Van Assche 2023-02-06 20:27 ` Stefan Hajnoczi 2023-02-08 2:12 ` Ming Lei 2023-02-08 12:17 ` Stefan Hajnoczi 2023-02-13 3:47 ` Ming Lei 2023-02-13 19:13 ` Stefan Hajnoczi 2023-02-15 0:51 ` Ming Lei 2023-02-15 15:27 ` Stefan Hajnoczi 2023-02-16 0:46 ` Ming Lei 2023-02-16 15:28 ` Stefan Hajnoczi 2023-02-16 9:44 ` Andreas Hindborg 2023-02-16 10:45 ` Ming Lei 2023-02-16 11:21 ` Andreas Hindborg 2023-02-17 2:20 ` Ming Lei 2023-02-17 16:39 ` Stefan Hajnoczi 2023-02-18 11:22 ` Ming Lei 2023-02-18 18:38 ` Stefan Hajnoczi 2023-02-22 23:17 ` Ming Lei 2023-02-23 20:18 ` Stefan Hajnoczi 2023-03-02 3:22 ` Ming Lei 2023-03-02 15:09 ` Stefan Hajnoczi 2023-03-17 3:10 ` Ming Lei 2023-03-17 14:41 ` Stefan Hajnoczi 2023-03-18 0:30 ` Ming Lei 2023-03-20 12:34 ` Stefan Hajnoczi 2023-03-20 15:30 ` Ming Lei 2023-03-21 11:25 ` Stefan Hajnoczi 2023-03-16 14:24 ` Stefan Hajnoczi
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).