linux-block.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Stefan Hajnoczi <stefanha@redhat.com>
To: Ming Lei <ming.lei@redhat.com>
Cc: "Andreas Hindborg" <nmi@metaspace.dk>,
	linux-block@vger.kernel.org, lsf-pc@lists.linux-foundation.org,
	"Liu Xiaodong" <xiaodong.liu@intel.com>,
	"Jim Harris" <james.r.harris@intel.com>,
	"Hans Holmberg" <Hans.Holmberg@wdc.com>,
	"Matias Bjørling" <Matias.Bjorling@wdc.com>,
	"hch@lst.de" <hch@lst.de>,
	ZiyangZhang <ZiyangZhang@linux.alibaba.com>
Subject: Re: [LSF/MM/BPF BoF]: extend UBLK to cover real storage hardware
Date: Thu, 16 Mar 2023 10:24:46 -0400	[thread overview]
Message-ID: <20230316142446.GC42060@fedora> (raw)
In-Reply-To: <ZAAWj8Bs8JujXsbX@T590>

[-- Attachment #1: Type: text/plain, Size: 9387 bytes --]

On Thu, Mar 02, 2023 at 11:22:55AM +0800, Ming Lei wrote:
> On Thu, Feb 23, 2023 at 03:18:19PM -0500, Stefan Hajnoczi wrote:
> > On Thu, Feb 23, 2023 at 07:17:33AM +0800, Ming Lei wrote:
> > > On Sat, Feb 18, 2023 at 01:38:08PM -0500, Stefan Hajnoczi wrote:
> > > > On Sat, Feb 18, 2023 at 07:22:49PM +0800, Ming Lei wrote:
> > > > > On Fri, Feb 17, 2023 at 11:39:58AM -0500, Stefan Hajnoczi wrote:
> > > > > > On Fri, Feb 17, 2023 at 10:20:45AM +0800, Ming Lei wrote:
> > > > > > > On Thu, Feb 16, 2023 at 12:21:32PM +0100, Andreas Hindborg wrote:
> > > > > > > > 
> > > > > > > > Ming Lei <ming.lei@redhat.com> writes:
> > > > > > > > 
> > > > > > > > > On Thu, Feb 16, 2023 at 10:44:02AM +0100, Andreas Hindborg wrote:
> > > > > > > > >> 
> > > > > > > > >> Hi Ming,
> > > > > > > > >> 
> > > > > > > > >> Ming Lei <ming.lei@redhat.com> writes:
> > > > > > > > >> 
> > > > > > > > >> > On Mon, Feb 13, 2023 at 02:13:59PM -0500, Stefan Hajnoczi wrote:
> > > > > > > > >> >> On Mon, Feb 13, 2023 at 11:47:31AM +0800, Ming Lei wrote:
> > > > > > > > >> >> > On Wed, Feb 08, 2023 at 07:17:10AM -0500, Stefan Hajnoczi wrote:
> > > > > > > > >> >> > > On Wed, Feb 08, 2023 at 10:12:19AM +0800, Ming Lei wrote:
> > > > > > > > >> >> > > > On Mon, Feb 06, 2023 at 03:27:09PM -0500, Stefan Hajnoczi wrote:
> > > > > > > > >> >> > > > > On Mon, Feb 06, 2023 at 11:00:27PM +0800, Ming Lei wrote:
> > > > > > > > >> >> > > > > > Hello,
> > > > > > > > >> >> > > > > > 
> > > > > > > > >> >> > > > > > So far UBLK is only used for implementing virtual block device from
> > > > > > > > >> >> > > > > > userspace, such as loop, nbd, qcow2, ...[1].
> > > > > > > > >> >> > > > > 
> > > > > > > > >> >> > > > > I won't be at LSF/MM so here are my thoughts:
> > > > > > > > >> >> > > > 
> > > > > > > > >> >> > > > Thanks for the thoughts, :-)
> > > > > > > > >> >> > > > 
> > > > > > > > >> >> > > > > 
> > > > > > > > >> >> > > > > > 
> > > > > > > > >> >> > > > > > It could be useful for UBLK to cover real storage hardware too:
> > > > > > > > >> >> > > > > > 
> > > > > > > > >> >> > > > > > - for fast prototype or performance evaluation
> > > > > > > > >> >> > > > > > 
> > > > > > > > >> >> > > > > > - some network storages are attached to host, such as iscsi and nvme-tcp,
> > > > > > > > >> >> > > > > > the current UBLK interface doesn't support such devices, since it needs
> > > > > > > > >> >> > > > > > all LUNs/Namespaces to share host resources(such as tag)
> > > > > > > > >> >> > > > > 
> > > > > > > > >> >> > > > > Can you explain this in more detail? It seems like an iSCSI or
> > > > > > > > >> >> > > > > NVMe-over-TCP initiator could be implemented as a ublk server today.
> > > > > > > > >> >> > > > > What am I missing?
> > > > > > > > >> >> > > > 
> > > > > > > > >> >> > > > The current ublk can't do that yet, because the interface doesn't
> > > > > > > > >> >> > > > support multiple ublk disks sharing single host, which is exactly
> > > > > > > > >> >> > > > the case of scsi and nvme.
> > > > > > > > >> >> > > 
> > > > > > > > >> >> > > Can you give an example that shows exactly where a problem is hit?
> > > > > > > > >> >> > > 
> > > > > > > > >> >> > > I took a quick look at the ublk source code and didn't spot a place
> > > > > > > > >> >> > > where it prevents a single ublk server process from handling multiple
> > > > > > > > >> >> > > devices.
> > > > > > > > >> >> > > 
> > > > > > > > >> >> > > Regarding "host resources(such as tag)", can the ublk server deal with
> > > > > > > > >> >> > > that in userspace? The Linux block layer doesn't have the concept of a
> > > > > > > > >> >> > > "host", that would come in at the SCSI/NVMe level that's implemented in
> > > > > > > > >> >> > > userspace.
> > > > > > > > >> >> > > 
> > > > > > > > >> >> > > I don't understand yet...
> > > > > > > > >> >> > 
> > > > > > > > >> >> > blk_mq_tag_set is embedded into driver host structure, and referred by queue
> > > > > > > > >> >> > via q->tag_set, both scsi and nvme allocates tag in host/queue wide,
> > > > > > > > >> >> > that said all LUNs/NSs share host/queue tags, current every ublk
> > > > > > > > >> >> > device is independent, and can't shard tags.
> > > > > > > > >> >> 
> > > > > > > > >> >> Does this actually prevent ublk servers with multiple ublk devices or is
> > > > > > > > >> >> it just sub-optimal?
> > > > > > > > >> >
> > > > > > > > >> > It is former, ublk can't support multiple devices which share single host
> > > > > > > > >> > because duplicated tag can be seen in host side, then io is failed.
> > > > > > > > >> >
> > > > > > > > >> 
> > > > > > > > >> I have trouble following this discussion. Why can we not handle multiple
> > > > > > > > >> block devices in a single ublk user space process?
> > > > > > > > >> 
> > > > > > > > >> From this conversation it seems that the limiting factor is allocation
> > > > > > > > >> of the tag set of the virtual device in the kernel? But as far as I can
> > > > > > > > >> tell, the tag sets are allocated per virtual block device in
> > > > > > > > >> `ublk_ctrl_add_dev()`?
> > > > > > > > >> 
> > > > > > > > >> It seems to me that a single ublk user space process shuld be able to
> > > > > > > > >> connect to multiple storage devices (for instance nvme-of) and then
> > > > > > > > >> create a ublk device for each namespace, all from a single ublk process.
> > > > > > > > >> 
> > > > > > > > >> Could you elaborate on why this is not possible?
> > > > > > > > >
> > > > > > > > > If the multiple storages devices are independent, the current ublk can
> > > > > > > > > handle them just fine.
> > > > > > > > >
> > > > > > > > > But if these storage devices(such as luns in iscsi, or NSs in nvme-tcp)
> > > > > > > > > share single host, and use host-wide tagset, the current interface can't
> > > > > > > > > work as expected, because tags is shared among all these devices. The
> > > > > > > > > current ublk interface needs to be extended for covering this case.
> > > > > > > > 
> > > > > > > > Thanks for clarifying, that is very helpful.
> > > > > > > > 
> > > > > > > > Follow up question: What would the implications be if one tried to
> > > > > > > > expose (through ublk) each nvme namespace of an nvme-of controller with
> > > > > > > > an independent tag set?
> > > > > > > 
> > > > > > > https://lore.kernel.org/linux-block/877cwhrgul.fsf@metaspace.dk/T/#m57158db9f0108e529d8d62d1d56652c52e9e3e67
> > > > > > > 
> > > > > > > > What are the benefits of sharing a tagset across
> > > > > > > > all namespaces of a controller?
> > > > > > > 
> > > > > > > The userspace implementation can be simplified a lot since generic
> > > > > > > shared tag allocation isn't needed, meantime with good performance
> > > > > > > (shared tags allocation in SMP is one hard problem)
> > > > > > 
> > > > > > In NVMe, tags are per Submission Queue. AFAIK there's no such thing as
> > > > > > shared tags across multiple SQs in NVMe. So userspace doesn't need an
> > > > > 
> > > > > In reality the max supported nr_queues of nvme is often much less than
> > > > > nr_cpu_ids, for example, lots of nvme-pci devices just support at most
> > > > > 32 queues, I remembered that Azure nvme supports less(just 8 queues).
> > > > > That is because queue isn't free in both software and hardware, which
> > > > > implementation is often tradeoff between performance and cost.
> > > > 
> > > > I didn't say that the ublk server should have nr_cpu_ids threads. I
> > > > thought the idea was the ublk server creates as many threads as it needs
> > > > (e.g. max 8 if the Azure NVMe device only has 8 queues).
> > > > 
> > > > Do you expect ublk servers to have nr_cpu_ids threads in all/most cases?
> > > 
> > > No.
> > > 
> > > In ublksrv project, each pthread maps to one unique hardware queue, so total
> > > number of pthread is equal to nr_hw_queues.
> > 
> > Good, I think we agree on that part.
> > 
> > Here is a summary of the ublk server model I've been describing:
> > 1. Each pthread has a separate io_uring context.
> > 2. Each pthread has its own hardware submission queue (NVMe SQ, SCSI
> >    command queue, etc).
> > 3. Each pthread has a distinct subrange of the tag space if the tag
> >    space is shared across hardware submission queues.
> > 4. Each pthread allocates tags from its subrange without coordinating
> >    with other threads. This is cheap and simple.
> 
> That is also not doable.
> 
> The tag space can be pretty small, such as, usb-storage queue depth
> is just 1, and usb card reader can support multi lun too.
> 
> That is just one extreme example, but there can be more low queue depth
> scsi devices(sata : 32, ...), typical nvme/pci queue depth is 1023, but
> there could be some implementation with less.
> 
> More importantly subrange could waste lots of tags for idle LUNs/NSs, and
> active LUNs/NSs will have to suffer from the small subrange tags. And available
> tags depth represents the max allowed in-flight block IOs, so performance
> is affected a lot by subrange.
> 
> If you look at block layer tag allocation change history, we never take
> such way.

Hi Ming,
Any thoughts on my last reply? If my mental model is incorrect I'd like
to learn why.

Thanks,
Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

      parent reply	other threads:[~2023-03-16 14:25 UTC|newest]

Thread overview: 34+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-02-06 15:00 [LSF/MM/BPF BoF]: extend UBLK to cover real storage hardware Ming Lei
2023-02-06 17:53 ` Hannes Reinecke
2023-03-08  8:50   ` Hans Holmberg
2023-03-08 12:27     ` Ming Lei
2023-02-06 18:26 ` Bart Van Assche
2023-02-08  1:38   ` Ming Lei
2023-02-08 18:02     ` Bart Van Assche
2023-02-06 20:27 ` Stefan Hajnoczi
2023-02-08  2:12   ` Ming Lei
2023-02-08 12:17     ` Stefan Hajnoczi
2023-02-13  3:47       ` Ming Lei
2023-02-13 19:13         ` Stefan Hajnoczi
2023-02-15  0:51           ` Ming Lei
2023-02-15 15:27             ` Stefan Hajnoczi
2023-02-16  0:46               ` Ming Lei
2023-02-16 15:28                 ` Stefan Hajnoczi
2023-02-16  9:44             ` Andreas Hindborg
2023-02-16 10:45               ` Ming Lei
2023-02-16 11:21                 ` Andreas Hindborg
2023-02-17  2:20                   ` Ming Lei
2023-02-17 16:39                     ` Stefan Hajnoczi
2023-02-18 11:22                       ` Ming Lei
2023-02-18 18:38                         ` Stefan Hajnoczi
2023-02-22 23:17                           ` Ming Lei
2023-02-23 20:18                             ` Stefan Hajnoczi
2023-03-02  3:22                               ` Ming Lei
2023-03-02 15:09                                 ` Stefan Hajnoczi
2023-03-17  3:10                                   ` Ming Lei
2023-03-17 14:41                                     ` Stefan Hajnoczi
2023-03-18  0:30                                       ` Ming Lei
2023-03-20 12:34                                         ` Stefan Hajnoczi
2023-03-20 15:30                                           ` Ming Lei
2023-03-21 11:25                                             ` Stefan Hajnoczi
2023-03-16 14:24                                 ` Stefan Hajnoczi [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20230316142446.GC42060@fedora \
    --to=stefanha@redhat.com \
    --cc=Hans.Holmberg@wdc.com \
    --cc=Matias.Bjorling@wdc.com \
    --cc=ZiyangZhang@linux.alibaba.com \
    --cc=hch@lst.de \
    --cc=james.r.harris@intel.com \
    --cc=linux-block@vger.kernel.org \
    --cc=lsf-pc@lists.linux-foundation.org \
    --cc=ming.lei@redhat.com \
    --cc=nmi@metaspace.dk \
    --cc=xiaodong.liu@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).