linux-block.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Ming Lei <ming.lei@redhat.com>
To: Stefan Hajnoczi <stefanha@redhat.com>
Cc: linux-block@vger.kernel.org, lsf-pc@lists.linux-foundation.org,
	"Liu Xiaodong" <xiaodong.liu@intel.com>,
	"Jim Harris" <james.r.harris@intel.com>,
	"Hans Holmberg" <Hans.Holmberg@wdc.com>,
	"Matias Bjørling" <Matias.Bjorling@wdc.com>,
	"hch@lst.de" <hch@lst.de>,
	ZiyangZhang <ZiyangZhang@linux.alibaba.com>
Subject: Re: [LSF/MM/BPF BoF]: extend UBLK to cover real storage hardware
Date: Thu, 16 Feb 2023 08:46:56 +0800	[thread overview]
Message-ID: <Y+19AM8zuU9+abQS@T590> (raw)
In-Reply-To: <Y+z5yzrOhq2nbV/A@fedora>

On Wed, Feb 15, 2023 at 10:27:07AM -0500, Stefan Hajnoczi wrote:
> On Wed, Feb 15, 2023 at 08:51:27AM +0800, Ming Lei wrote:
> > On Mon, Feb 13, 2023 at 02:13:59PM -0500, Stefan Hajnoczi wrote:
> > > On Mon, Feb 13, 2023 at 11:47:31AM +0800, Ming Lei wrote:
> > > > On Wed, Feb 08, 2023 at 07:17:10AM -0500, Stefan Hajnoczi wrote:
> > > > > On Wed, Feb 08, 2023 at 10:12:19AM +0800, Ming Lei wrote:
> > > > > > On Mon, Feb 06, 2023 at 03:27:09PM -0500, Stefan Hajnoczi wrote:
> > > > > > > On Mon, Feb 06, 2023 at 11:00:27PM +0800, Ming Lei wrote:
> > > > > > > > Hello,
> > > > > > > > 
> > > > > > > > So far UBLK is only used for implementing virtual block device from
> > > > > > > > userspace, such as loop, nbd, qcow2, ...[1].
> > > > > > > 
> > > > > > > I won't be at LSF/MM so here are my thoughts:
> > > > > > 
> > > > > > Thanks for the thoughts, :-)
> > > > > > 
> > > > > > > 
> > > > > > > > 
> > > > > > > > It could be useful for UBLK to cover real storage hardware too:
> > > > > > > > 
> > > > > > > > - for fast prototype or performance evaluation
> > > > > > > > 
> > > > > > > > - some network storages are attached to host, such as iscsi and nvme-tcp,
> > > > > > > > the current UBLK interface doesn't support such devices, since it needs
> > > > > > > > all LUNs/Namespaces to share host resources(such as tag)
> > > > > > > 
> > > > > > > Can you explain this in more detail? It seems like an iSCSI or
> > > > > > > NVMe-over-TCP initiator could be implemented as a ublk server today.
> > > > > > > What am I missing?
> > > > > > 
> > > > > > The current ublk can't do that yet, because the interface doesn't
> > > > > > support multiple ublk disks sharing single host, which is exactly
> > > > > > the case of scsi and nvme.
> > > > > 
> > > > > Can you give an example that shows exactly where a problem is hit?
> > > > > 
> > > > > I took a quick look at the ublk source code and didn't spot a place
> > > > > where it prevents a single ublk server process from handling multiple
> > > > > devices.
> > > > > 
> > > > > Regarding "host resources(such as tag)", can the ublk server deal with
> > > > > that in userspace? The Linux block layer doesn't have the concept of a
> > > > > "host", that would come in at the SCSI/NVMe level that's implemented in
> > > > > userspace.
> > > > > 
> > > > > I don't understand yet...
> > > > 
> > > > blk_mq_tag_set is embedded into driver host structure, and referred by queue
> > > > via q->tag_set, both scsi and nvme allocates tag in host/queue wide,
> > > > that said all LUNs/NSs share host/queue tags, current every ublk
> > > > device is independent, and can't shard tags.
> > > 
> > > Does this actually prevent ublk servers with multiple ublk devices or is
> > > it just sub-optimal?
> > 
> > It is former, ublk can't support multiple devices which share single host
> > because duplicated tag can be seen in host side, then io is failed.
> 
> The kernel sees two independent block devices so there is no issue
> within the kernel.

This way either wastes memory, or performance is bad since we can't
make a perfect queue depth for each ublk device.

> 
> Userspace can do its own hw tag allocation if there are shared storage
> controller resources (e.g. NVMe CIDs) to avoid duplicating tags.
> 
> Have I missed something?

Please look at lib/sbitmap.c and block/blk-mq-tag.c and see how many
hard issues fixed/reported in the past, and how much optimization done
in this area.

In theory hw tag allocation can be done in userspace, but just hard to
do efficiently:

1) it has been proved as one hard task for sharing data efficiently in
SMP, so don't reinvent wheel in userspace, and this work could take
much more efforts than extending current ublk interface, and just
fruitless

2) two times tag allocation slows down io path much

2) even worse for userspace allocation, cause task can be killed and
no cleanup is done, so tag leak can be caused easily


Thanks, 
Ming


  reply	other threads:[~2023-02-16  0:48 UTC|newest]

Thread overview: 34+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-02-06 15:00 [LSF/MM/BPF BoF]: extend UBLK to cover real storage hardware Ming Lei
2023-02-06 17:53 ` Hannes Reinecke
2023-03-08  8:50   ` Hans Holmberg
2023-03-08 12:27     ` Ming Lei
2023-02-06 18:26 ` Bart Van Assche
2023-02-08  1:38   ` Ming Lei
2023-02-08 18:02     ` Bart Van Assche
2023-02-06 20:27 ` Stefan Hajnoczi
2023-02-08  2:12   ` Ming Lei
2023-02-08 12:17     ` Stefan Hajnoczi
2023-02-13  3:47       ` Ming Lei
2023-02-13 19:13         ` Stefan Hajnoczi
2023-02-15  0:51           ` Ming Lei
2023-02-15 15:27             ` Stefan Hajnoczi
2023-02-16  0:46               ` Ming Lei [this message]
2023-02-16 15:28                 ` Stefan Hajnoczi
2023-02-16  9:44             ` Andreas Hindborg
2023-02-16 10:45               ` Ming Lei
2023-02-16 11:21                 ` Andreas Hindborg
2023-02-17  2:20                   ` Ming Lei
2023-02-17 16:39                     ` Stefan Hajnoczi
2023-02-18 11:22                       ` Ming Lei
2023-02-18 18:38                         ` Stefan Hajnoczi
2023-02-22 23:17                           ` Ming Lei
2023-02-23 20:18                             ` Stefan Hajnoczi
2023-03-02  3:22                               ` Ming Lei
2023-03-02 15:09                                 ` Stefan Hajnoczi
2023-03-17  3:10                                   ` Ming Lei
2023-03-17 14:41                                     ` Stefan Hajnoczi
2023-03-18  0:30                                       ` Ming Lei
2023-03-20 12:34                                         ` Stefan Hajnoczi
2023-03-20 15:30                                           ` Ming Lei
2023-03-21 11:25                                             ` Stefan Hajnoczi
2023-03-16 14:24                                 ` Stefan Hajnoczi

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=Y+19AM8zuU9+abQS@T590 \
    --to=ming.lei@redhat.com \
    --cc=Hans.Holmberg@wdc.com \
    --cc=Matias.Bjorling@wdc.com \
    --cc=ZiyangZhang@linux.alibaba.com \
    --cc=hch@lst.de \
    --cc=james.r.harris@intel.com \
    --cc=linux-block@vger.kernel.org \
    --cc=lsf-pc@lists.linux-foundation.org \
    --cc=stefanha@redhat.com \
    --cc=xiaodong.liu@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).