From: Ming Lei <ming.lei@redhat.com>
To: Stefan Hajnoczi <stefanha@redhat.com>
Cc: Jens Axboe <axboe@kernel.dk>,
linux-doc@vger.kernel.org, linux-block@vger.kernel.org,
Christoph Hellwig <hch@lst.de>, Jonathan Corbet <corbet@lwn.net>,
"Richard W . M . Jones" <rjones@redhat.com>,
ZiyangZhang <ZiyangZhang@linux.alibaba.com>,
Xiaoguang Wang <xiaoguang.wang@linux.alibaba.com>,
ming.lei@redhat.com
Subject: Re: [PATCH] Docs: ublk: add ublk document
Date: Thu, 1 Sep 2022 08:56:50 +0800 [thread overview]
Message-ID: <YxADUntc7em5mA34@T590> (raw)
In-Reply-To: <Yw4rcz23R3ofn6H6@fedora>
On Tue, Aug 30, 2022 at 11:23:31AM -0400, Stefan Hajnoczi wrote:
> On Sun, Aug 28, 2022 at 12:50:03PM +0800, Ming Lei wrote:
> > ublk document is missed when merging ublk driver, so add it now.
> >
> > Cc: Jonathan Corbet <corbet@lwn.net>
> > Cc: Richard W.M. Jones <rjones@redhat.com>
> > Cc: ZiyangZhang <ZiyangZhang@linux.alibaba.com>
> > Cc: Stefan Hajnoczi <stefanha@redhat.com>
> > Cc: Xiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
> > Signed-off-by: Ming Lei <ming.lei@redhat.com>
> > ---
> > Documentation/block/ublk.rst | 203 +++++++++++++++++++++++++++++++++++
> > 1 file changed, 203 insertions(+)
> > create mode 100644 Documentation/block/ublk.rst
> >
> > diff --git a/Documentation/block/ublk.rst b/Documentation/block/ublk.rst
> > new file mode 100644
> > index 000000000000..9e8f7ba518a3
> > --- /dev/null
> > +++ b/Documentation/block/ublk.rst
> > @@ -0,0 +1,203 @@
> > +.. SPDX-License-Identifier: GPL-2.0
> > +
> > +==========================================
> > +Userspace block device driver(ublk driver)
>
> Usually there is a space before '('. There are more instances throughout
> the document.
>
> > +==========================================
> > +
> > +Overview
> > +========
> > +
> > +ublk is one generic framework for implementing block device logic from
> > +userspace. It is very helpful to move virtual block drivers into userspace,
> > +such as loop, nbd and similar block drivers. It can help to implement new
> > +virtual block device, such as ublk-qcow2, and there was several attempts
> > +of implementing qcow2 driver in kernel.
>
> This paragraph doesn't give specific reasons why userspace block devices
> are "very helpful".
>
> Userspace block devices are attractive because:
> - They can be written many programming languages.
> - They can use libraries that are not available in the kernel.
> - They can be debugged with tools familiar to application developers.
> - Crashes do not kernel panic the machine.
> - Bugs are likely to have a lower security impact than bugs in kernel
> code.
> - They can be installed and updated independently of the kernel.
>
> I think including this list makes it clear why userspace block devices
> are appropriate in certain cases.
Good point!
Another use case: - can simulate generic block device with different type
with different parameters easily
>
> > +
> > +ublk block device(``/dev/ublkb*``) is added by ublk driver. Any IO request
> > +submitted to ublk device will be forwarded to ublk's userspace part(
> > +ublksrv [1]), and after the IO is handled by ublksrv, the result is
> > +committed back to ublk driver, then ublk IO request can be completed. With
> > +this way, any specific IO handling logic is totally done inside ublksrv,
> > +and ublk driver doe _not_ handle any device specific IO logic, such as
>
> s/doe/does/
>
> > +loop's IO handling, NBD's IO communication, or qcow2's IO mapping, ...
> > +
> > +/dev/ublkbN is driven by blk-mq request based driver, each request is
> > +assigned by one queue wide unique tag. ublksrv assigns unique tag to each
> > +IO too, which is 1:1 mapped with IO of /dev/ublkb*.
>
> Is "/dev/ublkbN" the same as "/dev/ublkb*"? Please use one term consistently.
>
> > +
> > +Both the IO request forward and IO handling result committing are done via
> > +io_uring passthrough command, that is why ublk is also one io_uring based
>
> s/also one/a/
>
> > +block driver. It has been observed that io_uring passthrough command can get
> > +better IOPS than block IO. So ublk is one high performance implementation
> > +of userspace block device.
>
> I think this sentence means "Therefore ublk uses io_uring passthrough
> commands to implement userspace block devices in a performant way".
Yeah, because io_uring passthrough is proved to be very efficient.
>
> > Not only IO request communication is done by
> > +io_uring, but also the preferred IO handling in ublksrv is io_uring based
> > +approach too.
> > +
> > +ublk provides control interface to set/get ublk block device parameters, and
> > +the interface is extendable and kabi compatible, so basically any ublk request
> > +queue's parameter or ublk generic feature parameters can be set/get via this
> > +extendable interface. So ublk is generic userspace block device framework, such
> > +as, it is easy to setup one ublk device with specified block parameters from
>
> "such as" -> "and as such"
>
> > +userspace.
> > +
> > +How to use ublk
> > +===============
> > +
> > +After building ublksrv[1], ublk block device(``/dev/ublkb*``) can be added
>
> It might be worth separating two use cases:
>
> - Quickstart and people who just want to use existing ublksrv
> functionality.
>
> - Developers who are writing their own ublk server (e.g. rjones' nbdkit).
>
> ublksrv isn't needed, it's possible to code directly against the ublk
> driver UAPI. I think it's worth pointing this out so it's clear to
> people when and why you would use ublksrv and also when to code against
> the ublk driver UAPI.
>
> The documentation currently leaves it to the reader to figure out that
> ublksrv is just one possible userspace server implementation.
OK, we can add the words, but it is usually one common sense since
userspace is independent with kernel.
>
> > +and deleted by the utility, then existed block IO applications can talk with
> > +it.
> > +
> > +See usage details in README[2] of ublksrv, for example of ublk-loop:
> > +
> > +- add ublk device:
> > + ublk add -t loop -f ublk-loop.img
> > +
> > +- use it:
> > + mkfs.xfs /dev/ublkb0
> > + mount /dev/ublkb0 /mnt
> > + .... # all IOs are handled by io_uring!!!
> > + umount /mnt
> > +
> > +- get ublk dev info:
> > + ublk list
> > +
> > +- delete ublk device
> > + ublk del -a
> > + ublk del -n $ublk_dev_id
> > +
> > +Design
> > +======
> > +
> > +Control plane
> > +-------------
> > +
> > +ublk driver provides global misc device node(``/dev/ublk-control``) for
> > +managing and controlling ublk devices with help of several control commands:
> > +
> > +- UBLK_CMD_ADD_DEV
> > + Add one ublk char device(``/dev/ublkc*``) which is talked with ublksrv wrt.
>
> It may be clearer to say "the server" or "userspace" instead of
> "ublksrv", since that is the name of a specific server implementation
> and people may develop other servers.
For the sake of simplicity, I will use ublksrv or ublk server to represent any
userspace application of ublk.
>
> > + IO command communication. Basic device info is sent together with this
> > + command, see UAPI structure of ublksrv_ctrl_dev_info, such as nr_hw_queues,
> > + queue_depth, and max IO request buffer size, which info is negotiated with
> > + ublk driver and sent back to ublksrv. After this command is completed, the
> > + basic device info can't be changed any more.
> > +
> > +- UBLK_CMD_SET_PARAMS / UBLK_CMD_GET_PARAMS
> > + Set or get ublk device's parameters, which can be generic feature related,
> > + or request queue limit related, but can't be IO logic specific, cause ublk
> > + driver does not handle any IO logic. This command has to be sent before
> > + sending UBLK_CMD_START_DEV.
> > +
> > +- UBLK_CMD_START_DEV
> > + After ublksrv prepares userspace resource such as, creating per-queue
> > + pthread & io_ruing for handling ublk IO, this command is set for ublk
>
> s/io_ruing/io_uring/
>
> > + driver to allocate & expose /dev/ublkb*. Parameters set via
> > + UBLK_CMD_SET_PARAMS are applied for creating /dev/ublkb*.
> > +
> > +- UBLK_CMD_STOP_DEV
> > + Quiesce IO on /dev/ublkb* and delete the disk. After this command returns,
> > + ublksrv can release resource, such as destroy per-queue pthread & io_uring
> > + for handling io command.
> > +
> > +- UBLK_CMD_DEL_DEV
> > + Delete /dev/ublkc*. After this command returns, the allocated ublk device
> > + number can be reused.
> > +
> > +- UBLK_CMD_GET_QUEUE_AFFINITY
> > + After /dev/ublkc is added, ublk driver creates block layer tagset, so each
> > + queue's affinity info is available, ublksrv sends UBLK_CMD_GET_QUEUE_AFFINITY
> > + to retrieve queue affinity info, so ublksrv can setup the per-queue context
> > + efficiently, such as bind affine CPUs with IO pthread, and try to allocate
> > + buffers in IO thread context.
> > +
> > +- UBLK_CMD_GET_DEV_INFO
> > + For retrieve device info of ublksrv_ctrl_dev_info. And it is ublksrv's
> > + responsibility to save IO target specific info in userspace.
> > +
> > +Data plane
> > +----------
> > +
> > +ublksrv needs to create per-queue IO pthread & io_uring for handling IO
> > +command (io_uring passthrough command), and the per-queue IO pthread
> > +focuses on IO handling and shouldn't handle any control & management
> > +task.
> > +
> > +ublksrv's IO is assigned by one unique tag, which is 1:1 mapping with IO
> > +request of /dev/ublkb*.
> > +
> > +UAPI structure of ublksrv_io_desc is defined for describing each IO from
> > +ublk driver. One fixed mmaped area(array) on /dev/ublkc* is provided for
> > +exporting IO info to ublksrv, such as IO offset, length, OP/flags and
> > +buffer address. Each ublksrv_io_desc instance can be indexed via queue id
> > +and IO tag directly.
> > +
> > +Following IO commands are communicated via io_uring passthrough command,
> > +and each command is only for forwarding ublk IO and committing IO result
> > +with specified IO tag in the command data:
> > +
> > +- UBLK_IO_FETCH_REQ
> > + Sent from ublksrv IO pthread for fetching future coming IO request
> > + issued to /dev/ublkb*. This command is just sent once from ublksrv IO
> > + pthread for ublk driver to setup IO forward environment.
> > +
> > +- UBLK_IO_COMMIT_AND_FETCH_REQ
> > + After one IO request is issued to /dev/ublkb*, ublk driver stores this
> > + IO's ublksrv_io_desc to the specified mapped area, then the previous
> > + received IO command of this IO tag, either UBLK_IO_FETCH_REQ or
> > + UBLK_IO_COMMIT_AND_FETCH_REQ, is completed, so ulksrv gets the IO
> > + notification via io_uring.
> > +
> > + After ublksrv handles this IO, this IO's result is committed back to ublk
> > + driver by sending UBLK_IO_COMMIT_AND_FETCH_REQ back. Once ublkdrv received
> > + this command, it parses the IO result and complete the IO request to
> > + /dev/ublkb*. Meantime setup environment for fetching future IO request
> > + with this IO tag. So UBLK_IO_COMMIT_AND_FETCH_REQ is reused for both
> > + fetching request and committing back IO result.
> > +
> > +- UBLK_IO_NEED_GET_DATA
> > + ublksrv pre-allocates IO buffer for each IO at default, any new project
> > + should use this IO buffer to communicate with ublk driver. But existed
> > + project may not work or be changed to in this way, so add this command
> > + to provide chance for userspace to use its existed buffer for handling
> > + IO.
>
> I find it hard to understand this paragraph. It seems the
> UBLK_IO_NEED_GET_DATA command allows userspace to set up something
> related to IO buffers. What exactly does this command do?
Looks Ziyang has replied on this question.
>
> > +
> > +- data copy between ublkserv IO buffer and ublk block IO request
>
> s/ublkserv/ublksrv/
>
> > + ublk driver needs to copy ublk block IO request pages into ublksrv buffer
> > + (pages) first for WRITE before notifying ublksrv of the coming IO, so
> > + ublksrv can hanldle WRITE request.
>
> s/hanldle/handle/
>
> > +
> > + After ublksrv handles READ request and sends UBLK_IO_COMMIT_AND_FETCH_REQ
> > + to ublksrv, ublkdrv needs to copy read ublksrv buffer(pages) to the ublk
> > + IO request pages.
>
> In the READ case userspace doesn't know exactly when the ublk driver is
> done copying in the buffer. I guess this isn't a problem because the
> userspace buffer will be reused when the next request is fetched?
In short, one buffer is dedicated for one single tag/slot, either in driver
side or userspace side. In kernel side, the buffer is used/prepared for
accepting new io request. In userspace side, it is used for handling
the io request. And that is how UBLK_IO_COMMIT_AND_FETCH_REQ is
defined.
Yeah, it is pre-allocation of virtual memory, but pages are just allocated
when the buffer/IO is used. If the buffer isn't used for long enough
period(so far it is set as 20sec), ublksrv will discard these pages via
madvise(DONT_NEED), so swap out isn't involved usually.
Thanks,
Ming
next prev parent reply other threads:[~2022-09-01 0:57 UTC|newest]
Thread overview: 18+ messages / expand[flat|nested] mbox.gz Atom feed top
2022-08-28 4:50 [PATCH] Docs: ublk: add ublk document Ming Lei
2022-08-28 12:40 ` Bagas Sanjaya
2022-08-28 15:17 ` Randy Dunlap
2022-08-30 8:53 ` Ming Lei
2022-08-28 15:09 ` Richard W.M. Jones
2022-08-29 3:58 ` Gao Xiang
2022-08-30 9:05 ` Ming Lei
2022-08-29 0:18 ` kernel test robot
2022-08-30 12:14 ` Bagas Sanjaya
2022-08-30 14:13 ` Ming Lei
2022-08-30 15:23 ` Stefan Hajnoczi
2022-08-31 6:31 ` Ziyang Zhang
2022-08-31 20:00 ` Stefan Hajnoczi
2022-09-01 1:34 ` Ming Lei
2022-09-01 2:47 ` Ziyang Zhang
2022-09-01 0:56 ` Ming Lei [this message]
2022-09-01 1:04 ` Jens Axboe
2022-09-01 1:35 ` Ming Lei
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=YxADUntc7em5mA34@T590 \
--to=ming.lei@redhat.com \
--cc=ZiyangZhang@linux.alibaba.com \
--cc=axboe@kernel.dk \
--cc=corbet@lwn.net \
--cc=hch@lst.de \
--cc=linux-block@vger.kernel.org \
--cc=linux-doc@vger.kernel.org \
--cc=rjones@redhat.com \
--cc=stefanha@redhat.com \
--cc=xiaoguang.wang@linux.alibaba.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).