From: Jens Axboe <axboe@kernel.dk>
To: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Cc: Christoph Hellwig <hch@infradead.org>,
linux-block <linux-block@vger.kernel.org>,
Damien Le Moal <Damien.LeMoal@wdc.com>,
Keith Busch <kbusch@kernel.org>,
"linux-scsi @ vger . kernel . org" <linux-scsi@vger.kernel.org>,
"Martin K . Petersen" <martin.petersen@oracle.com>,
"linux-fsdevel @ vger . kernel . org"
<linux-fsdevel@vger.kernel.org>
Subject: Re: [PATCH v11 00/10] Introduce Zone Append for writing to zoned block devices
Date: Tue, 12 May 2020 20:37:57 -0600 [thread overview]
Message-ID: <abd4b3d4-6261-c3a6-9b4c-9bf009a9820d@kernel.dk> (raw)
In-Reply-To: <20200512085554.26366-1-johannes.thumshirn@wdc.com>
On 5/12/20 2:55 AM, Johannes Thumshirn wrote:
> The upcoming NVMe ZNS Specification will define a new type of write
> command for zoned block devices, zone append.
>
> When when writing to a zoned block device using zone append, the start
> sector of the write is pointing at the start LBA of the zone to write to.
> Upon completion the block device will respond with the position the data
> has been placed in the zone. This from a high level perspective can be
> seen like a file system's block allocator, where the user writes to a
> file and the file-system takes care of the data placement on the device.
>
> In order to fully exploit the new zone append command in file-systems and
> other interfaces above the block layer, we choose to emulate zone append
> in SCSI and null_blk. This way we can have a single write path for both
> file-systems and other interfaces above the block-layer, like io_uring on
> zoned block devices, without having to care too much about the underlying
> characteristics of the device itself.
>
> The emulation works by providing a cache of each zone's write pointer, so
> zone append issued to the disk can be translated to a write with a
> starting LBA of the write pointer. This LBA is used as input zone number
> for the write pointer lookup in the zone write pointer offset cache and
> the cached offset is then added to the LBA to get the actual position to
> write the data. In SCSI we then turn the REQ_OP_ZONE_APPEND request into a
> WRITE(16) command. Upon successful completion of the WRITE(16), the cache
> will be updated to the new write pointer location and the written sector
> will be noted in the request. On error the cache entry will be marked as
> invalid and on the next write an update of the write pointer will be
> scheduled, before issuing the actual write.
>
> In order to reduce memory consumption, the only cached item is the offset
> of the write pointer from the start of the zone, everything else can be
> calculated. On an example drive with 52156 zones, the additional memory
> consumption of the cache is thus 52156 * 4 = 208624 Bytes or 51 4k Byte
> pages. The performance impact is neglectable for a spinning drive.
>
> For null_blk the emulation is way simpler, as null_blk's zoned block
> device emulation support already caches the write pointer position, so we
> only need to report the position back to the upper layers. Additional
> caching is not needed here.
>
> Furthermore we have converted zonefs to run use ZONE_APPEND for synchronous
> direct I/Os. Asynchronous I/O still uses the normal path via iomap.
>
> Performance testing with zonefs sync writes on a 14 TB SMR drive and nullblk
> shows good results. On the SMR drive we're not regressing (the performance
> improvement is within noise), on nullblk we could drastically improve specific
> workloads:
>
> * nullblk:
>
> Single Thread Multiple Zones
> kIOPS MiB/s MB/s % delta
> mq-deadline REQ_OP_WRITE 10.1 631 662
> mq-deadline REQ_OP_ZONE_APPEND 13.2 828 868 +31.12
> none REQ_OP_ZONE_APPEND 15.6 978 1026 +54.98
>
>
> Multiple Threads Multiple Zones
> kIOPS MiB/s MB/s % delta
> mq-deadline REQ_OP_WRITE 10.2 640 671
> mq-deadline REQ_OP_ZONE_APPEND 10.4 650 681 +1.49
> none REQ_OP_ZONE_APPEND 16.9 1058 1109 +65.28
>
> * 14 TB SMR drive
>
> Single Thread Multiple Zones
> IOPS MiB/s MB/s % delta
> mq-deadline REQ_OP_WRITE 797 49.9 52.3
> mq-deadline REQ_OP_ZONE_APPEND 806 50.4 52.9 +1.15
>
> Multiple Threads Multiple Zones
> kIOPS MiB/s MB/s % delta
> mq-deadline REQ_OP_WRITE 745 46.6 48.9
> mq-deadline REQ_OP_ZONE_APPEND 768 48 50.3 +2.86
>
> The %-delta is against the baseline of REQ_OP_WRITE using mq-deadline as I/O
> scheduler.
>
> The series is based on Jens' for-5.8/block branch with HEAD:
> ae979182ebb3 ("bdi: fix up for "remove the name field in struct backing_dev_info"")
Applied for 5.8, thanks.
--
Jens Axboe
prev parent reply other threads:[~2020-05-13 2:38 UTC|newest]
Thread overview: 18+ messages / expand[flat|nested] mbox.gz Atom feed top
2020-05-12 8:55 [PATCH v11 00/10] Introduce Zone Append for writing to zoned block devices Johannes Thumshirn
2020-05-12 8:55 ` [PATCH v11 01/10] block: provide fallbacks for blk_queue_zone_is_seq and blk_queue_zone_no Johannes Thumshirn
2020-05-12 8:55 ` [PATCH v11 02/10] block: rename __bio_add_pc_page to bio_add_hw_page Johannes Thumshirn
2020-05-12 8:55 ` [PATCH v11 03/10] block: Introduce REQ_OP_ZONE_APPEND Johannes Thumshirn
2020-05-12 8:55 ` [PATCH v11 04/10] block: introduce blk_req_zone_write_trylock Johannes Thumshirn
2020-05-12 8:55 ` [PATCH v11 05/10] block: Modify revalidate zones Johannes Thumshirn
2020-05-12 8:55 ` [PATCH v11 06/10] scsi: sd_zbc: factor out sanity checks for zoned commands Johannes Thumshirn
2020-05-12 8:55 ` [PATCH v11 07/10] scsi: sd_zbc: emulate ZONE_APPEND commands Johannes Thumshirn
2020-05-12 8:55 ` [PATCH v11 08/10] null_blk: Support REQ_OP_ZONE_APPEND Johannes Thumshirn
2020-05-12 8:55 ` [PATCH v11 09/10] block: export bio_release_pages and bio_iov_iter_get_pages Johannes Thumshirn
2020-05-12 8:55 ` [PATCH v11 10/10] zonefs: use REQ_OP_ZONE_APPEND for sync DIO Johannes Thumshirn
2020-05-12 13:17 ` [PATCH v11 00/10] Introduce Zone Append for writing to zoned block devices Christoph Hellwig
[not found] ` <(Christoph>
[not found] ` <Hellwig's>
[not found] ` <message>
[not found] ` <of>
[not found] ` <"Tue>
[not found] ` <12>
[not found] ` <May>
[not found] ` <2020>
[not found] ` <06:17:48>
[not found] ` <-0700")>
2020-05-12 16:01 ` Martin K. Petersen
2020-05-12 16:04 ` Christoph Hellwig
2020-05-12 16:12 ` Martin K. Petersen
2020-05-12 16:18 ` Johannes Thumshirn
2020-05-12 16:24 ` Martin K. Petersen
2020-05-13 2:37 ` Jens Axboe [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=abd4b3d4-6261-c3a6-9b4c-9bf009a9820d@kernel.dk \
--to=axboe@kernel.dk \
--cc=Damien.LeMoal@wdc.com \
--cc=hch@infradead.org \
--cc=johannes.thumshirn@wdc.com \
--cc=kbusch@kernel.org \
--cc=linux-block@vger.kernel.org \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-scsi@vger.kernel.org \
--cc=martin.petersen@oracle.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).