linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Jens Axboe <axboe@kernel.dk>
To: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Cc: Christoph Hellwig <hch@infradead.org>,
	linux-block <linux-block@vger.kernel.org>,
	Damien Le Moal <Damien.LeMoal@wdc.com>,
	Keith Busch <kbusch@kernel.org>,
	"linux-scsi @ vger . kernel . org" <linux-scsi@vger.kernel.org>,
	"Martin K . Petersen" <martin.petersen@oracle.com>,
	"linux-fsdevel @ vger . kernel . org"
	<linux-fsdevel@vger.kernel.org>
Subject: Re: [PATCH v11 00/10] Introduce Zone Append for writing to zoned block devices
Date: Tue, 12 May 2020 20:37:57 -0600	[thread overview]
Message-ID: <abd4b3d4-6261-c3a6-9b4c-9bf009a9820d@kernel.dk> (raw)
In-Reply-To: <20200512085554.26366-1-johannes.thumshirn@wdc.com>

On 5/12/20 2:55 AM, Johannes Thumshirn wrote:
> The upcoming NVMe ZNS Specification will define a new type of write
> command for zoned block devices, zone append.
> 
> When when writing to a zoned block device using zone append, the start
> sector of the write is pointing at the start LBA of the zone to write to.
> Upon completion the block device will respond with the position the data
> has been placed in the zone. This from a high level perspective can be
> seen like a file system's block allocator, where the user writes to a
> file and the file-system takes care of the data placement on the device.
> 
> In order to fully exploit the new zone append command in file-systems and
> other interfaces above the block layer, we choose to emulate zone append
> in SCSI and null_blk. This way we can have a single write path for both
> file-systems and other interfaces above the block-layer, like io_uring on
> zoned block devices, without having to care too much about the underlying
> characteristics of the device itself.
> 
> The emulation works by providing a cache of each zone's write pointer, so
> zone append issued to the disk can be translated to a write with a
> starting LBA of the write pointer. This LBA is used as input zone number
> for the write pointer lookup in the zone write pointer offset cache and
> the cached offset is then added to the LBA to get the actual position to
> write the data. In SCSI we then turn the REQ_OP_ZONE_APPEND request into a
> WRITE(16) command. Upon successful completion of the WRITE(16), the cache
> will be updated to the new write pointer location and the written sector
> will be noted in the request. On error the cache entry will be marked as
> invalid and on the next write an update of the write pointer will be
> scheduled, before issuing the actual write.
> 
> In order to reduce memory consumption, the only cached item is the offset
> of the write pointer from the start of the zone, everything else can be
> calculated. On an example drive with 52156 zones, the additional memory
> consumption of the cache is thus 52156 * 4 = 208624 Bytes or 51 4k Byte
> pages. The performance impact is neglectable for a spinning drive.
> 
> For null_blk the emulation is way simpler, as null_blk's zoned block
> device emulation support already caches the write pointer position, so we
> only need to report the position back to the upper layers. Additional
> caching is not needed here.
> 
> Furthermore we have converted zonefs to run use ZONE_APPEND for synchronous
> direct I/Os. Asynchronous I/O still uses the normal path via iomap.
> 
> Performance testing with zonefs sync writes on a 14 TB SMR drive and nullblk
> shows good results. On the SMR drive we're not regressing (the performance
> improvement is within noise), on nullblk we could drastically improve specific
> workloads:
> 
> * nullblk:
> 
> Single Thread Multiple Zones
> 				kIOPS	MiB/s	MB/s	% delta
> mq-deadline REQ_OP_WRITE	10.1	631	662
> mq-deadline REQ_OP_ZONE_APPEND	13.2	828	868	+31.12
> none REQ_OP_ZONE_APPEND		15.6	978	1026	+54.98
> 
> 
> Multiple Threads Multiple Zones
> 				kIOPS	MiB/s	MB/s	% delta
> mq-deadline REQ_OP_WRITE	10.2	640	671
> mq-deadline REQ_OP_ZONE_APPEND	10.4	650	681	+1.49
> none REQ_OP_ZONE_APPEND		16.9	1058	1109	+65.28
> 
> * 14 TB SMR drive
> 
> Single Thread Multiple Zones
> 				IOPS	MiB/s	MB/s	% delta
> mq-deadline REQ_OP_WRITE	797	49.9	52.3
> mq-deadline REQ_OP_ZONE_APPEND	806	50.4	52.9	+1.15
> 
> Multiple Threads Multiple Zones
> 				kIOPS	MiB/s	MB/s	% delta
> mq-deadline REQ_OP_WRITE	745	46.6	48.9
> mq-deadline REQ_OP_ZONE_APPEND	768	48	50.3	+2.86
> 
> The %-delta is against the baseline of REQ_OP_WRITE using mq-deadline as I/O
> scheduler.
> 
> The series is based on Jens' for-5.8/block branch with HEAD:
> ae979182ebb3 ("bdi: fix up for "remove the name field in struct backing_dev_info"")

Applied for 5.8, thanks.

-- 
Jens Axboe


      parent reply	other threads:[~2020-05-13  2:38 UTC|newest]

Thread overview: 18+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-05-12  8:55 [PATCH v11 00/10] Introduce Zone Append for writing to zoned block devices Johannes Thumshirn
2020-05-12  8:55 ` [PATCH v11 01/10] block: provide fallbacks for blk_queue_zone_is_seq and blk_queue_zone_no Johannes Thumshirn
2020-05-12  8:55 ` [PATCH v11 02/10] block: rename __bio_add_pc_page to bio_add_hw_page Johannes Thumshirn
2020-05-12  8:55 ` [PATCH v11 03/10] block: Introduce REQ_OP_ZONE_APPEND Johannes Thumshirn
2020-05-12  8:55 ` [PATCH v11 04/10] block: introduce blk_req_zone_write_trylock Johannes Thumshirn
2020-05-12  8:55 ` [PATCH v11 05/10] block: Modify revalidate zones Johannes Thumshirn
2020-05-12  8:55 ` [PATCH v11 06/10] scsi: sd_zbc: factor out sanity checks for zoned commands Johannes Thumshirn
2020-05-12  8:55 ` [PATCH v11 07/10] scsi: sd_zbc: emulate ZONE_APPEND commands Johannes Thumshirn
2020-05-12  8:55 ` [PATCH v11 08/10] null_blk: Support REQ_OP_ZONE_APPEND Johannes Thumshirn
2020-05-12  8:55 ` [PATCH v11 09/10] block: export bio_release_pages and bio_iov_iter_get_pages Johannes Thumshirn
2020-05-12  8:55 ` [PATCH v11 10/10] zonefs: use REQ_OP_ZONE_APPEND for sync DIO Johannes Thumshirn
2020-05-12 13:17 ` [PATCH v11 00/10] Introduce Zone Append for writing to zoned block devices Christoph Hellwig
     [not found]   ` <(Christoph>
     [not found]     ` <Hellwig's>
     [not found]       ` <message>
     [not found]         ` <of>
     [not found]           ` <"Tue>
     [not found]             ` <12>
     [not found]               ` <May>
     [not found]                 ` <2020>
     [not found]                   ` <06:17:48>
     [not found]                     ` <-0700")>
2020-05-12 16:01                       ` Martin K. Petersen
2020-05-12 16:04                         ` Christoph Hellwig
2020-05-12 16:12                           ` Martin K. Petersen
2020-05-12 16:18                             ` Johannes Thumshirn
2020-05-12 16:24                               ` Martin K. Petersen
2020-05-13  2:37 ` Jens Axboe [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=abd4b3d4-6261-c3a6-9b4c-9bf009a9820d@kernel.dk \
    --to=axboe@kernel.dk \
    --cc=Damien.LeMoal@wdc.com \
    --cc=hch@infradead.org \
    --cc=johannes.thumshirn@wdc.com \
    --cc=kbusch@kernel.org \
    --cc=linux-block@vger.kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-scsi@vger.kernel.org \
    --cc=martin.petersen@oracle.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).