From: Jaegeuk Kim <jaegeuk@kernel.org>
To: Damien Le Moal <dlemoal@kernel.org>
Cc: Christoph Hellwig <hch@lst.de>,
Bart Van Assche <bvanassche@acm.org>,
Jens Axboe <axboe@kernel.dk>,
linux-block@vger.kernel.org
Subject: Re: [PATCH 3/3] block/mq-deadline: Disable I/O prioritization in certain cases
Date: Thu, 14 Dec 2023 18:03:11 -0800 [thread overview]
Message-ID: <ZXuz36STuYajyccm@google.com> (raw)
In-Reply-To: <168ed2f4-cf58-4ee9-bfbb-449f06f7348d@kernel.org>
On 12/15, Damien Le Moal wrote:
> On 12/15/23 02:22, Jaegeuk Kim wrote:
> > On 12/14, Christoph Hellwig wrote:
> >> On Wed, Dec 13, 2023 at 08:41:32AM -0800, Jaegeuk Kim wrote:
> >>> I don't have any
> >>> concern to keep the same ioprio on writes, since handheld devices are mostly
> >>> sensitive to reads. So, if you have other use-cases using zoned writes which
> >>> require different ioprio on writes, I think you can suggest a knob to control
> >>> it by users.
> >>
> >> Get out of your little handheld world. In Linux we need a generally usable
> >> I/O stack, and any feature exposed by the kernel and will be used quite
> >> differently than you imagine.
> >>
> >> Just like people will add reordering to the I/O stack that's not there
> >> right now (in addition to the ones your testing doesn't hit). That
> >> doensn't mean we should avoid them - you genereally get better performance
> >> by not reordering without a good reason (like thotting), but especially
> >> in error handling paths or resource constrained environment they will
> >> hapen all over. We've had this whole discussion with the I/O barriers
> >> that did not work for exactly the same reasons.
> >>
> >>>
> >>>>
> >>>>> it is essential to place the data per file to get better bandwidth. And for
> >>>>> NAND-based storage, filesystem is the right place to deal with the more efficient
> >>>>> garbage collecion based on the known data locations.
> >>>>
> >>>> And that works perfectly fine match for zone append.
> >>>
> >>> How that works, if the device gives random LBAs back to the adjacent data in
> >>> a file? And, how to make the LBAs into the sequential ones back?
> >>
> >> Why would your device pick random LBAs? If you send a zone append to
> >> zone it will be written at the write pointer, which is absolutely not
> >> random. All I/O written in a single write is going to be sequential,
> >> so just like for all other devices doing large sequential writes is
> >> important. Multiple writes can get reordered, but if you havily hit
> >> the same zone you'd get the same effect in the file system allocator
> >> too.
> >
> > How can you guarantee the device does not give any random LBAs? What'd> be the selling point of zone append to end users? Are you sure this can
> > give the better write trhought forever? Have you considered how to
> > implement this in device side such as FTL mapping overhead and garbage
> > collection leading to tail latencies?
>
> Answers to all your questions, in order:
>
> 1) Asking this is to me similar to asking how can one guarantee that the device
> gives back the data that was written... You are asking for guarantees that the
> device is not buggy. By definition, zone append will return the writen start LBA
> within the zone that the zone append command specified. And that start LBA will
> always be equal to the zone write pointer value when the device started
> executing the zone append command.
>
> 2) When there is an FS, the user cannot know if the FS is using zone append or
> not, so the user should not care at all. If by "user" you mean "the file
> system", then it is a design decision. We already pointed out that generally
> speaking, zone append will be easier to use because it does not have ordering
> constraints.
>
> 3) Yes, because the writes are always sequential, which is the least expensive
> pattern for the device internal as that only triggers minimal internal activity
> on the FTL, GC, weir leveling etc, at least for a decently designed device.
>
> 4) See above. If the device interface forces the device user to always write
> sequentially, as mandated by a zoned device, then FTL, GC and weir leveling is
> minimized. The actual device internal GC that may or may not happen completely
> depend on how the device maps zones to flash super blocks. If the mapping is
> 1:1, then GC will be nearly non-existent. If the mapping is not 1:1, then GC
> overhead may exist. The discussion should then be about the design choices of
> the device. The fact that the host chose zone append will not in anyway make
> things worse for the device. Even with regular writes the host must write
> sequentially, same as what zone append achieves (potentially a lot more easily).
>
> > My takeaway on the two approaches would be:
> > zone_append zone_write
> > ----------- ----------
> > LBA from FTL from filesystem
> > FTL mapping Page-map Zone-map
>
> Not sure what you mean here. zone append always returns an LBA from within the
> zone specified by the LBA in the command CDB. So mapping is still per zone. zone
> append is *NOT* a random write command. Zone append automatically implements
> sequential writing within a zone for the user. In the case of regular writes,
> the user must fully control sentimentality. In both cases the write pattern *is*
> sequential.
Okay, it seems there's first disconnect here, which fails to explain all the
below gaps. Do you think the device supporting zone_append keeps LBAs inline
with PBAs within a zone? E.g., LBA#n guarantees to map to PBA#n in a zone.
If LBA order is exactly matching to the PBA order all the time, the mapping
granularity is zone. Otherwise, it should be page.
>
> > SRAM/DRAM needs Large Small
>
> There are no differences in this area because the FTL is the same for both. No
> changes, nothing special for zone append.
>
> > FTL GC Required Not required
>
> Incorrect. See above. That depends on the device mapping of zones to flash
> superblocks. And GC requirements are the same for both because the write pattern
> is identical: it is sequential within each zone being written. The user still
> controls which zone it wants to write. Zone append is not a magic command that
> chooses a target zone automatically.
>
> > Tail latencies Exist Not exisit
>
> Incorrect. They are the same and because of the lack of ordering requirement
> with zone append, if anything, zone append can give better latency.
>
> > GC Efficience Worse Better
>
> Nope. See above. Same.
>
> > Longevity As-is Longer
> > Discard cmd Required Not required
>
> There is no discard with zone devices. Only zone reset. So both are "not
> required" here.
>
> > Block complexity Small Large
> > Failure cases Less exist Exist
> > Fsck Don't know F2FS-TOOLS support
> > Filesystem BTRFS support(?) F2FS support
>
> Yes, btrfs data path uses zone append.
>
> >
> > Given this, I took zone_write, especially for mobile devices, since we can
> > recover the unaligned writes in the corner cases by fsck. And, most benefit
> > would be getting rid of FTL mapping overhead which improves random read IOPs
> > significantly due to the lack of SRAM in low-end storages. And, longer lifetime
> > by mitigating garbage collection overhead is more important in mobile world.
>
> As mentioned, GC is not an issue, or rather, GC depends on how the device is
> designed, not on which type of write command the host uses. Writes are always
> sequential for both types !
>
>
> --
> Damien Le Moal
> Western Digital Research
next prev parent reply other threads:[~2023-12-15 2:03 UTC|newest]
Thread overview: 48+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-12-05 5:32 [PATCH 0/3] Improve mq-deadline I/O priority support Bart Van Assche
2023-12-05 5:32 ` [PATCH 1/3] block/mq-deadline: Use dd_rq_ioclass() instead of open-coding it Bart Van Assche
2023-12-06 2:35 ` Damien Le Moal
2023-12-11 16:54 ` Christoph Hellwig
2023-12-05 5:32 ` [PATCH 2/3] block/mq-deadline: Introduce dd_bio_ioclass() Bart Van Assche
2023-12-06 2:35 ` Damien Le Moal
2023-12-11 16:55 ` Christoph Hellwig
2023-12-18 17:35 ` Bart Van Assche
2023-12-05 5:32 ` [PATCH 3/3] block/mq-deadline: Disable I/O prioritization in certain cases Bart Van Assche
2023-12-06 2:42 ` Damien Le Moal
2023-12-06 3:24 ` Bart Van Assche
2023-12-08 0:03 ` Bart Van Assche
2023-12-08 3:37 ` Damien Le Moal
2023-12-08 18:40 ` Bart Van Assche
2023-12-11 7:40 ` Damien Le Moal
2023-12-12 22:44 ` Bart Van Assche
2023-12-12 23:52 ` Damien Le Moal
2023-12-13 1:02 ` Bart Van Assche
2023-12-13 5:29 ` Damien Le Moal
2023-12-11 16:57 ` Christoph Hellwig
2023-12-11 17:20 ` Bart Van Assche
2023-12-12 15:40 ` Christoph Hellwig
2023-12-11 22:40 ` Damien Le Moal
2023-12-12 15:41 ` Christoph Hellwig
2023-12-12 17:15 ` Bart Van Assche
2023-12-12 17:18 ` Christoph Hellwig
2023-12-12 17:42 ` Bart Van Assche
2023-12-12 17:48 ` Christoph Hellwig
2023-12-12 18:09 ` Bart Van Assche
2023-12-12 18:13 ` Christoph Hellwig
2023-12-12 18:19 ` Bart Van Assche
2023-12-12 18:26 ` Christoph Hellwig
2023-12-12 19:03 ` Jaegeuk Kim
2023-12-12 23:44 ` Damien Le Moal
2023-12-13 16:49 ` Jaegeuk Kim
2023-12-13 22:55 ` Damien Le Moal
2023-12-13 15:56 ` Christoph Hellwig
2023-12-13 16:41 ` Jaegeuk Kim
2023-12-14 8:57 ` Christoph Hellwig
2023-12-14 17:22 ` Jaegeuk Kim
2023-12-15 1:12 ` Damien Le Moal
2023-12-15 2:03 ` Jaegeuk Kim [this message]
2023-12-15 2:20 ` Keith Busch
2023-12-15 4:49 ` Christoph Hellwig
2023-12-14 19:32 ` Bart Van Assche
2023-12-14 0:08 ` Bart Van Assche
2023-12-14 0:37 ` Damien Le Moal
2023-12-14 8:51 ` Christoph Hellwig
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=ZXuz36STuYajyccm@google.com \
--to=jaegeuk@kernel.org \
--cc=axboe@kernel.dk \
--cc=bvanassche@acm.org \
--cc=dlemoal@kernel.org \
--cc=hch@lst.de \
--cc=linux-block@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).