linux-block.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Jaegeuk Kim <jaegeuk@kernel.org>
To: Damien Le Moal <dlemoal@kernel.org>
Cc: Christoph Hellwig <hch@lst.de>,
	Bart Van Assche <bvanassche@acm.org>,
	Jens Axboe <axboe@kernel.dk>,
	linux-block@vger.kernel.org
Subject: Re: [PATCH 3/3] block/mq-deadline: Disable I/O prioritization in certain cases
Date: Thu, 14 Dec 2023 18:03:11 -0800	[thread overview]
Message-ID: <ZXuz36STuYajyccm@google.com> (raw)
In-Reply-To: <168ed2f4-cf58-4ee9-bfbb-449f06f7348d@kernel.org>

On 12/15, Damien Le Moal wrote:
> On 12/15/23 02:22, Jaegeuk Kim wrote:
> > On 12/14, Christoph Hellwig wrote:
> >> On Wed, Dec 13, 2023 at 08:41:32AM -0800, Jaegeuk Kim wrote:
> >>> I don't have any
> >>> concern to keep the same ioprio on writes, since handheld devices are mostly
> >>> sensitive to reads. So, if you have other use-cases using zoned writes which
> >>> require different ioprio on writes, I think you can suggest a knob to control
> >>> it by users.
> >>
> >> Get out of your little handheld world.  In Linux we need a generally usable
> >> I/O stack, and any feature exposed by the kernel and will be used quite
> >> differently than you imagine.
> >>
> >> Just like people will add reordering to the I/O stack that's not there
> >> right now (in addition to the ones your testing doesn't hit).  That
> >> doensn't mean we should avoid them - you genereally get better performance
> >> by not reordering without a good reason (like thotting), but especially
> >> in error handling paths or resource constrained environment they will
> >> hapen all over.  We've had this whole discussion with the I/O barriers
> >> that did not work for exactly the same reasons.
> >>
> >>>
> >>>>
> >>>>> it is essential to place the data per file to get better bandwidth. And for
> >>>>> NAND-based storage, filesystem is the right place to deal with the more efficient
> >>>>> garbage collecion based on the known data locations.
> >>>>
> >>>> And that works perfectly fine match for zone append.
> >>>
> >>> How that works, if the device gives random LBAs back to the adjacent data in
> >>> a file? And, how to make the LBAs into the sequential ones back?
> >>
> >> Why would your device pick random LBAs?  If you send a zone append to
> >> zone it will be written at the write pointer, which is absolutely not
> >> random.  All I/O written in a single write is going to be sequential,
> >> so just like for all other devices doing large sequential writes is
> >> important.  Multiple writes can get reordered, but if you havily hit
> >> the same zone you'd get the same effect in the file system allocator
> >> too.
> > 
> > How can you guarantee the device does not give any random LBAs? What'd> be the selling point of zone append to end users? Are you sure this can
> > give the better write trhought forever? Have you considered how to
> > implement this in device side such as FTL mapping overhead and garbage
> > collection leading to tail latencies?
> 
> Answers to all your questions, in order:
> 
> 1) Asking this is to me similar to asking how can one guarantee that the device
> gives back the data that was written... You are asking for guarantees that the
> device is not buggy. By definition, zone append will return the writen start LBA
> within the zone that the zone append command specified. And that start LBA will
> always be equal to the zone write pointer value when the device started
> executing the zone append command.
> 
> 2) When there is an FS, the user cannot know if the FS is using zone append or
> not, so the user should not care at all. If by "user" you mean "the file
> system", then it is a design decision. We already pointed out that generally
> speaking, zone append will be easier to use because it does not have ordering
> constraints.
> 
> 3) Yes, because the writes are always sequential, which is the least expensive
> pattern for the device internal as that only triggers minimal internal activity
> on the FTL, GC, weir leveling etc, at least for a decently designed device.
> 
> 4) See above. If the device interface forces the device user to always write
> sequentially, as mandated by a zoned device, then FTL, GC and weir leveling is
> minimized. The actual device internal GC that may or may not happen completely
> depend on how the device maps zones to flash super blocks. If the mapping is
> 1:1, then GC will be nearly non-existent. If the mapping is not 1:1, then GC
> overhead may exist. The discussion should then be about the design choices of
> the device. The fact that the host chose zone append will not in anyway make
> things worse for the device. Even with regular writes the host must write
> sequentially, same as what zone append achieves (potentially a lot more easily).
> 
> > My takeaway on the two approaches would be:
> >                   zone_append        zone_write
> > 		  -----------        ----------
> > LBA               from FTL           from filesystem
> > FTL mapping       Page-map           Zone-map
> 
> Not sure what you mean here. zone append always returns an LBA from within the
> zone specified by the LBA in the command CDB. So mapping is still per zone. zone
> append is *NOT* a random write command. Zone append automatically implements
> sequential writing within a zone for the user. In the case of regular writes,
> the user must fully control sentimentality. In both cases the write pattern *is*
> sequential.

Okay, it seems there's first disconnect here, which fails to explain all the
below gaps. Do you think the device supporting zone_append keeps LBAs inline
with PBAs within a zone? E.g., LBA#n guarantees to map to PBA#n in a zone.
If LBA order is exactly matching to the PBA order all the time, the mapping
granularity is zone. Otherwise, it should be page.

> 
> > SRAM/DRAM needs   Large              Small
> 
> There are no differences in this area because the FTL is the same for both. No
> changes, nothing special for zone append.
> 
> > FTL GC            Required           Not required
> 
> Incorrect. See above. That depends on the device mapping of zones to flash
> superblocks. And GC requirements are the same for both because the write pattern
> is identical: it is sequential within each zone being written. The user still
> controls which zone it wants to write. Zone append is not a magic command that
> chooses a target zone automatically.
> 
> > Tail latencies    Exist              Not exisit
> 
> Incorrect. They are the same and because of the lack of ordering requirement
> with zone append, if anything, zone append can give better latency.
> 
> > GC Efficience     Worse              Better
> 
> Nope. See above. Same.
> 
> > Longevity         As-is              Longer
> > Discard cmd       Required           Not required
> 
> There is no discard with zone devices. Only zone reset. So both are "not
> required" here.
> 
> > Block complexity  Small              Large
> > Failure cases     Less exist         Exist
> > Fsck              Don't know         F2FS-TOOLS support
> > Filesystem        BTRFS support(?)   F2FS support
> 
> Yes, btrfs data path uses zone append.
> 
> > 
> > Given this, I took zone_write, especially for mobile devices, since we can
> > recover the unaligned writes in the corner cases by fsck. And, most benefit
> > would be getting rid of FTL mapping overhead which improves random read IOPs
> > significantly due to the lack of SRAM in low-end storages. And, longer lifetime
> > by mitigating garbage collection overhead is more important in mobile world.
> 
> As mentioned, GC is not an issue, or rather, GC depends on how the device is
> designed, not on which type of write command the host uses. Writes are always
> sequential for both types !
> 
> 
> -- 
> Damien Le Moal
> Western Digital Research

  reply	other threads:[~2023-12-15  2:03 UTC|newest]

Thread overview: 48+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-12-05  5:32 [PATCH 0/3] Improve mq-deadline I/O priority support Bart Van Assche
2023-12-05  5:32 ` [PATCH 1/3] block/mq-deadline: Use dd_rq_ioclass() instead of open-coding it Bart Van Assche
2023-12-06  2:35   ` Damien Le Moal
2023-12-11 16:54   ` Christoph Hellwig
2023-12-05  5:32 ` [PATCH 2/3] block/mq-deadline: Introduce dd_bio_ioclass() Bart Van Assche
2023-12-06  2:35   ` Damien Le Moal
2023-12-11 16:55   ` Christoph Hellwig
2023-12-18 17:35     ` Bart Van Assche
2023-12-05  5:32 ` [PATCH 3/3] block/mq-deadline: Disable I/O prioritization in certain cases Bart Van Assche
2023-12-06  2:42   ` Damien Le Moal
2023-12-06  3:24     ` Bart Van Assche
2023-12-08  0:03     ` Bart Van Assche
2023-12-08  3:37       ` Damien Le Moal
2023-12-08 18:40         ` Bart Van Assche
2023-12-11  7:40           ` Damien Le Moal
2023-12-12 22:44             ` Bart Van Assche
2023-12-12 23:52               ` Damien Le Moal
2023-12-13  1:02                 ` Bart Van Assche
2023-12-13  5:29                   ` Damien Le Moal
2023-12-11 16:57   ` Christoph Hellwig
2023-12-11 17:20     ` Bart Van Assche
2023-12-12 15:40       ` Christoph Hellwig
2023-12-11 22:40     ` Damien Le Moal
2023-12-12 15:41       ` Christoph Hellwig
2023-12-12 17:15         ` Bart Van Assche
2023-12-12 17:18           ` Christoph Hellwig
2023-12-12 17:42             ` Bart Van Assche
2023-12-12 17:48               ` Christoph Hellwig
2023-12-12 18:09                 ` Bart Van Assche
2023-12-12 18:13                   ` Christoph Hellwig
2023-12-12 18:19                     ` Bart Van Assche
2023-12-12 18:26                       ` Christoph Hellwig
2023-12-12 19:03                         ` Jaegeuk Kim
2023-12-12 23:44                           ` Damien Le Moal
2023-12-13 16:49                             ` Jaegeuk Kim
2023-12-13 22:55                               ` Damien Le Moal
2023-12-13 15:56                           ` Christoph Hellwig
2023-12-13 16:41                             ` Jaegeuk Kim
2023-12-14  8:57                               ` Christoph Hellwig
2023-12-14 17:22                                 ` Jaegeuk Kim
2023-12-15  1:12                                   ` Damien Le Moal
2023-12-15  2:03                                     ` Jaegeuk Kim [this message]
2023-12-15  2:20                                       ` Keith Busch
2023-12-15  4:49                                         ` Christoph Hellwig
2023-12-14 19:32                                 ` Bart Van Assche
2023-12-14  0:08                     ` Bart Van Assche
2023-12-14  0:37                       ` Damien Le Moal
2023-12-14  8:51                         ` Christoph Hellwig

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ZXuz36STuYajyccm@google.com \
    --to=jaegeuk@kernel.org \
    --cc=axboe@kernel.dk \
    --cc=bvanassche@acm.org \
    --cc=dlemoal@kernel.org \
    --cc=hch@lst.de \
    --cc=linux-block@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).