Re: [PATCH 3/3] block/mq-deadline: Disable I/O prioritization in certain cases

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Jaegeuk Kim <jaegeuk@kernel.org>
To: Damien Le Moal <dlemoal@kernel.org>
Cc: Christoph Hellwig <hch@lst.de>,
	Bart Van Assche <bvanassche@acm.org>,
	Jens Axboe <axboe@kernel.dk>,
	linux-block@vger.kernel.org
Subject: Re: [PATCH 3/3] block/mq-deadline: Disable I/O prioritization in certain cases
Date: Thu, 14 Dec 2023 18:03:11 -0800	[thread overview]
Message-ID: <ZXuz36STuYajyccm@google.com> (raw)
In-Reply-To: <168ed2f4-cf58-4ee9-bfbb-449f06f7348d@kernel.org>

On 12/15, Damien Le Moal wrote:
> On 12/15/23 02:22, Jaegeuk Kim wrote:
> > On 12/14, Christoph Hellwig wrote:
> >> On Wed, Dec 13, 2023 at 08:41:32AM -0800, Jaegeuk Kim wrote:
> >>> I don't have any
> >>> concern to keep the same ioprio on writes, since handheld devices are mostly
> >>> sensitive to reads. So, if you have other use-cases using zoned writes which
> >>> require different ioprio on writes, I think you can suggest a knob to control
> >>> it by users.
> >>
> >> Get out of your little handheld world.  In Linux we need a generally usable
> >> I/O stack, and any feature exposed by the kernel and will be used quite
> >> differently than you imagine.
> >>
> >> Just like people will add reordering to the I/O stack that's not there
> >> right now (in addition to the ones your testing doesn't hit).  That
> >> doensn't mean we should avoid them - you genereally get better performance
> >> by not reordering without a good reason (like thotting), but especially
> >> in error handling paths or resource constrained environment they will
> >> hapen all over.  We've had this whole discussion with the I/O barriers
> >> that did not work for exactly the same reasons.
> >>
> >>>
> >>>>
> >>>>> it is essential to place the data per file to get better bandwidth. And for
> >>>>> NAND-based storage, filesystem is the right place to deal with the more efficient
> >>>>> garbage collecion based on the known data locations.
> >>>>
> >>>> And that works perfectly fine match for zone append.
> >>>
> >>> How that works, if the device gives random LBAs back to the adjacent data in
> >>> a file? And, how to make the LBAs into the sequential ones back?
> >>
> >> Why would your device pick random LBAs?  If you send a zone append to
> >> zone it will be written at the write pointer, which is absolutely not
> >> random.  All I/O written in a single write is going to be sequential,
> >> so just like for all other devices doing large sequential writes is
> >> important.  Multiple writes can get reordered, but if you havily hit
> >> the same zone you'd get the same effect in the file system allocator
> >> too.
> > 
> > How can you guarantee the device does not give any random LBAs? What'd> be the selling point of zone append to end users? Are you sure this can
> > give the better write trhought forever? Have you considered how to
> > implement this in device side such as FTL mapping overhead and garbage
> > collection leading to tail latencies?
> 
> Answers to all your questions, in order:
> 
> 1) Asking this is to me similar to asking how can one guarantee that the device
> gives back the data that was written... You are asking for guarantees that the
> device is not buggy. By definition, zone append will return the writen start LBA
> within the zone that the zone append command specified. And that start LBA will
> always be equal to the zone write pointer value when the device started
> executing the zone append command.
> 
> 2) When there is an FS, the user cannot know if the FS is using zone append or
> not, so the user should not care at all. If by "user" you mean "the file
> system", then it is a design decision. We already pointed out that generally
> speaking, zone append will be easier to use because it does not have ordering
> constraints.
> 
> 3) Yes, because the writes are always sequential, which is the least expensive
> pattern for the device internal as that only triggers minimal internal activity
> on the FTL, GC, weir leveling etc, at least for a decently designed device.
> 
> 4) See above. If the device interface forces the device user to always write
> sequentially, as mandated by a zoned device, then FTL, GC and weir leveling is
> minimized. The actual device internal GC that may or may not happen completely
> depend on how the device maps zones to flash super blocks. If the mapping is
> 1:1, then GC will be nearly non-existent. If the mapping is not 1:1, then GC
> overhead may exist. The discussion should then be about the design choices of
> the device. The fact that the host chose zone append will not in anyway make
> things worse for the device. Even with regular writes the host must write
> sequentially, same as what zone append achieves (potentially a lot more easily).
> 
> > My takeaway on the two approaches would be:
> >                   zone_append        zone_write
> > 		  -----------        ----------
> > LBA               from FTL           from filesystem
> > FTL mapping       Page-map           Zone-map
> 
> Not sure what you mean here. zone append always returns an LBA from within the
> zone specified by the LBA in the command CDB. So mapping is still per zone. zone
> append is *NOT* a random write command. Zone append automatically implements
> sequential writing within a zone for the user. In the case of regular writes,
> the user must fully control sentimentality. In both cases the write pattern *is*
> sequential.

Okay, it seems there's first disconnect here, which fails to explain all the
below gaps. Do you think the device supporting zone_append keeps LBAs inline
with PBAs within a zone? E.g., LBA#n guarantees to map to PBA#n in a zone.
If LBA order is exactly matching to the PBA order all the time, the mapping
granularity is zone. Otherwise, it should be page.

> 
> > SRAM/DRAM needs   Large              Small
> 
> There are no differences in this area because the FTL is the same for both. No
> changes, nothing special for zone append.
> 
> > FTL GC            Required           Not required
> 
> Incorrect. See above. That depends on the device mapping of zones to flash
> superblocks. And GC requirements are the same for both because the write pattern
> is identical: it is sequential within each zone being written. The user still
> controls which zone it wants to write. Zone append is not a magic command that
> chooses a target zone automatically.
> 
> > Tail latencies    Exist              Not exisit
> 
> Incorrect. They are the same and because of the lack of ordering requirement
> with zone append, if anything, zone append can give better latency.
> 
> > GC Efficience     Worse              Better
> 
> Nope. See above. Same.
> 
> > Longevity         As-is              Longer
> > Discard cmd       Required           Not required
> 
> There is no discard with zone devices. Only zone reset. So both are "not
> required" here.
> 
> > Block complexity  Small              Large
> > Failure cases     Less exist         Exist
> > Fsck              Don't know         F2FS-TOOLS support
> > Filesystem        BTRFS support(?)   F2FS support
> 
> Yes, btrfs data path uses zone append.
> 
> > 
> > Given this, I took zone_write, especially for mobile devices, since we can
> > recover the unaligned writes in the corner cases by fsck. And, most benefit
> > would be getting rid of FTL mapping overhead which improves random read IOPs
> > significantly due to the lack of SRAM in low-end storages. And, longer lifetime
> > by mitigating garbage collection overhead is more important in mobile world.
> 
> As mentioned, GC is not an issue, or rather, GC depends on how the device is
> designed, not on which type of write command the host uses. Writes are always
> sequential for both types !
> 
> 
> -- 
> Damien Le Moal
> Western Digital Research

next prev parent reply	other threads:[~2023-12-15  2:03 UTC|newest]

Thread overview: 48+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-12-05  5:32 [PATCH 0/3] Improve mq-deadline I/O priority support Bart Van Assche
2023-12-05  5:32 ` [PATCH 1/3] block/mq-deadline: Use dd_rq_ioclass() instead of open-coding it Bart Van Assche
2023-12-06  2:35   ` Damien Le Moal
2023-12-11 16:54   ` Christoph Hellwig
2023-12-05  5:32 ` [PATCH 2/3] block/mq-deadline: Introduce dd_bio_ioclass() Bart Van Assche
2023-12-06  2:35   ` Damien Le Moal
2023-12-11 16:55   ` Christoph Hellwig
2023-12-18 17:35     ` Bart Van Assche
2023-12-05  5:32 ` [PATCH 3/3] block/mq-deadline: Disable I/O prioritization in certain cases Bart Van Assche
2023-12-06  2:42   ` Damien Le Moal
2023-12-06  3:24     ` Bart Van Assche
2023-12-08  0:03     ` Bart Van Assche
2023-12-08  3:37       ` Damien Le Moal
2023-12-08 18:40         ` Bart Van Assche
2023-12-11  7:40           ` Damien Le Moal
2023-12-12 22:44             ` Bart Van Assche
2023-12-12 23:52               ` Damien Le Moal
2023-12-13  1:02                 ` Bart Van Assche
2023-12-13  5:29                   ` Damien Le Moal
2023-12-11 16:57   ` Christoph Hellwig
2023-12-11 17:20     ` Bart Van Assche
2023-12-12 15:40       ` Christoph Hellwig
2023-12-11 22:40     ` Damien Le Moal
2023-12-12 15:41       ` Christoph Hellwig
2023-12-12 17:15         ` Bart Van Assche
2023-12-12 17:18           ` Christoph Hellwig
2023-12-12 17:42             ` Bart Van Assche
2023-12-12 17:48               ` Christoph Hellwig
2023-12-12 18:09                 ` Bart Van Assche
2023-12-12 18:13                   ` Christoph Hellwig
2023-12-12 18:19                     ` Bart Van Assche
2023-12-12 18:26                       ` Christoph Hellwig
2023-12-12 19:03                         ` Jaegeuk Kim
2023-12-12 23:44                           ` Damien Le Moal
2023-12-13 16:49                             ` Jaegeuk Kim
2023-12-13 22:55                               ` Damien Le Moal
2023-12-13 15:56                           ` Christoph Hellwig
2023-12-13 16:41                             ` Jaegeuk Kim
2023-12-14  8:57                               ` Christoph Hellwig
2023-12-14 17:22                                 ` Jaegeuk Kim
2023-12-15  1:12                                   ` Damien Le Moal
2023-12-15  2:03                                     ` Jaegeuk Kim [this message]
2023-12-15  2:20                                       ` Keith Busch
2023-12-15  4:49                                         ` Christoph Hellwig
2023-12-14 19:32                                 ` Bart Van Assche
2023-12-14  0:08                     ` Bart Van Assche
2023-12-14  0:37                       ` Damien Le Moal
2023-12-14  8:51                         ` Christoph Hellwig

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ZXuz36STuYajyccm@google.com \
    --to=jaegeuk@kernel.org \
    --cc=axboe@kernel.dk \
    --cc=bvanassche@acm.org \
    --cc=dlemoal@kernel.org \
    --cc=hch@lst.de \
    --cc=linux-block@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.