public inbox for linux-nvme@lists.infradead.org
 help / color / mirror / Atom feed
From: Christoph Hellwig <hch@infradead.org>
To: Damien Le Moal <dlemoal@kernel.org>
Cc: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>,
	linux-nvme@lists.infradead.org,
	Christoph Hellwig <hch@infradead.org>,
	Keith Busch <kbusch@kernel.org>,
	Johannes Thumshirn <johannes.thumshirn@wdc.com>
Subject: Re: [PATCH] nvme: zns: limit max_zone_append by max_segments
Date: Mon, 31 Jul 2023 06:51:26 -0700	[thread overview]
Message-ID: <ZMe8XvFDDdAom97/@infradead.org> (raw)
In-Reply-To: <e0f88870-73a8-05a3-024d-52cefcdef71d@kernel.org>

On Mon, Jul 31, 2023 at 09:03:46PM +0900, Damien Le Moal wrote:
> I feel like a lot of the special casing for zone append bio add page can be
> removed from the block layer. This issue was found with zonefs tests on real zns
> devices because of this huge (and incorrect) zone append limit that zns has,
> combined with the recent zonefs iomap write change which overlooked the fact
> that bio add page is done by iomap before the bio op is set to zone append. That
> resulted in the large BIO. This problem however does not happen with scsi or
> null blk, kind-of proving that the regular bio add page is fine for zone append
> as long as the issuer has the correct zone append limit. Thought ?

A zone append limit larger than max_sectors is odd, and maybe the
block layer should assert something.  I think the root cause is that
many NVMe devices have a very large hardware equivalent to max_sectors
(the MDTS field), but Linux still uses a much lower limit due to memory
allocation issues (the PRPs used by NVMe are very inefficient in terms
of memory usage for larger transfers).  So we cap max_sectors to the
ѕoftware limit, but not max_zoned_append_sectors.

Zone Append needs some amount of special casing in the block layer
because the splitting of Zone Append bios must happen in the file system
as the file system needs a completion context per hardware operation.
I think the best way to do that is to first build up a maximum bio
and then use the same bio_split_rw function that the block layer would
use to split it to the hardware limits, just in the the issuer.  This
is what I did in btrfs, and it seems like zonefs actually needs to do
the same, but I missed that during review of the recent direct I/O
changes.


  reply	other threads:[~2023-07-31 13:51 UTC|newest]

Thread overview: 13+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-07-31 11:46 [PATCH] nvme: zns: limit max_zone_append by max_segments Shin'ichiro Kawasaki
2023-07-31 12:03 ` Damien Le Moal
2023-07-31 13:51   ` Christoph Hellwig [this message]
2023-07-31 14:01     ` Damien Le Moal
2023-07-31 14:06       ` Christoph Hellwig
2023-07-31 14:12         ` Damien Le Moal
2023-07-31 14:26           ` Christoph Hellwig
2023-07-31 14:33             ` Damien Le Moal
2023-07-31 13:46 ` Christoph Hellwig
2023-07-31 14:02   ` Damien Le Moal
2023-07-31 14:05     ` Christoph Hellwig
2023-07-31 14:07       ` Damien Le Moal
2023-07-31 14:08         ` Christoph Hellwig

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ZMe8XvFDDdAom97/@infradead.org \
    --to=hch@infradead.org \
    --cc=dlemoal@kernel.org \
    --cc=johannes.thumshirn@wdc.com \
    --cc=kbusch@kernel.org \
    --cc=linux-nvme@lists.infradead.org \
    --cc=shinichiro.kawasaki@wdc.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox