public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* [RFC] block: Make MMC respect REQ_NOUNMAP?
@ 2025-05-07 16:24 Ahmad Fatoum
  2025-05-08  5:59 ` Christoph Hellwig
  0 siblings, 1 reply; 2+ messages in thread
From: Ahmad Fatoum @ 2025-05-07 16:24 UTC (permalink / raw)
  To: Darrick J. Wong, Christoph Hellwig, Jens Axboe, Ulf Hansson
  Cc: Jan Lübbe, kernel@pengutronix.de, linux-block, linux-mmc,
	linux-kernel@vger.kernel.org

Hi,

We would like to tell storage (mostly eMMC, but not exclusively though) to
discard a block range, so it reads back zeroes and is preferably unmapped
to give the storage device the most flexibility. For eMMCs, this is possible
right now with blkdiscard -z (BLKZEROOUT), but digging through the code with Jan,
I am starting to question whether eMMC is correctly implementing REQ_OP_WRITE_ZEROES
(granted, the expected semantics seem to be spelt out nowhere).

Read along for what MMC does, what semantics BLKZEROOUT seems to should have,
and what I think might need to be done to address this.


eMMC supports a number of different commands for erasing/discarding data[1].
Relevant to my question are two commands:

  DISCARD: Host no longer needs the data and doesn't care about value
           read from it. Card may remove or unmap portions or all of it.

  TRIM:    implies DISCARD and additionally guarantees to
           return all-zero or all-ones on read.

These are made available to the block layer as follows:

  REQ_OP_DISCARD      -> DISCARD
  REQ_OP_WRITE_ZEROES -> TRIM (if erased_byte == 0, otherwise
  if all-ones, blk_queue_max_write_zeroes_sectors() is not called)

Looking at it from the ioctl side:

blkdev_common_ioctl(..., BLKZEROOUT, ...)
    blk_ioctl_zeroout
        blkdev_issue_zeroout(..., BLKDEV_ZERO_NOUNMAP)
            __blkdev_issue_write_zeroes(..., BLKDEV_ZERO_NOUNMAP)
	    submit_bio_wait(bio->bi_opf = REQ_NOUNMAP)

__REQ_NOUNMAP has comment saying 'do not free blocks when zeroing',
but as shown above, TRIM allows the card to unmap the indicated region.

REQ_NOUNMAP has no other documentation, but virtio inverts it and translates
it to VIRTIO_BLK_WRITE_ZEROES_FLAG_UNMAP, which is documented as follows[2]:

"For write zeroes commands, if the unmap is set, the device MAY deallocate
the specified range of sectors in the device backend storage, as if the
discard command had been sent.".

I.e. REQ_NOUNMAP -> device MAY NOT deallocate

This is at odds with the MMC implementation, which ignores REQ_NOUNMAP
completely it seems. I don't believe there is a MMC command for write zeroes
without discard short of actually writing zeroes, so it sounds like the
correct implementation for MMC would be:

	if (req->cmd_flags & REQ_NOUNMAP)
		// or w/e causes fallback to __blkdev_issue_zero_pages
		return -EOPNOTSUPP;
	// do TRIM as before

Of course, this will change user visible behavior: blkdiscard -z will start
taking much longer for most users. These users will have to migrate to
using fallocate instead:

blkdev_fallocate(mode = FALLOC_FL_ZERO_RANGE):
	blkdev_issue_zeroout(..., BLKDEV_ZERO_NOUNMAP)

blkdev_fallocate(mode = FALLOC_FL_PUNCH_HOLE)
	blkdev_issue_zeroout(..., BLKDEV_ZERO_NOFALLBACK)

So, it's not a drop-in replacement. I guess user code can punch hole
with fallback to BLKZEROOUT if it fails in order to get back the old
behavior.

I must admit I don't even know why one would write zeroes and care about
them remaining mapped on the storage device, but that seems to be
what's expected with BLKZEROOUT.


Thoughts? What did I miss?

Thanks,
Ahmad

[1]: For a short and incomplete summary, see:
     https://github.com/barebox/barebox/commit/91a11c7d50df91
[1]: https://docs.oasis-open.org/virtio/virtio/v1.2/csd01/virtio-v1.2-csd01.html
     5.2.6.2 Device Requirements: Device Operation


-- 
Pengutronix e.K.                  |                             |
Steuerwalder Str. 21              | http://www.pengutronix.de/  |
31137 Hildesheim, Germany         | Phone: +49-5121-206917-0    |
Amtsgericht Hildesheim, HRA 2686  | Fax:   +49-5121-206917-5555 |


^ permalink raw reply	[flat|nested] 2+ messages in thread

* Re: [RFC] block: Make MMC respect REQ_NOUNMAP?
  2025-05-07 16:24 [RFC] block: Make MMC respect REQ_NOUNMAP? Ahmad Fatoum
@ 2025-05-08  5:59 ` Christoph Hellwig
  0 siblings, 0 replies; 2+ messages in thread
From: Christoph Hellwig @ 2025-05-08  5:59 UTC (permalink / raw)
  To: Ahmad Fatoum
  Cc: Darrick J. Wong, Christoph Hellwig, Jens Axboe, Ulf Hansson,
	Jan Lübbe, kernel@pengutronix.de, linux-block, linux-mmc,
	linux-kernel@vger.kernel.org

On Wed, May 07, 2025 at 06:24:34PM +0200, Ahmad Fatoum wrote:
>     blk_ioctl_zeroout
>         blkdev_issue_zeroout(..., BLKDEV_ZERO_NOUNMAP)
>             __blkdev_issue_write_zeroes(..., BLKDEV_ZERO_NOUNMAP)
> 	    submit_bio_wait(bio->bi_opf = REQ_NOUNMAP)
> 
> __REQ_NOUNMAP has comment saying 'do not free blocks when zeroing',
> but as shown above, TRIM allows the card to unmap the indicated region.

__REQ_NOUNMAP is a hint.  All the storage specs are a mess in this
area as they usually get the polarity wrong and/or just have very
vague semantics.  Also actually writing zeroes and not unmapping is
totally pointless on flash storage as it just increases PE cycles
for no good reason.


^ permalink raw reply	[flat|nested] 2+ messages in thread

end of thread, other threads:[~2025-05-08  5:59 UTC | newest]

Thread overview: 2+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-05-07 16:24 [RFC] block: Make MMC respect REQ_NOUNMAP? Ahmad Fatoum
2025-05-08  5:59 ` Christoph Hellwig

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox