* [RFC] block: Make MMC respect REQ_NOUNMAP?
@ 2025-05-07 16:24 Ahmad Fatoum
2025-05-08 5:59 ` Christoph Hellwig
0 siblings, 1 reply; 2+ messages in thread
From: Ahmad Fatoum @ 2025-05-07 16:24 UTC (permalink / raw)
To: Darrick J. Wong, Christoph Hellwig, Jens Axboe, Ulf Hansson
Cc: Jan Lübbe, kernel@pengutronix.de, linux-block, linux-mmc,
linux-kernel@vger.kernel.org
Hi,
We would like to tell storage (mostly eMMC, but not exclusively though) to
discard a block range, so it reads back zeroes and is preferably unmapped
to give the storage device the most flexibility. For eMMCs, this is possible
right now with blkdiscard -z (BLKZEROOUT), but digging through the code with Jan,
I am starting to question whether eMMC is correctly implementing REQ_OP_WRITE_ZEROES
(granted, the expected semantics seem to be spelt out nowhere).
Read along for what MMC does, what semantics BLKZEROOUT seems to should have,
and what I think might need to be done to address this.
eMMC supports a number of different commands for erasing/discarding data[1].
Relevant to my question are two commands:
DISCARD: Host no longer needs the data and doesn't care about value
read from it. Card may remove or unmap portions or all of it.
TRIM: implies DISCARD and additionally guarantees to
return all-zero or all-ones on read.
These are made available to the block layer as follows:
REQ_OP_DISCARD -> DISCARD
REQ_OP_WRITE_ZEROES -> TRIM (if erased_byte == 0, otherwise
if all-ones, blk_queue_max_write_zeroes_sectors() is not called)
Looking at it from the ioctl side:
blkdev_common_ioctl(..., BLKZEROOUT, ...)
blk_ioctl_zeroout
blkdev_issue_zeroout(..., BLKDEV_ZERO_NOUNMAP)
__blkdev_issue_write_zeroes(..., BLKDEV_ZERO_NOUNMAP)
submit_bio_wait(bio->bi_opf = REQ_NOUNMAP)
__REQ_NOUNMAP has comment saying 'do not free blocks when zeroing',
but as shown above, TRIM allows the card to unmap the indicated region.
REQ_NOUNMAP has no other documentation, but virtio inverts it and translates
it to VIRTIO_BLK_WRITE_ZEROES_FLAG_UNMAP, which is documented as follows[2]:
"For write zeroes commands, if the unmap is set, the device MAY deallocate
the specified range of sectors in the device backend storage, as if the
discard command had been sent.".
I.e. REQ_NOUNMAP -> device MAY NOT deallocate
This is at odds with the MMC implementation, which ignores REQ_NOUNMAP
completely it seems. I don't believe there is a MMC command for write zeroes
without discard short of actually writing zeroes, so it sounds like the
correct implementation for MMC would be:
if (req->cmd_flags & REQ_NOUNMAP)
// or w/e causes fallback to __blkdev_issue_zero_pages
return -EOPNOTSUPP;
// do TRIM as before
Of course, this will change user visible behavior: blkdiscard -z will start
taking much longer for most users. These users will have to migrate to
using fallocate instead:
blkdev_fallocate(mode = FALLOC_FL_ZERO_RANGE):
blkdev_issue_zeroout(..., BLKDEV_ZERO_NOUNMAP)
blkdev_fallocate(mode = FALLOC_FL_PUNCH_HOLE)
blkdev_issue_zeroout(..., BLKDEV_ZERO_NOFALLBACK)
So, it's not a drop-in replacement. I guess user code can punch hole
with fallback to BLKZEROOUT if it fails in order to get back the old
behavior.
I must admit I don't even know why one would write zeroes and care about
them remaining mapped on the storage device, but that seems to be
what's expected with BLKZEROOUT.
Thoughts? What did I miss?
Thanks,
Ahmad
[1]: For a short and incomplete summary, see:
https://github.com/barebox/barebox/commit/91a11c7d50df91
[1]: https://docs.oasis-open.org/virtio/virtio/v1.2/csd01/virtio-v1.2-csd01.html
5.2.6.2 Device Requirements: Device Operation
--
Pengutronix e.K. | |
Steuerwalder Str. 21 | http://www.pengutronix.de/ |
31137 Hildesheim, Germany | Phone: +49-5121-206917-0 |
Amtsgericht Hildesheim, HRA 2686 | Fax: +49-5121-206917-5555 |
^ permalink raw reply [flat|nested] 2+ messages in thread
* Re: [RFC] block: Make MMC respect REQ_NOUNMAP?
2025-05-07 16:24 [RFC] block: Make MMC respect REQ_NOUNMAP? Ahmad Fatoum
@ 2025-05-08 5:59 ` Christoph Hellwig
0 siblings, 0 replies; 2+ messages in thread
From: Christoph Hellwig @ 2025-05-08 5:59 UTC (permalink / raw)
To: Ahmad Fatoum
Cc: Darrick J. Wong, Christoph Hellwig, Jens Axboe, Ulf Hansson,
Jan Lübbe, kernel@pengutronix.de, linux-block, linux-mmc,
linux-kernel@vger.kernel.org
On Wed, May 07, 2025 at 06:24:34PM +0200, Ahmad Fatoum wrote:
> blk_ioctl_zeroout
> blkdev_issue_zeroout(..., BLKDEV_ZERO_NOUNMAP)
> __blkdev_issue_write_zeroes(..., BLKDEV_ZERO_NOUNMAP)
> submit_bio_wait(bio->bi_opf = REQ_NOUNMAP)
>
> __REQ_NOUNMAP has comment saying 'do not free blocks when zeroing',
> but as shown above, TRIM allows the card to unmap the indicated region.
__REQ_NOUNMAP is a hint. All the storage specs are a mess in this
area as they usually get the polarity wrong and/or just have very
vague semantics. Also actually writing zeroes and not unmapping is
totally pointless on flash storage as it just increases PE cycles
for no good reason.
^ permalink raw reply [flat|nested] 2+ messages in thread
end of thread, other threads:[~2025-05-08 5:59 UTC | newest]
Thread overview: 2+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-05-07 16:24 [RFC] block: Make MMC respect REQ_NOUNMAP? Ahmad Fatoum
2025-05-08 5:59 ` Christoph Hellwig
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox