From: "Martin K. Petersen" <martin.petersen@oracle.com>
To: Dave Chinner <david@fromorbit.com>
Cc: Ric Wheeler <ricwheeler@gmail.com>, Jens Axboe <axboe@kernel.dk>,
linux-block@vger.kernel.org,
Linux FS Devel <linux-fsdevel@vger.kernel.org>,
lczerner@redhat.com
Subject: Re: Testing devices for discard support properly
Date: Wed, 08 May 2019 12:16:24 -0400 [thread overview]
Message-ID: <yq1ef58ly5j.fsf@oracle.com> (raw)
In-Reply-To: <20190507220449.GP1454@dread.disaster.area> (Dave Chinner's message of "Wed, 8 May 2019 08:04:50 +1000")
Hi Dave,
> My big question here is this:
>
> - is "discard" even relevant for future devices?
It's hard to make predictions. Especially about the future. But discard
is definitely relevant on a bunch of current drives across the entire
spectrum from junk to enterprise. Depending on workload,
over-provisioning, media type, etc.
Plus, as Ric pointed out, thin provisioning is also relevant. Different
use case but exactly the same plumbing.
> IMO, trying to "optimise discard" is completely the wrong direction
> to take. We should be getting rid of "discard" and it's interfaces
> operations - deprecate the ioctls, fix all other kernel callers of
> blkdev_issue_discard() to call blkdev_fallocate()
blkdev_fallocate() is implemented using blkdev_issue_discard().
> and ensure that drive vendors understand that they need to make
> FALLOC_FL_ZERO_RANGE and FALLOC_FL_PUNCH_HOLE work, and that
> FALLOC_FL_PUNCH_HOLE | FALLOC_FL_NO_HIDE_STALE is deprecated (like
> discard) and will be going away.
Fast, cheap, easy. Pick any two.
The issue is that -- from the device perspective -- guaranteeing zeroes
requires substantially more effort than deallocating blocks. To the
point where several vendors have given up making it work altogether and
either report no discard support or silently ignore discard requests
causing you to waste queue slots for no good reason.
So while instant zeroing of a 100TB drive would be nice, I don't think
it's a realistic goal given the architectural limitations of many of
these devices. Conceptually, you'd think it would be as easy as
unlinking an inode. But in practice the devices keep much more (and
different) state around in their FTLs than a filesystem does in its
metadata.
Wrt. device command processing performance:
1. Our expectation is that REQ_DISCARD (FL_PUNCH_HOLE |
FL_NO_HIDE_STALE), which gets translated into ATA DSM TRIM, NVMe
DEALLOCATE, SCSI UNMAP, executes in O(1) regardless of the number of
blocks operated on.
Due to the ambiguity of ATA DSM TRIM and early SCSI we ended up in a
situation where the industry applied additional semantics
(deterministic zeroing) to that particular operation. And that has
caused grief because devices often end up in the O(n-or-worse) bucket
when determinism is a requirement.
2. Our expectation for the allocating REQ_ZEROOUT (FL_ZERO_RANGE), which
gets translated into NVMe WRITE ZEROES, SCSI WRITE SAME, is that the
command executes in O(n) but that it is faster -- or at least not
worse -- than doing a regular WRITE to the same block range.
3. Our expectation for the deallocating REQ_ZEROOUT (FL_PUNCH_HOLE),
which gets translated into ATA DSM TRIM w/ whitelist, NVMe WRITE
ZEROES w/ DEAC, SCSI WRITE SAME w/ UNMAP, is that the command will
execute in O(1) for any portion of the block range described by the
I/O that is aligned to and a multiple of the internal device
granularity. With an additional small O(n_head_LBs) + O(n_tail_LBs)
overhead for zeroing any LBs at the beginning and end of the block
range described by the I/O that do not comprise a full block wrt. the
internal device granularity.
Does that description make sense?
The problem is that most vendors implement (3) using (1). But can't make
it work well because (3) was -- and still is for ATA -- outside the
scope of what the protocols can express.
And I agree with you that if (3) was implemented correctly in all
devices, we wouldn't need (1) at all. At least not for devices with an
internal granularity << total capacity.
--
Martin K. Petersen Oracle Linux Engineering
next prev parent reply other threads:[~2019-05-08 16:16 UTC|newest]
Thread overview: 30+ messages / expand[flat|nested] mbox.gz Atom feed top
2019-05-06 20:56 Testing devices for discard support properly Ric Wheeler
2019-05-07 7:10 ` Lukas Czerner
2019-05-07 8:48 ` Jan Tulak
2019-05-07 9:40 ` Lukas Czerner
2019-05-07 12:57 ` Ric Wheeler
2019-05-07 15:35 ` Bryan Gurney
2019-05-07 15:44 ` Ric Wheeler
2019-05-07 20:09 ` Bryan Gurney
2019-05-07 21:24 ` Chris Mason
2019-06-03 20:01 ` Ric Wheeler
2019-05-07 8:21 ` Nikolay Borisov
2019-05-07 22:04 ` Dave Chinner
2019-05-08 0:07 ` Ric Wheeler
2019-05-08 1:14 ` Dave Chinner
2019-05-08 15:05 ` Ric Wheeler
2019-05-08 17:03 ` Martin K. Petersen
2019-05-08 17:09 ` Ric Wheeler
2019-05-08 17:25 ` Martin K. Petersen
2019-05-08 18:12 ` Ric Wheeler
2019-05-09 16:02 ` Bryan Gurney
2019-05-09 17:27 ` Ric Wheeler
2019-05-09 20:35 ` Bryan Gurney
2019-05-08 21:58 ` Dave Chinner
2019-05-09 2:29 ` Martin K. Petersen
2019-05-09 3:20 ` Dave Chinner
2019-05-09 4:35 ` Martin K. Petersen
2019-05-08 16:16 ` Martin K. Petersen [this message]
2019-05-08 22:31 ` Dave Chinner
2019-05-09 3:55 ` Martin K. Petersen
2019-05-09 13:40 ` Ric Wheeler
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=yq1ef58ly5j.fsf@oracle.com \
--to=martin.petersen@oracle.com \
--cc=axboe@kernel.dk \
--cc=david@fromorbit.com \
--cc=lczerner@redhat.com \
--cc=linux-block@vger.kernel.org \
--cc=linux-fsdevel@vger.kernel.org \
--cc=ricwheeler@gmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox