From: "Martin K. Petersen" <martin.petersen@oracle.com>
To: "Theodore Ts'o" <tytso@mit.edu>
Cc: "Martin K. Petersen" <martin.petersen@oracle.com>,
linux-fsdevel@vger.kernel.org, Jens Axboe <axboe@kernel.dk>
Subject: Re: [PATCH RFC] block: use discard if possible in blkdev_issue_discard()
Date: Mon, 17 Feb 2014 11:44:27 -0500 [thread overview]
Message-ID: <yq161odofis.fsf@sermon.lab.mkp.net> (raw)
In-Reply-To: <20140215012901.GA28307@thunk.org> (Theodore Ts'o's message of "Fri, 14 Feb 2014 20:29:01 -0500")
>>>>> "Ted" == Theodore Ts'o <tytso@mit.edu> writes:
>> The issue being that TRIM is a hint and there are no hard
>> guarantees. Even if a device reports DRAT/RZAT.
Ted> So is this the same as how some devices will turn into bricks if
Ted> you send trim commands too quickly --- i.e., they are buggy, crappy
Ted> devices?
Well, it's actually per spec. Even if a device reports successful
completion on a DSM TRIM command it is not required to actually do
anything because TRIM is a hint.
The DRAT/RZAT flags indicate what the expected results are if a device
decides to honor the request (or parts of it). Some devices will report
zeroes only for blocks that are aligned to their internal allocation
units. Whereas misaligned heads/tails of the TRIM request will contain
old data, zeroes or garbage.
Early SSDs would drop TRIMs under load. I think we've now moved to a
world where TRIMs are mostly dropped when the FTL is in error
recovery. But we have no insight into internal FTL state.
Some RAID controller vendors explicitly whitelist drive models that do
the right thing in their firmware to overcome this. Others rely on WRITE
SAME to ensure that you don't get parity mismatches for RAID5/6.
Ted> Basically, who was practicing engineering malpractice? The SSD
Ted> vendors, or the T10/T13 spec authors?
I think it's important to emphasize that T10/T13 specs are mainly
written by device vendors. And they have a very strong objection to
complicating the device firmware, keeping internal state, etc. So the
outcome is very rarely in the operating system's favor. I completely
agree that these flags are broken by definition.
The only discard approach that provides a guaranteed result is WRITE
SAME with the UNMAP bit set (i.e. SCSI only). You can also use a discard
followed by a read of the block range to verify that you actually get
zeroes. And then manually patch up any pieces that didn't stick.
Ted> If this is a case that there is just a bunch of crap SSD's out
Ted> there, then maybe we should still do this, but just not enable it
Ted> by default, and force users to manually configure mount options or
Ted> fstrim if they think they have devices that are competently
Ted> implemented?
The good news is that most devices that report DRAT/RZAT are doing the
right thing due to server/RAID vendor pressure. But SSD vendors are
generally not willing to give such guarantees in the datasheets.
Many of these gray areas or slight enhancements to what's mandated by
the T10/T13 specs are negotiated as part of a typical drive procurement
process. The vendor will implement the additional features and
guarantees requested by Dell/HP/IBM/Oracle/etc. Sometimes the
enhancements will trickle into a later versions of the generic SSD
firmware. Sometimes they won't.
It's really no different from hard drives. I'd choose a server vendor
branded version of a disk drive over the generic version any day. Both
because of binning and because of the additional data integrity and
error recovery features that are likely present in the firmware.
--
Martin K. Petersen Oracle Linux Engineering
next prev parent reply other threads:[~2014-02-17 16:44 UTC|newest]
Thread overview: 13+ messages / expand[flat|nested] mbox.gz Atom feed top
2014-02-14 4:32 [PATCH RFC] block: use discard if possible in blkdev_issue_discard() Theodore Ts'o
2014-02-14 13:05 ` Christoph Hellwig
2014-02-14 14:57 ` Theodore Ts'o
2014-02-14 17:14 ` Martin K. Petersen
2014-02-15 1:29 ` Theodore Ts'o
2014-02-17 16:44 ` Martin K. Petersen [this message]
2014-02-17 19:19 ` Theodore Ts'o
2014-02-18 1:31 ` Martin K. Petersen
2014-02-18 2:17 ` Theodore Ts'o
2014-02-18 3:44 ` Martin K. Petersen
2014-02-18 5:47 ` Theodore Ts'o
2014-02-19 2:20 ` Martin K. Petersen
2014-02-17 16:41 ` Lukáš Czerner
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=yq161odofis.fsf@sermon.lab.mkp.net \
--to=martin.petersen@oracle.com \
--cc=axboe@kernel.dk \
--cc=linux-fsdevel@vger.kernel.org \
--cc=tytso@mit.edu \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.