Re: [PATCH] libata: Whitelist SSDs that are known to properly return zeroes after TRIM

linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Re: [PATCH] libata: Whitelist SSDs that are known to properly return zeroes after TRIM
       [not found]             ` <20150105162830.GP15833@htj.dyndns.org>
@ 2015-01-07  0:05               ` Martin K. Petersen
  2015-01-07  2:54                 ` Tejun Heo
  0 siblings, 1 reply; 16+ messages in thread
From: Martin K. Petersen @ 2015-01-07  0:05 UTC (permalink / raw)
  To: Tejun Heo; +Cc: Martin K. Petersen, James Bottomley, linux-ide, linux-fsdevel

>>>>> "Tejun" == Tejun Heo <tj@kernel.org> writes:

>> The other use case is the filesystem one where it is common to zero
>> block ranges for bitmaps, etc. In many workloads there's is a
>> significant win to trimming over writing out many blocks of zeroes.

Tejun> Isn't that kinda niche and specialized tho?  

I don't think so. There are two reasons for zeroing block ranges:

 1) To ensure they contain zeroes on subsequent reads

 2) To preallocate them or anchor them down on thin provisioned devices

The filesystem folks have specifically asked to be able to make that
distinction. Hence the patch that changes blkdev_issue_zeroout().

You really don't want to write out gobs and gobs of zeroes and cause
unnecessary flash wear if all you care about is the blocks being in a
deterministic state.

-- 
Martin K. Petersen	Oracle Linux Engineering

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] libata: Whitelist SSDs that are known to properly return zeroes after TRIM
  2015-01-07  0:05               ` [PATCH] libata: Whitelist SSDs that are known to properly return zeroes after TRIM Martin K. Petersen
@ 2015-01-07  2:54                 ` Tejun Heo
  2015-01-07  4:15                   ` Dave Chinner
  0 siblings, 1 reply; 16+ messages in thread
From: Tejun Heo @ 2015-01-07  2:54 UTC (permalink / raw)
  To: Martin K. Petersen; +Cc: James Bottomley, linux-ide, linux-fsdevel

Hello, Martin.

On Tue, Jan 06, 2015 at 07:05:40PM -0500, Martin K. Petersen wrote:
> Tejun> Isn't that kinda niche and specialized tho?  
> 
> I don't think so. There are two reasons for zeroing block ranges:
> 
>  1) To ensure they contain zeroes on subsequent reads
> 
>  2) To preallocate them or anchor them down on thin provisioned devices
> 
> The filesystem folks have specifically asked to be able to make that
> distinction. Hence the patch that changes blkdev_issue_zeroout().
> 
> You really don't want to write out gobs and gobs of zeroes and cause
> unnecessary flash wear if all you care about is the blocks being in a
> deterministic state.

I think I'm still missing something.  Are there enough cases where
filesystems want to write out zeroes during operation?  Earlier in the
thread, it was mentioned that this is currently mostly useful for
raids which need the blocks actually cleared for checksum consistency,
which basically means that raid metadata handling isn't (yet) capable
of just marking those (parts of) stripes as unused.  If a filesystem
wants to read back zeros from data blocks, wouldn't it be just marking
the matching index as such?  And if you take out the zeroing case,
trims are just trims and whether they return 0 afterwards or not is
irrelevant.

There sure can be use cases where zeroing fast and securely make
noticeable difference but the cases put forth till this point seem
relatively weak.  I mean, after all, requiring trim to zero the blocks
is essentially pushing down that amount of metadata management to the
device - the device would do the exactly same thing.  Pushing it down
the layers can definitely be beneficial especially when there's no
agreed-upon metadata on the medium (so, mkfs time), but it seems kinda
superflous during normal operation.  What am I missing?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] libata: Whitelist SSDs that are known to properly return zeroes after TRIM
  2015-01-07  2:54                 ` Tejun Heo
@ 2015-01-07  4:15                   ` Dave Chinner
  2015-01-07 15:26                     ` Tejun Heo
  2015-01-08  4:05                     ` Phillip Susi
  0 siblings, 2 replies; 16+ messages in thread
From: Dave Chinner @ 2015-01-07  4:15 UTC (permalink / raw)
  To: Tejun Heo; +Cc: Martin K. Petersen, James Bottomley, linux-ide, linux-fsdevel

On Tue, Jan 06, 2015 at 09:54:55PM -0500, Tejun Heo wrote:
> Hello, Martin.
> 
> On Tue, Jan 06, 2015 at 07:05:40PM -0500, Martin K. Petersen wrote:
> > Tejun> Isn't that kinda niche and specialized tho?  
> > 
> > I don't think so. There are two reasons for zeroing block ranges:
> > 
> >  1) To ensure they contain zeroes on subsequent reads
> > 
> >  2) To preallocate them or anchor them down on thin provisioned devices
> > 
> > The filesystem folks have specifically asked to be able to make that
> > distinction. Hence the patch that changes blkdev_issue_zeroout().
> > 
> > You really don't want to write out gobs and gobs of zeroes and cause
> > unnecessary flash wear if all you care about is the blocks being in a
> > deterministic state.
> 
> I think I'm still missing something.  Are there enough cases where
> filesystems want to write out zeroes during operation?

IMO, yes.

w.r.t. thinp devices, we need to be able to guarantee that
prellocated regions in the filesystem are actually backed by real
blocks in the thinp device so we don't get ENOSPC from the thinp
device. No filesystems do this yet because we don't have a mechanism
for telling the lower layers "preallocate these blocks to zero".

The biggest issue is that we currently have no easy way to say
"these blocks need to contain zeros, but we aren't actually using
them yet". i.e. the filesystem code assumes that they contain zeros
(e.g. in ext4 inode tables because mkfs used to zero them) if they
haven't been used, so when it reads them it detects that
initialisation is needed because the blocks are empty....

FWIW, some filesystems need these regions to actually contain
zeros because they can't track unwritten extents (e.g.
gfs2). having sb_issue_zeroout() just do the right thing enables us
to efficiently zero the regions they are preallocating...

> Earlier in the
> thread, it was mentioned that this is currently mostly useful for
> raids which need the blocks actually cleared for checksum consistency,
> which basically means that raid metadata handling isn't (yet) capable
> of just marking those (parts of) stripes as unused.  If a filesystem
> wants to read back zeros from data blocks, wouldn't it be just marking
> the matching index as such?

Not all filesystems can do this for user data (see gfs2 case above)
and no linux filesystem tracks whether free space contains zeros or
stale data. Hence if we want blocks to be zeroed on disk, we
currently have to write zeros to them and hence they get pinned in
devices as "used space" even though they may never get used again.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] libata: Whitelist SSDs that are known to properly return zeroes after TRIM
  2015-01-07  4:15                   ` Dave Chinner
@ 2015-01-07 15:26                     ` Tejun Heo
  2015-01-08 14:28                       ` Martin K. Petersen
  2015-01-08 14:29                       ` Martin K. Petersen
  2015-01-08  4:05                     ` Phillip Susi
  1 sibling, 2 replies; 16+ messages in thread
From: Tejun Heo @ 2015-01-07 15:26 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Martin K. Petersen, James Bottomley, linux-ide, linux-fsdevel

Hello, Dave.

On Wed, Jan 07, 2015 at 03:15:37PM +1100, Dave Chinner wrote:
> w.r.t. thinp devices, we need to be able to guarantee that
> prellocated regions in the filesystem are actually backed by real
> blocks in the thinp device so we don't get ENOSPC from the thinp
> device. No filesystems do this yet because we don't have a mechanism
> for telling the lower layers "preallocate these blocks to zero".
> 
> The biggest issue is that we currently have no easy way to say
> "these blocks need to contain zeros, but we aren't actually using
> them yet". i.e. the filesystem code assumes that they contain zeros
> (e.g. in ext4 inode tables because mkfs used to zero them) if they
> haven't been used, so when it reads them it detects that
> initialisation is needed because the blocks are empty....
> 
> FWIW, some filesystems need these regions to actually contain
> zeros because they can't track unwritten extents (e.g.
> gfs2). having sb_issue_zeroout() just do the right thing enables us
> to efficiently zero the regions they are preallocating...
> 
> > Earlier in the
> > thread, it was mentioned that this is currently mostly useful for
> > raids which need the blocks actually cleared for checksum consistency,
> > which basically means that raid metadata handling isn't (yet) capable
> > of just marking those (parts of) stripes as unused.  If a filesystem
> > wants to read back zeros from data blocks, wouldn't it be just marking
> > the matching index as such?
> 
> Not all filesystems can do this for user data (see gfs2 case above)
> and no linux filesystem tracks whether free space contains zeros or
> stale data. Hence if we want blocks to be zeroed on disk, we
> currently have to write zeros to them and hence they get pinned in
> devices as "used space" even though they may never get used again.

Okay, I'll take it as that this benefits generic enough use cases from
filesystem POV.  As long as that's the case, it prolly has a
reasonable chance of actually being widely and properly implemented hw
vendors.

It's easy to complain about mainstream hardware but IMHO the market is
actually pretty efficient at shooting down extra cruft which doesn't
really matter and only exists to increase the number of feature
checkboxes.  Hopefully, this has actual benefits and won't end up that
way.

Martin, do you have a newer version or shall I apply the original one?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] libata: Whitelist SSDs that are known to properly return zeroes after TRIM
  2015-01-07  4:15                   ` Dave Chinner
  2015-01-07 15:26                     ` Tejun Heo
@ 2015-01-08  4:05                     ` Phillip Susi
  2015-01-08  4:58                       ` Andreas Dilger
  1 sibling, 1 reply; 16+ messages in thread
From: Phillip Susi @ 2015-01-08  4:05 UTC (permalink / raw)
  To: Dave Chinner, Tejun Heo
  Cc: Martin K. Petersen, James Bottomley, linux-ide, linux-fsdevel

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA512

On 01/06/2015 11:15 PM, Dave Chinner wrote:
> (e.g. in ext4 inode tables because mkfs used to zero them) if they 
> haven't been used, so when it reads them it detects that 
> initialisation is needed because the blocks are empty....

No, it knows that the inode table needs initialized because there is a
flag in the group descriptor that says this inode table is still
uninitalized.  It never reads the blocks to see if they are full of
zeros.  mke2fs sets the flag when it does not initialize the table
with zeros, either by direct writes ( which it doesn't do if
lazy_itable_init is true, which it defaults to these days ), or by
discarding the blocks when the device claims to support deterministic
discard that zeros.


-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1

iQEcBAEBCgAGBQJUrgHvAAoJENRVrw2cjl5RWUUH/076fhOAYW8vRB285cz+UwrE
L7o3uteYa9OctoWZTctWAVueuISJaNsxzmbu04btznETkR/YbkkexpmXRKvYt6RZ
rlCzRLWLTOC4h57jZteczn9pBf0RBeeizU0pEdTdZP/IVxOqA6NDG1qywDCE9A+I
3frFg54alaZ4MCGH0jmPVbVxM6jY3/UhA8DnOoMSlIOXatV1X7UMPK3lMJ4Ih/Yx
e3yDWJjNHjAn88TRAGD6WJ2066t7DBGnLOOF9PscaZNW6f4hbuvEG9teAPk51OfS
cwypcLsk6Mj7WU8cSQ0T/6a7Qx84g9JC6wLR0QZCUCSau6ZfWf3ZuSI26i/Xlfc=
=BRqd
-----END PGP SIGNATURE-----

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] libata: Whitelist SSDs that are known to properly return zeroes after TRIM
  2015-01-08  4:05                     ` Phillip Susi
@ 2015-01-08  4:58                       ` Andreas Dilger
  2015-01-08 14:09                         ` Phillip Susi
  0 siblings, 1 reply; 16+ messages in thread
From: Andreas Dilger @ 2015-01-08  4:58 UTC (permalink / raw)
  To: Phillip Susi
  Cc: Dave Chinner, Tejun Heo, Martin Petersen, James Bottomley,
	linux-ide, linux-fsdevel

On Jan 7, 2015, at 9:05 PM, Phillip Susi <psusi@ubuntu.com> wrote:
> On 01/06/2015 11:15 PM, Dave Chinner wrote:
>> (e.g. in ext4 inode tables because mkfs used to zero them) if they 
>> haven't been used, so when it reads them it detects that 
>> initialisation is needed because the blocks are empty....
> 
> No, it knows that the inode table needs initialized because there is a
> flag in the group descriptor that says this inode table is still
> uninitalized.  It never reads the blocks to see if they are full of
> zeros.  mke2fs sets the flag when it does not initialize the table
> with zeros, either by direct writes ( which it doesn't do if
> lazy_itable_init is true, which it defaults to these days ), or by
> discarding the blocks when the device claims to support deterministic
> discard that zeros.

That is only partially correct.  While it is true that mke2fs sets the
UNINIT flag at format time, the "lazy" part of that means there is a
kernel thread still does the zeroing of the inode table blocks, but after
the filesystem is mounted, for any group that does not have the ZEROED
flag set.  After that point, the "UNINIT" flag is an optimization to
avoid reading the bitmap and unused blocks from disk during allocation.

This is needed in case the group descriptor or inode bitmap is corrupted,
and e2fsck needs to scan the inode table for in-use inodes.  We don't
want it to find old inodes from before the filesystem was formatted.

The ext4_init_inode_table() calls sb_issue_zeroout->blkdev_issue_zeroout(),
so if the underlying storage supported deterministic zeroing of the
underlying storage, this could be handled very efficiently.

Cheers, Andreas

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] libata: Whitelist SSDs that are known to properly return zeroes after TRIM
  2015-01-08  4:58                       ` Andreas Dilger
@ 2015-01-08 14:09                         ` Phillip Susi
  2015-01-08 22:31                           ` Andreas Dilger
  0 siblings, 1 reply; 16+ messages in thread
From: Phillip Susi @ 2015-01-08 14:09 UTC (permalink / raw)
  To: Andreas Dilger
  Cc: Dave Chinner, Tejun Heo, Martin Petersen, James Bottomley,
	linux-ide, linux-fsdevel

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 1/7/2015 11:58 PM, Andreas Dilger wrote:
>> No, it knows that the inode table needs initialized because there
>> is a flag in the group descriptor that says this inode table is
>> still uninitalized.  It never reads the blocks to see if they are
>> full of zeros.  mke2fs sets the flag when it does not initialize
>> the table with zeros, either by direct writes ( which it doesn't
>> do if lazy_itable_init is true, which it defaults to these days
>> ), or by discarding the blocks when the device claims to support
>> deterministic discard that zeros.
> 
> That is only partially correct.  While it is true that mke2fs sets
> the UNINIT flag at format time, the "lazy" part of that means there
> is a kernel thread still does the zeroing of the inode table
> blocks, but after the filesystem is mounted, for any group that
> does not have the ZEROED flag set.  After that point, the "UNINIT"
> flag is an optimization to avoid reading the bitmap and unused
> blocks from disk during allocation.

That is pretty much what I said, except that I was pointing out that
it does not *read* first to see if the disk is already zeroed, as that
would be a waste of time.  It just writes out the zeros for block
groups that still have the uninit flag set.

> This is needed in case the group descriptor or inode bitmap is
> corrupted, and e2fsck needs to scan the inode table for in-use
> inodes.  We don't want it to find old inodes from before the
> filesystem was formatted.
> 
> The ext4_init_inode_table() calls
> sb_issue_zeroout->blkdev_issue_zeroout(), so if the underlying
> storage supported deterministic zeroing of the underlying storage,
> this could be handled very efficiently.

Again, that's pretty much what I said.


-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.17 (MingW32)

iQEcBAEBAgAGBQJUro+rAAoJENRVrw2cjl5RDcMIAIF/qOjB4FfQ1v1NPLnbYrwa
MeGxmAGIq8ftOlfOpSo8W/Adal4eegPwiBzE/8JciTHli2hZFkboG96lCGLL2GHd
51gqADIcSUmPHleLP3L/MSleFLvOAC4xYWYi00idZF5UVK3tt6m07+T/DCkoku1L
8Yg0BH5rxxRyPMRe3a/Y3WoI3PFFQDiDhz/7PSFH1W6JrEmiZutIDfm/U4h6zFWU
tlNUMyiya8DrJt4katjBTa74EPWdZQjT/dOuUGGslAgOro69ZZQ4jmEXeJGN96Ly
mMhbzkw9mjvSHYWuH2Hf1QlxoMKgq1hxaNbwEtImBzYeQTb9VSqM6BSO1ru0m9M=
=X07B
-----END PGP SIGNATURE-----

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH] libata: Whitelist SSDs that are known to properly return zeroes after TRIM
  2015-01-07 15:26                     ` Tejun Heo
@ 2015-01-08 14:28                       ` Martin K. Petersen
  2015-01-08 15:11                         ` Tejun Heo
  2015-01-08 15:58                         ` Tim Small
  2015-01-08 14:29                       ` Martin K. Petersen
  1 sibling, 2 replies; 16+ messages in thread
From: Martin K. Petersen @ 2015-01-08 14:28 UTC (permalink / raw)
  To: tj; +Cc: linux-ide, linux-fsdevel, Martin K. Petersen

As defined, the DRAT (Deterministic Read After Trim) and RZAT (Return
Zero After Trim) flags in the ATA Command Set are unreliable in the
sense that they only define what happens if the device successfully
executed the DSM TRIM command. TRIM is only advisory, however, and the
device is free to silently ignore all or parts of the request.

In practice this renders the DRAT and RZAT flags completely useless and
because the results are unpredictable we decided to disable discard in
MD for 3.18 to avoid the risk of data corruption.

Hardware vendors in the real world obviously need better guarantees than
what the standards bodies provide. Unfortuntely those guarantees are
encoded in product requirements documents rather than somewhere we can
key off of them programatically. So we are compelled to disabling
discard_zeroes_data for all devices unless we explicitly have data to
support whitelisting them.

This patch whitelists SSDs from a few of the main vendors. None of the
whitelists are based on written guarantees. They are purely based on
empirical evidence collected from internal and external users that have
tested or qualified these drives in RAID deployments.

The whitelist is only meant as a starting point and is by no means
comprehensive:

   - All intel SSD models except for 510
   - Micron M5?0/M600
   - Samsung SSDs
   - Seagate SSDs

Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
 drivers/ata/libata-core.c | 18 ++++++++++++++----
 drivers/ata/libata-scsi.c | 10 ++++++----
 include/linux/libata.h    |  1 +
 3 files changed, 21 insertions(+), 8 deletions(-)

diff --git a/drivers/ata/libata-core.c b/drivers/ata/libata-core.c
index 5c84fb5c3372..7c03401dc5d8 100644
--- a/drivers/ata/libata-core.c
+++ b/drivers/ata/libata-core.c
@@ -4233,10 +4233,20 @@ static const struct ata_blacklist_entry ata_device_blacklist [] = {
 	{ "PIONEER DVD-RW  DVR-216D",	NULL,	ATA_HORKAGE_NOSETXFER },
 
 	/* devices that don't properly handle queued TRIM commands */
-	{ "Micron_M500*",		NULL,	ATA_HORKAGE_NO_NCQ_TRIM, },
-	{ "Crucial_CT???M500SSD*",	NULL,	ATA_HORKAGE_NO_NCQ_TRIM, },
-	{ "Micron_M550*",		NULL,	ATA_HORKAGE_NO_NCQ_TRIM, },
-	{ "Crucial_CT*M550SSD*",	NULL,	ATA_HORKAGE_NO_NCQ_TRIM, },
+	{ "Micron_M[56]*",		NULL,	ATA_HORKAGE_NO_NCQ_TRIM |
+						ATA_HORKAGE_ZERO_AFTER_TRIM, },
+	{ "Crucial_CT*SSD*",		NULL,	ATA_HORKAGE_NO_NCQ_TRIM, },
+
+	/*
+	 * DRAT/RZAT are weak guarantees. Explicitly black/whitelist
+	 * SSDs that provide reliable zero after TRIM.
+	 */
+	{ "INTEL*SSDSC2MH*",		NULL,	0, }, /* Blacklist intel 510 */
+	{ "INTEL*SSD*", 		NULL,	ATA_HORKAGE_ZERO_AFTER_TRIM, },
+	{ "SSD*INTEL*",			NULL,	ATA_HORKAGE_ZERO_AFTER_TRIM, },
+	{ "Samsung*SSD*",		NULL,	ATA_HORKAGE_ZERO_AFTER_TRIM, },
+	{ "SAMSUNG*SSD*",		NULL,	ATA_HORKAGE_ZERO_AFTER_TRIM, },
+	{ "ST[1248][0248]0[FH]*",	NULL,	ATA_HORKAGE_ZERO_AFTER_TRIM, },
 
 	/*
 	 * Some WD SATA-I drives spin up and down erratically when the link
diff --git a/drivers/ata/libata-scsi.c b/drivers/ata/libata-scsi.c
index 7659d6468303..280729325ebd 100644
--- a/drivers/ata/libata-scsi.c
+++ b/drivers/ata/libata-scsi.c
@@ -2532,13 +2532,15 @@ static unsigned int ata_scsiop_read_cap(struct ata_scsi_args *args, u8 *rbuf)
 		rbuf[15] = lowest_aligned;
 
 		if (ata_id_has_trim(args->id)) {
-			rbuf[14] |= 0x80; /* TPE */
+			rbuf[14] |= 0x80; /* LBPME */
 
-			if (ata_id_has_zero_after_trim(args->id))
-				rbuf[14] |= 0x40; /* TPRZ */
+			if (ata_id_has_zero_after_trim(args->id) &&
+			    dev->horkage & ATA_HORKAGE_ZERO_AFTER_TRIM) {
+				ata_dev_info(dev, "Enabling discard_zeroes_data\n");
+				rbuf[14] |= 0x40; /* LBPRZ */
+			}
 		}
 	}
-
 	return 0;
 }
 
diff --git a/include/linux/libata.h b/include/linux/libata.h
index 2d182413b1db..f2b440e44fd7 100644
--- a/include/linux/libata.h
+++ b/include/linux/libata.h
@@ -422,6 +422,7 @@ enum {
 	ATA_HORKAGE_NO_NCQ_TRIM	= (1 << 19),	/* don't use queued TRIM */
 	ATA_HORKAGE_NOLPM	= (1 << 20),	/* don't use LPM */
 	ATA_HORKAGE_WD_BROKEN_LPM = (1 << 21),	/* some WDs have broken LPM */
+	ATA_HORKAGE_ZERO_AFTER_TRIM = (1 << 22),/* guarantees zero after trim */
 
 	 /* DMA mask for user DMA control: User visible values; DO NOT
 	    renumber */
-- 
1.9.3


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH] libata: Whitelist SSDs that are known to properly return zeroes after TRIM
  2015-01-07 15:26                     ` Tejun Heo
  2015-01-08 14:28                       ` Martin K. Petersen
@ 2015-01-08 14:29                       ` Martin K. Petersen
  1 sibling, 0 replies; 16+ messages in thread
From: Martin K. Petersen @ 2015-01-08 14:29 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Dave Chinner, Martin K. Petersen, James Bottomley, linux-ide,
	linux-fsdevel

>>>>> "Tejun" == Tejun Heo <tj@kernel.org> writes:

Tejun> do you have a newer version or shall I apply the original one?

Just sent an updated patch that catches the latest Micron/Crucial drives
as well.

-- 
Martin K. Petersen	Oracle Linux Engineering

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] libata: Whitelist SSDs that are known to properly return zeroes after TRIM
  2015-01-08 14:28                       ` Martin K. Petersen
@ 2015-01-08 15:11                         ` Tejun Heo
  2015-01-08 15:34                           ` Martin K. Petersen
  2015-01-08 15:58                         ` Tim Small
  1 sibling, 1 reply; 16+ messages in thread
From: Tejun Heo @ 2015-01-08 15:11 UTC (permalink / raw)
  To: Martin K. Petersen; +Cc: linux-ide, linux-fsdevel

Hello, Martin.

The patch looks okay to me.  Just one nit.

On Thu, Jan 08, 2015 at 09:28:31AM -0500, Martin K. Petersen wrote:
> @@ -4233,10 +4233,20 @@ static const struct ata_blacklist_entry ata_device_blacklist [] = {
>  	{ "PIONEER DVD-RW  DVR-216D",	NULL,	ATA_HORKAGE_NOSETXFER },
>  
>  	/* devices that don't properly handle queued TRIM commands */
> -	{ "Micron_M500*",		NULL,	ATA_HORKAGE_NO_NCQ_TRIM, },
> -	{ "Crucial_CT???M500SSD*",	NULL,	ATA_HORKAGE_NO_NCQ_TRIM, },
> -	{ "Micron_M550*",		NULL,	ATA_HORKAGE_NO_NCQ_TRIM, },
> -	{ "Crucial_CT*M550SSD*",	NULL,	ATA_HORKAGE_NO_NCQ_TRIM, },
> +	{ "Micron_M[56]*",		NULL,	ATA_HORKAGE_NO_NCQ_TRIM |
> +						ATA_HORKAGE_ZERO_AFTER_TRIM, },
> +	{ "Crucial_CT*SSD*",		NULL,	ATA_HORKAGE_NO_NCQ_TRIM, },
> +
> +	/*
> +	 * DRAT/RZAT are weak guarantees. Explicitly black/whitelist
> +	 * SSDs that provide reliable zero after TRIM.
> +	 */

Can you please be more detailed on the above explanation?  It's not
like most people would be familiar with acronyms like DRAT/RZAT and
what the surrounding situation is like to require this sort of
whitelisting.

> +	{ "INTEL*SSDSC2MH*",		NULL,	0, }, /* Blacklist intel 510 */
> +	{ "INTEL*SSD*", 		NULL,	ATA_HORKAGE_ZERO_AFTER_TRIM, },

I think the above two can use a bit more verbose explanation too.

> +	{ "SSD*INTEL*",			NULL,	ATA_HORKAGE_ZERO_AFTER_TRIM, },
> +	{ "Samsung*SSD*",		NULL,	ATA_HORKAGE_ZERO_AFTER_TRIM, },
> +	{ "SAMSUNG*SSD*",		NULL,	ATA_HORKAGE_ZERO_AFTER_TRIM, },
> +	{ "ST[1248][0248]0[FH]*",	NULL,	ATA_HORKAGE_ZERO_AFTER_TRIM, },

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH] libata: Whitelist SSDs that are known to properly return zeroes after TRIM
  2015-01-08 15:11                         ` Tejun Heo
@ 2015-01-08 15:34                           ` Martin K. Petersen
  2015-01-08 15:36                             ` Tejun Heo
  0 siblings, 1 reply; 16+ messages in thread
From: Martin K. Petersen @ 2015-01-08 15:34 UTC (permalink / raw)
  To: tj; +Cc: linux-ide, linux-fsdevel, Martin K. Petersen

As defined, the DRAT (Deterministic Read After Trim) and RZAT (Return
Zero After Trim) flags in the ATA Command Set are unreliable in the
sense that they only define what happens if the device successfully
executed the DSM TRIM command. TRIM is only advisory, however, and the
device is free to silently ignore all or parts of the request.

In practice this renders the DRAT and RZAT flags completely useless and
because the results are unpredictable we decided to disable discard in
MD for 3.18 to avoid the risk of data corruption.

Hardware vendors in the real world obviously need better guarantees than
what the standards bodies provide. Unfortuntely those guarantees are
encoded in product requirements documents rather than somewhere we can
key off of them programatically. So we are compelled to disabling
discard_zeroes_data for all devices unless we explicitly have data to
support whitelisting them.

This patch whitelists SSDs from a few of the main vendors. None of the
whitelists are based on written guarantees. They are purely based on
empirical evidence collected from internal and external users that have
tested or qualified these drives in RAID deployments.

The whitelist is only meant as a starting point and is by no means
comprehensive:

   - All intel SSD models except for 510
   - Micron M5?0/M600
   - Samsung SSDs
   - Seagate SSDs

Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
 drivers/ata/libata-core.c | 31 +++++++++++++++++++++++++++----
 drivers/ata/libata-scsi.c | 10 ++++++----
 include/linux/libata.h    |  1 +
 3 files changed, 34 insertions(+), 8 deletions(-)

diff --git a/drivers/ata/libata-core.c b/drivers/ata/libata-core.c
index 5c84fb5c3372..23c2ae03a7ab 100644
--- a/drivers/ata/libata-core.c
+++ b/drivers/ata/libata-core.c
@@ -4233,10 +4233,33 @@ static const struct ata_blacklist_entry ata_device_blacklist [] = {
 	{ "PIONEER DVD-RW  DVR-216D",	NULL,	ATA_HORKAGE_NOSETXFER },
 
 	/* devices that don't properly handle queued TRIM commands */
-	{ "Micron_M500*",		NULL,	ATA_HORKAGE_NO_NCQ_TRIM, },
-	{ "Crucial_CT???M500SSD*",	NULL,	ATA_HORKAGE_NO_NCQ_TRIM, },
-	{ "Micron_M550*",		NULL,	ATA_HORKAGE_NO_NCQ_TRIM, },
-	{ "Crucial_CT*M550SSD*",	NULL,	ATA_HORKAGE_NO_NCQ_TRIM, },
+	{ "Micron_M[56]*",		NULL,	ATA_HORKAGE_NO_NCQ_TRIM |
+						ATA_HORKAGE_ZERO_AFTER_TRIM, },
+	{ "Crucial_CT*SSD*",		NULL,	ATA_HORKAGE_NO_NCQ_TRIM, },
+
+	/*
+	 * As defined, the DRAT (Deterministic Read After Trim) and RZAT
+	 * (Return Zero After Trim) flags in the ATA Command Set are
+	 * unreliable in the sense that they only define what happens if
+	 * the device successfully executed the DSM TRIM command. TRIM
+	 * is only advisory, however, and the device is free to silently
+	 * ignore all or parts of the request.
+	 *
+	 * Whitelist drives that are known to reliably return zeroes
+	 * after TRIM.
+	 */
+
+	/*
+	 * The intel 510 drive has buggy DRAT/RZAT. Explicitly exclude
+	 * that model before whitelisting all other intel SSDs.
+	 */
+	{ "INTEL*SSDSC2MH*",		NULL,	0, },
+
+	{ "INTEL*SSD*", 		NULL,	ATA_HORKAGE_ZERO_AFTER_TRIM, },
+	{ "SSD*INTEL*",			NULL,	ATA_HORKAGE_ZERO_AFTER_TRIM, },
+	{ "Samsung*SSD*",		NULL,	ATA_HORKAGE_ZERO_AFTER_TRIM, },
+	{ "SAMSUNG*SSD*",		NULL,	ATA_HORKAGE_ZERO_AFTER_TRIM, },
+	{ "ST[1248][0248]0[FH]*",	NULL,	ATA_HORKAGE_ZERO_AFTER_TRIM, },
 
 	/*
 	 * Some WD SATA-I drives spin up and down erratically when the link
diff --git a/drivers/ata/libata-scsi.c b/drivers/ata/libata-scsi.c
index 7659d6468303..280729325ebd 100644
--- a/drivers/ata/libata-scsi.c
+++ b/drivers/ata/libata-scsi.c
@@ -2532,13 +2532,15 @@ static unsigned int ata_scsiop_read_cap(struct ata_scsi_args *args, u8 *rbuf)
 		rbuf[15] = lowest_aligned;
 
 		if (ata_id_has_trim(args->id)) {
-			rbuf[14] |= 0x80; /* TPE */
+			rbuf[14] |= 0x80; /* LBPME */
 
-			if (ata_id_has_zero_after_trim(args->id))
-				rbuf[14] |= 0x40; /* TPRZ */
+			if (ata_id_has_zero_after_trim(args->id) &&
+			    dev->horkage & ATA_HORKAGE_ZERO_AFTER_TRIM) {
+				ata_dev_info(dev, "Enabling discard_zeroes_data\n");
+				rbuf[14] |= 0x40; /* LBPRZ */
+			}
 		}
 	}
-
 	return 0;
 }
 
diff --git a/include/linux/libata.h b/include/linux/libata.h
index 2d182413b1db..f2b440e44fd7 100644
--- a/include/linux/libata.h
+++ b/include/linux/libata.h
@@ -422,6 +422,7 @@ enum {
 	ATA_HORKAGE_NO_NCQ_TRIM	= (1 << 19),	/* don't use queued TRIM */
 	ATA_HORKAGE_NOLPM	= (1 << 20),	/* don't use LPM */
 	ATA_HORKAGE_WD_BROKEN_LPM = (1 << 21),	/* some WDs have broken LPM */
+	ATA_HORKAGE_ZERO_AFTER_TRIM = (1 << 22),/* guarantees zero after trim */
 
 	 /* DMA mask for user DMA control: User visible values; DO NOT
 	    renumber */
-- 
1.9.3


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH] libata: Whitelist SSDs that are known to properly return zeroes after TRIM
  2015-01-08 15:34                           ` Martin K. Petersen
@ 2015-01-08 15:36                             ` Tejun Heo
  0 siblings, 0 replies; 16+ messages in thread
From: Tejun Heo @ 2015-01-08 15:36 UTC (permalink / raw)
  To: Martin K. Petersen; +Cc: linux-ide, linux-fsdevel

On Thu, Jan 08, 2015 at 10:34:27AM -0500, Martin K. Petersen wrote:
> As defined, the DRAT (Deterministic Read After Trim) and RZAT (Return
> Zero After Trim) flags in the ATA Command Set are unreliable in the
> sense that they only define what happens if the device successfully
> executed the DSM TRIM command. TRIM is only advisory, however, and the
> device is free to silently ignore all or parts of the request.
> 
> In practice this renders the DRAT and RZAT flags completely useless and
> because the results are unpredictable we decided to disable discard in
> MD for 3.18 to avoid the risk of data corruption.
> 
> Hardware vendors in the real world obviously need better guarantees than
> what the standards bodies provide. Unfortuntely those guarantees are
> encoded in product requirements documents rather than somewhere we can
> key off of them programatically. So we are compelled to disabling
> discard_zeroes_data for all devices unless we explicitly have data to
> support whitelisting them.
> 
> This patch whitelists SSDs from a few of the main vendors. None of the
> whitelists are based on written guarantees. They are purely based on
> empirical evidence collected from internal and external users that have
> tested or qualified these drives in RAID deployments.
> 
> The whitelist is only meant as a starting point and is by no means
> comprehensive:
> 
>    - All intel SSD models except for 510
>    - Micron M5?0/M600
>    - Samsung SSDs
>    - Seagate SSDs
> 
> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
> Reviewed-by: Christoph Hellwig <hch@lst.de>

Applied to libata/for-3.19-fixes.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] libata: Whitelist SSDs that are known to properly return zeroes after TRIM
  2015-01-08 14:28                       ` Martin K. Petersen
  2015-01-08 15:11                         ` Tejun Heo
@ 2015-01-08 15:58                         ` Tim Small
  2015-01-09 20:52                           ` Martin K. Petersen
  1 sibling, 1 reply; 16+ messages in thread
From: Tim Small @ 2015-01-08 15:58 UTC (permalink / raw)
  To: Martin K. Petersen, tj; +Cc: linux-ide, linux-fsdevel

On 08/01/15 14:28, Martin K. Petersen wrote:

> +	{ "Micron_M[56]*",		NULL,	ATA_HORKAGE_NO_NCQ_TRIM |
> +						ATA_HORKAGE_ZERO_AFTER_TRIM, },

A minor quibble - but all the other HORKAGE flags are used to identify
things which are broken or non-functional about a device (and this is
implicit with the use of the word 'horkage' IMO).

However, this proposed flag is trying to imply behaviour which is an
"extra feature" (beyond the requirements of the spec).  So the intention
is to whitelist using a mechanism which is otherwise used only been used
to blacklist.

I know it's difficult coming up with something which isn't too
wordy/verbose, but IMO any of:

ATA_HORKAGE_TRIM_ALWAYS_ZEROS

ATA_TRIM_ALWAYS_ZEROS

ATA_RELIABLE_ZERO_AFTER_TRIM

would seem clearer to me, because as the patch currently stands,
"ATA_HORKAGE_ZERO_AFTER_TRIM" implies to me "zero after trim is broken
on this device".

Particularly with the Micron excerpt quoted above I initially parsed
this as "don't issue tagged TRIM commands to this device, and don't
assume it'll read zeros after TRIM either".

Tim.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] libata: Whitelist SSDs that are known to properly return zeroes after TRIM
  2015-01-08 14:09                         ` Phillip Susi
@ 2015-01-08 22:31                           ` Andreas Dilger
  0 siblings, 0 replies; 16+ messages in thread
From: Andreas Dilger @ 2015-01-08 22:31 UTC (permalink / raw)
  To: Phillip Susi
  Cc: Dave Chinner, Tejun Heo, Martin Petersen, James Bottomley,
	linux-ide, linux-fsdevel

On Jan 8, 2015, at 7:09 AM, Phillip Susi <psusi@ubuntu.com> wrote:
> On 1/7/2015 11:58 PM, Andreas Dilger wrote:
>>> No, it knows that the inode table needs initialized because there
>>> is a flag in the group descriptor that says this inode table is
>>> still uninitalized.  It never reads the blocks to see if they are
>>> full of zeros.  mke2fs sets the flag when it does not initialize
>>> the table with zeros, either by direct writes ( which it doesn't
>>> do if lazy_itable_init is true, which it defaults to these days
>>> ), or by discarding the blocks when the device claims to support
>>> deterministic discard that zeros.
>> 
>> That is only partially correct.  While it is true that mke2fs sets
>> the UNINIT flag at format time, the "lazy" part of that means there
>> is a kernel thread still does the zeroing of the inode table
>> blocks, but after the filesystem is mounted, for any group that
>> does not have the ZEROED flag set.  After that point, the "UNINIT"
>> flag is an optimization to avoid reading the bitmap and unused
>> blocks from disk during allocation.
> 
> That is pretty much what I said, except that I was pointing out that
> it does not *read* first to see if the disk is already zeroed, as that
> would be a waste of time.  It just writes out the zeros for block
> groups that still have the uninit flag set.

Sorry, I didn't get that from my reading, so I thought I'd clarify.
I'd actually proposed that the ext4_init_inode_table() thread start
by reading the itable blocks first, check them for zeroes, and only
switch over to writing if it finds any non-zero data in the blocks.

I think that would be a net win in some cases, and only a tiny bit
of overhead (a single read) if it turns out to be wrong.

>> This is needed in case the group descriptor or inode bitmap is
>> corrupted, and e2fsck needs to scan the inode table for in-use
>> inodes.  We don't want it to find old inodes from before the
>> filesystem was formatted.
>> 
>> The ext4_init_inode_table() calls
>> sb_issue_zeroout->blkdev_issue_zeroout(), so if the underlying
>> storage supported deterministic zeroing of the underlying storage,
>> this could be handled very efficiently.
> 
> Again, that's pretty much what I said.


Cheers, Andreas






^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] libata: Whitelist SSDs that are known to properly return zeroes after TRIM
  2015-01-08 15:58                         ` Tim Small
@ 2015-01-09 20:52                           ` Martin K. Petersen
  2015-01-09 21:39                             ` Tejun Heo
  0 siblings, 1 reply; 16+ messages in thread
From: Martin K. Petersen @ 2015-01-09 20:52 UTC (permalink / raw)
  To: Tim Small; +Cc: Martin K. Petersen, tj, linux-ide, linux-fsdevel

>>>>> "Tim" == Tim Small <tim@seoss.co.uk> writes:

Tim> would seem clearer to me, because as the patch currently stands,
Tim> "ATA_HORKAGE_ZERO_AFTER_TRIM" implies to me "zero after trim is
Tim> broken on this device".

In SCSI we use often the term "quirk" regardless of whether we enable or
disable a feature. So "horkage" didn't really bother me much in this
context. But I'm happy to submit a name change patch if Tejun thinks
it's a valid concern.

-- 
Martin K. Petersen	Oracle Linux Engineering

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] libata: Whitelist SSDs that are known to properly return zeroes after TRIM
  2015-01-09 20:52                           ` Martin K. Petersen
@ 2015-01-09 21:39                             ` Tejun Heo
  0 siblings, 0 replies; 16+ messages in thread
From: Tejun Heo @ 2015-01-09 21:39 UTC (permalink / raw)
  To: Martin K. Petersen; +Cc: Tim Small, linux-ide, linux-fsdevel

On Fri, Jan 09, 2015 at 03:52:21PM -0500, Martin K. Petersen wrote:
> >>>>> "Tim" == Tim Small <tim@seoss.co.uk> writes:
> 
> Tim> would seem clearer to me, because as the patch currently stands,
> Tim> "ATA_HORKAGE_ZERO_AFTER_TRIM" implies to me "zero after trim is
> Tim> broken on this device".
> 
> In SCSI we use often the term "quirk" regardless of whether we enable or
> disable a feature. So "horkage" didn't really bother me much in this
> context. But I'm happy to submit a name change patch if Tejun thinks
> it's a valid concern.

I think it's fine.  It's not like its polarity is confusing - nobody
would read it as meaning the opposite.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2015-01-09 21:39 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <yq18uiojfq9.fsf@sermon.lab.mkp.net>
     [not found] ` <20141204170611.GB2995@htj.dyndns.org>
     [not found]   ` <yq14mtahmil.fsf@sermon.lab.mkp.net>
     [not found]     ` <20141205145148.GI4080@htj.dyndns.org>
     [not found]       ` <1418184578.2121.3.camel@HansenPartnership.com>
     [not found]         ` <20141210142927.GA6294@htj.dyndns.org>
     [not found]           ` <yq1oarbcj6f.fsf@sermon.lab.mkp.net>
     [not found]             ` <20150105162830.GP15833@htj.dyndns.org>
2015-01-07  0:05               ` [PATCH] libata: Whitelist SSDs that are known to properly return zeroes after TRIM Martin K. Petersen
2015-01-07  2:54                 ` Tejun Heo
2015-01-07  4:15                   ` Dave Chinner
2015-01-07 15:26                     ` Tejun Heo
2015-01-08 14:28                       ` Martin K. Petersen
2015-01-08 15:11                         ` Tejun Heo
2015-01-08 15:34                           ` Martin K. Petersen
2015-01-08 15:36                             ` Tejun Heo
2015-01-08 15:58                         ` Tim Small
2015-01-09 20:52                           ` Martin K. Petersen
2015-01-09 21:39                             ` Tejun Heo
2015-01-08 14:29                       ` Martin K. Petersen
2015-01-08  4:05                     ` Phillip Susi
2015-01-08  4:58                       ` Andreas Dilger
2015-01-08 14:09                         ` Phillip Susi
2015-01-08 22:31                           ` Andreas Dilger

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).