* Re: [PATCH] libata: Whitelist SSDs that are known to properly return zeroes after TRIM [not found] ` <20150105162830.GP15833@htj.dyndns.org> @ 2015-01-07 0:05 ` Martin K. Petersen 2015-01-07 2:54 ` Tejun Heo 0 siblings, 1 reply; 16+ messages in thread From: Martin K. Petersen @ 2015-01-07 0:05 UTC (permalink / raw) To: Tejun Heo; +Cc: Martin K. Petersen, James Bottomley, linux-ide, linux-fsdevel >>>>> "Tejun" == Tejun Heo <tj@kernel.org> writes: >> The other use case is the filesystem one where it is common to zero >> block ranges for bitmaps, etc. In many workloads there's is a >> significant win to trimming over writing out many blocks of zeroes. Tejun> Isn't that kinda niche and specialized tho? I don't think so. There are two reasons for zeroing block ranges: 1) To ensure they contain zeroes on subsequent reads 2) To preallocate them or anchor them down on thin provisioned devices The filesystem folks have specifically asked to be able to make that distinction. Hence the patch that changes blkdev_issue_zeroout(). You really don't want to write out gobs and gobs of zeroes and cause unnecessary flash wear if all you care about is the blocks being in a deterministic state. -- Martin K. Petersen Oracle Linux Engineering ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH] libata: Whitelist SSDs that are known to properly return zeroes after TRIM 2015-01-07 0:05 ` [PATCH] libata: Whitelist SSDs that are known to properly return zeroes after TRIM Martin K. Petersen @ 2015-01-07 2:54 ` Tejun Heo 2015-01-07 4:15 ` Dave Chinner 0 siblings, 1 reply; 16+ messages in thread From: Tejun Heo @ 2015-01-07 2:54 UTC (permalink / raw) To: Martin K. Petersen; +Cc: James Bottomley, linux-ide, linux-fsdevel Hello, Martin. On Tue, Jan 06, 2015 at 07:05:40PM -0500, Martin K. Petersen wrote: > Tejun> Isn't that kinda niche and specialized tho? > > I don't think so. There are two reasons for zeroing block ranges: > > 1) To ensure they contain zeroes on subsequent reads > > 2) To preallocate them or anchor them down on thin provisioned devices > > The filesystem folks have specifically asked to be able to make that > distinction. Hence the patch that changes blkdev_issue_zeroout(). > > You really don't want to write out gobs and gobs of zeroes and cause > unnecessary flash wear if all you care about is the blocks being in a > deterministic state. I think I'm still missing something. Are there enough cases where filesystems want to write out zeroes during operation? Earlier in the thread, it was mentioned that this is currently mostly useful for raids which need the blocks actually cleared for checksum consistency, which basically means that raid metadata handling isn't (yet) capable of just marking those (parts of) stripes as unused. If a filesystem wants to read back zeros from data blocks, wouldn't it be just marking the matching index as such? And if you take out the zeroing case, trims are just trims and whether they return 0 afterwards or not is irrelevant. There sure can be use cases where zeroing fast and securely make noticeable difference but the cases put forth till this point seem relatively weak. I mean, after all, requiring trim to zero the blocks is essentially pushing down that amount of metadata management to the device - the device would do the exactly same thing. Pushing it down the layers can definitely be beneficial especially when there's no agreed-upon metadata on the medium (so, mkfs time), but it seems kinda superflous during normal operation. What am I missing? Thanks. -- tejun ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH] libata: Whitelist SSDs that are known to properly return zeroes after TRIM 2015-01-07 2:54 ` Tejun Heo @ 2015-01-07 4:15 ` Dave Chinner 2015-01-07 15:26 ` Tejun Heo 2015-01-08 4:05 ` Phillip Susi 0 siblings, 2 replies; 16+ messages in thread From: Dave Chinner @ 2015-01-07 4:15 UTC (permalink / raw) To: Tejun Heo; +Cc: Martin K. Petersen, James Bottomley, linux-ide, linux-fsdevel On Tue, Jan 06, 2015 at 09:54:55PM -0500, Tejun Heo wrote: > Hello, Martin. > > On Tue, Jan 06, 2015 at 07:05:40PM -0500, Martin K. Petersen wrote: > > Tejun> Isn't that kinda niche and specialized tho? > > > > I don't think so. There are two reasons for zeroing block ranges: > > > > 1) To ensure they contain zeroes on subsequent reads > > > > 2) To preallocate them or anchor them down on thin provisioned devices > > > > The filesystem folks have specifically asked to be able to make that > > distinction. Hence the patch that changes blkdev_issue_zeroout(). > > > > You really don't want to write out gobs and gobs of zeroes and cause > > unnecessary flash wear if all you care about is the blocks being in a > > deterministic state. > > I think I'm still missing something. Are there enough cases where > filesystems want to write out zeroes during operation? IMO, yes. w.r.t. thinp devices, we need to be able to guarantee that prellocated regions in the filesystem are actually backed by real blocks in the thinp device so we don't get ENOSPC from the thinp device. No filesystems do this yet because we don't have a mechanism for telling the lower layers "preallocate these blocks to zero". The biggest issue is that we currently have no easy way to say "these blocks need to contain zeros, but we aren't actually using them yet". i.e. the filesystem code assumes that they contain zeros (e.g. in ext4 inode tables because mkfs used to zero them) if they haven't been used, so when it reads them it detects that initialisation is needed because the blocks are empty.... FWIW, some filesystems need these regions to actually contain zeros because they can't track unwritten extents (e.g. gfs2). having sb_issue_zeroout() just do the right thing enables us to efficiently zero the regions they are preallocating... > Earlier in the > thread, it was mentioned that this is currently mostly useful for > raids which need the blocks actually cleared for checksum consistency, > which basically means that raid metadata handling isn't (yet) capable > of just marking those (parts of) stripes as unused. If a filesystem > wants to read back zeros from data blocks, wouldn't it be just marking > the matching index as such? Not all filesystems can do this for user data (see gfs2 case above) and no linux filesystem tracks whether free space contains zeros or stale data. Hence if we want blocks to be zeroed on disk, we currently have to write zeros to them and hence they get pinned in devices as "used space" even though they may never get used again. Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH] libata: Whitelist SSDs that are known to properly return zeroes after TRIM 2015-01-07 4:15 ` Dave Chinner @ 2015-01-07 15:26 ` Tejun Heo 2015-01-08 14:28 ` Martin K. Petersen 2015-01-08 14:29 ` Martin K. Petersen 2015-01-08 4:05 ` Phillip Susi 1 sibling, 2 replies; 16+ messages in thread From: Tejun Heo @ 2015-01-07 15:26 UTC (permalink / raw) To: Dave Chinner Cc: Martin K. Petersen, James Bottomley, linux-ide, linux-fsdevel Hello, Dave. On Wed, Jan 07, 2015 at 03:15:37PM +1100, Dave Chinner wrote: > w.r.t. thinp devices, we need to be able to guarantee that > prellocated regions in the filesystem are actually backed by real > blocks in the thinp device so we don't get ENOSPC from the thinp > device. No filesystems do this yet because we don't have a mechanism > for telling the lower layers "preallocate these blocks to zero". > > The biggest issue is that we currently have no easy way to say > "these blocks need to contain zeros, but we aren't actually using > them yet". i.e. the filesystem code assumes that they contain zeros > (e.g. in ext4 inode tables because mkfs used to zero them) if they > haven't been used, so when it reads them it detects that > initialisation is needed because the blocks are empty.... > > FWIW, some filesystems need these regions to actually contain > zeros because they can't track unwritten extents (e.g. > gfs2). having sb_issue_zeroout() just do the right thing enables us > to efficiently zero the regions they are preallocating... > > > Earlier in the > > thread, it was mentioned that this is currently mostly useful for > > raids which need the blocks actually cleared for checksum consistency, > > which basically means that raid metadata handling isn't (yet) capable > > of just marking those (parts of) stripes as unused. If a filesystem > > wants to read back zeros from data blocks, wouldn't it be just marking > > the matching index as such? > > Not all filesystems can do this for user data (see gfs2 case above) > and no linux filesystem tracks whether free space contains zeros or > stale data. Hence if we want blocks to be zeroed on disk, we > currently have to write zeros to them and hence they get pinned in > devices as "used space" even though they may never get used again. Okay, I'll take it as that this benefits generic enough use cases from filesystem POV. As long as that's the case, it prolly has a reasonable chance of actually being widely and properly implemented hw vendors. It's easy to complain about mainstream hardware but IMHO the market is actually pretty efficient at shooting down extra cruft which doesn't really matter and only exists to increase the number of feature checkboxes. Hopefully, this has actual benefits and won't end up that way. Martin, do you have a newer version or shall I apply the original one? Thanks. -- tejun ^ permalink raw reply [flat|nested] 16+ messages in thread
* [PATCH] libata: Whitelist SSDs that are known to properly return zeroes after TRIM 2015-01-07 15:26 ` Tejun Heo @ 2015-01-08 14:28 ` Martin K. Petersen 2015-01-08 15:11 ` Tejun Heo 2015-01-08 15:58 ` Tim Small 2015-01-08 14:29 ` Martin K. Petersen 1 sibling, 2 replies; 16+ messages in thread From: Martin K. Petersen @ 2015-01-08 14:28 UTC (permalink / raw) To: tj; +Cc: linux-ide, linux-fsdevel, Martin K. Petersen As defined, the DRAT (Deterministic Read After Trim) and RZAT (Return Zero After Trim) flags in the ATA Command Set are unreliable in the sense that they only define what happens if the device successfully executed the DSM TRIM command. TRIM is only advisory, however, and the device is free to silently ignore all or parts of the request. In practice this renders the DRAT and RZAT flags completely useless and because the results are unpredictable we decided to disable discard in MD for 3.18 to avoid the risk of data corruption. Hardware vendors in the real world obviously need better guarantees than what the standards bodies provide. Unfortuntely those guarantees are encoded in product requirements documents rather than somewhere we can key off of them programatically. So we are compelled to disabling discard_zeroes_data for all devices unless we explicitly have data to support whitelisting them. This patch whitelists SSDs from a few of the main vendors. None of the whitelists are based on written guarantees. They are purely based on empirical evidence collected from internal and external users that have tested or qualified these drives in RAID deployments. The whitelist is only meant as a starting point and is by no means comprehensive: - All intel SSD models except for 510 - Micron M5?0/M600 - Samsung SSDs - Seagate SSDs Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> --- drivers/ata/libata-core.c | 18 ++++++++++++++---- drivers/ata/libata-scsi.c | 10 ++++++---- include/linux/libata.h | 1 + 3 files changed, 21 insertions(+), 8 deletions(-) diff --git a/drivers/ata/libata-core.c b/drivers/ata/libata-core.c index 5c84fb5c3372..7c03401dc5d8 100644 --- a/drivers/ata/libata-core.c +++ b/drivers/ata/libata-core.c @@ -4233,10 +4233,20 @@ static const struct ata_blacklist_entry ata_device_blacklist [] = { { "PIONEER DVD-RW DVR-216D", NULL, ATA_HORKAGE_NOSETXFER }, /* devices that don't properly handle queued TRIM commands */ - { "Micron_M500*", NULL, ATA_HORKAGE_NO_NCQ_TRIM, }, - { "Crucial_CT???M500SSD*", NULL, ATA_HORKAGE_NO_NCQ_TRIM, }, - { "Micron_M550*", NULL, ATA_HORKAGE_NO_NCQ_TRIM, }, - { "Crucial_CT*M550SSD*", NULL, ATA_HORKAGE_NO_NCQ_TRIM, }, + { "Micron_M[56]*", NULL, ATA_HORKAGE_NO_NCQ_TRIM | + ATA_HORKAGE_ZERO_AFTER_TRIM, }, + { "Crucial_CT*SSD*", NULL, ATA_HORKAGE_NO_NCQ_TRIM, }, + + /* + * DRAT/RZAT are weak guarantees. Explicitly black/whitelist + * SSDs that provide reliable zero after TRIM. + */ + { "INTEL*SSDSC2MH*", NULL, 0, }, /* Blacklist intel 510 */ + { "INTEL*SSD*", NULL, ATA_HORKAGE_ZERO_AFTER_TRIM, }, + { "SSD*INTEL*", NULL, ATA_HORKAGE_ZERO_AFTER_TRIM, }, + { "Samsung*SSD*", NULL, ATA_HORKAGE_ZERO_AFTER_TRIM, }, + { "SAMSUNG*SSD*", NULL, ATA_HORKAGE_ZERO_AFTER_TRIM, }, + { "ST[1248][0248]0[FH]*", NULL, ATA_HORKAGE_ZERO_AFTER_TRIM, }, /* * Some WD SATA-I drives spin up and down erratically when the link diff --git a/drivers/ata/libata-scsi.c b/drivers/ata/libata-scsi.c index 7659d6468303..280729325ebd 100644 --- a/drivers/ata/libata-scsi.c +++ b/drivers/ata/libata-scsi.c @@ -2532,13 +2532,15 @@ static unsigned int ata_scsiop_read_cap(struct ata_scsi_args *args, u8 *rbuf) rbuf[15] = lowest_aligned; if (ata_id_has_trim(args->id)) { - rbuf[14] |= 0x80; /* TPE */ + rbuf[14] |= 0x80; /* LBPME */ - if (ata_id_has_zero_after_trim(args->id)) - rbuf[14] |= 0x40; /* TPRZ */ + if (ata_id_has_zero_after_trim(args->id) && + dev->horkage & ATA_HORKAGE_ZERO_AFTER_TRIM) { + ata_dev_info(dev, "Enabling discard_zeroes_data\n"); + rbuf[14] |= 0x40; /* LBPRZ */ + } } } - return 0; } diff --git a/include/linux/libata.h b/include/linux/libata.h index 2d182413b1db..f2b440e44fd7 100644 --- a/include/linux/libata.h +++ b/include/linux/libata.h @@ -422,6 +422,7 @@ enum { ATA_HORKAGE_NO_NCQ_TRIM = (1 << 19), /* don't use queued TRIM */ ATA_HORKAGE_NOLPM = (1 << 20), /* don't use LPM */ ATA_HORKAGE_WD_BROKEN_LPM = (1 << 21), /* some WDs have broken LPM */ + ATA_HORKAGE_ZERO_AFTER_TRIM = (1 << 22),/* guarantees zero after trim */ /* DMA mask for user DMA control: User visible values; DO NOT renumber */ -- 1.9.3 ^ permalink raw reply related [flat|nested] 16+ messages in thread
* Re: [PATCH] libata: Whitelist SSDs that are known to properly return zeroes after TRIM 2015-01-08 14:28 ` Martin K. Petersen @ 2015-01-08 15:11 ` Tejun Heo 2015-01-08 15:34 ` Martin K. Petersen 2015-01-08 15:58 ` Tim Small 1 sibling, 1 reply; 16+ messages in thread From: Tejun Heo @ 2015-01-08 15:11 UTC (permalink / raw) To: Martin K. Petersen; +Cc: linux-ide, linux-fsdevel Hello, Martin. The patch looks okay to me. Just one nit. On Thu, Jan 08, 2015 at 09:28:31AM -0500, Martin K. Petersen wrote: > @@ -4233,10 +4233,20 @@ static const struct ata_blacklist_entry ata_device_blacklist [] = { > { "PIONEER DVD-RW DVR-216D", NULL, ATA_HORKAGE_NOSETXFER }, > > /* devices that don't properly handle queued TRIM commands */ > - { "Micron_M500*", NULL, ATA_HORKAGE_NO_NCQ_TRIM, }, > - { "Crucial_CT???M500SSD*", NULL, ATA_HORKAGE_NO_NCQ_TRIM, }, > - { "Micron_M550*", NULL, ATA_HORKAGE_NO_NCQ_TRIM, }, > - { "Crucial_CT*M550SSD*", NULL, ATA_HORKAGE_NO_NCQ_TRIM, }, > + { "Micron_M[56]*", NULL, ATA_HORKAGE_NO_NCQ_TRIM | > + ATA_HORKAGE_ZERO_AFTER_TRIM, }, > + { "Crucial_CT*SSD*", NULL, ATA_HORKAGE_NO_NCQ_TRIM, }, > + > + /* > + * DRAT/RZAT are weak guarantees. Explicitly black/whitelist > + * SSDs that provide reliable zero after TRIM. > + */ Can you please be more detailed on the above explanation? It's not like most people would be familiar with acronyms like DRAT/RZAT and what the surrounding situation is like to require this sort of whitelisting. > + { "INTEL*SSDSC2MH*", NULL, 0, }, /* Blacklist intel 510 */ > + { "INTEL*SSD*", NULL, ATA_HORKAGE_ZERO_AFTER_TRIM, }, I think the above two can use a bit more verbose explanation too. > + { "SSD*INTEL*", NULL, ATA_HORKAGE_ZERO_AFTER_TRIM, }, > + { "Samsung*SSD*", NULL, ATA_HORKAGE_ZERO_AFTER_TRIM, }, > + { "SAMSUNG*SSD*", NULL, ATA_HORKAGE_ZERO_AFTER_TRIM, }, > + { "ST[1248][0248]0[FH]*", NULL, ATA_HORKAGE_ZERO_AFTER_TRIM, }, Thanks. -- tejun ^ permalink raw reply [flat|nested] 16+ messages in thread
* [PATCH] libata: Whitelist SSDs that are known to properly return zeroes after TRIM 2015-01-08 15:11 ` Tejun Heo @ 2015-01-08 15:34 ` Martin K. Petersen 2015-01-08 15:36 ` Tejun Heo 0 siblings, 1 reply; 16+ messages in thread From: Martin K. Petersen @ 2015-01-08 15:34 UTC (permalink / raw) To: tj; +Cc: linux-ide, linux-fsdevel, Martin K. Petersen As defined, the DRAT (Deterministic Read After Trim) and RZAT (Return Zero After Trim) flags in the ATA Command Set are unreliable in the sense that they only define what happens if the device successfully executed the DSM TRIM command. TRIM is only advisory, however, and the device is free to silently ignore all or parts of the request. In practice this renders the DRAT and RZAT flags completely useless and because the results are unpredictable we decided to disable discard in MD for 3.18 to avoid the risk of data corruption. Hardware vendors in the real world obviously need better guarantees than what the standards bodies provide. Unfortuntely those guarantees are encoded in product requirements documents rather than somewhere we can key off of them programatically. So we are compelled to disabling discard_zeroes_data for all devices unless we explicitly have data to support whitelisting them. This patch whitelists SSDs from a few of the main vendors. None of the whitelists are based on written guarantees. They are purely based on empirical evidence collected from internal and external users that have tested or qualified these drives in RAID deployments. The whitelist is only meant as a starting point and is by no means comprehensive: - All intel SSD models except for 510 - Micron M5?0/M600 - Samsung SSDs - Seagate SSDs Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> --- drivers/ata/libata-core.c | 31 +++++++++++++++++++++++++++---- drivers/ata/libata-scsi.c | 10 ++++++---- include/linux/libata.h | 1 + 3 files changed, 34 insertions(+), 8 deletions(-) diff --git a/drivers/ata/libata-core.c b/drivers/ata/libata-core.c index 5c84fb5c3372..23c2ae03a7ab 100644 --- a/drivers/ata/libata-core.c +++ b/drivers/ata/libata-core.c @@ -4233,10 +4233,33 @@ static const struct ata_blacklist_entry ata_device_blacklist [] = { { "PIONEER DVD-RW DVR-216D", NULL, ATA_HORKAGE_NOSETXFER }, /* devices that don't properly handle queued TRIM commands */ - { "Micron_M500*", NULL, ATA_HORKAGE_NO_NCQ_TRIM, }, - { "Crucial_CT???M500SSD*", NULL, ATA_HORKAGE_NO_NCQ_TRIM, }, - { "Micron_M550*", NULL, ATA_HORKAGE_NO_NCQ_TRIM, }, - { "Crucial_CT*M550SSD*", NULL, ATA_HORKAGE_NO_NCQ_TRIM, }, + { "Micron_M[56]*", NULL, ATA_HORKAGE_NO_NCQ_TRIM | + ATA_HORKAGE_ZERO_AFTER_TRIM, }, + { "Crucial_CT*SSD*", NULL, ATA_HORKAGE_NO_NCQ_TRIM, }, + + /* + * As defined, the DRAT (Deterministic Read After Trim) and RZAT + * (Return Zero After Trim) flags in the ATA Command Set are + * unreliable in the sense that they only define what happens if + * the device successfully executed the DSM TRIM command. TRIM + * is only advisory, however, and the device is free to silently + * ignore all or parts of the request. + * + * Whitelist drives that are known to reliably return zeroes + * after TRIM. + */ + + /* + * The intel 510 drive has buggy DRAT/RZAT. Explicitly exclude + * that model before whitelisting all other intel SSDs. + */ + { "INTEL*SSDSC2MH*", NULL, 0, }, + + { "INTEL*SSD*", NULL, ATA_HORKAGE_ZERO_AFTER_TRIM, }, + { "SSD*INTEL*", NULL, ATA_HORKAGE_ZERO_AFTER_TRIM, }, + { "Samsung*SSD*", NULL, ATA_HORKAGE_ZERO_AFTER_TRIM, }, + { "SAMSUNG*SSD*", NULL, ATA_HORKAGE_ZERO_AFTER_TRIM, }, + { "ST[1248][0248]0[FH]*", NULL, ATA_HORKAGE_ZERO_AFTER_TRIM, }, /* * Some WD SATA-I drives spin up and down erratically when the link diff --git a/drivers/ata/libata-scsi.c b/drivers/ata/libata-scsi.c index 7659d6468303..280729325ebd 100644 --- a/drivers/ata/libata-scsi.c +++ b/drivers/ata/libata-scsi.c @@ -2532,13 +2532,15 @@ static unsigned int ata_scsiop_read_cap(struct ata_scsi_args *args, u8 *rbuf) rbuf[15] = lowest_aligned; if (ata_id_has_trim(args->id)) { - rbuf[14] |= 0x80; /* TPE */ + rbuf[14] |= 0x80; /* LBPME */ - if (ata_id_has_zero_after_trim(args->id)) - rbuf[14] |= 0x40; /* TPRZ */ + if (ata_id_has_zero_after_trim(args->id) && + dev->horkage & ATA_HORKAGE_ZERO_AFTER_TRIM) { + ata_dev_info(dev, "Enabling discard_zeroes_data\n"); + rbuf[14] |= 0x40; /* LBPRZ */ + } } } - return 0; } diff --git a/include/linux/libata.h b/include/linux/libata.h index 2d182413b1db..f2b440e44fd7 100644 --- a/include/linux/libata.h +++ b/include/linux/libata.h @@ -422,6 +422,7 @@ enum { ATA_HORKAGE_NO_NCQ_TRIM = (1 << 19), /* don't use queued TRIM */ ATA_HORKAGE_NOLPM = (1 << 20), /* don't use LPM */ ATA_HORKAGE_WD_BROKEN_LPM = (1 << 21), /* some WDs have broken LPM */ + ATA_HORKAGE_ZERO_AFTER_TRIM = (1 << 22),/* guarantees zero after trim */ /* DMA mask for user DMA control: User visible values; DO NOT renumber */ -- 1.9.3 ^ permalink raw reply related [flat|nested] 16+ messages in thread
* Re: [PATCH] libata: Whitelist SSDs that are known to properly return zeroes after TRIM 2015-01-08 15:34 ` Martin K. Petersen @ 2015-01-08 15:36 ` Tejun Heo 0 siblings, 0 replies; 16+ messages in thread From: Tejun Heo @ 2015-01-08 15:36 UTC (permalink / raw) To: Martin K. Petersen; +Cc: linux-ide, linux-fsdevel On Thu, Jan 08, 2015 at 10:34:27AM -0500, Martin K. Petersen wrote: > As defined, the DRAT (Deterministic Read After Trim) and RZAT (Return > Zero After Trim) flags in the ATA Command Set are unreliable in the > sense that they only define what happens if the device successfully > executed the DSM TRIM command. TRIM is only advisory, however, and the > device is free to silently ignore all or parts of the request. > > In practice this renders the DRAT and RZAT flags completely useless and > because the results are unpredictable we decided to disable discard in > MD for 3.18 to avoid the risk of data corruption. > > Hardware vendors in the real world obviously need better guarantees than > what the standards bodies provide. Unfortuntely those guarantees are > encoded in product requirements documents rather than somewhere we can > key off of them programatically. So we are compelled to disabling > discard_zeroes_data for all devices unless we explicitly have data to > support whitelisting them. > > This patch whitelists SSDs from a few of the main vendors. None of the > whitelists are based on written guarantees. They are purely based on > empirical evidence collected from internal and external users that have > tested or qualified these drives in RAID deployments. > > The whitelist is only meant as a starting point and is by no means > comprehensive: > > - All intel SSD models except for 510 > - Micron M5?0/M600 > - Samsung SSDs > - Seagate SSDs > > Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> > Reviewed-by: Christoph Hellwig <hch@lst.de> Applied to libata/for-3.19-fixes. Thanks. -- tejun ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH] libata: Whitelist SSDs that are known to properly return zeroes after TRIM 2015-01-08 14:28 ` Martin K. Petersen 2015-01-08 15:11 ` Tejun Heo @ 2015-01-08 15:58 ` Tim Small 2015-01-09 20:52 ` Martin K. Petersen 1 sibling, 1 reply; 16+ messages in thread From: Tim Small @ 2015-01-08 15:58 UTC (permalink / raw) To: Martin K. Petersen, tj; +Cc: linux-ide, linux-fsdevel On 08/01/15 14:28, Martin K. Petersen wrote: > + { "Micron_M[56]*", NULL, ATA_HORKAGE_NO_NCQ_TRIM | > + ATA_HORKAGE_ZERO_AFTER_TRIM, }, A minor quibble - but all the other HORKAGE flags are used to identify things which are broken or non-functional about a device (and this is implicit with the use of the word 'horkage' IMO). However, this proposed flag is trying to imply behaviour which is an "extra feature" (beyond the requirements of the spec). So the intention is to whitelist using a mechanism which is otherwise used only been used to blacklist. I know it's difficult coming up with something which isn't too wordy/verbose, but IMO any of: ATA_HORKAGE_TRIM_ALWAYS_ZEROS ATA_TRIM_ALWAYS_ZEROS ATA_RELIABLE_ZERO_AFTER_TRIM would seem clearer to me, because as the patch currently stands, "ATA_HORKAGE_ZERO_AFTER_TRIM" implies to me "zero after trim is broken on this device". Particularly with the Micron excerpt quoted above I initially parsed this as "don't issue tagged TRIM commands to this device, and don't assume it'll read zeros after TRIM either". Tim. ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH] libata: Whitelist SSDs that are known to properly return zeroes after TRIM 2015-01-08 15:58 ` Tim Small @ 2015-01-09 20:52 ` Martin K. Petersen 2015-01-09 21:39 ` Tejun Heo 0 siblings, 1 reply; 16+ messages in thread From: Martin K. Petersen @ 2015-01-09 20:52 UTC (permalink / raw) To: Tim Small; +Cc: Martin K. Petersen, tj, linux-ide, linux-fsdevel >>>>> "Tim" == Tim Small <tim@seoss.co.uk> writes: Tim> would seem clearer to me, because as the patch currently stands, Tim> "ATA_HORKAGE_ZERO_AFTER_TRIM" implies to me "zero after trim is Tim> broken on this device". In SCSI we use often the term "quirk" regardless of whether we enable or disable a feature. So "horkage" didn't really bother me much in this context. But I'm happy to submit a name change patch if Tejun thinks it's a valid concern. -- Martin K. Petersen Oracle Linux Engineering ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH] libata: Whitelist SSDs that are known to properly return zeroes after TRIM 2015-01-09 20:52 ` Martin K. Petersen @ 2015-01-09 21:39 ` Tejun Heo 0 siblings, 0 replies; 16+ messages in thread From: Tejun Heo @ 2015-01-09 21:39 UTC (permalink / raw) To: Martin K. Petersen; +Cc: Tim Small, linux-ide, linux-fsdevel On Fri, Jan 09, 2015 at 03:52:21PM -0500, Martin K. Petersen wrote: > >>>>> "Tim" == Tim Small <tim@seoss.co.uk> writes: > > Tim> would seem clearer to me, because as the patch currently stands, > Tim> "ATA_HORKAGE_ZERO_AFTER_TRIM" implies to me "zero after trim is > Tim> broken on this device". > > In SCSI we use often the term "quirk" regardless of whether we enable or > disable a feature. So "horkage" didn't really bother me much in this > context. But I'm happy to submit a name change patch if Tejun thinks > it's a valid concern. I think it's fine. It's not like its polarity is confusing - nobody would read it as meaning the opposite. Thanks. -- tejun ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH] libata: Whitelist SSDs that are known to properly return zeroes after TRIM 2015-01-07 15:26 ` Tejun Heo 2015-01-08 14:28 ` Martin K. Petersen @ 2015-01-08 14:29 ` Martin K. Petersen 1 sibling, 0 replies; 16+ messages in thread From: Martin K. Petersen @ 2015-01-08 14:29 UTC (permalink / raw) To: Tejun Heo Cc: Dave Chinner, Martin K. Petersen, James Bottomley, linux-ide, linux-fsdevel >>>>> "Tejun" == Tejun Heo <tj@kernel.org> writes: Tejun> do you have a newer version or shall I apply the original one? Just sent an updated patch that catches the latest Micron/Crucial drives as well. -- Martin K. Petersen Oracle Linux Engineering ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH] libata: Whitelist SSDs that are known to properly return zeroes after TRIM 2015-01-07 4:15 ` Dave Chinner 2015-01-07 15:26 ` Tejun Heo @ 2015-01-08 4:05 ` Phillip Susi 2015-01-08 4:58 ` Andreas Dilger 1 sibling, 1 reply; 16+ messages in thread From: Phillip Susi @ 2015-01-08 4:05 UTC (permalink / raw) To: Dave Chinner, Tejun Heo Cc: Martin K. Petersen, James Bottomley, linux-ide, linux-fsdevel -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA512 On 01/06/2015 11:15 PM, Dave Chinner wrote: > (e.g. in ext4 inode tables because mkfs used to zero them) if they > haven't been used, so when it reads them it detects that > initialisation is needed because the blocks are empty.... No, it knows that the inode table needs initialized because there is a flag in the group descriptor that says this inode table is still uninitalized. It never reads the blocks to see if they are full of zeros. mke2fs sets the flag when it does not initialize the table with zeros, either by direct writes ( which it doesn't do if lazy_itable_init is true, which it defaults to these days ), or by discarding the blocks when the device claims to support deterministic discard that zeros. -----BEGIN PGP SIGNATURE----- Version: GnuPG v1 iQEcBAEBCgAGBQJUrgHvAAoJENRVrw2cjl5RWUUH/076fhOAYW8vRB285cz+UwrE L7o3uteYa9OctoWZTctWAVueuISJaNsxzmbu04btznETkR/YbkkexpmXRKvYt6RZ rlCzRLWLTOC4h57jZteczn9pBf0RBeeizU0pEdTdZP/IVxOqA6NDG1qywDCE9A+I 3frFg54alaZ4MCGH0jmPVbVxM6jY3/UhA8DnOoMSlIOXatV1X7UMPK3lMJ4Ih/Yx e3yDWJjNHjAn88TRAGD6WJ2066t7DBGnLOOF9PscaZNW6f4hbuvEG9teAPk51OfS cwypcLsk6Mj7WU8cSQ0T/6a7Qx84g9JC6wLR0QZCUCSau6ZfWf3ZuSI26i/Xlfc= =BRqd -----END PGP SIGNATURE----- ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH] libata: Whitelist SSDs that are known to properly return zeroes after TRIM 2015-01-08 4:05 ` Phillip Susi @ 2015-01-08 4:58 ` Andreas Dilger 2015-01-08 14:09 ` Phillip Susi 0 siblings, 1 reply; 16+ messages in thread From: Andreas Dilger @ 2015-01-08 4:58 UTC (permalink / raw) To: Phillip Susi Cc: Dave Chinner, Tejun Heo, Martin Petersen, James Bottomley, linux-ide, linux-fsdevel On Jan 7, 2015, at 9:05 PM, Phillip Susi <psusi@ubuntu.com> wrote: > On 01/06/2015 11:15 PM, Dave Chinner wrote: >> (e.g. in ext4 inode tables because mkfs used to zero them) if they >> haven't been used, so when it reads them it detects that >> initialisation is needed because the blocks are empty.... > > No, it knows that the inode table needs initialized because there is a > flag in the group descriptor that says this inode table is still > uninitalized. It never reads the blocks to see if they are full of > zeros. mke2fs sets the flag when it does not initialize the table > with zeros, either by direct writes ( which it doesn't do if > lazy_itable_init is true, which it defaults to these days ), or by > discarding the blocks when the device claims to support deterministic > discard that zeros. That is only partially correct. While it is true that mke2fs sets the UNINIT flag at format time, the "lazy" part of that means there is a kernel thread still does the zeroing of the inode table blocks, but after the filesystem is mounted, for any group that does not have the ZEROED flag set. After that point, the "UNINIT" flag is an optimization to avoid reading the bitmap and unused blocks from disk during allocation. This is needed in case the group descriptor or inode bitmap is corrupted, and e2fsck needs to scan the inode table for in-use inodes. We don't want it to find old inodes from before the filesystem was formatted. The ext4_init_inode_table() calls sb_issue_zeroout->blkdev_issue_zeroout(), so if the underlying storage supported deterministic zeroing of the underlying storage, this could be handled very efficiently. Cheers, Andreas ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH] libata: Whitelist SSDs that are known to properly return zeroes after TRIM 2015-01-08 4:58 ` Andreas Dilger @ 2015-01-08 14:09 ` Phillip Susi 2015-01-08 22:31 ` Andreas Dilger 0 siblings, 1 reply; 16+ messages in thread From: Phillip Susi @ 2015-01-08 14:09 UTC (permalink / raw) To: Andreas Dilger Cc: Dave Chinner, Tejun Heo, Martin Petersen, James Bottomley, linux-ide, linux-fsdevel -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On 1/7/2015 11:58 PM, Andreas Dilger wrote: >> No, it knows that the inode table needs initialized because there >> is a flag in the group descriptor that says this inode table is >> still uninitalized. It never reads the blocks to see if they are >> full of zeros. mke2fs sets the flag when it does not initialize >> the table with zeros, either by direct writes ( which it doesn't >> do if lazy_itable_init is true, which it defaults to these days >> ), or by discarding the blocks when the device claims to support >> deterministic discard that zeros. > > That is only partially correct. While it is true that mke2fs sets > the UNINIT flag at format time, the "lazy" part of that means there > is a kernel thread still does the zeroing of the inode table > blocks, but after the filesystem is mounted, for any group that > does not have the ZEROED flag set. After that point, the "UNINIT" > flag is an optimization to avoid reading the bitmap and unused > blocks from disk during allocation. That is pretty much what I said, except that I was pointing out that it does not *read* first to see if the disk is already zeroed, as that would be a waste of time. It just writes out the zeros for block groups that still have the uninit flag set. > This is needed in case the group descriptor or inode bitmap is > corrupted, and e2fsck needs to scan the inode table for in-use > inodes. We don't want it to find old inodes from before the > filesystem was formatted. > > The ext4_init_inode_table() calls > sb_issue_zeroout->blkdev_issue_zeroout(), so if the underlying > storage supported deterministic zeroing of the underlying storage, > this could be handled very efficiently. Again, that's pretty much what I said. -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.17 (MingW32) iQEcBAEBAgAGBQJUro+rAAoJENRVrw2cjl5RDcMIAIF/qOjB4FfQ1v1NPLnbYrwa MeGxmAGIq8ftOlfOpSo8W/Adal4eegPwiBzE/8JciTHli2hZFkboG96lCGLL2GHd 51gqADIcSUmPHleLP3L/MSleFLvOAC4xYWYi00idZF5UVK3tt6m07+T/DCkoku1L 8Yg0BH5rxxRyPMRe3a/Y3WoI3PFFQDiDhz/7PSFH1W6JrEmiZutIDfm/U4h6zFWU tlNUMyiya8DrJt4katjBTa74EPWdZQjT/dOuUGGslAgOro69ZZQ4jmEXeJGN96Ly mMhbzkw9mjvSHYWuH2Hf1QlxoMKgq1hxaNbwEtImBzYeQTb9VSqM6BSO1ru0m9M= =X07B -----END PGP SIGNATURE----- ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH] libata: Whitelist SSDs that are known to properly return zeroes after TRIM 2015-01-08 14:09 ` Phillip Susi @ 2015-01-08 22:31 ` Andreas Dilger 0 siblings, 0 replies; 16+ messages in thread From: Andreas Dilger @ 2015-01-08 22:31 UTC (permalink / raw) To: Phillip Susi Cc: Dave Chinner, Tejun Heo, Martin Petersen, James Bottomley, linux-ide, linux-fsdevel On Jan 8, 2015, at 7:09 AM, Phillip Susi <psusi@ubuntu.com> wrote: > On 1/7/2015 11:58 PM, Andreas Dilger wrote: >>> No, it knows that the inode table needs initialized because there >>> is a flag in the group descriptor that says this inode table is >>> still uninitalized. It never reads the blocks to see if they are >>> full of zeros. mke2fs sets the flag when it does not initialize >>> the table with zeros, either by direct writes ( which it doesn't >>> do if lazy_itable_init is true, which it defaults to these days >>> ), or by discarding the blocks when the device claims to support >>> deterministic discard that zeros. >> >> That is only partially correct. While it is true that mke2fs sets >> the UNINIT flag at format time, the "lazy" part of that means there >> is a kernel thread still does the zeroing of the inode table >> blocks, but after the filesystem is mounted, for any group that >> does not have the ZEROED flag set. After that point, the "UNINIT" >> flag is an optimization to avoid reading the bitmap and unused >> blocks from disk during allocation. > > That is pretty much what I said, except that I was pointing out that > it does not *read* first to see if the disk is already zeroed, as that > would be a waste of time. It just writes out the zeros for block > groups that still have the uninit flag set. Sorry, I didn't get that from my reading, so I thought I'd clarify. I'd actually proposed that the ext4_init_inode_table() thread start by reading the itable blocks first, check them for zeroes, and only switch over to writing if it finds any non-zero data in the blocks. I think that would be a net win in some cases, and only a tiny bit of overhead (a single read) if it turns out to be wrong. >> This is needed in case the group descriptor or inode bitmap is >> corrupted, and e2fsck needs to scan the inode table for in-use >> inodes. We don't want it to find old inodes from before the >> filesystem was formatted. >> >> The ext4_init_inode_table() calls >> sb_issue_zeroout->blkdev_issue_zeroout(), so if the underlying >> storage supported deterministic zeroing of the underlying storage, >> this could be handled very efficiently. > > Again, that's pretty much what I said. Cheers, Andreas ^ permalink raw reply [flat|nested] 16+ messages in thread
end of thread, other threads:[~2015-01-09 21:39 UTC | newest] Thread overview: 16+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- [not found] <yq18uiojfq9.fsf@sermon.lab.mkp.net> [not found] ` <20141204170611.GB2995@htj.dyndns.org> [not found] ` <yq14mtahmil.fsf@sermon.lab.mkp.net> [not found] ` <20141205145148.GI4080@htj.dyndns.org> [not found] ` <1418184578.2121.3.camel@HansenPartnership.com> [not found] ` <20141210142927.GA6294@htj.dyndns.org> [not found] ` <yq1oarbcj6f.fsf@sermon.lab.mkp.net> [not found] ` <20150105162830.GP15833@htj.dyndns.org> 2015-01-07 0:05 ` [PATCH] libata: Whitelist SSDs that are known to properly return zeroes after TRIM Martin K. Petersen 2015-01-07 2:54 ` Tejun Heo 2015-01-07 4:15 ` Dave Chinner 2015-01-07 15:26 ` Tejun Heo 2015-01-08 14:28 ` Martin K. Petersen 2015-01-08 15:11 ` Tejun Heo 2015-01-08 15:34 ` Martin K. Petersen 2015-01-08 15:36 ` Tejun Heo 2015-01-08 15:58 ` Tim Small 2015-01-09 20:52 ` Martin K. Petersen 2015-01-09 21:39 ` Tejun Heo 2015-01-08 14:29 ` Martin K. Petersen 2015-01-08 4:05 ` Phillip Susi 2015-01-08 4:58 ` Andreas Dilger 2015-01-08 14:09 ` Phillip Susi 2015-01-08 22:31 ` Andreas Dilger
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).