Status of discard support in MD RAID

linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Status of discard support in MD RAID
@ 2014-09-11 23:38 Brassow Jonathan
  2014-09-12  0:46 ` Chris Murphy
  2014-09-15  3:44 ` NeilBrown
  0 siblings, 2 replies; 10+ messages in thread
From: Brassow Jonathan @ 2014-09-11 23:38 UTC (permalink / raw)
  To: linux-raid@vger.kernel.org Raid; +Cc: NeilBrown

Neil (or anyone else),

I know that trim/discard support was added back in 2012 (commit 9db90880).  However, I thought there were still issues regarding what happens when various sync operations occur.  I'd like to turn on discard support in dm-raid.c (a oneline patch) if things are in order.  I can enable any, all or none depending on your recommendation.  (I assume RAID1/10 is easier than the parity RAIDs.)

Thanks for any information,
 brassow

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Status of discard support in MD RAID
  2014-09-11 23:38 Status of discard support in MD RAID Brassow Jonathan
@ 2014-09-12  0:46 ` Chris Murphy
  2014-09-12  9:03   ` David Brown
                     ` (2 more replies)
  2014-09-15  3:44 ` NeilBrown
  1 sibling, 3 replies; 10+ messages in thread
From: Chris Murphy @ 2014-09-12  0:46 UTC (permalink / raw)
  To: Brassow Jonathan; +Cc: linux-raid@vger.kernel.org Raid, NeilBrown

On Sep 11, 2014, at 5:38 PM, Brassow Jonathan <jbrassow@redhat.com> wrote:

> Neil (or anyone else),
> 
> I know that trim/discard support was added back in 2012 (commit 9db90880).  However, I thought there were still issues regarding what happens when various sync operations occur.  I'd like to turn on discard support in dm-raid.c (a oneline patch) if things are in order.  I can enable any, all or none depending on your recommendation.  (I assume RAID1/10 is easier than the parity RAIDs.)

If all the controller and drive support it then it should pass through, but there's the problem whether the SSD supports deterministic trim. If it doesn't, a check check > md/sync_action will report mismatches in md/mismatch_cnt; and a repair will probably corrupt the volume. So you can still use trim with a drive that returns non-deterministic results with raid0/1/10, but you can't rely on the result of md/mismatch_cnt and you can't do repair type scrubs.

For raid5/6, it's a problem to use trim if the drive returns non-deterministically for trimmed blocks. I'd think that in addition to DRAT being supported, it'd need to support DZAT.

smartctl --identify=wb /dev/diskX | grep -i trim

Chris Murphy

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Status of discard support in MD RAID
  2014-09-12  0:46 ` Chris Murphy
@ 2014-09-12  9:03   ` David Brown
  2014-09-12  9:26     ` Brad Campbell
  2014-09-15  3:50     ` NeilBrown
  2014-09-12  9:39   ` Roman Mamedov
  2014-09-15  3:46   ` NeilBrown
  2 siblings, 2 replies; 10+ messages in thread
From: David Brown @ 2014-09-12  9:03 UTC (permalink / raw)
  To: Chris Murphy, Brassow Jonathan; +Cc: linux-raid@vger.kernel.org Raid, NeilBrown

On 12/09/14 02:46, Chris Murphy wrote:
> 
> On Sep 11, 2014, at 5:38 PM, Brassow Jonathan <jbrassow@redhat.com>
> wrote:
> 
>> Neil (or anyone else),
>> 
>> I know that trim/discard support was added back in 2012 (commit
>> 9db90880).  However, I thought there were still issues regarding
>> what happens when various sync operations occur.  I'd like to turn
>> on discard support in dm-raid.c (a oneline patch) if things are in
>> order.  I can enable any, all or none depending on your
>> recommendation.  (I assume RAID1/10 is easier than the parity
>> RAIDs.)
> 
> If all the controller and drive support it then it should pass
> through, but there's the problem whether the SSD supports
> deterministic trim. If it doesn't, a check check > md/sync_action
> will report mismatches in md/mismatch_cnt; and a repair will probably
> corrupt the volume. So you can still use trim with a drive that
> returns non-deterministic results with raid0/1/10, but you can't rely
> on the result of md/mismatch_cnt and you can't do repair type
> scrubs.
> 
> For raid5/6, it's a problem to use trim if the drive returns
> non-deterministically for trimmed blocks. I'd think that in addition
> to DRAT being supported, it'd need to support DZAT.
> 
> smartctl --identify=wb /dev/diskX | grep -i trim
> 
> 
> Chris Murphy
> 

Would it be possible to change trim/discard commands into write zero
blocks for some SSDs?  A number of SSD controllers support transparent
compression, so writing large batches of zeros will result in very small
writes to the actual flash, and the SSD controller will be able to
recycle flash used by the overwritten logical blocks just as if they
were trimmed.  Obviously writing zeros will take longer in transfer than
trim commands, but the result on the disk would be similar and it would
be guaranteed deterministic.

David



^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Status of discard support in MD RAID
  2014-09-12  9:03   ` David Brown
@ 2014-09-12  9:26     ` Brad Campbell
  2014-09-15  3:50     ` NeilBrown
  1 sibling, 0 replies; 10+ messages in thread
From: Brad Campbell @ 2014-09-12  9:26 UTC (permalink / raw)
  To: David Brown, Chris Murphy, Brassow Jonathan
  Cc: linux-raid@vger.kernel.org Raid, NeilBrown

On 12/09/14 17:03, David Brown wrote:
> On 12/09/14 02:46, Chris Murphy wrote:
>>
>> On Sep 11, 2014, at 5:38 PM, Brassow Jonathan <jbrassow@redhat.com>
>> wrote:
>>
>>> Neil (or anyone else),
>>>
>>> I know that trim/discard support was added back in 2012 (commit
>>> 9db90880).  However, I thought there were still issues regarding
>>> what happens when various sync operations occur.  I'd like to turn
>>> on discard support in dm-raid.c (a oneline patch) if things are in
>>> order.  I can enable any, all or none depending on your
>>> recommendation.  (I assume RAID1/10 is easier than the parity
>>> RAIDs.)
>>
>> If all the controller and drive support it then it should pass
>> through, but there's the problem whether the SSD supports
>> deterministic trim. If it doesn't, a check check > md/sync_action
>> will report mismatches in md/mismatch_cnt; and a repair will probably
>> corrupt the volume. So you can still use trim with a drive that
>> returns non-deterministic results with raid0/1/10, but you can't rely
>> on the result of md/mismatch_cnt and you can't do repair type
>> scrubs.
>>
>> For raid5/6, it's a problem to use trim if the drive returns
>> non-deterministically for trimmed blocks. I'd think that in addition
>> to DRAT being supported, it'd need to support DZAT.
>>
>> smartctl --identify=wb /dev/diskX | grep -i trim
>>
>>
>> Chris Murphy
>>
>
> Would it be possible to change trim/discard commands into write zero
> blocks for some SSDs?  A number of SSD controllers support transparent
> compression, so writing large batches of zeros will result in very small
> writes to the actual flash, and the SSD controller will be able to
> recycle flash used by the overwritten logical blocks just as if they
> were trimmed.  Obviously writing zeros will take longer in transfer than
> trim commands, but the result on the disk would be similar and it would
> be guaranteed deterministic.
>

I have 6 drives here. 3 Intel 330's and 3 Samsung 830's. The Intel 330s 
compress transparently (Sandforce controllers). They also support 
deterministic and return 0 on trimmed areas. The Samsungs don't compress 
and don't support deterministic or 0 on trimmed areas.

They are in a 6 drive RAID10 and I just put up with the massive mismatch 
count. Contrary to what Chris wrote, there is no damage caused by a 
repair type of scrub. Depending on the direction it either copies 0's to 
the Samsungs or random garbage to the Intels. It simply ends up writing 
to all the trimmed areas, so I don't do it.

Frankly if I considered this an issue I'd just go and replace the 
Samsungs with something newer, but as it has no practical ramifications 
in the real world I'll use them until I either run out of space or I 
wear them out.

I certainly don't believe it is worth of any form of special casing in 
the block, RAID or filesystem code. Just tell people that if they want 
to use SSD's in RAID and get properly working trim then all drives need 
to support return deterministic and return 0 on trim.

As it is, I get this once a month :System Events
=-=-=-=-=-=-=
Sep  7 02:12:49 srv mdadm[6341]: RebuildFinished event detected on md 
device /dev/md2, component device  mismatches found: 7100032 (on raid 
level 10)

The machine is on a serious UPS and gets rebooted about twice a year, so 
I'm not afraid of unclean shutdowns.

Regards,
Brad

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Status of discard support in MD RAID
  2014-09-12  0:46 ` Chris Murphy
  2014-09-12  9:03   ` David Brown
@ 2014-09-12  9:39   ` Roman Mamedov
  2014-09-13 20:19     ` Chris Murphy
  2014-09-15  3:56     ` NeilBrown
  2014-09-15  3:46   ` NeilBrown
  2 siblings, 2 replies; 10+ messages in thread
From: Roman Mamedov @ 2014-09-12  9:39 UTC (permalink / raw)
  To: Chris Murphy; +Cc: Brassow Jonathan, linux-raid@vger.kernel.org Raid, NeilBrown

[-- Attachment #1: Type: text/plain, Size: 1142 bytes --]

On Thu, 11 Sep 2014 18:46:04 -0600
Chris Murphy <lists@colorremedies.com> wrote:

> If it doesn't, a check check > md/sync_action will report mismatches in
> md/mismatch_cnt; and a repair will probably corrupt the volume.

At least with RAID1/10, why would it?

> and you can't do repair type scrubs.

If the FS issues TRIM on a certain region, by definition it no longer cares
about what's stored there (as it's is no longer in use by the FS). So even if
a repair ends up coping some data from one SSD to another, in effect changing
the contents of that region, this should not affect anything whatsoever from
the FS standpoint.

Technically perhaps that still counts as a "corruption", but not of anything
in the filesystem metadata or user data, just of unused regions. So not as
scary as it first sounds.

The only case where you'd run into problems with this, is if some apps expect
to read back zeroes on TRIM'ed regions, e.g. Qemu in the "detect-zeroes=unmap"
mode. But using that would be dangerous even on a single SSD with
non-deterministic TRIM, so mdraid changes nothing here.

-- 
With respect,
Roman

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Status of discard support in MD RAID
  2014-09-12  9:39   ` Roman Mamedov
@ 2014-09-13 20:19     ` Chris Murphy
  2014-09-15  3:56     ` NeilBrown
  1 sibling, 0 replies; 10+ messages in thread
From: Chris Murphy @ 2014-09-13 20:19 UTC (permalink / raw)
  To: Roman Mamedov; +Cc: linux-raid@vger.kernel.org Raid

On Sep 12, 2014, at 3:39 AM, Roman Mamedov <rm@romanrm.net> wrote:

> On Thu, 11 Sep 2014 18:46:04 -0600
> Chris Murphy <lists@colorremedies.com> wrote:
> 
>> If it doesn't, a check check > md/sync_action will report mismatches in
>> md/mismatch_cnt; and a repair will probably corrupt the volume.
> 
> At least with RAID1/10, why would it?

It's a good question.

On the one hand:
ftp://ftp.t10.org/t10/document.08/08-347r1.pdf

In particular slides 5, 8, 9.

And then on the other hand:
https://lkml.org/lkml/2010/11/19/193

It's an overstatement to have said "repair will probably corrupt" when everything is working normally. I can't know that. What happens in the case of a crash, power failure, or a drive that dies? If drive 1of2 fully dies, then the user has a more certain outcome, at least it's one non-deterministic drive 2of2 remaining to use as a source to rebuild with a new drive.

But the non-deterministic output from SSD trimmed blocks means the user can't depend on raid mechanism to confirm whether the rebuild worked. There will always be mismatches on check, and we have no way of knowing if those mismatches occur only in trimmed areas that we don't care about, or in data/metadata areas that we do care about. What's the work around? Separately degrade mount each mirror and produce a file checksum list and compare them? Ick.

ZFS and Btrfs wouldn't get tripped up, because their scrubs only operate on in-use blocks. So that's also a plausible work around for non-deterministic trim. But I don't know how well tested delete followed by trim is on either of them. Like Ted says, the filesystem has to be certain the delete has committed to stable media before issuing trim or all bets are off.

> 
>> and you can't do repair type scrubs.
> 
> If the FS issues TRIM on a certain region, by definition it no longer cares
> about what's stored there (as it's is no longer in use by the FS). So even if
> a repair ends up coping some data from one SSD to another, in effect changing
> the contents of that region, this should not affect anything whatsoever from
> the FS standpoint.

That's true, it should not, so long as everything else is working normally and correctly. But we still lose the ability to verify the veracity of the repair.

Chris Murphy

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Status of discard support in MD RAID
  2014-09-11 23:38 Status of discard support in MD RAID Brassow Jonathan
  2014-09-12  0:46 ` Chris Murphy
@ 2014-09-15  3:44 ` NeilBrown
  1 sibling, 0 replies; 10+ messages in thread
From: NeilBrown @ 2014-09-15  3:44 UTC (permalink / raw)
  To: Brassow Jonathan; +Cc: linux-raid@vger.kernel.org Raid

[-- Attachment #1: Type: text/plain, Size: 1060 bytes --]

On Thu, 11 Sep 2014 18:38:11 -0500 Brassow Jonathan <jbrassow@redhat.com>
wrote:

> Neil (or anyone else),
> 
> I know that trim/discard support was added back in 2012 (commit 9db90880).  However, I thought there were still issues regarding what happens when various sync operations occur.  I'd like to turn on discard support in dm-raid.c (a oneline patch) if things are in order.  I can enable any, all or none depending on your recommendation.  (I assume RAID1/10 is easier than the parity RAIDs.)

The worst that a sync operation can do is report mismatches and "un-trim"
some (or all) of some devices.
It certainly should never corrupt data.

My perception is that enabling discard support in the filesystem can be good
for some devices and bad for other devices but should always be "safe" even
when not "optimal".  I think the same is true for md/raid.

For raid5/6, we only honour discard if the underlying devices report
discarded regions as all-zeros.  That make it safe enough I believe.

So I'd suggest: turn it on!

NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Status of discard support in MD RAID
  2014-09-12  0:46 ` Chris Murphy
  2014-09-12  9:03   ` David Brown
  2014-09-12  9:39   ` Roman Mamedov
@ 2014-09-15  3:46   ` NeilBrown
  2 siblings, 0 replies; 10+ messages in thread
From: NeilBrown @ 2014-09-15  3:46 UTC (permalink / raw)
  To: Chris Murphy; +Cc: Brassow Jonathan, linux-raid@vger.kernel.org Raid

[-- Attachment #1: Type: text/plain, Size: 1763 bytes --]

On Thu, 11 Sep 2014 18:46:04 -0600 Chris Murphy <lists@colorremedies.com>
wrote:

> 
> On Sep 11, 2014, at 5:38 PM, Brassow Jonathan <jbrassow@redhat.com> wrote:
> 
> > Neil (or anyone else),
> > 
> > I know that trim/discard support was added back in 2012 (commit 9db90880).  However, I thought there were still issues regarding what happens when various sync operations occur.  I'd like to turn on discard support in dm-raid.c (a oneline patch) if things are in order.  I can enable any, all or none depending on your recommendation.  (I assume RAID1/10 is easier than the parity RAIDs.)
> 
> If all the controller and drive support it then it should pass through, but there's the problem whether the SSD supports deterministic trim. If it doesn't, a check check > md/sync_action will report mismatches in md/mismatch_cnt; and a repair will probably corrupt the volume. So you can still use trim with a drive that returns non-deterministic results with raid0/1/10, but you can't rely on the result of md/mismatch_cnt and you can't do repair type scrubs.
> 
> For raid5/6, it's a problem to use trim if the drive returns non-deterministically for trimmed blocks. I'd think that in addition to DRAT being supported, it'd need to support DZAT.

md raid5/6 will not use trim unless the underlying device reports
"discard_zeros_data".  That is a Linux internal field name.  I don't know
exactly that it means in SCSI/SATA/whatever devices.

NeilBrown


> 
> smartctl --identify=wb /dev/diskX | grep -i trim
> 
> 
> Chris Murphy
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Status of discard support in MD RAID
  2014-09-12  9:03   ` David Brown
  2014-09-12  9:26     ` Brad Campbell
@ 2014-09-15  3:50     ` NeilBrown
  1 sibling, 0 replies; 10+ messages in thread
From: NeilBrown @ 2014-09-15  3:50 UTC (permalink / raw)
  To: David Brown
  Cc: Chris Murphy, Brassow Jonathan, linux-raid@vger.kernel.org Raid

[-- Attachment #1: Type: text/plain, Size: 2496 bytes --]

On Fri, 12 Sep 2014 11:03:19 +0200 David Brown <david.brown@hesbynett.no>
wrote:

> On 12/09/14 02:46, Chris Murphy wrote:
> > 
> > On Sep 11, 2014, at 5:38 PM, Brassow Jonathan <jbrassow@redhat.com>
> > wrote:
> > 
> >> Neil (or anyone else),
> >> 
> >> I know that trim/discard support was added back in 2012 (commit
> >> 9db90880).  However, I thought there were still issues regarding
> >> what happens when various sync operations occur.  I'd like to turn
> >> on discard support in dm-raid.c (a oneline patch) if things are in
> >> order.  I can enable any, all or none depending on your
> >> recommendation.  (I assume RAID1/10 is easier than the parity
> >> RAIDs.)
> > 
> > If all the controller and drive support it then it should pass
> > through, but there's the problem whether the SSD supports
> > deterministic trim. If it doesn't, a check check > md/sync_action
> > will report mismatches in md/mismatch_cnt; and a repair will probably
> > corrupt the volume. So you can still use trim with a drive that
> > returns non-deterministic results with raid0/1/10, but you can't rely
> > on the result of md/mismatch_cnt and you can't do repair type
> > scrubs.
> > 
> > For raid5/6, it's a problem to use trim if the drive returns
> > non-deterministically for trimmed blocks. I'd think that in addition
> > to DRAT being supported, it'd need to support DZAT.
> > 
> > smartctl --identify=wb /dev/diskX | grep -i trim
> > 
> > 
> > Chris Murphy
> > 
> 
> Would it be possible to change trim/discard commands into write zero
> blocks for some SSDs? 

There is a BLKZEROOUT ioctl which writes zeros, using the 'WRITE SAME' SCSI
command if possible.

I suspect it would be quite easy to modify "fstrim" to use BLKZEROOUT
instead of BLKDISCARD.

NeilBrown


>  A number of SSD controllers support transparent
> compression, so writing large batches of zeros will result in very small
> writes to the actual flash, and the SSD controller will be able to
> recycle flash used by the overwritten logical blocks just as if they
> were trimmed.  Obviously writing zeros will take longer in transfer than
> trim commands, but the result on the disk would be similar and it would
> be guaranteed deterministic.
> 
> David
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Status of discard support in MD RAID
  2014-09-12  9:39   ` Roman Mamedov
  2014-09-13 20:19     ` Chris Murphy
@ 2014-09-15  3:56     ` NeilBrown
  1 sibling, 0 replies; 10+ messages in thread
From: NeilBrown @ 2014-09-15  3:56 UTC (permalink / raw)
  To: Roman Mamedov
  Cc: Chris Murphy, Brassow Jonathan, linux-raid@vger.kernel.org Raid

[-- Attachment #1: Type: text/plain, Size: 1635 bytes --]

On Fri, 12 Sep 2014 15:39:15 +0600 Roman Mamedov <rm@romanrm.net> wrote:

> On Thu, 11 Sep 2014 18:46:04 -0600
> Chris Murphy <lists@colorremedies.com> wrote:
> 
> > If it doesn't, a check check > md/sync_action will report mismatches in
> > md/mismatch_cnt; and a repair will probably corrupt the volume.
> 
> At least with RAID1/10, why would it?
> 
> > and you can't do repair type scrubs.
> 
> If the FS issues TRIM on a certain region, by definition it no longer cares
> about what's stored there (as it's is no longer in use by the FS). So even if
> a repair ends up coping some data from one SSD to another, in effect changing
> the contents of that region, this should not affect anything whatsoever from
> the FS standpoint.
> 
> Technically perhaps that still counts as a "corruption", but not of anything
> in the filesystem metadata or user data, just of unused regions. So not as
> scary as it first sounds.
> 
> The only case where you'd run into problems with this, is if some apps expect
> to read back zeroes on TRIM'ed regions, e.g. Qemu in the "detect-zeroes=unmap"
> mode. But using that would be dangerous even on a single SSD with
> non-deterministic TRIM, so mdraid changes nothing here.
> 

For any block device in Linux you can read the 'queue/discard_zeroes_data'
attribute to see if it is safe to expect zeros from a discarded region.
md sets that correctly.
For raid1/raid10 it is set if all member devices have it set.
For raid5/6, it is never set.  This is because we can only discard full
stripes so a non-full-stripe discard will not zero all of the data.

NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2014-09-15  3:56 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-09-11 23:38 Status of discard support in MD RAID Brassow Jonathan
2014-09-12  0:46 ` Chris Murphy
2014-09-12  9:03   ` David Brown
2014-09-12  9:26     ` Brad Campbell
2014-09-15  3:50     ` NeilBrown
2014-09-12  9:39   ` Roman Mamedov
2014-09-13 20:19     ` Chris Murphy
2014-09-15  3:56     ` NeilBrown
2014-09-15  3:46   ` NeilBrown
2014-09-15  3:44 ` NeilBrown

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).