Re: SSD data reliable vs. unreliable [Was: Re: Data Recovery from SSDs

linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Re: SSD data reliable vs. unreliable [Was: Re: Data Recovery from SSDs - Impact of trim?]
       [not found]           ` <497DE35C.6090308@redhat.com>
@ 2009-01-26 17:34             ` Greg Freemyer
  2009-01-26 17:46               ` Ric Wheeler
                                 ` (2 more replies)
  0 siblings, 3 replies; 18+ messages in thread
From: Greg Freemyer @ 2009-01-26 17:34 UTC (permalink / raw)
  To: Ric Wheeler, linux-raid
  Cc: James Bottomley, Dongjun Shin, IDE/ATA development list

Adding mdraid list:

Top post as a recap for mdraid list (redundantly at end of email if
anyone wants to respond to any of this).:

== Start RECAP
With proposed spec changes for both T10 and T13 a new "unmap" or
"trim" command is proposed respectively.  The linux kernel is
implementing this as a sector discard and will be called by various
file systems as they delete data files.  Ext4 will be one of the first
to support this. (At least via out of kernel patches.)

SCSI - see http://www.t10.org/cgi-bin/ac.pl?t=d&f=08-356r5.pdf
ATA - see T13/e08137r2 draft

Per the proposed spec changes, the underlying SSD device can
optionally modify the unmapped data.  SCSI T10 at least restricts the
way the modification happens, but data modification of unmapped data
is still definitely allowed for both classes of SSD.

Thus if a filesystem "discards" a sector, the contents of the sector
can change and thus parity values are no longer meaningful for the
stripe.

ie. If the unmap-ed blocks don't exactly correlate with the Raid-5 / 6
stripping, then the integrity of a stripe containing both mapped and
unmapped data is lost.

Thus it seems that either the filesystem will have to understand the
raid 5 / 6 stripping / chunking setup and ensure it never issues a
discard command unless an entire stripe is being discarded.  Or that
the raid implementation must must snoop the discard commands and take
appropriate actions.

FYI:
In T13 a feature bit will be provided to identify ATA SSDs that
implement a "deterministic" feature.  Meaning that once you read a
specific unmapped sector, its contents will not change until written
but that does not change the fact that a discard command that does not
perfectly match the raid setup may destroy the integrity of a stripe.

I believe all T10 (SCSI) devices with be deterministic by spec.

End of RECAP

On Mon, Jan 26, 2009 at 11:22 AM, Ric Wheeler <rwheeler@redhat.com> wrote:
> Greg Freemyer wrote:
>>
>> On Fri, Jan 23, 2009 at 6:35 PM, Ric Wheeler <rwheeler@redhat.com> wrote:
>>
>>>
>>> Greg Freemyer wrote:
>>>
>>>>
>>>> On Fri, Jan 23, 2009 at 5:24 PM, James Bottomley
>>>> <James.Bottomley@hansenpartnership.com> wrote:
>>>>
>>>>
>>>>>
>>>>> On Fri, 2009-01-23 at 15:40 -0500, Ric Wheeler wrote:
>>>>>
>>>>>
>>>>>>
>>>>>> Greg Freemyer wrote:
>>>>>>
>>>>>>
>>>>>>>
>>>>>>> Just to make sure I understand, with the proposed trim updates to the
>>>>>>> ATA spec (T13/e08137r2 draft), a SSD can have two kinds of data.
>>>>>>>
>>>>>>> Reliable and unreliable.  Where unreliable can return zeros, ones,
>>>>>>> old
>>>>>>> data, random made up data, old data slightly adulterated, etc..
>>>>>>>
>>>>>>> And there is no way for the kernel to distinguish if the particular
>>>>>>> data it is getting from the SSD is of the reliable or unreliable
>>>>>>> type?
>>>>>>>
>>>>>>> For the unreliable data, if the determistic bit is set in the
>>>>>>> identify
>>>>>>> block, then the kernel can be assured of reading the same unreliable
>>>>>>> data repeatedly, but still it has no way of knowing the data it is
>>>>>>> reading was ever even written to the SSD in the first place.
>>>>>>>
>>>>>>> That just seems unacceptable.
>>>>>>>
>>>>>>> Greg
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>> Hi Greg,
>>>>>>
>>>>>> I sat in on a similar discussion in T10 . With luck, the T13 people
>>>>>> have
>>>>>> the same high level design:
>>>>>>
>>>>>> (1) following a write to sector X, any subsequent read of X will
>>>>>> return
>>>>>> that data
>>>>>> (2) once you DISCARD/UNMAP sector X, the device can return any state
>>>>>> (stale data, all 1's, all 0's) on the next read of that sector, but
>>>>>> must
>>>>>> continue to return that data on following reads until the sector is
>>>>>> rewritten
>>>>>>
>>>>>>
>>>>>
>>>>> Actually, the latest draft:
>>>>>
>>>>> http://www.t10.org/cgi-bin/ac.pl?t=d&f=08-356r5.pdf
>>>>>
>>>>> extends this behaviour: If the array has read capacity(16) TPRZ bit set
>>>>> then the return for an unmapped block is always zero.  If TPRZ isn't
>>>>> set, it's undefined but consistent.  I think TPRZ is there to address
>>>>> security concerns.
>>>>>
>>>>> James
>>>>>
>>>>>
>>>>
>>>> To James,
>>>>
>>>> I took a look at the spec, but I'm not familiar with the SCSI spec to
>>>> grok it immediately.
>>>>
>>>> Is the TPRZ bit meant to be a way for the manufacturer to report which
>>>> of the two behaviors their device implements, or is it a externally
>>>> configurable flag that tells the SSD which way to behave?
>>>>
>>>> Either way, is there reason to believe the ATA T13 spec will get
>>>> similar functionality?
>>>>
>>>> To Ric,
>>>>
>>>> First, in general I think is is bizarre to have a device that is by
>>>> spec able to return both reliable and non-reliable data, but the spec
>>>> does not include a signaling method to differentiate between the two.
>>>>
>>>> ===
>>>> My very specific concern is that I work with evidence that will
>>>> eventually be presented at court.
>>>>
>>>> We routinely work with both live files and recoved deleted files
>>>> (Computer Forensic Analysis).  Thus we would typically be reading the
>>>> discarded sectors as well as in-use sectors.
>>>>
>>>> After reading the original proposal from 2007, I assumed that a read
>>>> would provide me either data that had been written specifically to the
>>>> sectors read, or that the SSD would return all nulls.  That is very
>>>> troubling to the ten thousand or so computer forensic examiners in the
>>>> USA, but it true we just had to live with it.
>>>>
>>>> Now reading the Oct. 2008 revision I realized that discarded sectors
>>>> are theoretically allowed to return absolutely anything the SSD feels
>>>> like returning.  Thus the SSD might return data that appears to be
>>>> supporting one side of the trial or the other, but it may have been
>>>> artificially created by the SSD.  And I don't even have a flag that
>>>> says "trust this data".
>>>>
>>>> The way things currently stand with my understanding of the proposed
>>>> spec. I will not be able to tell the court anything about the
>>>> reliability of any data copied from the SSD regardless of whether it
>>>> is part of an active file or not.
>>>>
>>>> At its most basic level, I transport a typical file on a SSD by
>>>> connecting it to computer A, writing data to it, disconnecting from A
>>>> and connecting to computer B and then print it from there for court
>>>> room use.
>>>>
>>>> When I read that file from the SSD how can I assure the court that
>>>> data I read is even claimed to be reliable by the SSD?
>>>>
>>>>  ie. The SSD has no way to say "I believe this data is what was
>>>> written to me via computer A" so why should the court or anyone else
>>>> trust the data it returns.
>>>>
>>>> IF the TPRZ bit becomes mandatory for both ATA and SCSI SSDs, then if
>>>> it is set I can have confidence that any data read from the device was
>>>> actually written to it.
>>>>
>>>> Lacking the TPRZ bit, ...
>>>>
>>>> Greg
>>>>
>>>>
>>>
>>> I think that the incorrect assumption here is that you as a user can read
>>> data that is invalid. If you are using a file system, you will never be
>>> able
>>> to read those unmapped/freed blocks (the file system will not allow it).
>>>
>>> If you read the raw device as root, then you could seem random bits of
>>> data
>>> - maybe data recovery tools would make this an issue?
>>>
>>> ric
>>>
>>
>> Ric,
>>

<snip>

> This seems to be overstated. The file system layer knows what its valid data
> is at any time and will send down unmap/trim commands only when it is sure
> that the block is no longer in use.
>
> The only concern is one of efficiency/performance - the commands are
> advisory, so the target can ignore them (i.e., not pre-erase them or
> allocate them in T10 to another user). There will be no need for fsck to
> look at unallocated blocks.
>
> The concern we do have is that RAID and checksums must be consistent. Once
> read, the device must return the same contents after a trim/unmap so as not
> to change the parity/hash/etc.

===> Copy of top post
With proposed spec changes for both T10 and T13 a new "unmap" or
"trim" command is proposed respectively.

SCSI - see http://www.t10.org/cgi-bin/ac.pl?t=d&f=08-356r5.pdf
ATA - T13/e08137r2 draft

Per the proposed spec changes, the underlying SSD device can
optionally modify the unmapped data at its discretion.  SCSI T10
atleast restricts the way the modification happens, but data
modification of unmapped data is still definitely allowed.

Thus if a filesystem "discards" a sector, the contents of the sector
can change and thus parity values are no longer meaningful for the
stripe.

ie. If the unmap-ed blocks don't exactly correlate with the Raid-5 / 6
stripping, then the integrity of a stripe containing both mapped and
unmapped data is lost.

A feature bit will be provided to identify SSDs that implement a
"stable value on read" feature.  Meaning that once you read a specific
unmapped sector, its contents will not change until written but that
does not change the fact that a discard command that does not
perfectly match the raid setup may destroy the integrity of a stripe.

Thus it seems that either the filesystem will have to understand the
raid 5 / 6 stripping / chunking setup and ensure it never issues a
discard command unless an entire stripe is free.  Or that the raid
implementation must must snoop the discard commands and take
appropriate actions.
===> END Copy of top post

Seems to introduce some huge layering violations for Raid 5 / 6
implementations using next generation SSDs to comprise the raid
volumes.

I imagine writing reshaping software is hard enough without this going on.

<snip>

> One serious suggestion is that you take your concerns up with the T13 group directly - few people on this list sit in on those, I believe that it is an open forum.

I will have to look into that.  The whole idea of what is happening
here seems fraught with problems to me.  T13 is worse than T10 from
what I see, but both seem highly problematic.

Allowing data to change from the SATA / SAS interface layer and not
implementing a signaling mechanism that allows the kernel (or any OS /
software tool) to ask which sectors / blocks / erase units have
undergone data changes is just bizarre to me.

I the unmap command always caused the unmap sectors to return some
fixed value, at least that could be incorporated into a raid
implementations logic.

The current random nature of what unmap command does is very unsettling to me.

Greg
-- 
Greg Freemyer
Litigation Triage Solutions Specialist
http://www.linkedin.com/in/gregfreemyer
First 99 Days Litigation White Paper -
http://www.norcrossgroup.com/forms/whitepapers/99%20Days%20whitepaper.pdf

The Norcross Group
The Intersection of Evidence & Technology
http://www.norcrossgroup.com

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: SSD data reliable vs. unreliable [Was: Re: Data Recovery from SSDs - Impact of trim?]
  2009-01-26 17:34             ` SSD data reliable vs. unreliable [Was: Re: Data Recovery from SSDs - Impact of trim?] Greg Freemyer
@ 2009-01-26 17:46               ` Ric Wheeler
  2009-01-26 17:47               ` James Bottomley
  2009-01-26 17:51               ` Mark Lord
  2 siblings, 0 replies; 18+ messages in thread
From: Ric Wheeler @ 2009-01-26 17:46 UTC (permalink / raw)
  To: Greg Freemyer
  Cc: linux-raid, James Bottomley, Dongjun Shin,
	IDE/ATA development list

Greg Freemyer wrote:
> Adding mdraid list:
>
> Top post as a recap for mdraid list (redundantly at end of email if
> anyone wants to respond to any of this).:
>
> == Start RECAP
> With proposed spec changes for both T10 and T13 a new "unmap" or
> "trim" command is proposed respectively.  The linux kernel is
> implementing this as a sector discard and will be called by various
> file systems as they delete data files.  Ext4 will be one of the first
> to support this. (At least via out of kernel patches.)
>
> SCSI - see http://www.t10.org/cgi-bin/ac.pl?t=d&f=08-356r5.pdf
> ATA - see T13/e08137r2 draft
>
> Per the proposed spec changes, the underlying SSD device can
> optionally modify the unmapped data.  SCSI T10 at least restricts the
> way the modification happens, but data modification of unmapped data
> is still definitely allowed for both classes of SSD.
>   

For either device class, this is not limited to SSD devices (just for 
clarity). On the SCSI side, this is actually driven mainly by large 
arrays (like EMC Symm, Clariion, IBM Shark, etc).
> Thus if a filesystem "discards" a sector, the contents of the sector
> can change and thus parity values are no longer meaningful for the
> stripe.
>
> ie. If the unmap-ed blocks don't exactly correlate with the Raid-5 / 6
> stripping, then the integrity of a stripe containing both mapped and
> unmapped data is lost.
>   
What this means for RAID (md or dm raid) is that we will need to rebuild 
the parity after a discard of a stripe for the range of discarded 
blocks. For T10 devices at least, the devices are required to be 
consistent with regards to what they return after the unmap.

> Thus it seems that either the filesystem will have to understand the
> raid 5 / 6 stripping / chunking setup and ensure it never issues a
> discard command unless an entire stripe is being discarded.  Or that
> the raid implementation must must snoop the discard commands and take
> appropriate actions.
>
> FYI:
> In T13 a feature bit will be provided to identify ATA SSDs that
> implement a "deterministic" feature.  Meaning that once you read a
> specific unmapped sector, its contents will not change until written
> but that does not change the fact that a discard command that does not
> perfectly match the raid setup may destroy the integrity of a stripe.
>
> I believe all T10 (SCSI) devices with be deterministic by spec.
>
> End of RECAP
>
> On Mon, Jan 26, 2009 at 11:22 AM, Ric Wheeler <rwheeler@redhat.com> wrote:
>   
>> Greg Freemyer wrote:
>>     
>>> On Fri, Jan 23, 2009 at 6:35 PM, Ric Wheeler <rwheeler@redhat.com> wrote:
>>>
>>>       
>>>> Greg Freemyer wrote:
>>>>
>>>>         
>>>>> On Fri, Jan 23, 2009 at 5:24 PM, James Bottomley
>>>>> <James.Bottomley@hansenpartnership.com> wrote:
>>>>>
>>>>>
>>>>>           
>>>>>> On Fri, 2009-01-23 at 15:40 -0500, Ric Wheeler wrote:
>>>>>>
>>>>>>
>>>>>>             
>>>>>>> Greg Freemyer wrote:
>>>>>>>
>>>>>>>
>>>>>>>               
>>>>>>>> Just to make sure I understand, with the proposed trim updates to the
>>>>>>>> ATA spec (T13/e08137r2 draft), a SSD can have two kinds of data.
>>>>>>>>
>>>>>>>> Reliable and unreliable.  Where unreliable can return zeros, ones,
>>>>>>>> old
>>>>>>>> data, random made up data, old data slightly adulterated, etc..
>>>>>>>>
>>>>>>>> And there is no way for the kernel to distinguish if the particular
>>>>>>>> data it is getting from the SSD is of the reliable or unreliable
>>>>>>>> type?
>>>>>>>>
>>>>>>>> For the unreliable data, if the determistic bit is set in the
>>>>>>>> identify
>>>>>>>> block, then the kernel can be assured of reading the same unreliable
>>>>>>>> data repeatedly, but still it has no way of knowing the data it is
>>>>>>>> reading was ever even written to the SSD in the first place.
>>>>>>>>
>>>>>>>> That just seems unacceptable.
>>>>>>>>
>>>>>>>> Greg
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>                 
>>>>>>> Hi Greg,
>>>>>>>
>>>>>>> I sat in on a similar discussion in T10 . With luck, the T13 people
>>>>>>> have
>>>>>>> the same high level design:
>>>>>>>
>>>>>>> (1) following a write to sector X, any subsequent read of X will
>>>>>>> return
>>>>>>> that data
>>>>>>> (2) once you DISCARD/UNMAP sector X, the device can return any state
>>>>>>> (stale data, all 1's, all 0's) on the next read of that sector, but
>>>>>>> must
>>>>>>> continue to return that data on following reads until the sector is
>>>>>>> rewritten
>>>>>>>
>>>>>>>
>>>>>>>               
>>>>>> Actually, the latest draft:
>>>>>>
>>>>>> http://www.t10.org/cgi-bin/ac.pl?t=d&f=08-356r5.pdf
>>>>>>
>>>>>> extends this behaviour: If the array has read capacity(16) TPRZ bit set
>>>>>> then the return for an unmapped block is always zero.  If TPRZ isn't
>>>>>> set, it's undefined but consistent.  I think TPRZ is there to address
>>>>>> security concerns.
>>>>>>
>>>>>> James
>>>>>>
>>>>>>
>>>>>>             
>>>>> To James,
>>>>>
>>>>> I took a look at the spec, but I'm not familiar with the SCSI spec to
>>>>> grok it immediately.
>>>>>
>>>>> Is the TPRZ bit meant to be a way for the manufacturer to report which
>>>>> of the two behaviors their device implements, or is it a externally
>>>>> configurable flag that tells the SSD which way to behave?
>>>>>
>>>>> Either way, is there reason to believe the ATA T13 spec will get
>>>>> similar functionality?
>>>>>
>>>>> To Ric,
>>>>>
>>>>> First, in general I think is is bizarre to have a device that is by
>>>>> spec able to return both reliable and non-reliable data, but the spec
>>>>> does not include a signaling method to differentiate between the two.
>>>>>
>>>>> ===
>>>>> My very specific concern is that I work with evidence that will
>>>>> eventually be presented at court.
>>>>>
>>>>> We routinely work with both live files and recoved deleted files
>>>>> (Computer Forensic Analysis).  Thus we would typically be reading the
>>>>> discarded sectors as well as in-use sectors.
>>>>>
>>>>> After reading the original proposal from 2007, I assumed that a read
>>>>> would provide me either data that had been written specifically to the
>>>>> sectors read, or that the SSD would return all nulls.  That is very
>>>>> troubling to the ten thousand or so computer forensic examiners in the
>>>>> USA, but it true we just had to live with it.
>>>>>
>>>>> Now reading the Oct. 2008 revision I realized that discarded sectors
>>>>> are theoretically allowed to return absolutely anything the SSD feels
>>>>> like returning.  Thus the SSD might return data that appears to be
>>>>> supporting one side of the trial or the other, but it may have been
>>>>> artificially created by the SSD.  And I don't even have a flag that
>>>>> says "trust this data".
>>>>>
>>>>> The way things currently stand with my understanding of the proposed
>>>>> spec. I will not be able to tell the court anything about the
>>>>> reliability of any data copied from the SSD regardless of whether it
>>>>> is part of an active file or not.
>>>>>
>>>>> At its most basic level, I transport a typical file on a SSD by
>>>>> connecting it to computer A, writing data to it, disconnecting from A
>>>>> and connecting to computer B and then print it from there for court
>>>>> room use.
>>>>>
>>>>> When I read that file from the SSD how can I assure the court that
>>>>> data I read is even claimed to be reliable by the SSD?
>>>>>
>>>>>  ie. The SSD has no way to say "I believe this data is what was
>>>>> written to me via computer A" so why should the court or anyone else
>>>>> trust the data it returns.
>>>>>
>>>>> IF the TPRZ bit becomes mandatory for both ATA and SCSI SSDs, then if
>>>>> it is set I can have confidence that any data read from the device was
>>>>> actually written to it.
>>>>>
>>>>> Lacking the TPRZ bit, ...
>>>>>
>>>>> Greg
>>>>>
>>>>>
>>>>>           
>>>> I think that the incorrect assumption here is that you as a user can read
>>>> data that is invalid. If you are using a file system, you will never be
>>>> able
>>>> to read those unmapped/freed blocks (the file system will not allow it).
>>>>
>>>> If you read the raw device as root, then you could seem random bits of
>>>> data
>>>> - maybe data recovery tools would make this an issue?
>>>>
>>>> ric
>>>>
>>>>         
>>> Ric,
>>>
>>>       
>
> <snip>
>
>   
>> This seems to be overstated. The file system layer knows what its valid data
>> is at any time and will send down unmap/trim commands only when it is sure
>> that the block is no longer in use.
>>
>> The only concern is one of efficiency/performance - the commands are
>> advisory, so the target can ignore them (i.e., not pre-erase them or
>> allocate them in T10 to another user). There will be no need for fsck to
>> look at unallocated blocks.
>>
>> The concern we do have is that RAID and checksums must be consistent. Once
>> read, the device must return the same contents after a trim/unmap so as not
>> to change the parity/hash/etc.
>>     
>
> ===> Copy of top post
> With proposed spec changes for both T10 and T13 a new "unmap" or
> "trim" command is proposed respectively.
>
> SCSI - see http://www.t10.org/cgi-bin/ac.pl?t=d&f=08-356r5.pdf
> ATA - T13/e08137r2 draft
>
> Per the proposed spec changes, the underlying SSD device can
> optionally modify the unmapped data at its discretion.  SCSI T10
> atleast restricts the way the modification happens, but data
> modification of unmapped data is still definitely allowed.
>
> Thus if a filesystem "discards" a sector, the contents of the sector
> can change and thus parity values are no longer meaningful for the
> stripe.
>
> ie. If the unmap-ed blocks don't exactly correlate with the Raid-5 / 6
> stripping, then the integrity of a stripe containing both mapped and
> unmapped data is lost.
>
> A feature bit will be provided to identify SSDs that implement a
> "stable value on read" feature.  Meaning that once you read a specific
> unmapped sector, its contents will not change until written but that
> does not change the fact that a discard command that does not
> perfectly match the raid setup may destroy the integrity of a stripe.
>
> Thus it seems that either the filesystem will have to understand the
> raid 5 / 6 stripping / chunking setup and ensure it never issues a
> discard command unless an entire stripe is free.  Or that the raid
> implementation must must snoop the discard commands and take
> appropriate actions.
> ===> END Copy of top post
>
> Seems to introduce some huge layering violations for Raid 5 / 6
> implementations using next generation SSDs to comprise the raid
> volumes.
>
> I imagine writing reshaping software is hard enough without this going on.
>
> <snip>
>
>   
>> One serious suggestion is that you take your concerns up with the T13 group directly - few people on this list sit in on those, I believe that it is an open forum.
>>     
>
> I will have to look into that.  The whole idea of what is happening
> here seems fraught with problems to me.  T13 is worse than T10 from
> what I see, but both seem highly problematic.
>
> Allowing data to change from the SATA / SAS interface layer and not
> implementing a signaling mechanism that allows the kernel (or any OS /
> software tool) to ask which sectors / blocks / erase units have
> undergone data changes is just bizarre to me.
>
> I the unmap command always caused the unmap sectors to return some
> fixed value, at least that could be incorporated into a raid
> implementations logic.
>
> The current random nature of what unmap command does is very unsettling to me.
>
> Greg
>   


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: SSD data reliable vs. unreliable [Was: Re: Data Recovery from SSDs - Impact of trim?]
  2009-01-26 17:34             ` SSD data reliable vs. unreliable [Was: Re: Data Recovery from SSDs - Impact of trim?] Greg Freemyer
  2009-01-26 17:46               ` Ric Wheeler
@ 2009-01-26 17:47               ` James Bottomley
  2009-01-27  5:16                 ` Neil Brown
  2009-01-26 17:51               ` Mark Lord
  2 siblings, 1 reply; 18+ messages in thread
From: James Bottomley @ 2009-01-26 17:47 UTC (permalink / raw)
  To: Greg Freemyer
  Cc: Ric Wheeler, linux-raid, Dongjun Shin, IDE/ATA development list

On Mon, 2009-01-26 at 12:34 -0500, Greg Freemyer wrote:
> Adding mdraid list:
> 
> Top post as a recap for mdraid list (redundantly at end of email if
> anyone wants to respond to any of this).:
> 
> == Start RECAP
> With proposed spec changes for both T10 and T13 a new "unmap" or
> "trim" command is proposed respectively.  The linux kernel is
> implementing this as a sector discard and will be called by various
> file systems as they delete data files.  Ext4 will be one of the first
> to support this. (At least via out of kernel patches.)
> 
> SCSI - see http://www.t10.org/cgi-bin/ac.pl?t=d&f=08-356r5.pdf
> ATA - see T13/e08137r2 draft
> 
> Per the proposed spec changes, the underlying SSD device can
> optionally modify the unmapped data.  SCSI T10 at least restricts the
> way the modification happens, but data modification of unmapped data
> is still definitely allowed for both classes of SSD.
> 
> Thus if a filesystem "discards" a sector, the contents of the sector
> can change and thus parity values are no longer meaningful for the
> stripe.

This isn't correct.  The implementation is via bio and request discard
flags.  linux raid as a bio->bio mapping entity can choose to drop or
implement the discard flag (by default it will be dropped unless the
raid layer is modified).

> ie. If the unmap-ed blocks don't exactly correlate with the Raid-5 / 6
> stripping, then the integrity of a stripe containing both mapped and
> unmapped data is lost.
> 
> Thus it seems that either the filesystem will have to understand the
> raid 5 / 6 stripping / chunking setup and ensure it never issues a
> discard command unless an entire stripe is being discarded.  Or that
> the raid implementation must must snoop the discard commands and take
> appropriate actions.

No.  It only works if the discard is supported all the way through the
stack to the controller and device ... any point in the stack can drop
the discard.  It's also theoretically possible that any layer could
accumulate them as well (i.e. up to stripe size for raid).

James

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: SSD data reliable vs. unreliable [Was: Re: Data Recovery from SSDs - Impact of trim?]
  2009-01-26 17:47               ` James Bottomley
@ 2009-01-27  5:16                 ` Neil Brown
  2009-01-27 10:49                   ` John Robinson
                                     ` (2 more replies)
  0 siblings, 3 replies; 18+ messages in thread
From: Neil Brown @ 2009-01-27  5:16 UTC (permalink / raw)
  To: James Bottomley
  Cc: Greg Freemyer, Ric Wheeler, linux-raid, Dongjun Shin,
	IDE/ATA development list

On Monday January 26, James.Bottomley@HansenPartnership.com wrote:
> On Mon, 2009-01-26 at 12:34 -0500, Greg Freemyer wrote:
> > Adding mdraid list:
> > 
> > Top post as a recap for mdraid list (redundantly at end of email if
> > anyone wants to respond to any of this).:
> > 
> > == Start RECAP
> > With proposed spec changes for both T10 and T13 a new "unmap" or
> > "trim" command is proposed respectively.  The linux kernel is
> > implementing this as a sector discard and will be called by various
> > file systems as they delete data files.  Ext4 will be one of the first
> > to support this. (At least via out of kernel patches.)
> > 
> > SCSI - see http://www.t10.org/cgi-bin/ac.pl?t=d&f=08-356r5.pdf
> > ATA - see T13/e08137r2 draft
> > 
> > Per the proposed spec changes, the underlying SSD device can
> > optionally modify the unmapped data.  SCSI T10 at least restricts the
> > way the modification happens, but data modification of unmapped data
> > is still definitely allowed for both classes of SSD.
> > 
> > Thus if a filesystem "discards" a sector, the contents of the sector
> > can change and thus parity values are no longer meaningful for the
> > stripe.
> 
> This isn't correct.  The implementation is via bio and request discard
> flags.  linux raid as a bio->bio mapping entity can choose to drop or
> implement the discard flag (by default it will be dropped unless the
> raid layer is modified).

That's good.  I would be worried if they could slip through without
md/raid noticing.

> 
> > ie. If the unmap-ed blocks don't exactly correlate with the Raid-5 / 6
> > stripping, then the integrity of a stripe containing both mapped and
> > unmapped data is lost.
> > 
> > Thus it seems that either the filesystem will have to understand the
> > raid 5 / 6 stripping / chunking setup and ensure it never issues a
> > discard command unless an entire stripe is being discarded.  Or that
> > the raid implementation must must snoop the discard commands and take
> > appropriate actions.
> 
> No.  It only works if the discard is supported all the way through the
> stack to the controller and device ... any point in the stack can drop
> the discard.  It's also theoretically possible that any layer could
> accumulate them as well (i.e. up to stripe size for raid).

Accumulating them in the raid level would probably be awkward.

It was my understanding that filesystems would (try to) send the
largest possible 'discard' covering any surrounding blocks that had
already been discarded.  Then e.g. raid5 could just round down any
discard request to an aligned number of complete stripes and just
discard those.  i.e. have all the accumulation done in the filesystem.

To be able to safely discard stripes, raid5 would need to remember
which stripes were discarded so that it could be sure to write out the
whole stripe when updating any block on it, thus ensuring that parity
will be correct again and will remain correct.

Probably the only practical data structure for this would be a bitmap
similar to the current write-intent bitmap.

Is it really worth supporting this in raid5?   Are the sorts of
devices that will benefit from 'discard' requests likely to be used
inside an md/raid5 array I wonder....

raid1 and raid10 are much easier to handle, so supporting 'discard'
there certainly makes sense.

NeilBrown

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: SSD data reliable vs. unreliable [Was: Re: Data Recovery from SSDs - Impact of trim?]
  2009-01-27  5:16                 ` Neil Brown
@ 2009-01-27 10:49                   ` John Robinson
  2009-01-28 20:11                     ` Bill Davidsen
  2009-01-27 11:23                   ` Ric Wheeler
  2009-01-27 14:48                   ` James Bottomley
  2 siblings, 1 reply; 18+ messages in thread
From: John Robinson @ 2009-01-27 10:49 UTC (permalink / raw)
  To: Neil Brown
  Cc: James Bottomley, Greg Freemyer, Ric Wheeler, linux-raid,
	Dongjun Shin, IDE/ATA development list

On 27/01/2009 05:16, Neil Brown wrote:
[...]
> Probably the only practical data structure for this would be a bitmap
> similar to the current write-intent bitmap.
> 
> Is it really worth supporting this in raid5?   Are the sorts of
> devices that will benefit from 'discard' requests likely to be used
> inside an md/raid5 array I wonder....

Assuming I've understood correctly, this usage map sounds to me like a 
useful thing to have for all RAIDs. When building the array in the first 
place, the initial sync is just writing a usage map saying it's all 
empty. Filesystem writes and discards update it appropriately. Then when 
we get failing sectors reported via e.g. SMART or a scrub operation we 
know whether they're on used or unused areas so whether it's worth 
attempting recovery.

Cheers,

John.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: SSD data reliable vs. unreliable [Was: Re: Data Recovery from SSDs - Impact of trim?]
  2009-01-27 10:49                   ` John Robinson
@ 2009-01-28 20:11                     ` Bill Davidsen
       [not found]                       ` <7fce22690901281556h67fb353dp879f88e6c2a76eaf@mail.gmail.com>
  0 siblings, 1 reply; 18+ messages in thread
From: Bill Davidsen @ 2009-01-28 20:11 UTC (permalink / raw)
  To: John Robinson
  Cc: Neil Brown, James Bottomley, Greg Freemyer, Ric Wheeler,
	linux-raid, Dongjun Shin, IDE/ATA development list

John Robinson wrote:
> On 27/01/2009 05:16, Neil Brown wrote:
> [...]
>> Probably the only practical data structure for this would be a bitmap
>> similar to the current write-intent bitmap.
>>
>> Is it really worth supporting this in raid5?   Are the sorts of
>> devices that will benefit from 'discard' requests likely to be used
>> inside an md/raid5 array I wonder....
>
> Assuming I've understood correctly, this usage map sounds to me like a 
> useful thing to have for all RAIDs. When building the array in the 
> first place, the initial sync is just writing a usage map saying it's 
> all empty. Filesystem writes and discards update it appropriately. 
> Then when we get failing sectors reported via e.g. SMART or a scrub 
> operation we know whether they're on used or unused areas so whether 
> it's worth attempting recovery.

It would seem that this could really speed initialization. A per-stripe 
"unused" bitmap could save a lot of time in init, but also in the check 
operation on partially used media. It's not just being nice to SDD, but 
being nice to power consumption, performance impact, rebuild time... 
other than the initial coding and testing required, I can't see any 
downside to this.

-- 
Bill Davidsen <davidsen@tmr.com>
  "Woe unto the statesman who makes war without a reason that will still
  be valid when the war is over..." Otto von Bismark 



^ permalink raw reply	[flat|nested] 18+ messages in thread

[parent not found: <7fce22690901281556h67fb353dp879f88e6c2a76eaf@mail.gmail.com>]

* Re: SSD data reliable vs. unreliable [Was: Re: Data Recovery from SSDs - Impact of trim?]
       [not found]                       ` <7fce22690901281556h67fb353dp879f88e6c2a76eaf@mail.gmail.com>
@ 2009-01-29  1:49                         ` John Robinson
  0 siblings, 0 replies; 18+ messages in thread
From: John Robinson @ 2009-01-29  1:49 UTC (permalink / raw)
  To: Greg Freemyer
  Cc: Bill Davidsen, Neil Brown, James Bottomley, Ric Wheeler,
	linux-raid, Dongjun Shin, IDE/ATA development list

On 28/01/2009 23:56, Greg Freemyer wrote:
> Once discard calls get into linux file systems mdraid and/or device
> mapper could implement linux's own thin provisioning implementation.
> Even with traditional disks that don't support unmap.  I gather that
> is what the EMCs of the world will be doing in their platforms.
> 
> http://en.wikipedia.org/wiki/Thin_provisioning

Sounds more like a device mapper or LVM thing (than md/RAID) to me, but 
I'd definitely agree that this would be another great reason for block 
devices to implement map/unmap.

And I wonder if there's room for another dm/md device type which just 
implements these usage maps over traditional devices which don't support 
unmap (much as I was wondering a few weeks back about a soft DIF 
implementation over e.g. SATA devices). Darn it, I might just have to 
dig out my school books on C; it's a while since I offered a kernel patch.

Cheers,

John.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: SSD data reliable vs. unreliable [Was: Re: Data Recovery from SSDs - Impact of trim?]
  2009-01-27  5:16                 ` Neil Brown
  2009-01-27 10:49                   ` John Robinson
@ 2009-01-27 11:23                   ` Ric Wheeler
  2009-01-28 20:28                     ` Bill Davidsen
  2009-01-27 14:48                   ` James Bottomley
  2 siblings, 1 reply; 18+ messages in thread
From: Ric Wheeler @ 2009-01-27 11:23 UTC (permalink / raw)
  To: Neil Brown
  Cc: James Bottomley, Greg Freemyer, Ric Wheeler, linux-raid,
	Dongjun Shin, IDE/ATA development list

Neil Brown wrote:
> On Monday January 26, James.Bottomley@HansenPartnership.com wrote:
>   
>> On Mon, 2009-01-26 at 12:34 -0500, Greg Freemyer wrote:
>>     
>>> Adding mdraid list:
>>>
>>> Top post as a recap for mdraid list (redundantly at end of email if
>>> anyone wants to respond to any of this).:
>>>
>>> == Start RECAP
>>> With proposed spec changes for both T10 and T13 a new "unmap" or
>>> "trim" command is proposed respectively.  The linux kernel is
>>> implementing this as a sector discard and will be called by various
>>> file systems as they delete data files.  Ext4 will be one of the first
>>> to support this. (At least via out of kernel patches.)
>>>
>>> SCSI - see http://www.t10.org/cgi-bin/ac.pl?t=d&f=08-356r5.pdf
>>> ATA - see T13/e08137r2 draft
>>>
>>> Per the proposed spec changes, the underlying SSD device can
>>> optionally modify the unmapped data.  SCSI T10 at least restricts the
>>> way the modification happens, but data modification of unmapped data
>>> is still definitely allowed for both classes of SSD.
>>>
>>> Thus if a filesystem "discards" a sector, the contents of the sector
>>> can change and thus parity values are no longer meaningful for the
>>> stripe.
>>>       
>> This isn't correct.  The implementation is via bio and request discard
>> flags.  linux raid as a bio->bio mapping entity can choose to drop or
>> implement the discard flag (by default it will be dropped unless the
>> raid layer is modified).
>>     
>
> That's good.  I would be worried if they could slip through without
> md/raid noticing.
>
>   
>>> ie. If the unmap-ed blocks don't exactly correlate with the Raid-5 / 6
>>> stripping, then the integrity of a stripe containing both mapped and
>>> unmapped data is lost.
>>>
>>> Thus it seems that either the filesystem will have to understand the
>>> raid 5 / 6 stripping / chunking setup and ensure it never issues a
>>> discard command unless an entire stripe is being discarded.  Or that
>>> the raid implementation must must snoop the discard commands and take
>>> appropriate actions.
>>>       
>> No.  It only works if the discard is supported all the way through the
>> stack to the controller and device ... any point in the stack can drop
>> the discard.  It's also theoretically possible that any layer could
>> accumulate them as well (i.e. up to stripe size for raid).
>>     
>
> Accumulating them in the raid level would probably be awkward.
>
> It was my understanding that filesystems would (try to) send the
> largest possible 'discard' covering any surrounding blocks that had
> already been discarded.  Then e.g. raid5 could just round down any
> discard request to an aligned number of complete stripes and just
> discard those.  i.e. have all the accumulation done in the filesystem.
>
> To be able to safely discard stripes, raid5 would need to remember
> which stripes were discarded so that it could be sure to write out the
> whole stripe when updating any block on it, thus ensuring that parity
> will be correct again and will remain correct.
>
> Probably the only practical data structure for this would be a bitmap
> similar to the current write-intent bitmap.
>
> Is it really worth supporting this in raid5?   Are the sorts of
> devices that will benefit from 'discard' requests likely to be used
> inside an md/raid5 array I wonder....
>
> raid1 and raid10 are much easier to handle, so supporting 'discard'
> there certainly makes sense.
>
> NeilBrown
> --
>   

The benefit is also seen by SSD devices (T13) and high end arrays 
(T10).  On the array end, they almost universally do RAID support 
internally.

I suppose that people might make RAID5 devices out of SSD's locally, but 
it is probably not an immediate priority....

ric


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: SSD data reliable vs. unreliable [Was: Re: Data Recovery from SSDs - Impact of trim?]
  2009-01-27 11:23                   ` Ric Wheeler
@ 2009-01-28 20:28                     ` Bill Davidsen
  0 siblings, 0 replies; 18+ messages in thread
From: Bill Davidsen @ 2009-01-28 20:28 UTC (permalink / raw)
  To: Ric Wheeler
  Cc: Neil Brown, James Bottomley, Greg Freemyer, linux-raid,
	Dongjun Shin, IDE/ATA development list

Ric Wheeler wrote:
> Neil Brown wrote:
>> On Monday January 26, James.Bottomley@HansenPartnership.com wrote:
>>  
>>> On Mon, 2009-01-26 at 12:34 -0500, Greg Freemyer wrote:
>>>    
>>>> Adding mdraid list:
>>>>
>>>> Top post as a recap for mdraid list (redundantly at end of email if
>>>> anyone wants to respond to any of this).:
>>>>
>>>> == Start RECAP
>>>> With proposed spec changes for both T10 and T13 a new "unmap" or
>>>> "trim" command is proposed respectively.  The linux kernel is
>>>> implementing this as a sector discard and will be called by various
>>>> file systems as they delete data files.  Ext4 will be one of the first
>>>> to support this. (At least via out of kernel patches.)
>>>>
>>>> SCSI - see http://www.t10.org/cgi-bin/ac.pl?t=d&f=08-356r5.pdf
>>>> ATA - see T13/e08137r2 draft
>>>>
>>>> Per the proposed spec changes, the underlying SSD device can
>>>> optionally modify the unmapped data.  SCSI T10 at least restricts the
>>>> way the modification happens, but data modification of unmapped data
>>>> is still definitely allowed for both classes of SSD.
>>>>
>>>> Thus if a filesystem "discards" a sector, the contents of the sector
>>>> can change and thus parity values are no longer meaningful for the
>>>> stripe.
>>>>       
>>> This isn't correct.  The implementation is via bio and request discard
>>> flags.  linux raid as a bio->bio mapping entity can choose to drop or
>>> implement the discard flag (by default it will be dropped unless the
>>> raid layer is modified).
>>>     
>>
>> That's good.  I would be worried if they could slip through without
>> md/raid noticing.
>>
>>  
>>>> ie. If the unmap-ed blocks don't exactly correlate with the Raid-5 / 6
>>>> stripping, then the integrity of a stripe containing both mapped and
>>>> unmapped data is lost.
>>>>
>>>> Thus it seems that either the filesystem will have to understand the
>>>> raid 5 / 6 stripping / chunking setup and ensure it never issues a
>>>> discard command unless an entire stripe is being discarded.  Or that
>>>> the raid implementation must must snoop the discard commands and take
>>>> appropriate actions.
>>>>       
>>> No.  It only works if the discard is supported all the way through the
>>> stack to the controller and device ... any point in the stack can drop
>>> the discard.  It's also theoretically possible that any layer could
>>> accumulate them as well (i.e. up to stripe size for raid).
>>>     
>>
>> Accumulating them in the raid level would probably be awkward.
>>
>> It was my understanding that filesystems would (try to) send the
>> largest possible 'discard' covering any surrounding blocks that had
>> already been discarded.  Then e.g. raid5 could just round down any
>> discard request to an aligned number of complete stripes and just
>> discard those.  i.e. have all the accumulation done in the filesystem.
>>
>> To be able to safely discard stripes, raid5 would need to remember
>> which stripes were discarded so that it could be sure to write out the
>> whole stripe when updating any block on it, thus ensuring that parity
>> will be correct again and will remain correct.
>>
>> Probably the only practical data structure for this would be a bitmap
>> similar to the current write-intent bitmap.
>>
>> Is it really worth supporting this in raid5?   Are the sorts of
>> devices that will benefit from 'discard' requests likely to be used
>> inside an md/raid5 array I wonder....
>>
>> raid1 and raid10 are much easier to handle, so supporting 'discard'
>> there certainly makes sense.
>>
>> NeilBrown
>> -- 
>>   
>
> The benefit is also seen by SSD devices (T13) and high end arrays 
> (T10).  On the array end, they almost universally do RAID support 
> internally.
>
> I suppose that people might make RAID5 devices out of SSD's locally, 
> but it is probably not an immediate priority....

Depends on how you define "priority" here. It probably would not make 
much of a performance difference, it might make a significant lifetime 
difference in the devices.

Not RAID5, RAID6. As seek times shrink things which were performance 
limited become practical, journaling file systems are not a problem just 
a solution, mounting with atime disabled isn't needed, etc. I was given 
some CF to PATA adapters to test, and as soon as I grab some 16GB CFs I 
intend to try a 32GB RAID6. I have a perfect application for it, and if 
it works well after I test I can put journal files on it. I just wish I 
had a file system which could put the journal, inodes, and directories 
all on the fast device and leaves the files (data) on something cheap.

-- 
Bill Davidsen <davidsen@tmr.com>
  "Woe unto the statesman who makes war without a reason that will still
  be valid when the war is over..." Otto von Bismark 



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: SSD data reliable vs. unreliable [Was: Re: Data Recovery from SSDs - Impact of trim?]
  2009-01-27  5:16                 ` Neil Brown
  2009-01-27 10:49                   ` John Robinson
  2009-01-27 11:23                   ` Ric Wheeler
@ 2009-01-27 14:48                   ` James Bottomley
  2009-01-27 14:54                     ` Ric Wheeler
  2 siblings, 1 reply; 18+ messages in thread
From: James Bottomley @ 2009-01-27 14:48 UTC (permalink / raw)
  To: Neil Brown
  Cc: Greg Freemyer, Ric Wheeler, linux-raid, Dongjun Shin,
	IDE/ATA development list

On Tue, 2009-01-27 at 16:16 +1100, Neil Brown wrote:
> On Monday January 26, James.Bottomley@HansenPartnership.com wrote:
> > On Mon, 2009-01-26 at 12:34 -0500, Greg Freemyer wrote:
> > > Adding mdraid list:
> > > 
> > > Top post as a recap for mdraid list (redundantly at end of email if
> > > anyone wants to respond to any of this).:
> > > 
> > > == Start RECAP
> > > With proposed spec changes for both T10 and T13 a new "unmap" or
> > > "trim" command is proposed respectively.  The linux kernel is
> > > implementing this as a sector discard and will be called by various
> > > file systems as they delete data files.  Ext4 will be one of the first
> > > to support this. (At least via out of kernel patches.)
> > > 
> > > SCSI - see http://www.t10.org/cgi-bin/ac.pl?t=d&f=08-356r5.pdf
> > > ATA - see T13/e08137r2 draft
> > > 
> > > Per the proposed spec changes, the underlying SSD device can
> > > optionally modify the unmapped data.  SCSI T10 at least restricts the
> > > way the modification happens, but data modification of unmapped data
> > > is still definitely allowed for both classes of SSD.
> > > 
> > > Thus if a filesystem "discards" a sector, the contents of the sector
> > > can change and thus parity values are no longer meaningful for the
> > > stripe.
> > 
> > This isn't correct.  The implementation is via bio and request discard
> > flags.  linux raid as a bio->bio mapping entity can choose to drop or
> > implement the discard flag (by default it will be dropped unless the
> > raid layer is modified).
> 
> That's good.  I would be worried if they could slip through without
> md/raid noticing.
> 
> > 
> > > ie. If the unmap-ed blocks don't exactly correlate with the Raid-5 / 6
> > > stripping, then the integrity of a stripe containing both mapped and
> > > unmapped data is lost.
> > > 
> > > Thus it seems that either the filesystem will have to understand the
> > > raid 5 / 6 stripping / chunking setup and ensure it never issues a
> > > discard command unless an entire stripe is being discarded.  Or that
> > > the raid implementation must must snoop the discard commands and take
> > > appropriate actions.
> > 
> > No.  It only works if the discard is supported all the way through the
> > stack to the controller and device ... any point in the stack can drop
> > the discard.  It's also theoretically possible that any layer could
> > accumulate them as well (i.e. up to stripe size for raid).
> 
> Accumulating them in the raid level would probably be awkward.
> 
> It was my understanding that filesystems would (try to) send the
> largest possible 'discard' covering any surrounding blocks that had
> already been discarded.  Then e.g. raid5 could just round down any
> discard request to an aligned number of complete stripes and just
> discard those.  i.e. have all the accumulation done in the filesystem.

The jury is still out on this one.   Array manufacturers, who would
probably like this as well because their internal granularity for thin
provisioning is reputedly huge (in the megabytes).  However, trim and
discard are being driven by SSD which has no such need.

> To be able to safely discard stripes, raid5 would need to remember
> which stripes were discarded so that it could be sure to write out the
> whole stripe when updating any block on it, thus ensuring that parity
> will be correct again and will remain correct.

right.  This gives you a minimal discard size of the stripe width.

> Probably the only practical data structure for this would be a bitmap
> similar to the current write-intent bitmap.

Hmm ... the feature you're talking about is called white space
elimination by most in the industry.  The layer above RAID (usually fs)
knows this information exactly ... if there were a way to pass it on,
there'd be no need to store it separately.

> Is it really worth supporting this in raid5?   Are the sorts of
> devices that will benefit from 'discard' requests likely to be used
> inside an md/raid5 array I wonder....

There's no hard data on how useful Trim will be in general.  The idea is
it allows SSDs to pre-erase (which can be a big deal) and for Thin
Provisioning it allows just in time storage decisions.  However, all
thin provision devices are likely to do RAID internally ...

> raid1 and raid10 are much easier to handle, so supporting 'discard'
> there certainly makes sense.

James



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: SSD data reliable vs. unreliable [Was: Re: Data Recovery from SSDs - Impact of trim?]
  2009-01-27 14:48                   ` James Bottomley
@ 2009-01-27 14:54                     ` Ric Wheeler
  0 siblings, 0 replies; 18+ messages in thread
From: Ric Wheeler @ 2009-01-27 14:54 UTC (permalink / raw)
  To: James Bottomley
  Cc: Neil Brown, Greg Freemyer, linux-raid, Dongjun Shin,
	IDE/ATA development list

James Bottomley wrote:
> On Tue, 2009-01-27 at 16:16 +1100, Neil Brown wrote:
>   
>> On Monday January 26, James.Bottomley@HansenPartnership.com wrote:
>>     
>>> On Mon, 2009-01-26 at 12:34 -0500, Greg Freemyer wrote:
>>>       
>>>> Adding mdraid list:
>>>>
>>>> Top post as a recap for mdraid list (redundantly at end of email if
>>>> anyone wants to respond to any of this).:
>>>>
>>>> == Start RECAP
>>>> With proposed spec changes for both T10 and T13 a new "unmap" or
>>>> "trim" command is proposed respectively.  The linux kernel is
>>>> implementing this as a sector discard and will be called by various
>>>> file systems as they delete data files.  Ext4 will be one of the first
>>>> to support this. (At least via out of kernel patches.)
>>>>
>>>> SCSI - see http://www.t10.org/cgi-bin/ac.pl?t=d&f=08-356r5.pdf
>>>> ATA - see T13/e08137r2 draft
>>>>
>>>> Per the proposed spec changes, the underlying SSD device can
>>>> optionally modify the unmapped data.  SCSI T10 at least restricts the
>>>> way the modification happens, but data modification of unmapped data
>>>> is still definitely allowed for both classes of SSD.
>>>>
>>>> Thus if a filesystem "discards" a sector, the contents of the sector
>>>> can change and thus parity values are no longer meaningful for the
>>>> stripe.
>>>>         
>>> This isn't correct.  The implementation is via bio and request discard
>>> flags.  linux raid as a bio->bio mapping entity can choose to drop or
>>> implement the discard flag (by default it will be dropped unless the
>>> raid layer is modified).
>>>       
>> That's good.  I would be worried if they could slip through without
>> md/raid noticing.
>>
>>     
>>>> ie. If the unmap-ed blocks don't exactly correlate with the Raid-5 / 6
>>>> stripping, then the integrity of a stripe containing both mapped and
>>>> unmapped data is lost.
>>>>
>>>> Thus it seems that either the filesystem will have to understand the
>>>> raid 5 / 6 stripping / chunking setup and ensure it never issues a
>>>> discard command unless an entire stripe is being discarded.  Or that
>>>> the raid implementation must must snoop the discard commands and take
>>>> appropriate actions.
>>>>         
>>> No.  It only works if the discard is supported all the way through the
>>> stack to the controller and device ... any point in the stack can drop
>>> the discard.  It's also theoretically possible that any layer could
>>> accumulate them as well (i.e. up to stripe size for raid).
>>>       
>> Accumulating them in the raid level would probably be awkward.
>>
>> It was my understanding that filesystems would (try to) send the
>> largest possible 'discard' covering any surrounding blocks that had
>> already been discarded.  Then e.g. raid5 could just round down any
>> discard request to an aligned number of complete stripes and just
>> discard those.  i.e. have all the accumulation done in the filesystem.
>>     
>
> The jury is still out on this one.   Array manufacturers, who would
> probably like this as well because their internal granularity for thin
> provisioning is reputedly huge (in the megabytes).  However, trim and
> discard are being driven by SSD which has no such need.
>   

I have heard from some array vendors of sizes that range from 8k erase 
chunks (pretty easy for us) up to 768KB, but not up to megabytes....

ric

>   
>> To be able to safely discard stripes, raid5 would need to remember
>> which stripes were discarded so that it could be sure to write out the
>> whole stripe when updating any block on it, thus ensuring that parity
>> will be correct again and will remain correct.
>>     
>
> right.  This gives you a minimal discard size of the stripe width.
>
>   
>> Probably the only practical data structure for this would be a bitmap
>> similar to the current write-intent bitmap.
>>     
>
> Hmm ... the feature you're talking about is called white space
> elimination by most in the industry.  The layer above RAID (usually fs)
> knows this information exactly ... if there were a way to pass it on,
> there'd be no need to store it separately.
>
>   
>> Is it really worth supporting this in raid5?   Are the sorts of
>> devices that will benefit from 'discard' requests likely to be used
>> inside an md/raid5 array I wonder....
>>     
>
> There's no hard data on how useful Trim will be in general.  The idea is
> it allows SSDs to pre-erase (which can be a big deal) and for Thin
> Provisioning it allows just in time storage decisions.  However, all
> thin provision devices are likely to do RAID internally ...
>
>   
>> raid1 and raid10 are much easier to handle, so supporting 'discard'
>> there certainly makes sense.
>>     
>
> James
>
>
>   


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: SSD data reliable vs. unreliable [Was: Re: Data Recovery from SSDs - Impact of trim?]
  2009-01-26 17:34             ` SSD data reliable vs. unreliable [Was: Re: Data Recovery from SSDs - Impact of trim?] Greg Freemyer
  2009-01-26 17:46               ` Ric Wheeler
  2009-01-26 17:47               ` James Bottomley
@ 2009-01-26 17:51               ` Mark Lord
  2009-01-26 18:09                 ` Greg Freemyer
  2 siblings, 1 reply; 18+ messages in thread
From: Mark Lord @ 2009-01-26 17:51 UTC (permalink / raw)
  To: Greg Freemyer
  Cc: Ric Wheeler, linux-raid, James Bottomley, Dongjun Shin,
	IDE/ATA development list

Greg Freemyer wrote:
>
> Seems to introduce some huge layering violations for Raid 5 / 6
> implementations using next generation SSDs to comprise the raid
> volumes.
..

Possibly so.  But having stripe layouts known to the fs layer
is a *good* thing, and is pretty much already necessary for decent
filesystem performance.  It would be better even, if the filesystem
would just automatically pick up that information, rather than
relying upon mkfs parameters (maybe they already do now ?).

> Allowing data to change from the SATA / SAS interface layer and not
> implementing a signaling mechanism that allows the kernel (or any OS /
> software tool) to ask which sectors / blocks / erase units have
> undergone data changes is just bizarre to me.
..

I think that's just blowing smoke.  The only sectors/blocks/erase-units which
even *might* undergo such changes, are already restricted to those exact
units which the kernel itself specificies (by explicitly discarding them).
If we care about knowing which ones later on (and we don't, generally),
then we (kernel) can maintain a list, just like we do for other such entities.

I don't see much of a downside here for any normal users of filesystems.
This kind of feature is long overdue.

Cheers

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: SSD data reliable vs. unreliable [Was: Re: Data Recovery from SSDs - Impact of trim?]
  2009-01-26 17:51               ` Mark Lord
@ 2009-01-26 18:09                 ` Greg Freemyer
  2009-01-26 18:21                   ` Mark Lord
  0 siblings, 1 reply; 18+ messages in thread
From: Greg Freemyer @ 2009-01-26 18:09 UTC (permalink / raw)
  To: Mark Lord
  Cc: Ric Wheeler, linux-raid, James Bottomley, Dongjun Shin,
	IDE/ATA development list

On Mon, Jan 26, 2009 at 12:51 PM, Mark Lord <liml@rtr.ca> wrote:
> Greg Freemyer wrote:
>>
>> Seems to introduce some huge layering violations for Raid 5 / 6
>> implementations using next generation SSDs to comprise the raid
>> volumes.
>
> ..
>
> Possibly so.  But having stripe layouts known to the fs layer
> is a *good* thing, and is pretty much already necessary for decent
> filesystem performance.  It would be better even, if the filesystem
> would just automatically pick up that information, rather than
> relying upon mkfs parameters (maybe they already do now ?).
>
>> Allowing data to change from the SATA / SAS interface layer and not
>> implementing a signaling mechanism that allows the kernel (or any OS /
>> software tool) to ask which sectors / blocks / erase units have
>> undergone data changes is just bizarre to me.
>
> ..
>
> I think that's just blowing smoke.  The only sectors/blocks/erase-units
> which
> even *might* undergo such changes, are already restricted to those exact
> units which the kernel itself specificies (by explicitly discarding them).
> If we care about knowing which ones later on (and we don't, generally),
> then we (kernel) can maintain a list, just like we do for other such
> entities.
>
> I don't see much of a downside here for any normal users of filesystems.
> This kind of feature is long overdue.
>
> Cheers
>

Just so I know and before I try to find the right way to comment to
the T13 and T10 committees:

What is the negative of adding a ATA/SCSI command to allow the storage
device to be interrogated to see what sectors/blocks/erase-units are
in a mapped vs. unmapped state?

For T13 (ATA), it would actually be a tri-state flag I gather:

  Mapped - last data written available
  unmapped - no data values assigned
  unmapped - deterministic data values

Surely the storage device has to be keeping this data internally.  Why
not expose it?

For data recovery and computer forensics purposes I would actually
prefer even finer grained info, but those 3 states seem useful for
several layers of the filesystem stack to be able to more fully
optimize what they are doing.

Greg
-- 
Greg Freemyer
Litigation Triage Solutions Specialist
http://www.linkedin.com/in/gregfreemyer
First 99 Days Litigation White Paper -
http://www.norcrossgroup.com/forms/whitepapers/99%20Days%20whitepaper.pdf

The Norcross Group
The Intersection of Evidence & Technology
http://www.norcrossgroup.com

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: SSD data reliable vs. unreliable [Was: Re: Data Recovery from SSDs - Impact of trim?]
  2009-01-26 18:09                 ` Greg Freemyer
@ 2009-01-26 18:21                   ` Mark Lord
  2009-01-29 14:07                     ` Dongjun Shin
  0 siblings, 1 reply; 18+ messages in thread
From: Mark Lord @ 2009-01-26 18:21 UTC (permalink / raw)
  To: Greg Freemyer
  Cc: Ric Wheeler, linux-raid, James Bottomley, Dongjun Shin,
	IDE/ATA development list

Greg Freemyer wrote:
>
> Just so I know and before I try to find the right way to comment to
> the T13 and T10 committees:
> 
> What is the negative of adding a ATA/SCSI command to allow the storage
> device to be interrogated to see what sectors/blocks/erase-units are
> in a mapped vs. unmapped state?
> 
> For T13 (ATA), it would actually be a tri-state flag I gather:
> 
>   Mapped - last data written available
>   unmapped - no data values assigned
>   unmapped - deterministic data values
> 
> Surely the storage device has to be keeping this data internally.  Why
> not expose it?
..

That's a good approach.  One problem on the drive end, is that they
may not be maintain central lists of these.  Rather, it might be stored
as local flags in each affected sector.

So any attempt to query "all" is probably not feasible,
but it should be simple enough to "query one" sector at a time
for that info.  If one wants them all, the just loop over
the entire device doing the "query one" for each sector.

Another drawback to the "query all", is the size of data (the list)
returned is unknown in advance.  We generally don't have commands
that work like that.

Cheers

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: SSD data reliable vs. unreliable [Was: Re: Data Recovery from SSDs - Impact of trim?]
  2009-01-26 18:21                   ` Mark Lord
@ 2009-01-29 14:07                     ` Dongjun Shin
  2009-01-29 15:46                       ` Mark Lord
  0 siblings, 1 reply; 18+ messages in thread
From: Dongjun Shin @ 2009-01-29 14:07 UTC (permalink / raw)
  To: Mark Lord, Greg Freemyer
  Cc: Ric Wheeler, linux-raid, James Bottomley,
	IDE/ATA development list

On Tue, Jan 27, 2009 at 3:21 AM, Mark Lord <liml@rtr.ca> wrote:
> Greg Freemyer wrote:
>>
>> Just so I know and before I try to find the right way to comment to
>> the T13 and T10 committees:
>>
>> What is the negative of adding a ATA/SCSI command to allow the storage
>> device to be interrogated to see what sectors/blocks/erase-units are
>> in a mapped vs. unmapped state?
>>
>> For T13 (ATA), it would actually be a tri-state flag I gather:
>>
>>  Mapped - last data written available
>>  unmapped - no data values assigned
>>  unmapped - deterministic data values
>>
>> Surely the storage device has to be keeping this data internally.  Why
>> not expose it?
>
> ..
>
> That's a good approach.  One problem on the drive end, is that they
> may not be maintain central lists of these.  Rather, it might be stored
> as local flags in each affected sector.
>
> So any attempt to query "all" is probably not feasible,
> but it should be simple enough to "query one" sector at a time
> for that info.  If one wants them all, the just loop over
> the entire device doing the "query one" for each sector.
>
> Another drawback to the "query all", is the size of data (the list)
> returned is unknown in advance.  We generally don't have commands
> that work like that.
>

I'm not sure if the map/unmap flag will help the situation for the
forensic examiner.
What if SSD always returns zero for the unmapped area? (it's not
violating the spec)
Although the flag will tell you the exact state of the sectors, there
is no hint for
the forensic analysis if these are filled with zero.

My personal opinion is that the flag for map/unmap will result in
unnecessary overhead
considering the common use cases.

Please also consider that, although not directly related with trim,
features like
Full-Disk-Encryption are also making things difficult for you.

-- 
Dongjun

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: SSD data reliable vs. unreliable [Was: Re: Data Recovery from SSDs - Impact of trim?]
  2009-01-29 14:07                     ` Dongjun Shin
@ 2009-01-29 15:46                       ` Mark Lord
  2009-01-29 16:27                         ` Greg Freemyer
  0 siblings, 1 reply; 18+ messages in thread
From: Mark Lord @ 2009-01-29 15:46 UTC (permalink / raw)
  To: Dongjun Shin
  Cc: Greg Freemyer, Ric Wheeler, linux-raid, James Bottomley,
	IDE/ATA development list

Dongjun Shin wrote:
..
> Please also consider that, although not directly related with trim,
> features like
> Full-Disk-Encryption are also making things difficult for you.
..

Assuming the encryption doesn't have a state-mandated backdoor hidden inside.




^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: SSD data reliable vs. unreliable [Was: Re: Data Recovery from SSDs - Impact of trim?]
  2009-01-29 15:46                       ` Mark Lord
@ 2009-01-29 16:27                         ` Greg Freemyer
  2009-01-30 15:43                           ` Bill Davidsen
  0 siblings, 1 reply; 18+ messages in thread
From: Greg Freemyer @ 2009-01-29 16:27 UTC (permalink / raw)
  To: Mark Lord
  Cc: Dongjun Shin, Ric Wheeler, linux-raid, James Bottomley,
	IDE/ATA development list

On Thu, Jan 29, 2009 at 10:46 AM, Mark Lord <liml@rtr.ca> wrote:
> Dongjun Shin wrote:
> ..
>>
>> Please also consider that, although not directly related with trim,
>> features like
>> Full-Disk-Encryption are also making things difficult for you.
>
> ..
>
> Assuming the encryption doesn't have a state-mandated backdoor hidden
> inside.

The timing could not be worse.  Last year if we could not get the
password, we could just send the user to Gitmo!!!

(I am not affiliated with any gov't agency.  ;-)

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: SSD data reliable vs. unreliable [Was: Re: Data Recovery from SSDs - Impact of trim?]
  2009-01-29 16:27                         ` Greg Freemyer
@ 2009-01-30 15:43                           ` Bill Davidsen
  0 siblings, 0 replies; 18+ messages in thread
From: Bill Davidsen @ 2009-01-30 15:43 UTC (permalink / raw)
  To: Greg Freemyer
  Cc: Mark Lord, Dongjun Shin, Ric Wheeler, linux-raid, James Bottomley,
	IDE/ATA development list

Greg Freemyer wrote:
> On Thu, Jan 29, 2009 at 10:46 AM, Mark Lord <liml@rtr.ca> wrote:
>   
>> Dongjun Shin wrote:
>> ..
>>     
>>> Please also consider that, although not directly related with trim,
>>> features like
>>> Full-Disk-Encryption are also making things difficult for you.
>>>       
>> ..
>>
>> Assuming the encryption doesn't have a state-mandated backdoor hidden
>> inside.
>>     
>
> The timing could not be worse.  Last year if we could not get the
> password, we could just send the user to Gitmo!!!
>   

I believe that in England, or maybe the whole EU, you can be forced to 
give up your passwords, and be imprisoned until you do. There was an 
article about that and I don't remember the details, but it was enough 
to make me leave my computer home when I visit England again.

-- 
Bill Davidsen <davidsen@tmr.com>
  "Woe unto the statesman who makes war without a reason that will still
  be valid when the war is over..." Otto von Bismark 



^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2009-01-30 15:43 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <87f94c370901221553p4d3a749fl4717deabba5419ec@mail.gmail.com>
     [not found] ` <497A2B3C.3060603@redhat.com>
     [not found]   ` <1232749447.3250.146.camel@localhost.localdomain>
     [not found]     ` <87f94c370901231526jb41ea66ta1d6a23d7631d63c@mail.gmail.com>
     [not found]       ` <497A542C.1040900@redhat.com>
     [not found]         ` <7fce22690901260659u30ffd634m3fb7f75102141ee9@mail.gmail.com>
     [not found]           ` <497DE35C.6090308@redhat.com>
2009-01-26 17:34             ` SSD data reliable vs. unreliable [Was: Re: Data Recovery from SSDs - Impact of trim?] Greg Freemyer
2009-01-26 17:46               ` Ric Wheeler
2009-01-26 17:47               ` James Bottomley
2009-01-27  5:16                 ` Neil Brown
2009-01-27 10:49                   ` John Robinson
2009-01-28 20:11                     ` Bill Davidsen
     [not found]                       ` <7fce22690901281556h67fb353dp879f88e6c2a76eaf@mail.gmail.com>
2009-01-29  1:49                         ` John Robinson
2009-01-27 11:23                   ` Ric Wheeler
2009-01-28 20:28                     ` Bill Davidsen
2009-01-27 14:48                   ` James Bottomley
2009-01-27 14:54                     ` Ric Wheeler
2009-01-26 17:51               ` Mark Lord
2009-01-26 18:09                 ` Greg Freemyer
2009-01-26 18:21                   ` Mark Lord
2009-01-29 14:07                     ` Dongjun Shin
2009-01-29 15:46                       ` Mark Lord
2009-01-29 16:27                         ` Greg Freemyer
2009-01-30 15:43                           ` Bill Davidsen

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).