* Re: SSD data reliable vs. unreliable [Was: Re: Data Recovery from SSDs - Impact of trim?] [not found] ` <497DE35C.6090308@redhat.com> @ 2009-01-26 17:34 ` Greg Freemyer 2009-01-26 17:46 ` Ric Wheeler ` (2 more replies) 0 siblings, 3 replies; 18+ messages in thread From: Greg Freemyer @ 2009-01-26 17:34 UTC (permalink / raw) To: Ric Wheeler, linux-raid Cc: James Bottomley, Dongjun Shin, IDE/ATA development list Adding mdraid list: Top post as a recap for mdraid list (redundantly at end of email if anyone wants to respond to any of this).: == Start RECAP With proposed spec changes for both T10 and T13 a new "unmap" or "trim" command is proposed respectively. The linux kernel is implementing this as a sector discard and will be called by various file systems as they delete data files. Ext4 will be one of the first to support this. (At least via out of kernel patches.) SCSI - see http://www.t10.org/cgi-bin/ac.pl?t=d&f=08-356r5.pdf ATA - see T13/e08137r2 draft Per the proposed spec changes, the underlying SSD device can optionally modify the unmapped data. SCSI T10 at least restricts the way the modification happens, but data modification of unmapped data is still definitely allowed for both classes of SSD. Thus if a filesystem "discards" a sector, the contents of the sector can change and thus parity values are no longer meaningful for the stripe. ie. If the unmap-ed blocks don't exactly correlate with the Raid-5 / 6 stripping, then the integrity of a stripe containing both mapped and unmapped data is lost. Thus it seems that either the filesystem will have to understand the raid 5 / 6 stripping / chunking setup and ensure it never issues a discard command unless an entire stripe is being discarded. Or that the raid implementation must must snoop the discard commands and take appropriate actions. FYI: In T13 a feature bit will be provided to identify ATA SSDs that implement a "deterministic" feature. Meaning that once you read a specific unmapped sector, its contents will not change until written but that does not change the fact that a discard command that does not perfectly match the raid setup may destroy the integrity of a stripe. I believe all T10 (SCSI) devices with be deterministic by spec. End of RECAP On Mon, Jan 26, 2009 at 11:22 AM, Ric Wheeler <rwheeler@redhat.com> wrote: > Greg Freemyer wrote: >> >> On Fri, Jan 23, 2009 at 6:35 PM, Ric Wheeler <rwheeler@redhat.com> wrote: >> >>> >>> Greg Freemyer wrote: >>> >>>> >>>> On Fri, Jan 23, 2009 at 5:24 PM, James Bottomley >>>> <James.Bottomley@hansenpartnership.com> wrote: >>>> >>>> >>>>> >>>>> On Fri, 2009-01-23 at 15:40 -0500, Ric Wheeler wrote: >>>>> >>>>> >>>>>> >>>>>> Greg Freemyer wrote: >>>>>> >>>>>> >>>>>>> >>>>>>> Just to make sure I understand, with the proposed trim updates to the >>>>>>> ATA spec (T13/e08137r2 draft), a SSD can have two kinds of data. >>>>>>> >>>>>>> Reliable and unreliable. Where unreliable can return zeros, ones, >>>>>>> old >>>>>>> data, random made up data, old data slightly adulterated, etc.. >>>>>>> >>>>>>> And there is no way for the kernel to distinguish if the particular >>>>>>> data it is getting from the SSD is of the reliable or unreliable >>>>>>> type? >>>>>>> >>>>>>> For the unreliable data, if the determistic bit is set in the >>>>>>> identify >>>>>>> block, then the kernel can be assured of reading the same unreliable >>>>>>> data repeatedly, but still it has no way of knowing the data it is >>>>>>> reading was ever even written to the SSD in the first place. >>>>>>> >>>>>>> That just seems unacceptable. >>>>>>> >>>>>>> Greg >>>>>>> >>>>>>> >>>>>>> >>>>>> >>>>>> Hi Greg, >>>>>> >>>>>> I sat in on a similar discussion in T10 . With luck, the T13 people >>>>>> have >>>>>> the same high level design: >>>>>> >>>>>> (1) following a write to sector X, any subsequent read of X will >>>>>> return >>>>>> that data >>>>>> (2) once you DISCARD/UNMAP sector X, the device can return any state >>>>>> (stale data, all 1's, all 0's) on the next read of that sector, but >>>>>> must >>>>>> continue to return that data on following reads until the sector is >>>>>> rewritten >>>>>> >>>>>> >>>>> >>>>> Actually, the latest draft: >>>>> >>>>> http://www.t10.org/cgi-bin/ac.pl?t=d&f=08-356r5.pdf >>>>> >>>>> extends this behaviour: If the array has read capacity(16) TPRZ bit set >>>>> then the return for an unmapped block is always zero. If TPRZ isn't >>>>> set, it's undefined but consistent. I think TPRZ is there to address >>>>> security concerns. >>>>> >>>>> James >>>>> >>>>> >>>> >>>> To James, >>>> >>>> I took a look at the spec, but I'm not familiar with the SCSI spec to >>>> grok it immediately. >>>> >>>> Is the TPRZ bit meant to be a way for the manufacturer to report which >>>> of the two behaviors their device implements, or is it a externally >>>> configurable flag that tells the SSD which way to behave? >>>> >>>> Either way, is there reason to believe the ATA T13 spec will get >>>> similar functionality? >>>> >>>> To Ric, >>>> >>>> First, in general I think is is bizarre to have a device that is by >>>> spec able to return both reliable and non-reliable data, but the spec >>>> does not include a signaling method to differentiate between the two. >>>> >>>> === >>>> My very specific concern is that I work with evidence that will >>>> eventually be presented at court. >>>> >>>> We routinely work with both live files and recoved deleted files >>>> (Computer Forensic Analysis). Thus we would typically be reading the >>>> discarded sectors as well as in-use sectors. >>>> >>>> After reading the original proposal from 2007, I assumed that a read >>>> would provide me either data that had been written specifically to the >>>> sectors read, or that the SSD would return all nulls. That is very >>>> troubling to the ten thousand or so computer forensic examiners in the >>>> USA, but it true we just had to live with it. >>>> >>>> Now reading the Oct. 2008 revision I realized that discarded sectors >>>> are theoretically allowed to return absolutely anything the SSD feels >>>> like returning. Thus the SSD might return data that appears to be >>>> supporting one side of the trial or the other, but it may have been >>>> artificially created by the SSD. And I don't even have a flag that >>>> says "trust this data". >>>> >>>> The way things currently stand with my understanding of the proposed >>>> spec. I will not be able to tell the court anything about the >>>> reliability of any data copied from the SSD regardless of whether it >>>> is part of an active file or not. >>>> >>>> At its most basic level, I transport a typical file on a SSD by >>>> connecting it to computer A, writing data to it, disconnecting from A >>>> and connecting to computer B and then print it from there for court >>>> room use. >>>> >>>> When I read that file from the SSD how can I assure the court that >>>> data I read is even claimed to be reliable by the SSD? >>>> >>>> ie. The SSD has no way to say "I believe this data is what was >>>> written to me via computer A" so why should the court or anyone else >>>> trust the data it returns. >>>> >>>> IF the TPRZ bit becomes mandatory for both ATA and SCSI SSDs, then if >>>> it is set I can have confidence that any data read from the device was >>>> actually written to it. >>>> >>>> Lacking the TPRZ bit, ... >>>> >>>> Greg >>>> >>>> >>> >>> I think that the incorrect assumption here is that you as a user can read >>> data that is invalid. If you are using a file system, you will never be >>> able >>> to read those unmapped/freed blocks (the file system will not allow it). >>> >>> If you read the raw device as root, then you could seem random bits of >>> data >>> - maybe data recovery tools would make this an issue? >>> >>> ric >>> >> >> Ric, >> <snip> > This seems to be overstated. The file system layer knows what its valid data > is at any time and will send down unmap/trim commands only when it is sure > that the block is no longer in use. > > The only concern is one of efficiency/performance - the commands are > advisory, so the target can ignore them (i.e., not pre-erase them or > allocate them in T10 to another user). There will be no need for fsck to > look at unallocated blocks. > > The concern we do have is that RAID and checksums must be consistent. Once > read, the device must return the same contents after a trim/unmap so as not > to change the parity/hash/etc. ===> Copy of top post With proposed spec changes for both T10 and T13 a new "unmap" or "trim" command is proposed respectively. SCSI - see http://www.t10.org/cgi-bin/ac.pl?t=d&f=08-356r5.pdf ATA - T13/e08137r2 draft Per the proposed spec changes, the underlying SSD device can optionally modify the unmapped data at its discretion. SCSI T10 atleast restricts the way the modification happens, but data modification of unmapped data is still definitely allowed. Thus if a filesystem "discards" a sector, the contents of the sector can change and thus parity values are no longer meaningful for the stripe. ie. If the unmap-ed blocks don't exactly correlate with the Raid-5 / 6 stripping, then the integrity of a stripe containing both mapped and unmapped data is lost. A feature bit will be provided to identify SSDs that implement a "stable value on read" feature. Meaning that once you read a specific unmapped sector, its contents will not change until written but that does not change the fact that a discard command that does not perfectly match the raid setup may destroy the integrity of a stripe. Thus it seems that either the filesystem will have to understand the raid 5 / 6 stripping / chunking setup and ensure it never issues a discard command unless an entire stripe is free. Or that the raid implementation must must snoop the discard commands and take appropriate actions. ===> END Copy of top post Seems to introduce some huge layering violations for Raid 5 / 6 implementations using next generation SSDs to comprise the raid volumes. I imagine writing reshaping software is hard enough without this going on. <snip> > One serious suggestion is that you take your concerns up with the T13 group directly - few people on this list sit in on those, I believe that it is an open forum. I will have to look into that. The whole idea of what is happening here seems fraught with problems to me. T13 is worse than T10 from what I see, but both seem highly problematic. Allowing data to change from the SATA / SAS interface layer and not implementing a signaling mechanism that allows the kernel (or any OS / software tool) to ask which sectors / blocks / erase units have undergone data changes is just bizarre to me. I the unmap command always caused the unmap sectors to return some fixed value, at least that could be incorporated into a raid implementations logic. The current random nature of what unmap command does is very unsettling to me. Greg -- Greg Freemyer Litigation Triage Solutions Specialist http://www.linkedin.com/in/gregfreemyer First 99 Days Litigation White Paper - http://www.norcrossgroup.com/forms/whitepapers/99%20Days%20whitepaper.pdf The Norcross Group The Intersection of Evidence & Technology http://www.norcrossgroup.com ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: SSD data reliable vs. unreliable [Was: Re: Data Recovery from SSDs - Impact of trim?] 2009-01-26 17:34 ` SSD data reliable vs. unreliable [Was: Re: Data Recovery from SSDs - Impact of trim?] Greg Freemyer @ 2009-01-26 17:46 ` Ric Wheeler 2009-01-26 17:47 ` James Bottomley 2009-01-26 17:51 ` Mark Lord 2 siblings, 0 replies; 18+ messages in thread From: Ric Wheeler @ 2009-01-26 17:46 UTC (permalink / raw) To: Greg Freemyer Cc: linux-raid, James Bottomley, Dongjun Shin, IDE/ATA development list Greg Freemyer wrote: > Adding mdraid list: > > Top post as a recap for mdraid list (redundantly at end of email if > anyone wants to respond to any of this).: > > == Start RECAP > With proposed spec changes for both T10 and T13 a new "unmap" or > "trim" command is proposed respectively. The linux kernel is > implementing this as a sector discard and will be called by various > file systems as they delete data files. Ext4 will be one of the first > to support this. (At least via out of kernel patches.) > > SCSI - see http://www.t10.org/cgi-bin/ac.pl?t=d&f=08-356r5.pdf > ATA - see T13/e08137r2 draft > > Per the proposed spec changes, the underlying SSD device can > optionally modify the unmapped data. SCSI T10 at least restricts the > way the modification happens, but data modification of unmapped data > is still definitely allowed for both classes of SSD. > For either device class, this is not limited to SSD devices (just for clarity). On the SCSI side, this is actually driven mainly by large arrays (like EMC Symm, Clariion, IBM Shark, etc). > Thus if a filesystem "discards" a sector, the contents of the sector > can change and thus parity values are no longer meaningful for the > stripe. > > ie. If the unmap-ed blocks don't exactly correlate with the Raid-5 / 6 > stripping, then the integrity of a stripe containing both mapped and > unmapped data is lost. > What this means for RAID (md or dm raid) is that we will need to rebuild the parity after a discard of a stripe for the range of discarded blocks. For T10 devices at least, the devices are required to be consistent with regards to what they return after the unmap. > Thus it seems that either the filesystem will have to understand the > raid 5 / 6 stripping / chunking setup and ensure it never issues a > discard command unless an entire stripe is being discarded. Or that > the raid implementation must must snoop the discard commands and take > appropriate actions. > > FYI: > In T13 a feature bit will be provided to identify ATA SSDs that > implement a "deterministic" feature. Meaning that once you read a > specific unmapped sector, its contents will not change until written > but that does not change the fact that a discard command that does not > perfectly match the raid setup may destroy the integrity of a stripe. > > I believe all T10 (SCSI) devices with be deterministic by spec. > > End of RECAP > > On Mon, Jan 26, 2009 at 11:22 AM, Ric Wheeler <rwheeler@redhat.com> wrote: > >> Greg Freemyer wrote: >> >>> On Fri, Jan 23, 2009 at 6:35 PM, Ric Wheeler <rwheeler@redhat.com> wrote: >>> >>> >>>> Greg Freemyer wrote: >>>> >>>> >>>>> On Fri, Jan 23, 2009 at 5:24 PM, James Bottomley >>>>> <James.Bottomley@hansenpartnership.com> wrote: >>>>> >>>>> >>>>> >>>>>> On Fri, 2009-01-23 at 15:40 -0500, Ric Wheeler wrote: >>>>>> >>>>>> >>>>>> >>>>>>> Greg Freemyer wrote: >>>>>>> >>>>>>> >>>>>>> >>>>>>>> Just to make sure I understand, with the proposed trim updates to the >>>>>>>> ATA spec (T13/e08137r2 draft), a SSD can have two kinds of data. >>>>>>>> >>>>>>>> Reliable and unreliable. Where unreliable can return zeros, ones, >>>>>>>> old >>>>>>>> data, random made up data, old data slightly adulterated, etc.. >>>>>>>> >>>>>>>> And there is no way for the kernel to distinguish if the particular >>>>>>>> data it is getting from the SSD is of the reliable or unreliable >>>>>>>> type? >>>>>>>> >>>>>>>> For the unreliable data, if the determistic bit is set in the >>>>>>>> identify >>>>>>>> block, then the kernel can be assured of reading the same unreliable >>>>>>>> data repeatedly, but still it has no way of knowing the data it is >>>>>>>> reading was ever even written to the SSD in the first place. >>>>>>>> >>>>>>>> That just seems unacceptable. >>>>>>>> >>>>>>>> Greg >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>> Hi Greg, >>>>>>> >>>>>>> I sat in on a similar discussion in T10 . With luck, the T13 people >>>>>>> have >>>>>>> the same high level design: >>>>>>> >>>>>>> (1) following a write to sector X, any subsequent read of X will >>>>>>> return >>>>>>> that data >>>>>>> (2) once you DISCARD/UNMAP sector X, the device can return any state >>>>>>> (stale data, all 1's, all 0's) on the next read of that sector, but >>>>>>> must >>>>>>> continue to return that data on following reads until the sector is >>>>>>> rewritten >>>>>>> >>>>>>> >>>>>>> >>>>>> Actually, the latest draft: >>>>>> >>>>>> http://www.t10.org/cgi-bin/ac.pl?t=d&f=08-356r5.pdf >>>>>> >>>>>> extends this behaviour: If the array has read capacity(16) TPRZ bit set >>>>>> then the return for an unmapped block is always zero. If TPRZ isn't >>>>>> set, it's undefined but consistent. I think TPRZ is there to address >>>>>> security concerns. >>>>>> >>>>>> James >>>>>> >>>>>> >>>>>> >>>>> To James, >>>>> >>>>> I took a look at the spec, but I'm not familiar with the SCSI spec to >>>>> grok it immediately. >>>>> >>>>> Is the TPRZ bit meant to be a way for the manufacturer to report which >>>>> of the two behaviors their device implements, or is it a externally >>>>> configurable flag that tells the SSD which way to behave? >>>>> >>>>> Either way, is there reason to believe the ATA T13 spec will get >>>>> similar functionality? >>>>> >>>>> To Ric, >>>>> >>>>> First, in general I think is is bizarre to have a device that is by >>>>> spec able to return both reliable and non-reliable data, but the spec >>>>> does not include a signaling method to differentiate between the two. >>>>> >>>>> === >>>>> My very specific concern is that I work with evidence that will >>>>> eventually be presented at court. >>>>> >>>>> We routinely work with both live files and recoved deleted files >>>>> (Computer Forensic Analysis). Thus we would typically be reading the >>>>> discarded sectors as well as in-use sectors. >>>>> >>>>> After reading the original proposal from 2007, I assumed that a read >>>>> would provide me either data that had been written specifically to the >>>>> sectors read, or that the SSD would return all nulls. That is very >>>>> troubling to the ten thousand or so computer forensic examiners in the >>>>> USA, but it true we just had to live with it. >>>>> >>>>> Now reading the Oct. 2008 revision I realized that discarded sectors >>>>> are theoretically allowed to return absolutely anything the SSD feels >>>>> like returning. Thus the SSD might return data that appears to be >>>>> supporting one side of the trial or the other, but it may have been >>>>> artificially created by the SSD. And I don't even have a flag that >>>>> says "trust this data". >>>>> >>>>> The way things currently stand with my understanding of the proposed >>>>> spec. I will not be able to tell the court anything about the >>>>> reliability of any data copied from the SSD regardless of whether it >>>>> is part of an active file or not. >>>>> >>>>> At its most basic level, I transport a typical file on a SSD by >>>>> connecting it to computer A, writing data to it, disconnecting from A >>>>> and connecting to computer B and then print it from there for court >>>>> room use. >>>>> >>>>> When I read that file from the SSD how can I assure the court that >>>>> data I read is even claimed to be reliable by the SSD? >>>>> >>>>> ie. The SSD has no way to say "I believe this data is what was >>>>> written to me via computer A" so why should the court or anyone else >>>>> trust the data it returns. >>>>> >>>>> IF the TPRZ bit becomes mandatory for both ATA and SCSI SSDs, then if >>>>> it is set I can have confidence that any data read from the device was >>>>> actually written to it. >>>>> >>>>> Lacking the TPRZ bit, ... >>>>> >>>>> Greg >>>>> >>>>> >>>>> >>>> I think that the incorrect assumption here is that you as a user can read >>>> data that is invalid. If you are using a file system, you will never be >>>> able >>>> to read those unmapped/freed blocks (the file system will not allow it). >>>> >>>> If you read the raw device as root, then you could seem random bits of >>>> data >>>> - maybe data recovery tools would make this an issue? >>>> >>>> ric >>>> >>>> >>> Ric, >>> >>> > > <snip> > > >> This seems to be overstated. The file system layer knows what its valid data >> is at any time and will send down unmap/trim commands only when it is sure >> that the block is no longer in use. >> >> The only concern is one of efficiency/performance - the commands are >> advisory, so the target can ignore them (i.e., not pre-erase them or >> allocate them in T10 to another user). There will be no need for fsck to >> look at unallocated blocks. >> >> The concern we do have is that RAID and checksums must be consistent. Once >> read, the device must return the same contents after a trim/unmap so as not >> to change the parity/hash/etc. >> > > ===> Copy of top post > With proposed spec changes for both T10 and T13 a new "unmap" or > "trim" command is proposed respectively. > > SCSI - see http://www.t10.org/cgi-bin/ac.pl?t=d&f=08-356r5.pdf > ATA - T13/e08137r2 draft > > Per the proposed spec changes, the underlying SSD device can > optionally modify the unmapped data at its discretion. SCSI T10 > atleast restricts the way the modification happens, but data > modification of unmapped data is still definitely allowed. > > Thus if a filesystem "discards" a sector, the contents of the sector > can change and thus parity values are no longer meaningful for the > stripe. > > ie. If the unmap-ed blocks don't exactly correlate with the Raid-5 / 6 > stripping, then the integrity of a stripe containing both mapped and > unmapped data is lost. > > A feature bit will be provided to identify SSDs that implement a > "stable value on read" feature. Meaning that once you read a specific > unmapped sector, its contents will not change until written but that > does not change the fact that a discard command that does not > perfectly match the raid setup may destroy the integrity of a stripe. > > Thus it seems that either the filesystem will have to understand the > raid 5 / 6 stripping / chunking setup and ensure it never issues a > discard command unless an entire stripe is free. Or that the raid > implementation must must snoop the discard commands and take > appropriate actions. > ===> END Copy of top post > > Seems to introduce some huge layering violations for Raid 5 / 6 > implementations using next generation SSDs to comprise the raid > volumes. > > I imagine writing reshaping software is hard enough without this going on. > > <snip> > > >> One serious suggestion is that you take your concerns up with the T13 group directly - few people on this list sit in on those, I believe that it is an open forum. >> > > I will have to look into that. The whole idea of what is happening > here seems fraught with problems to me. T13 is worse than T10 from > what I see, but both seem highly problematic. > > Allowing data to change from the SATA / SAS interface layer and not > implementing a signaling mechanism that allows the kernel (or any OS / > software tool) to ask which sectors / blocks / erase units have > undergone data changes is just bizarre to me. > > I the unmap command always caused the unmap sectors to return some > fixed value, at least that could be incorporated into a raid > implementations logic. > > The current random nature of what unmap command does is very unsettling to me. > > Greg > ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: SSD data reliable vs. unreliable [Was: Re: Data Recovery from SSDs - Impact of trim?] 2009-01-26 17:34 ` SSD data reliable vs. unreliable [Was: Re: Data Recovery from SSDs - Impact of trim?] Greg Freemyer 2009-01-26 17:46 ` Ric Wheeler @ 2009-01-26 17:47 ` James Bottomley 2009-01-27 5:16 ` Neil Brown 2009-01-26 17:51 ` Mark Lord 2 siblings, 1 reply; 18+ messages in thread From: James Bottomley @ 2009-01-26 17:47 UTC (permalink / raw) To: Greg Freemyer Cc: Ric Wheeler, linux-raid, Dongjun Shin, IDE/ATA development list On Mon, 2009-01-26 at 12:34 -0500, Greg Freemyer wrote: > Adding mdraid list: > > Top post as a recap for mdraid list (redundantly at end of email if > anyone wants to respond to any of this).: > > == Start RECAP > With proposed spec changes for both T10 and T13 a new "unmap" or > "trim" command is proposed respectively. The linux kernel is > implementing this as a sector discard and will be called by various > file systems as they delete data files. Ext4 will be one of the first > to support this. (At least via out of kernel patches.) > > SCSI - see http://www.t10.org/cgi-bin/ac.pl?t=d&f=08-356r5.pdf > ATA - see T13/e08137r2 draft > > Per the proposed spec changes, the underlying SSD device can > optionally modify the unmapped data. SCSI T10 at least restricts the > way the modification happens, but data modification of unmapped data > is still definitely allowed for both classes of SSD. > > Thus if a filesystem "discards" a sector, the contents of the sector > can change and thus parity values are no longer meaningful for the > stripe. This isn't correct. The implementation is via bio and request discard flags. linux raid as a bio->bio mapping entity can choose to drop or implement the discard flag (by default it will be dropped unless the raid layer is modified). > ie. If the unmap-ed blocks don't exactly correlate with the Raid-5 / 6 > stripping, then the integrity of a stripe containing both mapped and > unmapped data is lost. > > Thus it seems that either the filesystem will have to understand the > raid 5 / 6 stripping / chunking setup and ensure it never issues a > discard command unless an entire stripe is being discarded. Or that > the raid implementation must must snoop the discard commands and take > appropriate actions. No. It only works if the discard is supported all the way through the stack to the controller and device ... any point in the stack can drop the discard. It's also theoretically possible that any layer could accumulate them as well (i.e. up to stripe size for raid). James ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: SSD data reliable vs. unreliable [Was: Re: Data Recovery from SSDs - Impact of trim?] 2009-01-26 17:47 ` James Bottomley @ 2009-01-27 5:16 ` Neil Brown 2009-01-27 10:49 ` John Robinson ` (2 more replies) 0 siblings, 3 replies; 18+ messages in thread From: Neil Brown @ 2009-01-27 5:16 UTC (permalink / raw) To: James Bottomley Cc: Greg Freemyer, Ric Wheeler, linux-raid, Dongjun Shin, IDE/ATA development list On Monday January 26, James.Bottomley@HansenPartnership.com wrote: > On Mon, 2009-01-26 at 12:34 -0500, Greg Freemyer wrote: > > Adding mdraid list: > > > > Top post as a recap for mdraid list (redundantly at end of email if > > anyone wants to respond to any of this).: > > > > == Start RECAP > > With proposed spec changes for both T10 and T13 a new "unmap" or > > "trim" command is proposed respectively. The linux kernel is > > implementing this as a sector discard and will be called by various > > file systems as they delete data files. Ext4 will be one of the first > > to support this. (At least via out of kernel patches.) > > > > SCSI - see http://www.t10.org/cgi-bin/ac.pl?t=d&f=08-356r5.pdf > > ATA - see T13/e08137r2 draft > > > > Per the proposed spec changes, the underlying SSD device can > > optionally modify the unmapped data. SCSI T10 at least restricts the > > way the modification happens, but data modification of unmapped data > > is still definitely allowed for both classes of SSD. > > > > Thus if a filesystem "discards" a sector, the contents of the sector > > can change and thus parity values are no longer meaningful for the > > stripe. > > This isn't correct. The implementation is via bio and request discard > flags. linux raid as a bio->bio mapping entity can choose to drop or > implement the discard flag (by default it will be dropped unless the > raid layer is modified). That's good. I would be worried if they could slip through without md/raid noticing. > > > ie. If the unmap-ed blocks don't exactly correlate with the Raid-5 / 6 > > stripping, then the integrity of a stripe containing both mapped and > > unmapped data is lost. > > > > Thus it seems that either the filesystem will have to understand the > > raid 5 / 6 stripping / chunking setup and ensure it never issues a > > discard command unless an entire stripe is being discarded. Or that > > the raid implementation must must snoop the discard commands and take > > appropriate actions. > > No. It only works if the discard is supported all the way through the > stack to the controller and device ... any point in the stack can drop > the discard. It's also theoretically possible that any layer could > accumulate them as well (i.e. up to stripe size for raid). Accumulating them in the raid level would probably be awkward. It was my understanding that filesystems would (try to) send the largest possible 'discard' covering any surrounding blocks that had already been discarded. Then e.g. raid5 could just round down any discard request to an aligned number of complete stripes and just discard those. i.e. have all the accumulation done in the filesystem. To be able to safely discard stripes, raid5 would need to remember which stripes were discarded so that it could be sure to write out the whole stripe when updating any block on it, thus ensuring that parity will be correct again and will remain correct. Probably the only practical data structure for this would be a bitmap similar to the current write-intent bitmap. Is it really worth supporting this in raid5? Are the sorts of devices that will benefit from 'discard' requests likely to be used inside an md/raid5 array I wonder.... raid1 and raid10 are much easier to handle, so supporting 'discard' there certainly makes sense. NeilBrown ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: SSD data reliable vs. unreliable [Was: Re: Data Recovery from SSDs - Impact of trim?] 2009-01-27 5:16 ` Neil Brown @ 2009-01-27 10:49 ` John Robinson 2009-01-28 20:11 ` Bill Davidsen 2009-01-27 11:23 ` Ric Wheeler 2009-01-27 14:48 ` James Bottomley 2 siblings, 1 reply; 18+ messages in thread From: John Robinson @ 2009-01-27 10:49 UTC (permalink / raw) To: Neil Brown Cc: James Bottomley, Greg Freemyer, Ric Wheeler, linux-raid, Dongjun Shin, IDE/ATA development list On 27/01/2009 05:16, Neil Brown wrote: [...] > Probably the only practical data structure for this would be a bitmap > similar to the current write-intent bitmap. > > Is it really worth supporting this in raid5? Are the sorts of > devices that will benefit from 'discard' requests likely to be used > inside an md/raid5 array I wonder.... Assuming I've understood correctly, this usage map sounds to me like a useful thing to have for all RAIDs. When building the array in the first place, the initial sync is just writing a usage map saying it's all empty. Filesystem writes and discards update it appropriately. Then when we get failing sectors reported via e.g. SMART or a scrub operation we know whether they're on used or unused areas so whether it's worth attempting recovery. Cheers, John. ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: SSD data reliable vs. unreliable [Was: Re: Data Recovery from SSDs - Impact of trim?] 2009-01-27 10:49 ` John Robinson @ 2009-01-28 20:11 ` Bill Davidsen [not found] ` <7fce22690901281556h67fb353dp879f88e6c2a76eaf@mail.gmail.com> 0 siblings, 1 reply; 18+ messages in thread From: Bill Davidsen @ 2009-01-28 20:11 UTC (permalink / raw) To: John Robinson Cc: Neil Brown, James Bottomley, Greg Freemyer, Ric Wheeler, linux-raid, Dongjun Shin, IDE/ATA development list John Robinson wrote: > On 27/01/2009 05:16, Neil Brown wrote: > [...] >> Probably the only practical data structure for this would be a bitmap >> similar to the current write-intent bitmap. >> >> Is it really worth supporting this in raid5? Are the sorts of >> devices that will benefit from 'discard' requests likely to be used >> inside an md/raid5 array I wonder.... > > Assuming I've understood correctly, this usage map sounds to me like a > useful thing to have for all RAIDs. When building the array in the > first place, the initial sync is just writing a usage map saying it's > all empty. Filesystem writes and discards update it appropriately. > Then when we get failing sectors reported via e.g. SMART or a scrub > operation we know whether they're on used or unused areas so whether > it's worth attempting recovery. It would seem that this could really speed initialization. A per-stripe "unused" bitmap could save a lot of time in init, but also in the check operation on partially used media. It's not just being nice to SDD, but being nice to power consumption, performance impact, rebuild time... other than the initial coding and testing required, I can't see any downside to this. -- Bill Davidsen <davidsen@tmr.com> "Woe unto the statesman who makes war without a reason that will still be valid when the war is over..." Otto von Bismark ^ permalink raw reply [flat|nested] 18+ messages in thread
[parent not found: <7fce22690901281556h67fb353dp879f88e6c2a76eaf@mail.gmail.com>]
* Re: SSD data reliable vs. unreliable [Was: Re: Data Recovery from SSDs - Impact of trim?] [not found] ` <7fce22690901281556h67fb353dp879f88e6c2a76eaf@mail.gmail.com> @ 2009-01-29 1:49 ` John Robinson 0 siblings, 0 replies; 18+ messages in thread From: John Robinson @ 2009-01-29 1:49 UTC (permalink / raw) To: Greg Freemyer Cc: Bill Davidsen, Neil Brown, James Bottomley, Ric Wheeler, linux-raid, Dongjun Shin, IDE/ATA development list On 28/01/2009 23:56, Greg Freemyer wrote: > Once discard calls get into linux file systems mdraid and/or device > mapper could implement linux's own thin provisioning implementation. > Even with traditional disks that don't support unmap. I gather that > is what the EMCs of the world will be doing in their platforms. > > http://en.wikipedia.org/wiki/Thin_provisioning Sounds more like a device mapper or LVM thing (than md/RAID) to me, but I'd definitely agree that this would be another great reason for block devices to implement map/unmap. And I wonder if there's room for another dm/md device type which just implements these usage maps over traditional devices which don't support unmap (much as I was wondering a few weeks back about a soft DIF implementation over e.g. SATA devices). Darn it, I might just have to dig out my school books on C; it's a while since I offered a kernel patch. Cheers, John. ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: SSD data reliable vs. unreliable [Was: Re: Data Recovery from SSDs - Impact of trim?] 2009-01-27 5:16 ` Neil Brown 2009-01-27 10:49 ` John Robinson @ 2009-01-27 11:23 ` Ric Wheeler 2009-01-28 20:28 ` Bill Davidsen 2009-01-27 14:48 ` James Bottomley 2 siblings, 1 reply; 18+ messages in thread From: Ric Wheeler @ 2009-01-27 11:23 UTC (permalink / raw) To: Neil Brown Cc: James Bottomley, Greg Freemyer, Ric Wheeler, linux-raid, Dongjun Shin, IDE/ATA development list Neil Brown wrote: > On Monday January 26, James.Bottomley@HansenPartnership.com wrote: > >> On Mon, 2009-01-26 at 12:34 -0500, Greg Freemyer wrote: >> >>> Adding mdraid list: >>> >>> Top post as a recap for mdraid list (redundantly at end of email if >>> anyone wants to respond to any of this).: >>> >>> == Start RECAP >>> With proposed spec changes for both T10 and T13 a new "unmap" or >>> "trim" command is proposed respectively. The linux kernel is >>> implementing this as a sector discard and will be called by various >>> file systems as they delete data files. Ext4 will be one of the first >>> to support this. (At least via out of kernel patches.) >>> >>> SCSI - see http://www.t10.org/cgi-bin/ac.pl?t=d&f=08-356r5.pdf >>> ATA - see T13/e08137r2 draft >>> >>> Per the proposed spec changes, the underlying SSD device can >>> optionally modify the unmapped data. SCSI T10 at least restricts the >>> way the modification happens, but data modification of unmapped data >>> is still definitely allowed for both classes of SSD. >>> >>> Thus if a filesystem "discards" a sector, the contents of the sector >>> can change and thus parity values are no longer meaningful for the >>> stripe. >>> >> This isn't correct. The implementation is via bio and request discard >> flags. linux raid as a bio->bio mapping entity can choose to drop or >> implement the discard flag (by default it will be dropped unless the >> raid layer is modified). >> > > That's good. I would be worried if they could slip through without > md/raid noticing. > > >>> ie. If the unmap-ed blocks don't exactly correlate with the Raid-5 / 6 >>> stripping, then the integrity of a stripe containing both mapped and >>> unmapped data is lost. >>> >>> Thus it seems that either the filesystem will have to understand the >>> raid 5 / 6 stripping / chunking setup and ensure it never issues a >>> discard command unless an entire stripe is being discarded. Or that >>> the raid implementation must must snoop the discard commands and take >>> appropriate actions. >>> >> No. It only works if the discard is supported all the way through the >> stack to the controller and device ... any point in the stack can drop >> the discard. It's also theoretically possible that any layer could >> accumulate them as well (i.e. up to stripe size for raid). >> > > Accumulating them in the raid level would probably be awkward. > > It was my understanding that filesystems would (try to) send the > largest possible 'discard' covering any surrounding blocks that had > already been discarded. Then e.g. raid5 could just round down any > discard request to an aligned number of complete stripes and just > discard those. i.e. have all the accumulation done in the filesystem. > > To be able to safely discard stripes, raid5 would need to remember > which stripes were discarded so that it could be sure to write out the > whole stripe when updating any block on it, thus ensuring that parity > will be correct again and will remain correct. > > Probably the only practical data structure for this would be a bitmap > similar to the current write-intent bitmap. > > Is it really worth supporting this in raid5? Are the sorts of > devices that will benefit from 'discard' requests likely to be used > inside an md/raid5 array I wonder.... > > raid1 and raid10 are much easier to handle, so supporting 'discard' > there certainly makes sense. > > NeilBrown > -- > The benefit is also seen by SSD devices (T13) and high end arrays (T10). On the array end, they almost universally do RAID support internally. I suppose that people might make RAID5 devices out of SSD's locally, but it is probably not an immediate priority.... ric ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: SSD data reliable vs. unreliable [Was: Re: Data Recovery from SSDs - Impact of trim?] 2009-01-27 11:23 ` Ric Wheeler @ 2009-01-28 20:28 ` Bill Davidsen 0 siblings, 0 replies; 18+ messages in thread From: Bill Davidsen @ 2009-01-28 20:28 UTC (permalink / raw) To: Ric Wheeler Cc: Neil Brown, James Bottomley, Greg Freemyer, linux-raid, Dongjun Shin, IDE/ATA development list Ric Wheeler wrote: > Neil Brown wrote: >> On Monday January 26, James.Bottomley@HansenPartnership.com wrote: >> >>> On Mon, 2009-01-26 at 12:34 -0500, Greg Freemyer wrote: >>> >>>> Adding mdraid list: >>>> >>>> Top post as a recap for mdraid list (redundantly at end of email if >>>> anyone wants to respond to any of this).: >>>> >>>> == Start RECAP >>>> With proposed spec changes for both T10 and T13 a new "unmap" or >>>> "trim" command is proposed respectively. The linux kernel is >>>> implementing this as a sector discard and will be called by various >>>> file systems as they delete data files. Ext4 will be one of the first >>>> to support this. (At least via out of kernel patches.) >>>> >>>> SCSI - see http://www.t10.org/cgi-bin/ac.pl?t=d&f=08-356r5.pdf >>>> ATA - see T13/e08137r2 draft >>>> >>>> Per the proposed spec changes, the underlying SSD device can >>>> optionally modify the unmapped data. SCSI T10 at least restricts the >>>> way the modification happens, but data modification of unmapped data >>>> is still definitely allowed for both classes of SSD. >>>> >>>> Thus if a filesystem "discards" a sector, the contents of the sector >>>> can change and thus parity values are no longer meaningful for the >>>> stripe. >>>> >>> This isn't correct. The implementation is via bio and request discard >>> flags. linux raid as a bio->bio mapping entity can choose to drop or >>> implement the discard flag (by default it will be dropped unless the >>> raid layer is modified). >>> >> >> That's good. I would be worried if they could slip through without >> md/raid noticing. >> >> >>>> ie. If the unmap-ed blocks don't exactly correlate with the Raid-5 / 6 >>>> stripping, then the integrity of a stripe containing both mapped and >>>> unmapped data is lost. >>>> >>>> Thus it seems that either the filesystem will have to understand the >>>> raid 5 / 6 stripping / chunking setup and ensure it never issues a >>>> discard command unless an entire stripe is being discarded. Or that >>>> the raid implementation must must snoop the discard commands and take >>>> appropriate actions. >>>> >>> No. It only works if the discard is supported all the way through the >>> stack to the controller and device ... any point in the stack can drop >>> the discard. It's also theoretically possible that any layer could >>> accumulate them as well (i.e. up to stripe size for raid). >>> >> >> Accumulating them in the raid level would probably be awkward. >> >> It was my understanding that filesystems would (try to) send the >> largest possible 'discard' covering any surrounding blocks that had >> already been discarded. Then e.g. raid5 could just round down any >> discard request to an aligned number of complete stripes and just >> discard those. i.e. have all the accumulation done in the filesystem. >> >> To be able to safely discard stripes, raid5 would need to remember >> which stripes were discarded so that it could be sure to write out the >> whole stripe when updating any block on it, thus ensuring that parity >> will be correct again and will remain correct. >> >> Probably the only practical data structure for this would be a bitmap >> similar to the current write-intent bitmap. >> >> Is it really worth supporting this in raid5? Are the sorts of >> devices that will benefit from 'discard' requests likely to be used >> inside an md/raid5 array I wonder.... >> >> raid1 and raid10 are much easier to handle, so supporting 'discard' >> there certainly makes sense. >> >> NeilBrown >> -- >> > > The benefit is also seen by SSD devices (T13) and high end arrays > (T10). On the array end, they almost universally do RAID support > internally. > > I suppose that people might make RAID5 devices out of SSD's locally, > but it is probably not an immediate priority.... Depends on how you define "priority" here. It probably would not make much of a performance difference, it might make a significant lifetime difference in the devices. Not RAID5, RAID6. As seek times shrink things which were performance limited become practical, journaling file systems are not a problem just a solution, mounting with atime disabled isn't needed, etc. I was given some CF to PATA adapters to test, and as soon as I grab some 16GB CFs I intend to try a 32GB RAID6. I have a perfect application for it, and if it works well after I test I can put journal files on it. I just wish I had a file system which could put the journal, inodes, and directories all on the fast device and leaves the files (data) on something cheap. -- Bill Davidsen <davidsen@tmr.com> "Woe unto the statesman who makes war without a reason that will still be valid when the war is over..." Otto von Bismark ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: SSD data reliable vs. unreliable [Was: Re: Data Recovery from SSDs - Impact of trim?] 2009-01-27 5:16 ` Neil Brown 2009-01-27 10:49 ` John Robinson 2009-01-27 11:23 ` Ric Wheeler @ 2009-01-27 14:48 ` James Bottomley 2009-01-27 14:54 ` Ric Wheeler 2 siblings, 1 reply; 18+ messages in thread From: James Bottomley @ 2009-01-27 14:48 UTC (permalink / raw) To: Neil Brown Cc: Greg Freemyer, Ric Wheeler, linux-raid, Dongjun Shin, IDE/ATA development list On Tue, 2009-01-27 at 16:16 +1100, Neil Brown wrote: > On Monday January 26, James.Bottomley@HansenPartnership.com wrote: > > On Mon, 2009-01-26 at 12:34 -0500, Greg Freemyer wrote: > > > Adding mdraid list: > > > > > > Top post as a recap for mdraid list (redundantly at end of email if > > > anyone wants to respond to any of this).: > > > > > > == Start RECAP > > > With proposed spec changes for both T10 and T13 a new "unmap" or > > > "trim" command is proposed respectively. The linux kernel is > > > implementing this as a sector discard and will be called by various > > > file systems as they delete data files. Ext4 will be one of the first > > > to support this. (At least via out of kernel patches.) > > > > > > SCSI - see http://www.t10.org/cgi-bin/ac.pl?t=d&f=08-356r5.pdf > > > ATA - see T13/e08137r2 draft > > > > > > Per the proposed spec changes, the underlying SSD device can > > > optionally modify the unmapped data. SCSI T10 at least restricts the > > > way the modification happens, but data modification of unmapped data > > > is still definitely allowed for both classes of SSD. > > > > > > Thus if a filesystem "discards" a sector, the contents of the sector > > > can change and thus parity values are no longer meaningful for the > > > stripe. > > > > This isn't correct. The implementation is via bio and request discard > > flags. linux raid as a bio->bio mapping entity can choose to drop or > > implement the discard flag (by default it will be dropped unless the > > raid layer is modified). > > That's good. I would be worried if they could slip through without > md/raid noticing. > > > > > > ie. If the unmap-ed blocks don't exactly correlate with the Raid-5 / 6 > > > stripping, then the integrity of a stripe containing both mapped and > > > unmapped data is lost. > > > > > > Thus it seems that either the filesystem will have to understand the > > > raid 5 / 6 stripping / chunking setup and ensure it never issues a > > > discard command unless an entire stripe is being discarded. Or that > > > the raid implementation must must snoop the discard commands and take > > > appropriate actions. > > > > No. It only works if the discard is supported all the way through the > > stack to the controller and device ... any point in the stack can drop > > the discard. It's also theoretically possible that any layer could > > accumulate them as well (i.e. up to stripe size for raid). > > Accumulating them in the raid level would probably be awkward. > > It was my understanding that filesystems would (try to) send the > largest possible 'discard' covering any surrounding blocks that had > already been discarded. Then e.g. raid5 could just round down any > discard request to an aligned number of complete stripes and just > discard those. i.e. have all the accumulation done in the filesystem. The jury is still out on this one. Array manufacturers, who would probably like this as well because their internal granularity for thin provisioning is reputedly huge (in the megabytes). However, trim and discard are being driven by SSD which has no such need. > To be able to safely discard stripes, raid5 would need to remember > which stripes were discarded so that it could be sure to write out the > whole stripe when updating any block on it, thus ensuring that parity > will be correct again and will remain correct. right. This gives you a minimal discard size of the stripe width. > Probably the only practical data structure for this would be a bitmap > similar to the current write-intent bitmap. Hmm ... the feature you're talking about is called white space elimination by most in the industry. The layer above RAID (usually fs) knows this information exactly ... if there were a way to pass it on, there'd be no need to store it separately. > Is it really worth supporting this in raid5? Are the sorts of > devices that will benefit from 'discard' requests likely to be used > inside an md/raid5 array I wonder.... There's no hard data on how useful Trim will be in general. The idea is it allows SSDs to pre-erase (which can be a big deal) and for Thin Provisioning it allows just in time storage decisions. However, all thin provision devices are likely to do RAID internally ... > raid1 and raid10 are much easier to handle, so supporting 'discard' > there certainly makes sense. James ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: SSD data reliable vs. unreliable [Was: Re: Data Recovery from SSDs - Impact of trim?] 2009-01-27 14:48 ` James Bottomley @ 2009-01-27 14:54 ` Ric Wheeler 0 siblings, 0 replies; 18+ messages in thread From: Ric Wheeler @ 2009-01-27 14:54 UTC (permalink / raw) To: James Bottomley Cc: Neil Brown, Greg Freemyer, linux-raid, Dongjun Shin, IDE/ATA development list James Bottomley wrote: > On Tue, 2009-01-27 at 16:16 +1100, Neil Brown wrote: > >> On Monday January 26, James.Bottomley@HansenPartnership.com wrote: >> >>> On Mon, 2009-01-26 at 12:34 -0500, Greg Freemyer wrote: >>> >>>> Adding mdraid list: >>>> >>>> Top post as a recap for mdraid list (redundantly at end of email if >>>> anyone wants to respond to any of this).: >>>> >>>> == Start RECAP >>>> With proposed spec changes for both T10 and T13 a new "unmap" or >>>> "trim" command is proposed respectively. The linux kernel is >>>> implementing this as a sector discard and will be called by various >>>> file systems as they delete data files. Ext4 will be one of the first >>>> to support this. (At least via out of kernel patches.) >>>> >>>> SCSI - see http://www.t10.org/cgi-bin/ac.pl?t=d&f=08-356r5.pdf >>>> ATA - see T13/e08137r2 draft >>>> >>>> Per the proposed spec changes, the underlying SSD device can >>>> optionally modify the unmapped data. SCSI T10 at least restricts the >>>> way the modification happens, but data modification of unmapped data >>>> is still definitely allowed for both classes of SSD. >>>> >>>> Thus if a filesystem "discards" a sector, the contents of the sector >>>> can change and thus parity values are no longer meaningful for the >>>> stripe. >>>> >>> This isn't correct. The implementation is via bio and request discard >>> flags. linux raid as a bio->bio mapping entity can choose to drop or >>> implement the discard flag (by default it will be dropped unless the >>> raid layer is modified). >>> >> That's good. I would be worried if they could slip through without >> md/raid noticing. >> >> >>>> ie. If the unmap-ed blocks don't exactly correlate with the Raid-5 / 6 >>>> stripping, then the integrity of a stripe containing both mapped and >>>> unmapped data is lost. >>>> >>>> Thus it seems that either the filesystem will have to understand the >>>> raid 5 / 6 stripping / chunking setup and ensure it never issues a >>>> discard command unless an entire stripe is being discarded. Or that >>>> the raid implementation must must snoop the discard commands and take >>>> appropriate actions. >>>> >>> No. It only works if the discard is supported all the way through the >>> stack to the controller and device ... any point in the stack can drop >>> the discard. It's also theoretically possible that any layer could >>> accumulate them as well (i.e. up to stripe size for raid). >>> >> Accumulating them in the raid level would probably be awkward. >> >> It was my understanding that filesystems would (try to) send the >> largest possible 'discard' covering any surrounding blocks that had >> already been discarded. Then e.g. raid5 could just round down any >> discard request to an aligned number of complete stripes and just >> discard those. i.e. have all the accumulation done in the filesystem. >> > > The jury is still out on this one. Array manufacturers, who would > probably like this as well because their internal granularity for thin > provisioning is reputedly huge (in the megabytes). However, trim and > discard are being driven by SSD which has no such need. > I have heard from some array vendors of sizes that range from 8k erase chunks (pretty easy for us) up to 768KB, but not up to megabytes.... ric > >> To be able to safely discard stripes, raid5 would need to remember >> which stripes were discarded so that it could be sure to write out the >> whole stripe when updating any block on it, thus ensuring that parity >> will be correct again and will remain correct. >> > > right. This gives you a minimal discard size of the stripe width. > > >> Probably the only practical data structure for this would be a bitmap >> similar to the current write-intent bitmap. >> > > Hmm ... the feature you're talking about is called white space > elimination by most in the industry. The layer above RAID (usually fs) > knows this information exactly ... if there were a way to pass it on, > there'd be no need to store it separately. > > >> Is it really worth supporting this in raid5? Are the sorts of >> devices that will benefit from 'discard' requests likely to be used >> inside an md/raid5 array I wonder.... >> > > There's no hard data on how useful Trim will be in general. The idea is > it allows SSDs to pre-erase (which can be a big deal) and for Thin > Provisioning it allows just in time storage decisions. However, all > thin provision devices are likely to do RAID internally ... > > >> raid1 and raid10 are much easier to handle, so supporting 'discard' >> there certainly makes sense. >> > > James > > > ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: SSD data reliable vs. unreliable [Was: Re: Data Recovery from SSDs - Impact of trim?] 2009-01-26 17:34 ` SSD data reliable vs. unreliable [Was: Re: Data Recovery from SSDs - Impact of trim?] Greg Freemyer 2009-01-26 17:46 ` Ric Wheeler 2009-01-26 17:47 ` James Bottomley @ 2009-01-26 17:51 ` Mark Lord 2009-01-26 18:09 ` Greg Freemyer 2 siblings, 1 reply; 18+ messages in thread From: Mark Lord @ 2009-01-26 17:51 UTC (permalink / raw) To: Greg Freemyer Cc: Ric Wheeler, linux-raid, James Bottomley, Dongjun Shin, IDE/ATA development list Greg Freemyer wrote: > > Seems to introduce some huge layering violations for Raid 5 / 6 > implementations using next generation SSDs to comprise the raid > volumes. .. Possibly so. But having stripe layouts known to the fs layer is a *good* thing, and is pretty much already necessary for decent filesystem performance. It would be better even, if the filesystem would just automatically pick up that information, rather than relying upon mkfs parameters (maybe they already do now ?). > Allowing data to change from the SATA / SAS interface layer and not > implementing a signaling mechanism that allows the kernel (or any OS / > software tool) to ask which sectors / blocks / erase units have > undergone data changes is just bizarre to me. .. I think that's just blowing smoke. The only sectors/blocks/erase-units which even *might* undergo such changes, are already restricted to those exact units which the kernel itself specificies (by explicitly discarding them). If we care about knowing which ones later on (and we don't, generally), then we (kernel) can maintain a list, just like we do for other such entities. I don't see much of a downside here for any normal users of filesystems. This kind of feature is long overdue. Cheers ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: SSD data reliable vs. unreliable [Was: Re: Data Recovery from SSDs - Impact of trim?] 2009-01-26 17:51 ` Mark Lord @ 2009-01-26 18:09 ` Greg Freemyer 2009-01-26 18:21 ` Mark Lord 0 siblings, 1 reply; 18+ messages in thread From: Greg Freemyer @ 2009-01-26 18:09 UTC (permalink / raw) To: Mark Lord Cc: Ric Wheeler, linux-raid, James Bottomley, Dongjun Shin, IDE/ATA development list On Mon, Jan 26, 2009 at 12:51 PM, Mark Lord <liml@rtr.ca> wrote: > Greg Freemyer wrote: >> >> Seems to introduce some huge layering violations for Raid 5 / 6 >> implementations using next generation SSDs to comprise the raid >> volumes. > > .. > > Possibly so. But having stripe layouts known to the fs layer > is a *good* thing, and is pretty much already necessary for decent > filesystem performance. It would be better even, if the filesystem > would just automatically pick up that information, rather than > relying upon mkfs parameters (maybe they already do now ?). > >> Allowing data to change from the SATA / SAS interface layer and not >> implementing a signaling mechanism that allows the kernel (or any OS / >> software tool) to ask which sectors / blocks / erase units have >> undergone data changes is just bizarre to me. > > .. > > I think that's just blowing smoke. The only sectors/blocks/erase-units > which > even *might* undergo such changes, are already restricted to those exact > units which the kernel itself specificies (by explicitly discarding them). > If we care about knowing which ones later on (and we don't, generally), > then we (kernel) can maintain a list, just like we do for other such > entities. > > I don't see much of a downside here for any normal users of filesystems. > This kind of feature is long overdue. > > Cheers > Just so I know and before I try to find the right way to comment to the T13 and T10 committees: What is the negative of adding a ATA/SCSI command to allow the storage device to be interrogated to see what sectors/blocks/erase-units are in a mapped vs. unmapped state? For T13 (ATA), it would actually be a tri-state flag I gather: Mapped - last data written available unmapped - no data values assigned unmapped - deterministic data values Surely the storage device has to be keeping this data internally. Why not expose it? For data recovery and computer forensics purposes I would actually prefer even finer grained info, but those 3 states seem useful for several layers of the filesystem stack to be able to more fully optimize what they are doing. Greg -- Greg Freemyer Litigation Triage Solutions Specialist http://www.linkedin.com/in/gregfreemyer First 99 Days Litigation White Paper - http://www.norcrossgroup.com/forms/whitepapers/99%20Days%20whitepaper.pdf The Norcross Group The Intersection of Evidence & Technology http://www.norcrossgroup.com ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: SSD data reliable vs. unreliable [Was: Re: Data Recovery from SSDs - Impact of trim?] 2009-01-26 18:09 ` Greg Freemyer @ 2009-01-26 18:21 ` Mark Lord 2009-01-29 14:07 ` Dongjun Shin 0 siblings, 1 reply; 18+ messages in thread From: Mark Lord @ 2009-01-26 18:21 UTC (permalink / raw) To: Greg Freemyer Cc: Ric Wheeler, linux-raid, James Bottomley, Dongjun Shin, IDE/ATA development list Greg Freemyer wrote: > > Just so I know and before I try to find the right way to comment to > the T13 and T10 committees: > > What is the negative of adding a ATA/SCSI command to allow the storage > device to be interrogated to see what sectors/blocks/erase-units are > in a mapped vs. unmapped state? > > For T13 (ATA), it would actually be a tri-state flag I gather: > > Mapped - last data written available > unmapped - no data values assigned > unmapped - deterministic data values > > Surely the storage device has to be keeping this data internally. Why > not expose it? .. That's a good approach. One problem on the drive end, is that they may not be maintain central lists of these. Rather, it might be stored as local flags in each affected sector. So any attempt to query "all" is probably not feasible, but it should be simple enough to "query one" sector at a time for that info. If one wants them all, the just loop over the entire device doing the "query one" for each sector. Another drawback to the "query all", is the size of data (the list) returned is unknown in advance. We generally don't have commands that work like that. Cheers ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: SSD data reliable vs. unreliable [Was: Re: Data Recovery from SSDs - Impact of trim?] 2009-01-26 18:21 ` Mark Lord @ 2009-01-29 14:07 ` Dongjun Shin 2009-01-29 15:46 ` Mark Lord 0 siblings, 1 reply; 18+ messages in thread From: Dongjun Shin @ 2009-01-29 14:07 UTC (permalink / raw) To: Mark Lord, Greg Freemyer Cc: Ric Wheeler, linux-raid, James Bottomley, IDE/ATA development list On Tue, Jan 27, 2009 at 3:21 AM, Mark Lord <liml@rtr.ca> wrote: > Greg Freemyer wrote: >> >> Just so I know and before I try to find the right way to comment to >> the T13 and T10 committees: >> >> What is the negative of adding a ATA/SCSI command to allow the storage >> device to be interrogated to see what sectors/blocks/erase-units are >> in a mapped vs. unmapped state? >> >> For T13 (ATA), it would actually be a tri-state flag I gather: >> >> Mapped - last data written available >> unmapped - no data values assigned >> unmapped - deterministic data values >> >> Surely the storage device has to be keeping this data internally. Why >> not expose it? > > .. > > That's a good approach. One problem on the drive end, is that they > may not be maintain central lists of these. Rather, it might be stored > as local flags in each affected sector. > > So any attempt to query "all" is probably not feasible, > but it should be simple enough to "query one" sector at a time > for that info. If one wants them all, the just loop over > the entire device doing the "query one" for each sector. > > Another drawback to the "query all", is the size of data (the list) > returned is unknown in advance. We generally don't have commands > that work like that. > I'm not sure if the map/unmap flag will help the situation for the forensic examiner. What if SSD always returns zero for the unmapped area? (it's not violating the spec) Although the flag will tell you the exact state of the sectors, there is no hint for the forensic analysis if these are filled with zero. My personal opinion is that the flag for map/unmap will result in unnecessary overhead considering the common use cases. Please also consider that, although not directly related with trim, features like Full-Disk-Encryption are also making things difficult for you. -- Dongjun ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: SSD data reliable vs. unreliable [Was: Re: Data Recovery from SSDs - Impact of trim?] 2009-01-29 14:07 ` Dongjun Shin @ 2009-01-29 15:46 ` Mark Lord 2009-01-29 16:27 ` Greg Freemyer 0 siblings, 1 reply; 18+ messages in thread From: Mark Lord @ 2009-01-29 15:46 UTC (permalink / raw) To: Dongjun Shin Cc: Greg Freemyer, Ric Wheeler, linux-raid, James Bottomley, IDE/ATA development list Dongjun Shin wrote: .. > Please also consider that, although not directly related with trim, > features like > Full-Disk-Encryption are also making things difficult for you. .. Assuming the encryption doesn't have a state-mandated backdoor hidden inside. ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: SSD data reliable vs. unreliable [Was: Re: Data Recovery from SSDs - Impact of trim?] 2009-01-29 15:46 ` Mark Lord @ 2009-01-29 16:27 ` Greg Freemyer 2009-01-30 15:43 ` Bill Davidsen 0 siblings, 1 reply; 18+ messages in thread From: Greg Freemyer @ 2009-01-29 16:27 UTC (permalink / raw) To: Mark Lord Cc: Dongjun Shin, Ric Wheeler, linux-raid, James Bottomley, IDE/ATA development list On Thu, Jan 29, 2009 at 10:46 AM, Mark Lord <liml@rtr.ca> wrote: > Dongjun Shin wrote: > .. >> >> Please also consider that, although not directly related with trim, >> features like >> Full-Disk-Encryption are also making things difficult for you. > > .. > > Assuming the encryption doesn't have a state-mandated backdoor hidden > inside. The timing could not be worse. Last year if we could not get the password, we could just send the user to Gitmo!!! (I am not affiliated with any gov't agency. ;-) ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: SSD data reliable vs. unreliable [Was: Re: Data Recovery from SSDs - Impact of trim?] 2009-01-29 16:27 ` Greg Freemyer @ 2009-01-30 15:43 ` Bill Davidsen 0 siblings, 0 replies; 18+ messages in thread From: Bill Davidsen @ 2009-01-30 15:43 UTC (permalink / raw) To: Greg Freemyer Cc: Mark Lord, Dongjun Shin, Ric Wheeler, linux-raid, James Bottomley, IDE/ATA development list Greg Freemyer wrote: > On Thu, Jan 29, 2009 at 10:46 AM, Mark Lord <liml@rtr.ca> wrote: > >> Dongjun Shin wrote: >> .. >> >>> Please also consider that, although not directly related with trim, >>> features like >>> Full-Disk-Encryption are also making things difficult for you. >>> >> .. >> >> Assuming the encryption doesn't have a state-mandated backdoor hidden >> inside. >> > > The timing could not be worse. Last year if we could not get the > password, we could just send the user to Gitmo!!! > I believe that in England, or maybe the whole EU, you can be forced to give up your passwords, and be imprisoned until you do. There was an article about that and I don't remember the details, but it was enough to make me leave my computer home when I visit England again. -- Bill Davidsen <davidsen@tmr.com> "Woe unto the statesman who makes war without a reason that will still be valid when the war is over..." Otto von Bismark ^ permalink raw reply [flat|nested] 18+ messages in thread
end of thread, other threads:[~2009-01-30 15:43 UTC | newest]
Thread overview: 18+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <87f94c370901221553p4d3a749fl4717deabba5419ec@mail.gmail.com>
[not found] ` <497A2B3C.3060603@redhat.com>
[not found] ` <1232749447.3250.146.camel@localhost.localdomain>
[not found] ` <87f94c370901231526jb41ea66ta1d6a23d7631d63c@mail.gmail.com>
[not found] ` <497A542C.1040900@redhat.com>
[not found] ` <7fce22690901260659u30ffd634m3fb7f75102141ee9@mail.gmail.com>
[not found] ` <497DE35C.6090308@redhat.com>
2009-01-26 17:34 ` SSD data reliable vs. unreliable [Was: Re: Data Recovery from SSDs - Impact of trim?] Greg Freemyer
2009-01-26 17:46 ` Ric Wheeler
2009-01-26 17:47 ` James Bottomley
2009-01-27 5:16 ` Neil Brown
2009-01-27 10:49 ` John Robinson
2009-01-28 20:11 ` Bill Davidsen
[not found] ` <7fce22690901281556h67fb353dp879f88e6c2a76eaf@mail.gmail.com>
2009-01-29 1:49 ` John Robinson
2009-01-27 11:23 ` Ric Wheeler
2009-01-28 20:28 ` Bill Davidsen
2009-01-27 14:48 ` James Bottomley
2009-01-27 14:54 ` Ric Wheeler
2009-01-26 17:51 ` Mark Lord
2009-01-26 18:09 ` Greg Freemyer
2009-01-26 18:21 ` Mark Lord
2009-01-29 14:07 ` Dongjun Shin
2009-01-29 15:46 ` Mark Lord
2009-01-29 16:27 ` Greg Freemyer
2009-01-30 15:43 ` Bill Davidsen
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).