From mboxrd@z Thu Jan 1 00:00:00 1970 From: Ric Wheeler Subject: Re: SSD data reliable vs. unreliable [Was: Re: Data Recovery from SSDs - Impact of trim?] Date: Tue, 27 Jan 2009 09:54:30 -0500 Message-ID: <497F2026.4000409@redhat.com> References: <87f94c370901221553p4d3a749fl4717deabba5419ec@mail.gmail.com> <497A2B3C.3060603@redhat.com> <1232749447.3250.146.camel@localhost.localdomain> <87f94c370901231526jb41ea66ta1d6a23d7631d63c@mail.gmail.com> <497A542C.1040900@redhat.com> <7fce22690901260659u30ffd634m3fb7f75102141ee9@mail.gmail.com> <497DE35C.6090308@redhat.com> <87f94c370901260934vef69a2cgada9ae3dfdb440ef@mail.gmail.com> <1232992065.3248.38.camel@localhost.localdomain> <18814.39074.194781.490676@notabene.brown> <1233067737.3231.6.camel@localhost.localdomain> Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Return-path: In-Reply-To: <1233067737.3231.6.camel@localhost.localdomain> Sender: linux-raid-owner@vger.kernel.org To: James Bottomley Cc: Neil Brown , Greg Freemyer , linux-raid , Dongjun Shin , IDE/ATA development list List-Id: linux-ide@vger.kernel.org James Bottomley wrote: > On Tue, 2009-01-27 at 16:16 +1100, Neil Brown wrote: > >> On Monday January 26, James.Bottomley@HansenPartnership.com wrote: >> >>> On Mon, 2009-01-26 at 12:34 -0500, Greg Freemyer wrote: >>> >>>> Adding mdraid list: >>>> >>>> Top post as a recap for mdraid list (redundantly at end of email if >>>> anyone wants to respond to any of this).: >>>> >>>> == Start RECAP >>>> With proposed spec changes for both T10 and T13 a new "unmap" or >>>> "trim" command is proposed respectively. The linux kernel is >>>> implementing this as a sector discard and will be called by various >>>> file systems as they delete data files. Ext4 will be one of the first >>>> to support this. (At least via out of kernel patches.) >>>> >>>> SCSI - see http://www.t10.org/cgi-bin/ac.pl?t=d&f=08-356r5.pdf >>>> ATA - see T13/e08137r2 draft >>>> >>>> Per the proposed spec changes, the underlying SSD device can >>>> optionally modify the unmapped data. SCSI T10 at least restricts the >>>> way the modification happens, but data modification of unmapped data >>>> is still definitely allowed for both classes of SSD. >>>> >>>> Thus if a filesystem "discards" a sector, the contents of the sector >>>> can change and thus parity values are no longer meaningful for the >>>> stripe. >>>> >>> This isn't correct. The implementation is via bio and request discard >>> flags. linux raid as a bio->bio mapping entity can choose to drop or >>> implement the discard flag (by default it will be dropped unless the >>> raid layer is modified). >>> >> That's good. I would be worried if they could slip through without >> md/raid noticing. >> >> >>>> ie. If the unmap-ed blocks don't exactly correlate with the Raid-5 / 6 >>>> stripping, then the integrity of a stripe containing both mapped and >>>> unmapped data is lost. >>>> >>>> Thus it seems that either the filesystem will have to understand the >>>> raid 5 / 6 stripping / chunking setup and ensure it never issues a >>>> discard command unless an entire stripe is being discarded. Or that >>>> the raid implementation must must snoop the discard commands and take >>>> appropriate actions. >>>> >>> No. It only works if the discard is supported all the way through the >>> stack to the controller and device ... any point in the stack can drop >>> the discard. It's also theoretically possible that any layer could >>> accumulate them as well (i.e. up to stripe size for raid). >>> >> Accumulating them in the raid level would probably be awkward. >> >> It was my understanding that filesystems would (try to) send the >> largest possible 'discard' covering any surrounding blocks that had >> already been discarded. Then e.g. raid5 could just round down any >> discard request to an aligned number of complete stripes and just >> discard those. i.e. have all the accumulation done in the filesystem. >> > > The jury is still out on this one. Array manufacturers, who would > probably like this as well because their internal granularity for thin > provisioning is reputedly huge (in the megabytes). However, trim and > discard are being driven by SSD which has no such need. > I have heard from some array vendors of sizes that range from 8k erase chunks (pretty easy for us) up to 768KB, but not up to megabytes.... ric > >> To be able to safely discard stripes, raid5 would need to remember >> which stripes were discarded so that it could be sure to write out the >> whole stripe when updating any block on it, thus ensuring that parity >> will be correct again and will remain correct. >> > > right. This gives you a minimal discard size of the stripe width. > > >> Probably the only practical data structure for this would be a bitmap >> similar to the current write-intent bitmap. >> > > Hmm ... the feature you're talking about is called white space > elimination by most in the industry. The layer above RAID (usually fs) > knows this information exactly ... if there were a way to pass it on, > there'd be no need to store it separately. > > >> Is it really worth supporting this in raid5? Are the sorts of >> devices that will benefit from 'discard' requests likely to be used >> inside an md/raid5 array I wonder.... >> > > There's no hard data on how useful Trim will be in general. The idea is > it allows SSDs to pre-erase (which can be a big deal) and for Thin > Provisioning it allows just in time storage decisions. However, all > thin provision devices are likely to do RAID internally ... > > >> raid1 and raid10 are much easier to handle, so supporting 'discard' >> there certainly makes sense. >> > > James > > >