From: Bill Davidsen <davidsen@tmr.com>
To: Ric Wheeler <rwheeler@redhat.com>
Cc: Neil Brown <neilb@suse.de>,
James Bottomley <James.Bottomley@HansenPartnership.com>,
Greg Freemyer <greg.freemyer@norcrossgroup.com>,
linux-raid <linux-raid@vger.kernel.org>,
Dongjun Shin <djshin90@gmail.com>,
IDE/ATA development list <linux-ide@vger.kernel.org>
Subject: Re: SSD data reliable vs. unreliable [Was: Re: Data Recovery from SSDs - Impact of trim?]
Date: Wed, 28 Jan 2009 15:28:22 -0500 [thread overview]
Message-ID: <4980BFE6.1060704@tmr.com> (raw)
In-Reply-To: <497EEEC2.1040907@redhat.com>
Ric Wheeler wrote:
> Neil Brown wrote:
>> On Monday January 26, James.Bottomley@HansenPartnership.com wrote:
>>
>>> On Mon, 2009-01-26 at 12:34 -0500, Greg Freemyer wrote:
>>>
>>>> Adding mdraid list:
>>>>
>>>> Top post as a recap for mdraid list (redundantly at end of email if
>>>> anyone wants to respond to any of this).:
>>>>
>>>> == Start RECAP
>>>> With proposed spec changes for both T10 and T13 a new "unmap" or
>>>> "trim" command is proposed respectively. The linux kernel is
>>>> implementing this as a sector discard and will be called by various
>>>> file systems as they delete data files. Ext4 will be one of the first
>>>> to support this. (At least via out of kernel patches.)
>>>>
>>>> SCSI - see http://www.t10.org/cgi-bin/ac.pl?t=d&f=08-356r5.pdf
>>>> ATA - see T13/e08137r2 draft
>>>>
>>>> Per the proposed spec changes, the underlying SSD device can
>>>> optionally modify the unmapped data. SCSI T10 at least restricts the
>>>> way the modification happens, but data modification of unmapped data
>>>> is still definitely allowed for both classes of SSD.
>>>>
>>>> Thus if a filesystem "discards" a sector, the contents of the sector
>>>> can change and thus parity values are no longer meaningful for the
>>>> stripe.
>>>>
>>> This isn't correct. The implementation is via bio and request discard
>>> flags. linux raid as a bio->bio mapping entity can choose to drop or
>>> implement the discard flag (by default it will be dropped unless the
>>> raid layer is modified).
>>>
>>
>> That's good. I would be worried if they could slip through without
>> md/raid noticing.
>>
>>
>>>> ie. If the unmap-ed blocks don't exactly correlate with the Raid-5 / 6
>>>> stripping, then the integrity of a stripe containing both mapped and
>>>> unmapped data is lost.
>>>>
>>>> Thus it seems that either the filesystem will have to understand the
>>>> raid 5 / 6 stripping / chunking setup and ensure it never issues a
>>>> discard command unless an entire stripe is being discarded. Or that
>>>> the raid implementation must must snoop the discard commands and take
>>>> appropriate actions.
>>>>
>>> No. It only works if the discard is supported all the way through the
>>> stack to the controller and device ... any point in the stack can drop
>>> the discard. It's also theoretically possible that any layer could
>>> accumulate them as well (i.e. up to stripe size for raid).
>>>
>>
>> Accumulating them in the raid level would probably be awkward.
>>
>> It was my understanding that filesystems would (try to) send the
>> largest possible 'discard' covering any surrounding blocks that had
>> already been discarded. Then e.g. raid5 could just round down any
>> discard request to an aligned number of complete stripes and just
>> discard those. i.e. have all the accumulation done in the filesystem.
>>
>> To be able to safely discard stripes, raid5 would need to remember
>> which stripes were discarded so that it could be sure to write out the
>> whole stripe when updating any block on it, thus ensuring that parity
>> will be correct again and will remain correct.
>>
>> Probably the only practical data structure for this would be a bitmap
>> similar to the current write-intent bitmap.
>>
>> Is it really worth supporting this in raid5? Are the sorts of
>> devices that will benefit from 'discard' requests likely to be used
>> inside an md/raid5 array I wonder....
>>
>> raid1 and raid10 are much easier to handle, so supporting 'discard'
>> there certainly makes sense.
>>
>> NeilBrown
>> --
>>
>
> The benefit is also seen by SSD devices (T13) and high end arrays
> (T10). On the array end, they almost universally do RAID support
> internally.
>
> I suppose that people might make RAID5 devices out of SSD's locally,
> but it is probably not an immediate priority....
Depends on how you define "priority" here. It probably would not make
much of a performance difference, it might make a significant lifetime
difference in the devices.
Not RAID5, RAID6. As seek times shrink things which were performance
limited become practical, journaling file systems are not a problem just
a solution, mounting with atime disabled isn't needed, etc. I was given
some CF to PATA adapters to test, and as soon as I grab some 16GB CFs I
intend to try a 32GB RAID6. I have a perfect application for it, and if
it works well after I test I can put journal files on it. I just wish I
had a file system which could put the journal, inodes, and directories
all on the fast device and leaves the files (data) on something cheap.
--
Bill Davidsen <davidsen@tmr.com>
"Woe unto the statesman who makes war without a reason that will still
be valid when the war is over..." Otto von Bismark
next prev parent reply other threads:[~2009-01-28 20:28 UTC|newest]
Thread overview: 24+ messages / expand[flat|nested] mbox.gz Atom feed top
2009-01-22 23:53 SSD data reliable vs. unreliable [Was: Re: Data Recovery from SSDs - Impact of trim?] Greg Freemyer
2009-01-23 20:40 ` Ric Wheeler
2009-01-23 22:24 ` James Bottomley
2009-01-23 23:26 ` Greg Freemyer
2009-01-23 23:35 ` Ric Wheeler
[not found] ` <7fce22690901260659u30ffd634m3fb7f75102141ee9@mail.gmail.com>
2009-01-26 16:22 ` Ric Wheeler
2009-01-26 17:34 ` Greg Freemyer
2009-01-26 17:46 ` Ric Wheeler
2009-01-26 17:47 ` James Bottomley
2009-01-27 5:16 ` Neil Brown
2009-01-27 10:49 ` John Robinson
2009-01-28 20:11 ` Bill Davidsen
[not found] ` <7fce22690901281556h67fb353dp879f88e6c2a76eaf@mail.gmail.com>
2009-01-29 1:49 ` John Robinson
2009-01-27 11:23 ` Ric Wheeler
2009-01-28 20:28 ` Bill Davidsen [this message]
2009-01-27 14:48 ` James Bottomley
2009-01-27 14:54 ` Ric Wheeler
2009-01-26 17:51 ` Mark Lord
2009-01-26 18:09 ` Greg Freemyer
2009-01-26 18:21 ` Mark Lord
2009-01-29 14:07 ` Dongjun Shin
2009-01-29 15:46 ` Mark Lord
2009-01-29 16:27 ` Greg Freemyer
2009-01-30 15:43 ` Bill Davidsen
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=4980BFE6.1060704@tmr.com \
--to=davidsen@tmr.com \
--cc=James.Bottomley@HansenPartnership.com \
--cc=djshin90@gmail.com \
--cc=greg.freemyer@norcrossgroup.com \
--cc=linux-ide@vger.kernel.org \
--cc=linux-raid@vger.kernel.org \
--cc=neilb@suse.de \
--cc=rwheeler@redhat.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).