* Re: Is TRIM/DISCARD going to be a performance problem?
[not found] ` <20090511120936.GB6277@mit.edu>
@ 2009-05-11 13:10 ` Greg Freemyer
2009-05-11 13:39 ` Matthew Wilcox
2009-05-11 14:27 ` Theodore Tso
0 siblings, 2 replies; 11+ messages in thread
From: Greg Freemyer @ 2009-05-11 13:10 UTC (permalink / raw)
To: Theodore Tso
Cc: Jörn Engel, Matthew Wilcox, Jens Axboe, Ric Wheeler,
linux-fsdevel, linux-ext4, Linux RAID
On Mon, May 11, 2009 at 8:09 AM, Theodore Tso
> All of the web browsing I've doen confirms that the ATA folks expect
> trim to work on 512-sector granularity.
Ted,
That implies that the SSD folks are not treating erase blocks as a
contiguous group of sectors. For some reason, I thought their was
only one mapping per erase block and within the erase block the
sectors were contiguous..
If I'm right, then the ata spec may allow you to send sub-erase block
trim commands down, but the spec does not prevent the (blackbox)
hardware from clipping the size of the trim to be on erase block
boundaries and ignoring the sub-erase block portions on each end. Or
ignoring the whole command if your trim command does not span a whole
erase block.
Also the mdraid people plan to clip at the stripe width boundary for
raid 5, 6, etc. Their expectation is that discards will be coalesced
into bigger blocks before it gets to the mdraid layer.
I still think reshaping a raid 5 online will be next to impossible
when some of the stripes may contain indeterminate data.
More realistic is to figure out a way to make it deterministic at
least for the short term (by writing data to all the trimmed blocks?),
then reshaping, then having a tool to scan the filesystem and re-issue
all the trim commands.
Obviously, if the ata spec had a signaling mechanism that
differentiated between deterministic data and non-deterministic data
then the above code excess could be simplified greatly.
Greg
--
Greg Freemyer
Head of EDD Tape Extraction and Processing team
Litigation Triage Solutions Specialist
http://www.linkedin.com/in/gregfreemyer
First 99 Days Litigation White Paper -
http://www.norcrossgroup.com/forms/whitepapers/99%20Days%20whitepaper.pdf
The Norcross Group
The Intersection of Evidence & Technology
http://www.norcrossgroup.com
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Is TRIM/DISCARD going to be a performance problem?
2009-05-11 13:10 ` Is TRIM/DISCARD going to be a performance problem? Greg Freemyer
@ 2009-05-11 13:39 ` Matthew Wilcox
2009-05-11 14:27 ` Theodore Tso
1 sibling, 0 replies; 11+ messages in thread
From: Matthew Wilcox @ 2009-05-11 13:39 UTC (permalink / raw)
To: Greg Freemyer
Cc: Theodore Tso, J?rn Engel, Matthew Wilcox, Jens Axboe, Ric Wheeler,
linux-fsdevel, linux-ext4, Linux RAID
On Mon, May 11, 2009 at 09:10:15AM -0400, Greg Freemyer wrote:
> That implies that the SSD folks are not treating erase blocks as a
> contiguous group of sectors. For some reason, I thought their was
> only one mapping per erase block and within the erase block the
> sectors were contiguous..
I believe there is a mapping per LBA, not per erase block. Of course,
different technologies will have different limitations here, but it
would be foolish to assume anything about SSDs at this point.
(For those who haven't heard my disclaimer before, the Intel SSD team
don't tell me anything fun about how the drives work internally).
--
Matthew Wilcox Intel Open Source Technology Centre
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours. We can't possibly take such
a retrograde step."
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Is TRIM/DISCARD going to be a performance problem?
2009-05-11 13:10 ` Is TRIM/DISCARD going to be a performance problem? Greg Freemyer
2009-05-11 13:39 ` Matthew Wilcox
@ 2009-05-11 14:27 ` Theodore Tso
2009-05-11 14:29 ` Ric Wheeler
1 sibling, 1 reply; 11+ messages in thread
From: Theodore Tso @ 2009-05-11 14:27 UTC (permalink / raw)
To: Greg Freemyer
Cc: Jörn Engel, Matthew Wilcox, Jens Axboe, Ric Wheeler,
linux-fsdevel, linux-ext4, Linux RAID
On Mon, May 11, 2009 at 09:10:15AM -0400, Greg Freemyer wrote:
>
> That implies that the SSD folks are not treating erase blocks as a
> contiguous group of sectors.
Correct.
> For some reason, I thought their was
> only one mapping per erase block and within the erase block the
> sectors were contiguous..
No, if you try to treat erase blocks as a contiguous group of
sectors, you'll have terrible write amplification problems (leading to
premature death of the SSD) and terrible small random write
performance. Flash devices optimized for digital cameras might have
done that, but for SSD's, this will result in catastrophically bad
performance, and very limited lifespan. As I said, I expect these
SSD's to be weeded out of the market very shortly.
For any sane implementation of an SSD, the mapping will be on a per
LBA basis, not on a per-erase block basis.
> More realistic is to figure out a way to make it deterministic at
> least for the short term (by writing data to all the trimmed blocks?),
> then reshaping, then having a tool to scan the filesystem and re-issue
> all the trim commands.
Writing data to all of the trimmed block? Um, no. That would be a
diaster, since it accelerates the wear and tear of the SSD. The whole
*point* of the TRIM command is to avoid needing to do that.
The whole worry about determinism is highly overrated. If the
filesystem doesn't need a block, then it doesn't need it. What you
read after you send a TRIM command, whether it is the old data because
the device applied some kind of rounding, or random data, or all
zero's, won't matter to the filesystem. Why should the filesystem
care? I know I certainly don't....
- Ted
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Is TRIM/DISCARD going to be a performance problem?
2009-05-11 14:27 ` Theodore Tso
@ 2009-05-11 14:29 ` Ric Wheeler
2009-05-11 14:50 ` Theodore Tso
0 siblings, 1 reply; 11+ messages in thread
From: Ric Wheeler @ 2009-05-11 14:29 UTC (permalink / raw)
To: Theodore Tso
Cc: Greg Freemyer, Jörn Engel, Matthew Wilcox, Jens Axboe,
linux-fsdevel, linux-ext4, Linux RAID
On 05/11/2009 10:27 AM, Theodore Tso wrote:
> On Mon, May 11, 2009 at 09:10:15AM -0400, Greg Freemyer wrote:
>> That implies that the SSD folks are not treating erase blocks as a
>> contiguous group of sectors.
>
> Correct.
>
>> For some reason, I thought their was
>> only one mapping per erase block and within the erase block the
>> sectors were contiguous..
>
> No, if you try to treat erase blocks as a contiguous group of
> sectors, you'll have terrible write amplification problems (leading to
> premature death of the SSD) and terrible small random write
> performance. Flash devices optimized for digital cameras might have
> done that, but for SSD's, this will result in catastrophically bad
> performance, and very limited lifespan. As I said, I expect these
> SSD's to be weeded out of the market very shortly.
>
> For any sane implementation of an SSD, the mapping will be on a per
> LBA basis, not on a per-erase block basis.
>
>> More realistic is to figure out a way to make it deterministic at
>> least for the short term (by writing data to all the trimmed blocks?),
>> then reshaping, then having a tool to scan the filesystem and re-issue
>> all the trim commands.
>
> Writing data to all of the trimmed block? Um, no. That would be a
> diaster, since it accelerates the wear and tear of the SSD. The whole
> *point* of the TRIM command is to avoid needing to do that.
>
> The whole worry about determinism is highly overrated. If the
> filesystem doesn't need a block, then it doesn't need it. What you
> read after you send a TRIM command, whether it is the old data because
> the device applied some kind of rounding, or random data, or all
> zero's, won't matter to the filesystem. Why should the filesystem
> care? I know I certainly don't....
>
> - Ted
The key is not at the FS layer - this is an issue for people who RAID these
beasts together and want to actually check that the bits are what they should be
(say doing a checksum validity check for a stripe).
ric
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Is TRIM/DISCARD going to be a performance problem?
2009-05-11 14:29 ` Ric Wheeler
@ 2009-05-11 14:50 ` Theodore Tso
2009-05-11 14:58 ` Ric Wheeler
2009-05-11 15:00 ` Matthew Wilcox
0 siblings, 2 replies; 11+ messages in thread
From: Theodore Tso @ 2009-05-11 14:50 UTC (permalink / raw)
To: Ric Wheeler
Cc: Greg Freemyer, Jörn Engel, Matthew Wilcox, Jens Axboe,
linux-fsdevel, linux-ext4, Linux RAID
On Mon, May 11, 2009 at 10:29:51AM -0400, Ric Wheeler wrote:
>
> The key is not at the FS layer - this is an issue for people who RAID
> these beasts together and want to actually check that the bits are what
> they should be (say doing a checksum validity check for a stripe).
>
Good point, yes I can see why they need that. In that case, the
storage device can't just silently truncate a TRIM request; it would
have to expose to the OS its alignment requirements. The risk though
is that more they try push this compleixity into the OS, the higher
the risk that the OS will simply decide not to take advantage of the
functionality. Of course, there is the question why anyone would want
to build a software-raid device on top of a thin-provisioned hardware
storage unit. :-)
- Ted
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Is TRIM/DISCARD going to be a performance problem?
2009-05-11 14:50 ` Theodore Tso
@ 2009-05-11 14:58 ` Ric Wheeler
2009-05-11 15:00 ` Matthew Wilcox
1 sibling, 0 replies; 11+ messages in thread
From: Ric Wheeler @ 2009-05-11 14:58 UTC (permalink / raw)
To: Theodore Tso
Cc: Greg Freemyer, Jörn Engel, Matthew Wilcox, Jens Axboe,
linux-fsdevel, linux-ext4, Linux RAID
On 05/11/2009 10:50 AM, Theodore Tso wrote:
> On Mon, May 11, 2009 at 10:29:51AM -0400, Ric Wheeler wrote:
>> The key is not at the FS layer - this is an issue for people who RAID
>> these beasts together and want to actually check that the bits are what
>> they should be (say doing a checksum validity check for a stripe).
>>
>
> Good point, yes I can see why they need that. In that case, the
> storage device can't just silently truncate a TRIM request; it would
> have to expose to the OS its alignment requirements. The risk though
> is that more they try push this compleixity into the OS, the higher
> the risk that the OS will simply decide not to take advantage of the
> functionality. Of course, there is the question why anyone would want
> to build a software-raid device on top of a thin-provisioned hardware
> storage unit. :-)
>
> - Ted
Probably not as uncommon as you would think, but not as you suggest to raid thin
provisioned luns (those are done usually as RAID devices inside an array).
Think more of the array providing a thinly provisioned LUN made up out of T13
TRIM enabled SSD's devices internally. RAID makes sense here (data protection
is still needed to avoid a single point of failure) and the relative expense of
the SSD's devices makes "thin provisioning" really attractive to external users :-)
ric
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Is TRIM/DISCARD going to be a performance problem?
2009-05-11 14:50 ` Theodore Tso
2009-05-11 14:58 ` Ric Wheeler
@ 2009-05-11 15:00 ` Matthew Wilcox
2009-05-11 18:47 ` Greg Freemyer
1 sibling, 1 reply; 11+ messages in thread
From: Matthew Wilcox @ 2009-05-11 15:00 UTC (permalink / raw)
To: Theodore Tso
Cc: Ric Wheeler, Greg Freemyer, J?rn Engel, Matthew Wilcox,
Jens Axboe, linux-fsdevel, linux-ext4, Linux RAID
On Mon, May 11, 2009 at 10:50:59AM -0400, Theodore Tso wrote:
> On Mon, May 11, 2009 at 10:29:51AM -0400, Ric Wheeler wrote:
> > The key is not at the FS layer - this is an issue for people who RAID
> > these beasts together and want to actually check that the bits are what
> > they should be (say doing a checksum validity check for a stripe).
>
> Good point, yes I can see why they need that. In that case, the
> storage device can't just silently truncate a TRIM request; it would
> have to expose to the OS its alignment requirements. The risk though
> is that more they try push this compleixity into the OS, the higher
> the risk that the OS will simply decide not to take advantage of the
> functionality. Of course, there is the question why anyone would want
> to build a software-raid device on top of a thin-provisioned hardware
> storage unit. :-)
It's not a problem for people who use Thin Provisioning, it's a problem
for people who want to run RAID-5 on top of SSDs. If you have a sector
whose reads are indeterminate, your parity for that stripe will always
be wrong.
--
Matthew Wilcox Intel Open Source Technology Centre
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours. We can't possibly take such
a retrograde step."
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Is TRIM/DISCARD going to be a performance problem?
2009-05-11 15:00 ` Matthew Wilcox
@ 2009-05-11 18:47 ` Greg Freemyer
2009-05-11 19:22 ` Andreas Dilger
2009-05-11 23:38 ` Neil Brown
0 siblings, 2 replies; 11+ messages in thread
From: Greg Freemyer @ 2009-05-11 18:47 UTC (permalink / raw)
To: Matthew Wilcox
Cc: Theodore Tso, Ric Wheeler, J?rn Engel, Matthew Wilcox, Jens Axboe,
linux-fsdevel, linux-ext4, Linux RAID
On Mon, May 11, 2009 at 11:00 AM, Matthew Wilcox <matthew@wil.cx> wrote:
> On Mon, May 11, 2009 at 10:50:59AM -0400, Theodore Tso wrote:
>> On Mon, May 11, 2009 at 10:29:51AM -0400, Ric Wheeler wrote:
>> > The key is not at the FS layer - this is an issue for people who RAID
>> > these beasts together and want to actually check that the bits are what
>> > they should be (say doing a checksum validity check for a stripe).
>>
>> Good point, yes I can see why they need that. In that case, the
>> storage device can't just silently truncate a TRIM request; it would
>> have to expose to the OS its alignment requirements. The risk though
>> is that more they try push this compleixity into the OS, the higher
>> the risk that the OS will simply decide not to take advantage of the
>> functionality. Of course, there is the question why anyone would want
>> to build a software-raid device on top of a thin-provisioned hardware
>> storage unit. :-)
>
> It's not a problem for people who use Thin Provisioning, it's a problem
> for people who want to run RAID-5 on top of SSDs. If you have a sector
> whose reads are indeterminate, your parity for that stripe will always
> be wrong.
Thus my understanding that entire stripe will either be discarded or
not by the mdraid layer.
And if a discard comes along from above that is smaller than a stripe,
then it will tossed by the mdraid layer.
And if it is not aligned to the stripe geometry, then the start/end of
the discard area will be adjusted to be stripe aligned.
And since the mdraid layer is not currently planning to track what has
been discarded over time, when a re-shape comes along, it will
effectively un-trim everything and rewrite 100% of the FS.
The same thing will happen if a drive is cloned via dd as happens
pretty routinely.
Overall, I think Linux will need a mechanism to scan a filesystem and
re-issue all the trim commands in order to get the hardware back in
sync a major maintenance activity. That mechanism could either be
admin invoked.or a always on maintenance task.
Personally, I think the best option is a background task (kernel I
assume) to scan the filesystem and issue discards for all the data on
a slow but steady basis. If it takes a week to make its way around
the disk/volume, then it takes a week. Who really cares.
Once you assume you have that background task in place, I'm not sure
how important it is to even have the filesystem manage this in
realtime with the file deletes.
Greg
--
Greg Freemyer
Head of EDD Tape Extraction and Processing team
Litigation Triage Solutions Specialist
http://www.linkedin.com/in/gregfreemyer
First 99 Days Litigation White Paper -
http://www.norcrossgroup.com/forms/whitepapers/99%20Days%20whitepaper.pdf
The Norcross Group
The Intersection of Evidence & Technology
http://www.norcrossgroup.com
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Is TRIM/DISCARD going to be a performance problem?
2009-05-11 18:47 ` Greg Freemyer
@ 2009-05-11 19:22 ` Andreas Dilger
2009-05-11 23:38 ` Neil Brown
1 sibling, 0 replies; 11+ messages in thread
From: Andreas Dilger @ 2009-05-11 19:22 UTC (permalink / raw)
To: Greg Freemyer
Cc: Matthew Wilcox, Theodore Tso, Ric Wheeler, J?rn Engel,
Matthew Wilcox, Jens Axboe, linux-fsdevel, linux-ext4, Linux RAID
On May 11, 2009 14:47 -0400, Greg Freemyer wrote:
> Overall, I think Linux will need a mechanism to scan a filesystem and
> re-issue all the trim commands in order to get the hardware back in
> sync a major maintenance activity. That mechanism could either be
> admin invoked.or a always on maintenance task.
>
> Personally, I think the best option is a background task (kernel I
> assume) to scan the filesystem and issue discards for all the data on
> a slow but steady basis. If it takes a week to make its way around
> the disk/volume, then it takes a week. Who really cares.
I'd suggested that we can also modify e2fsck to (optionally) send the
definitive list of blocks to be trimmed at that time. It shouldn't
necessarily be done all of the times e2fsck is run, because that would
kill any chance of data recovery, but should be optional.
Other filesystem checking tools (say btrfs online check) can periodically
do the same - lock an idle group from new allocations, scan the allocation
bitmap for all unused blocks, send a trim command for any regions >=
erase block size, unlock group. It might make more sense to do this
than send thousands of trim operations while the filesystem is busy.
Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Is TRIM/DISCARD going to be a performance problem?
2009-05-11 18:47 ` Greg Freemyer
2009-05-11 19:22 ` Andreas Dilger
@ 2009-05-11 23:38 ` Neil Brown
2009-05-12 13:28 ` Greg Freemyer
1 sibling, 1 reply; 11+ messages in thread
From: Neil Brown @ 2009-05-11 23:38 UTC (permalink / raw)
To: Greg Freemyer
Cc: Matthew Wilcox, Theodore Tso, Ric Wheeler, J?rn Engel,
Matthew Wilcox, Jens Axboe, linux-fsdevel, linux-ext4, Linux RAID
On Monday May 11, greg.freemyer@gmail.com wrote:
>
> And since the mdraid layer is not currently planning to track what has
> been discarded over time, when a re-shape comes along, it will
> effectively un-trim everything and rewrite 100% of the FS.
You might not call them "plans" exactly, but I have had thoughts
about tracking which part of an raid5 had 'live' data and which were
trimmed. I think that is the only way I could support TRIM, unless
devices guarantee that all trimmed blocks read a zeros, and that seems
unlikely.
You are right that the granularity would have to be at least
one stripe.
And a re-shape would be interesting, wouldn't it! We could probably
avoid instantiating every trimmed block, but in general quite a few
would get instantiated.. I hadn't thought about that...
NeilBrown
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Is TRIM/DISCARD going to be a performance problem?
2009-05-11 23:38 ` Neil Brown
@ 2009-05-12 13:28 ` Greg Freemyer
0 siblings, 0 replies; 11+ messages in thread
From: Greg Freemyer @ 2009-05-12 13:28 UTC (permalink / raw)
To: Neil Brown
Cc: Matthew Wilcox, Theodore Tso, Ric Wheeler, J?rn Engel,
Matthew Wilcox, Jens Axboe, linux-fsdevel, linux-ext4, Linux RAID
On Mon, May 11, 2009 at 7:38 PM, Neil Brown <neilb@suse.de> wrote:
> On Monday May 11, greg.freemyer@gmail.com wrote:
>>
>> And since the mdraid layer is not currently planning to track what has
>> been discarded over time, when a re-shape comes along, it will
>> effectively un-trim everything and rewrite 100% of the FS.
>
> You might not call them "plans" exactly, but I have had thoughts
> about tracking which part of an raid5 had 'live' data and which were
> trimmed. I think that is the only way I could support TRIM, unless
> devices guarantee that all trimmed blocks read a zeros, and that seems
> unlikely.
Neil,
Re: raid 5, etc. No FS info/discussion
The latest T13 proposed spec I saw explicitly allows reads from
trimmed sectors to return non-determinate data in some devices. Their
is a per device flag you can read to see if a device does that or not.
I think mdraid needs to simply assume all trimmed sectors return
non-determinate data. Either that, or simply check that per device
flag and refuse to accept a drive that supports returning
non-determinate data.
Regardless, ignoring reshape, why do you need to track it?
... thinking
Oh yes, you will have to track it at least at the stripe level.
If p = d1 ^ d2 is not guaranteed to be true due to a stripe discard
and p, d1, d2 are all potentially non-determinate all is good at first
because who cares that d1 = p ^ d2 is not true for your discarded
stripe. d1 is effectively just random data anyway.
But as soon as either d1 or d2 is written to, you will need to force
the entire stripe back into a determinate state or else you will have
unprotected data sitting on that stripe. You can only do that if you
know the entire stripe was previously indeterminate, thus you have no
option but to track the state of the stripes if dmraid is going to
support discards with devices that advertise themselves as returning
indeterminate data.
So Neil, it looks like you need to move from thoughts about tracking
discards to planning to track discards.
FYI: I don't know if it just for show, or if people really plan to do
it, but I have seen several people build up very high performance raid
arrays from SSDs already. Seems that about 8 SSDs maxes out the
current group of sata controllers, pci-express, etc.
Since SSDs with trim support should be even faster, I suspect these
ultra-high performance setups will want to use them.
Greg
--
Greg Freemyer
Head of EDD Tape Extraction and Processing team
Litigation Triage Solutions Specialist
http://www.linkedin.com/in/gregfreemyer
First 99 Days Litigation White Paper -
http://www.norcrossgroup.com/forms/whitepapers/99%20Days%20whitepaper.pdf
The Norcross Group
The Intersection of Evidence & Technology
http://www.norcrossgroup.com
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 11+ messages in thread
end of thread, other threads:[~2009-05-12 13:28 UTC | newest]
Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <E1M2ts6-0005jr-Cg@closure.thunk.org>
[not found] ` <20090510165259.GA31850@logfs.org>
[not found] ` <20090511083754.GA29082@mit.edu>
[not found] ` <20090511100624.GB6585@logfs.org>
[not found] ` <20090511112729.GD29082@mit.edu>
[not found] ` <20090511120936.GB6277@mit.edu>
2009-05-11 13:10 ` Is TRIM/DISCARD going to be a performance problem? Greg Freemyer
2009-05-11 13:39 ` Matthew Wilcox
2009-05-11 14:27 ` Theodore Tso
2009-05-11 14:29 ` Ric Wheeler
2009-05-11 14:50 ` Theodore Tso
2009-05-11 14:58 ` Ric Wheeler
2009-05-11 15:00 ` Matthew Wilcox
2009-05-11 18:47 ` Greg Freemyer
2009-05-11 19:22 ` Andreas Dilger
2009-05-11 23:38 ` Neil Brown
2009-05-12 13:28 ` Greg Freemyer
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).