* Re: Is TRIM/DISCARD going to be a performance problem? [not found] ` <20090511120936.GB6277@mit.edu> @ 2009-05-11 13:10 ` Greg Freemyer 2009-05-11 13:39 ` Matthew Wilcox 2009-05-11 14:27 ` Theodore Tso 0 siblings, 2 replies; 11+ messages in thread From: Greg Freemyer @ 2009-05-11 13:10 UTC (permalink / raw) To: Theodore Tso Cc: Jörn Engel, Matthew Wilcox, Jens Axboe, Ric Wheeler, linux-fsdevel, linux-ext4, Linux RAID On Mon, May 11, 2009 at 8:09 AM, Theodore Tso > All of the web browsing I've doen confirms that the ATA folks expect > trim to work on 512-sector granularity. Ted, That implies that the SSD folks are not treating erase blocks as a contiguous group of sectors. For some reason, I thought their was only one mapping per erase block and within the erase block the sectors were contiguous.. If I'm right, then the ata spec may allow you to send sub-erase block trim commands down, but the spec does not prevent the (blackbox) hardware from clipping the size of the trim to be on erase block boundaries and ignoring the sub-erase block portions on each end. Or ignoring the whole command if your trim command does not span a whole erase block. Also the mdraid people plan to clip at the stripe width boundary for raid 5, 6, etc. Their expectation is that discards will be coalesced into bigger blocks before it gets to the mdraid layer. I still think reshaping a raid 5 online will be next to impossible when some of the stripes may contain indeterminate data. More realistic is to figure out a way to make it deterministic at least for the short term (by writing data to all the trimmed blocks?), then reshaping, then having a tool to scan the filesystem and re-issue all the trim commands. Obviously, if the ata spec had a signaling mechanism that differentiated between deterministic data and non-deterministic data then the above code excess could be simplified greatly. Greg -- Greg Freemyer Head of EDD Tape Extraction and Processing team Litigation Triage Solutions Specialist http://www.linkedin.com/in/gregfreemyer First 99 Days Litigation White Paper - http://www.norcrossgroup.com/forms/whitepapers/99%20Days%20whitepaper.pdf The Norcross Group The Intersection of Evidence & Technology http://www.norcrossgroup.com ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Is TRIM/DISCARD going to be a performance problem? 2009-05-11 13:10 ` Is TRIM/DISCARD going to be a performance problem? Greg Freemyer @ 2009-05-11 13:39 ` Matthew Wilcox 2009-05-11 14:27 ` Theodore Tso 1 sibling, 0 replies; 11+ messages in thread From: Matthew Wilcox @ 2009-05-11 13:39 UTC (permalink / raw) To: Greg Freemyer Cc: Theodore Tso, J?rn Engel, Matthew Wilcox, Jens Axboe, Ric Wheeler, linux-fsdevel, linux-ext4, Linux RAID On Mon, May 11, 2009 at 09:10:15AM -0400, Greg Freemyer wrote: > That implies that the SSD folks are not treating erase blocks as a > contiguous group of sectors. For some reason, I thought their was > only one mapping per erase block and within the erase block the > sectors were contiguous.. I believe there is a mapping per LBA, not per erase block. Of course, different technologies will have different limitations here, but it would be foolish to assume anything about SSDs at this point. (For those who haven't heard my disclaimer before, the Intel SSD team don't tell me anything fun about how the drives work internally). -- Matthew Wilcox Intel Open Source Technology Centre "Bill, look, we understand that you're interested in selling us this operating system, but compare it to ours. We can't possibly take such a retrograde step." ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Is TRIM/DISCARD going to be a performance problem? 2009-05-11 13:10 ` Is TRIM/DISCARD going to be a performance problem? Greg Freemyer 2009-05-11 13:39 ` Matthew Wilcox @ 2009-05-11 14:27 ` Theodore Tso 2009-05-11 14:29 ` Ric Wheeler 1 sibling, 1 reply; 11+ messages in thread From: Theodore Tso @ 2009-05-11 14:27 UTC (permalink / raw) To: Greg Freemyer Cc: Jörn Engel, Matthew Wilcox, Jens Axboe, Ric Wheeler, linux-fsdevel, linux-ext4, Linux RAID On Mon, May 11, 2009 at 09:10:15AM -0400, Greg Freemyer wrote: > > That implies that the SSD folks are not treating erase blocks as a > contiguous group of sectors. Correct. > For some reason, I thought their was > only one mapping per erase block and within the erase block the > sectors were contiguous.. No, if you try to treat erase blocks as a contiguous group of sectors, you'll have terrible write amplification problems (leading to premature death of the SSD) and terrible small random write performance. Flash devices optimized for digital cameras might have done that, but for SSD's, this will result in catastrophically bad performance, and very limited lifespan. As I said, I expect these SSD's to be weeded out of the market very shortly. For any sane implementation of an SSD, the mapping will be on a per LBA basis, not on a per-erase block basis. > More realistic is to figure out a way to make it deterministic at > least for the short term (by writing data to all the trimmed blocks?), > then reshaping, then having a tool to scan the filesystem and re-issue > all the trim commands. Writing data to all of the trimmed block? Um, no. That would be a diaster, since it accelerates the wear and tear of the SSD. The whole *point* of the TRIM command is to avoid needing to do that. The whole worry about determinism is highly overrated. If the filesystem doesn't need a block, then it doesn't need it. What you read after you send a TRIM command, whether it is the old data because the device applied some kind of rounding, or random data, or all zero's, won't matter to the filesystem. Why should the filesystem care? I know I certainly don't.... - Ted ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Is TRIM/DISCARD going to be a performance problem? 2009-05-11 14:27 ` Theodore Tso @ 2009-05-11 14:29 ` Ric Wheeler 2009-05-11 14:50 ` Theodore Tso 0 siblings, 1 reply; 11+ messages in thread From: Ric Wheeler @ 2009-05-11 14:29 UTC (permalink / raw) To: Theodore Tso Cc: Greg Freemyer, Jörn Engel, Matthew Wilcox, Jens Axboe, linux-fsdevel, linux-ext4, Linux RAID On 05/11/2009 10:27 AM, Theodore Tso wrote: > On Mon, May 11, 2009 at 09:10:15AM -0400, Greg Freemyer wrote: >> That implies that the SSD folks are not treating erase blocks as a >> contiguous group of sectors. > > Correct. > >> For some reason, I thought their was >> only one mapping per erase block and within the erase block the >> sectors were contiguous.. > > No, if you try to treat erase blocks as a contiguous group of > sectors, you'll have terrible write amplification problems (leading to > premature death of the SSD) and terrible small random write > performance. Flash devices optimized for digital cameras might have > done that, but for SSD's, this will result in catastrophically bad > performance, and very limited lifespan. As I said, I expect these > SSD's to be weeded out of the market very shortly. > > For any sane implementation of an SSD, the mapping will be on a per > LBA basis, not on a per-erase block basis. > >> More realistic is to figure out a way to make it deterministic at >> least for the short term (by writing data to all the trimmed blocks?), >> then reshaping, then having a tool to scan the filesystem and re-issue >> all the trim commands. > > Writing data to all of the trimmed block? Um, no. That would be a > diaster, since it accelerates the wear and tear of the SSD. The whole > *point* of the TRIM command is to avoid needing to do that. > > The whole worry about determinism is highly overrated. If the > filesystem doesn't need a block, then it doesn't need it. What you > read after you send a TRIM command, whether it is the old data because > the device applied some kind of rounding, or random data, or all > zero's, won't matter to the filesystem. Why should the filesystem > care? I know I certainly don't.... > > - Ted The key is not at the FS layer - this is an issue for people who RAID these beasts together and want to actually check that the bits are what they should be (say doing a checksum validity check for a stripe). ric ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Is TRIM/DISCARD going to be a performance problem? 2009-05-11 14:29 ` Ric Wheeler @ 2009-05-11 14:50 ` Theodore Tso 2009-05-11 14:58 ` Ric Wheeler 2009-05-11 15:00 ` Matthew Wilcox 0 siblings, 2 replies; 11+ messages in thread From: Theodore Tso @ 2009-05-11 14:50 UTC (permalink / raw) To: Ric Wheeler Cc: Greg Freemyer, Jörn Engel, Matthew Wilcox, Jens Axboe, linux-fsdevel, linux-ext4, Linux RAID On Mon, May 11, 2009 at 10:29:51AM -0400, Ric Wheeler wrote: > > The key is not at the FS layer - this is an issue for people who RAID > these beasts together and want to actually check that the bits are what > they should be (say doing a checksum validity check for a stripe). > Good point, yes I can see why they need that. In that case, the storage device can't just silently truncate a TRIM request; it would have to expose to the OS its alignment requirements. The risk though is that more they try push this compleixity into the OS, the higher the risk that the OS will simply decide not to take advantage of the functionality. Of course, there is the question why anyone would want to build a software-raid device on top of a thin-provisioned hardware storage unit. :-) - Ted ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Is TRIM/DISCARD going to be a performance problem? 2009-05-11 14:50 ` Theodore Tso @ 2009-05-11 14:58 ` Ric Wheeler 2009-05-11 15:00 ` Matthew Wilcox 1 sibling, 0 replies; 11+ messages in thread From: Ric Wheeler @ 2009-05-11 14:58 UTC (permalink / raw) To: Theodore Tso Cc: Greg Freemyer, Jörn Engel, Matthew Wilcox, Jens Axboe, linux-fsdevel, linux-ext4, Linux RAID On 05/11/2009 10:50 AM, Theodore Tso wrote: > On Mon, May 11, 2009 at 10:29:51AM -0400, Ric Wheeler wrote: >> The key is not at the FS layer - this is an issue for people who RAID >> these beasts together and want to actually check that the bits are what >> they should be (say doing a checksum validity check for a stripe). >> > > Good point, yes I can see why they need that. In that case, the > storage device can't just silently truncate a TRIM request; it would > have to expose to the OS its alignment requirements. The risk though > is that more they try push this compleixity into the OS, the higher > the risk that the OS will simply decide not to take advantage of the > functionality. Of course, there is the question why anyone would want > to build a software-raid device on top of a thin-provisioned hardware > storage unit. :-) > > - Ted Probably not as uncommon as you would think, but not as you suggest to raid thin provisioned luns (those are done usually as RAID devices inside an array). Think more of the array providing a thinly provisioned LUN made up out of T13 TRIM enabled SSD's devices internally. RAID makes sense here (data protection is still needed to avoid a single point of failure) and the relative expense of the SSD's devices makes "thin provisioning" really attractive to external users :-) ric ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Is TRIM/DISCARD going to be a performance problem? 2009-05-11 14:50 ` Theodore Tso 2009-05-11 14:58 ` Ric Wheeler @ 2009-05-11 15:00 ` Matthew Wilcox 2009-05-11 18:47 ` Greg Freemyer 1 sibling, 1 reply; 11+ messages in thread From: Matthew Wilcox @ 2009-05-11 15:00 UTC (permalink / raw) To: Theodore Tso Cc: Ric Wheeler, Greg Freemyer, J?rn Engel, Matthew Wilcox, Jens Axboe, linux-fsdevel, linux-ext4, Linux RAID On Mon, May 11, 2009 at 10:50:59AM -0400, Theodore Tso wrote: > On Mon, May 11, 2009 at 10:29:51AM -0400, Ric Wheeler wrote: > > The key is not at the FS layer - this is an issue for people who RAID > > these beasts together and want to actually check that the bits are what > > they should be (say doing a checksum validity check for a stripe). > > Good point, yes I can see why they need that. In that case, the > storage device can't just silently truncate a TRIM request; it would > have to expose to the OS its alignment requirements. The risk though > is that more they try push this compleixity into the OS, the higher > the risk that the OS will simply decide not to take advantage of the > functionality. Of course, there is the question why anyone would want > to build a software-raid device on top of a thin-provisioned hardware > storage unit. :-) It's not a problem for people who use Thin Provisioning, it's a problem for people who want to run RAID-5 on top of SSDs. If you have a sector whose reads are indeterminate, your parity for that stripe will always be wrong. -- Matthew Wilcox Intel Open Source Technology Centre "Bill, look, we understand that you're interested in selling us this operating system, but compare it to ours. We can't possibly take such a retrograde step." ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Is TRIM/DISCARD going to be a performance problem? 2009-05-11 15:00 ` Matthew Wilcox @ 2009-05-11 18:47 ` Greg Freemyer 2009-05-11 19:22 ` Andreas Dilger 2009-05-11 23:38 ` Neil Brown 0 siblings, 2 replies; 11+ messages in thread From: Greg Freemyer @ 2009-05-11 18:47 UTC (permalink / raw) To: Matthew Wilcox Cc: Theodore Tso, Ric Wheeler, J?rn Engel, Matthew Wilcox, Jens Axboe, linux-fsdevel, linux-ext4, Linux RAID On Mon, May 11, 2009 at 11:00 AM, Matthew Wilcox <matthew@wil.cx> wrote: > On Mon, May 11, 2009 at 10:50:59AM -0400, Theodore Tso wrote: >> On Mon, May 11, 2009 at 10:29:51AM -0400, Ric Wheeler wrote: >> > The key is not at the FS layer - this is an issue for people who RAID >> > these beasts together and want to actually check that the bits are what >> > they should be (say doing a checksum validity check for a stripe). >> >> Good point, yes I can see why they need that. In that case, the >> storage device can't just silently truncate a TRIM request; it would >> have to expose to the OS its alignment requirements. The risk though >> is that more they try push this compleixity into the OS, the higher >> the risk that the OS will simply decide not to take advantage of the >> functionality. Of course, there is the question why anyone would want >> to build a software-raid device on top of a thin-provisioned hardware >> storage unit. :-) > > It's not a problem for people who use Thin Provisioning, it's a problem > for people who want to run RAID-5 on top of SSDs. If you have a sector > whose reads are indeterminate, your parity for that stripe will always > be wrong. Thus my understanding that entire stripe will either be discarded or not by the mdraid layer. And if a discard comes along from above that is smaller than a stripe, then it will tossed by the mdraid layer. And if it is not aligned to the stripe geometry, then the start/end of the discard area will be adjusted to be stripe aligned. And since the mdraid layer is not currently planning to track what has been discarded over time, when a re-shape comes along, it will effectively un-trim everything and rewrite 100% of the FS. The same thing will happen if a drive is cloned via dd as happens pretty routinely. Overall, I think Linux will need a mechanism to scan a filesystem and re-issue all the trim commands in order to get the hardware back in sync a major maintenance activity. That mechanism could either be admin invoked.or a always on maintenance task. Personally, I think the best option is a background task (kernel I assume) to scan the filesystem and issue discards for all the data on a slow but steady basis. If it takes a week to make its way around the disk/volume, then it takes a week. Who really cares. Once you assume you have that background task in place, I'm not sure how important it is to even have the filesystem manage this in realtime with the file deletes. Greg -- Greg Freemyer Head of EDD Tape Extraction and Processing team Litigation Triage Solutions Specialist http://www.linkedin.com/in/gregfreemyer First 99 Days Litigation White Paper - http://www.norcrossgroup.com/forms/whitepapers/99%20Days%20whitepaper.pdf The Norcross Group The Intersection of Evidence & Technology http://www.norcrossgroup.com -- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Is TRIM/DISCARD going to be a performance problem? 2009-05-11 18:47 ` Greg Freemyer @ 2009-05-11 19:22 ` Andreas Dilger 2009-05-11 23:38 ` Neil Brown 1 sibling, 0 replies; 11+ messages in thread From: Andreas Dilger @ 2009-05-11 19:22 UTC (permalink / raw) To: Greg Freemyer Cc: Matthew Wilcox, Theodore Tso, Ric Wheeler, J?rn Engel, Matthew Wilcox, Jens Axboe, linux-fsdevel, linux-ext4, Linux RAID On May 11, 2009 14:47 -0400, Greg Freemyer wrote: > Overall, I think Linux will need a mechanism to scan a filesystem and > re-issue all the trim commands in order to get the hardware back in > sync a major maintenance activity. That mechanism could either be > admin invoked.or a always on maintenance task. > > Personally, I think the best option is a background task (kernel I > assume) to scan the filesystem and issue discards for all the data on > a slow but steady basis. If it takes a week to make its way around > the disk/volume, then it takes a week. Who really cares. I'd suggested that we can also modify e2fsck to (optionally) send the definitive list of blocks to be trimmed at that time. It shouldn't necessarily be done all of the times e2fsck is run, because that would kill any chance of data recovery, but should be optional. Other filesystem checking tools (say btrfs online check) can periodically do the same - lock an idle group from new allocations, scan the allocation bitmap for all unused blocks, send a trim command for any regions >= erase block size, unlock group. It might make more sense to do this than send thousands of trim operations while the filesystem is busy. Cheers, Andreas -- Andreas Dilger Sr. Staff Engineer, Lustre Group Sun Microsystems of Canada, Inc. ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Is TRIM/DISCARD going to be a performance problem? 2009-05-11 18:47 ` Greg Freemyer 2009-05-11 19:22 ` Andreas Dilger @ 2009-05-11 23:38 ` Neil Brown 2009-05-12 13:28 ` Greg Freemyer 1 sibling, 1 reply; 11+ messages in thread From: Neil Brown @ 2009-05-11 23:38 UTC (permalink / raw) To: Greg Freemyer Cc: Matthew Wilcox, Theodore Tso, Ric Wheeler, J?rn Engel, Matthew Wilcox, Jens Axboe, linux-fsdevel, linux-ext4, Linux RAID On Monday May 11, greg.freemyer@gmail.com wrote: > > And since the mdraid layer is not currently planning to track what has > been discarded over time, when a re-shape comes along, it will > effectively un-trim everything and rewrite 100% of the FS. You might not call them "plans" exactly, but I have had thoughts about tracking which part of an raid5 had 'live' data and which were trimmed. I think that is the only way I could support TRIM, unless devices guarantee that all trimmed blocks read a zeros, and that seems unlikely. You are right that the granularity would have to be at least one stripe. And a re-shape would be interesting, wouldn't it! We could probably avoid instantiating every trimmed block, but in general quite a few would get instantiated.. I hadn't thought about that... NeilBrown ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Is TRIM/DISCARD going to be a performance problem? 2009-05-11 23:38 ` Neil Brown @ 2009-05-12 13:28 ` Greg Freemyer 0 siblings, 0 replies; 11+ messages in thread From: Greg Freemyer @ 2009-05-12 13:28 UTC (permalink / raw) To: Neil Brown Cc: Matthew Wilcox, Theodore Tso, Ric Wheeler, J?rn Engel, Matthew Wilcox, Jens Axboe, linux-fsdevel, linux-ext4, Linux RAID On Mon, May 11, 2009 at 7:38 PM, Neil Brown <neilb@suse.de> wrote: > On Monday May 11, greg.freemyer@gmail.com wrote: >> >> And since the mdraid layer is not currently planning to track what has >> been discarded over time, when a re-shape comes along, it will >> effectively un-trim everything and rewrite 100% of the FS. > > You might not call them "plans" exactly, but I have had thoughts > about tracking which part of an raid5 had 'live' data and which were > trimmed. I think that is the only way I could support TRIM, unless > devices guarantee that all trimmed blocks read a zeros, and that seems > unlikely. Neil, Re: raid 5, etc. No FS info/discussion The latest T13 proposed spec I saw explicitly allows reads from trimmed sectors to return non-determinate data in some devices. Their is a per device flag you can read to see if a device does that or not. I think mdraid needs to simply assume all trimmed sectors return non-determinate data. Either that, or simply check that per device flag and refuse to accept a drive that supports returning non-determinate data. Regardless, ignoring reshape, why do you need to track it? ... thinking Oh yes, you will have to track it at least at the stripe level. If p = d1 ^ d2 is not guaranteed to be true due to a stripe discard and p, d1, d2 are all potentially non-determinate all is good at first because who cares that d1 = p ^ d2 is not true for your discarded stripe. d1 is effectively just random data anyway. But as soon as either d1 or d2 is written to, you will need to force the entire stripe back into a determinate state or else you will have unprotected data sitting on that stripe. You can only do that if you know the entire stripe was previously indeterminate, thus you have no option but to track the state of the stripes if dmraid is going to support discards with devices that advertise themselves as returning indeterminate data. So Neil, it looks like you need to move from thoughts about tracking discards to planning to track discards. FYI: I don't know if it just for show, or if people really plan to do it, but I have seen several people build up very high performance raid arrays from SSDs already. Seems that about 8 SSDs maxes out the current group of sata controllers, pci-express, etc. Since SSDs with trim support should be even faster, I suspect these ultra-high performance setups will want to use them. Greg -- Greg Freemyer Head of EDD Tape Extraction and Processing team Litigation Triage Solutions Specialist http://www.linkedin.com/in/gregfreemyer First 99 Days Litigation White Paper - http://www.norcrossgroup.com/forms/whitepapers/99%20Days%20whitepaper.pdf The Norcross Group The Intersection of Evidence & Technology http://www.norcrossgroup.com -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 11+ messages in thread
end of thread, other threads:[~2009-05-12 13:28 UTC | newest]
Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <E1M2ts6-0005jr-Cg@closure.thunk.org>
[not found] ` <20090510165259.GA31850@logfs.org>
[not found] ` <20090511083754.GA29082@mit.edu>
[not found] ` <20090511100624.GB6585@logfs.org>
[not found] ` <20090511112729.GD29082@mit.edu>
[not found] ` <20090511120936.GB6277@mit.edu>
2009-05-11 13:10 ` Is TRIM/DISCARD going to be a performance problem? Greg Freemyer
2009-05-11 13:39 ` Matthew Wilcox
2009-05-11 14:27 ` Theodore Tso
2009-05-11 14:29 ` Ric Wheeler
2009-05-11 14:50 ` Theodore Tso
2009-05-11 14:58 ` Ric Wheeler
2009-05-11 15:00 ` Matthew Wilcox
2009-05-11 18:47 ` Greg Freemyer
2009-05-11 19:22 ` Andreas Dilger
2009-05-11 23:38 ` Neil Brown
2009-05-12 13:28 ` Greg Freemyer
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).