* LVM striping RAID volumes
@ 2012-01-25 4:50 Douglas Siebert
2012-01-25 10:04 ` David Brown
0 siblings, 1 reply; 7+ messages in thread
From: Douglas Siebert @ 2012-01-25 4:50 UTC (permalink / raw)
To: linux-raid
I recently bought two Intel 320 80GB SSDs and plan to use them on a
Fedora 16 system for boot/root, home and other frequently used data. I
planned to mirror them together using md raid/Linux software RAID for
reliability, and use LVM to partition. I did some investigation on
insuring data alignment for best performance, etc. along with
verification that my criteria for setting them up could be met:
1) must be mirrored since SSDs are similar to hard drives in their
reliability (i.e. lack of)
2) must support passing TRIM commands through the RAID layer (e.g. ext4-
>LVM->RAID->SSD) to avoid write amplification that reduces SSD lifetime
and performance
3) ideally should maximize performance by splitting reads between both
SSDs
Unfortunately, my investigation determined that while both md raid and
dmraid/fakeraid can of course accomplish #1, only dmraid does #2 and
only md raid does #3! At first it looked like I had to give up #3, but
I think I have a way around this dilemma, at the cost of a bit of
additional complexity which I'm totally comfortable with.
My plan is to use dmraid/ICH10R to create two equal sized RAID-1
volumes, with the primary mirror of the first volume as sda and the
primary mirror of the second volume as sdb, then create all my LVM
volumes by striping extents from the two dmraid devices.
Am I correct that this will meet my criteria? Does anyone see another
method that avoids dmraid? If md raid handled TRIM it would be my
preference since I've used it before and have no need to dual boot with
Windows, but I guess its not ready yet. Any gotchas to my solution I
need to worry about which I may be overlooking? Thanks for any tips or
suggestions!
PS - its too bad you can't set the 'strip size' parameter for RAID 1 in
dmraid and have it swap its idea of "primary mirror" on each stripe.
Not sure if the Intel metadata format would allow storing that value for
a RAID 1 volume, but if it does this would be a way to avoid having all
reads directed at one drive. While there may be some good reasons to
avoid indiscriminately splitting reads between hard drives, with SSDs
the algorithm doesn't need much intelligence to see a major performance
boost...
--
Douglas Siebert
douglas-siebert@uiowa.edu
^ permalink raw reply [flat|nested] 7+ messages in thread* Re: LVM striping RAID volumes 2012-01-25 4:50 LVM striping RAID volumes Douglas Siebert @ 2012-01-25 10:04 ` David Brown 2012-01-25 13:55 ` Peter Grandi 0 siblings, 1 reply; 7+ messages in thread From: David Brown @ 2012-01-25 10:04 UTC (permalink / raw) To: linux-raid On 25/01/12 05:50, Douglas Siebert wrote: > I recently bought two Intel 320 80GB SSDs and plan to use them on a > Fedora 16 system for boot/root, home and other frequently used data. I > planned to mirror them together using md raid/Linux software RAID for > reliability, and use LVM to partition. I did some investigation on > insuring data alignment for best performance, etc. along with > verification that my criteria for setting them up could be met: > > 1) must be mirrored since SSDs are similar to hard drives in their > reliability (i.e. lack of) SSD's are typically better than HD's, but certainly not failproof. I get the impression that there is a bigger spread of reliabilities and failures with SSD's than with HD's. > > 2) must support passing TRIM commands through the RAID layer (e.g. ext4- >> LVM->RAID->SSD) to avoid write amplification that reduces SSD lifetime > and performance That's not really necessary with modern SSD's - TRIM is overrated. Garbage collection on current generations is so much better than on earlier models that you generally don't have to worry about TRIM. Dropping TRIM makes your life /much/ easier with SSD's, especially when you want raid. According to some benchmarks I've seen, it also makes the disk measurably faster. > > 3) ideally should maximize performance by splitting reads between both > SSDs > Sounds like Linux raid10 to me. > Unfortunately, my investigation determined that while both md raid and > dmraid/fakeraid can of course accomplish #1, only dmraid does #2 and > only md raid does #3! At first it looked like I had to give up #3, but > I think I have a way around this dilemma, at the cost of a bit of > additional complexity which I'm totally comfortable with. > Drop #2, and you are done. > My plan is to use dmraid/ICH10R to create two equal sized RAID-1 > volumes, with the primary mirror of the first volume as sda and the > primary mirror of the second volume as sdb, then create all my LVM > volumes by striping extents from the two dmraid devices. > > Am I correct that this will meet my criteria? Does anyone see another > method that avoids dmraid? If md raid handled TRIM it would be my > preference since I've used it before and have no need to dual boot with > Windows, but I guess its not ready yet. Any gotchas to my solution I > need to worry about which I may be overlooking? Thanks for any tips or > suggestions! > > > PS - its too bad you can't set the 'strip size' parameter for RAID 1 in > dmraid and have it swap its idea of "primary mirror" on each stripe. > Not sure if the Intel metadata format would allow storing that value for > a RAID 1 volume, but if it does this would be a way to avoid having all > reads directed at one drive. While there may be some good reasons to > avoid indiscriminately splitting reads between hard drives, with SSDs > the algorithm doesn't need much intelligence to see a major performance > boost... > ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: LVM striping RAID volumes 2012-01-25 10:04 ` David Brown @ 2012-01-25 13:55 ` Peter Grandi 2012-01-25 17:56 ` David Brown 2012-01-25 20:59 ` Andrei Warkentin 0 siblings, 2 replies; 7+ messages in thread From: Peter Grandi @ 2012-01-25 13:55 UTC (permalink / raw) To: Linux RAID >> 2) must support passing TRIM commands through the RAID layer >> (e.g. ext4->LVM->RAID->SSD) to avoid write amplification that >> reduces SSD lifetime and performance > That's not really necessary with modern SSD's - TRIM is > overrated. Garbage collection on current generations is so > much better than on earlier models that you generally don't > have to worry about TRIM. Unfortunately not necessarily just for write amplification, and the "cleaner" (aka garbage collector) is really helped by TRIM. The really big deal is that the FTL in the flash SSD cannot figure out which flash-pages are unused, and cannot use a simple heuristic like "it is all zeroes" because filesystem code do not zero unused logical sectors when they are released but writes them only much later when they are allocated. TRIM is just a a way to ''write'' a logical sector as unused without zero-filling it (or other implicit marks). > Dropping TRIM makes your life /much/ easier with SSD's, > especially when you want raid. According to some benchmarks > I've seen, it also makes the disk measurably faster. While something like TRIM is really important, there is a bad reputation of TRIM, but it is due to SATA TRIM being specified badly, as it is specified to be synchronous (or cache-flushing or queue flushing). Anyhow, apart from write amplification, the really big deal is maximum write latency (and relatedly read latency!). Consider this scary comparison: http://www.storagereview.com/images/samsung_830_raid_256gb_write_latency.png as discussed in one of my many recent flash SSD blog entries: http://www.sabi.co.uk/blog/12-one.html#120115 Since erasing a flash-block can take a long time, it is very important for minimizing the highest write latency that the FTL have available a pool of pre-erased flash-blocks, so they can be written (OR'ed) to directly ("overprovisioning" in most flash SSDs is done to allow this too). The problem is that the "cleaner" (aka garbage collector) can only pack "used" flash-pages together, thus creating empty flash-blocks, if it knows which logical sectors and thus flash-pages are "unused". Since the TRIM command is synchronous it is often a bad idea to use it on every logical sector deallocation in filesystem code, but it or FITRIM should be used at least periodically (for example during 'fsck') to tell the FTL which logical sectors are unused so it can rebuild the pool of empty flash-blocks, and doing it periodically would work around the synchronous nature of SATA TRIM. Also TRIM and FITRIM are useful for any case of virtualization, not just for flash SSD layers, for example for "sparse" (aka thin provisioning) VM disk images. It would be nice if MD passed on TRIM or at least FITRIM, and I have just done a search and there is a discussion of some issues with that here: http://lkml.indiana.edu/hypermail/linux/kernel/1011.2/02184.html «the only really complex part is sending something like that into MDraid, because that one set of ranges might explode into thousands of ranges and then have to be coalesced back down to a more manageable number of ranges. ie. with a simple raid 0, each range will need to be broken into a bunch of stride sized ranges, then the contiguous strides on each spindle coalesced back into larger ranges. But if MDraid can handle discards now with one range, it should not be that hard to teach it handle a group of ranges.» This perplexes me because the logic should be identical to that of writing: TRIM is in effect a variant of WRITE. -- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: LVM striping RAID volumes 2012-01-25 13:55 ` Peter Grandi @ 2012-01-25 17:56 ` David Brown 2012-01-26 3:42 ` Douglas Siebert [not found] ` <20120126034746.0D87428A6E@zebra.redhouse.homelinux.net> 2012-01-25 20:59 ` Andrei Warkentin 1 sibling, 2 replies; 7+ messages in thread From: David Brown @ 2012-01-25 17:56 UTC (permalink / raw) Cc: Linux RAID On 25/01/12 14:55, Peter Grandi wrote: >>> 2) must support passing TRIM commands through the RAID layer >>> (e.g. ext4->LVM->RAID->SSD) to avoid write amplification that >>> reduces SSD lifetime and performance > >> That's not really necessary with modern SSD's - TRIM is >> overrated. Garbage collection on current generations is so >> much better than on earlier models that you generally don't >> have to worry about TRIM. > > Unfortunately not necessarily just for write amplification, and > the "cleaner" (aka garbage collector) is really helped by TRIM. > > The really big deal is that the FTL in the flash SSD cannot > figure out which flash-pages are unused, and cannot use a simple > heuristic like "it is all zeroes" because filesystem code do not > zero unused logical sectors when they are released but writes > them only much later when they are allocated. TRIM is just a a > way to ''write'' a logical sector as unused without zero-filling > it (or other implicit marks). > >> Dropping TRIM makes your life /much/ easier with SSD's, >> especially when you want raid. According to some benchmarks >> I've seen, it also makes the disk measurably faster. > > While something like TRIM is really important, there is a bad > reputation of TRIM, but it is due to SATA TRIM being specified > badly, as it is specified to be synchronous (or cache-flushing > or queue flushing). > I've read about this in a few places - there are several failing points in SATA TRIM that make it difficult to implement and much less useful than it could be. One problem is that TRIM is synchronous, as you say. That means if it is used during deletes, it makes them much slower - potentially very much slower. Secondly, there is no consistency as to what is read back from a trimmed sector. Had it always been read as zero, it would suit much better for raid. > > Anyhow, apart from write amplification, the really big deal is > maximum write latency (and relatedly read latency!). Consider > this scary comparison: > > http://www.storagereview.com/images/samsung_830_raid_256gb_write_latency.png > > as discussed in one of my many recent flash SSD blog entries: > > http://www.sabi.co.uk/blog/12-one.html#120115 > > Since erasing a flash-block can take a long time, it is very > important for minimizing the highest write latency that the FTL > have available a pool of pre-erased flash-blocks, so they can be > written (OR'ed) to directly ("overprovisioning" in most flash > SSDs is done to allow this too). > Overprovisioning is the key here. When the SSD has more flash space than is visible to the OS, then that space is always guaranteed free - though not necessarily in contiguous erase blocks. The more such free space there is, the higher the chances of their being free full blocks when they are needed, and the more flexibility the SSD firmware has in combining partly-written blocks to free up full erase blocks. So if you have sufficient free space due to overprovisioning, you quite simply do not need TRIM, as TRIM is just an expensive way of increasing this free space. How much overprovisioning you want depends on how much you want to reduce the risk of unexpected latencies, and how much extra space you are willing to pay for. More expensive (or rather, higher quality) SSD's have more overprovisioning. You can also make your own overprovisioning by simply not allocating all the disk when partitioning it (or using a smaller "size" when using the whole disk in an mdadm raid). Since there is an area that is never written to, it is effectively extra overprovisioned space. > The problem is that the "cleaner" (aka garbage collector) can > only pack "used" flash-pages together, thus creating empty > flash-blocks, if it knows which logical sectors and thus > flash-pages are "unused". > > Since the TRIM command is synchronous it is often a bad idea to > use it on every logical sector deallocation in filesystem code, > but it or FITRIM should be used at least periodically (for > example during 'fsck') to tell the FTL which logical sectors are > unused so it can rebuild the pool of empty flash-blocks, and > doing it periodically would work around the synchronous nature > of SATA TRIM. > > Also TRIM and FITRIM are useful for any case of virtualization, > not just for flash SSD layers, for example for "sparse" (aka > thin provisioning) VM disk images. > > It would be nice if MD passed on TRIM or at least FITRIM, and I > have just done a search and there is a discussion of some issues > with that here: > > http://lkml.indiana.edu/hypermail/linux/kernel/1011.2/02184.html > > «the only really complex part is sending something like that > into MDraid, because that one set of ranges might explode into > thousands of ranges and then have to be coalesced back down to > a more manageable number of ranges. > > ie. with a simple raid 0, each range will need to be broken > into a bunch of stride sized ranges, then the contiguous > strides on each spindle coalesced back into larger ranges. > > But if MDraid can handle discards now with one range, it > should not be that hard to teach it handle a group of ranges.» > > This perplexes me because the logic should be identical to that > of writing: TRIM is in effect a variant of WRITE. > -- > To unsubscribe from this list: send the line "unsubscribe linux-raid" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > -- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: LVM striping RAID volumes 2012-01-25 17:56 ` David Brown @ 2012-01-26 3:42 ` Douglas Siebert [not found] ` <20120126034746.0D87428A6E@zebra.redhouse.homelinux.net> 1 sibling, 0 replies; 7+ messages in thread From: Douglas Siebert @ 2012-01-26 3:42 UTC (permalink / raw) To: David Brown; +Cc: linux-raid On Wed, 2012-01-25 at 18:56 +0100, David Brown wrote: > On 25/01/12 14:55, Peter Grandi wrote: > >>> 2) must support passing TRIM commands through the RAID layer > >>> (e.g. ext4->LVM->RAID->SSD) to avoid write amplification that > >>> reduces SSD lifetime and performance > > > >> That's not really necessary with modern SSD's - TRIM is > >> overrated. Garbage collection on current generations is so > >> much better than on earlier models that you generally don't > >> have to worry about TRIM. > > > > Unfortunately not necessarily just for write amplification, and > > the "cleaner" (aka garbage collector) is really helped by TRIM. > > > > The really big deal is that the FTL in the flash SSD cannot > > figure out which flash-pages are unused, and cannot use a simple > > heuristic like "it is all zeroes" because filesystem code do not > > zero unused logical sectors when they are released but writes > > them only much later when they are allocated. TRIM is just a a > > way to ''write'' a logical sector as unused without zero-filling > > it (or other implicit marks). > > > >> Dropping TRIM makes your life /much/ easier with SSD's, > >> especially when you want raid. According to some benchmarks > >> I've seen, it also makes the disk measurably faster. > > > > While something like TRIM is really important, there is a bad > > reputation of TRIM, but it is due to SATA TRIM being specified > > badly, as it is specified to be synchronous (or cache-flushing > > or queue flushing). > > > > I've read about this in a few places - there are several failing points > in SATA TRIM that make it difficult to implement and much less useful > than it could be. > > One problem is that TRIM is synchronous, as you say. That means if it > is used during deletes, it makes them much slower - potentially very > much slower. Secondly, there is no consistency as to what is read back > from a trimmed sector. Had it always been read as zero, it would suit > much better for raid. As far as Linux software RAID goes, end users would currently only care about TRIM when using using RAID with a pair of SSDs. So in that case, require enabling the write intent bitmaps when enabling TRIM support. I believe this would eliminate the concern about what gets read back from a trimmed sector. I realize benchmarks show bitmaps to slow things down a lot, but I'm assuming that's because writing them to hard drives is the cause due to their slow seeks. With SSDs no such concern would exist. Your point about TRIM potentially slowing things down due to the synchronous nature of the ATA 3.0 spec is well taken, but you don't have to mount your filesystems with -o discard. You can just run fstrim out of cron daily. That's exactly what I'm planning to do, and I think most people using TRIM are doing so until SSDs support the ATA 3.1 spec's asynchronous TRIM. > > > > > Anyhow, apart from write amplification, the really big deal is > > maximum write latency (and relatedly read latency!). Consider > > this scary comparison: > > > > http://www.storagereview.com/images/samsung_830_raid_256gb_write_latency.png > > > > as discussed in one of my many recent flash SSD blog entries: > > > > http://www.sabi.co.uk/blog/12-one.html#120115 > > > > Since erasing a flash-block can take a long time, it is very > > important for minimizing the highest write latency that the FTL > > have available a pool of pre-erased flash-blocks, so they can be > > written (OR'ed) to directly ("overprovisioning" in most flash > > SSDs is done to allow this too). > > > > Overprovisioning is the key here. When the SSD has more flash space > than is visible to the OS, then that space is always guaranteed free - > though not necessarily in contiguous erase blocks. The more such free > space there is, the higher the chances of their being free full blocks > when they are needed, and the more flexibility the SSD firmware has in > combining partly-written blocks to free up full erase blocks. > > So if you have sufficient free space due to overprovisioning, you quite > simply do not need TRIM, as TRIM is just an expensive way of increasing > this free space. > > How much overprovisioning you want depends on how much you want to > reduce the risk of unexpected latencies, and how much extra space you > are willing to pay for. More expensive (or rather, higher quality) > SSD's have more overprovisioning. You can also make your own > overprovisioning by simply not allocating all the disk when partitioning > it (or using a smaller "size" when using the whole disk in an mdadm > raid). Since there is an area that is never written to, it is > effectively extra overprovisioned space. It sounds like you are saying TRIM is unnecessary because you can just allocate less space than you have on the device. That may be true, but I can equally say that overprovisioning is unnecessary because you can just use TRIM! Overprovisioning should only be required where it wouldn't happen naturally, such as using an SSD for raw volumes on a DB. Overprovisioning happens as a matter of course when used for a filesystem, since most filesystems maintain at least 5% free space, and sometimes more, to avoid fragmentation problems. Unfortunately even if your filesystem always has 5% free space, after a while due to that fragmentation it is likely that all blocks have been written to at least once. That's what TRIM fixes. Overprovisioning beyond that is silly and wasteful, when a perfectly good fix exists. Your argument is rather like saying that Linux shouldn't worry about being efficient in its operation, because you can always buy more CPU and memory than you need. One additional point. TRIM is not just for SSDs. SCSI/FC supports two commands similar in meaning to TRIM (and to each other, don't get me started...) that have usefulness way beyond SSDs. EMC for example supports them in their high end VMAX arrays on both thin provisioned AND traditional "thick" LUNs. Why on thick LUNs? Because knowing that a block is no longer in use is very useful for stuff like copies, snapshots and especially when sending data between arrays over WAN links. For exactly the same reasons, information about blocks no longer in use could be quite useful to the Linux device mapper layer. It would be a shame if Linux mdadm raid became marginalized in the future due to lack of support for TRIM/discard semantics. > > > > The problem is that the "cleaner" (aka garbage collector) can > > only pack "used" flash-pages together, thus creating empty > > flash-blocks, if it knows which logical sectors and thus > > flash-pages are "unused". > > > > Since the TRIM command is synchronous it is often a bad idea to > > use it on every logical sector deallocation in filesystem code, > > but it or FITRIM should be used at least periodically (for > > example during 'fsck') to tell the FTL which logical sectors are > > unused so it can rebuild the pool of empty flash-blocks, and > > doing it periodically would work around the synchronous nature > > of SATA TRIM. > > > > Also TRIM and FITRIM are useful for any case of virtualization, > > not just for flash SSD layers, for example for "sparse" (aka > > thin provisioning) VM disk images. > > > > It would be nice if MD passed on TRIM or at least FITRIM, and I > > have just done a search and there is a discussion of some issues > > with that here: > > > > http://lkml.indiana.edu/hypermail/linux/kernel/1011.2/02184.html > > > > «the only really complex part is sending something like that > > into MDraid, because that one set of ranges might explode into > > thousands of ranges and then have to be coalesced back down to > > a more manageable number of ranges. > > > > ie. with a simple raid 0, each range will need to be broken > > into a bunch of stride sized ranges, then the contiguous > > strides on each spindle coalesced back into larger ranges. > > > > But if MDraid can handle discards now with one range, it > > should not be that hard to teach it handle a group of ranges.» > > > > This perplexes me because the logic should be identical to that > > of writing: TRIM is in effect a variant of WRITE. > > -- > > To unsubscribe from this list: send the line "unsubscribe linux-raid" in > > the body of a message to majordomo@vger.kernel.org > > More majordomo info at http://vger.kernel.org/majordomo-info.html > > > > > -- > To unsubscribe from this list: send the line "unsubscribe linux-raid" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- Douglas Siebert douglas-siebert@uiowa.edu -- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 7+ messages in thread
[parent not found: <20120126034746.0D87428A6E@zebra.redhouse.homelinux.net>]
* Re: LVM striping RAID volumes [not found] ` <20120126034746.0D87428A6E@zebra.redhouse.homelinux.net> @ 2012-01-26 12:04 ` David Brown 0 siblings, 0 replies; 7+ messages in thread From: David Brown @ 2012-01-26 12:04 UTC (permalink / raw) To: douglas-siebert; +Cc: linux-raid On 26/01/2012 04:42, Douglas Siebert wrote: > On Wed, 2012-01-25 at 18:56 +0100, David Brown wrote: >> On 25/01/12 14:55, Peter Grandi wrote: >>>>> 2) must support passing TRIM commands through the RAID layer >>>>> (e.g. ext4->LVM->RAID->SSD) to avoid write amplification that >>>>> reduces SSD lifetime and performance >>> >>>> That's not really necessary with modern SSD's - TRIM is >>>> overrated. Garbage collection on current generations is so >>>> much better than on earlier models that you generally don't >>>> have to worry about TRIM. >>> >>> Unfortunately not necessarily just for write amplification, and >>> the "cleaner" (aka garbage collector) is really helped by TRIM. >>> >>> The really big deal is that the FTL in the flash SSD cannot >>> figure out which flash-pages are unused, and cannot use a simple >>> heuristic like "it is all zeroes" because filesystem code do not >>> zero unused logical sectors when they are released but writes >>> them only much later when they are allocated. TRIM is just a a >>> way to ''write'' a logical sector as unused without zero-filling >>> it (or other implicit marks). >>> >>>> Dropping TRIM makes your life /much/ easier with SSD's, >>>> especially when you want raid. According to some benchmarks >>>> I've seen, it also makes the disk measurably faster. >>> >>> While something like TRIM is really important, there is a bad >>> reputation of TRIM, but it is due to SATA TRIM being specified >>> badly, as it is specified to be synchronous (or cache-flushing >>> or queue flushing). >>> >> >> I've read about this in a few places - there are several failing points >> in SATA TRIM that make it difficult to implement and much less useful >> than it could be. >> >> One problem is that TRIM is synchronous, as you say. That means if it >> is used during deletes, it makes them much slower - potentially very >> much slower. Secondly, there is no consistency as to what is read back >> from a trimmed sector. Had it always been read as zero, it would suit >> much better for raid. > > > As far as Linux software RAID goes, end users would currently only care > about TRIM when using using RAID with a pair of SSDs. So in that case, > require enabling the write intent bitmaps when enabling TRIM support. I > believe this would eliminate the concern about what gets read back from > a trimmed sector. I realize benchmarks show bitmaps to slow things down > a lot, but I'm assuming that's because writing them to hard drives is > the cause due to their slow seeks. With SSDs no such concern would > exist. > It is not the seek time that makes TRIM slow, it is the synchronous and non-queued nature of it - the flow of data onto and out of the SSD is blocked until the TRIM is issued and completed. > Your point about TRIM potentially slowing things down due to the > synchronous nature of the ATA 3.0 spec is well taken, but you don't have > to mount your filesystems with -o discard. You can just run fstrim out > of cron daily. That's exactly what I'm planning to do, and I think most > people using TRIM are doing so until SSDs support the ATA 3.1 spec's > asynchronous TRIM. > Currently, fstrim is the recommended way to do trimming on Linux. I believe it only works for some filesystems (ext4 and xfs?). The trim commands don't pass through the md layer - Neil Brown has explained on this list already that it is difficult to do efficiently, and is low priority for development. The key problem is that because read-backs of trimmed blocks are not specified or consistent, you have to trim a whole stripe at a time. That means you have to track and record the trims until you have got a whole stripe, then apply it. I see a number of ways to improve the situation: 1. Hope that the ATA 3.1 specs make the asynchronous trim always "zero" the block. Then at the md layer could implement it as a "write a block of zeros" as far as parity and stripe consistency are concerned. 2. Only track the last few trim commands at the md layer, and only in memory - don't try to record them in the metadata. Combine the incoming trim commands if they are adjacent. If a full stripe has been trimmed, then pass that on to the devices - if not, just forget about the partial trims. This would not help anyone using "-o discard" mounts, but would fit perfectly with fstrim, and be far easier to implement in the md layer. Because reading trimmed blocks gives unspecified data, the trimmed stripes would not necessarily be consistent - so this would have to wait until md implements tracking of synchronised and non-synchronised blocks. 3. Translate trims into pure "write zero block" commands, and even pass them out to the SSD as "write zero block". Many modern SSD's compress the data, so that a "write zero block" will actually use very little flash space, and will free up used space. Being a simple write, it should be easy to keep everything consistent. 4. Publish some benchmarks showing how little TRIM affects real-world performance (using a single SSD without md raid), comparing different SSD's and different overprovisioning. There is no point in putting serious effort into solving this "problem" until it is clearly established that it /is/ a problem. Conversely, if it can be clearly shown that that it is not a problem, then people can stop worrying about it. > >> >>> >>> Anyhow, apart from write amplification, the really big deal is >>> maximum write latency (and relatedly read latency!). Consider >>> this scary comparison: >>> >>> http://www.storagereview.com/images/samsung_830_raid_256gb_write_latency.png >>> >>> as discussed in one of my many recent flash SSD blog entries: >>> >>> http://www.sabi.co.uk/blog/12-one.html#120115 >>> >>> Since erasing a flash-block can take a long time, it is very >>> important for minimizing the highest write latency that the FTL >>> have available a pool of pre-erased flash-blocks, so they can be >>> written (OR'ed) to directly ("overprovisioning" in most flash >>> SSDs is done to allow this too). >>> >> >> Overprovisioning is the key here. When the SSD has more flash space >> than is visible to the OS, then that space is always guaranteed free - >> though not necessarily in contiguous erase blocks. The more such free >> space there is, the higher the chances of their being free full blocks >> when they are needed, and the more flexibility the SSD firmware has in >> combining partly-written blocks to free up full erase blocks. >> >> So if you have sufficient free space due to overprovisioning, you quite >> simply do not need TRIM, as TRIM is just an expensive way of increasing >> this free space. >> >> How much overprovisioning you want depends on how much you want to >> reduce the risk of unexpected latencies, and how much extra space you >> are willing to pay for. More expensive (or rather, higher quality) >> SSD's have more overprovisioning. You can also make your own >> overprovisioning by simply not allocating all the disk when partitioning >> it (or using a smaller "size" when using the whole disk in an mdadm >> raid). Since there is an area that is never written to, it is >> effectively extra overprovisioned space. > > > It sounds like you are saying TRIM is unnecessary because you can just > allocate less space than you have on the device. That may be true, but > I can equally say that overprovisioning is unnecessary because you can > just use TRIM! Overprovisioning should only be required where it > wouldn't happen naturally, such as using an SSD for raw volumes on a DB. > > Overprovisioning happens as a matter of course when used for a > filesystem, since most filesystems maintain at least 5% free space, and > sometimes more, to avoid fragmentation problems. Unfortunately even if > your filesystem always has 5% free space, after a while due to that > fragmentation it is likely that all blocks have been written to at least > once. That's what TRIM fixes. Overprovisioning beyond that is silly > and wasteful, when a perfectly good fix exists. Your argument is rather > like saying that Linux shouldn't worry about being efficient in its > operation, because you can always buy more CPU and memory than you need. > There is no point in a filesystem maintaining 5% free space, especially on an SSD - fragmentation is a non-issue on SSD's (and often overrated as a problem on HD's). So rather than having 5% left on the filesystem, you have 5% left on the disk. From the user viewpoint, you have lost nothing (or at least, nothing that you hadn't already lost!). TRIM can only be of benefit when there are files being deleted from the filesystem - if you are relying on it, then your performance will plummet as you approach 95% full (using the same 5% example figure - actual values will vary by SSD, by usage patterns, and by disk size). So you have to ask yourself - do you want a filesystem that is painfully slow at 95% full, or do you want a filesystem that is 5% smaller but full speed all the time? > One additional point. TRIM is not just for SSDs. SCSI/FC supports two > commands similar in meaning to TRIM (and to each other, don't get me > started...) that have usefulness way beyond SSDs. EMC for example > supports them in their high end VMAX arrays on both thin provisioned AND > traditional "thick" LUNs. Why on thick LUNs? Because knowing that a > block is no longer in use is very useful for stuff like copies, > snapshots and especially when sending data between arrays over WAN > links. For exactly the same reasons, information about blocks no longer > in use could be quite useful to the Linux device mapper layer. It would > be a shame if Linux mdadm raid became marginalized in the future due to > lack of support for TRIM/discard semantics. > My knowledge of SCSI is limited, but I think this is a case where SCSI does the right thing while SATA is a poor copy (NCQ is the other example of a similar situation). My understanding is that SCSI's equivalent of TRIM is asynchronous, queueable, and properly specified. But I don't know whether md's lack of support here is an issue for such systems. mvh., David ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: LVM striping RAID volumes 2012-01-25 13:55 ` Peter Grandi 2012-01-25 17:56 ` David Brown @ 2012-01-25 20:59 ` Andrei Warkentin 1 sibling, 0 replies; 7+ messages in thread From: Andrei Warkentin @ 2012-01-25 20:59 UTC (permalink / raw) To: Peter Grandi; +Cc: Linux RAID H Peter, ----- Original Message ----- > From: "Peter Grandi" <pg@lxra2.to.sabi.co.UK> > To: "Linux RAID" <linux-raid@vger.kernel.org> > Sent: Wednesday, January 25, 2012 8:55:37 AM > Subject: Re: LVM striping RAID volumes > > It would be nice if MD passed on TRIM or at least FITRIM, and I > have just done a search and there is a discussion of some issues > with that here: > > http://lkml.indiana.edu/hypermail/linux/kernel/1011.2/02184.html > > «the only really complex part is sending something like that > into MDraid, because that one set of ranges might explode into > thousands of ranges and then have to be coalesced back down to > a more manageable number of ranges. > > ie. with a simple raid 0, each range will need to be broken > into a bunch of stride sized ranges, then the contiguous > strides on each spindle coalesced back into larger ranges. > > But if MDraid can handle discards now with one range, it > should not be that hard to teach it handle a group of ranges.» > > This perplexes me because the logic should be identical to that > of writing: TRIM is in effect a variant of WRITE. > -- That comment perplexes me as well. Who cares how many TRIMs go down into the slave devices? My understanding is that it was the drive's job to figure out TRIM-coalescing on it's own - the block queue limits at the very least specify a "max discard" size, not a minimum one. (btw FITRIM is an fs ioctl afaik, it gets translated into REQ_DISCARDs). Anyway, I was just looking at supporting REQ_DISCARD within RAID1. I don't see it being any different than handling other BIO flags like REQ_FUA etc. The only bit to watch out for are not sending DISCARDs to a device which doesn't support them. I was looking at discard as part of some other investigation which I will write about at a later point, but if there was specific interest, I could refactor it out and post it for review. A -- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2012-01-26 12:04 UTC | newest]
Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2012-01-25 4:50 LVM striping RAID volumes Douglas Siebert
2012-01-25 10:04 ` David Brown
2012-01-25 13:55 ` Peter Grandi
2012-01-25 17:56 ` David Brown
2012-01-26 3:42 ` Douglas Siebert
[not found] ` <20120126034746.0D87428A6E@zebra.redhouse.homelinux.net>
2012-01-26 12:04 ` David Brown
2012-01-25 20:59 ` Andrei Warkentin
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox