* Create Lock to Eliminate RMW in RAID/456 when writing perfect stripes @ 2015-12-23 19:20 Doug Dumitru 2015-12-23 20:28 ` Robert Kierski 2015-12-25 7:58 ` NeilBrown 0 siblings, 2 replies; 6+ messages in thread From: Doug Dumitru @ 2015-12-23 19:20 UTC (permalink / raw) To: linux-raid The issue: The background thread in RAID-5 can wake up in the middle of a process populating stripe cache entries with a long write. If the long write contains a complete stripe, the background thread "should" be able to process the require without doing any reads. Sometimes the background thread is too quick at starting up a write and schedules a RMW (Read Modify Write) even though the needed blocks will soon be available. Seeing this happen: You can see this happen by creating an MD set with a small stripe size and then doing DIRECT_IO writes that are exactly aligned on a stripe. For example, with 4 disks and 64K stripes, write 192K blocks aligned on 192K boundaries. You can do this from C or with 'dd' or 'fio'. If you have this running, you can then run iostat and you should see absolutely no read activity on the disks. The probability of this happening goes up when there are more disks. It may also go up the faster the disks are. My use case is 24 SSDs. The problem with this: There are really three issues. 1) The code does not need to work this way. It is not "broken" but just seems wrong. 2) There is a performance penalty here. 3) There is a Flash wear penalty here. It is 3) that most interests me. The fix: Create a waitq or semaphore based lock so that if a write includes a complete stripe, the background thread will wait for the write to completely populate the thread. I would do this with a small array of locks. When a write includes a complete stripe, it sets a lock (stripe_number % sizeof_lock_array). This lock is released as soon as the write finishes populating the stripe cache. The background thread checks this lock before it starts a write. If the lock is set, it waits until the stripe cache is completely populated which should eliminate the RMW. If no writes are full stripes, then the lock never gets set, so most code runs without any real overhead. Implementing this: I am happy to implement this. I have quite a bit of experience with lock structures like this. I can also test on x86 and x86_64, but will need help with other arch's. Then again, if this is too much of an "edge case", I will just keep my patches in-house. -- Doug Dumitru WildFire Storage ^ permalink raw reply [flat|nested] 6+ messages in thread
* RE: Create Lock to Eliminate RMW in RAID/456 when writing perfect stripes 2015-12-23 19:20 Create Lock to Eliminate RMW in RAID/456 when writing perfect stripes Doug Dumitru @ 2015-12-23 20:28 ` Robert Kierski 2015-12-23 20:37 ` Doug Dumitru 2015-12-25 7:58 ` NeilBrown 1 sibling, 1 reply; 6+ messages in thread From: Robert Kierski @ 2015-12-23 20:28 UTC (permalink / raw) To: doug@easyco.com, linux-raid Hey Doug, I have a comment about the last line of your message -- 'Then again, if this is too much of an "edge case", I will just keep my patches in-house.' As the use of SSD's in RAID configurations becomes more common, and SSD's become faster, the case you're talking about will be a major issue preventing customers from getting all that they expect from the equipment they purchased. I would encourage you to continue pursuing better performance. Bob Kierski Senior Storage Performance Engineer Cray Inc. 380 Jackson Street Suite 210 St. Paul, MN 55101 Tele: 651-967-9590 Fax: 651-605-9001 Cell: 651-890-7461 ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: Create Lock to Eliminate RMW in RAID/456 when writing perfect stripes 2015-12-23 20:28 ` Robert Kierski @ 2015-12-23 20:37 ` Doug Dumitru 0 siblings, 0 replies; 6+ messages in thread From: Doug Dumitru @ 2015-12-23 20:37 UTC (permalink / raw) To: Robert Kierski; +Cc: linux-raid Bob, In thinking about this, I suspect this is somewhat less of an edge case than just the full stripe test. Even partial stripes probably fire early increasing the amount of IO. My issue with SSD wear is that my particular application stack generates writes that are "perfect" and even very subtle deviations from this can scramble some SSDs FTLs. Workloads that are more typical will already push the SSDs into heavier wear amplification, so this issue might be in the noise. On Wed, Dec 23, 2015 at 12:28 PM, Robert Kierski <rkierski@cray.com> wrote: > Hey Doug, > > I have a comment about the last line of your message -- 'Then again, if this is too much of an "edge case", I will just keep my patches in-house.' > > As the use of SSD's in RAID configurations becomes more common, and SSD's become faster, the case you're talking about will be a major issue preventing customers from getting all that they expect from the equipment they purchased. I would encourage you to continue pursuing better performance. > > Bob Kierski > Senior Storage Performance Engineer > Cray Inc. > 380 Jackson Street > Suite 210 > St. Paul, MN 55101 > Tele: 651-967-9590 > Fax: 651-605-9001 > Cell: 651-890-7461 > -- Doug Dumitru WildFire Storage ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: Create Lock to Eliminate RMW in RAID/456 when writing perfect stripes 2015-12-23 19:20 Create Lock to Eliminate RMW in RAID/456 when writing perfect stripes Doug Dumitru 2015-12-23 20:28 ` Robert Kierski @ 2015-12-25 7:58 ` NeilBrown 2015-12-30 19:28 ` Doug Dumitru 1 sibling, 1 reply; 6+ messages in thread From: NeilBrown @ 2015-12-25 7:58 UTC (permalink / raw) To: doug, linux-raid [-- Attachment #1: Type: text/plain, Size: 2740 bytes --] On Thu, Dec 24 2015, Doug Dumitru wrote: > The issue: > > The background thread in RAID-5 can wake up in the middle of a process > populating stripe cache entries with a long write. If the long write > contains a complete stripe, the background thread "should" be able to > process the require without doing any reads. > > Sometimes the background thread is too quick at starting up a write > and schedules a RMW (Read Modify Write) even though the needed blocks > will soon be available. > > Seeing this happen: > > You can see this happen by creating an MD set with a small stripe size > and then doing DIRECT_IO writes that are exactly aligned on a stripe. > For example, with 4 disks and 64K stripes, write 192K blocks aligned > on 192K boundaries. You can do this from C or with 'dd' or 'fio'. > > If you have this running, you can then run iostat and you should see > absolutely no read activity on the disks. > > The probability of this happening goes up when there are more disks. > It may also go up the faster the disks are. My use case is 24 SSDs. > > The problem with this: > > There are really three issues. > > 1) The code does not need to work this way. It is not "broken" but > just seems wrong. > 2) There is a performance penalty here. > 3) There is a Flash wear penalty here. > > It is 3) that most interests me. > > The fix: > > Create a waitq or semaphore based lock so that if a write includes a > complete stripe, the background thread will wait for the write to > completely populate the thread. > > I would do this with a small array of locks. When a write includes a > complete stripe, it sets a lock (stripe_number % sizeof_lock_array). > This lock is released as soon as the write finishes populating the > stripe cache. The background thread checks this lock before it starts > a write. If the lock is set, it waits until the stripe cache is > completely populated which should eliminate the RMW. > > If no writes are full stripes, then the lock never gets set, so most > code runs without any real overhead. > > Implementing this: > > I am happy to implement this. I have quite a bit of experience with > lock structures like this. I can also test on x86 and x86_64, but > will need help with other arch's. > > Then again, if this is too much of an "edge case", I will just keep my > patches in-house. Hi, this is certainly something that needs fixing. I can't really say if your approach would work or not without seeing it and testing it on a variety of work loads. Certainly if you do implement something, please post it for other to test and review. If it makes measurable improvements without causing significant regressions, it will likely be included upstream. Thanks, NeilBrown [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 818 bytes --] ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: Create Lock to Eliminate RMW in RAID/456 when writing perfect stripes 2015-12-25 7:58 ` NeilBrown @ 2015-12-30 19:28 ` Doug Dumitru 2016-01-11 7:44 ` NeilBrown 0 siblings, 1 reply; 6+ messages in thread From: Doug Dumitru @ 2015-12-30 19:28 UTC (permalink / raw) To: NeilBrown; +Cc: linux-raid A new lock for RAID-4/5/6 to minimize read/modify/write operations. The Problem: The background thread for raid can wake up asynchronously and will sometimes wake up and start processing a write before the writing thread has finished updating the stripe cache blocks. If the calling thread write was a long write (longer than chunk size), then the background thread will configure a raid write operation that is sub-optimal resulting in extra IO operations, slower performance, and higher wear on Flash storage. The Easy Fix: When the calling thread has a long write, it "locks" the stripe number with a semaphore. When the background thread wakes up and starts working on a stripe, it locks the same lock, and then immediately releases it. This way the background thread will wait for the write to fully populate the stripe caches before it start to build a write request. The Lock Structure: ... is just a small, binary sized, array of semaphores. The lock is picked by just "stripe_number % number_of_semaphores". There will be collisions, but they should be short and lots of IO is happening for each operation, so a semaphore is cheap enough. Memory Usage: With 64 semaphores, this adds 1.5K to the raid control block. A More Comprehensive Fix: This Easy Fix assumes that updates are contained within a single BIO request. A better fix is to lock linear operations that span BIOs by looking back on the queue. I don't have a lot of experience playing with queues, but this is probably workable with only a little more complexity, provided that you don't try to do this across cores and queues. Again, I don't play much in queues, so I may be totally missing the structures here. The Really High Performance Fix: If the application is well enough behaved to write complete, perfect stripes contained in a single BIO request, then the whole stripe cache logic can be bypassed. This lets you submit the member disk IO operations directly from the calling thread. I have this running in a patch in the field and it works well, but the use case is very limited and something probably breaks with more "normal" IO patterns. I have hit 11GB/sec with RAID-5 and 8GB/sec with RAID-6 this way with 24 SSDs. Tweak-ability: All of these changes can be exposed in /sys to allow sysadmins to tune their system possibly enabling or disabling features. Most useful for early code that might have broken use cases. Then again, too many knobs sometimes just increases confusion. Asking for Feedback: I am happy to write "all of the above" and submit it and work with the group to get it tested etc. If this interests you, please comment on how far you think I should go. Also, if there are any notes on "submission style", how and where to post patches, which kernel version to patch/develop against, documentation style, sign-off requirements, etc. please point me at them. Thanks in advance, Doug Dumitru WildFire Storage ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: Create Lock to Eliminate RMW in RAID/456 when writing perfect stripes 2015-12-30 19:28 ` Doug Dumitru @ 2016-01-11 7:44 ` NeilBrown 0 siblings, 0 replies; 6+ messages in thread From: NeilBrown @ 2016-01-11 7:44 UTC (permalink / raw) To: doug; +Cc: linux-raid [-- Attachment #1: Type: text/plain, Size: 4190 bytes --] On Thu, Dec 31 2015, Doug Dumitru wrote: > A new lock for RAID-4/5/6 to minimize read/modify/write operations. > > The Problem: The background thread for raid can wake up > asynchronously and will sometimes wake up and start processing a write > before the writing thread has finished updating the stripe cache > blocks. If the calling thread write was a long write (longer than > chunk size), then the background thread will configure a raid write > operation that is sub-optimal resulting in extra IO operations, slower > performance, and higher wear on Flash storage. > > The Easy Fix: When the calling thread has a long write, it "locks" > the stripe number with a semaphore. When the background thread wakes > up and starts working on a stripe, it locks the same lock, and then > immediately releases it. This way the background thread will wait for > the write to fully populate the stripe caches before it start to build > a write request. The code does something a lot like this already. When the filesystem starts a write, it calls "blk_start_plug()" When it finishes it calls "blk_finish_plug()". md/raid5 detects this plugging and when it gets a bio and attaches it to a "stripe_head", the stripe_head is queued on a delayed-list which is not processed until the blk_finish_plug() is called. So we should already not start processing a request until we have the whole request. But this sometimes doesn't work. Understanding why it doesn't work and what actually happens would be an important first step to fixing the problem. The plugging doesn't guarantee that a request will be delayed - doing that can too easily lead to deadlocks. Rather it just discourages early processing. If a memory shortage appears and RAID5 could free up some memory by processing a request earlier, it is better to do that than to wait indefinitely and possibly deadlock. It is possible that this safety-valve code is triggering too early in some cases. > > The Really High Performance Fix: If the application is well enough > behaved to write complete, perfect stripes contained in a single BIO > request, then the whole stripe cache logic can be bypassed. This lets > you submit the member disk IO operations directly from the calling > thread. I have this running in a patch in the field and it works > well, but the use case is very limited and something probably breaks > with more "normal" IO patterns. I have hit 11GB/sec with RAID-5 and > 8GB/sec with RAID-6 this way with 24 SSDs. It would certainly be interesting to find a general solution which allowed full stripes to be submitted without a context switch. It should be possible. There is already code which avoids copying from the filesystem buffer into the stripe cache. Detecting complete stripes and submitting them immediately should be possible. Combining several stripe_heads into a whole stripe should be possible in many cases using the new stripe-batching. > > Tweak-ability: All of these changes can be exposed in /sys to allow > sysadmins to tune their system possibly enabling or disabling > features. Most useful for early code that might have broken use > cases. Then again, too many knobs sometimes just increases confusion. > > Asking for Feedback: I am happy to write "all of the above" and > submit it and work with the group to get it tested etc. If this > interests you, please comment on how far you think I should go. Also, > if there are any notes on "submission style", how and where to post > patches, which kernel version to patch/develop against, documentation > style, sign-off requirements, etc. please point me at them. > Patches should be sent to linux-raid@vger.kernel.org Documentation/SubmittingPatches should describe all the required style. I think a really important first step is to make sure you understand how the current code is supposed to work, and why it fails sometimes. The above short notes should get you started in following what should happens. I really don't think locks as you describe them would be part of a solution. Flag bits in the stripe_heads, different queues of stripe_heads and different queuing disciplines might be. Thanks, NeilBrown [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 818 bytes --] ^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2016-01-11 7:44 UTC | newest] Thread overview: 6+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2015-12-23 19:20 Create Lock to Eliminate RMW in RAID/456 when writing perfect stripes Doug Dumitru 2015-12-23 20:28 ` Robert Kierski 2015-12-23 20:37 ` Doug Dumitru 2015-12-25 7:58 ` NeilBrown 2015-12-30 19:28 ` Doug Dumitru 2016-01-11 7:44 ` NeilBrown
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).