* Raid/5 optimization for linear writes @ 2010-12-29 3:38 Doug Dumitru 2010-12-30 14:36 ` Roberto Spadim 0 siblings, 1 reply; 3+ messages in thread From: Doug Dumitru @ 2010-12-29 3:38 UTC (permalink / raw) To: linux-raid Hello all, I have been using an in-house mod to the raid5.c driver to optimize for linear writes. The optimization is probably too specific for general kernel inclusion, but I wanted to throw out what I have been doing in case anyone is interested. The application involves a kernel module that can produce precisely aligned, long, linear writes. In the case of raid-5, the obvious plan is to issue writes that are complete raid stripes of 'optimal_io_length'. Unfortunately, optimal_io_length is often less than the advertised max io_buf size value and sometime less than the system max io_buf size value. Thus just pumping up the max value inside of raid5 is dubious. Even though dubious, just punching up the mddev->queue->limits.max_hw_sectors does seem to work, not break anything obvious, and does help performance out a little. In looking at long linear writes with the stock raid5 driver, I am seeing a small amount of reads to individual devices. The test application code calling the raid layer has > 100MB of locked kernel buffer slamming the raid5 driver, so exactly why raid5 needs to back-fill some reads is not very clear to me. Looking at the raid5 code, it does not look like there is a real "scheduler" for deciding when to back-fill the stripe cache, but instead it just relies on thread round trips. In my case, I am testing on server-class systems with 8 or 16 3GHz threads, so availability of CPU cycles for the raid5 code is very high. My patch ended up special casing a single inbound bio that contained a write for a single full raid stripe. So for 8 drives raid-5, this is 7 * 64K or an IO 448KB long. With 4K pages this is a bi_io_vec array of 112 pages. Big for kernel memory generally, but easily handled by server systems. With more drives, you can be talking well over 1MB in a single bio call. The patch takes this special case write, makes sure it is raid-5 and layout 2, is not degraded and is not migrating. If all of these are true, the code allocates a new bi_io_vec and pages for the parity stripe, new bios for each drive, computes parity "in thread", and then issues simultanious IOs to all of the devices. A single bio complete function catches any errors and completes the IO. My testing is all done using SSDs. I have tests for 8 drives and for 32 partition on the 8 drives. The drives themselves do about 100MB/sec per drive. With the stock code I tend to get 550 MB/sec with 8 drives and 375 MB/sec with 32 partitions on 8 drives. With the patch, both 8 and 32 yield about 670 MB/sec which is within 5% of theoretical bandwidth. My "fix" for linear writes is probably way to "miopic" for general kernel use, but it does show that properly fed, really big raid/456 arrays should be able to crank linear bandwidth far beyond the current code base. What is really needed is some general technique to give the raid driver a "hint" that an IO stream is linear writes so that it will not try to back-fill too eagerly. Exactly how this can make it back up the bio stack is the real trick. I am happy to discuss this on-list or privately. -- Doug Dumitru EasyCo LLC ps: I am also working on patches to propagate "discard" requests through the raid stack, but don't have any operational code yet. -- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 3+ messages in thread
* Re: Raid/5 optimization for linear writes 2010-12-29 3:38 Raid/5 optimization for linear writes Doug Dumitru @ 2010-12-30 14:36 ` Roberto Spadim 2010-12-30 18:47 ` Doug Dumitru 0 siblings, 1 reply; 3+ messages in thread From: Roberto Spadim @ 2010-12-30 14:36 UTC (permalink / raw) To: doug; +Cc: linux-raid could we make a write algorithm read algorithm for each raid type? we don´t need to change default md algorithm, just put a option to select algorithm, it´s good since new developers could "plugin" news read/write algorithm thanks 2010/12/29 Doug Dumitru <doug@easyco.com>: > Hello all, > > I have been using an in-house mod to the raid5.c driver to optimize > for linear writes. The optimization is probably too specific for > general kernel inclusion, but I wanted to throw out what I have been > doing in case anyone is interested. > > The application involves a kernel module that can produce precisely > aligned, long, linear writes. In the case of raid-5, the obvious plan > is to issue writes that are complete raid stripes of > 'optimal_io_length'. > > Unfortunately, optimal_io_length is often less than the advertised max > io_buf size value and sometime less than the system max io_buf size > value. Thus just pumping up the max value inside of raid5 is dubious. > Even though dubious, just punching up the > mddev->queue->limits.max_hw_sectors does seem to work, not break > anything obvious, and does help performance out a little. > > In looking at long linear writes with the stock raid5 driver, I am > seeing a small amount of reads to individual devices. The test > application code calling the raid layer has > 100MB of locked kernel > buffer slamming the raid5 driver, so exactly why raid5 needs to > back-fill some reads is not very clear to me. Looking at the raid5 > code, it does not look like there is a real "scheduler" for deciding > when to back-fill the stripe cache, but instead it just relies on > thread round trips. In my case, I am testing on server-class systems > with 8 or 16 3GHz threads, so availability of CPU cycles for the raid5 > code is very high. > > My patch ended up special casing a single inbound bio that contained a > write for a single full raid stripe. So for 8 drives raid-5, this is > 7 * 64K or an IO 448KB long. With 4K pages this is a bi_io_vec array > of 112 pages. Big for kernel memory generally, but easily handled by > server systems. With more drives, you can be talking well over 1MB in > a single bio call. > > The patch takes this special case write, makes sure it is raid-5 and > layout 2, is not degraded and is not migrating. If all of these are > true, the code allocates a new bi_io_vec and pages for the parity > stripe, new bios for each drive, computes parity "in thread", and then > issues simultanious IOs to all of the devices. A single bio complete > function catches any errors and completes the IO. > > My testing is all done using SSDs. I have tests for 8 drives and for > 32 partition on the 8 drives. The drives themselves do about > 100MB/sec per drive. With the stock code I tend to get 550 MB/sec > with 8 drives and 375 MB/sec with 32 partitions on 8 drives. With the > patch, both 8 and 32 yield about 670 MB/sec which is within 5% of > theoretical bandwidth. > > My "fix" for linear writes is probably way to "miopic" for general > kernel use, but it does show that properly fed, really big raid/456 > arrays should be able to crank linear bandwidth far beyond the current > code base. > > What is really needed is some general technique to give the raid > driver a "hint" that an IO stream is linear writes so that it will not > try to back-fill too eagerly. Exactly how this can make it back up > the bio stack is the real trick. > > I am happy to discuss this on-list or privately. > > -- > Doug Dumitru > EasyCo LLC > > ps: I am also working on patches to propagate "discard" requests > through the raid stack, but don't have any operational code yet. > -- > To unsubscribe from this list: send the line "unsubscribe linux-raid" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > -- Roberto Spadim Spadim Technology / SPAEmpresarial -- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 3+ messages in thread
* Re: Raid/5 optimization for linear writes 2010-12-30 14:36 ` Roberto Spadim @ 2010-12-30 18:47 ` Doug Dumitru 0 siblings, 0 replies; 3+ messages in thread From: Doug Dumitru @ 2010-12-30 18:47 UTC (permalink / raw) To: Roberto Spadim; +Cc: linux-raid What I have been working on does not change the raid algorithm. The issue is scheduling. When raid/456 gets a write, it needs to write not only the new blocks, but also the parity blocks that are associated. In order to calculate the parity blocks, it needs data from other blocks in the same stripe set. The issue is, a) should the raid code issue read requests for the needed blocks, or b) should the raid code wait for more write requests hoping that these requests will contain data for the needed blocks. Both of these approaches are wrong some of the time. To make things worse, with some drives, guessing wrong just a fraction of a percent of the time can hurt performance dramatically. In my case, if the raid code can get an entire stripe in a single write request, then it can bypass most of the raid logic and just "compute and go". Unfortunately, such big requests break a lot of conventions about how big requests can be, especially for large drive count arrays. Doug Dumitru EasyCo LLC On Thu, Dec 30, 2010 at 6:36 AM, Roberto Spadim <roberto@spadim.com.br> wrote: > > could we make a > write algorithm > read algorithm > > for each raid type? we don´t need to change default md algorithm, just > put a option to select algorithm, it´s good since new developers could > "plugin" news read/write algorithm > thanks > > 2010/12/29 Doug Dumitru <doug@easyco.com>: > > Hello all, > > > > I have been using an in-house mod to the raid5.c driver to optimize > > for linear writes. The optimization is probably too specific for > > general kernel inclusion, but I wanted to throw out what I have been > > doing in case anyone is interested. > > > > The application involves a kernel module that can produce precisely > > aligned, long, linear writes. In the case of raid-5, the obvious plan > > is to issue writes that are complete raid stripes of > > 'optimal_io_length'. > > > > Unfortunately, optimal_io_length is often less than the advertised max > > io_buf size value and sometime less than the system max io_buf size > > value. Thus just pumping up the max value inside of raid5 is dubious. > > Even though dubious, just punching up the > > mddev->queue->limits.max_hw_sectors does seem to work, not break > > anything obvious, and does help performance out a little. > > > > In looking at long linear writes with the stock raid5 driver, I am > > seeing a small amount of reads to individual devices. The test > > application code calling the raid layer has > 100MB of locked kernel > > buffer slamming the raid5 driver, so exactly why raid5 needs to > > back-fill some reads is not very clear to me. Looking at the raid5 > > code, it does not look like there is a real "scheduler" for deciding > > when to back-fill the stripe cache, but instead it just relies on > > thread round trips. In my case, I am testing on server-class systems > > with 8 or 16 3GHz threads, so availability of CPU cycles for the raid5 > > code is very high. > > > > My patch ended up special casing a single inbound bio that contained a > > write for a single full raid stripe. So for 8 drives raid-5, this is > > 7 * 64K or an IO 448KB long. With 4K pages this is a bi_io_vec array > > of 112 pages. Big for kernel memory generally, but easily handled by > > server systems. With more drives, you can be talking well over 1MB in > > a single bio call. > > > > The patch takes this special case write, makes sure it is raid-5 and > > layout 2, is not degraded and is not migrating. If all of these are > > true, the code allocates a new bi_io_vec and pages for the parity > > stripe, new bios for each drive, computes parity "in thread", and then > > issues simultanious IOs to all of the devices. A single bio complete > > function catches any errors and completes the IO. > > > > My testing is all done using SSDs. I have tests for 8 drives and for > > 32 partition on the 8 drives. The drives themselves do about > > 100MB/sec per drive. With the stock code I tend to get 550 MB/sec > > with 8 drives and 375 MB/sec with 32 partitions on 8 drives. With the > > patch, both 8 and 32 yield about 670 MB/sec which is within 5% of > > theoretical bandwidth. > > > > My "fix" for linear writes is probably way to "miopic" for general > > kernel use, but it does show that properly fed, really big raid/456 > > arrays should be able to crank linear bandwidth far beyond the current > > code base. > > > > What is really needed is some general technique to give the raid > > driver a "hint" that an IO stream is linear writes so that it will not > > try to back-fill too eagerly. Exactly how this can make it back up > > the bio stack is the real trick. > > > > I am happy to discuss this on-list or privately. > > > > -- > > Doug Dumitru > > EasyCo LLC > > > > ps: I am also working on patches to propagate "discard" requests > > through the raid stack, but don't have any operational code yet. > > -- > > To unsubscribe from this list: send the line "unsubscribe linux-raid" in > > the body of a message to majordomo@vger.kernel.org > > More majordomo info at http://vger.kernel.org/majordomo-info.html > > > > > > -- > Roberto Spadim > Spadim Technology / SPAEmpresarial -- Doug Dumitru EasyCo LLC -- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 3+ messages in thread
end of thread, other threads:[~2010-12-30 18:47 UTC | newest] Thread overview: 3+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2010-12-29 3:38 Raid/5 optimization for linear writes Doug Dumitru 2010-12-30 14:36 ` Roberto Spadim 2010-12-30 18:47 ` Doug Dumitru
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).