* Accelerating Linux software raid
@ 2005-09-06 18:24 Dan Williams
2005-09-06 21:52 ` Molle Bestefich
` (3 more replies)
0 siblings, 4 replies; 12+ messages in thread
From: Dan Williams @ 2005-09-06 18:24 UTC (permalink / raw)
To: linux-raid; +Cc: dave.jiang
Hello,
I am writing to the list to gauge interest for a modification of the
md driver that allows it to take advantage of raid acceleration
hardware. I/O processors like the Intel IOP333
(http://www.intel.com/design/iio/docs/iop333.htm) contain an xor
engine for raid5 and raid6 calculations, but currently the md driver
does not fully utilize these resources.
Dave Jiang wrote a driver that re-routed calls to xor_block() to use
the hardware xor engine. However, from my understating, he found that
performance did not improve, due to the fact that md deals in
PAGE_SIZE (4K) blocks. At 4K the overhead of setting up the engine
destroys any performance advantage over a software xor. The goal of
the modification would be to enable md to understand the capacity of
the platform's xor resources and allow it to issue optimal block
sizes.
The first question is whether a solution along these lines would be
valued by the community? The effort is non-trivial.
Assuming a positive response I will solicit implementation ideas and
acceptance criteria from the list.
Thank you for your consideration,
Dan
^ permalink raw reply [flat|nested] 12+ messages in thread* Re: Accelerating Linux software raid 2005-09-06 18:24 Accelerating Linux software raid Dan Williams @ 2005-09-06 21:52 ` Molle Bestefich 2005-09-10 4:51 ` Mark Hahn ` (2 subsequent siblings) 3 siblings, 0 replies; 12+ messages in thread From: Molle Bestefich @ 2005-09-06 21:52 UTC (permalink / raw) To: linux-raid Dan Williams wrote: > The first question is whether a solution along these lines would be > valued by the community? The effort is non-trivial. I don't represent the community, but I think the idea is great. When will it be finished and where can I buy the hardware? :-) And if you don't mind terribly, could you also add hardware acceleration support to loop-aes now that you're at it? :-) ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Accelerating Linux software raid 2005-09-06 18:24 Accelerating Linux software raid Dan Williams 2005-09-06 21:52 ` Molle Bestefich @ 2005-09-10 4:51 ` Mark Hahn 2005-09-10 12:58 ` Ric Wheeler 2005-09-10 8:35 ` Colonel Hell 2005-09-11 23:14 ` Neil Brown 3 siblings, 1 reply; 12+ messages in thread From: Mark Hahn @ 2005-09-10 4:51 UTC (permalink / raw) To: Dan Williams; +Cc: linux-raid > I am writing to the list to gauge interest for a modification of the > md driver that allows it to take advantage of raid acceleration not that much, I think. > hardware. I/O processors like the Intel IOP333 > (http://www.intel.com/design/iio/docs/iop333.htm) contain an xor > engine for raid5 and raid6 calculations, but currently the md driver > does not fully utilize these resources. the worst insult in the linux world is "solution in search of a problem". that applies here: are you sure that there is a problem? yes, offload can be a lovely thing, but it often falls behind the main driver of the industry - host cpu performance. unless you're specifically targetting a high-IO device with very little CPU power, I think you'll find a lot of skepticism about IO coprocessors. I have a server that can do the raid5 checksum at 8 GB/s, and have no reason to ever want more than ~100 MB/s on that machine. do I care about "wasting" 1/80th of the machine? not really, even though it's a supercomputing cluster node. for fileservers, I mind even less wasting CPU using the host for parity, since the cycles aren't going to be used for anything else... > Dave Jiang wrote a driver that re-routed calls to xor_block() to use > the hardware xor engine. However, from my understating, he found that > performance did not improve, due to the fact that md deals in > PAGE_SIZE (4K) blocks. well, it calls xor_block with STRIPE_SIZE which is indeed PAGE_SIZE. > destroys any performance advantage over a software xor. The goal of > the modification would be to enable md to understand the capacity of > the platform's xor resources and allow it to issue optimal block > sizes. this argument assumes that the HW xor is actually faster than the host, though. is that true? otherwise, HW xor starts out behind due to the setup overhead, and falls further behind for larger stripe sizes! > The first question is whether a solution along these lines would be > valued by the community? The effort is non-trivial. if you're interested in speeding up MD, then looking at how to make STRIPE_SIZE bigger might be worthwhile. (independent of HW xor.) regards, mark hahn. ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Accelerating Linux software raid 2005-09-10 4:51 ` Mark Hahn @ 2005-09-10 12:58 ` Ric Wheeler 2005-09-10 15:35 ` Mark Hahn 0 siblings, 1 reply; 12+ messages in thread From: Ric Wheeler @ 2005-09-10 12:58 UTC (permalink / raw) To: Mark Hahn; +Cc: Dan Williams, linux-raid Mark Hahn wrote: >>hardware. I/O processors like the Intel IOP333 >>(http://www.intel.com/design/iio/docs/iop333.htm) contain an xor >>engine for raid5 and raid6 calculations, but currently the md driver >>does not fully utilize these resources. >> >> > >the worst insult in the linux world is "solution in search of a problem". >that applies here: are you sure that there is a problem? yes, offload >can be a lovely thing, but it often falls behind the main driver of the >industry - host cpu performance. unless you're specifically targetting >a high-IO device with very little CPU power, I think you'll find a lot >of skepticism about IO coprocessors. > >I have a server that can do the raid5 checksum at 8 GB/s, and have no >reason to ever want more than ~100 MB/s on that machine. do I care >about "wasting" 1/80th of the machine? not really, even though it's >a supercomputing cluster node. for fileservers, I mind even less >wasting CPU using the host for parity, since the cycles aren't going >to be used for anything else... > > > I think that the above holds for server applications, but there are lots of places where you will start to see a need for serious IO capabilities in low power, multi-core designs. Think of your Tivo starting to store family photos - you don't want to bolt a server class box under your TV in order to get some reasonable data protection ;-) In the Centera group where I work, we have a linux based box that is used for archival storage. Customers understand why the cost of a box is related to the number of disks, but the strength of the CPU, memory subsystem, etc are all more or less thought of as overhead (not to mention that nasty software stuff that I work on ;-)). In this kind of environment as well, finding an elegant way to take advantage of the capabilities of the new system on a chip parts is a win. Also keep in mind that the Xor done for simple RAID is not the whole story - think of compression offload, encryption, etc which might also be able to leverage a well thought out solution. ric ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Accelerating Linux software raid 2005-09-10 12:58 ` Ric Wheeler @ 2005-09-10 15:35 ` Mark Hahn 2005-09-10 19:13 ` Dan Williams 2005-09-11 2:06 ` Ric Wheeler 0 siblings, 2 replies; 12+ messages in thread From: Mark Hahn @ 2005-09-10 15:35 UTC (permalink / raw) To: Ric Wheeler; +Cc: Dan Williams, linux-raid > I think that the above holds for server applications, but there are lots > of places where you will start to see a need for serious IO capabilities > in low power, multi-core designs. Think of your Tivo starting to store > family photos - you don't want to bolt a server class box under your TV > in order to get some reasonable data protection ;-) I understand your point, but are the numbers right? it seems to me that the main factor in appliance design is power dissipation, and I'm guessing a budget of say 20W for the CPU. these days, that's a pretty fast processor, of the mobile-athlon-64 range - probably 3 GB/s xor performance. I'd guess it amounts to perhaps 5-10% cpu overhead if the appliance were, for some reason, writing at 100 MB/s. of course, it is NOT writing at that rate (remember, reading doesn't require xors, and appliances probably do more reads than writes...) > In the Centera group where I work, we have a linux based box that is > used for archival storage. Customers understand why the cost of a box > is related to the number of disks, but the strength of the CPU, memory > subsystem, etc are all more or less thought of as overhead (not to > mention that nasty software stuff that I work on ;-)). again, no offense meant, but I hear you saying "we under-designed the centera host processor, and over-priced it, so that people are trying to Stretch their budget by piling on too many disks". I'm actually a little surprised, since I figured the Centera design would be a sane, modern, building-block-based one, where you could cheaply scale the number of host processors, not just disks (like an old-fashioned, not-mourned SAN.) I see a lot of people using a high-performance network like IB as an internal backplane-like way to tie together a cluster-in-a-box. (and I expect they'll sprint from IB to 10G real soon now.) but then again, you did say this was an archive box. so what is the bandwidth of data coming in? that's the number that sizes your host cpu. being able to do xor at 12 GB/s is kind of pointless if the server has just one or two 2 Gb net links... > In this kind of environment as well, finding an elegant way to take > advantage of the capabilities of the new system on a chip parts is a win. I understand completely - aesthetic elegance does tend to inspire most solutions-in-search-of-problems. it's one of those urges that we all must simply learn to stifle every day. > Also keep in mind that the Xor done for simple RAID is not the whole > story - think of compression offload, encryption, etc which might also > be able to leverage a well thought out solution. this is an excellent point, and one that argues *against* HW coprocessing. consider the NIC market: TOE never happened because adding tcp/ssl to a separate card just moves the complexity and bugs from an easy-to-patch place into a harder-to-patch place. I'd much rather upgrade from a uni server to a dual and run the tcp/ssl in software than spend the same amount of money on a $2000 nic that runs its own OS. my tcp stack bugs get fixed in a few hours if I email netdev, but who knows how long bugs would linger in the firmware stack of a TOE card? same thing here, except moreso. making storage appliances smarter is great, but why put that smarts in some kind of opaque, inaccessible and hard-to-use coprocessor? good, thoughtful design leads towards a loosely-coupled cluster of off-the-shelf components... regards, mark hahn. (I run a large supercomputing center, and spend a lot of effort specifying and using big compute and storage hardware...) ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Accelerating Linux software raid 2005-09-10 15:35 ` Mark Hahn @ 2005-09-10 19:13 ` Dan Williams 2005-09-11 2:06 ` Ric Wheeler 1 sibling, 0 replies; 12+ messages in thread From: Dan Williams @ 2005-09-10 19:13 UTC (permalink / raw) To: Mark Hahn; +Cc: Ric Wheeler, linux-raid > this is an excellent point, and one that argues *against* HW coprocessing. > consider the NIC market: TOE never happened because adding tcp/ssl to a > separate card just moves the complexity and bugs from an easy-to-patch place > into a harder-to-patch place. I'd much rather upgrade from a uni server to a > dual and run the tcp/ssl in software than spend the same amount of money > on a $2000 nic that runs its own OS. my tcp stack bugs get fixed in a > few hours if I email netdev, but who knows how long bugs would linger in > the firmware stack of a TOE card? > > same thing here, except moreso. making storage appliances smarter is great, > but why put that smarts in some kind of opaque, inaccessible and hard-to-use > coprocessor? good, thoughtful design leads towards a loosely-coupled cluster > of off-the-shelf components... > The question here is not can a modern server outperform a coprocessor at a given task. Of course it can. The issue here is how to scale embedded Linux I/O performance for system-on-a-chip storage silicon designs. An embedded design breaks some of the assumptions of the current driver, first that dedicated raid5/6 offload logic is available, and that, in general, system resources can be biased towards the I/O subsystem. I disagree that it is a solution looking for a problem. The problem is the MD driver performs sub optimally on these platforms. I'm learning MD by reading the source, and stepping through it with a debugger. If anyone knows of other documentation or talks given about MD please point me to it. Thanks, Dan ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Accelerating Linux software raid 2005-09-10 15:35 ` Mark Hahn 2005-09-10 19:13 ` Dan Williams @ 2005-09-11 2:06 ` Ric Wheeler 2005-09-11 2:35 ` Konstantin Olchanski 1 sibling, 1 reply; 12+ messages in thread From: Ric Wheeler @ 2005-09-11 2:06 UTC (permalink / raw) To: Mark Hahn; +Cc: Dan Williams, linux-raid Mark Hahn wrote: >>I think that the above holds for server applications, but there are lots >>of places where you will start to see a need for serious IO capabilities >>in low power, multi-core designs. Think of your Tivo starting to store >>family photos - you don't want to bolt a server class box under your TV >>in order to get some reasonable data protection ;-) >> >> > >I understand your point, but are the numbers right? it seems to me that >the main factor in appliance design is power dissipation, and I'm guessing >a budget of say 20W for the CPU. these days, that's a pretty fast processor, >of the mobile-athlon-64 range - probably 3 GB/s xor performance. I'd >guess it amounts to perhaps 5-10% cpu overhead if the appliance were, >for some reason, writing at 100 MB/s. of course, it is NOT writing at >that rate (remember, reading doesn't require xors, and appliances probably >do more reads than writes...) > > > I think that one thing that your response shows is a small misunderstanding in what this class of part is. It is not a TOE in the classic sense, rather a generally useful (non-standard) execution unit that can do some restricted set of operations well but is not intended to be used as a full second (or third or fourth) CPU. If you get the code and design right, this will be a very simple driver calling functions that offload specific computations to these specialized execution units. If you look at public numbers for power for modern Intel architecture CPU's, say Tom's hardware at: http://www.tomshardware.com/cpu/20050525/pentium4-02.html you will see that the 20W budget you allocate for a modern CPU is much closer to the power budget for these embedded parts than any modern cpu. Mobile parts draw much less power than server CPUs and come somewhat closer to your number. >>In the Centera group where I work, we have a linux based box that is >>used for archival storage. Customers understand why the cost of a box >>is related to the number of disks, but the strength of the CPU, memory >>subsystem, etc are all more or less thought of as overhead (not to >>mention that nasty software stuff that I work on ;-)). >> >> > >again, no offense meant, but I hear you saying "we under-designed the >centera host processor, and over-priced it, so that people are trying to >Stretch their budget by piling on too many disks". I'm actually a little >surprised, since I figured the Centera design would be a sane, modern, >building-block-based one, where you could cheaply scale the number of >host processors, not just disks (like an old-fashioned, not-mourned SAN.) >I see a lot of people using a high-performance network like IB as an internal >backplane-like way to tie together a cluster-in-a-box. (and I expect they'll >sprint from IB to 10G real soon now.) > > These operations are not done only during ingest, they can be used to check the integrity of the already stored data, regenerate data, etc. I don't want to hawk centera here, but we are definitely a scalable design using building blocks ;-) What I tried to get across is the opposite of your summary, i.e. a customer who buys storage devices prefers to pay for storage capacity (media) instead of infrastructure used to provide storage and that they expect engineers to do the hard work to give them that storage at the best possible price. We definitely use commodity hardware, we just try to get as much out of it as possible. >but then again, you did say this was an archive box. so what is the >bandwidth of data coming in? that's the number that sizes your host cpu. >being able to do xor at 12 GB/s is kind of pointless if the server has just >one or two 2 Gb net links... > > Storage arrays like Centera are not block device, we do a lot more high level functions (real file systems, scrubbing, indexing, etc). All of these functions require CPU, disk, etc, so anything that we can save can be used to provide added functionality. >>Also keep in mind that the Xor done for simple RAID is not the whole >>story - think of compression offload, encryption, etc which might also >>be able to leverage a well thought out solution. >> >> > >this is an excellent point, and one that argues *against* HW coprocessing. >consider the NIC market: TOE never happened because adding tcp/ssl to a >separate card just moves the complexity and bugs from an easy-to-patch place >into a harder-to-patch place. I'd much rather upgrade from a uni server to a >dual and run the tcp/ssl in software than spend the same amount of money >on a $2000 nic that runs its own OS. my tcp stack bugs get fixed in a >few hours if I email netdev, but who knows how long bugs would linger in >the firmware stack of a TOE card? > > Again, I think you misunderstand the part and the intention of the project and the part. Not everyone (much to our sorrow), wants a huge storage system - some people might be able to do with very small, quiet appliances for their archives. >same thing here, except moreso. making storage appliances smarter is great, >but why put that smarts in some kind of opaque, inaccessible and hard-to-use >coprocessor? good, thoughtful design leads towards a loosely-coupled cluster >of off-the-shelf components... > >regards, mark hahn. >(I run a large supercomputing center, and spend a lot of effort specifying >and using big compute and storage hardware...) > > > I am an ex-Thinking Machines OS developer, who spent time working on the paragon OS at OSF and have a fair appreciation for large customers with deep wallets. If everyone wanted to buy large installations built with high powered hardware, my life would be much easier ;-) regards, ric ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Accelerating Linux software raid 2005-09-11 2:06 ` Ric Wheeler @ 2005-09-11 2:35 ` Konstantin Olchanski 2005-09-11 12:00 ` Ric Wheeler 0 siblings, 1 reply; 12+ messages in thread From: Konstantin Olchanski @ 2005-09-11 2:35 UTC (permalink / raw) To: Ric Wheeler; +Cc: Mark Hahn, Dan Williams, linux-raid On Sat, Sep 10, 2005 at 10:06:21PM -0400, Ric Wheeler wrote: > If you look at public numbers for power for modern Intel architecture > CPU's, say Tom's hardware at: ... > you will see that the 20W budget you allocate... I am now confused. Is somebody trying to save power by adding an i/o coprocessor? (with it's own power overhead for memory, i/o, etc) To me it is simple: 1) If you have an infinite power budget (big box), you might as well let the main cpus do the raid stuff. If you are short on power (embedded), you cannot afford to power an extra processor (+memory and stuff). 2) If you have rich customers (big box), let them pay for a bigger main cpu to do the raid, if you want to be cheap (embedded, appliance), you cannot afford to plop an extra cpu (+support chips) on your custom pcb. -- Konstantin Olchanski Data Acquisition Systems: The Bytes Must Flow! Email: olchansk-at-triumf-dot-ca Snail mail: 4004 Wesbrook Mall, TRIUMF, Vancouver, B.C., V6T 2A3, Canada ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Accelerating Linux software raid 2005-09-11 2:35 ` Konstantin Olchanski @ 2005-09-11 12:00 ` Ric Wheeler 2005-09-11 20:19 ` Mark Hahn 0 siblings, 1 reply; 12+ messages in thread From: Ric Wheeler @ 2005-09-11 12:00 UTC (permalink / raw) To: Konstantin Olchanski; +Cc: Mark Hahn, Dan Williams, linux-raid Konstantin Olchanski wrote: >I am now confused. Is somebody trying to save power by adding an i/o >coprocessor? (with it's own power overhead for memory, i/o, etc) > >To me it is simple: > >1) If you have an infinite power budget (big box), you might as well > let the main cpus do the raid stuff. If you are short on power (embedded), > you cannot afford to power an extra processor (+memory and stuff). > >2) If you have rich customers (big box), let them pay for a bigger > main cpu to do the raid, if you want to be cheap (embedded, appliance), > you cannot afford to plop an extra cpu (+support chips) on your custom pcb. > > The actual facts don't support this view since the gap in power consumption is huge. Most of these system on a chip designs provide the main CPU/northbridge/southbridge and extra execution units for a small fraction of one standard CPU. Say under 20 watts for all of the above versus up to (over sometimes) 100 watts for a standard CPU (without its system chip sets). ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Accelerating Linux software raid 2005-09-11 12:00 ` Ric Wheeler @ 2005-09-11 20:19 ` Mark Hahn 0 siblings, 0 replies; 12+ messages in thread From: Mark Hahn @ 2005-09-11 20:19 UTC (permalink / raw) To: linux-raid > >1) If you have an infinite power budget (big box), you might as well > > let the main cpus do the raid stuff. If you are short on power (embedded), > > you cannot afford to power an extra processor (+memory and stuff). > > > >2) If you have rich customers (big box), let them pay for a bigger > > main cpu to do the raid, if you want to be cheap (embedded, appliance), > > you cannot afford to plop an extra cpu (+support chips) on your custom pcb. > > > The actual facts don't support this view since the gap in power > consumption is huge. Most of these system on a chip designs provide the > main CPU/northbridge/southbridge and extra execution units for a small > fraction of one standard CPU. Say under 20 watts for all of the above > versus up to (over sometimes) 100 watts for a standard CPU (without its > system chip sets). we appear to be talking about different things. the original suggestion was for appliances like tivo, which clearly have a limited power budget, but certainly > 20W. I responded by suggesting that current mainstream mobile CPUs (like mobile athlon64's) have PLENTY of power to run MD - in fact, more than an appliance could possibly have any need for. and they dissipate 20-30W. then the topic somehow mutated into SoC designs, such as Intel's, which are actually intended to *be* the raid card, and have an ARM core peaking at 600 MHz, and probably are challenged to sustain even 100 MB/s. (in spite of bragging about a multi-GB/s internal bus.) in other words, there's a third category: OEM customers who want to build a smart raid card that consists of a SoC running linux actually implementing the raid. the main technical impediment is that Intel's solution has XOR and DMA engines to make up for the wimpy CPU, but those engines are barely profitable with 4k block sizes. since MD does XOR's in 4k chunks, some hacking would be necessary to expand this size. I expect this would have some modest benefit for systems other than the Intel SoC. (but I question the sanity of the Intel approach in the first place, since I believe the trend in storage is away from this kind of integration. obviously, it also doesn't make much sense for linux hosts, but such a card would probably find more of a market in the windows world.) regards, mark hahn. ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Accelerating Linux software raid 2005-09-06 18:24 Accelerating Linux software raid Dan Williams 2005-09-06 21:52 ` Molle Bestefich 2005-09-10 4:51 ` Mark Hahn @ 2005-09-10 8:35 ` Colonel Hell 2005-09-11 23:14 ` Neil Brown 3 siblings, 0 replies; 12+ messages in thread From: Colonel Hell @ 2005-09-10 8:35 UTC (permalink / raw) To: dan.j.williams; +Cc: linux-raid Hi, Dan: >the md driver that allows it to take advantage of raid acceleration >hardware. I/O processors like the Intel IOP333 >(http://www.intel.com/design/iio/docs/iop333.htm) contain an xor >engine for raid5 and raid6 calculations, but currently the md driver >does not fully utilize these resources. does that mean offloading RAID to the IOP board or simply using the XOR engine in the IOP ? if you have some benchmarks which shows that under certain workload md is not IOlimited(due to excessive XOR computation) and then concluded that offloading XOR is a gud idea , then it must be :) ... I was going thru IOP manual myself and saw that they have a raid6 accelaretor too :) Let me know. -Johri. ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Accelerating Linux software raid 2005-09-06 18:24 Accelerating Linux software raid Dan Williams ` (2 preceding siblings ...) 2005-09-10 8:35 ` Colonel Hell @ 2005-09-11 23:14 ` Neil Brown 3 siblings, 0 replies; 12+ messages in thread From: Neil Brown @ 2005-09-11 23:14 UTC (permalink / raw) To: dan.j.williams; +Cc: linux-raid, dave.jiang On Tuesday September 6, dan.j.williams@gmail.com wrote: > Hello, > > I am writing to the list to gauge interest for a modification of the > md driver that allows it to take advantage of raid acceleration > hardware. I/O processors like the Intel IOP333 > (http://www.intel.com/design/iio/docs/iop333.htm) contain an xor > engine for raid5 and raid6 calculations, but currently the md driver > does not fully utilize these resources. > > Dave Jiang wrote a driver that re-routed calls to xor_block() to use > the hardware xor engine. However, from my understating, he found that > performance did not improve, due to the fact that md deals in > PAGE_SIZE (4K) blocks. At 4K the overhead of setting up the engine > destroys any performance advantage over a software xor. The goal of > the modification would be to enable md to understand the capacity of > the platform's xor resources and allow it to issue optimal block > sizes. > > The first question is whether a solution along these lines would be > valued by the community? The effort is non-trivial. If the effort is non-trivial, then I suggest you only do it if it has really value to *you*. If it does, community involvement is more likely to provide value *to* you (such as guidance, bug-fixes, long-term maintenance) than to get value *from* you, though hopefully it would be a win-win situation. I'm not surprised that simply replacing xor_block with calls into the hardware engine didn't help much. xor_block is currently called under a spinlock, so the main processor will probably be completely idle while the AA is doing the XOR calculation, so there isn't much room for improvement. If I were to try to implement this, here is how I would do it: 1/ get the xor calc out from under the spinlock. This will require a fairly deep understanding of the handle_stripe() function. The 'stripe_head' works somewhat like a state machine. handle_stripe assesses the current state and advances it one step. Currently if it determines that it is time to write some data, it will - copy data out of file-system buffers into it's own cache - perform the xor calculations in the cache, locking all blocks the then become dirty. - schedule a write on all those locked blocks. The stripe won't be ready to be handled again until all the writes complete. This should be changed so that we don't copy+xor, but instead just lock the blocks and flag them as needing xor. Then after sh->lock is dropped you will send the copy+xor request to the AA, or do it in-line. Once the copy+xor is completed, the stripe needs to get flagged for handling again. Stripe handle will then need to notice that parity has been calculated, so writing can commence. 2/ Then I would try to find the best internal API to provide for the AA (Application Accelerator for those who haven't read the spec yet). My guess is that it should work much like the crypto API. I'm not up-to-date with that so I don't know if the async-crypto-API is complete and merged yet (async-crypto-API is for sending data to separate processors for crypto manipulation and being alerted asynchronously when they complete). If it is, definitely look into using that. If it isn't, certainly look into it and maybe even help it's development to make sure it can handle multiple-input xor operations. Step 1 is probably quite useful anyway and is unlikely to slow current performance - just re-arrange it. Once that is done, plugging in async xor should be fairly easy whether you use crypto-api or not. I don't think it is practical to use larger block sizes for the xor operations, and I doubt it is needed. The DMA engine in the AA has a very nice chaining arrangement where new operations can be added to the end of the chain at any time, and I doubt the effort of loading a new chain descriptor would be a substantial fraction of the time it takes to xor a 4k block. As long as you keep everything async (i.e. keep the main processor busy while the copy+xor is happening) you should notice some speed-up ... or at least a drop is cpu activity. One last note: if you do decide to give '1/' a try, remember to keep patches small and well defined. handle_stripe currently does xor in four places: two in compute_parity (one prior to write, one for resync) and two in compute_block (one for degraded read, one for recovery). Don't try to change these all at once. One, or at most two, at a time makes the patches much easier to review. Good luck, NeilBrown ^ permalink raw reply [flat|nested] 12+ messages in thread
end of thread, other threads:[~2005-09-11 23:14 UTC | newest] Thread overview: 12+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2005-09-06 18:24 Accelerating Linux software raid Dan Williams 2005-09-06 21:52 ` Molle Bestefich 2005-09-10 4:51 ` Mark Hahn 2005-09-10 12:58 ` Ric Wheeler 2005-09-10 15:35 ` Mark Hahn 2005-09-10 19:13 ` Dan Williams 2005-09-11 2:06 ` Ric Wheeler 2005-09-11 2:35 ` Konstantin Olchanski 2005-09-11 12:00 ` Ric Wheeler 2005-09-11 20:19 ` Mark Hahn 2005-09-10 8:35 ` Colonel Hell 2005-09-11 23:14 ` Neil Brown
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).