From mboxrd@z Thu Jan 1 00:00:00 1970 From: Bernd Schubert Subject: Re: raid resync speed Date: Thu, 27 Mar 2014 17:08:38 +0100 Message-ID: <53344D06.4090401@fastmail.fm> References: <532AFCC8.6080902@hardwarefreak.com> <532B0ABC.5040604@fastmail.fm> <532B372A.7090802@hardwarefreak.com> Mime-Version: 1.0 Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit Return-path: In-Reply-To: <532B372A.7090802@hardwarefreak.com> Sender: linux-raid-owner@vger.kernel.org To: stan@hardwarefreak.com, Jeff Allison , linux-raid@vger.kernel.org List-Id: linux-raid.ids Sorry for the late reply, I'm busy with work... On 03/20/2014 07:44 PM, Stan Hoeppner wrote: > On 3/20/2014 10:35 AM, Bernd Schubert wrote: >> On 3/20/2014 9:35 AM, Stan Hoeppner wrote: >>> Yes. The article gives 16384 and 32768 as examples for >>> stripe_cache_size. Such high values tend to reduce throughput instead >>> of increasing it. Never use a value above 2048 with rust, and 1024 is >>> usually optimal for 7.2K drives. Only go 4096 or higher with SSDs. In >>> addition, high values eat huge amounts of memory. The formula is: > >> Why should the stripe-cache size differ between SSDs and rotating disks? > > I won't discuss "should" as that makes this a subjective discussion. > I'll discuss this objectively, discuss what md does, not what it > "should" do or could do. > > I'll answer your question with a question: Why does the total stripe > cache memory differ, doubling between 4 drives and 8 drives, or 8 drives > and 16 drives, to maintain the same per drive throughput? > > The answer to both this question and your question is the same answer. > As the total write bandwidth of the array increases, so must the total > stripe cache buffer space. stripe_cache_size of 1024 is usually optimal > for SATA drives with measured 100MB/s throughput, and 4096 is usually > optimal for SSDs with 400MB/s measured write throughput. The bandwidth > numbers include parity block writes. Did you also consider that you simply need more stripe-heads (struct stripe_head) to get complete stripes with more drives? > > array(s) bandwidth MB/s stripe_cache_size cache MB > > 12x 100MB/s Rust 1200 1024 48 > 16x 100MB/s Rust 1600 1024 64 > 32x 100MB/s Rust 3200 1024 128 > > 3x 400MB/s SSD 1200 4096 48 > 4x 400MB/s SSD 1600 4096 64 > 8x 400MB/s SSD 3200 4096 128 > > As is clearly demonstrated, there is a direct relationship between cache > size and total write bandwidth. The number of drives and drive type is > irrelevant. It's the aggregate write bandwidth that matters. What is the meaning of "cache MB"? It does not seem to come from this calculation: > memory = conf->max_nr_stripes * (sizeof(struct stripe_head) + > max_disks * ((sizeof(struct bio) + PAGE_SIZE))) / 1024; ... > printk(KERN_INFO "md/raid:%s: allocated %dkB\n", > mdname(mddev), memory); > > Whether this "should" be this way is something for developers to debate. > I'm simply demonstrating how it "is" currently. Well, somehow I only see two different stripe-cache size values in your numbers. Then the given bandwidth seems to be theoretical value, based on num-drives * performance-per-drive. Redundancy drives are also missing in that calculation. And then the value of "cache MB" is also unclear. So I'm sorry, but don't see any "simply demonstrating". > >> Did you ever try to figure out yourself why it got slower with higher >> values? I profiled that in the past and it was a CPU/memory limitation - >> the md thread went to 100%, searching for stripe-heads. > > This may be true at the limits, but going from 512 to 1024 to 2048 to > 4096 with a 3 disk rust array isn't going to peak the CPU. And > somewhere with this setup, usually between 1024 and 2048, throughput > will begin to tail off, even with plenty of CPU and memory B/W remaining. Sorry, not in my experience. So it would be interesting to see real measused values. But then I definitely never tested raid6 with 3 drives, as this only provides a single data drive. > >> So I really wonder how you got the impression that the stripe cache size >> should have different values for differnt kinds of drives. > > Because higher aggregate throughputs require higher stripe_cache_size > values, and some drive types (SSDs) have significantly higher throughput > than others (rust), usually [3|4] to 1 for discrete SSDs, much greater > for PCIe SSDs. As I said, it would be interesting to see real numbers and profiling data. Cheers, Bernd