* raid resync speed @ 2014-03-20 1:12 Jeff Allison 2014-03-20 14:35 ` Stan Hoeppner 2014-03-20 17:46 ` Bernd Schubert 0 siblings, 2 replies; 11+ messages in thread From: Jeff Allison @ 2014-03-20 1:12 UTC (permalink / raw) To: linux-raid The gist of my question is what kind of resync speed should I expect? I have a HP N54L Microserver running centos 6.5. In this box I have a 3x2TB disk raid 5 array, which I am in the process of extending to a 4x2TB raid 5 array. I've added the new disk --> mdadm --add /dev/md0 /dev/sdb And grown the array --> mdadm --grow /dev/md0 --raid-devices=4 Now the problem the resync speed is v slow, it refuses to rise above 5MB, in general it sits at 4M. from looking at glances it would appear that writing to the new disk is the bottle neck, /dev/sdb is the new disk. Disk I/O In/s Out/s md0 0 0 sda1 0 0 sda2 0 1K sdb1 3.92M 0 sdc1 24.2M 54.7M sdd1 11.2M 54.7M sde1 16.3M 54.7M I partitiioned the disk with --> parted -a optimal /dev/sdb [root@nas ~]# parted -a optimal /dev/sdb GNU Parted 2.1 Using /dev/sdb Welcome to GNU Parted! Type 'help' to view a list of commands. (parted) p Model: ATA ST2000DM001-1E61 (scsi) Disk /dev/sdb: 2000GB Sector size (logical/physical): 512B/4096B Partition Table: msdos Number Start End Size Type File system Flags 1 1049kB 2000GB 2000GB primary ntfs raid There is no ntfs filesystem on the disk, I've still not worked out how to remove that flag. I've followed the article here --> http://www.cyberciti.biz/tips/linux-raid-increase-resync-rebuild-speed.html to attempt to speed it up but no joy. Any Ideas what I've done wrong? parted output [root@nas ~]# parted -l Model: ATA ST31000528AS (scsi) Disk /dev/sda: 1000GB Sector size (logical/physical): 512B/512B Partition Table: msdos Number Start End Size Type File system Flags 1 1049kB 525MB 524MB primary ext4 boot 2 525MB 1000GB 1000GB primary lvm Model: ATA ST2000DM001-1E61 (scsi) Disk /dev/sdb: 2000GB Sector size (logical/physical): 512B/4096B Partition Table: msdos Number Start End Size Type File system Flags 1 1049kB 2000GB 2000GB primary ntfs raid Model: ATA ST2000DM001-9YN1 (scsi) Disk /dev/sdc: 2000GB Sector size (logical/physical): 512B/4096B Partition Table: msdos Number Start End Size Type File system Flags 1 1049kB 2000GB 2000GB primary raid Model: ATA WDC WD25EZRS-00J (scsi) Disk /dev/sdd: 2500GB Sector size (logical/physical): 512B/4096B Partition Table: msdos Number Start End Size Type File system Flags 1 1049kB 2000GB 2000GB primary ntfs raid Model: ATA ST2000DL001-9VT1 (scsi) Disk /dev/sde: 2000GB Sector size (logical/physical): 512B/4096B Partition Table: msdos Number Start End Size Type File system Flags 1 1049kB 2000GB 2000GB primary raid ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: raid resync speed 2014-03-20 1:12 raid resync speed Jeff Allison @ 2014-03-20 14:35 ` Stan Hoeppner 2014-03-20 15:35 ` Bernd Schubert 2014-03-20 17:46 ` Bernd Schubert 1 sibling, 1 reply; 11+ messages in thread From: Stan Hoeppner @ 2014-03-20 14:35 UTC (permalink / raw) To: Jeff Allison, linux-raid On 3/19/2014 8:12 PM, Jeff Allison wrote: ... > In this box I have a 3x2TB disk raid 5 array, which I am in the > process of extending to a 4x2TB raid 5 array. > > I've added the new disk --> mdadm --add /dev/md0 /dev/sdb > > And grown the array --> mdadm --grow /dev/md0 --raid-devices=4 > > Now the problem the resync speed is v slow, it refuses to rise above > 5MB, in general it sits at 4M. ... > Model: ATA ST2000DM001-1E61 (scsi) > Disk /dev/sdb: 2000GB > Sector size (logical/physical): 512B/4096B > Partition Table: msdos Seagate Advanced Format 512e disk drive > Number Start End Size Type File system Flags > 1 1049kB 2000GB 2000GB primary ntfs raid Offset is 262 physical sectors, strange value, but won't incur RMW internally. So no performance hit here. ... > I've followed the article here --> > http://www.cyberciti.biz/tips/linux-raid-increase-resync-rebuild-speed.html > to attempt to speed it up but no joy. > > Any Ideas what I've done wrong? Yes. The article gives 16384 and 32768 as examples for stripe_cache_size. Such high values tend to reduce throughput instead of increasing it. Never use a value above 2048 with rust, and 1024 is usually optimal for 7.2K drives. Only go 4096 or higher with SSDs. In addition, high values eat huge amounts of memory. The formula is: stripe_cache_size * 4096 bytes * drive_count = RAM usage (32768*4096) * 4 = 512MB of RAM consumed by the stripe cache (16384*4096) * 4 = 256MB of RAM consumed by the stripe cache (2048*4096) * 4 = 32MB of RAM consumed by the stripe cache (1024*4096) * 4 = 16MB of RAM consumed by the stripe cache Cheers, -- Stan ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: raid resync speed 2014-03-20 14:35 ` Stan Hoeppner @ 2014-03-20 15:35 ` Bernd Schubert 2014-03-20 15:36 ` Bernd Schubert 2014-03-20 18:44 ` Stan Hoeppner 0 siblings, 2 replies; 11+ messages in thread From: Bernd Schubert @ 2014-03-20 15:35 UTC (permalink / raw) To: stan, Jeff Allison, linux-raid > > Yes. The article gives 16384 and 32768 as examples for > stripe_cache_size. Such high values tend to reduce throughput instead > of increasing it. Never use a value above 2048 with rust, and 1024 is > usually optimal for 7.2K drives. Only go 4096 or higher with SSDs. In > addition, high values eat huge amounts of memory. The formula is: > Why should the stripe-cache size differ between SSDs and rotating disks? Did you ever try to figure out yourself why it got slower with higher values? I profiled that in the past and it was a CPU/memory limitation - the md thread went to 100%, searching for stripe-heads. So I really wonder how you got the impression that the stripe cache size should have different values for differnt kinds of drives. Cheers, Bernd ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: raid resync speed 2014-03-20 15:35 ` Bernd Schubert @ 2014-03-20 15:36 ` Bernd Schubert 2014-03-20 16:19 ` Eivind Sarto 2014-03-20 18:44 ` Stan Hoeppner 1 sibling, 1 reply; 11+ messages in thread From: Bernd Schubert @ 2014-03-20 15:36 UTC (permalink / raw) To: stan, Jeff Allison, linux-raid On 03/20/2014 04:35 PM, Bernd Schubert wrote: >> >> Yes. The article gives 16384 and 32768 as examples for >> stripe_cache_size. Such high values tend to reduce throughput instead >> of increasing it. Never use a value above 2048 with rust, and 1024 is >> usually optimal for 7.2K drives. Only go 4096 or higher with SSDs. In >> addition, high values eat huge amounts of memory. The formula is: >> > > Why should the stripe-cache size differ between SSDs and rotating disks? > Did you ever try to figure out yourself why it got slower with higher > values? I profiled that in the past and it was a CPU/memory limitation - > the md thread went to 100%, searching for stripe-heads. Sorry, I forgot to write 'cpu usage', so it went to 100% cpu usage. > > So I really wonder how you got the impression that the stripe cache size > should have different values for differnt kinds of drives. > > > Cheers, > Bernd > ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: raid resync speed 2014-03-20 15:36 ` Bernd Schubert @ 2014-03-20 16:19 ` Eivind Sarto 2014-03-20 16:22 ` Bernd Schubert 0 siblings, 1 reply; 11+ messages in thread From: Eivind Sarto @ 2014-03-20 16:19 UTC (permalink / raw) To: Bernd Schubert; +Cc: stan, Jeff Allison, linux-raid On Mar 20, 2014, at 8:36 AM, Bernd Schubert <bernd.schubert@fastmail.fm> wrote: > On 03/20/2014 04:35 PM, Bernd Schubert wrote: >>> >>> Yes. The article gives 16384 and 32768 as examples for >>> stripe_cache_size. Such high values tend to reduce throughput instead >>> of increasing it. Never use a value above 2048 with rust, and 1024 is >>> usually optimal for 7.2K drives. Only go 4096 or higher with SSDs. In >>> addition, high values eat huge amounts of memory. The formula is: >>> >> >> Why should the stripe-cache size differ between SSDs and rotating disks? >> Did you ever try to figure out yourself why it got slower with higher >> values? I profiled that in the past and it was a CPU/memory limitation - >> the md thread went to 100%, searching for stripe-heads. > > Sorry, I forgot to write 'cpu usage', so it went to 100% cpu usage. > >> >> So I really wonder how you got the impression that the stripe cache size >> should have different values for differnt kinds of drives. >> >> >> Cheers, >> Bernd >> > > -- > To unsubscribe from this list: send the line "unsubscribe linux-raid" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html The hash chains for the stripe cache become long if you increase the stripe cache. There are only 256 hash buckets. With 32K stripe cache entries, the average length of a hash chain will be 128 and that will increase contention for the lock protection the chain. -eivind ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: raid resync speed 2014-03-20 16:19 ` Eivind Sarto @ 2014-03-20 16:22 ` Bernd Schubert 0 siblings, 0 replies; 11+ messages in thread From: Bernd Schubert @ 2014-03-20 16:22 UTC (permalink / raw) To: Eivind Sarto; +Cc: stan, Jeff Allison, linux-raid On 03/20/2014 05:19 PM, Eivind Sarto wrote: > > On Mar 20, 2014, at 8:36 AM, Bernd Schubert <bernd.schubert@fastmail.fm> wrote: > >> On 03/20/2014 04:35 PM, Bernd Schubert wrote: >>>> >>>> Yes. The article gives 16384 and 32768 as examples for >>>> stripe_cache_size. Such high values tend to reduce throughput instead >>>> of increasing it. Never use a value above 2048 with rust, and 1024 is >>>> usually optimal for 7.2K drives. Only go 4096 or higher with SSDs. In >>>> addition, high values eat huge amounts of memory. The formula is: >>>> >>> >>> Why should the stripe-cache size differ between SSDs and rotating disks? >>> Did you ever try to figure out yourself why it got slower with higher >>> values? I profiled that in the past and it was a CPU/memory limitation - >>> the md thread went to 100%, searching for stripe-heads. >> >> Sorry, I forgot to write 'cpu usage', so it went to 100% cpu usage. >> >>> >>> So I really wonder how you got the impression that the stripe cache size >>> should have different values for differnt kinds of drives. >>> > The hash chains for the stripe cache become long if you increase the stripe cache. There are only 256 > hash buckets. With 32K stripe cache entries, the average length of a hash chain will be 128 and that will > increase contention for the lock protection the chain. > Yes, this is a implementation detail. But that make a difference between SSDs and rotating disks... (which was my point here). ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: raid resync speed 2014-03-20 15:35 ` Bernd Schubert 2014-03-20 15:36 ` Bernd Schubert @ 2014-03-20 18:44 ` Stan Hoeppner 2014-03-27 16:08 ` Bernd Schubert 1 sibling, 1 reply; 11+ messages in thread From: Stan Hoeppner @ 2014-03-20 18:44 UTC (permalink / raw) To: Bernd Schubert, Jeff Allison, linux-raid On 3/20/2014 10:35 AM, Bernd Schubert wrote: > On 3/20/2014 9:35 AM, Stan Hoeppner wrote: >> Yes. The article gives 16384 and 32768 as examples for >> stripe_cache_size. Such high values tend to reduce throughput instead >> of increasing it. Never use a value above 2048 with rust, and 1024 is >> usually optimal for 7.2K drives. Only go 4096 or higher with SSDs. In >> addition, high values eat huge amounts of memory. The formula is: > Why should the stripe-cache size differ between SSDs and rotating disks? I won't discuss "should" as that makes this a subjective discussion. I'll discuss this objectively, discuss what md does, not what it "should" do or could do. I'll answer your question with a question: Why does the total stripe cache memory differ, doubling between 4 drives and 8 drives, or 8 drives and 16 drives, to maintain the same per drive throughput? The answer to both this question and your question is the same answer. As the total write bandwidth of the array increases, so must the total stripe cache buffer space. stripe_cache_size of 1024 is usually optimal for SATA drives with measured 100MB/s throughput, and 4096 is usually optimal for SSDs with 400MB/s measured write throughput. The bandwidth numbers include parity block writes. array(s) bandwidth MB/s stripe_cache_size cache MB 12x 100MB/s Rust 1200 1024 48 16x 100MB/s Rust 1600 1024 64 32x 100MB/s Rust 3200 1024 128 3x 400MB/s SSD 1200 4096 48 4x 400MB/s SSD 1600 4096 64 8x 400MB/s SSD 3200 4096 128 As is clearly demonstrated, there is a direct relationship between cache size and total write bandwidth. The number of drives and drive type is irrelevant. It's the aggregate write bandwidth that matters. Whether this "should" be this way is something for developers to debate. I'm simply demonstrating how it "is" currently. > Did you ever try to figure out yourself why it got slower with higher > values? I profiled that in the past and it was a CPU/memory limitation - > the md thread went to 100%, searching for stripe-heads. This may be true at the limits, but going from 512 to 1024 to 2048 to 4096 with a 3 disk rust array isn't going to peak the CPU. And somewhere with this setup, usually between 1024 and 2048, throughput will begin to tail off, even with plenty of CPU and memory B/W remaining. > So I really wonder how you got the impression that the stripe cache size > should have different values for differnt kinds of drives. Because higher aggregate throughputs require higher stripe_cache_size values, and some drive types (SSDs) have significantly higher throughput than others (rust), usually [3|4] to 1 for discrete SSDs, much greater for PCIe SSDs. Cheers, Stan ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: raid resync speed 2014-03-20 18:44 ` Stan Hoeppner @ 2014-03-27 16:08 ` Bernd Schubert 2014-03-28 8:03 ` Stan Hoeppner 0 siblings, 1 reply; 11+ messages in thread From: Bernd Schubert @ 2014-03-27 16:08 UTC (permalink / raw) To: stan, Jeff Allison, linux-raid Sorry for the late reply, I'm busy with work... On 03/20/2014 07:44 PM, Stan Hoeppner wrote: > On 3/20/2014 10:35 AM, Bernd Schubert wrote: >> On 3/20/2014 9:35 AM, Stan Hoeppner wrote: >>> Yes. The article gives 16384 and 32768 as examples for >>> stripe_cache_size. Such high values tend to reduce throughput instead >>> of increasing it. Never use a value above 2048 with rust, and 1024 is >>> usually optimal for 7.2K drives. Only go 4096 or higher with SSDs. In >>> addition, high values eat huge amounts of memory. The formula is: > >> Why should the stripe-cache size differ between SSDs and rotating disks? > > I won't discuss "should" as that makes this a subjective discussion. > I'll discuss this objectively, discuss what md does, not what it > "should" do or could do. > > I'll answer your question with a question: Why does the total stripe > cache memory differ, doubling between 4 drives and 8 drives, or 8 drives > and 16 drives, to maintain the same per drive throughput? > > The answer to both this question and your question is the same answer. > As the total write bandwidth of the array increases, so must the total > stripe cache buffer space. stripe_cache_size of 1024 is usually optimal > for SATA drives with measured 100MB/s throughput, and 4096 is usually > optimal for SSDs with 400MB/s measured write throughput. The bandwidth > numbers include parity block writes. Did you also consider that you simply need more stripe-heads (struct stripe_head) to get complete stripes with more drives? > > array(s) bandwidth MB/s stripe_cache_size cache MB > > 12x 100MB/s Rust 1200 1024 48 > 16x 100MB/s Rust 1600 1024 64 > 32x 100MB/s Rust 3200 1024 128 > > 3x 400MB/s SSD 1200 4096 48 > 4x 400MB/s SSD 1600 4096 64 > 8x 400MB/s SSD 3200 4096 128 > > As is clearly demonstrated, there is a direct relationship between cache > size and total write bandwidth. The number of drives and drive type is > irrelevant. It's the aggregate write bandwidth that matters. What is the meaning of "cache MB"? It does not seem to come from this calculation: > memory = conf->max_nr_stripes * (sizeof(struct stripe_head) + > max_disks * ((sizeof(struct bio) + PAGE_SIZE))) / 1024; ... > printk(KERN_INFO "md/raid:%s: allocated %dkB\n", > mdname(mddev), memory); > > Whether this "should" be this way is something for developers to debate. > I'm simply demonstrating how it "is" currently. Well, somehow I only see two different stripe-cache size values in your numbers. Then the given bandwidth seems to be theoretical value, based on num-drives * performance-per-drive. Redundancy drives are also missing in that calculation. And then the value of "cache MB" is also unclear. So I'm sorry, but don't see any "simply demonstrating". > >> Did you ever try to figure out yourself why it got slower with higher >> values? I profiled that in the past and it was a CPU/memory limitation - >> the md thread went to 100%, searching for stripe-heads. > > This may be true at the limits, but going from 512 to 1024 to 2048 to > 4096 with a 3 disk rust array isn't going to peak the CPU. And > somewhere with this setup, usually between 1024 and 2048, throughput > will begin to tail off, even with plenty of CPU and memory B/W remaining. Sorry, not in my experience. So it would be interesting to see real measused values. But then I definitely never tested raid6 with 3 drives, as this only provides a single data drive. > >> So I really wonder how you got the impression that the stripe cache size >> should have different values for differnt kinds of drives. > > Because higher aggregate throughputs require higher stripe_cache_size > values, and some drive types (SSDs) have significantly higher throughput > than others (rust), usually [3|4] to 1 for discrete SSDs, much greater > for PCIe SSDs. As I said, it would be interesting to see real numbers and profiling data. Cheers, Bernd ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: raid resync speed 2014-03-27 16:08 ` Bernd Schubert @ 2014-03-28 8:03 ` Stan Hoeppner 0 siblings, 0 replies; 11+ messages in thread From: Stan Hoeppner @ 2014-03-28 8:03 UTC (permalink / raw) To: Bernd Schubert, Jeff Allison, linux-raid On 3/27/2014 11:08 AM, Bernd Schubert wrote: > Sorry for the late reply, I'm busy with work... > > On 03/20/2014 07:44 PM, Stan Hoeppner wrote: >> On 3/20/2014 10:35 AM, Bernd Schubert wrote: >>> On 3/20/2014 9:35 AM, Stan Hoeppner wrote: >>>> Yes. The article gives 16384 and 32768 as examples for >>>> stripe_cache_size. Such high values tend to reduce throughput instead >>>> of increasing it. Never use a value above 2048 with rust, and 1024 is >>>> usually optimal for 7.2K drives. Only go 4096 or higher with SSDs. In >>>> addition, high values eat huge amounts of memory. The formula is: >> >>> Why should the stripe-cache size differ between SSDs and rotating disks? >> >> I won't discuss "should" as that makes this a subjective discussion. >> I'll discuss this objectively, discuss what md does, not what it >> "should" do or could do. >> >> I'll answer your question with a question: Why does the total stripe >> cache memory differ, doubling between 4 drives and 8 drives, or 8 drives >> and 16 drives, to maintain the same per drive throughput? >> >> The answer to both this question and your question is the same answer. >> As the total write bandwidth of the array increases, so must the total >> stripe cache buffer space. stripe_cache_size of 1024 is usually optimal >> for SATA drives with measured 100MB/s throughput, and 4096 is usually >> optimal for SSDs with 400MB/s measured write throughput. The bandwidth >> numbers include parity block writes. > > Did you also consider that you simply need more stripe-heads (struct stripe_head) to get complete stripes with more drives? It has nothing to do with what we're discussing. You get complete stripes with the default value which is IIRC 256, though md.txt still says 128 as of 3.13.6, and that it only applies to RAID5. Maybe md.txt should be updated. stripe_cache_size (currently raid5 only) number of entries in the stripe cache. This is writable, but there are upper and lower limits (32768, 16). Default is 128. >> array(s) bandwidth MB/s stripe_cache_size cache MB >> >> 12x 100MB/s Rust 1200 1024 48 >> 16x 100MB/s Rust 1600 1024 64 >> 32x 100MB/s Rust 3200 1024 128 >> >> 3x 400MB/s SSD 1200 4096 48 >> 4x 400MB/s SSD 1600 4096 64 >> 8x 400MB/s SSD 3200 4096 128 >> >> As is clearly demonstrated, there is a direct relationship between cache >> size and total write bandwidth. The number of drives and drive type is >> irrelevant. It's the aggregate write bandwidth that matters. > > What is the meaning of "cache MB"? It does not seem to come from this calculation: > >> memory = conf->max_nr_stripes * (sizeof(struct stripe_head) + >> max_disks * ((sizeof(struct bio) + PAGE_SIZE))) / 1024; ... > >> printk(KERN_INFO "md/raid:%s: allocated %dkB\n", >> mdname(mddev), memory); No, it is not derieved from the source code, but from the formula I stated previously in this thread: stripe_cache_size * 4096 bytes * drive_count = RAM usage >> Whether this "should" be this way is something for developers to debate. >> I'm simply demonstrating how it "is" currently. > > Well, somehow I only see two different stripe-cache size values in your numbers. Only two are required to demonstrate the md RAID5/6 behavior in question. > Then the given bandwidth seems to be theoretical value, based on num-drives * performance-per-drive. The values in the table are not theoretical, but are derived from test data, and are very close to what one will see with such a real world configuration. > Redundancy drives are also missing in that calculation. No, this is included. Read the sentence directly preceding the table. > And then the value of "cache MB" is also unclear. It is unambiguous. > So I'm sorry, but don't see any "simply demonstrating". ... >>> Did you ever try to figure out yourself why it got slower with higher >>> values? I profiled that in the past and it was a CPU/memory limitation - >>> the md thread went to 100%, searching for stripe-heads. >> >> This may be true at the limits, but going from 512 to 1024 to 2048 to >> 4096 with a 3 disk rust array isn't going to peak the CPU. And >> somewhere with this setup, usually between 1024 and 2048, throughput >> will begin to tail off, even with plenty of CPU and memory B/W remaining. > > Sorry, not in my experience. This is the behavior everyone sees, because this is how md behaves. If your experience is different then you should demonstrate it. > So it would be interesting to see real measused values. But then I definitely never tested raid6 with 3 drives, as this only provides a single data drive. The point above is that an md write thread won't saturate the processor, regardless of the size of the stripe cache, with a small count rust array. I simply chose a very low number to make the point clear. I didn't state a RAID level here. Whether it's RAID5 or 6 is irrelevant to the point. >>> So I really wonder how you got the impression that the stripe cache size >>> should have different values for differnt kinds of drives. >> >> Because higher aggregate throughputs require higher stripe_cache_size >> values, and some drive types (SSDs) have significantly higher throughput >> than others (rust), usually [3|4] to 1 for discrete SSDs, much greater >> for PCIe SSDs. > > As I said, it would be interesting to see real numbers and profiling data. Here are numbers for an md RAID5 SSD array, 64KB chunk. 5 x Intel 520s MLC 480G SATA3 Intel Xeon E3-1230V2 quad core, 1MB L2, 8MB L3, 3.3GHz/3.7GHz turbo 2x DDR3 = 21 GB/s memory bandwidth Debian 6 kernel 3.2 Parallel FIO throughput 16 threads, 256KB block size, O_DIRECT, libaio, queue depth 16, 8 GB/thread, 128 GB total written: stripe_cache_size = 256 READ: io=131072MB, aggrb=2496MB/s, minb=2556MB/s, maxb=2556MB/s, mint=52508msec, maxt=52508msec WRITE: io=131072MB, aggrb=928148KB/s, minb=950424KB/s, maxb=950424KB/s, mint=144608msec, maxt=144608msec stripe_cache_size = 512 READ: io=131072MB, aggrb=2497MB/s, minb=2557MB/s, maxb=2557MB/s, mint=52484msec, maxt=52484msec WRITE: io=131072MB, aggrb=978170KB/s, minb=978MB/s, maxb=978MB/s, mint=137213msec, maxt=137213msec stripe_cache_size = 2048 READ: io=131072MB, aggrb=2502MB/s, minb=2562MB/s, maxb=2562MB/s, mint=52382msec, maxt=52382msec WRITE: io=131072MB, aggrb=996MB/s, minb=1020MB/s, maxb=1020MB/s, mint=131631msec, maxt=131631msec stripe_cache_size = 4096 READ: io=131072MB, aggrb=2498MB/s, minb=2558MB/s, maxb=2558MB/s, mint=52465msec, maxt=52465msec WRITE: io=131072MB, aggrb=1596MB/s, minb=1634MB/s, maxb=1634MB/s, mint=82126msec, maxt=82126msec stripe_cache_size = 8192 READ: io=131072MB, aggrb=2498MB/s, minb=2558MB/s, maxb=2558MB/s, mint=52469msec, maxt=52469msec WRITE: io=131072MB, aggrb=1514MB/s, minb=1550MB/s, maxb=1550MB/s, mint=86565msec, maxt=86565msec stripe_cache_size = 16384 READ: io=131072MB, aggrb=2482MB/s, minb=2542MB/s, maxb=2542MB/s, mint=52807msec, maxt=52807msec WRITE: io=131072MB, aggrb=1377MB/s, minb=1410MB/s, maxb=1410MB/s, mint=95191msec, maxt=95191msec stripe_cache_size = 32768 READ: io=131072MB, aggrb=2498MB/s, minb=2557MB/s, maxb=2557MB/s, mint=52481msec, maxt=52481msec WRITE: io=131072MB, aggrb=1139MB/s, minb=1166MB/s, maxb=1166MB/s, mint=115102msec, maxt=115102msec The effect I described is clearly demonstrated here: increasing stripe_cache_size beyond the optimal value causes write throughput to decrease. With this SSD array a value of 4096 achieves peak sequential application write throughput of 1.6 GB/s. Throughput with parity is 2 GB/s, or 400 MB/s per drive. Note what I said previously, above, when I described the table figures: "...4096 is usually optimal for SSDs with 400MB/s measured write throughput." Thus, those figures are not "theoretical" as you claimed, but are based on actual testing. The same is true for rust, though I haven't performed such testing on rust. Others on this list have submitted rust numbers, but not with testing quite as thorough as the above. I invite you to perform FIO testing on your rust array and submit you r results. They should confirm what I stated in the table above. On 3/20/2014 10:35 AM, Bernd Schubert wrote: > Why should the stripe-cache size differ between SSDs and rotating > disks? Did you ever try to figure out yourself why it got slower with > higher values? I profiled that in the past and it was a CPU/memory > limitation - the md thread went to 100%, searching for stripe-heads. The results above do not seem corroborate your claim. The decrease in throughput from 1.63 GB/s to 1.16 GB/s, when increasing stripe_cache_size from 4096 to 32768, is a slope not a cliff. If CPU/DRAM starvation were the problem, I would think this would be a cliff and not a slope. As I stated previously, I am simply characterizing the behavior of stripe_cache_size values and their real world impact on throughput and memory consumption. I have not speculated to this point as to the cause of the observed behavior. I have not profiled execution. I don't know the code. I am not a kernel hacker. I am not a programmer. What I have observed in reports on this list and in testing is that there is a direct correlation between optimal stripe_cache_size and device write throughput. Cheers, Stan ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: raid resync speed 2014-03-20 1:12 raid resync speed Jeff Allison 2014-03-20 14:35 ` Stan Hoeppner @ 2014-03-20 17:46 ` Bernd Schubert 2014-03-21 0:44 ` Jeff Allison 1 sibling, 1 reply; 11+ messages in thread From: Bernd Schubert @ 2014-03-20 17:46 UTC (permalink / raw) To: Jeff Allison, linux-raid On 03/20/2014 02:12 AM, Jeff Allison wrote: > The gist of my question is what kind of resync speed should I expect? > > I have a HP N54L Microserver running centos 6.5. > > In this box I have a 3x2TB disk raid 5 array, which I am in the > process of extending to a 4x2TB raid 5 array. > > I've added the new disk --> mdadm --add /dev/md0 /dev/sdb > > And grown the array --> mdadm --grow /dev/md0 --raid-devices=4 > > Now the problem the resync speed is v slow, it refuses to rise above > 5MB, in general it sits at 4M. Per second? > > from looking at glances it would appear that writing to the new disk > is the bottle neck, /dev/sdb is the new disk. > > Disk I/O In/s Out/s > md0 0 0 > sda1 0 0 > sda2 0 1K > sdb1 3.92M 0 > sdc1 24.2M 54.7M > sdd1 11.2M 54.7M > sde1 16.3M 54.7M Could you please send output of 'iostat -xm 1'? Also, do you anything in 'top' that takes 100% CPU? Thanks, Bernd ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: raid resync speed 2014-03-20 17:46 ` Bernd Schubert @ 2014-03-21 0:44 ` Jeff Allison 0 siblings, 0 replies; 11+ messages in thread From: Jeff Allison @ 2014-03-21 0:44 UTC (permalink / raw) To: Bernd Schubert, linux-raid I don't think it's the raid code I've dropped the disk out of the array and I still cannot get anymore that 4MB/sec out of it... [jeff@nas ~]$dd if=/dev/zero of=/mnt/sdj/bonnie/test.tmp bs=4k count=2000000 && sync && dd if=/dev/zero of=/mnt/sdd/bonnie/test.tmp bs=4k count=2000000 && sync 2000000+0 records in 2000000+0 records out 8192000000 bytes (8.2 GB) copied, 231.778 s, 35.3 MB/s <-- WD Green RMA I got back yesterday. 2000000+0 records in 2000000+0 records out 8192000000 bytes (8.2 GB) copied, 1818.18 s, 4.5 MB/s <-- Dud one. Perhaps it's time to RMA the RMA. On 21 March 2014 04:46, Bernd Schubert <bernd.schubert@itwm.fraunhofer.de> wrote: > On 03/20/2014 02:12 AM, Jeff Allison wrote: >> >> The gist of my question is what kind of resync speed should I expect? >> >> I have a HP N54L Microserver running centos 6.5. >> >> In this box I have a 3x2TB disk raid 5 array, which I am in the >> process of extending to a 4x2TB raid 5 array. >> >> I've added the new disk --> mdadm --add /dev/md0 /dev/sdb >> >> And grown the array --> mdadm --grow /dev/md0 --raid-devices=4 >> >> Now the problem the resync speed is v slow, it refuses to rise above >> 5MB, in general it sits at 4M. > > > Per second? > > >> >> from looking at glances it would appear that writing to the new disk >> is the bottle neck, /dev/sdb is the new disk. >> >> Disk I/O In/s Out/s >> md0 0 0 >> sda1 0 0 >> sda2 0 1K >> sdb1 3.92M 0 >> sdc1 24.2M 54.7M >> sdd1 11.2M 54.7M >> sde1 16.3M 54.7M > > > Could you please send output of 'iostat -xm 1'? Also, do you anything in > 'top' that takes 100% CPU? > > Thanks, > Bernd > > ^ permalink raw reply [flat|nested] 11+ messages in thread
end of thread, other threads:[~2014-03-28 8:03 UTC | newest] Thread overview: 11+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2014-03-20 1:12 raid resync speed Jeff Allison 2014-03-20 14:35 ` Stan Hoeppner 2014-03-20 15:35 ` Bernd Schubert 2014-03-20 15:36 ` Bernd Schubert 2014-03-20 16:19 ` Eivind Sarto 2014-03-20 16:22 ` Bernd Schubert 2014-03-20 18:44 ` Stan Hoeppner 2014-03-27 16:08 ` Bernd Schubert 2014-03-28 8:03 ` Stan Hoeppner 2014-03-20 17:46 ` Bernd Schubert 2014-03-21 0:44 ` Jeff Allison
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).