Re: md-RAID5/6 stripe_cache_size default value vs performance vs memory footprint

From: Piergiorgio Sartor <piergiorgio.sartor@nexgo.de>
To: Stan Hoeppner <stan@hardwarefreak.com>
Cc: "Arkadiusz Miśkiewicz" <arekm@maven.pl>,
	linux-raid@vger.kernel.org, "xfs@oss.sgi.com" <xfs@oss.sgi.com>
Subject: Re: md-RAID5/6 stripe_cache_size default value vs performance vs memory footprint
Date: Sat, 21 Dec 2013 13:20:14 +0100	[thread overview]
Message-ID: <20131221122014.GA3909@lazy.lzy> (raw)
In-Reply-To: <52B57912.5080000@hardwarefreak.com>

On Sat, Dec 21, 2013 at 05:18:42AM -0600, Stan Hoeppner wrote:
> I renamed the subject as your question doesn't really apply to XFS, or
> the OP, but to md-RAID.
> 
> On 12/20/2013 4:43 PM, Arkadiusz Miśkiewicz wrote:
> 
> > I wonder why kernel is giving defaults that everyone repeatly recommends to 
> > change/increase? Has anyone tried to bugreport that for stripe_cache_size 
> > case?
> 
> The answer is balancing default md-RAID5/6 write performance against
> kernel RAM consumption, with more weight given to the latter.  The formula:
> 
> ((4096*stripe_cache_size)*num_drives)= RAM consumed for stripe cache
> 
> High stripe_cache_size values will cause the kernel to eat non trivial
> amounts of RAM for the stripe cache buffer.  This table demonstrates the
> effect today for typical RAID5/6 disk counts.
> 
> stripe_cache_size	drives	RAM consumed
> 256			 4	  4 MB
> 			 8	  8 MB
> 			16	 16 MB
> 512			 4	  8 MB
> 			 8	 16 MB
> 			16	 32 MB
> 1024			 4	 16 MB
> 			 8	 32 MB
> 			16	 64 MB
> 2048			 4	 32 MB
> 			 8	 64 MB
> 			16	128 MB
> 4096			 4	 64 MB
> 			 8	128 MB
> 			16	256 MB
> 
> The powers that be, Linus in particular, are not fond of default
> settings that create a lot of kernel memory structures.  The default
> md-RAID5/6 stripe_cache-size yields 1MB consumed per member device.
> 
> With SSDs becoming mainstream, and becoming ever faster, at some point
> the md-RAID5/6 architecture will have to be redesigned because of the
> memory footprint required for performance.  Currently the required size
> of the stripe cache appears directly proportional to the aggregate write
> throughput of the RAID devices.  Thus the optimal value will vary
> greatly from one system to another depending on the throughput of the
> drives.
> 
> For example, I assisted a user with 5x Intel SSDs back in January and
> his system required 4096, or 80MB of RAM for stripe cache, to reach
> maximum write throughput of the devices.  This yielded 600MB/s or 60%
> greater throughput than 2048, or 40MB RAM for cache.  In his case 60MB
> more RAM than the default was well worth the increase as the machine was
> an iSCSI target server with 8GB RAM.
> 
> In the previous case with 5x rust RAID6 the 2048 value seemed optimal
> (though not yet verified), requiring 40MB less RAM than the 5x Intel
> SSDs.  For a 3 modern rust RAID5 the default of 256, or 3MB, is close to
> optimal but maybe a little low.  Consider that 256 has been the default
> for a very long time, and was selected back when average drive
> throughput was much much lower, as in 50MB/s or less, SSDs hadn't yet
> been invented, and system memories were much smaller.
> 
> Due to the massive difference in throughput between rust and SSD, any
> meaningful change in the default really requires new code to sniff out
> what type of devices constitute the array, if that's possible, and it
> probably isn't, and set a lowish default accordingly.  Again, SSDs
> didn't exist when md-RAID was coded, nor when this default was set, and
> this throws a big monkey wrench into these spokes.

Hi Stan,

nice analytical report, as usual...

My dumb suggestion would be to simply use udev to
setup the drives.
Everything, stripe_cache, read_ahead, stcerr, etc.
can be configured, I suppose, by udev rules.

bye,

-- 

piergiorgio
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html