From mboxrd@z Thu Jan 1 00:00:00 1970 From: Keld =?iso-8859-1?Q?J=F8rn?= Simonsen Subject: Re: high throughput storage server? Date: Tue, 22 Mar 2011 12:01:29 +0100 Message-ID: <20110322110129.GB9329@www2.open-std.org> References: <4D837DAF.6060107@hardwarefreak.com> <20110319090101.1786cc2a@notabene.brown> <4D8559A2.6080209@hardwarefreak.com> <20110320144147.29141f04@notabene.brown> <4D868C36.5050304@hardwarefreak.com> <20110321024452.GA23100@www2.open-std.org> <4D875E51.50807@hardwarefreak.com> <20110321221304.GA900@www2.open-std.org> <4D887348.3030902@hardwarefreak.com> Mime-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Transfer-Encoding: QUOTED-PRINTABLE Return-path: Content-Disposition: inline In-Reply-To: <4D887348.3030902@hardwarefreak.com> Sender: linux-raid-owner@vger.kernel.org To: Stan Hoeppner Cc: Keld =?iso-8859-1?Q?J=F8rn?= Simonsen , Mdadm , Roberto Spadim , NeilBrown , Christoph Hellwig , Drew List-Id: linux-raid.ids On Tue, Mar 22, 2011 at 05:00:40AM -0500, Stan Hoeppner wrote: > Keld J=F8rn Simonsen put forth on 3/21/2011 5:13 PM: > > On Mon, Mar 21, 2011 at 09:18:57AM -0500, Stan Hoeppner wrote: > >> Keld J=F8rn Simonsen put forth on 3/20/2011 9:44 PM: > >> > >>> Anyway, with 384 spindles and only 50 users, each user will have = in > >>> average 7 spindles for himself. I think much of the time this wou= ld mean=20 > >>> no random IO, as most users are doing large sequential reading.=20 > >>> Thus on average you can expect quite close to striping speed if y= ou > >>> are running RAID capable of striping.=20 > >> > >> This is not how large scale shared RAID storage works under a > >> multi-stream workload. I thought I explained this in sufficient d= etail. > >> Maybe not. > >=20 > > Given that the whole array system is only lightly loaded, this is h= ow I > > expect it to function. Maybe you can explain why it would not be so= , if > > you think otherwise. >=20 > Using the term "lightly loaded" to describe any system sustaining > concurrent 10GB/s block IO and NFS throughput doesn't seem to be an > accurate statement. I think you're confusing theoretical maximum > hardware performance with real world IO performance. The former is > always significantly higher than the latter. With this in mind, as w= ith > any well designed system, I specified this system to have some headro= om, > as I previously stated. Everything we've discussed so far WRT this > system has been strictly parallel reads. The disks themselves should be cabable of doing about 60 GB/s so 10 GB/= s is only a 15 % use of the disks. And most of the IO is concurrent sequential reading of big files. > Now, if 10 cluster nodes are added with an application that performs > streaming writes, occurring concurrently with the 50 streaming reads, > we've just significantly increased the amount of head seeking on our > disks. The combined IO workload is now a mixed heavy random read/wri= te > workload. This is the most difficult type of workload for any RAID > subsystem. It would bring most parity RAID arrays to their knees. T= his > is one of the reasons why RAID10 is the only suitable RAID level for > this type of system. Yes, I agree. And that is why I also suggest you use a mirrored raid in the form of Linux MD RAID 10, F2, for better striping performance and d= isk access performance than traditional RAID1+0. Anyway the system was not specified to have additional 10 heavy writing= processes. > >> In summary, concatenating many relatively low stripe spindle count > >> arrays, and using XFS allocation groups to achieve parallel scalab= ility, > >> gives us the performance we want without the problems associated w= ith > >> other configurations. >=20 > > it is probably not the concurrency of XFS that makes the parallelis= m of > > the IO.=20 >=20 > It most certainly is the parallelism of XFS. There are some caveats = to > the amount of XFS IO parallelism that are workload dependent. But > generally, with multiple processes/threads reading/writing multiple > files in multiple directories, the device parallelism is very high. = =46or > example: >=20 > If you have 50 NFS clients all reading the same large 20GB file > concurrently, IO parallelism will be limited to the 12 stripe spindle= s > on the single underlying RAID array upon which the AG holding this fi= le > resides. If no other files in the AG are being accessed at the time, > you'll get something like 1.8GB/s throughput for this 20GB file. Sin= ce > the bulk, if not all, of this file will get cached during the read, a= ll > 50 NFS clients will likely be served from cache at their line rate of > 200MB/s, or 10GB/s aggregate. There's that magic 10GB/s number again= =2E > ;) As you can see I put some serious thought into this system > specification. >=20 > If you have all 50 NFS clients accessing 50 different files in 50 > different directories you have no cache benefit. But we will have fi= les > residing in all allocations groups on all 16 arrays. Since XFS evenl= y > distributes new directories across AGs when the directories are creat= ed, > we can probably assume we'll have parallel IO across all 16 arrays wi= th > this workload. Since each array can stream reads at 1.8GB/s, that's > potential parallel throughput of 28GB/s, saturating our PCIe bus > bandwidth of 16GB/s. Hmm, yes RAID1+0 can probably only stream read at 1.8 GB/s. Linux MD RAID10,F2 can stream read at around 3.6 GB/s, on an array of 24 spindles 15000 rpm, given that each spindle is capable of stream reading at about 150 MB/s. > Now change this to 50 clients each doing 10,000 4KB file reads in a > directory along with 10,000 4KB file writes. The throughput of each = 12 > disk array may now drop by over a factor of approximately 128, as eac= h > disk can only sustain about 300 head seeks/second, dropping its > throughput to 300 * 4096 bytes =3D 1.17MB/s. Kernel readahead may he= lp > some, but it'll still suck. >=20 > It is the occasional workload such as that above that dictates > overbuilding the disk subsystem. Imagine adding a high IOPS NFS clie= nt > workload to this server after it went into production to "only" serve > large streaming reads. The random workload above would drop the > performance of this 384 disk array with 15k spindles from a peak > streaming rate of 28.4GB/s to 18MB/s--yes, that's megabytes. Yes, random reading can diminish performance a lot. If the mix of random/sequential reading is still with a good sequential part, then I tink the system should still perform well. I think we lack measurements for things like that, for maybe incremental sequential reading speed on a non-saturated file system. I am not sure on how to define such measures, tho. > With one workload the disks can saturate the PCIe bus by almost a fac= tor > of two. With an opposite workload the disks can only transfer one > 14,000th of the PCIe bandwidth. This is why Fortune 500 companies an= d > others with extremely high random IO workloads such as databases, and > plenty of cash, have farms with multiple thousands of disks attached = to > database and other servers. Or use SSD. > > It is more likely the IO system, and that would also work for > > other file system types, like ext4.=20 >=20 > No. Upper kernel layers doesn't provide this parallelism. This is > strictly an XFS feature, although JFS had something similar (and JFS = is > now all but dead), though not as performant. BTRFS might have someth= ing > similar but I've read nothing about BTRFS internals. Because XFS has > simply been the king of scalable filesystems for 15 years, and added > great new capability along the way, all of the other filesystem > developers have started to steal ideas from XFS. IIRC Ted T'so stol= e > some things from XFS for use in EXT4, but allocation groups wasn't on= e > of them. >=20 > > I do not see anything in the XFS allocation > > blocks with any knowledge of the underlying disk structure.=20 >=20 > The primary structure that allows for XFS parallelism is > xfs_agnumber_t sb_agcount >=20 > Making the filesystem with > mkfs.xfs -d agcount=3D16 >=20 > creates 16 allocations groups of 1.752TB each in our case, 1 per 12 > spindle array. XFS will read/write to all 16 AGs in parallel, under = the > right circumstances, with 1 or multiple IO streams to/from each 12 > spindle array. XFS is the only Linux filesystem with this type of > scalability, again, unless BTRFS has something similar. >=20 > > What the file system does is only to administer the scheduling of t= he > > IO, in combination with the rest of the kernel. >=20 > Given that XFS has 64xxx lines of code, BTRFS has 46xxx, and EXT4 has > 29xxx, I think there's a bit more to it than that Keld. ;) Note that > XFS has over twice the code size of EXT4. That's not bloat but > features, one them being allocation groups. If your simplistic view = of > this was correct we'd have only one Linux filesystem. Filesystem cod= e > does much much more than you realize. Oh, well, of cause the file system does a lot of things. And I have don= e a number of designs and patches to a number of file systems during the = years. But I was talking about the overall picture. The CPU power should not b= e the bottleneck, the bottleneck is the IO. So we use the kernel code to administer the IO in the best possible way. I am also using XFS for many file systems, but I am also using EXT3, and I think I get about the same results for the systems I do, which are also a mostly sequential reading of many big files concurrently (a ftp server). > > Anyway, thanks for the energy and expertise that you are supplying = to > > this thread. >=20 > High performance systems are one of my passions. I'm glad to > participate and share. Speaking of sharing, after further reading on > how the parallelism of AGs is done and some other related things, I'm > changing my recommendation to using only 16 allocation groups of 1.75= 2TB > with this system, one AG per array, instead of 64 AGs of 438GB. Usin= g > 64 AGs could potentially hinder parallelism in some cases. Thank you again for your insights keld -- To unsubscribe from this list: send the line "unsubscribe linux-raid" i= n the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html