From mboxrd@z Thu Jan 1 00:00:00 1970 From: Mark Nelson Subject: Re: Fwd: Fwd: [newstore (again)] how disable double write WAL Date: Fri, 19 Feb 2016 06:57:41 -0600 Message-ID: <56C71145.8060306@redhat.com> References: <5661F3A9.8070703@redhat.com> <20151208044640.GL1983@devil.localdomain> <20160216033538.GB2005@devil.localdomain> <20160219052637.GF2005@devil.localdomain> Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 7bit Return-path: Received: from mx1.redhat.com ([209.132.183.28]:33761 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1757825AbcBSM5q (ORCPT ); Fri, 19 Feb 2016 07:57:46 -0500 In-Reply-To: Sender: ceph-devel-owner@vger.kernel.org List-ID: To: Blair Bethwaite , Dave Chinner Cc: David Casier , Ric Wheeler , Sage Weil , Ceph Development , Brian Foster , Eric Sandeen , =?UTF-8?Q?Beno=c3=aet_LORIOT?= There's a long standing bugzilla entry for this: https://bugzilla.redhat.com/show_bug.cgi?id=1219974 See Kefu and Sam's comments about scrubbing. That's basically the only blocker AFAIK. Mark On 02/19/2016 05:28 AM, Blair Bethwaite wrote: > Interesting observations Dave. Given XFS is Ceph's current production > standard it makes me wonder why the default filestore configs split > leaf directories at only 320 objects. We've seen first hand that it > doesn't take long before this starts hurting performance in a big way. > > Cheers, > > On 19 February 2016 at 16:26, Dave Chinner wrote: >> On Tue, Feb 16, 2016 at 09:39:28AM +0100, David Casier wrote: >>> "With this model, filestore rearrange the tree very >>> frequently : + 40 I/O every 32 objects link/unlink." >>> It is the consequence of parameters : >>> filestore_merge_threshold = 2 >>> filestore_split_multiple = 1 >>> >>> Not of ext4 customization. >> >> It's a function of the directory structure you are using to work >> around the scalability deficiencies of the ext4 directory structure. >> i.e. the root cause is that you are working around an ext4 problem. >> >>> The large amount of objects in FileStore require indirect access and >>> more IOPS for every directory. >>> >>> If root of inode B+tree is a simple block, we have the same problem with XFS >> >> Only if you use the same 32-entries per directory constraint. Get >> rid of that constraint, start thinking about storing tens of >> thousands of files per directory instead. i.e. let the directory >> structure handle IO optimisation as the number of entries grow, not >> impose artificial limits that prevent them from working efficiently. >> >> Put simply, XFS is more efficient in terms of the average physical >> IO per random inode lookup with shallow, wide directory structures >> than it will be with a narrow, deep setup that is optimised to work >> around the shortcomings of ext3/ext4. >> >> When you use deep directory structures to inde millions of files, >> you have to assume that any random lookup will require directory >> inode IO. When you use wide, shallow directories you can almost >> guarantee that the directory inodes will remain cached in memory >> because the are so frequently traversed. hence we never need to do >> IO for directory inodes in a wide, shallow config, and so that IO >> can be ignored. >> >> So let's assume, for ease of maths, we have 40 byte dirent >> structures (~24 byte file names). That means a single 4k directory >> block can index aproximately 60-70 entries. More than this, and XFs >> switches to a more scalable multi-block ("leaf", then "node") format. >> >> When XFs moves to a multi-block structure, the first block of the >> directory is converted to a name hash btree that allows finding any >> directory entry in one further IO. The hash index is made up of 8 >> byte entries, so for a 4k block it can index 500 entries in a single >> IO. IOWs, a random, cold cache lookup across 500 directory entries >> can be done in 2 IOs. >> >> Now lets add a second level to that hash btree - we have 500 hash >> index leaf blocks that can be reached in 2 IOs, so now we can reach >> 25,000 entries in 3 IOs. And in 4 IOs we can reach 2.5 million >> entries. >> >> It should be noted that the length of the directory entries doesn't >> affect this lookup scalability because the index is based on 4 byte >> name hashes. Hence it has the same scalability characterisitics >> regardless of the name lengths; it is only affect by changes in >> directory block size. >> >> If we consider your current "1 IO per directory" config using a 32 >> entry structure, it's 1024 entries in 2 IOs, 32768 in 3 IOs and with >> 4 IOs it's 1 million entries. This is assuming we can fit 32 entries >> in the inode core, which we shoul dbe able to do for the nodes of >> the tree, but the leaves with the file entries are probably going to >> have full object names and so are likely to be in block format. I've >> ignored this and assume the leaf directories pointing to the objects >> are also inline. >> >> IOWs, by the time we get to needing 4 IOs to reach the file store >> leaf directories (i.e. > ~30,000 files in the object store), a >> single XFS directory is going to have the same or better IO efficiency >> than your configuration fixed confiugration. >> >> And we can make XFS even better - with an 8k directory block size, 2 >> IOs reach 1000 entries, 3 IOs reach a million entries, and 4 IOs >> reach a billion entries. >> >> So, in summary, the number of entries that can be indexed in a >> given number of IOs: >> >> IO count 1 2 3 4 >> 32 entry wide 32 1k 32k 1m >> 4k dir block 70 500 25k 2.5m >> 8k dir block 150 1k 1m 1000m >> >> And the number of directories required for a given number of >> files if we limit XFS directories to 3 internal IOs: >> >> file count 1k 10k 100k 1m 10m 100m >> 32 entry wide 32 320 3200 32k 320k 3.2g >> 4k dir block 1 1 5 50 500 5k >> 8k dir block 1 1 1 1 11 101 >> >> So, as you can see, once you make the directory structure shallow >> and wide, you can reach many more entries in the same number of IOs >> and there is much lower inode/dentry cache footprint when you do so. >> IOWs, on XFS you design the heirachy to provide the necessary >> lookup/modification concurrency as IO scalibility as file counts >> rise is already efficeintly handled by the filesystem's directory >> structure. >> >> Doing this means the file store does not need to rebalance every 32 >> create/unlink operations. Nor do you need to be concerned about >> maintaining a working set of directory inodes in cache under memory >> pressure - there directory entries become the hotest items in the >> cache and so will never get reclaimed. >> >> Cheers, >> >> Dave. >> -- >> Dave Chinner >> dchinner@redhat.com >> -- >> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in >> the body of a message to majordomo@vger.kernel.org >> More majordomo info at http://vger.kernel.org/majordomo-info.html > > >