From mboxrd@z Thu Jan  1 00:00:00 1970
From: Mark Nelson <mnelson@redhat.com>
Subject: Re: Fwd: Fwd: [newstore (again)] how disable double write WAL
Date: Fri, 19 Feb 2016 06:57:41 -0600
Message-ID: <56C71145.8060306@redhat.com>
References: <alpine.DEB.2.00.1511241240150.25734@cobra.newdream.net>
 <CA+gn+znHyioZhOvuidN1pvMgRMOMvjbjcues_+uayYVadetz=A@mail.gmail.com>
 <CA+gn+z=5+gu=3R3ssLq-kQBjB6DFYeb9JteXV5Y7in89b8cmKA@mail.gmail.com>
 <alpine.DEB.2.00.1512011357340.19170@cobra.newdream.net>
 <5661F3A9.8070703@redhat.com> <20151208044640.GL1983@devil.localdomain>
 <CA+gn+znGzF+J=qAk+511qdfPJV4xYB+4F5k8KMLWh0+JtryLeA@mail.gmail.com>
 <20160216033538.GB2005@devil.localdomain>
 <CA+gn+z=dGTeLo71h=z=AvoLM-RRq_-RfbJwFamyfxK93bvk+Hw@mail.gmail.com>
 <CA+gn+zmCx_Pu6oEUT31SfKRF1A9Pzi1aWTPbXJY7dOgQqCqARQ@mail.gmail.com>
 <20160219052637.GF2005@devil.localdomain>
 <CA+z5Dsz_2z42tRtmSf3pSxSNModxmW60C6POUJDOMP9ZnwAgZg@mail.gmail.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 7bit
Return-path: <ceph-devel-owner@vger.kernel.org>
Received: from mx1.redhat.com ([209.132.183.28]:33761 "EHLO mx1.redhat.com"
	rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP
	id S1757825AbcBSM5q (ORCPT <rfc822;ceph-devel@vger.kernel.org>);
	Fri, 19 Feb 2016 07:57:46 -0500
In-Reply-To: <CA+z5Dsz_2z42tRtmSf3pSxSNModxmW60C6POUJDOMP9ZnwAgZg@mail.gmail.com>
Sender: ceph-devel-owner@vger.kernel.org
List-ID: <ceph-devel.vger.kernel.org>
To: Blair Bethwaite <blair.bethwaite@gmail.com>, Dave Chinner <dchinner@redhat.com>
Cc: David Casier <david.casier@aevoo.fr>, Ric Wheeler <rwheeler@redhat.com>, Sage Weil <sage@newdream.net>, Ceph Development <ceph-devel@vger.kernel.org>, Brian Foster <bfoster@redhat.com>, Eric Sandeen <esandeen@redhat.com>, =?UTF-8?Q?Beno=c3=aet_LORIOT?= <benoit.loriot@aevoo.fr>

There's a long standing bugzilla entry for this:

https://bugzilla.redhat.com/show_bug.cgi?id=1219974

See Kefu and Sam's comments about scrubbing.  That's basically the only 
blocker AFAIK.

Mark

On 02/19/2016 05:28 AM, Blair Bethwaite wrote:
> Interesting observations Dave. Given XFS is Ceph's current production
> standard it makes me wonder why the default filestore configs split
> leaf directories at only 320 objects. We've seen first hand that it
> doesn't take long before this starts hurting performance in a big way.
>
> Cheers,
>
> On 19 February 2016 at 16:26, Dave Chinner <dchinner@redhat.com> wrote:
>> On Tue, Feb 16, 2016 at 09:39:28AM +0100, David Casier wrote:
>>>          "With this model, filestore rearrange the tree very
>>>          frequently : + 40 I/O every 32 objects link/unlink."
>>> It is the consequence of parameters :
>>> filestore_merge_threshold = 2
>>> filestore_split_multiple = 1
>>>
>>> Not of ext4 customization.
>>
>> It's a function of the directory structure you are using to work
>> around the scalability deficiencies of the ext4 directory structure.
>> i.e. the root cause is that you are working around an ext4 problem.
>>
>>> The large amount of objects in FileStore require indirect access and
>>> more IOPS for every directory.
>>>
>>> If root of inode B+tree is a simple block, we have the same problem with XFS
>>
>> Only if you use the same 32-entries per directory constraint. Get
>> rid of that constraint, start thinking about storing tens of
>> thousands of files per directory instead. i.e. let the directory
>> structure handle IO optimisation as the number of entries grow, not
>> impose artificial limits that prevent them from working efficiently.
>>
>> Put simply, XFS is more efficient in terms of the average physical
>> IO per random inode lookup with shallow, wide directory structures
>> than it will be with a narrow, deep setup that is optimised to work
>> around the shortcomings of ext3/ext4.
>>
>> When you use deep directory structures to inde millions of files,
>> you have to assume that any random lookup will require directory
>> inode IO. When you use wide, shallow directories you can almost
>> guarantee that the directory inodes will remain cached in memory
>> because the are so frequently traversed. hence we never need to do
>> IO for directory inodes in a wide, shallow config, and so that IO
>> can be ignored.
>>
>> So let's assume, for ease of maths, we have 40 byte dirent
>> structures (~24 byte file names). That means a single 4k directory
>> block can index aproximately 60-70 entries. More than this, and XFs
>> switches to a more scalable multi-block ("leaf", then "node") format.
>>
>> When XFs moves to a multi-block structure, the first block of the
>> directory is converted to a name hash btree that allows finding any
>> directory entry in one further IO.  The hash index is made up of 8
>> byte entries, so for a 4k block it can index 500 entries in a single
>> IO.  IOWs, a random, cold cache lookup across 500 directory entries
>> can be done in 2 IOs.
>>
>> Now lets add a second level to that hash btree - we have 500 hash
>> index leaf blocks that can be reached in 2 IOs, so now we can reach
>> 25,000 entries in 3 IOs. And in 4 IOs we can reach 2.5 million
>> entries.
>>
>> It should be noted that the length of the directory entries doesn't
>> affect this lookup scalability because the index is based on 4 byte
>> name hashes. Hence it has the same scalability characterisitics
>> regardless of the name lengths; it is only affect by changes in
>> directory block size.
>>
>> If we consider your current "1 IO per directory" config using a 32
>> entry structure, it's 1024 entries in 2 IOs, 32768 in 3 IOs and with
>> 4 IOs it's 1 million entries. This is assuming we can fit 32 entries
>> in the inode core, which we shoul dbe able to do for the nodes of
>> the tree, but the leaves with the file entries are probably going to
>> have full object names and so are likely to be in block format. I've
>> ignored this and assume the leaf directories pointing to the objects
>> are also inline.
>>
>> IOWs, by the time we get to needing 4 IOs to reach the file store
>> leaf directories (i.e. > ~30,000 files in the object store), a
>> single XFS directory is going to have the same or better IO efficiency
>> than your configuration fixed confiugration.
>>
>> And we can make XFS even better - with an 8k directory block size, 2
>> IOs reach 1000 entries, 3 IOs reach a million entries, and 4 IOs
>> reach a billion entries.
>>
>> So, in summary, the number of entries that can be indexed in a
>> given number of IOs:
>>
>> IO count                1       2       3       4
>> 32 entry wide           32      1k      32k     1m
>> 4k dir block            70      500     25k     2.5m
>> 8k dir block            150     1k      1m      1000m
>>
>> And the number of directories required for a given number of
>> files if we limit XFS directories to 3 internal IOs:
>>
>> file count              1k      10k     100k    1m      10m     100m
>> 32 entry wide           32      320     3200    32k     320k    3.2g
>> 4k dir block            1       1       5       50      500     5k
>> 8k dir block            1       1       1       1       11      101
>>
>> So, as you can see, once you make the directory structure shallow
>> and wide, you can reach many more entries in the same number of IOs
>> and there is much lower inode/dentry cache footprint when you do so.
>> IOWs, on XFS you design the heirachy to provide the necessary
>> lookup/modification concurrency as IO scalibility as file counts
>> rise is already efficeintly handled by the filesystem's directory
>> structure.
>>
>> Doing this means the file store does not need to rebalance every 32
>> create/unlink operations. Nor do you need to be concerned about
>> maintaining a working set of directory inodes in cache under memory
>> pressure - there directory entries become the hotest items in the
>> cache and so will never get reclaimed.
>>
>> Cheers,
>>
>> Dave.
>> --
>> Dave Chinner
>> dchinner@redhat.com
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
>
>