From: Mark Nelson <mnelson@redhat.com>
To: Blair Bethwaite <blair.bethwaite@gmail.com>,
Dave Chinner <dchinner@redhat.com>
Cc: "David Casier" <david.casier@aevoo.fr>,
"Ric Wheeler" <rwheeler@redhat.com>,
"Sage Weil" <sage@newdream.net>,
"Ceph Development" <ceph-devel@vger.kernel.org>,
"Brian Foster" <bfoster@redhat.com>,
"Eric Sandeen" <esandeen@redhat.com>,
"Benoît LORIOT" <benoit.loriot@aevoo.fr>
Subject: Re: Fwd: Fwd: [newstore (again)] how disable double write WAL
Date: Fri, 19 Feb 2016 06:57:41 -0600 [thread overview]
Message-ID: <56C71145.8060306@redhat.com> (raw)
In-Reply-To: <CA+z5Dsz_2z42tRtmSf3pSxSNModxmW60C6POUJDOMP9ZnwAgZg@mail.gmail.com>
There's a long standing bugzilla entry for this:
https://bugzilla.redhat.com/show_bug.cgi?id=1219974
See Kefu and Sam's comments about scrubbing. That's basically the only
blocker AFAIK.
Mark
On 02/19/2016 05:28 AM, Blair Bethwaite wrote:
> Interesting observations Dave. Given XFS is Ceph's current production
> standard it makes me wonder why the default filestore configs split
> leaf directories at only 320 objects. We've seen first hand that it
> doesn't take long before this starts hurting performance in a big way.
>
> Cheers,
>
> On 19 February 2016 at 16:26, Dave Chinner <dchinner@redhat.com> wrote:
>> On Tue, Feb 16, 2016 at 09:39:28AM +0100, David Casier wrote:
>>> "With this model, filestore rearrange the tree very
>>> frequently : + 40 I/O every 32 objects link/unlink."
>>> It is the consequence of parameters :
>>> filestore_merge_threshold = 2
>>> filestore_split_multiple = 1
>>>
>>> Not of ext4 customization.
>>
>> It's a function of the directory structure you are using to work
>> around the scalability deficiencies of the ext4 directory structure.
>> i.e. the root cause is that you are working around an ext4 problem.
>>
>>> The large amount of objects in FileStore require indirect access and
>>> more IOPS for every directory.
>>>
>>> If root of inode B+tree is a simple block, we have the same problem with XFS
>>
>> Only if you use the same 32-entries per directory constraint. Get
>> rid of that constraint, start thinking about storing tens of
>> thousands of files per directory instead. i.e. let the directory
>> structure handle IO optimisation as the number of entries grow, not
>> impose artificial limits that prevent them from working efficiently.
>>
>> Put simply, XFS is more efficient in terms of the average physical
>> IO per random inode lookup with shallow, wide directory structures
>> than it will be with a narrow, deep setup that is optimised to work
>> around the shortcomings of ext3/ext4.
>>
>> When you use deep directory structures to inde millions of files,
>> you have to assume that any random lookup will require directory
>> inode IO. When you use wide, shallow directories you can almost
>> guarantee that the directory inodes will remain cached in memory
>> because the are so frequently traversed. hence we never need to do
>> IO for directory inodes in a wide, shallow config, and so that IO
>> can be ignored.
>>
>> So let's assume, for ease of maths, we have 40 byte dirent
>> structures (~24 byte file names). That means a single 4k directory
>> block can index aproximately 60-70 entries. More than this, and XFs
>> switches to a more scalable multi-block ("leaf", then "node") format.
>>
>> When XFs moves to a multi-block structure, the first block of the
>> directory is converted to a name hash btree that allows finding any
>> directory entry in one further IO. The hash index is made up of 8
>> byte entries, so for a 4k block it can index 500 entries in a single
>> IO. IOWs, a random, cold cache lookup across 500 directory entries
>> can be done in 2 IOs.
>>
>> Now lets add a second level to that hash btree - we have 500 hash
>> index leaf blocks that can be reached in 2 IOs, so now we can reach
>> 25,000 entries in 3 IOs. And in 4 IOs we can reach 2.5 million
>> entries.
>>
>> It should be noted that the length of the directory entries doesn't
>> affect this lookup scalability because the index is based on 4 byte
>> name hashes. Hence it has the same scalability characterisitics
>> regardless of the name lengths; it is only affect by changes in
>> directory block size.
>>
>> If we consider your current "1 IO per directory" config using a 32
>> entry structure, it's 1024 entries in 2 IOs, 32768 in 3 IOs and with
>> 4 IOs it's 1 million entries. This is assuming we can fit 32 entries
>> in the inode core, which we shoul dbe able to do for the nodes of
>> the tree, but the leaves with the file entries are probably going to
>> have full object names and so are likely to be in block format. I've
>> ignored this and assume the leaf directories pointing to the objects
>> are also inline.
>>
>> IOWs, by the time we get to needing 4 IOs to reach the file store
>> leaf directories (i.e. > ~30,000 files in the object store), a
>> single XFS directory is going to have the same or better IO efficiency
>> than your configuration fixed confiugration.
>>
>> And we can make XFS even better - with an 8k directory block size, 2
>> IOs reach 1000 entries, 3 IOs reach a million entries, and 4 IOs
>> reach a billion entries.
>>
>> So, in summary, the number of entries that can be indexed in a
>> given number of IOs:
>>
>> IO count 1 2 3 4
>> 32 entry wide 32 1k 32k 1m
>> 4k dir block 70 500 25k 2.5m
>> 8k dir block 150 1k 1m 1000m
>>
>> And the number of directories required for a given number of
>> files if we limit XFS directories to 3 internal IOs:
>>
>> file count 1k 10k 100k 1m 10m 100m
>> 32 entry wide 32 320 3200 32k 320k 3.2g
>> 4k dir block 1 1 5 50 500 5k
>> 8k dir block 1 1 1 1 11 101
>>
>> So, as you can see, once you make the directory structure shallow
>> and wide, you can reach many more entries in the same number of IOs
>> and there is much lower inode/dentry cache footprint when you do so.
>> IOWs, on XFS you design the heirachy to provide the necessary
>> lookup/modification concurrency as IO scalibility as file counts
>> rise is already efficeintly handled by the filesystem's directory
>> structure.
>>
>> Doing this means the file store does not need to rebalance every 32
>> create/unlink operations. Nor do you need to be concerned about
>> maintaining a working set of directory inodes in cache under memory
>> pressure - there directory entries become the hotest items in the
>> cache and so will never get reclaimed.
>>
>> Cheers,
>>
>> Dave.
>> --
>> Dave Chinner
>> dchinner@redhat.com
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
>
>
next prev parent reply other threads:[~2016-02-19 12:57 UTC|newest]
Thread overview: 25+ messages / expand[flat|nested] mbox.gz Atom feed top
[not found] <9D046674-EA8B-4CB5-B049-3CF665D4ED64@aevoo.fr>
2015-11-24 20:42 ` Fwd: [newstore (again)] how disable double write WAL Sage Weil
[not found] ` <CA+gn+znHyioZhOvuidN1pvMgRMOMvjbjcues_+uayYVadetz=A@mail.gmail.com>
2015-12-01 20:34 ` Fwd: " David Casier
2015-12-01 22:02 ` Sage Weil
2015-12-04 20:12 ` Ric Wheeler
2015-12-04 20:20 ` Eric Sandeen
2015-12-08 4:46 ` Dave Chinner
2016-02-15 15:18 ` David Casier
2016-02-15 16:21 ` Eric Sandeen
2016-02-16 3:35 ` Dave Chinner
2016-02-16 8:14 ` David Casier
2016-02-16 8:39 ` David Casier
2016-02-19 5:26 ` Dave Chinner
2016-02-19 11:28 ` Blair Bethwaite
2016-02-19 12:57 ` Mark Nelson [this message]
2016-02-22 12:01 ` Sage Weil
2016-02-22 17:09 ` David Casier
2016-02-22 17:16 ` Sage Weil
2016-02-18 17:54 ` David Casier
2016-02-19 17:06 ` Eric Sandeen
2016-02-21 10:56 ` David Casier
2016-02-22 15:56 ` Eric Sandeen
2016-02-22 16:12 ` David Casier
2016-02-22 16:16 ` Eric Sandeen
2016-02-22 17:17 ` Howard Chu
2016-02-23 5:20 ` Dave Chinner
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=56C71145.8060306@redhat.com \
--to=mnelson@redhat.com \
--cc=benoit.loriot@aevoo.fr \
--cc=bfoster@redhat.com \
--cc=blair.bethwaite@gmail.com \
--cc=ceph-devel@vger.kernel.org \
--cc=david.casier@aevoo.fr \
--cc=dchinner@redhat.com \
--cc=esandeen@redhat.com \
--cc=rwheeler@redhat.com \
--cc=sage@newdream.net \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox