Re: Fwd: Fwd: [newstore (again)] how disable double write WAL

CEPH filesystem development
 help / color / mirror / Atom feed

From: Mark Nelson <mnelson@redhat.com>
To: Blair Bethwaite <blair.bethwaite@gmail.com>,
	Dave Chinner <dchinner@redhat.com>
Cc: "David Casier" <david.casier@aevoo.fr>,
	"Ric Wheeler" <rwheeler@redhat.com>,
	"Sage Weil" <sage@newdream.net>,
	"Ceph Development" <ceph-devel@vger.kernel.org>,
	"Brian Foster" <bfoster@redhat.com>,
	"Eric Sandeen" <esandeen@redhat.com>,
	"Benoît LORIOT" <benoit.loriot@aevoo.fr>
Subject: Re: Fwd: Fwd: [newstore (again)] how disable double write WAL
Date: Fri, 19 Feb 2016 06:57:41 -0600	[thread overview]
Message-ID: <56C71145.8060306@redhat.com> (raw)
In-Reply-To: <CA+z5Dsz_2z42tRtmSf3pSxSNModxmW60C6POUJDOMP9ZnwAgZg@mail.gmail.com>

There's a long standing bugzilla entry for this:

https://bugzilla.redhat.com/show_bug.cgi?id=1219974

See Kefu and Sam's comments about scrubbing.  That's basically the only 
blocker AFAIK.

Mark

On 02/19/2016 05:28 AM, Blair Bethwaite wrote:
> Interesting observations Dave. Given XFS is Ceph's current production
> standard it makes me wonder why the default filestore configs split
> leaf directories at only 320 objects. We've seen first hand that it
> doesn't take long before this starts hurting performance in a big way.
>
> Cheers,
>
> On 19 February 2016 at 16:26, Dave Chinner <dchinner@redhat.com> wrote:
>> On Tue, Feb 16, 2016 at 09:39:28AM +0100, David Casier wrote:
>>>          "With this model, filestore rearrange the tree very
>>>          frequently : + 40 I/O every 32 objects link/unlink."
>>> It is the consequence of parameters :
>>> filestore_merge_threshold = 2
>>> filestore_split_multiple = 1
>>>
>>> Not of ext4 customization.
>>
>> It's a function of the directory structure you are using to work
>> around the scalability deficiencies of the ext4 directory structure.
>> i.e. the root cause is that you are working around an ext4 problem.
>>
>>> The large amount of objects in FileStore require indirect access and
>>> more IOPS for every directory.
>>>
>>> If root of inode B+tree is a simple block, we have the same problem with XFS
>>
>> Only if you use the same 32-entries per directory constraint. Get
>> rid of that constraint, start thinking about storing tens of
>> thousands of files per directory instead. i.e. let the directory
>> structure handle IO optimisation as the number of entries grow, not
>> impose artificial limits that prevent them from working efficiently.
>>
>> Put simply, XFS is more efficient in terms of the average physical
>> IO per random inode lookup with shallow, wide directory structures
>> than it will be with a narrow, deep setup that is optimised to work
>> around the shortcomings of ext3/ext4.
>>
>> When you use deep directory structures to inde millions of files,
>> you have to assume that any random lookup will require directory
>> inode IO. When you use wide, shallow directories you can almost
>> guarantee that the directory inodes will remain cached in memory
>> because the are so frequently traversed. hence we never need to do
>> IO for directory inodes in a wide, shallow config, and so that IO
>> can be ignored.
>>
>> So let's assume, for ease of maths, we have 40 byte dirent
>> structures (~24 byte file names). That means a single 4k directory
>> block can index aproximately 60-70 entries. More than this, and XFs
>> switches to a more scalable multi-block ("leaf", then "node") format.
>>
>> When XFs moves to a multi-block structure, the first block of the
>> directory is converted to a name hash btree that allows finding any
>> directory entry in one further IO.  The hash index is made up of 8
>> byte entries, so for a 4k block it can index 500 entries in a single
>> IO.  IOWs, a random, cold cache lookup across 500 directory entries
>> can be done in 2 IOs.
>>
>> Now lets add a second level to that hash btree - we have 500 hash
>> index leaf blocks that can be reached in 2 IOs, so now we can reach
>> 25,000 entries in 3 IOs. And in 4 IOs we can reach 2.5 million
>> entries.
>>
>> It should be noted that the length of the directory entries doesn't
>> affect this lookup scalability because the index is based on 4 byte
>> name hashes. Hence it has the same scalability characterisitics
>> regardless of the name lengths; it is only affect by changes in
>> directory block size.
>>
>> If we consider your current "1 IO per directory" config using a 32
>> entry structure, it's 1024 entries in 2 IOs, 32768 in 3 IOs and with
>> 4 IOs it's 1 million entries. This is assuming we can fit 32 entries
>> in the inode core, which we shoul dbe able to do for the nodes of
>> the tree, but the leaves with the file entries are probably going to
>> have full object names and so are likely to be in block format. I've
>> ignored this and assume the leaf directories pointing to the objects
>> are also inline.
>>
>> IOWs, by the time we get to needing 4 IOs to reach the file store
>> leaf directories (i.e. > ~30,000 files in the object store), a
>> single XFS directory is going to have the same or better IO efficiency
>> than your configuration fixed confiugration.
>>
>> And we can make XFS even better - with an 8k directory block size, 2
>> IOs reach 1000 entries, 3 IOs reach a million entries, and 4 IOs
>> reach a billion entries.
>>
>> So, in summary, the number of entries that can be indexed in a
>> given number of IOs:
>>
>> IO count                1       2       3       4
>> 32 entry wide           32      1k      32k     1m
>> 4k dir block            70      500     25k     2.5m
>> 8k dir block            150     1k      1m      1000m
>>
>> And the number of directories required for a given number of
>> files if we limit XFS directories to 3 internal IOs:
>>
>> file count              1k      10k     100k    1m      10m     100m
>> 32 entry wide           32      320     3200    32k     320k    3.2g
>> 4k dir block            1       1       5       50      500     5k
>> 8k dir block            1       1       1       1       11      101
>>
>> So, as you can see, once you make the directory structure shallow
>> and wide, you can reach many more entries in the same number of IOs
>> and there is much lower inode/dentry cache footprint when you do so.
>> IOWs, on XFS you design the heirachy to provide the necessary
>> lookup/modification concurrency as IO scalibility as file counts
>> rise is already efficeintly handled by the filesystem's directory
>> structure.
>>
>> Doing this means the file store does not need to rebalance every 32
>> create/unlink operations. Nor do you need to be concerned about
>> maintaining a working set of directory inodes in cache under memory
>> pressure - there directory entries become the hotest items in the
>> cache and so will never get reclaimed.
>>
>> Cheers,
>>
>> Dave.
>> --
>> Dave Chinner
>> dchinner@redhat.com
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
>
>

next prev parent reply	other threads:[~2016-02-19 12:57 UTC|newest]

Thread overview: 25+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <9D046674-EA8B-4CB5-B049-3CF665D4ED64@aevoo.fr>
2015-11-24 20:42 ` Fwd: [newstore (again)] how disable double write WAL Sage Weil
     [not found]   ` <CA+gn+znHyioZhOvuidN1pvMgRMOMvjbjcues_+uayYVadetz=A@mail.gmail.com>
2015-12-01 20:34     ` Fwd: " David Casier
2015-12-01 22:02       ` Sage Weil
2015-12-04 20:12         ` Ric Wheeler
2015-12-04 20:20           ` Eric Sandeen
2015-12-08  4:46           ` Dave Chinner
2016-02-15 15:18             ` David Casier
2016-02-15 16:21               ` Eric Sandeen
2016-02-16  3:35               ` Dave Chinner
2016-02-16  8:14                 ` David Casier
2016-02-16  8:39                   ` David Casier
2016-02-19  5:26                     ` Dave Chinner
2016-02-19 11:28                       ` Blair Bethwaite
2016-02-19 12:57                         ` Mark Nelson [this message]
2016-02-22 12:01                       ` Sage Weil
2016-02-22 17:09                         ` David Casier
2016-02-22 17:16                           ` Sage Weil
2016-02-18 17:54                 ` David Casier
2016-02-19 17:06                 ` Eric Sandeen
2016-02-21 10:56                   ` David Casier
2016-02-22 15:56                     ` Eric Sandeen
2016-02-22 16:12                       ` David Casier
2016-02-22 16:16                         ` Eric Sandeen
2016-02-22 17:17                           ` Howard Chu
2016-02-23  5:20                           ` Dave Chinner

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=56C71145.8060306@redhat.com \
    --to=mnelson@redhat.com \
    --cc=benoit.loriot@aevoo.fr \
    --cc=bfoster@redhat.com \
    --cc=blair.bethwaite@gmail.com \
    --cc=ceph-devel@vger.kernel.org \
    --cc=david.casier@aevoo.fr \
    --cc=dchinner@redhat.com \
    --cc=esandeen@redhat.com \
    --cc=rwheeler@redhat.com \
    --cc=sage@newdream.net \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox