CEPH filesystem development
 help / color / mirror / Atom feed
From: Eric Sandeen <sandeen@redhat.com>
To: David Casier <david.casier@aevoo.fr>
Cc: Dave Chinner <dchinner@redhat.com>,
	Ric Wheeler <rwheeler@redhat.com>, Sage Weil <sage@newdream.net>,
	Ceph Development <ceph-devel@vger.kernel.org>,
	Brian Foster <bfoster@redhat.com>
Subject: Re: Fwd: Fwd: [newstore (again)] how disable double write WAL
Date: Mon, 22 Feb 2016 10:16:43 -0600	[thread overview]
Message-ID: <56CB346B.50200@redhat.com> (raw)
In-Reply-To: <CA+gn+z=7N+uaNXQ3DU65ZFiNhW-VPjwf+ppaMLxswgoep5EJ3g@mail.gmail.com>

On 2/22/16 10:12 AM, David Casier wrote:
> I have carried out tests very quickly and I have not had time to
> concentrate fully on XFS.
>  maxpct =0.2 => 0.2% of 4To = 8Go
> Because my existing ssd partitions are small
> 
> If i'm not mistaken, and with what Dave says :
> By default, data is written to 2^32 inodes of 256 bytes (= 1TiB).
> With maxpct, you set the maximum size used by inodes, depending on the
> percentage of disk

Yes, that's reasonable, I just wanted to be sure.  I hadn't seen
it stated that your SSD was that small.

Thanks,
-Eric

> 2016-02-22 16:56 GMT+01:00 Eric Sandeen <sandeen@redhat.com>:
>> On 2/21/16 4:56 AM, David Casier wrote:
>>> I made a simple test with XFS
>>>
>>> dm-sdf6-sdg1 :
>>> -------------------------------------------------------------------------------------------
>>> ||  sdf6 : SSD part ||           sdg1 : HDD (4TB)                         ||
>>> -------------------------------------------------------------------------------------------
>>
>> If this is in response to my concern about not working on small
>> filesystems, the above is sufficiently large that inode32
>> won't be ignored.
>>
>>> [root@aotest ~]# mkfs.xfs -f -i maxpct=0.2 /dev/mapper/dm-sdf6-sdg1
>>
>> Hm, why set maxpct?  This does affect how the inode32 allocator
>> works, but I'm wondering if that's why you set it.  How did you arrive
>> at 0.2%?  Just want to be sure you understand what you're tuning.
>>
>> Thanks,
>> -Eric
>>
>>> [root@aotest ~]# mount -o inode32 /dev/mapper/dm-sdf6-sdg1 /mnt
>>>
>>> 8 directory with 16, 32, ..., 128 sub-directory and 16, 32, ..., 128
>>> files (82 bytes)
>>> 1 xattr per dir and 3 xattr per file (user.cephosd...)
>>>
>>> 3 800 000 files and directory
>>> 16 GiB was written on SSD
>>>
>>> ------------------------------------------------------
>>> ||                 find | wc -l                   ||
>>> ------------------------------------------------------
>>> || Objects per dir || % IOPS on SSD ||
>>> ------------------------------------------------------
>>> ||           16         ||            99           ||
>>> ||           32         ||           100          ||
>>> ||           48         ||            93           ||
>>> ||           64         ||            88           ||
>>> ||           80         ||            88           ||
>>> ||           96         ||            86           ||
>>> ||          112        ||            87           ||
>>> ||          128        ||            88           ||
>>> -----------------------------------------------------
>>>
>>> ------------------------------------------------------
>>> ||           find -exec getfattr '{}' \;         ||
>>> ------------------------------------------------------
>>> || Objects per dir || % IOPS on SSD ||
>>> ------------------------------------------------------
>>> ||           16         ||            96           ||
>>> ||           32         ||            97           ||
>>> ||           48         ||            96           ||
>>> ||           64         ||            95           ||
>>> ||           80         ||            94           ||
>>> ||           96         ||            93           ||
>>> ||          112        ||            94           ||
>>> ||          128        ||            95           ||
>>> -----------------------------------------------------
>>>
>>> It is true that filestore is not designed to make Big Data and the
>>> cache must work inode / xattr
>>>
>>> I hope to see quiclky Bluestore in production :)
>>>
>>> 2016-02-19 18:06 GMT+01:00 Eric Sandeen <esandeen@redhat.com>:
>>>>
>>>>
>>>> On 2/15/16 9:35 PM, Dave Chinner wrote:
>>>>> On Mon, Feb 15, 2016 at 04:18:28PM +0100, David Casier wrote:
>>>>>> Hi Dave,
>>>>>> 1TB is very wide for SSD.
>>>>>
>>>>> It fills from the bottom, so you don't need 1TB to make it work
>>>>> in a similar manner to the ext4 hack being described.
>>>>
>>>> I'm not sure it will work for smaller filesystems, though - we essentially
>>>> ignore the inode32 mount option for sufficiently small filesystems.
>>>>
>>>> i.e. if inode numbers > 32 bits can't exist, we don't change the allocator,
>>>> at least not until the filesystem (possibly) gets grown later.
>>>>
>>>> So for inode32 to impact behavior, it needs to be on a filesystem
>>>> of sufficient size (at least 1 or 2T, depending on block size, inode
>>>> size, etc). Otherwise it will have no effect today.
>>>>
>>>> Dave, I wonder if we need another mount option to essentially mean
>>>> "invoke the inode32 allocator regardless of filesystem size?"
>>>>
>>>> -Eric
>>>>
>>>>>> Exemple with only 10GiB :
>>>>>> https://www.aevoo.fr/2016/02/14/ceph-ext4-optimisation-for-filestore/
>>>>>
>>>>> It's a nice toy, but it's not something that is going scale reliably
>>>>> for production.  That caveat at the end:
>>>>>
>>>>>       "With this model, filestore rearrange the tree very
>>>>>       frequently : + 40 I/O every 32 objects link/unlink."
>>>>>
>>>>> Indicates how bad the IO patterns will be when modifying the
>>>>> directory structure, and says to me that it's not a useful
>>>>> optimisation at all when you might be creating several thousand
>>>>> files/s on a filesystem. That will end up IO bound, SSD or not.
>>>>>
>>>>> Cheers,
>>>>>
>>>>> Dave.
>>>>>
>>>
>>>
>>>
>>
> 
> 
> 


  reply	other threads:[~2016-02-22 16:16 UTC|newest]

Thread overview: 25+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <9D046674-EA8B-4CB5-B049-3CF665D4ED64@aevoo.fr>
2015-11-24 20:42 ` Fwd: [newstore (again)] how disable double write WAL Sage Weil
     [not found]   ` <CA+gn+znHyioZhOvuidN1pvMgRMOMvjbjcues_+uayYVadetz=A@mail.gmail.com>
2015-12-01 20:34     ` Fwd: " David Casier
2015-12-01 22:02       ` Sage Weil
2015-12-04 20:12         ` Ric Wheeler
2015-12-04 20:20           ` Eric Sandeen
2015-12-08  4:46           ` Dave Chinner
2016-02-15 15:18             ` David Casier
2016-02-15 16:21               ` Eric Sandeen
2016-02-16  3:35               ` Dave Chinner
2016-02-16  8:14                 ` David Casier
2016-02-16  8:39                   ` David Casier
2016-02-19  5:26                     ` Dave Chinner
2016-02-19 11:28                       ` Blair Bethwaite
2016-02-19 12:57                         ` Mark Nelson
2016-02-22 12:01                       ` Sage Weil
2016-02-22 17:09                         ` David Casier
2016-02-22 17:16                           ` Sage Weil
2016-02-18 17:54                 ` David Casier
2016-02-19 17:06                 ` Eric Sandeen
2016-02-21 10:56                   ` David Casier
2016-02-22 15:56                     ` Eric Sandeen
2016-02-22 16:12                       ` David Casier
2016-02-22 16:16                         ` Eric Sandeen [this message]
2016-02-22 17:17                           ` Howard Chu
2016-02-23  5:20                           ` Dave Chinner

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=56CB346B.50200@redhat.com \
    --to=sandeen@redhat.com \
    --cc=bfoster@redhat.com \
    --cc=ceph-devel@vger.kernel.org \
    --cc=david.casier@aevoo.fr \
    --cc=dchinner@redhat.com \
    --cc=rwheeler@redhat.com \
    --cc=sage@newdream.net \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox