CEPH filesystem development
 help / color / mirror / Atom feed
From: Eric Sandeen <esandeen@redhat.com>
To: David Casier <david.casier@aevoo.fr>, Dave Chinner <dchinner@redhat.com>
Cc: Ric Wheeler <rwheeler@redhat.com>, Sage Weil <sage@newdream.net>,
	Ceph Development <ceph-devel@vger.kernel.org>,
	Brian Foster <bfoster@redhat.com>
Subject: Re: Fwd: Fwd: [newstore (again)] how disable double write WAL
Date: Mon, 15 Feb 2016 10:21:05 -0600	[thread overview]
Message-ID: <56C1FAF1.3030805@redhat.com> (raw)
In-Reply-To: <CA+gn+znGzF+J=qAk+511qdfPJV4xYB+4F5k8KMLWh0+JtryLeA@mail.gmail.com>

On 2/15/16 9:18 AM, David Casier wrote:
> Hi Dave,
> 1TB is very wide for SSD.
> Exemple with only 10GiB :
> https://www.aevoo.fr/2016/02/14/ceph-ext4-optimisation-for-filestore/

It wouldn't be too hard to modify the inode32 restriction to a lower
threshold, I think, if it would really be useful.

On the other hand, 10GiB seems awfully small.  What are realistic
sizes for this usecase?

-Eric

 
> 2015-12-08 5:46 GMT+01:00 Dave Chinner <dchinner@redhat.com>:
>> On Fri, Dec 04, 2015 at 03:12:25PM -0500, Ric Wheeler wrote:
>>> On 12/01/2015 05:02 PM, Sage Weil wrote:
>>>> Hi David,
>>>>
>>>> On Tue, 1 Dec 2015, David Casier wrote:
>>>>> Hi Sage,
>>>>> With a standard disk (4 to 6 TB), and a small flash drive, it's easy
>>>>> to create an ext4 FS with metadata on flash
>>>>>
>>>>> Example with sdg1 on flash and sdb on hdd :
>>>>>
>>>>> size_of() {
>>>>>   blockdev --getsize $1
>>>>> }
>>>>>
>>>>> mkdmsetup() {
>>>>>   _ssd=/dev/$1
>>>>>   _hdd=/dev/$2
>>>>>   _size_of_ssd=$(size_of $_ssd)
>>>>>   echo """0 $_size_of_ssd linear $_ssd 0
>>>>>   $_size_of_ssd $(size_of $_hdd) linear $_hdd 0" | dmsetup create dm-${1}-${2}
>>>>> }
>>
>> So this is just a linear concatenation that relies on ext4 putting
>> all it's metadata at the front of the filesystem?
>>
>>>>>
>>>>> mkdmsetup sdg1 sdb
>>>>>
>>>>> mkfs.ext4 -O ^has_journal,flex_bg,^uninit_bg,^sparse_super,sparse_super2,^extra_isize,^dir_nlink,^resize_inode
>>>>> -E packed_meta_blocks=1,lazy_itable_init=0 -G 32768 -I 128 -i
>>>>> $((1024*512)) /dev/mapper/dm-sdg1-sdb
>>>>>
>>>>> With that, all meta_blocks are on the SSD
>>
>> IIRC, it's the "packed_meta_blocks=1" that does this.
>>
>> THis is something that is pretty trivial to do with XFS, too,
>> by use of the inode32 allocation mechanism. That reserves the
>> first TB of space for inodes and other metadata allocations,
>> so if you span the first TB with SSDs, you get almost all the
>> metadata on the SSDs, and all the data in the higher AGs. With the
>> undocumented log location mkfs option, you can also put hte log at
>> the start og AG 0 which means that would sit on the SSD, too,
>> without needing an external log device.
>>
>> SGI even had a mount option hack to limit this allocator behaviour
>> to a block limit lower than 1TB so they could limit the metadata AG
>> regions to, say, the first 200GB.
>>
>>>> This is coincidentally what I've been working on today.  So far I've just
>>>> added the ability to put the rocksdb WAL on a second device, but it's
>>>> super easy to push rocksdb data there as well (and have it spill over onto
>>>> the larger, slower device if it fills up).  Or to put the rocksdb WAL on a
>>>> third device (e.g., expensive NVMe or NVRAM).
>>
>> I have old bits and pieces from 7-8 years ago that would allow some
>> application control of allocation policy to allow things like this
>> to be done, but I left SGI before it was anything mor ethan just a
>> proof of concept....
>>
>>>> See this ticket for the ceph-disk tooling that's needed:
>>>>
>>>>     http://tracker.ceph.com/issues/13942
>>>>
>>>> I expect this will be more flexible and perform better than the ext4
>>>> metadata option, but we'll need to test on your hardware to confirm!
>>>>
>>>> sage
>>>
>>> I think that XFS "realtime" subvolumes are the thing that does this
>>> -  the second volume contains only the data (no metadata).
>>>
>>> Seem to recall that it is popular historically with video
>>> appliances, etc but it is not commonly used.
>>
>> Because it's a single threaded allocator. It's not suited to highly
>> concurrent applications, just applications that require large
>> extents allocated in a deterministic manner.
>>
>> Cheers,
>>
>> Dave.
>> --
>> Dave Chinner
>> dchinner@redhat.com
> 
> 
> 

  reply	other threads:[~2016-02-15 16:21 UTC|newest]

Thread overview: 25+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <9D046674-EA8B-4CB5-B049-3CF665D4ED64@aevoo.fr>
2015-11-24 20:42 ` Fwd: [newstore (again)] how disable double write WAL Sage Weil
     [not found]   ` <CA+gn+znHyioZhOvuidN1pvMgRMOMvjbjcues_+uayYVadetz=A@mail.gmail.com>
2015-12-01 20:34     ` Fwd: " David Casier
2015-12-01 22:02       ` Sage Weil
2015-12-04 20:12         ` Ric Wheeler
2015-12-04 20:20           ` Eric Sandeen
2015-12-08  4:46           ` Dave Chinner
2016-02-15 15:18             ` David Casier
2016-02-15 16:21               ` Eric Sandeen [this message]
2016-02-16  3:35               ` Dave Chinner
2016-02-16  8:14                 ` David Casier
2016-02-16  8:39                   ` David Casier
2016-02-19  5:26                     ` Dave Chinner
2016-02-19 11:28                       ` Blair Bethwaite
2016-02-19 12:57                         ` Mark Nelson
2016-02-22 12:01                       ` Sage Weil
2016-02-22 17:09                         ` David Casier
2016-02-22 17:16                           ` Sage Weil
2016-02-18 17:54                 ` David Casier
2016-02-19 17:06                 ` Eric Sandeen
2016-02-21 10:56                   ` David Casier
2016-02-22 15:56                     ` Eric Sandeen
2016-02-22 16:12                       ` David Casier
2016-02-22 16:16                         ` Eric Sandeen
2016-02-22 17:17                           ` Howard Chu
2016-02-23  5:20                           ` Dave Chinner

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=56C1FAF1.3030805@redhat.com \
    --to=esandeen@redhat.com \
    --cc=bfoster@redhat.com \
    --cc=ceph-devel@vger.kernel.org \
    --cc=david.casier@aevoo.fr \
    --cc=dchinner@redhat.com \
    --cc=rwheeler@redhat.com \
    --cc=sage@newdream.net \
    --cc=sandeen@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox