Re: Btrfs raid allocator - Hendrik Siedelmann

linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Hendrik Siedelmann <hendrik.siedelmann@googlemail.com>
To: Hugo Mills <hugo@carfax.org.uk>, linux-btrfs@vger.kernel.org
Subject: Re: Btrfs raid allocator
Date: Tue, 06 May 2014 14:16:18 +0200	[thread overview]
Message-ID: <5368D292.4010408@googlemail.com> (raw)
In-Reply-To: <20140506114640.GY24298@carfax.org.uk>

On 06.05.2014 13:46, Hugo Mills wrote:
> On Tue, May 06, 2014 at 01:26:44PM +0200, Hendrik Siedelmann wrote:
>> On 06.05.2014 13:19, Hugo Mills wrote:
>>> On Tue, May 06, 2014 at 01:14:26PM +0200, Hendrik Siedelmann wrote:
>>>> On 06.05.2014 12:59, Hugo Mills wrote:
>>>>> On Tue, May 06, 2014 at 12:41:38PM +0200, Hendrik Siedelmann wrote:
>>>>>> Hello all!
>>>>>>
>>>>>> I would like to use btrfs (or anyting else actually) to maximize raid0
>>>>>> performance. Basically I have a relatively constant stream of data that
>>>>>> simply has to be written out to disk. So my question is, how is the block
>>>>>> allocator deciding on which device to write, can this decision be dynamic
>>>>>> and could it incorporate timing/troughput decisions? I'm willing to write
>>>>>> code, I just have no clue as to how this works right now. I read somewhere
>>>>>> that the decision is based on free space, is this still true?
>>>>>
>>>>>     For (current) RAID-0 allocation, the block group allocator will use
>>>>> as many chunks as there are devices with free space (down to a minimum
>>>>> of 2). Data is then striped across those chunks in 64 KiB stripes.
>>>>> Thus, the first block group will be N GiB of usable space, striped
>>>>> across N devices.
>>>>
>>>> So do I understand this correctly that (assuming we have enough space) data
>>>> will be spread equally between the disks independend of write speeds? So one
>>>> slow device would slow down the whole raid?
>>>
>>>     Yes. Exactly the same as it would be with DM RAID-0 on the same
>>> configuration. There's not a lot we can do about that at this point.
>>
>> So striping is fixed but which disk takes part with a chunk is dynamic? But
>> for large workloads slower disks could 'skip a chunk' as chunk allocation is
>> dynamic, correct?
>
>     You'd have to rewrite the chunk allocator to do this, _and_ provide
> different RAID levels for different subvolumes. The chunk/block group
> allocator right now uses only one rule for allocating data, and one
> for allocating metadata. Now, both of these are planned, and _might_
> between them possibly cover the use-case you're talking about, but I'm
> not certain it's necessarily a sensible thing to do in this case.

But what does the allocator currently do when one disk runs out of 
space? I thought those disks do not get used but we can still write 
data. So the mechanism is already there, it just needs to be invoked 
when a drive is too busy instead of too full.

>     My question is, if you actually care about the performance of this
> system, why are you buying some slow devices to drag the performance
> of your fast devices down? It seems like a recipe for disaster...

Even the speed of a single hdd varies depending on where I write the 
data. So actually there is not much choice :-D.
I'm aware that this could be a case of overengineering. Actually my 
first thought was to write a simple fuse module which only handles data 
and puts metadata on a regular filesystem. But then I thought that it 
would be nice to have this in btrfs - and not just for raid0.

>>>>>     There's a second level of allocation (which I haven't looked at at
>>>>> all), which is how the FS decides where to put data within the
>>>>> allocated block groups. I think it will almost certainly be beneficial
>>>>> in your case to use prealloc extents, which will turn your continuous
>>>>> write into large contiguous sections of striping.
>>>>
>>>> Why does prealloc change anything? For me latency does not matter, only
>>>> continuous troughput!
>>>
>>>     It makes the extent allocation algorithm much simpler, because it
>>> can then allocate in larger chunks and do more linear writes
>>
>> Is this still true if I do very large writes? Or do those get broken down by
>> the kernel somewhere?
>
>     I guess it'll depend on the approach you use to do these "very
> large" writes, and on the exact definition of "very large". This is
> not an area I know a huge amount about.
>
>     Hugo.
>
Never mind I'll just try it out!

Hendrik

next prev parent reply	other threads:[~2014-05-06 12:16 UTC|newest]

Thread overview: 10+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2014-05-06 10:41 Btrfs raid allocator Hendrik Siedelmann
2014-05-06 10:59 ` Hugo Mills
2014-05-06 11:14   ` Hendrik Siedelmann
2014-05-06 11:19     ` Hugo Mills
2014-05-06 11:26       ` Hendrik Siedelmann
2014-05-06 11:46         ` Hugo Mills
2014-05-06 12:16           ` Hendrik Siedelmann [this message]
2014-05-06 20:59 ` Duncan
2014-05-06 21:49 ` Chris Murphy
2014-05-06 22:45   ` Hendrik Siedelmann

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=5368D292.4010408@googlemail.com \
    --to=hendrik.siedelmann@googlemail.com \
    --cc=hugo@carfax.org.uk \
    --cc=linux-btrfs@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).