Re: Help with space - Austin S Hemmelgarn

public inbox for linux-btrfs@vger.kernel.org
 help / color / mirror / Atom feed

From: Austin S Hemmelgarn <ahferroin7@gmail.com>
To: Chris Murphy <lists@colorremedies.com>
Cc: Btrfs BTRFS <linux-btrfs@vger.kernel.org>
Subject: Re: Help with space
Date: Sat, 03 May 2014 16:52:24 -0400	[thread overview]
Message-ID: <53655708.8080205@gmail.com> (raw)
In-Reply-To: <4C789231-1BA0-4415-B557-E87A32CCEBFC@colorremedies.com>

On 05/03/2014 03:09 PM, Chris Murphy wrote:
> 
> On May 3, 2014, at 10:31 AM, Austin S Hemmelgarn <ahferroin7@gmail.com> wrote:
> 
>> On 05/02/2014 03:21 PM, Chris Murphy wrote:
>>>
>>> On May 2, 2014, at 2:23 AM, Duncan <1i5t5.duncan@cox.net> wrote:
>>>>
>>>> Something tells me btrfs replace (not device replace, simply
>>>> replace) should be moved to btrfs device replace…
>>>
>>> The syntax for "btrfs device" is different though; replace is like
>>> balance: btrfs balance start and btrfs replace start. And you can
>>> also get a status on it. We don't (yet) have options to stop,
>>> start, resume, which could maybe come in handy for long rebuilds
>>> and a reboot is required (?) although maybe that just gets handled
>>> automatically: set it to pause, then unmount, then reboot, then
>>> mount and resume.
>>>
>>>> Well, I'd say two copies if it's only two devices in the raid1...
>>>> would be true raid1.  But if it's say four devices in the raid1,
>>>> as is certainly possible with btrfs raid1, that if it's not
>>>> mirrored 4-way across all devices, it's not true raid1, but
>>>> rather some sort of hybrid raid,  raid10 (or raid01) if the
>>>> devices are so arranged, raid1+linear if arranged that way, or
>>>> some form that doesn't nicely fall into a well defined raid level
>>>> categorization.
>>>
>>> Well, md raid1 is always n-way. So if you use -n 3 and specify
>>> three devices, you'll get 3-way mirroring (3 mirrors). But I don't
>>> know any hardware raid that works this way. They all seem to be
>>> raid 1 is strictly two devices. At 4 devices it's raid10, and only
>>> in pairs.
>>>
>>> Btrfs raid1 with 3+ devices is unique as far as I can tell. It is
>>> something like raid1 (2 copies) + linear/concat. But that
>>> allocation is round robin. I don't read code but based on how a 3
>>> disk raid1 volume grows VDI files as it's filled it looks like 1GB
>>> chunks are copied like this
>> Actually, MD RAID10 can be configured to work almost the same with an
>> odd number of disks, except it uses (much) smaller chunks, and it does
>> more intelligent striping of reads.
> 
> The efficiency of storage depends on the file system placed on top. Btrfs will allocate space exclusively for metadata, and it's possible much of that space either won't or can't be used. So ext4 or XFS on md probably is more efficient in that regard; but then Btrfs also has compression options so this clouds the efficiency analysis.
> 
> For striping of reads, there is a note in man 4 md about the layout with respect to raid10: "The 'far' arrangement can give sequential read performance equal to that of a RAID0 array, but at the cost of reduced write performance." The default layout for raid10 is near 2. I think either the read performance is a wash with defaults, and md reads are better while writes are worse with the far layout.
> 
> I'm not sure how Btrfs performs reads with multiple devices.
While I haven't tested MD RAID10 specifically, I do know that when used
as a backend for mirrored striping on LVM, it does, by default, get
better read performance than BTRFS (all though the difference is usually
not very significant for most use cases).

As far as how BTRFS preforms reads with multiple devices, it uses the
following algorithm (at least this is my understanding of it, I may be
wrong):
1. Create a 0-indexed list of the devices that the block is stored on.
2. Take the PID of the process that issued the read() call modulo the
number of device that the requested block is stored on, and dispatch the
read to the device with that index in the aforementioned list.
3. If checksum verification fails, then try other devices from the list
in sequential order.

While this algorithm gets relatively good performance for many use
cases, and causes very little overhead in the read path, it is still
sub-optimal in almost all cases, and produces bad results in a few
cases, such as copying very large files, or any other case where only a
single process/thread is reading very large amounts of data.

As far as improving it, dispatching the read to the least recently
accessed device.  Such a strategy would not introduce much more overhead
to the read path ( a few 64-bit compares), and would allow reads to be
striped across devices much more efficiently.  To get much better than
that would require tracking where the last access to each device was,
and dispatching to whichever one was closest.

next prev parent reply	other threads:[~2014-05-03 20:52 UTC|newest]

Thread overview: 28+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2014-02-27 18:19 Help with space Justin Brown
2014-02-27 19:27 ` Chris Murphy
2014-02-27 19:51   ` Chris Murphy
2014-02-27 20:49     ` otakujunction
2014-02-27 21:11       ` Chris Murphy
2014-02-28  0:12         ` Dave Chinner
2014-02-28  0:27           ` Chris Murphy
2014-02-28  4:21             ` Dave Chinner
2014-02-28  5:49               ` Chris Murphy
2014-02-28  4:34 ` Roman Mamedov
2014-02-28  7:27   ` Duncan
2014-02-28  7:37     ` Roman Mamedov
2014-02-28  7:46     ` Justin Brown
2014-05-01  1:52   ` Russell Coker
2014-05-01  5:33     ` Duncan
2014-05-02  1:48       ` Russell Coker
2014-05-02  8:23         ` Duncan
2014-05-02  9:28           ` Brendan Hide
2014-05-02 19:21           ` Chris Murphy
2014-05-02 21:08             ` Hugo Mills
2014-05-02 22:33               ` Chris Murphy
2014-05-03 16:31             ` Austin S Hemmelgarn
2014-05-03 19:09               ` Chris Murphy
2014-05-03 20:52                 ` Austin S Hemmelgarn [this message]
2014-05-03 23:16                 ` Chris Murphy
2014-02-28  6:13 ` Chris Murphy
2014-02-28  6:26   ` Chris Murphy
2014-02-28  7:39     ` Justin Brown

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=53655708.8080205@gmail.com \
    --to=ahferroin7@gmail.com \
    --cc=linux-btrfs@vger.kernel.org \
    --cc=lists@colorremedies.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox