All of lore.kernel.org
 help / color / mirror / Atom feed
From: Stefan Behrens <sbehrens@giantdisaster.de>
To: bo.li.liu@oracle.com
Cc: linux-btrfs@vger.kernel.org
Subject: Re: [PATCH v2 0/8] Btrfs: introduce a tree for UUID to subvol ID mapping
Date: Wed, 15 May 2013 10:52:13 +0200	[thread overview]
Message-ID: <51934CBD.4080006@giantdisaster.de> (raw)
In-Reply-To: <5192705A.6030708@giantdisaster.de>

On Tue, 14 May 2013 19:11:54 +0200, Stefan Behrens wrote:
> On Tue, 14 May 2013 18:55:23 +0800, Liu Bo wrote:
>> On Tue, May 14, 2013 at 11:36:52AM +0200, Stefan Behrens wrote:
>>> Mapping UUIDs to subvolume IDs is an operation with a high effort
>>> today. Today, the algorithm even has quadratic effort (based on the
>>> number of existing subvolumes), which means, that it takes minutes
>>> to send/receive a single subvolume if 10,000 subvolumes exist. But
>>> even linear effort would be too much since it is a waste. And these
>>> data structures to allow mapping UUIDs to subvolume IDs are created
>>> every time a btrfs send/receive instance is started.
>>>
>>> So the issue to address is that Btrfs send / receive does not work
>>> as it is today when a high number of subvolumes exist.
>>>
>>> It is much more efficient to maintain a searchable persistent data
>>> structure in the filesystem, one that is updated whenever a
>>> subvolume/snapshot is created and deleted, and when the received
>>> subvolume UUID is set by the btrfs-receive tool.
>>>
>>> Therefore kernel code is added that is able to maintain data
>>> structures in the filesystem that allow to quickly search for a
>>> given UUID and to retrieve the subvol ID.
>>>
>>> Now follows the lengthy justification, why a new tree was added
>>> instead of using the existing root tree:
>>>
>>> The first approach was to not create another tree that holds UUID
>>> items. Instead, the items should just go into the top root tree.
>>> Unfortunately this confused the algorithm to assign the objectid
>>> of subvolumes and snapshots. The reason is that
>>> btrfs_find_free_objectid() calls btrfs_find_highest_objectid() for
>>> the first created subvol or snapshot after mounting a filesystem,
>>> and this function simply searches for the largest used objectid in
>>> the root tree keys to pick the next objectid to assign. Of course,
>>> the UUID keys have always been the ones with the highest offset
>>> value, and the next assigned subvol ID was wastefully huge.
>>>
>>> To use any other existing tree did not look proper. To apply a
>>> workaround such as setting the objectid to zero in the UUID item
>>> key and to implement collision handling would either add
>>> limitations (in case of a btrfs_extend_item() approach to handle
>>> the collisions) or a lot of complexity and source code (in case a
>>> key would be looked up that is free of collisions). Adding new code
>>> that introduces limitations is not good, and adding code that is
>>> complex and lengthy for no good reason is also not good. That's the
>>> justification why a completely new tree was introduced.
>>
>> I'd appreciate if some performance number appear here since it's a speedup.
> 
> That's a good idea. The numbers are below in the table and there's also a link to a chart.
> 
> I stopped the measurement with the old version after 10000 subvolumes because it already took almost 13 minutes to send a single, empty subvolume. All the time is spent building a database, and this is done each time the btrfs send or receive tool is started to send or receive a single subvolume.
> 
> The table shows the time it takes to send a single, empty subvolume depending on the number of subvolume that exist in the filesystem.
> 
> # of subvols  | without    | with
> in filesystem | UUID tree  | UUID tree
> --------------+------------+----------
>             2 |  0m00.004s | 0m00.003s
>          1000 |  0m07.010s | 0m00.004s
>          2000 |  0m28.210s | 0m00.004s
>          3000 |  1m04.872s | 0m00.004s
>          4000 |  1m56.059s | 0m00.004s
>          5000 |  3m00.489s | 0m00.004s
>          6000 |  4m27.376s | 0m00.004s
>          7000 |  6m08.938s | 0m00.004s
>          8000 |  7m54.020s | 0m00.004s
>          9000 | 10m05.108s | 0m00.004s
>         10000 | 12m47.406s | 0m00.004s
> 
> Or as a chart:
> http://btrfs.giantdisaster.de/Btrfs-send-recv-perf.pdf

The table goes on like this for larger number of subvolumes (and the time value is always the time to transfer just _one_ of the subvolumes):

# of subvols  | without    | with
in filesystem | UUID tree  | UUID tree
--------------+------------+----------
            2 |  0m00.004s | 0m00.003s
         1000 |  0m07.010s | 0m00.004s
         2000 |  0m28.210s | 0m00.004s
         3000 |  1m04.872s | 0m00.004s
         4000 |  1m56.059s | 0m00.004s
         5000 |  3m00.489s | 0m00.004s
         6000 |  4m27.376s | 0m00.004s
         7000 |  6m08.938s | 0m00.004s
         8000 |  7m54.020s | 0m00.004s
         9000 | 10m05.108s | 0m00.004s
        10000 | 12m47.406s | 0m00.004s
        11000 | 15m05.800s | 0m00.004s
        12000 | 18m00.170s | 0m00.004s
        13000 | 21m39.438s | 0m00.004s
        14000 | 24m54.681s | 0m00.004s
        15000 | 28m09.096s | 0m00.004s
        16000 | 33m08.856s | 0m00.004s
        17000 | 37m10.562s | 0m00.004s
        18000 | 41m44.727s | 0m00.004s
        19000 | 46m14.335s | 0m00.004s
        20000 | 51m55.100s | 0m00.004s
        21000 | 56m54.346s | 0m00.004s
        22000 | 62m53.466s | 0m00.004s
        23000 | 66m57.328s | 0m00.004s
        24000 | 73m59.687s | 0m00.004s
        25000 | 81m24.476s | 0m00.004s
        26000 | 87m11.478s | 0m00.004s
        27000 | 92m59.225s | 0m00.004s

For 100,000 existing subvolumes, the calculated value is 22 hours 25 minutes 19 seconds to start btrfs send/receive to transfer a single subvolume.
For 1,000,000 existing subvolumes, it would take 102 days.
30 years for 10 million existing subvolumes. And 30 years is unacceptable.


> The Hardware:
> Intel(R) Xeon(R) CPU X3450 @ 2.67GHz
> 8 GB RAM
> 6 high performance SSDs
> 
> The script:
> #!/bin/bash
> set -e
> MOUNT=/mnt2
> umount $MOUNT || true
> mkfs.btrfs -f -m raid0 -d raid0 -n 32768 /dev/sdc /dev/sdj /dev/sds /dev/sdt /dev/sdu /dev/sdv
> mount /dev/sdc $MOUNT
> btrfs subv create $MOUNT/0
> btrfs subv snapshot -r $MOUNT/0 $MOUNT/0ro > /dev/null
> echo '2 subvols'
> time btrfs send $MOUNT/0ro > /dev/null
> umount $MOUNT
> mkfs.btrfs -f -m raid0 -d raid0 -n 32768 /dev/sdc /dev/sdj /dev/sds /dev/sdt /dev/sdu /dev/sdv
> mount /dev/sdc $MOUNT
> for i in `seq 1000`; do
>         btrfs subv create $MOUNT/$i
>         for j in `seq 500`; do
>                 btrfs subv create $MOUNT/$i/$j > /dev/null
>                 btrfs subv snapshot -r $MOUNT/$i/$j $MOUNT/$i/${j}ro > /dev/null
>         done
>         echo $i 'k subvols'
>         time btrfs send $MOUNT/$i/1ro > /dev/null
> done



  reply	other threads:[~2013-05-15  8:52 UTC|newest]

Thread overview: 18+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2013-05-14  9:36 [PATCH v2 0/8] Btrfs: introduce a tree for UUID to subvol ID mapping Stefan Behrens
2013-05-14  9:36 ` [PATCH v2 1/8] Btrfs: introduce a tree for items that map UUIDs to something Stefan Behrens
2013-05-16  6:35   ` Liu Bo
2013-05-14  9:36 ` [PATCH v2 2/8] Btrfs: support printing UUID tree elements Stefan Behrens
2013-05-14  9:36 ` [PATCH v2 3/8] Btrfs: create UUID tree if required Stefan Behrens
2013-05-14 10:45   ` Liu Bo
2013-05-14  9:36 ` [PATCH v2 4/8] Btrfs: maintain subvolume items in the UUID tree Stefan Behrens
2013-05-14 10:44   ` Liu Bo
2013-05-15 15:39     ` Stefan Behrens
2013-05-16  1:50       ` Liu Bo
2013-05-14  9:36 ` [PATCH v2 5/8] Btrfs: fill UUID tree initially Stefan Behrens
2013-05-14  9:36 ` [PATCH v2 6/8] Btrfs: introduce uuid-tree-gen field Stefan Behrens
2013-05-14  9:36 ` [PATCH v2 7/8] Btrfs: check UUID tree during mount if required Stefan Behrens
2013-05-14  9:37 ` [PATCH v2 8/8] Btrfs: add mount option to force UUID tree checking Stefan Behrens
2013-05-14 10:55 ` [PATCH v2 0/8] Btrfs: introduce a tree for UUID to subvol ID mapping Liu Bo
2013-05-14 17:11   ` Stefan Behrens
2013-05-15  8:52     ` Stefan Behrens [this message]
2013-05-15 13:12     ` David Sterba

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=51934CBD.4080006@giantdisaster.de \
    --to=sbehrens@giantdisaster.de \
    --cc=bo.li.liu@oracle.com \
    --cc=linux-btrfs@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.