Experiences with metadata balance/convert

linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Experiences with metadata balance/convert
@ 2017-04-21 10:26 Hans van Kranenburg
  2017-04-21 10:31 ` Hans van Kranenburg
                   ` (2 more replies)
  0 siblings, 3 replies; 12+ messages in thread
From: Hans van Kranenburg @ 2017-04-21 10:26 UTC (permalink / raw)
  To: linux-btrfs@vger.kernel.org

This is a followup to my previous post "About free space fragmentation,
metadata write amplification and (no)ssd", exploring how good or bad
btrfs can handle filesystem that are larger than your average desktop
computer.

One of the things I'm looking at to do is to convert the metadata of a
large filesystem from DUP to single, because:
  1. in this particular situation the disk storage is considered
reliable enough to handle bitrot and failed disks itself.
  2. it would simply reduce metadata writes (the favourite thing this
filesystem wants to do all the time) with 50% already

So, I used the clone functionality of the underlying iSCSI target to get
a writable throw-away version of the filesystem to experiment with (great!).

== The starting point ==

Data, single: total=39.46TiB, used=35.63TiB
System, DUP: total=40.00MiB, used=6.23MiB
Metadata, DUP: total=454.50GiB, used=441.46GiB
GlobalReserve, single: total=512.00MiB, used=0.00B

~90000 subvolumes, which are related to each other in groups of about 32
of them, no shared data extents between those groups.

That's around 900x a 512MiB metadata DUP block group, all >90% used.

I wrote a simple script to count the metadata types, here's the result,
sorted by tree and level (0 = leaf, > 0 are nodes)

-# ./show_metadata_tree_sizes.py /srv/backup/
ROOT_TREE          68.11MiB 0(  4346) 1(    12) 2(     1)
EXTENT_TREE        14.63GiB 0(955530) 1(  3572) 2(    16) 3(     1)
CHUNK_TREE          6.17MiB 0(   394) 1(     1)
DEV_TREE            3.50MiB 0(   223) 1(     1)
FS_TREE           382.63GiB 0(24806886) 1(250284) 2( 18258) 3(   331)
CSUM_TREE          41.98GiB 0(2741020) 1(  9930) 2(    45) 3(     1)
QUOTA_TREE            0.00B
UUID_TREE           3.28MiB 0(   209) 1(     1)
FREE_SPACE_TREE    79.31MiB 0(  5063) 1(    12) 2(     1)
DATA_RELOC_TREE    16.00KiB 0(     1)

FS_TREE counts tree 5 and all other subvolumes together.

Kernel: Linux 4.9.18 (Debian)
Progs: 4.9.1 (Debian)

== Test 1: Just trying it ==

So, let's just do a
  btrfs balance start -f -mconvert=single /srv/backup

The result:
 * One by one the metadata block groups are emptied, from highest vaddr
to lowest vaddr.
 * For each 512MiB that is removed, a new 1024MiB block group is
forcibly added.
 * For each 512MiB, it takes on average 3 hours to empty it, during
which the filesystem is writing metadata at 100MiB/s to disk. That
means, to move 512MiB to another place, it needs to write a bit more
than *1TiB* to disk (3*3600*100MiB). And, it seems to be touching almost
all of the 900 metadata block groups on every committed transaction.
 * Instead of moving the metadata to the single type block group, it
seems to prefer to keep messing around in the DUP block groups all the
time as long as there's any free space to be found in them.

I let it run for a day, and then stopped it. So, when naively
extrapolating this, when running at full speed, doing nothing else, this
would take 112.5 days, while writing a Petabyte of metadata to disk.

Hmm...

== Test 2: Does reducing metadata size help? ==

Another thing I tried is to see what the effect of removing a lot of
subvolumes is. I simply ran the backup-expiries for everything that
would expire in the next two weeks (which is at least all daily backup
snapshots, which are kept for 14 days by default).

After that:

Data, single: total=38.62TiB, used=30.15TiB
System, DUP: total=40.00MiB, used=6.16MiB
Metadata, DUP: total=454.00GiB, used=391.78GiB
GlobalReserve, single: total=512.00MiB, used=265.62MiB

About 54000 subvolumes left now.

Hmmzzz... FS trees reduced from ~380 to ~340 GiB... not spectacular.

ROOT_TREE          48.05MiB 0(  3064) 1(    10) 2(     1)
EXTENT_TREE        14.41GiB 0(940473) 1(  3559) 2(    16) 3(     1)
CHUNK_TREE          6.16MiB 0(   393) 1(     1)
DEV_TREE            3.50MiB 0(   223) 1(     1)
FS_TREE           339.80GiB 0(22072422) 1(183505) 2( 12821) 3(   272)
CSUM_TREE          37.33GiB 0(2437006) 1(  9519) 2(    44) 3(     1)
QUOTA_TREE            0.00B
UUID_TREE           3.25MiB 0(   207) 1(     1)
FREE_SPACE_TREE   119.44MiB 0(  7619) 1(    24) 2(     1)
DATA_RELOC_TREE    16.00KiB 0(     1)

Now trying it again:
  btrfs balance start -f -mconvert=single /srv/backup

The result:
 * For each 512MiB, it takes on average 1 hour, writing 100MiB/s to disk
 * Almost all metadata is rewritten into existing DUP chunks (!!), since
there's more room in there now because of reducing total metadata amount.
 * For the little bit of data written to the new single chunks (which
have twice the amount of space in total in them, because every 512MiB is
traded for a new 1024MiB...) it shows a somewhat interesting pattern:

vaddr            length   flags          used           used_pct
[.. many more above here ..]
v 87196153413632 l 512MiB f METADATA|DUP used 379338752 pct 71
v 87784027062272 l 512MiB f METADATA|DUP used 351125504 pct 65
v 87784563933184 l 512MiB f METADATA|DUP used 365297664 pct 68
v 87901064921088 l 512MiB f METADATA|DUP used 403718144 pct 75
v 87901601792000 l 512MiB f METADATA|DUP used 373047296 pct 69
v 87969784397824 l 512MiB f METADATA|DUP used 376979456 pct 70
v 87971395010560 l 512MiB f METADATA|DUP used 398917632 pct 74
v 87971931881472 l 512MiB f METADATA|DUP used 391757824 pct 73
v 88126013833216 l 512MiB f METADATA|DUP used 426967040 pct 80
v 88172721602560 l 512MiB f METADATA|DUP used 418840576 pct 78
v 88186143375360 l 512MiB f METADATA|DUP used 422821888 pct 79
v 88187753988096 l 512MiB f METADATA|DUP used 395575296 pct 74
v 88190438342656 l 512MiB f METADATA|DUP used 388841472 pct 72
v 88545310015488 l 512MiB f METADATA|DUP used 347045888 pct 65
v 88545846886400 l 512MiB f METADATA|DUP used 318111744 pct 59
v 88546383757312 l 512MiB f METADATA|DUP used 101662720 pct 19
v 89532615622656 l 1GiB f METADATA used 150994944 pct 14
v 89533689364480 l 1GiB f METADATA used 150716416 pct 14
v 89534763106304 l 1GiB f METADATA used 144375808 pct 13
v 89535836848128 l 1GiB f METADATA used 140738560 pct 13
v 89536910589952 l 1GiB f METADATA used 144637952 pct 13
v 89537984331776 l 1GiB f METADATA used 153124864 pct 14
v 89539058073600 l 1GiB f METADATA used 127434752 pct 12
v 89540131815424 l 1GiB f METADATA used 113655808 pct 11
v 89541205557248 l 1GiB f METADATA used 99450880 pct 9
v 89542279299072 l 1GiB f METADATA used 90652672 pct 8
v 89543353040896 l 1GiB f METADATA used 78725120 pct 7
v 89544426782720 l 1GiB f METADATA used 74186752 pct 7
v 89545500524544 l 1GiB f METADATA used 65175552 pct 6
v 89546574266368 l 1GiB f METADATA used 47136768 pct 4
v 89547648008192 l 1GiB f METADATA used 30965760 pct 3
v 89548721750016 l 1GiB f METADATA used 15187968 pct 1

So, it's 3 times faster per blockgroup, but it will only do more work
over and over again, leading to a 900 + 899 + 898 + 897 + ... etc
pattern of the amount of work it seems.

Still doesn't sound encouraging.

== Intermezzo: endless btrfs_merge_delayed_refs ==

Test 3: What happens when starting at the lower vaddr?

This test is mainly to just try things and find out what happens.

I tried to feed the metadata blockgroup with the lowest vaddr to balance:
  btrfs balance start -f -mconvert=single,soft,vrange=29360128..29360129
/srv/backup

When doing so, the filesystem immediately ends up using 100% kernel cpu,
does not read or write from disk anymore.

After letting it running for two hours, there's no change. These two
processes are just doing 100% cpu, showing the following stack traces
(which do not change over time) in /proc/<pid>/stack:

kworker/u20:3

[<ffffffffc00caa74>] btrfs_insert_empty_items+0x94/0xc0 [btrfs]
[<ffffffff815fc689>] error_exit+0x9/0x20
[<ffffffffc01426fe>] btrfs_merge_delayed_refs+0xee/0x570 [btrfs]
[<ffffffffc00d5ded>] __btrfs_run_delayed_refs+0xad/0x13a0 [btrfs]
[<ffffffff810abdb1>] update_curr+0xe1/0x160
[<ffffffff811e02dc>] kmem_cache_alloc+0xbc/0x520
[<ffffffff810aabc4>] account_entity_dequeue+0xa4/0xc0
[<ffffffffc00da07d>] btrfs_run_delayed_refs+0x9d/0x2b0 [btrfs]
[<ffffffffc00da319>] delayed_ref_async_start+0x89/0xa0 [btrfs]
[<ffffffffc0124fff>] btrfs_scrubparity_helper+0xcf/0x2d0 [btrfs]
[<ffffffff81090384>] process_one_work+0x184/0x410
[<ffffffff8109065d>] worker_thread+0x4d/0x480
[<ffffffff81090610>] process_one_work+0x410/0x410
[<ffffffff81090610>] process_one_work+0x410/0x410
[<ffffffff8107bb0a>] do_group_exit+0x3a/0xa0
[<ffffffff810965ce>] kthread+0xce/0xf0
[<ffffffff81024701>] __switch_to+0x2c1/0x6c0
[<ffffffff81096500>] kthread_park+0x60/0x60
[<ffffffff815fb2f5>] ret_from_fork+0x25/0x30
[<ffffffffffffffff>] 0xffffffffffffffff

btrfs-transacti

[<ffffffff815fc689>] error_exit+0x9/0x20
[<ffffffffc01426fe>] btrfs_merge_delayed_refs+0xee/0x570 [btrfs]
[<ffffffffc01426a5>] btrfs_merge_delayed_refs+0x95/0x570 [btrfs]
[<ffffffffc00d5ded>] __btrfs_run_delayed_refs+0xad/0x13a0 [btrfs]
[<ffffffff8109d94d>] finish_task_switch+0x7d/0x1f0
[<ffffffffc00da07d>] btrfs_run_delayed_refs+0x9d/0x2b0 [btrfs]
[<ffffffff8107bb0a>] do_group_exit+0x3a/0xa0
[<ffffffffc00f0b10>] btrfs_commit_transaction+0x40/0xa10 [btrfs]
[<ffffffffc00f1576>] start_transaction+0x96/0x480 [btrfs]
[<ffffffffc00eb9ac>] transaction_kthread+0x1dc/0x200 [btrfs]
[<ffffffffc00eb7d0>] btrfs_cleanup_transaction+0x580/0x580 [btrfs]
[<ffffffff810965ce>] kthread+0xce/0xf0
[<ffffffff81024701>] __switch_to+0x2c1/0x6c0
[<ffffffff81096500>] kthread_park+0x60/0x60
[<ffffffff815fb2f5>] ret_from_fork+0x25/0x30
[<ffffffffffffffff>] 0xffffffffffffffff

== Considering the options ==

Well, this all doesn't look good, that's for sure.

Especially the tendency to empty DUP block groups into other DUP block
groups, which need to be removed later again, instead of single ones
when converting it a bit sad.

== Thinking out of the box ==

Technically, converting from DUP to single could also mean:
* Flipping one bit in the block group type flags to 0 for each block
group item
* Flipping one bit in the chunk type flags and removing 1 stripe struct
for each metadata chunk item
* Removing the
* Anything else?

How feasible would it be to write btrfs-progs style conversion to do this?

-- 
Hans van Kranenburg

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Experiences with metadata balance/convert
  2017-04-21 10:26 Experiences with metadata balance/convert Hans van Kranenburg
@ 2017-04-21 10:31 ` Hans van Kranenburg
  2017-04-21 11:13   ` Hans van Kranenburg
  2017-04-22  9:17 ` Duncan
  2017-04-22 16:45 ` Chris Murphy
  2 siblings, 1 reply; 12+ messages in thread
From: Hans van Kranenburg @ 2017-04-21 10:31 UTC (permalink / raw)
  To: linux-btrfs@vger.kernel.org

Doh,

On 04/21/2017 12:26 PM, Hans van Kranenburg wrote:
> [...]
> 
> == Thinking out of the box ==
> 
> Technically, converting from DUP to single could also mean:
> * Flipping one bit in the block group type flags to 0 for each block
> group item
> * Flipping one bit in the chunk type flags and removing 1 stripe struct
> for each metadata chunk item
> * Removing the

Removing the dev extent objects for all removed stripes from the dev tree.

> * Anything else?
> 
> How feasible would it be to write btrfs-progs style conversion to do this?


-- 
Hans van Kranenburg

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Experiences with metadata balance/convert
  2017-04-21 10:31 ` Hans van Kranenburg
@ 2017-04-21 11:13   ` Hans van Kranenburg
  2017-04-21 11:27     ` Austin S. Hemmelgarn
  0 siblings, 1 reply; 12+ messages in thread
From: Hans van Kranenburg @ 2017-04-21 11:13 UTC (permalink / raw)
  To: linux-btrfs@vger.kernel.org

On 04/21/2017 12:31 PM, Hans van Kranenburg wrote:
> Doh,
> 
> On 04/21/2017 12:26 PM, Hans van Kranenburg wrote:
>> [...]
>>
>> == Thinking out of the box ==
>>
>> Technically, converting from DUP to single could also mean:
>> * Flipping one bit in the block group type flags to 0 for each block
>> group item
>> * Flipping one bit in the chunk type flags and removing 1 stripe struct
>> for each metadata chunk item
>> * Removing the
> 
> Removing the dev extent objects for all removed stripes from the dev tree.
> 
>> * Anything else?
>>
>> How feasible would it be to write btrfs-progs style conversion to do this?

>From the feedback on IRC already, to clear things up:

I'm *not* proposing/asking for this kind of functionality to be
officially available or added to mainline btrfs-progs and be supported
for every user to use. I understand that's a whole different kind of
discussion.

I only mean... would there be any great show-stopper for this idea which
would mean I couldn't technically do it myself in a 'one-off' style.

-- 
Hans van Kranenburg

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Experiences with metadata balance/convert
  2017-04-21 11:13   ` Hans van Kranenburg
@ 2017-04-21 11:27     ` Austin S. Hemmelgarn
  0 siblings, 0 replies; 12+ messages in thread
From: Austin S. Hemmelgarn @ 2017-04-21 11:27 UTC (permalink / raw)
  To: Hans van Kranenburg, linux-btrfs@vger.kernel.org

On 2017-04-21 07:13, Hans van Kranenburg wrote:
> On 04/21/2017 12:31 PM, Hans van Kranenburg wrote:
>> Doh,
>>
>> On 04/21/2017 12:26 PM, Hans van Kranenburg wrote:
>>> [...]
>>>
>>> == Thinking out of the box ==
>>>
>>> Technically, converting from DUP to single could also mean:
>>> * Flipping one bit in the block group type flags to 0 for each block
>>> group item
>>> * Flipping one bit in the chunk type flags and removing 1 stripe struct
>>> for each metadata chunk item
>>> * Removing the
>>
>> Removing the dev extent objects for all removed stripes from the dev tree.
>>
>>> * Anything else?
>>>
>>> How feasible would it be to write btrfs-progs style conversion to do this?
>
> From the feedback on IRC already, to clear things up:
>
> I'm *not* proposing/asking for this kind of functionality to be
> officially available or added to mainline btrfs-progs and be supported
> for every user to use. I understand that's a whole different kind of
> discussion.
>
> I only mean... would there be any great show-stopper for this idea which
> would mean I couldn't technically do it myself in a 'one-off' style.
>
Not that I know of, but it would be kind of nice to have an upstream 
version too (especially if it could handle raid1 to single conversion).

To be entirely honest though, the current profile conversion stuff is 
somewhat braindead.  It appears to assume that there is only one 
partially filled chunk (IOW,  that you just ran a full balance), and 
therefore does some seriously extraneous work.  This in turn scales 
multiplicatively based on how much data needs converted.

WIth a proper design, raid1 or dup to single should just drop the extra 
copy, single to dup or raid1 should just be adding an extra copy, and 
raid1 to dup or dup to raid1 should just move one of the copies, but 
doing that would make things seriously more complicated.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Experiences with metadata balance/convert
  2017-04-21 10:26 Experiences with metadata balance/convert Hans van Kranenburg
  2017-04-21 10:31 ` Hans van Kranenburg
@ 2017-04-22  9:17 ` Duncan
  2017-04-22 21:18   ` Hans van Kranenburg
  2017-04-22 16:45 ` Chris Murphy
  2 siblings, 1 reply; 12+ messages in thread
From: Duncan @ 2017-04-22  9:17 UTC (permalink / raw)
  To: linux-btrfs

Hans van Kranenburg posted on Fri, 21 Apr 2017 12:26:18 +0200 as
excerpted:

> So, I used the clone functionality of the underlying iSCSI target to get
> a writable throw-away version of the filesystem to experiment with
> (great!).

Please, I'm rather sure you know all this and have setup your system 
accordingly, but we need to try to make it explicit any time there's 
mention of any sort of device or filesystem cloning tool, that we tell 
others that may happen on the post either just reading the list or via a 
google or the like...

Don't let btrfs see both the original and clone at the same time, say 
with a btrfs device scan (which udev runs automatically when block 
devices with a btrfs on them show up, so don't insert any btrfs-hosting 
devices or otherwise let udev see a new one, until either the clone or 
original are removed from the system).

Because btrfs, being potentially multi-device unlike most filesystems, 
tracks the parts of a filesystem via UUID, universally unique identifier, 
which it therefore depends on being just that, unique.

The trouble is that the clone, /being/ a clone, has the same UUID as the 
original, and if btrfs sees both the clone and the original of a write-
mounted filesystem, "Bad Things"TM can happen!

If people take care not to insert new btrfs devices while making a clone, 
and detach the clone from the system hosting the original as soon as 
possible after the clone is completed, those "Bad Things" shouldn't 
happen and they should be OK in that regard.

Of course as I said I'm rather sure you know of this and used a system 
not hosting the filesystem you were cloning to access the iscsi, but 
things will be a bit more complicated for people doing, for instance, 
same-system LVM snapshots or DD clones, and every time I see references 
to cloning devices and filesystems, even if they know good and well about 
the problems and steer well clear, I get worried that somebody new to 
btrfs or who simply isn't yet aware of that problem, is going to see it 
being discussed and take that as an invitation to do a same-system btrfs 
clone and get themselves in a *HEAP* of trouble!

So any time you mention btrfs clone, just devote a one-liner to something 
like "Yes, I'm aware of the UUID problem and this clone wasn't exposed on 
the same system."... and hopefully it'll get connected in people's minds 
and if necessary they'll either research further or ask before they start 
going LVM-clone wild or something!

(Of course the above is an explanation in far more detail than that 
single-liner, because it's the full topic of the post and I might as 
well, compared to the simple parenthetical I'm asking for when clone 
discussions come up.)

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Experiences with metadata balance/convert
  2017-04-21 10:26 Experiences with metadata balance/convert Hans van Kranenburg
  2017-04-21 10:31 ` Hans van Kranenburg
  2017-04-22  9:17 ` Duncan
@ 2017-04-22 16:45 ` Chris Murphy
  2017-04-22 16:55   ` Chris Murphy
  2017-04-22 20:21   ` Hans van Kranenburg
  2 siblings, 2 replies; 12+ messages in thread
From: Chris Murphy @ 2017-04-22 16:45 UTC (permalink / raw)
  To: Hans van Kranenburg; +Cc: linux-btrfs@vger.kernel.org

On Fri, Apr 21, 2017 at 4:26 AM, Hans van Kranenburg
<hans.van.kranenburg@mendix.com> wrote:

>
> == Thinking out of the box ==
>
> Technically, converting from DUP to single could also mean:
> * Flipping one bit in the block group type flags to 0 for each block
> group item
> * Flipping one bit in the chunk type flags and removing 1 stripe struct
> for each metadata chunk item
> * Removing the
> * Anything else?

This is in the realm of efficient file system pruning as a means of
fixing it. And the existing code is not pruning. It's clearly doing a
lot of complex balance computations first, second, and third, before
it even gets to the convert to single chunk task. Such a prune would
need to write out new chunk and dev trees, and then whatever nodes end
up pointing to those, maybe it's just the super blocks.

> How feasible would it be to write btrfs-progs style conversion to do this?

I can pretty much say it's not just a bit flip change because at the
very least you've got new CRCs to write for any changed node.

But looking at the chunk tree with btrfs-debug-tree -t 3 between
single and dup file systems:

    item 3 key (FIRST_CHUNK_TREE CHUNK_ITEM 20971520) itemoff 15945 itemsize 80
        length 1073741824 owner 2 stripe_len 65536 type METADATA
        io_align 65536 io_width 65536 sector_size 4096
        num_stripes 1 sub_stripes 1
            stripe 0 devid 1 offset 20971520
            dev_uuid 6cd61505-9d47-4521-b980-95e9f20de920

    item 69 key (FIRST_CHUNK_TREE CHUNK_ITEM 298730913792) itemoff
10569 itemsize 112
        length 536870912 owner 2 stripe_len 65536 type METADATA|DUP
        io_align 65536 io_width 65536 sector_size 4096
        num_stripes 2 sub_stripes 1
            stripe 0 devid 1 offset 32250003456
            dev_uuid 1ee7f7aa-701d-42b7-b37f-b3356c277e7d
            stripe 1 devid 1 offset 32786874368
            dev_uuid 1ee7f7aa-701d-42b7-b37f-b3356c277e7d

Whatever on-disk item makes this DUP needs to be removed/changed, and
then it's an open question whether it's sufficient to leave the stripe
1 metadata alone and expect that it'll just be ignored, or if it has
to be zero'd, or if the itemsize has to change to literally end it
after the stripe 0 dev_uuid; ie. in the above example if item 70 needs
to be moved up two lines (of course on disk the encoding of this
information is just a dozen or so bytes not lines).

And then for the dev tree, you can see the above item 69 is pointing
to these two items, and they point to item 69. So I'd expect their
nodes need rewritten.

    item 32 key (1 DEV_EXTENT 32250003456) itemoff 14707 itemsize 48
        dev extent chunk_tree 3
        chunk_objectid 256 chunk_offset 298730913792 length 536870912
        chunk_tree_uuid 2928d93e-c031-464a-b475-e200cf61abac
    item 33 key (1 DEV_EXTENT 32786874368) itemoff 14659 itemsize 48
        dev extent chunk_tree 3
        chunk_objectid 256 chunk_offset 298730913792 length 536870912
        chunk_tree_uuid 2928d93e-c031-464a-b475-e200cf61abac

Anyway, after moving all of this stuff around, you still have to
compute a node CRC. So the whole 16KiB node has to be rewritten.

The proper way to do this in Btrfs terms would be to COW all of the
changed chunk tree nodes elsewhere, all the unneeded items are
removed. New CRCs. And then once that succeeds and is committed to
stable media, new supers written to point to the new chunk and dev
trees which in turn now only point to one of the already written
copies of metadata chunks, without writing out new chunks. Also, if
I'm not mistaken the chunk tree is actually in system chunk. So
there's this neat thing where you want the metadata chunk profile to
be single, described by a tree that itself could be single or dup. The
user space tools today consider "metadata" to include metadata and
system chunks. So converting one converts the other. But in ancient
times the user space code and probably still lurking in todays kernel
code, there's a distinction.

Anyway, yeah it'd be a ton faster. On your file system this is 10MiB
of writes to write out new dev tree and chunk trees that prune out the
unneeded extra copy. Just stop referencing the extra copy.

Basically right now it's doing a balance first, then convert. There's
no efficiency option to just convert via a prune only.

-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Experiences with metadata balance/convert
  2017-04-22 16:45 ` Chris Murphy
@ 2017-04-22 16:55   ` Chris Murphy
  2017-04-22 20:22     ` Hans van Kranenburg
  2017-04-22 20:21   ` Hans van Kranenburg
  1 sibling, 1 reply; 12+ messages in thread
From: Chris Murphy @ 2017-04-22 16:55 UTC (permalink / raw)
  To: Chris Murphy; +Cc: Hans van Kranenburg, linux-btrfs@vger.kernel.org

On Sat, Apr 22, 2017 at 10:45 AM, Chris Murph
> The proper way to do this in Btrfs terms would be to COW all of the
> changed chunk tree nodes elsewhere, all the unneeded items are
> removed. New CRCs. And then once that succeeds and is committed to
> stable media, new supers written to point to the new chunk and dev
> trees which in turn now only point to one of the already written
> copies of metadata chunks, without writing out new chunks.

Also probably needs free space cache or tree updated. But the main
point is that nothing is overwritten in Btrfs. It'd always be COW so
it's fail safe.

-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Experiences with metadata balance/convert
  2017-04-22 16:45 ` Chris Murphy
  2017-04-22 16:55   ` Chris Murphy
@ 2017-04-22 20:21   ` Hans van Kranenburg
  2017-04-23  9:45     ` Hans van Kranenburg
  1 sibling, 1 reply; 12+ messages in thread
From: Hans van Kranenburg @ 2017-04-22 20:21 UTC (permalink / raw)
  To: Chris Murphy; +Cc: linux-btrfs@vger.kernel.org

On 04/22/2017 06:45 PM, Chris Murphy wrote:
> On Fri, Apr 21, 2017 at 4:26 AM, Hans van Kranenburg
> <hans.van.kranenburg@mendix.com> wrote:
> 
>>
>> == Thinking out of the box ==
>>
>> Technically, converting from DUP to single could also mean:
>> * Flipping one bit in the block group type flags to 0 for each block
>> group item
>> * Flipping one bit in the chunk type flags and removing 1 stripe struct
>> for each metadata chunk item
>> * Removing the
>> * Anything else?
> 
> [...]
> Such a prune would
> need to write out new chunk and dev trees, and then whatever nodes end
> up pointing to those, maybe it's just the super blocks.

Or just use the existing offline tree plumbing code to remove some items
and insert some replacement ones.

>> How feasible would it be to write btrfs-progs style conversion to do this?
> 
> I can pretty much say it's not just a bit flip change because at the
> very least you've got new CRCs to write for any changed node.

The tree plumbing will take care of that, I was not planning to hexedit it.

> [...]

I was actually thinking about writing a C extension for python-btrfs
that gives me access to the building blocks that are present inside the
btrfs-progs code, so that the 'heavy lifting' can be done there, and so
that I can just easily script opening a filesystem and then doing tree
changes, inserting and removing some tree items, even interactively. ;-)

-- 
Hans van Kranenburg

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Experiences with metadata balance/convert
  2017-04-22 16:55   ` Chris Murphy
@ 2017-04-22 20:22     ` Hans van Kranenburg
  2017-04-22 20:33       ` Hans van Kranenburg
  0 siblings, 1 reply; 12+ messages in thread
From: Hans van Kranenburg @ 2017-04-22 20:22 UTC (permalink / raw)
  To: Chris Murphy; +Cc: linux-btrfs@vger.kernel.org

On 04/22/2017 06:55 PM, Chris Murphy wrote:
> On Sat, Apr 22, 2017 at 10:45 AM, Chris Murph
>> The proper way to do this in Btrfs terms would be to COW all of the
>> changed chunk tree nodes elsewhere, all the unneeded items are
>> removed. New CRCs. And then once that succeeds and is committed to
>> stable media, new supers written to point to the new chunk and dev
>> trees which in turn now only point to one of the already written
>> copies of metadata chunks, without writing out new chunks.
> 
> Also probably needs free space cache or tree updated. But the main
> point is that nothing is overwritten in Btrfs. It'd always be COW so
> it's fail safe.

Free space tree only deals with the virtual address space, so it won't
see any difference.

-- 
Hans van Kranenburg

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Experiences with metadata balance/convert
  2017-04-22 20:22     ` Hans van Kranenburg
@ 2017-04-22 20:33       ` Hans van Kranenburg
  0 siblings, 0 replies; 12+ messages in thread
From: Hans van Kranenburg @ 2017-04-22 20:33 UTC (permalink / raw)
  To: Chris Murphy; +Cc: linux-btrfs@vger.kernel.org

On 04/22/2017 10:22 PM, Hans van Kranenburg wrote:
> On 04/22/2017 06:55 PM, Chris Murphy wrote:
>> On Sat, Apr 22, 2017 at 10:45 AM, Chris Murph
>>> The proper way to do this in Btrfs terms would be to COW all of the
>>> changed chunk tree nodes elsewhere, all the unneeded items are
>>> removed. New CRCs. And then once that succeeds and is committed to
>>> stable media, new supers written to point to the new chunk and dev
>>> trees which in turn now only point to one of the already written
>>> copies of metadata chunks, without writing out new chunks.
>>
>> Also probably needs free space cache or tree updated. But the main
>> point is that nothing is overwritten in Btrfs. It'd always be COW so
>> it's fail safe.
> 
> Free space tree only deals with the virtual address space, so it won't
> see any difference.

Inb4 while that's true, the offline code cannot keep the free space tree
(which I'm using) in sync now, so the current only option for that is
still to invalidate the whole thing and re-create it on mount again
afterwards. :-|

-- 
Hans van Kranenburg

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Experiences with metadata balance/convert
  2017-04-22  9:17 ` Duncan
@ 2017-04-22 21:18   ` Hans van Kranenburg
  0 siblings, 0 replies; 12+ messages in thread
From: Hans van Kranenburg @ 2017-04-22 21:18 UTC (permalink / raw)
  To: Duncan, linux-btrfs

On 04/22/2017 11:17 AM, Duncan wrote:
> Hans van Kranenburg posted on Fri, 21 Apr 2017 12:26:18 +0200 as
> excerpted:
> 
>> So, I used the clone functionality of the underlying iSCSI target to get
>> a writable throw-away version of the filesystem to experiment with
>> (great!).
> 
> Please, I'm rather sure you know all this and have setup your system 
> accordingly, but we need to try to make it explicit any time there's 
> mention of any sort of device or filesystem cloning tool, that we tell 
> others that may happen on the post either just reading the list or via a 
> google or the like...

Ack. [1]

[1]
https://btrfs.wiki.kernel.org/index.php/Gotchas#Block-level_copies_of_devices

> [..... .. .. . ... .. . . .. . ..]
> (Of course the above is an explanation in far more detail than that 
> single-liner, because it's the full topic of the post and I might as 
> well, compared to the simple parenthetical I'm asking for when clone 
> discussions come up.)

Thanks for stressing the fact.

And yes, the cloned iSCSI luns are only visible to a completely separate
server configured in another initiator group of the target, seen by
hardware that is not aware of what's happenning in the production cluster.

-- 
Hans van Kranenburg

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Experiences with metadata balance/convert
  2017-04-22 20:21   ` Hans van Kranenburg
@ 2017-04-23  9:45     ` Hans van Kranenburg
  0 siblings, 0 replies; 12+ messages in thread
From: Hans van Kranenburg @ 2017-04-23  9:45 UTC (permalink / raw)
  To: Chris Murphy; +Cc: linux-btrfs@vger.kernel.org

On 22/04/2017 22:21, Hans van Kranenburg wrote:
> On 04/22/2017 06:45 PM, Chris Murphy wrote:
>> On Fri, Apr 21, 2017 at 4:26 AM, Hans van Kranenburg
>> <hans.van.kranenburg@mendix.com> wrote:
>>
>>>
>>> == Thinking out of the box ==
>>>
>>> Technically, converting from DUP to single could also mean:
>>> * Flipping one bit in the block group type flags to 0 for each block
>>> group item
>>> * Flipping one bit in the chunk type flags and removing 1 stripe struct
>>> for each metadata chunk item
>>> * Removing the
>>> * Anything else?
>>
>> [...]
>> Such a prune would
>> need to write out new chunk and dev trees, and then whatever nodes end
>> up pointing to those, maybe it's just the super blocks.
>
> Or just use the existing offline tree plumbing code to remove some items
> and insert some replacement ones.

Oh wait, that's not true, it's not that simple of course. To be able to 
fully cow, insert items etc it needs to activate enough of the 
filesystem to have exactly all things I would want to change active in 
memory.

To be continued...

Hans


^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2017-04-23  9:44 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2017-04-21 10:26 Experiences with metadata balance/convert Hans van Kranenburg
2017-04-21 10:31 ` Hans van Kranenburg
2017-04-21 11:13   ` Hans van Kranenburg
2017-04-21 11:27     ` Austin S. Hemmelgarn
2017-04-22  9:17 ` Duncan
2017-04-22 21:18   ` Hans van Kranenburg
2017-04-22 16:45 ` Chris Murphy
2017-04-22 16:55   ` Chris Murphy
2017-04-22 20:22     ` Hans van Kranenburg
2017-04-22 20:33       ` Hans van Kranenburg
2017-04-22 20:21   ` Hans van Kranenburg
2017-04-23  9:45     ` Hans van Kranenburg

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).