Superblock update: Is there really any benefits of updating synchronously?

Linux Btrfs filesystem development
 help / color / mirror / Atom feed

* Superblock update: Is there really any benefits of updating synchronously?
@ 2018-01-23  7:03 waxhead
  2018-01-23  9:03 ` Nikolay Borisov
  0 siblings, 1 reply; 8+ messages in thread
From: waxhead @ 2018-01-23  7:03 UTC (permalink / raw)
  To: Btrfs BTRFS

Note: This have been mentioned before, but since I see some issues 
related to superblocks I think it would be good to bring up the question 
again.

According to the information found in the wiki: 
https://btrfs.wiki.kernel.org/index.php/On-disk_Format#Superblock

The superblocks are updated synchronously on HDD's and one after each 
other on SSD's.

Superblocks are also (to my knowledge) not protected by copy-on-write 
and are read-modify-update.

On a storage device with >256GB there will be three superblocks.

BTRFS will always prefer the superblock with the highest generation 
number providing that the checksum is good.

On the list there seem to be a few incidents where the superblocks have 
gone toast and I am pondering what (if any) benefits there is by 
updating the superblocks synchronously.

The superblock is checkpoint'ed every 30 seconds by default and if 
someone pulls the plug (poweroutage) on HDD's then a synchronous write 
depending on (the quality of) your hardware may perhaps ruin all the 
superblock copies in one go. E.g. Copy A,B and C will all be updated at 30s.

On SSD's, since one superblock is updated after other it would mean that 
using the default 30 second checkpoint Copy A=30s, Copy B=1m, Copy C=1m30s

Why is the SSD method not used on harddrives also?! If two superblocks 
are toast you would at maximum loose 1m30s by default , and if this is 
considered a problem then you can always adjust downwards the commit 
time. If this is set to 15 seconds you would still only loose 30 seconds 
of "action time" and would in my opinion be far better off from a 
reliability point of view than having to update multiple superblocks at 
the same time. I can't see why on earth updating all superblocks at the 
same time would have any benefits.

So this all boils down to the questions three (ere the other side will 
see..... :P )

1. What are the benefits of updating all superblocks at the same time? 
(Just imagine if your memory is bad - you could risk updating all 
superblocks simultaneously with kebab'ed data).

2. What would the negative consequences be by using the SSD scheme also 
for harddisks? Especially if the commit time is set to 15s instead of 30s

3. In a RAID1 / 10 / 5 / 6 like setup. Would a set of corrupt 
superblocks on a single drive be recoverable from other disks or do the 
superblocks need to be intact on the (possibly) damaged drive?
(If the superblocks are needed then why would not SSD mode be better 
especially if the drive is partly working)

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Superblock update: Is there really any benefits of updating synchronously?
  2018-01-23  7:03 Superblock update: Is there really any benefits of updating synchronously? waxhead
@ 2018-01-23  9:03 ` Nikolay Borisov
  2018-01-23 14:20   ` Hans van Kranenburg
  0 siblings, 1 reply; 8+ messages in thread
From: Nikolay Borisov @ 2018-01-23  9:03 UTC (permalink / raw)
  To: waxhead, Btrfs BTRFS



On 23.01.2018 09:03, waxhead wrote:
> Note: This have been mentioned before, but since I see some issues
> related to superblocks I think it would be good to bring up the question
> again.
> 
> According to the information found in the wiki:
> https://btrfs.wiki.kernel.org/index.php/On-disk_Format#Superblock
> 
> The superblocks are updated synchronously on HDD's and one after each
> other on SSD's.

There is currently no distinction in the code whether we are writing to
SSD or HDD. Also what do you mean by synchronously, if you inspect the
code in write_all_supers you will see what for every device we issue
writes for every available copy of the superblock and then wait for all
of them to be finished via the 'wait_dev_supers'. In that regard sb
writeout is asynchronous.

> 
> Superblocks are also (to my knowledge) not protected by copy-on-write
> and are read-modify-update.
> 
> On a storage device with >256GB there will be three superblocks.
> 
> BTRFS will always prefer the superblock with the highest generation
> number providing that the checksum is good.

Wrong. On mount btrfs will only ever read the first superblock at 64k.
If that one is corrupted it will refuse to mount, then it's expected the
user will initiate recovery procedure with btrfs-progs which reads all
supers and replaces them with the "newest" one (as decided by the
generation number)

> 
> On the list there seem to be a few incidents where the superblocks have
> gone toast and I am pondering what (if any) benefits there is by
> updating the superblocks synchronously.
> 
> The superblock is checkpoint'ed every 30 seconds by default and if
> someone pulls the plug (poweroutage) on HDD's then a synchronous write
> depending on (the quality of) your hardware may perhaps ruin all the
> superblock copies in one go. E.g. Copy A,B and C will all be updated at
> 30s.
> 
> On SSD's, since one superblock is updated after other it would mean that
> using the default 30 second checkpoint Copy A=30s, Copy B=1m, Copy C=1m30s
As explained previously there is no notion of "SSD vs HDD" modes.

> Why is the SSD method not used on harddrives also?! If two superblocks
> are toast you would at maximum loose 1m30s by default , and if this is
> considered a problem then you can always adjust downwards the commit
> time. If this is set to 15 seconds you would still only loose 30 seconds
> of "action time" and would in my opinion be far better off from a
> reliability point of view than having to update multiple superblocks at
> the same time. I can't see why on earth updating all superblocks at the
> same time would have any benefits.
> 
> So this all boils down to the questions three (ere the other side will
> see..... :P )
> 
> 1. What are the benefits of updating all superblocks at the same time?
> (Just imagine if your memory is bad - you could risk updating all
> superblocks simultaneously with kebab'ed data).
> 
> 2. What would the negative consequences be by using the SSD scheme also
> for harddisks? Especially if the commit time is set to 15s instead of 30s
> 
> 3. In a RAID1 / 10 / 5 / 6 like setup. Would a set of corrupt
> superblocks on a single drive be recoverable from other disks or do the
> superblocks need to be intact on the (possibly) damaged drive?

According to the code in super-recover.c from btrfs-progs you needn't
have the sb intact on the broken disks, since the tool first makes a
list of all devices constituting this filesystem, then makes a list of
all valid superblocks on those disks and finally chooses the one with
the higher generation number to replace the rest

> (If the superblocks are needed then why would not SSD mode be better
> especially if the drive is partly working)
> 
> -- 
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Superblock update: Is there really any benefits of updating synchronously?
  2018-01-23  9:03 ` Nikolay Borisov
@ 2018-01-23 14:20   ` Hans van Kranenburg
  2018-01-23 14:48     ` Nikolay Borisov
  0 siblings, 1 reply; 8+ messages in thread
From: Hans van Kranenburg @ 2018-01-23 14:20 UTC (permalink / raw)
  To: Nikolay Borisov, waxhead, Btrfs BTRFS

On 01/23/2018 10:03 AM, Nikolay Borisov wrote:
> 
> On 23.01.2018 09:03, waxhead wrote:
>> Note: This have been mentioned before, but since I see some issues
>> related to superblocks I think it would be good to bring up the question
>> again.
>>
>> [...]
>> https://btrfs.wiki.kernel.org/index.php/On-disk_Format#Superblock
>>
>> The superblocks are updated synchronously on HDD's and one after each
>> other on SSD's.
> 
> There is currently no distinction in the code whether we are writing to
> SSD or HDD.

So what does that line in the wiki mean, and why is it there? "btrfs
normally updates all superblocks, but in SSD mode it will update only
one at a time."

> Also what do you mean by synchronously, if you inspect the
> code in write_all_supers you will see what for every device we issue
> writes for every available copy of the superblock and then wait for all
> of them to be finished via the 'wait_dev_supers'. In that regard sb
> writeout is asynchronous.
> 
>> Superblocks are also (to my knowledge) not protected by copy-on-write
>> and are read-modify-update.
>>
>> On a storage device with >256GB there will be three superblocks.
>>
>> BTRFS will always prefer the superblock with the highest generation
>> number providing that the checksum is good.
> 
> Wrong. On mount btrfs will only ever read the first superblock at 64k.
> If that one is corrupted it will refuse to mount, then it's expected the
> user will initiate recovery procedure with btrfs-progs which reads all
> supers and replaces them with the "newest" one (as decided by the
> generation number)

So again, the line "The superblock with the highest generation is used
when reading." in the wiki needs to go away then?

>> On the list there seem to be a few incidents where the superblocks have
>> gone toast and I am pondering what (if any) benefits there is by
>> updating the superblocks synchronously.
>>
>> The superblock is checkpoint'ed every 30 seconds by default and if
>> someone pulls the plug (poweroutage) on HDD's then a synchronous write
>> depending on (the quality of) your hardware may perhaps ruin all the
>> superblock copies in one go. E.g. Copy A,B and C will all be updated at
>> 30s.
>>
>> On SSD's, since one superblock is updated after other it would mean that
>> using the default 30 second checkpoint Copy A=30s, Copy B=1m, Copy C=1m30s
>
> As explained previously there is no notion of "SSD vs HDD" modes.

We also had a discussion about the "backup roots" that are stored
besides the superblock, and that they are "better than nothing" to help
maybe recover something from a borken fs, but never ever guarantee you
will get a working filesystem back.

The same holds for superblocks from a previous generation. As soon as
the transaction for generation X succesfully hits the disk, all space
that was occupied in generation X-1 but no longer in X is available to
be overwritten immediately.

-- 
Hans van Kranenburg

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Superblock update: Is there really any benefits of updating synchronously?
  2018-01-23 14:20   ` Hans van Kranenburg
@ 2018-01-23 14:48     ` Nikolay Borisov
  2018-01-23 19:51       ` waxhead
  0 siblings, 1 reply; 8+ messages in thread
From: Nikolay Borisov @ 2018-01-23 14:48 UTC (permalink / raw)
  To: Hans van Kranenburg, waxhead, Btrfs BTRFS



On 23.01.2018 16:20, Hans van Kranenburg wrote:
> On 01/23/2018 10:03 AM, Nikolay Borisov wrote:
>>
>> On 23.01.2018 09:03, waxhead wrote:
>>> Note: This have been mentioned before, but since I see some issues
>>> related to superblocks I think it would be good to bring up the question
>>> again.
>>>
>>> [...]
>>> https://btrfs.wiki.kernel.org/index.php/On-disk_Format#Superblock
>>>
>>> The superblocks are updated synchronously on HDD's and one after each
>>> other on SSD's.
>>
>> There is currently no distinction in the code whether we are writing to
>> SSD or HDD.
> 
> So what does that line in the wiki mean, and why is it there? "btrfs
> normally updates all superblocks, but in SSD mode it will update only
> one at a time."

It means the wiki is outdated.

> 
>> Also what do you mean by synchronously, if you inspect the
>> code in write_all_supers you will see what for every device we issue
>> writes for every available copy of the superblock and then wait for all
>> of them to be finished via the 'wait_dev_supers'. In that regard sb
>> writeout is asynchronous.
>>
>>> Superblocks are also (to my knowledge) not protected by copy-on-write
>>> and are read-modify-update.
>>>
>>> On a storage device with >256GB there will be three superblocks.
>>>
>>> BTRFS will always prefer the superblock with the highest generation
>>> number providing that the checksum is good.
>>
>> Wrong. On mount btrfs will only ever read the first superblock at 64k.
>> If that one is corrupted it will refuse to mount, then it's expected the
>> user will initiate recovery procedure with btrfs-progs which reads all
>> supers and replaces them with the "newest" one (as decided by the
>> generation number)
> 
> So again, the line "The superblock with the highest generation is used
> when reading." in the wiki needs to go away then?

Yep, for background information you can read the discussion here:
https://www.spinics.net/lists/linux-btrfs/msg71878.html

> 
>>> On the list there seem to be a few incidents where the superblocks have
>>> gone toast and I am pondering what (if any) benefits there is by
>>> updating the superblocks synchronously.
>>>
>>> The superblock is checkpoint'ed every 30 seconds by default and if
>>> someone pulls the plug (poweroutage) on HDD's then a synchronous write
>>> depending on (the quality of) your hardware may perhaps ruin all the
>>> superblock copies in one go. E.g. Copy A,B and C will all be updated at
>>> 30s.
>>>
>>> On SSD's, since one superblock is updated after other it would mean that
>>> using the default 30 second checkpoint Copy A=30s, Copy B=1m, Copy C=1m30s
>>
>> As explained previously there is no notion of "SSD vs HDD" modes.
> 
> We also had a discussion about the "backup roots" that are stored
> besides the superblock, and that they are "better than nothing" to help
> maybe recover something from a borken fs, but never ever guarantee you
> will get a working filesystem back.
> 
> The same holds for superblocks from a previous generation. As soon as
> the transaction for generation X succesfully hits the disk, all space
> that was occupied in generation X-1 but no longer in X is available to
> be overwritten immediately.
> 

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Superblock update: Is there really any benefits of updating synchronously?
  2018-01-23 14:48     ` Nikolay Borisov
@ 2018-01-23 19:51       ` waxhead
  2018-01-24  0:04         ` Hans van Kranenburg
  0 siblings, 1 reply; 8+ messages in thread
From: waxhead @ 2018-01-23 19:51 UTC (permalink / raw)
  To: Nikolay Borisov, Hans van Kranenburg, Btrfs BTRFS



Nikolay Borisov wrote:
> 
> 
> On 23.01.2018 16:20, Hans van Kranenburg wrote:
>> On 01/23/2018 10:03 AM, Nikolay Borisov wrote:
>>>
>>> On 23.01.2018 09:03, waxhead wrote:
>>>> Note: This have been mentioned before, but since I see some issues
>>>> related to superblocks I think it would be good to bring up the question
>>>> again.
>>>>
>>>> [...]
>>>> https://btrfs.wiki.kernel.org/index.php/On-disk_Format#Superblock
>>>>
>>>> The superblocks are updated synchronously on HDD's and one after each
>>>> other on SSD's.
>>>
>>> There is currently no distinction in the code whether we are writing to
>>> SSD or HDD.
>>
>> So what does that line in the wiki mean, and why is it there? "btrfs
>> normally updates all superblocks, but in SSD mode it will update only
>> one at a time."
> 
> It means the wiki is outdated.
> 
Ok and now the wiki is updated. Great :)

>>
>>> Also what do you mean by synchronously, if you inspect the
>>> code in write_all_supers you will see what for every device we issue
>>> writes for every available copy of the superblock and then wait for all
>>> of them to be finished via the 'wait_dev_supers'. In that regard sb
>>> writeout is asynchronous.
>>>
I meant basically what you have explained. You write the same memory to 
all superblocks "step by step" but in one operation.

>>>> Superblocks are also (to my knowledge) not protected by copy-on-write
>>>> and are read-modify-update.
>>>>
>>>> On a storage device with >256GB there will be three superblocks.
>>>>
>>>> BTRFS will always prefer the superblock with the highest generation
>>>> number providing that the checksum is good.
>>>
>>> Wrong. On mount btrfs will only ever read the first superblock at 64k.
>>> If that one is corrupted it will refuse to mount, then it's expected the
>>> user will initiate recovery procedure with btrfs-progs which reads all
>>> supers and replaces them with the "newest" one (as decided by the
>>> generation number)
>>
>> So again, the line "The superblock with the highest generation is used
>> when reading." in the wiki needs to go away then?
> 
> Yep, for background information you can read the discussion here:
> https://www.spinics.net/lists/linux-btrfs/msg71878.html
> 
And the wiki is also updated... Great!

>>
>>>> On the list there seem to be a few incidents where the superblocks have
>>>> gone toast and I am pondering what (if any) benefits there is by
>>>> updating the superblocks synchronously.
>>>>
>>>> The superblock is checkpoint'ed every 30 seconds by default and if
>>>> someone pulls the plug (poweroutage) on HDD's then a synchronous write
>>>> depending on (the quality of) your hardware may perhaps ruin all the
>>>> superblock copies in one go. E.g. Copy A,B and C will all be updated at
>>>> 30s.
>>>>
>>>> On SSD's, since one superblock is updated after other it would mean that
>>>> using the default 30 second checkpoint Copy A=30s, Copy B=1m, Copy C=1m30s
>>>
>>> As explained previously there is no notion of "SSD vs HDD" modes.
Ok, thanks for clearing things up. But the main thing here is that all 
superblocks are updated at the same time both on SSD and HDD's. I think 
the question is still valid. What is there to gain on updating all of 
them every 30s instead of updating them one by one?! Would not that be 
safer, perhaps itty-bitty quicker and perhaps better in terms of recovery?!

>>
>> We also had a discussion about the "backup roots" that are stored
>> besides the superblock, and that they are "better than nothing" to help
>> maybe recover something from a borken fs, but never ever guarantee you
>> will get a working filesystem back.
>>
>> The same holds for superblocks from a previous generation. As soon as
>> the transaction for generation X succesfully hits the disk, all space
>> that was occupied in generation X-1 but no longer in X is available to
>> be overwritten immediately.
>>
Ok so this means that superblocks with a older generation is utterly 
useless and will lead to corruption (effectively making my argument 
above useless as that would in fact assist corruption then).

Does this means that if disk space was allocated in X-1 and is freed in 
X it will unallocated if you roll back to X-1 e.g. writing to 
unallocated storage.

I was under the impression that a superblock was like a "snapshot" of 
the entire filesystem and that rollbacks via pre-gen superblocks was 
possible. Am I mistaking?




^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Superblock update: Is there really any benefits of updating synchronously?
  2018-01-23 19:51       ` waxhead
@ 2018-01-24  0:04         ` Hans van Kranenburg
  2018-01-24 18:54           ` waxhead
  0 siblings, 1 reply; 8+ messages in thread
From: Hans van Kranenburg @ 2018-01-24  0:04 UTC (permalink / raw)
  To: waxhead, Nikolay Borisov, Btrfs BTRFS

On 01/23/2018 08:51 PM, waxhead wrote:
> Nikolay Borisov wrote:
>> On 23.01.2018 16:20, Hans van Kranenburg wrote:

[...]

>>>
>>> We also had a discussion about the "backup roots" that are stored
>>> besides the superblock, and that they are "better than nothing" to help
>>> maybe recover something from a borken fs, but never ever guarantee you
>>> will get a working filesystem back.
>>>
>>> The same holds for superblocks from a previous generation. As soon as
>>> the transaction for generation X succesfully hits the disk, all space
>>> that was occupied in generation X-1 but no longer in X is available to
>>> be overwritten immediately.
>>>
> Ok so this means that superblocks with a older generation is utterly
> useless and will lead to corruption (effectively making my argument
> above useless as that would in fact assist corruption then).

Mostly, yes.

> Does this means that if disk space was allocated in X-1 and is freed in
> X it will unallocated if you roll back to X-1 e.g. writing to
> unallocated storage.

Can you reword that? I can't follow that sentence.

> I was under the impression that a superblock was like a "snapshot" of
> the entire filesystem and that rollbacks via pre-gen superblocks was
> possible. Am I mistaking?

Yes. The first fundamental thing in Btrfs is COW which makes sure that
everything referenced from transaction X, from the superblock all the
way down to metadata trees and actual data space is never overwritten by
changes done in transaction X+1.

For metadata trees that are NOT filesystem trees a.k.a. subvolumes, the
way this is done is actually quite simple. If a block is cowed, the old
location is added to a 'pinned extents' list (in memory), which is used
as a blacklist for choosing space to put new writes in. After a
transaction is completed on disk, that list with pinned extents is
emptied and all that space is available for immediate reuse. This way we
make sure that if the transaction that is ongoing is aborted, the
previous one (latest one that is completely on disk) is always still
there. If the computer crashes and the in memory list is lost, no big
deal, we just continue from the latest completed transaction again after
a reboot. (ignoring extra log things for simplicity)

So, the only situation in which you can fully use an X-1 superblock is
when none of that previously pinned space has actually been overwritten
yet afterwards.

And if any of the space was overwritten already, you can go play around
with using an older superblock and your filesystem mounts and everything
might look fine, until you hit that distant corner and BOOM!

---- >8 ---- Extra!! Moar!! ---- >8 ----

But, doing so does not give you snapshot functionality yet! It's more
like a poor mans snapshot that only can prevent from messing up the
current version.

Snapshot functionality is implemented only for filesystem trees
(subvolumes) by adding reference counting (which does end up on disk) to
the metadata blocks, and then COW trees as a whole.

If you make a snapshot of a filesystem tree, the snapshot gets a whole
new tree ID! It's not a previous version of the same subvolume you're
looking at, it's a clone!

This is a big difference. The extent tree is always tree 2. The chunk
tree is always tree 3. But your subvolume snapshot gets a new tree number.

Technically, it would maybe be possible to implement reference counting
and snapshots to all of the metadata trees, but it would probably mean
that the whole filesystem would get stuck in rewriting itself all day
instead of doing any useful work. The current extent tree already has
such amount of rumination problems that the added work of keeping track
of reference counts would make it completely unusable.

In the wiki, it's here:
https://btrfs.wiki.kernel.org/index.php/Btrfs_design#Copy_on_Write_Logging

Actually, I just paraphrased the first two of those six alineas... The
subvolume trees actually having a previous version of themselves again
(whaaaa!) is another thing... ;]

-- 
Hans van Kranenburg

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Superblock update: Is there really any benefits of updating synchronously?
  2018-01-24  0:04         ` Hans van Kranenburg
@ 2018-01-24 18:54           ` waxhead
  2018-01-24 21:00             ` Hans van Kranenburg
  0 siblings, 1 reply; 8+ messages in thread
From: waxhead @ 2018-01-24 18:54 UTC (permalink / raw)
  To: Hans van Kranenburg, Nikolay Borisov, Btrfs BTRFS

Hans van Kranenburg wrote:
> On 01/23/2018 08:51 PM, waxhead wrote:
>> Nikolay Borisov wrote:
>>> On 23.01.2018 16:20, Hans van Kranenburg wrote:
> 
> [...]
> 
>>>>
>>>> We also had a discussion about the "backup roots" that are stored
>>>> besides the superblock, and that they are "better than nothing" to help
>>>> maybe recover something from a borken fs, but never ever guarantee you
>>>> will get a working filesystem back.
>>>>
>>>> The same holds for superblocks from a previous generation. As soon as
>>>> the transaction for generation X succesfully hits the disk, all space
>>>> that was occupied in generation X-1 but no longer in X is available to
>>>> be overwritten immediately.
>>>>
>> Ok so this means that superblocks with a older generation is utterly
>> useless and will lead to corruption (effectively making my argument
>> above useless as that would in fact assist corruption then).
> 
> Mostly, yes.
> 
>> Does this means that if disk space was allocated in X-1 and is freed in
>> X it will unallocated if you roll back to X-1 e.g. writing to
>> unallocated storage.
> 
> Can you reword that? I can't follow that sentence.
Sure why not. I'll give it a go:

Does this mean that if...
* Superblock generation N-1 have range 1234-2345 allocated and used.

and....

* Superblock generation N-0 (the current) have range 1234-2345 free 
because someone deleted a file or something

Then....

It is no point in rolling back to generation N-1 because that refers to 
what is no essentially free "memory" which may or may have not been 
written over by generation N-0. And therefore N-1 which still thinks 
range 1234-2345 is allocated may point to the wrong data.

I hope that was easier to follow - if not don't hold back on the 
explicitives! :)

> 
>> I was under the impression that a superblock was like a "snapshot" of
>> the entire filesystem and that rollbacks via pre-gen superblocks was
>> possible. Am I mistaking?
> 
> Yes. The first fundamental thing in Btrfs is COW which makes sure that
> everything referenced from transaction X, from the superblock all the
> way down to metadata trees and actual data space is never overwritten by
> changes done in transaction X+1.
> 
Perhaps a tad off topic, but assuming the (hopefully) better explanation 
above clear things up a bit. What happens if a block is freed?! in X+1 
--- which must mean that it can be overwritten in transaction X+1 (which 
I assume means a new superblock generation). After all without freeing 
and overwriting data there is no way to re-use space.

> For metadata trees that are NOT filesystem trees a.k.a. subvolumes, the
> way this is done is actually quite simple. If a block is cowed, the old
> location is added to a 'pinned extents' list (in memory), which is used
> as a blacklist for choosing space to put new writes in. After a
> transaction is completed on disk, that list with pinned extents is
> emptied and all that space is available for immediate reuse. This way we
> make sure that if the transaction that is ongoing is aborted, the
> previous one (latest one that is completely on disk) is always still
> there. If the computer crashes and the in memory list is lost, no big
> deal, we just continue from the latest completed transaction again after
> a reboot. (ignoring extra log things for simplicity)
> 
> So, the only situation in which you can fully use an X-1 superblock is
> when none of that previously pinned space has actually been overwritten
> yet afterwards.
> 
> And if any of the space was overwritten already, you can go play around
> with using an older superblock and your filesystem mounts and everything
> might look fine, until you hit that distant corner and BOOM!
Got it , this takes care of my questions above, but I'll leave them in 
just for completeness sake.
Thanks for the good explanation.

> 
> ---- >8 ---- Extra!! Moar!! ---- >8 ----
> 
> But, doing so does not give you snapshot functionality yet! It's more
> like a poor mans snapshot that only can prevent from messing up the
> current version.
> 
> Snapshot functionality is implemented only for filesystem trees
> (subvolumes) by adding reference counting (which does end up on disk) to
> the metadata blocks, and then COW trees as a whole.
> 
> If you make a snapshot of a filesystem tree, the snapshot gets a whole
> new tree ID! It's not a previous version of the same subvolume you're
> looking at, it's a clone!
> 
> This is a big difference. The extent tree is always tree 2. The chunk
> tree is always tree 3. But your subvolume snapshot gets a new tree number.
> 
> Technically, it would maybe be possible to implement reference counting
> and snapshots to all of the metadata trees, but it would probably mean
> that the whole filesystem would get stuck in rewriting itself all day
> instead of doing any useful work. The current extent tree already has
> such amount of rumination problems that the added work of keeping track
> of reference counts would make it completely unusable.
> 
> In the wiki, it's here:
> https://btrfs.wiki.kernel.org/index.php/Btrfs_design#Copy_on_Write_Logging
> 
> Actually, I just paraphrased the first two of those six alineas... The
> subvolume trees actually having a previous version of themselves again
> (whaaaa!) is another thing... ;]
> 
hehe , again thanks for giving a good explanation. Clear things up a bit 
indeed!

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Superblock update: Is there really any benefits of updating synchronously?
  2018-01-24 18:54           ` waxhead
@ 2018-01-24 21:00             ` Hans van Kranenburg
  0 siblings, 0 replies; 8+ messages in thread
From: Hans van Kranenburg @ 2018-01-24 21:00 UTC (permalink / raw)
  To: waxhead@dirtcellar.net, Nikolay Borisov, Btrfs BTRFS

On 01/24/2018 07:54 PM, waxhead wrote:
> Hans van Kranenburg wrote:
>> On 01/23/2018 08:51 PM, waxhead wrote:
>>> Nikolay Borisov wrote:
>>>> On 23.01.2018 16:20, Hans van Kranenburg wrote:
>>
>> [...]
>>
>>>>>
>>>>> We also had a discussion about the "backup roots" that are stored
>>>>> besides the superblock, and that they are "better than nothing" to help
>>>>> maybe recover something from a borken fs, but never ever guarantee you
>>>>> will get a working filesystem back.
>>>>>
>>>>> The same holds for superblocks from a previous generation. As soon as
>>>>> the transaction for generation X succesfully hits the disk, all space
>>>>> that was occupied in generation X-1 but no longer in X is available to
>>>>> be overwritten immediately.
>>>>>
>>> Ok so this means that superblocks with a older generation is utterly
>>> useless and will lead to corruption (effectively making my argument
>>> above useless as that would in fact assist corruption then).
>>
>> Mostly, yes.
>>
>>> Does this means that if disk space was allocated in X-1 and is freed in
>>> X it will unallocated if you roll back to X-1 e.g. writing to
>>> unallocated storage.
>>
>> Can you reword that? I can't follow that sentence.
> Sure why not. I'll give it a go:
> 
> Does this mean that if...
> * Superblock generation N-1 have range 1234-2345 allocated and used.
> 
> and....
> 
> * Superblock generation N-0 (the current) have range 1234-2345 free 
> because someone deleted a file or something

Ok, so I assume that with current you mean the one on disk now.

> Then....
> 
> It is no point in rolling back to generation N-1 because that refers to 
> what is no essentially free "memory" which may or may have not been 
> written over by generation N-0.

If space that was used in N-1 turned into free space during N-0, then
N-0 will never have reused that space already, since if writing out N-0
had crashed halfway, so the superblock as seen when mounting is still
N-1, then you need to be able to fully use N-1.

It can be used immediately by N+1 however after the N-0 superblock is
safe on disk.

> And therefore N-1 which still thinks 
> range 1234-2345 is allocated may point to the wrong data.

So, at least for disk space used by metadata blocks:

1234-2345 - N-1 - in use
1234-2345 - N-0 - not in use, but can't be overwritten yet
1234-2345 - N+1 - can start writing whatever it wants in that disk
location any time

> I hope that was easier to follow - if not don't hold back on the 
> explicitives! :)
> 
>>
>>> I was under the impression that a superblock was like a "snapshot" of
>>> the entire filesystem and that rollbacks via pre-gen superblocks was
>>> possible. Am I mistaking?
>>
>> Yes. The first fundamental thing in Btrfs is COW which makes sure that
>> everything referenced from transaction X, from the superblock all the
>> way down to metadata trees and actual data space is never overwritten by
>> changes done in transaction X+1.
>>
> Perhaps a tad off topic, but assuming the (hopefully) better explanation 
> above clear things up a bit. What happens if a block is freed?! in X+1 
> --- which must mean that it can be overwritten in transaction X+1 (which 
> I assume means a new superblock generation). After all without freeing 
> and overwriting data there is no way to re-use space.

Freed in X you mean? Or not? But you write "freed?! in X+1".

For actual data disk space, it's the same pattern as above (so space
freed up during a transaction can only be reused in the next one), but
implemented a bit differently.

For metadata trees which do not have reference counting, (e.g. the
extent tree), there's the pinned extent (metadata block disk locations)
list I mentioned already.

For data, we have the filesystem (subvolume) trees which reference all
files and the data extents that they use data from, and via the links to
the extent tree they keep all locations where actual data is on disk as
occupied.

Now comes the different part. Because the filesystem trees already
implement the extra reference counting functionality, this is being used
to prevent freed up data space from already being overwritten in the
same transaction.

How does this work? Well, that's the rest of the wiki section I linked
below. :-D So you're asking exactly the right next question here I guess.

When making changes to a subvolume tree (normal file create, write
content, rename delete etc), btrfs is secretly just cloning the tree
into a new subvolume with the same subvolume ID. Wait, what? Whoa! So if
you're changing subvolume 1234, there's an item (1234 ROOT_ITEM N-0) on
disk in tree 1, and in memory it starts working on (1234 ROOT_ITEM N+1).
As an end user, you never see this happening when you look at btrfs sub
list etc, it's hidden from you.

    "When the transaction commits, a new root pointer is inserted in the
root tree for each new subvolume root." [...] "At this time the root
tree has two pointers for each subvolume changed during the transaction.
One item points to the new tree and one points to the tree that existed
at the start of the last transaction."

After the new transaction commits OK, the cleaner removes the old
subvolume from the previous transaction, which is technically the same
code which is used for regular subvol delete initiated by a user. So
when the old version of the same tree is removed, only then extent tree
mappings for data disk space that was freed in the previous transaction
will be adjusted and it ends up as free data space to be overwritten.

(Well, if an extent in its entirety is not referenced by any file in any
subvol, that is. Partial unreferenced extents keep hanging around as
unreachable data, but that's again another story).

When doing things like subvolume list between a transaction commit and
the cleanup being finished, the btrfs sub list code will only show the
one with highest generation (transaction) number if it encounters
multiple ones and filter out the others not to confuse you. If you
script some tree searches, e.g. with a few lines of python-btrfs, then
you could spot them.

So next time you remove a really big file, and you don't see a
difference in df output... You know you will only see it after the
current transaction is finished and the cleanup at the beginning of the
new one is done.

And the whole "Copy on Write Logging" section in the wiki should make
sense now. \o/

>> For metadata trees that are NOT filesystem trees a.k.a. subvolumes, the
>> way this is done is actually quite simple. If a block is cowed, the old
>> location is added to a 'pinned extents' list (in memory), which is used
>> as a blacklist for choosing space to put new writes in. After a
>> transaction is completed on disk, that list with pinned extents is
>> emptied and all that space is available for immediate reuse. This way we
>> make sure that if the transaction that is ongoing is aborted, the
>> previous one (latest one that is completely on disk) is always still
>> there. If the computer crashes and the in memory list is lost, no big
>> deal, we just continue from the latest completed transaction again after
>> a reboot. (ignoring extra log things for simplicity)
>>
>> So, the only situation in which you can fully use an X-1 superblock is
>> when none of that previously pinned space has actually been overwritten
>> yet afterwards.
>>
>> And if any of the space was overwritten already, you can go play around
>> with using an older superblock and your filesystem mounts and everything
>> might look fine, until you hit that distant corner and BOOM!
> Got it , this takes care of my questions above, but I'll leave them in 
> just for completeness sake.
> Thanks for the good explanation.
> 
>>
>> ---- >8 ---- Extra!! Moar!! ---- >8 ----
>>
>> But, doing so does not give you snapshot functionality yet! It's more
>> like a poor mans snapshot that only can prevent from messing up the
>> current version.
>>
>> Snapshot functionality is implemented only for filesystem trees
>> (subvolumes) by adding reference counting (which does end up on disk) to
>> the metadata blocks, and then COW trees as a whole.

Correction here:
* unfortunate wording: "COW trees as a whole"
* because: we're not copying an entire tree, but only cowing individual
changed metadata blocks
* better: metadata blocks can be shared between trees with a different
tree ID (subvolume ID).

>> If you make a snapshot of a filesystem tree, the snapshot gets a whole
>> new tree ID! It's not a previous version of the same subvolume you're
>> looking at, it's a clone!
>>
>> This is a big difference. The extent tree is always tree 2. The chunk
>> tree is always tree 3. But your subvolume snapshot gets a new tree number.
>>
>> Technically, it would maybe be possible to implement reference counting
>> and snapshots to all of the metadata trees, but it would probably mean
>> that the whole filesystem would get stuck in rewriting itself all day
>> instead of doing any useful work. The current extent tree already has
>> such amount of rumination problems that the added work of keeping track
>> of reference counts would make it completely unusable.
>>
>> In the wiki, it's here:
>> https://btrfs.wiki.kernel.org/index.php/Btrfs_design#Copy_on_Write_Logging
>>
>> Actually, I just paraphrased the first two of those six alineas... The
>> subvolume trees actually having a previous version of themselves again
>> (whaaaa!) is another thing... ;]
>>
> hehe , again thanks for giving a good explanation. Clear things up a bit 
> indeed!
> 

Fun stuff.

-- 
Hans van Kranenburg

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2018-01-24 21:00 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2018-01-23  7:03 Superblock update: Is there really any benefits of updating synchronously? waxhead
2018-01-23  9:03 ` Nikolay Borisov
2018-01-23 14:20   ` Hans van Kranenburg
2018-01-23 14:48     ` Nikolay Borisov
2018-01-23 19:51       ` waxhead
2018-01-24  0:04         ` Hans van Kranenburg
2018-01-24 18:54           ` waxhead
2018-01-24 21:00             ` Hans van Kranenburg

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox