* Superblock update: Is there really any benefits of updating synchronously? @ 2018-01-23 7:03 waxhead 2018-01-23 9:03 ` Nikolay Borisov 0 siblings, 1 reply; 8+ messages in thread From: waxhead @ 2018-01-23 7:03 UTC (permalink / raw) To: Btrfs BTRFS Note: This have been mentioned before, but since I see some issues related to superblocks I think it would be good to bring up the question again. According to the information found in the wiki: https://btrfs.wiki.kernel.org/index.php/On-disk_Format#Superblock The superblocks are updated synchronously on HDD's and one after each other on SSD's. Superblocks are also (to my knowledge) not protected by copy-on-write and are read-modify-update. On a storage device with >256GB there will be three superblocks. BTRFS will always prefer the superblock with the highest generation number providing that the checksum is good. On the list there seem to be a few incidents where the superblocks have gone toast and I am pondering what (if any) benefits there is by updating the superblocks synchronously. The superblock is checkpoint'ed every 30 seconds by default and if someone pulls the plug (poweroutage) on HDD's then a synchronous write depending on (the quality of) your hardware may perhaps ruin all the superblock copies in one go. E.g. Copy A,B and C will all be updated at 30s. On SSD's, since one superblock is updated after other it would mean that using the default 30 second checkpoint Copy A=30s, Copy B=1m, Copy C=1m30s Why is the SSD method not used on harddrives also?! If two superblocks are toast you would at maximum loose 1m30s by default , and if this is considered a problem then you can always adjust downwards the commit time. If this is set to 15 seconds you would still only loose 30 seconds of "action time" and would in my opinion be far better off from a reliability point of view than having to update multiple superblocks at the same time. I can't see why on earth updating all superblocks at the same time would have any benefits. So this all boils down to the questions three (ere the other side will see..... :P ) 1. What are the benefits of updating all superblocks at the same time? (Just imagine if your memory is bad - you could risk updating all superblocks simultaneously with kebab'ed data). 2. What would the negative consequences be by using the SSD scheme also for harddisks? Especially if the commit time is set to 15s instead of 30s 3. In a RAID1 / 10 / 5 / 6 like setup. Would a set of corrupt superblocks on a single drive be recoverable from other disks or do the superblocks need to be intact on the (possibly) damaged drive? (If the superblocks are needed then why would not SSD mode be better especially if the drive is partly working) ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Superblock update: Is there really any benefits of updating synchronously? 2018-01-23 7:03 Superblock update: Is there really any benefits of updating synchronously? waxhead @ 2018-01-23 9:03 ` Nikolay Borisov 2018-01-23 14:20 ` Hans van Kranenburg 0 siblings, 1 reply; 8+ messages in thread From: Nikolay Borisov @ 2018-01-23 9:03 UTC (permalink / raw) To: waxhead, Btrfs BTRFS On 23.01.2018 09:03, waxhead wrote: > Note: This have been mentioned before, but since I see some issues > related to superblocks I think it would be good to bring up the question > again. > > According to the information found in the wiki: > https://btrfs.wiki.kernel.org/index.php/On-disk_Format#Superblock > > The superblocks are updated synchronously on HDD's and one after each > other on SSD's. There is currently no distinction in the code whether we are writing to SSD or HDD. Also what do you mean by synchronously, if you inspect the code in write_all_supers you will see what for every device we issue writes for every available copy of the superblock and then wait for all of them to be finished via the 'wait_dev_supers'. In that regard sb writeout is asynchronous. > > Superblocks are also (to my knowledge) not protected by copy-on-write > and are read-modify-update. > > On a storage device with >256GB there will be three superblocks. > > BTRFS will always prefer the superblock with the highest generation > number providing that the checksum is good. Wrong. On mount btrfs will only ever read the first superblock at 64k. If that one is corrupted it will refuse to mount, then it's expected the user will initiate recovery procedure with btrfs-progs which reads all supers and replaces them with the "newest" one (as decided by the generation number) > > On the list there seem to be a few incidents where the superblocks have > gone toast and I am pondering what (if any) benefits there is by > updating the superblocks synchronously. > > The superblock is checkpoint'ed every 30 seconds by default and if > someone pulls the plug (poweroutage) on HDD's then a synchronous write > depending on (the quality of) your hardware may perhaps ruin all the > superblock copies in one go. E.g. Copy A,B and C will all be updated at > 30s. > > On SSD's, since one superblock is updated after other it would mean that > using the default 30 second checkpoint Copy A=30s, Copy B=1m, Copy C=1m30s As explained previously there is no notion of "SSD vs HDD" modes. > Why is the SSD method not used on harddrives also?! If two superblocks > are toast you would at maximum loose 1m30s by default , and if this is > considered a problem then you can always adjust downwards the commit > time. If this is set to 15 seconds you would still only loose 30 seconds > of "action time" and would in my opinion be far better off from a > reliability point of view than having to update multiple superblocks at > the same time. I can't see why on earth updating all superblocks at the > same time would have any benefits. > > So this all boils down to the questions three (ere the other side will > see..... :P ) > > 1. What are the benefits of updating all superblocks at the same time? > (Just imagine if your memory is bad - you could risk updating all > superblocks simultaneously with kebab'ed data). > > 2. What would the negative consequences be by using the SSD scheme also > for harddisks? Especially if the commit time is set to 15s instead of 30s > > 3. In a RAID1 / 10 / 5 / 6 like setup. Would a set of corrupt > superblocks on a single drive be recoverable from other disks or do the > superblocks need to be intact on the (possibly) damaged drive? According to the code in super-recover.c from btrfs-progs you needn't have the sb intact on the broken disks, since the tool first makes a list of all devices constituting this filesystem, then makes a list of all valid superblocks on those disks and finally chooses the one with the higher generation number to replace the rest > (If the superblocks are needed then why would not SSD mode be better > especially if the drive is partly working) > > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Superblock update: Is there really any benefits of updating synchronously? 2018-01-23 9:03 ` Nikolay Borisov @ 2018-01-23 14:20 ` Hans van Kranenburg 2018-01-23 14:48 ` Nikolay Borisov 0 siblings, 1 reply; 8+ messages in thread From: Hans van Kranenburg @ 2018-01-23 14:20 UTC (permalink / raw) To: Nikolay Borisov, waxhead, Btrfs BTRFS On 01/23/2018 10:03 AM, Nikolay Borisov wrote: > > On 23.01.2018 09:03, waxhead wrote: >> Note: This have been mentioned before, but since I see some issues >> related to superblocks I think it would be good to bring up the question >> again. >> >> [...] >> https://btrfs.wiki.kernel.org/index.php/On-disk_Format#Superblock >> >> The superblocks are updated synchronously on HDD's and one after each >> other on SSD's. > > There is currently no distinction in the code whether we are writing to > SSD or HDD. So what does that line in the wiki mean, and why is it there? "btrfs normally updates all superblocks, but in SSD mode it will update only one at a time." > Also what do you mean by synchronously, if you inspect the > code in write_all_supers you will see what for every device we issue > writes for every available copy of the superblock and then wait for all > of them to be finished via the 'wait_dev_supers'. In that regard sb > writeout is asynchronous. > >> Superblocks are also (to my knowledge) not protected by copy-on-write >> and are read-modify-update. >> >> On a storage device with >256GB there will be three superblocks. >> >> BTRFS will always prefer the superblock with the highest generation >> number providing that the checksum is good. > > Wrong. On mount btrfs will only ever read the first superblock at 64k. > If that one is corrupted it will refuse to mount, then it's expected the > user will initiate recovery procedure with btrfs-progs which reads all > supers and replaces them with the "newest" one (as decided by the > generation number) So again, the line "The superblock with the highest generation is used when reading." in the wiki needs to go away then? >> On the list there seem to be a few incidents where the superblocks have >> gone toast and I am pondering what (if any) benefits there is by >> updating the superblocks synchronously. >> >> The superblock is checkpoint'ed every 30 seconds by default and if >> someone pulls the plug (poweroutage) on HDD's then a synchronous write >> depending on (the quality of) your hardware may perhaps ruin all the >> superblock copies in one go. E.g. Copy A,B and C will all be updated at >> 30s. >> >> On SSD's, since one superblock is updated after other it would mean that >> using the default 30 second checkpoint Copy A=30s, Copy B=1m, Copy C=1m30s > > As explained previously there is no notion of "SSD vs HDD" modes. We also had a discussion about the "backup roots" that are stored besides the superblock, and that they are "better than nothing" to help maybe recover something from a borken fs, but never ever guarantee you will get a working filesystem back. The same holds for superblocks from a previous generation. As soon as the transaction for generation X succesfully hits the disk, all space that was occupied in generation X-1 but no longer in X is available to be overwritten immediately. -- Hans van Kranenburg ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Superblock update: Is there really any benefits of updating synchronously? 2018-01-23 14:20 ` Hans van Kranenburg @ 2018-01-23 14:48 ` Nikolay Borisov 2018-01-23 19:51 ` waxhead 0 siblings, 1 reply; 8+ messages in thread From: Nikolay Borisov @ 2018-01-23 14:48 UTC (permalink / raw) To: Hans van Kranenburg, waxhead, Btrfs BTRFS On 23.01.2018 16:20, Hans van Kranenburg wrote: > On 01/23/2018 10:03 AM, Nikolay Borisov wrote: >> >> On 23.01.2018 09:03, waxhead wrote: >>> Note: This have been mentioned before, but since I see some issues >>> related to superblocks I think it would be good to bring up the question >>> again. >>> >>> [...] >>> https://btrfs.wiki.kernel.org/index.php/On-disk_Format#Superblock >>> >>> The superblocks are updated synchronously on HDD's and one after each >>> other on SSD's. >> >> There is currently no distinction in the code whether we are writing to >> SSD or HDD. > > So what does that line in the wiki mean, and why is it there? "btrfs > normally updates all superblocks, but in SSD mode it will update only > one at a time." It means the wiki is outdated. > >> Also what do you mean by synchronously, if you inspect the >> code in write_all_supers you will see what for every device we issue >> writes for every available copy of the superblock and then wait for all >> of them to be finished via the 'wait_dev_supers'. In that regard sb >> writeout is asynchronous. >> >>> Superblocks are also (to my knowledge) not protected by copy-on-write >>> and are read-modify-update. >>> >>> On a storage device with >256GB there will be three superblocks. >>> >>> BTRFS will always prefer the superblock with the highest generation >>> number providing that the checksum is good. >> >> Wrong. On mount btrfs will only ever read the first superblock at 64k. >> If that one is corrupted it will refuse to mount, then it's expected the >> user will initiate recovery procedure with btrfs-progs which reads all >> supers and replaces them with the "newest" one (as decided by the >> generation number) > > So again, the line "The superblock with the highest generation is used > when reading." in the wiki needs to go away then? Yep, for background information you can read the discussion here: https://www.spinics.net/lists/linux-btrfs/msg71878.html > >>> On the list there seem to be a few incidents where the superblocks have >>> gone toast and I am pondering what (if any) benefits there is by >>> updating the superblocks synchronously. >>> >>> The superblock is checkpoint'ed every 30 seconds by default and if >>> someone pulls the plug (poweroutage) on HDD's then a synchronous write >>> depending on (the quality of) your hardware may perhaps ruin all the >>> superblock copies in one go. E.g. Copy A,B and C will all be updated at >>> 30s. >>> >>> On SSD's, since one superblock is updated after other it would mean that >>> using the default 30 second checkpoint Copy A=30s, Copy B=1m, Copy C=1m30s >> >> As explained previously there is no notion of "SSD vs HDD" modes. > > We also had a discussion about the "backup roots" that are stored > besides the superblock, and that they are "better than nothing" to help > maybe recover something from a borken fs, but never ever guarantee you > will get a working filesystem back. > > The same holds for superblocks from a previous generation. As soon as > the transaction for generation X succesfully hits the disk, all space > that was occupied in generation X-1 but no longer in X is available to > be overwritten immediately. > ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Superblock update: Is there really any benefits of updating synchronously? 2018-01-23 14:48 ` Nikolay Borisov @ 2018-01-23 19:51 ` waxhead 2018-01-24 0:04 ` Hans van Kranenburg 0 siblings, 1 reply; 8+ messages in thread From: waxhead @ 2018-01-23 19:51 UTC (permalink / raw) To: Nikolay Borisov, Hans van Kranenburg, Btrfs BTRFS Nikolay Borisov wrote: > > > On 23.01.2018 16:20, Hans van Kranenburg wrote: >> On 01/23/2018 10:03 AM, Nikolay Borisov wrote: >>> >>> On 23.01.2018 09:03, waxhead wrote: >>>> Note: This have been mentioned before, but since I see some issues >>>> related to superblocks I think it would be good to bring up the question >>>> again. >>>> >>>> [...] >>>> https://btrfs.wiki.kernel.org/index.php/On-disk_Format#Superblock >>>> >>>> The superblocks are updated synchronously on HDD's and one after each >>>> other on SSD's. >>> >>> There is currently no distinction in the code whether we are writing to >>> SSD or HDD. >> >> So what does that line in the wiki mean, and why is it there? "btrfs >> normally updates all superblocks, but in SSD mode it will update only >> one at a time." > > It means the wiki is outdated. > Ok and now the wiki is updated. Great :) >> >>> Also what do you mean by synchronously, if you inspect the >>> code in write_all_supers you will see what for every device we issue >>> writes for every available copy of the superblock and then wait for all >>> of them to be finished via the 'wait_dev_supers'. In that regard sb >>> writeout is asynchronous. >>> I meant basically what you have explained. You write the same memory to all superblocks "step by step" but in one operation. >>>> Superblocks are also (to my knowledge) not protected by copy-on-write >>>> and are read-modify-update. >>>> >>>> On a storage device with >256GB there will be three superblocks. >>>> >>>> BTRFS will always prefer the superblock with the highest generation >>>> number providing that the checksum is good. >>> >>> Wrong. On mount btrfs will only ever read the first superblock at 64k. >>> If that one is corrupted it will refuse to mount, then it's expected the >>> user will initiate recovery procedure with btrfs-progs which reads all >>> supers and replaces them with the "newest" one (as decided by the >>> generation number) >> >> So again, the line "The superblock with the highest generation is used >> when reading." in the wiki needs to go away then? > > Yep, for background information you can read the discussion here: > https://www.spinics.net/lists/linux-btrfs/msg71878.html > And the wiki is also updated... Great! >> >>>> On the list there seem to be a few incidents where the superblocks have >>>> gone toast and I am pondering what (if any) benefits there is by >>>> updating the superblocks synchronously. >>>> >>>> The superblock is checkpoint'ed every 30 seconds by default and if >>>> someone pulls the plug (poweroutage) on HDD's then a synchronous write >>>> depending on (the quality of) your hardware may perhaps ruin all the >>>> superblock copies in one go. E.g. Copy A,B and C will all be updated at >>>> 30s. >>>> >>>> On SSD's, since one superblock is updated after other it would mean that >>>> using the default 30 second checkpoint Copy A=30s, Copy B=1m, Copy C=1m30s >>> >>> As explained previously there is no notion of "SSD vs HDD" modes. Ok, thanks for clearing things up. But the main thing here is that all superblocks are updated at the same time both on SSD and HDD's. I think the question is still valid. What is there to gain on updating all of them every 30s instead of updating them one by one?! Would not that be safer, perhaps itty-bitty quicker and perhaps better in terms of recovery?! >> >> We also had a discussion about the "backup roots" that are stored >> besides the superblock, and that they are "better than nothing" to help >> maybe recover something from a borken fs, but never ever guarantee you >> will get a working filesystem back. >> >> The same holds for superblocks from a previous generation. As soon as >> the transaction for generation X succesfully hits the disk, all space >> that was occupied in generation X-1 but no longer in X is available to >> be overwritten immediately. >> Ok so this means that superblocks with a older generation is utterly useless and will lead to corruption (effectively making my argument above useless as that would in fact assist corruption then). Does this means that if disk space was allocated in X-1 and is freed in X it will unallocated if you roll back to X-1 e.g. writing to unallocated storage. I was under the impression that a superblock was like a "snapshot" of the entire filesystem and that rollbacks via pre-gen superblocks was possible. Am I mistaking? ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Superblock update: Is there really any benefits of updating synchronously? 2018-01-23 19:51 ` waxhead @ 2018-01-24 0:04 ` Hans van Kranenburg 2018-01-24 18:54 ` waxhead 0 siblings, 1 reply; 8+ messages in thread From: Hans van Kranenburg @ 2018-01-24 0:04 UTC (permalink / raw) To: waxhead, Nikolay Borisov, Btrfs BTRFS On 01/23/2018 08:51 PM, waxhead wrote: > Nikolay Borisov wrote: >> On 23.01.2018 16:20, Hans van Kranenburg wrote: [...] >>> >>> We also had a discussion about the "backup roots" that are stored >>> besides the superblock, and that they are "better than nothing" to help >>> maybe recover something from a borken fs, but never ever guarantee you >>> will get a working filesystem back. >>> >>> The same holds for superblocks from a previous generation. As soon as >>> the transaction for generation X succesfully hits the disk, all space >>> that was occupied in generation X-1 but no longer in X is available to >>> be overwritten immediately. >>> > Ok so this means that superblocks with a older generation is utterly > useless and will lead to corruption (effectively making my argument > above useless as that would in fact assist corruption then). Mostly, yes. > Does this means that if disk space was allocated in X-1 and is freed in > X it will unallocated if you roll back to X-1 e.g. writing to > unallocated storage. Can you reword that? I can't follow that sentence. > I was under the impression that a superblock was like a "snapshot" of > the entire filesystem and that rollbacks via pre-gen superblocks was > possible. Am I mistaking? Yes. The first fundamental thing in Btrfs is COW which makes sure that everything referenced from transaction X, from the superblock all the way down to metadata trees and actual data space is never overwritten by changes done in transaction X+1. For metadata trees that are NOT filesystem trees a.k.a. subvolumes, the way this is done is actually quite simple. If a block is cowed, the old location is added to a 'pinned extents' list (in memory), which is used as a blacklist for choosing space to put new writes in. After a transaction is completed on disk, that list with pinned extents is emptied and all that space is available for immediate reuse. This way we make sure that if the transaction that is ongoing is aborted, the previous one (latest one that is completely on disk) is always still there. If the computer crashes and the in memory list is lost, no big deal, we just continue from the latest completed transaction again after a reboot. (ignoring extra log things for simplicity) So, the only situation in which you can fully use an X-1 superblock is when none of that previously pinned space has actually been overwritten yet afterwards. And if any of the space was overwritten already, you can go play around with using an older superblock and your filesystem mounts and everything might look fine, until you hit that distant corner and BOOM! ---- >8 ---- Extra!! Moar!! ---- >8 ---- But, doing so does not give you snapshot functionality yet! It's more like a poor mans snapshot that only can prevent from messing up the current version. Snapshot functionality is implemented only for filesystem trees (subvolumes) by adding reference counting (which does end up on disk) to the metadata blocks, and then COW trees as a whole. If you make a snapshot of a filesystem tree, the snapshot gets a whole new tree ID! It's not a previous version of the same subvolume you're looking at, it's a clone! This is a big difference. The extent tree is always tree 2. The chunk tree is always tree 3. But your subvolume snapshot gets a new tree number. Technically, it would maybe be possible to implement reference counting and snapshots to all of the metadata trees, but it would probably mean that the whole filesystem would get stuck in rewriting itself all day instead of doing any useful work. The current extent tree already has such amount of rumination problems that the added work of keeping track of reference counts would make it completely unusable. In the wiki, it's here: https://btrfs.wiki.kernel.org/index.php/Btrfs_design#Copy_on_Write_Logging Actually, I just paraphrased the first two of those six alineas... The subvolume trees actually having a previous version of themselves again (whaaaa!) is another thing... ;] -- Hans van Kranenburg ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Superblock update: Is there really any benefits of updating synchronously? 2018-01-24 0:04 ` Hans van Kranenburg @ 2018-01-24 18:54 ` waxhead 2018-01-24 21:00 ` Hans van Kranenburg 0 siblings, 1 reply; 8+ messages in thread From: waxhead @ 2018-01-24 18:54 UTC (permalink / raw) To: Hans van Kranenburg, Nikolay Borisov, Btrfs BTRFS Hans van Kranenburg wrote: > On 01/23/2018 08:51 PM, waxhead wrote: >> Nikolay Borisov wrote: >>> On 23.01.2018 16:20, Hans van Kranenburg wrote: > > [...] > >>>> >>>> We also had a discussion about the "backup roots" that are stored >>>> besides the superblock, and that they are "better than nothing" to help >>>> maybe recover something from a borken fs, but never ever guarantee you >>>> will get a working filesystem back. >>>> >>>> The same holds for superblocks from a previous generation. As soon as >>>> the transaction for generation X succesfully hits the disk, all space >>>> that was occupied in generation X-1 but no longer in X is available to >>>> be overwritten immediately. >>>> >> Ok so this means that superblocks with a older generation is utterly >> useless and will lead to corruption (effectively making my argument >> above useless as that would in fact assist corruption then). > > Mostly, yes. > >> Does this means that if disk space was allocated in X-1 and is freed in >> X it will unallocated if you roll back to X-1 e.g. writing to >> unallocated storage. > > Can you reword that? I can't follow that sentence. Sure why not. I'll give it a go: Does this mean that if... * Superblock generation N-1 have range 1234-2345 allocated and used. and.... * Superblock generation N-0 (the current) have range 1234-2345 free because someone deleted a file or something Then.... It is no point in rolling back to generation N-1 because that refers to what is no essentially free "memory" which may or may have not been written over by generation N-0. And therefore N-1 which still thinks range 1234-2345 is allocated may point to the wrong data. I hope that was easier to follow - if not don't hold back on the explicitives! :) > >> I was under the impression that a superblock was like a "snapshot" of >> the entire filesystem and that rollbacks via pre-gen superblocks was >> possible. Am I mistaking? > > Yes. The first fundamental thing in Btrfs is COW which makes sure that > everything referenced from transaction X, from the superblock all the > way down to metadata trees and actual data space is never overwritten by > changes done in transaction X+1. > Perhaps a tad off topic, but assuming the (hopefully) better explanation above clear things up a bit. What happens if a block is freed?! in X+1 --- which must mean that it can be overwritten in transaction X+1 (which I assume means a new superblock generation). After all without freeing and overwriting data there is no way to re-use space. > For metadata trees that are NOT filesystem trees a.k.a. subvolumes, the > way this is done is actually quite simple. If a block is cowed, the old > location is added to a 'pinned extents' list (in memory), which is used > as a blacklist for choosing space to put new writes in. After a > transaction is completed on disk, that list with pinned extents is > emptied and all that space is available for immediate reuse. This way we > make sure that if the transaction that is ongoing is aborted, the > previous one (latest one that is completely on disk) is always still > there. If the computer crashes and the in memory list is lost, no big > deal, we just continue from the latest completed transaction again after > a reboot. (ignoring extra log things for simplicity) > > So, the only situation in which you can fully use an X-1 superblock is > when none of that previously pinned space has actually been overwritten > yet afterwards. > > And if any of the space was overwritten already, you can go play around > with using an older superblock and your filesystem mounts and everything > might look fine, until you hit that distant corner and BOOM! Got it , this takes care of my questions above, but I'll leave them in just for completeness sake. Thanks for the good explanation. > > ---- >8 ---- Extra!! Moar!! ---- >8 ---- > > But, doing so does not give you snapshot functionality yet! It's more > like a poor mans snapshot that only can prevent from messing up the > current version. > > Snapshot functionality is implemented only for filesystem trees > (subvolumes) by adding reference counting (which does end up on disk) to > the metadata blocks, and then COW trees as a whole. > > If you make a snapshot of a filesystem tree, the snapshot gets a whole > new tree ID! It's not a previous version of the same subvolume you're > looking at, it's a clone! > > This is a big difference. The extent tree is always tree 2. The chunk > tree is always tree 3. But your subvolume snapshot gets a new tree number. > > Technically, it would maybe be possible to implement reference counting > and snapshots to all of the metadata trees, but it would probably mean > that the whole filesystem would get stuck in rewriting itself all day > instead of doing any useful work. The current extent tree already has > such amount of rumination problems that the added work of keeping track > of reference counts would make it completely unusable. > > In the wiki, it's here: > https://btrfs.wiki.kernel.org/index.php/Btrfs_design#Copy_on_Write_Logging > > Actually, I just paraphrased the first two of those six alineas... The > subvolume trees actually having a previous version of themselves again > (whaaaa!) is another thing... ;] > hehe , again thanks for giving a good explanation. Clear things up a bit indeed! ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Superblock update: Is there really any benefits of updating synchronously? 2018-01-24 18:54 ` waxhead @ 2018-01-24 21:00 ` Hans van Kranenburg 0 siblings, 0 replies; 8+ messages in thread From: Hans van Kranenburg @ 2018-01-24 21:00 UTC (permalink / raw) To: waxhead@dirtcellar.net, Nikolay Borisov, Btrfs BTRFS On 01/24/2018 07:54 PM, waxhead wrote: > Hans van Kranenburg wrote: >> On 01/23/2018 08:51 PM, waxhead wrote: >>> Nikolay Borisov wrote: >>>> On 23.01.2018 16:20, Hans van Kranenburg wrote: >> >> [...] >> >>>>> >>>>> We also had a discussion about the "backup roots" that are stored >>>>> besides the superblock, and that they are "better than nothing" to help >>>>> maybe recover something from a borken fs, but never ever guarantee you >>>>> will get a working filesystem back. >>>>> >>>>> The same holds for superblocks from a previous generation. As soon as >>>>> the transaction for generation X succesfully hits the disk, all space >>>>> that was occupied in generation X-1 but no longer in X is available to >>>>> be overwritten immediately. >>>>> >>> Ok so this means that superblocks with a older generation is utterly >>> useless and will lead to corruption (effectively making my argument >>> above useless as that would in fact assist corruption then). >> >> Mostly, yes. >> >>> Does this means that if disk space was allocated in X-1 and is freed in >>> X it will unallocated if you roll back to X-1 e.g. writing to >>> unallocated storage. >> >> Can you reword that? I can't follow that sentence. > Sure why not. I'll give it a go: > > Does this mean that if... > * Superblock generation N-1 have range 1234-2345 allocated and used. > > and.... > > * Superblock generation N-0 (the current) have range 1234-2345 free > because someone deleted a file or something Ok, so I assume that with current you mean the one on disk now. > Then.... > > It is no point in rolling back to generation N-1 because that refers to > what is no essentially free "memory" which may or may have not been > written over by generation N-0. If space that was used in N-1 turned into free space during N-0, then N-0 will never have reused that space already, since if writing out N-0 had crashed halfway, so the superblock as seen when mounting is still N-1, then you need to be able to fully use N-1. It can be used immediately by N+1 however after the N-0 superblock is safe on disk. > And therefore N-1 which still thinks > range 1234-2345 is allocated may point to the wrong data. So, at least for disk space used by metadata blocks: 1234-2345 - N-1 - in use 1234-2345 - N-0 - not in use, but can't be overwritten yet 1234-2345 - N+1 - can start writing whatever it wants in that disk location any time > I hope that was easier to follow - if not don't hold back on the > explicitives! :) > >> >>> I was under the impression that a superblock was like a "snapshot" of >>> the entire filesystem and that rollbacks via pre-gen superblocks was >>> possible. Am I mistaking? >> >> Yes. The first fundamental thing in Btrfs is COW which makes sure that >> everything referenced from transaction X, from the superblock all the >> way down to metadata trees and actual data space is never overwritten by >> changes done in transaction X+1. >> > Perhaps a tad off topic, but assuming the (hopefully) better explanation > above clear things up a bit. What happens if a block is freed?! in X+1 > --- which must mean that it can be overwritten in transaction X+1 (which > I assume means a new superblock generation). After all without freeing > and overwriting data there is no way to re-use space. Freed in X you mean? Or not? But you write "freed?! in X+1". For actual data disk space, it's the same pattern as above (so space freed up during a transaction can only be reused in the next one), but implemented a bit differently. For metadata trees which do not have reference counting, (e.g. the extent tree), there's the pinned extent (metadata block disk locations) list I mentioned already. For data, we have the filesystem (subvolume) trees which reference all files and the data extents that they use data from, and via the links to the extent tree they keep all locations where actual data is on disk as occupied. Now comes the different part. Because the filesystem trees already implement the extra reference counting functionality, this is being used to prevent freed up data space from already being overwritten in the same transaction. How does this work? Well, that's the rest of the wiki section I linked below. :-D So you're asking exactly the right next question here I guess. When making changes to a subvolume tree (normal file create, write content, rename delete etc), btrfs is secretly just cloning the tree into a new subvolume with the same subvolume ID. Wait, what? Whoa! So if you're changing subvolume 1234, there's an item (1234 ROOT_ITEM N-0) on disk in tree 1, and in memory it starts working on (1234 ROOT_ITEM N+1). As an end user, you never see this happening when you look at btrfs sub list etc, it's hidden from you. "When the transaction commits, a new root pointer is inserted in the root tree for each new subvolume root." [...] "At this time the root tree has two pointers for each subvolume changed during the transaction. One item points to the new tree and one points to the tree that existed at the start of the last transaction." After the new transaction commits OK, the cleaner removes the old subvolume from the previous transaction, which is technically the same code which is used for regular subvol delete initiated by a user. So when the old version of the same tree is removed, only then extent tree mappings for data disk space that was freed in the previous transaction will be adjusted and it ends up as free data space to be overwritten. (Well, if an extent in its entirety is not referenced by any file in any subvol, that is. Partial unreferenced extents keep hanging around as unreachable data, but that's again another story). When doing things like subvolume list between a transaction commit and the cleanup being finished, the btrfs sub list code will only show the one with highest generation (transaction) number if it encounters multiple ones and filter out the others not to confuse you. If you script some tree searches, e.g. with a few lines of python-btrfs, then you could spot them. So next time you remove a really big file, and you don't see a difference in df output... You know you will only see it after the current transaction is finished and the cleanup at the beginning of the new one is done. And the whole "Copy on Write Logging" section in the wiki should make sense now. \o/ >> For metadata trees that are NOT filesystem trees a.k.a. subvolumes, the >> way this is done is actually quite simple. If a block is cowed, the old >> location is added to a 'pinned extents' list (in memory), which is used >> as a blacklist for choosing space to put new writes in. After a >> transaction is completed on disk, that list with pinned extents is >> emptied and all that space is available for immediate reuse. This way we >> make sure that if the transaction that is ongoing is aborted, the >> previous one (latest one that is completely on disk) is always still >> there. If the computer crashes and the in memory list is lost, no big >> deal, we just continue from the latest completed transaction again after >> a reboot. (ignoring extra log things for simplicity) >> >> So, the only situation in which you can fully use an X-1 superblock is >> when none of that previously pinned space has actually been overwritten >> yet afterwards. >> >> And if any of the space was overwritten already, you can go play around >> with using an older superblock and your filesystem mounts and everything >> might look fine, until you hit that distant corner and BOOM! > Got it , this takes care of my questions above, but I'll leave them in > just for completeness sake. > Thanks for the good explanation. > >> >> ---- >8 ---- Extra!! Moar!! ---- >8 ---- >> >> But, doing so does not give you snapshot functionality yet! It's more >> like a poor mans snapshot that only can prevent from messing up the >> current version. >> >> Snapshot functionality is implemented only for filesystem trees >> (subvolumes) by adding reference counting (which does end up on disk) to >> the metadata blocks, and then COW trees as a whole. Correction here: * unfortunate wording: "COW trees as a whole" * because: we're not copying an entire tree, but only cowing individual changed metadata blocks * better: metadata blocks can be shared between trees with a different tree ID (subvolume ID). >> If you make a snapshot of a filesystem tree, the snapshot gets a whole >> new tree ID! It's not a previous version of the same subvolume you're >> looking at, it's a clone! >> >> This is a big difference. The extent tree is always tree 2. The chunk >> tree is always tree 3. But your subvolume snapshot gets a new tree number. >> >> Technically, it would maybe be possible to implement reference counting >> and snapshots to all of the metadata trees, but it would probably mean >> that the whole filesystem would get stuck in rewriting itself all day >> instead of doing any useful work. The current extent tree already has >> such amount of rumination problems that the added work of keeping track >> of reference counts would make it completely unusable. >> >> In the wiki, it's here: >> https://btrfs.wiki.kernel.org/index.php/Btrfs_design#Copy_on_Write_Logging >> >> Actually, I just paraphrased the first two of those six alineas... The >> subvolume trees actually having a previous version of themselves again >> (whaaaa!) is another thing... ;] >> > hehe , again thanks for giving a good explanation. Clear things up a bit > indeed! > Fun stuff. -- Hans van Kranenburg ^ permalink raw reply [flat|nested] 8+ messages in thread
end of thread, other threads:[~2018-01-24 21:00 UTC | newest] Thread overview: 8+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2018-01-23 7:03 Superblock update: Is there really any benefits of updating synchronously? waxhead 2018-01-23 9:03 ` Nikolay Borisov 2018-01-23 14:20 ` Hans van Kranenburg 2018-01-23 14:48 ` Nikolay Borisov 2018-01-23 19:51 ` waxhead 2018-01-24 0:04 ` Hans van Kranenburg 2018-01-24 18:54 ` waxhead 2018-01-24 21:00 ` Hans van Kranenburg
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox