* BTRFS messes up snapshot LV with origin @ 2014-11-16 21:35 MegaBrutal 2014-11-17 1:42 ` Duncan 0 siblings, 1 reply; 64+ messages in thread From: MegaBrutal @ 2014-11-16 21:35 UTC (permalink / raw) To: linux-btrfs Hello guys, I think you'll like this... https://bugs.launchpad.net/ubuntu/+source/grub2/+bug/1391429 MegaBrutal ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: BTRFS messes up snapshot LV with origin 2014-11-16 21:35 BTRFS messes up snapshot LV with origin MegaBrutal @ 2014-11-17 1:42 ` Duncan 2014-11-17 6:59 ` Brendan Hide 0 siblings, 1 reply; 64+ messages in thread From: Duncan @ 2014-11-17 1:42 UTC (permalink / raw) To: linux-btrfs MegaBrutal posted on Sun, 16 Nov 2014 22:35:26 +0100 as excerpted: > Hello guys, > > I think you'll like this... > https://bugs.launchpad.net/ubuntu/+source/grub2/+bug/1391429 UUID is an initialism for "Universally Unique IDentifier".[1] If the UUID isn't unique, by definition, then, it can't be a UUID, and that's a bug in whatever is making the non-unique would-be UUID that isn't unique and thus cannot be a universally unique ID. In this case that would appear to be LVM. Meanwhile, if two or more devices are btrfs and have the same UUID, btrfs considers them part of the same filesystem, since btrfs /can/ be a multi- device filesystem. That's not a bug; that's the way btrfs IDs multiple devices as part of the same filesystem, because a UUID, by definition, can be relied upon to be unique, or it's no longer a UUID. Additionally, the UUID is actually written into the metadata of the filesystem in such a way that it's /not/ a simple task to change the UUID. Put simply, it's "ingrained" into the filesystem so deeply it cannot be changed, at least not without rewriting pretty much all the metadata. (FWIW, a btrfs balance does just that, rewrite the data, metadata, or both. However, I don't believe a balance plugin to change the UUID is yet available. You're simply not supposed to change the UUID once the filesystem is created.) So if LVM snapshots duplicate a UUID, as I believe they do, then there's your bug, because they're breaking the definition of Universally *UNIQUE* ID. That being the case, using them with btrfs is pretty essentially broken, because btrfs depends on UUIDs to be what they say on the label, actually "unique", and UUIDs are deeply enough ingrained into the very fabric of btrfs that it's simply not possible to change that on the btrfs side. Meanwhile, since btrfs *DOES* depend on UUIDs being unique, if there's multiple btrfs that accidentally have the same UUID, btrfs will not distinguish between them and will very possibly be writing into both of them. If I found myself in that situation, I'd very carefully copy all the data I wanted to save off the filesystem and do a new mkfs as soon as possible, because I would not consider the filesystem as it was at all stable, and I'd count myself very lucky if I got everything off the filesystem without damage. In actuality, since the second device was a snapshot of the first, if you catch it reasonably quickly you likely won't have too many issues. However, a btrfs in that condition is in an undefined state, and the longer it exists in that state, the more likely things are to go wrong, possibly VERY VERY wrong. So if you don't already have backups for anything you consider valuable on that thing, get it off there as soon as you possibly can, and consider yourself very lucky if nothing's damaged as a result. --- [1] http://en.wiktionary.org/wiki/UUID -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: BTRFS messes up snapshot LV with origin 2014-11-17 1:42 ` Duncan @ 2014-11-17 6:59 ` Brendan Hide 2014-11-17 7:35 ` Daniel Dressler ` (2 more replies) 0 siblings, 3 replies; 64+ messages in thread From: Brendan Hide @ 2014-11-17 6:59 UTC (permalink / raw) To: linux-btrfs; +Cc: bug-grub cc'd bug-grub@gnu.org for FYI On 2014/11/17 03:42, Duncan wrote: > MegaBrutal posted on Sun, 16 Nov 2014 22:35:26 +0100 as excerpted: > >> Hello guys, >> >> I think you'll like this... >> https://bugs.launchpad.net/ubuntu/+source/grub2/+bug/1391429 > UUID is an initialism for "Universally Unique IDentifier".[1] > > If the UUID isn't unique, by definition, then, it can't be a UUID, and > that's a bug in whatever is making the non-unique would-be UUID that > isn't unique and thus cannot be a universally unique ID. In this case > that would appear to be LVM. Perhaps the right question to ask is "Where should this bug be fixed?". TL;DR: This needs more thought and input from btrfs devs. To LVM, the bug is likely seen as being "out of scope". The "correct" fix probably lies in the ecosystem design, which requires co-operation from btrfs. Making a snapshot in LVM is a fundamental thing - and I feel LVM, in making its snapshot, is doing its job "exactly as expected". Additionally, there are other ways to get to a similar state without LVM: ddrescue backup, SAN snapshot, old "missing" disk re-introduced, etc. That leaves two places where this can be fixed: grub and btrfs Grub is already a little smart here - it avoids snapshots. But in this case it is relying on the UUID and only finding it in the snapshot. So possibly this is a bug in grub affecting the bug reporter specifically - but perhaps the bug is in btrfs where grub is relying on btrfs code. Yes, I'd rather use btrfs' snapshot mechanism - but this is often a choice that is left to the user/admin/distro. I don't think saying "LVM snapshots are incompatible with btrfs" is the right way to go either. That leaves two aspects of this issue which I view as two separate bugs: a) Btrfs cannot gracefully handle separate filesystems that have the same UUID. At all. b) Grub appears to pick the wrong filesystem when presented with two filesystems with the same UUID. I feel a) is a btrfs bug. I feel b) is a bug that is more about "ecosystem design" than grub being silly. I imagine a couple of aspects that could help fix a): - Utilise a "unique drive identifier" in the btrfs metadata (surely this exists already?). This way, any two filesystems will always have different drive identifiers *except* in cases like a ddrescue'd copy or a block-level snapshot. This will provide a sensible mechanism for "defined behaviour", preventing corruption - even if that "defined behaviour" is to simply give out lots of "PEBKAC" errors and panic. - Utilise a "drive list" to ensure that two unrelated filesystems with the same UUID cannot get "mixed up". Yes, the user/admin would likely be the culprit here (perhaps a VM rollout process that always gives out the same UUID in all its filesystems). Again, does btrfs not already have something like this built-in that we're simply not utilising fully? I'm not exactly sure of the "correct" way to fix b) except that I imagine it would be trivial to fix once a) is fixed. -- __________ Brendan Hide http://swiftspirit.co.za/ http://www.webafrica.co.za/?AFF1E97 ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: BTRFS messes up snapshot LV with origin 2014-11-17 6:59 ` Brendan Hide @ 2014-11-17 7:35 ` Daniel Dressler 2014-11-17 9:00 ` Brendan Hide 2014-11-17 19:04 ` Goffredo Baroncelli 2014-11-18 6:21 ` Chris Murphy 2 siblings, 1 reply; 64+ messages in thread From: Daniel Dressler @ 2014-11-17 7:35 UTC (permalink / raw) To: Brendan Hide; +Cc: open list:BTRFS FILE SYSTEM, bug-grub If a UUID is not unique enough how will adding a second UUID or "unique drive identifier" help? A UUID only serves any purpose when it is unique. Thus duplicate UUIDs are themselves a failure state. The solution should be to make it harder to get into this failure state. Not to make all programs resilient against running under this failure state. It isn't a btrfs bug that it requires Universal Unique IDs to be universally unique. Daniel 2014-11-17 15:59 GMT+09:00 Brendan Hide <brendan@swiftspirit.co.za>: > cc'd bug-grub@gnu.org for FYI > > On 2014/11/17 03:42, Duncan wrote: >> >> MegaBrutal posted on Sun, 16 Nov 2014 22:35:26 +0100 as excerpted: >> >>> Hello guys, >>> >>> I think you'll like this... >>> https://bugs.launchpad.net/ubuntu/+source/grub2/+bug/1391429 >> >> UUID is an initialism for "Universally Unique IDentifier".[1] >> >> If the UUID isn't unique, by definition, then, it can't be a UUID, and >> that's a bug in whatever is making the non-unique would-be UUID that >> isn't unique and thus cannot be a universally unique ID. In this case >> that would appear to be LVM. > > Perhaps the right question to ask is "Where should this bug be fixed?". > > TL;DR: This needs more thought and input from btrfs devs. To LVM, the bug is > likely seen as being "out of scope". The "correct" fix probably lies in the > ecosystem design, which requires co-operation from btrfs. > > Making a snapshot in LVM is a fundamental thing - and I feel LVM, in making > its snapshot, is doing its job "exactly as expected". > > Additionally, there are other ways to get to a similar state without LVM: > ddrescue backup, SAN snapshot, old "missing" disk re-introduced, etc. > > That leaves two places where this can be fixed: grub and btrfs > > Grub is already a little smart here - it avoids snapshots. But in this case > it is relying on the UUID and only finding it in the snapshot. So possibly > this is a bug in grub affecting the bug reporter specifically - but perhaps > the bug is in btrfs where grub is relying on btrfs code. > > Yes, I'd rather use btrfs' snapshot mechanism - but this is often a choice > that is left to the user/admin/distro. I don't think saying "LVM snapshots > are incompatible with btrfs" is the right way to go either. > > That leaves two aspects of this issue which I view as two separate bugs: > a) Btrfs cannot gracefully handle separate filesystems that have the same > UUID. At all. > b) Grub appears to pick the wrong filesystem when presented with two > filesystems with the same UUID. > > I feel a) is a btrfs bug. > I feel b) is a bug that is more about "ecosystem design" than grub being > silly. > > I imagine a couple of aspects that could help fix a): > - Utilise a "unique drive identifier" in the btrfs metadata (surely this > exists already?). This way, any two filesystems will always have different > drive identifiers *except* in cases like a ddrescue'd copy or a block-level > snapshot. This will provide a sensible mechanism for "defined behaviour", > preventing corruption - even if that "defined behaviour" is to simply give > out lots of "PEBKAC" errors and panic. > - Utilise a "drive list" to ensure that two unrelated filesystems with the > same UUID cannot get "mixed up". Yes, the user/admin would likely be the > culprit here (perhaps a VM rollout process that always gives out the same > UUID in all its filesystems). Again, does btrfs not already have something > like this built-in that we're simply not utilising fully? > > I'm not exactly sure of the "correct" way to fix b) except that I imagine it > would be trivial to fix once a) is fixed. > > -- > __________ > Brendan Hide > http://swiftspirit.co.za/ > http://www.webafrica.co.za/?AFF1E97 > > > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: BTRFS messes up snapshot LV with origin 2014-11-17 7:35 ` Daniel Dressler @ 2014-11-17 9:00 ` Brendan Hide 0 siblings, 0 replies; 64+ messages in thread From: Brendan Hide @ 2014-11-17 9:00 UTC (permalink / raw) Cc: linux-btrfs@vger.kernel.org, bug-grub On 2014/11/17 09:35, Daniel Dressler top-posted: > If a UUID is not unique enough how will adding a second UUID or > "unique drive identifier" help? A UUID is *supposed* to be unique by design. Isolated, the design is adequate. But the bigger picture clearly shows the design is naive. And broken. A second per-disk id (note I said "unique" - but I never said universal as in "UUID") would allow for better-defined behaviour where, presently, we're simply saying "current behaviour is undefined and you're likely to get corruption". On the other hand, I asked already if we have IDs of some sort (how else do we know which disk a chunk is stored on?), thus I don't think we need to add anything to the format. A simple scenario similar to the one the OP introduced: Disk sda -> says it is UUID Z with diskid 0 Disk sdb -> says it is UUID Z with diskid 0 If we're ignoring the fact that there are two disks with the same UUID and diskid and it causes corruption, then the kernel is doing something "stupid but fixable". We have some choices: - give a clear warning and ignore one of the disks (could just pick the first one - or be a little smarter and pick one based on some heuristic - for example extent generation number) - give a clear error and panic Normal multi-disk scenario: Disk sda -> UUID Z with diskid 1 Disk sdb -> UUID Z with diskid 2 These two disks are in the same filesystem and are supposed to work together - no issues. My second suggestion covers another scenario as well: Disk sda -> UUID Z with diskid 1; root block indicates that only diskid 1 is recorded as being part of the filesystem Disk sdb -> UUID Z with diskid 3; root block indicates that only diskid 3 is recorded as being part of the filesystem Again, based on the existing featureset, it seems reasonable that this information should already be recorded in the fs metadata. If the behaviour is "undefined" and causing corruption, again the kernel is currently doing something "stupid but fixable". Again, we have similar choices: - give a clear warning and ignore bad disk(s) - give a clear error and panic > 2014-11-17 15:59 GMT+09:00 Brendan Hide <brendan@swiftspirit.co.za>: >> cc'd bug-grub@gnu.org for FYI >> >> On 2014/11/17 03:42, Duncan wrote: >>> MegaBrutal posted on Sun, 16 Nov 2014 22:35:26 +0100 as excerpted: >>> >>>> Hello guys, >>>> >>>> I think you'll like this... >>>> https://bugs.launchpad.net/ubuntu/+source/grub2/+bug/1391429 >>> UUID is an initialism for "Universally Unique IDentifier".[1] >>> >>> If the UUID isn't unique, by definition, then, it can't be a UUID, and >>> that's a bug in whatever is making the non-unique would-be UUID that >>> isn't unique and thus cannot be a universally unique ID. In this case >>> that would appear to be LVM. >> Perhaps the right question to ask is "Where should this bug be fixed?". >> >> TL;DR: This needs more thought and input from btrfs devs. To LVM, the bug is >> likely seen as being "out of scope". The "correct" fix probably lies in the >> ecosystem design, which requires co-operation from btrfs. >> >> Making a snapshot in LVM is a fundamental thing - and I feel LVM, in making >> its snapshot, is doing its job "exactly as expected". >> >> Additionally, there are other ways to get to a similar state without LVM: >> ddrescue backup, SAN snapshot, old "missing" disk re-introduced, etc. >> >> That leaves two places where this can be fixed: grub and btrfs >> >> Grub is already a little smart here - it avoids snapshots. But in this case >> it is relying on the UUID and only finding it in the snapshot. So possibly >> this is a bug in grub affecting the bug reporter specifically - but perhaps >> the bug is in btrfs where grub is relying on btrfs code. >> >> Yes, I'd rather use btrfs' snapshot mechanism - but this is often a choice >> that is left to the user/admin/distro. I don't think saying "LVM snapshots >> are incompatible with btrfs" is the right way to go either. >> >> That leaves two aspects of this issue which I view as two separate bugs: >> a) Btrfs cannot gracefully handle separate filesystems that have the same >> UUID. At all. >> b) Grub appears to pick the wrong filesystem when presented with two >> filesystems with the same UUID. >> >> I feel a) is a btrfs bug. >> I feel b) is a bug that is more about "ecosystem design" than grub being >> silly. >> >> I imagine a couple of aspects that could help fix a): >> - Utilise a "unique drive identifier" in the btrfs metadata (surely this >> exists already?). This way, any two filesystems will always have different >> drive identifiers *except* in cases like a ddrescue'd copy or a block-level >> snapshot. This will provide a sensible mechanism for "defined behaviour", >> preventing corruption - even if that "defined behaviour" is to simply give >> out lots of "PEBKAC" errors and panic. >> - Utilise a "drive list" to ensure that two unrelated filesystems with the >> same UUID cannot get "mixed up". Yes, the user/admin would likely be the >> culprit here (perhaps a VM rollout process that always gives out the same >> UUID in all its filesystems). Again, does btrfs not already have something >> like this built-in that we're simply not utilising fully? >> >> I'm not exactly sure of the "correct" way to fix b) except that I imagine it >> would be trivial to fix once a) is fixed. >> >> -- >> __________ >> Brendan Hide >> http://swiftspirit.co.za/ >> http://www.webafrica.co.za/?AFF1E97 >> >> >> -- >> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in >> the body of a message to majordomo@vger.kernel.org >> More majordomo info at http://vger.kernel.org/majordomo-info.html > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- __________ Brendan Hide http://swiftspirit.co.za/ http://www.webafrica.co.za/?AFF1E97 ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: BTRFS messes up snapshot LV with origin 2014-11-17 6:59 ` Brendan Hide 2014-11-17 7:35 ` Daniel Dressler @ 2014-11-17 19:04 ` Goffredo Baroncelli [not found] ` <CAE8gLh=VubBbZdeKTAuWRjOxPF7C+ouUeeVvmGfT2ckYWGhQVA@mail.gmail.com> 2014-11-21 4:24 ` Zygo Blaxell 2014-11-18 6:21 ` Chris Murphy 2 siblings, 2 replies; 64+ messages in thread From: Goffredo Baroncelli @ 2014-11-17 19:04 UTC (permalink / raw) To: Brendan Hide, linux-btrfs; +Cc: bug-grub On 2014-11-17 07:59, Brendan Hide wrote: > > That leaves two aspects of this issue which I view as two separate bugs: > a) Btrfs cannot gracefully handle separate filesystems that have the same UUID. At all. > b) Grub appears to pick the wrong filesystem when presented with two filesystems with the same UUID. > > I feel a) is a btrfs bug. > I feel b) is a bug that is more about "ecosystem design" than grub being silly. Regarding a) IIRC, btrfs collects the filesystem information by UUID; if two filesystems have the same UUID (like the LVM-snapshot case), the last filesystem discovered overwrite the first one. The filesystem discovering is done in user-space; so it should be simple to skip a filesystem on a LVM-snapshot. Regarding b) I am bit confused: if I understood correctly, the root filesystem was picked from a LVM-snapshot, so grub-probe *correctly* reported that the root device is the snapshot. The problem was that during the boot filesystem discovering: first scanned the *real* device, then the LVM-snapshot; the latter overwrote the former so the system booted from the LVM-snapshot. My conclusion is that we should improve the btrfs scan so: - in udev rules, a partition that is a LVM snapshot by default should be not scanned by "btrfs dev scan" - "btrfs dev scan", during the partition discovery should skip the lvm-snapshot. BR G.Baroncelli -- gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it> Key fingerprint BBF5 1610 0B64 DAC6 5F7D 17B2 0EDA 9B37 8B82 E0B5 ^ permalink raw reply [flat|nested] 64+ messages in thread
[parent not found: <CAE8gLh=VubBbZdeKTAuWRjOxPF7C+ouUeeVvmGfT2ckYWGhQVA@mail.gmail.com>]
* Fwd: BTRFS messes up snapshot LV with origin [not found] ` <CAE8gLh=VubBbZdeKTAuWRjOxPF7C+ouUeeVvmGfT2ckYWGhQVA@mail.gmail.com> @ 2014-11-17 19:45 ` MegaBrutal 2014-11-17 20:32 ` Goffredo Baroncelli 2014-11-18 6:16 ` Chris Murphy 0 siblings, 2 replies; 64+ messages in thread From: MegaBrutal @ 2014-11-17 19:45 UTC (permalink / raw) To: kreijack, Brendan Hide, linux-btrfs 2014-11-17 20:04 GMT+01:00 Goffredo Baroncelli <kreijack@inwind.it>: > > Regarding b) > I am bit confused: if I understood correctly, the root filesystem was > picked from a LVM-snapshot, so grub-probe *correctly* reported that > the root device is the snapshot. This is not what happens. The system doesn't even get a reboot when the mix-up happens. You boot from the original device, create an LVM-snapshot*, and mount starts to report the snapshot as the root device, while in fact it isn't. I know my initial descriptions of the bug were misleading, as myself didn't know what the heck is going on. >From this point, please take these comments as reference: https://bugs.launchpad.net/ubuntu/+source/grub2/+bug/1391429/comments/2 https://bugs.launchpad.net/ubuntu/+source/grub2/+bug/1391429/comments/4 * I know I shouldn't make an LVM-snapshot of a mounted file system, but this is not the point. P.S.: E-mail sent twice, as lists didn't accept it in HTML. Plus I'm not on the GRUB list, and can't post there. ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: Fwd: BTRFS messes up snapshot LV with origin 2014-11-17 19:45 ` Fwd: " MegaBrutal @ 2014-11-17 20:32 ` Goffredo Baroncelli 2014-11-18 6:16 ` Chris Murphy 1 sibling, 0 replies; 64+ messages in thread From: Goffredo Baroncelli @ 2014-11-17 20:32 UTC (permalink / raw) To: MegaBrutal, Brendan Hide, linux-btrfs On 2014-11-17 20:45, MegaBrutal wrote: > * I know I shouldn't make an LVM-snapshot of a mounted file system, > but this is not the point. This should be supported for the filesystem which support the freezing See http://stackoverflow.com/questions/1940093/lvm-snapshot-of-mounted-filesystem -- gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it> Key fingerprint BBF5 1610 0B64 DAC6 5F7D 17B2 0EDA 9B37 8B82 E0B5 ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: BTRFS messes up snapshot LV with origin 2014-11-17 19:45 ` Fwd: " MegaBrutal 2014-11-17 20:32 ` Goffredo Baroncelli @ 2014-11-18 6:16 ` Chris Murphy 2014-11-18 15:42 ` Phillip Susi 1 sibling, 1 reply; 64+ messages in thread From: Chris Murphy @ 2014-11-18 6:16 UTC (permalink / raw) Cc: Btrfs BTRFS On Nov 17, 2014, at 12:45 PM, MegaBrutal <megabrutal@gmail.com> wrote: > 2014-11-17 20:04 GMT+01:00 Goffredo Baroncelli <kreijack@inwind.it>: >> >> Regarding b) >> I am bit confused: if I understood correctly, the root filesystem was >> picked from a LVM-snapshot, so grub-probe *correctly* reported that >> the root device is the snapshot. > > > This is not what happens. The system doesn't even get a reboot when > the mix-up happens. > > You boot from the original device, create an LVM-snapshot*, and mount > starts to report the snapshot as the root device, while in fact it > isn’t. If fstab specifies rootfs as UUID, and there are two volumes with the same UUID, it’s now ambiguous which one at boot time is the intended rootfs. It’s no different than the days of /dev/sdXY where X would change designations between boots = ambiguity and why we went to UUID. So we kinda need a way to distinguish derivative volumes. Maybe XFS and ext4 could easily change the volume UUID, but my vague recollection is this is difficult on Btrfs? So that led me to the idea of a way to create an on-the-fly (but consistent) “virtual volume UUID” maybe based on a hash of both the LVM LV and fs volume UUID. Chris Murphy ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: BTRFS messes up snapshot LV with origin 2014-11-18 6:16 ` Chris Murphy @ 2014-11-18 15:42 ` Phillip Susi 2014-11-18 19:17 ` Chris Murphy ` (2 more replies) 0 siblings, 3 replies; 64+ messages in thread From: Phillip Susi @ 2014-11-18 15:42 UTC (permalink / raw) To: Chris Murphy; +Cc: Btrfs BTRFS -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On 11/18/2014 1:16 AM, Chris Murphy wrote: > If fstab specifies rootfs as UUID, and there are two volumes with > the same UUID, it’s now ambiguous which one at boot time is the > intended rootfs. It’s no different than the days of /dev/sdXY where > X would change designations between boots = ambiguity and why we > went to UUID. He already said he has NOT rebooted, so there is no way that the snapshot has actually been mounted, even if it were UUID confusion. > So we kinda need a way to distinguish derivative volumes. Maybe > XFS and ext4 could easily change the volume UUID, but my vague > recollection is this is difficult on Btrfs? So that led me to the > idea of a way to create an on-the-fly (but consistent) “virtual > volume UUID” maybe based on a hash of both the LVM LV and fs > volume UUID. When using LVM, you should be referring to the volume by the LVM name rather than UUID. LVM names are stable, and don't have the duplicate uuid problem. -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.17 (MingW32) iQEcBAEBAgAGBQJUa2j4AAoJEI5FoCIzSKrwvywH/3yS25MAIwsGfIwBfCrNN5Qo NlBttcUcrYgOD/nQHEuulHdilWrvz3q6jGwVL9W8MQsHm0Ah5dMatT5e5zr1DSNC ZqSEXSE8jsYJu99FUWevxO7wtb94ioKa+OF1u0zsaA5yQUdaj5smPqK3iUfskUhs jE/vsJmws5iBv0dxnZI/6n3YqOB1Qck4PcMItRj8xvZQ0GjARIVw36pgJnmboGfY vWRmUXnTeLMu9ilHWhqNUIh3lTTUvRdaYoZtTr6eYh9sIntDCegN71WGmO8FfdjP vXhikg7Yx7FhkhxAl1X2NzM93d7fUSQDeQfTLYLMDbbTV/n2HwcoZ6G2+IQEJnQ= =3Lv1 -----END PGP SIGNATURE----- ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: BTRFS messes up snapshot LV with origin 2014-11-18 15:42 ` Phillip Susi @ 2014-11-18 19:17 ` Chris Murphy 2014-11-18 20:17 ` Phillip Susi 2014-11-18 20:41 ` MegaBrutal 2014-11-19 1:29 ` Robert White 2 siblings, 1 reply; 64+ messages in thread From: Chris Murphy @ 2014-11-18 19:17 UTC (permalink / raw) To: Phillip Susi; +Cc: Btrfs BTRFS On Nov 18, 2014, at 8:42 AM, Phillip Susi <psusi@ubuntu.com> wrote: > -----BEGIN PGP SIGNED MESSAGE----- > Hash: SHA1 > > On 11/18/2014 1:16 AM, Chris Murphy wrote: >> If fstab specifies rootfs as UUID, and there are two volumes with >> the same UUID, it’s now ambiguous which one at boot time is the >> intended rootfs. It’s no different than the days of /dev/sdXY where >> X would change designations between boots = ambiguity and why we >> went to UUID. > > He already said he has NOT rebooted, so there is no way that the > snapshot has actually been mounted, even if it were UUID confusion. > >> So we kinda need a way to distinguish derivative volumes. Maybe >> XFS and ext4 could easily change the volume UUID, but my vague >> recollection is this is difficult on Btrfs? So that led me to the >> idea of a way to create an on-the-fly (but consistent) “virtual >> volume UUID” maybe based on a hash of both the LVM LV and fs >> volume UUID. > > When using LVM, you should be referring to the volume by the LVM name > rather than UUID. LVM names are stable, and don't have the duplicate > uuid problem. What if you have a Btrfs raid1 volume using two LV’s and then snapshot both LV’s? Of course I’d specify one of the devices by VG-LV name. But Btrfs finds additional devices itself, it doesn’t support explicitly naming additional member devices. And in this example, there are two identical candidates, so it’s ambiguous to Btrfs which one to use. And further it’s unknown to the user which one Btrfs chose because neither mount, nor /proc/mounts right now shows anything other than the first device that’s mounted. So it’s using one of those two VG-LV’s automatically but not informing us which one. I think there’s some metadata that can be set on each LV whether it’s automatically activated (at e.g. boot time) so I think the thing to do would be to make sure the snapshot LV’s are not activated, therefore their UUID’s shouldn’t be visible to Btrfs and it won’t automatically discover and use the wrong LV. But I haven’t tested this. Chris Murphy ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: BTRFS messes up snapshot LV with origin 2014-11-18 19:17 ` Chris Murphy @ 2014-11-18 20:17 ` Phillip Susi 2014-11-19 2:54 ` Chris Murphy 0 siblings, 1 reply; 64+ messages in thread From: Phillip Susi @ 2014-11-18 20:17 UTC (permalink / raw) To: Chris Murphy; +Cc: Btrfs BTRFS -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On 11/18/2014 2:17 PM, Chris Murphy wrote: > What if you have a Btrfs raid1 volume using two LV’s and then > snapshot both LV’s? That's even more silly than a single lvm snapshot under btrfs. Just don't do it. -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.17 (MingW32) iQEcBAEBAgAGBQJUa6lsAAoJEI5FoCIzSKrwzicIAJLXrsVpWxsI+wq8xGGumwoy s2QGUZ3Soknr30FAeZWFpS7diXOuuOWXjaObTlFMcUAqGE134d4I3W+k2PxejHns AfdKSdyiactcndea6aw5zBGzdk5N5bLaoCaS8GSeKVdIMWlLFh+lMzHX2q6tC+cS 8RWJI7GYk193RmWkHKUhX57J9tnP7eJmXTkqdRJIDXmaaceYLR8057LZbNsuurFA h0ZptXKFUhp6dsEMV5JPnxKZ9l62ZNcL5zEE3D7sVU20ll/YEP7UHOYY/JTGwdLN KWOUIJ89gM6LqWTz2gFuz8JhPhmZCIKrpN6Fu/pKDHYSrdYyazZV/D6P/dX5TUA= =3LdX -----END PGP SIGNATURE----- ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: BTRFS messes up snapshot LV with origin 2014-11-18 20:17 ` Phillip Susi @ 2014-11-19 2:54 ` Chris Murphy 2014-11-19 15:20 ` Phillip Susi 0 siblings, 1 reply; 64+ messages in thread From: Chris Murphy @ 2014-11-19 2:54 UTC (permalink / raw) To: Phillip Susi; +Cc: Btrfs BTRFS On Nov 18, 2014, at 1:17 PM, Phillip Susi <psusi@ubuntu.com> wrote: > -----BEGIN PGP SIGNED MESSAGE----- > Hash: SHA1 > > On 11/18/2014 2:17 PM, Chris Murphy wrote: >> What if you have a Btrfs raid1 volume using two LV’s and then >> snapshot both LV’s? > > That's even more silly than a single lvm snapshot under btrfs. Just > don't do it. Why is it silly? Btrfs on a thin volume has practical use case aside from just being thinly provisioned, its snapshots are block device based, not merely that of an fs tree. Looks like lvm.conf does have a way to affect LV autoactivation, and there may be another way to achieve this also. Right after the snapshot(s) they’d need to have their autoactivation disabled to avoid UUID confusion. Chris Murphy ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: BTRFS messes up snapshot LV with origin 2014-11-19 2:54 ` Chris Murphy @ 2014-11-19 15:20 ` Phillip Susi 2014-11-19 18:35 ` Chris Murphy 2014-11-21 4:28 ` Zygo Blaxell 0 siblings, 2 replies; 64+ messages in thread From: Phillip Susi @ 2014-11-19 15:20 UTC (permalink / raw) To: Chris Murphy; +Cc: Btrfs BTRFS -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On 11/18/2014 9:54 PM, Chris Murphy wrote: > Why is it silly? Btrfs on a thin volume has practical use case > aside from just being thinly provisioned, its snapshots are block > device based, not merely that of an fs tree. Umm... because one of the big selling points of btrfs is that it is in a much better position to make snapshots being aware of the fs tree rather than doing it in the block layer. So it is kind of silly in the first place to be using lvm snapshots under btrfs, but it is is doubly silly to use lvm for snapshots, and btrfs for the mirroring rather than lvm. Pick one layer and use it for both functions. Even if that is lvm, then it should also be handling the mirroring. -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.17 (MingW32) iQEcBAEBAgAGBQJUbLUxAAoJEI5FoCIzSKrwh0oH/3TZ2oo8u2BjHYO3b0x8800/ LFkmGFWrZFSnAvtWuN5B1WlhMXku4dxLRXz14fJKFp3fNmnYRNVvw3tu9btvsBsC sZdwLaKwKPHTK8RS+QCI2pZPX+cGB+F7/z9PCHrzIzzCKk/4SvnJ76e2nnZFpY1m Md3f1BCHEVUPMMXbqv6Ry6v7PDs/8bx8WITYyAL9uh3tjh0dXQsjbZJn5u4XDitS /CoE8eX4rf1vc7qHI4K56TtArCcXQxAHcC56fXmcmS03bVhAkkJ5Z+/uwi6+TkJe 55rMFCd7UFy9pwKha3Q2flJHtDYG6ns7Njyff6BSL9Yzq7tHh4wLk1H3XxaOCP8= =ktv/ -----END PGP SIGNATURE----- ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: BTRFS messes up snapshot LV with origin 2014-11-19 15:20 ` Phillip Susi @ 2014-11-19 18:35 ` Chris Murphy 2014-11-19 19:23 ` Phillip Susi 2014-11-21 4:28 ` Zygo Blaxell 1 sibling, 1 reply; 64+ messages in thread From: Chris Murphy @ 2014-11-19 18:35 UTC (permalink / raw) To: Btrfs BTRFS On Wed, Nov 19, 2014 at 8:20 AM, Phillip Susi <psusi@ubuntu.com> wrote: > -----BEGIN PGP SIGNED MESSAGE----- > Hash: SHA1 > > On 11/18/2014 9:54 PM, Chris Murphy wrote: >> Why is it silly? Btrfs on a thin volume has practical use case >> aside from just being thinly provisioned, its snapshots are block >> device based, not merely that of an fs tree. > > Umm... because one of the big selling points of btrfs is that it is in > a much better position to make snapshots being aware of the fs tree > rather than doing it in the block layer. This is why we have fsfreeze before taking block level snapshots. And I point out that consistent snapshots with Btrfs have posed challenges too, there's a recent fstest "snapshoting after file write + truncate" for this reason. A block layer snapshot will snapshot the entire file system, not just one tree. We don't have a way in Btrfs to snapshot the entire volume. Considering how things still aren't exactly stable yet, in particular with many snapshots, it's not unreasonable to want to freeze then snapshot the entire volume before doing some possibly risky testing or usage where even a Btrfs snapshot doesn't protect your entire volume should things go wrong. > > > So it is kind of silly in the first place to be using lvm snapshots > under btrfs, but it is is doubly silly to use lvm for snapshots, and > btrfs for the mirroring rather than lvm. Pick one layer and use it > for both functions. Even if that is lvm, then it should also be > handling the mirroring. Thin volumes are more efficient. And the user creating them doesn't have to mess around with locating physical devices or possibly partitioning them. Plus in enterprise environments with lots of storage and many different kinds of use cases, even knowledable users aren't always granted full access to the physical storage anyway. They get a VG to play with, or now they can have a thin pool and only consume on storage what is actually used, and not what they've reserved. You can mkfs a 4TG virtual size volume, while it only uses 1MB of physical extents on storage. And all of that is orthogonal to using XFS or Btrfs which again comes down to use case. And whether I'd have LVM mirror or Btrfs mirror is again a question of use case, maybe I'm OK with LVM mirroring and I just get the rare corrupt file warning and that's OK. In another use case, corruption isn't OK, I need higher availability of known good data therefore I need Btrfs doing the mirroring. So I find your argument thus far uncompelling. Chris Murphy ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: BTRFS messes up snapshot LV with origin 2014-11-19 18:35 ` Chris Murphy @ 2014-11-19 19:23 ` Phillip Susi 0 siblings, 0 replies; 64+ messages in thread From: Phillip Susi @ 2014-11-19 19:23 UTC (permalink / raw) To: Chris Murphy, Btrfs BTRFS -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On 11/19/2014 1:33 PM, Chris Murphy wrote: > Thin volumes are more efficient. And the user creating them doesn't > have to mess around with locating physical devices or possibly > partitioning them. Plus in enterprise environments with lots of > storage and many different kinds of use cases, even knowledable > users aren't always granted full access to the physical storage > anyway. They get a VG to play with, or now they can have a thin > pool and only consume on storage what is actually used, and not > what they've reserved. You can mkfs a 4TG virtual size volume, > while it only uses 1MB of physical extents on storage. And all of > that is orthogonal to using XFS or Btrfs which again comes down to > use case. And whether I'd have LVM mirror or Btrfs mirror is again > a question of use case, maybe I'm OK with LVM mirroring and I just > get the rare corrupt file warning and that's OK. In another use > case, corruption isn't OK, I need higher availability of known > good data therefore I need Btrfs doing the mirroring. Correct me if I'm wrong, but this kind of setup is basically where you have a provider running an lvm thin pool volume on their hardware, and exposing it to the customer's vm as a virtual disk. In that case, then the provider can do their snapshots and it won't cause this problem since the snapshots aren't visible to the vm. Also in these cases the provider is normally already providing data protection by having the vg on a raid6 or raid60 or something, so having the client vm mirror the data in btrfs is a bit redundant. -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.17 (MingW32) iQEcBAEBAgAGBQJUbO4nAAoJEI5FoCIzSKrwl/QIAJ7arJ0ZXVc16pBRjE2F66uV GAOhatdx8pLhGey6by+gV8Ltvx4bK3BG40dkvQIM9RN9UFC5vofQ4FnzIn1nfXZB qyyITE2mF+lE3RNCb8ZKxwG58rfa9NOModPCeNVFWkS6+fyyhGY23sliWbVO6b15 w6BD5xu/Pp7Fhgkx81AL07XpusR9c8pKZd8ZHw4nozFHw20+13XuL+2g8axpZS+O Xd9W5GRlC+0k9jQ0q9xGi1jh6QpjMSWVj54MNS5jRubsY65TtmFPkdvgaMGD4U5k bADSEUMfij9NRMw8VwA4ik/JEi1IbukD4u1geKeZTowMGXReel2RimeA/PhFYcc= =tmDI -----END PGP SIGNATURE----- ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: BTRFS messes up snapshot LV with origin 2014-11-19 15:20 ` Phillip Susi 2014-11-19 18:35 ` Chris Murphy @ 2014-11-21 4:28 ` Zygo Blaxell 2014-11-21 6:22 ` Duncan 2014-11-22 17:34 ` Goffredo Baroncelli 1 sibling, 2 replies; 64+ messages in thread From: Zygo Blaxell @ 2014-11-21 4:28 UTC (permalink / raw) To: Phillip Susi; +Cc: Chris Murphy, Btrfs BTRFS [-- Attachment #1: Type: text/plain, Size: 3108 bytes --] On Wed, Nov 19, 2014 at 10:20:17AM -0500, Phillip Susi wrote: > -----BEGIN PGP SIGNED MESSAGE----- > Hash: SHA1 > > On 11/18/2014 9:54 PM, Chris Murphy wrote: > > Why is it silly? Btrfs on a thin volume has practical use case > > aside from just being thinly provisioned, its snapshots are block > > device based, not merely that of an fs tree. > > Umm... because one of the big selling points of btrfs is that it is in > a much better position to make snapshots being aware of the fs tree > rather than doing it in the block layer. One of the big selling points of LVM is that it is in a much better position to make snapshots so you can run btrfsck on the shattered remains of your broken btrfs filesystem. The UUID-driven behavior of btrfs is _really extremely annoying_. No other filesystem forces me to jump through the hoops btrfs does to get routine admin tasks done. e.g. if an ext4 filesystem explodes, I can: 1. make a LVM snapshot of the broken filesystem 2. run e2fsck on the snapshot 3. mount and repair the snapshot, e.g. rsync any missing files from backups, salvage anything that survived 4. LVM merge the snapshot to its origin volume 5. umount the origin volume and mount the merged volume (or just reboot) ...and I can do all of this on a running system, in-place, with only a few minutes of downtime in the must-reboot case. None of the above works with btrfs at all. Multi-device btrfs fails at 2, and mounting the filesystem fails at 3. The closest I've gotten to this workflow is to set up a kvm instance that can see only the LVM snapshots, (only) and run the btrfsck or rsync there--and hope that the system doesn't crash and reboot during that time, or the filesystem will be more or less destroyed by the random combination of origin and snapshot LVs. I've also learned the hard way to always make an LVM snapshot before running btrfsck, just in case you discover a new btrfsck bug with your filesystem. That at least works for single-device btrfs filesystems. > So it is kind of silly in the first place to be using lvm snapshots > under btrfs, but it is is doubly silly to use lvm for snapshots, and > btrfs for the mirroring rather than lvm. Pick one layer and use it > for both functions. Even if that is lvm, then it should also be > handling the mirroring. > > -----BEGIN PGP SIGNATURE----- > Version: GnuPG v2.0.17 (MingW32) > > iQEcBAEBAgAGBQJUbLUxAAoJEI5FoCIzSKrwh0oH/3TZ2oo8u2BjHYO3b0x8800/ > LFkmGFWrZFSnAvtWuN5B1WlhMXku4dxLRXz14fJKFp3fNmnYRNVvw3tu9btvsBsC > sZdwLaKwKPHTK8RS+QCI2pZPX+cGB+F7/z9PCHrzIzzCKk/4SvnJ76e2nnZFpY1m > Md3f1BCHEVUPMMXbqv6Ry6v7PDs/8bx8WITYyAL9uh3tjh0dXQsjbZJn5u4XDitS > /CoE8eX4rf1vc7qHI4K56TtArCcXQxAHcC56fXmcmS03bVhAkkJ5Z+/uwi6+TkJe > 55rMFCd7UFy9pwKha3Q2flJHtDYG6ns7Njyff6BSL9Yzq7tHh4wLk1H3XxaOCP8= > =ktv/ > -----END PGP SIGNATURE----- > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html [-- Attachment #2: Digital signature --] [-- Type: application/pgp-signature, Size: 198 bytes --] ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: BTRFS messes up snapshot LV with origin 2014-11-21 4:28 ` Zygo Blaxell @ 2014-11-21 6:22 ` Duncan 2014-11-21 11:35 ` Robert White ` (2 more replies) 2014-11-22 17:34 ` Goffredo Baroncelli 1 sibling, 3 replies; 64+ messages in thread From: Duncan @ 2014-11-21 6:22 UTC (permalink / raw) To: linux-btrfs Zygo Blaxell posted on Thu, 20 Nov 2014 23:28:14 -0500 as excerpted: > On Wed, Nov 19, 2014 at 10:20:17AM -0500, Phillip Susi wrote: >> -----BEGIN PGP SIGNED MESSAGE----- >> Hash: SHA1 >> >> On 11/18/2014 9:54 PM, Chris Murphy wrote: >> > Why is it silly? Btrfs on a thin volume has practical use case aside >> > from just being thinly provisioned, its snapshots are block device >> > based, not merely that of an fs tree. >> >> Umm... because one of the big selling points of btrfs is that it is in >> a much better position to make snapshots being aware of the fs tree >> rather than doing it in the block layer. > > One of the big selling points of LVM is that it is in a much better > position to make snapshots so you can run btrfsck on the shattered > remains of your broken btrfs filesystem. > > The UUID-driven behavior of btrfs is _really extremely annoying_. > No other filesystem forces me to jump through the hoops btrfs does to > get routine admin tasks done. > > e.g. if an ext4 filesystem explodes, I can: > > 1. make a LVM snapshot of the broken filesystem > > 2. run e2fsck on the snapshot > > 3. mount and repair the snapshot, e.g. rsync any missing files from > backups, salvage anything that survived > > 4. LVM merge the snapshot to its origin volume > > 5. umount the origin volume and mount the merged volume (or just > reboot) > > ...and I can do all of this on a running system, in-place, with only a > few minutes of downtime in the must-reboot case. > > None of the above works with btrfs at all. Multi-device btrfs fails at > 2, > and mounting the filesystem fails at 3. The closest I've gotten to this > workflow is to set up a kvm instance that can see only the LVM > snapshots, (only) and run the btrfsck or rsync there--and hope that the > system doesn't crash and reboot during that time, or the filesystem will > be more or less destroyed by the random combination of origin and > snapshot LVs. > > I've also learned the hard way to always make an LVM snapshot before > running btrfsck, just in case you discover a new btrfsck bug with your > filesystem. That at least works for single-device btrfs filesystems. When I have such a filesystem level problem, I simply dd from the backing device to some other location, generally to a file that's on a different filesystem (preferrably non-btrfs, I use reiserfs as I've found it very resilient, here), in which case btrfs device scan won't see the UUID on the copy as it scans block devices, not inside non-device files. After all, an LVM block-level snapshot takes the same space as a file containing the same raw data, and if there's room for the data in an LVM snapshot, given a different layout, there's room for exactly the same amount of data as a file on a different filesystem, piped thru some compressor if necessary due to tight datasize constraints. But while other filesystems might allow un-UUIDs (heh, UUUIDs or U3IDs =:^), because they're no longer unique, requiring them to be unique just as the label says cannot be considered a bug. It's simply stricter enforcement of the rules, which are, after all, plainly stated in the descriptive name. -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: BTRFS messes up snapshot LV with origin 2014-11-21 6:22 ` Duncan @ 2014-11-21 11:35 ` Robert White 2014-11-21 11:54 ` Duncan 2014-11-21 17:56 ` Zygo Blaxell 2014-11-21 18:23 ` Chris Murphy 2 siblings, 1 reply; 64+ messages in thread From: Robert White @ 2014-11-21 11:35 UTC (permalink / raw) To: Duncan, linux-btrfs On 11/20/2014 10:22 PM, Duncan wrote: > But while other filesystems might allow un-UUIDs (heh, UUUIDs or U3IDs > =:^), because they're no longer unique, requiring them to be unique just > as the label says cannot be considered a bug. It's simply stricter > enforcement of the rules, which are, after all, plainly stated in the > descriptive name. You take "U"s away, not add them UID = unique ID GUID = globally unique ID UUID = universally unique ID And other file systems have the same issues. XFS, for example uses UUIDs in the same way. It just has a command to re-brand the filesystem's UUID which you apply to the LVM snapshot immediately after taking the snapshot. (problem long-since established and understood since 2009 or so.) I don't know if this approach would work for BRFS with subvolumes. Example Citation :: http://www.miljan.org/main/2009/11/16/lvm-snapshots-and-xfs/ XFS also has the nouuids mount option. btrfs has device= mount option. But any system with unique ids will have this identical issue when block-snapshot support is added underneath. -- Rob. ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: BTRFS messes up snapshot LV with origin 2014-11-21 11:35 ` Robert White @ 2014-11-21 11:54 ` Duncan 0 siblings, 0 replies; 64+ messages in thread From: Duncan @ 2014-11-21 11:54 UTC (permalink / raw) To: linux-btrfs Robert White posted on Fri, 21 Nov 2014 03:35:05 -0800 as excerpted: > On 11/20/2014 10:22 PM, Duncan wrote: >> But while other filesystems might allow un-UUIDs (heh, UUUIDs or U3IDs >> =:^), because they're no longer unique, requiring them to be unique >> just as the label says cannot be considered a bug. It's simply >> stricter enforcement of the rules, which are, after all, plainly stated >> in the descriptive name. > > You take "U"s away, not add them > > UID = unique ID GUID = globally unique ID UUID = universally unique ID I was making a joke, as I happened to notice un-UUID =3 U-s just as I was writing that. Universally unique ID = UUID, un-UUID (not universally unique ID) = UUUID = U^3ID. =:^) Of course formally it'd be NUID (not/non- unique) or some such, but un- UUID served my purpose well enough, including the joke once I noticed it, so... -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: BTRFS messes up snapshot LV with origin 2014-11-21 6:22 ` Duncan 2014-11-21 11:35 ` Robert White @ 2014-11-21 17:56 ` Zygo Blaxell 2014-11-21 23:09 ` Duncan 2014-11-21 18:23 ` Chris Murphy 2 siblings, 1 reply; 64+ messages in thread From: Zygo Blaxell @ 2014-11-21 17:56 UTC (permalink / raw) To: Duncan; +Cc: linux-btrfs [-- Attachment #1: Type: text/plain, Size: 2035 bytes --] On Fri, Nov 21, 2014 at 06:22:57AM +0000, Duncan wrote: > After all, an LVM block-level snapshot takes the same space as a file > containing the same raw data, and if there's room for the data in an LVM > snapshot, given a different layout, there's room for exactly the same > amount of data as a file on a different filesystem, piped thru some > compressor if necessary due to tight datasize constraints. That isn't true at all. A repairing fsck can take less than 1% of the overall volume size, and a full conversion from another filesystem type can take less than 10%. Usually I can find enough space by blowing away the swap LV for a few hours. I do NOT usually have 13TB of slack space lying around in a 26TB disk array, nor do I have enough bandwidth to move those 13TB to another machine without great inconvenience. > But while other filesystems might allow un-UUIDs (heh, UUUIDs or U3IDs > =:^), because they're no longer unique, requiring them to be unique just > as the label says cannot be considered a bug. It's simply stricter > enforcement of the rules, which are, after all, plainly stated in the > descriptive name. It's not a bug as long as I can completely control which devices are searched for UUIDs, and the system behaves sanely when multiple UUIDs are found through automatic discovery; otherwise, it's not only a bug, it's a DoS attack security vulnerability. Consider what happens if someone looks at /sys/fs/btrfs, reads the non-secret UUIDs, builds a fake filesystem with those UUIDs, puts the fake filesystem on a USB stick, and plugs it back into the victim machine... > -- > Duncan - List replies preferred. No HTML msgs. > "Every nonfree program has a lord, a master -- > and if you use the program, he is your master." Richard Stallman > > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html [-- Attachment #2: Digital signature --] [-- Type: application/pgp-signature, Size: 198 bytes --] ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: BTRFS messes up snapshot LV with origin 2014-11-21 17:56 ` Zygo Blaxell @ 2014-11-21 23:09 ` Duncan 0 siblings, 0 replies; 64+ messages in thread From: Duncan @ 2014-11-21 23:09 UTC (permalink / raw) To: linux-btrfs Zygo Blaxell posted on Fri, 21 Nov 2014 12:56:23 -0500 as excerpted: > It's not a bug as long as I can completely control which devices are > searched for UUIDs, and the system behaves sanely when multiple UUIDs > are found through automatic discovery; otherwise, it's not only a bug, > it's a DoS attack security vulnerability. Consider what happens if > someone looks at /sys/fs/btrfs, reads the non-secret UUIDs, builds a > fake filesystem with those UUIDs, puts the fake filesystem on a USB > stick, and plugs it back into the victim machine... With the current state of USB vulnerability (firmware reprogrammed as an input device, etc, the vuln has been all over the tech news for some months now), anyone with USB access to the machine is simply another case of anyone with physical access to the machine, they're normally assumed to be able to be able to at minimum take down the machine, the ultimate DoS, in any case, and often to have effective root, tho that can be mitigated to some extent with encryption, etc. It's generally assumed that if you have physical access, as required to plug in that USB, game over, the machine is effectively p40wn3d. At the /very/ least, with physical access it's vulnerable to the sledgehammer DoS, and there's little to be done about that but prevent physical access by all means necessary (armed guards, nuclear silo hosting, etc) in the first place. -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: BTRFS messes up snapshot LV with origin 2014-11-21 6:22 ` Duncan 2014-11-21 11:35 ` Robert White 2014-11-21 17:56 ` Zygo Blaxell @ 2014-11-21 18:23 ` Chris Murphy 2014-11-21 22:49 ` Duncan 2 siblings, 1 reply; 64+ messages in thread From: Chris Murphy @ 2014-11-21 18:23 UTC (permalink / raw) To: Duncan; +Cc: Btrfs BTRFS On Thu, Nov 20, 2014 at 11:22 PM, Duncan <1i5t5.duncan@cox.net> wrote: > > When I have such a filesystem level problem, I simply dd from the backing > device to some other location, generally to a file that's on a different > filesystem (preferrably non-btrfs, I use reiserfs as I've found it very > resilient, here), in which case btrfs device scan won't see the UUID on > the copy as it scans block devices, not inside non-device files. That's hours of dd and you have to find space to do it. > After all, an LVM block-level snapshot takes the same space as a file > containing the same raw data, and if there's room for the data in an LVM > snapshot, given a different layout, there's room for exactly the same > amount of data as a file on a different filesystem, piped thru some > compressor if necessary due to tight datasize constraints. That's not true for thin volume snapshots. They take up next to no space upon creation, they don't need space reserved in advance. They're more like a qcow2 snapshot than a conventional LVM snapshot; a big difference being if you delete the snapshot, or you delete a bunch of files in a thin volume and follow it with fstrim, the unused extents are returned to the thin pool. There has been a fragmentation problem with thin volumes; I don't know if that's solved yet. And I don't know if it exacerbates things with Btrfs fragmentation. -- Chris Murphy ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: BTRFS messes up snapshot LV with origin 2014-11-21 18:23 ` Chris Murphy @ 2014-11-21 22:49 ` Duncan 2014-11-21 23:41 ` Duncan 0 siblings, 1 reply; 64+ messages in thread From: Duncan @ 2014-11-21 22:49 UTC (permalink / raw) To: linux-btrfs Chris Murphy posted on Fri, 21 Nov 2014 11:23:45 -0700 as excerpted: > On Thu, Nov 20, 2014 at 11:22 PM, Duncan <1i5t5.duncan@cox.net> wrote: > > >> When I have such a filesystem level problem, I simply dd from the >> backing device to some other location, generally to a file that's on a >> different filesystem (preferrably non-btrfs, I use reiserfs as I've >> found it very resilient, here), in which case btrfs device scan won't >> see the UUID on the copy as it scans block devices, not inside >> non-device files. > > That's hours of dd and you have to find space to do it. I did it recently here. There's a method to my sub-100-GiB partition madness! =:^) The partitions in question were on SSD, and were small enough I could simply DD them to files on my media filesystem, which was after all designed to be able to take full ISO images, etc. Additionally, due to size and reasonably consistent linear intra-file access patterns, the media filesystem's still on much cheaper spinning rust, while most of the system's on much faster to random-access but far more expensive SSD, so in this case one side was SSD, the other spinning rust. Tho granted, if you're doing single-partition/filesystem multi-TiB filesystems, it does get to be a problem. As there would have been if the filesystem in question was the media filesystem, altho that one's not yet btrfs for a reason. But still, if there's room enough for an LVM snapshot in the first place, with a different layout, there'd be room for the same data as a file. That's pretty basic. >> After all, an LVM block-level snapshot takes the same space as a file >> containing the same raw data, and if there's room for the data in an >> LVM snapshot, given a different layout, there's room for exactly the >> same amount of data as a file on a different filesystem, piped thru >> some compressor if necessary due to tight datasize constraints. > > That's not true for thin volume snapshots. They take up next to no space > upon creation, they don't need space reserved in advance. Thus the mention of compression if necessary. Thin-volume snapshots are effectively compression by another name, and a raw dd from them should compress pretty much equally well, depending on compression method chosen, of course. =:^) -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: BTRFS messes up snapshot LV with origin 2014-11-21 22:49 ` Duncan @ 2014-11-21 23:41 ` Duncan 2014-11-21 23:51 ` Duncan 0 siblings, 1 reply; 64+ messages in thread From: Duncan @ 2014-11-21 23:41 UTC (permalink / raw) To: linux-btrfs Duncan posted on Fri, 21 Nov 2014 22:49:06 +0000 as excerpted: > Chris Murphy posted... >> That's not true for thin volume snapshots. They take up next to no >> space upon creation, they don't need space reserved in advance. > > Thus the mention of compression if necessary. Thin-volume snapshots are > effectively compression by another name, and a raw dd from them should > compress pretty much equally well, depending on compression method > chosen, of course. =:^) Oops, I mis-parsed "thin". Good point and thanks, Chris. -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: BTRFS messes up snapshot LV with origin 2014-11-21 23:41 ` Duncan @ 2014-11-21 23:51 ` Duncan 0 siblings, 0 replies; 64+ messages in thread From: Duncan @ 2014-11-21 23:51 UTC (permalink / raw) To: linux-btrfs Duncan posted on Fri, 21 Nov 2014 23:41:49 +0000 as excerpted: > Duncan posted on Fri, 21 Nov 2014 22:49:06 +0000 as excerpted: > >> Chris Murphy posted... > >>> That's not true for thin volume snapshots. They take up next to no >>> space upon creation, they don't need space reserved in advance. >> >> Thus the mention of compression if necessary. Thin-volume snapshots >> are effectively compression by another name, and a raw dd from them >> should compress pretty much equally well, depending on compression >> method chosen, of course. =:^) > > Oops, I mis-parsed "thin". Good point and thanks, Chris. ... And Zygo, who pointed out my error as well. =:^) -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: BTRFS messes up snapshot LV with origin 2014-11-21 4:28 ` Zygo Blaxell 2014-11-21 6:22 ` Duncan @ 2014-11-22 17:34 ` Goffredo Baroncelli 2014-11-23 0:19 ` Zygo Blaxell 1 sibling, 1 reply; 64+ messages in thread From: Goffredo Baroncelli @ 2014-11-22 17:34 UTC (permalink / raw) To: Zygo Blaxell, Phillip Susi; +Cc: Chris Murphy, Btrfs BTRFS On 11/21/2014 05:28 AM, Zygo Blaxell wrote: > e.g. if an ext4 filesystem explodes, I can: > > 1. make a LVM snapshot of the broken filesystem > > 2. run e2fsck on the snapshot > > 3. mount and repair the snapshot, e.g. rsync any missing files > from backups, salvage anything that survived > > 4. LVM merge the snapshot to its origin volume > > 5. umount the origin volume and mount the merged volume > (or just reboot) > > ...and I can do all of this on a running system, in-place, with only a > few minutes of downtime in the must-reboot case. > > None of the above works with btrfs at all. Multi-device btrfs fails > at 2, You can't compare ext4 with btrfs, if you are talking about a multi-device filesystem: ext4 haven't this capability. Try to make a md-raid over a snapshotted logical volume(s); I never tried that, but I suppose that there will be the same problems... > and mounting the filesystem fails at 3. Are you sure ? ghigo@venice:/tmp$ # create a btrfs filesystem in a logical volume ghigo@venice:/tmp$ sudo truncate -s +10G disk.img ghigo@venice:/tmp$ sudo losetup -f disk.img ghigo@venice:/tmp$ sudo pvcreate /dev/loop0 ghigo@venice:/tmp$ sudo vgcreate vgtest /dev/loop0 ghigo@venice:/tmp$ sudo lvcreate -n lvone -L 3G vgtest ghigo@venice:/tmp$ sudo mkfs.btrfs /dev/vgtest/lvone ghigo@venice:/tmp$ mkdir t ghigo@venice:/tmp$ # create a file inside a btrfs fs ghigo@venice:/tmp$ sudo mount /dev/vgtest/lvone t/ ghigo@venice:/tmp$ sudo dd if=/dev/zero of=t/disk-orig bs=1M count=1 ghigo@venice:/tmp$ sudo umount t ghigo@venice:/tmp$ # make a lvm snapshot and add a 2nd file ghigo@venice:/tmp$ sudo lvcreate -s -n lvone_snap -L 3G vgtest/lvone ghigo@venice:/tmp$ sudo mount /dev/vgtest/lvone_snap t/ ghigo@venice:/tmp$ sudo dd if=/dev/zero of=t/disk-snap bs=1M count=1 ghigo@venice:/tmp$ sudo umount t ghigo@venice:/tmp$ # mount the first one lv, and check the file ghigo@venice:/tmp$ sudo mount /dev/vgtest/lvone t/ ghigo@venice:/tmp$ ls -l t total 1024 -rw-r--r-- 1 root root 1048576 Nov 22 18:11 disk-orig ghigo@venice:/tmp$ sudo umount t ghigo@venice:/tmp$ # mount the first one lv, and check the files ghigo@venice:/tmp$ sudo mount /dev/vgtest/lvone_snap t/ ghigo@venice:/tmp$ ls -l t total 2048 -rw-r--r-- 1 root root 1048576 Nov 22 18:11 disk-orig -rw-r--r-- 1 root root 1048576 Nov 22 18:12 disk-snap On the basis of the example above, in case you want to mount a "single-disk", BTRFS seems me to work properly. You have to pay attention only to not mount the two filesystem at the same time. BR G.Baroncelli -- gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it> Key fingerprint BBF5 1610 0B64 DAC6 5F7D 17B2 0EDA 9B37 8B82 E0B5 ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: BTRFS messes up snapshot LV with origin 2014-11-22 17:34 ` Goffredo Baroncelli @ 2014-11-23 0:19 ` Zygo Blaxell 2014-11-25 16:34 ` Goffredo Baroncelli 0 siblings, 1 reply; 64+ messages in thread From: Zygo Blaxell @ 2014-11-23 0:19 UTC (permalink / raw) To: Goffredo Baroncelli; +Cc: Phillip Susi, Chris Murphy, Btrfs BTRFS [-- Attachment #1: Type: text/plain, Size: 2159 bytes --] On Sat, Nov 22, 2014 at 06:34:38PM +0100, Goffredo Baroncelli wrote: > On 11/21/2014 05:28 AM, Zygo Blaxell wrote: > > e.g. if an ext4 filesystem explodes, I can: > > > > 1. make a LVM snapshot of the broken filesystem > > > > 2. run e2fsck on the snapshot > > > > 3. mount and repair the snapshot, e.g. rsync any missing files > > from backups, salvage anything that survived > > > > 4. LVM merge the snapshot to its origin volume > > > > 5. umount the origin volume and mount the merged volume > > (or just reboot) > > > > ...and I can do all of this on a running system, in-place, with only a > > few minutes of downtime in the must-reboot case. > > > > None of the above works with btrfs at all. Multi-device btrfs fails > > at 2, > > You can't compare ext4 with btrfs, if you are talking about a multi-device > filesystem: ext4 haven't this capability. btrfs fails this comparison as a single-device filesystem. > Try to make a md-raid over a snapshotted logical volume(s); I never tried > that, but I suppose that there will be the same problems... md-raid works as long as you specify the devices, and because it's always the lowest layer it can ignore LVs (snapshot or otherwise). It's also not a particularly common use case, while making an LV snapshot of a filesystem is a typical use case. > > and mounting the filesystem fails at 3. > Are you sure ? Yes, I'm sure. I've had to replace filesystems destroyed this way. >[working instance snipped] > On the basis of the example above, in case you want to mount a > "single-disk", BTRFS seems me to work properly. You have to pay > attention only to not mount the two filesystem at the same time. The problem is btrfs stops searching when it sees one disk with each UUID, so the set of disks (snapshot vs origin) that you get is *random*. For a pair of origin + snapshots, there's a 50% chance it works, 50% chance it eats your data. > BR > G.Baroncelli > > > -- > gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it> > Key fingerprint BBF5 1610 0B64 DAC6 5F7D 17B2 0EDA 9B37 8B82 E0B5 [-- Attachment #2: Digital signature --] [-- Type: application/pgp-signature, Size: 198 bytes --] ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: BTRFS messes up snapshot LV with origin 2014-11-23 0:19 ` Zygo Blaxell @ 2014-11-25 16:34 ` Goffredo Baroncelli 2014-11-25 20:29 ` Zygo Blaxell 0 siblings, 1 reply; 64+ messages in thread From: Goffredo Baroncelli @ 2014-11-25 16:34 UTC (permalink / raw) To: Zygo Blaxell; +Cc: linux-btrfs On 11/23/2014 01:19 AM, Zygo Blaxell wrote: [...] > md-raid works as long as you specify the devices, and because it's always > the lowest layer it can ignore LVs (snapshot or otherwise). It's also > not a particularly common use case, while making an LV snapshot of a > filesystem is a typical use case. I fully agree; but you still consider a *multi-device* btrfs over lvm... This is like a dm over lvm... which doesn't make sense at all (as you already wrote) > >>> and mounting the filesystem fails at 3. >> Are you sure ? > > Yes, I'm sure. I've had to replace filesystems destroyed this way. > >> [working instance snipped] > >> On the basis of the example above, in case you want to mount a >> "single-disk", BTRFS seems me to work properly. You have to pay >> attention only to not mount the two filesystem at the same time. > > The problem is btrfs stops searching when it sees one disk with each UUID, BTRFS doens't search anything. It is udev which "push" the information on the kernel module. The btrfs module groups these information by UUID. When a new disk is inserted, overwrite the information of the old one. > so the set of disks (snapshot vs origin) that you get is *random*. > For a pair of origin + snapshots, there's a 50% chance it works, 50% > chance it eats your data. Sorry but I have to disagree: the code is quite clear (see fs/btrfs/volume.c, near line 512): [...] } else if (!device->name || strcmp(device->name->str, path)) { /* * When FS is already mounted. * 1. If you are here and if the device->name is NULL that * means this device was missing at time of FS mount. * 2. If you are here and if the device->name is different * from 'path' that means either * a. The same device disappeared and reappeared with * different name. or * b. The missing-disk-which-was-replaced, has * reappeared now. * * We must allow 1 and 2a above. But 2b would be a spurious * and unintentional. [...] The case is the 2a; in this case btrfs store the new name and mount it. Anyway I made a small test: I created 1 btrfs filesystem, and made a lvm-snapshot. Then create two different file in the snapshot and in the original one. I run a program which mounts randomly the first or the latter, checks if the correct file is present; after more than 130 tests I never saw your "50% chance it works": it always works. BR G.Baroncelli > >> BR >> G.Baroncelli >> >> >> -- >> gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it> >> Key fingerprint BBF5 1610 0B64 DAC6 5F7D 17B2 0EDA 9B37 8B82 E0B5 -- gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it> Key fingerprint BBF5 1610 0B64 DAC6 5F7D 17B2 0EDA 9B37 8B82 E0B5 ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: BTRFS messes up snapshot LV with origin 2014-11-25 16:34 ` Goffredo Baroncelli @ 2014-11-25 20:29 ` Zygo Blaxell 2014-11-25 21:59 ` Goffredo Baroncelli 0 siblings, 1 reply; 64+ messages in thread From: Zygo Blaxell @ 2014-11-25 20:29 UTC (permalink / raw) To: Goffredo Baroncelli; +Cc: linux-btrfs [-- Attachment #1: Type: text/plain, Size: 4835 bytes --] On Tue, Nov 25, 2014 at 05:34:15PM +0100, Goffredo Baroncelli wrote: > On 11/23/2014 01:19 AM, Zygo Blaxell wrote: > [...] > > md-raid works as long as you specify the devices, and because it's always > > the lowest layer it can ignore LVs (snapshot or otherwise). It's also > > not a particularly common use case, while making an LV snapshot of a > > filesystem is a typical use case. > > I fully agree; but you still consider a *multi-device* btrfs over lvm... > This is like a dm over lvm... which doesn't make sense at all (as you > already wrote) It makes sense for btrfs because btrfs can productively use LVs on different PVs (e.g. btrfs-raid1 on two LVs, one on each PV). LVM is the bottom layer because not everything in the world is btrfs--things like ephemeral /tmp, boot, swap, and temporary backup copies of the btrfs (e.g. before running btrfsck) have to live on the same physical drives as the btrfs filesystems. > >>> and mounting the filesystem fails at 3. > >> Are you sure ? > > > > Yes, I'm sure. I've had to replace filesystems destroyed this way. > > > >> [working instance snipped] > > > >> On the basis of the example above, in case you want to mount a > >> "single-disk", BTRFS seems me to work properly. You have to pay > >> attention only to not mount the two filesystem at the same time. > > > > The problem is btrfs stops searching when it sees one disk with each UUID, > > BTRFS doens't search anything. It is udev which "push" the information > on the kernel module. The btrfs module groups these information by UUID. > When a new disk is inserted, overwrite the information of the old one. Same result: when presented with multiple devices with the same UUID, one is chosen arbitrarily instead of rejecting all of them. > > so the set of disks (snapshot vs origin) that you get is *random*. > > For a pair of origin + snapshots, there's a 50% chance it works, 50% > > chance it eats your data. > > Sorry but I have to disagree: the code is quite clear > (see fs/btrfs/volume.c, near line 512): > > [...] > > } else if (!device->name || strcmp(device->name->str, path)) { > /* > * When FS is already mounted. > * 1. If you are here and if the device->name is NULL that > * means this device was missing at time of FS mount. > * 2. If you are here and if the device->name is different > * from 'path' that means either > * a. The same device disappeared and reappeared with > * different name. or > * b. The missing-disk-which-was-replaced, has > * reappeared now. If the FS is already mounted then there is no issue. It's when you're trying to mount the FS that the fun occurs. > * > * We must allow 1 and 2a above. But 2b would be a spurious > * and unintentional. > > [...] > > The case is the 2a; in this case btrfs store the new name and mount it. > > Anyway I made a small test: I created 1 btrfs filesystem, and > made a lvm-snapshot. Then create two different file in the snapshot and in > the original one. I run a program which mounts randomly the first or > the latter, checks if the correct file is present; after more than 130 tests I > never saw your "50% chance it works": it always works. One btrfs filesystem on two LVs with a snapshot of each LV also present. So you'd have: lv00 - btrfs device 1 lv01 - btrfs device 2 lv00snap - snapshot of lv00 lv01snap - snapshot of lv01 If you mount by device UUID then you get one of these results at random: lv00 + lv01 - OK lv00snap + lv01snap - also OK lv00 + lv01snap - failure lv00snap + lv01 - failure 2 failures, 2 successes = 50% failure rate. If you mount by the name of one of the devices then you only get the two rows of the above table that match the device you named, but you still get one success row and one failure row. Which result you get seems to depend on the order in which LVM enumerates the LVs, so if you are doing a mount/umount loop then you won't see any problems as btrfs will consistently make the same choice of LVs over and over again. Rebooting or creating other LVs in between mounts will definitely cause problems. > BR > G.Baroncelli > > > > >> BR > >> G.Baroncelli > >> > >> > >> -- > >> gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it> > >> Key fingerprint BBF5 1610 0B64 DAC6 5F7D 17B2 0EDA 9B37 8B82 E0B5 > > > -- > gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it> > Key fingerprint BBF5 1610 0B64 DAC6 5F7D 17B2 0EDA 9B37 8B82 E0B5 > [-- Attachment #2: Digital signature --] [-- Type: application/pgp-signature, Size: 198 bytes --] ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: BTRFS messes up snapshot LV with origin 2014-11-25 20:29 ` Zygo Blaxell @ 2014-11-25 21:59 ` Goffredo Baroncelli 2014-11-25 22:21 ` Zygo Blaxell 2014-11-26 3:22 ` Duncan 0 siblings, 2 replies; 64+ messages in thread From: Goffredo Baroncelli @ 2014-11-25 21:59 UTC (permalink / raw) To: Zygo Blaxell; +Cc: linux-btrfs On 11/25/2014 09:29 PM, Zygo Blaxell wrote: > On Tue, Nov 25, 2014 at 05:34:15PM +0100, Goffredo Baroncelli wrote: >> On 11/23/2014 01:19 AM, Zygo Blaxell wrote: >> [...] >>> md-raid works as long as you specify the devices, and because it's always >>> the lowest layer it can ignore LVs (snapshot or otherwise). It's also >>> not a particularly common use case, while making an LV snapshot of a >>> filesystem is a typical use case. >> >> I fully agree; but you still consider a *multi-device* btrfs over lvm... >> This is like a dm over lvm... which doesn't make sense at all (as you >> already wrote) > > It makes sense for btrfs because btrfs can productively use LVs on > different PVs (e.g. btrfs-raid1 on two LVs, one on each PV). LVM is > the bottom layer because not everything in the world is btrfs--things > like ephemeral /tmp, boot, swap, and temporary backup copies of the btrfs > (e.g. before running btrfsck) have to live on the same physical drives > as the btrfs filesystems. Let me to summrize 1) btrfs-single-disk on lvm works fine 2) btrfs-w/multiple-disk on lvm works fine 3) btrfs-single-disk on lvm works fine even with snapshot 4) btrfs-w/multiple-disk doesn't work with lvm AND snapshot However I still doesn't understood why you want btrfs-w/multiple disk over LVM ? > >>>>> and mounting the filesystem fails at 3. >>>> Are you sure ? >>> >>> Yes, I'm sure. I've had to replace filesystems destroyed this way. In a previous email you wrote: >> Multi-device btrfs fails at 2, So I assumed that the point 3 onwards were related to a "single-disk" btrfs. [...] -- gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it> Key fingerprint BBF5 1610 0B64 DAC6 5F7D 17B2 0EDA 9B37 8B82 E0B5 ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: BTRFS messes up snapshot LV with origin 2014-11-25 21:59 ` Goffredo Baroncelli @ 2014-11-25 22:21 ` Zygo Blaxell 2014-11-25 22:47 ` Chris Murphy ` (2 more replies) 2014-11-26 3:22 ` Duncan 1 sibling, 3 replies; 64+ messages in thread From: Zygo Blaxell @ 2014-11-25 22:21 UTC (permalink / raw) To: Goffredo Baroncelli; +Cc: linux-btrfs [-- Attachment #1: Type: text/plain, Size: 2765 bytes --] On Tue, Nov 25, 2014 at 10:59:53PM +0100, Goffredo Baroncelli wrote: > On 11/25/2014 09:29 PM, Zygo Blaxell wrote: > > On Tue, Nov 25, 2014 at 05:34:15PM +0100, Goffredo Baroncelli wrote: > >> On 11/23/2014 01:19 AM, Zygo Blaxell wrote: > >> [...] > >>> md-raid works as long as you specify the devices, and because it's always > >>> the lowest layer it can ignore LVs (snapshot or otherwise). It's also > >>> not a particularly common use case, while making an LV snapshot of a > >>> filesystem is a typical use case. > >> > >> I fully agree; but you still consider a *multi-device* btrfs over lvm... > >> This is like a dm over lvm... which doesn't make sense at all (as you > >> already wrote) > > > > It makes sense for btrfs because btrfs can productively use LVs on > > different PVs (e.g. btrfs-raid1 on two LVs, one on each PV). LVM is > > the bottom layer because not everything in the world is btrfs--things > > like ephemeral /tmp, boot, swap, and temporary backup copies of the btrfs > > (e.g. before running btrfsck) have to live on the same physical drives > > as the btrfs filesystems. > > Let me to summrize > > 1) btrfs-single-disk on lvm works fine > 2) btrfs-w/multiple-disk on lvm works fine > 3) btrfs-single-disk on lvm works fine even with snapshot > > 4) btrfs-w/multiple-disk doesn't work with lvm AND snapshot > > However I still doesn't understood why you want btrfs-w/multiple disk over LVM ? I want to split a few disks into partitions, but I want to create, move, and resize the partitions from time to time. Only LVM can do that without taking the machine down, reducing RAID integrity levels, hotplugging drives, or leaving installed drives idle most of the time. I want btrfs-raid1 because of its ability to replace corrupted or lost data from one disk using the other. If I run a single-volume btrfs on LVM-RAID1 (or dm-RAID1, or RAID1 at any other layer of the storage stack), I can detect lost data, but not replace it automatically from the other mirror. Since I want both things at the same time, I have btrfs w/multiple disks on LVM. The LVM snapshots are for providing an 'undo' capability when I experiment with some btrfs or btrfsck feature that destroys the filesystem. > >>>>> and mounting the filesystem fails at 3. > >>>> Are you sure ? > >>> > >>> Yes, I'm sure. I've had to replace filesystems destroyed this way. > > In a previous email you wrote: > >> Multi-device btrfs fails at 2, > So I assumed that the point 3 onwards were related to a "single-disk" btrfs. > > > > [...] > > > -- > gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it> > Key fingerprint BBF5 1610 0B64 DAC6 5F7D 17B2 0EDA 9B37 8B82 E0B5 [-- Attachment #2: Digital signature --] [-- Type: application/pgp-signature, Size: 198 bytes --] ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: BTRFS messes up snapshot LV with origin 2014-11-25 22:21 ` Zygo Blaxell @ 2014-11-25 22:47 ` Chris Murphy [not found] ` <CAJCQCtQUM=viSoPtcJMcyKquYb1DLmEsqBi=p++uXPy63+r3Ow@mail.gmail.com> 2014-11-26 17:19 ` Goffredo Baroncelli 2 siblings, 0 replies; 64+ messages in thread From: Chris Murphy @ 2014-11-25 22:47 UTC (permalink / raw) To: linux-btrfs What happens when all btrfs LVs are unmounted, and you lvchange -an the LVs (the pair) you do not want mounted; and then btrfs dev scan; and then mount one of the devices? It should only find the matching LV because the others are deactivated. I know this isn't ideal, but it's better than corruption. Chris Murphy ^ permalink raw reply [flat|nested] 64+ messages in thread
[parent not found: <CAJCQCtQUM=viSoPtcJMcyKquYb1DLmEsqBi=p++uXPy63+r3Ow@mail.gmail.com>]
[parent not found: <20141126021134.GR17380@hungrycats.org>]
* Re: BTRFS messes up snapshot LV with origin [not found] ` <20141126021134.GR17380@hungrycats.org> @ 2014-11-26 4:48 ` Chris Murphy 0 siblings, 0 replies; 64+ messages in thread From: Chris Murphy @ 2014-11-26 4:48 UTC (permalink / raw) To: Btrfs BTRFS On Tue, Nov 25, 2014 at 7:11 PM, Zygo Blaxell <zblaxell@furryterror.org> wrote: > On Tue, Nov 25, 2014 at 03:46:32PM -0700, Chris Murphy wrote: >> What happens when all btrfs LVs are unmounted, and you lvchange -an >> the LVs (the pair) you do not want mounted; and then btrfs dev scan; >> and then mount one of the devices? It should only find the matching LV >> because the others are deactivated. I know this isn't ideal, but it's >> better than corruption. > > This is one of two possible ways to assemble the btrfs correctly. > The other is to explicitly name all of the devices when mounting. OK I didn't realize it was possible to explicitly name all of them, the last time I'd tried this (about 9 epochs ago) mount didn't understand being passed two devices before the mount point. > > The challenge for the poor end-user (or inexperienced sysadmin) is to > defeat all the defaults in system installers, initramfs-tools, lvm2, > udev, etc. to prevent btrfs from destroying a filesystem accidentally. I agree if it finds two identical volumes it should fail to mount with some coherent error. -- Chris Murphy ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: BTRFS messes up snapshot LV with origin 2014-11-25 22:21 ` Zygo Blaxell 2014-11-25 22:47 ` Chris Murphy [not found] ` <CAJCQCtQUM=viSoPtcJMcyKquYb1DLmEsqBi=p++uXPy63+r3Ow@mail.gmail.com> @ 2014-11-26 17:19 ` Goffredo Baroncelli 2014-11-27 4:15 ` Zygo Blaxell 2 siblings, 1 reply; 64+ messages in thread From: Goffredo Baroncelli @ 2014-11-26 17:19 UTC (permalink / raw) To: Zygo Blaxell; +Cc: linux-btrfs On 11/25/2014 11:21 PM, Zygo Blaxell wrote: >> > However I still doesn't understood why you want btrfs-w/multiple disk over LVM ? > I want to split a few disks into partitions, but I want to create, > move, and resize the partitions from time to time. Only LVM can do > that without taking the machine down, reducing RAID integrity levels, > hotplugging drives, or leaving installed drives idle most of the time. > > I want btrfs-raid1 because of its ability to replace corrupted or lost > data from one disk using the other. If I run a single-volume btrfs > on LVM-RAID1 (or dm-RAID1, or RAID1 at any other layer of the storage > stack), I can detect lost data, but not replace it automatically from > the other mirror. OK, now I have understood. Anyway as workaround, take in account that you can pass explicitly the devices as: mount -o device=/dev/sda,device=/dev/sdb,device=/dev/sdc /dev/sdd /mnt (supposing that the filesystem is on /dev/sda.../dev/sdd) I am working to a mount.btrfs helper. The aim of this helper is to manage the assembling of multiple devices; the main points will be: - wait until all the devices appeared - allow (if required) to mount in degraded mode after a timeout - at this point it could/should also skip the lvm-snapshotted devices (but before I have to know how recognize these) I hope to issue the patches in the next week BR G.Baroncelli -- gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it> Key fingerprint BBF5 1610 0B64 DAC6 5F7D 17B2 0EDA 9B37 8B82 E0B5 ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: BTRFS messes up snapshot LV with origin 2014-11-26 17:19 ` Goffredo Baroncelli @ 2014-11-27 4:15 ` Zygo Blaxell 2014-11-28 17:05 ` Goffredo Baroncelli 0 siblings, 1 reply; 64+ messages in thread From: Zygo Blaxell @ 2014-11-27 4:15 UTC (permalink / raw) To: Goffredo Baroncelli; +Cc: linux-btrfs [-- Attachment #1: Type: text/plain, Size: 3001 bytes --] On Wed, Nov 26, 2014 at 06:19:05PM +0100, Goffredo Baroncelli wrote: > On 11/25/2014 11:21 PM, Zygo Blaxell wrote: > >> > However I still doesn't understood why you want btrfs-w/multiple disk over LVM ? > > I want to split a few disks into partitions, but I want to create, > > move, and resize the partitions from time to time. Only LVM can do > > that without taking the machine down, reducing RAID integrity levels, > > hotplugging drives, or leaving installed drives idle most of the time. > > > > I want btrfs-raid1 because of its ability to replace corrupted or lost > > data from one disk using the other. If I run a single-volume btrfs > > on LVM-RAID1 (or dm-RAID1, or RAID1 at any other layer of the storage > > stack), I can detect lost data, but not replace it automatically from > > the other mirror. > OK, now I have understood. > > Anyway as workaround, take in account that you can pass explicitly the > devices as: > > mount -o device=/dev/sda,device=/dev/sdb,device=/dev/sdc /dev/sdd /mnt > > (supposing that the filesystem is on /dev/sda.../dev/sdd) > > I am working to a mount.btrfs helper. The aim of this helper is to manage > the assembling of multiple devices; the main points will be: > - wait until all the devices appeared ...and make sure there are no duplicate UUIDs. > - allow (if required) to mount in degraded mode after a timeout This is a terrible idea with current btrfs, at least for read-write degraded mounting (fallback to read-only degraded would be OK). Mounting a filesystem read-write and degraded is something you only want to do immediately before you replace all the missing disks and bring the filesystem up to a non-degraded space and after you've ensured that the missing disks can never, ever come back; otherwise, btrfs eats your data in a slightly different way than we have discussed so far... > - at this point it could/should also skip the lvm-snapshotted devices (but before > I have to know how recognize these) You don't have to recognize them as snapshots (and it's probably better not to treat snapshots specially anyway--how do you know whether the snapshot or the origin LVs are wanted for mounting?). You just have to detect duplicate UUIDs at the btrfs subdevice level, and if any are found, stop immediately (or get a hint from the admin). This is a weakness of the current udev and asynchronous device hotplug concept: there is no notion of bus enumeration in progress, so we can be trying to assemble multi-device storage before we have all the devices visible. Assembly of aggregate storage (whatever it is--btrfs, md, lvm2...) has to wait until all known storage buses are fully enumerated in order to detect if there are duplicates. > I hope to issue the patches in the next week > > BR > G.Baroncelli > > -- > gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it> > Key fingerprint BBF5 1610 0B64 DAC6 5F7D 17B2 0EDA 9B37 8B82 E0B5 [-- Attachment #2: Digital signature --] [-- Type: application/pgp-signature, Size: 198 bytes --] ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: BTRFS messes up snapshot LV with origin 2014-11-27 4:15 ` Zygo Blaxell @ 2014-11-28 17:05 ` Goffredo Baroncelli 2014-11-29 1:25 ` Robert White 2014-11-29 4:59 ` Zygo Blaxell 0 siblings, 2 replies; 64+ messages in thread From: Goffredo Baroncelli @ 2014-11-28 17:05 UTC (permalink / raw) To: Zygo Blaxell; +Cc: linux-btrfs On 11/27/2014 05:15 AM, Zygo Blaxell wrote: > On Wed, Nov 26, 2014 at 06:19:05PM +0100, Goffredo Baroncelli wrote: >> On 11/25/2014 11:21 PM, Zygo Blaxell wrote: >>>>> However I still doesn't understood why you want btrfs-w/multiple disk over LVM ? >>> I want to split a few disks into partitions, but I want to create, >>> move, and resize the partitions from time to time. Only LVM can do >>> that without taking the machine down, reducing RAID integrity levels, >>> hotplugging drives, or leaving installed drives idle most of the time. >>> >>> I want btrfs-raid1 because of its ability to replace corrupted or lost >>> data from one disk using the other. If I run a single-volume btrfs >>> on LVM-RAID1 (or dm-RAID1, or RAID1 at any other layer of the storage >>> stack), I can detect lost data, but not replace it automatically from >>> the other mirror. >> OK, now I have understood. >> >> Anyway as workaround, take in account that you can pass explicitly the >> devices as: >> >> mount -o device=/dev/sda,device=/dev/sdb,device=/dev/sdc /dev/sdd /mnt >> >> (supposing that the filesystem is on /dev/sda.../dev/sdd) >> >> I am working to a mount.btrfs helper. The aim of this helper is to manage >> the assembling of multiple devices; the main points will be: >> - wait until all the devices appeared > > ...and make sure there are no duplicate UUIDs. Yes, at the end I implemented in this way the "snapshot" detection: if two autodetected devices have the same DISK_UUID (reported as SUB_UUID by blkid), th emount process stopped. I checked also the num_device field of the superblock. > >> - allow (if required) to mount in degraded mode after a timeout > > This is a terrible idea with current btrfs, at least for read-write > degraded mounting (fallback to read-only degraded would be OK). > Mounting a filesystem read-write and degraded is something you only want > to do immediately before you replace all the missing disks and bring the > filesystem up to a non-degraded space and after you've ensured that the > missing disks can never, ever come back; otherwise, btrfs eats your data > in a slightly different way than we have discussed so far... I don't care. If the user pass "degraded" in the options of mount, he have it. Anyway this (wrong) btrfs behavior I hope that it will be solved. > >> - at this point it could/should also skip the lvm-snapshotted devices (but before >> I have to know how recognize these) > > You don't have to recognize them as snapshots (and it's probably better > not to treat snapshots specially anyway--how do you know whether the > snapshot or the origin LVs are wanted for mounting?). You just have to > detect duplicate UUIDs at the btrfs subdevice level, and if any are found, > stop immediately (or get a hint from the admin). For the disk autodetection, I still convinced that it is a "sane" default to skip the lvm-snapshot > > This is a weakness of the current udev and asynchronous device hotplug > concept: there is no notion of bus enumeration in progress, so we can be > trying to assemble multi-device storage before we have all the devices > visible. Assembly of aggregate storage (whatever it is--btrfs, md, > lvm2...) has to wait until all known storage buses are fully enumerated > in order to detect if there are duplicates. It is more complex than that. Some devices may appear after the "1st" bus enumeration. > >> I hope to issue the patches in the next week >> >> BR >> G.Baroncelli >> >> -- >> gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it> >> Key fingerprint BBF5 1610 0B64 DAC6 5F7D 17B2 0EDA 9B37 8B82 E0B5 -- gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it> Key fingerprint BBF5 1610 0B64 DAC6 5F7D 17B2 0EDA 9B37 8B82 E0B5 ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: BTRFS messes up snapshot LV with origin 2014-11-28 17:05 ` Goffredo Baroncelli @ 2014-11-29 1:25 ` Robert White 2014-11-29 7:35 ` Goffredo Baroncelli 2014-11-29 7:37 ` MegaBrutal 2014-11-29 4:59 ` Zygo Blaxell 1 sibling, 2 replies; 64+ messages in thread From: Robert White @ 2014-11-29 1:25 UTC (permalink / raw) To: kreijack, Zygo Blaxell; +Cc: linux-btrfs On 11/28/2014 09:05 AM, Goffredo Baroncelli wrote: > For the disk autodetection, I still convinced that it is a "sane" default > to skip the lvm-snapshot No... please don't... Maybe offer an option to select between snapshots or no-snapshots but in much the same way there is no _functional_ difference between a subvolume and a snapshot in btrfs, there is no "degenerate" status to an LVM snapshot. It would be way more useful if the helper dumped a message via stderr or syslog that said something like "UUID=xxxxxxxx ambiguous, must select between /dev/AA and /dev/BB using device= to mount filesystem." ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: BTRFS messes up snapshot LV with origin 2014-11-29 1:25 ` Robert White @ 2014-11-29 7:35 ` Goffredo Baroncelli 2014-11-29 8:02 ` Robert White 2014-11-29 7:37 ` MegaBrutal 1 sibling, 1 reply; 64+ messages in thread From: Goffredo Baroncelli @ 2014-11-29 7:35 UTC (permalink / raw) To: Robert White, Zygo Blaxell; +Cc: linux-btrfs On 11/29/2014 02:25 AM, Robert White wrote: > On 11/28/2014 09:05 AM, Goffredo Baroncelli wrote: >> For the disk autodetection, I still convinced that it is a "sane" >> default to skip the lvm-snapshot > > No... please don't... > > Maybe offer an option to select between snapshots or no-snapshots but > in much the same way there is no _functional_ difference between a > subvolume and a snapshot in btrfs, there is no "degenerate" status to > an LVM snapshot. I agree with you; but I have to find a "default" so during the boot a system can start even if snapshots are present. And pay attention that there would be cases where multiple snapshot are present: how group these ? My be for generation number ? Anyway for the moment my help simply refuse to mount if there is a conflict of dev_uuid. > > It would be way more useful if the helper dumped a message via stderr > or syslog that said something like "UUID=xxxxxxxx ambiguous, This is what it is printed when the helper finds a duplicate uuid: ghigo@emulato:~$ sudo lvdisplay | grep "LV Path" LV Path /dev/test/lv01 LV Path /dev/test/lv02 LV Path /dev/test/lv02_snap LV Path /dev/test/lv01_snap ghigo@emulato:~$ sudo mount /dev/test/lv01 /mnt/btrfs1/ ERROR: disk '/dev/mapper/test-lv01' and '/dev/mapper/test-lv01_snap' have the same disk uuid ERROR: disk '/dev/mapper/test-lv02_snap' and '/dev/mapper/test-lv02' have the same disk uuid > must > select between /dev/AA and /dev/BB using device= to mount > filesystem." But anyway I can force the disk to mount: ghigo@emulato:~$ sudo mount /dev/test/lv01_snap -o device=/dev/test/lv02_snap /mnt/btrfs1/ > > > -- gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it> Key fingerprint BBF5 1610 0B64 DAC6 5F7D 17B2 0EDA 9B37 8B82 E0B5 ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: BTRFS messes up snapshot LV with origin 2014-11-29 7:35 ` Goffredo Baroncelli @ 2014-11-29 8:02 ` Robert White 0 siblings, 0 replies; 64+ messages in thread From: Robert White @ 2014-11-29 8:02 UTC (permalink / raw) To: kreijack, Zygo Blaxell; +Cc: linux-btrfs On 11/28/2014 11:35 PM, Goffredo Baroncelli wrote: > I agree with you; but I have to find a "default" so during the boot > a system can start even if snapshots are present. No, you really _don't_ need to find such a default. Better a system that doesn't boot than one that boots based on a guess. I've been spending a lot of time thinking about booting while writing underdog (http://underdog.sourceforge.net) and while booting is fragile, an even partially incorrect boot is a system and _security_ nightmare. If you start making preferential guesses then an intruder could trick the system into booting from a thumb-drive or other alternate media by coercing a UUID colision in a way that the system picks the new media. Conflicts should _never_ be guessed at during boot. Ever. ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: BTRFS messes up snapshot LV with origin 2014-11-29 1:25 ` Robert White 2014-11-29 7:35 ` Goffredo Baroncelli @ 2014-11-29 7:37 ` MegaBrutal 1 sibling, 0 replies; 64+ messages in thread From: MegaBrutal @ 2014-11-29 7:37 UTC (permalink / raw) To: linux-btrfs 2014-11-29 2:25 GMT+01:00 Robert White <rwhite@pobox.com>: > > On 11/28/2014 09:05 AM, Goffredo Baroncelli wrote: >> >> For the disk autodetection, I still convinced that it is a "sane" default >> to skip the lvm-snapshot > > > No... please don't... > > Maybe offer an option to select between snapshots or no-snapshots but in much the same way there is no _functional_ difference between a subvolume and a snapshot in btrfs, there is no "degenerate" status to an LVM snapshot. > > It would be way more useful if the helper dumped a message via stderr or syslog that said something like "UUID=xxxxxxxx ambiguous, must select between /dev/AA and /dev/BB using device= to mount filesystem." > I agree with this. Sometimes people will exactly want to do that: mount the snapshot devices and not the origins. Listing devices in the device= mount option sounds perfectly sane. ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: BTRFS messes up snapshot LV with origin 2014-11-28 17:05 ` Goffredo Baroncelli 2014-11-29 1:25 ` Robert White @ 2014-11-29 4:59 ` Zygo Blaxell 2014-11-29 7:55 ` Robert White 1 sibling, 1 reply; 64+ messages in thread From: Zygo Blaxell @ 2014-11-29 4:59 UTC (permalink / raw) To: Goffredo Baroncelli; +Cc: linux-btrfs [-- Attachment #1: Type: text/plain, Size: 1147 bytes --] On Fri, Nov 28, 2014 at 06:05:48PM +0100, Goffredo Baroncelli wrote: > On 11/27/2014 05:15 AM, Zygo Blaxell wrote: > > This is a weakness of the current udev and asynchronous device hotplug > > concept: there is no notion of bus enumeration in progress, so we can be > > trying to assemble multi-device storage before we have all the devices > > visible. Assembly of aggregate storage (whatever it is--btrfs, md, > > lvm2...) has to wait until all known storage buses are fully enumerated > > in order to detect if there are duplicates. > > It is more complex than that. Some devices may appear after the "1st" bus > enumeration. That case is well handled already--a new enumeration will start with the second (and all later) hotplug events. The problem arises when we try to assemble disk arrays before the known end of the "1st" (or any) enumeration. There is no way for an enumerating agent to tell other agents "this is definitely not the complete list of devices yet, other devices may be inserted imminently" and defer all the multi-device assembly until the address space of the enumering bus is fully covered. [-- Attachment #2: Digital signature --] [-- Type: application/pgp-signature, Size: 198 bytes --] ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: BTRFS messes up snapshot LV with origin 2014-11-29 4:59 ` Zygo Blaxell @ 2014-11-29 7:55 ` Robert White 2014-12-01 15:25 ` Zygo Blaxell 0 siblings, 1 reply; 64+ messages in thread From: Robert White @ 2014-11-29 7:55 UTC (permalink / raw) To: Zygo Blaxell, Goffredo Baroncelli; +Cc: linux-btrfs On 11/28/2014 08:59 PM, Zygo Blaxell wrote: > On Fri, Nov 28, 2014 at 06:05:48PM +0100, Goffredo Baroncelli wrote: >> On 11/27/2014 05:15 AM, Zygo Blaxell wrote: >>> This is a weakness of the current udev and asynchronous device hotplug >>> concept: there is no notion of bus enumeration in progress, so we can be >>> trying to assemble multi-device storage before we have all the devices >>> visible. Assembly of aggregate storage (whatever it is--btrfs, md, >>> lvm2...) has to wait until all known storage buses are fully enumerated >>> in order to detect if there are duplicates. >> >> It is more complex than that. Some devices may appear after the "1st" bus >> enumeration. > > That case is well handled already--a new enumeration will start with the > second (and all later) hotplug events. > > The problem arises when we try to assemble disk arrays before the > known end of the "1st" (or any) enumeration. There is no way for an > enumerating agent to tell other agents "this is definitely not the > complete list of devices yet, other devices may be inserted imminently" > and defer all the multi-device assembly until the address space of the > enumering bus is fully covered. > MDADM has an "attached" but not "started" state for arrays that handles this condition during incremental assembly. (see "mdadm --incremental /dev/whatever"), To slightly misuse the vocabulary, as each partition is encountered and submitted to the system it's checked for a superblock. If one is found then it has the identity of an array encoded on it and if that array doesn't exist it is allocated, otherwise the device is added to the existent array. The array is only started if all the devices are accounted for unless an option is added to allow earlier starts, and even then "enough" of the devices must be present to make sense (e.g. only one device missing from a RAID5, or a correct pair of devices for a RAID10 etc.) So we'd need a "partially assembled but not started" state and some ioctls to do things like force-start or force-disown a filesystem that cannot be "finished" automatically. That sort of thing is very easy to do with devices because devices don't have to be opened and can reject an open attempt, or at least the read/writes after an open and such. Unfortunately a filesystem can really only exist as a mounted thing, and can really only be controlled by remounting thereafter. The most efficient way to do this would be to have a alternate file system operations structure that was filled mostly with dummy operations that would return ENOENT and friends. Then the remount that finally fulfilled the file system's requirements would then switch out that struct for the fully functional one. That remount would need an "adddev=" and some other such options (much like AUFS adds layers). It;s all doable. But it stretches to near breaking the "mount" paradigm. You would need an operation that looked like "mount -t btrfs -o do_we_need_this /dev/whatever /this/datum/means/nothing" to match and attach a device "wherever it goes" or you might end up needing to do the Cartesian product of trial attachments of each new device to all active fileystems to match it up, which is an ugly external scripting requirement. As far as waiting for the address space to be fully covered. Meh. If a ready-or-not, or ready-enough, status is established in the file system it would be undesirable for it to know anything about any other subsystem. We don't care if enumeration is "done" we only care if we have a rational set of storage, and whether that rational set is "enough" to be fully ready, enough to be only read-ready, or just plain not enough. In theory, the idempotent mount command could be mount -t btrfs some-uuid-instead-of-device /mount/point mount -t btrfs some-other-uuid-here /other/mount/point to create the zero-devices involved entity, followed by mount -t btrfs -o trydev /dev/something /this/bit/is/ignored repeated for all possible somethings. /mount/point and /other/mount/point would be returning ENOENT for their contents until they were ready-enough. In practice this is very impure compared to how mdadm has the /dev/md- namespace in which to build its devices before any actual mount is possible. ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: BTRFS messes up snapshot LV with origin 2014-11-29 7:55 ` Robert White @ 2014-12-01 15:25 ` Zygo Blaxell 0 siblings, 0 replies; 64+ messages in thread From: Zygo Blaxell @ 2014-12-01 15:25 UTC (permalink / raw) To: Robert White; +Cc: Goffredo Baroncelli, linux-btrfs [-- Attachment #1: Type: text/plain, Size: 2269 bytes --] On Fri, Nov 28, 2014 at 11:55:07PM -0800, Robert White wrote: > On 11/28/2014 08:59 PM, Zygo Blaxell wrote: > >On Fri, Nov 28, 2014 at 06:05:48PM +0100, Goffredo Baroncelli wrote: > >>On 11/27/2014 05:15 AM, Zygo Blaxell wrote: > >>>This is a weakness of the current udev and asynchronous device hotplug > >>>concept: there is no notion of bus enumeration in progress, so we can be > >>>trying to assemble multi-device storage before we have all the devices > >>>visible. Assembly of aggregate storage (whatever it is--btrfs, md, > >>>lvm2...) has to wait until all known storage buses are fully enumerated > >>>in order to detect if there are duplicates. > >> > >>It is more complex than that. Some devices may appear after the "1st" bus > >>enumeration. > > > >That case is well handled already--a new enumeration will start with the > >second (and all later) hotplug events. > > > >The problem arises when we try to assemble disk arrays before the > >known end of the "1st" (or any) enumeration. There is no way for an > >enumerating agent to tell other agents "this is definitely not the > >complete list of devices yet, other devices may be inserted imminently" > >and defer all the multi-device assembly until the address space of the > >enumering bus is fully covered. > > > MDADM has an "attached" but not "started" state for arrays that > handles this condition during incremental assembly. (see "mdadm > --incremental /dev/whatever"), > [...very complicated mdadm-architecture-invades-the-filesystem-layer > thing snipped...] I don't see why it can't all be done in user-space more or less the same way LVM does. Scan all the parititions known to be available, build a table of devices with UUIDs matching the target filesystem, check for sufficiency, check for uniqueness, and if the configuration passes all the sanity checks (or we have hints from the user that resolve ambiguity), submit the entire list of devices to the kernel as a BTRFS filesystem. If there are UUID duplicates or missing devices, submit nothing to the kernel at all. initramfs-less multi-disk configurations can calculate all that in advance and generate a rootflags parameter for the kernel command line. It's not necessary to resolve every possible situation in the kernel. [-- Attachment #2: Digital signature --] [-- Type: application/pgp-signature, Size: 198 bytes --] ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: BTRFS messes up snapshot LV with origin 2014-11-25 21:59 ` Goffredo Baroncelli 2014-11-25 22:21 ` Zygo Blaxell @ 2014-11-26 3:22 ` Duncan 2014-11-26 5:11 ` Chris Murphy 2014-11-26 22:08 ` Robert White 1 sibling, 2 replies; 64+ messages in thread From: Duncan @ 2014-11-26 3:22 UTC (permalink / raw) To: linux-btrfs Goffredo Baroncelli posted on Tue, 25 Nov 2014 22:59:53 +0100 as excerpted: > However I still doesn't understood why you want btrfs-w/multiple disk > over LVM ? While I'm not an LVM person here, and he already replied with essentially the same point, I think it's worth repeating... Btrfs' checksummed error detection and automatic rewrite from a different copy isn't a small thing, and simply isn't available at all with most would-be alternatives (zfs being the only similar thing I know of for Linux, and of course it has its own issues both technical and social/ legal/license). That alone is worth running multi-device btrfs to get. That makes btrfs a near-mandatory part of the picture, whatever it's on. And for people wanting LVM's volume management (including partitioning without many of the limitations), the direct result is multi-device btrfs on lvm. >From my perspective, however, btrfs is simply incompatible with lvm snapshots, because the basic assumptions are incompatible. Btrfs assumes UUIDs will be exactly what they say on the label, /unique/, while lvm's snapshot feature directly breaks that uniqueness by copying the (former) UUID, thus making the former UUID no longer unique and thus no longer truly UUID. Thus, part of the lvm /feature/ of snapshots is in direct contradiction to a basic assumption of btrfs, that UUIDs are exactly that, unique, making that feature directly incompatible with btrfs on a very basic level. So people can have their btrfs on lvm, but if they do, they have to forego LVM snapshots because btrfs isn't compatible with their usage. To me it's as simple as that, and people can choose either btrfs or lvm snapshots, but not both, it's one XOR the other. So for me it's simply choose the one you will have the most difficulty doing without and forgo the other one. Not a problem, just make your choice and move on. OTOH, there's that common signature about the reasonable man folding to the circumstance while the unreasonable man insisting on folding the circumstance to his wishes instead, so progress depends on the unreasonable man... But that's exactly what I see here, an unreasonable man insisting that entirely logical circumstance bend to his will. Which, given someone to actually code it up, it might well do. =:^) -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: BTRFS messes up snapshot LV with origin 2014-11-26 3:22 ` Duncan @ 2014-11-26 5:11 ` Chris Murphy 2014-11-26 22:08 ` Robert White 1 sibling, 0 replies; 64+ messages in thread From: Chris Murphy @ 2014-11-26 5:11 UTC (permalink / raw) To: Btrfs BTRFS On Tue, Nov 25, 2014 at 8:22 PM, Duncan <1i5t5.duncan@cox.net> wrote: > From my perspective, however, btrfs is simply incompatible with lvm > snapshots, because the basic assumptions are incompatible. Btrfs assumes > UUIDs will be exactly what they say on the label, /unique/, while lvm's > snapshot feature directly breaks that uniqueness by copying the (former) > UUID, thus making the former UUID no longer unique and thus no longer > truly UUID. The seed device has a mechanism to change volume UUID without rewriting a bunch of stuff in the original, the gotcha is that it requires adding a device. man fsfreeze says "fsfreeze is unncessary for device-mapper devices. The device-mapper (and LVM) automatically freezes filesystem on the device when a snapshot creation is requested." So if it's possible to communicate snapshotting/freezing to the fs at snapshot time, then maybe btrfs could 'btrfstune -S 1' the volume in the snapshot. That way that snapshot actually contains a btrfs seed device, which is read only. At least the snapshot copy isn't going to get obliterated in an accident; even though most people would probably want the origin LV to be protected while considering the snapshot disposable. -- Chris Murphy ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: BTRFS messes up snapshot LV with origin 2014-11-26 3:22 ` Duncan 2014-11-26 5:11 ` Chris Murphy @ 2014-11-26 22:08 ` Robert White 2014-11-27 9:08 ` Duncan 1 sibling, 1 reply; 64+ messages in thread From: Robert White @ 2014-11-26 22:08 UTC (permalink / raw) To: Duncan, linux-btrfs On 11/25/2014 07:22 PM, Duncan wrote: >>From my perspective, however, btrfs is simply incompatible with lvm > snapshots, because the basic assumptions are incompatible. Btrfs assumes > UUIDs will be exactly what they say on the label, /unique/, while lvm's > snapshot feature directly breaks that uniqueness by copying the (former) > UUID, thus making the former UUID no longer unique and thus no longer > truly UUID. Thus, part of the lvm /feature/ of snapshots is in direct > contradiction to a basic assumption of btrfs, that UUIDs are exactly > that, unique, making that feature directly incompatible with btrfs on a > very basic level. A finer point here. LVM doesn't "copy" the UUID. AN LVM snapshot is a copy-on-write entity so it _exposes_ the single sector(s) of the superblock(s) in both views of the underlying storage. This is universal to the idea of a snapshot. Just as a "btrfs subvol snap /old /new" exposes all the "unique" elements of "/old" under the name "/new" (in preparation for the user to implement subsequent divergence); "lvmcreate --snapshot Old New" causes every block-N of Old to be identically available as block-N of New (in preparation for the user to implement subsequent divergence). In point of fact the LVM snapshot operation is a zero-copy operation at its heart. After the snapshot is established, when a block in modified in Old, it's original content is saved in New. When blocks are written in New, they are written in place and the reference to the block content in Old is overwritten. This is the reason that fsfreeze is unnecessary for things above LVM snapshots as the instant-in-time divergence is _instant_. It's not that LVM goes out and does an fsfreeze equivalent action, its that the switch to write-divergence is essentially atomic. A bunch of metatdata is setup and then all-at-once one write behavior is switched with another by re-mapping the device access routines. So while you may have a point about btrfs being unprepared for LVM, neither party is particularly "at fault" in any way. The "damn you photocopier for making photocopies so identically" nature of your problem with LVM seems to be leading you to misplaced conclusions. If you need to harmonize these sorts of things, you need to be able to re-write blocks in question with disambiguating information (like new UUIDS) or restrict your accesses in some other manner. If you are waiting for someone to "code it up" perhaps you should do so. But it will _never_ be automatic because the use cases that don't match your expectations may need the founding assumptions to be as they are today. In other words, your belief that your position is "entirely logical" may be a little off, particularly if you think LVM is "Copying" things when it does a snapshot. As previously stated XFS solved this problem by providing a tool that would change the UUID of a file system. This tool cold then be pointed at either (or both) the original and/or snapshot volumes as needed. I don't see a "re-make the btrfs" option for changing UUIDs and LVM doesn't care _at_ _all_ about what is actually in its volumes (okay, lvresize has some fsck nonsense, but that's just messy). It might even be "wrong" to try to harmonize those features, like trying to put a manual clutch into a car with an automatic transmission... it may just not fit. Given that BTRFS want's to play in the same level of abstraction as LVM, its kind of a given that they'll butt heads over things like conflicting definitions of what it means to take a snapshot. ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: BTRFS messes up snapshot LV with origin 2014-11-26 22:08 ` Robert White @ 2014-11-27 9:08 ` Duncan 2014-11-28 7:10 ` Chris Murphy 0 siblings, 1 reply; 64+ messages in thread From: Duncan @ 2014-11-27 9:08 UTC (permalink / raw) To: linux-btrfs Robert White posted on Wed, 26 Nov 2014 14:08:14 -0800 as excerpted: > On 11/25/2014 07:22 PM, Duncan wrote: >>>From my perspective, however, btrfs is simply incompatible with lvm >> snapshots, because the basic assumptions are incompatible. Btrfs >> assumes UUIDs will be exactly what they say on the label, /unique/, >> while lvm's snapshot feature directly breaks that uniqueness by copying >> the (former) UUID, thus making the former UUID no longer unique and >> thus no longer truly UUID. Thus, part of the lvm /feature/ of >> snapshots is in direct contradiction to a basic assumption of btrfs, >> that UUIDs are exactly that, unique, making that feature directly >> incompatible with btrfs on a very basic level. > > A finer point here. LVM doesn't "copy" the UUID. AN LVM snapshot is a > copy-on-write entity so it _exposes_ the single sector(s) of the > superblock(s) in both views of the underlying storage. I /hate/ it when this happens, which is why my posts often end up so long. People keep saying shorten them, but when I try, invariably I end up shortcutting something like this and get called on it! =:^( So, umm... kinda late now, but read that "copy" as if it had a footnote attached, saying "Yes, I know it's not actual copy, it's two views of the same thing using COW, but my point is, from the btrfs perspective it's a copy, the "universally UNIQUE ID" no longer looks "unique" and thus no longer can be properly called a UUID at all." Which kinda makes most of the rest of what you said, which I agree with in general were it the case that I actually thought of it as a literal copy, unnecessary... Tho I can't fault you for catching and pointing out my shortcut as an error, because you're absolutely correct in that case, and I'd almost certainly be doing the same thing were the situation reversed. > So while you may have a point about btrfs being unprepared for LVM, > neither party is particularly "at fault" in any way. > > The "damn you photocopier for making photocopies so identically" nature > of your problem with LVM seems to be leading you to misplaced > conclusions. Well, to the extent that I tried to take an unwarranted logical shortcut and didn't properly describe it... But... I'd still say LVM is "at fault" to the extent that anyone is, as it /knows/ it's dealing with UUIDs because after all that's part of what's /on/ what it's snapshotting, and it doesn't make any effort to deal with the situation, despite the at least theoretical (and now in fact) confusion that may occur when former UUIDs are no longer unique and thus no longer UUIDs. However, the point remains, they are pretty much incompatible, in that one assumes "unique" means that a second one won't pop up elsewhere and depends on exactly that, while the functionality of the other is exactly that, to make another view of the same thing, including the otherwise unique ID, pop up elsewhere, with COW semantics. > If you are waiting for someone to "code it up" perhaps you should do so. I'm not sure if that was the singular or plural "you", but in any case, it won't be /me/, because I'm not a coder, simply another sysadmin willing to guinea-pig this fascinating new filesystem toy. =:^) > As previously stated XFS solved this problem by providing a tool that > would change the UUID of a file system. This tool cold then be pointed > at either (or both) the original and/or snapshot volumes as needed. I think that'll eventually happen. Actually, I see it's on the wiki project ideas page, now (see 1.2.25 and 1.2.26, online/offline UUID changes, respectively): https://btrfs.wiki.kernel.org/index.php/Project_ideas There's even POC code. =:^) Wiki page history says Kdave added that on 06 Oct. 2014, so the entry is reasonably new, and the POC's encouraging, but will it go anywhere from there? > Given that BTRFS want's to play in the same level of abstraction as LVM, > its kind of a given that they'll butt heads over things like conflicting > definitions of what it means to take a snapshot. Agreed. Actually, given btrfs is already doing much of it, it'd be interesting if it eventually got the ability to specify where subvolumes went and limit them in size (ideally more directly than the existing btrfs quotas related functionality does, etc, thus avoiding having to rely on LVM for that and eliminating the need for it in scenarios where that's desired. Couple that with the better snapshot handling that is already in the works, and would there /still/ be a need for LVM under btrfs then; for what if so, and could it too be integrated into btrfs? -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: BTRFS messes up snapshot LV with origin 2014-11-27 9:08 ` Duncan @ 2014-11-28 7:10 ` Chris Murphy 2014-11-29 7:29 ` Duncan 0 siblings, 1 reply; 64+ messages in thread From: Chris Murphy @ 2014-11-28 7:10 UTC (permalink / raw) Cc: Btrfs BTRFS On Thu, Nov 27, 2014 at 2:08 AM, Duncan <1i5t5.duncan@cox.net> wrote: > So, umm... kinda late now, but read that "copy" as if it had a footnote > attached, saying "Yes, I know it's not actual copy, it's two views of the > same thing using COW, but my point is, from the btrfs perspective it's a > copy, the "universally UNIQUE ID" no longer looks "unique" and thus no > longer can be properly called a UUID at all." The copy is sort of a misnomer anyway because up until the computer age the copy was a derivative, a facsimile, like a photocopy. But a copy of a digital file is actually another original. Therein lies the problem with the LVM snapshot in this context, we don't want another original. We want a copy, as in we want something we know has been derived from something else, and therefore can be discriminated. And that's the same problem with subvolume UUIDs being "reused" when creating new Btrfs volumes, which have new volume UUIDs, from a Btrfs seed device. There are now multiple originals of those subvolumes, there's no distinguishing them by their UUID alone. > But... I'd still say LVM is "at fault" to the extent that anyone is, as > it /knows/ it's dealing with UUIDs because after all that's part of > what's /on/ what it's snapshotting, and it doesn't make any effort to > deal with the situation, despite the at least theoretical (and now in > fact) confusion that may occur when former UUIDs are no longer unique and > thus no longer UUIDs. Well RFC 4122 I don't think would say it's not a UUID, the uniqueness is only guaranteed at the time of UUID creation. And duplication isn't creation so it's not going to say these things are no longer UUIDs, they're just UUIDs that have been recycled. That RFC doesn't specify workflow, but if it did, I think it'd basically say "oh crap, why'd you go and do that?" After all a major point of UUIDs is that they are effectively unlimited in quantity, therefore a.) we don't need central registry to avoid (unintended) collisions because they're so uncommon, b.) we're encouraged to not be attached to specific UUIDs when in doubt just create another one. A very good example of WTF reusage of a UUID that irks me to no end is GNU parted devs decided to recycle the Microsoft Windows Basic Data partition type GUID for Linux partitions. It's like watching someone get run over by a zamboni with 50 feet of advance notice... -- Chris Murphy ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: BTRFS messes up snapshot LV with origin 2014-11-28 7:10 ` Chris Murphy @ 2014-11-29 7:29 ` Duncan 2014-11-29 8:20 ` Robert White 0 siblings, 1 reply; 64+ messages in thread From: Duncan @ 2014-11-29 7:29 UTC (permalink / raw) To: linux-btrfs Chris Murphy posted on Fri, 28 Nov 2014 00:10:40 -0700 as excerpted: > On Thu, Nov 27, 2014 at 2:08 AM, Duncan <1i5t5.duncan@cox.net> wrote: >> So, umm... kinda late now, but read that "copy" as if it had a footnote >> attached, saying "Yes, I know it's not actual copy, it's two views of >> the same thing using COW, but my point is, from the btrfs perspective >> it's a copy, the "universally UNIQUE ID" no longer looks "unique" and >> thus no longer can be properly called a UUID at all." > > The copy is sort of a misnomer anyway because up until the computer age > the copy was a derivative, a facsimile, like a photocopy. But a copy of > a digital file is actually another original. Therein lies the problem > with the LVM snapshot in this context, we don't want another original. > We want a copy, as in we want something we know has been derived from > something else, and therefore can be discriminated. Very good point. I had all the pieces but hadn't put them together yet, so thanks. =:^) > Well RFC 4122 I don't think would say it's not a UUID, the uniqueness is > only guaranteed at the time of UUID creation. And duplication isn't > creation so it's not going to say these things are no longer UUIDs, > they're just UUIDs that have been recycled. That RFC doesn't specify > workflow, but if it did, I think it'd basically say "oh crap, why'd you > go and do that?" After all a major point of UUIDs is that they are > effectively unlimited in quantity, therefore a.) we don't need central > registry to avoid (unintended) collisions because they're so uncommon, > b.) we're encouraged to not be attached to specific UUIDs when in doubt > just create another one. Another good point. One common and less RFC/technical way of putting it, that I had thought about a few times but hadn't actually posted yet IIRC, is the old "If it hurts when you bang your head against the wall, quit banging!" =:^) IOW, LVM could change the UUIDs in its "copies", COWing that bit in ordered to do so. While that wouldn't change the same UUIDs embedded in for instance btrfs internals it would provide a mechanism to keep initial scans from confusing things, and filesystems or other UUID applications that duplicated the number for their own internals would then need to provide tools that rewrote them to match the LVM-changed master location UUID. Those that failed to do so would fail to function unless/until the master location version was changed back, but the tools and likely would eventually be provided, as I expect they will be here, but the difference would be at least it'd keep mixups like this from happening. > A very good example of WTF reusage of a UUID that irks me to no end is > GNU parted devs decided to recycle the Microsoft Windows Basic Data > partition type GUID for Linux partitions. It's like watching someone get > run over by a zamboni with 50 feet of advance notice... At least I don't have to worry about that one, since I no longer agree to "WE REFUSE TO TELL YOU SPECIFICALLY WHAT THIS SOFTWARE DOES AS WE DON'T SUPPLY THE SOURCES, BUT YOU ARE STILL REQUIRED TO ACCEPT ALL RESPONSIBILITY FOR IT, REGARDLESS OF WHAT IT DOES AND REGARDLESS OF WHETHER WE'VE BEEN WARNED" style EULAs, which is basically all of them, which means I have no legal way to run that software, so I don't. Note that the GPL among others has similar liability disclaimer wording (and to be fair it'd be hard not to, since the sources are there and the original author can hardly be held responsible for later modifications to them), but because it actually gives you the sources too, it allows you to fairly make your own decision about the responsibility you're about to take on. Since I can't/won't run pretty much anything proprietary, there's little chance of it being taken as anything but Linux, here. (Tho I actually use (c)gdisk for partitioning here and it appears to use a different GUID. (0700 in its short form which AFAIK is gdisk specific, for MS basic data, while it uses 8300 for general Linux filesystems. I could look up the long form GUIDs, but meh...) -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: BTRFS messes up snapshot LV with origin 2014-11-29 7:29 ` Duncan @ 2014-11-29 8:20 ` Robert White 2014-11-29 9:41 ` Duncan ` (2 more replies) 0 siblings, 3 replies; 64+ messages in thread From: Robert White @ 2014-11-29 8:20 UTC (permalink / raw) To: Duncan, linux-btrfs On 11/28/2014 11:29 PM, Duncan wrote: > Since I can't/won't run pretty much anything proprietary, there's little > chance of it being taken as anything but Linux, here. (Tho I actually > use (c)gdisk for partitioning here and it appears to use a different GUID. > (0700 in its short form which AFAIK is gdisk specific, for MS basic data, > while it uses 8300 for general Linux filesystems. I could look up the > long form GUIDs, but meh...) Partition type codes (e.g. 0700, 8300, EF00, etc) have _nothing_ to do with UUIDs. They are type codes. They aren't "short form" of anything else at all. In fact 0700 is the _long_ _form_ of the original code of "7", but in big-endian order now that it went from one byte to two. Microsoft started using pre-assigned UUIDs as "classes", e.g. type codes they could cram into their various registry files. If you actually read the registry you'll find a lot of places where "rational word" is defined as {some_uuid_here} and then eslwere {some_uuid_here} has a bunch of data items attached to it. So gpartd didn;t "reuse" microsoft UUIDs. In some/many of the older formats there was a code for "operating system data" (which I think is what 7 was originally). Others came by and said "since we're going to put in a type code for "linux swap" (82) then lets put in a code for linux data as well (83), and all this before the whole byte expansion to turn these things from bytes into two-byte words. Once everybody else picked their own type codes for their data partitions, everybody just started calling "7" microsoft data. And linux doesn't care at all since it's noise since every partition just ends up as /dev/[sh]d? anyway. All this stuff has historical reasons. GNU/Linux attempts to be an egalitarian actor so it adapts to whatever you do. ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: BTRFS messes up snapshot LV with origin 2014-11-29 8:20 ` Robert White @ 2014-11-29 9:41 ` Duncan 2014-11-29 16:33 ` Robert White 2014-11-29 16:50 ` Robert White 2014-11-29 21:15 ` Chris Murphy 2 siblings, 1 reply; 64+ messages in thread From: Duncan @ 2014-11-29 9:41 UTC (permalink / raw) To: linux-btrfs Robert White posted on Sat, 29 Nov 2014 00:20:11 -0800 as excerpted: > On 11/28/2014 11:29 PM, Duncan wrote: >> (Tho I actually use (c)gdisk for partitioning here and it appears to >> use a different GUID. (0700 in its short form which AFAIK is gdisk >> specific, for MS basic data, while it uses 8300 for general Linux >> filesystems. I could look up the long form GUIDs, but meh...) > > Partition type codes (e.g. 0700, 8300, EF00, etc) have _nothing_ to do > with UUIDs. They are type codes. They aren't "short form" of anything > else at all. In fact 0700 is the _long_ _form_ of the original code of > "7", but in big-endian order now that it went from one byte to two. You obviously know where the short forms originated (MBR type codes), but you haven't the foggiest what you're talking about in relation to gdisk, where they're used as 4-hex-char entry shortcuts for the similar GPT/EFI GUIDs. Now that's what I expected with the mention of a different partition editor, thus my mention that they were shortcuts for GUIDs, apparently gdisk specific, but in gdisk they certainly ARE shortcuts to the various GUIDs and you certainly do *NOT* know what you're talking about saying they are not even related. >From the gdisk (8) manpage entry for the l/list action: l Display a summary of partition types. GPT uses a GUID to identify partition types for particular OSes and purposes. For ease of data entry, gdisk compresses these into two-byte (four-digit hexadecimal) values that are related to their equivalent MBR codes. Specifically, the MBR code is multiplied by hexadecimal 0x0100. For instance, the code for Linux swap space in MBR is 0x82, and it's 0x8200 in gdisk. A one-to-one correspondence is impossible, though. Most notably, the codes for all varieties of FAT and NTFS partition correspond to a single GPT code (entered as 0x0700 in sgdisk). Some OSes use a single MBR code but employ many more codes in GPT. For these, gdisk adds code numbers sequentially, such as 0xa500 for a FreeBSD disklabel, 0xa501 for FreeBSD boot, 0xa502 for FreeBSD swap, and so on. Note that these two-byte codes are unique to gdisk. See also the gdisk home page: http://www.rodsbooks.com/gdisk/ In particular, see the gdisk walkthru here: http://www.rodsbooks.com/gdisk/walkthrough.html ... and the gdisk manpage I quoted above here: http://www.rodsbooks.com/gdisk/gdisk.html So as I said, gdisk uses a 4-hexit short code based on the legacy MBR type-code as an easy entry and display form referencing the longer and much less human readable GUIDs, just like I said, and such usage is gdisk specific, just like I said I thought it was. And you might have known the legacy MBR type-codes from which they were derived, but obviously you had no idea what I was talking about here, and despite my saying it was gdisk specific you decided to simply claim I didn't know what I was talking about without actually checking the situation, despite my telling you exactly what app I was referring to and that I thought those references were app-specific, giving you plenty of chance to actually look it up yourself if you decided to, or simply not argue that point if you weren't interested in checking out the app- specific stuff. =:^( > Microsoft started using pre-assigned UUIDs as "classes", e.g. type codes > they could cram into their various registry files. If you actually read > the registry you'll find a lot of places where "rational word" is > defined as {some_uuid_here} and then eslwere {some_uuid_here} has a > bunch of data items attached to it. FWIW I know about the MS registry stuff from actually doing MS-registry and API related programming (hobbiest/VB level but using the regular API not just the VB exposed stuff) back before the turn of the century. I've not touched it in nearing a decade and a half now and my knowledge is consequently dated 9x vintage, but it obviously had the registry and I used to be /quite/ familiar with it, including of course the UUIDs. > So gpartd didn;t "reuse" microsoft UUIDs. > > In some/many of the older formats there was a code for "operating system > data" (which I think is what 7 was originally). Others came by and said > "since we're going to put in a type code for "linux swap" (82) then lets > put in a code for linux data as well (83), and all this before the whole > byte expansion to turn these things from bytes into two-byte words. > > Once everybody else picked their own type codes for their data > partitions, everybody just started calling "7" microsoft data. And linux > doesn't care at all since it's noise since every partition just ends up > as /dev/[sh]d? anyway. > > All this stuff has historical reasons. GNU/Linux attempts to be an > egalitarian actor so it adapts to whatever you do. This part I have no disagreement with... -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: BTRFS messes up snapshot LV with origin 2014-11-29 9:41 ` Duncan @ 2014-11-29 16:33 ` Robert White 0 siblings, 0 replies; 64+ messages in thread From: Robert White @ 2014-11-29 16:33 UTC (permalink / raw) To: Duncan, linux-btrfs On 11/29/2014 01:41 AM, Duncan wrote: > Robert White posted on Sat, 29 Nov 2014 00:20:11 -0800 as excerpted: > l Display a summary of partition types. GPT uses a GUID to > identify partition types for particular OSes and purposes. For > ease of data entry, gdisk compresses these into two-byte > (four-digit hexadecimal) values that are related to their > equivalent MBR codes. Specifically, the MBR code is multiplied > by hexadecimal 0x0100. That EFI uses GUIDs is one thing. That the standard allows these to be selected based on type codes originally derived from ms-dos partition type codes ("compressed" is the wrong word) is something else. If they were "compressed" then it would be a relationship that could represent any GUID at all. It's marginally hashed, in that there is a table lookup, but its not properly a hashed as the "hash function" is undefined for virtually all possible input values. The other partition GUID is acutally more interesting. > So as I said, gdisk uses a 4-hexit short code based on the legacy MBR > type-code as an easy entry and display form referencing the longer and > much less human readable GUIDs, just like I said, and such usage is gdisk > specific, just like I said I thought it was. Which is not what you said. None of the above was mentioned in the email to which I responded. What you actually said :: [QUOTE] Since I can't/won't run pretty much anything proprietary, there's little chance of it being taken as anything but Linux, here. (Tho I actually use (c)gdisk for partitioning here and it appears to use a different GUID. (0700 in its short form which AFAIK is gdisk specific, for MS basic data, while it uses 8300 for general Linux filesystems. I could look up the long form GUIDs, but meh...) [/QUOTE] None of which is "gdisk specific", and all of which is based on EFI and the GUID partition table. What I mistakenly attributed to you and was key to my initial response was your extension of Chris Murphy: >>> Chris Murphy posted on Fri, 28 Nov 2014 00:10:40 -0700 as excerpted: >>>> A very good example of WTF reusage of a UUID that irks me to no end is >>>> GNU parted devs decided to recycle the Microsoft Windows Basic Data >>>> partition type GUID for Linux partitions. It's like watching someone get >>>> run over by a zamboni with 50 feet of advance notice... [So my bad there on the quoting...] The irking there being dumb because the universally used "type GUID" has nothing to do with the second GUID that universally identifies the partition regardless of type. But here is the thing... for all the screed about open and closed source... (and I am an open source guy myself) The actual EFI standard dictates these partition numbers and whatnot so if you used the microsoft tools you'd get the same results. http://en.wikipedia.org/wiki/GUID_Partition_Table#Partition_type_GUIDs AND microsoft was one of several principle players in the EFI and its GUID partition subparts. So his being "irked to no end" and your agreement and "that's why I used gdisk" response are both completely misplaced, and potentially misleading to others. I just went a little off the rails while trying to explain. /D'oh. ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: BTRFS messes up snapshot LV with origin 2014-11-29 8:20 ` Robert White 2014-11-29 9:41 ` Duncan @ 2014-11-29 16:50 ` Robert White 2014-11-30 6:46 ` Duncan 2014-11-29 21:15 ` Chris Murphy 2 siblings, 1 reply; 64+ messages in thread From: Robert White @ 2014-11-29 16:50 UTC (permalink / raw) To: linux-btrfs To those reading along who don't already know. My explanation below is factually inadequate or wrong in various places... The "type codes" as presented in the various EFI/GUID disk partitioning tools as 0700, 8200, 8300, EF02, and so on are never written to disk as such. They are short-hand values (chosen to be deliberately similar to the MS-DOS partitioning type codes of 07, 82, 83, etc) to select standardized GUIDs for the partition type field. So there is the two-digit code from the ms-dos partitoning scheme, then there are the four-digit codes that let you select which type GUID will be written in an EFI partition scheme. The question of "reuse" is still improper as the type codes were assigned by the EFI standard for specific use as type codes. The EFI tool used (gdisk, or windows disk partitioning tool, etc) is immaterial as the result codes are selected by standard. I could have, and should have, been _way_ more clear, and/or less wrong. 8-) http://en.wikipedia.org/wiki/GUID_Partition_Table#Partition_type_GUIDs On 11/29/2014 12:20 AM, Robert White wrote: > On 11/28/2014 11:29 PM, Duncan wrote: >> Since I can't/won't run pretty much anything proprietary, there's little >> chance of it being taken as anything but Linux, here. (Tho I actually >> use (c)gdisk for partitioning here and it appears to use a different >> GUID. >> (0700 in its short form which AFAIK is gdisk specific, for MS basic data, >> while it uses 8300 for general Linux filesystems. I could look up the >> long form GUIDs, but meh...) > > Partition type codes (e.g. 0700, 8300, EF00, etc) have _nothing_ to do > with UUIDs. They are type codes. They aren't "short form" of anything > else at all. In fact 0700 is the _long_ _form_ of the original code of > "7", but in big-endian order now that it went from one byte to two. > > Microsoft started using pre-assigned UUIDs as "classes", e.g. type codes > they could cram into their various registry files. If you actually read > the registry you'll find a lot of places where "rational word" is > defined as {some_uuid_here} and then eslwere {some_uuid_here} has a > bunch of data items attached to it. > > So gpartd didn;t "reuse" microsoft UUIDs. > > In some/many of the older formats there was a code for "operating system > data" (which I think is what 7 was originally). Others came by and said > "since we're going to put in a type code for "linux swap" (82) then lets > put in a code for linux data as well (83), and all this before the whole > byte expansion to turn these things from bytes into two-byte words. > > Once everybody else picked their own type codes for their data > partitions, everybody just started calling "7" microsoft data. And linux > doesn't care at all since it's noise since every partition just ends up > as /dev/[sh]d? anyway. > > All this stuff has historical reasons. GNU/Linux attempts to be an > egalitarian actor so it adapts to whatever you do. > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: BTRFS messes up snapshot LV with origin 2014-11-29 16:50 ` Robert White @ 2014-11-30 6:46 ` Duncan 0 siblings, 0 replies; 64+ messages in thread From: Duncan @ 2014-11-30 6:46 UTC (permalink / raw) To: linux-btrfs Robert White posted on Sat, 29 Nov 2014 08:50:57 -0800 as excerpted: > To those reading along who don't already know. My explanation below is > factually inadequate or wrong in various places... > > The "type codes" as presented in the various EFI/GUID disk partitioning > tools as 0700, 8200, 8300, EF02, and so on are never written to disk as > such. They are short-hand values (chosen to be deliberately similar to > the MS-DOS partitioning type codes of 07, 82, 83, etc) to select > standardized GUIDs for the partition type field. > I could have, and should have, been _way_ more clear, and/or less wrong. > 8-) > > http://en.wikipedia.org/wiki/GUID_Partition_Table#Partition_type_GUIDs Thanks. While I guess we all end up eat humble pie occasionally, you handled it with more rather more grace that I often do, and by taking such a hard line myself I didn't make it as easy as I might have. -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: BTRFS messes up snapshot LV with origin 2014-11-29 8:20 ` Robert White 2014-11-29 9:41 ` Duncan 2014-11-29 16:50 ` Robert White @ 2014-11-29 21:15 ` Chris Murphy 2 siblings, 0 replies; 64+ messages in thread From: Chris Murphy @ 2014-11-29 21:15 UTC (permalink / raw) To: Robert White; +Cc: Duncan, Btrfs BTRFS On Sat, Nov 29, 2014 at 1:20 AM, Robert White <rwhite@pobox.com> wrote: > On 11/28/2014 11:29 PM, Duncan wrote: >> >> Since I can't/won't run pretty much anything proprietary, there's little >> chance of it being taken as anything but Linux, here. (Tho I actually >> use (c)gdisk for partitioning here and it appears to use a different GUID. >> (0700 in its short form which AFAIK is gdisk specific, for MS basic data, >> while it uses 8300 for general Linux filesystems. I could look up the >> long form GUIDs, but meh...) > > > Partition type codes (e.g. 0700, 8300, EF00, etc) have _nothing_ to do with > UUIDs. They are type codes. They aren't "short form" of anything else at > all. In fact 0700 is the _long_ _form_ of the original code of "7", but in > big-endian order now that it went from one byte to two. No that's not correct. These four digit type codes are a user facing friendly type code, the actual on-disk "partitiontype GUID" is a UUID in that at the time of creation that UUID followed RFC 4122 so it was unique: no one else was using the UUID. That UUID in the context of a partitiontype GUID is intended to describe the purpose of that partition: what OS, what file system, where it should mount or be used for, etc. This is elaborately detailed in the GPT (GUID partition table) portion of the UEFI specification. A 120 bit type code is rather difficult for humans to remember and interact with, hence gdisk and recently fdisk now use a four digit type code as a front end for the partitiontypeGUID. The selection of four digits was to account for the fact there are many many many more type codes now possible, essentially unlimited. This is a case where UUID are reused effectively. > Microsoft started using pre-assigned UUIDs as "classes", e.g. type codes > they could cram into their various registry files. If you actually read the > registry you'll find a lot of places where "rational word" is defined as > {some_uuid_here} and then eslwere {some_uuid_here} has a bunch of data items > attached to it. > > So gpartd didn;t "reuse" microsoft UUIDs. GNU parted absolutely re-used partitiontypeGUID EBD0A0A2-B9E5-4433-87C0-68B6B72699C for Linux, by default. This you know as gdisk (and friends) type code 0700. It's the same thing as using type code 07 on an MBR partitioned disk instead of 83. It's ridiculous that this happened considering we had distinction on MBR with limited type code availability, and on GPT with unlimited type codes the decision was to use an already existing type code, EBD0A0A2-B9E5-4433-87C0-68B6B72699C. http://www.rodsbooks.com/linux-fs-code/ The Linux partitiontype GUID is now 0FC63DAF-8483-4772-8E79-3D69D8477DE4. And actually some others have been created also for encryption, RAID, LVM, swap, and a pile of GUIDs from the 'discoverable partitions spec' hosted at freedesktop.org for autodiscovery by systemd. Only very recent versions of parted supports code 0FC63DAF-8483-4772-8E79-3D69D8477DE4. > All this stuff has historical reasons. GNU/Linux attempts to be an > egalitarian actor so it adapts to whatever you do. With respect to this particular reuse of a Windows type code, it did a total face plant on adaptation. The very decision to reuse that GUID was a huge, weird mistake that we'll live with for years to come. Data loss will result from it. And then it was made worse, upon recognition that the conflict was probably not a good idea, to undermine patching GNU parted in a timely manner. The patch to fix the problem, from the gdisk author, sat around for two years before parted upstream merged it. There really isn't good diplomatic language to use for this. Some people flat out dropped the ball, and just didn't give a crap. -- Chris Murphy ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: BTRFS messes up snapshot LV with origin 2014-11-18 15:42 ` Phillip Susi 2014-11-18 19:17 ` Chris Murphy @ 2014-11-18 20:41 ` MegaBrutal 2014-11-19 1:29 ` Robert White 2 siblings, 0 replies; 64+ messages in thread From: MegaBrutal @ 2014-11-18 20:41 UTC (permalink / raw) To: linux-btrfs 2014-11-18 16:42 GMT+01:00 Phillip Susi <psusi@ubuntu.com>: > > -----BEGIN PGP SIGNED MESSAGE----- > Hash: SHA1 > > On 11/18/2014 1:16 AM, Chris Murphy wrote: > > If fstab specifies rootfs as UUID, and there are two volumes with > > the same UUID, it’s now ambiguous which one at boot time is the > > intended rootfs. It’s no different than the days of /dev/sdXY where > > X would change designations between boots = ambiguity and why we > > went to UUID. > > He already said he has NOT rebooted, so there is no way that the > snapshot has actually been mounted, even if it were UUID confusion. > That's right. Anyway, I've built a system to reproduce the bug. You can download the image and run it with KVM or other virtualization technology. Instructions are straightforward – if you start the VM, you'll know what to do, and you'll see what I was talking about. http://undead.megabrutal.com/kvm-reproduce-1391429.img.xz Download size: 113 MB; Unpacked image size: 2 GB. > > So we kinda need a way to distinguish derivative volumes. Maybe > > XFS and ext4 could easily change the volume UUID, but my vague > > recollection is this is difficult on Btrfs? So that led me to the > > idea of a way to create an on-the-fly (but consistent) “virtual > > volume UUID” maybe based on a hash of both the LVM LV and fs > > volume UUID. > > When using LVM, you should be referring to the volume by the LVM name > rather than UUID. LVM names are stable, and don't have the duplicate > uuid problem. > I use LVM names to identify volumes. I initially suspected it's an UUID confusion, because I thought grub-probe looks for the volume by UUID. But now I think the problem is nothing to do with UUIDs. Probably I should have looked deeper into the problem before I hypothesized. ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: BTRFS messes up snapshot LV with origin 2014-11-18 15:42 ` Phillip Susi 2014-11-18 19:17 ` Chris Murphy 2014-11-18 20:41 ` MegaBrutal @ 2014-11-19 1:29 ` Robert White 2014-11-19 3:37 ` Duncan 2 siblings, 1 reply; 64+ messages in thread From: Robert White @ 2014-11-19 1:29 UTC (permalink / raw) To: Phillip Susi, Chris Murphy; +Cc: Btrfs BTRFS On 11/18/2014 07:42 AM, Phillip Susi wrote: > On 11/18/2014 1:16 AM, Chris Murphy wrote: >> (stuff about UUIDs and LVM snapshots). > (suggestion to use LVM paths instead). This is also an XFS+LVM+LVM_Snapshot problem going back to at least 2009. It's inherent to the block-device-level snapshot phenomonia. q.v. http://www.miljan.org/main/2009/11/16/lvm-snapshots-and-xfs/ et al In XFS you attack the snapshot with a command to regenerate the UUID as soon as you take the snapshot. I don't think there is a "regenerate all my UUIDs" command for BTRFS. There are other places this can bone you, like old-format mdadm mirrors, where the metadata was only at the end of the partition so you could accidentally see two copied of your RAID1 file system if you hand't built/started the array. There is no really good way to prevent this other than "being really careful" or "not doing that at all". Sorry. Cost of doing business. Cheers... Rob. ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: BTRFS messes up snapshot LV with origin 2014-11-19 1:29 ` Robert White @ 2014-11-19 3:37 ` Duncan 0 siblings, 0 replies; 64+ messages in thread From: Duncan @ 2014-11-19 3:37 UTC (permalink / raw) To: linux-btrfs Robert White posted on Tue, 18 Nov 2014 17:29:12 -0800 as excerpted: > On 11/18/2014 07:42 AM, Phillip Susi wrote: > >> On 11/18/2014 1:16 AM, Chris Murphy wrote: >>> (stuff about UUIDs and LVM snapshots). > > (suggestion to use LVM paths instead). > > This is also an XFS+LVM+LVM_Snapshot problem going back to at least > 2009. It's inherent to the block-device-level snapshot phenomonia. > > q.v. http://www.miljan.org/main/2009/11/16/lvm-snapshots-and-xfs/ et al > > In XFS you attack the snapshot with a command to regenerate the UUID as > soon as you take the snapshot. I don't think there is a "regenerate all > my UUIDs" command for BTRFS. Which was part of my point in my reply. Btrfs embeds the UUID in the metadata deeply enough that it's no simple task to simply change it to something else and be done. It's quite a complicated operation for any (future, none current) tool that attempts it, with the most likely candidate being an option to btrfs balance or the like, but even then, we're looking at a timescale of hours for spinning rust. So while it's possible in theory, in practice such a regenerate-all UUIDs command for btrfs isn't available yet, and given the time involved in rewriting all those metadata UUIDs to something else, during which the filesystem's in a critically unstable state, and the limited use-case with other alternatives, such a tool isn't all /that/ practical in any case. Making an entirely new btrfs and doing a btrfs send/receive for the duplicate, or using btrfs snapshots, is a more practical way to go. (Tho watch out for the implications of btrfs snapshots on nocow files!) -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: BTRFS messes up snapshot LV with origin 2014-11-17 19:04 ` Goffredo Baroncelli [not found] ` <CAE8gLh=VubBbZdeKTAuWRjOxPF7C+ouUeeVvmGfT2ckYWGhQVA@mail.gmail.com> @ 2014-11-21 4:24 ` Zygo Blaxell 1 sibling, 0 replies; 64+ messages in thread From: Zygo Blaxell @ 2014-11-21 4:24 UTC (permalink / raw) To: Goffredo Baroncelli; +Cc: Brendan Hide, linux-btrfs, bug-grub [-- Attachment #1: Type: text/plain, Size: 2507 bytes --] On Mon, Nov 17, 2014 at 08:04:05PM +0100, Goffredo Baroncelli wrote: > On 2014-11-17 07:59, Brendan Hide wrote: > > > > That leaves two aspects of this issue which I view as two separate bugs: > > a) Btrfs cannot gracefully handle separate filesystems that have the same UUID. At all. > > b) Grub appears to pick the wrong filesystem when presented with two filesystems with the same UUID. > > > > I feel a) is a btrfs bug. > > I feel b) is a bug that is more about "ecosystem design" than grub being silly. > > Regarding a) > IIRC, btrfs collects the filesystem information by UUID; if two > filesystems have the same UUID (like the LVM-snapshot case), the > last filesystem discovered overwrite the first one. > > The filesystem discovering is done in user-space; so it should be simple > to skip a filesystem on a LVM-snapshot. > > Regarding b) > I am bit confused: if I understood correctly, the root filesystem was > picked from a LVM-snapshot, so grub-probe *correctly* reported that > the root device is the snapshot. > The problem was that during the boot filesystem discovering: first > scanned the *real* device, then the LVM-snapshot; the latter > overwrote the former so the system booted from the LVM-snapshot. IMHO if the device UUID search finds multiple devices with the same device UUID, it should ignore _all_ of them as the identification problem is unsolvable without further user input. This is what the 'device=' mount option is for. > My conclusion is that we should improve the btrfs scan so: > - in udev rules, a partition that is a LVM snapshot by default > should be not scanned by "btrfs dev scan" > - "btrfs dev scan", during the partition discovery should skip the > lvm-snapshot. That would mean I can't do this: 1. lvm snapshot of ext4 filesystem 2. btrfs-convert the snapshot 3. mount the snapshot, make sure it's OK 4. merge LVM snapshot to overwrite original ext4 filesystem which would be a shame since that's the only way I ever convert ext3/4 filesystems to btrfs (btrfs-convert is a little buggy still). > BR > G.Baroncelli > > > > -- > gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it> > Key fingerprint BBF5 1610 0B64 DAC6 5F7D 17B2 0EDA 9B37 8B82 E0B5 > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html [-- Attachment #2: Digital signature --] [-- Type: application/pgp-signature, Size: 198 bytes --] ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: BTRFS messes up snapshot LV with origin 2014-11-17 6:59 ` Brendan Hide 2014-11-17 7:35 ` Daniel Dressler 2014-11-17 19:04 ` Goffredo Baroncelli @ 2014-11-18 6:21 ` Chris Murphy 2014-11-18 12:13 ` Duncan 2014-11-18 20:01 ` Goffredo Baroncelli 2 siblings, 2 replies; 64+ messages in thread From: Chris Murphy @ 2014-11-18 6:21 UTC (permalink / raw) Cc: Btrfs BTRFS, bug-grub On Nov 16, 2014, at 11:59 PM, Brendan Hide <brendan@swiftspirit.co.za> wrote: > cc'd bug-grub@gnu.org for FYI > > On 2014/11/17 03:42, Duncan wrote: >> MegaBrutal posted on Sun, 16 Nov 2014 22:35:26 +0100 as excerpted: >> >>> Hello guys, >>> >>> I think you'll like this... >>> https://bugs.launchpad.net/ubuntu/+source/grub2/+bug/1391429 >> UUID is an initialism for "Universally Unique IDentifier".[1] >> >> If the UUID isn't unique, by definition, then, it can't be a UUID, and >> that's a bug in whatever is making the non-unique would-be UUID that >> isn't unique and thus cannot be a universally unique ID. In this case >> that would appear to be LVM. > Perhaps the right question to ask is "Where should this bug be fixed?”. > > TL;DR: This needs more thought and input from btrfs devs. To LVM, the bug is likely seen as being "out of scope". The "correct" fix probably lies in the ecosystem design, which requires co-operation from btrfs. I think the libblkid folks should be brought into this discussion, see what their take on this. LVM conventional snapshots causing this problem is rare / self-limiting as they’re short lived. LVM thinp snapshots mean there can be dozens, and they can sanely endure for the life of the thin pool. Effectively we have derivative volumes. At snapshot time, should a.) the fs volume UUID be changed; b.) each fs adds an additional/secondary volume UUID at snapshot time; c.) each fs adds a derivative/version indicator, i.e. 0 at mkfs time and maybe epoch time stamped at snapshot time; d.) not use fs UUID for identifying volumes uniqueness, instead use a virtual volume UUID which is externally determined based on whether the fs is on an LV snapshot. > Making a snapshot in LVM is a fundamental thing - and I feel LVM, in making its snapshot, is doing its job "exactly as expected". > > Additionally, there are other ways to get to a similar state without LVM: ddrescue backup, SAN snapshot, old "missing" disk re-introduced, etc. Sure and likewise self limiting problem. LVM thinp snapshots actually do make this confusion of multiple instances of the same volume UUID much much more likely. > > That leaves two places where this can be fixed: grub and btrfs The GRUB os-prober and grub-mkconfig paradigm I think needs to come to an end. The grub.cfg is not supposed to be externally modified, the design is that os-prober + grub-mkconfig obliterate it and generate a whole new one from scratch anytime the system boot state changes, i.e. anytime a new kernel is added. GRUB isn’t good at OS discovery now, I think it should just be abandoned. It can have its grub.cfg generated to do whatever complex things are needed, but the individual boot menu entries should exist as drop-in scripts managed by whatever is changing the OS boot state. This is the fundamental part of the two bootloaderspecs: http://www.freedesktop.org/wiki/Specifications/BootLoaderSpec/ http://www.freedesktop.org/wiki/MatthewGarrett/BootLoaderSpec/ And it’s a fundamental part of OSTree which supports multiple bootable trees on any filesystem, and currently uses a variation on bootloaderspec drop-in scripts to inform GRUB how to boot such a system: https://wiki.gnome.org/action/show/Projects/OSTree?action=show&redirect=OSTree > > Grub is already a little smart here - it avoids snapshots. But in this case it is relying on the UUID and only finding it in the snapshot. So possibly this is a bug in grub affecting the bug reporter specifically - but perhaps the bug is in btrfs where grub is relying on btrfs code. > > Yes, I'd rather use btrfs' snapshot mechanism - but this is often a choice that is left to the user/admin/distro. I don't think saying "LVM snapshots are incompatible with btrfs" is the right way to go either. > > That leaves two aspects of this issue which I view as two separate bugs: > a) Btrfs cannot gracefully handle separate filesystems that have the same UUID. At all. > b) Grub appears to pick the wrong filesystem when presented with two filesystems with the same UUID. > > I feel a) is a btrfs bug. > I feel b) is a bug that is more about "ecosystem design" than grub being silly. I think we’re well past the expiration date on grub.cfg, a line should be drawn in the sand to deprecate routine use of os-prober + grub-mkconfig, and move to drop-in scripts by whatever the distro presumes will be responsible for managing what “tree” will be booted or will be offered as a boot option, all GRUB needs to learn is how to use that drop in script file format. Ergo just because I’ve snapshot my root does not mean grub-mkconfig should be creating boot entries for it. But whatever usespace tool I’m using to do those snapshots (ostree, snapper, whatever the GNOME folks might come up with) should be the thing that creates the boot entry script; or as simple as this 2-4 line script should be, even hand done by a user, unlike the current grub.cfg file format. Further I’d like to get more traction from the syslinux/extlinux folks to support the same drop-in boot file format. There’s no good reason for us to not support a single file format for boot menu entries. Chris Murphy ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: BTRFS messes up snapshot LV with origin 2014-11-18 6:21 ` Chris Murphy @ 2014-11-18 12:13 ` Duncan 2014-11-18 20:01 ` Goffredo Baroncelli 1 sibling, 0 replies; 64+ messages in thread From: Duncan @ 2014-11-18 12:13 UTC (permalink / raw) To: linux-btrfs; +Cc: bug-grub Chris Murphy posted on Mon, 17 Nov 2014 23:21:57 -0700 as excerpted: > I think we’re well past the expiration date on grub.cfg, a line should > be drawn in the sand to deprecate routine use of os-prober + > grub-mkconfig, > and move to drop-in scripts by whatever the distro presumes will be > responsible for managing what “tree” will be booted or will be offered > as a boot option, all GRUB needs to learn is how to use that drop in > script file format. > > Ergo just because I’ve snapshot my root does not mean grub-mkconfig > should be creating boot entries for it. But whatever usespace tool I’m > using to do those snapshots (ostree, snapper, whatever the GNOME folks > might come up with) should be the thing that creates the boot entry > script; or as simple as this 2-4 line script should be, even hand done > by a user, unlike the current grub.cfg file format. FWIW, I hand-edit my grub.cfg here, grub-probe was taking /forever/ on my system back when I upgraded to grub2, and the "direct drive" configuration of direct grub.cfg editing was /far/ more flexible, or at least /far/ easier to learn how to do what I wanted to do than to figure out how to do it thru the translation layer, in any case. The configuration is advanced enough it has individual choices to set standard init and init=/bin/bash, current/fallback/stable kernels, current/backup/second-backup roots, etc, plus a choice to interactively type in additional kernel commandline options, loading those choices into grub variables as I change them, then another choice to boot using the loaded variables to select the kernel and setup the kernel commandline. The initial grub.cfg has the default boot option, plus others that load either a troubleshooting menu or the backups choices menu, from separate included config files, as necessary. Just /thinking/ about trying to do that via the cumbersome translation layer gives me a headache, and since I had to learn the grub scripting layer language to set it up anyway, I might as well just write and troubleshoot it in that directly rather than trying to figure out how to get the translation layer to write it, and then have to troubleshoot BOTH the translation layer and the lower level script. Then I deleted grub-probe and grub-mkconfig so they couldn't be run accidentally with unconfigured/default translation-level options to undo all my hard work, and set a mask on them so updating the package wouldn't reinstall them. So deprecate/kill os-prober and grub-mkconfig if you want, but grub.cfg needs to stay working! -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: BTRFS messes up snapshot LV with origin 2014-11-18 6:21 ` Chris Murphy 2014-11-18 12:13 ` Duncan @ 2014-11-18 20:01 ` Goffredo Baroncelli 1 sibling, 0 replies; 64+ messages in thread From: Goffredo Baroncelli @ 2014-11-18 20:01 UTC (permalink / raw) To: Chris Murphy; +Cc: Btrfs BTRFS, bug-grub On 2014-11-18 07:21, Chris Murphy wrote: > Ergo just because I’ve snapshot my root does not mean grub-mkconfig > should be creating boot entries for it. I find this an useful feature: a snapshot of / is done to rollback some changes, so why don't let grub to start (the kernel) from ? Anyway I find grub-mkconfig quite useful for a "standard" user. For more advance uses cases editing by hand grub.cfg may be possible. BR G.Baroncelli -- gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it> Key fingerprint BBF5 1610 0B64 DAC6 5F7D 17B2 0EDA 9B37 8B82 E0B5 ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: BTRFS messes up snapshot LV with origin @ 2014-11-17 8:00 MegaBrutal 0 siblings, 0 replies; 64+ messages in thread From: MegaBrutal @ 2014-11-17 8:00 UTC (permalink / raw) To: linux-btrfs 2014-11-17 7:59 GMT+01:00 Brendan Hide <brendan@swiftspirit.co.za>: > > Grub is already a little smart here - it avoids snapshots. But in this case it is relying on the UUID and only finding it in the snapshot. So possibly this is a bug in grub affecting the bug reporter specifically - but perhaps the bug is in btrfs where grub is relying on btrfs code. Yesterday, when I reproduced the phenomenon on a VM, I've found something rather interesting thing: even /proc/mounts reports incorrectly, that the snapshot is being mounted instead of the root FS. Note, there were no reboot. Just create an LVM snapshot and then check /proc/mounts. I couldn't reproduce the same with non-root file systems. It seems this only appears when the device in question is mounted as root FS. > Yes, I'd rather use btrfs' snapshot mechanism - but this is often a choice that is left to the user/admin/distro. I don't think saying "LVM snapshots are incompatible with btrfs" is the right way to go either. Before I did a release upgrade, just to be safe, I made both (LVM and btrfs snapshot). > > That leaves two aspects of this issue which I view as two separate bugs: > a) Btrfs cannot gracefully handle separate filesystems that have the same UUID. At all. > b) Grub appears to pick the wrong filesystem when presented with two filesystems with the same UUID. > > I feel a) is a btrfs bug. > I feel b) is a bug that is more about "ecosystem design" than grub being silly. > > I imagine a couple of aspects that could help fix a): > - Utilise a "unique drive identifier" in the btrfs metadata (surely this exists already?). This way, any two filesystems will always have different drive identifiers *except* in cases like a ddrescue'd copy or a block-level snapshot. This will provide a sensible mechanism for "defined behaviour", preventing corruption - even if that "defined behaviour" is to simply give out lots of "PEBKAC" errors and panic. > - Utilise a "drive list" to ensure that two unrelated filesystems with the same UUID cannot get "mixed up". Yes, the user/admin would likely be the culprit here (perhaps a VM rollout process that always gives out the same UUID in all its filesystems). Again, does btrfs not already have something like this built-in that we're simply not utilising fully? > > I'm not exactly sure of the "correct" way to fix b) except that I imagine it would be trivial to fix once a) is fixed. Note that everything that is written into the file system's metadata gets duplicated with an LVM snapshot. So a "unique drive identifier" wouldn't solve the problem, as it would also get replicated, and BTRFS would still see two identical devices. But devices on Linux have major and minor numbers those uniquely identify devices while they are attached. The original and the snapshot device have different major/minor numbers, and it would be quite enough to differentiate the devices while they are being opened/mounted. By the way, I actually made an entire release upgrade with the snapshot being there and being reported incorrectly. This would have caused enough corruption in the file system that I would have surely noticed it. But I didn't perceive any data corruption. BTRFS didn't actually write to the snapshot device. It seems the device is only mixed up in /proc/mounts, so probably the problem is not so severe as we think, and wouldn't require fundamental changes to BTRFS to fix it. ^ permalink raw reply [flat|nested] 64+ messages in thread
end of thread, other threads:[~2014-12-01 15:25 UTC | newest]
Thread overview: 64+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-11-16 21:35 BTRFS messes up snapshot LV with origin MegaBrutal
2014-11-17 1:42 ` Duncan
2014-11-17 6:59 ` Brendan Hide
2014-11-17 7:35 ` Daniel Dressler
2014-11-17 9:00 ` Brendan Hide
2014-11-17 19:04 ` Goffredo Baroncelli
[not found] ` <CAE8gLh=VubBbZdeKTAuWRjOxPF7C+ouUeeVvmGfT2ckYWGhQVA@mail.gmail.com>
2014-11-17 19:45 ` Fwd: " MegaBrutal
2014-11-17 20:32 ` Goffredo Baroncelli
2014-11-18 6:16 ` Chris Murphy
2014-11-18 15:42 ` Phillip Susi
2014-11-18 19:17 ` Chris Murphy
2014-11-18 20:17 ` Phillip Susi
2014-11-19 2:54 ` Chris Murphy
2014-11-19 15:20 ` Phillip Susi
2014-11-19 18:35 ` Chris Murphy
2014-11-19 19:23 ` Phillip Susi
2014-11-21 4:28 ` Zygo Blaxell
2014-11-21 6:22 ` Duncan
2014-11-21 11:35 ` Robert White
2014-11-21 11:54 ` Duncan
2014-11-21 17:56 ` Zygo Blaxell
2014-11-21 23:09 ` Duncan
2014-11-21 18:23 ` Chris Murphy
2014-11-21 22:49 ` Duncan
2014-11-21 23:41 ` Duncan
2014-11-21 23:51 ` Duncan
2014-11-22 17:34 ` Goffredo Baroncelli
2014-11-23 0:19 ` Zygo Blaxell
2014-11-25 16:34 ` Goffredo Baroncelli
2014-11-25 20:29 ` Zygo Blaxell
2014-11-25 21:59 ` Goffredo Baroncelli
2014-11-25 22:21 ` Zygo Blaxell
2014-11-25 22:47 ` Chris Murphy
[not found] ` <CAJCQCtQUM=viSoPtcJMcyKquYb1DLmEsqBi=p++uXPy63+r3Ow@mail.gmail.com>
[not found] ` <20141126021134.GR17380@hungrycats.org>
2014-11-26 4:48 ` Chris Murphy
2014-11-26 17:19 ` Goffredo Baroncelli
2014-11-27 4:15 ` Zygo Blaxell
2014-11-28 17:05 ` Goffredo Baroncelli
2014-11-29 1:25 ` Robert White
2014-11-29 7:35 ` Goffredo Baroncelli
2014-11-29 8:02 ` Robert White
2014-11-29 7:37 ` MegaBrutal
2014-11-29 4:59 ` Zygo Blaxell
2014-11-29 7:55 ` Robert White
2014-12-01 15:25 ` Zygo Blaxell
2014-11-26 3:22 ` Duncan
2014-11-26 5:11 ` Chris Murphy
2014-11-26 22:08 ` Robert White
2014-11-27 9:08 ` Duncan
2014-11-28 7:10 ` Chris Murphy
2014-11-29 7:29 ` Duncan
2014-11-29 8:20 ` Robert White
2014-11-29 9:41 ` Duncan
2014-11-29 16:33 ` Robert White
2014-11-29 16:50 ` Robert White
2014-11-30 6:46 ` Duncan
2014-11-29 21:15 ` Chris Murphy
2014-11-18 20:41 ` MegaBrutal
2014-11-19 1:29 ` Robert White
2014-11-19 3:37 ` Duncan
2014-11-21 4:24 ` Zygo Blaxell
2014-11-18 6:21 ` Chris Murphy
2014-11-18 12:13 ` Duncan
2014-11-18 20:01 ` Goffredo Baroncelli
-- strict thread matches above, loose matches on Subject: below --
2014-11-17 8:00 MegaBrutal
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).