* good documentation on btrfs internals and on disk layout @ 2016-03-30 13:58 sri 2016-03-30 17:28 ` Liu Bo 2016-03-30 18:43 ` Hugo Mills 0 siblings, 2 replies; 9+ messages in thread From: sri @ 2016-03-30 13:58 UTC (permalink / raw) To: linux-btrfs Hi, I could find very limited documentation related to on disk layout of btrfs and how all trees are related to each other. Except wiki which has very specific top level details I couldn't able to find more details on btrfs. FSs such as zfs, ext3/4 and XFS there are documents which explains ondisk layout of the file systems. Could anybody please provide pointers for the same for better understanding of btrfs on disk layout and how each tree interacts provided multiple disks are configured for btrfs. Thank you in advance ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: good documentation on btrfs internals and on disk layout 2016-03-30 13:58 good documentation on btrfs internals and on disk layout sri @ 2016-03-30 17:28 ` Liu Bo 2016-03-30 18:11 ` Dave Stevens 2016-03-30 18:43 ` Hugo Mills 1 sibling, 1 reply; 9+ messages in thread From: Liu Bo @ 2016-03-30 17:28 UTC (permalink / raw) To: sri; +Cc: linux-btrfs On Wed, Mar 30, 2016 at 01:58:03PM +0000, sri wrote: > Hi, > > I could find very limited documentation related to on disk layout of btrfs > and how all trees are related to each other. Except wiki which has very > specific top level details I couldn't able to find more details on btrfs. > > FSs such as zfs, ext3/4 and XFS there are documents which explains ondisk > layout of the file systems. > > Could anybody please provide pointers for the same for better > understanding of btrfs on disk layout and how each tree interacts provided > multiple disks are configured for btrfs. There is a paper[1] about btrfs filesystem which covers all the details. [1]: BTRFS: The Linux B-Tree Filesystem Thanks, -liubo ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: good documentation on btrfs internals and on disk layout 2016-03-30 17:28 ` Liu Bo @ 2016-03-30 18:11 ` Dave Stevens 0 siblings, 0 replies; 9+ messages in thread From: Dave Stevens @ 2016-03-30 18:11 UTC (permalink / raw) To: bo.li.liu; +Cc: sri, linux-btrfs Quoting Liu Bo <bo.li.liu@oracle.com>: > On Wed, Mar 30, 2016 at 01:58:03PM +0000, sri wrote: >> Hi, >> >> I could find very limited documentation related to on disk layout of btrfs >> and how all trees are related to each other. Except wiki which has very >> specific top level details I couldn't able to find more details on btrfs. >> >> FSs such as zfs, ext3/4 and XFS there are documents which explains ondisk >> layout of the file systems. >> >> Could anybody please provide pointers for the same for better >> understanding of btrfs on disk layout and how each tree interacts provided >> multiple disks are configured for btrfs. > > There is a paper[1] about btrfs filesystem which covers all the details. > > [1]: BTRFS: The Linux B-Tree Filesystem and this is where it is: http://domino.watson.ibm.com/library/CyberDig.nsf/papers/6E1C5B6A1B6EDD9885257A38006B6130/$File/rj10501.pdf D > > Thanks, > > -liubo > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > -- "As long as politics is the shadow cast on society by big business, the attenuation of the shadow will not change the substance." -- John Dewey ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: good documentation on btrfs internals and on disk layout 2016-03-30 13:58 good documentation on btrfs internals and on disk layout sri 2016-03-30 17:28 ` Liu Bo @ 2016-03-30 18:43 ` Hugo Mills 2016-04-05 17:53 ` Yauhen Kharuzhy 1 sibling, 1 reply; 9+ messages in thread From: Hugo Mills @ 2016-03-30 18:43 UTC (permalink / raw) To: sri; +Cc: linux-btrfs [-- Attachment #1: Type: text/plain, Size: 2297 bytes --] On Wed, Mar 30, 2016 at 01:58:03PM +0000, sri wrote: > I could find very limited documentation related to on disk layout of btrfs > and how all trees are related to each other. Except wiki which has very > specific top level details I couldn't able to find more details on btrfs. > > FSs such as zfs, ext3/4 and XFS there are documents which explains ondisk > layout of the file systems. > > Could anybody please provide pointers for the same for better > understanding of btrfs on disk layout and how each tree interacts provided > multiple disks are configured for btrfs. What are you intending to do? You'll need different things depending on whether you are, for example, using the BTRFS_TREE_SEARCH ioctl online to gather high-level information, or working your way through the datapaths from the superblock right down to individual bytes of a file for offline access. If you're using BTRFS_TREE_SEARCH, for example, you won't need to know anything about the superblocks or the way that trees are implemented. In fact, it's a good idea if you can avoid getting into those details at all. The high-level view of how the data model fits together is at [1]. Individual structures referenced in there are best examined in ctree.h for the details, although there's a little more detailed description at [2]. There's some documentation on the basic APIs used for reading the btrees at [3]. If you really _have_ to access trees yourself, the tree structure is at [4], but see my comment above about that. The way that the FS-tree metadata is put together to make up POSIX directory structures is at [5]. After all that, you're down to looking at the data structures in ctree.h, and grepping through the source code to see how they're used (which is how [1] was written in the first place). Hugo. [1] https://btrfs.wiki.kernel.org/index.php/Data_Structures [2] https://btrfs.wiki.kernel.org/index.php/On-disk_Format [3] https://btrfs.wiki.kernel.org/index.php/Code_documentation [4] https://btrfs.wiki.kernel.org/index.php/Btrfs_design [5] https://btrfs.wiki.kernel.org/index.php/Trees -- Hugo Mills | "There's more than one way to do it" is not a hugo@... carfax.org.uk | commandment. It is a dire warning. http://carfax.org.uk/ | PGP: E2AB1DE4 | [-- Attachment #2: Digital signature --] [-- Type: application/pgp-signature, Size: 836 bytes --] ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: good documentation on btrfs internals and on disk layout 2016-03-30 18:43 ` Hugo Mills @ 2016-04-05 17:53 ` Yauhen Kharuzhy 2016-04-05 18:15 ` Austin S. Hemmelgarn 0 siblings, 1 reply; 9+ messages in thread From: Yauhen Kharuzhy @ 2016-04-05 17:53 UTC (permalink / raw) To: linux-btrfs Hello, I try to understand btrfs logic in mounting of multi-device filesystem when device generations are different. All my questions are related to RAID5/6 for system, metadata, and data case. Kernel can mount FS with different device generations (if drive was physically removed before last unmount and returned back after, for example) now but scrub will report uncorrectable errors after this (but second run doesn't show any errors). Does any documentation about algorithm of multiple device handling in such case exist? Does the case with different device generations is allowed in general and what worst cases can be here? What should happen if device was removed and returned back after some time when filesystem is online? Should some kind of device reopening be possible or one possible way to guarantee FS consistensy is to mark such device as missing and to replace it? If any sources of such information exist (except btrfs code) please point me to them. ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: good documentation on btrfs internals and on disk layout 2016-04-05 17:53 ` Yauhen Kharuzhy @ 2016-04-05 18:15 ` Austin S. Hemmelgarn 2016-04-05 18:36 ` Yauhen Kharuzhy 0 siblings, 1 reply; 9+ messages in thread From: Austin S. Hemmelgarn @ 2016-04-05 18:15 UTC (permalink / raw) To: Yauhen Kharuzhy, linux-btrfs On 2016-04-05 13:53, Yauhen Kharuzhy wrote: > Hello, > > I try to understand btrfs logic in mounting of multi-device filesystem > when device generations are different. All my questions are related to > RAID5/6 for system, metadata, and data case. > > Kernel can mount FS with different device generations (if drive was > physically removed before last unmount and returned back after, for > example) now but scrub will report uncorrectable errors after this > (but second run doesn't show any errors). Does any documentation about > algorithm of multiple device handling in such case exist? Does the > case with different device generations is allowed in general and what > worst cases can be here? In general, it isn't allowed, but we don't explicitly disallow it either. The worst case here is that the devices both get written two separately, and you end up with data not matching for correlated generation ID's. The second scrub in this case shows no errors because the first one corrects them (even though they are reported as uncorrectable, which is a bug as far as I can tell), and from what I can tell from reading the code, it does this by just picking the highest generation ID and dropping the data from the lower generation. > > What should happen if device was removed and returned back after some > time when filesystem is online? Should some kind of device > reopening be possible or one possible way to guarantee FS consistensy > is to mark such device as missing and to replace it? In this case, the device being removed (or some component between the device and the processor failing, or the device itself erroneously reporting failure) will force the FS read-only. If the device reappears while the FS is still online, it may just start working again (this is _really_ rare, and requires that the device appear with the same device node as it had previously, and this usually only happens when the device disappears for only a very short period of time), or it may not work until the FS gets remounted (this is usually the case), or the system may crash (thankfully this almost never happens, and it's usually not because of BTRFS when it does). Regardless of what happens, you may still have to run a scrub to make sure everything is consistent. ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: good documentation on btrfs internals and on disk layout 2016-04-05 18:15 ` Austin S. Hemmelgarn @ 2016-04-05 18:36 ` Yauhen Kharuzhy 2016-04-05 18:56 ` Austin S. Hemmelgarn 0 siblings, 1 reply; 9+ messages in thread From: Yauhen Kharuzhy @ 2016-04-05 18:36 UTC (permalink / raw) To: Austin S. Hemmelgarn; +Cc: linux-btrfs 2016-04-05 11:15 GMT-07:00 Austin S. Hemmelgarn <ahferroin7@gmail.com>: > On 2016-04-05 13:53, Yauhen Kharuzhy wrote: >> >> Hello, >> >> I try to understand btrfs logic in mounting of multi-device filesystem >> when device generations are different. All my questions are related to >> RAID5/6 for system, metadata, and data case. >> >> Kernel can mount FS with different device generations (if drive was >> physically removed before last unmount and returned back after, for >> example) now but scrub will report uncorrectable errors after this >> (but second run doesn't show any errors). Does any documentation about >> algorithm of multiple device handling in such case exist? Does the >> case with different device generations is allowed in general and what >> worst cases can be here? > > In general, it isn't allowed, but we don't explicitly disallow it either. > The worst case here is that the devices both get written two separately, and > you end up with data not matching for correlated generation ID's. The > second scrub in this case shows no errors because the first one corrects > them (even though they are reported as uncorrectable, which is a bug as far > as I can tell), and from what I can tell from reading the code, it does this > by just picking the highest generation ID and dropping the data from the > lower generation. Hmm... Sounds reasonable but how to detect if filesystem should be checked by scrub after mounting? There is one way as I understand — to check kernel logs after mount for any btrfs errors and this is not a good way for case of some kind of automatic management. >> What should happen if device was removed and returned back after some >> time when filesystem is online? Should some kind of device >> reopening be possible or one possible way to guarantee FS consistensy >> is to mark such device as missing and to replace it? > > In this case, the device being removed (or some component between the device > and the processor failing, or the device itself erroneously reporting > failure) will force the FS read-only. If the device reappears while the FS > is still online, it may just start working again (this is _really_ rare, and > requires that the device appear with the same device node as it had > previously, and this usually only happens when the device disappears for > only a very short period of time), or it may not work until the FS gets > remounted (this is usually the case), or the system may crash (thankfully > this almost never happens, and it's usually not because of BTRFS when it > does). Regardless of what happens, you may still have to run a scrub to > make sure everything is consistent. So, one right way if we see device reconnected as new block device — is to reject it and don't include it in device list again, am I right? Existing code tries to 'reconnect' it with new device name but this works completely wrong for mounted FS (because btrfs device is renamed only, no real device reopening is performed) and I intend to propose patch based on Anand's 'global spare' patch series to handle this properly. ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: good documentation on btrfs internals and on disk layout 2016-04-05 18:36 ` Yauhen Kharuzhy @ 2016-04-05 18:56 ` Austin S. Hemmelgarn 2016-04-05 19:26 ` Yauhen Kharuzhy 0 siblings, 1 reply; 9+ messages in thread From: Austin S. Hemmelgarn @ 2016-04-05 18:56 UTC (permalink / raw) To: Yauhen Kharuzhy; +Cc: linux-btrfs On 2016-04-05 14:36, Yauhen Kharuzhy wrote: > 2016-04-05 11:15 GMT-07:00 Austin S. Hemmelgarn <ahferroin7@gmail.com>: >> On 2016-04-05 13:53, Yauhen Kharuzhy wrote: >>> >>> Hello, >>> >>> I try to understand btrfs logic in mounting of multi-device filesystem >>> when device generations are different. All my questions are related to >>> RAID5/6 for system, metadata, and data case. >>> >>> Kernel can mount FS with different device generations (if drive was >>> physically removed before last unmount and returned back after, for >>> example) now but scrub will report uncorrectable errors after this >>> (but second run doesn't show any errors). Does any documentation about >>> algorithm of multiple device handling in such case exist? Does the >>> case with different device generations is allowed in general and what >>> worst cases can be here? >> >> In general, it isn't allowed, but we don't explicitly disallow it either. >> The worst case here is that the devices both get written two separately, and >> you end up with data not matching for correlated generation ID's. The >> second scrub in this case shows no errors because the first one corrects >> them (even though they are reported as uncorrectable, which is a bug as far >> as I can tell), and from what I can tell from reading the code, it does this >> by just picking the highest generation ID and dropping the data from the >> lower generation. > > Hmm... Sounds reasonable but how to detect if filesystem should be > checked by scrub after mounting? There is one way as I understand — to > check kernel logs after mount for any btrfs errors and this is not a > good way for case of some kind of automatic management. There really isn't any way that I know of. Personally, I just scrub all my filesystems shortly after mount, but I also have pretty small filesystems (the biggest are 64G) on relatively fast storage. In theory, it might be possible to parse the filesystems before mounting to check the device generation numbers, but that may be just as expensive as just scrubbing the filesystem (and you really should be scrubbing somewhat regularly anyway). > >>> What should happen if device was removed and returned back after some >>> time when filesystem is online? Should some kind of device >>> reopening be possible or one possible way to guarantee FS consistensy >>> is to mark such device as missing and to replace it? >> >> In this case, the device being removed (or some component between the device >> and the processor failing, or the device itself erroneously reporting >> failure) will force the FS read-only. If the device reappears while the FS >> is still online, it may just start working again (this is _really_ rare, and >> requires that the device appear with the same device node as it had >> previously, and this usually only happens when the device disappears for >> only a very short period of time), or it may not work until the FS gets >> remounted (this is usually the case), or the system may crash (thankfully >> this almost never happens, and it's usually not because of BTRFS when it >> does). Regardless of what happens, you may still have to run a scrub to >> make sure everything is consistent. > > So, one right way if we see device reconnected as new block device — > is to reject it and don't include it in device list again, am I right? > Existing code tries to 'reconnect' it with new device name but this > works completely wrong for mounted FS (because btrfs device is renamed > only, no real device reopening is performed) and I intend to propose > patch based on Anand's 'global spare' patch series to handle this > properly. In an ideal situation, you have nothing using the FS and can unmount, run a device scan, and then remount. In most cases this won't work, and being able to re-add the device via a hot-spare type setup (or even just use device replace on it, which I've done before myself when dealing with filesystems on USB devices, and it works well) would be useful. Ideally, we should have the option to auto-detect such a situation and handle it, but that _really_ needs to be optional (there are just too many things that could go wrong). ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: good documentation on btrfs internals and on disk layout 2016-04-05 18:56 ` Austin S. Hemmelgarn @ 2016-04-05 19:26 ` Yauhen Kharuzhy 0 siblings, 0 replies; 9+ messages in thread From: Yauhen Kharuzhy @ 2016-04-05 19:26 UTC (permalink / raw) To: Austin S. Hemmelgarn; +Cc: linux-btrfs 2016-04-05 11:56 GMT-07:00 Austin S. Hemmelgarn <ahferroin7@gmail.com>: > On 2016-04-05 14:36, Yauhen Kharuzhy wrote: >> >> 2016-04-05 11:15 GMT-07:00 Austin S. Hemmelgarn <ahferroin7@gmail.com>: >>> >>> On 2016-04-05 13:53, Yauhen Kharuzhy wrote: >>> In general, it isn't allowed, but we don't explicitly disallow it either. >>> The worst case here is that the devices both get written two separately, >>> and >>> you end up with data not matching for correlated generation ID's. The >>> second scrub in this case shows no errors because the first one corrects >>> them (even though they are reported as uncorrectable, which is a bug as >>> far >>> as I can tell), and from what I can tell from reading the code, it does >>> this >>> by just picking the highest generation ID and dropping the data from the >>> lower generation. >> >> >> Hmm... Sounds reasonable but how to detect if filesystem should be >> checked by scrub after mounting? There is one way as I understand — to >> check kernel logs after mount for any btrfs errors and this is not a >> good way for case of some kind of automatic management. > > There really isn't any way that I know of. Personally, I just scrub all my > filesystems shortly after mount, but I also have pretty small filesystems > (the biggest are 64G) on relatively fast storage. In theory, it might be > possible to parse the filesystems before mounting to check the device > generation numbers, but that may be just as expensive as just scrubbing the > filesystem (and you really should be scrubbing somewhat regularly anyway). Yes, size matters — we have 96TB massive and scrubbing, rebalancing, replacing etc. can be expensive operations :) In fact, I implemented some kind of filesystem status reporting in kernel & btrfs progs already but still have no time to prepare this to sending as patches for commenting. But it will be done soon, I hope. ^ permalink raw reply [flat|nested] 9+ messages in thread
end of thread, other threads:[~2016-04-05 19:27 UTC | newest] Thread overview: 9+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2016-03-30 13:58 good documentation on btrfs internals and on disk layout sri 2016-03-30 17:28 ` Liu Bo 2016-03-30 18:11 ` Dave Stevens 2016-03-30 18:43 ` Hugo Mills 2016-04-05 17:53 ` Yauhen Kharuzhy 2016-04-05 18:15 ` Austin S. Hemmelgarn 2016-04-05 18:36 ` Yauhen Kharuzhy 2016-04-05 18:56 ` Austin S. Hemmelgarn 2016-04-05 19:26 ` Yauhen Kharuzhy
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).