* good documentation on btrfs internals and on disk layout
@ 2016-03-30 13:58 sri
2016-03-30 17:28 ` Liu Bo
2016-03-30 18:43 ` Hugo Mills
0 siblings, 2 replies; 9+ messages in thread
From: sri @ 2016-03-30 13:58 UTC (permalink / raw)
To: linux-btrfs
Hi,
I could find very limited documentation related to on disk layout of btrfs
and how all trees are related to each other. Except wiki which has very
specific top level details I couldn't able to find more details on btrfs.
FSs such as zfs, ext3/4 and XFS there are documents which explains ondisk
layout of the file systems.
Could anybody please provide pointers for the same for better
understanding of btrfs on disk layout and how each tree interacts provided
multiple disks are configured for btrfs.
Thank you in advance
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: good documentation on btrfs internals and on disk layout
2016-03-30 13:58 good documentation on btrfs internals and on disk layout sri
@ 2016-03-30 17:28 ` Liu Bo
2016-03-30 18:11 ` Dave Stevens
2016-03-30 18:43 ` Hugo Mills
1 sibling, 1 reply; 9+ messages in thread
From: Liu Bo @ 2016-03-30 17:28 UTC (permalink / raw)
To: sri; +Cc: linux-btrfs
On Wed, Mar 30, 2016 at 01:58:03PM +0000, sri wrote:
> Hi,
>
> I could find very limited documentation related to on disk layout of btrfs
> and how all trees are related to each other. Except wiki which has very
> specific top level details I couldn't able to find more details on btrfs.
>
> FSs such as zfs, ext3/4 and XFS there are documents which explains ondisk
> layout of the file systems.
>
> Could anybody please provide pointers for the same for better
> understanding of btrfs on disk layout and how each tree interacts provided
> multiple disks are configured for btrfs.
There is a paper[1] about btrfs filesystem which covers all the details.
[1]: BTRFS: The Linux B-Tree Filesystem
Thanks,
-liubo
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: good documentation on btrfs internals and on disk layout
2016-03-30 17:28 ` Liu Bo
@ 2016-03-30 18:11 ` Dave Stevens
0 siblings, 0 replies; 9+ messages in thread
From: Dave Stevens @ 2016-03-30 18:11 UTC (permalink / raw)
To: bo.li.liu; +Cc: sri, linux-btrfs
Quoting Liu Bo <bo.li.liu@oracle.com>:
> On Wed, Mar 30, 2016 at 01:58:03PM +0000, sri wrote:
>> Hi,
>>
>> I could find very limited documentation related to on disk layout of btrfs
>> and how all trees are related to each other. Except wiki which has very
>> specific top level details I couldn't able to find more details on btrfs.
>>
>> FSs such as zfs, ext3/4 and XFS there are documents which explains ondisk
>> layout of the file systems.
>>
>> Could anybody please provide pointers for the same for better
>> understanding of btrfs on disk layout and how each tree interacts provided
>> multiple disks are configured for btrfs.
>
> There is a paper[1] about btrfs filesystem which covers all the details.
>
> [1]: BTRFS: The Linux B-Tree Filesystem
and this is where it is:
http://domino.watson.ibm.com/library/CyberDig.nsf/papers/6E1C5B6A1B6EDD9885257A38006B6130/$File/rj10501.pdf
D
>
> Thanks,
>
> -liubo
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
--
"As long as politics is the shadow cast on society by big business,
the attenuation of the shadow will not change the substance."
-- John Dewey
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: good documentation on btrfs internals and on disk layout
2016-03-30 13:58 good documentation on btrfs internals and on disk layout sri
2016-03-30 17:28 ` Liu Bo
@ 2016-03-30 18:43 ` Hugo Mills
2016-04-05 17:53 ` Yauhen Kharuzhy
1 sibling, 1 reply; 9+ messages in thread
From: Hugo Mills @ 2016-03-30 18:43 UTC (permalink / raw)
To: sri; +Cc: linux-btrfs
[-- Attachment #1: Type: text/plain, Size: 2297 bytes --]
On Wed, Mar 30, 2016 at 01:58:03PM +0000, sri wrote:
> I could find very limited documentation related to on disk layout of btrfs
> and how all trees are related to each other. Except wiki which has very
> specific top level details I couldn't able to find more details on btrfs.
>
> FSs such as zfs, ext3/4 and XFS there are documents which explains ondisk
> layout of the file systems.
>
> Could anybody please provide pointers for the same for better
> understanding of btrfs on disk layout and how each tree interacts provided
> multiple disks are configured for btrfs.
What are you intending to do? You'll need different things
depending on whether you are, for example, using the BTRFS_TREE_SEARCH
ioctl online to gather high-level information, or working your way
through the datapaths from the superblock right down to individual
bytes of a file for offline access.
If you're using BTRFS_TREE_SEARCH, for example, you won't need to
know anything about the superblocks or the way that trees are
implemented. In fact, it's a good idea if you can avoid getting into
those details at all.
The high-level view of how the data model fits together is at
[1]. Individual structures referenced in there are best examined in
ctree.h for the details, although there's a little more detailed
description at [2]. There's some documentation on the basic APIs used
for reading the btrees at [3]. If you really _have_ to access trees
yourself, the tree structure is at [4], but see my comment above about
that. The way that the FS-tree metadata is put together to make up
POSIX directory structures is at [5].
After all that, you're down to looking at the data structures in
ctree.h, and grepping through the source code to see how they're used
(which is how [1] was written in the first place).
Hugo.
[1] https://btrfs.wiki.kernel.org/index.php/Data_Structures
[2] https://btrfs.wiki.kernel.org/index.php/On-disk_Format
[3] https://btrfs.wiki.kernel.org/index.php/Code_documentation
[4] https://btrfs.wiki.kernel.org/index.php/Btrfs_design
[5] https://btrfs.wiki.kernel.org/index.php/Trees
--
Hugo Mills | "There's more than one way to do it" is not a
hugo@... carfax.org.uk | commandment. It is a dire warning.
http://carfax.org.uk/ |
PGP: E2AB1DE4 |
[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 836 bytes --]
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: good documentation on btrfs internals and on disk layout
2016-03-30 18:43 ` Hugo Mills
@ 2016-04-05 17:53 ` Yauhen Kharuzhy
2016-04-05 18:15 ` Austin S. Hemmelgarn
0 siblings, 1 reply; 9+ messages in thread
From: Yauhen Kharuzhy @ 2016-04-05 17:53 UTC (permalink / raw)
To: linux-btrfs
Hello,
I try to understand btrfs logic in mounting of multi-device filesystem
when device generations are different. All my questions are related to
RAID5/6 for system, metadata, and data case.
Kernel can mount FS with different device generations (if drive was
physically removed before last unmount and returned back after, for
example) now but scrub will report uncorrectable errors after this
(but second run doesn't show any errors). Does any documentation about
algorithm of multiple device handling in such case exist? Does the
case with different device generations is allowed in general and what
worst cases can be here?
What should happen if device was removed and returned back after some
time when filesystem is online? Should some kind of device
reopening be possible or one possible way to guarantee FS consistensy
is to mark such device as missing and to replace it?
If any sources of such information exist (except btrfs code) please
point me to them.
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: good documentation on btrfs internals and on disk layout
2016-04-05 17:53 ` Yauhen Kharuzhy
@ 2016-04-05 18:15 ` Austin S. Hemmelgarn
2016-04-05 18:36 ` Yauhen Kharuzhy
0 siblings, 1 reply; 9+ messages in thread
From: Austin S. Hemmelgarn @ 2016-04-05 18:15 UTC (permalink / raw)
To: Yauhen Kharuzhy, linux-btrfs
On 2016-04-05 13:53, Yauhen Kharuzhy wrote:
> Hello,
>
> I try to understand btrfs logic in mounting of multi-device filesystem
> when device generations are different. All my questions are related to
> RAID5/6 for system, metadata, and data case.
>
> Kernel can mount FS with different device generations (if drive was
> physically removed before last unmount and returned back after, for
> example) now but scrub will report uncorrectable errors after this
> (but second run doesn't show any errors). Does any documentation about
> algorithm of multiple device handling in such case exist? Does the
> case with different device generations is allowed in general and what
> worst cases can be here?
In general, it isn't allowed, but we don't explicitly disallow it
either. The worst case here is that the devices both get written two
separately, and you end up with data not matching for correlated
generation ID's. The second scrub in this case shows no errors because
the first one corrects them (even though they are reported as
uncorrectable, which is a bug as far as I can tell), and from what I can
tell from reading the code, it does this by just picking the highest
generation ID and dropping the data from the lower generation.
>
> What should happen if device was removed and returned back after some
> time when filesystem is online? Should some kind of device
> reopening be possible or one possible way to guarantee FS consistensy
> is to mark such device as missing and to replace it?
In this case, the device being removed (or some component between the
device and the processor failing, or the device itself erroneously
reporting failure) will force the FS read-only. If the device reappears
while the FS is still online, it may just start working again (this is
_really_ rare, and requires that the device appear with the same device
node as it had previously, and this usually only happens when the device
disappears for only a very short period of time), or it may not work
until the FS gets remounted (this is usually the case), or the system
may crash (thankfully this almost never happens, and it's usually not
because of BTRFS when it does). Regardless of what happens, you may
still have to run a scrub to make sure everything is consistent.
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: good documentation on btrfs internals and on disk layout
2016-04-05 18:15 ` Austin S. Hemmelgarn
@ 2016-04-05 18:36 ` Yauhen Kharuzhy
2016-04-05 18:56 ` Austin S. Hemmelgarn
0 siblings, 1 reply; 9+ messages in thread
From: Yauhen Kharuzhy @ 2016-04-05 18:36 UTC (permalink / raw)
To: Austin S. Hemmelgarn; +Cc: linux-btrfs
2016-04-05 11:15 GMT-07:00 Austin S. Hemmelgarn <ahferroin7@gmail.com>:
> On 2016-04-05 13:53, Yauhen Kharuzhy wrote:
>>
>> Hello,
>>
>> I try to understand btrfs logic in mounting of multi-device filesystem
>> when device generations are different. All my questions are related to
>> RAID5/6 for system, metadata, and data case.
>>
>> Kernel can mount FS with different device generations (if drive was
>> physically removed before last unmount and returned back after, for
>> example) now but scrub will report uncorrectable errors after this
>> (but second run doesn't show any errors). Does any documentation about
>> algorithm of multiple device handling in such case exist? Does the
>> case with different device generations is allowed in general and what
>> worst cases can be here?
>
> In general, it isn't allowed, but we don't explicitly disallow it either.
> The worst case here is that the devices both get written two separately, and
> you end up with data not matching for correlated generation ID's. The
> second scrub in this case shows no errors because the first one corrects
> them (even though they are reported as uncorrectable, which is a bug as far
> as I can tell), and from what I can tell from reading the code, it does this
> by just picking the highest generation ID and dropping the data from the
> lower generation.
Hmm... Sounds reasonable but how to detect if filesystem should be
checked by scrub after mounting? There is one way as I understand — to
check kernel logs after mount for any btrfs errors and this is not a
good way for case of some kind of automatic management.
>> What should happen if device was removed and returned back after some
>> time when filesystem is online? Should some kind of device
>> reopening be possible or one possible way to guarantee FS consistensy
>> is to mark such device as missing and to replace it?
>
> In this case, the device being removed (or some component between the device
> and the processor failing, or the device itself erroneously reporting
> failure) will force the FS read-only. If the device reappears while the FS
> is still online, it may just start working again (this is _really_ rare, and
> requires that the device appear with the same device node as it had
> previously, and this usually only happens when the device disappears for
> only a very short period of time), or it may not work until the FS gets
> remounted (this is usually the case), or the system may crash (thankfully
> this almost never happens, and it's usually not because of BTRFS when it
> does). Regardless of what happens, you may still have to run a scrub to
> make sure everything is consistent.
So, one right way if we see device reconnected as new block device —
is to reject it and don't include it in device list again, am I right?
Existing code tries to 'reconnect' it with new device name but this
works completely wrong for mounted FS (because btrfs device is renamed
only, no real device reopening is performed) and I intend to propose
patch based on Anand's 'global spare' patch series to handle this
properly.
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: good documentation on btrfs internals and on disk layout
2016-04-05 18:36 ` Yauhen Kharuzhy
@ 2016-04-05 18:56 ` Austin S. Hemmelgarn
2016-04-05 19:26 ` Yauhen Kharuzhy
0 siblings, 1 reply; 9+ messages in thread
From: Austin S. Hemmelgarn @ 2016-04-05 18:56 UTC (permalink / raw)
To: Yauhen Kharuzhy; +Cc: linux-btrfs
On 2016-04-05 14:36, Yauhen Kharuzhy wrote:
> 2016-04-05 11:15 GMT-07:00 Austin S. Hemmelgarn <ahferroin7@gmail.com>:
>> On 2016-04-05 13:53, Yauhen Kharuzhy wrote:
>>>
>>> Hello,
>>>
>>> I try to understand btrfs logic in mounting of multi-device filesystem
>>> when device generations are different. All my questions are related to
>>> RAID5/6 for system, metadata, and data case.
>>>
>>> Kernel can mount FS with different device generations (if drive was
>>> physically removed before last unmount and returned back after, for
>>> example) now but scrub will report uncorrectable errors after this
>>> (but second run doesn't show any errors). Does any documentation about
>>> algorithm of multiple device handling in such case exist? Does the
>>> case with different device generations is allowed in general and what
>>> worst cases can be here?
>>
>> In general, it isn't allowed, but we don't explicitly disallow it either.
>> The worst case here is that the devices both get written two separately, and
>> you end up with data not matching for correlated generation ID's. The
>> second scrub in this case shows no errors because the first one corrects
>> them (even though they are reported as uncorrectable, which is a bug as far
>> as I can tell), and from what I can tell from reading the code, it does this
>> by just picking the highest generation ID and dropping the data from the
>> lower generation.
>
> Hmm... Sounds reasonable but how to detect if filesystem should be
> checked by scrub after mounting? There is one way as I understand — to
> check kernel logs after mount for any btrfs errors and this is not a
> good way for case of some kind of automatic management.
There really isn't any way that I know of. Personally, I just scrub all
my filesystems shortly after mount, but I also have pretty small
filesystems (the biggest are 64G) on relatively fast storage. In
theory, it might be possible to parse the filesystems before mounting to
check the device generation numbers, but that may be just as expensive
as just scrubbing the filesystem (and you really should be scrubbing
somewhat regularly anyway).
>
>>> What should happen if device was removed and returned back after some
>>> time when filesystem is online? Should some kind of device
>>> reopening be possible or one possible way to guarantee FS consistensy
>>> is to mark such device as missing and to replace it?
>>
>> In this case, the device being removed (or some component between the device
>> and the processor failing, or the device itself erroneously reporting
>> failure) will force the FS read-only. If the device reappears while the FS
>> is still online, it may just start working again (this is _really_ rare, and
>> requires that the device appear with the same device node as it had
>> previously, and this usually only happens when the device disappears for
>> only a very short period of time), or it may not work until the FS gets
>> remounted (this is usually the case), or the system may crash (thankfully
>> this almost never happens, and it's usually not because of BTRFS when it
>> does). Regardless of what happens, you may still have to run a scrub to
>> make sure everything is consistent.
>
> So, one right way if we see device reconnected as new block device —
> is to reject it and don't include it in device list again, am I right?
> Existing code tries to 'reconnect' it with new device name but this
> works completely wrong for mounted FS (because btrfs device is renamed
> only, no real device reopening is performed) and I intend to propose
> patch based on Anand's 'global spare' patch series to handle this
> properly.
In an ideal situation, you have nothing using the FS and can unmount,
run a device scan, and then remount. In most cases this won't work, and
being able to re-add the device via a hot-spare type setup (or even just
use device replace on it, which I've done before myself when dealing
with filesystems on USB devices, and it works well) would be useful.
Ideally, we should have the option to auto-detect such a situation and
handle it, but that _really_ needs to be optional (there are just too
many things that could go wrong).
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: good documentation on btrfs internals and on disk layout
2016-04-05 18:56 ` Austin S. Hemmelgarn
@ 2016-04-05 19:26 ` Yauhen Kharuzhy
0 siblings, 0 replies; 9+ messages in thread
From: Yauhen Kharuzhy @ 2016-04-05 19:26 UTC (permalink / raw)
To: Austin S. Hemmelgarn; +Cc: linux-btrfs
2016-04-05 11:56 GMT-07:00 Austin S. Hemmelgarn <ahferroin7@gmail.com>:
> On 2016-04-05 14:36, Yauhen Kharuzhy wrote:
>>
>> 2016-04-05 11:15 GMT-07:00 Austin S. Hemmelgarn <ahferroin7@gmail.com>:
>>>
>>> On 2016-04-05 13:53, Yauhen Kharuzhy wrote:
>>> In general, it isn't allowed, but we don't explicitly disallow it either.
>>> The worst case here is that the devices both get written two separately,
>>> and
>>> you end up with data not matching for correlated generation ID's. The
>>> second scrub in this case shows no errors because the first one corrects
>>> them (even though they are reported as uncorrectable, which is a bug as
>>> far
>>> as I can tell), and from what I can tell from reading the code, it does
>>> this
>>> by just picking the highest generation ID and dropping the data from the
>>> lower generation.
>>
>>
>> Hmm... Sounds reasonable but how to detect if filesystem should be
>> checked by scrub after mounting? There is one way as I understand — to
>> check kernel logs after mount for any btrfs errors and this is not a
>> good way for case of some kind of automatic management.
>
> There really isn't any way that I know of. Personally, I just scrub all my
> filesystems shortly after mount, but I also have pretty small filesystems
> (the biggest are 64G) on relatively fast storage. In theory, it might be
> possible to parse the filesystems before mounting to check the device
> generation numbers, but that may be just as expensive as just scrubbing the
> filesystem (and you really should be scrubbing somewhat regularly anyway).
Yes, size matters — we have 96TB massive and scrubbing, rebalancing,
replacing etc. can be expensive operations :)
In fact, I implemented some kind of filesystem status reporting in
kernel & btrfs progs already but still have no time to prepare this to
sending as patches for commenting. But it will be done soon, I hope.
^ permalink raw reply [flat|nested] 9+ messages in thread
end of thread, other threads:[~2016-04-05 19:27 UTC | newest]
Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2016-03-30 13:58 good documentation on btrfs internals and on disk layout sri
2016-03-30 17:28 ` Liu Bo
2016-03-30 18:11 ` Dave Stevens
2016-03-30 18:43 ` Hugo Mills
2016-04-05 17:53 ` Yauhen Kharuzhy
2016-04-05 18:15 ` Austin S. Hemmelgarn
2016-04-05 18:36 ` Yauhen Kharuzhy
2016-04-05 18:56 ` Austin S. Hemmelgarn
2016-04-05 19:26 ` Yauhen Kharuzhy
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).