good documentation on btrfs internals and on disk layout

linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* good documentation on btrfs internals and on disk layout
@ 2016-03-30 13:58 sri
  2016-03-30 17:28 ` Liu Bo
  2016-03-30 18:43 ` Hugo Mills
  0 siblings, 2 replies; 9+ messages in thread
From: sri @ 2016-03-30 13:58 UTC (permalink / raw)
  To: linux-btrfs

Hi,

I could find very limited documentation related to on disk layout of btrfs 
and how all trees are related to each other. Except wiki which has very 
specific top level details I couldn't able to find more details on btrfs.

FSs such as zfs, ext3/4 and XFS there are documents which explains ondisk 
layout of the file systems.

Could anybody please provide pointers for the same for better 
understanding of btrfs on disk layout and how each tree interacts provided 
multiple disks are configured for btrfs.

Thank you in advance

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: good documentation on btrfs internals and on disk layout
  2016-03-30 13:58 good documentation on btrfs internals and on disk layout sri
@ 2016-03-30 17:28 ` Liu Bo
  2016-03-30 18:11   ` Dave Stevens
  2016-03-30 18:43 ` Hugo Mills
  1 sibling, 1 reply; 9+ messages in thread
From: Liu Bo @ 2016-03-30 17:28 UTC (permalink / raw)
  To: sri; +Cc: linux-btrfs

On Wed, Mar 30, 2016 at 01:58:03PM +0000, sri wrote:
> Hi,
> 
> I could find very limited documentation related to on disk layout of btrfs 
> and how all trees are related to each other. Except wiki which has very 
> specific top level details I couldn't able to find more details on btrfs.
> 
> FSs such as zfs, ext3/4 and XFS there are documents which explains ondisk 
> layout of the file systems.
> 
> Could anybody please provide pointers for the same for better 
> understanding of btrfs on disk layout and how each tree interacts provided 
> multiple disks are configured for btrfs.

There is a paper[1] about btrfs filesystem which covers all the details.

[1]: BTRFS: The Linux B-Tree Filesystem

Thanks,

-liubo

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: good documentation on btrfs internals and on disk layout
  2016-03-30 17:28 ` Liu Bo
@ 2016-03-30 18:11   ` Dave Stevens
  0 siblings, 0 replies; 9+ messages in thread
From: Dave Stevens @ 2016-03-30 18:11 UTC (permalink / raw)
  To: bo.li.liu; +Cc: sri, linux-btrfs

Quoting Liu Bo <bo.li.liu@oracle.com>:

> On Wed, Mar 30, 2016 at 01:58:03PM +0000, sri wrote:
>> Hi,
>>
>> I could find very limited documentation related to on disk layout of btrfs
>> and how all trees are related to each other. Except wiki which has very
>> specific top level details I couldn't able to find more details on btrfs.
>>
>> FSs such as zfs, ext3/4 and XFS there are documents which explains ondisk
>> layout of the file systems.
>>
>> Could anybody please provide pointers for the same for better
>> understanding of btrfs on disk layout and how each tree interacts provided
>> multiple disks are configured for btrfs.
>
> There is a paper[1] about btrfs filesystem which covers all the details.
>
> [1]: BTRFS: The Linux B-Tree Filesystem

and this is where it is:

http://domino.watson.ibm.com/library/CyberDig.nsf/papers/6E1C5B6A1B6EDD9885257A38006B6130/$File/rj10501.pdf

D

>
> Thanks,
>
> -liubo
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>



-- 
"As long as politics is the shadow cast on society by big business,
the attenuation of the shadow will not change the substance."

-- John Dewey






^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: good documentation on btrfs internals and on disk layout
  2016-03-30 13:58 good documentation on btrfs internals and on disk layout sri
  2016-03-30 17:28 ` Liu Bo
@ 2016-03-30 18:43 ` Hugo Mills
  2016-04-05 17:53   ` Yauhen Kharuzhy
  1 sibling, 1 reply; 9+ messages in thread
From: Hugo Mills @ 2016-03-30 18:43 UTC (permalink / raw)
  To: sri; +Cc: linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 2297 bytes --]

On Wed, Mar 30, 2016 at 01:58:03PM +0000, sri wrote:
> I could find very limited documentation related to on disk layout of btrfs 
> and how all trees are related to each other. Except wiki which has very 
> specific top level details I couldn't able to find more details on btrfs.
> 
> FSs such as zfs, ext3/4 and XFS there are documents which explains ondisk 
> layout of the file systems.
> 
> Could anybody please provide pointers for the same for better 
> understanding of btrfs on disk layout and how each tree interacts provided 
> multiple disks are configured for btrfs.

   What are you intending to do? You'll need different things
depending on whether you are, for example, using the BTRFS_TREE_SEARCH
ioctl online to gather high-level information, or working your way
through the datapaths from the superblock right down to individual
bytes of a file for offline access.

   If you're using BTRFS_TREE_SEARCH, for example, you won't need to
know anything about the superblocks or the way that trees are
implemented. In fact, it's a good idea if you can avoid getting into
those details at all.

   The high-level view of how the data model fits together is at
[1]. Individual structures referenced in there are best examined in
ctree.h for the details, although there's a little more detailed
description at [2]. There's some documentation on the basic APIs used
for reading the btrees at [3]. If you really _have_ to access trees
yourself, the tree structure is at [4], but see my comment above about
that. The way that the FS-tree metadata is put together to make up
POSIX directory structures is at [5].

   After all that, you're down to looking at the data structures in
ctree.h, and grepping through the source code to see how they're used
(which is how [1] was written in the first place).

   Hugo.

[1] https://btrfs.wiki.kernel.org/index.php/Data_Structures
[2] https://btrfs.wiki.kernel.org/index.php/On-disk_Format
[3] https://btrfs.wiki.kernel.org/index.php/Code_documentation
[4] https://btrfs.wiki.kernel.org/index.php/Btrfs_design
[5] https://btrfs.wiki.kernel.org/index.php/Trees

-- 
Hugo Mills             | "There's more than one way to do it" is not a
hugo@... carfax.org.uk | commandment. It is a dire warning.
http://carfax.org.uk/  |
PGP: E2AB1DE4          |

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: good documentation on btrfs internals and on disk layout
  2016-03-30 18:43 ` Hugo Mills
@ 2016-04-05 17:53   ` Yauhen Kharuzhy
  2016-04-05 18:15     ` Austin S. Hemmelgarn
  0 siblings, 1 reply; 9+ messages in thread
From: Yauhen Kharuzhy @ 2016-04-05 17:53 UTC (permalink / raw)
  To: linux-btrfs

Hello,

I try to understand btrfs logic in mounting of multi-device filesystem
when device generations are different. All my questions are related to
RAID5/6 for system, metadata, and data case.

Kernel can mount FS with different device generations (if drive was
physically removed before last unmount and returned back after, for
example) now but scrub will report uncorrectable errors after this
(but second run doesn't show any errors). Does any documentation about
algorithm of multiple device handling in such case exist? Does the
case with different device generations is allowed in general and what
worst cases can be here?

What should happen if device was removed and returned back after some
time when filesystem is online? Should some kind of device
reopening be possible or one possible way to guarantee FS consistensy
is  to mark such device as missing and to replace it?

If any sources of such information exist (except btrfs code) please
point me to them.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: good documentation on btrfs internals and on disk layout
  2016-04-05 17:53   ` Yauhen Kharuzhy
@ 2016-04-05 18:15     ` Austin S. Hemmelgarn
  2016-04-05 18:36       ` Yauhen Kharuzhy
  0 siblings, 1 reply; 9+ messages in thread
From: Austin S. Hemmelgarn @ 2016-04-05 18:15 UTC (permalink / raw)
  To: Yauhen Kharuzhy, linux-btrfs

On 2016-04-05 13:53, Yauhen Kharuzhy wrote:
> Hello,
>
> I try to understand btrfs logic in mounting of multi-device filesystem
> when device generations are different. All my questions are related to
> RAID5/6 for system, metadata, and data case.
>
> Kernel can mount FS with different device generations (if drive was
> physically removed before last unmount and returned back after, for
> example) now but scrub will report uncorrectable errors after this
> (but second run doesn't show any errors). Does any documentation about
> algorithm of multiple device handling in such case exist? Does the
> case with different device generations is allowed in general and what
> worst cases can be here?
In general, it isn't allowed, but we don't explicitly disallow it 
either.  The worst case here is that the devices both get written two 
separately, and you end up with data not matching for correlated 
generation ID's.  The second scrub in this case shows no errors because 
the first one corrects them (even though they are reported as 
uncorrectable, which is a bug as far as I can tell), and from what I can 
tell from reading the code, it does this by just picking the highest 
generation ID and dropping the data from the lower generation.
>
> What should happen if device was removed and returned back after some
> time when filesystem is online? Should some kind of device
> reopening be possible or one possible way to guarantee FS consistensy
> is  to mark such device as missing and to replace it?
In this case, the device being removed (or some component between the 
device and the processor failing, or the device itself erroneously 
reporting failure) will force the FS read-only.  If the device reappears 
while the FS is still online, it may just start working again (this is 
_really_ rare, and requires that the device appear with the same device 
node as it had previously, and this usually only happens when the device 
disappears for only a very short period of time), or it may not work 
until the FS gets remounted (this is usually the case), or the system 
may crash (thankfully this almost never happens, and it's usually not 
because of BTRFS when it does).  Regardless of what happens, you may 
still have to run a scrub to make sure everything is consistent.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: good documentation on btrfs internals and on disk layout
  2016-04-05 18:15     ` Austin S. Hemmelgarn
@ 2016-04-05 18:36       ` Yauhen Kharuzhy
  2016-04-05 18:56         ` Austin S. Hemmelgarn
  0 siblings, 1 reply; 9+ messages in thread
From: Yauhen Kharuzhy @ 2016-04-05 18:36 UTC (permalink / raw)
  To: Austin S. Hemmelgarn; +Cc: linux-btrfs

2016-04-05 11:15 GMT-07:00 Austin S. Hemmelgarn <ahferroin7@gmail.com>:
> On 2016-04-05 13:53, Yauhen Kharuzhy wrote:
>>
>> Hello,
>>
>> I try to understand btrfs logic in mounting of multi-device filesystem
>> when device generations are different. All my questions are related to
>> RAID5/6 for system, metadata, and data case.
>>
>> Kernel can mount FS with different device generations (if drive was
>> physically removed before last unmount and returned back after, for
>> example) now but scrub will report uncorrectable errors after this
>> (but second run doesn't show any errors). Does any documentation about
>> algorithm of multiple device handling in such case exist? Does the
>> case with different device generations is allowed in general and what
>> worst cases can be here?
>
> In general, it isn't allowed, but we don't explicitly disallow it either.
> The worst case here is that the devices both get written two separately, and
> you end up with data not matching for correlated generation ID's.  The
> second scrub in this case shows no errors because the first one corrects
> them (even though they are reported as uncorrectable, which is a bug as far
> as I can tell), and from what I can tell from reading the code, it does this
> by just picking the highest generation ID and dropping the data from the
> lower generation.

Hmm... Sounds reasonable but how to detect if filesystem should be
checked by scrub after mounting? There is one way as I understand — to
check kernel logs after mount for any btrfs errors and this is not a
good way for case of some kind of automatic management.

>> What should happen if device was removed and returned back after some
>> time when filesystem is online? Should some kind of device
>> reopening be possible or one possible way to guarantee FS consistensy
>> is  to mark such device as missing and to replace it?
>
> In this case, the device being removed (or some component between the device
> and the processor failing, or the device itself erroneously reporting
> failure) will force the FS read-only.  If the device reappears while the FS
> is still online, it may just start working again (this is _really_ rare, and
> requires that the device appear with the same device node as it had
> previously, and this usually only happens when the device disappears for
> only a very short period of time), or it may not work until the FS gets
> remounted (this is usually the case), or the system may crash (thankfully
> this almost never happens, and it's usually not because of BTRFS when it
> does).  Regardless of what happens, you may still have to run a scrub to
> make sure everything is consistent.

So, one right way if we see device reconnected as new block device —
is to reject it and don't include it in device list again, am I right?
Existing code tries to 'reconnect' it with new device name but this
works completely wrong for mounted FS (because btrfs device is renamed
only, no real device reopening is performed) and I intend to propose
patch based on Anand's 'global spare' patch series to handle this
properly.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: good documentation on btrfs internals and on disk layout
  2016-04-05 18:36       ` Yauhen Kharuzhy
@ 2016-04-05 18:56         ` Austin S. Hemmelgarn
  2016-04-05 19:26           ` Yauhen Kharuzhy
  0 siblings, 1 reply; 9+ messages in thread
From: Austin S. Hemmelgarn @ 2016-04-05 18:56 UTC (permalink / raw)
  To: Yauhen Kharuzhy; +Cc: linux-btrfs

On 2016-04-05 14:36, Yauhen Kharuzhy wrote:
> 2016-04-05 11:15 GMT-07:00 Austin S. Hemmelgarn <ahferroin7@gmail.com>:
>> On 2016-04-05 13:53, Yauhen Kharuzhy wrote:
>>>
>>> Hello,
>>>
>>> I try to understand btrfs logic in mounting of multi-device filesystem
>>> when device generations are different. All my questions are related to
>>> RAID5/6 for system, metadata, and data case.
>>>
>>> Kernel can mount FS with different device generations (if drive was
>>> physically removed before last unmount and returned back after, for
>>> example) now but scrub will report uncorrectable errors after this
>>> (but second run doesn't show any errors). Does any documentation about
>>> algorithm of multiple device handling in such case exist? Does the
>>> case with different device generations is allowed in general and what
>>> worst cases can be here?
>>
>> In general, it isn't allowed, but we don't explicitly disallow it either.
>> The worst case here is that the devices both get written two separately, and
>> you end up with data not matching for correlated generation ID's.  The
>> second scrub in this case shows no errors because the first one corrects
>> them (even though they are reported as uncorrectable, which is a bug as far
>> as I can tell), and from what I can tell from reading the code, it does this
>> by just picking the highest generation ID and dropping the data from the
>> lower generation.
>
> Hmm... Sounds reasonable but how to detect if filesystem should be
> checked by scrub after mounting? There is one way as I understand — to
> check kernel logs after mount for any btrfs errors and this is not a
> good way for case of some kind of automatic management.
There really isn't any way that I know of.  Personally, I just scrub all 
my filesystems shortly after mount, but I also have pretty small 
filesystems (the biggest are 64G) on relatively fast storage.  In 
theory, it might be possible to parse the filesystems before mounting to 
check the device generation numbers, but that may be just as expensive 
as just scrubbing the filesystem (and you really should be scrubbing 
somewhat regularly anyway).
>
>>> What should happen if device was removed and returned back after some
>>> time when filesystem is online? Should some kind of device
>>> reopening be possible or one possible way to guarantee FS consistensy
>>> is  to mark such device as missing and to replace it?
>>
>> In this case, the device being removed (or some component between the device
>> and the processor failing, or the device itself erroneously reporting
>> failure) will force the FS read-only.  If the device reappears while the FS
>> is still online, it may just start working again (this is _really_ rare, and
>> requires that the device appear with the same device node as it had
>> previously, and this usually only happens when the device disappears for
>> only a very short period of time), or it may not work until the FS gets
>> remounted (this is usually the case), or the system may crash (thankfully
>> this almost never happens, and it's usually not because of BTRFS when it
>> does).  Regardless of what happens, you may still have to run a scrub to
>> make sure everything is consistent.
>
> So, one right way if we see device reconnected as new block device —
> is to reject it and don't include it in device list again, am I right?
> Existing code tries to 'reconnect' it with new device name but this
> works completely wrong for mounted FS (because btrfs device is renamed
> only, no real device reopening is performed) and I intend to propose
> patch based on Anand's 'global spare' patch series to handle this
> properly.
In an ideal situation, you have nothing using the FS and can unmount, 
run a device scan, and then remount.  In most cases this won't work, and 
being able to re-add the device via a hot-spare type setup (or even just 
use device replace on it, which I've done before myself when dealing 
with filesystems on USB devices, and it works well) would be useful. 
Ideally, we should have the option to auto-detect such a situation and 
handle it, but that _really_ needs to be optional (there are just too 
many things that could go wrong).


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: good documentation on btrfs internals and on disk layout
  2016-04-05 18:56         ` Austin S. Hemmelgarn
@ 2016-04-05 19:26           ` Yauhen Kharuzhy
  0 siblings, 0 replies; 9+ messages in thread
From: Yauhen Kharuzhy @ 2016-04-05 19:26 UTC (permalink / raw)
  To: Austin S. Hemmelgarn; +Cc: linux-btrfs

2016-04-05 11:56 GMT-07:00 Austin S. Hemmelgarn <ahferroin7@gmail.com>:
> On 2016-04-05 14:36, Yauhen Kharuzhy wrote:
>>
>> 2016-04-05 11:15 GMT-07:00 Austin S. Hemmelgarn <ahferroin7@gmail.com>:
>>>
>>> On 2016-04-05 13:53, Yauhen Kharuzhy wrote:

>>> In general, it isn't allowed, but we don't explicitly disallow it either.
>>> The worst case here is that the devices both get written two separately,
>>> and
>>> you end up with data not matching for correlated generation ID's.  The
>>> second scrub in this case shows no errors because the first one corrects
>>> them (even though they are reported as uncorrectable, which is a bug as
>>> far
>>> as I can tell), and from what I can tell from reading the code, it does
>>> this
>>> by just picking the highest generation ID and dropping the data from the
>>> lower generation.
>>
>>
>> Hmm... Sounds reasonable but how to detect if filesystem should be
>> checked by scrub after mounting? There is one way as I understand — to
>> check kernel logs after mount for any btrfs errors and this is not a
>> good way for case of some kind of automatic management.
>
> There really isn't any way that I know of.  Personally, I just scrub all my
> filesystems shortly after mount, but I also have pretty small filesystems
> (the biggest are 64G) on relatively fast storage.  In theory, it might be
> possible to parse the filesystems before mounting to check the device
> generation numbers, but that may be just as expensive as just scrubbing the
> filesystem (and you really should be scrubbing somewhat regularly anyway).

Yes, size matters — we have 96TB massive and scrubbing, rebalancing,
replacing etc. can be expensive operations :)
In fact, I implemented some kind of filesystem status reporting in
kernel & btrfs progs already but still have no time to prepare this to
sending as patches for commenting. But it will be done soon, I hope.

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2016-04-05 19:27 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2016-03-30 13:58 good documentation on btrfs internals and on disk layout sri
2016-03-30 17:28 ` Liu Bo
2016-03-30 18:11   ` Dave Stevens
2016-03-30 18:43 ` Hugo Mills
2016-04-05 17:53   ` Yauhen Kharuzhy
2016-04-05 18:15     ` Austin S. Hemmelgarn
2016-04-05 18:36       ` Yauhen Kharuzhy
2016-04-05 18:56         ` Austin S. Hemmelgarn
2016-04-05 19:26           ` Yauhen Kharuzhy

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).