BTRFS raid6 unmountable after a couple of days of usage.

linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* BTRFS raid6 unmountable after a couple of days of usage.
@ 2015-07-14 11:49 Austin S Hemmelgarn
  2015-07-14 13:25 ` Austin S Hemmelgarn
  2015-07-16 11:49 ` Austin S Hemmelgarn
  0 siblings, 2 replies; 12+ messages in thread
From: Austin S Hemmelgarn @ 2015-07-14 11:49 UTC (permalink / raw)
  To: linux-btrfs


[-- Attachment #1.1: Type: text/plain, Size: 1930 bytes --]

So, after experiencing this same issue multiple times (on almost a dozen different kernel versions since 4.0) and ruling out the possibility of it being caused by my hardware (or at least, the RAM, SATA controller and disk drives themselves), I've decided to report it here.

The general symptom is that raid6 profile filesystems that I have are working fine for multiple weeks, until I either reboot or otherwise try to remount them, at which point the system refuses to mount them.

I'm currently using btrfs-progs v4.1 with kernel 4.1.2, although I've been seeing this with versions of both since 4.0.

Output of 'btrfs fi show' for the most recent fs that I had this issue with:
        Label: 'altroot'  uuid: 86eef6b9-febe-4350-a316-4cb00c40bbc5
	Total devices 4 FS bytes used 9.70GiB
	devid    1 size 24.00GiB used 6.03GiB path /dev/mapper/vg-altroot.0
	devid    2 size 24.00GiB used 6.01GiB path /dev/mapper/vg-altroot.1
	devid    3 size 24.00GiB used 6.01GiB path /dev/mapper/vg-altroot.2
	devid    4 size 24.00GiB used 6.01GiB path /dev/mapper/vg-altroot.3

        btrfs-progs v4.1

Each of the individual LVS that are in the FS is just a flat chunk of space on a separate disk from the others.

The FS itself passes btrfs check just fine (no reported errors, exit value of 0), but the kernel refuses to mount it with the message 'open_ctree failed'.

I've run btrfs chunk recover and attached the output from that.

Here's a link to an image from 'btrfs image -c9 -w': https://www.dropbox.com/s/pl7gs305ej65u9q/altroot.btrfs.img?dl=0
(That link will expire in 30 days, let me know if you need access to it beyond that).

The filesystems in question all see relatively light but consistent usage as targets for receiving daily incremental snapshots for on-system backups (and because I know someone will mention it, yes, I do have other backups of the data, these are just my online backups).

[-- Attachment #1.2: chunk-recover-output.txt --]
[-- Type: text/plain, Size: 13773 bytes --]

All Devices:
        Device: id = 4, name = /dev/mapper/vg-altroot.3
        Device: id = 3, name = /dev/mapper/vg-altroot.2
        Device: id = 2, name = /dev/mapper/vg-altroot.1
        Device: id = 1, name = /dev/vg/altroot.0

DEVICE SCAN RESULT:
Filesystem Information:
        sectorsize: 4096
        leafsize: 16384
        tree root generation: 26
        chunk root generation: 11

All Devices:
        Device: id = 4, name = /dev/mapper/vg-altroot.3
        Device: id = 3, name = /dev/mapper/vg-altroot.2
        Device: id = 2, name = /dev/mapper/vg-altroot.1
        Device: id = 1, name = /dev/vg/altroot.0

All Block Groups:
        Block Group: start = 0, len = 4194304, flag = 2
        Block Group: start = 4194304, len = 8388608, flag = 4
        Block Group: start = 12582912, len = 8388608, flag = 1
        Block Group: start = 20971520, len = 16777216, flag = 102
        Block Group: start = 37748736, len = 2147483648, flag = 104
        Block Group: start = 2185232384, len = 2147483648, flag = 101
        Block Group: start = 4332716032, len = 2147483648, flag = 101
        Block Group: start = 6480199680, len = 2147483648, flag = 101
        Block Group: start = 8627683328, len = 2147483648, flag = 101
        Block Group: start = 10775166976, len = 2147483648, flag = 101

All Chunks:
        Chunk: start = 0, len = 4194304, type = 2, num_stripes = 1
            Stripes list:
            [ 0] Stripe: devid = 1, offset = 0
        Chunk: start = 4194304, len = 8388608, type = 4, num_stripes = 1
            Stripes list:
            [ 0] Stripe: devid = 1, offset = 4194304
        Chunk: start = 12582912, len = 8388608, type = 1, num_stripes = 1
            Stripes list:
            [ 0] Stripe: devid = 1, offset = 12582912
        Chunk: start = 20971520, len = 16777216, type = 102, num_stripes = 4
            Stripes list:
            [ 0] Stripe: devid = 4, offset = 1048576
            [ 1] Stripe: devid = 3, offset = 1048576
            [ 2] Stripe: devid = 2, offset = 1048576
            [ 3] Stripe: devid = 1, offset = 20971520
        Chunk: start = 37748736, len = 2147483648, type = 104, num_stripes = 4
            Stripes list:
            [ 0] Stripe: devid = 4, offset = 9437184
            [ 1] Stripe: devid = 3, offset = 9437184
            [ 2] Stripe: devid = 2, offset = 9437184
            [ 3] Stripe: devid = 1, offset = 29360128
        Chunk: start = 2185232384, len = 2147483648, type = 101, num_stripes =
4
            Stripes list:
            [ 0] Stripe: devid = 4, offset = 1083179008
            [ 1] Stripe: devid = 3, offset = 1083179008
            [ 2] Stripe: devid = 2, offset = 1083179008
            [ 3] Stripe: devid = 1, offset = 1103101952
        Chunk: start = 4332716032, len = 2147483648, type = 101, num_stripes =
4
            Stripes list:
            [ 0] Stripe: devid = 2, offset = 2156920832
            [ 1] Stripe: devid = 3, offset = 2156920832
            [ 2] Stripe: devid = 4, offset = 2156920832
            [ 3] Stripe: devid = 1, offset = 2176843776
        Chunk: start = 6480199680, len = 2147483648, type = 101, num_stripes =
4
            Stripes list:
            [ 0] Stripe: devid = 2, offset = 3230662656
            [ 1] Stripe: devid = 3, offset = 3230662656
            [ 2] Stripe: devid = 4, offset = 3230662656
            [ 3] Stripe: devid = 1, offset = 3250585600
        Chunk: start = 8627683328, len = 2147483648, type = 101, num_stripes =
4
            Stripes list:
            [ 0] Stripe: devid = 2, offset = 4304404480
            [ 1] Stripe: devid = 3, offset = 4304404480
            [ 2] Stripe: devid = 4, offset = 4304404480
            [ 3] Stripe: devid = 1, offset = 4324327424
        Chunk: start = 10775166976, len = 2147483648, type = 101, num_stripes
= 4
            Stripes list:
            [ 0] Stripe: devid = 2, offset = 5378146304
            [ 1] Stripe: devid = 3, offset = 5378146304
            [ 2] Stripe: devid = 4, offset = 5378146304
            [ 3] Stripe: devid = 1, offset = 5398069248

All Device Extents:
        Device extent: devid = 1, start = 0, len = 4194304, chunk offset = 0
        Device extent: devid = 1, start = 4194304, len = 8388608, chunk offset
= 4194304
        Device extent: devid = 1, start = 12582912, len = 8388608, chunk
offset = 12582912
        Device extent: devid = 1, start = 20971520, len = 8388608, chunk
offset = 20971520
        Device extent: devid = 1, start = 29360128, len = 1073741824, chunk
offset = 37748736
        Device extent: devid = 1, start = 1103101952, len = 1073741824, chunk
offset = 2185232384
        Device extent: devid = 1, start = 2176843776, len = 1073741824, chunk
offset = 4332716032
        Device extent: devid = 1, start = 3250585600, len = 1073741824, chunk
offset = 6480199680
        Device extent: devid = 1, start = 4324327424, len = 1073741824, chunk
offset = 8627683328
        Device extent: devid = 1, start = 5398069248, len = 1073741824, chunk
offset = 10775166976
        Device extent: devid = 2, start = 1048576, len = 8388608, chunk offset
= 20971520
        Device extent: devid = 2, start = 9437184, len = 1073741824, chunk
offset = 37748736
        Device extent: devid = 2, start = 1083179008, len = 1073741824, chunk
offset = 2185232384
        Device extent: devid = 2, start = 2156920832, len = 1073741824, chunk
offset = 4332716032
        Device extent: devid = 2, start = 3230662656, len = 1073741824, chunk
offset = 6480199680
        Device extent: devid = 2, start = 4304404480, len = 1073741824, chunk
offset = 8627683328
        Device extent: devid = 2, start = 5378146304, len = 1073741824, chunk
offset = 10775166976
        Device extent: devid = 3, start = 1048576, len = 8388608, chunk offset
= 20971520
        Device extent: devid = 3, start = 9437184, len = 1073741824, chunk
offset = 37748736
        Device extent: devid = 3, start = 1083179008, len = 1073741824, chunk
offset = 2185232384
        Device extent: devid = 3, start = 2156920832, len = 1073741824, chunk
offset = 4332716032
        Device extent: devid = 3, start = 3230662656, len = 1073741824, chunk
offset = 6480199680
        Device extent: devid = 3, start = 4304404480, len = 1073741824, chunk
offset = 8627683328
        Device extent: devid = 3, start = 5378146304, len = 1073741824, chunk
offset = 10775166976
        Device extent: devid = 4, start = 1048576, len = 8388608, chunk offset
= 20971520
        Device extent: devid = 4, start = 9437184, len = 1073741824, chunk
offset = 37748736
        Device extent: devid = 4, start = 1083179008, len = 1073741824, chunk
offset = 2185232384
        Device extent: devid = 4, start = 2156920832, len = 1073741824, chunk
offset = 4332716032
        Device extent: devid = 4, start = 3230662656, len = 1073741824, chunk
offset = 6480199680
        Device extent: devid = 4, start = 4304404480, len = 1073741824, chunk
offset = 8627683328
        Device extent: devid = 4, start = 5378146304, len = 1073741824, chunk
offset = 10775166976

CHECK RESULT:
Recoverable Chunks:
  Chunk: start = 0, len = 4194304, type = 2, num_stripes = 1
      Stripes list:
      [ 0] Stripe: devid = 1, offset = 0
      Block Group: start = 0, len = 4194304, flag = 2
      Device extent list:
          [ 0]Device extent: devid = 1, start = 0, len = 4194304, chunk offset
= 0
  Chunk: start = 4194304, len = 8388608, type = 4, num_stripes = 1
      Stripes list:
      [ 0] Stripe: devid = 1, offset = 4194304
      Block Group: start = 4194304, len = 8388608, flag = 4
      Device extent list:
          [ 0]Device extent: devid = 1, start = 4194304, len = 8388608, chunk
offset = 4194304
  Chunk: start = 12582912, len = 8388608, type = 1, num_stripes = 1
      Stripes list:
      [ 0] Stripe: devid = 1, offset = 12582912
      Block Group: start = 12582912, len = 8388608, flag = 1
      Device extent list:
          [ 0]Device extent: devid = 1, start = 12582912, len = 8388608, chunk
offset = 12582912
  Chunk: start = 20971520, len = 16777216, type = 102, num_stripes = 4
      Stripes list:
      [ 0] Stripe: devid = 4, offset = 1048576
      [ 1] Stripe: devid = 3, offset = 1048576
      [ 2] Stripe: devid = 2, offset = 1048576
      [ 3] Stripe: devid = 1, offset = 20971520
      Block Group: start = 20971520, len = 16777216, flag = 102
      Device extent list:
          [ 0]Device extent: devid = 1, start = 20971520, len = 8388608, chunk
offset = 20971520
          [ 1]Device extent: devid = 2, start = 1048576, len = 8388608, chunk
offset = 20971520
          [ 2]Device extent: devid = 3, start = 1048576, len = 8388608, chunk
offset = 20971520
          [ 3]Device extent: devid = 4, start = 1048576, len = 8388608, chunk
offset = 20971520
  Chunk: start = 37748736, len = 2147483648, type = 104, num_stripes = 4
      Stripes list:
      [ 0] Stripe: devid = 4, offset = 9437184
      [ 1] Stripe: devid = 3, offset = 9437184
      [ 2] Stripe: devid = 2, offset = 9437184
      [ 3] Stripe: devid = 1, offset = 29360128
      Block Group: start = 37748736, len = 2147483648, flag = 104
      Device extent list:
          [ 0]Device extent: devid = 1, start = 29360128, len = 1073741824,
chunk offset = 37748736
          [ 1]Device extent: devid = 2, start = 9437184, len = 1073741824,
chunk offset = 37748736
          [ 2]Device extent: devid = 3, start = 9437184, len = 1073741824,
chunk offset = 37748736
          [ 3]Device extent: devid = 4, start = 9437184, len = 1073741824,
chunk offset = 37748736
  Chunk: start = 2185232384, len = 2147483648, type = 101, num_stripes = 4
      Stripes list:
      [ 0] Stripe: devid = 4, offset = 1083179008
      [ 1] Stripe: devid = 3, offset = 1083179008
      [ 2] Stripe: devid = 2, offset = 1083179008
      [ 3] Stripe: devid = 1, offset = 1103101952
      Block Group: start = 2185232384, len = 2147483648, flag = 101
      Device extent list:
          [ 0]Device extent: devid = 1, start = 1103101952, len = 1073741824,
chunk offset = 2185232384
          [ 1]Device extent: devid = 2, start = 1083179008, len = 1073741824,
chunk offset = 2185232384
          [ 2]Device extent: devid = 3, start = 1083179008, len = 1073741824,
chunk offset = 2185232384
          [ 3]Device extent: devid = 4, start = 1083179008, len = 1073741824,
chunk offset = 2185232384
  Chunk: start = 4332716032, len = 2147483648, type = 101, num_stripes = 4
      Stripes list:
      [ 0] Stripe: devid = 2, offset = 2156920832
      [ 1] Stripe: devid = 3, offset = 2156920832
      [ 2] Stripe: devid = 4, offset = 2156920832
      [ 3] Stripe: devid = 1, offset = 2176843776
      Block Group: start = 4332716032, len = 2147483648, flag = 101
      Device extent list:
          [ 0]Device extent: devid = 1, start = 2176843776, len = 1073741824,
chunk offset = 4332716032
          [ 1]Device extent: devid = 4, start = 2156920832, len = 1073741824,
chunk offset = 4332716032
          [ 2]Device extent: devid = 3, start = 2156920832, len = 1073741824,
chunk offset = 4332716032
          [ 3]Device extent: devid = 2, start = 2156920832, len = 1073741824,
chunk offset = 4332716032
  Chunk: start = 6480199680, len = 2147483648, type = 101, num_stripes = 4
      Stripes list:
      [ 0] Stripe: devid = 2, offset = 3230662656
      [ 1] Stripe: devid = 3, offset = 3230662656
      [ 2] Stripe: devid = 4, offset = 3230662656
      [ 3] Stripe: devid = 1, offset = 3250585600
      Block Group: start = 6480199680, len = 2147483648, flag = 101
      Device extent list:
          [ 0]Device extent: devid = 1, start = 3250585600, len = 1073741824,
chunk offset = 6480199680
          [ 1]Device extent: devid = 4, start = 3230662656, len = 1073741824,
chunk offset = 6480199680
          [ 2]Device extent: devid = 3, start = 3230662656, len = 1073741824,
chunk offset = 6480199680
          [ 3]Device extent: devid = 2, start = 3230662656, len = 1073741824,
chunk offset = 6480199680
  Chunk: start = 8627683328, len = 2147483648, type = 101, num_stripes = 4
      Stripes list:
      [ 0] Stripe: devid = 2, offset = 4304404480
      [ 1] Stripe: devid = 3, offset = 4304404480
      [ 2] Stripe: devid = 4, offset = 4304404480
      [ 3] Stripe: devid = 1, offset = 4324327424
      Block Group: start = 8627683328, len = 2147483648, flag = 101
      Device extent list:
          [ 0]Device extent: devid = 1, start = 4324327424, len = 1073741824,
chunk offset = 8627683328
          [ 1]Device extent: devid = 4, start = 4304404480, len = 1073741824,
chunk offset = 8627683328
          [ 2]Device extent: devid = 3, start = 4304404480, len = 1073741824,
chunk offset = 8627683328
          [ 3]Device extent: devid = 2, start = 4304404480, len = 1073741824,
chunk offset = 8627683328
  Chunk: start = 10775166976, len = 2147483648, type = 101, num_stripes = 4
      Stripes list:
      [ 0] Stripe: devid = 2, offset = 5378146304
      [ 1] Stripe: devid = 3, offset = 5378146304
      [ 2] Stripe: devid = 4, offset = 5378146304
      [ 3] Stripe: devid = 1, offset = 5398069248
      Block Group: start = 10775166976, len = 2147483648, flag = 101
      Device extent list:
          [ 0]Device extent: devid = 1, start = 5398069248, len = 1073741824,
chunk offset = 10775166976
          [ 1]Device extent: devid = 4, start = 5378146304, len = 1073741824,
chunk offset = 10775166976
          [ 2]Device extent: devid = 3, start = 5378146304, len = 1073741824,
chunk offset = 10775166976
          [ 3]Device extent: devid = 2, start = 5378146304, len = 1073741824,
chunk offset = 10775166976
Unrecoverable Chunks:

Total Chunks:           10
  Recoverable:          10
  Unrecoverable:        0

Orphan Block Groups:

Orphan Device Extents:

Check chunks successfully with no orphans
Recover the chunk tree successfully.

[-- Attachment #2: S/MIME Cryptographic Signature --]
[-- Type: application/pkcs7-signature, Size: 2967 bytes --]

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: BTRFS raid6 unmountable after a couple of days of usage.
  2015-07-14 11:49 BTRFS raid6 unmountable after a couple of days of usage Austin S Hemmelgarn
@ 2015-07-14 13:25 ` Austin S Hemmelgarn
  2015-07-14 23:20   ` Chris Murphy
  2015-07-16 11:49 ` Austin S Hemmelgarn
  1 sibling, 1 reply; 12+ messages in thread
From: Austin S Hemmelgarn @ 2015-07-14 13:25 UTC (permalink / raw)
  To: linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 2297 bytes --]

On 2015-07-14 07:49, Austin S Hemmelgarn wrote:
> So, after experiencing this same issue multiple times (on almost a dozen different kernel versions since 4.0) and ruling out the possibility of it being caused by my hardware (or at least, the RAM, SATA controller and disk drives themselves), I've decided to report it here.
>
> The general symptom is that raid6 profile filesystems that I have are working fine for multiple weeks, until I either reboot or otherwise try to remount them, at which point the system refuses to mount them.
>
> I'm currently using btrfs-progs v4.1 with kernel 4.1.2, although I've been seeing this with versions of both since 4.0.
>
> Output of 'btrfs fi show' for the most recent fs that I had this issue with:
>          Label: 'altroot'  uuid: 86eef6b9-febe-4350-a316-4cb00c40bbc5
> 	Total devices 4 FS bytes used 9.70GiB
> 	devid    1 size 24.00GiB used 6.03GiB path /dev/mapper/vg-altroot.0
> 	devid    2 size 24.00GiB used 6.01GiB path /dev/mapper/vg-altroot.1
> 	devid    3 size 24.00GiB used 6.01GiB path /dev/mapper/vg-altroot.2
> 	devid    4 size 24.00GiB used 6.01GiB path /dev/mapper/vg-altroot.3
>
>          btrfs-progs v4.1
>
> Each of the individual LVS that are in the FS is just a flat chunk of space on a separate disk from the others.
>
> The FS itself passes btrfs check just fine (no reported errors, exit value of 0), but the kernel refuses to mount it with the message 'open_ctree failed'.
>
> I've run btrfs chunk recover and attached the output from that.
>
> Here's a link to an image from 'btrfs image -c9 -w': https://www.dropbox.com/s/pl7gs305ej65u9q/altroot.btrfs.img?dl=0
> (That link will expire in 30 days, let me know if you need access to it beyond that).
>
> The filesystems in question all see relatively light but consistent usage as targets for receiving daily incremental snapshots for on-system backups (and because I know someone will mention it, yes, I do have other backups of the data, these are just my online backups).
>
Further updates, I just tried mounting the filesystem from the image 
above again, this time passing device= options for each device in the 
FS, and it seems to be working fine now.  I've tried this with the other 
filesystems however, and they still won't mount.


[-- Attachment #2: S/MIME Cryptographic Signature --]
[-- Type: application/pkcs7-signature, Size: 2967 bytes --]

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: BTRFS raid6 unmountable after a couple of days of usage.
  2015-07-14 13:25 ` Austin S Hemmelgarn
@ 2015-07-14 23:20   ` Chris Murphy
  2015-07-15 11:07     ` Austin S Hemmelgarn
  0 siblings, 1 reply; 12+ messages in thread
From: Chris Murphy @ 2015-07-14 23:20 UTC (permalink / raw)
  To: Btrfs BTRFS

On Tue, Jul 14, 2015 at 7:25 AM, Austin S Hemmelgarn
<ahferroin7@gmail.com> wrote:
> On 2015-07-14 07:49, Austin S Hemmelgarn wrote:
>>
>> So, after experiencing this same issue multiple times (on almost a dozen
>> different kernel versions since 4.0) and ruling out the possibility of it
>> being caused by my hardware (or at least, the RAM, SATA controller and disk
>> drives themselves), I've decided to report it here.
>>
>> The general symptom is that raid6 profile filesystems that I have are
>> working fine for multiple weeks, until I either reboot or otherwise try to
>> remount them, at which point the system refuses to mount them.
>>
>> I'm currently using btrfs-progs v4.1 with kernel 4.1.2, although I've been
>> seeing this with versions of both since 4.0.
>>
>> Output of 'btrfs fi show' for the most recent fs that I had this issue
>> with:
>>          Label: 'altroot'  uuid: 86eef6b9-febe-4350-a316-4cb00c40bbc5
>>         Total devices 4 FS bytes used 9.70GiB
>>         devid    1 size 24.00GiB used 6.03GiB path
>> /dev/mapper/vg-altroot.0
>>         devid    2 size 24.00GiB used 6.01GiB path
>> /dev/mapper/vg-altroot.1
>>         devid    3 size 24.00GiB used 6.01GiB path
>> /dev/mapper/vg-altroot.2
>>         devid    4 size 24.00GiB used 6.01GiB path
>> /dev/mapper/vg-altroot.3
>>
>>          btrfs-progs v4.1
>>
>> Each of the individual LVS that are in the FS is just a flat chunk of
>> space on a separate disk from the others.
>>
>> The FS itself passes btrfs check just fine (no reported errors, exit value
>> of 0), but the kernel refuses to mount it with the message 'open_ctree
>> failed'.
>>
>> I've run btrfs chunk recover and attached the output from that.
>>
>> Here's a link to an image from 'btrfs image -c9 -w':
>> https://www.dropbox.com/s/pl7gs305ej65u9q/altroot.btrfs.img?dl=0
>> (That link will expire in 30 days, let me know if you need access to it
>> beyond that).
>>
>> The filesystems in question all see relatively light but consistent usage
>> as targets for receiving daily incremental snapshots for on-system backups
>> (and because I know someone will mention it, yes, I do have other backups of
>> the data, these are just my online backups).
>>
> Further updates, I just tried mounting the filesystem from the image above
> again, this time passing device= options for each device in the FS, and it
> seems to be working fine now.  I've tried this with the other filesystems
> however, and they still won't mount.
>

And it's the same message with the usual suspects: recovery,
ro,recovery ? How about degraded even though it's not degraded? And
what about 'btrfs rescue zero-log' ?

Of course it's weird that btrfs check doesn't complain, but mount
does. I don't understand that, so it's good you've got an image. If
either recovery or zero-log fix the problem, my understanding is this
suggests hardware did something Btrfs didn't expect.

What about 'btrfs check --check-data-csum" which should act similar to
a read-only scrub (different output though)? Hmm, nah. The thing is
the failure to mount is failing on some aspect of metadata, not data.
So the fact that check (on metadata) passes but mount fails is a bug
somewhere...

-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: BTRFS raid6 unmountable after a couple of days of usage.
       [not found] <55A5B13C.6060009@spotprint.com.au>
@ 2015-07-15  6:53 ` Ryan Bourne
  0 siblings, 0 replies; 12+ messages in thread
From: Ryan Bourne @ 2015-07-15  6:53 UTC (permalink / raw)
  To: linux-btrfs


On 14/07/15 11:25 PM, Austin S Hemmelgarn wrote:
> On 2015-07-14 07:49, Austin S Hemmelgarn wrote:
>> So, after experiencing this same issue multiple times (on almost a
>> dozen different kernel versions since 4.0) and ruling out the
>> possibility of it being caused by my hardware (or at least, the RAM,
>> SATA controller and disk drives themselves), I've decided to report it
>> here.
>>
>> The general symptom is that raid6 profile filesystems that I have are
>> working fine for multiple weeks, until I either reboot or otherwise
>> try to remount them, at which point the system refuses to mount them.

>>
> Further updates, I just tried mounting the filesystem from the image
> above again, this time passing device= options for each device in the
> FS, and it seems to be working fine now.  I've tried this with the other
> filesystems however, and they still won't mount.
>

I have experienced a similar problem on a raid1 with kernels from 3.17
onward following a kernel panic.

I have found that passing the other device as the main device to mount
will often work.
E.g.

# mount -o device=/dev/sdb,device=/dev/sdc /dev/sdb /mountpoint
open_ctree failed

# mount -o device=/dev/sdb,device=/dev/sdc /dev/sdc /mountpoint
mounts correctly.

If I then do an immediate umount and try again I get the same thing, but
after some time using the filesystem, I can umount and either device is
working for the mount again.





^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: BTRFS raid6 unmountable after a couple of days of usage.
  2015-07-14 23:20   ` Chris Murphy
@ 2015-07-15 11:07     ` Austin S Hemmelgarn
  2015-07-15 15:45       ` Chris Murphy
  0 siblings, 1 reply; 12+ messages in thread
From: Austin S Hemmelgarn @ 2015-07-15 11:07 UTC (permalink / raw)
  To: Chris Murphy, Btrfs BTRFS

[-- Attachment #1: Type: text/plain, Size: 3994 bytes --]

On 2015-07-14 19:20, Chris Murphy wrote:
> On Tue, Jul 14, 2015 at 7:25 AM, Austin S Hemmelgarn
> <ahferroin7@gmail.com> wrote:
>> On 2015-07-14 07:49, Austin S Hemmelgarn wrote:
>>>
>>> So, after experiencing this same issue multiple times (on almost a dozen
>>> different kernel versions since 4.0) and ruling out the possibility of it
>>> being caused by my hardware (or at least, the RAM, SATA controller and disk
>>> drives themselves), I've decided to report it here.
>>>
>>> The general symptom is that raid6 profile filesystems that I have are
>>> working fine for multiple weeks, until I either reboot or otherwise try to
>>> remount them, at which point the system refuses to mount them.
>>>
>>> I'm currently using btrfs-progs v4.1 with kernel 4.1.2, although I've been
>>> seeing this with versions of both since 4.0.
>>>
>>> Output of 'btrfs fi show' for the most recent fs that I had this issue
>>> with:
>>>           Label: 'altroot'  uuid: 86eef6b9-febe-4350-a316-4cb00c40bbc5
>>>          Total devices 4 FS bytes used 9.70GiB
>>>          devid    1 size 24.00GiB used 6.03GiB path
>>> /dev/mapper/vg-altroot.0
>>>          devid    2 size 24.00GiB used 6.01GiB path
>>> /dev/mapper/vg-altroot.1
>>>          devid    3 size 24.00GiB used 6.01GiB path
>>> /dev/mapper/vg-altroot.2
>>>          devid    4 size 24.00GiB used 6.01GiB path
>>> /dev/mapper/vg-altroot.3
>>>
>>>           btrfs-progs v4.1
>>>
>>> Each of the individual LVS that are in the FS is just a flat chunk of
>>> space on a separate disk from the others.
>>>
>>> The FS itself passes btrfs check just fine (no reported errors, exit value
>>> of 0), but the kernel refuses to mount it with the message 'open_ctree
>>> failed'.
>>>
>>> I've run btrfs chunk recover and attached the output from that.
>>>
>>> Here's a link to an image from 'btrfs image -c9 -w':
>>> https://www.dropbox.com/s/pl7gs305ej65u9q/altroot.btrfs.img?dl=0
>>> (That link will expire in 30 days, let me know if you need access to it
>>> beyond that).
>>>
>>> The filesystems in question all see relatively light but consistent usage
>>> as targets for receiving daily incremental snapshots for on-system backups
>>> (and because I know someone will mention it, yes, I do have other backups of
>>> the data, these are just my online backups).
>>>
>> Further updates, I just tried mounting the filesystem from the image above
>> again, this time passing device= options for each device in the FS, and it
>> seems to be working fine now.  I've tried this with the other filesystems
>> however, and they still won't mount.
>>
>
> And it's the same message with the usual suspects: recovery,
> ro,recovery ? How about degraded even though it's not degraded? And
> what about 'btrfs rescue zero-log' ?
Yeah, same result for both, and zero-log didn't help (although that kind 
of doesn't surprise me, as it was cleanly unmounted).
>
> Of course it's weird that btrfs check doesn't complain, but mount
> does. I don't understand that, so it's good you've got an image. If
> either recovery or zero-log fix the problem, my understanding is this
> suggests hardware did something Btrfs didn't expect.
I've run into cases in the past where this happens, although not 
recently (last time I remember it happening was back around 3.14 I 
think); and, interestingly, running check --repair in those cases did 
fix things, although that didn't complain about any issues either.

I've managed to get the other filesystems I was having issues with 
mounted again with the device= options and clear_cache after running 
btrfs dev scan a couple of times.  It seems to me (at least from what 
I'm seeing) that there is some metadata that isn't synchronized properly 
between the disks.  I've heard mention from multiple sources of similar 
issues happening occasionally with raid1 back around kernel 3.16-3.17, 
and passing a different device to mount helping with that.



[-- Attachment #2: S/MIME Cryptographic Signature --]
[-- Type: application/pkcs7-signature, Size: 2967 bytes --]

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: BTRFS raid6 unmountable after a couple of days of usage.
  2015-07-15 11:07     ` Austin S Hemmelgarn
@ 2015-07-15 15:45       ` Chris Murphy
  2015-07-15 16:15         ` Hugo Mills
  0 siblings, 1 reply; 12+ messages in thread
From: Chris Murphy @ 2015-07-15 15:45 UTC (permalink / raw)
  To: Austin S Hemmelgarn; +Cc: Chris Murphy, Btrfs BTRFS

On Wed, Jul 15, 2015 at 5:07 AM, Austin S Hemmelgarn
<ahferroin7@gmail.com> wrote:
> I've managed to get the other filesystems I was having issues with mounted
> again with the device= options and clear_cache after running btrfs dev scan
> a couple of times.  It seems to me (at least from what I'm seeing) that
> there is some metadata that isn't synchronized properly between the disks.

OK see if this logic follows without mistakes:

The fs metadata is raid6, and therefore is broken up across all
drives. Since you successfully captured an image of the file system
with btrfs-image, clearly user space tool is finding a minimum of n-2
drives. If it didn't complain of missing drives, it found n drives.

And yet the kernel is not finding n drives. And even with degraded it
still won't mount, therefore it's not finding n-2 drives.

By "drives" I mean either the physical device, or more likely whatever
minimal metadata is necessary for "assembling" all devices into a
volume. I don't know what that nugget of information is that's on each
physical device, separate from the superblocks (which I think is
distributed at logical addresses and therefore not on every physical
drive), and if we have any tools to extract just that and debug it.

-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: BTRFS raid6 unmountable after a couple of days of usage.
  2015-07-15 15:45       ` Chris Murphy
@ 2015-07-15 16:15         ` Hugo Mills
  2015-07-15 21:29           ` Chris Murphy
  0 siblings, 1 reply; 12+ messages in thread
From: Hugo Mills @ 2015-07-15 16:15 UTC (permalink / raw)
  To: Chris Murphy; +Cc: Austin S Hemmelgarn, Btrfs BTRFS

[-- Attachment #1: Type: text/plain, Size: 2767 bytes --]

On Wed, Jul 15, 2015 at 09:45:17AM -0600, Chris Murphy wrote:
> On Wed, Jul 15, 2015 at 5:07 AM, Austin S Hemmelgarn
> <ahferroin7@gmail.com> wrote:
> > I've managed to get the other filesystems I was having issues with mounted
> > again with the device= options and clear_cache after running btrfs dev scan
> > a couple of times.  It seems to me (at least from what I'm seeing) that
> > there is some metadata that isn't synchronized properly between the disks.
> 
> OK see if this logic follows without mistakes:
> 
> The fs metadata is raid6, and therefore is broken up across all
> drives. Since you successfully captured an image of the file system
> with btrfs-image, clearly user space tool is finding a minimum of n-2
> drives. If it didn't complain of missing drives, it found n drives.
> 
> And yet the kernel is not finding n drives. And even with degraded it
> still won't mount, therefore it's not finding n-2 drives.
> 
> By "drives" I mean either the physical device, or more likely whatever
> minimal metadata is necessary for "assembling" all devices into a
> volume. I don't know what that nugget of information is that's on each
> physical device, separate from the superblocks (which I think is
> distributed at logical addresses and therefore not on every physical
> drive), and if we have any tools to extract just that and debug it.

   There is at least one superblock on every device, usually two, and
often three. Each superblock contains the virtual address of the roots
of the root tree, the chunk tree and the log tree. Those are useless
without having the chunk tree, so there's also some information about
the chunk tree appended to the end of each superblock to bootstrap the
virtual address space lookup.

   The information at the end of the superblock seems to be a list of
packed (key, struct btrfs_chunk) pairs for the System chunks. The
struct btrfs_chunk contains info about the chunk as a whole, and each
stripe making it up. The stripe information is a devid, an offset
(presumably in physical address on the device), and a UUID.

   So, from btrfs dev scan the kernel has all the devid to (major,
minor) mappings for devices. From one device, it reads a superblock,
gets the list of (devid, offset) for the System chunks at the end of
that superblock, and can then identify the location of the System
chunks to read the full chunk tree. Once it's got the chunk tree, it
can do virtual->physical lookups, and the root tree and log tree
locations make sense.

   I don't know whether btrfs-image works any differently from that,
or if so, how it differs.

   Hugo.

-- 
Hugo Mills             | Radio is superior to television: the pictures are
hugo@... carfax.org.uk | better
http://carfax.org.uk/  |
PGP: E2AB1DE4          |

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: BTRFS raid6 unmountable after a couple of days of usage.
  2015-07-15 16:15         ` Hugo Mills
@ 2015-07-15 21:29           ` Chris Murphy
  2015-07-16 11:41             ` Austin S Hemmelgarn
  0 siblings, 1 reply; 12+ messages in thread
From: Chris Murphy @ 2015-07-15 21:29 UTC (permalink / raw)
  To: Btrfs BTRFS

On Wed, Jul 15, 2015 at 10:15 AM, Hugo Mills <hugo@carfax.org.uk> wrote:

>    There is at least one superblock on every device, usually two, and
> often three. Each superblock contains the virtual address of the roots
> of the root tree, the chunk tree and the log tree. Those are useless
> without having the chunk tree, so there's also some information about
> the chunk tree appended to the end of each superblock to bootstrap the
> virtual address space lookup.

So maybe Austin can use btrfs-show-super -a on every device and see if
there's anything different on some of the devices, that shouldn't be
different? There must be something the kernel is tripping over that
the use space tools aren't for some reason.




-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: BTRFS raid6 unmountable after a couple of days of usage.
  2015-07-15 21:29           ` Chris Murphy
@ 2015-07-16 11:41             ` Austin S Hemmelgarn
  2015-08-25 18:12               ` Austin S Hemmelgarn
  0 siblings, 1 reply; 12+ messages in thread
From: Austin S Hemmelgarn @ 2015-07-16 11:41 UTC (permalink / raw)
  To: Chris Murphy, Btrfs BTRFS

[-- Attachment #1: Type: text/plain, Size: 1318 bytes --]

On 2015-07-15 17:29, Chris Murphy wrote:
> On Wed, Jul 15, 2015 at 10:15 AM, Hugo Mills <hugo@carfax.org.uk> wrote:
>
>>     There is at least one superblock on every device, usually two, and
>> often three. Each superblock contains the virtual address of the roots
>> of the root tree, the chunk tree and the log tree. Those are useless
>> without having the chunk tree, so there's also some information about
>> the chunk tree appended to the end of each superblock to bootstrap the
>> virtual address space lookup.
>
> So maybe Austin can use btrfs-show-super -a on every device and see if
> there's anything different on some of the devices, that shouldn't be
> different? There must be something the kernel is tripping over that
> the use space tools aren't for some reason.
>
>
>
>
I actually did do so when this happened most recently (I just didn't 
think to mention it in the most recent e-mail), and nothing appeared to 
be different either between devices or within a given device (IIRC, 
there's 2 sb per device in each of the filesystems in question).

I'm going to try and reproduce this in a VM for inspection as all the 
filesystems I had this issue with now seem to be working fine (with the 
exception of some errors in the data blocks of one that got caught by 
scrub).


[-- Attachment #2: S/MIME Cryptographic Signature --]
[-- Type: application/pkcs7-signature, Size: 2967 bytes --]

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: BTRFS raid6 unmountable after a couple of days of usage.
  2015-07-14 11:49 BTRFS raid6 unmountable after a couple of days of usage Austin S Hemmelgarn
  2015-07-14 13:25 ` Austin S Hemmelgarn
@ 2015-07-16 11:49 ` Austin S Hemmelgarn
  2015-08-25 18:09   ` Austin S Hemmelgarn
  1 sibling, 1 reply; 12+ messages in thread
From: Austin S Hemmelgarn @ 2015-07-16 11:49 UTC (permalink / raw)
  To: linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 2280 bytes --]

On 2015-07-14 07:49, Austin S Hemmelgarn wrote:
> So, after experiencing this same issue multiple times (on almost a dozen different kernel versions since 4.0) and ruling out the possibility of it being caused by my hardware (or at least, the RAM, SATA controller and disk drives themselves), I've decided to report it here.
>
> The general symptom is that raid6 profile filesystems that I have are working fine for multiple weeks, until I either reboot or otherwise try to remount them, at which point the system refuses to mount them.
>
> I'm currently using btrfs-progs v4.1 with kernel 4.1.2, although I've been seeing this with versions of both since 4.0.
>
> Output of 'btrfs fi show' for the most recent fs that I had this issue with:
>          Label: 'altroot'  uuid: 86eef6b9-febe-4350-a316-4cb00c40bbc5
> 	Total devices 4 FS bytes used 9.70GiB
> 	devid    1 size 24.00GiB used 6.03GiB path /dev/mapper/vg-altroot.0
> 	devid    2 size 24.00GiB used 6.01GiB path /dev/mapper/vg-altroot.1
> 	devid    3 size 24.00GiB used 6.01GiB path /dev/mapper/vg-altroot.2
> 	devid    4 size 24.00GiB used 6.01GiB path /dev/mapper/vg-altroot.3
>
>          btrfs-progs v4.1
>
> Each of the individual LVS that are in the FS is just a flat chunk of space on a separate disk from the others.
>
> The FS itself passes btrfs check just fine (no reported errors, exit value of 0), but the kernel refuses to mount it with the message 'open_ctree failed'.
>
> I've run btrfs chunk recover and attached the output from that.
>
> Here's a link to an image from 'btrfs image -c9 -w': https://www.dropbox.com/s/pl7gs305ej65u9q/altroot.btrfs.img?dl=0
> (That link will expire in 30 days, let me know if you need access to it beyond that).
>
> The filesystems in question all see relatively light but consistent usage as targets for receiving daily incremental snapshots for on-system backups (and because I know someone will mention it, yes, I do have other backups of the data, these are just my online backups).
>
Secondary but possibly related issue, I'm seeing similar issues with all 
data/metadata profiles when using BTRFS on top of a dm-thinp volume with 
zeroing-mode turned off (that is, discard doesn't clear data from the 
areas that were discarded).


[-- Attachment #2: S/MIME Cryptographic Signature --]
[-- Type: application/pkcs7-signature, Size: 2967 bytes --]

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: BTRFS raid6 unmountable after a couple of days of usage.
  2015-07-16 11:49 ` Austin S Hemmelgarn
@ 2015-08-25 18:09   ` Austin S Hemmelgarn
  0 siblings, 0 replies; 12+ messages in thread
From: Austin S Hemmelgarn @ 2015-08-25 18:09 UTC (permalink / raw)
  To: linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 2715 bytes --]

On 2015-07-16 07:49, Austin S Hemmelgarn wrote:
> On 2015-07-14 07:49, Austin S Hemmelgarn wrote:
>> So, after experiencing this same issue multiple times (on almost a
>> dozen different kernel versions since 4.0) and ruling out the
>> possibility of it being caused by my hardware (or at least, the RAM,
>> SATA controller and disk drives themselves), I've decided to report it
>> here.
>>
>> The general symptom is that raid6 profile filesystems that I have are
>> working fine for multiple weeks, until I either reboot or otherwise
>> try to remount them, at which point the system refuses to mount them.
>>
>> I'm currently using btrfs-progs v4.1 with kernel 4.1.2, although I've
>> been seeing this with versions of both since 4.0.
>>
>> Output of 'btrfs fi show' for the most recent fs that I had this issue
>> with:
>>          Label: 'altroot'  uuid: 86eef6b9-febe-4350-a316-4cb00c40bbc5
>>     Total devices 4 FS bytes used 9.70GiB
>>     devid    1 size 24.00GiB used 6.03GiB path /dev/mapper/vg-altroot.0
>>     devid    2 size 24.00GiB used 6.01GiB path /dev/mapper/vg-altroot.1
>>     devid    3 size 24.00GiB used 6.01GiB path /dev/mapper/vg-altroot.2
>>     devid    4 size 24.00GiB used 6.01GiB path /dev/mapper/vg-altroot.3
>>
>>          btrfs-progs v4.1
>>
>> Each of the individual LVS that are in the FS is just a flat chunk of
>> space on a separate disk from the others.
>>
>> The FS itself passes btrfs check just fine (no reported errors, exit
>> value of 0), but the kernel refuses to mount it with the message
>> 'open_ctree failed'.
>>
>> I've run btrfs chunk recover and attached the output from that.
>>
>> Here's a link to an image from 'btrfs image -c9 -w':
>> https://www.dropbox.com/s/pl7gs305ej65u9q/altroot.btrfs.img?dl=0
>> (That link will expire in 30 days, let me know if you need access to
>> it beyond that).
>>
>> The filesystems in question all see relatively light but consistent
>> usage as targets for receiving daily incremental snapshots for
>> on-system backups (and because I know someone will mention it, yes, I
>> do have other backups of the data, these are just my online backups).
>>
> Secondary but possibly related issue, I'm seeing similar issues with all
> data/metadata profiles when using BTRFS on top of a dm-thinp volume with
> zeroing-mode turned off (that is, discard doesn't clear data from the
> areas that were discarded).
>
Following up further on this specific issue, I've tracked this down to 
dm-thinp not clearing the discard_zeros_data flag on the devices when 
you turn off zeroing mode.  I'm going to do some more digging regarding 
that and probably send a patch to lkml to fix it.


[-- Attachment #2: S/MIME Cryptographic Signature --]
[-- Type: application/pkcs7-signature, Size: 3019 bytes --]

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: BTRFS raid6 unmountable after a couple of days of usage.
  2015-07-16 11:41             ` Austin S Hemmelgarn
@ 2015-08-25 18:12               ` Austin S Hemmelgarn
  0 siblings, 0 replies; 12+ messages in thread
From: Austin S Hemmelgarn @ 2015-08-25 18:12 UTC (permalink / raw)
  To: Chris Murphy, Btrfs BTRFS

[-- Attachment #1: Type: text/plain, Size: 1759 bytes --]

On 2015-07-16 07:41, Austin S Hemmelgarn wrote:
> On 2015-07-15 17:29, Chris Murphy wrote:
>> On Wed, Jul 15, 2015 at 10:15 AM, Hugo Mills <hugo@carfax.org.uk> wrote:
>>
>>>     There is at least one superblock on every device, usually two, and
>>> often three. Each superblock contains the virtual address of the roots
>>> of the root tree, the chunk tree and the log tree. Those are useless
>>> without having the chunk tree, so there's also some information about
>>> the chunk tree appended to the end of each superblock to bootstrap the
>>> virtual address space lookup.
>>
>> So maybe Austin can use btrfs-show-super -a on every device and see if
>> there's anything different on some of the devices, that shouldn't be
>> different? There must be something the kernel is tripping over that
>> the use space tools aren't for some reason.
>>
>>
>>
>>
> I actually did do so when this happened most recently (I just didn't
> think to mention it in the most recent e-mail), and nothing appeared to
> be different either between devices or within a given device (IIRC,
> there's 2 sb per device in each of the filesystems in question).
>
> I'm going to try and reproduce this in a VM for inspection as all the
> filesystems I had this issue with now seem to be working fine (with the
> exception of some errors in the data blocks of one that got caught by
> scrub).
>
After an a long time of trying to reproduce this in a Virtual Machine 
with no success, and increasing issues from the motherboard in the 
system in question culminating in it dying completely, I'm pretty sure 
now that this was in fact a hardware problem and not a bug in BTRFS. 
I've been unable to reproduce it at all after replacing the motherboard.


[-- Attachment #2: S/MIME Cryptographic Signature --]
[-- Type: application/pkcs7-signature, Size: 3019 bytes --]

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2015-08-25 18:12 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2015-07-14 11:49 BTRFS raid6 unmountable after a couple of days of usage Austin S Hemmelgarn
2015-07-14 13:25 ` Austin S Hemmelgarn
2015-07-14 23:20   ` Chris Murphy
2015-07-15 11:07     ` Austin S Hemmelgarn
2015-07-15 15:45       ` Chris Murphy
2015-07-15 16:15         ` Hugo Mills
2015-07-15 21:29           ` Chris Murphy
2015-07-16 11:41             ` Austin S Hemmelgarn
2015-08-25 18:12               ` Austin S Hemmelgarn
2015-07-16 11:49 ` Austin S Hemmelgarn
2015-08-25 18:09   ` Austin S Hemmelgarn
     [not found] <55A5B13C.6060009@spotprint.com.au>
2015-07-15  6:53 ` Ryan Bourne

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).