linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* How to fix/remove "csum failed ino" error
@ 2013-11-07 14:42 Anatol Pomozov
  2013-11-07 16:41 ` Frank Holton
  0 siblings, 1 reply; 13+ messages in thread
From: Anatol Pomozov @ 2013-11-07 14:42 UTC (permalink / raw)
  To: linux-btrfs

Hi

I use Linux Arch, kernel 3.11.6.

Recently I had a disk crash and number of my files got corrupted. To
avoid this situation again I added more disks I trying to convert the
data to raid1:

# btrfs balance start -dconvert=raid1 -mconvert=raid1 /

But unfortunately it fails with IO erro. In dmesg I see

[ 5374.216320] BTRFS info (device sda3): csum failed ino 362 off
4993024 csum 1283121890 private 3720296651
[ 5374.219656] BTRFS info (device sda3): csum failed ino 362 off
5242880 csum 857237386 private 2562492866
[ 5374.222628] BTRFS info (device sda3): csum failed ino 362 off
5767168 csum 645194099 private 3149624654
[ 5374.223068] BTRFS info (device sda3): csum failed ino 362 off
4993024 csum 1283121890 private 3720296651

I looks like some files are corrupted. I would like either
fix/regenerate those files (e.g. reinstall from packages) or remove
them (as they corrupted anyway).

But I need to know what are these files. "ino 362" mentioned in the
message does not exist on the file system:

# find / -mount -inum 362
finds nothing.

So I assume this ino is some internal identifier. I checked function
btrfs_ino() from btrfs_inode.h and the output value can be either
BTRFS_I(inode)->location.objectid
or
inode->i_ino

I believe 362 is BTRFS_I(inode)->location.objectid

So my question is how to find a file that has this id corresponding?
How to remove this object and finally make the raid1 conversion? Also
is it possible to improve the error message so users can find failing
objects (e.g. include the real inode number)?

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: How to fix/remove "csum failed ino" error
  2013-11-07 14:42 How to fix/remove "csum failed ino" error Anatol Pomozov
@ 2013-11-07 16:41 ` Frank Holton
  2013-11-08  4:07   ` Anatol Pomozov
  0 siblings, 1 reply; 13+ messages in thread
From: Frank Holton @ 2013-11-07 16:41 UTC (permalink / raw)
  To: Anatol Pomozov; +Cc: linux-btrfs

Hey Anatol,

I just checked and on my filesystem inode number 362 corresponds to
part of the free space cache. You can check this yourself by running
(as root)

btrfs-debug-tree /dev/sdb | grep "(362 " -A 3 -B 1

where /dev/sdb is one of the devices from your filesystem.

It printed the following for me, note the location key (362
INODE_ITEM) under the FREE_SPACE key. Yours might be different but if
you see FREE_SPACE that points to the free space cache.

item 100 key (362 INODE_ITEM 0) itemoff 21857 itemsize 160
                inode generation 2004 transid 2004 size 262144 block
group 0 mode 100600 links 1
        item 101 key (362 EXTENT_DATA 0) itemoff 21804 itemsize 53
                extent data disk byte 41903296512 nr 262144
                extent data offset 0 nr 262144 ram 262144
                extent compression 0
--
        item 148 key (FREE_SPACE UNTYPED 113845993472) itemoff 23807 itemsize 41
                location key (362 INODE_ITEM 0)
                cache generation 2004 entries 2 bitmaps 0

I would try to mount the filesystem with the clear_cache option, which
should clear the current space cache and force a rebuild.

Hope that helps,
Frank



On Thu, Nov 7, 2013 at 9:42 AM, Anatol Pomozov <anatol.pomozov@gmail.com> wrote:
> Hi
>
> I use Linux Arch, kernel 3.11.6.
>
> Recently I had a disk crash and number of my files got corrupted. To
> avoid this situation again I added more disks I trying to convert the
> data to raid1:
>
> # btrfs balance start -dconvert=raid1 -mconvert=raid1 /
>
> But unfortunately it fails with IO erro. In dmesg I see
>
> [ 5374.216320] BTRFS info (device sda3): csum failed ino 362 off
> 4993024 csum 1283121890 private 3720296651
> [ 5374.219656] BTRFS info (device sda3): csum failed ino 362 off
> 5242880 csum 857237386 private 2562492866
> [ 5374.222628] BTRFS info (device sda3): csum failed ino 362 off
> 5767168 csum 645194099 private 3149624654
> [ 5374.223068] BTRFS info (device sda3): csum failed ino 362 off
> 4993024 csum 1283121890 private 3720296651
>
> I looks like some files are corrupted. I would like either
> fix/regenerate those files (e.g. reinstall from packages) or remove
> them (as they corrupted anyway).
>
> But I need to know what are these files. "ino 362" mentioned in the
> message does not exist on the file system:
>
> # find / -mount -inum 362
> finds nothing.
>
> So I assume this ino is some internal identifier. I checked function
> btrfs_ino() from btrfs_inode.h and the output value can be either
> BTRFS_I(inode)->location.objectid
> or
> inode->i_ino
>
> I believe 362 is BTRFS_I(inode)->location.objectid
>
> So my question is how to find a file that has this id corresponding?
> How to remove this object and finally make the raid1 conversion? Also
> is it possible to improve the error message so users can find failing
> objects (e.g. include the real inode number)?
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: How to fix/remove "csum failed ino" error
  2013-11-07 16:41 ` Frank Holton
@ 2013-11-08  4:07   ` Anatol Pomozov
  2013-11-08  4:55     ` Anatol Pomozov
  0 siblings, 1 reply; 13+ messages in thread
From: Anatol Pomozov @ 2013-11-08  4:07 UTC (permalink / raw)
  To: Frank Holton; +Cc: linux-btrfs

Hi, Frank

Thanks for your answer.

On Thu, Nov 7, 2013 at 8:41 AM, Frank Holton <fholton@gmail.com> wrote:
> Hey Anatol,
>
> I just checked and on my filesystem inode number 362 corresponds to
> part of the free space cache. You can check this yourself by running
> (as root)
>
> btrfs-debug-tree /dev/sdb | grep "(362 " -A 3 -B 1
>
> where /dev/sdb is one of the devices from your filesystem.
>
> It printed the following for me, note the location key (362
> INODE_ITEM) under the FREE_SPACE key. Yours might be different but if
> you see FREE_SPACE that points to the free space cache.
>
> item 100 key (362 INODE_ITEM 0) itemoff 21857 itemsize 160
>                 inode generation 2004 transid 2004 size 262144 block
> group 0 mode 100600 links 1
>         item 101 key (362 EXTENT_DATA 0) itemoff 21804 itemsize 53
>                 extent data disk byte 41903296512 nr 262144
>                 extent data offset 0 nr 262144 ram 262144
>                 extent compression 0
> --
>         item 148 key (FREE_SPACE UNTYPED 113845993472) itemoff 23807 itemsize 41
>                 location key (362 INODE_ITEM 0)
>                 cache generation 2004 entries 2 bitmaps 0


Indeed my case similar to yours

# btrfs-debug-tree /dev/sda3 | grep "(309 " -A 3 -B 1

item 1 key (309 INODE_ITEM 0) itemoff 3675 itemsize 160
    inode generation 190480 transid 190647 size 0 block group 0 mode
100600 links 1
item 51 key (FREE_SPACE UNTYPED 56937676800) itemoff 1863 itemsize 41
    location key (309 INODE_ITEM 0)

So I mounted my filesystem with 'clear_cache' flag:

# mount -o clear_cache /dev/sda3 mydata/

mount says:
/dev/sdc1 on /root/mydata type btrfs (rw,relatime,space_cache,clear_cache)

dmesg also mentions the cache:

[  634.991845] device fsid 25e6a6fa-fe1f-4be5-a638-eeac948f8c21 devid
9 transid 190479 /dev/sda3
[  634.993431] btrfs: force clearing of disk cache
[  634.993435] btrfs: disk space caching is enabled
[  635.046803] btrfs: bdev /dev/sda3 errs: wr 0, rd 0, flush 0,
corrupt 58481, gen 0


The I started raid1 rebalance but the error still presents:

[ 1571.787664] BTRFS info (device sda3): csum failed ino 309 off
4993024 csum 1283121890 private 3720296651
[ 1571.791027] BTRFS info (device sda3): csum failed ino 309 off
5242880 csum 857237386 private 2562492866
[ 1571.793998] BTRFS info (device sda3): csum failed ino 309 off
5767168 csum 645194099 private 3149624654
[ 1571.794389] BTRFS info (device sda3): csum failed ino 309 off
4993024 csum 1283121890 private 3720296651


So my problem still exists. How to fix the block with wrong csum?

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: How to fix/remove "csum failed ino" error
  2013-11-08  4:07   ` Anatol Pomozov
@ 2013-11-08  4:55     ` Anatol Pomozov
  2013-11-08  5:27       ` Frank Holton
  0 siblings, 1 reply; 13+ messages in thread
From: Anatol Pomozov @ 2013-11-08  4:55 UTC (permalink / raw)
  To: Frank Holton; +Cc: linux-btrfs

Hi

I ran btrfsck hoping that it fix the filesystem so 'balance' would not
crash anymore. But btrfsck itself crashed :(

# btrfsck --repair /dev/sda3

           :(
enabling repair mode
Checking filesystem on /dev/sda3
UUID: 25e6a6fa-fe1f-4be5-a638-eeac948f8c21
checking extents
checking fs roots
root 5 inode 522858 errors 1000
root 5 inode 1437358 errors 1000
root 5 inode 1437359 errors 1000
root 5 inode 1437360 errors 1000
root 5 inode 1437361 errors 1000
root 5 inode 1437362 errors 1000
root 5 inode 1437363 errors 1000
root 5 inode 1437368 errors 1000
root 5 inode 1437369 errors 1000
root 5 inode 1437370 errors 1000
root 5 inode 1437371 errors 1000
root 5 inode 1437372 errors 1000
root 5 inode 1437373 errors 1000
root 5 inode 1437374 errors 1000
root 5 inode 1437375 errors 1000
root 5 inode 1437376 errors 1000
root 5 inode 1437377 errors 1000
root 5 inode 1437378 errors 1000
root 5 inode 1437379 errors 1000
root 5 inode 1437380 errors 1000
root 5 inode 1437381 errors 1000
root 5 inode 1437382 errors 1000
root 5 inode 1437383 errors 1000
root 5 inode 1437384 errors 1000
root 5 inode 1437385 errors 1000
root 5 inode 1437386 errors 1000
root 5 inode 1437387 errors 1000
root 5 inode 1437388 errors 1000
root 5 inode 1437389 errors 1000
root 5 inode 1437390 errors 1000
root 5 inode 1437391 errors 1000
root 5 inode 1437392 errors 1000
root 5 inode 1437393 errors 1000
root 5 inode 1437394 errors 1000
root 5 inode 1437395 errors 1000
root 5 inode 1437396 errors 1000
root 5 inode 1437397 errors 1000
root 5 inode 1437398 errors 1000
root 5 inode 1437399 errors 1000
root 5 inode 1437400 errors 1000
root 5 inode 5073119 errors 400
Unable to find block group for 0
btrfsck: extent-tree.c:284: find_search_start: Assertion `!(1)' failed.
[1]    583 abort (core dumped)  btrfsck --repair /dev/sda3

On Thu, Nov 7, 2013 at 8:07 PM, Anatol Pomozov <anatol.pomozov@gmail.com> wrote:
> Hi, Frank
>
> Thanks for your answer.
>
> On Thu, Nov 7, 2013 at 8:41 AM, Frank Holton <fholton@gmail.com> wrote:
>> Hey Anatol,
>>
>> I just checked and on my filesystem inode number 362 corresponds to
>> part of the free space cache. You can check this yourself by running
>> (as root)
>>
>> btrfs-debug-tree /dev/sdb | grep "(362 " -A 3 -B 1
>>
>> where /dev/sdb is one of the devices from your filesystem.
>>
>> It printed the following for me, note the location key (362
>> INODE_ITEM) under the FREE_SPACE key. Yours might be different but if
>> you see FREE_SPACE that points to the free space cache.
>>
>> item 100 key (362 INODE_ITEM 0) itemoff 21857 itemsize 160
>>                 inode generation 2004 transid 2004 size 262144 block
>> group 0 mode 100600 links 1
>>         item 101 key (362 EXTENT_DATA 0) itemoff 21804 itemsize 53
>>                 extent data disk byte 41903296512 nr 262144
>>                 extent data offset 0 nr 262144 ram 262144
>>                 extent compression 0
>> --
>>         item 148 key (FREE_SPACE UNTYPED 113845993472) itemoff 23807 itemsize 41
>>                 location key (362 INODE_ITEM 0)
>>                 cache generation 2004 entries 2 bitmaps 0
>
>
> Indeed my case similar to yours
>
> # btrfs-debug-tree /dev/sda3 | grep "(309 " -A 3 -B 1
>
> item 1 key (309 INODE_ITEM 0) itemoff 3675 itemsize 160
>     inode generation 190480 transid 190647 size 0 block group 0 mode
> 100600 links 1
> item 51 key (FREE_SPACE UNTYPED 56937676800) itemoff 1863 itemsize 41
>     location key (309 INODE_ITEM 0)
>
> So I mounted my filesystem with 'clear_cache' flag:
>
> # mount -o clear_cache /dev/sda3 mydata/
>
> mount says:
> /dev/sdc1 on /root/mydata type btrfs (rw,relatime,space_cache,clear_cache)
>
> dmesg also mentions the cache:
>
> [  634.991845] device fsid 25e6a6fa-fe1f-4be5-a638-eeac948f8c21 devid
> 9 transid 190479 /dev/sda3
> [  634.993431] btrfs: force clearing of disk cache
> [  634.993435] btrfs: disk space caching is enabled
> [  635.046803] btrfs: bdev /dev/sda3 errs: wr 0, rd 0, flush 0,
> corrupt 58481, gen 0
>
>
> The I started raid1 rebalance but the error still presents:
>
> [ 1571.787664] BTRFS info (device sda3): csum failed ino 309 off
> 4993024 csum 1283121890 private 3720296651
> [ 1571.791027] BTRFS info (device sda3): csum failed ino 309 off
> 5242880 csum 857237386 private 2562492866
> [ 1571.793998] BTRFS info (device sda3): csum failed ino 309 off
> 5767168 csum 645194099 private 3149624654
> [ 1571.794389] BTRFS info (device sda3): csum failed ino 309 off
> 4993024 csum 1283121890 private 3720296651
>
>
> So my problem still exists. How to fix the block with wrong csum?

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: How to fix/remove "csum failed ino" error
  2013-11-08  4:55     ` Anatol Pomozov
@ 2013-11-08  5:27       ` Frank Holton
  2013-11-08  5:56         ` Chris Murphy
  2013-11-09  2:50         ` Anatol Pomozov
  0 siblings, 2 replies; 13+ messages in thread
From: Frank Holton @ 2013-11-08  5:27 UTC (permalink / raw)
  To: Anatol Pomozov; +Cc: linux-btrfs

Hi Anatol,

That certainly does not look good, definitely more than just a bad
space cache. A this point I would strongly suggest that before you try
anything else on the file system that you make sure you have a backup
of everything up there. After you have backed up everything a scrub
may be able to fix some of the corruption or at least tell you which
files are corrupted (the names are printed to the kernel log.) Its
also possible that this will lock up the kernel again so be prepared
for that. Since you are not on raid1 yet it cannot fix those files, it
only reports the ones with checksum errors so you would need to delete
them manually.

as root
btrfs scrub start /mount_point

Another option you can try is to mount with the recovery option.

But before you go any further ensure you have good backups separate
from your BTRFS file system.

Hope some of that helps.



On Thu, Nov 7, 2013 at 11:55 PM, Anatol Pomozov
<anatol.pomozov@gmail.com> wrote:
> Hi
>
> I ran btrfsck hoping that it fix the filesystem so 'balance' would not
> crash anymore. But btrfsck itself crashed :(
>
> # btrfsck --repair /dev/sda3
>
>            :(
> enabling repair mode
> Checking filesystem on /dev/sda3
> UUID: 25e6a6fa-fe1f-4be5-a638-eeac948f8c21
> checking extents
> checking fs roots
> root 5 inode 522858 errors 1000
> root 5 inode 1437358 errors 1000
> root 5 inode 1437359 errors 1000
> root 5 inode 1437360 errors 1000
> root 5 inode 1437361 errors 1000
> root 5 inode 1437362 errors 1000
> root 5 inode 1437363 errors 1000
> root 5 inode 1437368 errors 1000
> root 5 inode 1437369 errors 1000
> root 5 inode 1437370 errors 1000
> root 5 inode 1437371 errors 1000
> root 5 inode 1437372 errors 1000
> root 5 inode 1437373 errors 1000
> root 5 inode 1437374 errors 1000
> root 5 inode 1437375 errors 1000
> root 5 inode 1437376 errors 1000
> root 5 inode 1437377 errors 1000
> root 5 inode 1437378 errors 1000
> root 5 inode 1437379 errors 1000
> root 5 inode 1437380 errors 1000
> root 5 inode 1437381 errors 1000
> root 5 inode 1437382 errors 1000
> root 5 inode 1437383 errors 1000
> root 5 inode 1437384 errors 1000
> root 5 inode 1437385 errors 1000
> root 5 inode 1437386 errors 1000
> root 5 inode 1437387 errors 1000
> root 5 inode 1437388 errors 1000
> root 5 inode 1437389 errors 1000
> root 5 inode 1437390 errors 1000
> root 5 inode 1437391 errors 1000
> root 5 inode 1437392 errors 1000
> root 5 inode 1437393 errors 1000
> root 5 inode 1437394 errors 1000
> root 5 inode 1437395 errors 1000
> root 5 inode 1437396 errors 1000
> root 5 inode 1437397 errors 1000
> root 5 inode 1437398 errors 1000
> root 5 inode 1437399 errors 1000
> root 5 inode 1437400 errors 1000
> root 5 inode 5073119 errors 400
> Unable to find block group for 0
> btrfsck: extent-tree.c:284: find_search_start: Assertion `!(1)' failed.
> [1]    583 abort (core dumped)  btrfsck --repair /dev/sda3
>
> On Thu, Nov 7, 2013 at 8:07 PM, Anatol Pomozov <anatol.pomozov@gmail.com> wrote:
>> Hi, Frank
>>
>> Thanks for your answer.
>>
>> On Thu, Nov 7, 2013 at 8:41 AM, Frank Holton <fholton@gmail.com> wrote:
>>> Hey Anatol,
>>>
>>> I just checked and on my filesystem inode number 362 corresponds to
>>> part of the free space cache. You can check this yourself by running
>>> (as root)
>>>
>>> btrfs-debug-tree /dev/sdb | grep "(362 " -A 3 -B 1
>>>
>>> where /dev/sdb is one of the devices from your filesystem.
>>>
>>> It printed the following for me, note the location key (362
>>> INODE_ITEM) under the FREE_SPACE key. Yours might be different but if
>>> you see FREE_SPACE that points to the free space cache.
>>>
>>> item 100 key (362 INODE_ITEM 0) itemoff 21857 itemsize 160
>>>                 inode generation 2004 transid 2004 size 262144 block
>>> group 0 mode 100600 links 1
>>>         item 101 key (362 EXTENT_DATA 0) itemoff 21804 itemsize 53
>>>                 extent data disk byte 41903296512 nr 262144
>>>                 extent data offset 0 nr 262144 ram 262144
>>>                 extent compression 0
>>> --
>>>         item 148 key (FREE_SPACE UNTYPED 113845993472) itemoff 23807 itemsize 41
>>>                 location key (362 INODE_ITEM 0)
>>>                 cache generation 2004 entries 2 bitmaps 0
>>
>>
>> Indeed my case similar to yours
>>
>> # btrfs-debug-tree /dev/sda3 | grep "(309 " -A 3 -B 1
>>
>> item 1 key (309 INODE_ITEM 0) itemoff 3675 itemsize 160
>>     inode generation 190480 transid 190647 size 0 block group 0 mode
>> 100600 links 1
>> item 51 key (FREE_SPACE UNTYPED 56937676800) itemoff 1863 itemsize 41
>>     location key (309 INODE_ITEM 0)
>>
>> So I mounted my filesystem with 'clear_cache' flag:
>>
>> # mount -o clear_cache /dev/sda3 mydata/
>>
>> mount says:
>> /dev/sdc1 on /root/mydata type btrfs (rw,relatime,space_cache,clear_cache)
>>
>> dmesg also mentions the cache:
>>
>> [  634.991845] device fsid 25e6a6fa-fe1f-4be5-a638-eeac948f8c21 devid
>> 9 transid 190479 /dev/sda3
>> [  634.993431] btrfs: force clearing of disk cache
>> [  634.993435] btrfs: disk space caching is enabled
>> [  635.046803] btrfs: bdev /dev/sda3 errs: wr 0, rd 0, flush 0,
>> corrupt 58481, gen 0
>>
>>
>> The I started raid1 rebalance but the error still presents:
>>
>> [ 1571.787664] BTRFS info (device sda3): csum failed ino 309 off
>> 4993024 csum 1283121890 private 3720296651
>> [ 1571.791027] BTRFS info (device sda3): csum failed ino 309 off
>> 5242880 csum 857237386 private 2562492866
>> [ 1571.793998] BTRFS info (device sda3): csum failed ino 309 off
>> 5767168 csum 645194099 private 3149624654
>> [ 1571.794389] BTRFS info (device sda3): csum failed ino 309 off
>> 4993024 csum 1283121890 private 3720296651
>>
>>
>> So my problem still exists. How to fix the block with wrong csum?

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: How to fix/remove "csum failed ino" error
  2013-11-08  5:27       ` Frank Holton
@ 2013-11-08  5:56         ` Chris Murphy
  2013-11-08  8:13           ` Hugo Mills
  2013-11-09  2:50         ` Anatol Pomozov
  1 sibling, 1 reply; 13+ messages in thread
From: Chris Murphy @ 2013-11-08  5:56 UTC (permalink / raw)
  To: Btrfs BTRFS

What's the kernel and btrfs progs version?

I wish the dmesg errors were more explicit about the nature of checksum errors: do the two metadata checksums mismatch each other (one of them matches with data), or the metadata checksums match each other but mismatch with data?

Hopefully I'm mistaken, but it looks in this case that the data is actually corrupt, not the metadata. In which case repair would only be possible for raid1 or raid10 data profiles.

Why the corruption occurred depends on kernel and btrfs-progs versions, and what you were doing prior to the corruption so dmesg prior to the corruption would be needed and also when trying to mount with the recovery option so it might be worth:

dmesg
[note the last time entry]
dmesg -n7
btrfs mount -o recovery <dev> <mp>
dmesg

report results since the previously noted last time entry


Chris Murphy

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: How to fix/remove "csum failed ino" error
  2013-11-08  5:56         ` Chris Murphy
@ 2013-11-08  8:13           ` Hugo Mills
  2013-11-08 17:22             ` Chris Murphy
  0 siblings, 1 reply; 13+ messages in thread
From: Hugo Mills @ 2013-11-08  8:13 UTC (permalink / raw)
  To: Chris Murphy; +Cc: Btrfs BTRFS

[-- Attachment #1: Type: text/plain, Size: 1729 bytes --]

On Thu, Nov 07, 2013 at 10:56:10PM -0700, Chris Murphy wrote:
> What's the kernel and btrfs progs version?
> 
> I wish the dmesg errors were more explicit about the nature of checksum errors: do the two metadata checksums mismatch each other (one of them matches with data), or the metadata checksums match each other but mismatch with data?

   If there's two copies, and one fails the checksum and the other
passes, then there will be a note following the failure that it
read the other checksum and succeeded, and repaired the problem.

> Hopefully I'm mistaken, but it looks in this case that the data is actually corrupt, not the metadata. In which case repair would only be possible for raid1 or raid10 data profiles.
> 
> Why the corruption occurred depends on kernel and btrfs-progs versions, and what you were doing prior to the corruption so dmesg prior to the corruption would be needed and also when trying to mount with the recovery option so it might be worth:
> 
> dmesg
> [note the last time entry]
> dmesg -n7
> btrfs mount -o recovery <dev> <mp>
> dmesg
> 
> report results since the previously noted last time entry
> 
> 
> Chris Murphy--
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

-- 
=== Hugo Mills: hugo@... carfax.org.uk | darksatanic.net | lug.org.uk ===
  PGP key: 65E74AC0 from wwwkeys.eu.pgp.net or http://www.carfax.org.uk
     --- For months now, we have been making triumphant retreats ---     
               before a demoralised enemy who is advancing               
                           in utter disorder.                            

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: How to fix/remove "csum failed ino" error
  2013-11-08  8:13           ` Hugo Mills
@ 2013-11-08 17:22             ` Chris Murphy
  0 siblings, 0 replies; 13+ messages in thread
From: Chris Murphy @ 2013-11-08 17:22 UTC (permalink / raw)
  To: Btrfs BTRFS


On Nov 8, 2013, at 1:13 AM, Hugo Mills <hugo@carfax.org.uk> wrote:

> On Thu, Nov 07, 2013 at 10:56:10PM -0700, Chris Murphy wrote:
>> What's the kernel and btrfs progs version?
>> 
>> I wish the dmesg errors were more explicit about the nature of checksum errors: do the two metadata checksums mismatch each other (one of them matches with data), or the metadata checksums match each other but mismatch with data?
> 
>   If there's two copies, and one fails the checksum and the other
> passes, then there will be a note following the failure that it
> read the other checksum and succeeded, and repaired the problem.

OK I guess that's fairly explicit. And a more explicit message that data is corrupt in the current case, is maybe the wrong thing to do because all we know is data checksum doesn't match what's stored in metadata. We don't actually know the significance of the data corruption.

A bigger problem that was brought up a while ago is if we still have a way to read data that fails checksum, so we can attempt application level reconstruction. I'm also not seeing in the original report, any kernel messages that show what files (by path) have been affected which also makes understanding of the error messages less than clear.

Chris Murphy

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: How to fix/remove "csum failed ino" error
  2013-11-08  5:27       ` Frank Holton
  2013-11-08  5:56         ` Chris Murphy
@ 2013-11-09  2:50         ` Anatol Pomozov
  2013-11-16 12:06           ` Anatol Pomozov
  1 sibling, 1 reply; 13+ messages in thread
From: Anatol Pomozov @ 2013-11-09  2:50 UTC (permalink / raw)
  To: Frank Holton; +Cc: linux-btrfs

Hi, Frank, thanks for your help again.

Continuing my saga with filesystem recovering.

btrfsck $DEVICE

fails and says some files are corrupted. That is because of my recent
disk crash. I found all these files and indeed - reading it produces
an error. I removed those files and ran btrfsck again. Now it fails
with

Unable to find block group for 0
btrfsck: extent-tree.c:284: find_search_start: Assertion `!(1)' failed.

Hmm.. Google recommended me to run

btrfsck --init-extent-tree $DEVICE

and surprisingly the problem gone away and btrfsck finished
successfully. It looks promising. (BTW having man pages for btrfsck
would be really helpful)

Now it is time scrub the filesystem. It still shows a bunch of problems

scrub status for 25e6a6fa-fe1f-4be5-a638-eeac948f8c21
scrub started at Fri Nov  8 18:45:07 2013 and finished after 16564 seconds
total bytes scrubbed: 3.62TB with 90145 errors
error details: csum=90145
corrected errors: 0, uncorrectable errors: 90145, unverified errors: 0


Here is part of dmesg:

[ 7162.786759] btrfs: checksum error at logical 7915755577344 on dev
/dev/sdd, sector 125636176, root 5, inode 5224916, offset 44695552,
length 4096, links 1 (path:
var/log/journal/b4d8ffd8ac454d02849f8c8925432368/system@dcae59172d794892a7ca0cdc2d381fa3-000000000018ee6d-0004eaac687ad6bb.journal)
[ 7162.786766] btrfs: checksum error at logical 7915759673344 on dev
/dev/sdd, sector 125644176, root 5, inode 5224916, offset 48791552,
length 4096, links 1 (path:
var/log/journal/b4d8ffd8ac454d02849f8c8925432368/system@dcae59172d794892a7ca0cdc2d381fa3-000000000018ee6d-0004eaac687ad6bb.journal)
[ 7162.786767] btrfs_dev_stat_print_on_error: 9 callbacks suppressed


system journal issues? Hmmm.. Let's remove the journal files. Here are
more errors in dmesg though..

[ 7677.380386] BTRFS info (device sda3): csum failed ino 5224853 off
46080000 csum 1253352426 private 0
[ 7677.384271] BTRFS info (device sda3): csum failed ino 5224853 off
48812032 csum 2784483749 private 0
[ 7677.387325] BTRFS info (device sda3): csum failed ino 5224853 off
47443968 csum 1928606421 private 0
[ 7677.518090] BTRFS info (device sda3): csum failed ino 5224853 off
51552256 csum 2439491854 private 0

D'oh the same csum issue that I had before. And here are more cryptic errors:


[19374.896393] btrfs: bdev /dev/sdc1 errs: wr 0, rd 0, flush 0, corrupt 7, gen 0
[19374.896395] btrfs: unable to fixup (regular) error at logical
7325276450816 on dev /dev/sdc1
[19374.902842] btrfs: unable to fixup (regular) error at logical
7325276971008 on dev /dev/sdc1
[19374.903125] btrfs: bdev /dev/sdc1 errs: wr 0, rd 0, flush 0, corrupt 8, gen 0
[19374.903126] btrfs: unable to fixup (regular) error at logical
7325276454912 on dev /dev/sdc1
[19374.909514] btrfs: bdev /dev/sdc1 errs: wr 0, rd 0, flush 0, corrupt 9, gen 0
[19374.911597] btrfs: unable to fixup (regular) error at logical
7325276459008 on dev /dev/sdc1
[19374.911763] btrfs: bdev /dev/sdc1 errs: wr 0, rd 0, flush 0,
corrupt 10, gen 0
[19374.911765] btrfs: unable to fixup (regular) error at logical
7325276975104 on dev /dev/sdc1
[19379.864446] scrub_handle_errored_block: 24910 callbacks suppressed


I even do not understand what does it mean.


But maybe rebalance can be run now? Let's try it:

btrfs balance start -dconvert=raid1 -mconvert=raid1 $MOUNT


Ouch, got a kernel OOPs:

[25185.855910] ------------[ cut here ]------------
[25185.858066] kernel BUG at fs/btrfs/relocation.c:1055!
[25185.860200] invalid opcode: 0000 [#1] PREEMPT SMP
[25185.862330] Modules linked in: x86_pkg_temp_thermal
intel_powerclamp coretemp kvm_intel kvm crc32_pclmul
ghash_clmulni_intel iTCO_wdt iTCO_vendor_support ppdev cryptd psmouse
i2c_i801 microcode pcspkr snd_hda_codec_hdmi serio_raw
snd_hda_codec_realtek snd_hda_intel lpc_ich snd_hda_codec snd_hwdep
snd_pcm parport_pc parport snd_page_alloc snd_timer snd evdev mperf
soundcore mei_me mei shpchp processor nfs lockd sunrpc fscache ext4
crc16 mbcache jbd2 dm_snapshot dm_mod squashfs loop isofs btrfs
raid6_pq libcrc32c zlib_deflate xor hid_generic usbhid hid sd_mod
usb_storage ahci libahci libata scsi_mod crc32c_intel atl1c xhci_hcd
i915 intel_agp intel_gtt i2c_algo_bit drm_kms_helper ehci_pci ehci_hcd
drm usbcore usb_common i2c_core video button
[25185.871704] CPU: 1 PID: 902 Comm: btrfs Tainted: G        W
3.11.2-1-ARCH #1
[25185.874058] Hardware name: To Be Filled By O.E.M. To Be Filled By
O.E.M./H61M/U3S3, BIOS P2.20 07/30/2012
[25185.876433] task: ffff880113840000 ti: ffff880078582000 task.ti:
ffff880078582000
[25185.878807] RIP: 0010:[<ffffffffa037a6fa>]  [<ffffffffa037a6fa>]
build_backref_tree+0x111a/0x11c0 [btrfs]
[25185.881210] RSP: 0018:ffff8800785839d0  EFLAGS: 00010246
[25185.883588] RAX: 0000000000000000 RBX: ffff88009c633800 RCX: ffff8800be682d50
[25185.885971] RDX: ffff880078583a40 RSI: ffff88009c633820 RDI: ffff8800be682d40
[25185.888361] RBP: ffff880078583ab0 R08: ffff8800a53a7400 R09: ffff8800a53a7280
[25185.890733] R10: ffff88011ac01900 R11: ffff880078583fd8 R12: 0000000000000000
[25185.893114] R13: ffff8800c4a6a480 R14: ffff8800a53a7e80 R15: ffff8800be682d50
[25185.895488] FS:  00007f5585c96780(0000) GS:ffff88011f300000(0000)
knlGS:0000000000000000
[25185.897870] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[25185.900240] CR2: 00007f1aecd8b040 CR3: 0000000009ee1000 CR4: 00000000000407e0
[25185.902609] Stack:
[25185.904955]  ffff8800a53a7280 ffff8800be7889a0 ffff8800a53a7400
ffff8800c4a6a480
[25185.907332]  ffff8800a53a7400 ffff8800a3cc4000 ffff8800c4a6a990
ffff8800a53a7440
[25185.909692]  ffff88009c633920 ffff8800a53a7280 ffff88009c633924
ffff88009c633820
[25185.912037] Call Trace:
[25185.914363]  [<ffffffffa037bbe8>] relocate_tree_blocks+0x1d8/0x630 [btrfs]
[25185.916706]  [<ffffffffa037c468>] ? add_data_references+0x248/0x280 [btrfs]
[25185.919031]  [<ffffffffa037d070>] relocate_block_group+0x280/0x690 [btrfs]
[25185.921348]  [<ffffffffa037d61f>]
btrfs_relocate_block_group+0x19f/0x2e0 [btrfs]
[25185.923664]  [<ffffffffa0355a78>]
btrfs_relocate_chunk.isra.32+0x68/0x780 [btrfs]
[25185.925962]  [<ffffffffa030fa76>] ? btrfs_search_slot+0x436/0x940 [btrfs]
[25185.928254]  [<ffffffffa034b549>] ? release_extent_buffer+0xa9/0xd0 [btrfs]
[25185.930539]  [<ffffffffa0350cdf>] ? free_extent_buffer+0x4f/0xa0 [btrfs]
[25185.932814]  [<ffffffffa035939f>] btrfs_balance+0x8ef/0xe90 [btrfs]
[25185.935097]  [<ffffffffa03605a3>] btrfs_ioctl_balance+0x163/0x510 [btrfs]
[25185.937365]  [<ffffffffa03642b4>] btrfs_ioctl+0xdb4/0x1e00 [btrfs]
[25185.939635]  [<ffffffff814e50bc>] ? __do_page_fault+0x2ec/0x5c0
[25185.941891]  [<ffffffff81160a0a>] ? __vma_link_rb+0x6a/0x90
[25185.944156]  [<ffffffff81160ae7>] ? vma_link+0xb7/0xc0
[25185.946402]  [<ffffffff811b1055>] do_vfs_ioctl+0x2e5/0x4d0
[25185.948631]  [<ffffffff811b12c1>] SyS_ioctl+0x81/0xa0
[25185.950846]  [<ffffffff814e539e>] ? do_page_fault+0xe/0x10
[25185.953045]  [<ffffffff814e931d>] system_call_fastpath+0x1a/0x1f
[25185.955232] Code: 4c 89 ef e8 59 05 f9 ff 48 8b bd 50 ff ff ff e8
4d 05 f9 ff 48 83 bd 30 ff ff ff 00 0f 85 14 fd ff ff 31 c0 e9 be ef
ff ff 0f 0b <0f> 0b 48 8b 85 30 ff ff ff 49 8d 7e 20 48 8b 70 18 48 89
c2 e8
[25185.957655] RIP  [<ffffffffa037a6fa>]
build_backref_tree+0x111a/0x11c0 [btrfs]
[25185.959995]  RSP <ffff8800785839d0>
[25186.196680] ---[ end trace 9ab9ad3c7961486e ]---




So my attempt to recover my filesystem and convert it to raid1 failed
again. I feel that the easies way for me to reinstall system
completely and copy important files to newly created btrfs raid.




Why is that difficult to recover a broken filesystem. I bet it is
something that users will have a lot in the future. Do btrfs
developers test error path at all? I think you can repro my situation
by

1) Create 'single' multi-device filesystem and then crash one of the
disks. It should be easily to do with loop block devices.
2) Run some lengthy operation (e.g. rebalance) and then reboot computer.

Speaking of long operations: why there is no way to stop it? I think
'btrfs' should handle 'Ctrl+C' and interrupt the operation correctly
instead of sitting in 'D' state for 10 hours. Should I file a ticket?

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: How to fix/remove "csum failed ino" error
  2013-11-09  2:50         ` Anatol Pomozov
@ 2013-11-16 12:06           ` Anatol Pomozov
  2013-11-16 12:23             ` Hugo Mills
  0 siblings, 1 reply; 13+ messages in thread
From: Anatol Pomozov @ 2013-11-16 12:06 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Frank Holton

Hi

Follow-up for the issue. I stuck with this "invalid csum for free
space extent" error. Could anyone explain what does it mean? If this
is not data and just a free space, why do we care about its checksum?
And if we do really care then btrfs should have a way to fix this
error. I can "fix" a file checksum error by removing the file, but how
to "fix" free space extent checksum error?

I decided to run --init-csum-tree to see if it fixes the issue with
free space csum. I expected that it will recalculate csum for data.
And found that it cleared csum tree completely and made my fs became
unusable. Any read returns csum error. The data is still on disk - I
can read it with filefrag+btrfs-map-logic+dd - it is just csum
information that got dropped. So I want to echo the request from Robin
http://www.spinics.net/lists/linux-btrfs/msg25271.html
Information about --init-csum-tree should be documented. What is the
use-case for this feature? In fact I expected that it will recalculate
the csum tree based on the data, not clear the tree.

So I endup with reinstalling my system and copying data from backups.

Despite all my complains I still think btrfs is a great filesystem. I
really enjoy their multi-device support that allows me to use all my
different HDDs that I have. Thanks for your work, everyone!

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: How to fix/remove "csum failed ino" error
  2013-11-16 12:06           ` Anatol Pomozov
@ 2013-11-16 12:23             ` Hugo Mills
  2013-11-16 20:54               ` Duncan
  0 siblings, 1 reply; 13+ messages in thread
From: Hugo Mills @ 2013-11-16 12:23 UTC (permalink / raw)
  To: Anatol Pomozov; +Cc: linux-btrfs, Frank Holton

[-- Attachment #1: Type: text/plain, Size: 1601 bytes --]

On Sat, Nov 16, 2013 at 04:06:10AM -0800, Anatol Pomozov wrote:
> Hi
> 
> Follow-up for the issue. I stuck with this "invalid csum for free
> space extent" error. Could anyone explain what does it mean? If this
> is not data and just a free space, why do we care about its checksum?
> And if we do really care then btrfs should have a way to fix this
> error. I can "fix" a file checksum error by removing the file, but how
> to "fix" free space extent checksum error?

   Probably drop the free space cache and rebuild it.

> I decided to run --init-csum-tree to see if it fixes the issue with
> free space csum. I expected that it will recalculate csum for data.
> And found that it cleared csum tree completely and made my fs became
> unusable. Any read returns csum error. The data is still on disk - I
> can read it with filefrag+btrfs-map-logic+dd - it is just csum
> information that got dropped. So I want to echo the request from Robin
> http://www.spinics.net/lists/linux-btrfs/msg25271.html
> Information about --init-csum-tree should be documented. What is the
> use-case for this feature? In fact I expected that it will recalculate
> the csum tree based on the data, not clear the tree.

   There's a patch in the pipeline that makes --init-csum-tree rebuild
the csums as well.

   Hugo.

-- 
=== Hugo Mills: hugo@... carfax.org.uk | darksatanic.net | lug.org.uk ===
  PGP key: 65E74AC0 from wwwkeys.eu.pgp.net or http://www.carfax.org.uk
    --- I write in C because using pointer arithmetic lets people ---    
               know that you're virile. -- Matthew Garrett               

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: How to fix/remove "csum failed ino" error
  2013-11-16 12:23             ` Hugo Mills
@ 2013-11-16 20:54               ` Duncan
  2013-11-16 22:17                 ` Chris Murphy
  0 siblings, 1 reply; 13+ messages in thread
From: Duncan @ 2013-11-16 20:54 UTC (permalink / raw)
  To: linux-btrfs

Hugo Mills posted on Sat, 16 Nov 2013 12:23:42 +0000 as excerpted:

> On Sat, Nov 16, 2013 at 04:06:10AM -0800, Anatol Pomozov wrote:

>> Follow-up for the issue. I stuck with this "invalid csum for free space
>> extent" error. Could anyone explain what does it mean? If this is not
>> data and just a free space, why do we care about its checksum? And if
>> we do really care then btrfs should have a way to fix this error. I can
>> "fix" a file checksum error by removing the file, but how to "fix" free
>> space extent checksum error?
> 
> Probably drop the free space cache and rebuild it.

If you check the thread history (well, at least based on my 
interpretation thereof, perhaps I'm wrong), he tried that.  Apparently 
there's a problem in the free-space cache area that appears to go away 
when he drops cache, but the moment he lets it rebuild, the problem 
reappears.

Almost as if there's a physical defect on the hardware itself, but the 
usual automatic hardware sector relocate functionality isn't triggering 
for some reason, so every time he tries to use that physical location, 
which happens to be in the middle of where btrfs tries to put its space-
cache, the csum errors trigger.

I guess btrfs doesn't (yet?) work with badblocks output as ext* (and 
reiserfs, which I'm personally more familiar with) does?  I don't see 
anything listed in the btrfs or mkfs.btrfs manpages for it, at least.

Is such support planned?  One would expect it given that btrfs appears to 
be the assumed successor to ext* series' default Linux filesystem title.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: How to fix/remove "csum failed ino" error
  2013-11-16 20:54               ` Duncan
@ 2013-11-16 22:17                 ` Chris Murphy
  0 siblings, 0 replies; 13+ messages in thread
From: Chris Murphy @ 2013-11-16 22:17 UTC (permalink / raw)
  To: Btrfs BTRFS


On Nov 16, 2013, at 1:54 PM, Duncan <1i5t5.duncan@cox.net> wrote:

> Hugo Mills posted on Sat, 16 Nov 2013 12:23:42 +0000 as excerpted:
> 
>> On Sat, Nov 16, 2013 at 04:06:10AM -0800, Anatol Pomozov wrote:
> 
>>> Follow-up for the issue. I stuck with this "invalid csum for free space
>>> extent" error. Could anyone explain what does it mean? If this is not
>>> data and just a free space, why do we care about its checksum? And if
>>> we do really care then btrfs should have a way to fix this error. I can
>>> "fix" a file checksum error by removing the file, but how to "fix" free
>>> space extent checksum error?
>> 
>> Probably drop the free space cache and rebuild it.
> 
> If you check the thread history (well, at least based on my 
> interpretation thereof, perhaps I'm wrong), he tried that.  Apparently 
> there's a problem in the free-space cache area that appears to go away 
> when he drops cache, but the moment he lets it rebuild, the problem 
> reappears.
> 
> Almost as if there's a physical defect on the hardware itself, but the 
> usual automatic hardware sector relocate functionality isn't triggering 
> for some reason, so every time he tries to use that physical location, 
> which happens to be in the middle of where btrfs tries to put its space-
> cache, the csum errors trigger.

Bad sectors should show up very clearly in dmesg, so maybe we need a full dmesg from the OP. Any case of a write failure is grounds for tossing that device as completely and immutably broken.


> I guess btrfs doesn't (yet?) work with badblocks output as ext* (and 
> reiserfs, which I'm personally more familiar with) does?  I don't see 
> anything listed in the btrfs or mkfs.btrfs manpages for it, at least.

I wouldn't expect a modern file system to track bad sectors. This is firmware domain these days. All available data indicates when drives start to develop many bad sectors, especially the case where write failures occur means all reserve sectors are unavailable, that the drive is self-destructing with material either damaging the head or peripheral areas of the drive (surface coating detachment). I think it's a waste of resources to support this brick, the drive should just be replaced.


Chris Murphy


^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2013-11-16 22:17 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-11-07 14:42 How to fix/remove "csum failed ino" error Anatol Pomozov
2013-11-07 16:41 ` Frank Holton
2013-11-08  4:07   ` Anatol Pomozov
2013-11-08  4:55     ` Anatol Pomozov
2013-11-08  5:27       ` Frank Holton
2013-11-08  5:56         ` Chris Murphy
2013-11-08  8:13           ` Hugo Mills
2013-11-08 17:22             ` Chris Murphy
2013-11-09  2:50         ` Anatol Pomozov
2013-11-16 12:06           ` Anatol Pomozov
2013-11-16 12:23             ` Hugo Mills
2013-11-16 20:54               ` Duncan
2013-11-16 22:17                 ` Chris Murphy

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).