"csum failed" that was not detected by scrub

linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* "csum failed" that was not detected by scrub
@ 2014-05-02  9:42 Jaap Pieroen
  2014-05-02 10:20 ` Duncan
  2014-05-02 11:13 ` Shilong Wang
  0 siblings, 2 replies; 9+ messages in thread
From: Jaap Pieroen @ 2014-05-02  9:42 UTC (permalink / raw)
  To: linux-btrfs

Hi all,

I completed a full scrub:
root@nasbak:/home/jpieroen# btrfs scrub status /home/
scrub status for 7ca5f38e-308f-43ab-b3ea-31b3bcd11a0d
scrub started at Wed Apr 30 08:30:19 2014 and finished after 144131 seconds
total bytes scrubbed: 4.76TiB with 0 errors

Then tried to remove a device:
root@nasbak:/home/jpieroen# btrfs device delete /dev/sdb /home

This triggered bug_on, with the following error in dmesg: csum failed
ino 258 off 1395560448 csum 2284440321 expected csum 319628859

How can there still be csum failures directly after a scrub?
If I rerun the scrub it still won't find any errors. I know this,
because I've had the same issue 3 times in a row. Each time running a
scrub and still being unable to remove the device.

Kind Regards,
Jaap

--------------------------------------------------------------
Details:

root@nasbak:/home/jpieroen#   uname -a
Linux nasbak 3.14.1-031401-generic #201404141220 SMP Mon Apr 14
16:21:48 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux

root@nasbak:/home/jpieroen#   btrfs --version
Btrfs v3.14.1

root@nasbak:/home/jpieroen#   btrfs fi df /home
Data, RAID5: total=4.57TiB, used=4.55TiB
System, RAID1: total=32.00MiB, used=352.00KiB
Metadata, RAID1: total=7.00GiB, used=5.59GiB

root@nasbak:/home/jpieroen# btrfs fi show
Label: 'btrfs_storage'  uuid: 7ca5f38e-308f-43ab-b3ea-31b3bcd11a0d
Total devices 6 FS bytes used 4.56TiB
devid    1 size 1.82TiB used 1.31TiB path /dev/sde
devid    2 size 1.82TiB used 1.31TiB path /dev/sdf
devid    3 size 1.82TiB used 1.31TiB path /dev/sdg
devid    4 size 931.51GiB used 25.00GiB path /dev/sdb
devid    6 size 2.73TiB used 994.03GiB path /dev/sdh
devid    7 size 2.73TiB used 994.03GiB path /dev/sdi

Btrfs v3.14.1

jpieroen@nasbak:~$ dmesg
[227248.656438] BTRFS info (device sdi): relocating block group
9735225016320 flags 129
[227261.713860] BTRFS info (device sdi): found 9 extents
[227264.531019] BTRFS info (device sdi): found 9 extents
[227265.011826] BTRFS info (device sdi): relocating block group
76265029632 flags 129
[227274.052249] BTRFS info (device sdi): csum failed ino 258 off
1395560448 csum 2284440321 expected csum 319628859
[227274.052354] BTRFS info (device sdi): csum failed ino 258 off
1395564544 csum 3646299263 expected csum 319628859
[227274.052402] BTRFS info (device sdi): csum failed ino 258 off
1395568640 csum 281259278 expected csum 319628859
[227274.052449] BTRFS info (device sdi): csum failed ino 258 off
1395572736 csum 2594807184 expected csum 319628859
[227274.052492] BTRFS info (device sdi): csum failed ino 258 off
1395576832 csum 4288971971 expected csum 319628859
[227274.052537] BTRFS info (device sdi): csum failed ino 258 off
1395580928 csum 752615894 expected csum 319628859
[227274.052581] BTRFS info (device sdi): csum failed ino 258 off
1395585024 csum 3828951500 expected csum 319628859
[227274.061279] ------------[ cut here ]------------
[227274.061354] kernel BUG at /home/apw/COD/linux/fs/btrfs/extent_io.c:2116!
[227274.061445] invalid opcode: 0000 [#1] SMP
[227274.061509] Modules linked in: cuse deflate
[227274.061573] BTRFS info (device sdi): csum failed ino 258 off
1395560448 csum 2284440321 expected csum 319628859
[227274.061707]  ctr twofish_generic twofish_x86_64_3way
twofish_x86_64 twofish_common camellia_generic camellia_x86_64
serpent_sse2_x86_64 xts serpent_generic lrw gf128mul glue_helper
blowfish_generic blowfish_x86_64 blowfish_common cast5_generic
cast_common ablk_helper cryptd des_generic cmac xcbc rmd160
crypto_null af_key xfrm_algo nfsd auth_rpcgss nfs_acl nfs lockd sunrpc
fscache dm_crypt ip6t_REJECT ppdev xt_hl ip6t_rt nf_conntrack_ipv6
nf_defrag_ipv6 ipt_REJECT xt_comment xt_LOG kvm xt_recent microcode
xt_multiport xt_limit xt_tcpudp psmouse serio_raw xt_addrtype k10temp
edac_core ipt_MASQUERADE edac_mce_amd iptable_nat nf_nat_ipv4
sp5100_tco nf_conntrack_ipv4 nf_defrag_ipv4 ftdi_sio i2c_piix4
usbserial xt_conntrack ip6table_filter ip6_tables joydev
nf_conntrack_netbios_ns nf_conntrack_broadcast snd_hda_codec_via
nf_nat_ftp snd_hda_codec_hdmi nf_nat snd_hda_codec_generic
nf_conntrack_ftp nf_conntrack snd_hda_intel iptable_filter
ir_lirc_codec(OF) lirc_dev(OF) ip_tables snd_hda_codec
ir_mce_kbd_decoder(OF) x_tables snd_hwdep ir_sony_decoder(OF)
rc_tbs_nec(OF) ir_jvc_decoder(OF) snd_pcm ir_rc6_decoder(OF)
ir_rc5_decoder(OF) saa716x_tbs_dvb(OF) tbs6982fe(POF) tbs6680fe(POF)
ir_nec_decoder(OF) tbs6923fe(POF) tbs6985se(POF) tbs6928se(POF)
tbs6982se(POF) tbs6991fe(POF) tbs6618fe(POF) saa716x_core(OF)
tbs6922fe(POF) tbs6928fe(POF) tbs6991se(POF) stv090x(OF) dvb_core(OF)
rc_core(OF) snd_timer snd soundcore asus_atk0110 parport_pc shpchp
mac_hid lp parport btrfs xor raid6_pq pata_acpi hid_generic usbhid hid
usb_storage radeon pata_atiixp r8169 mii i2c_algo_bit sata_sil24 ttm
drm_kms_helper drm ahci libahci wmi
[227274.064118] CPU: 1 PID: 15543 Comm: btrfs-endio-4 Tainted: PF
    O 3.14.1-031401-generic #201404141220
[227274.064246] Hardware name: System manufacturer System Product
Name/M4A78LT-M, BIOS 0802    08/24/2010
[227274.064368] task: ffff88030a0e31e0 ti: ffff8800a15b8000 task.ti:
ffff8800a15b8000
[227274.064467] RIP: 0010:[<ffffffffa0304c33>]  [<ffffffffa0304c33>]
clean_io_failure+0x1a3/0x1b0 [btrfs]
[227274.064623] RSP: 0018:ffff8800a15b9cd8  EFLAGS: 00010246
[227274.064694] RAX: 0000000000000000 RBX: ffff88010b2869b8 RCX:
0000000000000000
[227274.064789] RDX: ffff8802cad30f00 RSI: 00000000720071fe RDI:
ffff88010b286884
[227274.064883] RBP: ffff8800a15b9d28 R08: 0000000000000000 R09:
0000000000000000
[227274.064977] R10: 0000000000000200 R11: 0000000000000000 R12:
ffffea000102b080
[227274.065071] R13: ffff880004366c00 R14: ffff88010b286800 R15:
00000000532ef000
[227274.065166] FS:  00007f16670b0740(0000) GS:ffff88031fc40000(0000)
knlGS:0000000000000000
[227274.065271] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[227274.065348] CR2: 00007f9c5c3b0000 CR3: 00000002dd8a8000 CR4:
00000000000007e0
[227274.065443] Stack:
[227274.065471]  00000000000532ef ffff88030a14c000 0000000000000000
ffff880004366c00
[227274.065584]  ffff88030a95f780 ffffea000102b080 ffff8801026cc4b0
ffff88010b2869b8
[227274.065697]  00000000532ef000 0000000000000000 ffff8800a15b9db8
ffffffffa0304f1b
[227274.065809] Call Trace:
[227274.065872]  [<ffffffffa0304f1b>]
end_bio_extent_readpage+0x2db/0x3d0 [btrfs]
[227274.065971]  [<ffffffff8120a013>] bio_endio+0x53/0xa0
[227274.066042]  [<ffffffff8120a072>] bio_endio_nodec+0x12/0x20
[227274.066137]  [<ffffffffa02dde81>] end_workqueue_fn+0x41/0x50 [btrfs]
[227274.066243]  [<ffffffffa03157d0>] worker_loop+0xa0/0x330 [btrfs]
[227274.066345]  [<ffffffffa0315730>] ?
check_pending_worker_creates.isra.1+0xe0/0xe0 [btrfs]
[227274.066455]  [<ffffffff8108ffa9>] kthread+0xc9/0xe0
[227274.066522]  [<ffffffff8108fee0>] ? flush_kthread_worker+0xb0/0xb0
[227274.066606]  [<ffffffff817721bc>] ret_from_fork+0x7c/0xb0
[227274.066680]  [<ffffffff8108fee0>] ? flush_kthread_worker+0xb0/0xb0
[227274.066761] Code: 00 00 83 f8 01 0f 8e 49 ff ff ff 49 8b 4d 18 49
8b 55 10 4d 89 e0 45 8b 4d 2c 48 8b 7d b8 4c 89 fe e8 72 fc ff ff e9
29 ff ff ff <0f> 0b 66 66 2e 0f 1f 84 00 00 00 00 00 66 66 66 66 90 55
48 89
[227274.067266] RIP  [<ffffffffa0304c33>] clean_io_failure+0x1a3/0x1b0 [btrfs]
[227274.067380]  RSP <ffff8800a15b9cd8>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: "csum failed" that was not detected by scrub
  2014-05-02  9:42 "csum failed" that was not detected by scrub Jaap Pieroen
@ 2014-05-02 10:20 ` Duncan
  2014-05-02 17:48   ` Jaap Pieroen
  2014-05-03 13:57   ` "csum failed" that was not detected by scrub Marc MERLIN
  2014-05-02 11:13 ` Shilong Wang
  1 sibling, 2 replies; 9+ messages in thread
From: Duncan @ 2014-05-02 10:20 UTC (permalink / raw)
  To: linux-btrfs

Jaap Pieroen posted on Fri, 02 May 2014 11:42:35 +0200 as excerpted:

> I completed a full scrub:
> root@nasbak:/home/jpieroen# btrfs scrub status /home/
> scrub status for 7ca5f38e-308f-43ab-b3ea-31b3bcd11a0d
> scrub started at Wed Apr 30 08:30:19 2014
> and finished after 144131 seconds
> total bytes scrubbed: 4.76TiB with 0 errors
> 
> Then tried to remove a device:
> root@nasbak:/home/jpieroen# btrfs device delete /dev/sdb /home
> 
> This triggered bug_on, with the following error in dmesg: csum failed
> ino 258 off 1395560448 csum 2284440321 expected csum 319628859
> 
> How can there still be csum failures directly after a scrub?

Simple enough, really...

> root@nasbak:/home/jpieroen#   btrfs fi df /home
> Data, RAID5: total=4.57TiB, used=4.55TiB
> System, RAID1: total=32.00MiB, used=352.00KiB
> Metadata, RAID1: total=7.00GiB, used=5.59GiB

To those that know the details, this tells the story.

Btrfs raid5/6 modes are not yet code-complete, and scrub is one of the 
incomplete bits.  btrfs scrub doesn't know how to deal with raid5/6 
properly just yet.

While the operational bits of raid5/6 support are there, parity is 
calculated and written, scrub, and recovery from a lost device, are not 
yet code complete.  Thus, it's effectively a slower, lower capacity raid0 
without scrub support at this point, except that when the code is 
complete, you'll get an automatic "free" upgrade to full raid5 or raid6, 
because the operational bits have been working since they were 
introduced, just the recovery and scrub bits were bad, making it 
effectively a raid0 in reliability terms, lose one and you've lost them 
all.

That's the big picture anyway.  Marc Merlin recently did quite a bit of 
raid5/6 testing and there's a page on the wiki now with what he found.  
Additionally, I saw a scrub support for raid5/6 modes patch on the list 
recently, but while it may be in integration, I believe it's too new to 
have reached release yet.

Wiki, for memory or bookmark: https://btrfs.wiki.kernel.org

Direct user documentation link for bookmark (unwrap as necessary):

https://btrfs.wiki.kernel.org/index.php/
Main_Page#Guides_and_usage_information

The raid5/6 page (which I didn't otherwise see conveniently linked, I dug 
it out of the recent changes list since I knew it was there from on-list 
discussion):

https://btrfs.wiki.kernel.org/index.php/RAID56

@ Marc or Hugo or someone with a wiki account:  Can this be more visibly 
linked from the user-docs contents, added to the user docs category list, 
and probably linked from at least the multiple devices and (for now) the 
gotchas pages?

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: "csum failed" that was not detected by scrub
  2014-05-02  9:42 "csum failed" that was not detected by scrub Jaap Pieroen
  2014-05-02 10:20 ` Duncan
@ 2014-05-02 11:13 ` Shilong Wang
  2014-05-02 17:55   ` Jaap Pieroen
  1 sibling, 1 reply; 9+ messages in thread
From: Shilong Wang @ 2014-05-02 11:13 UTC (permalink / raw)
  To: Jaap Pieroen; +Cc: linux-btrfs

Hello,


2014-05-02 17:42 GMT+08:00 Jaap Pieroen <jaap@pieroen.nl>:
> Hi all,
>
> I completed a full scrub:
> root@nasbak:/home/jpieroen# btrfs scrub status /home/
> scrub status for 7ca5f38e-308f-43ab-b3ea-31b3bcd11a0d
> scrub started at Wed Apr 30 08:30:19 2014 and finished after 144131 seconds
> total bytes scrubbed: 4.76TiB with 0 errors
>
> Then tried to remove a device:
> root@nasbak:/home/jpieroen# btrfs device delete /dev/sdb /home
>
> This triggered bug_on, with the following error in dmesg: csum failed
> ino 258 off 1395560448 csum 2284440321 expected csum 319628859
>
> How can there still be csum failures directly after a scrub?
> If I rerun the scrub it still won't find any errors. I know this,
> because I've had the same issue 3 times in a row. Each time running a
> scrub and still being unable to remove the device.

There is a known RAID5/6 bug, i sent a patch to address this problem.
Could you please double check if your kernel source includes the
following commit:

http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=3b080b2564287be91605bfd1d5ee985696e61d3c

RAID5/6 should detect checksum mismatch, it can not fix errors now.

Thanks,
Wang
>
> Kind Regards,
> Jaap
>
> --------------------------------------------------------------
> Details:
>
> root@nasbak:/home/jpieroen#   uname -a
> Linux nasbak 3.14.1-031401-generic #201404141220 SMP Mon Apr 14
> 16:21:48 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux
>
> root@nasbak:/home/jpieroen#   btrfs --version
> Btrfs v3.14.1
>
> root@nasbak:/home/jpieroen#   btrfs fi df /home
> Data, RAID5: total=4.57TiB, used=4.55TiB
> System, RAID1: total=32.00MiB, used=352.00KiB
> Metadata, RAID1: total=7.00GiB, used=5.59GiB
>
> root@nasbak:/home/jpieroen# btrfs fi show
> Label: 'btrfs_storage'  uuid: 7ca5f38e-308f-43ab-b3ea-31b3bcd11a0d
> Total devices 6 FS bytes used 4.56TiB
> devid    1 size 1.82TiB used 1.31TiB path /dev/sde
> devid    2 size 1.82TiB used 1.31TiB path /dev/sdf
> devid    3 size 1.82TiB used 1.31TiB path /dev/sdg
> devid    4 size 931.51GiB used 25.00GiB path /dev/sdb
> devid    6 size 2.73TiB used 994.03GiB path /dev/sdh
> devid    7 size 2.73TiB used 994.03GiB path /dev/sdi
>
> Btrfs v3.14.1
>
> jpieroen@nasbak:~$ dmesg
> [227248.656438] BTRFS info (device sdi): relocating block group
> 9735225016320 flags 129
> [227261.713860] BTRFS info (device sdi): found 9 extents
> [227264.531019] BTRFS info (device sdi): found 9 extents
> [227265.011826] BTRFS info (device sdi): relocating block group
> 76265029632 flags 129
> [227274.052249] BTRFS info (device sdi): csum failed ino 258 off
> 1395560448 csum 2284440321 expected csum 319628859
> [227274.052354] BTRFS info (device sdi): csum failed ino 258 off
> 1395564544 csum 3646299263 expected csum 319628859
> [227274.052402] BTRFS info (device sdi): csum failed ino 258 off
> 1395568640 csum 281259278 expected csum 319628859
> [227274.052449] BTRFS info (device sdi): csum failed ino 258 off
> 1395572736 csum 2594807184 expected csum 319628859
> [227274.052492] BTRFS info (device sdi): csum failed ino 258 off
> 1395576832 csum 4288971971 expected csum 319628859
> [227274.052537] BTRFS info (device sdi): csum failed ino 258 off
> 1395580928 csum 752615894 expected csum 319628859
> [227274.052581] BTRFS info (device sdi): csum failed ino 258 off
> 1395585024 csum 3828951500 expected csum 319628859
> [227274.061279] ------------[ cut here ]------------
> [227274.061354] kernel BUG at /home/apw/COD/linux/fs/btrfs/extent_io.c:2116!
> [227274.061445] invalid opcode: 0000 [#1] SMP
> [227274.061509] Modules linked in: cuse deflate
> [227274.061573] BTRFS info (device sdi): csum failed ino 258 off
> 1395560448 csum 2284440321 expected csum 319628859
> [227274.061707]  ctr twofish_generic twofish_x86_64_3way
> twofish_x86_64 twofish_common camellia_generic camellia_x86_64
> serpent_sse2_x86_64 xts serpent_generic lrw gf128mul glue_helper
> blowfish_generic blowfish_x86_64 blowfish_common cast5_generic
> cast_common ablk_helper cryptd des_generic cmac xcbc rmd160
> crypto_null af_key xfrm_algo nfsd auth_rpcgss nfs_acl nfs lockd sunrpc
> fscache dm_crypt ip6t_REJECT ppdev xt_hl ip6t_rt nf_conntrack_ipv6
> nf_defrag_ipv6 ipt_REJECT xt_comment xt_LOG kvm xt_recent microcode
> xt_multiport xt_limit xt_tcpudp psmouse serio_raw xt_addrtype k10temp
> edac_core ipt_MASQUERADE edac_mce_amd iptable_nat nf_nat_ipv4
> sp5100_tco nf_conntrack_ipv4 nf_defrag_ipv4 ftdi_sio i2c_piix4
> usbserial xt_conntrack ip6table_filter ip6_tables joydev
> nf_conntrack_netbios_ns nf_conntrack_broadcast snd_hda_codec_via
> nf_nat_ftp snd_hda_codec_hdmi nf_nat snd_hda_codec_generic
> nf_conntrack_ftp nf_conntrack snd_hda_intel iptable_filter
> ir_lirc_codec(OF) lirc_dev(OF) ip_tables snd_hda_codec
> ir_mce_kbd_decoder(OF) x_tables snd_hwdep ir_sony_decoder(OF)
> rc_tbs_nec(OF) ir_jvc_decoder(OF) snd_pcm ir_rc6_decoder(OF)
> ir_rc5_decoder(OF) saa716x_tbs_dvb(OF) tbs6982fe(POF) tbs6680fe(POF)
> ir_nec_decoder(OF) tbs6923fe(POF) tbs6985se(POF) tbs6928se(POF)
> tbs6982se(POF) tbs6991fe(POF) tbs6618fe(POF) saa716x_core(OF)
> tbs6922fe(POF) tbs6928fe(POF) tbs6991se(POF) stv090x(OF) dvb_core(OF)
> rc_core(OF) snd_timer snd soundcore asus_atk0110 parport_pc shpchp
> mac_hid lp parport btrfs xor raid6_pq pata_acpi hid_generic usbhid hid
> usb_storage radeon pata_atiixp r8169 mii i2c_algo_bit sata_sil24 ttm
> drm_kms_helper drm ahci libahci wmi
> [227274.064118] CPU: 1 PID: 15543 Comm: btrfs-endio-4 Tainted: PF
>     O 3.14.1-031401-generic #201404141220
> [227274.064246] Hardware name: System manufacturer System Product
> Name/M4A78LT-M, BIOS 0802    08/24/2010
> [227274.064368] task: ffff88030a0e31e0 ti: ffff8800a15b8000 task.ti:
> ffff8800a15b8000
> [227274.064467] RIP: 0010:[<ffffffffa0304c33>]  [<ffffffffa0304c33>]
> clean_io_failure+0x1a3/0x1b0 [btrfs]
> [227274.064623] RSP: 0018:ffff8800a15b9cd8  EFLAGS: 00010246
> [227274.064694] RAX: 0000000000000000 RBX: ffff88010b2869b8 RCX:
> 0000000000000000
> [227274.064789] RDX: ffff8802cad30f00 RSI: 00000000720071fe RDI:
> ffff88010b286884
> [227274.064883] RBP: ffff8800a15b9d28 R08: 0000000000000000 R09:
> 0000000000000000
> [227274.064977] R10: 0000000000000200 R11: 0000000000000000 R12:
> ffffea000102b080
> [227274.065071] R13: ffff880004366c00 R14: ffff88010b286800 R15:
> 00000000532ef000
> [227274.065166] FS:  00007f16670b0740(0000) GS:ffff88031fc40000(0000)
> knlGS:0000000000000000
> [227274.065271] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> [227274.065348] CR2: 00007f9c5c3b0000 CR3: 00000002dd8a8000 CR4:
> 00000000000007e0
> [227274.065443] Stack:
> [227274.065471]  00000000000532ef ffff88030a14c000 0000000000000000
> ffff880004366c00
> [227274.065584]  ffff88030a95f780 ffffea000102b080 ffff8801026cc4b0
> ffff88010b2869b8
> [227274.065697]  00000000532ef000 0000000000000000 ffff8800a15b9db8
> ffffffffa0304f1b
> [227274.065809] Call Trace:
> [227274.065872]  [<ffffffffa0304f1b>]
> end_bio_extent_readpage+0x2db/0x3d0 [btrfs]
> [227274.065971]  [<ffffffff8120a013>] bio_endio+0x53/0xa0
> [227274.066042]  [<ffffffff8120a072>] bio_endio_nodec+0x12/0x20
> [227274.066137]  [<ffffffffa02dde81>] end_workqueue_fn+0x41/0x50 [btrfs]
> [227274.066243]  [<ffffffffa03157d0>] worker_loop+0xa0/0x330 [btrfs]
> [227274.066345]  [<ffffffffa0315730>] ?
> check_pending_worker_creates.isra.1+0xe0/0xe0 [btrfs]
> [227274.066455]  [<ffffffff8108ffa9>] kthread+0xc9/0xe0
> [227274.066522]  [<ffffffff8108fee0>] ? flush_kthread_worker+0xb0/0xb0
> [227274.066606]  [<ffffffff817721bc>] ret_from_fork+0x7c/0xb0
> [227274.066680]  [<ffffffff8108fee0>] ? flush_kthread_worker+0xb0/0xb0
> [227274.066761] Code: 00 00 83 f8 01 0f 8e 49 ff ff ff 49 8b 4d 18 49
> 8b 55 10 4d 89 e0 45 8b 4d 2c 48 8b 7d b8 4c 89 fe e8 72 fc ff ff e9
> 29 ff ff ff <0f> 0b 66 66 2e 0f 1f 84 00 00 00 00 00 66 66 66 66 90 55
> 48 89
> [227274.067266] RIP  [<ffffffffa0304c33>] clean_io_failure+0x1a3/0x1b0 [btrfs]
> [227274.067380]  RSP <ffff8800a15b9cd8>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re:
  2014-05-02 10:20 ` Duncan
@ 2014-05-02 17:48   ` Jaap Pieroen
  2014-05-03  3:10     ` btrfs raid56 Was: "csum failed" that was not detected by scrub Duncan
  2014-05-03 13:31     ` Frank Holton
  2014-05-03 13:57   ` "csum failed" that was not detected by scrub Marc MERLIN
  1 sibling, 2 replies; 9+ messages in thread
From: Jaap Pieroen @ 2014-05-02 17:48 UTC (permalink / raw)
  To: linux-btrfs

Duncan <1i5t5.duncan <at> cox.net> writes:

> 
> To those that know the details, this tells the story.
> 
> Btrfs raid5/6 modes are not yet code-complete, and scrub is one of the 
> incomplete bits.  btrfs scrub doesn't know how to deal with raid5/6 
> properly just yet.
> 
> While the operational bits of raid5/6 support are there, parity is 
> calculated and written, scrub, and recovery from a lost device, are not 
> yet code complete.  Thus, it's effectively a slower, lower capacity raid0 
> without scrub support at this point, except that when the code is 
> complete, you'll get an automatic "free" upgrade to full raid5 or raid6, 
> because the operational bits have been working since they were 
> introduced, just the recovery and scrub bits were bad, making it 
> effectively a raid0 in reliability terms, lose one and you've lost them 
> all.
> 
> That's the big picture anyway.  Marc Merlin recently did quite a bit of 
> raid5/6 testing and there's a page on the wiki now with what he found.  
> Additionally, I saw a scrub support for raid5/6 modes patch on the list 
> recently, but while it may be in integration, I believe it's too new to 
> have reached release yet.
> 
> Wiki, for memory or bookmark: https://btrfs.wiki.kernel.org
> 
> Direct user documentation link for bookmark (unwrap as necessary):
> 
> https://btrfs.wiki.kernel.org/index.php/
> Main_Page#Guides_and_usage_information
> 
> The raid5/6 page (which I didn't otherwise see conveniently linked, I dug 
> it out of the recent changes list since I knew it was there from on-list 
> discussion):
> 
> https://btrfs.wiki.kernel.org/index.php/RAID56
> 
>  <at>  Marc or Hugo or someone with a wiki account:  Can this be more visibly 
> linked from the user-docs contents, added to the user docs category list, 
> and probably linked from at least the multiple devices and (for now) the 
> gotchas pages?
> 

So raid5 is much more useless than I assumed. I read Marc's blog and
figured that btrfs was ready enough.

I' really in trouble now. I tried to get rid of raid5 by doing a convert
balance to raid1. But of course this triggered the same issue. And now
I have a dead system because the first thing btrfs does after mounting
is continue the balance which will crash the system and send me into
a vicious loop.

- How can I stop btrfs from continuing balancing?
- How can I salvage this situation and convert to raid1?

Unfortunately I have little spare drives left. Not enough to contain
4.7TiB of data.. :(





^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: "csum failed" that was not detected by scrub
  2014-05-02 11:13 ` Shilong Wang
@ 2014-05-02 17:55   ` Jaap Pieroen
  0 siblings, 0 replies; 9+ messages in thread
From: Jaap Pieroen @ 2014-05-02 17:55 UTC (permalink / raw)
  To: linux-btrfs

Shilong Wang <wangshilong1991 <at> gmail.com> writes:

> 
> Hello,
> 
> There is a known RAID5/6 bug, i sent a patch to address this problem.
> Could you please double check if your kernel source includes the
> following commit:
> 
> http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?
id=3b080b2564287be91605bfd1d5ee985696e61d3c
> 
> RAID5/6 should detect checksum mismatch, it can not fix errors now.
> 
> Thanks,
> Wang

Your patch seems to be in 3.15rc1:
http://kernel.ubuntu.com/~kernel-ppa/mainline/v3.15-rc1-trusty/CHANGES

I tried rc3 but that made my system crash on boot.. I'm having bad luck


^ permalink raw reply	[flat|nested] 9+ messages in thread

* btrfs raid56 Was: "csum failed" that was not detected by scrub
  2014-05-02 17:48   ` Jaap Pieroen
@ 2014-05-03  3:10     ` Duncan
  2014-05-03  7:53       ` btrfs raid56 Was: Jaap Pieroen
  2014-05-03 13:31     ` Frank Holton
  1 sibling, 1 reply; 9+ messages in thread
From: Duncan @ 2014-05-03  3:10 UTC (permalink / raw)
  To: linux-btrfs

Jaap Pieroen posted on Fri, 02 May 2014 17:48:13 +0000 as excerpted:

> Duncan <1i5t5.duncan <at> cox.net> writes:
> 
> 
>> To those that know the details, this tells the story.
>> 
>> Btrfs raid5/6 modes are not yet code-complete, and scrub is one of the
>> incomplete bits.  btrfs scrub doesn't know how to deal with raid5/6
>> properly just yet.

>> The raid5/6 page (which I didn't otherwise see conveniently linked, I
>> dug it out of the recent changes list since I knew it was there from
>> on-list discussion):
>> 
>> https://btrfs.wiki.kernel.org/index.php/RAID56

> So raid5 is much more useless than I assumed. I read Marc's blog and
> figured that btrfs was ready enough.
> 
> I' really in trouble now. I tried to get rid of raid5 by doing a convert
> balance to raid1. But of course this triggered the same issue. And now I
> have a dead system because the first thing btrfs does after mounting is
> continue the balance which will crash the system and send me into a
> vicious loop.
> 
> - How can I stop btrfs from continuing balancing?

That one's easy.  See the Documentation/filesystems/btrfs.txt file in the 
kernel tree or the wiki for btrfs mount options, one of which is 
"skip_balance", to address this very sort of problem! =:^)

Alternatively, mounting it read-only should prevent further changes 
including the balance, at least allowing you to get the data off the 
filesystem.

> - How can I salvage this situation and convert to raid1?
> 
> Unfortunately I have little spare drives left. Not enough to contain
> 4.7TiB of data.. :(

[OK, this goes a bit philosophical, but it's something to think about...]

If you've done your research and followed the advice of the warnings when 
you do a mkfs.btrfs or on the wiki, not a problem, since you know that 
btrfs is still under heavy development and that as a result, it's even 
more critical to have current tested backups for anything you value 
anyway.  Simply use those backups.

Which, by definition, means that if you don't have such backups, you 
didn't consider the data all that valuable after all, actions perhaps 
giving the lie to your claims.  And no excuse for not doing the research 
either, since if you really care about your data, you research a 
filesystem you're not familiar with before trusting your data to it.  So 
again, if you didn't know btrfs was experimental and thus didn't have 
those backups, by definition your actions say you didn't really care 
about the data you put on it, no matter what your words might say.

OTOH, there *IS* such a thing as not realizing the value of something 
until you're in the process of losing it... that I do understand.  But of 
course try telling that to, for instance, someone who has just lost a 
loved one that they never actually /told/ them that...  Sometimes it's 
simply too late.  Tho if it's going to happen, at least here I'd much 
rather it happen to some data, than one of my own loved ones...

Anyway, at least for now you should still be able to recover most of the 
data using skip_balance or read-only mounting.  My guess is that if push 
comes to shove you can either prioritize that data and give up a TiB or 
two if it comes to that, or scrimp here and there, putting a few gigs on 
the odd blank DVD you may have lying around or downgrading a few meals to 
Raman-noodle to afford the $100 or so shipped that pricewatch says a new 
3 TB drive costs, these days.  I've been there, and have found that if I 
think I need it bad enough, that $100 has a way of appearing, like I said 
even if I'm noodling it for a few meals to do it.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: btrfs raid56 Was:
  2014-05-03  3:10     ` btrfs raid56 Was: "csum failed" that was not detected by scrub Duncan
@ 2014-05-03  7:53       ` Jaap Pieroen
  0 siblings, 0 replies; 9+ messages in thread
From: Jaap Pieroen @ 2014-05-03  7:53 UTC (permalink / raw)
  To: linux-btrfs

Duncan <1i5t5.duncan <at> cox.net> writes:

> > - How can I salvage this situation and convert to raid1?
> > 
> > Unfortunately I have little spare drives left. Not enough to contain
> > 4.7TiB of data.. :(
> 
> [OK, this goes a bit philosophical, but it's something to think about...]
> 
> ... 
>
> Anyway, at least for now you should still be able to recover most of the 
> data using skip_balance or read-only mounting.  My guess is that if push 
> comes to shove you can either prioritize that data and give up a TiB or 
> two if it comes to that, or scrimp here and there, putting a few gigs on 
> the odd blank DVD you may have lying around or downgrading a few meals to 
> Raman-noodle to afford the $100 or so shipped that pricewatch says a new 
> 3 TB drive costs, these days.  I've been there, and have found that if I 
> think I need it bad enough, that $100 has a way of appearing, like I said 
> even if I'm noodling it for a few meals to do it.
> 

Thanks for the philosophical response. Both telling me I can't simply
convert, and reminding me that this was an outcome I was prepared
to face. :) Because you are right. When push comes to shove, it's 
data I'm prepared to lose.

I'm going to hedge my bets and convince the Mrs te let me invest in
some new hardware. 


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re:
  2014-05-02 17:48   ` Jaap Pieroen
  2014-05-03  3:10     ` btrfs raid56 Was: "csum failed" that was not detected by scrub Duncan
@ 2014-05-03 13:31     ` Frank Holton
  1 sibling, 0 replies; 9+ messages in thread
From: Frank Holton @ 2014-05-03 13:31 UTC (permalink / raw)
  To: Jaap Pieroen; +Cc: linux-btrfs

Hi Jaap,

This patch http://www.spinics.net/lists/linux-btrfs/msg33025.html made
it into 3.15 RC2 so if you're willing to build your own RC kernel you
may have better luck with scrub in 3.15. The patch only scrubs the
data blocks in RAID5/6 so hopefully your parity blocks are intact. I'm
not sure if it would help any but it may be worth a try.

On Fri, May 2, 2014 at 1:48 PM, Jaap Pieroen <jaap@pieroen.nl> wrote:
> Duncan <1i5t5.duncan <at> cox.net> writes:
>
>>
>> To those that know the details, this tells the story.
>>
>> Btrfs raid5/6 modes are not yet code-complete, and scrub is one of the
>> incomplete bits.  btrfs scrub doesn't know how to deal with raid5/6
>> properly just yet.
>>
>> While the operational bits of raid5/6 support are there, parity is
>> calculated and written, scrub, and recovery from a lost device, are not
>> yet code complete.  Thus, it's effectively a slower, lower capacity raid0
>> without scrub support at this point, except that when the code is
>> complete, you'll get an automatic "free" upgrade to full raid5 or raid6,
>> because the operational bits have been working since they were
>> introduced, just the recovery and scrub bits were bad, making it
>> effectively a raid0 in reliability terms, lose one and you've lost them
>> all.
>>
>> That's the big picture anyway.  Marc Merlin recently did quite a bit of
>> raid5/6 testing and there's a page on the wiki now with what he found.
>> Additionally, I saw a scrub support for raid5/6 modes patch on the list
>> recently, but while it may be in integration, I believe it's too new to
>> have reached release yet.
>>
>> Wiki, for memory or bookmark: https://btrfs.wiki.kernel.org
>>
>> Direct user documentation link for bookmark (unwrap as necessary):
>>
>> https://btrfs.wiki.kernel.org/index.php/
>> Main_Page#Guides_and_usage_information
>>
>> The raid5/6 page (which I didn't otherwise see conveniently linked, I dug
>> it out of the recent changes list since I knew it was there from on-list
>> discussion):
>>
>> https://btrfs.wiki.kernel.org/index.php/RAID56
>>
>>  <at>  Marc or Hugo or someone with a wiki account:  Can this be more visibly
>> linked from the user-docs contents, added to the user docs category list,
>> and probably linked from at least the multiple devices and (for now) the
>> gotchas pages?
>>
>
> So raid5 is much more useless than I assumed. I read Marc's blog and
> figured that btrfs was ready enough.
>
> I' really in trouble now. I tried to get rid of raid5 by doing a convert
> balance to raid1. But of course this triggered the same issue. And now
> I have a dead system because the first thing btrfs does after mounting
> is continue the balance which will crash the system and send me into
> a vicious loop.
>
> - How can I stop btrfs from continuing balancing?
> - How can I salvage this situation and convert to raid1?
>
> Unfortunately I have little spare drives left. Not enough to contain
> 4.7TiB of data.. :(
>
>
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: "csum failed" that was not detected by scrub
  2014-05-02 10:20 ` Duncan
  2014-05-02 17:48   ` Jaap Pieroen
@ 2014-05-03 13:57   ` Marc MERLIN
  1 sibling, 0 replies; 9+ messages in thread
From: Marc MERLIN @ 2014-05-03 13:57 UTC (permalink / raw)
  To: Duncan; +Cc: linux-btrfs

On Fri, May 02, 2014 at 10:20:03AM +0000, Duncan wrote:
> The raid5/6 page (which I didn't otherwise see conveniently linked, I dug 

It's linked off
https://btrfs.wiki.kernel.org/index.php/FAQ#Can_I_use_RAID.5B56.5D_on_my_Btrfs_filesystem.3F

> it out of the recent changes list since I knew it was there from on-list 
> discussion):
> 
> https://btrfs.wiki.kernel.org/index.php/RAID56
> 
> 
> @ Marc or Hugo or someone with a wiki account:  Can this be more visibly 

@ Marc relies on a lot for me to see this, never mind at the bottom of a
message when my inbox is over 900 and I'm boarding a plane in a few hours ;)

More seriously, please Cc me (and I'd say generally others) if you're trying
to get their attention. I typically also put one liner at the top to tell
the Cced person to look for a bit with their name.

> linked from the user-docs contents, added to the user docs category list, 
> and probably linked from at least the multiple devices and (for now) the 
> gotchas pages?

I added it here
https://btrfs.wiki.kernel.org/index.php/Using_Btrfs_with_Multiple_Devices
note that it's first result on google for raid56.
Also raid5 btrfs brings you to
https://btrfs.wiki.kernel.org/index.php/FAQ#Case_study:_btrfs-raid_5.2F6_versus_MD-RAID_5.2F6
which also links to the raid56 page.

Marc
-- 
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems ....
                                      .... what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/  

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2014-05-03 13:57 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-05-02  9:42 "csum failed" that was not detected by scrub Jaap Pieroen
2014-05-02 10:20 ` Duncan
2014-05-02 17:48   ` Jaap Pieroen
2014-05-03  3:10     ` btrfs raid56 Was: "csum failed" that was not detected by scrub Duncan
2014-05-03  7:53       ` btrfs raid56 Was: Jaap Pieroen
2014-05-03 13:31     ` Frank Holton
2014-05-03 13:57   ` "csum failed" that was not detected by scrub Marc MERLIN
2014-05-02 11:13 ` Shilong Wang
2014-05-02 17:55   ` Jaap Pieroen

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).