linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* RAID-1 - handling disk failures?
@ 2014-03-27 20:52 Tomasz Chmielewski
  2014-03-28  6:22 ` Duncan
  0 siblings, 1 reply; 3+ messages in thread
From: Tomasz Chmielewski @ 2014-03-27 20:52 UTC (permalink / raw)
  To: linux-btrfs@vger.kernel.org

Is btrfs supposed to handle disk failures in RAID-1 mode?

It doesn't seem to be the case for me, with 3.14.0-rc8.

Right now, the system doesn't see the faulty drive anymore (i.e. hdparm -i /dev/sdd is unable to give any info).

Accesses to most files on btrfs filesystem just "freeze" (waiting for IO) the process which is accessing the data.

The other drive in RAID-1, /dev/sdc, is healthy.

# grep -i btrfs syslog
Mar 27 09:57:59 bkp010 kernel: [157256.352840] BTRFS: bdev /dev/sdd1 errs: wr 31, rd 1, flush 0, corrupt 0, gen 0
Mar 27 09:57:59 bkp010 kernel: [157256.353334] BTRFS: bdev /dev/sdd1 errs: wr 32, rd 1, flush 0, corrupt 0, gen 0
Mar 27 09:57:59 bkp010 kernel: [157256.353816] BTRFS: bdev /dev/sdd1 errs: wr 33, rd 1, flush 0, corrupt 0, gen 0
Mar 27 09:57:59 bkp010 kernel: [157256.354338] BTRFS: bdev /dev/sdd1 errs: wr 34, rd 1, flush 0, corrupt 0, gen 0
Mar 27 09:57:59 bkp010 kernel: [157256.354826] BTRFS: bdev /dev/sdd1 errs: wr 35, rd 1, flush 0, corrupt 0, gen 0
Mar 27 09:57:59 bkp010 kernel: [157256.355314] BTRFS: bdev /dev/sdd1 errs: wr 36, rd 1, flush 0, corrupt 0, gen 0
Mar 27 09:57:59 bkp010 kernel: [157256.355810] BTRFS: bdev /dev/sdd1 errs: wr 37, rd 1, flush 0, corrupt 0, gen 0
Mar 27 09:57:59 bkp010 kernel: [157256.356302] BTRFS: bdev /dev/sdd1 errs: wr 38, rd 1, flush 0, corrupt 0, gen 0
Mar 27 09:57:59 bkp010 kernel: [157256.356790] BTRFS: bdev /dev/sdd1 errs: wr 39, rd 1, flush 0, corrupt 0, gen 0
Mar 27 09:57:59 bkp010 kernel: [157256.357275] BTRFS: bdev /dev/sdd1 errs: wr 40, rd 1, flush 0, corrupt 0, gen 0
Mar 27 09:58:02 bkp010 kernel: [157259.298965] BTRFS: lost page write due to I/O error on /dev/sdd1
Mar 27 09:58:02 bkp010 kernel: [157259.299309] BTRFS: lost page write due to I/O error on /dev/sdd1
Mar 27 09:58:02 bkp010 kernel: [157259.299637] BTRFS: lost page write due to I/O error on /dev/sdd1
Mar 27 09:58:04 bkp010 kernel: [157261.358796] btrfs_dev_stat_print_on_error: 9038 callbacks suppressed
Mar 27 09:58:04 bkp010 kernel: [157261.358844] BTRFS: bdev /dev/sdd1 errs: wr 9007, rd 73, flush 0, corrupt 0, gen 0
Mar 27 09:58:04 bkp010 kernel: [157261.359215] BTRFS: bdev /dev/sdd1 errs: wr 9008, rd 73, flush 0, corrupt 0, gen 0
Mar 27 09:58:04 bkp010 kernel: [157261.359585] BTRFS: bdev /dev/sdd1 errs: wr 9009, rd 73, flush 0, corrupt 0, gen 0
Mar 27 09:58:04 bkp010 kernel: [157261.359954] BTRFS: bdev /dev/sdd1 errs: wr 9010, rd 73, flush 0, corrupt 0, gen 0
Mar 27 09:58:04 bkp010 kernel: [157261.360323] BTRFS: bdev /dev/sdd1 errs: wr 9011, rd 73, flush 0, corrupt 0, gen 0
Mar 27 09:58:04 bkp010 kernel: [157261.360693] BTRFS: bdev /dev/sdd1 errs: wr 9012, rd 73, flush 0, corrupt 0, gen 0
Mar 27 09:58:04 bkp010 kernel: [157261.361063] BTRFS: bdev /dev/sdd1 errs: wr 9013, rd 73, flush 0, corrupt 0, gen 0
Mar 27 09:58:04 bkp010 kernel: [157261.361433] BTRFS: bdev /dev/sdd1 errs: wr 9014, rd 73, flush 0, corrupt 0, gen 0
Mar 27 09:58:04 bkp010 kernel: [157261.361802] BTRFS: bdev /dev/sdd1 errs: wr 9015, rd 73, flush 0, corrupt 0, gen 0
Mar 27 09:58:04 bkp010 kernel: [157261.362172] BTRFS: bdev /dev/sdd1 errs: wr 9016, rd 73, flush 0, corrupt 0, gen 0
Mar 27 09:58:09 bkp010 kernel: [157266.046550] BTRFS: lost page write due to I/O error on /dev/sdd1
Mar 27 09:58:09 bkp010 kernel: [157266.046931] BTRFS: lost page write due to I/O error on /dev/sdd1
Mar 27 09:58:09 bkp010 kernel: [157266.047307] BTRFS: lost page write due to I/O error on /dev/sdd1
Mar 27 09:58:09 bkp010 kernel: [157266.427724] btrfs_dev_stat_print_on_error: 13860 callbacks suppressed
Mar 27 09:58:09 bkp010 kernel: [157266.427788] BTRFS: bdev /dev/sdd1 errs: wr 22877, rd 73, flush 0, corrupt 0, gen 0
Mar 27 09:58:09 bkp010 kernel: [157266.428288] BTRFS: bdev /dev/sdd1 errs: wr 22878, rd 73, flush 0, corrupt 0, gen 0
Mar 27 09:58:09 bkp010 kernel: [157266.431504] BTRFS: bdev /dev/sdd1 errs: wr 22879, rd 73, flush 0, corrupt 0, gen 0
Mar 27 09:58:09 bkp010 kernel: [157266.432047] BTRFS: bdev /dev/sdd1 errs: wr 22880, rd 73, flush 0, corrupt 0, gen 0
Mar 27 09:58:09 bkp010 kernel: [157266.499055] BTRFS: bdev /dev/sdd1 errs: wr 22881, rd 73, flush 0, corrupt 0, gen 0
Mar 27 09:58:09 bkp010 kernel: [157266.499453] BTRFS: bdev /dev/sdd1 errs: wr 22882, rd 73, flush 0, corrupt 0, gen 0
Mar 27 09:58:09 bkp010 kernel: [157266.499847] BTRFS: bdev /dev/sdd1 errs: wr 22883, rd 73, flush 0, corrupt 0, gen 0
Mar 27 09:58:09 bkp010 kernel: [157266.500238] BTRFS: bdev /dev/sdd1 errs: wr 22884, rd 73, flush 0, corrupt 0, gen 0
Mar 27 09:58:09 bkp010 kernel: [157266.500625] BTRFS: bdev /dev/sdd1 errs: wr 22885, rd 73, flush 0, corrupt 0, gen 0
Mar 27 09:58:09 bkp010 kernel: [157266.501692] BTRFS: bdev /dev/sdd1 errs: wr 22886, rd 73, flush 0, corrupt 0, gen 0
Mar 27 09:58:10 bkp010 kernel: [157267.726185] BTRFS: lost page write due to I/O error on /dev/sdd1
Mar 27 09:58:10 bkp010 kernel: [157267.726472] BTRFS: lost page write due to I/O error on /dev/sdd1
Mar 27 09:58:10 bkp010 kernel: [157267.726758] BTRFS: lost page write due to I/O error on /dev/sdd1
Mar 27 09:58:15 bkp010 kernel: [157272.407794] btrfs_dev_stat_print_on_error: 2918 callbacks suppressed
Mar 27 09:58:15 bkp010 kernel: [157272.407856] BTRFS: bdev /dev/sdd1 errs: wr 25804, rd 74, flush 0, corrupt 0, gen 0
Mar 27 09:58:15 bkp010 kernel: [157272.428206] BTRFS: bdev /dev/sdd1 errs: wr 25805, rd 74, flush 0, corrupt 0, gen 0
Mar 27 09:58:15 bkp010 kernel: [157272.428778] BTRFS: bdev /dev/sdd1 errs: wr 25805, rd 75, flush 0, corrupt 0, gen 0
Mar 27 09:58:15 bkp010 kernel: [157272.436211] BTRFS: bdev /dev/sdd1 errs: wr 25806, rd 75, flush 0, corrupt 0, gen 0
Mar 27 09:58:15 bkp010 kernel: [157272.436779] BTRFS: bdev /dev/sdd1 errs: wr 25806, rd 76, flush 0, corrupt 0, gen 0
Mar 27 09:58:15 bkp010 kernel: [157272.460585] BTRFS: bdev /dev/sdd1 errs: wr 25807, rd 76, flush 0, corrupt 0, gen 0
Mar 27 09:58:15 bkp010 kernel: [157272.462577] BTRFS: bdev /dev/sdd1 errs: wr 25807, rd 77, flush 0, corrupt 0, gen 0
Mar 27 09:58:15 bkp010 kernel: [157272.463304] BTRFS: bdev /dev/sdd1 errs: wr 25808, rd 77, flush 0, corrupt 0, gen 0
Mar 27 09:58:15 bkp010 kernel: [157272.463854] BTRFS: bdev /dev/sdd1 errs: wr 25808, rd 78, flush 0, corrupt 0, gen 0
Mar 27 09:58:15 bkp010 kernel: [157272.467755] BTRFS: bdev /dev/sdd1 errs: wr 25809, rd 78, flush 0, corrupt 0, gen 0
Mar 27 09:58:15 bkp010 kernel: [157272.610100] BTRFS info (device sdd1): csum failed ino 998 extent 1290755563520 csum 2566472073 wanted 3738272199 mirror 0
Mar 27 09:58:15 bkp010 kernel: [157272.610621] BTRFS info (device sdd1): csum failed ino 998 extent 1290755563520 csum 2566472073 wanted 3738272199 mirror 0
Mar 27 09:58:15 bkp010 kernel: [157272.611149] BTRFS info (device sdd1): csum failed ino 998 extent 1290755563520 csum 2566472073 wanted 3738272199 mirror 1
Mar 27 09:58:15 bkp010 kernel: [157272.622317] IP: [<ffffffffa02dfc75>] repair_io_failure+0xba/0x19e [btrfs]
Mar 27 09:58:15 bkp010 kernel: [157272.622464] Modules linked in: cpufreq_ondemand cpufreq_conservative cpufreq_powersave cpufreq_stats bridge stp llc ipv6 btrfs xor raid6_pq zlib_deflate loop tpm_infineon tpm_tis tpm parport_pc parport battery button video lpc_ich mfd_core pcspkr acpi_cpufreq i2c_i801 i2c_core ehci_pci ehci_hcd ext4 crc16 jbd2 mbcache raid1 sg sd_mod ahci libahci libata scsi_mod r8169 mii
Mar 27 09:58:15 bkp010 kernel: [157272.622724] CPU: 0 PID: 22767 Comm: btrfs-endio-3 Not tainted 3.14.0-rc8 #1
Mar 27 09:58:15 bkp010 kernel: [157272.622949] RIP: 0010:[<ffffffffa02dfc75>]  [<ffffffffa02dfc75>] repair_io_failure+0xba/0x19e [btrfs]
Mar 27 09:58:15 bkp010 kernel: [157272.625214]  [<ffffffffa02dffb9>] end_bio_extent_readpage+0x260/0x7c4 [btrfs]
Mar 27 09:58:15 bkp010 kernel: [157272.625401]  [<ffffffffa02c1194>] end_workqueue_fn+0x33/0x38 [btrfs]
Mar 27 09:58:15 bkp010 kernel: [157272.625454]  [<ffffffffa02ed649>] worker_loop+0x15e/0x495 [btrfs]
Mar 27 09:58:15 bkp010 kernel: [157272.625506]  [<ffffffffa02ed4eb>] ? btrfs_queue_worker+0x269/0x269 [btrfs]
Mar 27 09:58:15 bkp010 kernel: [157272.625918] RIP  [<ffffffffa02dfc75>] repair_io_failure+0xba/0x19e [btrfs]
Mar 27 09:58:42 bkp010 kernel: [157299.287358] btrfs_dev_stat_print_on_error: 23 callbacks suppressed
Mar 27 09:58:42 bkp010 kernel: [157299.287436] BTRFS: bdev /dev/sdd1 errs: wr 25820, rd 91, flush 0, corrupt 0, gen 0
Mar 27 09:58:42 bkp010 kernel: [157299.288571] BTRFS: bdev /dev/sdd1 errs: wr 25821, rd 91, flush 0, corrupt 0, gen 0
Mar 27 09:58:42 bkp010 kernel: [157299.289711] BTRFS: bdev /dev/sdd1 errs: wr 25822, rd 91, flush 0, corrupt 0, gen 0
Mar 27 09:58:42 bkp010 kernel: [157299.290859] BTRFS: bdev /dev/sdd1 errs: wr 25823, rd 91, flush 0, corrupt 0, gen 0
Mar 27 09:58:42 bkp010 kernel: [157299.292029] BTRFS: bdev /dev/sdd1 errs: wr 25824, rd 91, flush 0, corrupt 0, gen 0
Mar 27 09:58:42 bkp010 kernel: [157299.293186] BTRFS: bdev /dev/sdd1 errs: wr 25825, rd 91, flush 0, corrupt 0, gen 0
Mar 27 09:58:42 bkp010 kernel: [157299.294320] BTRFS: bdev /dev/sdd1 errs: wr 25826, rd 91, flush 0, corrupt 0, gen 0
Mar 27 09:58:42 bkp010 kernel: [157299.295474] BTRFS: bdev /dev/sdd1 errs: wr 25827, rd 91, flush 0, corrupt 0, gen 0
Mar 27 09:58:42 bkp010 kernel: [157299.296627] BTRFS: bdev /dev/sdd1 errs: wr 25828, rd 91, flush 0, corrupt 0, gen 0
Mar 27 09:58:42 bkp010 kernel: [157299.297762] BTRFS: bdev /dev/sdd1 errs: wr 25829, rd 91, flush 0, corrupt 0, gen 0
Mar 27 09:58:50 bkp010 kernel: [157307.005902] BTRFS: lost page write due to I/O error on /dev/sdd1
Mar 27 09:58:50 bkp010 kernel: [157307.006917] BTRFS: lost page write due to I/O error on /dev/sdd1
Mar 27 09:58:50 bkp010 kernel: [157307.007931] BTRFS: lost page write due to I/O error on /dev/sdd1
Mar 27 09:59:15 bkp010 kernel: [157332.169244] btrfs_dev_stat_print_on_error: 9421 callbacks suppressed
Mar 27 09:59:15 bkp010 kernel: [157332.169341] BTRFS: bdev /dev/sdd1 errs: wr 40014, rd 148, flush 0, corrupt 0, gen 0
Mar 27 09:59:15 bkp010 kernel: [157332.170688] BTRFS: bdev /dev/sdd1 errs: wr 40015, rd 148, flush 0, corrupt 0, gen 0
Mar 27 09:59:15 bkp010 kernel: [157332.172027] BTRFS: bdev /dev/sdd1 errs: wr 40016, rd 148, flush 0, corrupt 0, gen 0
Mar 27 09:59:15 bkp010 kernel: [157332.173365] BTRFS: bdev /dev/sdd1 errs: wr 40017, rd 148, flush 0, corrupt 0, gen 0
Mar 27 09:59:15 bkp010 kernel: [157332.174701] BTRFS: bdev /dev/sdd1 errs: wr 40018, rd 148, flush 0, corrupt 0, gen 0
Mar 27 09:59:15 bkp010 kernel: [157332.176051] BTRFS: bdev /dev/sdd1 errs: wr 40019, rd 148, flush 0, corrupt 0, gen 0
Mar 27 09:59:15 bkp010 kernel: [157332.177392] BTRFS: bdev /dev/sdd1 errs: wr 40020, rd 148, flush 0, corrupt 0, gen 0
Mar 27 09:59:15 bkp010 kernel: [157332.178731] BTRFS: bdev /dev/sdd1 errs: wr 40021, rd 148, flush 0, corrupt 0, gen 0
Mar 27 09:59:15 bkp010 kernel: [157332.180068] BTRFS: bdev /dev/sdd1 errs: wr 40022, rd 148, flush 0, corrupt 0, gen 0
Mar 27 09:59:15 bkp010 kernel: [157332.181405] BTRFS: bdev /dev/sdd1 errs: wr 40023, rd 148, flush 0, corrupt 0, gen 0
Mar 27 09:59:20 bkp010 kernel: [157337.204302] btrfs_dev_stat_print_on_error: 866 callbacks suppressed
Mar 27 09:59:20 bkp010 kernel: [157337.204382] BTRFS: bdev /dev/sdd1 errs: wr 40884, rd 154, flush 0, corrupt 0, gen 0
Mar 27 09:59:20 bkp010 kernel: [157337.205685] BTRFS: bdev /dev/sdd1 errs: wr 40884, rd 155, flush 0, corrupt 0, gen 0
Mar 27 09:59:20 bkp010 kernel: [157337.215507] BTRFS: bdev /dev/sdd1 errs: wr 40885, rd 155, flush 0, corrupt 0, gen 0
Mar 27 09:59:20 bkp010 kernel: [157337.216931] BTRFS: bdev /dev/sdd1 errs: wr 40885, rd 156, flush 0, corrupt 0, gen 0
Mar 27 09:59:20 bkp010 kernel: [157337.218269] BTRFS: bdev /dev/sdd1 errs: wr 40885, rd 157, flush 0, corrupt 0, gen 0
Mar 27 09:59:20 bkp010 kernel: [157337.232047] BTRFS: bdev /dev/sdd1 errs: wr 40886, rd 157, flush 0, corrupt 0, gen 0
Mar 27 09:59:20 bkp010 kernel: [157337.233788] BTRFS: bdev /dev/sdd1 errs: wr 40886, rd 158, flush 0, corrupt 0, gen 0
Mar 27 09:59:20 bkp010 kernel: [157337.271991] BTRFS: bdev /dev/sdd1 errs: wr 40887, rd 158, flush 0, corrupt 0, gen 0
Mar 27 09:59:20 bkp010 kernel: [157337.273203] BTRFS: bdev /dev/sdd1 errs: wr 40887, rd 159, flush 0, corrupt 0, gen 0
Mar 27 09:59:20 bkp010 kernel: [157337.274338] BTRFS: bdev /dev/sdd1 errs: wr 40887, rd 160, flush 0, corrupt 0, gen 0
Mar 27 09:59:25 bkp010 kernel: [157342.210657] btrfs_dev_stat_print_on_error: 1658 callbacks suppressed
Mar 27 09:59:25 bkp010 kernel: [157342.210735] BTRFS: bdev /dev/sdd1 errs: wr 42362, rd 344, flush 0, corrupt 0, gen 0
Mar 27 09:59:25 bkp010 kernel: [157342.211871] BTRFS: bdev /dev/sdd1 errs: wr 42363, rd 344, flush 0, corrupt 0, gen 0
Mar 27 09:59:25 bkp010 kernel: [157342.213005] BTRFS: bdev /dev/sdd1 errs: wr 42364, rd 344, flush 0, corrupt 0, gen 0
Mar 27 09:59:25 bkp010 kernel: [157342.214140] BTRFS: bdev /dev/sdd1 errs: wr 42365, rd 344, flush 0, corrupt 0, gen 0
Mar 27 09:59:25 bkp010 kernel: [157342.215274] BTRFS: bdev /dev/sdd1 errs: wr 42366, rd 344, flush 0, corrupt 0, gen 0
Mar 27 09:59:25 bkp010 kernel: [157342.217075] BTRFS: bdev /dev/sdd1 errs: wr 42367, rd 344, flush 0, corrupt 0, gen 0
Mar 27 09:59:25 bkp010 kernel: [157342.218343] BTRFS: bdev /dev/sdd1 errs: wr 42368, rd 344, flush 0, corrupt 0, gen 0
Mar 27 09:59:25 bkp010 kernel: [157342.219608] BTRFS: bdev /dev/sdd1 errs: wr 42369, rd 344, flush 0, corrupt 0, gen 0
Mar 27 09:59:25 bkp010 kernel: [157342.220898] BTRFS: bdev /dev/sdd1 errs: wr 42370, rd 344, flush 0, corrupt 0, gen 0
Mar 27 09:59:25 bkp010 kernel: [157342.222163] BTRFS: bdev /dev/sdd1 errs: wr 42371, rd 344, flush 0, corrupt 0, gen 0
Mar 27 09:59:34 bkp010 kernel: [157350.875937] BTRFS: lost page write due to I/O error on /dev/sdd1
Mar 27 09:59:34 bkp010 kernel: [157350.876953] BTRFS: lost page write due to I/O error on /dev/sdd1
Mar 27 09:59:34 bkp010 kernel: [157350.877968] BTRFS: lost page write due to I/O error on /dev/sdd1
Mar 27 10:00:00 bkp010 kernel: [157377.482402] btrfs_dev_stat_print_on_error: 8601 callbacks suppressed
Mar 27 10:00:00 bkp010 kernel: [157377.482498] BTRFS: bdev /dev/sdd1 errs: wr 67397, rd 344, flush 0, corrupt 0, gen 0
Mar 27 10:00:00 bkp010 kernel: [157377.483837] BTRFS: bdev /dev/sdd1 errs: wr 67398, rd 344, flush 0, corrupt 0, gen 0
Mar 27 10:00:00 bkp010 kernel: [157377.485169] BTRFS: bdev /dev/sdd1 errs: wr 67399, rd 344, flush 0, corrupt 0, gen 0
Mar 27 10:00:00 bkp010 kernel: [157377.486498] BTRFS: bdev /dev/sdd1 errs: wr 67400, rd 344, flush 0, corrupt 0, gen 0
Mar 27 10:00:00 bkp010 kernel: [157377.487827] BTRFS: bdev /dev/sdd1 errs: wr 67401, rd 344, flush 0, corrupt 0, gen 0
Mar 27 10:00:00 bkp010 kernel: [157377.489156] BTRFS: bdev /dev/sdd1 errs: wr 67402, rd 344, flush 0, corrupt 0, gen 0
Mar 27 10:00:00 bkp010 kernel: [157377.490439] BTRFS: bdev /dev/sdd1 errs: wr 67403, rd 344, flush 0, corrupt 0, gen 0
Mar 27 10:00:00 bkp010 kernel: [157377.491791] BTRFS: bdev /dev/sdd1 errs: wr 67404, rd 344, flush 0, corrupt 0, gen 0
Mar 27 10:00:00 bkp010 kernel: [157377.493125] BTRFS: bdev /dev/sdd1 errs: wr 67405, rd 344, flush 0, corrupt 0, gen 0
Mar 27 10:00:00 bkp010 kernel: [157377.494454] BTRFS: bdev /dev/sdd1 errs: wr 67406, rd 344, flush 0, corrupt 0, gen 0
Mar 27 10:00:05 bkp010 kernel: [157382.486976] btrfs_dev_stat_print_on_error: 1218 callbacks suppressed
Mar 27 10:00:05 bkp010 kernel: [157382.487055] BTRFS: bdev /dev/sdd1 errs: wr 68586, rd 383, flush 0, corrupt 0, gen 0
Mar 27 10:00:05 bkp010 kernel: [157382.488203] BTRFS: bdev /dev/sdd1 errs: wr 68587, rd 383, flush 0, corrupt 0, gen 0
Mar 27 10:00:05 bkp010 kernel: [157382.489342] BTRFS: bdev /dev/sdd1 errs: wr 68588, rd 383, flush 0, corrupt 0, gen 0
Mar 27 10:00:05 bkp010 kernel: [157382.491385] BTRFS: bdev /dev/sdd1 errs: wr 68589, rd 383, flush 0, corrupt 0, gen 0
Mar 27 10:00:05 bkp010 kernel: [157382.492730] BTRFS: bdev /dev/sdd1 errs: wr 68590, rd 383, flush 0, corrupt 0, gen 0
Mar 27 10:00:05 bkp010 kernel: [157382.494065] BTRFS: bdev /dev/sdd1 errs: wr 68591, rd 383, flush 0, corrupt 0, gen 0
Mar 27 10:00:05 bkp010 kernel: [157382.495399] BTRFS: bdev /dev/sdd1 errs: wr 68592, rd 383, flush 0, corrupt 0, gen 0
Mar 27 10:00:05 bkp010 kernel: [157382.496741] BTRFS: bdev /dev/sdd1 errs: wr 68593, rd 383, flush 0, corrupt 0, gen 0
Mar 27 10:00:05 bkp010 kernel: [157382.498631] BTRFS: bdev /dev/sdd1 errs: wr 68594, rd 383, flush 0, corrupt 0, gen 0
Mar 27 10:00:05 bkp010 kernel: [157382.499872] BTRFS: bdev /dev/sdd1 errs: wr 68595, rd 383, flush 0, corrupt 0, gen 0
Mar 27 10:00:15 bkp010 kernel: [157392.654641] btrfs_dev_stat_print_on_error: 12557 callbacks suppressed
Mar 27 10:00:15 bkp010 kernel: [157392.654721] BTRFS: bdev /dev/sdd1 errs: wr 92038, rd 383, flush 0, corrupt 0, gen 0
Mar 27 10:00:15 bkp010 kernel: [157392.655880] BTRFS: bdev /dev/sdd1 errs: wr 92039, rd 383, flush 0, corrupt 0, gen 0
Mar 27 10:00:15 bkp010 kernel: [157392.657017] BTRFS: bdev /dev/sdd1 errs: wr 92040, rd 383, flush 0, corrupt 0, gen 0
Mar 27 10:00:15 bkp010 kernel: [157392.659263] BTRFS: bdev /dev/sdd1 errs: wr 92041, rd 383, flush 0, corrupt 0, gen 0
Mar 27 10:00:15 bkp010 kernel: [157392.660420] BTRFS: bdev /dev/sdd1 errs: wr 92042, rd 383, flush 0, corrupt 0, gen 0
Mar 27 10:00:15 bkp010 kernel: [157392.661592] BTRFS: bdev /dev/sdd1 errs: wr 92043, rd 383, flush 0, corrupt 0, gen 0
Mar 27 10:00:15 bkp010 kernel: [157392.662727] BTRFS: bdev /dev/sdd1 errs: wr 92044, rd 383, flush 0, corrupt 0, gen 0
Mar 27 10:00:15 bkp010 kernel: [157392.663862] BTRFS: bdev /dev/sdd1 errs: wr 92045, rd 383, flush 0, corrupt 0, gen 0
Mar 27 10:00:15 bkp010 kernel: [157392.663982] BTRFS: bdev /dev/sdd1 errs: wr 92046, rd 383, flush 0, corrupt 0, gen 0
Mar 27 10:00:15 bkp010 kernel: [157392.665117] BTRFS: bdev /dev/sdd1 errs: wr 92047, rd 383, flush 0, corrupt 0, gen 0
Mar 27 10:00:15 bkp010 kernel: [157392.741518] BTRFS: lost page write due to I/O error on /dev/sdd1
Mar 27 10:00:15 bkp010 kernel: [157392.742716] BTRFS: lost page write due to I/O error on /dev/sdd1
Mar 27 10:00:15 bkp010 kernel: [157392.743909] BTRFS: lost page write due to I/O error on /dev/sdd1
Mar 27 10:00:45 bkp010 kernel: [157422.426129] btrfs_dev_stat_print_on_error: 11 callbacks suppressed
Mar 27 10:00:45 bkp010 kernel: [157422.426228] BTRFS: bdev /dev/sdd1 errs: wr 92059, rd 383, flush 0, corrupt 0, gen 0
Mar 27 10:00:45 bkp010 kernel: [157422.427619] BTRFS: bdev /dev/sdd1 errs: wr 92060, rd 383, flush 0, corrupt 0, gen 0
Mar 27 10:00:45 bkp010 kernel: [157422.429007] BTRFS: bdev /dev/sdd1 errs: wr 92061, rd 383, flush 0, corrupt 0, gen 0
Mar 27 10:00:45 bkp010 kernel: [157422.430341] BTRFS: bdev /dev/sdd1 errs: wr 92062, rd 383, flush 0, corrupt 0, gen 0
Mar 27 10:00:45 bkp010 kernel: [157422.431673] BTRFS: bdev /dev/sdd1 errs: wr 92063, rd 383, flush 0, corrupt 0, gen 0
Mar 27 10:00:45 bkp010 kernel: [157422.433022] BTRFS: bdev /dev/sdd1 errs: wr 92064, rd 383, flush 0, corrupt 0, gen 0
Mar 27 10:00:45 bkp010 kernel: [157422.434355] BTRFS: bdev /dev/sdd1 errs: wr 92065, rd 383, flush 0, corrupt 0, gen 0
Mar 27 10:00:45 bkp010 kernel: [157422.435685] BTRFS: bdev /dev/sdd1 errs: wr 92066, rd 383, flush 0, corrupt 0, gen 0
Mar 27 10:00:45 bkp010 kernel: [157422.437015] BTRFS: bdev /dev/sdd1 errs: wr 92067, rd 383, flush 0, corrupt 0, gen 0
Mar 27 10:00:45 bkp010 kernel: [157422.438346] BTRFS: bdev /dev/sdd1 errs: wr 92068, rd 383, flush 0, corrupt 0, gen 0
Mar 27 10:00:55 bkp010 kernel: [157432.064784] BTRFS: lost page write due to I/O error on /dev/sdd1
Mar 27 10:00:55 bkp010 kernel: [157432.065799] BTRFS: lost page write due to I/O error on /dev/sdd1
Mar 27 10:00:55 bkp010 kernel: [157432.066816] BTRFS: lost page write due to I/O error on /dev/sdd1
Mar 27 20:41:23 bkp010 kernel: [195837.280400] btrfs_dev_stat_print_on_error: 14081 callbacks suppressed
Mar 27 20:41:23 bkp010 kernel: [195837.280448] BTRFS: bdev /dev/sdd1 errs: wr 113630, rd 439, flush 0, corrupt 0, gen 0
Mar 27 20:41:23 bkp010 kernel: [195837.280536] BTRFS: bdev /dev/sdd1 errs: wr 113630, rd 440, flush 0, corrupt 0, gen 0
Mar 27 20:41:23 bkp010 kernel: [195837.280624] BTRFS: bdev /dev/sdd1 errs: wr 113630, rd 441, flush 0, corrupt 0, gen 0
Mar 27 20:41:23 bkp010 kernel: [195837.280996] BTRFS: bdev /dev/sdd1 errs: wr 113630, rd 442, flush 0, corrupt 0, gen 0
Mar 27 20:41:23 bkp010 kernel: [195837.281367] BTRFS: bdev /dev/sdd1 errs: wr 113630, rd 443, flush 0, corrupt 0, gen 0



-- 
Tomasz Chmielewski
http://wpkg.org

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: RAID-1 - handling disk failures?
  2014-03-27 20:52 Tomasz Chmielewski
@ 2014-03-28  6:22 ` Duncan
  0 siblings, 0 replies; 3+ messages in thread
From: Duncan @ 2014-03-28  6:22 UTC (permalink / raw)
  To: linux-btrfs

Tomasz Chmielewski posted on Thu, 27 Mar 2014 21:52:15 +0100 as excerpted:

> Is btrfs supposed to handle disk failures in RAID-1 mode?
> 
> It doesn't seem to be the case for me, with 3.14.0-rc8.
> 
> Right now, the system doesn't see the faulty drive anymore (i.e. hdparm
> -i /dev/sdd is unable to give any info).
> 
> Accesses to most files on btrfs filesystem just "freeze" (waiting for
> IO) the process which is accessing the data.
> 
> The other drive in RAID-1, /dev/sdc, is healthy.

Well, btrfs raid1 mode handles (single) drive loss, but rather 
differently than you might be used to raid1 working, if you've worked 
with it on mdraid or the like.

1) (Not directly related to your problem, but it likely differs from 
other raid1 you've worked with...) Unline normal raid1, btrfs' so-called 
raid1 mode is actually two-way-(only-)mirrored.  No matter how many 
devices there are in the filesystem, btrfs will only do two-way-mirroring 
of each chunk.  Thus, btrfs raid1 mode only tolerates loss of a single 
device without data loss, since once you lose two, both copies of some 
chunks will be gone and not recoverable, regardless of how many devices 
were in the raid1.

2) In btrfs, once you drop below the natural minimum number of devices to 
sustain that raid type, btrfs goes read-only as writes can no longer be 
done in the configured raid mode, which naturally blocks anything 
attempting to write to the filesystem.  I suspect that's what's happening 
to you.

With raid0 or raid1, the natural minimum operational number of devices is 
two.  With raid5, it's three.  With raid6 and raid10, it's four.  
(However, do note that raid5/6 support isn't complete yet.  Don't 
actually rely on it working as raid5/6 if something goes wrong, just yet.)

In your raid1 case, once you drop to a single device, writes can no 
longer be done to two mirrors, so the filesystem is forced read-only.  
Naturally that's going to hang any thread trying to do a write in 
"D" (disk-sleep) state.  Once those hung writing threads plug the IO 
queue reads will stall behind the writes, and anything trying to read 
from that filesystem will ultimately deadlock and hang as well.

OTOH, if you have more than the minimum number of devices, say you have 
three devices for raid1 mode, drop one device and writes can still be 
done in btrfs' normal two-way-mirrored raid1 write mode to the two 
remaining devices.  I'm not actually sure if it goes read-only when a 
device drops in this case or not, but if it does, you should be able to 
set it back to read/write mode and get on with things if you need to.

Basically what that means is that once you drop below two devices in 
raid1 mode, that btrfs will drop to read-only.  If it's your rootfs or 
the like, you're pretty well hosed and will be forced to reboot pretty 
quickly, altho if you catch it quickly enough you can probably umount 
other filesystems, etc, not on the dropped devices.  If it's just some 
auxiliary filesystem, you'll probably lose any processes working with it, 
but otherwise you should hopefully continue to stay in operation.

Mounting the still degraded filesystem in degraded mode (with the 
degraded mount-option) after a shutdown or other fully filesystem 
unmount, will result in the same force-read-only situation, except since 
the filesystem was never writable in the first place, nothing should have 
been able to open files on it in write-mode, so you should be able to get 
back workable enough at least to do a btrfs device add to it, bringing it 
back to the minimum two devices again, after which you should then be 
able to remount it writable.  With it again mounted writable, you should 
be able to do a btrfs device delete missing to remove the bad device, 
followed by a rebalance to create a new second mirror of all chunks where 
one mirror was on the missing device.

Basically, all this means in ordered to keep a btrfs raid1 fully usable 
without rebooting in the event of a dropped device, you'll need to build 
it out to three devices, so you can drop one and still have enough 
devices left to continue writing to a full pair of devices in two-way-
mirroring.

Depending upon your use-case, the drop to read-only and potentially 
forced-reboot may or may not be acceptable, as long as the data's still 
there and accessible, to copy elsewhere or whatever, after the reboot.  
If it's not acceptable, then as mentioned, do plan on making it three 
devices in normal mode, so the filesystem can continue writing in so-
called raid1 mode to the two remaining devices if one drops.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman


^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: RAID-1 - handling disk failures?
@ 2014-03-28 16:42 Tomasz Chmielewski
  0 siblings, 0 replies; 3+ messages in thread
From: Tomasz Chmielewski @ 2014-03-28 16:42 UTC (permalink / raw)
  To: linux-btrfs@vger.kernel.org

> 2) In btrfs, once you drop below the natural minimum number of devices
> to sustain that raid type, btrfs goes read-only as writes can no
> longer be done in the configured raid mode, which naturally blocks
> anything attempting to write to the filesystem.  I suspect that's
> what's happening to you.

No, it never went into read only mode.
If it did, I would see:

# touch testfile
touch: cannot touch `testfile': Read-only file system

and not waiting for IO.

Anyway, the RAID-1 filesystem looks now hosed after a drive failed in
it, and btrfs filesystem hanged when adding a new device.

Getting these kernel oopses now when trying to write anything there:

[  553.040075] BUG: unable to handle kernel NULL pointer dereference at 0000000000000098
[  553.040264] IP: [<ffffffff8111f33b>] bio_get_nr_vecs+0x0/0x38
[  553.040378] PGD 0 
[  553.040484] Oops: 0000 [#1] SMP 
[  553.040622] Modules linked in: cpufreq_ondemand cpufreq_conservative cpufreq_powersave cpufreq_stats bridge stp llc ipv6 btrfs xor raid6_pq zlib_deflate loop i2c_i801 parport_pc i2c_core parport tpm_infineon tpm_tis video ehci_pci pcspkr ehci_hcd lpc_ich mfd_core acpi_cpufreq button battery tpm ext4 crc16 jbd2 mbcache raid1 sg sd_mod ahci libahci libata scsi_mod r8169 mii
[  553.042270] CPU: 1 PID: 4951 Comm: btrfs-delalloc- Not tainted 3.14.0-rc8 #1
[  553.042351] Hardware name: System manufacturer System Product Name/P8H77-M PRO, BIOS 1101 02/04/2013
[  553.042474] task: ffff8807f3f98000 ti: ffff8807ebc42000 task.ti: ffff8807ebc42000
[  553.042594] RIP: 0010:[<ffffffff8111f33b>]  [<ffffffff8111f33b>] bio_get_nr_vecs+0x0/0x38
[  553.042749] RSP: 0018:ffff8807ebc43af0  EFLAGS: 00010246
[  553.042828] RAX: 0000000000000100 RBX: 0000000000001000 RCX: 0000000214919ca0
[  553.042909] RDX: ffffea001f4ccc00 RSI: ffff8807ff148430 RDI: 0000000000000000
[  553.042990] RBP: ffff8807ebc43b48 R08: 0000000000001000 R09: 0000000000000000
[  553.043071] R10: 0000000000000000 R11: 0000000000014a98 R12: ffff8807ebc43c78
[  553.043151] R13: 0000000000000000 R14: 0000000214919ca0 R15: ffff8807ff148430
[  553.043233] FS:  0000000000000000(0000) GS:ffff88081fa40000(0000) knlGS:0000000000000000
[  553.043354] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  553.043433] CR2: 0000000000000098 CR3: 000000000160b000 CR4: 00000000001407e0
[  553.043513] Stack:
[  553.043587]  ffffffffa02e3b08 00000010ebc43b28 0000000000000000 ffffea001f4ccc00
[  553.043835]  0000041100000000 ffff8807ebc43b28 ffffea001f4ccc00 0000000000000000
[  553.044082]  0000000000000001 ffff8807ff148430 ffff8807ff1485a8 ffff8807ebc43c58
[  553.044330] Call Trace:
[  553.044419]  [<ffffffffa02e3b08>] ? submit_extent_page.isra.38+0x10c/0x17e [btrfs]
[  553.044551]  [<ffffffffa02e535d>] __extent_writepage+0x542/0x5d2 [btrfs]
[  553.044643]  [<ffffffffa02e389a>] ? end_extent_writepage+0x5c/0x5c [btrfs]
[  553.044734]  [<ffffffffa02e58c6>] extent_write_locked_range+0xbf/0x124 [btrfs]
[  553.044865]  [<ffffffffa02cec56>] ? btrfs_fiemap+0x4c/0x4c [btrfs]
[  553.044954]  [<ffffffffa02d2349>] submit_compressed_extents+0x133/0x424 [btrfs]
[  553.045084]  [<ffffffffa02d26bd>] async_cow_submit+0x83/0x88 [btrfs]
[  553.045174]  [<ffffffffa02f0fcc>] run_ordered_completions+0x68/0xc5 [btrfs]
[  553.045264]  [<ffffffffa02f1659>] worker_loop+0x16e/0x495 [btrfs]
[  553.045353]  [<ffffffffa02f14eb>] ? btrfs_queue_worker+0x269/0x269 [btrfs]
[  553.045435]  [<ffffffff81050c92>] kthread+0xcd/0xd5
[  553.045516]  [<ffffffff81050bc5>] ? kthread_freezable_should_stop+0x43/0x43
[  553.045598]  [<ffffffff8139a03c>] ret_from_fork+0x7c/0xb0
[  553.045678]  [<ffffffff81050bc5>] ? kthread_freezable_should_stop+0x43/0x43
[  553.045758] Code: c4 b8 f1 ff 48 83 c8 ff 41 59 5b 5d c3 90 90 90 55 48 89 e5 53 48 89 f3 51 f6 46 10 08 75 05 e8 e6 62 07 00 8b 43 38 5a 5b 5d c3 <48> 8b 87 98 00 00 00 55 b9 00 01 00 00 48 89 e5 48 8b 90 80 02 
[  553.048083] RIP  [<ffffffff8111f33b>] bio_get_nr_vecs+0x0/0x38
[  553.048196]  RSP <ffff8807ebc43af0>
[  553.048272] CR2: 0000000000000098
[  553.048349] ---[ end trace 36d74486b120a453 ]---
[  581.331680] BUG: unable to handle kernel NULL pointer dereference at 0000000000000098
[  581.331867] IP: [<ffffffff8111f33b>] bio_get_nr_vecs+0x0/0x38
[  581.331981] PGD 0 
[  581.332087] Oops: 0000 [#2] SMP 
[  581.332227] Modules linked in: cpufreq_ondemand cpufreq_conservative cpufreq_powersave cpufreq_stats bridge stp llc ipv6 btrfs xor raid6_pq zlib_deflate loop i2c_i801 parport_pc i2c_core parport tpm_infineon tpm_tis video ehci_pci pcspkr ehci_hcd lpc_ich mfd_core acpi_cpufreq button battery tpm ext4 crc16 jbd2 mbcache raid1 sg sd_mod ahci libahci libata scsi_mod r8169 mii
[  581.333870] CPU: 3 PID: 5025 Comm: btrfs-transacti Tainted: G      D      3.14.0-rc8 #1
[  581.333989] Hardware name: System manufacturer System Product Name/P8H77-M PRO, BIOS 1101 02/04/2013
[  581.334109] task: ffff8807f3e30000 ti: ffff8807e770a000 task.ti: ffff8807e770a000
[  581.334226] RIP: 0010:[<ffffffff8111f33b>]  [<ffffffff8111f33b>] bio_get_nr_vecs+0x0/0x38
[  581.334377] RSP: 0018:ffff8807e770b7d0  EFLAGS: 00010246
[  581.334454] RAX: 0000000000000100 RBX: 0000000000001000 RCX: 00000001a049e238
[  581.334534] RDX: ffffea001f24a400 RSI: ffff8807e9888040 RDI: 0000000000000000
[  581.334614] RBP: ffff8807e770b828 R08: 0000000000001000 R09: 0000000000000000
[  581.334694] R10: 0000000000000000 R11: ffff8807cfed9690 R12: ffff8807e770ba08
[  581.334774] R13: 0000000000000000 R14: 00000001a049e238 R15: ffff8807e9888040
[  581.334854] FS:  0000000000000000(0000) GS:ffff88081fac0000(0000) knlGS:0000000000000000
[  581.334974] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  581.335053] CR2: 0000000000000098 CR3: 000000000160b000 CR4: 00000000001407e0
[  581.335133] Stack:
[  581.335206]  ffffffffa02e3b08 ffff8807e770b828 0000000000000000 ffffea001f24a400
[  581.335449]  0000002000000000 ffff8807e770b8c0 0000034093c47000 ffffea001f24a400
[  581.335693]  0000000000001000 0000000000000000 0000000000000000 ffff8807e770b938
[  581.335935] Call Trace:
[  581.336021]  [<ffffffffa02e3b08>] ? submit_extent_page.isra.38+0x10c/0x17e [btrfs]
[  581.336147]  [<ffffffffa02e4a97>] __do_readpage+0x49f/0x540 [btrfs]
[  581.336251]  [<ffffffffa02e3d59>] ? repair_io_failure+0x19e/0x19e [btrfs]
[  581.336335]  [<ffffffffa02c42f0>] ? verify_parent_transid+0x146/0x146 [btrfs]
[  581.336420]  [<ffffffffa02e09dd>] ? btrfs_lookup_ordered_extent+0x5d/0xb4 [btrfs]
[  581.336544]  [<ffffffffa02e4bed>] __extent_read_full_page+0xb5/0xc4 [btrfs]
[  581.336628]  [<ffffffffa02c42f0>] ? verify_parent_transid+0x146/0x146 [btrfs]
[  581.336712]  [<ffffffffa02e6ce7>] read_extent_buffer_pages+0x1ff/0x219 [btrfs]
[  581.336831]  [<ffffffff811ac285>] ? radix_tree_insert+0xf3/0x1bf
[  581.336914]  [<ffffffffa02c42f0>] ? verify_parent_transid+0x146/0x146 [btrfs]
[  581.336997]  [<ffffffffa02c5865>] btree_read_extent_buffer_pages.constprop.123+0x61/0xf9 [btrfs]
[  581.337121]  [<ffffffffa02c5dea>] read_tree_block+0x2c/0x45 [btrfs]
[  581.337204]  [<ffffffffa02ae1bf>] read_block_for_search.isra.40+0x2b4/0x2fb [btrfs]
[  581.337326]  [<ffffffffa02a9745>] ? unlock_up+0xdd/0x120 [btrfs]
[  581.338518]  [<ffffffffa02b02e2>] btrfs_search_slot+0x5ee/0x7dd [btrfs]
[  581.338600]  [<ffffffffa02b17d9>] btrfs_insert_empty_items+0x58/0xa4 [btrfs]
[  581.338683]  [<ffffffffa02bc11c>] __btrfs_run_delayed_refs+0x6c6/0xc36 [btrfs]
[  581.338806]  [<ffffffffa02be2b3>] btrfs_run_delayed_refs+0x7e/0x212 [btrfs]
[  581.338890]  [<ffffffffa02cbb3d>] btrfs_commit_transaction+0x375/0x7ff [btrfs]
[  581.339013]  [<ffffffffa02c9ea8>] transaction_kthread+0xef/0x1c3 [btrfs]
[  581.339107]  [<ffffffffa02c9db9>] ? open_ctree+0x1b5c/0x1b5c [btrfs]
[  581.339195]  [<ffffffff81050c92>] kthread+0xcd/0xd5
[  581.339272]  [<ffffffff81050bc5>] ? kthread_freezable_should_stop+0x43/0x43
[  581.339351]  [<ffffffff8139a03c>] ret_from_fork+0x7c/0xb0
[  581.339429]  [<ffffffff81050bc5>] ? kthread_freezable_should_stop+0x43/0x43
[  581.339507] Code: c4 b8 f1 ff 48 83 c8 ff 41 59 5b 5d c3 90 90 90 55 48 89 e5 53 48 89 f3 51 f6 46 10 08 75 05 e8 e6 62 07 00 8b 43 38 5a 5b 5d c3 <48> 8b 87 98 00 00 00 55 b9 00 01 00 00 48 89 e5 48 8b 90 80 02
[  581.341762] RIP  [<ffffffff8111f33b>] bio_get_nr_vecs+0x0/0x38
[  581.341870]  RSP <ffff8807e770b7d0>
[  581.341944] CR2: 0000000000000098
[  581.342019] ---[ end trace 36d74486b120a454 ]---


-- 
Tomasz Chmielewski
http://wpkg.org

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2014-03-28 16:42 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-03-28 16:42 RAID-1 - handling disk failures? Tomasz Chmielewski
  -- strict thread matches above, loose matches on Subject: below --
2014-03-27 20:52 Tomasz Chmielewski
2014-03-28  6:22 ` Duncan

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).