* RAID1: system stability
@ 2015-05-26 11:23 Timofey Titovets
2015-05-26 19:31 ` Timofey Titovets
2015-05-26 19:49 ` Chris Murphy
0 siblings, 2 replies; 13+ messages in thread
From: Timofey Titovets @ 2015-05-26 11:23 UTC (permalink / raw)
To: linux-btrfs
[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset=utf-8, Size: 914 bytes --]
Hi list,
I'm regular on this list and I very like btrfs, I want use it on production server, and I want replace hw raid on it.
Test case: server with N scsi discs
2 SAS disks used for raid 1 root fs
If I just remove one disk physically, all okay, kernel show me write errors and system continue work some time. But after first sync call, example
# sync
# dd if=/Dev/zero of=/zero
Kernel will crush and system freeze.
Yes, after reboot, I can mount with degraded and recovery options, and I can add failed disk again, and btrfs will rebuild array.
But kernel crush and reboot expected in this case, or I can skip it? How?
# mount -o remount, degraded -> kernel crush
Insert failed disk again -> kernel crush
May be I missing something? I just want avoid shutdown time or/and reboot =.=ÿôèº{.nÇ+·®+%Ëÿ±éݶ\x17¥wÿº{.nÇ+·¥{±ý»k~ÏâØ^nr¡ö¦zË\x1aëh¨èÚ&£ûàz¿äz¹Þú+Ê+zf£¢·h§~Ûiÿÿïêÿêçz_è®\x0fæj:+v¨þ)ߣøm
^ permalink raw reply [flat|nested] 13+ messages in thread* Re: RAID1: system stability 2015-05-26 11:23 RAID1: system stability Timofey Titovets @ 2015-05-26 19:31 ` Timofey Titovets 2015-05-26 19:49 ` Chris Murphy 1 sibling, 0 replies; 13+ messages in thread From: Timofey Titovets @ 2015-05-26 19:31 UTC (permalink / raw) To: linux-btrfs Oh, i missing, i've test it on 3.19+ kernels I can get trace from screen if it interesting for developers. 2015-05-26 14:23 GMT+03:00 Timofey Titovets <nefelim4ag@gmail.com>: > Hi list, > I'm regular on this list and I very like btrfs, I want use it on production server, and I want replace hw raid on it. > > Test case: server with N scsi discs > 2 SAS disks used for raid 1 root fs > If I just remove one disk physically, all okay, kernel show me write errors and system continue work some time. But after first sync call, example > # sync > # dd if=/Dev/zero of=/zero > > Kernel will crush and system freeze. > Yes, after reboot, I can mount with degraded and recovery options, and I can add failed disk again, and btrfs will rebuild array. > But kernel crush and reboot expected in this case, or I can skip it? How? > # mount -o remount, degraded -> kernel crush > Insert failed disk again -> kernel crush > > May be I missing something? I just want avoid shutdown time or/and reboot =.= -- Have a nice day, Timofey. ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: RAID1: system stability 2015-05-26 11:23 RAID1: system stability Timofey Titovets 2015-05-26 19:31 ` Timofey Titovets @ 2015-05-26 19:49 ` Chris Murphy 2015-05-26 19:51 ` Timofey Titovets 1 sibling, 1 reply; 13+ messages in thread From: Chris Murphy @ 2015-05-26 19:49 UTC (permalink / raw) To: Timofey Titovets; +Cc: linux-btrfs Without a complete dmesg it's hard to say what's going on. The call trace alone probably don't show the instigating factor so you may need to use remote ssh with journalctl -f, or use netconsole to continuously get kernel messages prior to the implosion. Chris Murphy ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: RAID1: system stability 2015-05-26 19:49 ` Chris Murphy @ 2015-05-26 19:51 ` Timofey Titovets 2015-06-22 11:35 ` Timofey Titovets 0 siblings, 1 reply; 13+ messages in thread From: Timofey Titovets @ 2015-05-26 19:51 UTC (permalink / raw) To: Chris Murphy; +Cc: linux-btrfs Oh, thanks for advice, i'll get and attach it. i.e. as i understand behaviour like it, not expected, cool 2015-05-26 22:49 GMT+03:00 Chris Murphy <lists@colorremedies.com>: > Without a complete dmesg it's hard to say what's going on. The call > trace alone probably don't show the instigating factor so you may need > to use remote ssh with journalctl -f, or use netconsole to > continuously get kernel messages prior to the implosion. > > Chris Murphy -- Have a nice day, Timofey. ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: RAID1: system stability 2015-05-26 19:51 ` Timofey Titovets @ 2015-06-22 11:35 ` Timofey Titovets 2015-06-22 11:45 ` Timofey Titovets 2015-06-22 16:03 ` Chris Murphy 0 siblings, 2 replies; 13+ messages in thread From: Timofey Titovets @ 2015-06-22 11:35 UTC (permalink / raw) To: Chris Murphy; +Cc: linux-btrfs Okay, logs, i did release disk /dev/sde1 and get: Jun 22 14:28:40 srv-lab-ceph-node-01 kernel: Read(10): 28 00 11 1d 69 00 00 00 08 00 Jun 22 14:28:40 srv-lab-ceph-node-01 kernel: blk_update_request: I/O error, dev sde, sector 287140096 Jun 22 14:28:40 srv-lab-ceph-node-01 kernel: mptbase: ioc0: LogInfo(0x31010011): Originator={PL}, Code={Open Failure}, SubCode(0x0011) cb_idx mptscsih_io_done Jun 22 14:28:40 srv-lab-ceph-node-01 kernel: sd 0:0:5:0: [sde] FAILED Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK Jun 22 14:28:40 srv-lab-ceph-node-01 kernel: sd 0:0:5:0: [sde] CDB: Jun 22 14:28:40 srv-lab-ceph-node-01 kernel: Read(10): 28 00 11 1d 69 00 00 00 08 00 Jun 22 14:28:40 srv-lab-ceph-node-01 kernel: blk_update_request: I/O error, dev sde, sector 287140096 Jun 22 14:28:40 srv-lab-ceph-node-01 kernel: Buffer I/O error on dev sde1, logical block 35892256, async page read Jun 22 14:28:40 srv-lab-ceph-node-01 kernel: mptbase: ioc0: LogInfo(0x31010011): Originator={PL}, Code={Open Failure}, SubCode(0x0011) cb_idx mptscsih_io_done Jun 22 14:28:40 srv-lab-ceph-node-01 kernel: mptbase: ioc0: LogInfo(0x31010011): Originator={PL}, Code={Open Failure}, SubCode(0x0011) cb_idx mptscsih_io_done Jun 22 14:28:40 srv-lab-ceph-node-01 kernel: sd 0:0:5:0: [sde] FAILED Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK Jun 22 14:28:40 srv-lab-ceph-node-01 kernel: sd 0:0:5:0: [sde] CDB: Jun 22 14:28:40 srv-lab-ceph-node-01 kernel: Read(10): 28 00 11 1d 69 00 00 00 08 00 Jun 22 14:28:40 srv-lab-ceph-node-01 kernel: blk_update_request: I/O error, dev sde, sector 287140096 Jun 22 14:28:40 srv-lab-ceph-node-01 kernel: Buffer I/O error on dev sde1, logical block 35892256, async page read Jun 22 14:28:40 srv-lab-ceph-node-01 kernel: mptbase: ioc0: LogInfo(0x31010011): Originator={PL}, Code={Open Failure}, SubCode(0x0011) cb_idx mptscsih_io_done Jun 22 14:28:40 srv-lab-ceph-node-01 kernel: mptbase: ioc0: LogInfo(0x31010011): Originator={PL}, Code={Open Failure}, SubCode(0x0011) cb_idx mptscsih_io_done Jun 22 14:28:40 srv-lab-ceph-node-01 kernel: sd 0:0:5:0: [sde] FAILED Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK Jun 22 14:28:40 srv-lab-ceph-node-01 kernel: sd 0:0:5:0: [sde] CDB: Jun 22 14:28:40 srv-lab-ceph-node-01 kernel: Read(10): 28 00 11 1d 69 00 00 00 08 00 Jun 22 14:28:40 srv-lab-ceph-node-01 kernel: blk_update_request: I/O error, dev sde, sector 287140096 Jun 22 14:28:40 srv-lab-ceph-node-01 kernel: Buffer I/O error on dev sde1, logical block 35892256, async page read Jun 22 14:28:41 srv-lab-ceph-node-01 kernel: mptbase: ioc0: LogInfo(0x31010011): Originator={PL}, Code={Open Failure}, SubCode(0x0011) cb_idx mptscsih_io_done Jun 22 14:28:41 srv-lab-ceph-node-01 kernel: end_device-0:0:6: mptsas: ioc0: removing ssp device: fw_channel 0, fw_id 16, phy 5,sas_addr 0x5000cca00d0514bd Jun 22 14:28:41 srv-lab-ceph-node-01 kernel: phy-0:0:9: mptsas: ioc0: delete phy 5, phy-obj (0xffff880449541400) Jun 22 14:28:41 srv-lab-ceph-node-01 kernel: port-0:0:6: mptsas: ioc0: delete port 6, sas_addr (0x5000cca00d0514bd) Jun 22 14:28:41 srv-lab-ceph-node-01 kernel: scsi target0:0:5: mptsas: ioc0: delete device: fw_channel 0, fw_id 16, phy 5, sas_addr 0x5000cca00d0514bd Jun 22 14:28:44 srv-lab-ceph-node-01 kernel: BTRFS: lost page write due to I/O error on /dev/sde1 Jun 22 14:28:44 srv-lab-ceph-node-01 kernel: BTRFS: bdev /dev/sde1 errs: wr 1, rd 0, flush 0, corrupt 0, gen 0 Jun 22 14:28:44 srv-lab-ceph-node-01 kernel: BTRFS: lost page write due to I/O error on /dev/sde1 Jun 22 14:28:44 srv-lab-ceph-node-01 kernel: BTRFS: bdev /dev/sde1 errs: wr 2, rd 0, flush 0, corrupt 0, gen 0 Jun 22 14:29:13 srv-lab-ceph-node-01 kernel: BTRFS: bdev /dev/sde1 errs: wr 4, rd 0, flush 0, corrupt 0, gen 0 Jun 22 14:29:13 srv-lab-ceph-node-01 kernel: BTRFS: bdev /dev/sde1 errs: wr 5, rd 0, flush 0, corrupt 0, gen 0 Jun 22 14:29:13 srv-lab-ceph-node-01 kernel: BTRFS: bdev /dev/sde1 errs: wr 6, rd 0, flush 0, corrupt 0, gen 0 Jun 22 14:29:13 srv-lab-ceph-node-01 kernel: BTRFS: bdev /dev/sde1 errs: wr 7, rd 0, flush 0, corrupt 0, gen 0 Jun 22 14:29:13 srv-lab-ceph-node-01 kernel: BTRFS: bdev /dev/sde1 errs: wr 8, rd 0, flush 0, corrupt 0, gen 0 Jun 22 14:29:13 srv-lab-ceph-node-01 kernel: BTRFS: bdev /dev/sde1 errs: wr 9, rd 0, flush 0, corrupt 0, gen 0 Jun 22 14:29:13 srv-lab-ceph-node-01 kernel: BTRFS: bdev /dev/sde1 errs: wr 10, rd 0, flush 0, corrupt 0, gen 0 Jun 22 14:29:13 srv-lab-ceph-node-01 kernel: BTRFS: bdev /dev/sde1 errs: wr 11, rd 0, flush 0, corrupt 0, gen 0 Jun 22 14:29:13 srv-lab-ceph-node-01 kernel: BTRFS: bdev /dev/sde1 errs: wr 12, rd 0, flush 0, corrupt 0, gen 0 Jun 22 14:29:13 srv-lab-ceph-node-01 kernel: BTRFS: bdev /dev/sde1 errs: wr 13, rd 0, flush 0, corrupt 0, gen 0 Jun 22 14:29:22 srv-lab-ceph-node-01 kernel: BTRFS info (device md127): csum failed ino 1039 extent 390332416 csum 2059524288 wanted 343582415 mirror 0 Jun 22 14:29:22 srv-lab-ceph-node-01 kernel: BUG: unable to handle kernel paging request at ffff87fa7ff53430 Jun 22 14:29:22 srv-lab-ceph-node-01 kernel: IP: [<ffffffffc04709d9>] __btrfs_map_block+0x2d9/0x1180 [btrfs] Jun 22 14:29:22 srv-lab-ceph-node-01 kernel: PGD 0 Jun 22 14:29:22 srv-lab-ceph-node-01 kernel: Oops: 0000 [#1] SMP Jun 22 14:29:22 srv-lab-ceph-node-01 kernel: Modules linked in: 8021q garp mrp stp llc binfmt_misc amdkfd amd_iommu_v2 radeon ttm drm_kms_helper ipmi_ssif coretemp gpio_ich drm kvm_intel serio_raw i5000_edac kvm ipmi_si lpc_ich edac_core ioatdma joydev i2c_algo_bit 8250_fintek mac_hid ipmi_msghandler i5k_amb dca shpchp bonding autofs4 btrfs ses enclosure raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx mptsas mptscsih xor hid_generic raid6_pq raid1 usbhid e1000e mptbase raid0 psmouse ptp hid multipath scsi_transport_sas pps_core linear Jun 22 14:29:22 srv-lab-ceph-node-01 kernel: CPU: 1 PID: 2411 Comm: kworker/u16:16 Not tainted 3.19.0-21-generic #21-Ubuntu Jun 22 14:29:22 srv-lab-ceph-node-01 kernel: Hardware name: Intel S5000VSA/S5000VSA, BIOS S5000.86B.15.00.0101.110920101604 11/09/2010 Jun 22 14:29:22 srv-lab-ceph-node-01 kernel: Workqueue: btrfs-endio btrfs_endio_helper [btrfs] Jun 22 14:29:22 srv-lab-ceph-node-01 kernel: task: ffff8803ef8ae220 ti: ffff8803efbe4000 task.ti: ffff8803efbe4000 Jun 22 14:29:22 srv-lab-ceph-node-01 kernel: RIP: 0010:[<ffffffffc04709d9>] [<ffffffffc04709d9>] __btrfs_map_block+0x2d9/0x1180 [btrfs] Jun 22 14:29:22 srv-lab-ceph-node-01 kernel: RSP: 0018:ffff8803efbe79d8 EFLAGS: 00010287 Jun 22 14:29:22 srv-lab-ceph-node-01 kernel: RAX: 0000000000010000 RBX: ffff88009a80dd00 RCX: 0000000000001533 Jun 22 14:29:22 srv-lab-ceph-node-01 kernel: RDX: ffff87fa7ff53428 RSI: ffff88009a80dd70 RDI: 000000009a869e00 Jun 22 14:29:22 srv-lab-ceph-node-01 kernel: RBP: ffff8803efbe7ab8 R08: 000000000000c000 R09: ffff88009a80dd00 Jun 22 14:29:22 srv-lab-ceph-node-01 kernel: R10: 000000000000c000 R11: 0000000000000002 R12: 000000009a869e00 Jun 22 14:29:22 srv-lab-ceph-node-01 kernel: R13: ffff880403566420 R14: ffff880448e20000 R15: 0000000000000001 Jun 22 14:29:22 srv-lab-ceph-node-01 kernel: FS: 0000000000000000(0000) GS:ffff88045fc40000(0000) knlGS:0000000000000000 Jun 22 14:29:22 srv-lab-ceph-node-01 kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b Jun 22 14:29:22 srv-lab-ceph-node-01 kernel: CR2: ffff87fa7ff53430 CR3: 000000034f878000 CR4: 00000000000407e0 Jun 22 14:29:22 srv-lab-ceph-node-01 kernel: Stack: Jun 22 14:29:22 srv-lab-ceph-node-01 kernel: 0000000000000000 0000000000001000 ffff8803efbe7a28 000000000000c000 Jun 22 14:29:22 srv-lab-ceph-node-01 kernel: 0000000000000001 0000000000010000 0000000015340000 0000000000000000 Jun 22 14:29:22 srv-lab-ceph-node-01 kernel: 0000000000001534 0000000000001533 ffff880448e20dd0 0000000000000000 Jun 22 14:29:22 srv-lab-ceph-node-01 kernel: Call Trace: Jun 22 14:29:22 srv-lab-ceph-node-01 kernel: [<ffffffffc0420afa>] ? btrfs_free_path+0x2a/0x40 [btrfs] Jun 22 14:29:22 srv-lab-ceph-node-01 kernel: [<ffffffffc0476b5d>] btrfs_map_bio+0x7d/0x530 [btrfs] Jun 22 14:29:22 srv-lab-ceph-node-01 kernel: [<ffffffffc0493982>] btrfs_submit_compressed_read+0x332/0x4d0 [btrfs] Jun 22 14:29:22 srv-lab-ceph-node-01 kernel: [<ffffffffc044df51>] btrfs_submit_bio_hook+0x1c1/0x1d0 [btrfs] Jun 22 14:29:22 srv-lab-ceph-node-01 kernel: [<ffffffff8137cc6e>] ? bio_add_page+0x5e/0x70 Jun 22 14:29:22 srv-lab-ceph-node-01 kernel: [<ffffffffc046be79>] ? btrfs_create_repair_bio+0xe9/0x110 [btrfs] Jun 22 14:29:22 srv-lab-ceph-node-01 kernel: [<ffffffffc046c38a>] end_bio_extent_readpage+0x4ea/0x5e0 [btrfs] Jun 22 14:29:22 srv-lab-ceph-node-01 kernel: [<ffffffffc046bea0>] ? btrfs_create_repair_bio+0x110/0x110 [btrfs] Jun 22 14:29:22 srv-lab-ceph-node-01 kernel: [<ffffffff8137f1eb>] bio_endio+0x6b/0xa0 Jun 22 14:29:22 srv-lab-ceph-node-01 kernel: [<ffffffff811d5bce>] ? kmem_cache_free+0x1be/0x200 Jun 22 14:29:22 srv-lab-ceph-node-01 kernel: [<ffffffff8137f232>] bio_endio_nodec+0x12/0x20 Jun 22 14:29:22 srv-lab-ceph-node-01 kernel: [<ffffffffc0440f3f>] end_workqueue_fn+0x3f/0x50 [btrfs] Jun 22 14:29:22 srv-lab-ceph-node-01 kernel: [<ffffffffc047b4e2>] normal_work_helper+0xc2/0x2b0 [btrfs] Jun 22 14:29:22 srv-lab-ceph-node-01 kernel: [<ffffffffc047b7a2>] btrfs_endio_helper+0x12/0x20 [btrfs] Jun 22 14:29:22 srv-lab-ceph-node-01 kernel: [<ffffffff8108fc98>] process_one_work+0x158/0x430 Jun 22 14:29:22 srv-lab-ceph-node-01 kernel: [<ffffffff810907db>] worker_thread+0x5b/0x530 Jun 22 14:29:22 srv-lab-ceph-node-01 kernel: [<ffffffff81090780>] ? rescuer_thread+0x3a0/0x3a0 Jun 22 14:29:22 srv-lab-ceph-node-01 kernel: [<ffffffff81095879>] kthread+0xc9/0xe0 Jun 22 14:29:22 srv-lab-ceph-node-01 kernel: [<ffffffff810957b0>] ? kthread_create_on_node+0x1c0/0x1c0 Jun 22 14:29:22 srv-lab-ceph-node-01 kernel: [<ffffffff817cae18>] ret_from_fork+0x58/0x90 Jun 22 14:29:22 srv-lab-ceph-node-01 kernel: [<ffffffff810957b0>] ? kthread_create_on_node+0x1c0/0x1c0 Jun 22 14:29:22 srv-lab-ceph-node-01 kernel: Code: 8d 68 ff ff ff 7e 3f 0f 1f 00 49 63 c4 4d 89 d0 41 83 c4 01 48 8d 04 40 48 83 c6 18 48 8d 14 c5 20 00 00 00 49 63 45 10 4c 01 ea <4c> 03 42 08 48 0f af c1 4c 01 c0 48 89 46 e8 48 8b 02 48 89 46 Jun 22 14:29:22 srv-lab-ceph-node-01 kernel: RIP [<ffffffffc04709d9>] __btrfs_map_block+0x2d9/0x1180 [btrfs] Jun 22 14:29:22 srv-lab-ceph-node-01 kernel: RSP <ffff8803efbe79d8> Jun 22 14:29:22 srv-lab-ceph-node-01 kernel: CR2: ffff87fa7ff53430 Jun 22 14:29:22 srv-lab-ceph-node-01 kernel: ---[ end trace f8af5955ebefcf19 ]--- Jun 22 14:29:22 srv-lab-ceph-node-01 kernel: BUG: unable to handle kernel paging request at ffffffffffffffd8 Jun 22 14:29:22 srv-lab-ceph-node-01 kernel: IP: [<ffffffff81095f80>] kthread_data+0x10/0x20 -- Have a nice day, Timofey. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: RAID1: system stability 2015-06-22 11:35 ` Timofey Titovets @ 2015-06-22 11:45 ` Timofey Titovets 2015-06-22 16:03 ` Chris Murphy 1 sibling, 0 replies; 13+ messages in thread From: Timofey Titovets @ 2015-06-22 11:45 UTC (permalink / raw) To: Chris Murphy; +Cc: linux-btrfs And again if i've try echo 1 > /sys/block/sdf/device/delete Jun 22 14:44:16 srv-lab-ceph-node-01 kernel: ------------[ cut here ]------------ Jun 22 14:44:16 srv-lab-ceph-node-01 kernel: kernel BUG at /build/buildd/linux-3.19.0/fs/btrfs/extent_io.c:2056! Jun 22 14:44:16 srv-lab-ceph-node-01 kernel: invalid opcode: 0000 [#1] SMP Jun 22 14:44:16 srv-lab-ceph-node-01 kernel: Modules linked in: 8021q garp mrp stp llc binfmt_misc ipmi_ssif amdkfd amd_iommu_v2 gpio_ich radeon ttm drm_kms_helper lpc_ich coretemp drm kvm_intel kvm i5000_edac i2c_algo_bit edac_core i5k_amb shpchp ipmi_si serio_raw 8250_fintek ioatdma dca joydev mac_hid ipmi_msghandler bonding autofs4 btrfs ses enclosure raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq hid_generic raid1 e1000e raid0 usbhid mptsas mptscsih multipath psmouse hid mptbase ptp scsi_transport_sas pps_core linear Jun 22 14:44:16 srv-lab-ceph-node-01 kernel: CPU: 0 PID: 1150 Comm: kworker/u16:12 Not tainted 3.19.0-21-generic #21-Ubuntu Jun 22 14:44:16 srv-lab-ceph-node-01 kernel: Hardware name: Intel S5000VSA/S5000VSA, BIOS S5000.86B.15.00.0101.110920101604 11/09/2010 Jun 22 14:44:16 srv-lab-ceph-node-01 kernel: Workqueue: btrfs-endio btrfs_endio_helper [btrfs] Jun 22 14:44:16 srv-lab-ceph-node-01 kernel: task: ffff88044c603110 ti: ffff88044b4b8000 task.ti: ffff88044b4b8000 Jun 22 14:44:16 srv-lab-ceph-node-01 kernel: RIP: 0010:[<ffffffffc043fa80>] [<ffffffffc043fa80>] repair_io_failure+0x1a0/0x220 [btrfs] Jun 22 14:44:16 srv-lab-ceph-node-01 kernel: RSP: 0018:ffff88044b4bbba8 EFLAGS: 00010202 Jun 22 14:44:16 srv-lab-ceph-node-01 kernel: RAX: 0000000000000000 RBX: 0000000000001000 RCX: 0000000000000000 Jun 22 14:44:16 srv-lab-ceph-node-01 kernel: RDX: 0000000000000000 RSI: ffff880449841b08 RDI: ffff880449841a80 Jun 22 14:44:16 srv-lab-ceph-node-01 kernel: RBP: ffff88044b4bbc08 R08: 0000000000109000 R09: ffff880449841a80 Jun 22 14:44:16 srv-lab-ceph-node-01 kernel: R10: 0000000000009000 R11: 0000000000000002 R12: ffff8803fa878068 Jun 22 14:44:16 srv-lab-ceph-node-01 kernel: R13: ffff880448f5d000 R14: ffff88044cde8d28 R15: 0000000524f09000 Jun 22 14:44:16 srv-lab-ceph-node-01 kernel: FS: 0000000000000000(0000) GS:ffff88045fc00000(0000) knlGS:0000000000000000 Jun 22 14:44:16 srv-lab-ceph-node-01 kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b Jun 22 14:44:16 srv-lab-ceph-node-01 kernel: CR2: 00007fdcef9cafb8 CR3: 0000000001c13000 CR4: 00000000000407f0 Jun 22 14:44:16 srv-lab-ceph-node-01 kernel: Stack: Jun 22 14:44:16 srv-lab-ceph-node-01 kernel: ffff880448f5d100 0000000000001000 000000004b4bbbd8 ffffea000fb66d40 Jun 22 14:44:16 srv-lab-ceph-node-01 kernel: 0000000000007000 ffff880449841a80 ffff88044b4bbc08 ffff880439a44b58 Jun 22 14:44:16 srv-lab-ceph-node-01 kernel: 0000000000001000 ffff880448f5d000 ffff88044cde8d28 ffff88044cde8bf0 Jun 22 14:44:16 srv-lab-ceph-node-01 kernel: Call Trace: Jun 22 14:44:16 srv-lab-ceph-node-01 kernel: [<ffffffffc043fd7c>] clean_io_failure+0x19c/0x1b0 [btrfs] Jun 22 14:44:16 srv-lab-ceph-node-01 kernel: [<ffffffffc04401b0>] end_bio_extent_readpage+0x310/0x5e0 [btrfs] Jun 22 14:44:16 srv-lab-ceph-node-01 kernel: [<ffffffff811d5795>] ? __slab_free+0xa5/0x320 Jun 22 14:44:16 srv-lab-ceph-node-01 kernel: [<ffffffff8101e74a>] ? native_sched_clock+0x2a/0x90 Jun 22 14:44:16 srv-lab-ceph-node-01 kernel: [<ffffffff8137f1eb>] bio_endio+0x6b/0xa0 Jun 22 14:44:16 srv-lab-ceph-node-01 kernel: [<ffffffff811d5bce>] ? kmem_cache_free+0x1be/0x200 Jun 22 14:44:16 srv-lab-ceph-node-01 kernel: [<ffffffff8137f232>] bio_endio_nodec+0x12/0x20 Jun 22 14:44:16 srv-lab-ceph-node-01 kernel: [<ffffffffc0414f3f>] end_workqueue_fn+0x3f/0x50 [btrfs] Jun 22 14:44:16 srv-lab-ceph-node-01 kernel: [<ffffffffc044f4e2>] normal_work_helper+0xc2/0x2b0 [btrfs] Jun 22 14:44:16 srv-lab-ceph-node-01 kernel: [<ffffffffc044f7a2>] btrfs_endio_helper+0x12/0x20 [btrfs] Jun 22 14:44:16 srv-lab-ceph-node-01 kernel: [<ffffffff8108fc98>] process_one_work+0x158/0x430 Jun 22 14:44:16 srv-lab-ceph-node-01 kernel: [<ffffffff810907db>] worker_thread+0x5b/0x530 Jun 22 14:44:16 srv-lab-ceph-node-01 kernel: [<ffffffff81090780>] ? rescuer_thread+0x3a0/0x3a0 Jun 22 14:44:16 srv-lab-ceph-node-01 kernel: [<ffffffff81095879>] kthread+0xc9/0xe0 Jun 22 14:44:16 srv-lab-ceph-node-01 kernel: [<ffffffff810957b0>] ? kthread_create_on_node+0x1c0/0x1c0 Jun 22 14:44:16 srv-lab-ceph-node-01 kernel: [<ffffffff817cae18>] ret_from_fork+0x58/0x90 Jun 22 14:44:16 srv-lab-ceph-node-01 kernel: [<ffffffff810957b0>] ? kthread_create_on_node+0x1c0/0x1c0 Jun 22 14:44:16 srv-lab-ceph-node-01 kernel: Code: f4 fe ff ff 0f 1f 80 00 00 00 00 0f 0b 66 0f 1f 44 00 00 4c 89 e7 e8 e0 e4 f3 c0 41 b9 fb ff ff ff e9 d2 fe ff ff 0f 1f 44 00 00 <0f> 0b 66 0f 1f 44 00 00 4c 89 e7 e8 c0 e4 f3 c0 31 f6 4c 89 ef Jun 22 14:44:16 srv-lab-ceph-node-01 kernel: RIP [<ffffffffc043fa80>] repair_io_failure+0x1a0/0x220 [btrfs] Jun 22 14:44:16 srv-lab-ceph-node-01 kernel: RSP <ffff88044b4bbba8> -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: RAID1: system stability 2015-06-22 11:35 ` Timofey Titovets 2015-06-22 11:45 ` Timofey Titovets @ 2015-06-22 16:03 ` Chris Murphy 2015-06-22 16:36 ` Timofey Titovets 1 sibling, 1 reply; 13+ messages in thread From: Chris Murphy @ 2015-06-22 16:03 UTC (permalink / raw) To: Timofey Titovets; +Cc: Chris Murphy, linux-btrfs On Mon, Jun 22, 2015 at 5:35 AM, Timofey Titovets <nefelim4ag@gmail.com> wrote: > Okay, logs, i did release disk /dev/sde1 and get: > Jun 22 14:28:40 srv-lab-ceph-node-01 kernel: Read(10): 28 00 11 1d 69 > 00 00 00 08 00 > Jun 22 14:28:40 srv-lab-ceph-node-01 kernel: blk_update_request: I/O > error, dev sde, sector 287140096 > Jun 22 14:28:40 srv-lab-ceph-node-01 kernel: mptbase: ioc0: > LogInfo(0x31010011): Originator={PL}, Code={Open Failure}, > SubCode(0x0011) cb_idx mptscsih_io_done > Jun 22 14:28:40 srv-lab-ceph-node-01 kernel: sd 0:0:5:0: [sde] FAILED > Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK > Jun 22 14:28:40 srv-lab-ceph-node-01 kernel: sd 0:0:5:0: [sde] CDB: > Jun 22 14:28:40 srv-lab-ceph-node-01 kernel: Read(10): 28 00 11 1d 69 > 00 00 00 08 00 > Jun 22 14:28:40 srv-lab-ceph-node-01 kernel: blk_update_request: I/O > error, dev sde, sector 287140096 So what's up with this? This only happens after you try to (software) remove /dev/sde1? Or is it happening also before that? Because this looks like some kind of hardware problem when the drive is reporting an error for a particular sector on read, as if it's a bad sector. > Jun 22 14:28:40 srv-lab-ceph-node-01 kernel: Buffer I/O error on dev > sde1, logical block 35892256, async page read > Jun 22 14:28:40 srv-lab-ceph-node-01 kernel: mptbase: ioc0: > LogInfo(0x31010011): Originator={PL}, Code={Open Failure}, > SubCode(0x0011) cb_idx mptscsih_io_done > Jun 22 14:28:40 srv-lab-ceph-node-01 kernel: mptbase: ioc0: > LogInfo(0x31010011): Originator={PL}, Code={Open Failure}, > SubCode(0x0011) cb_idx mptscsih_io_done > Jun 22 14:28:40 srv-lab-ceph-node-01 kernel: sd 0:0:5:0: [sde] FAILED > Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK > Jun 22 14:28:40 srv-lab-ceph-node-01 kernel: sd 0:0:5:0: [sde] CDB: > Jun 22 14:28:40 srv-lab-ceph-node-01 kernel: Read(10): 28 00 11 1d 69 > 00 00 00 08 00 > Jun 22 14:28:40 srv-lab-ceph-node-01 kernel: blk_update_request: I/O > error, dev sde, sector 287140096 Again same sector as before. This is not a Btrfs error message, it's coming from the block layer. > Jun 22 14:28:40 srv-lab-ceph-node-01 kernel: Buffer I/O error on dev > sde1, logical block 35892256, async page read I'm not a dev so take it with a grain of salt but because this references a logical block, this is the layer in between Btrfs and the physical device. Btrfs works on logical blocks and those have to be translated to device and physical sector. Maybe what's happening is there's confusion somewhere about this device not actually being unavailable so Btrfs or something else is trying to read this logical block again, which causes a read attempt to happen instead of a flat out "this device doesn't exist" type of error. So I don't know if this is a problem strictly in Btrfs missing device error handling, or if there's something else that's not really working correctly. You could test by physically removing the device, if you have hot plug support (be certain all the hardware components support it), you can see if you get different results. Or you could try to reproduce the software delete of the device with mdraid or lvm raid with XFS and no Btrfs at all, and see if you get different results. It's known that the btrfs multiple device failure use case is weak right now. Data isn't lost, but the error handling, notification, all that is almost non-existent compared to mdadm. > Jun 22 14:28:40 srv-lab-ceph-node-01 kernel: mptbase: ioc0: > LogInfo(0x31010011): Originator={PL}, Code={Open Failure}, > SubCode(0x0011) cb_idx mptscsih_io_done > Jun 22 14:28:40 srv-lab-ceph-node-01 kernel: mptbase: ioc0: > LogInfo(0x31010011): Originator={PL}, Code={Open Failure}, > SubCode(0x0011) cb_idx mptscsih_io_done > Jun 22 14:28:40 srv-lab-ceph-node-01 kernel: sd 0:0:5:0: [sde] FAILED > Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK > Jun 22 14:28:40 srv-lab-ceph-node-01 kernel: sd 0:0:5:0: [sde] CDB: > Jun 22 14:28:40 srv-lab-ceph-node-01 kernel: Read(10): 28 00 11 1d 69 > 00 00 00 08 00 > Jun 22 14:28:40 srv-lab-ceph-node-01 kernel: blk_update_request: I/O > error, dev sde, sector 287140096 > Jun 22 14:28:40 srv-lab-ceph-node-01 kernel: Buffer I/O error on dev > sde1, logical block 35892256, async page read > Jun 22 14:28:41 srv-lab-ceph-node-01 kernel: mptbase: ioc0: > LogInfo(0x31010011): Originator={PL}, Code={Open Failure}, > SubCode(0x0011) cb_idx mptscsih_io_done > Jun 22 14:28:41 srv-lab-ceph-node-01 kernel: end_device-0:0:6: > mptsas: ioc0: removing ssp device: fw_channel 0, fw_id 16, phy > 5,sas_addr 0x5000cca00d0514bd > Jun 22 14:28:41 srv-lab-ceph-node-01 kernel: phy-0:0:9: mptsas: ioc0: > delete phy 5, phy-obj (0xffff880449541400) > Jun 22 14:28:41 srv-lab-ceph-node-01 kernel: port-0:0:6: mptsas: > ioc0: delete port 6, sas_addr (0x5000cca00d0514bd) > Jun 22 14:28:41 srv-lab-ceph-node-01 kernel: scsi target0:0:5: mptsas: > ioc0: delete device: fw_channel 0, fw_id 16, phy 5, sas_addr > 0x5000cca00d0514bd OK it looks like not until here does it actually get deleted (?) and then that results in piles of write errors to this device by btrfs: > Jun 22 14:28:44 srv-lab-ceph-node-01 kernel: BTRFS: lost page write > due to I/O error on /dev/sde1 > Jun 22 14:28:44 srv-lab-ceph-node-01 kernel: BTRFS: bdev /dev/sde1 > errs: wr 1, rd 0, flush 0, corrupt 0, gen 0 > Jun 22 14:28:44 srv-lab-ceph-node-01 kernel: BTRFS: lost page write > due to I/O error on /dev/sde1 > Jun 22 14:28:44 srv-lab-ceph-node-01 kernel: BTRFS: bdev /dev/sde1 > errs: wr 2, rd 0, flush 0, corrupt 0, gen 0 > Jun 22 14:29:13 srv-lab-ceph-node-01 kernel: BTRFS: bdev /dev/sde1 > errs: wr 4, rd 0, flush 0, corrupt 0, gen 0 > Jun 22 14:29:13 srv-lab-ceph-node-01 kernel: BTRFS: bdev /dev/sde1 > errs: wr 5, rd 0, flush 0, corrupt 0, gen 0 > Jun 22 14:29:13 srv-lab-ceph-node-01 kernel: BTRFS: bdev /dev/sde1 > errs: wr 6, rd 0, flush 0, corrupt 0, gen 0 > Jun 22 14:29:13 srv-lab-ceph-node-01 kernel: BTRFS: bdev /dev/sde1 > errs: wr 7, rd 0, flush 0, corrupt 0, gen 0 > Jun 22 14:29:13 srv-lab-ceph-node-01 kernel: BTRFS: bdev /dev/sde1 > errs: wr 8, rd 0, flush 0, corrupt 0, gen 0 > Jun 22 14:29:13 srv-lab-ceph-node-01 kernel: BTRFS: bdev /dev/sde1 > errs: wr 9, rd 0, flush 0, corrupt 0, gen 0 > Jun 22 14:29:13 srv-lab-ceph-node-01 kernel: BTRFS: bdev /dev/sde1 > errs: wr 10, rd 0, flush 0, corrupt 0, gen 0 > Jun 22 14:29:13 srv-lab-ceph-node-01 kernel: BTRFS: bdev /dev/sde1 > errs: wr 11, rd 0, flush 0, corrupt 0, gen 0 > Jun 22 14:29:13 srv-lab-ceph-node-01 kernel: BTRFS: bdev /dev/sde1 > errs: wr 12, rd 0, flush 0, corrupt 0, gen 0 > Jun 22 14:29:13 srv-lab-ceph-node-01 kernel: BTRFS: bdev /dev/sde1 > errs: wr 13, rd 0, flush 0, corrupt 0, gen 0 So this makes sense in that it tries to write but can't because the device is now missing. So it's a case of Btrfs not doing very well handling suddenly missing device, I think. > Jun 22 14:29:22 srv-lab-ceph-node-01 kernel: BTRFS info (device > md127): csum failed ino 1039 extent 390332416 csum 2059524288 wanted > 343582415 mirror 0 > Jun 22 14:29:22 srv-lab-ceph-node-01 kernel: BUG: unable to handle > kernel paging request at ffff87fa7ff53430 > Jun 22 14:29:22 srv-lab-ceph-node-01 kernel: IP: [<ffffffffc04709d9>] > __btrfs_map_block+0x2d9/0x1180 [btrfs] > Jun 22 14:29:22 srv-lab-ceph-node-01 kernel: PGD 0 > Jun 22 14:29:22 srv-lab-ceph-node-01 kernel: Oops: 0000 [#1] SMP And then oops. Not good. So yeah it's definitely a Btrfs bug that it oopses instead of gracefully handling the failure. The question is whether (and what) other mitigating circumstances contribute to this bad handling, there may be other bugs that instigate this. I've tested this in a ridiculously rudimentary way (with USB drives) just by hanging them during usage, and I don't get an oops. But I do get piles of read and or write errors and it seems Btrfs never really becomes aware of the fact there's a missing device until there's a remount or even a reboot. I haven't quantified what amount of data is lost, but the file system itself still works degraded in this case with the remaining drive (actually both drives work fine, but once they're each written to separately with degraded mount option, they can't be rejoined together; if you try it, serious fs corruption results.) -- Chris Murphy -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: RAID1: system stability 2015-06-22 16:03 ` Chris Murphy @ 2015-06-22 16:36 ` Timofey Titovets 2015-06-22 16:52 ` Chris Murphy 0 siblings, 1 reply; 13+ messages in thread From: Timofey Titovets @ 2015-06-22 16:36 UTC (permalink / raw) To: Chris Murphy; +Cc: linux-btrfs 2015-06-22 19:03 GMT+03:00 Chris Murphy <lists@colorremedies.com>: > On Mon, Jun 22, 2015 at 5:35 AM, Timofey Titovets <nefelim4ag@gmail.com> wrote: >> Okay, logs, i did release disk /dev/sde1 and get: >> Jun 22 14:28:40 srv-lab-ceph-node-01 kernel: Read(10): 28 00 11 1d 69 >> 00 00 00 08 00 >> Jun 22 14:28:40 srv-lab-ceph-node-01 kernel: blk_update_request: I/O >> error, dev sde, sector 287140096 >> Jun 22 14:28:40 srv-lab-ceph-node-01 kernel: mptbase: ioc0: >> LogInfo(0x31010011): Originator={PL}, Code={Open Failure}, >> SubCode(0x0011) cb_idx mptscsih_io_done >> Jun 22 14:28:40 srv-lab-ceph-node-01 kernel: sd 0:0:5:0: [sde] FAILED >> Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK >> Jun 22 14:28:40 srv-lab-ceph-node-01 kernel: sd 0:0:5:0: [sde] CDB: >> Jun 22 14:28:40 srv-lab-ceph-node-01 kernel: Read(10): 28 00 11 1d 69 >> 00 00 00 08 00 >> Jun 22 14:28:40 srv-lab-ceph-node-01 kernel: blk_update_request: I/O >> error, dev sde, sector 287140096 > > So what's up with this? This only happens after you try to (software) > remove /dev/sde1? Or is it happening also before that? Because this > looks like some kind of hardware problem when the drive is reporting > an error for a particular sector on read, as if it's a bad sector. Nope, i've physically remove device and as you see it's produce errors on block layer -.- and this disks have 100% 'health' Because it's hot-plug device, kernel see what device now missing and remove all kernel objects reletad to them. > >> Jun 22 14:28:40 srv-lab-ceph-node-01 kernel: Buffer I/O error on dev sde1, logical block 35892256, async page read >> Jun 22 14:28:40 srv-lab-ceph-node-01 kernel: mptbase: ioc0: LogInfo(0x31010011): Originator={PL}, Code={Open Failure}, >> SubCode(0x0011) cb_idx mptscsih_io_done >> Jun 22 14:28:40 srv-lab-ceph-node-01 kernel: mptbase: ioc0: LogInfo(0x31010011): Originator={PL}, Code={Open Failure}, >> SubCode(0x0011) cb_idx mptscsih_io_done >> Jun 22 14:28:40 srv-lab-ceph-node-01 kernel: sd 0:0:5:0: [sde] FAILED Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK >> Jun 22 14:28:40 srv-lab-ceph-node-01 kernel: sd 0:0:5:0: [sde] CDB: >> Jun 22 14:28:40 srv-lab-ceph-node-01 kernel: Read(10): 28 00 11 1d 69 00 00 00 08 00 >> Jun 22 14:28:40 srv-lab-ceph-node-01 kernel: blk_update_request: I/O error, dev sde, sector 287140096 > > Again same sector as before. This is not a Btrfs error message, it's > coming from the block layer. > > >> Jun 22 14:28:40 srv-lab-ceph-node-01 kernel: Buffer I/O error on dev sde1, logical block 35892256, async page read > > I'm not a dev so take it with a grain of salt but because this > references a logical block, this is the layer in between Btrfs and the > physical device. Btrfs works on logical blocks and those have to be > translated to device and physical sector. Maybe what's happening is > there's confusion somewhere about this device not actually being > unavailable so Btrfs or something else is trying to read this logical > block again, which causes a read attempt to happen instead of a flat > out "this device doesn't exist" type of error. So I don't know if this > is a problem strictly in Btrfs missing device error handling, or if > there's something else that's not really working correctly. > > You could test by physically removing the device, if you have hot plug > support (be certain all the hardware components support it), you can > see if you get different results. Or you could try to reproduce the > software delete of the device with mdraid or lvm raid with XFS and no > Btrfs at all, and see if you get different results. > > It's known that the btrfs multiple device failure use case is weak > right now. Data isn't lost, but the error handling, notification, all > that is almost non-existent compared to mdadm. So sad -.- i've test this test case with md raid1 and system continue work without problem when i release one of two md device > >> Jun 22 14:28:40 srv-lab-ceph-node-01 kernel: mptbase: ioc0: LogInfo(0x31010011): Originator={PL}, Code={Open Failure}, >> SubCode(0x0011) cb_idx mptscsih_io_done >> Jun 22 14:28:40 srv-lab-ceph-node-01 kernel: mptbase: ioc0: LogInfo(0x31010011): Originator={PL}, Code={Open Failure}, >> SubCode(0x0011) cb_idx mptscsih_io_done >> Jun 22 14:28:40 srv-lab-ceph-node-01 kernel: sd 0:0:5:0: [sde] FAILED Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK >> Jun 22 14:28:40 srv-lab-ceph-node-01 kernel: sd 0:0:5:0: [sde] CDB: >> Jun 22 14:28:40 srv-lab-ceph-node-01 kernel: Read(10): 28 00 11 1d 69 00 00 00 08 00 >> Jun 22 14:28:40 srv-lab-ceph-node-01 kernel: blk_update_request: I/O error, dev sde, sector 287140096 >> Jun 22 14:28:40 srv-lab-ceph-node-01 kernel: Buffer I/O error on dev sde1, logical block 35892256, async page read >> Jun 22 14:28:41 srv-lab-ceph-node-01 kernel: mptbase: ioc0: LogInfo(0x31010011): Originator={PL}, Code={Open Failure}, >> SubCode(0x0011) cb_idx mptscsih_io_done >> Jun 22 14:28:41 srv-lab-ceph-node-01 kernel: end_device-0:0:6: mptsas: ioc0: removing ssp device: fw_channel 0, fw_id 16, phy >> 5,sas_addr 0x5000cca00d0514bd >> Jun 22 14:28:41 srv-lab-ceph-node-01 kernel: phy-0:0:9: mptsas: ioc0: delete phy 5, phy-obj (0xffff880449541400) >> Jun 22 14:28:41 srv-lab-ceph-node-01 kernel: port-0:0:6: mptsas: ioc0: delete port 6, sas_addr (0x5000cca00d0514bd) >> Jun 22 14:28:41 srv-lab-ceph-node-01 kernel: scsi target0:0:5: mptsas: ioc0: delete device: fw_channel 0, fw_id 16, phy 5, sas_addr >> 0x5000cca00d0514bd > > OK it looks like not until here does it actually get deleted (?) and > then that results in piles of write errors to this device by btrfs: > > >> Jun 22 14:28:44 srv-lab-ceph-node-01 kernel: BTRFS: lost page write due to I/O error on /dev/sde1 >> Jun 22 14:28:44 srv-lab-ceph-node-01 kernel: BTRFS: bdev /dev/sde1 errs: wr 1, rd 0, flush 0, corrupt 0, gen 0 >> Jun 22 14:28:44 srv-lab-ceph-node-01 kernel: BTRFS: lost page write due to I/O error on /dev/sde1 >> Jun 22 14:28:44 srv-lab-ceph-node-01 kernel: BTRFS: bdev /dev/sde1 errs: wr 2, rd 0, flush 0, corrupt 0, gen 0 >> Jun 22 14:29:13 srv-lab-ceph-node-01 kernel: BTRFS: bdev /dev/sde1 errs: wr 4, rd 0, flush 0, corrupt 0, gen 0 >> Jun 22 14:29:13 srv-lab-ceph-node-01 kernel: BTRFS: bdev /dev/sde1 errs: wr 5, rd 0, flush 0, corrupt 0, gen 0 >> Jun 22 14:29:13 srv-lab-ceph-node-01 kernel: BTRFS: bdev /dev/sde1 errs: wr 6, rd 0, flush 0, corrupt 0, gen 0 >> Jun 22 14:29:13 srv-lab-ceph-node-01 kernel: BTRFS: bdev /dev/sde1 errs: wr 7, rd 0, flush 0, corrupt 0, gen 0 >> Jun 22 14:29:13 srv-lab-ceph-node-01 kernel: BTRFS: bdev /dev/sde1 errs: wr 8, rd 0, flush 0, corrupt 0, gen 0 >> Jun 22 14:29:13 srv-lab-ceph-node-01 kernel: BTRFS: bdev /dev/sde1 errs: wr 9, rd 0, flush 0, corrupt 0, gen 0 >> Jun 22 14:29:13 srv-lab-ceph-node-01 kernel: BTRFS: bdev /dev/sde1 errs: wr 10, rd 0, flush 0, corrupt 0, gen 0 >> Jun 22 14:29:13 srv-lab-ceph-node-01 kernel: BTRFS: bdev /dev/sde1 errs: wr 11, rd 0, flush 0, corrupt 0, gen 0 >> Jun 22 14:29:13 srv-lab-ceph-node-01 kernel: BTRFS: bdev /dev/sde1 errs: wr 12, rd 0, flush 0, corrupt 0, gen 0 >> Jun 22 14:29:13 srv-lab-ceph-node-01 kernel: BTRFS: bdev /dev/sde1 errs: wr 13, rd 0, flush 0, corrupt 0, gen 0 > > So this makes sense in that it tries to write but can't because the > device is now missing. So it's a case of Btrfs not doing very well > handling suddenly missing device, I think. > > > >> Jun 22 14:29:22 srv-lab-ceph-node-01 kernel: BTRFS info (device md127): csum failed ino 1039 extent 390332416 csum 2059524288 wanted 343582415 mirror 0 >> Jun 22 14:29:22 srv-lab-ceph-node-01 kernel: BUG: unable to handle kernel paging request at ffff87fa7ff53430 >> Jun 22 14:29:22 srv-lab-ceph-node-01 kernel: IP: [<ffffffffc04709d9>] __btrfs_map_block+0x2d9/0x1180 [btrfs] >> Jun 22 14:29:22 srv-lab-ceph-node-01 kernel: PGD 0 >> Jun 22 14:29:22 srv-lab-ceph-node-01 kernel: Oops: 0000 [#1] SMP > > > And then oops. Not good. So yeah it's definitely a Btrfs bug that it > oopses instead of gracefully handling the failure. The question is > whether (and what) other mitigating circumstances contribute to this > bad handling, there may be other bugs that instigate this. I've tested > this in a ridiculously rudimentary way (with USB drives) just by > hanging them during usage, and I don't get an oops. But I do get piles > of read and or write errors and it seems Btrfs never really becomes > aware of the fact there's a missing device until there's a remount or > even a reboot. I haven't quantified what amount of data is lost, but > the file system itself still works degraded in this case with the > remaining drive (actually both drives work fine, but once they're each > written to separately with degraded mount option, they can't be > rejoined together; if you try it, serious fs corruption results.) > > -- > Chris Murphy You right about usb devices, it's not produce oops. May be its because kernel use different modules for SAS/SATA disks and usb sticks. -- Have a nice day, Timofey. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: RAID1: system stability 2015-06-22 16:36 ` Timofey Titovets @ 2015-06-22 16:52 ` Chris Murphy 2015-07-22 11:00 ` Russell Coker 0 siblings, 1 reply; 13+ messages in thread From: Chris Murphy @ 2015-06-22 16:52 UTC (permalink / raw) To: Timofey Titovets; +Cc: Chris Murphy, linux-btrfs On Mon, Jun 22, 2015 at 10:36 AM, Timofey Titovets <nefelim4ag@gmail.com> wrote: > 2015-06-22 19:03 GMT+03:00 Chris Murphy <lists@colorremedies.com>: >> On Mon, Jun 22, 2015 at 5:35 AM, Timofey Titovets <nefelim4ag@gmail.com> wrote: >>> Okay, logs, i did release disk /dev/sde1 and get: >>> Jun 22 14:28:40 srv-lab-ceph-node-01 kernel: Read(10): 28 00 11 1d 69 >>> 00 00 00 08 00 >>> Jun 22 14:28:40 srv-lab-ceph-node-01 kernel: blk_update_request: I/O >>> error, dev sde, sector 287140096 >>> Jun 22 14:28:40 srv-lab-ceph-node-01 kernel: mptbase: ioc0: >>> LogInfo(0x31010011): Originator={PL}, Code={Open Failure}, >>> SubCode(0x0011) cb_idx mptscsih_io_done >>> Jun 22 14:28:40 srv-lab-ceph-node-01 kernel: sd 0:0:5:0: [sde] FAILED >>> Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK >>> Jun 22 14:28:40 srv-lab-ceph-node-01 kernel: sd 0:0:5:0: [sde] CDB: >>> Jun 22 14:28:40 srv-lab-ceph-node-01 kernel: Read(10): 28 00 11 1d 69 >>> 00 00 00 08 00 >>> Jun 22 14:28:40 srv-lab-ceph-node-01 kernel: blk_update_request: I/O >>> error, dev sde, sector 287140096 >> >> So what's up with this? This only happens after you try to (software) >> remove /dev/sde1? Or is it happening also before that? Because this >> looks like some kind of hardware problem when the drive is reporting >> an error for a particular sector on read, as if it's a bad sector. > > Nope, i've physically remove device and as you see it's produce errors > on block layer -.- > and this disks have 100% 'health' > > Because it's hot-plug device, kernel see what device now missing and > remove all kernel objects reletad to them. OK I actually don't know what the intended block layer behavior is when unplugging a device, if it is supposed to vanish, or change state somehow so that thing that depend on it can know it's "missing" or what. So the question here is, is this working as intended? If the layer Btrfs depends on isn't working as intended, then Btrfs is probably going to do wild and crazy things. And I don't know that the part of the block layer Btrfs depends on for this is the same (or different) as what the md driver depends on. > >> >>> Jun 22 14:28:40 srv-lab-ceph-node-01 kernel: Buffer I/O error on dev sde1, logical block 35892256, async page read >>> Jun 22 14:28:40 srv-lab-ceph-node-01 kernel: mptbase: ioc0: LogInfo(0x31010011): Originator={PL}, Code={Open Failure}, >>> SubCode(0x0011) cb_idx mptscsih_io_done >>> Jun 22 14:28:40 srv-lab-ceph-node-01 kernel: mptbase: ioc0: LogInfo(0x31010011): Originator={PL}, Code={Open Failure}, >>> SubCode(0x0011) cb_idx mptscsih_io_done >>> Jun 22 14:28:40 srv-lab-ceph-node-01 kernel: sd 0:0:5:0: [sde] FAILED Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK >>> Jun 22 14:28:40 srv-lab-ceph-node-01 kernel: sd 0:0:5:0: [sde] CDB: >>> Jun 22 14:28:40 srv-lab-ceph-node-01 kernel: Read(10): 28 00 11 1d 69 00 00 00 08 00 >>> Jun 22 14:28:40 srv-lab-ceph-node-01 kernel: blk_update_request: I/O error, dev sde, sector 287140096 >> >> Again same sector as before. This is not a Btrfs error message, it's >> coming from the block layer. >> >> >>> Jun 22 14:28:40 srv-lab-ceph-node-01 kernel: Buffer I/O error on dev sde1, logical block 35892256, async page read >> >> I'm not a dev so take it with a grain of salt but because this >> references a logical block, this is the layer in between Btrfs and the >> physical device. Btrfs works on logical blocks and those have to be >> translated to device and physical sector. Maybe what's happening is >> there's confusion somewhere about this device not actually being >> unavailable so Btrfs or something else is trying to read this logical >> block again, which causes a read attempt to happen instead of a flat >> out "this device doesn't exist" type of error. So I don't know if this >> is a problem strictly in Btrfs missing device error handling, or if >> there's something else that's not really working correctly. >> >> You could test by physically removing the device, if you have hot plug >> support (be certain all the hardware components support it), you can >> see if you get different results. Or you could try to reproduce the >> software delete of the device with mdraid or lvm raid with XFS and no >> Btrfs at all, and see if you get different results. >> >> It's known that the btrfs multiple device failure use case is weak >> right now. Data isn't lost, but the error handling, notification, all >> that is almost non-existent compared to mdadm. > > So sad -.- > i've test this test case with md raid1 and system continue work > without problem when i release one of two md device OK well then it's either a Btrfs bug or something it directly depends on that md does not. > You right about usb devices, it's not produce oops. > May be its because kernel use different modules for SAS/SATA disks and > usb sticks. They appear as sd devices on my system, so they're using libata and as such they ultimately still depend on the SCSI block layer. But there may be a very different kind of missing device error handling for USB that somehow makes its way up to libata differently than SAS/SATA hotplug. I'd say the oops is definitely a Btrfs bug. But it might also be worth while to post the kernel messages to linux-scsi@ list, listing the hardware details (logic board, SAS/SATA card, drives) and of course the full kernel messages along with reproduce steps and see if the fact the device doesn't actually drop out like with USB devices is intended behavior. -- Chris Murphy -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: RAID1: system stability 2015-06-22 16:52 ` Chris Murphy @ 2015-07-22 11:00 ` Russell Coker 2015-08-05 17:32 ` Austin S Hemmelgarn 0 siblings, 1 reply; 13+ messages in thread From: Russell Coker @ 2015-07-22 11:00 UTC (permalink / raw) To: Chris Murphy, linux-btrfs On Tue, 23 Jun 2015 02:52:43 AM Chris Murphy wrote: > OK I actually don't know what the intended block layer behavior is > when unplugging a device, if it is supposed to vanish, or change state > somehow so that thing that depend on it can know it's "missing" or > what. So the question here is, is this working as intended? If the > layer Btrfs depends on isn't working as intended, then Btrfs is > probably going to do wild and crazy things. And I don't know that the > part of the block layer Btrfs depends on for this is the same (or > different) as what the md driver depends on. I disagree with that statement. BTRFS should be expected to not do wild and crazy things regardless of what happens with block devices. A BTRFS RAID-1/5/6 array should cope with a single disk failing or returning any manner of corrupted data and should not lose data or panic the kernel. A BTRFS RAID-0 or single disk setup should cope with a disk giving errors by mounting read-only or failing all operations on the filesystem. It should not affect any other filesystem or have any significant impact on the system unless it's the root filesystem. -- My Main Blog http://etbe.coker.com.au/ My Documents Blog http://doc.coker.com.au/ ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: RAID1: system stability 2015-07-22 11:00 ` Russell Coker @ 2015-08-05 17:32 ` Austin S Hemmelgarn 2015-08-05 19:00 ` Martin Steigerwald 0 siblings, 1 reply; 13+ messages in thread From: Austin S Hemmelgarn @ 2015-08-05 17:32 UTC (permalink / raw) To: Russell Coker, Chris Murphy, linux-btrfs [-- Attachment #1: Type: text/plain, Size: 2428 bytes --] On 2015-07-22 07:00, Russell Coker wrote: > On Tue, 23 Jun 2015 02:52:43 AM Chris Murphy wrote: >> OK I actually don't know what the intended block layer behavior is >> when unplugging a device, if it is supposed to vanish, or change state >> somehow so that thing that depend on it can know it's "missing" or >> what. So the question here is, is this working as intended? If the >> layer Btrfs depends on isn't working as intended, then Btrfs is >> probably going to do wild and crazy things. And I don't know that the >> part of the block layer Btrfs depends on for this is the same (or >> different) as what the md driver depends on. > > I disagree with that statement. BTRFS should be expected to not do wild and > crazy things regardless of what happens with block devices. I would generally agree with this, although we really shouldn't be doing things like trying to handle hardware failures without user intervention. If a block device disappears from under us, we should throw a warning and if it's the last device in the FS, kill anything that is trying to read or write to that FS. At the very least, we should try to avoid hanging or panicking the system if all of the devices in an FS disappear out from under us. > > A BTRFS RAID-1/5/6 array should cope with a single disk failing or returning > any manner of corrupted data and should not lose data or panic the kernel. It's debatable however whether the array should go read-only when degraded. MD/DM RAID (at least, AFAIK) and most hardware RAID controllers I've seen will still accept writes to degraded arrays, although there are arguments for forcing it read-only as well. Personally, I think that should be controlled by a mount option, so the sysadmin can decide, as it really is a policy decision. > > A BTRFS RAID-0 or single disk setup should cope with a disk giving errors by > mounting read-only or failing all operations on the filesystem. It should not > affect any other filesystem or have any significant impact on the system unless > it's the root filesystem. Or some other critical filesystem (there are still people who put /usr and/or /var on separate filesystems). Ideally, I'd love to see some some kind of warning from the kernel if a filesystem gets mounted that has the metadata/system profile set to raid0 (and possibly have some of the tools spit out such a warning also). [-- Attachment #2: S/MIME Cryptographic Signature --] [-- Type: application/pkcs7-signature, Size: 3019 bytes --] ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: RAID1: system stability 2015-08-05 17:32 ` Austin S Hemmelgarn @ 2015-08-05 19:00 ` Martin Steigerwald 0 siblings, 0 replies; 13+ messages in thread From: Martin Steigerwald @ 2015-08-05 19:00 UTC (permalink / raw) To: Austin S Hemmelgarn; +Cc: Russell Coker, Chris Murphy, linux-btrfs [-- Attachment #1: Type: text/plain, Size: 2491 bytes --] Am Mittwoch, 5. August 2015, 13:32:41 schrieb Austin S Hemmelgarn: > On 2015-07-22 07:00, Russell Coker wrote: > > On Tue, 23 Jun 2015 02:52:43 AM Chris Murphy wrote: > >> OK I actually don't know what the intended block layer behavior is > >> when unplugging a device, if it is supposed to vanish, or change state > >> somehow so that thing that depend on it can know it's "missing" or > >> what. So the question here is, is this working as intended? If the > >> layer Btrfs depends on isn't working as intended, then Btrfs is > >> probably going to do wild and crazy things. And I don't know that the > >> part of the block layer Btrfs depends on for this is the same (or > >> different) as what the md driver depends on. > > > > I disagree with that statement. BTRFS should be expected to not do wild > > and crazy things regardless of what happens with block devices. > > I would generally agree with this, although we really shouldn't be doing > things like trying to handle hardware failures without user > intervention. If a block device disappears from under us, we should > throw a warning and if it's the last device in the FS, kill anything > that is trying to read or write to that FS. At the very least, we > should try to avoid hanging or panicking the system if all of the > devices in an FS disappear out from under us. The best solution I have ever seen for removable media is with AmigaOS. You remove a disk (or nowadays an usb stick) while it is being written to and AmigaDOS/AmigaOS pops up a dialog window saying "You MUST insert volume $VOLUMENAME again". And if you did, it just continued writing. I bet this may be difficult for do for Linux for all devices as unwritten changes pile up in memory until dirty limits are reached, unless one says "Okay, disk gone, we block all processes writing to it immediately or quite soon", but for removable media I never saw anything with that amount of sanity. There was some GSoC for NetBSD once to implement this, but I don´t know whether its implemented in there now. For AmigaOS and floppy disks with back then filesystem there was just one culprit: If you didn´t insert the disk again, it was often broken beyond repair. For journaling or COW filesystem it would just be like in any other sudden stop to writes. On Linux with eSATA I saw I can also replug the disk if I didn´t yet hit the timeouts in block layer. After that the disk is gone. Ciao, -- Martin [-- Attachment #2: This is a digitally signed message part. --] [-- Type: application/pgp-signature, Size: 801 bytes --] ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: RAID1: system stability @ 2015-06-17 9:20 Timofey Titovets 0 siblings, 0 replies; 13+ messages in thread From: Timofey Titovets @ 2015-06-17 9:20 UTC (permalink / raw) To: linux-btrfs Upd: i've try do removing disk by 'right' way: # echo 1 > /sys/block/sdf/device/delete All okay and system don't crush immediately on 'sync' call and can work some time without problem, but after some call, which i can repeat by: # apt-get update testing system get kernel crush (on which i delete one of raid1 btrfs device), i've get following dmesg: ---- Jun 17 12:00:41 srv-lab-ceph-node-01 kernel: Modules linked in: 8021q garp mrp stp llc binfmt_misc gpio_ich coretemp kvm_intel lpc_ich ipmi_ssif kvm amdkfd amd_iommu_v2 serio_raw radeon ttm i5000_edac drm_kms_helper drm edac_core i2c_algo_bit i5k_amb ioatdma dca shpchp 8250_fintek joydev mac_hid ipmi_si ipmi_msghandler bonding autofs4 btrfs xor raid6_pq ses enclosure hid_generic psmouse usbhid hid mptsas mptscsih e1000e mptbase scsi_transport_sas ptp pps_core Jun 17 12:00:41 srv-lab-ceph-node-01 kernel: CPU: 3 PID: 99 Comm: kworker/u16:4 Not tainted 4.0.4-040004-generic #201505171336 Jun 17 12:00:41 srv-lab-ceph-node-01 kernel: Hardware name: Intel S5000VSA/S5000VSA, BIOS S5000.86B.15.00.0101.110920101604 11/09/2010 Jun 17 12:00:41 srv-lab-ceph-node-01 kernel: Workqueue: btrfs-endio btrfs_endio_helper [btrfs] Jun 17 12:00:41 srv-lab-ceph-node-01 kernel: task: ffff88009ab31400 ti: ffff88009ab40000 task.ti: ffff88009ab40000 Jun 17 12:00:41 srv-lab-ceph-node-01 kernel: RIP: 0010:[<ffffffffc0477d50>] [<ffffffffc0477d50>] repair_io_failure+0x1c0/0x200 [btrfs] Jun 17 12:00:41 srv-lab-ceph-node-01 kernel: RSP: 0018:ffff88009ab43bb8 EFLAGS: 00010206 Jun 17 12:00:41 srv-lab-ceph-node-01 kernel: RAX: 0000000000000000 RBX: ffff88009b1d3f30 RCX: ffff88009b53f9c0 Jun 17 12:00:41 srv-lab-ceph-node-01 kernel: RDX: ffff88044902f400 RSI: 0000000000000000 RDI: ffff88009b53f9c0 Jun 17 12:00:41 srv-lab-ceph-node-01 kernel: RBP: ffff88009ab43c18 R08: 0000000000000000 R09: 0000000000000000 Jun 17 12:00:41 srv-lab-ceph-node-01 kernel: R10: ffff880448c1b090 R11: 0000000000000000 R12: 0000000039070000 Jun 17 12:00:41 srv-lab-ceph-node-01 kernel: R13: ffff880439599e68 R14: 0000000000001000 R15: ffff88009a860000 Jun 17 12:00:41 srv-lab-ceph-node-01 kernel: FS: 0000000000000000(0000) GS:ffff88045fcc0000(0000) knlGS:0000000000000000 Jun 17 12:00:41 srv-lab-ceph-node-01 kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b Jun 17 12:00:41 srv-lab-ceph-node-01 kernel: CR2: 00007f640a27e675 CR3: 0000000098b4b000 CR4: 00000000000407e0 Jun 17 12:00:41 srv-lab-ceph-node-01 kernel: Stack: Jun 17 12:00:41 srv-lab-ceph-node-01 kernel: 0000000000000000 000000009a860de0 ffffea0002644380 00000003d2ee8000 Jun 17 12:00:41 srv-lab-ceph-node-01 kernel: 0000000000008000 ffff88009b53f9c0 ffff88009ab43c18 ffff88009b1d3f30 Jun 17 12:00:41 srv-lab-ceph-node-01 kernel: ffff88044c44a3c0 ffff88009b0c1190 0000000000000000 ffff88009a860000 Jun 17 12:00:41 srv-lab-ceph-node-01 kernel: Call Trace: Jun 17 12:00:41 srv-lab-ceph-node-01 kernel: [<ffffffffc0477f30>] clean_io_failure+0x1a0/0x1b0 [btrfs] Jun 17 12:00:41 srv-lab-ceph-node-01 kernel: [<ffffffffc0478218>] end_bio_extent_readpage+0x2d8/0x3d0 [btrfs] Jun 17 12:00:41 srv-lab-ceph-node-01 kernel: [<ffffffff8137b2c3>] bio_endio+0x53/0xa0 Jun 17 12:00:41 srv-lab-ceph-node-01 kernel: [<ffffffff8137b322>] bio_endio_nodec+0x12/0x20 Jun 17 12:00:41 srv-lab-ceph-node-01 kernel: [<ffffffffc044efb8>] end_workqueue_fn+0x48/0x60 [btrfs] Jun 17 12:00:41 srv-lab-ceph-node-01 kernel: [<ffffffffc0488b2e>] normal_work_helper+0x7e/0x1b0 [btrfs] Jun 17 12:00:41 srv-lab-ceph-node-01 kernel: [<ffffffffc0488d32>] btrfs_endio_helper+0x12/0x20 [btrfs] Jun 17 12:00:41 srv-lab-ceph-node-01 kernel: [<ffffffff81092204>] process_one_work+0x144/0x490 Jun 17 12:00:41 srv-lab-ceph-node-01 kernel: [<ffffffff81092c6e>] worker_thread+0x11e/0x450 Jun 17 12:00:41 srv-lab-ceph-node-01 kernel: [<ffffffff81092b50>] ? create_worker+0x1f0/0x1f0 Jun 17 12:00:41 srv-lab-ceph-node-01 kernel: [<ffffffff81098999>] kthread+0xc9/0xe0 Jun 17 12:00:41 srv-lab-ceph-node-01 kernel: [<ffffffff810988d0>] ? flush_kthread_worker+0x90/0x90 Jun 17 12:00:41 srv-lab-ceph-node-01 kernel: [<ffffffff817f08d8>] ret_from_fork+0x58/0x90 Jun 17 12:00:41 srv-lab-ceph-node-01 kernel: [<ffffffff810988d0>] ? flush_kthread_worker+0x90/0x90 Jun 17 12:00:41 srv-lab-ceph-node-01 kernel: Code: 44 00 00 4c 89 ef e8 b0 34 f0 c0 31 f6 4c 89 e7 e8 06 05 01 00 ba fb ff ff ff e9 c7 fe ff ff ba fb ff ff ff e9 bd fe ff ff 0f 0b <0f> 0b 49 8b 4c 24 30 48 8b b3 58 fe ff ff 48 83 c1 10 48 85 f6 Jun 17 12:00:41 srv-lab-ceph-node-01 kernel: RIP [<ffffffffc0477d50>] repair_io_failure+0x1c0/0x200 [btrfs] Jun 17 12:00:41 srv-lab-ceph-node-01 kernel: RSP <ffff88009ab43bb8> Jun 17 12:00:41 srv-lab-ceph-node-01 kernel: ---[ end trace 0361c6fdca5f7ee2 ]--- --- Another test case: i've delete device: echo 1 > /sys/block/sdf/device/delete after i reinsert this device (remove and insert again in server) Server found sdg device, all that okay but kernel crush with following stuck trace: --- Jun 17 12:08:35 srv-lab-ceph-node-01 kernel: kernel BUG at /home/kernel/COD/linux/fs/btrfs/extent_io.c:2057! Jun 17 12:08:35 srv-lab-ceph-node-01 kernel: invalid opcode: 0000 [#1] SMP Jun 17 12:08:35 srv-lab-ceph-node-01 kernel: Modules linked in: 8021q garp mrp stp llc binfmt_misc gpio_ich coretemp kvm_intel amdkfd amd_iommu_v2 ipmi_ssif kvm radeon lpc_ich serio_raw ttm i5000_edac edac_core drm_kms_helper drm i5k_amb ioatdma i2c_algo_bit joydev 8250_fintek ipmi_si dca ipmi_msghandler mac_hid shpchp bonding autofs4 btrfs xor raid6_pq ses enclosure hid_generic psmouse mptsas usbhid mptscsih hid mptbase scsi_transport_sas e1000e ptp pps_core Jun 17 12:08:35 srv-lab-ceph-node-01 kernel: CPU: 2 PID: 72 Comm: kworker/u16:2 Not tainted 4.0.4-040004-generic #201505171336 Jun 17 12:08:35 srv-lab-ceph-node-01 kernel: Hardware name: Intel S5000VSA/S5000VSA, BIOS S5000.86B.15.00.0101.110920101604 11/09/2010 Jun 17 12:08:35 srv-lab-ceph-node-01 kernel: Workqueue: btrfs-endio btrfs_endio_helper [btrfs] Jun 17 12:08:35 srv-lab-ceph-node-01 kernel: task: ffff88044d215a00 ti: ffff880449b1c000 task.ti: ffff880449b1c000 Jun 17 12:08:35 srv-lab-ceph-node-01 kernel: RIP: 0010:[<ffffffffc02a9d50>] [<ffffffffc02a9d50>] repair_io_failure+0x1c0/0x200 [btrfs] Jun 17 12:08:35 srv-lab-ceph-node-01 kernel: RSP: 0018:ffff880449b1fbb8 EFLAGS: 00010206 Jun 17 12:08:35 srv-lab-ceph-node-01 kernel: RAX: 0000000000000000 RBX: ffff88044c3ac308 RCX: ffff88044c5ef3c0 Jun 17 12:08:35 srv-lab-ceph-node-01 kernel: RDX: ffff880449117400 RSI: 0000000000000000 RDI: ffff88044c5ef3c0 Jun 17 12:08:35 srv-lab-ceph-node-01 kernel: RBP: ffff880449b1fc18 R08: 0000000000000000 R09: 0000000000000000 Jun 17 12:08:35 srv-lab-ceph-node-01 kernel: R10: ffff880448ce0090 R11: 0000000000000000 R12: 000000003999a000 Jun 17 12:08:35 srv-lab-ceph-node-01 kernel: R13: ffff88043999a568 R14: 0000000000001000 R15: ffff880449510000 Jun 17 12:08:35 srv-lab-ceph-node-01 kernel: FS: 0000000000000000(0000) GS:ffff88045fc80000(0000) knlGS:0000000000000000 Jun 17 12:08:35 srv-lab-ceph-node-01 kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b Jun 17 12:08:35 srv-lab-ceph-node-01 kernel: CR2: 00007fbfbe12cf00 CR3: 0000000449b4e000 CR4: 00000000000407e0 Jun 17 12:08:35 srv-lab-ceph-node-01 kernel: Stack: Jun 17 12:08:35 srv-lab-ceph-node-01 kernel: 0000000000000000 0000000049510de0 ffffea0010f40540 00000003f7ed4000 Jun 17 12:08:35 srv-lab-ceph-node-01 kernel: 000000000000c000 ffff88044c5ef3c0 ffff880449b1fc18 ffff88044c3ac308 Jun 17 12:08:35 srv-lab-ceph-node-01 kernel: ffff88044b1acc80 ffff880448dcbfa0 0000000000000000 ffff880449510000 Jun 17 12:08:35 srv-lab-ceph-node-01 kernel: Call Trace: Jun 17 12:08:35 srv-lab-ceph-node-01 kernel: [<ffffffffc02a9f30>] clean_io_failure+0x1a0/0x1b0 [btrfs] Jun 17 12:08:35 srv-lab-ceph-node-01 kernel: [<ffffffffc02aa218>] end_bio_extent_readpage+0x2d8/0x3d0 [btrfs] Jun 17 12:08:35 srv-lab-ceph-node-01 kernel: [<ffffffff8137b2c3>] bio_endio+0x53/0xa0 Jun 17 12:08:35 srv-lab-ceph-node-01 kernel: [<ffffffff8137b322>] bio_endio_nodec+0x12/0x20 Jun 17 12:08:35 srv-lab-ceph-node-01 kernel: [<ffffffffc0280fb8>] end_workqueue_fn+0x48/0x60 [btrfs] Jun 17 12:08:35 srv-lab-ceph-node-01 kernel: [<ffffffffc02bab2e>] normal_work_helper+0x7e/0x1b0 [btrfs] Jun 17 12:08:35 srv-lab-ceph-node-01 kernel: [<ffffffffc02bad32>] btrfs_endio_helper+0x12/0x20 [btrfs] Jun 17 12:08:35 srv-lab-ceph-node-01 kernel: [<ffffffff81092204>] process_one_work+0x144/0x490 Jun 17 12:08:35 srv-lab-ceph-node-01 kernel: [<ffffffff81092c6e>] worker_thread+0x11e/0x450 Jun 17 12:08:35 srv-lab-ceph-node-01 kernel: [<ffffffff81092b50>] ? create_worker+0x1f0/0x1f0 Jun 17 12:08:35 srv-lab-ceph-node-01 kernel: [<ffffffff81098999>] kthread+0xc9/0xe0 Jun 17 12:08:35 srv-lab-ceph-node-01 kernel: [<ffffffff810988d0>] ? flush_kthread_worker+0x90/0x90 Jun 17 12:08:35 srv-lab-ceph-node-01 kernel: [<ffffffff817f08d8>] ret_from_fork+0x58/0x90 Jun 17 12:08:35 srv-lab-ceph-node-01 kernel: [<ffffffff810988d0>] ? flush_kthread_worker+0x90/0x90 Jun 17 12:08:35 srv-lab-ceph-node-01 kernel: Code: 44 00 00 4c 89 ef e8 b0 14 0d c1 31 f6 4c 89 e7 e8 06 05 01 00 ba fb ff ff ff e9 c7 fe ff ff ba fb ff ff ff e9 bd fe ff ff 0f 0b <0f> 0b 49 8b 4c 24 30 48 8b b3 58 fe ff ff 48 83 c1 10 48 85 f6 Jun 17 12:08:35 srv-lab-ceph-node-01 kernel: RIP [<ffffffffc02a9d50>] repair_io_failure+0x1c0/0x200 [btrfs] Jun 17 12:08:35 srv-lab-ceph-node-01 kernel: RSP <ffff880449b1fbb8> Jun 17 12:08:35 srv-lab-ceph-node-01 kernel: ---[ end trace 90ec36112ab1f744 ]--- P.S. I just think about case where i have 2 slots for disk in server, and i want replace one disk, which failed (overheated and just 'burned' or something else) without server downtime -- Have a nice day, Timofey. ^ permalink raw reply [flat|nested] 13+ messages in thread
end of thread, other threads:[~2015-08-05 19:00 UTC | newest] Thread overview: 13+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2015-05-26 11:23 RAID1: system stability Timofey Titovets 2015-05-26 19:31 ` Timofey Titovets 2015-05-26 19:49 ` Chris Murphy 2015-05-26 19:51 ` Timofey Titovets 2015-06-22 11:35 ` Timofey Titovets 2015-06-22 11:45 ` Timofey Titovets 2015-06-22 16:03 ` Chris Murphy 2015-06-22 16:36 ` Timofey Titovets 2015-06-22 16:52 ` Chris Murphy 2015-07-22 11:00 ` Russell Coker 2015-08-05 17:32 ` Austin S Hemmelgarn 2015-08-05 19:00 ` Martin Steigerwald -- strict thread matches above, loose matches on Subject: below -- 2015-06-17 9:20 Timofey Titovets
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox