RAID1: system stability

Linux Btrfs filesystem development
 help / color / mirror / Atom feed

* RAID1: system stability
@ 2015-05-26 11:23 Timofey Titovets
  2015-05-26 19:31 ` Timofey Titovets
  2015-05-26 19:49 ` Chris Murphy
  0 siblings, 2 replies; 13+ messages in thread
From: Timofey Titovets @ 2015-05-26 11:23 UTC (permalink / raw)
  To: linux-btrfs

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset=utf-8, Size: 914 bytes --]

Hi list,
I'm regular on this list and I very like btrfs, I want use it on production server, and I want replace hw raid on it.

Test case: server with N scsi discs
2 SAS disks used for raid 1 root fs
If I just remove one disk physically, all okay, kernel show me write errors and system continue work some time. But after first sync call, example
# sync
# dd if=/Dev/zero of=/zero

Kernel will crush and system freeze.
Yes, after reboot, I can mount with degraded and recovery options, and I can add failed disk again, and btrfs will rebuild array.
But kernel crush and reboot expected in this case, or I can skip it? How?
# mount -o remount, degraded -> kernel crush
Insert failed disk again -> kernel crush

May be I missing something? I just want avoid shutdown time or/and reboot =.=ÿôèº{.nÇ+‰·Ÿ®‰†+%ŠËÿ±éÝ¶\x17¥Šwÿº{.nÇ+‰·¥Š{±ý»k~ÏâžØ^n‡r¡ö¦zË\x1aëh™¨èÚ&£ûàz¿äz¹Þ—ú+€Ê+zf£¢·hšˆ§~††Ûiÿÿïêÿ‘êçz_è®\x0fæj:+v‰¨þ)ß£øm

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: RAID1: system stability
  2015-05-26 11:23 RAID1: system stability Timofey Titovets
@ 2015-05-26 19:31 ` Timofey Titovets
  2015-05-26 19:49 ` Chris Murphy
  1 sibling, 0 replies; 13+ messages in thread
From: Timofey Titovets @ 2015-05-26 19:31 UTC (permalink / raw)
  To: linux-btrfs

Oh, i missing, i've test it on 3.19+ kernels
I can get trace from screen if it interesting for developers.

2015-05-26 14:23 GMT+03:00 Timofey Titovets <nefelim4ag@gmail.com>:
> Hi list,
> I'm regular on this list and I very like btrfs, I want use it on production server, and I want replace hw raid on it.
>
> Test case: server with N scsi discs
> 2 SAS disks used for raid 1 root fs
> If I just remove one disk physically, all okay, kernel show me write errors and system continue work some time. But after first sync call, example
> # sync
> # dd if=/Dev/zero of=/zero
>
> Kernel will crush and system freeze.
> Yes, after reboot, I can mount with degraded and recovery options, and I can add failed disk again, and btrfs will rebuild array.
> But kernel crush and reboot expected in this case, or I can skip it? How?
> # mount -o remount, degraded -> kernel crush
> Insert failed disk again -> kernel crush
>
> May be I missing something? I just want avoid shutdown time or/and reboot =.=



-- 
Have a nice day,
Timofey.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: RAID1: system stability
  2015-05-26 11:23 RAID1: system stability Timofey Titovets
  2015-05-26 19:31 ` Timofey Titovets
@ 2015-05-26 19:49 ` Chris Murphy
  2015-05-26 19:51   ` Timofey Titovets
  1 sibling, 1 reply; 13+ messages in thread
From: Chris Murphy @ 2015-05-26 19:49 UTC (permalink / raw)
  To: Timofey Titovets; +Cc: linux-btrfs

Without a complete dmesg it's hard to say what's going on. The call
trace alone probably don't show the instigating factor so you may need
to use remote ssh with journalctl -f, or use netconsole to
continuously get kernel messages prior to the implosion.

Chris Murphy

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: RAID1: system stability
  2015-05-26 19:49 ` Chris Murphy
@ 2015-05-26 19:51   ` Timofey Titovets
  2015-06-22 11:35     ` Timofey Titovets
  0 siblings, 1 reply; 13+ messages in thread
From: Timofey Titovets @ 2015-05-26 19:51 UTC (permalink / raw)
  To: Chris Murphy; +Cc: linux-btrfs

Oh, thanks for advice, i'll get and attach it.
i.e. as i understand behaviour like it, not expected, cool

2015-05-26 22:49 GMT+03:00 Chris Murphy <lists@colorremedies.com>:
> Without a complete dmesg it's hard to say what's going on. The call
> trace alone probably don't show the instigating factor so you may need
> to use remote ssh with journalctl -f, or use netconsole to
> continuously get kernel messages prior to the implosion.
>
> Chris Murphy



-- 
Have a nice day,
Timofey.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: RAID1: system stability
  2015-05-26 19:51   ` Timofey Titovets
@ 2015-06-22 11:35     ` Timofey Titovets
  2015-06-22 11:45       ` Timofey Titovets
  2015-06-22 16:03       ` Chris Murphy
  0 siblings, 2 replies; 13+ messages in thread
From: Timofey Titovets @ 2015-06-22 11:35 UTC (permalink / raw)
  To: Chris Murphy; +Cc: linux-btrfs

Okay, logs, i did release disk /dev/sde1 and get:
Jun 22 14:28:40 srv-lab-ceph-node-01 kernel: Read(10): 28 00 11 1d 69
00 00 00 08 00
Jun 22 14:28:40 srv-lab-ceph-node-01 kernel: blk_update_request: I/O
error, dev sde, sector 287140096
Jun 22 14:28:40 srv-lab-ceph-node-01 kernel: mptbase: ioc0:
LogInfo(0x31010011): Originator={PL}, Code={Open Failure},
SubCode(0x0011) cb_idx mptscsih_io_done
Jun 22 14:28:40 srv-lab-ceph-node-01 kernel: sd 0:0:5:0: [sde] FAILED
Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
Jun 22 14:28:40 srv-lab-ceph-node-01 kernel: sd 0:0:5:0: [sde] CDB:
Jun 22 14:28:40 srv-lab-ceph-node-01 kernel: Read(10): 28 00 11 1d 69
00 00 00 08 00
Jun 22 14:28:40 srv-lab-ceph-node-01 kernel: blk_update_request: I/O
error, dev sde, sector 287140096
Jun 22 14:28:40 srv-lab-ceph-node-01 kernel: Buffer I/O error on dev
sde1, logical block 35892256, async page read
Jun 22 14:28:40 srv-lab-ceph-node-01 kernel: mptbase: ioc0:
LogInfo(0x31010011): Originator={PL}, Code={Open Failure},
SubCode(0x0011) cb_idx mptscsih_io_done
Jun 22 14:28:40 srv-lab-ceph-node-01 kernel: mptbase: ioc0:
LogInfo(0x31010011): Originator={PL}, Code={Open Failure},
SubCode(0x0011) cb_idx mptscsih_io_done
Jun 22 14:28:40 srv-lab-ceph-node-01 kernel: sd 0:0:5:0: [sde] FAILED
Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
Jun 22 14:28:40 srv-lab-ceph-node-01 kernel: sd 0:0:5:0: [sde] CDB:
Jun 22 14:28:40 srv-lab-ceph-node-01 kernel: Read(10): 28 00 11 1d 69
00 00 00 08 00
Jun 22 14:28:40 srv-lab-ceph-node-01 kernel: blk_update_request: I/O
error, dev sde, sector 287140096
Jun 22 14:28:40 srv-lab-ceph-node-01 kernel: Buffer I/O error on dev
sde1, logical block 35892256, async page read
Jun 22 14:28:40 srv-lab-ceph-node-01 kernel: mptbase: ioc0:
LogInfo(0x31010011): Originator={PL}, Code={Open Failure},
SubCode(0x0011) cb_idx mptscsih_io_done
Jun 22 14:28:40 srv-lab-ceph-node-01 kernel: mptbase: ioc0:
LogInfo(0x31010011): Originator={PL}, Code={Open Failure},
SubCode(0x0011) cb_idx mptscsih_io_done
Jun 22 14:28:40 srv-lab-ceph-node-01 kernel: sd 0:0:5:0: [sde] FAILED
Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
Jun 22 14:28:40 srv-lab-ceph-node-01 kernel: sd 0:0:5:0: [sde] CDB:
Jun 22 14:28:40 srv-lab-ceph-node-01 kernel: Read(10): 28 00 11 1d 69
00 00 00 08 00
Jun 22 14:28:40 srv-lab-ceph-node-01 kernel: blk_update_request: I/O
error, dev sde, sector 287140096
Jun 22 14:28:40 srv-lab-ceph-node-01 kernel: Buffer I/O error on dev
sde1, logical block 35892256, async page read
Jun 22 14:28:41 srv-lab-ceph-node-01 kernel: mptbase: ioc0:
LogInfo(0x31010011): Originator={PL}, Code={Open Failure},
SubCode(0x0011) cb_idx mptscsih_io_done
Jun 22 14:28:41 srv-lab-ceph-node-01 kernel:  end_device-0:0:6:
mptsas: ioc0: removing ssp device: fw_channel 0, fw_id 16, phy
5,sas_addr 0x5000cca00d0514bd
Jun 22 14:28:41 srv-lab-ceph-node-01 kernel:  phy-0:0:9: mptsas: ioc0:
delete phy 5, phy-obj (0xffff880449541400)
Jun 22 14:28:41 srv-lab-ceph-node-01 kernel:  port-0:0:6: mptsas:
ioc0: delete port 6, sas_addr (0x5000cca00d0514bd)
Jun 22 14:28:41 srv-lab-ceph-node-01 kernel: scsi target0:0:5: mptsas:
ioc0: delete device: fw_channel 0, fw_id 16, phy 5, sas_addr
0x5000cca00d0514bd
Jun 22 14:28:44 srv-lab-ceph-node-01 kernel: BTRFS: lost page write
due to I/O error on /dev/sde1
Jun 22 14:28:44 srv-lab-ceph-node-01 kernel: BTRFS: bdev /dev/sde1
errs: wr 1, rd 0, flush 0, corrupt 0, gen 0
Jun 22 14:28:44 srv-lab-ceph-node-01 kernel: BTRFS: lost page write
due to I/O error on /dev/sde1
Jun 22 14:28:44 srv-lab-ceph-node-01 kernel: BTRFS: bdev /dev/sde1
errs: wr 2, rd 0, flush 0, corrupt 0, gen 0
Jun 22 14:29:13 srv-lab-ceph-node-01 kernel: BTRFS: bdev /dev/sde1
errs: wr 4, rd 0, flush 0, corrupt 0, gen 0
Jun 22 14:29:13 srv-lab-ceph-node-01 kernel: BTRFS: bdev /dev/sde1
errs: wr 5, rd 0, flush 0, corrupt 0, gen 0
Jun 22 14:29:13 srv-lab-ceph-node-01 kernel: BTRFS: bdev /dev/sde1
errs: wr 6, rd 0, flush 0, corrupt 0, gen 0
Jun 22 14:29:13 srv-lab-ceph-node-01 kernel: BTRFS: bdev /dev/sde1
errs: wr 7, rd 0, flush 0, corrupt 0, gen 0
Jun 22 14:29:13 srv-lab-ceph-node-01 kernel: BTRFS: bdev /dev/sde1
errs: wr 8, rd 0, flush 0, corrupt 0, gen 0
Jun 22 14:29:13 srv-lab-ceph-node-01 kernel: BTRFS: bdev /dev/sde1
errs: wr 9, rd 0, flush 0, corrupt 0, gen 0
Jun 22 14:29:13 srv-lab-ceph-node-01 kernel: BTRFS: bdev /dev/sde1
errs: wr 10, rd 0, flush 0, corrupt 0, gen 0
Jun 22 14:29:13 srv-lab-ceph-node-01 kernel: BTRFS: bdev /dev/sde1
errs: wr 11, rd 0, flush 0, corrupt 0, gen 0
Jun 22 14:29:13 srv-lab-ceph-node-01 kernel: BTRFS: bdev /dev/sde1
errs: wr 12, rd 0, flush 0, corrupt 0, gen 0
Jun 22 14:29:13 srv-lab-ceph-node-01 kernel: BTRFS: bdev /dev/sde1
errs: wr 13, rd 0, flush 0, corrupt 0, gen 0
Jun 22 14:29:22 srv-lab-ceph-node-01 kernel: BTRFS info (device
md127): csum failed ino 1039 extent 390332416 csum 2059524288 wanted
343582415 mirror 0
Jun 22 14:29:22 srv-lab-ceph-node-01 kernel: BUG: unable to handle
kernel paging request at ffff87fa7ff53430
Jun 22 14:29:22 srv-lab-ceph-node-01 kernel: IP: [<ffffffffc04709d9>]
__btrfs_map_block+0x2d9/0x1180 [btrfs]
Jun 22 14:29:22 srv-lab-ceph-node-01 kernel: PGD 0
Jun 22 14:29:22 srv-lab-ceph-node-01 kernel: Oops: 0000 [#1] SMP
Jun 22 14:29:22 srv-lab-ceph-node-01 kernel: Modules linked in: 8021q
garp mrp stp llc binfmt_misc amdkfd amd_iommu_v2 radeon ttm
drm_kms_helper ipmi_ssif coretemp gpio_ich drm kvm_intel serio_raw
i5000_edac kvm ipmi_si lpc_ich edac_core ioatdma joydev i2c_algo_bit
8250_fintek mac_hid ipmi_msghandler i5k_amb dca shpchp bonding autofs4
btrfs ses enclosure raid10 raid456 async_raid6_recov async_memcpy
async_pq async_xor async_tx mptsas mptscsih xor hid_generic raid6_pq
raid1 usbhid e1000e mptbase raid0 psmouse ptp hid multipath
scsi_transport_sas pps_core linear
Jun 22 14:29:22 srv-lab-ceph-node-01 kernel: CPU: 1 PID: 2411 Comm:
kworker/u16:16 Not tainted 3.19.0-21-generic #21-Ubuntu
Jun 22 14:29:22 srv-lab-ceph-node-01 kernel: Hardware name: Intel
S5000VSA/S5000VSA, BIOS S5000.86B.15.00.0101.110920101604 11/09/2010
Jun 22 14:29:22 srv-lab-ceph-node-01 kernel: Workqueue: btrfs-endio
btrfs_endio_helper [btrfs]
Jun 22 14:29:22 srv-lab-ceph-node-01 kernel: task: ffff8803ef8ae220
ti: ffff8803efbe4000 task.ti: ffff8803efbe4000
Jun 22 14:29:22 srv-lab-ceph-node-01 kernel: RIP:
0010:[<ffffffffc04709d9>]  [<ffffffffc04709d9>]
__btrfs_map_block+0x2d9/0x1180 [btrfs]
Jun 22 14:29:22 srv-lab-ceph-node-01 kernel: RSP:
0018:ffff8803efbe79d8  EFLAGS: 00010287
Jun 22 14:29:22 srv-lab-ceph-node-01 kernel: RAX: 0000000000010000
RBX: ffff88009a80dd00 RCX: 0000000000001533
Jun 22 14:29:22 srv-lab-ceph-node-01 kernel: RDX: ffff87fa7ff53428
RSI: ffff88009a80dd70 RDI: 000000009a869e00
Jun 22 14:29:22 srv-lab-ceph-node-01 kernel: RBP: ffff8803efbe7ab8
R08: 000000000000c000 R09: ffff88009a80dd00
Jun 22 14:29:22 srv-lab-ceph-node-01 kernel: R10: 000000000000c000
R11: 0000000000000002 R12: 000000009a869e00
Jun 22 14:29:22 srv-lab-ceph-node-01 kernel: R13: ffff880403566420
R14: ffff880448e20000 R15: 0000000000000001
Jun 22 14:29:22 srv-lab-ceph-node-01 kernel: FS:
0000000000000000(0000) GS:ffff88045fc40000(0000)
knlGS:0000000000000000
Jun 22 14:29:22 srv-lab-ceph-node-01 kernel: CS:  0010 DS: 0000 ES:
0000 CR0: 000000008005003b
Jun 22 14:29:22 srv-lab-ceph-node-01 kernel: CR2: ffff87fa7ff53430
CR3: 000000034f878000 CR4: 00000000000407e0
Jun 22 14:29:22 srv-lab-ceph-node-01 kernel: Stack:
Jun 22 14:29:22 srv-lab-ceph-node-01 kernel:  0000000000000000
0000000000001000 ffff8803efbe7a28 000000000000c000
Jun 22 14:29:22 srv-lab-ceph-node-01 kernel:  0000000000000001
0000000000010000 0000000015340000 0000000000000000
Jun 22 14:29:22 srv-lab-ceph-node-01 kernel:  0000000000001534
0000000000001533 ffff880448e20dd0 0000000000000000
Jun 22 14:29:22 srv-lab-ceph-node-01 kernel: Call Trace:
Jun 22 14:29:22 srv-lab-ceph-node-01 kernel:  [<ffffffffc0420afa>] ?
btrfs_free_path+0x2a/0x40 [btrfs]
Jun 22 14:29:22 srv-lab-ceph-node-01 kernel:  [<ffffffffc0476b5d>]
btrfs_map_bio+0x7d/0x530 [btrfs]
Jun 22 14:29:22 srv-lab-ceph-node-01 kernel:  [<ffffffffc0493982>]
btrfs_submit_compressed_read+0x332/0x4d0 [btrfs]
Jun 22 14:29:22 srv-lab-ceph-node-01 kernel:  [<ffffffffc044df51>]
btrfs_submit_bio_hook+0x1c1/0x1d0 [btrfs]
Jun 22 14:29:22 srv-lab-ceph-node-01 kernel:  [<ffffffff8137cc6e>] ?
bio_add_page+0x5e/0x70
Jun 22 14:29:22 srv-lab-ceph-node-01 kernel:  [<ffffffffc046be79>] ?
btrfs_create_repair_bio+0xe9/0x110 [btrfs]
Jun 22 14:29:22 srv-lab-ceph-node-01 kernel:  [<ffffffffc046c38a>]
end_bio_extent_readpage+0x4ea/0x5e0 [btrfs]
Jun 22 14:29:22 srv-lab-ceph-node-01 kernel:  [<ffffffffc046bea0>] ?
btrfs_create_repair_bio+0x110/0x110 [btrfs]
Jun 22 14:29:22 srv-lab-ceph-node-01 kernel:  [<ffffffff8137f1eb>]
bio_endio+0x6b/0xa0
Jun 22 14:29:22 srv-lab-ceph-node-01 kernel:  [<ffffffff811d5bce>] ?
kmem_cache_free+0x1be/0x200
Jun 22 14:29:22 srv-lab-ceph-node-01 kernel:  [<ffffffff8137f232>]
bio_endio_nodec+0x12/0x20
Jun 22 14:29:22 srv-lab-ceph-node-01 kernel:  [<ffffffffc0440f3f>]
end_workqueue_fn+0x3f/0x50 [btrfs]
Jun 22 14:29:22 srv-lab-ceph-node-01 kernel:  [<ffffffffc047b4e2>]
normal_work_helper+0xc2/0x2b0 [btrfs]
Jun 22 14:29:22 srv-lab-ceph-node-01 kernel:  [<ffffffffc047b7a2>]
btrfs_endio_helper+0x12/0x20 [btrfs]
Jun 22 14:29:22 srv-lab-ceph-node-01 kernel:  [<ffffffff8108fc98>]
process_one_work+0x158/0x430
Jun 22 14:29:22 srv-lab-ceph-node-01 kernel:  [<ffffffff810907db>]
worker_thread+0x5b/0x530
Jun 22 14:29:22 srv-lab-ceph-node-01 kernel:  [<ffffffff81090780>] ?
rescuer_thread+0x3a0/0x3a0
Jun 22 14:29:22 srv-lab-ceph-node-01 kernel:  [<ffffffff81095879>]
kthread+0xc9/0xe0
Jun 22 14:29:22 srv-lab-ceph-node-01 kernel:  [<ffffffff810957b0>] ?
kthread_create_on_node+0x1c0/0x1c0
Jun 22 14:29:22 srv-lab-ceph-node-01 kernel:  [<ffffffff817cae18>]
ret_from_fork+0x58/0x90
Jun 22 14:29:22 srv-lab-ceph-node-01 kernel:  [<ffffffff810957b0>] ?
kthread_create_on_node+0x1c0/0x1c0
Jun 22 14:29:22 srv-lab-ceph-node-01 kernel: Code: 8d 68 ff ff ff 7e
3f 0f 1f 00 49 63 c4 4d 89 d0 41 83 c4 01 48 8d 04 40 48 83 c6 18 48
8d 14 c5 20 00 00 00 49 63 45 10 4c 01 ea <4c> 03 42 08 48 0f af c1 4c
01 c0 48 89 46 e8 48 8b 02 48 89 46
Jun 22 14:29:22 srv-lab-ceph-node-01 kernel: RIP  [<ffffffffc04709d9>]
__btrfs_map_block+0x2d9/0x1180 [btrfs]
Jun 22 14:29:22 srv-lab-ceph-node-01 kernel:  RSP <ffff8803efbe79d8>
Jun 22 14:29:22 srv-lab-ceph-node-01 kernel: CR2: ffff87fa7ff53430
Jun 22 14:29:22 srv-lab-ceph-node-01 kernel: ---[ end trace
f8af5955ebefcf19 ]---
Jun 22 14:29:22 srv-lab-ceph-node-01 kernel: BUG: unable to handle
kernel paging request at ffffffffffffffd8
Jun 22 14:29:22 srv-lab-ceph-node-01 kernel: IP: [<ffffffff81095f80>]
kthread_data+0x10/0x20

-- 
Have a nice day,
Timofey.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: RAID1: system stability
  2015-06-22 11:35     ` Timofey Titovets
@ 2015-06-22 11:45       ` Timofey Titovets
  2015-06-22 16:03       ` Chris Murphy
  1 sibling, 0 replies; 13+ messages in thread
From: Timofey Titovets @ 2015-06-22 11:45 UTC (permalink / raw)
  To: Chris Murphy; +Cc: linux-btrfs

And again if i've try
echo 1 > /sys/block/sdf/device/delete

Jun 22 14:44:16 srv-lab-ceph-node-01 kernel: ------------[ cut here
]------------
Jun 22 14:44:16 srv-lab-ceph-node-01 kernel: kernel BUG at
/build/buildd/linux-3.19.0/fs/btrfs/extent_io.c:2056!
Jun 22 14:44:16 srv-lab-ceph-node-01 kernel: invalid opcode: 0000 [#1] SMP
Jun 22 14:44:16 srv-lab-ceph-node-01 kernel: Modules linked in: 8021q
garp mrp stp llc binfmt_misc ipmi_ssif amdkfd amd_iommu_v2 gpio_ich
radeon ttm drm_kms_helper lpc_ich coretemp drm kvm_intel kvm
i5000_edac i2c_algo_bit edac_core i5k_amb shpchp ipmi_si serio_raw
8250_fintek ioatdma dca joydev mac_hid ipmi_msghandler bonding autofs4
btrfs ses enclosure raid10 raid456 async_raid6_recov async_memcpy
async_pq async_xor async_tx xor raid6_pq hid_generic raid1 e1000e
raid0 usbhid mptsas mptscsih multipath psmouse hid mptbase ptp
scsi_transport_sas pps_core linear
Jun 22 14:44:16 srv-lab-ceph-node-01 kernel: CPU: 0 PID: 1150 Comm:
kworker/u16:12 Not tainted 3.19.0-21-generic #21-Ubuntu
Jun 22 14:44:16 srv-lab-ceph-node-01 kernel: Hardware name: Intel
S5000VSA/S5000VSA, BIOS S5000.86B.15.00.0101.110920101604 11/09/2010
Jun 22 14:44:16 srv-lab-ceph-node-01 kernel: Workqueue: btrfs-endio
btrfs_endio_helper [btrfs]
Jun 22 14:44:16 srv-lab-ceph-node-01 kernel: task: ffff88044c603110
ti: ffff88044b4b8000 task.ti: ffff88044b4b8000
Jun 22 14:44:16 srv-lab-ceph-node-01 kernel: RIP:
0010:[<ffffffffc043fa80>]  [<ffffffffc043fa80>]
repair_io_failure+0x1a0/0x220 [btrfs]
Jun 22 14:44:16 srv-lab-ceph-node-01 kernel: RSP:
0018:ffff88044b4bbba8  EFLAGS: 00010202
Jun 22 14:44:16 srv-lab-ceph-node-01 kernel: RAX: 0000000000000000
RBX: 0000000000001000 RCX: 0000000000000000
Jun 22 14:44:16 srv-lab-ceph-node-01 kernel: RDX: 0000000000000000
RSI: ffff880449841b08 RDI: ffff880449841a80
Jun 22 14:44:16 srv-lab-ceph-node-01 kernel: RBP: ffff88044b4bbc08
R08: 0000000000109000 R09: ffff880449841a80
Jun 22 14:44:16 srv-lab-ceph-node-01 kernel: R10: 0000000000009000
R11: 0000000000000002 R12: ffff8803fa878068
Jun 22 14:44:16 srv-lab-ceph-node-01 kernel: R13: ffff880448f5d000
R14: ffff88044cde8d28 R15: 0000000524f09000
Jun 22 14:44:16 srv-lab-ceph-node-01 kernel: FS:
0000000000000000(0000) GS:ffff88045fc00000(0000)
knlGS:0000000000000000
Jun 22 14:44:16 srv-lab-ceph-node-01 kernel: CS:  0010 DS: 0000 ES:
0000 CR0: 000000008005003b
Jun 22 14:44:16 srv-lab-ceph-node-01 kernel: CR2: 00007fdcef9cafb8
CR3: 0000000001c13000 CR4: 00000000000407f0
Jun 22 14:44:16 srv-lab-ceph-node-01 kernel: Stack:
Jun 22 14:44:16 srv-lab-ceph-node-01 kernel:  ffff880448f5d100
0000000000001000 000000004b4bbbd8 ffffea000fb66d40
Jun 22 14:44:16 srv-lab-ceph-node-01 kernel:  0000000000007000
ffff880449841a80 ffff88044b4bbc08 ffff880439a44b58
Jun 22 14:44:16 srv-lab-ceph-node-01 kernel:  0000000000001000
ffff880448f5d000 ffff88044cde8d28 ffff88044cde8bf0
Jun 22 14:44:16 srv-lab-ceph-node-01 kernel: Call Trace:
Jun 22 14:44:16 srv-lab-ceph-node-01 kernel:  [<ffffffffc043fd7c>]
clean_io_failure+0x19c/0x1b0 [btrfs]
Jun 22 14:44:16 srv-lab-ceph-node-01 kernel:  [<ffffffffc04401b0>]
end_bio_extent_readpage+0x310/0x5e0 [btrfs]
Jun 22 14:44:16 srv-lab-ceph-node-01 kernel:  [<ffffffff811d5795>] ?
__slab_free+0xa5/0x320
Jun 22 14:44:16 srv-lab-ceph-node-01 kernel:  [<ffffffff8101e74a>] ?
native_sched_clock+0x2a/0x90
Jun 22 14:44:16 srv-lab-ceph-node-01 kernel:  [<ffffffff8137f1eb>]
bio_endio+0x6b/0xa0
Jun 22 14:44:16 srv-lab-ceph-node-01 kernel:  [<ffffffff811d5bce>] ?
kmem_cache_free+0x1be/0x200
Jun 22 14:44:16 srv-lab-ceph-node-01 kernel:  [<ffffffff8137f232>]
bio_endio_nodec+0x12/0x20
Jun 22 14:44:16 srv-lab-ceph-node-01 kernel:  [<ffffffffc0414f3f>]
end_workqueue_fn+0x3f/0x50 [btrfs]
Jun 22 14:44:16 srv-lab-ceph-node-01 kernel:  [<ffffffffc044f4e2>]
normal_work_helper+0xc2/0x2b0 [btrfs]
Jun 22 14:44:16 srv-lab-ceph-node-01 kernel:  [<ffffffffc044f7a2>]
btrfs_endio_helper+0x12/0x20 [btrfs]
Jun 22 14:44:16 srv-lab-ceph-node-01 kernel:  [<ffffffff8108fc98>]
process_one_work+0x158/0x430
Jun 22 14:44:16 srv-lab-ceph-node-01 kernel:  [<ffffffff810907db>]
worker_thread+0x5b/0x530
Jun 22 14:44:16 srv-lab-ceph-node-01 kernel:  [<ffffffff81090780>] ?
rescuer_thread+0x3a0/0x3a0
Jun 22 14:44:16 srv-lab-ceph-node-01 kernel:  [<ffffffff81095879>]
kthread+0xc9/0xe0
Jun 22 14:44:16 srv-lab-ceph-node-01 kernel:  [<ffffffff810957b0>] ?
kthread_create_on_node+0x1c0/0x1c0
Jun 22 14:44:16 srv-lab-ceph-node-01 kernel:  [<ffffffff817cae18>]
ret_from_fork+0x58/0x90
Jun 22 14:44:16 srv-lab-ceph-node-01 kernel:  [<ffffffff810957b0>] ?
kthread_create_on_node+0x1c0/0x1c0
Jun 22 14:44:16 srv-lab-ceph-node-01 kernel: Code: f4 fe ff ff 0f 1f
80 00 00 00 00 0f 0b 66 0f 1f 44 00 00 4c 89 e7 e8 e0 e4 f3 c0 41 b9
fb ff ff ff e9 d2 fe ff ff 0f 1f 44 00 00 <0f> 0b 66 0f 1f 44 00 00 4c
89 e7 e8 c0 e4 f3 c0 31 f6 4c 89 ef
Jun 22 14:44:16 srv-lab-ceph-node-01 kernel: RIP  [<ffffffffc043fa80>]
repair_io_failure+0x1a0/0x220 [btrfs]
Jun 22 14:44:16 srv-lab-ceph-node-01 kernel:  RSP <ffff88044b4bbba8>
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: RAID1: system stability
  2015-06-22 11:35     ` Timofey Titovets
  2015-06-22 11:45       ` Timofey Titovets
@ 2015-06-22 16:03       ` Chris Murphy
  2015-06-22 16:36         ` Timofey Titovets
  1 sibling, 1 reply; 13+ messages in thread
From: Chris Murphy @ 2015-06-22 16:03 UTC (permalink / raw)
  To: Timofey Titovets; +Cc: Chris Murphy, linux-btrfs

On Mon, Jun 22, 2015 at 5:35 AM, Timofey Titovets <nefelim4ag@gmail.com> wrote:
> Okay, logs, i did release disk /dev/sde1 and get:
> Jun 22 14:28:40 srv-lab-ceph-node-01 kernel: Read(10): 28 00 11 1d 69
> 00 00 00 08 00
> Jun 22 14:28:40 srv-lab-ceph-node-01 kernel: blk_update_request: I/O
> error, dev sde, sector 287140096
> Jun 22 14:28:40 srv-lab-ceph-node-01 kernel: mptbase: ioc0:
> LogInfo(0x31010011): Originator={PL}, Code={Open Failure},
> SubCode(0x0011) cb_idx mptscsih_io_done
> Jun 22 14:28:40 srv-lab-ceph-node-01 kernel: sd 0:0:5:0: [sde] FAILED
> Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
> Jun 22 14:28:40 srv-lab-ceph-node-01 kernel: sd 0:0:5:0: [sde] CDB:
> Jun 22 14:28:40 srv-lab-ceph-node-01 kernel: Read(10): 28 00 11 1d 69
> 00 00 00 08 00
> Jun 22 14:28:40 srv-lab-ceph-node-01 kernel: blk_update_request: I/O
> error, dev sde, sector 287140096

So what's up with this? This only happens after you try to (software)
remove /dev/sde1? Or is it happening also before that? Because this
looks like some kind of hardware problem when the drive is reporting
an error for a particular sector on read, as if it's a bad sector.

> Jun 22 14:28:40 srv-lab-ceph-node-01 kernel: Buffer I/O error on dev
> sde1, logical block 35892256, async page read
> Jun 22 14:28:40 srv-lab-ceph-node-01 kernel: mptbase: ioc0:
> LogInfo(0x31010011): Originator={PL}, Code={Open Failure},
> SubCode(0x0011) cb_idx mptscsih_io_done
> Jun 22 14:28:40 srv-lab-ceph-node-01 kernel: mptbase: ioc0:
> LogInfo(0x31010011): Originator={PL}, Code={Open Failure},
> SubCode(0x0011) cb_idx mptscsih_io_done
> Jun 22 14:28:40 srv-lab-ceph-node-01 kernel: sd 0:0:5:0: [sde] FAILED
> Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
> Jun 22 14:28:40 srv-lab-ceph-node-01 kernel: sd 0:0:5:0: [sde] CDB:
> Jun 22 14:28:40 srv-lab-ceph-node-01 kernel: Read(10): 28 00 11 1d 69
> 00 00 00 08 00
> Jun 22 14:28:40 srv-lab-ceph-node-01 kernel: blk_update_request: I/O
> error, dev sde, sector 287140096

Again same sector as before. This is not a Btrfs error message, it's
coming from the block layer.

> Jun 22 14:28:40 srv-lab-ceph-node-01 kernel: Buffer I/O error on dev
> sde1, logical block 35892256, async page read

I'm not a dev so take it with a grain of salt but because this
references a logical block, this is the layer in between Btrfs and the
physical device. Btrfs works on logical blocks and those have to be
translated to device and physical sector. Maybe what's happening is
there's confusion somewhere about this device not actually being
unavailable so Btrfs or something else is trying to read this logical
block again, which causes a read attempt to happen instead of a flat
out "this device doesn't exist" type of error. So I don't know if this
is a problem strictly in Btrfs missing device error handling, or if
there's something else that's not really working correctly.

You could test by physically removing the device, if you have hot plug
support (be certain all the hardware components support it), you can
see if you get different results. Or you could try to reproduce the
software delete of the device with mdraid or lvm raid with XFS and no
Btrfs at all, and see if you get different results.

It's known that the btrfs multiple device failure use case is weak
right now. Data isn't lost, but the error handling, notification, all
that is almost non-existent compared to mdadm.

> Jun 22 14:28:40 srv-lab-ceph-node-01 kernel: mptbase: ioc0:
> LogInfo(0x31010011): Originator={PL}, Code={Open Failure},
> SubCode(0x0011) cb_idx mptscsih_io_done
> Jun 22 14:28:40 srv-lab-ceph-node-01 kernel: mptbase: ioc0:
> LogInfo(0x31010011): Originator={PL}, Code={Open Failure},
> SubCode(0x0011) cb_idx mptscsih_io_done
> Jun 22 14:28:40 srv-lab-ceph-node-01 kernel: sd 0:0:5:0: [sde] FAILED
> Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
> Jun 22 14:28:40 srv-lab-ceph-node-01 kernel: sd 0:0:5:0: [sde] CDB:
> Jun 22 14:28:40 srv-lab-ceph-node-01 kernel: Read(10): 28 00 11 1d 69
> 00 00 00 08 00
> Jun 22 14:28:40 srv-lab-ceph-node-01 kernel: blk_update_request: I/O
> error, dev sde, sector 287140096
> Jun 22 14:28:40 srv-lab-ceph-node-01 kernel: Buffer I/O error on dev
> sde1, logical block 35892256, async page read
> Jun 22 14:28:41 srv-lab-ceph-node-01 kernel: mptbase: ioc0:
> LogInfo(0x31010011): Originator={PL}, Code={Open Failure},
> SubCode(0x0011) cb_idx mptscsih_io_done
> Jun 22 14:28:41 srv-lab-ceph-node-01 kernel:  end_device-0:0:6:
> mptsas: ioc0: removing ssp device: fw_channel 0, fw_id 16, phy
> 5,sas_addr 0x5000cca00d0514bd
> Jun 22 14:28:41 srv-lab-ceph-node-01 kernel:  phy-0:0:9: mptsas: ioc0:
> delete phy 5, phy-obj (0xffff880449541400)
> Jun 22 14:28:41 srv-lab-ceph-node-01 kernel:  port-0:0:6: mptsas:
> ioc0: delete port 6, sas_addr (0x5000cca00d0514bd)
> Jun 22 14:28:41 srv-lab-ceph-node-01 kernel: scsi target0:0:5: mptsas:
> ioc0: delete device: fw_channel 0, fw_id 16, phy 5, sas_addr
> 0x5000cca00d0514bd

OK it looks like not until here does it actually get deleted (?) and
then that results in piles of write errors to this device by btrfs:

> Jun 22 14:28:44 srv-lab-ceph-node-01 kernel: BTRFS: lost page write
> due to I/O error on /dev/sde1
> Jun 22 14:28:44 srv-lab-ceph-node-01 kernel: BTRFS: bdev /dev/sde1
> errs: wr 1, rd 0, flush 0, corrupt 0, gen 0
> Jun 22 14:28:44 srv-lab-ceph-node-01 kernel: BTRFS: lost page write
> due to I/O error on /dev/sde1
> Jun 22 14:28:44 srv-lab-ceph-node-01 kernel: BTRFS: bdev /dev/sde1
> errs: wr 2, rd 0, flush 0, corrupt 0, gen 0
> Jun 22 14:29:13 srv-lab-ceph-node-01 kernel: BTRFS: bdev /dev/sde1
> errs: wr 4, rd 0, flush 0, corrupt 0, gen 0
> Jun 22 14:29:13 srv-lab-ceph-node-01 kernel: BTRFS: bdev /dev/sde1
> errs: wr 5, rd 0, flush 0, corrupt 0, gen 0
> Jun 22 14:29:13 srv-lab-ceph-node-01 kernel: BTRFS: bdev /dev/sde1
> errs: wr 6, rd 0, flush 0, corrupt 0, gen 0
> Jun 22 14:29:13 srv-lab-ceph-node-01 kernel: BTRFS: bdev /dev/sde1
> errs: wr 7, rd 0, flush 0, corrupt 0, gen 0
> Jun 22 14:29:13 srv-lab-ceph-node-01 kernel: BTRFS: bdev /dev/sde1
> errs: wr 8, rd 0, flush 0, corrupt 0, gen 0
> Jun 22 14:29:13 srv-lab-ceph-node-01 kernel: BTRFS: bdev /dev/sde1
> errs: wr 9, rd 0, flush 0, corrupt 0, gen 0
> Jun 22 14:29:13 srv-lab-ceph-node-01 kernel: BTRFS: bdev /dev/sde1
> errs: wr 10, rd 0, flush 0, corrupt 0, gen 0
> Jun 22 14:29:13 srv-lab-ceph-node-01 kernel: BTRFS: bdev /dev/sde1
> errs: wr 11, rd 0, flush 0, corrupt 0, gen 0
> Jun 22 14:29:13 srv-lab-ceph-node-01 kernel: BTRFS: bdev /dev/sde1
> errs: wr 12, rd 0, flush 0, corrupt 0, gen 0
> Jun 22 14:29:13 srv-lab-ceph-node-01 kernel: BTRFS: bdev /dev/sde1
> errs: wr 13, rd 0, flush 0, corrupt 0, gen 0

So this makes sense in that it tries to write but can't because the
device is now missing. So it's a case of Btrfs not doing very well
handling suddenly missing device, I think.

> Jun 22 14:29:22 srv-lab-ceph-node-01 kernel: BTRFS info (device
> md127): csum failed ino 1039 extent 390332416 csum 2059524288 wanted
> 343582415 mirror 0
> Jun 22 14:29:22 srv-lab-ceph-node-01 kernel: BUG: unable to handle
> kernel paging request at ffff87fa7ff53430
> Jun 22 14:29:22 srv-lab-ceph-node-01 kernel: IP: [<ffffffffc04709d9>]
> __btrfs_map_block+0x2d9/0x1180 [btrfs]
> Jun 22 14:29:22 srv-lab-ceph-node-01 kernel: PGD 0
> Jun 22 14:29:22 srv-lab-ceph-node-01 kernel: Oops: 0000 [#1] SMP

And then oops. Not good. So yeah it's definitely a Btrfs bug that it
oopses instead of gracefully handling the failure. The question is
whether (and what) other mitigating circumstances contribute to this
bad handling, there may be other bugs that instigate this. I've tested
this in a ridiculously rudimentary way (with USB drives) just by
hanging them during usage, and I don't get an oops. But I do get piles
of read and or write errors and it seems Btrfs never really becomes
aware of the fact there's a missing device until there's a remount or
even a reboot. I haven't quantified what amount of data is lost, but
the file system itself still works degraded in this case with the
remaining drive (actually both drives work fine, but once they're each
written to separately with degraded mount option, they can't be
rejoined together; if you try it, serious fs corruption results.)

-- 
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: RAID1: system stability
  2015-06-22 16:03       ` Chris Murphy
@ 2015-06-22 16:36         ` Timofey Titovets
  2015-06-22 16:52           ` Chris Murphy
  0 siblings, 1 reply; 13+ messages in thread
From: Timofey Titovets @ 2015-06-22 16:36 UTC (permalink / raw)
  To: Chris Murphy; +Cc: linux-btrfs

2015-06-22 19:03 GMT+03:00 Chris Murphy <lists@colorremedies.com>:
> On Mon, Jun 22, 2015 at 5:35 AM, Timofey Titovets <nefelim4ag@gmail.com> wrote:
>> Okay, logs, i did release disk /dev/sde1 and get:
>> Jun 22 14:28:40 srv-lab-ceph-node-01 kernel: Read(10): 28 00 11 1d 69
>> 00 00 00 08 00
>> Jun 22 14:28:40 srv-lab-ceph-node-01 kernel: blk_update_request: I/O
>> error, dev sde, sector 287140096
>> Jun 22 14:28:40 srv-lab-ceph-node-01 kernel: mptbase: ioc0:
>> LogInfo(0x31010011): Originator={PL}, Code={Open Failure},
>> SubCode(0x0011) cb_idx mptscsih_io_done
>> Jun 22 14:28:40 srv-lab-ceph-node-01 kernel: sd 0:0:5:0: [sde] FAILED
>> Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
>> Jun 22 14:28:40 srv-lab-ceph-node-01 kernel: sd 0:0:5:0: [sde] CDB:
>> Jun 22 14:28:40 srv-lab-ceph-node-01 kernel: Read(10): 28 00 11 1d 69
>> 00 00 00 08 00
>> Jun 22 14:28:40 srv-lab-ceph-node-01 kernel: blk_update_request: I/O
>> error, dev sde, sector 287140096
>
> So what's up with this? This only happens after you try to (software)
> remove /dev/sde1? Or is it happening also before that? Because this
> looks like some kind of hardware problem when the drive is reporting
> an error for a particular sector on read, as if it's a bad sector.

Nope, i've physically remove device and as you see it's produce errors
on block layer -.-
and this disks have 100% 'health'

Because it's hot-plug device, kernel see what device now missing and
remove all kernel objects reletad to them.

>
>> Jun 22 14:28:40 srv-lab-ceph-node-01 kernel: Buffer I/O error on dev sde1, logical block 35892256, async page read
>> Jun 22 14:28:40 srv-lab-ceph-node-01 kernel: mptbase: ioc0: LogInfo(0x31010011): Originator={PL}, Code={Open Failure},
>> SubCode(0x0011) cb_idx mptscsih_io_done
>> Jun 22 14:28:40 srv-lab-ceph-node-01 kernel: mptbase: ioc0: LogInfo(0x31010011): Originator={PL}, Code={Open Failure},
>> SubCode(0x0011) cb_idx mptscsih_io_done
>> Jun 22 14:28:40 srv-lab-ceph-node-01 kernel: sd 0:0:5:0: [sde] FAILED Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
>> Jun 22 14:28:40 srv-lab-ceph-node-01 kernel: sd 0:0:5:0: [sde] CDB:
>> Jun 22 14:28:40 srv-lab-ceph-node-01 kernel: Read(10): 28 00 11 1d 69 00 00 00 08 00
>> Jun 22 14:28:40 srv-lab-ceph-node-01 kernel: blk_update_request: I/O error, dev sde, sector 287140096
>
> Again same sector as before. This is not a Btrfs error message, it's
> coming from the block layer.
>
>
>> Jun 22 14:28:40 srv-lab-ceph-node-01 kernel: Buffer I/O error on dev sde1, logical block 35892256, async page read
>
> I'm not a dev so take it with a grain of salt but because this
> references a logical block, this is the layer in between Btrfs and the
> physical device. Btrfs works on logical blocks and those have to be
> translated to device and physical sector. Maybe what's happening is
> there's confusion somewhere about this device not actually being
> unavailable so Btrfs or something else is trying to read this logical
> block again, which causes a read attempt to happen instead of a flat
> out "this device doesn't exist" type of error. So I don't know if this
> is a problem strictly in Btrfs missing device error handling, or if
> there's something else that's not really working correctly.
>
> You could test by physically removing the device, if you have hot plug
> support (be certain all the hardware components support it), you can
> see if you get different results. Or you could try to reproduce the
> software delete of the device with mdraid or lvm raid with XFS and no
> Btrfs at all, and see if you get different results.
>
> It's known that the btrfs multiple device failure use case is weak
> right now. Data isn't lost, but the error handling, notification, all
> that is almost non-existent compared to mdadm.

So sad -.-
i've test this test case with md raid1 and system continue work
without problem when i release one of two md device

>
>> Jun 22 14:28:40 srv-lab-ceph-node-01 kernel: mptbase: ioc0: LogInfo(0x31010011): Originator={PL}, Code={Open Failure},
>> SubCode(0x0011) cb_idx mptscsih_io_done
>> Jun 22 14:28:40 srv-lab-ceph-node-01 kernel: mptbase: ioc0: LogInfo(0x31010011): Originator={PL}, Code={Open Failure},
>> SubCode(0x0011) cb_idx mptscsih_io_done
>> Jun 22 14:28:40 srv-lab-ceph-node-01 kernel: sd 0:0:5:0: [sde] FAILED Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
>> Jun 22 14:28:40 srv-lab-ceph-node-01 kernel: sd 0:0:5:0: [sde] CDB:
>> Jun 22 14:28:40 srv-lab-ceph-node-01 kernel: Read(10): 28 00 11 1d 69 00 00 00 08 00
>> Jun 22 14:28:40 srv-lab-ceph-node-01 kernel: blk_update_request: I/O error, dev sde, sector 287140096
>> Jun 22 14:28:40 srv-lab-ceph-node-01 kernel: Buffer I/O error on dev sde1, logical block 35892256, async page read
>> Jun 22 14:28:41 srv-lab-ceph-node-01 kernel: mptbase: ioc0: LogInfo(0x31010011): Originator={PL}, Code={Open Failure},
>> SubCode(0x0011) cb_idx mptscsih_io_done
>> Jun 22 14:28:41 srv-lab-ceph-node-01 kernel:  end_device-0:0:6: mptsas: ioc0: removing ssp device: fw_channel 0, fw_id 16, phy
>> 5,sas_addr 0x5000cca00d0514bd
>> Jun 22 14:28:41 srv-lab-ceph-node-01 kernel:  phy-0:0:9: mptsas: ioc0: delete phy 5, phy-obj (0xffff880449541400)
>> Jun 22 14:28:41 srv-lab-ceph-node-01 kernel:  port-0:0:6: mptsas: ioc0: delete port 6, sas_addr (0x5000cca00d0514bd)
>> Jun 22 14:28:41 srv-lab-ceph-node-01 kernel: scsi target0:0:5: mptsas: ioc0: delete device: fw_channel 0, fw_id 16, phy 5, sas_addr
>> 0x5000cca00d0514bd
>
> OK it looks like not until here does it actually get deleted (?) and
> then that results in piles of write errors to this device by btrfs:
>
>
>> Jun 22 14:28:44 srv-lab-ceph-node-01 kernel: BTRFS: lost page write due to I/O error on /dev/sde1
>> Jun 22 14:28:44 srv-lab-ceph-node-01 kernel: BTRFS: bdev /dev/sde1 errs: wr 1, rd 0, flush 0, corrupt 0, gen 0
>> Jun 22 14:28:44 srv-lab-ceph-node-01 kernel: BTRFS: lost page write due to I/O error on /dev/sde1
>> Jun 22 14:28:44 srv-lab-ceph-node-01 kernel: BTRFS: bdev /dev/sde1 errs: wr 2, rd 0, flush 0, corrupt 0, gen 0
>> Jun 22 14:29:13 srv-lab-ceph-node-01 kernel: BTRFS: bdev /dev/sde1 errs: wr 4, rd 0, flush 0, corrupt 0, gen 0
>> Jun 22 14:29:13 srv-lab-ceph-node-01 kernel: BTRFS: bdev /dev/sde1 errs: wr 5, rd 0, flush 0, corrupt 0, gen 0
>> Jun 22 14:29:13 srv-lab-ceph-node-01 kernel: BTRFS: bdev /dev/sde1 errs: wr 6, rd 0, flush 0, corrupt 0, gen 0
>> Jun 22 14:29:13 srv-lab-ceph-node-01 kernel: BTRFS: bdev /dev/sde1 errs: wr 7, rd 0, flush 0, corrupt 0, gen 0
>> Jun 22 14:29:13 srv-lab-ceph-node-01 kernel: BTRFS: bdev /dev/sde1 errs: wr 8, rd 0, flush 0, corrupt 0, gen 0
>> Jun 22 14:29:13 srv-lab-ceph-node-01 kernel: BTRFS: bdev /dev/sde1 errs: wr 9, rd 0, flush 0, corrupt 0, gen 0
>> Jun 22 14:29:13 srv-lab-ceph-node-01 kernel: BTRFS: bdev /dev/sde1 errs: wr 10, rd 0, flush 0, corrupt 0, gen 0
>> Jun 22 14:29:13 srv-lab-ceph-node-01 kernel: BTRFS: bdev /dev/sde1 errs: wr 11, rd 0, flush 0, corrupt 0, gen 0
>> Jun 22 14:29:13 srv-lab-ceph-node-01 kernel: BTRFS: bdev /dev/sde1 errs: wr 12, rd 0, flush 0, corrupt 0, gen 0
>> Jun 22 14:29:13 srv-lab-ceph-node-01 kernel: BTRFS: bdev /dev/sde1 errs: wr 13, rd 0, flush 0, corrupt 0, gen 0
>
> So this makes sense in that it tries to write but can't because the
> device is now missing. So it's a case of Btrfs not doing very well
> handling suddenly missing device, I think.
>
>
>
>> Jun 22 14:29:22 srv-lab-ceph-node-01 kernel: BTRFS info (device md127): csum failed ino 1039 extent 390332416 csum 2059524288 wanted 343582415 mirror 0
>> Jun 22 14:29:22 srv-lab-ceph-node-01 kernel: BUG: unable to handle kernel paging request at ffff87fa7ff53430
>> Jun 22 14:29:22 srv-lab-ceph-node-01 kernel: IP: [<ffffffffc04709d9>] __btrfs_map_block+0x2d9/0x1180 [btrfs]
>> Jun 22 14:29:22 srv-lab-ceph-node-01 kernel: PGD 0
>> Jun 22 14:29:22 srv-lab-ceph-node-01 kernel: Oops: 0000 [#1] SMP
>
>
> And then oops. Not good. So yeah it's definitely a Btrfs bug that it
> oopses instead of gracefully handling the failure. The question is
> whether (and what) other mitigating circumstances contribute to this
> bad handling, there may be other bugs that instigate this. I've tested
> this in a ridiculously rudimentary way (with USB drives) just by
> hanging them during usage, and I don't get an oops. But I do get piles
> of read and or write errors and it seems Btrfs never really becomes
> aware of the fact there's a missing device until there's a remount or
> even a reboot. I haven't quantified what amount of data is lost, but
> the file system itself still works degraded in this case with the
> remaining drive (actually both drives work fine, but once they're each
> written to separately with degraded mount option, they can't be
> rejoined together; if you try it, serious fs corruption results.)
>
> --
> Chris Murphy

You right about usb devices, it's not produce oops.
May be its because kernel use different modules for SAS/SATA disks and
usb sticks.

-- 
Have a nice day,
Timofey.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: RAID1: system stability
  2015-06-22 16:36         ` Timofey Titovets
@ 2015-06-22 16:52           ` Chris Murphy
  2015-07-22 11:00             ` Russell Coker
  0 siblings, 1 reply; 13+ messages in thread
From: Chris Murphy @ 2015-06-22 16:52 UTC (permalink / raw)
  To: Timofey Titovets; +Cc: Chris Murphy, linux-btrfs

On Mon, Jun 22, 2015 at 10:36 AM, Timofey Titovets <nefelim4ag@gmail.com> wrote:
> 2015-06-22 19:03 GMT+03:00 Chris Murphy <lists@colorremedies.com>:
>> On Mon, Jun 22, 2015 at 5:35 AM, Timofey Titovets <nefelim4ag@gmail.com> wrote:
>>> Okay, logs, i did release disk /dev/sde1 and get:
>>> Jun 22 14:28:40 srv-lab-ceph-node-01 kernel: Read(10): 28 00 11 1d 69
>>> 00 00 00 08 00
>>> Jun 22 14:28:40 srv-lab-ceph-node-01 kernel: blk_update_request: I/O
>>> error, dev sde, sector 287140096
>>> Jun 22 14:28:40 srv-lab-ceph-node-01 kernel: mptbase: ioc0:
>>> LogInfo(0x31010011): Originator={PL}, Code={Open Failure},
>>> SubCode(0x0011) cb_idx mptscsih_io_done
>>> Jun 22 14:28:40 srv-lab-ceph-node-01 kernel: sd 0:0:5:0: [sde] FAILED
>>> Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
>>> Jun 22 14:28:40 srv-lab-ceph-node-01 kernel: sd 0:0:5:0: [sde] CDB:
>>> Jun 22 14:28:40 srv-lab-ceph-node-01 kernel: Read(10): 28 00 11 1d 69
>>> 00 00 00 08 00
>>> Jun 22 14:28:40 srv-lab-ceph-node-01 kernel: blk_update_request: I/O
>>> error, dev sde, sector 287140096
>>
>> So what's up with this? This only happens after you try to (software)
>> remove /dev/sde1? Or is it happening also before that? Because this
>> looks like some kind of hardware problem when the drive is reporting
>> an error for a particular sector on read, as if it's a bad sector.
>
> Nope, i've physically remove device and as you see it's produce errors
> on block layer -.-
> and this disks have 100% 'health'
>
> Because it's hot-plug device, kernel see what device now missing and
> remove all kernel objects reletad to them.

OK I actually don't know what the intended block layer behavior is
when unplugging a device, if it is supposed to vanish, or change state
somehow so that thing that depend on it can know it's "missing" or
what. So the question here is, is this working as intended? If the
layer Btrfs depends on isn't working as intended, then Btrfs is
probably going to do wild and crazy things. And I don't know that the
part of the block layer Btrfs depends on for this is the same (or
different) as what the md driver depends on.


>
>>
>>> Jun 22 14:28:40 srv-lab-ceph-node-01 kernel: Buffer I/O error on dev sde1, logical block 35892256, async page read
>>> Jun 22 14:28:40 srv-lab-ceph-node-01 kernel: mptbase: ioc0: LogInfo(0x31010011): Originator={PL}, Code={Open Failure},
>>> SubCode(0x0011) cb_idx mptscsih_io_done
>>> Jun 22 14:28:40 srv-lab-ceph-node-01 kernel: mptbase: ioc0: LogInfo(0x31010011): Originator={PL}, Code={Open Failure},
>>> SubCode(0x0011) cb_idx mptscsih_io_done
>>> Jun 22 14:28:40 srv-lab-ceph-node-01 kernel: sd 0:0:5:0: [sde] FAILED Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
>>> Jun 22 14:28:40 srv-lab-ceph-node-01 kernel: sd 0:0:5:0: [sde] CDB:
>>> Jun 22 14:28:40 srv-lab-ceph-node-01 kernel: Read(10): 28 00 11 1d 69 00 00 00 08 00
>>> Jun 22 14:28:40 srv-lab-ceph-node-01 kernel: blk_update_request: I/O error, dev sde, sector 287140096
>>
>> Again same sector as before. This is not a Btrfs error message, it's
>> coming from the block layer.
>>
>>
>>> Jun 22 14:28:40 srv-lab-ceph-node-01 kernel: Buffer I/O error on dev sde1, logical block 35892256, async page read
>>
>> I'm not a dev so take it with a grain of salt but because this
>> references a logical block, this is the layer in between Btrfs and the
>> physical device. Btrfs works on logical blocks and those have to be
>> translated to device and physical sector. Maybe what's happening is
>> there's confusion somewhere about this device not actually being
>> unavailable so Btrfs or something else is trying to read this logical
>> block again, which causes a read attempt to happen instead of a flat
>> out "this device doesn't exist" type of error. So I don't know if this
>> is a problem strictly in Btrfs missing device error handling, or if
>> there's something else that's not really working correctly.
>>
>> You could test by physically removing the device, if you have hot plug
>> support (be certain all the hardware components support it), you can
>> see if you get different results. Or you could try to reproduce the
>> software delete of the device with mdraid or lvm raid with XFS and no
>> Btrfs at all, and see if you get different results.
>>
>> It's known that the btrfs multiple device failure use case is weak
>> right now. Data isn't lost, but the error handling, notification, all
>> that is almost non-existent compared to mdadm.
>
> So sad -.-
> i've test this test case with md raid1 and system continue work
> without problem when i release one of two md device

OK well then it's either a Btrfs bug or something it directly depends
on that md does not.


> You right about usb devices, it's not produce oops.
> May be its because kernel use different modules for SAS/SATA disks and
> usb sticks.

They appear as sd devices on my system, so they're using libata and as
such they ultimately still depend on the SCSI block layer. But there
may be a very different kind of missing device error handling for USB
that somehow makes its way up to libata differently than SAS/SATA
hotplug.

I'd say the oops is definitely a Btrfs bug. But it might also be worth
while to post the kernel messages to linux-scsi@ list, listing the
hardware details (logic board, SAS/SATA card, drives) and of course
the full kernel messages along with reproduce steps and see if the
fact the device doesn't actually drop out like with USB devices is
intended behavior.


-- 
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: RAID1: system stability
  2015-06-22 16:52           ` Chris Murphy
@ 2015-07-22 11:00             ` Russell Coker
  2015-08-05 17:32               ` Austin S Hemmelgarn
  0 siblings, 1 reply; 13+ messages in thread
From: Russell Coker @ 2015-07-22 11:00 UTC (permalink / raw)
  To: Chris Murphy, linux-btrfs

On Tue, 23 Jun 2015 02:52:43 AM Chris Murphy wrote:
> OK I actually don't know what the intended block layer behavior is
> when unplugging a device, if it is supposed to vanish, or change state
> somehow so that thing that depend on it can know it's "missing" or
> what. So the question here is, is this working as intended? If the
> layer Btrfs depends on isn't working as intended, then Btrfs is
> probably going to do wild and crazy things. And I don't know that the
> part of the block layer Btrfs depends on for this is the same (or
> different) as what the md driver depends on.

I disagree with that statement.  BTRFS should be expected to not do wild and 
crazy things regardless of what happens with block devices.

A BTRFS RAID-1/5/6 array should cope with a single disk failing or returning 
any manner of corrupted data and should not lose data or panic the kernel.

A BTRFS RAID-0 or single disk setup should cope with a disk giving errors by 
mounting read-only or failing all operations on the filesystem.  It should not 
affect any other filesystem or have any significant impact on the system unless 
it's the root filesystem.

-- 
My Main Blog         http://etbe.coker.com.au/
My Documents Blog    http://doc.coker.com.au/

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: RAID1: system stability
  2015-07-22 11:00             ` Russell Coker
@ 2015-08-05 17:32               ` Austin S Hemmelgarn
  2015-08-05 19:00                 ` Martin Steigerwald
  0 siblings, 1 reply; 13+ messages in thread
From: Austin S Hemmelgarn @ 2015-08-05 17:32 UTC (permalink / raw)
  To: Russell Coker, Chris Murphy, linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 2428 bytes --]

On 2015-07-22 07:00, Russell Coker wrote:
> On Tue, 23 Jun 2015 02:52:43 AM Chris Murphy wrote:
>> OK I actually don't know what the intended block layer behavior is
>> when unplugging a device, if it is supposed to vanish, or change state
>> somehow so that thing that depend on it can know it's "missing" or
>> what. So the question here is, is this working as intended? If the
>> layer Btrfs depends on isn't working as intended, then Btrfs is
>> probably going to do wild and crazy things. And I don't know that the
>> part of the block layer Btrfs depends on for this is the same (or
>> different) as what the md driver depends on.
>
> I disagree with that statement.  BTRFS should be expected to not do wild and
> crazy things regardless of what happens with block devices.
I would generally agree with this, although we really shouldn't be doing 
things like trying to handle hardware failures without user 
intervention.  If a block device disappears from under us, we should 
throw a warning and if it's the last device in the FS, kill anything 
that is trying to read or write to that FS.  At the very least, we 
should try to avoid hanging or panicking the system if all of the 
devices in an FS disappear out from under us.
>
> A BTRFS RAID-1/5/6 array should cope with a single disk failing or returning
> any manner of corrupted data and should not lose data or panic the kernel.
It's debatable however whether the array should go read-only when 
degraded.  MD/DM RAID (at least, AFAIK) and most hardware RAID 
controllers I've seen will still accept writes to degraded arrays, 
although there are arguments for forcing it read-only as well.
Personally, I think that should be controlled by a mount option, so the 
sysadmin can decide, as it really is a policy decision.
>
> A BTRFS RAID-0 or single disk setup should cope with a disk giving errors by
> mounting read-only or failing all operations on the filesystem.  It should not
> affect any other filesystem or have any significant impact on the system unless
> it's the root filesystem.
Or some other critical filesystem (there are still people who put /usr 
and/or /var on separate filesystems).  Ideally, I'd love to see some 
some kind of warning from the kernel if a filesystem gets mounted that 
has the metadata/system profile set to raid0 (and possibly have some of 
the tools spit out such a warning also).



[-- Attachment #2: S/MIME Cryptographic Signature --]
[-- Type: application/pkcs7-signature, Size: 3019 bytes --]

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: RAID1: system stability
  2015-08-05 17:32               ` Austin S Hemmelgarn
@ 2015-08-05 19:00                 ` Martin Steigerwald
  0 siblings, 0 replies; 13+ messages in thread
From: Martin Steigerwald @ 2015-08-05 19:00 UTC (permalink / raw)
  To: Austin S Hemmelgarn; +Cc: Russell Coker, Chris Murphy, linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 2491 bytes --]

Am Mittwoch, 5. August 2015, 13:32:41 schrieb Austin S Hemmelgarn:
> On 2015-07-22 07:00, Russell Coker wrote:
> > On Tue, 23 Jun 2015 02:52:43 AM Chris Murphy wrote:
> >> OK I actually don't know what the intended block layer behavior is
> >> when unplugging a device, if it is supposed to vanish, or change state
> >> somehow so that thing that depend on it can know it's "missing" or
> >> what. So the question here is, is this working as intended? If the
> >> layer Btrfs depends on isn't working as intended, then Btrfs is
> >> probably going to do wild and crazy things. And I don't know that the
> >> part of the block layer Btrfs depends on for this is the same (or
> >> different) as what the md driver depends on.
> > 
> > I disagree with that statement.  BTRFS should be expected to not do wild
> > and crazy things regardless of what happens with block devices.
> 
> I would generally agree with this, although we really shouldn't be doing
> things like trying to handle hardware failures without user
> intervention.  If a block device disappears from under us, we should
> throw a warning and if it's the last device in the FS, kill anything
> that is trying to read or write to that FS.  At the very least, we
> should try to avoid hanging or panicking the system if all of the
> devices in an FS disappear out from under us.

The best solution I have ever seen for removable media is with AmigaOS. You 
remove a disk (or nowadays an usb stick) while it is being written to and 
AmigaDOS/AmigaOS pops up a dialog window saying "You MUST insert volume 
$VOLUMENAME again". And if you did, it just continued writing. I bet this may 
be difficult for do for Linux for all devices as unwritten changes pile up in 
memory until dirty limits are reached, unless one says "Okay, disk gone, we 
block all processes writing to it immediately or quite soon", but for 
removable media I never saw anything with that amount of sanity. There was 
some GSoC for NetBSD once to implement this, but I don´t know whether its 
implemented in there now. For AmigaOS and floppy disks with back then 
filesystem there was just one culprit: If you didn´t insert the disk again, it 
was often broken beyond repair. For journaling or COW filesystem it would just 
be like in any other sudden stop to writes.

On Linux with eSATA I saw I can also replug the disk if I didn´t yet hit the 
timeouts in block layer. After that the disk is gone.

Ciao,
-- 
Martin

[-- Attachment #2: This is a digitally signed message part. --]
[-- Type: application/pgp-signature, Size: 801 bytes --]

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: RAID1: system stability
@ 2015-06-17  9:20 Timofey Titovets
  0 siblings, 0 replies; 13+ messages in thread
From: Timofey Titovets @ 2015-06-17  9:20 UTC (permalink / raw)
  To: linux-btrfs

Upd:
i've try do removing disk by 'right' way:
# echo 1 > /sys/block/sdf/device/delete

All okay and system don't crush immediately on 'sync' call and can
work some time without problem, but after some call, which i can
repeat by:
  # apt-get update
testing system get kernel crush (on which i delete one of raid1 btrfs
device), i've get following dmesg:
----
Jun 17 12:00:41 srv-lab-ceph-node-01 kernel: Modules linked in: 8021q
garp mrp stp llc binfmt_misc gpio_ich coretemp kvm_intel lpc_ich
ipmi_ssif kvm amdkfd amd_iommu_v2 serio_raw radeon ttm i5000_edac
drm_kms_helper drm edac_core i2c_algo_bit i5k_amb ioatdma dca shpchp
8250_fintek joydev mac_hid ipmi_si ipmi_msghandler bonding autofs4
btrfs xor raid6_pq ses enclosure hid_generic psmouse usbhid hid mptsas
mptscsih e1000e mptbase scsi_transport_sas ptp pps_core
Jun 17 12:00:41 srv-lab-ceph-node-01 kernel: CPU: 3 PID: 99 Comm:
kworker/u16:4 Not tainted 4.0.4-040004-generic #201505171336
Jun 17 12:00:41 srv-lab-ceph-node-01 kernel: Hardware name: Intel
S5000VSA/S5000VSA, BIOS S5000.86B.15.00.0101.110920101604 11/09/2010
Jun 17 12:00:41 srv-lab-ceph-node-01 kernel: Workqueue: btrfs-endio
btrfs_endio_helper [btrfs]
Jun 17 12:00:41 srv-lab-ceph-node-01 kernel: task: ffff88009ab31400
ti: ffff88009ab40000 task.ti: ffff88009ab40000
Jun 17 12:00:41 srv-lab-ceph-node-01 kernel: RIP:
0010:[<ffffffffc0477d50>]  [<ffffffffc0477d50>]
repair_io_failure+0x1c0/0x200 [btrfs]
Jun 17 12:00:41 srv-lab-ceph-node-01 kernel: RSP:
0018:ffff88009ab43bb8  EFLAGS: 00010206
Jun 17 12:00:41 srv-lab-ceph-node-01 kernel: RAX: 0000000000000000
RBX: ffff88009b1d3f30 RCX: ffff88009b53f9c0
Jun 17 12:00:41 srv-lab-ceph-node-01 kernel: RDX: ffff88044902f400
RSI: 0000000000000000 RDI: ffff88009b53f9c0
Jun 17 12:00:41 srv-lab-ceph-node-01 kernel: RBP: ffff88009ab43c18
R08: 0000000000000000 R09: 0000000000000000
Jun 17 12:00:41 srv-lab-ceph-node-01 kernel: R10: ffff880448c1b090
R11: 0000000000000000 R12: 0000000039070000
Jun 17 12:00:41 srv-lab-ceph-node-01 kernel: R13: ffff880439599e68
R14: 0000000000001000 R15: ffff88009a860000
Jun 17 12:00:41 srv-lab-ceph-node-01 kernel: FS:
0000000000000000(0000) GS:ffff88045fcc0000(0000)
knlGS:0000000000000000
Jun 17 12:00:41 srv-lab-ceph-node-01 kernel: CS:  0010 DS: 0000 ES:
0000 CR0: 000000008005003b
Jun 17 12:00:41 srv-lab-ceph-node-01 kernel: CR2: 00007f640a27e675
CR3: 0000000098b4b000 CR4: 00000000000407e0
Jun 17 12:00:41 srv-lab-ceph-node-01 kernel: Stack:
Jun 17 12:00:41 srv-lab-ceph-node-01 kernel:  0000000000000000
000000009a860de0 ffffea0002644380 00000003d2ee8000
Jun 17 12:00:41 srv-lab-ceph-node-01 kernel:  0000000000008000
ffff88009b53f9c0 ffff88009ab43c18 ffff88009b1d3f30
Jun 17 12:00:41 srv-lab-ceph-node-01 kernel:  ffff88044c44a3c0
ffff88009b0c1190 0000000000000000 ffff88009a860000
Jun 17 12:00:41 srv-lab-ceph-node-01 kernel: Call Trace:
Jun 17 12:00:41 srv-lab-ceph-node-01 kernel:  [<ffffffffc0477f30>]
clean_io_failure+0x1a0/0x1b0 [btrfs]
Jun 17 12:00:41 srv-lab-ceph-node-01 kernel:  [<ffffffffc0478218>]
end_bio_extent_readpage+0x2d8/0x3d0 [btrfs]
Jun 17 12:00:41 srv-lab-ceph-node-01 kernel:  [<ffffffff8137b2c3>]
bio_endio+0x53/0xa0
Jun 17 12:00:41 srv-lab-ceph-node-01 kernel:  [<ffffffff8137b322>]
bio_endio_nodec+0x12/0x20
Jun 17 12:00:41 srv-lab-ceph-node-01 kernel:  [<ffffffffc044efb8>]
end_workqueue_fn+0x48/0x60 [btrfs]
Jun 17 12:00:41 srv-lab-ceph-node-01 kernel:  [<ffffffffc0488b2e>]
normal_work_helper+0x7e/0x1b0 [btrfs]
Jun 17 12:00:41 srv-lab-ceph-node-01 kernel:  [<ffffffffc0488d32>]
btrfs_endio_helper+0x12/0x20 [btrfs]
Jun 17 12:00:41 srv-lab-ceph-node-01 kernel:  [<ffffffff81092204>]
process_one_work+0x144/0x490
Jun 17 12:00:41 srv-lab-ceph-node-01 kernel:  [<ffffffff81092c6e>]
worker_thread+0x11e/0x450
Jun 17 12:00:41 srv-lab-ceph-node-01 kernel:  [<ffffffff81092b50>] ?
create_worker+0x1f0/0x1f0
Jun 17 12:00:41 srv-lab-ceph-node-01 kernel:  [<ffffffff81098999>]
kthread+0xc9/0xe0
Jun 17 12:00:41 srv-lab-ceph-node-01 kernel:  [<ffffffff810988d0>] ?
flush_kthread_worker+0x90/0x90
Jun 17 12:00:41 srv-lab-ceph-node-01 kernel:  [<ffffffff817f08d8>]
ret_from_fork+0x58/0x90
Jun 17 12:00:41 srv-lab-ceph-node-01 kernel:  [<ffffffff810988d0>] ?
flush_kthread_worker+0x90/0x90
Jun 17 12:00:41 srv-lab-ceph-node-01 kernel: Code: 44 00 00 4c 89 ef
e8 b0 34 f0 c0 31 f6 4c 89 e7 e8 06 05 01 00 ba fb ff ff ff e9 c7 fe
ff ff ba fb ff ff ff e9 bd fe ff ff 0f 0b <0f> 0b 49 8b 4c 24 30 48 8b
b3 58 fe ff ff 48 83 c1 10 48 85 f6
Jun 17 12:00:41 srv-lab-ceph-node-01 kernel: RIP  [<ffffffffc0477d50>]
repair_io_failure+0x1c0/0x200 [btrfs]
Jun 17 12:00:41 srv-lab-ceph-node-01 kernel:  RSP <ffff88009ab43bb8>
Jun 17 12:00:41 srv-lab-ceph-node-01 kernel: ---[ end trace
0361c6fdca5f7ee2 ]---
---

Another test case:
i've delete device:
echo 1 > /sys/block/sdf/device/delete
after i reinsert this device (remove and insert again in server)
Server found sdg device, all that okay but kernel crush with following
stuck trace:
---
Jun 17 12:08:35 srv-lab-ceph-node-01 kernel: kernel BUG at
/home/kernel/COD/linux/fs/btrfs/extent_io.c:2057!
Jun 17 12:08:35 srv-lab-ceph-node-01 kernel: invalid opcode: 0000 [#1] SMP
Jun 17 12:08:35 srv-lab-ceph-node-01 kernel: Modules linked in: 8021q
garp mrp stp llc binfmt_misc gpio_ich coretemp kvm_intel amdkfd
amd_iommu_v2 ipmi_ssif kvm radeon lpc_ich serio_raw ttm i5000_edac
edac_core drm_kms_helper drm i5k_amb ioatdma i2c_algo_bit joydev
8250_fintek ipmi_si dca ipmi_msghandler mac_hid shpchp bonding autofs4
btrfs xor raid6_pq ses enclosure hid_generic psmouse mptsas usbhid
mptscsih hid mptbase scsi_transport_sas e1000e ptp pps_core
Jun 17 12:08:35 srv-lab-ceph-node-01 kernel: CPU: 2 PID: 72 Comm:
kworker/u16:2 Not tainted 4.0.4-040004-generic #201505171336
Jun 17 12:08:35 srv-lab-ceph-node-01 kernel: Hardware name: Intel
S5000VSA/S5000VSA, BIOS S5000.86B.15.00.0101.110920101604 11/09/2010
Jun 17 12:08:35 srv-lab-ceph-node-01 kernel: Workqueue: btrfs-endio
btrfs_endio_helper [btrfs]
Jun 17 12:08:35 srv-lab-ceph-node-01 kernel: task: ffff88044d215a00
ti: ffff880449b1c000 task.ti: ffff880449b1c000
Jun 17 12:08:35 srv-lab-ceph-node-01 kernel: RIP:
0010:[<ffffffffc02a9d50>]  [<ffffffffc02a9d50>]
repair_io_failure+0x1c0/0x200 [btrfs]
Jun 17 12:08:35 srv-lab-ceph-node-01 kernel: RSP:
0018:ffff880449b1fbb8  EFLAGS: 00010206
Jun 17 12:08:35 srv-lab-ceph-node-01 kernel: RAX: 0000000000000000
RBX: ffff88044c3ac308 RCX: ffff88044c5ef3c0
Jun 17 12:08:35 srv-lab-ceph-node-01 kernel: RDX: ffff880449117400
RSI: 0000000000000000 RDI: ffff88044c5ef3c0
Jun 17 12:08:35 srv-lab-ceph-node-01 kernel: RBP: ffff880449b1fc18
R08: 0000000000000000 R09: 0000000000000000
Jun 17 12:08:35 srv-lab-ceph-node-01 kernel: R10: ffff880448ce0090
R11: 0000000000000000 R12: 000000003999a000
Jun 17 12:08:35 srv-lab-ceph-node-01 kernel: R13: ffff88043999a568
R14: 0000000000001000 R15: ffff880449510000
Jun 17 12:08:35 srv-lab-ceph-node-01 kernel: FS:
0000000000000000(0000) GS:ffff88045fc80000(0000)
knlGS:0000000000000000
Jun 17 12:08:35 srv-lab-ceph-node-01 kernel: CS:  0010 DS: 0000 ES:
0000 CR0: 000000008005003b
Jun 17 12:08:35 srv-lab-ceph-node-01 kernel: CR2: 00007fbfbe12cf00
CR3: 0000000449b4e000 CR4: 00000000000407e0
Jun 17 12:08:35 srv-lab-ceph-node-01 kernel: Stack:
Jun 17 12:08:35 srv-lab-ceph-node-01 kernel:  0000000000000000
0000000049510de0 ffffea0010f40540 00000003f7ed4000
Jun 17 12:08:35 srv-lab-ceph-node-01 kernel:  000000000000c000
ffff88044c5ef3c0 ffff880449b1fc18 ffff88044c3ac308
Jun 17 12:08:35 srv-lab-ceph-node-01 kernel:  ffff88044b1acc80
ffff880448dcbfa0 0000000000000000 ffff880449510000
Jun 17 12:08:35 srv-lab-ceph-node-01 kernel: Call Trace:
Jun 17 12:08:35 srv-lab-ceph-node-01 kernel:  [<ffffffffc02a9f30>]
clean_io_failure+0x1a0/0x1b0 [btrfs]
Jun 17 12:08:35 srv-lab-ceph-node-01 kernel:  [<ffffffffc02aa218>]
end_bio_extent_readpage+0x2d8/0x3d0 [btrfs]
Jun 17 12:08:35 srv-lab-ceph-node-01 kernel:  [<ffffffff8137b2c3>]
bio_endio+0x53/0xa0
Jun 17 12:08:35 srv-lab-ceph-node-01 kernel:  [<ffffffff8137b322>]
bio_endio_nodec+0x12/0x20
Jun 17 12:08:35 srv-lab-ceph-node-01 kernel:  [<ffffffffc0280fb8>]
end_workqueue_fn+0x48/0x60 [btrfs]
Jun 17 12:08:35 srv-lab-ceph-node-01 kernel:  [<ffffffffc02bab2e>]
normal_work_helper+0x7e/0x1b0 [btrfs]
Jun 17 12:08:35 srv-lab-ceph-node-01 kernel:  [<ffffffffc02bad32>]
btrfs_endio_helper+0x12/0x20 [btrfs]
Jun 17 12:08:35 srv-lab-ceph-node-01 kernel:  [<ffffffff81092204>]
process_one_work+0x144/0x490
Jun 17 12:08:35 srv-lab-ceph-node-01 kernel:  [<ffffffff81092c6e>]
worker_thread+0x11e/0x450
Jun 17 12:08:35 srv-lab-ceph-node-01 kernel:  [<ffffffff81092b50>] ?
create_worker+0x1f0/0x1f0
Jun 17 12:08:35 srv-lab-ceph-node-01 kernel:  [<ffffffff81098999>]
kthread+0xc9/0xe0
Jun 17 12:08:35 srv-lab-ceph-node-01 kernel:  [<ffffffff810988d0>] ?
flush_kthread_worker+0x90/0x90
Jun 17 12:08:35 srv-lab-ceph-node-01 kernel:  [<ffffffff817f08d8>]
ret_from_fork+0x58/0x90
Jun 17 12:08:35 srv-lab-ceph-node-01 kernel:  [<ffffffff810988d0>] ?
flush_kthread_worker+0x90/0x90
Jun 17 12:08:35 srv-lab-ceph-node-01 kernel: Code: 44 00 00 4c 89 ef
e8 b0 14 0d c1 31 f6 4c 89 e7 e8 06 05 01 00 ba fb ff ff ff e9 c7 fe
ff ff ba fb ff ff ff e9 bd fe ff ff 0f 0b <0f> 0b 49 8b 4c 24 30 48 8b
b3 58 fe ff ff 48 83 c1 10 48 85 f6
Jun 17 12:08:35 srv-lab-ceph-node-01 kernel: RIP  [<ffffffffc02a9d50>]
repair_io_failure+0x1c0/0x200 [btrfs]
Jun 17 12:08:35 srv-lab-ceph-node-01 kernel:  RSP <ffff880449b1fbb8>
Jun 17 12:08:35 srv-lab-ceph-node-01 kernel: ---[ end trace
90ec36112ab1f744 ]---

P.S. I just think about case where i have 2 slots for disk in server,
and i want replace one disk, which failed (overheated and just
'burned' or something else) without server downtime
-- 
Have a nice day,
Timofey.

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2015-08-05 19:00 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2015-05-26 11:23 RAID1: system stability Timofey Titovets
2015-05-26 19:31 ` Timofey Titovets
2015-05-26 19:49 ` Chris Murphy
2015-05-26 19:51   ` Timofey Titovets
2015-06-22 11:35     ` Timofey Titovets
2015-06-22 11:45       ` Timofey Titovets
2015-06-22 16:03       ` Chris Murphy
2015-06-22 16:36         ` Timofey Titovets
2015-06-22 16:52           ` Chris Murphy
2015-07-22 11:00             ` Russell Coker
2015-08-05 17:32               ` Austin S Hemmelgarn
2015-08-05 19:00                 ` Martin Steigerwald
  -- strict thread matches above, loose matches on Subject: below --
2015-06-17  9:20 Timofey Titovets

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox