* Global hotspare functionality @ 2016-03-18 19:39 Yauhen Kharuzhy 2016-03-19 1:17 ` Yauhen Kharuzhy 2016-03-29 14:41 ` Anand Jain 0 siblings, 2 replies; 18+ messages in thread From: Yauhen Kharuzhy @ 2016-03-18 19:39 UTC (permalink / raw) To: linux-btrfs; +Cc: Anand Jain Hi all, I try to get Anand's patchset for global hotspare functionality working. Now it's working for me but I have met number of issues while applying and patches testing. I took latest versions of patchset and its dependencies (latest at two weeks ago): 1) Anand's hotspare patchset: http://thread.gmane.org/gmane.comp.file-systems.btrfs/49985 2) Device delete by id series: http://thread.gmane.org/gmane.comp.file-systems.btrfs/53208 3) Two Anand's patches about sysfs attributes (hotspare series seems to be depended on it): http://thread.gmane.org/gmane.comp.file-systems.btrfs/48943 My kernel is 4.4.5 stable version (I had tried integration-4.6 branch of btrfs-next first and had same troubles as for 4.4.5). So, good result: hotspare functionality works! Bad result: it works for me after some patching only :) General notice: we are definitely need FS-specific hotspares, because common case is to have few RAID with different drives size (system root and data RAIDs, for instance). I have published my git tree with working set of patches here: https://bitbucket.org/jekhor/linux-btrfs/branch/4.4.5%2Bhotspare-without_degradable_check And corresponding btrfs-progs tree: https://bitbucket.org/jekhor/btrfs-progs/commits/branch/devel-hotspare This trees contain some RAID state monitoring related changes, just ignore them (I am going to start another discussion about of RAID status monitoring soon). Issue 1. First, kernel oopsed at FS mounting after unmounting. Unfortunately, I don't have saved logs for this. I found that fsid_kobj was corrupted (has NULL ktype field) before invocation of btrfs_sysfs_add_fsid(). I cannot found the source of corruption – no 'kobject release' events before, state_initialized field remains true, ktype just is cleaned (btrfs_ktype.release() wasn't called before this too). My printk-based trace looks like this but exactly place of value changing was not permanent, so this is can be some kind of race condition: Mar 11 01:07:31 grack12 kernel: [ 33.694074] btrfs_commit_transaction:2133: fsid_kobj=ffff88001f020cd8, ktype=ffffffffa0219840 Mar 11 01:07:31 grack12 kernel: [ 33.697967] btrfs_commit_transaction:2142: fsid_kobj=ffff88001f020cd8, ktype=ffffffffa0219840 Mar 11 01:07:31 grack12 kernel: [ 33.697972] write_all_supers:3672: fsid_kobj=ffff88001f020cd8, ktype=ffffffffa0219840 Mar 11 01:07:31 grack12 kernel: [ 33.697973] write_all_supers:3677: fsid_kobj=ffff88001f020cd8, ktype=ffffffffa0219840 Mar 11 01:07:31 grack12 kernel: [ 33.697974] write_all_supers:3679: fsid_kobj=ffff88001f020cd8, ktype=ffffffffa0219840 Mar 11 01:07:31 grack12 kernel: [ 33.702881] write_all_supers:3690: fsid_kobj=ffff88001f020cd8, ktype= (null) Mar 11 01:07:31 grack12 kernel: [ 33.702884] write_all_supers:3699: fsid_kobj=ffff88001f020cd8, ktype= (null) Mar 11 01:07:31 grack12 kernel: [ 33.702885] write_all_supers:3701: fsid_kobj=ffff88001f020cd8, ktype= (null) Bisecting pointed me to simple commit 'b0f398c btrfs: optimize btrfs_check_degradable() for calls outside of barrier' but I have no idea how it may cause or trigger this issue... So, after spending some time for debugging, I decided to remove second patchset entirely except of 'btrfs: create a helper function to read the disk super' commit and problem had gone out. Issue 2. At start of autoreplacig drive by hotspare, kernel craches in transaction handling code (inside of btrfs_commit_transaction() called by autoreplace initiating routines). I 'fixed' this by removing of closing of bdev in btrfs_close_one_device_dont_free(), see https://bitbucket.org/jekhor/linux-btrfs/commits/dfa441c9ec7b3833f6a5e4d0b6f8c678faea29bb?at=master (oops text is attached also). Bdev is closed after replacing by btrfs_dev_replace_finishing(), so this is safe but doesn't seem to be right way. Issue 3. btrfs_auto_replace_start() doesn't check and doesn't set the fs_info->mutually_exclusive_operation_running flag as ioctl handler for DEV_REPLACE_START does, this cause race conditions in some cases, see https://bitbucket.org/jekhor/linux-btrfs/commits/834bebb96a2f6b5ef5856836839e5ce7830ec745?at=master Issue 4. Autoreplacement code doesn't start replacing at mounting in degraded mode, even if hotspare exists. We need this feature, so I added check for missing drives also, not only for failed, to checking if replacement needed. See https://bitbucket.org/jekhor/linux-btrfs/commits/4c9ddb58d979ae5a232aeaa1fbe3d26373210768?at=master and https://bitbucket.org/jekhor/linux-btrfs/commits/be5e2524c10f2b4047da80f9f85b54c6006d4273?at=master -- Yauhen Kharuzhy ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: Global hotspare functionality 2016-03-18 19:39 Global hotspare functionality Yauhen Kharuzhy @ 2016-03-19 1:17 ` Yauhen Kharuzhy 2016-03-29 14:43 ` Anand Jain 2016-03-29 14:41 ` Anand Jain 1 sibling, 1 reply; 18+ messages in thread From: Yauhen Kharuzhy @ 2016-03-19 1:17 UTC (permalink / raw) To: linux-btrfs; +Cc: Anand Jain Issue 5: Race between close_ctree() and casualty_kthread(): close_ctree(): if (fs_info->casualty_kthread) kthread_stop(fs_info->casualty_kthread); casualty_kthread(): out: fs_info->casualty_kthread = NULL; At SMP system, kthread_stop() argument can be changed to NULL after check in if (). Some kind of synchronization is needed. Oops: [48878.674314] BUG: unable to handle kernel paging request at 0000000000001347 [48878.686252] IP: [<ffffffff810a73a4>] kthread_stop+0x54/0x280 [48878.687530] PGD 5f2c6067 PUD 62c63067 PMD 0 [48878.688780] Oops: 0002 [#1] SMP [48878.689701] Modules linked in: cpuid cpufreq_powersave cpufreq_stats cpufreq_userspace cpufreq_conservative softdog nfsd auth_rpcgss oid_registry nfs_acl nfs lockd grace fscache sunrpc ipmi_devintf ipmi_msghandler iosf_mbi crct10dif_pclmul crc32_pclmul sha256_ssse3 sha256_generic hmac drbg ansi_cprng iTCO_wdt iTCO_vendor_support snd_pcm snd_timer snd soundcore aesni_intel aes_x86_64 lrw gf128mul glue_helper ablk_helper lpc_ich cryptd mfd_core psmouse serio_raw 8250_fintek parport_pc evdev parport video pcspkr ac battery acpi_cpufreq tpm_tis tpm processor i2c_piix4 rng_core button btrfs xor raid6_pq dm_mod raid1 md_mod sg sd_mod ahci libahci libata pcnet32 crc32c_intel scsi_mod mii [48879.282797] CPU: 1 PID: 24369 Comm: umount Not tainted 4.4.5-scst31x+ #16 [48879.283976] Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006 [48879.285532] task: ffff880036438440 ti: ffff88006242c000 task.ti: ffff88006242c000 [48879.288022] RIP: 0010:[<ffffffff810a73a4>] [<ffffffff810a73a4>] kthread_stop+0x54/0x280 [48879.289681] RSP: 0018:ffff88006242fd78 EFLAGS: 00010206 [48879.290678] RAX: 0000000000000001 RBX: 00000000000013af RCX: 0000000000000003 [48879.291889] RDX: 0000000000000000 RSI: 0000000000001347 RDI: 0000000000000246 [48879.293096] RBP: ffff88006242fd90 R08: ffff88006242fd38 R09: 0000000000000000 [48879.449616] R10: 0000000000000000 R11: 0000000000000000 R12: ffff88002642e840 [48879.494970] R13: ffff880036438440 R14: 0000000000000000 R15: ffff880048b7f000 [48879.496219] FS: 00007f95236c27e0(0000) GS:ffff880066b00000(0000) knlGS:0000000000000000 [48879.497767] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b [48879.498806] CR2: 0000000000001347 CR3: 000000005ed74000 CR4: 00000000000406e0 [48879.500027] Stack: [48879.500603] ffff880027f4a2d0 ffff880027f48000 ffff880036438440 ffff88006242fdf0 [48879.502516] ffffffffa016713d ffff88006242fe00 ffffffff81239ff5 ffff880039cf2000 [48879.505658] ffff880039cf2a40 ffff880062430000 ffff880039cf2000 ffffffffa01edae0 [48879.507584] Call Trace: [48879.508239] [<ffffffffa016713d>] close_ctree+0x13d/0x390 [btrfs] [48879.509328] [<ffffffff81239ff5>] ? evict_inodes+0x165/0x180 [48879.510360] [<ffffffffa0135f99>] btrfs_put_super+0x19/0x20 [btrfs] [48879.511467] [<ffffffff8121d0ff>] generic_shutdown_super+0x6f/0xf0 [48879.512570] [<ffffffff8121d3f2>] kill_anon_super+0x12/0x20 [48879.594205] [<ffffffffa0136a88>] btrfs_kill_super+0x18/0x120 [btrfs] [48879.714563] [<ffffffff8121d66e>] deactivate_locked_super+0x3e/0x70 [48879.715691] [<ffffffff8121dabc>] deactivate_super+0x5c/0x60 [48879.716720] [<ffffffff8123e07f>] cleanup_mnt+0x3f/0x90 [48879.717695] [<ffffffff8123e112>] __cleanup_mnt+0x12/0x20 [48879.718695] [<ffffffff810a5423>] task_work_run+0x73/0xa0 [48879.719692] [<ffffffff810032ac>] exit_to_usermode_loop+0xcc/0xd0 [48879.720773] [<ffffffff81003e0c>] syscall_return_slowpath+0xcc/0xe0 [48879.721880] [<ffffffff81634022>] int_ret_from_sys_call+0x25/0x9f [48879.722962] Code: 85 c0 0f 85 90 01 00 00 65 ff 0d 48 4b f6 7e f0 41 ff 44 24 10 49 8b 9c 24 38 04 00 00 48 85 db 74 26 48 89 de 48 83 ee 68 74 1d <f0> 80 4b 98 02 4c 89 e7 e8 ff fe ff ff 4c 89 e7 e8 a7 ca 00 00 [48879.891283] RIP [<ffffffff810a73a4>] kthread_stop+0x54/0x280 [48880.019550] RSP <ffff88006242fd78> [48880.091281] CR2: 0000000000001347 [48880.133312] ---[ end trace 8f0228d92754702e ]--- ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: Global hotspare functionality 2016-03-19 1:17 ` Yauhen Kharuzhy @ 2016-03-29 14:43 ` Anand Jain 0 siblings, 0 replies; 18+ messages in thread From: Anand Jain @ 2016-03-29 14:43 UTC (permalink / raw) To: Yauhen Kharuzhy, linux-btrfs On 03/19/2016 09:17 AM, Yauhen Kharuzhy wrote: > Issue 5: > > Race between close_ctree() and casualty_kthread(): This is fixed in V2, thanks for the report. - Anand > close_ctree(): > if (fs_info->casualty_kthread) > kthread_stop(fs_info->casualty_kthread); > > casualty_kthread(): > out: > fs_info->casualty_kthread = NULL; > > > At SMP system, kthread_stop() argument can be changed to NULL after > check in if (). Some kind of synchronization is needed. > > Oops: > > [48878.674314] BUG: unable to handle kernel paging request at 0000000000001347 > [48878.686252] IP: [<ffffffff810a73a4>] kthread_stop+0x54/0x280 > [48878.687530] PGD 5f2c6067 PUD 62c63067 PMD 0 > [48878.688780] Oops: 0002 [#1] SMP > [48878.689701] Modules linked in: cpuid cpufreq_powersave > cpufreq_stats cpufreq_userspace cpufreq_conservative softdog nfsd > auth_rpcgss oid_registry nfs_acl nfs lockd grace fscache sunrpc > ipmi_devintf ipmi_msghandler iosf_mbi crct10dif_pclmul crc32_pclmul > sha256_ssse3 sha256_generic hmac drbg ansi_cprng iTCO_wdt > iTCO_vendor_support snd_pcm snd_timer snd soundcore aesni_intel > aes_x86_64 lrw gf128mul glue_helper ablk_helper lpc_ich cryptd > mfd_core psmouse serio_raw 8250_fintek parport_pc evdev parport video > pcspkr ac battery acpi_cpufreq tpm_tis tpm processor i2c_piix4 > rng_core button btrfs xor raid6_pq dm_mod raid1 md_mod sg sd_mod ahci > libahci libata pcnet32 crc32c_intel scsi_mod mii > [48879.282797] CPU: 1 PID: 24369 Comm: umount Not tainted 4.4.5-scst31x+ #16 > [48879.283976] Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS > VirtualBox 12/01/2006 > [48879.285532] task: ffff880036438440 ti: ffff88006242c000 task.ti: > ffff88006242c000 > [48879.288022] RIP: 0010:[<ffffffff810a73a4>] [<ffffffff810a73a4>] > kthread_stop+0x54/0x280 > [48879.289681] RSP: 0018:ffff88006242fd78 EFLAGS: 00010206 > [48879.290678] RAX: 0000000000000001 RBX: 00000000000013af RCX: 0000000000000003 > [48879.291889] RDX: 0000000000000000 RSI: 0000000000001347 RDI: 0000000000000246 > [48879.293096] RBP: ffff88006242fd90 R08: ffff88006242fd38 R09: 0000000000000000 > [48879.449616] R10: 0000000000000000 R11: 0000000000000000 R12: ffff88002642e840 > [48879.494970] R13: ffff880036438440 R14: 0000000000000000 R15: ffff880048b7f000 > [48879.496219] FS: 00007f95236c27e0(0000) GS:ffff880066b00000(0000) > knlGS:0000000000000000 > [48879.497767] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b > [48879.498806] CR2: 0000000000001347 CR3: 000000005ed74000 CR4: 00000000000406e0 > [48879.500027] Stack: > [48879.500603] ffff880027f4a2d0 ffff880027f48000 ffff880036438440 > ffff88006242fdf0 > [48879.502516] ffffffffa016713d ffff88006242fe00 ffffffff81239ff5 > ffff880039cf2000 > [48879.505658] ffff880039cf2a40 ffff880062430000 ffff880039cf2000 > ffffffffa01edae0 > [48879.507584] Call Trace: > [48879.508239] [<ffffffffa016713d>] close_ctree+0x13d/0x390 [btrfs] > [48879.509328] [<ffffffff81239ff5>] ? evict_inodes+0x165/0x180 > [48879.510360] [<ffffffffa0135f99>] btrfs_put_super+0x19/0x20 [btrfs] > [48879.511467] [<ffffffff8121d0ff>] generic_shutdown_super+0x6f/0xf0 > [48879.512570] [<ffffffff8121d3f2>] kill_anon_super+0x12/0x20 > [48879.594205] [<ffffffffa0136a88>] btrfs_kill_super+0x18/0x120 [btrfs] > [48879.714563] [<ffffffff8121d66e>] deactivate_locked_super+0x3e/0x70 > [48879.715691] [<ffffffff8121dabc>] deactivate_super+0x5c/0x60 > [48879.716720] [<ffffffff8123e07f>] cleanup_mnt+0x3f/0x90 > [48879.717695] [<ffffffff8123e112>] __cleanup_mnt+0x12/0x20 > [48879.718695] [<ffffffff810a5423>] task_work_run+0x73/0xa0 > [48879.719692] [<ffffffff810032ac>] exit_to_usermode_loop+0xcc/0xd0 > [48879.720773] [<ffffffff81003e0c>] syscall_return_slowpath+0xcc/0xe0 > [48879.721880] [<ffffffff81634022>] int_ret_from_sys_call+0x25/0x9f > [48879.722962] Code: 85 c0 0f 85 90 01 00 00 65 ff 0d 48 4b f6 7e f0 > 41 ff 44 24 10 49 8b 9c 24 38 04 00 00 48 85 db 74 26 48 89 de 48 83 > ee 68 74 1d <f0> 80 4b 98 02 4c 89 e7 e8 ff fe ff ff 4c 89 e7 e8 a7 ca > 00 00 > [48879.891283] RIP [<ffffffff810a73a4>] kthread_stop+0x54/0x280 > [48880.019550] RSP <ffff88006242fd78> > [48880.091281] CR2: 0000000000001347 > [48880.133312] ---[ end trace 8f0228d92754702e ]--- > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: Global hotspare functionality 2016-03-18 19:39 Global hotspare functionality Yauhen Kharuzhy 2016-03-19 1:17 ` Yauhen Kharuzhy @ 2016-03-29 14:41 ` Anand Jain 2016-03-29 19:24 ` Yauhen Kharuzhy ` (2 more replies) 1 sibling, 3 replies; 18+ messages in thread From: Anand Jain @ 2016-03-29 14:41 UTC (permalink / raw) To: Yauhen Kharuzhy, linux-btrfs Hi Yauhen, Thanks ! more below.. On 03/19/2016 03:39 AM, Yauhen Kharuzhy wrote: > Hi all, > > I try to get Anand's patchset for global hotspare functionality working. > Now it's working for me but I have met number of issues while applying > and patches testing. > > I took latest versions of patchset and its dependencies (latest at two > weeks ago): > > 1) Anand's hotspare patchset: > http://thread.gmane.org/gmane.comp.file-systems.btrfs/49985 > 2) Device delete by id series: > http://thread.gmane.org/gmane.comp.file-systems.btrfs/53208 V2 is sent out based on cleanups-4.6 branch. Plus preparatory patch found in the ML. fc72066b27f3 btrfs: refactor btrfs_dev_replace_start for reuse 0b2322126a95 btrfs: keep sysfs target add in the last 3ecbc05149e0 btrfs: use fs_info directly > 3) Two Anand's patches about sysfs attributes (hotspare series seems to be > depended on it): > http://thread.gmane.org/gmane.comp.file-systems.btrfs/48943 The sysfs patches we need it only to see the device state or some enterprise scripts may need it. But auto replace hot spare as such don't depend on that. > My kernel is 4.4.5 stable version (I had tried integration-4.6 branch > of btrfs-next first and had same troubles as for 4.4.5). Wiki needs an update, pls don't use btrfs-next. > So, good result: hotspare functionality works! Thanks for testing. > Bad result: it works for me after some patching only :) Thanks for working on it. Let me review. > General notice: we are definitely need FS-specific hotspares, because > common case is to have few RAID with different drives size (system root and > data RAIDs, for instance). Yep. > I have published my git tree with working set of patches here: > https://bitbucket.org/jekhor/linux-btrfs/branch/4.4.5%2Bhotspare-without_degradable_check > And corresponding btrfs-progs tree: > https://bitbucket.org/jekhor/btrfs-progs/commits/branch/devel-hotspare Hm, generally posting the independent patches to the ML will help. > This trees contain some RAID state monitoring related changes, just ignore > them (I am going to start another discussion about of RAID status monitoring > soon). > > > Issue 1. > > First, kernel oopsed at FS mounting after unmounting. Unfortunately, I > don't have saved logs for this. I found that fsid_kobj was corrupted (has > NULL ktype field) before invocation of btrfs_sysfs_add_fsid(). I cannot > found the source of corruption – no 'kobject release' events before, > state_initialized field remains true, ktype just is cleaned > (btrfs_ktype.release() wasn't called before this too). > > My printk-based trace looks like this but exactly place of value changing > was not permanent, so this is can be some kind of race condition: > > Mar 11 01:07:31 grack12 kernel: [ 33.694074] btrfs_commit_transaction:2133: fsid_kobj=ffff88001f020cd8, ktype=ffffffffa0219840 > Mar 11 01:07:31 grack12 kernel: [ 33.697967] btrfs_commit_transaction:2142: fsid_kobj=ffff88001f020cd8, ktype=ffffffffa0219840 > Mar 11 01:07:31 grack12 kernel: [ 33.697972] write_all_supers:3672: fsid_kobj=ffff88001f020cd8, ktype=ffffffffa0219840 > Mar 11 01:07:31 grack12 kernel: [ 33.697973] write_all_supers:3677: fsid_kobj=ffff88001f020cd8, ktype=ffffffffa0219840 > Mar 11 01:07:31 grack12 kernel: [ 33.697974] write_all_supers:3679: fsid_kobj=ffff88001f020cd8, ktype=ffffffffa0219840 > Mar 11 01:07:31 grack12 kernel: [ 33.702881] write_all_supers:3690: fsid_kobj=ffff88001f020cd8, ktype= (null) > Mar 11 01:07:31 grack12 kernel: [ 33.702884] write_all_supers:3699: fsid_kobj=ffff88001f020cd8, ktype= (null) > Mar 11 01:07:31 grack12 kernel: [ 33.702885] write_all_supers:3701: fsid_kobj=ffff88001f020cd8, ktype= (null) > > Bisecting pointed me to simple commit 'b0f398c btrfs: optimize > btrfs_check_degradable() for calls outside of barrier' but I have no idea how > it may cause or trigger this issue... dev is stale here, my bad, that was a crap patch. Also we don't need this patch as part of hot spare / auto replace code. I have removed it. > So, after spending some time for debugging, I decided to remove second > patchset entirely except of 'btrfs: create a helper function to read the disk > super' commit and problem had gone out. > > > Issue 2. > At start of autoreplacig drive by hotspare, kernel craches in transaction > handling code (inside of btrfs_commit_transaction() called by autoreplace initiating > routines). I 'fixed' this by removing of closing of bdev in btrfs_close_one_device_dont_free(), see > https://bitbucket.org/jekhor/linux-btrfs/commits/dfa441c9ec7b3833f6a5e4d0b6f8c678faea29bb?at=master > (oops text is attached also). Bdev is closed after replacing by > btrfs_dev_replace_finishing(), so this is safe but doesn't seem > to be right way. I have sent out V2. I don't see that issue with this, could you pls try ? > Issue 3. > btrfs_auto_replace_start() doesn't check and doesn't set the > fs_info->mutually_exclusive_operation_running flag as ioctl handler for > DEV_REPLACE_START does, this cause race conditions in some cases, see > https://bitbucket.org/jekhor/linux-btrfs/commits/834bebb96a2f6b5ef5856836839e5ce7830ec745?at=master There were some fixes to the main btrfs_auto_replace_start() before, (not the v2). So to avoid such a disconnect, I have sent out a patch set which shall not v2 the function, instead it re-factors the original btrfs: refactor btrfs_dev_replace_start for reuse With this the hot spare V2 will apply nicely, and I have found it to be stable. > Issue 4. > Autoreplacement code doesn't start replacing at mounting in degraded mode, > even if hotspare exists. We need this feature, so I added check for missing > drives also, not only for failed, to checking if replacement needed. No. No. No please don't do that, it would lead to trouble in handing slow devices. I purposely didn't do it. Also kindly note that, in volume manage / storage context things should continue to work in degraded mode automatically, and it shouldn't wait for user's opinion. If it don't do that, then there is no point in having a volume manager. But as of now btrfs has already made degraded as non default choice. There is something else new which is needed and it can be a separate RFC, not part of this patch set. Please try. V2 sent out. Thanks, Anand > See > https://bitbucket.org/jekhor/linux-btrfs/commits/4c9ddb58d979ae5a232aeaa1fbe3d26373210768?at=master > and > https://bitbucket.org/jekhor/linux-btrfs/commits/be5e2524c10f2b4047da80f9f85b54c6006d4273?at=master ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: Global hotspare functionality 2016-03-29 14:41 ` Anand Jain @ 2016-03-29 19:24 ` Yauhen Kharuzhy 2016-03-29 19:59 ` Austin S. Hemmelgarn 2016-03-29 19:40 ` Yauhen Kharuzhy 2016-03-29 19:47 ` Yauhen Kharuzhy 2 siblings, 1 reply; 18+ messages in thread From: Yauhen Kharuzhy @ 2016-03-29 19:24 UTC (permalink / raw) To: Anand Jain; +Cc: linux-btrfs On Tue, Mar 29, 2016 at 10:41:36PM +0800, Anand Jain wrote: > > >Issue 2. > >At start of autoreplacig drive by hotspare, kernel craches in transaction > >handling code (inside of btrfs_commit_transaction() called by autoreplace initiating > >routines). I 'fixed' this by removing of closing of bdev in btrfs_close_one_device_dont_free(), see > >https://bitbucket.org/jekhor/linux-btrfs/commits/dfa441c9ec7b3833f6a5e4d0b6f8c678faea29bb?at=master > >(oops text is attached also). Bdev is closed after replacing by > >btrfs_dev_replace_finishing(), so this is safe but doesn't seem > >to be right way. > > I have sent out V2. I don't see that issue with this, > could you pls try ? Sure. In progresss now. > >Issue 4. > >Autoreplacement code doesn't start replacing at mounting in degraded mode, > >even if hotspare exists. We need this feature, so I added check for missing > >drives also, not only for failed, to checking if replacement needed. > > No. No. No please don't do that, it would lead to trouble in handing > slow devices. I purposely didn't do it. Hmm. Can you explain please? Sometimes admins may want to have autoreplacement working automatically if drive was failed and removed before unmounting and remounting again. The simplest way to achieve this — add spare and always mount FS with 'degraded' option (we need to use this option in any case if we have root fs on RAID, for instance, to avoiding non-bootable state). So, if the autoreplacement code will check for missing drives also, this will working without user intervention. To allow user to decide if he wants autoreplacement, we can add mount option like '(no)hotspare' (I have done this already for our project and will send patch after rebasing onto your new series). Yes, there are side effects exists if you want to make some experiments with missing drives in FS, but you can disable autoreplacement for such case. If you know about any pitfalls in such scenarios, please point me to them, I am newbie in FS-related kernel things. > Also kindly note that, in volume manage / storage context things > should continue to work in degraded mode automatically, and it > shouldn't wait for user's opinion. If it don't do that, then > there is no point in having a volume manager. But as of now btrfs has > already made degraded as non default choice. There is something else > new which is needed and it can be a separate RFC, not part of this > patch set. > > > Please try. V2 sent out. OK, I am going to testing now. -- Yauhen Kharuzhy ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: Global hotspare functionality 2016-03-29 19:24 ` Yauhen Kharuzhy @ 2016-03-29 19:59 ` Austin S. Hemmelgarn 2016-03-29 20:26 ` Chris Murphy 0 siblings, 1 reply; 18+ messages in thread From: Austin S. Hemmelgarn @ 2016-03-29 19:59 UTC (permalink / raw) To: Yauhen Kharuzhy, Anand Jain; +Cc: linux-btrfs On 2016-03-29 15:24, Yauhen Kharuzhy wrote: > On Tue, Mar 29, 2016 at 10:41:36PM +0800, Anand Jain wrote: >> >> No. No. No please don't do that, it would lead to trouble in handing >> slow devices. I purposely didn't do it. > > Hmm. Can you explain please? Sometimes admins may want to have > autoreplacement working automatically if drive was failed and removed > before unmounting and remounting again. The simplest way to achieve this — > add spare and always mount FS with 'degraded' option (we need to use > this option in any case if we have root fs on RAID, for instance, to > avoiding non-bootable state). So, if the autoreplacement code will check for > missing drives also, this will working without user intervention. To > allow user to decide if he wants autoreplacement, we can add mount > option like '(no)hotspare' (I have done this already for our project and > will send patch after rebasing onto your new series). Yes, there are > side effects exists if you want to make some experiments with missing > drives in FS, but you can disable autoreplacement for such case. > > If you know about any pitfalls in such scenarios, please point me to > them, I am newbie in FS-related kernel things. If a disk is particularly slow to start up for some reason (maybe it's going bad, maybe it's just got a slow interconnect (think SD cards), maybe it's just really cold so the bearings seizing up), then this would potentially force it out of the array when it shouldn't be. That said, having things set to always allow degraded mounts is _extremely dangerous_. If the user does not know anything failed, they also can't know they need to get anything fixed. While notification could be used, it also introduces a period of time where the user is at risk of data loss without them having explicitly agreed to this risk (by manually telling it to mount degraded). I could possibly understand doing this for something that needs to be guaranteed to come on line when powered on, but **only** if it notifies responsible parties that there was a problem **and** it is explicitly documented, and even then I'd be wary of doing this unless there was something in place to handle the possibility of false positives (yes, they do happen), and to make certain that the failed hardware got replaced as soon as possible. ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: Global hotspare functionality 2016-03-29 19:59 ` Austin S. Hemmelgarn @ 2016-03-29 20:26 ` Chris Murphy 2016-03-30 11:26 ` Austin S. Hemmelgarn 0 siblings, 1 reply; 18+ messages in thread From: Chris Murphy @ 2016-03-29 20:26 UTC (permalink / raw) To: Austin S. Hemmelgarn; +Cc: Yauhen Kharuzhy, Anand Jain, Btrfs BTRFS On Tue, Mar 29, 2016 at 1:59 PM, Austin S. Hemmelgarn <ahferroin7@gmail.com> wrote: > On 2016-03-29 15:24, Yauhen Kharuzhy wrote: >> >> On Tue, Mar 29, 2016 at 10:41:36PM +0800, Anand Jain wrote: >>> >>> >>> No. No. No please don't do that, it would lead to trouble in handing >>> slow devices. I purposely didn't do it. >> >> >> Hmm. Can you explain please? Sometimes admins may want to have >> autoreplacement working automatically if drive was failed and removed >> before unmounting and remounting again. The simplest way to achieve this — >> add spare and always mount FS with 'degraded' option (we need to use >> this option in any case if we have root fs on RAID, for instance, to >> avoiding non-bootable state). So, if the autoreplacement code will check >> for >> missing drives also, this will working without user intervention. To >> allow user to decide if he wants autoreplacement, we can add mount >> option like '(no)hotspare' (I have done this already for our project and >> will send patch after rebasing onto your new series). Yes, there are >> side effects exists if you want to make some experiments with missing >> drives in FS, but you can disable autoreplacement for such case. >> >> If you know about any pitfalls in such scenarios, please point me to >> them, I am newbie in FS-related kernel things. > > If a disk is particularly slow to start up for some reason (maybe it's going > bad, maybe it's just got a slow interconnect (think SD cards), maybe it's > just really cold so the bearings seizing up), then this would potentially > force it out of the array when it shouldn't be. > > That said, having things set to always allow degraded mounts is _extremely > dangerous_. If the user does not know anything failed, they also can't know > they need to get anything fixed. While notification could be used, it also > introduces a period of time where the user is at risk of data loss without > them having explicitly agreed to this risk (by manually telling it to mount > degraded). I agree, certainly replace should not be automatic by default. And I'm unconvinced this belongs in kernel code anyway because it's a matter of policy. Policy stuff goes in user space, where capability to achieve the policy goes in the kernel. A reasonable exception is bad device ejection (e.g. mdadm faulty). Considering spinning devices take a long time to rebuild already and this probably won't change, a policy I'd like to see upon a drive going bad (totally vanishing, or producing many read or write errors): 1. Bad device is ejected, volume is degraded. 2. Consider chunks with one remaining stripe (one copy) as degraded. 3. Degraded chunks are read only, so COW changes to non-degraded chunks. 4. Degraded metadata chunks are replicated elsewhere, happens right away. 5. Implied by 4, degraded data chunks aren't immediately replicated but any change are, via COW. 6. Option, by policy, to immediately start replicating degraded data chunks - either with existing storage or hot spare, which is also a policy choice. In particular, I'd like to see the single stripe metadata chunks replicated soon so in case there's another device failure the entire volume doesn't implode. Yes there's some data loss, still better than 100% data loss. > I could possibly understand doing this for something that needs to be > guaranteed to come on line when powered on, but **only** if it notifies > responsible parties that there was a problem **and** it is explicitly > documented, and even then I'd be wary of doing this unless there was > something in place to handle the possibility of false positives (yes, they > do happen), and to make certain that the failed hardware got replaced as > soon as possible. Exactly. And I think it's safer to be more aggressive with (fairly) immediate metadata replication to remaining devices, than it is with data. I'm considering this behavior for both single volume setups, as well as multiple bricks in a cluster. And admittedly it's probably cheaper/easier to just get n-way copies of metadata than the above scheme I've written. -- Chris Murphy ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: Global hotspare functionality 2016-03-29 20:26 ` Chris Murphy @ 2016-03-30 11:26 ` Austin S. Hemmelgarn 0 siblings, 0 replies; 18+ messages in thread From: Austin S. Hemmelgarn @ 2016-03-30 11:26 UTC (permalink / raw) To: Chris Murphy; +Cc: Yauhen Kharuzhy, Anand Jain, Btrfs BTRFS On 2016-03-29 16:26, Chris Murphy wrote: > On Tue, Mar 29, 2016 at 1:59 PM, Austin S. Hemmelgarn > <ahferroin7@gmail.com> wrote: >> On 2016-03-29 15:24, Yauhen Kharuzhy wrote: >>> >>> On Tue, Mar 29, 2016 at 10:41:36PM +0800, Anand Jain wrote: >>>> >>>> >>>> No. No. No please don't do that, it would lead to trouble in handing >>>> slow devices. I purposely didn't do it. >>> >>> >>> Hmm. Can you explain please? Sometimes admins may want to have >>> autoreplacement working automatically if drive was failed and removed >>> before unmounting and remounting again. The simplest way to achieve this — >>> add spare and always mount FS with 'degraded' option (we need to use >>> this option in any case if we have root fs on RAID, for instance, to >>> avoiding non-bootable state). So, if the autoreplacement code will check >>> for >>> missing drives also, this will working without user intervention. To >>> allow user to decide if he wants autoreplacement, we can add mount >>> option like '(no)hotspare' (I have done this already for our project and >>> will send patch after rebasing onto your new series). Yes, there are >>> side effects exists if you want to make some experiments with missing >>> drives in FS, but you can disable autoreplacement for such case. >>> >>> If you know about any pitfalls in such scenarios, please point me to >>> them, I am newbie in FS-related kernel things. >> >> If a disk is particularly slow to start up for some reason (maybe it's going >> bad, maybe it's just got a slow interconnect (think SD cards), maybe it's >> just really cold so the bearings seizing up), then this would potentially >> force it out of the array when it shouldn't be. >> >> That said, having things set to always allow degraded mounts is _extremely >> dangerous_. If the user does not know anything failed, they also can't know >> they need to get anything fixed. While notification could be used, it also >> introduces a period of time where the user is at risk of data loss without >> them having explicitly agreed to this risk (by manually telling it to mount >> degraded). > > I agree, certainly replace should not be automatic by default. And I'm > unconvinced this belongs in kernel code anyway because it's a matter > of policy. Policy stuff goes in user space, where capability to > achieve the policy goes in the kernel. > > A reasonable exception is bad device ejection (e.g. mdadm faulty). > > Considering spinning devices take a long time to rebuild already and > this probably won't change, a policy I'd like to see upon a drive > going bad (totally vanishing, or producing many read or write errors): > 1. Bad device is ejected, volume is degraded. > 2. Consider chunks with one remaining stripe (one copy) as degraded. > 3. Degraded chunks are read only, so COW changes to non-degraded chunks. > 4. Degraded metadata chunks are replicated elsewhere, happens right away. > 5. Implied by 4, degraded data chunks aren't immediately replicated > but any change are, via COW. > 6. Option, by policy, to immediately start replicating degraded data > chunks - either with existing storage or hot spare, which is also a > policy choice. > > In particular, I'd like to see the single stripe metadata chunks > replicated soon so in case there's another device failure the entire > volume doesn't implode. Yes there's some data loss, still better than > 100% data loss. I've actually considered multiple times writing a daemon in Python to do this. In general, I agree that it's mostly policy, and thus should be in userspace. At the very least though, we really should have something in the kernel that we can watch from userspace (be it with select, epoll, inotify, fanotify, or something else) to tell us when a state change happens or the filesystem, as right now the only way I can see to do this is to poll the mount options. > >> I could possibly understand doing this for something that needs to be >> guaranteed to come on line when powered on, but **only** if it notifies >> responsible parties that there was a problem **and** it is explicitly >> documented, and even then I'd be wary of doing this unless there was >> something in place to handle the possibility of false positives (yes, they >> do happen), and to make certain that the failed hardware got replaced as >> soon as possible. > > Exactly. And I think it's safer to be more aggressive with (fairly) > immediate metadata replication to remaining devices, than it is with > data. > > I'm considering this behavior for both single volume setups, as well > as multiple bricks in a cluster. And admittedly it's probably > cheaper/easier to just get n-way copies of metadata than the above > scheme I've written. > And even then, you would still have people with big arrays who would want the metadata re-striped immediately on a device failure. I will, however, be extremely happy when n-way replication hits, as I then will not need to stack BTRFS raid1 on top of LVM RAID1 to get higher order replication levels. ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: Global hotspare functionality 2016-03-29 14:41 ` Anand Jain 2016-03-29 19:24 ` Yauhen Kharuzhy @ 2016-03-29 19:40 ` Yauhen Kharuzhy 2016-03-30 22:17 ` Yauhen Kharuzhy 2016-03-29 19:47 ` Yauhen Kharuzhy 2 siblings, 1 reply; 18+ messages in thread From: Yauhen Kharuzhy @ 2016-03-29 19:40 UTC (permalink / raw) To: Anand Jain; +Cc: linux-btrfs Hi. I am testing hotspare v2 on kernel v4.4.5 (I will try latest Chris' tree later) now with lockdep debugging enabled. At starting of replacement, lockdep warning is displayed, because kstrdup() is called with GFP_NOFS inside of rcu_read_lock/unlock() block (GFP_NOFS can sleep). [ 1463.470875] BTRFS critical (device sdc): Fatal error on device /dev/sdf [ 1463.499091] BTRFS warning (device sdc): device /dev/sdf failed, chunks degraded [ 1463.501394] BTRFS info (device sdc): num_devices 4 rw_devices 3 degraded-option: unset [ 1463.506221] [ 1463.508572] =============================== [ 1463.512989] [ INFO: suspicious RCU usage. ] [ 1463.514194] 4.4.5-scst31x+ #20 Not tainted [ 1463.515191] ------------------------------- [ 1463.516162] include/linux/rcupdate.h:560 Illegal context switch in RCU read-side critical section! [ 1463.519943] [ 1463.519943] other info that might help us debug this: [ 1463.519943] [ 1463.531515] [ 1463.531515] rcu_scheduler_active = 1, debug_locks = 0 [ 1463.541056] 3 locks held by btrfs-casualty/4702: [ 1463.542151] #0: (&fs_info->casualty_mutex){+.+...}, at: [<ffffffffa018325a>] casualty_kthread+0x6a/0x340 [btrfs] [ 1463.563978] #1: (uuid_mutex){+.+.+.}, at: [<ffffffffa01bfc65>] btrfs_get_spare_device+0x25/0x1c0 [btrfs] [ 1463.587380] #2: (rcu_read_lock){......}, at: [<ffffffffa01bfcac>] btrfs_get_spare_device+0x6c/0x1c0 [btrfs] [ 1463.606124] [ 1463.606124] stack backtrace: [ 1463.612596] CPU: 0 PID: 4702 Comm: btrfs-casualty Not tainted 4.4.5-scst31x+ #20 [ 1463.630796] Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006 [ 1463.649423] 0000000000000000 ffff88005e65fc60 ffffffff813529e3 ffff88005e658580 [ 1463.685195] 0000000000000001 ffff88005e65fc90 ffffffff810d6407 0000000000000000 [ 1463.719570] ffffffff81a270d7 0000000000000b2d 0000000002400040 ffff88005e65fcb8 [ 1463.735510] Call Trace: [ 1463.741172] [<ffffffff813529e3>] dump_stack+0x85/0xc2 [ 1463.751346] [<ffffffff810d6407>] lockdep_rcu_suspicious+0xd7/0x110 [ 1463.776423] [<ffffffff810add27>] ___might_sleep+0xa7/0x230 [ 1463.792409] [<ffffffff810adef9>] __might_sleep+0x49/0x80 [ 1463.798818] [<ffffffffa01bfd0a>] ? btrfs_get_spare_device+0xca/0x1c0 [btrfs] [ 1463.814369] [<ffffffff81200ec7>] __kmalloc_track_caller+0x227/0x5e0 [ 1463.828477] [<ffffffff811c0f81>] kstrdup+0x31/0x60 [ 1463.836761] [<ffffffffa01bfd0a>] btrfs_get_spare_device+0xca/0x1c0 [btrfs] [ 1463.851841] [<ffffffffa01bfcac>] ? btrfs_get_spare_device+0x6c/0x1c0 [btrfs] [ 1463.867022] [<ffffffffa02038a3>] btrfs_auto_replace_start+0x23/0xd0 [btrfs] [ 1463.874201] [<ffffffffa01834ad>] casualty_kthread+0x2bd/0x340 [btrfs] [ 1463.891812] [<ffffffffa01833d1>] ? casualty_kthread+0x1e1/0x340 [btrfs] [ 1463.908948] [<ffffffffa01831f0>] ? btrfs_check_devices+0x1f0/0x1f0 [btrfs] [ 1463.920370] [<ffffffff810a70df>] kthread+0xef/0x110 [ 1463.936421] [<ffffffff810dc081>] ? __raw_callee_save___pv_queued_spin_unlock+0x11/0x20 [ 1463.949868] [<ffffffff810a6ff0>] ? kthread_create_on_node+0x200/0x200 [ 1463.963096] [<ffffffff81637c2f>] ret_from_fork+0x3f/0x70 [ 1463.973053] [<ffffffff810a6ff0>] ? kthread_create_on_node+0x200/0x200 [ 1463.976256] BUG: sleeping function called from invalid context at mm/slab.c:2861 [ 1463.979936] in_atomic(): 1, irqs_disabled(): 0, pid: 4702, name: btrfs-casualty [ 1463.981868] INFO: lockdep is turned off. [ 1463.987513] CPU: 0 PID: 4702 Comm: btrfs-casualty Not tainted 4.4.5-scst31x+ #20 [ 1463.999678] Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006 [ 1464.002555] 0000000000000000 ffff88005e65fc90 ffffffff813529e3 ffff88005e658580 [ 1464.022926] ffffffff81a270d7 ffff88005e65fcb8 ffffffff810addf9 ffffffff81a270d7 [ 1464.055231] 0000000000000b2d 0000000000000000 ffff88005e65fce0 ffffffff810adef9 [ 1464.066771] Call Trace: [ 1464.069301] [<ffffffff813529e3>] dump_stack+0x85/0xc2 [ 1464.074030] [<ffffffff810addf9>] ___might_sleep+0x179/0x230 [ 1464.076525] [<ffffffff810adef9>] __might_sleep+0x49/0x80 [ 1464.078372] [<ffffffffa01bfd0a>] ? btrfs_get_spare_device+0xca/0x1c0 [btrfs] [ 1464.080315] [<ffffffff81200ec7>] __kmalloc_track_caller+0x227/0x5e0 [ 1464.083712] [<ffffffff811c0f81>] kstrdup+0x31/0x60 [ 1464.099224] [<ffffffffa01bfd0a>] btrfs_get_spare_device+0xca/0x1c0 [btrfs] [ 1464.102570] [<ffffffffa01bfcac>] ? btrfs_get_spare_device+0x6c/0x1c0 [btrfs] [ 1464.106557] [<ffffffffa02038a3>] btrfs_auto_replace_start+0x23/0xd0 [btrfs] [ 1464.113009] [<ffffffffa01834ad>] casualty_kthread+0x2bd/0x340 [btrfs] [ 1464.115574] [<ffffffffa01833d1>] ? casualty_kthread+0x1e1/0x340 [btrfs] [ 1464.131337] [<ffffffffa01831f0>] ? btrfs_check_devices+0x1f0/0x1f0 [btrfs] [ 1464.139141] [<ffffffff810a70df>] kthread+0xef/0x110 [ 1464.151499] [<ffffffff810dc081>] ? __raw_callee_save___pv_queued_spin_unlock+0x11/0x20 [ 1464.175156] [<ffffffff810a6ff0>] ? kthread_create_on_node+0x200/0x200 [ 1464.194352] [<ffffffff81637c2f>] ret_from_fork+0x3f/0x70 [ 1464.207239] [<ffffffff810a6ff0>] ? kthread_create_on_node+0x200/0x200 [ 1464.232552] BTRFS info (device sdc): dev_replace from <missing disk> (devid 4) to /dev/sdg started -- Yauhen Kharuzhy ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: Global hotspare functionality 2016-03-29 19:40 ` Yauhen Kharuzhy @ 2016-03-30 22:17 ` Yauhen Kharuzhy 2016-04-02 1:17 ` Anand Jain 0 siblings, 1 reply; 18+ messages in thread From: Yauhen Kharuzhy @ 2016-03-30 22:17 UTC (permalink / raw) To: Anand Jain; +Cc: linux-btrfs On Tue, Mar 29, 2016 at 10:40:40PM +0300, Yauhen Kharuzhy wrote: > Hi. > > I am testing hotspare v2 on kernel v4.4.5 (I will try latest Chris' tree later) > now with lockdep debugging enabled. At starting of replacement, lockdep warning is displayed, > because kstrdup() is called with GFP_NOFS inside of rcu_read_lock/unlock() > block (GFP_NOFS can sleep). Similar thing in the btrfs_auto_replace_start(): rcu_str_deref() without rcu_read_lock(): int btrfs_auto_replace_start(struct btrfs_root *root, struct btrfs_device *src_device) { int ret; char *tgt_path; if (btrfs_get_spare_device(&tgt_path)) { btrfs_err(root->fs_info, "No spare device found/configured in the kernel"); return -EINVAL; } ret = btrfs_dev_replace_start(root, tgt_path, src_device->devid, rcu_str_deref(src_device->name), BTRFS_IOCTL_DEV_REPLACE_CONT_READING_FROM_SRCDEV_MODE_AVOID); if (ret) btrfs_put_spare_device(tgt_path); kfree(tgt_path); return 0; } [ 156.168133] =============================== [ 156.168963] [ INFO: suspicious RCU usage. ] [ 156.169822] 4.4.5-scst31x+ #20 Not tainted [ 156.170656] ------------------------------- [ 156.171488] fs/btrfs/dev-replace.c:990 suspicious rcu_dereference_check() usage! [ 156.172920] [ 156.172920] other info that might help us debug this: [ 156.172920] [ 156.174825] [ 156.174825] rcu_scheduler_active = 1, debug_locks = 0 [ 156.176152] 1 lock held by btrfs-casualty/4807: [ 156.181917] #0: (&fs_info->casualty_mutex){+.+...}, at: [<ffffffffa0165364>] casualty_kthread+0x64/0x390 [btrfs] [ 156.193511] [ 156.193511] stack backtrace: [ 156.194680] CPU: 0 PID: 4807 Comm: btrfs-casualty Not tainted 4.4.5-scst31x+ #20 [ 156.201650] Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006 [ 156.219100] 0000000000000000 ffff88005d79fda0 ffffffff813529e3 ffff88005e19c600 [ 156.221216] 0000000000000001 ffff88005d79fdd0 ffffffff810d6407 0000000000000000 [ 156.224287] 0000000000000000 ffff88005f4a0c00 ffff88005da36000 ffff88005d79fe08 [ 156.226375] Call Trace: [ 156.227078] [<ffffffff813529e3>] dump_stack+0x85/0xc2 [ 156.228152] [<ffffffff810d6407>] lockdep_rcu_suspicious+0xd7/0x110 [ 156.229418] [<ffffffffa01e6236>] btrfs_auto_replace_start+0xa6/0xd0 [btrfs] [ 156.230714] [<ffffffffa01655c4>] casualty_kthread+0x2c4/0x390 [btrfs] [ 156.231915] [<ffffffffa016549c>] ? casualty_kthread+0x19c/0x390 [btrfs] [ 156.233105] [<ffffffffa0165300>] ? btrfs_check_devices+0x200/0x200 [btrfs] [ 156.234339] [<ffffffff810a70df>] kthread+0xef/0x110 [ 156.235309] [<ffffffff810dc081>] ? __raw_callee_save___pv_queued_spin_unlock+0x11/0x20 [ 156.236940] [<ffffffff810a6ff0>] ? kthread_create_on_node+0x200/0x200 [ 156.239489] [<ffffffff81637c2f>] ret_from_fork+0x3f/0x70 [ 156.240533] [<ffffffff810a6ff0>] ? kthread_create_on_node+0x200/0x200 -- Yauhen Kharuzhy ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: Global hotspare functionality 2016-03-30 22:17 ` Yauhen Kharuzhy @ 2016-04-02 1:17 ` Anand Jain 0 siblings, 0 replies; 18+ messages in thread From: Anand Jain @ 2016-04-02 1:17 UTC (permalink / raw) To: Yauhen Kharuzhy; +Cc: linux-btrfs On 03/31/2016 06:17 AM, Yauhen Kharuzhy wrote: > On Tue, Mar 29, 2016 at 10:40:40PM +0300, Yauhen Kharuzhy wrote: >> Hi. >> >> I am testing hotspare v2 on kernel v4.4.5 (I will try latest Chris' tree later) >> now with lockdep debugging enabled. At starting of replacement, lockdep warning is displayed, >> because kstrdup() is called with GFP_NOFS inside of rcu_read_lock/unlock() >> block (GFP_NOFS can sleep). > > Similar thing in the btrfs_auto_replace_start(): rcu_str_deref() without > rcu_read_lock(): > > int btrfs_auto_replace_start(struct btrfs_root *root, > struct btrfs_device *src_device) > { > int ret; > char *tgt_path; > > if (btrfs_get_spare_device(&tgt_path)) { > btrfs_err(root->fs_info, > "No spare device found/configured in the kernel"); > return -EINVAL; > } > > ret = btrfs_dev_replace_start(root, tgt_path, > src_device->devid, > rcu_str_deref(src_device->name), This is fixed in V3. Thanks, Anand > BTRFS_IOCTL_DEV_REPLACE_CONT_READING_FROM_SRCDEV_MODE_AVOID); > if (ret) > btrfs_put_spare_device(tgt_path); > > kfree(tgt_path); > > return 0; > } > > [ 156.168133] =============================== > [ 156.168963] [ INFO: suspicious RCU usage. ] > [ 156.169822] 4.4.5-scst31x+ #20 Not tainted > [ 156.170656] ------------------------------- > [ 156.171488] fs/btrfs/dev-replace.c:990 suspicious rcu_dereference_check() usage! > [ 156.172920] > [ 156.172920] other info that might help us debug this: > [ 156.172920] > [ 156.174825] > [ 156.174825] rcu_scheduler_active = 1, debug_locks = 0 > [ 156.176152] 1 lock held by btrfs-casualty/4807: > [ 156.181917] #0: (&fs_info->casualty_mutex){+.+...}, at: [<ffffffffa0165364>] casualty_kthread+0x64/0x390 [btrfs] > [ 156.193511] > [ 156.193511] stack backtrace: > [ 156.194680] CPU: 0 PID: 4807 Comm: btrfs-casualty Not tainted 4.4.5-scst31x+ #20 > [ 156.201650] Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006 > [ 156.219100] 0000000000000000 ffff88005d79fda0 ffffffff813529e3 ffff88005e19c600 > [ 156.221216] 0000000000000001 ffff88005d79fdd0 ffffffff810d6407 0000000000000000 > [ 156.224287] 0000000000000000 ffff88005f4a0c00 ffff88005da36000 ffff88005d79fe08 > [ 156.226375] Call Trace: > [ 156.227078] [<ffffffff813529e3>] dump_stack+0x85/0xc2 > [ 156.228152] [<ffffffff810d6407>] lockdep_rcu_suspicious+0xd7/0x110 > [ 156.229418] [<ffffffffa01e6236>] btrfs_auto_replace_start+0xa6/0xd0 [btrfs] > [ 156.230714] [<ffffffffa01655c4>] casualty_kthread+0x2c4/0x390 [btrfs] > [ 156.231915] [<ffffffffa016549c>] ? casualty_kthread+0x19c/0x390 [btrfs] > [ 156.233105] [<ffffffffa0165300>] ? btrfs_check_devices+0x200/0x200 [btrfs] > [ 156.234339] [<ffffffff810a70df>] kthread+0xef/0x110 > [ 156.235309] [<ffffffff810dc081>] ? __raw_callee_save___pv_queued_spin_unlock+0x11/0x20 > [ 156.236940] [<ffffffff810a6ff0>] ? kthread_create_on_node+0x200/0x200 > [ 156.239489] [<ffffffff81637c2f>] ret_from_fork+0x3f/0x70 > [ 156.240533] [<ffffffff810a6ff0>] ? kthread_create_on_node+0x200/0x200 > > ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: Global hotspare functionality 2016-03-29 14:41 ` Anand Jain 2016-03-29 19:24 ` Yauhen Kharuzhy 2016-03-29 19:40 ` Yauhen Kharuzhy @ 2016-03-29 19:47 ` Yauhen Kharuzhy 2016-03-29 23:18 ` Yauhen Kharuzhy 2016-04-02 1:15 ` Anand Jain 2 siblings, 2 replies; 18+ messages in thread From: Yauhen Kharuzhy @ 2016-03-29 19:47 UTC (permalink / raw) To: Anand Jain; +Cc: linux-btrfs On Tue, Mar 29, 2016 at 10:41:36PM +0800, Anand Jain wrote: > > Hi Yauhen, > > > > >Issue 2. > >At start of autoreplacig drive by hotspare, kernel craches in transaction > >handling code (inside of btrfs_commit_transaction() called by autoreplace initiating > >routines). I 'fixed' this by removing of closing of bdev in btrfs_close_one_device_dont_free(), see > >https://bitbucket.org/jekhor/linux-btrfs/commits/dfa441c9ec7b3833f6a5e4d0b6f8c678faea29bb?at=master > >(oops text is attached also). Bdev is closed after replacing by > >btrfs_dev_replace_finishing(), so this is safe but doesn't seem > >to be right way. > > I have sent out V2. I don't see that issue with this, > could you pls try ? Yes, it reproduced on v4.4.5 kernel. I will try with current 'for-linus-4.6' Chris' tree soon. To emulate a drive failure, I disconnect the drive in VirtualBox, so bdev can be freed by kernel after releasing of all references to it. [ 1464.232552] BTRFS info (device sdc): dev_replace from <missing disk> (devid 4) to /dev/sdg started [ 1464.255824] BUG: unable to handle kernel NULL pointer dereference at 0000000000000548 [ 1464.291760] IP: [<ffffffff8131d58d>] generic_make_request_checks+0x4d/0x910 [ 1464.309746] PGD 5c668067 PUD 5b841067 PMD 0 [ 1464.326143] Oops: 0000 [#1] SMP DEBUG_PAGEALLOC [ 1464.340474] Modules linked in: cpufreq_powersave cpufreq_stats cpufreq_userspace cpufreq_conservative softdog nfsd auth_rpcgss oid_registry nfs_acl nfs lockd grace fscache sunrpc ipmi_devintf ipmi_msghandler iosf_mbi crct10dif_pclmul crc32_pclmul sha256_ssse3 sha256_generic hmac drbg iTCO_wdt ansi_cprng iTCO_vendor_support snd_pcm snd_timer aesni_intel snd soundcore psmouse aes_x86_64 lrw gf128mul glue_helper ablk_helper cryptd evdev serio_raw pcspkr battery acpi_cpufreq 8250_fintek parport_pc video lpc_ich parport mfd_core tpm_tis tpm ac rng_core processor button i2c_piix4 btrfs xor raid6_pq dm_mod raid1 md_mod sg sd_mod ahci libahci libata pcnet32 crc32c_intel scsi_mod mii [ 1464.483244] CPU: 0 PID: 4702 Comm: btrfs-casualty Not tainted 4.4.5-scst31x+ #20 [ 1464.511300] Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006 [ 1464.518035] task: ffff88005e658580 ti: ffff88005e65c000 task.ti: ffff88005e65c000 [ 1464.543072] RIP: 0010:[<ffffffff8131d58d>] [<ffffffff8131d58d>] generic_make_request_checks+0x4d/0x910 [ 1464.579027] RSP: 0018:ffff88005e65f498 EFLAGS: 00010283 [ 1464.604774] RAX: 0000000000000000 RBX: ffff88005b919f28 RCX: 0000000000030b00 [ 1464.629544] RDX: 0000000000000080 RSI: 0000000000000781 RDI: ffff88004ecd5ac0 [ 1464.652763] RBP: ffff88005e65f500 R08: ffff88005b130ff0 R09: 0000000000010000 [ 1464.674939] R10: ffff88005e674f28 R11: 0000000000000000 R12: 0000000000000080 [ 1464.691478] R13: 0000000000000004 R14: ffff88004e48de00 R15: 0000000000000010 [ 1464.714115] FS: 0000000000000000(0000) GS:ffff880066600000(0000) knlGS:0000000000000000 [ 1464.737302] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b [ 1464.766380] CR2: 0000000000000548 CR3: 000000005723f000 CR4: 00000000000406f0 [ 1464.804808] Stack: [ 1464.814950] ffffffff813184ae 0000000000000246 0000000000000082 0000000000000000 [ 1464.847217] 0000000000000000 0000000000000092 0000000000000000 ffff88005e65f540 [ 1464.879147] ffff88005b919f28 00000000ffffffff 0000000000000004 ffff88004e48de00 [ 1464.907440] Call Trace: [ 1464.919293] [<ffffffff813184ae>] ? bvec_alloc+0x5e/0x100 [ 1464.939019] [<ffffffff813213a4>] generic_make_request+0x24/0x290 [ 1464.961775] [<ffffffff81321677>] submit_bio+0x67/0x140 [ 1464.971842] [<ffffffffa02051b9>] finish_rmw+0x409/0x570 [btrfs] [ 1464.983700] [<ffffffffa02053c5>] full_stripe_write+0xa5/0xb0 [btrfs] [ 1464.996554] [<ffffffffa0206d05>] raid56_parity_write+0xf5/0x180 [btrfs] [ 1465.012560] [<ffffffffa01bab95>] btrfs_map_bio+0x105/0x300 [btrfs] [ 1465.046907] [<ffffffffa018f8b3>] ? btrfs_get_extent+0x83/0xb20 [btrfs] [ 1465.052462] [<ffffffffa018d175>] btrfs_submit_bio_hook+0xe5/0x1b0 [btrfs] [ 1465.069342] [<ffffffff810dc081>] ? __raw_callee_save___pv_queued_spin_unlock+0x11/0x20 [ 1465.091031] [<ffffffffa01a917d>] submit_one_bio+0x6d/0xa0 [btrfs] [ 1465.111233] [<ffffffffa01ae06e>] submit_extent_page+0xee/0x230 [btrfs] [ 1465.126076] [<ffffffffa01ae7f4>] __extent_writepage_io+0x444/0x490 [btrfs] [ 1465.132550] [<ffffffffa01ade10>] ? end_extent_writepage+0x80/0x80 [btrfs] [ 1465.145490] [<ffffffffa01aeaa5>] __extent_writepage+0x265/0x3e0 [btrfs] [ 1465.168445] [<ffffffffa01aef1b>] extent_write_cache_pages.isra.32.constprop.49+0x2fb/0x3d0 [btrfs] [ 1465.204094] [<ffffffffa01b009d>] extent_writepages+0x4d/0x70 [btrfs] [ 1465.229627] [<ffffffffa018f830>] ? btrfs_real_readdir+0x5c0/0x5c0 [btrfs] [ 1465.250927] [<ffffffffa018d5e8>] btrfs_writepages+0x28/0x30 [btrfs] [ 1465.274099] [<ffffffff811afcb1>] do_writepages+0x21/0x30 [ 1465.298275] [<ffffffff811a10ea>] __filemap_fdatawrite_range+0xaa/0xf0 [ 1465.324278] [<ffffffff811a1203>] filemap_fdatawrite_range+0x13/0x20 [ 1465.341055] [<ffffffffa01a1490>] btrfs_fdatawrite_range+0x20/0x50 [btrfs] [ 1465.378952] [<ffffffffa01d377a>] __btrfs_write_out_cache.isra.27+0x3ea/0x430 [btrfs] [ 1465.405760] [<ffffffffa01d4a7f>] btrfs_write_out_cache+0x8f/0x110 [btrfs] [ 1465.428091] [<ffffffffa0176128>] btrfs_write_dirty_block_groups+0x228/0x290 [btrfs] [ 1465.458865] [<ffffffffa0208d6a>] commit_cowonly_roots+0x1f8/0x283 [btrfs] [ 1465.480450] [<ffffffffa018b0d7>] btrfs_commit_transaction+0x577/0xb60 [btrfs] [ 1465.512410] [<ffffffffa0202cf3>] btrfs_dev_replace_start+0x2e3/0x520 [btrfs] [ 1465.535358] [<ffffffffa0202b6e>] ? btrfs_dev_replace_start+0x15e/0x520 [btrfs] [ 1465.548049] [<ffffffffa02038d8>] btrfs_auto_replace_start+0x58/0xd0 [btrfs] [ 1465.551787] [<ffffffffa01834ad>] casualty_kthread+0x2bd/0x340 [btrfs] [ 1465.561195] [<ffffffffa01833d1>] ? casualty_kthread+0x1e1/0x340 [btrfs] [ 1465.573308] [<ffffffffa01831f0>] ? btrfs_check_devices+0x1f0/0x1f0 [btrfs] [ 1465.610840] [<ffffffff810a70df>] kthread+0xef/0x110 [ 1465.629472] [<ffffffff810dc081>] ? __raw_callee_save___pv_queued_spin_unlock+0x11/0x20 [ 1465.648678] [<ffffffff810a6ff0>] ? kthread_create_on_node+0x200/0x200 [ 1465.660686] [<ffffffff81637c2f>] ret_from_fork+0x3f/0x70 [ 1465.667065] [<ffffffff810a6ff0>] ? kthread_create_on_node+0x200/0x200 [ 1465.676861] Code: 67 28 48 c7 c7 6b f8 a3 81 e8 40 09 d9 ff e8 3b 43 31 00 41 c1 ec 09 48 8b 7b 08 45 85 e4 0f 85 13 01 00 00 48 8b 87 f0 00 00 00 <4c> 8b b8 48 05 00 00 4d 85 ff 0f 84 d5 01 00 00 4c 8b af e0 00 [ 1465.750135] RIP [<ffffffff8131d58d>] generic_make_request_checks+0x4d/0x910 [ 1465.776005] RSP <ffff88005e65f498> [ 1465.790848] CR2: 0000000000000548 [ 1465.797370] ---[ end trace 45545495cd54e799 ]--- -- Yauhen Kharuzhy ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: Global hotspare functionality 2016-03-29 19:47 ` Yauhen Kharuzhy @ 2016-03-29 23:18 ` Yauhen Kharuzhy 2016-04-02 1:15 ` Anand Jain 1 sibling, 0 replies; 18+ messages in thread From: Yauhen Kharuzhy @ 2016-03-29 23:18 UTC (permalink / raw) To: Anand Jain; +Cc: linux-btrfs Reproduced with mason's for-linus-4.6 branch also. 2016-03-29 12:47 GMT-07:00 Yauhen Kharuzhy <yauhen.kharuzhy@zavadatar.com>: > On Tue, Mar 29, 2016 at 10:41:36PM +0800, Anand Jain wrote: >> >> Hi Yauhen, >> > >> > >> >Issue 2. >> >At start of autoreplacig drive by hotspare, kernel craches in transaction >> >handling code (inside of btrfs_commit_transaction() called by autoreplace initiating >> >routines). I 'fixed' this by removing of closing of bdev in btrfs_close_one_device_dont_free(), see >> >https://bitbucket.org/jekhor/linux-btrfs/commits/dfa441c9ec7b3833f6a5e4d0b6f8c678faea29bb?at=master >> >(oops text is attached also). Bdev is closed after replacing by >> >btrfs_dev_replace_finishing(), so this is safe but doesn't seem >> >to be right way. >> >> I have sent out V2. I don't see that issue with this, >> could you pls try ? > > Yes, it reproduced on v4.4.5 kernel. I will try with current > 'for-linus-4.6' Chris' tree soon. > > To emulate a drive failure, I disconnect the drive in VirtualBox, so bdev > can be freed by kernel after releasing of all references to it. ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: Global hotspare functionality 2016-03-29 19:47 ` Yauhen Kharuzhy 2016-03-29 23:18 ` Yauhen Kharuzhy @ 2016-04-02 1:15 ` Anand Jain 2016-04-02 1:33 ` Yauhen Kharuzhy 2016-04-04 19:32 ` Yauhen Kharuzhy 1 sibling, 2 replies; 18+ messages in thread From: Anand Jain @ 2016-04-02 1:15 UTC (permalink / raw) To: Yauhen Kharuzhy; +Cc: linux-btrfs On 03/30/2016 03:47 AM, Yauhen Kharuzhy wrote: > On Tue, Mar 29, 2016 at 10:41:36PM +0800, Anand Jain wrote: >> >> Hi Yauhen, >> > >>> >>> Issue 2. >>> At start of autoreplacig drive by hotspare, kernel craches in transaction >>> handling code (inside of btrfs_commit_transaction() called by autoreplace initiating >>> routines). I 'fixed' this by removing of closing of bdev in btrfs_close_one_device_dont_free(), see >>> https://bitbucket.org/jekhor/linux-btrfs/commits/dfa441c9ec7b3833f6a5e4d0b6f8c678faea29bb?at=master >>> (oops text is attached also). Bdev is closed after replacing by >>> btrfs_dev_replace_finishing(), so this is safe but doesn't seem >>> to be right way. >> >> I have sent out V2. I don't see that issue with this, >> could you pls try ? > > Yes, it reproduced on v4.4.5 kernel. I will try with current > 'for-linus-4.6' Chris' tree soon. > > To emulate a drive failure, I disconnect the drive in VirtualBox, so bdev > can be freed by kernel after releasing of all references to it. So far the raid group profile would adapt to lower suitable group profile when device is missing/failed. This appears to be not happening with RAID56 OR there are stale IO which wasn't flushed out. Anyway to have this fixed I am moving the patch btrfs: introduce device dynamic state transition to offline or failed to the top in v3 for any potential changes. But firstly we need a reliable test case, or a very carefully crafted test case which can create this situation Here below is the dm-error that I am using for testing, which apparently doesn't report this issue. Could you please try on V3. ? (pls note the device names are hard coded in the test script sorry about that) This would eventually be fstests script. ---- # cat util run() { local ret echo -- ${*} -- echo ${*} | bash ret=$? if [ $ret -ne 0 ]; then echo echo "###### FAILED: RET $ret #####" echo exit fi echo #echo "OK?"; read } runnt() { local ret echo -- ${*} -- echo ${*} | bash ret=$? echo #echo "OK?"; read } wipeall() { runnt "wipefs -a /dev/sd[c-h] > /dev/null" } create_err_dev_raid1() { dm_backing_dev="/dev/sdd" blk_dev_size=`blockdev --getsz $dm_backing_dev` dmerror_dev="/dev/mapper/dm-sdd" dmlinear_table="0 $blk_dev_size linear $dm_backing_dev 0" dmerror_table="0 $blk_dev_size error $dm_backing_dev 0" echo -e dm_backing_dev'\t'= $dm_backing_dev echo -e blk_dev_size'\t'= $blk_dev_size echo -e dmerror_dev'\t'= $dmerror_dev echo -e dmlinear_table'\t'= $dmlinear_table echo -e dmerror_table'\t'= $dmerror_table echo runnt "dmsetup remove dm-sdd > /dev/null 2>&1" run "dmsetup create dm-sdd --table '${dmlinear_table}'" run "mkfs.btrfs -f -draid1 -mraid1 /dev/sdc $dmerror_dev > /dev/null 2>&1" run mount /dev/sdc /btrfs run "fillfs /btrfs 1000 > /dev/null 2>&1" run "dd if=/dev/zero of=/btrfs/tf1 bs=4096 count=100 > /dev/null 2>&1" run btrfs fi show # run sleep 32 run dmsetup suspend dm-sdd run "dmsetup load dm-sdd --table '$dmerror_table'" run dmsetup resume dm-sdd run "dd if=/dev/zero of=/btrfs/tf1 bs=4096 count=100 > /dev/null 2>&1" run btrfs fi show } create_err_dev_raid56() { dm_backing_dev="/dev/sdd" blk_dev_size=`blockdev --getsz $dm_backing_dev` dmerror_dev="/dev/mapper/dm-sdd" dmlinear_table="0 $blk_dev_size linear $dm_backing_dev 0" dmerror_table="0 $blk_dev_size error $dm_backing_dev 0" echo -e dm_backing_dev'\t'= $dm_backing_dev echo -e blk_dev_size'\t'= $blk_dev_size echo -e dmerror_dev'\t'= $dmerror_dev echo -e dmlinear_table'\t'= $dmlinear_table echo -e dmerror_table'\t'= $dmerror_table echo runnt "dmsetup remove dm-sdd > /dev/null 2>&1" run "dmsetup create dm-sdd --table '${dmlinear_table}'" run "mkfs.btrfs -f -draid5 -mraid5 /dev/sdc /dev/sdf $dmerror_dev > /dev/null 2>&1" run mount /dev/sdc /btrfs run "fillfs /btrfs 1000 > /dev/null 2>&1" run "dd if=/dev/zero of=/btrfs/tf1 bs=4096 count=100 > /dev/null 2>&1" run btrfs fi show # run sleep 32 run dmsetup suspend dm-sdd run "dmsetup load dm-sdd --table '$dmerror_table'" run dmsetup resume dm-sdd run "dd if=/dev/zero of=/btrfs/tf1 bs=4096 count=100 > /dev/null 2>&1" run btrfs fi show } # cat auto-replace-test56 source $(dirname $0)/util wipeall run btrfs spare add /dev/sde #run cat /proc/fs/btrfs/devlist create_err_dev_raid56 ------ Thanks, Anand > [ 1464.232552] BTRFS info (device sdc): dev_replace from <missing disk> (devid 4) to /dev/sdg started > [ 1464.255824] BUG: unable to handle kernel NULL pointer dereference at 0000000000000548 > [ 1464.291760] IP: [<ffffffff8131d58d>] generic_make_request_checks+0x4d/0x910 > [ 1464.309746] PGD 5c668067 PUD 5b841067 PMD 0 > [ 1464.326143] Oops: 0000 [#1] SMP DEBUG_PAGEALLOC > [ 1464.340474] Modules linked in: cpufreq_powersave cpufreq_stats cpufreq_userspace cpufreq_conservative softdog nfsd auth_rpcgss oid_registry nfs_acl nfs lockd grace fscache sunrpc ipmi_devintf ipmi_msghandler iosf_mbi crct10dif_pclmul crc32_pclmul sha256_ssse3 sha256_generic hmac drbg iTCO_wdt ansi_cprng iTCO_vendor_support snd_pcm snd_timer aesni_intel snd soundcore psmouse aes_x86_64 lrw gf128mul glue_helper ablk_helper cryptd evdev serio_raw pcspkr battery acpi_cpufreq 8250_fintek parport_pc video lpc_ich parport mfd_core tpm_tis tpm ac rng_core processor button i2c_piix4 btrfs xor raid6_pq dm_mod raid1 md_mod sg sd_mod ahci libahci libata pcnet32 crc32c_intel scsi_mod mii > [ 1464.483244] CPU: 0 PID: 4702 Comm: btrfs-casualty Not tainted 4.4.5-scst31x+ #20 > [ 1464.511300] Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006 > [ 1464.518035] task: ffff88005e658580 ti: ffff88005e65c000 task.ti: ffff88005e65c000 > [ 1464.543072] RIP: 0010:[<ffffffff8131d58d>] [<ffffffff8131d58d>] generic_make_request_checks+0x4d/0x910 > [ 1464.579027] RSP: 0018:ffff88005e65f498 EFLAGS: 00010283 > [ 1464.604774] RAX: 0000000000000000 RBX: ffff88005b919f28 RCX: 0000000000030b00 > [ 1464.629544] RDX: 0000000000000080 RSI: 0000000000000781 RDI: ffff88004ecd5ac0 > [ 1464.652763] RBP: ffff88005e65f500 R08: ffff88005b130ff0 R09: 0000000000010000 > [ 1464.674939] R10: ffff88005e674f28 R11: 0000000000000000 R12: 0000000000000080 > [ 1464.691478] R13: 0000000000000004 R14: ffff88004e48de00 R15: 0000000000000010 > [ 1464.714115] FS: 0000000000000000(0000) GS:ffff880066600000(0000) knlGS:0000000000000000 > [ 1464.737302] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b > [ 1464.766380] CR2: 0000000000000548 CR3: 000000005723f000 CR4: 00000000000406f0 > [ 1464.804808] Stack: > [ 1464.814950] ffffffff813184ae 0000000000000246 0000000000000082 0000000000000000 > [ 1464.847217] 0000000000000000 0000000000000092 0000000000000000 ffff88005e65f540 > [ 1464.879147] ffff88005b919f28 00000000ffffffff 0000000000000004 ffff88004e48de00 > [ 1464.907440] Call Trace: > [ 1464.919293] [<ffffffff813184ae>] ? bvec_alloc+0x5e/0x100 > [ 1464.939019] [<ffffffff813213a4>] generic_make_request+0x24/0x290 > [ 1464.961775] [<ffffffff81321677>] submit_bio+0x67/0x140 > [ 1464.971842] [<ffffffffa02051b9>] finish_rmw+0x409/0x570 [btrfs] > [ 1464.983700] [<ffffffffa02053c5>] full_stripe_write+0xa5/0xb0 [btrfs] > [ 1464.996554] [<ffffffffa0206d05>] raid56_parity_write+0xf5/0x180 [btrfs] > [ 1465.012560] [<ffffffffa01bab95>] btrfs_map_bio+0x105/0x300 [btrfs] > [ 1465.046907] [<ffffffffa018f8b3>] ? btrfs_get_extent+0x83/0xb20 [btrfs] > [ 1465.052462] [<ffffffffa018d175>] btrfs_submit_bio_hook+0xe5/0x1b0 [btrfs] > [ 1465.069342] [<ffffffff810dc081>] ? __raw_callee_save___pv_queued_spin_unlock+0x11/0x20 > [ 1465.091031] [<ffffffffa01a917d>] submit_one_bio+0x6d/0xa0 [btrfs] > [ 1465.111233] [<ffffffffa01ae06e>] submit_extent_page+0xee/0x230 [btrfs] > [ 1465.126076] [<ffffffffa01ae7f4>] __extent_writepage_io+0x444/0x490 [btrfs] > [ 1465.132550] [<ffffffffa01ade10>] ? end_extent_writepage+0x80/0x80 [btrfs] > [ 1465.145490] [<ffffffffa01aeaa5>] __extent_writepage+0x265/0x3e0 [btrfs] > [ 1465.168445] [<ffffffffa01aef1b>] extent_write_cache_pages.isra.32.constprop.49+0x2fb/0x3d0 [btrfs] > [ 1465.204094] [<ffffffffa01b009d>] extent_writepages+0x4d/0x70 [btrfs] > [ 1465.229627] [<ffffffffa018f830>] ? btrfs_real_readdir+0x5c0/0x5c0 [btrfs] > [ 1465.250927] [<ffffffffa018d5e8>] btrfs_writepages+0x28/0x30 [btrfs] > [ 1465.274099] [<ffffffff811afcb1>] do_writepages+0x21/0x30 > [ 1465.298275] [<ffffffff811a10ea>] __filemap_fdatawrite_range+0xaa/0xf0 > [ 1465.324278] [<ffffffff811a1203>] filemap_fdatawrite_range+0x13/0x20 > [ 1465.341055] [<ffffffffa01a1490>] btrfs_fdatawrite_range+0x20/0x50 [btrfs] > [ 1465.378952] [<ffffffffa01d377a>] __btrfs_write_out_cache.isra.27+0x3ea/0x430 [btrfs] > [ 1465.405760] [<ffffffffa01d4a7f>] btrfs_write_out_cache+0x8f/0x110 [btrfs] > [ 1465.428091] [<ffffffffa0176128>] btrfs_write_dirty_block_groups+0x228/0x290 [btrfs] > [ 1465.458865] [<ffffffffa0208d6a>] commit_cowonly_roots+0x1f8/0x283 [btrfs] > [ 1465.480450] [<ffffffffa018b0d7>] btrfs_commit_transaction+0x577/0xb60 [btrfs] > [ 1465.512410] [<ffffffffa0202cf3>] btrfs_dev_replace_start+0x2e3/0x520 [btrfs] > [ 1465.535358] [<ffffffffa0202b6e>] ? btrfs_dev_replace_start+0x15e/0x520 [btrfs] > [ 1465.548049] [<ffffffffa02038d8>] btrfs_auto_replace_start+0x58/0xd0 [btrfs] > [ 1465.551787] [<ffffffffa01834ad>] casualty_kthread+0x2bd/0x340 [btrfs] > [ 1465.561195] [<ffffffffa01833d1>] ? casualty_kthread+0x1e1/0x340 [btrfs] > [ 1465.573308] [<ffffffffa01831f0>] ? btrfs_check_devices+0x1f0/0x1f0 [btrfs] > [ 1465.610840] [<ffffffff810a70df>] kthread+0xef/0x110 > [ 1465.629472] [<ffffffff810dc081>] ? __raw_callee_save___pv_queued_spin_unlock+0x11/0x20 > [ 1465.648678] [<ffffffff810a6ff0>] ? kthread_create_on_node+0x200/0x200 > [ 1465.660686] [<ffffffff81637c2f>] ret_from_fork+0x3f/0x70 > [ 1465.667065] [<ffffffff810a6ff0>] ? kthread_create_on_node+0x200/0x200 > [ 1465.676861] Code: 67 28 48 c7 c7 6b f8 a3 81 e8 40 09 d9 ff e8 3b 43 31 00 41 c1 ec 09 48 8b 7b 08 45 85 e4 0f 85 13 01 00 00 48 8b 87 f0 00 00 00 <4c> 8b b8 48 05 00 00 4d 85 ff 0f 84 d5 01 00 00 4c 8b af e0 00 > [ 1465.750135] RIP [<ffffffff8131d58d>] generic_make_request_checks+0x4d/0x910 > [ 1465.776005] RSP <ffff88005e65f498> > [ 1465.790848] CR2: 0000000000000548 > [ 1465.797370] ---[ end trace 45545495cd54e799 ]--- > > > ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: Global hotspare functionality 2016-04-02 1:15 ` Anand Jain @ 2016-04-02 1:33 ` Yauhen Kharuzhy 2016-04-02 1:38 ` Anand Jain 2016-04-04 19:32 ` Yauhen Kharuzhy 1 sibling, 1 reply; 18+ messages in thread From: Yauhen Kharuzhy @ 2016-04-02 1:33 UTC (permalink / raw) To: Anand Jain; +Cc: linux-btrfs On Sat, Apr 02, 2016 at 09:15:56AM +0800, Anand Jain wrote: > > > On 03/30/2016 03:47 AM, Yauhen Kharuzhy wrote: > >On Tue, Mar 29, 2016 at 10:41:36PM +0800, Anand Jain wrote: > >> > >>Hi Yauhen, > >> > > > >>> > >>>Issue 2. > >>>At start of autoreplacig drive by hotspare, kernel craches in transaction > >>>handling code (inside of btrfs_commit_transaction() called by autoreplace initiating > >>>routines). I 'fixed' this by removing of closing of bdev in btrfs_close_one_device_dont_free(), see > >>>https://bitbucket.org/jekhor/linux-btrfs/commits/dfa441c9ec7b3833f6a5e4d0b6f8c678faea29bb?at=master > >>>(oops text is attached also). Bdev is closed after replacing by > >>>btrfs_dev_replace_finishing(), so this is safe but doesn't seem > >>>to be right way. > >> > >> I have sent out V2. I don't see that issue with this, > >> could you pls try ? > > > >Yes, it reproduced on v4.4.5 kernel. I will try with current > >'for-linus-4.6' Chris' tree soon. > > > >To emulate a drive failure, I disconnect the drive in VirtualBox, so bdev > >can be freed by kernel after releasing of all references to it. > > So far the raid group profile would adapt to lower suitable > group profile when device is missing/failed. This appears to > be not happening with RAID56 OR there are stale IO which wasn't > flushed out. Anyway to have this fixed I am moving the patch > btrfs: introduce device dynamic state transition to offline or failed > to the top in v3 for any potential changes. > But firstly we need a reliable test case, or a very carefully > crafted test case which can create this situation > > Here below is the dm-error that I am using for testing, which > apparently doesn't report this issue. Could you please try on V3. ? > (pls note the device names are hard coded in the test script > sorry about that) This would eventually be fstests script. Sure. But I don't see any V3 patches in the list. Are you still preparing to send them or I missed something? -- Yauhen Kharuzhy ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: Global hotspare functionality 2016-04-02 1:33 ` Yauhen Kharuzhy @ 2016-04-02 1:38 ` Anand Jain 0 siblings, 0 replies; 18+ messages in thread From: Anand Jain @ 2016-04-02 1:38 UTC (permalink / raw) To: Yauhen Kharuzhy; +Cc: linux-btrfs On 04/02/2016 09:33 AM, Yauhen Kharuzhy wrote: > On Sat, Apr 02, 2016 at 09:15:56AM +0800, Anand Jain wrote: >> >> >> On 03/30/2016 03:47 AM, Yauhen Kharuzhy wrote: >>> On Tue, Mar 29, 2016 at 10:41:36PM +0800, Anand Jain wrote: >>>> >>>> Hi Yauhen, >>>> >>> >>>>> >>>>> Issue 2. >>>>> At start of autoreplacig drive by hotspare, kernel craches in transaction >>>>> handling code (inside of btrfs_commit_transaction() called by autoreplace initiating >>>>> routines). I 'fixed' this by removing of closing of bdev in btrfs_close_one_device_dont_free(), see >>>>> https://bitbucket.org/jekhor/linux-btrfs/commits/dfa441c9ec7b3833f6a5e4d0b6f8c678faea29bb?at=master >>>>> (oops text is attached also). Bdev is closed after replacing by >>>>> btrfs_dev_replace_finishing(), so this is safe but doesn't seem >>>>> to be right way. >>>> >>>> I have sent out V2. I don't see that issue with this, >>>> could you pls try ? >>> >>> Yes, it reproduced on v4.4.5 kernel. I will try with current >>> 'for-linus-4.6' Chris' tree soon. >>> >>> To emulate a drive failure, I disconnect the drive in VirtualBox, so bdev >>> can be freed by kernel after releasing of all references to it. >> >> So far the raid group profile would adapt to lower suitable >> group profile when device is missing/failed. This appears to >> be not happening with RAID56 OR there are stale IO which wasn't >> flushed out. Anyway to have this fixed I am moving the patch >> btrfs: introduce device dynamic state transition to offline or failed >> to the top in v3 for any potential changes. >> But firstly we need a reliable test case, or a very carefully >> crafted test case which can create this situation >> >> Here below is the dm-error that I am using for testing, which >> apparently doesn't report this issue. Could you please try on V3. ? >> (pls note the device names are hard coded in the test script >> sorry about that) This would eventually be fstests script. > > Sure. But I don't see any V3 patches in the list. Are you still > preparing to send them or I missed something? Its out now. There was a little distraction when I was about to send it. ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: Global hotspare functionality 2016-04-02 1:15 ` Anand Jain 2016-04-02 1:33 ` Yauhen Kharuzhy @ 2016-04-04 19:32 ` Yauhen Kharuzhy 2016-04-12 14:16 ` Anand Jain 1 sibling, 1 reply; 18+ messages in thread From: Yauhen Kharuzhy @ 2016-04-04 19:32 UTC (permalink / raw) To: Anand Jain; +Cc: linux-btrfs [-- Attachment #1: Type: text/plain, Size: 1950 bytes --] 2016-04-01 18:15 GMT-07:00 Anand Jain <anand.jain@oracle.com>: >>>> Issue 2. >>>> At start of autoreplacig drive by hotspare, kernel craches in >>>> transaction >>>> handling code (inside of btrfs_commit_transaction() called by >>>> autoreplace initiating >>>> routines). I 'fixed' this by removing of closing of bdev in >>>> btrfs_close_one_device_dont_free(), see >>>> >>>> https://bitbucket.org/jekhor/linux-btrfs/commits/dfa441c9ec7b3833f6a5e4d0b6f8c678faea29bb?at=master >>>> (oops text is attached also). Bdev is closed after replacing by >>>> btrfs_dev_replace_finishing(), so this is safe but doesn't seem >>>> to be right way. >>> >>> >>> I have sent out V2. I don't see that issue with this, >>> could you pls try ? >> >> >> Yes, it reproduced on v4.4.5 kernel. I will try with current >> 'for-linus-4.6' Chris' tree soon. >> >> To emulate a drive failure, I disconnect the drive in VirtualBox, so bdev >> can be freed by kernel after releasing of all references to it. > > > So far the raid group profile would adapt to lower suitable > group profile when device is missing/failed. This appears to > be not happening with RAID56 OR there are stale IO which wasn't > flushed out. Anyway to have this fixed I am moving the patch > btrfs: introduce device dynamic state transition to offline or failed > to the top in v3 for any potential changes. > But firstly we need a reliable test case, or a very carefully > crafted test case which can create this situation > > Here below is the dm-error that I am using for testing, which > apparently doesn't report this issue. Could you please try on V3. ? > (pls note the device names are hard coded in the test script > sorry about that) This would eventually be fstests script. Hi, I have reproduced this oops with attached script. I don't use any dm layer, but just detach drive at scsi layer as xfstests do (device management functions were copy-pasted from it). [-- Attachment #2: test-autoreplace2-mainline.sh --] [-- Type: application/x-sh, Size: 2565 bytes --] ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: Global hotspare functionality 2016-04-04 19:32 ` Yauhen Kharuzhy @ 2016-04-12 14:16 ` Anand Jain 0 siblings, 0 replies; 18+ messages in thread From: Anand Jain @ 2016-04-12 14:16 UTC (permalink / raw) To: Yauhen Kharuzhy; +Cc: linux-btrfs On 04/05/2016 03:32 AM, Yauhen Kharuzhy wrote: > 2016-04-01 18:15 GMT-07:00 Anand Jain <anand.jain@oracle.com>: >>>>> Issue 2. >>>>> At start of autoreplacig drive by hotspare, kernel craches in >>>>> transaction >>>>> handling code (inside of btrfs_commit_transaction() called by >>>>> autoreplace initiating >>>>> routines). I 'fixed' this by removing of closing of bdev in >>>>> btrfs_close_one_device_dont_free(), see >>>>> >>>>> https://bitbucket.org/jekhor/linux-btrfs/commits/dfa441c9ec7b3833f6a5e4d0b6f8c678faea29bb?at=master >>>>> (oops text is attached also). Bdev is closed after replacing by >>>>> btrfs_dev_replace_finishing(), so this is safe but doesn't seem >>>>> to be right way. >>>> >>>> >>>> I have sent out V2. I don't see that issue with this, >>>> could you pls try ? >>> >>> >>> Yes, it reproduced on v4.4.5 kernel. I will try with current >>> 'for-linus-4.6' Chris' tree soon. >>> >>> To emulate a drive failure, I disconnect the drive in VirtualBox, so bdev >>> can be freed by kernel after releasing of all references to it. >> >> >> So far the raid group profile would adapt to lower suitable >> group profile when device is missing/failed. This appears to >> be not happening with RAID56 OR there are stale IO which wasn't >> flushed out. Anyway to have this fixed I am moving the patch >> btrfs: introduce device dynamic state transition to offline or failed >> to the top in v3 for any potential changes. >> But firstly we need a reliable test case, or a very carefully >> crafted test case which can create this situation >> >> Here below is the dm-error that I am using for testing, which >> apparently doesn't report this issue. Could you please try on V3. ? >> (pls note the device names are hard coded in the test script >> sorry about that) This would eventually be fstests script. > > Hi, > > I have reproduced this oops with attached script. I don't use any dm > layer, but just detach drive at scsi layer as xfstests do (device > management functions were copy-pasted from it). Nice. I was able reproduce this (also found lock dep issue when running this, since it was in the original code a separate patch was sent ou). The issue was due to that bdev wasn't null, to fix this the btrfs_device_enforce_state() is changed quite a bit. V4 is out. Thanks, Anand ^ permalink raw reply [flat|nested] 18+ messages in thread
end of thread, other threads:[~2016-04-12 14:16 UTC | newest] Thread overview: 18+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2016-03-18 19:39 Global hotspare functionality Yauhen Kharuzhy 2016-03-19 1:17 ` Yauhen Kharuzhy 2016-03-29 14:43 ` Anand Jain 2016-03-29 14:41 ` Anand Jain 2016-03-29 19:24 ` Yauhen Kharuzhy 2016-03-29 19:59 ` Austin S. Hemmelgarn 2016-03-29 20:26 ` Chris Murphy 2016-03-30 11:26 ` Austin S. Hemmelgarn 2016-03-29 19:40 ` Yauhen Kharuzhy 2016-03-30 22:17 ` Yauhen Kharuzhy 2016-04-02 1:17 ` Anand Jain 2016-03-29 19:47 ` Yauhen Kharuzhy 2016-03-29 23:18 ` Yauhen Kharuzhy 2016-04-02 1:15 ` Anand Jain 2016-04-02 1:33 ` Yauhen Kharuzhy 2016-04-02 1:38 ` Anand Jain 2016-04-04 19:32 ` Yauhen Kharuzhy 2016-04-12 14:16 ` Anand Jain
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).