linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Anand Jain <anand.jain@oracle.com>
To: Yauhen Kharuzhy <yauhen.kharuzhy@zavadatar.com>
Cc: linux-btrfs@vger.kernel.org, dsterba@suse.cz
Subject: Re: [PATCH v4 00/13] Introduce device state 'failed', spare device and auto replace
Date: Thu, 14 Apr 2016 16:45:11 +0800	[thread overview]
Message-ID: <570F5897.6090701@oracle.com> (raw)
In-Reply-To: <20160413212143.GA28202@jeknote.loshitsa1.net>




Thanks for the report ! more below..


On 04/14/2016 05:21 AM, Yauhen Kharuzhy wrote:
> On Tue, Apr 12, 2016 at 10:15:50PM +0800, Anand Jain wrote:
>> Thanks for various comments, tests and feedback.
>
> Hmm... I broke it :)
>
> I get kernel oops after few cycles of drive removing-insertion-replacing.
>
> My steps to reproduce:
> 1) create RAID (I used RAID6)
> 2) remove drive (i tested /sys interface for this and VBox storage
> management – reproduced with both bethods). Write & sync fs to detect
> falure.
> 3) insert drive again
> 4) wipe it
> 5) replace missing device (reproduced with user-initiated replace and
> autoreplace)
> 6) repeat steps 2-3
>
> At reboot, kernel oopses (see below). Sometimes more than one repeat of
> steps 2-5 needed (I am still working to localize this now).
>
> Commands from my last session:
>
> root@grack12:~# btrfs fi show
> Label: 'test'  uuid: 833fef31-5536-411c-8f58-53b527569fa5
>          Total devices 4 FS bytes used 768.00KiB
>          devid    1 size 8.00GiB used 1.41GiB path /dev/sdc
>          devid    2 size 8.00GiB used 1.41GiB path /dev/sdd
>          devid    3 size 8.00GiB used 1.41GiB path /dev/sde
>          devid    5 size 8.00GiB used 1.12GiB path /dev/sdg
>
> Global spare
>
> root@test:~# ls -l /sys/block/sdg
> lrwxrwxrwx 1 root root 0 Apr  8 20:03 /sys/block/sdg -> ../devices/pci0000:00/0000:00:1f.2/ata7/host6/target6:0:0/6:0:0:0/block/sdg
> root@test:~# echo 1 > /sys/class/scsi_device/6\:0\:0\:0/device/delete
> root@test:~# touch /media/833fef31-5536-411c-8f58-53b527569fa5/ && btrfs fi sync /media/833fef31-5536-411c-8f58-53b527569fa5/
> FSSync '/media/833fef31-5536-411c-8f58-53b527569fa5/'
> root@test:~# echo 0 0 0 > /sys/class/scsi_host/host6/scan
> root@test:~# wipefs -a /dev/sdg
> 8 bytes were erased at offset 0x10040 (btrfs)
> they were: 5f 42 48 52 66 53 5f 4d
> root@test:~# btrfs replace start 5 /dev/sdg /media/833fef31-5536-411c-8f58-53b527569fa5/
> root@test:~# echo 1 > /sys/class/scsi_device/6\:0\:0\:0/device/delete
> root@test:~# echo 0 0 0 > /sys/class/scsi_host/host6/scan
> root@test:~# echo 1 > /sys/class/scsi_device/6\:0\:0\:0/device/delete

  You may use simpler devmgt tool, https://github.com/asj/devmgt

> root@test:~# touch /media/833fef31-5536-411c-8f58-53b527569fa5/ && btrfs fi sync /media/833fef31-5536-411c-8f58-53b527569fa5/
> FSSync '/media/833fef31-5536-411c-8f58-53b527569fa5/'
> root@test:~# echo 0 0 0 > /sys/class/scsi_host/host6/scan
> root@test:~# wipefs -a /dev/sdg
> 8 bytes were erased at offset 0x10040 (btrfs)
> they were: 5f 42 48 52 66 53 5f 4d
> root@test:~# btrfs replace start 5 /dev/sdg /media/833fef31-5536-411c-8f58-53b527569fa5/
> root@test:~# echo 1 > /sys/class/scsi_device/6\:0\:0\:0/device/delete
> root@test:~# touch /media/833fef31-5536-411c-8f58-53b527569fa5/ && btrfs fi sync /media/833fef31-5536-411c-8f58-53b527569fa5/
> FSSync '/media/833fef31-5536-411c-8f58-53b527569fa5/'
> root@test:~# echo 0 0 0 > /sys/class/scsi_host/host6/scan
> root@test:~# wipefs -a /dev/sdg
> 8 bytes were erased at offset 0x10040 (btrfs)
> they were: 5f 42 48 52 66 53 5f 4d
> root@test:~# btrfs replace start 5 /dev/sdg /media/833fef31-5536-411c-8f58-53b527569fa5/
> root@test:~# reboot
>
> Oops itself:
>
> [  349.559019] BTRFS info (device sdd): dev_replace from <missing disk> (devid 5) to /dev/sdg started
> [  349.647966] BTRFS info (device sdd): dev_replace from <missing disk> (devid 5) to /dev/sdg finished
> [  373.701691] general protection fault: 0000 [#1] SMP DEBUG_PAGEALLOC
> [  373.731698] Modules linked in: cpufreq_powersave cpufreq_stats cpufreq_userspace cpufreq_conservative softdog nfsd a
> uth_rpcgss oid_registry nfs_acl nfs lockd grace fscache sunrpc ipmi_devintf ipmi_msghandler iosf_mbi crct10dif_pclmul c
> rc32_pclmul sha256_ssse3 sha256_generic snd_pcm snd_timer iTCO_wdt hmac drbg iTCO_vendor_support ansi_cprng snd soundco
> re aesni_intel aes_x86_64 lrw gf128mul glue_helper ablk_helper cryptd psmouse evdev serio_raw pcspkr acpi_cpufreq 8250_
> fintek lpc_ich video ac battery parport_pc tpm_tis tpm mfd_core parport button processor rng_core i2c_piix4 btrfs xor r
> aid6_pq dm_mod raid1 md_mod sg sd_mod sr_mod cdrom ata_generic ahci libahci ata_piix libata crc32c_intel scsi_mod pcnet
> 32 mii
> [  373.933548] CPU: 0 PID: 3955 Comm: umount Not tainted 4.4.5-scst31x-debug+ #33
> [  373.941730] Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006
> [  373.945337] task: ffff88005b2fe080 ti: ffff880056cbc000 task.ti: ffff880056cbc000
> [  373.951991] RIP: 0010:[<ffffffff811a1069>]  [<ffffffff811a1069>] __filemap_fdatawrite_range+0x29/0xf0


  You are failing the replace-target, presumably when the replace is
  still running, however note that this patch-set does not fail the
  replace-target for errors (as of now I have no idea how to do that
  without leading to a messy situation), and so it would follow the
  original code as without this patch.
  Next, originally with-out this patch-set we won't close any device
  for errors. So when you delete the device at the block-layer and
  re-attach (scan) most probably you are having a newer device path
  to the block device. (which kind of defeats the idea of testing
  an intermittently disappearing device), so I doubt, if the test
  case is reliable,  and above panic is btrfs related and if its
  this patch-set related.



HTH.

Thanks, Anand


> [  373.954135] RSP: 0018:ffff880056cbfd50  EFLAGS: 00010286
> [  373.972201] RAX: 0000000000000000 RBX: ffff880056cbfd50 RCX: 0000000000000000
> [  374.003989] RDX: 7fffffffffffffff RSI: 0000000000000000 RDI: ffff880056cbfdb0
> [  374.044001] RBP: ffff880056cbfdc8 R08: 0000000000000000 R09: 0000000000000002
> [  374.099584] R10: ffffffff81d1b880 R11: ffffffff81d1b840 R12: 00441f0f0000441f
> [  374.113566] R13: ffff88005b2fe080 R14: 0000000000000000 R15: ffff88005b2fe080
> [  374.157600] FS:  00007f9281eea7e0(0000) GS:ffff880066600000(0000) knlGS:0000000000000000
> [  374.164870] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> [  374.184379] CR2: 0000000001277048 CR3: 0000000060324000 CR4: 00000000000406f0
> [  374.190320] Stack:
> [  374.201539]  0000000000000000 0000000000000000 0000000000000000 0000000000000000
> [  374.245946]  0000000000000000 0000000000000000 0000000000000000 0000000000000000
> [  374.286073]  0000000000000000 0000000000000000 0000000000000000 0000000000000000
> [  374.313913] Call Trace:
> [  374.329665]  [<ffffffff811a11ec>] filemap_flush+0x1c/0x20
> [  374.357488]  [<ffffffff812619d6>] __sync_blockdev+0x26/0x30
> [  374.389452]  [<ffffffff8125814e>] sync_filesystem+0x4e/0xa0
> [  374.425568]  [<ffffffff81220867>] generic_shutdown_super+0x27/0xf0
> [  374.457622]  [<ffffffff81220ba2>] kill_anon_super+0x12/0x20
> [  374.489572]  [<ffffffffa018ea88>] btrfs_kill_super+0x18/0x120 [btrfs]
> [  374.529552]  [<ffffffff81220e1e>] deactivate_locked_super+0x3e/0x70
> [  374.561196]  [<ffffffff8122126c>] deactivate_super+0x5c/0x60
> [  374.602015]  [<ffffffff8124182f>] cleanup_mnt+0x3f/0x90
> [  374.632955]  [<ffffffff812418c2>] __cleanup_mnt+0x12/0x20
> [  374.652292]  [<ffffffff810a5533>] task_work_run+0x73/0xa0
> [  374.662654]  [<ffffffff810032ac>] exit_to_usermode_loop+0xcc/0xd0
> [  374.672858]  [<ffffffff81003e0c>] syscall_return_slowpath+0xcc/0xe0
> [  374.702077]  [<ffffffff816379e2>] int_ret_from_sys_call+0x25/0x9f
> [  374.721467] Code: ff 90 0f 1f 44 00 00 55 31 c0 41 89 c8 b9 0c 00 00 00 48 89 e5 41 55 41 54 53 48 8d 5d 88 49 89 fc
>   48 89 df 48 83 ec 60 f3 48 ab <49> 8b 3c 24 48 b8 ff ff ff ff ff ff ff 7f 48 89 55 a0 48 89 45
> [  374.853615] RIP  [<ffffffff811a1069>] __filemap_fdatawrite_range+0x29/0xf0
> [  374.909672]  RSP <ffff880056cbfd50>
> [  374.937941] ---[ end trace 2bbc2fd699f402ff ]---
>
>

  reply	other threads:[~2016-04-14  8:45 UTC|newest]

Thread overview: 32+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-04-12 14:15 [PATCH v4 00/13] Introduce device state 'failed', spare device and auto replace Anand Jain
2016-04-12 14:15 ` [PATCH 01/13] btrfs: Introduce a new function to check if all chunks a OK for degraded mount Anand Jain
2016-04-12 19:21   ` Yauhen Kharuzhy
2016-04-12 14:15 ` [PATCH 02/13] btrfs: Do per-chunk check for mount time check Anand Jain
2016-04-12 14:15 ` [PATCH 03/13] btrfs: Do per-chunk degraded check for remount Anand Jain
2016-04-12 14:15 ` [PATCH 04/13] btrfs: Allow barrier_all_devices to do per-chunk device check Anand Jain
2016-04-12 14:15 ` [PATCH 05/13] btrfs: Cleanup num_tolerated_disk_barrier_failures Anand Jain
2016-04-12 14:15 ` [PATCH 06/13] btrfs: introduce BTRFS_FEATURE_INCOMPAT_SPARE_DEV Anand Jain
2016-04-12 14:15 ` [PATCH 07/13] btrfs: add check not to mount a spare device Anand Jain
2016-04-12 14:15 ` [PATCH 08/13] btrfs: support btrfs dev scan for " Anand Jain
2016-04-12 14:15 ` [PATCH 09/13] btrfs: provide framework to get and put a " Anand Jain
2016-04-12 14:16 ` [PATCH 10/13] btrfs: introduce helper functions to perform hot replace Anand Jain
2016-04-12 14:40   ` kbuild test robot
2016-04-12 14:16 ` [PATCH 11/13] btrfs: introduce device dynamic state transition to offline or failed Anand Jain
2016-04-14  1:15   ` [PATCH] Btrfs: Set superblock s_bdev field properly at device closing Yauhen Kharuzhy
2016-04-14  6:59     ` Anand Jain
2016-04-14  9:10       ` Yauhen Kharuzhy
2016-04-14  9:48         ` Anand Jain
2016-04-14 10:51   ` [PATCH v5 11/13] btrfs: introduce device dynamic state transition to offline or failed Anand Jain
2016-04-14 16:56     ` Yauhen Kharuzhy
2016-04-18 10:50       ` Anand Jain
2016-04-12 14:16 ` [PATCH 12/13] btrfs: check device for critical errors and mark failed Anand Jain
2016-04-12 14:16 ` [PATCH 13/13] btrfs: check for failed device and hot replace Anand Jain
2016-04-12 20:02 ` [PATCH v4 00/13] Introduce device state 'failed', spare device and auto replace Yauhen Kharuzhy
2016-04-13 22:43   ` Anand Jain
2016-04-13 21:21 ` Yauhen Kharuzhy
2016-04-14  8:45   ` Anand Jain [this message]
2016-04-14  9:22     ` Yauhen Kharuzhy
2016-04-14  9:57       ` Anand Jain
2016-04-14 19:12 ` Yauhen Kharuzhy
2016-04-14 23:09 ` Yauhen Kharuzhy
2016-04-18  8:54   ` Anand Jain

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=570F5897.6090701@oracle.com \
    --to=anand.jain@oracle.com \
    --cc=dsterba@suse.cz \
    --cc=linux-btrfs@vger.kernel.org \
    --cc=yauhen.kharuzhy@zavadatar.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).