From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from aserp1040.oracle.com ([141.146.126.69]:21398 "EHLO aserp1040.oracle.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752002AbcDNJ6E (ORCPT ); Thu, 14 Apr 2016 05:58:04 -0400 Subject: Re: [PATCH v4 00/13] Introduce device state 'failed', spare device and auto replace To: Yauhen Kharuzhy References: <1460470563-752-1-git-send-email-anand.jain@oracle.com> <20160413212143.GA28202@jeknote.loshitsa1.net> <570F5897.6090701@oracle.com> <20160414092246.GB17024@jeknote.loshitsa1.net> Cc: linux-btrfs@vger.kernel.org, dsterba@suse.cz From: Anand Jain Message-ID: <570F6994.9030605@oracle.com> Date: Thu, 14 Apr 2016 17:57:40 +0800 MIME-Version: 1.0 In-Reply-To: <20160414092246.GB17024@jeknote.loshitsa1.net> Content-Type: text/plain; charset=utf-8; format=flowed Sender: linux-btrfs-owner@vger.kernel.org List-ID: On 04/14/2016 05:22 PM, Yauhen Kharuzhy wrote: > On Thu, Apr 14, 2016 at 04:45:11PM +0800, Anand Jain wrote: >> >> >> >> Thanks for the report ! more below.. >> >> >> You may use simpler devmgt tool, https://github.com/asj/devmgt > > Thanks, will try. > >> >> You are failing the replace-target, presumably when the replace is >> still running, however note that this patch-set does not fail the >> replace-target for errors (as of now I have no idea how to do that >> without leading to a messy situation), and so it would follow the >> original code as without this patch. >> Next, originally with-out this patch-set we won't close any device >> for errors. So when you delete the device at the block-layer and >> re-attach (scan) most probably you are having a newer device path >> to the block device. (which kind of defeats the idea of testing >> an intermittently disappearing device), so I doubt, if the test >> case is reliable, and above panic is btrfs related and if its >> this patch-set related. > > No, It is fixed by my latest patch (about of s_bdev field in > superblock). Actual sequence which leads to oops is: > 1) FS is mounted, s_bdev is NULL > 2) failed device is closed, s_bdev untouched > 3) missing device is replaced, s_bdev is set to non-NULL – bdev of > the replaced device > 4) at second device closing, s_bdev is "changed" to first device from > the device list but it is... some device because closed dev still > didn't delete from the list! > 5) after device closing, s_bdev points to invalid bdev. > 6) umount -> sync_filesystem() -> sync_blokdev(s_bdev) -> OOPS. > This is wrong. It should be other way around. That is s_bdev should continue to be NULL. And if s_bdev continues to be NULL the sync thread will fail-safe. The diff sent in the other thread will fix. Thanks, Anand