[RFC] Btrfs device and pool management (wip)

linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [RFC] Btrfs device and pool management (wip)
@ 2015-11-30  7:59 Anand Jain
  2015-11-30 12:43 ` Qu Wenruo
  2015-11-30 14:51 ` Austin S Hemmelgarn
  0 siblings, 2 replies; 15+ messages in thread
From: Anand Jain @ 2015-11-30  7:59 UTC (permalink / raw)
  To: linux-btrfs

(fixed alignment)


------------
  Data center systems are generally aligned with the RAS (Reliability,
Availability and Serviceability) attributes. When it comes to Storage,
RAS applies even more because its matter of trust. In this context, one
of the primary area that a typical volume manager should be well tested
is, how well RAS attributes are maintained in the context of device
failure, and its further reporting.

  But, identifying a failed device is not a straight forward code. If
you look at some statistics performed on failed and returned disks,
most of the disks ends up being classified as NTF (No Trouble Found).
That is, host failed-and-replaced a disk even before it has actually
failed. This is not good for a cost effective setup who would want to
stretch the life of an intermittently failing device to its maximum
tenure and would want to replace only when it has confirmed dead.

  Also on the other hand, some of the data center admins would like to
mitigate the risk (of low performance at peak of their business
productions) of a potential failure, and prefer to pro-actively replace
the disk at their low business/workload hours, or they may choose to
replace a device even for read errors (mainly due to performance
reasons).

  In short a large variant of real MTF (Mean Time to Failure) for the
devices across the industries/users.


Consideration:

  - Have user-tunable to support different context of usages, which
should be applied on top of a set of disk IO errors, and its out come
will be to know if the disk can be failed.

  - Distinguish real disk failure (failed state) VS IO errors due to
intermittent transport errors (offline state). (I am not sure how to do
that yet, basically in some means, block layer could help?, RFC ?).

  - A sysfs offline interface, so as to udev update the kernel, when
disk is pulled out.

  - Because even to fail a device it depends on the user requirements,
btrfs IO completion threads instead of directly reacting on an IO
error, it will continue to just report the IO error into device error
statistics, and a spooler up on errors will apply user/system
criticalness as provided by the user on the top, which will decide if
the device has to be marked as failed OR if it can continue to be in
online.

  - A FS load pattern (mostly outside of btrfs-kernel or with in btrfs-
kernel) may pick the right time to replace the failed device, or to run
other FS maintenance activities (balance, scrub) automatically.

  - Sysfs will help user land scripts which may want to bring device to
offline or failed.



Device State flow:

   A device in the btrfs kernel can be in any one of following state:

   Online
      A normal healthy device

   Missing
      Device wasn't found that the time of mount OR device scan.

   Offline (disappeared)
      Device was present at some point in time after the FS was mounted,
      however offlined by user or block layer or hot unplug or device
      experienced transport error. Basically due to any error other than
      media error.
      The device in offline state are not candidate for the replace.
      Since still there is a hope that device may be restored to online
      at some point in time, by user or transport-layer error recovery.
      For device pulled out, there will be udev script which will call
      offline through sysfs. In the long run, we would also need to know
      the block layer to distinguish from the transient write errors
      like writes failing  due to transport error, vs write errors which
      are confirmed as target-device/device-media failure.

   Failed
      Device has confirmed a write/flush failure for at least a block.
      (In general the disk/storage FW will try to relocate the bad block
      on write, it happens automatically and transparent even to the
      block layer. Further there might have been few retry from the block
      layer. And here btrfs assumes that such an attempt has also
      failed). Or it might set device as failed for extensive read
      errors if the user tuned profile demands it.


A btrfs pool can be in one of the state:

Online:
   All the chunks are as configured.

Degraded:
   One or more logical-chunks does not meet the redundancy level that
   user requested / configured.

Failed:
   One or more logical-chunk is incomplete. FS will be in a RO mode Or
   panic -dump as configured.


Flow diagram (also include pool states BTRFS_POOL_STATE_xx along with
device state BTRFS_DEVICE_STATE_xx):


                     [1]
                     BTRFS_DEVICE_STATE_ONLINE,
                      BTRFS_POOL_STATE_ONLINE
                                 |
                                 |
                                 V
                           new IO error
                                 |
                                 |
                                 V
                    check with block layer to know
                   if confirmed media/target:- failed
                 or fix-able transport issue:- offline.
                       and apply user config.
             can be ignored ?  --------------yes->[1]
                                 |
                                 |no
         _______offline__________/\______failed________
         |                                             |
         |                                             |
         V                                             V
(eg: transport issue [*], disk is good)     (eg: write media error)
         |                                             |
         |                                             |
         V                                             V
BTRFS_DEVICE_STATE_OFFLINE                BTRFS_DEVICE_STATE_FAILED
         |                                             |
         |                                             |
         |______________________  _____________________|
                                \/
                                 |
                         Missing chunk ? --NO--> goto [1]
                                 |
                                 |
                          Tolerable? -NO-> FS ERROR. RO.
                                      BTRFS_POOL_STATE_FAILED->remount?
                                 |
                                 |yes
                                 V
                       BTRFS_POOL_STATE_DEGRADED --> rebalance -> [1]
                                 |
         ______offline___________|____failed_________
         |                                           |
         |                                      check priority
         |                                           |
         |                                           |
         |                                      hot spare ?
         |                                    replace --> goto [1]
         |                                           |
         |                                           | no
         |                                           |
         |                                       spare-add
(user/sys notify issue is fixed,         (manual-replace/dev-delete)
   trigger scrub/balance)                            |
         |______________________  ___________________|
                                \/
                                 |
                                 V
                                [1]


Code status:
  Part-1: Provided device transitions from online to failed/offline,
          hot spare and auto replace.
          [PATCH 00/15] btrfs: Hot spare and Auto replace

  Next,
   . Add sysfs part on top of
     [PATCH] btrfs: Introduce device pool sysfs attributes
   . POOL_STATE flow and reporting
   . Device transactions from Offline to Online
   . Btrfs-progs mainly to show device and pool states
   . Apply user tolerance level to the IO errors
----------


Thanks, Anand

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC] Btrfs device and pool management (wip)
  2015-11-30  7:59 [RFC] Btrfs device and pool management (wip) Anand Jain
@ 2015-11-30 12:43 ` Qu Wenruo
  2015-12-01 18:01   ` Goffredo Baroncelli
  2015-11-30 14:51 ` Austin S Hemmelgarn
  1 sibling, 1 reply; 15+ messages in thread
From: Qu Wenruo @ 2015-11-30 12:43 UTC (permalink / raw)
  To: Anand Jain, linux-btrfs, Qu Wenruo



On 11/30/2015 03:59 PM, Anand Jain wrote:
> (fixed alignment)
>
>
> ------------
>   Data center systems are generally aligned with the RAS (Reliability,
> Availability and Serviceability) attributes. When it comes to Storage,
> RAS applies even more because its matter of trust. In this context, one
> of the primary area that a typical volume manager should be well tested
> is, how well RAS attributes are maintained in the context of device
> failure, and its further reporting.
>
>   But, identifying a failed device is not a straight forward code. If
> you look at some statistics performed on failed and returned disks,
> most of the disks ends up being classified as NTF (No Trouble Found).
> That is, host failed-and-replaced a disk even before it has actually
> failed. This is not good for a cost effective setup who would want to
> stretch the life of an intermittently failing device to its maximum
> tenure and would want to replace only when it has confirmed dead.
>
>   Also on the other hand, some of the data center admins would like to
> mitigate the risk (of low performance at peak of their business
> productions) of a potential failure, and prefer to pro-actively replace
> the disk at their low business/workload hours, or they may choose to
> replace a device even for read errors (mainly due to performance
> reasons).
>
>   In short a large variant of real MTF (Mean Time to Failure) for the
> devices across the industries/users.
>
>
> Consideration:
>
>   - Have user-tunable to support different context of usages, which
> should be applied on top of a set of disk IO errors, and its out come
> will be to know if the disk can be failed.

I'm overall OK with your *current* hot-spare implement.
It's quite small and straightforward.
Just hope some more more easy-to-implement features, like hot-remove 
instead of replace. (for degradable case, it would case less IO).
And more test-cases.

And per-filesystem hot-spare device. Global one has its limitation, like 
no priority or choose less proper device.
(use a TB device to replace a GB device, eating up the pool quite easily)
It should be not hard to do, maybe add fsid into hot-spare device 
superblock and modify kernel/user-progs a little.



But if your ultimate goal of *in-kernel* hot-spare is to do such 
complicated *in-kernel police*, I would say *NO* right now before things 
get messed up.
(Yeah, maybe another "discussion" just like feature auto-align)

Kernel should provide *mechanisim*, not *policy*.
(Pretty sure most of us should hear it in one form or another).

In this case, btrfs supports for *replace* is a mechanism. (not 
automatically replace)
But *when* to replace a bad device, is *policy*.


But if you just want to get to that goal, *not restricted to in-kernel 
implement*, it would be much easier to do.

1) Implement a API(maybe sysfs as you suggested) to allow user-space 
programs get informed when a btrfs device get sick(including missing or 
number of IO errors hit a threshold)

2) Write a user-space program listening with that API

3) Trigger a action when device get failed.
    Maybe replace, maybe remove, or just do nothing, fully *tunable* and
    much *easier* to implement.

If use above method, kernel part should be as easy as the following:
1) A new API for user-progs to listen

2) (Optional) Tuning interface for that API
    E.g, threshold of IO error before informing user space

3) Kernel fallback behavior for such error
    Even no need to trigger replace from kernel, but just put the
    filesystem into degraded will be good enough.

3) A user daemon, maybe in btrfs-progs or another project.
    Easy to debug, easy to implement, and you will be the
    maintainer/leader/author of the new project!!

Now all the policy is moved to user-space, kernel is kept small and clean.

>
>   - Distinguish real disk failure (failed state) VS IO errors due to
> intermittent transport errors (offline state). (I am not sure how to do
> that yet, basically in some means, block layer could help?, RFC ?).
>
>   - A sysfs offline interface, so as to udev update the kernel, when
> disk is pulled out.

Yes, that's very *demanding* feature, and if implement correctly, you 
don't ever need to do in-kernel hot-replace.(only need to fallback to 
auto-degrade)
That's the core feature which may keep kernel small and clean.

I am eager to see it before in-kernel hot-spare.

>
>   - Because even to fail a device it depends on the user requirements,
> btrfs IO completion threads instead of directly reacting on an IO
> error, it will continue to just report the IO error into device error
> statistics, and a spooler up on errors will apply user/system
> criticalness as provided by the user on the top, which will decide if
> the device has to be marked as failed OR if it can continue to be in
> online.
>
>   - A FS load pattern (mostly outside of btrfs-kernel or with in btrfs-
> kernel) may pick the right time to replace the failed device, or to run
> other FS maintenance activities (balance, scrub) automatically.

Strongly recommend to move out of kernel.
Although I didn't have a good suggestion about user-space IO performance 
detection.
Maybe periodic sar command?
I hope someone can give suggestion on it.

>
>   - Sysfs will help user land scripts which may want to bring device to
> offline or failed.

Of course.

Thanks,
Qu
>
>
>
> Device State flow:
>
>    A device in the btrfs kernel can be in any one of following state:
>
>    Online
>       A normal healthy device
>
>    Missing
>       Device wasn't found that the time of mount OR device scan.
>
>    Offline (disappeared)
>       Device was present at some point in time after the FS was mounted,
>       however offlined by user or block layer or hot unplug or device
>       experienced transport error. Basically due to any error other than
>       media error.
>       The device in offline state are not candidate for the replace.
>       Since still there is a hope that device may be restored to online
>       at some point in time, by user or transport-layer error recovery.
>       For device pulled out, there will be udev script which will call
>       offline through sysfs. In the long run, we would also need to know
>       the block layer to distinguish from the transient write errors
>       like writes failing  due to transport error, vs write errors which
>       are confirmed as target-device/device-media failure.
>
>    Failed
>       Device has confirmed a write/flush failure for at least a block.
>       (In general the disk/storage FW will try to relocate the bad block
>       on write, it happens automatically and transparent even to the
>       block layer. Further there might have been few retry from the block
>       layer. And here btrfs assumes that such an attempt has also
>       failed). Or it might set device as failed for extensive read
>       errors if the user tuned profile demands it.
>
>
> A btrfs pool can be in one of the state:
>
> Online:
>    All the chunks are as configured.
>
> Degraded:
>    One or more logical-chunks does not meet the redundancy level that
>    user requested / configured.
>
> Failed:
>    One or more logical-chunk is incomplete. FS will be in a RO mode Or
>    panic -dump as configured.
>
>
> Flow diagram (also include pool states BTRFS_POOL_STATE_xx along with
> device state BTRFS_DEVICE_STATE_xx):
>
>
>                      [1]
>                      BTRFS_DEVICE_STATE_ONLINE,
>                       BTRFS_POOL_STATE_ONLINE
>                                  |
>                                  |
>                                  V
>                            new IO error
>                                  |
>                                  |
>                                  V
>                     check with block layer to know
>                    if confirmed media/target:- failed
>                  or fix-able transport issue:- offline.
>                        and apply user config.
>              can be ignored ?  --------------yes->[1]
>                                  |
>                                  |no
>          _______offline__________/\______failed________
>          |                                             |
>          |                                             |
>          V                                             V
> (eg: transport issue [*], disk is good)     (eg: write media error)
>          |                                             |
>          |                                             |
>          V                                             V
> BTRFS_DEVICE_STATE_OFFLINE                BTRFS_DEVICE_STATE_FAILED
>          |                                             |
>          |                                             |
>          |______________________  _____________________|
>                                 \/
>                                  |
>                          Missing chunk ? --NO--> goto [1]
>                                  |
>                                  |
>                           Tolerable? -NO-> FS ERROR. RO.
>                                       BTRFS_POOL_STATE_FAILED->remount?
>                                  |
>                                  |yes
>                                  V
>                        BTRFS_POOL_STATE_DEGRADED --> rebalance -> [1]
>                                  |
>          ______offline___________|____failed_________
>          |                                           |
>          |                                      check priority
>          |                                           |
>          |                                           |
>          |                                      hot spare ?
>          |                                    replace --> goto [1]
>          |                                           |
>          |                                           | no
>          |                                           |
>          |                                       spare-add
> (user/sys notify issue is fixed,         (manual-replace/dev-delete)
>    trigger scrub/balance)                            |
>          |______________________  ___________________|
>                                 \/
>                                  |
>                                  V
>                                 [1]
>
>
> Code status:
>   Part-1: Provided device transitions from online to failed/offline,
>           hot spare and auto replace.
>           [PATCH 00/15] btrfs: Hot spare and Auto replace
>
>   Next,
>    . Add sysfs part on top of
>      [PATCH] btrfs: Introduce device pool sysfs attributes
>    . POOL_STATE flow and reporting
>    . Device transactions from Offline to Online
>    . Btrfs-progs mainly to show device and pool states
>    . Apply user tolerance level to the IO errors
> ----------
>
>
> Thanks, Anand
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC] Btrfs device and pool management (wip)
  2015-11-30 12:43 ` Qu Wenruo
@ 2015-12-01 18:01   ` Goffredo Baroncelli
  2015-12-01 23:43     ` Qu Wenruo
  0 siblings, 1 reply; 15+ messages in thread
From: Goffredo Baroncelli @ 2015-12-01 18:01 UTC (permalink / raw)
  To: Qu Wenruo, Anand Jain, linux-btrfs, Qu Wenruo

On 2015-11-30 13:43, Qu Wenruo wrote:
> 
> 
> On 11/30/2015 03:59 PM, Anand Jain wrote:
>> (fixed alignment)
>>
>>
[...]
> 
> I'm overall OK with your *current* hot-spare implement.
> It's quite small and straightforward.
> Just hope some more more easy-to-implement features, like hot-remove instead of replace. (for degradable case, it would case less IO).
> And more test-cases.
> 
> And per-filesystem hot-spare device. Global one has its limitation, like no priority or choose less proper device.
> (use a TB device to replace a GB device, eating up the pool quite easily)
> It should be not hard to do, maybe add fsid into hot-spare device superblock and modify kernel/user-progs a little.
> 
> 
> 
> But if your ultimate goal of *in-kernel* hot-spare is to do such complicated *in-kernel police*, I would say *NO* right now before things get messed up.
> (Yeah, maybe another "discussion" just like feature auto-align)
> 
> Kernel should provide *mechanisim*, not *policy*.
> (Pretty sure most of us should hear it in one form or another).
> 
> In this case, btrfs supports for *replace* is a mechanism. (not automatically replace)
> But *when* to replace a bad device, is *policy*.
> 
> 
> But if you just want to get to that goal, *not restricted to in-kernel implement*, it would be much easier to do.

+1

> 
> 1) Implement a API(maybe sysfs as you suggested) to allow user-space programs get informed when a btrfs device get sick(including missing or number of IO errors hit a threshold)

This API should be device related and not specific to btrfs: what if the error happens in one partition not used by btrfs, but the disk has another partition used by btrfs ? 

> 
> 2) Write a user-space program listening with that API
> 
> 3) Trigger a action when device get failed.
>    Maybe replace, maybe remove, or just do nothing, fully *tunable* and
>    much *easier* to implement.
> 
> If use above method, kernel part should be as easy as the following:
> 1) A new API for user-progs to listen
> 
> 2) (Optional) Tuning interface for that API
>    E.g, threshold of IO error before informing user space
> 
> 3) Kernel fallback behavior for such error
>    Even no need to trigger replace from kernel, but just put the
>    filesystem into degraded will be good enough.
> 
> 3) A user daemon, maybe in btrfs-progs or another project.
>    Easy to debug, easy to implement, and you will be the
>    maintainer/leader/author of the new project!!
> 
> Now all the policy is moved to user-space, kernel is kept small and clean.

This is the most important thing: we should work to stabilize the current kernel implementation before adding further functionality. BTRFS is 8 year old, but it still needs some work to stabilize. I don't think that we should put further code in kernel space if we could add it in user space.



-- 
gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it>
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC] Btrfs device and pool management (wip)
  2015-12-01 18:01   ` Goffredo Baroncelli
@ 2015-12-01 23:43     ` Qu Wenruo
  2015-12-02 19:07       ` Goffredo Baroncelli
  0 siblings, 1 reply; 15+ messages in thread
From: Qu Wenruo @ 2015-12-01 23:43 UTC (permalink / raw)
  To: kreijack, Anand Jain, linux-btrfs, Qu Wenruo



On 12/02/2015 02:01 AM, Goffredo Baroncelli wrote:
> On 2015-11-30 13:43, Qu Wenruo wrote:
>>
>>
>> On 11/30/2015 03:59 PM, Anand Jain wrote:
>>> (fixed alignment)
>>>
>>>
> [...]
>>
>> I'm overall OK with your *current* hot-spare implement.
>> It's quite small and straightforward.
>> Just hope some more more easy-to-implement features, like hot-remove instead of replace. (for degradable case, it would case less IO).
>> And more test-cases.
>>
>> And per-filesystem hot-spare device. Global one has its limitation, like no priority or choose less proper device.
>> (use a TB device to replace a GB device, eating up the pool quite easily)
>> It should be not hard to do, maybe add fsid into hot-spare device superblock and modify kernel/user-progs a little.
>>
>>
>>
>> But if your ultimate goal of *in-kernel* hot-spare is to do such complicated *in-kernel police*, I would say *NO* right now before things get messed up.
>> (Yeah, maybe another "discussion" just like feature auto-align)
>>
>> Kernel should provide *mechanisim*, not *policy*.
>> (Pretty sure most of us should hear it in one form or another).
>>
>> In this case, btrfs supports for *replace* is a mechanism. (not automatically replace)
>> But *when* to replace a bad device, is *policy*.
>>
>>
>> But if you just want to get to that goal, *not restricted to in-kernel implement*, it would be much easier to do.
>
> +1
>
>>
>> 1) Implement a API(maybe sysfs as you suggested) to allow user-space programs get informed when a btrfs device get sick(including missing or number of IO errors hit a threshold)
>
> This API should be device related and not specific to btrfs: what if the error happens in one partition not used by btrfs, but the disk has another partition used by btrfs ?

My idea is btrfs only reports its own result, like how many crc 
read/write errors.

And block layer provides its own listen interface, reporting errors like 
ATA error.

These two interface has their own goal.
Btrfs one can detect bit error but block layer one can detect more like 
power loss or offline.

As it's quite hard to detect all type of low level errors at btrfs level.

>
>>
>> 2) Write a user-space program listening with that API
>>
>> 3) Trigger a action when device get failed.
>>     Maybe replace, maybe remove, or just do nothing, fully *tunable* and
>>     much *easier* to implement.
>>
>> If use above method, kernel part should be as easy as the following:
>> 1) A new API for user-progs to listen
>>
>> 2) (Optional) Tuning interface for that API
>>     E.g, threshold of IO error before informing user space
>>
>> 3) Kernel fallback behavior for such error
>>     Even no need to trigger replace from kernel, but just put the
>>     filesystem into degraded will be good enough.
>>
>> 3) A user daemon, maybe in btrfs-progs or another project.
>>     Easy to debug, easy to implement, and you will be the
>>     maintainer/leader/author of the new project!!
>>
>> Now all the policy is moved to user-space, kernel is kept small and clean.
>
> This is the most important thing: we should work to stabilize the current kernel implementation before adding further functionality. BTRFS is 8 year old, but it still needs some work to stabilize. I don't think that we should put further code in kernel space if we could add it in user space.
>

Yeah, completely right.

Thanks,
Qu

>
>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC] Btrfs device and pool management (wip)
  2015-12-01 23:43     ` Qu Wenruo
@ 2015-12-02 19:07       ` Goffredo Baroncelli
  2015-12-02 23:36         ` Qu Wenruo
  0 siblings, 1 reply; 15+ messages in thread
From: Goffredo Baroncelli @ 2015-12-02 19:07 UTC (permalink / raw)
  To: Qu Wenruo, Anand Jain, linux-btrfs, Qu Wenruo

On 2015-12-02 00:43, Qu Wenruo wrote:
[...]
> 
> And block layer provides its own listen interface, reporting errors
> like ATA error.
Could you point me to this kind of interface 



-- 
gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it>
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC] Btrfs device and pool management (wip)
  2015-12-02 19:07       ` Goffredo Baroncelli
@ 2015-12-02 23:36         ` Qu Wenruo
  0 siblings, 0 replies; 15+ messages in thread
From: Qu Wenruo @ 2015-12-02 23:36 UTC (permalink / raw)
  To: kreijack, Anand Jain, linux-btrfs, Qu Wenruo



On 12/03/2015 03:07 AM, Goffredo Baroncelli wrote:
> On 2015-12-02 00:43, Qu Wenruo wrote:
> [...]
>>
>> And block layer provides its own listen interface, reporting errors
>> like ATA error.
> Could you point me to this kind of interface
>
>
>
Not yet, and that's the problem...

Thanks,
Qu

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC] Btrfs device and pool management (wip)
  2015-11-30  7:59 [RFC] Btrfs device and pool management (wip) Anand Jain
  2015-11-30 12:43 ` Qu Wenruo
@ 2015-11-30 14:51 ` Austin S Hemmelgarn
  2015-11-30 20:17   ` Chris Murphy
  2015-12-01  0:43   ` Qu Wenruo
  1 sibling, 2 replies; 15+ messages in thread
From: Austin S Hemmelgarn @ 2015-11-30 14:51 UTC (permalink / raw)
  To: Anand Jain, linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 10349 bytes --]

On 2015-11-30 02:59, Anand Jain wrote:
>   Data center systems are generally aligned with the RAS (Reliability,
> Availability and Serviceability) attributes. When it comes to Storage,
> RAS applies even more because its matter of trust. In this context, one
> of the primary area that a typical volume manager should be well tested
> is, how well RAS attributes are maintained in the context of device
> failure, and its further reporting.
>
>   But, identifying a failed device is not a straight forward code. If
> you look at some statistics performed on failed and returned disks,
> most of the disks ends up being classified as NTF (No Trouble Found).
> That is, host failed-and-replaced a disk even before it has actually
> failed. This is not good for a cost effective setup who would want to
> stretch the life of an intermittently failing device to its maximum
> tenure and would want to replace only when it has confirmed dead.
>
>   Also on the other hand, some of the data center admins would like to
> mitigate the risk (of low performance at peak of their business
> productions) of a potential failure, and prefer to pro-actively replace
> the disk at their low business/workload hours, or they may choose to
> replace a device even for read errors (mainly due to performance
> reasons).
>
>   In short a large variant of real MTF (Mean Time to Failure) for the
> devices across the industries/users.
>
>
> Consideration:
>
>   - Have user-tunable to support different context of usages, which
> should be applied on top of a set of disk IO errors, and its out come
> will be to know if the disk can be failed.
General thoughts on this:
1. If there's a write error, we fail unconditionally right now.  It 
would be nice to have a configurable number of retries before failing.
2. Similar for read errors, possibly with the ability to ignore them 
below some threshold.
3. Kernel initiated link resets should probably be treated differently 
from regular read/write errors, they can indicate other potential 
problems (usually an issue in either the disk electronics or the storage 
controller).
4. Almost all of this is policy, and really should be configurable from 
userspace and have sane defaults (probably just keeping current behavior).
5. Properly differentiating between a media error, a transport error, or 
some other error (such as in the disk electronics or storage controller) 
is not reliably possible with the current state of the block layer and 
the ATA spec (it might be possible with SCSI, but I don't know enough 
about SCSI to be certain).
>
>   - Distinguish real disk failure (failed state) VS IO errors due to
> intermittent transport errors (offline state). (I am not sure how to do
> that yet, basically in some means, block layer could help?, RFC ?).
This gets really tricky.  Ideally, this is really something that needs 
to be done at least partly in userspace, unless we want to teach the 
kernel about SMART attributes and how to query the disk's own idea of 
how healthy it is.  We should also take into consideration the 
possibility of the storage controller failing.
>
>   - A sysfs offline interface, so as to udev update the kernel, when
> disk is pulled out.
This needs proper support in the block layer.  As of now, it assumes 
that if something has an open reference to a block device, that device 
will not be removed.  This simplifies things there, but has undesirable 
implications for stuff like BTRFS or iSCSI/ATAoE/NBD.
>
>   - Because even to fail a device it depends on the user requirements,
> btrfs IO completion threads instead of directly reacting on an IO
> error, it will continue to just report the IO error into device error
> statistics, and a spooler up on errors will apply user/system
> criticalness as provided by the user on the top, which will decide if
> the device has to be marked as failed OR if it can continue to be in
> online.
This is debatably a policy decision, and while it would be wonderful to 
have stuff in the kernel to help userspace with this, it probably 
belongs in userspace.
>
>   - A FS load pattern (mostly outside of btrfs-kernel or with in btrfs-
> kernel) may pick the right time to replace the failed device, or to run
> other FS maintenance activities (balance, scrub) automatically.
This is entirely a policy decision, and as such does not belong in the 
kernel.
>
>   - Sysfs will help user land scripts which may want to bring device to
> offline or failed.
>
>
>
> Device State flow:
>
>    A device in the btrfs kernel can be in any one of following state:
>
>    Online
>       A normal healthy device
>
>    Missing
>       Device wasn't found that the time of mount OR device scan.
>
>    Offline (disappeared)
>       Device was present at some point in time after the FS was mounted,
>       however offlined by user or block layer or hot unplug or device
>       experienced transport error. Basically due to any error other than
>       media error.
>       The device in offline state are not candidate for the replace.
>       Since still there is a hope that device may be restored to online
>       at some point in time, by user or transport-layer error recovery.
>       For device pulled out, there will be udev script which will call
>       offline through sysfs. In the long run, we would also need to know
>       the block layer to distinguish from the transient write errors
>       like writes failing  due to transport error, vs write errors which
>       are confirmed as target-device/device-media failure.
It may be useful to have the ability to transition a device from offline 
to failed after some configurable amount of time.
>
>    Failed
>       Device has confirmed a write/flush failure for at least a block.
>       (In general the disk/storage FW will try to relocate the bad block
>       on write, it happens automatically and transparent even to the
>       block layer. Further there might have been few retry from the block
>       layer. And here btrfs assumes that such an attempt has also
>       failed). Or it might set device as failed for extensive read
>       errors if the user tuned profile demands it.
>
>
> A btrfs pool can be in one of the state:
>
> Online:
>    All the chunks are as configured.
>
> Degraded:
>    One or more logical-chunks does not meet the redundancy level that
>    user requested / configured.
>
> Failed:
>    One or more logical-chunk is incomplete. FS will be in a RO mode Or
>    panic -dump as configured.
>
>
> Flow diagram (also include pool states BTRFS_POOL_STATE_xx along with
> device state BTRFS_DEVICE_STATE_xx):
>
>
>                      [1]
>                      BTRFS_DEVICE_STATE_ONLINE,
>                       BTRFS_POOL_STATE_ONLINE
>                                  |
>                                  |
>                                  V
>                            new IO error
>                                  |
>                                  |
>                                  V
>                     check with block layer to know
>                    if confirmed media/target:- failed
>                  or fix-able transport issue:- offline.
>                        and apply user config.
>              can be ignored ?  --------------yes->[1]
>                                  |
>                                  |no
>          _______offline__________/\______failed________
>          |                                             |
>          |                                             |
>          V                                             V
> (eg: transport issue [*], disk is good)     (eg: write media error)
>          |                                             |
>          |                                             |
>          V                                             V
> BTRFS_DEVICE_STATE_OFFLINE                BTRFS_DEVICE_STATE_FAILED
>          |                                             |
>          |                                             |
>          |______________________  _____________________|
>                                 \/
>                                  |
>                          Missing chunk ? --NO--> goto [1]
>                                  |
>                                  |
>                           Tolerable? -NO-> FS ERROR. RO.
>                                       BTRFS_POOL_STATE_FAILED->remount?
>                                  |
>                                  |yes
>                                  V
>                        BTRFS_POOL_STATE_DEGRADED --> rebalance -> [1]
>                                  |
>          ______offline___________|____failed_________
>          |                                           |
>          |                                      check priority
>          |                                           |
>          |                                           |
>          |                                      hot spare ?
>          |                                    replace --> goto [1]
>          |                                           |
>          |                                           | no
>          |                                           |
>          |                                       spare-add
> (user/sys notify issue is fixed,         (manual-replace/dev-delete)
>    trigger scrub/balance)                            |
>          |______________________  ___________________|
>                                 \/
>                                  |
>                                  V
>                                 [1]
>
>
> Code status:
>   Part-1: Provided device transitions from online to failed/offline,
>           hot spare and auto replace.
>           [PATCH 00/15] btrfs: Hot spare and Auto replace
>
>   Next,
>    . Add sysfs part on top of
>      [PATCH] btrfs: Introduce device pool sysfs attributes
>    . POOL_STATE flow and reporting
>    . Device transactions from Offline to Online
>    . Btrfs-progs mainly to show device and pool states
>    . Apply user tolerance level to the IO errors



[-- Attachment #2: S/MIME Cryptographic Signature --]
[-- Type: application/pkcs7-signature, Size: 3019 bytes --]

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC] Btrfs device and pool management (wip)
  2015-11-30 14:51 ` Austin S Hemmelgarn
@ 2015-11-30 20:17   ` Chris Murphy
  2015-11-30 20:37     ` Austin S Hemmelgarn
  2015-12-09  4:39     ` Christoph Anton Mitterer
  2015-12-01  0:43   ` Qu Wenruo
  1 sibling, 2 replies; 15+ messages in thread
From: Chris Murphy @ 2015-11-30 20:17 UTC (permalink / raw)
  To: Austin S Hemmelgarn; +Cc: Anand Jain, Btrfs BTRFS

On Mon, Nov 30, 2015 at 7:51 AM, Austin S Hemmelgarn
<ahferroin7@gmail.com> wrote:

> General thoughts on this:
> 1. If there's a write error, we fail unconditionally right now.  It would be
> nice to have a configurable number of retries before failing.

I'm unconvinced. I pretty much immediately do not trust a block device
that fails even a single write, and I'd expect the file system to
quickly get confused if it can't rely on flushing pending writes to
that device. Unless Btrfs gets into the business of tracking bad
sectors (failed writes), the block device is a gonor upon a single
write failure, although it could still be reliable for reads.

Possibly reasonable, is the user indicting a preference for what
happens after the max number of write failures is exceeded:

- Volume goes degraded: Faulty block device is ignored entirely,
degraded writes permitted.
- Volumes goes ro: Faulty block device is still used for reads,
degraded writes not permitted.

As far as I know, md and lvm only do the former. And md/mdadm did
recently get the ability to support bad block maps so it can continue
using drives lacking reserve sectors (typically that's the reason for
write failures on conventional rotational drives).

> 2. Similar for read errors, possibly with the ability to ignore them below
> some threshold.

Agreed. Maybe it would be an error rate (set by ratio)?

-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC] Btrfs device and pool management (wip)
  2015-11-30 20:17   ` Chris Murphy
@ 2015-11-30 20:37     ` Austin S Hemmelgarn
  2015-11-30 21:09       ` Chris Murphy
  2015-12-09  4:39     ` Christoph Anton Mitterer
  1 sibling, 1 reply; 15+ messages in thread
From: Austin S Hemmelgarn @ 2015-11-30 20:37 UTC (permalink / raw)
  To: Chris Murphy; +Cc: Anand Jain, Btrfs BTRFS

[-- Attachment #1: Type: text/plain, Size: 3145 bytes --]

On 2015-11-30 15:17, Chris Murphy wrote:
> On Mon, Nov 30, 2015 at 7:51 AM, Austin S Hemmelgarn
> <ahferroin7@gmail.com> wrote:
>
>> General thoughts on this:
>> 1. If there's a write error, we fail unconditionally right now.  It would be
>> nice to have a configurable number of retries before failing.
>
> I'm unconvinced. I pretty much immediately do not trust a block device
> that fails even a single write, and I'd expect the file system to
> quickly get confused if it can't rely on flushing pending writes to
> that device. Unless Btrfs gets into the business of tracking bad
> sectors (failed writes), the block device is a gonor upon a single
> write failure, although it could still be reliable for reads.
I've had multiple cases of disks that got one write error then were fine 
for more than a year before any further issues.  My thought is add an 
option to retry that single write after some short delay (1-2s maybe), 
and if it still fails, then mark the disk as failed.  This will provide 
an option for people like me who don't want to need to immediately 
replace a disk when it hits a write error.  (Possibly add some counter 
in and if we get another write error within a given period of time, we 
just kick the disk instead of retrying).  Transient errors do happen, 
and in some cases more often than people would expect.  We should 
reasonably account for this.

This discussion actually brings to mind the rather annoying behavior of 
some of the proprietary NAS systems we have where I work.  They check 
SMART attributes on a regular basis, and if anything the disk firmware 
marks as pre-failure changes at all, it kicks the disk from the RAID 
array.  It only kicks on a change though, so you can just disconnect and 
reconnect the disk itself, and it accepts it as a new disk as long as 
the attribute didn't cross the threshold the disk firmware lists.  (I 
discovered this rather short-sighted behavior by accident, but I've used 
the old disks in other systems just fine for months with no issue 
whatsoever).
>
> Possibly reasonable, is the user indicting a preference for what
> happens after the max number of write failures is exceeded:
>
> - Volume goes degraded: Faulty block device is ignored entirely,
> degraded writes permitted.
> - Volumes goes ro: Faulty block device is still used for reads,
> degraded writes not permitted.
>
> As far as I know, md and lvm only do the former. And md/mdadm did
> recently get the ability to support bad block maps so it can continue
> using drives lacking reserve sectors (typically that's the reason for
> write failures on conventional rotational drives).
>
>
>
>> 2. Similar for read errors, possibly with the ability to ignore them below
>> some threshold.
>
> Agreed. Maybe it would be an error rate (set by ratio)?
>
I was thinking of either:
a. A running count, using the current error counting mechanisms, with 
some max number allowed before the device gets kicked.
b. A count that decays over time, this would need two tunables (how long 
an error is considered, and how many are allowed).



[-- Attachment #2: S/MIME Cryptographic Signature --]
[-- Type: application/pkcs7-signature, Size: 3019 bytes --]

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC] Btrfs device and pool management (wip)
  2015-11-30 20:37     ` Austin S Hemmelgarn
@ 2015-11-30 21:09       ` Chris Murphy
  2015-12-01 10:05         ` Brendan Hide
  0 siblings, 1 reply; 15+ messages in thread
From: Chris Murphy @ 2015-11-30 21:09 UTC (permalink / raw)
  To: Austin S Hemmelgarn; +Cc: Anand Jain, Btrfs BTRFS

On Mon, Nov 30, 2015 at 1:37 PM, Austin S Hemmelgarn
<ahferroin7@gmail.com> wrote:

> I've had multiple cases of disks that got one write error then were fine for
> more than a year before any further issues.  My thought is add an option to
> retry that single write after some short delay (1-2s maybe), and if it still
> fails, then mark the disk as failed.

Seems reasonable.

>> Agreed. Maybe it would be an error rate (set by ratio)?
>>
> I was thinking of either:
> a. A running count, using the current error counting mechanisms, with some
> max number allowed before the device gets kicked.
> b. A count that decays over time, this would need two tunables (how long an
> error is considered, and how many are allowed).


OK.



-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC] Btrfs device and pool management (wip)
  2015-11-30 21:09       ` Chris Murphy
@ 2015-12-01 10:05         ` Brendan Hide
  2015-12-01 13:11           ` Brendan Hide
  0 siblings, 1 reply; 15+ messages in thread
From: Brendan Hide @ 2015-12-01 10:05 UTC (permalink / raw)
  To: Chris Murphy, Austin S Hemmelgarn; +Cc: Anand Jain, Btrfs BTRFS

On 11/30/2015 11:09 PM, Chris Murphy wrote:
> On Mon, Nov 30, 2015 at 1:37 PM, Austin S Hemmelgarn
> <ahferroin7@gmail.com> wrote:
>
>> I've had multiple cases of disks that got one write error then were fine for
>> more than a year before any further issues.  My thought is add an option to
>> retry that single write after some short delay (1-2s maybe), and if it still
>> fails, then mark the disk as failed.
> Seems reasonable.
I think I added this to the Project Ideas page on the wiki a *very* long 
time ago
https://btrfs.wiki.kernel.org/index.php/Project_ideas#False_alarm_on_bad_disk_-_rebuild_mitigation

"After a device is marked as unreliable, maintain the device within the 
FS in order to confirm the issue persists. The device will still 
contribute toward fs performance but will not be treated as if 
contributing towards replication/reliability. If the device shows that 
the given errors were a once-off issue then the device can be marked as 
reliable once again. This will mitigate further unnecessary rebalance. 
See http://storagemojo.com/2007/02/26/netapp-weighs-in-on-disks/ - 
"[Drive Resurrection]" as an example of where this is a significant 
feature for storage vendors."
>
>>> Agreed. Maybe it would be an error rate (set by ratio)?
>>>
>> I was thinking of either:
>> a. A running count, using the current error counting mechanisms, with some
>> max number allowed before the device gets kicked.
>> b. A count that decays over time, this would need two tunables (how long an
>> error is considered, and how many are allowed).
>
> OK.
>
>
>


-- 
__________
Brendan Hide
http://swiftspirit.co.za/
http://www.webafrica.co.za/?AFF1E97


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC] Btrfs device and pool management (wip)
  2015-12-01 10:05         ` Brendan Hide
@ 2015-12-01 13:11           ` Brendan Hide
  0 siblings, 0 replies; 15+ messages in thread
From: Brendan Hide @ 2015-12-01 13:11 UTC (permalink / raw)
  To: Chris Murphy, Austin S Hemmelgarn; +Cc: Anand Jain, Btrfs BTRFS

On 12/1/2015 12:05 PM, Brendan Hide wrote:
> On 11/30/2015 11:09 PM, Chris Murphy wrote:
>> On Mon, Nov 30, 2015 at 1:37 PM, Austin S Hemmelgarn
>> <ahferroin7@gmail.com> wrote:
>>
>>> I've had multiple cases of disks that got one write error then were 
>>> fine for
>>> more than a year before any further issues.  My thought is add an 
>>> option to
>>> retry that single write after some short delay (1-2s maybe), and if 
>>> it still
>>> fails, then mark the disk as failed.
>> Seems reasonable.
> I think I added this to the Project Ideas page on the wiki a *very* 
> long time ago
> https://btrfs.wiki.kernel.org/index.php/Project_ideas#False_alarm_on_bad_disk_-_rebuild_mitigation 
>
>
> "After a device is marked as unreliable, maintain the device within 
> the FS in order to confirm the issue persists. The device will still 
> contribute toward fs performance but will not be treated as if 
> contributing towards replication/reliability. If the device shows that 
> the given errors were a once-off issue then the device can be marked 
> as reliable once again. This will mitigate further unnecessary 
> rebalance. See 
> http://storagemojo.com/2007/02/26/netapp-weighs-in-on-disks/ - "[Drive 
> Resurrection]" as an example of where this is a significant feature 
> for storage vendors."
Related, a separate section on that same page mentions a Jeff Mahoney. 
Perhaps he should be consulted or his work should be looked into:
Take device with heavy IO errors offline or mark as "unreliable"
"Devices should be taken offline after they reach a given threshold of 
IO errors. Jeff Mahoney works on handling EIO errors (among others), 
this project can build on top of it."

>>
>>>> Agreed. Maybe it would be an error rate (set by ratio)?
>>>>
>>> I was thinking of either:
>>> a. A running count, using the current error counting mechanisms, 
>>> with some
>>> max number allowed before the device gets kicked.
>>> b. A count that decays over time, this would need two tunables (how 
>>> long an
>>> error is considered, and how many are allowed).
>>
>> OK.
>>
>>
>>
>
>


-- 
__________
Brendan Hide
http://swiftspirit.co.za/
http://www.webafrica.co.za/?AFF1E97


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC] Btrfs device and pool management (wip)
  2015-11-30 20:17   ` Chris Murphy
  2015-11-30 20:37     ` Austin S Hemmelgarn
@ 2015-12-09  4:39     ` Christoph Anton Mitterer
  1 sibling, 0 replies; 15+ messages in thread
From: Christoph Anton Mitterer @ 2015-12-09  4:39 UTC (permalink / raw)
  To: Chris Murphy, Austin S Hemmelgarn; +Cc: Anand Jain, Btrfs BTRFS

[-- Attachment #1: Type: text/plain, Size: 3888 bytes --]

On Mon, 2015-11-30 at 13:17 -0700, Chris Murphy wrote:
> On Mon, Nov 30, 2015 at 7:51 AM, Austin S Hemmelgarn
> <ahferroin7@gmail.com> wrote:
> 
> > General thoughts on this:
> > 1. If there's a write error, we fail unconditionally right now.  It
> > would be
> > nice to have a configurable number of retries before failing.
> 
> I'm unconvinced. I pretty much immediately do not trust a block
> device
> that fails even a single write, and I'd expect the file system to
> quickly get confused if it can't rely on flushing pending writes to
> that device.
From my large-amounts-of-storage-admin PoV,... I'd say it would be nice
to have more knobs to control when exactly a device is considered no
longer perfectly fine, which can include several different stages like:
- perhaps unreliable
  e.g. maybe the device shows SMART problems or there were correctable 
  read and/or write errors under a certain threshold (either in total,
  or per time period)
  Then I could imagine that one can control whether the device is put 
  - continued to be normally used until certain error thresholds are
    exceeded.
  - placed in a mode where data is still written to, but only when
    there's a duplicate on at least on other good device,... so the
    device would be used as read pool
    maybe optionally, data already on the device is auto-replicated to
    good devices
  - offline (perhaps only to be automatically reused in case of
    emergency (as a hot spare) when the fs knows that otherwise it's

even more likely that data would be lost soon
- failed
  the threshold
from above has been reached, the fs suspects the
  device to completely
fail soon
  Possible knobs would include how aggressively data is tried
to move
  of the device.
  How often should retries be made? In case the
other devices are
  under high IO load how much percentage should be
used to get the
  still working data of the bad device (i.e. up to 100%,
meaning 
  "rather stop any other IO, just to move the data to good
devices 
  ASAP)? 
- dead
  accesses don't work anymore at all an the fs shouldn't even waste 
  time trying to read/recover data from it.

It would also make sense to allow tuning what conditions need be met to
e.g. consider a drive unreliable (e.g. which SMART errors?) and to
allow an admin to manually place a drive in a certain state (e.g. SMART
would be still good, no IO errors so far, but the drive is 5 year old
and I better want to consider it unreliable).

That's - to some extent - what we at our LHC Tier-2 do at higher levels
(partly simply by human management, partly via the storage management
system we use (dCache), partly by RAID and other tools and scripting).

In any case, though,... any of these knobs should IMHO default to the
most conservative settings.
In other words: If a device shows the slightest hint of being
unstable/unreliable/failed... it should be considered bad, no new data
should go on it (if necessary, because not enough other devices are
left, the fs should get ro).
The only thing I wouldn't have a opinion is: should the fs go ro and do
nothing, waiting for a human to decide what's next, or should it go ro
and (if possible) try to move data off the bad device (per default).

Generally, a filesystem should be safe per default (which is why I see
the issue in the other thread with the corruption/security leaks in
case of UUID collisions quite a showstopper).
From the admin side, I don't want to be required to make it safe,.. my
interaction should rather only be needed to tune things.

Of course I'm aware that btrfs brings several techniques which make it
unavoidable that more maintenance is put into the filesystem, but, per
default, this should be minimised as far as possible.

Cheers,
Chris.

[-- Attachment #2: smime.p7s --]
[-- Type: application/x-pkcs7-signature, Size: 5313 bytes --]

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC] Btrfs device and pool management (wip)
  2015-11-30 14:51 ` Austin S Hemmelgarn
  2015-11-30 20:17   ` Chris Murphy
@ 2015-12-01  0:43   ` Qu Wenruo
  1 sibling, 0 replies; 15+ messages in thread
From: Qu Wenruo @ 2015-12-01  0:43 UTC (permalink / raw)
  To: Austin S Hemmelgarn, Anand Jain, linux-btrfs



Austin S Hemmelgarn wrote on 2015/11/30 09:51 -0500:
> On 2015-11-30 02:59, Anand Jain wrote:
>>   Data center systems are generally aligned with the RAS (Reliability,
>> Availability and Serviceability) attributes. When it comes to Storage,
>> RAS applies even more because its matter of trust. In this context, one
>> of the primary area that a typical volume manager should be well tested
>> is, how well RAS attributes are maintained in the context of device
>> failure, and its further reporting.
>>
>>   But, identifying a failed device is not a straight forward code. If
>> you look at some statistics performed on failed and returned disks,
>> most of the disks ends up being classified as NTF (No Trouble Found).
>> That is, host failed-and-replaced a disk even before it has actually
>> failed. This is not good for a cost effective setup who would want to
>> stretch the life of an intermittently failing device to its maximum
>> tenure and would want to replace only when it has confirmed dead.
>>
>>   Also on the other hand, some of the data center admins would like to
>> mitigate the risk (of low performance at peak of their business
>> productions) of a potential failure, and prefer to pro-actively replace
>> the disk at their low business/workload hours, or they may choose to
>> replace a device even for read errors (mainly due to performance
>> reasons).
>>
>>   In short a large variant of real MTF (Mean Time to Failure) for the
>> devices across the industries/users.
>>
>>
>> Consideration:
>>
>>   - Have user-tunable to support different context of usages, which
>> should be applied on top of a set of disk IO errors, and its out come
>> will be to know if the disk can be failed.
> General thoughts on this:
> 1. If there's a write error, we fail unconditionally right now.  It
> would be nice to have a configurable number of retries before failing.
> 2. Similar for read errors, possibly with the ability to ignore them
> below some threshold.

Already stated by Chris Murphy, it's better to let user tune this behavior.
Based on pure counter or counter during a time.

Btrfs has already have counter based one as "btrfs device status", but 
seems not working all the time.

> 3. Kernel initiated link resets should probably be treated differently
> from regular read/write errors, they can indicate other potential
> problems (usually an issue in either the disk electronics or the storage
> controller).

This one seems a little hard to distinguish for btrfs.
As it can only get result from bio layer.
But if we have above threshold interface, it would be configurable to 
workaround it.

> 4. Almost all of this is policy, and really should be configurable from
> userspace and have sane defaults (probably just keeping current behavior).

Can't agree any more!!

> 5. Properly differentiating between a media error, a transport error, or
> some other error (such as in the disk electronics or storage controller)
> is not reliably possible with the current state of the block layer and
> the ATA spec (it might be possible with SCSI, but I don't know enough
> about SCSI to be certain).

The same multi-layer problem, this involve bio layer and even driver layer.
Btrfs doesn't has such good judgment, as it only knows whether a bio 
operation is done correctly.

At least for btrfs, it may only be able to do read/write error count 
threshold.

But it may be possible for user-space daemon to handle them, e.g a btrfs 
daemon watching not only the "btrfs dev status" data, but also lower 
level info(such as driver log or something like that).

And it may be better than completely relying on btrfs.

>>
>>   - Distinguish real disk failure (failed state) VS IO errors due to
>> intermittent transport errors (offline state). (I am not sure how to do
>> that yet, basically in some means, block layer could help?, RFC ?).
> This gets really tricky.  Ideally, this is really something that needs
> to be done at least partly in userspace, unless we want to teach the
> kernel about SMART attributes and how to query the disk's own idea of
> how healthy it is.  We should also take into consideration the
> possibility of the storage controller failing.
>>
>>   - A sysfs offline interface, so as to udev update the kernel, when
>> disk is pulled out.
> This needs proper support in the block layer.  As of now, it assumes
> that if something has an open reference to a block device, that device
> will not be removed.  This simplifies things there, but has undesirable
> implications for stuff like BTRFS or iSCSI/ATAoE/NBD.

What about user daemon listen on the device offline interface provided 
by block layer?

Btrfs may not be able to detect such thing, but if user-space detect it 
and trigger a replace/remove, I think it won't be a big problem.

Thanks,
Qu

>>
>>   - Because even to fail a device it depends on the user requirements,
>> btrfs IO completion threads instead of directly reacting on an IO
>> error, it will continue to just report the IO error into device error
>> statistics, and a spooler up on errors will apply user/system
>> criticalness as provided by the user on the top, which will decide if
>> the device has to be marked as failed OR if it can continue to be in
>> online.
> This is debatably a policy decision, and while it would be wonderful to
> have stuff in the kernel to help userspace with this, it probably
> belongs in userspace.
>>
>>   - A FS load pattern (mostly outside of btrfs-kernel or with in btrfs-
>> kernel) may pick the right time to replace the failed device, or to run
>> other FS maintenance activities (balance, scrub) automatically.
> This is entirely a policy decision, and as such does not belong in the
> kernel.
>>
>>   - Sysfs will help user land scripts which may want to bring device to
>> offline or failed.
>>
>>
>>
>> Device State flow:
>>
>>    A device in the btrfs kernel can be in any one of following state:
>>
>>    Online
>>       A normal healthy device
>>
>>    Missing
>>       Device wasn't found that the time of mount OR device scan.
>>
>>    Offline (disappeared)
>>       Device was present at some point in time after the FS was mounted,
>>       however offlined by user or block layer or hot unplug or device
>>       experienced transport error. Basically due to any error other than
>>       media error.
>>       The device in offline state are not candidate for the replace.
>>       Since still there is a hope that device may be restored to online
>>       at some point in time, by user or transport-layer error recovery.
>>       For device pulled out, there will be udev script which will call
>>       offline through sysfs. In the long run, we would also need to know
>>       the block layer to distinguish from the transient write errors
>>       like writes failing  due to transport error, vs write errors which
>>       are confirmed as target-device/device-media failure.
> It may be useful to have the ability to transition a device from offline
> to failed after some configurable amount of time.
>>
>>    Failed
>>       Device has confirmed a write/flush failure for at least a block.
>>       (In general the disk/storage FW will try to relocate the bad block
>>       on write, it happens automatically and transparent even to the
>>       block layer. Further there might have been few retry from the block
>>       layer. And here btrfs assumes that such an attempt has also
>>       failed). Or it might set device as failed for extensive read
>>       errors if the user tuned profile demands it.
>>
>>
>> A btrfs pool can be in one of the state:
>>
>> Online:
>>    All the chunks are as configured.
>>
>> Degraded:
>>    One or more logical-chunks does not meet the redundancy level that
>>    user requested / configured.
>>
>> Failed:
>>    One or more logical-chunk is incomplete. FS will be in a RO mode Or
>>    panic -dump as configured.
>>
>>
>> Flow diagram (also include pool states BTRFS_POOL_STATE_xx along with
>> device state BTRFS_DEVICE_STATE_xx):
>>
>>
>>                      [1]
>>                      BTRFS_DEVICE_STATE_ONLINE,
>>                       BTRFS_POOL_STATE_ONLINE
>>                                  |
>>                                  |
>>                                  V
>>                            new IO error
>>                                  |
>>                                  |
>>                                  V
>>                     check with block layer to know
>>                    if confirmed media/target:- failed
>>                  or fix-able transport issue:- offline.
>>                        and apply user config.
>>              can be ignored ?  --------------yes->[1]
>>                                  |
>>                                  |no
>>          _______offline__________/\______failed________
>>          |                                             |
>>          |                                             |
>>          V                                             V
>> (eg: transport issue [*], disk is good)     (eg: write media error)
>>          |                                             |
>>          |                                             |
>>          V                                             V
>> BTRFS_DEVICE_STATE_OFFLINE                BTRFS_DEVICE_STATE_FAILED
>>          |                                             |
>>          |                                             |
>>          |______________________  _____________________|
>>                                 \/
>>                                  |
>>                          Missing chunk ? --NO--> goto [1]
>>                                  |
>>                                  |
>>                           Tolerable? -NO-> FS ERROR. RO.
>>                                       BTRFS_POOL_STATE_FAILED->remount?
>>                                  |
>>                                  |yes
>>                                  V
>>                        BTRFS_POOL_STATE_DEGRADED --> rebalance -> [1]
>>                                  |
>>          ______offline___________|____failed_________
>>          |                                           |
>>          |                                      check priority
>>          |                                           |
>>          |                                           |
>>          |                                      hot spare ?
>>          |                                    replace --> goto [1]
>>          |                                           |
>>          |                                           | no
>>          |                                           |
>>          |                                       spare-add
>> (user/sys notify issue is fixed,         (manual-replace/dev-delete)
>>    trigger scrub/balance)                            |
>>          |______________________  ___________________|
>>                                 \/
>>                                  |
>>                                  V
>>                                 [1]
>>
>>
>> Code status:
>>   Part-1: Provided device transitions from online to failed/offline,
>>           hot spare and auto replace.
>>           [PATCH 00/15] btrfs: Hot spare and Auto replace
>>
>>   Next,
>>    . Add sysfs part on top of
>>      [PATCH] btrfs: Introduce device pool sysfs attributes
>>    . POOL_STATE flow and reporting
>>    . Device transactions from Offline to Online
>>    . Btrfs-progs mainly to show device and pool states
>>    . Apply user tolerance level to the IO errors
>
>



^ permalink raw reply	[flat|nested] 15+ messages in thread

* [RFC] Btrfs device and pool management (wip)
@ 2015-11-30  7:54 Anand Jain
  0 siblings, 0 replies; 15+ messages in thread
From: Anand Jain @ 2015-11-30  7:54 UTC (permalink / raw)
  To: linux-btrfs



------------
  Data center systems are generally aligned with the RAS (Reliability,
Availability and Serviceability) attributes. When it comes to Storage,
RAS applies even more because its matter of trust. In this context, one
of the primary area that a typical volume manager should be well tested
is, how well RAS attributes are maintained in the context of device
failure, and its further reporting.

  But, identifying a failed device is not a straight forward code. If
you look at some statistics performed on failed and returned disks,
most of the disks ends up being classified as NTF (No Trouble Found).
That is, host failed-and-replaced a disk even before it has actually
failed. This is not good for a cost effective setup who would want to
stretch the life of an intermittently failing device to its maximum
tenure and would want to replace only when it has confirmed dead.

  Also on the other hand, some of the data center admins would like to
mitigate the risk (of low performance at peak of their business
productions) of a potential failure, and prefer to pro-actively replace
the disk at their low business/workload hours, or they may choose to
replace a device even for read errors (mainly due to performance
reasons).

  In short a large variant of real MTF (Mean Time to Failure) for the
devices across the industries/users.


Consideration:

  - Have user-tunable to support different context of usages, which 
should be applied on top of a set of disk IO errors, and its out come 
will be to know if the disk can be failed.

  - Distinguish real disk failure (failed state) VS IO errors due to 
intermittent transport errors (offline state). (I am not sure how to do 
that yet, basically in some means, block layer could help?, RFC ?).

  - A sysfs offline interface, so as to udev update the kernel, when 
disk is pulled out.

  - Because even to fail a device it depends on the user requirements, 
btrfs IO completion threads instead of directly reacting on an IO error, 
it will continue to just report the IO error into device error 
statistics, and a spooler up on errors will apply user/system 
criticalness as provided by the user on the top, which will decide if 
the device has to be marked as failed OR if it can continue to be in online.

  - A FS load pattern (mostly outside of btrfs-kernel or with in 
btrfs-kernel) may pick the right time to replace the failed device, or 
to run other FS maintenance activities (balance, scrub) automatically.

  - Sysfs will help user land scripts which may want to bring device to 
offline or failed.



Device State flow:

   A device in the btrfs kernel can be in any one of following state:

   Online
      A normal healthy device

   Missing
      Device wasn't found that the time of mount OR device scan.

   Offline (disappeared)
      Device was present at some point in time after the FS was mounted,
      however offlined by user or block layer or hot unplug or device
      experienced transport error. Basically due to any error other than
      media error.
      The device in offline state are not candidate for the replace.
      Since still there is a hope that device may be restored to online
      at some point in time, by user or transport-layer error recovery.
      For device pulled out, there will be udev script which will call
      offline through sysfs. In the long run, we would also need to know
      the block layer to distinguish from the transient write errors
      like writes failing  due to transport error, vs write errors which
      are confirmed as target-device/device-media failure.

   Failed
      Device has confirmed a write/flush failure for at least a block.
      (In general the disk/storage FW will try to relocate the bad block
      on write, it happens automatically and transparent even to the
      block layer. Further there might have been few retry from the block
      layer. And here btrfs assumes that such an attempt has also
      failed). Or it might set device as failed for extensive read
      errors if the user tuned profile demands it.


A btrfs pool can be in one of the state:

Online:
   All the chunks are as configured.

Degraded:
   One or more logical-chunks does not meet the redundancy level that
   user requested / configured.

Failed:
   One or more logical-chunk is incomplete. FS will be in a RO mode Or
   panic -dump as configured.


Flow diagram (also include pool states BTRFS_POOL_STATE_xx along with 
device state BTRFS_DEVICE_STATE_xx):


                     [1]
                     BTRFS_DEVICE_STATE_ONLINE,
                      BTRFS_POOL_STATE_ONLINE
                                 |
                                 |
                                 V
                           new IO error
                                 |
                                 |
                                 V
                    check with block layer to know
                   if confirmed media/target:- failed
                 or fix-able transport issue:- offline.
                       and apply user config.
			can be ignored ?  --------------yes->[1]
                                 |
                                 |no
         _______offline__________/\______failed________
         |                                             |
         |                                             |
         V                                             V
(eg: transport issue [*], disk is good)     (eg: write media error)
         |                                             |
         |                                             |
         V                                             V
BTRFS_DEVICE_STATE_OFFLINE                BTRFS_DEVICE_STATE_FAILED
         |                                             |
         |                                             |
         |______________________  _____________________|
                                \/
                                 |
                         Missing chunk ? --NO--> goto [1]
                                 |
                                 |
                          Tolerable? -NO-> FS ERROR. RO.
                                      BTRFS_POOL_STATE_FAILED->remount?
                                 |
                                 |yes
                                 V
                       BTRFS_POOL_STATE_DEGRADED --> rebalance -> [1]
                                 |
         ______offline___________|____failed_________
         |                                           |
         |                                      check priority
         |                                           |
         |                                           |
         |                                      hot spare ?
         |                                    replace --> goto [1]
         |                                           |
         |                                           | no
         |                                           |
         |                                       spare-add
(user/sys notify issue is fixed,         (manual-replace/dev-delete)
   trigger scrub/balance)                            |
         |______________________  ___________________|
                                \/
                                 |
                                 V
                                [1]


Code status:
  Part-1: Provided device transitions from online to failed/offline,
          hot spare and auto replace.
          [PATCH 00/15] btrfs: Hot spare and Auto replace

  Next,
   . Add sysfs part on top of
     [PATCH] btrfs: Introduce device pool sysfs attributes
   . POOL_STATE flow and reporting
   . Device transactions from Offline to Online
   . Btrfs-progs mainly to show device and pool states
   . Apply user tolerance level to the IO errors
----------


Thanks, Anand


^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2015-12-09  4:39 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2015-11-30  7:59 [RFC] Btrfs device and pool management (wip) Anand Jain
2015-11-30 12:43 ` Qu Wenruo
2015-12-01 18:01   ` Goffredo Baroncelli
2015-12-01 23:43     ` Qu Wenruo
2015-12-02 19:07       ` Goffredo Baroncelli
2015-12-02 23:36         ` Qu Wenruo
2015-11-30 14:51 ` Austin S Hemmelgarn
2015-11-30 20:17   ` Chris Murphy
2015-11-30 20:37     ` Austin S Hemmelgarn
2015-11-30 21:09       ` Chris Murphy
2015-12-01 10:05         ` Brendan Hide
2015-12-01 13:11           ` Brendan Hide
2015-12-09  4:39     ` Christoph Anton Mitterer
2015-12-01  0:43   ` Qu Wenruo
  -- strict thread matches above, loose matches on Subject: below --
2015-11-30  7:54 Anand Jain

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).