From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-ig0-f179.google.com ([209.85.213.179]:35754 "EHLO mail-ig0-f179.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1754044AbbK3OvV (ORCPT ); Mon, 30 Nov 2015 09:51:21 -0500 Received: by igl9 with SMTP id 9so65726091igl.0 for ; Mon, 30 Nov 2015 06:51:20 -0800 (PST) Subject: Re: [RFC] Btrfs device and pool management (wip) To: Anand Jain , linux-btrfs@vger.kernel.org References: <565C01F1.5030108@oracle.com> From: Austin S Hemmelgarn Message-ID: <565C625C.7060503@gmail.com> Date: Mon, 30 Nov 2015 09:51:08 -0500 MIME-Version: 1.0 In-Reply-To: <565C01F1.5030108@oracle.com> Content-Type: multipart/signed; protocol="application/pkcs7-signature"; micalg=sha-512; boundary="------------ms030008040103070709090006" Sender: linux-btrfs-owner@vger.kernel.org List-ID: This is a cryptographically signed message in MIME format. --------------ms030008040103070709090006 Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: quoted-printable On 2015-11-30 02:59, Anand Jain wrote: > Data center systems are generally aligned with the RAS (Reliability, > Availability and Serviceability) attributes. When it comes to Storage, > RAS applies even more because its matter of trust. In this context, one= > of the primary area that a typical volume manager should be well tested= > is, how well RAS attributes are maintained in the context of device > failure, and its further reporting. > > But, identifying a failed device is not a straight forward code. If > you look at some statistics performed on failed and returned disks, > most of the disks ends up being classified as NTF (No Trouble Found). > That is, host failed-and-replaced a disk even before it has actually > failed. This is not good for a cost effective setup who would want to > stretch the life of an intermittently failing device to its maximum > tenure and would want to replace only when it has confirmed dead. > > Also on the other hand, some of the data center admins would like to > mitigate the risk (of low performance at peak of their business > productions) of a potential failure, and prefer to pro-actively replace= > the disk at their low business/workload hours, or they may choose to > replace a device even for read errors (mainly due to performance > reasons). > > In short a large variant of real MTF (Mean Time to Failure) for the > devices across the industries/users. > > > Consideration: > > - Have user-tunable to support different context of usages, which > should be applied on top of a set of disk IO errors, and its out come > will be to know if the disk can be failed. General thoughts on this: 1. If there's a write error, we fail unconditionally right now. It=20 would be nice to have a configurable number of retries before failing. 2. Similar for read errors, possibly with the ability to ignore them=20 below some threshold. 3. Kernel initiated link resets should probably be treated differently=20 from regular read/write errors, they can indicate other potential=20 problems (usually an issue in either the disk electronics or the storage = controller). 4. Almost all of this is policy, and really should be configurable from=20 userspace and have sane defaults (probably just keeping current behavior)= =2E 5. Properly differentiating between a media error, a transport error, or = some other error (such as in the disk electronics or storage controller) = is not reliably possible with the current state of the block layer and=20 the ATA spec (it might be possible with SCSI, but I don't know enough=20 about SCSI to be certain). > > - Distinguish real disk failure (failed state) VS IO errors due to > intermittent transport errors (offline state). (I am not sure how to do= > that yet, basically in some means, block layer could help?, RFC ?). This gets really tricky. Ideally, this is really something that needs=20 to be done at least partly in userspace, unless we want to teach the=20 kernel about SMART attributes and how to query the disk's own idea of=20 how healthy it is. We should also take into consideration the=20 possibility of the storage controller failing. > > - A sysfs offline interface, so as to udev update the kernel, when > disk is pulled out. This needs proper support in the block layer. As of now, it assumes=20 that if something has an open reference to a block device, that device=20 will not be removed. This simplifies things there, but has undesirable=20 implications for stuff like BTRFS or iSCSI/ATAoE/NBD. > > - Because even to fail a device it depends on the user requirements, > btrfs IO completion threads instead of directly reacting on an IO > error, it will continue to just report the IO error into device error > statistics, and a spooler up on errors will apply user/system > criticalness as provided by the user on the top, which will decide if > the device has to be marked as failed OR if it can continue to be in > online. This is debatably a policy decision, and while it would be wonderful to=20 have stuff in the kernel to help userspace with this, it probably=20 belongs in userspace. > > - A FS load pattern (mostly outside of btrfs-kernel or with in btrfs-= > kernel) may pick the right time to replace the failed device, or to run= > other FS maintenance activities (balance, scrub) automatically. This is entirely a policy decision, and as such does not belong in the=20 kernel. > > - Sysfs will help user land scripts which may want to bring device to= > offline or failed. > > > > Device State flow: > > A device in the btrfs kernel can be in any one of following state: > > Online > A normal healthy device > > Missing > Device wasn't found that the time of mount OR device scan. > > Offline (disappeared) > Device was present at some point in time after the FS was mounted= , > however offlined by user or block layer or hot unplug or device > experienced transport error. Basically due to any error other tha= n > media error. > The device in offline state are not candidate for the replace. > Since still there is a hope that device may be restored to online= > at some point in time, by user or transport-layer error recovery.= > For device pulled out, there will be udev script which will call > offline through sysfs. In the long run, we would also need to kno= w > the block layer to distinguish from the transient write errors > like writes failing due to transport error, vs write errors whic= h > are confirmed as target-device/device-media failure. It may be useful to have the ability to transition a device from offline = to failed after some configurable amount of time. > > Failed > Device has confirmed a write/flush failure for at least a block. > (In general the disk/storage FW will try to relocate the bad bloc= k > on write, it happens automatically and transparent even to the > block layer. Further there might have been few retry from the blo= ck > layer. And here btrfs assumes that such an attempt has also > failed). Or it might set device as failed for extensive read > errors if the user tuned profile demands it. > > > A btrfs pool can be in one of the state: > > Online: > All the chunks are as configured. > > Degraded: > One or more logical-chunks does not meet the redundancy level that > user requested / configured. > > Failed: > One or more logical-chunk is incomplete. FS will be in a RO mode Or > panic -dump as configured. > > > Flow diagram (also include pool states BTRFS_POOL_STATE_xx along with > device state BTRFS_DEVICE_STATE_xx): > > > [1] > BTRFS_DEVICE_STATE_ONLINE, > BTRFS_POOL_STATE_ONLINE > | > | > V > new IO error > | > | > V > check with block layer to know > if confirmed media/target:- failed > or fix-able transport issue:- offline. > and apply user config. > can be ignored ? --------------yes->[1] > | > |no > _______offline__________/\______failed________ > | | > | | > V V > (eg: transport issue [*], disk is good) (eg: write media error) > | | > | | > V V > BTRFS_DEVICE_STATE_OFFLINE BTRFS_DEVICE_STATE_FAILED > | | > | | > |______________________ _____________________| > \/ > | > Missing chunk ? --NO--> goto [1] > | > | > Tolerable? -NO-> FS ERROR. RO. > BTRFS_POOL_STATE_FAILED->remount?= > | > |yes > V > BTRFS_POOL_STATE_DEGRADED --> rebalance -> [1] > | > ______offline___________|____failed_________ > | | > | check priority > | | > | | > | hot spare ? > | replace --> goto [1] > | | > | | no > | | > | spare-add > (user/sys notify issue is fixed, (manual-replace/dev-delete) > trigger scrub/balance) | > |______________________ ___________________| > \/ > | > V > [1] > > > Code status: > Part-1: Provided device transitions from online to failed/offline, > hot spare and auto replace. > [PATCH 00/15] btrfs: Hot spare and Auto replace > > Next, > . Add sysfs part on top of > [PATCH] btrfs: Introduce device pool sysfs attributes > . POOL_STATE flow and reporting > . Device transactions from Offline to Online > . Btrfs-progs mainly to show device and pool states > . Apply user tolerance level to the IO errors --------------ms030008040103070709090006 Content-Type: application/pkcs7-signature; name="smime.p7s" Content-Transfer-Encoding: base64 Content-Disposition: attachment; filename="smime.p7s" Content-Description: S/MIME Cryptographic Signature MIAGCSqGSIb3DQEHAqCAMIACAQExDzANBglghkgBZQMEAgMFADCABgkqhkiG9w0BBwEAAKCC Brgwgga0MIIEnKADAgECAgMRLfgwDQYJKoZIhvcNAQENBQAweTEQMA4GA1UEChMHUm9vdCBD QTEeMBwGA1UECxMVaHR0cDovL3d3dy5jYWNlcnQub3JnMSIwIAYDVQQDExlDQSBDZXJ0IFNp Z25pbmcgQXV0aG9yaXR5MSEwHwYJKoZIhvcNAQkBFhJzdXBwb3J0QGNhY2VydC5vcmcwHhcN MTUwOTIxMTEzNTEzWhcNMTYwMzE5MTEzNTEzWjBjMRgwFgYDVQQDEw9DQWNlcnQgV29UIFVz ZXIxIzAhBgkqhkiG9w0BCQEWFGFoZmVycm9pbjdAZ21haWwuY29tMSIwIAYJKoZIhvcNAQkB FhNhaGVtbWVsZ0BvaGlvZ3QuY29tMIICIjANBgkqhkiG9w0BAQEFAAOCAg8AMIICCgKCAgEA nQ/81tq0QBQi5w316VsVNfjg6kVVIMx760TuwA1MUaNQgQ3NyUl+UyFtjhpkNwwChjgAqfGd LIMTHAdObcwGfzO5uI2o1a8MHVQna8FRsU3QGouysIOGQlX8jFYXMKPEdnlt0GoQcd+BtESr pivbGWUEkPs1CwM6WOrs+09bAJP3qzKIr0VxervFrzrC5Dg9Rf18r9WXHElBuWHg4GYHNJ2V Ab8iKc10h44FnqxZK8RDN8ts/xX93i9bIBmHnFfyNRfiOUtNVeynJbf6kVtdHP+CRBkXCNRZ qyQT7gbTGD24P92PS2UTmDfplSBcWcTn65o3xWfesbf02jF6PL3BCrVnDRI4RgYxG3zFBJuG qvMoEODLhHKSXPAyQhwZINigZNdw5G1NqjXqUw+lIqdQvoPijK9J3eijiakh9u2bjWOMaleI SMRR6XsdM2O5qun1dqOrCgRkM0XSNtBQ2JjY7CycIx+qifJWsRaYWZz0aQU4ZrtAI7gVhO9h pyNaAGjvm7PdjEBiXq57e4QcgpwzvNlv8pG1c/hnt0msfDWNJtl3b6elhQ2Pz4w/QnWifZ8E BrFEmjeeJa2dqjE3giPVWrsH+lOvQQONsYJOuVb8b0zao4vrWeGmW2q2e3pdv0Axzm/60cJQ haZUv8+JdX9ZzqxOm5w5eUQSclt84u+D+hsCAwEAAaOCAVkwggFVMAwGA1UdEwEB/wQCMAAw VgYJYIZIAYb4QgENBEkWR1RvIGdldCB5b3VyIG93biBjZXJ0aWZpY2F0ZSBmb3IgRlJFRSBo ZWFkIG92ZXIgdG8gaHR0cDovL3d3dy5DQWNlcnQub3JnMA4GA1UdDwEB/wQEAwIDqDBABgNV HSUEOTA3BggrBgEFBQcDBAYIKwYBBQUHAwIGCisGAQQBgjcKAwQGCisGAQQBgjcKAwMGCWCG SAGG+EIEATAyBggrBgEFBQcBAQQmMCQwIgYIKwYBBQUHMAGGFmh0dHA6Ly9vY3NwLmNhY2Vy dC5vcmcwMQYDVR0fBCowKDAmoCSgIoYgaHR0cDovL2NybC5jYWNlcnQub3JnL3Jldm9rZS5j cmwwNAYDVR0RBC0wK4EUYWhmZXJyb2luN0BnbWFpbC5jb22BE2FoZW1tZWxnQG9oaW9ndC5j b20wDQYJKoZIhvcNAQENBQADggIBADMnxtSLiIunh/TQcjnRdf63yf2D8jMtYUm4yDoCF++J jCXbPQBGrpCEHztlNSGIkF3PH7ohKZvlqF4XePWxpY9dkr/pNyCF1PRkwxUURqvuHXbu8Lwn 8D3U2HeOEU3KmrfEo65DcbanJCMTTW7+mU9lZICPP7ZA9/zB+L0Gm1UNFZ6AU50N/86vjQfY WgkCd6dZD4rQ5y8L+d/lRbJW7ZGEQw1bSFVTRpkxxDTOwXH4/GpQfnfqTAtQuJ1CsKT12e+H NSD/RUWGTr289dA3P4nunBlz7qfvKamxPymHeBEUcuICKkL9/OZrnuYnGROFwcdvfjGE5iLB kjp/ttrY4aaVW5EsLASNgiRmA6mbgEAMlw3RwVx0sVelbiIAJg9Twzk4Ct6U9uBKiJ8S0sS2 8RCSyTmCRhJs0vvva5W9QUFGmp5kyFQEoSfBRJlbZfGX2ehI2Hi3U2/PMUm2ONuQG1E+a0AP u7I0NJc/Xil7rqR0gdbfkbWp0a+8dAvaM6J00aIcNo+HkcQkUgtfrw+C2Oyl3q8IjivGXZqT 5UdGUb2KujLjqjG91Dun3/RJ/qgQlotH7WkVBs7YJVTCxfkdN36rToPcnMYOI30FWa0Q06gn F6gUv9/mo6riv3A5bem/BdbgaJoPnWQD9D8wSyci9G4LKC+HQAMdLmGoeZfpJzKHMYIE0TCC BM0CAQEwgYAweTEQMA4GA1UEChMHUm9vdCBDQTEeMBwGA1UECxMVaHR0cDovL3d3dy5jYWNl cnQub3JnMSIwIAYDVQQDExlDQSBDZXJ0IFNpZ25pbmcgQXV0aG9yaXR5MSEwHwYJKoZIhvcN AQkBFhJzdXBwb3J0QGNhY2VydC5vcmcCAxEt+DANBglghkgBZQMEAgMFAKCCAiEwGAYJKoZI hvcNAQkDMQsGCSqGSIb3DQEHATAcBgkqhkiG9w0BCQUxDxcNMTUxMTMwMTQ1MTA4WjBPBgkq hkiG9w0BCQQxQgRAfooijUQac9UIbhWPIPKKLSVGq8Cr2MhvKQkBICRZc6pmcOC7hPEIUJ/G zal+rwCmZwnzSliJ3O6l6CA1b520hTBsBgkqhkiG9w0BCQ8xXzBdMAsGCWCGSAFlAwQBKjAL BglghkgBZQMEAQIwCgYIKoZIhvcNAwcwDgYIKoZIhvcNAwICAgCAMA0GCCqGSIb3DQMCAgFA MAcGBSsOAwIHMA0GCCqGSIb3DQMCAgEoMIGRBgkrBgEEAYI3EAQxgYMwgYAweTEQMA4GA1UE ChMHUm9vdCBDQTEeMBwGA1UECxMVaHR0cDovL3d3dy5jYWNlcnQub3JnMSIwIAYDVQQDExlD QSBDZXJ0IFNpZ25pbmcgQXV0aG9yaXR5MSEwHwYJKoZIhvcNAQkBFhJzdXBwb3J0QGNhY2Vy dC5vcmcCAxEt+DCBkwYLKoZIhvcNAQkQAgsxgYOggYAweTEQMA4GA1UEChMHUm9vdCBDQTEe MBwGA1UECxMVaHR0cDovL3d3dy5jYWNlcnQub3JnMSIwIAYDVQQDExlDQSBDZXJ0IFNpZ25p bmcgQXV0aG9yaXR5MSEwHwYJKoZIhvcNAQkBFhJzdXBwb3J0QGNhY2VydC5vcmcCAxEt+DAN BgkqhkiG9w0BAQEFAASCAgBofVY4FOkvuJJwi/+HRqGA5CoktRhrJR+AaBIiIozK0jDpBNTO hTXIcuU4HKrkhT248FuiALr63EyeFq9sA16hpkcDqkCI6vOilFO5zI9LCd1Ljk8Z3gRiPq1n CfnYzmBhQH1/HPUWAY8/7JPjr3z/amyhnU6KYZky+LfcNiNoVVgV8h4bHxroIpPr8oqnA9h5 s7CwuqRzXDt+RU5Pwok0i8vJDRqd1NYaf/fgexqYxLiYhvAXpKydBxqrEQhf3qCq87eqiN2K Ziwg8HqhLLrssBEOCV6zl6r+1/GM7U7ng4Blho9u0X7eRyySxtTwPnBY+zZvWiaq53tCRqgE BkWuEYd9OBCtzCJDt2JX7oCxFzLE/FUCYwzNnzgHaPt9EL9w4l5Krg1N51ZX4cF77V0IL2Ou 8nzYzm8FEI8gjYb0x2jik6V+p9IsLn6lHfpxgNsBEvuqMe7S3Pp9L0sSy+PuhoxGq5HYORmZ fyqfrjzFpNGQjegGJkbF0cz3qnhs+tBuZ1q5dbiUAdXxxdOjEnIWYinWITpzDrZqEIwdpaxn DfcNHnayWLf86PndVnERrv/l5aVo+5gouiqn13Mslr5rJg6cY3SI4QpcmlqCty77uSSkOrLU cl1JE/ROq6O3nCOSXk2Ro7rk3G9FZn39EwwB1Gfp3AQIjsL5CuBbr8qbewAAAAAAAA== --------------ms030008040103070709090006--