* Recommended why to use btrfs for production?
@ 2016-06-03  9:49 Martin
  2016-06-03  9:53 ` Marc Haber
                   ` (2 more replies)
  0 siblings, 3 replies; 28+ messages in thread
From: Martin @ 2016-06-03  9:49 UTC (permalink / raw)
  To: linux-btrfs
Hello,
We would like to use urBackup to make laptop backups, and they mention
btrfs as an option.
https://www.urbackup.org/administration_manual.html#x1-8400010.6
So if we go with btrfs and we need 100TB usable space in raid6, and to
have it replicated each night to another btrfs server for "backup" of
the backup, how should we then install btrfs?
E.g. Should we use the latest Fedora, CentOS, Ubuntu, Ubuntu LTS, or
should we compile the kernel our self?
And a bonus question: How stable is raid6 and detecting and replacing
failed drives?
-RC
^ permalink raw reply	[flat|nested] 28+ messages in thread* Re: Recommended why to use btrfs for production? 2016-06-03 9:49 Recommended why to use btrfs for production? Martin @ 2016-06-03 9:53 ` Marc Haber 2016-06-03 9:57 ` Martin 2016-06-03 10:01 ` Hans van Kranenburg 2016-06-03 12:55 ` Austin S. Hemmelgarn 2 siblings, 1 reply; 28+ messages in thread From: Marc Haber @ 2016-06-03 9:53 UTC (permalink / raw) To: linux-btrfs On Fri, Jun 03, 2016 at 11:49:09AM +0200, Martin wrote: > We would like to use urBackup to make laptop backups, and they mention > btrfs as an option. > > https://www.urbackup.org/administration_manual.html#x1-8400010.6 > > So if we go with btrfs and we need 100TB usable space in raid6, and to > have it replicated each night to another btrfs server for "backup" of > the backup, how should we then install btrfs? Do you plan to use Snapshots? How many of them? Greetings Marc -- ----------------------------------------------------------------------------- Marc Haber | "I don't trust Computers. They | Mailadresse im Header Leimen, Germany | lose things." Winona Ryder | Fon: *49 6224 1600402 Nordisch by Nature | How to make an American Quilt | Fax: *49 6224 1600421 ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: Recommended why to use btrfs for production? 2016-06-03 9:53 ` Marc Haber @ 2016-06-03 9:57 ` Martin 0 siblings, 0 replies; 28+ messages in thread From: Martin @ 2016-06-03 9:57 UTC (permalink / raw) To: Marc Haber; +Cc: linux-btrfs > Do you plan to use Snapshots? How many of them? Yes, minimum 7 for each day of the week. Nice to have would be 4 extra for each week of the month and then 12 for each month of the year. ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: Recommended why to use btrfs for production? 2016-06-03 9:49 Recommended why to use btrfs for production? Martin 2016-06-03 9:53 ` Marc Haber @ 2016-06-03 10:01 ` Hans van Kranenburg 2016-06-03 10:15 ` Martin 2016-06-03 12:55 ` Austin S. Hemmelgarn 2 siblings, 1 reply; 28+ messages in thread From: Hans van Kranenburg @ 2016-06-03 10:01 UTC (permalink / raw) To: Martin, linux-btrfs Hi Martin, On 06/03/2016 11:49 AM, Martin wrote: > > We would like to use urBackup to make laptop backups, and they mention > btrfs as an option. > > [...] > > And a bonus question: How stable is raid6 and detecting and replacing > failed drives? Before trying RAID5/6 in production, be sure to read posts like these: http://www.spinics.net/lists/linux-btrfs/msg55642.html o/ Hans van Kranenburg ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: Recommended why to use btrfs for production? 2016-06-03 10:01 ` Hans van Kranenburg @ 2016-06-03 10:15 ` Martin 0 siblings, 0 replies; 28+ messages in thread From: Martin @ 2016-06-03 10:15 UTC (permalink / raw) To: Hans van Kranenburg; +Cc: linux-btrfs > Before trying RAID5/6 in production, be sure to read posts like these: > > http://www.spinics.net/lists/linux-btrfs/msg55642.html Very interesting post and very recent even. If I decide to try raid6 and of course everything is replicated each day (for a bit of a safety net), and disks begin to fail, how much help will I likely get from this list to recover? ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: Recommended why to use btrfs for production? 2016-06-03 9:49 Recommended why to use btrfs for production? Martin 2016-06-03 9:53 ` Marc Haber 2016-06-03 10:01 ` Hans van Kranenburg @ 2016-06-03 12:55 ` Austin S. Hemmelgarn 2016-06-03 13:31 ` Martin 2016-06-03 14:05 ` Chris Murphy 2 siblings, 2 replies; 28+ messages in thread From: Austin S. Hemmelgarn @ 2016-06-03 12:55 UTC (permalink / raw) To: Martin, linux-btrfs On 2016-06-03 05:49, Martin wrote: > Hello, > > We would like to use urBackup to make laptop backups, and they mention > btrfs as an option. > > https://www.urbackup.org/administration_manual.html#x1-8400010.6 > > So if we go with btrfs and we need 100TB usable space in raid6, and to > have it replicated each night to another btrfs server for "backup" of > the backup, how should we then install btrfs? > > E.g. Should we use the latest Fedora, CentOS, Ubuntu, Ubuntu LTS, or > should we compile the kernel our self? In general, avoid Ubuntu LTS versions when dealing with BTRFS, as well as most enterprise distros, they all tend to back-port patches instead of using newer kernels, which means it's functionally impossible to provide good support for them here (because we can't know for sure what exactly they've back-ported). I'd suggest building your own kernel if possible, with Arch Linux being a close second (they follow upstream very closely), followed by Fedora and non-LTS Ubuntu. > > And a bonus question: How stable is raid6 and detecting and replacing > failed drives? Do not use BTRFS raid6 mode in production, it has at least 2 known serious bugs that may cause complete loss of the array due to a disk failure. Both of these issues have as of yet unknown trigger conditions, although they do seem to occur more frequently with larger arrays. That said, there are other options. If you have enough disks, you can run BTRFS raid1 on top of LVM or MD RAID5 or RAID6, which provides you with the benefits of both. Alternatively, you could use BTRFS raid1 on top of LVM or MD RAID1, which actually gets relatively decent performance and can provide even better guarantees than RAID6 would (depending on how you set it up, you can lose a lot more disks safely). If you go this way, I'd suggest setting up disks in pairs at the lower level, and then just let BTRFS handle spanning the data across disks (BTRFS raid1 mode keeps exactly two copies of each block). While this is not quite as efficient as just doing LVM based RAID6 with a traditional FS on top, it's also a lot easier to handle reshaping the array on-line because of the device management in BTRFS itself. ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: Recommended why to use btrfs for production? 2016-06-03 12:55 ` Austin S. Hemmelgarn @ 2016-06-03 13:31 ` Martin 2016-06-03 13:47 ` Julian Taylor 2016-06-03 14:21 ` Austin S. Hemmelgarn 2016-06-03 14:05 ` Chris Murphy 1 sibling, 2 replies; 28+ messages in thread From: Martin @ 2016-06-03 13:31 UTC (permalink / raw) To: Austin S. Hemmelgarn; +Cc: linux-btrfs > In general, avoid Ubuntu LTS versions when dealing with BTRFS, as well as > most enterprise distros, they all tend to back-port patches instead of using > newer kernels, which means it's functionally impossible to provide good > support for them here (because we can't know for sure what exactly they've > back-ported). I'd suggest building your own kernel if possible, with Arch > Linux being a close second (they follow upstream very closely), followed by > Fedora and non-LTS Ubuntu. Then I would build my own, if that is the preferred option. > Do not use BTRFS raid6 mode in production, it has at least 2 known serious > bugs that may cause complete loss of the array due to a disk failure. Both > of these issues have as of yet unknown trigger conditions, although they do > seem to occur more frequently with larger arrays. Ok. No raid6. > That said, there are other options. If you have enough disks, you can run > BTRFS raid1 on top of LVM or MD RAID5 or RAID6, which provides you with the > benefits of both. > > Alternatively, you could use BTRFS raid1 on top of LVM or MD RAID1, which > actually gets relatively decent performance and can provide even better > guarantees than RAID6 would (depending on how you set it up, you can lose a > lot more disks safely). If you go this way, I'd suggest setting up disks in > pairs at the lower level, and then just let BTRFS handle spanning the data > across disks (BTRFS raid1 mode keeps exactly two copies of each block). > While this is not quite as efficient as just doing LVM based RAID6 with a > traditional FS on top, it's also a lot easier to handle reshaping the array > on-line because of the device management in BTRFS itself. Right now I only have 10TB of backup data, but this is grow when urbackup is roled out. So maybe I could get a way with plain btrfs raid10 for the first year, and then re-balance to raid6 when the two bugs have been found... is the failed disk handling in btrfs raid10 considered stable? ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: Recommended why to use btrfs for production? 2016-06-03 13:31 ` Martin @ 2016-06-03 13:47 ` Julian Taylor 2016-06-03 14:21 ` Austin S. Hemmelgarn 1 sibling, 0 replies; 28+ messages in thread From: Julian Taylor @ 2016-06-03 13:47 UTC (permalink / raw) To: Martin; +Cc: linux-btrfs On 06/03/2016 03:31 PM, Martin wrote: >> In general, avoid Ubuntu LTS versions when dealing with BTRFS, as well as >> most enterprise distros, they all tend to back-port patches instead of using >> newer kernels, which means it's functionally impossible to provide good >> support for them here (because we can't know for sure what exactly they've >> back-ported). I'd suggest building your own kernel if possible, with Arch >> Linux being a close second (they follow upstream very closely), followed by >> Fedora and non-LTS Ubuntu. > > Then I would build my own, if that is the preferred option. > Ubuntu also provides newer kernels for their LTS via the Hardware Enablement Stack: https://wiki.ubuntu.com/Kernel/LTSEnablementStack So if you can live with about 6 month time lag and shorter support for the non-lts versions of those kernels that is a good option. As you can see 16.04 currently provides 4.4 and the next update will likely be 4.8. ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: Recommended why to use btrfs for production? 2016-06-03 13:31 ` Martin 2016-06-03 13:47 ` Julian Taylor @ 2016-06-03 14:21 ` Austin S. Hemmelgarn 2016-06-03 14:39 ` Martin ` (2 more replies) 1 sibling, 3 replies; 28+ messages in thread From: Austin S. Hemmelgarn @ 2016-06-03 14:21 UTC (permalink / raw) To: Martin; +Cc: linux-btrfs On 2016-06-03 09:31, Martin wrote: >> In general, avoid Ubuntu LTS versions when dealing with BTRFS, as well as >> most enterprise distros, they all tend to back-port patches instead of using >> newer kernels, which means it's functionally impossible to provide good >> support for them here (because we can't know for sure what exactly they've >> back-ported). I'd suggest building your own kernel if possible, with Arch >> Linux being a close second (they follow upstream very closely), followed by >> Fedora and non-LTS Ubuntu. > > Then I would build my own, if that is the preferred option. If you do go this route, make sure to keep an eye on the mailing list, as this is usually where any bugs get reported. New bugs have thankfully been decreasing in number each release, but they do still happen, and it's important to know what to avoid and what to look out for when dealing with something under such active development. > >> Do not use BTRFS raid6 mode in production, it has at least 2 known serious >> bugs that may cause complete loss of the array due to a disk failure. Both >> of these issues have as of yet unknown trigger conditions, although they do >> seem to occur more frequently with larger arrays. > > Ok. No raid6. > >> That said, there are other options. If you have enough disks, you can run >> BTRFS raid1 on top of LVM or MD RAID5 or RAID6, which provides you with the >> benefits of both. >> >> Alternatively, you could use BTRFS raid1 on top of LVM or MD RAID1, which >> actually gets relatively decent performance and can provide even better >> guarantees than RAID6 would (depending on how you set it up, you can lose a >> lot more disks safely). If you go this way, I'd suggest setting up disks in >> pairs at the lower level, and then just let BTRFS handle spanning the data >> across disks (BTRFS raid1 mode keeps exactly two copies of each block). >> While this is not quite as efficient as just doing LVM based RAID6 with a >> traditional FS on top, it's also a lot easier to handle reshaping the array >> on-line because of the device management in BTRFS itself. > > Right now I only have 10TB of backup data, but this is grow when > urbackup is roled out. So maybe I could get a way with plain btrfs > raid10 for the first year, and then re-balance to raid6 when the two > bugs have been found... > > is the failed disk handling in btrfs raid10 considered stable? > I would say it is, but I also don't have quite as much experience with it as with BTRFS raid1 mode. The one thing I do know for certain about it is that even if it theoretically could recover from two failed disks (ie, if they're from different positions in the striping of each mirror), there is no code to actually do so, so make sure you replace any failed disks as soon as possible (or at least balance the array so that you don't have a missing device anymore). Most of my systems where I would run raid10 mode are set up as BTRFS raid1 on top of two LVM based RAID0 volumes, as this gets measurably better performance than BTRFS raid10 mode at the moment (I see roughly a 10-20% difference on my home server system), and provides the same data safety guarantees as well. It's worth noting for such a setup that the current default block size in BTRFS is 16k except on very small filesystems, so you may want a larger stripe size than you would on a traditional filesystem. As far as BTRFS raid10 mode in general, there are a few things that are important to remember about it: 1. It stores exactly two copies of everything, any extra disks just add to the stripe length on each copy. 2. Because each stripe has the same number of disks as it's mirrored partner, the total number of disks in any chunk allocation will always be even, which means that if your using an odd number of disks, there will always be one left out of every chunk. This has limited impact on actual performance usually, but can cause confusing results if you have differently sized disks. 3. BTRFS (whether using raid10, raid0, or even raid5/6) will always try to use as many devices as possible for a stripe. As a result of this, the moment you add a new disk, the total length of all new stripes will adjust to fit the new configuration. If you want maximal performance when adding new disks, make sure to balance the rest of the filesystem afterwards, otherwise any existing stripes will just stay the same size. ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: Recommended why to use btrfs for production? 2016-06-03 14:21 ` Austin S. Hemmelgarn @ 2016-06-03 14:39 ` Martin 2016-06-03 19:09 ` Christoph Anton Mitterer 2016-06-09 6:16 ` Duncan 2 siblings, 0 replies; 28+ messages in thread From: Martin @ 2016-06-03 14:39 UTC (permalink / raw) To: Austin S. Hemmelgarn; +Cc: Btrfs BTRFS > I would say it is, but I also don't have quite as much experience with it as > with BTRFS raid1 mode. The one thing I do know for certain about it is that > even if it theoretically could recover from two failed disks (ie, if they're > from different positions in the striping of each mirror), there is no code > to actually do so, so make sure you replace any failed disks as soon as > possible (or at least balance the array so that you don't have a missing > device anymore). Ok, so that really speaks for raid1... > Most of my systems where I would run raid10 mode are set up as BTRFS raid1 > on top of two LVM based RAID0 volumes, as this gets measurably better > performance than BTRFS raid10 mode at the moment (I see roughly a 10-20% > difference on my home server system), and provides the same data safety > guarantees as well. It's worth noting for such a setup that the current > default block size in BTRFS is 16k except on very small filesystems, so you > may want a larger stripe size than you would on a traditional filesystem. > > As far as BTRFS raid10 mode in general, there are a few things that are > important to remember about it: > 1. It stores exactly two copies of everything, any extra disks just add to > the stripe length on each copy. > 2. Because each stripe has the same number of disks as it's mirrored > partner, the total number of disks in any chunk allocation will always be > even, which means that if your using an odd number of disks, there will > always be one left out of every chunk. This has limited impact on actual > performance usually, but can cause confusing results if you have differently > sized disks. > 3. BTRFS (whether using raid10, raid0, or even raid5/6) will always try to > use as many devices as possible for a stripe. As a result of this, the > moment you add a new disk, the total length of all new stripes will adjust > to fit the new configuration. If you want maximal performance when adding > new disks, make sure to balance the rest of the filesystem afterwards, > otherwise any existing stripes will just stay the same size. Those are very good things to know! ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: Recommended why to use btrfs for production? 2016-06-03 14:21 ` Austin S. Hemmelgarn 2016-06-03 14:39 ` Martin @ 2016-06-03 19:09 ` Christoph Anton Mitterer 2016-06-09 6:16 ` Duncan 2 siblings, 0 replies; 28+ messages in thread From: Christoph Anton Mitterer @ 2016-06-03 19:09 UTC (permalink / raw) To: linux-btrfs [-- Attachment #1: Type: text/plain, Size: 166 bytes --] Hey. Does anyone know whether the write hole issues have been fixed already? https://btrfs.wiki.kernel.org/index.php/RAID56 still mentions it. Cheers, Chris. [-- Attachment #2: smime.p7s --] [-- Type: application/x-pkcs7-signature, Size: 5930 bytes --] ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: Recommended why to use btrfs for production? 2016-06-03 14:21 ` Austin S. Hemmelgarn 2016-06-03 14:39 ` Martin 2016-06-03 19:09 ` Christoph Anton Mitterer @ 2016-06-09 6:16 ` Duncan 2016-06-09 11:38 ` Austin S. Hemmelgarn 2 siblings, 1 reply; 28+ messages in thread From: Duncan @ 2016-06-09 6:16 UTC (permalink / raw) To: linux-btrfs Austin S. Hemmelgarn posted on Fri, 03 Jun 2016 10:21:12 -0400 as excerpted: > As far as BTRFS raid10 mode in general, there are a few things that are > important to remember about it: > 1. It stores exactly two copies of everything, any extra disks just add > to the stripe length on each copy. I'll add one more, potentially very important, related to this one: Btrfs raid mode (any of them) works in relation to individual chunks, *NOT* individual devices. What that means for btrfs raid10 in combination with the above exactly two copies rule, is that it works rather differently than a standard raid10, which can tolerate loss of two devices as long as they're from the same mirror set, as the other mirror set will then still be whole. Because with btrfs raid10 the mirror sets are dynamic per-chunk, loss of a second device close to assures loss of data, because the very likely true assumption is that both mirror sets will be affected for some chunks, but not others. By using a layered approach, btrfs raid1 on top (for its error correction from the other copy feature) of a pair of mdraid0s, you force one of the btrfs raid1 copies to each of the mdraid0s, thus making allocation more deterministic than btrfs raid10, and can thus again tolerate loss of two devices, as long as they're from the same underlying mdraid0. (Traditionally, raid1 on top of raid0 is called raid01, and is discouraged compared to raid10, raid0 on top of raid1, because device failure and replacement with the latter triggers a much more localized rebuild than the former, across the pair of devices in the raid1 when it's closest to the physical devices, across the whole array, one raid0 to the other, when the raid1 is on top. However, btrfs raid1's data integrity and error repair from the good mirror feature is generally considered to be useful enough to be worth the rebuild-inefficiency of the raid01 design.) So in regard to failure tolerance, btrfs raid10 is far closer to traditional raid5, loss of a single device is tolerated, loss of a second before a repair is complete generally means data loss -- there's not the chance of it being on the same mirror set to save you that traditional raid10 has. Similarly, btrfs raid10 doesn't have the cleanly separate pair of mirrors on raid0 arrays that traditional raid10 does, thus doesn't have the fault tolerance of losing say the connection or power to one entire device bank, as long as it's all one mirror set, that traditional raid10 has. And again, doing the layered thing with btrfs raid1 on top and mdraid0 (or whatever else) underneath gets that back for you, if you set it up that way, of course. -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: Recommended why to use btrfs for production? 2016-06-09 6:16 ` Duncan @ 2016-06-09 11:38 ` Austin S. Hemmelgarn 2016-06-09 17:39 ` Chris Murphy 0 siblings, 1 reply; 28+ messages in thread From: Austin S. Hemmelgarn @ 2016-06-09 11:38 UTC (permalink / raw) To: linux-btrfs On 2016-06-09 02:16, Duncan wrote: > Austin S. Hemmelgarn posted on Fri, 03 Jun 2016 10:21:12 -0400 as > excerpted: > >> As far as BTRFS raid10 mode in general, there are a few things that are >> important to remember about it: >> 1. It stores exactly two copies of everything, any extra disks just add >> to the stripe length on each copy. > > I'll add one more, potentially very important, related to this one: > > Btrfs raid mode (any of them) works in relation to individual chunks, > *NOT* individual devices. > > What that means for btrfs raid10 in combination with the above exactly > two copies rule, is that it works rather differently than a standard > raid10, which can tolerate loss of two devices as long as they're from > the same mirror set, as the other mirror set will then still be whole. > Because with btrfs raid10 the mirror sets are dynamic per-chunk, loss of > a second device close to assures loss of data, because the very likely > true assumption is that both mirror sets will be affected for some > chunks, but not others. Actually, that's not _quite_ the case. Assuming that you have an even number of devices, BTRFS raid10 will currently always span all the available devices with two striped copies of the data (if there's an odd number, it spans one less than the total, and rotates which one gets left out of each chunk). This means that as long as all the devices are the same size and you have have stripes that are the full width of the array (you can end up with shorter ones if you have run in degraded mode or expanded the array), your probability of data loss per-chunk goes down as you add more devices (because the probability of a two device failure affecting both copies of a stripe in a given chunk decreases), but goes up as you add more chunks (because you then have to apply that probability for each individual chunk). Once you've lost one disk, the probability that losing another will compromise a specific chunk is: 1/(N - 1) Where N is the total number of devices. The probability that it will compromise _any_ chunk is: (1/(N - 1))/C Where C is the total number of chunks BTRFS raid1 mode actually has the exact same probabilities, but they apply even if you have an odd number of disks. > > By using a layered approach, btrfs raid1 on top (for its error correction > from the other copy feature) of a pair of mdraid0s, you force one of the > btrfs raid1 copies to each of the mdraid0s, thus making allocation more > deterministic than btrfs raid10, and can thus again tolerate loss of two > devices, as long as they're from the same underlying mdraid0. > > (Traditionally, raid1 on top of raid0 is called raid01, and is > discouraged compared to raid10, raid0 on top of raid1, because device > failure and replacement with the latter triggers a much more localized > rebuild than the former, across the pair of devices in the raid1 when > it's closest to the physical devices, across the whole array, one raid0 > to the other, when the raid1 is on top. However, btrfs raid1's data > integrity and error repair from the good mirror feature is generally > considered to be useful enough to be worth the rebuild-inefficiency of > the raid01 design.) > > So in regard to failure tolerance, btrfs raid10 is far closer to > traditional raid5, loss of a single device is tolerated, loss of a second > before a repair is complete generally means data loss -- there's not the > chance of it being on the same mirror set to save you that traditional > raid10 has. > > Similarly, btrfs raid10 doesn't have the cleanly separate pair of mirrors > on raid0 arrays that traditional raid10 does, thus doesn't have the fault > tolerance of losing say the connection or power to one entire device > bank, as long as it's all one mirror set, that traditional raid10 has. > > And again, doing the layered thing with btrfs raid1 on top and mdraid0 > (or whatever else) underneath gets that back for you, if you set it up > that way, of course. And will get you better performance than just BTRFS most of the time too. ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: Recommended why to use btrfs for production? 2016-06-09 11:38 ` Austin S. Hemmelgarn @ 2016-06-09 17:39 ` Chris Murphy 2016-06-09 19:57 ` Duncan 0 siblings, 1 reply; 28+ messages in thread From: Chris Murphy @ 2016-06-09 17:39 UTC (permalink / raw) To: Austin S. Hemmelgarn; +Cc: Btrfs BTRFS On Thu, Jun 9, 2016 at 5:38 AM, Austin S. Hemmelgarn <ahferroin7@gmail.com> wrote: > On 2016-06-09 02:16, Duncan wrote: >> >> Austin S. Hemmelgarn posted on Fri, 03 Jun 2016 10:21:12 -0400 as >> excerpted: >> >>> As far as BTRFS raid10 mode in general, there are a few things that are >>> important to remember about it: >>> 1. It stores exactly two copies of everything, any extra disks just add >>> to the stripe length on each copy. >> >> >> I'll add one more, potentially very important, related to this one: >> >> Btrfs raid mode (any of them) works in relation to individual chunks, >> *NOT* individual devices. >> >> What that means for btrfs raid10 in combination with the above exactly >> two copies rule, is that it works rather differently than a standard >> raid10, which can tolerate loss of two devices as long as they're from >> the same mirror set, as the other mirror set will then still be whole. >> Because with btrfs raid10 the mirror sets are dynamic per-chunk, loss of >> a second device close to assures loss of data, because the very likely >> true assumption is that both mirror sets will be affected for some >> chunks, but not others. > > Actually, that's not _quite_ the case. Assuming that you have an even > number of devices, BTRFS raid10 will currently always span all the available > devices with two striped copies of the data (if there's an odd number, it > spans one less than the total, and rotates which one gets left out of each > chunk). This means that as long as all the devices are the same size and > you have have stripes that are the full width of the array (you can end up > with shorter ones if you have run in degraded mode or expanded the array), > your probability of data loss per-chunk goes down as you add more devices > (because the probability of a two device failure affecting both copies of a > stripe in a given chunk decreases), but goes up as you add more chunks > (because you then have to apply that probability for each individual chunk). > Once you've lost one disk, the probability that losing another will > compromise a specific chunk is: > 1/(N - 1) > Where N is the total number of devices. > The probability that it will compromise _any_ chunk is: > (1/(N - 1))/C > Where C is the total number of chunks > BTRFS raid1 mode actually has the exact same probabilities, but they apply > even if you have an odd number of disks. Yeah but somewhere there's a chunk that's likely affected by two losses, with a probability much higher than for conventional raid10 where such a loss is very binary: if the loss is a mirrored pair, the whole array and filesystem implodes; if the loss does not affect an entire mirrored pair, the whole array survives. The thing with Btrfs raid 10 is you can't really tell in advance to what degree you have loss. It's not a binary condition, it has a gray area where a lot of data can still be retrieved, but the instant you hit missing data it's a loss, and if you hit missing metadata then the fs will either go read only or crash, it just can't continue. So that "walking on egg shells" behavior in a 2+ drive loss is really different from a conventional raid10 where it's either gonna completely work or completely fail. -- Chris Murphy ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: Recommended why to use btrfs for production? 2016-06-09 17:39 ` Chris Murphy @ 2016-06-09 19:57 ` Duncan 0 siblings, 0 replies; 28+ messages in thread From: Duncan @ 2016-06-09 19:57 UTC (permalink / raw) To: linux-btrfs Chris Murphy posted on Thu, 09 Jun 2016 11:39:23 -0600 as excerpted: > Yeah but somewhere there's a chunk that's likely affected by two losses, > with a probability much higher than for conventional raid10 where such a > loss is very binary: if the loss is a mirrored pair, the whole array and > filesystem implodes; if the loss does not affect an entire mirrored > pair, the whole array survives. > > The thing with Btrfs raid 10 is you can't really tell in advance to what > degree you have loss. It's not a binary condition, it has a gray area > where a lot of data can still be retrieved, but the instant you hit > missing data it's a loss, and if you hit missing metadata then the fs > will either go read only or crash, it just can't continue. So that > "walking on egg shells" behavior in a 2+ drive loss is really different > from a conventional raid10 where it's either gonna completely work or > completely fail. Yes, thanks, CMurphy. That's exactly what I was trying to explain. =:^) -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: Recommended why to use btrfs for production? 2016-06-03 12:55 ` Austin S. Hemmelgarn 2016-06-03 13:31 ` Martin @ 2016-06-03 14:05 ` Chris Murphy 2016-06-03 14:11 ` Martin 2016-06-05 10:45 ` Mladen Milinkovic 1 sibling, 2 replies; 28+ messages in thread From: Chris Murphy @ 2016-06-03 14:05 UTC (permalink / raw) To: Austin S. Hemmelgarn; +Cc: Martin, Btrfs BTRFS On Fri, Jun 3, 2016 at 6:55 AM, Austin S. Hemmelgarn <ahferroin7@gmail.com> wrote: > > That said, there are other options. If you have enough disks, you can run > BTRFS raid1 on top of LVM or MD RAID5 or RAID6, which provides you with the > benefits of both. There is a trade off. Either mdadm or lvm raid5, raid6, are more mature and stable, but it's more maintenance. You have a btrfs scrub as well as the md scrub. Btrfs on md/lvm raid56 will detect mismatches but won't be able to fix them because from its perspective there's no redundancy, except possibly metadata. So the repair has to happen on the mdadm/lvm side Make certain the kernel command timer value is greater than the driver error recovery timeout. The former is found in sysfs, per block device, the latter can be get and set with smartctl. Wrong configuration is common (it's actually the default) when using consumer drives, and inevitably leads to problems, even the loss of the entire array. It really is a terrible default. -- Chris Murphy ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: Recommended why to use btrfs for production? 2016-06-03 14:05 ` Chris Murphy @ 2016-06-03 14:11 ` Martin 2016-06-03 15:33 ` Austin S. Hemmelgarn 2016-06-04 1:34 ` Chris Murphy 2016-06-05 10:45 ` Mladen Milinkovic 1 sibling, 2 replies; 28+ messages in thread From: Martin @ 2016-06-03 14:11 UTC (permalink / raw) To: Chris Murphy; +Cc: Austin S. Hemmelgarn, Btrfs BTRFS > Make certain the kernel command timer value is greater than the driver > error recovery timeout. The former is found in sysfs, per block > device, the latter can be get and set with smartctl. Wrong > configuration is common (it's actually the default) when using > consumer drives, and inevitably leads to problems, even the loss of > the entire array. It really is a terrible default. Are nearline SAS drives considered consumer drives? ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: Recommended why to use btrfs for production? 2016-06-03 14:11 ` Martin @ 2016-06-03 15:33 ` Austin S. Hemmelgarn 2016-06-04 0:48 ` Nicholas D Steeves 2016-06-04 1:34 ` Chris Murphy 1 sibling, 1 reply; 28+ messages in thread From: Austin S. Hemmelgarn @ 2016-06-03 15:33 UTC (permalink / raw) To: Martin, Chris Murphy; +Cc: Btrfs BTRFS On 2016-06-03 10:11, Martin wrote: >> Make certain the kernel command timer value is greater than the driver >> error recovery timeout. The former is found in sysfs, per block >> device, the latter can be get and set with smartctl. Wrong >> configuration is common (it's actually the default) when using >> consumer drives, and inevitably leads to problems, even the loss of >> the entire array. It really is a terrible default. > > Are nearline SAS drives considered consumer drives? > If it's a SAS drive, then no, especially when you start talking about things marketed as 'nearline'. Additionally, SCT ERC is entirely a SATA thing, I forget what the equivalent in SCSI (and by extension SAS) terms is, but I'm pretty sure that the kernel handles things differently there. ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: Recommended why to use btrfs for production? 2016-06-03 15:33 ` Austin S. Hemmelgarn @ 2016-06-04 0:48 ` Nicholas D Steeves 2016-06-04 1:48 ` Chris Murphy 0 siblings, 1 reply; 28+ messages in thread From: Nicholas D Steeves @ 2016-06-04 0:48 UTC (permalink / raw) To: Austin S. Hemmelgarn; +Cc: Martin, Chris Murphy, Btrfs BTRFS On 3 June 2016 at 11:33, Austin S. Hemmelgarn <ahferroin7@gmail.com> wrote: > On 2016-06-03 10:11, Martin wrote: >>> >>> Make certain the kernel command timer value is greater than the driver >>> error recovery timeout. The former is found in sysfs, per block >>> device, the latter can be get and set with smartctl. Wrong >>> configuration is common (it's actually the default) when using >>> consumer drives, and inevitably leads to problems, even the loss of >>> the entire array. It really is a terrible default. >> >> >> Are nearline SAS drives considered consumer drives? >> > If it's a SAS drive, then no, especially when you start talking about things > marketed as 'nearline'. Additionally, SCT ERC is entirely a SATA thing, I > forget what the equivalent in SCSI (and by extension SAS) terms is, but I'm > pretty sure that the kernel handles things differently there. For the purposes of BTRFS RAID1: For drives that ship with SCT ERC of 7sec, is the default kernel command timeout of 30sec appropriate, or should it be reduced? For SATA drives that do not support SC TERC, is it true that 120sec is a sane value? I forget where I got this value of 120sec; it might have been this list, it might have been an mdadm bug report. Also, in terms of tuning, I've been unable to find whether the ideal kernel timeout value changes depending on RAID type...is that a factor in selecting a sane kernel timeout value? Kind regards, Nicholas ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: Recommended why to use btrfs for production? 2016-06-04 0:48 ` Nicholas D Steeves @ 2016-06-04 1:48 ` Chris Murphy 2016-06-06 13:29 ` Austin S. Hemmelgarn 0 siblings, 1 reply; 28+ messages in thread From: Chris Murphy @ 2016-06-04 1:48 UTC (permalink / raw) To: Nicholas D Steeves Cc: Austin S. Hemmelgarn, Martin, Chris Murphy, Btrfs BTRFS On Fri, Jun 3, 2016 at 6:48 PM, Nicholas D Steeves <nsteeves@gmail.com> wrote: > On 3 June 2016 at 11:33, Austin S. Hemmelgarn <ahferroin7@gmail.com> wrote: >> On 2016-06-03 10:11, Martin wrote: >>>> >>>> Make certain the kernel command timer value is greater than the driver >>>> error recovery timeout. The former is found in sysfs, per block >>>> device, the latter can be get and set with smartctl. Wrong >>>> configuration is common (it's actually the default) when using >>>> consumer drives, and inevitably leads to problems, even the loss of >>>> the entire array. It really is a terrible default. >>> >>> >>> Are nearline SAS drives considered consumer drives? >>> >> If it's a SAS drive, then no, especially when you start talking about things >> marketed as 'nearline'. Additionally, SCT ERC is entirely a SATA thing, I >> forget what the equivalent in SCSI (and by extension SAS) terms is, but I'm >> pretty sure that the kernel handles things differently there. > > For the purposes of BTRFS RAID1: For drives that ship with SCT ERC of > 7sec, is the default kernel command timeout of 30sec appropriate, or > should it be reduced? It's fine. But it depends on your use case, if it can tolerate a rare > 7 second < 30 second hang, and you're prepared to start investigating the cause then I'd leave it alone. If the use case prefers resetting the drive when it stops responding, then you'd go with something shorter. I'm fairly certain SAS's command queue doesn't get obliterated with such a link reset, just the hung command; where SATA drives all information in the queue is lost. So resets on SATA are a much bigger penalty if I have the correct understanding. > For SATA drives that do not support SC TERC, is > it true that 120sec is a sane value? I forget where I got this value > of 120sec; It's a good question. It's not well documented, is not defined in the SATA spec, so it's probably make/model specific. The linux-raid@ list probably has the most information on this just because their users get nailed by this problem often. And the recommendation does seem to vary around 120 to 180. That is of course a maximum. The drive could give up much sooner. But what you don't want is for the drive to be in recovery for a bad sector, and the command timer does a link reset, losing all of what the drive was doing: all of which is replaceable except really one thing which is what sector was having the problem. And right now there's no report of the drive for slow sectors. It only reports failed reads, and it's that failed read error that includes the sector, so that the raid mechanism can figure out what data is missing, recongistruct from mirror or parity, and then fix the bad sector by writing to it. > it might have been this list, it might have been an mdadm > bug report. Also, in terms of tuning, I've been unable to find > whether the ideal kernel timeout value changes depending on RAID > type...is that a factor in selecting a sane kernel timeout value? No. It's strictly a value to make certain you get read errors from the drive rather than link resets. And that's why I think it's a bad default, because it totally thwarts attempts by manufacturers to recover marginal sectors, even in the single disk case. -- Chris Murphy ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: Recommended why to use btrfs for production? 2016-06-04 1:48 ` Chris Murphy @ 2016-06-06 13:29 ` Austin S. Hemmelgarn 0 siblings, 0 replies; 28+ messages in thread From: Austin S. Hemmelgarn @ 2016-06-06 13:29 UTC (permalink / raw) To: Chris Murphy, Nicholas D Steeves; +Cc: Martin, Btrfs BTRFS On 2016-06-03 21:48, Chris Murphy wrote: > On Fri, Jun 3, 2016 at 6:48 PM, Nicholas D Steeves <nsteeves@gmail.com> wrote: >> On 3 June 2016 at 11:33, Austin S. Hemmelgarn <ahferroin7@gmail.com> wrote: >>> On 2016-06-03 10:11, Martin wrote: >>>>> >>>>> Make certain the kernel command timer value is greater than the driver >>>>> error recovery timeout. The former is found in sysfs, per block >>>>> device, the latter can be get and set with smartctl. Wrong >>>>> configuration is common (it's actually the default) when using >>>>> consumer drives, and inevitably leads to problems, even the loss of >>>>> the entire array. It really is a terrible default. >>>> >>>> >>>> Are nearline SAS drives considered consumer drives? >>>> >>> If it's a SAS drive, then no, especially when you start talking about things >>> marketed as 'nearline'. Additionally, SCT ERC is entirely a SATA thing, I >>> forget what the equivalent in SCSI (and by extension SAS) terms is, but I'm >>> pretty sure that the kernel handles things differently there. >> >> For the purposes of BTRFS RAID1: For drives that ship with SCT ERC of >> 7sec, is the default kernel command timeout of 30sec appropriate, or >> should it be reduced? > > It's fine. But it depends on your use case, if it can tolerate a rare >> 7 second < 30 second hang, and you're prepared to start > investigating the cause then I'd leave it alone. If the use case > prefers resetting the drive when it stops responding, then you'd go > with something shorter. > > I'm fairly certain SAS's command queue doesn't get obliterated with > such a link reset, just the hung command; where SATA drives all > information in the queue is lost. So resets on SATA are a much bigger > penalty if I have the correct understanding. There's also more involved otherwise with a ATA link reset because AHCI controllers aren't MP safe, so there's a global lock that has to be held while talking to them. Because of this, a link reset on an ATA drive (be it SATA or PATA) will cause performance degradation for all other devices on that controller as well until the reset is complete. > > >> For SATA drives that do not support SC TERC, is >> it true that 120sec is a sane value? I forget where I got this value >> of 120sec; > > It's a good question. It's not well documented, is not defined in the > SATA spec, so it's probably make/model specific. The linux-raid@ list > probably has the most information on this just because their users get > nailed by this problem often. And the recommendation does seem to vary > around 120 to 180. That is of course a maximum. The drive could give > up much sooner. But what you don't want is for the drive to be in > recovery for a bad sector, and the command timer does a link reset, > losing all of what the drive was doing: all of which is replaceable > except really one thing which is what sector was having the problem. > And right now there's no report of the drive for slow sectors. It only > reports failed reads, and it's that failed read error that includes > the sector, so that the raid mechanism can figure out what data is > missing, recongistruct from mirror or parity, and then fix the bad > sector by writing to it. FWIW, I usually go with 150 on the Seagate 'Desktop' drives I use. I've seen some cheap Hitachi and Toshiba disks that need it as high as 300 though to work right. > >> it might have been this list, it might have been an mdadm >> bug report. Also, in terms of tuning, I've been unable to find >> whether the ideal kernel timeout value changes depending on RAID >> type...is that a factor in selecting a sane kernel timeout value? > > No. It's strictly a value to make certain you get read errors from the > drive rather than link resets. You have to factor in how the controller handles things too. SOme of them will retry just like a desktop drive, and you need to account for that. > > And that's why I think it's a bad default, because it totally thwarts > attempts by manufacturers to recover marginal sectors, even in the > single disk case. That's debatable, by attempting to recover the bad sector, they're slowing down the whole system. The likelihood of recovering a bad sectors functionally falls off linearly the longer you try, and not having the ability to choose when to report an error is the bigger issue here. ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: Recommended why to use btrfs for production? 2016-06-03 14:11 ` Martin 2016-06-03 15:33 ` Austin S. Hemmelgarn @ 2016-06-04 1:34 ` Chris Murphy 1 sibling, 0 replies; 28+ messages in thread From: Chris Murphy @ 2016-06-04 1:34 UTC (permalink / raw) To: Martin; +Cc: Chris Murphy, Austin S. Hemmelgarn, Btrfs BTRFS On Fri, Jun 3, 2016 at 8:11 AM, Martin <rc6encrypted@gmail.com> wrote: >> Make certain the kernel command timer value is greater than the driver >> error recovery timeout. The former is found in sysfs, per block >> device, the latter can be get and set with smartctl. Wrong >> configuration is common (it's actually the default) when using >> consumer drives, and inevitably leads to problems, even the loss of >> the entire array. It really is a terrible default. > > Are nearline SAS drives considered consumer drives? No, they should have configurable sct erc setting using smartctl. Many, possibly most, consumer drives now do not support it, so often the only workable way to use them in any kind of multiple device scenario other than linear/concat or raid0 is to significantly increase the scsi command timer - upwards or 2 or 3 minutes. So if your use case cannot tolerate such delays, then the drives must be disqualified. -- Chris Murphy ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: Recommended why to use btrfs for production? 2016-06-03 14:05 ` Chris Murphy 2016-06-03 14:11 ` Martin @ 2016-06-05 10:45 ` Mladen Milinkovic 2016-06-05 16:33 ` James Johnston 2016-06-06 1:47 ` Chris Murphy 1 sibling, 2 replies; 28+ messages in thread From: Mladen Milinkovic @ 2016-06-05 10:45 UTC (permalink / raw) To: Chris Murphy, Austin S. Hemmelgarn; +Cc: Martin, Btrfs BTRFS On 06/03/2016 04:05 PM, Chris Murphy wrote: > Make certain the kernel command timer value is greater than the driver > error recovery timeout. The former is found in sysfs, per block > device, the latter can be get and set with smartctl. Wrong > configuration is common (it's actually the default) when using > consumer drives, and inevitably leads to problems, even the loss of > the entire array. It really is a terrible default. Since it's first time i've heard of this I did some googling. Here's some nice article about these timeouts: http://strugglers.net/~andy/blog/2015/11/09/linux-software-raid-and-drive-timeouts/comment-page-1/ And some udev rules that should apply this automatically: http://comments.gmane.org/gmane.linux.raid/48193 Cheers -- Mladen Milinkovic GPG: EF9D9B26 ^ permalink raw reply [flat|nested] 28+ messages in thread
* RE: Recommended why to use btrfs for production? 2016-06-05 10:45 ` Mladen Milinkovic @ 2016-06-05 16:33 ` James Johnston 2016-06-05 18:20 ` Andrei Borzenkov 2016-06-06 1:47 ` Chris Murphy 1 sibling, 1 reply; 28+ messages in thread From: James Johnston @ 2016-06-05 16:33 UTC (permalink / raw) To: 'Mladen Milinkovic', 'Chris Murphy', 'Austin S. Hemmelgarn' Cc: 'Martin', 'Btrfs BTRFS' On 06/05/2016 10:46 AM, Mladen Milinkovic wrote: > On 06/03/2016 04:05 PM, Chris Murphy wrote: > > Make certain the kernel command timer value is greater than the driver > > error recovery timeout. The former is found in sysfs, per block > > device, the latter can be get and set with smartctl. Wrong > > configuration is common (it's actually the default) when using > > consumer drives, and inevitably leads to problems, even the loss of > > the entire array. It really is a terrible default. > > Since it's first time i've heard of this I did some googling. > > Here's some nice article about these timeouts: > http://strugglers.net/~andy/blog/2015/11/09/linux-software-raid-and-drive- > timeouts/comment-page-1/ > > And some udev rules that should apply this automatically: > http://comments.gmane.org/gmane.linux.raid/48193 I think the first link there is a good one. On my system: /sys/block/sdX/device/timeout defaults to 30 seconds - long enough for a drive with short TLER setting but too short for a consumer drive. There is a Red Hat link on setting up a udev rule for it here: https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/5/html/Online_Storage_Reconfiguration_Guide/task_controlling-scsi-command-timer-onlining-devices.html I thought it looked a little funny, so I combined the above with one of the VMware udev rules pre-installed on my Ubuntu system and came up with this: # Update timeout from 180 to one of your choosing: ACTION=="add|change", SUBSYSTEMS=="scsi", ATTRS{type}=="0|7|14", \ RUN+="/bin/sh -c 'echo 180 >/sys$DEVPATH/device/timeout'" Now my attached drives automatically get this timeout without any scripting or manual setting of the timeout. James ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: Recommended why to use btrfs for production? 2016-06-05 16:33 ` James Johnston @ 2016-06-05 18:20 ` Andrei Borzenkov 0 siblings, 0 replies; 28+ messages in thread From: Andrei Borzenkov @ 2016-06-05 18:20 UTC (permalink / raw) To: James Johnston, 'Mladen Milinkovic', 'Chris Murphy', 'Austin S. Hemmelgarn' Cc: 'Martin', 'Btrfs BTRFS' 05.06.2016 19:33, James Johnston пишет: > On 06/05/2016 10:46 AM, Mladen Milinkovic wrote: >> On 06/03/2016 04:05 PM, Chris Murphy wrote: >>> Make certain the kernel command timer value is greater than the driver >>> error recovery timeout. The former is found in sysfs, per block >>> device, the latter can be get and set with smartctl. Wrong >>> configuration is common (it's actually the default) when using >>> consumer drives, and inevitably leads to problems, even the loss of >>> the entire array. It really is a terrible default. >> >> Since it's first time i've heard of this I did some googling. >> >> Here's some nice article about these timeouts: >> http://strugglers.net/~andy/blog/2015/11/09/linux-software-raid-and-drive- >> timeouts/comment-page-1/ >> >> And some udev rules that should apply this automatically: >> http://comments.gmane.org/gmane.linux.raid/48193 > > I think the first link there is a good one. On my system: > > /sys/block/sdX/device/timeout > > defaults to 30 seconds - long enough for a drive with short TLER setting > but too short for a consumer drive. > > There is a Red Hat link on setting up a udev rule for it here: > https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/5/html/Online_Storage_Reconfiguration_Guide/task_controlling-scsi-command-timer-onlining-devices.html > > I thought it looked a little funny, so I combined the above with one of the > VMware udev rules pre-installed on my Ubuntu system and came up with this: > > # Update timeout from 180 to one of your choosing: > ACTION=="add|change", SUBSYSTEMS=="scsi", ATTRS{type}=="0|7|14", \ > RUN+="/bin/sh -c 'echo 180 >/sys$DEVPATH/device/timeout'" > Last line is actually ATTR{device/timeout}="100" to avoid spawning extra process for every device. ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: Recommended why to use btrfs for production? 2016-06-05 10:45 ` Mladen Milinkovic 2016-06-05 16:33 ` James Johnston @ 2016-06-06 1:47 ` Chris Murphy 2016-06-06 2:40 ` James Johnston 1 sibling, 1 reply; 28+ messages in thread From: Chris Murphy @ 2016-06-06 1:47 UTC (permalink / raw) To: Mladen Milinkovic; +Cc: Chris Murphy, Austin S. Hemmelgarn, Martin, Btrfs BTRFS On Sun, Jun 5, 2016 at 4:45 AM, Mladen Milinkovic <maxrd2@smoothware.net> wrote: > On 06/03/2016 04:05 PM, Chris Murphy wrote: >> Make certain the kernel command timer value is greater than the driver >> error recovery timeout. The former is found in sysfs, per block >> device, the latter can be get and set with smartctl. Wrong >> configuration is common (it's actually the default) when using >> consumer drives, and inevitably leads to problems, even the loss of >> the entire array. It really is a terrible default. > > Since it's first time i've heard of this I did some googling. > > Here's some nice article about these timeouts: > http://strugglers.net/~andy/blog/2015/11/09/linux-software-raid-and-drive-timeouts/comment-page-1/ > > And some udev rules that should apply this automatically: > http://comments.gmane.org/gmane.linux.raid/48193 Yes it's a constant problem that pops up on the linux-raid list. Sometimes the list is quiet on this issue but it really seems like it's once a week. From last week... http://www.spinics.net/lists/raid/msg52447.html And you wouldn't know it because the subject is "raid 5 crashed" so you wouldn't think, oh bad sectors are accumulating because they're not getting fixed up and they're not getting fixed up because the kernel command timer is resetting the link preventing the drive from reporting a read error and the associated sector LBA. It starts with that, and then you get a single disk failure, and now when doing a rebuild, you hit the bad sector on an otherwise good drive and in effect that's like a 2nd drive failure and now the raid5 implodes. It's fixable, sometimes, but really tedious. -- Chris Murphy ^ permalink raw reply [flat|nested] 28+ messages in thread
* RE: Recommended why to use btrfs for production? 2016-06-06 1:47 ` Chris Murphy @ 2016-06-06 2:40 ` James Johnston 2016-06-06 13:36 ` Austin S. Hemmelgarn 0 siblings, 1 reply; 28+ messages in thread From: James Johnston @ 2016-06-06 2:40 UTC (permalink / raw) To: 'Chris Murphy', 'Mladen Milinkovic' Cc: 'Austin S. Hemmelgarn', 'Martin', 'Btrfs BTRFS' On 06/06/2016 at 01:47, Chris Murphy wrote: > On Sun, Jun 5, 2016 at 4:45 AM, Mladen Milinkovic <maxrd2@smoothware.net> wrote: > > On 06/03/2016 04:05 PM, Chris Murphy wrote: > >> Make certain the kernel command timer value is greater than the driver > >> error recovery timeout. The former is found in sysfs, per block > >> device, the latter can be get and set with smartctl. Wrong > >> configuration is common (it's actually the default) when using > >> consumer drives, and inevitably leads to problems, even the loss of > >> the entire array. It really is a terrible default. > > > > Since it's first time i've heard of this I did some googling. > > > > Here's some nice article about these timeouts: > > http://strugglers.net/~andy/blog/2015/11/09/linux-software-raid-and-drive- > timeouts/comment-page-1/ > > > > And some udev rules that should apply this automatically: > > http://comments.gmane.org/gmane.linux.raid/48193 > > Yes it's a constant problem that pops up on the linux-raid list. > Sometimes the list is quiet on this issue but it really seems like > it's once a week. From last week... > > http://www.spinics.net/lists/raid/msg52447.html It seems like it would be useful if the distributions or the kernel could automatically set the kernel timeout to an appropriate value. If the TLER can be indeed be queried via smartctl, then it would be easy to automatically read it, and then calculate a suitable timeout. A RAID-oriented drive would end up leaving the current 30 seconds, while if it can't successfully query for TLER or the drive just doesn't support it, then assume a consumer drive and set timeout for 180 seconds. That way, zero user configuration would be needed in the common case. Or is it not that simple? James ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: Recommended why to use btrfs for production? 2016-06-06 2:40 ` James Johnston @ 2016-06-06 13:36 ` Austin S. Hemmelgarn 0 siblings, 0 replies; 28+ messages in thread From: Austin S. Hemmelgarn @ 2016-06-06 13:36 UTC (permalink / raw) To: James Johnston, 'Chris Murphy', 'Mladen Milinkovic' Cc: 'Martin', 'Btrfs BTRFS' On 2016-06-05 22:40, James Johnston wrote: > On 06/06/2016 at 01:47, Chris Murphy wrote: >> On Sun, Jun 5, 2016 at 4:45 AM, Mladen Milinkovic <maxrd2@smoothware.net> wrote: >>> On 06/03/2016 04:05 PM, Chris Murphy wrote: >>>> Make certain the kernel command timer value is greater than the driver >>>> error recovery timeout. The former is found in sysfs, per block >>>> device, the latter can be get and set with smartctl. Wrong >>>> configuration is common (it's actually the default) when using >>>> consumer drives, and inevitably leads to problems, even the loss of >>>> the entire array. It really is a terrible default. >>> >>> Since it's first time i've heard of this I did some googling. >>> >>> Here's some nice article about these timeouts: >>> http://strugglers.net/~andy/blog/2015/11/09/linux-software-raid-and-drive- >> timeouts/comment-page-1/ >>> >>> And some udev rules that should apply this automatically: >>> http://comments.gmane.org/gmane.linux.raid/48193 >> >> Yes it's a constant problem that pops up on the linux-raid list. >> Sometimes the list is quiet on this issue but it really seems like >> it's once a week. From last week... >> >> http://www.spinics.net/lists/raid/msg52447.html > > It seems like it would be useful if the distributions or the kernel could > automatically set the kernel timeout to an appropriate value. If the TLER can be > indeed be queried via smartctl, then it would be easy to automatically read it, > and then calculate a suitable timeout. A RAID-oriented drive would end up leaving > the current 30 seconds, while if it can't successfully query for TLER or the drive > just doesn't support it, then assume a consumer drive and set timeout for 180 > seconds. > > That way, zero user configuration would be needed in the common case. Or is it > not that simple? Strictly speaking, it's policy, and therefore shouldn't be in the kernel. It's not hard to write a script to handle this though, both hdparm and smartctl can set the SCT ERC value, and will report an error if it fails, so you can try and set the value as you want (I personally would go with 10 seconds instead of 7), and if that fails, bump the kernel command timout. ^ permalink raw reply [flat|nested] 28+ messages in thread
end of thread, other threads:[~2016-06-09 19:58 UTC | newest] Thread overview: 28+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2016-06-03 9:49 Recommended why to use btrfs for production? Martin 2016-06-03 9:53 ` Marc Haber 2016-06-03 9:57 ` Martin 2016-06-03 10:01 ` Hans van Kranenburg 2016-06-03 10:15 ` Martin 2016-06-03 12:55 ` Austin S. Hemmelgarn 2016-06-03 13:31 ` Martin 2016-06-03 13:47 ` Julian Taylor 2016-06-03 14:21 ` Austin S. Hemmelgarn 2016-06-03 14:39 ` Martin 2016-06-03 19:09 ` Christoph Anton Mitterer 2016-06-09 6:16 ` Duncan 2016-06-09 11:38 ` Austin S. Hemmelgarn 2016-06-09 17:39 ` Chris Murphy 2016-06-09 19:57 ` Duncan 2016-06-03 14:05 ` Chris Murphy 2016-06-03 14:11 ` Martin 2016-06-03 15:33 ` Austin S. Hemmelgarn 2016-06-04 0:48 ` Nicholas D Steeves 2016-06-04 1:48 ` Chris Murphy 2016-06-06 13:29 ` Austin S. Hemmelgarn 2016-06-04 1:34 ` Chris Murphy 2016-06-05 10:45 ` Mladen Milinkovic 2016-06-05 16:33 ` James Johnston 2016-06-05 18:20 ` Andrei Borzenkov 2016-06-06 1:47 ` Chris Murphy 2016-06-06 2:40 ` James Johnston 2016-06-06 13:36 ` Austin S. Hemmelgarn
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).