Linux RAID subsystem development

Linux RAID subsystem development
 help / color / mirror / Atom feed

* Re: proactive disk replacement
From: Reindl Harald @ 2017-03-21 13:24 UTC (permalink / raw)
  To: David Brown, Adam Goryachev, Jeff Allison; +Cc: linux-raid
In-Reply-To: <58D126EB.7060707@hesbynett.no>



Am 21.03.2017 um 14:13 schrieb David Brown:
> On 21/03/17 12:03, Reindl Harald wrote:
>>
>> Am 21.03.2017 um 11:54 schrieb Adam Goryachev:
> <snip>
>>
>>> In addition, you claim that a drive larger than 2TB is almost certainly
>>> going to suffer from a URE during recovery, yet this is exactly the
>>> situation you will be in when trying to recover a RAID10 with member
>>> devices 2TB or larger. A single URE on the surviving portion of the
>>> RAID1 will cause you to lose the entire RAID10 array. On the other hand,
>>> 3 URE's on the three remaining members of the RAID6 will not cause more
>>> than a hiccup (as long as no more than one URE on the same stripe, which
>>> I would argue is ... exceptionally unlikely).
>>
>> given that when your disks have the same age errors on another disk
>> become more likely when one failed and the heavy disk IO due recovery of
>> a RAID6 with takes *many hours* where you have heavy IO on *all disks*
>> compared with a way faster restore of RAID1/10 guess in which case a URE
>> is more likely
>>
>> additionally why should the whole array fail just because a single block
>> get lost? the is no parity which needs to be calculated, you just lost a
>> single block somewhere - RAID1/10 are way easier in their implementation
>
> If you have RAID1, and you have an URE, then the data can be recovered
> from the other have of that RAID1 pair.  If you have had a disk failure
> (manual for replacement, or a real failure), and you get an URE on the
> other half of that pair, then you lose data.
>
> With RAID6, you need an additional failure (either another full disk
> failure or an URE in the /same/ stripe) to lose data.  RAID6 has higher
> redundancy than two-way RAID1 - of this there is /no/ doubt

yes, but with RAID5/RAID6 *all disks* are involved in the rebuild, with 
a 10 disk RAID10 only one disk needs to be read and the data written to 
the new one - all other disks are not involved in the resync at all

for most arrays the disks have a similar age and usage pattern, so when 
the first one fails it becomes likely that it don't take too long for 
another one and so load and recovery time matters

^ permalink raw reply

* Re: proactive disk replacement
From: David Brown @ 2017-03-21 13:13 UTC (permalink / raw)
  To: Reindl Harald, Adam Goryachev, Jeff Allison; +Cc: linux-raid
In-Reply-To: <583576ca-a76c-3901-c196-6083791533ee@thelounge.net>

On 21/03/17 12:03, Reindl Harald wrote:
> 
> 
> Am 21.03.2017 um 11:54 schrieb Adam Goryachev:
<snip>
> 
>> In addition, you claim that a drive larger than 2TB is almost certainly
>> going to suffer from a URE during recovery, yet this is exactly the
>> situation you will be in when trying to recover a RAID10 with member
>> devices 2TB or larger. A single URE on the surviving portion of the
>> RAID1 will cause you to lose the entire RAID10 array. On the other hand,
>> 3 URE's on the three remaining members of the RAID6 will not cause more
>> than a hiccup (as long as no more than one URE on the same stripe, which
>> I would argue is ... exceptionally unlikely).
> 
> given that when your disks have the same age errors on another disk
> become more likely when one failed and the heavy disk IO due recovery of
> a RAID6 with takes *many hours* where you have heavy IO on *all disks*
> compared with a way faster restore of RAID1/10 guess in which case a URE
> is more likely
> 
> additionally why should the whole array fail just because a single block
> get lost? the is no parity which needs to be calculated, you just lost a
> single block somewhere - RAID1/10 are way easier in their implementation

If you have RAID1, and you have an URE, then the data can be recovered
from the other have of that RAID1 pair.  If you have had a disk failure
(manual for replacement, or a real failure), and you get an URE on the
other half of that pair, then you lose data.

With RAID6, you need an additional failure (either another full disk
failure or an URE in the /same/ stripe) to lose data.  RAID6 has higher
redundancy than two-way RAID1 - of this there is /no/ doubt.

> 
>> In addition, with a 4 disk RAID6 you have a 100% chance of surviving a 2
>> drive failure without data loss, yet with 4 disk RAID10 you have a 50%
>> chance of surviving a 2 drive failure.
> 
> yeah and you *need that* when it takes many hours ot a few days until
> your 8 TB RAID6 is resynced while the whole time *all disks* are under
> heavy stress
> 
>> Sure, there are other things to consider (performance, cost, etc) but on
>> a reliability point, RAID6 seems to be the far better option
> 
> *no* - it takes twice as long to recalculate from parity and stresses
> the remaining disks twice as hard as RAID5 and so you pretty soon end
> with lost both of the disk you can lose without the array goes down
> while you still have many hours remaining recovery time

For RAID5 and RAID6, you read the same data - the full data stripe.  For
RAID5, you calculate and write a single parity block, while for RAID6
you calculate and write an additional parity block.  The disk reads are
the same in both cases, but you write out twice as many blocks.  You do
not stress the disks noticeably harder with RAID6 than with RAID5.

> 
> here you go: http://www.zdnet.com/article/why-raid-6-stops-working-in-2019/

This is an article heavily based on a Sun engineer trying to promote his
own alternative using scaremongering.

It is, however, correct in suggesting that RAID6 is more reliable than
RAID5.  And triple-parity raid (or additional layered RAID) is more
reliable than RAID6.  Nowhere does it suggest that RAID1 is more
reliable than RAID6.

It all boils down to the redundancy level.  Two-drive RAID1 pairs have a
single drive redundancy.  RAID5 has a single drive redundancy.  RAID6
has two drive redundancy - thus it is more reliable and will tolerate
more failures before losing data.  If this is not enough, and you don't
have triple parity RAID (it is not yet implemented in md - one day,
perhaps), you can use more mirrors on RAID1 or use layers such as a
RAID5 array built on RAID1 pairs.

^ permalink raw reply

* Re: proactive disk replacement
From: David Brown @ 2017-03-21 13:02 UTC (permalink / raw)
  To: Reindl Harald, Jeff Allison, Adam Goryachev; +Cc: linux-raid
In-Reply-To: <f0916e66-8ea7-3363-3600-1d2cd68e85af@thelounge.net>

On 21/03/17 10:54, Reindl Harald wrote:
> 
> 
> Am 21.03.2017 um 03:33 schrieb Jeff Allison:
>> I don't have a spare SATA slot I do however have a spare USB carrier,
>> is that fast enough to be used temporarily?
> 
> USB3 yes, USB2 don't make fun because the speed of the array depends on
> the slowest disk in the spindle

When you are turning your RAID5 into RAID6, you can use a non-standard
layout with the external drive being the second parity.  That way you
don't need to re-write the data on the existing drives, and the access
to the external drive will all be writes of the Q parity - the system
will not read from that drive unless it has to recover from a two drive
failure.  This will reduce stress on all the disks, and make the limited
USB2 bandwidth less of an issue.

If you have to use two USB carriers for the whole process, try to make
sure they are connected to separate root hubs so that they don't share
the bandwidth.  This is not always just a matter of using two USB ports
- sometimes two adjacent USB ports on a PC share an internal hub.

> 
> and about RAID5/RAID6 versus RAID10: both RAID5 and RAID6 suffer from
> the same problems - due rebuild you have a lot of random-IO load on all
> remaining disks which leads in bad performance and make it more likely
> that before the rebuild is finished another disk fails, RAID6 produces
> even more random IO because of the double parity and if you have a
> Unrecoverable-Read-Error on RAID5 you are dead, RAID6 is not much better
> here and the probability of a URE becomes more likely with larger disks

Rebuilds are done using streamed linear access - the only random access
is the mix of rebuild transfers with normal usage of the array.  This
applies to RAID5 and RAID6 as well as RAID1 or RAID10.

With RAID5 or two-disk RAID1, if you get an URE on a read then you can
recover the data without loss.  This is the case for normal
(non-degraded) use, or if you are using "replace" to duplicate an
existing disk before replacement.  If you have failed a drive (manually,
or due to a serious disk failure), then any single URE means lost data
in that stripe.

With RAID6 (or three-disk RAID1), you can tolerate /two/ URE's on the
same stripe.  If you have failed a disk for replacement, you can
tolerate one URE.

Note that to cause failure in non-degraded RAID5 (or degraded RAID6),
your two URE's need to be on the same stripe in order to cause data
loss.  The chances of getting an URE somewhere on the disk are roughly
proportional to the size of the disk - but the chance of getting an URE
on the same stripe as another URE on another disk are basically
independent of the disk size, and it is extraordinarily small.

> 
> RAID10: less to zero performance impact due rebuild and no random-IO
> caused by the rebuild, it's just "read a disk from start to end and
> write the data on another disk linear" while the only head moves on your
> disks is the normal workload on the array

RAID1 (and RAID0) rebuilds are a little more efficient than RAID5 or
RAID6 rebuilds - but not hugely so.  Depending on factors such as IO
structures, cpu speed and loading, number of disks in the array,
concurrent access to other data, etc., they can be something like 25% to
50% faster.  They do not involve noticeably more or less linear access
than a RAID5/RAID6 rebuild, but they avoid heavy access to disks other
than those in the RAID1 pair being rebuilt.

> 
> with disks 2 TB or larger you can make the conclusion "do not use
> RAID5/6 anymore and when you do be prepared that you won't survive a
> rebuild caused by a failed disk"

No, you cannot.  Your conclusion here is based on several totally
incorrect assumptions:

1. You think that RAID5/RAID6 recovery is more stressful, because the
parity is "all over the place".  This is wrong.

2. You think that random IO has higher chance of getting an URE than
linear IO.  This is wrong.

3. You think that getting an URE on one disk, then getting an URE on a
second disk, counts as a double failure that will break an single-parity
redundancy (RAID5, RAID1, RAID6 in degraded mode).  This is wrong - it
is only a problem if the two UREs are in the same stripe, which is quite
literally a one in a million chance.

There are certainly good reasons to prefer RAID10 systems to RAID5/RAID6
- for some types of loads, it can be significantly faster, and even
though the rebuild time is not as much faster as you think, it is still
faster.  Linux supports a range of different RAID types for good reason
- it is not a "one size fits all" problem.  But you should learn the
differences and make your choices and recommendations based on facts,
rather than articles written by people trying to sell their own "solutions".

mvh.,

David

> 
>> On 21 March 2017 at 01:59, Adam Goryachev
>> <mailinglists@websitemanagers.com.au> wrote:
>>>
>>>
>>> On 20/3/17 23:47, Jeff Allison wrote:
>>>>
>>>> Hi all I’ve had a poke around but am yet to find something definitive.
>>>>
>>>> I have a raid 5 array of 4 disks amounting to approx 5.5tb. Now this
>>>> disks
>>>> are getting a bit long in the tooth so before I get into problems I’ve
>>>> bought 4 new disks to replace them.
>>>>
>>>> I have a backup so if it all goes west I’m covered. So I’m looking for
>>>> suggestions.
>>>>
>>>> My current plan is just to replace the 2tb drives with the new 3tb
>>>> drives
>>>> and move on, I’d like to do it on line with out having to trash the
>>>> array
>>>> and start again, so does anyone have a game plan for doing that.
>>>
>>> Yes, do not fail a disk and then replace it, use the newer replace
>>> method
>>> (it keeps redundancy in the array).
>>> Even better would be to add a disk, and convert to RAID6, then add a
>>> second
>>> disk (using replace), and so on, then remove the last disk, grow the
>>> array
>>> to fill the 3TB, and then reduce the number of disks in the raid.
>>> This way, you end up with RAID6...
>>>>
>>>> Or is a 9tb raid 5 array the wrong thing to be doing and should I be
>>>> doing
>>>> something else 6tb raid 10 or something I’m open to suggestions.
>>>
>>> I'd feel safer with RAID6, but it depends on your requirements.
>>> RAID10 is
>>> also a nice option, but, it depends...
> -- 
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: proactive disk replacement
From: Andreas Klauer @ 2017-03-21 12:41 UTC (permalink / raw)
  To: Reindl Harald; +Cc: Adam Goryachev, Jeff Allison, linux-raid
In-Reply-To: <8b108a89-9d17-63f0-de1c-80b17be4411a@thelounge.net>

On Tue, Mar 21, 2017 at 01:03:22PM +0100, Reindl Harald wrote:
> the IO of a RAID5/6 rebuild is hardly linear beause the informations 
> (data + parity) are spread all over the disks

It's not "randomly" spread all over. The blocks are always where they belong.

https://en.wikipedia.org/wiki/Standard_RAID_levels#/media/File:RAID_6.svg

It's AAAA, BBBB, CCCC, DDDD. Not DBCA, BADC, ADBC, ...

There is no random I/O involved here, at worst it will decide to not read 
a parity block because it's not needed but that does not cause huge/random
jumps for the HDD read heads.

> while in case of RAID1/10 it is really linear

Actually RAID 10 has the most interesting layout choices... 
to this day mdadm is unable to grow/convert some of these.

In a RAID 10 rebuild the HDD might have to jump from end to start.

Of course if you consider metadata updates (progress has to be 
recorded somewhere?) then ALL rebuilds regardless of RAID level 
are random I/O in a way.

But such is the fate of a HDD, it's their bread and butter. 
Any server that does anything other than "idle" does random I/O 24/7.

If there was no other I/O (because the RAID is live during rebuild) 
and no metadata updates (or external metadata) you could totally do 
RAID0/1/5/6 rebuilds with tape drives. That's how random it is.
RAID10 might need a rewind in between.

Regards
Andreas Klauer

^ permalink raw reply

* Re: proactive disk replacement
From: Reindl Harald @ 2017-03-21 12:10 UTC (permalink / raw)
  To: Adam Goryachev, Jeff Allison; +Cc: linux-raid
In-Reply-To: <40485bef-feba-a0ae-5e90-3fc51795c29d@websitemanagers.com.au>


Am 21.03.2017 um 12:56 schrieb Adam Goryachev:
> Sorry, but I'm just seeing scaremongering and things that don't compute.
> Possibly I'm just not seeing it, but I don't see your advise being given
> by a majority of "experts" either on this list or elsewhere. I'll try to
> refrain from responding beyond this one, and return to lurking and
> hopefully learning more.
>
> Also, please note that the quoting / attribution seems to be wrong
> (inverted).

only in your mail client

> On 21/3/17 22:03, Reindl Harald wrote:
>> Am 21.03.2017 um 11:54 schrieb Adam Goryachev:
>> but the point is that with RAID5/6 the recovery itself is *heavy
>> random IO* and that get *combined* with the random IO auf the normal
>> workload and that means *heavy load on the disks*
> random IO is the same as random IO, regardless of the "cause" of making
> the IO random

no - it's a matter of *how much* random IO you have - when the rebuild 
process needs to seek for parity and remaining data blocks and hence 
produces heavily head movements all over the time this is added to the 
IO of the normal workload

in case of a RAID1/10 rebuild the rbuild process itself is just a linear 
read and the only head moves of the disks is the normal workload on the 
array

> In most systems, you won't be running anywhere near the IO limits, so
> allowing your recovery some portion of IO is not an issue

IO limits don't matter here when we talk about IOPS and drive head moves 
around heavily all the time because parity and data blocks for the 
restore are spread all over the disk *and* the requested workload data 
is also somewhere else

in case of a RAID1/10 rebuild you have all the time linear IO from time 
to time interrupted by the workload on the array - that's a completly 
other stress level for a disk compared with seek for hours and days 
parity and data to restore the data for the failed disk

^ permalink raw reply

* Re: on assembly and recovery of a hardware RAID
From: Alfred Matthews @ 2017-03-21 12:09 UTC (permalink / raw)
  To: NeilBrown; +Cc: linux-raid
In-Reply-To: <87a88fcq21.fsf@notabene.neil.brown.name>

Thanks, I'll take a look. Could use a project, insane as that sounds.

On 20 March 2017 at 22:38, NeilBrown <neilb@suse.com> wrote:
> On Mon, Mar 20 2017, Alfred Matthews wrote:
>
>>>> *** Checking Backup Volume Header:
>>>> Unexpected Volume signature '  ' expected 'H+'
>>>
>>> Here the backup volume header, which is 2 blocks (blocks are 8K) from
>>> the end of the device, looks wrong.
>>> This probably means the chunk size is wrong.
>>> I would suggest trying different chunksizes, starting at 4K and
>>> doubling, until this message goes away.
>>> That still might not be the correct chunk size, so I would continue up
>>> to several megabytes and find all the chunksizes that seem to work.
>>> Then look at what else hpfsck says on those.
>>
>> I'm not actually able to generate happy output in hpfsck using any of
>> the following multiples of 4K
>>
>> 4
>> [...]
>> 8192
>> 16384
>> 32768
>> 65536
>> 131072
>> 262144
>> 524288
>> 1048576
>> 2097152
>>
>> Any chance it's not really an HFS system at all?
>
> Not likely.  hpfsck finds a perfectly valid superblock (or "Volume
> Header") at the start of the device.  It just cannot find the end one.
>
> The blocksize is:
>      blocksize       : 2000
>
> which is in HEX, so 8K.
> The total_blocks is:
>      total_blocks    : 732482664
>
> which are 8K blocks, so 5859861312K or 5.4TB (using 1024*1024*1024).
> which matches the fact that each partition is 2.73TB.
>
> The problem seems to be that we are not combining the two partitions
> together in the correct way to create the original 5.4TB partition.
>
> All we know is that the backup volume header should look
> much like the main header, and particularly should have 'H+' in the
> signature, which is the first 2 bytes.
> i.e. the first two bytes of the volume headers should be
> 0x4A2B
>
> The second (8K) block of the disk must look like this, and
> the second last should as well.
> If you can search through both devices for all 8K blocks which
> start with 0x4A2B, that might give us a hint what to look for.
> I would write a C program to do this.  I might take a while to run, but
> you can test on the first device, as you know block 2 matches.
>
>
> Hmmm... I've got a new theory.  The code is broken.
> fscheck_read_wrapper() in libhfsp/src/fscheck.c should set
> vol->maxblocks.
> It is set to a dummy value of '3' before this is called.
> In the "signature == HFS_VOLHEAD_SIG" it sets it properly,
> but in the "signature == HFSP_VOLHEAD_SIG" case (which applies to you)
> it doesn't.
> So it tries to read the backup from block "3-2", or block 1.  And there
> is nothing there.
>
> How is your C coding?  You could
>   apt-get source hfsplus
> and hack the code and try to build it yourself....
>
>
> NeilBrown
>
>

^ permalink raw reply

* Re: proactive disk replacement
From: Reindl Harald @ 2017-03-21 12:03 UTC (permalink / raw)
  To: Andreas Klauer; +Cc: Adam Goryachev, Jeff Allison, linux-raid
In-Reply-To: <20170321113447.GA18665@metamorpher.de>



Am 21.03.2017 um 12:34 schrieb Andreas Klauer:
> On Tue, Mar 21, 2017 at 12:03:51PM +0100, Reindl Harald wrote:
>> but the point is that with RAID5/6 the recovery itself is *heavy random
>> IO* and that get *combined* with the random IO auf the normal workload
>> and that means *heavy load on the disks*
>
> Where do you get that random I/O idea from? Rebuild is linear.
> Or what do you mean by random I/O in this context? (RAID rebuilds)
> What kind of random things do you think the RAID is doing?

the IO of a RAID5/6 rebuild is hardly linear beause the informations 
(data + parity) are spread all over the disks while in case of RAID1/10 
it is really linear

> If you see read errors during rebuild, the most common cause is
> that the rebuild also happens to be the first read test since forever.
> (Happens to be the case for people who don't do any disk monitoring.)
>
>> here you go: http://www.zdnet.com/article/why-raid-6-stops-working-in-2019/
>
> This is just wrong

^ permalink raw reply

* Re: proactive disk replacement
From: Adam Goryachev @ 2017-03-21 11:56 UTC (permalink / raw)
  To: Reindl Harald, Jeff Allison; +Cc: linux-raid
In-Reply-To: <583576ca-a76c-3901-c196-6083791533ee@thelounge.net>

Sorry, but I'm just seeing scaremongering and things that don't compute. 
Possibly I'm just not seeing it, but I don't see your advise being given 
by a majority of "experts" either on this list or elsewhere. I'll try to 
refrain from responding beyond this one, and return to lurking and 
hopefully learning more.

Also, please note that the quoting / attribution seems to be wrong 
(inverted).

On 21/3/17 22:03, Reindl Harald wrote:
>
> Am 21.03.2017 um 11:54 schrieb Adam Goryachev:
>> On 21/3/17 20:54, Reindl Harald wrote:
>>> and about RAID5/RAID6 versus RAID10: both RAID5 and RAID6 suffer from
>>> the same problems - due rebuild you have a lot of random-IO load on
>>> all remaining disks which leads in bad performance and make it more
>>> likely that before the rebuild is finished another disk fails, RAID6
>>> produces even more random IO because of the double parity and if you
>>> have a Unrecoverable-Read-Error on RAID5 you are dead, RAID6 is not
>>> much better here and the probability of a URE becomes more likely with
>>> larger disks
>>>
>>> RAID10: less to zero performance impact due rebuild and no random-IO
>>> caused by the rebuild, it's just "read a disk from start to end and
>>> write the data on another disk linear" while the only head moves on
>>> your disks is the normal workload on the array
>>>
>>> with disks 2 TB or larger you can make the conclusion "do not use
>>> RAID5/6 anymore and when you do be prepared that you won't survive a
>>> rebuild caused by a failed disk"
>>>
>> I can't say I'm an expert in this, but in actual fact, I disagree with
>> both your arguments against RAID6...
>> You say recovery on a RAID10 is a simple linear read from one drive (the
>> surviving member of the RAID1 portion) and a linear write on the other
>> (the replaced drive). You also declare that there is no random IO with
>> normal work load + recovery. I think you have forgotten that the "normal
>> workload" is probably random IO, but certainly once combined with the
>> recovery IO then it will be random IO.
>
> but the point is that with RAID5/6 the recovery itself is *heavy 
> random IO* and that get *combined* with the random IO auf the normal 
> workload and that means *heavy load on the disks*
random IO is the same as random IO, regardless of the "cause" of making 
the IO random.
In most systems, you won't be running anywhere near the IO limits, so 
allowing your recovery some portion of IO is not an issue.
>
>> In addition, you claim that a drive larger than 2TB is almost certainly
>> going to suffer from a URE during recovery, yet this is exactly the
>> situation you will be in when trying to recover a RAID10 with member
>> devices 2TB or larger. A single URE on the surviving portion of the
>> RAID1 will cause you to lose the entire RAID10 array. On the other hand,
>> 3 URE's on the three remaining members of the RAID6 will not cause more
>> than a hiccup (as long as no more than one URE on the same stripe, which
>> I would argue is ... exceptionally unlikely).
>
> given that when your disks have the same age errors on another disk 
> become more likely when one failed and the heavy disk IO due recovery 
> of a RAID6 with takes *many hours* where you have heavy IO on *all 
> disks* compared with a way faster restore of RAID1/10 guess in which 
> case a URE is more likely
>
URE's are based on amount of data read, and that isn't cumulative, every 
block read starts again with the same chance. If winning lottery is a 
chance of 100:1 it doesn't mean you will win at least once if you buy 
100 tickets. So reading 200,000,000 blocks also doesn't ensure you will 
see a URE (equally, you just might be lucky and win the lottery more 
than once, and get more than one URE).
In any case, if you only have a single source of data, then you are more 
likely to lose it (this is one of the reasons for RAID and backups). So 
RAID6 which stores your data in more than one location (during a drive 
failure event) is better.
BTW, just because you say that you will suffer a URE under heavy load 
doesn't make it true. The load factor doesn't change the frequency of a 
URE (even though it sounds possible).
> additionally why should the whole array fail just because a single 
> block get lost? the is no parity which needs to be calculated, you 
> just lost a single block somewhere - RAID1/10 are way easier in their 
> implementation
Equally, worst case, you have multiple URE on the same stripe on RAID6 
only loses a single stripe (ok, a stripe is bigger than a block, but 
still much less likely to occur anyway).
>
>> In addition, with a 4 disk RAID6 you have a 100% chance of surviving a 2
>> drive failure without data loss, yet with 4 disk RAID10 you have a 50%
>> chance of surviving a 2 drive failure.
>
> yeah and you *need that* when it takes many hours ot a few days until 
> your 8 TB RAID6 is resynced while the whole time *all disks* are under 
> heavy stress
Why are all disks under heavy stress? Again, you don't operate (under 
normal conditions) at a heavy stress level, you need room to grow, and 
also peak load is going to be higher but for short duration. Normal 
activity might be 50% of maximum, degraded performance together with 
recovery might push that to 80%, but disks (decent ones) are not going 
to have a problem doing simple read/write activity, that is what they 
are designed for right?
>
>> Sure, there are other things to consider (performance, cost, etc) but on
>> a reliability point, RAID6 seems to be the far better option
>
> *no* - it takes twice as long to recalculate from parity and stresses 
> the remaining disks twice as hard as RAID5 and so you pretty soon end 
> with lost both of the disk you can lose without the array goes down 
> while you still have many hours remaining recovery time
>
> here you go: 
> http://www.zdnet.com/article/why-raid-6-stops-working-in-2019/
That was written in 2010, 2019 is only 2 years away, (unless you meant 
2029 and it was a typo) and I don't see evidence of that being true nor 
becoming true in such a short time. We don't see many (any?) people 
trying to recover their RAID6 arrays with double URE failures.

You say it takes twice as long to recalculate from parity for RAID6 
compared to RAID5, but with CPU performance, this is still faster than 
the drive speed (unless you have NVMe or some SSD's, but then I assume 
the whole URE issue is different there anyway). Also, why do you think 
it stresses the disks twice as hard as RAID5? To recover a RAID5 you 
need a full read of all surviving drives, that's 100% read. To recover a 
RAID6 you need a full read of all remaining drives minus one, so that is 
less than 100% read. So why are you "stressing the remaining disks twice 
as hard"? Also, why does a URE equal losing a disk, all you do is read 
that block from another member in the array, and fix the URE at the same 
time.

If anything, you might suggest triple mirror RAID (what is that called? 
RAID110?)
If I was to believe you, then that is the only sensible option, with 
triple mirror, when you lose any one drive, then you may recover by 
simply reading from the surviving members, and you are no worse off 
under any scenario. Even losing any two drives and you are still 
protected, potentially you can lose up to 4 drives without data loss 
(assuming a minimum of 6 drives). However, cost is a factor here.

Finally, other than RAID110 (really, what is this called?) do you have 
any other sensible suggestions? RAID10 just doesn't seem to be it, and 
zfs doesn't seem to be mainstream enough either, same with btrfs and 
other FS's which can do various checksum/redundant data storage.

PS, In case you are wondering, I am still running 8 drive RAID5 in real 
life workloads, and don't have any problems with data loss (albeit, I do 
use DRBD to replicate the data between two systems with RAID5 each, so 
you can call that RAID51 perhaps, but the point remains, I've never 
(yet) lost an entire RAID5 array due to multiple drive failure or URE's).


^ permalink raw reply

* Re: proactive disk replacement
From: Gandalf Corvotempesta @ 2017-03-21 11:55 UTC (permalink / raw)
  To: Adam Goryachev; +Cc: Reindl Harald, Jeff Allison, linux-raid
In-Reply-To: <02316742-3887-b811-3c77-aad29cda4077@websitemanagers.com.au>

2017-03-21 11:54 GMT+01:00 Adam Goryachev <mailinglists@websitemanagers.com.au>:
> I can't say I'm an expert in this, but in actual fact, I disagree with both
> your arguments against RAID6...
> You say recovery on a RAID10 is a simple linear read from one drive (the
> surviving member of the RAID1 portion) and a linear write on the other (the
> replaced drive). You also declare that there is no random IO with normal
> work load + recovery. I think you have forgotten that the "normal workload"
> is probably random IO, but certainly once combined with the recovery IO then
> it will be random IO.
>
> In addition, you claim that a drive larger than 2TB is almost certainly
> going to suffer from a URE during recovery, yet this is exactly the
> situation you will be in when trying to recover a RAID10 with member devices
> 2TB or larger. A single URE on the surviving portion of the RAID1 will cause
> you to lose the entire RAID10 array. On the other hand, 3 URE's on the three
> remaining members of the RAID6 will not cause more than a hiccup (as long as
> no more than one URE on the same stripe, which I would argue is ...
> exceptionally unlikely).
>
> In addition, with a 4 disk RAID6 you have a 100% chance of surviving a 2
> drive failure without data loss, yet with 4 disk RAID10 you have a 50%
> chance of surviving a 2 drive failure.
>
> Sure, there are other things to consider (performance, cost, etc) but on a
> reliability point, RAID6 seems to be the far better option.

Totally agree

^ permalink raw reply

* Re: proactive disk replacement
From: Andreas Klauer @ 2017-03-21 11:34 UTC (permalink / raw)
  To: Reindl Harald; +Cc: Adam Goryachev, Jeff Allison, linux-raid
In-Reply-To: <583576ca-a76c-3901-c196-6083791533ee@thelounge.net>

On Tue, Mar 21, 2017 at 12:03:51PM +0100, Reindl Harald wrote:
> but the point is that with RAID5/6 the recovery itself is *heavy random 
> IO* and that get *combined* with the random IO auf the normal workload 
> and that means *heavy load on the disks*

Where do you get that random I/O idea from? Rebuild is linear.
Or what do you mean by random I/O in this context? (RAID rebuilds)
What kind of random things do you think the RAID is doing?

If you see read errors during rebuild, the most common cause is 
that the rebuild also happens to be the first read test since forever. 
(Happens to be the case for people who don't do any disk monitoring.)

> here you go: http://www.zdnet.com/article/why-raid-6-stops-working-in-2019/

This is just wrong.

Regards
Andreas Klauer

^ permalink raw reply

* Re: proactive disk replacement
From: Reindl Harald @ 2017-03-21 11:03 UTC (permalink / raw)
  To: Adam Goryachev, Jeff Allison; +Cc: linux-raid
In-Reply-To: <02316742-3887-b811-3c77-aad29cda4077@websitemanagers.com.au>



Am 21.03.2017 um 11:54 schrieb Adam Goryachev:
> On 21/3/17 20:54, Reindl Harald wrote:
>> Am 21.03.2017 um 03:33 schrieb Jeff Allison:
>>> I don't have a spare SATA slot I do however have a spare USB carrier,
>>> is that fast enough to be used temporarily?
>>
>> USB3 yes, USB2 don't make fun because the speed of the array depends
>> on the slowest disk in the spindle
>>
>> and about RAID5/RAID6 versus RAID10: both RAID5 and RAID6 suffer from
>> the same problems - due rebuild you have a lot of random-IO load on
>> all remaining disks which leads in bad performance and make it more
>> likely that before the rebuild is finished another disk fails, RAID6
>> produces even more random IO because of the double parity and if you
>> have a Unrecoverable-Read-Error on RAID5 you are dead, RAID6 is not
>> much better here and the probability of a URE becomes more likely with
>> larger disks
>>
>> RAID10: less to zero performance impact due rebuild and no random-IO
>> caused by the rebuild, it's just "read a disk from start to end and
>> write the data on another disk linear" while the only head moves on
>> your disks is the normal workload on the array
>>
>> with disks 2 TB or larger you can make the conclusion "do not use
>> RAID5/6 anymore and when you do be prepared that you won't survive a
>> rebuild caused by a failed disk"
>>
> I can't say I'm an expert in this, but in actual fact, I disagree with
> both your arguments against RAID6...
> You say recovery on a RAID10 is a simple linear read from one drive (the
> surviving member of the RAID1 portion) and a linear write on the other
> (the replaced drive). You also declare that there is no random IO with
> normal work load + recovery. I think you have forgotten that the "normal
> workload" is probably random IO, but certainly once combined with the
> recovery IO then it will be random IO.

but the point is that with RAID5/6 the recovery itself is *heavy random 
IO* and that get *combined* with the random IO auf the normal workload 
and that means *heavy load on the disks*

> In addition, you claim that a drive larger than 2TB is almost certainly
> going to suffer from a URE during recovery, yet this is exactly the
> situation you will be in when trying to recover a RAID10 with member
> devices 2TB or larger. A single URE on the surviving portion of the
> RAID1 will cause you to lose the entire RAID10 array. On the other hand,
> 3 URE's on the three remaining members of the RAID6 will not cause more
> than a hiccup (as long as no more than one URE on the same stripe, which
> I would argue is ... exceptionally unlikely).

given that when your disks have the same age errors on another disk 
become more likely when one failed and the heavy disk IO due recovery of 
a RAID6 with takes *many hours* where you have heavy IO on *all disks* 
compared with a way faster restore of RAID1/10 guess in which case a URE 
is more likely

additionally why should the whole array fail just because a single block 
get lost? the is no parity which needs to be calculated, you just lost a 
single block somewhere - RAID1/10 are way easier in their implementation

> In addition, with a 4 disk RAID6 you have a 100% chance of surviving a 2
> drive failure without data loss, yet with 4 disk RAID10 you have a 50%
> chance of surviving a 2 drive failure.

yeah and you *need that* when it takes many hours ot a few days until 
your 8 TB RAID6 is resynced while the whole time *all disks* are under 
heavy stress

> Sure, there are other things to consider (performance, cost, etc) but on
> a reliability point, RAID6 seems to be the far better option

*no* - it takes twice as long to recalculate from parity and stresses 
the remaining disks twice as hard as RAID5 and so you pretty soon end 
with lost both of the disk you can lose without the array goes down 
while you still have many hours remaining recovery time

here you go: http://www.zdnet.com/article/why-raid-6-stops-working-in-2019/

^ permalink raw reply

* Re: proactive disk replacement
From: Adam Goryachev @ 2017-03-21 10:54 UTC (permalink / raw)
  To: Reindl Harald, Jeff Allison; +Cc: linux-raid
In-Reply-To: <f0916e66-8ea7-3363-3600-1d2cd68e85af@thelounge.net>



On 21/3/17 20:54, Reindl Harald wrote:
>
>
> Am 21.03.2017 um 03:33 schrieb Jeff Allison:
>> I don't have a spare SATA slot I do however have a spare USB carrier,
>> is that fast enough to be used temporarily?
>
> USB3 yes, USB2 don't make fun because the speed of the array depends 
> on the slowest disk in the spindle
>
> and about RAID5/RAID6 versus RAID10: both RAID5 and RAID6 suffer from 
> the same problems - due rebuild you have a lot of random-IO load on 
> all remaining disks which leads in bad performance and make it more 
> likely that before the rebuild is finished another disk fails, RAID6 
> produces even more random IO because of the double parity and if you 
> have a Unrecoverable-Read-Error on RAID5 you are dead, RAID6 is not 
> much better here and the probability of a URE becomes more likely with 
> larger disks
>
> RAID10: less to zero performance impact due rebuild and no random-IO 
> caused by the rebuild, it's just "read a disk from start to end and 
> write the data on another disk linear" while the only head moves on 
> your disks is the normal workload on the array
>
> with disks 2 TB or larger you can make the conclusion "do not use 
> RAID5/6 anymore and when you do be prepared that you won't survive a 
> rebuild caused by a failed disk"
>
I can't say I'm an expert in this, but in actual fact, I disagree with 
both your arguments against RAID6...
You say recovery on a RAID10 is a simple linear read from one drive (the 
surviving member of the RAID1 portion) and a linear write on the other 
(the replaced drive). You also declare that there is no random IO with 
normal work load + recovery. I think you have forgotten that the "normal 
workload" is probably random IO, but certainly once combined with the 
recovery IO then it will be random IO.

In addition, you claim that a drive larger than 2TB is almost certainly 
going to suffer from a URE during recovery, yet this is exactly the 
situation you will be in when trying to recover a RAID10 with member 
devices 2TB or larger. A single URE on the surviving portion of the 
RAID1 will cause you to lose the entire RAID10 array. On the other hand, 
3 URE's on the three remaining members of the RAID6 will not cause more 
than a hiccup (as long as no more than one URE on the same stripe, which 
I would argue is ... exceptionally unlikely).

In addition, with a 4 disk RAID6 you have a 100% chance of surviving a 2 
drive failure without data loss, yet with 4 disk RAID10 you have a 50% 
chance of surviving a 2 drive failure.

Sure, there are other things to consider (performance, cost, etc) but on 
a reliability point, RAID6 seems to be the far better option.

Regards,
Adam
>> On 21 March 2017 at 01:59, Adam Goryachev
>> <mailinglists@websitemanagers.com.au> wrote:
>>>
>>>
>>> On 20/3/17 23:47, Jeff Allison wrote:
>>>>
>>>> Hi all I’ve had a poke around but am yet to find something definitive.
>>>>
>>>> I have a raid 5 array of 4 disks amounting to approx 5.5tb. Now 
>>>> this disks
>>>> are getting a bit long in the tooth so before I get into problems I’ve
>>>> bought 4 new disks to replace them.
>>>>
>>>> I have a backup so if it all goes west I’m covered. So I’m looking for
>>>> suggestions.
>>>>
>>>> My current plan is just to replace the 2tb drives with the new 3tb 
>>>> drives
>>>> and move on, I’d like to do it on line with out having to trash the 
>>>> array
>>>> and start again, so does anyone have a game plan for doing that.
>>>
>>> Yes, do not fail a disk and then replace it, use the newer replace 
>>> method
>>> (it keeps redundancy in the array).
>>> Even better would be to add a disk, and convert to RAID6, then add a 
>>> second
>>> disk (using replace), and so on, then remove the last disk, grow the 
>>> array
>>> to fill the 3TB, and then reduce the number of disks in the raid.
>>> This way, you end up with RAID6...
>>>>
>>>> Or is a 9tb raid 5 array the wrong thing to be doing and should I 
>>>> be doing
>>>> something else 6tb raid 10 or something I’m open to suggestions.
>>>
>>> I'd feel safer with RAID6, but it depends on your requirements. 
>>> RAID10 is
>>> also a nice option, but, it depends...
> -- 
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


^ permalink raw reply

* Re: proactive disk replacement
From: Reindl Harald @ 2017-03-21  9:54 UTC (permalink / raw)
  To: Jeff Allison, Adam Goryachev; +Cc: linux-raid
In-Reply-To: <CAPrpM6wtQe=h1AE-PbFr0-DyZ_wRN7gvibjfn86W0mQz77xnLg@mail.gmail.com>



Am 21.03.2017 um 03:33 schrieb Jeff Allison:
> I don't have a spare SATA slot I do however have a spare USB carrier,
> is that fast enough to be used temporarily?

USB3 yes, USB2 don't make fun because the speed of the array depends on 
the slowest disk in the spindle

and about RAID5/RAID6 versus RAID10: both RAID5 and RAID6 suffer from 
the same problems - due rebuild you have a lot of random-IO load on all 
remaining disks which leads in bad performance and make it more likely 
that before the rebuild is finished another disk fails, RAID6 produces 
even more random IO because of the double parity and if you have a 
Unrecoverable-Read-Error on RAID5 you are dead, RAID6 is not much better 
here and the probability of a URE becomes more likely with larger disks

RAID10: less to zero performance impact due rebuild and no random-IO 
caused by the rebuild, it's just "read a disk from start to end and 
write the data on another disk linear" while the only head moves on your 
disks is the normal workload on the array

with disks 2 TB or larger you can make the conclusion "do not use 
RAID5/6 anymore and when you do be prepared that you won't survive a 
rebuild caused by a failed disk"

> On 21 March 2017 at 01:59, Adam Goryachev
> <mailinglists@websitemanagers.com.au> wrote:
>>
>>
>> On 20/3/17 23:47, Jeff Allison wrote:
>>>
>>> Hi all I’ve had a poke around but am yet to find something definitive.
>>>
>>> I have a raid 5 array of 4 disks amounting to approx 5.5tb. Now this disks
>>> are getting a bit long in the tooth so before I get into problems I’ve
>>> bought 4 new disks to replace them.
>>>
>>> I have a backup so if it all goes west I’m covered. So I’m looking for
>>> suggestions.
>>>
>>> My current plan is just to replace the 2tb drives with the new 3tb drives
>>> and move on, I’d like to do it on line with out having to trash the array
>>> and start again, so does anyone have a game plan for doing that.
>>
>> Yes, do not fail a disk and then replace it, use the newer replace method
>> (it keeps redundancy in the array).
>> Even better would be to add a disk, and convert to RAID6, then add a second
>> disk (using replace), and so on, then remove the last disk, grow the array
>> to fill the 3TB, and then reduce the number of disks in the raid.
>> This way, you end up with RAID6...
>>>
>>> Or is a 9tb raid 5 array the wrong thing to be doing and should I be doing
>>> something else 6tb raid 10 or something I’m open to suggestions.
>>
>> I'd feel safer with RAID6, but it depends on your requirements. RAID10 is
>> also a nice option, but, it depends...

^ permalink raw reply

* Re: [PATCH v6 0/4] Broadcom SBA RAID support
From: Vinod Koul @ 2017-03-21  9:18 UTC (permalink / raw)
  To: Anup Patel
  Cc: Jassi Brar, Rob Herring, Mark Rutland, Herbert Xu,
	David S . Miller, Dan Williams, Ray Jui, Scott Branden, Jon Mason,
	Rob Rice, BCM Kernel Feedback, dmaengine, Device Tree,
	Linux ARM Kernel, Linux Kernel, linux-crypto, linux-raid
In-Reply-To: <CAALAos-k_DBYAz49278wBr_GoceOKYVH_UuzTLn1h7QCiUd8pg@mail.gmail.com>

On Tue, Mar 21, 2017 at 02:17:21PM +0530, Anup Patel wrote:
> On Tue, Mar 21, 2017 at 2:00 PM, Vinod Koul <vinod.koul@intel.com> wrote:
> > On Mon, Mar 06, 2017 at 03:13:24PM +0530, Anup Patel wrote:
> >> The Broadcom SBA RAID is a stream-based device which provides
> >> RAID5/6 offload.
> >>
> >> It requires a SoC specific ring manager (such as Broadcom FlexRM
> >> ring manager) to provide ring-based programming interface. Due to
> >> this, the Broadcom SBA RAID driver (mailbox client) implements
> >> DMA device having one DMA channel using a set of mailbox channels
> >> provided by Broadcom SoC specific ring manager driver (mailbox
> >> controller).
> >>
> >> The Broadcom SBA RAID hardware requires PQ disk position instead
> >> of PQ disk coefficient. To address this, we have added raid_gflog
> >> table which will help driver to convert PQ disk coefficient to PQ
> >> disk position.
> >>
> >> This patchset is based on Linux-4.11-rc1 and depends on patchset
> >> "[PATCH v5 0/2] Broadcom FlexRM ring manager support"
> >
> > Okay I applied and was about to push when I noticed this :(
> >
> > So what is the status of this..?
> 
> PATCH2 is Acked but PATCH1 is under-review. Currently, its
> v6 of that patchset.
> 
> The only dependency on that patchset is the changes in
> brcm-message.h which are required by this BCM-SBA-RAID
> driver.
> 
> @Jassi,
> Can you please have a look at PATCH v6?

And I would need an immutable branch/tag once merged. I am going to keep
this series pending till then.

-- 
~Vinod

^ permalink raw reply

* Re: [PATCH v6 0/4] Broadcom SBA RAID support
From: Anup Patel @ 2017-03-21  8:47 UTC (permalink / raw)
  To: Vinod Koul, Jassi Brar
  Cc: Mark Rutland, Device Tree, Herbert Xu, Scott Branden, Jon Mason,
	Ray Jui, Linux Kernel, linux-raid, Rob Herring,
	BCM Kernel Feedback, linux-crypto, Rob Rice, dmaengine,
	Dan Williams, David S . Miller, Linux ARM Kernel
In-Reply-To: <20170321083052.GY2843@localhost>

On Tue, Mar 21, 2017 at 2:00 PM, Vinod Koul <vinod.koul@intel.com> wrote:
> On Mon, Mar 06, 2017 at 03:13:24PM +0530, Anup Patel wrote:
>> The Broadcom SBA RAID is a stream-based device which provides
>> RAID5/6 offload.
>>
>> It requires a SoC specific ring manager (such as Broadcom FlexRM
>> ring manager) to provide ring-based programming interface. Due to
>> this, the Broadcom SBA RAID driver (mailbox client) implements
>> DMA device having one DMA channel using a set of mailbox channels
>> provided by Broadcom SoC specific ring manager driver (mailbox
>> controller).
>>
>> The Broadcom SBA RAID hardware requires PQ disk position instead
>> of PQ disk coefficient. To address this, we have added raid_gflog
>> table which will help driver to convert PQ disk coefficient to PQ
>> disk position.
>>
>> This patchset is based on Linux-4.11-rc1 and depends on patchset
>> "[PATCH v5 0/2] Broadcom FlexRM ring manager support"
>
> Okay I applied and was about to push when I noticed this :(
>
> So what is the status of this..?

PATCH2 is Acked but PATCH1 is under-review. Currently, its
v6 of that patchset.

The only dependency on that patchset is the changes in
brcm-message.h which are required by this BCM-SBA-RAID
driver.

@Jassi,
Can you please have a look at PATCH v6?

Regards,
Anup

^ permalink raw reply

* Re: [PATCH v6 0/4] Broadcom SBA RAID support
From: Vinod Koul @ 2017-03-21  8:30 UTC (permalink / raw)
  To: Anup Patel
  Cc: Rob Herring, Mark Rutland, Herbert Xu, David S . Miller,
	Jassi Brar, Dan Williams, Ray Jui, Scott Branden, Jon Mason,
	Rob Rice, bcm-kernel-feedback-list, dmaengine, devicetree,
	linux-arm-kernel, linux-kernel, linux-crypto, linux-raid
In-Reply-To: <1488793408-25592-1-git-send-email-anup.patel@broadcom.com>

On Mon, Mar 06, 2017 at 03:13:24PM +0530, Anup Patel wrote:
> The Broadcom SBA RAID is a stream-based device which provides
> RAID5/6 offload.
> 
> It requires a SoC specific ring manager (such as Broadcom FlexRM
> ring manager) to provide ring-based programming interface. Due to
> this, the Broadcom SBA RAID driver (mailbox client) implements
> DMA device having one DMA channel using a set of mailbox channels
> provided by Broadcom SoC specific ring manager driver (mailbox
> controller).
> 
> The Broadcom SBA RAID hardware requires PQ disk position instead
> of PQ disk coefficient. To address this, we have added raid_gflog
> table which will help driver to convert PQ disk coefficient to PQ
> disk position.
> 
> This patchset is based on Linux-4.11-rc1 and depends on patchset
> "[PATCH v5 0/2] Broadcom FlexRM ring manager support"

Okay I applied and was about to push when I noticed this :(

So what is the status of this..?


-- 
~Vinod

^ permalink raw reply

* Re: [PATCH 24/29] drivers: convert iblock_req.pending from atomic_t to refcount_t
From: Nicholas A. Bellinger @ 2017-03-21  7:18 UTC (permalink / raw)
  To: Elena Reshetova
  Cc: gregkh-hQyY1W1yCW8ekmWlsbkhG0B+6BGkLq7r,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	xen-devel-GuqFBffKawtpuQazS67q72D2FQJk+8+b,
	netdev-u79uwXL29TY76Z2rM5mHXA,
	linux1394-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f,
	linux-bcache-u79uwXL29TY76Z2rM5mHXA,
	linux-raid-u79uwXL29TY76Z2rM5mHXA,
	linux-media-u79uwXL29TY76Z2rM5mHXA,
	devel-tBiZLqfeLfOHmIFyCCdPziST3g8Odh+X,
	linux-pci-u79uwXL29TY76Z2rM5mHXA,
	linux-s390-u79uwXL29TY76Z2rM5mHXA,
	fcoe-devel-s9riP+hp16TNLxjTenLetw,
	linux-scsi-u79uwXL29TY76Z2rM5mHXA,
	open-iscsi-/JYPxA39Uh5TLH3MbocFFw,
	devel-gWbeCf7V1WCQmaza687I9mD2FQJk+8+b,
	target-devel-u79uwXL29TY76Z2rM5mHXA,
	linux-serial-u79uwXL29TY76Z2rM5mHXA,
	linux-usb-u79uwXL29TY76Z2rM5mHXA, peterz-wEGCiKHe2LqWVfeAwA7xHQ,
	Hans Liljestrand, Kees Cook, David Windsor
In-Reply-To: <1488810076-3754-25-git-send-email-elena.reshetova-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>

Hi Elena,

On Mon, 2017-03-06 at 16:21 +0200, Elena Reshetova wrote:
> refcount_t type and corresponding API should be
> used instead of atomic_t when the variable is used as
> a reference counter. This allows to avoid accidental
> refcounter overflows that might lead to use-after-free
> situations.
> 
> Signed-off-by: Elena Reshetova <elena.reshetova-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
> Signed-off-by: Hans Liljestrand <ishkamiel-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
> Signed-off-by: Kees Cook <keescook-F7+t8E8rja9g9hUCZPvPmw@public.gmane.org>
> Signed-off-by: David Windsor <dwindsor-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
> ---
>  drivers/target/target_core_iblock.c | 12 ++++++------
>  drivers/target/target_core_iblock.h |  3 ++-
>  2 files changed, 8 insertions(+), 7 deletions(-)

After reading up on this thread, it looks like various subsystem
maintainers are now picking these atomic_t -> refcount_t conversions..

That said, applied to target-pending/for-next and will plan to include
for v4.12-rc1 merge window.

Thanks!

-- 
You received this message because you are subscribed to the Google Groups "open-iscsi" group.
To unsubscribe from this group and stop receiving emails from it, send an email to open-iscsi+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To post to this group, send email to open-iscsi-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
Visit this group at https://groups.google.com/group/open-iscsi.
For more options, visit https://groups.google.com/d/optout.

^ permalink raw reply

* Re: on assembly and recovery of a hardware RAID
From: NeilBrown @ 2017-03-21  2:38 UTC (permalink / raw)
  To: Alfred Matthews, linux-raid
In-Reply-To: <CAAZLhTcNsjzLdEZyzG+kEuDhVye8pnuf6VhWzP=G0NL_aFX-5w@mail.gmail.com>

[-- Attachment #1: Type: text/plain, Size: 2503 bytes --]

On Mon, Mar 20 2017, Alfred Matthews wrote:

>>> *** Checking Backup Volume Header:
>>> Unexpected Volume signature '  ' expected 'H+'
>>
>> Here the backup volume header, which is 2 blocks (blocks are 8K) from
>> the end of the device, looks wrong.
>> This probably means the chunk size is wrong.
>> I would suggest trying different chunksizes, starting at 4K and
>> doubling, until this message goes away.
>> That still might not be the correct chunk size, so I would continue up
>> to several megabytes and find all the chunksizes that seem to work.
>> Then look at what else hpfsck says on those.
>
> I'm not actually able to generate happy output in hpfsck using any of
> the following multiples of 4K
>
> 4
> [...]
> 8192
> 16384
> 32768
> 65536
> 131072
> 262144
> 524288
> 1048576
> 2097152
>
> Any chance it's not really an HFS system at all?

Not likely.  hpfsck finds a perfectly valid superblock (or "Volume
Header") at the start of the device.  It just cannot find the end one.

The blocksize is:
     blocksize       : 2000

which is in HEX, so 8K.
The total_blocks is:
     total_blocks    : 732482664

which are 8K blocks, so 5859861312K or 5.4TB (using 1024*1024*1024).
which matches the fact that each partition is 2.73TB.

The problem seems to be that we are not combining the two partitions
together in the correct way to create the original 5.4TB partition.

All we know is that the backup volume header should look
much like the main header, and particularly should have 'H+' in the
signature, which is the first 2 bytes.
i.e. the first two bytes of the volume headers should be
0x4A2B

The second (8K) block of the disk must look like this, and
the second last should as well.
If you can search through both devices for all 8K blocks which
start with 0x4A2B, that might give us a hint what to look for.
I would write a C program to do this.  I might take a while to run, but
you can test on the first device, as you know block 2 matches.

Hmmm... I've got a new theory.  The code is broken.
fscheck_read_wrapper() in libhfsp/src/fscheck.c should set
vol->maxblocks.
It is set to a dummy value of '3' before this is called.
In the "signature == HFS_VOLHEAD_SIG" it sets it properly,
but in the "signature == HFSP_VOLHEAD_SIG" case (which applies to you)
it doesn't.
So it tries to read the backup from block "3-2", or block 1.  And there
is nothing there.

How is your C coding?  You could
  apt-get source hfsplus
and hack the code and try to build it yourself....

NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply

* Re: proactive disk replacement
From: Jeff Allison @ 2017-03-21  2:33 UTC (permalink / raw)
  To: Adam Goryachev; +Cc: linux-raid
In-Reply-To: <11c21a22-4bbf-7b16-5e64-8932be768c68@websitemanagers.com.au>

I don't have a spare SATA slot I do however have a spare USB carrier,
is that fast enough to be used temporarily?

On 21 March 2017 at 01:59, Adam Goryachev
<mailinglists@websitemanagers.com.au> wrote:
>
>
> On 20/3/17 23:47, Jeff Allison wrote:
>>
>> Hi all I’ve had a poke around but am yet to find something definitive.
>>
>> I have a raid 5 array of 4 disks amounting to approx 5.5tb. Now this disks
>> are getting a bit long in the tooth so before I get into problems I’ve
>> bought 4 new disks to replace them.
>>
>> I have a backup so if it all goes west I’m covered. So I’m looking for
>> suggestions.
>>
>> My current plan is just to replace the 2tb drives with the new 3tb drives
>> and move on, I’d like to do it on line with out having to trash the array
>> and start again, so does anyone have a game plan for doing that.
>
> Yes, do not fail a disk and then replace it, use the newer replace method
> (it keeps redundancy in the array).
> Even better would be to add a disk, and convert to RAID6, then add a second
> disk (using replace), and so on, then remove the last disk, grow the array
> to fill the 3TB, and then reduce the number of disks in the raid.
> This way, you end up with RAID6...
>>
>> Or is a 9tb raid 5 array the wrong thing to be doing and should I be doing
>> something else 6tb raid 10 or something I’m open to suggestions.
>
> I'd feel safer with RAID6, but it depends on your requirements. RAID10 is
> also a nice option, but, it depends...
>
> Regards,
> Adam
>

^ permalink raw reply

* Re: on assembly and recovery of a hardware RAID
From: Alfred Matthews @ 2017-03-20 21:42 UTC (permalink / raw)
  To: NeilBrown, linux-raid
In-Reply-To: <87lgs0cy0e.fsf@notabene.neil.brown.name>

>> *** Checking Backup Volume Header:
>> Unexpected Volume signature '  ' expected 'H+'
>
> Here the backup volume header, which is 2 blocks (blocks are 8K) from
> the end of the device, looks wrong.
> This probably means the chunk size is wrong.
> I would suggest trying different chunksizes, starting at 4K and
> doubling, until this message goes away.
> That still might not be the correct chunk size, so I would continue up
> to several megabytes and find all the chunksizes that seem to work.
> Then look at what else hpfsck says on those.

I'm not actually able to generate happy output in hpfsck using any of
the following multiples of 4K

4
[...]
8192
16384
32768
65536
131072
262144
524288
1048576
2097152

Any chance it's not really an HFS system at all?

^ permalink raw reply

* Re: Read data from disk that was part of RAID1 array
From: Wols Lists @ 2017-03-20 19:33 UTC (permalink / raw)
  To: Peter Sangas, linux-raid
In-Reply-To: <006a01d2a1a0$ac148170$043d8450$@wnsdev.com>

On 20/03/17 17:37, Peter Sangas wrote:
>> From: Wols Lists [mailto:antlists@youngman.org.uk]
>> NEVER NEVER NEVER use --create !!!
>>
>>
>> Use something like --assemble --force, which will set up a working array
> if it can. 
> 
> OK, I tried this command but received an error:
> 
> mdadm --assemble --force  /dev/md10 /dev/sdc   
> 
> "Cannot assemble mbr metadata in /dev/sdc, no superblock"
> 
> What command do you suggest...?
> 
That makes it sound like something had trashed the superblock, or maybe
it was sdc1, or something. Anyways, it worked for you, so hopefully the
question is academic.
> 
> 
>> If that had been an old array, with a different offset or superblock or
> the like...
> 
> by "old" do you mean an array created using a different superblock format
> other than 1.2?
> 
Yes. The superblock format is v1. Whether it's v1.0, v1.1 or v1.2
depends on where the superblock is found. So if, for example, the array
had been created with a v1.0 superblock, the data would probably have
started at offset 2048, with the superblock at the end of the disk. v1.2
puts the superblock near the start, maybe offset 4096? So you would have
smashed some of your data, and also told the array to look in the wrong
place for the start of the data.

That's why --create is so dangerous - pick the wrong version and you can
damage the data, but even if you pick the right version, all the default
offsets and things have changed over the years (that's assuming they
haven't also been modified by general array management), so you can
easily lose where the data area starts.

Cheers,
Wol


^ permalink raw reply

* RE: Read data from disk that was part of RAID1 array
From: Peter Sangas @ 2017-03-20 17:37 UTC (permalink / raw)
  To: 'Wols Lists', linux-raid
In-Reply-To: <58CE5AF8.5070907@youngman.org.uk>

> From: Wols Lists [mailto:antlists@youngman.org.uk]
> NEVER NEVER NEVER use --create !!!
> 
> 
> Use something like --assemble --force, which will set up a working array
if it can. 

OK, I tried this command but received an error:

mdadm --assemble --force  /dev/md10 /dev/sdc   

"Cannot assemble mbr metadata in /dev/sdc, no superblock"

What command do you suggest...?



> If that had been an old array, with a different offset or superblock or
the like...

by "old" do you mean an array created using a different superblock format
other than 1.2?

Thanks,
Pete


^ permalink raw reply

* Re: stripe_cache_size, some info
From: Gandalf Corvotempesta @ 2017-03-20 16:24 UTC (permalink / raw)
  To: Wols Lists; +Cc: linux-raid
In-Reply-To: <58D001BF.1030804@youngman.org.uk>

2017-03-20 17:22 GMT+01:00 Wols Lists <antlists@youngman.org.uk>:
> Get a cheap PCI(e) SATA card! You should be able to get something for
> around GBP20, and if it's only temporary who cares if drives and cables
> are left all over the place so long as the data is safe while you're
> updating. :-)

This means:

1) that i'm using SATA on my servers
2) that I can power down for adding a new card
3) that I have enough HDD slots available

but

1) usually I use SAS
2) I can't power down the server when replacing a disk.
3) our older server have all slot full.

^ permalink raw reply

* Re: stripe_cache_size, some info
From: Wols Lists @ 2017-03-20 16:22 UTC (permalink / raw)
  To: Gandalf Corvotempesta; +Cc: linux-raid
In-Reply-To: <CAJH6TXhOokv1sOQL7KLe==rJQZ6J8D0gscQKJ5MPXSm5ZLXg9w@mail.gmail.com>

On 20/03/17 16:13, Gandalf Corvotempesta wrote:
> 2017-03-20 16:59 GMT+01:00 Wols Lists <antlists@youngman.org.uk>:
>> Burst speed, or sustained speed? Big difference ...
> 
> Both :)
> 
>> And I would avoid that entirely if I can - put the new disk in, do a
>> --replace, and then remove the old one. Doing a hotswap like that will
>> increase the stress on the array, and increased stress means another
>> disk is more likely to fail.
> 
> On newer server, i'll tend to avoid using all slots because of this.
> With at least 1 slot available, cool things could be done, like
> replace disks without compromize redundancy and so on.
> 
> But on older server, i don't have enough slot available and the only
> way to replace a disk is.... directly replace a disk :)
> 
Get a cheap PCI(e) SATA card! You should be able to get something for
around GBP20, and if it's only temporary who cares if drives and cables
are left all over the place so long as the data is safe while you're
updating. :-)

Cheers,
Wol

^ permalink raw reply

* Re: proactive disk replacement
From: Wols Lists @ 2017-03-20 16:19 UTC (permalink / raw)
  To: Adam Goryachev, Reindl Harald, Jeff Allison, linux-raid
In-Reply-To: <3df5e6da-6085-58fb-2811-cb4be843e676@websitemanagers.com.au>

On 20/03/17 15:23, Adam Goryachev wrote:
> 
> 
> On 21/3/17 02:04, Reindl Harald wrote:
>>
>>
>> Am 20.03.2017 um 15:59 schrieb Adam Goryachev:
>>> On 20/3/17 23:47, Jeff Allison wrote:
>>>> Hi all I’ve had a poke around but am yet to find something definitive.
>>>>
>>>> I have a raid 5 array of 4 disks amounting to approx 5.5tb. Now this
>>>> disks are getting a bit long in the tooth so before I get into
>>>> problems I’ve bought 4 new disks to replace them.
>>>>
>>>> I have a backup so if it all goes west I’m covered. So I’m looking for
>>>> suggestions.
>>>>
>>>> My current plan is just to replace the 2tb drives with the new 3tb
>>>> drives and move on, I’d like to do it on line with out having to trash
>>>> the array and start again, so does anyone have a game plan for doing
>>>> that.
>>> Yes, do not fail a disk and then replace it, use the newer replace
>>> method (it keeps redundancy in the array)
>>
>> how should it keep redundancy when you have to remove a disk anyways
>> except you have enough slots to at least temporary add a additional one?
> Yes, assuming you can (at least temporarily) add an additional disk,
> then you will not lose redundancy by using the replace instead of
> fail/add method.
> 
Take a look at the raid wiki. Especially this page ...

https://raid.wiki.kernel.org/index.php/Replacing_a_failed_drive

Okay, it's my work (unless people have come in since and edited it) but
I make a point of asking "the people who should know" to check my work
if I'm at all unsure. So this will have been looked over for mistakes by
various people on the list who either write the code or provide advice
and support.

And yes, as you can see from that page, I'd say add a new disk then
--replace it into the array. And upgrading the array to raid6 is a good
idea. But Adam's way I think you need two extra temporary drive slots.
What I think you can do is - the new drives you need to make the
underlying partition the full 3TB. You can then replace all four drives.
So long as 2*3TB >= 3*2TB (don't laugh - it might not be!!!) you should
be able to reduce the number of drives to three then add the fourth back
to give raid6.

The other thing is, if you've got the space for Adam's method, you could
always temporarily create a 4TB drive by combining 2*2TB in a raid0 -
probably best striped rather than linear.

Cheers,
Wol


^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox