Linux RAID subsystem development

Linux RAID subsystem development
 help / color / mirror / Atom feed

* Re: proactive disk replacement
From: Reindl Harald @ 2017-03-21 13:24 UTC (permalink / raw)
  To: David Brown, Adam Goryachev, Jeff Allison; +Cc: linux-raid
In-Reply-To: <58D126EB.7060707@hesbynett.no>



Am 21.03.2017 um 14:13 schrieb David Brown:
> On 21/03/17 12:03, Reindl Harald wrote:
>>
>> Am 21.03.2017 um 11:54 schrieb Adam Goryachev:
> <snip>
>>
>>> In addition, you claim that a drive larger than 2TB is almost certainly
>>> going to suffer from a URE during recovery, yet this is exactly the
>>> situation you will be in when trying to recover a RAID10 with member
>>> devices 2TB or larger. A single URE on the surviving portion of the
>>> RAID1 will cause you to lose the entire RAID10 array. On the other hand,
>>> 3 URE's on the three remaining members of the RAID6 will not cause more
>>> than a hiccup (as long as no more than one URE on the same stripe, which
>>> I would argue is ... exceptionally unlikely).
>>
>> given that when your disks have the same age errors on another disk
>> become more likely when one failed and the heavy disk IO due recovery of
>> a RAID6 with takes *many hours* where you have heavy IO on *all disks*
>> compared with a way faster restore of RAID1/10 guess in which case a URE
>> is more likely
>>
>> additionally why should the whole array fail just because a single block
>> get lost? the is no parity which needs to be calculated, you just lost a
>> single block somewhere - RAID1/10 are way easier in their implementation
>
> If you have RAID1, and you have an URE, then the data can be recovered
> from the other have of that RAID1 pair.  If you have had a disk failure
> (manual for replacement, or a real failure), and you get an URE on the
> other half of that pair, then you lose data.
>
> With RAID6, you need an additional failure (either another full disk
> failure or an URE in the /same/ stripe) to lose data.  RAID6 has higher
> redundancy than two-way RAID1 - of this there is /no/ doubt

yes, but with RAID5/RAID6 *all disks* are involved in the rebuild, with 
a 10 disk RAID10 only one disk needs to be read and the data written to 
the new one - all other disks are not involved in the resync at all

for most arrays the disks have a similar age and usage pattern, so when 
the first one fails it becomes likely that it don't take too long for 
another one and so load and recovery time matters

^ permalink raw reply

* Re: proactive disk replacement
From: Gandalf Corvotempesta @ 2017-03-21 13:26 UTC (permalink / raw)
  To: David Brown; +Cc: Reindl Harald, Jeff Allison, Adam Goryachev, linux-raid
In-Reply-To: <58D1244E.3040204@hesbynett.no>

2017-03-21 14:02 GMT+01:00 David Brown <david.brown@hesbynett.no>:
> Note that to cause failure in non-degraded RAID5 (or degraded RAID6),
> your two URE's need to be on the same stripe in order to cause data
> loss.  The chances of getting an URE somewhere on the disk are roughly
> proportional to the size of the disk - but the chance of getting an URE
> on the same stripe as another URE on another disk are basically
> independent of the disk size, and it is extraordinarily small.

Little bit OT:
is this the same even for HW RAID Controllers like LSI Megaraid
or they tend to fail the rebuild in case of multiple URE even in
different stripes?

> No, you cannot.  Your conclusion here is based on several totally
> incorrect assumptions:
>
> 1. You think that RAID5/RAID6 recovery is more stressful, because the
> parity is "all over the place".  This is wrong.
>
> 2. You think that random IO has higher chance of getting an URE than
> linear IO.  This is wrong.

Totally agree.

> 3. You think that getting an URE on one disk, then getting an URE on a
> second disk, counts as a double failure that will break an single-parity
> redundancy (RAID5, RAID1, RAID6 in degraded mode).  This is wrong - it
> is only a problem if the two UREs are in the same stripe, which is quite
> literally a one in a million chance.

I'm not sure about this.
The posted paper is talking about "standard" raid made with hw raid controllers
and I'm not sure if they are able to finish a rebuild in case of double URE even
if coming from different stripes.

I think they fail the whole rebuild.

^ permalink raw reply

* Re: proactive disk replacement
From: David Brown @ 2017-03-21 14:15 UTC (permalink / raw)
  To: Reindl Harald, Adam Goryachev, Jeff Allison; +Cc: linux-raid
In-Reply-To: <09f4c794-8b17-05f5-10b7-6a3fa515bfa9@thelounge.net>

On 21/03/17 14:24, Reindl Harald wrote:
> 
> 
> Am 21.03.2017 um 14:13 schrieb David Brown:
>> On 21/03/17 12:03, Reindl Harald wrote:
>>>
>>> Am 21.03.2017 um 11:54 schrieb Adam Goryachev:
>> <snip>
>>>
>>>> In addition, you claim that a drive larger than 2TB is almost certainly
>>>> going to suffer from a URE during recovery, yet this is exactly the
>>>> situation you will be in when trying to recover a RAID10 with member
>>>> devices 2TB or larger. A single URE on the surviving portion of the
>>>> RAID1 will cause you to lose the entire RAID10 array. On the other
>>>> hand,
>>>> 3 URE's on the three remaining members of the RAID6 will not cause more
>>>> than a hiccup (as long as no more than one URE on the same stripe,
>>>> which
>>>> I would argue is ... exceptionally unlikely).
>>>
>>> given that when your disks have the same age errors on another disk
>>> become more likely when one failed and the heavy disk IO due recovery of
>>> a RAID6 with takes *many hours* where you have heavy IO on *all disks*
>>> compared with a way faster restore of RAID1/10 guess in which case a URE
>>> is more likely
>>>
>>> additionally why should the whole array fail just because a single block
>>> get lost? the is no parity which needs to be calculated, you just lost a
>>> single block somewhere - RAID1/10 are way easier in their implementation
>>
>> If you have RAID1, and you have an URE, then the data can be recovered
>> from the other have of that RAID1 pair.  If you have had a disk failure
>> (manual for replacement, or a real failure), and you get an URE on the
>> other half of that pair, then you lose data.
>>
>> With RAID6, you need an additional failure (either another full disk
>> failure or an URE in the /same/ stripe) to lose data.  RAID6 has higher
>> redundancy than two-way RAID1 - of this there is /no/ doubt
> 
> yes, but with RAID5/RAID6 *all disks* are involved in the rebuild, with
> a 10 disk RAID10 only one disk needs to be read and the data written to
> the new one - all other disks are not involved in the resync at all

True...

> 
> for most arrays the disks have a similar age and usage pattern, so when
> the first one fails it becomes likely that it don't take too long for
> another one and so load and recovery time matters

False.  There is no reason to suspect that - certainly not to within the
hours or day it takes to rebuild your array.  Disk failure pattern shows
a peak within the first month or so (failures due to manufacturing or
handling), then a very low error rate for a few years, then a gradually
increasing rate after that.  There is not a very significant correlation
between drive failures within the same system, nor is there a very
significant correlation between usage and failures.  It might seem
reasonable to suspect that a drive is more likely to fail during a
rebuild since the disk is being heavily used, but that does not appear
to be the practice.  You will /spot/ more errors at that point - simply
because you don't see errors in parts of the disk that are not read -
but the rebuilding does not cause them.

And even if it /were/ true, then the key point is if there is an error
that causes data loss.  An error during reading for a RAID1 rebuild
means lost data.  An error during reading for a RAID6 rebuild means you
have to read an extra sector from another disk and correct the mistake.

^ permalink raw reply

* Re: proactive disk replacement
From: David Brown @ 2017-03-21 14:26 UTC (permalink / raw)
  To: Gandalf Corvotempesta
  Cc: Reindl Harald, Jeff Allison, Adam Goryachev, linux-raid
In-Reply-To: <CAJH6TXih4wv10WDGOA2PT-b8FSetx06D237HS_cpT4+Ap0d0dg@mail.gmail.com>

On 21/03/17 14:26, Gandalf Corvotempesta wrote:
> 2017-03-21 14:02 GMT+01:00 David Brown <david.brown@hesbynett.no>:
>> Note that to cause failure in non-degraded RAID5 (or degraded RAID6),
>> your two URE's need to be on the same stripe in order to cause data
>> loss.  The chances of getting an URE somewhere on the disk are roughly
>> proportional to the size of the disk - but the chance of getting an URE
>> on the same stripe as another URE on another disk are basically
>> independent of the disk size, and it is extraordinarily small.
> 
> Little bit OT:
> is this the same even for HW RAID Controllers like LSI Megaraid
> or they tend to fail the rebuild in case of multiple URE even in
> different stripes?

It should be true, for decent HW RAID setups.  One possible problem is
the famous re-read timeouts - if you use a consumer hard drive with long
re-read timeouts, and have not (or cannot) configure it to have a short
timeout, then a hardware RAID controller might consider a drive to be
completely dead while the drive is simply spending 30 seconds re-trying
its read.  If the raid controller drops the drive, then it is like an
URE in /all/ stripes at once!

> 
>> No, you cannot.  Your conclusion here is based on several totally
>> incorrect assumptions:
>>
>> 1. You think that RAID5/RAID6 recovery is more stressful, because the
>> parity is "all over the place".  This is wrong.
>>
>> 2. You think that random IO has higher chance of getting an URE than
>> linear IO.  This is wrong.
> 
> Totally agree.
> 
>> 3. You think that getting an URE on one disk, then getting an URE on a
>> second disk, counts as a double failure that will break an single-parity
>> redundancy (RAID5, RAID1, RAID6 in degraded mode).  This is wrong - it
>> is only a problem if the two UREs are in the same stripe, which is quite
>> literally a one in a million chance.
> 
> I'm not sure about this.
> The posted paper is talking about "standard" raid made with hw raid controllers
> and I'm not sure if they are able to finish a rebuild in case of double URE even
> if coming from different stripes.
> 
> I think they fail the whole rebuild.
> 

I cannot imagine why that would be the case.

Suppose you have seven drive RAID6, with data blocks ABCDE and parities
PQ.  To make it simpler, assume that on this particular stripe, the
order is ABCDEPQ.  If drive 5 has failed and you are rebuilding, the
RAID system will read in ABCD-P-.  It will not read from drive 5 (since
you are rebuilding it), and it will not bother reading drive 7 because
it doesn't need the Q parity (it /might/ read it in as part of a
streamed read).  It calculates E from ABCD and P, and writes it out.
If, for example, drive 3 gets an URE at this point then it will read the
Q parity and calculate C and E from ABD P and Q.  It will write out E to
the rebuild drive, and also C to the drive with the URE - the drive will
handle sector relocation as needed.  The result is that the stripe
ABCDEPQ is correct on the disk.  The drive with the URE will not be
dropped from the array.

Then it moves on to the next stripe, and repeats the process.  An URE
here is independent of an URE in the previous stripe, and errors can
again be corrected.

It is possible that if there are a large number of UREs from a drive,
that the RAID system will consider the whole drive bad and drop it.  But
other than that, UREs will be treated independently.

^ permalink raw reply

* Re: proactive disk replacement
From: Wols Lists @ 2017-03-21 15:25 UTC (permalink / raw)
  To: David Brown, Reindl Harald, Adam Goryachev, Jeff Allison; +Cc: linux-raid
In-Reply-To: <58D13598.50403@hesbynett.no>

On 21/03/17 14:15, David Brown wrote:
>> for most arrays the disks have a similar age and usage pattern, so when
>> > the first one fails it becomes likely that it don't take too long for
>> > another one and so load and recovery time matters

> False.  There is no reason to suspect that - certainly not to within the
> hours or day it takes to rebuild your array.  Disk failure pattern shows
> a peak within the first month or so (failures due to manufacturing or
> handling), then a very low error rate for a few years, then a gradually
> increasing rate after that.  There is not a very significant correlation
> between drive failures within the same system, nor is there a very
> significant correlation between usage and failures.

Except your argument and the claim don't match. You're right - disk
failures follow the pattern you describe. BUT.

If the array was created from completely new disks, then the usage
patterns will be very similar, therefore there will be a statistical
correlation between failures as compared to the population as a whole.
(Bit like a false DNA match is much higher in an inbred town, than in a
cosmopolitan city of immigrants.)

EVEN WORSE. The probability of all the drives coming off the same batch,
and sharing the same systematic defects, is much much higher. One only
has to look at the Seagate 3TB Barracuda mess to see a perfect example.

In other words, IFF your array is built of a bunch of identical drives
all bought at the same time, the risk of multiple failure is
significantly higher. How significant that is I don't know, but it is a
very valid reason for replacing your drives at semi-random intervals.

(Completely off topic :-) but a real-world demonstrable example is
couples' initials. "Like chooses like" and if you compare a couple's
first initials against what you would expect from a random sample, there
is a VERY significant spike in couples that share the same initial.)

To put it bluntly, if your array consists of disks with near-identical
characteristics (including manufacturing batch), then your chances of
random multiple failure are noticeably increased. Is it worth worrying
about? If you can do something about it, of course!

Cheers,
Wol

^ permalink raw reply

* Re: proactive disk replacement
From: Wols Lists @ 2017-03-21 15:29 UTC (permalink / raw)
  To: David Brown, Reindl Harald, Jeff Allison, Adam Goryachev; +Cc: linux-raid
In-Reply-To: <58D1244E.3040204@hesbynett.no>

On 21/03/17 13:02, David Brown wrote:
> If you have to use two USB carriers for the whole process, try to make
> sure they are connected to separate root hubs so that they don't share
> the bandwidth.  This is not always just a matter of using two USB ports
> - sometimes two adjacent USB ports on a PC share an internal hub.

Having built a bunch of desktop pcs from parts, I'd say adjacent ports
almost certainly share an internal hub. Typically, a single mobo header
will run a wire to a double slot at the front, or a double slot at the
back. So plugging one in at the front, and one at the back, will get
round this unless it's actually just one hub in the ?northbridge.

Cheers,
Wol

^ permalink raw reply

* Re: proactive disk replacement
From: Wols Lists @ 2017-03-21 15:31 UTC (permalink / raw)
  To: David Brown, Gandalf Corvotempesta
  Cc: Reindl Harald, Jeff Allison, Adam Goryachev, linux-raid
In-Reply-To: <58D1381E.1080101@hesbynett.no>

On 21/03/17 14:26, David Brown wrote:
> It is possible that if there are a large number of UREs from a drive,
> that the RAID system will consider the whole drive bad and drop it.  But
> other than that, UREs will be treated independently.

Doesn't mdadm have a setting that does exactly that? Too many UREs and
the drive gets dropped? I'm sure I've come across that interfering with
rebuilds.

Cheers,
Wol

^ permalink raw reply

* Re: proactive disk replacement
From: David Brown @ 2017-03-21 15:41 UTC (permalink / raw)
  To: Wols Lists, Reindl Harald, Adam Goryachev, Jeff Allison; +Cc: linux-raid
In-Reply-To: <58D145F9.1080405@youngman.org.uk>

On 21/03/17 16:25, Wols Lists wrote:
> On 21/03/17 14:15, David Brown wrote:
>>> for most arrays the disks have a similar age and usage pattern, so when
>>>> the first one fails it becomes likely that it don't take too long for
>>>> another one and so load and recovery time matters
> 
>> False.  There is no reason to suspect that - certainly not to within the
>> hours or day it takes to rebuild your array.  Disk failure pattern shows
>> a peak within the first month or so (failures due to manufacturing or
>> handling), then a very low error rate for a few years, then a gradually
>> increasing rate after that.  There is not a very significant correlation
>> between drive failures within the same system, nor is there a very
>> significant correlation between usage and failures.
> 
> Except your argument and the claim don't match. You're right - disk
> failures follow the pattern you describe. BUT.
> 
> If the array was created from completely new disks, then the usage
> patterns will be very similar, therefore there will be a statistical
> correlation between failures as compared to the population as a whole.
> (Bit like a false DNA match is much higher in an inbred town, than in a
> cosmopolitan city of immigrants.)
> 
> EVEN WORSE. The probability of all the drives coming off the same batch,
> and sharing the same systematic defects, is much much higher. One only
> has to look at the Seagate 3TB Barracuda mess to see a perfect example.
> 
> In other words, IFF your array is built of a bunch of identical drives
> all bought at the same time, the risk of multiple failure is
> significantly higher. How significant that is I don't know, but it is a
> very valid reason for replacing your drives at semi-random intervals.
> 

There /is/ a bit of correlation for early-fail drives coming from the
same batch.  But there is little correlation for normal lifetime drives.

If you roll three dice and sum them, the expected sum will follow a nice
Bell curve distribution.  If you pick another three dice and roll them,
they will follow the same distribution for the expected sum.  But there
is no correlation between the sums.

Similarly, maybe you figure out that there is a 10% chance of the drive
dying in the first month, 10% chance of it dying in the next three
years, then 30% for the fourth year, 40% for the fifth year, and 10%
spread out over the following years.  Multiple drives of the same type
bought at the same time, and run in the same conditions (usage patterns,
heat, humidity, etc.) will have the same expected lifetime curves.  But
if one drive fails in its fourth year, that does not affect the
probability of a second drive also failing in the same year - it is
basically independent.

Now, there will be a little bit of correlation, especially if there are
factors that may significantly affect reliability (such as someone
bumping the server).  But you are still extremely unlikely to find that
after one drive dies, a second drive dies on the same day or so (during
the rebuild) - it is possible, but it is very bad luck.  There is no
statistical basis for thinking it that when one drive dies, it is likely
that another one will die too.

Of course, some types of failures can affect several drives - a
motherboard failure, power supply problem, or similar event could kill
all your disks at the same time.  RAID does not avoid the need for backups!

Also early death failures can be correlated with a bad production batch
- mixing different batches helps reduce the risk of total failure.
Similarly, mixing different disk types reduces the risk of total
failures due to systematic errors such as firmware bugs.

> (Completely off topic :-) but a real-world demonstrable example is
> couples' initials. "Like chooses like" and if you compare a couple's
> first initials against what you would expect from a random sample, there
> is a VERY significant spike in couples that share the same initial.)
> 
> To put it bluntly, if your array consists of disks with near-identical
> characteristics (including manufacturing batch), then your chances of
> random multiple failure are noticeably increased. Is it worth worrying
> about? If you can do something about it, of course!
> 

^ permalink raw reply

* Re: proactive disk replacement
From: Phil Turmel @ 2017-03-21 16:49 UTC (permalink / raw)
  To: David Brown, Wols Lists, Reindl Harald, Adam Goryachev,
	Jeff Allison
  Cc: linux-raid
In-Reply-To: <58D14998.1060601@hesbynett.no>

On 03/21/2017 11:41 AM, David Brown wrote:

> There /is/ a bit of correlation for early-fail drives coming from
> the same batch.  But there is little correlation for normal lifetime
> drives.
> 
> If you roll three dice and sum them, the expected sum will follow a
> nice Bell curve distribution.  If you pick another three dice and
> roll them, they will follow the same distribution for the expected
> sum.  But there is no correlation between the sums.

Let me add to this:

The correlation is effectively immaterial in a non-degraded raid5 and
singly-degraded raid6 because recovery will succeed as long as any two
errors are in different 4k block/sector locations.  And for non-degraded
raid6, all three UREs must occur in the same block/sector to lose
data. Some participants in this discussion need to read the statistical
description of this stuff here:

http://marc.info/?l=linux-raid&m=139050322510249&w=2

As long as you are 'check' scrubbing every so often (I scrub weekly),
the odds of catastrophe on raid6 are the odds of something *else* taking
out the machine or controller, not the odds of simultaneous drive
failures.

Phil

^ permalink raw reply

* Re: proactive disk replacement
From: Phil Turmel @ 2017-03-21 16:55 UTC (permalink / raw)
  To: David Brown, Reindl Harald, Jeff Allison, Adam Goryachev; +Cc: linux-raid
In-Reply-To: <58D1244E.3040204@hesbynett.no>

On 03/21/2017 09:02 AM, David Brown wrote:

> With RAID6 (or three-disk RAID1), you can tolerate /two/ URE's on
> the same stripe.  If you have failed a disk for replacement, you can 
> tolerate one URE.

One nit to pick here:  The UREs have to be in the same 4k block/sector,
not just in the same stripe.  The stripe cache and all parity
calculations are done on strips of 4k blocks, not whole N*chunk stripes.

That makes the odds even larger.

Phil

^ permalink raw reply

* Re: proactive disk replacement
From: Phil Turmel @ 2017-03-21 17:00 UTC (permalink / raw)
  To: Wols Lists, David Brown, Gandalf Corvotempesta
  Cc: Reindl Harald, Jeff Allison, Adam Goryachev, linux-raid
In-Reply-To: <58D14764.60909@youngman.org.uk>

On 03/21/2017 11:31 AM, Wols Lists wrote:
> On 21/03/17 14:26, David Brown wrote:
>> It is possible that if there are a large number of UREs from a
>> drive, that the RAID system will consider the whole drive bad and
>> drop it.  But other than that, UREs will be treated independently.
> 
> Doesn't mdadm have a setting that does exactly that? Too many UREs
> and the drive gets dropped? I'm sure I've come across that
> interfering with rebuilds.

Yes.  MD maintains a per-member-device counter of read errors and drops
the device when the counter reaches 20 (twenty).  The counter is
decremented by 10 (ten) once an hour.  A short burst of less than 20
read errors will be tolerated, as long as they don't continue at more
than 10/hour.

Last I checked, this behavior is hard-coded.

Phil

^ permalink raw reply

* Re: [RAID recovery] Unable to recover RAID5 array after disk failure
From: Phil Turmel @ 2017-03-21 17:08 UTC (permalink / raw)
  To: Olivier Swinkels; +Cc: linux-raid
In-Reply-To: <CAJ0QwkKbYuVFdkRn0ggFOqudQswNYnF+koVLSxQ_XTPQnLkhOg@mail.gmail.com>

Hi Olivier,

{ Sorry, lot of work travel lately /-: }

On 03/17/2017 03:25 PM, Olivier Swinkels wrote:

> Hi Phil,
> 
> Did you already have time to look at the results of the fsck check
> and ext2/3/4 superblock search?

Well, there are a few superblock candidates in your output, the lines
showing ".X4R" for the first four bytes, but that converts to a
timestamp of Sat, 14 Sep 2013 12:35:24 GMT.

Ewww.

> I would really like some feedback, as I'm quite out of ideas.

You should look at the other devices, but with that timestamp, the odds
look very poor.

Sorry.  Possibly try photorec or similar raw data recovery tools.

Phil

^ permalink raw reply

* Re: [PATCHv2 2/2] super1: check and output faulty dev role
From: NeilBrown @ 2017-03-21 19:55 UTC (permalink / raw)
  To: Gioh Kim, jes.sorensen; +Cc: linux-raid, linux-kernel, Jack Wang
In-Reply-To: <1490003517-4216-3-git-send-email-gi-oh.kim@profitbricks.com>

[-- Attachment #1: Type: text/plain, Size: 1674 bytes --]

On Mon, Mar 20 2017, Gioh Kim wrote:

> From: Jack Wang <jinpu.wang@profitbricks.com>
>
> Output the real dev role in examine_super1, it will help to
> find problem.
>
> Signed-off-by: Jack Wang <jinpu.wang@profitbricks.com>
> Reviewed-by: Gioh Kim <gi-oh.kim@profitbricks.com>
> ---
>  super1.c | 6 ++++--
>  1 file changed, 4 insertions(+), 2 deletions(-)
>
> diff --git a/super1.c b/super1.c
> index f3520ac..c903371 100644
> --- a/super1.c
> +++ b/super1.c
> @@ -501,8 +501,10 @@ static void examine_super1(struct supertype *st, char *homehost)
>  #endif
>  	printf("   Device Role : ");
>  	role = role_from_sb(sb);
> -	if (role >= MD_DISK_ROLE_FAULTY)
> -		printf("spare\n");
> +	if (role == MD_DISK_ROLE_SPARE)
> +		printf("Spare\n");
> +	else if (role == MD_DISK_ROLE_FAULTY)
> +		printf("Faulty\n");
>  	else if (role == MD_DISK_ROLE_JOURNAL)
>  		printf("Journal\n");
>  	else if (sb->feature_map & __cpu_to_le32(MD_FEATURE_REPLACEMENT))
> -- 
> 2.5.0

I don't think the distinction between "faulty" and "spare" is really
useful here.  I used to report the difference and it turned out to be
confusing, so we stopped.

This is information stored on some other disk, not the one that is
spare-or-faulty.  All it needs to know if what other devices are
working.  It doesn't need to know about which devices aren't working and
why.
The distinction between 'faulty' and 'spare' is only relevant to the
device itself, and to the array as a whole.

We should probably get rid of the distinction between
MD_DISK_ROLE_FAULTY and MD_DISK_ROLE_SPARE.
Most places that test for it just test >= MD_DISK_ROLE_FAULTY.

NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply

* [PATCH] raid5-ppl: silence a misleading warning message
From: Dan Carpenter @ 2017-03-21 20:43 UTC (permalink / raw)
  To: Shaohua Li, Artur Paszkiewicz; +Cc: linux-raid, kernel-janitors

The "need_cache_flush" variable is never set to false.  When the
variable is true that means we print a warning message at the end of
the function.

Fixes: 3418d036c81d ("raid5-ppl: Partial Parity Log write logging implementation")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>

diff --git a/drivers/md/raid5-ppl.c b/drivers/md/raid5-ppl.c
index 27bad3e2d7ce..86ea9addb51a 100644
--- a/drivers/md/raid5-ppl.c
+++ b/drivers/md/raid5-ppl.c
@@ -1070,7 +1070,7 @@ int ppl_init_log(struct r5conf *conf)
 	struct mddev *mddev = conf->mddev;
 	int ret = 0;
 	int i;
-	bool need_cache_flush;
+	bool need_cache_flush = false;
 
 	pr_debug("md/raid:%s: enabling distributed Partial Parity Log\n",
 		 mdname(conf->mddev));

^ permalink raw reply related

* Re: [PATCH v5 3/7] raid5-ppl: Partial Parity Log write logging implementation
From: NeilBrown @ 2017-03-21 22:00 UTC (permalink / raw)
  To: shli; +Cc: linux-raid, Artur Paszkiewicz
In-Reply-To: <20170309090003.13298-4-artur.paszkiewicz@intel.com>

[-- Attachment #1: Type: text/plain, Size: 4095 bytes --]

On Thu, Mar 09 2017, Artur Paszkiewicz wrote:

> Implement the calculation of partial parity for a stripe and PPL write
> logging functionality. The description of PPL is added to the
> documentation. More details can be found in the comments in raid5-ppl.c.
>
> Attach a page for holding the partial parity data to stripe_head.
> Allocate it only if mddev has the MD_HAS_PPL flag set.
>
> Partial parity is the xor of not modified data chunks of a stripe and is
> calculated as follows:
>
> - reconstruct-write case:
>   xor data from all not updated disks in a stripe
>
> - read-modify-write case:
>   xor old data and parity from all updated disks in a stripe
>
> Implement it using the async_tx API and integrate into raid_run_ops().
> It must be called when we still have access to old data, so do it when
> STRIPE_OP_BIODRAIN is set, but before ops_run_prexor5(). The result is
> stored into sh->ppl_page.
>
> Partial parity is not meaningful for full stripe write and is not stored
> in the log or used for recovery, so don't attempt to calculate it when
> stripe has STRIPE_FULL_WRITE.
>
> Put the PPL metadata structures to md_p.h because userspace tools
> (mdadm) will also need to read/write PPL.
>
> Warn about using PPL with enabled disk volatile write-back cache for
> now. It can be removed once disk cache flushing before writing PPL is
> implemented.
>
> Signed-off-by: Artur Paszkiewicz <artur.paszkiewicz@intel.com>

Sorry for the delay in getting to this for review...

> +static struct ppl_io_unit *ppl_new_iounit(struct ppl_log *log,
> +					  struct stripe_head *sh)
> +{
> +	struct ppl_conf *ppl_conf = log->ppl_conf;
> +	struct ppl_io_unit *io;
> +	struct ppl_header *pplhdr;
> +
> +	io = mempool_alloc(ppl_conf->io_pool, GFP_ATOMIC);
> +	if (!io)
> +		return NULL;
> +
> +	memset(io, 0, sizeof(*io));
> +	io->log = log;
> +	INIT_LIST_HEAD(&io->log_sibling);
> +	INIT_LIST_HEAD(&io->stripe_list);
> +	atomic_set(&io->pending_stripes, 0);
> +	bio_init(&io->bio, io->biovec, PPL_IO_INLINE_BVECS);
> +
> +	io->header_page = mempool_alloc(ppl_conf->meta_pool, GFP_NOIO);

I'm trying to understand how these two mempool_alloc()s relate, and
particularly why the first one needs to be GFP_ATOMIC, while the second
one can safely be GFP_NOIO.
I see that the allocated memory is freed in different places:
header_page is called from the bi_endio function as soon as the write
completes, while 'io' is freed later.  But I'm not sure that is enough
to make it safe.

When working with mempools, you need to assume that the pool only
contains one element, and that every time you call mempool_alloc(), it
waits for that one element to be available.  While that doesn't usually
happen, it is possible and if that case isn't handled correctly, the
system can deadlock.

If no memory is available when this mempool_alloc() is called, it will
block.  As it is called from the raid5d thread, the whole array will
block.  So this can only complete safely is the write request has
already been submitted - or if there is some other workqueue which
submit requests after a timeout or similar.
I don't see that in the code.  These ppl_io_unit structures can queue up
and are only submitted later by raid5d (I think).  So if raid5d waits
for one to become free, it will wait forever.

One easy way around this problem (assuming my understanding is correct)
is to just have a single mempool which allocates both a struct
ppl_io_unit and a page.  You would need to define you own alloc/free
routines for the pool but that is easy enough.

Then you only need a single mempool_alloc(), which can sensibly be
GFP_ATOMIC.
If that fails, you queue for later handling as you do now.  If it
succeeds, then you continue to use the memory without any risk of
deadlocking.

Thanks,
NeilBrown

> +	pplhdr = page_address(io->header_page);
> +	clear_page(pplhdr);
> +	memset(pplhdr->reserved, 0xff, PPL_HDR_RESERVED);
> +	pplhdr->signature = cpu_to_le32(ppl_conf->signature);
> +
> +	io->seq = atomic64_add_return(1, &ppl_conf->seq);
> +	pplhdr->generation = cpu_to_le64(io->seq);
> +
> +	return io;
> +}

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply

* Fix bug in [md PATCH 02/15] md/raid5: simplfy delaying of writes while metadata is updated.
From: NeilBrown @ 2017-03-22  1:40 UTC (permalink / raw)
  To: Shaohua Li; +Cc: linux-raid, hch
In-Reply-To: <148954711228.18641.2048575896322496918.stgit@noble>

[-- Attachment #1: Type: text/plain, Size: 741 bytes --]

Like all other MD_SB_CHANGE_* flags used in this patch, this should
be MD_SB_CHANGE_PENDING.

Signed-off-by: NeilBrown <neilb@suse.com>
---
 drivers/md/raid5.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index f990f74901d2..8c5365d7f470 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -6241,7 +6241,7 @@ static void raid5_do_work(struct work_struct *work)
 			break;
 		handled += batch_size;
 		wait_event_lock_irq(mddev->sb_wait,
-				    !test_bit(MD_SB_CHANGE_DEVS, &mddev->sb_flags),
+				    !test_bit(MD_SB_CHANGE_PENDING, &mddev->sb_flags),
 				    conf->device_lock);
 	}
 	pr_debug("%d stripes handled\n", handled);
-- 
2.12.0

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply related

* Fix bugs in [md PATCH 10/15] md/raid1: stop using bi_phys_segment
From: NeilBrown @ 2017-03-22  1:41 UTC (permalink / raw)
  To: Shaohua Li; +Cc: linux-raid, hch
In-Reply-To: <148954711389.18641.6044680366998154084.stgit@noble>

[-- Attachment #1: Type: text/plain, Size: 1119 bytes --]


Using r1_bio->sector in call_bio_endio is more
obviously-correct than bio->bi_iter.bi_sector,
though both should have the same value.

The inc_pending() call in handle_read_error() was missing.
One should always accompany a bio_inc_remaining.

Signed-off-by: NeilBrown <neilb@suse.com>
---
 drivers/md/raid1.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
index e566407b196f..bea7f149c43c 100644
--- a/drivers/md/raid1.c
+++ b/drivers/md/raid1.c
@@ -257,7 +257,7 @@ static void call_bio_endio(struct r1bio *r1_bio)
 	 * Wake up any possible resync thread that waits for the device
 	 * to go idle.
 	 */
-	allow_barrier(conf, bi_sector);
+	allow_barrier(conf, r1_bio->sector);
 }
 
 static void raid_end_bio_io(struct r1bio *r1_bio)
@@ -2543,6 +2543,7 @@ static void handle_read_error(struct r1conf *conf, struct r1bio *r1_bio)
 
 			r1_bio = alloc_r1bio(mddev, mbio, sectors_handled);
 			set_bit(R1BIO_ReadError, &r1_bio->state);
+			inc_pending(conf, r1_bio->sector);
 
 			goto read_more;
 		} else {
-- 
2.12.0


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply related

* [PATCH] percpu-refcount: support synchronous switch to atomic mode.
From: NeilBrown @ 2017-03-22  1:50 UTC (permalink / raw)
  To: Tejun Heo, Christoph Lameter; +Cc: linux-kernel, Shaohua Li, Linux-RAID

[-- Attachment #1: Type: text/plain, Size: 2566 bytes --]


percpu_ref_switch_to_atomic_sync() schedules the switch
to atomic mode, then waits for it to complete.

Also export percpu_ref_switch_to_* so they can be used from modules.

This will be used in md/raid to count the number of pending write
requests to an array.
We occasionally need to check if the count is zero, but most often
we don't care.
We always want updates to the counter to be fast, as in some cases
we count every 4K page.

Signed-off-by: NeilBrown <neilb@suse.com>
---
 include/linux/percpu-refcount.h |  1 +
 lib/percpu-refcount.c           | 18 ++++++++++++++++++
 2 files changed, 19 insertions(+)

diff --git a/include/linux/percpu-refcount.h b/include/linux/percpu-refcount.h
index 3a481a49546e..c13dceb87b60 100644
--- a/include/linux/percpu-refcount.h
+++ b/include/linux/percpu-refcount.h
@@ -99,6 +99,7 @@ int __must_check percpu_ref_init(struct percpu_ref *ref,
 void percpu_ref_exit(struct percpu_ref *ref);
 void percpu_ref_switch_to_atomic(struct percpu_ref *ref,
 				 percpu_ref_func_t *confirm_switch);
+void percpu_ref_switch_to_atomic_sync(struct percpu_ref *ref);
 void percpu_ref_switch_to_percpu(struct percpu_ref *ref);
 void percpu_ref_kill_and_confirm(struct percpu_ref *ref,
 				 percpu_ref_func_t *confirm_kill);
diff --git a/lib/percpu-refcount.c b/lib/percpu-refcount.c
index 9ac959ef4cae..d133ed43a375 100644
--- a/lib/percpu-refcount.c
+++ b/lib/percpu-refcount.c
@@ -260,6 +260,23 @@ void percpu_ref_switch_to_atomic(struct percpu_ref *ref,
 
 	spin_unlock_irqrestore(&percpu_ref_switch_lock, flags);
 }
+EXPORT_SYMBOL_GPL(percpu_ref_switch_to_atomic);
+
+/**
+ * percpu_ref_switch_to_atomic_sync - switch a percpu_ref to atomic mode
+ * @ref: percpu_ref to switch to atomic mode
+ *
+ * Schedule switching the ref to atomic mode, and wait for the
+ * switch to complete.  Caller must ensure that no other thread
+ * will switch back to percpu mode.
+ *
+ */
+void percpu_ref_switch_to_atomic_sync(struct percpu_ref *ref)
+{
+	percpu_ref_switch_to_atomic(ref, NULL);
+	wait_event(percpu_ref_switch_waitq, !ref->confirm_switch);
+}
+EXPORT_SYMBOL_GPL(percpu_ref_switch_to_atomic_sync);
 
 /**
  * percpu_ref_switch_to_percpu - switch a percpu_ref to percpu mode
@@ -290,6 +307,7 @@ void percpu_ref_switch_to_percpu(struct percpu_ref *ref)
 
 	spin_unlock_irqrestore(&percpu_ref_switch_lock, flags);
 }
+EXPORT_SYMBOL_GPL(percpu_ref_switch_to_percpu);
 
 /**
  * percpu_ref_kill_and_confirm - drop the initial ref and schedule confirmation
-- 
2.12.0


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply related

* Improvement for [md PATCH 15/15] MD: use per-cpu counter for writes_pending
From: NeilBrown @ 2017-03-22  1:55 UTC (permalink / raw)
  To: Shaohua Li; +Cc: linux-raid, hch
In-Reply-To: <148954711465.18641.8222940807591984069.stgit@noble>

[-- Attachment #1: Type: text/plain, Size: 1307 bytes --]


__ref_is_percpu() is documented as an internal interface,
so best not to use it.
We don't really need it, as ->sync_checkers is always 0
when the writes_pending is in per-cpu mode.

So change to test ->sync_checkers, and update comments to match.

Signed-off-by: NeilBrown <neilb@suse.com>
---
 drivers/md/md.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/drivers/md/md.c b/drivers/md/md.c
index adf2b5bdfd67..b76ac563115e 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -2266,8 +2266,8 @@ static bool set_in_sync(struct mddev *mddev)
 		    percpu_ref_is_zero(&mddev->writes_pending)) {
 			mddev->in_sync = 1;
 			/*
-			 * Ensure in_sync is visible before switch back
-			 * to percpu
+			 * Ensure ->in_sync is visible before we clear
+			 * ->sync_checkers.
 			 */
 			smp_mb();
 			set_bit(MD_SB_CHANGE_CLEAN, &mddev->sb_flags);
@@ -7920,7 +7920,7 @@ void md_write_start(struct mddev *mddev, struct bio *bi)
 	smp_mb(); /* Match smp_mb in set_in_sync() */
 	if (mddev->safemode == 1)
 		mddev->safemode = 0;
-	if (mddev->in_sync || !__ref_is_percpu(&mddev->writes_pending, &notused)) {
+	if (mddev->in_sync || !mddev->sync_checkers) {
 		spin_lock(&mddev->lock);
 		if (mddev->in_sync) {
 			mddev->in_sync = 0;
-- 
2.12.0


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply related

* REALLY Fix bug in [md PATCH 02/15] md/raid5: simplfy delaying of writes while metadata is updated.
From: NeilBrown @ 2017-03-22  2:29 UTC (permalink / raw)
  To: Shaohua Li; +Cc: linux-raid, hch
In-Reply-To: <87tw6may1v.fsf@notabene.neil.brown.name>

[-- Attachment #1: Type: text/plain, Size: 1466 bytes --]


Using r1_bio->sector in call_bio_endio is more
obviously-correct than bio->bi_iter.bi_sector,
though both should have the same value.

The inc_pending() call in handle_read_error() was missing.
One should always accompany a bio_inc_remaining.

Signed-off-by: NeilBrown <neilb@suse.com>
---

Sorry, I left an unused variable...
NeilBrown


 drivers/md/raid1.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
index e566407b196f..2e2043cdcbf2 100644
--- a/drivers/md/raid1.c
+++ b/drivers/md/raid1.c
@@ -247,7 +247,6 @@ static void call_bio_endio(struct r1bio *r1_bio)
 {
 	struct bio *bio = r1_bio->master_bio;
 	struct r1conf *conf = r1_bio->mddev->private;
-	sector_t bi_sector = bio->bi_iter.bi_sector;
 
 	if (!test_bit(R1BIO_Uptodate, &r1_bio->state))
 		bio->bi_error = -EIO;
@@ -257,7 +256,7 @@ static void call_bio_endio(struct r1bio *r1_bio)
 	 * Wake up any possible resync thread that waits for the device
 	 * to go idle.
 	 */
-	allow_barrier(conf, bi_sector);
+	allow_barrier(conf, r1_bio->sector);
 }
 
 static void raid_end_bio_io(struct r1bio *r1_bio)
@@ -2543,6 +2542,7 @@ static void handle_read_error(struct r1conf *conf, struct r1bio *r1_bio)
 
 			r1_bio = alloc_r1bio(mddev, mbio, sectors_handled);
 			set_bit(R1BIO_ReadError, &r1_bio->state);
+			inc_pending(conf, r1_bio->sector);
 
 			goto read_more;
 		} else {
-- 
2.12.0


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply related

* IMPROVEMENT for Improvement for [md PATCH 15/15] MD: use per-cpu counter for writes_pending
From: NeilBrown @ 2017-03-22  2:34 UTC (permalink / raw)
  To: Shaohua Li; +Cc: linux-raid, hch
In-Reply-To: <87lgryaxdq.fsf@notabene.neil.brown.name>

[-- Attachment #1: Type: text/plain, Size: 1633 bytes --]


__ref_is_percpu() is documented as an internal interface,
so best not to use it.
We don't really need it, as ->sync_checkers is always 0
when the writes_pending is in per-cpu mode.

So change to test ->sync_checkers, and update comments to match.

Signed-off-by: NeilBrown <neilb@suse.com>
---

Sorry, there was an undefined variable in that version.

This one is better.

NeilBrown


 drivers/md/md.c | 7 +++----
 1 file changed, 3 insertions(+), 4 deletions(-)

diff --git a/drivers/md/md.c b/drivers/md/md.c
index adf2b5bdfd67..aeed8adeb5f1 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -2266,8 +2266,8 @@ static bool set_in_sync(struct mddev *mddev)
 		    percpu_ref_is_zero(&mddev->writes_pending)) {
 			mddev->in_sync = 1;
 			/*
-			 * Ensure in_sync is visible before switch back
-			 * to percpu
+			 * Ensure ->in_sync is visible before we clear
+			 * ->sync_checkers.
 			 */
 			smp_mb();
 			set_bit(MD_SB_CHANGE_CLEAN, &mddev->sb_flags);
@@ -7901,7 +7901,6 @@ EXPORT_SYMBOL(md_done_sync);
  */
 void md_write_start(struct mddev *mddev, struct bio *bi)
 {
-	unsigned long __percpu *notused;
 	int did_change = 0;
 	if (bio_data_dir(bi) != WRITE)
 		return;
@@ -7920,7 +7919,7 @@ void md_write_start(struct mddev *mddev, struct bio *bi)
 	smp_mb(); /* Match smp_mb in set_in_sync() */
 	if (mddev->safemode == 1)
 		mddev->safemode = 0;
-	if (mddev->in_sync || !__ref_is_percpu(&mddev->writes_pending, &notused)) {
+	if (mddev->in_sync || !mddev->sync_checkers) {
 		spin_lock(&mddev->lock);
 		if (mddev->in_sync) {
 			mddev->in_sync = 0;
-- 
2.12.0


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply related

* Re: REALLY Fix bug in [md PATCH 02/15] md/raid5: simplfy delaying of writes while metadata is updated.
From: NeilBrown @ 2017-03-22  2:35 UTC (permalink / raw)
  To: Shaohua Li; +Cc: linux-raid, hch
In-Reply-To: <87inn2avsj.fsf@notabene.neil.brown.name>

[-- Attachment #1: Type: text/plain, Size: 595 bytes --]

On Wed, Mar 22 2017, NeilBrown wrote:

> Using r1_bio->sector in call_bio_endio is more
> obviously-correct than bio->bi_iter.bi_sector,
> though both should have the same value.
>
> The inc_pending() call in handle_read_error() was missing.
> One should always accompany a bio_inc_remaining.
>
> Signed-off-by: NeilBrown <neilb@suse.com>
> ---
>
> Sorry, I left an unused variable...
> NeilBrown

Sorry again - I replied to the wrong email (not a good day...)
and gave the wrong subject.
This should be
  REALLY Fix bugs in [md PATCH 10/15] md/raid1: stop using bi_phys_segment

:-(

NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply

* [PATCH] block: trace completion of all bios.
From: NeilBrown @ 2017-03-22  2:38 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, linux-raid, dm-devel, Alasdair Kergon, Mike Snitzer,
	Shaohua Li, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 2561 bytes --]


Currently only dm and md/raid5 bios trigger trace_block_bio_complete().
Now that we have bio_chain(), it is not possible, in general, for a
driver to know when the bio is really complete.  Only bio_endio()
knows that.

So move the trace_block_bio_complete() call to bio_endio().

Now trace_block_bio_complete() pairs with trace_block_bio_queue().
Any bio for which a 'queue' event is traced, will subsequently
generate a 'complete' event.

Signed-off-by: NeilBrown <neilb@suse.com>
---
 block/bio.c        | 3 +++
 drivers/md/dm.c    | 1 -
 drivers/md/raid5.c | 8 --------
 3 files changed, 3 insertions(+), 9 deletions(-)

diff --git a/block/bio.c b/block/bio.c
index 5eec5e08417f..c89d83b3ca32 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -1838,6 +1838,9 @@ void bio_endio(struct bio *bio)
 		goto again;
 	}
 
+	if (bio->bi_bdev)
+		trace_block_bio_complete(bdev_get_queue(bio->bi_bdev),
+					 bio, bio->bi_error);
 	if (bio->bi_end_io)
 		bio->bi_end_io(bio);
 }
diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index f4ffd1eb8f44..f5f09ace690a 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -810,7 +810,6 @@ static void dec_pending(struct dm_io *io, int error)
 			queue_io(md, bio);
 		} else {
 			/* done with normal IO or empty flush */
-			trace_block_bio_complete(md->queue, bio, io_error);
 			bio->bi_error = io_error;
 			bio_endio(bio);
 		}
diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 9a3b7da34137..f684cb566721 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -5141,8 +5141,6 @@ static void raid5_align_endio(struct bio *bi)
 	rdev_dec_pending(rdev, conf->mddev);
 
 	if (!error) {
-		trace_block_bio_complete(bdev_get_queue(raid_bi->bi_bdev),
-					 raid_bi, 0);
 		bio_endio(raid_bi);
 		if (atomic_dec_and_test(&conf->active_aligned_reads))
 			wake_up(&conf->wait_for_quiescent);
@@ -5727,10 +5725,6 @@ static void raid5_make_request(struct mddev *mddev, struct bio * bi)
 		md_write_end(mddev);
 	remaining = raid5_dec_bi_active_stripes(bi);
 	if (remaining == 0) {
-
-
-		trace_block_bio_complete(bdev_get_queue(bi->bi_bdev),
-					 bi, 0);
 		bio_endio(bi);
 	}
 }
@@ -6138,8 +6132,6 @@ static int  retry_aligned_read(struct r5conf *conf, struct bio *raid_bio)
 	}
 	remaining = raid5_dec_bi_active_stripes(raid_bio);
 	if (remaining == 0) {
-		trace_block_bio_complete(bdev_get_queue(raid_bio->bi_bdev),
-					 raid_bio, 0);
 		bio_endio(raid_bio);
 	}
 	if (atomic_dec_and_test(&conf->active_aligned_reads))
-- 
2.12.0


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply related

* Re: recovery does not complete after --add
From: Eyal Lebedinsky @ 2017-03-22  3:12 UTC (permalink / raw)
  To: list linux-raid
In-Reply-To: <b0a98fa0-b271-4ed4-c8fa-15cdafe63fb0@eyal.emu.id.au>

Bump.

I expect those in the know should find this simple to answer?

TIA

On 17/03/17 10:51, Eyal Lebedinsky wrote:
> This is a repost of the issue (from a month ago) that did not get a response then.
>
> Executive summary:
> After '--add'ing a new member a 'recovery' starts automatically but 'sync_max' is not reset
> and the recovery hangs part way through where sync_max happened to be. This is a 7 disk raid6.
>
> Is this a known issue? Was it fixed since? Did I do something wrong?
>
> This machine runs the older f19.
>     $ uname -a
> Linux e7.eyal.emu.id.au 3.14.27-100.fc19.x86_64 #1 SMP Wed Dec 17 19:36:34 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux
>
> mdadm was built from source:
>     $ sudo mdadm --version
> mdadm - v4.0 - 2017-01-09
>
> The long story:
>
> I had a disk fail in a raid6. After some 'pending' sectors were logged I decided to do a 'check'
> around that location by setting sync_min/max and echo 'check'. This is done with a script doing:
>     # echo 4336657408 >sys/block/md127/md/sync_min
>     # echo 4339803136 >sys/block/md127/md/sync_max
>     # echo check      >sys/block/md127/md/sync_action
> The messages then say
>     Feb 18 13:46:31 e7 kernel: [  976.688691] md: data-check of RAID array md127
>     Feb 18 13:46:31 e7 kernel: [  976.693254] md: minimum _guaranteed_  speed: 150000 KB/sec/disk.
>     Feb 18 13:46:31 e7 kernel: [  976.699479] md: using maximum available idle IO bandwidth (but not more than 200000 KB/sec) for data-check.
>     Feb 18 13:46:31 e7 kernel: [  976.709420] md: using 128k window, over a total of 3906885120k.
>     Feb 18 13:46:31 e7 kernel: [  976.715457] md: resuming data-check of md127 from checkpoint.
>
> Sure enough this elicited disk errors, but the disk did not recover and it was kicked out of the array.
> Moreover it became unresponsive. It needed a power cycle so I shutdown and rebooted the machine.
>
> messages:
>     ... many i/o errors then sdf completely disappeared ... errors at sectors 4337414{000,040,168}
>     Feb 18 13:47:08 e7 kernel: [ 1014.334781] md: super_written gets error=-5, uptodate=0
>     Feb 18 13:47:08 e7 kernel: [ 1014.340024] md/raid:md127: Disk failure on sdf1, disabling device.
>     Feb 18 13:47:08 e7 kernel: [ 1014.340024] md/raid:md127: Operation continuing on 6 devices.
>     Feb 18 13:47:08 e7 kernel: [ 1014.417307] md: md127: data-check interrupted.
>
> A second power off/on, a check produced the same result. At this point I added a fresh disk:
>     $ sudo mdadm /dev/md127 --add /dev/sdj1
>     $ cat /proc/mdstat
> Personalities : [raid6] [raid5] [raid4]
> md127 : active raid6 sdj1[11] sdf1[7](F) sdi1[8] sde1[9] sdh1[12] sdc1[0] sdg1[13] sdd1[10]
>       19534425600 blocks super 1.2 level 6, 512k chunk, algorithm 2 [7/6] [UUU_UUU]
>       [>....................]  recovery =  0.7% (29805572/3906885120) finish=509.2min speed=126880K/sec
>       bitmap: 7/30 pages [28KB], 65536KB chunk
>
> messages:
>     Feb 18 14:23:10 e7 kernel: [ 3177.183250] md: bind<sdj1>
>     Feb 18 14:23:10 e7 kernel: [ 3177.255529] md: recovery of RAID array md127
>     Feb 18 14:23:10 e7 kernel: [ 3177.259894] md: minimum _guaranteed_  speed: 150000 KB/sec/disk.
>     Feb 18 14:23:10 e7 kernel: [ 3177.265994] md: using maximum available idle IO bandwidth (but not more than 200000 KB/sec) for recovery.
>     Feb 18 14:23:10 e7 kernel: [ 3177.275736] md: using 128k window, over a total of 3906885120k.
>
> However, the recovery stopped progressing at one point (my script logs /proc/mdstat every 10 seconds):
>     2017-02-18 20:02:48        [===========>.........]  recovery = 55.4% (2166229192/3906885120) finish=372.8min speed=77803K/sec
>     2017-02-18 20:02:58        [===========>.........]  recovery = 55.4% (2167083344/3906885120) finish=366.2min speed=79159K/sec
>     2017-02-18 20:03:08        [===========>.........]  recovery = 55.4% (2167819876/3906885120) finish=374.8min speed=77316K/sec
>     2017-02-18 20:03:18        [===========>.........]  recovery = 55.5% (2168520428/3906885120) finish=375.4min speed=77157K/sec
>     2017-02-18 20:03:28        [===========>.........]  recovery = 55.5% (2168590848/3906885120) finish=489.4min speed=59194K/sec
>     2017-02-18 20:03:38        [===========>.........]  recovery = 55.5% (2168590848/3906885120) finish=608.7min speed=47588K/sec
>     2017-02-18 20:03:48        [===========>.........]  recovery = 55.5% (2168590848/3906885120) finish=728.1min speed=39786K/sec
>     2017-02-18 20:03:58        [===========>.........]  recovery = 55.5% (2168590848/3906885120) finish=847.5min speed=34182K/sec
>     ... no progress anymore
>     2017-02-18 22:36:44        [===========>.........]  recovery = 55.5% (2168590848/3906885120) finish=110261.8min speed=262K/sec
>     2017-02-18 22:36:54        [===========>.........]  recovery = 55.5% (2168590848/3906885120) finish=110381.2min speed=262K/sec
>     2017-02-18 22:37:04        [===========>.........]  recovery = 55.5% (2168590848/3906885120) finish=110500.6min speed=262K/sec
>     2017-02-18 22:37:14        [===========>.........]  recovery = 55.5% (2168590848/3906885120) finish=110619.9min speed=261K/sec
>
> After some thinking I realised that it has paused at the point where the earlier 'check' failed. This was unexpected.
> I followed with
>     # echo 'max' >/sys/block/md127/md/sync_max
> the recovery now moves on:
>     2017-02-18 22:37:24        [===========>.........]  recovery = 55.5% (2168938500/3906885120) finish=117500.2min speed=246K/sec
>     2017-02-18 22:37:34        [===========>.........]  recovery = 55.5% (2169997568/3906885120) finish=105201.7min speed=275K/sec
>     2017-02-18 22:37:44        [===========>.........]  recovery = 55.5% (2171066120/3906885120) finish=90962.0min speed=318K/sec
>     2017-02-18 22:37:54        [===========>.........]  recovery = 55.5% (2172125192/3906885120) finish=269.9min speed=107101K/sec
>     2017-02-18 22:38:04        [===========>.........]  recovery = 55.6% (2173114372/3906885120) finish=272.1min speed=106165K/sec
>     2017-02-18 22:38:14        [===========>.........]  recovery = 55.6% (2174004224/3906885120) finish=287.3min speed=100492K/sec
>
> ### and it completed over six hours later:
>     Feb 19 04:49:16 e7 kernel: [55167.633100] md: md127: recovery done.
>
> TIA
>

-- 
Eyal Lebedinsky (eyal@eyal.emu.id.au)

^ permalink raw reply

* Re: proactive disk replacement
From: NeilBrown @ 2017-03-22  4:16 UTC (permalink / raw)
  To: Andreas Klauer, Reindl Harald; +Cc: Adam Goryachev, Jeff Allison, linux-raid
In-Reply-To: <20170321124129.GA18865@metamorpher.de>

[-- Attachment #1: Type: text/plain, Size: 2267 bytes --]

On Tue, Mar 21 2017, Andreas Klauer wrote:

> On Tue, Mar 21, 2017 at 01:03:22PM +0100, Reindl Harald wrote:
>> the IO of a RAID5/6 rebuild is hardly linear beause the informations 
>> (data + parity) are spread all over the disks
>
> It's not "randomly" spread all over. The blocks are always where they belong.
>
> https://en.wikipedia.org/wiki/Standard_RAID_levels#/media/File:RAID_6.svg
>
> It's AAAA, BBBB, CCCC, DDDD. Not DBCA, BADC, ADBC, ...
>
> There is no random I/O involved here, at worst it will decide to not read 
> a parity block because it's not needed but that does not cause huge/random
> jumps for the HDD read heads.

RAID5 resync (after an unclean shutdown) does read the parity.
It reads all devices in parallel and checks parity.  Normally all the
parity is correct so it doesn't write at all.
Occasionally there might be incorrect parity, in which case the head
will seek back and write the correct parity.

RAID5 recovery (when a device was removed and a new device is added)
reads all the *other* devices in parallel, calculates the missing block
(parity or data) and writes out to the replaced devices.  All reads and
writes are sequential.

NeilBrown


>
>> while in case of RAID1/10 it is really linear
>
> Actually RAID 10 has the most interesting layout choices... 
> to this day mdadm is unable to grow/convert some of these.
>
> In a RAID 10 rebuild the HDD might have to jump from end to start.
>
> Of course if you consider metadata updates (progress has to be 
> recorded somewhere?) then ALL rebuilds regardless of RAID level 
> are random I/O in a way.
>
> But such is the fate of a HDD, it's their bread and butter. 
> Any server that does anything other than "idle" does random I/O 24/7.
>
> If there was no other I/O (because the RAID is live during rebuild) 
> and no metadata updates (or external metadata) you could totally do 
> RAID0/1/5/6 rebuilds with tape drives. That's how random it is.
> RAID10 might need a rewind in between.
>
> Regards
> Andreas Klauer
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox