Find mismatch in data blocks during raid6 repair

linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Find mismatch in data blocks during raid6 repair
@ 2012-06-20 17:41 Robert Buchholz
  2012-06-21 12:38 ` John Robinson
  0 siblings, 1 reply; 13+ messages in thread
From: Robert Buchholz @ 2012-06-20 17:41 UTC (permalink / raw)
  To: linux-raid

[-- Attachment #1: Type: text/plain, Size: 1562 bytes --]

Hello list,

I have been looking into the repair functionality of the raid6 
implementation in recent kernels to find out how the md driver handles 
parity mismatches. As I understand, the handle_parity_checks6 function 
simply regenerates P and Q from the data blocks if they do not match. 
While this makes perfect sense for a single parity mismatch, when both 
are wrong it may indicate an error in the data blocks.

When repairing a full raid6 with no missing drives (raid-devices=n+2), a 
single inconsistent data block could be detected: For every one of the n 
blocks, assume its device is missing, recover the block from P, generate 
Q' and compare with the actual Q. If there is exactly one block where Q' 
equals Q, rewrite the data block in question.*

I understand the usual failure mode is to remove a drive from the array, 
or use IO errors from the kernel to identify incorrect data blocks. 
However, this assumes we recognize the error at the time and thus know 
which data block is incorrect. But that is not always the case:
The drives could be inconsistent after a multi-drive failure, unclean 
shutdown, bit rot or because one raid drive was replaced outside the 
realm of md using dd(rescue).
Is there a reason this approach is not currently chosen? The performance 
implications seem to be low, there is no increased io (in fact, it may 
decrease since write-back decreases up to a factor of 2), and number of 
parity calculations increases by a factor of n in the error case (both 
error case and n could be assumed low).

Cheers

Robert

[-- Attachment #2: This is a digitally signed message part. --]
[-- Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Find mismatch in data blocks during raid6 repair
  2012-06-20 17:41 Find mismatch in data blocks during raid6 repair Robert Buchholz
@ 2012-06-21 12:38 ` John Robinson
  2012-06-21 14:58   ` Robert Buchholz
  0 siblings, 1 reply; 13+ messages in thread
From: John Robinson @ 2012-06-21 12:38 UTC (permalink / raw)
  To: Robert Buchholz; +Cc: linux-raid

On 20/06/2012 18:41, Robert Buchholz wrote:
[...]
> When repairing a full raid6 with no missing drives (raid-devices=n+2), a
> single inconsistent data block could be detected

Yes, it could. See Neil Brown's blog post as to the many reasons why 
this isn't implemented: http://neil.brown.name/blog/20100211050355

Cheers,

John.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Find mismatch in data blocks during raid6 repair
  2012-06-21 12:38 ` John Robinson
@ 2012-06-21 14:58   ` Robert Buchholz
  2012-06-21 18:23     ` Piergiorgio Sartor
  0 siblings, 1 reply; 13+ messages in thread
From: Robert Buchholz @ 2012-06-21 14:58 UTC (permalink / raw)
  To: John Robinson; +Cc: linux-raid

[-- Attachment #1: Type: text/plain, Size: 856 bytes --]

Hello John,

On Thursday, June 21, 2012 01:38:38 PM John Robinson wrote:
> On 20/06/2012 18:41, Robert Buchholz wrote:
> [...]
> 
> > When repairing a full raid6 with no missing drives
> > (raid-devices=n+2), a single inconsistent data block could be
> > detected
> 
> Yes, it could. See Neil Brown's blog post as to the many reasons why
> this isn't implemented: http://neil.brown.name/blog/20100211050355

Thank you for the pointer, I did not find the article before.
I agree with Neil's premise, this should not be run on a mounted raid as 
changing data blocks can be a problem, and it should not be the default 
for resync. However, as both Neil and commenters point out, this is a 
valuable (offline) repair option.

Do you know whether the "smart" algorithm or the API necessary to 
construct a user space program are on the agenda?

Cheers

Robert

[-- Attachment #2: This is a digitally signed message part. --]
[-- Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Find mismatch in data blocks during raid6 repair
  2012-06-21 14:58   ` Robert Buchholz
@ 2012-06-21 18:23     ` Piergiorgio Sartor
  2012-06-29 18:16       ` Robert Buchholz
  0 siblings, 1 reply; 13+ messages in thread
From: Piergiorgio Sartor @ 2012-06-21 18:23 UTC (permalink / raw)
  To: Robert Buchholz; +Cc: John Robinson, linux-raid

Hi Robert,

On Thu, Jun 21, 2012 at 04:58:33PM +0200, Robert Buchholz wrote:
> Hello John,
> 
> On Thursday, June 21, 2012 01:38:38 PM John Robinson wrote:
> > On 20/06/2012 18:41, Robert Buchholz wrote:
> > [...]
> > 
> > > When repairing a full raid6 with no missing drives
> > > (raid-devices=n+2), a single inconsistent data block could be
> > > detected
> > 
> > Yes, it could. See Neil Brown's blog post as to the many reasons why
> > this isn't implemented: http://neil.brown.name/blog/20100211050355
> 
> Thank you for the pointer, I did not find the article before.
> I agree with Neil's premise, this should not be run on a mounted raid as 
> changing data blocks can be a problem, and it should not be the default 
> for resync. However, as both Neil and commenters point out, this is a 
> valuable (offline) repair option.
> 
> Do you know whether the "smart" algorithm or the API necessary to 
> construct a user space program are on the agenda?

a tool is already available, albeit not too much
advertised (my fault, basically).

It is "raid6check" and it is included in "mdadm", at
least in the source repository.

It does *not* repair, but it will tell you which
disk (if possible) and which stripe is incorrect
from an non-degraded (of course) RAID-6.

Any improvement to the code is very very welcome.

Unfortunately I did not have time to follow up the
thing, but a "repair" option might be interesting.

bye,

pg

> 
> Cheers
> 
> Robert



-- 

piergiorgio

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Find mismatch in data blocks during raid6 repair
  2012-06-21 18:23     ` Piergiorgio Sartor
@ 2012-06-29 18:16       ` Robert Buchholz
  2012-06-30 11:48         ` Piergiorgio Sartor
  0 siblings, 1 reply; 13+ messages in thread
From: Robert Buchholz @ 2012-06-29 18:16 UTC (permalink / raw)
  To: Piergiorgio Sartor; +Cc: John Robinson, linux-raid


[-- Attachment #1.1: Type: text/plain, Size: 1870 bytes --]

Hey Piergiorgio,

On Thursday, June 21, 2012 08:23:27 PM Piergiorgio Sartor wrote:
> a tool is already available, albeit not too much
> advertised (my fault, basically).
> 
> It is "raid6check" and it is included in "mdadm", at
> least in the source repository.

That is nice. The tool really is not advertised enough, but I am happy 
it exists. Compared to a standard "check" run on a raid, the output of 
this is much more helpful in determining the source of a mismatch.

> It does *not* repair, but it will tell you which
> disk (if possible) and which stripe is incorrect
> from an non-degraded (of course) RAID-6.
> 
> Any improvement to the code is very very welcome.
> 
> Unfortunately I did not have time to follow up the
> thing, but a "repair" option might be interesting.

I'll attach a set of patches that add a repair mode to raid6check. While 
the tool currently can detect failure of a single slot, and it could 
automatically repair that, I chose to make repair an explicit action.
In fact, even the slice number and the two slots to repair are given via 
the command line. 

So for example, given this output of raid6check (check mode):
Error detected at 1: possible failed disk slot: 5 --> /dev/sda1
Error detected at 2: possible failed disk slot: 3 --> /dev/sdb1
Error detected at 3: disk slot unknown

To regenerate 1 and 2, run:
raid6check /dev/md0 repair 1 5 3
raid6check /dev/md0 repair 2 5 3
(the repair arguments require you to always rebuild two blocks, one of 
which should result in a noop in these cases)

Since for stripe 3, two slots must be wrong, the admin has to provide a 
guess (and could iterate guesses, provided proper stripe backups):
raid6check /dev/md0 repair 3 5 3

I'd very much like to add cache invalidation (of the data buffered from 
the md device) to the tool, maybe someone has a pointer how to do that.


Cheers
Robert

[-- Attachment #1.2: 0001-Extract-function-to-generate-zeroes-and-expose-xor-f.patch --]
[-- Type: text/x-patch, Size: 1593 bytes --]

From 909cf4e15e6bc3eaca49b9d0bd09e55ecf20b059 Mon Sep 17 00:00:00 2001
From: Robert Buchholz <rbu@goodpoint.de>
Date: Fri, 29 Jun 2012 19:15:45 +0200
Subject: [PATCH 1/3] Extract function to generate zeroes and expose xor function

---
 restripe.c |   25 +++++++++++++++----------
 1 files changed, 15 insertions(+), 10 deletions(-)

diff --git a/restripe.c b/restripe.c
index 00e7a82..fe365a3 100644
--- a/restripe.c
+++ b/restripe.c
@@ -211,7 +211,7 @@ static int is_ddf(int layout)
 }
 
 
-static void xor_blocks(char *target, char **sources, int disks, int size)
+void xor_blocks(char *target, char **sources, int disks, int size)
 {
 	int i, j;
 	/* Amazingly inefficient... */
@@ -337,6 +337,19 @@ uint8_t *zero;
 int zero_size;
 /* Following was taken from linux/drivers/md/raid6recov.c */
 
+void ensure_zero_has_size(int chunk_size)
+{
+	if (zero == NULL || chunk_size > zero_size) {
+		if (zero)
+			free(zero);
+		zero = malloc(chunk_size);
+		if (zero)
+			memset(zero, 0, chunk_size);
+		zero_size = chunk_size;
+	}
+}
+
+
 /* Recover two failed data blocks. */
 void raid6_2data_recov(int disks, size_t bytes, int faila, int failb,
 		       uint8_t **ptrs)
@@ -510,15 +523,7 @@ int save_stripes(int *source, unsigned long long *offsets,
 
 	if (!tables_ready)
 		make_tables();
-
-	if (zero == NULL || chunk_size > zero_size) {
-		if (zero)
-			free(zero);
-		zero = malloc(chunk_size);
-		if (zero)
-			memset(zero, 0, chunk_size);
-		zero_size = chunk_size;
-	}
+	ensure_zero_has_size(chunk_size);
 
 	len = data_disks * chunk_size;
 	length_test = length / len;
-- 
1.7.3.4


[-- Attachment #1.3: 0002-Make-test-a-bash-script-as-it-is.patch --]
[-- Type: text/x-patch, Size: 454 bytes --]

From fcbfdbd2a692f84330a2506c1d57222b6bc5a252 Mon Sep 17 00:00:00 2001
From: Robert Buchholz <rbu@goodpoint.de>
Date: Fri, 29 Jun 2012 19:19:29 +0200
Subject: [PATCH 2/3] Make test a bash script (as it is)

---
 test |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/test b/test
index 80e1915..9340afe 100755
--- a/test
+++ b/test
@@ -1,4 +1,4 @@
-#!/bin/sh
+#!/bin/bash
 #
 # run test suite for mdadm
 user=`id -un`
-- 
1.7.3.4


[-- Attachment #1.4: 0003-Repair-mode-for-raid6.patch --]
[-- Type: text/x-patch, Size: 10423 bytes --]

From 0afe4f87a63232fd5da52133222ab149163b324d Mon Sep 17 00:00:00 2001
From: Robert Buchholz <rbu@goodpoint.de>
Date: Fri, 29 Jun 2012 19:32:49 +0200
Subject: [PATCH 3/3] Repair mode for raid6

In repair mode, raid6check will rewrite one single stripe
by regenerating the data (or parity) of two raid devices that
are specified via the command line.
If you need to rewrite just one slot, pick any other slot
at random.

Note that the repair option will change data on the disks
directly, so both the md layer above as well as any layers
above md (such as filesystems) may be accessing the stripe
data from cached buffers. Either instruct the kernels
to drop the caches or reassemble the raid after repair.
---
 raid6check.c        |  127 +++++++++++++++++++++++++++++++++++++++++++++++++--
 tests/19raid6repair |   47 +++++++++++++++++++
 2 files changed, 169 insertions(+), 5 deletions(-)
 create mode 100644 tests/19raid6repair

diff --git a/raid6check.c b/raid6check.c
index d4b6085..95a0721 100644
--- a/raid6check.c
+++ b/raid6check.c
@@ -31,6 +31,12 @@ int geo_map(int block, unsigned long long stripe, int raid_disks,
 	    int level, int layout);
 void qsyndrome(uint8_t *p, uint8_t *q, uint8_t **sources, int disks, int size);
 void make_tables(void);
+void ensure_zero_has_size(int chunk_size);
+void raid6_datap_recov(int disks, size_t bytes, int faila, uint8_t **ptrs);
+void raid6_2data_recov(int disks, size_t bytes, int faila, int failb,
+		       uint8_t **ptrs);
+void xor_blocks(char *target, char **sources, int disks, int size);
+
 
 /* Collect per stripe consistency information */
 void raid6_collect(int chunk_size, uint8_t *p, uint8_t *q,
@@ -103,7 +109,8 @@ int raid6_stats(int *results, int raid_disks, int chunk_size)
 
 int check_stripes(struct mdinfo *info, int *source, unsigned long long *offsets,
 		  int raid_disks, int chunk_size, int level, int layout,
-		  unsigned long long start, unsigned long long length, char *name[])
+		  unsigned long long start, unsigned long long length, char *name[],
+		  int repair, int failed_disk1, int failed_disk2)
 {
 	/* read the data and p and q blocks, and check we got them right */
 	char *stripe_buf = malloc(raid_disks * chunk_size);
@@ -180,10 +187,13 @@ int check_stripes(struct mdinfo *info, int *source, unsigned long long *offsets,
 
 		qsyndrome(p, q, (uint8_t**)blocks, data_disks, chunk_size);
 		diskP = geo_map(-1, start, raid_disks, level, layout);
+		diskQ = geo_map(-2, start, raid_disks, level, layout);
+		blocks[data_disks] = stripes[diskP];
+		blocks[data_disks+1] = stripes[diskQ];
+		
 		if (memcmp(p, stripes[diskP], chunk_size) != 0) {
 			printf("P(%d) wrong at %llu\n", diskP, start);
 		}
-		diskQ = geo_map(-2, start, raid_disks, level, layout);
 		if (memcmp(q, stripes[diskQ], chunk_size) != 0) {
 			printf("Q(%d) wrong at %llu\n", diskQ, start);
 		}
@@ -200,6 +210,80 @@ int check_stripes(struct mdinfo *info, int *source, unsigned long long *offsets,
 		if(disk == -65535) {
 			printf("Error detected at %llu: disk slot unknown\n", start);
 		}
+		if(repair == 1) {
+			printf("Repairing stripe %llu\n", start);
+			printf("Assuming slots %d (%s) and %d (%s) are incorrect\n", failed_disk1, name[failed_disk1], failed_disk2, name[failed_disk2]);
+			
+			if (failed_disk1 == diskQ || failed_disk2 == diskQ) {
+				int failed_data;
+				if (failed_disk1 == diskQ)
+					failed_data = failed_disk2;
+				else
+					failed_data = failed_disk1;
+				printf("Repairing D/P(%d) and Q\n", failed_data);
+				int failed_block_index = geo_map(failed_data, start, raid_disks, level, layout);
+				char *all_but_failed_blocks[data_disks];
+				for (i=0; i < data_disks; i++)
+					if (failed_block_index == i)
+						all_but_failed_blocks[i] = stripes[diskP];
+					else
+						all_but_failed_blocks[i] = blocks[i];
+				xor_blocks(stripes[failed_data],
+					all_but_failed_blocks, data_disks, chunk_size);
+				qsyndrome(p, (uint8_t*)stripes[diskQ], (uint8_t**)blocks, data_disks, chunk_size);
+			} else {
+				ensure_zero_has_size(chunk_size);
+				if (failed_disk1 == diskP || failed_disk2 == diskP) {
+					int failed_data;
+					if (failed_disk1 == diskP)
+						failed_data = failed_disk2;
+					else
+						failed_data = failed_disk1;
+					int failed_block_index = geo_map(failed_data, start, raid_disks, level, layout);
+					printf("Repairing D(%d) and P\n", failed_data);
+					raid6_datap_recov(raid_disks, chunk_size, failed_block_index, (uint8_t**)blocks);
+				} else {
+					printf("Repairing D and D\n");
+					int failed_block_index1 = geo_map(failed_disk1, start, raid_disks, level, layout);
+					int failed_block_index2 = geo_map(failed_disk2, start, raid_disks, level, layout);
+					if (failed_block_index1 > failed_block_index2) {
+						int t = failed_block_index1;
+						failed_block_index1 = failed_block_index2;
+						failed_block_index2 = t;
+					}
+					raid6_2data_recov(raid_disks, chunk_size, failed_block_index1, failed_block_index2, (uint8_t**)blocks);
+				}
+			}
+			if(mlockall(MCL_CURRENT | MCL_FUTURE) != 0) {
+				err = 2;
+				goto exitCheck;
+			}
+			sig[0] = signal(SIGTERM, SIG_IGN);
+			sig[1] = signal(SIGINT, SIG_IGN);
+			sig[2] = signal(SIGQUIT, SIG_IGN);
+			rv = sysfs_set_num(info, NULL, "suspend_lo", start * chunk_size * data_disks);
+			rv |= sysfs_set_num(info, NULL, "suspend_hi", (start + 1) * chunk_size * data_disks);
+			lseek64(source[failed_disk1], offsets[failed_disk1] + start * chunk_size, 0);
+			write(source[failed_disk1], stripes[failed_disk1], chunk_size);
+			lseek64(source[failed_disk2], offsets[failed_disk2] + start * chunk_size, 0);
+			write(source[failed_disk2], stripes[failed_disk2], chunk_size);
+			rv |= sysfs_set_num(info, NULL, "suspend_lo", 0x7FFFFFFFFFFFFFFFULL);
+			rv |= sysfs_set_num(info, NULL, "suspend_hi", 0);
+			rv |= sysfs_set_num(info, NULL, "suspend_lo", 0);
+			signal(SIGQUIT, sig[2]);
+			signal(SIGINT, sig[1]);
+			signal(SIGTERM, sig[0]);
+			if(munlockall() != 0) {
+				err = 3;
+				goto exitCheck;
+			}
+
+			if(rv != 0) {
+				err = rv * 256;
+				goto exitCheck;
+			}
+		}
+
 
 		length--;
 		start++;
@@ -240,6 +324,8 @@ int main(int argc, char *argv[])
 	int chunk_size = 0;
 	int layout = -1;
 	int level = 6;
+	int repair = 0;
+	int failed_disk1, failed_disk2;
 	unsigned long long start, length;
 	int i;
 	int mdfd;
@@ -256,6 +342,7 @@ int main(int argc, char *argv[])
 
 	if (argc < 4) {
 		fprintf(stderr, "Usage: %s md_device start_stripe length_stripes\n", prg);
+		fprintf(stderr, "   or: %s md_device repair stripe failed_slot_1 failed_slot_2\n", prg);
 		exit_err = 1;
 		goto exitHere;
 	}
@@ -321,8 +408,38 @@ int main(int argc, char *argv[])
 	raid_disks = info->array.raid_disks;
 	chunk_size = info->array.chunk_size;
 	layout = info->array.layout;
-	start = getnum(argv[2], &err);
-	length = getnum(argv[3], &err);
+	if (strcmp(argv[2], "repair")==0) {
+		if (argc < 6) {
+			fprintf(stderr, "For repair mode, call %s md_device repair stripe failed_slot_1 failed_slot_2\n", prg);
+			exit_err = 1;
+			goto exitHere;
+		}
+		repair = 1;
+		start = getnum(argv[3], &err);
+		length = 1;
+		failed_disk1 = getnum(argv[4], &err);
+		failed_disk2 = getnum(argv[5], &err);
+		
+		if(failed_disk1 > info->array.raid_disks) {
+			fprintf(stderr, "%s: failed_slot_1 index is higher than number of devices in raid\n", prg);
+			exit_err = 4;
+			goto exitHere;
+		}
+		if(failed_disk2 > info->array.raid_disks) {
+			fprintf(stderr, "%s: failed_slot_2 index is higher than number of devices in raid\n", prg);
+			exit_err = 4;
+			goto exitHere;
+		}
+		if(failed_disk1 == failed_disk2) {
+			fprintf(stderr, "%s: failed_slot_1 and failed_slot_2 are the same\n", prg);
+			exit_err = 4;
+			goto exitHere;
+		}
+	}
+	else {
+		start = getnum(argv[2], &err);
+		length = getnum(argv[3], &err);
+	}
 
 	if (err) {
 		fprintf(stderr, "%s: Bad number: %s\n", prg, err);
@@ -380,7 +497,7 @@ int main(int argc, char *argv[])
 
 	int rv = check_stripes(info, fds, offsets,
 			       raid_disks, chunk_size, level, layout,
-			       start, length, disk_name);
+			       start, length, disk_name, repair, failed_disk1, failed_disk2);
 	if (rv != 0) {
 		fprintf(stderr,
 			"%s: check_stripes returned %d\n", prg, rv);
diff --git a/tests/19raid6repair b/tests/19raid6repair
new file mode 100644
index 0000000..ed7d52d
--- /dev/null
+++ b/tests/19raid6repair
@@ -0,0 +1,47 @@
+number_of_disks=4
+chunksize_in_kib=512
+chunksize_in_b=$[chunksize_in_kib*1024]
+array_data_size_in_kib=$[chunksize_in_kib*(number_of_disks-2)*number_of_disks]
+array_data_size_in_b=$[array_data_size_in_kib*1024]
+devs="$dev1 $dev2 $dev3 $dev4"
+
+# default 32 sectors
+data_offset_in_kib=$[32/2]
+
+for failure in "$dev3 3 3 2" "$dev3 3 2 3" "$dev3 3 2 1" "$dev3 3 2 0" "$dev4 3 3 0" "$dev4 3 3 1" "$dev4 3 3 2" \
+		"$dev1 3 0 1" "$dev1 3 0 2" "$dev1 3 0 3" "$dev2 3 1 0" "$dev2 3 1 2" "$dev2 3 1 3" ; do
+	failure_split=( $failure )
+	device_with_error=${failure_split[0]}
+	stripe_with_error=${failure_split[1]}
+	repair_params="$stripe_with_error ${failure_split[2]} ${failure_split[3]}"
+	start_of_errors_in_kib=$[data_offset_in_kib+chunksize_in_kib*stripe_with_error]
+
+	# make a raid5 from a file
+	dd if=/dev/urandom of=/tmp/RandFile bs=1024 count=$array_data_size_in_kib
+	mdadm -CR $md0 -l6 -n$number_of_disks -c $chunksize_in_kib $devs
+	dd if=/tmp/RandFile of=$md0 bs=1024 count=$array_data_size_in_kib
+	blockdev --flushbufs $md0; sync
+
+	check wait
+	blockdev --flushbufs $devs; sync
+	echo 3 > /proc/sys/vm/drop_caches
+	cmp -s -n $array_data_size_in_b $md0 /tmp/RandFile || { echo sanity cmp failed ; exit 2; }
+
+	dd if=/dev/urandom of=$device_with_error bs=1024 count=$chunksize_in_kib seek=$start_of_errors_in_kib
+	blockdev --flushbufs $device_with_error; sync
+	echo 3 > /proc/sys/vm/drop_caches
+
+	$dir/raid6check $md0 0 0 2>&1 | grep -qs "Error" || { echo should detect errors; exit 2; }
+
+	$dir/raid6check $md0 repair $repair_params > /dev/null || { echo repair failed; exit 2; }
+	blockdev --flushbufs $md0 $devs; sync
+	echo 3 > /proc/sys/vm/drop_caches
+
+	$dir/raid6check $md0 0 0 2>&1 | grep -qs "Error" && { echo errors detected; exit 2; }
+	cmp -s -n $array_data_size_in_b $md0 /tmp/RandFile || { echo cmp failed ; exit 2; }
+
+	mdadm -S $md0
+	udevadm settle
+	blockdev --flushbufs $md0 $devs; sync
+	echo 3 > /proc/sys/vm/drop_caches
+done
\ No newline at end of file
-- 
1.7.3.4


[-- Attachment #2: This is a digitally signed message part. --]
[-- Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: Find mismatch in data blocks during raid6 repair
  2012-06-29 18:16       ` Robert Buchholz
@ 2012-06-30 11:48         ` Piergiorgio Sartor
  2012-07-03 19:10           ` Robert Buchholz
  0 siblings, 1 reply; 13+ messages in thread
From: Piergiorgio Sartor @ 2012-06-30 11:48 UTC (permalink / raw)
  To: Robert Buchholz; +Cc: Piergiorgio Sartor, John Robinson, linux-raid

Hi Robert,

very good you provide these patches!

Please see the comments below.

On Fri, Jun 29, 2012 at 08:16:13PM +0200, Robert Buchholz wrote:
[...]
> I'll attach a set of patches that add a repair mode to raid6check. While 

I think, I'm not sure, Neil Brown should/will
apply the pathes to the main tree.
In any case I expect a comment from him.

> the tool currently can detect failure of a single slot, and it could 
> automatically repair that, I chose to make repair an explicit action.
> In fact, even the slice number and the two slots to repair are given via 
> the command line. 
> 
> So for example, given this output of raid6check (check mode):
> Error detected at 1: possible failed disk slot: 5 --> /dev/sda1
> Error detected at 2: possible failed disk slot: 3 --> /dev/sdb1
> Error detected at 3: disk slot unknown
> 
> To regenerate 1 and 2, run:
> raid6check /dev/md0 repair 1 5 3
> raid6check /dev/md0 repair 2 5 3
> (the repair arguments require you to always rebuild two blocks, one of 
> which should result in a noop in these cases)

Why always two blocks?
 
> Since for stripe 3, two slots must be wrong, the admin has to provide a 

Well, "unknown" means it is not possible to detect
which one(s).
It could be there are more than 2 corrupted.
The "unknown" case means that the only reasonable thing
would be to rebuild the parities, but nothing more can
be said about the status of the array.

Nevertheless, there is a possibility which I was thinking
about, but I never had time to implement (even if the
software has some already built-in infrastructure for it).
Specifically, a "vertical" statistic.
That is, if there are mismatches, and, for example, 90% of
them belong to /dev/sdX, and the rest 10% are "unknown",
then it could be possible to extrapolate that, for the
"unknown", /dev/sdX must be fixed anyway and then re-check
if the status is still "unknown" or some other disk shows
up. If one disk is reported, then it could be fixed.
Other cases, the parity must be adjusted, whatever this
means in terms of data recovery.

Of course, this is just a statistical assumption, which
means a second, "aggressive", option will have to be
available, with all the warnings of the case.

> guess (and could iterate guesses, provided proper stripe backups):
> raid6check /dev/md0 repair 3 5 3

Actually, this could also be an improvement, I mean
the possibility to backup stripes, so that other,
advanced, recovery could be tried and reverted, if
necessary.

Finally, someone should consider to use the optimized
raid6 code, from the kernel module (can we link that
code directly?), in order to speed up the check/repair.

Thanks again for the patches, good job!

bye,

pg
 
> I'd very much like to add cache invalidation (of the data buffered from 
> the md device) to the tool, maybe someone has a pointer how to do that.
> 
> 
> Cheers
> Robert

> >From 909cf4e15e6bc3eaca49b9d0bd09e55ecf20b059 Mon Sep 17 00:00:00 2001
> From: Robert Buchholz <rbu@goodpoint.de>
> Date: Fri, 29 Jun 2012 19:15:45 +0200
> Subject: [PATCH 1/3] Extract function to generate zeroes and expose xor function
> 
> ---
>  restripe.c |   25 +++++++++++++++----------
>  1 files changed, 15 insertions(+), 10 deletions(-)
> 
> diff --git a/restripe.c b/restripe.c
> index 00e7a82..fe365a3 100644
> --- a/restripe.c
> +++ b/restripe.c
> @@ -211,7 +211,7 @@ static int is_ddf(int layout)
>  }
>  
>  
> -static void xor_blocks(char *target, char **sources, int disks, int size)
> +void xor_blocks(char *target, char **sources, int disks, int size)
>  {
>  	int i, j;
>  	/* Amazingly inefficient... */
> @@ -337,6 +337,19 @@ uint8_t *zero;
>  int zero_size;
>  /* Following was taken from linux/drivers/md/raid6recov.c */
>  
> +void ensure_zero_has_size(int chunk_size)
> +{
> +	if (zero == NULL || chunk_size > zero_size) {
> +		if (zero)
> +			free(zero);
> +		zero = malloc(chunk_size);
> +		if (zero)
> +			memset(zero, 0, chunk_size);
> +		zero_size = chunk_size;
> +	}
> +}
> +
> +
>  /* Recover two failed data blocks. */
>  void raid6_2data_recov(int disks, size_t bytes, int faila, int failb,
>  		       uint8_t **ptrs)
> @@ -510,15 +523,7 @@ int save_stripes(int *source, unsigned long long *offsets,
>  
>  	if (!tables_ready)
>  		make_tables();
> -
> -	if (zero == NULL || chunk_size > zero_size) {
> -		if (zero)
> -			free(zero);
> -		zero = malloc(chunk_size);
> -		if (zero)
> -			memset(zero, 0, chunk_size);
> -		zero_size = chunk_size;
> -	}
> +	ensure_zero_has_size(chunk_size);
>  
>  	len = data_disks * chunk_size;
>  	length_test = length / len;
> -- 
> 1.7.3.4
> 

> >From fcbfdbd2a692f84330a2506c1d57222b6bc5a252 Mon Sep 17 00:00:00 2001
> From: Robert Buchholz <rbu@goodpoint.de>
> Date: Fri, 29 Jun 2012 19:19:29 +0200
> Subject: [PATCH 2/3] Make test a bash script (as it is)
> 
> ---
>  test |    2 +-
>  1 files changed, 1 insertions(+), 1 deletions(-)
> 
> diff --git a/test b/test
> index 80e1915..9340afe 100755
> --- a/test
> +++ b/test
> @@ -1,4 +1,4 @@
> -#!/bin/sh
> +#!/bin/bash
>  #
>  # run test suite for mdadm
>  user=`id -un`
> -- 
> 1.7.3.4
> 

> >From 0afe4f87a63232fd5da52133222ab149163b324d Mon Sep 17 00:00:00 2001
> From: Robert Buchholz <rbu@goodpoint.de>
> Date: Fri, 29 Jun 2012 19:32:49 +0200
> Subject: [PATCH 3/3] Repair mode for raid6
> 
> In repair mode, raid6check will rewrite one single stripe
> by regenerating the data (or parity) of two raid devices that
> are specified via the command line.
> If you need to rewrite just one slot, pick any other slot
> at random.
> 
> Note that the repair option will change data on the disks
> directly, so both the md layer above as well as any layers
> above md (such as filesystems) may be accessing the stripe
> data from cached buffers. Either instruct the kernels
> to drop the caches or reassemble the raid after repair.
> ---
>  raid6check.c        |  127 +++++++++++++++++++++++++++++++++++++++++++++++++--
>  tests/19raid6repair |   47 +++++++++++++++++++
>  2 files changed, 169 insertions(+), 5 deletions(-)
>  create mode 100644 tests/19raid6repair
> 
> diff --git a/raid6check.c b/raid6check.c
> index d4b6085..95a0721 100644
> --- a/raid6check.c
> +++ b/raid6check.c
> @@ -31,6 +31,12 @@ int geo_map(int block, unsigned long long stripe, int raid_disks,
>  	    int level, int layout);
>  void qsyndrome(uint8_t *p, uint8_t *q, uint8_t **sources, int disks, int size);
>  void make_tables(void);
> +void ensure_zero_has_size(int chunk_size);
> +void raid6_datap_recov(int disks, size_t bytes, int faila, uint8_t **ptrs);
> +void raid6_2data_recov(int disks, size_t bytes, int faila, int failb,
> +		       uint8_t **ptrs);
> +void xor_blocks(char *target, char **sources, int disks, int size);
> +
>  
>  /* Collect per stripe consistency information */
>  void raid6_collect(int chunk_size, uint8_t *p, uint8_t *q,
> @@ -103,7 +109,8 @@ int raid6_stats(int *results, int raid_disks, int chunk_size)
>  
>  int check_stripes(struct mdinfo *info, int *source, unsigned long long *offsets,
>  		  int raid_disks, int chunk_size, int level, int layout,
> -		  unsigned long long start, unsigned long long length, char *name[])
> +		  unsigned long long start, unsigned long long length, char *name[],
> +		  int repair, int failed_disk1, int failed_disk2)
>  {
>  	/* read the data and p and q blocks, and check we got them right */
>  	char *stripe_buf = malloc(raid_disks * chunk_size);
> @@ -180,10 +187,13 @@ int check_stripes(struct mdinfo *info, int *source, unsigned long long *offsets,
>  
>  		qsyndrome(p, q, (uint8_t**)blocks, data_disks, chunk_size);
>  		diskP = geo_map(-1, start, raid_disks, level, layout);
> +		diskQ = geo_map(-2, start, raid_disks, level, layout);
> +		blocks[data_disks] = stripes[diskP];
> +		blocks[data_disks+1] = stripes[diskQ];
> +		
>  		if (memcmp(p, stripes[diskP], chunk_size) != 0) {
>  			printf("P(%d) wrong at %llu\n", diskP, start);
>  		}
> -		diskQ = geo_map(-2, start, raid_disks, level, layout);
>  		if (memcmp(q, stripes[diskQ], chunk_size) != 0) {
>  			printf("Q(%d) wrong at %llu\n", diskQ, start);
>  		}
> @@ -200,6 +210,80 @@ int check_stripes(struct mdinfo *info, int *source, unsigned long long *offsets,
>  		if(disk == -65535) {
>  			printf("Error detected at %llu: disk slot unknown\n", start);
>  		}
> +		if(repair == 1) {
> +			printf("Repairing stripe %llu\n", start);
> +			printf("Assuming slots %d (%s) and %d (%s) are incorrect\n", failed_disk1, name[failed_disk1], failed_disk2, name[failed_disk2]);
> +			
> +			if (failed_disk1 == diskQ || failed_disk2 == diskQ) {
> +				int failed_data;
> +				if (failed_disk1 == diskQ)
> +					failed_data = failed_disk2;
> +				else
> +					failed_data = failed_disk1;
> +				printf("Repairing D/P(%d) and Q\n", failed_data);
> +				int failed_block_index = geo_map(failed_data, start, raid_disks, level, layout);
> +				char *all_but_failed_blocks[data_disks];
> +				for (i=0; i < data_disks; i++)
> +					if (failed_block_index == i)
> +						all_but_failed_blocks[i] = stripes[diskP];
> +					else
> +						all_but_failed_blocks[i] = blocks[i];
> +				xor_blocks(stripes[failed_data],
> +					all_but_failed_blocks, data_disks, chunk_size);
> +				qsyndrome(p, (uint8_t*)stripes[diskQ], (uint8_t**)blocks, data_disks, chunk_size);
> +			} else {
> +				ensure_zero_has_size(chunk_size);
> +				if (failed_disk1 == diskP || failed_disk2 == diskP) {
> +					int failed_data;
> +					if (failed_disk1 == diskP)
> +						failed_data = failed_disk2;
> +					else
> +						failed_data = failed_disk1;
> +					int failed_block_index = geo_map(failed_data, start, raid_disks, level, layout);
> +					printf("Repairing D(%d) and P\n", failed_data);
> +					raid6_datap_recov(raid_disks, chunk_size, failed_block_index, (uint8_t**)blocks);
> +				} else {
> +					printf("Repairing D and D\n");
> +					int failed_block_index1 = geo_map(failed_disk1, start, raid_disks, level, layout);
> +					int failed_block_index2 = geo_map(failed_disk2, start, raid_disks, level, layout);
> +					if (failed_block_index1 > failed_block_index2) {
> +						int t = failed_block_index1;
> +						failed_block_index1 = failed_block_index2;
> +						failed_block_index2 = t;
> +					}
> +					raid6_2data_recov(raid_disks, chunk_size, failed_block_index1, failed_block_index2, (uint8_t**)blocks);
> +				}
> +			}
> +			if(mlockall(MCL_CURRENT | MCL_FUTURE) != 0) {
> +				err = 2;
> +				goto exitCheck;
> +			}
> +			sig[0] = signal(SIGTERM, SIG_IGN);
> +			sig[1] = signal(SIGINT, SIG_IGN);
> +			sig[2] = signal(SIGQUIT, SIG_IGN);
> +			rv = sysfs_set_num(info, NULL, "suspend_lo", start * chunk_size * data_disks);
> +			rv |= sysfs_set_num(info, NULL, "suspend_hi", (start + 1) * chunk_size * data_disks);
> +			lseek64(source[failed_disk1], offsets[failed_disk1] + start * chunk_size, 0);
> +			write(source[failed_disk1], stripes[failed_disk1], chunk_size);
> +			lseek64(source[failed_disk2], offsets[failed_disk2] + start * chunk_size, 0);
> +			write(source[failed_disk2], stripes[failed_disk2], chunk_size);
> +			rv |= sysfs_set_num(info, NULL, "suspend_lo", 0x7FFFFFFFFFFFFFFFULL);
> +			rv |= sysfs_set_num(info, NULL, "suspend_hi", 0);
> +			rv |= sysfs_set_num(info, NULL, "suspend_lo", 0);
> +			signal(SIGQUIT, sig[2]);
> +			signal(SIGINT, sig[1]);
> +			signal(SIGTERM, sig[0]);
> +			if(munlockall() != 0) {
> +				err = 3;
> +				goto exitCheck;
> +			}
> +
> +			if(rv != 0) {
> +				err = rv * 256;
> +				goto exitCheck;
> +			}
> +		}
> +
>  
>  		length--;
>  		start++;
> @@ -240,6 +324,8 @@ int main(int argc, char *argv[])
>  	int chunk_size = 0;
>  	int layout = -1;
>  	int level = 6;
> +	int repair = 0;
> +	int failed_disk1, failed_disk2;
>  	unsigned long long start, length;
>  	int i;
>  	int mdfd;
> @@ -256,6 +342,7 @@ int main(int argc, char *argv[])
>  
>  	if (argc < 4) {
>  		fprintf(stderr, "Usage: %s md_device start_stripe length_stripes\n", prg);
> +		fprintf(stderr, "   or: %s md_device repair stripe failed_slot_1 failed_slot_2\n", prg);
>  		exit_err = 1;
>  		goto exitHere;
>  	}
> @@ -321,8 +408,38 @@ int main(int argc, char *argv[])
>  	raid_disks = info->array.raid_disks;
>  	chunk_size = info->array.chunk_size;
>  	layout = info->array.layout;
> -	start = getnum(argv[2], &err);
> -	length = getnum(argv[3], &err);
> +	if (strcmp(argv[2], "repair")==0) {
> +		if (argc < 6) {
> +			fprintf(stderr, "For repair mode, call %s md_device repair stripe failed_slot_1 failed_slot_2\n", prg);
> +			exit_err = 1;
> +			goto exitHere;
> +		}
> +		repair = 1;
> +		start = getnum(argv[3], &err);
> +		length = 1;
> +		failed_disk1 = getnum(argv[4], &err);
> +		failed_disk2 = getnum(argv[5], &err);
> +		
> +		if(failed_disk1 > info->array.raid_disks) {
> +			fprintf(stderr, "%s: failed_slot_1 index is higher than number of devices in raid\n", prg);
> +			exit_err = 4;
> +			goto exitHere;
> +		}
> +		if(failed_disk2 > info->array.raid_disks) {
> +			fprintf(stderr, "%s: failed_slot_2 index is higher than number of devices in raid\n", prg);
> +			exit_err = 4;
> +			goto exitHere;
> +		}
> +		if(failed_disk1 == failed_disk2) {
> +			fprintf(stderr, "%s: failed_slot_1 and failed_slot_2 are the same\n", prg);
> +			exit_err = 4;
> +			goto exitHere;
> +		}
> +	}
> +	else {
> +		start = getnum(argv[2], &err);
> +		length = getnum(argv[3], &err);
> +	}
>  
>  	if (err) {
>  		fprintf(stderr, "%s: Bad number: %s\n", prg, err);
> @@ -380,7 +497,7 @@ int main(int argc, char *argv[])
>  
>  	int rv = check_stripes(info, fds, offsets,
>  			       raid_disks, chunk_size, level, layout,
> -			       start, length, disk_name);
> +			       start, length, disk_name, repair, failed_disk1, failed_disk2);
>  	if (rv != 0) {
>  		fprintf(stderr,
>  			"%s: check_stripes returned %d\n", prg, rv);
> diff --git a/tests/19raid6repair b/tests/19raid6repair
> new file mode 100644
> index 0000000..ed7d52d
> --- /dev/null
> +++ b/tests/19raid6repair
> @@ -0,0 +1,47 @@
> +number_of_disks=4
> +chunksize_in_kib=512
> +chunksize_in_b=$[chunksize_in_kib*1024]
> +array_data_size_in_kib=$[chunksize_in_kib*(number_of_disks-2)*number_of_disks]
> +array_data_size_in_b=$[array_data_size_in_kib*1024]
> +devs="$dev1 $dev2 $dev3 $dev4"
> +
> +# default 32 sectors
> +data_offset_in_kib=$[32/2]
> +
> +for failure in "$dev3 3 3 2" "$dev3 3 2 3" "$dev3 3 2 1" "$dev3 3 2 0" "$dev4 3 3 0" "$dev4 3 3 1" "$dev4 3 3 2" \
> +		"$dev1 3 0 1" "$dev1 3 0 2" "$dev1 3 0 3" "$dev2 3 1 0" "$dev2 3 1 2" "$dev2 3 1 3" ; do
> +	failure_split=( $failure )
> +	device_with_error=${failure_split[0]}
> +	stripe_with_error=${failure_split[1]}
> +	repair_params="$stripe_with_error ${failure_split[2]} ${failure_split[3]}"
> +	start_of_errors_in_kib=$[data_offset_in_kib+chunksize_in_kib*stripe_with_error]
> +
> +	# make a raid5 from a file
> +	dd if=/dev/urandom of=/tmp/RandFile bs=1024 count=$array_data_size_in_kib
> +	mdadm -CR $md0 -l6 -n$number_of_disks -c $chunksize_in_kib $devs
> +	dd if=/tmp/RandFile of=$md0 bs=1024 count=$array_data_size_in_kib
> +	blockdev --flushbufs $md0; sync
> +
> +	check wait
> +	blockdev --flushbufs $devs; sync
> +	echo 3 > /proc/sys/vm/drop_caches
> +	cmp -s -n $array_data_size_in_b $md0 /tmp/RandFile || { echo sanity cmp failed ; exit 2; }
> +
> +	dd if=/dev/urandom of=$device_with_error bs=1024 count=$chunksize_in_kib seek=$start_of_errors_in_kib
> +	blockdev --flushbufs $device_with_error; sync
> +	echo 3 > /proc/sys/vm/drop_caches
> +
> +	$dir/raid6check $md0 0 0 2>&1 | grep -qs "Error" || { echo should detect errors; exit 2; }
> +
> +	$dir/raid6check $md0 repair $repair_params > /dev/null || { echo repair failed; exit 2; }
> +	blockdev --flushbufs $md0 $devs; sync
> +	echo 3 > /proc/sys/vm/drop_caches
> +
> +	$dir/raid6check $md0 0 0 2>&1 | grep -qs "Error" && { echo errors detected; exit 2; }
> +	cmp -s -n $array_data_size_in_b $md0 /tmp/RandFile || { echo cmp failed ; exit 2; }
> +
> +	mdadm -S $md0
> +	udevadm settle
> +	blockdev --flushbufs $md0 $devs; sync
> +	echo 3 > /proc/sys/vm/drop_caches
> +done
> \ No newline at end of file
> -- 
> 1.7.3.4
> 




-- 

piergiorgio

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Find mismatch in data blocks during raid6 repair
  2012-06-30 11:48         ` Piergiorgio Sartor
@ 2012-07-03 19:10           ` Robert Buchholz
  2012-07-03 20:27             ` Piergiorgio Sartor
  0 siblings, 1 reply; 13+ messages in thread
From: Robert Buchholz @ 2012-07-03 19:10 UTC (permalink / raw)
  To: Piergiorgio Sartor; +Cc: John Robinson, linux-raid

[-- Attachment #1: Type: text/plain, Size: 4482 bytes --]

Hey Piergiorgio,

On Saturday, June 30, 2012 01:48:31 PM Piergiorgio Sartor wrote:
> > the tool currently can detect failure of a single slot, and it
> > could automatically repair that, I chose to make repair an
> > explicit action. In fact, even the slice number and the two slots
> > to repair are given via the command line.
> > 
> > So for example, given this output of raid6check (check mode):
> > Error detected at 1: possible failed disk slot: 5 --> /dev/sda1
> > Error detected at 2: possible failed disk slot: 3 --> /dev/sdb1
> > Error detected at 3: disk slot unknown
> > 
> > To regenerate 1 and 2, run:
> > raid6check /dev/md0 repair 1 5 3
> > raid6check /dev/md0 repair 2 5 3
> > (the repair arguments require you to always rebuild two blocks,
> > one of which should result in a noop in these cases)
> 
> Why always two blocks?

The reason is simply to have less cases to handle in the code. There's 
already three ways to regenerate regenerate two blocks (D&D, D/P&Q and 
D&P), and there would be two more cases if only one block was to be 
repaired. With the original patch, if you can repair two blocks, that 
allows you to repair one (and one other in addition) as well.

> > Since for stripe 3, two slots must be wrong, the admin has to
> > provide a
> Well, "unknown" means it is not possible to detect
> which one(s).
> It could be there are more than 2 corrupted.
> The "unknown" case means that the only reasonable thing
> would be to rebuild the parities, but nothing more can
> be said about the status of the array.
> 
> Nevertheless, there is a possibility which I was thinking
> about, but I never had time to implement (even if the
> software has some already built-in infrastructure for it).
> Specifically, a "vertical" statistic.
> That is, if there are mismatches, and, for example, 90% of
> them belong to /dev/sdX, and the rest 10% are "unknown",
> then it could be possible to extrapolate that, for the
> "unknown", /dev/sdX must be fixed anyway and then re-check
> if the status is still "unknown" or some other disk shows
> up. If one disk is reported, then it could be fixed.
> Other cases, the parity must be adjusted, whatever this
> means in terms of data recovery.
> 
> Of course, this is just a statistical assumption, which
> means a second, "aggressive", option will have to be
> available, with all the warnings of the case.

As you point out, it is impossible to determine which of two failed 
slots are in error. I would leave such decision to an admin, but giving 
one or more "advices" may be a nice idea.

Personally, I am recovering from a simultaneous three-disk failure on a 
backup storage. My best hope was to ddrescue "most" from all three disks 
onto fresh ones, and I lost a total of a few KB on each disk. Using the 
ddrescue log, I can even say which sectors of each disk were damaged. 
Interestingly, two disks of the same model failed on the very same 
sector (even though they were produced at different times), so I now 
have "unknown" slot errors in some stripes. But with context 
information, I am certain I know which slots need to be repaired.

> > guess (and could iterate guesses, provided proper stripe backups):
> > raid6check /dev/md0 repair 3 5 3
> 
> Actually, this could also be an improvement, I mean
> the possibility to backup stripes, so that other,
> advanced, recovery could be tried and reverted, if
> necessary.

That is true. I was thinking about this too. Unfortunately, as I 
remember, the functions to save and restore stripes in restripe.c do not 
save P and Q, which we should in order to redo the data block 
calculation. But with stripe backups, one could even imagine doing 
verifications on upper layers -- such as verifying file(system) 
checksums. I may send another patch implementing this, but I wanted to 
get general feedback on inclusion of such changes first (Neil?).

> Finally, someone should consider to use the optimized
> raid6 code, from the kernel module (can we link that
> code directly?), in order to speed up the check/repair.

I am a big supporter of getting it to work, then make it fast. Since a 
full raid check takes the magnitude of hours anyway, I do not mind that 
repairing blocks from the user space will take five minutes when it 
could be done in 3. That said, I think the faster code in the kernel is 
warranted (as it needs this calculation very often when a disk is 
failed), and if it is possible to reuse easily, we sure should.

Cheers,

Robert

[-- Attachment #2: This is a digitally signed message part. --]
[-- Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Find mismatch in data blocks during raid6 repair
  2012-07-03 19:10           ` Robert Buchholz
@ 2012-07-03 20:27             ` Piergiorgio Sartor
  2012-07-09  3:43               ` NeilBrown
  2012-07-20 10:53               ` Robert Buchholz
  0 siblings, 2 replies; 13+ messages in thread
From: Piergiorgio Sartor @ 2012-07-03 20:27 UTC (permalink / raw)
  To: Robert Buchholz; +Cc: Piergiorgio Sartor, John Robinson, linux-raid

Hi Robert,

On Tue, Jul 03, 2012 at 09:10:41PM +0200, Robert Buchholz wrote:
[...]
> > Why always two blocks?
> 
> The reason is simply to have less cases to handle in the code. There's 
> already three ways to regenerate regenerate two blocks (D&D, D/P&Q and 
> D&P), and there would be two more cases if only one block was to be 
> repaired. With the original patch, if you can repair two blocks, that 
> allows you to repair one (and one other in addition) as well.

sorry, I express myself not clearly.

I mean, a two parities Reed-Solomon system can
only detect one incorrect slot position, so I would
expect to have the possibility to fix only one, not
two slots.

So, I did not understand why two. I mean, I understand
that a RAID-6 can correct exact up two incorrect slots,
but the "unknown" case might have more and correcting
will mean no correction or, maybe, even more damage.

I would prefer, if you agree, to simply tell "raid6check"
to fix a single slot, or the (single) wrong slots it finds
during the check.

Does it make sense to you, or, maybe, you're considering
something I'm missing?

> > Of course, this is just a statistical assumption, which
> > means a second, "aggressive", option will have to be
> > available, with all the warnings of the case.
> 
> As you point out, it is impossible to determine which of two failed 
> slots are in error. I would leave such decision to an admin, but giving 
> one or more "advices" may be a nice idea.

That would be exactly the background.
For example, considering that "raid6check" processes
stripes, but the check is done per byte, already
knowing how many bytes per stripe (or block) need
to be corrected (per device) will hint a lot about
the overall status of the storage.

> Personally, I am recovering from a simultaneous three-disk failure on a 
> backup storage. My best hope was to ddrescue "most" from all three disks 
> onto fresh ones, and I lost a total of a few KB on each disk. Using the 
> ddrescue log, I can even say which sectors of each disk were damaged. 
> Interestingly, two disks of the same model failed on the very same 
> sector (even though they were produced at different times), so I now 
> have "unknown" slot errors in some stripes. But with context 
> information, I am certain I know which slots need to be repaired.

That's good!
Did you use "raid6check" for a verification?

[...]
> checksums. I may send another patch implementing this, but I wanted to 
> get general feedback on inclusion of such changes first (Neil?).

Yeah, last time Neil mentioned he needs re-triggering :-),
I guess you'll have to add "[PATCH]" tag to the message too...

> I am a big supporter of getting it to work, then make it fast. Since a 
> full raid check takes the magnitude of hours anyway, I do not mind that 
> repairing blocks from the user space will take five minutes when it 
> could be done in 3. That said, I think the faster code in the kernel is 
> warranted (as it needs this calculation very often when a disk is 
> failed), and if it is possible to reuse easily, we sure should.

The check is pretty slow, also due to the terminal
print out, which is a bit too verbose, I think.

Anyhow, I'm really happy someone has interest in
improving "raid6check", I hope you'll be able to
improve it and, maybe, someone else will join
the bandwagon... :-)

bye,

-- 

piergiorgio

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Find mismatch in data blocks during raid6 repair
  2012-07-03 20:27             ` Piergiorgio Sartor
@ 2012-07-09  3:43               ` NeilBrown
  2012-07-20 10:40                 ` [PATCH] " Robert Buchholz
  2012-07-20 10:53               ` Robert Buchholz
  1 sibling, 1 reply; 13+ messages in thread
From: NeilBrown @ 2012-07-09  3:43 UTC (permalink / raw)
  To: Piergiorgio Sartor; +Cc: Robert Buchholz, John Robinson, linux-raid

[-- Attachment #1: Type: text/plain, Size: 4224 bytes --]

On Tue, 3 Jul 2012 22:27:34 +0200 Piergiorgio Sartor
<piergiorgio.sartor@nexgo.de> wrote:

> Hi Robert,
> 
> On Tue, Jul 03, 2012 at 09:10:41PM +0200, Robert Buchholz wrote:
> [...]
> > > Why always two blocks?
> > 
> > The reason is simply to have less cases to handle in the code. There's 
> > already three ways to regenerate regenerate two blocks (D&D, D/P&Q and 
> > D&P), and there would be two more cases if only one block was to be 
> > repaired. With the original patch, if you can repair two blocks, that 
> > allows you to repair one (and one other in addition) as well.
> 
> sorry, I express myself not clearly.
> 
> I mean, a two parities Reed-Solomon system can
> only detect one incorrect slot position, so I would
> expect to have the possibility to fix only one, not
> two slots.
> 
> So, I did not understand why two. I mean, I understand
> that a RAID-6 can correct exact up two incorrect slots,
> but the "unknown" case might have more and correcting
> will mean no correction or, maybe, even more damage.
> 
> I would prefer, if you agree, to simply tell "raid6check"
> to fix a single slot, or the (single) wrong slots it finds
> during the check.


I think this is a sensible feature to offer.  Maybe if "autorepair" is given
in place of "repair", then it should choose which block to repair, and do
that one?


> 
> Does it make sense to you, or, maybe, you're considering
> something I'm missing?
> 
> > > Of course, this is just a statistical assumption, which
> > > means a second, "aggressive", option will have to be
> > > available, with all the warnings of the case.
> > 
> > As you point out, it is impossible to determine which of two failed 
> > slots are in error. I would leave such decision to an admin, but giving 
> > one or more "advices" may be a nice idea.
> 
> That would be exactly the background.
> For example, considering that "raid6check" processes
> stripes, but the check is done per byte, already
> knowing how many bytes per stripe (or block) need
> to be corrected (per device) will hint a lot about
> the overall status of the storage.
>  
> > Personally, I am recovering from a simultaneous three-disk failure on a 
> > backup storage. My best hope was to ddrescue "most" from all three disks 
> > onto fresh ones, and I lost a total of a few KB on each disk. Using the 
> > ddrescue log, I can even say which sectors of each disk were damaged. 
> > Interestingly, two disks of the same model failed on the very same 
> > sector (even though they were produced at different times), so I now 
> > have "unknown" slot errors in some stripes. But with context 
> > information, I am certain I know which slots need to be repaired.
> 
> That's good!
> Did you use "raid6check" for a verification?
>  
> [...]
> > checksums. I may send another patch implementing this, but I wanted to 
> > get general feedback on inclusion of such changes first (Neil?).
> 
> Yeah, last time Neil mentioned he needs re-triggering :-),
> I guess you'll have to add "[PATCH]" tag to the message too...

:-)

I have applied the patches, though with some fairly minor editing (wrapping
lines, moving variable declarations before any statements, removing
tab-at-the-end-of-the-line).  They probably won't appear in my public .git
for a little while I I have some other patches that I need to sort through
and organise first.


Thanks,
NeilBrown


> 
> > I am a big supporter of getting it to work, then make it fast. Since a 
> > full raid check takes the magnitude of hours anyway, I do not mind that 
> > repairing blocks from the user space will take five minutes when it 
> > could be done in 3. That said, I think the faster code in the kernel is 
> > warranted (as it needs this calculation very often when a disk is 
> > failed), and if it is possible to reuse easily, we sure should.
> 
> The check is pretty slow, also due to the terminal
> print out, which is a bit too verbose, I think.
> 
> Anyhow, I'm really happy someone has interest in
> improving "raid6check", I hope you'll be able to
> improve it and, maybe, someone else will join
> the bandwagon... :-)
> 
> bye,
> 


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH] Re: Find mismatch in data blocks during raid6 repair
  2012-07-09  3:43               ` NeilBrown
@ 2012-07-20 10:40                 ` Robert Buchholz
  2012-07-20 14:14                   ` Robert Buchholz
  0 siblings, 1 reply; 13+ messages in thread
From: Robert Buchholz @ 2012-07-20 10:40 UTC (permalink / raw)
  To: NeilBrown; +Cc: Piergiorgio Sartor, John Robinson, linux-raid


[-- Attachment #1.1: Type: text/plain, Size: 2383 bytes --]

Hello,

On Monday, July 09, 2012 01:43:08 PM NeilBrown wrote:
> On Tue, 3 Jul 2012 22:27:34 +0200 Piergiorgio Sartor
> <piergiorgio.sartor@nexgo.de> wrote:
> > On Tue, Jul 03, 2012 at 09:10:41PM +0200, Robert Buchholz wrote:
> > > > Why always two blocks?
> > > 
> > > The reason is simply to have less cases to handle in the code.
> > > There's already three ways to regenerate regenerate two
> > > blocks (D&D, D/P&Q and D&P), and there would be two more
> > > cases if only one block was to be repaired. With the original
> > > patch, if you can repair two blocks, that allows you to
> > > repair one (and one other in addition) as well.> 
> > sorry, I express myself not clearly.
> > 
> > I mean, a two parities Reed-Solomon system can
> > only detect one incorrect slot position, so I would
> > expect to have the possibility to fix only one, not
> > two slots.
> > 
> > So, I did not understand why two. I mean, I understand
> > that a RAID-6 can correct exact up two incorrect slots,
> > but the "unknown" case might have more and correcting
> > will mean no correction or, maybe, even more damage.
> > 
> > I would prefer, if you agree, to simply tell "raid6check"
> > to fix a single slot, or the (single) wrong slots it finds
> > during the check.
> 
> I think this is a sensible feature to offer.  Maybe if "autorepair" is
> given in place of "repair", then it should choose which block to
> repair, and do that one?

That makes sense, I wanted to make it work and get some feedback first.
I'll look into this.

> I have applied the patches, though with some fairly minor editing
> (wrapping lines, moving variable declarations before any statements,
> removing tab-at-the-end-of-the-line).  They probably won't appear in
> my public .git for a little while I I have some other patches that I
> need to sort through and organise first.

Thank you for merging and the cleanup.
I have attached three patches against your latest HEAD.
The first fixes compilation of the raid6check tool after the xmalloc 
refactoring. The other two are bugfixes in the repair code. The latter 
bug is somewhat embarrassing, I used geo_map in the wrong direction, 
yielding incorrect data block indices to repair. This will lead to a 
data corruption on the stripe in most cases (interestingly, not in my 
test case due to the way the raid size and stripe index were chosen).


Cheers,
Robert

[-- Attachment #1.2: 0001-Move-xmalloc-et-al-into-their-own-file.patch --]
[-- Type: text/x-patch, Size: 6207 bytes --]

From 24d8787a7ec2a56e4de62e9347ede9b2c2990617 Mon Sep 17 00:00:00 2001
From: Robert Buchholz <rbu@goodpoint.de>
Date: Mon, 16 Jul 2012 23:56:54 +0200
Subject: [PATCH 1/3] Move xmalloc et al into their own file

This avoid code duplication for utilities that do not link to
util.c and everything that comes with it, such as test_restripe and
raid6check
---
 Makefile   |   12 +++++-----
 restripe.c |   22 ------------------
 util.c     |   40 ---------------------------------
 xmalloc.c  |   72 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 4 files changed, 78 insertions(+), 68 deletions(-)
 create mode 100644 xmalloc.c

diff --git a/Makefile b/Makefile
index 315455b..19cd22c 100644
--- a/Makefile
+++ b/Makefile
@@ -104,10 +104,10 @@ OBJS =  mdadm.o config.o policy.o mdstat.o  ReadMe.o util.o maps.o lib.o \
 	Incremental.o \
 	mdopen.o super0.o super1.o super-ddf.o super-intel.o bitmap.o \
 	super-mbr.o super-gpt.o \
-	restripe.o sysfs.o sha1.o mapfile.o crc32.o sg_io.o msg.o \
+	restripe.o sysfs.o sha1.o mapfile.o crc32.o sg_io.o msg.o xmalloc.o \
 	platform-intel.o probe_roms.o
 
-CHECK_OBJS = restripe.o sysfs.o maps.o lib.o
+CHECK_OBJS = restripe.o sysfs.o maps.o lib.o xmalloc.o
 
 SRCS =  $(patsubst %.o,%.c,$(OBJS))
 
@@ -117,7 +117,7 @@ MON_OBJS = mdmon.o monitor.o managemon.o util.o maps.o mdstat.o sysfs.o \
 	config.o policy.o lib.o \
 	Kill.o sg_io.o dlink.o ReadMe.o super0.o super1.o super-intel.o \
 	super-mbr.o super-gpt.o \
-	super-ddf.o sha1.o crc32.o msg.o bitmap.o \
+	super-ddf.o sha1.o crc32.o msg.o bitmap.o xmalloc.o \
 	platform-intel.o probe_roms.o
 
 MON_SRCS = $(patsubst %.o,%.c,$(MON_OBJS))
@@ -126,7 +126,7 @@ STATICSRC = pwgr.c
 STATICOBJS = pwgr.o
 
 ASSEMBLE_SRCS := mdassemble.c Assemble.c Manage.c config.c policy.c dlink.c util.c \
-	maps.c lib.c \
+	maps.c lib.c xmalloc.c \
 	super0.c super1.c super-ddf.c super-intel.c sha1.c crc32.c sg_io.c mdstat.c \
 	platform-intel.c probe_roms.c sysfs.c super-mbr.c super-gpt.c
 ASSEMBLE_AUTO_SRCS := mdopen.c
@@ -175,8 +175,8 @@ mdmon : $(MON_OBJS)
 	$(CC) $(CFLAGS) $(LDFLAGS) $(MON_LDFLAGS) -Wl,-z,now -o mdmon $(MON_OBJS) $(LDLIBS)
 msg.o: msg.c msg.h
 
-test_stripe : restripe.c mdadm.h
-	$(CC) $(CXFLAGS) $(LDFLAGS) -o test_stripe -DMAIN restripe.c
+test_stripe : restripe.c xmalloc.o mdadm.h
+	$(CC) $(CXFLAGS) $(LDFLAGS) -o test_stripe xmalloc.o  -DMAIN restripe.c
 
 raid6check : raid6check.o mdadm.h $(CHECK_OBJS)
 	$(CC) $(CXFLAGS) $(LDFLAGS) -o raid6check raid6check.o $(CHECK_OBJS)
diff --git a/restripe.c b/restripe.c
index 1d2da1a..90896c8 100644
--- a/restripe.c
+++ b/restripe.c
@@ -998,26 +998,4 @@ main(int argc, char *argv[])
 	exit(0);
 }
 
-
-void *xmalloc(size_t len)
-{
-	void *rv = malloc(len);
-	char *msg;
-	if (rv)
-		return rv;
-	msg = Name ": memory allocation failure - aborting\n";
-	write(2, msg, strlen(msg));
-	exit(4);
-}
-
-void *xcalloc(size_t num, size_t size)
-{
-	void *rv = calloc(num, size);
-	char *msg;
-	if (rv)
-		return rv;
-	msg = Name ": memory allocation failure - aborting\n";
-	write(2, msg, strlen(msg));
-	exit(4);
-}
 #endif /* MAIN */
diff --git a/util.c b/util.c
index ad3c150..de12685 100644
--- a/util.c
+++ b/util.c
@@ -1799,43 +1799,3 @@ struct mdinfo *container_choose_spares(struct supertype *st,
 	}
 	return disks;
 }
-
-void *xmalloc(size_t len)
-{
-	void *rv = malloc(len);
-	char *msg;
-	if (rv)
-		return rv;
-	msg = Name ": memory allocation failure - aborting\n";
-	exit(4+!!write(2, msg, strlen(msg)));
-}
-
-void *xrealloc(void *ptr, size_t len)
-{
-	void *rv = realloc(ptr, len);
-	char *msg;
-	if (rv)
-		return rv;
-	msg = Name ": memory allocation failure - aborting\n";
-	exit(4+!!write(2, msg, strlen(msg)));
-}
-
-void *xcalloc(size_t num, size_t size)
-{
-	void *rv = calloc(num, size);
-	char *msg;
-	if (rv)
-		return rv;
-	msg = Name ": memory allocation failure - aborting\n";
-	exit(4+!!write(2, msg, strlen(msg)));
-}
-
-char *xstrdup(const char *str)
-{
-	char *rv = strdup(str);
-	char *msg;
-	if (rv)
-		return rv;
-	msg = Name ": memory allocation failure - aborting\n";
-	exit(4+!!write(2, msg, strlen(msg)));
-}
diff --git a/xmalloc.c b/xmalloc.c
new file mode 100644
index 0000000..8d42a7c
--- /dev/null
+++ b/xmalloc.c
@@ -0,0 +1,72 @@
+/* mdadm - manage Linux "md" devices aka RAID arrays.
+ *
+ * Copyright (C) 2001-2009 Neil Brown <neilb@suse.de>
+ *
+ *
+ *    This program is free software; you can redistribute it and/or modify
+ *    it under the terms of the GNU General Public License as published by
+ *    the Free Software Foundation; either version 2 of the License, or
+ *    (at your option) any later version.
+ *
+ *    This program is distributed in the hope that it will be useful,
+ *    but WITHOUT ANY WARRANTY; without even the implied warranty of
+ *    MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ *    GNU General Public License for more details.
+ *
+ *    You should have received a copy of the GNU General Public License
+ *    along with this program; if not, write to the Free Software
+ *    Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307  USA
+ *
+ *    Author: Neil Brown
+ *    Email: <neilb@suse.de>
+ */
+
+#include	"mdadm.h"
+/*#include	<sys/socket.h>
+#include	<sys/utsname.h>
+#include	<sys/wait.h>
+#include	<sys/un.h>
+#include	<ctype.h>
+#include	<dirent.h>
+#include	<signal.h>
+*/
+
+void *xmalloc(size_t len)
+{
+	void *rv = malloc(len);
+	char *msg;
+	if (rv)
+		return rv;
+	msg = Name ": memory allocation failure - aborting\n";
+	exit(4+!!write(2, msg, strlen(msg)));
+}
+
+void *xrealloc(void *ptr, size_t len)
+{
+	void *rv = realloc(ptr, len);
+	char *msg;
+	if (rv)
+		return rv;
+	msg = Name ": memory allocation failure - aborting\n";
+	exit(4+!!write(2, msg, strlen(msg)));
+}
+
+void *xcalloc(size_t num, size_t size)
+{
+	void *rv = calloc(num, size);
+	char *msg;
+	if (rv)
+		return rv;
+	msg = Name ": memory allocation failure - aborting\n";
+	exit(4+!!write(2, msg, strlen(msg)));
+}
+
+char *xstrdup(const char *str)
+{
+	char *rv = strdup(str);
+	char *msg;
+	if (rv)
+		return rv;
+	msg = Name ": memory allocation failure - aborting\n";
+	exit(4+!!write(2, msg, strlen(msg)));
+}
-- 
1.7.3.4


[-- Attachment #1.3: 0002-raid6check-Fix-off-by-one-in-argument-check.patch --]
[-- Type: text/x-patch, Size: 1136 bytes --]

From bc15e20f19bcced2b271479be9da92e9db335747 Mon Sep 17 00:00:00 2001
From: Robert Buchholz <rbu@goodpoint.de>
Date: Thu, 19 Jul 2012 17:14:47 +0200
Subject: [PATCH 2/3] raid6check: Fix off-by-one in argument check

In repair mode, specifying a failed slot that is equal to the number of
devices in the raid could cause a segfault.
---
 raid6check.c |    4 ++--
 1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/raid6check.c b/raid6check.c
index aba8160..dffadbe 100644
--- a/raid6check.c
+++ b/raid6check.c
@@ -416,12 +416,12 @@ int main(int argc, char *argv[])
 		failed_disk1 = getnum(argv[4], &err);
 		failed_disk2 = getnum(argv[5], &err);
 
-		if(failed_disk1 > info->array.raid_disks) {
+		if(failed_disk1 >= info->array.raid_disks) {
 			fprintf(stderr, "%s: failed_slot_1 index is higher than number of devices in raid\n", prg);
 			exit_err = 4;
 			goto exitHere;
 		}
-		if(failed_disk2 > info->array.raid_disks) {
+		if(failed_disk2 >= info->array.raid_disks) {
 			fprintf(stderr, "%s: failed_slot_2 index is higher than number of devices in raid\n", prg);
 			exit_err = 4;
 			goto exitHere;
-- 
1.7.3.4


[-- Attachment #1.4: 0003-raid6check-Repair-mode-used-geo_map-incorrectly.patch --]
[-- Type: text/x-patch, Size: 2647 bytes --]

From 90fe43106d33f22fd21fbfd49f853ec22592fa1f Mon Sep 17 00:00:00 2001
From: Robert Buchholz <rbu@goodpoint.de>
Date: Thu, 19 Jul 2012 19:33:22 +0200
Subject: [PATCH 3/3] raid6check: Repair mode used geo_map incorrectly

In repair mode, the data block indices to be repaired were calculated
using geo_map() which returns the disk slot for a data block index
and not the reverse. Now we simply store the reverse of that calculation
when we do it anyway.
---
 raid6check.c |   12 ++++++------
 1 files changed, 6 insertions(+), 6 deletions(-)

diff --git a/raid6check.c b/raid6check.c
index dffadbe..122936d 100644
--- a/raid6check.c
+++ b/raid6check.c
@@ -126,6 +126,7 @@ int check_stripes(struct mdinfo *info, int *source, unsigned long long *offsets,
 	int err = 0;
 	sighandler_t sig[3];
 	int rv;
+	int block_index_for_slot[data_disks];
 
 	extern int tables_ready;
 
@@ -172,6 +173,7 @@ int check_stripes(struct mdinfo *info, int *source, unsigned long long *offsets,
 		for (i = 0 ; i < data_disks ; i++) {
 			int disk = geo_map(i, start, raid_disks, level, layout);
 			blocks[i] = stripes[disk];
+			block_index_for_slot[disk] = i;
 			printf("%d->%d\n", i, disk);
 		}
 
@@ -216,9 +218,7 @@ int check_stripes(struct mdinfo *info, int *source, unsigned long long *offsets,
 				else
 					failed_data = failed_disk1;
 				printf("Repairing D/P(%d) and Q\n", failed_data);
-				failed_block_index = geo_map(
-					failed_data, start, raid_disks,
-					level, layout);
+				failed_block_index = block_index_for_slot[failed_data];
 				for (i=0; i < data_disks; i++)
 					if (failed_block_index == i)
 						all_but_failed_blocks[i] = stripes[diskP];
@@ -235,13 +235,13 @@ int check_stripes(struct mdinfo *info, int *source, unsigned long long *offsets,
 						failed_data = failed_disk2;
 					else
 						failed_data = failed_disk1;
-					failed_block_index = geo_map(failed_data, start, raid_disks, level, layout);
+					failed_block_index = block_index_for_slot[failed_data];
 					printf("Repairing D(%d) and P\n", failed_data);
 					raid6_datap_recov(raid_disks, chunk_size, failed_block_index, (uint8_t**)blocks);
 				} else {
 					printf("Repairing D and D\n");
-					int failed_block_index1 = geo_map(failed_disk1, start, raid_disks, level, layout);
-					int failed_block_index2 = geo_map(failed_disk2, start, raid_disks, level, layout);
+					int failed_block_index1 = block_index_for_slot[failed_disk1];
+					int failed_block_index2 = block_index_for_slot[failed_disk2];
 					if (failed_block_index1 > failed_block_index2) {
 						int t = failed_block_index1;
 						failed_block_index1 = failed_block_index2;
-- 
1.7.3.4


[-- Attachment #2: This is a digitally signed message part. --]
[-- Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: Find mismatch in data blocks during raid6 repair
  2012-07-03 20:27             ` Piergiorgio Sartor
  2012-07-09  3:43               ` NeilBrown
@ 2012-07-20 10:53               ` Robert Buchholz
  2012-07-21 16:00                 ` Piergiorgio Sartor
  1 sibling, 1 reply; 13+ messages in thread
From: Robert Buchholz @ 2012-07-20 10:53 UTC (permalink / raw)
  To: Piergiorgio Sartor; +Cc: John Robinson, linux-raid

[-- Attachment #1: Type: text/plain, Size: 4256 bytes --]

Hello Piergiorgio,

On Tuesday, July 03, 2012 10:27:34 PM Piergiorgio Sartor wrote:
> Hi Robert,
> 
> On Tue, Jul 03, 2012 at 09:10:41PM +0200, Robert Buchholz wrote:
> [...]
> 
> > > Why always two blocks?
> > 
> > The reason is simply to have less cases to handle in the code.
> > There's already three ways to regenerate regenerate two blocks
> > (D&D, D/P&Q and D&P), and there would be two more cases if only
> > one block was to be repaired. With the original patch, if you can
> > repair two blocks, that allows you to repair one (and one other
> > in addition) as well.
> sorry, I express myself not clearly.
> 
> I mean, a two parities Reed-Solomon system can
> only detect one incorrect slot position, so I would
> expect to have the possibility to fix only one, not
> two slots.
> 
> So, I did not understand why two. I mean, I understand
> that a RAID-6 can correct exact up two incorrect slots,
> but the "unknown" case might have more and correcting
> will mean no correction or, maybe, even more damage.

Well, if two slots have failed and you do not know which, or more than 
two have failed, there is no way to recover anything reliably.
I implemented the two slot fix to recover from a case where you *do* 
know which two slots failed (e.g., from syslog messages such as this: 
end_request: I/O error, dev sdk, sector 3174422). Obviously, this 
expects a lot of knowledge from the admin running the command and 
selecting the slots and comes with no guarantees that the "repaired" 
blocks will contain more of the expected data than before.

> I would prefer, if you agree, to simply tell "raid6check"
> to fix a single slot, or the (single) wrong slots it finds
> during the check.
> 
> Does it make sense to you, or, maybe, you're considering
> something I'm missing?

This makes perfect sense.

> > > Of course, this is just a statistical assumption, which
> > > means a second, "aggressive", option will have to be
> > > available, with all the warnings of the case.
> > 
> > As you point out, it is impossible to determine which of two
> > failed
> > slots are in error. I would leave such decision to an admin, but
> > giving one or more "advices" may be a nice idea.
> 
> That would be exactly the background.
> For example, considering that "raid6check" processes
> stripes, but the check is done per byte, already
> knowing how many bytes per stripe (or block) need
> to be corrected (per device) will hint a lot about
> the overall status of the storage.

That piece of information is definitely interesting. What is the 
smartest way to determine the number of incorrect bytes for one failed 
slot?

> > Personally, I am recovering from a simultaneous three-disk failure
> > on a backup storage. My best hope was to ddrescue "most" from all
> > three disks onto fresh ones, and I lost a total of a few KB on
> > each disk. Using the ddrescue log, I can even say which sectors
> > of each disk were damaged. Interestingly, two disks of the same
> > model failed on the very same sector (even though they were
> > produced at different times), so I now have "unknown" slot errors
> > in some stripes. But with context information, I am certain I
> > know which slots need to be repaired.
> That's good!
> Did you use "raid6check" for a verification?

Yes, since John Robinson pointed me to it earlier in this thread.

> > I am a big supporter of getting it to work, then make it fast.
> > Since a full raid check takes the magnitude of hours anyway, I do
> > not mind that repairing blocks from the user space will take five
> > minutes when it could be done in 3. That said, I think the faster
> > code in the kernel is warranted (as it needs this calculation
> > very often when a disk is failed), and if it is possible to reuse
> > easily, we sure should.
> The check is pretty slow, also due to the terminal
> print out, which is a bit too verbose, I think.

That is true. The stripe geometry output could be optional, especially 
when there is no error to be reported.

> Anyhow, I'm really happy someone has interest in
> improving "raid6check", I hope you'll be able to
> improve it and, maybe, someone else will join
> the bandwagon... :-)

Well, thank you for starting it and sorry for my slow replies.


Cheers

Robert

[-- Attachment #2: This is a digitally signed message part. --]
[-- Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] Re: Find mismatch in data blocks during raid6 repair
  2012-07-20 10:40                 ` [PATCH] " Robert Buchholz
@ 2012-07-20 14:14                   ` Robert Buchholz
  0 siblings, 0 replies; 13+ messages in thread
From: Robert Buchholz @ 2012-07-20 14:14 UTC (permalink / raw)
  To: NeilBrown; +Cc: Piergiorgio Sartor, John Robinson, linux-raid


[-- Attachment #1.1: Type: text/plain, Size: 1167 bytes --]

On Friday, July 20, 2012 12:40:04 PM Robert Buchholz wrote:
> I have attached three patches against your latest HEAD.
> The first fixes compilation of the raid6check tool after the xmalloc
> refactoring. The other two are bugfixes in the repair code. The latter
> bug is somewhat embarrassing, I used geo_map in the wrong direction,
> yielding incorrect data block indices to repair. This will lead to a
> data corruption on the stripe in most cases (interestingly, not in my
> test case due to the way the raid size and stripe index were chosen).

Please find attached a revised version of patch 3 in my previous email 
(0003-raid6check-Repair-mode-used-geo_map-incorrectly.patch). The 
original version did not handle the the P/Q repair case correctly, this 
patch replaces the previous file.

In addition, autorepair mode is added by the other two patches in this 
email. To fix all slots that are detected faulty, one can run
  # raid6check /dev/mdX 0 0 autorepair
The same considerations regarding caching apply, i.e. the array should 
not be in use and caches flushed before any usage.

We should probably update the tool's man page for this.


Cheers,

Robert

[-- Attachment #1.2: 0003-raid6check-Repair-mode-used-geo_map-incorrectly.patch --]
[-- Type: text/x-patch, Size: 5735 bytes --]

From 812c1feae3540a3671492cb4a32b1903aa17dece Mon Sep 17 00:00:00 2001
From: Robert Buchholz <rbu@goodpoint.de>
Date: Thu, 19 Jul 2012 19:33:22 +0200
Subject: [PATCH 3/5] raid6check: Repair mode used geo_map incorrectly

In repair mode, the data block indices to be repaired were calculated
using geo_map() which returns the disk slot for a data block index
and not the reverse. Now we simply store the reverse of that calculation
when we do it anyway.
---
 raid6check.c                    |   24 +++++++++++++-----------
 tests/19repair-does-not-destroy |   30 ++++++++++++++++++++++++++++++
 2 files changed, 43 insertions(+), 11 deletions(-)
 create mode 100644 tests/19repair-does-not-destroy

diff --git a/raid6check.c b/raid6check.c
index dffadbe..51e7cca 100644
--- a/raid6check.c
+++ b/raid6check.c
@@ -116,6 +116,7 @@ int check_stripes(struct mdinfo *info, int *source, unsigned long long *offsets,
 	char *stripe_buf = xmalloc(raid_disks * chunk_size);
 	char **stripes = xmalloc(raid_disks * sizeof(char*));
 	char **blocks = xmalloc(raid_disks * sizeof(char*));
+	int *block_index_for_slot = xmalloc(raid_disks * sizeof(int));
 	uint8_t *p = xmalloc(chunk_size);
 	uint8_t *q = xmalloc(chunk_size);
 	int *results = xmalloc(chunk_size * sizeof(int));
@@ -172,6 +173,7 @@ int check_stripes(struct mdinfo *info, int *source, unsigned long long *offsets,
 		for (i = 0 ; i < data_disks ; i++) {
 			int disk = geo_map(i, start, raid_disks, level, layout);
 			blocks[i] = stripes[disk];
+			block_index_for_slot[disk] = i;
 			printf("%d->%d\n", i, disk);
 		}
 
@@ -179,7 +181,9 @@ int check_stripes(struct mdinfo *info, int *source, unsigned long long *offsets,
 		diskP = geo_map(-1, start, raid_disks, level, layout);
 		diskQ = geo_map(-2, start, raid_disks, level, layout);
 		blocks[data_disks] = stripes[diskP];
+		block_index_for_slot[diskP] = data_disks;
 		blocks[data_disks+1] = stripes[diskQ];
+		block_index_for_slot[diskQ] = data_disks+1;
 
 		if (memcmp(p, stripes[diskP], chunk_size) != 0) {
 			printf("P(%d) wrong at %llu\n", diskP, start);
@@ -208,23 +212,21 @@ int check_stripes(struct mdinfo *info, int *source, unsigned long long *offsets,
 
 			if (failed_disk1 == diskQ || failed_disk2 == diskQ) {
 				char *all_but_failed_blocks[data_disks];
-				int failed_data;
+				int failed_data_or_p;
 				int failed_block_index;
 
 				if (failed_disk1 == diskQ)
-					failed_data = failed_disk2;
+					failed_data_or_p = failed_disk2;
 				else
-					failed_data = failed_disk1;
-				printf("Repairing D/P(%d) and Q\n", failed_data);
-				failed_block_index = geo_map(
-					failed_data, start, raid_disks,
-					level, layout);
+					failed_data_or_p = failed_disk1;
+				printf("Repairing D/P(%d) and Q\n", failed_data_or_p);
+				failed_block_index = block_index_for_slot[failed_data_or_p];
 				for (i=0; i < data_disks; i++)
 					if (failed_block_index == i)
 						all_but_failed_blocks[i] = stripes[diskP];
 					else
 						all_but_failed_blocks[i] = blocks[i];
-				xor_blocks(stripes[failed_data],
+				xor_blocks(stripes[failed_data_or_p],
 					all_but_failed_blocks, data_disks, chunk_size);
 				qsyndrome(p, (uint8_t*)stripes[diskQ], (uint8_t**)blocks, data_disks, chunk_size);
 			} else {
@@ -235,13 +237,13 @@ int check_stripes(struct mdinfo *info, int *source, unsigned long long *offsets,
 						failed_data = failed_disk2;
 					else
 						failed_data = failed_disk1;
-					failed_block_index = geo_map(failed_data, start, raid_disks, level, layout);
+					failed_block_index = block_index_for_slot[failed_data];
 					printf("Repairing D(%d) and P\n", failed_data);
 					raid6_datap_recov(raid_disks, chunk_size, failed_block_index, (uint8_t**)blocks);
 				} else {
 					printf("Repairing D and D\n");
-					int failed_block_index1 = geo_map(failed_disk1, start, raid_disks, level, layout);
-					int failed_block_index2 = geo_map(failed_disk2, start, raid_disks, level, layout);
+					int failed_block_index1 = block_index_for_slot[failed_disk1];
+					int failed_block_index2 = block_index_for_slot[failed_disk2];
 					if (failed_block_index1 > failed_block_index2) {
 						int t = failed_block_index1;
 						failed_block_index1 = failed_block_index2;
diff --git a/tests/19repair-does-not-destroy b/tests/19repair-does-not-destroy
new file mode 100644
index 0000000..d355e0c
--- /dev/null
+++ b/tests/19repair-does-not-destroy
@@ -0,0 +1,30 @@
+number_of_disks=7
+chunksize_in_kib=512
+array_data_size_in_kib=$[chunksize_in_kib*(number_of_disks-2)*number_of_disks]
+array_data_size_in_b=$[array_data_size_in_kib*1024]
+devs="$dev0 $dev1 $dev2 $dev3 $dev4 $dev5 $dev6"
+
+dd if=/dev/urandom of=/tmp/RandFile bs=1024 count=$array_data_size_in_kib
+mdadm -CR $md0 -l6 -n$number_of_disks -c $chunksize_in_kib $devs
+dd if=/tmp/RandFile of=$md0 bs=1024 count=$array_data_size_in_kib
+blockdev --flushbufs $md0; sync
+check wait
+blockdev --flushbufs $devs; sync
+echo 3 > /proc/sys/vm/drop_caches
+$dir/raid6check $md0 repair  1 2 3 > /dev/null # D D
+$dir/raid6check $md0 repair  8 2 5 > /dev/null # D P
+$dir/raid6check $md0 repair 15 4 6 > /dev/null # D Q
+$dir/raid6check $md0 repair 22 5 6 > /dev/null # P Q
+$dir/raid6check $md0 repair  3 4 0 > /dev/null # Q D
+$dir/raid6check $md0 repair  3 3 1 > /dev/null # P D
+$dir/raid6check $md0 repair  6 4 5 > /dev/null # D<D
+$dir/raid6check $md0 repair 13 5 4 > /dev/null # D>D
+blockdev --flushbufs $devs; sync
+echo 3 > /proc/sys/vm/drop_caches
+$dir/raid6check $md0 0 0 2>&1 | grep -qs "Error" && { echo errors detected; exit 2; }
+cmp -s -n $array_data_size_in_b $md0 /tmp/RandFile || { echo should not mess up correct stripe ; exit 2; }
+
+mdadm -S $md0
+udevadm settle
+blockdev --flushbufs $md0 $devs; sync
+
-- 
1.7.3.4


[-- Attachment #1.3: 0004-raid6check-Extract-un-locking-into-functions.patch --]
[-- Type: text/x-patch, Size: 5007 bytes --]

From 2fabd1507a0bcf9ae0f2af888e6dcd0fac580027 Mon Sep 17 00:00:00 2001
From: Robert Buchholz <rbu@goodpoint.de>
Date: Fri, 20 Jul 2012 16:00:14 +0200
Subject: [PATCH 4/5] raid6check: Extract (un)locking into functions

---
 raid6check.c |   90 ++++++++++++++++++++++++++++++---------------------------
 1 files changed, 47 insertions(+), 43 deletions(-)

diff --git a/raid6check.c b/raid6check.c
index 51e7cca..4aeafad 100644
--- a/raid6check.c
+++ b/raid6check.c
@@ -107,6 +107,38 @@ int raid6_stats(int *results, int raid_disks, int chunk_size)
 	return curr_broken_disk;
 }
 
+int lock_stripe(struct mdinfo *info, unsigned long long start,
+		int chunk_size, int data_disks, sighandler_t *sig) {
+	int rv;
+	if(mlockall(MCL_CURRENT | MCL_FUTURE) != 0) {
+		return 2;
+	}
+
+	sig[0] = signal(SIGTERM, SIG_IGN);
+	sig[1] = signal(SIGINT, SIG_IGN);
+	sig[2] = signal(SIGQUIT, SIG_IGN);
+
+	rv = sysfs_set_num(info, NULL, "suspend_lo", start * chunk_size * data_disks);
+	rv |= sysfs_set_num(info, NULL, "suspend_hi", (start + 1) * chunk_size * data_disks);
+	return rv * 256;
+}
+
+int unlock_all_stripes(struct mdinfo *info, sighandler_t *sig) {
+	int rv;
+	rv = sysfs_set_num(info, NULL, "suspend_lo", 0x7FFFFFFFFFFFFFFFULL);
+	rv |= sysfs_set_num(info, NULL, "suspend_hi", 0);
+	rv |= sysfs_set_num(info, NULL, "suspend_lo", 0);
+
+	signal(SIGQUIT, sig[2]);
+	signal(SIGINT, sig[1]);
+	signal(SIGTERM, sig[0]);
+
+	if(munlockall() != 0)
+		return 3;
+	return rv * 256;
+}
+
+
 int check_stripes(struct mdinfo *info, int *source, unsigned long long *offsets,
 		  int raid_disks, int chunk_size, int level, int layout,
 		  unsigned long long start, unsigned long long length, char *name[],
@@ -120,13 +152,12 @@ int check_stripes(struct mdinfo *info, int *source, unsigned long long *offsets,
 	uint8_t *p = xmalloc(chunk_size);
 	uint8_t *q = xmalloc(chunk_size);
 	int *results = xmalloc(chunk_size * sizeof(int));
+	sighandler_t *sig = xmalloc(3 * sizeof(sighandler_t));
 
 	int i;
 	int diskP, diskQ;
 	int data_disks = raid_disks - 2;
 	int err = 0;
-	sighandler_t sig[3];
-	int rv;
 
 	extern int tables_ready;
 
@@ -141,34 +172,19 @@ int check_stripes(struct mdinfo *info, int *source, unsigned long long *offsets,
 
 		printf("pos --> %llu\n", start);
 
-		if(mlockall(MCL_CURRENT | MCL_FUTURE) != 0) {
-			err = 2;
+		err = lock_stripe(info, start, chunk_size, data_disks, sig);
+		if(err != 0) {
+			if (err != 2)
+				unlock_all_stripes(info, sig);
 			goto exitCheck;
 		}
-		sig[0] = signal(SIGTERM, SIG_IGN);
-		sig[1] = signal(SIGINT, SIG_IGN);
-		sig[2] = signal(SIGQUIT, SIG_IGN);
-		rv = sysfs_set_num(info, NULL, "suspend_lo", start * chunk_size * data_disks);
-		rv |= sysfs_set_num(info, NULL, "suspend_hi", (start + 1) * chunk_size * data_disks);
 		for (i = 0 ; i < raid_disks ; i++) {
 			lseek64(source[i], offsets[i] + start * chunk_size, 0);
 			read(source[i], stripes[i], chunk_size);
 		}
-		rv |= sysfs_set_num(info, NULL, "suspend_lo", 0x7FFFFFFFFFFFFFFFULL);
-		rv |= sysfs_set_num(info, NULL, "suspend_hi", 0);
-		rv |= sysfs_set_num(info, NULL, "suspend_lo", 0);
-		signal(SIGQUIT, sig[2]);
-		signal(SIGINT, sig[1]);
-		signal(SIGTERM, sig[0]);
-		if(munlockall() != 0) {
-			err = 3;
-			goto exitCheck;
-		}
-
-		if(rv != 0) {
-			err = rv * 256;
+		err = unlock_all_stripes(info, sig);
+		if(err != 0)
 			goto exitCheck;
-		}
 
 		for (i = 0 ; i < data_disks ; i++) {
 			int disk = geo_map(i, start, raid_disks, level, layout);
@@ -252,34 +268,22 @@ int check_stripes(struct mdinfo *info, int *source, unsigned long long *offsets,
 					raid6_2data_recov(raid_disks, chunk_size, failed_block_index1, failed_block_index2, (uint8_t**)blocks);
 				}
 			}
-			if(mlockall(MCL_CURRENT | MCL_FUTURE) != 0) {
-				err = 2;
+
+			err = lock_stripe(info, start, chunk_size, data_disks, sig);
+			if(err != 0) {
+				if (err != 2)
+					unlock_all_stripes(info, sig);
 				goto exitCheck;
 			}
-			sig[0] = signal(SIGTERM, SIG_IGN);
-			sig[1] = signal(SIGINT, SIG_IGN);
-			sig[2] = signal(SIGQUIT, SIG_IGN);
-			rv = sysfs_set_num(info, NULL, "suspend_lo", start * chunk_size * data_disks);
-			rv |= sysfs_set_num(info, NULL, "suspend_hi", (start + 1) * chunk_size * data_disks);
+
 			lseek64(source[failed_disk1], offsets[failed_disk1] + start * chunk_size, 0);
 			write(source[failed_disk1], stripes[failed_disk1], chunk_size);
 			lseek64(source[failed_disk2], offsets[failed_disk2] + start * chunk_size, 0);
 			write(source[failed_disk2], stripes[failed_disk2], chunk_size);
-			rv |= sysfs_set_num(info, NULL, "suspend_lo", 0x7FFFFFFFFFFFFFFFULL);
-			rv |= sysfs_set_num(info, NULL, "suspend_hi", 0);
-			rv |= sysfs_set_num(info, NULL, "suspend_lo", 0);
-			signal(SIGQUIT, sig[2]);
-			signal(SIGINT, sig[1]);
-			signal(SIGTERM, sig[0]);
-			if(munlockall() != 0) {
-				err = 3;
-				goto exitCheck;
-			}
 
-			if(rv != 0) {
-				err = rv * 256;
+			err = unlock_all_stripes(info, sig);
+			if(err != 0)
 				goto exitCheck;
-			}
 		}
 
 
-- 
1.7.3.4


[-- Attachment #1.4: 0005-raid6check-Auto-repair-mode.patch --]
[-- Type: text/x-patch, Size: 4636 bytes --]

From 530bf005a45cd79f77c2a726874a1a7da84064d5 Mon Sep 17 00:00:00 2001
From: Robert Buchholz <rbu@goodpoint.de>
Date: Fri, 20 Jul 2012 16:01:53 +0200
Subject: [PATCH 5/5] raid6check: Auto-repair mode

When calling raid6check in regular scanning mode, specifiying
"autorepair" as the last positional parameter will cause it
to automatically repair any single slot failes it identifies.
---
 raid6check.c             |   33 ++++++++++++++++++++++++++++++++-
 tests/19raid6auto-repair |   43 +++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 75 insertions(+), 1 deletions(-)
 create mode 100644 tests/19raid6auto-repair

diff --git a/raid6check.c b/raid6check.c
index 4aeafad..e9a17a7 100644
--- a/raid6check.c
+++ b/raid6check.c
@@ -284,6 +284,35 @@ int check_stripes(struct mdinfo *info, int *source, unsigned long long *offsets,
 			err = unlock_all_stripes(info, sig);
 			if(err != 0)
 				goto exitCheck;
+		} else if (disk >= 0 && repair == 2) {
+			printf("Auto-repairing slot %d (%s)\n", disk, name[disk]);
+			if (disk == diskQ) {
+				qsyndrome(p, (uint8_t*)stripes[diskQ], (uint8_t**)blocks, data_disks, chunk_size);
+			} else {
+				char *all_but_failed_blocks[data_disks];
+				int failed_block_index = block_index_for_slot[disk];
+				for (i=0; i < data_disks; i++)
+					if (failed_block_index == i)
+						all_but_failed_blocks[i] = stripes[diskP];
+					else
+						all_but_failed_blocks[i] = blocks[i];
+				xor_blocks(stripes[disk],
+					all_but_failed_blocks, data_disks, chunk_size);
+			}
+
+			err = lock_stripe(info, start, chunk_size, data_disks, sig);
+			if(err != 0) {
+				if (err != 2)
+					unlock_all_stripes(info, sig);
+				goto exitCheck;
+			}
+
+			lseek64(source[disk], offsets[disk] + start * chunk_size, 0);
+			write(source[disk], stripes[disk], chunk_size);
+
+			err = unlock_all_stripes(info, sig);
+			if(err != 0)
+				goto exitCheck;
 		}
 
 
@@ -343,7 +372,7 @@ int main(int argc, char *argv[])
 		prg++;
 
 	if (argc < 4) {
-		fprintf(stderr, "Usage: %s md_device start_stripe length_stripes\n", prg);
+		fprintf(stderr, "Usage: %s md_device start_stripe length_stripes [autorepair]\n", prg);
 		fprintf(stderr, "   or: %s md_device repair stripe failed_slot_1 failed_slot_2\n", prg);
 		exit_err = 1;
 		goto exitHere;
@@ -441,6 +470,8 @@ int main(int argc, char *argv[])
 	else {
 		start = getnum(argv[2], &err);
 		length = getnum(argv[3], &err);
+		if (argc >= 5 && strcmp(argv[4], "autorepair")==0)
+			repair = 2;
 	}
 
 	if (err) {
diff --git a/tests/19raid6auto-repair b/tests/19raid6auto-repair
new file mode 100644
index 0000000..6665458
--- /dev/null
+++ b/tests/19raid6auto-repair
@@ -0,0 +1,43 @@
+number_of_disks=5
+chunksize_in_kib=512
+chunksize_in_b=$[chunksize_in_kib*1024]
+array_data_size_in_kib=$[chunksize_in_kib*(number_of_disks-2)*number_of_disks]
+array_data_size_in_b=$[array_data_size_in_kib*1024]
+devs="$dev0 $dev1 $dev2 $dev3 $dev4"
+
+# default 32 sectors
+data_offset_in_kib=$[32/2]
+
+# make a raid5 from a file
+dd if=/dev/urandom of=/tmp/RandFile bs=1024 count=$array_data_size_in_kib
+mdadm -CR $md0 -l6 -n$number_of_disks -c $chunksize_in_kib $devs
+dd if=/tmp/RandFile of=$md0 bs=1024 count=$array_data_size_in_kib
+blockdev --flushbufs $md0; sync
+check wait
+blockdev --flushbufs $devs; sync
+echo 3 > /proc/sys/vm/drop_caches
+cmp -s -n $array_data_size_in_b $md0 /tmp/RandFile || { echo sanity cmp failed ; exit 2; }
+
+# wipe out 5 chunks on each device
+dd if=/dev/urandom of=$dev0 bs=1024 count=$[5*chunksize_in_kib] seek=$[data_offset_in_kib+chunksize_in_kib*0]
+dd if=/dev/urandom of=$dev1 bs=1024 count=$[5*chunksize_in_kib] seek=$[data_offset_in_kib+chunksize_in_kib*5]
+dd if=/dev/urandom of=$dev2 bs=1024 count=$[5*chunksize_in_kib] seek=$[data_offset_in_kib+chunksize_in_kib*10]
+dd if=/dev/urandom of=$dev3 bs=1024 count=$[5*chunksize_in_kib] seek=$[data_offset_in_kib+chunksize_in_kib*15]
+dd if=/dev/urandom of=$dev4 bs=1024 count=$[5*chunksize_in_kib] seek=$[data_offset_in_kib+chunksize_in_kib*20]
+
+blockdev --flushbufs $devs; sync
+echo 3 > /proc/sys/vm/drop_caches
+
+$dir/raid6check $md0 0 0 2>&1 | grep -qs "Error" || { echo should detect errors; exit 2; }
+
+$dir/raid6check $md0 0 0 autorepair > /dev/null || { echo repair failed; exit 2; }
+blockdev --flushbufs $md0 $devs; sync
+echo 3 > /proc/sys/vm/drop_caches
+
+$dir/raid6check $md0 0 0 2>&1 | grep -qs "Error" && { echo errors detected; exit 2; }
+cmp -s -n $array_data_size_in_b $md0 /tmp/RandFile || { echo cmp failed ; exit 2; }
+
+mdadm -S $md0
+udevadm settle
+blockdev --flushbufs $md0 $devs; sync
+echo 3 > /proc/sys/vm/drop_caches
-- 
1.7.3.4


[-- Attachment #2: This is a digitally signed message part. --]
[-- Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: Find mismatch in data blocks during raid6 repair
  2012-07-20 10:53               ` Robert Buchholz
@ 2012-07-21 16:00                 ` Piergiorgio Sartor
  0 siblings, 0 replies; 13+ messages in thread
From: Piergiorgio Sartor @ 2012-07-21 16:00 UTC (permalink / raw)
  To: Robert Buchholz; +Cc: Piergiorgio Sartor, John Robinson, linux-raid

Hi Robert,

On Fri, Jul 20, 2012 at 12:53:10PM +0200, Robert Buchholz wrote:
> Hello Piergiorgio,
> 
> On Tuesday, July 03, 2012 10:27:34 PM Piergiorgio Sartor wrote:
> > Hi Robert,
> > 
> > On Tue, Jul 03, 2012 at 09:10:41PM +0200, Robert Buchholz wrote:
> > [...]
> > 
> > > > Why always two blocks?
> > > 
> > > The reason is simply to have less cases to handle in the code.
> > > There's already three ways to regenerate regenerate two blocks
> > > (D&D, D/P&Q and D&P), and there would be two more cases if only
> > > one block was to be repaired. With the original patch, if you can
> > > repair two blocks, that allows you to repair one (and one other
> > > in addition) as well.
> > sorry, I express myself not clearly.
> > 
> > I mean, a two parities Reed-Solomon system can
> > only detect one incorrect slot position, so I would
> > expect to have the possibility to fix only one, not
> > two slots.
> > 
> > So, I did not understand why two. I mean, I understand
> > that a RAID-6 can correct exact up two incorrect slots,
> > but the "unknown" case might have more and correcting
> > will mean no correction or, maybe, even more damage.
> 
> Well, if two slots have failed and you do not know which, or more than 
> two have failed, there is no way to recover anything reliably.
> I implemented the two slot fix to recover from a case where you *do* 
> know which two slots failed (e.g., from syslog messages such as this: 
> end_request: I/O error, dev sdk, sector 3174422). Obviously, this 
> expects a lot of knowledge from the admin running the command and 
> selecting the slots and comes with no guarantees that the "repaired" 
> blocks will contain more of the expected data than before.

OK, thanks, I see.

> > I would prefer, if you agree, to simply tell "raid6check"
> > to fix a single slot, or the (single) wrong slots it finds
> > during the check.
> > 
> > Does it make sense to you, or, maybe, you're considering
> > something I'm missing?
> 
> This makes perfect sense.
> 
> > > > Of course, this is just a statistical assumption, which
> > > > means a second, "aggressive", option will have to be
> > > > available, with all the warnings of the case.
> > > 
> > > As you point out, it is impossible to determine which of two
> > > failed
> > > slots are in error. I would leave such decision to an admin, but
> > > giving one or more "advices" may be a nice idea.
> > 
> > That would be exactly the background.
> > For example, considering that "raid6check" processes
> > stripes, but the check is done per byte, already
> > knowing how many bytes per stripe (or block) need
> > to be corrected (per device) will hint a lot about
> > the overall status of the storage.
> 
> That piece of information is definitely interesting. What is the 
> smartest way to determine the number of incorrect bytes for one failed 
> slot?

The function "raid6_collect()" of "raif6check"
performs the check per byte position and returns
an array (of int) in which each value has a
status representing the condition of the bytes
of the slot at that position.
The possible values are, if my memory serves me:
-255 = OK
positive integer = failed data (if greater than
number of disks, then "unknown", of course, but
this is checked later)
-1 or -2 = failed P or Q parity

The "raid6_stats()" function uses the returned
array in order to try to detect the status of
the complete slot.
This (bugs apart) tries to be consistent, that
is different errors are considered "unknown"
slot status. But this is the minimal possible
approach.

Now, the "results[]" array could also be used
to count how many positions of the slot are
correct or incorrect and how.
I guess this information could be used to
understand better the status of the slot.
Collecting statistics across the slots, that
is for the whole array, could allow to do a
better assessment of the RAID-6.

It clearly makes a difference if a single
byte position is incorrect or all.
Or if a byte is -1 and all the others are -2.

> > > Personally, I am recovering from a simultaneous three-disk failure
> > > on a backup storage. My best hope was to ddrescue "most" from all
> > > three disks onto fresh ones, and I lost a total of a few KB on
> > > each disk. Using the ddrescue log, I can even say which sectors
> > > of each disk were damaged. Interestingly, two disks of the same
> > > model failed on the very same sector (even though they were
> > > produced at different times), so I now have "unknown" slot errors
> > > in some stripes. But with context information, I am certain I
> > > know which slots need to be repaired.
> > That's good!
> > Did you use "raid6check" for a verification?
> 
> Yes, since John Robinson pointed me to it earlier in this thread.
> 
> > > I am a big supporter of getting it to work, then make it fast.
> > > Since a full raid check takes the magnitude of hours anyway, I do
> > > not mind that repairing blocks from the user space will take five
> > > minutes when it could be done in 3. That said, I think the faster
> > > code in the kernel is warranted (as it needs this calculation
> > > very often when a disk is failed), and if it is possible to reuse
> > > easily, we sure should.
> > The check is pretty slow, also due to the terminal
> > print out, which is a bit too verbose, I think.
> 
> That is true. The stripe geometry output could be optional, especially 
> when there is no error to be reported.
> 
> > Anyhow, I'm really happy someone has interest in
> > improving "raid6check", I hope you'll be able to
> > improve it and, maybe, someone else will join
> > the bandwagon... :-)
> 
> Well, thank you for starting it and sorry for my slow replies.
> 
> 
> Cheers
> 
> Robert

bye,

-- 

piergiorgio

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2012-07-21 16:00 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2012-06-20 17:41 Find mismatch in data blocks during raid6 repair Robert Buchholz
2012-06-21 12:38 ` John Robinson
2012-06-21 14:58   ` Robert Buchholz
2012-06-21 18:23     ` Piergiorgio Sartor
2012-06-29 18:16       ` Robert Buchholz
2012-06-30 11:48         ` Piergiorgio Sartor
2012-07-03 19:10           ` Robert Buchholz
2012-07-03 20:27             ` Piergiorgio Sartor
2012-07-09  3:43               ` NeilBrown
2012-07-20 10:40                 ` [PATCH] " Robert Buchholz
2012-07-20 14:14                   ` Robert Buchholz
2012-07-20 10:53               ` Robert Buchholz
2012-07-21 16:00                 ` Piergiorgio Sartor

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).