Linux RAID subsystem development

Linux RAID subsystem development
 help / color / mirror / Atom feed

* Re: Why do I get different results for 'mdadm --detail' & 'mdadm --examine' for the same array?
From: NeilBrown @ 2011-06-16  2:00 UTC (permalink / raw)
  To: jeffs_linux; +Cc: linux-raid
In-Reply-To: <1308188856.14820.1463669409@webmail.messagingengine.com>

On Wed, 15 Jun 2011 18:47:36 -0700 jeffs_linux@123mail.org wrote:

>> > > cat /dev/.mdadm/map
> > > ---------------------------------------------------------------
> > > md126 0.90 19f2b21c:e54f9e1a:be5ad16e:9754ab5e /dev/md/0_0
> > > md127 1.2 79fb7ad4:289bfae5:86c535ff:202960f2 /dev/md127
> > > ---------------------------------------------------------------
> > > 
> > > mdadm --detail --scan
> > >         ARRAY /dev/md127 metadata=1.2 name=jeffadm:jeffadm1
> > >         UUID=d84afb64:e6fa2b64:ff21c975:f9765431
> > >         ARRAY /dev/md/0_0 metadata=0.90
> > >         UUID=19f2b21c:e54f9e1a:be5ad16e:9754ab5e
> > 
> > Yes, that it weird.  I don't know how they came to be out of sync.
> > 
> >   mdadm --incremental --rebuild-map
> > 
> > will fix it..
> 
> Ok.  This is officially the first time that I'll actually try to fix
> anything on my 'production' array.
> 
> I'm reading the manpage -- again! -- and see both the "--incremental"
> and "--rebuild-map" sections.  So I get what they do.
> 
> WHEN can/should I do it?  On my live running array while at runlevel 5? 
> A lower runlevel?  From a separate boot disk?
> 

Any time at all is fine.  The 'map' file is used to help with incremental
assembly of arrays.  When "mdadm -I" is given a device that looks like part of
an array it looks in the map file to find out if any of that array has already
been assembled.
So one everything is assembled it is not interesting any longer.

I think you can even just remove it.  If mdadm needs it and finds it doesn't
exist, it perform the equivalent of "mdadm --incremental --rebuild", then
tries again.

So it really is safe to run it at any time that you aren't actively rebooting
or plugging in new devices.

NeilBrown

^ permalink raw reply

* Re: Why do I get different results for 'mdadm --detail' & 'mdadm --examine' for the same array?
From: jeffs_linux @ 2011-06-16  2:09 UTC (permalink / raw)
  To: NeilBrown; +Cc: linux-raid
In-Reply-To: <20110616120049.2f01ce88@notabene.brown>

On Thu, 16 Jun 2011 12:00 +1000, "NeilBrown" <neilb@suse.de> wrote:
> So it really is safe to run it at any time that you aren't actively
> rebooting or plugging in new devices.

gulp.  i'll give it a try!

thanks,

jeff

^ permalink raw reply

* Re: md 3.2.1 and xfs kernel panic on Linux 2.6.38
From: NeilBrown @ 2011-06-16  2:26 UTC (permalink / raw)
  To: fibreraid@gmail.com; +Cc: linux-raid, linux-xfs
In-Reply-To: <20110616115531.298328f2@notabene.brown>

On Thu, 16 Jun 2011 11:55:31 +1000 NeilBrown <neilb@suse.de> wrote:


> Hmmm... maybe we are missing the wakeup call.  I can find where we wake-up
> anyone waiting for an overlapping read request to complete, but I cannot find
> where we wake-up someone waiting for when an overlapping write request
> completes.  That should probably go in handle_stripe_clean_event.

On second thoughts, I can find where it is handled correctly for an
overlapping write request.

So I'm not stumped again.

I tried looking at a disassembly listing of make_request to confirm which
'schedule' call it was, but when I disassemble make_request it is about 1700
bytes long where as you trace show:

    make_request+0x138/0x3d0 

That it is 0x3d0 long, or about 900 bytes.  That is odd.

Can you get a disassembly to show me?
e.g.
   gdb raid456.ko
   disassemble make_request

Thanks,
NeilBrown


> 
> Do you have the system still hanging in this state?  If not, can you get it
> back into this state easily?
> If so, you can force a wakeup with the magic incantation:
> 
>  cat /sys/block/mdXX/md/suspend_lo > /sys/block/mdXX/md/suspend_lo
> 
> (with 'XX' suitably substituted).
> 
> If that makes a difference, then I know I am on the right track
> 
> Thanks,
> NeilBrown
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


^ permalink raw reply

* Re: Why do I get different results for 'mdadm --detail' & 'mdadm --examine' for the same array?
From: jeffs_linux @ 2011-06-16  2:39 UTC (permalink / raw)
  To: NeilBrown; +Cc: linux-raid
In-Reply-To: <20110616120049.2f01ce88@notabene.brown>



On Thu, 16 Jun 2011 12:00 +1000, "NeilBrown" <neilb@suse.de> wrote:
> On Wed, 15 Jun 2011 18:47:36 -0700 jeffs_linux@123mail.org wrote:
> 
> >> > > cat /dev/.mdadm/map
> > > > ---------------------------------------------------------------
> > > > md126 0.90 19f2b21c:e54f9e1a:be5ad16e:9754ab5e /dev/md/0_0
> > > > md127 1.2 79fb7ad4:289bfae5:86c535ff:202960f2 /dev/md127
> > > > ---------------------------------------------------------------
> > > > 
> > > > mdadm --detail --scan
> > > >         ARRAY /dev/md127 metadata=1.2 name=jeffadm:jeffadm1
> > > >         UUID=d84afb64:e6fa2b64:ff21c975:f9765431
> > > >         ARRAY /dev/md/0_0 metadata=0.90
> > > >         UUID=19f2b21c:e54f9e1a:be5ad16e:9754ab5e
> > > 
> > > Yes, that it weird.  I don't know how they came to be out of sync.
> > >   mdadm --incremental --rebuild-map
> > > will fix it..

cat /dev/.mdadm/map
	md126 0.90 19f2b21c:e54f9e1a:be5ad16e:9754ab5e /dev/md/0_0
	md127 1.2 79fb7ad4:289bfae5:86c535ff:202960f2 /dev/md127
mdadm --incremental --rebuild-map
cat /dev/.mdadm/map
	md127 1.2 79fb7ad4:289bfae5:86c535ff:202960f2
	/dev/md/jeffadm:jeffadm1
	md126 0.90 19f2b21c:e54f9e1a:be5ad16e:9754ab5e /dev/md/0_0

Notice that the UUIDs did NOT change, and are thus still out of sync.

BUT, the names did ...

Jeff

^ permalink raw reply

* mdadm and disk failures
From: Mike Power @ 2011-06-16  3:37 UTC (permalink / raw)
  To: linux-raid

I was looking around the wiki and I could not find the answer.

I wanted to know if the raid system will recover bad blocks lost on the 
one device when the copies of those blocks are available on another.  
 From what I understand when a hard drive loses a block it replaces it 
with another block but it can not recover the data.  At this point I do 
not know what the raid system does.  At one point in the past my 
understanding is, the raid system did nothing.  If it read from the disk 
that had the block it gets garbage for data.  Conceptually the raid 
system could read the data from one disk and write it to the copy on the 
other and restore the data.

Is the raid system capable of recovering and restoring bad blocks?

Mike Power

^ permalink raw reply

* Re: mdadm and disk failures
From: Brad Campbell @ 2011-06-16  4:39 UTC (permalink / raw)
  To: Mike Power; +Cc: linux-raid
In-Reply-To: <4DF97A7B.1070901@dodtsair.com>

On 16/06/11 11:37, Mike Power wrote:
> Is the raid system capable of recovering and restoring bad blocks?
>
Yes, and it does so admirably. You will see a message in your system log to the effect that it had 
recovered the (or those) bad sector(s).

Brad

^ permalink raw reply

* Re: mdadm and disk failures
From: Roman Mamedov @ 2011-06-16  4:59 UTC (permalink / raw)
  To: Mike Power; +Cc: linux-raid
In-Reply-To: <4DF97A7B.1070901@dodtsair.com>

[-- Attachment #1: Type: text/plain, Size: 727 bytes --]

On Wed, 15 Jun 2011 20:37:31 -0700
Mike Power <mpower@dodtsair.com> wrote:

>  From what I understand when a hard drive loses a block it replaces it 
> with another block but it can not recover the data.
> If it read from the disk that had the block it gets garbage for data.

This is not true, an HDD will never return garbage data just because it needed
to remap a bad sector. And that's the reason why bad sectors will return I/O
error (unreadable) instead, and are not remapped anywhere until you try to
WRITE into these sectors. At that moment, when the hard disk has new user data
for that sector, it can finally do the remap and put it somewhere else on a
remapped good sector.

-- 
With respect,
Roman

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply

* Re: mdadm and disk failures
From: Mikael Abrahamsson @ 2011-06-16  5:29 UTC (permalink / raw)
  To: Roman Mamedov; +Cc: Mike Power, linux-raid
In-Reply-To: <20110616105908.474d22bc@natsu>

On Thu, 16 Jun 2011, Roman Mamedov wrote:

> At that moment, when the hard disk has new user data for that sector, it 
> can finally do the remap and put it somewhere else on a remapped good 
> sector.

Just to be clear, it won't necessarily remap the sector. It might succeed 
in writing the new data to it and determine the sector isn't really bad, 
and not remap it.

-- 
Mikael Abrahamsson    email: swmike@swm.pp.se

^ permalink raw reply

* Re: md 3.2.1 and xfs kernel panic on Linux 2.6.38
From: Stan Hoeppner @ 2011-06-16  5:50 UTC (permalink / raw)
  To: NeilBrown; +Cc: fibreraid@gmail.com, linux-raid, xfs-oss
In-Reply-To: <20110616115531.298328f2@notabene.brown>

On 6/15/2011 8:55 PM, NeilBrown wrote:
> On Sun, 12 Jun 2011 11:50:01 -0700 "fibreraid@gmail.com"
> <fibreraid@gmail.com> wrote:
> 
>> Hi All,

Hi guys.

I was racking my brain trying to figure out why this thread wasn't
hitting the XFS list, and finally figured it out.  Palm to forehead.

It's 'xfs@oss.sgi.com' not 'linux-xfs@vger.kernel.org'

>> I am benchmarking md RAID with XFS on a server running Linux 2.6.38
>> kernel. The server has 24 x HDD's, dual 2.4GHz 6-core CPUs, and 24GB
>> RAM.

What HBA(s)/RAID card(s)?  BBWC enabled?

>> I created an md0 array using RAID 5, 64k chunk, 23 active drives, and
>> 1 hot-spare. I then created a LVM2 volume group from this md0, and
>> created an LV out of it. The volume was formatted XFS as follows:
>>
>> /sbin/mkfs.xfs –f –l lazy-count=1 -l size=128m -s size=4096
>> /dev/mapper/pool1-vol1

With 22 stripe spindles you should have at least specified '-d sw=22' in
mkfs.xfs.  This would give better performance, though it should have
nothing to do with the panic.

>> I then mounted it as follows:
>>
>> /dev/mapper/pool1-vol1 on /volumes/pool1/vol1 type xfs
>> (rw,_netdev,noatime,nodiratime,osyncisdsync,nobarrier,logbufs=8,delaylog)

I'm wondering if specifying nobarrier might have something to do with
the OP's issue.  Does the system panic when using only

defaults,delaylog

>> Once md synchronization was complete, I removed one of the active 23
>> drives. After attempting some IO, the md0 array began to rebuild to
>> the hot-spare. In a few hours, it was complete and the md0 array was
>> listed as active and healthy again (though now lacking a hot-spare
>> obviously).
>>
>> As a test, I removed one more drive to see what would happen. As
>> expected, mdadm reported the array as active but degraded, and since
>> there was no hot-spare available, there was no rebuilding happening.
>>
> ....
>>
>> What surprised me though is that I was no longer able to run IO on the
>> md0 device. As a test, I am using fio to generate IO to the XFS
>> mountpoint /volumes/pool1/vol1. However, IO failed. A few minutes
>> later, I received the following kernel dumps in /var/log/messages. Any
>> ideas?

What happens when you test with something other than FIO?  How about
simply touching a file or creating a directory?

>>
>>
>> Jun 12 11:33:54 TESTBA16 kernel: [59435.936575] fio             D
>> ffff88060c6e1a50     0 30463      1 0x00000000
>> Jun 12 11:33:54 TESTBA16 kernel: [59435.936578]  ffff880609887778
>> 0000000000000086 0000000000000001 0000000000000086
>> Jun 12 11:33:54 TESTBA16 kernel: [59435.936581]  0000000000011e40
>> ffff88060c6e16c0 ffff88060c6e1a50 ffff880609887fd8
>> Jun 12 11:33:54 TESTBA16 kernel: [59435.936583]  ffff88060c6e1a58
>> 0000000000011e40 ffff880609886010 0000000000011e40
>> Jun 12 11:33:54 TESTBA16 kernel: [59435.936586] Call Trace:
>> Jun 12 11:33:54 TESTBA16 kernel: [59435.936594]  [<ffffffffa025e698>]
>> make_request+0x138/0x3d0 [raid456]
> 
>>
>> The errors seem to be a combination of XFS and md related messages.
>> Any insight into this issue would be greatly appreciated. Thanks!
>>
> 
> Very peculiar!
> 
> It appears that make_request in raid5.c is entering schedule() in an
> uninterruptible wait.
> There are 4 places where make_request calls schedule.
> 2 can only happen if the  array is being reshaped (e.2. 5 drives to 6 drives)
> but that does not appear to be happening.
> 1 causes and interruptible wait, so it cannot be that one.
> 
> That just leaves the one on line 4105.
> This requires either than the stripe is being reshaped (which we already
> decided isn't happening) or that md/raid5 has received overlapping requests.
> 
> i.e. while one request (either read or write) was pending, another request
> (either read or write, not necessarily the same) arrives for a range of
> sectors which over-laps the previous request.
> 
> When this happens (which it shouldn't because it would be dumb for a
> filesystem to do that, but you never know) md/raid5 will wait for the first
> request to be completely handled before letting the second proceed.
> So we should be waiting here for at most a small fraction of a second.
> Clearly we are waiting longer than that...

With nobarrier set, I'm wondering if XFS is issuing overlapping writes
to the same sector on the log device.  Maybe the drives aren't
responding quickly enough, causing the excess wait.

> So this cannot possibly happen (as is so often the case when debugging :-)
> 
> Hmmm... maybe we are missing the wakeup call.  I can find where we wake-up
> anyone waiting for an overlapping read request to complete, but I cannot find
> where we wake-up someone waiting for when an overlapping write request
> completes.  That should probably go in handle_stripe_clean_event.

I'm beginning to think this is a case of non enterprise drives (no TLER,
etc) being used with a cache less HBA and without write barriers.  This
would definitely be a recipe for disaster from a data loss standpoint,
though I'm not sure it should cause a kernel panic.

> Do you have the system still hanging in this state?  If not, can you get it
> back into this state easily?
> If so, you can force a wakeup with the magic incantation:
> 
>  cat /sys/block/mdXX/md/suspend_lo > /sys/block/mdXX/md/suspend_lo
> 
> (with 'XX' suitably substituted).
> 
> If that makes a difference, then I know I am on the right track

Is there any downside to introducing such a wake-up for writers?

-- 
Stan
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: Why do I get different results for 'mdadm --detail' & 'mdadm --examine' for the same array?
From: jeffs_linux @ 2011-06-16  6:06 UTC (permalink / raw)
  To: NeilBrown; +Cc: linux-raid
In-Reply-To: <1308191963.29354.1463681709@webmail.messagingengine.com>

repeating the process on another box, i just noticed that these two
results' UUIDs match exactly,

 /dev/.mdadm/map       /dev/md/0_0  52f5b43c:e83f7e2a:be6ad32e:0536ab0e 
 mdadm --detail --scan /dev/md/0_0  52f5b43c:e83f7e2a:be6ad32e:0536ab0e

but the OTHER array's two UUIDs

 /dev/.mdadm/map       /dev/md127   79fb7ad4:289bfae5:86c535ff:202960f2 
 mdadm --detail --scan /dev/md127   d47afb79:e5fa9b28:ff35c586:f2602920

are 'transforms' of one another; e.g.,

 mdadm --detail --scan /dev/md127   d47afb79:e5fa9b28:ff35c586:f2602920
 
                                    d4       e5  
                                      7a       fa 
                                        fb       9b
                                          79:      28:...

                                        |
                                        |  couplet order transform
                                        |

                                          d4       e5
                                        7a       fa
                                      fb       9b
                                    79:      28:      ...

 /dev/.mdadm/map       /dev/md127   79fb7ad4:289bfae5:86c535ff:202960f2 

what's happening here?  that can't be a coincidence.

is this a bug?  or maybe an artifact of the type of array?

jeff

^ permalink raw reply

* [PATCH] gcc warnings again
From: Luca Berra @ 2011-06-16  7:05 UTC (permalink / raw)
  To: linux-raid

[-- Attachment #1: Type: text/plain, Size: 1481 bytes --]

hello.
yesterday i tried rebuilding both mdadm 3.1.5 and 3.2.1 with gcc 4.6,
with the following CXFLAGS

x86:    -O2 -g -frecord-gcc-switches -Wstrict-aliasing=2 -pipe -Wformat
-Werror=format-security -Wp,-D_FORTIFY_SOURCE=2 -fstack-protector
--param=ssp-buffer-size=4 -fomit-frame-pointer -mtune=generic
-march=i586 -fasynchronous-unwind-tables

x86_64: -O2 -g -frecord-gcc-switches -Wstrict-aliasing=2 -pipe -Wformat
-Werror=format-security -Wp,-D_FORTIFY_SOURCE=2 -fstack-protector
--param=ssp-buffer-size=4 -fPIC

i found a good number of warnings
unused but set variable
strict aliasing
comparison between signed and unsigned values *on 32bit*

for the unused variables i found fedora already had a patch which is
sensible enough, i did not see it reported here, so i will attach it.

I know -Wstrict-aliasing=2 can give false positive but those looked real
to me, so i fixed those.

looking at the gpt code in util.c i found i did not like it at all, a
gpt partition entry is currently 128 bytes, but the spec does not say it
is a fixed value, so the code that reads into a buffer with 512bytes
chunk expecting this to be a multiplier of part_size is imho incorrect.
my fix was to read each partition entry directly into a struct
GPT_part_entry, the advantage is that the code is very simple to read,
the disadvantage it is 128 reads of 128 bytes each, which is
sub-optimal, but i believe readahead will mitigate this a lot.

regards,
L.


-- 
Luca Berra -- bluca@comedia.it

[-- Attachment #2: mdadm-3.1.5-unused-param.patch --]
[-- Type: text/plain, Size: 4951 bytes --]

--- mdadm-3.2.1/sysfs.c.param	2011-03-28 11:28:13.599402233 -0400
+++ mdadm-3.2.1/sysfs.c	2011-03-28 11:48:02.593714836 -0400
@@ -418,7 +418,7 @@ int sysfs_set_num(struct mdinfo *sra, st
 int sysfs_uevent(struct mdinfo *sra, char *event)
 {
 	char fname[50];
-	int n;
+	unsigned int n;
 	int fd;
 
 	sprintf(fname, "/sys/block/%s/uevent",
@@ -428,6 +428,11 @@ int sysfs_uevent(struct mdinfo *sra, cha
 		return -1;
 	n = write(fd, event, strlen(event));
 	close(fd);
+	if (n != strlen(event)) {
+		dprintf(Name ": failed to write '%s' to '%s' (%s)\n",
+			event, fname, strerror(errno));
+		return -1;
+	}
 	return 0;
 }	
 
--- mdadm-3.2.1/mdadm.c.param	2011-03-28 10:38:12.035258787 -0400
+++ mdadm-3.2.1/mdadm.c	2011-03-28 10:39:33.346082070 -0400
@@ -103,7 +103,9 @@ int main(int argc, char *argv[])
 	char *shortopt = short_options;
 	int dosyslog = 0;
 	int rebuild_map = 0;
+#if 0
 	int auto_update_home = 0;
+#endif
 	char *subarray = NULL;
 	char *remove_path = NULL;
 	char *udev_filename = NULL;
@@ -1325,11 +1327,13 @@ int main(int argc, char *argv[])
 							cnt++;
 							acnt++;
 						}
+#if 0
 						if (rv2 == 1)
 							/* found something so even though assembly failed  we
 							 * want to avoid auto-updates
 							 */
 							auto_update_home = 0;
+#endif
 					} while (rv2!=2);
 					/* Incase there are stacked devices, we need to go around again */
 				} while (acnt);
--- mdadm-3.2.1/mdmon.c.param	2011-03-28 11:29:41.128681560 -0400
+++ mdadm-3.2.1/mdmon.c	2011-03-28 11:30:54.514946394 -0400
@@ -513,6 +513,9 @@ static int mdmon(char *devname, int devn
 	ignore = dup(0);
 #endif
 
+	if (ignore)
+		ignore++;
+
 	do_manager(container);
 
 	exit(0);
--- mdadm-3.2.1/Grow.c.param	2011-03-28 10:38:12.038259001 -0400
+++ mdadm-3.2.1/Grow.c	2011-03-28 10:45:28.174500010 -0400
@@ -1312,7 +1312,6 @@ int Grow_reshape(char *devname, int fd, 
 	char *subarray = NULL;
 
 	int frozen;
-	int changed = 0;
 	char *container = NULL;
 	char container_buf[20];
 	int cfd = -1;
@@ -1479,7 +1478,6 @@ int Grow_reshape(char *devname, int fd, 
 		if (!quiet)
 			fprintf(stderr, Name ": component size of %s has been set to %lluK\n",
 				devname, size);
-		changed = 1;
 	} else if (array.level != LEVEL_CONTAINER) {
 		size = get_component_size(fd)/2;
 		if (size == 0)
--- mdadm-3.2.1/Query.c.param	2011-03-28 10:38:12.040259145 -0400
+++ mdadm-3.2.1/Query.c	2011-03-28 10:41:19.272668999 -0400
@@ -35,7 +35,7 @@ int Query(char *dev)
 	int fd = open(dev, O_RDONLY);
 	int vers;
 	int ioctlerr;
-	int superror, superrno;
+	int superror;
 	struct mdinfo info;
 	mdu_array_info_t array;
 	struct supertype *st = NULL;
@@ -84,7 +84,6 @@ int Query(char *dev)
 	st = guess_super(fd);
 	if (st) {
 		superror = st->ss->load_super(st, fd, dev);
-		superrno = errno;
 	} else
 		superror = -1;
 	close(fd);
--- mdadm-3.2.1/super1.c.param	2011-03-28 10:38:12.043259360 -0400
+++ mdadm-3.2.1/super1.c	2011-03-28 10:53:14.423905054 -0400
@@ -111,7 +111,6 @@ static unsigned int calc_sb_1_csum(struc
 	unsigned long long newcsum;
 	int size = sizeof(*sb) + __le32_to_cpu(sb->max_dev)*2;
 	unsigned int *isuper = (unsigned int*)sb;
-	int i;
 
 /* make sure I can count... */
 	if (offsetof(struct mdp_superblock_1,data_offset) != 128 ||
@@ -123,7 +122,7 @@ static unsigned int calc_sb_1_csum(struc
 	disk_csum = sb->sb_csum;
 	sb->sb_csum = 0;
 	newcsum = 0;
-	for (i=0; size>=4; size -= 4 ) {
+	for (; size>=4; size -= 4 ) {
 		newcsum += __le32_to_cpu(*isuper);
 		isuper++;
 	}
@@ -387,15 +386,11 @@ static void examine_super1(struct supert
 	printf("   Array State : ");
 	for (d=0; d<__le32_to_cpu(sb->raid_disks) + delta_extra; d++) {
 		int cnt = 0;
-		int me = 0;
 		unsigned int i;
 		for (i=0; i< __le32_to_cpu(sb->max_dev); i++) {
 			unsigned int role = __le16_to_cpu(sb->dev_roles[i]);
-			if (role == d) {
-				if (i == __le32_to_cpu(sb->dev_number))
-					me = 1;
+			if (role == d)
 				cnt++;
-			}
 		}
 		if (cnt > 1) printf("?");
 		else if (cnt == 1) printf("A");
--- mdadm-3.2.1/Incremental.c.param	2011-03-28 10:38:12.045259502 -0400
+++ mdadm-3.2.1/Incremental.c	2011-03-28 11:31:41.924347665 -0400
@@ -707,7 +707,7 @@ static int count_active(struct supertype
 	int cnt = 0;
 	__u64 max_events = 0;
 	char *avail = NULL;
-	int *best;
+	int *best = NULL;
 	char *devmap = NULL;
 	int numdevs = 0;
 	int devnum;
--- mdadm-3.2.1/super-intel.c.param	2011-03-28 10:38:12.048259718 -0400
+++ mdadm-3.2.1/super-intel.c	2011-03-28 11:33:53.898816208 -0400
@@ -6164,7 +6164,7 @@ static int apply_takeover_update(struct 
 {
 	struct imsm_dev *dev = NULL;
 	struct intel_dev *dv;
-	struct imsm_dev *dev_new;
+	struct imsm_dev *dev_new = NULL;
 	struct imsm_map *map;
 	struct dl *dm, *du;
 	int i;
@@ -7008,7 +7008,7 @@ static int imsm_create_metadata_update_f
 	int update_memory_size = 0;
 	struct imsm_update_reshape *u = NULL;
 	struct mdinfo *spares = NULL;
-	int i;
+	int i = -1;
 	int delta_disks = 0;
 	struct mdinfo *dev;
 

[-- Attachment #3: mdadm-3.2.1-strictalias.patch --]
[-- Type: text/plain, Size: 2975 bytes --]

Workaround for strict-aliasing warning

Signed-off-by: Luca Berra <bluca@vodka.it>
---

--- mdadm-3.2.1/Grow.c.strictalias	2011-06-15 14:46:48.281409916 +0000
+++ mdadm-3.2.1/Grow.c	2011-06-15 14:46:48.321410099 +0000
@@ -2914,6 +2914,7 @@ int child_monitor(int afd, struct mdinfo
 	int chunk = sra->array.chunk_size;
 	struct mdinfo *sd;
 	unsigned long stripes;
+	int uuid[4];
 
 	/* set up the backup-super-block.  This requires the
 	 * uuid from the array.
@@ -2941,7 +2942,8 @@ int child_monitor(int afd, struct mdinfo
 
 	memset(&bsb, 0, 512);
 	memcpy(bsb.magic, "md_backup_data-1", 16);
-	st->ss->uuid_from_super(st, (int*)&bsb.set_uuid);
+	st->ss->uuid_from_super(st, uuid);
+	memcpy(bsb.set_uuid, uuid, 16);
 	bsb.mtime = __cpu_to_le64(time(0));
 	bsb.devstart2 = blocks;
 
--- mdadm-3.2.1/super0.c.strictalias	2011-03-28 02:31:20.000000000 +0000
+++ mdadm-3.2.1/super0.c	2011-06-15 14:46:48.321410099 +0000
@@ -423,6 +423,7 @@ static int update_super0(struct supertyp
 	 * ignored.
 	 */
 	int rv = 0;
+	int uuid[4];
 	mdp_super_t *sb = st->sb;
 	if (strcmp(update, "sparc2.2")==0 ) {
 		/* 2.2 sparc put the events in the wrong place
@@ -561,7 +562,8 @@ static int update_super0(struct supertyp
 		if (sb->state & (1<<MD_SB_BITMAP_PRESENT)) {
 			struct bitmap_super_s *bm;
 			bm = (struct bitmap_super_s*)(sb+1);
-			uuid_from_super0(st, (int*)bm->uuid);
+			uuid_from_super0(st, uuid);
+			memcpy(bm->uuid, uuid, 16);
 		}
 	} else if (strcmp(update, "no-bitmap") == 0) {
 		sb->state &= ~(1<<MD_SB_BITMAP_PRESENT);
@@ -987,6 +989,7 @@ static int add_internal_bitmap0(struct s
 	int chunk = *chunkp;
 	mdp_super_t *sb = st->sb;
 	bitmap_super_t *bms = (bitmap_super_t*)(((char*)sb) + MD_SB_BYTES);
+	int uuid[4];
 
 
 	min_chunk = 4096; /* sub-page chunks don't work yet.. */
@@ -1010,7 +1013,8 @@ static int add_internal_bitmap0(struct s
 	memset(bms, 0, sizeof(*bms));
 	bms->magic = __cpu_to_le32(BITMAP_MAGIC);
 	bms->version = __cpu_to_le32(major);
-	uuid_from_super0(st, (int*)bms->uuid);
+	uuid_from_super0(st, uuid);
+	memcpy(bms->uuid, uuid, 16);
 	bms->chunksize = __cpu_to_le32(chunk);
 	bms->daemon_sleep = __cpu_to_le32(delay);
 	bms->sync_size = __cpu_to_le64(size);
--- mdadm-3.2.1/super1.c.strictalias	2011-06-15 14:46:48.281409916 +0000
+++ mdadm-3.2.1/super1.c	2011-06-15 14:46:48.321410099 +0000
@@ -1492,6 +1492,7 @@ add_internal_bitmap1(struct supertype *s
 	int room = 0;
 	struct mdp_superblock_1 *sb = st->sb;
 	bitmap_super_t *bms = (bitmap_super_t*)(((char*)sb) + 1024);
+	int uuid[4];
 
 	switch(st->minor_version) {
 	case 0:
@@ -1579,7 +1580,8 @@ add_internal_bitmap1(struct supertype *s
 	memset(bms, 0, sizeof(*bms));
 	bms->magic = __cpu_to_le32(BITMAP_MAGIC);
 	bms->version = __cpu_to_le32(major);
-	uuid_from_super1(st, (int*)bms->uuid);
+	uuid_from_super1(st, uuid);
+	memcpy(bms->uuid, uuid, 16);
 	bms->chunksize = __cpu_to_le32(chunk);
 	bms->daemon_sleep = __cpu_to_le32(delay);
 	bms->sync_size = __cpu_to_le64(size);

[-- Attachment #4: mdadm-3.2.1-gpt.patch --]
[-- Type: text/plain, Size: 1975 bytes --]

Workaround for strict-aliasing warning
read() returns a ssize_t, not an unsigned
Rework code to not depend on assumptions about part_entry size

Signed-off-by: Luca Berra <bluca@vodka.it>

--- mdadm-3.2.1/util.c.gpt	2011-03-28 02:31:20.000000000 +0000
+++ mdadm-3.2.1/util.c	2011-06-15 21:14:07.039082716 +0000
@@ -1280,9 +1280,8 @@ int must_be_container(int fd)
 static int get_gpt_last_partition_end(int fd, unsigned long long *endofpart)
 {
 	struct GPT gpt;
-	unsigned char buf[512];
 	unsigned char empty_gpt_entry[16]= {0};
-	struct GPT_part_entry *part;
+	struct GPT_part_entry part;
 	unsigned long long curr_part_end;
 	unsigned all_partitions, entry_size;
 	unsigned part_nr;
@@ -1290,8 +1289,9 @@ static int get_gpt_last_partition_end(in
 	*endofpart = 0;
 
 	BUILD_BUG_ON(sizeof(gpt) != 512);
-	/* read GPT header */
+	/* skip protective MBR */
 	lseek(fd, 512, SEEK_SET);
+	/* read GPT header */
 	if (read(fd, &gpt, 512) != 512)
 		return 0;
 
@@ -1308,28 +1308,19 @@ static int get_gpt_last_partition_end(in
 	    entry_size > 512)
 		return -1;
 
-	/* read first GPT partition entries */
-	if (read(fd, buf, 512) != 512)
-		return 0;
-
-	part = (struct GPT_part_entry*)buf;
-
 	for (part_nr=0; part_nr < all_partitions; part_nr++) {
+		/* read partition entry */
+		if (read(fd, &part, entry_size) != (ssize_t)entry_size)
+			return 0;
+
 		/* is this valid partition? */
-		if (memcmp(part->type_guid, empty_gpt_entry, 16) != 0) {
+		if (memcmp(part.type_guid, empty_gpt_entry, 16) != 0) {
 			/* check the last lba for the current partition */
-			curr_part_end = __le64_to_cpu(part->ending_lba);
+			curr_part_end = __le64_to_cpu(part.ending_lba);
 			if (curr_part_end > *endofpart)
 				*endofpart = curr_part_end;
 		}
 
-		part = (struct GPT_part_entry*)((unsigned char*)part + entry_size);
-
-		if ((unsigned char *)part >= buf + 512) {
-			if (read(fd, buf, 512) != 512)
-				return 0;
-			part = (struct GPT_part_entry*)buf;
-		}
 	}
 	return 1;
 }

^ permalink raw reply

* Re: mark spare as active sync device?
From: Lars Täuber @ 2011-06-16  9:29 UTC (permalink / raw)
  To: linux-raid
In-Reply-To: <20110616105423.6aefa0f1@notabene.brown>

Hello NeilBrown,

thank you really very much. You saved my day.

Am Thu, 16 Jun 2011 10:54:23 +1000
NeilBrown <neilb@suse.de> schrieb:

> On Wed, 15 Jun 2011 12:10:26 +0200 Lars Täuber <taeuber@bbaw.de> wrote:
> 
> > Hi there,
> > 
> > is it possible to mark a nearly completly synced device as active in sync?
> > 
> > There was failing a device shortly before a spare drive was completely
> > synced. I want to get the directory structure from the device.
> > 
> > md3 : active raid6 sdz[16] sdaf[13] sds[0] sdac[10] sdah[15] sdag[14] sdae
> > [12] sdad [11](F) sdab[9] sdaa[17](F) sdy[6] sdx[5] sdw[4] sdv[3] sdu[2]
> > sdt[1] 27349202944 blocks level 6, 64k chunk, algorithm 2 [16/13]
> > [UUUUUUU__UU_UUUU] [===================>.]  recovery = 98.3%
> > (1921986020/1953514496) finish=25.9min speed=20244K/sec
> > 
> > /dev/sdz was the device that was inserted and is now marked as spare. Is
> > it possible to mark sdz as active and in sync?
> 
> Not really.... and should you mark it as the in-sync member number 7, 8, or
> 11 ??

I can reconstruct the member number of sdz from the previous failure mails
from mdadm. What a nice feature!

> > 
> > Is there any chance to get data from this array?
> 
> Your best bet is to try to 'create' the array again using '--assume-clean'
> and putting the devices that you thing are working in the correct place.
> e.g. something like:
> 
>  mdadm -S /dev/md3
>  mdadm -C /dev/md3 --metadata=0.90 --level=6 --algorithm=2 --chunk=64 \
>    --raid-devices=16  --assume-clean \
>    /dev/sds /dev/sdt /dev/sdu /dev/sdv   \
>    /dev/sdw /dev/sdx /dev/sdy /dev/sdz   \
>    missing /dev/sdab /dev/sdac missing   \
>    /dev/sdae /dev/sdaf /dev/sdag /dev/sdah

This was the very right command line except for the algorithm option. This is
not available in my version.

> Then "fsck -n /dev/md3"  and see if it looks reasonably OK.
> If it doesn't, try placing /dev/sdz in place of a different 'missing'.
> 
> Of course you should double check the order of devices that I have given
> here, and all the other details.

Of course!

> NeilBrown

Thanks again for your help and your great tool!
Lars
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* mdadm 3.2.1 segfaulting on boot when missing RaidDevice 0
From: Lawrence, Joe @ 2011-06-16 13:51 UTC (permalink / raw)
  To: linux-raid

Hi,

Using mdadm 3.2.1 from RedHat, I'm seeing occasional segmentation faults
on boot when only the first member (RaidDisk 0) of a RAID 1 pair is
present.

The crash occurs in Assemble.c and seems to be only invoked when the
initramfs is created with an /etc/mdadm.conf containing a line that
specifies an md device that is missing RaidDisk 1.  In RedHat terms,
this is during the Dracut environment, so I'm not sure how easy it would
be to save a core from this.

Analyzing the code backwards from the segfault address (it occurs on
lines 1386 or 1387 depending upon whatever random bits are living on the
heap), I added debugging prints and realized that the loop immediately
after the "If any devices did not get added because the kernel rejected
them ..." comment is stepping past the end of the best[] array.  This
value then indexes devices[].  Notice that the loop on line 1385 starts
at 0 and ends at bestcnt (inclusive).

I can see that on line 850, only 10 best[] entries are allocated, but
then on line 1386, a read of entry [10] is attempted.

dmesg output:

kernel: md: md5 stopped.
kernel: dracut: mdadm: Assemble.c(710) : allocated 16 entries (6400
bytes) @ 0x126bf00
kernel: dracut: mdadm: Assemble.c(837) : i = devices[0].i.disk.raid_disk
= 0
kernel: dracut: mdadm: Assemble.c(850) : allocated 10 entries (40 bytes)
@ 0x12740d0
kernel: md: bind<sdr3>
kernel: md/raid1:md5: active with 1 out of 2 mirrors
kernel: created bitmap (1 pages) for device md5
kernel: md5: bitmap initialized from disk: read 1/1 pages, set 2 bits
kernel: md5: detected capacity change from 0 to 16844251136
kernel: dracut: mdadm: /dev/md5 has been started with 1 drive (out of
2).
kernel: dracut: mdadm: Assemble.c(1386) : best @ 0x12740d0, bestcnt == i
== 10
kernel: md5: unknown partition table

I have tested breaking out of the loop starting on line 1385 when i ==
bestcnt (after my debugging output) and I do not see any further
segmentation faults.  I would think this loop should be rewritten as:

for (i = 0; i < bestcnt; i++) {
...
}

Additional debugging prints revealed that line 837, i =
devices[devcnt].i.disk.raid_disk;  is executed prior to the allocation
of newbest[].  When I had RaidDisk 0 inserted, i=0.  When the other disk
was present, i=1.  In the latter case, enough best[] was allocated and
boot succeeded, though I'm not sure if best[newbestcnt] was ever
properly initialized.  I think we just got lucky and booted ok.

The md device was created like so:

mdadm -C /dev/md5 -b internal --level=1 --raid-devices=2 /dev/sdc1
/dev/sdk1

mdadm --detail /dev/md5
/dev/md5:
        Version : 1.2
  Creation Time : Wed Jun 15 15:15:24 2011
     Raid Level : raid1
     Array Size : 8224212 (7.84 GiB 8.42 GB)
  Used Dev Size : 8224212 (7.84 GiB 8.42 GB)
   Raid Devices : 2
  Total Devices : 2
    Persistence : Superblock is persistent

  Intent Bitmap : Internal

    Update Time : Wed Jun 15 15:16:26 2011
          State : active
 Active Devices : 2
Working Devices : 2
 Failed Devices : 0
  Spare Devices : 0

           Name : yowler.mno.stratus.com:5  (local to host
yowler.mno.stratus.com)
           UUID : d74aa8e8:14f85390:0cc61025:a0046ec2
         Events : 23

    Number   Major   Minor   RaidDevice State
       0      65       17        0      active sync   /dev/sdc1
       1      65       33        1      active sync   /dev/sdk1

and its corresponding entry in /etc/mdadm.conf:

ARRAY /dev/md5 level=raid1 num-devices=2
UUID=471cf895:ba6ef375:a0bd54b3:1a6b3b08

If any other configuration, logs, or debugging information needs to be
provided, I'll be glad to provide it.

Thanks,

-- Joe Lawrence

^ permalink raw reply

* Re: mdadm and disk failures
From: Mike Power @ 2011-06-16 15:34 UTC (permalink / raw)
  To: Roman Mamedov; +Cc: linux-raid
In-Reply-To: <20110616105908.474d22bc@natsu>

Ahh I understand more clearly now.  Because the failing drive will 
report an IO error the raid system can respond reactively by getting the 
data else where and writing it to the failing disk.  Earlier I had 
thought that the raid system might periodically poll the disks to see 
what blocks were lost and correct them.

This makes me much more interested in setting up a raid system above and 
beyond the performance benefit.

^ permalink raw reply

* md0_raid5 process consuming 100% CPU on disk failure
From: Shivani Bhope @ 2011-06-16 19:01 UTC (permalink / raw)
  To: linux-raid
In-Reply-To: <542975669.724147.1308250559342.JavaMail.root@zmail02.collab.prod.int.phx2.redhat.com>

Hello all, 

I have been trying to test md raid 5 array faults. I set up a three-disk 
RAID 5, and started some I/O. Then I entered the command: 

mdadm /dev/md0 --fail /dev/sdc 

As soon as the disk is failed, the md0_raid5 process is seen to be taking 
100% CPU. This happens about 2 out of 3 times the test was run. 

The details: 

1. Output for mdadm -D /dev/md0 before the test: 
/dev/md0: 
Version : 1.2 
Creation Time : Tue Jun 14 20:05:07 2011 
Raid Level : raid5 
Array Size : 143129600 (136.50 GiB 146.56 GB) 
Used Dev Size : 71564800 (68.25 GiB 73.28 GB) 
Raid Devices : 3 
Total Devices : 3 
Persistence : Superblock is persistent 

Intent Bitmap : Internal 

Update Time : Wed Jun 15 13:42:59 2011 
State : active 
Active Devices : 3 
Working Devices : 3 
Failed Devices : 0 
Spare Devices : 0 

Layout : left-symmetric 
Chunk Size : 512K 

Name : 0 
UUID : 5a1f5b73:7ce46e00:2b16a389:cadd7ae8 
Events : 4716 

Number Major Minor RaidDevice State 
0 8 16 0 active sync /dev/sdb 
1 8 32 1 active sync /dev/sdc 
3 8 48 2 active sync /dev/sdd 


2. Output for mdadm -D after the test: 

/dev/md0: 
Version : 1.2 
Creation Time : Wed Jun 15 17:44:11 2011 
Raid Level : raid5 
Array Size : 143129600 (136.50 GiB 146.56 GB) 
Used Dev Size : 71564800 (68.25 GiB 73.28 GB) 
Raid Devices : 3 
Total Devices : 3 
Persistence : Superblock is persistent 

Intent Bitmap : Internal 

Update Time : Wed Jun 15 18:18:41 2011 
State : active, degraded 
Active Devices : 2 
Working Devices : 2 
Failed Devices : 1 
Spare Devices : 0 

Layout : left-symmetric 
Chunk Size : 512K 

Name : 0 
UUID : 8663119c:892e5b7d:a4f2be22:5bb2fdd4 
Events : 1049 

Number Major Minor RaidDevice State 
0 8 16 0 active sync /dev/sdb 
1 8 32 1 faulty spare 
rebuilding /dev/sdc 
3 8 48 2 active sync /dev/sdd 


Note the device is not marked as removed and spare but as
spare rebuilding. 

The system has to be manually power-cycled to recover. 

This system is running RHEL 6.1, kernel version 
2.6.32-131.0.15.el6.x86_64 & mdadm-3.2.1-1.el6.x86_64
The same test was also run on Fedora-15, kernel version 
2.6.38.8-32.fc15.x86_64 & mdadm-3.1.5-2.fc15.x86_64

The output for sysrq-t showing md0_raid5 after the hang is: 

Jun 15 18:20:39 kernel: md0_raid5 R running task 0 1896 2 0x00000080 
Jun 15 18:20:39 kernel: ffff880175e3fbd0 ffffffff814db337 ffff880175e3fb70 ffffffff8104af29 
Jun 15 18:20:39 kernel: 000000000000100c 0000000300000001 ffffe8ffffc00270 ffff88002801e988 
Jun 15 18:20:39 kernel: ffff8801747bdab8 ffff880175e3ffd8 000000000000f598 ffff8801747bdac0 
Jun 15 18:20:39 kernel: Call Trace: 
Jun 15 18:20:39 kernel: [<ffffffff814db337>] ? thread_return+0x4e/0x777 
Jun 15 18:20:39 kernel: [<ffffffff8104af29>] ? __wake_up_common+0x59/0x90 
Jun 15 18:20:39 kernel: [<ffffffff81103c56>] ? __perf_event_task_sched_out+0x36/0x50 
Jun 15 18:20:39 kernel: [<ffffffff8105faba>] __cond_resched+0x2a/0x40 
Jun 15 18:20:39 kernel: [<ffffffff814dbbb0>] _cond_resched+0x30/0x40 
Jun 15 18:20:39 kernel: [<ffffffffa0362cee>] ops_run_io+0x2e/0x350 [raid456] 
Jun 15 18:20:39 kernel: [<ffffffffa03659d1>] handle_stripe+0x501/0x2310 [raid456] 
Jun 15 18:20:39 kernel: [<ffffffff8104f843>] ? __wake_up+0x53/0x70 
Jun 15 18:20:39 kernel: [<ffffffffa0367c7f>] raid5d+0x49f/0x690 [raid456] 
Jun 15 18:20:39 kernel: [<ffffffff813de266>] md_thread+0x116/0x150 
Jun 15 18:20:39 kernel: [<ffffffff8108e160>] ? autoremove_wake_function+0x0/0x40 
Jun 15 18:20:39 kernel: [<ffffffff813de150>] ? md_thread+0x0/0x150 
Jun 15 18:20:39 kernel: [<ffffffff8108ddf6>] kthread+0x96/0xa0 
Jun 15 18:20:39 kernel: [<ffffffff8100c1ca>] child_rip+0xa/0x20 
Jun 15 18:20:39 kernel: [<ffffffff8108dd60>] ? kthread+0x0/0xa0 
Jun 15 18:20:39 kernel: [<ffffffff8100c1c0>] ? child_rip+0x0/0x20 

I tried to increase the stripe_cache_size to 512 and then to 16384.
The problem is still seen. 

What could be the reason for the problem? 

Any pointers will be greatly appreciated. 

Thanks, 
Shivani 


^ permalink raw reply

* raid6 issues
From: Chad Walker @ 2011-06-16 20:28 UTC (permalink / raw)
  To: linux-raid

I have 15 drives in a raid6 plus a spare. I returned home after being
gone for 12 days and one of the drives was marked as faulty. The load
on the machine was crazy, and mdadm stop responding. I should've done
an strace, sorry. Likewise cat'ing /proc/mdstat was blocking. I
rebooted and mdadm started recovering, but to the faulty drive. I
checked in on /proc/mdstat periodically over the 35-hour recovery.
When it was down to the last bit, /proc/mdstat and mdadm stopped
responding again. I gave it 28 hours, and then when I still couldn't
get any insight into it I rebooted again. Now /proc/mdstat says it's
inactive. And I don't appear to be able to assemble it. I issued
--examine on each of the 16 drives and they all agreed with each other
except for the faulty drive. I popped the faulty drive out and
rebooted again, still no luck assembling.

This is what my /proc/mdstat looks like:
Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5]
[raid4] [raid10]
md1 : inactive sdd1[12](S) sdm1[6](S) sdf1[0](S) sdh1[2](S) sdi1[7](S)
sdb1[14](S) sdo1[4](S) sdg1[1](S) sdl1[8](S) sdk1[9](S) sdc1[13](S)
sdn1[3](S) sdj1[10](S) sdp1[15](S) sde1[11](S)
      29302715520 blocks

unused devices: <none>

This is what the --examine for /dev/sd[b-o]1 and /dev/sdq1 look like:
/dev/sdb1:
          Magic : a92b4efc
        Version : 0.90.00
           UUID : 78e3f473:48bbfc34:0e051622:5c30970b
  Creation Time : Wed Mar 30 14:48:46 2011
     Raid Level : raid6
  Used Dev Size : 1953514368 (1863.02 GiB 2000.40 GB)
     Array Size : 25395686784 (24219.21 GiB 26005.18 GB)
   Raid Devices : 15
  Total Devices : 16
Preferred Minor : 1

    Update Time : Wed Jun 15 07:45:12 2011
          State : active
 Active Devices : 14
Working Devices : 15
 Failed Devices : 1
  Spare Devices : 1
       Checksum : e4ff038f - correct
         Events : 38452

         Layout : left-symmetric
     Chunk Size : 64K

      Number   Major   Minor   RaidDevice State
this    14       8       17       14      active sync   /dev/sdb1

   0     0       8       81        0      active sync   /dev/sdf1
   1     1       8       97        1      active sync   /dev/sdg1
   2     2       8      113        2      active sync   /dev/sdh1
   3     3       8      209        3      active sync   /dev/sdn1
   4     4       8      225        4      active sync   /dev/sdo1
   5     5       0        0        5      faulty removed
   6     6       8      193        6      active sync   /dev/sdm1
   7     7       8      129        7      active sync   /dev/sdi1
   8     8       8      177        8      active sync   /dev/sdl1
   9     9       8      161        9      active sync   /dev/sdk1
  10    10       8      145       10      active sync   /dev/sdj1
  11    11       8       65       11      active sync   /dev/sde1
  12    12       8       49       12      active sync   /dev/sdd1
  13    13       8       33       13      active sync   /dev/sdc1
  14    14       8       17       14      active sync   /dev/sdb1
  15    15      65        1       15      spare   /dev/sdq1

And this is what --examine for /dev/sdp1 looked like:
/dev/sdp1:
          Magic : a92b4efc
        Version : 0.90.00
           UUID : 78e3f473:48bbfc34:0e051622:5c30970b
  Creation Time : Wed Mar 30 14:48:46 2011
     Raid Level : raid6
  Used Dev Size : 1953514368 (1863.02 GiB 2000.40 GB)
     Array Size : 25395686784 (24219.21 GiB 26005.18 GB)
   Raid Devices : 15
  Total Devices : 16
Preferred Minor : 1

    Update Time : Tue Jun 14 07:35:56 2011
          State : active
 Active Devices : 15
Working Devices : 16
 Failed Devices : 0
  Spare Devices : 1
       Checksum : e4fdb07b - correct
         Events : 38433

         Layout : left-symmetric
     Chunk Size : 64K

      Number   Major   Minor   RaidDevice State
this     5       8      241        5      active sync   /dev/sdp1

   0     0       8       81        0      active sync   /dev/sdf1
   1     1       8       97        1      active sync   /dev/sdg1
   2     2       8      113        2      active sync   /dev/sdh1
   3     3       8      209        3      active sync   /dev/sdn1
   4     4       8      225        4      active sync   /dev/sdo1
   5     5       8      241        5      active sync   /dev/sdp1
   6     6       8      193        6      active sync   /dev/sdm1
   7     7       8      129        7      active sync   /dev/sdi1
   8     8       8      177        8      active sync   /dev/sdl1
   9     9       8      161        9      active sync   /dev/sdk1
  10    10       8      145       10      active sync   /dev/sdj1
  11    11       8       65       11      active sync   /dev/sde1
  12    12       8       49       12      active sync   /dev/sdd1
  13    13       8       33       13      active sync   /dev/sdc1
  14    14       8       17       14      active sync   /dev/sdb1
  15    15      65        1       15      spare   /dev/sdq1

I was scared to run mdadm --build --level=6 --raid-devices=15 /dev/md1
/dev/sdf1 /dev/sdg1....

system information:
Ubuntu 11.04, kernel 2.6.38, x86_64, mdadm version 3.1.4, 3ware 9650SE

Any advice? There's about 1TB of data on these drives that would cause
my wife to kill me (and about 9TB of data would just irritate her to
loose).

-chad

^ permalink raw reply

* Transfer Alert
From: UN  COMPENSATION @ 2011-06-16 23:05 UTC (permalink / raw)


How are you today?

This is to inform you that we have already sent USD5000.00 dollars through
Western Union as we have been mandated to transfer the payment of
the total sum of USD150, 000.00 through Western Union by the
UN government.

In this connection, I tried to call yesterday to provide this information
via telephone in order to avoid the potential for cyber criminals to
laying hold of this vital information, but could not reach you. So I
decided to send MTCN, sender's name and relevant information, so you can
pick up this USD5000.00 so we can send another USD5000.00 tomorrow as you
know, we will be sending only USD5000.00 per day. Please collect this
information and go to any Western Union office near you in your country to
collect USD5000.00 and email us so we can continue with the amount of rest
tomorrow.

Mr. Roy Charlton Manager
Tel:+2348073280722
Email: wumto33@yahoo.com.hk

call or email once I picked up this USD5000.00 today.
Here is the information from Western Union to collect compensation

BELOW IS YOUR INFORMATION FOR THE YOUR FIRST PAYMENT
SENDER'S FIRST NAME:______PEACE
SENDER'S SECOND NAME:___________MIMI
MTCN:________________748-676-8511
AMOUNT:___________$5,000.00
SECRET QUESTION:____PICK
SECRET ANSWER:_______NOW

Track your payment on the website: www.westernunion.com


I am waiting for your call once you pick this transfer

Regards
Director of Operations
Western Union ® Moneytransfer





--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Transfer Alert
From: UN  COMPENSATION @ 2011-06-16 23:06 UTC (permalink / raw)


How are you today?

This is to inform you that we have already sent USD5000.00 dollars through
Western Union as we have been mandated to transfer the payment of
the total sum of USD150, 000.00 through Western Union by the
UN government.

In this connection, I tried to call yesterday to provide this information
via telephone in order to avoid the potential for cyber criminals to
laying hold of this vital information, but could not reach you. So I
decided to send MTCN, sender's name and relevant information, so you can
pick up this USD5000.00 so we can send another USD5000.00 tomorrow as you
know, we will be sending only USD5000.00 per day. Please collect this
information and go to any Western Union office near you in your country to
collect USD5000.00 and email us so we can continue with the amount of rest
tomorrow.

Mr. Roy Charlton Manager
Tel:+2348073280722
Email: wumto33@yahoo.com.hk

call or email once I picked up this USD5000.00 today.
Here is the information from Western Union to collect compensation

BELOW IS YOUR INFORMATION FOR THE YOUR FIRST PAYMENT
SENDER'S FIRST NAME:______PEACE
SENDER'S SECOND NAME:___________MIMI
MTCN:________________748-676-8511
AMOUNT:___________$5,000.00
SECRET QUESTION:____PICK
SECRET ANSWER:_______NOW

Track your payment on the website: www.westernunion.com


I am waiting for your call once you pick this transfer

Regards
Director of Operations
Western Union ® Moneytransfer





--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* [PATCH] md/raid1: factor out common bio handling code
From: Namhyung Kim @ 2011-06-17  2:10 UTC (permalink / raw)
  To: Neil Brown; +Cc: linux-raid

When normal-write and sync-read/write bio completes, we should
find out the disk number the bio belongs to. Factor those common
code out to a separate function.

Signed-off-by: Namhyung Kim <namhyung@gmail.com>
---
 drivers/md/raid1.c |   45 ++++++++++++++++++++++++---------------------
 1 files changed, 24 insertions(+), 21 deletions(-)

diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
index 39e9f54038a8..7a68516e55f2 100644
--- a/drivers/md/raid1.c
+++ b/drivers/md/raid1.c
@@ -251,6 +251,23 @@ static inline void update_head_pos(int disk, r1bio_t *r1_bio)
 		r1_bio->sector + (r1_bio->sectors);
 }
 
+/*
+ * Find the disk number which triggered given bio
+ */
+static int find_bio_disk(r1bio_t *r1_bio, struct bio *bio)
+{
+	int i;
+
+	for (i = 0; i < r1_bio->mddev->raid_disks; i++)
+		if (r1_bio->bios[i] == bio)
+			break;
+
+	BUG_ON(i == r1_bio->mddev->raid_disks);
+	update_head_pos(i, r1_bio);
+
+	return i;
+}
+
 static void raid1_end_read_request(struct bio *bio, int error)
 {
 	int uptodate = test_bit(BIO_UPTODATE, &bio->bi_flags);
@@ -328,10 +345,7 @@ static void raid1_end_write_request(struct bio *bio, int error)
 	conf_t *conf = r1_bio->mddev->private;
 	struct bio *to_put = NULL;
 
-
-	for (mirror = 0; mirror < conf->raid_disks; mirror++)
-		if (r1_bio->bios[mirror] == bio)
-			break;
+	mirror = find_bio_disk(r1_bio, bio);
 
 	/*
 	 * 'one mirror IO has finished' event handler:
@@ -355,8 +369,6 @@ static void raid1_end_write_request(struct bio *bio, int error)
 		 */
 		set_bit(R1BIO_Uptodate, &r1_bio->state);
 
-	update_head_pos(mirror, r1_bio);
-
 	if (behind) {
 		if (test_bit(WriteMostly, &conf->mirrors[mirror].rdev->flags))
 			atomic_dec(&r1_bio->behind_remaining);
@@ -1126,13 +1138,10 @@ abort:
 static void end_sync_read(struct bio *bio, int error)
 {
 	r1bio_t *r1_bio = bio->bi_private;
-	int i;
 
-	for (i=r1_bio->mddev->raid_disks; i--; )
-		if (r1_bio->bios[i] == bio)
-			break;
-	BUG_ON(i < 0);
-	update_head_pos(i, r1_bio);
+	/* this will update head position of appropriate disk */
+	find_bio_disk(r1_bio, bio);
+
 	/*
 	 * we have read a block, now it needs to be re-written,
 	 * or re-read if the read failed.
@@ -1151,14 +1160,10 @@ static void end_sync_write(struct bio *bio, int error)
 	r1bio_t *r1_bio = bio->bi_private;
 	mddev_t *mddev = r1_bio->mddev;
 	conf_t *conf = mddev->private;
-	int i;
-	int mirror=0;
+	int mirror;
+
+	mirror = find_bio_disk(r1_bio, bio);
 
-	for (i = 0; i < conf->raid_disks; i++)
-		if (r1_bio->bios[i] == bio) {
-			mirror = i;
-			break;
-		}
 	if (!uptodate) {
 		sector_t sync_blocks = 0;
 		sector_t s = r1_bio->sector;
@@ -1173,8 +1178,6 @@ static void end_sync_write(struct bio *bio, int error)
 		md_error(mddev, conf->mirrors[mirror].rdev);
 	}
 
-	update_head_pos(mirror, r1_bio);
-
 	if (atomic_dec_and_test(&r1_bio->remaining)) {
 		sector_t s = r1_bio->sectors;
 		put_buf(r1_bio);
-- 
1.7.5.2


^ permalink raw reply related

* Re: [PATCH] gcc warnings again
From: NeilBrown @ 2011-06-17  4:42 UTC (permalink / raw)
  To: Luca Berra; +Cc: linux-raid
In-Reply-To: <20110616070510.GA8544@maude.comedia.it>

On Thu, 16 Jun 2011 09:05:10 +0200 Luca Berra <bluca@comedia.it> wrote:

> hello.
> yesterday i tried rebuilding both mdadm 3.1.5 and 3.2.1 with gcc 4.6,
> with the following CXFLAGS
> 
> x86:    -O2 -g -frecord-gcc-switches -Wstrict-aliasing=2 -pipe -Wformat
> -Werror=format-security -Wp,-D_FORTIFY_SOURCE=2 -fstack-protector
> --param=ssp-buffer-size=4 -fomit-frame-pointer -mtune=generic
> -march=i586 -fasynchronous-unwind-tables
> 
> x86_64: -O2 -g -frecord-gcc-switches -Wstrict-aliasing=2 -pipe -Wformat
> -Werror=format-security -Wp,-D_FORTIFY_SOURCE=2 -fstack-protector
> --param=ssp-buffer-size=4 -fPIC
> 
> i found a good number of warnings
> unused but set variable
> strict aliasing
> comparison between signed and unsigned values *on 32bit*
> 
> for the unused variables i found fedora already had a patch which is
> sensible enough, i did not see it reported here, so i will attach it.
> 
> I know -Wstrict-aliasing=2 can give false positive but those looked real
> to me, so i fixed those.
> 
> looking at the gpt code in util.c i found i did not like it at all, a
> gpt partition entry is currently 128 bytes, but the spec does not say it
> is a fixed value, so the code that reads into a buffer with 512bytes
> chunk expecting this to be a multiplier of part_size is imho incorrect.
> my fix was to read each partition entry directly into a struct
> GPT_part_entry, the advantage is that the code is very simple to read,
> the disadvantage it is 128 reads of 128 bytes each, which is
> sub-optimal, but i believe readahead will mitigate this a lot.
> 
> regards,
> L.
> 
> 

Hi Luca,
 thanks for these.  I have applied them all, though I made a number of
 changes to the first patch for fixing warning - nothing major.

They are just in time for 3.2.2 which is nice...

Thanks,
NeilBrown


^ permalink raw reply

* Re: mdadm 3.2.1 segfaulting on boot when missing RaidDevice 0
From: NeilBrown @ 2011-06-17  4:49 UTC (permalink / raw)
  To: Lawrence, Joe; +Cc: linux-raid
In-Reply-To: <CC98C7236C1DB341A7369B3274581F8E063E36@EXHQ1.corp.stratus.com>

On Thu, 16 Jun 2011 09:51:51 -0400 "Lawrence, Joe" <Joe.Lawrence@stratus.com>
wrote:

> Hi,
> 
> Using mdadm 3.2.1 from RedHat, I'm seeing occasional segmentation faults
> on boot when only the first member (RaidDisk 0) of a RAID 1 pair is
> present.
> 
> The crash occurs in Assemble.c and seems to be only invoked when the
> initramfs is created with an /etc/mdadm.conf containing a line that
> specifies an md device that is missing RaidDisk 1.  In RedHat terms,
> this is during the Dracut environment, so I'm not sure how easy it would
> be to save a core from this.
> 
> Analyzing the code backwards from the segfault address (it occurs on
> lines 1386 or 1387 depending upon whatever random bits are living on the
> heap), I added debugging prints and realized that the loop immediately
> after the "If any devices did not get added because the kernel rejected
> them ..." comment is stepping past the end of the best[] array.  This
> value then indexes devices[].  Notice that the loop on line 1385 starts
> at 0 and ends at bestcnt (inclusive).
> 
> I can see that on line 850, only 10 best[] entries are allocated, but
> then on line 1386, a read of entry [10] is attempted.
> 
> dmesg output:
> 
> kernel: md: md5 stopped.
> kernel: dracut: mdadm: Assemble.c(710) : allocated 16 entries (6400
> bytes) @ 0x126bf00
> kernel: dracut: mdadm: Assemble.c(837) : i = devices[0].i.disk.raid_disk
> = 0
> kernel: dracut: mdadm: Assemble.c(850) : allocated 10 entries (40 bytes)
> @ 0x12740d0
> kernel: md: bind<sdr3>
> kernel: md/raid1:md5: active with 1 out of 2 mirrors
> kernel: created bitmap (1 pages) for device md5
> kernel: md5: bitmap initialized from disk: read 1/1 pages, set 2 bits
> kernel: md5: detected capacity change from 0 to 16844251136
> kernel: dracut: mdadm: /dev/md5 has been started with 1 drive (out of
> 2).
> kernel: dracut: mdadm: Assemble.c(1386) : best @ 0x12740d0, bestcnt == i
> == 10
> kernel: md5: unknown partition table
> 
> I have tested breaking out of the loop starting on line 1385 when i ==
> bestcnt (after my debugging output) and I do not see any further
> segmentation faults.  I would think this loop should be rewritten as:
> 
> for (i = 0; i < bestcnt; i++) {
> ...
> }
> 


Thanks for the report and the terrific analysis Joe.

I completely agree with you - that should be 'i < bestcnt'.
It will be fixed in the soon-to-be-released 3.2.2.

Thanks,
NeilBrown



> Additional debugging prints revealed that line 837, i =
> devices[devcnt].i.disk.raid_disk;  is executed prior to the allocation
> of newbest[].  When I had RaidDisk 0 inserted, i=0.  When the other disk
> was present, i=1.  In the latter case, enough best[] was allocated and
> boot succeeded, though I'm not sure if best[newbestcnt] was ever
> properly initialized.  I think we just got lucky and booted ok.
> 
> The md device was created like so:
> 
> mdadm -C /dev/md5 -b internal --level=1 --raid-devices=2 /dev/sdc1
> /dev/sdk1
> 
> mdadm --detail /dev/md5
> /dev/md5:
>         Version : 1.2
>   Creation Time : Wed Jun 15 15:15:24 2011
>      Raid Level : raid1
>      Array Size : 8224212 (7.84 GiB 8.42 GB)
>   Used Dev Size : 8224212 (7.84 GiB 8.42 GB)
>    Raid Devices : 2
>   Total Devices : 2
>     Persistence : Superblock is persistent
> 
>   Intent Bitmap : Internal
> 
>     Update Time : Wed Jun 15 15:16:26 2011
>           State : active
>  Active Devices : 2
> Working Devices : 2
>  Failed Devices : 0
>   Spare Devices : 0
> 
>            Name : yowler.mno.stratus.com:5  (local to host
> yowler.mno.stratus.com)
>            UUID : d74aa8e8:14f85390:0cc61025:a0046ec2
>          Events : 23
> 
>     Number   Major   Minor   RaidDevice State
>        0      65       17        0      active sync   /dev/sdc1
>        1      65       33        1      active sync   /dev/sdk1
> 
> 
> and its corresponding entry in /etc/mdadm.conf:
> 
> ARRAY /dev/md5 level=raid1 num-devices=2
> UUID=471cf895:ba6ef375:a0bd54b3:1a6b3b08
> 
> If any other configuration, logs, or debugging information needs to be
> provided, I'll be glad to provide it.
> 
> Thanks,
> 
> -- Joe Lawrence
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


^ permalink raw reply

* Re: [PATCH] mdadm: Linux 3.x version change
From: Milan Broz @ 2011-06-17 12:18 UTC (permalink / raw)
  To: NeilBrown; +Cc: Namhyung Kim, linux-raid
In-Reply-To: <1307716200-27483-1-git-send-email-namhyung@gmail.com>

On 06/10/2011 04:30 PM, Namhyung Kim wrote:
> As Linux 3.x changes its versioning scheme, we have to deal with
> the 2-digit version number also.
FYI: we have patch below currently in Fedora rawhide
(which already switched to 3.0-rc numbering scheme).

Without it is system unable to assembly RAID during boot
("mdadm -As --auto=yes --run" fails because of wrong version detected).

It seems that 3.2.2 still have the same problem.

Milan

--- mdadm-3.2.1.old/util.c	2011-03-28 04:31:20.000000000 +0200
+++ mdadm-3.2.1/util.c	2011-06-05 07:59:03.741904751 +0200
@@ -154,8 +154,8 @@ int get_linux_version()
 	a = strtoul(cp, &cp, 10);
 	if (*cp != '.') return -1;
 	b = strtoul(cp+1, &cp, 10);
-	if (*cp != '.') return -1;
-	c = strtoul(cp+1, NULL, 10);
+	if (*cp != '.' && a <= 2) return -1;
+	c = (*cp == '.') ? strtoul(cp+1, NULL, 10) : 0;
 
 	return (a*1000000)+(b*1000)+c;
 }



^ permalink raw reply

* Re: [PATCH] mdadm: Linux 3.x version change
From: NeilBrown @ 2011-06-17 12:54 UTC (permalink / raw)
  To: Milan Broz; +Cc: Namhyung Kim, linux-raid
In-Reply-To: <4DFB4627.7050809@redhat.com>

On Fri, 17 Jun 2011 14:18:47 +0200 Milan Broz <mbroz@redhat.com> wrote:

> On 06/10/2011 04:30 PM, Namhyung Kim wrote:
> > As Linux 3.x changes its versioning scheme, we have to deal with
> > the 2-digit version number also.
> FYI: we have patch below currently in Fedora rawhide
> (which already switched to 3.0-rc numbering scheme).
> 
> Without it is system unable to assembly RAID during boot
> ("mdadm -As --auto=yes --run" fails because of wrong version detected).
> 
> It seems that 3.2.2 still have the same problem.
> 
> Milan
> 
> --- mdadm-3.2.1.old/util.c	2011-03-28 04:31:20.000000000 +0200
> +++ mdadm-3.2.1/util.c	2011-06-05 07:59:03.741904751 +0200
> @@ -154,8 +154,8 @@ int get_linux_version()
>  	a = strtoul(cp, &cp, 10);
>  	if (*cp != '.') return -1;
>  	b = strtoul(cp+1, &cp, 10);
> -	if (*cp != '.') return -1;
> -	c = strtoul(cp+1, NULL, 10);
> +	if (*cp != '.' && a <= 2) return -1;
> +	c = (*cp == '.') ? strtoul(cp+1, NULL, 10) : 0;
>  
>  	return (a*1000000)+(b*1000)+c;
>  }
> 

Bother - I always seem to miss something :-(

I have added the following (which is different yet again) to git.

According to https://lwn.net/Articles/447572/ it seems it isn't yet 100%
certain that the next version will be 3.0 rather than 3.0.0....  So I might
not need to push out a new version too quickly.

Thanks,
NeilBrown

From f161d047eed634b3380262767f955eb888502e88 Mon Sep 17 00:00:00 2001
From: NeilBrown <neilb@suse.de>
Date: Fri, 17 Jun 2011 22:49:24 +1000
Subject: [PATCH] util: correctly parse shorter linux version numbers.

The next version of Linux might be 3.0.  If it is, get_linux_version
will fail.
So make it more robust.

Reported-by: Namhyung Kim <namhyung@gmail.com>
Reported-by: Milan Broz <mbroz@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.de>
---
 util.c |   10 +++++-----
 1 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/util.c b/util.c
index 10bbe56..55d171a 100644
--- a/util.c
+++ b/util.c
@@ -146,16 +146,16 @@ int get_linux_version()
 {
 	struct utsname name;
 	char *cp;
-	int a,b,c;
+	int a = 0, b = 0,c = 0;
 	if (uname(&name) <0)
 		return -1;
 
 	cp = name.release;
 	a = strtoul(cp, &cp, 10);
-	if (*cp != '.') return -1;
-	b = strtoul(cp+1, &cp, 10);
-	if (*cp != '.') return -1;
-	c = strtoul(cp+1, NULL, 10);
+	if (*cp == '.')
+		b = strtoul(cp+1, &cp, 10);
+	if (*cp == '.')
+		c = strtoul(cp+1, &cp, 10);
 
 	return (a*1000000)+(b*1000)+c;
 }
-- 
1.7.3.4


^ permalink raw reply related

* Re: [PATCH] mdadm: Linux 3.x version change
From: David Brown @ 2011-06-17 14:07 UTC (permalink / raw)
  To: linux-raid
In-Reply-To: <1307716200-27483-1-git-send-email-namhyung@gmail.com>

On 10/06/2011 16:30, Namhyung Kim wrote:
> As Linux 3.x changes its versioning scheme, we have to deal with
> the 2-digit version number also.
>


I don't know what mdadm uses the version number for, but Linus Torvalds 
has had a rant about this sort of thing:

<https://lkml.org/lkml/2011/6/14/293>



^ permalink raw reply

* Re: [PATCH/RFC] Fix resync hang after surprise removal
From: James Paradis @ 2011-06-17 15:42 UTC (permalink / raw)
  To: NeilBrown; +Cc: linux-raid
In-Reply-To: <20110616113656.190fef9f@notabene.brown>


> NeilBrown <neilb@suse.de> wrote:
> Hi,
> thanks for the report and the patch.
> 
> However I don't think the patch really does what you want.
> 
> The two tests are already mutually exclusive as one begins with
> raid_disk >= 0
> and the other with
> raid_disk < 0
> and neither change raid_disk.
> 
> The reason the patch has an effect is the 'break' that has been added.
> i.e. as soon as you find a normal working device you break out of the
> loop
> and stop looking for spares.
> 
> I think the correct fix is simply:
> 
> diff --git a/drivers/md/md.c b/drivers/md/md.c
> index 4332fc2..91e31e2 100644
> --- a/drivers/md/md.c
> +++ b/drivers/md/md.c
> @@ -7088,6 +7088,7 @@ static int remove_and_add_spares(mddev_t *mddev)
> list_for_each_entry(rdev, &mddev->disks, same_set) {
> if (rdev->raid_disk >= 0 &&
> !test_bit(In_sync, &rdev->flags) &&
> + !test_bit(Faulty, &rdev->flags) &&
> !test_bit(Blocked, &rdev->flags))
> spares++;
> if (rdev->raid_disk < 0
> 
> 
> i.e. never consider a Faulty device to be a spare.
> 
> It looks like this bug was introduced by commit dfc70645000616777
> in 2.6.26 when we allowed partially recovered devices to remain in the
> array
> when a different device fails.
> 
> Can you please conform that this patch removes your symptom?
> 
> Thanks,
> NeilBrown

This patch does indeed fix the problem!  Thanks!

--jim


^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox