linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* raid1 oops, 2.6.16
@ 2006-08-05 14:04 Jason Lunz
  2006-08-18 17:05 ` Paul Clements
  0 siblings, 1 reply; 3+ messages in thread
From: Jason Lunz @ 2006-08-05 14:04 UTC (permalink / raw)
  To: linux-raid

I just had a disk die in a 2.6.16 (debian kernel) raid1 server, and it's
triggered an oops in raid1.

There are a bunch of 2-partition mirrors:

Personalities : [raid1]
md5 : active raid1 hdc7[2](F) hda7[1]
      77625984 blocks [2/1] [_U]

md4 : active raid1 hdc6[2](F) hda6[1]
      16000640 blocks [2/1] [_U]

md3 : active raid1 hdc5[2](F) hda5[1]
      12570752 blocks [2/1] [_U]

md2 : active raid1 hdc3[2](F) hda3[1]
      8000256 blocks [2/1] [_U]

md1 : active raid1 hdc2[2](F) hda2[1]
      2000000 blocks [2/1] [_U]

md0 : active raid1 hdc1[2](F) hda1[1]
      995904 blocks [2/1] [_U]

unused devices: <none>


amidst all the failure messages for these arrays in dmesg, I have this:


RAID1 conf printout:
 --- wd:1 rd:2
 disk 1, wo:0, o:1, dev:hda1
Unable to handle kernel NULL pointer dereference at virtual address 00000088
 printing eip:
f0831ea8
*pde = 00000000
Oops: 0002 [#1]
Modules linked in: thermal fan button processor ac battery e1000 rtc ext3 jbd mbcache raid1 md_mod ide_disk generic siimage ide_core evdev mousedev
CPU:    0
EIP:    0060:[<f0831ea8>]    Not tainted VLI
EFLAGS: 00010246   (2.6.16-2-686 #1)
EIP is at raid1d+0x2c8/0x4c3 [raid1]
eax: 00000008   ebx: 00000000   ecx: c9c60100   edx: b1a1ac60
esi: 00000000   edi: dd67c6c0   ebp: efd52740   esp: b1ac1f08
ds: 007b   es: 007b   ss: 0068
Process md5_raid1 (pid: 1001, threadinfo=b1ac0000 task=b1bc5a70)
Stack: <0>000000b1 5e341300 003e0387 00000001 00000001 00000008 00000008 025e1458
       00000000 b1bc5b98 00000001 efd5275c b1ac1fa4 7fffffff b026e32b 00000005
       b1ac0000 b1ac1f84 7fffffff 00000000 b1a1aba0 b1ac1f84 b1ac1fa4 7fffffff
Call Trace:
 [<b026e32b>] schedule_timeout+0x13/0x8e
 [<f0864095>] md_thread+0xe3/0xfb [md_mod]
 [<b012522e>] autoremove_wake_function+0x0/0x3a
 [<b026dced>] schedule+0x45f/0x4cd
 [<b012522e>] autoremove_wake_function+0x0/0x3a
 [<f0863fb2>] md_thread+0x0/0xfb [md_mod]
 [<b0124efe>] kthread+0x79/0xa3
 [<b0124e85>] kthread+0x0/0xa3
 [<b01012cd>] kernel_thread_helper+0x5/0xb
Code: 83 7c 24 10 00 8b 47 20 0f 84 dc 00 00 00 89 74 24 0c 39 c6 74 63 85 f6 75 03 8b 75
08 4e 8b 55 04 6b c6 0c 8b 1c 02 8b 44 24 14 <01> 83 88 00 00 00 85 db 74 3f 8b 43 70 a8 04 74 38 6a 01 ff 75
 end_request: I/O error, dev hdc, sector 26464418


The server is still running, but processes (like sync(1)) are getting hung in D
state.

Jason


^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: raid1 oops, 2.6.16
  2006-08-05 14:04 raid1 oops, 2.6.16 Jason Lunz
@ 2006-08-18 17:05 ` Paul Clements
  2006-08-20 23:42   ` Neil Brown
  0 siblings, 1 reply; 3+ messages in thread
From: Paul Clements @ 2006-08-18 17:05 UTC (permalink / raw)
  To: Jason Lunz, neilb; +Cc: linux-raid

Jason Lunz wrote:
> I just had a disk die in a 2.6.16 (debian kernel) raid1 server, and it's
> triggered an oops in raid1.

We just saw this problem as well, on SUSE 2.6.16.21-0.8. However, it 
looks like the problem code still exists in at least 2.6.18-rc1, and I 
haven't seen any patches recently that would have fixed it, so I believe 
this is still a problem. Details and possible fix below.


> RAID1 conf printout:
>  --- wd:1 rd:2
>  disk 1, wo:0, o:1, dev:hda1
> Unable to handle kernel NULL pointer dereference at virtual address 00000088
>  printing eip:
> f0831ea8
> *pde = 00000000
> Oops: 0002 [#1]
> Modules linked in: thermal fan button processor ac battery e1000 rtc ext3 jbd mbcache raid1 md_mod ide_disk generic siimage ide_core evdev mousedev
> CPU:    0
> EIP:    0060:[<f0831ea8>]    Not tainted VLI
> EFLAGS: 00010246   (2.6.16-2-686 #1)
> EIP is at raid1d+0x2c8/0x4c3 [raid1]
> eax: 00000008   ebx: 00000000   ecx: c9c60100   edx: b1a1ac60
> esi: 00000000   edi: dd67c6c0   ebp: efd52740   esp: b1ac1f08
> ds: 007b   es: 007b   ss: 0068
> Process md5_raid1 (pid: 1001, threadinfo=b1ac0000 task=b1bc5a70)
> Stack: <0>000000b1 5e341300 003e0387 00000001 00000001 00000008 00000008 025e1458
>        00000000 b1bc5b98 00000001 efd5275c b1ac1fa4 7fffffff b026e32b 00000005
>        b1ac0000 b1ac1f84 7fffffff 00000000 b1a1aba0 b1ac1f84 b1ac1fa4 7fffffff
> Call Trace:
>  [<b026e32b>] schedule_timeout+0x13/0x8e
>  [<f0864095>] md_thread+0xe3/0xfb [md_mod]
>  [<b012522e>] autoremove_wake_function+0x0/0x3a
>  [<b026dced>] schedule+0x45f/0x4cd
>  [<b012522e>] autoremove_wake_function+0x0/0x3a
>  [<f0863fb2>] md_thread+0x0/0xfb [md_mod]
>  [<b0124efe>] kthread+0x79/0xa3
>  [<b0124e85>] kthread+0x0/0xa3
>  [<b01012cd>] kernel_thread_helper+0x5/0xb
> Code: 83 7c 24 10 00 8b 47 20 0f 84 dc 00 00 00 89 74 24 0c 39 c6 74 63 85 f6 75 03 8b 75
> 08 4e 8b 55 04 6b c6 0c 8b 1c 02 8b 44 24 14 <01> 83 88 00 00 00 85 db 74 3f 8b 43 70 a8 04 74 38 6a 01 ff 75
>  end_request: I/O error, dev hdc, sector 26464418



We had one of our Fiber arrays hang and I/O started failing:

Aug 17 23:14:44 tristan kernel: raid1: cciss/c1d1p1: rescheduling sector 
466400
Aug 17 23:14:44 tristan kernel: cciss: cmd df340000 timedout
Aug 17 23:14:44 tristan kernel: raid1: cciss/c1d2p1: rescheduling sector 
1264808
Aug 17 23:14:45 tristan kernel: cciss: cmd df340250 timedout
Aug 17 23:14:45 tristan kernel: raid1: cciss/c1d3p1: rescheduling sector 
1459712
Aug 17 23:14:45 tristan kernel: cciss: cmd df3404a0 timedout
Aug 17 23:14:45 tristan kernel: raid1: cciss/c1d4p1: rescheduling sector 
1378648
Aug 17 23:14:46 tristan kernel: cciss: cmd df340b90 timedout
Aug 17 23:14:46 tristan kernel: raid1: Disk failure on cciss/c1d2p1, 
disabling device.
Aug 17 23:14:46 tristan kernel: 	Operation continuing on 1 devices
Aug 17 23:14:46 tristan kernel: RAID1 conf printout:
Aug 17 23:14:46 tristan kernel:  --- wd:1 rd:2
Aug 17 23:14:46 tristan kernel:  disk 0, wo:0, o:1, dev:sdc1
Aug 17 23:14:46 tristan kernel:  disk 1, wo:1, o:0, dev:cciss/c1d2p1
Aug 17 23:14:46 tristan kernel: RAID1 conf printout:
Aug 17 23:14:46 tristan kernel:  --- wd:1 rd:2
Aug 17 23:14:46 tristan kernel:  disk 0, wo:0, o:1, dev:sdc1

^^^^ The disk got kicked from the array.

Aug 17 23:14:46 tristan kernel: Unable to handle kernel NULL pointer 
dereference at virtual address 00000088
Aug 17 23:14:46 tristan kernel:  printing eip:
Aug 17 23:14:46 tristan kernel: e0c35105
Aug 17 23:14:46 tristan kernel: *pde = 00000000
Aug 17 23:14:46 tristan kernel: Oops: 0002 [#1]
Aug 17 23:14:46 tristan kernel: SMP
Aug 17 23:14:46 tristan kernel: last sysfs file: /block/sdd/sdd1/size
Aug 17 23:14:46 tristan kernel: Modules linked in: ext3 jbd raid1 ipv6 
dm_multipath button battery ac apparmor aamatch_pcre loop dm_mod 
sworks_agp agpgart ide_cd cpqphp cdrom i2c_piix4 i2c_core tg3 
pci_hotplug reiserfs edd fan thermal processor sg qla2xxx firmware_class 
scsi_transport_fc cciss serverworks sd_mod scsi_mod ide_disk ide_core
Aug 17 23:14:46 tristan kernel: CPU:    0
Aug 17 23:14:47 tristan kernel: EIP:    0060:[<e0c35105>]    Tainted: GF 
    U VLI
Aug 17 23:14:47 tristan kernel: EFLAGS: 00010286   (2.6.16.21-0.8-smp #1)
Aug 17 23:14:47 tristan kernel: EIP is at raid1d+0x8e0/0xb95 [raid1]
Aug 17 23:14:47 tristan kernel: eax: 00000008   ebx: 00000000   ecx: 
d06f0840   edx: d06f0840
Aug 17 23:14:47 tristan kernel: esi: 00000000   edi: 00000001   ebp: 
00001000   esp: cf095ed4
Aug 17 23:14:47 tristan kernel: ds: 007b   es: 007b   ss: 0068
Aug 17 23:14:47 tristan kernel: Process md1_raid1 (pid: 9332, 
threadinfo=cf094000 task=de0d3050)
Aug 17 23:14:47 tristan kernel: Stack: <0>dfd00cd0 c1411b80 df067a00 
c48d1240 d06f0840 d06f0860 00000001 00134ca8
Aug 17 23:14:47 tristan kernel:        00000000 00000060 00000008 
00000000 c0340b08 0000000a de0d318c de0d3050
Aug 17 23:14:47 tristan kernel:        dfd00af0 c1414da0 d4f08a9c 
0000011d 00005fea 00000002 c0340b28 0000000a
Aug 17 23:14:47 tristan kernel: Call Trace:
Aug 17 23:14:47 tristan kernel:  [<c011a067>] default_wake_function+0x0/0xc
Aug 17 23:14:47 tristan kernel:  [<c028fa0f>] schedule_timeout+0x13/0x8e
Aug 17 23:14:47 tristan kernel:  [<c022fc7a>] md_thread+0xed/0x104
Aug 17 23:14:47 tristan kernel:  [<c0130977>] 
autoremove_wake_function+0x0/0x2d
Aug 17 23:14:47 tristan kernel:  [<c022fb8d>] md_thread+0x0/0x104
Aug 17 23:14:47 tristan kernel:  [<c0130895>] kthread+0x9d/0xc9
Aug 17 23:14:47 tristan kernel:  [<c01307f8>] kthread+0x0/0xc9
Aug 17 23:14:47 tristan kernel:  [<c0102005>] kernel_thread_helper+0x5/0xb
Aug 17 23:14:47 tristan kernel: Code: 8b 44 24 0c 3b 70 20 0f 84 d5 00 
00 00 eb 95 85 ff 75 07 8b 54 24 10 8b 7a 08 8b 4c 24 10 4f 6b c7 0c 03 
41 04 8b 18 8b 44 24 28 <f0> 01 83 88 00 00 00 85 db 74 41 8b 43 70 a8 
04 74 3a 8b 53 18


 From the objdump of raid1.ko for this kernel, the oops occurred here 
(at raid1d+0x8e0):

     20eb:       75 07                   jne    20f4 <raid1d+0x8cf>
     20ed:       8b 54 24 10             mov    0x10(%esp),%edx
     20f1:       8b 7a 08                mov    0x8(%edx),%edi
     20f4:       8b 4c 24 10             mov    0x10(%esp),%ecx
     20f8:       4f                      dec    %edi
     20f9:       6b c7 0c                imul   $0xc,%edi,%eax
     20fc:       03 41 04                add    0x4(%ecx),%eax
     20ff:       8b 18                   mov    (%eax),%ebx
     2101:       8b 44 24 28             mov    0x28(%esp),%eax
     2105:       f0 01 83 88 00 00 00    lock add %eax,0x88(%ebx)
^^^^^^^^
     210c:       85 db                   test   %ebx,%ebx
     210e:       74 41                   je     2151 <raid1d+0x92c>
     2110:       8b 43 70                mov    0x70(%ebx),%eax
     2113:       a8 04                   test   $0x4,%al
     2115:       74 3a                   je     2151 <raid1d+0x92c>
     2117:       8b 53 18                mov    0x18(%ebx),%edx
     211a:       8b 4c 24 20             mov    0x20(%esp),%ecx
     211e:       89 54 24 04             mov    %edx,0x4(%esp)

The "lock add" is an atomic_add, of which there is exactly one in raid1d 
(line 1491, raid1.c):

                 if (success) {
                          /* write it back and re-read */
                          int start = d;
                          while (d != r1_bio->read_disk) {
                                 if (d==0)
                                         d = conf->raid_disks;
                                 d--;
                                 rdev = conf->mirrors[d].rdev;
                                 atomic_add(s, &rdev->corrected_errors);
                                 if (rdev &&
                                     test_bit(In_sync, &rdev->flags)) {
                                         if (sync_page_io(rdev->bdev,
                                             sect + rdev->data_offset,
                                             s<<9, conf->tmppage, WRITE) 
== 0)
                                          /* Well, this device is dead */
                                          md_error(mddev, rdev);
                          }
                 }


This is where raid1d is retrying read errors, and trying to rewrite the 
data using data it read from a working disk in the array, so that 
matches up with what was going on at the time on this system. The offset 
of 0x88 (where the NULL pointer exception occurred) matches the offset 
of rdev->corrected_errors, and this makes sense since md just kicked the 
failing drive (each drive has an rdev structure) from the array.

The atomic_add (and any other access to an rdev, for that matter) needs 
to be guarded by a NULL check, I think.

Neil, does that sound right?

-- 
Paul

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: raid1 oops, 2.6.16
  2006-08-18 17:05 ` Paul Clements
@ 2006-08-20 23:42   ` Neil Brown
  0 siblings, 0 replies; 3+ messages in thread
From: Neil Brown @ 2006-08-20 23:42 UTC (permalink / raw)
  To: Paul Clements; +Cc: Jason Lunz, linux-raid

On Friday August 18, paul.clements@steeleye.com wrote:
> 
> The atomic_add (and any other access to an rdev, for that matter) needs 
> to be guarded by a NULL check, I think.
> 
> Neil, does that sound right?

Yes.  Thanks for doing the analysis.

However after looking closely through the code, I think the atomic_inc
really is in the wrong place.
We are increasing 'corrected_errors' before we try to write out the
good data.  We should really wait until after reading back the good
data.  So rather than just protecting the atomic_inc with "if(dev)",
I'm going to move it several lines down.

And while I'm at it, there seem to be several place that reference
->rdev that could be cleaned up.

So I'm thinking of something like the following against -mm

I'll make a smaller patch for -stable.

Thanks,
NeilBrown

Signed-off-by: Neil Brown <neilb@suse.de>

### Diffstat output
 ./drivers/md/raid1.c |   55 +++++++++++++++++++++++++++++++--------------------
 1 file changed, 34 insertions(+), 21 deletions(-)

diff .prev/drivers/md/raid1.c ./drivers/md/raid1.c
--- .prev/drivers/md/raid1.c	2006-08-15 13:33:02.000000000 +1000
+++ ./drivers/md/raid1.c	2006-08-21 09:40:41.000000000 +1000
@@ -930,10 +930,13 @@ static void status(struct seq_file *seq,
 
 	seq_printf(seq, " [%d/%d] [", conf->raid_disks,
 		   conf->raid_disks - mddev->degraded);
-	for (i = 0; i < conf->raid_disks; i++)
+	rcu_read_lock();
+	for (i = 0; i < conf->raid_disks; i++) {
+		mdk_rdev_t *rdev = rcu_dereference(conf->mirrors[i].rdev);
 		seq_printf(seq, "%s",
-			      conf->mirrors[i].rdev &&
-			      test_bit(In_sync, &conf->mirrors[i].rdev->flags) ? "U" : "_");
+			   rdev && test_bit(In_sync, &rdev->flags) ? "U" : "_");
+	}
+	rcu_read_unlock();
 	seq_printf(seq, "]");
 }
 
@@ -976,7 +979,6 @@ static void error(mddev_t *mddev, mdk_rd
 static void print_conf(conf_t *conf)
 {
 	int i;
-	mirror_info_t *tmp;
 
 	printk("RAID1 conf printout:\n");
 	if (!conf) {
@@ -986,14 +988,17 @@ static void print_conf(conf_t *conf)
 	printk(" --- wd:%d rd:%d\n", conf->raid_disks - conf->mddev->degraded,
 		conf->raid_disks);
 
+	rcu_read_lock();
 	for (i = 0; i < conf->raid_disks; i++) {
 		char b[BDEVNAME_SIZE];
-		tmp = conf->mirrors + i;
-		if (tmp->rdev)
+		mdk_rdev_t *rdev = rcu_dereference(conf->mirrors[i].rdev);
+		if (rdev)
 			printk(" disk %d, wo:%d, o:%d, dev:%s\n",
-				i, !test_bit(In_sync, &tmp->rdev->flags), !test_bit(Faulty, &tmp->rdev->flags),
-				bdevname(tmp->rdev->bdev,b));
+			       i, !test_bit(In_sync, &rdev->flags),
+			       !test_bit(Faulty, &rdev->flags),
+			       bdevname(rdev->bdev,b));
 	}
+	rcu_read_unlock();
 }
 
 static void close_sync(conf_t *conf)
@@ -1009,17 +1014,17 @@ static int raid1_spare_active(mddev_t *m
 {
 	int i;
 	conf_t *conf = mddev->private;
-	mirror_info_t *tmp;
 
 	/*
 	 * Find all failed disks within the RAID1 configuration 
-	 * and mark them readable
+	 * and mark them readable.
+	 * Called under mddev lock, so rcu protection not needed.
 	 */
 	for (i = 0; i < conf->raid_disks; i++) {
-		tmp = conf->mirrors + i;
-		if (tmp->rdev 
-		    && !test_bit(Faulty, &tmp->rdev->flags)
-		    && !test_and_set_bit(In_sync, &tmp->rdev->flags)) {
+		mdk_rdev_t *rdev = conf->mirrors[i].rdev;
+		if (rdev
+		    && !test_bit(Faulty, &rdev->flags)
+		    && !test_and_set_bit(In_sync, &rdev->flags)) {
 			unsigned long flags;
 			spin_lock_irqsave(&conf->device_lock, flags);
 			mddev->degraded--;
@@ -1239,7 +1244,7 @@ static void sync_request_write(mddev_t *
 		/* ouch - failed to read all of that.
 		 * Try some synchronous reads of other devices to get
 		 * good data, much like with normal read errors.  Only
-		 * read into the pages we already have so they we don't
+		 * read into the pages we already have so we don't
 		 * need to re-issue the read request.
 		 * We don't need to freeze the array, because being in an
 		 * active sync request, there is no normal IO, and
@@ -1259,6 +1264,10 @@ static void sync_request_write(mddev_t *
 				s = PAGE_SIZE >> 9;
 			do {
 				if (r1_bio->bios[d]->bi_end_io == end_sync_read) {
+					/* No rcu protection needed here devices
+					 * can only be removed when no resync is 
+					 * active, and resync is currently active
+					 */
 					rdev = conf->mirrors[d].rdev;
 					if (sync_page_io(rdev->bdev,
 							 sect + rdev->data_offset,
@@ -1376,6 +1385,11 @@ static void fix_read_error(conf_t *conf,
 			s = PAGE_SIZE >> 9;
 
 		do {
+			/* Note: no rcu protection needed here
+			 * as this is synchronous in the raid1d thread
+			 * which is the thread that might remove
+			 * a device.  If raid1d ever becomes multi-threaded....
+			 */
 			rdev = conf->mirrors[d].rdev;
 			if (rdev &&
 			    test_bit(In_sync, &rdev->flags) &&
@@ -1403,7 +1417,6 @@ static void fix_read_error(conf_t *conf,
 				d = conf->raid_disks;
 			d--;
 			rdev = conf->mirrors[d].rdev;
-			atomic_add(s, &rdev->corrected_errors);
 			if (rdev &&
 			    test_bit(In_sync, &rdev->flags)) {
 				if (sync_page_io(rdev->bdev,
@@ -1429,7 +1442,8 @@ static void fix_read_error(conf_t *conf,
 				    == 0)
 					/* Well, this device is dead */
 					md_error(mddev, rdev);
-				else
+				else {
+					atomic_add(s, &rdev->corrected_errors);
 					printk(KERN_INFO
 					       "raid1:%s: read error corrected "
 					       "(%d sectors at %llu on %s)\n",
@@ -1437,6 +1451,7 @@ static void fix_read_error(conf_t *conf,
 					       (unsigned long long)sect +
 					           rdev->data_offset,
 					       bdevname(rdev->bdev, b));
+				}
 			}
 		}
 		sectors -= s;
@@ -1806,19 +1821,17 @@ static sector_t sync_request(mddev_t *md
 		for (i=0; i<conf->raid_disks; i++) {
 			bio = r1_bio->bios[i];
 			if (bio->bi_end_io == end_sync_read) {
-				md_sync_acct(conf->mirrors[i].rdev->bdev, nr_sectors);
+				md_sync_acct(bio->bi_bdev, nr_sectors);
 				generic_make_request(bio);
 			}
 		}
 	} else {
 		atomic_set(&r1_bio->remaining, 1);
 		bio = r1_bio->bios[r1_bio->read_disk];
-		md_sync_acct(conf->mirrors[r1_bio->read_disk].rdev->bdev,
-			     nr_sectors);
+		md_sync_acct(bio->bi_bdev, nr_sectors);
 		generic_make_request(bio);
 
 	}
-
 	return nr_sectors;
 }
 

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2006-08-20 23:42 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2006-08-05 14:04 raid1 oops, 2.6.16 Jason Lunz
2006-08-18 17:05 ` Paul Clements
2006-08-20 23:42   ` Neil Brown

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).