All of lore.kernel.org
 help / color / mirror / Atom feed
* RE: OOPS in do_try_to_free_pages with VERY large software RAID array
@ 2003-03-11 19:38 Rechenberg, Andrew
  2003-03-11 19:39 ` Martin J. Bligh
  0 siblings, 1 reply; 11+ messages in thread
From: Rechenberg, Andrew @ 2003-03-11 19:38 UTC (permalink / raw)
  To: Martin J. Bligh, linux-kernel; +Cc: Kevin P. Fleming

Thanks for the help Martin.  It looks like that was the problem.  The
kernel mdstat statistics must have been overwriting some other kernel
memory and giving me my panics.  With the help of Kevin's 2.5 patch I
patched the Red Hat 2.4.18-26 md code to use seq_file and now my big
RAID arrays are syncing and I haven't had a panic yet :D

Thanks again to everyone for the help.  I'll submit the patch to the
linux-raid list as md-seq_file-2.4.18-26.7.x.patch if anyone's
interested.

Regards,
Andy.

-----Original Message-----
From: Martin J. Bligh [mailto:mbligh@aracnet.com] 
Sent: Monday, March 10, 2003 2:28 PM
To: Rechenberg, Andrew; linux-kernel@vger.kernel.org
Subject: Re: OOPS in do_try_to_free_pages with VERY large software RAID
array


> Can anyone help me out please?  I'm trying to create a monster 
> software RAID array and the kernel is not behaving.  On some test 
> hardware I can get 17 RAID1 arrays to begin syncing and will sync with

> /proc/sys/dev/raid/speed_limit_max set to 100000 (the max allowed) 
> with no problem.
> 
> We wanted to use 26 RAID1 arrays and then stripe across them to get 
> very high performance.  When I tried to do that this weekend on our 
> production box we started getting kernel panics when the RAID1 arrays 
> started syncing.  This was with speed_limit_max set to 10000 so the 
> rate wasn't very high.  Since we knew 34 disks worked we decided to 
> put the box in to production with just 13 RAID1 arrays and striping 
> across those.  The performance is great compared to our hardware RAID,

> but I would like to get all the disks we purchase for this system 
> working.
> 
> This morning I connected 56 disks to our test hardware and tried to 
> reproduce the problem.  With the test hardware, the 26 RAID1 arrays 
> were working OK at speed_limit_max 10000 however the kernel OOPSed 
> when I 'less'ed /proc/mdstat.  It wasn't a hard crash because I could 
> still work.  However when I upped the speed_limit_max to 30000 there 
> was a hard crash.

At a wild guess (OK, I only looked for about 1 minute),
md_status_read_proc is generating more than 4K of information, and
overwriting the end of it's 4K page. Throw some debug in there, and get
it to printk how much of the buffer it thinks it's using (just printk sz
every time it changes it). If it's > 4K, convert it to the seq_file
interface.

May not be it, but it seems likely given the unusual scale of what
you're doing, and it's easy to check.

M.




^ permalink raw reply	[flat|nested] 11+ messages in thread
* RE: OOPS in do_try_to_free_pages with VERY large software RAID array
@ 2003-03-11 20:18 Rechenberg, Andrew
  0 siblings, 0 replies; 11+ messages in thread
From: Rechenberg, Andrew @ 2003-03-11 20:18 UTC (permalink / raw)
  To: linux-raid

[-- Attachment #1: Type: text/plain, Size: 11158 bytes --]

With the help of Martin Bligh, Kevin Fleming, and Randy Dunlap, it looks
like this problem is related to the large size of the information
presented in /proc/mdstat and it overflowing the 4k page boundary.

With Kevin's patch from last week and some help from Randy, I patched
the md code in Red Hat 2.4.18-26.7.x to use the seq_file interface for
mdstat.  I've attached the patch.  As with Kevin's patch, it touches
almost everything in drivers/md, as well as adding the necessary methods
to fs/seq_file.c and include/linux/seq_file.h.

I'm currently testing raid1 and raid0 and it seems to work well.  No
panics yet!!! :)  I currently have 26 RAID1 arrays and a big RAID0
stripe across that and I'm running some I/O tests on it now to make sure
that it is stable.  

I haven't tested the raid5, linear, or multipath code, so someone might
want to test that out before using it in production :)

As Kevin indicated in his mail, I can post the patch to a web site if
attachments are a problem.

Thanks to everyone for their help.

Regards,
Andy.


-----Original Message-----
From: Rechenberg, Andrew 
Sent: Monday, March 10, 2003 2:19 PM
To: linux-kernel!@vger.kernel.org; linux-raid@vger.kernel.org
Subject: OOPS in do_try_to_free_pages with VERY large software RAID
array


Good day.

Can anyone help me out please?  I'm trying to create a monster software
RAID array and the kernel is not behaving.  On some test hardware I can
get 17 RAID1 arrays to begin syncing and will sync with
/proc/sys/dev/raid/speed_limit_max set to 100000 (the max allowed) with
no problem.  

We wanted to use 26 RAID1 arrays and then stripe across them to get very
high performance.  When I tried to do that this weekend on our
production box we started getting kernel panics when the RAID1 arrays
started syncing.  This was with speed_limit_max set to 10000 so the rate
wasn't very high.  Since we knew 34 disks worked we decided to put the
box in to production with just 13 RAID1 arrays and striping across
those.  The performance is great compared to our hardware RAID, but I
would like to get all the disks we purchase for this system working.

This morning I connected 56 disks to our test hardware and tried to
reproduce the problem.  With the test hardware, the 26 RAID1 arrays were
working OK at speed_limit_max 10000 however the kernel OOPSed when I
'less'ed /proc/mdstat.  It wasn't a hard crash because I could still
work.  However when I upped the speed_limit_max to 30000 there was a
hard crash.

I've tried disabling Hyper Threading on these boxes with the 'noht'
kernel boot parameter, but that didn't seem to help.  A lot of what's on
Google points to bad hardware, but I don't think this problem is
hardware-related.  The kernels are stock Red Hat source and have
CONFIG_SD_EXTRA_DEVS set to 64 and have the megaraid2 module patches.

The output from ksymoops for both OOPS are below along with the
production and test hardware specs and kernel versions.  If anyone can
aid me, please let me know.  This is test hardware so I an free to try
kernel patches or anything else.  If more information is needed please
let me know.  I am subscribed to the RAID list, but not to the LKML so
please CC: me with responses.

Thank you so much for your assistance,
Andy.

Regards,
Andrew Rechenberg
Infrastructure Team, Sherman Financial Group


Production Hardware
--------------------
Dell PE6600
4x1.4GHz Xeon with HT
8GB RAM
2.4.18-19.7.xbigmem-SHR
ProductionHW Modules:
---------------------
Module                  Size  Used by    Not tainted
lp                      8672   0  (autoclean)
parport                35648   0  (autoclean) [lp]
autofs                 11620   0  (autoclean) (unused)
tg3                    47200   1
raid0                   4128   1  (autoclean)
raid1                  15556  13  (autoclean)
loop                   11184   0  (autoclean)
lvm-mod                64608   3
ext3                   67360   7
jbd                    51464   7  [ext3]
aic7xxx               153664  28
megaraid2              37920   7
sd_mod                 12800  70
scsi_mod              110352   3  [aic7xxx megaraid2 sd_mod]

Test Hardware
-------------
Dell PE4600
2x2.4GHz Xeon with HT
4GB RAM
2.4.18-24.7.xbigmem-SHR (includes megaraid2 module)
TestHW Modules
---------------
Module                  Size  Used by    Not tainted
e100                   58500   1
raid1                  15556   0  (autoclean)
loop                   11184   0  (autoclean)
sr_mod                 16088   0  (autoclean) (unused)
cdrom                  32416   0  (autoclean) [sr_mod]
usb-ohci               21856   0  (unused)
usbcore                74400   1  [usb-ohci]
lvm-mod                64608   0
aic7xxx               129856   3
sd_mod                 12832   6
scsi_mod              110800   3  [sr_mod aic7xxx sd_mod]

[root@cinshrinft1 ~]# ksymoops -k /proc/ksyms /tmp/raidoops ksymoops
2.4.4 on i686 2.4.18-24.7.xbigmem-SHR.  Options used
     -V (default)
     -k /proc/ksyms (specified)
     -l /proc/modules (default)
     -o /lib/modules/2.4.18-24.7.xbigmem-SHR/ (default)
     -m /boot/System.map-2.4.18-24.7.xbigmem-SHR (default)

Error (expand_objects): cannot stat(/lib/lvm-mod.o) for lvm-mod
ksymoops: No such file or directory
Error (expand_objects): cannot stat(/lib/aic7xxx.o) for aic7xxx
ksymoops: No such file or directory
Error (expand_objects): cannot stat(/lib/sd_mod.o) for sd_mod
ksymoops: No such file or directory
Error (expand_objects): cannot stat(/lib/scsi_mod.o) for scsi_mod
ksymoops: No such file or directory
OOPS: 0000
CPU: 3
EIP: 0010 [<c0138320>] Not tainted
Using defaults from ksymoops -t elf32-i386 -a i386
EFLAGS: 00010883
eax: 5d305b21  ebx: 000000a3  ecx: 00000006  edx: c5e8a090
esi: c5e8a080  edi: 00000000  ebp: c5e5e2b4  esp: f7ffdf64
ds: 0018   es: 0018   ss: 0018
Call Trace: [<c013aa8e>] do_try_to_free_pages [kernel] 0x3e
(0xf7ffdf98))
[<c013add1>] kswapd [kernel] 0x141 (0xf7ffdfd4))
[<c0105000>] stext [kernel] 0x0 (0xf7ffdfe8))
[<c01072a6>] kernel_thread [kernel] 0x26 (0xf7ffdff0)) [<c013ac90>]
kswapd [kernel] 0x0 (0xf7ffdff8))
Code: 8b 00 43 39 d0 75 f9 8b 4e 3c 89 da 8b 7e 4c d3 e2 85 ff 74

>>EIP; c0138320 <kmem_cache_reap+1d0/340>   <=====
Trace; c013aa8e <do_try_to_free_pages+3e/1b0>
Trace; c013add1 <kswapd+141/390>
Trace; c0105000 <_stext+0/0>
Trace; c01072a6 <kernel_thread+26/30>
Trace; c013ac90 <kswapd+0/390>
Code;  c0138320 <kmem_cache_reap+1d0/340>
00000000 <_EIP>:
Code;  c0138320 <kmem_cache_reap+1d0/340>   <=====
   0:   8b 00                     mov    (%eax),%eax   <=====
Code;  c0138322 <kmem_cache_reap+1d2/340>
   2:   43                        inc    %ebx
Code;  c0138323 <kmem_cache_reap+1d3/340>
   3:   39 d0                     cmp    %edx,%eax
Code;  c0138325 <kmem_cache_reap+1d5/340>
   5:   75 f9                     jne    0 <_EIP>
Code;  c0138327 <kmem_cache_reap+1d7/340>
   7:   8b 4e 3c                  mov    0x3c(%esi),%ecx
Code;  c013832a <kmem_cache_reap+1da/340>
   a:   89 da                     mov    %ebx,%edx
Code;  c013832c <kmem_cache_reap+1dc/340>
   c:   8b 7e 4c                  mov    0x4c(%esi),%edi
Code;  c013832f <kmem_cache_reap+1df/340>
   f:   d3 e2                     shl    %cl,%edx
Code;  c0138331 <kmem_cache_reap+1e1/340>
  11:   85 ff                     test   %edi,%edi
Code;  c0138333 <kmem_cache_reap+1e3/340>
  13:   74 00                     je     15 <_EIP+0x15> c0138335
<kmem_cache_reap+1e5/340>


4 errors issued.  Results may not be reliable.


[root@cinshrinft1 ~]# ksymoops -k /proc/ksyms /tmp/lessoops ksymoops
2.4.4 on i686 2.4.18-24.7.xbigmem-SHR.  Options used
     -V (default)
     -k /proc/ksyms (specified)
     -l /proc/modules (default)
     -o /lib/modules/2.4.18-24.7.xbigmem-SHR/ (default)
     -m /boot/System.map-2.4.18-24.7.xbigmem-SHR (default)

Error (expand_objects): cannot stat(/lib/lvm-mod.o) for lvm-mod
ksymoops: No such file or directory
Error (expand_objects): cannot stat(/lib/aic7xxx.o) for aic7xxx
ksymoops: No such file or directory
Error (expand_objects): cannot stat(/lib/sd_mod.o) for sd_mod
ksymoops: No such file or directory
Error (expand_objects): cannot stat(/lib/scsi_mod.o) for scsi_mod
ksymoops: No such file or directory
Mar 10 09:56:03 cinshrinft1 kernel: Unable to handle kernel paging
request at virtual address 0a5d306f Mar 10 09:56:03 cinshrinft1 kernel:
c0145084 Mar 10 09:56:03 cinshrinft1 kernel: *pde = 00000000 Mar 10
09:56:03 cinshrinft1 kernel: Oops: 0002
Mar 10 09:56:03 cinshrinft1 kernel: CPU:    3
Mar 10 09:56:03 cinshrinft1 kernel: EIP:    0010:[<c0145084>]    Not
tainted
Using defaults from ksymoops -t elf32-i386 -a i386
Mar 10 09:56:03 cinshrinft1 kernel: EFLAGS: 00010202
Mar 10 09:56:03 cinshrinft1 kernel: eax: 0a5d305b   ebx: eba14c80   ecx:
00000001   edx: eba14c80
Mar 10 09:56:03 cinshrinft1 kernel: esi: 0805db40   edi: fffffff7   ebp:
000003eb   esp: eb7d3f88
Mar 10 09:56:03 cinshrinft1 kernel: ds: 0018   es: 0018   ss: 0018
Mar 10 09:56:03 cinshrinft1 kernel: Process less (pid: 6866,
stackpage=eb7d3000)
Mar 10 09:56:03 cinshrinft1 kernel: Stack: eb7d2000 c01440f9 0000000c
00000003 c0116a29 00000001 0805db40 c0121
Mar 10 09:56:03 cinshrinft1 kernel:        eb7d3fac 3e6ca782 eb7d2000
0805db40 00000000 bfffe438 c0108c93 00000
Mar 10 09:56:03 cinshrinft1 kernel:        0805e540 000003eb 0805db40
00000000 bfffe438 00000004 0000002b 00000
Mar 10 09:56:03 cinshrinft1 kernel: Call Trace: [<c01440f9>] sys_write
[kernel] 0x19 (0xeb7d3f8c)) Mar 10 09:56:03 cinshrinft1 kernel:
[<c0116a29>] smp_apic_timer_interrupt [kernel] 0xa9 (0xeb7d3f98)) Mar 10
09:56:03 cinshrinft1 kernel: [<c0121e52>] sys_time [kernel] 0x12
(0xeb7d3fa4))
Mar 10 09:56:03 cinshrinft1 kernel: [<c0108c93>] system_call [kernel]
0x33 (0xeb7d3fc0)) Mar 10 09:56:03 cinshrinft1 kernel: Code: f0 ff 40 14
f0 ff 43 04 5b c3 89 f6 8b 4c 24 04 f0 ff 49 14

>>EIP; c0145084 <fget+34/40>   <=====
Trace; c01440f9 <sys_write+19/110>
Trace; c0116a29 <smp_apic_timer_interrupt+a9/d0>
Trace; c0121e52 <sys_time+12/60>
Trace; c0108c93 <system_call+33/38>
Code;  c0145084 <fget+34/40>
00000000 <_EIP>:
Code;  c0145084 <fget+34/40>   <=====
   0:   f0 ff 40 14               lock incl 0x14(%eax)   <=====
Code;  c0145088 <fget+38/40>
   4:   f0 ff 43 04               lock incl 0x4(%ebx)
Code;  c014508c <fget+3c/40>
   8:   5b                        pop    %ebx
Code;  c014508d <fget+3d/40>
   9:   c3                        ret
Code;  c014508e <fget+3e/40>
   a:   89 f6                     mov    %esi,%esi
Code;  c0145090 <put_filp+0/50>
   c:   8b 4c 24 04               mov    0x4(%esp,1),%ecx
Code;  c0145094 <put_filp+4/50>
  10:   f0 ff 49 14               lock decl 0x14(%ecx)


4 errors issued.  Results may not be reliable.
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org More majordomo info
at  http://vger.kernel.org/majordomo-info.html

[-- Attachment #2: md-seq_file-2.4.18-26.7.x.patch --]
[-- Type: application/octet-stream, Size: 14646 bytes --]

--- ../linux-2.4.18-26.7.x/drivers/md/linear.c	Sun Sep 30 15:26:06 2001
+++ ./drivers/md/linear.c	Tue Mar 11 11:09:04 2003
@@ -22,6 +22,7 @@
 #include <linux/slab.h>
 
 #include <linux/raid/linear.h>
+#include <linux/seq_file.h>
 
 #define MAJOR_NR MD_MAJOR
 #define MD_DRIVER
@@ -153,31 +154,29 @@
 	return 1;
 }
 
-static int linear_status (char *page, mddev_t *mddev)
+static void linear_status (struct seq_file *seq, mddev_t *mddev)
 {
-	int sz = 0;
 
 #undef MD_DEBUG
 #ifdef MD_DEBUG
 	int j;
 	linear_conf_t *conf = mddev_to_conf(mddev);
   
-	sz += sprintf(page+sz, "      ");
+	seq_printf(seq, "      ");
 	for (j = 0; j < conf->nr_zones; j++)
 	{
-		sz += sprintf(page+sz, "[%s",
+		seq_printf(seq, "[%s",
 			partition_name(conf->hash_table[j].dev0->dev));
 
 		if (conf->hash_table[j].dev1)
-			sz += sprintf(page+sz, "/%s] ",
+			seq_printf(seq, "/%s] ",
 			  partition_name(conf->hash_table[j].dev1->dev));
 		else
-			sz += sprintf(page+sz, "] ");
+			seq_printf(seq, "] ");
 	}
-	sz += sprintf(page+sz, "\n");
+	seq_printf(seq, "\n");
 #endif
-	sz += sprintf(page+sz, " %dk rounding", mddev->param.chunk_size/1024);
-	return sz;
+	seq_printf(seq, " %dk rounding", mddev->param.chunk_size/1024);
 }
 
 
--- ../linux-2.4.18-26.7.x/drivers/md/md.c	Mon Feb 24 09:15:39 2003
+++ ./drivers/md/md.c	Tue Mar 11 12:25:46 2003
@@ -36,6 +36,8 @@
 #include <linux/devfs_fs_kernel.h>
 
 #include <linux/init.h>
+#include <linux/proc_fs.h>
+#include <linux/seq_file.h>
 
 #ifdef CONFIG_KMOD
 #include <linux/kmod.h>
@@ -128,6 +130,15 @@
 	fops: &md_fops,
 };
 
+static int md_state_open_fs(struct inode *inode, struct file *file);
+
+static struct file_operations md_state_fops = {
+	open: md_state_open_fs,
+	read: seq_read,
+	llseek: seq_lseek,
+	release: single_release,
+};
+
 /*
  * Enables to iterate over all existing md arrays
  */
@@ -3070,13 +3081,13 @@
 	return 0;
 }
 
-static int status_unused(char * page)
+static void status_unused(struct seq_file * seq)
 {
-	int sz = 0, i = 0;
+	int i = 0;
 	mdk_rdev_t *rdev;
 	struct md_list_head *tmp;
 
-	sz += sprintf(page + sz, "unused devices: ");
+	seq_printf(seq, "unused devices: ");
 
 	ITERATE_RDEV_ALL(rdev,tmp) {
 		if (!rdev->same_set.next && !rdev->same_set.prev) {
@@ -3084,21 +3095,19 @@
 			 * The device is not yet used by any array.
 			 */
 			i++;
-			sz += sprintf(page + sz, "%s ",
+			seq_printf(seq, "%s ",
 				partition_name(rdev->dev));
 		}
 	}
 	if (!i)
-		sz += sprintf(page + sz, "<none>");
+		seq_printf(seq, "<none>");
 
-	sz += sprintf(page + sz, "\n");
-	return sz;
+	seq_printf(seq, "\n");
 }
 
 
-static int status_resync(char * page, mddev_t * mddev)
+static void status_resync(struct seq_file * seq, mddev_t * mddev)
 {
-	int sz = 0;
 	unsigned long max_blocks, resync, res, dt, db, rt;
 
 	resync = (mddev->curr_resync - atomic_read(&mddev->recovery_active))/2;
@@ -3109,30 +3118,29 @@
 	 */
 	if (!max_blocks) {
 		MD_BUG();
-		return 0;
 	}
 	res = (resync/1024)*1000/(max_blocks/1024 + 1);
 	{
 		int i, x = res/50, y = 20-x;
-		sz += sprintf(page + sz, "[");
+		seq_printf(seq, "[");
 		for (i = 0; i < x; i++)
-			sz += sprintf(page + sz, "=");
-		sz += sprintf(page + sz, ">");
+			seq_printf(seq, "=");
+		seq_printf(seq, ">");
 		for (i = 0; i < y; i++)
-			sz += sprintf(page + sz, ".");
-		sz += sprintf(page + sz, "] ");
+			seq_printf(seq, ".");
+		seq_printf(seq, "] ");
 	}
 	if (!mddev->recovery_running)
 		/*
 		 * true resync
 		 */
-		sz += sprintf(page + sz, " resync =%3lu.%lu%% (%lu/%lu)",
+		seq_printf(seq, " resync =%3lu.%lu%% (%lu/%lu)",
 				res/10, res % 10, resync, max_blocks);
 	else
 		/*
 		 * recovery ...
 		 */
-		sz += sprintf(page + sz, " recovery =%3lu.%lu%% (%lu/%lu)",
+		seq_printf(seq, " recovery =%3lu.%lu%% (%lu/%lu)",
 				res/10, res % 10, resync, max_blocks);
 
 	/*
@@ -3149,50 +3157,47 @@
 	db = resync - (mddev->resync_mark_cnt/2);
 	rt = (dt * ((max_blocks-resync) / (db/100+1)))/100;
 
-	sz += sprintf(page + sz, " finish=%lu.%lumin", rt / 60, (rt % 60)/6);
+	seq_printf(seq, " finish=%lu.%lumin", rt / 60, (rt % 60)/6);
 
-	sz += sprintf(page + sz, " speed=%ldK/sec", db/dt);
-
-	return sz;
+	seq_printf(seq, " speed=%ldK/sec", db/dt);
 }
 
-static int md_status_read_proc(char *page, char **start, off_t off,
-			int count, int *eof, void *data)
+static int md_state_seq_show(struct seq_file *seq, void *dummy)
 {
-	int sz = 0, j, size;
+	int j, size;
 	struct md_list_head *tmp, *tmp2;
 	mdk_rdev_t *rdev;
 	mddev_t *mddev;
 
-	sz += sprintf(page + sz, "Personalities : ");
+	seq_printf(seq, "Personalities : ");
 	for (j = 0; j < MAX_PERSONALITY; j++)
 	if (pers[j])
-		sz += sprintf(page+sz, "[%s] ", pers[j]->name);
+		seq_printf(seq, "[%s] ", pers[j]->name);
 
-	sz += sprintf(page+sz, "\n");
+	seq_printf(seq, "\n");
 
 
-	sz += sprintf(page+sz, "read_ahead ");
+	seq_printf(seq, "read_ahead ");
 	if (read_ahead[MD_MAJOR] == INT_MAX)
-		sz += sprintf(page+sz, "not set\n");
+		seq_printf(seq, "not set\n");
 	else
-		sz += sprintf(page+sz, "%d sectors\n", read_ahead[MD_MAJOR]);
+		seq_printf(seq, "%d sectors\n", read_ahead[MD_MAJOR]);
 
 	ITERATE_MDDEV(mddev,tmp) {
-		sz += sprintf(page + sz, "md%d : %sactive", mdidx(mddev),
+		seq_printf(seq, "md%d : %sactive", mdidx(mddev),
 						mddev->pers ? "" : "in");
 		if (mddev->pers) {
 			if (mddev->ro)
-				sz += sprintf(page + sz, " (read-only)");
-			sz += sprintf(page + sz, " %s", mddev->pers->name);
+				seq_printf(seq, " (read-only)");
+			seq_printf(seq, " %s", mddev->pers->name);
 		}
 
 		size = 0;
 		ITERATE_RDEV(mddev,rdev,tmp2) {
-			sz += sprintf(page + sz, " %s[%d]",
+			seq_printf(seq, " %s[%d]",
 				partition_name(rdev->dev), rdev->desc_nr);
 			if (rdev->faulty) {
-				sz += sprintf(page + sz, "(F)");
+				seq_printf(seq, "(F)");
 				continue;
 			}
 			size += rdev->size;
@@ -3200,33 +3205,40 @@
 
 		if (mddev->nb_dev) {
 			if (mddev->pers)
-				sz += sprintf(page + sz, "\n      %d blocks",
+				seq_printf(seq, "\n      %d blocks",
 						 md_size[mdidx(mddev)]);
 			else
-				sz += sprintf(page + sz, "\n      %d blocks", size);
+				seq_printf(seq, "\n      %d blocks", size);
 		}
 
 		if (!mddev->pers) {
-			sz += sprintf(page+sz, "\n");
+			seq_printf(seq, "\n");
 			continue;
 		}
 
-		sz += mddev->pers->status (page+sz, mddev);
+		mddev->pers->status (seq, mddev);
 
-		sz += sprintf(page+sz, "\n      ");
+		seq_printf(seq, "\n      ");
 		if (mddev->curr_resync) {
-			sz += status_resync (page+sz, mddev);
+			status_resync (seq, mddev);
 		} else {
 			if (md_atomic_read(&mddev->resync_sem.count) != 1)
-				sz += sprintf(page + sz, "	resync=DELAYED");
+				seq_printf(seq, "	resync=DELAYED");
 		}
-		sz += sprintf(page + sz, "\n");
+		seq_printf(seq, "\n");
 	}
-	sz += status_unused(page + sz);
+	status_unused(seq);
 
-	return sz;
+	return 0;
 }
 
+
+static int md_state_open_fs(struct inode *inode, struct file *file)
+{
+	return single_open(file, md_state_seq_show, NULL);
+}
+
+
 int register_md_personality(int pnum, mdk_personality_t *p)
 {
 	if (pnum >= MAX_PERSONALITY) {
@@ -3633,15 +3645,13 @@
 
 	dprintk("md: sizeof(mdp_super_t) = %d\n", (int)sizeof(mdp_super_t));
 
-#ifdef CONFIG_PROC_FS
-	create_proc_read_entry("mdstat", 0, NULL, md_status_read_proc, NULL);
-#endif
 }
 
 int md__init md_init(void)
 {
 	static char * name = "mdrecoveryd";
 	int minor;
+	struct proc_dir_entry * entry = NULL;
 
 	printk(KERN_INFO "md: md driver %d.%d.%d MAX_MD_DEVS=%d, MD_SB_DISKS=%d\n",
 			MD_MAJOR_VERSION, MD_MINOR_VERSION,
@@ -3678,6 +3688,13 @@
 	raid_table_header = register_sysctl_table(raid_root_table, 1);
 
 	md_geninit();
+	entry = create_proc_entry("mdstat", S_IRUGO, NULL);
+	if (!entry)
+		printk(KERN_ALERT
+			"md: bug: couldn't create /proc/mdstat\n");
+	else {
+		entry->proc_fops = &md_state_fops;
+	}
 	return (0);
 }
 
@@ -4005,9 +4022,7 @@
 	devfs_unregister_blkdev(MAJOR_NR,"md");
 	unregister_reboot_notifier(&md_notifier);
 	unregister_sysctl_table(raid_table_header);
-#ifdef CONFIG_PROC_FS
 	remove_proc_entry("mdstat", NULL);
-#endif
 
 	del_gendisk(&md_gendisk);
 
--- ../linux-2.4.18-26.7.x/drivers/md/multipath.c	Mon Feb 25 14:37:58 2002
+++ ./drivers/md/multipath.c	Tue Mar 11 11:09:04 2003
@@ -23,6 +23,7 @@
 #include <linux/slab.h>
 #include <linux/raid/multipath.h>
 #include <asm/atomic.h>
+#include <linux/seq_file.h>
 
 #define MAJOR_NR MD_MAJOR
 #define MD_DRIVER
@@ -281,18 +282,17 @@
 	return 0;
 }
 
-static int multipath_status (char *page, mddev_t *mddev)
+static void multipath_status (struct seq_file *seq, mddev_t *mddev)
 {
 	multipath_conf_t *conf = mddev_to_conf(mddev);
-	int sz = 0, i;
+	int i;
 	
-	sz += sprintf (page+sz, " [%d/%d] [", conf->raid_disks,
+	seq_printf (seq, " [%d/%d] [", conf->raid_disks,
 						 conf->working_disks);
 	for (i = 0; i < conf->raid_disks; i++)
-		sz += sprintf (page+sz, "%s",
+		seq_printf (seq, "%s",
 			conf->multipaths[i].operational ? "U" : "_");
-	sz += sprintf (page+sz, "]");
-	return sz;
+	seq_printf (seq, "]");
 }
 
 #define LAST_DISK KERN_ALERT \
--- ../linux-2.4.18-26.7.x/drivers/md/raid0.c	Sun Sep 30 15:26:06 2001
+++ ./drivers/md/raid0.c	Tue Mar 11 11:09:04 2003
@@ -20,6 +20,7 @@
 
 #include <linux/module.h>
 #include <linux/raid/raid0.h>
+#include <linux/seq_file.h>
 
 #define MAJOR_NR MD_MAJOR
 #define MD_DRIVER
@@ -289,41 +290,38 @@
 	return 0;
 }
 			   
-static int raid0_status (char *page, mddev_t *mddev)
+static void raid0_status (struct seq_file *seq, mddev_t *mddev)
 {
-	int sz = 0;
 #undef MD_DEBUG
 #ifdef MD_DEBUG
 	int j, k;
 	raid0_conf_t *conf = mddev_to_conf(mddev);
   
-	sz += sprintf(page + sz, "      ");
+	seq_printf(seq, "      ");
 	for (j = 0; j < conf->nr_zones; j++) {
-		sz += sprintf(page + sz, "[z%d",
+		seq_printf(seq, "[z%d",
 				conf->hash_table[j].zone0 - conf->strip_zone);
 		if (conf->hash_table[j].zone1)
-			sz += sprintf(page+sz, "/z%d] ",
+			seq_printf(seq, "/z%d] ",
 				conf->hash_table[j].zone1 - conf->strip_zone);
 		else
-			sz += sprintf(page+sz, "] ");
+			seq_printf(seq, "] ");
 	}
   
-	sz += sprintf(page + sz, "\n");
+	seq_printf(seq, "\n");
   
 	for (j = 0; j < conf->nr_strip_zones; j++) {
-		sz += sprintf(page + sz, "      z%d=[", j);
+		seq_printf(seq, "      z%d=[", j);
 		for (k = 0; k < conf->strip_zone[j].nb_dev; k++)
-			sz += sprintf (page+sz, "%s/", partition_name(
+			seq_printf (seq, "%s/", partition_name(
 				conf->strip_zone[j].dev[k]->dev));
-		sz--;
-		sz += sprintf (page+sz, "] zo=%d do=%d s=%d\n",
+		seq_printf (seq, "] zo=%d do=%d s=%d\n",
 				conf->strip_zone[j].zone_offset,
 				conf->strip_zone[j].dev_offset,
 				conf->strip_zone[j].size);
 	}
 #endif
-	sz += sprintf(page + sz, " %dk chunks", mddev->param.chunk_size/1024);
-	return sz;
+	seq_printf(seq, " %dk chunks", mddev->param.chunk_size/1024);
 }
 
 static mdk_personality_t raid0_personality=
--- ../linux-2.4.18-26.7.x/drivers/md/raid1.c	Mon Feb 24 09:15:28 2003
+++ ./drivers/md/raid1.c	Tue Mar 11 11:09:05 2003
@@ -26,6 +26,7 @@
 #include <linux/slab.h>
 #include <linux/raid/raid1.h>
 #include <asm/atomic.h>
+#include <linux/seq_file.h>
 
 #define MAJOR_NR MD_MAJOR
 #define MD_DRIVER
@@ -714,18 +715,17 @@
 	return (0);
 }
 
-static int raid1_status (char *page, mddev_t *mddev)
+static void raid1_status (struct seq_file *seq, mddev_t *mddev)
 {
 	raid1_conf_t *conf = mddev_to_conf(mddev);
-	int sz = 0, i;
+	int i;
 	
-	sz += sprintf (page+sz, " [%d/%d] [", conf->raid_disks,
+	seq_printf (seq, " [%d/%d] [", conf->raid_disks,
 						 conf->working_disks);
 	for (i = 0; i < conf->raid_disks; i++)
-		sz += sprintf (page+sz, "%s",
+		seq_printf (seq, "%s",
 			conf->mirrors[i].operational ? "U" : "_");
-	sz += sprintf (page+sz, "]");
-	return sz;
+	seq_printf (seq, "]");
 }
 
 #define LAST_DISK KERN_ALERT \
--- ../linux-2.4.18-26.7.x/drivers/md/raid5.c	Mon Feb 24 09:15:36 2003
+++ ./drivers/md/raid5.c	Tue Mar 11 11:09:05 2003
@@ -23,6 +23,7 @@
 #include <linux/raid/raid5.h>
 #include <asm/bitops.h>
 #include <asm/atomic.h>
+#include <linux/seq_file.h>
 
 static mdk_personality_t raid5_personality;
 
@@ -1681,23 +1682,22 @@
 }
 #endif
 
-static int raid5_status (char *page, mddev_t *mddev)
+static void raid5_status (struct seq_file *seq, mddev_t *mddev)
 {
 	raid5_conf_t *conf = (raid5_conf_t *) mddev->private;
 	mdp_super_t *sb = mddev->sb;
-	int sz = 0, i;
+	int i;
 
-	sz += sprintf (page+sz, " level %d, %dk chunk, algorithm %d", sb->level, sb->chunk_size >> 10, sb->layout);
-	sz += sprintf (page+sz, " [%d/%d] [", conf->raid_disks, conf->working_disks);
+	seq_printf (seq, " level %d, %dk chunk, algorithm %d", sb->level, sb->chunk_size >> 10, sb->layout);
+	seq_printf (seq, " [%d/%d] [", conf->raid_disks, conf->working_disks);
 	for (i = 0; i < conf->raid_disks; i++)
-		sz += sprintf (page+sz, "%s", conf->disks[i].operational ? "U" : "_");
-	sz += sprintf (page+sz, "]");
+		seq_printf (seq, "%s", conf->disks[i].operational ? "U" : "_");
+	seq_printf (seq, "]");
 #if RAID5_DEBUG
 #define D(x) \
-	sz += sprintf (page+sz, "<"#x":%d>", atomic_read(&conf->x))
+	seq_printf (seq, "<"#x":%d>", atomic_read(&conf->x))
 	printall(conf);
 #endif
-	return sz;
 }
 
 static void print_raid5_conf (raid5_conf_t *conf)
--- ../linux-2.4.18-26.7.x/fs/seq_file.c	Mon Feb 24 09:15:39 2003
+++ ./fs/seq_file.c	Tue Mar 11 12:23:48 2003
@@ -295,3 +295,47 @@
 	m->count = m->size;
 	return -1;
 }
+
+static void *single_start(struct seq_file *p, loff_t *pos)
+{
+	return NULL + (*pos == 0);
+}
+
+static void *single_next(struct seq_file *p, void *v, loff_t *pos)
+{
+	++*pos;
+	return NULL;
+}
+
+static void single_stop(struct seq_file *p, void *v)
+{
+}
+
+int single_open(struct file *file, int (*show)(struct seq_file *, void*), void *data)
+{
+	struct seq_operations *op = kmalloc(sizeof(*op), GFP_KERNEL);
+	int res = -ENOMEM;
+
+	if (op) {
+		op->start = single_start;
+		op->next = single_next;
+		op->stop = single_stop;
+		op->show = show;
+		res = seq_open(file, op);
+		if (!res)
+			((struct seq_file *)file->private_data)->private = data;
+		else
+			kfree(op);
+		}
+	return res;
+}
+
+int single_release(struct inode *inode, struct file *file)
+{
+	struct seq_operations *op = ((struct seq_file *)file->private_data)->op;
+	int res = seq_release(inode, file);
+	kfree(op);
+	return res;
+}
+
+
--- ../linux-2.4.18-26.7.x/include/linux/seq_file.h	Mon Feb 24 09:15:18 2003
+++ ./include/linux/seq_file.h	Tue Mar 11 12:25:15 2003
@@ -52,5 +52,7 @@
 int seq_printf(struct seq_file *, const char *, ...)
 	__attribute__ ((format (printf,2,3)));
 
+int single_open(struct file *, int (*)(struct seq_file *, void *), void *);
+int single_release(struct inode *, struct file *);
 #endif
 #endif

^ permalink raw reply	[flat|nested] 11+ messages in thread
* RE: OOPS in do_try_to_free_pages with VERY large software RAID array
@ 2003-03-11 17:38 Rechenberg, Andrew
  2003-03-11 17:49 ` Randy.Dunlap
  0 siblings, 1 reply; 11+ messages in thread
From: Rechenberg, Andrew @ 2003-03-11 17:38 UTC (permalink / raw)
  To: Kevin P. Fleming, Martin J. Bligh; +Cc: linux-kernel

Kevin,

I tried patching md by hand since you're patch is for 2.5 but I'm having
some issues.  When I try to make bzImage I'm getting the following
error:

md.c:139: `single_release' undeclared here (not in a function)
md.c:139: initializer element is not constant
md.c:139: (near initialization for `md_state_fops.release')
md.c:140: initializer element is not constant
md.c:140: (near initialization for `md_state_fops')
md.c: In function `md_state_seq_show':
md.c:3219: warning: passing arg 1 of pointer to function from
incompatible pointer type
md.c: In function `md_state_open_fs':
md.c:3238: warning: implicit declaration of function `single_open'
make[3]: *** [md.o] Error 1
make[3]: Leaving directory `/usr/src/linux-flux/drivers/md'
make[2]: *** [first_rule] Error 2
make[2]: Leaving directory `/usr/src/linux-flux/drivers/md'
make[1]: *** [_subdir_md] Error 2
make[1]: Leaving directory `/usr/src/linux-flux/drivers'
make: *** [_dir_drivers] Error 2

Can you tell me if the single_release is a 2.5 "thing?"  Can you point
me in the right direction as to how to fix this problem?.

Thanks,
Andy.

-----Original Message-----
From: Kevin P. Fleming [mailto:kpfleming@cox.net] 
Sent: Monday, March 10, 2003 2:48 PM
To: Martin J. Bligh
Cc: Rechenberg, Andrew; linux-kernel@vger.kernel.org
Subject: Re: OOPS in do_try_to_free_pages with VERY large software RAID
array


Martin J. Bligh wrote:
> At a wild guess (OK, I only looked for about 1 minute), 
> md_status_read_proc is generating more than 4K of information, and 
> overwriting the end of it's 4K page. Throw some debug in there, and 
> get it to printk how much of the buffer it thinks it's using (just 
> printk sz every time it changes it). If it's > 4K, convert it to the 
> seq_file interface.
> 
> May not be it, but it seems likely given the unusual scale of what 
> you're doing, and it's easy to check.
> 
> M.

I posted a patch to do exactly this last week to the Linux-RAID mailing
list. If 
you check the archives you should find it. This problem also occurs if
you use 
the device-mapper under 2.5.X, because it makes all 256 md minors appear
in the 
tables and /proc/mdstat wants to tell you about all of them.



^ permalink raw reply	[flat|nested] 11+ messages in thread
* RE: OOPS in do_try_to_free_pages with VERY large software RAID array
@ 2003-03-10 21:23 Rechenberg, Andrew
  2003-03-10 21:52 ` Martin J. Bligh
  0 siblings, 1 reply; 11+ messages in thread
From: Rechenberg, Andrew @ 2003-03-10 21:23 UTC (permalink / raw)
  To: Martin J. Bligh, linux-kernel

I could see why that would be the problem why I would get the OOPS from
/proc/mdstat, but the other OOPS I'm getting is just when the box is
syncing the RAID arrays.  Could I get an OOPS from md_status_read_proc
overwriting it's buffer if I'm not looking at it?  I guess that is a
likely cause.

Let me know what you think.

Thanks,
Andy.

-----Original Message-----
From: Martin J. Bligh [mailto:mbligh@aracnet.com] 
Sent: Monday, March 10, 2003 2:28 PM
To: Rechenberg, Andrew; linux-kernel@vger.kernel.org
Subject: Re: OOPS in do_try_to_free_pages with VERY large software RAID
array


> Can anyone help me out please?  I'm trying to create a monster 
> software RAID array and the kernel is not behaving.  On some test 
> hardware I can get 17 RAID1 arrays to begin syncing and will sync with

> /proc/sys/dev/raid/speed_limit_max set to 100000 (the max allowed) 
> with no problem.
> 
> We wanted to use 26 RAID1 arrays and then stripe across them to get 
> very high performance.  When I tried to do that this weekend on our 
> production box we started getting kernel panics when the RAID1 arrays 
> started syncing.  This was with speed_limit_max set to 10000 so the 
> rate wasn't very high.  Since we knew 34 disks worked we decided to 
> put the box in to production with just 13 RAID1 arrays and striping 
> across those.  The performance is great compared to our hardware RAID,

> but I would like to get all the disks we purchase for this system 
> working.
> 
> This morning I connected 56 disks to our test hardware and tried to 
> reproduce the problem.  With the test hardware, the 26 RAID1 arrays 
> were working OK at speed_limit_max 10000 however the kernel OOPSed 
> when I 'less'ed /proc/mdstat.  It wasn't a hard crash because I could 
> still work.  However when I upped the speed_limit_max to 30000 there 
> was a hard crash.

At a wild guess (OK, I only looked for about 1 minute),
md_status_read_proc is generating more than 4K of information, and
overwriting the end of it's 4K page. Throw some debug in there, and get
it to printk how much of the buffer it thinks it's using (just printk sz
every time it changes it). If it's > 4K, convert it to the seq_file
interface.

May not be it, but it seems likely given the unusual scale of what
you're doing, and it's easy to check.

M.




^ permalink raw reply	[flat|nested] 11+ messages in thread
* OOPS in do_try_to_free_pages with VERY large software RAID array
@ 2003-03-10 19:20 Rechenberg, Andrew
  2003-03-10 19:27 ` Martin J. Bligh
  0 siblings, 1 reply; 11+ messages in thread
From: Rechenberg, Andrew @ 2003-03-10 19:20 UTC (permalink / raw)
  To: linux-kernel

Good day.

Can anyone help me out please?  I'm trying to create a monster software
RAID array and the kernel is not behaving.  On some test hardware I can
get 17 RAID1 arrays to begin syncing and will sync with
/proc/sys/dev/raid/speed_limit_max set to 100000 (the max allowed) with
no problem.  

We wanted to use 26 RAID1 arrays and then stripe across them to get very
high performance.  When I tried to do that this weekend on our
production box we started getting kernel panics when the RAID1 arrays
started syncing.  This was with speed_limit_max set to 10000 so the rate
wasn't very high.  Since we knew 34 disks worked we decided to put the
box in to production with just 13 RAID1 arrays and striping across
those.  The performance is great compared to our hardware RAID, but I
would like to get all the disks we purchase for this system working.

This morning I connected 56 disks to our test hardware and tried to
reproduce the problem.  With the test hardware, the 26 RAID1 arrays were
working OK at speed_limit_max 10000 however the kernel OOPSed when I
'less'ed /proc/mdstat.  It wasn't a hard crash because I could still
work.  However when I upped the speed_limit_max to 30000 there was a
hard crash.

I've tried disabling Hyper Threading on these boxes with the 'noht'
kernel boot parameter, but that didn't seem to help.  A lot of what's on
Google points to bad hardware, but I don't think this problem is
hardware-related.  The kernels are stock Red Hat source and have
CONFIG_SD_EXTRA_DEVS set to 64 and have the megaraid2 module patches.

The output from ksymoops for both OOPS are below along with the
production and test hardware specs and kernel versions.  If anyone can
aid me, please let me know.  This is test hardware so I an free to try
kernel patches or anything else.  If more information is needed please
let me know.  I am subscribed to the RAID list, but not to the LKML so
please CC: me with responses.

Thank you so much for your assistance,
Andy.

Regards,
Andrew Rechenberg
Infrastructure Team, Sherman Financial Group


Production Hardware
--------------------
Dell PE6600
4x1.4GHz Xeon with HT
8GB RAM
2.4.18-19.7.xbigmem-SHR
ProductionHW Modules:
---------------------
Module                  Size  Used by    Not tainted
lp                      8672   0  (autoclean)
parport                35648   0  (autoclean) [lp]
autofs                 11620   0  (autoclean) (unused)
tg3                    47200   1
raid0                   4128   1  (autoclean)
raid1                  15556  13  (autoclean)
loop                   11184   0  (autoclean)
lvm-mod                64608   3
ext3                   67360   7
jbd                    51464   7  [ext3]
aic7xxx               153664  28
megaraid2              37920   7
sd_mod                 12800  70
scsi_mod              110352   3  [aic7xxx megaraid2 sd_mod]

Test Hardware
-------------
Dell PE4600
2x2.4GHz Xeon with HT
4GB RAM
2.4.18-24.7.xbigmem-SHR (includes megaraid2 module)
TestHW Modules
---------------
Module                  Size  Used by    Not tainted
e100                   58500   1
raid1                  15556   0  (autoclean)
loop                   11184   0  (autoclean)
sr_mod                 16088   0  (autoclean) (unused)
cdrom                  32416   0  (autoclean) [sr_mod]
usb-ohci               21856   0  (unused)
usbcore                74400   1  [usb-ohci]
lvm-mod                64608   0
aic7xxx               129856   3
sd_mod                 12832   6
scsi_mod              110800   3  [sr_mod aic7xxx sd_mod]

[root@cinshrinft1 ~]# ksymoops -k /proc/ksyms /tmp/raidoops
ksymoops 2.4.4 on i686 2.4.18-24.7.xbigmem-SHR.  Options used
     -V (default)
     -k /proc/ksyms (specified)
     -l /proc/modules (default)
     -o /lib/modules/2.4.18-24.7.xbigmem-SHR/ (default)
     -m /boot/System.map-2.4.18-24.7.xbigmem-SHR (default)

Error (expand_objects): cannot stat(/lib/lvm-mod.o) for lvm-mod
ksymoops: No such file or directory
Error (expand_objects): cannot stat(/lib/aic7xxx.o) for aic7xxx
ksymoops: No such file or directory
Error (expand_objects): cannot stat(/lib/sd_mod.o) for sd_mod
ksymoops: No such file or directory
Error (expand_objects): cannot stat(/lib/scsi_mod.o) for scsi_mod
ksymoops: No such file or directory
OOPS: 0000
CPU: 3
EIP: 0010 [<c0138320>] Not tainted
Using defaults from ksymoops -t elf32-i386 -a i386
EFLAGS: 00010883
eax: 5d305b21  ebx: 000000a3  ecx: 00000006  edx: c5e8a090
esi: c5e8a080  edi: 00000000  ebp: c5e5e2b4  esp: f7ffdf64
ds: 0018   es: 0018   ss: 0018
Call Trace: [<c013aa8e>] do_try_to_free_pages [kernel] 0x3e
(0xf7ffdf98))
[<c013add1>] kswapd [kernel] 0x141 (0xf7ffdfd4))
[<c0105000>] stext [kernel] 0x0 (0xf7ffdfe8))
[<c01072a6>] kernel_thread [kernel] 0x26 (0xf7ffdff0))
[<c013ac90>] kswapd [kernel] 0x0 (0xf7ffdff8))
Code: 8b 00 43 39 d0 75 f9 8b 4e 3c 89 da 8b 7e 4c d3 e2 85 ff 74

>>EIP; c0138320 <kmem_cache_reap+1d0/340>   <=====
Trace; c013aa8e <do_try_to_free_pages+3e/1b0>
Trace; c013add1 <kswapd+141/390>
Trace; c0105000 <_stext+0/0>
Trace; c01072a6 <kernel_thread+26/30>
Trace; c013ac90 <kswapd+0/390>
Code;  c0138320 <kmem_cache_reap+1d0/340>
00000000 <_EIP>:
Code;  c0138320 <kmem_cache_reap+1d0/340>   <=====
   0:   8b 00                     mov    (%eax),%eax   <=====
Code;  c0138322 <kmem_cache_reap+1d2/340>
   2:   43                        inc    %ebx
Code;  c0138323 <kmem_cache_reap+1d3/340>
   3:   39 d0                     cmp    %edx,%eax
Code;  c0138325 <kmem_cache_reap+1d5/340>
   5:   75 f9                     jne    0 <_EIP>
Code;  c0138327 <kmem_cache_reap+1d7/340>
   7:   8b 4e 3c                  mov    0x3c(%esi),%ecx
Code;  c013832a <kmem_cache_reap+1da/340>
   a:   89 da                     mov    %ebx,%edx
Code;  c013832c <kmem_cache_reap+1dc/340>
   c:   8b 7e 4c                  mov    0x4c(%esi),%edi
Code;  c013832f <kmem_cache_reap+1df/340>
   f:   d3 e2                     shl    %cl,%edx
Code;  c0138331 <kmem_cache_reap+1e1/340>
  11:   85 ff                     test   %edi,%edi
Code;  c0138333 <kmem_cache_reap+1e3/340>
  13:   74 00                     je     15 <_EIP+0x15> c0138335
<kmem_cache_reap+1e5/340>


4 errors issued.  Results may not be reliable.


[root@cinshrinft1 ~]# ksymoops -k /proc/ksyms /tmp/lessoops
ksymoops 2.4.4 on i686 2.4.18-24.7.xbigmem-SHR.  Options used
     -V (default)
     -k /proc/ksyms (specified)
     -l /proc/modules (default)
     -o /lib/modules/2.4.18-24.7.xbigmem-SHR/ (default)
     -m /boot/System.map-2.4.18-24.7.xbigmem-SHR (default)

Error (expand_objects): cannot stat(/lib/lvm-mod.o) for lvm-mod
ksymoops: No such file or directory
Error (expand_objects): cannot stat(/lib/aic7xxx.o) for aic7xxx
ksymoops: No such file or directory
Error (expand_objects): cannot stat(/lib/sd_mod.o) for sd_mod
ksymoops: No such file or directory
Error (expand_objects): cannot stat(/lib/scsi_mod.o) for scsi_mod
ksymoops: No such file or directory
Mar 10 09:56:03 cinshrinft1 kernel: Unable to handle kernel paging
request at virtual address 0a5d306f
Mar 10 09:56:03 cinshrinft1 kernel: c0145084
Mar 10 09:56:03 cinshrinft1 kernel: *pde = 00000000
Mar 10 09:56:03 cinshrinft1 kernel: Oops: 0002
Mar 10 09:56:03 cinshrinft1 kernel: CPU:    3
Mar 10 09:56:03 cinshrinft1 kernel: EIP:    0010:[<c0145084>]    Not
tainted
Using defaults from ksymoops -t elf32-i386 -a i386
Mar 10 09:56:03 cinshrinft1 kernel: EFLAGS: 00010202
Mar 10 09:56:03 cinshrinft1 kernel: eax: 0a5d305b   ebx: eba14c80   ecx:
00000001   edx: eba14c80
Mar 10 09:56:03 cinshrinft1 kernel: esi: 0805db40   edi: fffffff7   ebp:
000003eb   esp: eb7d3f88
Mar 10 09:56:03 cinshrinft1 kernel: ds: 0018   es: 0018   ss: 0018
Mar 10 09:56:03 cinshrinft1 kernel: Process less (pid: 6866,
stackpage=eb7d3000)
Mar 10 09:56:03 cinshrinft1 kernel: Stack: eb7d2000 c01440f9 0000000c
00000003 c0116a29 00000001 0805db40 c0121
Mar 10 09:56:03 cinshrinft1 kernel:        eb7d3fac 3e6ca782 eb7d2000
0805db40 00000000 bfffe438 c0108c93 00000
Mar 10 09:56:03 cinshrinft1 kernel:        0805e540 000003eb 0805db40
00000000 bfffe438 00000004 0000002b 00000
Mar 10 09:56:03 cinshrinft1 kernel: Call Trace: [<c01440f9>] sys_write
[kernel] 0x19 (0xeb7d3f8c))
Mar 10 09:56:03 cinshrinft1 kernel: [<c0116a29>]
smp_apic_timer_interrupt [kernel] 0xa9 (0xeb7d3f98))
Mar 10 09:56:03 cinshrinft1 kernel: [<c0121e52>] sys_time [kernel] 0x12
(0xeb7d3fa4))
Mar 10 09:56:03 cinshrinft1 kernel: [<c0108c93>] system_call [kernel]
0x33 (0xeb7d3fc0))
Mar 10 09:56:03 cinshrinft1 kernel: Code: f0 ff 40 14 f0 ff 43 04 5b c3
89 f6 8b 4c 24 04 f0 ff 49 14

>>EIP; c0145084 <fget+34/40>   <=====
Trace; c01440f9 <sys_write+19/110>
Trace; c0116a29 <smp_apic_timer_interrupt+a9/d0>
Trace; c0121e52 <sys_time+12/60>
Trace; c0108c93 <system_call+33/38>
Code;  c0145084 <fget+34/40>
00000000 <_EIP>:
Code;  c0145084 <fget+34/40>   <=====
   0:   f0 ff 40 14               lock incl 0x14(%eax)   <=====
Code;  c0145088 <fget+38/40>
   4:   f0 ff 43 04               lock incl 0x4(%ebx)
Code;  c014508c <fget+3c/40>
   8:   5b                        pop    %ebx
Code;  c014508d <fget+3d/40>
   9:   c3                        ret
Code;  c014508e <fget+3e/40>
   a:   89 f6                     mov    %esi,%esi
Code;  c0145090 <put_filp+0/50>
   c:   8b 4c 24 04               mov    0x4(%esp,1),%ecx
Code;  c0145094 <put_filp+4/50>
  10:   f0 ff 49 14               lock decl 0x14(%ecx)


4 errors issued.  Results may not be reliable.

^ permalink raw reply	[flat|nested] 11+ messages in thread
* OOPS in do_try_to_free_pages with VERY large software RAID array
@ 2003-03-10 19:19 Rechenberg, Andrew
  0 siblings, 0 replies; 11+ messages in thread
From: Rechenberg, Andrew @ 2003-03-10 19:19 UTC (permalink / raw)
  To: linux-kernel!, linux-raid

Good day.

Can anyone help me out please?  I'm trying to create a monster software
RAID array and the kernel is not behaving.  On some test hardware I can
get 17 RAID1 arrays to begin syncing and will sync with
/proc/sys/dev/raid/speed_limit_max set to 100000 (the max allowed) with
no problem.  

We wanted to use 26 RAID1 arrays and then stripe across them to get very
high performance.  When I tried to do that this weekend on our
production box we started getting kernel panics when the RAID1 arrays
started syncing.  This was with speed_limit_max set to 10000 so the rate
wasn't very high.  Since we knew 34 disks worked we decided to put the
box in to production with just 13 RAID1 arrays and striping across
those.  The performance is great compared to our hardware RAID, but I
would like to get all the disks we purchase for this system working.

This morning I connected 56 disks to our test hardware and tried to
reproduce the problem.  With the test hardware, the 26 RAID1 arrays were
working OK at speed_limit_max 10000 however the kernel OOPSed when I
'less'ed /proc/mdstat.  It wasn't a hard crash because I could still
work.  However when I upped the speed_limit_max to 30000 there was a
hard crash.

I've tried disabling Hyper Threading on these boxes with the 'noht'
kernel boot parameter, but that didn't seem to help.  A lot of what's on
Google points to bad hardware, but I don't think this problem is
hardware-related.  The kernels are stock Red Hat source and have
CONFIG_SD_EXTRA_DEVS set to 64 and have the megaraid2 module patches.

The output from ksymoops for both OOPS are below along with the
production and test hardware specs and kernel versions.  If anyone can
aid me, please let me know.  This is test hardware so I an free to try
kernel patches or anything else.  If more information is needed please
let me know.  I am subscribed to the RAID list, but not to the LKML so
please CC: me with responses.

Thank you so much for your assistance,
Andy.

Regards,
Andrew Rechenberg
Infrastructure Team, Sherman Financial Group


Production Hardware
--------------------
Dell PE6600
4x1.4GHz Xeon with HT
8GB RAM
2.4.18-19.7.xbigmem-SHR
ProductionHW Modules:
---------------------
Module                  Size  Used by    Not tainted
lp                      8672   0  (autoclean)
parport                35648   0  (autoclean) [lp]
autofs                 11620   0  (autoclean) (unused)
tg3                    47200   1
raid0                   4128   1  (autoclean)
raid1                  15556  13  (autoclean)
loop                   11184   0  (autoclean)
lvm-mod                64608   3
ext3                   67360   7
jbd                    51464   7  [ext3]
aic7xxx               153664  28
megaraid2              37920   7
sd_mod                 12800  70
scsi_mod              110352   3  [aic7xxx megaraid2 sd_mod]

Test Hardware
-------------
Dell PE4600
2x2.4GHz Xeon with HT
4GB RAM
2.4.18-24.7.xbigmem-SHR (includes megaraid2 module)
TestHW Modules
---------------
Module                  Size  Used by    Not tainted
e100                   58500   1
raid1                  15556   0  (autoclean)
loop                   11184   0  (autoclean)
sr_mod                 16088   0  (autoclean) (unused)
cdrom                  32416   0  (autoclean) [sr_mod]
usb-ohci               21856   0  (unused)
usbcore                74400   1  [usb-ohci]
lvm-mod                64608   0
aic7xxx               129856   3
sd_mod                 12832   6
scsi_mod              110800   3  [sr_mod aic7xxx sd_mod]

[root@cinshrinft1 ~]# ksymoops -k /proc/ksyms /tmp/raidoops
ksymoops 2.4.4 on i686 2.4.18-24.7.xbigmem-SHR.  Options used
     -V (default)
     -k /proc/ksyms (specified)
     -l /proc/modules (default)
     -o /lib/modules/2.4.18-24.7.xbigmem-SHR/ (default)
     -m /boot/System.map-2.4.18-24.7.xbigmem-SHR (default)

Error (expand_objects): cannot stat(/lib/lvm-mod.o) for lvm-mod
ksymoops: No such file or directory
Error (expand_objects): cannot stat(/lib/aic7xxx.o) for aic7xxx
ksymoops: No such file or directory
Error (expand_objects): cannot stat(/lib/sd_mod.o) for sd_mod
ksymoops: No such file or directory
Error (expand_objects): cannot stat(/lib/scsi_mod.o) for scsi_mod
ksymoops: No such file or directory
OOPS: 0000
CPU: 3
EIP: 0010 [<c0138320>] Not tainted
Using defaults from ksymoops -t elf32-i386 -a i386
EFLAGS: 00010883
eax: 5d305b21  ebx: 000000a3  ecx: 00000006  edx: c5e8a090
esi: c5e8a080  edi: 00000000  ebp: c5e5e2b4  esp: f7ffdf64
ds: 0018   es: 0018   ss: 0018
Call Trace: [<c013aa8e>] do_try_to_free_pages [kernel] 0x3e
(0xf7ffdf98))
[<c013add1>] kswapd [kernel] 0x141 (0xf7ffdfd4))
[<c0105000>] stext [kernel] 0x0 (0xf7ffdfe8))
[<c01072a6>] kernel_thread [kernel] 0x26 (0xf7ffdff0))
[<c013ac90>] kswapd [kernel] 0x0 (0xf7ffdff8))
Code: 8b 00 43 39 d0 75 f9 8b 4e 3c 89 da 8b 7e 4c d3 e2 85 ff 74

>>EIP; c0138320 <kmem_cache_reap+1d0/340>   <=====
Trace; c013aa8e <do_try_to_free_pages+3e/1b0>
Trace; c013add1 <kswapd+141/390>
Trace; c0105000 <_stext+0/0>
Trace; c01072a6 <kernel_thread+26/30>
Trace; c013ac90 <kswapd+0/390>
Code;  c0138320 <kmem_cache_reap+1d0/340>
00000000 <_EIP>:
Code;  c0138320 <kmem_cache_reap+1d0/340>   <=====
   0:   8b 00                     mov    (%eax),%eax   <=====
Code;  c0138322 <kmem_cache_reap+1d2/340>
   2:   43                        inc    %ebx
Code;  c0138323 <kmem_cache_reap+1d3/340>
   3:   39 d0                     cmp    %edx,%eax
Code;  c0138325 <kmem_cache_reap+1d5/340>
   5:   75 f9                     jne    0 <_EIP>
Code;  c0138327 <kmem_cache_reap+1d7/340>
   7:   8b 4e 3c                  mov    0x3c(%esi),%ecx
Code;  c013832a <kmem_cache_reap+1da/340>
   a:   89 da                     mov    %ebx,%edx
Code;  c013832c <kmem_cache_reap+1dc/340>
   c:   8b 7e 4c                  mov    0x4c(%esi),%edi
Code;  c013832f <kmem_cache_reap+1df/340>
   f:   d3 e2                     shl    %cl,%edx
Code;  c0138331 <kmem_cache_reap+1e1/340>
  11:   85 ff                     test   %edi,%edi
Code;  c0138333 <kmem_cache_reap+1e3/340>
  13:   74 00                     je     15 <_EIP+0x15> c0138335
<kmem_cache_reap+1e5/340>


4 errors issued.  Results may not be reliable.


[root@cinshrinft1 ~]# ksymoops -k /proc/ksyms /tmp/lessoops
ksymoops 2.4.4 on i686 2.4.18-24.7.xbigmem-SHR.  Options used
     -V (default)
     -k /proc/ksyms (specified)
     -l /proc/modules (default)
     -o /lib/modules/2.4.18-24.7.xbigmem-SHR/ (default)
     -m /boot/System.map-2.4.18-24.7.xbigmem-SHR (default)

Error (expand_objects): cannot stat(/lib/lvm-mod.o) for lvm-mod
ksymoops: No such file or directory
Error (expand_objects): cannot stat(/lib/aic7xxx.o) for aic7xxx
ksymoops: No such file or directory
Error (expand_objects): cannot stat(/lib/sd_mod.o) for sd_mod
ksymoops: No such file or directory
Error (expand_objects): cannot stat(/lib/scsi_mod.o) for scsi_mod
ksymoops: No such file or directory
Mar 10 09:56:03 cinshrinft1 kernel: Unable to handle kernel paging
request at virtual address 0a5d306f
Mar 10 09:56:03 cinshrinft1 kernel: c0145084
Mar 10 09:56:03 cinshrinft1 kernel: *pde = 00000000
Mar 10 09:56:03 cinshrinft1 kernel: Oops: 0002
Mar 10 09:56:03 cinshrinft1 kernel: CPU:    3
Mar 10 09:56:03 cinshrinft1 kernel: EIP:    0010:[<c0145084>]    Not
tainted
Using defaults from ksymoops -t elf32-i386 -a i386
Mar 10 09:56:03 cinshrinft1 kernel: EFLAGS: 00010202
Mar 10 09:56:03 cinshrinft1 kernel: eax: 0a5d305b   ebx: eba14c80   ecx:
00000001   edx: eba14c80
Mar 10 09:56:03 cinshrinft1 kernel: esi: 0805db40   edi: fffffff7   ebp:
000003eb   esp: eb7d3f88
Mar 10 09:56:03 cinshrinft1 kernel: ds: 0018   es: 0018   ss: 0018
Mar 10 09:56:03 cinshrinft1 kernel: Process less (pid: 6866,
stackpage=eb7d3000)
Mar 10 09:56:03 cinshrinft1 kernel: Stack: eb7d2000 c01440f9 0000000c
00000003 c0116a29 00000001 0805db40 c0121
Mar 10 09:56:03 cinshrinft1 kernel:        eb7d3fac 3e6ca782 eb7d2000
0805db40 00000000 bfffe438 c0108c93 00000
Mar 10 09:56:03 cinshrinft1 kernel:        0805e540 000003eb 0805db40
00000000 bfffe438 00000004 0000002b 00000
Mar 10 09:56:03 cinshrinft1 kernel: Call Trace: [<c01440f9>] sys_write
[kernel] 0x19 (0xeb7d3f8c))
Mar 10 09:56:03 cinshrinft1 kernel: [<c0116a29>]
smp_apic_timer_interrupt [kernel] 0xa9 (0xeb7d3f98))
Mar 10 09:56:03 cinshrinft1 kernel: [<c0121e52>] sys_time [kernel] 0x12
(0xeb7d3fa4))
Mar 10 09:56:03 cinshrinft1 kernel: [<c0108c93>] system_call [kernel]
0x33 (0xeb7d3fc0))
Mar 10 09:56:03 cinshrinft1 kernel: Code: f0 ff 40 14 f0 ff 43 04 5b c3
89 f6 8b 4c 24 04 f0 ff 49 14

>>EIP; c0145084 <fget+34/40>   <=====
Trace; c01440f9 <sys_write+19/110>
Trace; c0116a29 <smp_apic_timer_interrupt+a9/d0>
Trace; c0121e52 <sys_time+12/60>
Trace; c0108c93 <system_call+33/38>
Code;  c0145084 <fget+34/40>
00000000 <_EIP>:
Code;  c0145084 <fget+34/40>   <=====
   0:   f0 ff 40 14               lock incl 0x14(%eax)   <=====
Code;  c0145088 <fget+38/40>
   4:   f0 ff 43 04               lock incl 0x4(%ebx)
Code;  c014508c <fget+3c/40>
   8:   5b                        pop    %ebx
Code;  c014508d <fget+3d/40>
   9:   c3                        ret
Code;  c014508e <fget+3e/40>
   a:   89 f6                     mov    %esi,%esi
Code;  c0145090 <put_filp+0/50>
   c:   8b 4c 24 04               mov    0x4(%esp,1),%ecx
Code;  c0145094 <put_filp+4/50>
  10:   f0 ff 49 14               lock decl 0x14(%ecx)


4 errors issued.  Results may not be reliable.

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2003-03-11 20:18 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2003-03-11 19:38 OOPS in do_try_to_free_pages with VERY large software RAID array Rechenberg, Andrew
2003-03-11 19:39 ` Martin J. Bligh
  -- strict thread matches above, loose matches on Subject: below --
2003-03-11 20:18 Rechenberg, Andrew
2003-03-11 17:38 Rechenberg, Andrew
2003-03-11 17:49 ` Randy.Dunlap
2003-03-10 21:23 Rechenberg, Andrew
2003-03-10 21:52 ` Martin J. Bligh
2003-03-10 19:20 Rechenberg, Andrew
2003-03-10 19:27 ` Martin J. Bligh
2003-03-10 19:48   ` Kevin P. Fleming
2003-03-10 19:19 Rechenberg, Andrew

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.