* problems with 2.5.61-mm1
@ 2003-02-15 8:58 Dave Hansen
2003-02-15 9:09 ` Andrew Morton
2003-02-15 18:20 ` Martin J. Bligh
0 siblings, 2 replies; 5+ messages in thread
From: Dave Hansen @ 2003-02-15 8:58 UTC (permalink / raw)
To: Andrew Morton; +Cc: Linux Kernel Mailing List, Martin J. Bligh
I've been beating on various versions of 2.5.59 all day long with no
problems that I didn't cause. I started testing 2.5.61-mm1 and rand
into a couple problems right away.
The first I really doubt is -mm specific. I gets _loads_ of these, and
the e1000 isn't working:
NETDEV WATCHDOG: eth0: transmit timed out
e1000: eth0 NIC Link is Up 1000 Mbps Full Duplex
The e1000 driver hasn't been touched in weeks. Here's my /proc/interrupts:
http://www.sr71.net/linux/interrupts
I'm pretty sure we can see the problem here. Almost all interrupts are
going to CPU0. Is this a summit thing?
The other looks a bit more insidious.
Unable to handle kernel NULL pointer dereference at virtual address 0000003d
c011af77
*pde = 1cf93001
Oops: 0002
CPU: 1
EIP: 0060:[<c011af77>] Not tainted
Using defaults from ksymoops -t elf32-i386 -a i386
EFLAGS: 00010202
eax: 00000029 ebx: de562870 ecx: dfa85074 edx: 000000e4
esi: deefc140 edi: cf5227c0 ebp: cf522780 esp: dcad7f08
ds: 007b es: 007b ss: 0068
Stack: cf522780 000000ff df734c80 00000008 00000000 cc266680 00000100
dfa85074
deefc100 c6ac3900 c6ac39a4 00000011 00000000 c011b46c 00000011
c6ac3900
00000004 00000286 00001000 cc266680 fffffff4 bffff7a0 00000011
00000000
[<c011b46c>] copy_process+0x3a4/0x902
[<c011ba1a>] do_fork+0x50/0x166
[<c0126cca>] sys_rt_sigprocmask+0xdc/0x150
[<c010792b>] sys_fork+0x37/0x4a
[<c0109347>] syscall_call+0x7/0xb
Code: f0 ff 40 14 89 03 83 c3 04 83 ea 01 75 e1 8b 44 24 20 f0 ff
>>EIP; c011af77 <copy_files+18f/2c6> <=====
>>ebx; de562870 <END_OF_CODE+1e0f3f2c/????>
>>ecx; dfa85074 <END_OF_CODE+1f616730/????>
>>esi; deefc140 <END_OF_CODE+1ea8d7fc/????>
>>edi; cf5227c0 <END_OF_CODE+f0b3e7c/????>
>>ebp; cf522780 <END_OF_CODE+f0b3e3c/????>
>>esp; dcad7f08 <END_OF_CODE+1c6695c4/????>
Code; c011af77 <copy_files+18f/2c6>
00000000 <_EIP>:
Code; c011af77 <copy_files+18f/2c6> <=====
0: f0 ff 40 14 lock incl 0x14(%eax) <=====
Code; c011af7b <copy_files+193/2c6>
4: 89 03 mov %eax,(%ebx)
Code; c011af7d <copy_files+195/2c6>
6: 83 c3 04 add $0x4,%ebx
Code; c011af80 <copy_files+198/2c6>
9: 83 ea 01 sub $0x1,%edx
Code; c011af83 <copy_files+19b/2c6>
c: 75 e1 jne ffffffef <_EIP+0xffffffef>
Code; c011af85 <copy_files+19d/2c6>
e: 8b 44 24 20 mov 0x20(%esp,1),%eax
Code; c011af89 <copy_files+1a1/2c6>
12: f0 ff 00 lock incl (%eax)
more disassembly
c011af64: 74 1f je c011af85 <copy_files+0x19d>
c011af66: 8b 4c 24 1c mov 0x1c(%esp,1),%ecx
c011af6a: 8b 01 mov (%ecx),%eax
c011af6c: 83 c1 04 add $0x4,%ecx
c011af6f: 85 c0 test %eax,%eax
c011af71: 89 4c 24 1c mov %ecx,0x1c(%esp,1)
c011af75: 74 04 je c011af7b <copy_files+0x193>
c011af77: f0 ff 40 14 lock incl 0x14(%eax) <========
c011af7b: 89 03 mov %eax,(%ebx)
c011af7d: 83 c3 04 add $0x4,%ebx
c011af80: 83 ea 01 sub $0x1,%edx
c011af83: 75 e1 jne c011af66 <copy_files+0x17e>
c011af85: 8b 44 24 20 mov 0x20(%esp,1),%eax
c011af89: f0 ff 40 04 lock incl 0x4(%eax)
c011af8d: 8b 45 08 mov 0x8(%ebp),%eax
c011af90: 89 df mov %ebx,%edi
c011af92: 2b 44 24 18 sub 0x18(%esp,1),%eax
c011af96: 8d 34 85 00 00 00 00 lea 0x0(,%eax,4),%esi
I didn't compile with -g, but I have a hunch it is this:
for (i = open_files; i != 0; i--) {
struct file *f = *old_fds++;
if (f)
get_file(f); <=============
*new_fds++ = f;
}
The offset of f_count in struct file is 0x14. The "test %eax,%eax" is
probably the "if (f)"
--
Dave Hansen
haveblue@us.ibm.com
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: problems with 2.5.61-mm1
2003-02-15 8:58 problems with 2.5.61-mm1 Dave Hansen
@ 2003-02-15 9:09 ` Andrew Morton
2003-02-15 18:20 ` Martin J. Bligh
1 sibling, 0 replies; 5+ messages in thread
From: Andrew Morton @ 2003-02-15 9:09 UTC (permalink / raw)
To: Dave Hansen; +Cc: linux-kernel, Martin.Bligh
Dave Hansen <haveblue@us.ibm.com> wrote:
>
> I've been beating on various versions of 2.5.59 all day long with no
> problems that I didn't cause. I started testing 2.5.61-mm1 and rand
> into a couple problems right away.
>
> The first I really doubt is -mm specific. I gets _loads_ of these, and
> the e1000 isn't working:
> NETDEV WATCHDOG: eth0: transmit timed out
> e1000: eth0 NIC Link is Up 1000 Mbps Full Duplex
>
> The e1000 driver hasn't been touched in weeks.
Don't know.
> Here's my /proc/interrupts:
> http://www.sr71.net/linux/interrupts
> I'm pretty sure we can see the problem here. Almost all interrupts are
> going to CPU0. Is this a summit thing?
No, that's the new irq balancing code. It only starts distributing
interrupts to other CPUs when the load gets higher. See how it
spread the ethernet interrupts. Apparently this is as-designed.
> The other looks a bit more insidious.
>
> Unable to handle kernel NULL pointer dereference at virtual address 0000003d
It might be best to test 2.5.61 first.
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: problems with 2.5.61-mm1
2003-02-15 8:58 problems with 2.5.61-mm1 Dave Hansen
2003-02-15 9:09 ` Andrew Morton
@ 2003-02-15 18:20 ` Martin J. Bligh
2003-02-15 18:47 ` Dave Hansen
1 sibling, 1 reply; 5+ messages in thread
From: Martin J. Bligh @ 2003-02-15 18:20 UTC (permalink / raw)
To: Dave Hansen, Andrew Morton; +Cc: Linux Kernel Mailing List
No, that's a kirq broke no_irq_balance thing (I presume this is NUMA-Q?).
There's a bootflag option to disable it as well, but that's broken too. I
can't fix do it right now, but someone needs to go through and fix all the
disable bits so they work.
--On Saturday, February 15, 2003 00:58:59 -0800 Dave Hansen
<haveblue@us.ibm.com> wrote:
> I've been beating on various versions of 2.5.59 all day long with no
> problems that I didn't cause. I started testing 2.5.61-mm1 and rand
> into a couple problems right away.
>
> The first I really doubt is -mm specific. I gets _loads_ of these, and
> the e1000 isn't working:
> NETDEV WATCHDOG: eth0: transmit timed out
> e1000: eth0 NIC Link is Up 1000 Mbps Full Duplex
>
> The e1000 driver hasn't been touched in weeks. Here's my
> /proc/interrupts: http://www.sr71.net/linux/interrupts
> I'm pretty sure we can see the problem here. Almost all interrupts are
> going to CPU0. Is this a summit thing?
>
> The other looks a bit more insidious.
>
> Unable to handle kernel NULL pointer dereference at virtual address
> 0000003d c011af77
> *pde = 1cf93001
> Oops: 0002
> CPU: 1
> EIP: 0060:[<c011af77>] Not tainted
> Using defaults from ksymoops -t elf32-i386 -a i386
> EFLAGS: 00010202
> eax: 00000029 ebx: de562870 ecx: dfa85074 edx: 000000e4
> esi: deefc140 edi: cf5227c0 ebp: cf522780 esp: dcad7f08
> ds: 007b es: 007b ss: 0068
> Stack: cf522780 000000ff df734c80 00000008 00000000 cc266680 00000100
> dfa85074
> deefc100 c6ac3900 c6ac39a4 00000011 00000000 c011b46c 00000011
> c6ac3900
> 00000004 00000286 00001000 cc266680 fffffff4 bffff7a0 00000011
> 00000000
> [<c011b46c>] copy_process+0x3a4/0x902
> [<c011ba1a>] do_fork+0x50/0x166
> [<c0126cca>] sys_rt_sigprocmask+0xdc/0x150
> [<c010792b>] sys_fork+0x37/0x4a
> [<c0109347>] syscall_call+0x7/0xb
> Code: f0 ff 40 14 89 03 83 c3 04 83 ea 01 75 e1 8b 44 24 20 f0 ff
>
>
>>> EIP; c011af77 <copy_files+18f/2c6> <=====
>
>>> ebx; de562870 <END_OF_CODE+1e0f3f2c/????>
>>> ecx; dfa85074 <END_OF_CODE+1f616730/????>
>>> esi; deefc140 <END_OF_CODE+1ea8d7fc/????>
>>> edi; cf5227c0 <END_OF_CODE+f0b3e7c/????>
>>> ebp; cf522780 <END_OF_CODE+f0b3e3c/????>
>>> esp; dcad7f08 <END_OF_CODE+1c6695c4/????>
>
> Code; c011af77 <copy_files+18f/2c6>
> 00000000 <_EIP>:
> Code; c011af77 <copy_files+18f/2c6> <=====
> 0: f0 ff 40 14 lock incl 0x14(%eax) <=====
> Code; c011af7b <copy_files+193/2c6>
> 4: 89 03 mov %eax,(%ebx)
> Code; c011af7d <copy_files+195/2c6>
> 6: 83 c3 04 add $0x4,%ebx
> Code; c011af80 <copy_files+198/2c6>
> 9: 83 ea 01 sub $0x1,%edx
> Code; c011af83 <copy_files+19b/2c6>
> c: 75 e1 jne ffffffef <_EIP+0xffffffef>
> Code; c011af85 <copy_files+19d/2c6>
> e: 8b 44 24 20 mov 0x20(%esp,1),%eax
> Code; c011af89 <copy_files+1a1/2c6>
> 12: f0 ff 00 lock incl (%eax)
>
> more disassembly
> c011af64: 74 1f je c011af85 <copy_files+0x19d>
> c011af66: 8b 4c 24 1c mov 0x1c(%esp,1),%ecx
> c011af6a: 8b 01 mov (%ecx),%eax
> c011af6c: 83 c1 04 add $0x4,%ecx
> c011af6f: 85 c0 test %eax,%eax
> c011af71: 89 4c 24 1c mov %ecx,0x1c(%esp,1)
> c011af75: 74 04 je c011af7b <copy_files+0x193>
> c011af77: f0 ff 40 14 lock incl 0x14(%eax) <========
> c011af7b: 89 03 mov %eax,(%ebx)
> c011af7d: 83 c3 04 add $0x4,%ebx
> c011af80: 83 ea 01 sub $0x1,%edx
> c011af83: 75 e1 jne c011af66 <copy_files+0x17e>
> c011af85: 8b 44 24 20 mov 0x20(%esp,1),%eax
> c011af89: f0 ff 40 04 lock incl 0x4(%eax)
> c011af8d: 8b 45 08 mov 0x8(%ebp),%eax
> c011af90: 89 df mov %ebx,%edi
> c011af92: 2b 44 24 18 sub 0x18(%esp,1),%eax
> c011af96: 8d 34 85 00 00 00 00 lea 0x0(,%eax,4),%esi
>
> I didn't compile with -g, but I have a hunch it is this:
> for (i = open_files; i != 0; i--) {
> struct file *f = *old_fds++;
> if (f)
> get_file(f); <=============
> *new_fds++ = f;
> }
>
> The offset of f_count in struct file is 0x14. The "test %eax,%eax" is
> probably the "if (f)"
> --
> Dave Hansen
> haveblue@us.ibm.com
>
>
>
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: problems with 2.5.61-mm1
2003-02-15 18:20 ` Martin J. Bligh
@ 2003-02-15 18:47 ` Dave Hansen
2003-02-15 21:11 ` Martin J. Bligh
0 siblings, 1 reply; 5+ messages in thread
From: Dave Hansen @ 2003-02-15 18:47 UTC (permalink / raw)
To: Martin J. Bligh; +Cc: Andrew Morton, Linux Kernel Mailing List
Martin J. Bligh wrote:
> No, that's a kirq broke no_irq_balance thing (I presume this is NUMA-Q?).
Nope, it's an 8-way Summit box.
I just booted 2.5.61, and the problem still happens there, so it not
surprisingly isn't just -mm.
> There's a bootflag option to disable it as well, but that's broken too. I
> can't fix do it right now, but someone needs to go through and fix all the
> disable bits so they work.
Disabling it is easy. Any idea what might be wrong.
--
Dave Hansen
haveblue@us.ibm.com
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: problems with 2.5.61-mm1
2003-02-15 18:47 ` Dave Hansen
@ 2003-02-15 21:11 ` Martin J. Bligh
0 siblings, 0 replies; 5+ messages in thread
From: Martin J. Bligh @ 2003-02-15 21:11 UTC (permalink / raw)
To: Dave Hansen; +Cc: Andrew Morton, Linux Kernel Mailing List
>> No, that's a kirq broke no_irq_balance thing (I presume this is NUMA-Q?).
>
> Nope, it's an 8-way Summit box.
>
> I just booted 2.5.61, and the problem still happens there, so it not
> surprisingly isn't just -mm.
Ah, OK. Sorry, "assumptions > coffee" error.
>> There's a bootflag option to disable it as well, but that's broken too. I
>> can't fix do it right now, but someone needs to go through and fix all
>> the disable bits so they work.
>
> Disabling it is easy. Any idea what might be wrong.
Yup, lots of the code assumes things are in flat logical mode, and/or
that you can target arbitrary bitmasks of CPUs ... see the fix I sent out
yesterday for smp_affinity, for instance.
M.
^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2003-02-15 21:01 UTC | newest]
Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2003-02-15 8:58 problems with 2.5.61-mm1 Dave Hansen
2003-02-15 9:09 ` Andrew Morton
2003-02-15 18:20 ` Martin J. Bligh
2003-02-15 18:47 ` Dave Hansen
2003-02-15 21:11 ` Martin J. Bligh
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.