odd memory corruption in 2.5.27?

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* odd memory corruption in 2.5.27?
@ 2002-07-23  6:23 Zwane Mwaikambo
  2002-07-23  6:24 ` Zwane Mwaikambo
  0 siblings, 1 reply; 29+ messages in thread
From: Zwane Mwaikambo @ 2002-07-23  6:23 UTC (permalink / raw)
  To: Alexander Viro; +Cc: Linux Kernel

Hi Al,

	kernel is 2.5.27, the box was doing a 'make -j2 bzImage' at the 
time on an NFS mounted filesystem, the server is 2.4.19-re5-ac3

(gdb) list *0xc014b09c
0xc014b09c is in filp_close (open.c:834).
829      */
830     int filp_close(struct file *filp, fl_owner_t id)
831     {
832             int retval;
833
834             if (!file_count(filp)) {
835                     printk(KERN_ERR "VFS: Close: file count is 0\n");
836                     return 0;
837             }
838             retval = 0;

Unable to handle kernel NULL pointer dereference at virtual address 0000008c
 printing eip:
c014b09c
*pde = 00000000
Oops: 0000
CPU:    0
EIP:    0010:[<c014b09c>]    Not tainted
EFLAGS: 00010206
eax: cea36a44   ebx: cea36920   ecx: cb235b30   edx: 00000078
esi: 00000078   edi: 00000001   ebp: cea36920   esp: ca9bfddc
ds: 0018   es: 0018   ss: 0018
Process fixdep (pid: 1015, threadinfo=ca9be000 task=c8245980)
Stack: cea36920 00000001 00000001 0000000a c01202db 00000078 cea36920 ca9be000
       c8245980 cea36920 00000000 c0120e40 00000000 c036bb00 ca9bff28 00000002
       c0314a6f 00000001 c0340018 c0310018 ffffff00 c0108622 00000010 00000202
Call Trace: [<c01202db>] [<c0120e40>] [<c0108622>] [<c0115261>] [<c014cd1d>] 
   [<c01616ac>] [<c0157cda>] [<c0159020>] [<c012f75c>] [<c0114e90>] [<c01080a0>]
   [<c0120018>] [<c014cd1d>] [<c0154970>] [<c0154ed1>] [<c014b066>] [<c01075db>]

Code: 8b 46 14 85 c0 75 12 68 c0 6e 31 c0 e8 13 1b fd ff 5a 31 c0 

>>EIP; c014b09c <filp_close+c/130>   <=====
Trace; c01202db <put_files_struct+4b/d0>
Trace; c0120e40 <do_exit+210/520>
Trace; c0108622 <die+f2/100>
Trace; c0115261 <do_page_fault+3d1/506>
Trace; c014cd1d <fget+4d/70>
Trace; c01616ac <dput+1c/240>
Trace; c0157cda <permission+1a/30>
Trace; c0159020 <link_path_walk+a30/a60>
Trace; c012f75c <zap_pmd_range+4c/60>
Trace; c0114e90 <do_page_fault+0/506>
Trace; c01080a0 <error_code+34/40>
Trace; c0120018 <will_become_orphaned_pgrp+48/d0>
Trace; c014cd1d <fget+4d/70>
Trace; c0154970 <vfs_fstat+10/40>
Trace; c0154ed1 <sys_fstat64+11/30>
Trace; c014b066 <sys_open+66/70>
Trace; c01075db <syscall_call+7/b>
Code;  c014b09c <filp_close+c/130>
00000000 <_EIP>:
Code;  c014b09c <filp_close+c/130>   <=====
   0:   8b 46 14                  mov    0x14(%esi),%eax   <=====
Code;  c014b09f <filp_close+f/130>
   3:   85 c0                     test   %eax,%eax
Code;  c014b0a1 <filp_close+11/130>
   5:   75 12                     jne    19 <_EIP+0x19> c014b0b5 <filp_close+25/130>
Code;  c014b0a3 <filp_close+13/130>
   7:   68 c0 6e 31 c0            push   $0xc0316ec0
Code;  c014b0a8 <filp_close+18/130>
   c:   e8 13 1b fd ff            call   fffd1b24 <_EIP+0xfffd1b24> c011cbc0 <printk+0/210>
Code;  c014b0ad <filp_close+1d/130>
  11:   5a                        pop    %edx
Code;  c014b0ae <filp_close+1e/130>
  12:   31 c0                     xor    %eax,%eax

-- 
function.linuxpower.ca


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: odd memory corruption in 2.5.27?
  2002-07-23  6:23 odd memory corruption in 2.5.27? Zwane Mwaikambo
@ 2002-07-23  6:24 ` Zwane Mwaikambo
  2002-07-23  6:26   ` Zwane Mwaikambo
  0 siblings, 1 reply; 29+ messages in thread
From: Zwane Mwaikambo @ 2002-07-23  6:24 UTC (permalink / raw)
  To: Alexander Viro; +Cc: Linux Kernel

Hi Al,

Hmm this is wild pointer #2, whats -really- freaky is that its the same 
address 0x14(00000078), this was also over NFS.

Unable to handle kernel NULL pointer dereference at virtual address 0000008c
c014cd1d
*pde = 00000000
Oops: 0002
CPU:    0
EIP:    0010:[<c014cd1d>]    Not tainted
Using defaults from ksymoops -t elf32-i386 -a i386
EFLAGS: 00010206
eax: ccefb4c0   ebx: 00000078   ecx: 00000003   edx: ccf30000
esi: ccefb39c   edi: 00000003   ebp: bfffe998   esp: ccf31f5c
ds: 0018   es: 0018   ss: 0018
Stack: ccf31f7c fffffff7 c0154970 ccf31f7c bfffe9e0 c0154ed1 00000003 ccf31f7c
       00000246 00000246 c4ca7000 00000001 00000000 00000000 00000400 00000078
       00000003 c4ca7000 bfffe9c8 c014b066 c16b6a30 c4ca7000 ccf30000 bfffe9e0
Call Trace: [<c0154970>] [<c0154ed1>] [<c014b066>] [<c01075db>]
Code: f0 ff 43 14 f0 ff 46 04 ff 4a 10 8b 42 08 83 e0 08 74 05 e8

>>EIP; c014cd1d <fget+4d/70>   <=====
Trace; c0154970 <vfs_fstat+10/40>
Trace; c0154ed1 <sys_fstat64+11/30>
Trace; c014b066 <sys_open+66/70>
Trace; c01075db <syscall_call+7/b>
Code;  c014cd1d <fget+4d/70>
00000000 <_EIP>:
Code;  c014cd1d <fget+4d/70>   <=====
   0:   f0 ff 43 14               lock incl 0x14(%ebx)   <=====
Code;  c014cd21 <fget+51/70>
   4:   f0 ff 46 04               lock incl 0x4(%esi)
Code;  c014cd25 <fget+55/70>
   8:   ff 4a 10                  decl   0x10(%edx)
Code;  c014cd28 <fget+58/70>
   b:   8b 42 08                  mov    0x8(%edx),%eax
Code;  c014cd2b <fget+5b/70>
   e:   83 e0 08                  and    $0x8,%eax
Code;  c014cd2e <fget+5e/70>
  11:   74 05                     je     18 <_EIP+0x18> c014cd35 <fget+65/70>
Code;  c014cd30 <fget+60/70>

-- 
function.linuxpower.ca


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: odd memory corruption in 2.5.27?
  2002-07-23  6:24 ` Zwane Mwaikambo
@ 2002-07-23  6:26   ` Zwane Mwaikambo
  2002-07-23  7:57     ` Trond Myklebust
  0 siblings, 1 reply; 29+ messages in thread
From: Zwane Mwaikambo @ 2002-07-23  6:26 UTC (permalink / raw)
  To: Trond Myklebust; +Cc: Linux Kernel, Alexander Viro

Hi Trond, Arnaldo, Al
	I tried reproducing using a local filesystem and couldn't 
(machine survived 3 make -j10 kernel compiles). Here is another oops for 
the collection. Al i'll remove you from further CCs now.

client: 2.5.27-serial, 3c905B
server: 2.4.19-pre5-ac3, 3c905B
connection: 100Mb/FD

I got this message before it oopsed;
RPC: garbage, exit EIO

Unable to handle kernel paging request at virtual address 5f707275
c013f822
*pde = 00000000
Oops: 0000
CPU:    0
EIP:    0010:[<c013f822>]    Not tainted
Using defaults from ksymoops -t elf32-i386 -a i386
EFLAGS: 00010202
eax: 5f707261   ebx: 00000001   ecx: 5f707261   edx: 00000000
esi: c826d2e4   edi: c826d2e4   ebp: c038c000   esp: c038deb0
ds: 0018   es: 0018   ss: 0018
Stack: c028d4fe c826d2e4 cea012cc c028d53c c826d2e4 cde2ca04 cea012cc c028d67d
       c826d2e4 c02a2b00 c02a28d3 c826d2e4 c8e0ae2c c02a2b00 c038c000 c02a2c21
       cea012cc c8e0ae2c c02a2c8b cea012cc 00000001 00000000 cc00fc0c 00000000
Call Trace: [<c028d4fe>] [<c028d53c>] [<c028d67d>] [<c02a2b00>] [<c02a28d3>]
   [<c02a2b00>] [<c02a2c21>] [<c02a2c8b>] [<c0126614>] [<c0122950>] [<c01227e0>]   [<c01224eb>] [<c0109d82>] [<c01053a0>] [<c0107f2a>] [<c01053a0>] [<c01053a0>]   [<c01053ca>] [<c0105472>] [<c0105000>]
Code: 8b 41 14 a9 00 08 00 00 75 14 f0 ff 49 10 0f 94 c0 84 c0 74

>>EIP; c013f822 <__free_pages+2/30>   <=====
Trace; c028d4fe <skb_release_data+1e/80>
Trace; c028d53c <skb_release_data+5c/80>
Trace; c028d67d <__kfree_skb+ad/e0>
Trace; c02a2b00 <ip_evictor+1d0/200>
Trace; c02a28d3 <ip_frag_destroy+43/a0>
Trace; c02a2b00 <ip_evictor+1d0/200>
Trace; c02a2c21 <ip_expire+f1/1a0>
Trace; c02a2c8b <ip_expire+15b/1a0>
Trace; c0126614 <timer_bh+324/400>
Trace; c0122950 <bh_action+70/140>
Trace; c01227e0 <tasklet_hi_action+70/c0>
Trace; c01224eb <do_softirq+7b/f0>
Trace; c0109d82 <do_IRQ+1d2/1e0>
Trace; c01053a0 <default_idle+0/40>
Trace; c0107f2a <common_interrupt+22/28>
Trace; c01053a0 <default_idle+0/40>
Trace; c01053a0 <default_idle+0/40>
Trace; c01053ca <default_idle+2a/40>
Trace; c0105472 <cpu_idle+52/70>
Trace; c0105000 <_stext+0/0>
Code;  c013f822 <__free_pages+2/30>
00000000 <_EIP>:
Code;  c013f822 <__free_pages+2/30>   <=====
   0:   8b 41 14                  mov    0x14(%ecx),%eax   <=====
Code;  c013f825 <__free_pages+5/30>
   3:   a9 00 08 00 00            test   $0x800,%eax
Code;  c013f82a <__free_pages+a/30>
   8:   75 14                     jne    1e <_EIP+0x1e> c013f840 <__free_pages+20/30>
Code;  c013f82c <__free_pages+c/30>
   a:   f0 ff 49 10               lock decl 0x10(%ecx)
Code;  c013f830 <__free_pages+10/30>
   e:   0f 94 c0                  sete   %al
Code;  c013f833 <__free_pages+13/30>
  11:   84 c0                     test   %al,%al
Code;  c013f835 <__free_pages+15/30>
  13:   74 00                     je     15 <_EIP+0x15> c013f837 <__free_pages+17/30>

 <0>Kernel panic: Aiee, killing interrupt handler!


-- 
function.linuxpower.ca


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: odd memory corruption in 2.5.27?
  2002-07-23  6:26   ` Zwane Mwaikambo
@ 2002-07-23  7:57     ` Trond Myklebust
  2002-07-23  8:54       ` Zwane Mwaikambo
  0 siblings, 1 reply; 29+ messages in thread
From: Trond Myklebust @ 2002-07-23  7:57 UTC (permalink / raw)
  To: Zwane Mwaikambo; +Cc: Linux Kernel, Alexander Viro

On Tuesday 23 July 2002 08:26, Zwane Mwaikambo wrote:
> Hi Trond, Arnaldo, Al
> 	I tried reproducing using a local filesystem and couldn't
> (machine survived 3 make -j10 kernel compiles). Here is another oops for
> the collection. Al i'll remove you from further CCs now.
>
> client: 2.5.27-serial, 3c905B
> server: 2.4.19-pre5-ac3, 3c905B
> connection: 100Mb/FD
>
> I got this message before it oopsed;
> RPC: garbage, exit EIO

Just means that some RPC message reply from the server was crap. We should 
deal fine with that sort of thing...

AFAICS The Oops itself happened deep down in the socket layer in the part 
which has to do with reassembling fragments into packets. The garbage 
collector tried to release a fragment that had timed out and Oopsed.

Suggests either memory corruption or else that the networking driver is doing 
something odd ('cos at that point in the socket layer *only* the driver + the 
fragment handler should have touched the skb).

Cheers,
  Trond

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: odd memory corruption in 2.5.27?
  2002-07-23  7:57     ` Trond Myklebust
@ 2002-07-23  8:54       ` Zwane Mwaikambo
  2002-07-23 20:32         ` george anzinger
  0 siblings, 1 reply; 29+ messages in thread
From: Zwane Mwaikambo @ 2002-07-23  8:54 UTC (permalink / raw)
  To: Trond Myklebust; +Cc: Linux Kernel

On Tue, 23 Jul 2002, Trond Myklebust wrote:

> Just means that some RPC message reply from the server was crap. We should 
> deal fine with that sort of thing...
> 
> AFAICS The Oops itself happened deep down in the socket layer in the part 
> which has to do with reassembling fragments into packets. The garbage 
> collector tried to release a fragment that had timed out and Oopsed.
> 
> Suggests either memory corruption or else that the networking driver is doing 
> something odd ('cos at that point in the socket layer *only* the driver + the 
> fragment handler should have touched the skb).

Thanks, that helps quite a bit, i'll see if i can pinpoint it and send it 
to the relevant people.

Thanks,
	Zwane

-- 
function.linuxpower.ca


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: odd memory corruption in 2.5.27?
  2002-07-23  8:54       ` Zwane Mwaikambo
@ 2002-07-23 20:32         ` george anzinger
  2002-07-23 20:47           ` William Lee Irwin III
                             ` (2 more replies)
  0 siblings, 3 replies; 29+ messages in thread
From: george anzinger @ 2002-07-23 20:32 UTC (permalink / raw)
  To: Zwane Mwaikambo; +Cc: Trond Myklebust, Linux Kernel

Zwane Mwaikambo wrote:
> 
> On Tue, 23 Jul 2002, Trond Myklebust wrote:
> 
> > Just means that some RPC message reply from the server was crap. We should
> > deal fine with that sort of thing...
> >
> > AFAICS The Oops itself happened deep down in the socket layer in the part
> > which has to do with reassembling fragments into packets. The garbage
> > collector tried to release a fragment that had timed out and Oopsed.
> >
> > Suggests either memory corruption or else that the networking driver is doing
> > something odd ('cos at that point in the socket layer *only* the driver + the
> > fragment handler should have touched the skb).
> 
> Thanks, that helps quite a bit, i'll see if i can pinpoint it and send it
> to the relevant people.
> 
I just spent a month tracking down this issue.  It comes
down to the slab allocater using per cpu data structures and
protecting them with a combination of interrupt disables and
spin_locks.  Preemption is allowed (incorrectly) if
interrupts are off and preempt_count goes to zero on the
spin_unlock.  I will wager that this is an SMP machine. 
After the preemption interrupts will be on (schedule() does
that) AND you could be on a different cpu.  Either of these
is a BAD thing.

The proposed fix is to catch the attempted preemption in
preempt_schedule() and just return if the interrupt system
is off.  (Of course there is more that this to it, but I do
believe that the problem is known.  You could blow this
assertion out of the water by asserting that the machine is
NOT smp.)
-- 
George Anzinger   george@mvista.com
High-res-timers: 
http://sourceforge.net/projects/high-res-timers/
Real time sched:  http://sourceforge.net/projects/rtsched/
Preemption patch:
http://www.kernel.org/pub/linux/kernel/people/rml

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: odd memory corruption in 2.5.27?
  2002-07-23 20:32         ` george anzinger
@ 2002-07-23 20:47           ` William Lee Irwin III
  2002-07-23 23:28           ` [patch] irqlock patch -G3. [was Re: odd memory corruption in 2.5.27?] Ingo Molnar
  2002-07-24  6:52           ` odd memory corruption in 2.5.27? Zwane Mwaikambo
  2 siblings, 0 replies; 29+ messages in thread
From: William Lee Irwin III @ 2002-07-23 20:47 UTC (permalink / raw)
  To: george anzinger; +Cc: Zwane Mwaikambo, Trond Myklebust, Linux Kernel

On Tue, Jul 23, 2002 at 01:32:11PM -0700, george anzinger wrote:
> I just spent a month tracking down this issue.  It comes
> down to the slab allocater using per cpu data structures and
> protecting them with a combination of interrupt disables and
> spin_locks.  Preemption is allowed (incorrectly) if
> interrupts are off and preempt_count goes to zero on the
> spin_unlock.  I will wager that this is an SMP machine. 
> After the preemption interrupts will be on (schedule() does
> that) AND you could be on a different cpu.  Either of these
> is a BAD thing.
> The proposed fix is to catch the attempted preemption in
> preempt_schedule() and just return if the interrupt system
> is off.  (Of course there is more that this to it, but I do
> believe that the problem is known.  You could blow this
> assertion out of the water by asserting that the machine is
> NOT smp.)

I've been seeing the slab allocator race like mad already (i.e.
BUG at slab.c:1947 and slab.c:1961. I'll test this fix.


Thanks,
Bill

^ permalink raw reply	[flat|nested] 29+ messages in thread

* [patch] irqlock patch -G3. [was Re: odd memory corruption in 2.5.27?]
  2002-07-23 20:32         ` george anzinger
  2002-07-23 20:47           ` William Lee Irwin III
@ 2002-07-23 23:28           ` Ingo Molnar
  2002-07-23 23:53             ` george anzinger
                               ` (2 more replies)
  2002-07-24  6:52           ` odd memory corruption in 2.5.27? Zwane Mwaikambo
  2 siblings, 3 replies; 29+ messages in thread
From: Ingo Molnar @ 2002-07-23 23:28 UTC (permalink / raw)
  To: george anzinger
  Cc: Zwane Mwaikambo, Trond Myklebust, Linus Torvalds, Linux Kernel


On Tue, 23 Jul 2002, george anzinger wrote:

> I just spent a month tracking down this issue.  It comes
> down to the slab allocater using per cpu data structures and
> protecting them with a combination of interrupt disables and
> spin_locks.  Preemption is allowed (incorrectly) if
> interrupts are off and preempt_count goes to zero on the
> spin_unlock. [...]

> The proposed fix is to catch the attempted preemption in
> preempt_schedule() and just return if the interrupt system
> is off. [...]

this is most definitely not the correct fix ...

i'm quite convinced that the fix is to avoid illegal preemption, not to
work it around.

i've written debugging code that caught and reported this slab.c bug
within minutes. The code detects the irqs-off condition in schedule(). You
can find it my latest irqlock patchset, at:

   http://redhat.com/~mingo/remove-irqlock-patches/remove-irqlock-2.5.27-G3

it fixes this and other related bugs as well.

Changes in -G3:

 - slab.c needs to spin_unlock_no_resched(), instead of spin_unlock(). (It
   also has to check for preemption in the right spot.) This should fix
   the memory corruption.

 - irq_exit() needs to run softirqs if interrupts not active - in the 
   previous patch it ran them when preempt_count() was 0, which is
   incorrect.

 - spinlock macros are updated to enable preemption after enabling 
   interrupts. Besides avoiding false positive warnings, this also 

 - fork.c has to call scheduler_tick() with preemption disabled - 
   otherwise scheduler_tick()'s spin_unlock can preempt!

 - irqs_disabled() macro introduced.

 - [ all other local_irq_enable() or sti instances conditional on
     CONFIG_DEBUG_IRQ_SCHEDULE are to fix false positive warnings. ]

Changes in -G0:

 - fix buggy in_softirq(). Fortunately the bug made the test broader,
   which didnt result in algorithmical breakage, just suboptimal
   performance.

 - move do_softirq() processing into irq_exit() => this also fixes the
   softirq processing bugs present in apic.c IRQ handlers that did not
   test for softirqs after irq_exit().

 - simplify local_bh_enable().

Changes in -F9:

 - replace all instances of:

	local_save_flags(flags);
	local_irq_disable();

   with the shorter form of:

	local_irq_save(flags);

  about 30 files are affected by this change.

Changes in -F8:

 - preempt/hardirq/softirq count separation, cleanups.

 - skbuff.c fix.

 - use irq_count() in scheduler_tick()

Changes in -F3:

 - the entry.S cleanups/speedups by Oleg Nesterov.

 - a rather critical synchronize_irq() bugfix: if a driver frees an 
   interrupt that is still being probed then synchronize_irq() locks up.
   This bug has caused a spurious boot-lockup on one of my testsystems,
   ifconfig would lock up trying to close eth0.

 - remove duplicate definitions from asm-i386/system.h, this fixes 
   compiler warnings.

	Ingo


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [patch] irqlock patch -G3. [was Re: odd memory corruption in  2.5.27?]
  2002-07-23 23:28           ` [patch] irqlock patch -G3. [was Re: odd memory corruption in 2.5.27?] Ingo Molnar
@ 2002-07-23 23:53             ` george anzinger
  2002-07-23 23:56             ` Linus Torvalds
  2002-07-24  1:08             ` [patch] irqlock patch -G3. [was Re: odd memory corruption in 2.5.27?] Robert Love
  2 siblings, 0 replies; 29+ messages in thread
From: george anzinger @ 2002-07-23 23:53 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Zwane Mwaikambo, Trond Myklebust, Linus Torvalds, Linux Kernel

Ingo Molnar wrote:
> 
> On Tue, 23 Jul 2002, george anzinger wrote:
> 
> > I just spent a month tracking down this issue.  It comes
> > down to the slab allocater using per cpu data structures and
> > protecting them with a combination of interrupt disables and
> > spin_locks.  Preemption is allowed (incorrectly) if
> > interrupts are off and preempt_count goes to zero on the
> > spin_unlock. [...]
> 
> > The proposed fix is to catch the attempted preemption in
> > preempt_schedule() and just return if the interrupt system
> > is off. [...]
> 
> this is most definitely not the correct fix ...
> 
> i'm quite convinced that the fix is to avoid illegal preemption, not to
> work it around.

I like this.  The only change I would make is to enable
interrupts in exit.c and entry.S where they are only enabled
under the debug condition now, (mostly I try to avoid
Heisenberg).  I really do like the ability to track down
this problem by just turing on the debug option.
> 
> i've written debugging code that caught and reported this slab.c bug
> within minutes. 

Yeah, its easy when you know what to look for :)

> The code detects the irqs-off condition in schedule(). You
> can find it my latest irqlock patchset, at:
> 
>    http://redhat.com/~mingo/remove-irqlock-patches/remove-irqlock-2.5.27-G3
> 
> it fixes this and other related bugs as well.
> 
> Changes in -G3:
> 
>  - slab.c needs to spin_unlock_no_resched(), instead of spin_unlock(). (It
>    also has to check for preemption in the right spot.) This should fix
>    the memory corruption.
> 
>  - irq_exit() needs to run softirqs if interrupts not active - in the
>    previous patch it ran them when preempt_count() was 0, which is
>    incorrect.
> 
>  - spinlock macros are updated to enable preemption after enabling
>    interrupts. Besides avoiding false positive warnings, this also
> 
>  - fork.c has to call scheduler_tick() with preemption disabled -
>    otherwise scheduler_tick()'s spin_unlock can preempt!
> 
>  - irqs_disabled() macro introduced.
> 
>  - [ all other local_irq_enable() or sti instances conditional on
>      CONFIG_DEBUG_IRQ_SCHEDULE are to fix false positive warnings. ]
> 
> Changes in -G0:
> 
>  - fix buggy in_softirq(). Fortunately the bug made the test broader,
>    which didnt result in algorithmical breakage, just suboptimal
>    performance.
> 
>  - move do_softirq() processing into irq_exit() => this also fixes the
>    softirq processing bugs present in apic.c IRQ handlers that did not
>    test for softirqs after irq_exit().
> 
>  - simplify local_bh_enable().
> 
> Changes in -F9:
> 
>  - replace all instances of:
> 
>         local_save_flags(flags);
>         local_irq_disable();
> 
>    with the shorter form of:
> 
>         local_irq_save(flags);
> 
>   about 30 files are affected by this change.
> 
> Changes in -F8:
> 
>  - preempt/hardirq/softirq count separation, cleanups.
> 
>  - skbuff.c fix.
> 
>  - use irq_count() in scheduler_tick()
> 
> Changes in -F3:
> 
>  - the entry.S cleanups/speedups by Oleg Nesterov.
> 
>  - a rather critical synchronize_irq() bugfix: if a driver frees an
>    interrupt that is still being probed then synchronize_irq() locks up.
>    This bug has caused a spurious boot-lockup on one of my testsystems,
>    ifconfig would lock up trying to close eth0.
> 
>  - remove duplicate definitions from asm-i386/system.h, this fixes
>    compiler warnings.
> 
>         Ingo

-- 
George Anzinger   george@mvista.com
High-res-timers: 
http://sourceforge.net/projects/high-res-timers/
Real time sched:  http://sourceforge.net/projects/rtsched/
Preemption patch:
http://www.kernel.org/pub/linux/kernel/people/rml

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [patch] irqlock patch -G3. [was Re: odd memory corruption in 2.5.27?]
  2002-07-23 23:28           ` [patch] irqlock patch -G3. [was Re: odd memory corruption in 2.5.27?] Ingo Molnar
  2002-07-23 23:53             ` george anzinger
@ 2002-07-23 23:56             ` Linus Torvalds
  2002-07-24  0:07               ` Ingo Molnar
  2002-07-24  1:08             ` [patch] irqlock patch -G3. [was Re: odd memory corruption in 2.5.27?] Robert Love
  2 siblings, 1 reply; 29+ messages in thread
From: Linus Torvalds @ 2002-07-23 23:56 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: george anzinger, Zwane Mwaikambo, Trond Myklebust, Linux Kernel

On Wed, 24 Jul 2002, Ingo Molnar wrote:
>
>  - slab.c needs to spin_unlock_no_resched(), instead of spin_unlock(). (It
>    also has to check for preemption in the right spot.) This should fix
>    the memory corruption.

That cannot be right.

If we want to drop a spinlock but remain non-preemptible, we should
comment that a _lot_, and not just say "xxxx_no_resched()".

In fact, I personally think that every "spin_unlock_no_resched()" is an
outright BUG. Either the spin_unlock() makes us preemptible (in which case
it doesn't matter from a correctness point whether we schedule
immediately, or whether something else like a vmalloc fault might force us
to schedule soon afterwards), or the spin_unlock is doing something
magical, and we depend on the preemptability to not change.

In the latter case (which should be very very rare indeed), we should just
use

	/* BIG comment about what we're doing. */
	/* We're dropping the spinlock, but we remain non-preemptable */
	__raw_spin_unlock(..);

and then later on, when preemptability is over, we do

	local_irq_enable();
	preempt_enable();

so that we _clearly_ mark out the region where we must not re-schedule.

It is simply not acceptable to just play games with disabling interrupts,
and magically "knowing" that we're not preemptable without making that
clear some way.

Please get rid of spin_unlock_no_schedule() and friends, ok?

			Linus

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [patch] irqlock patch -G3. [was Re: odd memory corruption in 2.5.27?]
  2002-07-23 23:56             ` Linus Torvalds
@ 2002-07-24  0:07               ` Ingo Molnar
  2002-07-24  2:15                 ` Linus Torvalds
  0 siblings, 1 reply; 29+ messages in thread
From: Ingo Molnar @ 2002-07-24  0:07 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: george anzinger, Zwane Mwaikambo, Trond Myklebust, Linux Kernel


On Tue, 23 Jul 2002, Linus Torvalds wrote:

> 	/* BIG comment about what we're doing. */
> 	/* We're dropping the spinlock, but we remain non-preemptable */
> 	__raw_spin_unlock(..);
> 
> and then later on, when preemptability is over, we do
> 
> 	local_irq_enable();
> 	preempt_enable();
> 
> so that we _clearly_ mark out the region where we must not re-schedule.

agreed, this is much nicer.

i removed the spin_unlock_no_resched() define, and modified the slab.c fix
(and the other spin_unlock_no_resched() places) to be more verbose about
what they do.

we still have preempt_enable_no_resched() - that is legitimately needed in
a number of cases - i've added comments to make its usage clear.

these cleanups are in the -G5 patch:

  http://redhat.com/~mingo/remove-irqlock-patches/remove-irqlock-2.5.27-G5

i've test-booted BK-curr + G5, CONFIG_SMP, CONFIG_PREEMPT and
CONFIG_DEBUG_IRQ_SCHEDULE kernel on an SMP box, it works fine.

	Ingo


Changes in -G3:

 - slab.c needs to spin_unlock_no_resched(), instead of spin_unlock(). (It
   also has to check for preemption in the right spot.) This should fix
   the memory corruption.

 - irq_exit() needs to run softirqs if interrupts not active - in the 
   previous patch it ran them when preempt_count() was 0, which is
   incorrect.

 - spinlock macros are updated to enable preemption after enabling 
   interrupts. Besides avoiding false positive warnings, this also 

 - fork.c has to call scheduler_tick() with preemption disabled - 
   otherwise scheduler_tick()'s spin_unlock can preempt!

 - irqs_disabled() macro introduced.

 - [ all other local_irq_enable() or sti instances conditional on
     CONFIG_DEBUG_IRQ_SCHEDULE are to fix false positive warnings. ]

Changes in -G0:

 - fix buggy in_softirq(). Fortunately the bug made the test broader,
   which didnt result in algorithmical breakage, just suboptimal
   performance.

 - move do_softirq() processing into irq_exit() => this also fixes the
   softirq processing bugs present in apic.c IRQ handlers that did not
   test for softirqs after irq_exit().

 - simplify local_bh_enable().

Changes in -F9:

 - replace all instances of:

	local_save_flags(flags);
	local_irq_disable();

   with the shorter form of:

	local_irq_save(flags);

  about 30 files are affected by this change.

Changes in -F8:

 - preempt/hardirq/softirq count separation, cleanups.

 - skbuff.c fix.

 - use irq_count() in scheduler_tick()

Changes in -F3:

 - the entry.S cleanups/speedups by Oleg Nesterov.

 - a rather critical synchronize_irq() bugfix: if a driver frees an 
   interrupt that is still being probed then synchronize_irq() locks up.
   This bug has caused a spurious boot-lockup on one of my testsystems,
   ifconfig would lock up trying to close eth0.

 - remove duplicate definitions from asm-i386/system.h, this fixes 
   compiler warnings.


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [patch] irqlock patch -G3. [was Re: odd memory corruption in 2.5.27?]
  2002-07-24  0:07               ` Ingo Molnar
@ 2002-07-24  2:15                 ` Linus Torvalds
  2002-07-24  8:59                   ` [patch] irqlock patch 2.5.27-H3 Ingo Molnar
  0 siblings, 1 reply; 29+ messages in thread
From: Linus Torvalds @ 2002-07-24  2:15 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: george anzinger, Zwane Mwaikambo, Trond Myklebust, Linux Kernel

Ingo,
 you really shouldn't use a generic #define like IRQ_MASK for something as
obscure as the mask for preemption bits (that apparently gets used
exactly _once_ in the same header file that defines it).

That #define is already used in the kernel inside various files, and from
the things I looked at, other users had more reason to call their stuff
IRQ_MASK than the new code has.

Also, please don't do things like

	#define NR_PREEMPT 256

when what you really are doing is doling out bits rather than "numbers".

So I think you'd be better off doing

	#define PREEMPT_BITS	8
	#define HARDIRQ_BITS	8
	#define SOFTIRQ_BITS	8

	#define PREEMPT_SHIFT	0
	#define HARDIRQ_SHIFT	(PREEMPT_SHIFT + PREEMPT_BITS)
	#define SOFTIRQ_SHIFT	(HARDIRQ_SHIFT + PREEMPT_BITS)

	#define __MASK(x)	((1UL << (x))-1)

	#define PREEMPT_MASK	(__MASK(PREEMPT_BITS) << PREEMPT_SHIFT)
	#define HARDIRQ_MASK	(__MASK(HARDIRQ_BITS) << HARDIRQ_SHIFT)
	#define SOFTIRQ_MASK	(__MASK(SOFTIRQ_BITS) << SOFTIRQ_SHIFT)

	#define hardirq_count()	(preempt_count() & HARDIRQ_MASK)
	#define softirq_count()	(preempt_count() & SOFTIRQ_MASK)
	#define irq_count()	(preempt_count() & (HARDIRQ_MASK | SOFTIRQ_MASK))

which creates bitmasks from bit-operations rather than doing non-bitwise
arithmetic to get them from magic values that have to be powers-of-2.

And avoids using too generic a name (ie IRQ_MASK is gone).

Ehh?

		Linus

^ permalink raw reply	[flat|nested] 29+ messages in thread

* [patch] irqlock patch 2.5.27-H3.
  2002-07-24  2:15                 ` Linus Torvalds
@ 2002-07-24  8:59                   ` Ingo Molnar
  0 siblings, 0 replies; 29+ messages in thread
From: Ingo Molnar @ 2002-07-24  8:59 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: george anzinger, Zwane Mwaikambo, Trond Myklebust, Linux Kernel


On Tue, 23 Jul 2002, Linus Torvalds wrote:

>  you really shouldn't use a generic #define like IRQ_MASK for something as
> obscure as the mask for preemption bits (that apparently gets used
> exactly _once_ in the same header file that defines it).
> 
> That #define is already used in the kernel inside various files, and from
> the things I looked at, other users had more reason to call their stuff
> IRQ_MASK than the new code has.

(doh - and i've specifically checked the new names for namespace collision
because they looked too generic - apparently not well enough.)

> So I think you'd be better off doing
> 
> 	#define PREEMPT_BITS	8
> 	#define HARDIRQ_BITS	8
> 	#define SOFTIRQ_BITS	8
> 
> 	#define PREEMPT_SHIFT	0
> 	#define HARDIRQ_SHIFT	(PREEMPT_SHIFT + PREEMPT_BITS)
> 	#define SOFTIRQ_SHIFT	(HARDIRQ_SHIFT + PREEMPT_BITS)
> 
> 	#define __MASK(x)	((1UL << (x))-1)
> 
> 	#define PREEMPT_MASK	(__MASK(PREEMPT_BITS) << PREEMPT_SHIFT)
> 	#define HARDIRQ_MASK	(__MASK(HARDIRQ_BITS) << HARDIRQ_SHIFT)
> 	#define SOFTIRQ_MASK	(__MASK(SOFTIRQ_BITS) << SOFTIRQ_SHIFT)
> 
> 	#define hardirq_count()	(preempt_count() & HARDIRQ_MASK)
> 	#define softirq_count()	(preempt_count() & SOFTIRQ_MASK)
> 	#define irq_count()	(preempt_count() & (HARDIRQ_MASK | SOFTIRQ_MASK))
> 
> which creates bitmasks from bit-operations rather than doing non-bitwise
> arithmetic to get them from magic values that have to be powers-of-2.

yeah, agreed.

i did one more modification in this area, just for educational purposes
it's a better ordering to have:

	- preempt count
	- softirq count
	- hardirq count

the higher positioned the bitmask, the 'stronger' and more atomic the
execution concept is. Latest patch (against BK-curr) is at:

   http://redhat.com/~mingo/remove-irqlock-patches/remove-irqlock-2.5.27-H3

Changes in -H3:

 - init thread needs to have preempt_count of 1 until sched_init(). 
   (William Lee Irwin III)

 - clean up the irq-mask macros. (Linus)

 - add barrier() to irq_enter() and irq_exit(). (based on Oleg Nesterov's
   comment.)

 - move the irqs-off check into preempt_schedule() and remove
   CONFIG_DEBUG_IRQ_SCHEDULE.

Changes in -G5:

 - remove spin_unlock_no_resched() and comment the affected places more
   agressively.

Changes in -G3:

 - slab.c needs to spin_unlock_no_resched(), instead of spin_unlock(). (It
   also has to check for preemption in the right spot.) This should fix
   the memory corruption.

 - irq_exit() needs to run softirqs if interrupts not active - in the 
   previous patch it ran them when preempt_count() was 0, which is
   incorrect.

 - spinlock macros are updated to enable preemption after enabling 
   interrupts. Besides avoiding false positive warnings, this also 

 - fork.c has to call scheduler_tick() with preemption disabled - 
   otherwise scheduler_tick()'s spin_unlock can preempt!

 - irqs_disabled() macro introduced.

 - [ all other local_irq_enable() or sti instances conditional on
     CONFIG_DEBUG_IRQ_SCHEDULE are to fix false positive warnings. ]

Changes in -G0:

 - fix buggy in_softirq(). Fortunately the bug made the test broader,
   which didnt result in algorithmical breakage, just suboptimal
   performance.

 - move do_softirq() processing into irq_exit() => this also fixes the
   softirq processing bugs present in apic.c IRQ handlers that did not
   test for softirqs after irq_exit().

 - simplify local_bh_enable().

Changes in -F9:

 - replace all instances of:

	local_save_flags(flags);
	local_irq_disable();

   with the shorter form of:

	local_irq_save(flags);

  about 30 files are affected by this change.

Changes in -F8:

 - preempt/hardirq/softirq count separation, cleanups.

 - skbuff.c fix.

 - use irq_count() in scheduler_tick()

Changes in -F3:

 - the entry.S cleanups/speedups by Oleg Nesterov.

 - a rather critical synchronize_irq() bugfix: if a driver frees an 
   interrupt that is still being probed then synchronize_irq() locks up.
   This bug has caused a spurious boot-lockup on one of my testsystems,
   ifconfig would lock up trying to close eth0.

 - remove duplicate definitions from asm-i386/system.h, this fixes 
   compiler warnings.

	Ingo



^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [patch] irqlock patch -G3. [was Re: odd memory corruption in 2.5.27?]
  2002-07-23 23:28           ` [patch] irqlock patch -G3. [was Re: odd memory corruption in 2.5.27?] Ingo Molnar
  2002-07-23 23:53             ` george anzinger
  2002-07-23 23:56             ` Linus Torvalds
@ 2002-07-24  1:08             ` Robert Love
  2002-07-24  3:13               ` [patch] irqlock patch -G3. [was Re: odd memory corruption in2.5.27?] Andrew Morton
  2 siblings, 1 reply; 29+ messages in thread
From: Robert Love @ 2002-07-24  1:08 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: george anzinger, Zwane Mwaikambo, Trond Myklebust, Linus Torvalds,
	Linux Kernel

On Tue, 2002-07-23 at 16:28, Ingo Molnar wrote:

> this is most definitely not the correct fix ...
> 
> i'm quite convinced that the fix is to avoid illegal preemption, not to
> work it around.

I am not sure I am fully convinced one way or the other, but treating
every bit of code as we find it scares me.  The fact is, if a
spin_unlock() can magically reenable interrupts that is a bug.

I don't like relying on chance and the possibility your debug tool found
the problem... but at the same time, Ingo's solution is a lot cleaner.

Linus, Ingo, comments?

Attached is the patch George mentioned, against 2.5.27.

	Robert Love

diff -urN linux-2.5.27/include/asm-i386/system.h linux/include/asm-i386/system.h
--- linux-2.5.27/include/asm-i386/system.h	Sat Jul 20 12:11:05 2002
+++ linux/include/asm-i386/system.h	Tue Jul 23 18:03:47 2002
@@ -270,6 +270,13 @@
 /* Compiling for a 386 proper.	Is it worth implementing via cli/sti?  */
 #endif
 
+#define MASK_IF			0x200
+#define interrupts_enabled()	({ \
+		int flg; \
+		__save_flags(flg); \
+		flg & MASK_IF; \
+})
+
 /*
  * Force strict CPU ordering.
  * And yes, this is required on UP too when we're talking
diff -urN linux-2.5.27/kernel/sched.c linux/kernel/sched.c
--- linux-2.5.27/kernel/sched.c	Sat Jul 20 12:11:11 2002
+++ linux/kernel/sched.c	Tue Jul 23 18:02:13 2002
@@ -899,7 +899,7 @@
 {
 	struct thread_info *ti = current_thread_info();
 
-	if (unlikely(ti->preempt_count))
+	if (unlikely(ti->preempt_count || !interrupts_enabled()))
 		return;
 
 need_resched:


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [patch] irqlock patch -G3. [was Re: odd memory corruption in2.5.27?]
  2002-07-24  1:08             ` [patch] irqlock patch -G3. [was Re: odd memory corruption in 2.5.27?] Robert Love
@ 2002-07-24  3:13               ` Andrew Morton
  2002-07-24  3:18                 ` Linus Torvalds
                                   ` (2 more replies)
  0 siblings, 3 replies; 29+ messages in thread
From: Andrew Morton @ 2002-07-24  3:13 UTC (permalink / raw)
  To: Robert Love
  Cc: Ingo Molnar, george anzinger, Zwane Mwaikambo, Trond Myklebust,
	Linus Torvalds, Linux Kernel

Robert Love wrote:
> 
> ...
> Attached is the patch George mentioned, against 2.5.27.
> 

I'd agree with Robert on that.  The idea should be that,
as much as possible, preempt "just works" as long as the
programmer sticks with standard SMP programming practices.

And yet here we have a case where a spin_unlock() will
go and turn on local interrupts.  Only with CONFIG_PREEMPT,
and even then, extremely rarely.

Plus it would be nice to have an arch-independent way of
querying whether the current CPU has interrupts enabled.
I've needed that several times within debug/devel code.

Robert and George's patch doesn't seem to be optimal though - if
we're not going to preempt at spin_unlock() time, we need to
preempt at local_irq_restore() time.  It'll be untrivial to fix
all this, but this very subtle change to the locking semantics
with CONFIG_PREEMPT is quite nasty.

-

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [patch] irqlock patch -G3. [was Re: odd memory corruption in2.5.27?]
  2002-07-24  3:13               ` [patch] irqlock patch -G3. [was Re: odd memory corruption in2.5.27?] Andrew Morton
@ 2002-07-24  3:18                 ` Linus Torvalds
  2002-07-24  7:13                   ` Ingo Molnar
  2002-07-24  7:34                 ` Ingo Molnar
  2002-07-24  7:37                 ` [patch] irqlock patch -G3. [was Re: odd memory corruption in2.5.27?] Ingo Molnar
  2 siblings, 1 reply; 29+ messages in thread
From: Linus Torvalds @ 2002-07-24  3:18 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Robert Love, Ingo Molnar, george anzinger, Zwane Mwaikambo,
	Trond Myklebust, Linux Kernel

On Tue, 23 Jul 2002, Andrew Morton wrote:
>
> And yet here we have a case where a spin_unlock() will
> go and turn on local interrupts.  Only with CONFIG_PREEMPT,
> and even then, extremely rarely.

I think that's just a bug, the same way it was a bug that preemtion would
sometimes set tsk->state to TASK_RUNNING.

I think Robert already sent a fix: make "preempt_schedule()" refuse to
schedule if local interrupts are disabled.

That, together with making it a warning (so that we can _fix_ places that
have unbalanced irq/spinlock behaviour) shoul dbe fine. Eventually, if we
think all places are fixed, we can remove the test from
preempt_schedule().

		Linus

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [patch] irqlock patch -G3. [was Re: odd memory corruption in2.5.27?]
  2002-07-24  3:18                 ` Linus Torvalds
@ 2002-07-24  7:13                   ` Ingo Molnar
  0 siblings, 0 replies; 29+ messages in thread
From: Ingo Molnar @ 2002-07-24  7:13 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andrew Morton, Robert Love, george anzinger, Zwane Mwaikambo,
	Trond Myklebust, Linux Kernel

On Tue, 23 Jul 2002, Linus Torvalds wrote:

> > And yet here we have a case where a spin_unlock() will
> > go and turn on local interrupts.  Only with CONFIG_PREEMPT,
> > and even then, extremely rarely.
> 
> I think that's just a bug, the same way it was a bug that preemtion would
> sometimes set tsk->state to TASK_RUNNING.
> 
> I think Robert already sent a fix: make "preempt_schedule()" refuse to
> schedule if local interrupts are disabled.

my problem with Robert's patch is that the intention is not debugging, the
intention of the change was make it the standard thing. This just hides
serious bugs like the one in slab.c. I'd suggest to rather fix these bugs
and be aware of them via a debugging mechanism, instead of putting one
more (not quite cheap) check into one of our hotpaths.

> That, together with making it a warning (so that we can _fix_ places
> that have unbalanced irq/spinlock behaviour) shoul dbe fine. [...]

yep - i've moved the check from schedule() to preempt_schedule(), which
clearly is the most serious offender. This enabled the removal of the
CONFIG_DEBUG_IRQ_SCHEDULE define.

	Ingo

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [patch] irqlock patch -G3. [was Re: odd memory corruption in2.5.27?]
  2002-07-24  3:13               ` [patch] irqlock patch -G3. [was Re: odd memory corruption in2.5.27?] Andrew Morton
  2002-07-24  3:18                 ` Linus Torvalds
@ 2002-07-24  7:34                 ` Ingo Molnar
  2002-07-24  8:00                   ` [patch] irqlock patch -G3. [was Re: odd memory corruptionin2.5.27?] Andrew Morton
  2002-07-24  7:37                 ` [patch] irqlock patch -G3. [was Re: odd memory corruption in2.5.27?] Ingo Molnar
  2 siblings, 1 reply; 29+ messages in thread
From: Ingo Molnar @ 2002-07-24  7:34 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Robert Love, george anzinger, Zwane Mwaikambo, Trond Myklebust,
	Linus Torvalds, Linux Kernel


On Tue, 23 Jul 2002, Andrew Morton wrote:

> Robert and George's patch doesn't seem to be optimal though - if we're
> not going to preempt at spin_unlock() time, we need to preempt at
> local_irq_restore() time.  It'll be untrivial to fix all this, but this
> very subtle change to the locking semantics with CONFIG_PREEMPT is quite
> nasty.

this is precisely the reason why we cannot pretend these bugs do not exist
and just work this around in preempt_schedule(). Code that relies on
cli/sti for atomicity should be pretty rare and limited, there's 1 known
case so far where it leads to bugs.

	Ingo


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [patch] irqlock patch -G3. [was Re: odd memory corruptionin2.5.27?]
  2002-07-24  7:34                 ` Ingo Molnar
@ 2002-07-24  8:00                   ` Andrew Morton
  2002-07-24  7:54                     ` Ingo Molnar
                                       ` (2 more replies)
  0 siblings, 3 replies; 29+ messages in thread
From: Andrew Morton @ 2002-07-24  8:00 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Robert Love, george anzinger, Zwane Mwaikambo, Trond Myklebust,
	Linus Torvalds, Linux Kernel

Ingo Molnar wrote:
> 
> On Tue, 23 Jul 2002, Andrew Morton wrote:
> 
> > Robert and George's patch doesn't seem to be optimal though - if we're
> > not going to preempt at spin_unlock() time, we need to preempt at
> > local_irq_restore() time.  It'll be untrivial to fix all this, but this
> > very subtle change to the locking semantics with CONFIG_PREEMPT is quite
> > nasty.
> 
> this is precisely the reason why we cannot pretend these bugs do not exist
> and just work this around in preempt_schedule().

But there is no bug in slab.  The bug is that spin_unlock() is
scheduling inside local_irq_disable().

> Code that relies on
> cli/sti for atomicity should be pretty rare and limited, there's 1 known
> case so far where it leads to bugs.

Are you implying that all code which does spin_unlock() inside
local_irq_disable() needs to be converted to use _raw_spin_unlock()?
If so then, umm, ugh.  I hope that the debug check is working
for CONFIG_PREEMPT=n.

BTW, what is the situation with spin_unlock_irq[restore]()?  Seems
that these will schedule inside local_irq_disable() quite a lot?

-

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [patch] irqlock patch -G3. [was Re: odd memory corruptionin2.5.27?]
  2002-07-24  8:00                   ` [patch] irqlock patch -G3. [was Re: odd memory corruptionin2.5.27?] Andrew Morton
@ 2002-07-24  7:54                     ` Ingo Molnar
  2002-07-24  8:03                     ` William Lee Irwin III
  2002-07-24 16:40                     ` Linus Torvalds
  2 siblings, 0 replies; 29+ messages in thread
From: Ingo Molnar @ 2002-07-24  7:54 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Robert Love, george anzinger, Zwane Mwaikambo, Trond Myklebust,
	Linus Torvalds, Linux Kernel


On Wed, 24 Jul 2002, Andrew Morton wrote:

> > Code that relies on
> > cli/sti for atomicity should be pretty rare and limited, there's 1 known
> > case so far where it leads to bugs.
> 
> Are you implying that all code which does spin_unlock() inside
> local_irq_disable() needs to be converted to use _raw_spin_unlock()? If
> so then, umm, ugh.  I hope that the debug check is working for
> CONFIG_PREEMPT=n.

yes, it works for CONFIG_PREEMT=n as well.

> BTW, what is the situation with spin_unlock_irq[restore]()?  Seems that
> these will schedule inside local_irq_disable() quite a lot?

i changed the order in my patch - and there's another valid reason for it:  
to slightly reduce the amount of time spent with irqs disabled.

	Ingo


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [patch] irqlock patch -G3. [was Re: odd memory corruptionin2.5.27?]
  2002-07-24  8:00                   ` [patch] irqlock patch -G3. [was Re: odd memory corruptionin2.5.27?] Andrew Morton
  2002-07-24  7:54                     ` Ingo Molnar
@ 2002-07-24  8:03                     ` William Lee Irwin III
  2002-07-24  8:06                       ` Ingo Molnar
  2002-07-24 16:40                     ` Linus Torvalds
  2 siblings, 1 reply; 29+ messages in thread
From: William Lee Irwin III @ 2002-07-24  8:03 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Ingo Molnar, Robert Love, george anzinger, Zwane Mwaikambo,
	Trond Myklebust, Linus Torvalds, Linux Kernel

Ingo Molnar wrote:
>> Code that relies on
>> cli/sti for atomicity should be pretty rare and limited, there's 1 known
>> case so far where it leads to bugs.

On Wed, Jul 24, 2002 at 01:00:00AM -0700, Andrew Morton wrote:
> Are you implying that all code which does spin_unlock() inside
> local_irq_disable() needs to be converted to use _raw_spin_unlock()?
> If so then, umm, ugh.  I hope that the debug check is working
> for CONFIG_PREEMPT=n.
> BTW, what is the situation with spin_unlock_irq[restore]()?  Seems
> that these will schedule inside local_irq_disable() quite a lot?

Since we're on the subject of preempt_schedule() being done at
inappropriate times, I'm seeing it being called and panicking the
machine before the per-cpu GDT, IDT, TSS, and LDT are loaded in
cpu_init() and suspect that may be a wee bit too early to be sane.


Cheers,
Bill

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [patch] irqlock patch -G3. [was Re: odd memory corruptionin2.5.27?]
  2002-07-24  8:03                     ` William Lee Irwin III
@ 2002-07-24  8:06                       ` Ingo Molnar
  2002-07-24  8:15                         ` William Lee Irwin III
  0 siblings, 1 reply; 29+ messages in thread
From: Ingo Molnar @ 2002-07-24  8:06 UTC (permalink / raw)
  To: William Lee Irwin III
  Cc: Andrew Morton, Robert Love, george anzinger, Zwane Mwaikambo,
	Trond Myklebust, Linus Torvalds, Linux Kernel


On Wed, 24 Jul 2002, William Lee Irwin III wrote:

> Since we're on the subject of preempt_schedule() being done at
> inappropriate times, I'm seeing it being called and panicking the
> machine before the per-cpu GDT, IDT, TSS, and LDT are loaded in
> cpu_init() and suspect that may be a wee bit too early to be sane.

i've fixed this in my tree: the init thread needs to start up with a
nonzero preempt_count, and schedule_init() sets it to 0. [schedule_init()  
is the point after we can schedule.]

	Ingo


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [patch] irqlock patch -G3. [was Re: odd memory corruptionin2.5.27?]
  2002-07-24  8:06                       ` Ingo Molnar
@ 2002-07-24  8:15                         ` William Lee Irwin III
  2002-07-24  8:17                           ` Ingo Molnar
  0 siblings, 1 reply; 29+ messages in thread
From: William Lee Irwin III @ 2002-07-24  8:15 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Andrew Morton, Robert Love, george anzinger, Zwane Mwaikambo,
	Trond Myklebust, Linus Torvalds, Linux Kernel

On Wed, 24 Jul 2002, William Lee Irwin III wrote:
>> Since we're on the subject of preempt_schedule() being done at
>> inappropriate times, I'm seeing it being called and panicking the
>> machine before the per-cpu GDT, IDT, TSS, and LDT are loaded in
>> cpu_init() and suspect that may be a wee bit too early to be sane.

On Wed, Jul 24, 2002 at 10:06:28AM +0200, Ingo Molnar wrote:
> i've fixed this in my tree: the init thread needs to start up with a
> nonzero preempt_count, and schedule_init() sets it to 0. [schedule_init()  
> is the point after we can schedule.]

Sorry about the duplicated report, I must have missed the fix in the
changelogs. I've been following closely in general partly because my
machines are very sensitive to i386 interrupt changes, and I need to
help do testing and report things early as you're not likely to have a
similar machine.

Since it's where the fixes are going, it looks like I'd better use
your tree. I'll follow up once I've updated and had time to test.
(They are unfortunately very slow to boot.)

Thanks,
Bill

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [patch] irqlock patch -G3. [was Re: odd memory corruptionin2.5.27?]
  2002-07-24  8:15                         ` William Lee Irwin III
@ 2002-07-24  8:17                           ` Ingo Molnar
  0 siblings, 0 replies; 29+ messages in thread
From: Ingo Molnar @ 2002-07-24  8:17 UTC (permalink / raw)
  To: William Lee Irwin III
  Cc: Andrew Morton, Robert Love, george anzinger, Zwane Mwaikambo,
	Trond Myklebust, Linus Torvalds, Linux Kernel


On Wed, 24 Jul 2002, William Lee Irwin III wrote:

> > i've fixed this in my tree: the init thread needs to start up with a
> > nonzero preempt_count, and schedule_init() sets it to 0. [schedule_init()  
> > is the point after we can schedule.]
> 
> Sorry about the duplicated report, I must have missed the fix in the
> changelogs. [...]

sorry - i was too compact. What i wanted to say: "i agree that this is a
bug, and based on your report i've fixed this bug in my tree, it will show
up in the next patch". Thanks!

	Ingo


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [patch] irqlock patch -G3. [was Re: odd memory corruptionin2.5.27?]
  2002-07-24  8:00                   ` [patch] irqlock patch -G3. [was Re: odd memory corruptionin2.5.27?] Andrew Morton
  2002-07-24  7:54                     ` Ingo Molnar
  2002-07-24  8:03                     ` William Lee Irwin III
@ 2002-07-24 16:40                     ` Linus Torvalds
  2002-07-24 16:49                       ` Robert Love
  2 siblings, 1 reply; 29+ messages in thread
From: Linus Torvalds @ 2002-07-24 16:40 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Ingo Molnar, Robert Love, george anzinger, Zwane Mwaikambo,
	Trond Myklebust, Linux Kernel

On Wed, 24 Jul 2002, Andrew Morton wrote:
>
> But there is no bug in slab.  The bug is that spin_unlock() is
> scheduling inside local_irq_disable().

I disagree.

There is no reason to believe that you can mix irq disables and
spinlocking willy nilly. The two have commonalities, and thinking that you
can build up your own special irq-safe spinlocks by hand just isn't
_true_.

The bug is conceptually definitely in slab.c - although we can't blame it
too much since we've modified some behaviour.

Just as an example, in the not too distant future what _will_ happen is
that

	spin_lock_irqsave()
	..
	spin_unlock_irqrestore()

will not necessarily increment the preemtion count. Why should they?
They've disabled local interrupts, so there's no preemption to protect
against. That's just an _obvious_ optimization.

But that optimization also means that you _have_ to pair the things
properly. It is incorrect and buggy to play pairing games. The same way
you could not pair the global irq lock with a local unlock, you cannot
play games with local_irq_disable() and spinlocks, and expect it to "just
work".

Normal code should always pair these primitives. And, in fact, 99% of all
code _does_ pair them properly, and people would consider anything else
strange.

The 1% (or less) of code that thinks it wants to be clever will have to
live with that decision. We will have code that does the preemption
regions by hand, and some day you will need to do

	spin_lock_irqsave(..)
	...
	...
	preempt_disable();
	spin_unlock(..);
	...
	...
	local_irq_restore(..);
	preempt_enable();

if you want to do what slab apparently wants to do right now.

In short, you should _always_ write out what you actually expect of the
environment. Making the assumption that "spin_lock_irqsave" is 100% the
same as "local_irq_save + preempt_disable + __spin_lock" is a BUG, simply
because it assumes something that has never been guaranteed, and is
nothign but an implementation detail.

> Are you implying that all code which does spin_unlock() inside
> local_irq_disable() needs to be converted to use _raw_spin_unlock()?
> If so then, umm, ugh.  I hope that the debug check is working
> for CONFIG_PREEMPT=n.

I'm saying that preempt will have the irq check (for safety and debugging,
and it will scream if somebody does something bad), and I'm saying
outright that if you expect to not be scheduled, you should _tell_ the
kernel so, instead of just thinking that it's implied by something else.

You can nest spinlocks any amount you want, but no, you cannot just assume
that nesting irq disables and spinlocks gives you the same thing as using
the irq-safe spinlocks.

We can easily add a debugging check to spin_unlock() that says:

	/* Somebody messed up, doesn't hold any other preemption thing
	 * than this lock that is now getting released, and has interrupts
	 * disabled
	 */
	BUG_ON(preempt_count() == 1 && interrupts_enabled())

No?

		Linus

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [patch] irqlock patch -G3. [was Re: odd memory corruptionin2.5.27?]
  2002-07-24 16:40                     ` Linus Torvalds
@ 2002-07-24 16:49                       ` Robert Love
  2002-07-24 20:56                         ` [patch] irqlock patch -G3. [was Re: odd memorycorruptionin2.5.27?] george anzinger
  0 siblings, 1 reply; 29+ messages in thread
From: Robert Love @ 2002-07-24 16:49 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andrew Morton, Ingo Molnar, george anzinger, Zwane Mwaikambo,
	Trond Myklebust, Linux Kernel

On Wed, 2002-07-24 at 09:40, Linus Torvalds wrote:

> Just as an example, in the not too distant future what _will_ happen is
> that
> 
> 	spin_lock_irqsave()
> 	..
> 	spin_unlock_irqrestore()
> 
> will not necessarily increment the preemtion count. Why should they?
> They've disabled local interrupts, so there's no preemption to protect
> against. That's just an _obvious_ optimization.

So obvious it is even in my queue :)

I do not think we are ready yet - there is just way too much code that
does not pair up as you mention... but I have played with the patch and
some debugging to see just how feasible it is.  Fairly soon.

Note that other preemptible kernels (IRIX and BeOS, for example) do not
even have a preemption count -- all spinlocks also disable interrupts. 
Not that I am suggesting that, I am very fond of our preempt_count
mechanism, but its a point to consider even if it were feasible (e.g. we
did not have spinlocks held for hundreds of milliseconds).

> We can easily add a debugging check to spin_unlock() that says:
> 
> 	/* Somebody messed up, doesn't hold any other preemption thing
> 	 * than this lock that is now getting released, and has interrupts
> 	 * disabled
> 	 */
> 	BUG_ON(preempt_count() == 1 && interrupts_enabled())
> 
> No?

Pretty similar to the debugging I played with... as long as it goes away
eventually (who wants this in their unlock path?) we certainly should
add it at some point.

	Robert Love

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [patch] irqlock patch -G3. [was Re: odd memorycorruptionin2.5.27?]
  2002-07-24 16:49                       ` Robert Love
@ 2002-07-24 20:56                         ` george anzinger
  0 siblings, 0 replies; 29+ messages in thread
From: george anzinger @ 2002-07-24 20:56 UTC (permalink / raw)
  To: Robert Love
  Cc: Linus Torvalds, Andrew Morton, Ingo Molnar, Zwane Mwaikambo,
	Trond Myklebust, Linux Kernel

Robert Love wrote:
> 
> On Wed, 2002-07-24 at 09:40, Linus Torvalds wrote:
> 
> > Just as an example, in the not too distant future what _will_ happen is
> > that
> >
> >       spin_lock_irqsave()
> >       ..
> >       spin_unlock_irqrestore()
> >
> > will not necessarily increment the preemtion count. Why should they?
> > They've disabled local interrupts, so there's no preemption to protect
> > against. That's just an _obvious_ optimization.

I thought this was the thing to do back at the beginning of
the preemption saga, but now I am not so sure.  Lately I
have been doing a bit of code in the clock routines.  These
get called from both interrupt context and from normal
context and thus need to use the IRQ locks.  The task entry
does a spin_lock_irq() and the interrupt routine KNOWS
interrupts are off and just does a spin_lock().  The inner
code just knows that it is wrapped by an IRQ off and so does
not use the IRQ version of the spin locks it needs.

The above change would auger that the "inner" code should
use _raw* spinlocks OR should use the nested IRQ locks.  But
an IRQ lock is a LOT more expensive (due to the hardware
sync implied by messing with the interrupt system) than inc
and dec on a counter.

This, by the way, is the same reason that
preempt_enable()/preemp_disable() should be preferred over
local_irq_disable()/local_irq_enable() when either would do
the job.
> 
> So obvious it is even in my queue :)
> 
> I do not think we are ready yet - there is just way too much code that
> does not pair up as you mention... but I have played with the patch and
> some debugging to see just how feasible it is.  Fairly soon.
> 
> Note that other preemptible kernels (IRIX and BeOS, for example) do not
> even have a preemption count -- all spinlocks also disable interrupts.
> Not that I am suggesting that, I am very fond of our preempt_count
> mechanism, but its a point to consider even if it were feasible (e.g. we
> did not have spinlocks held for hundreds of milliseconds).
> 
> > We can easily add a debugging check to spin_unlock() that says:
> >
> >       /* Somebody messed up, doesn't hold any other preemption thing
> >        * than this lock that is now getting released, and has interrupts
> >        * disabled
> >        */
> >       BUG_ON(preempt_count() == 1 && interrupts_enabled())
> >
> > No?
> 
> Pretty similar to the debugging I played with... as long as it goes away
> eventually (who wants this in their unlock path?) we certainly should
> add it at some point.

I think we also need to have debugging code in the
local_irq_enable() to make sure the correct preemption
enable is being done.

-- 
George Anzinger   george@mvista.com
High-res-timers: 
http://sourceforge.net/projects/high-res-timers/
Real time sched:  http://sourceforge.net/projects/rtsched/
Preemption patch:
http://www.kernel.org/pub/linux/kernel/people/rml

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [patch] irqlock patch -G3. [was Re: odd memory corruption in2.5.27?]
  2002-07-24  3:13               ` [patch] irqlock patch -G3. [was Re: odd memory corruption in2.5.27?] Andrew Morton
  2002-07-24  3:18                 ` Linus Torvalds
  2002-07-24  7:34                 ` Ingo Molnar
@ 2002-07-24  7:37                 ` Ingo Molnar
  2 siblings, 0 replies; 29+ messages in thread
From: Ingo Molnar @ 2002-07-24  7:37 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Robert Love, george anzinger, Zwane Mwaikambo, Trond Myklebust,
	Linus Torvalds, Linux Kernel


On Tue, 23 Jul 2002, Andrew Morton wrote:

> [...] It'll be untrivial to fix all this, but this very subtle change to
> the locking semantics with CONFIG_PREEMPT is quite nasty.

well, preempt is different, lets face it. I'm very strongly against making
the irq-on path check for preemption ...

	Ingo


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: odd memory corruption in 2.5.27?
  2002-07-23 20:32         ` george anzinger
  2002-07-23 20:47           ` William Lee Irwin III
  2002-07-23 23:28           ` [patch] irqlock patch -G3. [was Re: odd memory corruption in 2.5.27?] Ingo Molnar
@ 2002-07-24  6:52           ` Zwane Mwaikambo
  2 siblings, 0 replies; 29+ messages in thread
From: Zwane Mwaikambo @ 2002-07-24  6:52 UTC (permalink / raw)
  To: george anzinger; +Cc: Trond Myklebust, Linux Kernel

On Tue, 23 Jul 2002, george anzinger wrote:

> protecting them with a combination of interrupt disables and
> spin_locks.  Preemption is allowed (incorrectly) if
> interrupts are off and preempt_count goes to zero on the
> spin_unlock.  I will wager that this is an SMP machine. 
> After the preemption interrupts will be on (schedule() does
> that) AND you could be on a different cpu.  Either of these
> is a BAD thing.
> 
> The proposed fix is to catch the attempted preemption in
> preempt_schedule() and just return if the interrupt system
> is off.  (Of course there is more that this to it, but I do
> believe that the problem is known.  You could blow this
> assertion out of the water by asserting that the machine is
> NOT smp.)

I haven't looked at it further than gathering oopses and idly browsing 
surrounding code. About your assertion, you're almost right, its UP box 
running an SMP kernel w/ CONFIG_PREEMT. 

-- 
function.linuxpower.ca


^ permalink raw reply	[flat|nested] 29+ messages in thread

end of thread, other threads:[~2002-07-24 21:12 UTC | newest]

Thread overview: 29+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2002-07-23  6:23 odd memory corruption in 2.5.27? Zwane Mwaikambo
2002-07-23  6:24 ` Zwane Mwaikambo
2002-07-23  6:26   ` Zwane Mwaikambo
2002-07-23  7:57     ` Trond Myklebust
2002-07-23  8:54       ` Zwane Mwaikambo
2002-07-23 20:32         ` george anzinger
2002-07-23 20:47           ` William Lee Irwin III
2002-07-23 23:28           ` [patch] irqlock patch -G3. [was Re: odd memory corruption in 2.5.27?] Ingo Molnar
2002-07-23 23:53             ` george anzinger
2002-07-23 23:56             ` Linus Torvalds
2002-07-24  0:07               ` Ingo Molnar
2002-07-24  2:15                 ` Linus Torvalds
2002-07-24  8:59                   ` [patch] irqlock patch 2.5.27-H3 Ingo Molnar
2002-07-24  1:08             ` [patch] irqlock patch -G3. [was Re: odd memory corruption in 2.5.27?] Robert Love
2002-07-24  3:13               ` [patch] irqlock patch -G3. [was Re: odd memory corruption in2.5.27?] Andrew Morton
2002-07-24  3:18                 ` Linus Torvalds
2002-07-24  7:13                   ` Ingo Molnar
2002-07-24  7:34                 ` Ingo Molnar
2002-07-24  8:00                   ` [patch] irqlock patch -G3. [was Re: odd memory corruptionin2.5.27?] Andrew Morton
2002-07-24  7:54                     ` Ingo Molnar
2002-07-24  8:03                     ` William Lee Irwin III
2002-07-24  8:06                       ` Ingo Molnar
2002-07-24  8:15                         ` William Lee Irwin III
2002-07-24  8:17                           ` Ingo Molnar
2002-07-24 16:40                     ` Linus Torvalds
2002-07-24 16:49                       ` Robert Love
2002-07-24 20:56                         ` [patch] irqlock patch -G3. [was Re: odd memorycorruptionin2.5.27?] george anzinger
2002-07-24  7:37                 ` [patch] irqlock patch -G3. [was Re: odd memory corruption in2.5.27?] Ingo Molnar
2002-07-24  6:52           ` odd memory corruption in 2.5.27? Zwane Mwaikambo

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox