* Re: schedule() BUG [not found] <FJEIIOCBFAIOIDNKLPFJCECODAAA.koji.kawachi@pioneer-pdt.com> @ 2003-10-07 2:05 ` Steve Scott 2003-10-07 2:05 ` Steve Scott 2003-10-08 16:29 ` Ralf Baechle 0 siblings, 2 replies; 13+ messages in thread From: Steve Scott @ 2003-10-07 2:05 UTC (permalink / raw) To: jsun; +Cc: linux-mips, craig.mautner [-- Attachment #1: Type: text/plain, Size: 4188 bytes --] We tried the fault.c patch Jun suggested, but it didn't solve the problem we were having with the BUG() in schedule(). The patch at the beginning of except_vec3_generic for the Vr5432 bug had previously been installed. While chasing the BUG() in schedule(), though, we ran across another BUG() in alloc_skb() in ...linux/net/core/skbuff.c. : alloc_skb called nonatomically from interrupt 80117acc kernel BUG at skbuff.c:179! We changed the way sock_init_data initializes the 'allocation' field and were able to get past this one (see attached sock.c.patch). We're not sure if this fix needs to be permanent, or if it's just a temporary workaround. For the schedule() BUG(), all evidence that we collected pointed to some interrupt causing us to reenter schedule() (i.e., somehow schedule() was called during an interrupt handler). We suspected something being run from the timer interrupt bottom half, but were never able to prove it. We also thought a remote possibility might be a pipeline hazard in the MIPS causing the EPC register not to update on a nested exception, but NEC says that can't happen on the Vr5432 that we're using... We finally worked around the schedule BUG() by disabling interrupts during the context switch in schedule(). This workaround required changes in linux/kernel/sched.c and linux/arch/mips/kernel/r4k_switch.S (see attached patches). --steve > > > -----Original Message----- > From: linux-mips-bounce@linux-mips.org > [mailto:linux-mips-bounce@linux-mips.org]On Behalf Of Jun Sun > Sent: Wednesday, October 01, 2003 4:50 PM > To: Craig Mautner > Cc: linux-mips@linux-mips.org; jsun@mvista.com > Subject: Re: schedule() BUG > > > On Fri, Sep 12, 2003 at 11:04:16AM -0700, Craig Mautner wrote: > > We are using mips-linux 2.4.17, gcc 3.2.1 (MontaVista) and crashing in > > schedule(): > > > > Unable to handle kernel paging request at virtual address 00000000, epc == > > 800153c0, ra == 800153c0 > > $0 : 00000000 9001f800 0000001b 00000000 0000001a 83f56000 8298f4a0 > 0000001f > > $8 : 00000001 ffffe2e0 000022e0 00000000 fffffff9 ffffffff 0000000a > 00000002 > > $16: 00000000 00000000 82af0000 8298f4a0 83f56000 00000000 80008000 > 00000000 > > $24: 82af1dc2 00000002 82af0000 82af1ef8 82af1ef8 > 800153c0 > > epc : 800153c0 Not tainted > > > > The code is: > > > > { > > struct mm_struct *mm = next->mm; > > struct mm_struct *oldmm = prev->active_mm; > > if (!mm) { > > if (next->active_mm) BUG(); <- this is where we crash > > next->active_mm = oldmm; > > atomic_inc(&oldmm->mm_count); > > enter_lazy_tlb(oldmm, next, this_cpu); > > } > > . > > . > > . > > > > This seems to happen in our case when 'next' points to 'kswapd' although > we > > think it could happen when switching to any kernel task (i.e. those tasks > > with mm==NULL). > > > > We think the culprit is that we are taking an interrupt and rescheduling > > while at a vulnerable point in 'schedule()'. Interrupts are enabled in > line > > 743. If we get an interrupt any time after line 785: > > > > next->active_mm = oldmm; > > > > but before line 806 > > > > __schedule_tail() > > > > completes the swap, the interrupt can force 'schedule()' to be reentered > via > > 'ret_from_intr()'. > > > > If so, 'kswapd's 'active_mm' field will be left non-zero, but 'current' > will > > not have been set to point to 'kswapd'. The next time 'schedule()' tries > to > > switch to 'kswapd', 'next' points to 'kswapd', and > > > > next->mm == NULL > > next->active_mm != NULL > > > > which is detected as an invalid state, so we hit the BUG. > > > > Some questions: > > Are we looking at this correctly? > > Has anyone ever seen this before? > > Is there a published fix? > > > > Thanks, > > > > -Craig > > > > This is an known problem. Please try the attached patch. > > On R5432 CPU, there is also an hardware bug which can cause the same > problem. Please double-check vec3_generic to see if workaround is > at the beginning of the handler. > > BTW, 2.4.17 is an old kernel. You really need to upgrade. > > Jun > > > [-- Attachment #2: sock.c.patch --] [-- Type: application/octet-stream, Size: 341 bytes --] Index: /home/sscott/work/Software/linux/net/core/sock.c =================================================================== RCS file: /usr/local/CVS/V4000/Software/linux/net/core/sock.c,v retrieving revision 1.1.1.1 diff -r1.1.1.1 sock.c 1175c1175 < sk->allocation = GFP_KERNEL; --- > sk->allocation = GFP_ATOMIC; /*GFP_KERNEL;*/ [-- Attachment #3: r4k_switch.S.patch --] [-- Type: application/octet-stream, Size: 378 bytes --] Index: /home/sscott/work/Software/linux/arch/mips/kernel/r4k_switch.S =================================================================== RCS file: /usr/local/CVS/V4000/Software/linux/arch/mips/kernel/r4k_switch.S,v retrieving revision 1.1.1.1 diff -r1.1.1.1 r4k_switch.S 38a39 > ori t1, 0x1 /* srs - assume ints disabled in schedule(). Reenable when task resumes */ [-- Attachment #4: sched.c.patch --] [-- Type: application/octet-stream, Size: 391 bytes --] Index: /home/sscott/work/Software/linux/kernel/sched.c =================================================================== RCS file: /usr/local/CVS/V4000/Software/linux/kernel/sched.c,v retrieving revision 1.1.1.1 diff -r1.1.1.1 sched.c 743c743 < spin_unlock_irq(&runqueue_lock); --- > /*srs spin_unlock_irq(&runqueue_lock); */ 747a748 > /*srs*/ spin_unlock_irq(&runqueue_lock); ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: schedule() BUG 2003-10-07 2:05 ` schedule() BUG Steve Scott @ 2003-10-07 2:05 ` Steve Scott 2003-10-08 16:29 ` Ralf Baechle 1 sibling, 0 replies; 13+ messages in thread From: Steve Scott @ 2003-10-07 2:05 UTC (permalink / raw) To: jsun; +Cc: linux-mips, craig.mautner [-- Attachment #1: Type: text/plain, Size: 4188 bytes --] We tried the fault.c patch Jun suggested, but it didn't solve the problem we were having with the BUG() in schedule(). The patch at the beginning of except_vec3_generic for the Vr5432 bug had previously been installed. While chasing the BUG() in schedule(), though, we ran across another BUG() in alloc_skb() in ...linux/net/core/skbuff.c. : alloc_skb called nonatomically from interrupt 80117acc kernel BUG at skbuff.c:179! We changed the way sock_init_data initializes the 'allocation' field and were able to get past this one (see attached sock.c.patch). We're not sure if this fix needs to be permanent, or if it's just a temporary workaround. For the schedule() BUG(), all evidence that we collected pointed to some interrupt causing us to reenter schedule() (i.e., somehow schedule() was called during an interrupt handler). We suspected something being run from the timer interrupt bottom half, but were never able to prove it. We also thought a remote possibility might be a pipeline hazard in the MIPS causing the EPC register not to update on a nested exception, but NEC says that can't happen on the Vr5432 that we're using... We finally worked around the schedule BUG() by disabling interrupts during the context switch in schedule(). This workaround required changes in linux/kernel/sched.c and linux/arch/mips/kernel/r4k_switch.S (see attached patches). --steve > > > -----Original Message----- > From: linux-mips-bounce@linux-mips.org > [mailto:linux-mips-bounce@linux-mips.org]On Behalf Of Jun Sun > Sent: Wednesday, October 01, 2003 4:50 PM > To: Craig Mautner > Cc: linux-mips@linux-mips.org; jsun@mvista.com > Subject: Re: schedule() BUG > > > On Fri, Sep 12, 2003 at 11:04:16AM -0700, Craig Mautner wrote: > > We are using mips-linux 2.4.17, gcc 3.2.1 (MontaVista) and crashing in > > schedule(): > > > > Unable to handle kernel paging request at virtual address 00000000, epc == > > 800153c0, ra == 800153c0 > > $0 : 00000000 9001f800 0000001b 00000000 0000001a 83f56000 8298f4a0 > 0000001f > > $8 : 00000001 ffffe2e0 000022e0 00000000 fffffff9 ffffffff 0000000a > 00000002 > > $16: 00000000 00000000 82af0000 8298f4a0 83f56000 00000000 80008000 > 00000000 > > $24: 82af1dc2 00000002 82af0000 82af1ef8 82af1ef8 > 800153c0 > > epc : 800153c0 Not tainted > > > > The code is: > > > > { > > struct mm_struct *mm = next->mm; > > struct mm_struct *oldmm = prev->active_mm; > > if (!mm) { > > if (next->active_mm) BUG(); <- this is where we crash > > next->active_mm = oldmm; > > atomic_inc(&oldmm->mm_count); > > enter_lazy_tlb(oldmm, next, this_cpu); > > } > > . > > . > > . > > > > This seems to happen in our case when 'next' points to 'kswapd' although > we > > think it could happen when switching to any kernel task (i.e. those tasks > > with mm==NULL). > > > > We think the culprit is that we are taking an interrupt and rescheduling > > while at a vulnerable point in 'schedule()'. Interrupts are enabled in > line > > 743. If we get an interrupt any time after line 785: > > > > next->active_mm = oldmm; > > > > but before line 806 > > > > __schedule_tail() > > > > completes the swap, the interrupt can force 'schedule()' to be reentered > via > > 'ret_from_intr()'. > > > > If so, 'kswapd's 'active_mm' field will be left non-zero, but 'current' > will > > not have been set to point to 'kswapd'. The next time 'schedule()' tries > to > > switch to 'kswapd', 'next' points to 'kswapd', and > > > > next->mm == NULL > > next->active_mm != NULL > > > > which is detected as an invalid state, so we hit the BUG. > > > > Some questions: > > Are we looking at this correctly? > > Has anyone ever seen this before? > > Is there a published fix? > > > > Thanks, > > > > -Craig > > > > This is an known problem. Please try the attached patch. > > On R5432 CPU, there is also an hardware bug which can cause the same > problem. Please double-check vec3_generic to see if workaround is > at the beginning of the handler. > > BTW, 2.4.17 is an old kernel. You really need to upgrade. > > Jun > > > [-- Attachment #2: sock.c.patch --] [-- Type: application/octet-stream, Size: 341 bytes --] Index: /home/sscott/work/Software/linux/net/core/sock.c =================================================================== RCS file: /usr/local/CVS/V4000/Software/linux/net/core/sock.c,v retrieving revision 1.1.1.1 diff -r1.1.1.1 sock.c 1175c1175 < sk->allocation = GFP_KERNEL; --- > sk->allocation = GFP_ATOMIC; /*GFP_KERNEL;*/ [-- Attachment #3: r4k_switch.S.patch --] [-- Type: application/octet-stream, Size: 378 bytes --] Index: /home/sscott/work/Software/linux/arch/mips/kernel/r4k_switch.S =================================================================== RCS file: /usr/local/CVS/V4000/Software/linux/arch/mips/kernel/r4k_switch.S,v retrieving revision 1.1.1.1 diff -r1.1.1.1 r4k_switch.S 38a39 > ori t1, 0x1 /* srs - assume ints disabled in schedule(). Reenable when task resumes */ [-- Attachment #4: sched.c.patch --] [-- Type: application/octet-stream, Size: 391 bytes --] Index: /home/sscott/work/Software/linux/kernel/sched.c =================================================================== RCS file: /usr/local/CVS/V4000/Software/linux/kernel/sched.c,v retrieving revision 1.1.1.1 diff -r1.1.1.1 sched.c 743c743 < spin_unlock_irq(&runqueue_lock); --- > /*srs spin_unlock_irq(&runqueue_lock); */ 747a748 > /*srs*/ spin_unlock_irq(&runqueue_lock); ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: schedule() BUG 2003-10-07 2:05 ` schedule() BUG Steve Scott 2003-10-07 2:05 ` Steve Scott @ 2003-10-08 16:29 ` Ralf Baechle 1 sibling, 0 replies; 13+ messages in thread From: Ralf Baechle @ 2003-10-08 16:29 UTC (permalink / raw) To: Steve Scott; +Cc: jsun, linux-mips, craig.mautner On Mon, Oct 06, 2003 at 07:05:06PM -0700, Steve Scott wrote: > We tried the fault.c patch Jun suggested, but it didn't solve the problem we were > having with the BUG() in schedule(). The patch at the beginning of > except_vec3_generic for the Vr5432 bug had previously been installed. > > While chasing the BUG() in schedule(), though, we ran across another BUG() in > alloc_skb() in ...linux/net/core/skbuff.c. : > > alloc_skb called nonatomically from interrupt 80117acc > kernel BUG at skbuff.c:179! > > We changed the way sock_init_data initializes the 'allocation' field and > were able to get past this one (see attached sock.c.patch). We're not sure > if this fix needs to be permanent, or if it's just a temporary workaround. > > For the schedule() BUG(), all evidence that we collected pointed to some > interrupt causing us to reenter schedule() (i.e., somehow schedule() was > called during an interrupt handler). We suspected something being run > from the timer interrupt bottom half, but were never able to prove it. We > also thought a remote possibility might be a pipeline hazard in the MIPS > causing the EPC register not to update on a nested exception, but NEC says > that can't happen on the Vr5432 that we're using... Can't happen on any MIPS. > We finally worked around the schedule BUG() by disabling interrupts > during the context switch in schedule(). This workaround required changes > in linux/kernel/sched.c and linux/arch/mips/kernel/r4k_switch.S (see attached > patches). Ouch. Forgive but if I'd not already ignore these patches for being ed-style I'd ignore them for being completly broken - these patches are harmful for performance and probably not going to achieve stability by anything other than luck ... Ralf ^ permalink raw reply [flat|nested] 13+ messages in thread
* schedule() BUG
@ 2003-09-12 18:04 Craig Mautner
2003-09-12 18:04 ` Craig Mautner
` (3 more replies)
0 siblings, 4 replies; 13+ messages in thread
From: Craig Mautner @ 2003-09-12 18:04 UTC (permalink / raw)
To: linux-mips
We are using mips-linux 2.4.17, gcc 3.2.1 (MontaVista) and crashing in
schedule():
kernel BUG at sched.c:784!
Unable to handle kernel paging request at virtual address 00000000, epc ==
800153c0, ra == 800153c0
$0 : 00000000 9001f800 0000001b 00000000 0000001a 83f56000 8298f4a0 0000001f
$8 : 00000001 ffffe2e0 000022e0 00000000 fffffff9 ffffffff 0000000a 00000002
$16: 00000000 00000000 82af0000 8298f4a0 83f56000 00000000 80008000 00000000
$24: 82af1dc2 00000002 82af0000 82af1ef8 82af1ef8 800153c0
epc : 800153c0 Not tainted
The code is:
{
struct mm_struct *mm = next->mm;
struct mm_struct *oldmm = prev->active_mm;
if (!mm) {
if (next->active_mm) BUG(); <- this is where we crash
next->active_mm = oldmm;
atomic_inc(&oldmm->mm_count);
enter_lazy_tlb(oldmm, next, this_cpu);
}
.
.
.
This seems to happen in our case when 'next' points to 'kswapd' although we
think it could happen when switching to any kernel task (i.e. those tasks
with mm==NULL).
We think the culprit is that we are taking an interrupt and rescheduling
while at a vulnerable point in 'schedule()'. Interrupts are enabled in line
743. If we get an interrupt any time after line 785:
next->active_mm = oldmm;
but before line 806
__schedule_tail()
completes the swap, the interrupt can force 'schedule()' to be reentered via
'ret_from_intr()'.
If so, 'kswapd's 'active_mm' field will be left non-zero, but 'current' will
not have been set to point to 'kswapd'. The next time 'schedule()' tries to
switch to 'kswapd', 'next' points to 'kswapd', and
next->mm == NULL
next->active_mm != NULL
which is detected as an invalid state, so we hit the BUG.
Some questions:
Are we looking at this correctly?
Has anyone ever seen this before?
Is there a published fix?
Thanks,
-Craig
-. .-. .-_ Craig Mautner
\ / \ / / ` Coastal Sr. Consulting, Inc.
`-' `-' `--- (858)361-2683
(858)581-0542 (fax)
5580 La Jolla Blvd. #308 La Jolla, CA 92037
mailto:craig.mautner@alumni.ucsd.edu
http://home.san.rr.com/cmautner/csc/craig/
^ permalink raw reply [flat|nested] 13+ messages in thread* schedule() BUG 2003-09-12 18:04 Craig Mautner @ 2003-09-12 18:04 ` Craig Mautner 2003-09-13 16:30 ` Craig Mautner ` (2 subsequent siblings) 3 siblings, 0 replies; 13+ messages in thread From: Craig Mautner @ 2003-09-12 18:04 UTC (permalink / raw) To: linux-mips We are using mips-linux 2.4.17, gcc 3.2.1 (MontaVista) and crashing in schedule(): kernel BUG at sched.c:784! Unable to handle kernel paging request at virtual address 00000000, epc == 800153c0, ra == 800153c0 $0 : 00000000 9001f800 0000001b 00000000 0000001a 83f56000 8298f4a0 0000001f $8 : 00000001 ffffe2e0 000022e0 00000000 fffffff9 ffffffff 0000000a 00000002 $16: 00000000 00000000 82af0000 8298f4a0 83f56000 00000000 80008000 00000000 $24: 82af1dc2 00000002 82af0000 82af1ef8 82af1ef8 800153c0 epc : 800153c0 Not tainted The code is: { struct mm_struct *mm = next->mm; struct mm_struct *oldmm = prev->active_mm; if (!mm) { if (next->active_mm) BUG(); <- this is where we crash next->active_mm = oldmm; atomic_inc(&oldmm->mm_count); enter_lazy_tlb(oldmm, next, this_cpu); } . . . This seems to happen in our case when 'next' points to 'kswapd' although we think it could happen when switching to any kernel task (i.e. those tasks with mm==NULL). We think the culprit is that we are taking an interrupt and rescheduling while at a vulnerable point in 'schedule()'. Interrupts are enabled in line 743. If we get an interrupt any time after line 785: next->active_mm = oldmm; but before line 806 __schedule_tail() completes the swap, the interrupt can force 'schedule()' to be reentered via 'ret_from_intr()'. If so, 'kswapd's 'active_mm' field will be left non-zero, but 'current' will not have been set to point to 'kswapd'. The next time 'schedule()' tries to switch to 'kswapd', 'next' points to 'kswapd', and next->mm == NULL next->active_mm != NULL which is detected as an invalid state, so we hit the BUG. Some questions: Are we looking at this correctly? Has anyone ever seen this before? Is there a published fix? Thanks, -Craig -. .-. .-_ Craig Mautner \ / \ / / ` Coastal Sr. Consulting, Inc. `-' `-' `--- (858)361-2683 (858)581-0542 (fax) 5580 La Jolla Blvd. #308 La Jolla, CA 92037 mailto:craig.mautner@alumni.ucsd.edu http://home.san.rr.com/cmautner/csc/craig/ ^ permalink raw reply [flat|nested] 13+ messages in thread
* RE: schedule() BUG 2003-09-12 18:04 Craig Mautner 2003-09-12 18:04 ` Craig Mautner @ 2003-09-13 16:30 ` Craig Mautner 2003-09-13 16:30 ` Craig Mautner 2003-09-15 18:59 ` Craig Mautner 2003-10-01 23:50 ` Jun Sun 3 siblings, 1 reply; 13+ messages in thread From: Craig Mautner @ 2003-09-13 16:30 UTC (permalink / raw) To: linux-mips Regarding my previous posting, we had made the assumption that schedule() could be called from an interrupt that occured within schedule(). However, because schedule() is in the kernel, ret_from_irq will skip over the call to schedule() and simply restore the context. -Craig -. .-. .-_ Craig Mautner \ / \ / / ` Coastal Sr. Consulting, Inc. `-' `-' `--- (858)361-2683 (858)581-0542 (fax) 5580 La Jolla Blvd. #308 La Jolla, CA 92037 mailto:craig.mautner@alumni.ucsd.edu http://home.san.rr.com/cmautner/csc/craig/ ^ permalink raw reply [flat|nested] 13+ messages in thread
* RE: schedule() BUG 2003-09-13 16:30 ` Craig Mautner @ 2003-09-13 16:30 ` Craig Mautner 0 siblings, 0 replies; 13+ messages in thread From: Craig Mautner @ 2003-09-13 16:30 UTC (permalink / raw) To: linux-mips Regarding my previous posting, we had made the assumption that schedule() could be called from an interrupt that occured within schedule(). However, because schedule() is in the kernel, ret_from_irq will skip over the call to schedule() and simply restore the context. -Craig -. .-. .-_ Craig Mautner \ / \ / / ` Coastal Sr. Consulting, Inc. `-' `-' `--- (858)361-2683 (858)581-0542 (fax) 5580 La Jolla Blvd. #308 La Jolla, CA 92037 mailto:craig.mautner@alumni.ucsd.edu http://home.san.rr.com/cmautner/csc/craig/ ^ permalink raw reply [flat|nested] 13+ messages in thread
* RE: schedule() BUG 2003-09-12 18:04 Craig Mautner 2003-09-12 18:04 ` Craig Mautner 2003-09-13 16:30 ` Craig Mautner @ 2003-09-15 18:59 ` Craig Mautner 2003-09-15 18:59 ` Craig Mautner 2003-10-01 23:50 ` Jun Sun 3 siblings, 1 reply; 13+ messages in thread From: Craig Mautner @ 2003-09-15 18:59 UTC (permalink / raw) To: linux-mips Regarding my previous posting, we had made the assumption that schedule() could be called from an interrupt that occured within schedule(). However, because schedule() is in the kernel, ret_from_irq will skip over the call to schedule() and simply restore the context. -Craig -. .-. .-_ Craig Mautner \ / \ / / ` Coastal Sr. Consulting, Inc. `-' `-' `--- (858)361-2683 (858)581-0542 (fax) 5580 La Jolla Blvd. #308 La Jolla, CA 92037 mailto:craig.mautner@alumni.ucsd.edu http://home.san.rr.com/cmautner/csc/craig/ ^ permalink raw reply [flat|nested] 13+ messages in thread
* RE: schedule() BUG 2003-09-15 18:59 ` Craig Mautner @ 2003-09-15 18:59 ` Craig Mautner 0 siblings, 0 replies; 13+ messages in thread From: Craig Mautner @ 2003-09-15 18:59 UTC (permalink / raw) To: linux-mips Regarding my previous posting, we had made the assumption that schedule() could be called from an interrupt that occured within schedule(). However, because schedule() is in the kernel, ret_from_irq will skip over the call to schedule() and simply restore the context. -Craig -. .-. .-_ Craig Mautner \ / \ / / ` Coastal Sr. Consulting, Inc. `-' `-' `--- (858)361-2683 (858)581-0542 (fax) 5580 La Jolla Blvd. #308 La Jolla, CA 92037 mailto:craig.mautner@alumni.ucsd.edu http://home.san.rr.com/cmautner/csc/craig/ ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: schedule() BUG 2003-09-12 18:04 Craig Mautner ` (2 preceding siblings ...) 2003-09-15 18:59 ` Craig Mautner @ 2003-10-01 23:50 ` Jun Sun 2003-10-02 0:09 ` Ralf Baechle 2003-10-02 4:28 ` Daniel Jacobowitz 3 siblings, 2 replies; 13+ messages in thread From: Jun Sun @ 2003-10-01 23:50 UTC (permalink / raw) To: Craig Mautner; +Cc: linux-mips, jsun On Fri, Sep 12, 2003 at 11:04:16AM -0700, Craig Mautner wrote: > We are using mips-linux 2.4.17, gcc 3.2.1 (MontaVista) and crashing in > schedule(): > > Unable to handle kernel paging request at virtual address 00000000, epc == > 800153c0, ra == 800153c0 > $0 : 00000000 9001f800 0000001b 00000000 0000001a 83f56000 8298f4a0 0000001f > $8 : 00000001 ffffe2e0 000022e0 00000000 fffffff9 ffffffff 0000000a 00000002 > $16: 00000000 00000000 82af0000 8298f4a0 83f56000 00000000 80008000 00000000 > $24: 82af1dc2 00000002 82af0000 82af1ef8 82af1ef8 800153c0 > epc : 800153c0 Not tainted > > The code is: > > { > struct mm_struct *mm = next->mm; > struct mm_struct *oldmm = prev->active_mm; > if (!mm) { > if (next->active_mm) BUG(); <- this is where we crash > next->active_mm = oldmm; > atomic_inc(&oldmm->mm_count); > enter_lazy_tlb(oldmm, next, this_cpu); > } > . > . > . > > This seems to happen in our case when 'next' points to 'kswapd' although we > think it could happen when switching to any kernel task (i.e. those tasks > with mm==NULL). > > We think the culprit is that we are taking an interrupt and rescheduling > while at a vulnerable point in 'schedule()'. Interrupts are enabled in line > 743. If we get an interrupt any time after line 785: > > next->active_mm = oldmm; > > but before line 806 > > __schedule_tail() > > completes the swap, the interrupt can force 'schedule()' to be reentered via > 'ret_from_intr()'. > > If so, 'kswapd's 'active_mm' field will be left non-zero, but 'current' will > not have been set to point to 'kswapd'. The next time 'schedule()' tries to > switch to 'kswapd', 'next' points to 'kswapd', and > > next->mm == NULL > next->active_mm != NULL > > which is detected as an invalid state, so we hit the BUG. > > Some questions: > Are we looking at this correctly? > Has anyone ever seen this before? > Is there a published fix? > > Thanks, > > -Craig > This is an known problem. Please try the attached patch. On R5432 CPU, there is also an hardware bug which can cause the same problem. Please double-check vec3_generic to see if workaround is at the beginning of the handler. BTW, 2.4.17 is an old kernel. You really need to upgrade. Jun ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: schedule() BUG 2003-10-01 23:50 ` Jun Sun @ 2003-10-02 0:09 ` Ralf Baechle 2003-10-02 0:39 ` Jun Sun 2003-10-02 4:28 ` Daniel Jacobowitz 1 sibling, 1 reply; 13+ messages in thread From: Ralf Baechle @ 2003-10-02 0:09 UTC (permalink / raw) To: Jun Sun; +Cc: Craig Mautner, linux-mips On Wed, Oct 01, 2003 at 04:50:23PM -0700, Jun Sun wrote: > This is an known problem. Please try the attached patch. No patch attached :-) Ralf ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: schedule() BUG 2003-10-02 0:09 ` Ralf Baechle @ 2003-10-02 0:39 ` Jun Sun 0 siblings, 0 replies; 13+ messages in thread From: Jun Sun @ 2003-10-02 0:39 UTC (permalink / raw) To: Ralf Baechle; +Cc: Craig Mautner, linux-mips, jsun [-- Attachment #1: Type: text/plain, Size: 240 bytes --] On Thu, Oct 02, 2003 at 02:09:31AM +0200, Ralf Baechle wrote: > On Wed, Oct 01, 2003 at 04:50:23PM -0700, Jun Sun wrote: > > > This is an known problem. Please try the attached patch. > > No patch attached :-) > Doh! Here you go. Jun [-- Attachment #2: vmalloc-fault-with-no-active-mm.patch --] [-- Type: text/plain, Size: 414 bytes --] diff -Nru link/arch/mips/mm/fault.c.orig link/arch/mips/mm/fault.c --- link/arch/mips/mm/fault.c.orig Fri May 10 18:50:08 2002 +++ link/arch/mips/mm/fault.c Fri May 23 10:39:10 2003 @@ -260,7 +260,7 @@ pgd_t *pgd, *pgd_k; pmd_t *pmd, *pmd_k; - pgd = tsk->active_mm->pgd + offset; + pgd = (pgd_t *) pgd_current[smp_processor_id()] + offset; pgd_k = init_mm.pgd + offset; if (!pgd_present(*pgd)) { ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: schedule() BUG 2003-10-01 23:50 ` Jun Sun 2003-10-02 0:09 ` Ralf Baechle @ 2003-10-02 4:28 ` Daniel Jacobowitz 1 sibling, 0 replies; 13+ messages in thread From: Daniel Jacobowitz @ 2003-10-02 4:28 UTC (permalink / raw) To: Jun Sun; +Cc: Craig Mautner, linux-mips On Wed, Oct 01, 2003 at 04:50:23PM -0700, Jun Sun wrote: > On Fri, Sep 12, 2003 at 11:04:16AM -0700, Craig Mautner wrote: > > We are using mips-linux 2.4.17, gcc 3.2.1 (MontaVista) and crashing in > > schedule(): > > > > Unable to handle kernel paging request at virtual address 00000000, epc == > > 800153c0, ra == 800153c0 > > $0 : 00000000 9001f800 0000001b 00000000 0000001a 83f56000 8298f4a0 0000001f > > $8 : 00000001 ffffe2e0 000022e0 00000000 fffffff9 ffffffff 0000000a 00000002 > > $16: 00000000 00000000 82af0000 8298f4a0 83f56000 00000000 80008000 00000000 > > $24: 82af1dc2 00000002 82af0000 82af1ef8 82af1ef8 800153c0 > > epc : 800153c0 Not tainted > > > > The code is: > > > > { > > struct mm_struct *mm = next->mm; > > struct mm_struct *oldmm = prev->active_mm; > > if (!mm) { > > if (next->active_mm) BUG(); <- this is where we crash > > next->active_mm = oldmm; > > atomic_inc(&oldmm->mm_count); > > enter_lazy_tlb(oldmm, next, this_cpu); > > } > > . > > . > > . > > > > This seems to happen in our case when 'next' points to 'kswapd' although we > > think it could happen when switching to any kernel task (i.e. those tasks > > with mm==NULL). > > > > We think the culprit is that we are taking an interrupt and rescheduling > > while at a vulnerable point in 'schedule()'. Interrupts are enabled in line > > 743. If we get an interrupt any time after line 785: > > > > next->active_mm = oldmm; > > > > but before line 806 > > > > __schedule_tail() > > > > completes the swap, the interrupt can force 'schedule()' to be reentered via > > 'ret_from_intr()'. > > > > If so, 'kswapd's 'active_mm' field will be left non-zero, but 'current' will > > not have been set to point to 'kswapd'. The next time 'schedule()' tries to > > switch to 'kswapd', 'next' points to 'kswapd', and > > > > next->mm == NULL > > next->active_mm != NULL > > > > which is detected as an invalid state, so we hit the BUG. > > > > Some questions: > > Are we looking at this correctly? > > Has anyone ever seen this before? > > Is there a published fix? > > > > Thanks, > > > > -Craig > > > > This is an known problem. Please try the attached patch. > > On R5432 CPU, there is also an hardware bug which can cause the same > problem. Please double-check vec3_generic to see if workaround is > at the beginning of the handler. > > BTW, 2.4.17 is an old kernel. You really need to upgrade. By the way, in 2.6 the include of <asm/war.h> has vanished from genex.S. If you want the workaround to be compiled, then you need to re-add that. -- Daniel Jacobowitz MontaVista Software Debian GNU/Linux Developer ^ permalink raw reply [flat|nested] 13+ messages in thread
end of thread, other threads:[~2003-10-08 16:30 UTC | newest]
Thread overview: 13+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <FJEIIOCBFAIOIDNKLPFJCECODAAA.koji.kawachi@pioneer-pdt.com>
2003-10-07 2:05 ` schedule() BUG Steve Scott
2003-10-07 2:05 ` Steve Scott
2003-10-08 16:29 ` Ralf Baechle
2003-09-12 18:04 Craig Mautner
2003-09-12 18:04 ` Craig Mautner
2003-09-13 16:30 ` Craig Mautner
2003-09-13 16:30 ` Craig Mautner
2003-09-15 18:59 ` Craig Mautner
2003-09-15 18:59 ` Craig Mautner
2003-10-01 23:50 ` Jun Sun
2003-10-02 0:09 ` Ralf Baechle
2003-10-02 0:39 ` Jun Sun
2003-10-02 4:28 ` Daniel Jacobowitz
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox