From mboxrd@z Thu Jan 1 00:00:00 1970 Received: with ECARTIS (v1.0.0; list linux-mips); Thu, 02 Oct 2003 00:50:29 +0100 (BST) Received: from gateway-1237.mvista.com ([IPv6:::ffff:12.44.186.158]:7927 "EHLO orion.mvista.com") by linux-mips.org with ESMTP id ; Thu, 2 Oct 2003 00:50:27 +0100 Received: (from jsun@localhost) by orion.mvista.com (8.11.6/8.11.6) id h91NoN929739; Wed, 1 Oct 2003 16:50:23 -0700 Date: Wed, 1 Oct 2003 16:50:23 -0700 From: Jun Sun To: Craig Mautner Cc: linux-mips@linux-mips.org, jsun@mvista.com Subject: Re: schedule() BUG Message-ID: <20031001165023.A26517@mvista.com> References: Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline User-Agent: Mutt/1.2.5i In-Reply-To: ; from craig.mautner@alumni.ucsd.edu on Fri, Sep 12, 2003 at 11:04:16AM -0700 Return-Path: X-Envelope-To: <"|/home/ecartis/ecartis -s linux-mips"> (uid 0) X-Orcpt: rfc822;linux-mips@linux-mips.org Original-Recipient: rfc822;linux-mips@linux-mips.org X-archive-position: 3350 X-ecartis-version: Ecartis v1.0.0 Sender: linux-mips-bounce@linux-mips.org Errors-to: linux-mips-bounce@linux-mips.org X-original-sender: jsun@mvista.com Precedence: bulk X-list: linux-mips On Fri, Sep 12, 2003 at 11:04:16AM -0700, Craig Mautner wrote: > We are using mips-linux 2.4.17, gcc 3.2.1 (MontaVista) and crashing in > schedule(): > > Unable to handle kernel paging request at virtual address 00000000, epc == > 800153c0, ra == 800153c0 > $0 : 00000000 9001f800 0000001b 00000000 0000001a 83f56000 8298f4a0 0000001f > $8 : 00000001 ffffe2e0 000022e0 00000000 fffffff9 ffffffff 0000000a 00000002 > $16: 00000000 00000000 82af0000 8298f4a0 83f56000 00000000 80008000 00000000 > $24: 82af1dc2 00000002 82af0000 82af1ef8 82af1ef8 800153c0 > epc : 800153c0 Not tainted > > The code is: > > { > struct mm_struct *mm = next->mm; > struct mm_struct *oldmm = prev->active_mm; > if (!mm) { > if (next->active_mm) BUG(); <- this is where we crash > next->active_mm = oldmm; > atomic_inc(&oldmm->mm_count); > enter_lazy_tlb(oldmm, next, this_cpu); > } > . > . > . > > This seems to happen in our case when 'next' points to 'kswapd' although we > think it could happen when switching to any kernel task (i.e. those tasks > with mm==NULL). > > We think the culprit is that we are taking an interrupt and rescheduling > while at a vulnerable point in 'schedule()'. Interrupts are enabled in line > 743. If we get an interrupt any time after line 785: > > next->active_mm = oldmm; > > but before line 806 > > __schedule_tail() > > completes the swap, the interrupt can force 'schedule()' to be reentered via > 'ret_from_intr()'. > > If so, 'kswapd's 'active_mm' field will be left non-zero, but 'current' will > not have been set to point to 'kswapd'. The next time 'schedule()' tries to > switch to 'kswapd', 'next' points to 'kswapd', and > > next->mm == NULL > next->active_mm != NULL > > which is detected as an invalid state, so we hit the BUG. > > Some questions: > Are we looking at this correctly? > Has anyone ever seen this before? > Is there a published fix? > > Thanks, > > -Craig > This is an known problem. Please try the attached patch. On R5432 CPU, there is also an hardware bug which can cause the same problem. Please double-check vec3_generic to see if workaround is at the beginning of the handler. BTW, 2.4.17 is an old kernel. You really need to upgrade. Jun