bug in the latest cache code?

Linux MIPS Architecture development
 help / color / mirror / Atom feed

* bug in the latest cache code?
@ 2000-08-10  1:08 Jun Sun
  2000-08-10  2:30 ` Atsushi Nemoto
  2000-08-10 17:38 ` Ralf Baechle
  0 siblings, 2 replies; 4+ messages in thread
From: Jun Sun @ 2000-08-10  1:08 UTC (permalink / raw)
  To: linux, linux-mips

Ralf,

I spent the last a few days to track down a problem where /sbin/init
hangs forever.  It turns out, I believe, to be a bug introduced in the
recent cache code change.

A new function, r4k_flush_icache_page_i32(), was added recently.  It
calls blast_icache32_page(), which uses Hit cache operations to flush
cache.  Unfortunately, that will generate TLB fault if virtual address
is not present in TLB.  Under certain conditions,
r4k_flush_icache_page_i32() will be called in the middle of handling a
page fault, and it will then generate the same page fault again with
cache hit operation.  This causes a deadlock (on current->mm->mmap_sem).

I read the previous version of code.  The fix seems to be using the
indexed cache operation.  Here is the fix, and apparently it fixes the
problem on my board.

Jun

-----------

static void
r4k_flush_icache_page_i32(struct vm_area_struct *vma, struct page *page,
                      unsigned long address)
{
        if (!(vma->vm_flags & VM_EXEC))
                return;

-        blast_icache32_page(address);
+        address = KSEG0 + (address & PAGE_MASK & (dcache_size - 1));
+        blast_icache32_page_indexed(address);
}

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: bug in the latest cache code?
  2000-08-10  1:08 bug in the latest cache code? Jun Sun
@ 2000-08-10  2:30 ` Atsushi Nemoto
  2000-08-10 17:38 ` Ralf Baechle
  1 sibling, 0 replies; 4+ messages in thread
From: Atsushi Nemoto @ 2000-08-10  2:30 UTC (permalink / raw)
  To: jsun; +Cc: linux, linux-mips

>>>>> On Wed, 09 Aug 2000 18:08:12 -0700, Jun Sun <jsun@mvista.com> said:
jsun> A new function, r4k_flush_icache_page_i32(), was added recently.
jsun> It calls blast_icache32_page(), which uses Hit cache operations
jsun> to flush cache.  Unfortunately, that will generate TLB fault if
jsun> virtual address is not present in TLB.  Under certain
jsun> conditions, r4k_flush_icache_page_i32() will be called in the
jsun> middle of handling a page fault, and it will then generate the
jsun> same page fault again with cache hit operation.  This causes a
jsun> deadlock (on current->mm->mmap_sem).

To my knowlege, if the vierual address is not present in TLB, cache
hit operation generates TLB refill exception, not TLB invalid
exception.  After the TLB refill excepsion, the cache instruction can
continue execution without a page fault (no deadlock).

I met the same deadlock problem on my r3k (with r4k-like cache) board
with 2.2.12 based kernel.  I doubted my TLB/cache codes first, but the
real cause was in vt.c.  _kd_mksound() modifies TLB refill handler
code if mips_io_port_base == 0xa0000000.  Modifing the #if-line near
_kd_mksound() fixed my problem.

Hope this helps.

---
Atsushi Nemoto

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: bug in the latest cache code?
  2000-08-10  1:08 bug in the latest cache code? Jun Sun
  2000-08-10  2:30 ` Atsushi Nemoto
@ 2000-08-10 17:38 ` Ralf Baechle
  2000-08-10 17:50   ` Jun Sun
  1 sibling, 1 reply; 4+ messages in thread
From: Ralf Baechle @ 2000-08-10 17:38 UTC (permalink / raw)
  To: Jun Sun; +Cc: linux-mips, linux-mips, linux-mips

On Wed, Aug 09, 2000 at 06:08:12PM -0700, Jun Sun wrote:

> I spent the last a few days to track down a problem where /sbin/init
> hangs forever.  It turns out, I believe, to be a bug introduced in the
> recent cache code change.
> 
> A new function, r4k_flush_icache_page_i32(), was added recently.  It
> calls blast_icache32_page(), which uses Hit cache operations to flush
> cache.  Unfortunately, that will generate TLB fault if virtual address
> is not present in TLB.  Under certain conditions,
> r4k_flush_icache_page_i32() will be called in the middle of handling a
> page fault, and it will then generate the same page fault again with
> cache hit operation.  This causes a deadlock (on current->mm->mmap_sem).
> 
> I read the previous version of code.  The fix seems to be using the
> indexed cache operation.  Here is the fix, and apparently it fixes the
> problem on my board.

I can see how this may happen and will take care of fixing this one.

We really want to avoid using index operations.  Unlike what the comment
in the kernel code suggest they do overly flush caches which is pretty
expensive.

  Ralf

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: bug in the latest cache code?
  2000-08-10 17:38 ` Ralf Baechle
@ 2000-08-10 17:50   ` Jun Sun
  0 siblings, 0 replies; 4+ messages in thread
From: Jun Sun @ 2000-08-10 17:50 UTC (permalink / raw)
  To: Ralf Baechle; +Cc: linux-mips, linux-mips, linux-mips

Ralf Baechle wrote:
> 
> On Wed, Aug 09, 2000 at 06:08:12PM -0700, Jun Sun wrote:
> 
> > I spent the last a few days to track down a problem where /sbin/init
> > hangs forever.  It turns out, I believe, to be a bug introduced in the
> > recent cache code change.
> >
> > A new function, r4k_flush_icache_page_i32(), was added recently.  It
> > calls blast_icache32_page(), which uses Hit cache operations to flush
> > cache.  Unfortunately, that will generate TLB fault if virtual address
> > is not present in TLB.  Under certain conditions,
> > r4k_flush_icache_page_i32() will be called in the middle of handling a
> > page fault, and it will then generate the same page fault again with
> > cache hit operation.  This causes a deadlock (on current->mm->mmap_sem).
> >
> > I read the previous version of code.  The fix seems to be using the
> > indexed cache operation.  Here is the fix, and apparently it fixes the
> > problem on my board.
> 
> I can see how this may happen and will take care of fixing this one.
> 

Thanks.

Below is the stack trace and some of my notes on this problem.  Hope
this helps.

I agree we should not use index operation abusively, but this is pretty
serious problem.  I don't think we can fix it easily without changing
the arch-independent part of kernel.

Jun

-------------------------

more traces :
the page fault is caused r4k_flush_icache_page_i32(), the first cache
(Hit_....) operation.

call stack when current->mm->sem has already been taken but
        r4k_flush_icache_page_i32() is still called.

#0  jsun_bug () at r4xx0.c:1971
#1  0x8009aa60 in r4k_flush_icache_page_i32 (vma=0x811401e0,
page=0x810476c0,
    address=263607008) at r4xx0.c:1986
#2  0x800b0320 in do_no_page (mm=0x81142080, vma=0x811401e0,
address=263607008,
    write_access=0, page_table=0x811fed94) at memory.c:1162
#3  0x800b0508 in handle_mm_fault (mm=0x81142080, vma=0x811401e0,
    address=263607008, write_access=0) at memory.c:1202
#4  0x80094118 in do_page_fault (regs=0x81127f30, write=0,
address=263607008)
    at fault.c:93
#5  0x8008ce98 in handle_tlbl () at r4k_misc.S:154

(263607008 = 0xfb652e0)

The epc for #5 tlbl fault is 0xfb652e0, which means it is a page fault
for
the next instruction.

****

annotated calling trace :

handle_tlbl (in asm) - arch/mips/kernel/r4k_misc.S
    do_page_fault - arch/mips/mm/fault.c
        after check it is a good area
        swtich (handle_mm_fault(....) )  - line 93
            [not visiable to gdb
            handle_mm_fault(...)  - mm/memory.c ]
                alloc pte
                handle_pte_fault(...)
                    check about the page and
                    do_no_page(...)  - mm/memory.c
                        /* do a bunch of stuff but TLB entry
			   for the new page is not built yet */
                        flush_page_to_ram(new_page);
                        flush_icache_page(...)
                          ( = r4k_flush_icache_page_i32) ;
                                ==> jsun_bug()

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2000-08-10 17:51 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2000-08-10  1:08 bug in the latest cache code? Jun Sun
2000-08-10  2:30 ` Atsushi Nemoto
2000-08-10 17:38 ` Ralf Baechle
2000-08-10 17:50   ` Jun Sun

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox