public inbox for linux-ia64@vger.kernel.org
 help / color / mirror / Atom feed
* [Linux-ia64] Preempt problems
@ 2003-02-03 20:17 Peter Chubb
  2003-02-03 21:36 ` Stephane Eranian
                   ` (6 more replies)
  0 siblings, 7 replies; 8+ messages in thread
From: Peter Chubb @ 2003-02-03 20:17 UTC (permalink / raw)
  To: linux-ia64

Hi David,
   Just a heads up --- there are still two major problems with the
patch I submitted for preemption support, so I suggest you don't merge
it yet (although I'd very much appreciate having it reviewed and
tested by third parties).

The two problems I see are:
    -- perfmon missing interrupts *even with CONFIG_PREEMPT turned
off!*
   -- weird kernel page faults apparently during ia64_leave_kernel()
(from the stack backtrace), but with an IP address in region 0, after
some variable amount of uptime, apparently independent of anything
happening on the machine (at least, I haven't yet been able to isolate
anything that causes it).

--
Dr Peter Chubb				    peterc@gelato.unsw.edu.au
You are lost in a maze of BitKeeper repositories, all almost the same.


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [Linux-ia64] Preempt problems
  2003-02-03 20:17 [Linux-ia64] Preempt problems Peter Chubb
@ 2003-02-03 21:36 ` Stephane Eranian
  2003-02-03 22:33 ` Peter Chubb
                   ` (5 subsequent siblings)
  6 siblings, 0 replies; 8+ messages in thread
From: Stephane Eranian @ 2003-02-03 21:36 UTC (permalink / raw)
  To: linux-ia64

Peter,

On Tue, Feb 04, 2003 at 07:17:01AM +1100, Peter Chubb wrote:
> 
> Hi David,
>    Just a heads up --- there are still two major problems with the
> patch I submitted for preemption support, so I suggest you don't merge
> it yet (although I'd very much appreciate having it reviewed and
> tested by third parties).
> 
> The two problems I see are:
>     -- perfmon missing interrupts *even with CONFIG_PREEMPT turned
> off!*

As for perfmon, there are some known issues with perfmon and the O(1) 
scheduler (deadlocks during ctxsw in SMP). I am not sure it affects your 
particular test case. I had postponed fixing this because I am working on 
a new perfmon code base for 2.5 in which (hopefully) all problems are gone. 
However a somewhat related issue came up last week and I decided to fix 
some of the problems. I will try to give a new patch to David this week.

As for preemption and perfmon, I haven't had a chance to look at the patch
yet. There are some assumptions about not being preemptable at several places.

-- 
-Stephane


^ permalink raw reply	[flat|nested] 8+ messages in thread

* [Linux-ia64] Preempt problems
  2003-02-03 20:17 [Linux-ia64] Preempt problems Peter Chubb
  2003-02-03 21:36 ` Stephane Eranian
@ 2003-02-03 22:33 ` Peter Chubb
  2003-02-03 23:43 ` David Mosberger
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 8+ messages in thread
From: Peter Chubb @ 2003-02-03 22:33 UTC (permalink / raw)
  To: linux-ia64

>>>>> "Peter" = Peter Chubb <peter@chubb.wattle.id.au> writes:
Peter> * -- weird kernel page faults
Peter> apparently during ia64_leave_kernel() (from the stack
Peter> backtrace), but with an IP address in region 0, after some
Peter> variable amount of uptime, apparently independent of anything
Peter> happening on the machine (at least, I haven't yet been able to
Peter> isolate anything that causes it).


I just got one of these on a straight 2.5.59+davidm's patches, on a
vanilla zx2000.  The kernel address is always the same, at 0xffe6bf10,
regardless of the kernel, and regardless of the command that was
(apparently) running at the time.


I've appended the Oops message.  On a previous (different kernel) I
printed out the iipa:   e00000003ff04420cc1
It also appears bogus.  Note that interrupts are off in the psr.


I'm at a loss as to how to debug this thing now.  Can anyone suggest
anything?  It *could* be a hardware problem (I've had problems with
this machine before).



Unable to handle kernel paging request at virtual address 00000000ffe6bf10
automount[288]: Oops 4294967296

Pid: 288, CPU 0, comm:            automount
psr : 0000101008022018 ifs : 800000000000048d ip  : [<00000000ffe6bf10>]    Not tainted
ip is at 0xffe6bf10
unat: 0000000000000000 pfs : 0000000000000d22 rsc : 0000000000000003
rnat: 0000000000000000 bsps: 0000000000000000 pr  : 769dbd656f5a5865
ldrs: 0000000000000000 ccv : 0000000000000000 fpsr: 0009804c0270033f
b0  : e00000003ff04430 b6  : e00000003fef6a80 b7  : e00000003fef37b0
f6  : 1003e0000000000003fff f7  : 1003ecccccccccccccccd
f8  : 1000cfffc000000000000 f9  : 1003e0000000000000008
r1  : e00000003fe40000 r2  : e00000003ff04430 r3  : 0000000000000000
r8  : 00000000ffe6bf10 r9  : 0000000000000000 r10 : 0000000000000000
r11 : 0000000000000000 r12 : e0000040428d7380 r13 : e0000040428d0000
r14 : c0000000ff4661f8 r15 : c0000000ff44a040 r16 : 000000000074a040
r17 : 0000000000000001 r18 : 00000000000000ff r19 : 0000000000000200
r20 : 00000000000002ff r21 : a0000000000100c8 r22 : e00000003fef37b0
r23 : e000000004ab8318 r24 : e0000000049eec10 r25 : e000000004aa4228
r26 : 0000000000000006 r27 : e000000004ab93e8 r28 : 0000000000000019
r29 : 0000000000000000 r30 : 0000000000000000 r31 : 0000000000000000

Call Trace:
 [<e000000004416560>] show_stack+0x80/0xa0 sp=0xe0000040428d6f90 bsp=0xe0000040428d1b58
 [<e00000000442cef0>] die+0x110/0x1a0 sp=0xe0000040428d7150 bsp=0xe0000040428d1b30
 [<e000000004448450>] ia64_do_page_fault+0x310/0x840 sp=0xe0000040428d7150 bsp=0xe0000040428d1ad0
 [<e000000004411520>] ia64_leave_kernel+0x0/0x240 sp=0xe0000040428d71e0 bsp=0xe0000040428d1ad0
 <0>Kernel panic: Aiee, killing interrupt handler!
In interrupt handler - not syncing


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [Linux-ia64] Preempt problems
  2003-02-03 20:17 [Linux-ia64] Preempt problems Peter Chubb
  2003-02-03 21:36 ` Stephane Eranian
  2003-02-03 22:33 ` Peter Chubb
@ 2003-02-03 23:43 ` David Mosberger
  2003-02-05 17:03 ` Joel GUILLET
                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 8+ messages in thread
From: David Mosberger @ 2003-02-03 23:43 UTC (permalink / raw)
  To: linux-ia64

>>>>> On Tue, 4 Feb 2003 09:33:31 +1100, Peter Chubb <peter@chubb.wattle.id.au> said:

  Peter> I just got one of these on a straight 2.5.59+davidm's
  Peter> patches, on a vanilla zx2000.  The kernel address is always
  Peter> the same, at 0xffe6bf10, regardless of the kernel, and
  Peter> regardless of the command that was (apparently) running at
  Peter> the time.

Can't say I have ever seen anything like this.  It does look like the
top 32 bits somehow get truncated to zero.  Rather weird.

  Peter> I'm at a loss as to how to debug this thing now.  Can anyone
  Peter> suggest anything?  It *could* be a hardware problem (I've had
  Peter> problems with this machine before).

If you had problems before, a hw problem sounds like a real
possibility.  If I can find some time, I'll try your patch on some of
our machines.

	--david


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [Linux-ia64] Preempt problems
  2003-02-03 20:17 [Linux-ia64] Preempt problems Peter Chubb
                   ` (2 preceding siblings ...)
  2003-02-03 23:43 ` David Mosberger
@ 2003-02-05 17:03 ` Joel GUILLET
  2003-02-14 20:05 ` Ray Bryant
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 8+ messages in thread
From: Joel GUILLET @ 2003-02-05 17:03 UTC (permalink / raw)
  To: linux-ia64

Hello,

I've just found a problem when using preemption patch with SMP on a
2.5.59 kernel. (with my version of the _raw_write_trylock() macro)

It seems, that the SCSI driver have some problems (at boot time):
As a result, sometimes, one of my partition (type ext3) can not be
mounted.
Here is the only related message in the log file :

Feb  5 17:54:13 zli22 kernel: scsi: Underflow detected - retrying command.


Sometimes, the partition is "half-mounted". I can only read one part of
the files. If I try to mount it again or to unmount it, I've got a message
saying that the device is busy.

		> mount /dev/sda4
		umount: /home: device is busy


It seems to me, that it is a deadlock. But it's not easy to identify it
because, it doesn't happen every time I boot the machine (about 1/2, with
my machines that have 2 cpus))

It could be a problem in the  "_raw_write_trylock()", but I have made some
tests, and it seems that the macro "returns" the good value.


---------------
> ... because the rw_lock value is composed with :
> - 1 bit for the write "flag" (the most significant bit of a _long_ value)

  Oh... I meant MSB of an _int_ value !

> - 31 bits for the read flags


Regards

---------
  Joel





^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [Linux-ia64] Preempt problems
  2003-02-03 20:17 [Linux-ia64] Preempt problems Peter Chubb
                   ` (3 preceding siblings ...)
  2003-02-05 17:03 ` Joel GUILLET
@ 2003-02-14 20:05 ` Ray Bryant
  2003-02-14 20:11 ` Stephane Eranian
  2003-02-14 21:04 ` Ray Bryant
  6 siblings, 0 replies; 8+ messages in thread
From: Ray Bryant @ 2003-02-14 20:05 UTC (permalink / raw)
  To: linux-ia64

[-- Attachment #1: Type: text/plain, Size: 1499 bytes --]

Stephane Eranian wrote:
> 
> Peter,
> 
> On Tue, Feb 04, 2003 at 07:17:01AM +1100, Peter Chubb wrote:
> >
Stephane,

Does the deadlock you describe here look at all like the bug report that
Jack Steiner has submitted for our Altix kernel?
(We're using the O(1) scheduler and pfmon.)  (It certainly sound
similar.) Details attached.

Is anyone working this issue that you know of?

> 
> As for perfmon, there are some known issues with perfmon and the O(1)
> scheduler (deadlocks during ctxsw in SMP). I am not sure it affects your
> particular test case. I had postponed fixing this because I am working on
> a new perfmon code base for 2.5 in which (hopefully) all problems are gone.
> However a somewhat related issue came up last week and I decided to fix
> some of the problems. I will try to give a new patch to David this week.
> 
> As for preemption and perfmon, I haven't had a chance to look at the patch
> yet. There are some assumptions about not being preemptable at several places.
> 
> --
> -Stephane
> 
> _______________________________________________
> Linux-IA64 mailing list
> Linux-IA64@linuxia64.org
> http://lists.linuxia64.org/lists/listinfo/linux-ia64

-- 
Best Regards,
Ray
-----------------------------------------------
                  Ray Bryant
512-453-9679 (work)         512-507-7807 (cell)
raybry@sgi.com             raybry@austin.rr.com
The box said: "Requires Windows 98 or better",
           so I installed Linux.
-----------------------------------------------

[-- Attachment #2: perfmon.deadlock --]
[-- Type: text/plain, Size: 4113 bytes --]

From: pv@relay.sgi.com (steiner@sgi.com)
Subject: BUG 881594 - Deadlock in perfmon - pfm_fetch_regs()
To: raybry@sgi.com, steiner@sgi.com
Status:   
X-Mozilla-Status: 8001
X-Mozilla-Status2: 00000000
X-UIDL: 3bd895570000681c

View Incident: http://co-op.engr.sgi.com/BugWorks/code/bwxquery.cgi?search=Search&wlong=1&view_type=Bug&wi=881594

Submitter : steiner                   Submitter Domain : sgi.com            
Assigned Engineer : raybry            Assigned Engineer Domain : sgi.com    
Assigned Group : linux-mckinley       Category : software                   
Reported by Customer : F              Priority : 2                          
Project : snlinux                     Status : open                         
Description :
Ferarri hung this morning running a mixture of
        0xe000003068b40000 00024267 00024266  0  006  stop  0xe000003068b407d0 code3
        0xe00000305c608000 00024273 00024250  0  000  stop  0xe00000305c6087d0 pfmon
        0xe00000300c738000 00024275 00010688  0  000  stop  0xe00000300c7387d0 go.bottle
        0xe000003051018000 00024278 00010688  0  000  stop  0xe0000030510187d0 go.bottle
        0xe000003020058000 00024281 00010688  0  000  stop  0xe0000030200587d0 go.bottle
        0xe00000306c6e8000 00024284 00010688  0  006  stop  0xe00000306c6e87d0 go.bottle
        0xe000003049098000 00024287 00010688  0  000  stop  0xe0000030490987d0 go.bottle
        0xe000003021558000 00024274 00024273  0  000  stop  0xe0000030215587d0 code3
        0xe00001b030900000 00024295 00024284  0  006  stop  0xe00001b0309007d0 pfmon
        0xe00001b03ad30000 00024296 00024295  0  005  stop  0xe00001b03ad307d0 code3
        0xe00000302a240000 00024297 00024278  0  000  stop  0xe00000302a2407d0 pfmon
        0xe000003028980000 00024298 00024275  0  000  stop  0xe0000030289807d0 pfmon

From the leds, it appeared that cpu 2 & 3 were hard hung & not processing interrupts.
I nmi'ed the system.  Cpu 2 was hung here (I think this is right - 90% confidence. Someone
reset the system before I finished digging out the info I needed):


     1  99  1  smp_call_function_single
     7  ??  2      pfm_fetch_regs
     8  ??  3          pfm_load_regs
     9  ??  4              ia64_load_extra
    10  ??  5                  __switch_to
    11  ??  6                      switch_to
    12  ??  7                          context_switch
    13  ??  8                              schedule


cpu 2 was in the function smp_call_function_single spinnning with interrupts disabled trying to
lock call_lock. Cpu 3 was hung the same way that cpu was hung.

Another cpu was holding the call_lock & was waiting for cpu 2 to respond to an IPI. Since
cpu 2 was spinning with interrupts disabled, it was not responding.
The cpu holding the lock was here:
        0xe002000000045af0 smp_call_function+0x470
        0xe002000000045350 smp_flush_tlb_all+0x30
        0xe002000000051550 flush_tlb_range+0x50
        0xe002000000125090 swap_out+0x9f0
        0xe002000000126350 shrink_cache+0xb70
        0xe0020000001269e0 shrink_caches+0x100
        0xe002000000126ad0 try_to_free_pages+0x70
        0xe002000000128b40 balance_classzone+0xe0
        0xe0020000001295c0 __alloc_pages+0x420
        0xe0020000001297c0 __get_free_pages+0xc0
        0xe002000000120260 kmem_cache_grow+0x280
        0xe002000000121580 kmem_cache_alloc+0x460
        0xe00200000033fc60 kmem_zone_zalloc+0xa0
        0xe0020000002d5980 xfs_efd_init+0x80
        0xe00200000030e030 xfs_trans_get_efd+0x30
        0xe00200000029cf40 xfs_bmap_finish+0x1a0
        0xe0020000002e5cb0 xfs_itruncate_finish+0x2d0
        0xe002000000318640 xfs_inactive+0x5e0
        0xe00200000033ea20 vn_rele+0x140
        0xe00200000033c9f0 linvfs_clear_inode+0x30
        0xe0020000001741d0 clear_inode+0x370
        0xe002000000175d70 iput+0x4b0
        0xe002000000170160 d_delete+0x180
        0xe00200000015c690 vfs_unlink+0x650
        0xe00200000015c930 sys_unlink+0x210
        0xe00200000000ea00 ia64_ret_from_syscall

I dont believe that pfm_fetch_regs should be calling smp_call_function_single unless
interrupts are enabled.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [Linux-ia64] Preempt problems
  2003-02-03 20:17 [Linux-ia64] Preempt problems Peter Chubb
                   ` (4 preceding siblings ...)
  2003-02-14 20:05 ` Ray Bryant
@ 2003-02-14 20:11 ` Stephane Eranian
  2003-02-14 21:04 ` Ray Bryant
  6 siblings, 0 replies; 8+ messages in thread
From: Stephane Eranian @ 2003-02-14 20:11 UTC (permalink / raw)
  To: linux-ia64

Ray,

On Fri, Feb 14, 2003 at 02:05:53PM -0600, Ray Bryant wrote:
> 
> Does the deadlock you describe here look at all like the bug report that
> Jack Steiner has submitted for our Altix kernel?
> (We're using the O(1) scheduler and pfmon.)  (It certainly sound
> similar.) Details attached.
> 
> Is anyone working this issue that you know of?
> 
This problem was reported to me by Intel and others. It was affecting 
the 2.5 kernel only. However given that RHAS 2.1 is also using the
O(1) scheduler, I had to fix it in 2.4.18/RHAS. I don't exactly
know which kernel you are using at SGI. Note that the fix has been
propagated to 2.5.60 as well. I can provide the patch to you but
note that it also includes updates to perfmon to bring it closer
to what we have in 2.4.20. My  goal is to minimize the number of 
variations out there ;->


-- 
-Stephane


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [Linux-ia64] Preempt problems
  2003-02-03 20:17 [Linux-ia64] Preempt problems Peter Chubb
                   ` (5 preceding siblings ...)
  2003-02-14 20:11 ` Stephane Eranian
@ 2003-02-14 21:04 ` Ray Bryant
  6 siblings, 0 replies; 8+ messages in thread
From: Ray Bryant @ 2003-02-14 21:04 UTC (permalink / raw)
  To: linux-ia64

Stephane,

We're using a 2.4.19 kernel with a number of patches and SGI fixes, and
some backports from 2.5, as well as our machine 
specific stuff.  A more complete description of what we run is:

--2.4.19 from www.kernel.org
--IA64 patch
--Atlas patches including discontinguous memory support, VM support
--O(1) scheduler -- including some fixes for large CPU counts, etc
--SGI machine specific H/W support
--SGI patches and fixes for bugs and scalability
--PAGG
--CSA (Comprehensive System Accounting)
--XSCSI (our version of SCSI support from IRIX -- this is a
module/closed source)

Of interest for this discussion is the O(1) scheduler and perfmon.

If you could send me your patch, I will take a look at it and see what
applies.

We will eventually be going to RHAS, but at the moment we are running a
"free-bytes" clone of RH 7.2.

I understand exactly why you are trying to keep the number of variations
small.

Thanks,

Stephane Eranian wrote:
> 
> Ray,
> 
> On Fri, Feb 14, 2003 at 02:05:53PM -0600, Ray Bryant wrote:
> >
> > Does the deadlock you describe here look at all like the bug report that
> > Jack Steiner has submitted for our Altix kernel?
> > (We're using the O(1) scheduler and pfmon.)  (It certainly sound
> > similar.) Details attached.
> >
> > Is anyone working this issue that you know of?
> >
> This problem was reported to me by Intel and others. It was affecting
> the 2.5 kernel only. However given that RHAS 2.1 is also using the
> O(1) scheduler, I had to fix it in 2.4.18/RHAS. I don't exactly
> know which kernel you are using at SGI. Note that the fix has been
> propagated to 2.5.60 as well. I can provide the patch to you but
> note that it also includes updates to perfmon to bring it closer
> to what we have in 2.4.20. My  goal is to minimize the number of
> variations out there ;->
> 
> --
> -Stephane

-- 
Best Regards,
Ray
-----------------------------------------------
                  Ray Bryant
512-453-9679 (work)         512-507-7807 (cell)
raybry@sgi.com             raybry@austin.rr.com
The box said: "Requires Windows 98 or better",
           so I installed Linux.
-----------------------------------------------


^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2003-02-14 21:04 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2003-02-03 20:17 [Linux-ia64] Preempt problems Peter Chubb
2003-02-03 21:36 ` Stephane Eranian
2003-02-03 22:33 ` Peter Chubb
2003-02-03 23:43 ` David Mosberger
2003-02-05 17:03 ` Joel GUILLET
2003-02-14 20:05 ` Ray Bryant
2003-02-14 20:11 ` Stephane Eranian
2003-02-14 21:04 ` Ray Bryant

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox