* Re: [parisc-linux] Latest 2.6.13-rc6-pa2 test pr :_(
[not found] <200508261342.j7QDgMCk015398@hiauly1.hia.nrc.ca>
@ 2005-08-27 14:40 ` Randolph Chung
2005-08-27 17:38 ` John David Anglin
0 siblings, 1 reply; 15+ messages in thread
From: Randolph Chung @ 2005-08-27 14:40 UTC (permalink / raw)
To: John David Anglin; +Cc: parisc-linux
> After another couple of hours of poking at this with gdb, I think
> we have a kernel bug. The saved r2 value in the signal context
> appears to contain the pc where the exception occurred instead of
> the r2 value when the exception occured. As a result, the unwind
> process goes into a loop when an exception occurs in a function
> which hasn't saved r2 in the frame.
I thought I could reproduce this, but I couldn't..... using the attached
program, the signal handler does see the correct (different) r2 and iaoq
in the signal frame. When compiled with -O2, bar() is a leaf function
with no frame, r2 in the signal handler points to the return point of
foo() inside of main().
Did I misunderstand the problem that you are describing? I am testing
with 2.6.13-rc6-pa2 64-bit.
randolph
#include <stdio.h>
#include <strings.h>
#include <signal.h>
#include <sys/ucontext.h>
void baz(void)
{
printf("in baz\n");
}
void sighandler(int sig, siginfo_t *info, void *data)
{
struct ucontext *ctx = (struct ucontext *)data;
struct sigcontext *mctx = &ctx->uc_mcontext;
printf("in sighandler\n");
printf("data=%p, r2 = %x, iaoq[0] = %x\n", data,
mctx->sc_gr[2], mctx->sc_iaoq[0]);
baz();
exit(0);
}
void bar(void)
{
int *x = 0;
int r2;
*x = 0;
}
void foo(void)
{
printf("in foo\n");
bar();
}
int main(int argc, char **argv)
{
struct sigaction sact;
memset(&sact, 0, sizeof(sact));
sact.sa_flags = SA_SIGINFO;
sact.sa_sigaction = sighandler;
sigaction(SIGSEGV, &sact, NULL);
foo();
return 0;
}
_______________________________________________
parisc-linux mailing list
parisc-linux@lists.parisc-linux.org
http://lists.parisc-linux.org/mailman/listinfo/parisc-linux
^ permalink raw reply [flat|nested] 15+ messages in thread* Re: [parisc-linux] Latest 2.6.13-rc6-pa2 test pr :_(
2005-08-27 14:40 ` [parisc-linux] Latest 2.6.13-rc6-pa2 test pr :_( Randolph Chung
@ 2005-08-27 17:38 ` John David Anglin
0 siblings, 0 replies; 15+ messages in thread
From: John David Anglin @ 2005-08-27 17:38 UTC (permalink / raw)
To: Randolph Chung; +Cc: parisc-linux
> > After another couple of hours of poking at this with gdb, I think
> > we have a kernel bug. The saved r2 value in the signal context
> > appears to contain the pc where the exception occurred instead of
> > the r2 value when the exception occured. As a result, the unwind
> > process goes into a loop when an exception occurs in a function
> > which hasn't saved r2 in the frame.
>
> I thought I could reproduce this, but I couldn't..... using the attached
> program, the signal handler does see the correct (different) r2 and iaoq
> in the signal frame. When compiled with -O2, bar() is a leaf function
> with no frame, r2 in the signal handler points to the return point of
> foo() inside of main().
>
> Did I misunderstand the problem that you are describing? I am testing
> with 2.6.13-rc6-pa2 64-bit.
> }
>
> void sighandler(int sig, siginfo_t *info, void *data)
> {
> struct ucontext *ctx = (struct ucontext *)data;
> struct sigcontext *mctx = &ctx->uc_mcontext;
> printf("in sighandler\n");
> printf("data=%p, r2 = %x, iaoq[0] = %x\n", data,
> mctx->sc_gr[2], mctx->sc_iaoq[0]);
> baz();
> exit(0);
Actually, the change that I originally posted fixes the problem.
I blew the testing when I tried to take a short cut after the first
tested failed due to a separate problem was introduced into the
tree. The problem was libgcc didn't get rebuilt as we don't have
proper dependencies on the linux-unwind.h file.
I did more debugging and the signal context generated by the
kernel is fine. The problem is simply that the frame state
built by pa32_fallback_frame_state has to include offsets for
both r2 and iaoq[0]. We need r2 when the function that causes
the exception hasn't saved r2. In that case, using the r2 slot
in the frame state as the unwind column causes the unwind machinery
to think that the function that caused the exception has the
exception point as its return location. You can see this if
you look at the context built in _Unwind_Backtrace for the
function that causes the exception.
We can simply use the r0 slot in the frame state for the return
address column in signal frames since r0 is never saved.
I'm now down to the following failures on my c3k under 2.6.8.1-pa11
(I reverted back because of the crashes under 2.6.13):
FAIL: Serialization output - gij test
FAIL: Serialization output - gij test
WARNING: program timed out.
FAIL: SyncTest execution - gij test
FAIL: negzero output - gij test
FAIL: negzero output - gij test
This is down from ~65 fails two weeks ago. The testsuite is now
running on 2.6.13-rc3-pa1 (64-bit SMP). In the past, the results
for some of the above have varied from one system to another.
Dave
--
J. David Anglin dave.anglin@nrc-cnrc.gc.ca
National Research Council of Canada (613) 990-0752 (FAX: 952-6602)
_______________________________________________
parisc-linux mailing list
parisc-linux@lists.parisc-linux.org
http://lists.parisc-linux.org/mailman/listinfo/parisc-linux
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [parisc-linux] Latest 2.6.13-rc6-pa2 test pr :_(
@ 2005-08-26 16:18 Joel Soete
0 siblings, 0 replies; 15+ messages in thread
From: Joel Soete @ 2005-08-26 16:18 UTC (permalink / raw)
To: dave; +Cc: parisc-linux, tsg45800
> > > After another couple of hours of poking at this with gdb, I think
> > > we have a kernel bug. The saved r2 value in the signal context
> > > appears to contain the pc where the exception occurred instead of
> > > the r2 value when the exception occured. As a result, the unwind
> > > process goes into a loop when an exception occurs in a function
> > > which hasn't saved r2 in the frame.
> > >
> > Ah, that why my stress test made again hang the b2k apparently at the=
same
> > place as the original mail:
> > hpmc pointing to the same bad value (IAOQ =3D 0x00000000000172b8)
> >
> > while TOC always show a pb in internal_add_timer():
> > GR[02] =3D=3D rp =3D 00000000101319b4
>
> I don't see how these are related. I'm talking about the value of
> r2 saved in the context passed to a signal handler. Normally, the
> code that generated an exception isn't restarted, so a bad value
> for r2 doesn't usually affect the application. In the java case,
> the java runtime is attempting to unwind the stack of the application
> and needs the r2 value for this. Note that this is the context
> passed to a user application.
Ah ok my pb is another pb :_(
>
> In the above, the TOC rp value is a kernel address. The hpmc
> IAOQ value is probably in PDC code. Most likely, this occurs when
> the machine is probing devices while rebooting. I see this on my
> c3k all the time when I press TOC.
>
And no chance to get more debuging stuff as Sysrq doesn't responds ;^(
Thanks,
Joel=0A=0A-----------------------------------------------------------=
----=0AA free anti-spam and anti-virus filter on all Scarlet mailboxes=0A=
More info on http://www.scarlet.be/
_______________________________________________
parisc-linux mailing list
parisc-linux@lists.parisc-linux.org
http://lists.parisc-linux.org/mailman/listinfo/parisc-linux
^ permalink raw reply [flat|nested] 15+ messages in thread[parent not found: <ILU3F9$C3BA108CF7E2FE4DECDD367F36FD15E5@scarlet.be>]
* Re: [parisc-linux] Latest 2.6.13-rc6-pa2 test pr :_(
[not found] <ILU3F9$C3BA108CF7E2FE4DECDD367F36FD15E5@scarlet.be>
@ 2005-08-26 15:29 ` John David Anglin
0 siblings, 0 replies; 15+ messages in thread
From: John David Anglin @ 2005-08-26 15:29 UTC (permalink / raw)
To: Joel Soete; +Cc: parisc-linux, tsg45800
> > After another couple of hours of poking at this with gdb, I think
> > we have a kernel bug. The saved r2 value in the signal context
> > appears to contain the pc where the exception occurred instead of
> > the r2 value when the exception occured. As a result, the unwind
> > process goes into a loop when an exception occurs in a function
> > which hasn't saved r2 in the frame.
> >
> Ah, that why my stress test made again hang the b2k apparently at the same
> place as the original mail:
> hpmc pointing to the same bad value (IAOQ = 0x00000000000172b8)
>
> while TOC always show a pb in internal_add_timer():
> GR[02] == rp = 00000000101319b4
I don't see how these are related. I'm talking about the value of
r2 saved in the context passed to a signal handler. Normally, the
code that generated an exception isn't restarted, so a bad value
for r2 doesn't usually affect the application. In the java case,
the java runtime is attempting to unwind the stack of the application
and needs the r2 value for this. Note that this is the context
passed to a user application.
In the above, the TOC rp value is a kernel address. The hpmc
IAOQ value is probably in PDC code. Most likely, this occurs when
the machine is probing devices while rebooting. I see this on my
c3k all the time when I press TOC.
Dave
--
J. David Anglin dave.anglin@nrc-cnrc.gc.ca
National Research Council of Canada (613) 990-0752 (FAX: 952-6602)
_______________________________________________
parisc-linux mailing list
parisc-linux@lists.parisc-linux.org
http://lists.parisc-linux.org/mailman/listinfo/parisc-linux
^ permalink raw reply [flat|nested] 15+ messages in thread
[parent not found: <200508241530.j7OFUFwS005854@hiauly1.hia.nrc.ca>]
* Re: [parisc-linux] Latest 2.6.13-rc6-pa2 test pr :_(
[not found] <200508241530.j7OFUFwS005854@hiauly1.hia.nrc.ca>
@ 2005-08-25 15:59 ` Randolph Chung
2005-08-25 17:04 ` John David Anglin
0 siblings, 1 reply; 15+ messages in thread
From: Randolph Chung @ 2005-08-25 15:59 UTC (permalink / raw)
To: John David Anglin; +Cc: parisc-linux
> Still don't know. I have been working on the java testsuite which
> appears to cause most of the kernel bugs. I fixed a bug last weekend
> where the signal handler caused a SIGSEGV. An incorrect count was
> used to copy a mmapped block. We still have a problem with PR218.exe.
> The DWARF unwind fails to stop and allocates a huge amount of memory.
> PR218 is a simple test to test whether the runtime can catch a null
> pointer exception from a leaf routine. I think there is a bug in
> MD_FALLBACK_FRAME_STATE_FOR. It replaces the %r2 value with the
> iaoq[0] value and I think this fails when we need to unwind through
> a leaf function which doesn't save %r2.
Can you explain more where this is broken?
I'm assuming you are talking about this bit of code:
134 fs->regs.reg[2].how = REG_SAVED_OFFSET;
135 fs->regs.reg[2].loc.offset = (long) &sc->sc_iaoq[0] - new_cfa;
136 fs->retaddr_column = 2;
sc_iaoq[0] should be the pc when the signal handler was triggered, so it
shouldn't matter if it's a leaf func or not. We are simply telling the
unwinder that reg[2] has the return address of the current (signal
handler) function, and the value of reg[2] can be computed with a stack
offset. When is this wrong?
randolph
_______________________________________________
parisc-linux mailing list
parisc-linux@lists.parisc-linux.org
http://lists.parisc-linux.org/mailman/listinfo/parisc-linux
^ permalink raw reply [flat|nested] 15+ messages in thread* Re: [parisc-linux] Latest 2.6.13-rc6-pa2 test pr :_(
2005-08-25 15:59 ` Randolph Chung
@ 2005-08-25 17:04 ` John David Anglin
0 siblings, 0 replies; 15+ messages in thread
From: John David Anglin @ 2005-08-25 17:04 UTC (permalink / raw)
To: Randolph Chung; +Cc: parisc-linux
> > Still don't know. I have been working on the java testsuite which
> > appears to cause most of the kernel bugs. I fixed a bug last weekend
> > where the signal handler caused a SIGSEGV. An incorrect count was
> > used to copy a mmapped block. We still have a problem with PR218.exe.
> > The DWARF unwind fails to stop and allocates a huge amount of memory.
> > PR218 is a simple test to test whether the runtime can catch a null
> > pointer exception from a leaf routine. I think there is a bug in
> > MD_FALLBACK_FRAME_STATE_FOR. It replaces the %r2 value with the
> > iaoq[0] value and I think this fails when we need to unwind through
> > a leaf function which doesn't save %r2.
>
> Can you explain more where this is broken?
No. I need to do more debugging.
> I'm assuming you are talking about this bit of code:
>
> 134 fs->regs.reg[2].how = REG_SAVED_OFFSET;
> 135 fs->regs.reg[2].loc.offset = (long) &sc->sc_iaoq[0] - new_cfa;
> 136 fs->retaddr_column = 2;
Yes. See the alpha implementation that uses an alternate column
for signal handlers.
> sc_iaoq[0] should be the pc when the signal handler was triggered, so it
> shouldn't matter if it's a leaf func or not. We are simply telling the
> unwinder that reg[2] has the return address of the current (signal
> handler) function, and the value of reg[2] can be computed with a stack
> offset. When is this wrong?
I'm not sure. The func that triggers the exception looks like this:
00010cc4 <_ZN5PR2183fooEPS_>:
10cc4: 0f 28 10 9c ldw 4(r25),ret0
10cc8: 37 9c 00 08 ldo 4(ret0),ret0
10ccc: 0f 3c 12 88 stw ret0,4(r25)
10cd0: e8 40 c0 02 bv,n r0(rp)
It doesn't have a frame.
Dave
--
J. David Anglin dave.anglin@nrc-cnrc.gc.ca
National Research Council of Canada (613) 990-0752 (FAX: 952-6602)
_______________________________________________
parisc-linux mailing list
parisc-linux@lists.parisc-linux.org
http://lists.parisc-linux.org/mailman/listinfo/parisc-linux
^ permalink raw reply [flat|nested] 15+ messages in thread
[parent not found: <4305FA88.9040404@tausq.org>]
* Re: [parisc-linux] Latest 2.6.13-rc6-pa2 test pr :_(
[not found] <4305FA88.9040404@tausq.org>
@ 2005-08-19 18:41 ` John David Anglin
2005-08-20 17:51 ` John David Anglin
1 sibling, 0 replies; 15+ messages in thread
From: John David Anglin @ 2005-08-19 18:41 UTC (permalink / raw)
To: Randolph Chung; +Cc: parisc-linux
> > TOC:
> >
> > r2 __handle_mm_fault+0x68 (return from call to pte_alloc_map)
> > IIA __handle_mm_fault+0x190
> >
> > It looks as if this might have been caused by Process_3.exe in the
> > libjava GCC testsuite.
>
> Does it always hang in the same place? How much RAM/swap do you have in
> your system? Does it hang more/less if you disable swap?
I've seen this fault before. I don't have enough data at this point
with 2.6.13 to say it always hangs in the same place.
RAM: 1GB
SWAP:
root@hiauly6:/home/dave# cat /proc/swaps
Filename Type Size Used Priority
/dev/sdb1 partition 265032 0 -1
/dev/sda1 partition 263144 0 -2
dave@hiauly6:~$ ulimit -v
524288
The above limit was sufficient to prevent the as crash associated with
the 1GB .block directive.
I don't know the answer to the last question at this time.
Dave
--
J. David Anglin dave.anglin@nrc-cnrc.gc.ca
National Research Council of Canada (613) 990-0752 (FAX: 952-6602)
_______________________________________________
parisc-linux mailing list
parisc-linux@lists.parisc-linux.org
http://lists.parisc-linux.org/mailman/listinfo/parisc-linux
^ permalink raw reply [flat|nested] 15+ messages in thread* Re: [parisc-linux] Latest 2.6.13-rc6-pa2 test pr :_(
[not found] <4305FA88.9040404@tausq.org>
2005-08-19 18:41 ` John David Anglin
@ 2005-08-20 17:51 ` John David Anglin
2005-08-21 14:37 ` Joel Soete
2005-08-21 15:53 ` Randolph Chung
1 sibling, 2 replies; 15+ messages in thread
From: John David Anglin @ 2005-08-20 17:51 UTC (permalink / raw)
To: Randolph Chung; +Cc: parisc-linux
> > TOC:
> >
> > r2 __handle_mm_fault+0x68 (return from call to pte_alloc_map)
> > IIA __handle_mm_fault+0x190
> >
> > It looks as if this might have been caused by Process_3.exe in the
> > libjava GCC testsuite.
>
> Does it always hang in the same place? How much RAM/swap do you have in
> your system? Does it hang more/less if you disable swap?
Had another hang last night. Here is the TOC for it:
r2 intr_check_sig+0
IIA handle_interruption+0x30
r26,r5 code = 0x1a = 26 => Data Memory Access Rights Trap
This again occurred in the libjava testsuite:
byte compile: /home/dave/gnu/gcc-4.0/objdir/gcc/gcj -B/home/dave/gnu/gcc-4.0/objdir/hppa-linux/libjava/ -B/home/dave/gnu/gcc-4.0/objdir/gcc/ --encoding=UTF-8 -C -I/home/dave/gnu/gcc-4.0/objdir/hppa-linux/libjava/testsuite/../libgcj-4.1.0.jar -g /home/dave/gnu/gcc-4.0/gcc/libjava/testsuite/libjava.lang/SyncTest.java -d
/home/dave/gnu/gcc-4.0/objdir/hppa-linux/libjava/testsuite 2>@ stdout
PASS: SyncTest byte compilation
SyncTestSyncTest set_ld_library_path_env_vars: ld_library_path=.:/home/dave/gnu/gcc-4.0/objdir/hppa-linux/./libjava/.libs:/home/dave/gnu/gcc-4.0/objdir/gcc
Setting LD_LIBRARY_PATH to .:/home/dave/gnu/gcc-4.0/objdir/hppa-linux/./libjava/.libs:/home/dave/gnu/gcc-4.0/objdir/gcc:.:/home/dave/gnu/gcc-4.0/objdir/hppa-linux/./libjava/.libs:/home/dave/gnu/gcc-4.0/objdir/gcc:.:/home/dave/gnu/gcc-4.0/objdir/hppa-linux/./libjava/.libs:/home/dave/gnu/gcc-4.0/objdir/gcc:/home/dave/gnu/gcc-4.0/objdir/hppa-linux/libstdc++-v3/.libs:/home/dave/gnu/gcc-4.0/objdir/hppa-linux/libmudflap/.libs:/home/dave/gnu/gcc-4.0/objdir/hppa-linux/libssp/.libs:/home/dave/gnu/gcc-4.0/objdir/./gcc:/home/dave/gnu/gcc-4.0/objdir/./prev-gcc:/usr/lib/debug
Exception in thread "Thread-1" Exception in thread "Thread-2" Exception in thread "Thread-3" Exception in thread "Thread-4"
I found that Process_3.exe was generating unaligned exceptions. This
was a result of a new macro define being added to libjava
(UNWRAP_FUNCTION_DESCRIPTOR). This macro was incorrectly doing function
pointer canonicalization resulting in the unaligned exceptions. I was
testing a fix for this when I got the above hang.
I have the feeling that there is at least one more circumstance where
handle_exceptions can loop.
Dave
--
J. David Anglin dave.anglin@nrc-cnrc.gc.ca
National Research Council of Canada (613) 990-0752 (FAX: 952-6602)
_______________________________________________
parisc-linux mailing list
parisc-linux@lists.parisc-linux.org
http://lists.parisc-linux.org/mailman/listinfo/parisc-linux
^ permalink raw reply [flat|nested] 15+ messages in thread* Re: [parisc-linux] Latest 2.6.13-rc6-pa2 test pr :_(
2005-08-20 17:51 ` John David Anglin
@ 2005-08-21 14:37 ` Joel Soete
2005-08-21 15:47 ` John David Anglin
2005-08-21 15:53 ` Randolph Chung
1 sibling, 1 reply; 15+ messages in thread
From: Joel Soete @ 2005-08-21 14:37 UTC (permalink / raw)
To: John David Anglin; +Cc: parisc-linux
John David Anglin wrote:
>>>TOC:
>>>
>>>r2 __handle_mm_fault+0x68 (return from call to pte_alloc_map)
>>>IIA __handle_mm_fault+0x190
>>>
>>>It looks as if this might have been caused by Process_3.exe in the
>>>libjava GCC testsuite.
>>
>>Does it always hang in the same place? How much RAM/swap do you have in
>>your system? Does it hang more/less if you disable swap?
>
>
> Had another hang last night. Here is the TOC for it:
>
> r2 intr_check_sig+0
> IIA handle_interruption+0x30
> r26,r5 code = 0x1a = 26 => Data Memory Access Rights Trap
>
mmm, from pure user point of view that sounds a bit like a segv?
and btw in traps.c still stand some other
[...]
479 void handle_interruption(int code, struct pt_regs *regs)
480 {
[...]
737 force_sig_info(SIGSEGV, &si, current);
[...]
781 force_sig_info(SIGSEGV, &si, current);
[...]
This last one is specialy interesting as (if I well understand) ending handle_interruption() for case 26.
Unfortunately, not yet find how to fix it:
o as tausq did for signal.c
o s/force_sig_info(SIGSEGV, &si, current)/force_sig(SIGSEGV, current)/
Joel
_______________________________________________
parisc-linux mailing list
parisc-linux@lists.parisc-linux.org
http://lists.parisc-linux.org/mailman/listinfo/parisc-linux
^ permalink raw reply [flat|nested] 15+ messages in thread* Re: [parisc-linux] Latest 2.6.13-rc6-pa2 test pr :_(
2005-08-21 14:37 ` Joel Soete
@ 2005-08-21 15:47 ` John David Anglin
0 siblings, 0 replies; 15+ messages in thread
From: John David Anglin @ 2005-08-21 15:47 UTC (permalink / raw)
To: Joel Soete; +Cc: parisc-linux
> 737 force_sig_info(SIGSEGV, &si, current);
> [...]
> 781 force_sig_info(SIGSEGV, &si, current);
> [...]
>
> This last one is specialy interesting as (if I well understand) ending handle_interruption() for case 26.
>
> Unfortunately, not yet find how to fix it:
> o as tausq did for signal.c
> o s/force_sig_info(SIGSEGV, &si, current)/force_sig(SIGSEGV, current)/
Hmmm, possibly all SIGSEGV calls to force_sig_info are potentially
subject to looping. I see ia64 defines a function force_sigsegv_info.
The important difference being the following bit of code:
if (sig == SIGSEGV) {
/*
* Acquiring siglock around the sa_handler-update is almost
* certainly overkill, but this isn't a
* performance-critical path and I'd rather play it safe
* here than having to debug a nasty race if and when
* something changes in kernel/signal.c that would make it
* no longer safe to modify sa_handler without holding the
* lock.
*/
spin_lock_irqsave(¤t->sighand->siglock, flags);
current->sighand->action[sig - 1].sa.sa_handler = SIG_DFL;
spin_unlock_irqrestore(¤t->sighand->siglock, flags);
}
where it looks as if it resets the handler for the signal to SIG_DFL.
This same bit of code is in force_sigsegv. If we were to do this, I
see that we need to handle SEGV_MAPERR and SI_KERNEL.
I think we may also want to reset the handler for SIGBUS.
Dave
--
J. David Anglin dave.anglin@nrc-cnrc.gc.ca
National Research Council of Canada (613) 990-0752 (FAX: 952-6602)
_______________________________________________
parisc-linux mailing list
parisc-linux@lists.parisc-linux.org
http://lists.parisc-linux.org/mailman/listinfo/parisc-linux
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [parisc-linux] Latest 2.6.13-rc6-pa2 test pr :_(
2005-08-20 17:51 ` John David Anglin
2005-08-21 14:37 ` Joel Soete
@ 2005-08-21 15:53 ` Randolph Chung
2005-08-21 16:46 ` John David Anglin
1 sibling, 1 reply; 15+ messages in thread
From: Randolph Chung @ 2005-08-21 15:53 UTC (permalink / raw)
To: John David Anglin; +Cc: parisc-linux
> I found that Process_3.exe was generating unaligned exceptions. This
> was a result of a new macro define being added to libjava
> (UNWRAP_FUNCTION_DESCRIPTOR). This macro was incorrectly doing function
> pointer canonicalization resulting in the unaligned exceptions. I was
> testing a fix for this when I got the above hang.
Process_3.exe was causing unaligned traps, but it was not the process
running when the system hung, right?
> I have the feeling that there is at least one more circumstance where
> handle_exceptions can loop.
In this case, when the machine hung, was it the same as before, where it
will still respond to sysrq and pings?
In the previous case (stack overflow), I was able to debug this further
by running the hanging process in the background (i.e. ./cc1plus ... &).
Now when it hangs, I was still able to use the console and run strace on
it. I wonder if you ran the gcc testsuite in the background, will the
machine still respond sufficiently when it is "hung" that you can do
some more debugging? strace should tell you if you are having another
signal loop.
randolph
_______________________________________________
parisc-linux mailing list
parisc-linux@lists.parisc-linux.org
http://lists.parisc-linux.org/mailman/listinfo/parisc-linux
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [parisc-linux] Latest 2.6.13-rc6-pa2 test pr :_(
2005-08-21 15:53 ` Randolph Chung
@ 2005-08-21 16:46 ` John David Anglin
0 siblings, 0 replies; 15+ messages in thread
From: John David Anglin @ 2005-08-21 16:46 UTC (permalink / raw)
To: Randolph Chung; +Cc: parisc-linux
> Process_3.exe was causing unaligned traps, but it was not the process
> running when the system hung, right?
I'm not sure. It was the last test run in the log but this isn't a
reliable indication.
I installed a GCC fix last night which fixes Process_3.exe, so it doesn't
generate the unaligned faults anymore. The fixes stack traces under
java. However, there's still a problem when gij is used (PR libgcj/21692).
I haven't figured out how to debug this.
> In this case, when the machine hung, was it the same as before, where it
> will still respond to sysrq and pings?
I'm not sure about sysrq but the machine was still responding to pings.
> In the previous case (stack overflow), I was able to debug this further
> by running the hanging process in the background (i.e. ./cc1plus ... &).
> Now when it hangs, I was still able to use the console and run strace on
> it. I wonder if you ran the gcc testsuite in the background, will the
> machine still respond sufficiently when it is "hung" that you can do
> some more debugging? strace should tell you if you are having another
> signal loop.
I actually run GCC builds and the testsuite in the background all
the time. So, it's definitely possible to do this from the console.
Problem is the machine is downtown at NRC and it's not setup with a
serial console. I know from past experience that ssh connections die.
Dave
--
J. David Anglin dave.anglin@nrc-cnrc.gc.ca
National Research Council of Canada (613) 990-0752 (FAX: 952-6602)
_______________________________________________
parisc-linux mailing list
parisc-linux@lists.parisc-linux.org
http://lists.parisc-linux.org/mailman/listinfo/parisc-linux
^ permalink raw reply [flat|nested] 15+ messages in thread
* [parisc-linux] Latest 2.6.13-rc6-pa2 test pr :_(
@ 2005-08-18 16:01 Joel Soete
2005-08-18 17:24 ` John David Anglin
2005-08-18 23:49 ` Randolph Chung
0 siblings, 2 replies; 15+ messages in thread
From: Joel Soete @ 2005-08-18 16:01 UTC (permalink / raw)
To: parisc-linux
Hello all,
The good stuff first.
I rebuild glibc 2.3.5-3 with latest gcc-4.0-4.0.1-5 (as debian named? act=
ualy
4.0.2 20050816) and all seems to work fine: i.e. the dump of the
libpthread-0.10.so shows that the pb point out by Randoplh was well fixed=
by
Dave's patch :-).
I so install it on a chroot disk to rebuild first the latest 2.6.13-rc6-p=
a2
cvs 20050817:
no more segv :-).
I also install this kernel of this chroot disk and reboot it quiet nicely=
my
b180 :-).
I run first Randolph test which end well now by a segv :-)
And finaly launch my stress test (do you remember the 2 loops) which stil=
l
runing (now for about 5h without pb ;^).)
The bad news now.
I also install the latest gcc-4.0-4.0.1-5 on my unstable b2k install to
rebuild the same 32bit kernel (2.6.13-rc6-pa2 cvs 20050817) but saving li=
bc6
2.3.2.ds1-22 (excepted this the 2 systems are debian unstable update at t=
he
same level this morning).
Well the system reboot fine but the same test as describe above make agai=
n the
system hang after only 45min of test ;_(
Anyway this time I could grab a piminfo with a TOC:
[I discard the HPMC because IIA Offset =3D 0x00000000000172b8 which is no=
t usable]
----------------- Processor 0 TOC Information -------------------
General Registers 0 - 31
00-03 0000000000000000 00000000104eb810 00000000101319b4 000000001dd=
45890
04-07 00000000104ec17c 00000000104ebda4 000000000000000a 00000000104=
ebda4
08-11 00000000104ebdac 00000000103c6010 0000000000200200 00000000104=
90698
12-15 0000000000000000 00000000ffffffff 0000000000000000 00000000f04=
00004
16-19 0000000010484140 00000000f000017c 00000000f0000174 00000000104=
ec17c
20-23 000000000042884a 00000000006a884a 0000000000280000 00000000000=
00262
24-27 000000000000000a 000000001dd454d0 00000000104ebda4 00000000103=
bd010
28-31 000000001dd45110 0000000000280000 0000000010484500 00000000101=
2d958
<Press any key to continue (q to quit)>
Control Registers 0 - 31
00-03 0000000000000000 0000000000000000 0000000000000000 00000000000=
00000
04-07 0000000000000000 0000000000000000 0000000000000000 00000000000=
00000
08-11 000000000000142e 0000000000000000 00000000000000c0 00000000000=
0003e
12-15 0000000000000000 0000000000000000 000000000010b000 00000000ff8=
00000
16-19 00000243adf4f356 0000000000000000 00000000101318f4 000000000f9=
91280
20-23 0000000010240037 0000000051545110 000000ff0004ff0e 00000000da0=
00000
24-27 0000000000479000 000000000d19a000 0000000000044021 00000000f04=
12000
28-31 0000000055555555 0000000055555555 0000000010484000 00000000104=
88000
Space Registers 0 - 7
00-03 00000000 00000a17 00000000 00000a17
04-07 00000000 00000000 00000000 00000000
IIA Space =3D 0x0000000000000000
IIA Offset =3D 0x00000000101318f8
CPU State =3D 0x9e000001
<Press any key to continue (q to quit)>
Memory Error Log Information:
Timestamp =3D
Thu Aug 18 14:43:55 GMT 2005 (20:05:08:18:14:43:55)
'9000/785 B,C,J Workstation Memory Error Log', rev 0, 64 bytes:
No memory errors logged
I/O Module Error Log Information:
Timestamp =3D
Thu Aug 18 14:43:55 GMT 2005 (20:05:08:18:14:43:55)
'9000/785 B,C,J Workstation IO Error Log', rev 0, 228 bytes:
Rope Word1 Word2 Word3
------ ------------ ------------
0 ---------- 0x0e0cc249 ------------------
1 0x00000000 0x1e0cc009 0x00000000fed32048
2 ---------- 0x2e0cc009 ------------------
3 ---------- 0x3e0cc009 ------------------
4 ---------- 0x4e0cc009 ------------------
5 ---------- 0x5e0cc009 ------------------
6 ---------- 0x6e0cc009 ------------------
7 ---------- 0x7e0cc009 ------------------
=3D=3D=3D=3D<>=3D=3D=3D=3D
The analysis of this TOC give:
----------------- Processor 0 TOC Information -------------------
GR of CPU[0]
00-03 0000000000000000 00000000104eb810 00000000101319b4 000000001dd=
45890
04-07 00000000104ec17c 00000000104ebda4 000000000000000a 00000000104=
ebda4
08-11 00000000104ebdac 00000000103c6010 0000000000200200 00000000104=
90698
12-15 0000000000000000 00000000ffffffff 0000000000000000 00000000f04=
00004
16-19 0000000010484140 00000000f000017c 00000000f0000174 00000000104=
ec17c
20-23 000000000042884a 00000000006a884a 0000000000280000 00000000000=
00262
24-27 000000000000000a 000000001dd454d0 00000000104ebda4 00000000103=
bd010
28-31 000000001dd45110 0000000000280000 0000000010484500 00000000101=
2d958
GR[02] =3D=3D rp =3D 00000000101319b4
Func: cascade, Off: 0x44, Addr: 0x101319b4
...
10131994: 80 83 20 40 cmpb,=3D r3,r4,101319bc <cascade+0x4c>
...
101319b0: 08 05 02 5a copy r5,r26
101319b4: 88 83 3f cf cmpb,<>,n r3,r4,101319a0 <cascade+0x30>
101319b8: 48 7c 00 30 ldw 18(,r3),ret0
101319bc: 0c 63 12 88 stw r3,4(,r3)
GR[22] =3D=3D t1(32bits) =3D=3D arg4(64bits) =3D 0000000000280000
GR[21] =3D=3D t2(32bits) =3D=3D arg5(64bits) =3D 00000000006a884a
GR[20] =3D=3D t3(32bits) =3D=3D arg6(64bits) =3D 000000000042884a
GR[19] =3D=3D t4(32bits) =3D=3D arg7(64bits) =3D 00000000104ec17c
GR[26] =3D=3D arg0 =3D 00000000104ebda4
GR[25] =3D=3D arg1 =3D 000000001dd454d0
GR[24] =3D=3D arg2 =3D 000000000000000a
GR[23] =3D=3D arg3 =3D 0000000000000262
GR[27] =3D=3D dp =3D 00000000103bd010
Func: $global$, Off: 0x0, Addr: 0x103bd010
GR[28] =3D=3D ret0 =3D 000000001dd45110
GR[29] =3D=3D ret1 or sl =3D 0000000000280000
GR[30] =3D=3D sp =3D 0000000010484500
GR[31] =3D=3D ble rp =3D 000000001012d958
Func: __do_softirq, Off: 0xe0, Addr: 0x1012d958
1012d950: e6 c0 20 00 be,l 0(sr4,r22),%sr0,%r31
1012d954: 08 1f 02 42 copy r31,rp
1012d958: e8 1f 1f 05 b,l 1012d8e0 <__do_softirq+0x68>,r0
1012d95c: 0d 06 12 88 stw r6,4(,r8)
CR of CPU[0]
00-03 0000000000000000 0000000000000000 0000000000000000 00000000000=
00000
04-07 0000000000000000 0000000000000000 0000000000000000 00000000000=
00000
08-11 000000000000142e 0000000000000000 00000000000000c0 00000000000=
0003e
12-15 0000000000000000 0000000000000000 000000000010b000 00000000ff8=
00000
16-19 00000243adf4f356 0000000000000000 00000000101318f4 000000000f9=
91280
20-23 0000000010240037 0000000051545110 000000ff0004ff0e 00000000da0=
00000
24-27 0000000000479000 000000000d19a000 0000000000044021 00000000f04=
12000
28-31 0000000055555555 0000000055555555 0000000010484000 00000000104=
88000
CR[00] =3D=3D rctr =3D 0000000000000000
CR[08] =3D=3D (Protection ID) pidr1 =3D 000000000000142e
CR[10] =3D=3D ccr =3D 00000000000000c0
CR[11] =3D=3D sar =3D 000000000000003e
CR[14] =3D=3D iva =3D 000000000010b000
CR[15] =3D=3D eiem =3D 00000000ff800000
CR[16] =3D=3D itmr =3D 00000243adf4f356
CR[17] =3D=3D pcsq =3D 0000000000000000
CR[18] =3D=3D pcoq =3D 00000000101318f4
CR[19] =3D=3D iir =3D 000000000f991280
CR[20] =3D=3D isr =3D 0000000010240037
CR[21] =3D=3D ior =3D 0000000051545110
CR[22] =3D=3D ipsw =3D 000000ff0004ff0e
CR[23] =3D=3D eirw =3D 00000000da000000
CR[24] =3D=3D tr0 (ptov) =3D 0000000000479000
CR[25] =3D=3D tr1 (vtop) =3D 000000000d19a000
CR[26] =3D=3D tr2 =3D 0000000000044021
CR[27] =3D=3D tr3 =3D 00000000f0412000
CR[28] =3D=3D tr4 =3D 0000000055555555
CR[29] =3D=3D tr5 =3D 0000000055555555
CR[30] =3D=3D tr6 =3D 0000000010484000
CR[31] =3D=3D tr7 =3D 0000000010488000
SR of CPU[0]
00-03 00000000 00000a17 00000000 00000a17
04-07 00000000 00000000 00000000 00000000
SR[0] =3D 00000000
SR[1] =3D 00000a17
SR[2] =3D 00000000
SR[3] =3D 00000a17
SR[4] =3D 00000000
SR[5] =3D 00000000
SR[6] =3D 00000000
SR[7] =3D 00000000
Need much more work !!!
SR[00] =3D=3D ts0 =3D 00000000
SR[01] =3D=3D ts1 =3D 00000a17
SR[03] =3D=3D cpp =3D 00000a17
Not parsable address!
...
IIA Offset =3D 0x00000000101318f8
...
e.g. IAOQ =3D 0x00000000101318f8
Parse IAOQ =3D 0x00000000101318f8 for CPU[0]
Func: internal_add_timer, Off: 0xdc, Addr: 0x101318f8
101318f0: 0e 79 12 88 stw r25,4(,r19)
101318f4: 0f 99 12 80 stw r25,0(,ret0)
101318f8: e8 40 d0 00 bve (rp)
101318fc: 0f 3c 12 88 stw ret0,4(,r25)
Any idea on this new hang?
Thanks,
Joel=0A=0A-------------------------------------------------------=0AN=
OTE! My email address is changing to ... @scarlet.be=0APlease make the ne=
cessary changes in your address book. =0A=0A
_______________________________________________
parisc-linux mailing list
parisc-linux@lists.parisc-linux.org
http://lists.parisc-linux.org/mailman/listinfo/parisc-linux
^ permalink raw reply [flat|nested] 15+ messages in thread* Re: [parisc-linux] Latest 2.6.13-rc6-pa2 test pr :_(
2005-08-18 16:01 Joel Soete
@ 2005-08-18 17:24 ` John David Anglin
2005-08-18 23:49 ` Randolph Chung
1 sibling, 0 replies; 15+ messages in thread
From: John David Anglin @ 2005-08-18 17:24 UTC (permalink / raw)
To: Joel Soete; +Cc: parisc-linux
> Well the system reboot fine but the same test as describe above make again the
> system hang after only 45min of test ;_(
Yah, my c3k is also hung. It got through one GCC full build and check,
but hung last night in the second.
Randolph's patch only fixed a hang causes by a stack overflow when
catching SIGSEGV. It's not going to fix anything that doesn't SEGV.
Note that my patch was to fix a glibc issue affecting signal handling.
It's doubtful that it would fix the hangs that you have been experiencing
with your stress test.
Dave
--
J. David Anglin dave.anglin@nrc-cnrc.gc.ca
National Research Council of Canada (613) 990-0752 (FAX: 952-6602)
_______________________________________________
parisc-linux mailing list
parisc-linux@lists.parisc-linux.org
http://lists.parisc-linux.org/mailman/listinfo/parisc-linux
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [parisc-linux] Latest 2.6.13-rc6-pa2 test pr :_(
2005-08-18 16:01 Joel Soete
2005-08-18 17:24 ` John David Anglin
@ 2005-08-18 23:49 ` Randolph Chung
1 sibling, 0 replies; 15+ messages in thread
From: Randolph Chung @ 2005-08-18 23:49 UTC (permalink / raw)
To: Joel Soete; +Cc: parisc-linux
> I also install the latest gcc-4.0-4.0.1-5 on my unstable b2k install to
> rebuild the same 32bit kernel (2.6.13-rc6-pa2 cvs 20050817) but saving libc6
> 2.3.2.ds1-22 (excepted this the 2 systems are debian unstable update at the
> same level this morning).
> Well the system reboot fine but the same test as describe above make again the
> system hang after only 45min of test ;_(
does the system respond to sysrq when it hangs? if so it may be useful
to look at it with kdb.
does it always hang in internal_add_timer?
randolph
_______________________________________________
parisc-linux mailing list
parisc-linux@lists.parisc-linux.org
http://lists.parisc-linux.org/mailman/listinfo/parisc-linux
^ permalink raw reply [flat|nested] 15+ messages in thread
end of thread, other threads:[~2005-08-27 17:38 UTC | newest]
Thread overview: 15+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <200508261342.j7QDgMCk015398@hiauly1.hia.nrc.ca>
2005-08-27 14:40 ` [parisc-linux] Latest 2.6.13-rc6-pa2 test pr :_( Randolph Chung
2005-08-27 17:38 ` John David Anglin
2005-08-26 16:18 Joel Soete
[not found] <ILU3F9$C3BA108CF7E2FE4DECDD367F36FD15E5@scarlet.be>
2005-08-26 15:29 ` John David Anglin
[not found] <200508241530.j7OFUFwS005854@hiauly1.hia.nrc.ca>
2005-08-25 15:59 ` Randolph Chung
2005-08-25 17:04 ` John David Anglin
[not found] <4305FA88.9040404@tausq.org>
2005-08-19 18:41 ` John David Anglin
2005-08-20 17:51 ` John David Anglin
2005-08-21 14:37 ` Joel Soete
2005-08-21 15:47 ` John David Anglin
2005-08-21 15:53 ` Randolph Chung
2005-08-21 16:46 ` John David Anglin
-- strict thread matches above, loose matches on Subject: below --
2005-08-18 16:01 Joel Soete
2005-08-18 17:24 ` John David Anglin
2005-08-18 23:49 ` Randolph Chung
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.