* [parisc-linux] 2.5 randomly kills applications with page faults
@ 2002-12-18 16:37 James Bottomley
2002-12-18 17:02 ` Randolph Chung
0 siblings, 1 reply; 12+ messages in thread
From: James Bottomley @ 2002-12-18 16:37 UTC (permalink / raw)
To: parisc-linux; +Cc: James.Bottomley
I find when booting 2.5.51 up on a C380 that applications seem to take random
page faults and die. It seems that the more heavily an application does file
accesses, the more likely it is to suffer from this.
In debugging the problems, so far it has always been stack manipulation
instructions in the user level code causing this. Further, on adding a
register dump to the page fault debugging code, the reason is that the stack
pointer is way out of where it should be for a user process (around 0x4f000),
so I surmise it got clobbered on a rare return path from kernel to user.
Does anyone have any additional information and pointers? I'm trying to audit
entry.S to see if there is a little used path that can clobber the stack, but
my parisc assembly isn't the best...
James
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [parisc-linux] 2.5 randomly kills applications with page faults
2002-12-18 16:37 [parisc-linux] 2.5 randomly kills applications with page faults James Bottomley
@ 2002-12-18 17:02 ` Randolph Chung
2002-12-20 22:12 ` James Bottomley
0 siblings, 1 reply; 12+ messages in thread
From: Randolph Chung @ 2002-12-18 17:02 UTC (permalink / raw)
To: James Bottomley; +Cc: parisc-linux
> In debugging the problems, so far it has always been stack manipulation
> instructions in the user level code causing this. Further, on adding a
ditto.... there's a note about this in the todo....
> register dump to the page fault debugging code, the reason is that the stack
> pointer is way out of where it should be for a user process (around 0x4f000),
> so I surmise it got clobbered on a rare return path from kernel to user.
> Does anyone have any additional information and pointers? I'm trying to audit
> entry.S to see if there is a little used path that can clobber the stack, but
> my parisc assembly isn't the best...
>
that's what i thought too, so i went through entry.S as well to see what
i can find. haven't found anything yet :(
i was able to get the kernel to die simply by having a program do
gettimeofday() in a loop with 2.5... i would guess it's a case where we
have to do some work on the syscall return path (resched, softirq, etc)
that's clobbering things, but i don't know what it is.
randolph
--
Randolph Chung
Debian GNU/Linux Developer, hppa/ia64 ports
http://www.tausq.org/
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [parisc-linux] 2.5 randomly kills applications with page faults
2002-12-18 17:02 ` Randolph Chung
@ 2002-12-20 22:12 ` James Bottomley
2002-12-20 22:19 ` John David Anglin
` (2 more replies)
0 siblings, 3 replies; 12+ messages in thread
From: James Bottomley @ 2002-12-20 22:12 UTC (permalink / raw)
To: Randolph Chung; +Cc: James Bottomley, parisc-linux
[-- Attachment #1: Type: text/plain, Size: 1192 bytes --]
randolph@tausq.org said:
> that's what i thought too, so i went through entry.S as well to see
> what i can find. haven't found anything yet :(
OK, I think I found the cause of this and the solution.
The cause is in syscall.S in linux_gateway_entry. Some person (herinafter
referred to as "the guilty party") added a patch to store the user stack on
the kernel stack temporarily before stashing it correctly in the user pt_regs:
STREG %r1,0(%r30) /* Stick r1 (usp) here for now */
The problem is that they forgot to increment the stack pointer. Thus, if we
take an interruption between this instruction and the corresponding retrieval,
the value can be trashed.
The fix is simple: increment the stack pointer. I chose 16 to preserve every
alignment I can think of is that also safe for 64 bit?
With this fix, my system seems fairly solid. It survives my bitkeeper and
stress tests so far (about 30 min) previously it always collapsed within a few
minutes.
James
P.S. After this little debug frenzy, I don't personally care if I ever see
another line of parisc assembly again, so if another obscure register trashing
problem turns up, my good deed is done...
James
[-- Attachment #2: tmp.diff --]
[-- Type: text/plain , Size: 809 bytes --]
===== arch/parisc/kernel/syscall.S 1.5 vs edited =====
--- 1.5/arch/parisc/kernel/syscall.S Fri Nov 29 04:31:54 2002
+++ edited/arch/parisc/kernel/syscall.S Fri Dec 20 15:46:40 2002
@@ -94,6 +94,7 @@
mtsp %r0,%sr7 /* get kernel space into sr7 */
STREG %r1,0(%r30) /* Stick r1 (usp) here for now */
+ ldo 16(%r30),%r30
mfctl %cr30,%r1 /* get task ptr in %r1 */
LDREG TI_TASK(%r1),%r1
@@ -104,7 +105,8 @@
PSW value is stored. This is needed for gdb and sys_ptrace. */
STREG %r0, TASK_PT_PSW(%r1)
STREG %r2, TASK_PT_GR2(%r1) /* preserve rp */
- LDREG 0(%r30), %r2 /* get users sp back */
+ LDREG -16(%r30), %r2 /* get users sp back */
+ ldo -16(%r30), %r30
STREG %r2, TASK_PT_GR30(%r1) /* ... and save it */
STREG %r19, TASK_PT_GR19(%r1)
STREG %r20, TASK_PT_GR20(%r1)
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [parisc-linux] 2.5 randomly kills applications with page faults
2002-12-20 22:12 ` James Bottomley
@ 2002-12-20 22:19 ` John David Anglin
2002-12-20 22:37 ` Grant Grundler
2002-12-21 1:38 ` Grant Grundler
2 siblings, 0 replies; 12+ messages in thread
From: John David Anglin @ 2002-12-20 22:19 UTC (permalink / raw)
To: James Bottomley; +Cc: randolph, James.Bottomley, parisc-linux
> @@ -104,7 +105,8 @@
> PSW value is stored. This is needed for gdb and sys_ptrace. */
> STREG %r0, TASK_PT_PSW(%r1)
> STREG %r2, TASK_PT_GR2(%r1) /* preserve rp */
> - LDREG 0(%r30), %r2 /* get users sp back */
> + LDREG -16(%r30), %r2 /* get users sp back */
> + ldo -16(%r30), %r30
I believe that can combine the above two instructions into one with
a ",ma" completer.
Dave
--
J. David Anglin dave.anglin@nrc.ca
National Research Council of Canada (613) 990-0752 (FAX: 952-6605)
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [parisc-linux] 2.5 randomly kills applications with page faults
2002-12-20 22:12 ` James Bottomley
2002-12-20 22:19 ` John David Anglin
@ 2002-12-20 22:37 ` Grant Grundler
2002-12-22 7:11 ` Grant Grundler
2002-12-21 1:38 ` Grant Grundler
2 siblings, 1 reply; 12+ messages in thread
From: Grant Grundler @ 2002-12-20 22:37 UTC (permalink / raw)
To: James Bottomley; +Cc: Randolph Chung, parisc-linux
On Fri, Dec 20, 2002 at 04:12:37PM -0600, James Bottomley wrote:
> The fix is simple: increment the stack pointer. I chose 16 to preserve every
> alignment I can think of is that also safe for 64 bit?
needs to be 64 bytes - lamont confirmed.
> With this fix, my system seems fairly solid. It survives my bitkeeper and
> stress tests so far (about 30 min) previously it always collapsed within
> a few minutes.
excellent!
> P.S. After this little debug frenzy, I don't personally care if I ever see
> another line of parisc assembly again, so if another obscure register
> trashing problem turns up, my good deed is done...
*G*
/me kowtows to the east several times...
> STREG %r1,0(%r30) /* Stick r1 (usp) here for now */
> + ldo 16(%r30),%r30
As David observed, this wants to use ",ma" and I'll work that out with
lamont/helge offline.
Something will get committed to 2.4.x/2.5.x trees this afternoon.
thanks - what a wonderful Christmas present! :^)
grant
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [parisc-linux] 2.5 randomly kills applications with page faults
2002-12-20 22:12 ` James Bottomley
2002-12-20 22:19 ` John David Anglin
2002-12-20 22:37 ` Grant Grundler
@ 2002-12-21 1:38 ` Grant Grundler
2002-12-21 1:46 ` James Bottomley
2 siblings, 1 reply; 12+ messages in thread
From: Grant Grundler @ 2002-12-21 1:38 UTC (permalink / raw)
To: James Bottomley; +Cc: Randolph Chung, parisc-linux
On Fri, Dec 20, 2002 at 04:12:37PM -0600, James Bottomley wrote:
...
> The problem is that they forgot to increment the stack pointer. Thus, if we
> take an interruption between this instruction and the corresponding
> retrieval, the value can be trashed.
It doesn't look like this bug is present in 2.4.
Richard suspects it was introduced when the task struct was split
from the stack. (I hope I recall his statement correctly)
The new code sequence is:
95 mtsp %r0,%sr7 /* get kernel space into sr7 */
96 STREGM %r1,FRAME_SIZE(%r30) /* save r1 (usp) here for now */
97 mfctl %cr30,%r1 /* get task ptr in %r1 */
98 LDREG TI_TASK(%r1),%r1
105 STREG %r0, TASK_PT_PSW(%r1)
106 STREG %r2, TASK_PT_GR2(%r1) /* preserve rp */
107 LDREGM FRAME_SIZE(%r30), %r2 /* get users sp back */
108 STREG %r2, TASK_PT_GR30(%r1) /* ... and save it */
where STREGM/LDREGM are new macros that use st<X>,ma instructions.
I'll commit this once I see it boots on my c3000.
But, given the assertion we could take an interrupt between line 96 and
107, would an interrupt between 95/96 cause Bad Things (tm) to happen?
grant
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [parisc-linux] 2.5 randomly kills applications with page faults
2002-12-21 1:38 ` Grant Grundler
@ 2002-12-21 1:46 ` James Bottomley
2002-12-21 4:34 ` Grant Grundler
0 siblings, 1 reply; 12+ messages in thread
From: James Bottomley @ 2002-12-21 1:46 UTC (permalink / raw)
To: Grant Grundler; +Cc: James Bottomley, Randolph Chung, parisc-linux
grundler@dsl2.external.hp.com said:
> where STREGM/LDREGM are new macros that use st<X>,ma instructions.
Actually, I found STREG,ma and LDREG,mb worked for me.
grundler@dsl2.external.hp.com said:
> but, given the assertion we could take an interrupt between line 96 and
> 107, would an interrupt between 95/96 cause Bad Things (tm) to happen?
Not according to the parisc assembler manual. As long as we can guarantee
that the stack is incremented before the value is stored (which seems to be
what STREG,ma seems to assure), we should be fine.
James
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [parisc-linux] 2.5 randomly kills applications with page faults
2002-12-21 1:46 ` James Bottomley
@ 2002-12-21 4:34 ` Grant Grundler
2002-12-21 5:03 ` James Bottomley
0 siblings, 1 reply; 12+ messages in thread
From: Grant Grundler @ 2002-12-21 4:34 UTC (permalink / raw)
To: James Bottomley; +Cc: Randolph Chung, parisc-linux
On Fri, Dec 20, 2002 at 07:46:57PM -0600, James Bottomley wrote:
> Actually, I found STREG,ma and LDREG,mb worked for me.
doh! of course!
those are cpp macros, not asm macros.
> > but, given the assertion we could take an interrupt between line 96 and
> > 107, would an interrupt between 95/96 cause Bad Things (tm) to happen?
>
> Not according to the parisc assembler manual. As long as we can guarantee
> that the stack is incremented before the value is stored (which seems to be
> what STREG,ma seems to assure), we should be fine.
I'm not worried about the atomicity of the instruction.
I'm worried about sr7 getting modified without the user stack
pointer getting saved to the proper place. It might not be a
problem at all. I just don't know all the uses of user/kernel
stacks in the interrupt code paths. I'm wondering if the entire
code sequence I quoted needs to block interrupts while setting
up the syscall.
grant
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [parisc-linux] 2.5 randomly kills applications with page faults
2002-12-21 4:34 ` Grant Grundler
@ 2002-12-21 5:03 ` James Bottomley
0 siblings, 0 replies; 12+ messages in thread
From: James Bottomley @ 2002-12-21 5:03 UTC (permalink / raw)
To: Grant Grundler; +Cc: James Bottomley, Randolph Chung, parisc-linux
grundler@dsl2.external.hp.com said:
> I'm worried about sr7 getting modified without the user stack pointer
> getting saved to the proper place. It might not be a problem at all. I
> just don't know all the uses of user/kernel stacks in the interrupt
> code paths. I'm wondering if the entire code sequence I quoted needs
> to block interrupts while setting up the syscall.
I don't think that's a problem.
An interruption can occur anywhere, and thus it saves all registers. The only
problem is that on parisc there aren't separate irq stacks, so the
interruption expects to be able to use the current kernel stack (whatever it
is). As long as the kernel stack is always correctly set up when %sr7 points
to kernel space, we should be fine. If we take an interruption before zeroing
sr7, we go through the procedure to obtain a kernel stack for an executing
user process (however, in this case, the interruption will stash the registers
in the task structure, so we can't modify the task structure until we've
changed sr7 to kernel space). Also note, we can't use the kernel stack until
sr7 is in kernel space.
James
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [parisc-linux] 2.5 randomly kills applications with page faults
2002-12-20 22:37 ` Grant Grundler
@ 2002-12-22 7:11 ` Grant Grundler
2002-12-22 10:17 ` Helge Deller
2002-12-22 16:35 ` James Bottomley
0 siblings, 2 replies; 12+ messages in thread
From: Grant Grundler @ 2002-12-22 7:11 UTC (permalink / raw)
To: James Bottomley; +Cc: Randolph Chung, parisc-linux
On Fri, Dec 20, 2002 at 03:37:47PM -0700, Grant Grundler wrote:
> > STREG %r1,0(%r30) /* Stick r1 (usp) here for now */
> > + ldo 16(%r30),%r30
>
> As David observed, this wants to use ",ma" and I'll work that out with
> lamont/helge offline.
> Something will get committed to 2.4.x/2.5.x trees this afternoon.
*sigh*. Can someone else test/commit this?
I've not been able to boot a 2.5.51 kernel on either A500-44 or C3000.
I won't commit what I can't test when others are obviously booting
2.5.51 kernels. My last attempts on A500-44 data page fault on
a NULL dereference in "$lctu_loop+8" (GR02 is sys_nanosleep+13c).
Panic seems to be after (AFAICT) all the RC scripts have run.
I'm at the end of my rope on 2.5.51.
I've placed the diff on ftp.parisc-linux.org:/patches/diff-2.5.51-pa5
The patch *looks* right and fixes a few other nits.
Should work on both PA1.1 and PA2.0 machines.
(and we have to use STREGM/LDREGM to support both variants).
thanks again and apologies for not committing it as promised.
grant
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [parisc-linux] 2.5 randomly kills applications with page faults
2002-12-22 7:11 ` Grant Grundler
@ 2002-12-22 10:17 ` Helge Deller
2002-12-22 16:35 ` James Bottomley
1 sibling, 0 replies; 12+ messages in thread
From: Helge Deller @ 2002-12-22 10:17 UTC (permalink / raw)
To: Grant Grundler, James Bottomley; +Cc: Randolph Chung, parisc-linux
On Sunday 22 December 2002 08:11, Grant Grundler wrote:
> On Fri, Dec 20, 2002 at 03:37:47PM -0700, Grant Grundler wrote:
> > > STREG %r1,0(%r30) /* Stick r1 (usp) here for now */
> > > + ldo 16(%r30),%r30
> >
> > As David observed, this wants to use ",ma" and I'll work that out with
> > lamont/helge offline.
> > Something will get committed to 2.4.x/2.5.x trees this afternoon.
>
> *sigh*. Can someone else test/commit this?
Seems to work without any problems on my 715/64.
I committed your changes to CVS (2.5.51-pa5).
Helge
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [parisc-linux] 2.5 randomly kills applications with page faults
2002-12-22 7:11 ` Grant Grundler
2002-12-22 10:17 ` Helge Deller
@ 2002-12-22 16:35 ` James Bottomley
1 sibling, 0 replies; 12+ messages in thread
From: James Bottomley @ 2002-12-22 16:35 UTC (permalink / raw)
To: Grant Grundler; +Cc: James Bottomley, Randolph Chung, parisc-linux
grundler@dsl2.external.hp.com said:
> I've placed the diff on ftp.parisc-linux.org:/patches/diff-2.5.51-pa5
> The patch *looks* right and fixes a few other nits. Should work on
> both PA1.1 and PA2.0 machines. (and we have to use STREGM/LDREGM to
> support both variants).
Boots and runs fine for me on my C360. I'm on 2.5.52-BK latest, rather than
2.5.51 as the base, though. Seems to be robust to my bitkeeper tests. I'm
going to use this as my development machine kernel now to see if I can find
any other problems.
James
^ permalink raw reply [flat|nested] 12+ messages in thread
end of thread, other threads:[~2002-12-22 16:35 UTC | newest]
Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2002-12-18 16:37 [parisc-linux] 2.5 randomly kills applications with page faults James Bottomley
2002-12-18 17:02 ` Randolph Chung
2002-12-20 22:12 ` James Bottomley
2002-12-20 22:19 ` John David Anglin
2002-12-20 22:37 ` Grant Grundler
2002-12-22 7:11 ` Grant Grundler
2002-12-22 10:17 ` Helge Deller
2002-12-22 16:35 ` James Bottomley
2002-12-21 1:38 ` Grant Grundler
2002-12-21 1:46 ` James Bottomley
2002-12-21 4:34 ` Grant Grundler
2002-12-21 5:03 ` James Bottomley
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.