[parisc-linux] Linux syscall ABI

All of lore.kernel.org
 help / color / mirror / Atom feed

* [parisc-linux] Linux syscall ABI
@ 2000-02-14  9:30 John Marvin
  2000-02-14 13:34 ` Philipp Rumpf
  0 siblings, 1 reply; 15+ messages in thread
From: John Marvin @ 2000-02-14  9:30 UTC (permalink / raw)
  To: parisc-linux

I've been talking with willy about the Linux syscall ABI, and now I'd
like to get some input from the rest of you regarding how it should
be handled.

As most of you are aware, HP-UX uses some parisc specific features,
namely the gate instruction used on a page mapped with privilege
promotion access rights (i.e. a gateway page), to implement HP-UX
syscalls. HP-UX puts this gateway page at 0xC0000000 in the users
address space (Which on HP-UX is in a shared quadrant, so there
is only one entry is needed in the tlb for all user processes).

Currently I've implemented a Linux syscall gateway page at 0xC0010000,
but since we don't have anything to be binary compatible with for
parisc linux applications, we can do things differently. I'd like
to throw out a few proposals and see what you all think. Feel free
to suggest other ideas.

Proposal #1:

Don't use a gateway page. Use a more "traditional" trapping instruction,
and handle syscalls in the fault path. We could use a subset of the
available break instructions, or we could "dedicate" a trap (the break
instruction trap handler will have to be shared with debugger support),
like the privileged register trap, or any of a few other traps that
a user program should not run into in the normal course of execution.

The disadvantage with this method is that I don't believe it can be made
to perform as well.  Even if we dedicate a particular trap for handling
syscalls, we still need to do at least 4 mtctl instructions (which on many
parisc processors take 2 states each, and don't bundle for multiple issue)
to reload the space queue and offset queue, plus and rfi instruction, in
order to return to virtual mode in the kernel.  This method also will
defeat any advantages from branch prediction.

All of the other proposals below deal with using a gateway page. I
personally believe that using a gateway page is the better choice.
However, on parisc linux we are capable of supporting a ~4 Gb linear
address space for user processes. I don't think locating the gateway
page at the ~3 Gb mark is a good idea, since it prevents heap expansion
beyond that point (this is a problem I am currently trying to work around
on HP-UX for customers who need this kind of large address space and
are not yet willing to port to 64 bit). I can think of no good reason
to put the gateway page in the middle of the user address space somewhere.
The remaining proposals have to do with where the Linux gateway page
should be located.

I should mention here that we do not currently plan on having any globally
shared quadrants in the user address space for parisc linux. Therefore
whether or not an HP-UX gateway page is mapped into the address space
can be determined on a per process basis. I can see no reason to map
a HP-UX gateway page into the address space for native parisc linux
processes (as opposed to HP-UX processes running on parisc linux).

Proposal #2:

Map the Linux syscall gateway page at the top end of the user address space.
What this top end address would be has yet to be determined. Depending
on how we support mapping I/O devices into the user address space, we
may want to reserve the 0xF0000000-0xFFFFFFFF range for IO (keeping the
device mapped at its equivalent address in the kernel address space).
This may be also be necessary for routines like memcpy (so it can easily
determine if the address is an IO mapped address), which if used on IO
addresses have to do things differently, assuming that memcpy is optimized
for performance.

Proposal #3:

Map the Linux syscall gateway page at near the bottom end of the users
address space.  We could define the default text start for parisc linux
processes such that it leaves room for a gateway page below it.

Proposal #4:

Map the Linux syscall gateway page at the very bottom end of the users
address space, i.e. 0x00000000! Note that gateway pages are execute only,
so processes would still fault on a data null pointer dereference. We
could put some trapping code at the beginning of the gateway page to
catch anyone branching through a null function pointer.

One disadvantage of this proposal is that we could not support the
System V personality null pointer dereference behaviour. This maps
a page of zero's at location 0 so that null pointer dereferences will
return 0 for buggy software. Do we really still need to maintain this
ancient hack?

A slight advantage of this proposal is that it eliminates one instruction
(yes, one whole instruction!) from the syscall path. The general syscall
stub for a user space gateway page looks something like this:

	ldil L%<gateway address>,%r1
	ble  R%<gateway address>(%sr?,%r1)
	ldi <syscall #>,%r20

With the gateway page at 0 we don't need the ldil and can do just:

	ble <gateway page offset>(%sr4,%r0)
	ldi <syscall #>,%r20

Proposal #5:

Locate the gateway page in the kernel address space (space 0).  This will
be a more efficient with respect to tlb usage.  It will add an instruction
to the syscall stub (perhaps an instruction or two can be reclaimed
on the gateway page in return, see below).

It is more efficient re: tlb usage for two reasons.  The first reason is
that since there is only one kernel address space, we only need one entry
in the tlb to map the page.  For user space gateway pages every process
will have its own mapping (aliased to the same page).  I should mention
here that every process will have its own unique space value, and we will
not need to flush the tlb on context switches. The second reason is
that we could locate the syscall return path on the gateway page, so
the syscall path will not need to run through another address range
(the syscall return code) that it could miss on. The kernel system
calls are written in C, and therefore cannot do a long branch back onto
the gateway page, which would be necessary if the gateway page is not
located in the kernel address space. If the gateway page is located in
the kernel address space the system calls can return there for the
syscall return path (check pending signals, rescheds, etc.) before
doing a long branch back to user space. We may also be able to save
a few instructions in the syscall path if the return point is the
natural return point for where the branch to the syscall was taken.

The disadvantage is that we would have to load a space register in
the syscall stub. The sequence would be something like this:

	mtsp %r0,%sr0
	ldil L%<gateway address>,%r1
	ble  R%<gateway address>(%sr0,%r1)
	ldi <syscall #>,%r20

If address 0 is available in the kernel address space (and there are
a variety of reasons why it might not be available long term) the
sequence could be shortened to:

	mtsp %r0,%sr0
	ble  <gateway offset>(%sr0,%r0)
	ldi <syscall #>,%r20

Proposal #6:

Locate the gateway page in a space dedicated purely for the gateway
page. This has the advantage of having one global mapping, similar
to proposal #5 above. It also is completely flexible in terms of
where in the address space it could be located, i.e. 0 would be
available. It has the disadvantages (compared to #5) of not being
able to locate the syscall return path on the gateway page. Also
it would take yet another instruction to load a non zero space value
into a space register, e.g: (assuming gateway at address 0)

    ldi <gateway space value>,%r1
    mtsp    %r1,%sr0
    ble  <gateway offset>(%sr0,%r0)
    ldi <syscall #>,%r20

I only mention this possibility to be complete. I personally do not
think it has much going for it.

I haven't proposed more flexible solutions, including what HP-UX
does for 64 bit syscalls, i.e. they pass a pointer to an array of
syscall pointers into the application at startup. This means that
you have to load them from memory.  My opinion is that we don't
need to be that flexible,  but I'm sure some of you will disagree.

So, what do you all think?

John Marvin
jsm@fc.hp.com

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [parisc-linux] Linux syscall ABI
  2000-02-14  9:30 John Marvin
@ 2000-02-14 13:34 ` Philipp Rumpf
  0 siblings, 0 replies; 15+ messages in thread
From: Philipp Rumpf @ 2000-02-14 13:34 UTC (permalink / raw)
  To: John Marvin; +Cc: parisc-linux

> Proposal #1:
> 
> Don't use a gateway page. Use a more "traditional" trapping instruction,

I agree this probably has very bad performance, so we shouldn't do it.

> Proposal #2:
> 
> Map the Linux syscall gateway page at the top end of the user address space.
> What this top end address would be has yet to be determined. Depending
> on how we support mapping I/O devices into the user address space, we
> may want to reserve the 0xF0000000-0xFFFFFFFF range for IO (keeping the
> device mapped at its equivalent address in the kernel address space).

I don't think reserving 0xFXXX XXXX for I/O in userspace is a good idea.
There is no problem with doing userspace I/O using the normal mmap /dev/mem
approach.  (Except maybe HPUX compatibility, which doesn't concern linux-
only processes).

Not using the last page (i.e. 0xffff f000 - 0xffff ffff) sounds like a good
idea to me though as it avoids small negative numbers cast into pointers
getting successfully dereferenced.

> This may be also be necessary for routines like memcpy (so it can easily
> determine if the address is an IO mapped address), which if used on IO

kernel memcpy() shouldn't ever be called with either an IO or a user address

> One disadvantage of this proposal is that we could not support the
> System V personality null pointer dereference behaviour. This maps
> a page of zero's at location 0 so that null pointer dereferences will
> return 0 for buggy software. Do we really still need to maintain this
> ancient hack?

No, we don't.  We're talking about PER_LINUX binaries here, and those
never expected to be able to dereference NULL pointers.

> The disadvantage is that we would have to load a space register in
> the syscall stub. The sequence would be something like this:
> 
> 	mtsp %r0,%sr0
> 	ldil L%<gateway address>,%r1
> 	ble  R%<gateway address>(%sr0,%r1)
> 	ldi <syscall #>,%r20
> 
> If address 0 is available in the kernel address space (and there are

Of course every page in the region 0xfffc0000 - 0x3f fffc (it's a 17-bit
signed immediate shifted left 2 bits, so that should be -2^18 - 2^18-4)
can be used, so we just need a page within the first 256 KB.
 
> a variety of reasons why it might not be available long term) the
> sequence could be shortened to:
> 
> 	mtsp %r0,%sr0
> 	ble  <gateway offset>(%sr0,%r0)
> 	ldi <syscall #>,%r20

In fact, what's wrong with shortening _this_ sequence to

	ble <gateway offset)(%sr2, %r0)
	ldi <syscall #>,%r20

and teaching userspace to not modify sr2 ?

> I haven't proposed more flexible solutions, including what HP-UX
> does for 64 bit syscalls, i.e. they pass a pointer to an array of
> syscall pointers into the application at startup. This means that
> you have to load them from memory.  My opinion is that we don't
> need to be that flexible,  but I'm sure some of you will disagree.

If you disagree, there's still 252 / 248 KB left for you to play in.

	Philipp

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [parisc-linux] Linux syscall ABI
@ 2000-02-15  5:36 John Marvin
  2000-02-15  6:15 ` willy
                   ` (2 more replies)
  0 siblings, 3 replies; 15+ messages in thread
From: John Marvin @ 2000-02-15  5:36 UTC (permalink / raw)
  To: parisc-linux

> I don't think reserving 0xFXXX XXXX for I/O in userspace is a good idea.
> There is no problem with doing userspace I/O using the normal mmap /dev/mem
> approach.  (Except maybe HPUX compatibility, which doesn't concern linux-
> only processes).

...

> kernel memcpy() shouldn't ever be called with either an IO or a user address

I was referring to user space memcpy, not kernel memcpy.  The HP-UX user
space memcpy supports use with IO mapped addresses, however it has to
differentiate those addresses in order to not do optimizations that won't
work with IO mapped addresses. Having a dedicated range allows for an
easy test. But perhaps if this is not desirable we can just say that
Linux glibc memcpy is not supported for IO mapped addresses (assuming it
is optimized).

> > One disadvantage of this proposal is that we could not support the
> > System V personality null pointer dereference behaviour. This maps
> > a page of zero's at location 0 so that null pointer dereferences will
> > return 0 for buggy software. Do we really still need to maintain this
> > ancient hack?
>
> No, we don't.  We're talking about PER_LINUX binaries here, and those
> never expected to be able to dereference NULL pointers.

I don't know much about PER_SVR4, and why it exists.  Willy pointed it
out to me.  I can see from the kernel source that perhaps it is only there
for sparc.  If it is not necessary for parisc-linux to support then
there is no issue. If it is necessary then I guess I assumed that PER_SVR4
binaries would use the same gateway page as PER_LINUX binaries.

> Of course every page in the region 0xfffc0000 - 0x3f fffc (it's a 17-bit
> signed immediate shifted left 2 bits, so that should be -2^18 - 2^18-4)
> can be used, so we just need a page within the first 256 KB.

This is true for user space. For kernel space, I don't think we can
use anything in F space, unless we map the real IO addresses somewhere
else in virtual space. I'm not sure what assumptions are being made
right now regarding that mapping in the drivers.

I was also thinking that we may want to eventually map physical addresses
directly (with no offset) to virtual addresses, in order to support the
maximum amount of physical memory. But Perhaps we can have a 16 Mb offset
instead.

> a variety of reasons why it might not be available long term) the
> > sequence could be shortened to:
> >
> >       mtsp %r0,%sr0
> >       ble  <gateway offset>(%sr0,%r0)
> >       ldi <syscall #>,%r20
>
> In fact, what's wrong with shortening _this_ sequence to
>
>       ble <gateway offset)(%sr2, %r0)
>       ldi <syscall #>,%r20
>
> and teaching userspace to not modify sr2 ?

I like this idea.  The only disadvantage is that if the user modifies sr2
by mistake, all of a sudden all of the syscalls stop working (for that
process only).  It might be hard to debug.  But, as long as we make sure
that gcc never touches sr2, there should be almost no legitimate reason to
play with space registers in the user address space for Linux processes,
since we are going to have sr4=sr5=sr6=sr7.  In fact, gcc should be
modified to stop using $$dyncall for indirect function pointer calls.  So,
a C programmer will never run into this problem by mistake.  Only people
doing assembly language programming could run into the error.

Now, I am assuming we would set sr2 to 0 and locate the gateway page in
the kernel address space if we chose this proposal.  But this idea has the
flexibility of allowing us to move the gateway page into another space
completely if we ever need to (would require modifications to the tlb miss
handler).  It also has the interesting feature that a programmer could set
sr2 to point into the user address space, and if we choose an offset for
the gateway page in the kernel address space and make that offset also
available for mmap in the user address space, the user could place there
own page at the gateway offset in user space and intercept all syscalls
(there are other ways of doing this, but I just thought it was
interesting).

John Marvin
jsm@fc.hp.com

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [parisc-linux] Linux syscall ABI
  2000-02-15  5:36 [parisc-linux] Linux syscall ABI John Marvin
@ 2000-02-15  6:15 ` willy
  2000-02-15 12:50 ` Philipp Rumpf
  2000-02-15 17:25 ` [parisc-linux] Linux syscall ABI Grant Grundler
  2 siblings, 0 replies; 15+ messages in thread
From: willy @ 2000-02-15  6:15 UTC (permalink / raw)
  To: John Marvin; +Cc: parisc-linux

On Mon, Feb 14, 2000 at 10:36:39PM -0700, John Marvin wrote:
> > No, we don't.  We're talking about PER_LINUX binaries here, and those
> > never expected to be able to dereference NULL pointers.
> 
> I don't know much about PER_SVR4, and why it exists.  Willy pointed it
> out to me.  I can see from the kernel source that perhaps it is only there
> for sparc.  If it is not necessary for parisc-linux to support then
> there is no issue. If it is necessary then I guess I assumed that PER_SVR4
> binaries would use the same gateway page as PER_LINUX binaries.

Linux has a personality() syscall which tells the kernel what operating
system this binary was compiled for.  PER_SVR4 means that this binary
was compiled for SVR4.  We _may_ want to include a PER_HPUX at some
point, but i'm not convinced we need it yet.  I pointed you at that bit
of code to show you that for some binaries Linux did that nasty hack,
rather than to advocate we do it too.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [parisc-linux] Linux syscall ABI
  2000-02-15  5:36 [parisc-linux] Linux syscall ABI John Marvin
  2000-02-15  6:15 ` willy
@ 2000-02-15 12:50 ` Philipp Rumpf
  2000-02-16 14:04   ` [parisc-linux] Location of HIL protocol docs? Brian S. Julin
  2000-02-15 17:25 ` [parisc-linux] Linux syscall ABI Grant Grundler
  2 siblings, 1 reply; 15+ messages in thread
From: Philipp Rumpf @ 2000-02-15 12:50 UTC (permalink / raw)
  To: John Marvin; +Cc: parisc-linux

> > kernel memcpy() shouldn't ever be called with either an IO or a user address
> 
> I was referring to user space memcpy, not kernel memcpy.  The HP-UX user
> space memcpy supports use with IO mapped addresses, however it has to
> differentiate those addresses in order to not do optimizations that won't
> work with IO mapped addresses. Having a dedicated range allows for an
> easy test. But perhaps if this is not desirable we can just say that
> Linux glibc memcpy is not supported for IO mapped addresses (assuming it
> is optimized).

This sounds to me like a typical case of doing a static optimization (is
this a memcpy() to I/O space, from I/O space, to and from I/O space) at
runtime.

> > > One disadvantage of this proposal is that we could not support the
> > > System V personality null pointer dereference behaviour. This maps
> > > a page of zero's at location 0 so that null pointer dereferences will
> > > return 0 for buggy software. Do we really still need to maintain this
> > > ancient hack?
> >
> > No, we don't.  We're talking about PER_LINUX binaries here, and those
> > never expected to be able to dereference NULL pointers.
> 
> I don't know much about PER_SVR4, and why it exists.  Willy pointed it
> out to me.  I can see from the kernel source that perhaps it is only there
> for sparc.  If it is not necessary for parisc-linux to support then
> there is no issue. If it is necessary then I guess I assumed that PER_SVR4
> binaries would use the same gateway page as PER_LINUX binaries.
> 
> > Of course every page in the region 0xfffc0000 - 0x3f fffc (it's a 17-bit
> > signed immediate shifted left 2 bits, so that should be -2^18 - 2^18-4)
> > can be used, so we just need a page within the first 256 KB.
>  
> This is true for user space. For kernel space, I don't think we can
> use anything in F space, unless we map the real IO addresses somewhere
> else in virtual space.

That's what I meant by "within the first 256 KB". ble <offset>(srX, r0)
gives us the range 0xfffc0000 - 0x3 fffc, we can't use 0xfffc0000 - 0xfffffff,
so we're limited to the first 256 KB.

> I'm not sure what assumptions are being made
> right now regarding that mapping in the drivers.

Mapping the I/O space to 0xf000 0000 - 0xffff ffff would make sense, IMO, and
shouldn't be a problem with our drivers.

> I was also thinking that we may want to eventually map physical addresses
> directly (with no offset) to virtual addresses, in order to support the
> maximum amount of physical memory.

We agreed upon doing this eventually, didn't we ?

> But Perhaps we can have a 16 Mb offset instead.

I think not mapping the first 64 KB and making a copy of page 0 somewhere
else would make sense.  Then we could use the first 64 KB of the virtual
address space to implement gateway pages.

> > a variety of reasons why it might not be available long term) the
> > > sequence could be shortened to:
> > >
> > >       mtsp %r0,%sr0
> > >       ble  <gateway offset>(%sr0,%r0)
> > >       ldi <syscall #>,%r20
> >
> > In fact, what's wrong with shortening _this_ sequence to
> >
> >       ble <gateway offset)(%sr2, %r0)
> >       ldi <syscall #>,%r20
> >
> > and teaching userspace to not modify sr2 ?
> 
> I like this idea.  The only disadvantage is that if the user modifies sr2
> by mistake, all of a sudden all of the syscalls stop working (for that
> process only).

I don't see a real problem with that.  Modifying SR2 requires either direct
modification (the only code I could see doing that is HP/UX code, which isn't
supposed to execute with PER_LINUX anytime soon) or executing random bytes,
which will always break in unexpected ways.

> It might be hard to debug.  But, as long as we make sure that gcc never
> touches sr2, there should be almost no legitimate reason to
> play with space registers in the user address space for Linux processes,
> since we are going to have sr4=sr5=sr6=sr7.  In fact, gcc should be
> modified to stop using $$dyncall for indirect function pointer calls.  So,

There is an option for that.  Something along the lines of "fast function
calls" (I'll have a look lateron).

> Now, I am assuming we would set sr2 to 0 and locate the gateway page in
> the kernel address space if we chose this proposal.  But this idea has the
> flexibility of allowing us to move the gateway page into another space
> completely if we ever need to (would require modifications to the tlb miss
> handler).  It also has the interesting feature that a programmer could set
> sr2 to point into the user address space, and if we choose an offset for
> the gateway page in the kernel address space and make that offset also
> available for mmap in the user address space, the user could place there
> own page at the gateway offset in user space and intercept all syscalls
> (there are other ways of doing this, but I just thought it was
> interesting).

I agree this would be another point in favour of using 0:0 or 0:0x1000 as
default gateway page.

	Philipp

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [parisc-linux] Linux syscall ABI
  2000-02-15  5:36 [parisc-linux] Linux syscall ABI John Marvin
  2000-02-15  6:15 ` willy
  2000-02-15 12:50 ` Philipp Rumpf
@ 2000-02-15 17:25 ` Grant Grundler
  2000-02-15 18:18   ` Philipp Rumpf
  2 siblings, 1 reply; 15+ messages in thread
From: Grant Grundler @ 2000-02-15 17:25 UTC (permalink / raw)
  To: John Marvin; +Cc: parisc-linux

John Marvin wrote:
...
> > Of course every page in the region 0xfffc0000 - 0x3f fffc (it's a 17-bit
> > signed immediate shifted left 2 bits, so that should be -2^18 - 2^18-4)
> > can be used, so we just need a page within the first 256 KB.
>  
> This is true for user space. For kernel space, I don't think we can
> use anything in F space, unless we map the real IO addresses somewhere
> else in virtual space. I'm not sure what assumptions are being made
> right now regarding that mapping in the drivers.

Drivers don't map anything. The assumption for GSC/PCI drivers is the
address given will work with readl/writel (or inb/outb) routines.

For GSC, the HPA is in the struct hp_device->hpa field and readl/writel
are aliases for gsc_readl/gsc_writel.

For PCI devices the base address is in struct pci_device ->resources[].
The PCI drivers know to use MMIO or I/O port space for their device
(often based on #define flags).  And again readl/writel alias to
gsc_readl/gsc_writel for MMIO. inl/outl are defined to be an indirect
function call to either Dino or "lba" PCI services (similar to
how pci_config_read is an indirect call).

gsc_readl/writel take a physical address as the address parameter.
I would like to get away from gsc_read for one simple reason: HPMC.
If a driver incorrectly accesses something which isn't supposed to
we get either an HPMC or undefined behavior. Most likely the HPMC.
If the read/write routines referenced a virtual address, we have
much better chances of getting a data page fault and some decent
debugging information to track down the problem.

This is #2 on my TODO list after reviewing the coherent DMA services
recently introduced in 2.3.

grant

Grant Grundler
Unix Development Lab
+1.408.447.7253

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [parisc-linux] Linux syscall ABI
  2000-02-15 17:25 ` [parisc-linux] Linux syscall ABI Grant Grundler
@ 2000-02-15 18:18   ` Philipp Rumpf
  2000-02-15 19:15     ` Frank Rowand
  2000-02-16  2:34     ` Grant Grundler
  0 siblings, 2 replies; 15+ messages in thread
From: Philipp Rumpf @ 2000-02-15 18:18 UTC (permalink / raw)
  To: Grant Grundler; +Cc: John Marvin, parisc-linux

> Drivers don't map anything. The assumption for GSC/PCI drivers is the
> address given will work with readl/writel (or inb/outb) routines.

PCI drivers call ioremap(), which is free to do the mapping (but I don't
think it should, for PA-RISC).

> For GSC, the HPA is in the struct hp_device->hpa field and readl/writel
> are aliases for gsc_readl/gsc_writel.

which itself will be aliased to simple volatile accesses soon.

> gsc_readl/writel take a physical address as the address parameter.
> I would like to get away from gsc_read for one simple reason: HPMC.
> If a driver incorrectly accesses something which isn't supposed to
> we get either an HPMC or undefined behavior. Most likely the HPMC.
> If the read/write routines referenced a virtual address, we have
> much better chances of getting a data page fault and some decent
> debugging information to track down the problem.

HPMC is good debugging information - you've got PIM.  Of course, we want
an HPMC handler too, at some point.  The assembly part just tries to
find out if the machine is still usable, and resets it if it's not. 
If it is, we'd like it to be treated as normal interruption, and then
have a CPU-specific fault handler that reads the interesting registers
and prints a nice message.

	Philipp Rumpf

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [parisc-linux] Linux syscall ABI
  2000-02-15 18:18   ` Philipp Rumpf
@ 2000-02-15 19:15     ` Frank Rowand
  2000-02-16  2:34     ` Grant Grundler
  1 sibling, 0 replies; 15+ messages in thread
From: Frank Rowand @ 2000-02-15 19:15 UTC (permalink / raw)
  To: Philipp Rumpf; +Cc: Grant Grundler, John Marvin, parisc-linux

Philipp Rumpf wrote:
> 
> > Drivers don't map anything. The assumption for GSC/PCI drivers is the
> > address given will work with readl/writel (or inb/outb) routines.
> 
> PCI drivers call ioremap(), which is free to do the mapping (but I don't
> think it should, for PA-RISC).
> 
> > For GSC, the HPA is in the struct hp_device->hpa field and readl/writel
> > are aliases for gsc_readl/gsc_writel.
> 
> which itself will be aliased to simple volatile accesses soon.
> 
> > gsc_readl/writel take a physical address as the address parameter.
> > I would like to get away from gsc_read for one simple reason: HPMC.
> > If a driver incorrectly accesses something which isn't supposed to
> > we get either an HPMC or undefined behavior. Most likely the HPMC.
> > If the read/write routines referenced a virtual address, we have
> > much better chances of getting a data page fault and some decent
> > debugging information to track down the problem.
> 
> HPMC is good debugging information - you've got PIM.  Of course, we want
> an HPMC handler too, at some point.  The assembly part just tries to
> find out if the machine is still usable, and resets it if it's not.
> If it is, we'd like it to be treated as normal interruption, and then
> have a CPU-specific fault handler that reads the interesting registers
> and prints a nice message.
> 
>         Philipp Rumpf
> 
> ---------------------------------------------------------------------------
> To unsubscribe: send e-mail to parisc-linux-request@thepuffingroup.com with
> `unsubscribe' as the subject.

An HPMC may be delayed, relative to the instruction that caused it.  The worst
case is that a context switch _could_ occur before the HPMC occurs (and yes,
we did see this problem with our HP-UX and HP-RT VME systems when a VME
time-out was long enough).  This can make it more difficult to figure out what
instruction was issued to cause the HPMC. The advantage of the page fault is
that you know exactly what instruction caused the fault.

-Frank
-- 
MontaVista Software, Inc
frank_rowand@mvista.com

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [parisc-linux] Linux syscall ABI
  2000-02-15 18:18   ` Philipp Rumpf
  2000-02-15 19:15     ` Frank Rowand
@ 2000-02-16  2:34     ` Grant Grundler
  2000-02-16  9:33       ` Kirk Bresniker
  1 sibling, 1 reply; 15+ messages in thread
From: Grant Grundler @ 2000-02-16  2:34 UTC (permalink / raw)
  To: Philipp Rumpf; +Cc: parisc-linux

Philipp Rumpf wrote:
...
> HPMC is good debugging information - you've got PIM.

IMO, Frank didn't say this clearly:
	PIM is not "good debugging information".

Given the complexity of the systems, knowing *some* (not all)
of the HW state is marginally useful at best. When we get
into debugging driver problems later on, this will be clearer.

Besides the asynchronous nature of HPMCs, PIMs are unique to each
class of box. So decoding a PIM on a K-class is quite different
from the PIM on N or L-class. Only recently have tools been made
internally available to help decode each type of PIM. I wouldn't
hold my breath waiting for those to get published.

> Of course, we want
> an HPMC handler too, at some point.  The assembly part just tries to
> find out if the machine is still usable, and resets it if it's not.
> If it is, we'd like it to be treated as normal interruption, and then
> have a CPU-specific fault handler that reads the interesting registers
> and prints a nice message.

If linux could learn to dump host memory to disk, then HPMC's would
a bit easier to debug since one could review data structures for suspect
code. I think that's what the HPMC handler is intended for - not
attempt to recover. Attempting to recover from an asyncronous fault
doesn't sound feasible to me. But what do I know anyway....

later,
grant

Grant Grundler
Unix Development Lab
+1.408.447.7253

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [parisc-linux] Linux syscall ABI
  2000-02-16  2:34     ` Grant Grundler
@ 2000-02-16  9:33       ` Kirk Bresniker
  0 siblings, 0 replies; 15+ messages in thread
From: Kirk Bresniker @ 2000-02-16  9:33 UTC (permalink / raw)
  To: Grant Grundler; +Cc: prumpf, parisc-linux

Grant wrote:

| 
| Given the complexity of the systems, knowing *some* (not all)
| of the HW state is marginally useful at best. When we get
| into debugging driver problems later on, this will be clearer.
| 
| Besides the asynchronous nature of HPMCs, PIMs are unique to each
| class of box. So decoding a PIM on a K-class is quite different
| from the PIM on N or L-class. Only recently have tools been made
| internally available to help decode each type of PIM. I wouldn't
| hold my breath waiting for those to get published.

There are two key take aways from what Grant has said: 

1. There are some platform specific tools which help PIM analysis.  As
   someone who has read literally thousands of PIM dumps over 10 years
   worth of server platforms, and as someone who has contributed some of
   the analysis tools, I would say that the tools only automate the
   decoding of status register values (which are all implementation
   specific). There has never been an expert tool which pulls in a 
   PIM dump and spits out the answer. 

2. The platforms which Grant specified are server platforms, not the
   workstations.  In my experience, you're going to find many more
   people familiar with server PIM dump output than workstations, simply
   because of the threshold of pain of the customer base. A server
   customer is much more concerned with getting a fully analysis of
   each and every failure than a workstation customer.

In general, for real hardware faults, PIM dumps are usually as good 
as the underlying hardware error logging registers in telling an
expert what has gone wrong. But, in this case, when there is an OS or
OS/hardware interaction, the PIM is usually not enough. 

| 
| If linux could learn to dump host memory to disk, then HPMC's would
| a bit easier to debug since one could review data structures for suspect
| code. I think that's what the HPMC handler is intended for - not
| attempt to recover. Attempting to recover from an asyncronous fault
| doesn't sound feasible to me. But what do I know anyway....
| 

I don't know what Grant does (n't) know :), but I second the call for a
core dump.  To give an example of a complex hardware/OS interaction, I
was once debugging a system which was regularly getting OS panics due to
data page faults.  As a hardware engineer I would, as a matter of
principle, blaim software and then firmware.  But, the problem was
actually a double bit error due to a bad SRAM in the instruction cache
which was corrupting an instruction.  I only found this out by comparing
instructions and data in the memory dumps with the data stored in
PIM dumps.

As to recovery from HMPCs, I can only speak to the hardware generated
exceptions.  Most of the hardware generated HPMCs are linked to 
events which calls into question the validity of information. Get a
parity error on a private, dirty cache line? Well that means that there
is no valid copy anywhere. Better to dump PIM and halt immediately
rather than possibly commit bad data to permanent storage.  I think
that you have to be pretty confident to continue with other than
a core dump or tombstone page.

KMB
--
+============================================================+
|       Kirk Bresniker    	(916) 748-2393		     |
|       8000 Foothills Blvd                                  |
|       Roseville, CA 95747-5649                             |
|       kirkb@rose.hp.com                                    |

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [parisc-linux] Linux syscall ABI
@ 2000-02-16 13:57 John Marvin
  2000-02-16 17:41 ` Philipp Rumpf
  0 siblings, 1 reply; 15+ messages in thread
From: John Marvin @ 2000-02-16 13:57 UTC (permalink / raw)
  To: parisc-linux

> This sounds to me like a typical case of doing a static optimization (is
> this a memcpy() to I/O space, from I/O space, to and from I/O space) at
> runtime.

I believe there are some cases in the graphics libraries where it is
not known at runtime whether the destination will be IO (framebuffer)
or memory. But I also tend to agree with you. 99.9% of the use of memcpy
will not be for IO, so it probably would have made more sense for the
graphics libraries, and any other code where there is any possibility
of being handed a pointer to IO space, to handle it in a different way,
rather than having the test be in memcpy.

> > But Perhaps we can have a 16 Mb offset instead.
>
> I think not mapping the first 64 KB and making a copy of page 0 somewhere
> else would make sense.  Then we could use the first 64 KB of the virtual
> address space to implement gateway pages.

We can probably use a smaller offset than 16 Mb but 64 Kb won't work.  We
have to make sure that the kernel space virtual addresses are equivalently
aliased with their physical addresses. 64 Kb would work on a 712, but it
won't work on a C3000.  Currently PCXU supports a maximum external direct
mapped cache size of 4 Mb, and I don't think that has been increased for
PCXW.  I'm not sure what the largest actually implemented direct mapped
cache is, but I know it is at least 2 Mb.  Of course, to take full
advantage of large pages, it might make sense to use a larger offset, i.e.
64 Mb.

Rereading what you said above made me realize that you probably were not
talking about a 64 Kb offset. If so, then you are talking about
still using an offset of 0, but just not mapping the first 64 Kb a memory,
i.e. throwing those pages "away" (actually we can probably find ways
to use them). The only problem with this is that we would be prevented
from using maximally large tlb mappings to map the first 64 Mb of memory.
If we moved the offset to 64 Mb we could use 64 Mb page size mappings
to map the kernel address space. The cost of this is that it reduces
the amount of physical memory we can support.  We can't support 4 Gb
(at least not easily), since we need virtual space for the vmalloc area.
So I'm not sure losing 64 Mb of virtual space at the bottom end is that
much of an issue.

What is the largest amount of physical memory we want to support for the
32 bit implementation?  How hard do we want to work to achieve it?  We
can't support more than 4 Gb.  It would take some work to support 4 Gb.
My feeling is that if we supported 3.5 Gb max that would be more than
adequate.  We could use a 64 Mb offset and use 64 Mb page size mappings to
cover the kernel address space.  This should leave enough space for the
vmalloc area.

> >
> > I like this idea.  The only disadvantage is that if the user modifies sr2
> > by mistake, all of a sudden all of the syscalls stop working (for that
> > process only).
>
> I don't see a real problem with that.  Modifying SR2 requires either direct
> modification (the only code I could see doing that is HP/UX code, which isn't
> supposed to execute with PER_LINUX anytime soon) or executing random bytes,
> which will always break in unexpected ways.
>

I agree that it is not a significant enough problem to stop us from doing
this. So, I propose the following:

    1) When we move the kernel virtual mappings we will leave room at
    the bottom to a) properly trap on null pointer dereferences, and
    b) provide room for a Linux syscall gateway page in the kernel
    address space (space 0). This gateway page will be located at an
    offset within the positive offset range of a ble instruction.

    2) We will set sr2 to zero for each process.

    3) We will only map an HP-UX syscall gateway page into HP-UX
    processes, i.e. we will not map any gateway page into the user
    address space for PER_LINUX processes.

    4) Linux syscalls will use the following 2 instruction sequence
    to reach the gateway page:

	ble <gateway offset>)(%sr2,%r0)
	ldi <syscall #>,%r20

So, if anyone has a significant problem with this proposal, speak up.

John Marvin
jsm@fc.hp.com

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [parisc-linux] Location of HIL protocol docs?
  2000-02-15 12:50 ` Philipp Rumpf
@ 2000-02-16 14:04   ` Brian S. Julin
  2000-02-16 18:42     ` Grant Grundler
  0 siblings, 1 reply; 15+ messages in thread
From: Brian S. Julin @ 2000-02-16 14:04 UTC (permalink / raw)
  To: parisc-linux

A brief poking around HP's site didn't turn me up any links --
and I'd prefer to enhance the existing code if appropriate, 
rather than just reformat it, not to mention debugging might 
go faster if I actually understand what I'm doing :).

Anyone got a link to tech docs on HIL?

--
Brian S. Julin

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [parisc-linux] Linux syscall ABI
  2000-02-16 13:57 John Marvin
@ 2000-02-16 17:41 ` Philipp Rumpf
  0 siblings, 0 replies; 15+ messages in thread
From: Philipp Rumpf @ 2000-02-16 17:41 UTC (permalink / raw)
  To: John Marvin; +Cc: parisc-linux

> > > But Perhaps we can have a 16 Mb offset instead.
> >
> > I think not mapping the first 64 KB and making a copy of page 0 somewhere
> > else would make sense.  Then we could use the first 64 KB of the virtual
> > address space to implement gateway pages.
> 
> We can probably use a smaller offset than 16 Mb but 64 Kb won't work.  We

I didn't propose any offset.  I proposed to do the mapping somewhat like this:

virt		phys
0000 0000 	somewhere (whereever our syscall page is)
0000 1000 	- (invalid)
 ...
0000 f000	- (invalid)
0001 0000 	0001 0000
 ...
01ff f000	01ff f000

for a 32 MB box.  This is why I said we should make a copy of page 0.

> Rereading what you said above made me realize that you probably were not
> talking about a 64 Kb offset. If so, then you are talking about
> still using an offset of 0, but just not mapping the first 64 Kb a memory,
> i.e. throwing those pages "away" (actually we can probably find ways
> to use them). The only problem with this is that we would be prevented
> from using maximally large tlb mappings to map the first 64 Mb of memory.

I don't think I care.  Note that this is for PA1.1 anyway, so we don't have
large pages architecturally.

> If we moved the offset to 64 Mb we could use 64 Mb page size mappings
> to map the kernel address space. The cost of this is that it reduces
> the amount of physical memory we can support.  We can't support 4 Gb
> (at least not easily), since we need virtual space for the vmalloc area.
> So I'm not sure losing 64 Mb of virtual space at the bottom end is that
> much of an issue.
> 
> What is the largest amount of physical memory we want to support for the
> 32 bit implementation?  How hard do we want to work to achieve it?  We
> can't support more than 4 Gb.  It would take some work to support 4 Gb.
> My feeling is that if we supported 3.5 Gb max that would be more than
> adequate.  We could use a 64 Mb offset and use 64 Mb page size mappings to
> cover the kernel address space.  This should leave enough space for the
> vmalloc area.

IMHO, don't map the first 64 KB, map the first 3.25 GB - 64 KB, then have
512 MB vmalloc space, then 256 MB I/O space.  (For newer boxes it looks like
the 64 KB we don't map should be more like 1 MB).

> > I don't see a real problem with that.  Modifying SR2 requires either direct
> > modification (the only code I could see doing that is HP/UX code, which isn't
> > supposed to execute with PER_LINUX anytime soon) or executing random bytes,
> > which will always break in unexpected ways.
> >
> 
> I agree that it is not a significant enough problem to stop us from doing
> this. So, I propose the following:
> 
>     1) When we move the kernel virtual mappings we will leave room at
>     the bottom to a) properly trap on null pointer dereferences, and
>     b) provide room for a Linux syscall gateway page in the kernel
>     address space (space 0). This gateway page will be located at an
>     offset within the positive offset range of a ble instruction.

If you mean "not map the first 64 KB - 1 MB" by "leave room", I agree.

>     2) We will set sr2 to zero for each process.

Agreed.

>     3) We will only map an HP-UX syscall gateway page into HP-UX
>     processes, i.e. we will not map any gateway page into the user
>     address space for PER_LINUX processes.

agreed.

>     4) Linux syscalls will use the following 2 instruction sequence
>     to reach the gateway page:
> 
> 	ble <gateway offset>)(%sr2,%r0)
> 	ldi <syscall #>,%r20

I'd prefer to fix gateway offset now - it's a pretty arbitrary decision,
but it might break binary compatibility lateron.  My proposal is 0x100.

So did anyone think about how to write the actual syscall asm statements ?
it seems rather hard to me, at least without writing the actual functions
directly and having one more level of indirection ...

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [parisc-linux] Location of HIL protocol docs?
  2000-02-16 14:04   ` [parisc-linux] Location of HIL protocol docs? Brian S. Julin
@ 2000-02-16 18:42     ` Grant Grundler
  0 siblings, 0 replies; 15+ messages in thread
From: Grant Grundler @ 2000-02-16 18:42 UTC (permalink / raw)
  To: Brian S. Julin; +Cc: parisc-linux

"Brian S. Julin" wrote:

> Anyone got a link to tech docs on HIL?

HIL comes off the "WAX" (EISA Bus Adapter chip) on C200.
I assume the same is true for any box which supports EISA.
(ie 715/725, GSCtoPCI workstations).

I don't know if WAX documents will ever get published.
Probably but not right away if they aren't already.

grant

Grant Grundler
Unix Development Lab
+1.408.447.7253

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [parisc-linux] Linux syscall ABI
@ 2000-02-17 14:17 John Marvin
  0 siblings, 0 replies; 15+ messages in thread
From: John Marvin @ 2000-02-17 14:17 UTC (permalink / raw)
  To: parisc-linux

>
> I don't think I care.  Note that this is for PA1.1 anyway, so we don't have
> large pages architecturally.

But we do have an even smaller resource of block tlb's that can also
map a maximum of 64 Mb at a time, and which also need to be aligned
to there same size that they map. For this reason it might be worth
considering having a 64 Mb offset rather than a 0 offset and not mapping
the first 64 Kb.

>
> I'd prefer to fix gateway offset now - it's a pretty arbitrary decision,
> but it might break binary compatibility lateron.  My proposal is 0x100.

That's fine with me. We can put break instructions in 0x00-0xfc to catch
anyone branching through a null function pointer.

John Marvin
jsm@fc.hp.com

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2000-02-17 15:16 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2000-02-15  5:36 [parisc-linux] Linux syscall ABI John Marvin
2000-02-15  6:15 ` willy
2000-02-15 12:50 ` Philipp Rumpf
2000-02-16 14:04   ` [parisc-linux] Location of HIL protocol docs? Brian S. Julin
2000-02-16 18:42     ` Grant Grundler
2000-02-15 17:25 ` [parisc-linux] Linux syscall ABI Grant Grundler
2000-02-15 18:18   ` Philipp Rumpf
2000-02-15 19:15     ` Frank Rowand
2000-02-16  2:34     ` Grant Grundler
2000-02-16  9:33       ` Kirk Bresniker
  -- strict thread matches above, loose matches on Subject: below --
2000-02-17 14:17 John Marvin
2000-02-16 13:57 John Marvin
2000-02-16 17:41 ` Philipp Rumpf
2000-02-14  9:30 John Marvin
2000-02-14 13:34 ` Philipp Rumpf

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.