Re: Virtual address space exhaustion (was Discontigmem virt_to

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* Re: Virtual address space exhaustion (was  Discontigmem virt_to_page() )
@ 2002-05-03 18:37 Tony Luck
  2002-05-03 19:01 ` Richard B. Johnson
  0 siblings, 1 reply; 22+ messages in thread
From: Tony Luck @ 2002-05-03 18:37 UTC (permalink / raw)
  To: linux-kernel

Richard B. Johnson wrote:
> One of the Unix characteristics is that the kernel
> address space is shared with each of the process
> address space.

This hasn't been an absolute requirement. There have
been 32-bit Unix implementations that gave separate
4G address spaces to the kernel and to each user
process.  The only real downside to this is that
copyin()/copyout() are more complex. Some processors
provided special instructions to access user-mode
addresses from kernel to mitigate this complexity.

-Tony

__________________________________________________
Do You Yahoo!?
Yahoo! Health - your guide to health and wellness
http://health.yahoo.com

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Virtual address space exhaustion (was  Discontigmem virt_to_page() )
  2002-05-03 18:37 Virtual address space exhaustion (was Discontigmem virt_to_page() ) Tony Luck
@ 2002-05-03 19:01 ` Richard B. Johnson
  2002-04-27  1:15   ` Pavel Machek
                     ` (4 more replies)
  0 siblings, 5 replies; 22+ messages in thread
From: Richard B. Johnson @ 2002-05-03 19:01 UTC (permalink / raw)
  To: Tony Luck; +Cc: linux-kernel

On Fri, 3 May 2002, Tony Luck wrote:

> Richard B. Johnson wrote:
> > One of the Unix characteristics is that the kernel
> > address space is shared with each of the process
> > address space.
> 
> This hasn't been an absolute requirement. There have
> been 32-bit Unix implementations that gave separate
> 4G address spaces to the kernel and to each user
> process.  The only real downside to this is that
> copyin()/copyout() are more complex. Some processors
> provided special instructions to access user-mode
> addresses from kernel to mitigate this complexity.
> 
> -Tony
> 
Really? The only 32-bit Unix's I've seen the details of
are SCO Unix, Interactive Unix, Linux, and BSD Unix.
The other Unix's I've become familiar are Sun-OS, the
original AT&T(Unix System Labs)/SYS-V and DEC Ultrix.
All these Unix's share user address-space with kernel
address-space. This is supposed to be the very thing
that makes Unix different from other VMS/timeshare
Operating Systems.

I think that if this shared address-space doesn't exist
then you don't have "Unix". You have something (perhaps
better), but it's not Unix. For instance VAX/VMS doesn't
share address space. In fact, the VAX/VMS kernel is, itself,
a process. This means it has its own context. This can
be quite useful.

Would you please tell me what Unix has 32-bit address space
which is not shared with the kernel?

Cheers,
Dick Johnson

Penguin : Linux version 2.4.18 on an i686 machine (797.90 BogoMips).

                 Windows-2000/Professional isn't.


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Virtual address space exhaustion (was  Discontigmem virt_to_page() )
  2002-05-03 19:01 ` Richard B. Johnson
@ 2002-04-27  1:15   ` Pavel Machek
  2002-05-03 19:09   ` Christoph Hellwig
                     ` (3 subsequent siblings)
  4 siblings, 0 replies; 22+ messages in thread
From: Pavel Machek @ 2002-04-27  1:15 UTC (permalink / raw)
  To: Richard B. Johnson; +Cc: Tony Luck, linux-kernel

Hi!

> > > One of the Unix characteristics is that the kernel
> > > address space is shared with each of the process
> > > address space.
> > 
> > This hasn't been an absolute requirement. There have
> > been 32-bit Unix implementations that gave separate
> > 4G address spaces to the kernel and to each user
> > process.  The only real downside to this is that
> > copyin()/copyout() are more complex. Some processors
> > provided special instructions to access user-mode
> > addresses from kernel to mitigate this complexity.
> > 
> Really? The only 32-bit Unix's I've seen the details of
> are SCO Unix, Interactive Unix, Linux, and BSD Unix.
> The other Unix's I've become familiar are Sun-OS, the
> original AT&T(Unix System Labs)/SYS-V and DEC Ultrix.
> All these Unix's share user address-space with kernel
> address-space. This is supposed to be the very thing

Remember userspace being accessed through fs: in linux-2.0 days?

That counts as separate address space to me...
								Pavel
-- 
Philips Velo 1: 1"x4"x8", 300gram, 60, 12MB, 40bogomips, linux, mutt,
details at http://atrey.karlin.mff.cuni.cz/~pavel/velo/index.html.


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Virtual address space exhaustion (was  Discontigmem virt_to_page() )
  2002-05-03 19:01 ` Richard B. Johnson
  2002-04-27  1:15   ` Pavel Machek
@ 2002-05-03 19:09   ` Christoph Hellwig
  2002-05-03 19:17     ` Richard B. Johnson
  2002-05-03 19:38   ` Matti Aarnio
                     ` (2 subsequent siblings)
  4 siblings, 1 reply; 22+ messages in thread
From: Christoph Hellwig @ 2002-05-03 19:09 UTC (permalink / raw)
  To: Richard B. Johnson; +Cc: Tony Luck, linux-kernel

On Fri, May 03, 2002 at 03:01:48PM -0400, Richard B. Johnson wrote:
> The other Unix's I've become familiar are Sun-OS, the

SunOS 5 uses separate address spaces on sparcv9 (32 and 64bit).
The same is true for many Linux ports, e.g. sparc64 or s390.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Virtual address space exhaustion (was  Discontigmem virt_to_page() )
  2002-05-03 19:09   ` Christoph Hellwig
@ 2002-05-03 19:17     ` Richard B. Johnson
  2002-05-03 19:24       ` Christoph Hellwig
  0 siblings, 1 reply; 22+ messages in thread
From: Richard B. Johnson @ 2002-05-03 19:17 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Tony Luck, linux-kernel

On Fri, 3 May 2002, Christoph Hellwig wrote:

> On Fri, May 03, 2002 at 03:01:48PM -0400, Richard B. Johnson wrote:
> > The other Unix's I've become familiar are Sun-OS, the
> 
> SunOS 5 uses separate address spaces on sparcv9 (32 and 64bit).
> The same is true for many Linux ports, e.g. sparc64 or s390.
> 

No no! I'm not talking about the physical address spaces. Many
CPUs have separate address spaces for separate functions. I'm
taking about the virtual address space that the process sees.
There are no holes in this virtual address space of SunOS, and
no "separate stuff" (I/O space) seen by a user-mode task.

Cheers,
Dick Johnson

Penguin : Linux version 2.4.18 on an i686 machine (797.90 BogoMips).

                 Windows-2000/Professional isn't.


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Virtual address space exhaustion (was  Discontigmem virt_to_page() )
  2002-05-03 19:17     ` Richard B. Johnson
@ 2002-05-03 19:24       ` Christoph Hellwig
  0 siblings, 0 replies; 22+ messages in thread
From: Christoph Hellwig @ 2002-05-03 19:24 UTC (permalink / raw)
  To: Richard B. Johnson; +Cc: Tony Luck, linux-kernel

On Fri, May 03, 2002 at 03:17:35PM -0400, Richard B. Johnson wrote:
> On Fri, 3 May 2002, Christoph Hellwig wrote:
> 
> > On Fri, May 03, 2002 at 03:01:48PM -0400, Richard B. Johnson wrote:
> > > The other Unix's I've become familiar are Sun-OS, the
> > 
> > SunOS 5 uses separate address spaces on sparcv9 (32 and 64bit).
> > The same is true for many Linux ports, e.g. sparc64 or s390.
> > 
> 
> No no! I'm not talking about the physical address spaces. Many
> CPUs have separate address spaces for separate functions. I'm
> taking about the virtual address space that the process sees.
> There are no holes in this virtual address space of SunOS, and
> no "separate stuff" (I/O space) seen by a user-mode task.

This thread was about separate user/kernel VIRTUAL address spaces.
Not about holes, I/O spaces or other crap.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Virtual address space exhaustion (was  Discontigmem virt_to_page() )
  2002-05-03 19:01 ` Richard B. Johnson
  2002-04-27  1:15   ` Pavel Machek
  2002-05-03 19:09   ` Christoph Hellwig
@ 2002-05-03 19:38   ` Matti Aarnio
  2002-05-03 19:50   ` Tony Luck
  2002-05-03 20:22   ` Jeff Dike
  4 siblings, 0 replies; 22+ messages in thread
From: Matti Aarnio @ 2002-05-03 19:38 UTC (permalink / raw)
  To: Richard B. Johnson; +Cc: linux-kernel

On Fri, May 03, 2002 at 03:01:48PM -0400, Richard B. Johnson wrote:
...
> > This hasn't been an absolute requirement. There have
> > been 32-bit Unix implementations that gave separate
> > 4G address spaces to the kernel and to each user
> > process.  The only real downside to this is that
> > copyin()/copyout() are more complex. Some processors
> > provided special instructions to access user-mode
> > addresses from kernel to mitigate this complexity.
> > 
> > -Tony
> 
> Really? The only 32-bit Unix's I've seen the details of
> are SCO Unix, Interactive Unix, Linux, and BSD Unix.

   An example of hardware with fully separable user/kernel spaces
   are Motorola 68020-68060 series processors.

   They have those special instructions to choose (in kernel mode)
   what address spaces to use at which data access phase of the
   special moves.  There is some speed penalty, of course..

...
> Would you please tell me what Unix has 32-bit address space
> which is not shared with the kernel?

   That could be the one called "Linux", if a bunch of conditions
   are met -- beginning with suitable hardware.

> Cheers,
> Dick Johnson
> Penguin : Linux version 2.4.18 on an i686 machine (797.90 BogoMips).

/Matti Aarnio

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Virtual address space exhaustion (was  Discontigmem virt_to_page() )
  2002-05-03 19:01 ` Richard B. Johnson
                     ` (2 preceding siblings ...)
  2002-05-03 19:38   ` Matti Aarnio
@ 2002-05-03 19:50   ` Tony Luck
  2002-05-03 20:22   ` Jeff Dike
  4 siblings, 0 replies; 22+ messages in thread
From: Tony Luck @ 2002-05-03 19:50 UTC (permalink / raw)
  To: root; +Cc: linux-kernel

--- "Richard B. Johnson" <root@chaos.analogic.com>
wrote:
> 
> I think that if this shared address-space doesn't
> exist
> then you don't have "Unix". You have something
> (perhaps
> better), but it's not Unix. 

Looking back a little earlier in the history of Unix,
we see that early versions ran on 16-bit
architectures. Does anyone out there remember Version
6 on the pdp11. It most certainly did not share the
address space (all 64k of it) between user and kernel.
Are you trying to say that what Dennis and Ken wrote
is not "Unix"?

-Tony

__________________________________________________
Do You Yahoo!?
Yahoo! Health - your guide to health and wellness
http://health.yahoo.com

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Virtual address space exhaustion (was Discontigmem virt_to_page() )
  2002-05-03 19:01 ` Richard B. Johnson
                     ` (3 preceding siblings ...)
  2002-05-03 19:50   ` Tony Luck
@ 2002-05-03 20:22   ` Jeff Dike
  2002-05-03 19:30     ` Richard B. Johnson
  4 siblings, 1 reply; 22+ messages in thread
From: Jeff Dike @ 2002-05-03 20:22 UTC (permalink / raw)
  To: root; +Cc: linux-kernel

root@chaos.analogic.com said:
> Would you please tell me what Unix has 32-bit address space which is
> not shared with the kernel? 

I'm planning on doing that with UML at some point.

The claim that it's not Unix if it doesn't share the process address space
is just stupid.

				Jeff


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Virtual address space exhaustion (was Discontigmem virt_to_page() )
  2002-05-03 20:22   ` Jeff Dike
@ 2002-05-03 19:30     ` Richard B. Johnson
  2002-05-03 22:35       ` Martin J. Bligh
  0 siblings, 1 reply; 22+ messages in thread
From: Richard B. Johnson @ 2002-05-03 19:30 UTC (permalink / raw)
  To: Jeff Dike; +Cc: linux-kernel

On Fri, 3 May 2002, Jeff Dike wrote:

> root@chaos.analogic.com said:
> > Would you please tell me what Unix has 32-bit address space which is
> > not shared with the kernel? 
> 
> I'm planning on doing that with UML at some point.
> 
> The claim that it's not Unix if it doesn't share the process address space
> is just stupid.
> 

No. It's not stupid. Unix defines a kind of operating system that
has certain characteristics and/or attributes. Process/kernel shared
address space is one of them. It's a name that has historical
signifigance.

Linux does not have to be Unix. In fact, divorcing virtual address
space may make a better Operating System and it's good that somebody
it planning that. But the result will not be the 25-30 year old
architecture we call Unix. It will be Linux. And it just might
be the thing that makes Linux shine above others, so don't call
this difference stupid.

Cheers,
Dick Johnson

Penguin : Linux version 2.4.18 on an i686 machine (797.90 BogoMips).

                 Windows-2000/Professional isn't.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Virtual address space exhaustion (was Discontigmem virt_to_page() )
  2002-05-03 19:30     ` Richard B. Johnson
@ 2002-05-03 22:35       ` Martin J. Bligh
  2002-05-05  0:49         ` Denis Vlasenko
  0 siblings, 1 reply; 22+ messages in thread
From: Martin J. Bligh @ 2002-05-03 22:35 UTC (permalink / raw)
  To: root, Jeff Dike; +Cc: linux-kernel

> No. It's not stupid. Unix defines a kind of operating system that
> has certain characteristics and/or attributes. Process/kernel shared
> address space is one of them. It's a name that has historical
> signifigance.

Yes it is stupid. This is a small implementation detail, and has no
real importance whatsoever. People have done this in the past
(Dynix/PTX did it) will do so in the future. Nor does the kernel 
address space have to be global and shared across all tasks
as stated earlier in this thread. What makes it Unix is the interface
it presents to the world, and how it behaves, not the little details
of how it's implemented inside.

M.

PS. I've been told Solaris x86 can do 4Gb for each of kernel
and user space, though I've no first hand experience with that
OS.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Virtual address space exhaustion (was Discontigmem virt_to_page() )
  2002-05-03 22:35       ` Martin J. Bligh
@ 2002-05-05  0:49         ` Denis Vlasenko
  2002-05-05 17:59           ` Martin J. Bligh
  0 siblings, 1 reply; 22+ messages in thread
From: Denis Vlasenko @ 2002-05-05  0:49 UTC (permalink / raw)
  To: Martin J. Bligh; +Cc: linux-kernel

On 3 May 2002 20:35, Martin J. Bligh wrote:
> > No. It's not stupid. Unix defines a kind of operating system that
> > has certain characteristics and/or attributes. Process/kernel shared
> > address space is one of them. It's a name that has historical
> > signifigance.
>
> Yes it is stupid. This is a small implementation detail, and has no
> real importance whatsoever. People have done this in the past
> (Dynix/PTX did it) will do so in the future. Nor does the kernel
> address space have to be global and shared across all tasks
> as stated earlier in this thread. What makes it Unix is the interface
> it presents to the world, and how it behaves, not the little details
> of how it's implemented inside.

I'm curious where it is visible to userspace?
(I'm asking for educational purposes)
--
vda

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Virtual address space exhaustion (was Discontigmem virt_to_page() )
  2002-05-05  0:49         ` Denis Vlasenko
@ 2002-05-05 17:59           ` Martin J. Bligh
  0 siblings, 0 replies; 22+ messages in thread
From: Martin J. Bligh @ 2002-05-05 17:59 UTC (permalink / raw)
  To: vda; +Cc: linux-kernel

> On 3 May 2002 20:35, Martin J. Bligh wrote:
>> > No. It's not stupid. Unix defines a kind of operating system that
>> > has certain characteristics and/or attributes. Process/kernel shared
>> > address space is one of them. It's a name that has historical
>> > signifigance.
>> 
>> Yes it is stupid. This is a small implementation detail, and has no
>> real importance whatsoever. People have done this in the past
>> (Dynix/PTX did it) will do so in the future. Nor does the kernel
>> address space have to be global and shared across all tasks
>> as stated earlier in this thread. What makes it Unix is the interface
>> it presents to the world, and how it behaves, not the little details
>> of how it's implemented inside.
> 
> I'm curious where it is visible to userspace?
> (I'm asking for educational purposes)

Where what is visible to userspace? If you mean the bit about 
"the interface it presents to the world", I meant Linux in 
general, not this feature. The whole point is that this is 
invisble to userspace (apart from performance and a lack of
architectural restrictions you might have been expecting) 
therefore it's irrelevant to whether it's "Unix" like or not.

M.


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Bug: Discontigmem virt_to_page() [Alpha,ARM,Mips64?]
@ 2002-05-03  8:38 Andrea Arcangeli
  2002-05-03 15:17 ` Virtual address space exhaustion (was Discontigmem virt_to_page() ) Martin J. Bligh
  0 siblings, 1 reply; 22+ messages in thread
From: Andrea Arcangeli @ 2002-05-03  8:38 UTC (permalink / raw)
  To: Martin J. Bligh
  Cc: William Lee Irwin III, Daniel Phillips, Russell King,
	linux-kernel

On Thu, May 02, 2002 at 11:33:43PM -0700, Martin J. Bligh wrote:
> into kernel address space for ever. That's a fundamental scalability
> problem for a 32 bit machine, and I think we need to fix it. If we
> map only the pages the process is using into the user-kernel address
> space area, rather than the global KVA, we get rid of some of these
> problems. Not that that plan doesn't have its own problems, but ... ;-)

:) As said every workaround has a significant drawback at this point.
Starting flooding the tlb with invlpg and pagetable walking every time
we need to do a set_bit or clear_bit test_bit or an unlock_page is both
overkill at runtime and overcomplex on the software side too to manage
those kernel pools in user memory.

just assume we do that and that you're ok to pay for the hit in general
purpose usage, then the next year how will you plan to workaround the
limitation of 64G of physical ram, are you going to multiplex another
64G of ram via a pci register so you can handle 128G of ram on x86 just
not simultaneously? (but that's ok in theory, the cpu won't notice
you're swapping the ram under it, and you cannot keep mapped in virtual
mem more than 4G anyways simultaneously, so it doesn't matter if some
ram isn't visible on the phsical side either)

I mean, in theory there's no limit, but in practice there's a limit, 64G
is just over the limit for general purpose x86 IMHO, it's at a point
where every workaround for something has a significant performance (or
memory drawback), still very fine for custom apps that needs that much
ram but 32G is the pratical limit of general purpose x86 IMHO.

Ah, and of course you could also use 2M pagetables by default to make it
more usable but still you would run in some huge ram wastage in certain
usages with small files, huge pageins and reads swapout and swapins,
plus it wouldn't be guaranteed to be transparent to the userspace
binaries (for istance mmap offset fields would break backwards
compatibility on the required alignment, that's probably the last
problem though). Despite its also significant drawbacks and the
complexity of the change, probably the 4M pagetables would be the saner
approch to manage more efficiently 64G with only a 800M kernel window.

> Bear in mind that we've sucessfully used 64Gb of ram in a 32 bit 
> virtual addr space a long time ago with Dynix/PTX.

You can use 64G "sucessfully" just now too with 2.4.19pre8 too, as said
in the earlier email there are many applications that doesn't care if
there's only a few meg of zone_normal and for them 2.4.19pre8 is just
fine (actually -aa is much better for the bounce buffers and other vm
fixes in that area). If all the load is in userspace current 2.4 is just
optimal and you'll take advantage of all the ram without problems (let's
assume it's not a numa machine, with numa you'd be better with the fixes
I included in my tree).  But if you need the kernel to do some amount of
work, like vfs caching, blkdev cache, lots of bh on pagecache, lots of
vma, lots of kiobufs, skb etc..  then you'd probably be faster if you
boot with mem=32G or at least you should take actions like recompiling
the kernel as CONFIG_2G that would then break SGA large 1.7G etc...

> > So at the end you'll be left with
> > only say 5/10M per node of zone_normal that will be filled immediatly as
> > soon as you start reading some directory from disk. a few hundred mbyte
> > of vfs cache is the minimum for those machines, this doesn't even take
> > into account bh headers for the pagecache, physical address space
> > pagecache for the buffercache, kiobufs, vma, etc... 
> 
> Bufferheads are another huge problem right now. For a P4 machine, they
> round off to 128 bytes per data structure. I was just looking at a 16Gb
> machine that had completely wedged itself by filling ZONE_NORMAL with 

Go ahead, use -aa or the vm-33 update, I fixed that problem a few days
after hearing about it the first time (with the due credit to Rik in a
comment for showing me such problem btw, I never noticed it before).

> unfreeable overhead - 440Mb of bufferheads alone. Globally mapping the
> bufferheads is probably another thing that'll have to go.
> 
> > It's just that 1G of
> > virtual address space reserved for kernel is too low to handle
> > efficiently 64G of physical ram, this is a fact and you can't 
> > workaround it. 
> 
> Death to global mappings! ;-)
> 
> I'd agree that a 64 bit vaddr space makes much more sense, but we're

This is my whole point yes :)

> stuck with the chips we've got for a little while yet. AMD were a few
> years too late for the bleeding edge Intel arch people amongst us.

Andrea

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Virtual address space exhaustion (was  Discontigmem virt_to_page() )
  2002-05-03  8:38 Bug: Discontigmem virt_to_page() [Alpha,ARM,Mips64?] Andrea Arcangeli
@ 2002-05-03 15:17 ` Martin J. Bligh
  2002-05-03 15:58   ` Andrea Arcangeli
  2002-05-03 16:02   ` Daniel Phillips
  0 siblings, 2 replies; 22+ messages in thread
From: Martin J. Bligh @ 2002-05-03 15:17 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: William Lee Irwin III, Daniel Phillips, Russell King,
	linux-kernel

> On Thu, May 02, 2002 at 11:33:43PM -0700, Martin J. Bligh wrote:
>> into kernel address space for ever. That's a fundamental scalability
>> problem for a 32 bit machine, and I think we need to fix it. If we
>> map only the pages the process is using into the user-kernel address
>> space area, rather than the global KVA, we get rid of some of these
>> problems. Not that that plan doesn't have its own problems, but ... ;-)
> 
> :) As said every workaround has a significant drawback at this point.
> Starting flooding the tlb with invlpg and pagetable walking every time
> we need to do a set_bit or clear_bit test_bit or an unlock_page is both
> overkill at runtime and overcomplex on the software side too to manage
> those kernel pools in user memory.

Whilst I take your point in principle, and acknowledge that there is
some cost to pay, I don't believe that the working set of one task is
all that dynamic (see also second para below). Some stuff really is 
global data, that's used by a lot of processes, but lots of other 
things really are per task. If only one process has a given file open, that's the only process that needs to see the pagecache control 
structures for that file. 

We don't have to tlb flush every time we map something in, only when
we delete it. For the sake of illustration, imagine a huge kmap pool
for each task, we just map things in as we need them (say some pagecache
structures when we open a file that's already partly in cache), and
use lazy TLB flushing to tear down those structures for free when we
context switch. If we run out of virtual space, yes, we'll have to 
flush, but I suspect that won't be too bad (for most workloads) if
we careful how we flush.

> just assume we do that and that you're ok to pay for the hit in general
> purpose usage, then the next year how will you plan to workaround the
> limitation of 64G of physical ram,

;-) No, I agree we're pushing the limits here, and I don't want to be
fighting this too much longer. The next generation of machines will 
all have larger virtual address spaces, and I'll be happy when they
arrive. For now, we have to deal with what we have, and support the
machines that are in the marketplace, and ia32 is (to my mind) still
faster than ia64. 

I'm really looking forward to AMD's Hammer architecture, but it's 
simply not here right now, and even when it is, there will be these 
older 32 bit machines in the field for a few years yet to come, and
we have to cope with them as best we can.

> Ah, and of course you could also use 2M pagetables by default to make it
> more usable but still you would run in some huge ram wastage in certain
> usages with small files, huge pageins and reads swapout and swapins,
> plus it wouldn't be guaranteed to be transparent to the userspace
> binaries (for istance mmap offset fields would break backwards
> compatibility on the required alignment, that's probably the last
> problem though). Despite its also significant drawbacks and the
> complexity of the change, probably the 4M pagetables would be the saner
> approch to manage more efficiently 64G with only a 800M kernel window.

Though that'd reduce the size of some of the structures, I'd still
have other concerns (such as tlb size, which is something stupid
like 4 pages, IIRC), and the space wastage you mentioned. Page 
clustering is probably a more useful technique - letting the existing
control structures control groups of pages. For example, one struct
page could control aligned groups of 4 4K pages, giving us an 
effective page size of 16K from the management overhead point of
view (swap in and out in 4 page chunks, etc).

>> Bear in mind that we've sucessfully used 64Gb of ram in a 32 bit 
>> virtual addr space a long time ago with Dynix/PTX.
> 
> You can use 64G "sucessfully" just now too with 2.4.19pre8 too, as said

I said *used*, not *booted* ;-) There's a whole host of problems
we still have to fix yet, and some tradeoffs to be made - we just
have to make those without affecting the people that don't need
them. It won't be easy, but I don't think it'll be impossible either.

>> Bufferheads are another huge problem right now. For a P4 machine, they
>> round off to 128 bytes per data structure. I was just looking at a 16Gb
>> machine that had completely wedged itself by filling ZONE_NORMAL with 
> 
> Go ahead, use -aa or the vm-33 update, I fixed that problem a few days
> after hearing about it the first time (with the due credit to Rik in a
> comment for showing me such problem btw, I never noticed it before).

Thanks - I'll have a close look at that ... I didn't know you'd
already fixed that one.

M.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Virtual address space exhaustion (was  Discontigmem virt_to_page() )
  2002-05-03 15:17 ` Virtual address space exhaustion (was Discontigmem virt_to_page() ) Martin J. Bligh
@ 2002-05-03 15:58   ` Andrea Arcangeli
  2002-05-03 16:10     ` Martin J. Bligh
  2002-05-03 16:02   ` Daniel Phillips
  1 sibling, 1 reply; 22+ messages in thread
From: Andrea Arcangeli @ 2002-05-03 15:58 UTC (permalink / raw)
  To: Martin J. Bligh
  Cc: William Lee Irwin III, Daniel Phillips, Russell King,
	linux-kernel

On Fri, May 03, 2002 at 08:17:23AM -0700, Martin J. Bligh wrote:
> We don't have to tlb flush every time we map something in, only when
> we delete it. For the sake of illustration, imagine a huge kmap pool
> for each task, we just map things in as we need them (say some pagecache

yes, the pool will "cache" the mem_map virtual window for a while, but
the complexity of the pool management isn't trivial, in the page
structure you won't find the associated per-task cached virtual address,
you will need something like a lookup on a data structure associated
with the task struct to find if you just have it in cache or not in the
per-process userspace kmap pool. The current kmap pool is an order of
magnitude simpler thanks to page->virtual but you cannot have a
page->virtual[nr_tasks] array.

Another interesting problem is that 'struct page *' will be as best a
cookie, not a valid pointer anymore, not sure what's the best way to
handle that. Working with pfn would be cleaner rather than working with
a cookie (somebody could dereference the cookie by mistake thinking it's
a page structure old style), but if __alloc_pages returns a pfn a whole
lot of kernel code will break.

> older 32 bit machines in the field for a few years yet to come, and
> we have to cope with them as best we can.

Sure.

> Though that'd reduce the size of some of the structures, I'd still
> have other concerns (such as tlb size, which is something stupid
> like 4 pages, IIRC), and the space wastage you mentioned. Page 

it has 8 pages for data and 2 for instructions, that's 16M data and 4M
of instructions with PAE.  4k pages can be cached with at most 64 slots
for data and 32 entries for instructions, that means 256K of data and
128k of instructions. The main disavantage is that we basically would
waste the 4k tlb slots, and we'd share the same slots with the kernel.
It mostly depend on the workload but in theory the 8 pages for data
could reduce the pte walking (also not to mention a layer less of pte
would make the pte walking faster too). So I think 2M pages could
speedup some application, but the main advantage remains that you
wouldn't need to change the page structure handling.

Andrea

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Virtual address space exhaustion (was  Discontigmem virt_to_page() )
  2002-05-03 15:58   ` Andrea Arcangeli
@ 2002-05-03 16:10     ` Martin J. Bligh
  2002-05-03 16:25       ` Andrea Arcangeli
  0 siblings, 1 reply; 22+ messages in thread
From: Martin J. Bligh @ 2002-05-03 16:10 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: William Lee Irwin III, Daniel Phillips, linux-kernel

> Another interesting problem is that 'struct page *' will be as best a
> cookie, not a valid pointer anymore, not sure what's the best way to
> handle that. Working with pfn would be cleaner rather than working with
> a cookie (somebody could dereference the cookie by mistake thinking it's
> a page structure old style), but if __alloc_pages returns a pfn a whole
> lot of kernel code will break.

Yup, a physical address pfn would probably be best.

(such as tlb size, which is something stupid like 4 pages, IIRC)

> it has 8 pages for data and 2 for instructions, that's 16M data and 4M
> of instructions with PAE

What is "it", a P4? I think the sizes are dependant on which chip you're
using. The x440 has the P4 chips, but the NUMA-Q is is P2 or P3 (even
PPro for the oldest ones, but those don't work at the moment with Linux
on multiquad).

M.


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Virtual address space exhaustion (was  Discontigmem virt_to_page() )
  2002-05-03 16:10     ` Martin J. Bligh
@ 2002-05-03 16:25       ` Andrea Arcangeli
  0 siblings, 0 replies; 22+ messages in thread
From: Andrea Arcangeli @ 2002-05-03 16:25 UTC (permalink / raw)
  To: Martin J. Bligh; +Cc: William Lee Irwin III, Daniel Phillips, linux-kernel

On Fri, May 03, 2002 at 09:10:46AM -0700, Martin J. Bligh wrote:
> > Another interesting problem is that 'struct page *' will be as best a
> > cookie, not a valid pointer anymore, not sure what's the best way to
> > handle that. Working with pfn would be cleaner rather than working with
> > a cookie (somebody could dereference the cookie by mistake thinking it's
> > a page structure old style), but if __alloc_pages returns a pfn a whole
> > lot of kernel code will break.
> 
> Yup, a physical address pfn would probably be best.
> 
> (such as tlb size, which is something stupid like 4 pages, IIRC)

you recall correcty the mean :), it's 8 for data and 2 for instructions.
But I don't think the tlb is the problem, potentially it's a big win for
the big apps like database, more ram addressed via tlb and faster
pagetable lookups, it's the I/O granularity for the pageins that is
probably the most annoying part. Even if you've a fast disk, 2M instead
of kbytes is going to make difference, as well as the fact a 4M per page
and the bh on the pagecache would waste quite lots of ram with small
files.

> > it has 8 pages for data and 2 for instructions, that's 16M data and 4M
> > of instructions with PAE
> 
> What is "it", a P4? I think the sizes are dependant on which chip you're

I didn't read if P4 changes that, nor I checked the athlon yet, I read
it in the usual and a bit old system programmin manual 3.

> using. The x440 has the P4 chips, but the NUMA-Q is is P2 or P3 (even
> PPro for the oldest ones, but those don't work at the moment with Linux
> on multiquad).

that's the P6 family, so the PPro P2 P3 all included (only P5 excluded).

Andrea

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Virtual address space exhaustion (was  Discontigmem virt_to_page() )
  2002-05-03 15:17 ` Virtual address space exhaustion (was Discontigmem virt_to_page() ) Martin J. Bligh
  2002-05-03 15:58   ` Andrea Arcangeli
@ 2002-05-03 16:02   ` Daniel Phillips
  2002-05-03 16:20     ` Andrea Arcangeli
  1 sibling, 1 reply; 22+ messages in thread
From: Daniel Phillips @ 2002-05-03 16:02 UTC (permalink / raw)
  To: Martin J. Bligh, Andrea Arcangeli; +Cc: William Lee Irwin III, linux-kernel

On Friday 03 May 2002 17:17, Martin J. Bligh wrote:
> Andrea apparently wrote:
> > Ah, and of course you could also use 2M pagetables by default to make it
> > more usable but still you would run in some huge ram wastage in certain
> > usages with small files, huge pageins and reads swapout and swapins,
> > plus it wouldn't be guaranteed to be transparent to the userspace
> > binaries (for istance mmap offset fields would break backwards
> > compatibility on the required alignment, that's probably the last
> > problem though). Despite its also significant drawbacks and the
> > complexity of the change, probably the 4M pagetables would be the saner
> > approch to manage more efficiently 64G with only a 800M kernel window.
> 
> Though that'd reduce the size of some of the structures, I'd still
> have other concerns (such as tlb size, which is something stupid
> like 4 pages, IIRC), and the space wastage you mentioned. Page 
> clustering is probably a more useful technique - letting the existing
> control structures control groups of pages. For example, one struct
> page could control aligned groups of 4 4K pages, giving us an 
> effective page size of 16K from the management overhead point of
> view (swap in and out in 4 page chunks, etc).

IMHO, this will be a much easier change than storing mem_map in highmem,
and solves 75% of the problem.  It's not just ia32 numa that will benefit
from it.  For example, MIPS supports 16K pages in software, which will
take a lot of load off the tlb.  According to Ralf, there are benefits
re virtual aliasing as well.

-- 
Daniel

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Virtual address space exhaustion (was  Discontigmem virt_to_page() )
  2002-05-03 16:02   ` Daniel Phillips
@ 2002-05-03 16:20     ` Andrea Arcangeli
  2002-05-03 16:41       ` Daniel Phillips
  0 siblings, 1 reply; 22+ messages in thread
From: Andrea Arcangeli @ 2002-05-03 16:20 UTC (permalink / raw)
  To: Daniel Phillips; +Cc: Martin J. Bligh, William Lee Irwin III, linux-kernel

On Fri, May 03, 2002 at 06:02:18PM +0200, Daniel Phillips wrote:
> and solves 75% of the problem.  It's not just ia32 numa that will benefit
> from it.  For example, MIPS supports 16K pages in software, which will

the whole change would be specific to ia32, I don't see the connection
with mips. There would be nothing to share between ia32 2M pages and
mips 16K pages. You can do mips 16K just now indipendently from the
page_size of ia32. 16K should work without surprises because other archs
have pages of this size and even bigger. Nobody has pages large as much
as 2M yet, that's an order of magnitude bigger. 16K for example is just
fine for the read()/pagein/pageout I/O, DMA is usually done in larger
chunks anyways with readahead and async-flushing to be faster (but never
as big as 2M, the highest limit is 512k per scsi command).

Andrea

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Virtual address space exhaustion (was  Discontigmem virt_to_page() )
  2002-05-03 16:20     ` Andrea Arcangeli
@ 2002-05-03 16:41       ` Daniel Phillips
  2002-05-03 16:58         ` Andrea Arcangeli
  0 siblings, 1 reply; 22+ messages in thread
From: Daniel Phillips @ 2002-05-03 16:41 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: Martin J. Bligh, William Lee Irwin III, linux-kernel

On Friday 03 May 2002 18:20, Andrea Arcangeli wrote:
> On Fri, May 03, 2002 at 06:02:18PM +0200, Daniel Phillips wrote:
> > and solves 75% of the problem.  It's not just ia32 numa that will benefit
> > from it.  For example, MIPS supports 16K pages in software, which will
> 
> the whole change would be specific to ia32, I don't see the connection
> with mips. There would be nothing to share between ia32 2M pages and
> mips 16K pages.

The topic here is 'page clustering'.  The idea is to use one struct page for
every four 4K page frames on ia32.

-- 
Daniel

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Virtual address space exhaustion (was  Discontigmem virt_to_page() )
  2002-05-03 16:41       ` Daniel Phillips
@ 2002-05-03 16:58         ` Andrea Arcangeli
  2002-05-03 18:08           ` Daniel Phillips
  0 siblings, 1 reply; 22+ messages in thread
From: Andrea Arcangeli @ 2002-05-03 16:58 UTC (permalink / raw)
  To: Daniel Phillips; +Cc: Martin J. Bligh, William Lee Irwin III, linux-kernel

On Fri, May 03, 2002 at 06:41:15PM +0200, Daniel Phillips wrote:
> On Friday 03 May 2002 18:20, Andrea Arcangeli wrote:
> > On Fri, May 03, 2002 at 06:02:18PM +0200, Daniel Phillips wrote:
> > > and solves 75% of the problem.  It's not just ia32 numa that will benefit
> > > from it.  For example, MIPS supports 16K pages in software, which will
> > 
> > the whole change would be specific to ia32, I don't see the connection
> > with mips. There would be nothing to share between ia32 2M pages and
> > mips 16K pages.
> 
> The topic here is 'page clustering'.  The idea is to use one struct page for
> every four 4K page frames on ia32.

ah ok, I meant physical hardware pages. physical hardware pages should
be doable without common code changes, a software PAGE_SIZE or the
PAGE_CACHE_SIZE raises non trivial problems instead.

Andrea

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Virtual address space exhaustion (was  Discontigmem virt_to_page() )
  2002-05-03 16:58         ` Andrea Arcangeli
@ 2002-05-03 18:08           ` Daniel Phillips
  0 siblings, 0 replies; 22+ messages in thread
From: Daniel Phillips @ 2002-05-03 18:08 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: Martin J. Bligh, William Lee Irwin III, linux-kernel

On Friday 03 May 2002 18:58, Andrea Arcangeli wrote:
> On Fri, May 03, 2002 at 06:41:15PM +0200, Daniel Phillips wrote:
> > On Friday 03 May 2002 18:20, Andrea Arcangeli wrote:
> > > On Fri, May 03, 2002 at 06:02:18PM +0200, Daniel Phillips wrote:
> > > > and solves 75% of the problem.  It's not just ia32 numa that will benefit
> > > > from it.  For example, MIPS supports 16K pages in software, which will
> > > 
> > > the whole change would be specific to ia32, I don't see the connection
> > > with mips. There would be nothing to share between ia32 2M pages and
> > > mips 16K pages.
> > 
> > The topic here is 'page clustering'.  The idea is to use one struct page for
> > every four 4K page frames on ia32.
> 
> ah ok, I meant physical hardware pages. physical hardware pages should
> be doable without common code changes, a software PAGE_SIZE or the
> PAGE_CACHE_SIZE raises non trivial problems instead.

Yes, it's not too bad though.  In the swap-in path, the locking would be against
mem_map + (pfn >> 2).  The four pages don't have to be read in and valid all at
the same time - it's ok to take multiple faults on the cluster, not recommended,
but ok.  In the swap-out path, all four page frames have to be swapped out and
invalidated at the same time.

-- 
Daniel

^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2002-05-06  9:49 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2002-05-03 18:37 Virtual address space exhaustion (was Discontigmem virt_to_page() ) Tony Luck
2002-05-03 19:01 ` Richard B. Johnson
2002-04-27  1:15   ` Pavel Machek
2002-05-03 19:09   ` Christoph Hellwig
2002-05-03 19:17     ` Richard B. Johnson
2002-05-03 19:24       ` Christoph Hellwig
2002-05-03 19:38   ` Matti Aarnio
2002-05-03 19:50   ` Tony Luck
2002-05-03 20:22   ` Jeff Dike
2002-05-03 19:30     ` Richard B. Johnson
2002-05-03 22:35       ` Martin J. Bligh
2002-05-05  0:49         ` Denis Vlasenko
2002-05-05 17:59           ` Martin J. Bligh
  -- strict thread matches above, loose matches on Subject: below --
2002-05-03  8:38 Bug: Discontigmem virt_to_page() [Alpha,ARM,Mips64?] Andrea Arcangeli
2002-05-03 15:17 ` Virtual address space exhaustion (was Discontigmem virt_to_page() ) Martin J. Bligh
2002-05-03 15:58   ` Andrea Arcangeli
2002-05-03 16:10     ` Martin J. Bligh
2002-05-03 16:25       ` Andrea Arcangeli
2002-05-03 16:02   ` Daniel Phillips
2002-05-03 16:20     ` Andrea Arcangeli
2002-05-03 16:41       ` Daniel Phillips
2002-05-03 16:58         ` Andrea Arcangeli
2002-05-03 18:08           ` Daniel Phillips

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox