* Minutes from Feb 21 LSE Call
@ 2003-02-21 23:48 Hanna Linder
2003-02-22 0:16 ` Larry McVoy
` (2 more replies)
0 siblings, 3 replies; 124+ messages in thread
From: Hanna Linder @ 2003-02-21 23:48 UTC (permalink / raw)
To: lse-tech; +Cc: linux-kernel
LSE Con Call Minutes from Feb21
Minutes compiled by Hanna Linder hannal@us.ibm.com, please post
corrections to lse-tech@lists.sf.net.
Object Based Reverse Mapping:
(Dave McCracken, Ben LaHaise, Rik van Riel, Martin Bligh, Gerrit Huizenga)
Dave coded up an initial patch for partial object based rmap
which he sent to linux-mm yesterday. Rik pointed out there is a scalability
problem with the full object based approach. However, a hybrid approach
between regular rmap and object based may not be too radical for
2.5/2.6 timeframe.
Ben said none of the users have been complaining about
performance with the existing rmap. Martin disagreed and said Linus,
Andrew Morton and himself have all agreed there is a problem.
One of the problems Martin is already hitting on high cpu machines with
large memory is the space consumption by all the pte-chains filling up
memory and killing the machine. There is also a performance impact of
maintaining the chains.
Ben said they shouldnt be using fork and bash is the
main user of fork and should be changed to use clone instead.
Gerrit said bash is not used as much as Ben might think on
these large systems running real world applications.
Ben said he doesnt see the large systems problems with
the users he talks to and doesnt agree the full object based rmap
is needed. Gerrit explained we have very complex workloads running on
very large systems and we are already hitting the space consumption
problem which is a blocker for running Linux on them.
Ben said none of the distros are supporting these large
systems right now. Martin said UL is already starting to support
them. Then it degraded into a distro discussion and Hanna asked
for them to bring it back to the technical side.
In order to show the problem with object based rmap you have to
add vm pressure to existing benchmarks to see what happens. Martin
agreed to run multiple benchmarks on the same systems to simulate this.
Cliff White of the OSDL offered to help Martin with this.
At the end Ben said the solution for now needs to be
a hybrid with existing rmap. Martin, Rik, and Dave all agreed with Ben.
Then we all agreed to move on to other things.
*ActionItem - someone needs to change bash to use clone instead of fork..
Scheduler Hang as discovered by restarting a large Web application
multiple times:
Rick Lindlsey/ Hanna Linder
We were seeing a hard hang after restarting a large web
serving application 3-6 times on the 2.5.59 (and up) kernels
(also seen as far back as 2.5.44). It was mainly caused when two
threads each have interrupts disabled and one is spinning on a lock that
the other is holding. The one holding the lock has sent an IPI to all
the other processes telling them to flush their TLB's. But the one
witinging for the spinlock has interrupts turned off and does not recieve
that IPI request. So they both sit there waiting for ever.
The final fix will be in kernel.org mainline kernel version 2.5.63.
Here are the individual patches which should apply with fuzz to
older kernel versions:
http://linux.bkbits.net:8080/linux-2.5/cset@1.1005?nav=index.html
http://linux.bkbits.net:8080/linux-2.5/cset@1.1004?nav=index.html
Shared Memory Binding :
Matt Dobson -
Shared memory binding API (new). A way for an
application to bind shared memory to Nodes. Motivation
is for large databases support that want more control
over their shared memory.
current allocation scheme is each process gets
a chunk of shared memory from the same node the process
is located on. instead of page faulting around to different
nodes dynamicaly this API will allow a process to specify
which node or set of nodes to bind the shared memory to.
Work in progress.
Martin - gcc 2.95 vs 3.2.
Martin has done some testing which indicates that gcc 3.2 produces
slightly worse code for the kernel than 2.95 and takes a bit
longer to do so. gcc 3.2 -Os produces larger code than gcc 2.95 -O2.
On his machines -O2 was faster than -Os, but on a cpu wiht smaller
caches the inverse may be true. More testing may be needed.
^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: Minutes from Feb 21 LSE Call
2003-02-21 23:48 Minutes from Feb 21 LSE Call Hanna Linder
@ 2003-02-22 0:16 ` Larry McVoy
2003-02-22 0:25 ` William Lee Irwin III
` (4 more replies)
2003-02-23 0:42 ` Eric W. Biederman
2003-02-23 3:24 ` Andrew Morton
2 siblings, 5 replies; 124+ messages in thread
From: Larry McVoy @ 2003-02-22 0:16 UTC (permalink / raw)
To: Hanna Linder; +Cc: lse-tech, linux-kernel
> Ben said none of the distros are supporting these large
> systems right now. Martin said UL is already starting to support
> them.
Ben is right. I think IBM and the other big iron companies would be
far better served looking at what they have done with running multiple
instances of Linux on one big machine, like the 390 work. Figure out
how to use that model to scale up. There is simply not a big enough
market to justify shoveling lots of scaling stuff in for huge machines
that only a handful of people can afford. That's the same path which
has sunk all the workstation companies, they all have bloated OS's and
Linux runs circles around them.
In terms of the money and in terms of installed seats, the small Linux
machines out number the 4 or more CPU SMP machines easily 10,000:1.
And with the embedded market being one of the few real money makers
for Linux, there will be huge pushback from those companies against
changes which increase memory footprint.
--
---
Larry McVoy lm at bitmover.com http://www.bitmover.com/lm
^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: Minutes from Feb 21 LSE Call
2003-02-22 0:16 ` Larry McVoy
@ 2003-02-22 0:25 ` William Lee Irwin III
2003-02-22 2:24 ` Steven Cole
2003-02-22 0:44 ` Martin J. Bligh
` (3 subsequent siblings)
4 siblings, 1 reply; 124+ messages in thread
From: William Lee Irwin III @ 2003-02-22 0:25 UTC (permalink / raw)
To: Larry McVoy, Hanna Linder, lse-tech, linux-kernel
On Fri, Feb 21, 2003 at 04:16:18PM -0800, Larry McVoy wrote:
> Ben is right. I think IBM and the other big iron companies would be
> far better served looking at what they have done with running multiple
> instances of Linux on one big machine, like the 390 work. Figure out
> how to use that model to scale up. There is simply not a big enough
> market to justify shoveling lots of scaling stuff in for huge machines
> that only a handful of people can afford. That's the same path which
> has sunk all the workstation companies, they all have bloated OS's and
> Linux runs circles around them.
Scalability done properly should not degrade performance on smaller
machines, Pee Cees, or even microscopic organisms.
On Fri, Feb 21, 2003 at 04:16:18PM -0800, Larry McVoy wrote:
> In terms of the money and in terms of installed seats, the small Linux
> machines out number the 4 or more CPU SMP machines easily 10,000:1.
> And with the embedded market being one of the few real money makers
> for Linux, there will be huge pushback from those companies against
> changes which increase memory footprint.
There's quite a bit of commonality with large x86 highmem there, as
the highmem crew is extremely concerned about the kernel's memory
footprint and is looking to trim kernel memory overhead from every
aspect of its operation they can. Reducing kernel memory footprint
is a crucial part of scalability, in both scaling down to the low end
and scaling up to highmem. =)
-- wli
^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: Minutes from Feb 21 LSE Call
2003-02-22 0:16 ` Larry McVoy
2003-02-22 0:25 ` William Lee Irwin III
@ 2003-02-22 0:44 ` Martin J. Bligh
2003-02-22 2:47 ` Larry McVoy
2003-02-22 8:32 ` David S. Miller
` (2 subsequent siblings)
4 siblings, 1 reply; 124+ messages in thread
From: Martin J. Bligh @ 2003-02-22 0:44 UTC (permalink / raw)
To: Larry McVoy, Hanna Linder; +Cc: lse-tech, linux-kernel
> Ben is right. I think IBM and the other big iron companies would be
> far better served looking at what they have done with running multiple
> instances of Linux on one big machine, like the 390 work. Figure out
> how to use that model to scale up. There is simply not a big enough
> market to justify shoveling lots of scaling stuff in for huge machines
> that only a handful of people can afford. That's the same path which
> has sunk all the workstation companies, they all have bloated OS's and
> Linux runs circles around them.
In your humble opinion.
Unfortunately, as I've pointed out to you before, this doesn't work in
practice. Workloads may not be easily divisible amongst machines, and
you're just pushing all the complex problems out for every userspace
app to solve itself, instead of fixing it once in the kernel.
The fact that you were never able to do this before doesn't mean it's
impossible, it just means that you failed.
> In terms of the money and in terms of installed seats, the small Linux
> machines out number the 4 or more CPU SMP machines easily 10,000:1.
> And with the embedded market being one of the few real money makers
> for Linux, there will be huge pushback from those companies against
> changes which increase memory footprint.
And the profit margin on the big machines will outpace the smaller
machines by a similar ratio, inverted. The high-end space is where most
of the money is made by the Linux distros, by selling products like SLES
or Advanced Server to people who can afford to pay for it.
M.
^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: Minutes from Feb 21 LSE Call
2003-02-22 0:25 ` William Lee Irwin III
@ 2003-02-22 2:24 ` Steven Cole
0 siblings, 0 replies; 124+ messages in thread
From: Steven Cole @ 2003-02-22 2:24 UTC (permalink / raw)
To: William Lee Irwin III; +Cc: Larry McVoy, Hanna Linder, lse-tech, LKML
On Fri, 2003-02-21 at 17:25, William Lee Irwin III wrote:
> On Fri, Feb 21, 2003 at 04:16:18PM -0800, Larry McVoy wrote:
> > Ben is right. I think IBM and the other big iron companies would be
> > far better served looking at what they have done with running multiple
> > instances of Linux on one big machine, like the 390 work. Figure out
> > how to use that model to scale up. There is simply not a big enough
> > market to justify shoveling lots of scaling stuff in for huge machines
> > that only a handful of people can afford. That's the same path which
> > has sunk all the workstation companies, they all have bloated OS's and
> > Linux runs circles around them.
mjb> Unfortunately, as I've pointed out to you before, this doesn't work
mjb> in practice. Workloads may not be easily divisible amongst
mjb> machines, and you're just pushing all the complex problems out for
mjb> every userspace app to solve itself, instead of fixing it once in
mjb> the kernel.
Please permit an observer from the sidelines a few comments.
I think all four of you are right, for different reasons.
>
> Scalability done properly should not degrade performance on smaller
> machines, Pee Cees, or even microscopic organisms.
s/should/must/ in the above. That must be a guiding principle.
>
>
> On Fri, Feb 21, 2003 at 04:16:18PM -0800, Larry McVoy wrote:
> > In terms of the money and in terms of installed seats, the small Linux
> > machines out number the 4 or more CPU SMP machines easily 10,000:1.
> > And with the embedded market being one of the few real money makers
> > for Linux, there will be huge pushback from those companies against
> > changes which increase memory footprint.
>
> There's quite a bit of commonality with large x86 highmem there, as
> the highmem crew is extremely concerned about the kernel's memory
> footprint and is looking to trim kernel memory overhead from every
> aspect of its operation they can. Reducing kernel memory footprint
> is a crucial part of scalability, in both scaling down to the low end
> and scaling up to highmem. =)
>
>
> -- wli
Since the time between major releases of the kernel seems to be two to
three years now (counting to where the new kernel is really stable),
it is probably worthwhile to think about what high-end systems will
be like when 3.0 is expected.
My guess is that a trend will be machines with increasingly greater cpu
counts with access to the same memory. Why? Because if it can be done,
it will be done. The ability to put more cpus on a single chip may
translate into a Moore's law of increasing cpu counts per machine. And
as Martin points out, the high end machines are where the money is.
In my own unsophisticated opinion, Larry's concept of Cache Coherent
Clusters seems worth further development. And Martin is right about the
need for fixing it in the kernel, again IMHO. But how to fix it in the
kernel? Would something similar to OpenMosix or OpenSSI in a future
kernel be appropriate to get Larry's CCCluster members to cooperate? Or
is it possible to continue the scalability race when cpu counts get to
256, 512, etc.
Just some thoughts from the sidelines.
Best regards,
Steven
^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: Minutes from Feb 21 LSE Call
2003-02-22 0:44 ` Martin J. Bligh
@ 2003-02-22 2:47 ` Larry McVoy
2003-02-22 4:32 ` Martin J. Bligh
0 siblings, 1 reply; 124+ messages in thread
From: Larry McVoy @ 2003-02-22 2:47 UTC (permalink / raw)
To: Martin J. Bligh; +Cc: Larry McVoy, Hanna Linder, lse-tech, linux-kernel
On Fri, Feb 21, 2003 at 04:44:13PM -0800, Martin J. Bligh wrote:
> > Ben is right. I think IBM and the other big iron companies would be
> > far better served looking at what they have done with running multiple
> > instances of Linux on one big machine, like the 390 work. Figure out
> > how to use that model to scale up. There is simply not a big enough
> > market to justify shoveling lots of scaling stuff in for huge machines
> > that only a handful of people can afford. That's the same path which
> > has sunk all the workstation companies, they all have bloated OS's and
> > Linux runs circles around them.
>
> In your humble opinion.
My opinion has nothing to do with it, go benchmark them and see for
yourself. I'm in a pretty good position to back up my statements with
data, we support BitKeeper on AIX, Solaris, IRIX, HP-UX, Tru64, as well
as a pile of others, so we have both the hardware and the software to
do the comparisons. I stand by statement above and so does anyone else
who has done the measurements. It is much much more pleasant to have
Linux versus any other Unix implementation on the same platform. Let's
keep it that way.
> Unfortunately, as I've pointed out to you before, this doesn't work in
> practice. Workloads may not be easily divisible amongst machines, and
> you're just pushing all the complex problems out for every userspace
> app to solve itself, instead of fixing it once in the kernel.
"fixing it", huh? Your "fixes" may be great for your tiny segment of
the market but they are not going to be welcome if they turn Linux into
BloatOS 9.8.
> The fact that you were never able to do this before doesn't mean it's
> impossible, it just means that you failed.
Thanks for the vote of confidence. I think the thing to focus on,
however, is that *noone* has ever succeeded at what you are trying
to do. And there have been many, many attempts. Your opinion, it
would appear, is that you are smarter than all of the people in all
of those past failed attempts, but you'll forgive me if I'm not
impressed with your optimism.
> > In terms of the money and in terms of installed seats, the small Linux
> > machines out number the 4 or more CPU SMP machines easily 10,000:1.
> > And with the embedded market being one of the few real money makers
> > for Linux, there will be huge pushback from those companies against
> > changes which increase memory footprint.
>
> And the profit margin on the big machines will outpace the smaller
> machines by a similar ratio, inverted.
Really? How about some figures? You'd need HUGE profit margins to
justify your position, how about some actual hard cold numbers?
--
---
Larry McVoy lm at bitmover.com http://www.bitmover.com/lm
^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: Minutes from Feb 21 LSE Call
2003-02-22 2:47 ` Larry McVoy
@ 2003-02-22 4:32 ` Martin J. Bligh
2003-02-22 5:05 ` Larry McVoy
0 siblings, 1 reply; 124+ messages in thread
From: Martin J. Bligh @ 2003-02-22 4:32 UTC (permalink / raw)
To: Larry McVoy; +Cc: Hanna Linder, lse-tech, linux-kernel
>> In your humble opinion.
>
> My opinion has nothing to do with it, go benchmark them and see for
> yourself.
Nope, I was referring to this:
>> > Ben is right. I think IBM and the other big iron companies would be
>> > far better served looking at what they have done with running multiple
>> > instances of Linux on one big machine, like the 390 work. Figure out
>> > how to use that model to scale up. There is simply not a big enough
>> > market to justify shoveling lots of scaling stuff in for huge machines
>> > that only a handful of people can afford.
Which I totally disagree with.
>> >That's the same path which
>> > has sunk all the workstation companies, they all have bloated OS's and
>> > Linux runs circles around them.
Not the fact that Linux is capable of stellar things, which I totally
agree with.
> I'm in a pretty good position to back up my statements with
> data, we support BitKeeper on AIX, Solaris, IRIX, HP-UX, Tru64, as well
> as a pile of others, so we have both the hardware and the software to
> do the comparisons. I stand by statement above and so does anyone else
> who has done the measurements.
Oh, I don't doubt it - But I'd be amused to see the measurements,
if you have them to hand.
> It is much much more pleasant to have Linux versus any other Unix
> implementation on the same platform. Let's keep it that way.
Absolutely.
>> Unfortunately, as I've pointed out to you before, this doesn't work in
>> practice. Workloads may not be easily divisible amongst machines, and
>> you're just pushing all the complex problems out for every userspace
>> app to solve itself, instead of fixing it once in the kernel.
>
> "fixing it", huh? Your "fixes" may be great for your tiny segment of
> the market but they are not going to be welcome if they turn Linux into
> BloatOS 9.8.
They won't - the maintainers would never allow us to do that.
>> The fact that you were never able to do this before doesn't mean it's
>> impossible, it just means that you failed.
>
> Thanks for the vote of confidence. I think the thing to focus on,
> however, is that *noone* has ever succeeded at what you are trying
> to do. And there have been many, many attempts. Your opinion, it
> would appear, is that you are smarter than all of the people in all
> of those past failed attempts, but you'll forgive me if I'm not
> impressed with your optimism.
Who said that I was going to single-handedly change the world? What's
different with Linux is the development model. That's why *we* will
succeed where others have failed before. There's some incredible intellect
all around Linux, but that's not all it takes, as you've pointed out.
>> > In terms of the money and in terms of installed seats, the small Linux
>> > machines out number the 4 or more CPU SMP machines easily 10,000:1.
>> > And with the embedded market being one of the few real money makers
>> > for Linux, there will be huge pushback from those companies against
>> > changes which increase memory footprint.
>>
>> And the profit margin on the big machines will outpace the smaller
>> machines by a similar ratio, inverted.
>
> Really? How about some figures? You'd need HUGE profit margins to
> justify your position, how about some actual hard cold numbers?
I don't have them to hand, but if you think anyone's making money on
PCs nowadays, you're delusional (with respect to hardware). With respect
to Linux, what makes you think distros are going to make large amounts
of money from a freely replicatable OS, for tiny embedded systems?
Support for servers, on the other hand, is a different game ...
M.
^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: Minutes from Feb 21 LSE Call
2003-02-22 4:32 ` Martin J. Bligh
@ 2003-02-22 5:05 ` Larry McVoy
2003-02-22 6:39 ` Martin J. Bligh
2003-02-22 8:38 ` David S. Miller
0 siblings, 2 replies; 124+ messages in thread
From: Larry McVoy @ 2003-02-22 5:05 UTC (permalink / raw)
To: Martin J. Bligh; +Cc: Larry McVoy, Hanna Linder, lse-tech, linux-kernel
On Fri, Feb 21, 2003 at 08:32:30PM -0800, Martin J. Bligh wrote:
> > "fixing it", huh? Your "fixes" may be great for your tiny segment of
> > the market but they are not going to be welcome if they turn Linux into
> > BloatOS 9.8.
>
> They won't - the maintainers would never allow us to do that.
The path to hell is paved with good intentions.
> > Really? How about some figures? You'd need HUGE profit margins to
> > justify your position, how about some actual hard cold numbers?
>
> I don't have them to hand, but if you think anyone's making money on
> PCs nowadays, you're delusional (with respect to hardware).
Let's see, Dell has a $66B market cap, revenues of $8B/quarter and
$500M/quarter in profit.
Lots of people working for companies who haven't figured out how to do
it as well as Dell *say* it can't be done but numbers say differently.
--
---
Larry McVoy lm at bitmover.com http://www.bitmover.com/lm
^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: Minutes from Feb 21 LSE Call
2003-02-22 5:05 ` Larry McVoy
@ 2003-02-22 6:39 ` Martin J. Bligh
2003-02-22 8:38 ` Jeff Garzik
2003-02-22 8:38 ` David S. Miller
2003-02-22 8:38 ` David S. Miller
1 sibling, 2 replies; 124+ messages in thread
From: Martin J. Bligh @ 2003-02-22 6:39 UTC (permalink / raw)
To: Larry McVoy; +Cc: Hanna Linder, lse-tech, linux-kernel
>> I don't have them to hand, but if you think anyone's making money on
>> PCs nowadays, you're delusional (with respect to hardware).
>
> Let's see, Dell has a $66B market cap, revenues of $8B/quarter and
> $500M/quarter in profit.
>
> Lots of people working for companies who haven't figured out how to do
> it as well as Dell *say* it can't be done but numbers say differently.
And how much of that was profit on PCs running Linux?
M.
^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: Minutes from Feb 21 LSE Call
2003-02-22 0:16 ` Larry McVoy
2003-02-22 0:25 ` William Lee Irwin III
2003-02-22 0:44 ` Martin J. Bligh
@ 2003-02-22 8:32 ` David S. Miller
2003-02-22 18:20 ` Alan Cox
2003-02-23 0:37 ` Eric W. Biederman
4 siblings, 0 replies; 124+ messages in thread
From: David S. Miller @ 2003-02-22 8:32 UTC (permalink / raw)
To: Larry McVoy; +Cc: Hanna Linder, lse-tech, linux-kernel
On Fri, 2003-02-21 at 16:16, Larry McVoy wrote:
> In terms of the money and in terms of installed seats, the small Linux
> machines out number the 4 or more CPU SMP machines easily 10,000:1.
While I totally agree with your points, I want to mention that
although this ratio is true, the exact opposite ratio applies to
the price of the service contracts a company can land with the big
machines :-)
^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: Minutes from Feb 21 LSE Call
2003-02-22 6:39 ` Martin J. Bligh
@ 2003-02-22 8:38 ` Jeff Garzik
2003-02-22 22:18 ` William Lee Irwin III
2003-02-22 8:38 ` David S. Miller
1 sibling, 1 reply; 124+ messages in thread
From: Jeff Garzik @ 2003-02-22 8:38 UTC (permalink / raw)
To: linux-kernel
ia32 big iron. sigh. I think that's so unfortunately in a number
of ways, but the main reason, of course, is that highmem is evil :)
Intel can use PAE to "turn back the clock" on ia32. Although googling
doesn't support this speculation, I am willing to bet Intel will
eventually unveil a new PAE that busts the 64GB barrier -- instead of
trying harder to push consumers to 64-bit processors. Processor speed,
FSB speed, PCI bus bandwidth, all these are issues -- but ones that
pale in comparison to the long term effects of highmem on the market.
Enterprise customers will see this as a signal to continue building
around ia32 for the next few years, thoroughly damaging 64-bit
technology sales and development. I bet even IA64 suffers...
at Intel's own hands. Rumors of a "Pentium64" at Intel are constantly
floating around The Register and various rumor web sites, but Intel
is gonna miss that huge profit opportunity too by trying to hack the
ia32 ISA to scale up to big iron -- where it doesn't belong.
Being cynical, one might guess that Intel will treat IA64 as a loss
leader until the other 64-bit competition dies, keeping ia32 at the
top end of the market via silly PAE/PSE hacks. When the existing
64-bit compettion disappears, five years down the road, compilers
will have matured sufficiently to make using IA64 boxes feasible.
If you really want to scale, just go to 64-bits, darn it. Don't keep
hacking ia32 ISA -- leave it alone, it's fine as it is, and will live
a nice long life as the future's preferred embedded platform.
64-bit. alpha is old tech, and dead. *sniff* sparc64 is mostly
old tech, and mostly dead. IA64 isn't, yet. x86-64 is _nice_ tech,
but who knows if AMD will survive competition with Intel. PPC64 is
the wild card in all this. I hope it succeeds.
Jeff,
feeling like a silly, random rant after a long drive
...and from a technical perspective, highmem grots up the code, too :)
^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: Minutes from Feb 21 LSE Call
2003-02-22 5:05 ` Larry McVoy
2003-02-22 6:39 ` Martin J. Bligh
@ 2003-02-22 8:38 ` David S. Miller
2003-02-22 14:34 ` Larry McVoy
1 sibling, 1 reply; 124+ messages in thread
From: David S. Miller @ 2003-02-22 8:38 UTC (permalink / raw)
To: Larry McVoy; +Cc: Martin J. Bligh, Hanna Linder, lse-tech, linux-kernel
On Fri, 2003-02-21 at 21:05, Larry McVoy wrote:
> Let's see, Dell has a $66B market cap, revenues of $8B/quarter and
> $500M/quarter in profit.
While I understand these numbers are on the mark, there is a tertiary
issue to realize.
Dell makes money on many things other than thin-margin PCs. And lo'
and behold one of those things is selling the larger Intel based
servers and support contracts to go along with that. And so you're
nearly supporting Martin's arguments for supporting large servers
better under Linux by bringing up Dell's balance sheet :-)
^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: Minutes from Feb 21 LSE Call
2003-02-22 6:39 ` Martin J. Bligh
2003-02-22 8:38 ` Jeff Garzik
@ 2003-02-22 8:38 ` David S. Miller
1 sibling, 0 replies; 124+ messages in thread
From: David S. Miller @ 2003-02-22 8:38 UTC (permalink / raw)
To: Martin J. Bligh; +Cc: Larry McVoy, Hanna Linder, lse-tech, linux-kernel
On Fri, 2003-02-21 at 22:39, Martin J. Bligh wrote:
> > Lots of people working for companies who haven't figured out how to do
> > it as well as Dell *say* it can't be done but numbers say differently.
>
> And how much of that was profit on PCs running Linux?
Or PCs period, they make tons of bucks on servers and associated
support contracts.
^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: Minutes from Feb 21 LSE Call
2003-02-22 8:38 ` David S. Miller
@ 2003-02-22 14:34 ` Larry McVoy
2003-02-22 15:47 ` Martin J. Bligh
0 siblings, 1 reply; 124+ messages in thread
From: Larry McVoy @ 2003-02-22 14:34 UTC (permalink / raw)
To: David S. Miller
Cc: Larry McVoy, Martin J. Bligh, Hanna Linder, lse-tech,
linux-kernel
On Sat, Feb 22, 2003 at 12:38:33AM -0800, David S. Miller wrote:
> On Fri, 2003-02-21 at 21:05, Larry McVoy wrote:
> > Let's see, Dell has a $66B market cap, revenues of $8B/quarter and
> > $500M/quarter in profit.
>
> While I understand these numbers are on the mark, there is a tertiary
> issue to realize.
>
> Dell makes money on many things other than thin-margin PCs. And lo'
> and behold one of those things is selling the larger Intel based
> servers and support contracts to go along with that.
I did some digging trying to find that ratio before I posted last night
and couldn't. You obviously think that the servers are a significant
part of their business. I'd be surprised at that, but that's cool,
what are the numbers? PC's, monitors, disks, laptops, anything with less
than 4 cpus is in the little bucket, so how much revenue does Dell generate
on the 4 CPU and larger servers?
--
---
Larry McVoy lm at bitmover.com http://www.bitmover.com/lm
^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: Minutes from Feb 21 LSE Call
2003-02-22 14:34 ` Larry McVoy
@ 2003-02-22 15:47 ` Martin J. Bligh
2003-02-22 16:13 ` Larry McVoy
0 siblings, 1 reply; 124+ messages in thread
From: Martin J. Bligh @ 2003-02-22 15:47 UTC (permalink / raw)
To: Larry McVoy, David S. Miller; +Cc: lse-tech, linux-kernel
>> > Let's see, Dell has a $66B market cap, revenues of $8B/quarter and
>> > $500M/quarter in profit.
>>
>> While I understand these numbers are on the mark, there is a tertiary
>> issue to realize.
>>
>> Dell makes money on many things other than thin-margin PCs. And lo'
>> and behold one of those things is selling the larger Intel based
>> servers and support contracts to go along with that.
>
> I did some digging trying to find that ratio before I posted last night
> and couldn't. You obviously think that the servers are a significant
> part of their business. I'd be surprised at that, but that's cool,
> what are the numbers? PC's, monitors, disks, laptops, anything with less
> than 4 cpus is in the little bucket, so how much revenue does Dell generate
> on the 4 CPU and larger servers?
It's not a question of revenue, it's one of profit. Very few people buy
desktops for use with Linux, compared to those that buy them for Windows.
The profit on each PC is small, thus I still think a substantial proportion
of the profit made by hardware vendors by Linux is on servers rather than
desktop PCs. The numbers will be smaller for high end machines, but the
profit margins are much higher.
M.
^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: Minutes from Feb 21 LSE Call
2003-02-22 15:47 ` Martin J. Bligh
@ 2003-02-22 16:13 ` Larry McVoy
2003-02-22 16:29 ` Martin J. Bligh
2003-02-24 18:00 ` Timothy D. Witham
0 siblings, 2 replies; 124+ messages in thread
From: Larry McVoy @ 2003-02-22 16:13 UTC (permalink / raw)
To: Martin J. Bligh; +Cc: Larry McVoy, David S. Miller, lse-tech, linux-kernel
On Sat, Feb 22, 2003 at 07:47:53AM -0800, Martin J. Bligh wrote:
> >> > Let's see, Dell has a $66B market cap, revenues of $8B/quarter and
> >> > $500M/quarter in profit.
> >>
> >> While I understand these numbers are on the mark, there is a tertiary
> >> issue to realize.
> >>
> >> Dell makes money on many things other than thin-margin PCs. And lo'
> >> and behold one of those things is selling the larger Intel based
> >> servers and support contracts to go along with that.
> >
> > I did some digging trying to find that ratio before I posted last night
> > and couldn't. You obviously think that the servers are a significant
> > part of their business. I'd be surprised at that, but that's cool,
> > what are the numbers? PC's, monitors, disks, laptops, anything with less
> > than 4 cpus is in the little bucket, so how much revenue does Dell generate
> > on the 4 CPU and larger servers?
>
> It's not a question of revenue, it's one of profit. Very few people buy
> desktops for use with Linux, compared to those that buy them for Windows.
> The profit on each PC is small, thus I still think a substantial proportion
> of the profit made by hardware vendors by Linux is on servers rather than
> desktop PCs. The numbers will be smaller for high end machines, but the
> profit margins are much higher.
That's all handwaving and has no meaning without numbers. I could care less
if Dell has 99.99% margins on their servers, if they only sell $50M of servers
a quarter that is still less than 10% of their quarterly profit.
So what are the actual *numbers*? Your point makes sense if and only if
people sell lots of server. I spent a few minutes in google: world wide
server sales are $40B at the moment. The overwhelming majority of that
revenue is small servers. Let's say that Dell has 20% of that market,
that's $2B/quarter. Now let's chop off the 1-2 CPU systems. I'll bet
you long long odds that that is 90% of their revenue in the server space.
Supposing that's right, that's $200M/quarter in big iron sales. Out of
$8000M/quarter.
I'd love to see data which is different than this but you'll have a tough
time finding it. More and more companies are looking at the cost of
big iron and deciding it doesn't make sense to spend $20K/CPU when they
could be spending $1K/CPU. Look at Google, try selling them some big
iron. Look at Wall Street - abandoning big iron as fast as they can.
--
---
Larry McVoy lm at bitmover.com http://www.bitmover.com/lm
^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: Minutes from Feb 21 LSE Call
2003-02-22 16:13 ` Larry McVoy
@ 2003-02-22 16:29 ` Martin J. Bligh
2003-02-22 16:33 ` Larry McVoy
2003-02-24 18:00 ` Timothy D. Witham
1 sibling, 1 reply; 124+ messages in thread
From: Martin J. Bligh @ 2003-02-22 16:29 UTC (permalink / raw)
To: Larry McVoy; +Cc: David S. Miller, lse-tech, linux-kernel
> That's all handwaving and has no meaning without numbers. I could care less
> if Dell has 99.99% margins on their servers, if they only sell $50M of servers
> a quarter that is still less than 10% of their quarterly profit.
>
> So what are the actual *numbers*? Your point makes sense if and only if
> people sell lots of server. I spent a few minutes in google: world wide
> server sales are $40B at the moment. The overwhelming majority of that
> revenue is small servers. Let's say that Dell has 20% of that market,
> that's $2B/quarter. Now let's chop off the 1-2 CPU systems. I'll bet
> you long long odds that that is 90% of their revenue in the server space.
> Supposing that's right, that's $200M/quarter in big iron sales. Out of
> $8000M/quarter.
>
> I'd love to see data which is different than this but you'll have a tough
> time finding it. More and more companies are looking at the cost of
> big iron and deciding it doesn't make sense to spend $20K/CPU when they
> could be spending $1K/CPU. Look at Google, try selling them some big
> iron. Look at Wall Street - abandoning big iron as fast as they can.
But we're talking about linux ... and we're talking about profit, not
revenue. I'd guess that 99% of their desktop sales are for Windows.
And I'd guess they make 100 times as much profit on a big server as they
do on a desktop PC.
Would be nice if someone had real numbers, but I doubt they're published
except in non-free corporate research reports.
M.
^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: Minutes from Feb 21 LSE Call
2003-02-22 16:29 ` Martin J. Bligh
@ 2003-02-22 16:33 ` Larry McVoy
2003-02-22 16:39 ` Martin J. Bligh
0 siblings, 1 reply; 124+ messages in thread
From: Larry McVoy @ 2003-02-22 16:33 UTC (permalink / raw)
To: Martin J. Bligh; +Cc: Larry McVoy, David S. Miller, lse-tech, linux-kernel
On Sat, Feb 22, 2003 at 08:29:34AM -0800, Martin J. Bligh wrote:
> > people sell lots of server. I spent a few minutes in google: world wide
> > server sales are $40B at the moment. The overwhelming majority of that
> > revenue is small servers. Let's say that Dell has 20% of that market,
> > that's $2B/quarter. Now let's chop off the 1-2 CPU systems. I'll bet
> > you long long odds that that is 90% of their revenue in the server space.
> > Supposing that's right, that's $200M/quarter in big iron sales. Out of
> > $8000M/quarter.
> >
> > I'd love to see data which is different than this but you'll have a tough
> > time finding it. More and more companies are looking at the cost of
> > big iron and deciding it doesn't make sense to spend $20K/CPU when they
> > could be spending $1K/CPU. Look at Google, try selling them some big
> > iron. Look at Wall Street - abandoning big iron as fast as they can.
>
> But we're talking about linux ... and we're talking about profit, not
> revenue. I'd guess that 99% of their desktop sales are for Windows.
> And I'd guess they make 100 times as much profit on a big server as they
> do on a desktop PC.
You are thinking in today's terms. Find the asymptote and project out.
--
---
Larry McVoy lm at bitmover.com http://www.bitmover.com/lm
^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: Minutes from Feb 21 LSE Call
2003-02-22 16:33 ` Larry McVoy
@ 2003-02-22 16:39 ` Martin J. Bligh
2003-02-22 16:59 ` John Bradford
0 siblings, 1 reply; 124+ messages in thread
From: Martin J. Bligh @ 2003-02-22 16:39 UTC (permalink / raw)
To: Larry McVoy; +Cc: David S. Miller, lse-tech, linux-kernel
>> But we're talking about linux ... and we're talking about profit, not
>> revenue. I'd guess that 99% of their desktop sales are for Windows.
>> And I'd guess they make 100 times as much profit on a big server as they
>> do on a desktop PC.
>
> You are thinking in today's terms. Find the asymptote and project out.
OK, I predict that Linux will take over the whole of the high end server
market ... if people stop complaining about us fixing scalability. That
should give some nicer numbers ....
M.
^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: Minutes from Feb 21 LSE Call
2003-02-22 16:39 ` Martin J. Bligh
@ 2003-02-22 16:59 ` John Bradford
0 siblings, 0 replies; 124+ messages in thread
From: John Bradford @ 2003-02-22 16:59 UTC (permalink / raw)
To: Martin J. Bligh; +Cc: lm, davem, lse-tech, linux-kernel
> OK, I predict that Linux will take over the whole of the high end server
> market ... if people stop complaining about us fixing scalability. That
> should give some nicer numbers ....
Extending the useful life of current hardware will shift profit even
further towards support contracts, and away from hardware sales.
Imagine the performance gain a webserver serving mostly static
content, with light database and scripting usage is going to see
moving from a 2.4 -> 2.6 kernel? Zero copy and filesystem
improvements alone will extend it's useful life dramatically, in my
opinion.
John.
^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: Minutes from Feb 21 LSE Call
2003-02-22 0:16 ` Larry McVoy
` (2 preceding siblings ...)
2003-02-22 8:32 ` David S. Miller
@ 2003-02-22 18:20 ` Alan Cox
2003-02-22 20:05 ` William Lee Irwin III
2003-02-22 21:36 ` Gerrit Huizenga
2003-02-23 0:37 ` Eric W. Biederman
4 siblings, 2 replies; 124+ messages in thread
From: Alan Cox @ 2003-02-22 18:20 UTC (permalink / raw)
To: Larry McVoy; +Cc: Hanna Linder, lse-tech, Linux Kernel Mailing List
On Sat, 2003-02-22 at 00:16, Larry McVoy wrote:
> In terms of the money and in terms of installed seats, the small Linux
> machines out number the 4 or more CPU SMP machines easily 10,000:1.
> And with the embedded market being one of the few real money makers
> for Linux, there will be huge pushback from those companies against
> changes which increase memory footprint.
I think people overestimate the numbner of large boxes badly. Several IDE
pre-patches didn't work on highmem boxes. It took *ages* for people to
actually notice there was a problem. The desktop world is still 128-256Mb
and some of the crap people push is problematic even there. In the embedded
space where there is a *ton* of money to be made by smart people a lot
of the 2.5 choices look very questionable indeed - but not all by any
means, we are for example close to being able to dump the block layer,
shrink stacks down by using IRQ stacks and other good stuff.
I'm hoping the Montavista and IBM people will swat each others bogons 8)
Alan
^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: Minutes from Feb 21 LSE Call
2003-02-22 18:20 ` Alan Cox
@ 2003-02-22 20:05 ` William Lee Irwin III
2003-02-22 21:35 ` Alan Cox
2003-02-22 21:36 ` Gerrit Huizenga
1 sibling, 1 reply; 124+ messages in thread
From: William Lee Irwin III @ 2003-02-22 20:05 UTC (permalink / raw)
To: Alan Cox; +Cc: Larry McVoy, Hanna Linder, lse-tech, Linux Kernel Mailing List
On Sat, 2003-02-22 at 00:16, Larry McVoy wrote:
>> And with the embedded market being one of the few real money makers
>> for Linux, there will be huge pushback from those companies against
>> changes which increase memory footprint.
On Sat, Feb 22, 2003 at 06:20:19PM +0000, Alan Cox wrote:
> I think people overestimate the numbner of large boxes badly. Several IDE
> pre-patches didn't work on highmem boxes. It took *ages* for people to
> actually notice there was a problem. The desktop world is still 128-256Mb
> and some of the crap people push is problematic even there. In the embedded
> space where there is a *ton* of money to be made by smart people a lot
> of the 2.5 choices look very questionable indeed - but not all by any
> means, we are for example close to being able to dump the block layer,
> shrink stacks down by using IRQ stacks and other good stuff.
Well, I've never seen IDE in a highmem box, and there's probably a good
reason for it. The space trimmings sound pretty interesting. IRQ stacks
in general sound good just to mitigate stackblowings due to IRQ pounding.
On Sat, Feb 22, 2003 at 06:20:19PM +0000, Alan Cox wrote:
> I'm hoping the Montavista and IBM people will swat each others bogons 8)
Sounds like a bigger win for the bigboxen, since space matters there,
but large-scale SMP efficiency probably doesn't make a difference to
embedded (though I think some 2x embedded systems are floating around).
-- wli
^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: Minutes from Feb 21 LSE Call
2003-02-22 20:05 ` William Lee Irwin III
@ 2003-02-22 21:35 ` Alan Cox
0 siblings, 0 replies; 124+ messages in thread
From: Alan Cox @ 2003-02-22 21:35 UTC (permalink / raw)
To: William Lee Irwin III
Cc: Larry McVoy, Hanna Linder, lse-tech, Linux Kernel Mailing List
On Sat, 2003-02-22 at 20:05, William Lee Irwin III wrote:
> On Sat, Feb 22, 2003 at 06:20:19PM +0000, Alan Cox wrote:
> > I'm hoping the Montavista and IBM people will swat each others bogons 8)
>
> Sounds like a bigger win for the bigboxen, since space matters there,
> but large-scale SMP efficiency probably doesn't make a difference to
> embedded (though I think some 2x embedded systems are floating around).
Smaller cleaner code is a win for everyone, and it often pays off in ways
that are not immediately obvious. For example having your entire kernel
working set and running app fitting in the L2 cache happens to be very
good news to most people.
Alan
^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: Minutes from Feb 21 LSE Call
2003-02-22 18:20 ` Alan Cox
2003-02-22 20:05 ` William Lee Irwin III
@ 2003-02-22 21:36 ` Gerrit Huizenga
2003-02-22 21:42 ` Christoph Hellwig
2003-02-23 23:23 ` Bill Davidsen
1 sibling, 2 replies; 124+ messages in thread
From: Gerrit Huizenga @ 2003-02-22 21:36 UTC (permalink / raw)
To: Alan Cox; +Cc: Larry McVoy, Hanna Linder, lse-tech, Linux Kernel Mailing List
On 22 Feb 2003 18:20:19 GMT, Alan Cox wrote:
> I think people overestimate the numbner of large boxes badly. Several IDE
> pre-patches didn't work on highmem boxes. It took *ages* for people to
> actually notice there was a problem. The desktop world is still 128-256Mb
IDE on big boxes? Is that crack I smell burning? A desktop with 4 GB
is a fun toy, but bigger than *I* need, even for development purposes.
But I don't think EMC, Clariion (low end EMC), Shark, etc. have any
IDE products for my 8-proc 16 GB machine... And running pre-patches in
a production environment that might expose this would be a little
silly as well.
Probably a bad example to extrapolate large system numbers from.
gerrit
^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: Minutes from Feb 21 LSE Call
2003-02-22 21:36 ` Gerrit Huizenga
@ 2003-02-22 21:42 ` Christoph Hellwig
2003-02-23 23:23 ` Bill Davidsen
1 sibling, 0 replies; 124+ messages in thread
From: Christoph Hellwig @ 2003-02-22 21:42 UTC (permalink / raw)
To: Gerrit Huizenga
Cc: Alan Cox, Larry McVoy, Hanna Linder, lse-tech,
Linux Kernel Mailing List
On Sat, Feb 22, 2003 at 01:36:31PM -0800, Gerrit Huizenga wrote:
> IDE on big boxes? Is that crack I smell burning? A desktop with 4 GB
> is a fun toy, but bigger than *I* need, even for development purposes.
> But I don't think EMC, Clariion (low end EMC), Shark, etc. have any
> IDE products for my 8-proc 16 GB machine... And running pre-patches in
> a production environment that might expose this would be a little
> silly as well.
>
> Probably a bad example to extrapolate large system numbers from.
At least the SGI Altix does have an IDE/ATAPI CDROM drive :)
^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: Minutes from Feb 21 LSE Call
2003-02-22 8:38 ` Jeff Garzik
@ 2003-02-22 22:18 ` William Lee Irwin III
2003-02-23 0:50 ` Martin J. Bligh
2003-02-23 1:17 ` Benjamin LaHaise
0 siblings, 2 replies; 124+ messages in thread
From: William Lee Irwin III @ 2003-02-22 22:18 UTC (permalink / raw)
To: Jeff Garzik; +Cc: linux-kernel
On Sat, Feb 22, 2003 at 03:38:10AM -0500, Jeff Garzik wrote:
> ia32 big iron. sigh. I think that's so unfortunately in a number
> of ways, but the main reason, of course, is that highmem is evil :)
> Intel can use PAE to "turn back the clock" on ia32. Although googling
> doesn't support this speculation, I am willing to bet Intel will
> eventually unveil a new PAE that busts the 64GB barrier -- instead of
> trying harder to push consumers to 64-bit processors. Processor speed,
> FSB speed, PCI bus bandwidth, all these are issues -- but ones that
> pale in comparison to the long term effects of highmem on the market.
PAE is a relatively minor insult compared to the FPU, the 50,000 psi
register pressure, variable-length instruction encoding with extremely
difficult to optimize for instruction decoder trickiness, the nauseating
bastardization of segmentation, the microscopic caches and TLB's, the
lack of TLB context tags, frankly bizarre and just-barely-fixable gate
nonsense, the interrupt controller, and ISA DMA.
I've got no idea why this particular system-level ugliness which is
nothing more than a routine pitstop in any bring your own barfbag
reading session of x86 manuals fascinates you so much.
At any rate, if systems (or any other) programming difficulties were
any concern at all, x86 wouldn't be used at all.
On Sat, Feb 22, 2003 at 03:38:10AM -0500, Jeff Garzik wrote:
> Enterprise customers will see this as a signal to continue building
> around ia32 for the next few years, thoroughly damaging 64-bit
> technology sales and development. I bet even IA64 suffers...
> at Intel's own hands. Rumors of a "Pentium64" at Intel are constantly
> floating around The Register and various rumor web sites, but Intel
> is gonna miss that huge profit opportunity too by trying to hack the
> ia32 ISA to scale up to big iron -- where it doesn't belong.
What power do you suppose we have to resist any of this? Intel, the
800lb gorilla, shoves what it wants where it wants to shove it, and
all the "exit only" signs in the world attached to our backsides do
absolutely nothing to deter it whatsoever.
On Sat, Feb 22, 2003 at 03:38:10AM -0500, Jeff Garzik wrote:
> Being cynical, one might guess that Intel will treat IA64 as a loss
> leader until the other 64-bit competition dies, keeping ia32 at the
> top end of the market via silly PAE/PSE hacks. When the existing
> 64-bit compettion disappears, five years down the road, compilers
> will have matured sufficiently to make using IA64 boxes feasible.
Sounds relatively natural. I don't have a good notion of the legality
boundaries wrt. to antitrust, but I'd assume they would otherwise do
whatever it takes to either defeat or wipe out competitors.
On Sat, Feb 22, 2003 at 03:38:10AM -0500, Jeff Garzik wrote:
> If you really want to scale, just go to 64-bits, darn it. Don't keep
> hacking ia32 ISA -- leave it alone, it's fine as it is, and will live
> a nice long life as the future's preferred embedded platform.
Take this up with Intel. The rest of us are at their mercy.
Good luck finding anyone there to listen to it, you'll need it.
On Sat, Feb 22, 2003 at 03:38:10AM -0500, Jeff Garzik wrote:
> 64-bit. alpha is old tech, and dead. *sniff* sparc64 is mostly
> old tech, and mostly dead. IA64 isn't, yet. x86-64 is _nice_ tech,
> but who knows if AMD will survive competition with Intel. PPC64 is
> the wild card in all this. I hope it succeeds.
Alpha is old, dead, and kicking most other cpus' asses from the grave.
I always did like DEC hardware. =(
I'm not sure what's so nice about x86-64; another opcode prefix
controlled extension atop the festering pile of existing x86 crud
sounds every bit as bad any other attempt to prolong x86. Some of
the system device -level cleanups like the HPET look nice, though.
This success/failure stuff sounds a lot like economics, which is
pretty much even further out of our control than the weather or the
government. What prompted this bit?
-- wli
^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: Minutes from Feb 21 LSE Call
2003-02-22 0:16 ` Larry McVoy
` (3 preceding siblings ...)
2003-02-22 18:20 ` Alan Cox
@ 2003-02-23 0:37 ` Eric W. Biederman
4 siblings, 0 replies; 124+ messages in thread
From: Eric W. Biederman @ 2003-02-23 0:37 UTC (permalink / raw)
To: Larry McVoy; +Cc: Hanna Linder, lse-tech, linux-kernel
Larry McVoy <lm@bitmover.com> writes:
> > Ben said none of the distros are supporting these large
> > systems right now. Martin said UL is already starting to support
> > them.
>
> Ben is right. I think IBM and the other big iron companies would be
> far better served looking at what they have done with running multiple
> instances of Linux on one big machine, like the 390 work. Figure out
> how to use that model to scale up. There is simply not a big enough
> market to justify shoveling lots of scaling stuff in for huge machines
> that only a handful of people can afford. That's the same path which
> has sunk all the workstation companies, they all have bloated OS's and
> Linux runs circles around them.
Larry it isn't that Linux isn't being scaled in the way you suggest.
But for the people who really care about scalability having a single
system image is not the most important thing so making it look like
one system is secondary.
Linux clusters are currently among the top 5 supercomputers of the
world. And there the question is how do you make 1200 machines look
like one. And how do you handle the reliability issues. When MTBF
becomes a predictor for how many times a week someone needs to replace
hardware the problem is very different from a simple SMP.
And there seems to be a fairly substantial market for huge machines,
for people who need high performance. All kinds of problems are
require enormous amounts of data crunching.
So far the low hanging fruit on large clusters is still with making
the hardware and the systems actually work. But increasingly having
a single high performance distributed filesystem is becoming
important.
But look at projects like bproc, mosix, and lustre. Not the best
things in the world but the work is getting done. Scalability is
easy. The hard part is making it look like one machine when you are
done.
Eric
^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: Minutes from Feb 21 LSE Call
2003-02-21 23:48 Minutes from Feb 21 LSE Call Hanna Linder
2003-02-22 0:16 ` Larry McVoy
@ 2003-02-23 0:42 ` Eric W. Biederman
2003-02-23 14:29 ` Rik van Riel
2003-02-23 3:24 ` Andrew Morton
2 siblings, 1 reply; 124+ messages in thread
From: Eric W. Biederman @ 2003-02-23 0:42 UTC (permalink / raw)
To: Hanna Linder; +Cc: lse-tech, linux-kernel
Hanna Linder <hannal@us.ibm.com> writes:
> LSE Con Call Minutes from Feb21
>
> Minutes compiled by Hanna Linder hannal@us.ibm.com, please post
> corrections to lse-tech@lists.sf.net.
>
> Object Based Reverse Mapping:
> (Dave McCracken, Ben LaHaise, Rik van Riel, Martin Bligh, Gerrit Huizenga)
>
> Ben said none of the users have been complaining about
> performance with the existing rmap. Martin disagreed and said Linus,
> Andrew Morton and himself have all agreed there is a problem.
> One of the problems Martin is already hitting on high cpu machines with
> large memory is the space consumption by all the pte-chains filling up
> memory and killing the machine. There is also a performance impact of
> maintaining the chains.
Note: rmap chains can be restricted to an arbitrary length, or an
arbitrary total count trivially. All you have to do is allow a fixed
limit on the number of people who can map a page simultaneously.
The selection of which chain to unmap can be a bit tricky but is
relatively straight forward. Why doesn't someone who is seeing
this just hack this up?
Eric
^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: Minutes from Feb 21 LSE Call
2003-02-22 22:18 ` William Lee Irwin III
@ 2003-02-23 0:50 ` Martin J. Bligh
2003-02-23 11:22 ` Magnus Danielson
2003-02-23 19:54 ` Eric W. Biederman
2003-02-23 1:17 ` Benjamin LaHaise
1 sibling, 2 replies; 124+ messages in thread
From: Martin J. Bligh @ 2003-02-23 0:50 UTC (permalink / raw)
To: William Lee Irwin III, Jeff Garzik; +Cc: linux-kernel
> On Sat, Feb 22, 2003 at 03:38:10AM -0500, Jeff Garzik wrote:
>> ia32 big iron. sigh. I think that's so unfortunately in a number
>> of ways, but the main reason, of course, is that highmem is evil :)
One phrase ... "price:performance ratio". That's all it's about.
The only thing that will kill 32-bit big iron is the availability of
cheap 64 bit chips. It's a free-market economy.
It's ugly to program, but it's cheap, and it works.
M.
^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: Minutes from Feb 21 LSE Call
2003-02-22 22:18 ` William Lee Irwin III
2003-02-23 0:50 ` Martin J. Bligh
@ 2003-02-23 1:17 ` Benjamin LaHaise
2003-02-23 5:21 ` Gerrit Huizenga
2003-02-23 9:37 ` William Lee Irwin III
1 sibling, 2 replies; 124+ messages in thread
From: Benjamin LaHaise @ 2003-02-23 1:17 UTC (permalink / raw)
To: William Lee Irwin III, Jeff Garzik, linux-kernel
On Sat, Feb 22, 2003 at 02:18:20PM -0800, William Lee Irwin III wrote:
> I'm not sure what's so nice about x86-64; another opcode prefix
> controlled extension atop the festering pile of existing x86 crud
What's nice about x86-64 is that it runs existing 32 bit apps fast and
doesn't suffer from the blisteringly small caches that were part of your
rant. Plus, x86-64 binaries are not horrifically bloated like ia64.
Not to mention that the amount of reengineering in compilers like
gcc required to get decent performance out of it is actually sane.
> sounds every bit as bad any other attempt to prolong x86. Some of
> the system device -level cleanups like the HPET look nice, though.
HPET is part of one of the PCYY specs and even available on 32 bit x86,
there are just not that many bug free implements yet. Since x86-64 made
it part of the base platform and is testing it from launch, they actually
have a chance at being debugged in the mass market versions.
-ben
--
Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a>
^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: Minutes from Feb 21 LSE Call
2003-02-21 23:48 Minutes from Feb 21 LSE Call Hanna Linder
2003-02-22 0:16 ` Larry McVoy
2003-02-23 0:42 ` Eric W. Biederman
@ 2003-02-23 3:24 ` Andrew Morton
2003-02-23 16:14 ` object-based rmap and pte-highmem Martin J. Bligh
2003-02-25 17:17 ` Minutes from Feb 21 LSE Call Andrea Arcangeli
2 siblings, 2 replies; 124+ messages in thread
From: Andrew Morton @ 2003-02-23 3:24 UTC (permalink / raw)
To: Hanna Linder; +Cc: lse-tech, linux-kernel
Hanna Linder <hannal@us.ibm.com> wrote:
>
>
> Dave coded up an initial patch for partial object based rmap
> which he sent to linux-mm yesterday.
I've run some numbers on this. Looks like it reclaims most of the
fork/exec/exit rmap overhead.
The testcase is applying and removing 64 kernel patches using my patch
management scripts. I use this because
a) It's a real workload, which someone cares about and
b) It's about as forky as anything is ever likely to be, without being a
stupid microbenchmark.
Testing is on the fast P4-HT, everything in pagecache.
2.4.21-pre4: 8.10 seconds
2.5.62-mm3 with objrmap: 9.95 seconds (+1.85)
2.5.62-mm3 without objrmap: 10.86 seconds (+0.91)
Current 2.5 is 2.76 seconds slower, and this patch reclaims 0.91 of those
seconds.
So whole stole the remaining 1.85 seconds? Looks like pte_highmem.
Here is 2.5.62-mm3, with objrmap:
c013042c find_get_page 601 10.7321
c01333dc free_hot_cold_page 641 2.7629
c0207130 __copy_to_user_ll 687 6.6058
c011450c flush_tlb_page 725 6.4732
c0139ba0 clear_page_tables 841 2.4735
c011718c pte_alloc_one 910 6.5000
c013b56c do_anonymous_page 954 1.7667
c013b788 do_no_page 1044 1.6519
c015b59c d_lookup 1096 3.2619
c013ba00 handle_mm_fault 1098 4.6525
c0108d14 system_call 1116 25.3636
c0137240 release_pages 1828 6.4366
c013a1f4 zap_pte_range 2616 4.8806
c013f5c0 page_add_rmap 2776 8.3614
c0139eac copy_page_range 2994 3.5643
c013f70c page_remove_rmap 3132 6.2640
c013adb4 do_wp_page 6712 8.4322
c01172e0 do_page_fault 8788 7.7496
c0106ed8 poll_idle 99878 1189.0238
00000000 total 158601 0.0869
Note one second spent in pte_alloc_one().
Here is 2.4.21-pre4, with the following functions uninlined
pte_t *pte_alloc_one(struct mm_struct *mm, unsigned long address);
pte_t *pte_alloc_one_fast(struct mm_struct *mm, unsigned long address);
void pte_free_fast(pte_t *pte);
void pte_free_slow(pte_t *pte);
c0252950 atomic_dec_and_lock 36 0.4800
c0111778 flush_tlb_mm 37 0.3304
c0129c3c file_read_actor 37 0.2569
c025282c strnlen_user 43 0.5119
c012b35c generic_file_write 46 0.0283
c0114c78 schedule 48 0.0361
c0129050 unlock_page 53 0.4907
c0140974 link_path_walk 57 0.0237
c0116740 copy_mm 62 0.0852
c0130740 __free_pages_ok 62 0.0963
c0126afc handle_mm_fault 63 0.3424
c01254c0 __free_pte 67 0.8816
c0129198 __find_get_page 67 0.9853
c01309c4 rmqueue 70 0.1207
c011ae0c exit_notify 77 0.1075
c0149b34 d_lookup 81 0.2774
c0126874 do_anonymous_page 83 0.3517
c0126960 do_no_page 86 0.2087
c01117e8 flush_tlb_page 105 0.8750
c0106f54 system_call 138 2.4643
c01255c8 copy_page_range 197 0.4603
c0130ffc __free_pages 204 5.6667
c0125774 zap_page_range 262 0.3104
c0126330 do_wp_page 775 1.4904
c0113c18 do_page_fault 864 0.7030
c01052f8 poll_idle 6803 170.0750
00000000 total 11923 0.0087
Note the lack of pte_alloc_one_slow().
So we need the page table cache back.
We cannot put it in slab, because slab does not do highmem.
I believe the best way to solve this is to implement a per-cpu LIFO head
array of known-to-be-zeroed pages in the page allocator. Populate it with
free_zeroed_page(), grab pages from it with __GFP_ZEROED.
This is a simple extension to the existing hot and cold head arrays, and I
have patches, and they don't work. Something in the pagetable freeing path
seems to be putting back pages which are not fully zeroed, and I didn't get
onto debugging it.
It would be nice to get it going, because a number of architectures can
perhaps nuke their private pagetable caches.
I shall drop the patches in next-mm/experimental and look hopefully
at Dave ;)
^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: Minutes from Feb 21 LSE Call
2003-02-23 1:17 ` Benjamin LaHaise
@ 2003-02-23 5:21 ` Gerrit Huizenga
2003-02-23 8:07 ` David Lang
2003-02-23 9:37 ` William Lee Irwin III
1 sibling, 1 reply; 124+ messages in thread
From: Gerrit Huizenga @ 2003-02-23 5:21 UTC (permalink / raw)
To: Benjamin LaHaise; +Cc: William Lee Irwin III, Jeff Garzik, linux-kernel
On Sat, 22 Feb 2003 20:17:24 EST, Benjamin LaHaise wrote:
> On Sat, Feb 22, 2003 at 02:18:20PM -0800, William Lee Irwin III wrote:
> > I'm not sure what's so nice about x86-64; another opcode prefix
> > controlled extension atop the festering pile of existing x86 crud
>
> What's nice about x86-64 is that it runs existing 32 bit apps fast and
> doesn't suffer from the blisteringly small caches that were part of your
> rant. Plus, x86-64 binaries are not horrifically bloated like ia64.
> Not to mention that the amount of reengineering in compilers like
> gcc required to get decent performance out of it is actually sane.
Four or five years ago the claim was that IA64 would solve all the large
memory problems. Commercial viability and substantial market presence
is still lacking. x86-64 has the same uphill battle. It has a better
architecture for highmem and potentially better architecture for large
systems in general (compared to IA32, not substantially better than, say,
IA64 or PPC64). It also has at least one manufacturer looking at high
end systems. But until those systems have some recognized market share,
the boys with the big pockets aren't likely to make the ubiquitous.
The whole thing about expenses to design and develop combined with the
ROI model have more influence on their deployment than the fact that it
is technically a useful architecture.
gerrit
^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: Minutes from Feb 21 LSE Call
2003-02-23 5:21 ` Gerrit Huizenga
@ 2003-02-23 8:07 ` David Lang
2003-02-23 8:20 ` William Lee Irwin III
` (2 more replies)
0 siblings, 3 replies; 124+ messages in thread
From: David Lang @ 2003-02-23 8:07 UTC (permalink / raw)
To: Gerrit Huizenga
Cc: Benjamin LaHaise, William Lee Irwin III, Jeff Garzik,
linux-kernel
On Sat, 22 Feb 2003, Gerrit Huizenga wrote:
> On Sat, 22 Feb 2003 20:17:24 EST, Benjamin LaHaise wrote:
> > On Sat, Feb 22, 2003 at 02:18:20PM -0800, William Lee Irwin III wrote:
> > > I'm not sure what's so nice about x86-64; another opcode prefix
> > > controlled extension atop the festering pile of existing x86 crud
> >
> > What's nice about x86-64 is that it runs existing 32 bit apps fast and
> > doesn't suffer from the blisteringly small caches that were part of your
> > rant. Plus, x86-64 binaries are not horrifically bloated like ia64.
> > Not to mention that the amount of reengineering in compilers like
> > gcc required to get decent performance out of it is actually sane.
>
> Four or five years ago the claim was that IA64 would solve all the large
> memory problems. Commercial viability and substantial market presence
> is still lacking. x86-64 has the same uphill battle. It has a better
> architecture for highmem and potentially better architecture for large
> systems in general (compared to IA32, not substantially better than, say,
> IA64 or PPC64). It also has at least one manufacturer looking at high
> end systems. But until those systems have some recognized market share,
> the boys with the big pockets aren't likely to make the ubiquitous.
> The whole thing about expenses to design and develop combined with the
> ROI model have more influence on their deployment than the fact that it
> is technically a useful architecture.
Garrit, you missed the preior posters point. IA64 had the same fundamental
problem as the Alpha, PPC, and Sparc processors, it doesn't run x86
binaries.
the 8086/8088 CPU was nothing special when it was picked to be used on the
IBM PC, but once it was picked it hit a critical mass that has meant that
compatability with it is critical to a new CPU. the 286 and 386 CPUs were
arguably inferior to other options available at the time, but they had one
feature that absolutly trumped everything else, they could run existing
programs with no modifications faster then anything else available. with
the IA64 Intel forgot this (or decided their name value was so high that
they were immune to the issue) x86-64 takes the same approach that the 286
and 386 did and will be used by people who couldn't care less about 64 bit
stuff simply becouse it looks to be the fastest x86 cpu available (and if
the SMP features work as advertised it will again give a big boost to the
price/performance of SMP machines due to much cheaper MLB designs). if it
was being marketed by Intel it would be a shoo-in, but AMD does have a bit
of an uphill struggle
David Lang
^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: Minutes from Feb 21 LSE Call
2003-02-23 8:07 ` David Lang
@ 2003-02-23 8:20 ` William Lee Irwin III
2003-02-23 19:17 ` Linus Torvalds
2003-02-23 19:13 ` David Mosberger
2003-02-23 20:48 ` Gerrit Huizenga
2 siblings, 1 reply; 124+ messages in thread
From: William Lee Irwin III @ 2003-02-23 8:20 UTC (permalink / raw)
To: David Lang; +Cc: Gerrit Huizenga, Benjamin LaHaise, Jeff Garzik, linux-kernel
On Sun, Feb 23, 2003 at 12:07:50AM -0800, David Lang wrote:
> Garrit, you missed the preior posters point. IA64 had the same fundamental
> problem as the Alpha, PPC, and Sparc processors, it doesn't run x86
> binaries.
If I didn't know this mattered I wouldn't bother with the barfbags.
I just wouldn't deal with it.
-- wli
^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: Minutes from Feb 21 LSE Call
2003-02-23 1:17 ` Benjamin LaHaise
2003-02-23 5:21 ` Gerrit Huizenga
@ 2003-02-23 9:37 ` William Lee Irwin III
1 sibling, 0 replies; 124+ messages in thread
From: William Lee Irwin III @ 2003-02-23 9:37 UTC (permalink / raw)
To: Benjamin LaHaise; +Cc: Jeff Garzik, linux-kernel
On Sat, Feb 22, 2003 at 02:18:20PM -0800, William Lee Irwin III wrote:
>> I'm not sure what's so nice about x86-64; another opcode prefix
>> controlled extension atop the festering pile of existing x86 crud
On Sat, Feb 22, 2003 at 08:17:24PM -0500, Benjamin LaHaise wrote:
> What's nice about x86-64 is that it runs existing 32 bit apps fast and
> doesn't suffer from the blisteringly small caches that were part of your
> rant. Plus, x86-64 binaries are not horrifically bloated like ia64.
> Not to mention that the amount of reengineering in compilers like
> gcc required to get decent performance out of it is actually sane.
Rant? It was just a catalogue of other things that are nasty. The
point was that PAE's not special, it's one of a very long list of
very ugly uglinesses, and my list wasn't anywhere near exhaustive.
But yes, more cache is good. Unfortunately the amount of baggage from
32-bit x86 stuff still puts a good chunk of systems programming into
the old bring your own barfbag territory.
On Sat, Feb 22, 2003 at 02:18:20PM -0800, William Lee Irwin III wrote:
>> sounds every bit as bad any other attempt to prolong x86. Some of
>> the system device -level cleanups like the HPET look nice, though.
On Sat, Feb 22, 2003 at 08:17:24PM -0500, Benjamin LaHaise wrote:
> HPET is part of one of the PCYY specs and even available on 32 bit x86,
> there are just not that many bug free implements yet. Since x86-64 made
> it part of the base platform and is testing it from launch, they actually
> have a chance at being debugged in the mass market versions.
Well, it beats the heck out of the TSC and the PIT, and x86-64 is
apparently supposed to have it "for real".
I'm not excited at all about another opcode prefix and pagetable format.
-- wli
^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: Minutes from Feb 21 LSE Call
2003-02-23 0:50 ` Martin J. Bligh
@ 2003-02-23 11:22 ` Magnus Danielson
2003-02-23 19:54 ` Eric W. Biederman
1 sibling, 0 replies; 124+ messages in thread
From: Magnus Danielson @ 2003-02-23 11:22 UTC (permalink / raw)
To: mbligh; +Cc: wli, jgarzik, linux-kernel
From: "Martin J. Bligh" <mbligh@aracnet.com>
Subject: Re: Minutes from Feb 21 LSE Call
Date: Sat, 22 Feb 2003 16:50:36 -0800
> > On Sat, Feb 22, 2003 at 03:38:10AM -0500, Jeff Garzik wrote:
> >> ia32 big iron. sigh. I think that's so unfortunately in a number
> >> of ways, but the main reason, of course, is that highmem is evil :)
>
> One phrase ... "price:performance ratio". That's all it's about.
> The only thing that will kill 32-bit big iron is the availability of
> cheap 64 bit chips. It's a free-market economy.
>
> It's ugly to program, but it's cheap, and it works.
Not all heavy-duty problems die for 64 bit, but fit nicely into 32 bit.
There is however different 32-bit architectures for which it fit more or less
nicely into. SIMD may or may not give the boost just as 64 bit in itself.
This is just like clustering vs. SMP, it depends on the application.
Cheers,
Magnus
^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: Minutes from Feb 21 LSE Call
2003-02-23 0:42 ` Eric W. Biederman
@ 2003-02-23 14:29 ` Rik van Riel
2003-02-23 17:28 ` Eric W. Biederman
0 siblings, 1 reply; 124+ messages in thread
From: Rik van Riel @ 2003-02-23 14:29 UTC (permalink / raw)
To: Eric W. Biederman; +Cc: Hanna Linder, lse-tech, linux-kernel
On Sat, 22 Feb 2003, Eric W. Biederman wrote:
> Note: rmap chains can be restricted to an arbitrary length, or an
> arbitrary total count trivially. All you have to do is allow a fixed
> limit on the number of people who can map a page simultaneously.
>
> The selection of which chain to unmap can be a bit tricky but is
> relatively straight forward. Why doesn't someone who is seeing
> this just hack this up?
I'm not sure how useful this feature would be. Also,
there are a bunch of corner cases in which you cannot
limit the number of processes mapping a page, think
about eg. mlock, nonlinear vmas and anonymous memory.
All in all I suspect that the cost of such a feature
might be higher than any benefits.
cheers,
Rik
--
Engineers don't grow up, they grow sideways.
http://www.surriel.com/ http://kernelnewbies.org/
^ permalink raw reply [flat|nested] 124+ messages in thread
* object-based rmap and pte-highmem
2003-02-23 3:24 ` Andrew Morton
@ 2003-02-23 16:14 ` Martin J. Bligh
2003-02-23 19:20 ` Linus Torvalds
2003-02-25 17:17 ` Minutes from Feb 21 LSE Call Andrea Arcangeli
1 sibling, 1 reply; 124+ messages in thread
From: Martin J. Bligh @ 2003-02-23 16:14 UTC (permalink / raw)
To: Andrew Morton; +Cc: lse-tech, linux-kernel, haveblue, dmccr
> So whole stole the remaining 1.85 seconds? Looks like pte_highmem.
I have a plan for that (UKVA) ... we reserve a per-process area with
kernel type protections (either at the top of user space, changing
permissions appropriately, or inside kernel space, changing per-process
vs global appropriately).
This area is permanently mapped into each process, so that there's no
kmap_atomic / tlb_flush_one overhead ... it's highmem backed still.
In order to do fork efficiently, we may need space for 2 sets of
pagetables (12Mb on PAE).
Dave McCracken had an earlier implementation of that, but we never saw
an improvement (quite possibly because the fork double-space wasn't
there) - Dave Hansen is now trying to get something work with current
kernels ... will let you know.
M.
^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: Minutes from Feb 21 LSE Call
2003-02-23 14:29 ` Rik van Riel
@ 2003-02-23 17:28 ` Eric W. Biederman
2003-02-24 1:42 ` Benjamin LaHaise
0 siblings, 1 reply; 124+ messages in thread
From: Eric W. Biederman @ 2003-02-23 17:28 UTC (permalink / raw)
To: Rik van Riel; +Cc: Hanna Linder, lse-tech, linux-kernel
Rik van Riel <riel@imladris.surriel.com> writes:
> On Sat, 22 Feb 2003, Eric W. Biederman wrote:
>
> > Note: rmap chains can be restricted to an arbitrary length, or an
> > arbitrary total count trivially. All you have to do is allow a fixed
> > limit on the number of people who can map a page simultaneously.
> >
> > The selection of which chain to unmap can be a bit tricky but is
> > relatively straight forward. Why doesn't someone who is seeing
> > this just hack this up?
>
> I'm not sure how useful this feature would be.
The problem. There is no upper bound to how many rmap
entries there can be at one time. And the unbounded
growth can overwhelm a machine.
The goal is to provide an overall system cap on the number
of rmap entries.
> Also,
> there are a bunch of corner cases in which you cannot
> limit the number of processes mapping a page, think
> about eg. mlock, nonlinear vmas and anonymous memory.
Unless something has changed for nonlinear vmas, and anonymous
memory we have been storing enough information to recover
the page in the page tables for ages.
For mlock we want a cap on the number of pages that are locked,
so it should not be a problem. But even then we don't have to
guarantee the page is constantly in the processes page table, simply
that the mlocked page is never swapped out.
> All in all I suspect that the cost of such a feature
> might be higher than any benefits.
Cost? What Cost?
The simple implementation is to walk the page lists and unmap
the pages that are least likely to be used next.
This is not something new. We have been doing this in 2.4.x and
before for years. Before it just never freed up rmap entries, as well
as preparing a page to be paged out.
Eric
^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: Minutes from Feb 21 LSE Call
2003-02-23 8:07 ` David Lang
2003-02-23 8:20 ` William Lee Irwin III
@ 2003-02-23 19:13 ` David Mosberger
2003-02-23 23:28 ` Benjamin LaHaise
2003-02-26 8:46 ` Eric W. Biederman
2003-02-23 20:48 ` Gerrit Huizenga
2 siblings, 2 replies; 124+ messages in thread
From: David Mosberger @ 2003-02-23 19:13 UTC (permalink / raw)
To: David Lang
Cc: Gerrit Huizenga, Benjamin LaHaise, William Lee Irwin III,
Jeff Garzik, linux-kernel
>>>>> On Sun, 23 Feb 2003 00:07:50 -0800 (PST), David Lang <david.lang@digitalinsight.com> said:
David.L> Garrit, you missed the preior posters point. IA64 had the
David.L> same fundamental problem as the Alpha, PPC, and Sparc
David.L> processors, it doesn't run x86 binaries.
This simply isn't true. Itanium and Itanium 2 have full x86 hardware
built into the chip (for better or worse ;-). The speed isn't as good
as the fastest x86 chips today, but it's faster (~300MHz P6) than the
PCs many of us are using and it certainly meets my needs better than
any other x86 "emulation" I have used in the past (which includes
FX!32 and its relatives for Alpha).
--david
^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: Minutes from Feb 21 LSE Call
2003-02-23 8:20 ` William Lee Irwin III
@ 2003-02-23 19:17 ` Linus Torvalds
2003-02-23 19:29 ` David Mosberger
` (3 more replies)
0 siblings, 4 replies; 124+ messages in thread
From: Linus Torvalds @ 2003-02-23 19:17 UTC (permalink / raw)
To: linux-kernel
In article <20030223082036.GI10411@holomorphy.com>,
William Lee Irwin III <wli@holomorphy.com> wrote:
>On Sun, Feb 23, 2003 at 12:07:50AM -0800, David Lang wrote:
>> Garrit, you missed the preior posters point. IA64 had the same fundamental
>> problem as the Alpha, PPC, and Sparc processors, it doesn't run x86
>> binaries.
>
>If I didn't know this mattered I wouldn't bother with the barfbags.
>I just wouldn't deal with it.
Why?
The x86 is a hell of a lot nicer than the ppc32, for example. On the
x86, you get good performance and you can ignore the design mistakes (ie
segmentation) by just basically turning them off.
On the ppc32, the MMU braindamage is not something you can ignore, you
have to write your OS for it and if you turn it off (ie enable soft-fill
on the ones that support it) you now have to have separate paths in the
OS for it.
And the baroque instruction encoding on the x86 is actually a _good_
thing: it's a rather dense encoding, which means that you win on icache.
It's a bit hard to decode, but who cares? Existing chips do well at
decoding, and thanks to the icache win they tend to perform better - and
they load faster too (which is important - you can make your CPU have
big caches, but _nothing_ saves you from the cold-cache costs).
The low register count isn't an issue when you code in any high-level
language, and it has actually forced x86 implementors to do a hell of a
lot better job than the competition when it comes to memory loads and
stores - which helps in general. While the RISC people were off trying
to optimize their compilers to generate loops that used all 32 registers
efficiently, the x86 implementors instead made the chip run fast on
varied loads and used tons of register renaming hardware (and looking at
_memory_ renaming too).
IA64 made all the mistakes anybody else did, and threw out all the good
parts of the x86 because people thought those parts were ugly. They
aren't ugly, they're the "charming oddity" that makes it do well. Look
at them the right way and you realize that a lot of the grottyness is
exactly _why_ the x86 works so well (yeah, and the fact that they are
everywhere ;).
The only real major failure of the x86 is the PAE crud. Let's hope
we'll get to forget it, the same way the DOS people eventually forgot
about their memory extenders.
(Yeah, and maybe IBM will make their ppc64 chips cheap enough that they
will matter, and people can overlook the grottiness there. Right now
Intel doesn't even seem to be interested in "64-bit for the masses", and
maybe IBM will be. AMD certainly seems to be serious about the "masses"
part, which in the end is the only part that really matters).
Linus
^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: object-based rmap and pte-highmem
2003-02-23 16:14 ` object-based rmap and pte-highmem Martin J. Bligh
@ 2003-02-23 19:20 ` Linus Torvalds
2003-02-23 20:16 ` Martin J. Bligh
0 siblings, 1 reply; 124+ messages in thread
From: Linus Torvalds @ 2003-02-23 19:20 UTC (permalink / raw)
To: linux-kernel
In article <11090000.1046016895@[10.10.2.4]>,
Martin J. Bligh <mbligh@aracnet.com> wrote:
>> So whole stole the remaining 1.85 seconds? Looks like pte_highmem.
>
>I have a plan for that (UKVA) ... we reserve a per-process area with
>kernel type protections (either at the top of user space, changing
>permissions appropriately, or inside kernel space, changing per-process
>vs global appropriately).
Nobody ever seems to have solved the threading impact of UKVA's. I told
Andrea about it almost a year ago, and his reaction was "oh, duh!" and
couldn't come up with a solution either.
The thing is, you _cannot_ have a per-thread area, since all threads
share the same TLB. And if it isn't per-thread, you still need all the
locking and all the scalability stuff that the _current_ pte_highmem
code needs, since there are people with thousands of threads in the same
process.
Until somebody _addresses_ this issue with UKVA, I consider UKVA to be a
pipe-dream of people who haven't thought it through.
Linus
^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: Minutes from Feb 21 LSE Call
2003-02-23 19:17 ` Linus Torvalds
@ 2003-02-23 19:29 ` David Mosberger
2003-02-23 20:13 ` Martin J. Bligh
2003-02-23 21:34 ` Linus Torvalds
2003-02-23 20:21 ` Xavier Bestel
` (2 subsequent siblings)
3 siblings, 2 replies; 124+ messages in thread
From: David Mosberger @ 2003-02-23 19:29 UTC (permalink / raw)
To: Linus Torvalds; +Cc: linux-kernel
>>>>> On Sun, 23 Feb 2003 19:17:30 +0000 (UTC), torvalds@transmeta.com (Linus Torvalds) said:
Linus> Look at them the right way and you realize that a lot of the
Linus> grottyness is exactly _why_ the x86 works so well (yeah, and
Linus> the fact that they are everywhere ;).
But does x86 reall work so well? Itanium 2 on 0.13um performs a lot
better than P4 on 0.13um. As far as I can guess, the only reason P4
comes out on 0.13um (and 0.09um) before anything else is due to the
latter part you mention: it's where the volume is today.
--david
^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: Minutes from Feb 21 LSE Call
2003-02-23 0:50 ` Martin J. Bligh
2003-02-23 11:22 ` Magnus Danielson
@ 2003-02-23 19:54 ` Eric W. Biederman
1 sibling, 0 replies; 124+ messages in thread
From: Eric W. Biederman @ 2003-02-23 19:54 UTC (permalink / raw)
To: Martin J. Bligh; +Cc: William Lee Irwin III, Jeff Garzik, linux-kernel
"Martin J. Bligh" <mbligh@aracnet.com> writes:
> > On Sat, Feb 22, 2003 at 03:38:10AM -0500, Jeff Garzik wrote:
> >> ia32 big iron. sigh. I think that's so unfortunately in a number
> >> of ways, but the main reason, of course, is that highmem is evil :)
>
> One phrase ... "price:performance ratio". That's all it's about.
> The only thing that will kill 32-bit big iron is the availability of
> cheap 64 bit chips. It's a free-market economy.
>
> It's ugly to program, but it's cheap, and it works.
I guess ugly to program is in the eye of the beholder. The big platforms
have always seemed much worse to me. When every box is feels free to
change things in arbitrary ways for no good reason. Or where OS and
other low-level software must know exactly which motherboard they are
running on to work properly.
Gratuitous incompatibilities are the ugliest thing I have ever seen.
Much less ugly then the warts a real platform accumulates because it
is designed to actually be used.
Eric
^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: Minutes from Feb 21 LSE Call
2003-02-23 19:29 ` David Mosberger
@ 2003-02-23 20:13 ` Martin J. Bligh
2003-02-23 22:01 ` David Mosberger
2003-02-23 21:34 ` Linus Torvalds
1 sibling, 1 reply; 124+ messages in thread
From: Martin J. Bligh @ 2003-02-23 20:13 UTC (permalink / raw)
To: davidm, Linus Torvalds; +Cc: linux-kernel
> Linus> Look at them the right way and you realize that a lot of the
> Linus> grottyness is exactly _why_ the x86 works so well (yeah, and
> Linus> the fact that they are everywhere ;).
>
> But does x86 reall work so well? Itanium 2 on 0.13um performs a lot
> better than P4 on 0.13um. As far as I can guess, the only reason P4
> comes out on 0.13um (and 0.09um) before anything else is due to the
> latter part you mention: it's where the volume is today.
Care to share those impressive benchmark numbers (for macro-benchmarks)?
Would be interesting to see the difference, and where it wins.
Thanks,
M
^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: object-based rmap and pte-highmem
2003-02-23 19:20 ` Linus Torvalds
@ 2003-02-23 20:16 ` Martin J. Bligh
2003-02-23 21:37 ` Linus Torvalds
0 siblings, 1 reply; 124+ messages in thread
From: Martin J. Bligh @ 2003-02-23 20:16 UTC (permalink / raw)
To: Linus Torvalds, linux-kernel
>> I have a plan for that (UKVA) ... we reserve a per-process area with
>> kernel type protections (either at the top of user space, changing
>> permissions appropriately, or inside kernel space, changing per-process
>> vs global appropriately).
>
> Nobody ever seems to have solved the threading impact of UKVA's. I told
> Andrea about it almost a year ago, and his reaction was "oh, duh!" and
> couldn't come up with a solution either.
>
> The thing is, you _cannot_ have a per-thread area, since all threads
> share the same TLB. And if it isn't per-thread, you still need all the
> locking and all the scalability stuff that the _current_ pte_highmem
> code needs, since there are people with thousands of threads in the same
> process.
>
> Until somebody _addresses_ this issue with UKVA, I consider UKVA to be a
> pipe-dream of people who haven't thought it through.
I don't see why that's an issue - the pagetables are per-process, not
per-thread.
Yes, that was a stalling point for sticking kmap in there, which was
amongst my original plotting for it, but the stuff that's per-process
still works.
I'm not suggesting kmapping them dynamically (though it's rather like
permanent kmap), I'm suggesting making enough space so we have them all
there for each process all the time. None of this tiny little window
shifting around stuff ...
M.
^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: Minutes from Feb 21 LSE Call
2003-02-23 19:17 ` Linus Torvalds
2003-02-23 19:29 ` David Mosberger
@ 2003-02-23 20:21 ` Xavier Bestel
2003-02-23 20:50 ` Martin J. Bligh
` (4 more replies)
2003-02-23 21:15 ` John Bradford
2003-02-23 21:55 ` William Lee Irwin III
3 siblings, 5 replies; 124+ messages in thread
From: Xavier Bestel @ 2003-02-23 20:21 UTC (permalink / raw)
To: Linus Torvalds; +Cc: Linux Kernel Mailing List
Le dim 23/02/2003 à 20:17, Linus Torvalds a écrit :
> And the baroque instruction encoding on the x86 is actually a _good_
> thing: it's a rather dense encoding, which means that you win on icache.
> It's a bit hard to decode, but who cares? Existing chips do well at
> decoding, and thanks to the icache win they tend to perform better - and
> they load faster too (which is important - you can make your CPU have
> big caches, but _nothing_ saves you from the cold-cache costs).
Next step: hardware gzip ?
^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: Minutes from Feb 21 LSE Call
2003-02-23 8:07 ` David Lang
2003-02-23 8:20 ` William Lee Irwin III
2003-02-23 19:13 ` David Mosberger
@ 2003-02-23 20:48 ` Gerrit Huizenga
2 siblings, 0 replies; 124+ messages in thread
From: Gerrit Huizenga @ 2003-02-23 20:48 UTC (permalink / raw)
To: David Lang
Cc: Benjamin LaHaise, William Lee Irwin III, Jeff Garzik,
linux-kernel
On Sun, 23 Feb 2003 00:07:50 PST, David Lang wrote:
> Garrit, you missed the preior posters point. IA64 had the same fundamental
> problem as the Alpha, PPC, and Sparc processors, it doesn't run x86
> binaries.
IA64 *can* run IA32 binaries, just more slowly than native IA64 code.
gerrit
^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: Minutes from Feb 21 LSE Call
2003-02-23 20:21 ` Xavier Bestel
@ 2003-02-23 20:50 ` Martin J. Bligh
2003-02-23 23:57 ` Alan Cox
2003-02-23 21:35 ` Alan Cox
` (3 subsequent siblings)
4 siblings, 1 reply; 124+ messages in thread
From: Martin J. Bligh @ 2003-02-23 20:50 UTC (permalink / raw)
To: Xavier Bestel; +Cc: Linux Kernel Mailing List
>> And the baroque instruction encoding on the x86 is actually a _good_
>> thing: it's a rather dense encoding, which means that you win on icache.
>> It's a bit hard to decode, but who cares? Existing chips do well at
>> decoding, and thanks to the icache win they tend to perform better - and
>> they load faster too (which is important - you can make your CPU have
>> big caches, but _nothing_ saves you from the cold-cache costs).
>
> Next step: hardware gzip ?
They did that already ... IBM were demonstrating such a thing a couple of
years ago. Don't see it helping with icache though, as it unpacks between
memory and the processory, IIRC.
M.
^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: Minutes from Feb 21 LSE Call
2003-02-23 19:17 ` Linus Torvalds
2003-02-23 19:29 ` David Mosberger
2003-02-23 20:21 ` Xavier Bestel
@ 2003-02-23 21:15 ` John Bradford
2003-02-23 21:45 ` Linus Torvalds
2003-02-23 21:55 ` William Lee Irwin III
3 siblings, 1 reply; 124+ messages in thread
From: John Bradford @ 2003-02-23 21:15 UTC (permalink / raw)
To: Linus Torvalds; +Cc: linux-kernel
> >If I didn't know this mattered I wouldn't bother with the barfbags.
> >I just wouldn't deal with it.
>
> Why?
>
> The x86 is a hell of a lot nicer than the ppc32, for example. On the
> x86, you get good performance and you can ignore the design mistakes (ie
> segmentation) by just basically turning them off.
I could be wrong, but I always thought that Sparc, and a lot of other
architectures could mark arbitrary areas of memory, (such as the
stack), as non-executable, whereas x86 only lets you have one
non-executable segment.
John.
^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: Minutes from Feb 21 LSE Call
2003-02-23 19:29 ` David Mosberger
2003-02-23 20:13 ` Martin J. Bligh
@ 2003-02-23 21:34 ` Linus Torvalds
2003-02-23 22:40 ` David Mosberger
1 sibling, 1 reply; 124+ messages in thread
From: Linus Torvalds @ 2003-02-23 21:34 UTC (permalink / raw)
To: davidm; +Cc: linux-kernel
On Sun, 23 Feb 2003, David Mosberger wrote:
>
> But does x86 reall work so well? Itanium 2 on 0.13um performs a lot
> better than P4 on 0.13um.
On WHAT benchmark?
Itanium 2 doesn't hold a candle to a P4 on any real-world benchmarks.
As far as I know, the _only_ things Itanium 2 does better on is (a) FP
kernels, partly due to a huge cache and (b) big databases, entirely
because the P4 is crippled with lots of memory because Intel refuses to do
a 64-bit version (because they know it would totally kill ia-64).
Last I saw P4 was kicking ia-64 butt on specint and friends.
That's also ignoring the fact that ia-64 simply CANNOT DO the things a P4
does every single day. You can't put an ia-64 in a reasonable desktop
machine, partly because of pricing, but partly because it would just suck
so horribly at things people expect not to suck (games spring to mind).
And I further bet that using a native distribution (ie totally ignoring
the power and price and bad x86 performance issues), ia-64 will work a lot
worse for people simply because the binaries are bigger. That was quite
painful on alpha, and ia-64 is even worse - to offset the bigger binaries,
you need a faster disk subsystem etc just to not feel slower than a
bog-standard PC.
Code size matters. Price matters. Real world matters. And ia-64 at least
so far falls flat on its face on ALL of these.
> As far as I can guess, the only reason P4
> comes out on 0.13um (and 0.09um) before anything else is due to the
> latter part you mention: it's where the volume is today.
It's where all the money is ("ia-64: 5 billion dollars in the red and
still sinking") so of _course_ it's where the efforts get put.
Linus
^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: Minutes from Feb 21 LSE Call
2003-02-23 20:21 ` Xavier Bestel
2003-02-23 20:50 ` Martin J. Bligh
@ 2003-02-23 21:35 ` Alan Cox
2003-02-23 21:41 ` Linus Torvalds
` (2 subsequent siblings)
4 siblings, 0 replies; 124+ messages in thread
From: Alan Cox @ 2003-02-23 21:35 UTC (permalink / raw)
To: Xavier Bestel; +Cc: Linus Torvalds, Linux Kernel Mailing List
On Sun, 2003-02-23 at 20:21, Xavier Bestel wrote:
> > they load faster too (which is important - you can make your CPU have
> > big caches, but _nothing_ saves you from the cold-cache costs).
>
> Next step: hardware gzip ?
gzip doesn't work because its not unpackable from an arbitary point. x86
in many ways is compressed, with common codes carefully bitpacked. A
horrible cisc design constraint for size has come full circle and turned
into a very nice memory/cache optimisation
^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: object-based rmap and pte-highmem
2003-02-23 20:16 ` Martin J. Bligh
@ 2003-02-23 21:37 ` Linus Torvalds
2003-02-23 22:07 ` pte-highmem vs UKVA (was: object-based rmap and pte-highmem) Martin J. Bligh
0 siblings, 1 reply; 124+ messages in thread
From: Linus Torvalds @ 2003-02-23 21:37 UTC (permalink / raw)
To: Martin J. Bligh; +Cc: linux-kernel
On Sun, 23 Feb 2003, Martin J. Bligh wrote:
> >
> > The thing is, you _cannot_ have a per-thread area, since all threads
> > share the same TLB. And if it isn't per-thread, you still need all the
> > locking and all the scalability stuff that the _current_ pte_highmem
> > code needs, since there are people with thousands of threads in the same
> > process.
>
> I don't see why that's an issue - the pagetables are per-process, not
> per-thread.
Exactly. Which means that UKVA has all the same problems as the current
global map.
There are _NO_ differences. Any problems you have with the current global
map you would have with UKVA in threads. So I don't see what you expect to
win from UKVA.
> Yes, that was a stalling point for sticking kmap in there, which was
> amongst my original plotting for it, but the stuff that's per-process
> still works.
Exactly what _is_ "per-process"? The only thing that is per-process is
stuff that is totally local to the VM, by the linux definition.
And the rmap stuff certainly isn't "local to the VM". Yes, it is torn down
and built up by the VM, but it needs to be traversed by global code.
Linus
^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: Minutes from Feb 21 LSE Call
2003-02-23 20:21 ` Xavier Bestel
2003-02-23 20:50 ` Martin J. Bligh
2003-02-23 21:35 ` Alan Cox
@ 2003-02-23 21:41 ` Linus Torvalds
2003-02-24 0:01 ` Bill Davidsen
2003-02-24 0:36 ` yodaiken
4 siblings, 0 replies; 124+ messages in thread
From: Linus Torvalds @ 2003-02-23 21:41 UTC (permalink / raw)
To: Xavier Bestel; +Cc: Linux Kernel Mailing List
On 23 Feb 2003, Xavier Bestel wrote:
> Le dim 23/02/2003 à 20:17, Linus Torvalds a écrit :
>
> > And the baroque instruction encoding on the x86 is actually a _good_
> > thing: it's a rather dense encoding, which means that you win on icache.
> > It's a bit hard to decode, but who cares? Existing chips do well at
> > decoding, and thanks to the icache win they tend to perform better - and
> > they load faster too (which is important - you can make your CPU have
> > big caches, but _nothing_ saves you from the cold-cache costs).
>
> Next step: hardware gzip ?
Not gzip, no. It needs to be a random-access compression with reasonably
small blocks, not something designed for streaming. Which makes it harder
to do right and efficiently.
But ARM has Thumb (not the same thing, but same idea), and at least some
PPC chips have a page-based compressor - IBM calls it "CodePack" in case
you want to google for it.
Linus
^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: Minutes from Feb 21 LSE Call
2003-02-23 21:15 ` John Bradford
@ 2003-02-23 21:45 ` Linus Torvalds
2003-02-24 1:25 ` Benjamin LaHaise
0 siblings, 1 reply; 124+ messages in thread
From: Linus Torvalds @ 2003-02-23 21:45 UTC (permalink / raw)
To: John Bradford; +Cc: linux-kernel
On Sun, 23 Feb 2003, John Bradford wrote:
>
> I could be wrong, but I always thought that Sparc, and a lot of other
> architectures could mark arbitrary areas of memory, (such as the
> stack), as non-executable, whereas x86 only lets you have one
> non-executable segment.
The x86 has that stupid "executablility is tied to a segment" thing, which
means that you cannot make things executable on a page-per-page level.
It's a mistake, but it's one that _could_ be fixed in the architecture if
it really mattered, the same way the WP bit got fixed in the i486.
I'm definitely not saying that the x86 is perfect. It clearly isn't. But a
lot of people complain about the wrong things, and a lot of people who
tried to "fix" things just made them worse by throwing out the good parts
too.
Linus
^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: Minutes from Feb 21 LSE Call
2003-02-23 19:17 ` Linus Torvalds
` (2 preceding siblings ...)
2003-02-23 21:15 ` John Bradford
@ 2003-02-23 21:55 ` William Lee Irwin III
3 siblings, 0 replies; 124+ messages in thread
From: William Lee Irwin III @ 2003-02-23 21:55 UTC (permalink / raw)
To: Linus Torvalds; +Cc: linux-kernel
On Sun, Feb 23, 2003 at 07:17:30PM +0000, Linus Torvalds wrote:
>> If I didn't know this mattered I wouldn't bother with the barfbags.
>> I just wouldn't deal with it.
On Sun, Feb 23, 2003 at 07:17:30PM +0000, Linus Torvalds wrote:
> The x86 is a hell of a lot nicer than the ppc32, for example. On the
> x86, you get good performance and you can ignore the design mistakes (ie
> segmentation) by just basically turning them off.
We "basically" turn it off, but I was recently reminded it existed,
as LDT's are apparently wanted by something in userspace. There seem
to be various other unwelcome reminders floating around performance
critical paths as well.
I vaguely remember segmentation being the only way to enforce
execution permissions for mmap(), which we just don't bother doing.
On Sun, Feb 23, 2003 at 07:17:30PM +0000, Linus Torvalds wrote:
> On the ppc32, the MMU braindamage is not something you can ignore, you
> have to write your OS for it and if you turn it off (ie enable soft-fill
> on the ones that support it) you now have to have separate paths in the
> OS for it.
The hashtables don't bother me very much. They can relatively easily
be front-ended by radix tree pagetables anyway, and if it sucks, well,
no software in the world can save sucky hardware. Hopefully later models
fix it to be fast or disablable. I'm more bothered by x86 lacking ASN's.
On Sun, Feb 23, 2003 at 07:17:30PM +0000, Linus Torvalds wrote:
> And the baroque instruction encoding on the x86 is actually a _good_
> thing: it's a rather dense encoding, which means that you win on icache.
> It's a bit hard to decode, but who cares? Existing chips do well at
> decoding, and thanks to the icache win they tend to perform better - and
> they load faster too (which is important - you can make your CPU have
> big caches, but _nothing_ saves you from the cold-cache costs).
I'm not so sure, between things cacheline aligning branch targets and
space/time tradeoffs with smaller instructions running slower than
large sequences of instructions, this stuff gets pretty strange. It
still comes out smaller in the end but by a smaller-than-expected though
probably still significant margin. There's a good chunk of the
instruction set that should probably just be dumped outright, too.
On Sun, Feb 23, 2003 at 07:17:30PM +0000, Linus Torvalds wrote:
> The low register count isn't an issue when you code in any high-level
> language, and it has actually forced x86 implementors to do a hell of a
> lot better job than the competition when it comes to memory loads and
> stores - which helps in general. While the RISC people were off trying
> to optimize their compilers to generate loops that used all 32 registers
> efficiently, the x86 implementors instead made the chip run fast on
> varied loads and used tons of register renaming hardware (and looking at
> _memory_ renaming too).
Invariably we get stuck diving into assembly anyway. =)
This one is basically me getting irked by looking at disassemblies of
random x86 binaries and seeing vast amounts of register spilling. It's
probably not a performance issue aside from code bloat esp. given the
amount of trickery with the weird L1 cache stack magic and so on.
On Sun, Feb 23, 2003 at 07:17:30PM +0000, Linus Torvalds wrote:
> IA64 made all the mistakes anybody else did, and threw out all the good
> parts of the x86 because people thought those parts were ugly. They
> aren't ugly, they're the "charming oddity" that makes it do well. Look
> at them the right way and you realize that a lot of the grottyness is
> exactly _why_ the x86 works so well (yeah, and the fact that they are
> everywhere ;).
Count me as "not charmed". We've actually tripped over this stuff, and
for the most part you've been personally squashing the super low-level
bugs like the NT flag business and vsyscall segmentation oddities.
IA64 suffers from truly excessive featuritis and there are relatively
good chances some (or all) of them will be every bit as unused and
hated as segmentation if it actually survives.
On Sun, Feb 23, 2003 at 07:17:30PM +0000, Linus Torvalds wrote:
> The only real major failure of the x86 is the PAE crud. Let's hope
> we'll get to forget it, the same way the DOS people eventually forgot
> about their memory extenders.
We've not really been able to forget about segments or ISA DMA...
The pessimist in me has more or less already resigned me to PAE as
a fact of life.
On Sun, Feb 23, 2003 at 07:17:30PM +0000, Linus Torvalds wrote:
> (Yeah, and maybe IBM will make their ppc64 chips cheap enough that they
> will matter, and people can overlook the grottiness there. Right now
> Intel doesn't even seem to be interested in "64-bit for the masses", and
> maybe IBM will be. AMD certainly seems to be serious about the "masses"
> part, which in the end is the only part that really matters).
ppc64 is sane in my book (not vendor nepotism, the other "vanilla RISC"
machines get the same rating in my book). No idea about marketing stuff.
-- wli
^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: Minutes from Feb 21 LSE Call
2003-02-23 20:13 ` Martin J. Bligh
@ 2003-02-23 22:01 ` David Mosberger
2003-02-23 22:12 ` Martin J. Bligh
0 siblings, 1 reply; 124+ messages in thread
From: David Mosberger @ 2003-02-23 22:01 UTC (permalink / raw)
To: Martin J. Bligh; +Cc: davidm, Linus Torvalds, linux-kernel
>>>>> On Sun, 23 Feb 2003 12:13:00 -0800, "Martin J. Bligh" <mbligh@aracnet.com> said:
Linus> Look at them the right way and you realize that a lot of the
Linus> grottyness is exactly _why_ the x86 works so well (yeah, and
Linus> the fact that they are everywhere ;).
>> But does x86 reall work so well? Itanium 2 on 0.13um performs a
>> lot better than P4 on 0.13um. As far as I can guess, the only
>> reason P4 comes out on 0.13um (and 0.09um) before anything else
>> is due to the latter part you mention: it's where the volume is
>> today.
Martin> Care to share those impressive benchmark numbers (for
Martin> macro-benchmarks)? Would be interesting to see the
Martin> difference, and where it wins.
You can do it two ways: you can look at the numbers Intel is publicly
projected for Madison, or you can compare McKinley with 0.18um Pentium 4.
--david
^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: pte-highmem vs UKVA (was: object-based rmap and pte-highmem)
2003-02-23 21:37 ` Linus Torvalds
@ 2003-02-23 22:07 ` Martin J. Bligh
2003-02-23 22:10 ` William Lee Irwin III
2003-02-24 3:07 ` Martin J. Bligh
0 siblings, 2 replies; 124+ messages in thread
From: Martin J. Bligh @ 2003-02-23 22:07 UTC (permalink / raw)
To: Linus Torvalds; +Cc: linux-kernel
>> > The thing is, you _cannot_ have a per-thread area, since all threads
>> > share the same TLB. And if it isn't per-thread, you still need all the
>> > locking and all the scalability stuff that the _current_ pte_highmem
>> > code needs, since there are people with thousands of threads in the
>> > same process.
>>
>> I don't see why that's an issue - the pagetables are per-process, not
>> per-thread.
>
> Exactly. Which means that UKVA has all the same problems as the current
> global map.
>
> There are _NO_ differences. Any problems you have with the current global
> map you would have with UKVA in threads. So I don't see what you expect
> to win from UKVA.
This just just for PTEs ... for which at the moment we have two choices:
1. Stick them in lowmem (fills up the global space too much).
2. Stick them in highmem - too much overhead doing k(un)map_atomic
as measured by both myself and Andrew.
Using UKVA for PTEs seems to be a better way to implement pte-highmem to me.
If you're walking another processes' pagetables, you just kmap them as now,
but I think this will avoid most of the kmap'ing (if we have space for two
sets of pagetables so we can do a little bit of trickery at fork time).
>> Yes, that was a stalling point for sticking kmap in there, which was
>> amongst my original plotting for it, but the stuff that's per-process
>> still works.
>
> Exactly what _is_ "per-process"? The only thing that is per-process is
> stuff that is totally local to the VM, by the linux definition.
The pagetables.
> And the rmap stuff certainly isn't "local to the VM". Yes, it is torn
> down and built up by the VM, but it needs to be traversed by global code.
Sorry, subject was probably misleading ... I'm just talking about the
PTEs here, not sticking anything to do with rmap into UKVA.
Partially object-based rmap is cool for other reasons, that have little to
do with this. ;-)
M.
^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: pte-highmem vs UKVA (was: object-based rmap and pte-highmem)
2003-02-23 22:07 ` pte-highmem vs UKVA (was: object-based rmap and pte-highmem) Martin J. Bligh
@ 2003-02-23 22:10 ` William Lee Irwin III
2003-02-24 0:31 ` Linus Torvalds
2003-02-24 3:07 ` Martin J. Bligh
1 sibling, 1 reply; 124+ messages in thread
From: William Lee Irwin III @ 2003-02-23 22:10 UTC (permalink / raw)
To: Martin J. Bligh; +Cc: Linus Torvalds, linux-kernel
On Sun, Feb 23, 2003 at 02:07:42PM -0800, Martin J. Bligh wrote:
> Using UKVA for PTEs seems to be a better way to implement pte-highmem to me.
> If you're walking another processes' pagetables, you just kmap them as now,
> but I think this will avoid most of the kmap'ing (if we have space for two
> sets of pagetables so we can do a little bit of trickery at fork time).
Another term for "UKVA for pagetables only" is "recursive pagetables",
if this helps clarify anything.
-- wli
^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: Minutes from Feb 21 LSE Call
2003-02-23 22:01 ` David Mosberger
@ 2003-02-23 22:12 ` Martin J. Bligh
0 siblings, 0 replies; 124+ messages in thread
From: Martin J. Bligh @ 2003-02-23 22:12 UTC (permalink / raw)
To: davidm; +Cc: Linus Torvalds, linux-kernel
> >> But does x86 reall work so well? Itanium 2 on 0.13um performs a
> >> lot better than P4 on 0.13um. As far as I can guess, the only
> >> reason P4 comes out on 0.13um (and 0.09um) before anything else
> >> is due to the latter part you mention: it's where the volume is
> >> today.
>
> Martin> Care to share those impressive benchmark numbers (for
> Martin> macro-benchmarks)? Would be interesting to see the
> Martin> difference, and where it wins.
>
> You can do it two ways: you can look at the numbers Intel is publicly
> projected for Madison, or you can compare McKinley with 0.18um Pentium 4.
Ummm ... I'm not exactly happy working with Intel's own projections on the
performance of their Itanium chips ... seems a little unscientific ;-)
Presumably when you said "Itanium 2 on 0.13um performs a lot better than P4
on 0.13um." you were referring to some benchmarks you have the results of?
If you can't publish them, fair enough. But if you can, I'd love to see how
it compares ... Itanium seems to be "more interesting" nowadays, though I
can't say I'm happy about the complexity of it.
M.
^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: Minutes from Feb 21 LSE Call
2003-02-23 21:34 ` Linus Torvalds
@ 2003-02-23 22:40 ` David Mosberger
2003-02-23 22:48 ` David Lang
2003-02-23 23:06 ` Martin J. Bligh
0 siblings, 2 replies; 124+ messages in thread
From: David Mosberger @ 2003-02-23 22:40 UTC (permalink / raw)
To: Linus Torvalds; +Cc: davidm, linux-kernel
>>>>> On Sun, 23 Feb 2003 13:34:32 -0800 (PST), Linus Torvalds <torvalds@transmeta.com> said:
Linus> Last I saw P4 was kicking ia-64 butt on specint and friends.
I don't think so. According to Intel [1], the highest clockfrequency
for a 0.18um part is 2GHz (both for Xeon and P4, for Xeon MP it's
1.5GHz). The highest reported SPECint for a 2GHz Xeon seems to be 701
[2]. In comparison, a 1GHz McKinley gets a SPECint of 810 [3].
--david
[1] http://www.intel.com/support/processors/xeon/corespeeds.htm
[2] http://www.specbench.org/cpu2000/results/res2002q1/cpu2000-20020128-01232.html
[3] http://www.specbench.org/cpu2000/results/res2002q3/cpu2000-20020711-01469.html
^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: Minutes from Feb 21 LSE Call
2003-02-23 22:40 ` David Mosberger
@ 2003-02-23 22:48 ` David Lang
2003-02-23 22:54 ` David Mosberger
2003-02-23 23:06 ` Martin J. Bligh
1 sibling, 1 reply; 124+ messages in thread
From: David Lang @ 2003-02-23 22:48 UTC (permalink / raw)
To: davidm; +Cc: Linus Torvalds, linux-kernel
I would call a 15% lead over the ia64 pretty substantial.
yes it's not the same clock speed, but if that's the clock speed they can
achieve on that process it's equivalent. the P4 covers a LOT of sins by
ratcheting up it's speed, what matters is the final capability, not the
capability/clock (if capability/clock was what mattered the AMD chips
would have put intel out of business and the P4 would be as common as
ia-64)
David Lang
On Sun, 23 Feb 2003, David Mosberger wrote:
> Date: Sun, 23 Feb 2003 14:40:44 -0800
> From: David Mosberger <davidm@napali.hpl.hp.com>
> Reply-To: davidm@hpl.hp.com
> To: Linus Torvalds <torvalds@transmeta.com>
> Cc: davidm@hpl.hp.com, linux-kernel@vger.kernel.org
> Subject: Re: Minutes from Feb 21 LSE Call
>
> >>>>> On Sun, 23 Feb 2003 13:34:32 -0800 (PST), Linus Torvalds <torvalds@transmeta.com> said:
>
> Linus> Last I saw P4 was kicking ia-64 butt on specint and friends.
>
> I don't think so. According to Intel [1], the highest clockfrequency
> for a 0.18um part is 2GHz (both for Xeon and P4, for Xeon MP it's
> 1.5GHz). The highest reported SPECint for a 2GHz Xeon seems to be 701
> [2]. In comparison, a 1GHz McKinley gets a SPECint of 810 [3].
>
> --david
>
> [1] http://www.intel.com/support/processors/xeon/corespeeds.htm
> [2] http://www.specbench.org/cpu2000/results/res2002q1/cpu2000-20020128-01232.html
> [3] http://www.specbench.org/cpu2000/results/res2002q3/cpu2000-20020711-01469.html
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>
^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: Minutes from Feb 21 LSE Call
2003-02-23 22:48 ` David Lang
@ 2003-02-23 22:54 ` David Mosberger
2003-02-23 22:56 ` David Lang
` (2 more replies)
0 siblings, 3 replies; 124+ messages in thread
From: David Mosberger @ 2003-02-23 22:54 UTC (permalink / raw)
To: David Lang; +Cc: davidm, Linus Torvalds, linux-kernel
>>>>> On Sun, 23 Feb 2003 14:48:48 -0800 (PST), David Lang <david.lang@digitalinsight.com> said:
David.L> I would call a 15% lead over the ia64 pretty substantial.
Huh? Did you misread my mail?
2 GHz Xeon: 701 SPECint
1 GHz Itanium 2: 810 SPECint
That is, Itanium 2 is 15% faster.
--david
^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: Minutes from Feb 21 LSE Call
2003-02-23 22:54 ` David Mosberger
@ 2003-02-23 22:56 ` David Lang
2003-02-24 0:40 ` Linus Torvalds
2003-02-24 1:06 ` dean gaudet
2 siblings, 0 replies; 124+ messages in thread
From: David Lang @ 2003-02-23 22:56 UTC (permalink / raw)
To: davidm; +Cc: Linus Torvalds, linux-kernel
yep, I revered the numbers
David Lang
On Sun, 23 Feb 2003, David Mosberger wrote:
> Date: Sun, 23 Feb 2003 14:54:12 -0800
> From: David Mosberger <davidm@napali.hpl.hp.com>
> Reply-To: davidm@hpl.hp.com
> To: David Lang <david.lang@digitalinsight.com>
> Cc: davidm@hpl.hp.com, Linus Torvalds <torvalds@transmeta.com>,
> linux-kernel@vger.kernel.org
> Subject: Re: Minutes from Feb 21 LSE Call
>
> >>>>> On Sun, 23 Feb 2003 14:48:48 -0800 (PST), David Lang <david.lang@digitalinsight.com> said:
>
> David.L> I would call a 15% lead over the ia64 pretty substantial.
>
> Huh? Did you misread my mail?
>
> 2 GHz Xeon: 701 SPECint
> 1 GHz Itanium 2: 810 SPECint
>
> That is, Itanium 2 is 15% faster.
>
> --david
>
^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: Minutes from Feb 21 LSE Call
2003-02-23 22:40 ` David Mosberger
2003-02-23 22:48 ` David Lang
@ 2003-02-23 23:06 ` Martin J. Bligh
2003-02-23 23:59 ` David Mosberger
1 sibling, 1 reply; 124+ messages in thread
From: Martin J. Bligh @ 2003-02-23 23:06 UTC (permalink / raw)
To: davidm, Linus Torvalds; +Cc: linux-kernel
> Linus> Last I saw P4 was kicking ia-64 butt on specint and friends.
>
> I don't think so. According to Intel [1], the highest clockfrequency
> for a 0.18um part is 2GHz (both for Xeon and P4, for Xeon MP it's
> 1.5GHz). The highest reported SPECint for a 2GHz Xeon seems to be 701
> [2]. In comparison, a 1GHz McKinley gets a SPECint of 810 [3].
>
> --david
>
> [1] http://www.intel.com/support/processors/xeon/corespeeds.htm
> [2]
> http://www.specbench.org/cpu2000/results/res2002q1/cpu2000-20020128-01232
> .html [3]
> http://www.specbench.org/cpu2000/results/res2002q3/cpu2000-20020711-01469
> .html -
Got anything more real-world than SPECint type microbenchmarks?
M.
^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: Minutes from Feb 21 LSE Call
2003-02-22 21:36 ` Gerrit Huizenga
2003-02-22 21:42 ` Christoph Hellwig
@ 2003-02-23 23:23 ` Bill Davidsen
2003-02-24 3:31 ` Gerrit Huizenga
1 sibling, 1 reply; 124+ messages in thread
From: Bill Davidsen @ 2003-02-23 23:23 UTC (permalink / raw)
To: Gerrit Huizenga; +Cc: lse-tech, Linux Kernel Mailing List
On Sat, 22 Feb 2003, Gerrit Huizenga wrote:
> On 22 Feb 2003 18:20:19 GMT, Alan Cox wrote:
> > I think people overestimate the numbner of large boxes badly. Several IDE
> > pre-patches didn't work on highmem boxes. It took *ages* for people to
> > actually notice there was a problem. The desktop world is still 128-256Mb
>
> IDE on big boxes? Is that crack I smell burning? A desktop with 4 GB
> is a fun toy, but bigger than *I* need, even for development purposes.
> But I don't think EMC, Clariion (low end EMC), Shark, etc. have any
> IDE products for my 8-proc 16 GB machine... And running pre-patches in
> a production environment that might expose this would be a little
> silly as well.
I don't disagree with most of your point, however there certainly are
legitimate uses for big boxes with small (IDE) disk. Those which first
come to mind are all computational problems, in which a small dataset is
read from disk and then processors beat on the data. More or less common
examples are graphics transformations (original and final data
compressed), engineering calculations such as finite element analysis,
rendering (raytracing) type calculations, and data analysis (things like
setiathome or automated medical image analysis).
IDE drives are very cost effective, and low cost motherboard RAID is
certainly useful for preserving the results of large calculations on small
(relatively) datasets.
--
bill davidsen <davidsen@tmr.com>
CTO, TMR Associates, Inc
Doing interesting things with little computers since 1979.
^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: Minutes from Feb 21 LSE Call
2003-02-23 19:13 ` David Mosberger
@ 2003-02-23 23:28 ` Benjamin LaHaise
2003-02-26 8:46 ` Eric W. Biederman
1 sibling, 0 replies; 124+ messages in thread
From: Benjamin LaHaise @ 2003-02-23 23:28 UTC (permalink / raw)
To: David Mosberger
Cc: David Lang, Gerrit Huizenga, William Lee Irwin III, Jeff Garzik,
linux-kernel
On Sun, Feb 23, 2003 at 11:13:03AM -0800, David Mosberger wrote:
> This simply isn't true. Itanium and Itanium 2 have full x86 hardware
> built into the chip (for better or worse ;-). The speed isn't as good
> as the fastest x86 chips today, but it's faster (~300MHz P6) than the
That hardly counts as reasonably performant: the slowest mainstream chips
from Intel and AMD are clocked well over 1 GHz. At least x86-64 will
improve the performance of the 32 bit databases people have already
invested large amounts of money in, and it will do so without the need
for a massive outlay of funds for a new 64 bit license. Why accept
more than 10x the cost to migrate to ia64 when a new x86-64 will improve
the speed of existing applications, and improve scalability with the
transparent addition of a 64 bit kernel?
-ben
--
Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a>
^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: Minutes from Feb 21 LSE Call
2003-02-23 20:50 ` Martin J. Bligh
@ 2003-02-23 23:57 ` Alan Cox
2003-02-24 1:26 ` Kenneth Johansson
0 siblings, 1 reply; 124+ messages in thread
From: Alan Cox @ 2003-02-23 23:57 UTC (permalink / raw)
To: Martin J. Bligh; +Cc: Xavier Bestel, Linux Kernel Mailing List
On Sun, 2003-02-23 at 20:50, Martin J. Bligh wrote:
> >> And the baroque instruction encoding on the x86 is actually a _good_
> >> thing: it's a rather dense encoding, which means that you win on icache.
> >> It's a bit hard to decode, but who cares? Existing chips do well at
> >> decoding, and thanks to the icache win they tend to perform better - and
> >> they load faster too (which is important - you can make your CPU have
> >> big caches, but _nothing_ saves you from the cold-cache costs).
> >
> > Next step: hardware gzip ?
>
> They did that already ... IBM were demonstrating such a thing a couple of
> years ago. Don't see it helping with icache though, as it unpacks between
> memory and the processory, IIRC.
I saw the L2/L3 compressed cache thing, and I thought "doh!", and I watched and
I've not seen it for a long time. What happened to it ?
^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: Minutes from Feb 21 LSE Call
2003-02-23 23:06 ` Martin J. Bligh
@ 2003-02-23 23:59 ` David Mosberger
2003-02-24 3:49 ` Gerrit Huizenga
0 siblings, 1 reply; 124+ messages in thread
From: David Mosberger @ 2003-02-23 23:59 UTC (permalink / raw)
To: Martin J. Bligh; +Cc: davidm, Linus Torvalds, linux-kernel
>>>>> On Sun, 23 Feb 2003 15:06:56 -0800, "Martin J. Bligh" <mbligh@aracnet.com> said:
Linus> Last I saw P4 was kicking ia-64 butt on specint and friends.
>> I don't think so. According to Intel [1], the highest
>> clockfrequency for a 0.18um part is 2GHz (both for Xeon and P4,
>> for Xeon MP it's 1.5GHz). The highest reported SPECint for a
>> 2GHz Xeon seems to be 701 [2]. In comparison, a 1GHz McKinley
>> gets a SPECint of 810 [3].
Martin> Got anything more real-world than SPECint type
Martin> microbenchmarks?
SPECint a microbenchmark? You seem to be redefining the meaning of
the word (last time I checked, lmbench was a microbenchmark).
Ironically, Itanium 2 seems to do even better in the "real world" than
suggested by benchmarks, partly because of the large caches, memory
bandwidth and, I'm guessing, partly because of it's straight-forward
micro-architecture (e.g., a synchronization operation takes on the
order of 10 cycles, as compared to order of dozens and hundres of
cycles on the Pentium 4).
BTW: I hope I don't sound too negative on the Pentium 4/Xeon. It's
certainly an excellent performer for many things. I just want to
point out that Itanium 2 also is a good performer, probably more so
than many on this list seem to be willing to give it credit for.
--david
^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: Minutes from Feb 21 LSE Call
2003-02-23 20:21 ` Xavier Bestel
` (2 preceding siblings ...)
2003-02-23 21:41 ` Linus Torvalds
@ 2003-02-24 0:01 ` Bill Davidsen
2003-02-24 0:36 ` yodaiken
4 siblings, 0 replies; 124+ messages in thread
From: Bill Davidsen @ 2003-02-24 0:01 UTC (permalink / raw)
To: Xavier Bestel; +Cc: Linus Torvalds, Linux Kernel Mailing List
[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: TEXT/PLAIN; charset=US-ASCII, Size: 859 bytes --]
On 23 Feb 2003, Xavier Bestel wrote:
> Le dim 23/02/2003 à 20:17, Linus Torvalds a écrit :
>
> > And the baroque instruction encoding on the x86 is actually a _good_
> > thing: it's a rather dense encoding, which means that you win on icache.
> > It's a bit hard to decode, but who cares? Existing chips do well at
> > decoding, and thanks to the icache win they tend to perform better - and
> > they load faster too (which is important - you can make your CPU have
> > big caches, but _nothing_ saves you from the cold-cache costs).
>
> Next step: hardware gzip ?
If the firmware issues were better defined in Intel ia32 chips, I could
see a gzip instruction pointing to blocks in memory. As a proof of
concept, not a big win.
--
bill davidsen <davidsen@tmr.com>
CTO, TMR Associates, Inc
Doing interesting things with little computers since 1979.
^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: pte-highmem vs UKVA (was: object-based rmap and pte-highmem)
2003-02-23 22:10 ` William Lee Irwin III
@ 2003-02-24 0:31 ` Linus Torvalds
0 siblings, 0 replies; 124+ messages in thread
From: Linus Torvalds @ 2003-02-24 0:31 UTC (permalink / raw)
To: William Lee Irwin III; +Cc: Martin J. Bligh, linux-kernel
On Sun, 23 Feb 2003, William Lee Irwin III wrote:
>
> Another term for "UKVA for pagetables only" is "recursive pagetables",
> if this helps clarify anything.
Oh, ok. We did that for alpha, and it was a good deal there (it's actually
architected for alpha). So yes, I don't mind doing it for the page tables,
and it should work fine on x86 too (it's not necessarily a very portable
approach, since it requires that the pmd- and the pte- tables look the
same, which is not always true).
So sure, go ahead with that part.
Linus
^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: Minutes from Feb 21 LSE Call
2003-02-23 20:21 ` Xavier Bestel
` (3 preceding siblings ...)
2003-02-24 0:01 ` Bill Davidsen
@ 2003-02-24 0:36 ` yodaiken
4 siblings, 0 replies; 124+ messages in thread
From: yodaiken @ 2003-02-24 0:36 UTC (permalink / raw)
To: Xavier Bestel; +Cc: Linus Torvalds, Linux Kernel Mailing List
On Sun, Feb 23, 2003 at 09:21:27PM +0100, Xavier Bestel wrote:
> Le dim 23/02/2003 à 20:17, Linus Torvalds a écrit :
>
> > And the baroque instruction encoding on the x86 is actually a _good_
> > thing: it's a rather dense encoding, which means that you win on icache.
> > It's a bit hard to decode, but who cares? Existing chips do well at
> > decoding, and thanks to the icache win they tend to perform better - and
> > they load faster too (which is important - you can make your CPU have
> > big caches, but _nothing_ saves you from the cold-cache costs).
>
> Next step: hardware gzip ?
See ARM "thumb"
^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: Minutes from Feb 21 LSE Call
2003-02-23 22:54 ` David Mosberger
2003-02-23 22:56 ` David Lang
@ 2003-02-24 0:40 ` Linus Torvalds
2003-02-24 2:32 ` David Mosberger
2003-02-24 1:06 ` dean gaudet
2 siblings, 1 reply; 124+ messages in thread
From: Linus Torvalds @ 2003-02-24 0:40 UTC (permalink / raw)
To: davidm; +Cc: David Lang, linux-kernel
On Sun, 23 Feb 2003, David Mosberger wrote:
>
> 2 GHz Xeon: 701 SPECint
> 1 GHz Itanium 2: 810 SPECint
>
> That is, Itanium 2 is 15% faster.
Ehh, and this is with how much cache?
Last I saw, the Itanium 2 machines came with 3MB of integrated L3 caches,
and I suspect that whatever 0.13 Itanium numbers you're looking at are
with the new 6MB caches.
So your "apples to apples" comparison isn't exactly that.
The only thing that is meaningful is "performace at the same time of
general availability". At which point the P4 beats the Itanium 2 senseless
with a 25% higher SpecInt. And last I heard, by the time Itanium 2 is up
at 2GHz, the P4 is apparently going to be at 5GHz, comfortably keeping
that 25% lead.
Linus
^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: Minutes from Feb 21 LSE Call
2003-02-23 22:54 ` David Mosberger
2003-02-23 22:56 ` David Lang
2003-02-24 0:40 ` Linus Torvalds
@ 2003-02-24 1:06 ` dean gaudet
2003-02-24 1:56 ` David Mosberger
2 siblings, 1 reply; 124+ messages in thread
From: dean gaudet @ 2003-02-24 1:06 UTC (permalink / raw)
To: davidm; +Cc: David Lang, Linus Torvalds, linux-kernel
On Sun, 23 Feb 2003, David Mosberger wrote:
> >>>>> On Sun, 23 Feb 2003 14:48:48 -0800 (PST), David Lang <david.lang@digitalinsight.com> said:
>
> David.L> I would call a 15% lead over the ia64 pretty substantial.
>
> Huh? Did you misread my mail?
>
> 2 GHz Xeon: 701 SPECint
> 1 GHz Itanium 2: 810 SPECint
>
> That is, Itanium 2 is 15% faster.
according to pricewatch i could buy ten 2GHz Xeons for about the cost of
one Itanium 2 900MHz.
that's not even considering the cost of the motherboards i'd need to plug
those into.
-dean
^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: Minutes from Feb 21 LSE Call
2003-02-23 21:45 ` Linus Torvalds
@ 2003-02-24 1:25 ` Benjamin LaHaise
0 siblings, 0 replies; 124+ messages in thread
From: Benjamin LaHaise @ 2003-02-24 1:25 UTC (permalink / raw)
To: Linus Torvalds; +Cc: John Bradford, linux-kernel
On Sun, Feb 23, 2003 at 01:45:16PM -0800, Linus Torvalds wrote:
> The x86 has that stupid "executablility is tied to a segment" thing, which
> means that you cannot make things executable on a page-per-page level.
> It's a mistake, but it's one that _could_ be fixed in the architecture if
> it really mattered, the same way the WP bit got fixed in the i486.
I've been thinking about this recently, and it turns out that the whole
point is moot with a fixed address vsyscall page: non-exec stacks are
trivially circumvented by using the vsyscall page as a known starting
point for the exploite. All the other tricks of changing the starting
stack offset and using randomized load addresses don't help at all,
since the exploite can merely use the vsyscall page to perform various
operations. Personally, I'm still a fan of the shared library vsyscall
trick, which would allow us to randomize its laod address and defeat
this problem.
-ben
^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: Minutes from Feb 21 LSE Call
2003-02-23 23:57 ` Alan Cox
@ 2003-02-24 1:26 ` Kenneth Johansson
2003-02-24 1:53 ` dean gaudet
0 siblings, 1 reply; 124+ messages in thread
From: Kenneth Johansson @ 2003-02-24 1:26 UTC (permalink / raw)
To: Alan Cox; +Cc: Martin J. Bligh, Xavier Bestel, Linux Kernel Mailing List
On Mon, 2003-02-24 at 00:57, Alan Cox wrote:
> On Sun, 2003-02-23 at 20:50, Martin J. Bligh wrote:
> > >> And the baroque instruction encoding on the x86 is actually a _good_
> > >> thing: it's a rather dense encoding, which means that you win on icache.
> > >> It's a bit hard to decode, but who cares? Existing chips do well at
> > >> decoding, and thanks to the icache win they tend to perform better - and
> > >> they load faster too (which is important - you can make your CPU have
> > >> big caches, but _nothing_ saves you from the cold-cache costs).
> > >
> > > Next step: hardware gzip ?
> >
> > They did that already ... IBM were demonstrating such a thing a couple of
> > years ago. Don't see it helping with icache though, as it unpacks between
> > memory and the processory, IIRC.
>
> I saw the L2/L3 compressed cache thing, and I thought "doh!", and I watched and
> I've not seen it for a long time. What happened to it ?
>
http://www-3.ibm.com/chips/techlib/techlib.nsf/products/CodePack
If you are thinking of this it dose look like people was not using it I
know I'm not.It reduces memory for instructions but that is all and
memory is seems is not a problem at least not for instructions.
It dose not exist in new cpu's from IBM I don't know the official reason
for the removal.
If you really do mean compressed cache I don't think anybody has done
that for real.
^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: Minutes from Feb 21 LSE Call
2003-02-23 17:28 ` Eric W. Biederman
@ 2003-02-24 1:42 ` Benjamin LaHaise
0 siblings, 0 replies; 124+ messages in thread
From: Benjamin LaHaise @ 2003-02-24 1:42 UTC (permalink / raw)
To: Eric W. Biederman; +Cc: Rik van Riel, Hanna Linder, lse-tech, linux-kernel
On Sun, Feb 23, 2003 at 10:28:04AM -0700, Eric W. Biederman wrote:
> The problem. There is no upper bound to how many rmap
> entries there can be at one time. And the unbounded
> growth can overwhelm a machine.
Eh? By that logic there's no bound to the number of vmas that can exist
at a given time. But there is a bound on the number that a single process
can force the system into using, and that limit also caps the number of
rmap entries the process can bring into existance. Virtual address space
is not free, and there are already mechanisms in place to limit it which,
given that the number of rmap entries are directly proportion to the amount
of virtual address space in use, probably need proper configuration.
> The goal is to provide an overall system cap on the number
> of rmap entries.
No, the goal is to have a stable system under a variety of workloads that
performs well. User exploitable worst case behaviour is a bad idea. Hybrid
solves that at the expense of added complexity.
-ben
--
Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a>
^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: Minutes from Feb 21 LSE Call
2003-02-24 1:26 ` Kenneth Johansson
@ 2003-02-24 1:53 ` dean gaudet
0 siblings, 0 replies; 124+ messages in thread
From: dean gaudet @ 2003-02-24 1:53 UTC (permalink / raw)
To: Kenneth Johansson
Cc: Alan Cox, Martin J. Bligh, Xavier Bestel,
Linux Kernel Mailing List
On Sun, 24 Feb 2003, Kenneth Johansson wrote:
> If you really do mean compressed cache I don't think anybody has done
> that for real.
people are doing this *for real* -- it really depends on what you define
as compressed.
ARM thumb is definitely a compression function for code.
x86 native instructions are compressed compared to the RISC-like micro-ops
which a processor like athlon, p3, and p4 actually execute. for similar
operations, an x86 would average probably 1.5 bytes to encode what a
32-bit RISC would need 4 bytes to encode.
-dean
^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: Minutes from Feb 21 LSE Call
2003-02-24 1:06 ` dean gaudet
@ 2003-02-24 1:56 ` David Mosberger
2003-02-24 2:15 ` dean gaudet
0 siblings, 1 reply; 124+ messages in thread
From: David Mosberger @ 2003-02-24 1:56 UTC (permalink / raw)
To: dean gaudet; +Cc: davidm, David Lang, Linus Torvalds, linux-kernel
>>>>> On Sun, 23 Feb 2003 17:06:29 -0800 (PST), dean gaudet <dean-list-linux-kernel@arctic.org> said:
Dean> On Sun, 23 Feb 2003, David Mosberger wrote:
>> >>>>> On Sun, 23 Feb 2003 14:48:48 -0800 (PST), David Lang <david.lang@digitalinsight.com> said:
David.L> I would call a 15% lead over the ia64 pretty substantial.
>> Huh? Did you misread my mail?
>> 2 GHz Xeon: 701 SPECint
>> 1 GHz Itanium 2: 810 SPECint
>> That is, Itanium 2 is 15% faster.
Dean> according to pricewatch i could buy ten 2GHz Xeons for about
Dean> the cost of one Itanium 2 900MHz.
Not if you want comparable cache-sizes [1]:
Intel Xeon MP, 2MB L3 cache: $3692
Itanium 2, 1 GHZ, 3MB L3 cache: $4226
Itanium 2, 1 GHZ, 1.5MB L3 cache: $2247
Itanium 2, 900 MHZ, 1.5MB L3 cache: $1338
Intel basically prices things by the cache size.
--david
[1]: http://www.intel.com/intel/finance/pricelist/
^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: Minutes from Feb 21 LSE Call
2003-02-24 1:56 ` David Mosberger
@ 2003-02-24 2:15 ` dean gaudet
2003-02-24 3:11 ` David Mosberger
0 siblings, 1 reply; 124+ messages in thread
From: dean gaudet @ 2003-02-24 2:15 UTC (permalink / raw)
To: davidm; +Cc: David Lang, Linus Torvalds, linux-kernel
On Sun, 23 Feb 2003, David Mosberger wrote:
> >>>>> On Sun, 23 Feb 2003 17:06:29 -0800 (PST), dean gaudet <dean-list-linux-kernel@arctic.org> said:
>
> Dean> On Sun, 23 Feb 2003, David Mosberger wrote:
> >> >>>>> On Sun, 23 Feb 2003 14:48:48 -0800 (PST), David Lang <david.lang@digitalinsight.com> said:
>
> David.L> I would call a 15% lead over the ia64 pretty substantial.
>
> >> Huh? Did you misread my mail?
>
> >> 2 GHz Xeon: 701 SPECint
> >> 1 GHz Itanium 2: 810 SPECint
>
> >> That is, Itanium 2 is 15% faster.
>
> Dean> according to pricewatch i could buy ten 2GHz Xeons for about
> Dean> the cost of one Itanium 2 900MHz.
>
> Not if you want comparable cache-sizes [1]:
somehow i doubt you're quoting Xeon numbers w/2MB of cache above. in
fact, here's a 701 specint with only 512KB of cache @ 2GHz:
http://www.spec.org/osg/cpu2000/results/res2002q1/cpu2000-20020128-01232.html
my point was that if you had comparable die sizes the 15% "advantage"
would disappear. there's a hell of a lot which could be done with the
approximately double die size that the itanium 2 has compared to any of
the commodity x86 parts. but then the cost per part would be
correspondingly higher... which is exactly what is shown in the intel cost
numbers.
a more fair comparison would be your itanium 2 number with this:
http://www.spec.org/osg/cpu2000/results/res2002q4/cpu2000-20021021-01742.html
2MB L2 Xeon @ 2GHz, scores 842.
is this the itanium 2 number you're quoting us?
http://www.spec.org/osg/cpu2000/results/res2002q3/cpu2000-20020711-01469.html
'cause that's with 3MB L3.
-dean
>
> Intel Xeon MP, 2MB L3 cache: $3692
>
> Itanium 2, 1 GHZ, 3MB L3 cache: $4226
> Itanium 2, 1 GHZ, 1.5MB L3 cache: $2247
> Itanium 2, 900 MHZ, 1.5MB L3 cache: $1338
>
> Intel basically prices things by the cache size.
>
> --david
>
> [1]: http://www.intel.com/intel/finance/pricelist/
>
^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: Minutes from Feb 21 LSE Call
2003-02-24 0:40 ` Linus Torvalds
@ 2003-02-24 2:32 ` David Mosberger
2003-02-24 2:54 ` Linus Torvalds
0 siblings, 1 reply; 124+ messages in thread
From: David Mosberger @ 2003-02-24 2:32 UTC (permalink / raw)
To: Linus Torvalds; +Cc: davidm, David Lang, linux-kernel
>>>>> On Sun, 23 Feb 2003 16:40:40 -0800 (PST), Linus Torvalds <torvalds@transmeta.com> said:
Linus> On Sun, 23 Feb 2003, David Mosberger wrote:
>> 2 GHz Xeon: 701 SPECint
>> 1 GHz Itanium 2: 810 SPECint
>> That is, Itanium 2 is 15% faster.
Linus> Ehh, and this is with how much cache?
Linus> Last I saw, the Itanium 2 machines came with 3MB of
Linus> integrated L3 caches, and I suspect that whatever 0.13
Linus> Itanium numbers you're looking at are with the new 6MB
Linus> caches.
Unfortunately, HP doesn't sell 1.5MB/1GHz Itanium 2 workstations, but
we can do some educated guessing:
1GHz Itanium 2, 3MB cache: 810 SPECint
900MHz Itanium 2, 1.5MB cache: 674 SPECint
Assuming pure frequency scaling, a 1GHz/1.5MB Itanium 2 would get
around 750 SPECint. In reality, it would get slightly less, but most
likely substantially more than 701.
Linus> So your "apples to apples" comparison isn't exactly that.
I never claimed it's an apples to apples comparison. But comparing
same-process chips from the same manufacturer does make for a fairer
"architectural" comparison because it factors out at least some of the
effects caused by volume (there is no reason other than (a) volume and
(b) being designed as a server chip for Itanium chips to come out on
the same process later than the corresponding x86 chips).
Linus> The only thing that is meaningful is "performace at the same
Linus> time of general availability".
You claimed that x86 is inherently superior. I provided data that
shows that much of this apparent superiority is simply an effect of
the larger volume that x86 achieves today. Please don't claim that
x86 wins on technical grounds when it really wins on economic grounds.
--david
^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: Minutes from Feb 21 LSE Call
2003-02-24 2:32 ` David Mosberger
@ 2003-02-24 2:54 ` Linus Torvalds
2003-02-24 3:08 ` David Mosberger
2003-02-24 21:42 ` Andrea Arcangeli
0 siblings, 2 replies; 124+ messages in thread
From: Linus Torvalds @ 2003-02-24 2:54 UTC (permalink / raw)
To: davidm; +Cc: David Lang, linux-kernel
On Sun, 23 Feb 2003, David Mosberger wrote:
> >> 2 GHz Xeon: 701 SPECint
> >> 1 GHz Itanium 2: 810 SPECint
>
> >> That is, Itanium 2 is 15% faster.
>
> Unfortunately, HP doesn't sell 1.5MB/1GHz Itanium 2 workstations, but
> we can do some educated guessing:
>
> 1GHz Itanium 2, 3MB cache: 810 SPECint
> 900MHz Itanium 2, 1.5MB cache: 674 SPECint
>
> Assuming pure frequency scaling, a 1GHz/1.5MB Itanium 2 would get
> around 750 SPECint. In reality, it would get slightly less, but most
> likely substantially more than 701.
And as Dean pointed out:
2Ghz Xeon MP with 2MB L3 cache: 842 SPECint
In other words, the P4 eats the Itanium for breakfast even if you limit it
to 2GHz due to some "process" rule.
And if you don't make up any silly rules, but simply look at "what's
available today", you get
2.8Ghz Xeon MP with 2MB L3 cache: 907 SPECint
or even better (much cheaper CPUs):
3.06 GHz P4 with 512kB L2 cache: 1074 SPECint
AMD Athlon XP 2800+: 933 SPECint
These are systems that you can buy today. With _less_ cache, and clearly
much higher performance (the difference between the best-performing
published ia-64 and the best P4 on specint, the P4 is 32% faster. Even
with the "you can only run the P4 at 2GHz because that is all it ever ran
at in 0.18" thing the ia-64 falls behind.
> Linus> The only thing that is meaningful is "performace at the same
> Linus> time of general availability".
>
> You claimed that x86 is inherently superior. I provided data that
> shows that much of this apparent superiority is simply an effect of
> the larger volume that x86 achieves today.
And I showed that your data is flawed. Clearly the P4 outperforms ia-64
on an architectural level _even_ when taking process into account.
Linus
^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: pte-highmem vs UKVA (was: object-based rmap and pte-highmem)
2003-02-23 22:07 ` pte-highmem vs UKVA (was: object-based rmap and pte-highmem) Martin J. Bligh
2003-02-23 22:10 ` William Lee Irwin III
@ 2003-02-24 3:07 ` Martin J. Bligh
1 sibling, 0 replies; 124+ messages in thread
From: Martin J. Bligh @ 2003-02-24 3:07 UTC (permalink / raw)
To: linux-kernel
> This just just for PTEs ... for which at the moment we have two choices:
> 1. Stick them in lowmem (fills up the global space too much).
> 2. Stick them in highmem - too much overhead doing k(un)map_atomic
> as measured by both myself and Andrew.
Actually Andrew's measurements seem to be a bit different from mine ...
several different things all interacting. I'll try to get some more
measurements from a straight SMP box, and see if they correlate more
closely with what he's seeing.
M.
^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: Minutes from Feb 21 LSE Call
2003-02-24 2:54 ` Linus Torvalds
@ 2003-02-24 3:08 ` David Mosberger
2003-02-24 21:42 ` Andrea Arcangeli
1 sibling, 0 replies; 124+ messages in thread
From: David Mosberger @ 2003-02-24 3:08 UTC (permalink / raw)
To: Linus Torvalds; +Cc: davidm, David Lang, linux-kernel
>>>>> On Sun, 23 Feb 2003 18:54:41 -0800 (PST), Linus Torvalds <torvalds@transmeta.com> said:
Linus> In other words, the P4 eats the Itanium for breakfast even if
Linus> you limit it to 2GHz due to some "process" rule.
Ugh, 842 vs 810 is "eating for breakfast"? In my lexicon, that's "in
the same ballpark".
Besides the 2GHz Xeon MP is a 0.13um part.
>> You claimed that x86 is inherently superior. I provided data that
>> shows that much of this apparent superiority is simply an effect of
>> the larger volume that x86 achieves today.
Linus> And I showed that your data is flawed.
No, you did not.
--david
^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: Minutes from Feb 21 LSE Call
2003-02-24 2:15 ` dean gaudet
@ 2003-02-24 3:11 ` David Mosberger
0 siblings, 0 replies; 124+ messages in thread
From: David Mosberger @ 2003-02-24 3:11 UTC (permalink / raw)
To: dean gaudet; +Cc: davidm, David Lang, Linus Torvalds, linux-kernel
>>>>> On Sun, 23 Feb 2003 18:15:29 -0800 (PST), dean gaudet <dean-list-linux-kernel@arctic.org> said:
Dean> somehow i doubt you're quoting Xeon numbers w/2MB of cache above.
I quoted the Xeon 0.13um price because there was no 0.18um part with
>512KB cache (for better or worse, Intel basically prices CPUs by
cache-size).
--david
^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: Minutes from Feb 21 LSE Call
2003-02-23 23:23 ` Bill Davidsen
@ 2003-02-24 3:31 ` Gerrit Huizenga
2003-02-24 4:02 ` Larry McVoy
0 siblings, 1 reply; 124+ messages in thread
From: Gerrit Huizenga @ 2003-02-24 3:31 UTC (permalink / raw)
To: Bill Davidsen; +Cc: lse-tech, Linux Kernel Mailing List
On Sun, 23 Feb 2003 18:23:01 EST, Bill Davidsen wrote:
> On Sat, 22 Feb 2003, Gerrit Huizenga wrote:
>
> > On 22 Feb 2003 18:20:19 GMT, Alan Cox wrote:
> > > I think people overestimate the numbner of large boxes badly. Several IDE
> > > pre-patches didn't work on highmem boxes. It took *ages* for people to
> > > actually notice there was a problem. The desktop world is still 128-256Mb
> >
> > IDE on big boxes? Is that crack I smell burning? A desktop with 4 GB
> > is a fun toy, but bigger than *I* need, even for development purposes.
> > But I don't think EMC, Clariion (low end EMC), Shark, etc. have any
> > IDE products for my 8-proc 16 GB machine... And running pre-patches in
> > a production environment that might expose this would be a little
> > silly as well.
>
> I don't disagree with most of your point, however there certainly are
> legitimate uses for big boxes with small (IDE) disk. Those which first
> come to mind are all computational problems, in which a small dataset is
> read from disk and then processors beat on the data. More or less common
> examples are graphics transformations (original and final data
> compressed), engineering calculations such as finite element analysis,
> rendering (raytracing) type calculations, and data analysis (things like
> setiathome or automated medical image analysis).
Yeah and as Christoph pointed out, a lot of big machines have IDE
based CD-ROMs. And, there *are* some IDE disk subsystems with 1 TB
on an IDE bus and such, but there just aren't enough IDE busses or PCI
slots on most big machines to span out to the really high disk capacities
or large numbers of spindles. But some of the compute engines could
either be net-booted (no local disk) or have a cheap, small disk for
boot, small static storage (couple hundred GB range) etc. But most
people don't connect big machines to IDE drive subsystems.
gerrit
^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: Minutes from Feb 21 LSE Call
2003-02-23 23:59 ` David Mosberger
@ 2003-02-24 3:49 ` Gerrit Huizenga
2003-02-24 4:07 ` David Mosberger
0 siblings, 1 reply; 124+ messages in thread
From: Gerrit Huizenga @ 2003-02-24 3:49 UTC (permalink / raw)
To: davidm; +Cc: Martin J. Bligh, Linus Torvalds, linux-kernel
On Sun, 23 Feb 2003 15:59:12 PST, David Mosberger wrote:
> >>>>> On Sun, 23 Feb 2003 15:06:56 -0800, "Martin J. Bligh" <mbligh@aracnet.com> said:
> Martin> Got anything more real-world than SPECint type
> Martin> microbenchmarks?
>
> SPECint a microbenchmark? You seem to be redefining the meaning of
> the word (last time I checked, lmbench was a microbenchmark).
>
> Ironically, Itanium 2 seems to do even better in the "real world" than
> suggested by benchmarks, partly because of the large caches, memory
> bandwidth and, I'm guessing, partly because of it's straight-forward
> micro-architecture (e.g., a synchronization operation takes on the
> order of 10 cycles, as compared to order of dozens and hundres of
> cycles on the Pentium 4).
Two major types of high end workloads here (and IA64 is definitely
still in the "high end" category). There are the scientific and
technical style workloads, which SPECcpu (of which CINT and CFP are
the integer and floating point subsets) might reasonably categorize,
and some of the "system" workloads, such as those roughly categorized
by things like TPC-C/H/W/etc, or SPECweb/jbb/jvm/jAppServer which
exercise some more complex, multi-tier interactions.
I haven't seen anything recently on the higher level System bencmarks
for IA64 - I'm not sure that anyone is doing much that is significant
in this space, where IA32 results practically saturate the overall
reported results.
I know SGI is generally more interested in the scientific and
technical area. I would assume that HP would be more interested
in the broader system deployment, except that too much activity in
that area might endanger parisc sales. IBM is doing some stuff in
the IA64 space, but more in IA32 and obviously PPC64. That leaves
NEC and a few others that I don't know about. It may be that IA64
isn't really ready for the system level stuff or that it competes
with too many entrenched platforms to make it economically viable.
But, I would be really interested in seeing anything other than
"scientific and technical" based benchmarks for IA64. I don't think
there is much out there. That implies that nobody is interested in
IA64 or that it doesn't perform "competitively" in that space...
gerrit
^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: Minutes from Feb 21 LSE Call
2003-02-24 3:31 ` Gerrit Huizenga
@ 2003-02-24 4:02 ` Larry McVoy
2003-02-24 4:15 ` Russell Leighton
` (2 more replies)
0 siblings, 3 replies; 124+ messages in thread
From: Larry McVoy @ 2003-02-24 4:02 UTC (permalink / raw)
To: Gerrit Huizenga; +Cc: Bill Davidsen, lse-tech, Linux Kernel Mailing List
On Sun, Feb 23, 2003 at 07:31:26PM -0800, Gerrit Huizenga wrote:
> But most
> people don't connect big machines to IDE drive subsystems.
3ware controllers. They look like SCSI to the host, but use cheap IDE
drives on the back end. Really nice cards. bkbits.net runs on one.
--
---
Larry McVoy lm at bitmover.com http://www.bitmover.com/lm
^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: Minutes from Feb 21 LSE Call
2003-02-24 3:49 ` Gerrit Huizenga
@ 2003-02-24 4:07 ` David Mosberger
2003-02-24 4:34 ` Martin J. Bligh
2003-02-24 5:02 ` Gerrit Huizenga
0 siblings, 2 replies; 124+ messages in thread
From: David Mosberger @ 2003-02-24 4:07 UTC (permalink / raw)
To: Gerrit Huizenga; +Cc: davidm, Martin J. Bligh, Linus Torvalds, linux-kernel
>>>>> On Sun, 23 Feb 2003 19:49:38 -0800, Gerrit Huizenga <gh@us.ibm.com> said:
Gerrit> I haven't seen anything recently on the higher level System bencmarks
Gerrit> for IA64
Did you miss the TPC-C announcement from last November & December?
rx5670 4-way Itanium 2: 80498 tpmC @ $5.30/transaction (Oracle 10 on Linux).
rx5670 4-way Itanium 2: 87741 tpmC @ $5.03/transaction (MS SQL on Windows).
Both world-records for 4-way machines when they were announced (not
sure if that's still true).
--david
^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: Minutes from Feb 21 LSE Call
2003-02-24 4:02 ` Larry McVoy
@ 2003-02-24 4:15 ` Russell Leighton
2003-02-24 5:11 ` William Lee Irwin III
2003-02-24 8:07 ` Christoph Hellwig
2 siblings, 0 replies; 124+ messages in thread
From: Russell Leighton @ 2003-02-24 4:15 UTC (permalink / raw)
To: Larry McVoy
Cc: Gerrit Huizenga, Bill Davidsen, lse-tech,
Linux Kernel Mailing List
Yup.
Great price and super price/performance.
Gotta luv it.
Larry McVoy wrote:
>On Sun, Feb 23, 2003 at 07:31:26PM -0800, Gerrit Huizenga wrote:
>
>>But most
>>people don't connect big machines to IDE drive subsystems.
>>
>
>3ware controllers. They look like SCSI to the host, but use cheap IDE
>drives on the back end. Really nice cards. bkbits.net runs on one.
>
^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: Minutes from Feb 21 LSE Call
2003-02-24 4:07 ` David Mosberger
@ 2003-02-24 4:34 ` Martin J. Bligh
2003-02-24 5:02 ` Gerrit Huizenga
1 sibling, 0 replies; 124+ messages in thread
From: Martin J. Bligh @ 2003-02-24 4:34 UTC (permalink / raw)
To: davidm, Gerrit Huizenga; +Cc: Linus Torvalds, linux-kernel
> Gerrit> I haven't seen anything recently on the higher level System
> bencmarks Gerrit> for IA64
>
> Did you miss the TPC-C announcement from last November & December?
>
> rx5670 4-way Itanium 2: 80498 tpmC @ $5.30/transaction (Oracle 10 on
> Linux). rx5670 4-way Itanium 2: 87741 tpmC @ $5.03/transaction (MS SQL
> on Windows).
>
> Both world-records for 4-way machines when they were announced (not
> sure if that's still true).
Cool - thanks. that's more what I was looking for.
M.
^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: Minutes from Feb 21 LSE Call
2003-02-24 4:07 ` David Mosberger
2003-02-24 4:34 ` Martin J. Bligh
@ 2003-02-24 5:02 ` Gerrit Huizenga
1 sibling, 0 replies; 124+ messages in thread
From: Gerrit Huizenga @ 2003-02-24 5:02 UTC (permalink / raw)
To: davidm; +Cc: Martin J. Bligh, Linus Torvalds, linux-kernel
On Sun, 23 Feb 2003 20:07:43 PST, David Mosberger wrote:
> >>> On Sun, 23 Feb 2003 19:49:38 -0800, Gerrit Huizenga <gh@us.ibm.com> said:
>
> Gerrit> I haven't seen anything recently on the higher level System bencmarks
> Gerrit> for IA64
>
> Did you miss the TPC-C announcement from last November & December?
>
> rx5670 4-way Itanium 2: 80498 tpmC @ $5.30/transaction (Oracle 10 on Linux).
> rx5670 4-way Itanium 2: 87741 tpmC @ $5.03/transaction (MS SQL on Windows).
>
> Both world-records for 4-way machines when they were announced (not
> sure if that's still true).
Yeah, I missed that. And my spot checking didn't catch anything IA64
related. Was there anything else on IA64 that competed with the current
rack of 8-way IA32 boxen, or the upcoming 16-way stuff rolling out
this year? Seems like the larger phys memory support should help on
several of those benchmarks...
The thin number of IA64 results indicates the difference in marketing/sales,
although better price/performance should be able to change that... ;)
Odd that MS is still outdoing Linux (or SQL is outdoing Oracle on Linux).
Will be nice when that changes...
gerrit
^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: Minutes from Feb 21 LSE Call
2003-02-24 4:02 ` Larry McVoy
2003-02-24 4:15 ` Russell Leighton
@ 2003-02-24 5:11 ` William Lee Irwin III
2003-02-24 8:07 ` Christoph Hellwig
2 siblings, 0 replies; 124+ messages in thread
From: William Lee Irwin III @ 2003-02-24 5:11 UTC (permalink / raw)
To: Larry McVoy, Gerrit Huizenga, Bill Davidsen, lse-tech,
Linux Kernel Mailing List
On Sun, Feb 23, 2003 at 07:31:26PM -0800, Gerrit Huizenga wrote:
>> But most people don't connect big machines to IDE drive subsystems.
>
On Sun, Feb 23, 2003 at 08:02:46PM -0800, Larry McVoy wrote:
> 3ware controllers. They look like SCSI to the host, but use cheap IDE
> drives on the back end. Really nice cards. bkbits.net runs on one.
A quick back of the napkin estimate guesstimates that this 3ware stuff
would max at 6 racks of disks on NUMA-Q or 3/8 of a rack per node
(ignoring cabling, which looks infeasible, but never mind that), which
is a smaller capacity than I remember FC having. NUMA-Q's a bit
optimistic for 3ware because it has buttloads of PCI slots in
comparison to more modern machines.
-- wli
^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: Minutes from Feb 21 LSE Call
2003-02-24 4:02 ` Larry McVoy
2003-02-24 4:15 ` Russell Leighton
2003-02-24 5:11 ` William Lee Irwin III
@ 2003-02-24 8:07 ` Christoph Hellwig
2 siblings, 0 replies; 124+ messages in thread
From: Christoph Hellwig @ 2003-02-24 8:07 UTC (permalink / raw)
To: Larry McVoy, Gerrit Huizenga, Bill Davidsen, lse-tech,
Linux Kernel Mailing List
On Sun, Feb 23, 2003 at 08:02:46PM -0800, Larry McVoy wrote:
> On Sun, Feb 23, 2003 at 07:31:26PM -0800, Gerrit Huizenga wrote:
> > But most
> > people don't connect big machines to IDE drive subsystems.
>
> 3ware controllers. They look like SCSI to the host, but use cheap IDE
> drives on the back end. Really nice cards. bkbits.net runs on one.
That's true (similar for some nice scsi2ide external raid boxens), but Alan's
original argument was about the Linux IDE driver on bix machines which is used
by neither..
^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: Minutes from Feb 21 LSE Call
2003-02-22 16:13 ` Larry McVoy
2003-02-22 16:29 ` Martin J. Bligh
@ 2003-02-24 18:00 ` Timothy D. Witham
1 sibling, 0 replies; 124+ messages in thread
From: Timothy D. Witham @ 2003-02-24 18:00 UTC (permalink / raw)
To: Larry McVoy; +Cc: Martin J. Bligh, David S. Miller, lse-tech, linux-kernel
On Sat, 2003-02-22 at 08:13, Larry McVoy wrote:
> On Sat, Feb 22, 2003 at 07:47:53AM -0800, Martin J. Bligh wrote:
> > >> > Let's see, Dell has a $66B market cap, revenues of $8B/quarter and
> > >> > $500M/quarter in profit.
> > >>
> > >> While I understand these numbers are on the mark, there is a tertiary
> > >> issue to realize.
> > >>
> > >> Dell makes money on many things other than thin-margin PCs. And lo'
> > >> and behold one of those things is selling the larger Intel based
> > >> servers and support contracts to go along with that.
> > >
> > > I did some digging trying to find that ratio before I posted last night
> > > and couldn't. You obviously think that the servers are a significant
> > > part of their business. I'd be surprised at that, but that's cool,
> > > what are the numbers? PC's, monitors, disks, laptops, anything with less
> > > than 4 cpus is in the little bucket, so how much revenue does Dell generate
> > > on the 4 CPU and larger servers?
> >
> > It's not a question of revenue, it's one of profit. Very few people buy
> > desktops for use with Linux, compared to those that buy them for Windows.
> > The profit on each PC is small, thus I still think a substantial proportion
> > of the profit made by hardware vendors by Linux is on servers rather than
> > desktop PCs. The numbers will be smaller for high end machines, but the
> > profit margins are much higher.
>
> That's all handwaving and has no meaning without numbers. I could care less
> if Dell has 99.99% margins on their servers, if they only sell $50M of servers
> a quarter that is still less than 10% of their quarterly profit.
>
> So what are the actual *numbers*? Your point makes sense if and only if
> people sell lots of server. I spent a few minutes in google: world wide
> server sales are $40B at the moment. The overwhelming majority of that
> revenue is small servers. Let's say that Dell has 20% of that market,
> that's $2B/quarter. Now let's chop off the 1-2 CPU systems. I'll bet
> you long long odds that that is 90% of their revenue in the server space.
> Supposing that's right, that's $200M/quarter in big iron sales. Out of
> $8000M/quarter.
>
The numbers that I have seen are covered under an NDA so I can' put
them out but an important point to note is that while there is a very
sharp decrease in the number of servers sold as you go hight up into
the price bands the total $ in revenue is hourglass shaped. With
the neck being in a price band that corresponds to a 4 way server.
The total $ spent on the highest band of servers is about equal
to the total $ spent on the lowest price band of servers. But the
margins for the high end are much better than the margins for the
lowest band.
> I'd love to see data which is different than this but you'll have a tough
> time finding it. More and more companies are looking at the cost of
> big iron and deciding it doesn't make sense to spend $20K/CPU when they
> could be spending $1K/CPU. Look at Google, try selling them some big
> iron. Look at Wall Street - abandoning big iron as fast as they can.
Oh, you can see it, it will just cost you about $50,000 to get the
survey from the company that spends all the money putting it together.
On the size of the system, every system should be as big as it needs
to be. Some problems partition nicely, like Google but other ones
do not, like accounts receivable. It all seems to come down to the
question, "Does the data _naturally_ partition?" If it does then
you should either use lots of small servers or a s/390 type solution
with lots of instances. However if the data doesn't naturally partition
you should use one large machine as you will spend more money on
people trying to manage the servers than you would of spent initially
on the hardware.
Also you need to look at the backend systems in places like Wall
Street, those are big machines, have been for a long time and
aren't changing out. But it doesn't make a good story.
Tim
--
Timothy D. Witham <wookie@osdl.org>
Open Sourcre Development Lab, Inc
^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: Minutes from Feb 21 LSE Call
2003-02-24 2:54 ` Linus Torvalds
2003-02-24 3:08 ` David Mosberger
@ 2003-02-24 21:42 ` Andrea Arcangeli
1 sibling, 0 replies; 124+ messages in thread
From: Andrea Arcangeli @ 2003-02-24 21:42 UTC (permalink / raw)
To: Linus Torvalds; +Cc: davidm, David Lang, linux-kernel
On Sun, Feb 23, 2003 at 06:54:41PM -0800, Linus Torvalds wrote:
>
> On Sun, 23 Feb 2003, David Mosberger wrote:
> > >> 2 GHz Xeon: 701 SPECint
> > >> 1 GHz Itanium 2: 810 SPECint
> >
> > >> That is, Itanium 2 is 15% faster.
> >
> > Unfortunately, HP doesn't sell 1.5MB/1GHz Itanium 2 workstations, but
> > we can do some educated guessing:
> >
> > 1GHz Itanium 2, 3MB cache: 810 SPECint
> > 900MHz Itanium 2, 1.5MB cache: 674 SPECint
> >
> > Assuming pure frequency scaling, a 1GHz/1.5MB Itanium 2 would get
> > around 750 SPECint. In reality, it would get slightly less, but most
> > likely substantially more than 701.
>
> And as Dean pointed out:
>
> 2Ghz Xeon MP with 2MB L3 cache: 842 SPECint
>
> In other words, the P4 eats the Itanium for breakfast even if you limit it
> to 2GHz due to some "process" rule.
>
> And if you don't make up any silly rules, but simply look at "what's
> available today", you get
>
> 2.8Ghz Xeon MP with 2MB L3 cache: 907 SPECint
>
> or even better (much cheaper CPUs):
>
> 3.06 GHz P4 with 512kB L2 cache: 1074 SPECint
> AMD Athlon XP 2800+: 933 SPECint
>
> These are systems that you can buy today. With _less_ cache, and clearly
> much higher performance (the difference between the best-performing
> published ia-64 and the best P4 on specint, the P4 is 32% faster. Even
> with the "you can only run the P4 at 2GHz because that is all it ever ran
> at in 0.18" thing the ia-64 falls behind.
I agree, especially the cache difference makes any comparison not
interesting to my eyes (it's similar to running dbench with different
pagecache sizes and comparing the results). But I've a side note on
these matters in favour of the 64bit platforms. I could be wrong, but
AFIK some of the specint testcases generates a double data memory
footprint if compiled 64bit, so I guess some of the testcases should be
really called speclong and not specint. (however I don't think those
testcases alone can explain a global 32% difference, but still there
would be some difference in favour of the 32bit platform)
So in short, I currently believe specint is not a good benchmark to
compare a 64bit cpu to a 32bit cpu, 64bit can only lose in specint if
the cpu is exactly the same but only the data 'longs' are changed to
64bit. To do a real fair comparison one should first change the source
replacing every "long" with either a "long long" or an "int", only then
it will be fair to compare specint results between 32bit and 64bit cpus.
I never used specint myself, so don't ask me more details on this, and
again I could be wrong, but really - if I'm right - somebody should go
over the source and make a kind of unofficial (but official) patch
available to people to generate a specint testsuite usable to compare
32bit with 64bit results, or lots of effort will be wasted by people
pretending to do the impossible. I mean, if the memory bus is the same
hardware in both the 32bit and 64bit runs, the double memory footprint
will run slower and there's nothing the OS or the hardware can do about
it (and dozen mbytes of ram won't fit in l1 cache, not even on the
itanium 8). The benchmark suite really must be fixed to ensure the 32bit
and 64bit compilation will generate the same _data_ memory footprint if
one wants to make comparisons between the two.
Andrea
^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: Minutes from Feb 21 LSE Call
2003-02-23 3:24 ` Andrew Morton
2003-02-23 16:14 ` object-based rmap and pte-highmem Martin J. Bligh
@ 2003-02-25 17:17 ` Andrea Arcangeli
2003-02-25 17:43 ` William Lee Irwin III
1 sibling, 1 reply; 124+ messages in thread
From: Andrea Arcangeli @ 2003-02-25 17:17 UTC (permalink / raw)
To: Andrew Morton; +Cc: Hanna Linder, lse-tech, linux-kernel
On Sat, Feb 22, 2003 at 07:24:24PM -0800, Andrew Morton wrote:
> 2.4.21-pre4: 8.10 seconds
> 2.5.62-mm3 with objrmap: 9.95 seconds (+1.85)
> 2.5.62-mm3 without objrmap: 10.86 seconds (+0.91)
>
> Current 2.5 is 2.76 seconds slower, and this patch reclaims 0.91 of those
> seconds.
>
>
> So whole stole the remaining 1.85 seconds? Looks like pte_highmem.
would you mind to add the line for 2.4.21-pre4aa3? it has pte-highmem so
you can easily find it out for sure if it is pte_highmem that stole >10%
of your fast cpu. A line for the 2.4-rmap patch would be also
interesting.
> Note one second spent in pte_alloc_one().
note the seconds spent in the rmap affected paths too.
Andrea
^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: Minutes from Feb 21 LSE Call
2003-02-25 17:17 ` Minutes from Feb 21 LSE Call Andrea Arcangeli
@ 2003-02-25 17:43 ` William Lee Irwin III
2003-02-25 17:59 ` Andrea Arcangeli
0 siblings, 1 reply; 124+ messages in thread
From: William Lee Irwin III @ 2003-02-25 17:43 UTC (permalink / raw)
To: Andrea Arcangeli; +Cc: Andrew Morton, Hanna Linder, lse-tech, linux-kernel
On Sat, Feb 22, 2003 at 07:24:24PM -0800, Andrew Morton wrote:
>> So whole stole the remaining 1.85 seconds? Looks like pte_highmem.
On Tue, Feb 25, 2003 at 06:17:27PM +0100, Andrea Arcangeli wrote:
> would you mind to add the line for 2.4.21-pre4aa3? it has pte-highmem so
> you can easily find it out for sure if it is pte_highmem that stole >10%
> of your fast cpu. A line for the 2.4-rmap patch would be also
> interesting.
On Sat, Feb 22, 2003 at 07:24:24PM -0800, Andrew Morton wrote:
>> Note one second spent in pte_alloc_one().
On Tue, Feb 25, 2003 at 06:17:27PM +0100, Andrea Arcangeli wrote:
> note the seconds spent in the rmap affected paths too.
The pagetable cache is gone in 2.5, so pte_alloc_one() takes the
bitblitting hit for pagetables.
I didn't catch the whole profile, so I'll need numbers for rmap paths.
-- wli
^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: Minutes from Feb 21 LSE Call
2003-02-25 17:43 ` William Lee Irwin III
@ 2003-02-25 17:59 ` Andrea Arcangeli
2003-02-25 18:04 ` William Lee Irwin III
2003-02-25 18:50 ` William Lee Irwin III
0 siblings, 2 replies; 124+ messages in thread
From: Andrea Arcangeli @ 2003-02-25 17:59 UTC (permalink / raw)
To: William Lee Irwin III, Andrew Morton, Hanna Linder, lse-tech,
linux-kernel
On Tue, Feb 25, 2003 at 09:43:59AM -0800, William Lee Irwin III wrote:
> On Sat, Feb 22, 2003 at 07:24:24PM -0800, Andrew Morton wrote:
> >> So whole stole the remaining 1.85 seconds? Looks like pte_highmem.
>
> On Tue, Feb 25, 2003 at 06:17:27PM +0100, Andrea Arcangeli wrote:
> > would you mind to add the line for 2.4.21-pre4aa3? it has pte-highmem so
> > you can easily find it out for sure if it is pte_highmem that stole >10%
> > of your fast cpu. A line for the 2.4-rmap patch would be also
> > interesting.
>
> On Sat, Feb 22, 2003 at 07:24:24PM -0800, Andrew Morton wrote:
> >> Note one second spent in pte_alloc_one().
>
> On Tue, Feb 25, 2003 at 06:17:27PM +0100, Andrea Arcangeli wrote:
> > note the seconds spent in the rmap affected paths too.
>
> The pagetable cache is gone in 2.5, so pte_alloc_one() takes the
> bitblitting hit for pagetables.
I'm talking about do_anonymous_page, do_wp_page, do_no_page fork and all
the other places that introduces spinlocks (per-page) and allocations of
2 pieces of ram rather than just 1 (and in turn potentially global
spinlocks too if the cpu-caches are empty). Just grep for
pte_chain_alloc or page_add_rmap in mm/memory.c, that's what I mean, I'm
not talking about pagetables.
Andrea
^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: Minutes from Feb 21 LSE Call
2003-02-25 17:59 ` Andrea Arcangeli
@ 2003-02-25 18:04 ` William Lee Irwin III
2003-02-25 18:50 ` William Lee Irwin III
1 sibling, 0 replies; 124+ messages in thread
From: William Lee Irwin III @ 2003-02-25 18:04 UTC (permalink / raw)
To: Andrea Arcangeli; +Cc: Andrew Morton, Hanna Linder, lse-tech, linux-kernel
On Tue, Feb 25, 2003 at 09:43:59AM -0800, William Lee Irwin III wrote:
>> The pagetable cache is gone in 2.5, so pte_alloc_one() takes the
>> bitblitting hit for pagetables.
On Tue, Feb 25, 2003 at 06:59:28PM +0100, Andrea Arcangeli wrote:
> I'm talking about do_anonymous_page, do_wp_page, do_no_page fork and all
> the other places that introduces spinlocks (per-page) and allocations of
> 2 pieces of ram rather than just 1 (and in turn potentially global
> spinlocks too if the cpu-caches are empty). Just grep for
> pte_chain_alloc or page_add_rmap in mm/memory.c, that's what I mean, I'm
> not talking about pagetables.
Well, pte_alloc_one() has a clear explanation.
The fact that the rmap accounting is not free is not news.
For anonymous pages performing the analogous vma-based lookup as with
Dave McCracken's patch for file-backed pages would require a
significant anonymous page accounting rework.
-- wli
^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: Minutes from Feb 21 LSE Call
2003-02-25 17:59 ` Andrea Arcangeli
2003-02-25 18:04 ` William Lee Irwin III
@ 2003-02-25 18:50 ` William Lee Irwin III
2003-02-25 19:18 ` Andrea Arcangeli
1 sibling, 1 reply; 124+ messages in thread
From: William Lee Irwin III @ 2003-02-25 18:50 UTC (permalink / raw)
To: Andrea Arcangeli; +Cc: Andrew Morton, Hanna Linder, lse-tech, linux-kernel
On Tue, Feb 25, 2003 at 09:43:59AM -0800, William Lee Irwin III wrote:
>> The pagetable cache is gone in 2.5, so pte_alloc_one() takes the
>> bitblitting hit for pagetables.
On Tue, Feb 25, 2003 at 06:59:28PM +0100, Andrea Arcangeli wrote:
> I'm talking about do_anonymous_page, do_wp_page, do_no_page fork and all
> the other places that introduces spinlocks (per-page) and allocations of
> 2 pieces of ram rather than just 1 (and in turn potentially global
> spinlocks too if the cpu-caches are empty). Just grep for
> pte_chain_alloc or page_add_rmap in mm/memory.c, that's what I mean, I'm
> not talking about pagetables.
Okay, fished out the profiles (w/Dave's optimization):
00000000 total 158601 0.0869
c0106ed8 poll_idle 99878 1189.0238
c01172e0 do_page_fault 8788 7.7496
c013adb4 do_wp_page 6712 8.4322
c013f70c page_remove_rmap 3132 6.2640
c0139eac copy_page_range 2994 3.5643
c013f5c0 page_add_rmap 2776 8.3614
c013a1f4 zap_pte_range 2616 4.8806
c0137240 release_pages 1828 6.4366
c0108d14 system_call 1116 25.3636
c013ba00 handle_mm_fault 1098 4.6525
c015b59c d_lookup 1096 3.2619
c013b788 do_no_page 1044 1.6519
c013b56c do_anonymous_page 954 1.7667
c011718c pte_alloc_one 910 6.5000
c0139ba0 clear_page_tables 841 2.4735
c011450c flush_tlb_page 725 6.4732
c0207130 __copy_to_user_ll 687 6.6058
c01333dc free_hot_cold_page 641 2.7629
c013042c find_get_page 601 10.7321
Just taking the exception dwarfs anything written in C.
page_add_rmap() absorbs hits from all of the fault routines and
copy_page_range(). page_remove_rmap() absorbs hits from zap_pte_range().
do_wp_page() is huge because it's doing bitblitting in-line.
These things aren't cheap with or without rmap. Trimming down
accounting overhead could raise search problems elsewhere.
Whether avoiding the search problem is worth the accounting overhead
could probably use some more investigation, like actually trying the
anonymous page handling rework needed to use vma-based ptov resolution.
-- wli
^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: Minutes from Feb 21 LSE Call
2003-02-25 18:50 ` William Lee Irwin III
@ 2003-02-25 19:18 ` Andrea Arcangeli
2003-02-25 19:27 ` Martin J. Bligh
2003-02-25 20:10 ` William Lee Irwin III
0 siblings, 2 replies; 124+ messages in thread
From: Andrea Arcangeli @ 2003-02-25 19:18 UTC (permalink / raw)
To: William Lee Irwin III, Andrew Morton, Hanna Linder, lse-tech,
linux-kernel
On Tue, Feb 25, 2003 at 10:50:08AM -0800, William Lee Irwin III wrote:
> On Tue, Feb 25, 2003 at 09:43:59AM -0800, William Lee Irwin III wrote:
> >> The pagetable cache is gone in 2.5, so pte_alloc_one() takes the
> >> bitblitting hit for pagetables.
>
> On Tue, Feb 25, 2003 at 06:59:28PM +0100, Andrea Arcangeli wrote:
> > I'm talking about do_anonymous_page, do_wp_page, do_no_page fork and all
> > the other places that introduces spinlocks (per-page) and allocations of
> > 2 pieces of ram rather than just 1 (and in turn potentially global
> > spinlocks too if the cpu-caches are empty). Just grep for
> > pte_chain_alloc or page_add_rmap in mm/memory.c, that's what I mean, I'm
> > not talking about pagetables.
>
> Okay, fished out the profiles (w/Dave's optimization):
>
> 00000000 total 158601 0.0869
> c0106ed8 poll_idle 99878 1189.0238
> c01172e0 do_page_fault 8788 7.7496
> c013adb4 do_wp_page 6712 8.4322
> c013f70c page_remove_rmap 3132 6.2640
> c0139eac copy_page_range 2994 3.5643
> c013f5c0 page_add_rmap 2776 8.3614
> c013a1f4 zap_pte_range 2616 4.8806
> c0137240 release_pages 1828 6.4366
> c0108d14 system_call 1116 25.3636
> c013ba00 handle_mm_fault 1098 4.6525
> c015b59c d_lookup 1096 3.2619
> c013b788 do_no_page 1044 1.6519
> c013b56c do_anonymous_page 954 1.7667
> c011718c pte_alloc_one 910 6.5000
> c0139ba0 clear_page_tables 841 2.4735
> c011450c flush_tlb_page 725 6.4732
> c0207130 __copy_to_user_ll 687 6.6058
> c01333dc free_hot_cold_page 641 2.7629
> c013042c find_get_page 601 10.7321
>
> Just taking the exception dwarfs anything written in C.
>
> page_add_rmap() absorbs hits from all of the fault routines and
> copy_page_range(). page_remove_rmap() absorbs hits from zap_pte_range().
> do_wp_page() is huge because it's doing bitblitting in-line.
"absorbing" is a nice word for it. The way I see it, page_add_rmap and
page_remove_rmap are even more expensive than the pagtable zapping.
They're even more expensive than copy_page_range. Also focus on the
numbers on the right that are even more interesting to find what is
worth to optimize away first IMHO
>
> These things aren't cheap with or without rmap. Trimming down
lots of things aren't cheap, but this isn't a good reason to make them
twice more expensive, especially if they were as cheap as possible and
they're critical hot paths.
> accounting overhead could raise search problems elsewhere.
this is the point indeed, but at least in 2.4 I don't see any cpu saving
advantage during swapping because during swapping the cpu is always idle
anyways.
Infact I had to drop the lru_cache_add too from the anonymous page fault
path because it was wasting way too much cpu to get peak performance (of
course you're using per-page spinlocks by hand with rmap, and
lru_cache_add needs a global spinlock, so at least rmap shouldn't
introduce very big scalability issue unlike the lru_cache_add)
> Whether avoiding the search problem is worth the accounting overhead
> could probably use some more investigation, like actually trying the
> anonymous page handling rework needed to use vma-based ptov resolution.
the only solution is to do rmap lazily, i.e. to start building the rmap
during swapping by walking the pagetables, basically exactly like I
refill the lru with anonymous pages only after I start to need this
information recently in my 2.4 tree, so if you never need to pageout
heavily several giga of ram (like most of very high end numa servers),
you'll never waste a single cycle in locking or whatever other worthless
accounting overhead that hurts performance of all common workloads
Andrea
^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: Minutes from Feb 21 LSE Call
2003-02-25 19:18 ` Andrea Arcangeli
@ 2003-02-25 19:27 ` Martin J. Bligh
2003-02-25 20:30 ` Andrea Arcangeli
2003-02-25 20:10 ` William Lee Irwin III
1 sibling, 1 reply; 124+ messages in thread
From: Martin J. Bligh @ 2003-02-25 19:27 UTC (permalink / raw)
To: Andrea Arcangeli, William Lee Irwin III, Andrew Morton,
Hanna Linder, lse-tech, linux-kernel
> the only solution is to do rmap lazily, i.e. to start building the rmap
> during swapping by walking the pagetables, basically exactly like I
> refill the lru with anonymous pages only after I start to need this
> information recently in my 2.4 tree, so if you never need to pageout
> heavily several giga of ram (like most of very high end numa servers),
> you'll never waste a single cycle in locking or whatever other worthless
> accounting overhead that hurts performance of all common workloads
Did you see the partially object-based rmap stuff? I think that does
very close to what you want already.
M.
^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: Minutes from Feb 21 LSE Call
2003-02-25 19:18 ` Andrea Arcangeli
2003-02-25 19:27 ` Martin J. Bligh
@ 2003-02-25 20:10 ` William Lee Irwin III
2003-02-25 20:23 ` Andrea Arcangeli
1 sibling, 1 reply; 124+ messages in thread
From: William Lee Irwin III @ 2003-02-25 20:10 UTC (permalink / raw)
To: Andrea Arcangeli; +Cc: Andrew Morton, Hanna Linder, lse-tech, linux-kernel
On Tue, Feb 25, 2003 at 10:50:08AM -0800, William Lee Irwin III wrote:
>> Just taking the exception dwarfs anything written in C.
>> page_add_rmap() absorbs hits from all of the fault routines and
>> copy_page_range(). page_remove_rmap() absorbs hits from zap_pte_range().
>> do_wp_page() is huge because it's doing bitblitting in-line.
On Tue, Feb 25, 2003 at 08:18:17PM +0100, Andrea Arcangeli wrote:
> "absorbing" is a nice word for it. The way I see it, page_add_rmap and
> page_remove_rmap are even more expensive than the pagtable zapping.
> They're even more expensive than copy_page_range. Also focus on the
> numbers on the right that are even more interesting to find what is
> worth to optimize away first IMHO
Those just divide the number of hits by the size of the function IIRC,
which is useless for some codepath spinning hard in the middle of a
large function or in the presence of over-inlining. It's also greatly
disturbed by spinlock section hackery (as are most profilers).
On Tue, Feb 25, 2003 at 10:50:08AM -0800, William Lee Irwin III wrote:
>> These things aren't cheap with or without rmap. Trimming down
On Tue, Feb 25, 2003 at 08:18:17PM +0100, Andrea Arcangeli wrote:
> lots of things aren't cheap, but this isn't a good reason to make them
> twice more expensive, especially if they were as cheap as possible and
> they're critical hot paths.
They weren't as cheap as possible and it's a bad idea to make them so.
SVR4 proved there are limits to the usefulness of lazy evaluation wrt.
pagetable copying and the like.
You're also looking at sampling hits, not end-to-end timings.
After all these disclaimers, trimming down cpu cost is a good idea.
On Tue, Feb 25, 2003 at 10:50:08AM -0800, William Lee Irwin III wrote:
>> accounting overhead could raise search problems elsewhere.
On Tue, Feb 25, 2003 at 08:18:17PM +0100, Andrea Arcangeli wrote:
> this is the point indeed, but at least in 2.4 I don't see any cpu saving
> advantage during swapping because during swapping the cpu is always idle
> anyways.
It's probably not swapping that matters, but high turnover of clean data.
No one can really make a concrete assertion without some implementations
of the alternatives, which is why I think they need to be done soon.
Once one or more are there we're set. I'm personally in favor of the
anonymous handling rework as the alternative to pursue, since that
actually retains the locality of reference as opposed to wild pagetable
scanning over random processes, which is highly unpredictable with
respect to locality and even worse with respect to cpu consumption.
On Tue, Feb 25, 2003 at 08:18:17PM +0100, Andrea Arcangeli wrote:
> Infact I had to drop the lru_cache_add too from the anonymous page fault
> path because it was wasting way too much cpu to get peak performance (of
> course you're using per-page spinlocks by hand with rmap, and
> lru_cache_add needs a global spinlock, so at least rmap shouldn't
> introduce very big scalability issue unlike the lru_cache_add)
The high arrival rates to LRU lists in do_anonymous_page() etc. were
dealt with by the pagevec batching infrastructure in 2.5.x, which is
the primary method by which pagemap_lru_lock contention was addressed.
The "breakup" so to speak is primarily for locality of reference.
Which reminds me, my node-local pgdat allocation patch is pending...
On Tue, Feb 25, 2003 at 10:50:08AM -0800, William Lee Irwin III wrote:
>> Whether avoiding the search problem is worth the accounting overhead
>> could probably use some more investigation, like actually trying the
>> anonymous page handling rework needed to use vma-based ptov resolution.
On Tue, Feb 25, 2003 at 08:18:17PM +0100, Andrea Arcangeli wrote:
> the only solution is to do rmap lazily, i.e. to start building the rmap
> during swapping by walking the pagetables, basically exactly like I
> refill the lru with anonymous pages only after I start to need this
> information recently in my 2.4 tree, so if you never need to pageout
> heavily several giga of ram (like most of very high end numa servers),
> you'll never waste a single cycle in locking or whatever other worthless
> accounting overhead that hurts performance of all common workloads
I'd just bite the bullet and do the anonymous rework. Building
pte_chains lazily raises the issue of needing to allocate in order to
free, which is relatively thorny. Maintaining any level of accuracy of
the things with lazy buildup is also problematic. That and the whole
space issue wrt. pte_chains is blown away by the anonymous rework,
which is a significant advantage.
-- wli
^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: Minutes from Feb 21 LSE Call
2003-02-25 20:10 ` William Lee Irwin III
@ 2003-02-25 20:23 ` Andrea Arcangeli
2003-02-25 20:46 ` William Lee Irwin III
0 siblings, 1 reply; 124+ messages in thread
From: Andrea Arcangeli @ 2003-02-25 20:23 UTC (permalink / raw)
To: William Lee Irwin III, Andrew Morton, Hanna Linder, lse-tech,
linux-kernel
On Tue, Feb 25, 2003 at 12:10:23PM -0800, William Lee Irwin III wrote:
> I'd just bite the bullet and do the anonymous rework. Building
> pte_chains lazily raises the issue of needing to allocate in order to
note that there is no need of allocate to free.
Andrea
^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: Minutes from Feb 21 LSE Call
2003-02-25 19:27 ` Martin J. Bligh
@ 2003-02-25 20:30 ` Andrea Arcangeli
2003-02-25 20:53 ` Martin J. Bligh
0 siblings, 1 reply; 124+ messages in thread
From: Andrea Arcangeli @ 2003-02-25 20:30 UTC (permalink / raw)
To: Martin J. Bligh
Cc: William Lee Irwin III, Andrew Morton, Hanna Linder, lse-tech,
linux-kernel
On Tue, Feb 25, 2003 at 11:27:40AM -0800, Martin J. Bligh wrote:
> > the only solution is to do rmap lazily, i.e. to start building the rmap
> > during swapping by walking the pagetables, basically exactly like I
> > refill the lru with anonymous pages only after I start to need this
> > information recently in my 2.4 tree, so if you never need to pageout
> > heavily several giga of ram (like most of very high end numa servers),
> > you'll never waste a single cycle in locking or whatever other worthless
> > accounting overhead that hurts performance of all common workloads
>
> Did you see the partially object-based rmap stuff? I think that does
> very close to what you want already.
I don't see how it can optimize away the overhead but I didn't look at
it for long.
Andrea
^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: Minutes from Feb 21 LSE Call
2003-02-25 20:23 ` Andrea Arcangeli
@ 2003-02-25 20:46 ` William Lee Irwin III
2003-02-25 20:52 ` Andrea Arcangeli
0 siblings, 1 reply; 124+ messages in thread
From: William Lee Irwin III @ 2003-02-25 20:46 UTC (permalink / raw)
To: Andrea Arcangeli; +Cc: Andrew Morton, Hanna Linder, lse-tech, linux-kernel
On Tue, Feb 25, 2003 at 12:10:23PM -0800, William Lee Irwin III wrote:
>> I'd just bite the bullet and do the anonymous rework. Building
>> pte_chains lazily raises the issue of needing to allocate in order to
On Tue, Feb 25, 2003 at 09:23:35PM +0100, Andrea Arcangeli wrote:
> note that there is no need of allocate to free.
I've no longer got any idea what you're talking about, then.
-- wli
^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: Minutes from Feb 21 LSE Call
2003-02-25 20:46 ` William Lee Irwin III
@ 2003-02-25 20:52 ` Andrea Arcangeli
0 siblings, 0 replies; 124+ messages in thread
From: Andrea Arcangeli @ 2003-02-25 20:52 UTC (permalink / raw)
To: William Lee Irwin III, Andrew Morton, Hanna Linder, lse-tech,
linux-kernel
On Tue, Feb 25, 2003 at 12:46:16PM -0800, William Lee Irwin III wrote:
> On Tue, Feb 25, 2003 at 12:10:23PM -0800, William Lee Irwin III wrote:
> >> I'd just bite the bullet and do the anonymous rework. Building
> >> pte_chains lazily raises the issue of needing to allocate in order to
>
> On Tue, Feb 25, 2003 at 09:23:35PM +0100, Andrea Arcangeli wrote:
> > note that there is no need of allocate to free.
>
> I've no longer got any idea what you're talking about, then.
Were we able to release memory w/o rmap: yes.
Can we do it again: yes.
Can we use a bit of the released memory to release further memory more
efficiently with rmap: yes.
I'm not saying it's easy to implement that, but the problem that we'll
need memory to release memory doesn't exit, since it also never existed
before rmap was introduced into the kernel. Sure, the early stage of the
swapping would be more cpu-intensive, but that is the feature.
Andrea
^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: Minutes from Feb 21 LSE Call
2003-02-25 20:30 ` Andrea Arcangeli
@ 2003-02-25 20:53 ` Martin J. Bligh
2003-02-25 21:17 ` Andrea Arcangeli
0 siblings, 1 reply; 124+ messages in thread
From: Martin J. Bligh @ 2003-02-25 20:53 UTC (permalink / raw)
To: Andrea Arcangeli
Cc: William Lee Irwin III, Andrew Morton, Hanna Linder, lse-tech,
linux-kernel
>> > the only solution is to do rmap lazily, i.e. to start building the rmap
>> > during swapping by walking the pagetables, basically exactly like I
>> > refill the lru with anonymous pages only after I start to need this
>> > information recently in my 2.4 tree, so if you never need to pageout
>> > heavily several giga of ram (like most of very high end numa servers),
>> > you'll never waste a single cycle in locking or whatever other
>> > worthless accounting overhead that hurts performance of all common
>> > workloads
>>
>> Did you see the partially object-based rmap stuff? I think that does
>> very close to what you want already.
>
> I don't see how it can optimize away the overhead but I didn't look at
> it for long.
Because you don't set up and tear down the rmap pte-chains for every
fault in / delete of any page ... it just works off the vmas.
M.
^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: Minutes from Feb 21 LSE Call
2003-02-25 21:17 ` Andrea Arcangeli
@ 2003-02-25 21:12 ` Martin J. Bligh
2003-02-25 22:16 ` Andrea Arcangeli
2003-02-25 21:26 ` William Lee Irwin III
1 sibling, 1 reply; 124+ messages in thread
From: Martin J. Bligh @ 2003-02-25 21:12 UTC (permalink / raw)
To: Andrea Arcangeli; +Cc: William Lee Irwin III, Andrew Morton, linux-kernel
>> Because you don't set up and tear down the rmap pte-chains for every
>> fault in / delete of any page ... it just works off the vmas.
>
> so basically it uses the rmap that we always had since at least 2.2 for
> everything but anon mappings, right? this is what DaveM did a few years
> back too. This makes lots of sense to me, so at least we avoid the
> duplication of rmap information, even if it won't fix the anonymous page
> overhead, but clearly it's much lower cost for everything but anonymous
> pages.
Right ... and anonymous chains are about 95% single-reference (at least for
the case I looked at), so they're direct mapped from the struct page with
no chain at all. Cuts out something like 95% of the space overhead of
pte-chains, and 65% of the time (for kernel compile -j256 on 16x system).
However, it's going to be a little more expensive to *use* the mappings,
so we need to measure that carefully.
M.
^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: Minutes from Feb 21 LSE Call
2003-02-25 20:53 ` Martin J. Bligh
@ 2003-02-25 21:17 ` Andrea Arcangeli
2003-02-25 21:12 ` Martin J. Bligh
2003-02-25 21:26 ` William Lee Irwin III
0 siblings, 2 replies; 124+ messages in thread
From: Andrea Arcangeli @ 2003-02-25 21:17 UTC (permalink / raw)
To: Martin J. Bligh
Cc: William Lee Irwin III, Andrew Morton, Hanna Linder, lse-tech,
linux-kernel
On Tue, Feb 25, 2003 at 12:53:44PM -0800, Martin J. Bligh wrote:
> >> > the only solution is to do rmap lazily, i.e. to start building the rmap
> >> > during swapping by walking the pagetables, basically exactly like I
> >> > refill the lru with anonymous pages only after I start to need this
> >> > information recently in my 2.4 tree, so if you never need to pageout
> >> > heavily several giga of ram (like most of very high end numa servers),
> >> > you'll never waste a single cycle in locking or whatever other
> >> > worthless accounting overhead that hurts performance of all common
> >> > workloads
> >>
> >> Did you see the partially object-based rmap stuff? I think that does
> >> very close to what you want already.
> >
> > I don't see how it can optimize away the overhead but I didn't look at
> > it for long.
>
> Because you don't set up and tear down the rmap pte-chains for every
> fault in / delete of any page ... it just works off the vmas.
so basically it uses the rmap that we always had since at least 2.2 for
everything but anon mappings, right? this is what DaveM did a few years
back too. This makes lots of sense to me, so at least we avoid the
duplication of rmap information, even if it won't fix the anonymous page
overhead, but clearly it's much lower cost for everything but anonymous
pages.
Andrea
^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: Minutes from Feb 21 LSE Call
2003-02-25 21:17 ` Andrea Arcangeli
2003-02-25 21:12 ` Martin J. Bligh
@ 2003-02-25 21:26 ` William Lee Irwin III
2003-02-25 22:18 ` Andrea Arcangeli
2003-02-26 5:24 ` Rik van Riel
1 sibling, 2 replies; 124+ messages in thread
From: William Lee Irwin III @ 2003-02-25 21:26 UTC (permalink / raw)
To: Andrea Arcangeli
Cc: Martin J. Bligh, Andrew Morton, Hanna Linder, lse-tech,
linux-kernel
On Tue, Feb 25, 2003 at 12:53:44PM -0800, Martin J. Bligh wrote:
>> Because you don't set up and tear down the rmap pte-chains for every
>> fault in / delete of any page ... it just works off the vmas.
On Tue, Feb 25, 2003 at 10:17:18PM +0100, Andrea Arcangeli wrote:
> so basically it uses the rmap that we always had since at least 2.2 for
> everything but anon mappings, right? this is what DaveM did a few years
> back too. This makes lots of sense to me, so at least we avoid the
> duplication of rmap information, even if it won't fix the anonymous page
> overhead, but clearly it's much lower cost for everything but anonymous
> pages.
This is what the "anonymous rework" is about. There is already a fix
extant for the file-backed case, which I presumed you knew of already,
and so were were speaking of issues with the anonymous case.
My impression thus far is that the anonymous case has not been pressing
with respect to space consumption or cpu time once the file-backed code
is in place, though if it resurfaces as a serious concern the anonymous
rework can be pursued (along with other things).
-- wli
^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: Minutes from Feb 21 LSE Call
2003-02-25 21:12 ` Martin J. Bligh
@ 2003-02-25 22:16 ` Andrea Arcangeli
2003-02-25 22:17 ` Martin J. Bligh
0 siblings, 1 reply; 124+ messages in thread
From: Andrea Arcangeli @ 2003-02-25 22:16 UTC (permalink / raw)
To: Martin J. Bligh; +Cc: William Lee Irwin III, Andrew Morton, linux-kernel
On Tue, Feb 25, 2003 at 01:12:55PM -0800, Martin J. Bligh wrote:
> >> Because you don't set up and tear down the rmap pte-chains for every
> >> fault in / delete of any page ... it just works off the vmas.
> >
> > so basically it uses the rmap that we always had since at least 2.2 for
> > everything but anon mappings, right? this is what DaveM did a few years
> > back too. This makes lots of sense to me, so at least we avoid the
> > duplication of rmap information, even if it won't fix the anonymous page
> > overhead, but clearly it's much lower cost for everything but anonymous
> > pages.
>
> Right ... and anonymous chains are about 95% single-reference (at least for
> the case I looked at), so they're direct mapped from the struct page with
> no chain at all. Cuts out something like 95% of the space overhead of
> pte-chains, and 65% of the time (for kernel compile -j256 on 16x system).
> However, it's going to be a little more expensive to *use* the mappings,
> so we need to measure that carefully.
Sure, it is more expensive to use them, but all we care about is
complexity, and they solve the complexity problem just fine, so I
definitely prefer it. Cpu utilization during heavy swapping isn't a big
deal IMHO
Andrea
^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: Minutes from Feb 21 LSE Call
2003-02-25 22:16 ` Andrea Arcangeli
@ 2003-02-25 22:17 ` Martin J. Bligh
2003-02-25 22:37 ` Andrea Arcangeli
0 siblings, 1 reply; 124+ messages in thread
From: Martin J. Bligh @ 2003-02-25 22:17 UTC (permalink / raw)
To: Andrea Arcangeli; +Cc: William Lee Irwin III, Andrew Morton, linux-kernel
> Sure, it is more expensive to use them, but all we care about is
> complexity, and they solve the complexity problem just fine, so I
> definitely prefer it. Cpu utilization during heavy swapping isn't a big
> deal IMHO
I totally agree with you. However the concerns others raised were over
page aging and page stealing (eg from pagecache), which might not involve
disk, but would also be slower. It probably need some tuning and tweaking,
but I'm pretty sure it's fundamentally the right approach.
M.
^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: Minutes from Feb 21 LSE Call
2003-02-25 21:26 ` William Lee Irwin III
@ 2003-02-25 22:18 ` Andrea Arcangeli
2003-02-26 5:24 ` Rik van Riel
1 sibling, 0 replies; 124+ messages in thread
From: Andrea Arcangeli @ 2003-02-25 22:18 UTC (permalink / raw)
To: William Lee Irwin III, Martin J. Bligh, Andrew Morton,
Hanna Linder, lse-tech, linux-kernel
On Tue, Feb 25, 2003 at 01:26:35PM -0800, William Lee Irwin III wrote:
> My impression thus far is that the anonymous case has not been pressing
> with respect to space consumption or cpu time once the file-backed code
> is in place, though if it resurfaces as a serious concern the anonymous
> rework can be pursued (along with other things).
sounds good to me ;)
Andrea
^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: Minutes from Feb 21 LSE Call
2003-02-25 22:17 ` Martin J. Bligh
@ 2003-02-25 22:37 ` Andrea Arcangeli
0 siblings, 0 replies; 124+ messages in thread
From: Andrea Arcangeli @ 2003-02-25 22:37 UTC (permalink / raw)
To: Martin J. Bligh; +Cc: William Lee Irwin III, Andrew Morton, linux-kernel
On Tue, Feb 25, 2003 at 02:17:48PM -0800, Martin J. Bligh wrote:
> > Sure, it is more expensive to use them, but all we care about is
> > complexity, and they solve the complexity problem just fine, so I
> > definitely prefer it. Cpu utilization during heavy swapping isn't a big
> > deal IMHO
>
> I totally agree with you. However the concerns others raised were over
> page aging and page stealing (eg from pagecache), which might not involve
> disk, but would also be slower. It probably need some tuning and tweaking,
> but I'm pretty sure it's fundamentally the right approach.
there's no slowdown at all when we don't need to unmap anything. We
just need to avoid watching the pte young bit in the pagetables unless
we're about to start unmapping stuff. Most machines won't reach the
point where they need to start unmapping stuff. Watching the ptes during
normal pagecache recycling would be wasteful anyways, regardless what
chain we take to reach the pte.
Andrea
^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: Minutes from Feb 21 LSE Call
2003-02-25 21:26 ` William Lee Irwin III
2003-02-25 22:18 ` Andrea Arcangeli
@ 2003-02-26 5:24 ` Rik van Riel
2003-02-26 5:38 ` William Lee Irwin III
1 sibling, 1 reply; 124+ messages in thread
From: Rik van Riel @ 2003-02-26 5:24 UTC (permalink / raw)
To: William Lee Irwin III
Cc: Andrea Arcangeli, Martin J. Bligh, Andrew Morton, Hanna Linder,
lse-tech, linux-kernel
On Tue, 25 Feb 2003, William Lee Irwin III wrote:
> My impression thus far is that the anonymous case has not been pressing
> with respect to space consumption or cpu time once the file-backed code
> is in place, though if it resurfaces as a serious concern the anonymous
> rework can be pursued (along with other things).
... but making the anonymous pages use an object based
scheme probably will make things too expensive.
IIRC the object based reverse map patches by bcrl and
davem both failed on the complexities needed to deal
with anonymous pages.
My instinct is that a hybrid system will work well in
most cases and the worst case with mapped files won't
be too bad.
cheers,
Rik
--
Engineers don't grow up, they grow sideways.
http://www.surriel.com/ http://kernelnewbies.org/
^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: Minutes from Feb 21 LSE Call
2003-02-26 5:24 ` Rik van Riel
@ 2003-02-26 5:38 ` William Lee Irwin III
2003-02-26 6:01 ` Martin J. Bligh
0 siblings, 1 reply; 124+ messages in thread
From: William Lee Irwin III @ 2003-02-26 5:38 UTC (permalink / raw)
To: Rik van Riel
Cc: Andrea Arcangeli, Martin J. Bligh, Andrew Morton, Hanna Linder,
lse-tech, linux-kernel
On Tue, 25 Feb 2003, William Lee Irwin III wrote:
>> My impression thus far is that the anonymous case has not been pressing
>> with respect to space consumption or cpu time once the file-backed code
>> is in place, though if it resurfaces as a serious concern the anonymous
>> rework can be pursued (along with other things).
On Wed, Feb 26, 2003 at 02:24:18AM -0300, Rik van Riel wrote:
> ... but making the anonymous pages use an object based
> scheme probably will make things too expensive.
> IIRC the object based reverse map patches by bcrl and
> davem both failed on the complexities needed to deal
> with anonymous pages.
> My instinct is that a hybrid system will work well in
> most cases and the worst case with mapped files won't
> be too bad.
The boxen I'm supposed to babysit need a high degree of resource
consciousness wrt. lowmem allocations, so there is a clear voice
on this issue. IMHO it's still an open question as to whether this
is efficient for replacement concerns, which may yet favor objects.
-- wli
^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: Minutes from Feb 21 LSE Call
2003-02-26 5:38 ` William Lee Irwin III
@ 2003-02-26 6:01 ` Martin J. Bligh
2003-02-26 6:14 ` William Lee Irwin III
2003-02-26 16:02 ` Rik van Riel
0 siblings, 2 replies; 124+ messages in thread
From: Martin J. Bligh @ 2003-02-26 6:01 UTC (permalink / raw)
To: William Lee Irwin III, Rik van Riel
Cc: Andrea Arcangeli, Andrew Morton, Hanna Linder, lse-tech,
linux-kernel
>>> My impression thus far is that the anonymous case has not been pressing
>>> with respect to space consumption or cpu time once the file-backed code
>>> is in place, though if it resurfaces as a serious concern the anonymous
>>> rework can be pursued (along with other things).
>
> On Wed, Feb 26, 2003 at 02:24:18AM -0300, Rik van Riel wrote:
>> ... but making the anonymous pages use an object based
>> scheme probably will make things too expensive.
>> IIRC the object based reverse map patches by bcrl and
>> davem both failed on the complexities needed to deal
>> with anonymous pages.
>> My instinct is that a hybrid system will work well in
>> most cases and the worst case with mapped files won't
>> be too bad.
>
> The boxen I'm supposed to babysit need a high degree of resource
> consciousness wrt. lowmem allocations, so there is a clear voice
It seemed, at least on the simple kernel compile tests that I did, that all
the long chains are not anonymous. It killed 95% of the space issue, which
given the simplicity of the patch was pretty damned stunning. Yes, there's
a pointer per page I guess we could kill in the struct page itself, but I
think you already have a better method for killing mem_map bloat ;-)
M.
^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: Minutes from Feb 21 LSE Call
2003-02-26 6:01 ` Martin J. Bligh
@ 2003-02-26 6:14 ` William Lee Irwin III
2003-02-26 6:32 ` William Lee Irwin III
2003-02-26 16:02 ` Rik van Riel
1 sibling, 1 reply; 124+ messages in thread
From: William Lee Irwin III @ 2003-02-26 6:14 UTC (permalink / raw)
To: Martin J. Bligh
Cc: Rik van Riel, Andrea Arcangeli, Andrew Morton, Hanna Linder,
lse-tech, linux-kernel
At some point in the past, I wrote:
>> The boxen I'm supposed to babysit need a high degree of resource
>> consciousness wrt. lowmem allocations, so there is a clear voice
On Tue, Feb 25, 2003 at 10:01:20PM -0800, Martin J. Bligh wrote:
> It seemed, at least on the simple kernel compile tests that I did, that all
> the long chains are not anonymous. It killed 95% of the space issue, which
> given the simplicity of the patch was pretty damned stunning. Yes, there's
> a pointer per page I guess we could kill in the struct page itself, but I
> think you already have a better method for killing mem_map bloat ;-)
I'm not going to get up in arms about this unless there's a serious
performance issue that's going to get smacked down that I want to have
a say in how it gets smacked down. aa is happy with the filebacked
stuff, so I'm not pressing it (much) further.
And yes, page clustering is certainly on its way and fast. I'm getting
very close to the point where a general announcement will be in order.
There's basically "one last big bug" and two bits of gross suboptimality
I want to clean up before bringing the world to bear on it.
-- wli
^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: Minutes from Feb 21 LSE Call
2003-02-26 6:14 ` William Lee Irwin III
@ 2003-02-26 6:32 ` William Lee Irwin III
0 siblings, 0 replies; 124+ messages in thread
From: William Lee Irwin III @ 2003-02-26 6:32 UTC (permalink / raw)
To: Martin J. Bligh, Rik van Riel, Andrea Arcangeli, Andrew Morton,
Hanna Linder, lse-tech, linux-kernel
On Tue, Feb 25, 2003 at 10:01:20PM -0800, Martin J. Bligh wrote:
>> It seemed, at least on the simple kernel compile tests that I did, that all
>> the long chains are not anonymous. It killed 95% of the space issue, which
>> given the simplicity of the patch was pretty damned stunning. Yes, there's
>> a pointer per page I guess we could kill in the struct page itself, but I
>> think you already have a better method for killing mem_map bloat ;-)
On Tue, Feb 25, 2003 at 10:14:40PM -0800, William Lee Irwin III wrote:
> I'm not going to get up in arms about this unless there's a serious
> performance issue that's going to get smacked down that I want to have
> a say in how it gets smacked down. aa is happy with the filebacked
> stuff, so I'm not pressing it (much) further.
> And yes, page clustering is certainly on its way and fast. I'm getting
> very close to the point where a general announcement will be in order.
> There's basically "one last big bug" and two bits of gross suboptimality
> I want to clean up before bringing the world to bear on it.
Screw it. Here it comes, ready or not. hch, I hope you were right...
-- wli
^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: Minutes from Feb 21 LSE Call
2003-02-23 19:13 ` David Mosberger
2003-02-23 23:28 ` Benjamin LaHaise
@ 2003-02-26 8:46 ` Eric W. Biederman
1 sibling, 0 replies; 124+ messages in thread
From: Eric W. Biederman @ 2003-02-26 8:46 UTC (permalink / raw)
To: davidm
Cc: David Lang, Gerrit Huizenga, Benjamin LaHaise,
William Lee Irwin III, Jeff Garzik, linux-kernel
David Mosberger <davidm@napali.hpl.hp.com> writes:
> >>>>> On Sun, 23 Feb 2003 00:07:50 -0800 (PST), David Lang
> <david.lang@digitalinsight.com> said:
>
>
> David.L> Garrit, you missed the preior posters point. IA64 had the
> David.L> same fundamental problem as the Alpha, PPC, and Sparc
> David.L> processors, it doesn't run x86 binaries.
>
> This simply isn't true. Itanium and Itanium 2 have full x86 hardware
> built into the chip (for better or worse ;-). The speed isn't as good
> as the fastest x86 chips today, but it's faster (~300MHz P6) than the
> PCs many of us are using and it certainly meets my needs better than
> any other x86 "emulation" I have used in the past (which includes
> FX!32 and its relatives for Alpha).
I have various random x86 binaries that do not work.
My 32bit x86 user space does not run.
A 32bit kernel doesn't have a chance.
So for me at least the 32bit support is not useful in avoiding
converting binaries. For the handful of apps that cannot be
recompiled I suspect the support is good enough so you can get them
to run somehow.
Eric
^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: Minutes from Feb 21 LSE Call
2003-02-26 6:01 ` Martin J. Bligh
2003-02-26 6:14 ` William Lee Irwin III
@ 2003-02-26 16:02 ` Rik van Riel
2003-02-27 3:48 ` Daniel Phillips
1 sibling, 1 reply; 124+ messages in thread
From: Rik van Riel @ 2003-02-26 16:02 UTC (permalink / raw)
To: Martin J. Bligh
Cc: William Lee Irwin III, Andrea Arcangeli, Andrew Morton,
Hanna Linder, lse-tech, linux-kernel
On Tue, 25 Feb 2003, Martin J. Bligh wrote:
> > On Wed, Feb 26, 2003 at 02:24:18AM -0300, Rik van Riel wrote:
> >> ... but making the anonymous pages use an object based
> >> scheme probably will make things too expensive.
> >> My instinct is that a hybrid system will work well in
[snip] "wli wrote something"
> It seemed, at least on the simple kernel compile tests that I did, that
> all the long chains are not anonymous. It killed 95% of the space issue,
> which given the simplicity of the patch was pretty damned stunning. Yes,
> there's a pointer per page I guess we could kill in the struct page
> itself, but I think you already have a better method for killing mem_map
> bloat ;-)
Also, with copy-on-write and mremap after fork, doing an
object based rmap scheme for anonymous pages is just complex,
almost certainly far too complex to be worth it, since it just
has too many issues. Just read the patches by bcrl and davem,
things get hairy fast.
The pte chain rmap scheme is clean, but suffers from too much
overhead for file mappings.
As shown by Dave's patch, a hybrid system really is simple and
clean, and it removes most of the pte chain overhead while still
keeping the code nice and efficient.
I think this hybrid system is the way to go, possibly with a few
more tweaks left and right...
regards,
Rik
--
Engineers don't grow up, they grow sideways.
http://www.surriel.com/ http://kernelnewbies.org/
^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: Minutes from Feb 21 LSE Call
2003-02-26 16:02 ` Rik van Riel
@ 2003-02-27 3:48 ` Daniel Phillips
0 siblings, 0 replies; 124+ messages in thread
From: Daniel Phillips @ 2003-02-27 3:48 UTC (permalink / raw)
To: Rik van Riel, Martin J. Bligh
Cc: William Lee Irwin III, Andrea Arcangeli, Andrew Morton,
Hanna Linder, lse-tech, linux-kernel
On Wednesday 26 February 2003 17:02, Rik van Riel wrote:
> On Tue, 25 Feb 2003, Martin J. Bligh wrote:
> > It seemed, at least on the simple kernel compile tests that I did, that
> > all the long chains are not anonymous. It killed 95% of the space issue,
> > which given the simplicity of the patch was pretty damned stunning. Yes,
> > there's a pointer per page I guess we could kill in the struct page
> > itself, but I think you already have a better method for killing mem_map
> > bloat ;-)
>
> Also, with copy-on-write and mremap after fork, doing an
> object based rmap scheme for anonymous pages is just complex,
> almost certainly far too complex to be worth it, since it just
> has too many issues. Just read the patches by bcrl and davem,
> things get hairy fast.
>
> The pte chain rmap scheme is clean, but suffers from too much
> overhead for file mappings.
There is a lot of redundancy in the rmap chains that could be exploited. If
a pte page happens to reference a group of (say) 32 anon pages, then you can
set each anon page's page->index to its position in the group and let a
pte_chain node point at the pte of the first page of the group. You can then
find each page's pte by adding its page->index to the pte_chain node's pte
pointer. This allows a single rmap chain to be shared by all the pages in
the group.
This much of the idea is simple, however there are some tricky details to
take care of. How does a copy-on-write break out one page of the group from
one of the pte pages? I tried putting a (32 bit) bitmap in each pte_chain
node to indicate which pte entries actually belong to the group, and that
wasn't too bad except for doubling the per-link memory usage, turning a best
case 32x gain into only 16x. It's probably better to break the group up,
creating log2(groupsize) new chains. (This can be avoided in the common case
that you already know every page in the group is going to be copied, as with
a copy_from_user.) Getting rid of the bitmaps makes the single-page case the
same as the current arrangement and makes it easy to let the size of a page
be as large as the capacity of a whole pte page.
There's also the problem of detecting groupable clusters of pages, e.g., in
do_anon_page. Swap-out and swap-in introduce more messiness, as does mremap.
In the end, I decided it's not needed in the current cycle, but probably
worth investigating later.
My purpose in bringing it up now is to show that there are still some more
incremental gains to be had without needing radical surgery.
> As shown by Dave's patch, a hybrid system really is simple and
> clean, and it removes most of the pte chain overhead while still
> keeping the code nice and efficient.
>
> I think this hybrid system is the way to go, possibly with a few
> more tweaks left and right...
Emphatically, yes.
Regards,
Daniel
^ permalink raw reply [flat|nested] 124+ messages in thread
end of thread, other threads:[~2003-02-26 20:47 UTC | newest]
Thread overview: 124+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2003-02-21 23:48 Minutes from Feb 21 LSE Call Hanna Linder
2003-02-22 0:16 ` Larry McVoy
2003-02-22 0:25 ` William Lee Irwin III
2003-02-22 2:24 ` Steven Cole
2003-02-22 0:44 ` Martin J. Bligh
2003-02-22 2:47 ` Larry McVoy
2003-02-22 4:32 ` Martin J. Bligh
2003-02-22 5:05 ` Larry McVoy
2003-02-22 6:39 ` Martin J. Bligh
2003-02-22 8:38 ` Jeff Garzik
2003-02-22 22:18 ` William Lee Irwin III
2003-02-23 0:50 ` Martin J. Bligh
2003-02-23 11:22 ` Magnus Danielson
2003-02-23 19:54 ` Eric W. Biederman
2003-02-23 1:17 ` Benjamin LaHaise
2003-02-23 5:21 ` Gerrit Huizenga
2003-02-23 8:07 ` David Lang
2003-02-23 8:20 ` William Lee Irwin III
2003-02-23 19:17 ` Linus Torvalds
2003-02-23 19:29 ` David Mosberger
2003-02-23 20:13 ` Martin J. Bligh
2003-02-23 22:01 ` David Mosberger
2003-02-23 22:12 ` Martin J. Bligh
2003-02-23 21:34 ` Linus Torvalds
2003-02-23 22:40 ` David Mosberger
2003-02-23 22:48 ` David Lang
2003-02-23 22:54 ` David Mosberger
2003-02-23 22:56 ` David Lang
2003-02-24 0:40 ` Linus Torvalds
2003-02-24 2:32 ` David Mosberger
2003-02-24 2:54 ` Linus Torvalds
2003-02-24 3:08 ` David Mosberger
2003-02-24 21:42 ` Andrea Arcangeli
2003-02-24 1:06 ` dean gaudet
2003-02-24 1:56 ` David Mosberger
2003-02-24 2:15 ` dean gaudet
2003-02-24 3:11 ` David Mosberger
2003-02-23 23:06 ` Martin J. Bligh
2003-02-23 23:59 ` David Mosberger
2003-02-24 3:49 ` Gerrit Huizenga
2003-02-24 4:07 ` David Mosberger
2003-02-24 4:34 ` Martin J. Bligh
2003-02-24 5:02 ` Gerrit Huizenga
2003-02-23 20:21 ` Xavier Bestel
2003-02-23 20:50 ` Martin J. Bligh
2003-02-23 23:57 ` Alan Cox
2003-02-24 1:26 ` Kenneth Johansson
2003-02-24 1:53 ` dean gaudet
2003-02-23 21:35 ` Alan Cox
2003-02-23 21:41 ` Linus Torvalds
2003-02-24 0:01 ` Bill Davidsen
2003-02-24 0:36 ` yodaiken
2003-02-23 21:15 ` John Bradford
2003-02-23 21:45 ` Linus Torvalds
2003-02-24 1:25 ` Benjamin LaHaise
2003-02-23 21:55 ` William Lee Irwin III
2003-02-23 19:13 ` David Mosberger
2003-02-23 23:28 ` Benjamin LaHaise
2003-02-26 8:46 ` Eric W. Biederman
2003-02-23 20:48 ` Gerrit Huizenga
2003-02-23 9:37 ` William Lee Irwin III
2003-02-22 8:38 ` David S. Miller
2003-02-22 8:38 ` David S. Miller
2003-02-22 14:34 ` Larry McVoy
2003-02-22 15:47 ` Martin J. Bligh
2003-02-22 16:13 ` Larry McVoy
2003-02-22 16:29 ` Martin J. Bligh
2003-02-22 16:33 ` Larry McVoy
2003-02-22 16:39 ` Martin J. Bligh
2003-02-22 16:59 ` John Bradford
2003-02-24 18:00 ` Timothy D. Witham
2003-02-22 8:32 ` David S. Miller
2003-02-22 18:20 ` Alan Cox
2003-02-22 20:05 ` William Lee Irwin III
2003-02-22 21:35 ` Alan Cox
2003-02-22 21:36 ` Gerrit Huizenga
2003-02-22 21:42 ` Christoph Hellwig
2003-02-23 23:23 ` Bill Davidsen
2003-02-24 3:31 ` Gerrit Huizenga
2003-02-24 4:02 ` Larry McVoy
2003-02-24 4:15 ` Russell Leighton
2003-02-24 5:11 ` William Lee Irwin III
2003-02-24 8:07 ` Christoph Hellwig
2003-02-23 0:37 ` Eric W. Biederman
2003-02-23 0:42 ` Eric W. Biederman
2003-02-23 14:29 ` Rik van Riel
2003-02-23 17:28 ` Eric W. Biederman
2003-02-24 1:42 ` Benjamin LaHaise
2003-02-23 3:24 ` Andrew Morton
2003-02-23 16:14 ` object-based rmap and pte-highmem Martin J. Bligh
2003-02-23 19:20 ` Linus Torvalds
2003-02-23 20:16 ` Martin J. Bligh
2003-02-23 21:37 ` Linus Torvalds
2003-02-23 22:07 ` pte-highmem vs UKVA (was: object-based rmap and pte-highmem) Martin J. Bligh
2003-02-23 22:10 ` William Lee Irwin III
2003-02-24 0:31 ` Linus Torvalds
2003-02-24 3:07 ` Martin J. Bligh
2003-02-25 17:17 ` Minutes from Feb 21 LSE Call Andrea Arcangeli
2003-02-25 17:43 ` William Lee Irwin III
2003-02-25 17:59 ` Andrea Arcangeli
2003-02-25 18:04 ` William Lee Irwin III
2003-02-25 18:50 ` William Lee Irwin III
2003-02-25 19:18 ` Andrea Arcangeli
2003-02-25 19:27 ` Martin J. Bligh
2003-02-25 20:30 ` Andrea Arcangeli
2003-02-25 20:53 ` Martin J. Bligh
2003-02-25 21:17 ` Andrea Arcangeli
2003-02-25 21:12 ` Martin J. Bligh
2003-02-25 22:16 ` Andrea Arcangeli
2003-02-25 22:17 ` Martin J. Bligh
2003-02-25 22:37 ` Andrea Arcangeli
2003-02-25 21:26 ` William Lee Irwin III
2003-02-25 22:18 ` Andrea Arcangeli
2003-02-26 5:24 ` Rik van Riel
2003-02-26 5:38 ` William Lee Irwin III
2003-02-26 6:01 ` Martin J. Bligh
2003-02-26 6:14 ` William Lee Irwin III
2003-02-26 6:32 ` William Lee Irwin III
2003-02-26 16:02 ` Rik van Riel
2003-02-27 3:48 ` Daniel Phillips
2003-02-25 20:10 ` William Lee Irwin III
2003-02-25 20:23 ` Andrea Arcangeli
2003-02-25 20:46 ` William Lee Irwin III
2003-02-25 20:52 ` Andrea Arcangeli
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox