* Minutes from Feb 21 LSE Call
@ 2003-02-21 23:48 Hanna Linder
2003-02-22 0:16 ` Larry McVoy
` (2 more replies)
0 siblings, 3 replies; 124+ messages in thread
From: Hanna Linder @ 2003-02-21 23:48 UTC (permalink / raw)
To: lse-tech; +Cc: linux-kernel
LSE Con Call Minutes from Feb21
Minutes compiled by Hanna Linder hannal@us.ibm.com, please post
corrections to lse-tech@lists.sf.net.
Object Based Reverse Mapping:
(Dave McCracken, Ben LaHaise, Rik van Riel, Martin Bligh, Gerrit Huizenga)
Dave coded up an initial patch for partial object based rmap
which he sent to linux-mm yesterday. Rik pointed out there is a scalability
problem with the full object based approach. However, a hybrid approach
between regular rmap and object based may not be too radical for
2.5/2.6 timeframe.
Ben said none of the users have been complaining about
performance with the existing rmap. Martin disagreed and said Linus,
Andrew Morton and himself have all agreed there is a problem.
One of the problems Martin is already hitting on high cpu machines with
large memory is the space consumption by all the pte-chains filling up
memory and killing the machine. There is also a performance impact of
maintaining the chains.
Ben said they shouldnt be using fork and bash is the
main user of fork and should be changed to use clone instead.
Gerrit said bash is not used as much as Ben might think on
these large systems running real world applications.
Ben said he doesnt see the large systems problems with
the users he talks to and doesnt agree the full object based rmap
is needed. Gerrit explained we have very complex workloads running on
very large systems and we are already hitting the space consumption
problem which is a blocker for running Linux on them.
Ben said none of the distros are supporting these large
systems right now. Martin said UL is already starting to support
them. Then it degraded into a distro discussion and Hanna asked
for them to bring it back to the technical side.
In order to show the problem with object based rmap you have to
add vm pressure to existing benchmarks to see what happens. Martin
agreed to run multiple benchmarks on the same systems to simulate this.
Cliff White of the OSDL offered to help Martin with this.
At the end Ben said the solution for now needs to be
a hybrid with existing rmap. Martin, Rik, and Dave all agreed with Ben.
Then we all agreed to move on to other things.
*ActionItem - someone needs to change bash to use clone instead of fork..
Scheduler Hang as discovered by restarting a large Web application
multiple times:
Rick Lindlsey/ Hanna Linder
We were seeing a hard hang after restarting a large web
serving application 3-6 times on the 2.5.59 (and up) kernels
(also seen as far back as 2.5.44). It was mainly caused when two
threads each have interrupts disabled and one is spinning on a lock that
the other is holding. The one holding the lock has sent an IPI to all
the other processes telling them to flush their TLB's. But the one
witinging for the spinlock has interrupts turned off and does not recieve
that IPI request. So they both sit there waiting for ever.
The final fix will be in kernel.org mainline kernel version 2.5.63.
Here are the individual patches which should apply with fuzz to
older kernel versions:
http://linux.bkbits.net:8080/linux-2.5/cset@1.1005?nav=index.html
http://linux.bkbits.net:8080/linux-2.5/cset@1.1004?nav=index.html
Shared Memory Binding :
Matt Dobson -
Shared memory binding API (new). A way for an
application to bind shared memory to Nodes. Motivation
is for large databases support that want more control
over their shared memory.
current allocation scheme is each process gets
a chunk of shared memory from the same node the process
is located on. instead of page faulting around to different
nodes dynamicaly this API will allow a process to specify
which node or set of nodes to bind the shared memory to.
Work in progress.
Martin - gcc 2.95 vs 3.2.
Martin has done some testing which indicates that gcc 3.2 produces
slightly worse code for the kernel than 2.95 and takes a bit
longer to do so. gcc 3.2 -Os produces larger code than gcc 2.95 -O2.
On his machines -O2 was faster than -Os, but on a cpu wiht smaller
caches the inverse may be true. More testing may be needed.
^ permalink raw reply [flat|nested] 124+ messages in thread* Re: Minutes from Feb 21 LSE Call 2003-02-21 23:48 Minutes from Feb 21 LSE Call Hanna Linder @ 2003-02-22 0:16 ` Larry McVoy 2003-02-22 0:25 ` William Lee Irwin III ` (4 more replies) 2003-02-23 0:42 ` Eric W. Biederman 2003-02-23 3:24 ` Andrew Morton 2 siblings, 5 replies; 124+ messages in thread From: Larry McVoy @ 2003-02-22 0:16 UTC (permalink / raw) To: Hanna Linder; +Cc: lse-tech, linux-kernel > Ben said none of the distros are supporting these large > systems right now. Martin said UL is already starting to support > them. Ben is right. I think IBM and the other big iron companies would be far better served looking at what they have done with running multiple instances of Linux on one big machine, like the 390 work. Figure out how to use that model to scale up. There is simply not a big enough market to justify shoveling lots of scaling stuff in for huge machines that only a handful of people can afford. That's the same path which has sunk all the workstation companies, they all have bloated OS's and Linux runs circles around them. In terms of the money and in terms of installed seats, the small Linux machines out number the 4 or more CPU SMP machines easily 10,000:1. And with the embedded market being one of the few real money makers for Linux, there will be huge pushback from those companies against changes which increase memory footprint. -- --- Larry McVoy lm at bitmover.com http://www.bitmover.com/lm ^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: Minutes from Feb 21 LSE Call 2003-02-22 0:16 ` Larry McVoy @ 2003-02-22 0:25 ` William Lee Irwin III 2003-02-22 2:24 ` Steven Cole 2003-02-22 0:44 ` Martin J. Bligh ` (3 subsequent siblings) 4 siblings, 1 reply; 124+ messages in thread From: William Lee Irwin III @ 2003-02-22 0:25 UTC (permalink / raw) To: Larry McVoy, Hanna Linder, lse-tech, linux-kernel On Fri, Feb 21, 2003 at 04:16:18PM -0800, Larry McVoy wrote: > Ben is right. I think IBM and the other big iron companies would be > far better served looking at what they have done with running multiple > instances of Linux on one big machine, like the 390 work. Figure out > how to use that model to scale up. There is simply not a big enough > market to justify shoveling lots of scaling stuff in for huge machines > that only a handful of people can afford. That's the same path which > has sunk all the workstation companies, they all have bloated OS's and > Linux runs circles around them. Scalability done properly should not degrade performance on smaller machines, Pee Cees, or even microscopic organisms. On Fri, Feb 21, 2003 at 04:16:18PM -0800, Larry McVoy wrote: > In terms of the money and in terms of installed seats, the small Linux > machines out number the 4 or more CPU SMP machines easily 10,000:1. > And with the embedded market being one of the few real money makers > for Linux, there will be huge pushback from those companies against > changes which increase memory footprint. There's quite a bit of commonality with large x86 highmem there, as the highmem crew is extremely concerned about the kernel's memory footprint and is looking to trim kernel memory overhead from every aspect of its operation they can. Reducing kernel memory footprint is a crucial part of scalability, in both scaling down to the low end and scaling up to highmem. =) -- wli ^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: Minutes from Feb 21 LSE Call 2003-02-22 0:25 ` William Lee Irwin III @ 2003-02-22 2:24 ` Steven Cole 0 siblings, 0 replies; 124+ messages in thread From: Steven Cole @ 2003-02-22 2:24 UTC (permalink / raw) To: William Lee Irwin III; +Cc: Larry McVoy, Hanna Linder, lse-tech, LKML On Fri, 2003-02-21 at 17:25, William Lee Irwin III wrote: > On Fri, Feb 21, 2003 at 04:16:18PM -0800, Larry McVoy wrote: > > Ben is right. I think IBM and the other big iron companies would be > > far better served looking at what they have done with running multiple > > instances of Linux on one big machine, like the 390 work. Figure out > > how to use that model to scale up. There is simply not a big enough > > market to justify shoveling lots of scaling stuff in for huge machines > > that only a handful of people can afford. That's the same path which > > has sunk all the workstation companies, they all have bloated OS's and > > Linux runs circles around them. mjb> Unfortunately, as I've pointed out to you before, this doesn't work mjb> in practice. Workloads may not be easily divisible amongst mjb> machines, and you're just pushing all the complex problems out for mjb> every userspace app to solve itself, instead of fixing it once in mjb> the kernel. Please permit an observer from the sidelines a few comments. I think all four of you are right, for different reasons. > > Scalability done properly should not degrade performance on smaller > machines, Pee Cees, or even microscopic organisms. s/should/must/ in the above. That must be a guiding principle. > > > On Fri, Feb 21, 2003 at 04:16:18PM -0800, Larry McVoy wrote: > > In terms of the money and in terms of installed seats, the small Linux > > machines out number the 4 or more CPU SMP machines easily 10,000:1. > > And with the embedded market being one of the few real money makers > > for Linux, there will be huge pushback from those companies against > > changes which increase memory footprint. > > There's quite a bit of commonality with large x86 highmem there, as > the highmem crew is extremely concerned about the kernel's memory > footprint and is looking to trim kernel memory overhead from every > aspect of its operation they can. Reducing kernel memory footprint > is a crucial part of scalability, in both scaling down to the low end > and scaling up to highmem. =) > > > -- wli Since the time between major releases of the kernel seems to be two to three years now (counting to where the new kernel is really stable), it is probably worthwhile to think about what high-end systems will be like when 3.0 is expected. My guess is that a trend will be machines with increasingly greater cpu counts with access to the same memory. Why? Because if it can be done, it will be done. The ability to put more cpus on a single chip may translate into a Moore's law of increasing cpu counts per machine. And as Martin points out, the high end machines are where the money is. In my own unsophisticated opinion, Larry's concept of Cache Coherent Clusters seems worth further development. And Martin is right about the need for fixing it in the kernel, again IMHO. But how to fix it in the kernel? Would something similar to OpenMosix or OpenSSI in a future kernel be appropriate to get Larry's CCCluster members to cooperate? Or is it possible to continue the scalability race when cpu counts get to 256, 512, etc. Just some thoughts from the sidelines. Best regards, Steven ^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: Minutes from Feb 21 LSE Call 2003-02-22 0:16 ` Larry McVoy 2003-02-22 0:25 ` William Lee Irwin III @ 2003-02-22 0:44 ` Martin J. Bligh 2003-02-22 2:47 ` Larry McVoy 2003-02-22 8:32 ` David S. Miller ` (2 subsequent siblings) 4 siblings, 1 reply; 124+ messages in thread From: Martin J. Bligh @ 2003-02-22 0:44 UTC (permalink / raw) To: Larry McVoy, Hanna Linder; +Cc: lse-tech, linux-kernel > Ben is right. I think IBM and the other big iron companies would be > far better served looking at what they have done with running multiple > instances of Linux on one big machine, like the 390 work. Figure out > how to use that model to scale up. There is simply not a big enough > market to justify shoveling lots of scaling stuff in for huge machines > that only a handful of people can afford. That's the same path which > has sunk all the workstation companies, they all have bloated OS's and > Linux runs circles around them. In your humble opinion. Unfortunately, as I've pointed out to you before, this doesn't work in practice. Workloads may not be easily divisible amongst machines, and you're just pushing all the complex problems out for every userspace app to solve itself, instead of fixing it once in the kernel. The fact that you were never able to do this before doesn't mean it's impossible, it just means that you failed. > In terms of the money and in terms of installed seats, the small Linux > machines out number the 4 or more CPU SMP machines easily 10,000:1. > And with the embedded market being one of the few real money makers > for Linux, there will be huge pushback from those companies against > changes which increase memory footprint. And the profit margin on the big machines will outpace the smaller machines by a similar ratio, inverted. The high-end space is where most of the money is made by the Linux distros, by selling products like SLES or Advanced Server to people who can afford to pay for it. M. ^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: Minutes from Feb 21 LSE Call 2003-02-22 0:44 ` Martin J. Bligh @ 2003-02-22 2:47 ` Larry McVoy 2003-02-22 4:32 ` Martin J. Bligh 0 siblings, 1 reply; 124+ messages in thread From: Larry McVoy @ 2003-02-22 2:47 UTC (permalink / raw) To: Martin J. Bligh; +Cc: Larry McVoy, Hanna Linder, lse-tech, linux-kernel On Fri, Feb 21, 2003 at 04:44:13PM -0800, Martin J. Bligh wrote: > > Ben is right. I think IBM and the other big iron companies would be > > far better served looking at what they have done with running multiple > > instances of Linux on one big machine, like the 390 work. Figure out > > how to use that model to scale up. There is simply not a big enough > > market to justify shoveling lots of scaling stuff in for huge machines > > that only a handful of people can afford. That's the same path which > > has sunk all the workstation companies, they all have bloated OS's and > > Linux runs circles around them. > > In your humble opinion. My opinion has nothing to do with it, go benchmark them and see for yourself. I'm in a pretty good position to back up my statements with data, we support BitKeeper on AIX, Solaris, IRIX, HP-UX, Tru64, as well as a pile of others, so we have both the hardware and the software to do the comparisons. I stand by statement above and so does anyone else who has done the measurements. It is much much more pleasant to have Linux versus any other Unix implementation on the same platform. Let's keep it that way. > Unfortunately, as I've pointed out to you before, this doesn't work in > practice. Workloads may not be easily divisible amongst machines, and > you're just pushing all the complex problems out for every userspace > app to solve itself, instead of fixing it once in the kernel. "fixing it", huh? Your "fixes" may be great for your tiny segment of the market but they are not going to be welcome if they turn Linux into BloatOS 9.8. > The fact that you were never able to do this before doesn't mean it's > impossible, it just means that you failed. Thanks for the vote of confidence. I think the thing to focus on, however, is that *noone* has ever succeeded at what you are trying to do. And there have been many, many attempts. Your opinion, it would appear, is that you are smarter than all of the people in all of those past failed attempts, but you'll forgive me if I'm not impressed with your optimism. > > In terms of the money and in terms of installed seats, the small Linux > > machines out number the 4 or more CPU SMP machines easily 10,000:1. > > And with the embedded market being one of the few real money makers > > for Linux, there will be huge pushback from those companies against > > changes which increase memory footprint. > > And the profit margin on the big machines will outpace the smaller > machines by a similar ratio, inverted. Really? How about some figures? You'd need HUGE profit margins to justify your position, how about some actual hard cold numbers? -- --- Larry McVoy lm at bitmover.com http://www.bitmover.com/lm ^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: Minutes from Feb 21 LSE Call 2003-02-22 2:47 ` Larry McVoy @ 2003-02-22 4:32 ` Martin J. Bligh 2003-02-22 5:05 ` Larry McVoy 0 siblings, 1 reply; 124+ messages in thread From: Martin J. Bligh @ 2003-02-22 4:32 UTC (permalink / raw) To: Larry McVoy; +Cc: Hanna Linder, lse-tech, linux-kernel >> In your humble opinion. > > My opinion has nothing to do with it, go benchmark them and see for > yourself. Nope, I was referring to this: >> > Ben is right. I think IBM and the other big iron companies would be >> > far better served looking at what they have done with running multiple >> > instances of Linux on one big machine, like the 390 work. Figure out >> > how to use that model to scale up. There is simply not a big enough >> > market to justify shoveling lots of scaling stuff in for huge machines >> > that only a handful of people can afford. Which I totally disagree with. >> >That's the same path which >> > has sunk all the workstation companies, they all have bloated OS's and >> > Linux runs circles around them. Not the fact that Linux is capable of stellar things, which I totally agree with. > I'm in a pretty good position to back up my statements with > data, we support BitKeeper on AIX, Solaris, IRIX, HP-UX, Tru64, as well > as a pile of others, so we have both the hardware and the software to > do the comparisons. I stand by statement above and so does anyone else > who has done the measurements. Oh, I don't doubt it - But I'd be amused to see the measurements, if you have them to hand. > It is much much more pleasant to have Linux versus any other Unix > implementation on the same platform. Let's keep it that way. Absolutely. >> Unfortunately, as I've pointed out to you before, this doesn't work in >> practice. Workloads may not be easily divisible amongst machines, and >> you're just pushing all the complex problems out for every userspace >> app to solve itself, instead of fixing it once in the kernel. > > "fixing it", huh? Your "fixes" may be great for your tiny segment of > the market but they are not going to be welcome if they turn Linux into > BloatOS 9.8. They won't - the maintainers would never allow us to do that. >> The fact that you were never able to do this before doesn't mean it's >> impossible, it just means that you failed. > > Thanks for the vote of confidence. I think the thing to focus on, > however, is that *noone* has ever succeeded at what you are trying > to do. And there have been many, many attempts. Your opinion, it > would appear, is that you are smarter than all of the people in all > of those past failed attempts, but you'll forgive me if I'm not > impressed with your optimism. Who said that I was going to single-handedly change the world? What's different with Linux is the development model. That's why *we* will succeed where others have failed before. There's some incredible intellect all around Linux, but that's not all it takes, as you've pointed out. >> > In terms of the money and in terms of installed seats, the small Linux >> > machines out number the 4 or more CPU SMP machines easily 10,000:1. >> > And with the embedded market being one of the few real money makers >> > for Linux, there will be huge pushback from those companies against >> > changes which increase memory footprint. >> >> And the profit margin on the big machines will outpace the smaller >> machines by a similar ratio, inverted. > > Really? How about some figures? You'd need HUGE profit margins to > justify your position, how about some actual hard cold numbers? I don't have them to hand, but if you think anyone's making money on PCs nowadays, you're delusional (with respect to hardware). With respect to Linux, what makes you think distros are going to make large amounts of money from a freely replicatable OS, for tiny embedded systems? Support for servers, on the other hand, is a different game ... M. ^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: Minutes from Feb 21 LSE Call 2003-02-22 4:32 ` Martin J. Bligh @ 2003-02-22 5:05 ` Larry McVoy 2003-02-22 6:39 ` Martin J. Bligh 2003-02-22 8:38 ` David S. Miller 0 siblings, 2 replies; 124+ messages in thread From: Larry McVoy @ 2003-02-22 5:05 UTC (permalink / raw) To: Martin J. Bligh; +Cc: Larry McVoy, Hanna Linder, lse-tech, linux-kernel On Fri, Feb 21, 2003 at 08:32:30PM -0800, Martin J. Bligh wrote: > > "fixing it", huh? Your "fixes" may be great for your tiny segment of > > the market but they are not going to be welcome if they turn Linux into > > BloatOS 9.8. > > They won't - the maintainers would never allow us to do that. The path to hell is paved with good intentions. > > Really? How about some figures? You'd need HUGE profit margins to > > justify your position, how about some actual hard cold numbers? > > I don't have them to hand, but if you think anyone's making money on > PCs nowadays, you're delusional (with respect to hardware). Let's see, Dell has a $66B market cap, revenues of $8B/quarter and $500M/quarter in profit. Lots of people working for companies who haven't figured out how to do it as well as Dell *say* it can't be done but numbers say differently. -- --- Larry McVoy lm at bitmover.com http://www.bitmover.com/lm ^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: Minutes from Feb 21 LSE Call 2003-02-22 5:05 ` Larry McVoy @ 2003-02-22 6:39 ` Martin J. Bligh 2003-02-22 8:38 ` Jeff Garzik 2003-02-22 8:38 ` David S. Miller 2003-02-22 8:38 ` David S. Miller 1 sibling, 2 replies; 124+ messages in thread From: Martin J. Bligh @ 2003-02-22 6:39 UTC (permalink / raw) To: Larry McVoy; +Cc: Hanna Linder, lse-tech, linux-kernel >> I don't have them to hand, but if you think anyone's making money on >> PCs nowadays, you're delusional (with respect to hardware). > > Let's see, Dell has a $66B market cap, revenues of $8B/quarter and > $500M/quarter in profit. > > Lots of people working for companies who haven't figured out how to do > it as well as Dell *say* it can't be done but numbers say differently. And how much of that was profit on PCs running Linux? M. ^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: Minutes from Feb 21 LSE Call 2003-02-22 6:39 ` Martin J. Bligh @ 2003-02-22 8:38 ` Jeff Garzik 2003-02-22 22:18 ` William Lee Irwin III 2003-02-22 8:38 ` David S. Miller 1 sibling, 1 reply; 124+ messages in thread From: Jeff Garzik @ 2003-02-22 8:38 UTC (permalink / raw) To: linux-kernel ia32 big iron. sigh. I think that's so unfortunately in a number of ways, but the main reason, of course, is that highmem is evil :) Intel can use PAE to "turn back the clock" on ia32. Although googling doesn't support this speculation, I am willing to bet Intel will eventually unveil a new PAE that busts the 64GB barrier -- instead of trying harder to push consumers to 64-bit processors. Processor speed, FSB speed, PCI bus bandwidth, all these are issues -- but ones that pale in comparison to the long term effects of highmem on the market. Enterprise customers will see this as a signal to continue building around ia32 for the next few years, thoroughly damaging 64-bit technology sales and development. I bet even IA64 suffers... at Intel's own hands. Rumors of a "Pentium64" at Intel are constantly floating around The Register and various rumor web sites, but Intel is gonna miss that huge profit opportunity too by trying to hack the ia32 ISA to scale up to big iron -- where it doesn't belong. Being cynical, one might guess that Intel will treat IA64 as a loss leader until the other 64-bit competition dies, keeping ia32 at the top end of the market via silly PAE/PSE hacks. When the existing 64-bit compettion disappears, five years down the road, compilers will have matured sufficiently to make using IA64 boxes feasible. If you really want to scale, just go to 64-bits, darn it. Don't keep hacking ia32 ISA -- leave it alone, it's fine as it is, and will live a nice long life as the future's preferred embedded platform. 64-bit. alpha is old tech, and dead. *sniff* sparc64 is mostly old tech, and mostly dead. IA64 isn't, yet. x86-64 is _nice_ tech, but who knows if AMD will survive competition with Intel. PPC64 is the wild card in all this. I hope it succeeds. Jeff, feeling like a silly, random rant after a long drive ...and from a technical perspective, highmem grots up the code, too :) ^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: Minutes from Feb 21 LSE Call 2003-02-22 8:38 ` Jeff Garzik @ 2003-02-22 22:18 ` William Lee Irwin III 2003-02-23 0:50 ` Martin J. Bligh 2003-02-23 1:17 ` Benjamin LaHaise 0 siblings, 2 replies; 124+ messages in thread From: William Lee Irwin III @ 2003-02-22 22:18 UTC (permalink / raw) To: Jeff Garzik; +Cc: linux-kernel On Sat, Feb 22, 2003 at 03:38:10AM -0500, Jeff Garzik wrote: > ia32 big iron. sigh. I think that's so unfortunately in a number > of ways, but the main reason, of course, is that highmem is evil :) > Intel can use PAE to "turn back the clock" on ia32. Although googling > doesn't support this speculation, I am willing to bet Intel will > eventually unveil a new PAE that busts the 64GB barrier -- instead of > trying harder to push consumers to 64-bit processors. Processor speed, > FSB speed, PCI bus bandwidth, all these are issues -- but ones that > pale in comparison to the long term effects of highmem on the market. PAE is a relatively minor insult compared to the FPU, the 50,000 psi register pressure, variable-length instruction encoding with extremely difficult to optimize for instruction decoder trickiness, the nauseating bastardization of segmentation, the microscopic caches and TLB's, the lack of TLB context tags, frankly bizarre and just-barely-fixable gate nonsense, the interrupt controller, and ISA DMA. I've got no idea why this particular system-level ugliness which is nothing more than a routine pitstop in any bring your own barfbag reading session of x86 manuals fascinates you so much. At any rate, if systems (or any other) programming difficulties were any concern at all, x86 wouldn't be used at all. On Sat, Feb 22, 2003 at 03:38:10AM -0500, Jeff Garzik wrote: > Enterprise customers will see this as a signal to continue building > around ia32 for the next few years, thoroughly damaging 64-bit > technology sales and development. I bet even IA64 suffers... > at Intel's own hands. Rumors of a "Pentium64" at Intel are constantly > floating around The Register and various rumor web sites, but Intel > is gonna miss that huge profit opportunity too by trying to hack the > ia32 ISA to scale up to big iron -- where it doesn't belong. What power do you suppose we have to resist any of this? Intel, the 800lb gorilla, shoves what it wants where it wants to shove it, and all the "exit only" signs in the world attached to our backsides do absolutely nothing to deter it whatsoever. On Sat, Feb 22, 2003 at 03:38:10AM -0500, Jeff Garzik wrote: > Being cynical, one might guess that Intel will treat IA64 as a loss > leader until the other 64-bit competition dies, keeping ia32 at the > top end of the market via silly PAE/PSE hacks. When the existing > 64-bit compettion disappears, five years down the road, compilers > will have matured sufficiently to make using IA64 boxes feasible. Sounds relatively natural. I don't have a good notion of the legality boundaries wrt. to antitrust, but I'd assume they would otherwise do whatever it takes to either defeat or wipe out competitors. On Sat, Feb 22, 2003 at 03:38:10AM -0500, Jeff Garzik wrote: > If you really want to scale, just go to 64-bits, darn it. Don't keep > hacking ia32 ISA -- leave it alone, it's fine as it is, and will live > a nice long life as the future's preferred embedded platform. Take this up with Intel. The rest of us are at their mercy. Good luck finding anyone there to listen to it, you'll need it. On Sat, Feb 22, 2003 at 03:38:10AM -0500, Jeff Garzik wrote: > 64-bit. alpha is old tech, and dead. *sniff* sparc64 is mostly > old tech, and mostly dead. IA64 isn't, yet. x86-64 is _nice_ tech, > but who knows if AMD will survive competition with Intel. PPC64 is > the wild card in all this. I hope it succeeds. Alpha is old, dead, and kicking most other cpus' asses from the grave. I always did like DEC hardware. =( I'm not sure what's so nice about x86-64; another opcode prefix controlled extension atop the festering pile of existing x86 crud sounds every bit as bad any other attempt to prolong x86. Some of the system device -level cleanups like the HPET look nice, though. This success/failure stuff sounds a lot like economics, which is pretty much even further out of our control than the weather or the government. What prompted this bit? -- wli ^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: Minutes from Feb 21 LSE Call 2003-02-22 22:18 ` William Lee Irwin III @ 2003-02-23 0:50 ` Martin J. Bligh 2003-02-23 11:22 ` Magnus Danielson 2003-02-23 19:54 ` Eric W. Biederman 2003-02-23 1:17 ` Benjamin LaHaise 1 sibling, 2 replies; 124+ messages in thread From: Martin J. Bligh @ 2003-02-23 0:50 UTC (permalink / raw) To: William Lee Irwin III, Jeff Garzik; +Cc: linux-kernel > On Sat, Feb 22, 2003 at 03:38:10AM -0500, Jeff Garzik wrote: >> ia32 big iron. sigh. I think that's so unfortunately in a number >> of ways, but the main reason, of course, is that highmem is evil :) One phrase ... "price:performance ratio". That's all it's about. The only thing that will kill 32-bit big iron is the availability of cheap 64 bit chips. It's a free-market economy. It's ugly to program, but it's cheap, and it works. M. ^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: Minutes from Feb 21 LSE Call 2003-02-23 0:50 ` Martin J. Bligh @ 2003-02-23 11:22 ` Magnus Danielson 2003-02-23 19:54 ` Eric W. Biederman 1 sibling, 0 replies; 124+ messages in thread From: Magnus Danielson @ 2003-02-23 11:22 UTC (permalink / raw) To: mbligh; +Cc: wli, jgarzik, linux-kernel From: "Martin J. Bligh" <mbligh@aracnet.com> Subject: Re: Minutes from Feb 21 LSE Call Date: Sat, 22 Feb 2003 16:50:36 -0800 > > On Sat, Feb 22, 2003 at 03:38:10AM -0500, Jeff Garzik wrote: > >> ia32 big iron. sigh. I think that's so unfortunately in a number > >> of ways, but the main reason, of course, is that highmem is evil :) > > One phrase ... "price:performance ratio". That's all it's about. > The only thing that will kill 32-bit big iron is the availability of > cheap 64 bit chips. It's a free-market economy. > > It's ugly to program, but it's cheap, and it works. Not all heavy-duty problems die for 64 bit, but fit nicely into 32 bit. There is however different 32-bit architectures for which it fit more or less nicely into. SIMD may or may not give the boost just as 64 bit in itself. This is just like clustering vs. SMP, it depends on the application. Cheers, Magnus ^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: Minutes from Feb 21 LSE Call 2003-02-23 0:50 ` Martin J. Bligh 2003-02-23 11:22 ` Magnus Danielson @ 2003-02-23 19:54 ` Eric W. Biederman 1 sibling, 0 replies; 124+ messages in thread From: Eric W. Biederman @ 2003-02-23 19:54 UTC (permalink / raw) To: Martin J. Bligh; +Cc: William Lee Irwin III, Jeff Garzik, linux-kernel "Martin J. Bligh" <mbligh@aracnet.com> writes: > > On Sat, Feb 22, 2003 at 03:38:10AM -0500, Jeff Garzik wrote: > >> ia32 big iron. sigh. I think that's so unfortunately in a number > >> of ways, but the main reason, of course, is that highmem is evil :) > > One phrase ... "price:performance ratio". That's all it's about. > The only thing that will kill 32-bit big iron is the availability of > cheap 64 bit chips. It's a free-market economy. > > It's ugly to program, but it's cheap, and it works. I guess ugly to program is in the eye of the beholder. The big platforms have always seemed much worse to me. When every box is feels free to change things in arbitrary ways for no good reason. Or where OS and other low-level software must know exactly which motherboard they are running on to work properly. Gratuitous incompatibilities are the ugliest thing I have ever seen. Much less ugly then the warts a real platform accumulates because it is designed to actually be used. Eric ^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: Minutes from Feb 21 LSE Call 2003-02-22 22:18 ` William Lee Irwin III 2003-02-23 0:50 ` Martin J. Bligh @ 2003-02-23 1:17 ` Benjamin LaHaise 2003-02-23 5:21 ` Gerrit Huizenga 2003-02-23 9:37 ` William Lee Irwin III 1 sibling, 2 replies; 124+ messages in thread From: Benjamin LaHaise @ 2003-02-23 1:17 UTC (permalink / raw) To: William Lee Irwin III, Jeff Garzik, linux-kernel On Sat, Feb 22, 2003 at 02:18:20PM -0800, William Lee Irwin III wrote: > I'm not sure what's so nice about x86-64; another opcode prefix > controlled extension atop the festering pile of existing x86 crud What's nice about x86-64 is that it runs existing 32 bit apps fast and doesn't suffer from the blisteringly small caches that were part of your rant. Plus, x86-64 binaries are not horrifically bloated like ia64. Not to mention that the amount of reengineering in compilers like gcc required to get decent performance out of it is actually sane. > sounds every bit as bad any other attempt to prolong x86. Some of > the system device -level cleanups like the HPET look nice, though. HPET is part of one of the PCYY specs and even available on 32 bit x86, there are just not that many bug free implements yet. Since x86-64 made it part of the base platform and is testing it from launch, they actually have a chance at being debugged in the mass market versions. -ben -- Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a> ^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: Minutes from Feb 21 LSE Call 2003-02-23 1:17 ` Benjamin LaHaise @ 2003-02-23 5:21 ` Gerrit Huizenga 2003-02-23 8:07 ` David Lang 2003-02-23 9:37 ` William Lee Irwin III 1 sibling, 1 reply; 124+ messages in thread From: Gerrit Huizenga @ 2003-02-23 5:21 UTC (permalink / raw) To: Benjamin LaHaise; +Cc: William Lee Irwin III, Jeff Garzik, linux-kernel On Sat, 22 Feb 2003 20:17:24 EST, Benjamin LaHaise wrote: > On Sat, Feb 22, 2003 at 02:18:20PM -0800, William Lee Irwin III wrote: > > I'm not sure what's so nice about x86-64; another opcode prefix > > controlled extension atop the festering pile of existing x86 crud > > What's nice about x86-64 is that it runs existing 32 bit apps fast and > doesn't suffer from the blisteringly small caches that were part of your > rant. Plus, x86-64 binaries are not horrifically bloated like ia64. > Not to mention that the amount of reengineering in compilers like > gcc required to get decent performance out of it is actually sane. Four or five years ago the claim was that IA64 would solve all the large memory problems. Commercial viability and substantial market presence is still lacking. x86-64 has the same uphill battle. It has a better architecture for highmem and potentially better architecture for large systems in general (compared to IA32, not substantially better than, say, IA64 or PPC64). It also has at least one manufacturer looking at high end systems. But until those systems have some recognized market share, the boys with the big pockets aren't likely to make the ubiquitous. The whole thing about expenses to design and develop combined with the ROI model have more influence on their deployment than the fact that it is technically a useful architecture. gerrit ^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: Minutes from Feb 21 LSE Call 2003-02-23 5:21 ` Gerrit Huizenga @ 2003-02-23 8:07 ` David Lang 2003-02-23 8:20 ` William Lee Irwin III ` (2 more replies) 0 siblings, 3 replies; 124+ messages in thread From: David Lang @ 2003-02-23 8:07 UTC (permalink / raw) To: Gerrit Huizenga Cc: Benjamin LaHaise, William Lee Irwin III, Jeff Garzik, linux-kernel On Sat, 22 Feb 2003, Gerrit Huizenga wrote: > On Sat, 22 Feb 2003 20:17:24 EST, Benjamin LaHaise wrote: > > On Sat, Feb 22, 2003 at 02:18:20PM -0800, William Lee Irwin III wrote: > > > I'm not sure what's so nice about x86-64; another opcode prefix > > > controlled extension atop the festering pile of existing x86 crud > > > > What's nice about x86-64 is that it runs existing 32 bit apps fast and > > doesn't suffer from the blisteringly small caches that were part of your > > rant. Plus, x86-64 binaries are not horrifically bloated like ia64. > > Not to mention that the amount of reengineering in compilers like > > gcc required to get decent performance out of it is actually sane. > > Four or five years ago the claim was that IA64 would solve all the large > memory problems. Commercial viability and substantial market presence > is still lacking. x86-64 has the same uphill battle. It has a better > architecture for highmem and potentially better architecture for large > systems in general (compared to IA32, not substantially better than, say, > IA64 or PPC64). It also has at least one manufacturer looking at high > end systems. But until those systems have some recognized market share, > the boys with the big pockets aren't likely to make the ubiquitous. > The whole thing about expenses to design and develop combined with the > ROI model have more influence on their deployment than the fact that it > is technically a useful architecture. Garrit, you missed the preior posters point. IA64 had the same fundamental problem as the Alpha, PPC, and Sparc processors, it doesn't run x86 binaries. the 8086/8088 CPU was nothing special when it was picked to be used on the IBM PC, but once it was picked it hit a critical mass that has meant that compatability with it is critical to a new CPU. the 286 and 386 CPUs were arguably inferior to other options available at the time, but they had one feature that absolutly trumped everything else, they could run existing programs with no modifications faster then anything else available. with the IA64 Intel forgot this (or decided their name value was so high that they were immune to the issue) x86-64 takes the same approach that the 286 and 386 did and will be used by people who couldn't care less about 64 bit stuff simply becouse it looks to be the fastest x86 cpu available (and if the SMP features work as advertised it will again give a big boost to the price/performance of SMP machines due to much cheaper MLB designs). if it was being marketed by Intel it would be a shoo-in, but AMD does have a bit of an uphill struggle David Lang ^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: Minutes from Feb 21 LSE Call 2003-02-23 8:07 ` David Lang @ 2003-02-23 8:20 ` William Lee Irwin III 2003-02-23 19:17 ` Linus Torvalds 2003-02-23 19:13 ` David Mosberger 2003-02-23 20:48 ` Gerrit Huizenga 2 siblings, 1 reply; 124+ messages in thread From: William Lee Irwin III @ 2003-02-23 8:20 UTC (permalink / raw) To: David Lang; +Cc: Gerrit Huizenga, Benjamin LaHaise, Jeff Garzik, linux-kernel On Sun, Feb 23, 2003 at 12:07:50AM -0800, David Lang wrote: > Garrit, you missed the preior posters point. IA64 had the same fundamental > problem as the Alpha, PPC, and Sparc processors, it doesn't run x86 > binaries. If I didn't know this mattered I wouldn't bother with the barfbags. I just wouldn't deal with it. -- wli ^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: Minutes from Feb 21 LSE Call 2003-02-23 8:20 ` William Lee Irwin III @ 2003-02-23 19:17 ` Linus Torvalds 2003-02-23 19:29 ` David Mosberger ` (3 more replies) 0 siblings, 4 replies; 124+ messages in thread From: Linus Torvalds @ 2003-02-23 19:17 UTC (permalink / raw) To: linux-kernel In article <20030223082036.GI10411@holomorphy.com>, William Lee Irwin III <wli@holomorphy.com> wrote: >On Sun, Feb 23, 2003 at 12:07:50AM -0800, David Lang wrote: >> Garrit, you missed the preior posters point. IA64 had the same fundamental >> problem as the Alpha, PPC, and Sparc processors, it doesn't run x86 >> binaries. > >If I didn't know this mattered I wouldn't bother with the barfbags. >I just wouldn't deal with it. Why? The x86 is a hell of a lot nicer than the ppc32, for example. On the x86, you get good performance and you can ignore the design mistakes (ie segmentation) by just basically turning them off. On the ppc32, the MMU braindamage is not something you can ignore, you have to write your OS for it and if you turn it off (ie enable soft-fill on the ones that support it) you now have to have separate paths in the OS for it. And the baroque instruction encoding on the x86 is actually a _good_ thing: it's a rather dense encoding, which means that you win on icache. It's a bit hard to decode, but who cares? Existing chips do well at decoding, and thanks to the icache win they tend to perform better - and they load faster too (which is important - you can make your CPU have big caches, but _nothing_ saves you from the cold-cache costs). The low register count isn't an issue when you code in any high-level language, and it has actually forced x86 implementors to do a hell of a lot better job than the competition when it comes to memory loads and stores - which helps in general. While the RISC people were off trying to optimize their compilers to generate loops that used all 32 registers efficiently, the x86 implementors instead made the chip run fast on varied loads and used tons of register renaming hardware (and looking at _memory_ renaming too). IA64 made all the mistakes anybody else did, and threw out all the good parts of the x86 because people thought those parts were ugly. They aren't ugly, they're the "charming oddity" that makes it do well. Look at them the right way and you realize that a lot of the grottyness is exactly _why_ the x86 works so well (yeah, and the fact that they are everywhere ;). The only real major failure of the x86 is the PAE crud. Let's hope we'll get to forget it, the same way the DOS people eventually forgot about their memory extenders. (Yeah, and maybe IBM will make their ppc64 chips cheap enough that they will matter, and people can overlook the grottiness there. Right now Intel doesn't even seem to be interested in "64-bit for the masses", and maybe IBM will be. AMD certainly seems to be serious about the "masses" part, which in the end is the only part that really matters). Linus ^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: Minutes from Feb 21 LSE Call 2003-02-23 19:17 ` Linus Torvalds @ 2003-02-23 19:29 ` David Mosberger 2003-02-23 20:13 ` Martin J. Bligh 2003-02-23 21:34 ` Linus Torvalds 2003-02-23 20:21 ` Xavier Bestel ` (2 subsequent siblings) 3 siblings, 2 replies; 124+ messages in thread From: David Mosberger @ 2003-02-23 19:29 UTC (permalink / raw) To: Linus Torvalds; +Cc: linux-kernel >>>>> On Sun, 23 Feb 2003 19:17:30 +0000 (UTC), torvalds@transmeta.com (Linus Torvalds) said: Linus> Look at them the right way and you realize that a lot of the Linus> grottyness is exactly _why_ the x86 works so well (yeah, and Linus> the fact that they are everywhere ;). But does x86 reall work so well? Itanium 2 on 0.13um performs a lot better than P4 on 0.13um. As far as I can guess, the only reason P4 comes out on 0.13um (and 0.09um) before anything else is due to the latter part you mention: it's where the volume is today. --david ^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: Minutes from Feb 21 LSE Call 2003-02-23 19:29 ` David Mosberger @ 2003-02-23 20:13 ` Martin J. Bligh 2003-02-23 22:01 ` David Mosberger 2003-02-23 21:34 ` Linus Torvalds 1 sibling, 1 reply; 124+ messages in thread From: Martin J. Bligh @ 2003-02-23 20:13 UTC (permalink / raw) To: davidm, Linus Torvalds; +Cc: linux-kernel > Linus> Look at them the right way and you realize that a lot of the > Linus> grottyness is exactly _why_ the x86 works so well (yeah, and > Linus> the fact that they are everywhere ;). > > But does x86 reall work so well? Itanium 2 on 0.13um performs a lot > better than P4 on 0.13um. As far as I can guess, the only reason P4 > comes out on 0.13um (and 0.09um) before anything else is due to the > latter part you mention: it's where the volume is today. Care to share those impressive benchmark numbers (for macro-benchmarks)? Would be interesting to see the difference, and where it wins. Thanks, M ^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: Minutes from Feb 21 LSE Call 2003-02-23 20:13 ` Martin J. Bligh @ 2003-02-23 22:01 ` David Mosberger 2003-02-23 22:12 ` Martin J. Bligh 0 siblings, 1 reply; 124+ messages in thread From: David Mosberger @ 2003-02-23 22:01 UTC (permalink / raw) To: Martin J. Bligh; +Cc: davidm, Linus Torvalds, linux-kernel >>>>> On Sun, 23 Feb 2003 12:13:00 -0800, "Martin J. Bligh" <mbligh@aracnet.com> said: Linus> Look at them the right way and you realize that a lot of the Linus> grottyness is exactly _why_ the x86 works so well (yeah, and Linus> the fact that they are everywhere ;). >> But does x86 reall work so well? Itanium 2 on 0.13um performs a >> lot better than P4 on 0.13um. As far as I can guess, the only >> reason P4 comes out on 0.13um (and 0.09um) before anything else >> is due to the latter part you mention: it's where the volume is >> today. Martin> Care to share those impressive benchmark numbers (for Martin> macro-benchmarks)? Would be interesting to see the Martin> difference, and where it wins. You can do it two ways: you can look at the numbers Intel is publicly projected for Madison, or you can compare McKinley with 0.18um Pentium 4. --david ^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: Minutes from Feb 21 LSE Call 2003-02-23 22:01 ` David Mosberger @ 2003-02-23 22:12 ` Martin J. Bligh 0 siblings, 0 replies; 124+ messages in thread From: Martin J. Bligh @ 2003-02-23 22:12 UTC (permalink / raw) To: davidm; +Cc: Linus Torvalds, linux-kernel > >> But does x86 reall work so well? Itanium 2 on 0.13um performs a > >> lot better than P4 on 0.13um. As far as I can guess, the only > >> reason P4 comes out on 0.13um (and 0.09um) before anything else > >> is due to the latter part you mention: it's where the volume is > >> today. > > Martin> Care to share those impressive benchmark numbers (for > Martin> macro-benchmarks)? Would be interesting to see the > Martin> difference, and where it wins. > > You can do it two ways: you can look at the numbers Intel is publicly > projected for Madison, or you can compare McKinley with 0.18um Pentium 4. Ummm ... I'm not exactly happy working with Intel's own projections on the performance of their Itanium chips ... seems a little unscientific ;-) Presumably when you said "Itanium 2 on 0.13um performs a lot better than P4 on 0.13um." you were referring to some benchmarks you have the results of? If you can't publish them, fair enough. But if you can, I'd love to see how it compares ... Itanium seems to be "more interesting" nowadays, though I can't say I'm happy about the complexity of it. M. ^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: Minutes from Feb 21 LSE Call 2003-02-23 19:29 ` David Mosberger 2003-02-23 20:13 ` Martin J. Bligh @ 2003-02-23 21:34 ` Linus Torvalds 2003-02-23 22:40 ` David Mosberger 1 sibling, 1 reply; 124+ messages in thread From: Linus Torvalds @ 2003-02-23 21:34 UTC (permalink / raw) To: davidm; +Cc: linux-kernel On Sun, 23 Feb 2003, David Mosberger wrote: > > But does x86 reall work so well? Itanium 2 on 0.13um performs a lot > better than P4 on 0.13um. On WHAT benchmark? Itanium 2 doesn't hold a candle to a P4 on any real-world benchmarks. As far as I know, the _only_ things Itanium 2 does better on is (a) FP kernels, partly due to a huge cache and (b) big databases, entirely because the P4 is crippled with lots of memory because Intel refuses to do a 64-bit version (because they know it would totally kill ia-64). Last I saw P4 was kicking ia-64 butt on specint and friends. That's also ignoring the fact that ia-64 simply CANNOT DO the things a P4 does every single day. You can't put an ia-64 in a reasonable desktop machine, partly because of pricing, but partly because it would just suck so horribly at things people expect not to suck (games spring to mind). And I further bet that using a native distribution (ie totally ignoring the power and price and bad x86 performance issues), ia-64 will work a lot worse for people simply because the binaries are bigger. That was quite painful on alpha, and ia-64 is even worse - to offset the bigger binaries, you need a faster disk subsystem etc just to not feel slower than a bog-standard PC. Code size matters. Price matters. Real world matters. And ia-64 at least so far falls flat on its face on ALL of these. > As far as I can guess, the only reason P4 > comes out on 0.13um (and 0.09um) before anything else is due to the > latter part you mention: it's where the volume is today. It's where all the money is ("ia-64: 5 billion dollars in the red and still sinking") so of _course_ it's where the efforts get put. Linus ^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: Minutes from Feb 21 LSE Call 2003-02-23 21:34 ` Linus Torvalds @ 2003-02-23 22:40 ` David Mosberger 2003-02-23 22:48 ` David Lang 2003-02-23 23:06 ` Martin J. Bligh 0 siblings, 2 replies; 124+ messages in thread From: David Mosberger @ 2003-02-23 22:40 UTC (permalink / raw) To: Linus Torvalds; +Cc: davidm, linux-kernel >>>>> On Sun, 23 Feb 2003 13:34:32 -0800 (PST), Linus Torvalds <torvalds@transmeta.com> said: Linus> Last I saw P4 was kicking ia-64 butt on specint and friends. I don't think so. According to Intel [1], the highest clockfrequency for a 0.18um part is 2GHz (both for Xeon and P4, for Xeon MP it's 1.5GHz). The highest reported SPECint for a 2GHz Xeon seems to be 701 [2]. In comparison, a 1GHz McKinley gets a SPECint of 810 [3]. --david [1] http://www.intel.com/support/processors/xeon/corespeeds.htm [2] http://www.specbench.org/cpu2000/results/res2002q1/cpu2000-20020128-01232.html [3] http://www.specbench.org/cpu2000/results/res2002q3/cpu2000-20020711-01469.html ^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: Minutes from Feb 21 LSE Call 2003-02-23 22:40 ` David Mosberger @ 2003-02-23 22:48 ` David Lang 2003-02-23 22:54 ` David Mosberger 2003-02-23 23:06 ` Martin J. Bligh 1 sibling, 1 reply; 124+ messages in thread From: David Lang @ 2003-02-23 22:48 UTC (permalink / raw) To: davidm; +Cc: Linus Torvalds, linux-kernel I would call a 15% lead over the ia64 pretty substantial. yes it's not the same clock speed, but if that's the clock speed they can achieve on that process it's equivalent. the P4 covers a LOT of sins by ratcheting up it's speed, what matters is the final capability, not the capability/clock (if capability/clock was what mattered the AMD chips would have put intel out of business and the P4 would be as common as ia-64) David Lang On Sun, 23 Feb 2003, David Mosberger wrote: > Date: Sun, 23 Feb 2003 14:40:44 -0800 > From: David Mosberger <davidm@napali.hpl.hp.com> > Reply-To: davidm@hpl.hp.com > To: Linus Torvalds <torvalds@transmeta.com> > Cc: davidm@hpl.hp.com, linux-kernel@vger.kernel.org > Subject: Re: Minutes from Feb 21 LSE Call > > >>>>> On Sun, 23 Feb 2003 13:34:32 -0800 (PST), Linus Torvalds <torvalds@transmeta.com> said: > > Linus> Last I saw P4 was kicking ia-64 butt on specint and friends. > > I don't think so. According to Intel [1], the highest clockfrequency > for a 0.18um part is 2GHz (both for Xeon and P4, for Xeon MP it's > 1.5GHz). The highest reported SPECint for a 2GHz Xeon seems to be 701 > [2]. In comparison, a 1GHz McKinley gets a SPECint of 810 [3]. > > --david > > [1] http://www.intel.com/support/processors/xeon/corespeeds.htm > [2] http://www.specbench.org/cpu2000/results/res2002q1/cpu2000-20020128-01232.html > [3] http://www.specbench.org/cpu2000/results/res2002q3/cpu2000-20020711-01469.html > - > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ > ^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: Minutes from Feb 21 LSE Call 2003-02-23 22:48 ` David Lang @ 2003-02-23 22:54 ` David Mosberger 2003-02-23 22:56 ` David Lang ` (2 more replies) 0 siblings, 3 replies; 124+ messages in thread From: David Mosberger @ 2003-02-23 22:54 UTC (permalink / raw) To: David Lang; +Cc: davidm, Linus Torvalds, linux-kernel >>>>> On Sun, 23 Feb 2003 14:48:48 -0800 (PST), David Lang <david.lang@digitalinsight.com> said: David.L> I would call a 15% lead over the ia64 pretty substantial. Huh? Did you misread my mail? 2 GHz Xeon: 701 SPECint 1 GHz Itanium 2: 810 SPECint That is, Itanium 2 is 15% faster. --david ^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: Minutes from Feb 21 LSE Call 2003-02-23 22:54 ` David Mosberger @ 2003-02-23 22:56 ` David Lang 2003-02-24 0:40 ` Linus Torvalds 2003-02-24 1:06 ` dean gaudet 2 siblings, 0 replies; 124+ messages in thread From: David Lang @ 2003-02-23 22:56 UTC (permalink / raw) To: davidm; +Cc: Linus Torvalds, linux-kernel yep, I revered the numbers David Lang On Sun, 23 Feb 2003, David Mosberger wrote: > Date: Sun, 23 Feb 2003 14:54:12 -0800 > From: David Mosberger <davidm@napali.hpl.hp.com> > Reply-To: davidm@hpl.hp.com > To: David Lang <david.lang@digitalinsight.com> > Cc: davidm@hpl.hp.com, Linus Torvalds <torvalds@transmeta.com>, > linux-kernel@vger.kernel.org > Subject: Re: Minutes from Feb 21 LSE Call > > >>>>> On Sun, 23 Feb 2003 14:48:48 -0800 (PST), David Lang <david.lang@digitalinsight.com> said: > > David.L> I would call a 15% lead over the ia64 pretty substantial. > > Huh? Did you misread my mail? > > 2 GHz Xeon: 701 SPECint > 1 GHz Itanium 2: 810 SPECint > > That is, Itanium 2 is 15% faster. > > --david > ^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: Minutes from Feb 21 LSE Call 2003-02-23 22:54 ` David Mosberger 2003-02-23 22:56 ` David Lang @ 2003-02-24 0:40 ` Linus Torvalds 2003-02-24 2:32 ` David Mosberger 2003-02-24 1:06 ` dean gaudet 2 siblings, 1 reply; 124+ messages in thread From: Linus Torvalds @ 2003-02-24 0:40 UTC (permalink / raw) To: davidm; +Cc: David Lang, linux-kernel On Sun, 23 Feb 2003, David Mosberger wrote: > > 2 GHz Xeon: 701 SPECint > 1 GHz Itanium 2: 810 SPECint > > That is, Itanium 2 is 15% faster. Ehh, and this is with how much cache? Last I saw, the Itanium 2 machines came with 3MB of integrated L3 caches, and I suspect that whatever 0.13 Itanium numbers you're looking at are with the new 6MB caches. So your "apples to apples" comparison isn't exactly that. The only thing that is meaningful is "performace at the same time of general availability". At which point the P4 beats the Itanium 2 senseless with a 25% higher SpecInt. And last I heard, by the time Itanium 2 is up at 2GHz, the P4 is apparently going to be at 5GHz, comfortably keeping that 25% lead. Linus ^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: Minutes from Feb 21 LSE Call 2003-02-24 0:40 ` Linus Torvalds @ 2003-02-24 2:32 ` David Mosberger 2003-02-24 2:54 ` Linus Torvalds 0 siblings, 1 reply; 124+ messages in thread From: David Mosberger @ 2003-02-24 2:32 UTC (permalink / raw) To: Linus Torvalds; +Cc: davidm, David Lang, linux-kernel >>>>> On Sun, 23 Feb 2003 16:40:40 -0800 (PST), Linus Torvalds <torvalds@transmeta.com> said: Linus> On Sun, 23 Feb 2003, David Mosberger wrote: >> 2 GHz Xeon: 701 SPECint >> 1 GHz Itanium 2: 810 SPECint >> That is, Itanium 2 is 15% faster. Linus> Ehh, and this is with how much cache? Linus> Last I saw, the Itanium 2 machines came with 3MB of Linus> integrated L3 caches, and I suspect that whatever 0.13 Linus> Itanium numbers you're looking at are with the new 6MB Linus> caches. Unfortunately, HP doesn't sell 1.5MB/1GHz Itanium 2 workstations, but we can do some educated guessing: 1GHz Itanium 2, 3MB cache: 810 SPECint 900MHz Itanium 2, 1.5MB cache: 674 SPECint Assuming pure frequency scaling, a 1GHz/1.5MB Itanium 2 would get around 750 SPECint. In reality, it would get slightly less, but most likely substantially more than 701. Linus> So your "apples to apples" comparison isn't exactly that. I never claimed it's an apples to apples comparison. But comparing same-process chips from the same manufacturer does make for a fairer "architectural" comparison because it factors out at least some of the effects caused by volume (there is no reason other than (a) volume and (b) being designed as a server chip for Itanium chips to come out on the same process later than the corresponding x86 chips). Linus> The only thing that is meaningful is "performace at the same Linus> time of general availability". You claimed that x86 is inherently superior. I provided data that shows that much of this apparent superiority is simply an effect of the larger volume that x86 achieves today. Please don't claim that x86 wins on technical grounds when it really wins on economic grounds. --david ^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: Minutes from Feb 21 LSE Call 2003-02-24 2:32 ` David Mosberger @ 2003-02-24 2:54 ` Linus Torvalds 2003-02-24 3:08 ` David Mosberger 2003-02-24 21:42 ` Andrea Arcangeli 0 siblings, 2 replies; 124+ messages in thread From: Linus Torvalds @ 2003-02-24 2:54 UTC (permalink / raw) To: davidm; +Cc: David Lang, linux-kernel On Sun, 23 Feb 2003, David Mosberger wrote: > >> 2 GHz Xeon: 701 SPECint > >> 1 GHz Itanium 2: 810 SPECint > > >> That is, Itanium 2 is 15% faster. > > Unfortunately, HP doesn't sell 1.5MB/1GHz Itanium 2 workstations, but > we can do some educated guessing: > > 1GHz Itanium 2, 3MB cache: 810 SPECint > 900MHz Itanium 2, 1.5MB cache: 674 SPECint > > Assuming pure frequency scaling, a 1GHz/1.5MB Itanium 2 would get > around 750 SPECint. In reality, it would get slightly less, but most > likely substantially more than 701. And as Dean pointed out: 2Ghz Xeon MP with 2MB L3 cache: 842 SPECint In other words, the P4 eats the Itanium for breakfast even if you limit it to 2GHz due to some "process" rule. And if you don't make up any silly rules, but simply look at "what's available today", you get 2.8Ghz Xeon MP with 2MB L3 cache: 907 SPECint or even better (much cheaper CPUs): 3.06 GHz P4 with 512kB L2 cache: 1074 SPECint AMD Athlon XP 2800+: 933 SPECint These are systems that you can buy today. With _less_ cache, and clearly much higher performance (the difference between the best-performing published ia-64 and the best P4 on specint, the P4 is 32% faster. Even with the "you can only run the P4 at 2GHz because that is all it ever ran at in 0.18" thing the ia-64 falls behind. > Linus> The only thing that is meaningful is "performace at the same > Linus> time of general availability". > > You claimed that x86 is inherently superior. I provided data that > shows that much of this apparent superiority is simply an effect of > the larger volume that x86 achieves today. And I showed that your data is flawed. Clearly the P4 outperforms ia-64 on an architectural level _even_ when taking process into account. Linus ^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: Minutes from Feb 21 LSE Call 2003-02-24 2:54 ` Linus Torvalds @ 2003-02-24 3:08 ` David Mosberger 2003-02-24 21:42 ` Andrea Arcangeli 1 sibling, 0 replies; 124+ messages in thread From: David Mosberger @ 2003-02-24 3:08 UTC (permalink / raw) To: Linus Torvalds; +Cc: davidm, David Lang, linux-kernel >>>>> On Sun, 23 Feb 2003 18:54:41 -0800 (PST), Linus Torvalds <torvalds@transmeta.com> said: Linus> In other words, the P4 eats the Itanium for breakfast even if Linus> you limit it to 2GHz due to some "process" rule. Ugh, 842 vs 810 is "eating for breakfast"? In my lexicon, that's "in the same ballpark". Besides the 2GHz Xeon MP is a 0.13um part. >> You claimed that x86 is inherently superior. I provided data that >> shows that much of this apparent superiority is simply an effect of >> the larger volume that x86 achieves today. Linus> And I showed that your data is flawed. No, you did not. --david ^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: Minutes from Feb 21 LSE Call 2003-02-24 2:54 ` Linus Torvalds 2003-02-24 3:08 ` David Mosberger @ 2003-02-24 21:42 ` Andrea Arcangeli 1 sibling, 0 replies; 124+ messages in thread From: Andrea Arcangeli @ 2003-02-24 21:42 UTC (permalink / raw) To: Linus Torvalds; +Cc: davidm, David Lang, linux-kernel On Sun, Feb 23, 2003 at 06:54:41PM -0800, Linus Torvalds wrote: > > On Sun, 23 Feb 2003, David Mosberger wrote: > > >> 2 GHz Xeon: 701 SPECint > > >> 1 GHz Itanium 2: 810 SPECint > > > > >> That is, Itanium 2 is 15% faster. > > > > Unfortunately, HP doesn't sell 1.5MB/1GHz Itanium 2 workstations, but > > we can do some educated guessing: > > > > 1GHz Itanium 2, 3MB cache: 810 SPECint > > 900MHz Itanium 2, 1.5MB cache: 674 SPECint > > > > Assuming pure frequency scaling, a 1GHz/1.5MB Itanium 2 would get > > around 750 SPECint. In reality, it would get slightly less, but most > > likely substantially more than 701. > > And as Dean pointed out: > > 2Ghz Xeon MP with 2MB L3 cache: 842 SPECint > > In other words, the P4 eats the Itanium for breakfast even if you limit it > to 2GHz due to some "process" rule. > > And if you don't make up any silly rules, but simply look at "what's > available today", you get > > 2.8Ghz Xeon MP with 2MB L3 cache: 907 SPECint > > or even better (much cheaper CPUs): > > 3.06 GHz P4 with 512kB L2 cache: 1074 SPECint > AMD Athlon XP 2800+: 933 SPECint > > These are systems that you can buy today. With _less_ cache, and clearly > much higher performance (the difference between the best-performing > published ia-64 and the best P4 on specint, the P4 is 32% faster. Even > with the "you can only run the P4 at 2GHz because that is all it ever ran > at in 0.18" thing the ia-64 falls behind. I agree, especially the cache difference makes any comparison not interesting to my eyes (it's similar to running dbench with different pagecache sizes and comparing the results). But I've a side note on these matters in favour of the 64bit platforms. I could be wrong, but AFIK some of the specint testcases generates a double data memory footprint if compiled 64bit, so I guess some of the testcases should be really called speclong and not specint. (however I don't think those testcases alone can explain a global 32% difference, but still there would be some difference in favour of the 32bit platform) So in short, I currently believe specint is not a good benchmark to compare a 64bit cpu to a 32bit cpu, 64bit can only lose in specint if the cpu is exactly the same but only the data 'longs' are changed to 64bit. To do a real fair comparison one should first change the source replacing every "long" with either a "long long" or an "int", only then it will be fair to compare specint results between 32bit and 64bit cpus. I never used specint myself, so don't ask me more details on this, and again I could be wrong, but really - if I'm right - somebody should go over the source and make a kind of unofficial (but official) patch available to people to generate a specint testsuite usable to compare 32bit with 64bit results, or lots of effort will be wasted by people pretending to do the impossible. I mean, if the memory bus is the same hardware in both the 32bit and 64bit runs, the double memory footprint will run slower and there's nothing the OS or the hardware can do about it (and dozen mbytes of ram won't fit in l1 cache, not even on the itanium 8). The benchmark suite really must be fixed to ensure the 32bit and 64bit compilation will generate the same _data_ memory footprint if one wants to make comparisons between the two. Andrea ^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: Minutes from Feb 21 LSE Call 2003-02-23 22:54 ` David Mosberger 2003-02-23 22:56 ` David Lang 2003-02-24 0:40 ` Linus Torvalds @ 2003-02-24 1:06 ` dean gaudet 2003-02-24 1:56 ` David Mosberger 2 siblings, 1 reply; 124+ messages in thread From: dean gaudet @ 2003-02-24 1:06 UTC (permalink / raw) To: davidm; +Cc: David Lang, Linus Torvalds, linux-kernel On Sun, 23 Feb 2003, David Mosberger wrote: > >>>>> On Sun, 23 Feb 2003 14:48:48 -0800 (PST), David Lang <david.lang@digitalinsight.com> said: > > David.L> I would call a 15% lead over the ia64 pretty substantial. > > Huh? Did you misread my mail? > > 2 GHz Xeon: 701 SPECint > 1 GHz Itanium 2: 810 SPECint > > That is, Itanium 2 is 15% faster. according to pricewatch i could buy ten 2GHz Xeons for about the cost of one Itanium 2 900MHz. that's not even considering the cost of the motherboards i'd need to plug those into. -dean ^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: Minutes from Feb 21 LSE Call 2003-02-24 1:06 ` dean gaudet @ 2003-02-24 1:56 ` David Mosberger 2003-02-24 2:15 ` dean gaudet 0 siblings, 1 reply; 124+ messages in thread From: David Mosberger @ 2003-02-24 1:56 UTC (permalink / raw) To: dean gaudet; +Cc: davidm, David Lang, Linus Torvalds, linux-kernel >>>>> On Sun, 23 Feb 2003 17:06:29 -0800 (PST), dean gaudet <dean-list-linux-kernel@arctic.org> said: Dean> On Sun, 23 Feb 2003, David Mosberger wrote: >> >>>>> On Sun, 23 Feb 2003 14:48:48 -0800 (PST), David Lang <david.lang@digitalinsight.com> said: David.L> I would call a 15% lead over the ia64 pretty substantial. >> Huh? Did you misread my mail? >> 2 GHz Xeon: 701 SPECint >> 1 GHz Itanium 2: 810 SPECint >> That is, Itanium 2 is 15% faster. Dean> according to pricewatch i could buy ten 2GHz Xeons for about Dean> the cost of one Itanium 2 900MHz. Not if you want comparable cache-sizes [1]: Intel Xeon MP, 2MB L3 cache: $3692 Itanium 2, 1 GHZ, 3MB L3 cache: $4226 Itanium 2, 1 GHZ, 1.5MB L3 cache: $2247 Itanium 2, 900 MHZ, 1.5MB L3 cache: $1338 Intel basically prices things by the cache size. --david [1]: http://www.intel.com/intel/finance/pricelist/ ^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: Minutes from Feb 21 LSE Call 2003-02-24 1:56 ` David Mosberger @ 2003-02-24 2:15 ` dean gaudet 2003-02-24 3:11 ` David Mosberger 0 siblings, 1 reply; 124+ messages in thread From: dean gaudet @ 2003-02-24 2:15 UTC (permalink / raw) To: davidm; +Cc: David Lang, Linus Torvalds, linux-kernel On Sun, 23 Feb 2003, David Mosberger wrote: > >>>>> On Sun, 23 Feb 2003 17:06:29 -0800 (PST), dean gaudet <dean-list-linux-kernel@arctic.org> said: > > Dean> On Sun, 23 Feb 2003, David Mosberger wrote: > >> >>>>> On Sun, 23 Feb 2003 14:48:48 -0800 (PST), David Lang <david.lang@digitalinsight.com> said: > > David.L> I would call a 15% lead over the ia64 pretty substantial. > > >> Huh? Did you misread my mail? > > >> 2 GHz Xeon: 701 SPECint > >> 1 GHz Itanium 2: 810 SPECint > > >> That is, Itanium 2 is 15% faster. > > Dean> according to pricewatch i could buy ten 2GHz Xeons for about > Dean> the cost of one Itanium 2 900MHz. > > Not if you want comparable cache-sizes [1]: somehow i doubt you're quoting Xeon numbers w/2MB of cache above. in fact, here's a 701 specint with only 512KB of cache @ 2GHz: http://www.spec.org/osg/cpu2000/results/res2002q1/cpu2000-20020128-01232.html my point was that if you had comparable die sizes the 15% "advantage" would disappear. there's a hell of a lot which could be done with the approximately double die size that the itanium 2 has compared to any of the commodity x86 parts. but then the cost per part would be correspondingly higher... which is exactly what is shown in the intel cost numbers. a more fair comparison would be your itanium 2 number with this: http://www.spec.org/osg/cpu2000/results/res2002q4/cpu2000-20021021-01742.html 2MB L2 Xeon @ 2GHz, scores 842. is this the itanium 2 number you're quoting us? http://www.spec.org/osg/cpu2000/results/res2002q3/cpu2000-20020711-01469.html 'cause that's with 3MB L3. -dean > > Intel Xeon MP, 2MB L3 cache: $3692 > > Itanium 2, 1 GHZ, 3MB L3 cache: $4226 > Itanium 2, 1 GHZ, 1.5MB L3 cache: $2247 > Itanium 2, 900 MHZ, 1.5MB L3 cache: $1338 > > Intel basically prices things by the cache size. > > --david > > [1]: http://www.intel.com/intel/finance/pricelist/ > ^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: Minutes from Feb 21 LSE Call 2003-02-24 2:15 ` dean gaudet @ 2003-02-24 3:11 ` David Mosberger 0 siblings, 0 replies; 124+ messages in thread From: David Mosberger @ 2003-02-24 3:11 UTC (permalink / raw) To: dean gaudet; +Cc: davidm, David Lang, Linus Torvalds, linux-kernel >>>>> On Sun, 23 Feb 2003 18:15:29 -0800 (PST), dean gaudet <dean-list-linux-kernel@arctic.org> said: Dean> somehow i doubt you're quoting Xeon numbers w/2MB of cache above. I quoted the Xeon 0.13um price because there was no 0.18um part with >512KB cache (for better or worse, Intel basically prices CPUs by cache-size). --david ^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: Minutes from Feb 21 LSE Call 2003-02-23 22:40 ` David Mosberger 2003-02-23 22:48 ` David Lang @ 2003-02-23 23:06 ` Martin J. Bligh 2003-02-23 23:59 ` David Mosberger 1 sibling, 1 reply; 124+ messages in thread From: Martin J. Bligh @ 2003-02-23 23:06 UTC (permalink / raw) To: davidm, Linus Torvalds; +Cc: linux-kernel > Linus> Last I saw P4 was kicking ia-64 butt on specint and friends. > > I don't think so. According to Intel [1], the highest clockfrequency > for a 0.18um part is 2GHz (both for Xeon and P4, for Xeon MP it's > 1.5GHz). The highest reported SPECint for a 2GHz Xeon seems to be 701 > [2]. In comparison, a 1GHz McKinley gets a SPECint of 810 [3]. > > --david > > [1] http://www.intel.com/support/processors/xeon/corespeeds.htm > [2] > http://www.specbench.org/cpu2000/results/res2002q1/cpu2000-20020128-01232 > .html [3] > http://www.specbench.org/cpu2000/results/res2002q3/cpu2000-20020711-01469 > .html - Got anything more real-world than SPECint type microbenchmarks? M. ^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: Minutes from Feb 21 LSE Call 2003-02-23 23:06 ` Martin J. Bligh @ 2003-02-23 23:59 ` David Mosberger 2003-02-24 3:49 ` Gerrit Huizenga 0 siblings, 1 reply; 124+ messages in thread From: David Mosberger @ 2003-02-23 23:59 UTC (permalink / raw) To: Martin J. Bligh; +Cc: davidm, Linus Torvalds, linux-kernel >>>>> On Sun, 23 Feb 2003 15:06:56 -0800, "Martin J. Bligh" <mbligh@aracnet.com> said: Linus> Last I saw P4 was kicking ia-64 butt on specint and friends. >> I don't think so. According to Intel [1], the highest >> clockfrequency for a 0.18um part is 2GHz (both for Xeon and P4, >> for Xeon MP it's 1.5GHz). The highest reported SPECint for a >> 2GHz Xeon seems to be 701 [2]. In comparison, a 1GHz McKinley >> gets a SPECint of 810 [3]. Martin> Got anything more real-world than SPECint type Martin> microbenchmarks? SPECint a microbenchmark? You seem to be redefining the meaning of the word (last time I checked, lmbench was a microbenchmark). Ironically, Itanium 2 seems to do even better in the "real world" than suggested by benchmarks, partly because of the large caches, memory bandwidth and, I'm guessing, partly because of it's straight-forward micro-architecture (e.g., a synchronization operation takes on the order of 10 cycles, as compared to order of dozens and hundres of cycles on the Pentium 4). BTW: I hope I don't sound too negative on the Pentium 4/Xeon. It's certainly an excellent performer for many things. I just want to point out that Itanium 2 also is a good performer, probably more so than many on this list seem to be willing to give it credit for. --david ^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: Minutes from Feb 21 LSE Call 2003-02-23 23:59 ` David Mosberger @ 2003-02-24 3:49 ` Gerrit Huizenga 2003-02-24 4:07 ` David Mosberger 0 siblings, 1 reply; 124+ messages in thread From: Gerrit Huizenga @ 2003-02-24 3:49 UTC (permalink / raw) To: davidm; +Cc: Martin J. Bligh, Linus Torvalds, linux-kernel On Sun, 23 Feb 2003 15:59:12 PST, David Mosberger wrote: > >>>>> On Sun, 23 Feb 2003 15:06:56 -0800, "Martin J. Bligh" <mbligh@aracnet.com> said: > Martin> Got anything more real-world than SPECint type > Martin> microbenchmarks? > > SPECint a microbenchmark? You seem to be redefining the meaning of > the word (last time I checked, lmbench was a microbenchmark). > > Ironically, Itanium 2 seems to do even better in the "real world" than > suggested by benchmarks, partly because of the large caches, memory > bandwidth and, I'm guessing, partly because of it's straight-forward > micro-architecture (e.g., a synchronization operation takes on the > order of 10 cycles, as compared to order of dozens and hundres of > cycles on the Pentium 4). Two major types of high end workloads here (and IA64 is definitely still in the "high end" category). There are the scientific and technical style workloads, which SPECcpu (of which CINT and CFP are the integer and floating point subsets) might reasonably categorize, and some of the "system" workloads, such as those roughly categorized by things like TPC-C/H/W/etc, or SPECweb/jbb/jvm/jAppServer which exercise some more complex, multi-tier interactions. I haven't seen anything recently on the higher level System bencmarks for IA64 - I'm not sure that anyone is doing much that is significant in this space, where IA32 results practically saturate the overall reported results. I know SGI is generally more interested in the scientific and technical area. I would assume that HP would be more interested in the broader system deployment, except that too much activity in that area might endanger parisc sales. IBM is doing some stuff in the IA64 space, but more in IA32 and obviously PPC64. That leaves NEC and a few others that I don't know about. It may be that IA64 isn't really ready for the system level stuff or that it competes with too many entrenched platforms to make it economically viable. But, I would be really interested in seeing anything other than "scientific and technical" based benchmarks for IA64. I don't think there is much out there. That implies that nobody is interested in IA64 or that it doesn't perform "competitively" in that space... gerrit ^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: Minutes from Feb 21 LSE Call 2003-02-24 3:49 ` Gerrit Huizenga @ 2003-02-24 4:07 ` David Mosberger 2003-02-24 4:34 ` Martin J. Bligh 2003-02-24 5:02 ` Gerrit Huizenga 0 siblings, 2 replies; 124+ messages in thread From: David Mosberger @ 2003-02-24 4:07 UTC (permalink / raw) To: Gerrit Huizenga; +Cc: davidm, Martin J. Bligh, Linus Torvalds, linux-kernel >>>>> On Sun, 23 Feb 2003 19:49:38 -0800, Gerrit Huizenga <gh@us.ibm.com> said: Gerrit> I haven't seen anything recently on the higher level System bencmarks Gerrit> for IA64 Did you miss the TPC-C announcement from last November & December? rx5670 4-way Itanium 2: 80498 tpmC @ $5.30/transaction (Oracle 10 on Linux). rx5670 4-way Itanium 2: 87741 tpmC @ $5.03/transaction (MS SQL on Windows). Both world-records for 4-way machines when they were announced (not sure if that's still true). --david ^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: Minutes from Feb 21 LSE Call 2003-02-24 4:07 ` David Mosberger @ 2003-02-24 4:34 ` Martin J. Bligh 2003-02-24 5:02 ` Gerrit Huizenga 1 sibling, 0 replies; 124+ messages in thread From: Martin J. Bligh @ 2003-02-24 4:34 UTC (permalink / raw) To: davidm, Gerrit Huizenga; +Cc: Linus Torvalds, linux-kernel > Gerrit> I haven't seen anything recently on the higher level System > bencmarks Gerrit> for IA64 > > Did you miss the TPC-C announcement from last November & December? > > rx5670 4-way Itanium 2: 80498 tpmC @ $5.30/transaction (Oracle 10 on > Linux). rx5670 4-way Itanium 2: 87741 tpmC @ $5.03/transaction (MS SQL > on Windows). > > Both world-records for 4-way machines when they were announced (not > sure if that's still true). Cool - thanks. that's more what I was looking for. M. ^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: Minutes from Feb 21 LSE Call 2003-02-24 4:07 ` David Mosberger 2003-02-24 4:34 ` Martin J. Bligh @ 2003-02-24 5:02 ` Gerrit Huizenga 1 sibling, 0 replies; 124+ messages in thread From: Gerrit Huizenga @ 2003-02-24 5:02 UTC (permalink / raw) To: davidm; +Cc: Martin J. Bligh, Linus Torvalds, linux-kernel On Sun, 23 Feb 2003 20:07:43 PST, David Mosberger wrote: > >>> On Sun, 23 Feb 2003 19:49:38 -0800, Gerrit Huizenga <gh@us.ibm.com> said: > > Gerrit> I haven't seen anything recently on the higher level System bencmarks > Gerrit> for IA64 > > Did you miss the TPC-C announcement from last November & December? > > rx5670 4-way Itanium 2: 80498 tpmC @ $5.30/transaction (Oracle 10 on Linux). > rx5670 4-way Itanium 2: 87741 tpmC @ $5.03/transaction (MS SQL on Windows). > > Both world-records for 4-way machines when they were announced (not > sure if that's still true). Yeah, I missed that. And my spot checking didn't catch anything IA64 related. Was there anything else on IA64 that competed with the current rack of 8-way IA32 boxen, or the upcoming 16-way stuff rolling out this year? Seems like the larger phys memory support should help on several of those benchmarks... The thin number of IA64 results indicates the difference in marketing/sales, although better price/performance should be able to change that... ;) Odd that MS is still outdoing Linux (or SQL is outdoing Oracle on Linux). Will be nice when that changes... gerrit ^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: Minutes from Feb 21 LSE Call 2003-02-23 19:17 ` Linus Torvalds 2003-02-23 19:29 ` David Mosberger @ 2003-02-23 20:21 ` Xavier Bestel 2003-02-23 20:50 ` Martin J. Bligh ` (4 more replies) 2003-02-23 21:15 ` John Bradford 2003-02-23 21:55 ` William Lee Irwin III 3 siblings, 5 replies; 124+ messages in thread From: Xavier Bestel @ 2003-02-23 20:21 UTC (permalink / raw) To: Linus Torvalds; +Cc: Linux Kernel Mailing List Le dim 23/02/2003 à 20:17, Linus Torvalds a écrit : > And the baroque instruction encoding on the x86 is actually a _good_ > thing: it's a rather dense encoding, which means that you win on icache. > It's a bit hard to decode, but who cares? Existing chips do well at > decoding, and thanks to the icache win they tend to perform better - and > they load faster too (which is important - you can make your CPU have > big caches, but _nothing_ saves you from the cold-cache costs). Next step: hardware gzip ? ^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: Minutes from Feb 21 LSE Call 2003-02-23 20:21 ` Xavier Bestel @ 2003-02-23 20:50 ` Martin J. Bligh 2003-02-23 23:57 ` Alan Cox 2003-02-23 21:35 ` Alan Cox ` (3 subsequent siblings) 4 siblings, 1 reply; 124+ messages in thread From: Martin J. Bligh @ 2003-02-23 20:50 UTC (permalink / raw) To: Xavier Bestel; +Cc: Linux Kernel Mailing List >> And the baroque instruction encoding on the x86 is actually a _good_ >> thing: it's a rather dense encoding, which means that you win on icache. >> It's a bit hard to decode, but who cares? Existing chips do well at >> decoding, and thanks to the icache win they tend to perform better - and >> they load faster too (which is important - you can make your CPU have >> big caches, but _nothing_ saves you from the cold-cache costs). > > Next step: hardware gzip ? They did that already ... IBM were demonstrating such a thing a couple of years ago. Don't see it helping with icache though, as it unpacks between memory and the processory, IIRC. M. ^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: Minutes from Feb 21 LSE Call 2003-02-23 20:50 ` Martin J. Bligh @ 2003-02-23 23:57 ` Alan Cox 2003-02-24 1:26 ` Kenneth Johansson 0 siblings, 1 reply; 124+ messages in thread From: Alan Cox @ 2003-02-23 23:57 UTC (permalink / raw) To: Martin J. Bligh; +Cc: Xavier Bestel, Linux Kernel Mailing List On Sun, 2003-02-23 at 20:50, Martin J. Bligh wrote: > >> And the baroque instruction encoding on the x86 is actually a _good_ > >> thing: it's a rather dense encoding, which means that you win on icache. > >> It's a bit hard to decode, but who cares? Existing chips do well at > >> decoding, and thanks to the icache win they tend to perform better - and > >> they load faster too (which is important - you can make your CPU have > >> big caches, but _nothing_ saves you from the cold-cache costs). > > > > Next step: hardware gzip ? > > They did that already ... IBM were demonstrating such a thing a couple of > years ago. Don't see it helping with icache though, as it unpacks between > memory and the processory, IIRC. I saw the L2/L3 compressed cache thing, and I thought "doh!", and I watched and I've not seen it for a long time. What happened to it ? ^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: Minutes from Feb 21 LSE Call 2003-02-23 23:57 ` Alan Cox @ 2003-02-24 1:26 ` Kenneth Johansson 2003-02-24 1:53 ` dean gaudet 0 siblings, 1 reply; 124+ messages in thread From: Kenneth Johansson @ 2003-02-24 1:26 UTC (permalink / raw) To: Alan Cox; +Cc: Martin J. Bligh, Xavier Bestel, Linux Kernel Mailing List On Mon, 2003-02-24 at 00:57, Alan Cox wrote: > On Sun, 2003-02-23 at 20:50, Martin J. Bligh wrote: > > >> And the baroque instruction encoding on the x86 is actually a _good_ > > >> thing: it's a rather dense encoding, which means that you win on icache. > > >> It's a bit hard to decode, but who cares? Existing chips do well at > > >> decoding, and thanks to the icache win they tend to perform better - and > > >> they load faster too (which is important - you can make your CPU have > > >> big caches, but _nothing_ saves you from the cold-cache costs). > > > > > > Next step: hardware gzip ? > > > > They did that already ... IBM were demonstrating such a thing a couple of > > years ago. Don't see it helping with icache though, as it unpacks between > > memory and the processory, IIRC. > > I saw the L2/L3 compressed cache thing, and I thought "doh!", and I watched and > I've not seen it for a long time. What happened to it ? > http://www-3.ibm.com/chips/techlib/techlib.nsf/products/CodePack If you are thinking of this it dose look like people was not using it I know I'm not.It reduces memory for instructions but that is all and memory is seems is not a problem at least not for instructions. It dose not exist in new cpu's from IBM I don't know the official reason for the removal. If you really do mean compressed cache I don't think anybody has done that for real. ^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: Minutes from Feb 21 LSE Call 2003-02-24 1:26 ` Kenneth Johansson @ 2003-02-24 1:53 ` dean gaudet 0 siblings, 0 replies; 124+ messages in thread From: dean gaudet @ 2003-02-24 1:53 UTC (permalink / raw) To: Kenneth Johansson Cc: Alan Cox, Martin J. Bligh, Xavier Bestel, Linux Kernel Mailing List On Sun, 24 Feb 2003, Kenneth Johansson wrote: > If you really do mean compressed cache I don't think anybody has done > that for real. people are doing this *for real* -- it really depends on what you define as compressed. ARM thumb is definitely a compression function for code. x86 native instructions are compressed compared to the RISC-like micro-ops which a processor like athlon, p3, and p4 actually execute. for similar operations, an x86 would average probably 1.5 bytes to encode what a 32-bit RISC would need 4 bytes to encode. -dean ^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: Minutes from Feb 21 LSE Call 2003-02-23 20:21 ` Xavier Bestel 2003-02-23 20:50 ` Martin J. Bligh @ 2003-02-23 21:35 ` Alan Cox 2003-02-23 21:41 ` Linus Torvalds ` (2 subsequent siblings) 4 siblings, 0 replies; 124+ messages in thread From: Alan Cox @ 2003-02-23 21:35 UTC (permalink / raw) To: Xavier Bestel; +Cc: Linus Torvalds, Linux Kernel Mailing List On Sun, 2003-02-23 at 20:21, Xavier Bestel wrote: > > they load faster too (which is important - you can make your CPU have > > big caches, but _nothing_ saves you from the cold-cache costs). > > Next step: hardware gzip ? gzip doesn't work because its not unpackable from an arbitary point. x86 in many ways is compressed, with common codes carefully bitpacked. A horrible cisc design constraint for size has come full circle and turned into a very nice memory/cache optimisation ^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: Minutes from Feb 21 LSE Call 2003-02-23 20:21 ` Xavier Bestel 2003-02-23 20:50 ` Martin J. Bligh 2003-02-23 21:35 ` Alan Cox @ 2003-02-23 21:41 ` Linus Torvalds 2003-02-24 0:01 ` Bill Davidsen 2003-02-24 0:36 ` yodaiken 4 siblings, 0 replies; 124+ messages in thread From: Linus Torvalds @ 2003-02-23 21:41 UTC (permalink / raw) To: Xavier Bestel; +Cc: Linux Kernel Mailing List On 23 Feb 2003, Xavier Bestel wrote: > Le dim 23/02/2003 à 20:17, Linus Torvalds a écrit : > > > And the baroque instruction encoding on the x86 is actually a _good_ > > thing: it's a rather dense encoding, which means that you win on icache. > > It's a bit hard to decode, but who cares? Existing chips do well at > > decoding, and thanks to the icache win they tend to perform better - and > > they load faster too (which is important - you can make your CPU have > > big caches, but _nothing_ saves you from the cold-cache costs). > > Next step: hardware gzip ? Not gzip, no. It needs to be a random-access compression with reasonably small blocks, not something designed for streaming. Which makes it harder to do right and efficiently. But ARM has Thumb (not the same thing, but same idea), and at least some PPC chips have a page-based compressor - IBM calls it "CodePack" in case you want to google for it. Linus ^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: Minutes from Feb 21 LSE Call 2003-02-23 20:21 ` Xavier Bestel ` (2 preceding siblings ...) 2003-02-23 21:41 ` Linus Torvalds @ 2003-02-24 0:01 ` Bill Davidsen 2003-02-24 0:36 ` yodaiken 4 siblings, 0 replies; 124+ messages in thread From: Bill Davidsen @ 2003-02-24 0:01 UTC (permalink / raw) To: Xavier Bestel; +Cc: Linus Torvalds, Linux Kernel Mailing List [-- Warning: decoded text below may be mangled, UTF-8 assumed --] [-- Attachment #1: Type: TEXT/PLAIN; charset=US-ASCII, Size: 859 bytes --] On 23 Feb 2003, Xavier Bestel wrote: > Le dim 23/02/2003 à 20:17, Linus Torvalds a écrit : > > > And the baroque instruction encoding on the x86 is actually a _good_ > > thing: it's a rather dense encoding, which means that you win on icache. > > It's a bit hard to decode, but who cares? Existing chips do well at > > decoding, and thanks to the icache win they tend to perform better - and > > they load faster too (which is important - you can make your CPU have > > big caches, but _nothing_ saves you from the cold-cache costs). > > Next step: hardware gzip ? If the firmware issues were better defined in Intel ia32 chips, I could see a gzip instruction pointing to blocks in memory. As a proof of concept, not a big win. -- bill davidsen <davidsen@tmr.com> CTO, TMR Associates, Inc Doing interesting things with little computers since 1979. ^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: Minutes from Feb 21 LSE Call 2003-02-23 20:21 ` Xavier Bestel ` (3 preceding siblings ...) 2003-02-24 0:01 ` Bill Davidsen @ 2003-02-24 0:36 ` yodaiken 4 siblings, 0 replies; 124+ messages in thread From: yodaiken @ 2003-02-24 0:36 UTC (permalink / raw) To: Xavier Bestel; +Cc: Linus Torvalds, Linux Kernel Mailing List On Sun, Feb 23, 2003 at 09:21:27PM +0100, Xavier Bestel wrote: > Le dim 23/02/2003 à 20:17, Linus Torvalds a écrit : > > > And the baroque instruction encoding on the x86 is actually a _good_ > > thing: it's a rather dense encoding, which means that you win on icache. > > It's a bit hard to decode, but who cares? Existing chips do well at > > decoding, and thanks to the icache win they tend to perform better - and > > they load faster too (which is important - you can make your CPU have > > big caches, but _nothing_ saves you from the cold-cache costs). > > Next step: hardware gzip ? See ARM "thumb" ^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: Minutes from Feb 21 LSE Call 2003-02-23 19:17 ` Linus Torvalds 2003-02-23 19:29 ` David Mosberger 2003-02-23 20:21 ` Xavier Bestel @ 2003-02-23 21:15 ` John Bradford 2003-02-23 21:45 ` Linus Torvalds 2003-02-23 21:55 ` William Lee Irwin III 3 siblings, 1 reply; 124+ messages in thread From: John Bradford @ 2003-02-23 21:15 UTC (permalink / raw) To: Linus Torvalds; +Cc: linux-kernel > >If I didn't know this mattered I wouldn't bother with the barfbags. > >I just wouldn't deal with it. > > Why? > > The x86 is a hell of a lot nicer than the ppc32, for example. On the > x86, you get good performance and you can ignore the design mistakes (ie > segmentation) by just basically turning them off. I could be wrong, but I always thought that Sparc, and a lot of other architectures could mark arbitrary areas of memory, (such as the stack), as non-executable, whereas x86 only lets you have one non-executable segment. John. ^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: Minutes from Feb 21 LSE Call 2003-02-23 21:15 ` John Bradford @ 2003-02-23 21:45 ` Linus Torvalds 2003-02-24 1:25 ` Benjamin LaHaise 0 siblings, 1 reply; 124+ messages in thread From: Linus Torvalds @ 2003-02-23 21:45 UTC (permalink / raw) To: John Bradford; +Cc: linux-kernel On Sun, 23 Feb 2003, John Bradford wrote: > > I could be wrong, but I always thought that Sparc, and a lot of other > architectures could mark arbitrary areas of memory, (such as the > stack), as non-executable, whereas x86 only lets you have one > non-executable segment. The x86 has that stupid "executablility is tied to a segment" thing, which means that you cannot make things executable on a page-per-page level. It's a mistake, but it's one that _could_ be fixed in the architecture if it really mattered, the same way the WP bit got fixed in the i486. I'm definitely not saying that the x86 is perfect. It clearly isn't. But a lot of people complain about the wrong things, and a lot of people who tried to "fix" things just made them worse by throwing out the good parts too. Linus ^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: Minutes from Feb 21 LSE Call 2003-02-23 21:45 ` Linus Torvalds @ 2003-02-24 1:25 ` Benjamin LaHaise 0 siblings, 0 replies; 124+ messages in thread From: Benjamin LaHaise @ 2003-02-24 1:25 UTC (permalink / raw) To: Linus Torvalds; +Cc: John Bradford, linux-kernel On Sun, Feb 23, 2003 at 01:45:16PM -0800, Linus Torvalds wrote: > The x86 has that stupid "executablility is tied to a segment" thing, which > means that you cannot make things executable on a page-per-page level. > It's a mistake, but it's one that _could_ be fixed in the architecture if > it really mattered, the same way the WP bit got fixed in the i486. I've been thinking about this recently, and it turns out that the whole point is moot with a fixed address vsyscall page: non-exec stacks are trivially circumvented by using the vsyscall page as a known starting point for the exploite. All the other tricks of changing the starting stack offset and using randomized load addresses don't help at all, since the exploite can merely use the vsyscall page to perform various operations. Personally, I'm still a fan of the shared library vsyscall trick, which would allow us to randomize its laod address and defeat this problem. -ben ^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: Minutes from Feb 21 LSE Call 2003-02-23 19:17 ` Linus Torvalds ` (2 preceding siblings ...) 2003-02-23 21:15 ` John Bradford @ 2003-02-23 21:55 ` William Lee Irwin III 3 siblings, 0 replies; 124+ messages in thread From: William Lee Irwin III @ 2003-02-23 21:55 UTC (permalink / raw) To: Linus Torvalds; +Cc: linux-kernel On Sun, Feb 23, 2003 at 07:17:30PM +0000, Linus Torvalds wrote: >> If I didn't know this mattered I wouldn't bother with the barfbags. >> I just wouldn't deal with it. On Sun, Feb 23, 2003 at 07:17:30PM +0000, Linus Torvalds wrote: > The x86 is a hell of a lot nicer than the ppc32, for example. On the > x86, you get good performance and you can ignore the design mistakes (ie > segmentation) by just basically turning them off. We "basically" turn it off, but I was recently reminded it existed, as LDT's are apparently wanted by something in userspace. There seem to be various other unwelcome reminders floating around performance critical paths as well. I vaguely remember segmentation being the only way to enforce execution permissions for mmap(), which we just don't bother doing. On Sun, Feb 23, 2003 at 07:17:30PM +0000, Linus Torvalds wrote: > On the ppc32, the MMU braindamage is not something you can ignore, you > have to write your OS for it and if you turn it off (ie enable soft-fill > on the ones that support it) you now have to have separate paths in the > OS for it. The hashtables don't bother me very much. They can relatively easily be front-ended by radix tree pagetables anyway, and if it sucks, well, no software in the world can save sucky hardware. Hopefully later models fix it to be fast or disablable. I'm more bothered by x86 lacking ASN's. On Sun, Feb 23, 2003 at 07:17:30PM +0000, Linus Torvalds wrote: > And the baroque instruction encoding on the x86 is actually a _good_ > thing: it's a rather dense encoding, which means that you win on icache. > It's a bit hard to decode, but who cares? Existing chips do well at > decoding, and thanks to the icache win they tend to perform better - and > they load faster too (which is important - you can make your CPU have > big caches, but _nothing_ saves you from the cold-cache costs). I'm not so sure, between things cacheline aligning branch targets and space/time tradeoffs with smaller instructions running slower than large sequences of instructions, this stuff gets pretty strange. It still comes out smaller in the end but by a smaller-than-expected though probably still significant margin. There's a good chunk of the instruction set that should probably just be dumped outright, too. On Sun, Feb 23, 2003 at 07:17:30PM +0000, Linus Torvalds wrote: > The low register count isn't an issue when you code in any high-level > language, and it has actually forced x86 implementors to do a hell of a > lot better job than the competition when it comes to memory loads and > stores - which helps in general. While the RISC people were off trying > to optimize their compilers to generate loops that used all 32 registers > efficiently, the x86 implementors instead made the chip run fast on > varied loads and used tons of register renaming hardware (and looking at > _memory_ renaming too). Invariably we get stuck diving into assembly anyway. =) This one is basically me getting irked by looking at disassemblies of random x86 binaries and seeing vast amounts of register spilling. It's probably not a performance issue aside from code bloat esp. given the amount of trickery with the weird L1 cache stack magic and so on. On Sun, Feb 23, 2003 at 07:17:30PM +0000, Linus Torvalds wrote: > IA64 made all the mistakes anybody else did, and threw out all the good > parts of the x86 because people thought those parts were ugly. They > aren't ugly, they're the "charming oddity" that makes it do well. Look > at them the right way and you realize that a lot of the grottyness is > exactly _why_ the x86 works so well (yeah, and the fact that they are > everywhere ;). Count me as "not charmed". We've actually tripped over this stuff, and for the most part you've been personally squashing the super low-level bugs like the NT flag business and vsyscall segmentation oddities. IA64 suffers from truly excessive featuritis and there are relatively good chances some (or all) of them will be every bit as unused and hated as segmentation if it actually survives. On Sun, Feb 23, 2003 at 07:17:30PM +0000, Linus Torvalds wrote: > The only real major failure of the x86 is the PAE crud. Let's hope > we'll get to forget it, the same way the DOS people eventually forgot > about their memory extenders. We've not really been able to forget about segments or ISA DMA... The pessimist in me has more or less already resigned me to PAE as a fact of life. On Sun, Feb 23, 2003 at 07:17:30PM +0000, Linus Torvalds wrote: > (Yeah, and maybe IBM will make their ppc64 chips cheap enough that they > will matter, and people can overlook the grottiness there. Right now > Intel doesn't even seem to be interested in "64-bit for the masses", and > maybe IBM will be. AMD certainly seems to be serious about the "masses" > part, which in the end is the only part that really matters). ppc64 is sane in my book (not vendor nepotism, the other "vanilla RISC" machines get the same rating in my book). No idea about marketing stuff. -- wli ^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: Minutes from Feb 21 LSE Call 2003-02-23 8:07 ` David Lang 2003-02-23 8:20 ` William Lee Irwin III @ 2003-02-23 19:13 ` David Mosberger 2003-02-23 23:28 ` Benjamin LaHaise 2003-02-26 8:46 ` Eric W. Biederman 2003-02-23 20:48 ` Gerrit Huizenga 2 siblings, 2 replies; 124+ messages in thread From: David Mosberger @ 2003-02-23 19:13 UTC (permalink / raw) To: David Lang Cc: Gerrit Huizenga, Benjamin LaHaise, William Lee Irwin III, Jeff Garzik, linux-kernel >>>>> On Sun, 23 Feb 2003 00:07:50 -0800 (PST), David Lang <david.lang@digitalinsight.com> said: David.L> Garrit, you missed the preior posters point. IA64 had the David.L> same fundamental problem as the Alpha, PPC, and Sparc David.L> processors, it doesn't run x86 binaries. This simply isn't true. Itanium and Itanium 2 have full x86 hardware built into the chip (for better or worse ;-). The speed isn't as good as the fastest x86 chips today, but it's faster (~300MHz P6) than the PCs many of us are using and it certainly meets my needs better than any other x86 "emulation" I have used in the past (which includes FX!32 and its relatives for Alpha). --david ^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: Minutes from Feb 21 LSE Call 2003-02-23 19:13 ` David Mosberger @ 2003-02-23 23:28 ` Benjamin LaHaise 2003-02-26 8:46 ` Eric W. Biederman 1 sibling, 0 replies; 124+ messages in thread From: Benjamin LaHaise @ 2003-02-23 23:28 UTC (permalink / raw) To: David Mosberger Cc: David Lang, Gerrit Huizenga, William Lee Irwin III, Jeff Garzik, linux-kernel On Sun, Feb 23, 2003 at 11:13:03AM -0800, David Mosberger wrote: > This simply isn't true. Itanium and Itanium 2 have full x86 hardware > built into the chip (for better or worse ;-). The speed isn't as good > as the fastest x86 chips today, but it's faster (~300MHz P6) than the That hardly counts as reasonably performant: the slowest mainstream chips from Intel and AMD are clocked well over 1 GHz. At least x86-64 will improve the performance of the 32 bit databases people have already invested large amounts of money in, and it will do so without the need for a massive outlay of funds for a new 64 bit license. Why accept more than 10x the cost to migrate to ia64 when a new x86-64 will improve the speed of existing applications, and improve scalability with the transparent addition of a 64 bit kernel? -ben -- Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a> ^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: Minutes from Feb 21 LSE Call 2003-02-23 19:13 ` David Mosberger 2003-02-23 23:28 ` Benjamin LaHaise @ 2003-02-26 8:46 ` Eric W. Biederman 1 sibling, 0 replies; 124+ messages in thread From: Eric W. Biederman @ 2003-02-26 8:46 UTC (permalink / raw) To: davidm Cc: David Lang, Gerrit Huizenga, Benjamin LaHaise, William Lee Irwin III, Jeff Garzik, linux-kernel David Mosberger <davidm@napali.hpl.hp.com> writes: > >>>>> On Sun, 23 Feb 2003 00:07:50 -0800 (PST), David Lang > <david.lang@digitalinsight.com> said: > > > David.L> Garrit, you missed the preior posters point. IA64 had the > David.L> same fundamental problem as the Alpha, PPC, and Sparc > David.L> processors, it doesn't run x86 binaries. > > This simply isn't true. Itanium and Itanium 2 have full x86 hardware > built into the chip (for better or worse ;-). The speed isn't as good > as the fastest x86 chips today, but it's faster (~300MHz P6) than the > PCs many of us are using and it certainly meets my needs better than > any other x86 "emulation" I have used in the past (which includes > FX!32 and its relatives for Alpha). I have various random x86 binaries that do not work. My 32bit x86 user space does not run. A 32bit kernel doesn't have a chance. So for me at least the 32bit support is not useful in avoiding converting binaries. For the handful of apps that cannot be recompiled I suspect the support is good enough so you can get them to run somehow. Eric ^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: Minutes from Feb 21 LSE Call 2003-02-23 8:07 ` David Lang 2003-02-23 8:20 ` William Lee Irwin III 2003-02-23 19:13 ` David Mosberger @ 2003-02-23 20:48 ` Gerrit Huizenga 2 siblings, 0 replies; 124+ messages in thread From: Gerrit Huizenga @ 2003-02-23 20:48 UTC (permalink / raw) To: David Lang Cc: Benjamin LaHaise, William Lee Irwin III, Jeff Garzik, linux-kernel On Sun, 23 Feb 2003 00:07:50 PST, David Lang wrote: > Garrit, you missed the preior posters point. IA64 had the same fundamental > problem as the Alpha, PPC, and Sparc processors, it doesn't run x86 > binaries. IA64 *can* run IA32 binaries, just more slowly than native IA64 code. gerrit ^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: Minutes from Feb 21 LSE Call 2003-02-23 1:17 ` Benjamin LaHaise 2003-02-23 5:21 ` Gerrit Huizenga @ 2003-02-23 9:37 ` William Lee Irwin III 1 sibling, 0 replies; 124+ messages in thread From: William Lee Irwin III @ 2003-02-23 9:37 UTC (permalink / raw) To: Benjamin LaHaise; +Cc: Jeff Garzik, linux-kernel On Sat, Feb 22, 2003 at 02:18:20PM -0800, William Lee Irwin III wrote: >> I'm not sure what's so nice about x86-64; another opcode prefix >> controlled extension atop the festering pile of existing x86 crud On Sat, Feb 22, 2003 at 08:17:24PM -0500, Benjamin LaHaise wrote: > What's nice about x86-64 is that it runs existing 32 bit apps fast and > doesn't suffer from the blisteringly small caches that were part of your > rant. Plus, x86-64 binaries are not horrifically bloated like ia64. > Not to mention that the amount of reengineering in compilers like > gcc required to get decent performance out of it is actually sane. Rant? It was just a catalogue of other things that are nasty. The point was that PAE's not special, it's one of a very long list of very ugly uglinesses, and my list wasn't anywhere near exhaustive. But yes, more cache is good. Unfortunately the amount of baggage from 32-bit x86 stuff still puts a good chunk of systems programming into the old bring your own barfbag territory. On Sat, Feb 22, 2003 at 02:18:20PM -0800, William Lee Irwin III wrote: >> sounds every bit as bad any other attempt to prolong x86. Some of >> the system device -level cleanups like the HPET look nice, though. On Sat, Feb 22, 2003 at 08:17:24PM -0500, Benjamin LaHaise wrote: > HPET is part of one of the PCYY specs and even available on 32 bit x86, > there are just not that many bug free implements yet. Since x86-64 made > it part of the base platform and is testing it from launch, they actually > have a chance at being debugged in the mass market versions. Well, it beats the heck out of the TSC and the PIT, and x86-64 is apparently supposed to have it "for real". I'm not excited at all about another opcode prefix and pagetable format. -- wli ^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: Minutes from Feb 21 LSE Call 2003-02-22 6:39 ` Martin J. Bligh 2003-02-22 8:38 ` Jeff Garzik @ 2003-02-22 8:38 ` David S. Miller 1 sibling, 0 replies; 124+ messages in thread From: David S. Miller @ 2003-02-22 8:38 UTC (permalink / raw) To: Martin J. Bligh; +Cc: Larry McVoy, Hanna Linder, lse-tech, linux-kernel On Fri, 2003-02-21 at 22:39, Martin J. Bligh wrote: > > Lots of people working for companies who haven't figured out how to do > > it as well as Dell *say* it can't be done but numbers say differently. > > And how much of that was profit on PCs running Linux? Or PCs period, they make tons of bucks on servers and associated support contracts. ^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: Minutes from Feb 21 LSE Call 2003-02-22 5:05 ` Larry McVoy 2003-02-22 6:39 ` Martin J. Bligh @ 2003-02-22 8:38 ` David S. Miller 2003-02-22 14:34 ` Larry McVoy 1 sibling, 1 reply; 124+ messages in thread From: David S. Miller @ 2003-02-22 8:38 UTC (permalink / raw) To: Larry McVoy; +Cc: Martin J. Bligh, Hanna Linder, lse-tech, linux-kernel On Fri, 2003-02-21 at 21:05, Larry McVoy wrote: > Let's see, Dell has a $66B market cap, revenues of $8B/quarter and > $500M/quarter in profit. While I understand these numbers are on the mark, there is a tertiary issue to realize. Dell makes money on many things other than thin-margin PCs. And lo' and behold one of those things is selling the larger Intel based servers and support contracts to go along with that. And so you're nearly supporting Martin's arguments for supporting large servers better under Linux by bringing up Dell's balance sheet :-) ^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: Minutes from Feb 21 LSE Call 2003-02-22 8:38 ` David S. Miller @ 2003-02-22 14:34 ` Larry McVoy 2003-02-22 15:47 ` Martin J. Bligh 0 siblings, 1 reply; 124+ messages in thread From: Larry McVoy @ 2003-02-22 14:34 UTC (permalink / raw) To: David S. Miller Cc: Larry McVoy, Martin J. Bligh, Hanna Linder, lse-tech, linux-kernel On Sat, Feb 22, 2003 at 12:38:33AM -0800, David S. Miller wrote: > On Fri, 2003-02-21 at 21:05, Larry McVoy wrote: > > Let's see, Dell has a $66B market cap, revenues of $8B/quarter and > > $500M/quarter in profit. > > While I understand these numbers are on the mark, there is a tertiary > issue to realize. > > Dell makes money on many things other than thin-margin PCs. And lo' > and behold one of those things is selling the larger Intel based > servers and support contracts to go along with that. I did some digging trying to find that ratio before I posted last night and couldn't. You obviously think that the servers are a significant part of their business. I'd be surprised at that, but that's cool, what are the numbers? PC's, monitors, disks, laptops, anything with less than 4 cpus is in the little bucket, so how much revenue does Dell generate on the 4 CPU and larger servers? -- --- Larry McVoy lm at bitmover.com http://www.bitmover.com/lm ^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: Minutes from Feb 21 LSE Call 2003-02-22 14:34 ` Larry McVoy @ 2003-02-22 15:47 ` Martin J. Bligh 2003-02-22 16:13 ` Larry McVoy 0 siblings, 1 reply; 124+ messages in thread From: Martin J. Bligh @ 2003-02-22 15:47 UTC (permalink / raw) To: Larry McVoy, David S. Miller; +Cc: lse-tech, linux-kernel >> > Let's see, Dell has a $66B market cap, revenues of $8B/quarter and >> > $500M/quarter in profit. >> >> While I understand these numbers are on the mark, there is a tertiary >> issue to realize. >> >> Dell makes money on many things other than thin-margin PCs. And lo' >> and behold one of those things is selling the larger Intel based >> servers and support contracts to go along with that. > > I did some digging trying to find that ratio before I posted last night > and couldn't. You obviously think that the servers are a significant > part of their business. I'd be surprised at that, but that's cool, > what are the numbers? PC's, monitors, disks, laptops, anything with less > than 4 cpus is in the little bucket, so how much revenue does Dell generate > on the 4 CPU and larger servers? It's not a question of revenue, it's one of profit. Very few people buy desktops for use with Linux, compared to those that buy them for Windows. The profit on each PC is small, thus I still think a substantial proportion of the profit made by hardware vendors by Linux is on servers rather than desktop PCs. The numbers will be smaller for high end machines, but the profit margins are much higher. M. ^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: Minutes from Feb 21 LSE Call 2003-02-22 15:47 ` Martin J. Bligh @ 2003-02-22 16:13 ` Larry McVoy 2003-02-22 16:29 ` Martin J. Bligh 2003-02-24 18:00 ` Timothy D. Witham 0 siblings, 2 replies; 124+ messages in thread From: Larry McVoy @ 2003-02-22 16:13 UTC (permalink / raw) To: Martin J. Bligh; +Cc: Larry McVoy, David S. Miller, lse-tech, linux-kernel On Sat, Feb 22, 2003 at 07:47:53AM -0800, Martin J. Bligh wrote: > >> > Let's see, Dell has a $66B market cap, revenues of $8B/quarter and > >> > $500M/quarter in profit. > >> > >> While I understand these numbers are on the mark, there is a tertiary > >> issue to realize. > >> > >> Dell makes money on many things other than thin-margin PCs. And lo' > >> and behold one of those things is selling the larger Intel based > >> servers and support contracts to go along with that. > > > > I did some digging trying to find that ratio before I posted last night > > and couldn't. You obviously think that the servers are a significant > > part of their business. I'd be surprised at that, but that's cool, > > what are the numbers? PC's, monitors, disks, laptops, anything with less > > than 4 cpus is in the little bucket, so how much revenue does Dell generate > > on the 4 CPU and larger servers? > > It's not a question of revenue, it's one of profit. Very few people buy > desktops for use with Linux, compared to those that buy them for Windows. > The profit on each PC is small, thus I still think a substantial proportion > of the profit made by hardware vendors by Linux is on servers rather than > desktop PCs. The numbers will be smaller for high end machines, but the > profit margins are much higher. That's all handwaving and has no meaning without numbers. I could care less if Dell has 99.99% margins on their servers, if they only sell $50M of servers a quarter that is still less than 10% of their quarterly profit. So what are the actual *numbers*? Your point makes sense if and only if people sell lots of server. I spent a few minutes in google: world wide server sales are $40B at the moment. The overwhelming majority of that revenue is small servers. Let's say that Dell has 20% of that market, that's $2B/quarter. Now let's chop off the 1-2 CPU systems. I'll bet you long long odds that that is 90% of their revenue in the server space. Supposing that's right, that's $200M/quarter in big iron sales. Out of $8000M/quarter. I'd love to see data which is different than this but you'll have a tough time finding it. More and more companies are looking at the cost of big iron and deciding it doesn't make sense to spend $20K/CPU when they could be spending $1K/CPU. Look at Google, try selling them some big iron. Look at Wall Street - abandoning big iron as fast as they can. -- --- Larry McVoy lm at bitmover.com http://www.bitmover.com/lm ^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: Minutes from Feb 21 LSE Call 2003-02-22 16:13 ` Larry McVoy @ 2003-02-22 16:29 ` Martin J. Bligh 2003-02-22 16:33 ` Larry McVoy 2003-02-24 18:00 ` Timothy D. Witham 1 sibling, 1 reply; 124+ messages in thread From: Martin J. Bligh @ 2003-02-22 16:29 UTC (permalink / raw) To: Larry McVoy; +Cc: David S. Miller, lse-tech, linux-kernel > That's all handwaving and has no meaning without numbers. I could care less > if Dell has 99.99% margins on their servers, if they only sell $50M of servers > a quarter that is still less than 10% of their quarterly profit. > > So what are the actual *numbers*? Your point makes sense if and only if > people sell lots of server. I spent a few minutes in google: world wide > server sales are $40B at the moment. The overwhelming majority of that > revenue is small servers. Let's say that Dell has 20% of that market, > that's $2B/quarter. Now let's chop off the 1-2 CPU systems. I'll bet > you long long odds that that is 90% of their revenue in the server space. > Supposing that's right, that's $200M/quarter in big iron sales. Out of > $8000M/quarter. > > I'd love to see data which is different than this but you'll have a tough > time finding it. More and more companies are looking at the cost of > big iron and deciding it doesn't make sense to spend $20K/CPU when they > could be spending $1K/CPU. Look at Google, try selling them some big > iron. Look at Wall Street - abandoning big iron as fast as they can. But we're talking about linux ... and we're talking about profit, not revenue. I'd guess that 99% of their desktop sales are for Windows. And I'd guess they make 100 times as much profit on a big server as they do on a desktop PC. Would be nice if someone had real numbers, but I doubt they're published except in non-free corporate research reports. M. ^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: Minutes from Feb 21 LSE Call 2003-02-22 16:29 ` Martin J. Bligh @ 2003-02-22 16:33 ` Larry McVoy 2003-02-22 16:39 ` Martin J. Bligh 0 siblings, 1 reply; 124+ messages in thread From: Larry McVoy @ 2003-02-22 16:33 UTC (permalink / raw) To: Martin J. Bligh; +Cc: Larry McVoy, David S. Miller, lse-tech, linux-kernel On Sat, Feb 22, 2003 at 08:29:34AM -0800, Martin J. Bligh wrote: > > people sell lots of server. I spent a few minutes in google: world wide > > server sales are $40B at the moment. The overwhelming majority of that > > revenue is small servers. Let's say that Dell has 20% of that market, > > that's $2B/quarter. Now let's chop off the 1-2 CPU systems. I'll bet > > you long long odds that that is 90% of their revenue in the server space. > > Supposing that's right, that's $200M/quarter in big iron sales. Out of > > $8000M/quarter. > > > > I'd love to see data which is different than this but you'll have a tough > > time finding it. More and more companies are looking at the cost of > > big iron and deciding it doesn't make sense to spend $20K/CPU when they > > could be spending $1K/CPU. Look at Google, try selling them some big > > iron. Look at Wall Street - abandoning big iron as fast as they can. > > But we're talking about linux ... and we're talking about profit, not > revenue. I'd guess that 99% of their desktop sales are for Windows. > And I'd guess they make 100 times as much profit on a big server as they > do on a desktop PC. You are thinking in today's terms. Find the asymptote and project out. -- --- Larry McVoy lm at bitmover.com http://www.bitmover.com/lm ^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: Minutes from Feb 21 LSE Call 2003-02-22 16:33 ` Larry McVoy @ 2003-02-22 16:39 ` Martin J. Bligh 2003-02-22 16:59 ` John Bradford 0 siblings, 1 reply; 124+ messages in thread From: Martin J. Bligh @ 2003-02-22 16:39 UTC (permalink / raw) To: Larry McVoy; +Cc: David S. Miller, lse-tech, linux-kernel >> But we're talking about linux ... and we're talking about profit, not >> revenue. I'd guess that 99% of their desktop sales are for Windows. >> And I'd guess they make 100 times as much profit on a big server as they >> do on a desktop PC. > > You are thinking in today's terms. Find the asymptote and project out. OK, I predict that Linux will take over the whole of the high end server market ... if people stop complaining about us fixing scalability. That should give some nicer numbers .... M. ^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: Minutes from Feb 21 LSE Call 2003-02-22 16:39 ` Martin J. Bligh @ 2003-02-22 16:59 ` John Bradford 0 siblings, 0 replies; 124+ messages in thread From: John Bradford @ 2003-02-22 16:59 UTC (permalink / raw) To: Martin J. Bligh; +Cc: lm, davem, lse-tech, linux-kernel > OK, I predict that Linux will take over the whole of the high end server > market ... if people stop complaining about us fixing scalability. That > should give some nicer numbers .... Extending the useful life of current hardware will shift profit even further towards support contracts, and away from hardware sales. Imagine the performance gain a webserver serving mostly static content, with light database and scripting usage is going to see moving from a 2.4 -> 2.6 kernel? Zero copy and filesystem improvements alone will extend it's useful life dramatically, in my opinion. John. ^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: Minutes from Feb 21 LSE Call 2003-02-22 16:13 ` Larry McVoy 2003-02-22 16:29 ` Martin J. Bligh @ 2003-02-24 18:00 ` Timothy D. Witham 1 sibling, 0 replies; 124+ messages in thread From: Timothy D. Witham @ 2003-02-24 18:00 UTC (permalink / raw) To: Larry McVoy; +Cc: Martin J. Bligh, David S. Miller, lse-tech, linux-kernel On Sat, 2003-02-22 at 08:13, Larry McVoy wrote: > On Sat, Feb 22, 2003 at 07:47:53AM -0800, Martin J. Bligh wrote: > > >> > Let's see, Dell has a $66B market cap, revenues of $8B/quarter and > > >> > $500M/quarter in profit. > > >> > > >> While I understand these numbers are on the mark, there is a tertiary > > >> issue to realize. > > >> > > >> Dell makes money on many things other than thin-margin PCs. And lo' > > >> and behold one of those things is selling the larger Intel based > > >> servers and support contracts to go along with that. > > > > > > I did some digging trying to find that ratio before I posted last night > > > and couldn't. You obviously think that the servers are a significant > > > part of their business. I'd be surprised at that, but that's cool, > > > what are the numbers? PC's, monitors, disks, laptops, anything with less > > > than 4 cpus is in the little bucket, so how much revenue does Dell generate > > > on the 4 CPU and larger servers? > > > > It's not a question of revenue, it's one of profit. Very few people buy > > desktops for use with Linux, compared to those that buy them for Windows. > > The profit on each PC is small, thus I still think a substantial proportion > > of the profit made by hardware vendors by Linux is on servers rather than > > desktop PCs. The numbers will be smaller for high end machines, but the > > profit margins are much higher. > > That's all handwaving and has no meaning without numbers. I could care less > if Dell has 99.99% margins on their servers, if they only sell $50M of servers > a quarter that is still less than 10% of their quarterly profit. > > So what are the actual *numbers*? Your point makes sense if and only if > people sell lots of server. I spent a few minutes in google: world wide > server sales are $40B at the moment. The overwhelming majority of that > revenue is small servers. Let's say that Dell has 20% of that market, > that's $2B/quarter. Now let's chop off the 1-2 CPU systems. I'll bet > you long long odds that that is 90% of their revenue in the server space. > Supposing that's right, that's $200M/quarter in big iron sales. Out of > $8000M/quarter. > The numbers that I have seen are covered under an NDA so I can' put them out but an important point to note is that while there is a very sharp decrease in the number of servers sold as you go hight up into the price bands the total $ in revenue is hourglass shaped. With the neck being in a price band that corresponds to a 4 way server. The total $ spent on the highest band of servers is about equal to the total $ spent on the lowest price band of servers. But the margins for the high end are much better than the margins for the lowest band. > I'd love to see data which is different than this but you'll have a tough > time finding it. More and more companies are looking at the cost of > big iron and deciding it doesn't make sense to spend $20K/CPU when they > could be spending $1K/CPU. Look at Google, try selling them some big > iron. Look at Wall Street - abandoning big iron as fast as they can. Oh, you can see it, it will just cost you about $50,000 to get the survey from the company that spends all the money putting it together. On the size of the system, every system should be as big as it needs to be. Some problems partition nicely, like Google but other ones do not, like accounts receivable. It all seems to come down to the question, "Does the data _naturally_ partition?" If it does then you should either use lots of small servers or a s/390 type solution with lots of instances. However if the data doesn't naturally partition you should use one large machine as you will spend more money on people trying to manage the servers than you would of spent initially on the hardware. Also you need to look at the backend systems in places like Wall Street, those are big machines, have been for a long time and aren't changing out. But it doesn't make a good story. Tim -- Timothy D. Witham <wookie@osdl.org> Open Sourcre Development Lab, Inc ^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: Minutes from Feb 21 LSE Call 2003-02-22 0:16 ` Larry McVoy 2003-02-22 0:25 ` William Lee Irwin III 2003-02-22 0:44 ` Martin J. Bligh @ 2003-02-22 8:32 ` David S. Miller 2003-02-22 18:20 ` Alan Cox 2003-02-23 0:37 ` Eric W. Biederman 4 siblings, 0 replies; 124+ messages in thread From: David S. Miller @ 2003-02-22 8:32 UTC (permalink / raw) To: Larry McVoy; +Cc: Hanna Linder, lse-tech, linux-kernel On Fri, 2003-02-21 at 16:16, Larry McVoy wrote: > In terms of the money and in terms of installed seats, the small Linux > machines out number the 4 or more CPU SMP machines easily 10,000:1. While I totally agree with your points, I want to mention that although this ratio is true, the exact opposite ratio applies to the price of the service contracts a company can land with the big machines :-) ^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: Minutes from Feb 21 LSE Call 2003-02-22 0:16 ` Larry McVoy ` (2 preceding siblings ...) 2003-02-22 8:32 ` David S. Miller @ 2003-02-22 18:20 ` Alan Cox 2003-02-22 20:05 ` William Lee Irwin III 2003-02-22 21:36 ` Gerrit Huizenga 2003-02-23 0:37 ` Eric W. Biederman 4 siblings, 2 replies; 124+ messages in thread From: Alan Cox @ 2003-02-22 18:20 UTC (permalink / raw) To: Larry McVoy; +Cc: Hanna Linder, lse-tech, Linux Kernel Mailing List On Sat, 2003-02-22 at 00:16, Larry McVoy wrote: > In terms of the money and in terms of installed seats, the small Linux > machines out number the 4 or more CPU SMP machines easily 10,000:1. > And with the embedded market being one of the few real money makers > for Linux, there will be huge pushback from those companies against > changes which increase memory footprint. I think people overestimate the numbner of large boxes badly. Several IDE pre-patches didn't work on highmem boxes. It took *ages* for people to actually notice there was a problem. The desktop world is still 128-256Mb and some of the crap people push is problematic even there. In the embedded space where there is a *ton* of money to be made by smart people a lot of the 2.5 choices look very questionable indeed - but not all by any means, we are for example close to being able to dump the block layer, shrink stacks down by using IRQ stacks and other good stuff. I'm hoping the Montavista and IBM people will swat each others bogons 8) Alan ^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: Minutes from Feb 21 LSE Call 2003-02-22 18:20 ` Alan Cox @ 2003-02-22 20:05 ` William Lee Irwin III 2003-02-22 21:35 ` Alan Cox 2003-02-22 21:36 ` Gerrit Huizenga 1 sibling, 1 reply; 124+ messages in thread From: William Lee Irwin III @ 2003-02-22 20:05 UTC (permalink / raw) To: Alan Cox; +Cc: Larry McVoy, Hanna Linder, lse-tech, Linux Kernel Mailing List On Sat, 2003-02-22 at 00:16, Larry McVoy wrote: >> And with the embedded market being one of the few real money makers >> for Linux, there will be huge pushback from those companies against >> changes which increase memory footprint. On Sat, Feb 22, 2003 at 06:20:19PM +0000, Alan Cox wrote: > I think people overestimate the numbner of large boxes badly. Several IDE > pre-patches didn't work on highmem boxes. It took *ages* for people to > actually notice there was a problem. The desktop world is still 128-256Mb > and some of the crap people push is problematic even there. In the embedded > space where there is a *ton* of money to be made by smart people a lot > of the 2.5 choices look very questionable indeed - but not all by any > means, we are for example close to being able to dump the block layer, > shrink stacks down by using IRQ stacks and other good stuff. Well, I've never seen IDE in a highmem box, and there's probably a good reason for it. The space trimmings sound pretty interesting. IRQ stacks in general sound good just to mitigate stackblowings due to IRQ pounding. On Sat, Feb 22, 2003 at 06:20:19PM +0000, Alan Cox wrote: > I'm hoping the Montavista and IBM people will swat each others bogons 8) Sounds like a bigger win for the bigboxen, since space matters there, but large-scale SMP efficiency probably doesn't make a difference to embedded (though I think some 2x embedded systems are floating around). -- wli ^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: Minutes from Feb 21 LSE Call 2003-02-22 20:05 ` William Lee Irwin III @ 2003-02-22 21:35 ` Alan Cox 0 siblings, 0 replies; 124+ messages in thread From: Alan Cox @ 2003-02-22 21:35 UTC (permalink / raw) To: William Lee Irwin III Cc: Larry McVoy, Hanna Linder, lse-tech, Linux Kernel Mailing List On Sat, 2003-02-22 at 20:05, William Lee Irwin III wrote: > On Sat, Feb 22, 2003 at 06:20:19PM +0000, Alan Cox wrote: > > I'm hoping the Montavista and IBM people will swat each others bogons 8) > > Sounds like a bigger win for the bigboxen, since space matters there, > but large-scale SMP efficiency probably doesn't make a difference to > embedded (though I think some 2x embedded systems are floating around). Smaller cleaner code is a win for everyone, and it often pays off in ways that are not immediately obvious. For example having your entire kernel working set and running app fitting in the L2 cache happens to be very good news to most people. Alan ^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: Minutes from Feb 21 LSE Call 2003-02-22 18:20 ` Alan Cox 2003-02-22 20:05 ` William Lee Irwin III @ 2003-02-22 21:36 ` Gerrit Huizenga 2003-02-22 21:42 ` Christoph Hellwig 2003-02-23 23:23 ` Bill Davidsen 1 sibling, 2 replies; 124+ messages in thread From: Gerrit Huizenga @ 2003-02-22 21:36 UTC (permalink / raw) To: Alan Cox; +Cc: Larry McVoy, Hanna Linder, lse-tech, Linux Kernel Mailing List On 22 Feb 2003 18:20:19 GMT, Alan Cox wrote: > I think people overestimate the numbner of large boxes badly. Several IDE > pre-patches didn't work on highmem boxes. It took *ages* for people to > actually notice there was a problem. The desktop world is still 128-256Mb IDE on big boxes? Is that crack I smell burning? A desktop with 4 GB is a fun toy, but bigger than *I* need, even for development purposes. But I don't think EMC, Clariion (low end EMC), Shark, etc. have any IDE products for my 8-proc 16 GB machine... And running pre-patches in a production environment that might expose this would be a little silly as well. Probably a bad example to extrapolate large system numbers from. gerrit ^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: Minutes from Feb 21 LSE Call 2003-02-22 21:36 ` Gerrit Huizenga @ 2003-02-22 21:42 ` Christoph Hellwig 2003-02-23 23:23 ` Bill Davidsen 1 sibling, 0 replies; 124+ messages in thread From: Christoph Hellwig @ 2003-02-22 21:42 UTC (permalink / raw) To: Gerrit Huizenga Cc: Alan Cox, Larry McVoy, Hanna Linder, lse-tech, Linux Kernel Mailing List On Sat, Feb 22, 2003 at 01:36:31PM -0800, Gerrit Huizenga wrote: > IDE on big boxes? Is that crack I smell burning? A desktop with 4 GB > is a fun toy, but bigger than *I* need, even for development purposes. > But I don't think EMC, Clariion (low end EMC), Shark, etc. have any > IDE products for my 8-proc 16 GB machine... And running pre-patches in > a production environment that might expose this would be a little > silly as well. > > Probably a bad example to extrapolate large system numbers from. At least the SGI Altix does have an IDE/ATAPI CDROM drive :) ^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: Minutes from Feb 21 LSE Call 2003-02-22 21:36 ` Gerrit Huizenga 2003-02-22 21:42 ` Christoph Hellwig @ 2003-02-23 23:23 ` Bill Davidsen 2003-02-24 3:31 ` Gerrit Huizenga 1 sibling, 1 reply; 124+ messages in thread From: Bill Davidsen @ 2003-02-23 23:23 UTC (permalink / raw) To: Gerrit Huizenga; +Cc: lse-tech, Linux Kernel Mailing List On Sat, 22 Feb 2003, Gerrit Huizenga wrote: > On 22 Feb 2003 18:20:19 GMT, Alan Cox wrote: > > I think people overestimate the numbner of large boxes badly. Several IDE > > pre-patches didn't work on highmem boxes. It took *ages* for people to > > actually notice there was a problem. The desktop world is still 128-256Mb > > IDE on big boxes? Is that crack I smell burning? A desktop with 4 GB > is a fun toy, but bigger than *I* need, even for development purposes. > But I don't think EMC, Clariion (low end EMC), Shark, etc. have any > IDE products for my 8-proc 16 GB machine... And running pre-patches in > a production environment that might expose this would be a little > silly as well. I don't disagree with most of your point, however there certainly are legitimate uses for big boxes with small (IDE) disk. Those which first come to mind are all computational problems, in which a small dataset is read from disk and then processors beat on the data. More or less common examples are graphics transformations (original and final data compressed), engineering calculations such as finite element analysis, rendering (raytracing) type calculations, and data analysis (things like setiathome or automated medical image analysis). IDE drives are very cost effective, and low cost motherboard RAID is certainly useful for preserving the results of large calculations on small (relatively) datasets. -- bill davidsen <davidsen@tmr.com> CTO, TMR Associates, Inc Doing interesting things with little computers since 1979. ^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: Minutes from Feb 21 LSE Call 2003-02-23 23:23 ` Bill Davidsen @ 2003-02-24 3:31 ` Gerrit Huizenga 2003-02-24 4:02 ` Larry McVoy 0 siblings, 1 reply; 124+ messages in thread From: Gerrit Huizenga @ 2003-02-24 3:31 UTC (permalink / raw) To: Bill Davidsen; +Cc: lse-tech, Linux Kernel Mailing List On Sun, 23 Feb 2003 18:23:01 EST, Bill Davidsen wrote: > On Sat, 22 Feb 2003, Gerrit Huizenga wrote: > > > On 22 Feb 2003 18:20:19 GMT, Alan Cox wrote: > > > I think people overestimate the numbner of large boxes badly. Several IDE > > > pre-patches didn't work on highmem boxes. It took *ages* for people to > > > actually notice there was a problem. The desktop world is still 128-256Mb > > > > IDE on big boxes? Is that crack I smell burning? A desktop with 4 GB > > is a fun toy, but bigger than *I* need, even for development purposes. > > But I don't think EMC, Clariion (low end EMC), Shark, etc. have any > > IDE products for my 8-proc 16 GB machine... And running pre-patches in > > a production environment that might expose this would be a little > > silly as well. > > I don't disagree with most of your point, however there certainly are > legitimate uses for big boxes with small (IDE) disk. Those which first > come to mind are all computational problems, in which a small dataset is > read from disk and then processors beat on the data. More or less common > examples are graphics transformations (original and final data > compressed), engineering calculations such as finite element analysis, > rendering (raytracing) type calculations, and data analysis (things like > setiathome or automated medical image analysis). Yeah and as Christoph pointed out, a lot of big machines have IDE based CD-ROMs. And, there *are* some IDE disk subsystems with 1 TB on an IDE bus and such, but there just aren't enough IDE busses or PCI slots on most big machines to span out to the really high disk capacities or large numbers of spindles. But some of the compute engines could either be net-booted (no local disk) or have a cheap, small disk for boot, small static storage (couple hundred GB range) etc. But most people don't connect big machines to IDE drive subsystems. gerrit ^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: Minutes from Feb 21 LSE Call 2003-02-24 3:31 ` Gerrit Huizenga @ 2003-02-24 4:02 ` Larry McVoy 2003-02-24 4:15 ` Russell Leighton ` (2 more replies) 0 siblings, 3 replies; 124+ messages in thread From: Larry McVoy @ 2003-02-24 4:02 UTC (permalink / raw) To: Gerrit Huizenga; +Cc: Bill Davidsen, lse-tech, Linux Kernel Mailing List On Sun, Feb 23, 2003 at 07:31:26PM -0800, Gerrit Huizenga wrote: > But most > people don't connect big machines to IDE drive subsystems. 3ware controllers. They look like SCSI to the host, but use cheap IDE drives on the back end. Really nice cards. bkbits.net runs on one. -- --- Larry McVoy lm at bitmover.com http://www.bitmover.com/lm ^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: Minutes from Feb 21 LSE Call 2003-02-24 4:02 ` Larry McVoy @ 2003-02-24 4:15 ` Russell Leighton 2003-02-24 5:11 ` William Lee Irwin III 2003-02-24 8:07 ` Christoph Hellwig 2 siblings, 0 replies; 124+ messages in thread From: Russell Leighton @ 2003-02-24 4:15 UTC (permalink / raw) To: Larry McVoy Cc: Gerrit Huizenga, Bill Davidsen, lse-tech, Linux Kernel Mailing List Yup. Great price and super price/performance. Gotta luv it. Larry McVoy wrote: >On Sun, Feb 23, 2003 at 07:31:26PM -0800, Gerrit Huizenga wrote: > >>But most >>people don't connect big machines to IDE drive subsystems. >> > >3ware controllers. They look like SCSI to the host, but use cheap IDE >drives on the back end. Really nice cards. bkbits.net runs on one. > ^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: Minutes from Feb 21 LSE Call 2003-02-24 4:02 ` Larry McVoy 2003-02-24 4:15 ` Russell Leighton @ 2003-02-24 5:11 ` William Lee Irwin III 2003-02-24 8:07 ` Christoph Hellwig 2 siblings, 0 replies; 124+ messages in thread From: William Lee Irwin III @ 2003-02-24 5:11 UTC (permalink / raw) To: Larry McVoy, Gerrit Huizenga, Bill Davidsen, lse-tech, Linux Kernel Mailing List On Sun, Feb 23, 2003 at 07:31:26PM -0800, Gerrit Huizenga wrote: >> But most people don't connect big machines to IDE drive subsystems. > On Sun, Feb 23, 2003 at 08:02:46PM -0800, Larry McVoy wrote: > 3ware controllers. They look like SCSI to the host, but use cheap IDE > drives on the back end. Really nice cards. bkbits.net runs on one. A quick back of the napkin estimate guesstimates that this 3ware stuff would max at 6 racks of disks on NUMA-Q or 3/8 of a rack per node (ignoring cabling, which looks infeasible, but never mind that), which is a smaller capacity than I remember FC having. NUMA-Q's a bit optimistic for 3ware because it has buttloads of PCI slots in comparison to more modern machines. -- wli ^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: Minutes from Feb 21 LSE Call 2003-02-24 4:02 ` Larry McVoy 2003-02-24 4:15 ` Russell Leighton 2003-02-24 5:11 ` William Lee Irwin III @ 2003-02-24 8:07 ` Christoph Hellwig 2 siblings, 0 replies; 124+ messages in thread From: Christoph Hellwig @ 2003-02-24 8:07 UTC (permalink / raw) To: Larry McVoy, Gerrit Huizenga, Bill Davidsen, lse-tech, Linux Kernel Mailing List On Sun, Feb 23, 2003 at 08:02:46PM -0800, Larry McVoy wrote: > On Sun, Feb 23, 2003 at 07:31:26PM -0800, Gerrit Huizenga wrote: > > But most > > people don't connect big machines to IDE drive subsystems. > > 3ware controllers. They look like SCSI to the host, but use cheap IDE > drives on the back end. Really nice cards. bkbits.net runs on one. That's true (similar for some nice scsi2ide external raid boxens), but Alan's original argument was about the Linux IDE driver on bix machines which is used by neither.. ^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: Minutes from Feb 21 LSE Call 2003-02-22 0:16 ` Larry McVoy ` (3 preceding siblings ...) 2003-02-22 18:20 ` Alan Cox @ 2003-02-23 0:37 ` Eric W. Biederman 4 siblings, 0 replies; 124+ messages in thread From: Eric W. Biederman @ 2003-02-23 0:37 UTC (permalink / raw) To: Larry McVoy; +Cc: Hanna Linder, lse-tech, linux-kernel Larry McVoy <lm@bitmover.com> writes: > > Ben said none of the distros are supporting these large > > systems right now. Martin said UL is already starting to support > > them. > > Ben is right. I think IBM and the other big iron companies would be > far better served looking at what they have done with running multiple > instances of Linux on one big machine, like the 390 work. Figure out > how to use that model to scale up. There is simply not a big enough > market to justify shoveling lots of scaling stuff in for huge machines > that only a handful of people can afford. That's the same path which > has sunk all the workstation companies, they all have bloated OS's and > Linux runs circles around them. Larry it isn't that Linux isn't being scaled in the way you suggest. But for the people who really care about scalability having a single system image is not the most important thing so making it look like one system is secondary. Linux clusters are currently among the top 5 supercomputers of the world. And there the question is how do you make 1200 machines look like one. And how do you handle the reliability issues. When MTBF becomes a predictor for how many times a week someone needs to replace hardware the problem is very different from a simple SMP. And there seems to be a fairly substantial market for huge machines, for people who need high performance. All kinds of problems are require enormous amounts of data crunching. So far the low hanging fruit on large clusters is still with making the hardware and the systems actually work. But increasingly having a single high performance distributed filesystem is becoming important. But look at projects like bproc, mosix, and lustre. Not the best things in the world but the work is getting done. Scalability is easy. The hard part is making it look like one machine when you are done. Eric ^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: Minutes from Feb 21 LSE Call 2003-02-21 23:48 Minutes from Feb 21 LSE Call Hanna Linder 2003-02-22 0:16 ` Larry McVoy @ 2003-02-23 0:42 ` Eric W. Biederman 2003-02-23 14:29 ` Rik van Riel 2003-02-23 3:24 ` Andrew Morton 2 siblings, 1 reply; 124+ messages in thread From: Eric W. Biederman @ 2003-02-23 0:42 UTC (permalink / raw) To: Hanna Linder; +Cc: lse-tech, linux-kernel Hanna Linder <hannal@us.ibm.com> writes: > LSE Con Call Minutes from Feb21 > > Minutes compiled by Hanna Linder hannal@us.ibm.com, please post > corrections to lse-tech@lists.sf.net. > > Object Based Reverse Mapping: > (Dave McCracken, Ben LaHaise, Rik van Riel, Martin Bligh, Gerrit Huizenga) > > Ben said none of the users have been complaining about > performance with the existing rmap. Martin disagreed and said Linus, > Andrew Morton and himself have all agreed there is a problem. > One of the problems Martin is already hitting on high cpu machines with > large memory is the space consumption by all the pte-chains filling up > memory and killing the machine. There is also a performance impact of > maintaining the chains. Note: rmap chains can be restricted to an arbitrary length, or an arbitrary total count trivially. All you have to do is allow a fixed limit on the number of people who can map a page simultaneously. The selection of which chain to unmap can be a bit tricky but is relatively straight forward. Why doesn't someone who is seeing this just hack this up? Eric ^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: Minutes from Feb 21 LSE Call 2003-02-23 0:42 ` Eric W. Biederman @ 2003-02-23 14:29 ` Rik van Riel 2003-02-23 17:28 ` Eric W. Biederman 0 siblings, 1 reply; 124+ messages in thread From: Rik van Riel @ 2003-02-23 14:29 UTC (permalink / raw) To: Eric W. Biederman; +Cc: Hanna Linder, lse-tech, linux-kernel On Sat, 22 Feb 2003, Eric W. Biederman wrote: > Note: rmap chains can be restricted to an arbitrary length, or an > arbitrary total count trivially. All you have to do is allow a fixed > limit on the number of people who can map a page simultaneously. > > The selection of which chain to unmap can be a bit tricky but is > relatively straight forward. Why doesn't someone who is seeing > this just hack this up? I'm not sure how useful this feature would be. Also, there are a bunch of corner cases in which you cannot limit the number of processes mapping a page, think about eg. mlock, nonlinear vmas and anonymous memory. All in all I suspect that the cost of such a feature might be higher than any benefits. cheers, Rik -- Engineers don't grow up, they grow sideways. http://www.surriel.com/ http://kernelnewbies.org/ ^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: Minutes from Feb 21 LSE Call 2003-02-23 14:29 ` Rik van Riel @ 2003-02-23 17:28 ` Eric W. Biederman 2003-02-24 1:42 ` Benjamin LaHaise 0 siblings, 1 reply; 124+ messages in thread From: Eric W. Biederman @ 2003-02-23 17:28 UTC (permalink / raw) To: Rik van Riel; +Cc: Hanna Linder, lse-tech, linux-kernel Rik van Riel <riel@imladris.surriel.com> writes: > On Sat, 22 Feb 2003, Eric W. Biederman wrote: > > > Note: rmap chains can be restricted to an arbitrary length, or an > > arbitrary total count trivially. All you have to do is allow a fixed > > limit on the number of people who can map a page simultaneously. > > > > The selection of which chain to unmap can be a bit tricky but is > > relatively straight forward. Why doesn't someone who is seeing > > this just hack this up? > > I'm not sure how useful this feature would be. The problem. There is no upper bound to how many rmap entries there can be at one time. And the unbounded growth can overwhelm a machine. The goal is to provide an overall system cap on the number of rmap entries. > Also, > there are a bunch of corner cases in which you cannot > limit the number of processes mapping a page, think > about eg. mlock, nonlinear vmas and anonymous memory. Unless something has changed for nonlinear vmas, and anonymous memory we have been storing enough information to recover the page in the page tables for ages. For mlock we want a cap on the number of pages that are locked, so it should not be a problem. But even then we don't have to guarantee the page is constantly in the processes page table, simply that the mlocked page is never swapped out. > All in all I suspect that the cost of such a feature > might be higher than any benefits. Cost? What Cost? The simple implementation is to walk the page lists and unmap the pages that are least likely to be used next. This is not something new. We have been doing this in 2.4.x and before for years. Before it just never freed up rmap entries, as well as preparing a page to be paged out. Eric ^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: Minutes from Feb 21 LSE Call 2003-02-23 17:28 ` Eric W. Biederman @ 2003-02-24 1:42 ` Benjamin LaHaise 0 siblings, 0 replies; 124+ messages in thread From: Benjamin LaHaise @ 2003-02-24 1:42 UTC (permalink / raw) To: Eric W. Biederman; +Cc: Rik van Riel, Hanna Linder, lse-tech, linux-kernel On Sun, Feb 23, 2003 at 10:28:04AM -0700, Eric W. Biederman wrote: > The problem. There is no upper bound to how many rmap > entries there can be at one time. And the unbounded > growth can overwhelm a machine. Eh? By that logic there's no bound to the number of vmas that can exist at a given time. But there is a bound on the number that a single process can force the system into using, and that limit also caps the number of rmap entries the process can bring into existance. Virtual address space is not free, and there are already mechanisms in place to limit it which, given that the number of rmap entries are directly proportion to the amount of virtual address space in use, probably need proper configuration. > The goal is to provide an overall system cap on the number > of rmap entries. No, the goal is to have a stable system under a variety of workloads that performs well. User exploitable worst case behaviour is a bad idea. Hybrid solves that at the expense of added complexity. -ben -- Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a> ^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: Minutes from Feb 21 LSE Call 2003-02-21 23:48 Minutes from Feb 21 LSE Call Hanna Linder 2003-02-22 0:16 ` Larry McVoy 2003-02-23 0:42 ` Eric W. Biederman @ 2003-02-23 3:24 ` Andrew Morton 2003-02-23 16:14 ` object-based rmap and pte-highmem Martin J. Bligh 2003-02-25 17:17 ` Minutes from Feb 21 LSE Call Andrea Arcangeli 2 siblings, 2 replies; 124+ messages in thread From: Andrew Morton @ 2003-02-23 3:24 UTC (permalink / raw) To: Hanna Linder; +Cc: lse-tech, linux-kernel Hanna Linder <hannal@us.ibm.com> wrote: > > > Dave coded up an initial patch for partial object based rmap > which he sent to linux-mm yesterday. I've run some numbers on this. Looks like it reclaims most of the fork/exec/exit rmap overhead. The testcase is applying and removing 64 kernel patches using my patch management scripts. I use this because a) It's a real workload, which someone cares about and b) It's about as forky as anything is ever likely to be, without being a stupid microbenchmark. Testing is on the fast P4-HT, everything in pagecache. 2.4.21-pre4: 8.10 seconds 2.5.62-mm3 with objrmap: 9.95 seconds (+1.85) 2.5.62-mm3 without objrmap: 10.86 seconds (+0.91) Current 2.5 is 2.76 seconds slower, and this patch reclaims 0.91 of those seconds. So whole stole the remaining 1.85 seconds? Looks like pte_highmem. Here is 2.5.62-mm3, with objrmap: c013042c find_get_page 601 10.7321 c01333dc free_hot_cold_page 641 2.7629 c0207130 __copy_to_user_ll 687 6.6058 c011450c flush_tlb_page 725 6.4732 c0139ba0 clear_page_tables 841 2.4735 c011718c pte_alloc_one 910 6.5000 c013b56c do_anonymous_page 954 1.7667 c013b788 do_no_page 1044 1.6519 c015b59c d_lookup 1096 3.2619 c013ba00 handle_mm_fault 1098 4.6525 c0108d14 system_call 1116 25.3636 c0137240 release_pages 1828 6.4366 c013a1f4 zap_pte_range 2616 4.8806 c013f5c0 page_add_rmap 2776 8.3614 c0139eac copy_page_range 2994 3.5643 c013f70c page_remove_rmap 3132 6.2640 c013adb4 do_wp_page 6712 8.4322 c01172e0 do_page_fault 8788 7.7496 c0106ed8 poll_idle 99878 1189.0238 00000000 total 158601 0.0869 Note one second spent in pte_alloc_one(). Here is 2.4.21-pre4, with the following functions uninlined pte_t *pte_alloc_one(struct mm_struct *mm, unsigned long address); pte_t *pte_alloc_one_fast(struct mm_struct *mm, unsigned long address); void pte_free_fast(pte_t *pte); void pte_free_slow(pte_t *pte); c0252950 atomic_dec_and_lock 36 0.4800 c0111778 flush_tlb_mm 37 0.3304 c0129c3c file_read_actor 37 0.2569 c025282c strnlen_user 43 0.5119 c012b35c generic_file_write 46 0.0283 c0114c78 schedule 48 0.0361 c0129050 unlock_page 53 0.4907 c0140974 link_path_walk 57 0.0237 c0116740 copy_mm 62 0.0852 c0130740 __free_pages_ok 62 0.0963 c0126afc handle_mm_fault 63 0.3424 c01254c0 __free_pte 67 0.8816 c0129198 __find_get_page 67 0.9853 c01309c4 rmqueue 70 0.1207 c011ae0c exit_notify 77 0.1075 c0149b34 d_lookup 81 0.2774 c0126874 do_anonymous_page 83 0.3517 c0126960 do_no_page 86 0.2087 c01117e8 flush_tlb_page 105 0.8750 c0106f54 system_call 138 2.4643 c01255c8 copy_page_range 197 0.4603 c0130ffc __free_pages 204 5.6667 c0125774 zap_page_range 262 0.3104 c0126330 do_wp_page 775 1.4904 c0113c18 do_page_fault 864 0.7030 c01052f8 poll_idle 6803 170.0750 00000000 total 11923 0.0087 Note the lack of pte_alloc_one_slow(). So we need the page table cache back. We cannot put it in slab, because slab does not do highmem. I believe the best way to solve this is to implement a per-cpu LIFO head array of known-to-be-zeroed pages in the page allocator. Populate it with free_zeroed_page(), grab pages from it with __GFP_ZEROED. This is a simple extension to the existing hot and cold head arrays, and I have patches, and they don't work. Something in the pagetable freeing path seems to be putting back pages which are not fully zeroed, and I didn't get onto debugging it. It would be nice to get it going, because a number of architectures can perhaps nuke their private pagetable caches. I shall drop the patches in next-mm/experimental and look hopefully at Dave ;) ^ permalink raw reply [flat|nested] 124+ messages in thread
* object-based rmap and pte-highmem 2003-02-23 3:24 ` Andrew Morton @ 2003-02-23 16:14 ` Martin J. Bligh 2003-02-23 19:20 ` Linus Torvalds 2003-02-25 17:17 ` Minutes from Feb 21 LSE Call Andrea Arcangeli 1 sibling, 1 reply; 124+ messages in thread From: Martin J. Bligh @ 2003-02-23 16:14 UTC (permalink / raw) To: Andrew Morton; +Cc: lse-tech, linux-kernel, haveblue, dmccr > So whole stole the remaining 1.85 seconds? Looks like pte_highmem. I have a plan for that (UKVA) ... we reserve a per-process area with kernel type protections (either at the top of user space, changing permissions appropriately, or inside kernel space, changing per-process vs global appropriately). This area is permanently mapped into each process, so that there's no kmap_atomic / tlb_flush_one overhead ... it's highmem backed still. In order to do fork efficiently, we may need space for 2 sets of pagetables (12Mb on PAE). Dave McCracken had an earlier implementation of that, but we never saw an improvement (quite possibly because the fork double-space wasn't there) - Dave Hansen is now trying to get something work with current kernels ... will let you know. M. ^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: object-based rmap and pte-highmem 2003-02-23 16:14 ` object-based rmap and pte-highmem Martin J. Bligh @ 2003-02-23 19:20 ` Linus Torvalds 2003-02-23 20:16 ` Martin J. Bligh 0 siblings, 1 reply; 124+ messages in thread From: Linus Torvalds @ 2003-02-23 19:20 UTC (permalink / raw) To: linux-kernel In article <11090000.1046016895@[10.10.2.4]>, Martin J. Bligh <mbligh@aracnet.com> wrote: >> So whole stole the remaining 1.85 seconds? Looks like pte_highmem. > >I have a plan for that (UKVA) ... we reserve a per-process area with >kernel type protections (either at the top of user space, changing >permissions appropriately, or inside kernel space, changing per-process >vs global appropriately). Nobody ever seems to have solved the threading impact of UKVA's. I told Andrea about it almost a year ago, and his reaction was "oh, duh!" and couldn't come up with a solution either. The thing is, you _cannot_ have a per-thread area, since all threads share the same TLB. And if it isn't per-thread, you still need all the locking and all the scalability stuff that the _current_ pte_highmem code needs, since there are people with thousands of threads in the same process. Until somebody _addresses_ this issue with UKVA, I consider UKVA to be a pipe-dream of people who haven't thought it through. Linus ^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: object-based rmap and pte-highmem 2003-02-23 19:20 ` Linus Torvalds @ 2003-02-23 20:16 ` Martin J. Bligh 2003-02-23 21:37 ` Linus Torvalds 0 siblings, 1 reply; 124+ messages in thread From: Martin J. Bligh @ 2003-02-23 20:16 UTC (permalink / raw) To: Linus Torvalds, linux-kernel >> I have a plan for that (UKVA) ... we reserve a per-process area with >> kernel type protections (either at the top of user space, changing >> permissions appropriately, or inside kernel space, changing per-process >> vs global appropriately). > > Nobody ever seems to have solved the threading impact of UKVA's. I told > Andrea about it almost a year ago, and his reaction was "oh, duh!" and > couldn't come up with a solution either. > > The thing is, you _cannot_ have a per-thread area, since all threads > share the same TLB. And if it isn't per-thread, you still need all the > locking and all the scalability stuff that the _current_ pte_highmem > code needs, since there are people with thousands of threads in the same > process. > > Until somebody _addresses_ this issue with UKVA, I consider UKVA to be a > pipe-dream of people who haven't thought it through. I don't see why that's an issue - the pagetables are per-process, not per-thread. Yes, that was a stalling point for sticking kmap in there, which was amongst my original plotting for it, but the stuff that's per-process still works. I'm not suggesting kmapping them dynamically (though it's rather like permanent kmap), I'm suggesting making enough space so we have them all there for each process all the time. None of this tiny little window shifting around stuff ... M. ^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: object-based rmap and pte-highmem 2003-02-23 20:16 ` Martin J. Bligh @ 2003-02-23 21:37 ` Linus Torvalds 2003-02-23 22:07 ` pte-highmem vs UKVA (was: object-based rmap and pte-highmem) Martin J. Bligh 0 siblings, 1 reply; 124+ messages in thread From: Linus Torvalds @ 2003-02-23 21:37 UTC (permalink / raw) To: Martin J. Bligh; +Cc: linux-kernel On Sun, 23 Feb 2003, Martin J. Bligh wrote: > > > > The thing is, you _cannot_ have a per-thread area, since all threads > > share the same TLB. And if it isn't per-thread, you still need all the > > locking and all the scalability stuff that the _current_ pte_highmem > > code needs, since there are people with thousands of threads in the same > > process. > > I don't see why that's an issue - the pagetables are per-process, not > per-thread. Exactly. Which means that UKVA has all the same problems as the current global map. There are _NO_ differences. Any problems you have with the current global map you would have with UKVA in threads. So I don't see what you expect to win from UKVA. > Yes, that was a stalling point for sticking kmap in there, which was > amongst my original plotting for it, but the stuff that's per-process > still works. Exactly what _is_ "per-process"? The only thing that is per-process is stuff that is totally local to the VM, by the linux definition. And the rmap stuff certainly isn't "local to the VM". Yes, it is torn down and built up by the VM, but it needs to be traversed by global code. Linus ^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: pte-highmem vs UKVA (was: object-based rmap and pte-highmem) 2003-02-23 21:37 ` Linus Torvalds @ 2003-02-23 22:07 ` Martin J. Bligh 2003-02-23 22:10 ` William Lee Irwin III 2003-02-24 3:07 ` Martin J. Bligh 0 siblings, 2 replies; 124+ messages in thread From: Martin J. Bligh @ 2003-02-23 22:07 UTC (permalink / raw) To: Linus Torvalds; +Cc: linux-kernel >> > The thing is, you _cannot_ have a per-thread area, since all threads >> > share the same TLB. And if it isn't per-thread, you still need all the >> > locking and all the scalability stuff that the _current_ pte_highmem >> > code needs, since there are people with thousands of threads in the >> > same process. >> >> I don't see why that's an issue - the pagetables are per-process, not >> per-thread. > > Exactly. Which means that UKVA has all the same problems as the current > global map. > > There are _NO_ differences. Any problems you have with the current global > map you would have with UKVA in threads. So I don't see what you expect > to win from UKVA. This just just for PTEs ... for which at the moment we have two choices: 1. Stick them in lowmem (fills up the global space too much). 2. Stick them in highmem - too much overhead doing k(un)map_atomic as measured by both myself and Andrew. Using UKVA for PTEs seems to be a better way to implement pte-highmem to me. If you're walking another processes' pagetables, you just kmap them as now, but I think this will avoid most of the kmap'ing (if we have space for two sets of pagetables so we can do a little bit of trickery at fork time). >> Yes, that was a stalling point for sticking kmap in there, which was >> amongst my original plotting for it, but the stuff that's per-process >> still works. > > Exactly what _is_ "per-process"? The only thing that is per-process is > stuff that is totally local to the VM, by the linux definition. The pagetables. > And the rmap stuff certainly isn't "local to the VM". Yes, it is torn > down and built up by the VM, but it needs to be traversed by global code. Sorry, subject was probably misleading ... I'm just talking about the PTEs here, not sticking anything to do with rmap into UKVA. Partially object-based rmap is cool for other reasons, that have little to do with this. ;-) M. ^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: pte-highmem vs UKVA (was: object-based rmap and pte-highmem) 2003-02-23 22:07 ` pte-highmem vs UKVA (was: object-based rmap and pte-highmem) Martin J. Bligh @ 2003-02-23 22:10 ` William Lee Irwin III 2003-02-24 0:31 ` Linus Torvalds 2003-02-24 3:07 ` Martin J. Bligh 1 sibling, 1 reply; 124+ messages in thread From: William Lee Irwin III @ 2003-02-23 22:10 UTC (permalink / raw) To: Martin J. Bligh; +Cc: Linus Torvalds, linux-kernel On Sun, Feb 23, 2003 at 02:07:42PM -0800, Martin J. Bligh wrote: > Using UKVA for PTEs seems to be a better way to implement pte-highmem to me. > If you're walking another processes' pagetables, you just kmap them as now, > but I think this will avoid most of the kmap'ing (if we have space for two > sets of pagetables so we can do a little bit of trickery at fork time). Another term for "UKVA for pagetables only" is "recursive pagetables", if this helps clarify anything. -- wli ^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: pte-highmem vs UKVA (was: object-based rmap and pte-highmem) 2003-02-23 22:10 ` William Lee Irwin III @ 2003-02-24 0:31 ` Linus Torvalds 0 siblings, 0 replies; 124+ messages in thread From: Linus Torvalds @ 2003-02-24 0:31 UTC (permalink / raw) To: William Lee Irwin III; +Cc: Martin J. Bligh, linux-kernel On Sun, 23 Feb 2003, William Lee Irwin III wrote: > > Another term for "UKVA for pagetables only" is "recursive pagetables", > if this helps clarify anything. Oh, ok. We did that for alpha, and it was a good deal there (it's actually architected for alpha). So yes, I don't mind doing it for the page tables, and it should work fine on x86 too (it's not necessarily a very portable approach, since it requires that the pmd- and the pte- tables look the same, which is not always true). So sure, go ahead with that part. Linus ^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: pte-highmem vs UKVA (was: object-based rmap and pte-highmem) 2003-02-23 22:07 ` pte-highmem vs UKVA (was: object-based rmap and pte-highmem) Martin J. Bligh 2003-02-23 22:10 ` William Lee Irwin III @ 2003-02-24 3:07 ` Martin J. Bligh 1 sibling, 0 replies; 124+ messages in thread From: Martin J. Bligh @ 2003-02-24 3:07 UTC (permalink / raw) To: linux-kernel > This just just for PTEs ... for which at the moment we have two choices: > 1. Stick them in lowmem (fills up the global space too much). > 2. Stick them in highmem - too much overhead doing k(un)map_atomic > as measured by both myself and Andrew. Actually Andrew's measurements seem to be a bit different from mine ... several different things all interacting. I'll try to get some more measurements from a straight SMP box, and see if they correlate more closely with what he's seeing. M. ^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: Minutes from Feb 21 LSE Call 2003-02-23 3:24 ` Andrew Morton 2003-02-23 16:14 ` object-based rmap and pte-highmem Martin J. Bligh @ 2003-02-25 17:17 ` Andrea Arcangeli 2003-02-25 17:43 ` William Lee Irwin III 1 sibling, 1 reply; 124+ messages in thread From: Andrea Arcangeli @ 2003-02-25 17:17 UTC (permalink / raw) To: Andrew Morton; +Cc: Hanna Linder, lse-tech, linux-kernel On Sat, Feb 22, 2003 at 07:24:24PM -0800, Andrew Morton wrote: > 2.4.21-pre4: 8.10 seconds > 2.5.62-mm3 with objrmap: 9.95 seconds (+1.85) > 2.5.62-mm3 without objrmap: 10.86 seconds (+0.91) > > Current 2.5 is 2.76 seconds slower, and this patch reclaims 0.91 of those > seconds. > > > So whole stole the remaining 1.85 seconds? Looks like pte_highmem. would you mind to add the line for 2.4.21-pre4aa3? it has pte-highmem so you can easily find it out for sure if it is pte_highmem that stole >10% of your fast cpu. A line for the 2.4-rmap patch would be also interesting. > Note one second spent in pte_alloc_one(). note the seconds spent in the rmap affected paths too. Andrea ^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: Minutes from Feb 21 LSE Call 2003-02-25 17:17 ` Minutes from Feb 21 LSE Call Andrea Arcangeli @ 2003-02-25 17:43 ` William Lee Irwin III 2003-02-25 17:59 ` Andrea Arcangeli 0 siblings, 1 reply; 124+ messages in thread From: William Lee Irwin III @ 2003-02-25 17:43 UTC (permalink / raw) To: Andrea Arcangeli; +Cc: Andrew Morton, Hanna Linder, lse-tech, linux-kernel On Sat, Feb 22, 2003 at 07:24:24PM -0800, Andrew Morton wrote: >> So whole stole the remaining 1.85 seconds? Looks like pte_highmem. On Tue, Feb 25, 2003 at 06:17:27PM +0100, Andrea Arcangeli wrote: > would you mind to add the line for 2.4.21-pre4aa3? it has pte-highmem so > you can easily find it out for sure if it is pte_highmem that stole >10% > of your fast cpu. A line for the 2.4-rmap patch would be also > interesting. On Sat, Feb 22, 2003 at 07:24:24PM -0800, Andrew Morton wrote: >> Note one second spent in pte_alloc_one(). On Tue, Feb 25, 2003 at 06:17:27PM +0100, Andrea Arcangeli wrote: > note the seconds spent in the rmap affected paths too. The pagetable cache is gone in 2.5, so pte_alloc_one() takes the bitblitting hit for pagetables. I didn't catch the whole profile, so I'll need numbers for rmap paths. -- wli ^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: Minutes from Feb 21 LSE Call 2003-02-25 17:43 ` William Lee Irwin III @ 2003-02-25 17:59 ` Andrea Arcangeli 2003-02-25 18:04 ` William Lee Irwin III 2003-02-25 18:50 ` William Lee Irwin III 0 siblings, 2 replies; 124+ messages in thread From: Andrea Arcangeli @ 2003-02-25 17:59 UTC (permalink / raw) To: William Lee Irwin III, Andrew Morton, Hanna Linder, lse-tech, linux-kernel On Tue, Feb 25, 2003 at 09:43:59AM -0800, William Lee Irwin III wrote: > On Sat, Feb 22, 2003 at 07:24:24PM -0800, Andrew Morton wrote: > >> So whole stole the remaining 1.85 seconds? Looks like pte_highmem. > > On Tue, Feb 25, 2003 at 06:17:27PM +0100, Andrea Arcangeli wrote: > > would you mind to add the line for 2.4.21-pre4aa3? it has pte-highmem so > > you can easily find it out for sure if it is pte_highmem that stole >10% > > of your fast cpu. A line for the 2.4-rmap patch would be also > > interesting. > > On Sat, Feb 22, 2003 at 07:24:24PM -0800, Andrew Morton wrote: > >> Note one second spent in pte_alloc_one(). > > On Tue, Feb 25, 2003 at 06:17:27PM +0100, Andrea Arcangeli wrote: > > note the seconds spent in the rmap affected paths too. > > The pagetable cache is gone in 2.5, so pte_alloc_one() takes the > bitblitting hit for pagetables. I'm talking about do_anonymous_page, do_wp_page, do_no_page fork and all the other places that introduces spinlocks (per-page) and allocations of 2 pieces of ram rather than just 1 (and in turn potentially global spinlocks too if the cpu-caches are empty). Just grep for pte_chain_alloc or page_add_rmap in mm/memory.c, that's what I mean, I'm not talking about pagetables. Andrea ^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: Minutes from Feb 21 LSE Call 2003-02-25 17:59 ` Andrea Arcangeli @ 2003-02-25 18:04 ` William Lee Irwin III 2003-02-25 18:50 ` William Lee Irwin III 1 sibling, 0 replies; 124+ messages in thread From: William Lee Irwin III @ 2003-02-25 18:04 UTC (permalink / raw) To: Andrea Arcangeli; +Cc: Andrew Morton, Hanna Linder, lse-tech, linux-kernel On Tue, Feb 25, 2003 at 09:43:59AM -0800, William Lee Irwin III wrote: >> The pagetable cache is gone in 2.5, so pte_alloc_one() takes the >> bitblitting hit for pagetables. On Tue, Feb 25, 2003 at 06:59:28PM +0100, Andrea Arcangeli wrote: > I'm talking about do_anonymous_page, do_wp_page, do_no_page fork and all > the other places that introduces spinlocks (per-page) and allocations of > 2 pieces of ram rather than just 1 (and in turn potentially global > spinlocks too if the cpu-caches are empty). Just grep for > pte_chain_alloc or page_add_rmap in mm/memory.c, that's what I mean, I'm > not talking about pagetables. Well, pte_alloc_one() has a clear explanation. The fact that the rmap accounting is not free is not news. For anonymous pages performing the analogous vma-based lookup as with Dave McCracken's patch for file-backed pages would require a significant anonymous page accounting rework. -- wli ^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: Minutes from Feb 21 LSE Call 2003-02-25 17:59 ` Andrea Arcangeli 2003-02-25 18:04 ` William Lee Irwin III @ 2003-02-25 18:50 ` William Lee Irwin III 2003-02-25 19:18 ` Andrea Arcangeli 1 sibling, 1 reply; 124+ messages in thread From: William Lee Irwin III @ 2003-02-25 18:50 UTC (permalink / raw) To: Andrea Arcangeli; +Cc: Andrew Morton, Hanna Linder, lse-tech, linux-kernel On Tue, Feb 25, 2003 at 09:43:59AM -0800, William Lee Irwin III wrote: >> The pagetable cache is gone in 2.5, so pte_alloc_one() takes the >> bitblitting hit for pagetables. On Tue, Feb 25, 2003 at 06:59:28PM +0100, Andrea Arcangeli wrote: > I'm talking about do_anonymous_page, do_wp_page, do_no_page fork and all > the other places that introduces spinlocks (per-page) and allocations of > 2 pieces of ram rather than just 1 (and in turn potentially global > spinlocks too if the cpu-caches are empty). Just grep for > pte_chain_alloc or page_add_rmap in mm/memory.c, that's what I mean, I'm > not talking about pagetables. Okay, fished out the profiles (w/Dave's optimization): 00000000 total 158601 0.0869 c0106ed8 poll_idle 99878 1189.0238 c01172e0 do_page_fault 8788 7.7496 c013adb4 do_wp_page 6712 8.4322 c013f70c page_remove_rmap 3132 6.2640 c0139eac copy_page_range 2994 3.5643 c013f5c0 page_add_rmap 2776 8.3614 c013a1f4 zap_pte_range 2616 4.8806 c0137240 release_pages 1828 6.4366 c0108d14 system_call 1116 25.3636 c013ba00 handle_mm_fault 1098 4.6525 c015b59c d_lookup 1096 3.2619 c013b788 do_no_page 1044 1.6519 c013b56c do_anonymous_page 954 1.7667 c011718c pte_alloc_one 910 6.5000 c0139ba0 clear_page_tables 841 2.4735 c011450c flush_tlb_page 725 6.4732 c0207130 __copy_to_user_ll 687 6.6058 c01333dc free_hot_cold_page 641 2.7629 c013042c find_get_page 601 10.7321 Just taking the exception dwarfs anything written in C. page_add_rmap() absorbs hits from all of the fault routines and copy_page_range(). page_remove_rmap() absorbs hits from zap_pte_range(). do_wp_page() is huge because it's doing bitblitting in-line. These things aren't cheap with or without rmap. Trimming down accounting overhead could raise search problems elsewhere. Whether avoiding the search problem is worth the accounting overhead could probably use some more investigation, like actually trying the anonymous page handling rework needed to use vma-based ptov resolution. -- wli ^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: Minutes from Feb 21 LSE Call 2003-02-25 18:50 ` William Lee Irwin III @ 2003-02-25 19:18 ` Andrea Arcangeli 2003-02-25 19:27 ` Martin J. Bligh 2003-02-25 20:10 ` William Lee Irwin III 0 siblings, 2 replies; 124+ messages in thread From: Andrea Arcangeli @ 2003-02-25 19:18 UTC (permalink / raw) To: William Lee Irwin III, Andrew Morton, Hanna Linder, lse-tech, linux-kernel On Tue, Feb 25, 2003 at 10:50:08AM -0800, William Lee Irwin III wrote: > On Tue, Feb 25, 2003 at 09:43:59AM -0800, William Lee Irwin III wrote: > >> The pagetable cache is gone in 2.5, so pte_alloc_one() takes the > >> bitblitting hit for pagetables. > > On Tue, Feb 25, 2003 at 06:59:28PM +0100, Andrea Arcangeli wrote: > > I'm talking about do_anonymous_page, do_wp_page, do_no_page fork and all > > the other places that introduces spinlocks (per-page) and allocations of > > 2 pieces of ram rather than just 1 (and in turn potentially global > > spinlocks too if the cpu-caches are empty). Just grep for > > pte_chain_alloc or page_add_rmap in mm/memory.c, that's what I mean, I'm > > not talking about pagetables. > > Okay, fished out the profiles (w/Dave's optimization): > > 00000000 total 158601 0.0869 > c0106ed8 poll_idle 99878 1189.0238 > c01172e0 do_page_fault 8788 7.7496 > c013adb4 do_wp_page 6712 8.4322 > c013f70c page_remove_rmap 3132 6.2640 > c0139eac copy_page_range 2994 3.5643 > c013f5c0 page_add_rmap 2776 8.3614 > c013a1f4 zap_pte_range 2616 4.8806 > c0137240 release_pages 1828 6.4366 > c0108d14 system_call 1116 25.3636 > c013ba00 handle_mm_fault 1098 4.6525 > c015b59c d_lookup 1096 3.2619 > c013b788 do_no_page 1044 1.6519 > c013b56c do_anonymous_page 954 1.7667 > c011718c pte_alloc_one 910 6.5000 > c0139ba0 clear_page_tables 841 2.4735 > c011450c flush_tlb_page 725 6.4732 > c0207130 __copy_to_user_ll 687 6.6058 > c01333dc free_hot_cold_page 641 2.7629 > c013042c find_get_page 601 10.7321 > > Just taking the exception dwarfs anything written in C. > > page_add_rmap() absorbs hits from all of the fault routines and > copy_page_range(). page_remove_rmap() absorbs hits from zap_pte_range(). > do_wp_page() is huge because it's doing bitblitting in-line. "absorbing" is a nice word for it. The way I see it, page_add_rmap and page_remove_rmap are even more expensive than the pagtable zapping. They're even more expensive than copy_page_range. Also focus on the numbers on the right that are even more interesting to find what is worth to optimize away first IMHO > > These things aren't cheap with or without rmap. Trimming down lots of things aren't cheap, but this isn't a good reason to make them twice more expensive, especially if they were as cheap as possible and they're critical hot paths. > accounting overhead could raise search problems elsewhere. this is the point indeed, but at least in 2.4 I don't see any cpu saving advantage during swapping because during swapping the cpu is always idle anyways. Infact I had to drop the lru_cache_add too from the anonymous page fault path because it was wasting way too much cpu to get peak performance (of course you're using per-page spinlocks by hand with rmap, and lru_cache_add needs a global spinlock, so at least rmap shouldn't introduce very big scalability issue unlike the lru_cache_add) > Whether avoiding the search problem is worth the accounting overhead > could probably use some more investigation, like actually trying the > anonymous page handling rework needed to use vma-based ptov resolution. the only solution is to do rmap lazily, i.e. to start building the rmap during swapping by walking the pagetables, basically exactly like I refill the lru with anonymous pages only after I start to need this information recently in my 2.4 tree, so if you never need to pageout heavily several giga of ram (like most of very high end numa servers), you'll never waste a single cycle in locking or whatever other worthless accounting overhead that hurts performance of all common workloads Andrea ^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: Minutes from Feb 21 LSE Call 2003-02-25 19:18 ` Andrea Arcangeli @ 2003-02-25 19:27 ` Martin J. Bligh 2003-02-25 20:30 ` Andrea Arcangeli 2003-02-25 20:10 ` William Lee Irwin III 1 sibling, 1 reply; 124+ messages in thread From: Martin J. Bligh @ 2003-02-25 19:27 UTC (permalink / raw) To: Andrea Arcangeli, William Lee Irwin III, Andrew Morton, Hanna Linder, lse-tech, linux-kernel > the only solution is to do rmap lazily, i.e. to start building the rmap > during swapping by walking the pagetables, basically exactly like I > refill the lru with anonymous pages only after I start to need this > information recently in my 2.4 tree, so if you never need to pageout > heavily several giga of ram (like most of very high end numa servers), > you'll never waste a single cycle in locking or whatever other worthless > accounting overhead that hurts performance of all common workloads Did you see the partially object-based rmap stuff? I think that does very close to what you want already. M. ^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: Minutes from Feb 21 LSE Call 2003-02-25 19:27 ` Martin J. Bligh @ 2003-02-25 20:30 ` Andrea Arcangeli 2003-02-25 20:53 ` Martin J. Bligh 0 siblings, 1 reply; 124+ messages in thread From: Andrea Arcangeli @ 2003-02-25 20:30 UTC (permalink / raw) To: Martin J. Bligh Cc: William Lee Irwin III, Andrew Morton, Hanna Linder, lse-tech, linux-kernel On Tue, Feb 25, 2003 at 11:27:40AM -0800, Martin J. Bligh wrote: > > the only solution is to do rmap lazily, i.e. to start building the rmap > > during swapping by walking the pagetables, basically exactly like I > > refill the lru with anonymous pages only after I start to need this > > information recently in my 2.4 tree, so if you never need to pageout > > heavily several giga of ram (like most of very high end numa servers), > > you'll never waste a single cycle in locking or whatever other worthless > > accounting overhead that hurts performance of all common workloads > > Did you see the partially object-based rmap stuff? I think that does > very close to what you want already. I don't see how it can optimize away the overhead but I didn't look at it for long. Andrea ^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: Minutes from Feb 21 LSE Call 2003-02-25 20:30 ` Andrea Arcangeli @ 2003-02-25 20:53 ` Martin J. Bligh 2003-02-25 21:17 ` Andrea Arcangeli 0 siblings, 1 reply; 124+ messages in thread From: Martin J. Bligh @ 2003-02-25 20:53 UTC (permalink / raw) To: Andrea Arcangeli Cc: William Lee Irwin III, Andrew Morton, Hanna Linder, lse-tech, linux-kernel >> > the only solution is to do rmap lazily, i.e. to start building the rmap >> > during swapping by walking the pagetables, basically exactly like I >> > refill the lru with anonymous pages only after I start to need this >> > information recently in my 2.4 tree, so if you never need to pageout >> > heavily several giga of ram (like most of very high end numa servers), >> > you'll never waste a single cycle in locking or whatever other >> > worthless accounting overhead that hurts performance of all common >> > workloads >> >> Did you see the partially object-based rmap stuff? I think that does >> very close to what you want already. > > I don't see how it can optimize away the overhead but I didn't look at > it for long. Because you don't set up and tear down the rmap pte-chains for every fault in / delete of any page ... it just works off the vmas. M. ^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: Minutes from Feb 21 LSE Call 2003-02-25 20:53 ` Martin J. Bligh @ 2003-02-25 21:17 ` Andrea Arcangeli 2003-02-25 21:12 ` Martin J. Bligh 2003-02-25 21:26 ` William Lee Irwin III 0 siblings, 2 replies; 124+ messages in thread From: Andrea Arcangeli @ 2003-02-25 21:17 UTC (permalink / raw) To: Martin J. Bligh Cc: William Lee Irwin III, Andrew Morton, Hanna Linder, lse-tech, linux-kernel On Tue, Feb 25, 2003 at 12:53:44PM -0800, Martin J. Bligh wrote: > >> > the only solution is to do rmap lazily, i.e. to start building the rmap > >> > during swapping by walking the pagetables, basically exactly like I > >> > refill the lru with anonymous pages only after I start to need this > >> > information recently in my 2.4 tree, so if you never need to pageout > >> > heavily several giga of ram (like most of very high end numa servers), > >> > you'll never waste a single cycle in locking or whatever other > >> > worthless accounting overhead that hurts performance of all common > >> > workloads > >> > >> Did you see the partially object-based rmap stuff? I think that does > >> very close to what you want already. > > > > I don't see how it can optimize away the overhead but I didn't look at > > it for long. > > Because you don't set up and tear down the rmap pte-chains for every > fault in / delete of any page ... it just works off the vmas. so basically it uses the rmap that we always had since at least 2.2 for everything but anon mappings, right? this is what DaveM did a few years back too. This makes lots of sense to me, so at least we avoid the duplication of rmap information, even if it won't fix the anonymous page overhead, but clearly it's much lower cost for everything but anonymous pages. Andrea ^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: Minutes from Feb 21 LSE Call 2003-02-25 21:17 ` Andrea Arcangeli @ 2003-02-25 21:12 ` Martin J. Bligh 2003-02-25 22:16 ` Andrea Arcangeli 2003-02-25 21:26 ` William Lee Irwin III 1 sibling, 1 reply; 124+ messages in thread From: Martin J. Bligh @ 2003-02-25 21:12 UTC (permalink / raw) To: Andrea Arcangeli; +Cc: William Lee Irwin III, Andrew Morton, linux-kernel >> Because you don't set up and tear down the rmap pte-chains for every >> fault in / delete of any page ... it just works off the vmas. > > so basically it uses the rmap that we always had since at least 2.2 for > everything but anon mappings, right? this is what DaveM did a few years > back too. This makes lots of sense to me, so at least we avoid the > duplication of rmap information, even if it won't fix the anonymous page > overhead, but clearly it's much lower cost for everything but anonymous > pages. Right ... and anonymous chains are about 95% single-reference (at least for the case I looked at), so they're direct mapped from the struct page with no chain at all. Cuts out something like 95% of the space overhead of pte-chains, and 65% of the time (for kernel compile -j256 on 16x system). However, it's going to be a little more expensive to *use* the mappings, so we need to measure that carefully. M. ^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: Minutes from Feb 21 LSE Call 2003-02-25 21:12 ` Martin J. Bligh @ 2003-02-25 22:16 ` Andrea Arcangeli 2003-02-25 22:17 ` Martin J. Bligh 0 siblings, 1 reply; 124+ messages in thread From: Andrea Arcangeli @ 2003-02-25 22:16 UTC (permalink / raw) To: Martin J. Bligh; +Cc: William Lee Irwin III, Andrew Morton, linux-kernel On Tue, Feb 25, 2003 at 01:12:55PM -0800, Martin J. Bligh wrote: > >> Because you don't set up and tear down the rmap pte-chains for every > >> fault in / delete of any page ... it just works off the vmas. > > > > so basically it uses the rmap that we always had since at least 2.2 for > > everything but anon mappings, right? this is what DaveM did a few years > > back too. This makes lots of sense to me, so at least we avoid the > > duplication of rmap information, even if it won't fix the anonymous page > > overhead, but clearly it's much lower cost for everything but anonymous > > pages. > > Right ... and anonymous chains are about 95% single-reference (at least for > the case I looked at), so they're direct mapped from the struct page with > no chain at all. Cuts out something like 95% of the space overhead of > pte-chains, and 65% of the time (for kernel compile -j256 on 16x system). > However, it's going to be a little more expensive to *use* the mappings, > so we need to measure that carefully. Sure, it is more expensive to use them, but all we care about is complexity, and they solve the complexity problem just fine, so I definitely prefer it. Cpu utilization during heavy swapping isn't a big deal IMHO Andrea ^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: Minutes from Feb 21 LSE Call 2003-02-25 22:16 ` Andrea Arcangeli @ 2003-02-25 22:17 ` Martin J. Bligh 2003-02-25 22:37 ` Andrea Arcangeli 0 siblings, 1 reply; 124+ messages in thread From: Martin J. Bligh @ 2003-02-25 22:17 UTC (permalink / raw) To: Andrea Arcangeli; +Cc: William Lee Irwin III, Andrew Morton, linux-kernel > Sure, it is more expensive to use them, but all we care about is > complexity, and they solve the complexity problem just fine, so I > definitely prefer it. Cpu utilization during heavy swapping isn't a big > deal IMHO I totally agree with you. However the concerns others raised were over page aging and page stealing (eg from pagecache), which might not involve disk, but would also be slower. It probably need some tuning and tweaking, but I'm pretty sure it's fundamentally the right approach. M. ^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: Minutes from Feb 21 LSE Call 2003-02-25 22:17 ` Martin J. Bligh @ 2003-02-25 22:37 ` Andrea Arcangeli 0 siblings, 0 replies; 124+ messages in thread From: Andrea Arcangeli @ 2003-02-25 22:37 UTC (permalink / raw) To: Martin J. Bligh; +Cc: William Lee Irwin III, Andrew Morton, linux-kernel On Tue, Feb 25, 2003 at 02:17:48PM -0800, Martin J. Bligh wrote: > > Sure, it is more expensive to use them, but all we care about is > > complexity, and they solve the complexity problem just fine, so I > > definitely prefer it. Cpu utilization during heavy swapping isn't a big > > deal IMHO > > I totally agree with you. However the concerns others raised were over > page aging and page stealing (eg from pagecache), which might not involve > disk, but would also be slower. It probably need some tuning and tweaking, > but I'm pretty sure it's fundamentally the right approach. there's no slowdown at all when we don't need to unmap anything. We just need to avoid watching the pte young bit in the pagetables unless we're about to start unmapping stuff. Most machines won't reach the point where they need to start unmapping stuff. Watching the ptes during normal pagecache recycling would be wasteful anyways, regardless what chain we take to reach the pte. Andrea ^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: Minutes from Feb 21 LSE Call 2003-02-25 21:17 ` Andrea Arcangeli 2003-02-25 21:12 ` Martin J. Bligh @ 2003-02-25 21:26 ` William Lee Irwin III 2003-02-25 22:18 ` Andrea Arcangeli 2003-02-26 5:24 ` Rik van Riel 1 sibling, 2 replies; 124+ messages in thread From: William Lee Irwin III @ 2003-02-25 21:26 UTC (permalink / raw) To: Andrea Arcangeli Cc: Martin J. Bligh, Andrew Morton, Hanna Linder, lse-tech, linux-kernel On Tue, Feb 25, 2003 at 12:53:44PM -0800, Martin J. Bligh wrote: >> Because you don't set up and tear down the rmap pte-chains for every >> fault in / delete of any page ... it just works off the vmas. On Tue, Feb 25, 2003 at 10:17:18PM +0100, Andrea Arcangeli wrote: > so basically it uses the rmap that we always had since at least 2.2 for > everything but anon mappings, right? this is what DaveM did a few years > back too. This makes lots of sense to me, so at least we avoid the > duplication of rmap information, even if it won't fix the anonymous page > overhead, but clearly it's much lower cost for everything but anonymous > pages. This is what the "anonymous rework" is about. There is already a fix extant for the file-backed case, which I presumed you knew of already, and so were were speaking of issues with the anonymous case. My impression thus far is that the anonymous case has not been pressing with respect to space consumption or cpu time once the file-backed code is in place, though if it resurfaces as a serious concern the anonymous rework can be pursued (along with other things). -- wli ^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: Minutes from Feb 21 LSE Call 2003-02-25 21:26 ` William Lee Irwin III @ 2003-02-25 22:18 ` Andrea Arcangeli 2003-02-26 5:24 ` Rik van Riel 1 sibling, 0 replies; 124+ messages in thread From: Andrea Arcangeli @ 2003-02-25 22:18 UTC (permalink / raw) To: William Lee Irwin III, Martin J. Bligh, Andrew Morton, Hanna Linder, lse-tech, linux-kernel On Tue, Feb 25, 2003 at 01:26:35PM -0800, William Lee Irwin III wrote: > My impression thus far is that the anonymous case has not been pressing > with respect to space consumption or cpu time once the file-backed code > is in place, though if it resurfaces as a serious concern the anonymous > rework can be pursued (along with other things). sounds good to me ;) Andrea ^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: Minutes from Feb 21 LSE Call 2003-02-25 21:26 ` William Lee Irwin III 2003-02-25 22:18 ` Andrea Arcangeli @ 2003-02-26 5:24 ` Rik van Riel 2003-02-26 5:38 ` William Lee Irwin III 1 sibling, 1 reply; 124+ messages in thread From: Rik van Riel @ 2003-02-26 5:24 UTC (permalink / raw) To: William Lee Irwin III Cc: Andrea Arcangeli, Martin J. Bligh, Andrew Morton, Hanna Linder, lse-tech, linux-kernel On Tue, 25 Feb 2003, William Lee Irwin III wrote: > My impression thus far is that the anonymous case has not been pressing > with respect to space consumption or cpu time once the file-backed code > is in place, though if it resurfaces as a serious concern the anonymous > rework can be pursued (along with other things). ... but making the anonymous pages use an object based scheme probably will make things too expensive. IIRC the object based reverse map patches by bcrl and davem both failed on the complexities needed to deal with anonymous pages. My instinct is that a hybrid system will work well in most cases and the worst case with mapped files won't be too bad. cheers, Rik -- Engineers don't grow up, they grow sideways. http://www.surriel.com/ http://kernelnewbies.org/ ^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: Minutes from Feb 21 LSE Call 2003-02-26 5:24 ` Rik van Riel @ 2003-02-26 5:38 ` William Lee Irwin III 2003-02-26 6:01 ` Martin J. Bligh 0 siblings, 1 reply; 124+ messages in thread From: William Lee Irwin III @ 2003-02-26 5:38 UTC (permalink / raw) To: Rik van Riel Cc: Andrea Arcangeli, Martin J. Bligh, Andrew Morton, Hanna Linder, lse-tech, linux-kernel On Tue, 25 Feb 2003, William Lee Irwin III wrote: >> My impression thus far is that the anonymous case has not been pressing >> with respect to space consumption or cpu time once the file-backed code >> is in place, though if it resurfaces as a serious concern the anonymous >> rework can be pursued (along with other things). On Wed, Feb 26, 2003 at 02:24:18AM -0300, Rik van Riel wrote: > ... but making the anonymous pages use an object based > scheme probably will make things too expensive. > IIRC the object based reverse map patches by bcrl and > davem both failed on the complexities needed to deal > with anonymous pages. > My instinct is that a hybrid system will work well in > most cases and the worst case with mapped files won't > be too bad. The boxen I'm supposed to babysit need a high degree of resource consciousness wrt. lowmem allocations, so there is a clear voice on this issue. IMHO it's still an open question as to whether this is efficient for replacement concerns, which may yet favor objects. -- wli ^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: Minutes from Feb 21 LSE Call 2003-02-26 5:38 ` William Lee Irwin III @ 2003-02-26 6:01 ` Martin J. Bligh 2003-02-26 6:14 ` William Lee Irwin III 2003-02-26 16:02 ` Rik van Riel 0 siblings, 2 replies; 124+ messages in thread From: Martin J. Bligh @ 2003-02-26 6:01 UTC (permalink / raw) To: William Lee Irwin III, Rik van Riel Cc: Andrea Arcangeli, Andrew Morton, Hanna Linder, lse-tech, linux-kernel >>> My impression thus far is that the anonymous case has not been pressing >>> with respect to space consumption or cpu time once the file-backed code >>> is in place, though if it resurfaces as a serious concern the anonymous >>> rework can be pursued (along with other things). > > On Wed, Feb 26, 2003 at 02:24:18AM -0300, Rik van Riel wrote: >> ... but making the anonymous pages use an object based >> scheme probably will make things too expensive. >> IIRC the object based reverse map patches by bcrl and >> davem both failed on the complexities needed to deal >> with anonymous pages. >> My instinct is that a hybrid system will work well in >> most cases and the worst case with mapped files won't >> be too bad. > > The boxen I'm supposed to babysit need a high degree of resource > consciousness wrt. lowmem allocations, so there is a clear voice It seemed, at least on the simple kernel compile tests that I did, that all the long chains are not anonymous. It killed 95% of the space issue, which given the simplicity of the patch was pretty damned stunning. Yes, there's a pointer per page I guess we could kill in the struct page itself, but I think you already have a better method for killing mem_map bloat ;-) M. ^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: Minutes from Feb 21 LSE Call 2003-02-26 6:01 ` Martin J. Bligh @ 2003-02-26 6:14 ` William Lee Irwin III 2003-02-26 6:32 ` William Lee Irwin III 2003-02-26 16:02 ` Rik van Riel 1 sibling, 1 reply; 124+ messages in thread From: William Lee Irwin III @ 2003-02-26 6:14 UTC (permalink / raw) To: Martin J. Bligh Cc: Rik van Riel, Andrea Arcangeli, Andrew Morton, Hanna Linder, lse-tech, linux-kernel At some point in the past, I wrote: >> The boxen I'm supposed to babysit need a high degree of resource >> consciousness wrt. lowmem allocations, so there is a clear voice On Tue, Feb 25, 2003 at 10:01:20PM -0800, Martin J. Bligh wrote: > It seemed, at least on the simple kernel compile tests that I did, that all > the long chains are not anonymous. It killed 95% of the space issue, which > given the simplicity of the patch was pretty damned stunning. Yes, there's > a pointer per page I guess we could kill in the struct page itself, but I > think you already have a better method for killing mem_map bloat ;-) I'm not going to get up in arms about this unless there's a serious performance issue that's going to get smacked down that I want to have a say in how it gets smacked down. aa is happy with the filebacked stuff, so I'm not pressing it (much) further. And yes, page clustering is certainly on its way and fast. I'm getting very close to the point where a general announcement will be in order. There's basically "one last big bug" and two bits of gross suboptimality I want to clean up before bringing the world to bear on it. -- wli ^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: Minutes from Feb 21 LSE Call 2003-02-26 6:14 ` William Lee Irwin III @ 2003-02-26 6:32 ` William Lee Irwin III 0 siblings, 0 replies; 124+ messages in thread From: William Lee Irwin III @ 2003-02-26 6:32 UTC (permalink / raw) To: Martin J. Bligh, Rik van Riel, Andrea Arcangeli, Andrew Morton, Hanna Linder, lse-tech, linux-kernel On Tue, Feb 25, 2003 at 10:01:20PM -0800, Martin J. Bligh wrote: >> It seemed, at least on the simple kernel compile tests that I did, that all >> the long chains are not anonymous. It killed 95% of the space issue, which >> given the simplicity of the patch was pretty damned stunning. Yes, there's >> a pointer per page I guess we could kill in the struct page itself, but I >> think you already have a better method for killing mem_map bloat ;-) On Tue, Feb 25, 2003 at 10:14:40PM -0800, William Lee Irwin III wrote: > I'm not going to get up in arms about this unless there's a serious > performance issue that's going to get smacked down that I want to have > a say in how it gets smacked down. aa is happy with the filebacked > stuff, so I'm not pressing it (much) further. > And yes, page clustering is certainly on its way and fast. I'm getting > very close to the point where a general announcement will be in order. > There's basically "one last big bug" and two bits of gross suboptimality > I want to clean up before bringing the world to bear on it. Screw it. Here it comes, ready or not. hch, I hope you were right... -- wli ^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: Minutes from Feb 21 LSE Call 2003-02-26 6:01 ` Martin J. Bligh 2003-02-26 6:14 ` William Lee Irwin III @ 2003-02-26 16:02 ` Rik van Riel 2003-02-27 3:48 ` Daniel Phillips 1 sibling, 1 reply; 124+ messages in thread From: Rik van Riel @ 2003-02-26 16:02 UTC (permalink / raw) To: Martin J. Bligh Cc: William Lee Irwin III, Andrea Arcangeli, Andrew Morton, Hanna Linder, lse-tech, linux-kernel On Tue, 25 Feb 2003, Martin J. Bligh wrote: > > On Wed, Feb 26, 2003 at 02:24:18AM -0300, Rik van Riel wrote: > >> ... but making the anonymous pages use an object based > >> scheme probably will make things too expensive. > >> My instinct is that a hybrid system will work well in [snip] "wli wrote something" > It seemed, at least on the simple kernel compile tests that I did, that > all the long chains are not anonymous. It killed 95% of the space issue, > which given the simplicity of the patch was pretty damned stunning. Yes, > there's a pointer per page I guess we could kill in the struct page > itself, but I think you already have a better method for killing mem_map > bloat ;-) Also, with copy-on-write and mremap after fork, doing an object based rmap scheme for anonymous pages is just complex, almost certainly far too complex to be worth it, since it just has too many issues. Just read the patches by bcrl and davem, things get hairy fast. The pte chain rmap scheme is clean, but suffers from too much overhead for file mappings. As shown by Dave's patch, a hybrid system really is simple and clean, and it removes most of the pte chain overhead while still keeping the code nice and efficient. I think this hybrid system is the way to go, possibly with a few more tweaks left and right... regards, Rik -- Engineers don't grow up, they grow sideways. http://www.surriel.com/ http://kernelnewbies.org/ ^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: Minutes from Feb 21 LSE Call 2003-02-26 16:02 ` Rik van Riel @ 2003-02-27 3:48 ` Daniel Phillips 0 siblings, 0 replies; 124+ messages in thread From: Daniel Phillips @ 2003-02-27 3:48 UTC (permalink / raw) To: Rik van Riel, Martin J. Bligh Cc: William Lee Irwin III, Andrea Arcangeli, Andrew Morton, Hanna Linder, lse-tech, linux-kernel On Wednesday 26 February 2003 17:02, Rik van Riel wrote: > On Tue, 25 Feb 2003, Martin J. Bligh wrote: > > It seemed, at least on the simple kernel compile tests that I did, that > > all the long chains are not anonymous. It killed 95% of the space issue, > > which given the simplicity of the patch was pretty damned stunning. Yes, > > there's a pointer per page I guess we could kill in the struct page > > itself, but I think you already have a better method for killing mem_map > > bloat ;-) > > Also, with copy-on-write and mremap after fork, doing an > object based rmap scheme for anonymous pages is just complex, > almost certainly far too complex to be worth it, since it just > has too many issues. Just read the patches by bcrl and davem, > things get hairy fast. > > The pte chain rmap scheme is clean, but suffers from too much > overhead for file mappings. There is a lot of redundancy in the rmap chains that could be exploited. If a pte page happens to reference a group of (say) 32 anon pages, then you can set each anon page's page->index to its position in the group and let a pte_chain node point at the pte of the first page of the group. You can then find each page's pte by adding its page->index to the pte_chain node's pte pointer. This allows a single rmap chain to be shared by all the pages in the group. This much of the idea is simple, however there are some tricky details to take care of. How does a copy-on-write break out one page of the group from one of the pte pages? I tried putting a (32 bit) bitmap in each pte_chain node to indicate which pte entries actually belong to the group, and that wasn't too bad except for doubling the per-link memory usage, turning a best case 32x gain into only 16x. It's probably better to break the group up, creating log2(groupsize) new chains. (This can be avoided in the common case that you already know every page in the group is going to be copied, as with a copy_from_user.) Getting rid of the bitmaps makes the single-page case the same as the current arrangement and makes it easy to let the size of a page be as large as the capacity of a whole pte page. There's also the problem of detecting groupable clusters of pages, e.g., in do_anon_page. Swap-out and swap-in introduce more messiness, as does mremap. In the end, I decided it's not needed in the current cycle, but probably worth investigating later. My purpose in bringing it up now is to show that there are still some more incremental gains to be had without needing radical surgery. > As shown by Dave's patch, a hybrid system really is simple and > clean, and it removes most of the pte chain overhead while still > keeping the code nice and efficient. > > I think this hybrid system is the way to go, possibly with a few > more tweaks left and right... Emphatically, yes. Regards, Daniel ^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: Minutes from Feb 21 LSE Call 2003-02-25 19:18 ` Andrea Arcangeli 2003-02-25 19:27 ` Martin J. Bligh @ 2003-02-25 20:10 ` William Lee Irwin III 2003-02-25 20:23 ` Andrea Arcangeli 1 sibling, 1 reply; 124+ messages in thread From: William Lee Irwin III @ 2003-02-25 20:10 UTC (permalink / raw) To: Andrea Arcangeli; +Cc: Andrew Morton, Hanna Linder, lse-tech, linux-kernel On Tue, Feb 25, 2003 at 10:50:08AM -0800, William Lee Irwin III wrote: >> Just taking the exception dwarfs anything written in C. >> page_add_rmap() absorbs hits from all of the fault routines and >> copy_page_range(). page_remove_rmap() absorbs hits from zap_pte_range(). >> do_wp_page() is huge because it's doing bitblitting in-line. On Tue, Feb 25, 2003 at 08:18:17PM +0100, Andrea Arcangeli wrote: > "absorbing" is a nice word for it. The way I see it, page_add_rmap and > page_remove_rmap are even more expensive than the pagtable zapping. > They're even more expensive than copy_page_range. Also focus on the > numbers on the right that are even more interesting to find what is > worth to optimize away first IMHO Those just divide the number of hits by the size of the function IIRC, which is useless for some codepath spinning hard in the middle of a large function or in the presence of over-inlining. It's also greatly disturbed by spinlock section hackery (as are most profilers). On Tue, Feb 25, 2003 at 10:50:08AM -0800, William Lee Irwin III wrote: >> These things aren't cheap with or without rmap. Trimming down On Tue, Feb 25, 2003 at 08:18:17PM +0100, Andrea Arcangeli wrote: > lots of things aren't cheap, but this isn't a good reason to make them > twice more expensive, especially if they were as cheap as possible and > they're critical hot paths. They weren't as cheap as possible and it's a bad idea to make them so. SVR4 proved there are limits to the usefulness of lazy evaluation wrt. pagetable copying and the like. You're also looking at sampling hits, not end-to-end timings. After all these disclaimers, trimming down cpu cost is a good idea. On Tue, Feb 25, 2003 at 10:50:08AM -0800, William Lee Irwin III wrote: >> accounting overhead could raise search problems elsewhere. On Tue, Feb 25, 2003 at 08:18:17PM +0100, Andrea Arcangeli wrote: > this is the point indeed, but at least in 2.4 I don't see any cpu saving > advantage during swapping because during swapping the cpu is always idle > anyways. It's probably not swapping that matters, but high turnover of clean data. No one can really make a concrete assertion without some implementations of the alternatives, which is why I think they need to be done soon. Once one or more are there we're set. I'm personally in favor of the anonymous handling rework as the alternative to pursue, since that actually retains the locality of reference as opposed to wild pagetable scanning over random processes, which is highly unpredictable with respect to locality and even worse with respect to cpu consumption. On Tue, Feb 25, 2003 at 08:18:17PM +0100, Andrea Arcangeli wrote: > Infact I had to drop the lru_cache_add too from the anonymous page fault > path because it was wasting way too much cpu to get peak performance (of > course you're using per-page spinlocks by hand with rmap, and > lru_cache_add needs a global spinlock, so at least rmap shouldn't > introduce very big scalability issue unlike the lru_cache_add) The high arrival rates to LRU lists in do_anonymous_page() etc. were dealt with by the pagevec batching infrastructure in 2.5.x, which is the primary method by which pagemap_lru_lock contention was addressed. The "breakup" so to speak is primarily for locality of reference. Which reminds me, my node-local pgdat allocation patch is pending... On Tue, Feb 25, 2003 at 10:50:08AM -0800, William Lee Irwin III wrote: >> Whether avoiding the search problem is worth the accounting overhead >> could probably use some more investigation, like actually trying the >> anonymous page handling rework needed to use vma-based ptov resolution. On Tue, Feb 25, 2003 at 08:18:17PM +0100, Andrea Arcangeli wrote: > the only solution is to do rmap lazily, i.e. to start building the rmap > during swapping by walking the pagetables, basically exactly like I > refill the lru with anonymous pages only after I start to need this > information recently in my 2.4 tree, so if you never need to pageout > heavily several giga of ram (like most of very high end numa servers), > you'll never waste a single cycle in locking or whatever other worthless > accounting overhead that hurts performance of all common workloads I'd just bite the bullet and do the anonymous rework. Building pte_chains lazily raises the issue of needing to allocate in order to free, which is relatively thorny. Maintaining any level of accuracy of the things with lazy buildup is also problematic. That and the whole space issue wrt. pte_chains is blown away by the anonymous rework, which is a significant advantage. -- wli ^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: Minutes from Feb 21 LSE Call 2003-02-25 20:10 ` William Lee Irwin III @ 2003-02-25 20:23 ` Andrea Arcangeli 2003-02-25 20:46 ` William Lee Irwin III 0 siblings, 1 reply; 124+ messages in thread From: Andrea Arcangeli @ 2003-02-25 20:23 UTC (permalink / raw) To: William Lee Irwin III, Andrew Morton, Hanna Linder, lse-tech, linux-kernel On Tue, Feb 25, 2003 at 12:10:23PM -0800, William Lee Irwin III wrote: > I'd just bite the bullet and do the anonymous rework. Building > pte_chains lazily raises the issue of needing to allocate in order to note that there is no need of allocate to free. Andrea ^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: Minutes from Feb 21 LSE Call 2003-02-25 20:23 ` Andrea Arcangeli @ 2003-02-25 20:46 ` William Lee Irwin III 2003-02-25 20:52 ` Andrea Arcangeli 0 siblings, 1 reply; 124+ messages in thread From: William Lee Irwin III @ 2003-02-25 20:46 UTC (permalink / raw) To: Andrea Arcangeli; +Cc: Andrew Morton, Hanna Linder, lse-tech, linux-kernel On Tue, Feb 25, 2003 at 12:10:23PM -0800, William Lee Irwin III wrote: >> I'd just bite the bullet and do the anonymous rework. Building >> pte_chains lazily raises the issue of needing to allocate in order to On Tue, Feb 25, 2003 at 09:23:35PM +0100, Andrea Arcangeli wrote: > note that there is no need of allocate to free. I've no longer got any idea what you're talking about, then. -- wli ^ permalink raw reply [flat|nested] 124+ messages in thread
* Re: Minutes from Feb 21 LSE Call 2003-02-25 20:46 ` William Lee Irwin III @ 2003-02-25 20:52 ` Andrea Arcangeli 0 siblings, 0 replies; 124+ messages in thread From: Andrea Arcangeli @ 2003-02-25 20:52 UTC (permalink / raw) To: William Lee Irwin III, Andrew Morton, Hanna Linder, lse-tech, linux-kernel On Tue, Feb 25, 2003 at 12:46:16PM -0800, William Lee Irwin III wrote: > On Tue, Feb 25, 2003 at 12:10:23PM -0800, William Lee Irwin III wrote: > >> I'd just bite the bullet and do the anonymous rework. Building > >> pte_chains lazily raises the issue of needing to allocate in order to > > On Tue, Feb 25, 2003 at 09:23:35PM +0100, Andrea Arcangeli wrote: > > note that there is no need of allocate to free. > > I've no longer got any idea what you're talking about, then. Were we able to release memory w/o rmap: yes. Can we do it again: yes. Can we use a bit of the released memory to release further memory more efficiently with rmap: yes. I'm not saying it's easy to implement that, but the problem that we'll need memory to release memory doesn't exit, since it also never existed before rmap was introduced into the kernel. Sure, the early stage of the swapping would be more cpu-intensive, but that is the feature. Andrea ^ permalink raw reply [flat|nested] 124+ messages in thread
end of thread, other threads:[~2003-02-26 20:47 UTC | newest] Thread overview: 124+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2003-02-21 23:48 Minutes from Feb 21 LSE Call Hanna Linder 2003-02-22 0:16 ` Larry McVoy 2003-02-22 0:25 ` William Lee Irwin III 2003-02-22 2:24 ` Steven Cole 2003-02-22 0:44 ` Martin J. Bligh 2003-02-22 2:47 ` Larry McVoy 2003-02-22 4:32 ` Martin J. Bligh 2003-02-22 5:05 ` Larry McVoy 2003-02-22 6:39 ` Martin J. Bligh 2003-02-22 8:38 ` Jeff Garzik 2003-02-22 22:18 ` William Lee Irwin III 2003-02-23 0:50 ` Martin J. Bligh 2003-02-23 11:22 ` Magnus Danielson 2003-02-23 19:54 ` Eric W. Biederman 2003-02-23 1:17 ` Benjamin LaHaise 2003-02-23 5:21 ` Gerrit Huizenga 2003-02-23 8:07 ` David Lang 2003-02-23 8:20 ` William Lee Irwin III 2003-02-23 19:17 ` Linus Torvalds 2003-02-23 19:29 ` David Mosberger 2003-02-23 20:13 ` Martin J. Bligh 2003-02-23 22:01 ` David Mosberger 2003-02-23 22:12 ` Martin J. Bligh 2003-02-23 21:34 ` Linus Torvalds 2003-02-23 22:40 ` David Mosberger 2003-02-23 22:48 ` David Lang 2003-02-23 22:54 ` David Mosberger 2003-02-23 22:56 ` David Lang 2003-02-24 0:40 ` Linus Torvalds 2003-02-24 2:32 ` David Mosberger 2003-02-24 2:54 ` Linus Torvalds 2003-02-24 3:08 ` David Mosberger 2003-02-24 21:42 ` Andrea Arcangeli 2003-02-24 1:06 ` dean gaudet 2003-02-24 1:56 ` David Mosberger 2003-02-24 2:15 ` dean gaudet 2003-02-24 3:11 ` David Mosberger 2003-02-23 23:06 ` Martin J. Bligh 2003-02-23 23:59 ` David Mosberger 2003-02-24 3:49 ` Gerrit Huizenga 2003-02-24 4:07 ` David Mosberger 2003-02-24 4:34 ` Martin J. Bligh 2003-02-24 5:02 ` Gerrit Huizenga 2003-02-23 20:21 ` Xavier Bestel 2003-02-23 20:50 ` Martin J. Bligh 2003-02-23 23:57 ` Alan Cox 2003-02-24 1:26 ` Kenneth Johansson 2003-02-24 1:53 ` dean gaudet 2003-02-23 21:35 ` Alan Cox 2003-02-23 21:41 ` Linus Torvalds 2003-02-24 0:01 ` Bill Davidsen 2003-02-24 0:36 ` yodaiken 2003-02-23 21:15 ` John Bradford 2003-02-23 21:45 ` Linus Torvalds 2003-02-24 1:25 ` Benjamin LaHaise 2003-02-23 21:55 ` William Lee Irwin III 2003-02-23 19:13 ` David Mosberger 2003-02-23 23:28 ` Benjamin LaHaise 2003-02-26 8:46 ` Eric W. Biederman 2003-02-23 20:48 ` Gerrit Huizenga 2003-02-23 9:37 ` William Lee Irwin III 2003-02-22 8:38 ` David S. Miller 2003-02-22 8:38 ` David S. Miller 2003-02-22 14:34 ` Larry McVoy 2003-02-22 15:47 ` Martin J. Bligh 2003-02-22 16:13 ` Larry McVoy 2003-02-22 16:29 ` Martin J. Bligh 2003-02-22 16:33 ` Larry McVoy 2003-02-22 16:39 ` Martin J. Bligh 2003-02-22 16:59 ` John Bradford 2003-02-24 18:00 ` Timothy D. Witham 2003-02-22 8:32 ` David S. Miller 2003-02-22 18:20 ` Alan Cox 2003-02-22 20:05 ` William Lee Irwin III 2003-02-22 21:35 ` Alan Cox 2003-02-22 21:36 ` Gerrit Huizenga 2003-02-22 21:42 ` Christoph Hellwig 2003-02-23 23:23 ` Bill Davidsen 2003-02-24 3:31 ` Gerrit Huizenga 2003-02-24 4:02 ` Larry McVoy 2003-02-24 4:15 ` Russell Leighton 2003-02-24 5:11 ` William Lee Irwin III 2003-02-24 8:07 ` Christoph Hellwig 2003-02-23 0:37 ` Eric W. Biederman 2003-02-23 0:42 ` Eric W. Biederman 2003-02-23 14:29 ` Rik van Riel 2003-02-23 17:28 ` Eric W. Biederman 2003-02-24 1:42 ` Benjamin LaHaise 2003-02-23 3:24 ` Andrew Morton 2003-02-23 16:14 ` object-based rmap and pte-highmem Martin J. Bligh 2003-02-23 19:20 ` Linus Torvalds 2003-02-23 20:16 ` Martin J. Bligh 2003-02-23 21:37 ` Linus Torvalds 2003-02-23 22:07 ` pte-highmem vs UKVA (was: object-based rmap and pte-highmem) Martin J. Bligh 2003-02-23 22:10 ` William Lee Irwin III 2003-02-24 0:31 ` Linus Torvalds 2003-02-24 3:07 ` Martin J. Bligh 2003-02-25 17:17 ` Minutes from Feb 21 LSE Call Andrea Arcangeli 2003-02-25 17:43 ` William Lee Irwin III 2003-02-25 17:59 ` Andrea Arcangeli 2003-02-25 18:04 ` William Lee Irwin III 2003-02-25 18:50 ` William Lee Irwin III 2003-02-25 19:18 ` Andrea Arcangeli 2003-02-25 19:27 ` Martin J. Bligh 2003-02-25 20:30 ` Andrea Arcangeli 2003-02-25 20:53 ` Martin J. Bligh 2003-02-25 21:17 ` Andrea Arcangeli 2003-02-25 21:12 ` Martin J. Bligh 2003-02-25 22:16 ` Andrea Arcangeli 2003-02-25 22:17 ` Martin J. Bligh 2003-02-25 22:37 ` Andrea Arcangeli 2003-02-25 21:26 ` William Lee Irwin III 2003-02-25 22:18 ` Andrea Arcangeli 2003-02-26 5:24 ` Rik van Riel 2003-02-26 5:38 ` William Lee Irwin III 2003-02-26 6:01 ` Martin J. Bligh 2003-02-26 6:14 ` William Lee Irwin III 2003-02-26 6:32 ` William Lee Irwin III 2003-02-26 16:02 ` Rik van Riel 2003-02-27 3:48 ` Daniel Phillips 2003-02-25 20:10 ` William Lee Irwin III 2003-02-25 20:23 ` Andrea Arcangeli 2003-02-25 20:46 ` William Lee Irwin III 2003-02-25 20:52 ` Andrea Arcangeli
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox