* 2048 CPUs [was: Re: New filesystem for Linux] [not found] <787b0d920611041154l69db46abv4c8c467809ada57c@mail.gmail.com> @ 2006-11-04 22:36 ` Mikulas Patocka [not found] ` <787b0d920611050802o460a4000r5ce154e589732a02@mail.gmail.com> ` (2 more replies) 0 siblings, 3 replies; 13+ messages in thread From: Mikulas Patocka @ 2006-11-04 22:36 UTC (permalink / raw) To: Albert Cahalan; +Cc: linux-kernel >> Does one Linux kernel run on system with 1024 cpus? I guess it >> must fry spinlocks... (or even lockup due to spinlock livelocks) > > The SGI Altix can have 2048 CPUs. And does it run one image of Linux? Or more images each on few cpus? How do they solve problem with spinlock livelocks? If time-spent-outside-spinlock/time-spent-in-spinlock < number-of-cpus, the spinlock livelock may happen --- this condition is not true normally with 2 or 4 cpus, but for that high amount of cpus, there is a danger. Or do they have some special hardware spinlock instruction that guarantees fairness? Mikulas ^ permalink raw reply [flat|nested] 13+ messages in thread
[parent not found: <787b0d920611050802o460a4000r5ce154e589732a02@mail.gmail.com>]
* Re: 2048 CPUs [was: Re: New filesystem for Linux] [not found] ` <787b0d920611050802o460a4000r5ce154e589732a02@mail.gmail.com> @ 2006-11-06 1:47 ` Mikulas Patocka 0 siblings, 0 replies; 13+ messages in thread From: Mikulas Patocka @ 2006-11-06 1:47 UTC (permalink / raw) To: Albert Cahalan; +Cc: linux-kernel On Sun, 5 Nov 2006, Albert Cahalan wrote: > On 11/4/06, Mikulas Patocka <mikulas@artax.karlin.mff.cuni.cz> wrote: >> >> Does one Linux kernel run on system with 1024 cpus? I guess it >> >> must fry spinlocks... (or even lockup due to spinlock livelocks) >> > >> > The SGI Altix can have 2048 CPUs. >> >> And does it run one image of Linux? Or more images each on few cpus? > > Yes, of course it runs one image of Linux. It's SMP, not a cluster. > If I remember right: > > 2 CPUs per chip > 2 chips per board (with local RAM; this is NUMA) > 512 boards per machine > >> How do they solve problem with spinlock livelocks? >> >> If time-spent-outside-spinlock/time-spent-in-spinlock < number-of-cpus, >> the spinlock livelock may happen --- this condition is not true normally >> with 2 or 4 cpus, but for that high amount of cpus, there is a danger. >> >> Or do they have some special hardware spinlock instruction that guarantees >> fairness? > > Well first of all it's using IA-64 processors. These do atomic operations > via instructions that are essentially "take lock" and "release lock". > The non-CPU hardware probably recognizes these operations by the > special bus cycles and does whatever is needed to ensure fairness. It looks like a normal looping spinlock with rep nop and cmpxchg on i386. Not that I understand IA64 assembler --- but look at ia64_spinlock_contention --- it looks like a loop with pause and nonatomic test and a branch to acquire spinlock atomically with cmpxchg --- nothing that would prevent starving there. > Second of all, I think we just try to avoid having long spinlock hold > times. The SGI people love using RCU. As I recall, SGI also caused > Linux to get a sequence lock for jiffies. That is right, that you should avoid contention --- but the kernel should be stable no matter what userspace programs you run. Or does it mean that scientists can run only trusted programs on Altix that do not perform too many certain syscalls concurently in order to not jam spinlocks? Mikulas ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: 2048 CPUs [was: Re: New filesystem for Linux] 2006-11-04 22:36 ` 2048 CPUs [was: Re: New filesystem for Linux] Mikulas Patocka [not found] ` <787b0d920611050802o460a4000r5ce154e589732a02@mail.gmail.com> @ 2006-11-07 21:26 ` Pavel Machek 2006-11-07 22:36 ` Mikulas Patocka 2006-11-08 3:04 ` David Chinner 2 siblings, 1 reply; 13+ messages in thread From: Pavel Machek @ 2006-11-07 21:26 UTC (permalink / raw) To: Mikulas Patocka; +Cc: Albert Cahalan, linux-kernel Hi! > >The SGI Altix can have 2048 CPUs. > > And does it run one image of Linux? Or more images each > on few cpus? > > How do they solve problem with spinlock livelocks? > > If time-spent-outside-spinlock/time-spent-in-spinlock < > number-of-cpus, the spinlock livelock may happen --- > this condition is not true normally with 2 or 4 cpus, > but for that high amount of cpus, there is a danger. Lets say time-spent-outside-spinlock == time-spent-in-spinlock and number-of-cpus == 2. 1 < 2 , so it should livelock according to you... ...but afaict this should work okay. Even if spinlocks are very unfair, as long as time-outside and time-inside comes in big chunks, it should work. If you are unlucky, one cpu may stall for a while, but... I see no livelock. -- Thanks for all the (sleeping) penguins. ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: 2048 CPUs [was: Re: New filesystem for Linux] 2006-11-07 21:26 ` Pavel Machek @ 2006-11-07 22:36 ` Mikulas Patocka 2006-11-07 23:14 ` Pavel Machek 0 siblings, 1 reply; 13+ messages in thread From: Mikulas Patocka @ 2006-11-07 22:36 UTC (permalink / raw) To: Pavel Machek; +Cc: Albert Cahalan, linux-kernel Hi! > Hi! > >>> The SGI Altix can have 2048 CPUs. >> >> And does it run one image of Linux? Or more images each >> on few cpus? >> >> How do they solve problem with spinlock livelocks? >> >> If time-spent-outside-spinlock/time-spent-in-spinlock < >> number-of-cpus, the spinlock livelock may happen --- >> this condition is not true normally with 2 or 4 cpus, >> but for that high amount of cpus, there is a danger. > > Lets say time-spent-outside-spinlock == time-spent-in-spinlock and > number-of-cpus == 2. > > 1 < 2 , so it should livelock according to you... There is off-by-one bug in the condition. It should be: (time_spent_in_spinlock + time_spent_outside_spinlock) / time_spent_in_spinlock < number_of_cpus ... or if you divide it by time_spent_in_spinlock: time_spent_outside_spinlock / time_spent_in_spinlock + 1 < number_of_cpus > ...but afaict this should work okay. Even if spinlocks are very > unfair, as long as time-outside and time-inside comes in big chunks, > it should work. > > If you are unlucky, one cpu may stall for a while, but... I see no > livelock. If some rogue threads (and it may not even be intetional) call the same syscall stressing the one spinlock all the time, other syscalls needing the same spinlock may stall. Maybe there are so few Altices in the world that no one has yet observed it... Mikulas ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: 2048 CPUs [was: Re: New filesystem for Linux] 2006-11-07 22:36 ` Mikulas Patocka @ 2006-11-07 23:14 ` Pavel Machek 2006-11-08 18:26 ` Mikulas Patocka 0 siblings, 1 reply; 13+ messages in thread From: Pavel Machek @ 2006-11-07 23:14 UTC (permalink / raw) To: Mikulas Patocka; +Cc: Albert Cahalan, linux-kernel Hi! > >Lets say time-spent-outside-spinlock == time-spent-in-spinlock and > >number-of-cpus == 2. > > > >1 < 2 , so it should livelock according to you... > > There is off-by-one bug in the condition. It should be: > (time_spent_in_spinlock + time_spent_outside_spinlock) / > time_spent_in_spinlock < number_of_cpus > > ... or if you divide it by time_spent_in_spinlock: > time_spent_outside_spinlock / time_spent_in_spinlock + 1 < number_of_cpus > > >...but afaict this should work okay. Even if spinlocks are very > >unfair, as long as time-outside and time-inside comes in big chunks, > >it should work. > > > >If you are unlucky, one cpu may stall for a while, but... I see no > >livelock. > > If some rogue threads (and it may not even be intetional) call the same > syscall stressing the one spinlock all the time, other syscalls needing > the same spinlock may stall. Fortunately, they'll unstall with probability of 1... so no, I do not think this is real problem. If someone takes semaphore in syscall (we do), same problem may happen, right...? Without need for 2048 cpus. Maybe semaphores/mutexes are fair (or mostly fair) these days, but rwlocks may not be or something. Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: 2048 CPUs [was: Re: New filesystem for Linux] 2006-11-07 23:14 ` Pavel Machek @ 2006-11-08 18:26 ` Mikulas Patocka 2006-11-10 9:03 ` Pavel Machek 0 siblings, 1 reply; 13+ messages in thread From: Mikulas Patocka @ 2006-11-08 18:26 UTC (permalink / raw) To: Pavel Machek; +Cc: Albert Cahalan, linux-kernel Hi! >>> Lets say time-spent-outside-spinlock == time-spent-in-spinlock and >>> number-of-cpus == 2. >>> >>> 1 < 2 , so it should livelock according to you... >> >> There is off-by-one bug in the condition. It should be: >> (time_spent_in_spinlock + time_spent_outside_spinlock) / >> time_spent_in_spinlock < number_of_cpus >> >> ... or if you divide it by time_spent_in_spinlock: >> time_spent_outside_spinlock / time_spent_in_spinlock + 1 < number_of_cpus >> >>> ...but afaict this should work okay. Even if spinlocks are very >>> unfair, as long as time-outside and time-inside comes in big chunks, >>> it should work. >>> >>> If you are unlucky, one cpu may stall for a while, but... I see no >>> livelock. >> >> If some rogue threads (and it may not even be intetional) call the same >> syscall stressing the one spinlock all the time, other syscalls needing >> the same spinlock may stall. > > Fortunately, they'll unstall with probability of 1... so no, I do not > think this is real problem. You can't tell that CPUs behave exactly probabilistically --- it may happen that one gets out of the wait loop always too late. SMP buses have complex protocols to prevent starvation in case all CPUs are writing to the same cache line and similar --- however it is unusable againt spinlock starvation. > If someone takes semaphore in syscall (we do), same problem may > happen, right...? Without need for 2048 cpus. Maybe semaphores/mutexes > are fair (or mostly fair) these days, but rwlocks may not be or > something. Scheduler increases priority of sleeping process, so starving process should be waken up first. But if there are so many processes, that process that passed the semaphore, sleeps and tries to take the semaphore has already increased priority to the level of process that waited on the semaphor, livelock can happen too. Mikulas > Pavel > > -- > (english) http://www.livejournal.com/~pavelmachek > (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html > ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: 2048 CPUs [was: Re: New filesystem for Linux] 2006-11-08 18:26 ` Mikulas Patocka @ 2006-11-10 9:03 ` Pavel Machek 2006-11-10 15:20 ` Mikulas Patocka 0 siblings, 1 reply; 13+ messages in thread From: Pavel Machek @ 2006-11-10 9:03 UTC (permalink / raw) To: Mikulas Patocka; +Cc: Albert Cahalan, linux-kernel Hi! > >>If some rogue threads (and it may not even be intetional) call the same > >>syscall stressing the one spinlock all the time, other syscalls needing > >>the same spinlock may stall. > > > >Fortunately, they'll unstall with probability of 1... so no, I do not > >think this is real problem. > > You can't tell that CPUs behave exactly probabilistically --- it may > happen that one gets out of the wait loop always too late. Well, I don't need them to be _exactly_ probabilistical. Anyway, if you have 2048 CPUs... you can perhaps get some non-broken ones. > >If someone takes semaphore in syscall (we do), same problem may > >happen, right...? Without need for 2048 cpus. Maybe semaphores/mutexes > >are fair (or mostly fair) these days, but rwlocks may not be or > >something. > > Scheduler increases priority of sleeping process, so starving process > should be waken up first. But if there are so many processes, that >process I do not think this is how Linux scheduler works. Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: 2048 CPUs [was: Re: New filesystem for Linux] 2006-11-10 9:03 ` Pavel Machek @ 2006-11-10 15:20 ` Mikulas Patocka 2006-11-10 16:20 ` Alan Cox 2006-11-12 14:28 ` Pavel Machek 0 siblings, 2 replies; 13+ messages in thread From: Mikulas Patocka @ 2006-11-10 15:20 UTC (permalink / raw) To: Pavel Machek; +Cc: Albert Cahalan, linux-kernel Hi! >>>> If some rogue threads (and it may not even be intetional) call the same >>>> syscall stressing the one spinlock all the time, other syscalls needing >>>> the same spinlock may stall. >>> >>> Fortunately, they'll unstall with probability of 1... so no, I do not >>> think this is real problem. >> >> You can't tell that CPUs behave exactly probabilistically --- it may >> happen that one gets out of the wait loop always too late. > > Well, I don't need them to be _exactly_ probabilistical. > > Anyway, if you have 2048 CPUs... you can perhaps get some non-broken > ones. No intel document guarantees you that if more CPUs simultaneously execute locked cmpxchg in a loop that a CPU will see compare success in a finite time. In fact, CPUs can't guarantee this at all, because they don't know that they're executing a spinlock --- for them its just an instruction stream like anything else. Intel only guarantees that cmpxchg (or any other instruction) completes in finite time, but it doesn't say anything about the result of it. >>> If someone takes semaphore in syscall (we do), same problem may >>> happen, right...? Without need for 2048 cpus. Maybe semaphores/mutexes >>> are fair (or mostly fair) these days, but rwlocks may not be or >>> something. >> >> Scheduler increases priority of sleeping process, so starving process >> should be waken up first. But if there are so many processes, that >> process > > I do not think this is how Linux scheduler works. > Pavel <= 2.4 scheduler worked exactly like this. 2.6 has it more complicated, but does similar thing. But you are right that starvation on semaphore can happen, if the process has too high nice value, it will never be risen above other processes. Mikulas ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: 2048 CPUs [was: Re: New filesystem for Linux] 2006-11-10 15:20 ` Mikulas Patocka @ 2006-11-10 16:20 ` Alan Cox 2006-11-12 14:28 ` Pavel Machek 1 sibling, 0 replies; 13+ messages in thread From: Alan Cox @ 2006-11-10 16:20 UTC (permalink / raw) To: Mikulas Patocka; +Cc: Pavel Machek, Albert Cahalan, linux-kernel > Intel only guarantees that cmpxchg (or any other instruction) completes in > finite time, but it doesn't say anything about the result of it. They don't even guarantee that... (hlt for example) or instructions SMM trapping into stuff that doesn't come back </pedant> Alan ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: 2048 CPUs [was: Re: New filesystem for Linux] 2006-11-10 15:20 ` Mikulas Patocka 2006-11-10 16:20 ` Alan Cox @ 2006-11-12 14:28 ` Pavel Machek 2006-11-12 20:07 ` Mikulas Patocka 1 sibling, 1 reply; 13+ messages in thread From: Pavel Machek @ 2006-11-12 14:28 UTC (permalink / raw) To: Mikulas Patocka; +Cc: Albert Cahalan, linux-kernel Hi! > >>You can't tell that CPUs behave exactly > >>probabilistically --- it may > >>happen that one gets out of the wait loop always too > >>late. > > > >Well, I don't need them to be _exactly_ > >probabilistical. > > > >Anyway, if you have 2048 CPUs... you can perhaps get > >some non-broken > >ones. > > No intel document guarantees you that if more CPUs > simultaneously execute locked cmpxchg in a loop that a If we are talking 2048 cpus, we are talking ia64. > CPU will see compare success in a finite time. In fact, > CPUs can't guarantee this at all, because they don't > know that they're executing a spinlock --- for them its > just an instruction stream like anything else. ...even i386 has monitor/mwait these days. Pavel -- Thanks for all the (sleeping) penguins. ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: 2048 CPUs [was: Re: New filesystem for Linux] 2006-11-12 14:28 ` Pavel Machek @ 2006-11-12 20:07 ` Mikulas Patocka 2006-11-14 6:24 ` Albert Cahalan 0 siblings, 1 reply; 13+ messages in thread From: Mikulas Patocka @ 2006-11-12 20:07 UTC (permalink / raw) To: Pavel Machek; +Cc: Albert Cahalan, linux-kernel Hi! > Hi! > >>>> You can't tell that CPUs behave exactly >>>> probabilistically --- it may >>>> happen that one gets out of the wait loop always too >>>> late. >>> >>> Well, I don't need them to be _exactly_ >>> probabilistical. >>> >>> Anyway, if you have 2048 CPUs... you can perhaps get >>> some non-broken >>> ones. >> >> No intel document guarantees you that if more CPUs >> simultaneously execute locked cmpxchg in a loop that a > > If we are talking 2048 cpus, we are talking ia64. IA64 spinlock is locked cmpxchg, if failed than pause (i386 equivalent of rep nop) read the value, and if unlocked, try cmpxchg again. There is no fairness in it. >> CPU will see compare success in a finite time. In fact, >> CPUs can't guarantee this at all, because they don't >> know that they're executing a spinlock --- for them its >> just an instruction stream like anything else. > > ...even i386 has monitor/mwait these days. It also doesn't guarantee that subsequent locked instruction will take the lock after finite number of loops. Mikulas > Pavel > -- > Thanks for all the (sleeping) penguins. > ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: 2048 CPUs [was: Re: New filesystem for Linux] 2006-11-12 20:07 ` Mikulas Patocka @ 2006-11-14 6:24 ` Albert Cahalan 0 siblings, 0 replies; 13+ messages in thread From: Albert Cahalan @ 2006-11-14 6:24 UTC (permalink / raw) To: Mikulas Patocka; +Cc: Pavel Machek, linux-kernel > >> No intel document guarantees you that if more CPUs > >> simultaneously execute locked cmpxchg in a loop that a > > > > If we are talking 2048 cpus, we are talking ia64. > > IA64 spinlock is locked cmpxchg, if failed than pause (i386 equivalent of > rep nop) read the value, and if unlocked, try cmpxchg again. > > There is no fairness in it. I suppose we could use something better. There is the MCS lock, the related CLH lock, and IBM's improvement on the MCS lock. As with RCU, we'd need to get IBM's permission to use their lock. (so, how did we get permission for RCU?) The basic MCS lock is also patented I think. http://www.cs.rochester.edu/~scott/professional/Dijkstra/presentation.html ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: 2048 CPUs [was: Re: New filesystem for Linux] 2006-11-04 22:36 ` 2048 CPUs [was: Re: New filesystem for Linux] Mikulas Patocka [not found] ` <787b0d920611050802o460a4000r5ce154e589732a02@mail.gmail.com> 2006-11-07 21:26 ` Pavel Machek @ 2006-11-08 3:04 ` David Chinner 2 siblings, 0 replies; 13+ messages in thread From: David Chinner @ 2006-11-08 3:04 UTC (permalink / raw) To: Mikulas Patocka; +Cc: Albert Cahalan, linux-kernel On Sat, Nov 04, 2006 at 11:36:49PM +0100, Mikulas Patocka wrote: > >>Does one Linux kernel run on system with 1024 cpus? I guess it > >>must fry spinlocks... (or even lockup due to spinlock livelocks) > > > >The SGI Altix can have 2048 CPUs. > > And does it run one image of Linux? Or more images each on few cpus? One image. > How do they solve problem with spinlock livelocks? By replacing contended spinlocks withsleeping locks, using no-lock techniques (e.g. per-cpu) or changing the algorithm to remove the contention point. w.r.t filesystem locking scalability, you should read this paper: http://oss.sgi.com/projects/xfs/papers/ols2006/ols-2006-paper.pdf Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group ^ permalink raw reply [flat|nested] 13+ messages in thread
end of thread, other threads:[~2006-11-14 6:24 UTC | newest]
Thread overview: 13+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <787b0d920611041154l69db46abv4c8c467809ada57c@mail.gmail.com>
2006-11-04 22:36 ` 2048 CPUs [was: Re: New filesystem for Linux] Mikulas Patocka
[not found] ` <787b0d920611050802o460a4000r5ce154e589732a02@mail.gmail.com>
2006-11-06 1:47 ` Mikulas Patocka
2006-11-07 21:26 ` Pavel Machek
2006-11-07 22:36 ` Mikulas Patocka
2006-11-07 23:14 ` Pavel Machek
2006-11-08 18:26 ` Mikulas Patocka
2006-11-10 9:03 ` Pavel Machek
2006-11-10 15:20 ` Mikulas Patocka
2006-11-10 16:20 ` Alan Cox
2006-11-12 14:28 ` Pavel Machek
2006-11-12 20:07 ` Mikulas Patocka
2006-11-14 6:24 ` Albert Cahalan
2006-11-08 3:04 ` David Chinner
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.