* v2.6.21.5-rt19
@ 2007-07-04 20:49 Thomas Gleixner
2007-07-06 14:10 ` v2.6.21.5-rt19 Rui Nuno Capela
0 siblings, 1 reply; 19+ messages in thread
From: Thomas Gleixner @ 2007-07-04 20:49 UTC (permalink / raw)
To: LKML
Cc: RT-Users, Ingo Molnar, Steven Rostedt, Fernando Lopez-Lezcano,
jcaceres@ccrma.Stanford.EDU, Carsten Emde
I'm pleased to announce the v2.6.21.5-rt19 kernel on behalf of Ingo.
It can be downloaded from the usual place:
http://people.redhat.com/mingo/realtime-preempt/
More info about the -rt patch set can be found in the RT wiki:
http://rt.wiki.kernel.org
Changes since 2.6.21.5-rt18:
- Fixed a nasty and hard to track down slowness / boot problem on SMP
machines with CONFIG_NOHZ enabled. The problem was caused by the timer
wheel base lock held during the get_next_timer_interrupt() call in the
idle path, which eventually led to a bogus PI boosting of the idle task
and in consequence a stale wrong scheduler selection for the affected
idle task.
Kudos to Carsten Emde, who patiently and meticulously isolated the
problem and provided the traces, which allowed to identify the root
cause.
Problem solution: Prevent idle task boosting
- back port of the ntp / clock_was_set fix
- integration of the processor_idle fix from Venki Pallipadi, which
resolves boot issues on some platforms
- ep93xx clock events fix from Manfred Gruber
To build a 2.6.21.5-rt19 tree, the following patches should be applied:
http://kernel.org/pub/linux/kernel/v2.6/linux-2.6.21.5.tar.bz2
http://people.redhat.com/mingo/realtime-preempt/patch-2.6.21.5-rt19
Thanks,
tglx
^ permalink raw reply [flat|nested] 19+ messages in thread* Re: v2.6.21.5-rt19 2007-07-04 20:49 v2.6.21.5-rt19 Thomas Gleixner @ 2007-07-06 14:10 ` Rui Nuno Capela 2007-07-06 21:49 ` v2.6.21.5-rt19 Fernando Lopez-Lezcano 0 siblings, 1 reply; 19+ messages in thread From: Rui Nuno Capela @ 2007-07-06 14:10 UTC (permalink / raw) To: Thomas Gleixner Cc: LKML, RT-Users, Ingo Molnar, Steven Rostedt, Fernando Lopez-Lezcano, jcaceres@ccrma.Stanford.EDU, Carsten Emde Hi, On Wed, July 4, 2007 21:49, Thomas Gleixner wrote: > I'm pleased to announce the v2.6.21.5-rt19 kernel on behalf of Ingo. > > > It can be downloaded from the usual place: > > http://people.redhat.com/mingo/realtime-preempt/ > > More info about the -rt patch set can be found in the RT wiki: > > http://rt.wiki.kernel.org > > Changes since 2.6.21.5-rt18: > > - Fixed a nasty and hard to track down slowness / boot problem on SMP > machines with CONFIG_NOHZ enabled. The problem was caused by the timer > wheel base lock held during the get_next_timer_interrupt() call in the > idle path, which eventually led to a bogus PI boosting of the idle task > and in consequence a stale wrong scheduler selection for the affected idle > task. > > Kudos to Carsten Emde, who patiently and meticulously isolated the > problem and provided the traces, which allowed to identify the root cause. > > Problem solution: Prevent idle task boosting > > - back port of the ntp / clock_was_set fix > > - integration of the processor_idle fix from Venki Pallipadi, which > resolves boot issues on some platforms > > - ep93xx clock events fix from Manfred Gruber > Maybe someone remember me whining about troubles with 2.6.21-rt2..18 on my Core2 T7200 laptop (fujitsu-siemens amilo i1520). Althought I'm still with my fingers crossed, I can tell the good news are that 2.6.21.5-rt19 (and -rt20) does behave far better now on the very same box. I've more than 8 hours up and running now, without a single glimpse of the bad symptoms, which used to show in a matter of minutes if not earlier during init time. Congratulations. I'm not sure whether this problem can be closed for good, though, just because you mention the slowness fix applies to CONFIG_NOHZ=Y and I'm quite sure my badness surged either way. But at least -rt is usable again here and that just makes me happier :) Cheers. -- rncbc aka Rui Nuno Capela rncbc@rncbc.org ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: v2.6.21.5-rt19 2007-07-06 14:10 ` v2.6.21.5-rt19 Rui Nuno Capela @ 2007-07-06 21:49 ` Fernando Lopez-Lezcano 2007-07-07 9:15 ` v2.6.21.5-rt19 Ingo Molnar 2007-07-07 9:24 ` v2.6.21.5-rt19 Ingo Molnar 0 siblings, 2 replies; 19+ messages in thread From: Fernando Lopez-Lezcano @ 2007-07-06 21:49 UTC (permalink / raw) To: Rui Nuno Capela Cc: Thomas Gleixner, LKML, RT-Users, Ingo Molnar, Steven Rostedt, jcaceres@ccrma.Stanford.EDU, Carsten Emde On Fri, 2007-07-06 at 15:10 +0100, Rui Nuno Capela wrote: > Hi, > > On Wed, July 4, 2007 21:49, Thomas Gleixner wrote: > > I'm pleased to announce the v2.6.21.5-rt19 kernel on behalf of Ingo. > > > > > > It can be downloaded from the usual place: > > > > http://people.redhat.com/mingo/realtime-preempt/ > > > > More info about the -rt patch set can be found in the RT wiki: > > > > http://rt.wiki.kernel.org > > > > Changes since 2.6.21.5-rt18: > > > > - Fixed a nasty and hard to track down slowness / boot problem on SMP > > machines with CONFIG_NOHZ enabled. The problem was caused by the timer > > wheel base lock held during the get_next_timer_interrupt() call in the > > idle path, which eventually led to a bogus PI boosting of the idle task > > and in consequence a stale wrong scheduler selection for the affected idle > > task. > > > > Kudos to Carsten Emde, who patiently and meticulously isolated the > > problem and provided the traces, which allowed to identify the root cause. > > > > Problem solution: Prevent idle task boosting > > > > - back port of the ntp / clock_was_set fix > > > > - integration of the processor_idle fix from Venki Pallipadi, which > > resolves boot issues on some platforms > > > > - ep93xx clock events fix from Manfred Gruber > > > > Maybe someone remember me whining about troubles with 2.6.21-rt2..18 on my > Core2 T7200 laptop (fujitsu-siemens amilo i1520). > > Althought I'm still with my fingers crossed, I can tell the good news are > that 2.6.21.5-rt19 (and -rt20) does behave far better now on the very same > box. Yes, it works much better indeed... Ingo: is there a place where I can read about the changes in different rtxx releases? What is new/better/fixed in rt20? (I see scheduler stuff in a diff from rt19 to rt20 but I don't really know what it means). -- Fernando > I've more than 8 hours up and running now, without a single glimpse of the > bad symptoms, which used to show in a matter of minutes if not earlier > during init time. > > Congratulations. I'm not sure whether this problem can be closed for good, > though, just because you mention the slowness fix applies to CONFIG_NOHZ=Y > and I'm quite sure my badness surged either way. > > But at least -rt is usable again here and that just makes me happier :) > > Cheers. ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: v2.6.21.5-rt19 2007-07-06 21:49 ` v2.6.21.5-rt19 Fernando Lopez-Lezcano @ 2007-07-07 9:15 ` Ingo Molnar 2007-07-07 9:24 ` v2.6.21.5-rt19 Ingo Molnar 1 sibling, 0 replies; 19+ messages in thread From: Ingo Molnar @ 2007-07-07 9:15 UTC (permalink / raw) To: Fernando Lopez-Lezcano Cc: Rui Nuno Capela, Thomas Gleixner, LKML, RT-Users, Steven Rostedt, jcaceres@ccrma.Stanford.EDU, Carsten Emde * Fernando Lopez-Lezcano <nando@ccrma.Stanford.EDU> wrote: > > Althought I'm still with my fingers crossed, I can tell the good > > news are that 2.6.21.5-rt19 (and -rt20) does behave far better now > > on the very same box. > > Yes, it works much better indeed... > > Ingo: is there a place where I can read about the changes in different > rtxx releases? What is new/better/fixed in rt20? (I see scheduler > stuff in a diff from rt19 to rt20 but I don't really know what it > means). rt19 -> rt20 was a pure CFS update - from v18 to v19-almost-final. Ingo ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: v2.6.21.5-rt19 2007-07-06 21:49 ` v2.6.21.5-rt19 Fernando Lopez-Lezcano 2007-07-07 9:15 ` v2.6.21.5-rt19 Ingo Molnar @ 2007-07-07 9:24 ` Ingo Molnar 2007-07-08 22:36 ` v2.6.21.5-rt19 Fernando Lopez-Lezcano 1 sibling, 1 reply; 19+ messages in thread From: Ingo Molnar @ 2007-07-07 9:24 UTC (permalink / raw) To: Fernando Lopez-Lezcano Cc: Rui Nuno Capela, Thomas Gleixner, LKML, RT-Users, Steven Rostedt, jcaceres@ccrma.Stanford.EDU, Carsten Emde * Fernando Lopez-Lezcano <nando@ccrma.Stanford.EDU> wrote: > > > Changes since 2.6.21.5-rt18: > > > > > > - Fixed a nasty and hard to track down slowness / boot problem on SMP > > > machines with CONFIG_NOHZ enabled. The problem was caused by the timer > > > wheel base lock held during the get_next_timer_interrupt() call in the > > > idle path, which eventually led to a bogus PI boosting of the idle task > > > and in consequence a stale wrong scheduler selection for the affected idle > > > task. > > > > > > Kudos to Carsten Emde, who patiently and meticulously isolated the > > > problem and provided the traces, which allowed to identify the root cause. > > > > > > Problem solution: Prevent idle task boosting > > Maybe someone remember me whining about troubles with 2.6.21-rt2..18 > > on my Core2 T7200 laptop (fujitsu-siemens amilo i1520). > > > > Althought I'm still with my fingers crossed, I can tell the good > > news are that 2.6.21.5-rt19 (and -rt20) does behave far better now > > on the very same box. > > Yes, it works much better indeed... > > Ingo: is there a place where I can read about the changes in different > rtxx releases? What is new/better/fixed in rt20? (I see scheduler > stuff in a diff from rt19 to rt20 but I don't really know what it > means). and rt18 was a -rt-only NOHZ fix, that bug got introduced in rt11 when CFS was merged. i _think_ Rui might have seen two separate problems. Perhaps by the time we fixed the first problem (which Rui saw since -rt2) we introduced the other one via -rt11 - which then got fixed in -rt19. btw., we'd love to get more feedback regarding CFS. CFS is a completely new scheduler for Linux. It has a design centered around keeping application latencies down, so it is ultimately real-time friendly, and it should also make things work better for desktop-ish and audio-ish stuff as well. (even under SCHED_OTHER) So it would be nice if you could keep an extra eye on any scheduling artifacts or regressions, and make sure your favorite workload is still handled by the Linux scheduler in the utmost best way. I'd like to hear about any sort of "scheduling behavior / interactivity" regression you might see, relative to the vanilla kernel. Or if you can see no such problems then a line of "it works as well as the previous scheduler" is important info to us too. Thanks! Ingo ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: v2.6.21.5-rt19 2007-07-07 9:24 ` v2.6.21.5-rt19 Ingo Molnar @ 2007-07-08 22:36 ` Fernando Lopez-Lezcano 2007-07-08 22:50 ` v2.6.21.5-rt19 Fernando Lopez-Lezcano 2007-07-08 23:42 ` v2.6.21.5-rt19 Gabriel C 0 siblings, 2 replies; 19+ messages in thread From: Fernando Lopez-Lezcano @ 2007-07-08 22:36 UTC (permalink / raw) To: Ingo Molnar Cc: Rui Nuno Capela, Thomas Gleixner, LKML, RT-Users, Steven Rostedt, jcaceres@ccrma.Stanford.EDU, Carsten Emde, nando On Sat, 2007-07-07 at 11:24 +0200, Ingo Molnar wrote: > * Fernando Lopez-Lezcano <nando@ccrma.Stanford.EDU> wrote: > > > > Changes since 2.6.21.5-rt18: > > > > > > > > - Fixed a nasty and hard to track down slowness / boot problem on SMP > > > > machines with CONFIG_NOHZ enabled. The problem was caused by the timer > > > > wheel base lock held during the get_next_timer_interrupt() call in the > > > > idle path, which eventually led to a bogus PI boosting of the idle task > > > > and in consequence a stale wrong scheduler selection for the affected idle > > > > task. > > > > > > > > Kudos to Carsten Emde, who patiently and meticulously isolated the > > > > problem and provided the traces, which allowed to identify the root cause. > > > > > > > > Problem solution: Prevent idle task boosting > > > > Maybe someone remember me whining about troubles with 2.6.21-rt2..18 > > > on my Core2 T7200 laptop (fujitsu-siemens amilo i1520). > > > > > > Althought I'm still with my fingers crossed, I can tell the good > > > news are that 2.6.21.5-rt19 (and -rt20) does behave far better now > > > on the very same box. > > > > Yes, it works much better indeed... > > > > Ingo: is there a place where I can read about the changes in different > > rtxx releases? What is new/better/fixed in rt20? (I see scheduler > > stuff in a diff from rt19 to rt20 but I don't really know what it > > means). > > and rt18 was a -rt-only NOHZ fix, that bug got introduced in rt11 when > CFS was merged. > > i _think_ Rui might have seen two separate problems. Perhaps by the time > we fixed the first problem (which Rui saw since -rt2) we introduced the > other one via -rt11 - which then got fixed in -rt19. Ahh, CFS is now part of rt, I was obviously not paying attention... I'm really trying to provide a "stable" rt kernel for audio usage and including another subsystem into rt is - IMHO - not going to help. What's the chance of splitting things? > btw., we'd love to get more feedback regarding CFS. CFS is a completely > new scheduler for Linux. Then I'd rather have it separate from rt. > It has a design centered around keeping > application latencies down, so it is ultimately real-time friendly, and > it should also make things work better for desktop-ish and audio-ish > stuff as well. (even under SCHED_OTHER) Maybe this is CFS related? (tail of a thread in the Planet CCRMA mailing list): On Sun, 2007-07-08 at 15:26 -0400, Hector Centeno wrote: > Ok, so just to confirm, that 2.6.21-0182.rt19.1.fc7.ccrmart works fine > on my desktop but on my laptop it makes Firefox and Tomboy to crash. > On the same laptop using 2.6.21-0182.rt17.1.fc7.ccrmart there is no > problem. > > Cheers, > > Hector > > > On 7/7/07, Hector Centeno <hcengar@gmail.com> wrote: > Hi Fernando, > > I do have Flash installed but for me Firefox crashes when > trying to > access gmail (which AFAIK doesn't use Flash, does it?). Right > now > Firefox is frozen and I'm typing this email using Konkeror (in > Gnome). > This is ps' output: > > hector 3595 1.1 2.2 194352 46336 ? D 16:25 > 0:03 > /usr/lib/firefox-2.0.0.4/firefox-bin > > I think the problem is not present in my Desktop but I have to > double > check. In the same laptop using the stock fedora kernel both > Tomboy > and Firefox work fine. My laptop has a centrino duo processor, > 2 gigs > of ram and the Inte GMA950 graphics chip. > > Hector I managed to completely hang firefox (fc7) with flash 9 installed (unkillable even with -9). Does not seem to happen with flash 7. Have not tried yet with gmail and flash uninstalled. I'll try to strace it to see when/why it hangs. -- Fernando > So it would be nice if you could keep an extra eye on any scheduling > artifacts or regressions, and make sure your favorite workload is still > handled by the Linux scheduler in the utmost best way. I'd like to hear > about any sort of "scheduling behavior / interactivity" regression you > might see, relative to the vanilla kernel. Or if you can see no such > problems then a line of "it works as well as the previous scheduler" is > important info to us too. Thanks! ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: v2.6.21.5-rt19 2007-07-08 22:36 ` v2.6.21.5-rt19 Fernando Lopez-Lezcano @ 2007-07-08 22:50 ` Fernando Lopez-Lezcano 2007-07-08 23:42 ` v2.6.21.5-rt19 Gabriel C 1 sibling, 0 replies; 19+ messages in thread From: Fernando Lopez-Lezcano @ 2007-07-08 22:50 UTC (permalink / raw) To: Ingo Molnar Cc: Rui Nuno Capela, Thomas Gleixner, LKML, RT-Users, Steven Rostedt, jcaceres@ccrma.Stanford.EDU, Carsten Emde, nando On Sun, 2007-07-08 at 15:36 -0700, Fernando Lopez-Lezcano wrote: > On Sat, 2007-07-07 at 11:24 +0200, Ingo Molnar wrote: > > * Fernando Lopez-Lezcano <nando@ccrma.Stanford.EDU> wrote: > > > > > Changes since 2.6.21.5-rt18: > > > > > > > > > > - Fixed a nasty and hard to track down slowness / boot problem on SMP > > > > > machines with CONFIG_NOHZ enabled. The problem was caused by the timer > > > > > wheel base lock held during the get_next_timer_interrupt() call in the > > > > > idle path, which eventually led to a bogus PI boosting of the idle task > > > > > and in consequence a stale wrong scheduler selection for the affected idle > > > > > task. > > > > > > > > > > Kudos to Carsten Emde, who patiently and meticulously isolated the > > > > > problem and provided the traces, which allowed to identify the root cause. > > > > > > > > > > Problem solution: Prevent idle task boosting > > > > > > Maybe someone remember me whining about troubles with 2.6.21-rt2..18 > > > > on my Core2 T7200 laptop (fujitsu-siemens amilo i1520). > > > > > > > > Althought I'm still with my fingers crossed, I can tell the good > > > > news are that 2.6.21.5-rt19 (and -rt20) does behave far better now > > > > on the very same box. > > > > > > Yes, it works much better indeed... > > > > > > Ingo: is there a place where I can read about the changes in different > > > rtxx releases? What is new/better/fixed in rt20? (I see scheduler > > > stuff in a diff from rt19 to rt20 but I don't really know what it > > > means). > > > > and rt18 was a -rt-only NOHZ fix, that bug got introduced in rt11 when > > CFS was merged. > > > > i _think_ Rui might have seen two separate problems. Perhaps by the time > > we fixed the first problem (which Rui saw since -rt2) we introduced the > > other one via -rt11 - which then got fixed in -rt19. > > Ahh, CFS is now part of rt, I was obviously not paying attention... I'm > really trying to provide a "stable" rt kernel for audio usage and > including another subsystem into rt is - IMHO - not going to help. > What's the chance of splitting things? > > > btw., we'd love to get more feedback regarding CFS. CFS is a completely > > new scheduler for Linux. > > Then I'd rather have it separate from rt. Please? I would like to provide the least ammount of new functionality that is really necessary in my audio kernels. Audio related requirements include the rt patch but not a new scheduler. > > It has a design centered around keeping > > application latencies down, so it is ultimately real-time friendly, and > > it should also make things work better for desktop-ish and audio-ish > > stuff as well. (even under SCHED_OTHER) > > Maybe this is CFS related? (tail of a thread in the Planet CCRMA mailing > list): > > On Sun, 2007-07-08 at 15:26 -0400, Hector Centeno wrote: > > Ok, so just to confirm, that 2.6.21-0182.rt19.1.fc7.ccrmart works fine > > on my desktop but on my laptop it makes Firefox and Tomboy to crash. > > On the same laptop using 2.6.21-0182.rt17.1.fc7.ccrmart there is no > > problem. It looks to my untrained eye like it is CFS related, I'm attaching the last part of the strace of firefox while it tries to load a flash site. The firefox process is left in an unkillable (not even by -9) state. What else could I provide to debug the problem? (this is in a T61 laptop with the Intel 7700 processor). -- Fernando ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: v2.6.21.5-rt19 2007-07-08 22:36 ` v2.6.21.5-rt19 Fernando Lopez-Lezcano 2007-07-08 22:50 ` v2.6.21.5-rt19 Fernando Lopez-Lezcano @ 2007-07-08 23:42 ` Gabriel C 2007-07-09 3:53 ` v2.6.21.5-rt19 Fernando Pablo Lopez-Lezcano 1 sibling, 1 reply; 19+ messages in thread From: Gabriel C @ 2007-07-08 23:42 UTC (permalink / raw) To: Fernando Lopez-Lezcano Cc: Ingo Molnar, Rui Nuno Capela, Thomas Gleixner, LKML, RT-Users, Steven Rostedt, jcaceres@ccrma.Stanford.EDU, Carsten Emde Fernando Lopez-Lezcano wrote: > On Sat, 2007-07-07 at 11:24 +0200, Ingo Molnar wrote: > >> * Fernando Lopez-Lezcano <nando@ccrma.Stanford.EDU> wrote: >> >>>>> Changes since 2.6.21.5-rt18: >>>>> >>>>> - Fixed a nasty and hard to track down slowness / boot problem on SMP >>>>> machines with CONFIG_NOHZ enabled. The problem was caused by the timer >>>>> wheel base lock held during the get_next_timer_interrupt() call in the >>>>> idle path, which eventually led to a bogus PI boosting of the idle task >>>>> and in consequence a stale wrong scheduler selection for the affected idle >>>>> task. >>>>> >>>>> Kudos to Carsten Emde, who patiently and meticulously isolated the >>>>> problem and provided the traces, which allowed to identify the root cause. >>>>> >>>>> Problem solution: Prevent idle task boosting >>>>> >>>> Maybe someone remember me whining about troubles with 2.6.21-rt2..18 >>>> on my Core2 T7200 laptop (fujitsu-siemens amilo i1520). >>>> >>>> Althought I'm still with my fingers crossed, I can tell the good >>>> news are that 2.6.21.5-rt19 (and -rt20) does behave far better now >>>> on the very same box. >>>> >>> Yes, it works much better indeed... >>> >>> Ingo: is there a place where I can read about the changes in different >>> rtxx releases? What is new/better/fixed in rt20? (I see scheduler >>> stuff in a diff from rt19 to rt20 but I don't really know what it >>> means). >>> >> and rt18 was a -rt-only NOHZ fix, that bug got introduced in rt11 when >> CFS was merged. >> >> i _think_ Rui might have seen two separate problems. Perhaps by the time >> we fixed the first problem (which Rui saw since -rt2) we introduced the >> other one via -rt11 - which then got fixed in -rt19. >> > > Ahh, CFS is now part of rt, I was obviously not paying attention... I'm > really trying to provide a "stable" rt kernel for audio usage and > including another subsystem into rt is - IMHO - not going to help. > What's the chance of splitting things? > > >> btw., we'd love to get more feedback regarding CFS. CFS is a completely >> new scheduler for Linux. >> > > Then I'd rather have it separate from rt. > > >> It has a design centered around keeping >> application latencies down, so it is ultimately real-time friendly, and >> it should also make things work better for desktop-ish and audio-ish >> stuff as well. (even under SCHED_OTHER) >> > > Maybe this is CFS related? (tail of a thread in the Planet CCRMA mailing > list): > > On Sun, 2007-07-08 at 15:26 -0400, Hector Centeno wrote: > >> Ok, so just to confirm, that 2.6.21-0182.rt19.1.fc7.ccrmart works fine >> on my desktop but on my laptop it makes Firefox and Tomboy to crash. >> On the same laptop using 2.6.21-0182.rt17.1.fc7.ccrmart there is no >> problem. >> >> Cheers, >> >> Hector >> >> >> On 7/7/07, Hector Centeno <hcengar@gmail.com> wrote: >> Hi Fernando, >> >> I do have Flash installed but for me Firefox crashes when >> trying to >> access gmail (which AFAIK doesn't use Flash, does it?). Right >> now >> Firefox is frozen and I'm typing this email using Konkeror (in >> Gnome). >> This is ps' output: >> >> hector 3595 1.1 2.2 194352 46336 ? D 16:25 >> 0:03 >> /usr/lib/firefox-2.0.0.4/firefox-bin >> >> I think the problem is not present in my Desktop but I have to >> double >> check. In the same laptop using the stock fedora kernel both >> Tomboy >> and Firefox work fine. My laptop has a centrino duo processor, >> 2 gigs >> of ram and the Inte GMA950 graphics chip. >> >> Hector >> > > I managed to completely hang firefox (fc7) with flash 9 installed > (unkillable even with -9). Firefox with flash 9 does not work good , there are a lot bugs reported about ( just google ) and it hangs on vanilla or whatever other kernels as well. Not only Firefox but also Swiftfox, Opera, Epiphany etc. The most time Firefox dies when you use flash 9 and close a window or a tab. > Does not seem to happen with flash 7. Yes flash 7 is fine. > Have > not tried yet with gmail and flash uninstalled. I'll try to strace it to > see when/why it hangs. > > > -- Fernando > > > >> So it would be nice if you could keep an extra eye on any scheduling >> artifacts or regressions, and make sure your favorite workload is still >> handled by the Linux scheduler in the utmost best way. I'd like to hear >> about any sort of "scheduling behavior / interactivity" regression you >> might see, relative to the vanilla kernel. Or if you can see no such >> problems then a line of "it works as well as the previous scheduler" is >> important info to us too. Thanks! >> > > > Regards, Gabriel C ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: v2.6.21.5-rt19 2007-07-08 23:42 ` v2.6.21.5-rt19 Gabriel C @ 2007-07-09 3:53 ` Fernando Pablo Lopez-Lezcano 2007-07-09 5:08 ` v2.6.21.5-rt19 (sched_getaffinity?) Fernando Lopez-Lezcano 0 siblings, 1 reply; 19+ messages in thread From: Fernando Pablo Lopez-Lezcano @ 2007-07-09 3:53 UTC (permalink / raw) To: Gabriel C Cc: Ingo Molnar, Rui Nuno Capela, Thomas Gleixner, LKML, RT-Users, Steven Rostedt, jcaceres@ccrma.Stanford.EDU, Carsten Emde On Mon, 9 Jul 2007, Gabriel C wrote: > Fernando Lopez-Lezcano wrote: >> On Sat, 2007-07-07 at 11:24 +0200, Ingo Molnar wrote: >>> * Fernando Lopez-Lezcano <nando@ccrma.Stanford.EDU> wrote: >>>>>> Changes since 2.6.21.5-rt18: >>>>>> - Fixed a nasty and hard to track down slowness / boot problem on SMP >>>>>> machines with CONFIG_NOHZ enabled. The problem was caused by the timer >>>>>> wheel base lock held during the get_next_timer_interrupt() call in the >>>>>> idle path, which eventually led to a bogus PI boosting of the idle task >>>>>> and in consequence a stale wrong scheduler selection for the affected >>>>>> idle >>>>>> task. >>>>>> >>>>>> Kudos to Carsten Emde, who patiently and meticulously isolated the >>>>>> problem and provided the traces, which allowed to identify the root >>>>>> cause. >>>>>> >>>>>> Problem solution: Prevent idle task boosting >>>>>> >>>>> Maybe someone remember me whining about troubles with 2.6.21-rt2..18 on >>>>> my Core2 T7200 laptop (fujitsu-siemens amilo i1520). >>>>> >>>>> Althought I'm still with my fingers crossed, I can tell the good news >>>>> are that 2.6.21.5-rt19 (and -rt20) does behave far better now on the >>>>> very same box. >>>>> >>>> Yes, it works much better indeed... >>>> >>>> Ingo: is there a place where I can read about the changes in different >>>> rtxx releases? What is new/better/fixed in rt20? (I see scheduler stuff >>>> in a diff from rt19 to rt20 but I don't really know what it means). >>>> >>> and rt18 was a -rt-only NOHZ fix, that bug got introduced in rt11 when CFS >>> was merged. >>> >>> i _think_ Rui might have seen two separate problems. Perhaps by the time >>> we fixed the first problem (which Rui saw since -rt2) we introduced the >>> other one via -rt11 - which then got fixed in -rt19. >> >> Ahh, CFS is now part of rt, I was obviously not paying attention... I'm >> really trying to provide a "stable" rt kernel for audio usage and >> including another subsystem into rt is - IMHO - not going to help. >> What's the chance of splitting things? >> >>> btw., we'd love to get more feedback regarding CFS. CFS is a completely >>> new scheduler for Linux. >> >> Then I'd rather have it separate from rt. >> >>> It has a design centered around keeping application latencies down, so it >>> is ultimately real-time friendly, and it should also make things work >>> better for desktop-ish and audio-ish stuff as well. (even under >>> SCHED_OTHER) >>> >> >> Maybe this is CFS related? (tail of a thread in the Planet CCRMA mailing >> list): >> >> On Sun, 2007-07-08 at 15:26 -0400, Hector Centeno wrote: >> >>> Ok, so just to confirm, that 2.6.21-0182.rt19.1.fc7.ccrmart works fine >>> on my desktop but on my laptop it makes Firefox and Tomboy to crash. >>> On the same laptop using 2.6.21-0182.rt17.1.fc7.ccrmart there is no >>> problem. >>> >> I managed to completely hang firefox (fc7) with flash 9 installed >> (unkillable even with -9). > > Firefox with flash 9 does not work good , there are a lot bugs reported > about ( just google ) and it hangs on vanilla or whatever other kernels > as well. Not only Firefox but also Swiftfox, Opera, Epiphany etc. > > The most time Firefox dies when you use flash 9 and close a window or a > tab. More tests... The problem is the rt kernel AFAICT, this goes beyond Flash 9, way beyond: _OpenOffice_ hangs with 2.6.21.5-rt20, works fine with stock Fedora 7 kernel. Flash 9 hangs with 2.6.21.5-rt20, works fine with the stock Fedora 7 kernel. Same machine booting different kernels, I'd say it is the kernel. The only way out for a hung app is a reboot. Ingo: what would be a good way to trace this? It makes the rt kernels not very usable at least on this hardware (more tests tomorrow in the CCRMA machines). Same on 2.6.21.5-rt18 with CONFIG_NO_HZ not set. -- Fernando ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: v2.6.21.5-rt19 (sched_getaffinity?) 2007-07-09 3:53 ` v2.6.21.5-rt19 Fernando Pablo Lopez-Lezcano @ 2007-07-09 5:08 ` Fernando Lopez-Lezcano 2007-07-17 19:32 ` Ingo Molnar 0 siblings, 1 reply; 19+ messages in thread From: Fernando Lopez-Lezcano @ 2007-07-09 5:08 UTC (permalink / raw) To: Gabriel C Cc: nando, Carsten Emde, jcaceres@ccrma.Stanford.EDU, Steven Rostedt, RT-Users, LKML, Thomas Gleixner, Rui Nuno Capela, Ingo Molnar [-- Attachment #1: Type: text/plain, Size: 4632 bytes --] On Sun, 2007-07-08 at 20:53 -0700, Fernando Pablo Lopez-Lezcano wrote: > On Mon, 9 Jul 2007, Gabriel C wrote: > > Fernando Lopez-Lezcano wrote: > >> On Sat, 2007-07-07 at 11:24 +0200, Ingo Molnar wrote: > >>> * Fernando Lopez-Lezcano <nando@ccrma.Stanford.EDU> wrote: > >>>>>> Changes since 2.6.21.5-rt18: > >>>>>> - Fixed a nasty and hard to track down slowness / boot problem on SMP > >>>>>> machines with CONFIG_NOHZ enabled. The problem was caused by the timer > >>>>>> wheel base lock held during the get_next_timer_interrupt() call in the > >>>>>> idle path, which eventually led to a bogus PI boosting of the idle task > >>>>>> and in consequence a stale wrong scheduler selection for the affected > >>>>>> idle > >>>>>> task. > >>>>>> > >>>>>> Kudos to Carsten Emde, who patiently and meticulously isolated the > >>>>>> problem and provided the traces, which allowed to identify the root > >>>>>> cause. > >>>>>> > >>>>>> Problem solution: Prevent idle task boosting > >>>>>> > >>>>> Maybe someone remember me whining about troubles with 2.6.21-rt2..18 on > >>>>> my Core2 T7200 laptop (fujitsu-siemens amilo i1520). > >>>>> > >>>>> Althought I'm still with my fingers crossed, I can tell the good news > >>>>> are that 2.6.21.5-rt19 (and -rt20) does behave far better now on the > >>>>> very same box. > >>>>> > >>>> Yes, it works much better indeed... > >>>> > >>>> Ingo: is there a place where I can read about the changes in different > >>>> rtxx releases? What is new/better/fixed in rt20? (I see scheduler stuff > >>>> in a diff from rt19 to rt20 but I don't really know what it means). > >>>> > >>> and rt18 was a -rt-only NOHZ fix, that bug got introduced in rt11 when CFS > >>> was merged. > >>> > >>> i _think_ Rui might have seen two separate problems. Perhaps by the time > >>> we fixed the first problem (which Rui saw since -rt2) we introduced the > >>> other one via -rt11 - which then got fixed in -rt19. > >> > >> Ahh, CFS is now part of rt, I was obviously not paying attention... I'm > >> really trying to provide a "stable" rt kernel for audio usage and > >> including another subsystem into rt is - IMHO - not going to help. > >> What's the chance of splitting things? > >> > >>> btw., we'd love to get more feedback regarding CFS. CFS is a completely > >>> new scheduler for Linux. > >> > >> Then I'd rather have it separate from rt. > >> > >>> It has a design centered around keeping application latencies down, so it > >>> is ultimately real-time friendly, and it should also make things work > >>> better for desktop-ish and audio-ish stuff as well. (even under > >>> SCHED_OTHER) > >>> > >> > >> Maybe this is CFS related? (tail of a thread in the Planet CCRMA mailing > >> list): > >> > >> On Sun, 2007-07-08 at 15:26 -0400, Hector Centeno wrote: > >> > >>> Ok, so just to confirm, that 2.6.21-0182.rt19.1.fc7.ccrmart works fine > >>> on my desktop but on my laptop it makes Firefox and Tomboy to crash. > >>> On the same laptop using 2.6.21-0182.rt17.1.fc7.ccrmart there is no > >>> problem. > >>> > >> I managed to completely hang firefox (fc7) with flash 9 installed > >> (unkillable even with -9). > > > > Firefox with flash 9 does not work good , there are a lot bugs reported > > about ( just google ) and it hangs on vanilla or whatever other kernels > > as well. Not only Firefox but also Swiftfox, Opera, Epiphany etc. > > > > The most time Firefox dies when you use flash 9 and close a window or a > > tab. > > More tests... > > The problem is the rt kernel AFAICT, this goes beyond Flash 9, way > beyond: > > _OpenOffice_ hangs with 2.6.21.5-rt20, works fine with stock Fedora 7 > kernel. Flash 9 hangs with 2.6.21.5-rt20, works fine with the stock Fedora > 7 kernel. Same machine booting different kernels, I'd say it is the > kernel. > > The only way out for a hung app is a reboot. > > Ingo: what would be a good way to trace this? It makes the rt kernels not > very usable at least on this hardware (more tests tomorrow in the CCRMA > machines). > > Same on 2.6.21.5-rt18 with CONFIG_NO_HZ not set. I forgot to include the output of strace... and of course now I can't repeat the openoffice hang. I do get flash 9 (I know, not the best example) and tomboy to hang as reported by one of my Planet CCRMA users - flash 9 tested working on stock fedora 7 kernel - and both seem to hang in the same system call: sched_getaffinity(3528, 32, <unfinished ...> Full output of strace attached for both cases. Hopefully this will make the bug immediately obvious to someone :-) [running on a laptop with the 7700 Intel cpu] -- Fernando [-- Attachment #2: firefox.trace.gz --] [-- Type: application/x-gzip, Size: 155285 bytes --] [-- Attachment #3: tomboy.trace.gz --] [-- Type: application/x-gzip, Size: 2960 bytes --] ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: v2.6.21.5-rt19 (sched_getaffinity?) 2007-07-09 5:08 ` v2.6.21.5-rt19 (sched_getaffinity?) Fernando Lopez-Lezcano @ 2007-07-17 19:32 ` Ingo Molnar 2007-07-17 19:47 ` Fernando Lopez-Lezcano 2007-07-17 19:56 ` Fernando Lopez-Lezcano 0 siblings, 2 replies; 19+ messages in thread From: Ingo Molnar @ 2007-07-17 19:32 UTC (permalink / raw) To: Fernando Lopez-Lezcano Cc: Gabriel C, Carsten Emde, jcaceres@ccrma.Stanford.EDU, Steven Rostedt, RT-Users, LKML, Thomas Gleixner, Rui Nuno Capela * Fernando Lopez-Lezcano <nando@ccrma.Stanford.EDU> wrote: > I do get flash 9 (I know, not the best example) and tomboy to hang as > reported by one of my Planet CCRMA users - flash 9 tested working on > stock fedora 7 kernel - and both seem to hang in the same system call: > > sched_getaffinity(3528, 32, <unfinished ...> > > Full output of strace attached for both cases. hm, that's weird. Is it completely unkillable at that time? Could you do a few things: enable CONFIG_PROVE_LOCKING (lockdep), and also try to get a full task state dump via: echo t > /proc/sysrq-trigger thanks, Ingo ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: v2.6.21.5-rt19 (sched_getaffinity?) 2007-07-17 19:32 ` Ingo Molnar @ 2007-07-17 19:47 ` Fernando Lopez-Lezcano 2007-07-17 19:56 ` Fernando Lopez-Lezcano 1 sibling, 0 replies; 19+ messages in thread From: Fernando Lopez-Lezcano @ 2007-07-17 19:47 UTC (permalink / raw) To: Ingo Molnar Cc: nando, Gabriel C, Carsten Emde, jcaceres@ccrma.Stanford.EDU, Steven Rostedt, RT-Users, LKML, Thomas Gleixner, Rui Nuno Capela On Tue, 2007-07-17 at 21:32 +0200, Ingo Molnar wrote: > * Fernando Lopez-Lezcano <nando@ccrma.Stanford.EDU> wrote: > > > I do get flash 9 (I know, not the best example) and tomboy to hang as > > reported by one of my Planet CCRMA users - flash 9 tested working on > > stock fedora 7 kernel - and both seem to hang in the same system call: > > > > sched_getaffinity(3528, 32, <unfinished ...> > > > > Full output of strace attached for both cases. > > hm, that's weird. Is it completely unkillable at that time? Could you do > a few things: enable CONFIG_PROVE_LOCKING (lockdep), and also try to get > a full task state dump via: > > echo t > /proc/sysrq-trigger > > thanks, kill -9 does nothing. If there's another way to kill something let me know :-) I'll try to get the dump asap. Hope you had a good time over the long weekend, you certainly deserve some rest (and congrats over the scheduler inclusing in mainline!) -- Fernando ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: v2.6.21.5-rt19 (sched_getaffinity?) 2007-07-17 19:32 ` Ingo Molnar 2007-07-17 19:47 ` Fernando Lopez-Lezcano @ 2007-07-17 19:56 ` Fernando Lopez-Lezcano 2007-07-17 20:12 ` Ingo Molnar 1 sibling, 1 reply; 19+ messages in thread From: Fernando Lopez-Lezcano @ 2007-07-17 19:56 UTC (permalink / raw) To: Ingo Molnar Cc: nando, Gabriel C, Carsten Emde, jcaceres@ccrma.Stanford.EDU, Steven Rostedt, RT-Users, LKML, Thomas Gleixner, Rui Nuno Capela [-- Attachment #1: Type: text/plain, Size: 751 bytes --] On Tue, 2007-07-17 at 21:32 +0200, Ingo Molnar wrote: > * Fernando Lopez-Lezcano <nando@ccrma.Stanford.EDU> wrote: > > > I do get flash 9 (I know, not the best example) and tomboy to hang as > > reported by one of my Planet CCRMA users - flash 9 tested working on > > stock fedora 7 kernel - and both seem to hang in the same system call: > > > > sched_getaffinity(3528, 32, <unfinished ...> > > > > Full output of strace attached for both cases. > > hm, that's weird. Is it completely unkillable at that time? Could you do > a few things: enable CONFIG_PROVE_LOCKING (lockdep), and also try to get > a full task state dump via: > > echo t > /proc/sysrq-trigger Trace attached... the process stays in D state no matter what. -- Fernando [-- Attachment #2: trace1.txt.gz --] [-- Type: application/x-gzip, Size: 21335 bytes --] ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: v2.6.21.5-rt19 (sched_getaffinity?) 2007-07-17 19:56 ` Fernando Lopez-Lezcano @ 2007-07-17 20:12 ` Ingo Molnar 2007-07-17 21:41 ` Fernando Lopez-Lezcano 2007-07-17 23:51 ` Fernando Lopez-Lezcano 0 siblings, 2 replies; 19+ messages in thread From: Ingo Molnar @ 2007-07-17 20:12 UTC (permalink / raw) To: Fernando Lopez-Lezcano Cc: Gabriel C, Carsten Emde, jcaceres@ccrma.Stanford.EDU, Steven Rostedt, RT-Users, LKML, Thomas Gleixner, Rui Nuno Capela * Fernando Lopez-Lezcano <nando@ccrma.Stanford.EDU> wrote: > On Tue, 2007-07-17 at 21:32 +0200, Ingo Molnar wrote: > > * Fernando Lopez-Lezcano <nando@ccrma.Stanford.EDU> wrote: > > > > > I do get flash 9 (I know, not the best example) and tomboy to hang as > > > reported by one of my Planet CCRMA users - flash 9 tested working on > > > stock fedora 7 kernel - and both seem to hang in the same system call: > > > > > > sched_getaffinity(3528, 32, <unfinished ...> > > > > > > Full output of strace attached for both cases. > > > > hm, that's weird. Is it completely unkillable at that time? Could you do > > a few things: enable CONFIG_PROVE_LOCKING (lockdep), and also try to get > > a full task state dump via: > > > > echo t > /proc/sysrq-trigger > > Trace attached... the process stays in D state no matter what. hm, seems to be related to: Jul 17 12:51:18 localhost kernel: sched-powersa D [f0aaf930] 00000005 6584 3420 3407 which blocks the cpu-hotplug mutex: Jul 17 12:51:18 localhost kernel: Call Trace: Jul 17 12:51:18 localhost kernel: [<c0603f46>] schedule+0xe0/0xfa Jul 17 12:51:18 localhost kernel: [<c0604d0d>] rt_mutex_slowlock+0x164/0x20b Jul 17 12:51:18 localhost kernel: [<c0604a5c>] rt_mutex_lock+0x3c/0x3f Jul 17 12:51:18 localhost kernel: [<c0423bb4>] sched_getaffinity+0x14/0x94 Jul 17 12:51:18 localhost kernel: [<c045a647>] __synchronize_sched+0xd/0x5a Jul 17 12:51:18 localhost kernel: [<c0423732>] arch_reinit_sched_domains+0x18/0x33 Jul 17 12:51:18 localhost kernel: [<c0423789>] sched_power_savings_store+0x3c/0x49 Jul 17 12:51:18 localhost kernel: [<c0552cd4>] sysdev_class_store+0x1e/0x22 Jul 17 12:51:18 localhost kernel: [<c04b195b>] sysfs_write_file+0xa3/0xc6 Jul 17 12:51:18 localhost kernel: [<c047a64a>] vfs_write+0xa8/0x154 Jul 17 12:51:18 localhost kernel: [<c047ac65>] sys_write+0x41/0x67 Jul 17 12:51:18 localhost kernel: [<c0404f7c>] syscall_call+0x7/0xb and firefox blocks on the same mutex too: Jul 17 12:51:18 localhost kernel: firefox-bin D [efc44670] 00000012 6368 4388 1 Jul 17 12:51:18 localhost kernel: Call Trace: Jul 17 12:51:18 localhost kernel: [<c0603f46>] schedule+0xe0/0xfa Jul 17 12:51:18 localhost kernel: [<c0604d0d>] rt_mutex_slowlock+0x164/0x20b Jul 17 12:51:18 localhost kernel: [<c0604a5c>] rt_mutex_lock+0x3c/0x3f Jul 17 12:51:18 localhost kernel: [<c0423bb4>] sched_getaffinity+0x14/0x94 Jul 17 12:51:18 localhost kernel: [<c0423c53>] sys_sched_getaffinity+0x1f/0x41 Jul 17 12:51:18 localhost kernel: [<c0404f7c>] syscall_call+0x7/0xb Jul 17 12:51:18 localhost kernel: [<b7f0f410>] 0xb7f0f410 does lockdep pinpoint anything? Ingo ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: v2.6.21.5-rt19 (sched_getaffinity?) 2007-07-17 20:12 ` Ingo Molnar @ 2007-07-17 21:41 ` Fernando Lopez-Lezcano 2007-07-17 23:51 ` Fernando Lopez-Lezcano 1 sibling, 0 replies; 19+ messages in thread From: Fernando Lopez-Lezcano @ 2007-07-17 21:41 UTC (permalink / raw) To: Ingo Molnar Cc: Gabriel C, Carsten Emde, jcaceres@ccrma.Stanford.EDU, Steven Rostedt, RT-Users, LKML, Thomas Gleixner, Rui Nuno Capela, nando [-- Attachment #1: Type: text/plain, Size: 3044 bytes --] On Tue, 2007-07-17 at 22:12 +0200, Ingo Molnar wrote: > * Fernando Lopez-Lezcano <nando@ccrma.Stanford.EDU> wrote: > > > On Tue, 2007-07-17 at 21:32 +0200, Ingo Molnar wrote: > > > * Fernando Lopez-Lezcano <nando@ccrma.Stanford.EDU> wrote: > > > > > > > I do get flash 9 (I know, not the best example) and tomboy to hang as > > > > reported by one of my Planet CCRMA users - flash 9 tested working on > > > > stock fedora 7 kernel - and both seem to hang in the same system call: > > > > > > > > sched_getaffinity(3528, 32, <unfinished ...> > > > > > > > > Full output of strace attached for both cases. > > > > > > hm, that's weird. Is it completely unkillable at that time? Could you do > > > a few things: enable CONFIG_PROVE_LOCKING (lockdep), and also try to get > > > a full task state dump via: > > > > > > echo t > /proc/sysrq-trigger > > > > Trace attached... the process stays in D state no matter what. Just in case, it repeats under 2.6.22.1-rt4 (< rt4 did not boot into my t61 laptop, this one at least does that). I'm including the (probably redundant) dump. I have to build a new kernel with prove locking... -- Fernando > hm, seems to be related to: > > Jul 17 12:51:18 localhost kernel: sched-powersa D [f0aaf930] 00000005 6584 3420 3407 > > which blocks the cpu-hotplug mutex: > > Jul 17 12:51:18 localhost kernel: Call Trace: > Jul 17 12:51:18 localhost kernel: [<c0603f46>] schedule+0xe0/0xfa > Jul 17 12:51:18 localhost kernel: [<c0604d0d>] rt_mutex_slowlock+0x164/0x20b > Jul 17 12:51:18 localhost kernel: [<c0604a5c>] rt_mutex_lock+0x3c/0x3f > Jul 17 12:51:18 localhost kernel: [<c0423bb4>] sched_getaffinity+0x14/0x94 > Jul 17 12:51:18 localhost kernel: [<c045a647>] __synchronize_sched+0xd/0x5a > Jul 17 12:51:18 localhost kernel: [<c0423732>] arch_reinit_sched_domains+0x18/0x33 > Jul 17 12:51:18 localhost kernel: [<c0423789>] sched_power_savings_store+0x3c/0x49 > Jul 17 12:51:18 localhost kernel: [<c0552cd4>] sysdev_class_store+0x1e/0x22 > Jul 17 12:51:18 localhost kernel: [<c04b195b>] sysfs_write_file+0xa3/0xc6 > Jul 17 12:51:18 localhost kernel: [<c047a64a>] vfs_write+0xa8/0x154 > Jul 17 12:51:18 localhost kernel: [<c047ac65>] sys_write+0x41/0x67 > Jul 17 12:51:18 localhost kernel: [<c0404f7c>] syscall_call+0x7/0xb > > and firefox blocks on the same mutex too: > > Jul 17 12:51:18 localhost kernel: firefox-bin D [efc44670] 00000012 6368 4388 1 > Jul 17 12:51:18 localhost kernel: Call Trace: > Jul 17 12:51:18 localhost kernel: [<c0603f46>] schedule+0xe0/0xfa > Jul 17 12:51:18 localhost kernel: [<c0604d0d>] rt_mutex_slowlock+0x164/0x20b > Jul 17 12:51:18 localhost kernel: [<c0604a5c>] rt_mutex_lock+0x3c/0x3f > Jul 17 12:51:18 localhost kernel: [<c0423bb4>] sched_getaffinity+0x14/0x94 > Jul 17 12:51:18 localhost kernel: [<c0423c53>] sys_sched_getaffinity+0x1f/0x41 > Jul 17 12:51:18 localhost kernel: [<c0404f7c>] syscall_call+0x7/0xb > Jul 17 12:51:18 localhost kernel: [<b7f0f410>] 0xb7f0f410 > > does lockdep pinpoint anything? > > Ingo [-- Attachment #2: trace2.txt.gz --] [-- Type: application/x-gzip, Size: 21875 bytes --] ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: v2.6.21.5-rt19 (sched_getaffinity?) 2007-07-17 20:12 ` Ingo Molnar 2007-07-17 21:41 ` Fernando Lopez-Lezcano @ 2007-07-17 23:51 ` Fernando Lopez-Lezcano 2007-07-18 7:18 ` Ingo Molnar 1 sibling, 1 reply; 19+ messages in thread From: Fernando Lopez-Lezcano @ 2007-07-17 23:51 UTC (permalink / raw) To: Ingo Molnar Cc: nando, Gabriel C, Carsten Emde, jcaceres@ccrma.Stanford.EDU, Steven Rostedt, RT-Users, LKML, Thomas Gleixner, Rui Nuno Capela [-- Attachment #1: Type: text/plain, Size: 2997 bytes --] On Tue, 2007-07-17 at 22:12 +0200, Ingo Molnar wrote: > * Fernando Lopez-Lezcano <nando@ccrma.Stanford.EDU> wrote: > > > On Tue, 2007-07-17 at 21:32 +0200, Ingo Molnar wrote: > > > * Fernando Lopez-Lezcano <nando@ccrma.Stanford.EDU> wrote: > > > > > > > I do get flash 9 (I know, not the best example) and tomboy to hang as > > > > reported by one of my Planet CCRMA users - flash 9 tested working on > > > > stock fedora 7 kernel - and both seem to hang in the same system call: > > > > > > > > sched_getaffinity(3528, 32, <unfinished ...> > > > > > > > > Full output of strace attached for both cases. > > > > > > hm, that's weird. Is it completely unkillable at that time? Could you do > > > a few things: enable CONFIG_PROVE_LOCKING (lockdep), and also try to get > > > a full task state dump via: > > > > > > echo t > /proc/sysrq-trigger > > > > Trace attached... the process stays in D state no matter what. > > hm, seems to be related to: > > Jul 17 12:51:18 localhost kernel: sched-powersa D [f0aaf930] 00000005 6584 3420 3407 > > which blocks the cpu-hotplug mutex: > > Jul 17 12:51:18 localhost kernel: Call Trace: > Jul 17 12:51:18 localhost kernel: [<c0603f46>] schedule+0xe0/0xfa > Jul 17 12:51:18 localhost kernel: [<c0604d0d>] rt_mutex_slowlock+0x164/0x20b > Jul 17 12:51:18 localhost kernel: [<c0604a5c>] rt_mutex_lock+0x3c/0x3f > Jul 17 12:51:18 localhost kernel: [<c0423bb4>] sched_getaffinity+0x14/0x94 > Jul 17 12:51:18 localhost kernel: [<c045a647>] __synchronize_sched+0xd/0x5a > Jul 17 12:51:18 localhost kernel: [<c0423732>] arch_reinit_sched_domains+0x18/0x33 > Jul 17 12:51:18 localhost kernel: [<c0423789>] sched_power_savings_store+0x3c/0x49 > Jul 17 12:51:18 localhost kernel: [<c0552cd4>] sysdev_class_store+0x1e/0x22 > Jul 17 12:51:18 localhost kernel: [<c04b195b>] sysfs_write_file+0xa3/0xc6 > Jul 17 12:51:18 localhost kernel: [<c047a64a>] vfs_write+0xa8/0x154 > Jul 17 12:51:18 localhost kernel: [<c047ac65>] sys_write+0x41/0x67 > Jul 17 12:51:18 localhost kernel: [<c0404f7c>] syscall_call+0x7/0xb > > and firefox blocks on the same mutex too: > > Jul 17 12:51:18 localhost kernel: firefox-bin D [efc44670] 00000012 6368 4388 1 > Jul 17 12:51:18 localhost kernel: Call Trace: > Jul 17 12:51:18 localhost kernel: [<c0603f46>] schedule+0xe0/0xfa > Jul 17 12:51:18 localhost kernel: [<c0604d0d>] rt_mutex_slowlock+0x164/0x20b > Jul 17 12:51:18 localhost kernel: [<c0604a5c>] rt_mutex_lock+0x3c/0x3f > Jul 17 12:51:18 localhost kernel: [<c0423bb4>] sched_getaffinity+0x14/0x94 > Jul 17 12:51:18 localhost kernel: [<c0423c53>] sys_sched_getaffinity+0x1f/0x41 > Jul 17 12:51:18 localhost kernel: [<c0404f7c>] syscall_call+0x7/0xb > Jul 17 12:51:18 localhost kernel: [<b7f0f410>] 0xb7f0f410 > > does lockdep pinpoint anything? Lots of stuff, and at the end the lock report for the problem. Hopefully some of this will help... I have attached the whole bootup sequence as logged in /var/log/messages. -- Fernando [-- Attachment #2: trace3.txt.gz --] [-- Type: application/x-gzip, Size: 12462 bytes --] ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: v2.6.21.5-rt19 (sched_getaffinity?) 2007-07-17 23:51 ` Fernando Lopez-Lezcano @ 2007-07-18 7:18 ` Ingo Molnar 2007-07-18 13:21 ` Paul E. McKenney 2007-07-18 18:02 ` Fernando Lopez-Lezcano 0 siblings, 2 replies; 19+ messages in thread From: Ingo Molnar @ 2007-07-18 7:18 UTC (permalink / raw) To: Fernando Lopez-Lezcano Cc: Gabriel C, Carsten Emde, jcaceres@ccrma.Stanford.EDU, Steven Rostedt, RT-Users, LKML, Thomas Gleixner, Rui Nuno Capela, Paul E. McKenney * Fernando Lopez-Lezcano <nando@ccrma.Stanford.EDU> wrote: > > does lockdep pinpoint anything? > > Lots of stuff, and at the end the lock report for the problem. > Hopefully some of this will help... I have attached the whole bootup > sequence as logged in /var/log/messages. yeah, it pinpointed the bug. It seems to be an interaction between RCU-preempt (Paul Cc:-ed) and sched_mc_power_savings_store(): detach_destroy_domains() uses synchronize_sched() which uses getaffinity, which takes sched_hotcpu_mutex, and arch_reinit_sched_domains does it too - see the lockdep report below. I've added a quick workaround below as well, which should keep your box from hanging. Ingo ============================================= [ INFO: possible recursive locking detected ] [ 2.6.22-0182.rt4.3.fc7.ccrmart #1 --------------------------------------------- sched-powersave/3251 is trying to acquire lock: (sched_hotcpu_mutex){--..}, at: [<c0424a37>] sched_getaffinity+0x14/0x94 but task is already holding lock: (sched_hotcpu_mutex){--..}, at: [<c04245a5>] arch_reinit_sched_domains+0xe/0x33 other info that might help us debug this: 1 lock held by sched-powersave/3251: #0: (sched_hotcpu_mutex){--..}, at: [<c04245a5>] arch_reinit_sched_domains+0xe/0x33 stack backtrace: [<c040600c>] show_trace_log_lvl+0x1a/0x2f [<c0406ae8>] show_trace+0x12/0x14 [<c0406b50>] dump_stack+0x16/0x18 [<c0446f46>] __lock_acquire+0x172/0xb67 [<c0447d03>] lock_acquire+0x56/0x6f [<c061d414>] _mutex_lock+0x2b/0x38 [<c0424a37>] sched_getaffinity+0x14/0x94 [<c0460841>] __synchronize_sched+0x11/0x5f [<c0423fa8>] detach_destroy_domains+0x2c/0x30 [<c04245af>] arch_reinit_sched_domains+0x18/0x33 [<c0424606>] sched_power_savings_store+0x3c/0x49 [<c0424634>] sched_mc_power_savings_store+0xe/0x10 [<c0561f11>] sysdev_class_store+0x20/0x25 [<c04bbc6c>] sysfs_write_file+0xaf/0xd0 [<c048183c>] vfs_write+0xaf/0x163 [<c0481e8a>] sys_write+0x3d/0x61 [<c040501a>] syscall_call+0x7/0xb ======================= thinkpad_acpi: ThinkPad ACPI Extras v0.14 ---------------------> Index: linux-rt.q/kernel/sched.c =================================================================== --- linux-rt.q.orig/kernel/sched.c +++ linux-rt.q/kernel/sched.c @@ -6699,7 +6699,6 @@ static void detach_destroy_domains(const for_each_cpu_mask(i, *cpu_map) cpu_attach_domain(NULL, i); - synchronize_sched(); arch_destroy_sched_domains(cpu_map); } ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: v2.6.21.5-rt19 (sched_getaffinity?) 2007-07-18 7:18 ` Ingo Molnar @ 2007-07-18 13:21 ` Paul E. McKenney 2007-07-18 18:02 ` Fernando Lopez-Lezcano 1 sibling, 0 replies; 19+ messages in thread From: Paul E. McKenney @ 2007-07-18 13:21 UTC (permalink / raw) To: Ingo Molnar Cc: Fernando Lopez-Lezcano, Gabriel C, Carsten Emde, jcaceres@ccrma.Stanford.EDU, Steven Rostedt, RT-Users, LKML, Thomas Gleixner, Rui Nuno Capela On Wed, Jul 18, 2007 at 09:18:52AM +0200, Ingo Molnar wrote: > > * Fernando Lopez-Lezcano <nando@ccrma.Stanford.EDU> wrote: > > > > does lockdep pinpoint anything? > > > > Lots of stuff, and at the end the lock report for the problem. > > Hopefully some of this will help... I have attached the whole bootup > > sequence as logged in /var/log/messages. > > yeah, it pinpointed the bug. It seems to be an interaction between > RCU-preempt (Paul Cc:-ed) and sched_mc_power_savings_store(): > detach_destroy_domains() uses synchronize_sched() which uses > getaffinity, which takes sched_hotcpu_mutex, and > arch_reinit_sched_domains does it too - see the lockdep report below. > I've added a quick workaround below as well, which should keep your box > from hanging. Interesting. The "right" way to do this seems to be to put both "classic" and "realtime" RCU into the kernel. The "classic" RCU would be there to support synchronize_sched() without calling getaffinity(), while the "realtime" RCU would be there for the standard RCU API. I will work this into my -mm efforts. Thanx, Paul > ============================================= > [ INFO: possible recursive locking detected ] > [ 2.6.22-0182.rt4.3.fc7.ccrmart #1 > --------------------------------------------- > sched-powersave/3251 is trying to acquire lock: > (sched_hotcpu_mutex){--..}, at: [<c0424a37>] sched_getaffinity+0x14/0x94 > > but task is already holding lock: > (sched_hotcpu_mutex){--..}, at: [<c04245a5>] arch_reinit_sched_domains+0xe/0x33 > > other info that might help us debug this: > 1 lock held by sched-powersave/3251: > #0: (sched_hotcpu_mutex){--..}, at: [<c04245a5>] arch_reinit_sched_domains+0xe/0x33 > > stack backtrace: > [<c040600c>] show_trace_log_lvl+0x1a/0x2f > [<c0406ae8>] show_trace+0x12/0x14 > [<c0406b50>] dump_stack+0x16/0x18 > [<c0446f46>] __lock_acquire+0x172/0xb67 > [<c0447d03>] lock_acquire+0x56/0x6f > [<c061d414>] _mutex_lock+0x2b/0x38 > [<c0424a37>] sched_getaffinity+0x14/0x94 > [<c0460841>] __synchronize_sched+0x11/0x5f > [<c0423fa8>] detach_destroy_domains+0x2c/0x30 > [<c04245af>] arch_reinit_sched_domains+0x18/0x33 > [<c0424606>] sched_power_savings_store+0x3c/0x49 > [<c0424634>] sched_mc_power_savings_store+0xe/0x10 > [<c0561f11>] sysdev_class_store+0x20/0x25 > [<c04bbc6c>] sysfs_write_file+0xaf/0xd0 > [<c048183c>] vfs_write+0xaf/0x163 > [<c0481e8a>] sys_write+0x3d/0x61 > [<c040501a>] syscall_call+0x7/0xb > ======================= > thinkpad_acpi: ThinkPad ACPI Extras v0.14 > > ---------------------> > Index: linux-rt.q/kernel/sched.c > =================================================================== > --- linux-rt.q.orig/kernel/sched.c > +++ linux-rt.q/kernel/sched.c > @@ -6699,7 +6699,6 @@ static void detach_destroy_domains(const > > for_each_cpu_mask(i, *cpu_map) > cpu_attach_domain(NULL, i); > - synchronize_sched(); > arch_destroy_sched_domains(cpu_map); > } > > > ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: v2.6.21.5-rt19 (sched_getaffinity?) 2007-07-18 7:18 ` Ingo Molnar 2007-07-18 13:21 ` Paul E. McKenney @ 2007-07-18 18:02 ` Fernando Lopez-Lezcano 1 sibling, 0 replies; 19+ messages in thread From: Fernando Lopez-Lezcano @ 2007-07-18 18:02 UTC (permalink / raw) To: Ingo Molnar Cc: nando, Gabriel C, Carsten Emde, jcaceres@ccrma.Stanford.EDU, Steven Rostedt, RT-Users, LKML, Thomas Gleixner, Rui Nuno Capela, Paul E. McKenney On Wed, 2007-07-18 at 09:18 +0200, Ingo Molnar wrote: > * Fernando Lopez-Lezcano <nando@ccrma.Stanford.EDU> wrote: > > > > does lockdep pinpoint anything? > > > > Lots of stuff, and at the end the lock report for the problem. > > Hopefully some of this will help... I have attached the whole bootup > > sequence as logged in /var/log/messages. > > yeah, it pinpointed the bug. It seems to be an interaction between > RCU-preempt (Paul Cc:-ed) and sched_mc_power_savings_store(): > detach_destroy_domains() uses synchronize_sched() which uses > getaffinity, which takes sched_hotcpu_mutex, and > arch_reinit_sched_domains does it too - see the lockdep report below. > I've added a quick workaround below as well, which should keep your box > from hanging. I can confirm that flash9 does not hang with the patch. Thanks!!! I presume the same would apply to 2.6.21.x and, say, rt21. I'll test. But (of course, there's always a but somewhere) I just experienced a complete hang - 2.6.22.1-rt4 with the little patch. This time there was something in the logs, maybe it will help? This was when finishing the install of an additional kernel module rpm package (ipw3945 drivers). -- Fernando -------- Jul 18 10:48:15 localhost kernel: BUG: sleeping function called from invalid context modprobe(5001) at kernel/rtmutex.c:636 Jul 18 10:48:15 localhost kernel: in_atomic():1 [00000001], irqs_disabled():0 Jul 18 10:48:15 localhost kernel: [<c0405f34>] show_trace_log_lvl +0x1a/0x2f Jul 18 10:48:15 localhost kernel: [<c0406a09>] show_trace+0x12/0x14 Jul 18 10:48:15 localhost kernel: [<c0406a71>] dump_stack+0x16/0x18 Jul 18 10:48:15 localhost kernel: [<c0423bfc>] __might_sleep+0xeb/0xf2 Jul 18 10:48:15 localhost kernel: [<c0617242>] __rt_spin_lock+0x24/0x40 Jul 18 10:48:15 localhost kernel: [<c0617266>] rt_spin_lock+0x8/0xa Jul 18 10:48:15 localhost kernel: [<c04621c9>] get_zone_pcp+0x23/0x33 Jul 18 10:48:15 localhost kernel: [<c0462702>] free_hot_cold_page +0xcf/0x148 Jul 18 10:48:15 localhost kernel: [<c04627b2>] free_hot_page+0xa/0xc Jul 18 10:48:15 localhost kernel: [<c0462a82>] __free_pages+0x25/0x30 Jul 18 10:48:15 localhost kernel: [<c0462ab6>] free_pages+0x29/0x2b Jul 18 10:48:15 localhost kernel: [<c047abf3>] quicklist_trim+0xd0/0xf5 Jul 18 10:48:15 localhost kernel: [<c041f5d9>] check_pgt_cache +0x1e/0x20 Jul 18 10:48:15 localhost kernel: [<c046aedf>] free_pgtables+0x52/0x147 Jul 18 10:48:15 localhost kernel: [<c046cdf7>] unmap_region+0xe6/0x135 Jul 18 10:48:15 localhost kernel: [<c046d764>] do_munmap+0x153/0x1b4 Jul 18 10:48:15 localhost kernel: [<c046f3de>] do_mremap+0x413/0x4c3 Jul 18 10:48:15 localhost kernel: [<c046f4c4>] sys_mremap+0x36/0x56 Jul 18 10:48:15 localhost kernel: [<c0404fca>] syscall_call+0x7/0xb Jul 18 10:48:15 localhost kernel: ======================= Jul 18 10:48:16 localhost kernel: BUG: sleeping function called from invalid context head(5652) at kernel/rtmutex.c:636 Jul 18 10:48:16 localhost kernel: in_atomic():1 [00000001], irqs_disabled():0 Jul 18 10:48:16 localhost kernel: [<c0405f34>] show_trace_log_lvl +0x1a/0x2f Jul 18 10:48:16 localhost kernel: [<c0406a09>] show_trace+0x12/0x14 Jul 18 10:48:16 localhost kernel: [<c0406a71>] dump_stack+0x16/0x18 Jul 18 10:48:16 localhost kernel: [<c0423bfc>] __might_sleep+0xeb/0xf2 Jul 18 10:48:16 localhost kernel: [<c0617242>] __rt_spin_lock+0x24/0x40 Jul 18 10:48:16 localhost kernel: [<c0617266>] rt_spin_lock+0x8/0xa Jul 18 10:48:16 localhost kernel: [<c04621c9>] get_zone_pcp+0x23/0x33 Jul 18 10:48:16 localhost kernel: [<c0462702>] free_hot_cold_page +0xcf/0x148 Jul 18 10:48:16 localhost kernel: [<c04627b2>] free_hot_page+0xa/0xc Jul 18 10:48:16 localhost kernel: [<c0462a82>] __free_pages+0x25/0x30 Jul 18 10:48:16 localhost kernel: [<c0462ab6>] free_pages+0x29/0x2b Jul 18 10:48:16 localhost kernel: [<c047abf3>] quicklist_trim+0xd0/0xf5 Jul 18 10:48:16 localhost kernel: [<c041f5d9>] check_pgt_cache +0x1e/0x20 Jul 18 10:48:16 localhost kernel: [<c046aedf>] free_pgtables+0x52/0x147 Jul 18 10:48:16 localhost kernel: [<c046cdf7>] unmap_region+0xe6/0x135 Jul 18 10:48:16 localhost kernel: [<c046d764>] do_munmap+0x153/0x1b4 Jul 18 10:48:16 localhost kernel: [<c046d7f5>] sys_munmap+0x30/0x3f Jul 18 10:48:16 localhost kernel: [<c0404fca>] syscall_call+0x7/0xb Jul 18 10:48:16 localhost kernel: ======================= Jul 18 10:50:22 localhost syslogd 1.4.2: restart. ^ permalink raw reply [flat|nested] 19+ messages in thread
end of thread, other threads:[~2007-07-18 18:02 UTC | newest] Thread overview: 19+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2007-07-04 20:49 v2.6.21.5-rt19 Thomas Gleixner 2007-07-06 14:10 ` v2.6.21.5-rt19 Rui Nuno Capela 2007-07-06 21:49 ` v2.6.21.5-rt19 Fernando Lopez-Lezcano 2007-07-07 9:15 ` v2.6.21.5-rt19 Ingo Molnar 2007-07-07 9:24 ` v2.6.21.5-rt19 Ingo Molnar 2007-07-08 22:36 ` v2.6.21.5-rt19 Fernando Lopez-Lezcano 2007-07-08 22:50 ` v2.6.21.5-rt19 Fernando Lopez-Lezcano 2007-07-08 23:42 ` v2.6.21.5-rt19 Gabriel C 2007-07-09 3:53 ` v2.6.21.5-rt19 Fernando Pablo Lopez-Lezcano 2007-07-09 5:08 ` v2.6.21.5-rt19 (sched_getaffinity?) Fernando Lopez-Lezcano 2007-07-17 19:32 ` Ingo Molnar 2007-07-17 19:47 ` Fernando Lopez-Lezcano 2007-07-17 19:56 ` Fernando Lopez-Lezcano 2007-07-17 20:12 ` Ingo Molnar 2007-07-17 21:41 ` Fernando Lopez-Lezcano 2007-07-17 23:51 ` Fernando Lopez-Lezcano 2007-07-18 7:18 ` Ingo Molnar 2007-07-18 13:21 ` Paul E. McKenney 2007-07-18 18:02 ` Fernando Lopez-Lezcano
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox