* v2.6.21.5-rt19
@ 2007-07-04 20:49 Thomas Gleixner
2007-07-06 14:10 ` v2.6.21.5-rt19 Rui Nuno Capela
0 siblings, 1 reply; 19+ messages in thread
From: Thomas Gleixner @ 2007-07-04 20:49 UTC (permalink / raw)
To: LKML
Cc: RT-Users, Ingo Molnar, Steven Rostedt, Fernando Lopez-Lezcano,
jcaceres@ccrma.Stanford.EDU, Carsten Emde
I'm pleased to announce the v2.6.21.5-rt19 kernel on behalf of Ingo.
It can be downloaded from the usual place:
http://people.redhat.com/mingo/realtime-preempt/
More info about the -rt patch set can be found in the RT wiki:
http://rt.wiki.kernel.org
Changes since 2.6.21.5-rt18:
- Fixed a nasty and hard to track down slowness / boot problem on SMP
machines with CONFIG_NOHZ enabled. The problem was caused by the timer
wheel base lock held during the get_next_timer_interrupt() call in the
idle path, which eventually led to a bogus PI boosting of the idle task
and in consequence a stale wrong scheduler selection for the affected
idle task.
Kudos to Carsten Emde, who patiently and meticulously isolated the
problem and provided the traces, which allowed to identify the root
cause.
Problem solution: Prevent idle task boosting
- back port of the ntp / clock_was_set fix
- integration of the processor_idle fix from Venki Pallipadi, which
resolves boot issues on some platforms
- ep93xx clock events fix from Manfred Gruber
To build a 2.6.21.5-rt19 tree, the following patches should be applied:
http://kernel.org/pub/linux/kernel/v2.6/linux-2.6.21.5.tar.bz2
http://people.redhat.com/mingo/realtime-preempt/patch-2.6.21.5-rt19
Thanks,
tglx
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: v2.6.21.5-rt19
2007-07-04 20:49 v2.6.21.5-rt19 Thomas Gleixner
@ 2007-07-06 14:10 ` Rui Nuno Capela
2007-07-06 21:49 ` v2.6.21.5-rt19 Fernando Lopez-Lezcano
0 siblings, 1 reply; 19+ messages in thread
From: Rui Nuno Capela @ 2007-07-06 14:10 UTC (permalink / raw)
To: Thomas Gleixner
Cc: LKML, RT-Users, Ingo Molnar, Steven Rostedt,
Fernando Lopez-Lezcano, jcaceres@ccrma.Stanford.EDU, Carsten Emde
Hi,
On Wed, July 4, 2007 21:49, Thomas Gleixner wrote:
> I'm pleased to announce the v2.6.21.5-rt19 kernel on behalf of Ingo.
>
>
> It can be downloaded from the usual place:
>
> http://people.redhat.com/mingo/realtime-preempt/
>
> More info about the -rt patch set can be found in the RT wiki:
>
> http://rt.wiki.kernel.org
>
> Changes since 2.6.21.5-rt18:
>
> - Fixed a nasty and hard to track down slowness / boot problem on SMP
> machines with CONFIG_NOHZ enabled. The problem was caused by the timer
> wheel base lock held during the get_next_timer_interrupt() call in the
> idle path, which eventually led to a bogus PI boosting of the idle task
> and in consequence a stale wrong scheduler selection for the affected idle
> task.
>
> Kudos to Carsten Emde, who patiently and meticulously isolated the
> problem and provided the traces, which allowed to identify the root cause.
>
> Problem solution: Prevent idle task boosting
>
> - back port of the ntp / clock_was_set fix
>
> - integration of the processor_idle fix from Venki Pallipadi, which
> resolves boot issues on some platforms
>
> - ep93xx clock events fix from Manfred Gruber
>
Maybe someone remember me whining about troubles with 2.6.21-rt2..18 on my
Core2 T7200 laptop (fujitsu-siemens amilo i1520).
Althought I'm still with my fingers crossed, I can tell the good news are
that 2.6.21.5-rt19 (and -rt20) does behave far better now on the very same
box.
I've more than 8 hours up and running now, without a single glimpse of the
bad symptoms, which used to show in a matter of minutes if not earlier
during init time.
Congratulations. I'm not sure whether this problem can be closed for good,
though, just because you mention the slowness fix applies to CONFIG_NOHZ=Y
and I'm quite sure my badness surged either way.
But at least -rt is usable again here and that just makes me happier :)
Cheers.
--
rncbc aka Rui Nuno Capela
rncbc@rncbc.org
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: v2.6.21.5-rt19
2007-07-06 14:10 ` v2.6.21.5-rt19 Rui Nuno Capela
@ 2007-07-06 21:49 ` Fernando Lopez-Lezcano
2007-07-07 9:15 ` v2.6.21.5-rt19 Ingo Molnar
2007-07-07 9:24 ` v2.6.21.5-rt19 Ingo Molnar
0 siblings, 2 replies; 19+ messages in thread
From: Fernando Lopez-Lezcano @ 2007-07-06 21:49 UTC (permalink / raw)
To: Rui Nuno Capela
Cc: Thomas Gleixner, LKML, RT-Users, Ingo Molnar, Steven Rostedt,
jcaceres@ccrma.Stanford.EDU, Carsten Emde
On Fri, 2007-07-06 at 15:10 +0100, Rui Nuno Capela wrote:
> Hi,
>
> On Wed, July 4, 2007 21:49, Thomas Gleixner wrote:
> > I'm pleased to announce the v2.6.21.5-rt19 kernel on behalf of Ingo.
> >
> >
> > It can be downloaded from the usual place:
> >
> > http://people.redhat.com/mingo/realtime-preempt/
> >
> > More info about the -rt patch set can be found in the RT wiki:
> >
> > http://rt.wiki.kernel.org
> >
> > Changes since 2.6.21.5-rt18:
> >
> > - Fixed a nasty and hard to track down slowness / boot problem on SMP
> > machines with CONFIG_NOHZ enabled. The problem was caused by the timer
> > wheel base lock held during the get_next_timer_interrupt() call in the
> > idle path, which eventually led to a bogus PI boosting of the idle task
> > and in consequence a stale wrong scheduler selection for the affected idle
> > task.
> >
> > Kudos to Carsten Emde, who patiently and meticulously isolated the
> > problem and provided the traces, which allowed to identify the root cause.
> >
> > Problem solution: Prevent idle task boosting
> >
> > - back port of the ntp / clock_was_set fix
> >
> > - integration of the processor_idle fix from Venki Pallipadi, which
> > resolves boot issues on some platforms
> >
> > - ep93xx clock events fix from Manfred Gruber
> >
>
> Maybe someone remember me whining about troubles with 2.6.21-rt2..18 on my
> Core2 T7200 laptop (fujitsu-siemens amilo i1520).
>
> Althought I'm still with my fingers crossed, I can tell the good news are
> that 2.6.21.5-rt19 (and -rt20) does behave far better now on the very same
> box.
Yes, it works much better indeed...
Ingo: is there a place where I can read about the changes in different
rtxx releases? What is new/better/fixed in rt20? (I see scheduler stuff
in a diff from rt19 to rt20 but I don't really know what it means).
-- Fernando
> I've more than 8 hours up and running now, without a single glimpse of the
> bad symptoms, which used to show in a matter of minutes if not earlier
> during init time.
>
> Congratulations. I'm not sure whether this problem can be closed for good,
> though, just because you mention the slowness fix applies to CONFIG_NOHZ=Y
> and I'm quite sure my badness surged either way.
>
> But at least -rt is usable again here and that just makes me happier :)
>
> Cheers.
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: v2.6.21.5-rt19
2007-07-06 21:49 ` v2.6.21.5-rt19 Fernando Lopez-Lezcano
@ 2007-07-07 9:15 ` Ingo Molnar
2007-07-07 9:24 ` v2.6.21.5-rt19 Ingo Molnar
1 sibling, 0 replies; 19+ messages in thread
From: Ingo Molnar @ 2007-07-07 9:15 UTC (permalink / raw)
To: Fernando Lopez-Lezcano
Cc: Rui Nuno Capela, Thomas Gleixner, LKML, RT-Users, Steven Rostedt,
jcaceres@ccrma.Stanford.EDU, Carsten Emde
* Fernando Lopez-Lezcano <nando@ccrma.Stanford.EDU> wrote:
> > Althought I'm still with my fingers crossed, I can tell the good
> > news are that 2.6.21.5-rt19 (and -rt20) does behave far better now
> > on the very same box.
>
> Yes, it works much better indeed...
>
> Ingo: is there a place where I can read about the changes in different
> rtxx releases? What is new/better/fixed in rt20? (I see scheduler
> stuff in a diff from rt19 to rt20 but I don't really know what it
> means).
rt19 -> rt20 was a pure CFS update - from v18 to v19-almost-final.
Ingo
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: v2.6.21.5-rt19
2007-07-06 21:49 ` v2.6.21.5-rt19 Fernando Lopez-Lezcano
2007-07-07 9:15 ` v2.6.21.5-rt19 Ingo Molnar
@ 2007-07-07 9:24 ` Ingo Molnar
2007-07-08 22:36 ` v2.6.21.5-rt19 Fernando Lopez-Lezcano
1 sibling, 1 reply; 19+ messages in thread
From: Ingo Molnar @ 2007-07-07 9:24 UTC (permalink / raw)
To: Fernando Lopez-Lezcano
Cc: Rui Nuno Capela, Thomas Gleixner, LKML, RT-Users, Steven Rostedt,
jcaceres@ccrma.Stanford.EDU, Carsten Emde
* Fernando Lopez-Lezcano <nando@ccrma.Stanford.EDU> wrote:
> > > Changes since 2.6.21.5-rt18:
> > >
> > > - Fixed a nasty and hard to track down slowness / boot problem on SMP
> > > machines with CONFIG_NOHZ enabled. The problem was caused by the timer
> > > wheel base lock held during the get_next_timer_interrupt() call in the
> > > idle path, which eventually led to a bogus PI boosting of the idle task
> > > and in consequence a stale wrong scheduler selection for the affected idle
> > > task.
> > >
> > > Kudos to Carsten Emde, who patiently and meticulously isolated the
> > > problem and provided the traces, which allowed to identify the root cause.
> > >
> > > Problem solution: Prevent idle task boosting
> > Maybe someone remember me whining about troubles with 2.6.21-rt2..18
> > on my Core2 T7200 laptop (fujitsu-siemens amilo i1520).
> >
> > Althought I'm still with my fingers crossed, I can tell the good
> > news are that 2.6.21.5-rt19 (and -rt20) does behave far better now
> > on the very same box.
>
> Yes, it works much better indeed...
>
> Ingo: is there a place where I can read about the changes in different
> rtxx releases? What is new/better/fixed in rt20? (I see scheduler
> stuff in a diff from rt19 to rt20 but I don't really know what it
> means).
and rt18 was a -rt-only NOHZ fix, that bug got introduced in rt11 when
CFS was merged.
i _think_ Rui might have seen two separate problems. Perhaps by the time
we fixed the first problem (which Rui saw since -rt2) we introduced the
other one via -rt11 - which then got fixed in -rt19.
btw., we'd love to get more feedback regarding CFS. CFS is a completely
new scheduler for Linux. It has a design centered around keeping
application latencies down, so it is ultimately real-time friendly, and
it should also make things work better for desktop-ish and audio-ish
stuff as well. (even under SCHED_OTHER)
So it would be nice if you could keep an extra eye on any scheduling
artifacts or regressions, and make sure your favorite workload is still
handled by the Linux scheduler in the utmost best way. I'd like to hear
about any sort of "scheduling behavior / interactivity" regression you
might see, relative to the vanilla kernel. Or if you can see no such
problems then a line of "it works as well as the previous scheduler" is
important info to us too. Thanks!
Ingo
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: v2.6.21.5-rt19
2007-07-07 9:24 ` v2.6.21.5-rt19 Ingo Molnar
@ 2007-07-08 22:36 ` Fernando Lopez-Lezcano
2007-07-08 22:50 ` v2.6.21.5-rt19 Fernando Lopez-Lezcano
2007-07-08 23:42 ` v2.6.21.5-rt19 Gabriel C
0 siblings, 2 replies; 19+ messages in thread
From: Fernando Lopez-Lezcano @ 2007-07-08 22:36 UTC (permalink / raw)
To: Ingo Molnar
Cc: Rui Nuno Capela, Thomas Gleixner, LKML, RT-Users, Steven Rostedt,
jcaceres@ccrma.Stanford.EDU, Carsten Emde, nando
On Sat, 2007-07-07 at 11:24 +0200, Ingo Molnar wrote:
> * Fernando Lopez-Lezcano <nando@ccrma.Stanford.EDU> wrote:
> > > > Changes since 2.6.21.5-rt18:
> > > >
> > > > - Fixed a nasty and hard to track down slowness / boot problem on SMP
> > > > machines with CONFIG_NOHZ enabled. The problem was caused by the timer
> > > > wheel base lock held during the get_next_timer_interrupt() call in the
> > > > idle path, which eventually led to a bogus PI boosting of the idle task
> > > > and in consequence a stale wrong scheduler selection for the affected idle
> > > > task.
> > > >
> > > > Kudos to Carsten Emde, who patiently and meticulously isolated the
> > > > problem and provided the traces, which allowed to identify the root cause.
> > > >
> > > > Problem solution: Prevent idle task boosting
>
> > > Maybe someone remember me whining about troubles with 2.6.21-rt2..18
> > > on my Core2 T7200 laptop (fujitsu-siemens amilo i1520).
> > >
> > > Althought I'm still with my fingers crossed, I can tell the good
> > > news are that 2.6.21.5-rt19 (and -rt20) does behave far better now
> > > on the very same box.
> >
> > Yes, it works much better indeed...
> >
> > Ingo: is there a place where I can read about the changes in different
> > rtxx releases? What is new/better/fixed in rt20? (I see scheduler
> > stuff in a diff from rt19 to rt20 but I don't really know what it
> > means).
>
> and rt18 was a -rt-only NOHZ fix, that bug got introduced in rt11 when
> CFS was merged.
>
> i _think_ Rui might have seen two separate problems. Perhaps by the time
> we fixed the first problem (which Rui saw since -rt2) we introduced the
> other one via -rt11 - which then got fixed in -rt19.
Ahh, CFS is now part of rt, I was obviously not paying attention... I'm
really trying to provide a "stable" rt kernel for audio usage and
including another subsystem into rt is - IMHO - not going to help.
What's the chance of splitting things?
> btw., we'd love to get more feedback regarding CFS. CFS is a completely
> new scheduler for Linux.
Then I'd rather have it separate from rt.
> It has a design centered around keeping
> application latencies down, so it is ultimately real-time friendly, and
> it should also make things work better for desktop-ish and audio-ish
> stuff as well. (even under SCHED_OTHER)
Maybe this is CFS related? (tail of a thread in the Planet CCRMA mailing
list):
On Sun, 2007-07-08 at 15:26 -0400, Hector Centeno wrote:
> Ok, so just to confirm, that 2.6.21-0182.rt19.1.fc7.ccrmart works fine
> on my desktop but on my laptop it makes Firefox and Tomboy to crash.
> On the same laptop using 2.6.21-0182.rt17.1.fc7.ccrmart there is no
> problem.
>
> Cheers,
>
> Hector
>
>
> On 7/7/07, Hector Centeno <hcengar@gmail.com> wrote:
> Hi Fernando,
>
> I do have Flash installed but for me Firefox crashes when
> trying to
> access gmail (which AFAIK doesn't use Flash, does it?). Right
> now
> Firefox is frozen and I'm typing this email using Konkeror (in
> Gnome).
> This is ps' output:
>
> hector 3595 1.1 2.2 194352 46336 ? D 16:25
> 0:03
> /usr/lib/firefox-2.0.0.4/firefox-bin
>
> I think the problem is not present in my Desktop but I have to
> double
> check. In the same laptop using the stock fedora kernel both
> Tomboy
> and Firefox work fine. My laptop has a centrino duo processor,
> 2 gigs
> of ram and the Inte GMA950 graphics chip.
>
> Hector
I managed to completely hang firefox (fc7) with flash 9 installed
(unkillable even with -9). Does not seem to happen with flash 7. Have
not tried yet with gmail and flash uninstalled. I'll try to strace it to
see when/why it hangs.
-- Fernando
> So it would be nice if you could keep an extra eye on any scheduling
> artifacts or regressions, and make sure your favorite workload is still
> handled by the Linux scheduler in the utmost best way. I'd like to hear
> about any sort of "scheduling behavior / interactivity" regression you
> might see, relative to the vanilla kernel. Or if you can see no such
> problems then a line of "it works as well as the previous scheduler" is
> important info to us too. Thanks!
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: v2.6.21.5-rt19
2007-07-08 22:36 ` v2.6.21.5-rt19 Fernando Lopez-Lezcano
@ 2007-07-08 22:50 ` Fernando Lopez-Lezcano
2007-07-08 23:42 ` v2.6.21.5-rt19 Gabriel C
1 sibling, 0 replies; 19+ messages in thread
From: Fernando Lopez-Lezcano @ 2007-07-08 22:50 UTC (permalink / raw)
To: Ingo Molnar
Cc: Rui Nuno Capela, Thomas Gleixner, LKML, RT-Users, Steven Rostedt,
jcaceres@ccrma.Stanford.EDU, Carsten Emde, nando
On Sun, 2007-07-08 at 15:36 -0700, Fernando Lopez-Lezcano wrote:
> On Sat, 2007-07-07 at 11:24 +0200, Ingo Molnar wrote:
> > * Fernando Lopez-Lezcano <nando@ccrma.Stanford.EDU> wrote:
> > > > > Changes since 2.6.21.5-rt18:
> > > > >
> > > > > - Fixed a nasty and hard to track down slowness / boot problem on SMP
> > > > > machines with CONFIG_NOHZ enabled. The problem was caused by the timer
> > > > > wheel base lock held during the get_next_timer_interrupt() call in the
> > > > > idle path, which eventually led to a bogus PI boosting of the idle task
> > > > > and in consequence a stale wrong scheduler selection for the affected idle
> > > > > task.
> > > > >
> > > > > Kudos to Carsten Emde, who patiently and meticulously isolated the
> > > > > problem and provided the traces, which allowed to identify the root cause.
> > > > >
> > > > > Problem solution: Prevent idle task boosting
> >
> > > > Maybe someone remember me whining about troubles with 2.6.21-rt2..18
> > > > on my Core2 T7200 laptop (fujitsu-siemens amilo i1520).
> > > >
> > > > Althought I'm still with my fingers crossed, I can tell the good
> > > > news are that 2.6.21.5-rt19 (and -rt20) does behave far better now
> > > > on the very same box.
> > >
> > > Yes, it works much better indeed...
> > >
> > > Ingo: is there a place where I can read about the changes in different
> > > rtxx releases? What is new/better/fixed in rt20? (I see scheduler
> > > stuff in a diff from rt19 to rt20 but I don't really know what it
> > > means).
> >
> > and rt18 was a -rt-only NOHZ fix, that bug got introduced in rt11 when
> > CFS was merged.
> >
> > i _think_ Rui might have seen two separate problems. Perhaps by the time
> > we fixed the first problem (which Rui saw since -rt2) we introduced the
> > other one via -rt11 - which then got fixed in -rt19.
>
> Ahh, CFS is now part of rt, I was obviously not paying attention... I'm
> really trying to provide a "stable" rt kernel for audio usage and
> including another subsystem into rt is - IMHO - not going to help.
> What's the chance of splitting things?
>
> > btw., we'd love to get more feedback regarding CFS. CFS is a completely
> > new scheduler for Linux.
>
> Then I'd rather have it separate from rt.
Please?
I would like to provide the least ammount of new functionality that is
really necessary in my audio kernels. Audio related requirements include
the rt patch but not a new scheduler.
> > It has a design centered around keeping
> > application latencies down, so it is ultimately real-time friendly, and
> > it should also make things work better for desktop-ish and audio-ish
> > stuff as well. (even under SCHED_OTHER)
>
> Maybe this is CFS related? (tail of a thread in the Planet CCRMA mailing
> list):
>
> On Sun, 2007-07-08 at 15:26 -0400, Hector Centeno wrote:
> > Ok, so just to confirm, that 2.6.21-0182.rt19.1.fc7.ccrmart works fine
> > on my desktop but on my laptop it makes Firefox and Tomboy to crash.
> > On the same laptop using 2.6.21-0182.rt17.1.fc7.ccrmart there is no
> > problem.
It looks to my untrained eye like it is CFS related, I'm attaching the
last part of the strace of firefox while it tries to load a flash site.
The firefox process is left in an unkillable (not even by -9) state.
What else could I provide to debug the problem? (this is in a T61 laptop
with the Intel 7700 processor).
-- Fernando
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: v2.6.21.5-rt19
2007-07-08 22:36 ` v2.6.21.5-rt19 Fernando Lopez-Lezcano
2007-07-08 22:50 ` v2.6.21.5-rt19 Fernando Lopez-Lezcano
@ 2007-07-08 23:42 ` Gabriel C
2007-07-09 3:53 ` v2.6.21.5-rt19 Fernando Pablo Lopez-Lezcano
1 sibling, 1 reply; 19+ messages in thread
From: Gabriel C @ 2007-07-08 23:42 UTC (permalink / raw)
To: Fernando Lopez-Lezcano
Cc: Ingo Molnar, Rui Nuno Capela, Thomas Gleixner, LKML, RT-Users,
Steven Rostedt, jcaceres@ccrma.Stanford.EDU, Carsten Emde
Fernando Lopez-Lezcano wrote:
> On Sat, 2007-07-07 at 11:24 +0200, Ingo Molnar wrote:
>
>> * Fernando Lopez-Lezcano <nando@ccrma.Stanford.EDU> wrote:
>>
>>>>> Changes since 2.6.21.5-rt18:
>>>>>
>>>>> - Fixed a nasty and hard to track down slowness / boot problem on SMP
>>>>> machines with CONFIG_NOHZ enabled. The problem was caused by the timer
>>>>> wheel base lock held during the get_next_timer_interrupt() call in the
>>>>> idle path, which eventually led to a bogus PI boosting of the idle task
>>>>> and in consequence a stale wrong scheduler selection for the affected idle
>>>>> task.
>>>>>
>>>>> Kudos to Carsten Emde, who patiently and meticulously isolated the
>>>>> problem and provided the traces, which allowed to identify the root cause.
>>>>>
>>>>> Problem solution: Prevent idle task boosting
>>>>>
>>>> Maybe someone remember me whining about troubles with 2.6.21-rt2..18
>>>> on my Core2 T7200 laptop (fujitsu-siemens amilo i1520).
>>>>
>>>> Althought I'm still with my fingers crossed, I can tell the good
>>>> news are that 2.6.21.5-rt19 (and -rt20) does behave far better now
>>>> on the very same box.
>>>>
>>> Yes, it works much better indeed...
>>>
>>> Ingo: is there a place where I can read about the changes in different
>>> rtxx releases? What is new/better/fixed in rt20? (I see scheduler
>>> stuff in a diff from rt19 to rt20 but I don't really know what it
>>> means).
>>>
>> and rt18 was a -rt-only NOHZ fix, that bug got introduced in rt11 when
>> CFS was merged.
>>
>> i _think_ Rui might have seen two separate problems. Perhaps by the time
>> we fixed the first problem (which Rui saw since -rt2) we introduced the
>> other one via -rt11 - which then got fixed in -rt19.
>>
>
> Ahh, CFS is now part of rt, I was obviously not paying attention... I'm
> really trying to provide a "stable" rt kernel for audio usage and
> including another subsystem into rt is - IMHO - not going to help.
> What's the chance of splitting things?
>
>
>> btw., we'd love to get more feedback regarding CFS. CFS is a completely
>> new scheduler for Linux.
>>
>
> Then I'd rather have it separate from rt.
>
>
>> It has a design centered around keeping
>> application latencies down, so it is ultimately real-time friendly, and
>> it should also make things work better for desktop-ish and audio-ish
>> stuff as well. (even under SCHED_OTHER)
>>
>
> Maybe this is CFS related? (tail of a thread in the Planet CCRMA mailing
> list):
>
> On Sun, 2007-07-08 at 15:26 -0400, Hector Centeno wrote:
>
>> Ok, so just to confirm, that 2.6.21-0182.rt19.1.fc7.ccrmart works fine
>> on my desktop but on my laptop it makes Firefox and Tomboy to crash.
>> On the same laptop using 2.6.21-0182.rt17.1.fc7.ccrmart there is no
>> problem.
>>
>> Cheers,
>>
>> Hector
>>
>>
>> On 7/7/07, Hector Centeno <hcengar@gmail.com> wrote:
>> Hi Fernando,
>>
>> I do have Flash installed but for me Firefox crashes when
>> trying to
>> access gmail (which AFAIK doesn't use Flash, does it?). Right
>> now
>> Firefox is frozen and I'm typing this email using Konkeror (in
>> Gnome).
>> This is ps' output:
>>
>> hector 3595 1.1 2.2 194352 46336 ? D 16:25
>> 0:03
>> /usr/lib/firefox-2.0.0.4/firefox-bin
>>
>> I think the problem is not present in my Desktop but I have to
>> double
>> check. In the same laptop using the stock fedora kernel both
>> Tomboy
>> and Firefox work fine. My laptop has a centrino duo processor,
>> 2 gigs
>> of ram and the Inte GMA950 graphics chip.
>>
>> Hector
>>
>
> I managed to completely hang firefox (fc7) with flash 9 installed
> (unkillable even with -9).
Firefox with flash 9 does not work good , there are a lot bugs reported
about ( just google ) and it hangs on vanilla or
whatever other kernels as well. Not only Firefox but also Swiftfox,
Opera, Epiphany etc.
The most time Firefox dies when you use flash 9 and close a window or a tab.
> Does not seem to happen with flash 7.
Yes flash 7 is fine.
> Have
> not tried yet with gmail and flash uninstalled. I'll try to strace it to
> see when/why it hangs.
>
>
> -- Fernando
>
>
>
>> So it would be nice if you could keep an extra eye on any scheduling
>> artifacts or regressions, and make sure your favorite workload is still
>> handled by the Linux scheduler in the utmost best way. I'd like to hear
>> about any sort of "scheduling behavior / interactivity" regression you
>> might see, relative to the vanilla kernel. Or if you can see no such
>> problems then a line of "it works as well as the previous scheduler" is
>> important info to us too. Thanks!
>>
>
>
>
Regards,
Gabriel C
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: v2.6.21.5-rt19
2007-07-08 23:42 ` v2.6.21.5-rt19 Gabriel C
@ 2007-07-09 3:53 ` Fernando Pablo Lopez-Lezcano
2007-07-09 5:08 ` v2.6.21.5-rt19 (sched_getaffinity?) Fernando Lopez-Lezcano
0 siblings, 1 reply; 19+ messages in thread
From: Fernando Pablo Lopez-Lezcano @ 2007-07-09 3:53 UTC (permalink / raw)
To: Gabriel C
Cc: Ingo Molnar, Rui Nuno Capela, Thomas Gleixner, LKML, RT-Users,
Steven Rostedt, jcaceres@ccrma.Stanford.EDU, Carsten Emde
On Mon, 9 Jul 2007, Gabriel C wrote:
> Fernando Lopez-Lezcano wrote:
>> On Sat, 2007-07-07 at 11:24 +0200, Ingo Molnar wrote:
>>> * Fernando Lopez-Lezcano <nando@ccrma.Stanford.EDU> wrote:
>>>>>> Changes since 2.6.21.5-rt18:
>>>>>> - Fixed a nasty and hard to track down slowness / boot problem on SMP
>>>>>> machines with CONFIG_NOHZ enabled. The problem was caused by the timer
>>>>>> wheel base lock held during the get_next_timer_interrupt() call in the
>>>>>> idle path, which eventually led to a bogus PI boosting of the idle task
>>>>>> and in consequence a stale wrong scheduler selection for the affected
>>>>>> idle
>>>>>> task.
>>>>>>
>>>>>> Kudos to Carsten Emde, who patiently and meticulously isolated the
>>>>>> problem and provided the traces, which allowed to identify the root
>>>>>> cause.
>>>>>>
>>>>>> Problem solution: Prevent idle task boosting
>>>>>>
>>>>> Maybe someone remember me whining about troubles with 2.6.21-rt2..18 on
>>>>> my Core2 T7200 laptop (fujitsu-siemens amilo i1520).
>>>>>
>>>>> Althought I'm still with my fingers crossed, I can tell the good news
>>>>> are that 2.6.21.5-rt19 (and -rt20) does behave far better now on the
>>>>> very same box.
>>>>>
>>>> Yes, it works much better indeed...
>>>>
>>>> Ingo: is there a place where I can read about the changes in different
>>>> rtxx releases? What is new/better/fixed in rt20? (I see scheduler stuff
>>>> in a diff from rt19 to rt20 but I don't really know what it means).
>>>>
>>> and rt18 was a -rt-only NOHZ fix, that bug got introduced in rt11 when CFS
>>> was merged.
>>>
>>> i _think_ Rui might have seen two separate problems. Perhaps by the time
>>> we fixed the first problem (which Rui saw since -rt2) we introduced the
>>> other one via -rt11 - which then got fixed in -rt19.
>>
>> Ahh, CFS is now part of rt, I was obviously not paying attention... I'm
>> really trying to provide a "stable" rt kernel for audio usage and
>> including another subsystem into rt is - IMHO - not going to help.
>> What's the chance of splitting things?
>>
>>> btw., we'd love to get more feedback regarding CFS. CFS is a completely
>>> new scheduler for Linux.
>>
>> Then I'd rather have it separate from rt.
>>
>>> It has a design centered around keeping application latencies down, so it
>>> is ultimately real-time friendly, and it should also make things work
>>> better for desktop-ish and audio-ish stuff as well. (even under
>>> SCHED_OTHER)
>>>
>>
>> Maybe this is CFS related? (tail of a thread in the Planet CCRMA mailing
>> list):
>>
>> On Sun, 2007-07-08 at 15:26 -0400, Hector Centeno wrote:
>>
>>> Ok, so just to confirm, that 2.6.21-0182.rt19.1.fc7.ccrmart works fine
>>> on my desktop but on my laptop it makes Firefox and Tomboy to crash.
>>> On the same laptop using 2.6.21-0182.rt17.1.fc7.ccrmart there is no
>>> problem.
>>>
>> I managed to completely hang firefox (fc7) with flash 9 installed
>> (unkillable even with -9).
>
> Firefox with flash 9 does not work good , there are a lot bugs reported
> about ( just google ) and it hangs on vanilla or whatever other kernels
> as well. Not only Firefox but also Swiftfox, Opera, Epiphany etc.
>
> The most time Firefox dies when you use flash 9 and close a window or a
> tab.
More tests...
The problem is the rt kernel AFAICT, this goes beyond Flash 9, way
beyond:
_OpenOffice_ hangs with 2.6.21.5-rt20, works fine with stock Fedora 7
kernel. Flash 9 hangs with 2.6.21.5-rt20, works fine with the stock Fedora
7 kernel. Same machine booting different kernels, I'd say it is the
kernel.
The only way out for a hung app is a reboot.
Ingo: what would be a good way to trace this? It makes the rt kernels not
very usable at least on this hardware (more tests tomorrow in the CCRMA
machines).
Same on 2.6.21.5-rt18 with CONFIG_NO_HZ not set.
-- Fernando
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: v2.6.21.5-rt19 (sched_getaffinity?)
2007-07-09 3:53 ` v2.6.21.5-rt19 Fernando Pablo Lopez-Lezcano
@ 2007-07-09 5:08 ` Fernando Lopez-Lezcano
2007-07-17 19:32 ` Ingo Molnar
0 siblings, 1 reply; 19+ messages in thread
From: Fernando Lopez-Lezcano @ 2007-07-09 5:08 UTC (permalink / raw)
To: Gabriel C
Cc: nando, Carsten Emde, jcaceres@ccrma.Stanford.EDU, Steven Rostedt,
RT-Users, LKML, Thomas Gleixner, Rui Nuno Capela, Ingo Molnar
[-- Attachment #1: Type: text/plain, Size: 4632 bytes --]
On Sun, 2007-07-08 at 20:53 -0700, Fernando Pablo Lopez-Lezcano wrote:
> On Mon, 9 Jul 2007, Gabriel C wrote:
> > Fernando Lopez-Lezcano wrote:
> >> On Sat, 2007-07-07 at 11:24 +0200, Ingo Molnar wrote:
> >>> * Fernando Lopez-Lezcano <nando@ccrma.Stanford.EDU> wrote:
> >>>>>> Changes since 2.6.21.5-rt18:
> >>>>>> - Fixed a nasty and hard to track down slowness / boot problem on SMP
> >>>>>> machines with CONFIG_NOHZ enabled. The problem was caused by the timer
> >>>>>> wheel base lock held during the get_next_timer_interrupt() call in the
> >>>>>> idle path, which eventually led to a bogus PI boosting of the idle task
> >>>>>> and in consequence a stale wrong scheduler selection for the affected
> >>>>>> idle
> >>>>>> task.
> >>>>>>
> >>>>>> Kudos to Carsten Emde, who patiently and meticulously isolated the
> >>>>>> problem and provided the traces, which allowed to identify the root
> >>>>>> cause.
> >>>>>>
> >>>>>> Problem solution: Prevent idle task boosting
> >>>>>>
> >>>>> Maybe someone remember me whining about troubles with 2.6.21-rt2..18 on
> >>>>> my Core2 T7200 laptop (fujitsu-siemens amilo i1520).
> >>>>>
> >>>>> Althought I'm still with my fingers crossed, I can tell the good news
> >>>>> are that 2.6.21.5-rt19 (and -rt20) does behave far better now on the
> >>>>> very same box.
> >>>>>
> >>>> Yes, it works much better indeed...
> >>>>
> >>>> Ingo: is there a place where I can read about the changes in different
> >>>> rtxx releases? What is new/better/fixed in rt20? (I see scheduler stuff
> >>>> in a diff from rt19 to rt20 but I don't really know what it means).
> >>>>
> >>> and rt18 was a -rt-only NOHZ fix, that bug got introduced in rt11 when CFS
> >>> was merged.
> >>>
> >>> i _think_ Rui might have seen two separate problems. Perhaps by the time
> >>> we fixed the first problem (which Rui saw since -rt2) we introduced the
> >>> other one via -rt11 - which then got fixed in -rt19.
> >>
> >> Ahh, CFS is now part of rt, I was obviously not paying attention... I'm
> >> really trying to provide a "stable" rt kernel for audio usage and
> >> including another subsystem into rt is - IMHO - not going to help.
> >> What's the chance of splitting things?
> >>
> >>> btw., we'd love to get more feedback regarding CFS. CFS is a completely
> >>> new scheduler for Linux.
> >>
> >> Then I'd rather have it separate from rt.
> >>
> >>> It has a design centered around keeping application latencies down, so it
> >>> is ultimately real-time friendly, and it should also make things work
> >>> better for desktop-ish and audio-ish stuff as well. (even under
> >>> SCHED_OTHER)
> >>>
> >>
> >> Maybe this is CFS related? (tail of a thread in the Planet CCRMA mailing
> >> list):
> >>
> >> On Sun, 2007-07-08 at 15:26 -0400, Hector Centeno wrote:
> >>
> >>> Ok, so just to confirm, that 2.6.21-0182.rt19.1.fc7.ccrmart works fine
> >>> on my desktop but on my laptop it makes Firefox and Tomboy to crash.
> >>> On the same laptop using 2.6.21-0182.rt17.1.fc7.ccrmart there is no
> >>> problem.
> >>>
> >> I managed to completely hang firefox (fc7) with flash 9 installed
> >> (unkillable even with -9).
> >
> > Firefox with flash 9 does not work good , there are a lot bugs reported
> > about ( just google ) and it hangs on vanilla or whatever other kernels
> > as well. Not only Firefox but also Swiftfox, Opera, Epiphany etc.
> >
> > The most time Firefox dies when you use flash 9 and close a window or a
> > tab.
>
> More tests...
>
> The problem is the rt kernel AFAICT, this goes beyond Flash 9, way
> beyond:
>
> _OpenOffice_ hangs with 2.6.21.5-rt20, works fine with stock Fedora 7
> kernel. Flash 9 hangs with 2.6.21.5-rt20, works fine with the stock Fedora
> 7 kernel. Same machine booting different kernels, I'd say it is the
> kernel.
>
> The only way out for a hung app is a reboot.
>
> Ingo: what would be a good way to trace this? It makes the rt kernels not
> very usable at least on this hardware (more tests tomorrow in the CCRMA
> machines).
>
> Same on 2.6.21.5-rt18 with CONFIG_NO_HZ not set.
I forgot to include the output of strace... and of course now I can't
repeat the openoffice hang.
I do get flash 9 (I know, not the best example) and tomboy to hang as
reported by one of my Planet CCRMA users - flash 9 tested working on
stock fedora 7 kernel - and both seem to hang in the same system call:
sched_getaffinity(3528, 32, <unfinished ...>
Full output of strace attached for both cases.
Hopefully this will make the bug immediately obvious to someone :-)
[running on a laptop with the 7700 Intel cpu]
-- Fernando
[-- Attachment #2: firefox.trace.gz --]
[-- Type: application/x-gzip, Size: 155285 bytes --]
[-- Attachment #3: tomboy.trace.gz --]
[-- Type: application/x-gzip, Size: 2960 bytes --]
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: v2.6.21.5-rt19 (sched_getaffinity?)
2007-07-09 5:08 ` v2.6.21.5-rt19 (sched_getaffinity?) Fernando Lopez-Lezcano
@ 2007-07-17 19:32 ` Ingo Molnar
2007-07-17 19:47 ` Fernando Lopez-Lezcano
2007-07-17 19:56 ` Fernando Lopez-Lezcano
0 siblings, 2 replies; 19+ messages in thread
From: Ingo Molnar @ 2007-07-17 19:32 UTC (permalink / raw)
To: Fernando Lopez-Lezcano
Cc: Gabriel C, Carsten Emde, jcaceres@ccrma.Stanford.EDU,
Steven Rostedt, RT-Users, LKML, Thomas Gleixner, Rui Nuno Capela
* Fernando Lopez-Lezcano <nando@ccrma.Stanford.EDU> wrote:
> I do get flash 9 (I know, not the best example) and tomboy to hang as
> reported by one of my Planet CCRMA users - flash 9 tested working on
> stock fedora 7 kernel - and both seem to hang in the same system call:
>
> sched_getaffinity(3528, 32, <unfinished ...>
>
> Full output of strace attached for both cases.
hm, that's weird. Is it completely unkillable at that time? Could you do
a few things: enable CONFIG_PROVE_LOCKING (lockdep), and also try to get
a full task state dump via:
echo t > /proc/sysrq-trigger
thanks,
Ingo
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: v2.6.21.5-rt19 (sched_getaffinity?)
2007-07-17 19:32 ` Ingo Molnar
@ 2007-07-17 19:47 ` Fernando Lopez-Lezcano
2007-07-17 19:56 ` Fernando Lopez-Lezcano
1 sibling, 0 replies; 19+ messages in thread
From: Fernando Lopez-Lezcano @ 2007-07-17 19:47 UTC (permalink / raw)
To: Ingo Molnar
Cc: nando, Gabriel C, Carsten Emde, jcaceres@ccrma.Stanford.EDU,
Steven Rostedt, RT-Users, LKML, Thomas Gleixner, Rui Nuno Capela
On Tue, 2007-07-17 at 21:32 +0200, Ingo Molnar wrote:
> * Fernando Lopez-Lezcano <nando@ccrma.Stanford.EDU> wrote:
>
> > I do get flash 9 (I know, not the best example) and tomboy to hang as
> > reported by one of my Planet CCRMA users - flash 9 tested working on
> > stock fedora 7 kernel - and both seem to hang in the same system call:
> >
> > sched_getaffinity(3528, 32, <unfinished ...>
> >
> > Full output of strace attached for both cases.
>
> hm, that's weird. Is it completely unkillable at that time? Could you do
> a few things: enable CONFIG_PROVE_LOCKING (lockdep), and also try to get
> a full task state dump via:
>
> echo t > /proc/sysrq-trigger
>
> thanks,
kill -9 does nothing. If there's another way to kill something let me
know :-) I'll try to get the dump asap.
Hope you had a good time over the long weekend, you certainly deserve
some rest (and congrats over the scheduler inclusing in mainline!)
-- Fernando
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: v2.6.21.5-rt19 (sched_getaffinity?)
2007-07-17 19:32 ` Ingo Molnar
2007-07-17 19:47 ` Fernando Lopez-Lezcano
@ 2007-07-17 19:56 ` Fernando Lopez-Lezcano
2007-07-17 20:12 ` Ingo Molnar
1 sibling, 1 reply; 19+ messages in thread
From: Fernando Lopez-Lezcano @ 2007-07-17 19:56 UTC (permalink / raw)
To: Ingo Molnar
Cc: nando, Gabriel C, Carsten Emde, jcaceres@ccrma.Stanford.EDU,
Steven Rostedt, RT-Users, LKML, Thomas Gleixner, Rui Nuno Capela
[-- Attachment #1: Type: text/plain, Size: 751 bytes --]
On Tue, 2007-07-17 at 21:32 +0200, Ingo Molnar wrote:
> * Fernando Lopez-Lezcano <nando@ccrma.Stanford.EDU> wrote:
>
> > I do get flash 9 (I know, not the best example) and tomboy to hang as
> > reported by one of my Planet CCRMA users - flash 9 tested working on
> > stock fedora 7 kernel - and both seem to hang in the same system call:
> >
> > sched_getaffinity(3528, 32, <unfinished ...>
> >
> > Full output of strace attached for both cases.
>
> hm, that's weird. Is it completely unkillable at that time? Could you do
> a few things: enable CONFIG_PROVE_LOCKING (lockdep), and also try to get
> a full task state dump via:
>
> echo t > /proc/sysrq-trigger
Trace attached... the process stays in D state no matter what.
-- Fernando
[-- Attachment #2: trace1.txt.gz --]
[-- Type: application/x-gzip, Size: 21335 bytes --]
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: v2.6.21.5-rt19 (sched_getaffinity?)
2007-07-17 19:56 ` Fernando Lopez-Lezcano
@ 2007-07-17 20:12 ` Ingo Molnar
2007-07-17 21:41 ` Fernando Lopez-Lezcano
2007-07-17 23:51 ` Fernando Lopez-Lezcano
0 siblings, 2 replies; 19+ messages in thread
From: Ingo Molnar @ 2007-07-17 20:12 UTC (permalink / raw)
To: Fernando Lopez-Lezcano
Cc: Gabriel C, Carsten Emde, jcaceres@ccrma.Stanford.EDU,
Steven Rostedt, RT-Users, LKML, Thomas Gleixner, Rui Nuno Capela
* Fernando Lopez-Lezcano <nando@ccrma.Stanford.EDU> wrote:
> On Tue, 2007-07-17 at 21:32 +0200, Ingo Molnar wrote:
> > * Fernando Lopez-Lezcano <nando@ccrma.Stanford.EDU> wrote:
> >
> > > I do get flash 9 (I know, not the best example) and tomboy to hang as
> > > reported by one of my Planet CCRMA users - flash 9 tested working on
> > > stock fedora 7 kernel - and both seem to hang in the same system call:
> > >
> > > sched_getaffinity(3528, 32, <unfinished ...>
> > >
> > > Full output of strace attached for both cases.
> >
> > hm, that's weird. Is it completely unkillable at that time? Could you do
> > a few things: enable CONFIG_PROVE_LOCKING (lockdep), and also try to get
> > a full task state dump via:
> >
> > echo t > /proc/sysrq-trigger
>
> Trace attached... the process stays in D state no matter what.
hm, seems to be related to:
Jul 17 12:51:18 localhost kernel: sched-powersa D [f0aaf930] 00000005 6584 3420 3407
which blocks the cpu-hotplug mutex:
Jul 17 12:51:18 localhost kernel: Call Trace:
Jul 17 12:51:18 localhost kernel: [<c0603f46>] schedule+0xe0/0xfa
Jul 17 12:51:18 localhost kernel: [<c0604d0d>] rt_mutex_slowlock+0x164/0x20b
Jul 17 12:51:18 localhost kernel: [<c0604a5c>] rt_mutex_lock+0x3c/0x3f
Jul 17 12:51:18 localhost kernel: [<c0423bb4>] sched_getaffinity+0x14/0x94
Jul 17 12:51:18 localhost kernel: [<c045a647>] __synchronize_sched+0xd/0x5a
Jul 17 12:51:18 localhost kernel: [<c0423732>] arch_reinit_sched_domains+0x18/0x33
Jul 17 12:51:18 localhost kernel: [<c0423789>] sched_power_savings_store+0x3c/0x49
Jul 17 12:51:18 localhost kernel: [<c0552cd4>] sysdev_class_store+0x1e/0x22
Jul 17 12:51:18 localhost kernel: [<c04b195b>] sysfs_write_file+0xa3/0xc6
Jul 17 12:51:18 localhost kernel: [<c047a64a>] vfs_write+0xa8/0x154
Jul 17 12:51:18 localhost kernel: [<c047ac65>] sys_write+0x41/0x67
Jul 17 12:51:18 localhost kernel: [<c0404f7c>] syscall_call+0x7/0xb
and firefox blocks on the same mutex too:
Jul 17 12:51:18 localhost kernel: firefox-bin D [efc44670] 00000012 6368 4388 1
Jul 17 12:51:18 localhost kernel: Call Trace:
Jul 17 12:51:18 localhost kernel: [<c0603f46>] schedule+0xe0/0xfa
Jul 17 12:51:18 localhost kernel: [<c0604d0d>] rt_mutex_slowlock+0x164/0x20b
Jul 17 12:51:18 localhost kernel: [<c0604a5c>] rt_mutex_lock+0x3c/0x3f
Jul 17 12:51:18 localhost kernel: [<c0423bb4>] sched_getaffinity+0x14/0x94
Jul 17 12:51:18 localhost kernel: [<c0423c53>] sys_sched_getaffinity+0x1f/0x41
Jul 17 12:51:18 localhost kernel: [<c0404f7c>] syscall_call+0x7/0xb
Jul 17 12:51:18 localhost kernel: [<b7f0f410>] 0xb7f0f410
does lockdep pinpoint anything?
Ingo
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: v2.6.21.5-rt19 (sched_getaffinity?)
2007-07-17 20:12 ` Ingo Molnar
@ 2007-07-17 21:41 ` Fernando Lopez-Lezcano
2007-07-17 23:51 ` Fernando Lopez-Lezcano
1 sibling, 0 replies; 19+ messages in thread
From: Fernando Lopez-Lezcano @ 2007-07-17 21:41 UTC (permalink / raw)
To: Ingo Molnar
Cc: Gabriel C, Carsten Emde, jcaceres@ccrma.Stanford.EDU,
Steven Rostedt, RT-Users, LKML, Thomas Gleixner, Rui Nuno Capela,
nando
[-- Attachment #1: Type: text/plain, Size: 3044 bytes --]
On Tue, 2007-07-17 at 22:12 +0200, Ingo Molnar wrote:
> * Fernando Lopez-Lezcano <nando@ccrma.Stanford.EDU> wrote:
>
> > On Tue, 2007-07-17 at 21:32 +0200, Ingo Molnar wrote:
> > > * Fernando Lopez-Lezcano <nando@ccrma.Stanford.EDU> wrote:
> > >
> > > > I do get flash 9 (I know, not the best example) and tomboy to hang as
> > > > reported by one of my Planet CCRMA users - flash 9 tested working on
> > > > stock fedora 7 kernel - and both seem to hang in the same system call:
> > > >
> > > > sched_getaffinity(3528, 32, <unfinished ...>
> > > >
> > > > Full output of strace attached for both cases.
> > >
> > > hm, that's weird. Is it completely unkillable at that time? Could you do
> > > a few things: enable CONFIG_PROVE_LOCKING (lockdep), and also try to get
> > > a full task state dump via:
> > >
> > > echo t > /proc/sysrq-trigger
> >
> > Trace attached... the process stays in D state no matter what.
Just in case, it repeats under 2.6.22.1-rt4 (< rt4 did not boot into my
t61 laptop, this one at least does that). I'm including the (probably
redundant) dump.
I have to build a new kernel with prove locking...
-- Fernando
> hm, seems to be related to:
>
> Jul 17 12:51:18 localhost kernel: sched-powersa D [f0aaf930] 00000005 6584 3420 3407
>
> which blocks the cpu-hotplug mutex:
>
> Jul 17 12:51:18 localhost kernel: Call Trace:
> Jul 17 12:51:18 localhost kernel: [<c0603f46>] schedule+0xe0/0xfa
> Jul 17 12:51:18 localhost kernel: [<c0604d0d>] rt_mutex_slowlock+0x164/0x20b
> Jul 17 12:51:18 localhost kernel: [<c0604a5c>] rt_mutex_lock+0x3c/0x3f
> Jul 17 12:51:18 localhost kernel: [<c0423bb4>] sched_getaffinity+0x14/0x94
> Jul 17 12:51:18 localhost kernel: [<c045a647>] __synchronize_sched+0xd/0x5a
> Jul 17 12:51:18 localhost kernel: [<c0423732>] arch_reinit_sched_domains+0x18/0x33
> Jul 17 12:51:18 localhost kernel: [<c0423789>] sched_power_savings_store+0x3c/0x49
> Jul 17 12:51:18 localhost kernel: [<c0552cd4>] sysdev_class_store+0x1e/0x22
> Jul 17 12:51:18 localhost kernel: [<c04b195b>] sysfs_write_file+0xa3/0xc6
> Jul 17 12:51:18 localhost kernel: [<c047a64a>] vfs_write+0xa8/0x154
> Jul 17 12:51:18 localhost kernel: [<c047ac65>] sys_write+0x41/0x67
> Jul 17 12:51:18 localhost kernel: [<c0404f7c>] syscall_call+0x7/0xb
>
> and firefox blocks on the same mutex too:
>
> Jul 17 12:51:18 localhost kernel: firefox-bin D [efc44670] 00000012 6368 4388 1
> Jul 17 12:51:18 localhost kernel: Call Trace:
> Jul 17 12:51:18 localhost kernel: [<c0603f46>] schedule+0xe0/0xfa
> Jul 17 12:51:18 localhost kernel: [<c0604d0d>] rt_mutex_slowlock+0x164/0x20b
> Jul 17 12:51:18 localhost kernel: [<c0604a5c>] rt_mutex_lock+0x3c/0x3f
> Jul 17 12:51:18 localhost kernel: [<c0423bb4>] sched_getaffinity+0x14/0x94
> Jul 17 12:51:18 localhost kernel: [<c0423c53>] sys_sched_getaffinity+0x1f/0x41
> Jul 17 12:51:18 localhost kernel: [<c0404f7c>] syscall_call+0x7/0xb
> Jul 17 12:51:18 localhost kernel: [<b7f0f410>] 0xb7f0f410
>
> does lockdep pinpoint anything?
>
> Ingo
[-- Attachment #2: trace2.txt.gz --]
[-- Type: application/x-gzip, Size: 21875 bytes --]
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: v2.6.21.5-rt19 (sched_getaffinity?)
2007-07-17 20:12 ` Ingo Molnar
2007-07-17 21:41 ` Fernando Lopez-Lezcano
@ 2007-07-17 23:51 ` Fernando Lopez-Lezcano
2007-07-18 7:18 ` Ingo Molnar
1 sibling, 1 reply; 19+ messages in thread
From: Fernando Lopez-Lezcano @ 2007-07-17 23:51 UTC (permalink / raw)
To: Ingo Molnar
Cc: nando, Gabriel C, Carsten Emde, jcaceres@ccrma.Stanford.EDU,
Steven Rostedt, RT-Users, LKML, Thomas Gleixner, Rui Nuno Capela
[-- Attachment #1: Type: text/plain, Size: 2997 bytes --]
On Tue, 2007-07-17 at 22:12 +0200, Ingo Molnar wrote:
> * Fernando Lopez-Lezcano <nando@ccrma.Stanford.EDU> wrote:
>
> > On Tue, 2007-07-17 at 21:32 +0200, Ingo Molnar wrote:
> > > * Fernando Lopez-Lezcano <nando@ccrma.Stanford.EDU> wrote:
> > >
> > > > I do get flash 9 (I know, not the best example) and tomboy to hang as
> > > > reported by one of my Planet CCRMA users - flash 9 tested working on
> > > > stock fedora 7 kernel - and both seem to hang in the same system call:
> > > >
> > > > sched_getaffinity(3528, 32, <unfinished ...>
> > > >
> > > > Full output of strace attached for both cases.
> > >
> > > hm, that's weird. Is it completely unkillable at that time? Could you do
> > > a few things: enable CONFIG_PROVE_LOCKING (lockdep), and also try to get
> > > a full task state dump via:
> > >
> > > echo t > /proc/sysrq-trigger
> >
> > Trace attached... the process stays in D state no matter what.
>
> hm, seems to be related to:
>
> Jul 17 12:51:18 localhost kernel: sched-powersa D [f0aaf930] 00000005 6584 3420 3407
>
> which blocks the cpu-hotplug mutex:
>
> Jul 17 12:51:18 localhost kernel: Call Trace:
> Jul 17 12:51:18 localhost kernel: [<c0603f46>] schedule+0xe0/0xfa
> Jul 17 12:51:18 localhost kernel: [<c0604d0d>] rt_mutex_slowlock+0x164/0x20b
> Jul 17 12:51:18 localhost kernel: [<c0604a5c>] rt_mutex_lock+0x3c/0x3f
> Jul 17 12:51:18 localhost kernel: [<c0423bb4>] sched_getaffinity+0x14/0x94
> Jul 17 12:51:18 localhost kernel: [<c045a647>] __synchronize_sched+0xd/0x5a
> Jul 17 12:51:18 localhost kernel: [<c0423732>] arch_reinit_sched_domains+0x18/0x33
> Jul 17 12:51:18 localhost kernel: [<c0423789>] sched_power_savings_store+0x3c/0x49
> Jul 17 12:51:18 localhost kernel: [<c0552cd4>] sysdev_class_store+0x1e/0x22
> Jul 17 12:51:18 localhost kernel: [<c04b195b>] sysfs_write_file+0xa3/0xc6
> Jul 17 12:51:18 localhost kernel: [<c047a64a>] vfs_write+0xa8/0x154
> Jul 17 12:51:18 localhost kernel: [<c047ac65>] sys_write+0x41/0x67
> Jul 17 12:51:18 localhost kernel: [<c0404f7c>] syscall_call+0x7/0xb
>
> and firefox blocks on the same mutex too:
>
> Jul 17 12:51:18 localhost kernel: firefox-bin D [efc44670] 00000012 6368 4388 1
> Jul 17 12:51:18 localhost kernel: Call Trace:
> Jul 17 12:51:18 localhost kernel: [<c0603f46>] schedule+0xe0/0xfa
> Jul 17 12:51:18 localhost kernel: [<c0604d0d>] rt_mutex_slowlock+0x164/0x20b
> Jul 17 12:51:18 localhost kernel: [<c0604a5c>] rt_mutex_lock+0x3c/0x3f
> Jul 17 12:51:18 localhost kernel: [<c0423bb4>] sched_getaffinity+0x14/0x94
> Jul 17 12:51:18 localhost kernel: [<c0423c53>] sys_sched_getaffinity+0x1f/0x41
> Jul 17 12:51:18 localhost kernel: [<c0404f7c>] syscall_call+0x7/0xb
> Jul 17 12:51:18 localhost kernel: [<b7f0f410>] 0xb7f0f410
>
> does lockdep pinpoint anything?
Lots of stuff, and at the end the lock report for the problem. Hopefully
some of this will help... I have attached the whole bootup sequence as
logged in /var/log/messages.
-- Fernando
[-- Attachment #2: trace3.txt.gz --]
[-- Type: application/x-gzip, Size: 12462 bytes --]
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: v2.6.21.5-rt19 (sched_getaffinity?)
2007-07-17 23:51 ` Fernando Lopez-Lezcano
@ 2007-07-18 7:18 ` Ingo Molnar
2007-07-18 13:21 ` Paul E. McKenney
2007-07-18 18:02 ` Fernando Lopez-Lezcano
0 siblings, 2 replies; 19+ messages in thread
From: Ingo Molnar @ 2007-07-18 7:18 UTC (permalink / raw)
To: Fernando Lopez-Lezcano
Cc: Gabriel C, Carsten Emde, jcaceres@ccrma.Stanford.EDU,
Steven Rostedt, RT-Users, LKML, Thomas Gleixner, Rui Nuno Capela,
Paul E. McKenney
* Fernando Lopez-Lezcano <nando@ccrma.Stanford.EDU> wrote:
> > does lockdep pinpoint anything?
>
> Lots of stuff, and at the end the lock report for the problem.
> Hopefully some of this will help... I have attached the whole bootup
> sequence as logged in /var/log/messages.
yeah, it pinpointed the bug. It seems to be an interaction between
RCU-preempt (Paul Cc:-ed) and sched_mc_power_savings_store():
detach_destroy_domains() uses synchronize_sched() which uses
getaffinity, which takes sched_hotcpu_mutex, and
arch_reinit_sched_domains does it too - see the lockdep report below.
I've added a quick workaround below as well, which should keep your box
from hanging.
Ingo
=============================================
[ INFO: possible recursive locking detected ]
[ 2.6.22-0182.rt4.3.fc7.ccrmart #1
---------------------------------------------
sched-powersave/3251 is trying to acquire lock:
(sched_hotcpu_mutex){--..}, at: [<c0424a37>] sched_getaffinity+0x14/0x94
but task is already holding lock:
(sched_hotcpu_mutex){--..}, at: [<c04245a5>] arch_reinit_sched_domains+0xe/0x33
other info that might help us debug this:
1 lock held by sched-powersave/3251:
#0: (sched_hotcpu_mutex){--..}, at: [<c04245a5>] arch_reinit_sched_domains+0xe/0x33
stack backtrace:
[<c040600c>] show_trace_log_lvl+0x1a/0x2f
[<c0406ae8>] show_trace+0x12/0x14
[<c0406b50>] dump_stack+0x16/0x18
[<c0446f46>] __lock_acquire+0x172/0xb67
[<c0447d03>] lock_acquire+0x56/0x6f
[<c061d414>] _mutex_lock+0x2b/0x38
[<c0424a37>] sched_getaffinity+0x14/0x94
[<c0460841>] __synchronize_sched+0x11/0x5f
[<c0423fa8>] detach_destroy_domains+0x2c/0x30
[<c04245af>] arch_reinit_sched_domains+0x18/0x33
[<c0424606>] sched_power_savings_store+0x3c/0x49
[<c0424634>] sched_mc_power_savings_store+0xe/0x10
[<c0561f11>] sysdev_class_store+0x20/0x25
[<c04bbc6c>] sysfs_write_file+0xaf/0xd0
[<c048183c>] vfs_write+0xaf/0x163
[<c0481e8a>] sys_write+0x3d/0x61
[<c040501a>] syscall_call+0x7/0xb
=======================
thinkpad_acpi: ThinkPad ACPI Extras v0.14
--------------------->
Index: linux-rt.q/kernel/sched.c
===================================================================
--- linux-rt.q.orig/kernel/sched.c
+++ linux-rt.q/kernel/sched.c
@@ -6699,7 +6699,6 @@ static void detach_destroy_domains(const
for_each_cpu_mask(i, *cpu_map)
cpu_attach_domain(NULL, i);
- synchronize_sched();
arch_destroy_sched_domains(cpu_map);
}
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: v2.6.21.5-rt19 (sched_getaffinity?)
2007-07-18 7:18 ` Ingo Molnar
@ 2007-07-18 13:21 ` Paul E. McKenney
2007-07-18 18:02 ` Fernando Lopez-Lezcano
1 sibling, 0 replies; 19+ messages in thread
From: Paul E. McKenney @ 2007-07-18 13:21 UTC (permalink / raw)
To: Ingo Molnar
Cc: Fernando Lopez-Lezcano, Gabriel C, Carsten Emde,
jcaceres@ccrma.Stanford.EDU, Steven Rostedt, RT-Users, LKML,
Thomas Gleixner, Rui Nuno Capela
On Wed, Jul 18, 2007 at 09:18:52AM +0200, Ingo Molnar wrote:
>
> * Fernando Lopez-Lezcano <nando@ccrma.Stanford.EDU> wrote:
>
> > > does lockdep pinpoint anything?
> >
> > Lots of stuff, and at the end the lock report for the problem.
> > Hopefully some of this will help... I have attached the whole bootup
> > sequence as logged in /var/log/messages.
>
> yeah, it pinpointed the bug. It seems to be an interaction between
> RCU-preempt (Paul Cc:-ed) and sched_mc_power_savings_store():
> detach_destroy_domains() uses synchronize_sched() which uses
> getaffinity, which takes sched_hotcpu_mutex, and
> arch_reinit_sched_domains does it too - see the lockdep report below.
> I've added a quick workaround below as well, which should keep your box
> from hanging.
Interesting. The "right" way to do this seems to be to put both "classic"
and "realtime" RCU into the kernel. The "classic" RCU would be there
to support synchronize_sched() without calling getaffinity(), while
the "realtime" RCU would be there for the standard RCU API.
I will work this into my -mm efforts.
Thanx, Paul
> =============================================
> [ INFO: possible recursive locking detected ]
> [ 2.6.22-0182.rt4.3.fc7.ccrmart #1
> ---------------------------------------------
> sched-powersave/3251 is trying to acquire lock:
> (sched_hotcpu_mutex){--..}, at: [<c0424a37>] sched_getaffinity+0x14/0x94
>
> but task is already holding lock:
> (sched_hotcpu_mutex){--..}, at: [<c04245a5>] arch_reinit_sched_domains+0xe/0x33
>
> other info that might help us debug this:
> 1 lock held by sched-powersave/3251:
> #0: (sched_hotcpu_mutex){--..}, at: [<c04245a5>] arch_reinit_sched_domains+0xe/0x33
>
> stack backtrace:
> [<c040600c>] show_trace_log_lvl+0x1a/0x2f
> [<c0406ae8>] show_trace+0x12/0x14
> [<c0406b50>] dump_stack+0x16/0x18
> [<c0446f46>] __lock_acquire+0x172/0xb67
> [<c0447d03>] lock_acquire+0x56/0x6f
> [<c061d414>] _mutex_lock+0x2b/0x38
> [<c0424a37>] sched_getaffinity+0x14/0x94
> [<c0460841>] __synchronize_sched+0x11/0x5f
> [<c0423fa8>] detach_destroy_domains+0x2c/0x30
> [<c04245af>] arch_reinit_sched_domains+0x18/0x33
> [<c0424606>] sched_power_savings_store+0x3c/0x49
> [<c0424634>] sched_mc_power_savings_store+0xe/0x10
> [<c0561f11>] sysdev_class_store+0x20/0x25
> [<c04bbc6c>] sysfs_write_file+0xaf/0xd0
> [<c048183c>] vfs_write+0xaf/0x163
> [<c0481e8a>] sys_write+0x3d/0x61
> [<c040501a>] syscall_call+0x7/0xb
> =======================
> thinkpad_acpi: ThinkPad ACPI Extras v0.14
>
> --------------------->
> Index: linux-rt.q/kernel/sched.c
> ===================================================================
> --- linux-rt.q.orig/kernel/sched.c
> +++ linux-rt.q/kernel/sched.c
> @@ -6699,7 +6699,6 @@ static void detach_destroy_domains(const
>
> for_each_cpu_mask(i, *cpu_map)
> cpu_attach_domain(NULL, i);
> - synchronize_sched();
> arch_destroy_sched_domains(cpu_map);
> }
>
>
>
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: v2.6.21.5-rt19 (sched_getaffinity?)
2007-07-18 7:18 ` Ingo Molnar
2007-07-18 13:21 ` Paul E. McKenney
@ 2007-07-18 18:02 ` Fernando Lopez-Lezcano
1 sibling, 0 replies; 19+ messages in thread
From: Fernando Lopez-Lezcano @ 2007-07-18 18:02 UTC (permalink / raw)
To: Ingo Molnar
Cc: nando, Gabriel C, Carsten Emde, jcaceres@ccrma.Stanford.EDU,
Steven Rostedt, RT-Users, LKML, Thomas Gleixner, Rui Nuno Capela,
Paul E. McKenney
On Wed, 2007-07-18 at 09:18 +0200, Ingo Molnar wrote:
> * Fernando Lopez-Lezcano <nando@ccrma.Stanford.EDU> wrote:
>
> > > does lockdep pinpoint anything?
> >
> > Lots of stuff, and at the end the lock report for the problem.
> > Hopefully some of this will help... I have attached the whole bootup
> > sequence as logged in /var/log/messages.
>
> yeah, it pinpointed the bug. It seems to be an interaction between
> RCU-preempt (Paul Cc:-ed) and sched_mc_power_savings_store():
> detach_destroy_domains() uses synchronize_sched() which uses
> getaffinity, which takes sched_hotcpu_mutex, and
> arch_reinit_sched_domains does it too - see the lockdep report below.
> I've added a quick workaround below as well, which should keep your box
> from hanging.
I can confirm that flash9 does not hang with the patch.
Thanks!!!
I presume the same would apply to 2.6.21.x and, say, rt21. I'll test.
But (of course, there's always a but somewhere) I just experienced a
complete hang - 2.6.22.1-rt4 with the little patch. This time there was
something in the logs, maybe it will help? This was when finishing the
install of an additional kernel module rpm package (ipw3945 drivers).
-- Fernando
--------
Jul 18 10:48:15 localhost kernel: BUG: sleeping function called from
invalid context modprobe(5001) at kernel/rtmutex.c:636
Jul 18 10:48:15 localhost kernel: in_atomic():1 [00000001],
irqs_disabled():0
Jul 18 10:48:15 localhost kernel: [<c0405f34>] show_trace_log_lvl
+0x1a/0x2f
Jul 18 10:48:15 localhost kernel: [<c0406a09>] show_trace+0x12/0x14
Jul 18 10:48:15 localhost kernel: [<c0406a71>] dump_stack+0x16/0x18
Jul 18 10:48:15 localhost kernel: [<c0423bfc>] __might_sleep+0xeb/0xf2
Jul 18 10:48:15 localhost kernel: [<c0617242>] __rt_spin_lock+0x24/0x40
Jul 18 10:48:15 localhost kernel: [<c0617266>] rt_spin_lock+0x8/0xa
Jul 18 10:48:15 localhost kernel: [<c04621c9>] get_zone_pcp+0x23/0x33
Jul 18 10:48:15 localhost kernel: [<c0462702>] free_hot_cold_page
+0xcf/0x148
Jul 18 10:48:15 localhost kernel: [<c04627b2>] free_hot_page+0xa/0xc
Jul 18 10:48:15 localhost kernel: [<c0462a82>] __free_pages+0x25/0x30
Jul 18 10:48:15 localhost kernel: [<c0462ab6>] free_pages+0x29/0x2b
Jul 18 10:48:15 localhost kernel: [<c047abf3>] quicklist_trim+0xd0/0xf5
Jul 18 10:48:15 localhost kernel: [<c041f5d9>] check_pgt_cache
+0x1e/0x20
Jul 18 10:48:15 localhost kernel: [<c046aedf>] free_pgtables+0x52/0x147
Jul 18 10:48:15 localhost kernel: [<c046cdf7>] unmap_region+0xe6/0x135
Jul 18 10:48:15 localhost kernel: [<c046d764>] do_munmap+0x153/0x1b4
Jul 18 10:48:15 localhost kernel: [<c046f3de>] do_mremap+0x413/0x4c3
Jul 18 10:48:15 localhost kernel: [<c046f4c4>] sys_mremap+0x36/0x56
Jul 18 10:48:15 localhost kernel: [<c0404fca>] syscall_call+0x7/0xb
Jul 18 10:48:15 localhost kernel: =======================
Jul 18 10:48:16 localhost kernel: BUG: sleeping function called from
invalid context head(5652) at kernel/rtmutex.c:636
Jul 18 10:48:16 localhost kernel: in_atomic():1 [00000001],
irqs_disabled():0
Jul 18 10:48:16 localhost kernel: [<c0405f34>] show_trace_log_lvl
+0x1a/0x2f
Jul 18 10:48:16 localhost kernel: [<c0406a09>] show_trace+0x12/0x14
Jul 18 10:48:16 localhost kernel: [<c0406a71>] dump_stack+0x16/0x18
Jul 18 10:48:16 localhost kernel: [<c0423bfc>] __might_sleep+0xeb/0xf2
Jul 18 10:48:16 localhost kernel: [<c0617242>] __rt_spin_lock+0x24/0x40
Jul 18 10:48:16 localhost kernel: [<c0617266>] rt_spin_lock+0x8/0xa
Jul 18 10:48:16 localhost kernel: [<c04621c9>] get_zone_pcp+0x23/0x33
Jul 18 10:48:16 localhost kernel: [<c0462702>] free_hot_cold_page
+0xcf/0x148
Jul 18 10:48:16 localhost kernel: [<c04627b2>] free_hot_page+0xa/0xc
Jul 18 10:48:16 localhost kernel: [<c0462a82>] __free_pages+0x25/0x30
Jul 18 10:48:16 localhost kernel: [<c0462ab6>] free_pages+0x29/0x2b
Jul 18 10:48:16 localhost kernel: [<c047abf3>] quicklist_trim+0xd0/0xf5
Jul 18 10:48:16 localhost kernel: [<c041f5d9>] check_pgt_cache
+0x1e/0x20
Jul 18 10:48:16 localhost kernel: [<c046aedf>] free_pgtables+0x52/0x147
Jul 18 10:48:16 localhost kernel: [<c046cdf7>] unmap_region+0xe6/0x135
Jul 18 10:48:16 localhost kernel: [<c046d764>] do_munmap+0x153/0x1b4
Jul 18 10:48:16 localhost kernel: [<c046d7f5>] sys_munmap+0x30/0x3f
Jul 18 10:48:16 localhost kernel: [<c0404fca>] syscall_call+0x7/0xb
Jul 18 10:48:16 localhost kernel: =======================
Jul 18 10:50:22 localhost syslogd 1.4.2: restart.
^ permalink raw reply [flat|nested] 19+ messages in thread
end of thread, other threads:[~2007-07-18 18:02 UTC | newest]
Thread overview: 19+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2007-07-04 20:49 v2.6.21.5-rt19 Thomas Gleixner
2007-07-06 14:10 ` v2.6.21.5-rt19 Rui Nuno Capela
2007-07-06 21:49 ` v2.6.21.5-rt19 Fernando Lopez-Lezcano
2007-07-07 9:15 ` v2.6.21.5-rt19 Ingo Molnar
2007-07-07 9:24 ` v2.6.21.5-rt19 Ingo Molnar
2007-07-08 22:36 ` v2.6.21.5-rt19 Fernando Lopez-Lezcano
2007-07-08 22:50 ` v2.6.21.5-rt19 Fernando Lopez-Lezcano
2007-07-08 23:42 ` v2.6.21.5-rt19 Gabriel C
2007-07-09 3:53 ` v2.6.21.5-rt19 Fernando Pablo Lopez-Lezcano
2007-07-09 5:08 ` v2.6.21.5-rt19 (sched_getaffinity?) Fernando Lopez-Lezcano
2007-07-17 19:32 ` Ingo Molnar
2007-07-17 19:47 ` Fernando Lopez-Lezcano
2007-07-17 19:56 ` Fernando Lopez-Lezcano
2007-07-17 20:12 ` Ingo Molnar
2007-07-17 21:41 ` Fernando Lopez-Lezcano
2007-07-17 23:51 ` Fernando Lopez-Lezcano
2007-07-18 7:18 ` Ingo Molnar
2007-07-18 13:21 ` Paul E. McKenney
2007-07-18 18:02 ` Fernando Lopez-Lezcano
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox