Random 3 to 4.5 second blocks of RT thread

linux-rt-users.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Random 3 to 4.5 second blocks of RT thread
@ 2013-02-11 15:54 Ralf Müller
  2013-02-11 17:08 ` Bruce Ashfield
  2013-02-11 18:55 ` Stanislav Meduna
  0 siblings, 2 replies; 7+ messages in thread
From: Ralf Müller @ 2013-02-11 15:54 UTC (permalink / raw)
  To: Realtime Mailinglist; +Cc: Thomas Weisser

Hi.

I've got a problem with a box at a customer, where about once a day (overall 20 events in 16 days) a realtime thread blocks for 3 to 4.5 seconds. This threads only job is to define a kind of time normal for other subsystems. It basically is a loop that wakes up every milli second, increments a generation counter, does some statistics and goes to sleep for another millisecond. It does this quite well - except for these random multi second sleeps.

My problem: this only happens on a customer machine (actually two different machines at different places, both with very limited access). I tried to reproduce the bug on two test systems here, which are configured as close to the customer system as possible for me (bios settings, kernel configuration, ... about the same load situation) but I was not able to get such a block for more then 3 month (I get jitters of less then 30 micro seconds what is quite ok for the use case).

The thread is configured to run with SCHED_RR priority 99 via pthread_setschedparam(). There are no other user threads in the system with such a high priority. There is no swap file, memory is locked via mlockall. There are two 24/7 disks configured to never got to suspend, running a md-RAID 1. The thread itself does not do anything time consuming, it does not do anything that can block. All these systems run on an old OpenSUSE 11.2 with kernel 2.6.33.2-jen97-rt. A maybe relevant kernel option set, is "processor.max_cstate=0". I know it's not just a time warp on the system, it actually blocks that long because a connected system reports a communication error.

My next try would be to change the board of a customer system by one of the boards of the test systems and hope the error moves to me. As this is a bit expensive and I'm not sure it will really help I would like to ask if someone here can give any hint how to debug such a problem or has any idea what in a system blocks for 3 to 4.5 seconds and how I can avoid such blocks.

I know the problem is related to old software and the description is at least a bit vague. My real hope is that someone says: "yes, exactly the same problem I had the day before and I solved it by setting this or that option". Hope dies last.

Best regards and thanks in advance
Ralf Müller--
To unsubscribe from this list: send the line "unsubscribe linux-rt-users" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Random 3 to 4.5 second blocks of RT thread
  2013-02-11 15:54 Random 3 to 4.5 second blocks of RT thread Ralf Müller
@ 2013-02-11 17:08 ` Bruce Ashfield
  2013-02-12 12:53   ` Ralf Müller
  2013-02-11 18:55 ` Stanislav Meduna
  1 sibling, 1 reply; 7+ messages in thread
From: Bruce Ashfield @ 2013-02-11 17:08 UTC (permalink / raw)
  To: Ralf Müller; +Cc: Realtime Mailinglist, Thomas Weisser

On Mon, Feb 11, 2013 at 10:54 AM, Ralf Müller <ralf@bj-ig.de> wrote:
> Hi.
>
> I've got a problem with a box at a customer, where about once a day (overall 20 events in 16 days) a realtime thread blocks for 3 to 4.5 seconds. This threads only job is to define a kind of time normal for other subsystems. It basically is a loop that wakes up every milli second, increments a generation counter, does some statistics and goes to sleep for another millisecond. It does this quite well - except for these random multi second sleeps.
>
> My problem: this only happens on a customer machine (actually two different machines at different places, both with very limited access). I tried to reproduce the bug on two test systems here, which are configured as close to the customer system as possible for me (bios settings, kernel configuration, ... about the same load situation) but I was not able to get such a block for more then 3 month (I get jitters of less then 30 micro seconds what is quite ok for the use case).
>
> The thread is configured to run with SCHED_RR priority 99 via pthread_setschedparam(). There are no other user threads in the system with such a high priority. There is no swap file, memory is locked via mlockall. There are two 24/7 disks configured to never got to suspend, running a md-RAID 1. The thread itself does not do anything time consuming, it does not do anything that can block. All these systems run on an old OpenSUSE 11.2 with kernel 2.6.33.2-jen97-rt. A maybe relevant kernel option set, is "processor.max_cstate=0". I know it's not just a time warp on the system, it actually blocks that long because a connected system reports a communication error.
>
> My next try would be to change the board of a customer system by one of the boards of the test systems and hope the error moves to me. As this is a bit expensive and I'm not sure it will really help I would like to ask if someone here can give any hint how to debug such a problem or has any idea what in a system blocks for 3 to 4.5 seconds and how I can avoid such blocks.
>
> I know the problem is related to old software and the description is at least a bit vague. My real hope is that someone says: "yes, exactly the same problem I had the day before and I solved it by setting this or that option". Hope dies last.

resending since gmail decide to mix html into my original message.

Since you mentioned BIOS, I assume that you are running on a x86 platform.
I noted that you didn't mention if SMIs have been ruled out as a cause of the
latency issues. You've probably already ruled this out, but I figured I'd ask
anyway :)

Cheers,

Bruce

>
> Best regards and thanks in advance
> Ralf Müller--
> To unsubscribe from this list: send the line "unsubscribe linux-rt-users" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



--
"Thou shalt not follow the NULL pointer, for chaos and madness await
thee at its end"
--
To unsubscribe from this list: send the line "unsubscribe linux-rt-users" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Random 3 to 4.5 second blocks of RT thread
  2013-02-11 15:54 Random 3 to 4.5 second blocks of RT thread Ralf Müller
  2013-02-11 17:08 ` Bruce Ashfield
@ 2013-02-11 18:55 ` Stanislav Meduna
  2013-02-12 14:01   ` Ralf Müller
  1 sibling, 1 reply; 7+ messages in thread
From: Stanislav Meduna @ 2013-02-11 18:55 UTC (permalink / raw)
  To: Ralf Müller; +Cc: Realtime Mailinglist, Thomas Weisser

On 11.02.2013 16:54, Ralf Müller wrote:

> It basically is a loop that wakes up every milli second, increments a
> generation counter, does some statistics and goes to sleep for
> another millisecond.

3-4.5 seconds is heavy so this is not very probable, but seeing
that you have a thread with prio 99 and have timer wakeup
problems, depending on what the other realtime threads do maybe
the following could be (part of) the issue:

http://www.spinics.net/lists/linux-rt-users/msg08745.html
http://www.spinics.net/lists/linux-rt-users/msg07804.html

Shortly said: you aren't going to have low latency timers
if the ksoftirqd can be preempted by your threads.

Disclaimer: I have never used RT with 2.6, so I don't know
whether it applies at all.

Regards
-- 
                                         Stano

--
To unsubscribe from this list: send the line "unsubscribe linux-rt-users" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Random 3 to 4.5 second blocks of RT thread
  2013-02-11 17:08 ` Bruce Ashfield
@ 2013-02-12 12:53   ` Ralf Müller
  2013-02-12 14:28     ` Bruce Ashfield
  0 siblings, 1 reply; 7+ messages in thread
From: Ralf Müller @ 2013-02-12 12:53 UTC (permalink / raw)
  To: Bruce Ashfield; +Cc: Realtime Mailinglist, Thomas Weisser

Am 11.02.2013 um 18:08 schrieb Bruce Ashfield:

> On Mon, Feb 11, 2013 at 10:54 AM, Ralf Müller <ralf@bj-ig.de> wrote:
>> 
>> I've got a problem with a box at a customer, where about once a day (overall 20 events in 16 days) a realtime thread blocks for 3 to 4.5 seconds. ... It basically is a loop that wakes up every milli second ... It does this quite well - except for these random multi second sleeps.
> 
> Since you mentioned BIOS, I assume that you are running on a x86 platform.

It's x86 - yes.

> I noted that you didn't mention if SMIs have been ruled out as a cause of the
> latency issues. You've probably already ruled this out, but I figured I'd ask
> anyway :)

The question is a good one. What I read about SMI when I started this project, said that expected latencies from SMI would be something from some micro- to a maximum of some milliseconds. As my latency constraints are relatively weak - I can perfectly live with 50 microseconds and I would not be happy, but could at least deal with 10 milliseconds every now and then - I did not follow that trail very far. When SMI can really create 3 to 4 seconds blocks I will have to look into that deeper ... anyway - I will have a look at SMI the next days. Thanks a lot for that hint.

BTW: Are there any links to SMI events in a multi second range? What in a system is done within such a long time span? 

Best regards
Ralf

--
To unsubscribe from this list: send the line "unsubscribe linux-rt-users" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Random 3 to 4.5 second blocks of RT thread
  2013-02-11 18:55 ` Stanislav Meduna
@ 2013-02-12 14:01   ` Ralf Müller
  0 siblings, 0 replies; 7+ messages in thread
From: Ralf Müller @ 2013-02-12 14:01 UTC (permalink / raw)
  To: Stanislav Meduna; +Cc: Realtime Mailinglist, Thomas Weisser

Am 11.02.2013 um 19:55 schrieb Stanislav Meduna:

> On 11.02.2013 16:54, Ralf Müller wrote:
> 
>> It basically is a loop that wakes up every milli second, increments a
>> generation counter, does some statistics and goes to sleep for
>> another millisecond.
> 
> 3-4.5 seconds is heavy so this is not very probable, but seeing
> that you have a thread with prio 99 and have timer wakeup
> problems, depending on what the other realtime threads do maybe
> the following could be (part of) the issue:
> 
> http://www.spinics.net/lists/linux-rt-users/msg08745.html
> http://www.spinics.net/lists/linux-rt-users/msg07804.html

Thanks for the links.

> Shortly said: you aren't going to have low latency timers
> if the ksoftirqd can be preempted by your threads.
> 
> Disclaimer: I have never used RT with 2.6, so I don't know
> whether it applies at all.

It is not called ksoftirqd on that kernel - it's sirq-timer, sirq-hrtimer, sirq-rtc there. But yes - looking for the priorities once again I found there is one thread, which is not completely trusted to never ever run wild, which had a higher priority than the timer interrupt threads. So I will adjust the interrupt threads first and see if something changes. Maybe you already made my day :)

So far - thanks a lot. I will keep you informed if this adjustment helps - but it will take at least two days to install and test.

Best regards
Ralf--
To unsubscribe from this list: send the line "unsubscribe linux-rt-users" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Random 3 to 4.5 second blocks of RT thread
  2013-02-12 12:53   ` Ralf Müller
@ 2013-02-12 14:28     ` Bruce Ashfield
  2013-02-13  9:15       ` Ralf Müller
  0 siblings, 1 reply; 7+ messages in thread
From: Bruce Ashfield @ 2013-02-12 14:28 UTC (permalink / raw)
  To: Ralf Müller; +Cc: Realtime Mailinglist, Thomas Weisser

On Tue, Feb 12, 2013 at 7:53 AM, Ralf Müller <ralf@bj-ig.de> wrote:
>
> Am 11.02.2013 um 18:08 schrieb Bruce Ashfield:
>
>> On Mon, Feb 11, 2013 at 10:54 AM, Ralf Müller <ralf@bj-ig.de> wrote:
>>>
>>> I've got a problem with a box at a customer, where about once a day (overall 20 events in 16 days) a realtime thread blocks for 3 to 4.5 seconds. ... It basically is a loop that wakes up every milli second ... It does this quite well - except for these random multi second sleeps.
>>
>> Since you mentioned BIOS, I assume that you are running on a x86 platform.
>
> It's x86 - yes.
>
>> I noted that you didn't mention if SMIs have been ruled out as a cause of the
>> latency issues. You've probably already ruled this out, but I figured I'd ask
>> anyway :)
>
> The question is a good one. What I read about SMI when I started this project, said that expected latencies from SMI would be something from some micro- to a maximum of some milliseconds. As my latency constraints are relatively weak - I can perfectly live with 50 microseconds and I would not be happy, but could at least deal with 10 milliseconds every now and then - I did not follow that trail very far. When SMI can really create 3 to 4 seconds blocks I will have to look into that deeper ... anyway - I will have a look at SMI the next days. Thanks a lot for that hint.
>

I agree that such a long SMI isn't likely, but running the hwlat
detector is a fairly simple
way to see if time is in fact being stolen from your kernel, so it's
something to look into.

> BTW: Are there any links to SMI events in a multi second range? What in a system is done within such a long time span?

I've never seen it first hand, but I have heard of thermal SMIs that
can "borrow" quite
a bit of time (depending on what your platform is doing). Like
anything, not all BIOS/SMIs
are created equal :) Which of course can be more of a problem in your
case, since it
sounds like you aren't on exactly the same h/w as the problematic system.

Anyway, just a thought and something to rule out.

If the customer machines can have tracing enabled, you can always set
a latency threshold
and stop tracing when it is crossed. That should get you insight into
what is happening in
the long latency case.

Cheers,

Bruce

>
> Best regards
> Ralf
>



--
"Thou shalt not follow the NULL pointer, for chaos and madness await
thee at its end"
--
To unsubscribe from this list: send the line "unsubscribe linux-rt-users" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Random 3 to 4.5 second blocks of RT thread
  2013-02-12 14:28     ` Bruce Ashfield
@ 2013-02-13  9:15       ` Ralf Müller
  0 siblings, 0 replies; 7+ messages in thread
From: Ralf Müller @ 2013-02-13  9:15 UTC (permalink / raw)
  To: Bruce Ashfield; +Cc: Realtime Mailinglist, Thomas Weisser


Am 12.02.2013 um 15:28 schrieb Bruce Ashfield:
> 
> I agree that such a long SMI isn't likely, but running the hwlat
> detector is a fairly simple way to see if time is in fact being
> stolen from your kernel, so it's something to look into.

hwlat - up to now - was unknown to me. I will give it a try. Thanks.

>> BTW: Are there any links to SMI events in a multi second
>> range? What in a system is done within such a long time span?
> 
> I've never seen it first hand, but I have heard of thermal SMIs that
> can "borrow" quite a bit of time (depending on what your platform
> is doing). Like anything, not all BIOS/SMIs are created equal :)

:)

> Which of course can be more of a problem in your
> case, since it sounds like you aren't on exactly
> the same h/w as the problematic system.

Thats the case - yes. But I can try here, build an installation
packet for the customer system. Remote control the customer - so
he can activate that packet and send me results.

> Anyway, just a thought and something to rule out.

> If the customer machines can have tracing enabled, you can always set
> a latency threshold and stop tracing when it is crossed. That should
> get you insight into what is happening in the long latency case.

You are right - great idea.

Thank you very much
Ralf


^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2013-02-13  9:16 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-02-11 15:54 Random 3 to 4.5 second blocks of RT thread Ralf Müller
2013-02-11 17:08 ` Bruce Ashfield
2013-02-12 12:53   ` Ralf Müller
2013-02-12 14:28     ` Bruce Ashfield
2013-02-13  9:15       ` Ralf Müller
2013-02-11 18:55 ` Stanislav Meduna
2013-02-12 14:01   ` Ralf Müller

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).