public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* [BUG] Something goes wrong with timer statistics.
@ 2007-05-29 21:38 Ian Kumlien
       [not found] ` <6bffcb0e0705291524y2752d646p94b0bf6ca87af68f@mail.gmail.com>
                   ` (2 more replies)
  0 siblings, 3 replies; 25+ messages in thread
From: Ian Kumlien @ 2007-05-29 21:38 UTC (permalink / raw)
  To: Linux-kernel

Hi, 

As the daystar sets, i try to play some with my new would be
firewall/server, but since this will be running for quite some time i
have been experimenting with powertop to find out what i can do to limit
it's power usage.

But, if i run powertop for too long or a few times to many... this
happens:
http://pomac.netswarm.net/pics/kernel_panic.jpg

If i don't run powertop, it is rock solid... Compiling for hours,
running memtest for hours etc etc... 

-- 
Ian Kumlien <pomac () vapor ! com> -- http://pomac.netswarm.net

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [BUG] Something goes wrong with timer statistics.
       [not found] ` <6bffcb0e0705291524y2752d646p94b0bf6ca87af68f@mail.gmail.com>
@ 2007-05-29 22:35   ` Ian Kumlien
  2007-05-29 22:38   ` Thomas Gleixner
  1 sibling, 0 replies; 25+ messages in thread
From: Ian Kumlien @ 2007-05-29 22:35 UTC (permalink / raw)
  To: Michal Piotrowski; +Cc: Thomas Gleixner, Linux-kernel

On ons, 2007-05-30 at 00:24 +0200, Michal Piotrowski wrote:
> Hi Ian,
> 
> (Thomas "The Wizard of Time" added to CC :))
> 
> On 29/05/07, Ian Kumlien <iank@bredband.net> wrote:
> > Hi,
> >
> > As the daystar sets, i try to play some with my new would be
> > firewall/server, but since this will be running for quite some time i
> > have been experimenting with powertop to find out what i can do to limit
> > it's power usage.
> >
> > But, if i run powertop for too long or a few times to many... this
> > happens:
> > http://pomac.netswarm.net/pics/kernel_panic.jpg
> >
> > If i don't run powertop, it is rock solid... Compiling for hours,
> > running memtest for hours etc etc...
> 
> (Where is my magic ball?)

(Lost? =))

> Let me guess - Pentium 4 with HT?

Core 2 Duo, without HT =)

> Also, please try the latest hrt tree
> http://tglx.de/projects/hrtimers/2.6.22-rc3/

If you still think i should try that, i will, this box is not in
production yet for a while so... =)

I switched to using 64 bit hpet after the last crash, will this
influence the results?

> Regards,
> Michal
> 
-- 
Ian Kumlien <pomac () vapor ! com> -- http://pomac.netswarm.net

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [BUG] Something goes wrong with timer statistics.
  2007-05-29 21:38 [BUG] Something goes wrong with timer statistics Ian Kumlien
       [not found] ` <6bffcb0e0705291524y2752d646p94b0bf6ca87af68f@mail.gmail.com>
@ 2007-05-29 22:37 ` David Miller
  2007-05-29 22:42   ` Thomas Gleixner
       [not found]   ` <6bffcb0e0705291541q7e6d6faen5cd6345074c4c92a@mail.gmail.com>
  2007-05-30  0:13 ` Stephane Casset
  2 siblings, 2 replies; 25+ messages in thread
From: David Miller @ 2007-05-29 22:37 UTC (permalink / raw)
  To: iank; +Cc: Linux-kernel

From: Ian Kumlien <iank@bredband.net>
Date: Tue, 29 May 2007 23:38:48 +0200

> As the daystar sets, i try to play some with my new would be
> firewall/server, but since this will be running for quite some time i
> have been experimenting with powertop to find out what i can do to limit
> it's power usage.
> 
> But, if i run powertop for too long or a few times to many... this
> happens:
> http://pomac.netswarm.net/pics/kernel_panic.jpg
> 
> If i don't run powertop, it is rock solid... Compiling for hours,
> running memtest for hours etc etc... 

I see this same exact problem on sparc64.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [BUG] Something goes wrong with timer statistics.
       [not found] ` <6bffcb0e0705291524y2752d646p94b0bf6ca87af68f@mail.gmail.com>
  2007-05-29 22:35   ` Ian Kumlien
@ 2007-05-29 22:38   ` Thomas Gleixner
  2007-05-29 22:50     ` Ian Kumlien
                       ` (2 more replies)
  1 sibling, 3 replies; 25+ messages in thread
From: Thomas Gleixner @ 2007-05-29 22:38 UTC (permalink / raw)
  To: Michal Piotrowski
  Cc: Ian Kumlien, Linux-kernel, Ingo Molnar, Arjan van de Ven

On Wed, 2007-05-30 at 00:24 +0200, Michal Piotrowski wrote:
> Hi Ian,
> 
> (Thomas "The Wizard of Time" added to CC :))

Added more wizards :)

> On 29/05/07, Ian Kumlien <iank@bredband.net> wrote:
> > Hi,
> >
> > As the daystar sets, i try to play some with my new would be
> > firewall/server, but since this will be running for quite some time i
> > have been experimenting with powertop to find out what i can do to limit
> > it's power usage.
> >
> > But, if i run powertop for too long or a few times to many... this
> > happens:
> > http://pomac.netswarm.net/pics/kernel_panic.jpg

This was reported before. It's incredibly hard to reproduce.

Ian, is this happening reproducible ?

> > If i don't run powertop, it is rock solid... Compiling for hours,
> > running memtest for hours etc etc...
> 
> (Where is my magic ball?)
> 
> Let me guess - Pentium 4 with HT?

I might be wrong, but it's a newer core duo thingy.

	tglx



^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [BUG] Something goes wrong with timer statistics.
  2007-05-29 22:37 ` David Miller
@ 2007-05-29 22:42   ` Thomas Gleixner
  2007-05-29 22:44     ` David Miller
       [not found]   ` <6bffcb0e0705291541q7e6d6faen5cd6345074c4c92a@mail.gmail.com>
  1 sibling, 1 reply; 25+ messages in thread
From: Thomas Gleixner @ 2007-05-29 22:42 UTC (permalink / raw)
  To: David Miller; +Cc: iank, Linux-kernel, Ingo Molnar, Arjan van de Ven

On Tue, 2007-05-29 at 15:37 -0700, David Miller wrote:
> From: Ian Kumlien <iank@bredband.net>
> Date: Tue, 29 May 2007 23:38:48 +0200
> 
> > As the daystar sets, i try to play some with my new would be
> > firewall/server, but since this will be running for quite some time i
> > have been experimenting with powertop to find out what i can do to limit
> > it's power usage.
> > 
> > But, if i run powertop for too long or a few times to many... this
> > happens:
> > http://pomac.netswarm.net/pics/kernel_panic.jpg
> > 
> > If i don't run powertop, it is rock solid... Compiling for hours,
> > running memtest for hours etc etc... 
> 
> I see this same exact problem on sparc64.

100% reproducible ?

	tglx




^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [BUG] Something goes wrong with timer statistics.
       [not found]   ` <6bffcb0e0705291541q7e6d6faen5cd6345074c4c92a@mail.gmail.com>
@ 2007-05-29 22:44     ` David Miller
  2007-05-29 22:51       ` Ian Kumlien
  0 siblings, 1 reply; 25+ messages in thread
From: David Miller @ 2007-05-29 22:44 UTC (permalink / raw)
  To: michal.k.k.piotrowski; +Cc: iank, Linux-kernel

From: "Michal Piotrowski" <michal.k.k.piotrowski@gmail.com>
Date: Wed, 30 May 2007 00:41:46 +0200

> On 30/05/07, David Miller <davem@davemloft.net> wrote:
> > From: Ian Kumlien <iank@bredband.net>
> > Date: Tue, 29 May 2007 23:38:48 +0200
> >
> > > As the daystar sets, i try to play some with my new would be
> > > firewall/server, but since this will be running for quite some time i
> > > have been experimenting with powertop to find out what i can do to limit
> > > it's power usage.
> > >
> > > But, if i run powertop for too long or a few times to many... this
> > > happens:
> > > http://pomac.netswarm.net/pics/kernel_panic.jpg
> > >
> > > If i don't run powertop, it is rock solid... Compiling for hours,
> > > running memtest for hours etc etc...
> >
> > I see this same exact problem on sparc64.
> 
> Have you tried this patch?
> http://lkml.org/lkml/2007/5/29/392

Of course, I worked with Thomas on the fix :-)

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [BUG] Something goes wrong with timer statistics.
  2007-05-29 22:42   ` Thomas Gleixner
@ 2007-05-29 22:44     ` David Miller
  2007-05-29 22:53       ` Thomas Gleixner
  0 siblings, 1 reply; 25+ messages in thread
From: David Miller @ 2007-05-29 22:44 UTC (permalink / raw)
  To: tglx; +Cc: iank, Linux-kernel, mingo, arjan

From: Thomas Gleixner <tglx@linutronix.de>
Date: Wed, 30 May 2007 00:42:12 +0200

> On Tue, 2007-05-29 at 15:37 -0700, David Miller wrote:
> > I see this same exact problem on sparc64.
> 
> 100% reproducible ?

I think so, I'll try to find time to tinker with it a bit
more.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [BUG] Something goes wrong with timer statistics.
  2007-05-29 22:38   ` Thomas Gleixner
@ 2007-05-29 22:50     ` Ian Kumlien
  2007-05-30  6:52     ` Björn Steinbrink
  2007-05-30 11:38     ` Björn Steinbrink
  2 siblings, 0 replies; 25+ messages in thread
From: Ian Kumlien @ 2007-05-29 22:50 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Michal Piotrowski, Linux-kernel, Ingo Molnar, Arjan van de Ven

On ons, 2007-05-30 at 00:38 +0200, Thomas Gleixner wrote:
> On Wed, 2007-05-30 at 00:24 +0200, Michal Piotrowski wrote:
> > Hi Ian,
> > 
> > (Thomas "The Wizard of Time" added to CC :))
> 
> Added more wizards :)
> 
> > On 29/05/07, Ian Kumlien <iank@bredband.net> wrote:
> > > Hi,
> > >
> > > As the daystar sets, i try to play some with my new would be
> > > firewall/server, but since this will be running for quite some time i
> > > have been experimenting with powertop to find out what i can do to limit
> > > it's power usage.
> > >
> > > But, if i run powertop for too long or a few times to many... this
> > > happens:
> > > http://pomac.netswarm.net/pics/kernel_panic.jpg
> 
> This was reported before. It's incredibly hard to reproduce.
> 
> Ian, is this happening reproducible ?

Seems like it, i haven't actually TRIED to crash yet... 
But i can try if i don't have to crash in the future =)

(The machine is just a motherboard lying on the floor with some disks
attached, so i have to bring out my pocket knife and short circuit the
right pins =))

> > > If i don't run powertop, it is rock solid... Compiling for hours,
> > > running memtest for hours etc etc...
> > 
> > (Where is my magic ball?)
> > 
> > Let me guess - Pentium 4 with HT?
> 
> I might be wrong, but it's a newer core duo thingy.
> 
> 	tglx
> 
-- 
Ian Kumlien <pomac () vapor ! com> -- http://pomac.netswarm.net

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [BUG] Something goes wrong with timer statistics.
  2007-05-29 22:44     ` David Miller
@ 2007-05-29 22:51       ` Ian Kumlien
  2007-05-29 23:03         ` David Miller
  0 siblings, 1 reply; 25+ messages in thread
From: Ian Kumlien @ 2007-05-29 22:51 UTC (permalink / raw)
  To: David Miller; +Cc: michal.k.k.piotrowski, Linux-kernel

On tis, 2007-05-29 at 15:44 -0700, David Miller wrote:
> From: "Michal Piotrowski" <michal.k.k.piotrowski@gmail.com>
> Date: Wed, 30 May 2007 00:41:46 +0200
> 
> > On 30/05/07, David Miller <davem@davemloft.net> wrote:
> > > From: Ian Kumlien <iank@bredband.net>
> > > Date: Tue, 29 May 2007 23:38:48 +0200
> > >
> > > > As the daystar sets, i try to play some with my new would be
> > > > firewall/server, but since this will be running for quite some time i
> > > > have been experimenting with powertop to find out what i can do to limit
> > > > it's power usage.
> > > >
> > > > But, if i run powertop for too long or a few times to many... this
> > > > happens:
> > > > http://pomac.netswarm.net/pics/kernel_panic.jpg
> > > >
> > > > If i don't run powertop, it is rock solid... Compiling for hours,
> > > > running memtest for hours etc etc...
> > >
> > > I see this same exact problem on sparc64.
> > 
> > Have you tried this patch?
> > http://lkml.org/lkml/2007/5/29/392
> 
> Of course, I worked with Thomas on the fix :-)

I dunno if this applies to me, i run a 32 bit userspace...

-- 
Ian Kumlien <pomac () vapor ! com> -- http://pomac.netswarm.net

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [BUG] Something goes wrong with timer statistics.
  2007-05-29 22:44     ` David Miller
@ 2007-05-29 22:53       ` Thomas Gleixner
  0 siblings, 0 replies; 25+ messages in thread
From: Thomas Gleixner @ 2007-05-29 22:53 UTC (permalink / raw)
  To: David Miller; +Cc: iank, Linux-kernel, mingo, arjan

On Tue, 2007-05-29 at 15:44 -0700, David Miller wrote:
> From: Thomas Gleixner <tglx@linutronix.de>
> Date: Wed, 30 May 2007 00:42:12 +0200
> 
> > On Tue, 2007-05-29 at 15:37 -0700, David Miller wrote:
> > > I see this same exact problem on sparc64.
> > 
> > 100% reproducible ?
> 
> I think so, I'll try to find time to tinker with it a bit
> more.

It seems to be a subtle bug in the timer stats code, but I did not spot
it.

	tglx



^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [BUG] Something goes wrong with timer statistics.
  2007-05-29 22:51       ` Ian Kumlien
@ 2007-05-29 23:03         ` David Miller
  2007-05-30  0:04           ` Ian Kumlien
  0 siblings, 1 reply; 25+ messages in thread
From: David Miller @ 2007-05-29 23:03 UTC (permalink / raw)
  To: iank; +Cc: michal.k.k.piotrowski, Linux-kernel

From: Ian Kumlien <iank@bredband.net>
Date: Wed, 30 May 2007 00:51:52 +0200

> On tis, 2007-05-29 at 15:44 -0700, David Miller wrote:
> > From: "Michal Piotrowski" <michal.k.k.piotrowski@gmail.com>
> > Date: Wed, 30 May 2007 00:41:46 +0200
> > 
> > > On 30/05/07, David Miller <davem@davemloft.net> wrote:
> > > > From: Ian Kumlien <iank@bredband.net>
> > > > Date: Tue, 29 May 2007 23:38:48 +0200
> > > >
> > > > > As the daystar sets, i try to play some with my new would be
> > > > > firewall/server, but since this will be running for quite some time i
> > > > > have been experimenting with powertop to find out what i can do to limit
> > > > > it's power usage.
> > > > >
> > > > > But, if i run powertop for too long or a few times to many... this
> > > > > happens:
> > > > > http://pomac.netswarm.net/pics/kernel_panic.jpg
> > > > >
> > > > > If i don't run powertop, it is rock solid... Compiling for hours,
> > > > > running memtest for hours etc etc...
> > > >
> > > > I see this same exact problem on sparc64.
> > > 
> > > Have you tried this patch?
> > > http://lkml.org/lkml/2007/5/29/392
> > 
> > Of course, I worked with Thomas on the fix :-)
> 
> I dunno if this applies to me, i run a 32 bit userspace...

That patch fixes a problem on 64-bit kernels, regardless of
userspace, when NOHZ is enabled.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [BUG] Something goes wrong with timer statistics.
  2007-05-29 23:03         ` David Miller
@ 2007-05-30  0:04           ` Ian Kumlien
  0 siblings, 0 replies; 25+ messages in thread
From: Ian Kumlien @ 2007-05-30  0:04 UTC (permalink / raw)
  To: David Miller; +Cc: michal.k.k.piotrowski, Linux-kernel

On tis, 2007-05-29 at 16:03 -0700, David Miller wrote:
> From: Ian Kumlien <iank@bredband.net>
> Date: Wed, 30 May 2007 00:51:52 +0200
> 
> > On tis, 2007-05-29 at 15:44 -0700, David Miller wrote:
> > > From: "Michal Piotrowski" <michal.k.k.piotrowski@gmail.com>
> > > Date: Wed, 30 May 2007 00:41:46 +0200
> > > 
> > > > On 30/05/07, David Miller <davem@davemloft.net> wrote:
> > > > > From: Ian Kumlien <iank@bredband.net>
> > > > > Date: Tue, 29 May 2007 23:38:48 +0200
> > > > >
> > > > > > As the daystar sets, i try to play some with my new would be
> > > > > > firewall/server, but since this will be running for quite some time i
> > > > > > have been experimenting with powertop to find out what i can do to limit
> > > > > > it's power usage.
> > > > > >
> > > > > > But, if i run powertop for too long or a few times to many... this
> > > > > > happens:
> > > > > > http://pomac.netswarm.net/pics/kernel_panic.jpg
> > > > > >
> > > > > > If i don't run powertop, it is rock solid... Compiling for hours,
> > > > > > running memtest for hours etc etc...
> > > > >
> > > > > I see this same exact problem on sparc64.
> > > > 
> > > > Have you tried this patch?
> > > > http://lkml.org/lkml/2007/5/29/392
> > > 
> > > Of course, I worked with Thomas on the fix :-)
> > 
> > I dunno if this applies to me, i run a 32 bit userspace...
> 
> That patch fixes a problem on 64-bit kernels, regardless of
> userspace, when NOHZ is enabled.

Sorry, i run all 32 bit, i was on the phone while typing that, so a wee
bit distracted. Anyways, i run a nohz 32 bit kernel.

I dunno how good a intel T7200 is when it comes to 64 bit
kernel/userspace.. (Haven't really read up on it)

-- 
Ian Kumlien <pomac () vapor ! com> -- http://pomac.netswarm.net

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [BUG] Something goes wrong with timer statistics.
  2007-05-29 21:38 [BUG] Something goes wrong with timer statistics Ian Kumlien
       [not found] ` <6bffcb0e0705291524y2752d646p94b0bf6ca87af68f@mail.gmail.com>
  2007-05-29 22:37 ` David Miller
@ 2007-05-30  0:13 ` Stephane Casset
  2007-05-31 23:53   ` Björn Steinbrink
  2 siblings, 1 reply; 25+ messages in thread
From: Stephane Casset @ 2007-05-30  0:13 UTC (permalink / raw)
  To: Ian Kumlien; +Cc: Linux-kernel

Le Tue, May 29, 2007 at 11:38:48PM +0200, Ian Kumlien écrivait :
> Hi, 
> 
> As the daystar sets, i try to play some with my new would be
> firewall/server, but since this will be running for quite some time i
> have been experimenting with powertop to find out what i can do to limit
> it's power usage.
> 
> But, if i run powertop for too long or a few times to many... this
> happens:
> http://pomac.netswarm.net/pics/kernel_panic.jpg
> 
> If i don't run powertop, it is rock solid... Compiling for hours,
> running memtest for hours etc etc... 

Same here, P4 HT, if I run powertop for 5-10 minutes the system crash :(
It happens with 2.6.22-rc{2,3} at least :( I didn't try before.

I don't have a oops since the last crash was in X, but next time I will
try to catch it. If there is any patch around to fix this I am willing
to test.

Regards
-- 
Stéphane Casset            LOGIDÉE sàrl       Se faire plaisir d'apprendre
1a, rue Pasteur        Tel : +33 388 23 69 77           casset@logidee.com
F-67540 OSTWALD        Fax : +33 388 23 69 77           http://logidee.com

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [BUG] Something goes wrong with timer statistics.
  2007-05-29 22:38   ` Thomas Gleixner
  2007-05-29 22:50     ` Ian Kumlien
@ 2007-05-30  6:52     ` Björn Steinbrink
  2007-05-30  7:10       ` Björn Steinbrink
  2007-05-30 11:38     ` Björn Steinbrink
  2 siblings, 1 reply; 25+ messages in thread
From: Björn Steinbrink @ 2007-05-30  6:52 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Michal Piotrowski, Ian Kumlien, Linux-kernel, Ingo Molnar,
	Arjan van de Ven

On 2007.05.30 00:38:08 +0200, Thomas Gleixner wrote:
> On Wed, 2007-05-30 at 00:24 +0200, Michal Piotrowski wrote:
> > Hi Ian,
> > 
> > (Thomas "The Wizard of Time" added to CC :))
> 
> Added more wizards :)
> 
> > On 29/05/07, Ian Kumlien <iank@bredband.net> wrote:
> > > Hi,
> > >
> > > As the daystar sets, i try to play some with my new would be
> > > firewall/server, but since this will be running for quite some time i
> > > have been experimenting with powertop to find out what i can do to limit
> > > it's power usage.
> > >
> > > But, if i run powertop for too long or a few times to many... this
> > > happens:
> > > http://pomac.netswarm.net/pics/kernel_panic.jpg
> 
> This was reported before. It's incredibly hard to reproduce.

Hm, tstats_write() only grabs show_mutex, but it calls reset_entries()
which happily messes with the entries table, shouldn't the table_lock
also be held in that case?

Björn

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [BUG] Something goes wrong with timer statistics.
  2007-05-30  6:52     ` Björn Steinbrink
@ 2007-05-30  7:10       ` Björn Steinbrink
  0 siblings, 0 replies; 25+ messages in thread
From: Björn Steinbrink @ 2007-05-30  7:10 UTC (permalink / raw)
  To: Thomas Gleixner, Michal Piotrowski, Ian Kumlien, Linux-kernel,
	Ingo Molnar, Arjan van de Ven

On 2007.05.30 08:52:04 +0200, Björn Steinbrink wrote:
> On 2007.05.30 00:38:08 +0200, Thomas Gleixner wrote:
> > On Wed, 2007-05-30 at 00:24 +0200, Michal Piotrowski wrote:
> > > Hi Ian,
> > > 
> > > (Thomas "The Wizard of Time" added to CC :))
> > 
> > Added more wizards :)
> > 
> > > On 29/05/07, Ian Kumlien <iank@bredband.net> wrote:
> > > > Hi,
> > > >
> > > > As the daystar sets, i try to play some with my new would be
> > > > firewall/server, but since this will be running for quite some time i
> > > > have been experimenting with powertop to find out what i can do to limit
> > > > it's power usage.
> > > >
> > > > But, if i run powertop for too long or a few times to many... this
> > > > happens:
> > > > http://pomac.netswarm.net/pics/kernel_panic.jpg
> > 
> > This was reported before. It's incredibly hard to reproduce.
> 
> Hm, tstats_write() only grabs show_mutex, but it calls reset_entries()
> which happily messes with the entries table, shouldn't the table_lock
> also be held in that case?

Hrmpf, nevermind, I missed the active flag...

Sorry,
Björn

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [BUG] Something goes wrong with timer statistics.
  2007-05-29 22:38   ` Thomas Gleixner
  2007-05-29 22:50     ` Ian Kumlien
  2007-05-30  6:52     ` Björn Steinbrink
@ 2007-05-30 11:38     ` Björn Steinbrink
  2007-05-30 12:44       ` Eric Dumazet
  2 siblings, 1 reply; 25+ messages in thread
From: Björn Steinbrink @ 2007-05-30 11:38 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Michal Piotrowski, Ian Kumlien, Linux-kernel, Ingo Molnar,
	Arjan van de Ven

On 2007.05.30 00:38:08 +0200, Thomas Gleixner wrote:
> On Wed, 2007-05-30 at 00:24 +0200, Michal Piotrowski wrote:
> > Hi Ian,
> > 
> > (Thomas "The Wizard of Time" added to CC :))
> 
> Added more wizards :)
> 
> > On 29/05/07, Ian Kumlien <iank@bredband.net> wrote:
> > > Hi,
> > >
> > > As the daystar sets, i try to play some with my new would be
> > > firewall/server, but since this will be running for quite some time i
> > > have been experimenting with powertop to find out what i can do to limit
> > > it's power usage.
> > >
> > > But, if i run powertop for too long or a few times to many... this
> > > happens:
> > > http://pomac.netswarm.net/pics/kernel_panic.jpg
> 
> This was reported before. It's incredibly hard to reproduce.

OK, second try, this time with a patch. In timer_stats_update_stats,
input is allocated on the stack, so it is uninitialized, in particular
the "next" field is random. Now in tstat_lookup, the new entry "curr" is
initialized with the values from "input" (passed as "entry") and "next"
is set to NULL _after_ it is added to the list, so if a second CPU is
running the fastpath lookup while we're inserting the new element, it
might get the garbage value in "next". The patch below fixes that.

Björn



Initialize the "next" field of a timer stats entry before it is inserted
into the list to avoid a race with the fastpath lookup.

Signed-off-by: Björn Steinbrink <B.Steinbrink@gmx.de>
---
diff --git a/kernel/time/timer_stats.c b/kernel/time/timer_stats.c
index 868f1bc..5bc8f91 100644
--- a/kernel/time/timer_stats.c
+++ b/kernel/time/timer_stats.c
@@ -202,12 +202,12 @@ static struct entry *tstat_lookup(struct entry *entry, char *comm)
 	if (curr) {
 		*curr = *entry;
 		curr->count = 0;
+		curr->next = NULL;
 		memcpy(curr->comm, comm, TASK_COMM_LEN);
 		if (prev)
 			prev->next = curr;
 		else
 			*head = curr;
-		curr->next = NULL;
 	}
  out_unlock:
 	spin_unlock(&table_lock);


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: [BUG] Something goes wrong with timer statistics.
  2007-05-30 11:38     ` Björn Steinbrink
@ 2007-05-30 12:44       ` Eric Dumazet
  2007-05-30 13:14         ` Björn Steinbrink
  0 siblings, 1 reply; 25+ messages in thread
From: Eric Dumazet @ 2007-05-30 12:44 UTC (permalink / raw)
  To: Björn Steinbrink
  Cc: Thomas Gleixner, Michal Piotrowski, Ian Kumlien, Linux-kernel,
	Ingo Molnar, Arjan van de Ven

On Wed, 30 May 2007 13:38:25 +0200
Björn Steinbrink <B.Steinbrink@gmx.de> wrote:

> On 2007.05.30 00:38:08 +0200, Thomas Gleixner wrote:
> > On Wed, 2007-05-30 at 00:24 +0200, Michal Piotrowski wrote:
> > > Hi Ian,
> > > 
> > > (Thomas "The Wizard of Time" added to CC :))
> > 
> > Added more wizards :)
> > 
> > > On 29/05/07, Ian Kumlien <iank@bredband.net> wrote:
> > > > Hi,
> > > >
> > > > As the daystar sets, i try to play some with my new would be
> > > > firewall/server, but since this will be running for quite some time i
> > > > have been experimenting with powertop to find out what i can do to limit
> > > > it's power usage.
> > > >
> > > > But, if i run powertop for too long or a few times to many... this
> > > > happens:
> > > > http://pomac.netswarm.net/pics/kernel_panic.jpg
> > 
> > This was reported before. It's incredibly hard to reproduce.
> 
> OK, second try, this time with a patch. In timer_stats_update_stats,
> input is allocated on the stack, so it is uninitialized, in particular
> the "next" field is random. Now in tstat_lookup, the new entry "curr" is
> initialized with the values from "input" (passed as "entry") and "next"
> is set to NULL _after_ it is added to the list, so if a second CPU is
> running the fastpath lookup while we're inserting the new element, it
> might get the garbage value in "next". The patch below fixes that.
> 
> Björn
> 
> 
> 
> Initialize the "next" field of a timer stats entry before it is inserted
> into the list to avoid a race with the fastpath lookup.
> 
> Signed-off-by: Björn Steinbrink <B.Steinbrink@gmx.de>
> ---
> diff --git a/kernel/time/timer_stats.c b/kernel/time/timer_stats.c
> index 868f1bc..5bc8f91 100644
> --- a/kernel/time/timer_stats.c
> +++ b/kernel/time/timer_stats.c
> @@ -202,12 +202,12 @@ static struct entry *tstat_lookup(struct entry *entry, char *comm)
>  	if (curr) {
>  		*curr = *entry;
>  		curr->count = 0;
> +		curr->next = NULL;
>  		memcpy(curr->comm, comm, TASK_COMM_LEN);
>  		if (prev)
>  			prev->next = curr;
>  		else
>  			*head = curr;
> -		curr->next = NULL;
>  	}
>   out_unlock:
>  	spin_unlock(&table_lock);
> 

Your analysis might be right, not the fix.

You *cannot* assume curr->next = NULL will be done before insert.

You probably also need a memory barrier.

	curr->next = NULL;
	memcpy(curr->comm, comm, TASK_COMM_LEN);

	smp_mb(); /* commit all writes to curr before insert */

	if (prev)
		prev->next = curr;
	else
		*head = curr;

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [BUG] Something goes wrong with timer statistics.
  2007-05-30 12:44       ` Eric Dumazet
@ 2007-05-30 13:14         ` Björn Steinbrink
       [not found]           ` <20070531102047.GG19272@pomac.netswarm.net>
  0 siblings, 1 reply; 25+ messages in thread
From: Björn Steinbrink @ 2007-05-30 13:14 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Thomas Gleixner, Michal Piotrowski, Ian Kumlien, Linux-kernel,
	Ingo Molnar, Arjan van de Ven

On 2007.05.30 14:44:49 +0200, Eric Dumazet wrote:
> On Wed, 30 May 2007 13:38:25 +0200
> Björn Steinbrink <B.Steinbrink@gmx.de> wrote:
> 
> > On 2007.05.30 00:38:08 +0200, Thomas Gleixner wrote:
> > > On Wed, 2007-05-30 at 00:24 +0200, Michal Piotrowski wrote:
> > > > On 29/05/07, Ian Kumlien <iank@bredband.net> wrote:
> > > > > Hi,
> > > > >
> > > > > As the daystar sets, i try to play some with my new would be
> > > > > firewall/server, but since this will be running for quite some time i
> > > > > have been experimenting with powertop to find out what i can do to limit
> > > > > it's power usage.
> > > > >
> > > > > But, if i run powertop for too long or a few times to many... this
> > > > > happens:
> > > > > http://pomac.netswarm.net/pics/kernel_panic.jpg
> > > 
> > > This was reported before. It's incredibly hard to reproduce.
> > 
> > OK, second try, this time with a patch. In timer_stats_update_stats,
> > input is allocated on the stack, so it is uninitialized, in particular
> > the "next" field is random. Now in tstat_lookup, the new entry "curr" is
> > initialized with the values from "input" (passed as "entry") and "next"
> > is set to NULL _after_ it is added to the list, so if a second CPU is
> > running the fastpath lookup while we're inserting the new element, it
> > might get the garbage value in "next". The patch below fixes that.
> > 
> > Björn
> > 
> > 
> > 
> > Initialize the "next" field of a timer stats entry before it is inserted
> > into the list to avoid a race with the fastpath lookup.
> > 
> > Signed-off-by: Björn Steinbrink <B.Steinbrink@gmx.de>
> > ---
> > diff --git a/kernel/time/timer_stats.c b/kernel/time/timer_stats.c
> > index 868f1bc..5bc8f91 100644
> > --- a/kernel/time/timer_stats.c
> > +++ b/kernel/time/timer_stats.c
> > @@ -202,12 +202,12 @@ static struct entry *tstat_lookup(struct entry *entry, char *comm)
> >  	if (curr) {
> >  		*curr = *entry;
> >  		curr->count = 0;
> > +		curr->next = NULL;
> >  		memcpy(curr->comm, comm, TASK_COMM_LEN);
> >  		if (prev)
> >  			prev->next = curr;
> >  		else
> >  			*head = curr;
> > -		curr->next = NULL;
> >  	}
> >   out_unlock:
> >  	spin_unlock(&table_lock);
> > 
> 
> Your analysis might be right, not the fix.
> 
> You *cannot* assume curr->next = NULL will be done before insert.
> 
> You probably also need a memory barrier.

Ehrm, right. I somehow thought of the spinlock being enough to satisfy
that constraint, but if that was the case, the whole problem wouldn't
exist in the first place. D'oh!

Thanks,
Björn


Initialize the "next" field of a timer stats entry before it is inserted
into the list to avoid a race with the fastpath lookup.

Thanks to Eric Dumazet for reminding me of the memory barrier.

Signed-off-by: Björn Steinbrink <B.Steinbrink@gmx.de>
---
diff --git a/kernel/time/timer_stats.c b/kernel/time/timer_stats.c
index 868f1bc..ab0ba6c 100644
--- a/kernel/time/timer_stats.c
+++ b/kernel/time/timer_stats.c
@@ -202,12 +202,15 @@ static struct entry *tstat_lookup(struct entry *entry, char *comm)
 	if (curr) {
 		*curr = *entry;
 		curr->count = 0;
+		curr->next = NULL;
 		memcpy(curr->comm, comm, TASK_COMM_LEN);
+
+		smp_mb(); /* Ensure that curr is initialized before insert */
+
 		if (prev)
 			prev->next = curr;
 		else
 			*head = curr;
-		curr->next = NULL;
 	}
  out_unlock:
 	spin_unlock(&table_lock);

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: [BUG] Something goes wrong with timer statistics.
       [not found]           ` <20070531102047.GG19272@pomac.netswarm.net>
@ 2007-05-31 14:25             ` Björn Steinbrink
  2007-05-31 15:10               ` Eric Dumazet
  0 siblings, 1 reply; 25+ messages in thread
From: Björn Steinbrink @ 2007-05-31 14:25 UTC (permalink / raw)
  To: iank
  Cc: Eric Dumazet, Thomas Gleixner, Michal Piotrowski, Ingo Molnar,
	Arjan van de Ven, linux-kernel

On 2007.05.31 12:20:47 +0200, iank@bredband.net wrote:
> On Wed, May 30, 2007 at 03:14:58PM +0200, Björn Steinbrink wrote:
> > Initialize the "next" field of a timer stats entry before it is inserted
> > into the list to avoid a race with the fastpath lookup.
> 
> Sorry to say... but this does not fix my problem, however, i can't reach
> that machine at all atm, but i will post the oops later today.

Ok, three times is a charm, right? ;-) The previous patch did fix a
race, but there's still one left. The hash table is never cleared, so it
can still point into the entries array.

As long as we do not race with the insertion, that's not a directly
visible problem as all fields are 0, so we get no match and the lookup
finishes right away. We probably _can_ get weird results though, as the
all-zero entry isn't marked as used, but is used as "head" and "prev" in
this case and even multiple hash entries might point to it in some
cases.  That would eventually lead to lost entries because the "next"
field gets overwritten when alloc_entry() finally claims the entry.

But when we race, we can still end up between "*curr = *entry" (makes
"next" contain garbage) and "curr->next = NULL", while we're doing the
fastpath lookup, because the insertion basically already took place!
Replacing "curr->next = NULL" with "entry->next = NULL" and moving it up
is pointless, as "*curr = *entry" isn't atomic, so instead, we have to
simply clean the hash table when the entries are cleared.

Björn


Fix two races in the timer stats lookup code. One by ensuring that the
initialization of a new entry is finished upon insertion of that entry.
The other by cleaning up the hash table when the entries array is
cleared, so that we don't have "pre-inserted" entries.

Thanks to Eric Dumazet for reminding me of the memory barrier.

Signed-off-by: Björn Steinbrink <B.Steinbrink@gmx.de>
---
diff --git a/kernel/time/timer_stats.c b/kernel/time/timer_stats.c
index 868f1bc..611b844 100644
--- a/kernel/time/timer_stats.c
+++ b/kernel/time/timer_stats.c
@@ -117,21 +117,6 @@ static struct entry entries[MAX_ENTRIES];
 
 static atomic_t overflow_count;
 
-static void reset_entries(void)
-{
-	nr_entries = 0;
-	memset(entries, 0, sizeof(entries));
-	atomic_set(&overflow_count, 0);
-}
-
-static struct entry *alloc_entry(void)
-{
-	if (nr_entries >= MAX_ENTRIES)
-		return NULL;
-
-	return entries + nr_entries++;
-}
-
 /*
  * The entries are in a hash-table, for fast lookup:
  */
@@ -149,6 +134,22 @@ static struct entry *alloc_entry(void)
 
 static struct entry *tstat_hash_table[TSTAT_HASH_SIZE] __read_mostly;
 
+static void reset_entries(void)
+{
+	nr_entries = 0;
+	memset(entries, 0, sizeof(entries));
+	memset(tstat_hash_table, 0, sizeof(tstat_hash_table));
+	atomic_set(&overflow_count, 0);
+}
+
+static struct entry *alloc_entry(void)
+{
+	if (nr_entries >= MAX_ENTRIES)
+		return NULL;
+
+	return entries + nr_entries++;
+}
+
 static int match_entries(struct entry *entry1, struct entry *entry2)
 {
 	return entry1->timer       == entry2->timer	  &&
@@ -202,12 +203,15 @@ static struct entry *tstat_lookup(struct entry *entry, char *comm)
 	if (curr) {
 		*curr = *entry;
 		curr->count = 0;
+		curr->next = NULL;
 		memcpy(curr->comm, comm, TASK_COMM_LEN);
+
+		smp_mb(); /* Ensure that curr is initialized before insert */
+
 		if (prev)
 			prev->next = curr;
 		else
 			*head = curr;
-		curr->next = NULL;
 	}
  out_unlock:
 	spin_unlock(&table_lock);

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: [BUG] Something goes wrong with timer statistics.
  2007-05-31 14:25             ` Björn Steinbrink
@ 2007-05-31 15:10               ` Eric Dumazet
  2007-05-31 15:27                 ` Björn Steinbrink
  0 siblings, 1 reply; 25+ messages in thread
From: Eric Dumazet @ 2007-05-31 15:10 UTC (permalink / raw)
  To: Björn Steinbrink
  Cc: iank, Thomas Gleixner, Michal Piotrowski, Ingo Molnar,
	Arjan van de Ven, linux-kernel

On Thu, 31 May 2007 16:25:22 +0200
Björn Steinbrink <B.Steinbrink@gmx.de> wrote:

> On 2007.05.31 12:20:47 +0200, iank@bredband.net wrote:
> > On Wed, May 30, 2007 at 03:14:58PM +0200, Björn Steinbrink wrote:
> > > Initialize the "next" field of a timer stats entry before it is inserted
> > > into the list to avoid a race with the fastpath lookup.
> > 
> > Sorry to say... but this does not fix my problem, however, i can't reach
> > that machine at all atm, but i will post the oops later today.
> 
> Ok, three times is a charm, right? ;-) The previous patch did fix a
> race, but there's still one left. The hash table is never cleared, so it
> can still point into the entries array.
> 
> As long as we do not race with the insertion, that's not a directly
> visible problem as all fields are 0, so we get no match and the lookup
> finishes right away. We probably _can_ get weird results though, as the
> all-zero entry isn't marked as used, but is used as "head" and "prev" in
> this case and even multiple hash entries might point to it in some
> cases.  That would eventually lead to lost entries because the "next"
> field gets overwritten when alloc_entry() finally claims the entry.
> 
> But when we race, we can still end up between "*curr = *entry" (makes
> "next" contain garbage) and "curr->next = NULL", while we're doing the
> fastpath lookup, because the insertion basically already took place!
> Replacing "curr->next = NULL" with "entry->next = NULL" and moving it up
> is pointless, as "*curr = *entry" isn't atomic, so instead, we have to
> simply clean the hash table when the entries are cleared.
> 
> Björn
> 
> 
> Fix two races in the timer stats lookup code. One by ensuring that the
> initialization of a new entry is finished upon insertion of that entry.
> The other by cleaning up the hash table when the entries array is
> cleared, so that we don't have "pre-inserted" entries.
> 
> Thanks to Eric Dumazet for reminding me of the memory barrier.
> 
> Signed-off-by: Björn Steinbrink <B.Steinbrink@gmx.de>
> ---
> diff --git a/kernel/time/timer_stats.c b/kernel/time/timer_stats.c
> index 868f1bc..611b844 100644
> --- a/kernel/time/timer_stats.c
> +++ b/kernel/time/timer_stats.c
> @@ -117,21 +117,6 @@ static struct entry entries[MAX_ENTRIES];
>  
>  static atomic_t overflow_count;
>  
> -static void reset_entries(void)
> -{
> -	nr_entries = 0;
> -	memset(entries, 0, sizeof(entries));
> -	atomic_set(&overflow_count, 0);
> -}
> -
> -static struct entry *alloc_entry(void)
> -{
> -	if (nr_entries >= MAX_ENTRIES)
> -		return NULL;
> -
> -	return entries + nr_entries++;
> -}
> -
>  /*
>   * The entries are in a hash-table, for fast lookup:
>   */
> @@ -149,6 +134,22 @@ static struct entry *alloc_entry(void)
>  
>  static struct entry *tstat_hash_table[TSTAT_HASH_SIZE] __read_mostly;
>  
> +static void reset_entries(void)
> +{
> +	nr_entries = 0;
> +	memset(entries, 0, sizeof(entries));
> +	memset(tstat_hash_table, 0, sizeof(tstat_hash_table));
> +	atomic_set(&overflow_count, 0);
> +}
> +
> +static struct entry *alloc_entry(void)
> +{
> +	if (nr_entries >= MAX_ENTRIES)
> +		return NULL;
> +
> +	return entries + nr_entries++;
> +}
> +
>  static int match_entries(struct entry *entry1, struct entry *entry2)
>  {
>  	return entry1->timer       == entry2->timer	  &&
> @@ -202,12 +203,15 @@ static struct entry *tstat_lookup(struct entry *entry, char *comm)
>  	if (curr) {
>  		*curr = *entry;
>  		curr->count = 0;
> +		curr->next = NULL;
>  		memcpy(curr->comm, comm, TASK_COMM_LEN);
> +
> +		smp_mb(); /* Ensure that curr is initialized before insert */
> +
>  		if (prev)
>  			prev->next = curr;
>  		else
>  			*head = curr;
> -		curr->next = NULL;
>  	}
>   out_unlock:
>  	spin_unlock(&table_lock);
> -

Well... :) , there is still a memory barrier missing it seems.

Another cpu might see a bad value if 'active=1' is set before tstat_hash_table is really cleared.


 diff -urp linux/kernel/time/timer_stats.c.orig linux/kernel/time/timer_stats.c
--- kernel/time/timer_stats.c.orig
+++ kernel/time/timer_stats.c
@@ -361,6 +361,7 @@ static ssize_t tstats_write(struct file 
                if (!active) {
                        reset_entries();
                        time_start = ktime_get();
+                       smp_mb();
                        active = 1;
                }
                break;



^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [BUG] Something goes wrong with timer statistics.
  2007-05-31 15:10               ` Eric Dumazet
@ 2007-05-31 15:27                 ` Björn Steinbrink
  2007-05-31 22:59                   ` Ian Kumlien
  0 siblings, 1 reply; 25+ messages in thread
From: Björn Steinbrink @ 2007-05-31 15:27 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: iank, Thomas Gleixner, Michal Piotrowski, Ingo Molnar,
	Arjan van de Ven, linux-kernel

On 2007.05.31 17:10:07 +0200, Eric Dumazet wrote:
> Well... :) , there is still a memory barrier missing it seems.
> 
> Another cpu might see a bad value if 'active=1' is set before tstat_hash_table is really cleared.

Hm, that even makes the assumption in my first try valid ;-)
Just for the record, this time I thought that the barrier from the
spinlock in timer_stats_update_stats (right before the check for active)
would be enough, but that's obviously running on the wrong cpu if we
race... *sigh*

Thanks,
Björn



Fix two races in the timer stats lookup code. One by ensuring that the
initialization of a new entry is finished upon insertion of that entry.
The other by cleaning up the hash table when the entries array is
cleared, so that we don't have "pre-inserted" entries.

Thanks to Eric Dumazet for reminding me of the memory barriers.

Signed-off-by: Björn Steinbrink <B.Steinbrink@gmx.de>
---
diff --git a/kernel/time/timer_stats.c b/kernel/time/timer_stats.c
index 868f1bc..63d00f9 100644
--- a/kernel/time/timer_stats.c
+++ b/kernel/time/timer_stats.c
@@ -117,21 +117,6 @@ static struct entry entries[MAX_ENTRIES];
 
 static atomic_t overflow_count;
 
-static void reset_entries(void)
-{
-	nr_entries = 0;
-	memset(entries, 0, sizeof(entries));
-	atomic_set(&overflow_count, 0);
-}
-
-static struct entry *alloc_entry(void)
-{
-	if (nr_entries >= MAX_ENTRIES)
-		return NULL;
-
-	return entries + nr_entries++;
-}
-
 /*
  * The entries are in a hash-table, for fast lookup:
  */
@@ -149,6 +134,22 @@ static struct entry *alloc_entry(void)
 
 static struct entry *tstat_hash_table[TSTAT_HASH_SIZE] __read_mostly;
 
+static void reset_entries(void)
+{
+	nr_entries = 0;
+	memset(entries, 0, sizeof(entries));
+	memset(tstat_hash_table, 0, sizeof(tstat_hash_table));
+	atomic_set(&overflow_count, 0);
+}
+
+static struct entry *alloc_entry(void)
+{
+	if (nr_entries >= MAX_ENTRIES)
+		return NULL;
+
+	return entries + nr_entries++;
+}
+
 static int match_entries(struct entry *entry1, struct entry *entry2)
 {
 	return entry1->timer       == entry2->timer	  &&
@@ -202,12 +203,15 @@ static struct entry *tstat_lookup(struct entry *entry, char *comm)
 	if (curr) {
 		*curr = *entry;
 		curr->count = 0;
+		curr->next = NULL;
 		memcpy(curr->comm, comm, TASK_COMM_LEN);
+
+		smp_mb(); /* Ensure that curr is initialized before insert */
+
 		if (prev)
 			prev->next = curr;
 		else
 			*head = curr;
-		curr->next = NULL;
 	}
  out_unlock:
 	spin_unlock(&table_lock);
@@ -360,6 +364,7 @@ static ssize_t tstats_write(struct file *file, const char __user *buf,
 		if (!active) {
 			reset_entries();
 			time_start = ktime_get();
+			smb_mb();
 			active = 1;
 		}
 		break;

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: [BUG] Something goes wrong with timer statistics.
  2007-05-31 15:27                 ` Björn Steinbrink
@ 2007-05-31 22:59                   ` Ian Kumlien
  2007-05-31 23:09                     ` Björn Steinbrink
  0 siblings, 1 reply; 25+ messages in thread
From: Ian Kumlien @ 2007-05-31 22:59 UTC (permalink / raw)
  To: Björn Steinbrink
  Cc: Eric Dumazet, Thomas Gleixner, Michal Piotrowski, Ingo Molnar,
	Arjan van de Ven, linux-kernel

On tor, 2007-05-31 at 17:27 +0200, Björn Steinbrink wrote:
> On 2007.05.31 17:10:07 +0200, Eric Dumazet wrote:
> > Well... :) , there is still a memory barrier missing it seems.
> > 
> > Another cpu might see a bad value if 'active=1' is set before tstat_hash_table is really cleared.
> 
> Hm, that even makes the assumption in my first try valid ;-)
> Just for the record, this time I thought that the barrier from the
> spinlock in timer_stats_update_stats (right before the check for active)
> would be enough, but that's obviously running on the wrong cpu if we
> race... *sigh*
> 
> Thanks,
> Björn
> 
> 
> 
> Fix two races in the timer stats lookup code. One by ensuring that the
> initialization of a new entry is finished upon insertion of that entry.
> The other by cleaning up the hash table when the entries array is
> cleared, so that we don't have "pre-inserted" entries.
> 
> Thanks to Eric Dumazet for reminding me of the memory barriers.

Fix the comment below and you can add:
Signed-off-by: Ian Kumlien <pomac@vapor.com>

It's currently been running for the longest period ever, ie, 11 minutes
=)

I'm gonna leave it running during the night and send a status update
when the evil daystar reaches it's peak CET. (i haven't been able to
stop since linus mentioned it... damn it... =))

> Signed-off-by: Björn Steinbrink <B.Steinbrink@gmx.de>
> ---

> @@ -360,6 +364,7 @@ static ssize_t tstats_write(struct file *file, const char __user *buf,
>  		if (!active) {
>  			reset_entries();
>  			time_start = ktime_get();
> +			smb_mb();

smb? you mean smp =)

>  			active = 1;
>  		}
>  		break;

-- 
Ian Kumlien <pomac () vapor ! com> -- http://pomac.netswarm.net

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [BUG] Something goes wrong with timer statistics.
  2007-05-31 22:59                   ` Ian Kumlien
@ 2007-05-31 23:09                     ` Björn Steinbrink
  2007-06-01  6:21                       ` Ingo Molnar
  0 siblings, 1 reply; 25+ messages in thread
From: Björn Steinbrink @ 2007-05-31 23:09 UTC (permalink / raw)
  To: Ian Kumlien
  Cc: Eric Dumazet, Thomas Gleixner, Michal Piotrowski, Ingo Molnar,
	Arjan van de Ven, linux-kernel

On 2007.06.01 00:59:11 +0200, Ian Kumlien wrote:
> On tor, 2007-05-31 at 17:27 +0200, Björn Steinbrink wrote:
> > On 2007.05.31 17:10:07 +0200, Eric Dumazet wrote:
> > > Well... :) , there is still a memory barrier missing it seems.
> > > 
> > > Another cpu might see a bad value if 'active=1' is set before tstat_hash_table is really cleared.
> > 
> > Hm, that even makes the assumption in my first try valid ;-)
> > Just for the record, this time I thought that the barrier from the
> > spinlock in timer_stats_update_stats (right before the check for active)
> > would be enough, but that's obviously running on the wrong cpu if we
> > race... *sigh*
> > 
> 
> Fix the comment below and you can add:
> Signed-off-by: Ian Kumlien <pomac@vapor.com>
> 
> It's currently been running for the longest period ever, ie, 11 minutes
> =)

Finally! :-)

> I'm gonna leave it running during the night and send a status update
> when the evil daystar reaches it's peak CET. (i haven't been able to
> stop since linus mentioned it... damn it... =))
> 
> > @@ -360,6 +364,7 @@ static ssize_t tstats_write(struct file *file, const char __user *buf,
> >  		if (!active) {
> >  			reset_entries();
> >  			time_start = ktime_get();
> > +			smb_mb();
> 
> smb? you mean smp =)

No idea how I managed to break that after test compiling it... :-/
Thanks for catching it :-)

So unless I messed up something again, here's the final patch.

Thanks,
Björn


Fix two races in the timer stats lookup code. One by ensuring that the
initialization of a new entry is finished upon insertion of that entry.
The other by cleaning up the hash table when the entries array is
cleared, so that we don't have any "pre-inserted" entries.

Thanks to Eric Dumazet for reminding me of the memory barriers.

Signed-off-by: Björn Steinbrink <B.Steinbrink@gmx.de>
Signed-off-by: Ian Kumlien <pomac@vapor.com>
---
diff --git a/kernel/time/timer_stats.c b/kernel/time/timer_stats.c
index 868f1bc..fa3d380 100644
--- a/kernel/time/timer_stats.c
+++ b/kernel/time/timer_stats.c
@@ -117,21 +117,6 @@ static struct entry entries[MAX_ENTRIES];
 
 static atomic_t overflow_count;
 
-static void reset_entries(void)
-{
-	nr_entries = 0;
-	memset(entries, 0, sizeof(entries));
-	atomic_set(&overflow_count, 0);
-}
-
-static struct entry *alloc_entry(void)
-{
-	if (nr_entries >= MAX_ENTRIES)
-		return NULL;
-
-	return entries + nr_entries++;
-}
-
 /*
  * The entries are in a hash-table, for fast lookup:
  */
@@ -149,6 +134,22 @@ static struct entry *alloc_entry(void)
 
 static struct entry *tstat_hash_table[TSTAT_HASH_SIZE] __read_mostly;
 
+static void reset_entries(void)
+{
+	nr_entries = 0;
+	memset(entries, 0, sizeof(entries));
+	memset(tstat_hash_table, 0, sizeof(tstat_hash_table));
+	atomic_set(&overflow_count, 0);
+}
+
+static struct entry *alloc_entry(void)
+{
+	if (nr_entries >= MAX_ENTRIES)
+		return NULL;
+
+	return entries + nr_entries++;
+}
+
 static int match_entries(struct entry *entry1, struct entry *entry2)
 {
 	return entry1->timer       == entry2->timer	  &&
@@ -202,12 +203,15 @@ static struct entry *tstat_lookup(struct entry *entry, char *comm)
 	if (curr) {
 		*curr = *entry;
 		curr->count = 0;
+		curr->next = NULL;
 		memcpy(curr->comm, comm, TASK_COMM_LEN);
+
+		smp_mb(); /* Ensure that curr is initialized before insert */
+
 		if (prev)
 			prev->next = curr;
 		else
 			*head = curr;
-		curr->next = NULL;
 	}
  out_unlock:
 	spin_unlock(&table_lock);
@@ -360,6 +364,7 @@ static ssize_t tstats_write(struct file *file, const char __user *buf,
 		if (!active) {
 			reset_entries();
 			time_start = ktime_get();
+			smp_mb();
 			active = 1;
 		}
 		break;

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: [BUG] Something goes wrong with timer statistics.
  2007-05-30  0:13 ` Stephane Casset
@ 2007-05-31 23:53   ` Björn Steinbrink
  0 siblings, 0 replies; 25+ messages in thread
From: Björn Steinbrink @ 2007-05-31 23:53 UTC (permalink / raw)
  To: Stephane Casset; +Cc: linux-kernel

On 2007.05.30 02:13:50 +0200, Stephane Casset wrote:
> Le Tue, May 29, 2007 at 11:38:48PM +0200, Ian Kumlien écrivait :
> > Hi, 
> > 
> > As the daystar sets, i try to play some with my new would be
> > firewall/server, but since this will be running for quite some time i
> > have been experimenting with powertop to find out what i can do to limit
> > it's power usage.
> > 
> > But, if i run powertop for too long or a few times to many... this
> > happens:
> > http://pomac.netswarm.net/pics/kernel_panic.jpg
> > 
> > If i don't run powertop, it is rock solid... Compiling for hours,
> > running memtest for hours etc etc... 
> 
> Same here, P4 HT, if I run powertop for 5-10 minutes the system crash :(
> It happens with 2.6.22-rc{2,3} at least :( I didn't try before.
> 
> I don't have a oops since the last crash was in X, but next time I will
> try to catch it. If there is any patch around to fix this I am willing
> to test.

There is finally one that seems to work:
http://lkml.org/lkml/2007/5/31/394

I'd be glad if you could give it a try.

Thanks,
Björn

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [BUG] Something goes wrong with timer statistics.
  2007-05-31 23:09                     ` Björn Steinbrink
@ 2007-06-01  6:21                       ` Ingo Molnar
  0 siblings, 0 replies; 25+ messages in thread
From: Ingo Molnar @ 2007-06-01  6:21 UTC (permalink / raw)
  To: Björn Steinbrink, Ian Kumlien, Eric Dumazet, Thomas Gleixner,
	Michal Piotrowski, Arjan van de Ven, linux-kernel
  Cc: Arjan van de Ven


* Björn Steinbrink <B.Steinbrink@gmx.de> wrote:

> > Fix the comment below and you can add:
> > Signed-off-by: Ian Kumlien <pomac@vapor.com>
> > 
> > It's currently been running for the longest period ever, ie, 11 minutes
> > =)
> 
> Finally! :-)

cool! Great detective work Björn!

> Thanks to Eric Dumazet for reminding me of the memory barriers.
> 
> Signed-off-by: Björn Steinbrink <B.Steinbrink@gmx.de>
> Signed-off-by: Ian Kumlien <pomac@vapor.com>

Acked-by: Ingo Molnar <mingo@elte.hu>

	Ingo

^ permalink raw reply	[flat|nested] 25+ messages in thread

end of thread, other threads:[~2007-06-01  6:23 UTC | newest]

Thread overview: 25+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2007-05-29 21:38 [BUG] Something goes wrong with timer statistics Ian Kumlien
     [not found] ` <6bffcb0e0705291524y2752d646p94b0bf6ca87af68f@mail.gmail.com>
2007-05-29 22:35   ` Ian Kumlien
2007-05-29 22:38   ` Thomas Gleixner
2007-05-29 22:50     ` Ian Kumlien
2007-05-30  6:52     ` Björn Steinbrink
2007-05-30  7:10       ` Björn Steinbrink
2007-05-30 11:38     ` Björn Steinbrink
2007-05-30 12:44       ` Eric Dumazet
2007-05-30 13:14         ` Björn Steinbrink
     [not found]           ` <20070531102047.GG19272@pomac.netswarm.net>
2007-05-31 14:25             ` Björn Steinbrink
2007-05-31 15:10               ` Eric Dumazet
2007-05-31 15:27                 ` Björn Steinbrink
2007-05-31 22:59                   ` Ian Kumlien
2007-05-31 23:09                     ` Björn Steinbrink
2007-06-01  6:21                       ` Ingo Molnar
2007-05-29 22:37 ` David Miller
2007-05-29 22:42   ` Thomas Gleixner
2007-05-29 22:44     ` David Miller
2007-05-29 22:53       ` Thomas Gleixner
     [not found]   ` <6bffcb0e0705291541q7e6d6faen5cd6345074c4c92a@mail.gmail.com>
2007-05-29 22:44     ` David Miller
2007-05-29 22:51       ` Ian Kumlien
2007-05-29 23:03         ` David Miller
2007-05-30  0:04           ` Ian Kumlien
2007-05-30  0:13 ` Stephane Casset
2007-05-31 23:53   ` Björn Steinbrink

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox