Re: RFC: starting a kernel-testers group for newbies

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* Re: RFC: starting a kernel-testers group for newbies
@ 2008-05-01 16:11 devzero
  2008-05-01 16:26 ` Kok, Auke
  0 siblings, 1 reply; 48+ messages in thread
From: devzero @ 2008-05-01 16:11 UTC (permalink / raw)
  To: linux-kernel

>I'll try to do this:
>- create some Wiki page
>- get a mailing list at vger
>- point newbies to this mailing list
>- tell people there which kernels to test
>- figure out and document stuff like how to bisect between -next kernels
>- help them to do whatever is required for a proper bug report

good idea :)

one more:
reported bugs sometimes get lost in lkml (or elsewhere) and may not get the attention they need.

maybe there are not enough people who put some steadiness into tracking bugs (i.e. put them in bugzilla and make sure they get tracked/resolved) ?

isn`t that something for kernel-testers, too ?

i`m quite sure that there are people among them who have fun with helping tracking bugs, even if they lack proper programming skills.
imho, it`s not only a matter of knowledge, but maybe manpower, too.

so, wouldn`t it be helpful if there were more people helping the kernel-developers saving time, e.g. by take over the easier tasks like asking bug-reporters for input, help collection of debug-data, help assigning bugs to the right people etc.....

List:       linux-kernel
Subject:    RFC: starting a kernel-testers group for newbies
From:       Adrian Bunk <bunk () kernel ! org>
Date:       2008-05-01 0:31:25
Message-ID: 20080501003125.GM29330 () cs181133002 ! pp ! htv ! fi
[Download message RAW]

On Wed, Apr 30, 2008 at 01:31:08PM -0700, Linus Torvalds wrote:
> 
> 
> On Wed, 30 Apr 2008, Andrew Morton wrote:
> > 
> > <jumps up and down>
> > 
> > There should be nothing in 2.6.x-rc1 which wasn't in 2.6.x-mm1!
> 
> The problem I see with both -mm and linux-next is that they tend to be 
> better at finding the "physical conflict" kind of issues (ie the merge 
> itself fails) than the "code looks ok but doesn't actually work" kind of 
> issue.
> 
> Why?
> 
> The tester base is simply too small.
> 
> Now, if *that* could be improved, that would be wonderful, but I'm not 
> seeing it as very likely.
> 
> I think we have fairly good penetration these days with the regular -git 
> tree, but I think that one is quite frankly a *lot* less scary than -mm or 
> -next are, and there it has been an absolutely huge boon to get the kernel 
> into the Fedora test-builds etc (and I _think_ Ubuntu and SuSE also 
> started something like that).
> 
> So I'm very pessimistic about getting a lot of test coverage before -rc1.
> 
> Maybe too pessimistic, who knows?

First of all:
I 100% agree with Andrew that our biggest problems are in reviewing code 
and resolving bugs, not in finding bugs (we already have far too many 
unresolved bugs).

But although testing mustn't replace code reviews it is a great help, 
especially for identifying regressions early.

Finding testers should actually be relatively easy since it doesn't 
require much knowledge from the testers.

And it could even solve a second problem:

It could be a way for getting newbies into kernel development.

We actually do only rarely have tasks suitable as janitor tasks for 
newbies, and the results of people who do neither know the kernel
nor know C running checkpatch on files in the kernel have already
been discussed extensively...

I'll try to do this:
- create some Wiki page
- get a mailing list at vger
- point newbies to this mailing list
- tell people there which kernels to test
- figure out and document stuff like how to bisect between -next kernels
- help them to do whatever is required for a proper bug report

> 		Linus

cu
Adrian

-- 

______________________________________________________
Bis 50 MB Dateianhänge? Kein Problem!
http://freemail.web.de/club/landingpage.htm/?mc=025556

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: RFC: starting a kernel-testers group for newbies
  2008-05-01 16:11 RFC: starting a kernel-testers group for newbies devzero
@ 2008-05-01 16:26 ` Kok, Auke
  2008-05-01 17:12   ` Adrian Bunk
  0 siblings, 1 reply; 48+ messages in thread
From: Kok, Auke @ 2008-05-01 16:26 UTC (permalink / raw)
  To: devzero; +Cc: linux-kernel

devzero@web.de wrote:
>> I'll try to do this:
>> - create some Wiki page
>> - get a mailing list at vger
>> - point newbies to this mailing list
>> - tell people there which kernels to test
>> - figure out and document stuff like how to bisect between -next kernels
>> - help them to do whatever is required for a proper bug report
> 
> good idea :)
> 
> one more:
> reported bugs sometimes get lost in lkml (or elsewhere) and may not get the attention they need.
> 
> maybe there are not enough people who put some steadiness into tracking bugs (i.e. put them in bugzilla and make sure they get tracked/resolved) ?


I would say that this should be one of the main goals - have a team of people who
can consistently be assigned bugs for duplication and verification. Getting newly
opened bugs reproduced can be half the solution (and a great way to learn the
kernel on the side).

often the reporter of the bug never confirms that the bug is fixed - another place
where linux testers can help out...

Auke

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: RFC: starting a kernel-testers group for newbies
  2008-05-01 16:26 ` Kok, Auke
@ 2008-05-01 17:12   ` Adrian Bunk
  0 siblings, 0 replies; 48+ messages in thread
From: Adrian Bunk @ 2008-05-01 17:12 UTC (permalink / raw)
  To: Kok, Auke; +Cc: devzero, linux-kernel

On Thu, May 01, 2008 at 09:26:59AM -0700, Kok, Auke wrote:
> devzero@web.de wrote:
> >> I'll try to do this:
> >> - create some Wiki page
> >> - get a mailing list at vger
> >> - point newbies to this mailing list
> >> - tell people there which kernels to test
> >> - figure out and document stuff like how to bisect between -next kernels
> >> - help them to do whatever is required for a proper bug report
> > 
> > good idea :)
> > 
> > one more:
> > reported bugs sometimes get lost in lkml (or elsewhere) and may not get the attention they need.
> > 
> > maybe there are not enough people who put some steadiness into tracking bugs (i.e. put them in bugzilla and make sure they get tracked/resolved) ?
> 
> I would say that this should be one of the main goals - have a team of people who
> can consistently be assigned bugs for duplication and verification. Getting newly
> opened bugs reproduced can be half the solution (and a great way to learn the
> kernel on the side).
>...

That is already working in the kernel Bugzilla for years
(with Andrew currently doing most of the work).

You might be in the lucky and unusual position that you have the 
hardware for reproducing most bugs in the drivers you maintain,
but that's nothing that's generally true.

Bugs in e1000 have a good chance of being resolved, if you want a really
bad example think of e.g. a bug in some unmaintained ISA network driver

We already have 1350 open bugs in the kernel Bugzilla, and the real 
problem is not how to track them but to find someone who resolves them.

> Auke

cu
Adrian

-- 

       "Is there not promise of rain?" Ling Tan asked suddenly out
        of the darkness. There had been need of rain for many days.
       "Only a promise," Lao Er said.
                                       Pearl S. Buck - Dragon Seed


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: RFC: starting a kernel-testers group for newbies
@ 2008-05-01 17:09 devzero
  2008-05-01 17:27 ` Steven Rostedt
  0 siblings, 1 reply; 48+ messages in thread
From: devzero @ 2008-05-01 17:09 UTC (permalink / raw)
  To: linux-kernel; +Cc: rostedt

>We need to send them to a URL that lists all the known bugs and have them pick one,
>any one, and have them solve it. This would be the best way to learn part of the kernel.

what about adding some link to 

"http://bugzilla.kernel.org/buglist.cgi?query_format=advanced&short_desc_type=allwordssubstr&short_desc=&long_desc_type=substring&long_desc=&kernel_version_type=allwordssubstr&kernel_version=&bug_status=NEW&bug_status=ASSIGNED&bug_status=REOPENED&bug_status=VERIFIED&bug_status=DEFERRED&bug_status=NEEDINFO&emailassigned_to1=1&emailtype1=substring&email1=&emailassigned_to2=1&emailreporter2=1&emailcc2=1&emailtype2=substring&email2=&bugidtype=include&bug_id=&chfieldfrom=&chfieldto=Now&chfieldvalue=&regression=both&cmdtype=doit&order=Bug+Number&field0-0-0=noop&type0-0-0=noop&value0-0-0="

on www.kernel.org or doing that via redirect from "http://bugs.kernel.org" ?

sorting results of bugzilla search could need some enhancement, btw. 
for example, it seems i cannot sort by ID top/down or sorting by date.....

it`s not obvious enough, what bugs exist - it`s all hidden in bugzilla and in lkml (and tons of other bugtrackers, forums, mailinglists....).
furthermore, it`s also not obvious, that everyone is invited to work together with the kernel devs to solve the bugs.

List:       linux-kernel
Subject:    Re: RFC: starting a kernel-testers group for newbies
From:       Steven Rostedt <rostedt () goodmis ! org>
Date:       2008-05-01 16:38:23
Message-ID: Pine.LNX.4.58.0805011217200.11101 () gandalf ! stny ! rr ! com
[Download message RAW]

On Thu, 1 May 2008, Andrew Morton wrote:
>
> Arjan's fourth fallacy: "We don't make (effective) prioritization
> decisions." lol.  This implies that someone somewhere once sat down and
> wondered which bug he should most effectively work on.  Well, we don't do
> that.  We ignore _all_ the bugs in favour of busily writing new ones.

And actually, core kernel developers are best for writing new bugs.

Really, the way I started out learning how the kernel ticks was to go and
try to solve some bugs that I was seeing (this was years ago). I get
people asking that they want to learn to be a kernel developer and they
ask what new feature should they work on? Well, honestly, the last thing
a newbie kernel developer should be doing is writing new bugs. We need to
send them to a URL that lists all the known bugs and have them pick one,
any one, and have them solve it. This would be the best way to learn part
of the kernel.

I even find that I understand my own code better when I'm in the debugging
phase.

People here mention differnt places to look at code, and besides the
kerneloops.org I really don't even know where to look for bugs, because I
haven't seen a URL to point me to.

The next time someone asks me how to get started in kernel programming, I
would love to tell them to go and look here, and solve the bugs. I'm
guessing that I should just point them to:

  http://janitor.kernelnewbies.org/

and tell them to focus on real bugs (not just comments and such) to get
fixed if they really want to learn the kernel.

-- Steve

_________________________________________________________________________
In 5 Schritten zur eigenen Homepage. Jetzt Domain sichern und gestalten! 
Nur 3,99 EUR/Monat! http://www.maildomain.web.de/?mc=021114

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: RFC: starting a kernel-testers group for newbies
  2008-05-01 17:09 devzero
@ 2008-05-01 17:27 ` Steven Rostedt
  0 siblings, 0 replies; 48+ messages in thread
From: Steven Rostedt @ 2008-05-01 17:27 UTC (permalink / raw)
  To: devzero
  Cc: LKML, Andrew Morton, Adrian Bunk, Arjan van de Ven,
	Linus Torvalds, Rafael J. Wysocki, davem, jirislaby


[ Please don't strip the CC list. It may be a month before those on the CC
  actually read your email ]

Just replying with the missing CC.

-- Steve


On Thu, 1 May 2008 devzero@web.de wrote:

> >We need to send them to a URL that lists all the known bugs and have them pick one,
> >any one, and have them solve it. This would be the best way to learn part of the kernel.
>
> what about adding some link to
>
> "http://bugzilla.kernel.org/buglist.cgi?query_format=advanced&short_desc_type=allwordssubstr&short_desc=&long_desc_type=substring&long_desc=&kernel_version_type=allwordssubstr&kernel_version=&bug_status=NEW&bug_status=ASSIGNED&bug_status=REOPENED&bug_status=VERIFIED&bug_status=DEFERRED&bug_status=NEEDINFO&emailassigned_to1=1&emailtype1=substring&email1=&emailassigned_to2=1&emailreporter2=1&emailcc2=1&emailtype2=substring&email2=&bugidtype=include&bug_id=&chfieldfrom=&chfieldto=Now&chfieldvalue=&regression=both&cmdtype=doit&order=Bug+Number&field0-0-0=noop&type0-0-0=noop&value0-0-0="
>
> on www.kernel.org or doing that via redirect from "http://bugs.kernel.org" ?
>
> sorting results of bugzilla search could need some enhancement, btw.
> for example, it seems i cannot sort by ID top/down or sorting by date.....
>
> it`s not obvious enough, what bugs exist - it`s all hidden in bugzilla and in lkml (and tons of other bugtrackers, forums, mailinglists....).
> furthermore, it`s also not obvious, that everyone is invited to work together with the kernel devs to solve the bugs.
>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: RFC: starting a kernel-testers group for newbies
@ 2008-05-01 16:36 devzero
  0 siblings, 0 replies; 48+ messages in thread
From: devzero @ 2008-05-01 16:36 UTC (permalink / raw)
  To: linux-kernel; +Cc: akpm

>And why can't they work on the bug?  Usually, because they found a
>workaround.  People aren't going to spend months sitting in front of a
>non-functional computer waiting for kernel developers to decide if their
>machine is important enough to fix.  They will find a workaround.  They
>will buy new hardware.  They will discover "noapic" (234000 google hits and
>rising!).  They will swap it with a different machine.  They will switch to
>a different distro which for some reason doesn't trigger the bug.  They
>will use an older kernel.  They will switch to Solaris.  Etcetera.  People
>are clever - they will find a way to get around it.
>
>I figure that after a bug is reported we have maybe 24 to 48 hours to send
>a good response before our chances of _ever_ fixing it have begun to
>decline sharply due to the clever minds at the other end.

yes, this is absolutely true !

List:       linux-kernel
Subject:    Re: RFC: starting a kernel-testers group for newbies
From:       Andrew Morton <akpm () linux-foundation ! org>
Date:       2008-05-01 15:49:19
Message-ID: 20080501084919.8ac6dbdd.akpm () linux-foundation ! org
[Download message RAW]

On Thu, 1 May 2008 16:21:59 +0300 Adrian Bunk <bunk@kernel.org> wrote:

> > > But our current status quo is not OK:
> > > 
> > > Check Rafael's regressions lists asking yourself
> > > "How many regressions are older than two weeks?" 
> > 
> > "ext4 doesn't compile on m68k".
> > YAWN.
> >  
> > Wrong question...
> > "How many bugs that a sizable portion of users will hit in reality are there?"
> > is the right question to ask...
> >...
> 
> "Kernel oops while running kernbench and tbench on powerpc" took more 
> than 2 months to get resolved, and we ship 2.6.25 with this regression.

Precisely.  Cherry-picking a single example such as the 68k thing and then
claiming that it reflects the general is known as a "fallacy".

> Granted that compared to x86 there's not a sizable portion of users 
> crazy enough to run Linux on powerpc machines...

Another fallacy which Arjan is pushing (even though he doesn't appear to
have realised it) is "all hardware is the same".

Well, it isn't.  And most of our bugs are hardware-specific.  So, I'd
venture, most of our bugs don't affect most people.  So, over time, by
Arjan's "important to enough people" observation we just get more and more
and more unfixed bugs.

And I believe this effect has been occurring.

And please stop regaling us with this kerneloops.org stuff.  It just isn't
very interesting, useful or representative when considering the whole
problem.  Very few kernel bugs result in a trace, and when they do they are
usually easy to fix and, because of this, they will get fixed, often
quickly.  I expect netdevwatchdogeth0transmittimedout.org would tell a
different story.

One thing which muddies all this up is that bug reporters vanish.  Over the
years I have sent thousands and thousands of ping emails to people who have
reported bugs via email, three to six months after the fact.  Some were
solved - maybe a fifth.  About the same proportion of reporters reply and
give some reason why they cannot work on the bug.  In the majorty of cases
people don't reply at all and I suspect they're in the same category of
cannot-work-on-the-bug.

And why can't they work on the bug?  Usually, because they found a
workaround.  People aren't going to spend months sitting in front of a
non-functional computer waiting for kernel developers to decide if their
machine is important enough to fix.  They will find a workaround.  They
will buy new hardware.  They will discover "noapic" (234000 google hits and
rising!).  They will swap it with a different machine.  They will switch to
a different distro which for some reason doesn't trigger the bug.  They
will use an older kernel.  They will switch to Solaris.  Etcetera.  People
are clever - they will find a way to get around it.

I figure that after a bug is reported we have maybe 24 to 48 hours to send
a good response before our chances of _ever_ fixing it have begun to
decline sharply due to the clever minds at the other end.

Which leads us to Arjan's third fallacy:

   "How many bugs that a sizable portion of users will hit in reality
   are there?" is the right question to ask...

well no, it isn't.  Because approximately zero of the hardware bugs affect
a sizeable portion of users.  With this logic we will end up with more and
more and more and more bugs each of which affect a tiny number of users. 
Hundreds of different bugs.  You know where this process ends up.

Arjan's fourth fallacy: "We don't make (effective) prioritization
decisions." lol.  This implies that someone somewhere once sat down and
wondered which bug he should most effectively work on.  Well, we don't do
that.  We ignore _all_ the bugs in favour of busily writing new ones.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
[prev in list] [next in list] [prev in thread] [next in thread] 

Configure | About | News | Donate | Add a list | Sponsors: 10East, KoreLogic, Terra-International 
_____________________________________________________________________
Der WEB.DE SmartSurfer hilft bis zu 70% Ihrer Onlinekosten zu sparen!
http://smartsurfer.web.de/?mc=100071&distributionid=000000000066

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Slow DOWN, please!!!
@ 2008-04-30  2:03 David Miller
  2008-04-30 19:36 ` Rafael J. Wysocki
  0 siblings, 1 reply; 48+ messages in thread
From: David Miller @ 2008-04-30  2:03 UTC (permalink / raw)
  To: linux-kernel

This is starting to get beyond frustrating for me.

Yesterday, I spent the whole day bisecting boot failures
on my system due to the totally untested linux/bitops.h
optimization, which I fully analyzed and debugged.

Today, I had hoped that I could get some work done of my
own, but that's not the case.

Yet another bootup regression got added within the last 24
hours.

I don't mind fixing the regression or two during the merge
window but THIS IS ABSOLUTELY, FUCKING, REDICULIOUS!

The tree breaks every day, and it's becomming an extremely
non-fun environment to work in.

We need to slow down the merging, we need to review things
more, we need people to test their fucking changes!

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: Slow DOWN, please!!!
  2008-04-30  2:03 Slow DOWN, please!!! David Miller
@ 2008-04-30 19:36 ` Rafael J. Wysocki
  2008-04-30 20:15   ` Andrew Morton
  0 siblings, 1 reply; 48+ messages in thread
From: Rafael J. Wysocki @ 2008-04-30 19:36 UTC (permalink / raw)
  To: David Miller; +Cc: linux-kernel, Andrew Morton, Linus Torvalds, Jiri Slaby

On Wednesday, 30 of April 2008, David Miller wrote:
> 
> This is starting to get beyond frustrating for me.
> 
> Yesterday, I spent the whole day bisecting boot failures
> on my system due to the totally untested linux/bitops.h
> optimization, which I fully analyzed and debugged.
> 
> Today, I had hoped that I could get some work done of my
> own, but that's not the case.
> 
> Yet another bootup regression got added within the last 24
> hours.
> 
> I don't mind fixing the regression or two during the merge
> window but THIS IS ABSOLUTELY, FUCKING, REDICULIOUS!
> 
> The tree breaks every day, and it's becomming an extremely
> non-fun environment to work in.
> 
> We need to slow down the merging, we need to review things
> more, we need people to test their fucking changes!

Well, I must say I second that.

I'm not seeing regressions myself this time (well, except for the one that
Jiri fixed), but I did find a few of them during the post-2.6.24 merge window
and I wouldn't like to repeat that experience, so to speak.

IMO, the merge window is way too short for actually testing anything.  I rebuild
the kernel once or even twice a day and there's no way I can really test it.
I can only check if it breaks right away.  And if it does, there's no time to
find out what broke it before the next few hundreds of commits land on top of
that.

Thanks,
Rafael

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: Slow DOWN, please!!!
  2008-04-30 19:36 ` Rafael J. Wysocki
@ 2008-04-30 20:15   ` Andrew Morton
  2008-04-30 20:31     ` Linus Torvalds
  0 siblings, 1 reply; 48+ messages in thread
From: Andrew Morton @ 2008-04-30 20:15 UTC (permalink / raw)
  To: Rafael J. Wysocki; +Cc: davem, linux-kernel, torvalds, jirislaby

On Wed, 30 Apr 2008 21:36:57 +0200
"Rafael J. Wysocki" <rjw@sisk.pl> wrote:

> IMO, the merge window is way too short for actually testing anything.

<jumps up and down>

There should be nothing in 2.6.x-rc1 which wasn't in 2.6.x-mm1!

_anything_ which appears in 2.6.x-rc1 and which wasn't in 2.6.x-mm1 was
snuck in too late (OK, apart from trivia and bugfixes).

If we decide that we need to fix the oh-shit-lets-slam-this-in-and-hope
problem then I expect we can do so, via fairly relible means.

But the first attempt at solving it should be to ask people to not do that.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: Slow DOWN, please!!!
  2008-04-30 20:15   ` Andrew Morton
@ 2008-04-30 20:31     ` Linus Torvalds
  2008-05-01  0:31       ` RFC: starting a kernel-testers group for newbies Adrian Bunk
  0 siblings, 1 reply; 48+ messages in thread
From: Linus Torvalds @ 2008-04-30 20:31 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Rafael J. Wysocki, davem, linux-kernel, jirislaby

On Wed, 30 Apr 2008, Andrew Morton wrote:
> 
> <jumps up and down>
> 
> There should be nothing in 2.6.x-rc1 which wasn't in 2.6.x-mm1!

The problem I see with both -mm and linux-next is that they tend to be 
better at finding the "physical conflict" kind of issues (ie the merge 
itself fails) than the "code looks ok but doesn't actually work" kind of 
issue.

Why?

The tester base is simply too small.

Now, if *that* could be improved, that would be wonderful, but I'm not 
seeing it as very likely.

I think we have fairly good penetration these days with the regular -git 
tree, but I think that one is quite frankly a *lot* less scary than -mm or 
-next are, and there it has been an absolutely huge boon to get the kernel 
into the Fedora test-builds etc (and I _think_ Ubuntu and SuSE also 
started something like that).

So I'm very pessimistic about getting a lot of test coverage before -rc1.

Maybe too pessimistic, who knows?

		Linus

^ permalink raw reply	[flat|nested] 48+ messages in thread

* RFC: starting a kernel-testers group for newbies
  2008-04-30 20:31     ` Linus Torvalds
@ 2008-05-01  0:31       ` Adrian Bunk
  2008-04-30  7:03         ` Arjan van de Ven
  2008-05-01  0:41         ` David Miller
  0 siblings, 2 replies; 48+ messages in thread
From: Adrian Bunk @ 2008-05-01  0:31 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andrew Morton, Rafael J. Wysocki, davem, linux-kernel, jirislaby,
	Steven Rostedt

On Wed, Apr 30, 2008 at 01:31:08PM -0700, Linus Torvalds wrote:
> 
> 
> On Wed, 30 Apr 2008, Andrew Morton wrote:
> > 
> > <jumps up and down>
> > 
> > There should be nothing in 2.6.x-rc1 which wasn't in 2.6.x-mm1!
> 
> The problem I see with both -mm and linux-next is that they tend to be 
> better at finding the "physical conflict" kind of issues (ie the merge 
> itself fails) than the "code looks ok but doesn't actually work" kind of 
> issue.
> 
> Why?
> 
> The tester base is simply too small.
> 
> Now, if *that* could be improved, that would be wonderful, but I'm not 
> seeing it as very likely.
> 
> I think we have fairly good penetration these days with the regular -git 
> tree, but I think that one is quite frankly a *lot* less scary than -mm or 
> -next are, and there it has been an absolutely huge boon to get the kernel 
> into the Fedora test-builds etc (and I _think_ Ubuntu and SuSE also 
> started something like that).
> 
> So I'm very pessimistic about getting a lot of test coverage before -rc1.
> 
> Maybe too pessimistic, who knows?

First of all:
I 100% agree with Andrew that our biggest problems are in reviewing code 
and resolving bugs, not in finding bugs (we already have far too many 
unresolved bugs).

But although testing mustn't replace code reviews it is a great help, 
especially for identifying regressions early.

Finding testers should actually be relatively easy since it doesn't 
require much knowledge from the testers.

And it could even solve a second problem:

It could be a way for getting newbies into kernel development.

We actually do only rarely have tasks suitable as janitor tasks for 
newbies, and the results of people who do neither know the kernel
nor know C running checkpatch on files in the kernel have already
been discussed extensively...

I'll try to do this:
- create some Wiki page
- get a mailing list at vger
- point newbies to this mailing list
- tell people there which kernels to test
- figure out and document stuff like how to bisect between -next kernels
- help them to do whatever is required for a proper bug report

> 		Linus

cu
Adrian

-- 

       "Is there not promise of rain?" Ling Tan asked suddenly out
        of the darkness. There had been need of rain for many days.
       "Only a promise," Lao Er said.
                                       Pearl S. Buck - Dragon Seed


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: RFC: starting a kernel-testers group for newbies
  2008-05-01  0:31       ` RFC: starting a kernel-testers group for newbies Adrian Bunk
@ 2008-04-30  7:03         ` Arjan van de Ven
  2008-05-01  8:13           ` Andrew Morton
  2008-05-01 11:30           ` Adrian Bunk
  2008-05-01  0:41         ` David Miller
  1 sibling, 2 replies; 48+ messages in thread
From: Arjan van de Ven @ 2008-04-30  7:03 UTC (permalink / raw)
  To: Adrian Bunk
  Cc: Linus Torvalds, Andrew Morton, Rafael J. Wysocki, davem,
	linux-kernel, jirislaby, Steven Rostedt

On Thu, 1 May 2008 03:31:25 +0300
Adrian Bunk <bunk@kernel.org> wrote:

> On Wed, Apr 30, 2008 at 01:31:08PM -0700, Linus Torvalds wrote:
> > 
> > 
> > On Wed, 30 Apr 2008, Andrew Morton wrote:
> > > 
> > > <jumps up and down>
> > > 
> > > There should be nothing in 2.6.x-rc1 which wasn't in 2.6.x-mm1!
> > 
> > The problem I see with both -mm and linux-next is that they tend to
> > be better at finding the "physical conflict" kind of issues (ie the
> > merge itself fails) than the "code looks ok but doesn't actually
> > work" kind of issue.
> > 
> > Why?
> > 
> > The tester base is simply too small.
> > 
> > Now, if *that* could be improved, that would be wonderful, but I'm
> > not seeing it as very likely.
> > 
> > I think we have fairly good penetration these days with the regular
> > -git tree, but I think that one is quite frankly a *lot* less scary
> > than -mm or -next are, and there it has been an absolutely huge
> > boon to get the kernel into the Fedora test-builds etc (and I
> > _think_ Ubuntu and SuSE also started something like that).
> > 
> > So I'm very pessimistic about getting a lot of test coverage before
> > -rc1.
> > 
> > Maybe too pessimistic, who knows?
> 
> First of all:
> I 100% agree with Andrew that our biggest problems are in reviewing
> code and resolving bugs, not in finding bugs (we already have far too
> many unresolved bugs).

I would argue instead that we don't know which bugs to fix first.
We're never going to fix all bugs, and to be honest, that's ok.
As long as we fix the important bugs, we're doing really well.
And at least for the kerneloops.org reported issues, we're doing quite ok.

For me, 'important' is a combination of effect of the bug and the number of people
it'll hit. A compiler warning on parisc is less important than easy to trigger filesystem corruption
in ext3 that way; more people will hit it and the effect is more grave.

For oopses and WARN_ON()'s were getting to the hang of this now with kerneloops.org,
at least for the oopses that aren't really hard fatal. One thing I learned at least is that
lkml is a poor representation of what people actually hit; it's a very very selective
audience. 
oopses/warnons are only a subset of the bugs of course... but still.

So there's a few things we (and you / janitors) can do over time to get better data on what issues
people hit: 
1) Get automated collection of issues more wide spread. The wider our net the better we know which
   issues get hit a lot, and plain the more data we have on when things start, when they stop, etc etc.
   Especially if you get a lot of testers in your project, I'd like them to install the client for easy reporting
   of issues.
2) We should add more WARN_ON()s on "known bad" conditions. If it WARN_ON()'s, we can learn about it via
   the automated collection. And we can then do the statistics to figure out which ones happen a lot.
3) We need to get persistent-across-reboot oops saving going; there's some venues for this

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: RFC: starting a kernel-testers group for newbies
  2008-04-30  7:03         ` Arjan van de Ven
@ 2008-05-01  8:13           ` Andrew Morton
  2008-04-30 14:15             ` Arjan van de Ven
  2008-05-01  9:16             ` Frans Pop
  2008-05-01 11:30           ` Adrian Bunk
  1 sibling, 2 replies; 48+ messages in thread
From: Andrew Morton @ 2008-05-01  8:13 UTC (permalink / raw)
  To: Arjan van de Ven
  Cc: Adrian Bunk, Linus Torvalds, Rafael J. Wysocki, davem,
	linux-kernel, jirislaby, Steven Rostedt

On Wed, 30 Apr 2008 00:03:38 -0700 Arjan van de Ven <arjan@infradead.org> wrote:

> > First of all:
> > I 100% agree with Andrew that our biggest problems are in reviewing
> > code and resolving bugs, not in finding bugs (we already have far too
> > many unresolved bugs).
> 
> I would argue instead that we don't know which bugs to fix first.

<boggle>

How about "a bug which we just added"?  One which is repeatable. 
Repeatable by a tester who is prepared to work with us on resolving it. 
Those bugs.

Rafael has a list of them.  We release kernels when that list still has tens of
unfixed regressions dating back up to a couple of months.


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: RFC: starting a kernel-testers group for newbies
  2008-05-01  8:13           ` Andrew Morton
@ 2008-04-30 14:15             ` Arjan van de Ven
  2008-05-01 12:42               ` David Woodhouse
  2008-05-04 12:45               ` Rene Herman
  2008-05-01  9:16             ` Frans Pop
  1 sibling, 2 replies; 48+ messages in thread
From: Arjan van de Ven @ 2008-04-30 14:15 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Adrian Bunk, Linus Torvalds, Rafael J. Wysocki, davem,
	linux-kernel, jirislaby, Steven Rostedt

On Thu, 1 May 2008 01:13:46 -0700
Andrew Morton <akpm@linux-foundation.org> wrote:

> On Wed, 30 Apr 2008 00:03:38 -0700 Arjan van de Ven
> <arjan@infradead.org> wrote:
> 
> > > First of all:
> > > I 100% agree with Andrew that our biggest problems are in
> > > reviewing code and resolving bugs, not in finding bugs (we
> > > already have far too many unresolved bugs).
> > 
> > I would argue instead that we don't know which bugs to fix first.
> 
> <boggle>
> 
> How about "a bug which we just added"?  One which is repeatable. 
> Repeatable by a tester who is prepared to work with us on resolving
> it. Those bugs.
> 
> Rafael has a list of them.  We release kernels when that list still
> has tens of unfixed regressions dating back up to a couple of months.
> 

I know he does. But I will still argue that if that is all we work from, and treat
all of those equally, we're doing the wrong thing.
I'm sorry, but I really do not consider "ext4 doesn't compile on m68k" which is 
on that list to be as relevant as a "i915 drm driver crashes" bug which is among
us for a while and not on that list, just based on the total user base for either of those. 

Does that mean nobody should fix the m68k bug?
Someone who cares about m68k for sure should work on it, or if it's easy for an ext4 developer,
sure. But if the ext4 person has to spend 8 hours on it figuring cross compilers, I say 
we're doing something very wrong here. (no offense to the m68k people, but there's just
a few of you; maybe I should have picked voyager instead)

Maybe that's a "boggle" for you; but for me that's symptomatic of where we are today:
We don't make (effective) prioritization decisions. Such decisions are hard, because it 
effectively means telling people "I'm sorry but your bug is not yet important". That's
unpopular, especially if the reporter is very motivated on lkml. And it will involve a 
certain amount of non-quantifiable judgement calls, which also means we won't always be
right. Another hard thing is that lkml is a very self-selective audience. A bug may be 
reported three times there, but never hit otherwise, while another bug might not be reported
at all (or only once) while thousands and thousands of people are hitting it.

Not that we're doing all that bad, we ARE fixing the bugs (at least the oopses/warnings) that
are frequently hit. So I wouldn't blindly say we're doing a bad job at prioritizing. I would
rather say that if we focus only on what is left afterwards without doing a reality check,
we'll *always* have a negative view of quality, since there will *always* be bugs we don't 
fix. Linux well over ten million users (much more if you count embedded devices). 
A lot of them will have "standard" hardware, and a bunch of them will have "weird" stuff.
Cosmic rays happen. As do overclocking and bad DIMMs. And some BIOSes are just weird etc etc.
If we do not prioritize effectively we'll be stuck forever chasing ghosts, or we'll be stuck
saying "our quality sucks" forever without making progress.

Another trap is to only look at what goes wrong, not on what goes right... we tend to only
see what goes wrong on lkml and it's an easy trap to fall into doomthinking that way.
Are we doing worse on quality? My (subjective) opinion is that we are doing better than last year.
We are focused more on quality. We are fixing the bugs that people hit most. We are fixing most
of the regressions (yes, not all). Subsystems are seeing flat or lower bugcounts/bugrates. Take ACPI, 
the number of outstanding bugs *halved* over the last year. Of course you can pick a single 
bug and say "but this one did not get fixed", but that just loses the big picture (and 
proves the point :). All of this with a growing userbase and a rate of development that's a bit
faster than last year as well.

Can we do better? Always. More testing will help. Both to detect things early, and by 
letting us figure out which bugs are important. Just saying "more testing is not relevant
because we're not even fixing the bugs we have now" is just incorrect. Sorry.
More testers helps. Wider range of hardware/usages allows us to find better patterns
in the hard to track down bugs. More testers means more people willing to see if they
can diagnose the bugs at least somewhat themselves, via bisection or otherwise. That's important,
because that's the part of the problem that scales well with a growing userbase.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: RFC: starting a kernel-testers group for newbies
  2008-04-30 14:15             ` Arjan van de Ven
@ 2008-05-01 12:42               ` David Woodhouse
  2008-04-30 15:02                 ` Arjan van de Ven
  2008-05-05 10:03                 ` Benny Halevy
  2008-05-04 12:45               ` Rene Herman
  1 sibling, 2 replies; 48+ messages in thread
From: David Woodhouse @ 2008-05-01 12:42 UTC (permalink / raw)
  To: Arjan van de Ven
  Cc: Andrew Morton, Adrian Bunk, Linus Torvalds, Rafael J. Wysocki,
	davem, linux-kernel, jirislaby, Steven Rostedt

On Wed, 2008-04-30 at 07:15 -0700, Arjan van de Ven wrote:
> Maybe that's a "boggle" for you; but for me that's symptomatic of
> where we are today: We don't make (effective) prioritization
> decisions. Such decisions are hard, because it effectively means
> telling people "I'm sorry but your bug is not yet important". 

It's not that clear-cut, either. Something which manifests itself as a
build failure or an immediate test failure on m68k alone, might actually
turn out to cause subtle data corruption on other platforms.

You can't always know that it isn't important, just because it only
shows up in some esoteric circumstances. You only really know how
important it was _after_ you've fixed it.

That obviously doesn't help us to prioritise.

-- 
dwmw2

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: RFC: starting a kernel-testers group for newbies
  2008-05-01 12:42               ` David Woodhouse
@ 2008-04-30 15:02                 ` Arjan van de Ven
  2008-05-05 10:03                 ` Benny Halevy
  1 sibling, 0 replies; 48+ messages in thread
From: Arjan van de Ven @ 2008-04-30 15:02 UTC (permalink / raw)
  To: David Woodhouse
  Cc: Andrew Morton, Adrian Bunk, Linus Torvalds, Rafael J. Wysocki,
	davem, linux-kernel, jirislaby, Steven Rostedt

On Thu, 01 May 2008 13:42:44 +0100
David Woodhouse <dwmw2@infradead.org> wrote:

> On Wed, 2008-04-30 at 07:15 -0700, Arjan van de Ven wrote:
> > Maybe that's a "boggle" for you; but for me that's symptomatic of
> > where we are today: We don't make (effective) prioritization
> > decisions. Such decisions are hard, because it effectively means
> > telling people "I'm sorry but your bug is not yet important". 
> 
> It's not that clear-cut, either. Something which manifests itself as a
> build failure or an immediate test failure on m68k alone, might
> actually turn out to cause subtle data corruption on other platforms.
> 
> You can't always know that it isn't important, just because it only
> shows up in some esoteric circumstances. You only really know how
> important it was _after_ you've fixed it.
> 
> That obviously doesn't help us to prioritise.

absolutely. I'm not going to argue that prioritization is easy. Or 
that we'll be able to get it right all the time.
Doesn't mean we shouldn't try at least somewhat..

> 

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: RFC: starting a kernel-testers group for newbies
  2008-05-01 12:42               ` David Woodhouse
  2008-04-30 15:02                 ` Arjan van de Ven
@ 2008-05-05 10:03                 ` Benny Halevy
  1 sibling, 0 replies; 48+ messages in thread
From: Benny Halevy @ 2008-05-05 10:03 UTC (permalink / raw)
  To: David Woodhouse
  Cc: Arjan van de Ven, Andrew Morton, Adrian Bunk, Linus Torvalds,
	Rafael J. Wysocki, davem, linux-kernel, jirislaby, Steven Rostedt

On May. 01, 2008, 15:42 +0300, David Woodhouse <dwmw2@infradead.org> wrote:
> On Wed, 2008-04-30 at 07:15 -0700, Arjan van de Ven wrote:
>> Maybe that's a "boggle" for you; but for me that's symptomatic of
>> where we are today: We don't make (effective) prioritization
>> decisions. Such decisions are hard, because it effectively means
>> telling people "I'm sorry but your bug is not yet important". 
> 
> It's not that clear-cut, either. Something which manifests itself as a
> build failure or an immediate test failure on m68k alone, might actually
> turn out to cause subtle data corruption on other platforms.
> 
> You can't always know that it isn't important, just because it only
> shows up in some esoteric circumstances. You only really know how
> important it was _after_ you've fixed it.
> 
> That obviously doesn't help us to prioritise.
> 

Ideally, you'd do an analysis first and then prioritize, based
on the severity of the bug, its exposure, how easy it is it fix,
etc.  If while doing that you already have a fix at hand, you're
almost done :)

Recursively, there's the problem of which bugs you analyze first.
I'm inclined to say that you want to analyze most if not all bug reports
in higher priority than working on fixing non-critical bug.

Benny

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: RFC: starting a kernel-testers group for newbies
  2008-04-30 14:15             ` Arjan van de Ven
  2008-05-01 12:42               ` David Woodhouse
@ 2008-05-04 12:45               ` Rene Herman
  2008-05-04 13:00                 ` Pekka Enberg
  1 sibling, 1 reply; 48+ messages in thread
From: Rene Herman @ 2008-05-04 12:45 UTC (permalink / raw)
  To: Arjan van de Ven
  Cc: Andrew Morton, Adrian Bunk, Linus Torvalds, Rafael J. Wysocki,
	davem, linux-kernel, jirislaby, Steven Rostedt

On 30-04-08 16:15, Arjan van de Ven wrote:

> Does that mean nobody should fix the m68k bug? Someone who cares about
> m68k for sure should work on it, or if it's easy for an ext4 developer, 
> sure. But if the ext4 person has to spend 8 hours on it figuring cross
> compilers, I say we're doing something very wrong here. (no offense to
> the m68k people, but there's just a few of you; maybe I should have
> picked voyager instead)

On that note, I'd really like to see better binary availability of cross 
compilers. While it's improved over the last few years mostly due to the 
crossgcc stuff it's still a pain. Ideally, they would be available through 
the distribution package manager even but failing that some dedicated place 
on kernel.org with x86->lots and some of the more widely used other 
combinations would quite definitely be good. Perhaps not really directly 
relevant to this thread as such, but still good.

Andrew maintain{s,ed} a number of them at

http://userweb.kernel.org/~akpm/cross-compilers/

But as you see, most of the stuff there is really old again...

Rene

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: RFC: starting a kernel-testers group for newbies
  2008-05-04 12:45               ` Rene Herman
@ 2008-05-04 13:00                 ` Pekka Enberg
  2008-05-04 13:19                   ` Rene Herman
  0 siblings, 1 reply; 48+ messages in thread
From: Pekka Enberg @ 2008-05-04 13:00 UTC (permalink / raw)
  To: Rene Herman
  Cc: Arjan van de Ven, Andrew Morton, Adrian Bunk, Linus Torvalds,
	Rafael J. Wysocki, davem, linux-kernel, jirislaby, Steven Rostedt,
	Vegard Nossum

On Sun, May 4, 2008 at 3:45 PM, Rene Herman <rene.herman@keyaccess.nl> wrote:
>  On that note, I'd really like to see better binary availability of cross
> compilers. While it's improved over the last few years mostly due to the
> crossgcc stuff it's still a pain. Ideally, they would be available through
> the distribution package manager even but failing that some dedicated place
> on kernel.org with x86->lots and some of the more widely used other
> combinations would quite definitely be good. Perhaps not really directly
> relevant to this thread as such, but still good.
>
>  Andrew maintain{s,ed} a number of them at
>
>  http://userweb.kernel.org/~akpm/cross-compilers/
>
>  But as you see, most of the stuff there is really old again...

You're most welcome to help out Vegard to do this:

http://www.kernel.org/pub/tools/crosstool/

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: RFC: starting a kernel-testers group for newbies
  2008-05-04 13:00                 ` Pekka Enberg
@ 2008-05-04 13:19                   ` Rene Herman
  0 siblings, 0 replies; 48+ messages in thread
From: Rene Herman @ 2008-05-04 13:19 UTC (permalink / raw)
  To: Pekka Enberg
  Cc: Arjan van de Ven, Andrew Morton, Adrian Bunk, Linus Torvalds,
	Rafael J. Wysocki, davem, linux-kernel, jirislaby, Steven Rostedt,
	Vegard Nossum

On 04-05-08 15:00, Pekka Enberg wrote:

> On Sun, May 4, 2008 at 3:45 PM, Rene Herman <rene.herman@keyaccess.nl> wrote:

>>  On that note, I'd really like to see better binary availability of cross
>> compilers. While it's improved over the last few years mostly due to the
>> crossgcc stuff it's still a pain. Ideally, they would be available through
>> the distribution package manager even but failing that some dedicated place
>> on kernel.org with x86->lots and some of the more widely used other
>> combinations would quite definitely be good. Perhaps not really directly
>> relevant to this thread as such, but still good.
>>
>>  Andrew maintain{s,ed} a number of them at
>>
>>  http://userweb.kernel.org/~akpm/cross-compilers/
>>
>>  But as you see, most of the stuff there is really old again...
> 
> You're most welcome to help out Vegard to do this:
> 
> http://www.kernel.org/pub/tools/crosstool/

Ah, thanks, lovely, just new I see (and yes, I meant s/grossgcc/crosstool/). 
Good thing. I'll check it out and see if there's anything to add.

Rene.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: RFC: starting a kernel-testers group for newbies
  2008-05-01  8:13           ` Andrew Morton
  2008-04-30 14:15             ` Arjan van de Ven
@ 2008-05-01  9:16             ` Frans Pop
  2008-05-01 10:30               ` Enrico Weigelt
  1 sibling, 1 reply; 48+ messages in thread
From: Frans Pop @ 2008-05-01  9:16 UTC (permalink / raw)
  To: Andrew Morton
  Cc: arjan, bunk, torvalds, rjw, davem, linux-kernel, jirislaby,
	rostedt

Andrew Morton wrote:
> On Wed, 30 Apr 2008 00:03:38 -0700 Arjan van de Ven <arjan@infradead.org>
> wrote:
>> I would argue instead that we don't know which bugs to fix first.
> 
> How about "a bug which we just added"?

And leave unfixed all the regressions introduced in earlier kernel versions 
and known at the time of the release of that version but still present in 
the current version? Not to mention all the other bugs reported by users of 
recent stable versions?

> One which is repeatable. 
> Repeatable by a tester who is prepared to work with us on resolving it.

That can be true for not-so-recently introduced bugs too.

There are so many bugs out there and developers tend to focus on new ones 
leaving a lot of others unattended, both important and not so important 
ones.

Which ones should someone focus on? Maybe on the ones that someone (helped) 
introduce him/herself. Maybe that should even sometimes be prioritized over 
introducing new bugs^W^W^Wdoing new development.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: RFC: starting a kernel-testers group for newbies
  2008-05-01  9:16             ` Frans Pop
@ 2008-05-01 10:30               ` Enrico Weigelt
  2008-05-01 13:02                 ` Adrian Bunk
  0 siblings, 1 reply; 48+ messages in thread
From: Enrico Weigelt @ 2008-05-01 10:30 UTC (permalink / raw)
  To: linux kernel list

<big_snip />

Hi folks,

what do you think about Gentoo's "bug-wrangler" concept ?
Maybe could do something similar:

An Tester group (which eg. should be the entry point for newbies),
is responsible for receiving bug reports from users (maybe even 
distro maintainers who're not directly involved in kernel dev.). 
They try to reproduce the bugs and find out as much as they can,
then file a report to the actual kernel devs (just critical bugs 
are directly kicked to the devs with high priority). Maybe this 
group could also keep users informed about fixes and give some 
upgrade advise, etc.

This way we can build an good technical support (independent
from distributors ;-P), newbies can learn on the job and te 
load on kernel devs is reduced, so they can better concentrate
on their core competences.

What do you think about this ?

cu
-- 
---------------------------------------------------------------------
 Enrico Weigelt    ==   metux IT service - http://www.metux.de/
---------------------------------------------------------------------
 Please visit the OpenSource QM Taskforce:
 	http://wiki.metux.de/public/OpenSource_QM_Taskforce
 Patches / Fixes for a lot dozens of packages in dozens of versions:
	http://patches.metux.de/
---------------------------------------------------------------------

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: RFC: starting a kernel-testers group for newbies
  2008-05-01 10:30               ` Enrico Weigelt
@ 2008-05-01 13:02                 ` Adrian Bunk
  0 siblings, 0 replies; 48+ messages in thread
From: Adrian Bunk @ 2008-05-01 13:02 UTC (permalink / raw)
  To: Enrico Weigelt; +Cc: linux kernel list

On Thu, May 01, 2008 at 12:30:00PM +0200, Enrico Weigelt wrote:
> 
> <big_snip />
> 
> Hi folks,
> 
> 
> what do you think about Gentoo's "bug-wrangler" concept ?
> Maybe could do something similar:
> 
> An Tester group (which eg. should be the entry point for newbies),
> is responsible for receiving bug reports from users (maybe even 
> distro maintainers who're not directly involved in kernel dev.). 
> They try to reproduce the bugs and find out as much as they can,
> then file a report to the actual kernel devs (just critical bugs 
> are directly kicked to the devs with high priority). Maybe this 
> group could also keep users informed about fixes and give some 
> upgrade advise, etc.
> 
> This way we can build an good technical support (independent
> from distributors ;-P), newbies can learn on the job and te 
> load on kernel devs is reduced, so they can better concentrate
> on their core competences.
> 
> What do you think about this ?

Andrew already does more or less this.

The problems are:
- kernel bugs tend to very quickly reach the state where you need expert
  knowledge in some area, and there's definitely not much room for
  newbies in bug handling
- "try to reproduce the bugs" works for much software, but in the 
  kernel bugs often tend to depend on some specific hardware

> cu

cu
Adrian

-- 

       "Is there not promise of rain?" Ling Tan asked suddenly out
        of the darkness. There had been need of rain for many days.
       "Only a promise," Lao Er said.
                                       Pearl S. Buck - Dragon Seed


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: RFC: starting a kernel-testers group for newbies
  2008-04-30  7:03         ` Arjan van de Ven
  2008-05-01  8:13           ` Andrew Morton
@ 2008-05-01 11:30           ` Adrian Bunk
  2008-04-30 14:20             ` Arjan van de Ven
  1 sibling, 1 reply; 48+ messages in thread
From: Adrian Bunk @ 2008-05-01 11:30 UTC (permalink / raw)
  To: Arjan van de Ven
  Cc: Linus Torvalds, Andrew Morton, Rafael J. Wysocki, davem,
	linux-kernel, jirislaby, Steven Rostedt

On Wed, Apr 30, 2008 at 12:03:38AM -0700, Arjan van de Ven wrote:
> On Thu, 1 May 2008 03:31:25 +0300
> Adrian Bunk <bunk@kernel.org> wrote:
> 
> > On Wed, Apr 30, 2008 at 01:31:08PM -0700, Linus Torvalds wrote:
> > > 
> > > 
> > > On Wed, 30 Apr 2008, Andrew Morton wrote:
> > > > 
> > > > <jumps up and down>
> > > > 
> > > > There should be nothing in 2.6.x-rc1 which wasn't in 2.6.x-mm1!
> > > 
> > > The problem I see with both -mm and linux-next is that they tend to
> > > be better at finding the "physical conflict" kind of issues (ie the
> > > merge itself fails) than the "code looks ok but doesn't actually
> > > work" kind of issue.
> > > 
> > > Why?
> > > 
> > > The tester base is simply too small.
> > > 
> > > Now, if *that* could be improved, that would be wonderful, but I'm
> > > not seeing it as very likely.
> > > 
> > > I think we have fairly good penetration these days with the regular
> > > -git tree, but I think that one is quite frankly a *lot* less scary
> > > than -mm or -next are, and there it has been an absolutely huge
> > > boon to get the kernel into the Fedora test-builds etc (and I
> > > _think_ Ubuntu and SuSE also started something like that).
> > > 
> > > So I'm very pessimistic about getting a lot of test coverage before
> > > -rc1.
> > > 
> > > Maybe too pessimistic, who knows?
> > 
> > First of all:
> > I 100% agree with Andrew that our biggest problems are in reviewing
> > code and resolving bugs, not in finding bugs (we already have far too
> > many unresolved bugs).
> 
> I would argue instead that we don't know which bugs to fix first.
> We're never going to fix all bugs, and to be honest, that's ok.
>...

That might be OK.

But our current status quo is not OK:

Check Rafael's regressions lists asking yourself
"How many regressions are older than two weeks?" 

The kernel Bugzilla curerntly knows about 212 open regression bugs.
(And many more have not made it into Bugzilla.)

We have unmaintained and de facto unmaintained parts of the kernel where 
even issues that might be easy to fix don't get fixed.

>...
> So there's a few things we (and you / janitors) can do over time to get better data on what issues
> people hit: 
> 1) Get automated collection of issues more wide spread. The wider our net the better we know which
>    issues get hit a lot, and plain the more data we have on when things start, when they stop, etc etc.
>    Especially if you get a lot of testers in your project, I'd like them to install the client for easy reporting
>    of issues.
> 2) We should add more WARN_ON()s on "known bad" conditions. If it WARN_ON()'s, we can learn about it via
>    the automated collection. And we can then do the statistics to figure out which ones happen a lot.
> 3) We need to get persistent-across-reboot oops saving going; there's some venues for this

No disagreement on this, its just a different issue than our bug fixing 
problem.

cu
Adrian

-- 

       "Is there not promise of rain?" Ling Tan asked suddenly out
        of the darkness. There had been need of rain for many days.
       "Only a promise," Lao Er said.
                                       Pearl S. Buck - Dragon Seed


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: RFC: starting a kernel-testers group for newbies
  2008-05-01 11:30           ` Adrian Bunk
@ 2008-04-30 14:20             ` Arjan van de Ven
  2008-05-01 12:53               ` Rafael J. Wysocki
  2008-05-01 13:21               ` Adrian Bunk
  0 siblings, 2 replies; 48+ messages in thread
From: Arjan van de Ven @ 2008-04-30 14:20 UTC (permalink / raw)
  To: Adrian Bunk
  Cc: Linus Torvalds, Andrew Morton, Rafael J. Wysocki, davem,
	linux-kernel, jirislaby, Steven Rostedt

On Thu, 1 May 2008 14:30:38 +0300
Adrian Bunk <bunk@kernel.org> wrote:

> On Wed, Apr 30, 2008 at 12:03:38AM -0700, Arjan van de Ven wrote:
> > On Thu, 1 May 2008 03:31:25 +0300
> > Adrian Bunk <bunk@kernel.org> wrote:
> > 
> > > On Wed, Apr 30, 2008 at 01:31:08PM -0700, Linus Torvalds wrote:
> > > > 
> > > > 
> > > > On Wed, 30 Apr 2008, Andrew Morton wrote:
> > > > > 
> > > > > <jumps up and down>
> > > > > 
> > > > > There should be nothing in 2.6.x-rc1 which wasn't in
> > > > > 2.6.x-mm1!
> > > > 
> > > > The problem I see with both -mm and linux-next is that they
> > > > tend to be better at finding the "physical conflict" kind of
> > > > issues (ie the merge itself fails) than the "code looks ok but
> > > > doesn't actually work" kind of issue.
> > > > 
> > > > Why?
> > > > 
> > > > The tester base is simply too small.
> > > > 
> > > > Now, if *that* could be improved, that would be wonderful, but
> > > > I'm not seeing it as very likely.
> > > > 
> > > > I think we have fairly good penetration these days with the
> > > > regular -git tree, but I think that one is quite frankly a
> > > > *lot* less scary than -mm or -next are, and there it has been
> > > > an absolutely huge boon to get the kernel into the Fedora
> > > > test-builds etc (and I _think_ Ubuntu and SuSE also started
> > > > something like that).
> > > > 
> > > > So I'm very pessimistic about getting a lot of test coverage
> > > > before -rc1.
> > > > 
> > > > Maybe too pessimistic, who knows?
> > > 
> > > First of all:
> > > I 100% agree with Andrew that our biggest problems are in
> > > reviewing code and resolving bugs, not in finding bugs (we
> > > already have far too many unresolved bugs).
> > 
> > I would argue instead that we don't know which bugs to fix first.
> > We're never going to fix all bugs, and to be honest, that's ok.
> >...
> 
> That might be OK.
> 
> But our current status quo is not OK:
> 
> Check Rafael's regressions lists asking yourself
> "How many regressions are older than two weeks?" 

"ext4 doesn't compile on m68k".
YAWN.

Wrong question...
"How many bugs that a sizable portion of users will hit in reality are there?"
is the right question to ask...


> 
> We have unmaintained and de facto unmaintained parts of the kernel
> where even issues that might be easy to fix don't get fixed.

And how many people are hitting those issues? If a part of the kernel is really
important to enough people, there tends to be someone who stands up to either fix
the issue or start de-facto maintaining that part.
And yes I know there's parts where that doesn't hold. But to be honest, there's
not that many of them that have active development (and thus get the biggest
share of regressions)

> 
> >...
> > So there's a few things we (and you / janitors) can do over time to
> > get better data on what issues people hit: 
> > 1) Get automated collection of issues more wide spread. The wider
> > our net the better we know which issues get hit a lot, and plain
> > the more data we have on when things start, when they stop, etc
> > etc. Especially if you get a lot of testers in your project, I'd
> > like them to install the client for easy reporting of issues. 2) We
> > should add more WARN_ON()s on "known bad" conditions. If it
> > WARN_ON()'s, we can learn about it via the automated collection.
> > And we can then do the statistics to figure out which ones happen a
> > lot. 3) We need to get persistent-across-reboot oops saving going;
> > there's some venues for this
> 
> No disagreement on this, its just a different issue than our bug
> fixing problem.

No it's not! Knowing earlier and better which bugs get hit is NOT different
to our bug fixing "problem", it's in fact an essential part to the solution of it!

> 

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: RFC: starting a kernel-testers group for newbies
  2008-04-30 14:20             ` Arjan van de Ven
@ 2008-05-01 12:53               ` Rafael J. Wysocki
  2008-05-01 13:21               ` Adrian Bunk
  1 sibling, 0 replies; 48+ messages in thread
From: Rafael J. Wysocki @ 2008-05-01 12:53 UTC (permalink / raw)
  To: Arjan van de Ven
  Cc: Adrian Bunk, Linus Torvalds, Andrew Morton, davem, linux-kernel,
	jirislaby, Steven Rostedt

On Wednesday, 30 of April 2008, Arjan van de Ven wrote:
> On Thu, 1 May 2008 14:30:38 +0300
> Adrian Bunk <bunk@kernel.org> wrote:
> 
> > On Wed, Apr 30, 2008 at 12:03:38AM -0700, Arjan van de Ven wrote:
> > > On Thu, 1 May 2008 03:31:25 +0300
> > > Adrian Bunk <bunk@kernel.org> wrote:
> > > 
> > > > On Wed, Apr 30, 2008 at 01:31:08PM -0700, Linus Torvalds wrote:
> > > > > 
> > > > > 
> > > > > On Wed, 30 Apr 2008, Andrew Morton wrote:
> > > > > > 
> > > > > > <jumps up and down>
> > > > > > 
> > > > > > There should be nothing in 2.6.x-rc1 which wasn't in
> > > > > > 2.6.x-mm1!
> > > > > 
> > > > > The problem I see with both -mm and linux-next is that they
> > > > > tend to be better at finding the "physical conflict" kind of
> > > > > issues (ie the merge itself fails) than the "code looks ok but
> > > > > doesn't actually work" kind of issue.
> > > > > 
> > > > > Why?
> > > > > 
> > > > > The tester base is simply too small.
> > > > > 
> > > > > Now, if *that* could be improved, that would be wonderful, but
> > > > > I'm not seeing it as very likely.
> > > > > 
> > > > > I think we have fairly good penetration these days with the
> > > > > regular -git tree, but I think that one is quite frankly a
> > > > > *lot* less scary than -mm or -next are, and there it has been
> > > > > an absolutely huge boon to get the kernel into the Fedora
> > > > > test-builds etc (and I _think_ Ubuntu and SuSE also started
> > > > > something like that).
> > > > > 
> > > > > So I'm very pessimistic about getting a lot of test coverage
> > > > > before -rc1.
> > > > > 
> > > > > Maybe too pessimistic, who knows?
> > > > 
> > > > First of all:
> > > > I 100% agree with Andrew that our biggest problems are in
> > > > reviewing code and resolving bugs, not in finding bugs (we
> > > > already have far too many unresolved bugs).
> > > 
> > > I would argue instead that we don't know which bugs to fix first.
> > > We're never going to fix all bugs, and to be honest, that's ok.
> > >...
> > 
> > That might be OK.
> > 
> > But our current status quo is not OK:
> > 
> > Check Rafael's regressions lists asking yourself
> > "How many regressions are older than two weeks?" 
> 
> "ext4 doesn't compile on m68k".
> YAWN.
> 
> Wrong question...
> "How many bugs that a sizable portion of users will hit in reality are there?"
> is the right question to ask...
> 
> 
> > 
> > We have unmaintained and de facto unmaintained parts of the kernel
> > where even issues that might be easy to fix don't get fixed.
> 
> And how many people are hitting those issues? If a part of the kernel is really
> important to enough people, there tends to be someone who stands up to either fix
> the issue or start de-facto maintaining that part.
> And yes I know there's parts where that doesn't hold. But to be honest, there's
> not that many of them that have active development (and thus get the biggest
> share of regressions)
> 
> > 
> > >...
> > > So there's a few things we (and you / janitors) can do over time to
> > > get better data on what issues people hit: 
> > > 1) Get automated collection of issues more wide spread. The wider
> > > our net the better we know which issues get hit a lot, and plain
> > > the more data we have on when things start, when they stop, etc
> > > etc. Especially if you get a lot of testers in your project, I'd
> > > like them to install the client for easy reporting of issues. 2) We
> > > should add more WARN_ON()s on "known bad" conditions. If it
> > > WARN_ON()'s, we can learn about it via the automated collection.
> > > And we can then do the statistics to figure out which ones happen a
> > > lot. 3) We need to get persistent-across-reboot oops saving going;
> > > there's some venues for this
> > 
> > No disagreement on this, its just a different issue than our bug
> > fixing problem.
> 
> No it's not! Knowing earlier and better which bugs get hit is NOT different
> to our bug fixing "problem", it's in fact an essential part to the solution of it!

Agreed.

Thanks,
Rafael

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: RFC: starting a kernel-testers group for newbies
  2008-04-30 14:20             ` Arjan van de Ven
  2008-05-01 12:53               ` Rafael J. Wysocki
@ 2008-05-01 13:21               ` Adrian Bunk
  2008-05-01 15:49                 ` Andrew Morton
  2008-05-02  2:08                 ` Paul Mackerras
  1 sibling, 2 replies; 48+ messages in thread
From: Adrian Bunk @ 2008-05-01 13:21 UTC (permalink / raw)
  To: Arjan van de Ven
  Cc: Linus Torvalds, Andrew Morton, Rafael J. Wysocki, davem,
	linux-kernel, jirislaby, Steven Rostedt

On Wed, Apr 30, 2008 at 07:20:13AM -0700, Arjan van de Ven wrote:
> On Thu, 1 May 2008 14:30:38 +0300
> Adrian Bunk <bunk@kernel.org> wrote:
> 
> > On Wed, Apr 30, 2008 at 12:03:38AM -0700, Arjan van de Ven wrote:
> > > On Thu, 1 May 2008 03:31:25 +0300
> > > Adrian Bunk <bunk@kernel.org> wrote:
> > > 
> > > > On Wed, Apr 30, 2008 at 01:31:08PM -0700, Linus Torvalds wrote:
> > > > > 
> > > > > 
> > > > > On Wed, 30 Apr 2008, Andrew Morton wrote:
> > > > > > 
> > > > > > <jumps up and down>
> > > > > > 
> > > > > > There should be nothing in 2.6.x-rc1 which wasn't in
> > > > > > 2.6.x-mm1!
> > > > > 
> > > > > The problem I see with both -mm and linux-next is that they
> > > > > tend to be better at finding the "physical conflict" kind of
> > > > > issues (ie the merge itself fails) than the "code looks ok but
> > > > > doesn't actually work" kind of issue.
> > > > > 
> > > > > Why?
> > > > > 
> > > > > The tester base is simply too small.
> > > > > 
> > > > > Now, if *that* could be improved, that would be wonderful, but
> > > > > I'm not seeing it as very likely.
> > > > > 
> > > > > I think we have fairly good penetration these days with the
> > > > > regular -git tree, but I think that one is quite frankly a
> > > > > *lot* less scary than -mm or -next are, and there it has been
> > > > > an absolutely huge boon to get the kernel into the Fedora
> > > > > test-builds etc (and I _think_ Ubuntu and SuSE also started
> > > > > something like that).
> > > > > 
> > > > > So I'm very pessimistic about getting a lot of test coverage
> > > > > before -rc1.
> > > > > 
> > > > > Maybe too pessimistic, who knows?
> > > > 
> > > > First of all:
> > > > I 100% agree with Andrew that our biggest problems are in
> > > > reviewing code and resolving bugs, not in finding bugs (we
> > > > already have far too many unresolved bugs).
> > > 
> > > I would argue instead that we don't know which bugs to fix first.
> > > We're never going to fix all bugs, and to be honest, that's ok.
> > >...
> > 
> > That might be OK.
> > 
> > But our current status quo is not OK:
> > 
> > Check Rafael's regressions lists asking yourself
> > "How many regressions are older than two weeks?" 
> 
> "ext4 doesn't compile on m68k".
> YAWN.
>  
> Wrong question...
> "How many bugs that a sizable portion of users will hit in reality are there?"
> is the right question to ask...
>...

"Kernel oops while running kernbench and tbench on powerpc" took more 
than 2 months to get resolved, and we ship 2.6.25 with this regression.

Granted that compared to x86 there's not a sizable portion of users 
crazy enough to run Linux on powerpc machines...

cu
Adrian

-- 

       "Is there not promise of rain?" Ling Tan asked suddenly out
        of the darkness. There had been need of rain for many days.
       "Only a promise," Lao Er said.
                                       Pearl S. Buck - Dragon Seed


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: RFC: starting a kernel-testers group for newbies
  2008-05-01 13:21               ` Adrian Bunk
@ 2008-05-01 15:49                 ` Andrew Morton
  2008-05-01  1:13                   ` Arjan van de Ven
                                     ` (2 more replies)
  2008-05-02  2:08                 ` Paul Mackerras
  1 sibling, 3 replies; 48+ messages in thread
From: Andrew Morton @ 2008-05-01 15:49 UTC (permalink / raw)
  To: Adrian Bunk
  Cc: Arjan van de Ven, Linus Torvalds, Rafael J. Wysocki, davem,
	linux-kernel, jirislaby, Steven Rostedt

On Thu, 1 May 2008 16:21:59 +0300 Adrian Bunk <bunk@kernel.org> wrote:

> > > But our current status quo is not OK:
> > > 
> > > Check Rafael's regressions lists asking yourself
> > > "How many regressions are older than two weeks?" 
> > 
> > "ext4 doesn't compile on m68k".
> > YAWN.
> >  
> > Wrong question...
> > "How many bugs that a sizable portion of users will hit in reality are there?"
> > is the right question to ask...
> >...
> 
> "Kernel oops while running kernbench and tbench on powerpc" took more 
> than 2 months to get resolved, and we ship 2.6.25 with this regression.

Precisely.  Cherry-picking a single example such as the 68k thing and then
claiming that it reflects the general is known as a "fallacy".

> Granted that compared to x86 there's not a sizable portion of users 
> crazy enough to run Linux on powerpc machines...

Another fallacy which Arjan is pushing (even though he doesn't appear to
have realised it) is "all hardware is the same".

Well, it isn't.  And most of our bugs are hardware-specific.  So, I'd
venture, most of our bugs don't affect most people.  So, over time, by
Arjan's "important to enough people" observation we just get more and more
and more unfixed bugs.

And I believe this effect has been occurring.

And please stop regaling us with this kerneloops.org stuff.  It just isn't
very interesting, useful or representative when considering the whole
problem.  Very few kernel bugs result in a trace, and when they do they are
usually easy to fix and, because of this, they will get fixed, often
quickly.  I expect netdevwatchdogeth0transmittimedout.org would tell a
different story.

One thing which muddies all this up is that bug reporters vanish.  Over the
years I have sent thousands and thousands of ping emails to people who have
reported bugs via email, three to six months after the fact.  Some were
solved - maybe a fifth.  About the same proportion of reporters reply and
give some reason why they cannot work on the bug.  In the majorty of cases
people don't reply at all and I suspect they're in the same category of
cannot-work-on-the-bug.

And why can't they work on the bug?  Usually, because they found a
workaround.  People aren't going to spend months sitting in front of a
non-functional computer waiting for kernel developers to decide if their
machine is important enough to fix.  They will find a workaround.  They
will buy new hardware.  They will discover "noapic" (234000 google hits and
rising!).  They will swap it with a different machine.  They will switch to
a different distro which for some reason doesn't trigger the bug.  They
will use an older kernel.  They will switch to Solaris.  Etcetera.  People
are clever - they will find a way to get around it.

I figure that after a bug is reported we have maybe 24 to 48 hours to send
a good response before our chances of _ever_ fixing it have begun to
decline sharply due to the clever minds at the other end.

Which leads us to Arjan's third fallacy:

   "How many bugs that a sizable portion of users will hit in reality
   are there?" is the right question to ask...

well no, it isn't.  Because approximately zero of the hardware bugs affect
a sizeable portion of users.  With this logic we will end up with more and
more and more and more bugs each of which affect a tiny number of users. 
Hundreds of different bugs.  You know where this process ends up.

Arjan's fourth fallacy: "We don't make (effective) prioritization
decisions." lol.  This implies that someone somewhere once sat down and
wondered which bug he should most effectively work on.  Well, we don't do
that.  We ignore _all_ the bugs in favour of busily writing new ones.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: RFC: starting a kernel-testers group for newbies
  2008-05-01 15:49                 ` Andrew Morton
@ 2008-05-01  1:13                   ` Arjan van de Ven
  2008-05-02  9:00                     ` Adrian Bunk
  2008-05-01 16:38                   ` Steven Rostedt
  2008-05-01 17:24                   ` Theodore Tso
  2 siblings, 1 reply; 48+ messages in thread
From: Arjan van de Ven @ 2008-05-01  1:13 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Adrian Bunk, Linus Torvalds, Rafael J. Wysocki, davem,
	linux-kernel, jirislaby, Steven Rostedt

On Thu, 1 May 2008 08:49:19 -0700
Andrew Morton <akpm@linux-foundation.org> wrote:

> > Granted that compared to x86 there's not a sizable portion of users 
> > crazy enough to run Linux on powerpc machines...
> 
> Another fallacy which Arjan is pushing (even though he doesn't appear
> to have realised it) is "all hardware is the same".

no I'm pushing "some classes of hardware are much more popular/relevant
than others".


 
> Well, it isn't.  And most of our bugs are hardware-specific.  So, I'd
> venture, most of our bugs don't affect most people.  So, over time, by
> Arjan's "important to enough people" observation we just get more and
> more and more unfixed bugs.

I did not say "most people". I believe "most people" aren't hitting
bugs right now (or there would be a lot more screaming).
What I do believe is that *within the bugs that hit*, even the hardware
specific ones, there's a clear prioritization by how many people hit
the bug (or have the hardware in general).

> 
> And I believe this effect has been occurring.
> 

> And please stop regaling us with this kerneloops.org stuff.  It just
> isn't very interesting, useful or representative when considering the
> whole problem.  Very few kernel bugs result in a trace, and when they
> do they are usually easy to fix and, because of this, they will get
> fixed, often quickly.  I expect
> netdevwatchdogeth0transmittimedout.org would tell a different story.

now that's a fallacy of your own.. if you care about that one, it's 1)
trivial to track and/or 2) could contain a WARN_ON_ONCE(), at which
point it's automatically tracked. (and more useful information I
suspect, since it suddenly has a full backtrace including driver info
in it)
By your argument we should work hard to make sure we're better at
creating traces for cases we detect something goes wrong.
(I would not argue against that fwiw)

> I figure that after a bug is reported we have maybe 24 to 48 hours to
> send a good response before our chances of _ever_ fixing it have
> begun to decline sharply due to the clever minds at the other end.
> 
> Which leads us to Arjan's third fallacy:
> 
>    "How many bugs that a sizable portion of users will hit in reality
>    are there?" is the right question to ask...
> 
> well no, it isn't.  Because approximately zero of the hardware bugs

if it's a hardware bug there's little we can do.
If it's a hardware specific bug, yeah then it becomes a function of how
popular that hardware is.

> affect a sizeable portion of users.  With this logic we will end up
> with more and more and more and more bugs each of which affect a tiny
> number of users. Hundreds of different bugs.  You know where this
> process ends up.

Given that a normal PC has maybe 10 components... 
yes we don't want bugcreep that affects common hardware over time.
At the same time, by your argument, a bug that hits a piece of hardware
of which 5 are made (or left on this planet) is equally important to
a bug in something that 
> 
> Arjan's fourth fallacy: "We don't make (effective) prioritization
> decisions." lol.  This implies that someone somewhere once sat down
> and wondered which bug he should most effectively work on.  Well, we
> don't do that.  We ignore _all_ the bugs in favour of busily writing
> new ones

This statement is so rediculous and self contradicting to what you
said before that I'm not even going to respond to it. 

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: RFC: starting a kernel-testers group for newbies
  2008-05-01  1:13                   ` Arjan van de Ven
@ 2008-05-02  9:00                     ` Adrian Bunk
  0 siblings, 0 replies; 48+ messages in thread
From: Adrian Bunk @ 2008-05-02  9:00 UTC (permalink / raw)
  To: Arjan van de Ven
  Cc: Andrew Morton, Linus Torvalds, Rafael J. Wysocki, davem,
	linux-kernel, jirislaby, Steven Rostedt

On Wed, Apr 30, 2008 at 06:13:38PM -0700, Arjan van de Ven wrote:
> On Thu, 1 May 2008 08:49:19 -0700
> Andrew Morton <akpm@linux-foundation.org> wrote:
> 
> > > Granted that compared to x86 there's not a sizable portion of users 
> > > crazy enough to run Linux on powerpc machines...
> > 
> > Another fallacy which Arjan is pushing (even though he doesn't appear
> > to have realised it) is "all hardware is the same".
> 
> no I'm pushing "some classes of hardware are much more popular/relevant
> than others".

"popular/relevant" is hard to define.

E.g. if we'd go after "popular" we should only keep architectures like 
ARM and x86 and ditch architectures like ia64 and s390 that have puny 
userbases.

And how would you define "relevant"?

> > Well, it isn't.  And most of our bugs are hardware-specific.  So, I'd
> > venture, most of our bugs don't affect most people.  So, over time, by
> > Arjan's "important to enough people" observation we just get more and
> > more and more unfixed bugs.
> 
> I did not say "most people". I believe "most people" aren't hitting
> bugs right now (or there would be a lot more screaming).
> What I do believe is that *within the bugs that hit*, even the hardware
> specific ones, there's a clear prioritization by how many people hit
> the bug (or have the hardware in general).

If your "or have the hardware in general" is meant seriously you have to
convince people that ARM must become a very high priority.

No matter whether one supports your "there's a clear prioritization" 
view or not it anyway doesn't currently work since the areas covered by 
people testing -rc kernels don't even remotely map the most popular 
hardware in the field.

> > And I believe this effect has been occurring.
> 
> > And please stop regaling us with this kerneloops.org stuff.  It just
> > isn't very interesting, useful or representative when considering the
> > whole problem.  Very few kernel bugs result in a trace, and when they
> > do they are usually easy to fix and, because of this, they will get
> > fixed, often quickly.  I expect
> > netdevwatchdogeth0transmittimedout.org would tell a different story.
> 
> now that's a fallacy of your own.. if you care about that one, it's 1)
> trivial to track and/or 2) could contain a WARN_ON_ONCE(), at which
> point it's automatically tracked. (and more useful information I
> suspect, since it suddenly has a full backtrace including driver info
> in it)
> By your argument we should work hard to make sure we're better at
> creating traces for cases we detect something goes wrong.
> (I would not argue against that fwiw)
>...

kerneloops.org catches the easiest to solve bugs (there's a trace) and 
helps in getting them fixed.

That's a very good thing.

And if we get more bugs into this easy to resolve state that would be 
even better.

But it's only a small part of the complete picture of incoming bug 
reports.

cu
Adrian

-- 

       "Is there not promise of rain?" Ling Tan asked suddenly out
        of the darkness. There had been need of rain for many days.
       "Only a promise," Lao Er said.
                                       Pearl S. Buck - Dragon Seed


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: RFC: starting a kernel-testers group for newbies
  2008-05-01 15:49                 ` Andrew Morton
  2008-05-01  1:13                   ` Arjan van de Ven
@ 2008-05-01 16:38                   ` Steven Rostedt
  2008-05-01 17:18                     ` Andrew Morton
  2008-05-01 17:24                   ` Theodore Tso
  2 siblings, 1 reply; 48+ messages in thread
From: Steven Rostedt @ 2008-05-01 16:38 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Adrian Bunk, Arjan van de Ven, Linus Torvalds, Rafael J. Wysocki,
	davem, linux-kernel, jirislaby

On Thu, 1 May 2008, Andrew Morton wrote:
>
> Arjan's fourth fallacy: "We don't make (effective) prioritization
> decisions." lol.  This implies that someone somewhere once sat down and
> wondered which bug he should most effectively work on.  Well, we don't do
> that.  We ignore _all_ the bugs in favour of busily writing new ones.

And actually, core kernel developers are best for writing new bugs.

Really, the way I started out learning how the kernel ticks was to go and
try to solve some bugs that I was seeing (this was years ago). I get
people asking that they want to learn to be a kernel developer and they
ask what new feature should they work on? Well, honestly, the last thing
a newbie kernel developer should be doing is writing new bugs. We need to
send them to a URL that lists all the known bugs and have them pick one,
any one, and have them solve it. This would be the best way to learn part
of the kernel.

I even find that I understand my own code better when I'm in the debugging
phase.

People here mention differnt places to look at code, and besides the
kerneloops.org I really don't even know where to look for bugs, because I
haven't seen a URL to point me to.

The next time someone asks me how to get started in kernel programming, I
would love to tell them to go and look here, and solve the bugs. I'm
guessing that I should just point them to:

  http://janitor.kernelnewbies.org/

and tell them to focus on real bugs (not just comments and such) to get
fixed if they really want to learn the kernel.

-- Steve

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: RFC: starting a kernel-testers group for newbies
  2008-05-01 16:38                   ` Steven Rostedt
@ 2008-05-01 17:18                     ` Andrew Morton
  0 siblings, 0 replies; 48+ messages in thread
From: Andrew Morton @ 2008-05-01 17:18 UTC (permalink / raw)
  To: Steven Rostedt; +Cc: bunk, arjan, torvalds, rjw, davem, linux-kernel, jirislaby

On Thu, 1 May 2008 12:38:23 -0400 (EDT)
Steven Rostedt <rostedt@goodmis.org> wrote:

> People here mention differnt places to look at code, and besides the
> kerneloops.org I really don't even know where to look for bugs, because I
> haven't seen a URL to point me to.

bugzilla.kernel.org is, umm, improving.

It would be an intersting exercise for someone to spend a few days seeing
how many of the bugzilla reports they personally can reproduce.  I'd guess
"zero".  There's a lesson in that.

The problem with bugzilla will be that it will be hard to find reports
where the reporter will be able to work with you on the fix - we've let
them go cold.

The most fruitful place to find fixable bugs is linux-kernel.  People who
report bugs there are sufficiently motivated to have actually sent the
email and the bug is still recent, so they probably haven't done the
Solaris install yet.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: RFC: starting a kernel-testers group for newbies
  2008-05-01 15:49                 ` Andrew Morton
  2008-05-01  1:13                   ` Arjan van de Ven
  2008-05-01 16:38                   ` Steven Rostedt
@ 2008-05-01 17:24                   ` Theodore Tso
  2008-05-01 19:26                     ` Andrew Morton
  2 siblings, 1 reply; 48+ messages in thread
From: Theodore Tso @ 2008-05-01 17:24 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Adrian Bunk, Arjan van de Ven, Linus Torvalds, Rafael J. Wysocki,
	davem, linux-kernel, jirislaby, Steven Rostedt

On Thu, May 01, 2008 at 08:49:19AM -0700, Andrew Morton wrote:
> Another fallacy which Arjan is pushing (even though he doesn't appear to
> have realised it) is "all hardware is the same".
> 
> Well, it isn't.  And most of our bugs are hardware-specific.  So, I'd
> venture, most of our bugs don't affect most people.  So, over time, by
> Arjan's "important to enough people" observation we just get more and more
> and more unfixed bugs.
> 
> And I believe this effect has been occurring.

So the question is if we have a thousand bugs which only affect one
person each, and 70 million Linux users, how much should we beat up
ourselves that 1,000 people can't use a particular version of the
Linux kernel, versus the 99.9% of the people for which the kernel
works just fine?

Sometimes, we can't make everyone happy.

At the recent Linux Collaboration Summit, we had a local user walk up
to a microphone, and loosely paraphrased, said, "WHINE WHINE WHINE
WHINE I have have a $30 DVD drive that doesn't work with Linux.  WHINE
WHINE WHINE WHINE WHINE What are *you* going to do to fix my problem?"

Some people like James responded very diplomatically, with "Well, you
have to understand, the developer might not have your hardware, and
there's a lot of broken out here, etc., etc."  What I wanted to tell
this user was, "Ask not what the Linux development community can do
for you.  Ask what *you* can do for Linux?"  Suppose this person had
filed a kernel bugzilla bug, and it was one of the hundreds or
thousands of non-handled bugs.  Sure, it's a tragedy that bugs pile
up.  But if they pile up because of crappy hardware, that's not a
major tragedy.  If we can figure out how to blacklist it, and move on,
we should do so.  

> And why can't they work on the bug?  Usually, because they found a
> workaround.  People aren't going to spend months sitting in front of a
> non-functional computer waiting for kernel developers to decide if their
> machine is important enough to fix.  They will find a workaround.  They
> will buy new hardware.

Hey, in this particular case, if this user worked around the problem
by buying new hardware, it was probably the right solution.  As far as
we know we don't have a systematic problem where huge numbers DVD
drives aren't working, so if there are a few odd ball ones that are
out there, we just CAN'T self-flagellate ourselves that we're not
fixing all bugs, and letting some bugs pile up.

> Which leads us to Arjan's third fallacy:
> 
>    "How many bugs that a sizable portion of users will hit in reality
>    are there?" is the right question to ask...
> 
> well no, it isn't.  Because approximately zero of the hardware bugs affect
> a sizeable portion of users.  With this logic we will end up with more and
> more and more and more bugs each of which affect a tiny number of users. 
> Hundreds of different bugs.  You know where this process ends up.

... and maybe we can't solve hardware bugs.  Or that crappy hardware
isn't worth holding back Linux development.  And I'm not sure ignoring
it is that horrible of a thing.  And in practice, if it's a hardware
bug in something which is very common, it *will* get noticed very
quickly and fixed.  But if it's in a hardware bug in some rare piece
of hardware, the user is going to have to either (a) help us fix it,
or (b) decide that his time is more valuable and that buying another
$30 DVD drive might be a better use of his and our time.

Back when I was the serial driver maintainer, I certainly made those
kinds of triage decisions.  I knew the serial driver was working on
the vast majority of the Linux users, because if it broke in a major
ways, I would hear about it, in spades and get lots and lots of hate
mail.  And there were plenty of crappy ISA boards out there; and I
would help them out when I could, and sometimes spend more volunteer
time helping them by changing one or two outb() to outb_p()'s (yes,
that really made a difference; remember, we're talking about crappy PC
class hardware with hardware bugs), but at the end of the day, past a
certain point, even with a willing and cooperative end-user, I would
have to call it a day, and give up, and tell them to get another
serial card.  (And back in the days of ISA boards, we couldn't even
use blacklists.)

And you know what?  Linux didn't collapse into a steaming pile of dung
when I did that.  We're all volunteers, and we need to recognize there
are limits to what we can do --- otherwise, it will way to easy to
burn out and become a bitter shell of a maintainer....

Even BSD fan boys will realize that in BSD land, you have to do even
more of this; if there's random broken hardware, or simply a lack of a
device driver, very often your only recourse is to work around the
problem by buying another serial card, or wifi card, or whatever.  And
this happens much more with BSD than Linux, simply because they
support fewer devices to begin with.

					- Ted

P.S.  We should really try to categorize bugs so we can figure out
what percentage of the bugs are device driver bugs, and what
percentage are core kernel bugs, which are "if you stress the system
too badly" sort of bugs, or "if you do something bad like yank the USB
stick without unmounting the filesystem first" sort of thing.  I think
if we did this, the numbers wouldn't look quite so scary, because it's
things like device driver problems with wierd sh*t bugs are not
comparable with core functionality bugs in the SLUB allocator, for
example.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: RFC: starting a kernel-testers group for newbies
  2008-05-01 17:24                   ` Theodore Tso
@ 2008-05-01 19:26                     ` Andrew Morton
  2008-05-01 19:39                       ` Steven Rostedt
  2008-05-02 10:23                       ` Andi Kleen
  0 siblings, 2 replies; 48+ messages in thread
From: Andrew Morton @ 2008-05-01 19:26 UTC (permalink / raw)
  To: Theodore Tso
  Cc: bunk, arjan, torvalds, rjw, davem, linux-kernel, jirislaby,
	rostedt

On Thu, 1 May 2008 13:24:34 -0400
Theodore Tso <tytso@MIT.EDU> wrote:

> ... and maybe we can't solve hardware bugs. 

Many, many of these are regressions.  If old-linux works on that
hardware then new-linux can too.

(still wants to know what we did 2-3 years ago which caused thousands of
people to have to resort to using noapic and other apic-related boot option
workarounds)


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: RFC: starting a kernel-testers group for newbies
  2008-05-01 19:26                     ` Andrew Morton
@ 2008-05-01 19:39                       ` Steven Rostedt
  2008-05-02 10:23                       ` Andi Kleen
  1 sibling, 0 replies; 48+ messages in thread
From: Steven Rostedt @ 2008-05-01 19:39 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Theodore Tso, bunk, arjan, torvalds, rjw, davem, linux-kernel,
	jirislaby


On Thu, 1 May 2008, Andrew Morton wrote:

> On Thu, 1 May 2008 13:24:34 -0400
> Theodore Tso <tytso@MIT.EDU> wrote:
>
> > ... and maybe we can't solve hardware bugs.
>
> Many, many of these are regressions.  If old-linux works on that
> hardware then new-linux can too.
>
> (still wants to know what we did 2-3 years ago which caused thousands of
> people to have to resort to using noapic and other apic-related boot option
> workarounds)

Perhaps 2-3 years ago more people started using more hardware that
implements APIC. ;-)

-- Steve


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: RFC: starting a kernel-testers group for newbies
  2008-05-01 19:26                     ` Andrew Morton
  2008-05-01 19:39                       ` Steven Rostedt
@ 2008-05-02 10:23                       ` Andi Kleen
  1 sibling, 0 replies; 48+ messages in thread
From: Andi Kleen @ 2008-05-02 10:23 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Theodore Tso, bunk, arjan, torvalds, rjw, davem, linux-kernel,
	jirislaby, rostedt

Andrew Morton <akpm@linux-foundation.org> writes:
>
> (still wants to know what we did 2-3 years ago which caused thousands of
> people to have to resort to using noapic and other apic-related boot option
> workarounds)

Forcing APIC even when the BIOS didn't support them.

-Andi



^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: RFC: starting a kernel-testers group for newbies
  2008-05-01 13:21               ` Adrian Bunk
  2008-05-01 15:49                 ` Andrew Morton
@ 2008-05-02  2:08                 ` Paul Mackerras
  2008-05-02  3:10                   ` Josh Boyer
  1 sibling, 1 reply; 48+ messages in thread
From: Paul Mackerras @ 2008-05-02  2:08 UTC (permalink / raw)
  To: Adrian Bunk
  Cc: Arjan van de Ven, Linus Torvalds, Andrew Morton,
	Rafael J. Wysocki, davem, linux-kernel, jirislaby, Steven Rostedt

Adrian Bunk writes:

> "Kernel oops while running kernbench and tbench on powerpc" took more 
> than 2 months to get resolved, and we ship 2.6.25 with this regression.

That was a very subtle bug that only showed up on one particular
powerpc machine.  I was not able to replicate it on any of the powerpc
machines I have here.  Nevertheless, we found it and we have a fix for
it.  I think that's an example of the process working. :)

Paul.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: RFC: starting a kernel-testers group for newbies
  2008-05-02  2:08                 ` Paul Mackerras
@ 2008-05-02  3:10                   ` Josh Boyer
  2008-05-02  4:09                     ` Paul Mackerras
  0 siblings, 1 reply; 48+ messages in thread
From: Josh Boyer @ 2008-05-02  3:10 UTC (permalink / raw)
  To: Paul Mackerras
  Cc: Adrian Bunk, Arjan van de Ven, Linus Torvalds, Andrew Morton,
	Rafael J. Wysocki, davem, linux-kernel, jirislaby, Steven Rostedt

On Fri, 2008-05-02 at 12:08 +1000, Paul Mackerras wrote:
> Adrian Bunk writes:
> 
> > "Kernel oops while running kernbench and tbench on powerpc" took more 
> > than 2 months to get resolved, and we ship 2.6.25 with this regression.
> 
> That was a very subtle bug that only showed up on one particular
> powerpc machine.  I was not able to replicate it on any of the powerpc
> machines I have here.  Nevertheless, we found it and we have a fix for
> it.  I think that's an example of the process working. :)

Was it even a regression in the classical sense of the word?  Seemed
more of a latent bug that was simply never triggered before.

josh


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: RFC: starting a kernel-testers group for newbies
  2008-05-02  3:10                   ` Josh Boyer
@ 2008-05-02  4:09                     ` Paul Mackerras
  2008-05-02  8:29                       ` Adrian Bunk
  0 siblings, 1 reply; 48+ messages in thread
From: Paul Mackerras @ 2008-05-02  4:09 UTC (permalink / raw)
  To: Josh Boyer
  Cc: Adrian Bunk, Arjan van de Ven, Linus Torvalds, Andrew Morton,
	Rafael J. Wysocki, davem, linux-kernel, jirislaby, Steven Rostedt

Josh Boyer writes:

> On Fri, 2008-05-02 at 12:08 +1000, Paul Mackerras wrote:
> > Adrian Bunk writes:
> > 
> > > "Kernel oops while running kernbench and tbench on powerpc" took more 
> > > than 2 months to get resolved, and we ship 2.6.25 with this regression.
> > 
> > That was a very subtle bug that only showed up on one particular
> > powerpc machine.  I was not able to replicate it on any of the powerpc
> > machines I have here.  Nevertheless, we found it and we have a fix for
> > it.  I think that's an example of the process working. :)
> 
> Was it even a regression in the classical sense of the word?  Seemed
> more of a latent bug that was simply never triggered before.

That's right.  The bug has been there basically forever (i.e. since
before 2.6.12-rc2 ;) and no-one has been able to trigger it reliably
before.

Paul.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: RFC: starting a kernel-testers group for newbies
  2008-05-02  4:09                     ` Paul Mackerras
@ 2008-05-02  8:29                       ` Adrian Bunk
  2008-05-02 10:16                         ` Paul Mackerras
  2008-05-02 14:58                         ` Linus Torvalds
  0 siblings, 2 replies; 48+ messages in thread
From: Adrian Bunk @ 2008-05-02  8:29 UTC (permalink / raw)
  To: Paul Mackerras
  Cc: Josh Boyer, Arjan van de Ven, Linus Torvalds, Andrew Morton,
	Rafael J. Wysocki, davem, linux-kernel, jirislaby, Steven Rostedt

On Fri, May 02, 2008 at 02:09:39PM +1000, Paul Mackerras wrote:
> Josh Boyer writes:
> 
> > On Fri, 2008-05-02 at 12:08 +1000, Paul Mackerras wrote:
> > > Adrian Bunk writes:
> > > 
> > > > "Kernel oops while running kernbench and tbench on powerpc" took more 
> > > > than 2 months to get resolved, and we ship 2.6.25 with this regression.
> > > 
> > > That was a very subtle bug that only showed up on one particular
> > > powerpc machine.  I was not able to replicate it on any of the powerpc
> > > machines I have here.  Nevertheless, we found it and we have a fix for
> > > it.  I think that's an example of the process working. :)
> > 
> > Was it even a regression in the classical sense of the word?  Seemed
> > more of a latent bug that was simply never triggered before.
> 
> That's right.  The bug has been there basically forever (i.e. since
> before 2.6.12-rc2 ;) and no-one has been able to trigger it reliably
> before.

But for users this is a recent regression since 2.6.24 worked
and 2.6.25 does not.

If this problem was on x86 Linus himself and some other core developers 
would most likely have debugged this issue and Linus would have delayed 
the release of 2.6.25 for getting it fixed there.

And stuff that "only showed up on one particular machine" often shows up 
on many machines (we only know in hindsight) and the "one particular 
machine" is often due to the fact that of the many machines that might 
trigger a regression only one was used for testing this -rc kernel.

This not in any way meant against you personally, and due to the fact 
that the powerpc port is among the better maintained parts of the kernel 
this regression eventually got fixed, but in many other parts of the 
kernel this would have been one more of the many regressions that were 
reported and never fixed.

> Paul.

cu
Adrian

-- 

       "Is there not promise of rain?" Ling Tan asked suddenly out
        of the darkness. There had been need of rain for many days.
       "Only a promise," Lao Er said.
                                       Pearl S. Buck - Dragon Seed

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: RFC: starting a kernel-testers group for newbies
  2008-05-02  8:29                       ` Adrian Bunk
@ 2008-05-02 10:16                         ` Paul Mackerras
  2008-05-02 11:58                           ` Adrian Bunk
  2008-05-02 14:58                         ` Linus Torvalds
  1 sibling, 1 reply; 48+ messages in thread
From: Paul Mackerras @ 2008-05-02 10:16 UTC (permalink / raw)
  To: Adrian Bunk
  Cc: Josh Boyer, Arjan van de Ven, Linus Torvalds, Andrew Morton,
	Rafael J. Wysocki, davem, linux-kernel, jirislaby, Steven Rostedt

Adrian Bunk writes:

> > That's right.  The bug has been there basically forever (i.e. since
> > before 2.6.12-rc2 ;) and no-one has been able to trigger it reliably
> > before.
> 
> But for users this is a recent regression since 2.6.24 worked
> and 2.6.25 does not.

I never actually saw a statement to that effect (i.e. that 2.6.24
worked) from Kamalesh.  I think people assumed that because he
reported it against version X that version X-1 worked, but we don't
actually know that.

> If this problem was on x86 Linus himself and some other core developers 
> would most likely have debugged this issue and Linus would have delayed 
> the release of 2.6.25 for getting it fixed there.

If I had been able to replicate it, or if it had been seen on more
than one machine, I would probably have asked Linus to wait while we
fixed it.  

There's a risk management thing happening here.  Delaying a release is
a negative thing in itself, since it means that users have to wait
longer for the improvements we have made.  That has to be balanced
against the negative of some users seeing a regression.  It's not an
absolute, black-and-white kind of thing.  In this case, for a bug
being seen on only one machine, of a somewhat unusual configuration, I
considered it wasn't worth asking to delay the release.

Paul.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: RFC: starting a kernel-testers group for newbies
  2008-05-02 10:16                         ` Paul Mackerras
@ 2008-05-02 11:58                           ` Adrian Bunk
  0 siblings, 0 replies; 48+ messages in thread
From: Adrian Bunk @ 2008-05-02 11:58 UTC (permalink / raw)
  To: Paul Mackerras
  Cc: Josh Boyer, Arjan van de Ven, Linus Torvalds, Andrew Morton,
	Rafael J. Wysocki, davem, linux-kernel, jirislaby, Steven Rostedt

On Fri, May 02, 2008 at 08:16:49PM +1000, Paul Mackerras wrote:
> Adrian Bunk writes:
> 
> > > That's right.  The bug has been there basically forever (i.e. since
> > > before 2.6.12-rc2 ;) and no-one has been able to trigger it reliably
> > > before.
> > 
> > But for users this is a recent regression since 2.6.24 worked
> > and 2.6.25 does not.
> 
> I never actually saw a statement to that effect (i.e. that 2.6.24
> worked) from Kamalesh.  I think people assumed that because he
> reported it against version X that version X-1 worked, but we don't
> actually know that.

He reported it as

[BUG] 2.6.25-rc2-git4 - Regression Kernel oops while running kernbench and tbench on powerpc

and it was in the 2.6.25 regression lists for ages.

> > If this problem was on x86 Linus himself and some other core developers 
> > would most likely have debugged this issue and Linus would have delayed 
> > the release of 2.6.25 for getting it fixed there.
> 
> If I had been able to replicate it, or if it had been seen on more
> than one machine, I would probably have asked Linus to wait while we
> fixed it.  
> 
> There's a risk management thing happening here.  Delaying a release is
> a negative thing in itself, since it means that users have to wait
> longer for the improvements we have made.  That has to be balanced
> against the negative of some users seeing a regression.  It's not an
> absolute, black-and-white kind of thing.  In this case, for a bug
> being seen on only one machine, of a somewhat unusual configuration, I
> considered it wasn't worth asking to delay the release.

No general disagreement on this.

And my example was not in any way meant against you - it's actually 
unusual and positive that a bug that once got the attention of being
on the regression lists gets fixed later.

Even worse is the situation with regressions people run into when 
upgrading from 2.6.22 to 2.6.24 today...  :-(

> Paul.

cu
Adrian

-- 

       "Is there not promise of rain?" Ling Tan asked suddenly out
        of the darkness. There had been need of rain for many days.
       "Only a promise," Lao Er said.
                                       Pearl S. Buck - Dragon Seed


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: RFC: starting a kernel-testers group for newbies
  2008-05-02  8:29                       ` Adrian Bunk
  2008-05-02 10:16                         ` Paul Mackerras
@ 2008-05-02 14:58                         ` Linus Torvalds
  2008-05-02 15:44                           ` Carlos R. Mafra
  1 sibling, 1 reply; 48+ messages in thread
From: Linus Torvalds @ 2008-05-02 14:58 UTC (permalink / raw)
  To: Adrian Bunk
  Cc: Paul Mackerras, Josh Boyer, Arjan van de Ven, Andrew Morton,
	Rafael J. Wysocki, davem, linux-kernel, jirislaby, Steven Rostedt

On Fri, 2 May 2008, Adrian Bunk wrote:
> 
> But for users this is a recent regression since 2.6.24 worked
> and 2.6.25 does not.

Totally and utterly immaterial.

If it's a timing-related bug, as far as developers are concerned, nothing 
they did introduced the problem.

So anybody who think s that "process" should have caught it is just being 
stupid. 

Adrian, you're one of the absolutely *worst* in the camp of "everything 
should be perfect". You really need to realize that reality is messy, and 
things cannot be pefect.

You also need to realize and *understand* that aiming for "good" is 
actually much BETTER than trying to aim for "perfect".

Perfect is the enemy of good.

			Linus

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: RFC: starting a kernel-testers group for newbies
  2008-05-02 14:58                         ` Linus Torvalds
@ 2008-05-02 15:44                           ` Carlos R. Mafra
  2008-05-02 16:28                             ` Linus Torvalds
  0 siblings, 1 reply; 48+ messages in thread
From: Carlos R. Mafra @ 2008-05-02 15:44 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Adrian Bunk, Paul Mackerras, Josh Boyer, Arjan van de Ven,
	Andrew Morton, Rafael J. Wysocki, davem, linux-kernel, jirislaby,
	Steven Rostedt

On Fri  2.May'08 at  7:58:25 -0700, Linus Torvalds wrote:
> 
> 
> On Fri, 2 May 2008, Adrian Bunk wrote:
> > 
> > But for users this is a recent regression since 2.6.24 worked
> > and 2.6.25 does not.
> 
> Totally and utterly immaterial.
> 
> If it's a timing-related bug, as far as developers are concerned, nothing 
> they did introduced the problem.
> 
> So anybody who think s that "process" should have caught it is just being 
> stupid. 

So I would like to ask you what an user should do when facing what is
probably a timing-related bug, as it appears I have the bad luck
of hitting one.

See for example my comments after this one 
http://bugzilla.kernel.org/show_bug.cgi?id=10117#c11

This same problem is still present with yesterday's git, and sometimes
it hangs without hpet=disable and sometimes it doesn't. (And never
with hpet=disable in the boot command line)

And when it hangs I can see only _one_ "Switched to high resolution mode
on CPU x" message before the hang point, and when it boots fine there
is always the two of them in sequence:

Switched to high resolution mode on CPU 1
Switched to high resolution mode on CPU 0

And using vga=6 or vga=0x0364 makes a difference in the probability
of hanging.

I am just waiting -rc1 to be released to send an email with my
problem again, as I am unable to debug this myself.
I think this is ok from my part, right?



^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: RFC: starting a kernel-testers group for newbies
  2008-05-02 15:44                           ` Carlos R. Mafra
@ 2008-05-02 16:28                             ` Linus Torvalds
  2008-05-02 17:15                               ` Carlos R. Mafra
  0 siblings, 1 reply; 48+ messages in thread
From: Linus Torvalds @ 2008-05-02 16:28 UTC (permalink / raw)
  To: Carlos R. Mafra
  Cc: Adrian Bunk, Paul Mackerras, Josh Boyer, Arjan van de Ven,
	Andrew Morton, Rafael J. Wysocki, davem, linux-kernel, jirislaby,
	Steven Rostedt

On Fri, 2 May 2008, Carlos R. Mafra wrote:
> 
> So I would like to ask you what an user should do when facing what is
> probably a timing-related bug, as it appears I have the bad luck
> of hitting one.

Quite frankly, it will depend on the bug.

If it's *reliably* timing-related (which sounds crazy, but is not at all 
unheard of), it can be reliably bisected down to some totally unrelated 
commit that doesn't actually introduce the problem at all, but that 
reliably turns it on or off.

That can be very misleading, and can cause us to basically revert a good 
commit, only to not actually fix the bug (and possibly re-introduce the 
bug that the reverted commit tried to fix).

But sometimes it gives us a clue where the timing problem is. But quite 
frankly, that seems to be the exception rather than the rule.

There have been issues that literally seemed to depend on things like 
cacheline placement etc, where changing config options for code that was 
never actually even *run* would change timing just enough to show a bug 
pseudo-reliably or not at all.

The good news is that those timing issues are really quite rare. 

Tha bad news is that when they happen, they are almost totally 
undebuggable. 

> This same problem is still present with yesterday's git, and sometimes
> it hangs without hpet=disable and sometimes it doesn't. (And never
> with hpet=disable in the boot command line)

Hey, it may well be a HPET+NOHZ issue. But it could also be that HPET is 
the thing that just allows you to see the hang.

> And using vga=6 or vga=0x0364 makes a difference in the probability
> of hanging.

.. and yeah, these kinds of really odd and obviously totally unrelated 
issues are a sign of a bug that is either simply hardware instability or 
very subtly timing-related.

The reason I mention hardware instability is that there really are bugs 
that happen due to (for example) power supply instabilities. Brownouts 
under heavy load have been causes of problems, but perhaps surprisingly, 
so has _idle_ time thanks to sleep-states!

The latter is probably due to bad powr conditioning on the CPU power 
lines, where the huge current swings (going at high CPU power to low, and 
back again) not only have made soem motherboards "sing" (or "hum", 
depending on frequency) but also causes voltage instability and then 
the CPU crashes.

Am I saying that's the reason you see problems? Probably not. Most 
instabilities really are due to kernel bugs. But hardware instabilities do 
happen, and they can have these kinds of odd effects.

> I am just waiting -rc1 to be released to send an email with my
> problem again, as I am unable to debug this myself.
> I think this is ok from my part, right?

Yes. You've been a good bug reporter, and kept at it. It's not your fault 
that the bug is hard to pin down. 

Quite frankly, it does sound like the hang happens somewhere around the 

	hpet_init
	hpet_acpi_add
	hpet_resources
	hpet_resources: 0xfed00000 is busy

printk's you added (correct?) and we've had tons of issues with NO_HZ, so 
at a guess it is timer-related.

(And I assume it's stable if/once it gets past that boot hang issue? That 
tends to mean that it's not some hardware instability, it's literally our 
init code).

			Linus

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: RFC: starting a kernel-testers group for newbies
  2008-05-02 16:28                             ` Linus Torvalds
@ 2008-05-02 17:15                               ` Carlos R. Mafra
  2008-05-02 18:02                                 ` Pallipadi, Venkatesh
  0 siblings, 1 reply; 48+ messages in thread
From: Carlos R. Mafra @ 2008-05-02 17:15 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Adrian Bunk, Paul Mackerras, Josh Boyer, Arjan van de Ven,
	Andrew Morton, Rafael J. Wysocki, davem, linux-kernel, jirislaby,
	Steven Rostedt, venkatesh.pallipadi

On Fri  2.May'08 at  9:28:08 -0700, Linus Torvalds wrote:

> Quite frankly, it does sound like the hang happens somewhere around the 
> 
> 	hpet_init
> 	hpet_acpi_add
> 	hpet_resources
> 	hpet_resources: 0xfed00000 is busy
> 
> printk's you added (correct?) and we've had tons of issues with NO_HZ, so 
> at a guess it is timer-related.

It happens a bit before that because when it hangs it doesn't 
print the above lines, and when it does not hang these lines are
the ones right after the point where it hangs. 

> (And I assume it's stable if/once it gets past that boot hang issue? 

Yes you are right. When I have luck and the boot succeeds my Sony laptop
is rock solid and the kernel is wonderful (even the card reader works!).

> That
> tends to mean that it's not some hardware instability, it's literally our 
> init code).

A few days ago I found this message in lkml in reply to a hpet patch
http://lkml.org/lkml/2007/5/7/361 in which the reporter also had 
a similar hang, which was cured by hpet=disable. 

So it is in my TODO list to try to check out if that patch is 
in the current -git and whether it can be reverted somehow (I 
added Venki to the Cc: now)

Thanks a lot for the answer!

^ permalink raw reply	[flat|nested] 48+ messages in thread

* RE: RFC: starting a kernel-testers group for newbies
  2008-05-02 17:15                               ` Carlos R. Mafra
@ 2008-05-02 18:02                                 ` Pallipadi, Venkatesh
  2008-05-09 16:32                                   ` Mark Lord
  0 siblings, 1 reply; 48+ messages in thread
From: Pallipadi, Venkatesh @ 2008-05-02 18:02 UTC (permalink / raw)
  To: Carlos R. Mafra, Linus Torvalds
  Cc: Adrian Bunk, Paul Mackerras, Josh Boyer, Arjan van de Ven,
	Andrew Morton, Rafael J. Wysocki, davem, linux-kernel, jirislaby,
	Steven Rostedt, tglx, Len Brown

 

>-----Original Message-----
>From: linux-kernel-owner@vger.kernel.org 
>[mailto:linux-kernel-owner@vger.kernel.org] On Behalf Of 
>Carlos R. Mafra
>Sent: Friday, May 02, 2008 10:16 AM
>To: Linus Torvalds
>Cc: Adrian Bunk; Paul Mackerras; Josh Boyer; Arjan van de Ven; 
>Andrew Morton; Rafael J. Wysocki; davem@davemloft.net; 
>linux-kernel@vger.kernel.org; jirislaby@gmail.com; Steven 
>Rostedt; Pallipadi, Venkatesh
>Subject: Re: RFC: starting a kernel-testers group for newbies
>
>On Fri  2.May'08 at  9:28:08 -0700, Linus Torvalds wrote:
>
>> Quite frankly, it does sound like the hang happens somewhere 
>around the 
>> 
>> 	hpet_init
>> 	hpet_acpi_add
>> 	hpet_resources
>> 	hpet_resources: 0xfed00000 is busy
>> 
>> printk's you added (correct?) and we've had tons of issues 
>with NO_HZ, so 
>> at a guess it is timer-related.
>
>It happens a bit before that because when it hangs it doesn't 
>print the above lines, and when it does not hang these lines are
>the ones right after the point where it hangs. 
>
>> (And I assume it's stable if/once it gets past that boot hang issue? 
>
>Yes you are right. When I have luck and the boot succeeds my 
>Sony laptop
>is rock solid and the kernel is wonderful (even the card 
>reader works!).
>
>> That
>> tends to mean that it's not some hardware instability, it's 
>literally our 
>> init code).
>
>A few days ago I found this message in lkml in reply to a hpet patch
>http://lkml.org/lkml/2007/5/7/361 in which the reporter also had 
>a similar hang, which was cured by hpet=disable. 
>
>So it is in my TODO list to try to check out if that patch is 
>in the current -git and whether it can be reverted somehow (I 
>added Venki to the Cc: now)
>
>Thanks a lot for the answer!

It depends on whether we are HPET is being force detected based on the
chipset or whether it was exported by the BIOS in ACPI table.

If it was force enabled and above patch is having any effect, then you
should see a message like
> Force enabled HPET at base address 0xfed00000

In any case, off late there seems to be quite a few breakages that are
related to HPET/timer interrupts. One of them was on a system which has
HPET being exported by BIOS
http://bugzilla.kernel.org/show_bug.cgi?id=10409
And the other one where we are force enabling based on chipset
http://bugzilla.kernel.org/show_bug.cgi?id=10561

And then we have hangs once in a while reports by you, Roman and Mark
here
http://bugzilla.kernel.org/show_bug.cgi?id=10377
http://bugzilla.kernel.org/show_bug.cgi?id=10117


Thanks,
Venki

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: RFC: starting a kernel-testers group for newbies
  2008-05-02 18:02                                 ` Pallipadi, Venkatesh
@ 2008-05-09 16:32                                   ` Mark Lord
  2008-05-09 19:30                                     ` Carlos R. Mafra
  0 siblings, 1 reply; 48+ messages in thread
From: Mark Lord @ 2008-05-09 16:32 UTC (permalink / raw)
  To: Pallipadi, Venkatesh
  Cc: Carlos R. Mafra, Linus Torvalds, Adrian Bunk, Paul Mackerras,
	Josh Boyer, Arjan van de Ven, Andrew Morton, Rafael J. Wysocki,
	davem, linux-kernel, jirislaby, Steven Rostedt, tglx, Len Brown

Pallipadi, Venkatesh wrote:
>  
> 
>> -----Original Message-----
>> From: linux-kernel-owner@vger.kernel.org 
>> [mailto:linux-kernel-owner@vger.kernel.org] On Behalf Of 
>> Carlos R. Mafra
>> Sent: Friday, May 02, 2008 10:16 AM
>> To: Linus Torvalds
>> Cc: Adrian Bunk; Paul Mackerras; Josh Boyer; Arjan van de Ven; 
>> Andrew Morton; Rafael J. Wysocki; davem@davemloft.net; 
>> linux-kernel@vger.kernel.org; jirislaby@gmail.com; Steven 
>> Rostedt; Pallipadi, Venkatesh
>> Subject: Re: RFC: starting a kernel-testers group for newbies
>>
>> On Fri  2.May'08 at  9:28:08 -0700, Linus Torvalds wrote:
>>
>>> Quite frankly, it does sound like the hang happens somewhere 
>> around the 
>>> 	hpet_init
>>> 	hpet_acpi_add
>>> 	hpet_resources
>>> 	hpet_resources: 0xfed00000 is busy
>>>
>>> printk's you added (correct?) and we've had tons of issues 
>> with NO_HZ, so 
>>> at a guess it is timer-related.
>> It happens a bit before that because when it hangs it doesn't 
>> print the above lines, and when it does not hang these lines are
>> the ones right after the point where it hangs. 
>>
>>> (And I assume it's stable if/once it gets past that boot hang issue? 
>> Yes you are right. When I have luck and the boot succeeds my 
>> Sony laptop
>> is rock solid and the kernel is wonderful (even the card 
>> reader works!).
>>
>>> That
>>> tends to mean that it's not some hardware instability, it's 
>> literally our 
>>> init code).
>> A few days ago I found this message in lkml in reply to a hpet patch
>> http://lkml.org/lkml/2007/5/7/361 in which the reporter also had 
>> a similar hang, which was cured by hpet=disable. 
>>
>> So it is in my TODO list to try to check out if that patch is 
>> in the current -git and whether it can be reverted somehow (I 
>> added Venki to the Cc: now)
>>
>> Thanks a lot for the answer!
> 
> It depends on whether we are HPET is being force detected based on the
> chipset or whether it was exported by the BIOS in ACPI table.
> 
> If it was force enabled and above patch is having any effect, then you
> should see a message like
>> Force enabled HPET at base address 0xfed00000
> 
> In any case, off late there seems to be quite a few breakages that are
> related to HPET/timer interrupts. One of them was on a system which has
> HPET being exported by BIOS
> http://bugzilla.kernel.org/show_bug.cgi?id=10409
> And the other one where we are force enabling based on chipset
> http://bugzilla.kernel.org/show_bug.cgi?id=10561
> 
> And then we have hangs once in a while reports by you, Roman and Mark
> here
> http://bugzilla.kernel.org/show_bug.cgi?id=10377
> http://bugzilla.kernel.org/show_bug.cgi?id=10117
..

Yeah.  This particular bug first appeared when NOHZ & HPET were added.
Somebody once suggested it had something to do with an SMI interrupt
happening in the midst of HPET calibration or some such thing.

But nobody who works on the HPET code has ever shown more than a casual
interest in helping to track down and fix whatever the problem is.

Cheers

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: RFC: starting a kernel-testers group for newbies
  2008-05-09 16:32                                   ` Mark Lord
@ 2008-05-09 19:30                                     ` Carlos R. Mafra
  2008-05-09 20:39                                       ` Mark Lord
  0 siblings, 1 reply; 48+ messages in thread
From: Carlos R. Mafra @ 2008-05-09 19:30 UTC (permalink / raw)
  To: Mark Lord
  Cc: Pallipadi, Venkatesh, Linus Torvalds, Adrian Bunk, Paul Mackerras,
	Josh Boyer, Arjan van de Ven, Andrew Morton, Rafael J. Wysocki,
	davem, linux-kernel, jirislaby, Steven Rostedt, tglx, Len Brown

On Fri  9.May'08 at 12:32:51 -0400, Mark Lord wrote:
> Pallipadi, Venkatesh wrote:
>>  
>>> -----Original Message-----
>>> From: linux-kernel-owner@vger.kernel.org 
>>> [mailto:linux-kernel-owner@vger.kernel.org] On Behalf Of Carlos R. Mafra
>>> Sent: Friday, May 02, 2008 10:16 AM
>>> To: Linus Torvalds
>>> Cc: Adrian Bunk; Paul Mackerras; Josh Boyer; Arjan van de Ven; Andrew 
>>> Morton; Rafael J. Wysocki; davem@davemloft.net; 
>>> linux-kernel@vger.kernel.org; jirislaby@gmail.com; Steven Rostedt; 
>>> Pallipadi, Venkatesh
>>> Subject: Re: RFC: starting a kernel-testers group for newbies
>>>
>>> On Fri  2.May'08 at  9:28:08 -0700, Linus Torvalds wrote:
>>>
>>>> Quite frankly, it does sound like the hang happens somewhere 
>>> around the 
>>>> 	hpet_init
>>>> 	hpet_acpi_add
>>>> 	hpet_resources
>>>> 	hpet_resources: 0xfed00000 is busy
>>>>
>>>> printk's you added (correct?) and we've had tons of issues 
>>> with NO_HZ, so 
>>>> at a guess it is timer-related.
>>> It happens a bit before that because when it hangs it doesn't print the 
>>> above lines, and when it does not hang these lines are
>>> the ones right after the point where it hangs. 
>>>> (And I assume it's stable if/once it gets past that boot hang issue? 
>>> Yes you are right. When I have luck and the boot succeeds my Sony laptop
>>> is rock solid and the kernel is wonderful (even the card reader works!).
>>>
>>>> That
>>>> tends to mean that it's not some hardware instability, it's 
>>> literally our 
>>>> init code).
>>> A few days ago I found this message in lkml in reply to a hpet patch
>>> http://lkml.org/lkml/2007/5/7/361 in which the reporter also had a 
>>> similar hang, which was cured by hpet=disable. 
>>> So it is in my TODO list to try to check out if that patch is in the 
>>> current -git and whether it can be reverted somehow (I added Venki to the 
>>> Cc: now)
>>>
>>> Thanks a lot for the answer!
>>
>> It depends on whether we are HPET is being force detected based on the
>> chipset or whether it was exported by the BIOS in ACPI table.
>>
>> If it was force enabled and above patch is having any effect, then you
>> should see a message like
>>> Force enabled HPET at base address 0xfed00000
>>
>> In any case, off late there seems to be quite a few breakages that are
>> related to HPET/timer interrupts. One of them was on a system which has
>> HPET being exported by BIOS
>> http://bugzilla.kernel.org/show_bug.cgi?id=10409
>> And the other one where we are force enabling based on chipset
>> http://bugzilla.kernel.org/show_bug.cgi?id=10561
>>
>> And then we have hangs once in a while reports by you, Roman and Mark
>> here
>> http://bugzilla.kernel.org/show_bug.cgi?id=10377
>> http://bugzilla.kernel.org/show_bug.cgi?id=10117
> ..
>
> Yeah.  This particular bug first appeared when NOHZ & HPET were added.
> Somebody once suggested it had something to do with an SMI interrupt
> happening in the midst of HPET calibration or some such thing.
>

I said I was waiting for -rc1 to be released to send another email
about my HPET problem, but curiously with v2.6.26-rc1-6-gafa26be 
my laptop did not hang after 30+ boots and counting. 

Somewhere between 2.6.25-07000-(something) and the above kernel
something happened which changed significantly the probability
of hanging during boot. 

I could not boot more than 3 times in
a row without hanging with kernels up to 2.6.25-07000 (approximately),
and now I am still booting v2.6.26-rc1-6-gafa26be a few times a day
and no hangs yet.

Yesterday I started a "reverse" bisection, trying to find which
commit "fixed" it, but I still didn't finish (but it is past
-7200).

Of course I am not sure if after the 100th boot the latest -git
won't hang but it definitely improved.

> But nobody who works on the HPET code has ever shown more than a casual
> interest in helping to track down and fix whatever the problem is.

Well, I would like to thank Venki for his effort because he even
answered some private emails from me about this issue and is 
tracking the bugzillas about it.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: RFC: starting a kernel-testers group for newbies
  2008-05-09 19:30                                     ` Carlos R. Mafra
@ 2008-05-09 20:39                                       ` Mark Lord
  0 siblings, 0 replies; 48+ messages in thread
From: Mark Lord @ 2008-05-09 20:39 UTC (permalink / raw)
  To: Mark Lord, Pallipadi, Venkatesh, Linus Torvalds, Adrian Bunk,
	Paul Mackerras, Josh Boyer, Arjan van de Ven, Andrew Morton,
	Rafael J. Wysocki, davem, linux-kernel, jirislaby, Steven Rostedt,
	tglx, Len Brown

Carlos R. Mafra wrote:
> On Fri  9.May'08 at 12:32:51 -0400, Mark Lord wrote:
>> Pallipadi, Venkatesh wrote:
>>>  
>>>> -----Original Message-----
>>>> From: linux-kernel-owner@vger.kernel.org 
>>>> [mailto:linux-kernel-owner@vger.kernel.org] On Behalf Of Carlos R. Mafra
>>>> Sent: Friday, May 02, 2008 10:16 AM
>>>> To: Linus Torvalds
>>>> Cc: Adrian Bunk; Paul Mackerras; Josh Boyer; Arjan van de Ven; Andrew 
>>>> Morton; Rafael J. Wysocki; davem@davemloft.net; 
>>>> linux-kernel@vger.kernel.org; jirislaby@gmail.com; Steven Rostedt; 
>>>> Pallipadi, Venkatesh
>>>> Subject: Re: RFC: starting a kernel-testers group for newbies
>>>>
>>>> On Fri  2.May'08 at  9:28:08 -0700, Linus Torvalds wrote:
>>>>
>>>>> Quite frankly, it does sound like the hang happens somewhere 
>>>> around the 
>>>>> 	hpet_init
>>>>> 	hpet_acpi_add
>>>>> 	hpet_resources
>>>>> 	hpet_resources: 0xfed00000 is busy
>>>>>
>>>>> printk's you added (correct?) and we've had tons of issues 
>>>> with NO_HZ, so 
>>>>> at a guess it is timer-related.
>>>> It happens a bit before that because when it hangs it doesn't print the 
>>>> above lines, and when it does not hang these lines are
>>>> the ones right after the point where it hangs. 
>>>>> (And I assume it's stable if/once it gets past that boot hang issue? 
>>>> Yes you are right. When I have luck and the boot succeeds my Sony laptop
>>>> is rock solid and the kernel is wonderful (even the card reader works!).
>>>>
>>>>> That
>>>>> tends to mean that it's not some hardware instability, it's 
>>>> literally our 
>>>>> init code).
>>>> A few days ago I found this message in lkml in reply to a hpet patch
>>>> http://lkml.org/lkml/2007/5/7/361 in which the reporter also had a 
>>>> similar hang, which was cured by hpet=disable. 
>>>> So it is in my TODO list to try to check out if that patch is in the 
>>>> current -git and whether it can be reverted somehow (I added Venki to the 
>>>> Cc: now)
>>>>
>>>> Thanks a lot for the answer!
>>> It depends on whether we are HPET is being force detected based on the
>>> chipset or whether it was exported by the BIOS in ACPI table.
>>>
>>> If it was force enabled and above patch is having any effect, then you
>>> should see a message like
>>>> Force enabled HPET at base address 0xfed00000
>>> In any case, off late there seems to be quite a few breakages that are
>>> related to HPET/timer interrupts. One of them was on a system which has
>>> HPET being exported by BIOS
>>> http://bugzilla.kernel.org/show_bug.cgi?id=10409
>>> And the other one where we are force enabling based on chipset
>>> http://bugzilla.kernel.org/show_bug.cgi?id=10561
>>>
>>> And then we have hangs once in a while reports by you, Roman and Mark
>>> here
>>> http://bugzilla.kernel.org/show_bug.cgi?id=10377
>>> http://bugzilla.kernel.org/show_bug.cgi?id=10117
>> ..
>>
>> Yeah.  This particular bug first appeared when NOHZ & HPET were added.
>> Somebody once suggested it had something to do with an SMI interrupt
>> happening in the midst of HPET calibration or some such thing.
>>
> 
> I said I was waiting for -rc1 to be released to send another email
> about my HPET problem, but curiously with v2.6.26-rc1-6-gafa26be 
> my laptop did not hang after 30+ boots and counting. 
> 
> Somewhere between 2.6.25-07000-(something) and the above kernel
> something happened which changed significantly the probability
> of hanging during boot. 
> 
> I could not boot more than 3 times in
> a row without hanging with kernels up to 2.6.25-07000 (approximately),
> and now I am still booting v2.6.26-rc1-6-gafa26be a few times a day
> and no hangs yet.
> 
> Yesterday I started a "reverse" bisection, trying to find which
> commit "fixed" it, but I still didn't finish (but it is past
> -7200).
> 
> Of course I am not sure if after the 100th boot the latest -git
> won't hang but it definitely improved.
> 
>> But nobody who works on the HPET code has ever shown more than a casual
>> interest in helping to track down and fix whatever the problem is.
> 
> Well, I would like to thank Venki for his effort because he even
> answered some private emails from me about this issue and is 
> tracking the bugzillas about it.
..

My experience with this bug, since 2.6.20 or so, has been that it comes
and goes with even the most innocent change in the .config file,
like turning frame pointers on/off.

Cheers

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: RFC: starting a kernel-testers group for newbies
  2008-05-01  0:31       ` RFC: starting a kernel-testers group for newbies Adrian Bunk
  2008-04-30  7:03         ` Arjan van de Ven
@ 2008-05-01  0:41         ` David Miller
  2008-05-01 13:23           ` Adrian Bunk
  1 sibling, 1 reply; 48+ messages in thread
From: David Miller @ 2008-05-01  0:41 UTC (permalink / raw)
  To: bunk; +Cc: torvalds, akpm, rjw, linux-kernel, jirislaby, rostedt

From: Adrian Bunk <bunk@kernel.org>
Date: Thu, 1 May 2008 03:31:25 +0300

> - get a mailing list at vger

kernel-testers@vger.kernel.org has been created, feel free to
use it

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: RFC: starting a kernel-testers group for newbies
  2008-05-01  0:41         ` David Miller
@ 2008-05-01 13:23           ` Adrian Bunk
  0 siblings, 0 replies; 48+ messages in thread
From: Adrian Bunk @ 2008-05-01 13:23 UTC (permalink / raw)
  To: David Miller; +Cc: torvalds, akpm, rjw, linux-kernel, jirislaby, rostedt

On Wed, Apr 30, 2008 at 05:41:58PM -0700, David Miller wrote:
> From: Adrian Bunk <bunk@kernel.org>
> Date: Thu, 1 May 2008 03:31:25 +0300
> 
> > - get a mailing list at vger
> 
> kernel-testers@vger.kernel.org has been created, feel free to
> use it

Thanks  :-)
Adrian

-- 

       "Is there not promise of rain?" Ling Tan asked suddenly out
        of the darkness. There had been need of rain for many days.
       "Only a promise," Lao Er said.
                                       Pearl S. Buck - Dragon Seed


^ permalink raw reply	[flat|nested] 48+ messages in thread

end of thread, other threads:[~2008-05-09 20:40 UTC | newest]

Thread overview: 48+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-05-01 16:11 RFC: starting a kernel-testers group for newbies devzero
2008-05-01 16:26 ` Kok, Auke
2008-05-01 17:12   ` Adrian Bunk
  -- strict thread matches above, loose matches on Subject: below --
2008-05-01 17:09 devzero
2008-05-01 17:27 ` Steven Rostedt
2008-05-01 16:36 devzero
2008-04-30  2:03 Slow DOWN, please!!! David Miller
2008-04-30 19:36 ` Rafael J. Wysocki
2008-04-30 20:15   ` Andrew Morton
2008-04-30 20:31     ` Linus Torvalds
2008-05-01  0:31       ` RFC: starting a kernel-testers group for newbies Adrian Bunk
2008-04-30  7:03         ` Arjan van de Ven
2008-05-01  8:13           ` Andrew Morton
2008-04-30 14:15             ` Arjan van de Ven
2008-05-01 12:42               ` David Woodhouse
2008-04-30 15:02                 ` Arjan van de Ven
2008-05-05 10:03                 ` Benny Halevy
2008-05-04 12:45               ` Rene Herman
2008-05-04 13:00                 ` Pekka Enberg
2008-05-04 13:19                   ` Rene Herman
2008-05-01  9:16             ` Frans Pop
2008-05-01 10:30               ` Enrico Weigelt
2008-05-01 13:02                 ` Adrian Bunk
2008-05-01 11:30           ` Adrian Bunk
2008-04-30 14:20             ` Arjan van de Ven
2008-05-01 12:53               ` Rafael J. Wysocki
2008-05-01 13:21               ` Adrian Bunk
2008-05-01 15:49                 ` Andrew Morton
2008-05-01  1:13                   ` Arjan van de Ven
2008-05-02  9:00                     ` Adrian Bunk
2008-05-01 16:38                   ` Steven Rostedt
2008-05-01 17:18                     ` Andrew Morton
2008-05-01 17:24                   ` Theodore Tso
2008-05-01 19:26                     ` Andrew Morton
2008-05-01 19:39                       ` Steven Rostedt
2008-05-02 10:23                       ` Andi Kleen
2008-05-02  2:08                 ` Paul Mackerras
2008-05-02  3:10                   ` Josh Boyer
2008-05-02  4:09                     ` Paul Mackerras
2008-05-02  8:29                       ` Adrian Bunk
2008-05-02 10:16                         ` Paul Mackerras
2008-05-02 11:58                           ` Adrian Bunk
2008-05-02 14:58                         ` Linus Torvalds
2008-05-02 15:44                           ` Carlos R. Mafra
2008-05-02 16:28                             ` Linus Torvalds
2008-05-02 17:15                               ` Carlos R. Mafra
2008-05-02 18:02                                 ` Pallipadi, Venkatesh
2008-05-09 16:32                                   ` Mark Lord
2008-05-09 19:30                                     ` Carlos R. Mafra
2008-05-09 20:39                                       ` Mark Lord
2008-05-01  0:41         ` David Miller
2008-05-01 13:23           ` Adrian Bunk

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox