Automatic Kernel Bug Report

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* Automatic Kernel Bug Report
@ 2006-07-09  8:45 Daniel Bonekeeper
  2006-07-09  9:45 ` Michal Piotrowski
  0 siblings, 1 reply; 23+ messages in thread
From: Daniel Bonekeeper @ 2006-07-09  8:45 UTC (permalink / raw)
  To: linux-kernel

Well this probably was already discussed. Some distros have automatic
bug reporting tools that are triggered when something bad happens
(don't know if includes kernel stuff). But have anybody thought about
some kind of bug report tool that, under an Oops like a NULL point
dereference, it creates for example a packed file with the config used
to build the kernel, the kernel version, loaded modules, some hardware
info, backtraces, everything that could be useful for debugging, and
sends to a server to be catalogued ?

I know for sure that a lot of people don't use to send bug reports,
either because they are in a hurry and forget, or because they just
don't know how or that it even exists. We could have something that,
under certain bad events, sends that info to a userspace program and
lets it handle that bug report problem automatically (here distros can
be creative).

I'm not sure about including this on distro's kernels, since they
already use some kind of bug report mechanism, and usually distro
kernels are very different from the vanilla one, which could make it
harder to debug the problem. So, distros should ship their kernels
with this thing disabled (or enabled, but having the handler on
userspace pointing to them, and not for us).

Wouldn't that be helpful ?

Daniel

-- 
What this world needs is a good five-dollar plasma weapon.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Automatic Kernel Bug Report
  2006-07-09  8:45 Automatic Kernel Bug Report Daniel Bonekeeper
@ 2006-07-09  9:45 ` Michal Piotrowski
  2006-07-09 10:29   ` Daniel Bonekeeper
  0 siblings, 1 reply; 23+ messages in thread
From: Michal Piotrowski @ 2006-07-09  9:45 UTC (permalink / raw)
  To: Daniel Bonekeeper; +Cc: linux-kernel

Hi Daniel,

On 09/07/06, Daniel Bonekeeper <thehazard@gmail.com> wrote:
> Well this probably was already discussed. Some distros have automatic
> bug reporting tools that are triggered when something bad happens
> (don't know if includes kernel stuff). But have anybody thought about
> some kind of bug report tool that, under an Oops like a NULL point
> dereference, it creates for example a packed file with the config used
> to build the kernel, the kernel version, loaded modules, some hardware
> info, backtraces, everything that could be useful for debugging, and
> sends to a server to be catalogued ?

How about oops reporting tool?
http://www.stardust.webpages.pl/files/ort/

[snip]
>
> Wouldn't that be helpful ?
>
> Daniel
>

Regards,
Michal

-- 
Michal K. K. Piotrowski
LTG - Linux Testers Group
(http://www.stardust.webpages.pl/ltg/wiki/)

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Automatic Kernel Bug Report
  2006-07-09  9:45 ` Michal Piotrowski
@ 2006-07-09 10:29   ` Daniel Bonekeeper
  2006-07-09 12:58     ` Adrian Bunk
  0 siblings, 1 reply; 23+ messages in thread
From: Daniel Bonekeeper @ 2006-07-09 10:29 UTC (permalink / raw)
  To: Michal Piotrowski; +Cc: linux-kernel

On 7/9/06, Michal Piotrowski <michal.k.k.piotrowski@gmail.com> wrote:
> Hi Daniel,
>
> On 09/07/06, Daniel Bonekeeper <thehazard@gmail.com> wrote:
> > Well this probably was already discussed. Some distros have automatic
> > bug reporting tools that are triggered when something bad happens
> > (don't know if includes kernel stuff). But have anybody thought about
> > some kind of bug report tool that, under an Oops like a NULL point
> > dereference, it creates for example a packed file with the config used
> > to build the kernel, the kernel version, loaded modules, some hardware
> > info, backtraces, everything that could be useful for debugging, and
> > sends to a server to be catalogued ?
>
> How about oops reporting tool?
> http://www.stardust.webpages.pl/files/ort/
>
> [snip]
> >
> > Wouldn't that be helpful ?
> >
> > Daniel
> >
>
> Regards,
> Michal
>

Hello Michal.

Yes, something like that =)

Maybe less verbal. Another problem is that, depending on the
situation, the problem may be serious enough to not allow a program in
userspace to work (and therefore, not acknowledge the Oops nor send a
bug report). Also, important information may not be available for
userspace (imagine a machine where the kernel wasn't compiled with
debug stuff, so those details are not exposed to userspace, but
available at kernelspace). As far as I understood your script, it
requires interactivity to work (so if we have a bunch of servers in a
datacenter at 1k miles, we got a problem). My first idea was:

1) At kernelspace we have some kind of function that is called at the
end of the bug handler (BUG_ON for example) or more generically in
another place. This function just adds the current bug description
(probably the output used on printk (that is currently segmented over
the code)) and adds it to a list of structs that holds bug
descriptions inside the bug report system.

2a) The bug report system can export those bug descriptors to
userspace via sysfs, for example, where a tool there can do the rest.

2b) We could provide a device like /dev/oops which just returns the
content of the bug report lists (in ASCII so shell scripts can read
it). By having a device, we can actually know if something on the
userspace cares about bug reporting (if we have a process with
/dev/oops open (and blocked, waiting for new oops reports), we let it
handle that). Again, something serious can happen and the userspace
notifier won't run. Something simple as  easy-to-parse e-mails could
be used to KISS it.

2c) Just have the notifier on the kernel. Maybe this won't be
possible, but I thought about something very rude: at boot time,
initrd scripts tell the kernel where to send the bug report to (using
UDP, may be a machine inside their LAN where the admin can also have
information about those Oops, or point directly to servers at vger or
any place else). When the Oops occur, a notifier function inside the
kernel uses the (if available) network stuff to send a simple UDP
packet containing the info. Maybe this will be enough to overcome very
bad situations where everything on userspace locks, but the kernel
still runnable.

3) At the central bug report server, we get the bugs routed. Here we
put a hold on bugs that may not be worth to look at (for example
tainted stuff), and the useful ones (let's say, stuff that just came
out) are routed to maintainers, or just kept on the server where
everybody can have free access to it.

In any circunstance, I think that user interactivity should be
avoided. At the boot time, the initrd specifies to the kernel what to
do upon a bug (device/call a binary in userspace/send dump via UDP,
etc), and have some kind of contact information, so the system at the
other side, after receiving the bug report, can put the bug in a
catalog, assign an ID to it, and notify the admin of the server
(probably via e-mail) about the bug, asking for details that may be
helpful. The process may seem complex at a first glance, but I think
that this is worth of having (imagine how many bugs actually are
triggered every single day, but not reported...). This would also end
up as a quality meter (just imagine the current discussion about
uswsusp and suspend2... we would know which one of them have higher
rates of problems, statistics over the time, so we can see if the rate
of problems with them are getting higher or not, etc). Maybe we could
just incorporate Mozilla's bug report tool, if it's currently
available (and not hard to do at kernel side, if it's a good idea at
all).

Daniel

-- 
What this world needs is a good five-dollar plasma weapon.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Automatic Kernel Bug Report
  2006-07-09 10:29   ` Daniel Bonekeeper
@ 2006-07-09 12:58     ` Adrian Bunk
  2006-07-09 18:46       ` Daniel Bonekeeper
  0 siblings, 1 reply; 23+ messages in thread
From: Adrian Bunk @ 2006-07-09 12:58 UTC (permalink / raw)
  To: Daniel Bonekeeper; +Cc: Michal Piotrowski, linux-kernel

On Sun, Jul 09, 2006 at 06:29:55AM -0400, Daniel Bonekeeper wrote:
>...
> Maybe less verbal. Another problem is that, depending on the
> situation, the problem may be serious enough to not allow a program in
> userspace to work (and therefore, not acknowledge the Oops nor send a
> bug report). Also, important information may not be available for
> userspace (imagine a machine where the kernel wasn't compiled with
> debug stuff, so those details are not exposed to userspace, but
> available at kernelspace). As far as I understood your script, it
> requires interactivity to work (so if we have a bunch of servers in a
> datacenter at 1k miles, we got a problem). My first idea was:
>...

I'm sorry for being so negative, but it seems you are overdesigning a 
solution for a non-existing problem:

There are cases where the machine is simply dead with exactly zero 
information. These are the really hard ones.

Then there are cases where the kernel is able to print a BUG() or Oops 
to a log file. Or the error message is printed to the screen and the 
user uses a digital camera and sends the photo.

The message is usually enough for starting to debug the problem or 
asking the user for additional information.

But most important, the problem lies in a completely different area:

Interaction between kernel devlopers and users is not a real problem.
The real problem is the missing developer manpower for handling bug 
reports.

> Daniel

cu
Adrian

-- 

       "Is there not promise of rain?" Ling Tan asked suddenly out
        of the darkness. There had been need of rain for many days.
       "Only a promise," Lao Er said.
                                       Pearl S. Buck - Dragon Seed

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Automatic Kernel Bug Report
  2006-07-09 12:58     ` Adrian Bunk
@ 2006-07-09 18:46       ` Daniel Bonekeeper
  2006-07-09 19:11         ` Adrian Bunk
  0 siblings, 1 reply; 23+ messages in thread
From: Daniel Bonekeeper @ 2006-07-09 18:46 UTC (permalink / raw)
  To: Adrian Bunk; +Cc: Michal Piotrowski, linux-kernel

> I'm sorry for being so negative, but it seems you are overdesigning a
> solution for a non-existing problem:
>
> There are cases where the machine is simply dead with exactly zero
> information. These are the really hard ones.
>

Then really there isn't anything that we can do, except to expect the
kindness of the user in taking a picture of his screen and posting on
the kernel's bugzilla.

> Then there are cases where the kernel is able to print a BUG() or Oops
> to a log file. Or the error message is printed to the screen and the
> user uses a digital camera and sends the photo.

Then again, users may just continue using the machine (without even
noticing the Oops), or notice but never care to report it, or forgets,
etc.

> The message is usually enough for starting to debug the problem or
> asking the user for additional information.
>
> But most important, the problem lies in a completely different area:
>
> Interaction between kernel devlopers and users is not a real problem.
> The real problem is the missing developer manpower for handling bug
> reports.
>

Well Adrian, this is the other side of the problem. We don't actually
need a kernel monkey to keep looking for bugs that comes (even thought
would be good, but as you stated, there is not enough manpower to do
that), even more after having something that automatically sends Oops
reports to the server, where we could expect thousands of bug reports
daily... but I also believe that not having somebody to look at them
is not an excuse for not having this bug taken account for. For
example, even though we may not debug each and every bug report, we
can have statistics for which modules are reporting more problems (and
therefore, have more bugs). For example, I don't really expect
Microsoft to investigate every crash report that users send, but it is
definitely important to have bugs accounted for. Let's say that the
SCSI maintainer just did a big change to the SCSI subsystem and wants
to know how is it going: he just goes to bugzilla and see statistics
about increase of the ratio of bug reports compared to the last
version, or he can also see which functions (based on the EIP as a
guess) are reporting more problems.

In resume, don't being able to investigate each report isn't a reason
for not being acknowledged of its existance, and even we don't
investigate it, having it for statistical purposes is already a great
deal.

Daniel

-- 
What this world needs is a good five-dollar plasma weapon.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Automatic Kernel Bug Report
  2006-07-09 18:46       ` Daniel Bonekeeper
@ 2006-07-09 19:11         ` Adrian Bunk
  2006-07-09 20:01           ` Daniel Bonekeeper
  0 siblings, 1 reply; 23+ messages in thread
From: Adrian Bunk @ 2006-07-09 19:11 UTC (permalink / raw)
  To: Daniel Bonekeeper; +Cc: Michal Piotrowski, linux-kernel

On Sun, Jul 09, 2006 at 02:46:28PM -0400, Daniel Bonekeeper wrote:
> >I'm sorry for being so negative, but it seems you are overdesigning a
> >solution for a non-existing problem:
> >
> >There are cases where the machine is simply dead with exactly zero
> >information. These are the really hard ones.
> 
> Then really there isn't anything that we can do, except to expect the
> kindness of the user in taking a picture of his screen and posting on
> the kernel's bugzilla.

No, I'm talking about freezes without anything printed.

As soon as anything is printed, it becomes easier.

> >Then there are cases where the kernel is able to print a BUG() or Oops
> >to a log file. Or the error message is printed to the screen and the
> >user uses a digital camera and sends the photo.
> 
> Then again, users may just continue using the machine (without even
> noticing the Oops), or notice but never care to report it, or forgets,
> etc.

If the user doesn't notice what is written into his logs, the solution 
is to change this (e.g. via logcheck).

And if the user doesn't care, there's no reason for getting the bug 
report - a bug report from a not responsive user is worse than no bug 
report.

> >The message is usually enough for starting to debug the problem or
> >asking the user for additional information.
> >
> >But most important, the problem lies in a completely different area:
> >
> >Interaction between kernel devlopers and users is not a real problem.
> >The real problem is the missing developer manpower for handling bug
> >reports.
> 
> Well Adrian, this is the other side of the problem. We don't actually
> need a kernel monkey to keep looking for bugs that comes (even thought
> would be good, but as you stated, there is not enough manpower to do
> that), even more after having something that automatically sends Oops
> reports to the server, where we could expect thousands of bug reports
> daily... but I also believe that not having somebody to look at them
> is not an excuse for not having this bug taken account for. For
>...
> In resume, don't being able to investigate each report isn't a reason
> for not being acknowledged of its existance, and even we don't
> investigate it, having it for statistical purposes is already a great
> deal.

I'm still sure the important points are
- developer manpower and
- responsive bug submitters,
and your proposal doesn't help with any of these.

But this is open source, so feel free to send a patch implementing your 
ideas and prove me wrong.

> Daniel

cu
Adrian

-- 

       "Is there not promise of rain?" Ling Tan asked suddenly out
        of the darkness. There had been need of rain for many days.
       "Only a promise," Lao Er said.
                                       Pearl S. Buck - Dragon Seed


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Automatic Kernel Bug Report
  2006-07-09 19:11         ` Adrian Bunk
@ 2006-07-09 20:01           ` Daniel Bonekeeper
  2006-07-09 20:19             ` Valdis.Kletnieks
                               ` (2 more replies)
  0 siblings, 3 replies; 23+ messages in thread
From: Daniel Bonekeeper @ 2006-07-09 20:01 UTC (permalink / raw)
  To: Adrian Bunk; +Cc: linux-kernel

> > >There are cases where the machine is simply dead with exactly zero
> > >information. These are the really hard ones.
> >
> > Then really there isn't anything that we can do, except to expect the
> > kindness of the user in taking a picture of his screen and posting on
> > the kernel's bugzilla.
>
> No, I'm talking about freezes without anything printed.
> As soon as anything is printed, it becomes easier.
>

Hopefully, in some bugs where nothing is printed (i.e., syslog died,
is not running, or we are in kernel context and never get back to user
mode), having a notifier on the kernel may ensure that the bug report
is sent (since we don't need userspace interaction to get it working).

> > >Then there are cases where the kernel is able to print a BUG() or Oops
> > >to a log file. Or the error message is printed to the screen and the
> > >user uses a digital camera and sends the photo.
> >
> > Then again, users may just continue using the machine (without even
> > noticing the Oops), or notice but never care to report it, or forgets,
> > etc.
>
> If the user doesn't notice what is written into his logs, the solution
> is to change this (e.g. via logcheck).
>

Sometimes the user may be just somebody that just started using linux,
or is in an industry that has nothing related to computers. He doesn't
even know that syslog exists, and even if he did, he could not even
care about it. The whole idea of the system is, in fact, not needing
interaction from the user to let the kernel development community be
acknowledged that a BUG_ON() was triggered somewhere, having a basic
set of information that could be used to debug the problem (if
somebody decides to pick a random report to take a deeper look, or
just group the reports by frequency, and take a look on the most
frequent ones).

> And if the user doesn't care, there's no reason for getting the bug
> report - a bug report from a not responsive user is worse than no bug
> report.
>

I disagree. We may not be talking about something that will bulk up
the kernel's bugzilla, so developers and bug trackers won't be
overwhelmed with a flood of less-than-ideal bug reports. We may even
have something totally unrelated with the bugzilla to keep track of
those reports. It may end up just providing statistics about the most
faulty pieces of code. Hopefully, a good set of information can be
already sufficient to have a clue on where bugs lies, and since the
report will include the (assumed) most important info needed to know,
we won't have to read more from the user. If that's not enough, we
still have the option of contacting the user to ask further details,
because the report will be signed by his e-mail (some kind of "If you
agree to have bug reports sent to linux developers, having in mind
that no confidential information is sent, sign with your e-mail on
/sys/kernel/bugreport/contact" done by initrd).

> >...
> > In resume, don't being able to investigate each report isn't a reason
> > for not being acknowledged of its existance, and even we don't
> > investigate it, having it for statistical purposes is already a great
> > deal.
>
> I'm still sure the important points are
> - developer manpower and
> - responsive bug submitters,
> and your proposal doesn't help with any of these.
>

Well, none of your proposed problems may be solved easily, which
doens't mean that the "problem" can't be ammenized. My proposal may
help _current_ developers to know that their code have actual bugs. It
doesn't obligate anybody to work on them. You may agree with me that,
considering the current number of linux users and the current number
of bug reports on mozilla, hardly 1% of bugs are reported. If I had
thousands of people using a driver that I wrote for some fancy
wireless adapter, I would like to know if people are actually getting
problems with it or not, what kind of problems, etc (and that doesn't
mean that I need to look at each of the thousands of bug reports on
the system, neither that I'm obligated to fix them), and I think that
other developers would like to know that too.

> But this is open source, so feel free to send a patch implementing your
> ideas and prove me wrong.
>

Well, even though I know that it is a good idea, I wouldn't do
anything unless I hear at least a little bunch of developers saying
that they think that this is good. The system itself is worthless if
nobody will use it, as anything else. I can do and maintain everything
by myself, I just need to know if people think that this is a good
thing to have or not (and that it would be used), since this is not
something that "well if nobody uses it, at least I can use it myself",
like a driver for an exotic device.

Independent of wheter you think that this is useful or not, do you see
any cleaner way to send those reports, having in mind that the
userspace may not be responsive ?

Daniel

-- 
What this world needs is a good five-dollar plasma weapon.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Automatic Kernel Bug Report
  2006-07-09 20:01           ` Daniel Bonekeeper
@ 2006-07-09 20:19             ` Valdis.Kletnieks
  2006-07-09 20:27               ` Daniel Bonekeeper
  2006-07-09 20:24             ` Diego Calleja
  2006-07-09 20:25             ` Jesper Juhl
  2 siblings, 1 reply; 23+ messages in thread
From: Valdis.Kletnieks @ 2006-07-09 20:19 UTC (permalink / raw)
  To: Daniel Bonekeeper; +Cc: Adrian Bunk, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 652 bytes --]

On Sun, 09 Jul 2006 16:01:58 EDT, Daniel Bonekeeper said:

> Sometimes the user may be just somebody that just started using linux,
> or is in an industry that has nothing related to computers. He doesn't
> even know that syslog exists, and even if he did, he could not even
> care about it.

This user will do whatever his distro tells him to do, which is almost
certainly something *other* than what a kernel.org kernel should do.
If he's running Ubuntu, it should do whatever Ubuntu does.  If he's
on Fedora Core, it should poke the RedHat bugzilla, and so on.

If he's running a kernel.org kernel, it's probably safe to assume *some*
level of clue

[-- Attachment #2: Type: application/pgp-signature, Size: 226 bytes --]

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Automatic Kernel Bug Report
  2006-07-09 20:01           ` Daniel Bonekeeper
  2006-07-09 20:19             ` Valdis.Kletnieks
@ 2006-07-09 20:24             ` Diego Calleja
  2006-07-09 20:37               ` Daniel Bonekeeper
  2006-07-09 20:25             ` Jesper Juhl
  2 siblings, 1 reply; 23+ messages in thread
From: Diego Calleja @ 2006-07-09 20:24 UTC (permalink / raw)
  To: Daniel Bonekeeper; +Cc: bunk, linux-kernel

El Sun, 9 Jul 2006 16:01:58 -0400,
"Daniel Bonekeeper" <thehazard@gmail.com> escribió:

> Independent of wheter you think that this is useful or not, do you see
> any cleaner way to send those reports, having in mind that the
> userspace may not be responsive ?

Kdump (http://lse.sourceforge.net/kdump/) would be useful for
many cases.

WRT to the bugs where the system completely locks up and doesn't
leaves any option for automatic bug reports: You don't really care
about those, because they're _so_ annoying that people reports them
manually.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Automatic Kernel Bug Report
  2006-07-09 20:01           ` Daniel Bonekeeper
  2006-07-09 20:19             ` Valdis.Kletnieks
  2006-07-09 20:24             ` Diego Calleja
@ 2006-07-09 20:25             ` Jesper Juhl
  2 siblings, 0 replies; 23+ messages in thread
From: Jesper Juhl @ 2006-07-09 20:25 UTC (permalink / raw)
  To: Daniel Bonekeeper; +Cc: Adrian Bunk, linux-kernel

On 09/07/06, Daniel Bonekeeper <thehazard@gmail.com> wrote:
...
>
> Hopefully, in some bugs where nothing is printed (i.e., syslog died,
> is not running, or we are in kernel context and never get back to user
> mode), having a notifier on the kernel may ensure that the bug report
> is sent (since we don't need userspace interaction to get it working).
>
...

Have you considered the privacy implications of this?

If you implement something like this it most definately needs to be
configurable and default to *OFF* and need explicit user intervention
to turn on.
Also consider that any data transmitted should probably be encrypted
during transmission - not something you want to start doing after you
just Oops'ed.

I for one certainly do *NOT* want my kernel to "phone home" and
disclose information about my computer without my concent - that, in
my book, is called spyware.

Consider this :

If my machine connects to some off-site location and submits an Oops
at least the following information about me will (or may) be disclosed
:

- My IP address.
- My OS.
- Portions of memory on my machine that may contain sensitive info
(encryption keys for instance or personal data).
- Name(s) of applications I have running.
- gcc version of the compiler I used to build the kernel.
- Details of hardware I have in the box (architecture etc).

and probably a lot more that I've forgotten to include.

I consider the above info privileged and personal and something that
requires my explicit concent to release. It's *NOT* something I want
my computer to submit off-site without me knowing about it.

Also consider that I may be using a labtop and be connected to a
network where I may not be allowed to connect off-site except under a
specific set of circumstances. This thing could make me violate such a
policy without knowing about it.

I may also be connected to a network where the firewall logs all my
outgoing connections and I may not want people to know I'm running
Linux. A Linux kernel Oops from my machine showing up in the firewall
logs would certainly disclose that fact.

There are more things to consider than just "would this be useful for
kernel development" - privacy in this case is a major issue.

-- 
Jesper Juhl <jesper.juhl@gmail.com>
Don't top-post  http://www.catb.org/~esr/jargon/html/T/top-post.html
Plain text mails only, please      http://www.expita.com/nomime.html

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Automatic Kernel Bug Report
  2006-07-09 20:19             ` Valdis.Kletnieks
@ 2006-07-09 20:27               ` Daniel Bonekeeper
  2006-07-10  8:11                 ` Pavel Machek
  0 siblings, 1 reply; 23+ messages in thread
From: Daniel Bonekeeper @ 2006-07-09 20:27 UTC (permalink / raw)
  To: Valdis.Kletnieks@vt.edu; +Cc: Adrian Bunk, linux-kernel

On 7/9/06, Valdis.Kletnieks@vt.edu <Valdis.Kletnieks@vt.edu> wrote:
> On Sun, 09 Jul 2006 16:01:58 EDT, Daniel Bonekeeper said:
>
> > Sometimes the user may be just somebody that just started using linux,
> > or is in an industry that has nothing related to computers. He doesn't
> > even know that syslog exists, and even if he did, he could not even
> > care about it.
>
> This user will do whatever his distro tells him to do, which is almost
> certainly something *other* than what a kernel.org kernel should do.
> If he's running Ubuntu, it should do whatever Ubuntu does.  If he's
> on Fedora Core, it should poke the RedHat bugzilla, and so on.
>
> If he's running a kernel.org kernel, it's probably safe to assume *some*
> level of clue
>

This was actually just an example circunstance of why somebody would
not report a bug. Dozens of other circunstances may be given, and it
just illustrates why would be good to have those bug reports without
user interaction. Of course I can go to bugzilla and fill a report
upon a bug, but I wouldn't care to have bug reports being sent from my
servers automatically, if it's an option. This would ensure that even
bug report from non-caring users are, well, reported.

Daniel
-- 
What this world needs is a good five-dollar plasma weapon.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Automatic Kernel Bug Report
  2006-07-09 20:24             ` Diego Calleja
@ 2006-07-09 20:37               ` Daniel Bonekeeper
  2006-07-09 22:19                 ` Diego Calleja
  0 siblings, 1 reply; 23+ messages in thread
From: Daniel Bonekeeper @ 2006-07-09 20:37 UTC (permalink / raw)
  To: Diego Calleja; +Cc: bunk, linux-kernel

On 7/9/06, Diego Calleja <diegocg@gmail.com> wrote:
> El Sun, 9 Jul 2006 16:01:58 -0400,
> "Daniel Bonekeeper" <thehazard@gmail.com> escribió:
>
> > Independent of wheter you think that this is useful or not, do you see
> > any cleaner way to send those reports, having in mind that the
> > userspace may not be responsive ?
>
> Kdump (http://lse.sourceforge.net/kdump/) would be useful for
> many cases.

I agree.

> WRT to the bugs where the system completely locks up and doesn't
> leaves any option for automatic bug reports: You don't really care
> about those, because they're _so_ annoying that people reports them
> manually.

Yeah, or they just switch the ethernet card, get it working and go out
for a coffee. =]
In those nasty cases, we really don't have much to do, since we can't
provide a mechanism on the kernel to run when we detect that it is not
responsive (can't we ?). A rapid mindstorm: A frozen system can't
preempt tasks. If we can keep track of the timestamp of the last time
the schedule ran, and we see that it was like 5 or 10 seconds ago, it
means that something is very wrong on the kernel side. We may have
several levels of fucked-up-ness in which, at a certain level,
interrupts are still called (and we can call our code here to check
the sanity of the system). If we see that we didn't schedule for a
long time, we can trigger the report system (then again, ideally we
don't need userspace to do that). Then we just pray, hoping that the
the report gets thru the networking. Of course there is no magic
solution, but this could help on a great deal of cases.

Daniel

-- 
What this world needs is a good five-dollar plasma weapon.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Automatic Kernel Bug Report
  2006-07-09 20:37               ` Daniel Bonekeeper
@ 2006-07-09 22:19                 ` Diego Calleja
  2006-07-09 22:49                   ` Daniel Bonekeeper
  0 siblings, 1 reply; 23+ messages in thread
From: Diego Calleja @ 2006-07-09 22:19 UTC (permalink / raw)
  To: Daniel Bonekeeper; +Cc: bunk, linux-kernel

El Sun, 9 Jul 2006 16:37:46 -0400,
"Daniel Bonekeeper" <thehazard@gmail.com> escribió:

> preempt tasks. If we can keep track of the timestamp of the last time
> the schedule ran, and we see that it was like 5 or 10 seconds ago, it
> means that something is very wrong on the kernel side. We may have
> several levels of fucked-up-ness in which, at a certain level,
> interrupts are still called (and we can call our code here to check
> the sanity of the system). If we see that we didn't schedule for a
> long time, we can trigger the report system (then again, ideally we

That's what NMIs are for

Note that while I like the idea of automatically reporting such bugs
I fully agree with Adrian that it's not a critical issue right now.
It's not that kernel developers are sleeping while users hit tons
of bugs that won't get reported, right now we've _too many_ bug 
reports and developers are not fixing/noticing them as fast as they
get reported. Take a look at kernel.org's bugzilla, and at the kernel
component of fedora/ubuntu/debian/gentoo bugzillas. We're not 
lacking bug reports, quite the contrary: We're getting so many
that some people is starting to question (again) if the current
development model is the right one and whether we should make
a bug-fix-only release.

In fact, one of the main problems is that there's not an "official"
bug reporting tool beyond email. Many kernel developers didn't
like bugzilla when it was started, and as for today, many core 
kernel developers still do not even _look_ at bugzilla.kernel.org.
The hard work of akpm and some cool people like the acpi guys has
made possible lately to start using bugzilla as a sort of
"official" bugzilla, but there're still many kernel developers
that need to be convinced (and many bugzilla features that need
to be polished). AFAIK, people is going to talk seriusly about 
this in OLS.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Automatic Kernel Bug Report
  2006-07-09 22:19                 ` Diego Calleja
@ 2006-07-09 22:49                   ` Daniel Bonekeeper
  0 siblings, 0 replies; 23+ messages in thread
From: Daniel Bonekeeper @ 2006-07-09 22:49 UTC (permalink / raw)
  To: Diego Calleja; +Cc: bunk, linux-kernel

...
>
> In fact, one of the main problems is that there's not an "official"
> bug reporting tool beyond email. Many kernel developers didn't
> like bugzilla when it was started, and as for today, many core
> kernel developers still do not even _look_ at bugzilla.kernel.org.
> The hard work of akpm and some cool people like the acpi guys has
> made possible lately to start using bugzilla as a sort of
> "official" bugzilla, but there're still many kernel developers
> that need to be convinced (and many bugzilla features that need
> to be polished). AFAIK, people is going to talk seriusly about
> this in OLS.
>

I see. Well, let's see what's discussed/decided after OLS so I can
have a broader view regarding that.

Daniel


-- 
What this world needs is a good five-dollar plasma weapon.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Automatic Kernel Bug Report
  2006-07-09 20:27               ` Daniel Bonekeeper
@ 2006-07-10  8:11                 ` Pavel Machek
  2006-07-10 17:40                   ` Daniel Bonekeeper
  0 siblings, 1 reply; 23+ messages in thread
From: Pavel Machek @ 2006-07-10  8:11 UTC (permalink / raw)
  To: Daniel Bonekeeper; +Cc: Valdis.Kletnieks@vt.edu, Adrian Bunk, linux-kernel

Hi!

> >> Sometimes the user may be just somebody that just started using linux,
> >> or is in an industry that has nothing related to computers. He doesn't
> >> even know that syslog exists, and even if he did, he could not even
> >> care about it.
> >
> >This user will do whatever his distro tells him to do, which is almost
> >certainly something *other* than what a kernel.org kernel should do.
> >If he's running Ubuntu, it should do whatever Ubuntu does.  If he's
> >on Fedora Core, it should poke the RedHat bugzilla, and so on.
> >
> >If he's running a kernel.org kernel, it's probably safe to assume *some*
> >level of clue
> >
> 
> This was actually just an example circunstance of why somebody would
> not report a bug. Dozens of other circunstances may be given, and it
> just illustrates why would be good to have those bug reports without
> user interaction. Of course I can go to bugzilla and fill a report
> upon a bug, but I wouldn't care to have bug reports being sent from my
> servers automatically, if it's an option. This would ensure that even
> bug report from non-caring users are, well, reported.

Well, unless we have some volunteer to go through the bugreports and
sort them/kill the invalid ones/etc... this is going to do more harm
than good.
								Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Automatic Kernel Bug Report
  2006-07-10  8:11                 ` Pavel Machek
@ 2006-07-10 17:40                   ` Daniel Bonekeeper
  2006-07-10 17:59                     ` Valdis.Kletnieks
  2006-07-10 18:41                     ` Horst von Brand
  0 siblings, 2 replies; 23+ messages in thread
From: Daniel Bonekeeper @ 2006-07-10 17:40 UTC (permalink / raw)
  To: Pavel Machek; +Cc: Valdis.Kletnieks@vt.edu, Adrian Bunk, linux-kernel

On 7/10/06, Pavel Machek <pavel@ucw.cz> wrote:
> Hi!

Hi ! =)

>
> Well, unless we have some volunteer to go through the bugreports and
> sort them/kill the invalid ones/etc... this is going to do more harm
> than good.

As I told before, I wouldn't care to do that, as long as I know that
it is actually being used (and useful). The system (at the server
side) could automatically route some reports (mark them as "tainted
modules detected", etc, that sort of mechanical stuff), and according
to the frequency of certain bugs, I could check if they are actually
real bugs. If so, they get reported here on LKML. Since we can expect,
maybe, dozens of thousands of reports per week, wouldn't be hard to
distinct between real bugs, etc (if we use frequency as a marker). For
example, if the number of reports on Suspend2 get risen up sensitively
on some just-released kernel, this means that something that was added
isn't working (so here comes the personal debug, where we can see if
it's a new bug or a regression)

Daniel

-- 
What this world needs is a good five-dollar plasma weapon.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Automatic Kernel Bug Report
  2006-07-10 17:40                   ` Daniel Bonekeeper
@ 2006-07-10 17:59                     ` Valdis.Kletnieks
  2006-07-10 22:05                       ` Daniel Bonekeeper
  2006-07-10 18:41                     ` Horst von Brand
  1 sibling, 1 reply; 23+ messages in thread
From: Valdis.Kletnieks @ 2006-07-10 17:59 UTC (permalink / raw)
  To: Daniel Bonekeeper; +Cc: Pavel Machek, Adrian Bunk, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 1277 bytes --]

On Mon, 10 Jul 2006 12:40:07 CDT, Daniel Bonekeeper said:

> real bugs. If so, they get reported here on LKML. Since we can expect,
> maybe, dozens of thousands of reports per week, wouldn't be hard to
> distinct between real bugs, etc (if we use frequency as a marker).

Actually, at that level, it *is* hard to distinguish.  I'm sure the RedHat
people have a *very* good idea of exactly how much PEBKAC cruft their bugzilla
gathers - and that's from users clued enough to bugzilla.

It might be interesting to use it to measure how many machines crap out because
of stray single-bit errors due to insufficient ECC on the hardware.

You can't use "a sudden upsurge" in reports as a good regression test, because
the vast majority of boxes are running distro kernels.  RHEL 4.0 just shipped a
2.6.9-34 kernel.  Ubuntu is on a 2.6.15.

And the people who are using kernel.org kernels aren't actually upgrading all
*that* fast either.  You'll get better info by looking at the lkml postings
that say '2.6.mumble regressed my foobar' - that will likely trigger before any
statistical tendency in bug reports gets noticed.

(Visit the bugzilla.mozilla.org, and note that neither 'most frequently
reported' nor 'reported today' give you a really good grasp on *current*
issues....)

[-- Attachment #2: Type: application/pgp-signature, Size: 226 bytes --]

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Automatic Kernel Bug Report
  2006-07-10 17:40                   ` Daniel Bonekeeper
  2006-07-10 17:59                     ` Valdis.Kletnieks
@ 2006-07-10 18:41                     ` Horst von Brand
  2006-07-10 21:34                       ` Daniel Bonekeeper
  1 sibling, 1 reply; 23+ messages in thread
From: Horst von Brand @ 2006-07-10 18:41 UTC (permalink / raw)
  To: Daniel Bonekeeper
  Cc: Pavel Machek, Valdis.Kletnieks@vt.edu, Adrian Bunk, linux-kernel

Daniel Bonekeeper <thehazard@gmail.com> wrote:
> On 7/10/06, Pavel Machek <pavel@ucw.cz> wrote:
> > Hi!
> 
> Hi ! =)


Hi all out there!

> > Well, unless we have some volunteer to go through the bugreports and
> > sort them/kill the invalid ones/etc... this is going to do more harm
> > than good.

> As I told before, I wouldn't care to do that,

Who will, then?

>                                               as long as I know that
> it is actually being used (and useful).

If you don't care about the data...

>                                         The system (at the server
> side) could automatically

/Someone/ will have to program/configure/tweak/maintain that...

>                           route some reports (mark them as "tainted
> modules detected", etc, that sort of mechanical stuff),

Mechanical != trivial, and much less == "does it by itself, all alone"...

>                                                         and according
> to the frequency of certain bugs, I could check if they are actually
> real bugs. If so, they get reported here on LKML.

Which helps how in getting more new people up to speed and involved in bug
fixing? (Last I heard, that was the current bottleneck...)

>                                                   Since we can expect,
> maybe, dozens of thousands of reports per week, wouldn't be hard to
> distinct between real bugs, etc (if we use frequency as a marker). For
> example, if the number of reports on Suspend2 get risen up sensitively
> on some just-released kernel, this means that something that was added
> isn't working (so here comes the personal debug, where we can see if
> it's a new bug or a regression)

That kind of stuff is currently sitting in bugzillas all over the
distributions. And again, what is required is people willing to see if they
can reproduce the bug (and that may mean getting an obscure piece of
hardware, etc) and then see if they can fix it. 
-- 
Dr. Horst H. von Brand                   User #22616 counter.li.org
Departamento de Informatica                     Fono: +56 32 654431
Universidad Tecnica Federico Santa Maria              +56 32 654239
Casilla 110-V, Valparaiso, Chile                Fax:  +56 32 797513


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Automatic Kernel Bug Report
  2006-07-10 18:41                     ` Horst von Brand
@ 2006-07-10 21:34                       ` Daniel Bonekeeper
  2006-07-11 14:16                         ` Horst von Brand
  0 siblings, 1 reply; 23+ messages in thread
From: Daniel Bonekeeper @ 2006-07-10 21:34 UTC (permalink / raw)
  To: Horst von Brand
  Cc: Pavel Machek, Valdis.Kletnieks@vt.edu, Adrian Bunk, linux-kernel

On 7/10/06, Horst von Brand <vonbrand@inf.utfsm.cl> wrote:
> Daniel Bonekeeper <thehazard@gmail.com> wrote:
> > On 7/10/06, Pavel Machek <pavel@ucw.cz> wrote:
> > > Hi!
> >
> > Hi ! =)
>
>
> Hi all out there!

Hi Horst!

> > > Well, unless we have some volunteer to go through the bugreports and
> > > sort them/kill the invalid ones/etc... this is going to do more harm
> > > than good.
>
> > As I told before, I wouldn't care to do that,
>
> Who will, then?

What I meant (as you can read on earlier messages) is that I would do
everything (from kernelspace stuff to maintain the server[s] that will
receive the reports and the web interface that will classify them)

> >  as long as I know that
> > it is actually being used (and useful).
>
> If you don't care about the data...

I do! So much that I even suggested the system. =)

> > The system (at the server
> > side) could automatically
>
> /Someone/ will have to program/configure/tweak/maintain that...
>

I'll program, configure and maintain the whole system. By
"automatically" I mean classify each report by distribution (when
possible), kernel version, release, architecture, type of hardware,
function (EIP) where the bug happened, type of bug (null point
dereference, or BUG_ON(), etc), that kind of stuff that is already on
the report, I'll just parse it using regular expressions to extract it
and classify (so I don't need to manually check boxes for every report
that comes).

> > route some reports (mark them as "tainted
> > modules detected", etc, that sort of mechanical stuff),
>
> Mechanical != trivial, and much less == "does it by itself, all alone"...
>

The system will be mechanical as long as possible. By mechanical I
don't mean "automatically detect if the bug was caused by some binary
nVidia driver messing around, because the automatic disassembler shows
that the nVidia driver is acquiring a lock and never releasing it
under that specific circunstance". I mean "if nvidia.ko is loaded,
mark the report as tainted". I won't let the system just send mails to
LKML reporting bugs unless I already looked at them and confirmed that
something is wrong. I never intended to create a system that magically
analyzes the reports and check the source code for bugs and suggest
fixes ( Just Impossible(tm) ), but rather have a tool where we can
know that a certain kind of motherboard or usb controller is Oopsing
too much, always at some point. Think of it as a really large
compatibility laboratory.

> > and according
> > to the frequency of certain bugs, I could check if they are actually
> > real bugs. If so, they get reported here on LKML.
>
> Which helps how in getting more new people up to speed and involved in bug
> fixing? (Last I heard, that was the current bottleneck...)
>

Well, that I don't know. There's already www.kernelnewbies.org for
that, and I see that lots of higher education institutes (aka
universities) are starting to include kernel hacking in their programs
(or at least lots of exercises involving that), so hopefully we'll get
more new people in a few month/years. Again, I believe that not having
enough people to work on bugs is not an excuse to not get them
acknowledged and catalogued. Just because you don't have anybody to
change your car's tires, does it mean that you don't actually want to
know that they are flat ?

> > Since we can expect,
> > maybe, dozens of thousands of reports per week, wouldn't be hard to
> > distinct between real bugs, etc (if we use frequency as a marker). For
> > example, if the number of reports on Suspend2 get risen up sensitively
> > on some just-released kernel, this means that something that was added
> > isn't working (so here comes the personal debug, where we can see if
> > it's a new bug or a regression)
>
> That kind of stuff is currently sitting in bugzillas all over the
> distributions. And again, what is required is people willing to see if they
> can reproduce the bug (and that may mean getting an obscure piece of
> hardware, etc) and then see if they can fix it.

Yes, they are also full of bug reports for multimedia players,
editors, etc. I agree, every decent distribution have their bugzillas
loaded with all kinds of bugs. Also, there are probably hundreds of
distros around the world, with a good number of them being widely
used. Can you easily let me know if people using those distros are
having frequent NULL dereferences using sata_via and a VIA VT6420 SATA
RAID Controller with rev 80, when people with the same driver and
controller weren't having such problems 2 months ago ? Or that 95% of
those people have SMP kernels ? You probably can know that, if you
know what exactly to look for and search using lots of "contains the
string" (hopefully everything that you need was pasted on the report).
And then, repeat this for 10 major distros. Understand ? Since our
developers are so few and so busy, having a tool to automatically
compare that stuff is handy. Imagine that you want to fix a bug in
sata_via. You search all references to sata_via module (where EIP is
on it, for example), and you can have statistics telling you that 70%
of bugs reported on sata_via are caused by machines using some kind of
proprietary driver (just made that up). This is already a very decent
clue (unfortunatelly I don't expect things to be that easy, but it's a
start).

-- 
What this world needs is a good five-dollar plasma weapon.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Automatic Kernel Bug Report
  2006-07-10 17:59                     ` Valdis.Kletnieks
@ 2006-07-10 22:05                       ` Daniel Bonekeeper
  2006-07-10 23:41                         ` Lee Revell
  0 siblings, 1 reply; 23+ messages in thread
From: Daniel Bonekeeper @ 2006-07-10 22:05 UTC (permalink / raw)
  To: Valdis.Kletnieks@vt.edu; +Cc: Pavel Machek, Adrian Bunk, linux-kernel

On 7/10/06, Valdis.Kletnieks@vt.edu <Valdis.Kletnieks@vt.edu> wrote:
> On Mon, 10 Jul 2006 12:40:07 CDT, Daniel Bonekeeper said:
>
> > real bugs. If so, they get reported here on LKML. Since we can expect,
> > maybe, dozens of thousands of reports per week, wouldn't be hard to
> > distinct between real bugs, etc (if we use frequency as a marker).
>
> Actually, at that level, it *is* hard to distinguish.  I'm sure the RedHat
> people have a *very* good idea of exactly how much PEBKAC cruft their bugzilla
> gathers - and that's from users clued enough to bugzilla.

I believe that people on distros are already very busy solving
problems with the whole distro (thousands of programs, libraries,
etc). They have their kernel guys, they hack their own kernels, etc,
and they are very distro-specific. This tool I intend to be a generic
solution, kernel-only thing. Hopefully we could get distros to
incorporate these reports to their kernels (or point the kernel report
system to them and mirror the reports to our central server). I'm a
little concerned about receiving reports from distro-modded kernels,
since they may not be easily debugged. Anyways, the system will take
account of the fact that the kernel is or isn't a vanilla, and we can
filter that easily, so there's no problem on that.

> It might be interesting to use it to measure how many machines crap out because
> of stray single-bit errors due to insufficient ECC on the hardware.

That's a good example. Another example: a little while ago
(http://lkml.org/lkml/2006/7/1/70) Daniel Drake from Gentoo was
reporting a problem where page_mapcount(page) was getting negative. As
it turned out, it was related with a nVidia proprietary driver that
the machine was running. With the system, we just needed to search for
"Eeek! page_mapcount(page) went negative! (-1)" on kernels 2.6.16.19
(maybe too generic), and he would see that lots of people reporting
that has, between other things, nVidia drivers running. It's already a
clue on where to start looking for. The same applies for lots of other
stuff.

The main difference here, is that the system isn't passive as a
bugzilla. The system could gather information about those bug reports
and start working on them, finding relations and pointing out
relations between the bug, the hardware and the kernel configuration.

> You can't use "a sudden upsurge" in reports as a good regression test, because
> the vast majority of boxes are running distro kernels.  RHEL 4.0 just shipped a
> 2.6.9-34 kernel.  Ubuntu is on a 2.6.15.
>

I agree with you on that. In this case, we can consider just vanilla
users (or RedHat people can do this comparison between their released
kernels, even though the focus of this system isn't distros, but
vanilla kernels). Another thing to point out is that Slackware users,
for example, run on vanilla kernels (even though slackware 10.2 is
shipping with 2.6.13, people usually update to the latest one). But
really, unless people use -mm kernels or release candidates, surges
can't really be used to detect regressions. But this tool would still
be useful to detect regressions after they are on the wild for a while
(for example, people with the latest stable 2.6.17.4 are getting
problems with via_sata that they weren't getting before with the same
hardware (here we can discuss how to detect, if possible, if it's a
regression or a new bug)).

> And the people who are using kernel.org kernels aren't actually upgrading all
> *that* fast either.  You'll get better info by looking at the lkml postings
> that say '2.6.mumble regressed my foobar' - that will likely trigger before any
> statistical tendency in bug reports gets noticed.

Agreed.

> (Visit the bugzilla.mozilla.org, and note that neither 'most frequently
> reported' nor 'reported today' give you a really good grasp on *current*
> issues....)

I don't think that I understand that correctly, but the way I see, if
bugs don't get fixed, they are still "current issues". Everything
depends on the complexity of the statistical engine that the system
will use. If it detects that people are reporting Oopses frequently on
sata_via and a 2.6.10 kernel, but with newer kernels the bug isn't
reported, it will disconsider this issue (even though it continues on
the database, so regressions can be detected).

Daniel

-- 
What this world needs is a good five-dollar plasma weapon.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Automatic Kernel Bug Report
  2006-07-10 22:05                       ` Daniel Bonekeeper
@ 2006-07-10 23:41                         ` Lee Revell
  2006-07-11  1:15                           ` Daniel Bonekeeper
  0 siblings, 1 reply; 23+ messages in thread
From: Lee Revell @ 2006-07-10 23:41 UTC (permalink / raw)
  To: Daniel Bonekeeper
  Cc: Valdis.Kletnieks@vt.edu, Pavel Machek, Adrian Bunk, linux-kernel

On Mon, 2006-07-10 at 18:05 -0400, Daniel Bonekeeper wrote:
> That's a good example. Another example: a little while ago
> (http://lkml.org/lkml/2006/7/1/70) Daniel Drake from Gentoo was
> reporting a problem where page_mapcount(page) was getting negative. As
> it turned out, it was related with a nVidia proprietary driver that
> the machine was running. With the system, we just needed to search for
> "Eeek! page_mapcount(page) went negative! (-1)" on kernels 2.6.16.19
> (maybe too generic), and he would see that lots of people reporting
> that has, between other things, nVidia drivers running. It's already a
> clue on where to start looking for. The same applies for lots of other
> stuff. 

That sounds backwards to me - any kernel bug reporting system should
immediately discard bug reports with the nvidia driver loaded, as such a
kernel is not debuggable.

Lee


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Automatic Kernel Bug Report
  2006-07-10 23:41                         ` Lee Revell
@ 2006-07-11  1:15                           ` Daniel Bonekeeper
  0 siblings, 0 replies; 23+ messages in thread
From: Daniel Bonekeeper @ 2006-07-11  1:15 UTC (permalink / raw)
  To: Lee Revell
  Cc: Valdis.Kletnieks@vt.edu, Pavel Machek, Adrian Bunk, linux-kernel

On 7/10/06, Lee Revell <rlrevell@joe-job.com> wrote:
> On Mon, 2006-07-10 at 18:05 -0400, Daniel Bonekeeper wrote:
> > That's a good example. Another example: a little while ago
> > (http://lkml.org/lkml/2006/7/1/70) Daniel Drake from Gentoo was
> > reporting a problem where page_mapcount(page) was getting negative. As
> > it turned out, it was related with a nVidia proprietary driver that
> > the machine was running. With the system, we just needed to search for
> > "Eeek! page_mapcount(page) went negative! (-1)" on kernels 2.6.16.19
> > (maybe too generic), and he would see that lots of people reporting
> > that has, between other things, nVidia drivers running. It's already a
> > clue on where to start looking for. The same applies for lots of other
> > stuff.
>
> That sounds backwards to me - any kernel bug reporting system should
> immediately discard bug reports with the nvidia driver loaded, as such a
> kernel is not debuggable.

Our job is to do kernel, and anything related to it should not be discarded.
The system has to have the flexibility to provide the same information
ignoring tainted configurations, if it's that what you need. ("provide
mechanism, not policy")

Daniel

-- 
What this world needs is a good five-dollar plasma weapon.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Automatic Kernel Bug Report
  2006-07-10 21:34                       ` Daniel Bonekeeper
@ 2006-07-11 14:16                         ` Horst von Brand
  0 siblings, 0 replies; 23+ messages in thread
From: Horst von Brand @ 2006-07-11 14:16 UTC (permalink / raw)
  To: Daniel Bonekeeper
  Cc: Pavel Machek, Valdis.Kletnieks@vt.edu, Adrian Bunk, linux-kernel

Daniel Bonekeeper <thehazard@gmail.com> wrote:
> On 7/10/06, Horst von Brand <vonbrand@inf.utfsm.cl> wrote:
> > Daniel Bonekeeper <thehazard@gmail.com> wrote:
> > > On 7/10/06, Pavel Machek <pavel@ucw.cz> wrote:

[...]

> > > > Well, unless we have some volunteer to go through the bugreports and
> > > > sort them/kill the invalid ones/etc... this is going to do more harm
> > > > than good.

> > > As I told before, I wouldn't care to do that,

> > Who will, then?

> What I meant (as you can read on earlier messages) is that I would do
> everything (from kernelspace stuff to maintain the server[s] that will
> receive the reports and the web interface that will classify them)

But to be able to do a halfway credible job at that you /must/ look at the
(ever changing) data flowing through it, and know enough about what it
means (i.e., be rather intimately involved in its use) to do it right...

[...]

> > > route some reports (mark them as "tainted
> > > modules detected", etc, that sort of mechanical stuff),

> > Mechanical != trivial, and much less == "does it by itself, all alone"...

> The system will be mechanical as long as possible. By mechanical I
> don't mean "automatically detect if the bug was caused by some binary
> nVidia driver messing around, because the automatic disassembler shows
> that the nVidia driver is acquiring a lock and never releasing it
> under that specific circunstance". I mean "if nvidia.ko is loaded,
> mark the report as tainted". I won't let the system just send mails to
> LKML reporting bugs unless I already looked at them and confirmed that
> something is wrong. I never intended to create a system that magically
> analyzes the reports and check the source code for bugs and suggest
> fixes ( Just Impossible(tm) ), but rather have a tool where we can
> know that a certain kind of motherboard or usb controller is Oopsing
> too much, always at some point. Think of it as a really large
> compatibility laboratory.

Sounds like a job for massive data-mining... could it be applied to the
bugzillas of the distributions? I bet they are doing that right now (or
would be, if they had the resources to do so...).

> > > and according
> > > to the frequency of certain bugs, I could check if they are actually
> > > real bugs. If so, they get reported here on LKML.

> > Which helps how in getting more new people up to speed and involved in bug
> > fixing? (Last I heard, that was the current bottleneck...)

> Well, that I don't know. There's already www.kernelnewbies.org for
> that, and I see that lots of higher education institutes (aka
> universities) are starting to include kernel hacking in their programs
> (or at least lots of exercises involving that), so hopefully we'll get
> more new people in a few month/years. Again, I believe that not having
> enough people to work on bugs is not an excuse to not get them
> acknowledged and catalogued. Just because you don't have anybody to
> change your car's tires, does it mean that you don't actually want to
> know that they are flat ?

To see that the tires are flat or that the engine stopped requires very
little knowledge of automotive engineering, to find out why and how to fix
it are in a different league. And adding the right instrumentation points
and the measuring equipment plus "artificial intelligence" to make use of
them in the field was probably a large part of the development effort over
the last decade or so.

> > > Since we can expect,
> > > maybe, dozens of thousands of reports per week, wouldn't be hard to
> > > distinct between real bugs, etc (if we use frequency as a marker). For
> > > example, if the number of reports on Suspend2 get risen up sensitively
> > > on some just-released kernel, this means that something that was added
> > > isn't working (so here comes the personal debug, where we can see if
> > > it's a new bug or a regression)

> > That kind of stuff is currently sitting in bugzillas all over the
> > distributions. And again, what is required is people willing to see if they
> > can reproduce the bug (and that may mean getting an obscure piece of
> > hardware, etc) and then see if they can fix it.

> Yes, they are also full of bug reports for multimedia players,
> editors, etc. I agree, every decent distribution have their bugzillas
> loaded with all kinds of bugs.

Neatly sorted (as to the user's perception) of the cause.

>                                Also, there are probably hundreds of
> distros around the world, with a good number of them being widely
> used. Can you easily let me know if people using those distros are
> having frequent NULL dereferences using sata_via and a VIA VT6420 SATA
> RAID Controller with rev 80, when people with the same driver and
> controller weren't having such problems 2 months ago ? Or that 95% of
> those people have SMP kernels ? You probably can know that, if you
> know what exactly to look for and search using lots of "contains the
> string" (hopefully everything that you need was pasted on the report).
> And then, repeat this for 10 major distros. Understand ? Since our
> developers are so few and so busy, having a tool to automatically
> compare that stuff is handy. Imagine that you want to fix a bug in
> sata_via. You search all references to sata_via module (where EIP is
> on it, for example), and you can have statistics telling you that 70%
> of bugs reported on sata_via are caused by machines using some kind of
> proprietary driver (just made that up). This is already a very decent
> clue (unfortunatelly I don't expect things to be that easy, but it's a
> start).

Problem is that the "propietary driver" will most probably cause all kinds
of /different/ havoc in the various configurations out there. Scribbling
over memory at address foo won't always crash driver bar. And the same can
be said of many other problem origins. So your only reliable bet is to have
a single configuration, i.e., the kernel version shipped by /one/
distribution.
-- 
Dr. Horst H. von Brand                   User #22616 counter.li.org
Departamento de Informatica                     Fono: +56 32 654431
Universidad Tecnica Federico Santa Maria              +56 32 654239
Casilla 110-V, Valparaiso, Chile                Fax:  +56 32 797513

^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2006-07-11 14:19 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2006-07-09  8:45 Automatic Kernel Bug Report Daniel Bonekeeper
2006-07-09  9:45 ` Michal Piotrowski
2006-07-09 10:29   ` Daniel Bonekeeper
2006-07-09 12:58     ` Adrian Bunk
2006-07-09 18:46       ` Daniel Bonekeeper
2006-07-09 19:11         ` Adrian Bunk
2006-07-09 20:01           ` Daniel Bonekeeper
2006-07-09 20:19             ` Valdis.Kletnieks
2006-07-09 20:27               ` Daniel Bonekeeper
2006-07-10  8:11                 ` Pavel Machek
2006-07-10 17:40                   ` Daniel Bonekeeper
2006-07-10 17:59                     ` Valdis.Kletnieks
2006-07-10 22:05                       ` Daniel Bonekeeper
2006-07-10 23:41                         ` Lee Revell
2006-07-11  1:15                           ` Daniel Bonekeeper
2006-07-10 18:41                     ` Horst von Brand
2006-07-10 21:34                       ` Daniel Bonekeeper
2006-07-11 14:16                         ` Horst von Brand
2006-07-09 20:24             ` Diego Calleja
2006-07-09 20:37               ` Daniel Bonekeeper
2006-07-09 22:19                 ` Diego Calleja
2006-07-09 22:49                   ` Daniel Bonekeeper
2006-07-09 20:25             ` Jesper Juhl

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox