oops/warning report for the week of November 26, 2008

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* oops/warning report for the week of November 26, 2008
@ 2008-11-26 23:11 Arjan van de Ven
  2008-11-27  0:05 ` Jesse Barnes
                   ` (2 more replies)
  0 siblings, 3 replies; 25+ messages in thread
From: Arjan van de Ven @ 2008-11-26 23:11 UTC (permalink / raw)
  To: Linux Kernel Mailing List
  Cc: Linus Torvalds, NetDev, x86, Andrew Morton, Theodore Ts'o,
	Alan Cox, jesse Barnes

In collecting this report, oopses and warnings with versions prior to 2.6.27 are ignored.
This week, a total of 5450 oopses and warnings have been reported of version 2.6.27+,
compared to 2198 reports in the previous week.

This report is a bit different than the previous weeks; all 2.6.26 and earlier issues are no
longer used, which means the top 12 has shuffled quite a bit, with some new star appearances.

Also I've reworked the "are these two backtraces the same" algorithm; the website should now
be presenting a more compact/concise view due to having the backtraces consolidated in a much
more logical (for the human) way.


Per file statistics
936	external/virtualbox/module
602	drivers/pci/slot.c
455	drivers/net/wireless/iwlwifi/iwl-tx.c
364	kernel/power/main.c
274	drivers/net/r8169.c
231	drivers/net/wireless/iwlwifi/iwl-3945-rs.c
231	fs/jbd/journal.c
227	arch/x86/include/asm/mtrr.h
147	drivers/ata/libata-sff.c
137	drivers/net/sis900.c
71	net/ipv4/tcp.c
62	drivers/gpu/drm/radeon/radeon_cp.c


Rank 1: VBoxDrvLinuxIOCtl (warning)
	Reported 934 times (1635 total reports)
	[external] bug in the VirtualBox drivers
	This warning was last seen in version 2.6.28-rc3, and first seen in 2.6.25.11.
	More info: http://www.kerneloops.org/searchweek.php?search=VBoxDrvLinuxIOCtl

Rank 2: pci_create_slot (warning)
	Reported 603 times (639 total reports)
	BIOS provided duplicated slot names, the PCI layer blindly passes to sysfs
	This warning was last seen in version 2.6.27.5, and first seen in 2.6.27-rc7-git1.
	More info: http://www.kerneloops.org/searchweek.php?search=pci_create_slot

Rank 3: iwl_tx_cmd_complete (warning)
	Reported 455 times (693 total reports)
	Bug in the IWL wireless driver; partial fix available
	This warning was last seen in version 2.6.28-rc4, and first seen in 2.6.27-rc9.
	More info: http://www.kerneloops.org/searchweek.php?search=iwl_tx_cmd_complete

Rank 4: suspend_test_finish (warning)
	Reported 362 times (1202 total reports)
	Fedora is shipping with the suspend test on.. and it's failing everywhere.
	The patch to report what fails is in 2.6.28-rc6 and later
	This warning was last seen in version 2.6.28-rc1, and first seen in 2.6.27-rc0-git14.
	More info: http://www.kerneloops.org/searchweek.php?search=suspend_test_finish

Rank 5: dev_watchdog(r8169) (oops)
	Reported 274 times (1414 total reports)
	Network driver not handling timeouts itself.
	This oops was last seen in version 2.6.28-rc4, and first seen in 2.6.26.6.
	More info: http://www.kerneloops.org/searchweek.php?search=dev_watchdog(r8169)

Rank 6: rs_get_rate (oops)
	Reported 232 times (1152 total reports)
	Bug in the Intel IWL wireless drivers
	This oops was last seen in version 2.6.27.5, and first seen in 2.6.25-rc2-git5.
	More info: http://www.kerneloops.org/searchweek.php?search=rs_get_rate

Rank 7: journal_update_superblock (warning)
	Reported 231 times (6506 total reports)
	Likely caused by the user removing a USB stick while mounted
	This warning was last seen in version 2.6.27.7, and first seen in 2.6.24-rc6-git1.
	More info: http://www.kerneloops.org/searchweek.php?search=journal_update_superblock

Rank 8: mtrr_trim_uncached_memory (warning)
	Reported 227 times (619 total reports)
	There is a high number of machines where our MTRR checks trigger. I suspect we are too
	picky in accepting the MTRR configuration.
	This warning was last seen in version 2.6.27.5, and first seen in 2.6.24.
	More info: http://www.kerneloops.org/searchweek.php?search=mtrr_trim_uncached_memory

Rank 9: __atapi_pio_bytes (warning)
	Reported 146 times (224 total reports)
	Alan said this was due to some other layer giving the libata drivers a weird
	scatter gather list. It just happens a lot, and somehow it mostly happens in
	virtualized environments
	This warning was last seen in version 2.6.27.5, and first seen in 2.6.27.4.
	More info: http://www.kerneloops.org/searchweek.php?search=__atapi_pio_bytes

Rank 10: dev_watchdog(sis900) (oops)
	Reported 137 times (1538 total reports)
	This oops was last seen in version 2.6.27.6, and first seen in 2.6.26-rc4-git2.
	More info: http://www.kerneloops.org/searchweek.php?search=dev_watchdog(sis900)

Rank 11: tcp_recvmsg (warning)
	Reported 71 times (167 total reports)
	This warning was last seen in version 2.6.27.5, and first seen in 2.6.25.
	More info: http://www.kerneloops.org/searchweek.php?search=tcp_recvmsg

Rank 12: dev_watchdog(atl1) (oops)
	Reported 56 times (109 total reports)
	This oops was last seen in version 2.6.27.5, and first seen in 2.6.26.6.
	More info: http://www.kerneloops.org/searchweek.php?search=dev_watchdog(atl1)

Rank 13: nv_set_page_attrib_cached (warning)
	Reported 56 times (65 total reports)
	[external] bug in the binary nvidia driver
	warning only shows up in tainted kernels
	This warning was last seen in version 2.6.27.5, and first seen in 2.6.27.5.
	More info: http://www.kerneloops.org/searchweek.php?search=nv_set_page_attrib_cached

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: oops/warning report for the week of November 26, 2008
  2008-11-26 23:11 oops/warning report for the week of November 26, 2008 Arjan van de Ven
@ 2008-11-27  0:05 ` Jesse Barnes
  2008-11-27 11:48   ` Ingo Molnar
  2008-11-27 19:42   ` Alex Chiang
  2008-11-27 11:52 ` Ingo Molnar
  2008-11-28 17:18 ` Jay Cliburn
  2 siblings, 2 replies; 25+ messages in thread
From: Jesse Barnes @ 2008-11-27  0:05 UTC (permalink / raw)
  To: Arjan van de Ven
  Cc: Linux Kernel Mailing List, Linus Torvalds, NetDev, x86,
	Andrew Morton, Theodore Ts'o, Alan Cox

On Wednesday, November 26, 2008 3:11 pm Arjan van de Ven wrote:
> Rank 2: pci_create_slot (warning)
> 	Reported 603 times (639 total reports)
> 	BIOS provided duplicated slot names, the PCI layer blindly passes to sysfs
> 	This warning was last seen in version 2.6.27.5, and first seen in
> 2.6.27-rc7-git1. More info:
> http://www.kerneloops.org/searchweek.php?search=pci_create_slot

IIRC we fixed this one post-2.6.27.  I didn't send the patches back to -stable 
because they were a bit big, but if someone were sufficiently motiviated I'm 
sure the backport wouldn't be that hard...

Jesse

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: oops/warning report for the week of November 26, 2008
  2008-11-27  0:05 ` Jesse Barnes
@ 2008-11-27 11:48   ` Ingo Molnar
  2008-11-27 19:42   ` Alex Chiang
  1 sibling, 0 replies; 25+ messages in thread
From: Ingo Molnar @ 2008-11-27 11:48 UTC (permalink / raw)
  To: Jesse Barnes
  Cc: Arjan van de Ven, Linux Kernel Mailing List, Linus Torvalds,
	NetDev, x86, Andrew Morton, Theodore Ts'o, Alan Cox


* Jesse Barnes <jbarnes@virtuousgeek.org> wrote:

> On Wednesday, November 26, 2008 3:11 pm Arjan van de Ven wrote:
> > Rank 2: pci_create_slot (warning)
> > 	Reported 603 times (639 total reports)
> > 	BIOS provided duplicated slot names, the PCI layer blindly passes to sysfs
> > 	This warning was last seen in version 2.6.27.5, and first seen in
> > 2.6.27-rc7-git1. More info:
> > http://www.kerneloops.org/searchweek.php?search=pci_create_slot
> 
> IIRC we fixed this one post-2.6.27.  I didn't send the patches back 
> to -stable because they were a bit big, but if someone were 
> sufficiently motiviated I'm sure the backport wouldn't be that 
> hard...

having the commit IDs mentioned here would be nice, should anyone feel 
motivated.

	Ingo

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: oops/warning report for the week of November 26, 2008
  2008-11-26 23:11 oops/warning report for the week of November 26, 2008 Arjan van de Ven
  2008-11-27  0:05 ` Jesse Barnes
@ 2008-11-27 11:52 ` Ingo Molnar
  2008-11-27 17:02   ` Jesse Barnes
  2008-11-27 18:01   ` Arjan van de Ven
  2008-11-28 17:18 ` Jay Cliburn
  2 siblings, 2 replies; 25+ messages in thread
From: Ingo Molnar @ 2008-11-27 11:52 UTC (permalink / raw)
  To: Arjan van de Ven, Yinghai Lu
  Cc: Linux Kernel Mailing List, Linus Torvalds, NetDev, x86,
	Andrew Morton, Theodore Ts'o, Alan Cox, jesse Barnes

* Arjan van de Ven <arjan@linux.intel.com> wrote:

> Rank 8: mtrr_trim_uncached_memory (warning)
> 	Reported 227 times (619 total reports)
> 	There is a high number of machines where our MTRR checks 
> 	trigger. I suspect we are too picky in accepting the MTRR 
> 	configuration.

the warning here means: "the BIOS messed up but we fixed it up for 
you just fine".

Should we print a DMI descriptor so that it can be tracked back to the 
bad BIOSen in question? Or should we (partially) silence the warning 
itself? Those BIOS bugs need fixing really: older kernels will boot up 
with bad MTRR settings - resulting in a super-slow system or other 
weirdnesses. We can tone down the message so that it doesnt show up in 
kerneloops.org. It's up to you.

	Ingo

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: oops/warning report for the week of November 26, 2008
  2008-11-27 11:52 ` Ingo Molnar
@ 2008-11-27 17:02   ` Jesse Barnes
  2008-11-27 18:01   ` Arjan van de Ven
  1 sibling, 0 replies; 25+ messages in thread
From: Jesse Barnes @ 2008-11-27 17:02 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Arjan van de Ven, Yinghai Lu, Linux Kernel Mailing List,
	Linus Torvalds, NetDev, x86, Andrew Morton, Theodore Ts'o,
	Alan Cox

On Thursday, November 27, 2008 3:52 am Ingo Molnar wrote:
> * Arjan van de Ven <arjan@linux.intel.com> wrote:
> > Rank 8: mtrr_trim_uncached_memory (warning)
> > 	Reported 227 times (619 total reports)
> > 	There is a high number of machines where our MTRR checks
> > 	trigger. I suspect we are too picky in accepting the MTRR
> > 	configuration.
>
> the warning here means: "the BIOS messed up but we fixed it up for
> you just fine".
>
> Should we print a DMI descriptor so that it can be tracked back to the
> bad BIOSen in question? Or should we (partially) silence the warning
> itself? Those BIOS bugs need fixing really: older kernels will boot up
> with bad MTRR settings - resulting in a super-slow system or other
> weirdnesses. We can tone down the message so that it doesnt show up in
> kerneloops.org. It's up to you.

I actually think we're doing something wrong here, since so many platforms 
have this behavior.  It's likely that there's an undocumented, additional 
check needed to determine whether a slot is hot pluggable.  Matthew Garrett 
recently posted a patch to check for ACPI _RMV methods, which should be an 
improvement.  I'll be putting that into linux-next soon for testing.

-- 
Jesse Barnes, Intel Open Source Technology Center

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: oops/warning report for the week of November 26, 2008
  2008-11-27 11:52 ` Ingo Molnar
  2008-11-27 17:02   ` Jesse Barnes
@ 2008-11-27 18:01   ` Arjan van de Ven
  2008-11-27 20:18     ` Ingo Molnar
  1 sibling, 1 reply; 25+ messages in thread
From: Arjan van de Ven @ 2008-11-27 18:01 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Yinghai Lu, Linux Kernel Mailing List, Linus Torvalds, NetDev,
	x86, Andrew Morton, Theodore Ts'o, Alan Cox, jesse Barnes

Ingo Molnar wrote:
> * Arjan van de Ven <arjan@linux.intel.com> wrote:
> 
>> Rank 8: mtrr_trim_uncached_memory (warning)
>> 	Reported 227 times (619 total reports)
>> 	There is a high number of machines where our MTRR checks 
>> 	trigger. I suspect we are too picky in accepting the MTRR 
>> 	configuration.
> 
> the warning here means: "the BIOS messed up but we fixed it up for 
> you just fine".

I don't believe that right now.
we see so many of these, including many "there's no MTRRs at all",
that I am seriously suspecting that our code is just incorrect somehow
and triggering too much.


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: oops/warning report for the week of November 26, 2008
  2008-11-27  0:05 ` Jesse Barnes
  2008-11-27 11:48   ` Ingo Molnar
@ 2008-11-27 19:42   ` Alex Chiang
  2008-11-27 19:49     ` Arjan van de Ven
  1 sibling, 1 reply; 25+ messages in thread
From: Alex Chiang @ 2008-11-27 19:42 UTC (permalink / raw)
  To: Jesse Barnes
  Cc: Arjan van de Ven, Linux Kernel Mailing List, Linus Torvalds,
	NetDev, x86, Andrew Morton, Theodore Ts'o, Alan Cox

* Jesse Barnes <jbarnes@virtuousgeek.org>:
> On Wednesday, November 26, 2008 3:11 pm Arjan van de Ven wrote:
> > Rank 2: pci_create_slot (warning)
> > 	Reported 603 times (639 total reports)
> > 	BIOS provided duplicated slot names, the PCI layer blindly passes to sysfs
> > 	This warning was last seen in version 2.6.27.5, and first seen in
> > 2.6.27-rc7-git1. More info:
> > http://www.kerneloops.org/searchweek.php?search=pci_create_slot
> 
> IIRC we fixed this one post-2.6.27.  I didn't send the patches back to -stable 
> because they were a bit big, but if someone were sufficiently motiviated I'm 
> sure the backport wouldn't be that hard...

I can do this backport. A few questions though...

We're seeing a proliferation of this one presumably because
Fedora10 uses 2.6.27.5 as a starting point? If I just backport
the fixes against Greg's latest tree, do I have to do anything
special to make sure they get into the Fedora kernel?

Also, does kerneloops capture any of the machine information,
like DMI output, etc. or does it just get the oops? It would be
nice to see which machines out there have the broken BIOS that
causes this oops.

Thanks.

/ac


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: oops/warning report for the week of November 26, 2008
  2008-11-27 19:42   ` Alex Chiang
@ 2008-11-27 19:49     ` Arjan van de Ven
  0 siblings, 0 replies; 25+ messages in thread
From: Arjan van de Ven @ 2008-11-27 19:49 UTC (permalink / raw)
  To: Alex Chiang
  Cc: Jesse Barnes, Linux Kernel Mailing List, Linus Torvalds, NetDev,
	x86, Andrew Morton, Theodore Ts'o, Alan Cox

On Thu, 27 Nov 2008 12:42:10 -0700
Alex Chiang <achiang@hp.com> wrote:

> * Jesse Barnes <jbarnes@virtuousgeek.org>:
> > On Wednesday, November 26, 2008 3:11 pm Arjan van de Ven wrote:
> > > Rank 2: pci_create_slot (warning)
> > > 	Reported 603 times (639 total reports)
> > > 	BIOS provided duplicated slot names, the PCI layer
> > > blindly passes to sysfs This warning was last seen in version
> > > 2.6.27.5, and first seen in 2.6.27-rc7-git1. More info:
> > > http://www.kerneloops.org/searchweek.php?search=pci_create_slot
> > 
> > IIRC we fixed this one post-2.6.27.  I didn't send the patches back
> > to -stable because they were a bit big, but if someone were
> > sufficiently motiviated I'm sure the backport wouldn't be that
> > hard...
> 
> I can do this backport. A few questions though...
> 
> We're seeing a proliferation of this one presumably because
> Fedora10 uses 2.6.27.5 as a starting point? If I just backport
> the fixes against Greg's latest tree, do I have to do anything
> special to make sure they get into the Fedora kernel?

Fedora tends to follow -stable quite closely so that ought to be enough

> 
> Also, does kerneloops capture any of the machine information,
> like DMI output, etc. or does it just get the oops? It would be
> nice to see which machines out there have the broken BIOS that
> causes this oops.

right now we do this for oopses, but not for warnings ;(
I'll make a patch to add this; it's generally useful.


-- 
Arjan van de Ven 	Intel Open Source Technology Centre
For development, discussion and tips for power savings, 
visit http://www.lesswatts.org

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: oops/warning report for the week of November 26, 2008
  2008-11-27 18:01   ` Arjan van de Ven
@ 2008-11-27 20:18     ` Ingo Molnar
  2008-11-27 20:28       ` Arjan van de Ven
  0 siblings, 1 reply; 25+ messages in thread
From: Ingo Molnar @ 2008-11-27 20:18 UTC (permalink / raw)
  To: Arjan van de Ven
  Cc: Yinghai Lu, Linux Kernel Mailing List, Linus Torvalds, NetDev,
	x86, Andrew Morton, Theodore Ts'o, Alan Cox, jesse Barnes


* Arjan van de Ven <arjan@linux.intel.com> wrote:

> Ingo Molnar wrote:
>> * Arjan van de Ven <arjan@linux.intel.com> wrote:
>>
>>> Rank 8: mtrr_trim_uncached_memory (warning)
>>> 	Reported 227 times (619 total reports)
>>> 	There is a high number of machines where our MTRR checks 	trigger. I 
>>> suspect we are too picky in accepting the MTRR 	configuration.
>>
>> the warning here means: "the BIOS messed up but we fixed it up for you 
>> just fine".
>
> I don't believe that right now. we see so many of these, including 
> many "there's no MTRRs at all", that I am seriously suspecting that 
> our code is just incorrect somehow and triggering too much.

well we looked at existing reports and Linux was right to fix them up. 
Show us one that is incorrect, then we can fix it up.

the "no MTRR's" are vmware/(also qemu?) guests not implementing a full 
CPU emulation.

	Ingo

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: oops/warning report for the week of November 26, 2008
  2008-11-27 20:18     ` Ingo Molnar
@ 2008-11-27 20:28       ` Arjan van de Ven
  2008-11-27 20:47         ` Ingo Molnar
                           ` (3 more replies)
  0 siblings, 4 replies; 25+ messages in thread
From: Arjan van de Ven @ 2008-11-27 20:28 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Yinghai Lu, Linux Kernel Mailing List, Linus Torvalds, NetDev,
	x86, Andrew Morton, Theodore Ts'o, Alan Cox, jesse Barnes

On Thu, 27 Nov 2008 21:18:36 +0100
Ingo Molnar <mingo@elte.hu> wrote:

> 
> * Arjan van de Ven <arjan@linux.intel.com> wrote:
> 
> > Ingo Molnar wrote:
> >> * Arjan van de Ven <arjan@linux.intel.com> wrote:
> >>
> >>> Rank 8: mtrr_trim_uncached_memory (warning)
> >>> 	Reported 227 times (619 total reports)
> >>> 	There is a high number of machines where our MTRR checks
> >>> 	trigger. I suspect we are too picky in accepting the MTRR
> >>> 	configuration.
> >>
> >> the warning here means: "the BIOS messed up but we fixed it up for
> >> you just fine".
> >
> > I don't believe that right now. we see so many of these, including 
> > many "there's no MTRRs at all", that I am seriously suspecting that 
> > our code is just incorrect somehow and triggering too much.
> 
> well we looked at existing reports and Linux was right to fix them
> up. Show us one that is incorrect, then we can fix it up.
> 
> the "no MTRR's" are vmware/(also qemu?) guests not implementing a
> full CPU emulation.

... and it's still our fault in part, since we don't even check to see
if a cpu claims to support MTRR before complaining about it...

easy to fix though:

>From 7e987ae541c41ce908b414fee9d8e2fd2099a083 Mon Sep 17 00:00:00 2001
From: Arjan van de Ven <arjan@linux.intel.com>
Date: Thu, 27 Nov 2008 12:25:47 -0800
Subject: [PATCH] x86: make sure the CPU advertizes MTRR support before complaining about the lack thereoff...

We complain loudly if a CPU does not have MTRR support... but we don't check if the CPU
exposes MTRR support in the CPUID flags first. While this might not fix all of the
broken virtualization systems out there, it will at least fix those that properly don't
advertize things they don't support.

Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
---
 arch/x86/kernel/cpu/mtrr/main.c |    2 ++
 1 files changed, 2 insertions(+), 0 deletions(-)

diff --git a/arch/x86/kernel/cpu/mtrr/main.c b/arch/x86/kernel/cpu/mtrr/main.c
index 1159e26..0044e61 100644
--- a/arch/x86/kernel/cpu/mtrr/main.c
+++ b/arch/x86/kernel/cpu/mtrr/main.c
@@ -1567,6 +1567,8 @@ int __init mtrr_trim_uncached_memory(unsigned long end_pfn)
 	 * Make sure we only trim uncachable memory on machines that
 	 * support the Intel MTRR architecture:
 	 */
+	if (!cpu_has_mtrr)
+		return 0;
 	if (!is_cpu(INTEL) || disable_mtrr_trim)
 		return 0;
 	rdmsr(MTRRdefType_MSR, def, dummy);
-- 
1.6.0.4



-- 
Arjan van de Ven 	Intel Open Source Technology Centre
For development, discussion and tips for power savings, 
visit http://www.lesswatts.org

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: oops/warning report for the week of November 26, 2008
  2008-11-27 20:28       ` Arjan van de Ven
@ 2008-11-27 20:47         ` Ingo Molnar
  2008-11-27 20:53           ` Arjan van de Ven
  2008-11-27 21:18         ` H. Peter Anvin
                           ` (2 subsequent siblings)
  3 siblings, 1 reply; 25+ messages in thread
From: Ingo Molnar @ 2008-11-27 20:47 UTC (permalink / raw)
  To: Arjan van de Ven
  Cc: Yinghai Lu, Linux Kernel Mailing List, Linus Torvalds, NetDev,
	x86, Andrew Morton, Theodore Ts'o, Alan Cox, jesse Barnes,
	H. Peter Anvin


* Arjan van de Ven <arjan@linux.intel.com> wrote:

> On Thu, 27 Nov 2008 21:18:36 +0100
> Ingo Molnar <mingo@elte.hu> wrote:
> 
> > 
> > * Arjan van de Ven <arjan@linux.intel.com> wrote:
> > 
> > > Ingo Molnar wrote:
> > >> * Arjan van de Ven <arjan@linux.intel.com> wrote:
> > >>
> > >>> Rank 8: mtrr_trim_uncached_memory (warning)
> > >>> 	Reported 227 times (619 total reports)
> > >>> 	There is a high number of machines where our MTRR checks
> > >>> 	trigger. I suspect we are too picky in accepting the MTRR
> > >>> 	configuration.
> > >>
> > >> the warning here means: "the BIOS messed up but we fixed it up for
> > >> you just fine".
> > >
> > > I don't believe that right now. we see so many of these, including 
> > > many "there's no MTRRs at all", that I am seriously suspecting that 
> > > our code is just incorrect somehow and triggering too much.
> > 
> > well we looked at existing reports and Linux was right to fix them
> > up. Show us one that is incorrect, then we can fix it up.
> > 
> > the "no MTRR's" are vmware/(also qemu?) guests not implementing a
> > full CPU emulation.
> 
> ... and it's still our fault in part, since we don't even check to 
> see if a cpu claims to support MTRR before complaining about it...
> 
> easy to fix though:

IIRC the problem is that vmware _does_ claim that it supports MTRRs. 

	Ingo

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: oops/warning report for the week of November 26, 2008
  2008-11-27 20:47         ` Ingo Molnar
@ 2008-11-27 20:53           ` Arjan van de Ven
  2008-11-28  8:34             ` Ingo Molnar
  0 siblings, 1 reply; 25+ messages in thread
From: Arjan van de Ven @ 2008-11-27 20:53 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Yinghai Lu, Linux Kernel Mailing List, Linus Torvalds, NetDev,
	x86, Andrew Morton, Theodore Ts'o, Alan Cox, jesse Barnes,
	H. Peter Anvin

On Thu, 27 Nov 2008 21:47:14 +0100
Ingo Molnar <mingo@elte.hu> wrote:

> IIRC the problem is that vmware _does_ claim that it supports MTRRs. 

it might.
but even if they would fix that, we would still WARN (
at least we should do our side correctly...


-- 
Arjan van de Ven 	Intel Open Source Technology Centre
For development, discussion and tips for power savings, 
visit http://www.lesswatts.org

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: oops/warning report for the week of November 26, 2008
  2008-11-27 20:28       ` Arjan van de Ven
  2008-11-27 20:47         ` Ingo Molnar
@ 2008-11-27 21:18         ` H. Peter Anvin
  2008-11-27 21:18         ` Yinghai Lu
  2008-11-27 21:42         ` H. Peter Anvin
  3 siblings, 0 replies; 25+ messages in thread
From: H. Peter Anvin @ 2008-11-27 21:18 UTC (permalink / raw)
  To: Arjan van de Ven
  Cc: Ingo Molnar, Yinghai Lu, Linux Kernel Mailing List,
	Linus Torvalds, NetDev, x86, Andrew Morton, Theodore Ts'o,
	Alan Cox, jesse Barnes

Arjan van de Ven wrote:
> +	if (!cpu_has_mtrr)
> +		return 0;
>  	if (!is_cpu(INTEL) || disable_mtrr_trim)
>  		return 0;
>  	rdmsr(MTRRdefType_MSR, def, dummy);

cpu_has_mtrr there should presumably replace is_cpu(INTEL).  I'm not 
sure if this can be replaced by use_intel(); in particular use_intel() 
relies on mtrr_if having been initialized.

Looking...

	-hpa (out of town for Thanksgiving)

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: oops/warning report for the week of November 26, 2008
  2008-11-27 20:28       ` Arjan van de Ven
  2008-11-27 20:47         ` Ingo Molnar
  2008-11-27 21:18         ` H. Peter Anvin
@ 2008-11-27 21:18         ` Yinghai Lu
  2008-11-27 21:42         ` H. Peter Anvin
  3 siblings, 0 replies; 25+ messages in thread
From: Yinghai Lu @ 2008-11-27 21:18 UTC (permalink / raw)
  To: Arjan van de Ven
  Cc: Ingo Molnar, Linux Kernel Mailing List, Linus Torvalds, NetDev,
	x86, Andrew Morton, Theodore Ts'o, Alan Cox, jesse Barnes

Arjan van de Ven wrote:
> On Thu, 27 Nov 2008 21:18:36 +0100
> Ingo Molnar <mingo@elte.hu> wrote:
> 
>> * Arjan van de Ven <arjan@linux.intel.com> wrote:
>>
>>> Ingo Molnar wrote:
>>>> * Arjan van de Ven <arjan@linux.intel.com> wrote:
>>>>
>>>>> Rank 8: mtrr_trim_uncached_memory (warning)
>>>>> 	Reported 227 times (619 total reports)
>>>>> 	There is a high number of machines where our MTRR checks
>>>>> 	trigger. I suspect we are too picky in accepting the MTRR
>>>>> 	configuration.
>>>> the warning here means: "the BIOS messed up but we fixed it up for
>>>> you just fine".
>>> I don't believe that right now. we see so many of these, including 
>>> many "there's no MTRRs at all", that I am seriously suspecting that 
>>> our code is just incorrect somehow and triggering too much.
>> well we looked at existing reports and Linux was right to fix them
>> up. Show us one that is incorrect, then we can fix it up.
>>
>> the "no MTRR's" are vmware/(also qemu?) guests not implementing a
>> full CPU emulation.
> 
> ... and it's still our fault in part, since we don't even check to see
> if a cpu claims to support MTRR before complaining about it...
> 
> easy to fix though:
> 
> From 7e987ae541c41ce908b414fee9d8e2fd2099a083 Mon Sep 17 00:00:00 2001
> From: Arjan van de Ven <arjan@linux.intel.com>
> Date: Thu, 27 Nov 2008 12:25:47 -0800
> Subject: [PATCH] x86: make sure the CPU advertizes MTRR support before complaining about the lack thereoff...
> 
> We complain loudly if a CPU does not have MTRR support... but we don't check if the CPU
> exposes MTRR support in the CPUID flags first. While this might not fix all of the
> broken virtualization systems out there, it will at least fix those that properly don't
> advertize things they don't support.
> 
> Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
> ---
>  arch/x86/kernel/cpu/mtrr/main.c |    2 ++
>  1 files changed, 2 insertions(+), 0 deletions(-)
> 
> diff --git a/arch/x86/kernel/cpu/mtrr/main.c b/arch/x86/kernel/cpu/mtrr/main.c
> index 1159e26..0044e61 100644
> --- a/arch/x86/kernel/cpu/mtrr/main.c
> +++ b/arch/x86/kernel/cpu/mtrr/main.c
> @@ -1567,6 +1567,8 @@ int __init mtrr_trim_uncached_memory(unsigned long end_pfn)
>  	 * Make sure we only trim uncachable memory on machines that
>  	 * support the Intel MTRR architecture:
>  	 */
> +	if (!cpu_has_mtrr)
> +		return 0;

that is not needed, we already check that in mtrr_bp_init before this function is called, and it will assign mtrr_if

and
#define is_cpu(vnd)     (mtrr_if && mtrr_if->vendor == X86_VENDOR_##vnd)

will make it sure mtrr is there.

ps: here INTEL mean any cpu has same interface like intel cpu's

YH

>  	if (!is_cpu(INTEL) || disable_mtrr_trim)
>  		return 0;
>  	rdmsr(MTRRdefType_MSR, def, dummy);


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: oops/warning report for the week of November 26, 2008
  2008-11-27 20:28       ` Arjan van de Ven
                           ` (2 preceding siblings ...)
  2008-11-27 21:18         ` Yinghai Lu
@ 2008-11-27 21:42         ` H. Peter Anvin
  3 siblings, 0 replies; 25+ messages in thread
From: H. Peter Anvin @ 2008-11-27 21:42 UTC (permalink / raw)
  To: Arjan van de Ven
  Cc: Ingo Molnar, Yinghai Lu, Linux Kernel Mailing List,
	Linus Torvalds, NetDev, x86, Andrew Morton, Theodore Ts'o,
	Alan Cox, jesse Barnes

Arjan van de Ven wrote:
> 
> diff --git a/arch/x86/kernel/cpu/mtrr/main.c b/arch/x86/kernel/cpu/mtrr/main.c
> index 1159e26..0044e61 100644
> --- a/arch/x86/kernel/cpu/mtrr/main.c
> +++ b/arch/x86/kernel/cpu/mtrr/main.c
> @@ -1567,6 +1567,8 @@ int __init mtrr_trim_uncached_memory(unsigned long end_pfn)
>  	 * Make sure we only trim uncachable memory on machines that
>  	 * support the Intel MTRR architecture:
>  	 */
> +	if (!cpu_has_mtrr)
> +		return 0;
>  	if (!is_cpu(INTEL) || disable_mtrr_trim)
>  		return 0;
>  	rdmsr(MTRRdefType_MSR, def, dummy);

Okay... is_cpu() here is defined as:

#define is_cpu(vnd)      (mtrr_if && mtrr_if->vendor == X86_VENDOR_##vnd)

... so an MTRR interface has been identified.  Therefore testing 
cpu_has_mtrr is redundant.

As far as use_intel() versus is_cpu(INTEL), it looks to me as though the 
two are identical in the current code -- mtrr_if->vendor is never set in 
the generic code, and so defaults to 0 - meaning X86_VENDOR_INTEL.

All in all, it looks like the vendor ID stuff is a bad case of "works by 
accident" in the MTRR code, however, *given the current code* I conclude 
that is_cpu(INTEL) == use_intel() and that neither can be true without 
MTRRs enabled.

	-hpa

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: oops/warning report for the week of November 26, 2008
  2008-11-27 20:53           ` Arjan van de Ven
@ 2008-11-28  8:34             ` Ingo Molnar
  0 siblings, 0 replies; 25+ messages in thread
From: Ingo Molnar @ 2008-11-28  8:34 UTC (permalink / raw)
  To: Arjan van de Ven
  Cc: Yinghai Lu, Linux Kernel Mailing List, Linus Torvalds, NetDev,
	x86, Andrew Morton, Theodore Ts'o, Alan Cox, jesse Barnes,
	H. Peter Anvin

* Arjan van de Ven <arjan@linux.intel.com> wrote:

> On Thu, 27 Nov 2008 21:47:14 +0100
> Ingo Molnar <mingo@elte.hu> wrote:
> 
> > IIRC the problem is that vmware _does_ claim that it supports MTRRs. 
> 
> it might.
> but even if they would fix that, we would still WARN (
> at least we should do our side correctly...

As pointed out in other parts of the thread, that is not the case.

Anyway, as i said it in the onset, if you think we should remove the 
warning altogether, or tweak it, we can do that - it is important to 
have relevant warnings show up in kerneloops.org.

To sum it up: the only remaining MTRR warnings we know of are either:

 1) apparently genuine BIOS bugs that do cause problems if the (new) 
    kernel does not fix them up.

    The MTRR warning is relevant and correct in those cases.

or:

 2) sucky virtualization solutions that cheat the guest OS by faking 
    "MTRR support" in the CPUID info, but not actually showing any 
    MTRRs. These virtualization solutions do not even properly 
    identify themselves to the kernel.

    The MTRR warning is unnecessary in this case.

So what we did in the x86 tree was remove the warning in the second 
case - is to properly identify vmware (and in general, virtualization) 
guests.

It was not a simple oneliner:

 earth4:~/tip> gll linus..x86/detect-hyper

 4e42ebd: x86: hypervisor - fix sparse warnings
 c450d78: x86: vmware - fix sparse warnings
 fd8cd7e: x86: vmware: look for DMI string in the product serial key
 6bdbfe9: x86: VMware: Fix vmware_get_tsc code
 395628e: x86: Skip verification by the watchdog for TSC clocksource.
 eca0cd0: x86: Add a synthetic TSC_RELIABLE feature bit.
 88b094f: x86: Hypervisor detection and get tsc_freq from hypervisor
 49ab56a: x86: add X86_FEATURE_HYPERVISOR feature bit
 b2bcc7b: x86: add a synthetic TSC_RELIABLE feature bit

and it will benefit vmware guests in many more areas than just a 
sharper MTRR warning message. That code is queued up for v2.6.29.

	Ingo

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: oops/warning report for the week of November 26, 2008
  2008-11-26 23:11 oops/warning report for the week of November 26, 2008 Arjan van de Ven
  2008-11-27  0:05 ` Jesse Barnes
  2008-11-27 11:52 ` Ingo Molnar
@ 2008-11-28 17:18 ` Jay Cliburn
  2008-11-28 17:32   ` Arjan van de Ven
  2 siblings, 1 reply; 25+ messages in thread
From: Jay Cliburn @ 2008-11-28 17:18 UTC (permalink / raw)
  To: Arjan van de Ven; +Cc: NetDev

[trimmed the cc list down to netdev only]

On Wed, 26 Nov 2008 15:11:14 -0800
Arjan van de Ven <arjan@linux.intel.com> wrote:

> Rank 12: dev_watchdog(atl1) (oops)
> 	Reported 56 times (109 total reports)
> 	This oops was last seen in version 2.6.27.5, and first seen
> in 2.6.26.6. More info:
> http://www.kerneloops.org/searchweek.php?search=dev_watchdog(atl1)

I can't reproduce this, so I've launched a request at fedoraforum.org
hoping I can snag a Fedora user who's encountering the bug and willing
to test.

The tx timeout reports at kerneloops.org appear to be happening on a
startling variety of network drivers (startling to me, anyway): r8169,
atl1, atl2, sis900, cdc_ether, orinoco_cs, tg3, ne2k-pci, via-rhine,
8139too, ath_pci, e1000, gl620a, sky2, hso, fealnx, forcedeth; probably
others, but I quit looking.

Is it correct to assume all these drivers are showing symptoms of the
poor timeout handling you mentioned in your r8169 comment, or is the
occasional tx timeout to be expected, and the leaders in this category
(r8169, sis900, atl1) are the only ones suffering from deficient
timeout handling?

Jay

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: oops/warning report for the week of November 26, 2008
  2008-11-28 17:18 ` Jay Cliburn
@ 2008-11-28 17:32   ` Arjan van de Ven
  2008-11-28 18:36     ` Jay Cliburn
  2008-11-28 19:50     ` Francois Romieu
  0 siblings, 2 replies; 25+ messages in thread
From: Arjan van de Ven @ 2008-11-28 17:32 UTC (permalink / raw)
  To: Jay Cliburn; +Cc: NetDev

On Fri, 28 Nov 2008 11:18:27 -0600
Jay Cliburn <jcliburn@gmail.com> wrote:

> [trimmed the cc list down to netdev only]
> 
> > Rank 12: dev_watchdog(atl1) (oops)
> > 	Reported 56 times (109 total reports)
> > 	This oops was last seen in version 2.6.27.5, and first seen
> > in 2.6.26.6. More info:
> > http://www.kerneloops.org/searchweek.php?search=dev_watchdog(atl1)
> 
> The tx timeout reports at kerneloops.org appear to be happening on a
> startling variety of network drivers (startling to me, anyway): r8169,
> atl1, atl2, sis900, cdc_ether, orinoco_cs, tg3, ne2k-pci, via-rhine,
> 8139too, ath_pci, e1000, gl620a, sky2, hso, fealnx, forcedeth;
> probably others, but I quit looking.

to be specific in counts, the data I have so far is:

 count |           guilty           
-------+----------------------------
  1599 | dev_watchdog(sis900)
  1501 | dev_watchdog(r8169)
   280 | dev_watchdog(via-rhine)
   264 | dev_watchdog(cdc_ether)
   213 | dev_watchdog(usbnet)
   192 | dev_watchdog(8139too)
   164 | dev_watchdog(8390)
   158 | dev_watchdog(via_rhine)
   129 | dev_watchdog(ne2k-pci)
   122 | dev_watchdog(atl1)
   102 | dev_watchdog(atl2)
   101 | dev_watchdog(orinoco)

and then a long tail of sub-100, omitted to keep this mail not too
long; if anyone wants data on his/her driver not in the list, let me
know.

(please don't read too much in the word "guilty"; it's just the name of
the column in the kerneloops.org database used for identifing which
function was the prime suspect of a backtrace)

> 
> Is it correct to assume all these drivers are showing symptoms of the
> poor timeout handling you mentioned in your r8169 comment, or is the
> occasional tx timeout to be expected, and the leaders in this category
> (r8169, sis900, atl1) are the only ones suffering from deficient
> timeout handling?

For me, sis900 and r8169 stand out; if you look at the data in the
table above, both of these are an order of magnitude more frequent than
the rest of the pack. ATL1 isn't doing all that bad in this regard,
although your driver is still a little higher than other popular cards
like tg3, e1000, e1000e etc. (those are all sub-50). 


-- 
Arjan van de Ven 	Intel Open Source Technology Centre
For development, discussion and tips for power savings, 
visit http://www.lesswatts.org

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: oops/warning report for the week of November 26, 2008
  2008-11-28 17:32   ` Arjan van de Ven
@ 2008-11-28 18:36     ` Jay Cliburn
  2008-11-28 18:50       ` Arjan van de Ven
  2008-11-28 19:50     ` Francois Romieu
  1 sibling, 1 reply; 25+ messages in thread
From: Jay Cliburn @ 2008-11-28 18:36 UTC (permalink / raw)
  To: Arjan van de Ven; +Cc: NetDev

On Fri, 28 Nov 2008 09:32:17 -0800
Arjan van de Ven <arjan@linux.intel.com> wrote:


> to be specific in counts, the data I have so far is:
> 
>  count |           guilty           
> -------+----------------------------
>   1599 | dev_watchdog(sis900)
>   1501 | dev_watchdog(r8169)
>    280 | dev_watchdog(via-rhine)
>    264 | dev_watchdog(cdc_ether)
>    213 | dev_watchdog(usbnet)
>    192 | dev_watchdog(8139too)
>    164 | dev_watchdog(8390)
>    158 | dev_watchdog(via_rhine)
>    129 | dev_watchdog(ne2k-pci)
>    122 | dev_watchdog(atl1)
>    102 | dev_watchdog(atl2)
>    101 | dev_watchdog(orinoco)

> ATL1 isn't doing all that bad in this
> regard, although your driver is still a little higher than other
> popular cards like tg3, e1000, e1000e etc. 

...And that's what troubles me: the L1 chip isn't what I'd characterize
as "popular" -- it's LOM only, and it's found in only about 25
mainboards that I know of (from voluntary user reports) -- yet its
prevalence in the tx timeout list seems to be quickly rising.

Can you produce a list from your database for me that includes the
kernel version for each of the 122 reported atl1 dev_watchdog warnings?
I'd like to see if I can correlate an increase in the warnings with a
particular change we made.

Thanks.

Thanks.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: oops/warning report for the week of November 26, 2008
  2008-11-28 18:36     ` Jay Cliburn
@ 2008-11-28 18:50       ` Arjan van de Ven
  2008-11-28 21:12         ` atl1 transmit timeout Was: " Jay Cliburn
  0 siblings, 1 reply; 25+ messages in thread
From: Arjan van de Ven @ 2008-11-28 18:50 UTC (permalink / raw)
  To: Jay Cliburn; +Cc: NetDev

> 
> > ATL1 isn't doing all that bad in this
> > regard, although your driver is still a little higher than other
> > popular cards like tg3, e1000, e1000e etc. 
> 
> ...And that's what troubles me: the L1 chip isn't what I'd
> characterize as "popular" -- it's LOM only, and it's found in only
> about 25 mainboards that I know of (from voluntary user reports) --
> yet its prevalence in the tx timeout list seems to be quickly rising.
> 
> Can you produce a list from your database for me that includes the
> kernel version for each of the 122 reported atl1 dev_watchdog
> warnings? I'd like to see if I can correlate an increase in the
> warnings with a particular change we made.


=> select count(version), version from oopses where guilty='dev_watchdog(atl1)' group by version order by version desc; 
 count |     version     
-------+-----------------
    93 | 2.6.27.5
     6 | 2.6.27.4
     1 | 2.6.27.3
     1 | 2.6.27.2
     1 | 2.6.27-rc9
     1 | 2.6.27-rc7-git1
     1 | 2.6.27-rc7
     1 | 2.6.27-rc6
     6 | 2.6.27-rc3
     7 | 2.6.27
     4 | 2.6.26.6
(11 rows)


or in more detail:

=> select count(full_version), full_version from oopses where \
guilty='dev_watchdog(atl1)' group by full_version order by \
full_version desc;

 count |           full_version            
-------+-----------------------------------
     2 | 2.6.27.5-94.fc10.x86_64
    26 | 2.6.27.5-41.fc9.x86_64
    12 | 2.6.27.5-41.fc9.i686
    15 | 2.6.27.5-37.fc9.x86_64
    19 | 2.6.27.5-37.fc9.i686
    11 | 2.6.27.5-117.fc10.x86_64
     4 | 2.6.27.5-117.fc10.i686
     1 | 2.6.27.5-109.fc10.x86_64
     2 | 2.6.27.5-109.fc10.i686.PAE
     1 | 2.6.27.5-109.fc10.i686
     1 | 2.6.27.4-79.fc10.i686
     1 | 2.6.27.4-68.fc10.x86_64
     3 | 2.6.27.4-68.fc10.i686
     1 | 2.6.27.4-26.fc9.x86_64
     1 | 2.6.27.3-34.rc1.fc10.i686.PAE
     1 | 2.6.27.2-23.rc1.fc10.x86_64
     4 | 2.6.27-wl
     1 | 2.6.27-rc7
     1 | 2.6.27-rc6-wl-AUS32
     6 | 2.6.27-rc3-wl-8KS-UVC
     1 | 2.6.27-7-generic
     1 | 2.6.27-0.398.rc9.fc10.x86_64
     1 | 2.6.27-0.352.rc7.git1.fc10.x86_64
     2 | 2.6.27
     3 | 2.6.26.6-79.fc9.x86_64
     1 | 2.6.26.6-79.fc9.i686
(26 rows)



-- 
Arjan van de Ven 	Intel Open Source Technology Centre
For development, discussion and tips for power savings, 
visit http://www.lesswatts.org

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: oops/warning report for the week of November 26, 2008
  2008-11-28 17:32   ` Arjan van de Ven
  2008-11-28 18:36     ` Jay Cliburn
@ 2008-11-28 19:50     ` Francois Romieu
  2008-11-28 20:12       ` Arjan van de Ven
  2008-11-30  8:58       ` Roger Luethi
  1 sibling, 2 replies; 25+ messages in thread
From: Francois Romieu @ 2008-11-28 19:50 UTC (permalink / raw)
  To: Arjan van de Ven; +Cc: Jay Cliburn, NetDev

Arjan van de Ven <arjan@linux.intel.com> :
[...]
> For me, sis900 and r8169 stand out; if you look at the data in the
> table above, both of these are an order of magnitude more frequent than
> the rest of the pack.

via-rhine + via_rhine = 438: it does not look too good either.

Is there an (ideally automated) way to retrieve more information ?

The r8169 driver handles three different chipsets and a plethora of
phys. The "XID" line printed by the driver could hint at some specific
PHY for instance.

-- 
Ueimor

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: oops/warning report for the week of November 26, 2008
  2008-11-28 19:50     ` Francois Romieu
@ 2008-11-28 20:12       ` Arjan van de Ven
  2008-11-30  8:58       ` Roger Luethi
  1 sibling, 0 replies; 25+ messages in thread
From: Arjan van de Ven @ 2008-11-28 20:12 UTC (permalink / raw)
  To: Francois Romieu; +Cc: Jay Cliburn, NetDev

On Fri, 28 Nov 2008 20:50:18 +0100
Francois Romieu <romieu@fr.zoreil.com> wrote:

> Arjan van de Ven <arjan@linux.intel.com> :
> [...]
> > For me, sis900 and r8169 stand out; if you look at the data in the
> > table above, both of these are an order of magnitude more frequent
> > than the rest of the pack.
> 
> via-rhine + via_rhine = 438: it does not look too good either.
> 
> Is there an (ideally automated) way to retrieve more information ?

this will need help from the driver and a bit of the core
infrastructure.

the code that generates the warning is in net/sched/sch_generic.c:

char drivername[64];
WARN_ONCE(1, KERN_INFO "NETDEV WATCHDOG: %s (%s): transmit timed out\n", dev->name, netdev_drivername(dev, drivername, 64));
dev->tx_timeout(dev); 
 
> The r8169 driver handles three different chipsets and a plethora of
> phys. The "XID" line printed by the driver could hint at some specific
> PHY for instance.

anything you add to that WARN_ONCE will end up on kerneloops.org...
it could be as simple storing some information in the net dev... or having
a function pointer that can print some useful diagnostics information.


In addition, I'm trying to get a patch into .29 that prints, on x86, some basic
DMI information in every WARN_ON class message; but this won't give you the details
about the actual NIC, at most which motherboard is in use.


-- 
Arjan van de Ven 	Intel Open Source Technology Centre
For development, discussion and tips for power savings, 
visit http://www.lesswatts.org

^ permalink raw reply	[flat|nested] 25+ messages in thread

* atl1 transmit timeout Was: Re: oops/warning report for the week of November 26, 2008
  2008-11-28 18:50       ` Arjan van de Ven
@ 2008-11-28 21:12         ` Jay Cliburn
  2008-11-28 21:22           ` Arjan van de Ven
  0 siblings, 1 reply; 25+ messages in thread
From: Jay Cliburn @ 2008-11-28 21:12 UTC (permalink / raw)
  To: Arjan van de Ven; +Cc: NetDev

On Fri, 28 Nov 2008 10:50:44 -0800
Arjan van de Ven <arjan@linux.intel.com> wrote:

> => select count(version), version from oopses where
> guilty='dev_watchdog(atl1)' group by version order by version desc;
> count |     version -------+-----------------
>     93 | 2.6.27.5
>      6 | 2.6.27.4
>      1 | 2.6.27.3
>      1 | 2.6.27.2
>      1 | 2.6.27-rc9
>      1 | 2.6.27-rc7-git1
>      1 | 2.6.27-rc7
>      1 | 2.6.27-rc6
>      6 | 2.6.27-rc3
>      7 | 2.6.27
>      4 | 2.6.26.6
> (11 rows)

Wow.  4 hits in 2.6.26, then 118 in 2.6.27.

A history of changes between 2.6.26 and 2.6.27.5 shows a mere six
changes to the driver.

commit    event
======    =====
788a5f3f  2.6.27.5
8dc186c1  atl1: fix vlan tag regression
056c7145  2.6.27.4
322df44b  2.6.27.3
6bcd6d77  2.6.27.2
bc5b8bb6  2.6.27.1
3fa8749e  2.6.27
4330ed8e  2.6.27-rc9
94aca1da  2.6.27-rc8
72d31053  2.6.27-rc7
adee14b2  2.6.27-rc6
24342c34  2.6.27-rc5
82c26a9d  atl1: disable TSO by default
6a55617e  2.6.27-rc4
30a2f3c6  2.6.27-rc3
c2ac3ef3  atl1: deal with hardware rx checksum bug
0967d61e  2.6.27-rc2
6e86841d  2.6.27-rc1
39d48157  atl1: Do not wake queue before queue has been started.
b102df14  atl1: use netdev_alloc_skb
d63ddcec  misc drivers/net endianness noise
bce7f793  2.6.26

The only one that jumps out at me is 39d48157, which contains, in part:

commit 39d48157ac1a0ff3ec81212e5451bfd1bf5f50db
Author: David S. Miller <davem@davemloft.net>
Date:   Mon Jul 21 08:28:37 2008 -0700

    atl1: Do not wake queue before queue has been started.
    
    Based upon a bug report by Alexey Dobriyan, the patch is
    also tested by him and confirmed to fix the problem.
    
    Packet flow during link state events should not be done by
    waking and stopping the TX queue anyways, that is handled
    transparently by netif_carrier_{on,off}().
    
    So, remove the netif_{wake,stop}_queue() calls in the link
    check code, and add the necessary netif_start_queue() call
    to atl1_up().
    
    Signed-off-by: David S. Miller <davem@davemloft.net>

[...]
@@ -2627,6 +2625,7 @@ static s32 atl1_up(struct atl1_adapter *adapter)
        mod_timer(&adapter->watchdog_timer, jiffies);
        atlx_irq_enable(adapter);
        atl1_check_link(adapter);
+       netif_start_queue(netdev);
        return 0;

Would it be reasonable to increase the above mod_timer() expiry to
jiffies + (5 * HZ)?

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: atl1 transmit timeout Was: Re: oops/warning report for the week of November 26, 2008
  2008-11-28 21:12         ` atl1 transmit timeout Was: " Jay Cliburn
@ 2008-11-28 21:22           ` Arjan van de Ven
  0 siblings, 0 replies; 25+ messages in thread
From: Arjan van de Ven @ 2008-11-28 21:22 UTC (permalink / raw)
  To: Jay Cliburn; +Cc: NetDev

Jay Cliburn wrote:
> On Fri, 28 Nov 2008 10:50:44 -0800
> Arjan van de Ven <arjan@linux.intel.com> wrote:
> 
>> => select count(version), version from oopses where
>> guilty='dev_watchdog(atl1)' group by version order by version desc;
>> count |     version -------+-----------------
>>     93 | 2.6.27.5
>>      6 | 2.6.27.4
>>      1 | 2.6.27.3
>>      1 | 2.6.27.2
>>      1 | 2.6.27-rc9
>>      1 | 2.6.27-rc7-git1
>>      1 | 2.6.27-rc7
>>      1 | 2.6.27-rc6
>>      6 | 2.6.27-rc3
>>      7 | 2.6.27
>>      4 | 2.6.26.6
>> (11 rows)
> 
> Wow.  4 hits in 2.6.26, then 118 in 2.6.27.
> 
> A history of changes between 2.6.26 and 2.6.27.5 shows a mere six
> changes to the driver.

one thing to note is that for the .26 kernel, there was not very good data collection of this issue yet.
(Although.. more than 4% I would say; Fedora had the patches to report the driver info backported for quite some time)


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: oops/warning report for the week of November 26, 2008
  2008-11-28 19:50     ` Francois Romieu
  2008-11-28 20:12       ` Arjan van de Ven
@ 2008-11-30  8:58       ` Roger Luethi
  1 sibling, 0 replies; 25+ messages in thread
From: Roger Luethi @ 2008-11-30  8:58 UTC (permalink / raw)
  To: Francois Romieu; +Cc: Arjan van de Ven, Jay Cliburn, NetDev

On Fri, 28 Nov 2008 20:50:18 +0100, Francois Romieu wrote:
> Arjan van de Ven <arjan@linux.intel.com> :
> [...]
> > For me, sis900 and r8169 stand out; if you look at the data in the
> > table above, both of these are an order of magnitude more frequent than
> > the rest of the pack.
> 
> via-rhine + via_rhine = 438: it does not look too good either.

Agreed. I was kinda hoping I'd get some clues for free when other drivers
get fixed :-).

> Is there an (ideally automated) way to retrieve more information ?
> 
> The r8169 driver handles three different chipsets and a plethora of
> phys. The "XID" line printed by the driver could hint at some specific
> PHY for instance.

For the Rhine, knowing the PCI rev would help identify problems tied to
specific models.

Roger

^ permalink raw reply	[flat|nested] 25+ messages in thread

end of thread, other threads:[~2008-11-30  9:05 UTC | newest]

Thread overview: 25+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-11-26 23:11 oops/warning report for the week of November 26, 2008 Arjan van de Ven
2008-11-27  0:05 ` Jesse Barnes
2008-11-27 11:48   ` Ingo Molnar
2008-11-27 19:42   ` Alex Chiang
2008-11-27 19:49     ` Arjan van de Ven
2008-11-27 11:52 ` Ingo Molnar
2008-11-27 17:02   ` Jesse Barnes
2008-11-27 18:01   ` Arjan van de Ven
2008-11-27 20:18     ` Ingo Molnar
2008-11-27 20:28       ` Arjan van de Ven
2008-11-27 20:47         ` Ingo Molnar
2008-11-27 20:53           ` Arjan van de Ven
2008-11-28  8:34             ` Ingo Molnar
2008-11-27 21:18         ` H. Peter Anvin
2008-11-27 21:18         ` Yinghai Lu
2008-11-27 21:42         ` H. Peter Anvin
2008-11-28 17:18 ` Jay Cliburn
2008-11-28 17:32   ` Arjan van de Ven
2008-11-28 18:36     ` Jay Cliburn
2008-11-28 18:50       ` Arjan van de Ven
2008-11-28 21:12         ` atl1 transmit timeout Was: " Jay Cliburn
2008-11-28 21:22           ` Arjan van de Ven
2008-11-28 19:50     ` Francois Romieu
2008-11-28 20:12       ` Arjan van de Ven
2008-11-30  8:58       ` Roger Luethi

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).