Re: Machine Check Exception Was: NetDev! Please help!

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* Re: Machine Check Exception Was: NetDev! Please help!
       [not found] <48D4F85C.8090709@bigtelecom.ru>
@ 2008-09-20 18:31 ` Jarek Poplawski
       [not found] ` <200809202111.01256.denys@visp.net.lb>
  1 sibling, 0 replies; 9+ messages in thread
From: Jarek Poplawski @ 2008-09-20 18:31 UTC (permalink / raw)
  To: Badalian Vyacheslav; +Cc: netdev, LKML

Badalian Vyacheslav wrote, On 09/20/2008 03:19 PM:

> Hello all.


Hi Vyacheslav,


I think it might be something more than netdev. Please, read in the kernel
config the comment to Processor type and features/Machine Check Exception.

I pasted your second message below and Cc linux-kernel.

Jarek P.

> We buy 10 Intel servers and paste it to shape traffic. After 5-15 hours
> all PC is was freeze!  Kernel not see TCO watchdog at this platform and
> can't reboot it!. Soft Watchdog not reboot pc in this situation.  =(
> 
> At screen we see messages like this (when it freeze and i was near monitor):
> 
> http://www.kerneloops.org/guilty.php?guilty=dev_watchdog&version=2.6.26-release&start=1736704&end=1769471&class=warn
> 
> Also by netconsole we was get this one time:
> 
> [ 1352.245851] netconsole: network logging started
> [ 1458.400133] 802.1Q VLAN Support v1.8 Ben Greear <greearb@candelatech.com>
> [ 1458.400133] All bugs added by David S. Miller <davem@redhat.com>
> [ 4956.420298] CPU 1: Machine Check Exception: 0000000000000005
> [ 4956.420298] e1000: eth0: e1000_clean_tx_irq: Detected Tx Unit Hang
> [ 4956.420300]   Tx Queue             <0>
> [ 4956.420300]   TDH                  <81>
> [ 4956.420301]   TDT                  <81>
> [ 4956.420302]   next_to_use          <81>
> [ 4956.420302]   next_to_clean        <d6>
> [ 4956.420303] buffer_info[next_to_clean]
> [ 4956.420303]   time_stamp           <15498d>
> [ 4956.420304]   next_to_watch        <d6>
> [ 4956.420304]   jiffies              <15511c>
> [ 4956.420305]   next_to_watch.status <1>
> [ 4956.420537] eth1: Detected Tx Unit Hang:
> [ 4956.420538]   TDH                  <b0>
> [ 4956.420538]   TDT                  <b0>
> [ 4956.420539]   next_to_use          <b0>
> [ 4956.420539]   next_to_clean        <5>
> [ 4956.420540] buffer_info[next_to_clean]:
> [ 4956.420540]   time_stamp           <15498e>
> [ 4956.420541]   next_to_watch        <5>
> [ 4956.420542]   jiffies              <15511c>
> [ 4956.420542]   next_to_watch.status <1>
> [ 4956.423064] CPU 1: Bank 0: 3200004000000800
> [ 4956.423190] CPU 1: Bank 5: 3200220024080400
> [ 4956.423315] Kernel panic - not syncing: CPU context corrupt
> [ 4956.423933] Rebooting in 3 seconds..[  531.843998] CPU 2: Machine
> Check Exception: 0000000000000005
> [  531.843998] CPU 0: Machine Check Exception: 0000000000000004
> [  531.844000] CPU 0: Bank 0: 3200004000000800
> [  531.844001] CPU 0: Bank 5: 3200121014040400
> [  531.844002] Kernel panic - not syncing: CPU context corrupt
> [  531.844916] Rebooting in 3 seconds..
> 
> This out of lspci:
> 
> 00:00.0 Host bridge: Intel Corporation Server DRAM Controller
> 00:19.0 Ethernet controller: Intel Corporation 82566DM-2 Gigabit Network
> Connection (rev 02)
> 00:1a.0 USB Controller: Intel Corporation USB UHCI Controller #4 (rev 02)
> 00:1a.1 USB Controller: Intel Corporation USB UHCI Controller #5 (rev 02)
> 00:1a.2 USB Controller: Intel Corporation USB UHCI Controller #6 (rev 02)
> 00:1a.7 USB Controller: Intel Corporation USB2 EHCI Controller #2 (rev 02)
> 00:1c.0 PCI bridge: Intel Corporation PCI Express Port 1 (rev 02)
> 00:1c.4 PCI bridge: Intel Corporation PCI Express Port 5 (rev 02)
> 00:1d.0 USB Controller: Intel Corporation USB UHCI Controller #1 (rev 02)
> 00:1d.1 USB Controller: Intel Corporation USB UHCI Controller #2 (rev 02)
> 00:1d.2 USB Controller: Intel Corporation USB UHCI Controller #3 (rev 02)
> 00:1d.7 USB Controller: Intel Corporation USB2 EHCI Controller #1 (rev 02)
> 00:1e.0 PCI bridge: Intel Corporation 82801 PCI Bridge (rev 92)
> 00:1f.0 ISA bridge: Intel Corporation LPC Interface Controller (rev 02)
> 00:1f.2 SATA controller: Intel Corporation 6 port SATA AHCI Controller
> (rev 02)
> 00:1f.3 SMBus: Intel Corporation SMBus Controller (rev 02)
> 02:00.0 VGA compatible controller: Matrox Graphics, Inc. MGA G200e
> [Pilot] ServerEngines (SEP1) (rev 02)
> 03:02.0 Ethernet controller: Intel Corporation 82541GI Gigabit Ethernet
> Controller (rev 05)
> 
> lspci -tv
> -[0000:00]-+-00.0  Intel Corporation Server DRAM Controller
>            +-19.0  Intel Corporation 82566DM-2 Gigabit Network Connection
>            +-1a.0  Intel Corporation USB UHCI Controller #4
>            +-1a.1  Intel Corporation USB UHCI Controller #5
>            +-1a.2  Intel Corporation USB UHCI Controller #6
>            +-1a.7  Intel Corporation USB2 EHCI Controller #2
>            +-1c.0-[0000:01]--
>            +-1c.4-[0000:02]----00.0  Matrox Graphics, Inc. MGA G200e
> [Pilot] ServerEngines (SEP1)
>            +-1d.0  Intel Corporation USB UHCI Controller #1
>            +-1d.1  Intel Corporation USB UHCI Controller #2
>            +-1d.2  Intel Corporation USB UHCI Controller #3
>            +-1d.7  Intel Corporation USB2 EHCI Controller #1
>            +-1e.0-[0000:03]----02.0  Intel Corporation 82541GI Gigabit
> Ethernet Controller
>            +-1f.0  Intel Corporation LPC Interface Controller
>            +-1f.2  Intel Corporation 6 port SATA AHCI Controller
>            \-1f.3  Intel Corporation SMBus Controller
> 
> All tests in 2.6.26.2 kernel... now i try compile 2.6.26.5 kernel, but
> if KernelOops is right - its not  will help.
> 
> I hope NetDev help!
> If any information or test is needed - please write!
> 
> Thanks for anyone!
> 
> Badalian Vyacheslav.

Badalian Vyacheslav wrote, On 09/20/2008 03:38 PM:

> New crash.... netconsole log:
> 
> [  116.333349] ------------[ cut here ]------------
> [  116.333516] WARNING: at net/sched/sch_generic.c:222
> dev_watchdog+0xf1/0x110()
> [  116.333690] Modules linked in: netconsole i2c_i801 i2c_core e1000e e1000
> [  116.334199] Pid: 0, comm: swapper Not tainted 2.6.26-gentoo-r1-fw #2
> [  116.334371]  [<c012506f>] warn_on_slowpath+0x5f/0x90
> [  116.334597]  [<c011dd1a>] enqueue_task_fair+0x1a/0x30
> [  116.334823]  [<c011b962>] enqueue_task+0x12/0x30
> [  116.335046]  [<c011b9d3>] activate_task+0x23/0x40
> [  116.335268]  [<c011e01a>] try_to_wake_up+0x6a/0x110
> [  116.335491]  [<c0137c7b>] autoremove_wake_function+0x1b/0x50
> [  116.335718]  [<c011be6b>] __wake_up_common+0x4b/0x80
> [  116.335941]  [<c011cfde>] __wake_up+0x3e/0x60
> [  116.336161]  [<c0134a2b>] insert_work+0x4b/0x70
> [  116.336384]  [<c0134dd7>] __queue_work+0x27/0x40
> [  116.336610]  [<c02d0651>] dev_watchdog+0xf1/0x110
> [  116.337333]  [<c012e055>] run_timer_softirq+0x115/0x170
> [  116.337557]  [<c0122c71>] scheduler_tick+0xa1/0xd0
> [  116.337780]  [<c012a062>] __do_softirq+0x82/0x100
> [  116.338002]  [<c012a117>] do_softirq+0x37/0x40
> [  116.338222]  [<c0114027>] smp_apic_timer_interrupt+0x57/0x90
> [  116.338448]  [<c0105660>] apic_timer_interrupt+0x28/0x30
> [  116.338672]  [<c010a5e2>] mwait_idle+0x32/0x40
> [  116.338894]  [<c010a5b0>] mwait_idle+0x0/0x40
> [  116.339115]  [<c01036e8>] cpu_idle+0x48/0xc0
> [  116.339336]  =======================
> [  116.339499] ---[ end trace e25a40b7dc59df07 ]---
> [  117.655918] CPU 1: Machine Check Exception: 0000000000000005
> [  117.656103] CPU 1: Bank 0: 3200004000000800
> [  117.656604] CPU 1: Bank 5: 3200220024080400
> [  117.656604] Kernel panic - not syncing: CPU context corrupt
> [  117.656624] Rebooting in 3 seconds..
> 
> 
> Thanks
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 




^ permalink raw reply	[flat|nested] 9+ messages in thread

[parent not found: <200809202111.01256.denys@visp.net.lb>]

[parent not found: <48D67239.9040006@gmail.com>]

[parent not found: <48D7385D.40107@bigtelecom.ru>]

* Re: Machine Check Exception Re: NetDev! Please help!
       [not found]     ` <48D7385D.40107@bigtelecom.ru>
@ 2008-09-22  6:53       ` Jarek Poplawski
  2008-09-22  8:05         ` Jarek Poplawski
  2008-09-22  9:40         ` Badalian Vyacheslav
  0 siblings, 2 replies; 9+ messages in thread
From: Jarek Poplawski @ 2008-09-22  6:53 UTC (permalink / raw)
  To: Badalian Vyacheslav; +Cc: Denys Fedoryshchenko, netdev, linux-kernel

On Mon, Sep 22, 2008 at 10:17:01AM +0400, Badalian Vyacheslav wrote:
> Jarek Poplawski:
> 
> Hello!
> There all requested information.
> I try 2.6.26.5 and again get:
> [143784.513166] CPU 2: Bank 0: 3200004000000800
> [143784.513241] CPU 2: Bank 5: 3200121020080400
> [143784.513241] Kernel panic - not syncing: CPU context corrupt
> [143784.513282] Rebooting in 3 seconds..

Hi,

Actually, I suggested you to read this Machine Check Exception help,
because I think you should first try to test your hardware instead of
sending configs. This type of error isn't usually seen with netdev
bugs.

Since I'm not a hardware expert I added linux-kernel to Cc, and
probably you should do the same (I added it to this one). But, until
you have any better advice I think you should do some long and heavy
testing of your PCs especially for overheating or memory problems.
We can start to analyze other bugs after we are sure the hardware is
OK.

BTW, probably your attachements are too big for the lists and the
message could be dropped. It would be better to add some link to a
server or use bugzilla for this.

Thanks,
Jarek P.
 
> 
> Attached all info that i was can get from PC. Maybe problem that we use
> Core Duo Quard processors? It's 64bit, but kernel and software compile
> as 32. On 2 x "OLD HT(2 core) Xeon 32 bit" PC all work great...
> 
> Simple step to reproduce
> Add iptables and tc rules.... give above 500 mbs total traffic (we have
> above 300/200 mbs in/out) from any (many?) ip what preset in TC rules
> and run any CPU like process (like compiling)...
> 
> Thanks for answers!
> 
> Denys Fedoryshchenko:
> Hello!
> i try run nmi_watchdog...
> i hope its helps, but this PC have hardware watchdog (bios have params
> for it), but kernel not have module for it - /S3210SH/ (ICH9-R chipset).
> I think simple not add ID to driver. I try write to author of it -
> wim@iguana.be.
> Please ask for me... this line:
> [    0.143332] APIC timer registered as dummy, due to nmi_watchdog=1!
> its normal start of nmi_watchdog? or i need use nmi_watchdog=2?
> 
> Thanks for answers!
> 
> > Denys Fedoryshchenko wrote, On 09/20/2008 08:11 PM:
> > ...
> >
> >   
> >> P.S. For netdev, i have one more friend - who is complaining that shapers is 
> >> crashing on Intel machines (who uses TSC, he have two different "Core" based 
> >> servers, and both is crashing). With HPET i dont have any problem on high 
> >> performance shapers (except, that it is CPU expensive). It happens on latest 
> >> 2.6.26.5 too. Machine getting hard lockup, and nothing than hardware watchdog 
> >> able to recover it. They dont have experience to get actual reason of this 
> >> issue and they dont know english well to report this issue.
> >>     
> >
> > Is your friend sure it's because of shapers? If he/she can patch
> > there is no need to know English well to report here:
> >
> > Subject: 2.6.26.5 tc not OK
> >
> > Config:
> > 	.config
> >
> > tc script:
> > 	script
> >
> > dmesg:
> > 	dmesg
> >
> > not OK when: script run/script not run
> >
> > patch #1 not OK
> > patch #2 not OK
> > ...
> > patch #2001 OK!
> >
> > Jarek P.
> >
> >   
> 









^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Machine Check Exception Re: NetDev! Please help!
  2008-09-22  6:53       ` Machine Check Exception " Jarek Poplawski
@ 2008-09-22  8:05         ` Jarek Poplawski
  2008-09-22  9:40         ` Badalian Vyacheslav
  1 sibling, 0 replies; 9+ messages in thread
From: Jarek Poplawski @ 2008-09-22  8:05 UTC (permalink / raw)
  To: Badalian Vyacheslav; +Cc: Denys Fedoryshchenko, netdev, linux-kernel

On Mon, Sep 22, 2008 at 06:53:39AM +0000, Jarek Poplawski wrote:
...
> [...]I think you should do some long and heavy
> testing of your PCs especially for overheating or memory problems.

but also power supply, network card etc.

Jarek P.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Machine Check Exception Re: NetDev! Please help!
  2008-09-22  6:53       ` Machine Check Exception " Jarek Poplawski
  2008-09-22  8:05         ` Jarek Poplawski
@ 2008-09-22  9:40         ` Badalian Vyacheslav
  2008-09-22 11:24           ` Jarek Poplawski
  1 sibling, 1 reply; 9+ messages in thread
From: Badalian Vyacheslav @ 2008-09-22  9:40 UTC (permalink / raw)
  To: Jarek Poplawski; +Cc: Denys Fedoryshchenko, netdev, linux-kernel

Thanks for answer Jarek!
I post it is bugtrack - http://bugzilla.kernel.org/show_bug.cgi?id=11618

I not think that its hardware error because this problem we have in 10
servers on 2.6.26.2 kernel +)
On Friday night i compile 2.6.26.5 and have 2 panic on 1 pc what have
max load and 1 panic on other pc.
I write to netdev list because first messages looks like:

[ 4956.420298] CPU 1: Machine Check Exception: 0000000000000005
[ 4956.420298] e1000: eth0: e1000_clean_tx_irq: Detected Tx Unit Hang
[ 4956.420300]   Tx Queue             <0>
[ 4956.420300]   TDH                  <81>
[ 4956.420301]   TDT                  <81>
[ 4956.420302]   next_to_use          <81>
[ 4956.420302]   next_to_clean        <d6>
[ 4956.420303] buffer_info[next_to_clean]
[ 4956.420303]   time_stamp           <15498d>
[ 4956.420304]   next_to_watch        <d6>
[ 4956.420304]   jiffies              <15511c>
[ 4956.420305]   next_to_watch.status <1>
[ 4956.420537] eth1: Detected Tx Unit Hang:
[ 4956.420538]   TDH                  <b0>
[ 4956.420538]   TDT                  <b0>
[ 4956.420539]   next_to_use          <b0>
[ 4956.420539]   next_to_clean        <5>
[ 4956.420540] buffer_info[next_to_clean]:
[ 4956.420540]   time_stamp           <15498e>
[ 4956.420541]   next_to_watch        <5>
[ 4956.420542]   jiffies              <15511c>
[ 4956.420542]   next_to_watch.status <1>
[ 4956.423064] CPU 1: Bank 0: 3200004000000800
[ 4956.423190] CPU 1: Bank 5: 3200220024080400
[ 4956.423315] Kernel panic - not syncing: CPU context corrupt
[ 4956.423933] Rebooting in 3 seconds..

But in 2.6.26.5 i not see errors like this 2 days... Also if system not have network load - i can't do panic by cpuburn or compiling sources...
Anyone i think its good that my message also go to general mail-list and bugzilla...

I try get more info... if you or anyone have idea how test this bug - i can do it)

Thanks!

> On Mon, Sep 22, 2008 at 10:17:01AM +0400, Badalian Vyacheslav wrote:
>   
>> Jarek Poplawski:
>>
>> Hello!
>> There all requested information.
>> I try 2.6.26.5 and again get:
>> [143784.513166] CPU 2: Bank 0: 3200004000000800
>> [143784.513241] CPU 2: Bank 5: 3200121020080400
>> [143784.513241] Kernel panic - not syncing: CPU context corrupt
>> [143784.513282] Rebooting in 3 seconds..
>>     
>
> Hi,
>
> Actually, I suggested you to read this Machine Check Exception help,
> because I think you should first try to test your hardware instead of
> sending configs. This type of error isn't usually seen with netdev
> bugs.
>
> Since I'm not a hardware expert I added linux-kernel to Cc, and
> probably you should do the same (I added it to this one). But, until
> you have any better advice I think you should do some long and heavy
> testing of your PCs especially for overheating or memory problems.
> We can start to analyze other bugs after we are sure the hardware is
> OK.
>
> BTW, probably your attachements are too big for the lists and the
> message could be dropped. It would be better to add some link to a
> server or use bugzilla for this.
>
> Thanks,
> Jarek P.
>  
>   
>> Attached all info that i was can get from PC. Maybe problem that we use
>> Core Duo Quard processors? It's 64bit, but kernel and software compile
>> as 32. On 2 x "OLD HT(2 core) Xeon 32 bit" PC all work great...
>>
>> Simple step to reproduce
>> Add iptables and tc rules.... give above 500 mbs total traffic (we have
>> above 300/200 mbs in/out) from any (many?) ip what preset in TC rules
>> and run any CPU like process (like compiling)...
>>
>> Thanks for answers!
>>
>> Denys Fedoryshchenko:
>> Hello!
>> i try run nmi_watchdog...
>> i hope its helps, but this PC have hardware watchdog (bios have params
>> for it), but kernel not have module for it - /S3210SH/ (ICH9-R chipset).
>> I think simple not add ID to driver. I try write to author of it -
>> wim@iguana.be.
>> Please ask for me... this line:
>> [    0.143332] APIC timer registered as dummy, due to nmi_watchdog=1!
>> its normal start of nmi_watchdog? or i need use nmi_watchdog=2?
>>
>> Thanks for answers!
>>
>>     
>>> Denys Fedoryshchenko wrote, On 09/20/2008 08:11 PM:
>>> ...
>>>
>>>   
>>>       
>>>> P.S. For netdev, i have one more friend - who is complaining that shapers is 
>>>> crashing on Intel machines (who uses TSC, he have two different "Core" based 
>>>> servers, and both is crashing). With HPET i dont have any problem on high 
>>>> performance shapers (except, that it is CPU expensive). It happens on latest 
>>>> 2.6.26.5 too. Machine getting hard lockup, and nothing than hardware watchdog 
>>>> able to recover it. They dont have experience to get actual reason of this 
>>>> issue and they dont know english well to report this issue.
>>>>     
>>>>         
>>> Is your friend sure it's because of shapers? If he/she can patch
>>> there is no need to know English well to report here:
>>>
>>> Subject: 2.6.26.5 tc not OK
>>>
>>> Config:
>>> 	.config
>>>
>>> tc script:
>>> 	script
>>>
>>> dmesg:
>>> 	dmesg
>>>
>>> not OK when: script run/script not run
>>>
>>> patch #1 not OK
>>> patch #2 not OK
>>> ...
>>> patch #2001 OK!
>>>
>>> Jarek P.
>>>
>>>   
>>>       
>
>
>
>
>
>
>
>
>
>   


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Machine Check Exception Re: NetDev! Please help!
  2008-09-22  9:40         ` Badalian Vyacheslav
@ 2008-09-22 11:24           ` Jarek Poplawski
  2008-09-22 13:00             ` Badalian Vyacheslav
  0 siblings, 1 reply; 9+ messages in thread
From: Jarek Poplawski @ 2008-09-22 11:24 UTC (permalink / raw)
  To: Badalian Vyacheslav; +Cc: Denys Fedoryshchenko, netdev, linux-kernel

On Mon, Sep 22, 2008 at 01:40:35PM +0400, Badalian Vyacheslav wrote:
> Thanks for answer Jarek!
> I post it is bugtrack - http://bugzilla.kernel.org/show_bug.cgi?id=11618
> 
> I not think that its hardware error because this problem we have in 10
> servers on 2.6.26.2 kernel +)
> On Friday night i compile 2.6.26.5 and have 2 panic on 1 pc what have
> max load and 1 panic on other pc.
> I write to netdev list because first messages looks like:
> 
> [ 4956.420298] CPU 1: Machine Check Exception: 0000000000000005
> [ 4956.420298] e1000: eth0: e1000_clean_tx_irq: Detected Tx Unit Hang
> [ 4956.420300]   Tx Queue             <0>
> [ 4956.420300]   TDH                  <81>
> [ 4956.420301]   TDT                  <81>
> [ 4956.420302]   next_to_use          <81>
> [ 4956.420302]   next_to_clean        <d6>
> [ 4956.420303] buffer_info[next_to_clean]
> [ 4956.420303]   time_stamp           <15498d>
> [ 4956.420304]   next_to_watch        <d6>
> [ 4956.420304]   jiffies              <15511c>
> [ 4956.420305]   next_to_watch.status <1>
> [ 4956.420537] eth1: Detected Tx Unit Hang:
> [ 4956.420538]   TDH                  <b0>
> [ 4956.420538]   TDT                  <b0>
> [ 4956.420539]   next_to_use          <b0>
> [ 4956.420539]   next_to_clean        <5>
> [ 4956.420540] buffer_info[next_to_clean]:
> [ 4956.420540]   time_stamp           <15498e>
> [ 4956.420541]   next_to_watch        <5>
> [ 4956.420542]   jiffies              <15511c>
> [ 4956.420542]   next_to_watch.status <1>
> [ 4956.423064] CPU 1: Bank 0: 3200004000000800
> [ 4956.423190] CPU 1: Bank 5: 3200220024080400
> [ 4956.423315] Kernel panic - not syncing: CPU context corrupt
> [ 4956.423933] Rebooting in 3 seconds..

Yes, similar messages are often netdev problems, but not with
this Machine Check Exception with this CPU context corrupt,
which should mean some severe hardware problem (unless some bug,
probably not netdev, triggers them).

> 
> But in 2.6.26.5 i not see errors like this 2 days... Also if system not have network load - i can't do panic by cpuburn or compiling sources...
> Anyone i think its good that my message also go to general mail-list and bugzilla...
> 
> I try get more info... if you or anyone have idea how test this bug - i can do it)

I see you have some advice in bugzilla. These people really know more
about these things, so you should try this first. I think, they expect
you to compile the most current kernel version (tip) using git for
this. You can do this using the instructions from Ingo Molnar's README.
Make a script from this: from the beginning to the "git checkout ...".
Of course you have to install git before. After running the commands
it will download the kernel sources to a subdir (takes time). Copy your
config there, make oldconfig, make etc. Then send them dmesg after
rebooting. If you have any problems - write. Alternatively, I guess,
you could try the current 2.6.27-rc7 kernel at least.

Jarek P.

BTW: could you try to trigger this bug with one network card off?

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Machine Check Exception Re: NetDev! Please help!
  2008-09-22 11:24           ` Jarek Poplawski
@ 2008-09-22 13:00             ` Badalian Vyacheslav
  2008-09-22 17:23               ` Jarek Poplawski
  0 siblings, 1 reply; 9+ messages in thread
From: Badalian Vyacheslav @ 2008-09-22 13:00 UTC (permalink / raw)
  To: Jarek Poplawski; +Cc: Denys Fedoryshchenko, netdev, linux-kernel


> BTW: could you try to trigger this bug with one network card off?
>   


Shire! I stop eth1 and do "/etc/init.d/bgpd stop" (this pc not get route
traffic anymore)....

run "emerge portage" 2 times and get:

[25492.187405] CPU 3: Machine Check Exception: 0000000000000005
[25492.187405] MCE: The hardware reports a non fatal, correctable
incident occurred on CPU 1.
[25492.187405] Bank 0: b200004000000800
[25492.187405] MCE: The hardware reports a non fatal, correctable
incident occurred on CPU 1.
[25492.187405] Bank 5: b200120014040400
[25497.124884] CPU 1: Machine Check Exception: 0000000000000004
[25497.124884] Kernel panic - not syncing: Unable to continue
[25497.124884] Rebooting in 3 seconds..

bugtracker updated.... i can get (reproduce) error on all 10 servers at
2.6.26.5... I use TC and not wont test 2.6.27-rc because its have (if i
understand) multiqueue feature that not tested...

Thanks!

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Machine Check Exception Re: NetDev! Please help!
  2008-09-22 13:00             ` Badalian Vyacheslav
@ 2008-09-22 17:23               ` Jarek Poplawski
  2008-09-23  7:43                 ` Badalian Vyacheslav
  0 siblings, 1 reply; 9+ messages in thread
From: Jarek Poplawski @ 2008-09-22 17:23 UTC (permalink / raw)
  To: Badalian Vyacheslav; +Cc: Denys Fedoryshchenko, netdev, linux-kernel

On Mon, Sep 22, 2008 at 05:00:57PM +0400, Badalian Vyacheslav wrote:
> 
> > BTW: could you try to trigger this bug with one network card off?
> >   
> 
> 
> Shire! I stop eth1 and do "/etc/init.d/bgpd stop" (this pc not get route
> traffic anymore)....
> 
> run "emerge portage" 2 times and get:
> 
> [25492.187405] CPU 3: Machine Check Exception: 0000000000000005
> [25492.187405] MCE: The hardware reports a non fatal, correctable
> incident occurred on CPU 1.
> [25492.187405] Bank 0: b200004000000800
> [25492.187405] MCE: The hardware reports a non fatal, correctable
> incident occurred on CPU 1.
> [25492.187405] Bank 5: b200120014040400
> [25497.124884] CPU 1: Machine Check Exception: 0000000000000004
> [25497.124884] Kernel panic - not syncing: Unable to continue
> [25497.124884] Rebooting in 3 seconds..
> 
> bugtracker updated.... i can get (reproduce) error on all 10 servers at
> 2.6.26.5... I use TC and not wont test 2.6.27-rc because its have (if i
> understand) multiqueue feature that not tested...

Actually, it's quite well tested, especially by Denys, and I doubt it
will be much better in 2.6.27. BTW, maybe start eth1, stop eth0 yet?

Thanks,
Jarek P.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Machine Check Exception Re: NetDev! Please help!
  2008-09-22 17:23               ` Jarek Poplawski
@ 2008-09-23  7:43                 ` Badalian Vyacheslav
  2008-09-23  9:25                   ` Jarek Poplawski
  0 siblings, 1 reply; 9+ messages in thread
From: Badalian Vyacheslav @ 2008-09-23  7:43 UTC (permalink / raw)
  To: Jarek Poplawski; +Cc: Denys Fedoryshchenko, netdev, linux-kernel

Hello

I stop eth1 and eth0 and run "emegre portage" and get exception. Now i
think its not problem in network part.

I miss situation in 2.6.27 about multiqueue and traffic shaper because i
was have many work and was can't read all netdev list =(
As i understand 2.6.27-rc have support multiqueue, but how it will work
with HTB/SFQ?
Is tc rules must have in 2.6.27 one root queue (and all queue go to this
tree) or need to do many qdiscs and settings it to device queues (i was
read some about queue2band)?
If it is simple for you - can  you sort describe this part of changes?

P.S. I think now that problem not in network part of kernel and i think
i stop CC netdev and Denys Fedoryshchenko. Thanks for you doing and
thanks for help!

Thanks.

> Actually, it's quite well tested, especially by Denys, and I doubt it
> will be much better in 2.6.27. BTW, maybe start eth1, stop eth0 yet?
>
> Thanks,
> Jarek P.
>
>   

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Machine Check Exception Re: NetDev! Please help!
  2008-09-23  7:43                 ` Badalian Vyacheslav
@ 2008-09-23  9:25                   ` Jarek Poplawski
  0 siblings, 0 replies; 9+ messages in thread
From: Jarek Poplawski @ 2008-09-23  9:25 UTC (permalink / raw)
  To: Badalian Vyacheslav; +Cc: Denys Fedoryshchenko, netdev, linux-kernel

On Tue, Sep 23, 2008 at 11:43:08AM +0400, Badalian Vyacheslav wrote:
> Hello
> 
> I stop eth1 and eth0 and run "emegre portage" and get exception. Now i
> think its not problem in network part.
> 
> I miss situation in 2.6.27 about multiqueue and traffic shaper because i
> was have many work and was can't read all netdev list =(
> As i understand 2.6.27-rc have support multiqueue, but how it will work
> with HTB/SFQ?
> Is tc rules must have in 2.6.27 one root queue (and all queue go to this
> tree) or need to do many qdiscs and settings it to device queues (i was
> read some about queue2band)?
> If it is simple for you - can  you sort describe this part of changes?

There are two main cases:

1) The default qdiscs (created while activating a new net device):
depending on a driver (most drivers are still uniqueue), there are
created independent pfifo_fast_qdiscs for each supported tx queue;
if a driver doesn't change this, packets are directed to them
automatically, according to some hash function, which tries to
separate different flows. This should be the fastest solution because
there are separate qdisc and transmit locks, which could be taken by
different cpus at the same time.

2) Non-default qdiscs (any qdiscs added with tc): there is only one
root qdisc (with its tree) as before, dequeued to all tx queues (if
available). Since there is only one qdisc lock, and additional flag
preventing other processes to run the qdisc at the same time, there
is not so much advantage of SMP, except on tx locking. All previous
tc configs should work without changes (except sch_prio and sch_rr
used for multiqueuing, replaced by sch_multiq and act_skbedit now).
Probably in some cases adding sch_multiq to a tree for separating
qdisc queues per tx queues could be useful.

Cheers,
Jarek P.

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2008-09-23  9:25 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <48D4F85C.8090709@bigtelecom.ru>
2008-09-20 18:31 ` Machine Check Exception Was: NetDev! Please help! Jarek Poplawski
     [not found] ` <200809202111.01256.denys@visp.net.lb>
     [not found]   ` <48D67239.9040006@gmail.com>
     [not found]     ` <48D7385D.40107@bigtelecom.ru>
2008-09-22  6:53       ` Machine Check Exception " Jarek Poplawski
2008-09-22  8:05         ` Jarek Poplawski
2008-09-22  9:40         ` Badalian Vyacheslav
2008-09-22 11:24           ` Jarek Poplawski
2008-09-22 13:00             ` Badalian Vyacheslav
2008-09-22 17:23               ` Jarek Poplawski
2008-09-23  7:43                 ` Badalian Vyacheslav
2008-09-23  9:25                   ` Jarek Poplawski

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox