All of lore.kernel.org
 help / color / mirror / Atom feed
* RE: x86_64 eth0 e1000_clean_tx_irq tx hang
@ 2006-02-08 23:36 Ian Pratt
  2006-02-09  1:27 ` Chris Wright
  2006-02-09 15:02 ` Christian Leber
  0 siblings, 2 replies; 26+ messages in thread
From: Ian Pratt @ 2006-02-08 23:36 UTC (permalink / raw)
  To: adam, Chris Wright, xen-devel

 
> Ah, finally people experiencing same bug as me!
> 
> It is much much worse for me, as soon as I ping from domU 
> network goes down, started with the subarch change.

For this bug it might actually be helpful to start collecting
information about the hardware it's observed on.

For us, the bug is hard to repro, despite us having tried on several
different machines (2 and 4 way SMP, Opteron and Xeon, tg3 and e1000
NICs).

If the bug is easier to trigger for you, please post a summary of the
hardware and anything unusual about your config (i.e. not default
bridged).

Thanks,
Ian
 
> Adam Wendt
> IPCoast, Inc.
> 
> On Wed, 8 Feb 2006 12:11 , Chris Wright <chrisw@sous-sol.org> sent:
> 
> >* Ian Pratt (m+Ian.Pratt@cl.cam.ac.uk) wrote:
> >> Yep, this is the bug I warned y'all about at the summit, but you 
> >> asked for the code to be checked in anyway...
> >
> >Hehe, get what you ask for...
> >
> >> A bug shared is a bug fixed quicker? :-)
> >
> >Let's hope ;-)
> >
> >> For us, this only manifests on x86_64, and arrived with 
> the subarch 
> >> xen version of 2.6.12. Extensive inspection of the arch->subarch 
> >> conversion suggests that nothing should have changed, so this is 
> >> likely a latent bug being triggered by slight timing changes.
> >> 
> >> It sounds like it's rather easier for you to trigger than 
> it was for 
> >> us
> >> -- we had to run xm-test several times to get it to happen. Happy 
> >> hunting, and good luck :-)
> >
> >It's trivial for me to trigger.  I'll keep poking at it.
> >
> >thanks,
> >-chris
> >
> >_______________________________________________
> >Xen-devel mailing list
> >Xen-devel@lists.xensource.com
> >http://lists.xensource.com/xen-devel
> >
> 
> 
> 
> 
> 

^ permalink raw reply	[flat|nested] 26+ messages in thread
* RE: x86_64 eth0 e1000_clean_tx_irq tx hang
@ 2006-02-09 23:55 Kamble, Nitin A
  2006-02-10 11:20 ` Keir Fraser
  2006-02-16  3:07 ` Chris Wright
  0 siblings, 2 replies; 26+ messages in thread
From: Kamble, Nitin A @ 2006-02-09 23:55 UTC (permalink / raw)
  To: Chris Wright, Ian Pratt; +Cc: xen-devel, adam

>  - limiting to 2G works fine, sounds like something with swiotlb

I noticed it too and exactly same. I also notice this in the dom0 dmesg.

PCI-DMA: Disabling IOMMU.
WARNING more than 4GB of memory but IOMMU not compiled in.
WARNING 32bit PCI may malfunction.
You might want to enable CONFIG_GART_IOMMU
Memory: 5868412k/6071120k available (3553k kernel code, 202040k
reserved, 1376k
data, 300k init)

Thanks & Regards,
Nitin
------------------------------------------------------------------------
-----------
Open Source Technology Center, Intel Corp

>-----Original Message-----
>From: xen-devel-bounces@lists.xensource.com [mailto:xen-devel-
>bounces@lists.xensource.com] On Behalf Of Chris Wright
>Sent: Wednesday, February 08, 2006 5:28 PM
>To: Ian Pratt
>Cc: Chris Wright; xen-devel@lists.xensource.com; adam@ipcoast.com
>Subject: Re: [Xen-devel] x86_64 eth0 e1000_clean_tx_irq tx hang
>
>* Ian Pratt (m+Ian.Pratt@cl.cam.ac.uk) wrote:
>> For us, the bug is hard to repro, despite us having tried on several
>> different machines (2 and 4 way SMP, Opteron and Xeon, tg3 and e1000
>> NICs).
>
>xeon, 4 cpu (2-ht), e1000, 4G
>
>- works fine w/ 32-bit
>- dom0 is UP (SMP fails as well)
>  - this is dom0 only, no xend, no domUs, no bridging
>  - limiting to 2G works fine, sounds like something with swiotlb
>
>Also, while it was working, I blasted with packets, and eventually got:
>
>irq 19: nobody cared (try booting with the "irqpoll" option)
>
>Call Trace: <IRQ> <ffffffff80148508>{__report_bad_irq+56}
>       <ffffffff80148721>{note_interrupt+449}
><ffffffff80147dcc>{handle_IRQ_event+76}
>       <ffffffff80147ec2>{__do_IRQ+162} <ffffffff8011077b>{do_IRQ+75}
>       <ffffffff802f16b5>{evtchn_do_upcall+117}
><ffffffff8010e5f1>{do_hypervisor_callback+37}
>       <ffffffff8011ccc5>{ia32_syscall+13}
><ffffffff8010a22a>{hypercall_page+554}
>       <ffffffff8010a22a>{hypercall_page+554}
><ffffffff802f14de>{force_evtchn_callback+14}
>       <ffffffff80147db5>{handle_IRQ_event+53}
><ffffffff80147ea8>{__do_IRQ+136}
>       <ffffffff8011077b>{do_IRQ+75}
><ffffffff802f16b5>{evtchn_do_upcall+117}
>       <ffffffff8010e5f1>{do_hypervisor_callback+37} <EOI>
>       <ffffffff8011ccc5>{ia32_syscall+13}
>handlers:
>[<ffffffff80377b80>] (ata_interrupt+0x0/0x1b0)
>[<ffffffff80396570>] (usb_hcd_irq+0x0/0x70)
>Disabling IRQ #19
>
>thanks,
>-chris
>
>_______________________________________________
>Xen-devel mailing list
>Xen-devel@lists.xensource.com
>http://lists.xensource.com/xen-devel

^ permalink raw reply	[flat|nested] 26+ messages in thread
* RE: x86_64 eth0 e1000_clean_tx_irq tx hang
@ 2006-02-09  2:29 Ian Pratt
  0 siblings, 0 replies; 26+ messages in thread
From: Ian Pratt @ 2006-02-09  2:29 UTC (permalink / raw)
  To: Chris Wright; +Cc: xen-devel, adam

> > What devices are on irq 19?
> > 
> > It might be worth trying booting nousb on the kernel 
> command line (or
> > usb-handoff)
> 
> 19:       5748        Phys-irq  libata, uhci_hcd:usb3
> 
> with ata, that effectively killed the box.  trying with 
> nousb, but i wonder if it's not evntchn problem?

Something else to try might be booting with maxcpus=1 on the xen command
line, but if you're running just a uniproc dom0 this really ought not
make any difference. 

When the box is in a bad state, it might be worth using the serial debug
keys to get some information about the ioapic and event channels.

I'm glad you've got an easy way of repro'ing this. I've just tried again
on a couple of our machines and it took me ages to trigger. 

Thanks,
Ian

^ permalink raw reply	[flat|nested] 26+ messages in thread
* RE: x86_64 eth0 e1000_clean_tx_irq tx hang
@ 2006-02-09  1:50 Ian Pratt
  2006-02-09  1:59 ` Chris Wright
  2006-02-09  2:29 ` Chris Wright
  0 siblings, 2 replies; 26+ messages in thread
From: Ian Pratt @ 2006-02-09  1:50 UTC (permalink / raw)
  To: Chris Wright; +Cc: xen-devel, adam

 > > That's interesting, but I'd be surprised if it was an 
> swiotlb thing -- 
> > it looks so much more like an interrupt problem. e1000 and tg3 
> > shouldn't be going anywhere near swiotlb anyhow.
> > 
> > Please can you try a PAE kernel just to check you don't have the 
> > problem.
> 
> It's 64-bit.

Yep, but I'm wandering whether it's worth trying a PAE kernel as that
might give a datapoint to indicate whether swiotlb might be involved. 

My money is still on an interrupt problem (virtual or otherwise),
though.

Ian

^ permalink raw reply	[flat|nested] 26+ messages in thread
* RE: x86_64 eth0 e1000_clean_tx_irq tx hang
@ 2006-02-09  1:29 Ian Pratt
  2006-02-09  1:38 ` Chris Wright
  2006-02-09  1:49 ` Chris Wright
  0 siblings, 2 replies; 26+ messages in thread
From: Ian Pratt @ 2006-02-09  1:29 UTC (permalink / raw)
  To: Chris Wright; +Cc: xen-devel, adam

> xeon, 4 cpu (2-ht), e1000, 4G
> 
> - works fine w/ 32-bit
> - dom0 is UP (SMP fails as well)
>   - this is dom0 only, no xend, no domUs, no bridging
>   - limiting to 2G works fine, sounds like something with swiotlb

That's interesting, but I'd be surprised if it was an swiotlb thing --
it looks so much more like an interrupt problem. e1000 and tg3 shouldn't
be going anywhere near swiotlb anyhow.

Please can you try a PAE kernel just to check you don't have the
problem.
 
> Also, while it was working, I blasted with packets, and 
> eventually got:
> 
> irq 19: nobody cared (try booting with the "irqpoll" option)

What devices are on irq 19?

It might be worth trying booting nousb on the kernel command line (or
usb-handoff)

Thanks,
Ian

> Call Trace: <IRQ> <ffffffff80148508>{__report_bad_irq+56}
>        <ffffffff80148721>{note_interrupt+449} 
> <ffffffff80147dcc>{handle_IRQ_event+76}
>        <ffffffff80147ec2>{__do_IRQ+162} <ffffffff8011077b>{do_IRQ+75}
>        <ffffffff802f16b5>{evtchn_do_upcall+117} 
> <ffffffff8010e5f1>{do_hypervisor_callback+37}
>        <ffffffff8011ccc5>{ia32_syscall+13} 
> <ffffffff8010a22a>{hypercall_page+554}
>        <ffffffff8010a22a>{hypercall_page+554} 
> <ffffffff802f14de>{force_evtchn_callback+14}
>        <ffffffff80147db5>{handle_IRQ_event+53} 
> <ffffffff80147ea8>{__do_IRQ+136}
>        <ffffffff8011077b>{do_IRQ+75} 
> <ffffffff802f16b5>{evtchn_do_upcall+117}
>        <ffffffff8010e5f1>{do_hypervisor_callback+37} <EOI>
>        <ffffffff8011ccc5>{ia32_syscall+13}
> handlers:
> [<ffffffff80377b80>] (ata_interrupt+0x0/0x1b0) 
> [<ffffffff80396570>] (usb_hcd_irq+0x0/0x70) Disabling IRQ #19
> 
> thanks,
> -chris
> 

^ permalink raw reply	[flat|nested] 26+ messages in thread
* RE: x86_64 eth0 e1000_clean_tx_irq tx hang
@ 2006-02-08 20:01 Ian Pratt
  2006-02-08 20:11 ` Chris Wright
  0 siblings, 1 reply; 26+ messages in thread
From: Ian Pratt @ 2006-02-08 20:01 UTC (permalink / raw)
  To: Chris Wright, xen-devel

> This is against current x86_64 defconfig build:
> 
> e1000: eth0: e1000_clean_tx_irq: Detected Tx Unit Hang
>   Tx Queue             <0>
>   TDH                  <2b>
>   TDT                  <31>
>   next_to_use          <31>
>   next_to_clean        <2b>
> buffer_info[next_to_clean]
>   time_stamp           <10004d5f2>
>   next_to_watch        <2d>
>   jiffies              <10004d7ce>
>   next_to_watch.status <0>
> 
> ... repeat until eventually ...
> 
> NETDEV WATCHDOG: eth0: transmit timed out
> 
> this is on simple scp to dom0 from external box.  after a bit 
> watchdog resets, and ping works, only to repeat itself when a 
> try to scp again

Yep, this is the bug I warned y'all about at the summit, but you asked
for the code to be checked in anyway... 

A bug shared is a bug fixed quicker? :-)

For us, this only manifests on x86_64, and arrived with the subarch xen
version of 2.6.12. Extensive inspection of the arch->subarch conversion
suggests that nothing should have changed, so this is likely a latent
bug being triggered by slight timing changes.

It sounds like it's rather easier for you to trigger than it was for us
-- we had to run xm-test several times to get it to happen. Happy
hunting, and good luck :-)

Ian

^ permalink raw reply	[flat|nested] 26+ messages in thread
* Re: x86_64 eth0 e1000_clean_tx_irq tx hang
@ 2006-02-08 15:06 Adam Wendt
  0 siblings, 0 replies; 26+ messages in thread
From: Adam Wendt @ 2006-02-08 15:06 UTC (permalink / raw)
  To: Ian Pratt, Chris Wright, xen-devel

Ah, finally people experiencing same bug as me!

It is much much worse for me, as soon as I ping from domU network goes down,
started with the subarch change.

Adam Wendt
IPCoast, Inc.

On Wed, 8 Feb 2006 12:11 , Chris Wright <chrisw@sous-sol.org> sent:

>* Ian Pratt (m+Ian.Pratt@cl.cam.ac.uk) wrote:
>> Yep, this is the bug I warned y'all about at the summit, but you asked
>> for the code to be checked in anyway... 
>
>Hehe, get what you ask for...
>
>> A bug shared is a bug fixed quicker? :-)
>
>Let's hope ;-)
>
>> For us, this only manifests on x86_64, and arrived with the subarch xen
>> version of 2.6.12. Extensive inspection of the arch->subarch conversion
>> suggests that nothing should have changed, so this is likely a latent
>> bug being triggered by slight timing changes.
>> 
>> It sounds like it's rather easier for you to trigger than it was for us
>> -- we had to run xm-test several times to get it to happen. Happy
>> hunting, and good luck :-)
>
>It's trivial for me to trigger.  I'll keep poking at it.
>
>thanks,
>-chris
>
>_______________________________________________
>Xen-devel mailing list
>Xen-devel@lists.xensource.com
>http://lists.xensource.com/xen-devel
>

^ permalink raw reply	[flat|nested] 26+ messages in thread
* x86_64 eth0 e1000_clean_tx_irq tx hang
@ 2006-02-07 20:47 Chris Wright
  0 siblings, 0 replies; 26+ messages in thread
From: Chris Wright @ 2006-02-07 20:47 UTC (permalink / raw)
  To: xen-devel

This is against current x86_64 defconfig build:

e1000: eth0: e1000_clean_tx_irq: Detected Tx Unit Hang
  Tx Queue             <0>
  TDH                  <2b>
  TDT                  <31>
  next_to_use          <31>
  next_to_clean        <2b>
buffer_info[next_to_clean]
  time_stamp           <10004d5f2>
  next_to_watch        <2d>
  jiffies              <10004d7ce>
  next_to_watch.status <0>

... repeat until eventually ...

NETDEV WATCHDOG: eth0: transmit timed out

this is on simple scp to dom0 from external box.  after a bit watchdog
resets, and ping works, only to repeat itself when a try to scp again

thanks,
-chris

^ permalink raw reply	[flat|nested] 26+ messages in thread

end of thread, other threads:[~2006-02-17  7:26 UTC | newest]

Thread overview: 26+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2006-02-08 23:36 x86_64 eth0 e1000_clean_tx_irq tx hang Ian Pratt
2006-02-09  1:27 ` Chris Wright
2006-02-09 15:02 ` Christian Leber
2006-02-09 17:24   ` Chris Wright
2006-02-09 23:29     ` Christian Leber
  -- strict thread matches above, loose matches on Subject: below --
2006-02-09 23:55 Kamble, Nitin A
2006-02-10 11:20 ` Keir Fraser
2006-02-10 11:33   ` Muli Ben-Yehuda
2006-02-16  3:07 ` Chris Wright
2006-02-16 11:36   ` Keir Fraser
2006-02-16 11:45     ` Jan Beulich
2006-02-16 13:54       ` Keir Fraser
2006-02-16 13:10   ` Guillaume Thouvenin
2006-02-16 13:55     ` Keir Fraser
2006-02-17  7:26       ` Guillaume Thouvenin
2006-02-09  2:29 Ian Pratt
2006-02-09  1:50 Ian Pratt
2006-02-09  1:59 ` Chris Wright
2006-02-09  2:29 ` Chris Wright
2006-02-09  1:29 Ian Pratt
2006-02-09  1:38 ` Chris Wright
2006-02-09  1:49 ` Chris Wright
2006-02-08 20:01 Ian Pratt
2006-02-08 20:11 ` Chris Wright
2006-02-08 15:06 Adam Wendt
2006-02-07 20:47 Chris Wright

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.