RE: x86_64 eth0 e1000_clean_tx

All of lore.kernel.org
 help / color / mirror / Atom feed

* RE: x86_64 eth0 e1000_clean_tx_irq tx hang
@ 2006-02-08 23:36 Ian Pratt
  2006-02-09  1:27 ` Chris Wright
  2006-02-09 15:02 ` Christian Leber
  0 siblings, 2 replies; 26+ messages in thread
From: Ian Pratt @ 2006-02-08 23:36 UTC (permalink / raw)
  To: adam, Chris Wright, xen-devel

 
> Ah, finally people experiencing same bug as me!
> 
> It is much much worse for me, as soon as I ping from domU 
> network goes down, started with the subarch change.

For this bug it might actually be helpful to start collecting
information about the hardware it's observed on.

For us, the bug is hard to repro, despite us having tried on several
different machines (2 and 4 way SMP, Opteron and Xeon, tg3 and e1000
NICs).

If the bug is easier to trigger for you, please post a summary of the
hardware and anything unusual about your config (i.e. not default
bridged).

Thanks,
Ian
 
> Adam Wendt
> IPCoast, Inc.
> 
> On Wed, 8 Feb 2006 12:11 , Chris Wright <chrisw@sous-sol.org> sent:
> 
> >* Ian Pratt (m+Ian.Pratt@cl.cam.ac.uk) wrote:
> >> Yep, this is the bug I warned y'all about at the summit, but you 
> >> asked for the code to be checked in anyway...
> >
> >Hehe, get what you ask for...
> >
> >> A bug shared is a bug fixed quicker? :-)
> >
> >Let's hope ;-)
> >
> >> For us, this only manifests on x86_64, and arrived with 
> the subarch 
> >> xen version of 2.6.12. Extensive inspection of the arch->subarch 
> >> conversion suggests that nothing should have changed, so this is 
> >> likely a latent bug being triggered by slight timing changes.
> >> 
> >> It sounds like it's rather easier for you to trigger than 
> it was for 
> >> us
> >> -- we had to run xm-test several times to get it to happen. Happy 
> >> hunting, and good luck :-)
> >
> >It's trivial for me to trigger.  I'll keep poking at it.
> >
> >thanks,
> >-chris
> >
> >_______________________________________________
> >Xen-devel mailing list
> >Xen-devel@lists.xensource.com
> >http://lists.xensource.com/xen-devel
> >
> 
> 
> 
> 
> 

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: x86_64 eth0 e1000_clean_tx_irq tx hang
  2006-02-08 23:36 x86_64 eth0 e1000_clean_tx_irq tx hang Ian Pratt
@ 2006-02-09  1:27 ` Chris Wright
  2006-02-09 15:02 ` Christian Leber
  1 sibling, 0 replies; 26+ messages in thread
From: Chris Wright @ 2006-02-09  1:27 UTC (permalink / raw)
  To: Ian Pratt; +Cc: Chris Wright, xen-devel, adam

* Ian Pratt (m+Ian.Pratt@cl.cam.ac.uk) wrote:
> For us, the bug is hard to repro, despite us having tried on several
> different machines (2 and 4 way SMP, Opteron and Xeon, tg3 and e1000
> NICs).

xeon, 4 cpu (2-ht), e1000, 4G

- works fine w/ 32-bit
- dom0 is UP (SMP fails as well)
  - this is dom0 only, no xend, no domUs, no bridging
  - limiting to 2G works fine, sounds like something with swiotlb

Also, while it was working, I blasted with packets, and eventually got:

irq 19: nobody cared (try booting with the "irqpoll" option)

Call Trace: <IRQ> <ffffffff80148508>{__report_bad_irq+56}
       <ffffffff80148721>{note_interrupt+449} <ffffffff80147dcc>{handle_IRQ_event+76}
       <ffffffff80147ec2>{__do_IRQ+162} <ffffffff8011077b>{do_IRQ+75}
       <ffffffff802f16b5>{evtchn_do_upcall+117} <ffffffff8010e5f1>{do_hypervisor_callback+37}
       <ffffffff8011ccc5>{ia32_syscall+13} <ffffffff8010a22a>{hypercall_page+554}
       <ffffffff8010a22a>{hypercall_page+554} <ffffffff802f14de>{force_evtchn_callback+14}
       <ffffffff80147db5>{handle_IRQ_event+53} <ffffffff80147ea8>{__do_IRQ+136}
       <ffffffff8011077b>{do_IRQ+75} <ffffffff802f16b5>{evtchn_do_upcall+117}
       <ffffffff8010e5f1>{do_hypervisor_callback+37} <EOI>
       <ffffffff8011ccc5>{ia32_syscall+13}
handlers:
[<ffffffff80377b80>] (ata_interrupt+0x0/0x1b0)
[<ffffffff80396570>] (usb_hcd_irq+0x0/0x70)
Disabling IRQ #19

thanks,
-chris

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: x86_64 eth0 e1000_clean_tx_irq tx hang
  2006-02-08 23:36 x86_64 eth0 e1000_clean_tx_irq tx hang Ian Pratt
  2006-02-09  1:27 ` Chris Wright
@ 2006-02-09 15:02 ` Christian Leber
  2006-02-09 17:24   ` Chris Wright
  1 sibling, 1 reply; 26+ messages in thread
From: Christian Leber @ 2006-02-09 15:02 UTC (permalink / raw)
  To: xen-devel

On Wed, Feb 08, 2006 at 11:36:06PM -0000, Ian Pratt wrote:

> For this bug it might actually be helpful to start collecting
> information about the hardware it's observed on.
> 
> For us, the bug is hard to repro, despite us having tried on several
> different machines (2 and 4 way SMP, Opteron and Xeon, tg3 and e1000
> NICs).

It's not on Xen, but i get something similar with scp:
(and this Tx Unit Hang seems to be a seldom problem)
(2.6.15)

[4294726.019000] e1000: eth0: e1000_clean_tx_irq: Detected Tx Unit Hang
[4294726.019000] TDH <cb>
[4294726.019000] TDT <cb>
[4294726.019000] next_to_use <cb>
[4294726.019000] next_to_clean <df>
[4294726.019000] buffer_info[next_to_clean]
[4294726.019000] dma <1aa25cce>
[4294726.019000] time_stamp <fffc40e7>
[4294726.019000] next_to_watch <df>
[4294726.019000] jiffies <fffc5183>
[4294726.019000] next_to_watch.status <0>

That happens on AthlonXP+ViaKT600 but not on Intel PIII with Intel 815
chipset.
https://launchpad.net/distros/ubuntu/+source/linux-source-2.6.15/+bug/30476


Christian Leber

-- 
  "Omnis enim res, quae dando non deficit, dum habetur et non datur,
   nondum habetur, quomodo habenda est."       (Aurelius Augustinus)
  Translation: <http://gnuhh.org/work/fsf-europe/augustinus.html>

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: x86_64 eth0 e1000_clean_tx_irq tx hang
  2006-02-09 15:02 ` Christian Leber
@ 2006-02-09 17:24   ` Chris Wright
  2006-02-09 23:29     ` Christian Leber
  0 siblings, 1 reply; 26+ messages in thread
From: Chris Wright @ 2006-02-09 17:24 UTC (permalink / raw)
  To: Christian Leber; +Cc: xen-devel

* Christian Leber (christian@leber.de) wrote:
> That happens on AthlonXP+ViaKT600 but not on Intel PIII with Intel 815
> chipset.
> https://launchpad.net/distros/ubuntu/+source/linux-source-2.6.15/+bug/30476

Does that have >=2.6.15.2 patchset?

thanks,
-chris

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: x86_64 eth0 e1000_clean_tx_irq tx hang
  2006-02-09 17:24   ` Chris Wright
@ 2006-02-09 23:29     ` Christian Leber
  0 siblings, 0 replies; 26+ messages in thread
From: Christian Leber @ 2006-02-09 23:29 UTC (permalink / raw)
  To: Chris Wright; +Cc: xen-devel

On Thu, Feb 09, 2006 at 09:24:57AM -0800, Chris Wright wrote:
> > That happens on AthlonXP+ViaKT600 but not on Intel PIII with Intel 815
> > chipset.
> > https://launchpad.net/distros/ubuntu/+source/linux-source-2.6.15/+bug/30476
> 
> Does that have >=2.6.15.2 patchset?

No, but it's >=2.6.15.1 and the 2.6.15.2 changelog doesn't seem to be related
to ethernet drivers.
I tried also 2.6.16-rc2 and it has the same problem.

Christian Leber

-- 
  "Omnis enim res, quae dando non deficit, dum habetur et non datur,
   nondum habetur, quomodo habenda est."       (Aurelius Augustinus)
  Translation: <http://gnuhh.org/work/fsf-europe/augustinus.html>

^ permalink raw reply	[flat|nested] 26+ messages in thread

* RE: x86_64 eth0 e1000_clean_tx_irq tx hang
@ 2006-02-09 23:55 Kamble, Nitin A
  2006-02-10 11:20 ` Keir Fraser
  2006-02-16  3:07 ` Chris Wright
  0 siblings, 2 replies; 26+ messages in thread
From: Kamble, Nitin A @ 2006-02-09 23:55 UTC (permalink / raw)
  To: Chris Wright, Ian Pratt; +Cc: xen-devel, adam

>  - limiting to 2G works fine, sounds like something with swiotlb

I noticed it too and exactly same. I also notice this in the dom0 dmesg.

PCI-DMA: Disabling IOMMU.
WARNING more than 4GB of memory but IOMMU not compiled in.
WARNING 32bit PCI may malfunction.
You might want to enable CONFIG_GART_IOMMU
Memory: 5868412k/6071120k available (3553k kernel code, 202040k
reserved, 1376k
data, 300k init)

Thanks & Regards,
Nitin
------------------------------------------------------------------------
-----------
Open Source Technology Center, Intel Corp

>-----Original Message-----
>From: xen-devel-bounces@lists.xensource.com [mailto:xen-devel-
>bounces@lists.xensource.com] On Behalf Of Chris Wright
>Sent: Wednesday, February 08, 2006 5:28 PM
>To: Ian Pratt
>Cc: Chris Wright; xen-devel@lists.xensource.com; adam@ipcoast.com
>Subject: Re: [Xen-devel] x86_64 eth0 e1000_clean_tx_irq tx hang
>
>* Ian Pratt (m+Ian.Pratt@cl.cam.ac.uk) wrote:
>> For us, the bug is hard to repro, despite us having tried on several
>> different machines (2 and 4 way SMP, Opteron and Xeon, tg3 and e1000
>> NICs).
>
>xeon, 4 cpu (2-ht), e1000, 4G
>
>- works fine w/ 32-bit
>- dom0 is UP (SMP fails as well)
>  - this is dom0 only, no xend, no domUs, no bridging
>  - limiting to 2G works fine, sounds like something with swiotlb
>
>Also, while it was working, I blasted with packets, and eventually got:
>
>irq 19: nobody cared (try booting with the "irqpoll" option)
>
>Call Trace: <IRQ> <ffffffff80148508>{__report_bad_irq+56}
>       <ffffffff80148721>{note_interrupt+449}
><ffffffff80147dcc>{handle_IRQ_event+76}
>       <ffffffff80147ec2>{__do_IRQ+162} <ffffffff8011077b>{do_IRQ+75}
>       <ffffffff802f16b5>{evtchn_do_upcall+117}
><ffffffff8010e5f1>{do_hypervisor_callback+37}
>       <ffffffff8011ccc5>{ia32_syscall+13}
><ffffffff8010a22a>{hypercall_page+554}
>       <ffffffff8010a22a>{hypercall_page+554}
><ffffffff802f14de>{force_evtchn_callback+14}
>       <ffffffff80147db5>{handle_IRQ_event+53}
><ffffffff80147ea8>{__do_IRQ+136}
>       <ffffffff8011077b>{do_IRQ+75}
><ffffffff802f16b5>{evtchn_do_upcall+117}
>       <ffffffff8010e5f1>{do_hypervisor_callback+37} <EOI>
>       <ffffffff8011ccc5>{ia32_syscall+13}
>handlers:
>[<ffffffff80377b80>] (ata_interrupt+0x0/0x1b0)
>[<ffffffff80396570>] (usb_hcd_irq+0x0/0x70)
>Disabling IRQ #19
>
>thanks,
>-chris
>
>_______________________________________________
>Xen-devel mailing list
>Xen-devel@lists.xensource.com
>http://lists.xensource.com/xen-devel

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: x86_64 eth0 e1000_clean_tx_irq tx hang
  2006-02-09 23:55 Kamble, Nitin A
@ 2006-02-10 11:20 ` Keir Fraser
  2006-02-10 11:33   ` Muli Ben-Yehuda
  2006-02-16  3:07 ` Chris Wright
  1 sibling, 1 reply; 26+ messages in thread
From: Keir Fraser @ 2006-02-10 11:20 UTC (permalink / raw)
  To: Kamble, Nitin A; +Cc: Chris Wright, Ian Pratt, xen-devel, adam


On 9 Feb 2006, at 23:55, Kamble, Nitin A wrote:

>>  - limiting to 2G works fine, sounds like something with swiotlb
>
> I noticed it too and exactly same. I also notice this in the dom0 
> dmesg.
>
> PCI-DMA: Disabling IOMMU.
> WARNING more than 4GB of memory but IOMMU not compiled in.
> WARNING 32bit PCI may malfunction.
> You might want to enable CONFIG_GART_IOMMU
> Memory: 5868412k/6071120k available (3553k kernel code, 202040k
> reserved, 1376k
> data, 300k init)

That is harmless. In fact our SWIOTLB probably is enabled (look at the 
lines just above the ones you posted). It's because we don't properly 
(yet) respect the new plug-n-play dma_ops structures in x86_64.

I've checked in a temporary fix to remove the above misleading lines.

  -- Keir

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: x86_64 eth0 e1000_clean_tx_irq tx hang
  2006-02-10 11:20 ` Keir Fraser
@ 2006-02-10 11:33   ` Muli Ben-Yehuda
  0 siblings, 0 replies; 26+ messages in thread
From: Muli Ben-Yehuda @ 2006-02-10 11:33 UTC (permalink / raw)
  To: Keir Fraser; +Cc: Chris Wright, Ian Pratt, xen-devel, adam

On Fri, Feb 10, 2006 at 11:20:50AM +0000, Keir Fraser wrote:

> That is harmless. In fact our SWIOTLB probably is enabled (look at the 
> lines just above the ones you posted). It's because we don't properly 
> (yet) respect the new plug-n-play dma_ops structures in x86_64.
> 
> I've checked in a temporary fix to remove the above misleading
> lines.

There was also a harmless bug in the initial dma_ops patch that caused
the wrong printk in some cases. Jon Mason submitted a fix that is in
mainline now. Not sure if this is the case here, but FYI.

Cheers,
Muli
-- 
Muli Ben-Yehuda
http://www.mulix.org | http://mulix.livejournal.com/

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: x86_64 eth0 e1000_clean_tx_irq tx hang
  2006-02-09 23:55 Kamble, Nitin A
  2006-02-10 11:20 ` Keir Fraser
@ 2006-02-16  3:07 ` Chris Wright
  2006-02-16 11:36   ` Keir Fraser
  2006-02-16 13:10   ` Guillaume Thouvenin
  1 sibling, 2 replies; 26+ messages in thread
From: Chris Wright @ 2006-02-16  3:07 UTC (permalink / raw)
  To: Kamble, Nitin A; +Cc: Chris Wright, Ian Pratt, xen-devel, adam

* Kamble, Nitin A (nitin.a.kamble@intel.com) wrote:
> >  - limiting to 2G works fine, sounds like something with swiotlb
> 
> I noticed it too and exactly same. I also notice this in the dom0 dmesg.

After spending hours trying to find something -- anything -- wrong with
irq delivery and e1000 hung tx unit, I went back to my original hunch,
which was swiotlb related.  When TSO is enabled, some debugging showed
this:

swiotlb_map_page: returns d586a000
dma_map_page: returns ffffffffd586a000

Indeed.

 a43:   e8 00 00 00 00          callq  a48 <dma_map_page+0xc8>        a44: R_X86_64_PC32      swiotlb_map_page+0xfffffffffffffffc
 a48:   48 63 d8                movslq %eax,%rbx

Whoops.  Prototype mismatch.

And had we been paying attention:

/home/chrisw/hg/xen/xen-unstable/linux-2.6.16-rc2-xen0/arch/x86_64/kernel/../../i386/kernel/pci-dma-xen.c:107: warning: implicit declaration of function ‘swiotlb_map_page’
/home/chrisw/hg/xen/xen-unstable/linux-2.6.16-rc2-xen0/arch/x86_64/kernel/../../i386/kernel/pci-dma-xen.c: In function ‘dma_unmap_page’:
/home/chrisw/hg/xen/xen-unstable/linux-2.6.16-rc2-xen0/arch/x86_64/kernel/../../i386/kernel/pci-dma-xen.c:125: warning: implicit declaration of function ‘swiotlb_unmap_page’

Here's a quick patch that fixes the issue (not ready to apply to
-unstable, since it's a file that's not in sparse tree).  Nitin, this
should fix your problem as well.  I'll work on a proper patch later this
evening or tomorrow morning.

thanks,
-chris
--

--- linux-2.6.16-rc2/include/asm-x86_64/swiotlb.h	2006-02-15 21:42:24.000000000 -0500
+++ linux-2.6.16-rc2-xen0/include/asm-x86_64/swiotlb.h	2006-02-15 21:19:15.000000000 -0500
@@ -38,6 +38,11 @@
 extern void swiotlb_unmap_sg(struct device *hwdev, struct scatterlist *sg,
 			 int nents, int direction);
 extern int swiotlb_dma_mapping_error(dma_addr_t dma_addr);
+extern dma_addr_t swiotlb_map_page(struct device *hwdev, struct page *page,
+                                   unsigned long offset, size_t size,
+                                   enum dma_data_direction direction);
+extern void swiotlb_unmap_page(struct device *hwdev, dma_addr_t dma_address,
+                               size_t size, enum dma_data_direction direction);
 extern void swiotlb_free_coherent (struct device *hwdev, size_t size,
 				   void *vaddr, dma_addr_t dma_handle);
 extern int swiotlb_dma_supported(struct device *hwdev, u64 mask);

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: x86_64 eth0 e1000_clean_tx_irq tx hang
  2006-02-16  3:07 ` Chris Wright
@ 2006-02-16 11:36   ` Keir Fraser
  2006-02-16 11:45     ` Jan Beulich
  2006-02-16 13:10   ` Guillaume Thouvenin
  1 sibling, 1 reply; 26+ messages in thread
From: Keir Fraser @ 2006-02-16 11:36 UTC (permalink / raw)
  To: Chris Wright; +Cc: Ian Pratt, xen-devel, adam


On 16 Feb 2006, at 03:07, Chris Wright wrote:

> Here's a quick patch that fixes the issue (not ready to apply to
> -unstable, since it's a file that's not in sparse tree).  Nitin, this
> should fix your problem as well.  I'll work on a proper patch later 
> this
> evening or tomorrow morning.

Thanks for tracking this one down: it's been our major outstanding bug 
for a while now. We checked in a suitable fix to -unstable (change 
pci-dma-xen.c to explicitly include the asm-i386/mach-xen version of 
swiotlb.h).

  -- Keir

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: x86_64 eth0 e1000_clean_tx_irq tx hang
  2006-02-16 11:36   ` Keir Fraser
@ 2006-02-16 11:45     ` Jan Beulich
  2006-02-16 13:54       ` Keir Fraser
  0 siblings, 1 reply; 26+ messages in thread
From: Jan Beulich @ 2006-02-16 11:45 UTC (permalink / raw)
  To: Keir Fraser; +Cc: xen-devel

>>> Keir Fraser <Keir.Fraser@cl.cam.ac.uk> 16.02.06 12:36:29 >>>
>
>On 16 Feb 2006, at 03:07, Chris Wright wrote:
>
>> Here's a quick patch that fixes the issue (not ready to apply to
>> -unstable, since it's a file that's not in sparse tree).  Nitin, this
>> should fix your problem as well.  I'll work on a proper patch later 
>> this
>> evening or tomorrow morning.
>
>Thanks for tracking this one down: it's been our major outstanding bug 
>for a while now. We checked in a suitable fix to -unstable (change 
>pci-dma-xen.c to explicitly include the asm-i386/mach-xen version of 
>swiotlb.h).

This doesn't sound like a good thing to do, as that way all but this one file will include the x86-64 version of it,
and you can easily get things out of sync (if e.g. the x86-64 version changes). I would much favor the change being done
as originally posted; we have a similar same fix in our tree.

Jan

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: x86_64 eth0 e1000_clean_tx_irq tx hang
  2006-02-16 11:45     ` Jan Beulich
@ 2006-02-16 13:54       ` Keir Fraser
  0 siblings, 0 replies; 26+ messages in thread
From: Keir Fraser @ 2006-02-16 13:54 UTC (permalink / raw)
  To: Jan Beulich; +Cc: xen-devel


On 16 Feb 2006, at 11:45, Jan Beulich wrote:

>> Thanks for tracking this one down: it's been our major outstanding bug
>> for a while now. We checked in a suitable fix to -unstable (change
>> pci-dma-xen.c to explicitly include the asm-i386/mach-xen version of
>> swiotlb.h).
>
> This doesn't sound like a good thing to do, as that way all but this 
> one file will include the x86-64 version of it,
> and you can easily get things out of sync (if e.g. the x86-64 version 
> changes). I would much favor the change being done
> as originally posted; we have a similar same fix in our tree.

In our tree, pci-dma-xen.c is the only file that uses the core swiotlb 
functions. Since it's an i386 file linked against our xen-i386 swiotlb, 
it seems to make sense for it to include explicitly the i386 swiotlb 
header file. The best fix of course is to merge the swiotlbs: maybe by 
incrementally modifying the xen-specific one to get it closer the 
generic swiotlb code.

  -- Keir

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: x86_64 eth0 e1000_clean_tx_irq tx hang
  2006-02-16  3:07 ` Chris Wright
  2006-02-16 11:36   ` Keir Fraser
@ 2006-02-16 13:10   ` Guillaume Thouvenin
  2006-02-16 13:55     ` Keir Fraser
  1 sibling, 1 reply; 26+ messages in thread
From: Guillaume Thouvenin @ 2006-02-16 13:10 UTC (permalink / raw)
  To: Chris Wright; +Cc: Ian Pratt, xen-devel

On Wed, 15 Feb 2006 19:07:15 -0800
Chris Wright <chrisw@sous-sol.org> wrote:
> 
> --- linux-2.6.16-rc2/include/asm-x86_64/swiotlb.h	2006-02-15 21:42:24.000000000 -0500
> +++ linux-2.6.16-rc2-xen0/include/asm-x86_64/swiotlb.h	2006-02-15 21:19:15.000000000 -0500
> @@ -38,6 +38,11 @@
>  extern void swiotlb_unmap_sg(struct device *hwdev, struct scatterlist *sg,
>  			 int nents, int direction);
>  extern int swiotlb_dma_mapping_error(dma_addr_t dma_addr);
> +extern dma_addr_t swiotlb_map_page(struct device *hwdev, struct page *page,
> +                                   unsigned long offset, size_t size,
> +                                   enum dma_data_direction direction);
> +extern void swiotlb_unmap_page(struct device *hwdev, dma_addr_t dma_address,
> +                               size_t size, enum dma_data_direction direction);
>  extern void swiotlb_free_coherent (struct device *hwdev, size_t size,
>  				   void *vaddr, dma_addr_t dma_handle);
>  extern int swiotlb_dma_supported(struct device *hwdev, u64 mask);


The patch fixes the problem of the tx hang and it also fixes another
problem on my box. With the xen unstable (changeset 8833), I couldn't
open a ssh connection on the domain 0 until I ran the xend daemon (I
don't know why running the xend daemon allows the connection). With the
patch, I can open a ssh connection as soon as the ssh daemon is running
on domain0.

Just a remark, if I enable PAE, it doesn't solve the problem of the tx
hang on my computer which is an Intel Xeon (1 CPU) with hyper-threading
enabled. I'm using a debian distribution.


thanks,
Guillaume

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: x86_64 eth0 e1000_clean_tx_irq tx hang
  2006-02-16 13:10   ` Guillaume Thouvenin
@ 2006-02-16 13:55     ` Keir Fraser
  2006-02-17  7:26       ` Guillaume Thouvenin
  0 siblings, 1 reply; 26+ messages in thread
From: Keir Fraser @ 2006-02-16 13:55 UTC (permalink / raw)
  To: Guillaume Thouvenin; +Cc: Chris Wright, Ian Pratt, xen-devel


On 16 Feb 2006, at 13:10, Guillaume Thouvenin wrote:

> Just a remark, if I enable PAE, it doesn't solve the problem of the tx
> hang on my computer which is an Intel Xeon (1 CPU) with hyper-threading
> enabled. I'm using a debian distribution.

Does this go away if your specify 'mem=2G' as a Xen boot parameter?

   -- Keir

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: x86_64 eth0 e1000_clean_tx_irq tx hang
  2006-02-16 13:55     ` Keir Fraser
@ 2006-02-17  7:26       ` Guillaume Thouvenin
  0 siblings, 0 replies; 26+ messages in thread
From: Guillaume Thouvenin @ 2006-02-17  7:26 UTC (permalink / raw)
  To: Keir Fraser; +Cc: Chris Wright, Ian Pratt, xen-devel

On Thu, 16 Feb 2006 13:55:18 +0000
Keir Fraser <Keir.Fraser@cl.cam.ac.uk> wrote:

> 
> On 16 Feb 2006, at 13:10, Guillaume Thouvenin wrote:
> 
> > Just a remark, if I enable PAE, it doesn't solve the problem of the tx
> > hang on my computer which is an Intel Xeon (1 CPU) with hyper-threading
> > enabled. I'm using a debian distribution.
> 
> Does this go away if your specify 'mem=2G' as a Xen boot parameter?

Yes it goes away.

Guillaume

^ permalink raw reply	[flat|nested] 26+ messages in thread

* RE: x86_64 eth0 e1000_clean_tx_irq tx hang
@ 2006-02-09  2:29 Ian Pratt
  0 siblings, 0 replies; 26+ messages in thread
From: Ian Pratt @ 2006-02-09  2:29 UTC (permalink / raw)
  To: Chris Wright; +Cc: xen-devel, adam

> > What devices are on irq 19?
> > 
> > It might be worth trying booting nousb on the kernel 
> command line (or
> > usb-handoff)
> 
> 19:       5748        Phys-irq  libata, uhci_hcd:usb3
> 
> with ata, that effectively killed the box.  trying with 
> nousb, but i wonder if it's not evntchn problem?

Something else to try might be booting with maxcpus=1 on the xen command
line, but if you're running just a uniproc dom0 this really ought not
make any difference. 

When the box is in a bad state, it might be worth using the serial debug
keys to get some information about the ioapic and event channels.

I'm glad you've got an easy way of repro'ing this. I've just tried again
on a couple of our machines and it took me ages to trigger. 

Thanks,
Ian

^ permalink raw reply	[flat|nested] 26+ messages in thread

* RE: x86_64 eth0 e1000_clean_tx_irq tx hang
@ 2006-02-09  1:50 Ian Pratt
  2006-02-09  1:59 ` Chris Wright
  2006-02-09  2:29 ` Chris Wright
  0 siblings, 2 replies; 26+ messages in thread
From: Ian Pratt @ 2006-02-09  1:50 UTC (permalink / raw)
  To: Chris Wright; +Cc: xen-devel, adam

 > > That's interesting, but I'd be surprised if it was an 
> swiotlb thing -- 
> > it looks so much more like an interrupt problem. e1000 and tg3 
> > shouldn't be going anywhere near swiotlb anyhow.
> > 
> > Please can you try a PAE kernel just to check you don't have the 
> > problem.
> 
> It's 64-bit.

Yep, but I'm wandering whether it's worth trying a PAE kernel as that
might give a datapoint to indicate whether swiotlb might be involved. 

My money is still on an interrupt problem (virtual or otherwise),
though.

Ian

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: x86_64 eth0 e1000_clean_tx_irq tx hang
  2006-02-09  1:50 Ian Pratt
@ 2006-02-09  1:59 ` Chris Wright
  2006-02-09  2:29 ` Chris Wright
  1 sibling, 0 replies; 26+ messages in thread
From: Chris Wright @ 2006-02-09  1:59 UTC (permalink / raw)
  To: Ian Pratt; +Cc: Chris Wright, xen-devel, adam

* Ian Pratt (m+Ian.Pratt@cl.cam.ac.uk) wrote:
>  > > That's interesting, but I'd be surprised if it was an 
> > swiotlb thing -- 
> > > it looks so much more like an interrupt problem. e1000 and tg3 
> > > shouldn't be going anywhere near swiotlb anyhow.
> > > 
> > > Please can you try a PAE kernel just to check you don't have the 
> > > problem.
> > 
> > It's 64-bit.
> 
> Yep, but I'm wandering whether it's worth trying a PAE kernel as that
> might give a datapoint to indicate whether swiotlb might be involved. 

Yeah, sorry, I was confused at first.  I'm building PAE atm.

thanks,
-chris

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: x86_64 eth0 e1000_clean_tx_irq tx hang
  2006-02-09  1:50 Ian Pratt
  2006-02-09  1:59 ` Chris Wright
@ 2006-02-09  2:29 ` Chris Wright
  1 sibling, 0 replies; 26+ messages in thread
From: Chris Wright @ 2006-02-09  2:29 UTC (permalink / raw)
  To: Ian Pratt; +Cc: Chris Wright, xen-devel, adam

* Ian Pratt (m+Ian.Pratt@cl.cam.ac.uk) wrote:
> Yep, but I'm wandering whether it's worth trying a PAE kernel as that
> might give a datapoint to indicate whether swiotlb might be involved. 

OK, PAE works fine.

thanks,
-chris

^ permalink raw reply	[flat|nested] 26+ messages in thread

* RE: x86_64 eth0 e1000_clean_tx_irq tx hang
@ 2006-02-09  1:29 Ian Pratt
  2006-02-09  1:38 ` Chris Wright
  2006-02-09  1:49 ` Chris Wright
  0 siblings, 2 replies; 26+ messages in thread
From: Ian Pratt @ 2006-02-09  1:29 UTC (permalink / raw)
  To: Chris Wright; +Cc: xen-devel, adam

> xeon, 4 cpu (2-ht), e1000, 4G
> 
> - works fine w/ 32-bit
> - dom0 is UP (SMP fails as well)
>   - this is dom0 only, no xend, no domUs, no bridging
>   - limiting to 2G works fine, sounds like something with swiotlb

That's interesting, but I'd be surprised if it was an swiotlb thing --
it looks so much more like an interrupt problem. e1000 and tg3 shouldn't
be going anywhere near swiotlb anyhow.

Please can you try a PAE kernel just to check you don't have the
problem.
 
> Also, while it was working, I blasted with packets, and 
> eventually got:
> 
> irq 19: nobody cared (try booting with the "irqpoll" option)

What devices are on irq 19?

It might be worth trying booting nousb on the kernel command line (or
usb-handoff)

Thanks,
Ian

> Call Trace: <IRQ> <ffffffff80148508>{__report_bad_irq+56}
>        <ffffffff80148721>{note_interrupt+449} 
> <ffffffff80147dcc>{handle_IRQ_event+76}
>        <ffffffff80147ec2>{__do_IRQ+162} <ffffffff8011077b>{do_IRQ+75}
>        <ffffffff802f16b5>{evtchn_do_upcall+117} 
> <ffffffff8010e5f1>{do_hypervisor_callback+37}
>        <ffffffff8011ccc5>{ia32_syscall+13} 
> <ffffffff8010a22a>{hypercall_page+554}
>        <ffffffff8010a22a>{hypercall_page+554} 
> <ffffffff802f14de>{force_evtchn_callback+14}
>        <ffffffff80147db5>{handle_IRQ_event+53} 
> <ffffffff80147ea8>{__do_IRQ+136}
>        <ffffffff8011077b>{do_IRQ+75} 
> <ffffffff802f16b5>{evtchn_do_upcall+117}
>        <ffffffff8010e5f1>{do_hypervisor_callback+37} <EOI>
>        <ffffffff8011ccc5>{ia32_syscall+13}
> handlers:
> [<ffffffff80377b80>] (ata_interrupt+0x0/0x1b0) 
> [<ffffffff80396570>] (usb_hcd_irq+0x0/0x70) Disabling IRQ #19
> 
> thanks,
> -chris
> 

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: x86_64 eth0 e1000_clean_tx_irq tx hang
  2006-02-09  1:29 Ian Pratt
@ 2006-02-09  1:38 ` Chris Wright
  2006-02-09  1:49 ` Chris Wright
  1 sibling, 0 replies; 26+ messages in thread
From: Chris Wright @ 2006-02-09  1:38 UTC (permalink / raw)
  To: Ian Pratt; +Cc: Chris Wright, xen-devel, adam

* Ian Pratt (m+Ian.Pratt@cl.cam.ac.uk) wrote:
> > xeon, 4 cpu (2-ht), e1000, 4G
> > 
> > - works fine w/ 32-bit
> > - dom0 is UP (SMP fails as well)
> >   - this is dom0 only, no xend, no domUs, no bridging
> >   - limiting to 2G works fine, sounds like something with swiotlb
> 
> That's interesting, but I'd be surprised if it was an swiotlb thing --
> it looks so much more like an interrupt problem. e1000 and tg3 shouldn't
> be going anywhere near swiotlb anyhow.
> 
> Please can you try a PAE kernel just to check you don't have the
> problem.

It's 64-bit.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: x86_64 eth0 e1000_clean_tx_irq tx hang
  2006-02-09  1:29 Ian Pratt
  2006-02-09  1:38 ` Chris Wright
@ 2006-02-09  1:49 ` Chris Wright
  1 sibling, 0 replies; 26+ messages in thread
From: Chris Wright @ 2006-02-09  1:49 UTC (permalink / raw)
  To: Ian Pratt; +Cc: Chris Wright, xen-devel, adam

* Ian Pratt (m+Ian.Pratt@cl.cam.ac.uk) wrote:
whoops, missed this part.

> What devices are on irq 19?
> 
> It might be worth trying booting nousb on the kernel command line (or
> usb-handoff)

19:       5748        Phys-irq  libata, uhci_hcd:usb3

with ata, that effectively killed the box.  trying with nousb, but i
wonder if it's not evntchn problem?

thanks,
-chris

^ permalink raw reply	[flat|nested] 26+ messages in thread

* RE: x86_64 eth0 e1000_clean_tx_irq tx hang
@ 2006-02-08 20:01 Ian Pratt
  2006-02-08 20:11 ` Chris Wright
  0 siblings, 1 reply; 26+ messages in thread
From: Ian Pratt @ 2006-02-08 20:01 UTC (permalink / raw)
  To: Chris Wright, xen-devel

> This is against current x86_64 defconfig build:
> 
> e1000: eth0: e1000_clean_tx_irq: Detected Tx Unit Hang
>   Tx Queue             <0>
>   TDH                  <2b>
>   TDT                  <31>
>   next_to_use          <31>
>   next_to_clean        <2b>
> buffer_info[next_to_clean]
>   time_stamp           <10004d5f2>
>   next_to_watch        <2d>
>   jiffies              <10004d7ce>
>   next_to_watch.status <0>
> 
> ... repeat until eventually ...
> 
> NETDEV WATCHDOG: eth0: transmit timed out
> 
> this is on simple scp to dom0 from external box.  after a bit 
> watchdog resets, and ping works, only to repeat itself when a 
> try to scp again

Yep, this is the bug I warned y'all about at the summit, but you asked
for the code to be checked in anyway... 

A bug shared is a bug fixed quicker? :-)

For us, this only manifests on x86_64, and arrived with the subarch xen
version of 2.6.12. Extensive inspection of the arch->subarch conversion
suggests that nothing should have changed, so this is likely a latent
bug being triggered by slight timing changes.

It sounds like it's rather easier for you to trigger than it was for us
-- we had to run xm-test several times to get it to happen. Happy
hunting, and good luck :-)

Ian

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: x86_64 eth0 e1000_clean_tx_irq tx hang
  2006-02-08 20:01 Ian Pratt
@ 2006-02-08 20:11 ` Chris Wright
  0 siblings, 0 replies; 26+ messages in thread
From: Chris Wright @ 2006-02-08 20:11 UTC (permalink / raw)
  To: Ian Pratt; +Cc: Chris Wright, xen-devel

* Ian Pratt (m+Ian.Pratt@cl.cam.ac.uk) wrote:
> Yep, this is the bug I warned y'all about at the summit, but you asked
> for the code to be checked in anyway... 

Hehe, get what you ask for...

> A bug shared is a bug fixed quicker? :-)

Let's hope ;-)

> For us, this only manifests on x86_64, and arrived with the subarch xen
> version of 2.6.12. Extensive inspection of the arch->subarch conversion
> suggests that nothing should have changed, so this is likely a latent
> bug being triggered by slight timing changes.
> 
> It sounds like it's rather easier for you to trigger than it was for us
> -- we had to run xm-test several times to get it to happen. Happy
> hunting, and good luck :-)

It's trivial for me to trigger.  I'll keep poking at it.

thanks,
-chris

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: x86_64 eth0 e1000_clean_tx_irq tx hang
@ 2006-02-08 15:06 Adam Wendt
  0 siblings, 0 replies; 26+ messages in thread
From: Adam Wendt @ 2006-02-08 15:06 UTC (permalink / raw)
  To: Ian Pratt, Chris Wright, xen-devel

Ah, finally people experiencing same bug as me!

It is much much worse for me, as soon as I ping from domU network goes down,
started with the subarch change.

Adam Wendt
IPCoast, Inc.

On Wed, 8 Feb 2006 12:11 , Chris Wright <chrisw@sous-sol.org> sent:

>* Ian Pratt (m+Ian.Pratt@cl.cam.ac.uk) wrote:
>> Yep, this is the bug I warned y'all about at the summit, but you asked
>> for the code to be checked in anyway... 
>
>Hehe, get what you ask for...
>
>> A bug shared is a bug fixed quicker? :-)
>
>Let's hope ;-)
>
>> For us, this only manifests on x86_64, and arrived with the subarch xen
>> version of 2.6.12. Extensive inspection of the arch->subarch conversion
>> suggests that nothing should have changed, so this is likely a latent
>> bug being triggered by slight timing changes.
>> 
>> It sounds like it's rather easier for you to trigger than it was for us
>> -- we had to run xm-test several times to get it to happen. Happy
>> hunting, and good luck :-)
>
>It's trivial for me to trigger.  I'll keep poking at it.
>
>thanks,
>-chris
>
>_______________________________________________
>Xen-devel mailing list
>Xen-devel@lists.xensource.com
>http://lists.xensource.com/xen-devel
>

^ permalink raw reply	[flat|nested] 26+ messages in thread

* x86_64 eth0 e1000_clean_tx_irq tx hang
@ 2006-02-07 20:47 Chris Wright
  0 siblings, 0 replies; 26+ messages in thread
From: Chris Wright @ 2006-02-07 20:47 UTC (permalink / raw)
  To: xen-devel

This is against current x86_64 defconfig build:

e1000: eth0: e1000_clean_tx_irq: Detected Tx Unit Hang
  Tx Queue             <0>
  TDH                  <2b>
  TDT                  <31>
  next_to_use          <31>
  next_to_clean        <2b>
buffer_info[next_to_clean]
  time_stamp           <10004d5f2>
  next_to_watch        <2d>
  jiffies              <10004d7ce>
  next_to_watch.status <0>

... repeat until eventually ...

NETDEV WATCHDOG: eth0: transmit timed out

this is on simple scp to dom0 from external box.  after a bit watchdog
resets, and ping works, only to repeat itself when a try to scp again

thanks,
-chris

^ permalink raw reply	[flat|nested] 26+ messages in thread

end of thread, other threads:[~2006-02-17  7:26 UTC | newest]

Thread overview: 26+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2006-02-08 23:36 x86_64 eth0 e1000_clean_tx_irq tx hang Ian Pratt
2006-02-09  1:27 ` Chris Wright
2006-02-09 15:02 ` Christian Leber
2006-02-09 17:24   ` Chris Wright
2006-02-09 23:29     ` Christian Leber
  -- strict thread matches above, loose matches on Subject: below --
2006-02-09 23:55 Kamble, Nitin A
2006-02-10 11:20 ` Keir Fraser
2006-02-10 11:33   ` Muli Ben-Yehuda
2006-02-16  3:07 ` Chris Wright
2006-02-16 11:36   ` Keir Fraser
2006-02-16 11:45     ` Jan Beulich
2006-02-16 13:54       ` Keir Fraser
2006-02-16 13:10   ` Guillaume Thouvenin
2006-02-16 13:55     ` Keir Fraser
2006-02-17  7:26       ` Guillaume Thouvenin
2006-02-09  2:29 Ian Pratt
2006-02-09  1:50 Ian Pratt
2006-02-09  1:59 ` Chris Wright
2006-02-09  2:29 ` Chris Wright
2006-02-09  1:29 Ian Pratt
2006-02-09  1:38 ` Chris Wright
2006-02-09  1:49 ` Chris Wright
2006-02-08 20:01 Ian Pratt
2006-02-08 20:11 ` Chris Wright
2006-02-08 15:06 Adam Wendt
2006-02-07 20:47 Chris Wright

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.