All of lore.kernel.org
 help / color / mirror / Atom feed
* IB softirq race
@ 2012-08-10 13:03 Sebastian Riemer
       [not found] ` <502506B8.7080203-EIkl63zCoXaH+58JC4qpiA@public.gmane.org>
  0 siblings, 1 reply; 2+ messages in thread
From: Sebastian Riemer @ 2012-08-10 13:03 UTC (permalink / raw)
  To: Roland Dreier
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, Or Gerlitz,
	ri-EIkl63zCoXaH+58JC4qpiA@public.gmane.org

Hi Roland,

we've got a gateway machine which is connected to the internet via
ethernet and is connected with our KVM VMs-providing cloud
infrastructure via IB.
There must have been a race with softirqs. We've got a custom kernel
module ("xt_ETHOIP6_gw") which handles the Ethernet<>IB.

The trace looks like that it caused the kernel trace together with the
tun driver. What does the "(O)" mean?

Should we look at our kernel module for better locking? What are the
common data structures with which races can occur? Something with the
connection manager?

Cheers,
Sebastian


WARNING: at net/sched/sch_generic.c:255 dev_watchdog+0x22c/0x240()
Hardware name: H8DGU
NETDEV WATCHDOG: ib0 (mlx4_core): transmit queue 0 timed out
Modules linked in: ipt_LOG xt_ETHOIP6_gw(O) ip6table_mangle
iptable_mangle ip6table_filter ip6_tables tun(O) bridge stp llc rdma_ucm
rdma_cm iw_cm ib_addr ib_ipoib ib_cm ib_sa ib_uverbs ib_umad ib_qib
mlx4_ib xt_multiport iptable_filter ip_tables x_tables ib_mthca ib_mad
ib_core kvm_amd kvm psmouse tpm_tis tpm tpm_bios amd64_edac_mod
i2c_piix4 edac_core serio_raw evdev edac_mce_amd button processor
thermal_sys mlx4_en sg usb_storage mlx4_core ixgbe dca mdio [last
unloaded: scsi_wait_scan]
Pid: 3, comm: ksoftirqd/0 Tainted: G O 3.2.8-gw #1
Call Trace:
[<ffffffff81047dbb>] ? warn_slowpath_common+0x7b/0xc0
[<ffffffff81047eb5>] ? warn_slowpath_fmt+0x45/0x50
[<ffffffff81058333>] ? mod_timer+0x153/0x2a0
[<ffffffff81584bec>] ? dev_watchdog+0x22c/0x240
[<ffffffff810572a8>] ? run_timer_softirq+0x158/0x360
[<ffffffff815849c0>] ? __netdev_watchdog_up+0x70/0x70
[<ffffffff8168dc8a>] ? __schedule+0x2ea/0x7e0
[<ffffffff8104e481>] ? __do_softirq+0xb1/0x1e0
[<ffffffff8104e661>] ? run_ksoftirqd+0xb1/0x160
[<ffffffff8104e5b0>] ? __do_softirq+0x1e0/0x1e0
[<ffffffff8104e5b0>] ? __do_softirq+0x1e0/0x1e0
[<ffffffff81069176>] ? kthread+0x96/0xa0
[<ffffffff816997b4>] ? kernel_thread_helper+0x4/0x10
[<ffffffff810690e0>] ? kthread_worker_fn+0x180/0x180
[<ffffffff816997b0>] ? gs_change+0x13/0x13
---[ end trace a4ac921bb1a9d647 ]---
ib0: transmit timeout: latency 1770 msecs
ib0: queue stopped 1, tx_head 39614, tx_tail 39614


-- 
Sebastian Riemer
Linux Kernel Developer

ProfitBricks GmbH • Greifswalder Str. 207 • 10405 Berlin, Germany
www.profitbricks.com • sebastian.riemer-EIkl63zCoXaH+58JC4qpiA@public.gmane.org
Tel.: +49 - 30 - 60 98 56 991 - 915

Sitz der Gesellschaft: Berlin
Registergericht: Amtsgericht Charlottenburg, HRB 125506 B
Geschäftsführer: Andreas Gauger, Achim Weiss

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 2+ messages in thread

* Re: IB softirq race
       [not found] ` <502506B8.7080203-EIkl63zCoXaH+58JC4qpiA@public.gmane.org>
@ 2012-08-10 20:51   ` Roland Dreier
  0 siblings, 0 replies; 2+ messages in thread
From: Roland Dreier @ 2012-08-10 20:51 UTC (permalink / raw)
  To: Sebastian Riemer
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, Or Gerlitz,
	ri-EIkl63zCoXaH+58JC4qpiA@public.gmane.org

On Fri, Aug 10, 2012 at 6:03 AM, Sebastian Riemer
<sebastian.riemer-EIkl63zCoXaH+58JC4qpiA@public.gmane.org> wrote:
> we've got a gateway machine which is connected to the internet via
> ethernet and is connected with our KVM VMs-providing cloud
> infrastructure via IB.
> There must have been a race with softirqs. We've got a custom kernel
> module ("xt_ETHOIP6_gw") which handles the Ethernet<>IB.

Not sure why you are counting on a race.  The dev_watchdog means that
the netdev has been stuck with a full transmit queue for a long time
(timescale of seconds).  Usually this means completions aren't happening
or aren't being reaped for some reason.

Without really looking at your module code, it's hard to guess what the issue
might be (and indeed you may just be triggering some existing bug).

> The trace looks like that it caused the kernel trace together with the
> tun driver. What does the "(O)" mean?

"(O)" means the module with the flag is an out-of-tree module.

 - R.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 2+ messages in thread

end of thread, other threads:[~2012-08-10 20:51 UTC | newest]

Thread overview: 2+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2012-08-10 13:03 IB softirq race Sebastian Riemer
     [not found] ` <502506B8.7080203-EIkl63zCoXaH+58JC4qpiA@public.gmane.org>
2012-08-10 20:51   ` Roland Dreier

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.