All of lore.kernel.org
 help / color / mirror / Atom feed
From: Pradeep Satyanarayana <pradeeps-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>
To: linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
Subject: [Fwd: Crash in bonding]
Date: Mon, 02 Nov 2009 14:43:56 -0800	[thread overview]
Message-ID: <4AEF60AC.6030508@linux.vnet.ibm.com> (raw)

[-- Attachment #1: Type: text/plain, Size: 43 bytes --]

Typo in email address. Resending.

Pradeep

[-- Attachment #2: Crash in bonding.eml --]
[-- Type: message/rfc822, Size: 5269 bytes --]

From: Pradeep Satyanarayana <pradeeps-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>
To: monis-smomgflXvOZWk0Htik3J/w@public.gmane.org
Cc: EWG <Openfabrics-ewg-0P3JtQMG0aQdnm+yROfE0A@public.gmane.org>, linux-rdma-u79uwXL29TZpP82i2CBTzA@public.gmane.org, fubar-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org
Subject: Crash in bonding
Date: Mon, 02 Nov 2009 14:41:36 -0800
Message-ID: <4AEF6020.7000806-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>

This crash was originally reported against Rhel5.4. However, one can recreate this crash quite easily in OFED-1.5 too. 
The steps to recreate the crash are as follows:

1. Run traffic (I used ping) on the IB interfaces through the bond master
2. ifdown ib0
3. ifdown ib1
4. modprobe -r ib_ipoib

Quite often, the crash stack trace seen is as follows:

ID: 0      TASK: ffff81087fc11820  CPU: 13  COMMAND: "swapper"
 #0 [ffff81010ff07ab0] crash_kexec at ffffffff800ac5b9
 #1 [ffff81010ff07b70] __die at ffffffff80065127
 #2 [ffff81010ff07bb0] do_page_fault at ffffffff80066da7
 #3 [ffff81010ff07ca0] error_exit at ffffffff8005dde9
 #4 [ffff81010ff07d58] neigh_connected_output at ffffffff8022cb87
 #5 [ffff81010ff07d88] ip_output at ffffffff800320ac
 #6 [ffff81010ff07db8] ip_queue_xmit at ffffffff8003464d
 #7 [ffff81010ff07e78] tcp_transmit_skb at ffffffff80021d73
 #8 [ffff81010ff07ec8] tcp_retransmit_skb at ffffffff80250ccd
 #9 [ffff81010ff07f08] tcp_write_timer at ffffffff80252652
#10 [ffff81010ff07f28] run_timer_softirq at ffffffff800968be
#11 [ffff81010ff07f58] __do_softirq at ffffffff8001235a
#12 [ffff81010ff07f88] call_softirq at ffffffff8005e2fc
#13 [ffff81010ff07fa0] do_softirq at ffffffff8006cb14
#14 [ffff81010ff07fb0] apic_timer_interrupt at ffffffff8005dc8e
--- <IRQ stack> ---
#15 [ffff81010ff03e48] apic_timer_interrupt at ffffffff8005dc8e
    [exception RIP: mwait_idle+54]
    RIP: ffffffff800571f4  RSP: ffff81010ff03ef0  RFLAGS: 00000246
    RAX: 0000000000000000  RBX: 000000000000000d  RCX: 0000000000000000
    RDX: 0000000000000000  RSI: 0000000000000001  RDI: ffffffff80301698
    RBP: ffff81087fc11a10   R8: ffff81010ff02000   R9: 0000000000000032
    R10: ffff81048e0cc4f0  R11: ffff8103ebafcd18  R12: 0000000005f33f4d
    R13: 00000d12e63d7223  R14: ffff81047fe797a0  R15: ffff81087fc11820
    ORIG_RAX: ffffffffffffff10  CS: 0010  SS: 0018
#16 [ffff81010ff03ef0] cpu_idle at ffffffff8004939e



I was able to set up some break points and the analysis follows.

cpu 0x1 stopped at breakpoint 0x1 (d000000000ec4214 .bond_release+0x0/0x4d0 [bonding])
        mflr    r0
enter ? for help
1:mon> t
[link register   ] d000000000ecdf80 .bonding_store_slaves+0x304/0x3f0 [bonding]
[c00000000fd97b00] d000000000ecdf70 .bonding_store_slaves+0x2f4/0x3f0 [bonding] (unreliable)
[c00000000fd97bd0] c00000000029a660 .class_device_attr_store+0x44/0x60
[c00000000fd97c40] c00000000015df9c .sysfs_write_file+0x134/0x1b8
[c00000000fd97cf0] c0000000000f8ec4 .vfs_write+0x118/0x200
[c00000000fd97d90] c0000000000f9634 .sys_write+0x4c/0x8c
[c00000000fd97e30] c0000000000086a4 syscall_exit+0x0/0x40
--- Exception: c00 (System Call) at 000000000ff11138
SP (ffd1f300) is in userspace

Did some basic sanity checks and confirmed that we hit a couple of breakpoints and
the bond master was indeed bond0 as expected and the slave device being released was ib1.
After the breakpoints, we crashed 


Faulting instruction address: 0xc00000000034bddc
cpu 0x1: Vector: 300 (Data Access) at [c0000000e025b2b0]
    pc: c00000000034bddc: .neigh_resolve_output+0x28c/0x34c
    lr: c00000000034bdc0: .neigh_resolve_output+0x270/0x34c
    sp: c0000000e025b530
   msr: 8000000000009032
   dar: d000000000c6fe58
 dsisr: 40000000
  current = 0xc0000000e25f1aa0
  paca    = 0xc00000000053e280
    pid   = 3591, comm = ping
enter ? for help
1:mon> e
cpu 0x1: Vector: 300 (Data Access) at [c0000000e025b2b0]
    pc: c00000000034bddc: .neigh_resolve_output+0x28c/0x34c
    lr: c00000000034bdc0: .neigh_resolve_output+0x270/0x34c
    sp: c0000000e025b530
   msr: 8000000000009032
   dar: d000000000c6fe58
 dsisr: 40000000
  current = 0xc0000000e25f1aa0
  paca    = 0xc00000000053e280
    pid   = 3591, comm = ping
1:mon> t
[c0000000e025b5e0] c000000000376934 .ip_output+0x358/0x3c0
[c0000000e025b670] c000000000374a04 .ip_push_pending_frames+0x440/0x558
[c0000000e025b720] c000000000397f10 .raw_sendmsg+0x770/0x860
[c0000000e025b860] c0000000003a24f8 .inet_sendmsg+0x7c/0xa8
[c0000000e025b900] c00000000033031c .sock_sendmsg+0x114/0x1b8
[c0000000e025bb00] c000000000331878 .sys_sendmsg+0x218/0x2ac
[c0000000e025bd20] c000000000356314 .compat_sys_sendmsg+0x14/0x28
[c0000000e025bd90] c000000000357914 .compat_sys_socketcall+0x1e4/0x214
[c0000000e025be30] c0000000000086a4 syscall_exit+0x0/0x40
--- Exception: c00 (System Call) at 0000000007f03c98
SP (ffb6e570) is in userspace
1:mon>

I looked at the skb and confirmed that this was indeed against bond0.

One thing is apparent at this point. ping is continuing even though bond_release()
for ib1 (and of course ib0) occurred way back!

This is the reason for the crash. Any suggestions as to how to fix this?

Pradeep




             reply	other threads:[~2009-11-02 22:43 UTC|newest]

Thread overview: 4+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2009-11-02 22:43 Pradeep Satyanarayana [this message]
     [not found] ` <4AEF60AC.6030508-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>
2009-11-03  7:14   ` Crash in bonding Or Gerlitz
     [not found]     ` <4AEFD866.1040106-smomgflXvOZWk0Htik3J/w@public.gmane.org>
2009-11-03 16:38       ` Pradeep Satyanarayana
     [not found]         ` <4AF05C7E.3060804-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>
2009-11-10 11:44           ` Or Gerlitz

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=4AEF60AC.6030508@linux.vnet.ibm.com \
    --to=pradeeps-23vcf4htsmix0ybbhkvfkdbpr1lh4cv8@public.gmane.org \
    --cc=linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.