All of lore.kernel.org
 help / color / mirror / Atom feed
From: Or Gerlitz <ogerlitz@voltaire.com>
To: Jay Vosburgh <fubar@us.ibm.com>, Moni Shoua <monis@voltaire.com>
Cc: Roland Dreier <rolandd@cisco.com>,
	netdev@vger.kernel.org, Moni Levy <monil@voltaire.com>
Subject: bonding / 2.6.24-rc1 issues
Date: Wed, 07 Nov 2007 15:51:26 +0200	[thread overview]
Message-ID: <4731C2DE.4000701@voltaire.com> (raw)

[-- Attachment #1: Type: text/plain, Size: 15447 bytes --]

Jay, Moni

I did some tests with 2.6.24-rc1 and the first patch to bonding that Jay 
sent last night to netdev. Basic operation and fail over work fine. 
However, I see some crashes which are somehow related to destroying the 
bond when the slaves are ipoib ones, I don't see similar crashes when 
enslaving ethernet devices (Broadcom Corporation NetXtreme BCM5704 
Gigabit Ethernet (rev 03)), my compressed dot config is attached.

The first type of oops is when I just do modprobe -r bonding after 
enslavement of the ipoib devices:

> Ethernet Channel Bonding Driver: v3.2.1 (October 15, 2007)
> bonding: MII link monitoring set to 100 ms
> bonding: bond0: setting mode to active-backup (1).
> bonding: bond0: Setting MII monitoring interval to 100.
> NET: Registered protocol family 10
> ADDRCONF(NETDEV_UP): bond0: link is not ready
> bonding: bond0: doing slave updates when interface is down.
> bonding: bond0: Adding slave ib0.
> bonding bond0: master_dev is not up in bond_enslave
> bonding: bond0: Warning: enslaved VLAN challenged slave ib0. Adding VLANs will be blocked as long as ib0 is part of bond bond0
> bonding: bond0: enslaving ib0 as a backup interface with a down link.
> bonding: bond0: doing slave updates when interface is down.
> bonding: bond0: Adding slave ib1.
> bonding bond0: master_dev is not up in bond_enslave
> bonding: bond0: Warning: enslaved VLAN challenged slave ib1. Adding VLANs will be blocked as long as ib1 is part of bond bond0
> bonding: bond0: enslaving ib1 as a backup interface with a down link.
> ADDRCONF(NETDEV_UP): bond0: link is not ready
> bonding: bond0: link status definitely up for interface ib0.
> bonding: bond0: link status definitely up for interface ib1.
> bonding: bond0: making interface ib0 the new active one.
> bonding: bond0: first active interface up!
> ADDRCONF(NETDEV_CHANGE): bond0: link becomes ready
> eth0: no IPv6 routers present
> bond0: no IPv6 routers present
> bonding: bond0: released all slaves
> Unable to handle kernel paging request at ffffffff880a07ce RIP: 
>  [<ffffffff880a07ce>]
> PGD 203067 PUD 207063 PMD 2060f067 PTE 0
> Oops: 0010 [1] SMP 
> CPU 0 
> Modules linked in: ib_ipoib ib_cm ib_sa ipv6 sg st sd_mod sr_mod scsi_mod e100 ib_mthca ib_mad ib_core i2c_amd8111 i2c_core
> Pid: 14604, comm: bond0 Not tainted 2.6.24-rc1 #1
> RIP: 0010:[<ffffffff880a07ce>]  [<ffffffff880a07ce>]
> RSP: 0018:ffff810008439e98  EFLAGS: 00010247
> RAX: ffff810004da20c0 RBX: ffff810004da20c0 RCX: ffff81000315aa68
> RDX: ffff810004da20c8 RSI: ffff810008439ef0 RDI: ffff81000315aa60
> RBP: ffffffff880a07ce R08: ffff810008438000 R09: ffff81000152d0d8
> R10: ffff810004da20c0 R11: ffff810009574000 R12: 00000000fffffffc
> R13: ffffffffffffffff R14: ffffffff8063b820 R15: 0000000000000000
> FS:  00002af0c528b0a0(0000) GS:ffffffff805d4000(0000) knlGS:0000000000000000
> CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
> CR2: ffffffff880a07ce CR3: 000000002852f000 CR4: 00000000000006e0
> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> Process bond0 (pid: 14604, threadinfo ffff810008438000, task ffff8100024970c0)
> Stack:  ffffffff802445c6 ffff810008439f08 ffff810004da20c0 ffffffff80244652
>  ffffffff8024473f 0000000000000000 ffff8100024970c0 ffffffff80248320
>  ffff810008439f08 ffff810008439f08 0000000000000000 0000000000000000
> Call Trace:
>  [<ffffffff802445c6>] run_workqueue+0x83/0x10f
>  [<ffffffff80244652>] worker_thread+0x0/0xf7
>  [<ffffffff8024473f>] worker_thread+0xed/0xf7
>  [<ffffffff80248320>] autoremove_wake_function+0x0/0x2e
>  [<ffffffff80248320>] autoremove_wake_function+0x0/0x2e
>  [<ffffffff80247fe6>] kthread+0x3d/0x63
>  [<ffffffff8020c4a8>] child_rip+0xa/0x12
>  [<ffffffff80247fa9>] kthread+0x0/0x63
>  [<ffffffff8020c49e>] child_rip+0x0/0x12
> 
> 
> Code:  Bad RIP value.
> RIP  [<ffffffff880a07ce>]
>  RSP <ffff810008439e98>
> CR2: ffffffff880a07ce

the second type of oops is when I modprobe -r ib_ipoib after 
enslavement. I was not able to test this one with ethernet as the tg3 
code is built into my kernel

> Nov  7 14:31:56 dill kernel: bonding: bond0: Setting MII monitoring interval to 100.
> Nov  7 14:31:56 dill kernel: bonding: bond0: Adding slave ib0.
> Nov  7 14:31:56 dill kernel: bonding: bond0: Warning: enslaved VLAN challenged slave ib0. Adding VLANs will be blocked as long as ib0 is part of bond bond0
> Nov  7 14:31:56 dill kernel: bonding: bond0: Warning: The first slave device specified does not support setting the MAC address. Enabling the fail_over_mac option.<6>bonding: bond0: enslaving ib0 as a backup interface with a down link.
> Nov  7 14:31:56 dill kernel: bonding: bond0: Adding slave ib1.
> Nov  7 14:31:56 dill kernel: bonding: bond0: Warning: enslaved VLAN challenged slave ib1. Adding VLANs will be blocked as long as ib1 is part of bond bond0
> Nov  7 14:31:56 dill kernel: bonding: bond0: enslaving ib1 as a backup interface with a down link.
> Nov  7 14:31:56 dill kernel: bonding: bond0: link status definitely up for interface ib0.
> Nov  7 14:31:56 dill kernel: bonding: bond0: making interface ib0 the new active one.
> Nov  7 14:31:56 dill kernel: bonding: bond0: first active interface up!
> Nov  7 14:31:56 dill kernel: ADDRCONF(NETDEV_CHANGE): bond0: link becomes ready
> Nov  7 14:31:56 dill kernel: ib0: multicast join failed for 0001:ffff:ffff:0a0a:0081:ffff:c0bb:0a0a, status -22
> Nov  7 14:31:56 dill kernel: bonding: bond0: link status definitely up for interface ib1.
> Nov  7 14:31:58 dill kernel: ib0: multicast join failed for 0001:ffff:ffff:0a0a:0081:ffff:c0bb:0a0a, status -22
> Nov  7 14:32:02 dill kernel: ib0: multicast join failed for 0001:ffff:ffff:0a0a:0081:ffff:c0bb:0a0a, status -22
> Nov  7 14:32:07 dill kernel: bond0: no IPv6 routers present
> Nov  7 14:32:10 dill kernel: ib0: multicast join failed for 0001:ffff:ffff:0a0a:0081:ffff:c0bb:0a0a, status -22
> Nov  7 14:32:12 dill ypbind[14475]: broadcast: RPC: Timed out.
> Nov  7 14:32:18 dill kernel: ib0: cm send completion event with wrid 1073741823 (> 64)
> Nov  7 14:32:23 dill kernel: ib0: RX drain timing out
> Nov  7 14:32:23 dill kernel: bonding: bond0: Warning: the permanent HWaddr of ib0 - 80:06:04:04:fe:80 - is still in use by bond0. Set the HWaddr of ib0 to a different address to avoid conflicts.
> Nov  7 14:32:23 dill kernel: bonding: bond0: releasing active interface ib0
> Nov  7 14:32:23 dill kernel: bonding: bond0: making interface ib1 the new active one.
> Nov  7 14:32:23 dill kernel: ib1: multicast join failed for 0001:0000:ffff:0000:0000:0000:0070:5229, status -22
> Nov  7 14:32:23 dill kernel: bonding: bond0: releasing active interface ib1
> Nov  7 14:32:23 dill kernel: bonding: bond0: destroying bond bond0.
> Nov  7 14:32:23 dill kernel: __dev_addr_discard: address leakage! da_users=1
> Nov  7 14:32:23 dill kernel: Unable to handle kernel NULL pointer dereference at 0000000000000028 RIP: 
> Nov  7 14:32:23 dill kernel:  [<ffffffff802be76f>] sysfs_find_dirent+0x7/0x36
> Nov  7 14:32:23 dill kernel: PGD 250a067 PUD 40a4067 PMD 0 
> Nov  7 14:32:23 dill kernel: Oops: 0000 [1] SMP 
> Nov  7 14:32:23 dill kernel: CPU 1 
> Nov  7 14:32:23 dill kernel: Modules linked in: ib_ipoib ib_cm ib_sa bonding e100 ipv6 sg st sd_mod sr_mod scsi_mod ib_mthca ib_mad ib_core i2c_amd756 i2c_amd8111 i2c_core
> Nov  7 14:32:23 dill kernel: Pid: 18870, comm: modprobe Not tainted 2.6.24-rc1 #1
> Nov  7 14:32:23 dill kernel: RIP: 0010:[<ffffffff802be76f>]  [<ffffffff802be76f>] sysfs_find_dirent+0x7/0x36
> Nov  7 14:32:23 dill kernel: RSP: 0018:ffff8100264e3da8  EFLAGS: 00010246
> Nov  7 14:32:23 dill kernel: RAX: 0000000000000000 RBX: ffffffff88111959 RCX: 000000000000000a
> Nov  7 14:32:23 dill kernel: RDX: ffff8100264e3fd8 RSI: ffffffff88111959 RDI: 0000000000000000
> Nov  7 14:32:23 dill kernel: RBP: ffffffff88111959 R08: ffff810020705d70 R09: ffff810020012ae8
> Nov  7 14:32:23 dill kernel: R10: 0000000000000000 R11: 0000000000000286 R12: 0000000000000000
> Nov  7 14:32:23 dill kernel: R13: ffff810028d9e000 R14: 0000000000000006 R15: 0000000000515ab0
> Nov  7 14:32:23 dill kernel: FS:  00002adaba330720(0000) GS:ffff81002053dac0(0000) knlGS:0000000000000000
> Nov  7 14:32:23 dill kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> Nov  7 14:32:23 dill kernel: CR2: 0000000000000028 CR3: 0000000001cca000 CR4: 00000000000006e0
> Nov  7 14:32:23 dill kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> Nov  7 14:32:23 dill kernel: DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> Nov  7 14:32:23 dill kernel: Process modprobe (pid: 18870, threadinfo ffff8100264e2000, task ffff8100264ed790)
> Nov  7 14:32:23 dill kernel: Stack:  0000000000000000 ffffffff88111959 ffffffff88117680 ffffffff802be87b
> Nov  7 14:32:23 dill kernel:  0000000000000006 0000000000000000 ffff810006578700 ffffffff802bfb83
> Nov  7 14:32:23 dill kernel:  ffff810020705d70 ffff810006578000 0000000000000000 ffffffff88107bd6
> Nov  7 14:32:24 dill kernel: Call Trace:
> Nov  7 14:32:24 dill kernel:  [<ffffffff802be87b>] sysfs_get_dirent+0x21/0x6c
> Nov  7 14:32:24 dill kernel:  [<ffffffff802bfb83>] sysfs_remove_group+0x1b/0x92
> Nov  7 14:32:24 dill kernel:  [<ffffffff88107bd6>] :bonding:bond_release_and_destroy+0x3d/0x44
> Nov  7 14:32:24 dill kernel:  [<ffffffff88107c92>] :bonding:bond_netdev_event+0xb5/0xca
> Nov  7 14:32:24 dill kernel:  [<ffffffff8046e55e>] notifier_call_chain+0x30/0x54
> Nov  7 14:32:24 dill kernel:  [<ffffffff8041845d>] unregister_netdevice+0xc3/0x15a
> Nov  7 14:32:24 dill kernel:  [<ffffffff80418505>] unregister_netdev+0x11/0x17
> Nov  7 14:32:24 dill kernel:  [<ffffffff880f2be4>] :ib_ipoib:ipoib_remove_one+0x64/0xa5
> Nov  7 14:32:24 dill kernel:  [<ffffffff88015069>] :ib_core:ib_unregister_client+0x43/0xfe
> Nov  7 14:32:24 dill kernel:  [<ffffffff880fb071>] :ib_ipoib:ipoib_cleanup_module+0xd/0x2b
> Nov  7 14:32:24 dill kernel:  [<ffffffff802557b1>] sys_delete_module+0x1b1/0x1e2
> Nov  7 14:32:24 dill kernel:  [<ffffffff80329b00>] __downgrade_write+0x5f/0xb1
> Nov  7 14:32:24 dill kernel:  [<ffffffff8026eb2e>] sys_munmap+0x4a/0x56
> Nov  7 14:32:24 dill kernel:  [<ffffffff8020b68e>] system_call+0x7e/0x83
> Nov  7 14:32:24 dill kernel: 
> Nov  7 14:32:24 dill kernel: 
> Nov  7 14:32:24 dill kernel: Code: 48 8b 5f 28 48 85 db 74 1c 48 8b 7b 18 48 89 ee e8 f6 b6 06 
> Nov  7 14:32:24 dill kernel: RIP  [<ffffffff802be76f>] sysfs_find_dirent+0x7/0x36
> Nov  7 14:32:24 dill kernel:  RSP <ffff8100264e3da8>
> Nov  7 14:32:24 dill kernel: CR2: 0000000000000028

the third type of oops is when I did some fail overs, then removed both 
slaves from the bond using
echo -$slave > /sys/class/net/bond0/bonding/slaves

> Ethernet Channel Bonding Driver: v3.2.1 (October 15, 2007)
> bonding: MII link monitoring set to 100 ms
> bonding: bond0: setting mode to active-backup (1).
> bonding: bond0: Setting MII monitoring interval to 100.
> ADDRCONF(NETDEV_UP): bond0: link is not ready
> bonding: bond0: doing slave updates when interface is down.
> bonding: bond0: Adding slave ib0.
> bonding bond0: master_dev is not up in bond_enslave
> bonding: bond0: Warning: enslaved VLAN challenged slave ib0. Adding VLANs will be blocked as long as ib0 is part of bond bond0
> bonding: bond0: enslaving ib0 as a backup interface with a down link.
> bonding: bond0: doing slave updates when interface is down.
> bonding: bond0: Adding slave ib1.
> bonding bond0: master_dev is not up in bond_enslave
> bonding: bond0: Warning: enslaved VLAN challenged slave ib1. Adding VLANs will be blocked as long as ib1 is part of bond bond0
> bonding: bond0: enslaving ib1 as a backup interface with a down link.
> ADDRCONF(NETDEV_UP): bond0: link is not ready
> bonding: bond0: link status definitely up for interface ib0.
> bonding: bond0: link status definitely up for interface ib1.
> bonding: bond0: making interface ib0 the new active one.
> bonding: bond0: first active interface up!
> ADDRCONF(NETDEV_CHANGE): bond0: link becomes ready
> bond0: no IPv6 routers present
> bonding: bond0: link status definitely down for interface ib0, disabling it
> bonding: bond0: making interface ib1 the new active one.
> bonding: bond0: link status definitely up for interface ib0.
> bonding: bond0: link status definitely down for interface ib1, disabling it
> bonding: bond0: making interface ib0 the new active one.
> bonding: bond0: Removing slave ib0
> bonding: bond0: Warning: the permanent HWaddr of ib0 - 80:08:04:04:fe:80 - is still in use by bond0. Set the HWaddr of ib0 to a different address to avoid conflicts.
> bonding: bond0: releasing active interface ib0
> bonding: bond0: Removing slave ib1
> bonding: bond0: releasing backup interface ib1
> bonding: bond0: destroying bond bond0.
> Unable to handle kernel NULL pointer dereference at 0000000000000028 RIP: 
>  [<ffffffff802be76f>] sysfs_find_dirent+0x7/0x36
> PGD 48a0067 PUD 285f067 PMD 0 
> Oops: 0000 [1] SMP 
> CPU 1 
> Modules linked in: ib_ipoib ib_cm ib_sa bonding ipv6 sg st sd_mod sr_mod scsi_mod e100 ib_mthca ib_mad ib_core i2c_amd756 i2c_amd8111 i2c_core
> Pid: 16811, comm: bash Not tainted 2.6.24-rc1 #1
> RIP: 0010:[<ffffffff802be76f>]  [<ffffffff802be76f>] sysfs_find_dirent+0x7/0x36
> RSP: 0018:ffff8100049a5dd8  EFLAGS: 00010246
> RAX: 0000000000000000 RBX: ffffffff880ae959 RCX: 0000000000000002
> RDX: ffff8100049a5fd8 RSI: ffffffff880ae959 RDI: 0000000000000000
> RBP: ffffffff880ae959 R08: ffff8100205f5d70 R09: 0000000000000000
> R10: 0000000000000000 R11: 0000000000000202 R12: 0000000000000000
> R13: 0000000000000000 R14: ffff8100300c7000 R15: ffff8100049a5e69
> FS:  00002afe4e9870a0(0000) GS:ffff81002053dac0(0000) knlGS:0000000000000000
> CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> CR2: 0000000000000028 CR3: 000000000248f000 CR4: 00000000000006e0
> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> Process bash (pid: 16811, threadinfo ffff8100049a4000, task ffff810001d91750)
> Stack:  0000000000000000 ffffffff880ae959 ffffffff880b4680 ffffffff802be87b
>  0000000000000006 0000000000000000 ffff81000e081700 ffffffff802bfb83
>  ffff8100205f5d70 ffff81000e081000 0000000000000000 ffffffff880a4bd6
> Call Trace:
>  [<ffffffff802be87b>] sysfs_get_dirent+0x21/0x6c
>  [<ffffffff802bfb83>] sysfs_remove_group+0x1b/0x92
>  [<ffffffff880a4bd6>] :bonding:bond_release_and_destroy+0x3d/0x44
>  [<ffffffff880aa685>] :bonding:bonding_store_slaves+0x29a/0x352
>  [<ffffffff8038a0c7>] dev_attr_store+0x1c/0x1e
>  [<ffffffff802be03d>] sysfs_write_file+0xca/0xfc
>  [<ffffffff802832fa>] vfs_write+0xae/0x130
>  [<ffffffff8028343b>] sys_write+0x45/0x6e
>  [<ffffffff8020b68e>] system_call+0x7e/0x83
> 
> 
> Code: 48 8b 5f 28 48 85 db 74 1c 48 8b 7b 18 48 89 ee e8 f6 b6 06 
> RIP  [<ffffffff802be76f>] sysfs_find_dirent+0x7/0x36
>  RSP <ffff8100049a5dd8>
> CR2: 0000000000000028


here's the script I use to set the bond  & do the enslavement
> #!/bin/bash
> 
> SLAVE_A=ib0
> SLAVE_B=ib1
> ADDR=192.168.10.118
> 
> #SLAVE_A=eth0
> #SLAVE_B=eth1
> #ADDR=172.30.10.6
> 
> /sbin/modprobe bonding
> 
> echo 1 > /sys/class/net/bond0/bonding/mode
> echo 100 > /sys/class/net/bond0/bonding/miimon
> 
> /sbin/modprobe ib_ipoib
> 
> echo +$SLAVE_A > /sys/class/net/bond0/bonding/slaves
> echo +$SLAVE_B > /sys/class/net/bond0/bonding/slaves
> 
> ifconfig bond0 $ADDR

Or.


[-- Attachment #2: config-2.6.24-rc1.bz2 --]
[-- Type: application/octet-stream, Size: 7823 bytes --]

             reply	other threads:[~2007-11-07 14:07 UTC|newest]

Thread overview: 2+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2007-11-07 13:51 Or Gerlitz [this message]
2007-11-07 14:52 ` bonding / 2.6.24-rc1 issues Moni Shoua

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=4731C2DE.4000701@voltaire.com \
    --to=ogerlitz@voltaire.com \
    --cc=fubar@us.ibm.com \
    --cc=monil@voltaire.com \
    --cc=monis@voltaire.com \
    --cc=netdev@vger.kernel.org \
    --cc=rolandd@cisco.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.