Netdev List
 help / color / mirror / Atom feed
* Re: [RFC PATCH] ppp: add support for L2 multihop / tunnel switching
From: James Chapman @ 2012-07-10  9:32 UTC (permalink / raw)
  To: Benjamin LaHaise; +Cc: netdev, linux-ppp
In-Reply-To: <20120709141511.GL19462@kvack.org>

On 09/07/12 15:15, Benjamin LaHaise wrote:
> On Mon, Jul 09, 2012 at 12:52:15PM +0100, James Chapman wrote:
>> As a mechanism for switching PPP interfaces together, this patch is
>> good. For L2TP though, I prefer an approach that would be applicable for
>> all L2TP traffic types, not just PPP.
> 
> *nod*  This seems like a reasonable consideration.
> 
>> L2TP supports many different pseudowire types, and this patch will only
>> be useful for tunnel switching between PPP pseudowires. Whereas if we
>> implement it within the L2TP core, rather than in the PPP code, we would
>> get switching between all pseudowire types. If we add this patch and
>> then subsequently add switching between other pseudowires in the L2TP
>> core (which we're likely to want to do), then we're left with two
>> different interfaces for doing L2TP tunnel switching in the kernel.
> 
> At least for ethernet pseudowires, it can already be implemented by using 
> an ethernet bridge device.  Besides PPP and ethernet pseudowires, what 
> other types are supported at present by the L2TP core?

Only those two at the moment, but others (ATM etc) can be added if and
when there is demand. To do this at an L2TP level avoids using two
linked PPP interfaces in the case of PPP and two bridged l2tpeth
interfaces in the case of ethernet. I envisage a new L2TP netlink API to
join the datapaths of two L2TP sessions together with no devices being
needed. It would work for all L2TP session types, now and in the future.

>>> The reasoning behind using dev_queue_xmit() rather than outputting directly 
>>> to another PPP channel is to enable the use of the traffic shaping and 
>>> queuing features of the kernel on multihop sessions.
>>
>> I'm not sure about using a pseudo packet type to do this. For L2TP, it
>> would seem better to add netfilter/tc support for L2TP data packets,
>> which would let people add rules for, say, traffic in L2TP tunnel x /
>> session y. This would avoid the need for ETH_P_PPP and you could then
>> output directly to the ppp channel.
> 
> The downside of an L2TP specific method is that all the mechanisms need to 
> be duplicated, resulting in a much higher maintenance overhead for the 
> code and functionality, not to mention all the tool changes to go along 
> with that.

Could the same argument be applied to other protocols which have
netfilter/tc support already? Adding support for L2TP would seem
consistent with other protocol implementations. It would also mean that
the same rules would work for all L2TP session types.

> As for the pseudo packet type, it may indeed be better to avoid the pseudo 
> packet type for known PPP packet types.  One of the benefits of going the 
> network device route is that it makes it much easier to implement additional 
> functionality like lawful intercept, which would be yet more functionality 
> that would have to be implemented if the mechanism is L2TP specific.  The 
> pseudo packet type would still be needed for forwarding PPP frames that the 
> kernel doesn't know about (all the *CP packet types and MLPPP come to mind)
> 
> I had thought about doing the packet forwarding in a manner similar to the 
> bridging code -- that is, as a pseudowire bridge in the network core that 
> only works between 2 devices.  That approach might work better for L2TP, as 
> it would be able to pass packets of any type between the 2 endpoints.

For L2TP, I think it should be possible to avoid having devices for
switched L2TP sessions.

> 
> 		-ben
> 
-- 
James Chapman
Katalix Systems Ltd
http://www.katalix.com
Catalysts for your Embedded Linux software development



^ permalink raw reply

* Re: [PATCH] net: cgroup: fix access the unallocated memory in netprio cgroup
From: Gao feng @ 2012-07-10  9:36 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: nhorman, davem, linux-kernel, netdev, lizefan, tj, Eric Dumazet
In-Reply-To: <1341911707.3265.4603.camel@edumazet-glaptop>

于 2012年07月10日 17:15, Eric Dumazet 写道:
> On Tue, 2012-07-10 at 16:53 +0800, Gao feng wrote:
>>> Hi Gao
>>>
>>> Is it still needed to call update_netdev_tables() from write_priomap() ?
>>>
>>
>> Yes, I think it's needed,because read_priomap will show all of the net devices,
>>
>> But we may add the netdev after create a netprio cgroup, so the new added netdev's
>> priomap will not be allocated. if we don't call update_netdev_tables in write_priomap,
>> we may access this unallocated memory.
>>
> 
> I realize my question was not clear.
> 
> If we write in write_priomap() a field of a single netdevice,
> why should we allocate memory for all netdevices on the machine ?
> 
> So the question was : Do we really need to call
> update_netdev_tables(alldevs), instead of extend_netdev_table(dev)
> 
> 

I get it.

You are right,Indeed we only need to call extend_netdev_table
for the netdev witch we want to change.

and I read the commit f5c38208d32412d72b97a4f0d44af0eb39feb20b,
found why we need delay allocation.

I will send a v2 patch.

Thanks!

^ permalink raw reply

* [PATCH iproute2] tc: u32: Fix icmp_code off.
From: Hiroaki SHIMODA @ 2012-07-10  9:53 UTC (permalink / raw)
  To: shemminger; +Cc: netdev

The off of icmp_code is not 20 but 21. Also offmask should be 0 unless
nexthdr+ is specified.

Signed-off-by: Hiroaki SHIMODA <shimoda.hiroaki@gmail.com>
---
 tc/f_u32.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/tc/f_u32.c b/tc/f_u32.c
index 975c0b5..7a04634 100644
--- a/tc/f_u32.c
+++ b/tc/f_u32.c
@@ -531,7 +531,7 @@ static int parse_ip(int *argc_p, char ***argv_p, struct tc_u32_sel *sel)
 		res = parse_u8(&argc, &argv, sel, 20, 0);
 	} else if (strcmp(*argv, "icmp_code") == 0) {
 		NEXT_ARG();
-		res = parse_u8(&argc, &argv, sel, 20, 1);
+		res = parse_u8(&argc, &argv, sel, 21, 0);
 	} else
 		return -1;
 
-- 
1.7.8.6

^ permalink raw reply related

* [PATCH] bridge: fix endian
From: roy.qing.li @ 2012-07-10  9:56 UTC (permalink / raw)
  To: netdev; +Cc: yoshfuji

From: Li RongQing <roy.qing.li@gmail.com>

mld->mld_maxdelay is net endian, so we should use ntohs, not htons

CC: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: Li RongQing <roy.qing.li@gmail.com>
---
 net/bridge/br_multicast.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/net/bridge/br_multicast.c b/net/bridge/br_multicast.c
index b665812..2d9a066 100644
--- a/net/bridge/br_multicast.c
+++ b/net/bridge/br_multicast.c
@@ -1160,7 +1160,7 @@ static int br_ip6_multicast_query(struct net_bridge *br,
 			goto out;
 		}
 		mld = (struct mld_msg *) icmp6_hdr(skb);
-		max_delay = msecs_to_jiffies(htons(mld->mld_maxdelay));
+		max_delay = msecs_to_jiffies(ntohs(mld->mld_maxdelay));
 		if (max_delay)
 			group = &mld->mld_mca;
 	} else if (skb->len >= sizeof(*mld2q)) {
-- 
1.7.1

^ permalink raw reply related

* getting warn once around skb_try_coalesce
From: Or Gerlitz @ 2012-07-10  9:54 UTC (permalink / raw)
  To: David Miller, Eric Dumazet
  Cc: netdev@vger.kernel.org, Shlomo Pongratz, Erez Shitrit

Hi Dave, Eric,

Another trace that I see here with net-next is this one-time warning. I 
get it always
on the passive side of TCP, something that seems related to GRO, it 
happens only with
IPoIB, not with mlx4_en and igb (when igb get to work on net-next...)

The latest commit in this area is bad43ca8325f493dcaa0896c2f036276af059c7e
"net: introduce skb_try_coalesce()" from Eric.

Or.

-----------[ cut here ]------------
WARNING: at net/core/skbuff.c:3413 skb_try_coalesce+0x1f8/0x31d()
Hardware name: X7DWU
Modules linked in: drbd lru_cache cn autofs4 sunrpc 8021q ib_ipoib 
rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_addr ipv6 ib_sa 
dm_mirror dm_region_hash dm_log dm_round_robin dm_multipath uinput 
mlx4_ib ib_mad ib_core mlx4_en mlx4_core igb sg joydev kvm microcode 
pcspkr rng_core ioatdma dm_mod dca floppy shpchp button sr_mod ext3 jbd 
usb_storage sd_mod ata_piix libata scsi_mod ehci_hcd uhci_hcd [last 
unloaded: scsi_wait_scan]
Pid: 0, comm: swapper/1 Tainted: G          I 
3.5.0-rc1-00107-gf5bae8a-dirty #57
Call Trace:
  <IRQ>  [<ffffffff8102ab65>] warn_slowpath_common+0x80/0x98
  [<ffffffff8102ab92>] warn_slowpath_null+0x15/0x17
  [<ffffffff812c5a73>] skb_try_coalesce+0x1f8/0x31d
  [<ffffffff8130a6ad>] tcp_try_coalesce+0x4c/0xa0
  [<ffffffff8130a759>] tcp_queue_rcv+0x58/0xe1
  [<ffffffff8130d4ca>] tcp_data_queue+0x1bd/0xa8d
  [<ffffffff8130ecba>] tcp_rcv_established+0x646/0x6fc
  [<ffffffff81314fd7>] ? tcp_v4_rcv+0x427/0xa1b
  [<ffffffff81314892>] tcp_v4_do_rcv+0xd8/0x3f6
  [<ffffffff8136aefb>] ? _raw_spin_lock_nested+0x41/0x48
  [<ffffffff813151a5>] tcp_v4_rcv+0x5f5/0xa1b
  [<ffffffff812f8626>] ip_local_deliver_finish+0x1a1/0x2b2
  [<ffffffff812f84ba>] ? ip_local_deliver_finish+0x35/0x2b2
  [<ffffffff812f87a9>] ip_local_deliver+0x72/0x79
  [<ffffffff812f820d>] ip_rcv_finish+0x399/0x3b1
  [<ffffffff812f845f>] ip_rcv+0x23a/0x260
  [<ffffffff812cd086>] __netif_receive_skb+0x3b2/0x41b
  [<ffffffff812cce0e>] ? __netif_receive_skb+0x13a/0x41b
  [<ffffffff812ce93c>] netif_receive_skb+0xee/0xf7
  [<ffffffff81322512>] ? inet_compat_ioctl+0x1e/0x1e
  [<ffffffff812ceb90>] napi_gro_complete+0x133/0x140
  [<ffffffff812ceaab>] ? napi_gro_complete+0x4e/0x140
  [<ffffffff812ced3d>] dev_gro_receive+0x1a0/0x2fb
  [<ffffffff812cec19>] ? dev_gro_receive+0x7c/0x2fb
  [<ffffffff812cf1c5>] napi_gro_receive+0x105/0x11e
  [<ffffffffa02ed6d4>] ipoib_ib_handle_rx_wc+0x243/0x277 [ib_ipoib]
  [<ffffffffa02ee84e>] ipoib_poll+0xa9/0x12d [ib_ipoib]
  [<ffffffff812cf355>] net_rx_action+0xc1/0x1ee
  [<ffffffff81031e4a>] __do_softirq+0xff/0x1de
  [<ffffffff813735cc>] call_softirq+0x1c/0x30
  [<ffffffff81003174>] do_softirq+0x38/0x80
  [<ffffffff81031b23>] irq_exit+0x4e/0x83
  [<ffffffff810029dd>] do_IRQ+0x98/0xaf
  [<ffffffff8136b92c>] common_interrupt+0x6c/0x6c
  <EOI>  [<ffffffff8100850c>] ? mwait_idle+0x13c/0x208
  [<ffffffff81008503>] ? mwait_idle+0x133/0x208
  [<ffffffff810089f1>] cpu_idle+0x6e/0xab
  [<ffffffff81363763>] start_secondary+0x1b9/0x1bd
---[ end trace fdf1b0e917b37732 ]---

^ permalink raw reply

* Re: [PATCH] bridge: fix endian
From: devendra.aaru @ 2012-07-10 10:04 UTC (permalink / raw)
  To: roy.qing.li; +Cc: netdev, yoshfuji
In-Reply-To: <1341914172-22075-1-git-send-email-roy.qing.li@gmail.com>

As you are doing the same change to the drivers in drivers/net/*** i
think a patchset would be better.

but that's just upto you. ;-)

Thanks,

On Tue, Jul 10, 2012 at 3:26 PM,  <roy.qing.li@gmail.com> wrote:
> From: Li RongQing <roy.qing.li@gmail.com>
>
> mld->mld_maxdelay is net endian, so we should use ntohs, not htons
>
> CC: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
> Signed-off-by: Li RongQing <roy.qing.li@gmail.com>
> ---
>  net/bridge/br_multicast.c |    2 +-
>  1 files changed, 1 insertions(+), 1 deletions(-)
>
> diff --git a/net/bridge/br_multicast.c b/net/bridge/br_multicast.c
> index b665812..2d9a066 100644
> --- a/net/bridge/br_multicast.c
> +++ b/net/bridge/br_multicast.c
> @@ -1160,7 +1160,7 @@ static int br_ip6_multicast_query(struct net_bridge *br,
>                         goto out;
>                 }
>                 mld = (struct mld_msg *) icmp6_hdr(skb);
> -               max_delay = msecs_to_jiffies(htons(mld->mld_maxdelay));
> +               max_delay = msecs_to_jiffies(ntohs(mld->mld_maxdelay));
>                 if (max_delay)
>                         group = &mld->mld_mca;
>         } else if (skb->len >= sizeof(*mld2q)) {
> --
> 1.7.1
>
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: getting warn once around skb_try_coalesce
From: Eric Dumazet @ 2012-07-10 10:18 UTC (permalink / raw)
  To: Or Gerlitz
  Cc: David Miller, netdev@vger.kernel.org, Shlomo Pongratz,
	Erez Shitrit
In-Reply-To: <4FFBFBD2.6030004@mellanox.com>

On Tue, 2012-07-10 at 12:54 +0300, Or Gerlitz wrote:
> Hi Dave, Eric,
> 
> Another trace that I see here with net-next is this one-time warning. I 
> get it always
> on the passive side of TCP, something that seems related to GRO, it 
> happens only with
> IPoIB, not with mlx4_en and igb (when igb get to work on net-next...)
> 
> The latest commit in this area is bad43ca8325f493dcaa0896c2f036276af059c7e
> "net: introduce skb_try_coalesce()" from Eric.
> 
> Or.
> 
> -----------[ cut here ]------------
> WARNING: at net/core/skbuff.c:3413 skb_try_coalesce+0x1f8/0x31d()

This warning catch skb truesize offenders, most probably its a driver
issue.

^ permalink raw reply

* Re: [PATCH] ipvs: fix oops on NAT reply in br_nf context
From: Lin Ming @ 2012-07-10 10:24 UTC (permalink / raw)
  To: Simon Horman
  Cc: Julian Anastasov, Massimo Cetra, Eric Dumazet, David S. Miller,
	netdev
In-Reply-To: <20120710085145.GA10014@verge.net.au>

On Tue, Jul 10, 2012 at 4:51 PM, Simon Horman <horms@verge.net.au> wrote:
> On Sat, Jul 07, 2012 at 06:26:10PM +0800, Lin Ming wrote:
>> IPVS should not reset skb->nf_bridge in FORWARD hook
>> by calling nf_reset for NAT replies. It triggers oops in
>> br_nf_forward_finish.
>>
>> [  579.781508] BUG: unable to handle kernel NULL pointer dereference at 0000000000000004
>> [  579.781669] IP: [<ffffffff817b1ca5>] br_nf_forward_finish+0x58/0x112
>> [  579.781792] PGD 218f9067 PUD 0
>> [  579.781865] Oops: 0000 [#1] SMP
>> [  579.781945] CPU 0
>> [  579.781983] Modules linked in:
>> [  579.782047]
>> [  579.782080]
>> [  579.782114] Pid: 4644, comm: qemu Tainted: G        W    3.5.0-rc5-00006-g95e69f9 #282 Hewlett-Packard  /30E8
>> [  579.782300] RIP: 0010:[<ffffffff817b1ca5>]  [<ffffffff817b1ca5>] br_nf_forward_finish+0x58/0x112
>> [  579.782455] RSP: 0018:ffff88007b003a98  EFLAGS: 00010287
>> [  579.782541] RAX: 0000000000000008 RBX: ffff8800762ead00 RCX: 000000000001670a
>> [  579.782653] RDX: 0000000000000000 RSI: 000000000000000a RDI: ffff8800762ead00
>> [  579.782845] RBP: ffff88007b003ac8 R08: 0000000000016630 R09: ffff88007b003a90
>> [  579.782957] R10: ffff88007b0038e8 R11: ffff88002da37540 R12: ffff88002da01a02
>> [  579.783066] R13: ffff88002da01a80 R14: ffff88002d83c000 R15: ffff88002d82a000
>> [  579.783177] FS:  0000000000000000(0000) GS:ffff88007b000000(0063) knlGS:00000000f62d1b70
>> [  579.783306] CS:  0010 DS: 002b ES: 002b CR0: 000000008005003b
>> [  579.783395] CR2: 0000000000000004 CR3: 00000000218fe000 CR4: 00000000000027f0
>> [  579.783505] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
>> [  579.783684] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
>> [  579.783795] Process qemu (pid: 4644, threadinfo ffff880021b20000, task ffff880021aba760)
>> [  579.783919] Stack:
>> [  579.783959]  ffff88007693cedc ffff8800762ead00 ffff88002da01a02 ffff8800762ead00
>> [  579.784110]  ffff88002da01a02 ffff88002da01a80 ffff88007b003b18 ffffffff817b26c7
>> [  579.784260]  ffff880080000000 ffffffff81ef59f0 ffff8800762ead00 ffffffff81ef58b0
>> [  579.784477] Call Trace:
>> [  579.784523]  <IRQ>
>> [  579.784562]
>> [  579.784603]  [<ffffffff817b26c7>] br_nf_forward_ip+0x275/0x2c8
>> [  579.784707]  [<ffffffff81704b58>] nf_iterate+0x47/0x7d
>> [  579.784797]  [<ffffffff817ac32e>] ? br_dev_queue_push_xmit+0xae/0xae
>> [  579.784906]  [<ffffffff81704bfb>] nf_hook_slow+0x6d/0x102
>> [  579.784995]  [<ffffffff817ac32e>] ? br_dev_queue_push_xmit+0xae/0xae
>> [  579.785175]  [<ffffffff8187fa95>] ? _raw_write_unlock_bh+0x19/0x1b
>> [  579.785179]  [<ffffffff817ac417>] __br_forward+0x97/0xa2
>> [  579.785179]  [<ffffffff817ad366>] br_handle_frame_finish+0x1a6/0x257
>> [  579.785179]  [<ffffffff817b2386>] br_nf_pre_routing_finish+0x26d/0x2cb
>> [  579.785179]  [<ffffffff817b2cf0>] br_nf_pre_routing+0x55d/0x5c1
>> [  579.785179]  [<ffffffff81704b58>] nf_iterate+0x47/0x7d
>> [  579.785179]  [<ffffffff817ad1c0>] ? br_handle_local_finish+0x44/0x44
>> [  579.785179]  [<ffffffff81704bfb>] nf_hook_slow+0x6d/0x102
>> [  579.785179]  [<ffffffff817ad1c0>] ? br_handle_local_finish+0x44/0x44
>> [  579.785179]  [<ffffffff81551525>] ? sky2_poll+0xb35/0xb54
>> [  579.785179]  [<ffffffff817ad62a>] br_handle_frame+0x213/0x229
>> [  579.785179]  [<ffffffff817ad417>] ? br_handle_frame_finish+0x257/0x257
>> [  579.785179]  [<ffffffff816e3b47>] __netif_receive_skb+0x2b4/0x3f1
>> [  579.785179]  [<ffffffff816e69fc>] process_backlog+0x99/0x1e2
>> [  579.785179]  [<ffffffff816e6800>] net_rx_action+0xdf/0x242
>> [  579.785179]  [<ffffffff8107e8a8>] __do_softirq+0xc1/0x1e0
>> [  579.785179]  [<ffffffff8135a5ba>] ? trace_hardirqs_off_thunk+0x3a/0x6c
>> [  579.785179]  [<ffffffff8188812c>] call_softirq+0x1c/0x30
>>
>> The steps to reproduce as follow,
>>
>> 1. On Host1, setup brige br0(192.168.1.106)
>> 2. Boot a kvm guest(192.168.1.105) on Host1 and start httpd
>> 3. Start IPVS service on Host1
>>    ipvsadm -A -t 192.168.1.106:80 -s rr
>>    ipvsadm -a -t 192.168.1.106:80 -r 192.168.1.105:80 -m
>> 4. Run apache benchmark on Host2(192.168.1.101)
>>    ab -n 1000 http://192.168.1.106/
>>
>> ip_vs_reply4
>>   ip_vs_out
>>     handle_response
>>       ip_vs_notrack
>>         nf_reset()
>>         {
>>           skb->nf_bridge = NULL;
>>         }
>>
>> Actually, IPVS wants in this case just to replace nfct
>> with untracked version. So replace the nf_reset(skb) call
>> in ip_vs_notrack() with a nf_conntrack_put(skb->nfct) call.
>>
>> Signed-off-by: Lin Ming <mlin@ss.pku.edu.cn>
>> Signed-off-by: Julian Anastasov <ja@ssi.bg>
>
> Actually, I'll queue up this version for 3.5 rather than the previous one
> as it has a better title.

Yes, thanks.

>
> As per my previous comment (repeated here for reference) it seems to me
> that this problem has been present since 2.6.37 and thus is stable material.
>

^ permalink raw reply

* [PATCH v2] net: cgroup: fix access the unallocated memory in netprio cgroup
From: Gao feng @ 2012-07-10 10:44 UTC (permalink / raw)
  To: eric.dumazet
  Cc: nhorman, linux-kernel, netdev, lizefan, tj, Gao feng,
	Eric Dumazet

there are some out of bound accesses in netprio cgroup.

now before accessing the dev->priomap.priomap array,we only check
if the dev->priomap exist.and because we don't want to see
additional bound checkings in fast path, so we should make sure
that dev->priomap is null or array size of dev->priomap.priomap
is equal to max_prioidx + 1;

and it's not needed to call extend_netdev_tabel in write_priomap,
we can only allocate the net device's priomap which we change through
net_prio.ifpriomap.

this patch add a return value for update_netdev_tables & extend_netdev_table,
so when new_priomap is allocated failed,write_priomap will stop to access
the priomap,and return -ENOMEM back to the userspace to tell the user
what happend.

Signed-off-by: Gao feng <gaofeng@cn.fujitsu.com>
Cc: Neil Horman <nhorman@tuxdriver.com>
Cc: Eric Dumazet <edumazet@google.com>
---
 net/core/netprio_cgroup.c |   50 +++++++++++++++++++++++++++++++-------------
 1 files changed, 35 insertions(+), 15 deletions(-)

diff --git a/net/core/netprio_cgroup.c b/net/core/netprio_cgroup.c
index aa907ed..ab59221 100644
--- a/net/core/netprio_cgroup.c
+++ b/net/core/netprio_cgroup.c
@@ -65,7 +65,7 @@ static void put_prioidx(u32 idx)
 	spin_unlock_irqrestore(&prioidx_map_lock, flags);
 }
 
-static void extend_netdev_table(struct net_device *dev, u32 new_len)
+static int extend_netdev_table(struct net_device *dev, u32 new_len)
 {
 	size_t new_size = sizeof(struct netprio_map) +
 			   ((sizeof(u32) * new_len));
@@ -77,7 +77,7 @@ static void extend_netdev_table(struct net_device *dev, u32 new_len)
 
 	if (!new_priomap) {
 		pr_warn("Unable to alloc new priomap!\n");
-		return;
+		return -ENOMEM;
 	}
 
 	for (i = 0;
@@ -90,10 +90,12 @@ static void extend_netdev_table(struct net_device *dev, u32 new_len)
 	rcu_assign_pointer(dev->priomap, new_priomap);
 	if (old_priomap)
 		kfree_rcu(old_priomap, rcu);
+	return 0;
 }
 
-static void update_netdev_tables(void)
+static int update_netdev_tables(void)
 {
+	int ret = 0;
 	struct net_device *dev;
 	u32 max_len = atomic_read(&max_prioidx) + 1;
 	struct netprio_map *map;
@@ -101,35 +103,49 @@ static void update_netdev_tables(void)
 	rtnl_lock();
 	for_each_netdev(&init_net, dev) {
 		map = rtnl_dereference(dev->priomap);
-		if ((!map) ||
-		    (map->priomap_len < max_len))
-			extend_netdev_table(dev, max_len);
+		/*
+		 * don't allocate priomap if we didn't
+		 * change net_prio.ifpriomap,this will
+		 * speed up skb_update_prio.
+		 */
+		if (map) {
+			ret = extend_netdev_table(dev, max_len);
+			if (ret < 0)
+				break;
+		}
 	}
 	rtnl_unlock();
+	return ret;
 }
 
 static struct cgroup_subsys_state *cgrp_create(struct cgroup *cgrp)
 {
 	struct cgroup_netprio_state *cs;
-	int ret;
+	int ret = -EINVAL;
 
 	cs = kzalloc(sizeof(*cs), GFP_KERNEL);
 	if (!cs)
 		return ERR_PTR(-ENOMEM);
 
-	if (cgrp->parent && cgrp_netprio_state(cgrp->parent)->prioidx) {
-		kfree(cs);
-		return ERR_PTR(-EINVAL);
-	}
+	if (cgrp->parent && cgrp_netprio_state(cgrp->parent)->prioidx)
+		goto out;
 
 	ret = get_prioidx(&cs->prioidx);
-	if (ret != 0) {
+	if (ret < 0) {
 		pr_warn("No space in priority index array\n");
-		kfree(cs);
-		return ERR_PTR(ret);
+		goto out;
+	}
+
+	ret = update_netdev_tables();
+	if (ret < 0) {
+		put_prioidx(cs->prioidx);
+		goto out;
 	}
 
 	return &cs->css;
+out:
+	kfree(cs);
+	return ERR_PTR(ret);
 }
 
 static void cgrp_destroy(struct cgroup *cgrp)
@@ -179,6 +195,7 @@ static int write_priomap(struct cgroup *cgrp, struct cftype *cft,
 	char *devname = kstrdup(buffer, GFP_KERNEL);
 	int ret = -EINVAL;
 	u32 prioidx = cgrp_netprio_state(cgrp)->prioidx;
+	u32 max_len = atomic_read(&max_prioidx) + 1;
 	unsigned long priority;
 	char *priostr;
 	struct net_device *dev;
@@ -221,7 +238,10 @@ static int write_priomap(struct cgroup *cgrp, struct cftype *cft,
 	if (!dev)
 		goto out_free_devname;
 
-	update_netdev_tables();
+	ret = extend_netdev_table(dev, max_len);
+	if (ret < 0)
+		goto out_free_devname;
+
 	ret = 0;
 	rcu_read_lock();
 	map = rcu_dereference(dev->priomap);
-- 
1.7.7.6

^ permalink raw reply related

* [PATCH iproute2] tc: u32: Fix firstfrag filter.
From: Hiroaki SHIMODA @ 2012-07-10 10:44 UTC (permalink / raw)
  To: shemminger; +Cc: netdev

On current firstfrag filter, all non fragmented packets are matched.
firstfrag should check MF bit.

Signed-off-by: Hiroaki SHIMODA <shimoda.hiroaki@gmail.com>
---
Maybe no one uses this filter.

 tc/f_u32.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/tc/f_u32.c b/tc/f_u32.c
index 7a04634..66c3247 100644
--- a/tc/f_u32.c
+++ b/tc/f_u32.c
@@ -513,7 +513,7 @@ static int parse_ip(int *argc_p, char ***argv_p, struct tc_u32_sel *sel)
 		res = pack_key16(sel, 0, 0x3FFF, 6, 0);
 	} else if (strcmp(*argv, "firstfrag") == 0) {
 		argc--; argv++;
-		res = pack_key16(sel, 0, 0x1FFF, 6, 0);
+		res = pack_key16(sel, 0x2000, 0x3FFF, 6, 0);
 	} else if (strcmp(*argv, "df") == 0) {
 		argc--; argv++;
 		res = pack_key16(sel, 0x4000, 0x4000, 6, 0);
-- 
1.7.8.6

^ permalink raw reply related

* Re: [PATCH v2] net: cgroup: fix access the unallocated memory in netprio cgroup
From: Eric Dumazet @ 2012-07-10 11:05 UTC (permalink / raw)
  To: Gao feng; +Cc: nhorman, linux-kernel, netdev, lizefan, tj, Eric Dumazet
In-Reply-To: <1341917043-13264-1-git-send-email-gaofeng@cn.fujitsu.com>

On Tue, 2012-07-10 at 18:44 +0800, Gao feng wrote:
> there are some out of bound accesses in netprio cgroup.

> -	update_netdev_tables();
> +	ret = extend_netdev_table(dev, max_len);
> +	if (ret < 0)
> +		goto out_free_devname;
> +
>  	ret = 0;
>  	rcu_read_lock();
>  	map = rcu_dereference(dev->priomap);

Its unfortunately adding a bug.

extend_netdev_table() is protected by RTNL.

^ permalink raw reply

* Re: [PATCH v2] net: cgroup: fix access the unallocated memory in netprio cgroup
From: Eric Dumazet @ 2012-07-10 11:08 UTC (permalink / raw)
  To: Gao feng; +Cc: nhorman, linux-kernel, netdev, lizefan, tj, Eric Dumazet
In-Reply-To: <1341918350.3265.4830.camel@edumazet-glaptop>

On Tue, 2012-07-10 at 13:05 +0200, Eric Dumazet wrote:
> On Tue, 2012-07-10 at 18:44 +0800, Gao feng wrote:
> > there are some out of bound accesses in netprio cgroup.
> 
> > -	update_netdev_tables();
> > +	ret = extend_netdev_table(dev, max_len);
> > +	if (ret < 0)
> > +		goto out_free_devname;
> > +
> >  	ret = 0;
> >  	rcu_read_lock();
> >  	map = rcu_dereference(dev->priomap);
> 
> Its unfortunately adding a bug.
> 
> extend_netdev_table() is protected by RTNL.

Please test your next patch using :

CONFIG_LOCKDEP=y
CONFIG_PROVE_RCU=y

Because rtnl_dereference() should shout if you dont hold RTNL

^ permalink raw reply

* Re: [PATCH 04/16] mm: allow PF_MEMALLOC from softirq context
From: Mel Gorman @ 2012-07-10 11:09 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: Andrew Morton, Linux-MM, Linux-Netdev, LKML, David Miller,
	Neil Brown, Peter Zijlstra, Mike Christie, Eric B Munson,
	Eric Dumazet
In-Reply-To: <20120709165710.GC3515@breakpoint.cc>

On Mon, Jul 09, 2012 at 06:57:10PM +0200, Sebastian Andrzej Siewior wrote:
> On Mon, Jul 09, 2012 at 11:04:42AM +0100, Mel Gorman wrote:
> > > - lets assume your allocation happens with kmalloc() without __GFP_MEMALLOC
> > >   and current->flags has PF_MEMALLOC ORed and your SLAB pool is empty. This
> > >   forces SLAB to allocate more pages from the buddy allocator with it will
> > >   receive more likely (due to ->current->flags + PF_MEMALLOC) but SLAB will
> > >   drop this extra memory because the page has ->pf_memory (or something like
> > >   that) set and the GFP_FLAGS do not have __GFP_MEMALLOC set.
> > > 
> > 
> > It's recorded if the slab page was allocated from PFMEMALLOC reserves (see
> > patch 2 from the swap over NBD series). slab will use this page for objects
> > but only allocate them to callers that pass a gfp_pfmemalloc_allowed() check.
> > kmalloc() users with either __GFP_MEMALLOC or PF_MEMALLOC will get
> > the pages they need but they will not "leak" to !_GFP_MEMALLOC users as
> > that would potentially deadlock.
> 
> Argh, I missed that gfp_to_alloc_flags() is not only called from
> within the buddy allocater but also from slab. So this is fine then :)
> 

Good to hear. I appreciate you taking the time to give it a solid review
like this looking for holes.

> One thing:
> You only get current->flags |= PF_MEMALLOC in softirq _if_ the skb, which is 
> passed to netif_receive_skb(), was allocated with __GFP_MEMALLOC. That
> means if the NIC's RX allocation did not require an allocation from the
> emergency pool (without ->pfmemalloc set) then you never use this extra
> pool, even if this skb would end up in your swap socket. Also, the other way
> around, where you allocate it from the emergency pool but it is a user
> socket and you could drop it.
> 

While there is a possibility that packets may get dropped later like this,
they still get retransmitted and eventually it'll get through.  This is
not optimal but optimised swap-over-network was not the primary goal of
the series, deadlock avoidance was.

> What about extending sk_set_memalloc() to record socket's ips + ports
> in a separate list so that skb_pfmemalloc_protocol() might use that
> information and decide on per-protocol basis if the skb is worth to
> spend more ressource to deliver it. That means you would enable the
> extra pool if the currently received skb is part of your swap socket and
> not if the skb was allocated from the emergency pool.
> 
> That said, there is nothing wrong with the code as of now and this
> optimization could be added later (if at all).
> 

I think it is a good idea but it could also be done later iff a user had
a serious problem with the performance and that this made a measurable
difference. The series is already quite complex and I'd rather not add to
that complexity without strong motivation.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: [PATCH 11/16] netvm: Propagate page->pfmemalloc from skb_alloc_page to skb
From: Mel Gorman @ 2012-07-10 11:12 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: Andrew Morton, Linux-MM, Linux-Netdev, LKML, David Miller,
	Neil Brown, Peter Zijlstra, Mike Christie, Eric B Munson,
	Eric Dumazet
In-Reply-To: <20120709191856.GD3515@breakpoint.cc>

On Mon, Jul 09, 2012 at 09:18:56PM +0200, Sebastian Andrzej Siewior wrote:
> 
> > I can update e1000 if you like but it's not critical
> > to do so and in fact getting a bug reporting saying that network swap
> > was slow on e1000 would be useful to me in its own way :)
> No, leave as it, I was just curious.
> One thing: Do you think it makes sense to you introduce
> 	#define GFP_NET_RX     (GFP_ATOMIC | __GFP_MEMALLOC)
> 
> and use it within the receive path instead of GFP_ATOMIC?
> 

For now, I'd prefer to keep the __GFP_MEMALLOC flag at the different
callsites because it forces people to think about what it means.  I fear
that GFP_NET_RX may be too easy to misuse without thinking about what the
consequences are.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: getting warn once around skb_try_coalesce
From: Eric Dumazet @ 2012-07-10 11:14 UTC (permalink / raw)
  To: Or Gerlitz
  Cc: David Miller, netdev@vger.kernel.org, Shlomo Pongratz,
	Erez Shitrit
In-Reply-To: <1341915510.3265.4734.camel@edumazet-glaptop>

On Tue, 2012-07-10 at 12:18 +0200, Eric Dumazet wrote:
> On Tue, 2012-07-10 at 12:54 +0300, Or Gerlitz wrote:
> > Hi Dave, Eric,
> > 
> > Another trace that I see here with net-next is this one-time warning. I 
> > get it always
> > on the passive side of TCP, something that seems related to GRO, it 
> > happens only with
> > IPoIB, not with mlx4_en and igb (when igb get to work on net-next...)
> > 
> > The latest commit in this area is bad43ca8325f493dcaa0896c2f036276af059c7e
> > "net: introduce skb_try_coalesce()" from Eric.
> > 
> > Or.
> > 
> > -----------[ cut here ]------------
> > WARNING: at net/core/skbuff.c:3413 skb_try_coalesce+0x1f8/0x31d()
> 
> This warning catch skb truesize offenders, most probably its a driver
> issue.
> 

By the way, this driver allocates not enough tailroom in skbs, so IP/TCP
stacks need to reallocate skb head to pull IP/TCP headers. Thats not
efficient.

I suggest using following patch :

diff --git a/drivers/infiniband/ulp/ipoib/ipoib_ib.c b/drivers/infiniband/ulp/ipoib/ipoib_ib.c
index 5c1bc99..9939869 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_ib.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_ib.c
@@ -159,7 +159,7 @@ static struct sk_buff *ipoib_alloc_rx_skb(struct net_device *dev, int id)
 	u64 *mapping;
 
 	if (ipoib_ud_need_sg(priv->max_ib_mtu))
-		buf_size = IPOIB_UD_HEAD_SIZE;
+		buf_size = IPOIB_UD_HEAD_SIZE + 128; /* reserve some tailroom for IP/TCP headers */
 	else
 		buf_size = IPOIB_UD_BUF_SIZE(priv->max_ib_mtu);
 

^ permalink raw reply related

* Re: getting warn once around skb_try_coalesce
From: Eric Dumazet @ 2012-07-10 11:22 UTC (permalink / raw)
  To: Or Gerlitz
  Cc: David Miller, netdev@vger.kernel.org, Shlomo Pongratz,
	Erez Shitrit
In-Reply-To: <1341918848.3265.4853.camel@edumazet-glaptop>

On Tue, 2012-07-10 at 13:14 +0200, Eric Dumazet wrote:
> On Tue, 2012-07-10 at 12:18 +0200, Eric Dumazet wrote:
> > On Tue, 2012-07-10 at 12:54 +0300, Or Gerlitz wrote:
> > > Hi Dave, Eric,
> > > 
> > > Another trace that I see here with net-next is this one-time warning. I 
> > > get it always
> > > on the passive side of TCP, something that seems related to GRO, it 
> > > happens only with
> > > IPoIB, not with mlx4_en and igb (when igb get to work on net-next...)
> > > 
> > > The latest commit in this area is bad43ca8325f493dcaa0896c2f036276af059c7e
> > > "net: introduce skb_try_coalesce()" from Eric.
> > > 
> > > Or.
> > > 
> > > -----------[ cut here ]------------
> > > WARNING: at net/core/skbuff.c:3413 skb_try_coalesce+0x1f8/0x31d()
> > 
> > This warning catch skb truesize offenders, most probably its a driver
> > issue.
> > 
> 
> By the way, this driver allocates not enough tailroom in skbs, so IP/TCP
> stacks need to reallocate skb head to pull IP/TCP headers. Thats not
> efficient.
> 
> I suggest using following patch :

And of course we also can fix the truesize bug.
(Not sure it will fix the warning, but worth trying)

Since this driver allocates a full page, it must use the PAGE_SIZE, not
the used part in the fragment

diff --git a/drivers/infiniband/ulp/ipoib/ipoib_ib.c b/drivers/infiniband/ulp/ipoib/ipoib_ib.c
index 5c1bc99..e611a924 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_ib.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_ib.c
@@ -123,7 +123,7 @@ static void ipoib_ud_skb_put_frags(struct ipoib_dev_priv *priv,
 
 		skb_frag_size_set(frag, size);
 		skb->data_len += size;
-		skb->truesize += size;
+		skb->truesize += PAGE_SIZE;
 	} else
 		skb_put(skb, length);
 
@@ -159,7 +159,7 @@ static struct sk_buff *ipoib_alloc_rx_skb(struct net_device *dev, int id)
 	u64 *mapping;
 
 	if (ipoib_ud_need_sg(priv->max_ib_mtu))
-		buf_size = IPOIB_UD_HEAD_SIZE;
+		buf_size = IPOIB_UD_HEAD_SIZE + 128; /* reserve some tailroom for IP/TCP headers */
 	else
 		buf_size = IPOIB_UD_BUF_SIZE(priv->max_ib_mtu);
 

^ permalink raw reply related

* Re: [PATCH v2] net: cgroup: fix access the unallocated memory in netprio cgroup
From: Neil Horman @ 2012-07-10 11:32 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Gao feng, linux-kernel, netdev, lizefan, tj, Eric Dumazet
In-Reply-To: <1341918350.3265.4830.camel@edumazet-glaptop>

On Tue, Jul 10, 2012 at 01:05:50PM +0200, Eric Dumazet wrote:
> On Tue, 2012-07-10 at 18:44 +0800, Gao feng wrote:
> > there are some out of bound accesses in netprio cgroup.
> 
> > -	update_netdev_tables();
> > +	ret = extend_netdev_table(dev, max_len);
> > +	if (ret < 0)
> > +		goto out_free_devname;
> > +
> >  	ret = 0;
> >  	rcu_read_lock();
> >  	map = rcu_dereference(dev->priomap);
> 
> Its unfortunately adding a bug.
> 
> extend_netdev_table() is protected by RTNL.
> 
More specifically it needs to be protected by rtnl, and the call above isn't.
Other than that it looks pretty good to me.
Neil

^ permalink raw reply

* [PATCH net-next 0/9] Add Ethernet IPoIB driver
From: Or Gerlitz @ 2012-07-10 12:16 UTC (permalink / raw)
  To: davem; +Cc: roland, netdev, ali, sean.hefty, Or Gerlitz

The eIPoIB driver provides a standard Ethernet netdevice over 
the InfiniBand IPoIB interface .

Some services can run only on top of Ethernet L2 interfaces, and cannot be
bound to an IPoIB interface. With this new driver, these services can run
seamlessly.

Main use case of the driver is the Ethernet Virtual Switching used in
virtualized environments, where an eipoib netdevice can be used as a 
Physical Interface (PIF) in the hypervisor domain, and allow other 
guests Virtual Interfaces (VIF) connected to the same Virtual Switch 
to run over the InfiniBand fabric.

This driver supports L2 Switching (Direct Bridging) as well as other L3
Switching modes (e.g. NAT).

Whenever an IPoIB interface is created, one eIPoIB PIF netdevice 
will be created. The default naming scheme is as in other Ethernet 
interfaces: ethX, for example, on a system with two IPoIB interfaces,
ib0 and ib1, two interfaces will be created ethX and ethX+1 When "X" 
is the next free Ethernet number in the system.

Using "ethtool -i " over the new interface can tell on which IPoIB
PIF interface that interface is above.  For example: driver: eth_ipoib:ib0 
indicates that eth3 is the Ethernet interface over the ib0 IPoIB interface.

The driver can be used as independent interface or to serve in
virtualization environment as the physical layer for the virtual
interfaces on the virtual guest.

The driver interface (eipoib interface or which is also referred to as parent) 
uses slave interfaces, IPoIB clones, which are the VIFs described above.

VIFs interfaces are enslaved/released from the eipoib driver on demand, according 
to the management interface provided to user space.

The management interface for the driver uses sysfs entries. Via these sysfs 
entries the driver gets details on new VIF's to manage. The driver can 
enslave new VIF (IPoIB cloned interface) or detaches from it.

Here are few sysfs commands that are used in order to manage the driver, 
according to few scenarios:

1. create new clone of IPoIB interface:

	$ echo .Y > /sys/class/net/ibX/create_child

create new clone ibX.Y with the same pkey as ibX, for example:

	$ echo .1 > /sys/class/net/ib0/create_child

will create new interface ib0.1

2. notify parent interface on new VIF to enslave:

	$ echo +ibX.Y > /sys/class/net/ethZ/eth/slaves

where ethZ is the driver interface, for example:

	$ echo +ib0.1 > /sys/class/net/eth4/eth/slaves

will enslave ib0.1 to eth4

3. notify parent interface interface on VIF details (mac and vlan)

	$ echo +ibX.Y <MAC address> > /sys/class/net/ethZ/eth/vifs

for example:

	$ echo +ib0.1 00:02:c9:43:3b:f1 > /sys/class/net/eth4/eth/vifs

4. notify parent to release VIF:

	$ echo -ibX.Y > /sys/class/net/ethZ/eth/slaves

where ethZ is the driver interface, for example:

        $ echo -ib0.1 > /sys/class/net/eth4/eth/slaves

will release ib0.1 from eth4

5. see the list of ipoib interfaces enslaved under eipoib interface,

	$ cat /sys/class/net/ethX/eth/vifs

for example:
	
	$ cat /sys/class/net/eth4/eth/vifs

	SLAVE=ib0.1      MAC=9a:c2:1f:d7:3b:63 VLAN=N/A
	SLAVE=ib0.2      MAC=52:54:00:60:55:88 VLAN=N/A
	SLAVE=ib0.3      MAC=52:54:00:60:55:89 VLAN=N/A

Note: Each ethX interface has at least one ibX.Y slave to serve the PIF
itself, in the VIFs list of ethX you'll notice that ibX.1 is always created 
to serve applications running from the Hypervisor on top of ethX interface directly.

For IB applications that require native IPoIB interfaces (e.g. RDMA-CM), the
original ipoib interfaces ibX can still be used.  For example, RDMA-CM and
eth_ipoib drivers can co-exist and make use of IPoIB

The last patch of this series was made such that the series works as is over 
net-next, in parallel to the submission of this driver, a patch to modify IPoIB 
such that it doesn't assume dst/neighbour on the skb was posted. 

The series is made against net-next commit 700db99d0 "ipoib: Need to do 
dst_neigh_lookup_skb() outside of priv->lock" as of some issues with
net-next latest which were reported over netdev today.

Or.

Erez Shitrit (8):
  include/linux: Add private flags for IPoIB interfaces
  IB/ipoib: Add support for acting as VIF
  net/eipoib: Add private header file
  net/eipoib: Add ethtool file support
  net/eipoib: Add sysfs support
  net/eipoib: Add main driver functionality
  net/eipoib: Add Makefile, Kconfig and MAINTAINERS entries
  IB/ipoib: Add support for transmission of skbs w.o dst/neighbour

Or Gerlitz (1):
  IB/ipoib: Add support for clones / multiple childs on the same
    partition

 Documentation/infiniband/ipoib.txt         |   24 +
 MAINTAINERS                                |    6 +
 drivers/infiniband/ulp/ipoib/ipoib.h       |   13 +-
 drivers/infiniband/ulp/ipoib/ipoib_cm.c    |    9 +
 drivers/infiniband/ulp/ipoib/ipoib_ib.c    |    8 +-
 drivers/infiniband/ulp/ipoib/ipoib_main.c  |   83 +-
 drivers/infiniband/ulp/ipoib/ipoib_verbs.c |    3 +-
 drivers/infiniband/ulp/ipoib/ipoib_vlan.c  |   46 +-
 drivers/net/Kconfig                        |   15 +
 drivers/net/Makefile                       |    1 +
 drivers/net/eipoib/Makefile                |    4 +
 drivers/net/eipoib/eth_ipoib.h             |  224 ++++
 drivers/net/eipoib/eth_ipoib_ethtool.c     |  147 +++
 drivers/net/eipoib/eth_ipoib_main.c        | 1897 ++++++++++++++++++++++++++++
 drivers/net/eipoib/eth_ipoib_sysfs.c       |  640 ++++++++++
 include/linux/if.h                         |    2 +
 include/rdma/e_ipoib.h                     |   51 +
 17 files changed, 3140 insertions(+), 33 deletions(-)
 create mode 100644 drivers/net/eipoib/Makefile
 create mode 100644 drivers/net/eipoib/eth_ipoib.h
 create mode 100644 drivers/net/eipoib/eth_ipoib_ethtool.c
 create mode 100644 drivers/net/eipoib/eth_ipoib_main.c
 create mode 100644 drivers/net/eipoib/eth_ipoib_sysfs.c
 create mode 100644 include/rdma/e_ipoib.h

^ permalink raw reply

* [PATCH net-next 2/9] include/linux: Add private flags for IPoIB interfaces
From: Or Gerlitz @ 2012-07-10 12:16 UTC (permalink / raw)
  To: davem; +Cc: roland, netdev, ali, sean.hefty, Erez Shitrit, Or Gerlitz
In-Reply-To: <1341922569-4118-1-git-send-email-ogerlitz@mellanox.com>

From: Erez Shitrit <erezsh@mellanox.co.il>

The new 2 bits indicates whenever a device is considered PIF interface,
which means the "main" interfaces (ib0, ib1 etc), or cloned interfaces
(ib0.1, ib1.2 etc.) that is now in use by the eIPoIB driver.

Signed-off-by: Erez Shitrit <erezsh@mellanox.co.il>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
---
 include/linux/if.h |    2 ++
 1 files changed, 2 insertions(+), 0 deletions(-)

diff --git a/include/linux/if.h b/include/linux/if.h
index 1ec407b..f50dbf2 100644
--- a/include/linux/if.h
+++ b/include/linux/if.h
@@ -84,6 +84,8 @@
 #define IFF_LIVE_ADDR_CHANGE 0x100000	/* device supports hardware address
 					 * change when it's running */
 
+#define IFF_EIPOIB_PIF  0x200000       /* IPoIB PIF intf (ib0, ib1 etc.) */
+#define IFF_EIPOIB_VIF  0x400000       /* IPoIB VIF intf (ib0.x, ib1.x etc.) */
 
 #define IF_GET_IFACE	0x0001		/* for querying only */
 #define IF_GET_PROTO	0x0002
-- 
1.7.1

^ permalink raw reply related

* [PATCH net-next 4/9] net/eipoib: Add private header file
From: Or Gerlitz @ 2012-07-10 12:16 UTC (permalink / raw)
  To: davem; +Cc: roland, netdev, ali, sean.hefty, Erez Shitrit, Or Gerlitz
In-Reply-To: <1341922569-4118-1-git-send-email-ogerlitz@mellanox.com>

From: Erez Shitrit <erezsh@mellanox.co.il>

The header file includes all structures, macros and non-static
functions which are of use by the driver.

Signed-off-by: Erez Shitrit <erezsh@mellanox.co.il>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
---
 drivers/net/eipoib/eth_ipoib.h |  224 ++++++++++++++++++++++++++++++++++++++++
 1 files changed, 224 insertions(+), 0 deletions(-)
 create mode 100644 drivers/net/eipoib/eth_ipoib.h

diff --git a/drivers/net/eipoib/eth_ipoib.h b/drivers/net/eipoib/eth_ipoib.h
new file mode 100644
index 0000000..45871c9
--- /dev/null
+++ b/drivers/net/eipoib/eth_ipoib.h
@@ -0,0 +1,224 @@
+/*
+ * Copyright (c) 2012 Mellanox Technologies. All rights reserved
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * openfabric.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ */
+
+#ifndef _LINUX_ETH_IPOIB_H
+#define _LINUX_ETH_IPOIB_H
+
+#include <linux/module.h>
+#include <linux/errno.h>
+#include <linux/netdevice.h>
+#include <linux/skbuff.h>
+#include <net/arp.h>
+#include <linux/if_vlan.h>
+#include <net/net_namespace.h>
+#include <net/netns/generic.h>
+#include <rdma/e_ipoib.h>
+
+/* macros and definitions */
+#define DRV_VERSION		"1.0.0"
+#define DRV_RELDATE		"June 1, 2012"
+#define DRV_NAME		"eth_ipoib"
+#define SDRV_NAME		"ipoib"
+#define DRV_DESCRIPTION		"IP-over-InfiniBand Para Virtualized Driver"
+#define EIPOIB_ABI_VER	1
+
+#undef  pr_fmt
+#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
+
+#define GID_LEN			16
+#define GUID_LEN		8
+
+#define PARENT_VLAN_FEATURES \
+	(NETIF_F_HW_VLAN_RX | NETIF_F_HW_VLAN_TX | \
+	 NETIF_F_HW_VLAN_FILTER)
+
+#define parent_for_each_slave(_parent, slave)		\
+		list_for_each_entry(slave, &(_parent)->slave_list, list)\
+
+#define PARENT_IS_OK(_parent)				\
+		(((_parent)->dev->flags & IFF_UP) &&	\
+		 netif_running((_parent)->dev)    &&	\
+		 ((_parent)->slave_cnt > 0))
+
+#define IS_E_IPOIB_PROTO(_proto)			\
+		 (((_proto) == htons(ETH_P_ARP)) ||	\
+		 ((_proto) == htons(ETH_P_RARP)) ||	\
+		 ((_proto) == htons(ETH_P_IP)))
+
+enum eipoib_emac_guest_info {
+	VALID,
+	MIGRATED_OUT,
+	INVALID,
+};
+
+/* structs */
+struct eth_arp_data {
+	u8 arp_sha[ETH_ALEN];
+	__be32 arp_sip;
+	u8 arp_dha[ETH_ALEN];
+	__be32 arp_dip;
+} __packed;
+
+struct ipoib_arp_data {
+	u8 arp_sha[INFINIBAND_ALEN];
+	__be32 arp_sip;
+	u8 arp_dha[INFINIBAND_ALEN];
+	__be32 arp_dip;
+} __packed;
+
+/* live migration support structures: */
+struct ip_member {
+	__be32 ip;
+	struct list_head list;
+};
+
+/*
+ * for each slave (emac) saves all the ip over that mac.
+ * the parent keeps that list for live migration.
+ */
+struct guest_emac_info {
+	u8 emac[ETH_ALEN];
+	u16 vlan;
+	struct list_head ip_list;
+	struct list_head list;
+	enum eipoib_emac_guest_info rec_state;
+	int num_of_retries;
+};
+
+struct neigh {
+	struct list_head list;
+	u8 emac[ETH_ALEN];
+	u8 imac[INFINIBAND_ALEN];
+	/* this part is used for neigh_add_list */
+	char cmd[PAGE_SIZE];
+};
+
+struct slave {
+	struct net_device *dev;
+	struct slave *next;
+	struct slave *prev;
+	int    index;
+	struct list_head list;
+	unsigned long jiffies;
+	s8     link;
+	s8     state;
+	u16    pkey;
+	u16    vlan;
+	u8     emac[ETH_ALEN];
+	u8     imac[INFINIBAND_ALEN];
+	struct list_head neigh_list;
+	/* this part is used for vif_add_list */
+	char cmd[PAGE_SIZE];
+};
+
+struct port_stats {
+	/* update PORT_STATS_LEN (number of stat fields)accordingly */
+	unsigned long tx_parent_dropped;
+	unsigned long tx_vif_miss;
+	unsigned long tx_neigh_miss;
+	unsigned long tx_vlan;
+	unsigned long tx_shared;
+	unsigned long tx_proto_errors;
+	unsigned long tx_skb_errors;
+	unsigned long tx_slave_err;
+
+	unsigned long rx_parent_dropped;
+	unsigned long rx_vif_miss;
+	unsigned long rx_neigh_miss;
+	unsigned long rx_vlan;
+	unsigned long rx_shared;
+	unsigned long rx_proto_errors;
+	unsigned long rx_skb_errors;
+	unsigned long rx_slave_err;
+};
+
+struct parent {
+	struct   net_device *dev;
+	int      index;
+	struct   neigh_parms nparms;
+	struct   list_head slave_list;
+	/* never change this value outside the attach/detach wrappers */
+	s32      slave_cnt;
+	rwlock_t lock;
+	struct   net_device_stats stats;
+	struct   port_stats port_stats;
+	struct   list_head parent_list;
+	struct   dev_mc_list *mc_list;
+	u16      flags;
+	struct   list_head vlan_list;
+	struct   workqueue_struct *wq;
+	s8       kill_timers;
+	struct   delayed_work neigh_learn_work;
+	struct   delayed_work vif_learn_work;
+	struct   list_head neigh_add_list;
+	union    ib_gid gid;
+	char     ipoib_main_interface[IFNAMSIZ];
+	struct   list_head emac_ip_list;
+	struct   delayed_work emac_ip_work;
+	struct   delayed_work migrate_out_work;
+};
+
+#define eipoib_slave_get_rcu(dev) \
+	((struct slave *) rcu_dereference(dev->rx_handler_data))
+
+/* name space support for sys/fs */
+struct eipoib_net {
+	struct net	*net;	/* Associated network namespace */
+	struct class_attribute class_attr_eipoib_interfaces;
+};
+
+/* exported from main.c */
+extern int eipoib_net_id;
+extern struct list_head parent_dev_list;
+
+/* functions prototypes */
+int mod_create_sysfs(struct eipoib_net *eipoib_n);
+void mod_destroy_sysfs(struct eipoib_net *eipoib_n);
+void parent_destroy_sysfs_entry(struct parent *parent);
+int parent_create_sysfs_entry(struct parent *parent);
+int create_slave_symlinks(struct net_device *master,
+			  struct net_device *slave);
+void destroy_slave_symlinks(struct net_device *master,
+			    struct net_device *slave);
+int parent_enslave(struct net_device *parent_dev,
+		   struct net_device *slave_dev);
+int parent_release_slave(struct net_device *parent_dev,
+			 struct net_device *slave_dev);
+struct neigh *parent_get_neigh_cmd(char op, char *ifname,
+				   u8 *remac, u8 *rimac);
+struct slave *parent_get_vif_cmd(char op, char *ifname, u8 *lemac);
+ssize_t __parent_store_neighs(struct device *d,
+			      struct device_attribute *attr,
+			      const char *buffer, size_t count);
+void parent_set_ethtool_ops(struct net_device *dev);
+
+#endif /* _LINUX_ETH_IPOIB_H */
-- 
1.7.1

^ permalink raw reply related

* [PATCH net-next 3/9] IB/ipoib: Add support for acting as VIF
From: Or Gerlitz @ 2012-07-10 12:16 UTC (permalink / raw)
  To: davem; +Cc: roland, netdev, ali, sean.hefty, Erez Shitrit, Or Gerlitz
In-Reply-To: <1341922569-4118-1-git-send-email-ogerlitz@mellanox.com>

From: Erez Shitrit <erezsh@mellanox.co.il>

When IPoIB interface acts as a VIF for an eIPoIB interface, it uses
the skb cb storage area on the RX flow, to place information which
can be of use to the upper layer device.

One such usage example, is when an eIPoIB inteface needs to generate
a source mac for incoming Ethernet frames.

The IPoIB code checks the VIF private flag on the RX path, and accoriding
to the value of the flag prepares the skb CB data, etc.

Signed-off-by: Erez Shitrit <erezsh@mellanox.co.il>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
---
 drivers/infiniband/ulp/ipoib/ipoib.h      |    6 +++
 drivers/infiniband/ulp/ipoib/ipoib_cm.c   |    9 +++++
 drivers/infiniband/ulp/ipoib/ipoib_ib.c   |    8 ++++-
 drivers/infiniband/ulp/ipoib/ipoib_main.c |   28 ++++++++++++++++
 include/rdma/e_ipoib.h                    |   51 +++++++++++++++++++++++++++++
 5 files changed, 101 insertions(+), 1 deletions(-)
 create mode 100644 include/rdma/e_ipoib.h

diff --git a/drivers/infiniband/ulp/ipoib/ipoib.h b/drivers/infiniband/ulp/ipoib/ipoib.h
index a57db27..1d28774 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib.h
+++ b/drivers/infiniband/ulp/ipoib/ipoib.h
@@ -52,6 +52,7 @@
 #include <rdma/ib_pack.h>
 #include <rdma/ib_sa.h>
 #include <linux/sched.h>
+#include <rdma/e_ipoib.h>
 
 /* constants */
 
@@ -209,6 +210,7 @@ struct ipoib_cm_rx {
 	unsigned long		jiffies;
 	enum ipoib_cm_state	state;
 	int			recv_count;
+	u32			qpn;
 };
 
 struct ipoib_cm_tx {
@@ -695,6 +697,10 @@ extern int ipoib_recvq_size;
 
 extern struct ib_sa_client ipoib_sa_client;
 
+
+inline void set_skb_oob_cb_data(struct sk_buff *skb, struct ib_wc *wc,
+				struct napi_struct *napi);
+
 #ifdef CONFIG_INFINIBAND_IPOIB_DEBUG
 extern int ipoib_debug_level;
 
diff --git a/drivers/infiniband/ulp/ipoib/ipoib_cm.c b/drivers/infiniband/ulp/ipoib/ipoib_cm.c
index 014504d..dca7952 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_cm.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_cm.c
@@ -440,6 +440,7 @@ static int ipoib_cm_req_handler(struct ib_cm_id *cm_id, struct ib_cm_event *even
 	struct net_device *dev = cm_id->context;
 	struct ipoib_dev_priv *priv = netdev_priv(dev);
 	struct ipoib_cm_rx *p;
+	struct ipoib_cm_data *data = event->private_data;
 	unsigned psn;
 	int ret;
 
@@ -452,6 +453,10 @@ static int ipoib_cm_req_handler(struct ib_cm_id *cm_id, struct ib_cm_event *even
 	cm_id->context = p;
 	p->state = IPOIB_CM_RX_LIVE;
 	p->jiffies = jiffies;
+
+	/* used to keep track of base qpn in CM mode */
+	p->qpn = be32_to_cpu(data->qpn);
+
 	INIT_LIST_HEAD(&p->list);
 
 	p->qp = ipoib_cm_create_rx_qp(dev, p);
@@ -669,6 +674,10 @@ copied:
 	skb->dev = dev;
 	/* XXX get correct PACKET_ type here */
 	skb->pkt_type = PACKET_HOST;
+	/* if handler is registered on top of ipoib, set skb oob data. */
+	if (skb->dev->priv_flags & IFF_EIPOIB_VIF)
+		set_skb_oob_cb_data(skb, wc, NULL);
+
 	netif_receive_skb(skb);
 
 repost:
diff --git a/drivers/infiniband/ulp/ipoib/ipoib_ib.c b/drivers/infiniband/ulp/ipoib/ipoib_ib.c
index 5c1bc99..da28799 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_ib.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_ib.c
@@ -300,7 +300,13 @@ static void ipoib_ib_handle_rx_wc(struct net_device *dev, struct ib_wc *wc)
 			likely(wc->wc_flags & IB_WC_IP_CSUM_OK))
 		skb->ip_summed = CHECKSUM_UNNECESSARY;
 
-	napi_gro_receive(&priv->napi, skb);
+	/* if handler is registered on top of ipoib, set skb oob data */
+	if (dev->priv_flags & IFF_EIPOIB_VIF) {
+		set_skb_oob_cb_data(skb, wc, &priv->napi);
+		/* the registered handler will take care of the skb */
+		netif_receive_skb(skb);
+	} else
+		napi_gro_receive(&priv->napi, skb);
 
 repost:
 	if (unlikely(ipoib_ib_post_receive(dev, wr_id)))
diff --git a/drivers/infiniband/ulp/ipoib/ipoib_main.c b/drivers/infiniband/ulp/ipoib/ipoib_main.c
index 704d068..1ccd42f 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_main.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_main.c
@@ -91,6 +91,31 @@ static struct ib_client ipoib_client = {
 	.remove = ipoib_remove_one
 };
 
+inline void set_skb_oob_cb_data(struct sk_buff *skb, struct ib_wc *wc,
+				struct napi_struct *napi)
+{
+	struct ipoib_cm_rx *p_cm_ctx = NULL;
+	union skb_cb_data *data = NULL;
+	struct ib_grh *grh = NULL;
+
+	p_cm_ctx = wc->qp->qp_context;
+	data = IPOIB_HANDLER_CB(skb);
+
+	data->rx.slid = wc->slid;
+	data->rx.sqpn = wc->src_qp;
+	data->rx.napi = napi;
+
+	/* if dqpn is mcast, fetch the dgid */
+	grh = (struct ib_grh *)(skb->data - IB_GRH_BYTES - IPOIB_ENCAP_LEN);
+
+	if ((wc->wc_flags & IB_WC_GRH) && grh)
+		memcpy(data->rx.dgid, grh->dgid.raw, 16);
+
+	/* in CM mode, use the "base" qpn as sqpn */
+	if (p_cm_ctx)
+		data->rx.sqpn = p_cm_ctx->qpn;
+}
+
 int ipoib_open(struct net_device *dev)
 {
 	struct ipoib_dev_priv *priv = netdev_priv(dev);
@@ -1277,6 +1302,9 @@ static struct net_device *ipoib_add_port(const char *format,
 		goto event_failed;
 	}
 
+	/* indicates pif port */
+	priv->dev->priv_flags |= IFF_EIPOIB_PIF;
+
 	result = register_netdev(priv->dev);
 	if (result) {
 		printk(KERN_WARNING "%s: couldn't register ipoib port %d; error %d\n",
diff --git a/include/rdma/e_ipoib.h b/include/rdma/e_ipoib.h
new file mode 100644
index 0000000..481514e
--- /dev/null
+++ b/include/rdma/e_ipoib.h
@@ -0,0 +1,51 @@
+/*
+ * Copyright (c) 2012 Mellanox Technologies. All rights reserved
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * openfabric.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ */
+
+#ifndef _LINUX_ETH_IB_IPOIB_H
+#define _LINUX_ETH_IB_IPOIB_H
+#include <linux/skbuff.h>
+#include <linux/if_infiniband.h>
+#include <rdma/ib_verbs.h>
+
+/* must be <= 48 bytes */
+union skb_cb_data {
+	struct {
+		u32 sqpn;
+		u16 slid;
+		u8  dgid[16];
+		struct napi_struct *napi;
+	} rx;
+};
+
+#define IPOIB_HANDLER_CB(skb) ((union skb_cb_data *)(skb)->cb)
+
+#endif /* _LINUX_ETH_IB_IPOIB_H */
-- 
1.7.1

^ permalink raw reply related

* [PATCH net-next 6/9] net/eipoib: Add sysfs support
From: Or Gerlitz @ 2012-07-10 12:16 UTC (permalink / raw)
  To: davem; +Cc: roland, netdev, ali, sean.hefty, Erez Shitrit, Or Gerlitz
In-Reply-To: <1341922569-4118-1-git-send-email-ogerlitz@mellanox.com>

From: Erez Shitrit <erezsh@mellanox.co.il>

The management interface for the driver uses sysfs entries. Via these sysfs
entries the driver gets details on new VIF's to manage. The driver can
enslave new VIF (IPoIB cloned interface) or detaches from it.

Here are few sysfs commands that are used in order to manage the driver,
according to few scenarios:

1. create new clone of IPoIB interface:

	$ echo .Y > /sys/class/net/ibX/create_child

create new clone ibX.Y with the same pkey as ibX, for example:

	$ echo .1 > /sys/class/net/ib0/create_child

will create new interface ib0.1

2. notify parent interface on new VIF to enslave:

	$ echo +ibX.Y > /sys/class/net/ethZ/eth/slaves

where ethZ is the driver interface, for example:

	$ echo +ib0.1 > /sys/class/net/eth4/eth/slaves

will enslave ib0.1 to eth4

3. notify parent interface interface on VIF details (mac and vlan)

	$ echo +ibX.Y <MAC address> > /sys/class/net/ethZ/eth/vifs

for example:

	$ echo +ib0.1 00:02:c9:43:3b:f1 > /sys/class/net/eth4/eth/vifs

4. notify parent to release VIF:

	$ echo -ibX.Y > /sys/class/net/ethZ/eth/slaves

where ethZ is the driver interface, for example:

        $ echo -ib0.1 > /sys/class/net/eth4/eth/slaves

will release ib0.1 from eth4

5. see the list of ipoib interfaces enslaved under eipoib interface,

	$ cat /sys/class/net/ethX/eth/vifs

for example:

	$ cat /sys/class/net/eth4/eth/vifs

	SLAVE=ib0.1      MAC=9a:c2:1f:d7:3b:63 VLAN=N/A
	SLAVE=ib0.2      MAC=52:54:00:60:55:88 VLAN=N/A
	SLAVE=ib0.3      MAC=52:54:00:60:55:89 VLAN=N/A

Signed-off-by: Erez Shitrit <erezsh@mellanox.co.il>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
---
 drivers/net/eipoib/eth_ipoib_sysfs.c |  640 ++++++++++++++++++++++++++++++++++
 1 files changed, 640 insertions(+), 0 deletions(-)
 create mode 100644 drivers/net/eipoib/eth_ipoib_sysfs.c

diff --git a/drivers/net/eipoib/eth_ipoib_sysfs.c b/drivers/net/eipoib/eth_ipoib_sysfs.c
new file mode 100644
index 0000000..be1712e
--- /dev/null
+++ b/drivers/net/eipoib/eth_ipoib_sysfs.c
@@ -0,0 +1,640 @@
+/*
+ * Copyright (c) 2012 Mellanox Technologies. All rights reserved
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * openfabric.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ */
+
+#include <linux/kernel.h>
+#include <linux/module.h>
+#include <linux/device.h>
+#include <linux/sched.h>
+#include <linux/fs.h>
+#include <linux/types.h>
+#include <linux/string.h>
+#include <linux/netdevice.h>
+#include <linux/inetdevice.h>
+#include <linux/in.h>
+#include <linux/sysfs.h>
+#include <linux/ctype.h>
+#include <linux/inet.h>
+#include <linux/rtnetlink.h>
+#include <linux/etherdevice.h>
+#include <net/net_namespace.h>
+
+#include "eth_ipoib.h"
+
+#define to_dev(obj)	container_of(obj, struct device, kobj)
+#define to_parent(cd)	((struct parent *)(netdev_priv(to_net_dev(cd))))
+#define MOD_NA_STRING		"N/A"
+
+#define _sprintf(p, buf, format, arg...)				\
+((PAGE_SIZE - (int)(p - buf)) <= 0 ? 0 :				\
+	scnprintf(p, PAGE_SIZE - (int)(p - buf), format, ## arg))\
+
+#define _end_of_line(_p, _buf)					\
+do { if (_p - _buf) /* eat the leftover space */			\
+		buf[_p - _buf - 1] = '\n';				\
+} while (0)
+
+/* helper functions */
+static int get_emac(u8 *mac, char *s)
+{
+	if (sscanf(s, "%hhx:%hhx:%hhx:%hhx:%hhx:%hhx",
+		   mac + 0, mac + 1, mac + 2, mac + 3, mac + 4,
+		   mac + 5) != 6)
+		return -1;
+
+	return 0;
+}
+
+static int get_imac(u8 *mac, char *s)
+{
+	if (sscanf(s, "%hhx:%hhx:%hhx:%hhx:%hhx:%hhx:%hhx:%hhx:"
+		   "%hhx:%hhx:%hhx:%hhx:%hhx:%hhx:%hhx:%hhx:"
+		   "%hhx:%hhx:%hhx:%hhx",
+		   mac + 0, mac + 1, mac + 2, mac + 3, mac + 4,
+		   mac + 5, mac + 6, mac + 7, mac + 8, mac + 9,
+		   mac + 10, mac + 11, mac + 12, mac + 13,
+		   mac + 14, mac + 15, mac + 16, mac + 17,
+		   mac + 18, mac + 19) != 20)
+		return -1;
+
+	return 0;
+}
+
+/* show/store functions per module (CLASS_ATTR) */
+static ssize_t show_parents(struct class *cls, struct class_attribute *attr,
+			    char *buf)
+{
+	char *p = buf;
+	struct parent *parent;
+
+	rtnl_lock(); /* because of parent_dev_list */
+
+	list_for_each_entry(parent, &parent_dev_list, parent_list) {
+		p += _sprintf(p, buf, "%s over IB port: %s\n",
+			      parent->dev->name,
+			      parent->ipoib_main_interface);
+	}
+	_end_of_line(p, buf);
+
+	rtnl_unlock();
+	return (ssize_t)(p - buf);
+}
+
+/* show/store functions per parent (DEVICE_ATTR) */
+static ssize_t parent_show_neighs(struct device *d,
+				  struct device_attribute *attr, char *buf)
+{
+	struct slave *slave;
+	struct neigh *neigh;
+	struct parent *parent = to_parent(d);
+	char *p = buf;
+
+	read_lock_bh(&parent->lock);
+	parent_for_each_slave(parent, slave) {
+		list_for_each_entry(neigh, &slave->neigh_list, list) {
+			p += _sprintf(p, buf, "SLAVE=%-10s EMAC=%pM IMAC=%pM:%pM:%pM:%.2x:%.2x\n",
+				      slave->dev->name,
+				      neigh->emac,
+				      neigh->imac, neigh->imac + 6, neigh->imac + 12,
+				      neigh->imac[18], neigh->imac[19]);
+		}
+	}
+
+	read_unlock_bh(&parent->lock);
+
+	_end_of_line(p, buf);
+
+	return (ssize_t)(p - buf);
+}
+
+struct neigh *parent_get_neigh_cmd(char op,
+				   char *ifname, u8 *remac, u8 *rimac)
+{
+	struct neigh *neigh_cmd;
+
+	neigh_cmd = kzalloc(sizeof *neigh_cmd, GFP_ATOMIC);
+	if (!neigh_cmd) {
+		pr_err("%s cannot allocate neigh struct\n", ifname);
+		goto out;
+	}
+
+	/*
+	 * populate emac field so it can be used easily
+	 * in neigh_cmd_find_by_mac()
+	 */
+	memcpy(neigh_cmd->emac, remac, ETH_ALEN);
+	memcpy(neigh_cmd->imac, rimac, INFINIBAND_ALEN);
+
+	/* prepare the command as a string */
+	sprintf(neigh_cmd->cmd, "%c%s %pM %pM:%pM:%pM:%.2x:%.2x",
+		op, ifname, remac, rimac, rimac + 6, rimac + 12, rimac[18], rimac[19]);
+out:
+	return neigh_cmd;
+}
+
+/* write_lock_bh(&parent->lock) must be held */
+ssize_t __parent_store_neighs(struct device *d,
+			      struct device_attribute *attr,
+			      const char *buffer, size_t count)
+{
+	char command[IFNAMSIZ + 1] = { 0, };
+	char emac_str[ETH_ALEN * 3] = { 0, };
+	u8 emac[ETH_ALEN];
+	char imac_str[INFINIBAND_ALEN * 3] = { 0, };
+	u8 imac[INFINIBAND_ALEN];
+	char *ifname;
+	int found = 0, ret = count;
+	struct slave *slave = NULL, *slave_tmp;
+	struct neigh *neigh;
+	struct parent *parent = to_parent(d);
+
+	sscanf(buffer, "%s %s %s", command, emac_str, imac_str);
+
+	/* check ifname */
+	ifname = command + 1;
+	if ((strlen(command) <= 1) || !dev_valid_name(ifname) ||
+	    (command[0] != '+' && command[0] != '-'))
+		goto err_no_cmd;
+
+	/* check if ifname exist */
+	parent_for_each_slave(parent, slave_tmp) {
+		if (!strcmp(slave_tmp->dev->name, ifname)) {
+			found = 1;
+			slave = slave_tmp;
+		}
+	}
+
+	if (!found) {
+		pr_err("%s could not find slave\n", ifname);
+		ret = -EINVAL;
+		goto out;
+	}
+
+	if (get_emac(emac, emac_str)) {
+		pr_err("%s bad emac %s\n", ifname, emac_str);
+		ret = -EINVAL;
+		goto out;
+	}
+
+	if (get_imac(imac, imac_str)) {
+		pr_err("%s bad imac %s\n", ifname, imac_str);
+		ret = -EINVAL;
+		goto out;
+	}
+
+	/* process command */
+	if (command[0] == '+') {
+		found = 0;
+		list_for_each_entry(neigh, &slave->neigh_list, list) {
+			if (!memcmp(neigh->emac, emac, ETH_ALEN))
+				found = 1;
+		}
+
+		if (found) {
+			pr_err("%s: cannot update neigh, slave already has "
+			       "this neigh mac %pM\n",
+			       slave->dev->name, emac);
+			ret = -EINVAL;
+			goto out;
+		}
+
+		neigh = kzalloc(sizeof *neigh, GFP_KERNEL);
+		if (!neigh) {
+			pr_err("%s cannot allocate neigh struct\n",
+			       slave->dev->name);
+			ret = -ENOMEM;
+			goto out;
+		}
+
+		/* ready to go */
+		pr_info("%s: slave %s neigh mac is set to %pM\n",
+			ifname, parent->dev->name, emac);
+		memcpy(neigh->emac, emac, ETH_ALEN);
+		memcpy(neigh->imac, imac, INFINIBAND_ALEN);
+
+		list_add_tail(&neigh->list, &slave->neigh_list);
+
+		goto out;
+	}
+
+	if (command[0] == '-') {
+		found = 0;
+		list_for_each_entry(neigh, &slave->neigh_list, list) {
+			if (!memcmp(neigh->emac, emac, ETH_ALEN))
+				found = 1;
+		}
+
+		if (!found) {
+			pr_err("%s cannot delete neigh mac %pM\n",
+			       ifname, emac);
+			ret = -EINVAL;
+			goto out;
+		}
+
+		list_del(&neigh->list);
+		kfree(neigh);
+
+		goto out;
+	}
+
+err_no_cmd:
+	pr_err("%s USAGE: (-|+)ifname emac imac\n", DRV_NAME);
+	ret = -EPERM;
+
+out:
+	return ret;
+}
+
+static ssize_t parent_store_neighs(struct device *d,
+				   struct device_attribute *attr,
+				   const char *buffer, size_t count)
+{
+	struct parent *parent = to_parent(d);
+	ssize_t rc;
+
+	write_lock_bh(&parent->lock);
+	rc = __parent_store_neighs(d, attr, buffer, count);
+	write_unlock_bh(&parent->lock);
+
+	return rc;
+}
+
+static DEVICE_ATTR(neighs, S_IRUGO | S_IWUSR, parent_show_neighs,
+		   parent_store_neighs);
+
+static ssize_t parent_show_vifs(struct device *d,
+				struct device_attribute *attr, char *buf)
+{
+	struct slave *slave;
+	struct parent *parent = to_parent(d);
+	char *p = buf;
+
+	read_lock_bh(&parent->lock);
+	parent_for_each_slave(parent, slave) {
+		if (is_zero_ether_addr(slave->emac)) {
+			p += _sprintf(p, buf, "SLAVE=%-10s MAC=%-17s "
+				      "VLAN=%s\n", slave->dev->name,
+				      MOD_NA_STRING, MOD_NA_STRING);
+		} else if (slave->vlan == VLAN_N_VID) {
+			p += _sprintf(p, buf, "SLAVE=%-10s MAC=%pM VLAN=%s\n",
+				      slave->dev->name,
+				      slave->emac,
+				      MOD_NA_STRING);
+		} else {
+			p += _sprintf(p, buf, "SLAVE=%-10s MAC=%pM VLAN=%d\n",
+				      slave->dev->name,
+				      slave->emac,
+				      slave->vlan);
+		}
+	}
+	read_unlock_bh(&parent->lock);
+
+	_end_of_line(p, buf);
+
+	return (ssize_t)(p - buf);
+}
+
+static ssize_t parent_store_vifs(struct device *d,
+				 struct device_attribute *attr,
+				 const char *buffer, size_t count)
+{
+	char command[IFNAMSIZ + 1] = { 0, };
+	char mac_str[ETH_ALEN * 3] = { 0, };
+	char *ifname;
+	u8 mac[ETH_ALEN];
+	int found = 0, ret = count;
+	struct slave *slave = NULL, *slave_tmp;
+	struct parent *parent = to_parent(d);
+
+	sscanf(buffer, "%s %s", command, mac_str);
+
+	write_lock_bh(&parent->lock);
+
+	/* check ifname */
+	ifname = command + 1;
+	if ((strlen(command) <= 1) || !dev_valid_name(ifname) ||
+	    (command[0] != '+' && command[0] != '-'))
+		goto err_no_cmd;
+
+	/* check if ifname exist */
+	parent_for_each_slave(parent, slave_tmp) {
+		if (!strcmp(slave_tmp->dev->name, ifname)) {
+			found = 1;
+			slave = slave_tmp;
+		}
+	}
+
+	if (!found) {
+		pr_err("%s could not find slave\n", ifname);
+		ret = -EINVAL;
+		goto out;
+	}
+
+	/* process command */
+	if (command[0] == '+') {
+		if (get_emac(mac, mac_str) || !is_valid_ether_addr(mac)) {
+			pr_err("%s invalid mac input\n", ifname);
+			ret = -EINVAL;
+			goto out;
+		}
+
+		if (!is_zero_ether_addr(slave->emac)) {
+			pr_err("%s slave %s mac already set to %pM\n",
+			       ifname, slave->dev->name, slave->emac);
+			ret = -EINVAL;
+			goto out;
+		}
+
+		/* check another slave has this mac/vlan */
+		found = 0;
+		parent_for_each_slave(parent, slave_tmp) {
+			if (!memcmp(slave_tmp->emac, mac, ETH_ALEN) &&
+			    slave_tmp->vlan == slave->vlan) {
+				pr_err("cannot update %s, slave %s already has"
+				       " vlan 0x%x mac %pM\n",
+				       parent->dev->name, slave->dev->name,
+				       slave_tmp->vlan,
+				       mac);
+				ret = -EINVAL;
+				goto out;
+			}
+		}
+
+		/* ready to go */
+		pr_info("slave %s mac is set to %pM\n",
+			ifname, mac);
+
+		memcpy(slave->emac, mac, ETH_ALEN);
+		goto out;
+	}
+
+	if (command[0] == '-') {
+		if (is_zero_ether_addr(slave->emac)) {
+			pr_err("%s slave mac already unset %pM\n",
+			       ifname, slave->emac);
+			ret = -EINVAL;
+			goto out;
+		}
+
+		pr_info("slave %s mac is unset (was %pM)\n",
+			ifname, slave->emac);
+
+		goto out;
+	}
+
+err_no_cmd:
+	pr_err("%s USAGE: (-|+)ifname [mac]\n", DRV_NAME);
+	ret = -EPERM;
+
+out:
+	write_unlock_bh(&parent->lock);
+
+	return ret;
+}
+
+static DEVICE_ATTR(vifs, S_IRUGO | S_IWUSR, parent_show_vifs,
+		   parent_store_vifs);
+
+static ssize_t parent_show_slaves(struct device *d,
+				  struct device_attribute *attr, char *buf)
+{
+	struct slave *slave;
+	struct parent *parent = to_parent(d);
+	char *p = buf;
+
+	read_lock_bh(&parent->lock);
+	parent_for_each_slave(parent, slave)
+		p += _sprintf(p, buf, "%s\n", slave->dev->name);
+	read_unlock_bh(&parent->lock);
+
+	_end_of_line(p, buf);
+
+	return (ssize_t)(p - buf);
+}
+
+static ssize_t parent_store_slaves(struct device *d,
+				   struct device_attribute *attr,
+				   const char *buffer, size_t count)
+{
+	char command[IFNAMSIZ + 1] = { 0, };
+	char *ifname;
+	int res, ret = count;
+	struct slave *slave;
+	struct net_device *dev = NULL;
+	struct parent *parent = to_parent(d);
+
+	/* Quick sanity check -- is the parent interface up? */
+	if (!(parent->dev->flags & IFF_UP)) {
+		pr_warn("%s: doing slave updates when "
+			"interface is down.\n", dev->name);
+	}
+
+	if (!rtnl_trylock()) /* because __dev_get_by_name */
+		return restart_syscall();
+
+	sscanf(buffer, "%16s", command);
+
+	ifname = command + 1;
+	if ((strlen(command) <= 1) || !dev_valid_name(ifname))
+		goto err_no_cmd;
+
+	if (command[0] == '+') {
+		/* Got a slave name in ifname. Is it already in the list? */
+		dev = __dev_get_by_name(&init_net, ifname);
+		if (!dev) {
+			pr_warn("%s: Interface %s does not exist!\n",
+				parent->dev->name, ifname);
+			ret = -EINVAL;
+			goto out;
+		}
+
+		read_lock_bh(&parent->lock);
+		parent_for_each_slave(parent, slave) {
+			if (slave->dev == dev) {
+				pr_err("%s ERR- Interface %s is already enslaved!\n",
+				       parent->dev->name, dev->name);
+				ret = -EPERM;
+			}
+		}
+		read_unlock_bh(&parent->lock);
+
+		if (ret < 0)
+			goto out;
+
+		pr_info("%s: adding slave %s\n",
+			parent->dev->name, ifname);
+
+		res = parent_enslave(parent->dev, dev);
+		if (res)
+			ret = res;
+
+		goto out;
+	}
+
+	if (command[0] == '-') {
+		dev = NULL;
+		parent_for_each_slave(parent, slave)
+			if (strnicmp(slave->dev->name, ifname, IFNAMSIZ) == 0) {
+				dev = slave->dev;
+				break;
+			}
+
+		if (dev) {
+			pr_info("%s: removing slave %s\n",
+				parent->dev->name, dev->name);
+			res = parent_release_slave(parent->dev, dev);
+			if (res) {
+				ret = res;
+				goto out;
+			}
+		} else {
+			pr_warn("%s: unable to remove non-existent "
+				"slave for parent %s.\n",
+				ifname, parent->dev->name);
+			ret = -ENODEV;
+		}
+		goto out;
+	}
+
+err_no_cmd:
+	pr_err("%s USAGE: (-|+)ifname\n", DRV_NAME);
+	ret = -EPERM;
+
+out:
+	rtnl_unlock();
+	return ret;
+}
+
+static DEVICE_ATTR(slaves, S_IRUGO | S_IWUSR, parent_show_slaves,
+		   parent_store_slaves);
+
+/* sysfs create/destroy functions */
+static struct attribute *per_parent_attrs[] = {
+	&dev_attr_slaves.attr, /* DEVICE_ATTR(slaves..) */
+	&dev_attr_vifs.attr,
+	&dev_attr_neighs.attr,
+	NULL,
+};
+
+/* name spcase  support */
+static const void *eipoib_namespace(struct class *cls,
+				    const struct class_attribute *attr)
+{
+	const struct eipoib_net *eipoib_n =
+		container_of(attr,
+			     struct eipoib_net, class_attr_eipoib_interfaces);
+	return eipoib_n->net;
+}
+
+static struct attribute_group parent_group = {
+	/* per parent sysfs files under: /sys/class/net/<IF>/eth/.. */
+	.name = "eth",
+	.attrs = per_parent_attrs
+};
+
+int create_slave_symlinks(struct net_device *master,
+			  struct net_device *slave)
+{
+	char linkname[IFNAMSIZ+7];
+	int ret = 0;
+
+	ret = sysfs_create_link(&(slave->dev.kobj), &(master->dev.kobj),
+				"eth_parent");
+	if (ret)
+		return ret;
+
+	sprintf(linkname, "slave_%s", slave->name);
+	ret = sysfs_create_link(&(master->dev.kobj), &(slave->dev.kobj),
+				linkname);
+	return ret;
+
+}
+
+void destroy_slave_symlinks(struct net_device *master,
+			    struct net_device *slave)
+{
+	char linkname[IFNAMSIZ+7];
+
+	sysfs_remove_link(&(slave->dev.kobj), "eth_parent");
+	sprintf(linkname, "slave_%s", slave->name);
+	sysfs_remove_link(&(master->dev.kobj), linkname);
+}
+
+static struct class_attribute class_attr_eth_ipoib_interfaces = {
+	.attr = {
+		.name = "eth_ipoib_interfaces",
+		.mode = S_IWUSR | S_IRUGO,
+	},
+	.show = show_parents,
+	.namespace = eipoib_namespace,
+};
+
+/* per module sysfs file under: /sys/class/net/eth_ipoib_interfaces */
+int mod_create_sysfs(struct eipoib_net *eipoib_n)
+{
+	int rc;
+	/* defined in CLASS_ATTR(eth_ipoib_interfaces..) */
+	eipoib_n->class_attr_eipoib_interfaces =
+		class_attr_eth_ipoib_interfaces;
+
+	sysfs_attr_init(&eipoib_n->class_attr_eipoib_interfaces.attr);
+
+	rc = netdev_class_create_file(&eipoib_n->class_attr_eipoib_interfaces);
+	if (rc)
+		pr_err("%s failed to create sysfs (rc %d)\n",
+		       eipoib_n->class_attr_eipoib_interfaces.attr.name, rc);
+
+	return rc;
+}
+
+void mod_destroy_sysfs(struct eipoib_net *eipoib_n)
+{
+	netdev_class_remove_file(&eipoib_n->class_attr_eipoib_interfaces);
+}
+
+int parent_create_sysfs_entry(struct parent *parent)
+{
+	struct net_device *dev = parent->dev;
+	int rc;
+
+	rc = sysfs_create_group(&(dev->dev.kobj), &parent_group);
+	if (rc)
+		pr_info("failed to create sysfs group\n");
+
+	return rc;
+}
+
+void parent_destroy_sysfs_entry(struct parent *parent)
+{
+	struct net_device *dev = parent->dev;
+
+	sysfs_remove_group(&(dev->dev.kobj), &parent_group);
+}
-- 
1.7.1

^ permalink raw reply related

* [PATCH net-next 1/9] IB/ipoib: Add support for clones / multiple childs on the same partition
From: Or Gerlitz @ 2012-07-10 12:16 UTC (permalink / raw)
  To: davem; +Cc: roland, netdev, ali, sean.hefty, Or Gerlitz, Erez Shitrit
In-Reply-To: <1341922569-4118-1-git-send-email-ogerlitz@mellanox.com>

Allow creating "clone" child interfaces which further partition an
IPoIB interface to sub interfaces who either use the same pkey as
their parent or use the same pkey as already created child interface.

Each child now has a child index, which together with the pkey is
used as the identifier of the created network device.

All sorts of childs are still created/deleted through sysfs, in a
similar manner to the way legacy child interfaces are.

A major use case for clone childs is for virtualization purposes, where
a per VM NIC is desired at the hypervisor level, such as the solution
provided by the newly introduced Ethernet IPoIB driver.

Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Erez Shitrit <erezsh@mellanox.co.il>
---
 Documentation/infiniband/ipoib.txt         |   24 ++++++++++++++
 drivers/infiniband/ulp/ipoib/ipoib.h       |    7 +++-
 drivers/infiniband/ulp/ipoib/ipoib_main.c  |   48 +++++++++++++++++++++-------
 drivers/infiniband/ulp/ipoib/ipoib_verbs.c |    3 +-
 drivers/infiniband/ulp/ipoib/ipoib_vlan.c  |   46 ++++++++++++++++++--------
 5 files changed, 99 insertions(+), 29 deletions(-)

diff --git a/Documentation/infiniband/ipoib.txt b/Documentation/infiniband/ipoib.txt
index 64eeb55..b3a704d 100644
--- a/Documentation/infiniband/ipoib.txt
+++ b/Documentation/infiniband/ipoib.txt
@@ -24,6 +24,30 @@ Partitions and P_Keys
   The P_Key for any interface is given by the "pkey" file, and the
   main interface for a subinterface is in "parent."
 
+Clones
++
+  Its possible to further partition an IPoIB interfaces, and create
+  "clone" child interfaces which either use the same pkey as their
+  parent, or as an already created child interface. Each child now has
+  a child index, which together with the pkey is used as the identifier
+  of the created network device.
+
+ All sorts of childs are still created/deleted through sysfs, in a
+ similar manner to the way legacy child interfaces are, for example:
+
+    echo 0x8001.1 > /sys/class/net/ib0/create_child
+
+  will create an interface named ib0.8001.1 with P_Key 0x8001 and index 1
+
+    echo .1 > /sys/class/net/ib0/create_child
+
+  will create an interface named ib0.1 with same P_Key as ib0 and index 1
+
+  remove a subinterface, use the "delete_child" file:
+
+    echo 0x8001.1 > /sys/class/net/ib0/create_child
+    echo .1  > /sys/class/net/ib0/create_child
+
 Datagram vs Connected modes
 
   The IPoIB driver supports two modes of operation: datagram and
diff --git a/drivers/infiniband/ulp/ipoib/ipoib.h b/drivers/infiniband/ulp/ipoib/ipoib.h
index 86df632..a57db27 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib.h
+++ b/drivers/infiniband/ulp/ipoib/ipoib.h
@@ -332,6 +332,7 @@ struct ipoib_dev_priv {
 	struct net_device *parent;
 	struct list_head child_intfs;
 	struct list_head list;
+	int child_index;
 
 #ifdef CONFIG_INFINIBAND_IPOIB_CM
 	struct ipoib_cm_dev_priv cm;
@@ -490,8 +491,10 @@ void ipoib_transport_dev_cleanup(struct net_device *dev);
 void ipoib_event(struct ib_event_handler *handler,
 		 struct ib_event *record);
 
-int ipoib_vlan_add(struct net_device *pdev, unsigned short pkey);
-int ipoib_vlan_delete(struct net_device *pdev, unsigned short pkey);
+int ipoib_vlan_add(struct net_device *pdev, unsigned short pkey,
+						unsigned char clone_index);
+int ipoib_vlan_delete(struct net_device *pdev, unsigned short pkey,
+						unsigned char clone_index);
 
 void ipoib_pkey_poll(struct work_struct *work);
 int ipoib_pkey_dev_delay_open(struct net_device *dev);
diff --git a/drivers/infiniband/ulp/ipoib/ipoib_main.c b/drivers/infiniband/ulp/ipoib/ipoib_main.c
index bbee4b2..704d068 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_main.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_main.c
@@ -1095,17 +1095,44 @@ int ipoib_add_umcast_attr(struct net_device *dev)
 	return device_create_file(&dev->dev, &dev_attr_umcast);
 }
 
+int parse_child(struct device *dev, const char *buf, int *pkey,
+		int *child_index)
+{
+	int ret;
+	struct ipoib_dev_priv *priv = netdev_priv(to_net_dev(dev));
+
+	*pkey = *child_index = -1;
+
+	/* 'pkey' or 'pkey.child_index' or '.child_index' are allowed */
+	ret = sscanf(buf, "%i.%i", pkey, child_index);
+	if (ret == 1)  /* just pkey, implicit child index is 0 */
+		*child_index = 0;
+	else  if (ret != 2) { /* pkey same as parent, specified child index */
+		*pkey = priv->pkey;
+		ret  = sscanf(buf, ".%i", child_index);
+		if (ret != 1 || *child_index == 0)
+			return -EINVAL;
+	}
+
+	if (*child_index < 0 || *child_index > 0xff)
+		return -EINVAL;
+
+	if (*pkey < 0 || *pkey > 0xffff)
+		return -EINVAL;
+
+	ipoib_dbg(priv, "parse_child inp %s out pkey %04x index %d\n",
+		buf, *pkey, *child_index);
+	return 0;
+}
+
 static ssize_t create_child(struct device *dev,
 			    struct device_attribute *attr,
 			    const char *buf, size_t count)
 {
-	int pkey;
+	int pkey, child_index;
 	int ret;
 
-	if (sscanf(buf, "%i", &pkey) != 1)
-		return -EINVAL;
-
-	if (pkey < 0 || pkey > 0xffff)
+	if (parse_child(dev, buf, &pkey, &child_index))
 		return -EINVAL;
 
 	/*
@@ -1114,7 +1141,7 @@ static ssize_t create_child(struct device *dev,
 	 */
 	pkey |= 0x8000;
 
-	ret = ipoib_vlan_add(to_net_dev(dev), pkey);
+	ret = ipoib_vlan_add(to_net_dev(dev), pkey, child_index);
 
 	return ret ? ret : count;
 }
@@ -1124,16 +1151,13 @@ static ssize_t delete_child(struct device *dev,
 			    struct device_attribute *attr,
 			    const char *buf, size_t count)
 {
-	int pkey;
+	int pkey, child_index;
 	int ret;
 
-	if (sscanf(buf, "%i", &pkey) != 1)
-		return -EINVAL;
-
-	if (pkey < 0 || pkey > 0xffff)
+	if (parse_child(dev, buf, &pkey, &child_index))
 		return -EINVAL;
 
-	ret = ipoib_vlan_delete(to_net_dev(dev), pkey);
+	ret = ipoib_vlan_delete(to_net_dev(dev), pkey, child_index);
 
 	return ret ? ret : count;
 
diff --git a/drivers/infiniband/ulp/ipoib/ipoib_verbs.c b/drivers/infiniband/ulp/ipoib/ipoib_verbs.c
index 049a997..2131772 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_verbs.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_verbs.c
@@ -167,7 +167,8 @@ int ipoib_transport_dev_init(struct net_device *dev, struct ib_device *ca)
 			size += ipoib_recvq_size * ipoib_max_conn_qp;
 	}
 
-	priv->recv_cq = ib_create_cq(priv->ca, ipoib_ib_completion, NULL, dev, size, 0);
+	priv->recv_cq = ib_create_cq(priv->ca, ipoib_ib_completion, NULL, dev, size,
+				     priv->child_index % priv->ca->num_comp_vectors);
 	if (IS_ERR(priv->recv_cq)) {
 		printk(KERN_WARNING "%s: failed to create receive CQ\n", ca->name);
 		goto out_free_mr;
diff --git a/drivers/infiniband/ulp/ipoib/ipoib_vlan.c b/drivers/infiniband/ulp/ipoib/ipoib_vlan.c
index d7e9740..2d35cb4 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_vlan.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_vlan.c
@@ -49,7 +49,8 @@ static ssize_t show_parent(struct device *d, struct device_attribute *attr,
 }
 static DEVICE_ATTR(parent, S_IRUGO, show_parent, NULL);
 
-int ipoib_vlan_add(struct net_device *pdev, unsigned short pkey)
+int ipoib_vlan_add(struct net_device *pdev, unsigned short pkey,
+		unsigned char child_index)
 {
 	struct ipoib_dev_priv *ppriv, *priv;
 	char intf_name[IFNAMSIZ];
@@ -65,25 +66,40 @@ int ipoib_vlan_add(struct net_device *pdev, unsigned short pkey)
 	mutex_lock(&ppriv->vlan_mutex);
 
 	/*
-	 * First ensure this isn't a duplicate. We check the parent device and
-	 * then all of the child interfaces to make sure the Pkey doesn't match.
+	 * First ensure this isn't a duplicate. We check all of the child
+	 * interfaces to make sure the Pkey AND the child index
+	 * don't match.
 	 */
-	if (ppriv->pkey == pkey) {
-		result = -ENOTUNIQ;
-		priv = NULL;
-		goto err;
-	}
-
 	list_for_each_entry(priv, &ppriv->child_intfs, list) {
-		if (priv->pkey == pkey) {
+		if (priv->pkey == pkey && priv->child_index == child_index) {
 			result = -ENOTUNIQ;
 			priv = NULL;
 			goto err;
 		}
 	}
 
-	snprintf(intf_name, sizeof intf_name, "%s.%04x",
-		 ppriv->dev->name, pkey);
+	/*
+	 * for the case of non-legacy and same pkey childs we wanted to use
+	 * a notation of ibN.pkey:index and ibN:index but this is problematic
+	 * with tools like ifconfig who treat devices with ":" in their names
+	 * as aliases which are restriced, e.t w.r.t counters, etc
+	 */
+	if (ppriv->pkey != pkey && child_index == 0) /* legacy child */
+		snprintf(intf_name, sizeof intf_name, "%s.%04x",
+			 ppriv->dev->name, pkey);
+	else if (ppriv->pkey != pkey && child_index != 0) /* non-legacy child */
+		snprintf(intf_name, sizeof intf_name, "%s.%04x.%d",
+			 ppriv->dev->name, pkey, child_index);
+	else if (ppriv->pkey == pkey && child_index != 0) /* same pkey child */
+		snprintf(intf_name, sizeof intf_name, "%s.%d",
+			 ppriv->dev->name, child_index);
+	else  {
+		ipoib_warn(ppriv, "wrong pkey/child_index pairing %04x %d\n",
+			   pkey, child_index);
+		result = -EINVAL;
+		goto err;
+	}
+
 	priv = ipoib_intf_alloc(intf_name);
 	if (!priv) {
 		result = -ENOMEM;
@@ -101,6 +117,7 @@ int ipoib_vlan_add(struct net_device *pdev, unsigned short pkey)
 		goto err;
 
 	priv->pkey = pkey;
+	priv->child_index = child_index;
 
 	memcpy(priv->dev->dev_addr, ppriv->dev->dev_addr, INFINIBAND_ALEN);
 	priv->dev->broadcast[8] = pkey >> 8;
@@ -157,7 +174,8 @@ err:
 	return result;
 }
 
-int ipoib_vlan_delete(struct net_device *pdev, unsigned short pkey)
+int ipoib_vlan_delete(struct net_device *pdev, unsigned short pkey,
+		unsigned char child_index)
 {
 	struct ipoib_dev_priv *ppriv, *priv, *tpriv;
 	struct net_device *dev = NULL;
@@ -171,7 +189,7 @@ int ipoib_vlan_delete(struct net_device *pdev, unsigned short pkey)
 		return restart_syscall();
 	mutex_lock(&ppriv->vlan_mutex);
 	list_for_each_entry_safe(priv, tpriv, &ppriv->child_intfs, list) {
-		if (priv->pkey == pkey) {
+		if (priv->pkey == pkey && priv->child_index == child_index) {
 			unregister_netdevice(priv->dev);
 			ipoib_dev_cleanup(priv->dev);
 			list_del(&priv->list);
-- 
1.7.1

^ permalink raw reply related

* [PATCH net-next 5/9] net/eipoib: Add ethtool file support
From: Or Gerlitz @ 2012-07-10 12:16 UTC (permalink / raw)
  To: davem; +Cc: roland, netdev, ali, sean.hefty, Erez Shitrit, Or Gerlitz
In-Reply-To: <1341922569-4118-1-git-send-email-ogerlitz@mellanox.com>

From: Erez Shitrit <erezsh@mellanox.co.il>

Via ethtool the driver describes its version, ABI version, on what PIF
interface it runs and various statistics.

Signed-off-by: Erez Shitrit <erezsh@mellanox.co.il>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
---
 drivers/net/eipoib/eth_ipoib_ethtool.c |  147 ++++++++++++++++++++++++++++++++
 1 files changed, 147 insertions(+), 0 deletions(-)
 create mode 100644 drivers/net/eipoib/eth_ipoib_ethtool.c

diff --git a/drivers/net/eipoib/eth_ipoib_ethtool.c b/drivers/net/eipoib/eth_ipoib_ethtool.c
new file mode 100644
index 0000000..b5c20ec
--- /dev/null
+++ b/drivers/net/eipoib/eth_ipoib_ethtool.c
@@ -0,0 +1,147 @@
+/*
+ * Copyright (c) 2012 Mellanox Technologies. All rights reserved
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * openfabric.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ */
+
+#include "eth_ipoib.h"
+
+static void parent_ethtool_get_drvinfo(struct net_device *parent_dev,
+				       struct ethtool_drvinfo *drvinfo)
+{
+	struct parent *parent = netdev_priv(parent_dev);
+
+	if (strlen(DRV_NAME) + strlen(parent->ipoib_main_interface) > 31)
+		strncpy(drvinfo->driver, "driver name is too long", 32);
+	else
+		sprintf(drvinfo->driver, "%s:%s",
+			DRV_NAME, parent->ipoib_main_interface);
+
+	strncpy(drvinfo->version, DRV_VERSION, 32);
+
+	/* indicates ABI version */
+	snprintf(drvinfo->fw_version, 32, "%d", EIPOIB_ABI_VER);
+}
+
+static const char parent_strings[][ETH_GSTRING_LEN] = {
+	/* public statistics */
+	"rx_packets", "tx_packets", "rx_bytes",
+	"tx_bytes", "rx_errors", "tx_errors",
+	"rx_dropped", "tx_dropped", "multicast",
+	"collisions", "rx_length_errors", "rx_over_errors",
+	"rx_crc_errors", "rx_frame_errors", "rx_fifo_errors",
+	"rx_missed_errors", "tx_aborted_errors", "tx_carrier_errors",
+	"tx_fifo_errors", "tx_heartbeat_errors", "tx_window_errors",
+#define PUB_STATS_LEN	21
+
+	/* private statistics */
+	"tx_parent_dropped",
+	"tx_vif_miss",
+	"tx_neigh_miss",
+	"tx_vlan",
+	"tx_shared",
+	"tx_proto_errors",
+	"tx_skb_errors",
+	"tx_slave_err",
+
+	"rx_parent_dropped",
+	"rx_vif_miss",
+	"rx_neigh_miss",
+	"rx_vlan",
+	"rx_shared",
+	"rx_proto_errors",
+	"rx_skb_errors",
+	"rx_slave_err",
+#define PORT_STATS_LEN	(8 * 2)
+
+};
+
+#define PARENT_STATS_LEN (sizeof(parent_strings) / ETH_GSTRING_LEN)
+
+static void parent_get_strings(struct net_device *parent_dev,
+			       uint32_t stringset, uint8_t *data)
+{
+	int index = 0, stats_off = 0, i;
+
+	if (stringset != ETH_SS_STATS)
+		return;
+
+	/* Add main counters */
+	for (i = 0; i < PUB_STATS_LEN; i++)
+		strcpy(data + (index++) * ETH_GSTRING_LEN,
+		       parent_strings[i + stats_off]);
+	stats_off += PUB_STATS_LEN;
+
+	for (i = 0; i < PORT_STATS_LEN; i++)
+		strcpy(data + (index++) * ETH_GSTRING_LEN,
+		       parent_strings[i + stats_off]);
+	stats_off += PORT_STATS_LEN;
+
+}
+
+static void parent_get_ethtool_stats(struct net_device *parent_dev,
+				     struct ethtool_stats *stats,
+				     uint64_t *data)
+{
+	struct parent *parent = netdev_priv(parent_dev);
+	int index = 0, i;
+
+	read_lock(&parent->lock);
+
+	for (i = 0; i < PUB_STATS_LEN; i++)
+		data[index++] = ((unsigned long *) &parent->stats)[i];
+
+	for (i = 0; i < PORT_STATS_LEN; i++)
+		data[index++] = ((unsigned long *) &parent->port_stats)[i];
+
+	read_unlock(&parent->lock);
+}
+
+static int parent_get_sset_count(struct net_device *parent_dev, int sset)
+{
+	switch (sset) {
+	case ETH_SS_STATS:
+		return PARENT_STATS_LEN;
+	default:
+		return -EOPNOTSUPP;
+	}
+}
+
+static const struct ethtool_ops parent_ethtool_ops = {
+	.get_drvinfo		= parent_ethtool_get_drvinfo,
+	.get_strings		= parent_get_strings,
+	.get_ethtool_stats	= parent_get_ethtool_stats,
+	.get_sset_count		= parent_get_sset_count,
+	.get_link		= ethtool_op_get_link,
+};
+
+void parent_set_ethtool_ops(struct net_device *dev)
+{
+	SET_ETHTOOL_OPS(dev, &parent_ethtool_ops);
+}
-- 
1.7.1

^ permalink raw reply related

* [PATCH net-next 8/9] net/eipoib: Add Makefile, Kconfig and MAINTAINERS entries
From: Or Gerlitz @ 2012-07-10 12:16 UTC (permalink / raw)
  To: davem; +Cc: roland, netdev, ali, sean.hefty, Erez Shitrit, Or Gerlitz
In-Reply-To: <1341922569-4118-1-git-send-email-ogerlitz@mellanox.com>

From: Erez Shitrit <erezsh@mellanox.co.il>

Add Kconfig entry under drivers/net and MAINTAINERS entry for eIPoIB, also
add the driver makefile.

Signed-off-by: Erez Shitrit <erezsh@mellanox.co.il>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>

Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
---
 MAINTAINERS                 |    6 ++++++
 drivers/net/Kconfig         |   15 +++++++++++++++
 drivers/net/Makefile        |    1 +
 drivers/net/eipoib/Makefile |    4 ++++
 4 files changed, 26 insertions(+), 0 deletions(-)
 create mode 100644 drivers/net/eipoib/Makefile

diff --git a/MAINTAINERS b/MAINTAINERS
index 8da1373..582f8de 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -2618,6 +2618,12 @@ L:	netdev@vger.kernel.org
 S:	Maintained
 F:	drivers/net/ethernet/ibm/ehea/
 
+EIPoIB (Ethernet services over IPoIB) DRIVER
+M:	Erez Shitrit <erezsh@mellanox.com>
+L:	netdev@vger.kernel.org
+S:	Supported
+F:	drivers/net/eipoib/
+
 EMBEDDED LINUX
 M:	Paul Gortmaker <paul.gortmaker@windriver.com>
 M:	Matt Mackall <mpm@selenic.com>
diff --git a/drivers/net/Kconfig b/drivers/net/Kconfig
index 0c2bd80..09c0352 100644
--- a/drivers/net/Kconfig
+++ b/drivers/net/Kconfig
@@ -68,6 +68,21 @@ config DUMMY
 	  To compile this driver as a module, choose M here: the module
 	  will be called dummy.
 
+config E_IPOIB
+        tristate "Ethernet Services over IPoIB"
+        depends on INFINIBAND_IPOIB
+       ---help---
+	  This driver supports Ethernet protocol over InfiniBand IPoIB devices.
+	  Some services can run only on top of Ethernet L2 interfaces, and
+	  cannot be bound to an IPoIB interface. With this new driver, these services 
+	  can run seamlessly.
+
+	  Main use case of the driver is the Ethernet Virtual Switching used in
+	  virtualized environments, where an eipoib netdevice can be used as a 
+	  Physical Interface (PIF) in the hypervisor domain, and allow other guests 
+	  Virtual Interfaces (VIF) connected to the same Virtual Switch to run over 
+	  the InfiniBand fabric.
+
 config EQUALIZER
 	tristate "EQL (serial line load balancing) support"
 	---help---
diff --git a/drivers/net/Makefile b/drivers/net/Makefile
index 3d375ca..2c3409e 100644
--- a/drivers/net/Makefile
+++ b/drivers/net/Makefile
@@ -31,6 +31,7 @@ obj-$(CONFIG_CAIF) += caif/
 obj-$(CONFIG_CAN) += can/
 obj-$(CONFIG_ETRAX_ETHERNET) += cris/
 obj-$(CONFIG_NET_DSA) += dsa/
+obj-$(CONFIG_E_IPOIB) += eipoib/
 obj-$(CONFIG_ETHERNET) += ethernet/
 obj-$(CONFIG_FDDI) += fddi/
 obj-$(CONFIG_HIPPI) += hippi/
diff --git a/drivers/net/eipoib/Makefile b/drivers/net/eipoib/Makefile
new file mode 100644
index 0000000..b64e96e
--- /dev/null
+++ b/drivers/net/eipoib/Makefile
@@ -0,0 +1,4 @@
+obj-$(CONFIG_E_IPOIB)                         := eth_ipoib.o
+eth_ipoib-y                                    := eth_ipoib_main.o \
+                                                  eth_ipoib_sysfs.o \
+                                                  eth_ipoib_ethtool.o
-- 
1.7.1

^ permalink raw reply related


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox