Netdev List
 help / color / mirror / Atom feed
* [PATCH net-next v3 4/4] bpf: add __must_check attributes to refcount manipulating helpers
From: Daniel Borkmann @ 2016-11-19  0:45 UTC (permalink / raw)
  To: davem
  Cc: alexei.starovoitov, bblanco, zhiyisun, ranas, saeedm, netdev,
	Daniel Borkmann
In-Reply-To: <cover.1479514784.git.daniel@iogearbox.net>

Helpers like bpf_prog_add(), bpf_prog_inc(), bpf_map_inc() can fail
with an error, so make sure the caller properly checks their return
value and not just ignores it, which could worst-case lead to use
after free.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
---
 include/linux/bpf.h | 12 +++++++-----
 1 file changed, 7 insertions(+), 5 deletions(-)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 01c1487..69d0a7f 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -233,14 +233,14 @@ u64 bpf_event_output(struct bpf_map *map, u64 flags, void *meta, u64 meta_size,
 
 struct bpf_prog *bpf_prog_get(u32 ufd);
 struct bpf_prog *bpf_prog_get_type(u32 ufd, enum bpf_prog_type type);
-struct bpf_prog *bpf_prog_add(struct bpf_prog *prog, int i);
+struct bpf_prog * __must_check bpf_prog_add(struct bpf_prog *prog, int i);
 void bpf_prog_sub(struct bpf_prog *prog, int i);
-struct bpf_prog *bpf_prog_inc(struct bpf_prog *prog);
+struct bpf_prog * __must_check bpf_prog_inc(struct bpf_prog *prog);
 void bpf_prog_put(struct bpf_prog *prog);
 
 struct bpf_map *bpf_map_get_with_uref(u32 ufd);
 struct bpf_map *__bpf_map_get(struct fd f);
-struct bpf_map *bpf_map_inc(struct bpf_map *map, bool uref);
+struct bpf_map * __must_check bpf_map_inc(struct bpf_map *map, bool uref);
 void bpf_map_put_with_uref(struct bpf_map *map);
 void bpf_map_put(struct bpf_map *map);
 int bpf_map_precharge_memlock(u32 pages);
@@ -299,7 +299,8 @@ static inline struct bpf_prog *bpf_prog_get_type(u32 ufd,
 {
 	return ERR_PTR(-EOPNOTSUPP);
 }
-static inline struct bpf_prog *bpf_prog_add(struct bpf_prog *prog, int i)
+static inline struct bpf_prog * __must_check bpf_prog_add(struct bpf_prog *prog,
+							  int i)
 {
 	return ERR_PTR(-EOPNOTSUPP);
 }
@@ -311,7 +312,8 @@ static inline void bpf_prog_sub(struct bpf_prog *prog, int i)
 static inline void bpf_prog_put(struct bpf_prog *prog)
 {
 }
-static inline struct bpf_prog *bpf_prog_inc(struct bpf_prog *prog)
+
+static inline struct bpf_prog * __must_check bpf_prog_inc(struct bpf_prog *prog)
 {
 	return ERR_PTR(-EOPNOTSUPP);
 }
-- 
1.9.3

^ permalink raw reply related

* [PATCH net-next v3 2/4] bpf, mlx5: fix various refcount issues in mlx5e_xdp_set
From: Daniel Borkmann @ 2016-11-19  0:45 UTC (permalink / raw)
  To: davem
  Cc: alexei.starovoitov, bblanco, zhiyisun, ranas, saeedm, netdev,
	Daniel Borkmann
In-Reply-To: <cover.1479514784.git.daniel@iogearbox.net>

There are multiple issues in mlx5e_xdp_set():

1) The batched bpf_prog_add() is currently not checked for errors. When
   doing so, it should be done at an earlier point in time to makes sure
   that we cannot fail anymore at the time we want to set the program for
   each channel. The batched refs short-cut can only be performed when we
   don't need to perform a reset for changing the rq type and the device
   was in opened state. In case the device was not in opened state, then
   the next mlx5e_open_locked() will aquire the refs from the control prog
   via mlx5e_create_rq(), same when we need to perform a reset.

2) When swapping the priv->xdp_prog, then no extra reference count must be
   taken since we got that from call path via dev_change_xdp_fd() already.
   Otherwise, we'd never be able to release the program. Also, bpf_prog_add()
   without checking the return code could fail.

Fixes: 86994156c736 ("net/mlx5e: XDP fast RX drop bpf programs support")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
---
 drivers/net/ethernet/mellanox/mlx5/core/en_main.c | 17 +++++++++++++----
 1 file changed, 13 insertions(+), 4 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
index 54bae79..491cff9 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
@@ -3144,11 +3144,21 @@ static int mlx5e_xdp_set(struct net_device *netdev, struct bpf_prog *prog)
 
 	if (was_opened && reset)
 		mlx5e_close_locked(netdev);
+	if (was_opened && !reset) {
+		/* num_channels is invariant here, so we can take the
+		 * batched reference right upfront.
+		 */
+		prog = bpf_prog_add(prog, priv->params.num_channels);
+		if (IS_ERR(prog)) {
+			err = PTR_ERR(prog);
+			goto unlock;
+		}
+	}
 
-	/* exchange programs */
+	/* exchange programs, extra prog reference we got from caller
+	 * as long as we don't fail from this point onwards.
+	 */
 	old_prog = xchg(&priv->xdp_prog, prog);
-	if (prog)
-		bpf_prog_add(prog, 1);
 	if (old_prog)
 		bpf_prog_put(old_prog);
 
@@ -3164,7 +3174,6 @@ static int mlx5e_xdp_set(struct net_device *netdev, struct bpf_prog *prog)
 	/* exchanging programs w/o reset, we update ref counts on behalf
 	 * of the channels RQs here.
 	 */
-	bpf_prog_add(prog, priv->params.num_channels);
 	for (i = 0; i < priv->params.num_channels; i++) {
 		struct mlx5e_channel *c = priv->channel[i];
 
-- 
1.9.3

^ permalink raw reply related

* [PATCH net-next v3 0/4] Couple of BPF refcount fixes for mlx5
From: Daniel Borkmann @ 2016-11-19  0:44 UTC (permalink / raw)
  To: davem
  Cc: alexei.starovoitov, bblanco, zhiyisun, ranas, saeedm, netdev,
	Daniel Borkmann

Various mlx5 bugs on eBPF refcount handling found during review.
Last patch in series adds a __must_check to BPF helpers to make
sure we won't run into it again w/o compiler complaining first.

v2 -> v3:

 - Just reworked patch 2/4 so we don't need bpf_prog_sub().
 - Rebased, rest as is.

v1 -> v2:

 - After discussion with Alexei, we agreed upon rebasing the
   patches against net-next.
 - Since net-next, I've also added the __must_check to enforce
   future users to check for errors.
 - Fixed up commit message #2.
 - Simplify assignment from patch #1 based on Saeed's feedback
   on previous set.

Thanks a lot!

Daniel Borkmann (4):
  bpf, mlx5: fix mlx5e_create_rq taking reference on prog
  bpf, mlx5: fix various refcount issues in mlx5e_xdp_set
  bpf, mlx5: drop priv->xdp_prog reference on netdev cleanup
  bpf: add __must_check attributes to refcount manipulating helpers

 drivers/net/ethernet/mellanox/mlx5/core/en_main.c | 33 +++++++++++++++++------
 include/linux/bpf.h                               | 12 +++++----
 kernel/bpf/syscall.c                              |  1 +
 3 files changed, 33 insertions(+), 13 deletions(-)

-- 
1.9.3

^ permalink raw reply

* [PATCH net-next v3 3/4] bpf, mlx5: drop priv->xdp_prog reference on netdev cleanup
From: Daniel Borkmann @ 2016-11-19  0:45 UTC (permalink / raw)
  To: davem
  Cc: alexei.starovoitov, bblanco, zhiyisun, ranas, saeedm, netdev,
	Daniel Borkmann
In-Reply-To: <cover.1479514784.git.daniel@iogearbox.net>

mlx5e_xdp_set() is currently the only place where we drop reference on the
prog sitting in priv->xdp_prog when it's exchanged by a new one. We also
need to make sure that we eventually release that reference, for example,
in case the netdev is dismantled, otherwise we leak the program.

Fixes: 86994156c736 ("net/mlx5e: XDP fast RX drop bpf programs support")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
---
 drivers/net/ethernet/mellanox/mlx5/core/en_main.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
index 491cff9..6957608 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
@@ -3705,6 +3705,9 @@ static void mlx5e_nic_cleanup(struct mlx5e_priv *priv)
 
 	if (MLX5_CAP_GEN(mdev, vport_group_manager))
 		mlx5_eswitch_unregister_vport_rep(esw, 0);
+
+	if (priv->xdp_prog)
+		bpf_prog_put(priv->xdp_prog);
 }
 
 static int mlx5e_init_nic_rx(struct mlx5e_priv *priv)
-- 
1.9.3

^ permalink raw reply related

* [PATCH net-next v3 1/4] bpf, mlx5: fix mlx5e_create_rq taking reference on prog
From: Daniel Borkmann @ 2016-11-19  0:45 UTC (permalink / raw)
  To: davem
  Cc: alexei.starovoitov, bblanco, zhiyisun, ranas, saeedm, netdev,
	Daniel Borkmann
In-Reply-To: <cover.1479514784.git.daniel@iogearbox.net>

In mlx5e_create_rq(), when creating a new queue, we call bpf_prog_add() but
without checking the return value. bpf_prog_add() can fail since 92117d8443bc
("bpf: fix refcnt overflow"), so we really must check it. Take the reference
right when we assign it to the rq from priv->xdp_prog, and just drop the
reference on error path. Destruction in mlx5e_destroy_rq() looks good, though.

Fixes: 86994156c736 ("net/mlx5e: XDP fast RX drop bpf programs support")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
---
 drivers/net/ethernet/mellanox/mlx5/core/en_main.c | 13 +++++++++----
 kernel/bpf/syscall.c                              |  1 +
 2 files changed, 10 insertions(+), 4 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
index bd0732d..54bae79 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
@@ -513,7 +513,13 @@ static int mlx5e_create_rq(struct mlx5e_channel *c,
 	rq->channel = c;
 	rq->ix      = c->ix;
 	rq->priv    = c->priv;
-	rq->xdp_prog = priv->xdp_prog;
+
+	rq->xdp_prog = priv->xdp_prog ? bpf_prog_inc(priv->xdp_prog) : NULL;
+	if (IS_ERR(rq->xdp_prog)) {
+		err = PTR_ERR(rq->xdp_prog);
+		rq->xdp_prog = NULL;
+		goto err_rq_wq_destroy;
+	}
 
 	rq->buff.map_dir = DMA_FROM_DEVICE;
 	if (rq->xdp_prog)
@@ -590,12 +596,11 @@ static int mlx5e_create_rq(struct mlx5e_channel *c,
 	rq->page_cache.head = 0;
 	rq->page_cache.tail = 0;
 
-	if (rq->xdp_prog)
-		bpf_prog_add(rq->xdp_prog, 1);
-
 	return 0;
 
 err_rq_wq_destroy:
+	if (rq->xdp_prog)
+		bpf_prog_put(rq->xdp_prog);
 	mlx5_wq_destroy(&rq->wq_ctrl);
 
 	return err;
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index ce1b7de..eb15498 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -696,6 +696,7 @@ struct bpf_prog *bpf_prog_inc(struct bpf_prog *prog)
 {
 	return bpf_prog_add(prog, 1);
 }
+EXPORT_SYMBOL_GPL(bpf_prog_inc);
 
 static struct bpf_prog *__bpf_prog_get(u32 ufd, enum bpf_prog_type *type)
 {
-- 
1.9.3

^ permalink raw reply related

* Re: Long delays creating a netns after deleting one (possibly RCU related)
From: Eric Dumazet @ 2016-11-19  0:41 UTC (permalink / raw)
  To: Jarno Rajahalme
  Cc: Eric W. Biederman, Paul E. McKenney, Cong Wang, Rolf Neugebauer,
	LKML, Linux Kernel Network Developers, Justin Cormack,
	Ian Campbell, Eric Dumazet
In-Reply-To: <884D43D4-024E-4485-94E6-1E8DFF972483@gmail.com>

On Fri, 2016-11-18 at 16:38 -0800, Jarno Rajahalme wrote:

> This fixes the problem for me, so for whatever it’s worth:
> 
> Tested-by: Jarno Rajahalme <jarno@ovn.org>
> 

Thanks for testing !

https://git.kernel.org/cgit/linux/kernel/git/davem/net.git/commit/?id=e88a2766143a27bfe6704b4493b214de4094cf29

^ permalink raw reply

* Re: Long delays creating a netns after deleting one (possibly RCU related)
From: Jarno Rajahalme @ 2016-11-19  0:38 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Eric W. Biederman, Paul E. McKenney, Cong Wang, Rolf Neugebauer,
	LKML, Linux Kernel Network Developers, Justin Cormack,
	Ian Campbell, Eric Dumazet
In-Reply-To: <1479164967.8455.87.camel@edumazet-glaptop3.roam.corp.google.com>


> On Nov 14, 2016, at 3:09 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> 
> On Mon, 2016-11-14 at 14:46 -0800, Eric Dumazet wrote:
>> On Mon, 2016-11-14 at 16:12 -0600, Eric W. Biederman wrote:
>> 
>>> synchronize_rcu_expidited is not enough if you have multiple network
>>> devices in play.
>>> 
>>> Looking at the code it comes down to this commit, and it appears there
>>> is a promise add rcu grace period combining by Eric Dumazet.
>>> 
>>> Eric since people are hitting noticable stalls because of the rcu grace
>>> period taking a long time do you think you could look at this code path
>>> a bit more?
>>> 
>>> commit 93d05d4a320cb16712bb3d57a9658f395d8cecb9
>>> Author: Eric Dumazet <edumazet@google.com>
>>> Date:   Wed Nov 18 06:31:03 2015 -0800
>> 
>> Absolutely, I will take a loop asap.
> 
> The worst offender should be fixed by the following patch.
> 
> busy poll needs to poll the physical device, not a virtual one...
> 
> diff --git a/include/net/gro_cells.h b/include/net/gro_cells.h
> index d15214d673b2e8e08fd6437b572278fb1359f10d..2a1abbf8da74368cd01adc40cef6c0644e059ef2 100644
> --- a/include/net/gro_cells.h
> +++ b/include/net/gro_cells.h
> @@ -68,6 +68,9 @@ static inline int gro_cells_init(struct gro_cells *gcells, struct net_device *de
> 		struct gro_cell *cell = per_cpu_ptr(gcells->cells, i);
> 
> 		__skb_queue_head_init(&cell->napi_skbs);
> +
> +		set_bit(NAPI_STATE_NO_BUSY_POLL, &cell->napi.state);
> +
> 		netif_napi_add(dev, &cell->napi, gro_cell_poll, 64);
> 		napi_enable(&cell->napi);
> 	}
> 
> 
> 
> 
> 

This fixes the problem for me, so for whatever it’s worth:

Tested-by: Jarno Rajahalme <jarno@ovn.org>

^ permalink raw reply

* [PATCH net-next v2 5/5] af_packet: Use virtio_net_hdr_from_skb() directly.
From: Jarno Rajahalme @ 2016-11-18 23:40 UTC (permalink / raw)
  To: netdev; +Cc: jarno
In-Reply-To: <1479512442-61601-1-git-send-email-jarno@ovn.org>

Remove static function __packet_rcv_vnet(), which only called
virtio_net_hdr_from_skb() and BUG()ged out if an error code was
returned.  Instead, call virtio_net_hdr_from_skb() from the former
call sites of __packet_rcv_vnet() and actually use the error handling
code that is already there.

Signed-off-by: Jarno Rajahalme <jarno@ovn.org>
---
 net/packet/af_packet.c | 16 ++++------------
 1 file changed, 4 insertions(+), 12 deletions(-)

diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c
index 1816b77..fab9bbf 100644
--- a/net/packet/af_packet.c
+++ b/net/packet/af_packet.c
@@ -1967,15 +1967,6 @@ static unsigned int run_filter(struct sk_buff *skb,
 	return res;
 }
 
-static int __packet_rcv_vnet(const struct sk_buff *skb,
-			     struct virtio_net_hdr *vnet_hdr)
-{
-	if (virtio_net_hdr_from_skb(skb, vnet_hdr, vio_le()))
-		BUG();
-
-	return 0;
-}
-
 static int packet_rcv_vnet(struct msghdr *msg, const struct sk_buff *skb,
 			   size_t *len)
 {
@@ -1985,7 +1976,7 @@ static int packet_rcv_vnet(struct msghdr *msg, const struct sk_buff *skb,
 		return -EINVAL;
 	*len -= sizeof(vnet_hdr);
 
-	if (__packet_rcv_vnet(skb, &vnet_hdr))
+	if (virtio_net_hdr_from_skb(skb, &vnet_hdr, vio_le()))
 		return -EINVAL;
 
 	return memcpy_to_msg(msg, (void *)&vnet_hdr, sizeof(vnet_hdr));
@@ -2244,8 +2235,9 @@ static int tpacket_rcv(struct sk_buff *skb, struct net_device *dev,
 	spin_unlock(&sk->sk_receive_queue.lock);
 
 	if (po->has_vnet_hdr) {
-		if (__packet_rcv_vnet(skb, h.raw + macoff -
-					   sizeof(struct virtio_net_hdr))) {
+		if (virtio_net_hdr_from_skb(skb, h.raw + macoff -
+					    sizeof(struct virtio_net_hdr),
+					    vio_le())) {
 			spin_lock(&sk->sk_receive_queue.lock);
 			goto drop_n_account;
 		}
-- 
2.1.4

^ permalink raw reply related

* [PATCH net-next v2 4/5] af_packet: Use virtio_net_hdr_to_skb().
From: Jarno Rajahalme @ 2016-11-18 23:40 UTC (permalink / raw)
  To: netdev; +Cc: jarno
In-Reply-To: <1479512442-61601-1-git-send-email-jarno@ovn.org>

Use the common virtio_net_hdr_to_skb() instead of open coding it.
Other call sites were changed by commit fd2a0437dc, but this one was
missed, maybe because it is split in two parts of the source code.

Interim comparisons of 'vnet_hdr->gso_type' still work as both the
vnet_hdr and skb notion of gso_type is zero when there is no gso.

Fixes: fd2a0437dc ("virtio_net: introduce virtio_net_hdr_{from,to}_skb")
Signed-off-by: Jarno Rajahalme <jarno@ovn.org>
---
 net/packet/af_packet.c | 51 +++-----------------------------------------------
 1 file changed, 3 insertions(+), 48 deletions(-)

diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c
index abe6c0b..1816b77 100644
--- a/net/packet/af_packet.c
+++ b/net/packet/af_packet.c
@@ -2388,8 +2388,6 @@ static void tpacket_set_protocol(const struct net_device *dev,
 
 static int __packet_snd_vnet_parse(struct virtio_net_hdr *vnet_hdr, size_t len)
 {
-	unsigned short gso_type = 0;
-
 	if ((vnet_hdr->flags & VIRTIO_NET_HDR_F_NEEDS_CSUM) &&
 	    (__virtio16_to_cpu(vio_le(), vnet_hdr->csum_start) +
 	     __virtio16_to_cpu(vio_le(), vnet_hdr->csum_offset) + 2 >
@@ -2401,29 +2399,6 @@ static int __packet_snd_vnet_parse(struct virtio_net_hdr *vnet_hdr, size_t len)
 	if (__virtio16_to_cpu(vio_le(), vnet_hdr->hdr_len) > len)
 		return -EINVAL;
 
-	if (vnet_hdr->gso_type != VIRTIO_NET_HDR_GSO_NONE) {
-		switch (vnet_hdr->gso_type & ~VIRTIO_NET_HDR_GSO_ECN) {
-		case VIRTIO_NET_HDR_GSO_TCPV4:
-			gso_type = SKB_GSO_TCPV4;
-			break;
-		case VIRTIO_NET_HDR_GSO_TCPV6:
-			gso_type = SKB_GSO_TCPV6;
-			break;
-		case VIRTIO_NET_HDR_GSO_UDP:
-			gso_type = SKB_GSO_UDP;
-			break;
-		default:
-			return -EINVAL;
-		}
-
-		if (vnet_hdr->gso_type & VIRTIO_NET_HDR_GSO_ECN)
-			gso_type |= SKB_GSO_TCP_ECN;
-
-		if (vnet_hdr->gso_size == 0)
-			return -EINVAL;
-	}
-
-	vnet_hdr->gso_type = gso_type;	/* changes type, temporary storage */
 	return 0;
 }
 
@@ -2443,27 +2418,6 @@ static int packet_snd_vnet_parse(struct msghdr *msg, size_t *len,
 	return __packet_snd_vnet_parse(vnet_hdr, *len);
 }
 
-static int packet_snd_vnet_gso(struct sk_buff *skb,
-			       struct virtio_net_hdr *vnet_hdr)
-{
-	if (vnet_hdr->flags & VIRTIO_NET_HDR_F_NEEDS_CSUM) {
-		u16 s = __virtio16_to_cpu(vio_le(), vnet_hdr->csum_start);
-		u16 o = __virtio16_to_cpu(vio_le(), vnet_hdr->csum_offset);
-
-		if (!skb_partial_csum_set(skb, s, o))
-			return -EINVAL;
-	}
-
-	skb_shinfo(skb)->gso_size =
-		__virtio16_to_cpu(vio_le(), vnet_hdr->gso_size);
-	skb_shinfo(skb)->gso_type = vnet_hdr->gso_type;
-
-	/* Header must be checked, and gso_segs computed. */
-	skb_shinfo(skb)->gso_type |= SKB_GSO_DODGY;
-	skb_shinfo(skb)->gso_segs = 0;
-	return 0;
-}
-
 static int tpacket_fill_skb(struct packet_sock *po, struct sk_buff *skb,
 		void *frame, struct net_device *dev, void *data, int tp_len,
 		__be16 proto, unsigned char *addr, int hlen, int copylen,
@@ -2723,7 +2677,8 @@ static int tpacket_snd(struct packet_sock *po, struct msghdr *msg)
 			}
 		}
 
-		if (po->has_vnet_hdr && packet_snd_vnet_gso(skb, vnet_hdr)) {
+		if (po->has_vnet_hdr && virtio_net_hdr_to_skb(skb, vnet_hdr,
+							      vio_le())) {
 			tp_len = -EINVAL;
 			goto tpacket_error;
 		}
@@ -2914,7 +2869,7 @@ static int packet_snd(struct socket *sock, struct msghdr *msg, size_t len)
 	packet_pick_tx_queue(dev, skb);
 
 	if (po->has_vnet_hdr) {
-		err = packet_snd_vnet_gso(skb, &vnet_hdr);
+		err = virtio_net_hdr_to_skb(skb, &vnet_hdr, vio_le());
 		if (err)
 			goto out_free;
 		len += sizeof(vnet_hdr);
-- 
2.1.4

^ permalink raw reply related

* [PATCH net-next v2 3/5] virtio_net: Do not clear memory for struct virtio_net_hdr twice.
From: Jarno Rajahalme @ 2016-11-18 23:40 UTC (permalink / raw)
  To: netdev; +Cc: jarno
In-Reply-To: <1479512442-61601-1-git-send-email-jarno@ovn.org>

virtio_net_hdr_from_skb() clears the memory for the header, so there
is no point for the callers to do the same.

Signed-off-by: Jarno Rajahalme <jarno@ovn.org>
---
 drivers/net/tun.c          | 3 +--
 include/linux/virtio_net.h | 2 +-
 net/packet/af_packet.c     | 2 --
 3 files changed, 2 insertions(+), 5 deletions(-)

diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index 3b8d8cc..64e694c 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -1360,8 +1360,7 @@ static ssize_t tun_put_user(struct tun_struct *tun,
 	}
 
 	if (vnet_hdr_sz) {
-		struct virtio_net_hdr gso = { 0 }; /* no info leak */
-		int ret;
+		struct virtio_net_hdr gso;
 
 		if (iov_iter_count(iter) < vnet_hdr_sz)
 			return -EINVAL;
diff --git a/include/linux/virtio_net.h b/include/linux/virtio_net.h
index 74f1e33..6620400 100644
--- a/include/linux/virtio_net.h
+++ b/include/linux/virtio_net.h
@@ -58,7 +58,7 @@ static inline int virtio_net_hdr_from_skb(const struct sk_buff *skb,
 					  struct virtio_net_hdr *hdr,
 					  bool little_endian)
 {
-	memset(hdr, 0, sizeof(*hdr));
+	memset(hdr, 0, sizeof(*hdr));   /* no info leak */
 
 	if (skb_is_gso(skb)) {
 		struct skb_shared_info *sinfo = skb_shinfo(skb);
diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c
index d2238b2..abe6c0b 100644
--- a/net/packet/af_packet.c
+++ b/net/packet/af_packet.c
@@ -1970,8 +1970,6 @@ static unsigned int run_filter(struct sk_buff *skb,
 static int __packet_rcv_vnet(const struct sk_buff *skb,
 			     struct virtio_net_hdr *vnet_hdr)
 {
-	*vnet_hdr = (const struct virtio_net_hdr) { 0 };
-
 	if (virtio_net_hdr_from_skb(skb, vnet_hdr, vio_le()))
 		BUG();
 
-- 
2.1.4

^ permalink raw reply related

* [PATCH net-next v2 2/5] virtio_net.h: Fix comment.
From: Jarno Rajahalme @ 2016-11-18 23:40 UTC (permalink / raw)
  To: netdev; +Cc: jarno
In-Reply-To: <1479512442-61601-1-git-send-email-jarno@ovn.org>

Fix incorrent comment after the final #endif.

Signed-off-by: Jarno Rajahalme <jarno@ovn.org>
---
 include/linux/virtio_net.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/include/linux/virtio_net.h b/include/linux/virtio_net.h
index 1c912f8..74f1e33 100644
--- a/include/linux/virtio_net.h
+++ b/include/linux/virtio_net.h
@@ -98,4 +98,4 @@ static inline int virtio_net_hdr_from_skb(const struct sk_buff *skb,
 	return 0;
 }
 
-#endif /* _LINUX_VIRTIO_BYTEORDER */
+#endif /* _LINUX_VIRTIO_NET_H */
-- 
2.1.4

^ permalink raw reply related

* [PATCH net-next v2 1/5] virtio_net: Simplify call sites for virtio_net_hdr_{from,to}_skb().
From: Jarno Rajahalme @ 2016-11-18 23:40 UTC (permalink / raw)
  To: netdev; +Cc: jarno

No point storing the return value of virtio_net_hdr_to_skb() or
virtio_net_hdr_from_skb() to a variable when the value is used only
once as a boolean in an immediately following if statement.

Signed-off-by: Jarno Rajahalme <jarno@ovn.org>
---
 drivers/net/macvtap.c | 5 ++---
 drivers/net/tun.c     | 8 +++-----
 2 files changed, 5 insertions(+), 8 deletions(-)

diff --git a/drivers/net/macvtap.c b/drivers/net/macvtap.c
index 070e329..5da9861 100644
--- a/drivers/net/macvtap.c
+++ b/drivers/net/macvtap.c
@@ -821,9 +821,8 @@ static ssize_t macvtap_put_user(struct macvtap_queue *q,
 		if (iov_iter_count(iter) < vnet_hdr_len)
 			return -EINVAL;
 
-		ret = virtio_net_hdr_from_skb(skb, &vnet_hdr,
-					      macvtap_is_little_endian(q));
-		if (ret)
+		if (virtio_net_hdr_from_skb(skb, &vnet_hdr,
+					    macvtap_is_little_endian(q)))
 			BUG();
 
 		if (copy_to_iter(&vnet_hdr, sizeof(vnet_hdr), iter) !=
diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index 1588469..3b8d8cc 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -1252,8 +1252,7 @@ static ssize_t tun_get_user(struct tun_struct *tun, struct tun_file *tfile,
 		return -EFAULT;
 	}
 
-	err = virtio_net_hdr_to_skb(skb, &gso, tun_is_little_endian(tun));
-	if (err) {
+	if (virtio_net_hdr_to_skb(skb, &gso, tun_is_little_endian(tun))) {
 		this_cpu_inc(tun->pcpu_stats->rx_frame_errors);
 		kfree_skb(skb);
 		return -EINVAL;
@@ -1367,9 +1366,8 @@ static ssize_t tun_put_user(struct tun_struct *tun,
 		if (iov_iter_count(iter) < vnet_hdr_sz)
 			return -EINVAL;
 
-		ret = virtio_net_hdr_from_skb(skb, &gso,
-					      tun_is_little_endian(tun));
-		if (ret) {
+		if (virtio_net_hdr_from_skb(skb, &gso,
+					    tun_is_little_endian(tun))) {
 			struct skb_shared_info *sinfo = skb_shinfo(skb);
 			pr_err("unexpected GSO type: "
 			       "0x%x, gso_size %d, hdr_len %d\n",
-- 
2.1.4

^ permalink raw reply related

* Re: [mm PATCH v3 21/23] mm: Add support for releasing multiple instances of a page
From: Andrew Morton @ 2016-11-18 23:27 UTC (permalink / raw)
  To: Alexander Duyck; +Cc: linux-mm, netdev, linux-kernel
In-Reply-To: <20161110113606.76501.70752.stgit@ahduyck-blue-test.jf.intel.com>

On Thu, 10 Nov 2016 06:36:06 -0500 Alexander Duyck <alexander.h.duyck@intel.com> wrote:

> This patch adds a function that allows us to batch free a page that has
> multiple references outstanding.  Specifically this function can be used to
> drop a page being used in the page frag alloc cache.  With this drivers can
> make use of functionality similar to the page frag alloc cache without
> having to do any workarounds for the fact that there is no function that
> frees multiple references.
> 
> ...
>
> --- a/include/linux/gfp.h
> +++ b/include/linux/gfp.h
> @@ -506,6 +506,8 @@ extern void free_hot_cold_page(struct page *page, bool cold);
>  extern void free_hot_cold_page_list(struct list_head *list, bool cold);
>  
>  struct page_frag_cache;
> +extern void __page_frag_drain(struct page *page, unsigned int order,
> +			      unsigned int count);
>  extern void *__alloc_page_frag(struct page_frag_cache *nc,
>  			       unsigned int fragsz, gfp_t gfp_mask);
>  extern void __free_page_frag(void *addr);
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 0fbfead..54fea40 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -3912,6 +3912,20 @@ static struct page *__page_frag_refill(struct page_frag_cache *nc,
>  	return page;
>  }
>  
> +void __page_frag_drain(struct page *page, unsigned int order,
> +		       unsigned int count)
> +{
> +	VM_BUG_ON_PAGE(page_ref_count(page) == 0, page);
> +
> +	if (page_ref_sub_and_test(page, count)) {
> +		if (order == 0)
> +			free_hot_cold_page(page, false);
> +		else
> +			__free_pages_ok(page, order);
> +	}
> +}
> +EXPORT_SYMBOL(__page_frag_drain);

It's an exported-to-modules library function.  It should be documented,
please?  The page-frag API is only partially documented, but that's no
excuse.

And perhaps documentation will help explain the naming choice.  Why
"drain"?  I'd have expected "put"?

And why the leading underscores.  The page-frag API is pretty weird :(

And inconsistent.  __alloc_page_frag -> page_frag_alloc,
__free_page_frag -> page_frag_free(), etc.  I must have been asleep
when I let that lot through.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: [PATCH 3/5] virtio_net: Add XDP support
From: Eric Dumazet @ 2016-11-18 23:23 UTC (permalink / raw)
  To: John Fastabend
  Cc: tgraf, shm, alexei.starovoitov, daniel, davem, john.r.fastabend,
	netdev, bblanco, brouer
In-Reply-To: <20161118190017.16137.13910.stgit@john-Precision-Tower-5810>

On Fri, 2016-11-18 at 11:00 -0800, John Fastabend wrote:
> From: Shrijeet Mukherjee <shrijeet@gmail.com>


>  #include <linux/slab.h>
> @@ -81,6 +82,8 @@ struct receive_queue {
>  
>  	struct napi_struct napi;
>  
> +	struct bpf_prog *xdp_prog;

Please add proper sparse annotation, as in 

	struct bpf_prog __rcu *xdp_prog;

And run sparse ;)

CONFIG_SPARSE_RCU_POINTER=y

make C=2 drivers/net/virtio_net.o

^ permalink raw reply

* Re: [PATCH 3/5] virtio_net: Add XDP support
From: Eric Dumazet @ 2016-11-18 23:21 UTC (permalink / raw)
  To: John Fastabend
  Cc: tgraf, shm, alexei.starovoitov, daniel, davem, john.r.fastabend,
	netdev, bblanco, brouer
In-Reply-To: <20161118190017.16137.13910.stgit@john-Precision-Tower-5810>

On Fri, 2016-11-18 at 11:00 -0800, John Fastabend wrote:


>  static void free_receive_bufs(struct virtnet_info *vi)
>  {
> +	struct bpf_prog *old_prog;
>  	int i;
>  
>  	for (i = 0; i < vi->max_queue_pairs; i++) {
>  		while (vi->rq[i].pages)
>  			__free_pages(get_a_page(&vi->rq[i], GFP_KERNEL), 0);
> +
> +		old_prog = rcu_dereference(vi->rq[i].xdp_prog);

Seems wrong to me.

Are you sure lockdep (with CONFIG_PROVE_RCU=y) was happy with this ?

> +		RCU_INIT_POINTER(vi->rq[i].xdp_prog, NULL);
> +		if (old_prog)
> +			bpf_prog_put(old_prog);
>  	}
>  }
>  
> 

^ permalink raw reply

* [PATCH] net: fix bogus cast in skb_pagelen() and use unsigned variables
From: Alexey Dobriyan @ 2016-11-19  1:08 UTC (permalink / raw)
  To: davem; +Cc: netdev

1) cast to "int" is unnecessary:
   u8 will be promoted to int before decrementing,
   small positive numbers fit into "int", so their values won't be changed
   during promotion.

   Once everything is int including loop counters, signedness doesn't
   matter: 32-bit operations will stay 32-bit operations.

   But! Someone tried to make this loop smart by making everything of
   the same type apparently in an attempt to optimise it.
   Do the optimization, just differently.
   Do the cast where it matters. :^)

2) frag size is unsigned entity and sum of fragments sizes is also
   unsigned.

Make everything unsigned, leave no MOVSX instruction behind.


	add/remove: 0/0 grow/shrink: 0/3 up/down: 0/-4 (-4)
	function                                     old     new   delta
	skb_cow_data                                 835     834      -1
	ip_do_fragment                              2549    2548      -1
	ip6_fragment                                3130    3128      -2
	Total: Before=154865032, After=154865028, chg -0.00%


Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
---

 include/linux/skbuff.h |    6 +++---
 net/ipv4/ip_output.c   |    2 +-
 net/ipv6/ip6_output.c  |    2 +-
 3 files changed, 5 insertions(+), 5 deletions(-)

--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -1799,11 +1799,11 @@ static inline unsigned int skb_headlen(const struct sk_buff *skb)
 	return skb->len - skb->data_len;
 }
 
-static inline int skb_pagelen(const struct sk_buff *skb)
+static inline unsigned int skb_pagelen(const struct sk_buff *skb)
 {
-	int i, len = 0;
+	unsigned int i, len = 0;
 
-	for (i = (int)skb_shinfo(skb)->nr_frags - 1; i >= 0; i--)
+	for (i = skb_shinfo(skb)->nr_frags - 1; (int)i >= 0; i--)
 		len += skb_frag_size(&skb_shinfo(skb)->frags[i]);
 	return len + skb_headlen(skb);
 }
--- a/net/ipv4/ip_output.c
+++ b/net/ipv4/ip_output.c
@@ -581,7 +581,7 @@ int ip_do_fragment(struct net *net, struct sock *sk, struct sk_buff *skb,
 	 */
 	if (skb_has_frag_list(skb)) {
 		struct sk_buff *frag, *frag2;
-		int first_len = skb_pagelen(skb);
+		unsigned int first_len = skb_pagelen(skb);
 
 		if (first_len - hlen > mtu ||
 		    ((first_len - hlen) & 7) ||
--- a/net/ipv6/ip6_output.c
+++ b/net/ipv6/ip6_output.c
@@ -625,7 +625,7 @@ int ip6_fragment(struct net *net, struct sock *sk, struct sk_buff *skb,
 
 	hroom = LL_RESERVED_SPACE(rt->dst.dev);
 	if (skb_has_frag_list(skb)) {
-		int first_len = skb_pagelen(skb);
+		unsigned int first_len = skb_pagelen(skb);
 		struct sk_buff *frag2;
 
 		if (first_len - hlen > mtu ||

^ permalink raw reply

* [PATCH] netlink: smaller nla_attr_minlen table
From: Alexey Dobriyan @ 2016-11-19  0:59 UTC (permalink / raw)
  To: davem; +Cc: netdev

Length of a netlink attribute may be u16 but lengths of basic attributes
are much smaller, so small we can save 16 bytes of .rodata and pocket
change inside .text.

16-bit is worse on x86-64 than 8-bit because of operand size override prefix.

	add/remove: 0/0 grow/shrink: 0/3 up/down: 0/-19 (-19)
	function                                     old     new   delta
	validate_nla                                 418     417      -1
	nla_policy_len                                66      64      -2
	nla_attr_minlen                               32      16     -16
	Total: Before=154865051, After=154865032, chg -0.00%


Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
---

 lib/nlattr.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--- a/lib/nlattr.c
+++ b/lib/nlattr.c
@@ -14,7 +14,7 @@
 #include <linux/types.h>
 #include <net/netlink.h>
 
-static const u16 nla_attr_minlen[NLA_TYPE_MAX+1] = {
+static const u8 nla_attr_minlen[NLA_TYPE_MAX+1] = {
 	[NLA_U8]	= sizeof(u8),
 	[NLA_U16]	= sizeof(u16),
 	[NLA_U32]	= sizeof(u32),

^ permalink raw reply

* Re: [net-next] af_packet: Use virtio_net_hdr_to_skb().
From: Jarno Rajahalme @ 2016-11-18 22:58 UTC (permalink / raw)
  To: David Miller; +Cc: netdev
In-Reply-To: <20161118.113527.2164504813195182869.davem@davemloft.net>

Sorry for my transgressions and wasting your time. I’ll send a v2 in a moment.

  Jarno
 
> On Nov 18, 2016, at 8:35 AM, David Miller <davem@davemloft.net> wrote:
> 
> From: Jarno Rajahalme <jarno@ovn.org>
> Date: Wed, 16 Nov 2016 18:06:42 -0800
> 
>> Use the common virtio_net_hdr_to_skb() instead of open coding it.
>> Other call sites were changed by commit fd2a0437dc, but this one was
>> missed, maybe because it is split in two parts of the source code.
>> 
>> Also fix other call sites to be more uniform.
>> 
>> Fixes: fd2a0437dc ("virtio_net: introduce virtio_net_hdr_{from,to}_skb")
>> Signed-off-by: Jarno Rajahalme <jarno@ovn.org>
> 
> This patch is doing many more things that just this.
> 
> Do not mix unrelated changes together:
> 
>> @@ -821,9 +821,8 @@ static ssize_t macvtap_put_user(struct macvtap_queue *q,
>> 		if (iov_iter_count(iter) < vnet_hdr_len)
>> 			return -EINVAL;
>> 
>> -		ret = virtio_net_hdr_from_skb(skb, &vnet_hdr,
>> -					      macvtap_is_little_endian(q));
>> -		if (ret)
>> +		if (virtio_net_hdr_from_skb(skb, &vnet_hdr,
>> +					    macvtap_is_little_endian(q)))
>> 			BUG();
>> 
>> 		if (copy_to_iter(&vnet_hdr, sizeof(vnet_hdr), iter) !=
> 
> This has nothing to do with modifying code to use vrtio_net_hdr_to_skb(), it
> doesn't belong in this patch.
> 
>> @@ -1361,15 +1360,12 @@ static ssize_t tun_put_user(struct tun_struct *tun,
>> 	}
>> 
>> 	if (vnet_hdr_sz) {
>> -		struct virtio_net_hdr gso = { 0 }; /* no info leak */
>> -		int ret;
>> -
>> +		struct virtio_net_hdr gso;
> 
> This is _extremely_ opaque.  The initializer is trying to prevent kernel memory
> info leaks onto the network or into user space.
> 
> Maybe this transformation is valid but:
> 
> 1) YOU DON'T EVEN MENTION IT IN YOUR COMMIT MESSAGE.
> 
> 2) It's unrelated to this specific change, therefore it belongs in
>   a separate change.
> 
> 3) You don't explain that it is a valid transformation, not why.
> 
> It is extremely disappointing to catch unrelated, potentially far
> reaching things embedded in a patch when I review it.
> 
> Please do not ever do this.
> 
>> @@ -98,4 +98,4 @@ static inline int virtio_net_hdr_from_skb(const struct sk_buff *skb,
>> 	return 0;
>> }
>> 
>> -#endif /* _LINUX_VIRTIO_BYTEORDER */
>> +#endif /* _LINUX_VIRTIO_NET_H */
> 
> Another unrelated change.
> 
>> diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c
>> index 11db0d6..09abb88 100644
>> --- a/net/packet/af_packet.c
>> +++ b/net/packet/af_packet.c
>> @@ -1971,8 +1971,6 @@ static unsigned int run_filter(struct sk_buff *skb,
>> static int __packet_rcv_vnet(const struct sk_buff *skb,
>> 			     struct virtio_net_hdr *vnet_hdr)
>> {
>> -	*vnet_hdr = (const struct virtio_net_hdr) { 0 };
>> -
> 
> There is no way this belongs in this patch, and again you do not explain
> why removing this initializer is valid.

^ permalink raw reply

* [PATCH] netlink: use "unsigned int" in nla_next()
From: Alexey Dobriyan @ 2016-11-19  0:54 UTC (permalink / raw)
  To: davem; +Cc: netdev

->nla_len is unsigned entity (it's length after all) and u16,
thus it can't overflow when being aligned into int/unsigned int.

(nlmsg_next has the same code, but I didn't yet convince myself
it is correct to do so).

There is pointer arithmetic in this function and offset being
unsigned is better:

	add/remove: 0/0 grow/shrink: 1/64 up/down: 5/-309 (-304)
	function                                     old     new   delta
	nl80211_set_wiphy                           1444    1449      +5
	team_nl_cmd_options_set                      997     995      -2
	tcf_em_tree_validate                         872     870      -2
	switchdev_port_bridge_setlink                352     350      -2
	switchdev_port_br_afspec                     312     310      -2
	rtm_to_fib_config                            428     426      -2
	qla4xxx_sysfs_ddb_set_param                 2193    2191      -2
	qla4xxx_iface_set_param                     4470    4468      -2
	ovs_nla_free_flow_actions                    152     150      -2
	output_userspace                             518     516      -2
		...
	nl80211_set_reg                              654     649      -5
	validate_scan_freqs                          148     142      -6
	validate_linkmsg                             288     282      -6
	nl80211_parse_connkeys                       489     483      -6
	nlattr_set                                   231     224      -7
	nf_tables_delsetelem                         267     260      -7
	do_setlink                                  3416    3408      -8
	netlbl_cipsov4_add_std                      1672    1659     -13
	nl80211_parse_sched_scan                    2902    2888     -14
	nl80211_trigger_scan                        1738    1720     -18
	do_execute_actions                          2821    2738     -83
	Total: Before=154865355, After=154865051, chg -0.00%

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
---

 include/net/netlink.h |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--- a/include/net/netlink.h
+++ b/include/net/netlink.h
@@ -713,7 +713,7 @@ static inline bool nla_ok(const struct nlattr *nla, int remaining)
  */
 static inline struct nlattr *nla_next(const struct nlattr *nla, int *remaining)
 {
-	int totlen = NLA_ALIGN(nla->nla_len);
+	unsigned int totlen = NLA_ALIGN(nla->nla_len);
 
 	*remaining -= totlen;
 	return (struct nlattr *) ((char *) nla + totlen);

^ permalink raw reply

* [PATCH] net: make struct napi_alloc_cache::skb_count unsigned int
From: Alexey Dobriyan @ 2016-11-19  0:47 UTC (permalink / raw)
  To: davem; +Cc: netdev

size_t is way too much for an integer not exceeding 64.

Space savings: 10 bytes!

	add/remove: 0/0 grow/shrink: 0/3 up/down: 0/-10 (-10)
	function                                     old     new   delta
	napi_consume_skb                             165     163      -2
	__kfree_skb_flush                             56      53      -3
	__kfree_skb_defer                             97      92      -5
	Total: Before=154865639, After=154865629, chg -0.00%

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
---

 net/core/skbuff.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -354,7 +354,7 @@ EXPORT_SYMBOL(build_skb);
 
 struct napi_alloc_cache {
 	struct page_frag_cache page;
-	size_t skb_count;
+	unsigned int skb_count;
 	void *skb_cache[NAPI_SKB_CACHE_SIZE];
 };
 

^ permalink raw reply

* [RFC 09/10] IB/hfi1: Virtual Network Interface Controller (VNIC) support
From: Vishwanathapura, Niranjana @ 2016-11-18 22:42 UTC (permalink / raw)
  To: Doug Ledford
  Cc: linux-rdma, netdev, Dennis Dalessandro, Niranjana Vishwanathapura,
	Andrzej Kacprowski
In-Reply-To: <1479508938-63799-1-git-send-email-niranjana.vishwanathapura@intel.com>

HFI1 HW specific support for VNIC functionality. Add support to create
VNIC devices on HFI VNIC Bus. Also implement the bus operations to
allocate resources, transmit and receive of Omni-Path encapsulated
Ethernet packets.

Dynamically allocate a set of contexts for VNIC when the first vnic
port is instantiated. Allocate VNIC contexts from user contexts pool
and return them back to the same pool while freeing up. Set aside
enough MSI-X interrupts for VNIC contexts and assign them when the
contexts are allocated. On the receive side, use an RSM rule to
spread TCP/UDP streams among VNIC contexts.

Change-Id: I1b275a7585d6c2e3573039a9137014031f1f5c7e
Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Niranjana Vishwanathapura <niranjana.vishwanathapura@intel.com>
Signed-off-by: Andrzej Kacprowski <andrzej.kacprowski@intel.com>
---
 drivers/infiniband/hw/hfi1/Kconfig        |   2 +-
 drivers/infiniband/hw/hfi1/Makefile       |   3 +-
 drivers/infiniband/hw/hfi1/aspm.h         |  13 +-
 drivers/infiniband/hw/hfi1/chip.c         | 270 ++++++++++++---
 drivers/infiniband/hw/hfi1/chip.h         |   2 +
 drivers/infiniband/hw/hfi1/debugfs.c      |   6 +-
 drivers/infiniband/hw/hfi1/driver.c       |  78 ++++-
 drivers/infiniband/hw/hfi1/file_ops.c     |  25 +-
 drivers/infiniband/hw/hfi1/hfi.h          |  50 ++-
 drivers/infiniband/hw/hfi1/init.c         |  44 ++-
 drivers/infiniband/hw/hfi1/mad.c          |   8 +-
 drivers/infiniband/hw/hfi1/pio.c          |  17 +
 drivers/infiniband/hw/hfi1/pio.h          |   6 +
 drivers/infiniband/hw/hfi1/sysfs.c        |   2 +-
 drivers/infiniband/hw/hfi1/user_exp_rcv.c |   6 +-
 drivers/infiniband/hw/hfi1/user_pages.c   |   3 +-
 drivers/infiniband/hw/hfi1/vnic.h         | 155 +++++++++
 drivers/infiniband/hw/hfi1/vnic_device.c  | 168 +++++++++
 drivers/infiniband/hw/hfi1/vnic_main.c    | 555 ++++++++++++++++++++++++++++++
 drivers/infiniband/hw/hfi1/vnic_sdma.c    |  60 ++++
 include/rdma/opa_port_info.h              |   2 +-
 21 files changed, 1376 insertions(+), 99 deletions(-)
 create mode 100644 drivers/infiniband/hw/hfi1/vnic.h
 create mode 100644 drivers/infiniband/hw/hfi1/vnic_device.c
 create mode 100644 drivers/infiniband/hw/hfi1/vnic_main.c
 create mode 100644 drivers/infiniband/hw/hfi1/vnic_sdma.c

diff --git a/drivers/infiniband/hw/hfi1/Kconfig b/drivers/infiniband/hw/hfi1/Kconfig
index f6ea088..6c07117 100644
--- a/drivers/infiniband/hw/hfi1/Kconfig
+++ b/drivers/infiniband/hw/hfi1/Kconfig
@@ -1,6 +1,6 @@
 config INFINIBAND_HFI1
 	tristate "Intel OPA Gen1 support"
-	depends on X86_64 && INFINIBAND_RDMAVT && I2C
+	depends on X86_64 && INFINIBAND_RDMAVT && I2C && HFI_VNIC_BUS
 	select MMU_NOTIFIER
 	select CRC32
 	select I2C_ALGOBIT
diff --git a/drivers/infiniband/hw/hfi1/Makefile b/drivers/infiniband/hw/hfi1/Makefile
index 0cf97a0..c579f98 100644
--- a/drivers/infiniband/hw/hfi1/Makefile
+++ b/drivers/infiniband/hw/hfi1/Makefile
@@ -6,13 +6,14 @@
 # Called from the kernel module build system.
 #
 obj-$(CONFIG_INFINIBAND_HFI1) += hfi1.o
+ccflags-y += -I$(src)/../../sw/intel/vnic/include
 
 hfi1-y := affinity.o chip.o device.o driver.o efivar.o \
 	eprom.o file_ops.o firmware.o \
 	init.o intr.o mad.o mmu_rb.o pcie.o pio.o pio_copy.o platform.o \
 	qp.o qsfp.o rc.o ruc.o sdma.o sysfs.o trace.o \
 	uc.o ud.o user_exp_rcv.o user_pages.o user_sdma.o verbs.o \
-	verbs_txreq.o
+	verbs_txreq.o vnic_main.o vnic_device.o vnic_sdma.o
 hfi1-$(CONFIG_DEBUG_FS) += debugfs.o
 
 CFLAGS_trace.o = -I$(src)
diff --git a/drivers/infiniband/hw/hfi1/aspm.h b/drivers/infiniband/hw/hfi1/aspm.h
index 0d58fe3..3a01b69 100644
--- a/drivers/infiniband/hw/hfi1/aspm.h
+++ b/drivers/infiniband/hw/hfi1/aspm.h
@@ -229,14 +229,17 @@ static inline void aspm_ctx_timer_function(unsigned long data)
 	spin_unlock_irqrestore(&rcd->aspm_lock, flags);
 }
 
-/* Disable interrupt processing for verbs contexts when PSM contexts are open */
+/*
+ * Disable interrupt processing for verbs contexts when PSM or VNIC contexts
+ * are open.
+ */
 static inline void aspm_disable_all(struct hfi1_devdata *dd)
 {
 	struct hfi1_ctxtdata *rcd;
 	unsigned long flags;
 	unsigned i;
 
-	for (i = 0; i < dd->first_user_ctxt; i++) {
+	for (i = 0; i < dd->first_dyn_alloc_ctxt; i++) {
 		rcd = dd->rcd[i];
 		del_timer_sync(&rcd->aspm_timer);
 		spin_lock_irqsave(&rcd->aspm_lock, flags);
@@ -260,7 +263,7 @@ static inline void aspm_enable_all(struct hfi1_devdata *dd)
 	if (aspm_mode != ASPM_MODE_DYNAMIC)
 		return;
 
-	for (i = 0; i < dd->first_user_ctxt; i++) {
+	for (i = 0; i < dd->first_dyn_alloc_ctxt; i++) {
 		rcd = dd->rcd[i];
 		spin_lock_irqsave(&rcd->aspm_lock, flags);
 		rcd->aspm_intr_enable = true;
@@ -276,7 +279,7 @@ static inline void aspm_ctx_init(struct hfi1_ctxtdata *rcd)
 		    (unsigned long)rcd);
 	rcd->aspm_intr_supported = rcd->dd->aspm_supported &&
 		aspm_mode == ASPM_MODE_DYNAMIC &&
-		rcd->ctxt < rcd->dd->first_user_ctxt;
+		rcd->ctxt < rcd->dd->first_dyn_alloc_ctxt;
 }
 
 static inline void aspm_init(struct hfi1_devdata *dd)
@@ -286,7 +289,7 @@ static inline void aspm_init(struct hfi1_devdata *dd)
 	spin_lock_init(&dd->aspm_lock);
 	dd->aspm_supported = aspm_hw_l1_supported(dd);
 
-	for (i = 0; i < dd->first_user_ctxt; i++)
+	for (i = 0; i < dd->first_dyn_alloc_ctxt; i++)
 		aspm_ctx_init(dd->rcd[i]);
 
 	/* Start with ASPM disabled */
diff --git a/drivers/infiniband/hw/hfi1/chip.c b/drivers/infiniband/hw/hfi1/chip.c
index f87d805..ceb472b 100644
--- a/drivers/infiniband/hw/hfi1/chip.c
+++ b/drivers/infiniband/hw/hfi1/chip.c
@@ -125,9 +125,16 @@ struct flag_table {
 #define DEFAULT_KRCVQS		  2
 #define MIN_KERNEL_KCTXTS         2
 #define FIRST_KERNEL_KCTXT        1
-/* sizes for both the QP and RSM map tables */
-#define NUM_MAP_ENTRIES		256
-#define NUM_MAP_REGS             32
+
+/*
+ * RSM instance allocation
+ *   0 - Verbs
+ *   1 - User Fecn Handling
+ *   2 - Vnic
+ */
+#define RSM_INS_VERBS             0
+#define RSM_INS_FECN              1
+#define RSM_INS_VNIC              2
 
 /* Bit offset into the GUID which carries HFI id information */
 #define GUID_HFI_INDEX_SHIFT     39
@@ -138,8 +145,7 @@ struct flag_table {
 #define is_emulator_p(dd) ((((dd)->irev) & 0xf) == 3)
 #define is_emulator_s(dd) ((((dd)->irev) & 0xf) == 4)
 
-/* RSM fields */
-
+/* RSM fields for Verbs */
 /* packet type */
 #define IB_PACKET_TYPE         2ull
 #define QW_SHIFT               6ull
@@ -169,6 +175,28 @@ struct flag_table {
 /* QPN[m+n:1] QW 1, OFFSET 1 */
 #define QPN_SELECT_OFFSET      ((1ull << QW_SHIFT) | (1ull))
 
+/* RSM fields for Vnic */
+/* L2_TYPE: QW 0, OFFSET 61 - for match */
+#define L2_TYPE_QW             0ull
+#define L2_TYPE_BIT_OFFSET     61ull
+#define L2_TYPE_OFFSET(off)    ((L2_TYPE_QW << QW_SHIFT) | (off))
+#define L2_TYPE_MATCH_OFFSET   L2_TYPE_OFFSET(L2_TYPE_BIT_OFFSET)
+#define L2_TYPE_MASK           3ull
+#define L2_16B_VALUE           2ull
+
+/* L4_TYPE QW 1, OFFSET 0 - for match */
+#define L4_TYPE_QW              1ull
+#define L4_TYPE_BIT_OFFSET      0ull
+#define L4_TYPE_OFFSET(off)     ((L4_TYPE_QW << QW_SHIFT) | (off))
+#define L4_TYPE_MATCH_OFFSET    L4_TYPE_OFFSET(L4_TYPE_BIT_OFFSET)
+#define L4_16B_TYPE_MASK        0xFFull
+#define L4_16B_ETH_VALUE        0x78ull
+
+/* 16B VESWID - for select */
+#define L4_16B_HDR_VESWID_OFFSET  ((2 << QW_SHIFT) | (16ull))
+/* 16B ENTROPY - for select */
+#define L2_16B_ENTROPY_OFFSET     ((1 << QW_SHIFT) | (32ull))
+
 /* defines to build power on SC2VL table */
 #define SC2VL_VAL( \
 	num, \
@@ -1045,6 +1073,7 @@ static int wait_logical_linkstate(struct hfi1_pportdata *ppd, u32 state,
 static int qos_rmt_entries(struct hfi1_devdata *dd, unsigned int *mp,
 			   unsigned int *np);
 static void clear_full_mgmt_pkey(struct hfi1_pportdata *ppd);
+static void clear_rsm_rule(struct hfi1_devdata *dd, u8 rule_index);
 
 /*
  * Error interrupt table entry.  This is used as input to the interrupt
@@ -6712,7 +6741,13 @@ static void rxe_kernel_unfreeze(struct hfi1_devdata *dd)
 	int i;
 
 	/* enable all kernel contexts */
-	for (i = 0; i < dd->n_krcv_queues; i++) {
+	for (i = 0; i < dd->num_rcv_contexts; i++) {
+		struct hfi1_ctxtdata *rcd = dd->rcd[i];
+
+		/* Ensure all non-user contexts(including vnic) are enabled */
+		if (!rcd || !rcd->sc || (rcd->sc->type == SC_USER))
+			continue;
+
 		rcvmask = HFI1_RCVCTRL_CTXT_ENB;
 		/* HFI1_RCVCTRL_TAILUPD_[ENB|DIS] needs to be set explicitly */
 		rcvmask |= HFI1_CAP_KGET_MASK(dd->rcd[i]->flags, DMA_RTAIL) ?
@@ -8004,7 +8039,9 @@ static void is_rcv_avail_int(struct hfi1_devdata *dd, unsigned int source)
 	if (likely(source < dd->num_rcv_contexts)) {
 		rcd = dd->rcd[source];
 		if (rcd) {
-			if (source < dd->first_user_ctxt)
+			/* Check for non-user contexts, including vnic */
+			if ((source < dd->first_dyn_alloc_ctxt) ||
+			    (rcd->sc && (rcd->sc->type == SC_KERNEL)))
 				rcd->do_interrupt(rcd, 0);
 			else
 				handle_user_interrupt(rcd);
@@ -8032,7 +8069,8 @@ static void is_rcv_urgent_int(struct hfi1_devdata *dd, unsigned int source)
 		rcd = dd->rcd[source];
 		if (rcd) {
 			/* only pay attention to user urgent interrupts */
-			if (source >= dd->first_user_ctxt)
+			if ((source >= dd->first_dyn_alloc_ctxt) &&
+			    (!rcd->sc || (rcd->sc->type == SC_USER)))
 				handle_user_interrupt(rcd);
 			return;	/* OK */
 		}
@@ -12733,7 +12771,7 @@ static int request_msix_irqs(struct hfi1_devdata *dd)
 	first_sdma = last_general;
 	last_sdma = first_sdma + dd->num_sdma;
 	first_rx = last_sdma;
-	last_rx = first_rx + dd->n_krcv_queues;
+	last_rx = first_rx + dd->n_krcv_queues + HFI1_NUM_VNIC_CTXT;
 
 	/*
 	 * Sanity check - the code expects all SDMA chip source
@@ -12747,7 +12785,7 @@ static int request_msix_irqs(struct hfi1_devdata *dd)
 		const char *err_info;
 		irq_handler_t handler;
 		irq_handler_t thread = NULL;
-		void *arg;
+		void *arg = NULL;
 		int idx;
 		struct hfi1_ctxtdata *rcd = NULL;
 		struct sdma_engine *sde = NULL;
@@ -12774,24 +12812,24 @@ static int request_msix_irqs(struct hfi1_devdata *dd)
 		} else if (first_rx <= i && i < last_rx) {
 			idx = i - first_rx;
 			rcd = dd->rcd[idx];
-			/* no interrupt if no rcd */
-			if (!rcd)
-				continue;
-			/*
-			 * Set the interrupt register and mask for this
-			 * context's interrupt.
-			 */
-			rcd->ireg = (IS_RCVAVAIL_START + idx) / 64;
-			rcd->imask = ((u64)1) <<
-					((IS_RCVAVAIL_START + idx) % 64);
-			handler = receive_context_interrupt;
-			thread = receive_context_thread;
-			arg = rcd;
-			snprintf(me->name, sizeof(me->name),
-				 DRIVER_NAME "_%d kctxt%d", dd->unit, idx);
-			err_info = "receive context";
-			remap_intr(dd, IS_RCVAVAIL_START + idx, i);
-			me->type = IRQ_RCVCTXT;
+			if (rcd) {
+				/*
+				 * Set the interrupt register and mask for this
+				 * context's interrupt.
+				 */
+				rcd->ireg = (IS_RCVAVAIL_START + idx) / 64;
+				rcd->imask = ((u64)1) <<
+					  ((IS_RCVAVAIL_START + idx) % 64);
+				handler = receive_context_interrupt;
+				thread = receive_context_thread;
+				arg = rcd;
+				snprintf(me->name, sizeof(me->name),
+					 DRIVER_NAME "_%d kctxt%d",
+					 dd->unit, idx);
+				err_info = "receive context";
+				remap_intr(dd, IS_RCVAVAIL_START + idx, i);
+				me->type = IRQ_RCVCTXT;
+			}
 		} else {
 			/* not in our expected range - complain, then
 			 * ignore it
@@ -12829,6 +12867,67 @@ static int request_msix_irqs(struct hfi1_devdata *dd)
 	return ret;
 }
 
+void hfi1_reset_vnic_msix_info(struct hfi1_ctxtdata *rcd)
+{
+	int idx = rcd->ctxt;
+	struct hfi1_devdata *dd = rcd->dd;
+	int i = 1 + dd->num_sdma + idx;
+	struct hfi1_msix_entry *me = &dd->msix_entries[i];
+
+	if (!me->arg) /* => no irq, no affinity */
+		return;
+
+	hfi1_put_irq_affinity(dd, me);
+	free_irq(me->msix.vector, me->arg);
+
+	me->arg = NULL;
+}
+
+void hfi1_set_vnic_msix_info(struct hfi1_ctxtdata *rcd)
+{
+	int idx = rcd->ctxt;
+	void *arg = rcd;
+	int ret;
+	struct hfi1_devdata *dd = rcd->dd;
+	int i = 1 + dd->num_sdma + idx;
+	struct hfi1_msix_entry *me = &dd->msix_entries[i];
+
+	/*
+	 * Set the interrupt register and mask for this
+	 * context's interrupt.
+	 */
+	rcd->ireg = (IS_RCVAVAIL_START + idx) / 64;
+	rcd->imask = ((u64)1) <<
+		  ((IS_RCVAVAIL_START + idx) % 64);
+
+	snprintf(me->name, sizeof(me->name),
+		 DRIVER_NAME "_%d kctxt%d", dd->unit, idx);
+	me->name[sizeof(me->name) - 1] = 0;
+	me->type = IRQ_RCVCTXT;
+
+	remap_intr(dd, IS_RCVAVAIL_START + idx, i);
+
+	ret = request_threaded_irq(me->msix.vector, receive_context_interrupt,
+				   receive_context_thread, 0, me->name, arg);
+	if (ret) {
+		dd_dev_err(dd, "vnic irq request (vector %d, idx %d) fail %d\n",
+			   me->msix.vector, idx, ret);
+		return;
+	}
+	/*
+	 * assign arg after request_irq call, so it will be
+	 * cleaned up
+	 */
+	me->arg = arg;
+
+	ret = hfi1_get_irq_affinity(dd, me);
+	if (ret) {
+		dd_dev_err(dd,
+			   "unable to pin IRQ %d\n", ret);
+		free_irq(me->msix.vector, me->arg);
+	}
+}
+
 /*
  * Set the general handler to accept all interrupts, remap all
  * chip interrupts back to MSI-X 0.
@@ -12860,7 +12959,7 @@ static int set_up_interrupts(struct hfi1_devdata *dd)
 	 *	N interrupts - one per used SDMA engine
 	 *	M interrupt - one per kernel receive context
 	 */
-	total = 1 + dd->num_sdma + dd->n_krcv_queues;
+	total = 1 + dd->num_sdma + dd->n_krcv_queues + HFI1_NUM_VNIC_CTXT;
 
 	entries = kcalloc(total, sizeof(*entries), GFP_KERNEL);
 	if (!entries) {
@@ -12925,7 +13024,8 @@ static int set_up_interrupts(struct hfi1_devdata *dd)
  *
  *	num_rcv_contexts - number of contexts being used
  *	n_krcv_queues - number of kernel contexts
- *	first_user_ctxt - first non-kernel context in array of contexts
+ *	first_dyn_alloc_ctxt - first dynamically allocated context
+ *                             in array of contexts
  *	freectxts  - number of free user contexts
  *	num_send_contexts - number of PIO send contexts being used
  */
@@ -13002,10 +13102,14 @@ static int set_up_context_variables(struct hfi1_devdata *dd)
 		total_contexts = num_kernel_contexts + num_user_contexts;
 	}
 
-	/* the first N are kernel contexts, the rest are user contexts */
+	/* Accommodate VNIC contexts */
+	if ((total_contexts + HFI1_NUM_VNIC_CTXT) <= dd->chip_rcv_contexts)
+		total_contexts += HFI1_NUM_VNIC_CTXT;
+
+	/* the first N are kernel contexts, the rest are user/vnic contexts */
 	dd->num_rcv_contexts = total_contexts;
 	dd->n_krcv_queues = num_kernel_contexts;
-	dd->first_user_ctxt = num_kernel_contexts;
+	dd->first_dyn_alloc_ctxt = num_kernel_contexts;
 	dd->num_user_contexts = num_user_contexts;
 	dd->freectxts = num_user_contexts;
 	dd_dev_info(dd,
@@ -13461,11 +13565,8 @@ static void reset_rxe_csrs(struct hfi1_devdata *dd)
 		write_csr(dd, RCV_COUNTER_ARRAY32 + (8 * i), 0);
 	for (i = 0; i < RXE_NUM_64_BIT_COUNTERS; i++)
 		write_csr(dd, RCV_COUNTER_ARRAY64 + (8 * i), 0);
-	for (i = 0; i < RXE_NUM_RSM_INSTANCES; i++) {
-		write_csr(dd, RCV_RSM_CFG + (8 * i), 0);
-		write_csr(dd, RCV_RSM_SELECT + (8 * i), 0);
-		write_csr(dd, RCV_RSM_MATCH + (8 * i), 0);
-	}
+	for (i = 0; i < RXE_NUM_RSM_INSTANCES; i++)
+		clear_rsm_rule(dd, i);
 	for (i = 0; i < 32; i++)
 		write_csr(dd, RCV_RSM_MAP_TABLE + (8 * i), 0);
 
@@ -13824,6 +13925,16 @@ static void add_rsm_rule(struct hfi1_devdata *dd, u8 rule_index,
 		  (u64)rrd->value2 << RCV_RSM_MATCH_VALUE2_SHIFT);
 }
 
+/*
+ * Clear a receive side mapping rule.
+ */
+static void clear_rsm_rule(struct hfi1_devdata *dd, u8 rule_index)
+{
+	write_csr(dd, RCV_RSM_CFG + (8 * rule_index), 0);
+	write_csr(dd, RCV_RSM_SELECT + (8 * rule_index), 0);
+	write_csr(dd, RCV_RSM_MATCH + (8 * rule_index), 0);
+}
+
 /* return the number of RSM map table entries that will be used for QOS */
 static int qos_rmt_entries(struct hfi1_devdata *dd, unsigned int *mp,
 			   unsigned int *np)
@@ -13939,7 +14050,7 @@ static void init_qos(struct hfi1_devdata *dd, struct rsm_map_table *rmt)
 	rrd.value2 = LRH_SC_VALUE;
 
 	/* add rule 0 */
-	add_rsm_rule(dd, 0, &rrd);
+	add_rsm_rule(dd, RSM_INS_VERBS, &rrd);
 
 	/* mark RSM map entries as used */
 	rmt->used += rmt_entries;
@@ -13969,7 +14080,7 @@ static void init_user_fecn_handling(struct hfi1_devdata *dd,
 	/*
 	 * RSM will extract the destination context as an index into the
 	 * map table.  The destination contexts are a sequential block
-	 * in the range first_user_ctxt...num_rcv_contexts-1 (inclusive).
+	 * in the range first_dyn_alloc_ctxt...num_rcv_contexts-1 (inclusive).
 	 * Map entries are accessed as offset + extracted value.  Adjust
 	 * the added offset so this sequence can be placed anywhere in
 	 * the table - as long as the entries themselves do not wrap.
@@ -13977,9 +14088,9 @@ static void init_user_fecn_handling(struct hfi1_devdata *dd,
 	 * start with that to allow for a "negative" offset.
 	 */
 	offset = (u8)(NUM_MAP_ENTRIES + (int)rmt->used -
-						(int)dd->first_user_ctxt);
+						(int)dd->first_dyn_alloc_ctxt);
 
-	for (i = dd->first_user_ctxt, idx = rmt->used;
+	for (i = dd->first_dyn_alloc_ctxt, idx = rmt->used;
 				i < dd->num_rcv_contexts; i++, idx++) {
 		/* replace with identity mapping */
 		regoff = (idx % 8) * 8;
@@ -14013,11 +14124,84 @@ static void init_user_fecn_handling(struct hfi1_devdata *dd,
 	rrd.value2 = 1;
 
 	/* add rule 1 */
-	add_rsm_rule(dd, 1, &rrd);
+	add_rsm_rule(dd, RSM_INS_FECN, &rrd);
 
 	rmt->used += dd->num_user_contexts;
 }
 
+/* Initialize RSM for VNIC */
+void hfi1_init_vnic_rsm(struct hfi1_devdata *dd)
+{
+	u8 i, j;
+	u8 ctx_id = 0;
+	u64 reg;
+	u32 regoff;
+	struct rsm_rule_data rrd;
+
+	if (hfi1_vnic_is_rsm_full(dd, NUM_VNIC_MAP_ENTRIES)) {
+		dd_dev_err(dd, "Vnic RSM disabled, rmt entries used = %d\n",
+			   dd->vnic.rmt_start);
+		return;
+	}
+
+	dev_dbg(&(dd)->pcidev->dev, "Vnic rsm start = %d, end %d\n",
+		dd->vnic.rmt_start,
+		dd->vnic.rmt_start + NUM_VNIC_MAP_ENTRIES);
+
+	/* Update RSM mapping table, 32 regs, 256 entries - 1 ctx per byte */
+	regoff = RCV_RSM_MAP_TABLE + (dd->vnic.rmt_start / 8) * 8;
+	reg = read_csr(dd, regoff);
+	for (i = 0; i < NUM_VNIC_MAP_ENTRIES; i++) {
+		/* Update map register with vnic context */
+		j = (dd->vnic.rmt_start + i) % 8;
+		reg &= ~(0xffllu << (j * 8));
+		reg |= (u64)dd->vnic.ctxt[ctx_id++]->ctxt << (j * 8);
+		/* Wrap up vnic ctx index */
+		ctx_id %= dd->vnic.num_ctxt;
+		/* Write back map register */
+		if (j == 7 || ((i + 1) == NUM_VNIC_MAP_ENTRIES)) {
+			dev_dbg(&(dd)->pcidev->dev,
+				"Vnic rsm map reg[%d] =0x%llx\n",
+				regoff - RCV_RSM_MAP_TABLE, reg);
+
+			write_csr(dd, regoff, reg);
+			regoff += 8;
+			if (i < (NUM_VNIC_MAP_ENTRIES - 1))
+				reg = read_csr(dd, regoff);
+		}
+	}
+
+	/* Add rule for vnic */
+	rrd.offset = dd->vnic.rmt_start;
+	rrd.pkt_type = 4;
+	/* Match 16B packets */
+	rrd.field1_off = L2_TYPE_MATCH_OFFSET;
+	rrd.mask1 = L2_TYPE_MASK;
+	rrd.value1 = L2_16B_VALUE;
+	/* Match ETH L4 packets */
+	rrd.field2_off = L4_TYPE_MATCH_OFFSET;
+	rrd.mask2 = L4_16B_TYPE_MASK;
+	rrd.value2 = L4_16B_ETH_VALUE;
+	/* Calc context from veswid and entropy */
+	rrd.index1_off = L4_16B_HDR_VESWID_OFFSET;
+	rrd.index1_width = ilog2(NUM_VNIC_MAP_ENTRIES);
+	rrd.index2_off = L2_16B_ENTROPY_OFFSET;
+	rrd.index2_width = ilog2(NUM_VNIC_MAP_ENTRIES);
+	add_rsm_rule(dd, RSM_INS_VNIC, &rrd);
+
+	/* Enable RSM if not already enabled */
+	add_rcvctrl(dd, RCV_CTRL_RCV_RSM_ENABLE_SMASK);
+}
+
+void hfi1_deinit_vnic_rsm(struct hfi1_devdata *dd)
+{
+	clear_rsm_rule(dd, RSM_INS_VNIC);
+
+	/* Disable RSM if used only by vnic */
+	if (dd->vnic.rmt_start == 0)
+		clear_rcvctrl(dd, RCV_CTRL_RCV_RSM_ENABLE_SMASK);
+}
+
 static void init_rxe(struct hfi1_devdata *dd)
 {
 	struct rsm_map_table *rmt;
@@ -14030,6 +14214,8 @@ static void init_rxe(struct hfi1_devdata *dd)
 	init_qos(dd, rmt);
 	init_user_fecn_handling(dd, rmt);
 	complete_rsm_map_table(dd, rmt);
+	/* record number of used rsm map entries for vnic */
+	dd->vnic.rmt_start = rmt->used;
 	kfree(rmt);
 
 	/*
diff --git a/drivers/infiniband/hw/hfi1/chip.h b/drivers/infiniband/hw/hfi1/chip.h
index 9234525..1e177f5 100644
--- a/drivers/infiniband/hw/hfi1/chip.h
+++ b/drivers/infiniband/hw/hfi1/chip.h
@@ -1355,6 +1355,8 @@ void hfi1_put_tid(struct hfi1_devdata *dd, u32 index,
 int hfi1_set_ctxt_pkey(struct hfi1_devdata *dd, unsigned ctxt, u16 pkey);
 int hfi1_clear_ctxt_pkey(struct hfi1_devdata *dd, unsigned ctxt);
 void hfi1_read_link_quality(struct hfi1_devdata *dd, u8 *link_quality);
+void hfi1_init_vnic_rsm(struct hfi1_devdata *dd);
+void hfi1_deinit_vnic_rsm(struct hfi1_devdata *dd);
 
 /*
  * Interrupt source table.
diff --git a/drivers/infiniband/hw/hfi1/debugfs.c b/drivers/infiniband/hw/hfi1/debugfs.c
index 632ba21..a088151 100644
--- a/drivers/infiniband/hw/hfi1/debugfs.c
+++ b/drivers/infiniband/hw/hfi1/debugfs.c
@@ -169,7 +169,7 @@ static int _opcode_stats_seq_show(struct seq_file *s, void *v)
 	struct hfi1_ibdev *ibd = (struct hfi1_ibdev *)s->private;
 	struct hfi1_devdata *dd = dd_from_dev(ibd);
 
-	for (j = 0; j < dd->first_user_ctxt; j++) {
+	for (j = 0; j < dd->first_dyn_alloc_ctxt; j++) {
 		if (!dd->rcd[j])
 			continue;
 		n_packets += dd->rcd[j]->opstats->stats[i].n_packets;
@@ -195,7 +195,7 @@ static void *_ctx_stats_seq_start(struct seq_file *s, loff_t *pos)
 
 	if (!*pos)
 		return SEQ_START_TOKEN;
-	if (*pos >= dd->first_user_ctxt)
+	if (*pos >= dd->first_dyn_alloc_ctxt)
 		return NULL;
 	return pos;
 }
@@ -209,7 +209,7 @@ static void *_ctx_stats_seq_next(struct seq_file *s, void *v, loff_t *pos)
 		return pos;
 
 	++*pos;
-	if (*pos >= dd->first_user_ctxt)
+	if (*pos >= dd->first_dyn_alloc_ctxt)
 		return NULL;
 	return pos;
 }
diff --git a/drivers/infiniband/hw/hfi1/driver.c b/drivers/infiniband/hw/hfi1/driver.c
index 6563e4d..c60db8f 100644
--- a/drivers/infiniband/hw/hfi1/driver.c
+++ b/drivers/infiniband/hw/hfi1/driver.c
@@ -59,6 +59,7 @@
 #include "trace.h"
 #include "qp.h"
 #include "sdma.h"
+#include "vnic.h"
 
 #undef pr_fmt
 #define pr_fmt(fmt) DRIVER_NAME ": " fmt
@@ -859,20 +860,42 @@ int handle_receive_interrupt_dma_rtail(struct hfi1_ctxtdata *rcd, int thread)
 	return last;
 }
 
-static inline void set_all_nodma_rtail(struct hfi1_devdata *dd)
+static inline void set_nodma_rtail(struct hfi1_devdata *dd, u8 ctxt)
 {
 	int i;
 
-	for (i = HFI1_CTRL_CTXT + 1; i < dd->first_user_ctxt; i++)
+	/*
+	 * For dynamically allocated kernel contexts (like vnic) switch
+	 * interrupt handler only for that context. Otherwise, switch
+	 * interrupt handler for all statically allocated kernel contexts.
+	 */
+	if (ctxt >= dd->first_dyn_alloc_ctxt) {
+		dd->rcd[ctxt]->do_interrupt =
+			&handle_receive_interrupt_nodma_rtail;
+		return;
+	}
+
+	for (i = HFI1_CTRL_CTXT + 1; i < dd->first_dyn_alloc_ctxt; i++)
 		dd->rcd[i]->do_interrupt =
 			&handle_receive_interrupt_nodma_rtail;
 }
 
-static inline void set_all_dma_rtail(struct hfi1_devdata *dd)
+static inline void set_dma_rtail(struct hfi1_devdata *dd, u8 ctxt)
 {
 	int i;
 
-	for (i = HFI1_CTRL_CTXT + 1; i < dd->first_user_ctxt; i++)
+	/*
+	 * For dynamically allocated kernel contexts (like vnic) switch
+	 * interrupt handler only for that context. Otherwise, switch
+	 * interrupt handler for all statically allocated kernel contexts.
+	 */
+	if (ctxt >= dd->first_dyn_alloc_ctxt) {
+		dd->rcd[ctxt]->do_interrupt =
+			&handle_receive_interrupt_dma_rtail;
+		return;
+	}
+
+	for (i = HFI1_CTRL_CTXT + 1; i < dd->first_dyn_alloc_ctxt; i++)
 		dd->rcd[i]->do_interrupt =
 			&handle_receive_interrupt_dma_rtail;
 }
@@ -882,8 +905,13 @@ void set_all_slowpath(struct hfi1_devdata *dd)
 	int i;
 
 	/* HFI1_CTRL_CTXT must always use the slow path interrupt handler */
-	for (i = HFI1_CTRL_CTXT + 1; i < dd->first_user_ctxt; i++)
-		dd->rcd[i]->do_interrupt = &handle_receive_interrupt;
+	for (i = HFI1_CTRL_CTXT + 1; i < dd->num_rcv_contexts; i++) {
+		struct hfi1_ctxtdata *rcd = dd->rcd[i];
+
+		if ((i < dd->first_dyn_alloc_ctxt) ||
+		    (rcd && rcd->sc && (rcd->sc->type == SC_KERNEL)))
+			rcd->do_interrupt = &handle_receive_interrupt;
+	}
 }
 
 static inline int set_armed_to_active(struct hfi1_ctxtdata *rcd,
@@ -993,7 +1021,7 @@ int handle_receive_interrupt(struct hfi1_ctxtdata *rcd, int thread)
 				last = RCV_PKT_DONE;
 			if (needset) {
 				dd_dev_info(dd, "Switching to NO_DMA_RTAIL\n");
-				set_all_nodma_rtail(dd);
+				set_nodma_rtail(dd, rcd->ctxt);
 				needset = 0;
 			}
 		} else {
@@ -1015,7 +1043,7 @@ int handle_receive_interrupt(struct hfi1_ctxtdata *rcd, int thread)
 			if (needset) {
 				dd_dev_info(dd,
 					    "Switching to DMA_RTAIL\n");
-				set_all_dma_rtail(dd);
+				set_dma_rtail(dd, rcd->ctxt);
 				needset = 0;
 			}
 		}
@@ -1063,10 +1091,10 @@ void receive_interrupt_work(struct work_struct *work)
 	set_link_state(ppd, HLS_UP_ACTIVE);
 
 	/*
-	 * Interrupt all kernel contexts that could have had an
-	 * interrupt during auto activation.
+	 * Interrupt all statically allocated kernel contexts that could
+	 * have had an interrupt during auto activation.
 	 */
-	for (i = HFI1_CTRL_CTXT; i < dd->first_user_ctxt; i++)
+	for (i = HFI1_CTRL_CTXT; i < dd->first_dyn_alloc_ctxt; i++)
 		force_recv_intr(dd->rcd[i]);
 }
 
@@ -1280,7 +1308,8 @@ int hfi1_reset_device(int unit)
 
 	spin_lock_irqsave(&dd->uctxt_lock, flags);
 	if (dd->rcd)
-		for (i = dd->first_user_ctxt; i < dd->num_rcv_contexts; i++) {
+		for (i = dd->first_dyn_alloc_ctxt;
+		     i < dd->num_rcv_contexts; i++) {
 			if (!dd->rcd[i] || !dd->rcd[i]->cnt)
 				continue;
 			spin_unlock_irqrestore(&dd->uctxt_lock, flags);
@@ -1358,13 +1387,34 @@ int process_receive_ib(struct hfi1_packet *packet)
 	return RHF_RCV_CONTINUE;
 }
 
+/*
+ * Check if packet is for VNIC
+ */
+static inline bool hfi1_is_vnic_packet(struct hfi1_packet *packet)
+{
+	/* Packet received in VNIC context via RSM */
+	if (packet->rcd->is_vnic)
+		return true;
+
+	if ((HFI1_GET_L2_TYPE(packet->ebuf) == HFI1_L2_TYPE_HDR_16B) &&
+	    (HFI1_GET_L4_TYPE(packet->ebuf) == HFI1_VNIC_L4_ETHR))
+		return true;
+
+	/* Not a VNIC packet */
+	return false;
+}
+
 int process_receive_bypass(struct hfi1_packet *packet)
 {
-	if (unlikely(rhf_err_flags(packet->rhf)))
+	if (unlikely(rhf_err_flags(packet->rhf))) {
 		handle_eflags(packet);
+	} else if (hfi1_is_vnic_packet(packet)) {
+		hfi1_vnic_bypass_rcv(packet);
+		return RHF_RCV_CONTINUE;
+	}
 
 	dd_dev_err(packet->rcd->dd,
-		   "Bypass packets are not supported in normal operation. Dropping\n");
+		   "Unsupported bypass packet. Dropping\n");
 	incr_cntr64(&packet->rcd->dd->sw_rcv_bypass_packet_errors);
 	return RHF_RCV_CONTINUE;
 }
diff --git a/drivers/infiniband/hw/hfi1/file_ops.c b/drivers/infiniband/hw/hfi1/file_ops.c
index 677efa0..863fbbb 100644
--- a/drivers/infiniband/hw/hfi1/file_ops.c
+++ b/drivers/infiniband/hw/hfi1/file_ops.c
@@ -576,8 +576,8 @@ static int hfi1_file_mmap(struct file *fp, struct vm_area_struct *vma)
 		 * knows where it's own bitmap is within the page.
 		 */
 		memaddr = (unsigned long)(dd->events +
-					  ((uctxt->ctxt - dd->first_user_ctxt) *
-					   HFI1_MAX_SHARED_CTXTS)) & PAGE_MASK;
+				  ((uctxt->ctxt - dd->first_dyn_alloc_ctxt) *
+				   HFI1_MAX_SHARED_CTXTS)) & PAGE_MASK;
 		memlen = PAGE_SIZE;
 		/*
 		 * v3.7 removes VM_RESERVED but the effect is kept by
@@ -746,7 +746,7 @@ static int hfi1_file_close(struct inode *inode, struct file *fp)
 	 * Clear any left over, unhandled events so the next process that
 	 * gets this context doesn't get confused.
 	 */
-	ev = dd->events + ((uctxt->ctxt - dd->first_user_ctxt) *
+	ev = dd->events + ((uctxt->ctxt - dd->first_dyn_alloc_ctxt) *
 			   HFI1_MAX_SHARED_CTXTS) + fdata->subctxt;
 	*ev = 0;
 
@@ -895,12 +895,18 @@ static int find_shared_ctxt(struct file *fp,
 
 		if (!(dd && (dd->flags & HFI1_PRESENT) && dd->kregbase))
 			continue;
-		for (i = dd->first_user_ctxt; i < dd->num_rcv_contexts; i++) {
+		for (i = dd->first_dyn_alloc_ctxt;
+		     i < dd->num_rcv_contexts; i++) {
 			struct hfi1_ctxtdata *uctxt = dd->rcd[i];
 
 			/* Skip ctxts which are not yet open */
 			if (!uctxt || !uctxt->cnt)
 				continue;
+
+			/* Skip dynamically allocted kernel contexts */
+			if (uctxt->sc && (uctxt->sc->type == SC_KERNEL))
+				continue;
+
 			/* Skip ctxt if it doesn't match the requested one */
 			if (memcmp(uctxt->uuid, uinfo->uuid,
 				   sizeof(uctxt->uuid)) ||
@@ -946,7 +952,8 @@ static int allocate_ctxt(struct file *fp, struct hfi1_devdata *dd,
 		return -EIO;
 	}
 
-	for (ctxt = dd->first_user_ctxt; ctxt < dd->num_rcv_contexts; ctxt++)
+	for (ctxt = dd->first_dyn_alloc_ctxt;
+	     ctxt < dd->num_rcv_contexts; ctxt++)
 		if (!dd->rcd[ctxt])
 			break;
 
@@ -1292,7 +1299,7 @@ static int get_base_info(struct file *fp, void __user *ubase, __u32 len)
 	 */
 	binfo.user_regbase = HFI1_MMAP_TOKEN(UREGS, uctxt->ctxt,
 					    fd->subctxt, 0);
-	offset = offset_in_page((((uctxt->ctxt - dd->first_user_ctxt) *
+	offset = offset_in_page((((uctxt->ctxt - dd->first_dyn_alloc_ctxt) *
 		    HFI1_MAX_SHARED_CTXTS) + fd->subctxt) *
 		  sizeof(*dd->events));
 	binfo.events_bufbase = HFI1_MMAP_TOKEN(EVENTS, uctxt->ctxt,
@@ -1386,12 +1393,12 @@ int hfi1_set_uevent_bits(struct hfi1_pportdata *ppd, const int evtbit)
 	}
 
 	spin_lock_irqsave(&dd->uctxt_lock, flags);
-	for (ctxt = dd->first_user_ctxt; ctxt < dd->num_rcv_contexts;
+	for (ctxt = dd->first_dyn_alloc_ctxt; ctxt < dd->num_rcv_contexts;
 	     ctxt++) {
 		uctxt = dd->rcd[ctxt];
 		if (uctxt) {
 			unsigned long *evs = dd->events +
-				(uctxt->ctxt - dd->first_user_ctxt) *
+				(uctxt->ctxt - dd->first_dyn_alloc_ctxt) *
 				HFI1_MAX_SHARED_CTXTS;
 			int i;
 			/*
@@ -1463,7 +1470,7 @@ static int user_event_ack(struct hfi1_ctxtdata *uctxt, int subctxt,
 	if (!dd->events)
 		return 0;
 
-	evs = dd->events + ((uctxt->ctxt - dd->first_user_ctxt) *
+	evs = dd->events + ((uctxt->ctxt - dd->first_dyn_alloc_ctxt) *
 			    HFI1_MAX_SHARED_CTXTS) + subctxt;
 
 	for (i = 0; i <= _HFI1_MAX_EVENT_BIT; i++) {
diff --git a/drivers/infiniband/hw/hfi1/hfi.h b/drivers/infiniband/hw/hfi1/hfi.h
index a66d198..2ff3453 100644
--- a/drivers/infiniband/hw/hfi1/hfi.h
+++ b/drivers/infiniband/hw/hfi1/hfi.h
@@ -54,6 +54,7 @@
 #include <linux/list.h>
 #include <linux/scatterlist.h>
 #include <linux/slab.h>
+#include <linux/idr.h>
 #include <linux/io.h>
 #include <linux/fs.h>
 #include <linux/completion.h>
@@ -66,6 +67,7 @@
 #include <linux/i2c-algo-bit.h>
 #include <rdma/ib_hdrs.h>
 #include <linux/rhashtable.h>
+#include <linux/netdevice.h>
 #include <rdma/rdma_vt.h>
 
 #include "chip_registers.h"
@@ -337,6 +339,12 @@ struct hfi1_ctxtdata {
 	 * packets with the wrong interrupt handler.
 	 */
 	int (*do_interrupt)(struct hfi1_ctxtdata *rcd, int threaded);
+
+	/* Indicates that this is vnic context */
+	bool is_vnic;
+
+	/* vnic queue index this context is mapped to */
+	u8 vnic_q_idx;
 };
 
 /*
@@ -831,6 +839,31 @@ struct hfi1_asic_data {
 	struct hfi1_i2c_bus *i2c_bus1;
 };
 
+/* sizes for both the QP and RSM map tables */
+#define NUM_MAP_ENTRIES	 256
+#define NUM_MAP_REGS      32
+
+/*
+ * Number of VNIC contexts used. Ensure it is less than or equal to
+ * max queues supported by VNIC (HFI_VNIC_MAX_QUEUE).
+ */
+#define HFI1_NUM_VNIC_CTXT   8
+
+/* Number of VNIC RSM entries */
+#define NUM_VNIC_MAP_ENTRIES     8
+
+/* Virtual NIC information */
+struct hfi1_vnic_data {
+	struct hfi1_ctxtdata *ctxt[HFI1_NUM_VNIC_CTXT];
+	u8 num_vports;
+	struct hfi_vnic_ctrl_device *ctrl_dev;
+	struct idr vesw_idr;
+	u8 rmt_start;
+	u8 num_ctxt;
+};
+
+struct hfi1_vnic_vport_info;
+
 /* device data struct now contains only "general per-device" info.
  * fields related to a physical IB port are in a hfi1_pportdata struct.
  */
@@ -1140,6 +1173,9 @@ struct hfi1_devdata {
 	send_routine process_dma_send;
 	void (*pio_inline_send)(struct hfi1_devdata *dd, struct pio_buf *pbuf,
 				u64 pbc, const void *from, size_t count);
+	int (*process_vnic_dma_send)(struct hfi1_devdata *dd, u8 q_idx,
+				     struct hfi1_vnic_vport_info *vinfo,
+				     struct sk_buff *skb, u64 pbc, u8 plen);
 	/* hfi1_pportdata, points to array of (physical) port-specific
 	 * data structs, indexed by pidx (0..n-1)
 	 */
@@ -1151,8 +1187,8 @@ struct hfi1_devdata {
 	u16 flags;
 	/* Number of physical ports available */
 	u8 num_pports;
-	/* Lowest context number which can be used by user processes */
-	u8 first_user_ctxt;
+	/* Lowest context number which can be used by user processes or VNIC */
+	u8 first_dyn_alloc_ctxt;
 	/* adding a new field here would make it part of this cacheline */
 
 	/* seqlock for sc2vl */
@@ -1192,8 +1228,16 @@ struct hfi1_devdata {
 	struct rhashtable sdma_rht;
 
 	struct kobject kobj;
+
+	/* vnic data */
+	struct hfi1_vnic_data vnic;
 };
 
+static inline bool hfi1_vnic_is_rsm_full(struct hfi1_devdata *dd, int spare)
+{
+	return (dd->vnic.rmt_start + spare) > NUM_MAP_ENTRIES;
+}
+
 /* 8051 firmware version helper */
 #define dc8051_ver(a, b) ((a) << 8 | (b))
 #define dc8051_ver_maj(a) ((a & 0xff00) >> 8)
@@ -1259,6 +1303,8 @@ void hfi1_init_pportdata(struct pci_dev *, struct hfi1_pportdata *,
 int handle_receive_interrupt_nodma_rtail(struct hfi1_ctxtdata *, int);
 int handle_receive_interrupt_dma_rtail(struct hfi1_ctxtdata *, int);
 void set_all_slowpath(struct hfi1_devdata *dd);
+void hfi1_set_vnic_msix_info(struct hfi1_ctxtdata *rcd);
+void hfi1_reset_vnic_msix_info(struct hfi1_ctxtdata *rcd);
 
 extern const struct pci_device_id hfi1_pci_tbl[];
 
diff --git a/drivers/infiniband/hw/hfi1/init.c b/drivers/infiniband/hw/hfi1/init.c
index 60db615..b897c94 100644
--- a/drivers/infiniband/hw/hfi1/init.c
+++ b/drivers/infiniband/hw/hfi1/init.c
@@ -65,6 +65,7 @@
 #include "verbs.h"
 #include "aspm.h"
 #include "affinity.h"
+#include "vnic.h"
 
 #undef pr_fmt
 #define pr_fmt(fmt) DRIVER_NAME ": " fmt
@@ -139,7 +140,7 @@ int hfi1_create_ctxts(struct hfi1_devdata *dd)
 		goto nomem;
 
 	/* create one or more kernel contexts */
-	for (i = 0; i < dd->first_user_ctxt; ++i) {
+	for (i = 0; i < dd->first_dyn_alloc_ctxt; ++i) {
 		struct hfi1_pportdata *ppd;
 		struct hfi1_ctxtdata *rcd;
 
@@ -213,9 +214,9 @@ struct hfi1_ctxtdata *hfi1_create_ctxtdata(struct hfi1_pportdata *ppd, u32 ctxt,
 	u32 base;
 
 	if (dd->rcv_entries.nctxt_extra >
-	    dd->num_rcv_contexts - dd->first_user_ctxt)
+	    dd->num_rcv_contexts - dd->first_dyn_alloc_ctxt)
 		kctxt_ngroups = (dd->rcv_entries.nctxt_extra -
-				 (dd->num_rcv_contexts - dd->first_user_ctxt));
+			 (dd->num_rcv_contexts - dd->first_dyn_alloc_ctxt));
 	rcd = kzalloc(sizeof(*rcd), GFP_KERNEL);
 	if (rcd) {
 		u32 rcvtids, max_entries;
@@ -237,10 +238,10 @@ struct hfi1_ctxtdata *hfi1_create_ctxtdata(struct hfi1_pportdata *ppd, u32 ctxt,
 		 * Calculate the context's RcvArray entry starting point.
 		 * We do this here because we have to take into account all
 		 * the RcvArray entries that previous context would have
-		 * taken and we have to account for any extra groups
-		 * assigned to the kernel or user contexts.
+		 * taken and we have to account for any extra groups assigned
+		 * to the static (kernel) or dynamic (vnic/user) contexts.
 		 */
-		if (ctxt < dd->first_user_ctxt) {
+		if (ctxt < dd->first_dyn_alloc_ctxt) {
 			if (ctxt < kctxt_ngroups) {
 				base = ctxt * (dd->rcv_entries.ngroups + 1);
 				rcd->rcv_array_groups++;
@@ -248,7 +249,7 @@ struct hfi1_ctxtdata *hfi1_create_ctxtdata(struct hfi1_pportdata *ppd, u32 ctxt,
 				base = kctxt_ngroups +
 					(ctxt * dd->rcv_entries.ngroups);
 		} else {
-			u16 ct = ctxt - dd->first_user_ctxt;
+			u16 ct = ctxt - dd->first_dyn_alloc_ctxt;
 
 			base = ((dd->n_krcv_queues * dd->rcv_entries.ngroups) +
 				kctxt_ngroups);
@@ -327,7 +328,8 @@ struct hfi1_ctxtdata *hfi1_create_ctxtdata(struct hfi1_pportdata *ppd, u32 ctxt,
 		}
 		rcd->egrbufs.rcvtid_size = HFI1_MAX_EAGER_BUFFER_SIZE;
 
-		if (ctxt < dd->first_user_ctxt) { /* N/A for PSM contexts */
+		/* Applicable only for statically created kernel contexts */
+		if (ctxt < dd->first_dyn_alloc_ctxt) {
 			rcd->opstats = kzalloc(sizeof(*rcd->opstats),
 				GFP_KERNEL);
 			if (!rcd->opstats)
@@ -591,7 +593,7 @@ static void enable_chip(struct hfi1_devdata *dd)
 	 * Enable kernel ctxts' receive and receive interrupt.
 	 * Other ctxts done as user opens and initializes them.
 	 */
-	for (i = 0; i < dd->first_user_ctxt; ++i) {
+	for (i = 0; i < dd->first_dyn_alloc_ctxt; ++i) {
 		rcvmask = HFI1_RCVCTRL_CTXT_ENB | HFI1_RCVCTRL_INTRAVAIL_ENB;
 		rcvmask |= HFI1_CAP_KGET_MASK(dd->rcd[i]->flags, DMA_RTAIL) ?
 			HFI1_RCVCTRL_TAILUPD_ENB : HFI1_RCVCTRL_TAILUPD_DIS;
@@ -685,6 +687,7 @@ int hfi1_init(struct hfi1_devdata *dd, int reinit)
 	dd->process_pio_send = hfi1_verbs_send_pio;
 	dd->process_dma_send = hfi1_verbs_send_dma;
 	dd->pio_inline_send = pio_copy;
+	dd->process_vnic_dma_send = hfi1_vnic_send_dma;
 
 	if (is_ax(dd)) {
 		atomic_set(&dd->drop_packet, DROP_PACKET_ON);
@@ -720,7 +723,7 @@ int hfi1_init(struct hfi1_devdata *dd, int reinit)
 	}
 
 	/* dd->rcd can be NULL if early initialization failed */
-	for (i = 0; dd->rcd && i < dd->first_user_ctxt; ++i) {
+	for (i = 0; dd->rcd && i < dd->first_dyn_alloc_ctxt; ++i) {
 		/*
 		 * Set up the (kernel) rcvhdr queue and egr TIDs.  If doing
 		 * re-init, the simplest way to handle this is to free
@@ -1401,7 +1404,7 @@ static void postinit_cleanup(struct hfi1_devdata *dd)
 
 static int init_one(struct pci_dev *pdev, const struct pci_device_id *ent)
 {
-	int ret = 0, j, pidx, initfail;
+	int ret = 0, j, pidx, initfail, vnicfail;
 	struct hfi1_devdata *dd = ERR_PTR(-EINVAL);
 	struct hfi1_pportdata *ppd;
 
@@ -1507,7 +1510,12 @@ static int init_one(struct pci_dev *pdev, const struct pci_device_id *ent)
 	if (j)
 		dd_dev_err(dd, "Failed to create /dev devices: %d\n", -j);
 
-	if (initfail || ret) {
+	/* setup vnic */
+	vnicfail = hfi1_vnic_setup(dd);
+	if (vnicfail)
+		dd_dev_err(dd, "vnic setup failed %d\n", vnicfail);
+
+	if (initfail || ret || vnicfail) {
 		stop_timers(dd);
 		flush_workqueue(ib_wq);
 		for (pidx = 0; pidx < dd->num_pports; ++pidx) {
@@ -1518,6 +1526,8 @@ static int init_one(struct pci_dev *pdev, const struct pci_device_id *ent)
 				ppd->hfi1_wq = NULL;
 			}
 		}
+		if (!vnicfail)
+			hfi1_vnic_cleanup(dd);
 		if (!j)
 			hfi1_device_remove(dd);
 		if (!ret)
@@ -1542,6 +1552,9 @@ static void remove_one(struct pci_dev *pdev)
 {
 	struct hfi1_devdata *dd = pci_get_drvdata(pdev);
 
+	/* cleanup vnic */
+	hfi1_vnic_cleanup(dd);
+
 	/* close debugfs files before ib unregister */
 	hfi1_dbg_ibdev_exit(&dd->verbs_dev);
 	/* unregister from IB core */
@@ -1588,8 +1601,11 @@ int hfi1_create_rcvhdrq(struct hfi1_devdata *dd, struct hfi1_ctxtdata *rcd)
 		amt = PAGE_ALIGN(rcd->rcvhdrq_cnt * rcd->rcvhdrqentsize *
 				 sizeof(u32));
 
-		gfp_flags = (rcd->ctxt >= dd->first_user_ctxt) ?
-			GFP_USER : GFP_KERNEL;
+		if ((rcd->ctxt < dd->first_dyn_alloc_ctxt) ||
+		    (rcd->sc && (rcd->sc->type == SC_KERNEL)))
+			gfp_flags = GFP_KERNEL;
+		else
+			gfp_flags = GFP_USER;
 		rcd->rcvhdrq = dma_zalloc_coherent(
 			&dd->pcidev->dev, amt, &rcd->rcvhdrq_dma,
 			gfp_flags | __GFP_COMP);
diff --git a/drivers/infiniband/hw/hfi1/mad.c b/drivers/infiniband/hw/hfi1/mad.c
index 0ef62e6..7e90442 100644
--- a/drivers/infiniband/hw/hfi1/mad.c
+++ b/drivers/infiniband/hw/hfi1/mad.c
@@ -53,6 +53,7 @@
 #include "mad.h"
 #include "trace.h"
 #include "qp.h"
+#include "vnic.h"
 
 /* the reset value from the FM is supposed to be 0xffff, handle both */
 #define OPA_LINK_WIDTH_RESET_OLD 0x0fff
@@ -650,9 +651,11 @@ static int __subn_get_opa_portinfo(struct opa_smp *smp, u32 am, u8 *data,
 					OPA_PI_MASK_PORT_ACTIVE_OPTOMIZE : 0);
 
 	pi->port_packet_format.supported =
-		cpu_to_be16(OPA_PORT_PACKET_FORMAT_9B);
+		cpu_to_be16(OPA_PORT_PACKET_FORMAT_9B |
+			    OPA_PORT_PACKET_FORMAT_16B);
 	pi->port_packet_format.enabled =
-		cpu_to_be16(OPA_PORT_PACKET_FORMAT_9B);
+		cpu_to_be16(OPA_PORT_PACKET_FORMAT_9B |
+			    OPA_PORT_PACKET_FORMAT_16B);
 
 	/* flit_control.interleave is (OPA V1, version .76):
 	 * bits		use
@@ -678,6 +681,7 @@ static int __subn_get_opa_portinfo(struct opa_smp *smp, u32 am, u8 *data,
 	pi->resptimevalue = 3;
 
 	pi->local_port_num = port;
+	pi->num_vesw_port_supported = HFI_MAX_NUM_VNICS;
 
 	/* buffer info for FM */
 	pi->overall_buffer_space = cpu_to_be16(dd->link_credits);
diff --git a/drivers/infiniband/hw/hfi1/pio.c b/drivers/infiniband/hw/hfi1/pio.c
index 86a7f36..45e36b0 100644
--- a/drivers/infiniband/hw/hfi1/pio.c
+++ b/drivers/infiniband/hw/hfi1/pio.c
@@ -710,6 +710,7 @@ struct send_context *sc_alloc(struct hfi1_devdata *dd, int type,
 {
 	struct send_context_info *sci;
 	struct send_context *sc = NULL;
+	int req_type = type;
 	dma_addr_t dma;
 	unsigned long flags;
 	u64 reg;
@@ -736,6 +737,13 @@ struct send_context *sc_alloc(struct hfi1_devdata *dd, int type,
 		return NULL;
 	}
 
+	/*
+	 * VNIC contexts are dynamically allocated.
+	 * Hence, pick a user context for VNIC.
+	 */
+	if (type == SC_VNIC)
+		type = SC_USER;
+
 	spin_lock_irqsave(&dd->sc_lock, flags);
 	ret = sc_hw_alloc(dd, type, &sw_index, &hw_context);
 	if (ret) {
@@ -745,6 +753,15 @@ struct send_context *sc_alloc(struct hfi1_devdata *dd, int type,
 		return NULL;
 	}
 
+	/*
+	 * VNIC contexts are used by kernel driver.
+	 * Hence, mark them as kernel contexts.
+	 */
+	if (req_type == SC_VNIC) {
+		dd->send_contexts[sw_index].type = SC_KERNEL;
+		type = SC_KERNEL;
+	}
+
 	sci = &dd->send_contexts[sw_index];
 	sci->sc = sc;
 
diff --git a/drivers/infiniband/hw/hfi1/pio.h b/drivers/infiniband/hw/hfi1/pio.h
index 867e5ff..22e19d5 100644
--- a/drivers/infiniband/hw/hfi1/pio.h
+++ b/drivers/infiniband/hw/hfi1/pio.h
@@ -54,6 +54,12 @@
 #define SC_USER   3	/* must be the last one: it may take all left */
 #define SC_MAX    4	/* count of send context types */
 
+/*
+ * SC_VNIC types are allocated (dynamically) from the user context pool,
+ * (SC_USER) and used by kernel driver as kernel contexts (SC_KERNEL).
+ */
+#define SC_VNIC   SC_MAX
+
 /* invalid send context index */
 #define INVALID_SCI 0xff
 
diff --git a/drivers/infiniband/hw/hfi1/sysfs.c b/drivers/infiniband/hw/hfi1/sysfs.c
index edba224..916ce94 100644
--- a/drivers/infiniband/hw/hfi1/sysfs.c
+++ b/drivers/infiniband/hw/hfi1/sysfs.c
@@ -543,7 +543,7 @@ static ssize_t show_nctxts(struct device *device,
 	 * give a more accurate picture of total contexts available.
 	 */
 	return scnprintf(buf, PAGE_SIZE, "%u\n",
-			 min(dd->num_rcv_contexts - dd->first_user_ctxt,
+			 min(dd->num_rcv_contexts - dd->first_dyn_alloc_ctxt,
 			     (u32)dd->sc_sizes[SC_USER].count));
 }
 
diff --git a/drivers/infiniband/hw/hfi1/user_exp_rcv.c b/drivers/infiniband/hw/hfi1/user_exp_rcv.c
index 64d2652..fdcd686 100644
--- a/drivers/infiniband/hw/hfi1/user_exp_rcv.c
+++ b/drivers/infiniband/hw/hfi1/user_exp_rcv.c
@@ -612,7 +612,7 @@ int hfi1_user_exp_rcv_invalid(struct file *fp, struct hfi1_tid_info *tinfo)
 	struct hfi1_filedata *fd = fp->private_data;
 	struct hfi1_ctxtdata *uctxt = fd->uctxt;
 	unsigned long *ev = uctxt->dd->events +
-		(((uctxt->ctxt - uctxt->dd->first_user_ctxt) *
+		(((uctxt->ctxt - uctxt->dd->first_dyn_alloc_ctxt) *
 		  HFI1_MAX_SHARED_CTXTS) + fd->subctxt);
 	u32 *array;
 	int ret = 0;
@@ -1016,8 +1016,8 @@ static int tid_rb_invalidate(void *arg, struct mmu_rb_node *mnode)
 			 * process in question.
 			 */
 			ev = uctxt->dd->events +
-				(((uctxt->ctxt - uctxt->dd->first_user_ctxt) *
-				  HFI1_MAX_SHARED_CTXTS) + fdata->subctxt);
+			  (((uctxt->ctxt - uctxt->dd->first_dyn_alloc_ctxt) *
+			    HFI1_MAX_SHARED_CTXTS) + fdata->subctxt);
 			set_bit(_HFI1_EVENT_TID_MMU_NOTIFY_BIT, ev);
 		}
 		fdata->invalid_tid_idx++;
diff --git a/drivers/infiniband/hw/hfi1/user_pages.c b/drivers/infiniband/hw/hfi1/user_pages.c
index 20f4ddc..7238a34 100644
--- a/drivers/infiniband/hw/hfi1/user_pages.c
+++ b/drivers/infiniband/hw/hfi1/user_pages.c
@@ -73,7 +73,8 @@ bool hfi1_can_pin_pages(struct hfi1_devdata *dd, struct mm_struct *mm,
 {
 	unsigned long ulimit = rlimit(RLIMIT_MEMLOCK), pinned, cache_limit,
 		size = (cache_size * (1UL << 20)); /* convert to bytes */
-	unsigned usr_ctxts = dd->num_rcv_contexts - dd->first_user_ctxt;
+	unsigned int usr_ctxts =
+			dd->num_rcv_contexts - dd->first_dyn_alloc_ctxt;
 	bool can_lock = capable(CAP_IPC_LOCK);
 
 	/*
diff --git a/drivers/infiniband/hw/hfi1/vnic.h b/drivers/infiniband/hw/hfi1/vnic.h
new file mode 100644
index 0000000..d91c35b
--- /dev/null
+++ b/drivers/infiniband/hw/hfi1/vnic.h
@@ -0,0 +1,155 @@
+#ifndef _HFI1_VNIC_H
+#define _HFI1_VNIC_H
+/*
+ * Copyright(c) 2016 Intel Corporation.
+ *
+ * This file is provided under a dual BSD/GPLv2 license.  When using or
+ * redistributing this file, you may do so under either license.
+ *
+ * GPL LICENSE SUMMARY
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of version 2 of the GNU General Public License as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ *
+ * BSD LICENSE
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ *
+ *  - Redistributions of source code must retain the above copyright
+ *    notice, this list of conditions and the following disclaimer.
+ *  - Redistributions in binary form must reproduce the above copyright
+ *    notice, this list of conditions and the following disclaimer in
+ *    the documentation and/or other materials provided with the
+ *    distribution.
+ *  - Neither the name of Intel Corporation nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ * "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ * LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ * A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+ * OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ * SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+ * LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+ * DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+ * THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ *
+ */
+
+#include "hfi_vnic.h"
+#include "hfi.h"
+
+#define HFI1_VNIC_ICRC_LEN   4
+#define HFI1_VNIC_TAIL_LEN   1
+#define HFI1_VNIC_ICRC_TAIL_LEN  (HFI1_VNIC_ICRC_LEN + HFI1_VNIC_TAIL_LEN)
+
+#define HFI1_VNIC_MAX_TXQ     16
+#define HFI1_VNIC_MAX_PAD     12
+
+/* L2 header definitions */
+#define HFI1_L2_TYPE_OFFSET     0x7
+#define HFI1_L2_TYPE_SHFT       0x5
+#define HFI1_L2_TYPE_MASK       0x3
+#define HFI1_L2_TYPE_HDR_16B    0x2
+
+#define HFI1_GET_L2_TYPE(hdr)                                            \
+	((*((u8 *)(hdr) + HFI1_L2_TYPE_OFFSET) >> HFI1_L2_TYPE_SHFT) &   \
+	 HFI1_L2_TYPE_MASK)
+
+/* L4 type definitions */
+#define HFI1_L4_TYPE_OFFSET 8
+
+#define HFI1_GET_L4_TYPE(data)   \
+	(*((u8 *)(data) + HFI1_L4_TYPE_OFFSET))
+
+#define HFI1_VNIC_L4_ETHR  0x78
+
+/* L4 header definitions */
+#define HFI1_VNIC_L4_HDR_OFFSET  18
+
+#define HFI1_VNIC_GET_L4_HDR(data)   \
+	(*((u16 *)((u8 *)(data) + HFI1_VNIC_L4_HDR_OFFSET)))
+
+#define HFI1_VNIC_GET_VESWID(data)   \
+	(HFI1_VNIC_GET_L4_HDR(data) & 0xFF)
+
+/* Service class */
+#define HFI1_VNIC_SC_OFFSET_LOW 6
+#define HFI1_VNIC_SC_OFFSET_HI  7
+#define HFI1_VNIC_SC_SHIFT      4
+
+/**
+ * struct hfi1_vnic_notifier - VNIC notifer structure
+ * @cb - vnic callback function
+ */
+struct hfi1_vnic_notifier {
+	hfi_vnic_evt_cb_fn  cb;
+};
+
+/**
+ * struct hfi1_vnic_vport_info - HFI1 VNIC virtual port information
+ * @dd: device data pointer
+ * @vnic_cb: vnic callback function
+ * @event_flags: event notification flags
+ * @notifier: vnic notifier
+ * @skbq: Array of queues for received socket buffers
+ */
+struct hfi1_vnic_vport_info {
+	struct hfi1_devdata *dd;
+
+	struct hfi1_vnic_notifier __rcu *notifier;
+	DECLARE_BITMAP(event_flags, HFI_VNIC_NUM_EVTS);
+	struct hfi_vnic_device *vdev;
+
+	struct sk_buff_head skbq[HFI1_NUM_VNIC_CTXT];
+};
+
+static inline struct hfi1_devdata *vnic_dev2dd(struct hfi_vnic_device *vdev)
+{
+	struct hfi1_vnic_vport_info *vinfo = vdev->hfi_priv;
+
+	return vinfo->dd;
+}
+
+/* setup the last plen bypes of pad */
+static inline void hfi1_vnic_update_pad(unsigned char *pad, u8 plen)
+{
+	pad[HFI1_VNIC_MAX_PAD - 1] = plen - HFI1_VNIC_ICRC_TAIL_LEN;
+}
+
+/* vnic hfi1 internal functions */
+int hfi1_vnic_setup(struct hfi1_devdata *dd);
+void hfi1_vnic_cleanup(struct hfi1_devdata *dd);
+int hfi1_vnic_add_ctrl_port(struct hfi1_devdata *dd, struct device *parent);
+void hfi1_vnic_rem_ctrl_port(struct hfi1_devdata *dd);
+
+void hfi1_vnic_bypass_rcv(struct hfi1_packet *packet);
+
+/* vnic device bus ops */
+int hfi1_vnic_init(struct hfi_vnic_device *vdev);
+void hfi1_vnic_deinit(struct hfi_vnic_device *vdev);
+int hfi1_vnic_open(struct hfi_vnic_device *vdev, hfi_vnic_evt_cb_fn cb);
+void hfi1_vnic_close(struct hfi_vnic_device *vdev);
+int hfi1_vnic_put_skb(struct hfi_vnic_device *vdev,
+		      u8 q_idx, struct sk_buff *skb);
+u8 hfi1_vnic_select_queue(struct hfi_vnic_device *vdev, u8 vl, u8 entropy);
+struct sk_buff *hfi1_vnic_get_skb(struct hfi_vnic_device *vdev, u8 q_idx);
+u16 hfi1_vnic_get_read_avail(struct hfi_vnic_device *vdev, u8 q_idx);
+bool hfi1_vnic_get_write_avail(struct hfi_vnic_device *vdev, u8 q_idx);
+void hfi1_vnic_config_notify(struct hfi_vnic_device *vdev, u8 evt, bool enable);
+int hfi1_vnic_send_dma(struct hfi1_devdata *dd, u8 q_idx,
+		       struct hfi1_vnic_vport_info *vinfo,
+		       struct sk_buff *skb, u64 pbc, u8 plen);
+
+#endif /* _HFI1_VNIC_H */
diff --git a/drivers/infiniband/hw/hfi1/vnic_device.c b/drivers/infiniband/hw/hfi1/vnic_device.c
new file mode 100644
index 0000000..468e197
--- /dev/null
+++ b/drivers/infiniband/hw/hfi1/vnic_device.c
@@ -0,0 +1,168 @@
+/*
+ * Copyright(c) 2016 Intel Corporation.
+ *
+ * This file is provided under a dual BSD/GPLv2 license.  When using or
+ * redistributing this file, you may do so under either license.
+ *
+ * GPL LICENSE SUMMARY
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of version 2 of the GNU General Public License as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ *
+ * BSD LICENSE
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ *
+ *  - Redistributions of source code must retain the above copyright
+ *    notice, this list of conditions and the following disclaimer.
+ *  - Redistributions in binary form must reproduce the above copyright
+ *    notice, this list of conditions and the following disclaimer in
+ *    the documentation and/or other materials provided with the
+ *    distribution.
+ *  - Neither the name of Intel Corporation nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ * "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ * LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ * A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+ * OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ * SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+ * LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+ * DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+ * THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ *
+ */
+
+/*
+ * This file contains HFI1 VNIC device handling functions
+ */
+
+#include <linux/idr.h>
+#include <linux/module.h>
+
+#include "vnic.h"
+
+/* vnic operations */
+static struct hfi_vnic_ops hfi1_vnic_ops = {
+	.init = hfi1_vnic_init,
+	.deinit = hfi1_vnic_deinit,
+	.open = hfi1_vnic_open,
+	.close = hfi1_vnic_close,
+	.put_skb = hfi1_vnic_put_skb,
+	.get_skb = hfi1_vnic_get_skb,
+	.get_read_avail = hfi1_vnic_get_read_avail,
+	.get_write_avail = hfi1_vnic_get_write_avail,
+	.select_queue = hfi1_vnic_select_queue,
+	.config_notify = hfi1_vnic_config_notify
+};
+
+/* hfi1_vdev_create - add vnic device on vnic bus */
+static int hfi1_vdev_create(struct hfi_vnic_ctrl_device *cdev,
+			    u8 port_num, u8 vport_num)
+{
+	struct hfi1_devdata *dd = (struct hfi1_devdata *)cdev->hfi_priv;
+	struct hfi1_vnic_vport_info *vinfo;
+	struct hfi_vnic_device *vdev;
+	struct hfi_vnic_info hfi_info;
+
+	if (!port_num || (port_num > dd->num_pports))
+		return -EINVAL;
+
+	vinfo = kzalloc(sizeof(*vinfo), GFP_KERNEL);
+	if (!vinfo)
+		return -ENOMEM;
+
+	vinfo->dd = dd;
+	hfi_info.num_tx_q = 1;
+	hfi_info.num_rx_q = HFI1_NUM_VNIC_CTXT;
+	hfi_info.cap = HFI_VNIC_CAP_SG;
+	vdev = hfi_vnic_device_register(cdev, port_num, vport_num, vinfo,
+					&hfi1_vnic_ops, hfi_info);
+	if (IS_ERR(vdev)) {
+		kfree(vinfo);
+		return PTR_ERR(vdev);
+	}
+	return 0;
+}
+
+/* hfi1_vdev_destroy - remove vnic device from vnic bus */
+static void hfi1_vdev_destroy(struct hfi_vnic_ctrl_device *cdev,
+			      u8 port_num, u8 vport_num)
+{
+	struct hfi1_vnic_vport_info *vinfo;
+	struct hfi_vnic_device *vdev;
+
+	vdev = hfi_vnic_get_dev(cdev, port_num, vport_num);
+	if (!vdev)
+		return;
+
+	vinfo = vdev->hfi_priv;
+	hfi_vnic_device_unregister(vdev);
+	kfree(vinfo);
+}
+
+/* hfi1_vnic_add_vport - add vnic port */
+static int hfi1_vnic_add_vport(struct hfi_vnic_ctrl_device *cdev,
+			       u8 port_num, u8 vport_num)
+{
+	int rc;
+
+	rc = hfi1_vdev_create(cdev, port_num, vport_num);
+	if (rc)
+		dev_err(&cdev->dev, "error adding vnic port (%d:%d): %d\n",
+			port_num, vport_num, rc);
+
+	return rc;
+}
+
+/* hfi1_vnic_rem_vport - remove vnic port */
+static void hfi1_vnic_rem_vport(struct hfi_vnic_ctrl_device *cdev,
+				u8 port_num, u8 vport_num)
+{
+	hfi1_vdev_destroy(cdev, port_num, vport_num);
+}
+
+/* vnic control operations */
+static struct hfi_vnic_ctrl_ops hfi1_vnic_ctrl_ops = {
+	.add_vport = hfi1_vnic_add_vport,
+	.rem_vport = hfi1_vnic_rem_vport
+};
+
+/* hfi1_vnic_add_ctrl_port - add vnic control port */
+int hfi1_vnic_add_ctrl_port(struct hfi1_devdata *dd, struct device *parent)
+{
+	struct ib_device *ibdev = &dd->verbs_dev.rdi.ibdev;
+	struct hfi_vnic_ctrl_device *cdev;
+	int rc;
+
+	cdev = hfi_vnic_ctrl_device_register(parent, ibdev, dd->num_pports,
+					     dd, &hfi1_vnic_ctrl_ops);
+	if (IS_ERR(cdev)) {
+		rc = PTR_ERR(cdev);
+		dev_err(parent, "error adding vnic control port %d: %d\n",
+			dd->unit, rc);
+		return rc;
+	}
+
+	dd->vnic.ctrl_dev = cdev;
+	return 0;
+}
+
+/* hfi1_vnic_rem_ctrl_port - remove vnic control port */
+void hfi1_vnic_rem_ctrl_port(struct hfi1_devdata *dd)
+{
+	hfi_vnic_ctrl_device_unregister(dd->vnic.ctrl_dev);
+	dd->vnic.ctrl_dev = NULL;
+}
diff --git a/drivers/infiniband/hw/hfi1/vnic_main.c b/drivers/infiniband/hw/hfi1/vnic_main.c
new file mode 100644
index 0000000..82e30bd
--- /dev/null
+++ b/drivers/infiniband/hw/hfi1/vnic_main.c
@@ -0,0 +1,555 @@
+/*
+ * Copyright(c) 2016 Intel Corporation.
+ *
+ * This file is provided under a dual BSD/GPLv2 license.  When using or
+ * redistributing this file, you may do so under either license.
+ *
+ * GPL LICENSE SUMMARY
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of version 2 of the GNU General Public License as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ *
+ * BSD LICENSE
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ *
+ *  - Redistributions of source code must retain the above copyright
+ *    notice, this list of conditions and the following disclaimer.
+ *  - Redistributions in binary form must reproduce the above copyright
+ *    notice, this list of conditions and the following disclaimer in
+ *    the documentation and/or other materials provided with the
+ *    distribution.
+ *  - Neither the name of Intel Corporation nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ * "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ * LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ * A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+ * OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ * SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+ * LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+ * DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+ * THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ *
+ */
+
+/*
+ * This file contains HFI1 support for VNIC functionality
+ */
+
+#include <linux/io.h>
+
+#include "vnic.h"
+
+#define HFI1_VNIC_RCV_Q_SIZE   1024
+
+static DEFINE_SPINLOCK(vdev_cntr_lock);
+
+static inline u8 hfi1_vnic_get_sc5(u8 *hdr)
+{
+	return  (((*(hdr + HFI1_VNIC_SC_OFFSET_LOW)) >> HFI1_VNIC_SC_SHIFT) |
+		 (((*(hdr + HFI1_VNIC_SC_OFFSET_HI)) & 0x1) <<
+		  HFI1_VNIC_SC_SHIFT));
+}
+
+static int setup_vnic_ctxt(struct hfi1_devdata *dd, struct hfi1_ctxtdata *uctxt)
+{
+	unsigned int rcvctrl_ops = 0;
+	int ret;
+
+	ret = hfi1_init_ctxt(uctxt->sc);
+	if (ret)
+		goto done;
+
+	uctxt->do_interrupt = &handle_receive_interrupt;
+
+	/* Now allocate the RcvHdr queue and eager buffers. */
+	ret = hfi1_create_rcvhdrq(dd, uctxt);
+	if (ret)
+		goto done;
+
+	ret = hfi1_setup_eagerbufs(uctxt);
+	if (ret)
+		goto done;
+
+	set_bit(HFI1_CTXT_SETUP_DONE, &uctxt->event_flags);
+
+	if (uctxt->rcvhdrtail_kvaddr)
+		clear_rcvhdrtail(uctxt);
+
+	rcvctrl_ops = HFI1_RCVCTRL_CTXT_ENB;
+	rcvctrl_ops |= HFI1_RCVCTRL_INTRAVAIL_ENB;
+
+	if (!HFI1_CAP_KGET_MASK(uctxt->flags, MULTI_PKT_EGR))
+		rcvctrl_ops |= HFI1_RCVCTRL_ONE_PKT_EGR_ENB;
+	if (HFI1_CAP_KGET_MASK(uctxt->flags, NODROP_EGR_FULL))
+		rcvctrl_ops |= HFI1_RCVCTRL_NO_EGR_DROP_ENB;
+	if (HFI1_CAP_KGET_MASK(uctxt->flags, NODROP_RHQ_FULL))
+		rcvctrl_ops |= HFI1_RCVCTRL_NO_RHQ_DROP_ENB;
+	if (HFI1_CAP_KGET_MASK(uctxt->flags, DMA_RTAIL))
+		rcvctrl_ops |= HFI1_RCVCTRL_TAILUPD_ENB;
+
+	hfi1_rcvctrl(uctxt->dd, rcvctrl_ops, uctxt->ctxt);
+
+	uctxt->is_vnic = true;
+done:
+	return ret;
+}
+
+static int allocate_vnic_ctxt(struct hfi1_devdata *dd,
+			      struct hfi1_ctxtdata **vnic_ctxt)
+{
+	struct hfi1_ctxtdata *uctxt;
+	unsigned int ctxt;
+	int ret;
+
+	if (dd->flags & HFI1_FROZEN)
+		return -EIO;
+
+	for (ctxt = dd->first_dyn_alloc_ctxt;
+	     ctxt < dd->num_rcv_contexts; ctxt++)
+		if (!dd->rcd[ctxt])
+			break;
+
+	if (ctxt == dd->num_rcv_contexts)
+		return -EBUSY;
+
+	uctxt = hfi1_create_ctxtdata(dd->pport, ctxt, dd->node);
+	if (!uctxt) {
+		dd_dev_err(dd, "Unable to create ctxtdata, failing open\n");
+		return -ENOMEM;
+	}
+
+	uctxt->flags = HFI1_CAP_KGET(MULTI_PKT_EGR) |
+			HFI1_CAP_KGET(NODROP_RHQ_FULL) |
+			HFI1_CAP_KGET(NODROP_EGR_FULL) |
+			HFI1_CAP_KGET(DMA_RTAIL);
+	uctxt->seq_cnt = 1;
+
+	/* Allocate and enable a PIO send context */
+	uctxt->sc = sc_alloc(dd, SC_VNIC, uctxt->rcvhdrqentsize,
+			     uctxt->numa_id);
+
+	ret = uctxt->sc ? 0 : -ENOMEM;
+	if (ret)
+		goto bail;
+
+	dev_dbg(&dd->vnic.ctrl_dev->dev,
+		"allocated send context %u(%u)\n",
+		uctxt->sc->sw_index, uctxt->sc->hw_context);
+	ret = sc_enable(uctxt->sc);
+	if (ret)
+		goto bail;
+
+	if (dd->num_msix_entries)
+		hfi1_set_vnic_msix_info(uctxt);
+
+	hfi1_stats.sps_ctxts++;
+	dev_dbg(&dd->vnic.ctrl_dev->dev, "created vnic context %d\n",
+		uctxt->ctxt);
+	*vnic_ctxt = uctxt;
+
+	return ret;
+bail:
+	/*
+	 * hfi1_free_ctxtdata() also releases send_context
+	 * structure if uctxt->sc is not null
+	 */
+	dd->rcd[uctxt->ctxt] = NULL;
+	hfi1_free_ctxtdata(dd, uctxt);
+	dev_dbg(&dd->vnic.ctrl_dev->dev, "allocation failed. rc %d\n", ret);
+	return ret;
+}
+
+static void deallocate_vnic_ctxt(struct hfi1_devdata *dd,
+				 struct hfi1_ctxtdata *uctxt)
+{
+	unsigned long flags;
+
+	dev_dbg(&dd->vnic.ctrl_dev->dev, "closing vnic context %d\n",
+		uctxt->ctxt);
+	flush_wc();
+
+	if (dd->num_msix_entries)
+		hfi1_reset_vnic_msix_info(uctxt);
+
+	spin_lock_irqsave(&dd->uctxt_lock, flags);
+	/*
+	 * Disable receive context and interrupt available, reset all
+	 * RcvCtxtCtrl bits to default values.
+	 */
+	hfi1_rcvctrl(dd, HFI1_RCVCTRL_CTXT_DIS |
+		     HFI1_RCVCTRL_TIDFLOW_DIS |
+		     HFI1_RCVCTRL_INTRAVAIL_DIS |
+		     HFI1_RCVCTRL_ONE_PKT_EGR_DIS |
+		     HFI1_RCVCTRL_NO_RHQ_DROP_DIS |
+		     HFI1_RCVCTRL_NO_EGR_DROP_DIS, uctxt->ctxt);
+	/*
+	 * VNIC contexts are allocated from user context pool.
+	 * Release them back to user context pool.
+	 *
+	 * Reset context integrity checks to default.
+	 * (writes to CSRs probably belong in chip.c)
+	 */
+	write_kctxt_csr(dd, uctxt->sc->hw_context, SEND_CTXT_CHECK_ENABLE,
+			hfi1_pkt_default_send_ctxt_mask(dd, SC_USER));
+	sc_disable(uctxt->sc);
+
+	dd->send_contexts[uctxt->sc->sw_index].type = SC_USER;
+	spin_unlock_irqrestore(&dd->uctxt_lock, flags);
+
+	dd->rcd[uctxt->ctxt] = NULL;
+	uctxt->event_flags = 0;
+
+	hfi1_clear_tids(uctxt);
+	hfi1_clear_ctxt_pkey(dd, uctxt->ctxt);
+
+	hfi1_stats.sps_ctxts--;
+	hfi1_free_ctxtdata(dd, uctxt);
+}
+
+int hfi1_vnic_setup(struct hfi1_devdata *dd)
+{
+	idr_init(&dd->vnic.vesw_idr);
+	return hfi1_vnic_add_ctrl_port(dd, &dd->pcidev->dev);
+}
+
+void hfi1_vnic_cleanup(struct hfi1_devdata *dd)
+{
+	hfi1_vnic_rem_ctrl_port(dd);
+	idr_destroy(&dd->vnic.vesw_idr);
+}
+
+static u64 create_bypass_pbc(u32 vl, u32 dw_len)
+{
+	u64 pbc;
+
+	pbc = ((u64)PBC_IHCRC_NONE << PBC_INSERT_HCRC_SHIFT)
+		| PBC_INSERT_BYPASS_ICRC | PBC_CREDIT_RETURN
+		| PBC_PACKET_BYPASS
+		| ((vl & PBC_VL_MASK) << PBC_VL_SHIFT)
+		| (dw_len & PBC_LENGTH_DWS_MASK) << PBC_LENGTH_DWS_SHIFT;
+
+	return pbc;
+}
+
+int hfi1_vnic_put_skb(struct hfi_vnic_device *vdev,
+		      u8 q_idx, struct sk_buff *skb)
+{
+	struct hfi1_vnic_vport_info *vinfo = vdev->hfi_priv;
+	struct hfi1_devdata *dd = vinfo->dd;
+	u32 vl, pkt_len, total_len;
+	u8 sc5, pad_len;
+	int ret = 0;
+	u64 pbc;
+
+	if (q_idx >= vdev->hfi_info.num_tx_q) {
+		dev_kfree_skb_any(skb);
+		return -EINVAL;
+	}
+
+	/* add tail padding (for 8 bytes size alignment) and icrc */
+	pad_len = -(skb->len + HFI1_VNIC_ICRC_TAIL_LEN) & 0x7;
+	pad_len += HFI1_VNIC_ICRC_TAIL_LEN;
+
+	/*
+	 * pkt_len is how much data we have to write, includes header and data.
+	 * total_len is length of the packet in Dwords plus the PBC should not
+	 * include the CRC.
+	 */
+	pkt_len = (skb->len + pad_len) >> 2;
+	total_len = pkt_len + 2; /* PBC + packet */
+
+	sc5 = hfi1_vnic_get_sc5(skb->data);
+	vl = sc_to_vlt(dd, sc5);
+	pbc = create_bypass_pbc(vl, total_len);
+
+	dev_dbg(&vdev->dev, "%s: pbc 0x%016llX len %d pad_len %d\n",
+		__func__, pbc, skb->len, pad_len);
+
+	ret = dd->process_vnic_dma_send(dd, q_idx, vinfo, skb,
+					pbc, pad_len);
+
+	if (ret) {
+		if (ret == -ENOMEM)
+			vdev->hfi_stats[q_idx].tx_fifo_errors++;
+		else if (ret != -EBUSY)
+			vdev->hfi_stats[q_idx].tx_logic_errors++;
+	}
+
+	return ret;
+}
+
+u8 hfi1_vnic_select_queue(struct hfi_vnic_device *vdev, u8 vl, u8 entropy)
+{
+	return 0;
+}
+
+bool hfi1_vnic_get_write_avail(struct hfi_vnic_device *vdev, u8 q_idx)
+{
+	if (q_idx >= vdev->hfi_info.num_tx_q)
+		return false;
+
+	return true;
+}
+
+void hfi1_vnic_bypass_rcv(struct hfi1_packet *packet)
+{
+	struct hfi1_devdata *dd = packet->rcd->dd;
+	struct hfi1_vnic_vport_info *vinfo;
+	struct hfi_vnic_device *vdev = NULL;
+	struct hfi1_vnic_notifier *notifier;
+	struct sk_buff *skb;
+	int l4_type, vesw_id = -1;
+	u8 q_idx;
+
+	rcu_read_lock();
+	l4_type = HFI1_GET_L4_TYPE(packet->ebuf);
+	if (l4_type == HFI1_VNIC_L4_ETHR) {
+		vesw_id = HFI1_VNIC_GET_VESWID(packet->ebuf);
+		vdev = idr_find(&dd->vnic.vesw_idr, vesw_id);
+
+		/*
+		 * In case of invalid vesw id, update the rx_bad_veswid
+		 * error count of first available vdev.
+		 */
+		if (unlikely(!vdev)) {
+			struct hfi_vnic_device *vdev_tmp;
+			int id_tmp = 0;
+
+			vdev_tmp =  idr_get_next(&dd->vnic.vesw_idr, &id_tmp);
+			if (vdev_tmp) {
+				spin_lock(&vdev_cntr_lock);
+				vdev_tmp->hfi_stats[0].rx_bad_veswid++;
+				spin_unlock(&vdev_cntr_lock);
+			}
+		}
+	}
+
+	if (unlikely(!vdev)) {
+		dev_warn(&dd->vnic.ctrl_dev->dev,
+			 "Invalid packet received, l4 %d vesw id %d, ctx %d\n",
+			 l4_type, vesw_id, packet->rcd->ctxt);
+		goto rcv_done;
+	}
+
+	vinfo = vdev->hfi_priv;
+	q_idx = packet->rcd->vnic_q_idx;
+	notifier = rcu_dereference(vinfo->notifier);
+	if (!notifier || !notifier->cb) {
+		vdev->hfi_stats[q_idx].rx_logic_errors++;
+		goto rcv_done;
+	}
+
+	if (skb_queue_len(&vinfo->skbq[q_idx]) > HFI1_VNIC_RCV_Q_SIZE) {
+		vdev->hfi_stats[q_idx].rx_fifo_errors++;
+		goto rcv_done;
+	}
+
+	skb = netdev_alloc_skb(vdev->netdev, packet->tlen);
+	if (!skb) {
+		vdev->hfi_stats[q_idx].rx_missed_errors++;
+		goto rcv_done;
+	}
+	memcpy(skb->data, packet->ebuf, packet->tlen);
+	skb_put(skb, packet->tlen);
+
+	skb_queue_tail(&vinfo->skbq[q_idx], skb);
+	if (test_bit((HFI_VNIC_EVT_RX0 + q_idx), vinfo->event_flags))
+		notifier->cb(vdev, HFI_VNIC_EVT_RX0 + q_idx);
+
+rcv_done:
+	rcu_read_unlock();
+}
+
+u16 hfi1_vnic_get_read_avail(struct hfi_vnic_device *vdev, u8 q_idx)
+{
+	struct hfi1_vnic_vport_info *vinfo = vdev->hfi_priv;
+
+	if (q_idx >= vdev->hfi_info.num_rx_q)
+		return 0;
+
+	return skb_queue_len(&vinfo->skbq[q_idx]);
+}
+
+struct sk_buff *hfi1_vnic_get_skb(struct hfi_vnic_device *vdev, u8 q_idx)
+{
+	struct hfi1_vnic_vport_info *vinfo = vdev->hfi_priv;
+	unsigned char *pad_info;
+	struct sk_buff *skb;
+
+	if (q_idx >= vdev->hfi_info.num_rx_q)
+		return NULL;
+
+	skb = skb_dequeue(&vinfo->skbq[q_idx]);
+	if (!skb)
+		return NULL;
+
+	/* remove tail padding and icrc */
+	pad_info = skb->data + skb->len - 1;
+	skb_trim(skb, (skb->len - HFI1_VNIC_ICRC_TAIL_LEN -
+		       ((*pad_info) & 0x7)));
+
+	return skb;
+}
+
+void hfi1_vnic_config_notify(struct hfi_vnic_device *vdev, u8 evt, bool enable)
+{
+	struct hfi1_vnic_vport_info *vinfo = vdev->hfi_priv;
+
+	if (enable)
+		set_bit(evt, vinfo->event_flags);
+	else
+		clear_bit(evt, vinfo->event_flags);
+}
+
+int hfi1_vnic_open(struct hfi_vnic_device *vdev, hfi_vnic_evt_cb_fn cb)
+{
+	struct hfi1_vnic_vport_info *vinfo = vdev->hfi_priv;
+	struct hfi1_devdata *dd = vinfo->dd;
+	struct hfi1_vnic_notifier *notifier;
+	int i, rc;
+
+	if (!cb)
+		return -EINVAL;
+
+	notifier = kmalloc(sizeof(*notifier), GFP_KERNEL);
+	if (!notifier)
+		return -ENOMEM;
+
+	notifier->cb = cb;
+
+	/* ensure virtual eth switch id is valid */
+	if (!vdev->vesw_id) {
+		rc = -EINVAL;
+		goto open_fail;
+	}
+
+	rc = idr_alloc(&dd->vnic.vesw_idr, vdev, vdev->vesw_id,
+		       vdev->vesw_id + 1, GFP_NOWAIT);
+	if (rc < 0)
+		goto open_fail;
+
+	for (i = 0; i < HFI1_NUM_VNIC_CTXT; i++)
+		skb_queue_head_init(&vinfo->skbq[i]);
+
+	/* Enable all events */
+	for (i = 0; i < HFI_VNIC_NUM_EVTS; i++)
+		set_bit(i, vinfo->event_flags);
+
+	rcu_assign_pointer(vinfo->notifier, notifier);
+	synchronize_rcu();
+	return 0;
+
+open_fail:
+	kfree(notifier);
+	return rc;
+}
+
+void hfi1_vnic_close(struct hfi_vnic_device *vdev)
+{
+	struct hfi1_vnic_vport_info *vinfo = vdev->hfi_priv;
+	struct hfi1_devdata *dd = vinfo->dd;
+	struct hfi1_vnic_notifier *notifier;
+	u8 i;
+
+	idr_remove(&dd->vnic.vesw_idr, vdev->vesw_id);
+	notifier = rcu_access_pointer(vinfo->notifier);
+	rcu_assign_pointer(vinfo->notifier, NULL);
+	synchronize_rcu();
+	kfree(notifier);
+
+	/* remove unread skbs */
+	for (i = 0; i < HFI1_NUM_VNIC_CTXT; i++)
+		skb_queue_purge(&vinfo->skbq[i]);
+}
+
+static int hfi1_vnic_allot_ctxt(struct hfi1_devdata *dd,
+				struct hfi1_ctxtdata **vnic_ctxt)
+{
+	int rc;
+
+	rc = allocate_vnic_ctxt(dd, vnic_ctxt);
+	if (rc) {
+		dd_dev_err(dd, "vnic ctxt alloc failed %d\n", rc);
+		return rc;
+	}
+
+	rc = setup_vnic_ctxt(dd, *vnic_ctxt);
+	if (rc) {
+		dd_dev_err(dd, "vnic ctxt setup failed %d\n", rc);
+		deallocate_vnic_ctxt(dd, *vnic_ctxt);
+		*vnic_ctxt = NULL;
+	}
+
+	return rc;
+}
+
+int hfi1_vnic_init(struct hfi_vnic_device *vdev)
+{
+	struct hfi1_vnic_vport_info *vinfo = vdev->hfi_priv;
+	struct hfi1_devdata *dd = vinfo->dd;
+	int i, rc = 0;
+
+	mutex_lock(&hfi1_mutex);
+	for (i = dd->vnic.num_ctxt; i < vdev->hfi_info.num_rx_q; i++) {
+		rc = hfi1_vnic_allot_ctxt(dd, &dd->vnic.ctxt[i]);
+		if (rc)
+			break;
+		dd->vnic.ctxt[i]->vnic_q_idx = i;
+	}
+
+	if (i < vdev->hfi_info.num_rx_q) {
+		/*
+		 * If required amount of contexts is not
+		 * allocated successfully then remaining contexts
+		 * are released.
+		 */
+		while (i-- > dd->vnic.num_ctxt) {
+			deallocate_vnic_ctxt(dd, dd->vnic.ctxt[i]);
+			dd->vnic.ctxt[i] = NULL;
+		}
+		goto alloc_fail;
+	}
+
+	if (dd->vnic.num_ctxt != i) {
+		dd->vnic.num_ctxt = i;
+		hfi1_init_vnic_rsm(dd);
+	}
+
+	dd->vnic.num_vports++;
+	vinfo->vdev = vdev;
+alloc_fail:
+	mutex_unlock(&hfi1_mutex);
+	return rc;
+}
+
+void hfi1_vnic_deinit(struct hfi_vnic_device *vdev)
+{
+	struct hfi1_devdata *dd = vnic_dev2dd(vdev);
+	int i;
+
+	mutex_lock(&hfi1_mutex);
+	if (--dd->vnic.num_vports == 0) {
+		for (i = 0; i < dd->vnic.num_ctxt; i++) {
+			deallocate_vnic_ctxt(dd, dd->vnic.ctxt[i]);
+			dd->vnic.ctxt[i] = NULL;
+		}
+		hfi1_deinit_vnic_rsm(dd);
+		dd->vnic.num_ctxt = 0;
+	}
+
+	mutex_unlock(&hfi1_mutex);
+}
diff --git a/drivers/infiniband/hw/hfi1/vnic_sdma.c b/drivers/infiniband/hw/hfi1/vnic_sdma.c
new file mode 100644
index 0000000..66abad0
--- /dev/null
+++ b/drivers/infiniband/hw/hfi1/vnic_sdma.c
@@ -0,0 +1,60 @@
+/*
+ * Copyright(c) 2016 Intel Corporation.
+ *
+ * This file is provided under a dual BSD/GPLv2 license.  When using or
+ * redistributing this file, you may do so under either license.
+ *
+ * GPL LICENSE SUMMARY
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of version 2 of the GNU General Public License as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ *
+ * BSD LICENSE
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ *
+ *  - Redistributions of source code must retain the above copyright
+ *    notice, this list of conditions and the following disclaimer.
+ *  - Redistributions in binary form must reproduce the above copyright
+ *    notice, this list of conditions and the following disclaimer in
+ *    the documentation and/or other materials provided with the
+ *    distribution.
+ *  - Neither the name of Intel Corporation nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ * "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ * LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ * A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+ * OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ * SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+ * LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+ * DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+ * THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ *
+ */
+
+/*
+ * This file contains HFI1 support for VNIC SDMA functionality
+ */
+
+#include "sdma.h"
+#include "vnic.h"
+
+int hfi1_vnic_send_dma(struct hfi1_devdata *dd, u8 q_idx,
+		       struct hfi1_vnic_vport_info *vinfo,
+		       struct sk_buff *skb, u64 pbc, u8 plen)
+{
+	return 0;
+}
diff --git a/include/rdma/opa_port_info.h b/include/rdma/opa_port_info.h
index 9303e0e..84caa5b 100644
--- a/include/rdma/opa_port_info.h
+++ b/include/rdma/opa_port_info.h
@@ -410,7 +410,7 @@ struct opa_port_info {
 
 	u8     resptimevalue;		        /* 3 res, 5 bits */
 	u8     local_port_num;
-	u8     reserved12;
+	u8     num_vesw_port_supported;
 	u8     reserved13;                       /* was guid_cap */
 } __attribute__ ((packed));
 
-- 
1.8.3.1

^ permalink raw reply related

* [RFC 06/10] IB/hfi-vnic: VNIC MAC table support
From: Vishwanathapura, Niranjana @ 2016-11-18 22:42 UTC (permalink / raw)
  To: Doug Ledford
  Cc: linux-rdma, netdev, Dennis Dalessandro, Niranjana Vishwanathapura,
	Sadanand Warrier
In-Reply-To: <1479508938-63799-1-git-send-email-niranjana.vishwanathapura@intel.com>

HFI VNIC MAC table contains the MAC address to DLID mappings provided by
the Ethernet manager. During transmission, the MAC table provides the MAC
address to DLID translation. Implement MAC table using simple hash list.
Also provide support to update/query the MAC table by Ethernet manager.

Change-Id: Ibe88bcd65ac47c316d2ac4ef746b12f82dcea274
Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Niranjana Vishwanathapura <niranjana.vishwanathapura@intel.com>
Signed-off-by: Sadanand Warrier <sadanand.warrier@intel.com>
---
 .../sw/intel/vnic/hfi_vnic/hfi_vnic_encap.c        | 236 +++++++++++++++++++++
 .../sw/intel/vnic/hfi_vnic/hfi_vnic_internal.h     |  53 ++++-
 .../sw/intel/vnic/hfi_vnic/hfi_vnic_netdev.c       |   4 +
 3 files changed, 292 insertions(+), 1 deletion(-)

diff --git a/drivers/infiniband/sw/intel/vnic/hfi_vnic/hfi_vnic_encap.c b/drivers/infiniband/sw/intel/vnic/hfi_vnic/hfi_vnic_encap.c
index 5a5e5a7..ffdd7b3 100644
--- a/drivers/infiniband/sw/intel/vnic/hfi_vnic/hfi_vnic_encap.c
+++ b/drivers/infiniband/sw/intel/vnic/hfi_vnic/hfi_vnic_encap.c
@@ -105,6 +105,238 @@
 
 #define HFI_VNIC_SC_MASK 0x1f
 
+/*
+ * Using a simple hash table for mac table implementation with the last octet
+ * of mac address as a key.
+ */
+static void hfi_vnic_free_mac_tbl(struct hlist_head *mactbl)
+{
+	struct hfi_vnic_mac_tbl_node *node;
+	struct hlist_node *tmp;
+	int bkt;
+
+	if (!mactbl)
+		return;
+
+	vnic_hash_for_each_safe(mactbl, bkt, tmp, node, hlist) {
+		hash_del(&node->hlist);
+		kfree(node);
+	}
+	kfree(mactbl);
+}
+
+static struct hlist_head *hfi_vnic_alloc_mac_tbl(void)
+{
+	u32 size = sizeof(struct hlist_head) * HFI_VNIC_MAC_TBL_SIZE;
+	struct hlist_head *mactbl;
+
+	mactbl = kzalloc(size, GFP_KERNEL);
+	if (!mactbl)
+		return ERR_PTR(-ENOMEM);
+
+	vnic_hash_init(mactbl);
+	return mactbl;
+}
+
+/* hfi_vnic_release_mac_tbl - empty and free the mac table */
+void hfi_vnic_release_mac_tbl(struct hfi_vnic_adapter *adapter)
+{
+	struct hlist_head *mactbl;
+
+	mutex_lock(&adapter->mactbl_lock);
+	mactbl = rcu_access_pointer(adapter->mactbl);
+	rcu_assign_pointer(adapter->mactbl, NULL);
+	synchronize_rcu();
+	hfi_vnic_free_mac_tbl(mactbl);
+	mutex_unlock(&adapter->mactbl_lock);
+}
+
+/*
+ * hfi_vnic_query_mac_tbl - query the mac table for a section
+ *
+ * This function implements query of specific function of the mac table.
+ * The function also expects the requested range to be valid.
+ */
+void hfi_vnic_query_mac_tbl(struct hfi_vnic_adapter *adapter,
+			    struct hfi_veswport_mactable *tbl)
+{
+	struct hfi_vnic_mac_tbl_node *node;
+	struct hlist_head *mactbl;
+	int bkt;
+	u16 loffset, lnum_entries;
+
+	rcu_read_lock();
+	mactbl = rcu_dereference(adapter->mactbl);
+	if (!mactbl)
+		goto get_mac_done;
+
+	loffset = be16_to_cpu(tbl->offset);
+	lnum_entries = be16_to_cpu(tbl->num_entries);
+
+	vnic_hash_for_each(mactbl, bkt, node, hlist) {
+		struct __hfi_vnic_mactable_entry *nentry = &node->entry;
+		struct hfi_veswport_mactable_entry *entry;
+
+		if ((node->index < loffset) ||
+		    (node->index >= (loffset + lnum_entries)))
+			continue;
+
+		/* populate entry in the tbl corresponding to the index */
+		entry = &tbl->tbl_entries[node->index - loffset];
+		memcpy(entry->mac_addr, nentry->mac_addr,
+		       ARRAY_SIZE(entry->mac_addr));
+		memcpy(entry->mac_addr_mask, nentry->mac_addr_mask,
+		       ARRAY_SIZE(entry->mac_addr_mask));
+		entry->dlid_sd.dw = cpu_to_be32(nentry->dlid_sd.dw);
+	}
+	tbl->mac_tbl_digest = cpu_to_be32(adapter->info.vport.mac_tbl_digest);
+get_mac_done:
+	rcu_read_unlock();
+}
+
+/*
+ * hfi_vnic_update_mac_tbl - update mac table section
+ *
+ * This function updates the specified section of the mac table.
+ * The procedure includes following steps.
+ *  - Allocate a new mac (hash) table.
+ *  - Add the specified entries to the new table.
+ *    (except the ones that are requested to be deleted).
+ *  - Add all the other entries from the old mac table.
+ *  - If there is a failure, free the new table and return.
+ *  - Switch to the new table.
+ *  - Free the old table and return.
+ *
+ * The function also expects the requested range to be valid.
+ */
+int hfi_vnic_update_mac_tbl(struct hfi_vnic_adapter *adapter,
+			    struct hfi_veswport_mactable *tbl)
+{
+	struct hfi_vnic_mac_tbl_node *node, *new_node;
+	struct hlist_head *new_mactbl, *old_mactbl;
+	int i, bkt, rc = 0;
+	u8 key;
+	u16 loffset, lnum_entries;
+
+	mutex_lock(&adapter->mactbl_lock);
+	/* allocate new mac table */
+	new_mactbl = hfi_vnic_alloc_mac_tbl();
+	if (IS_ERR(new_mactbl)) {
+		mutex_unlock(&adapter->mactbl_lock);
+		return PTR_ERR(new_mactbl);
+	}
+
+	loffset = be16_to_cpu(tbl->offset);
+	lnum_entries = be16_to_cpu(tbl->num_entries);
+
+	/* add updated entries to the new mac table */
+	for (i = 0; i < lnum_entries; i++) {
+		struct __hfi_vnic_mactable_entry *nentry;
+		struct hfi_veswport_mactable_entry *entry =
+							&tbl->tbl_entries[i];
+		u8 *mac_addr = entry->mac_addr;
+		u8 empty_mac[ETH_ALEN] = { 0 };
+
+		v_dbg("new mac entry %4d: %02x:%02x:%02x:%02x:%02x:%02x %x\n",
+		      loffset + i, mac_addr[0], mac_addr[1], mac_addr[2],
+		      mac_addr[3], mac_addr[4], mac_addr[5],
+		      entry->dlid_sd.dw);
+
+		/* if the entry is being removed, do not add it */
+		if (!memcmp(mac_addr, empty_mac, ARRAY_SIZE(empty_mac)))
+			continue;
+
+		node = kzalloc(sizeof(*node), GFP_KERNEL);
+		if (!node) {
+			rc = -ENOMEM;
+			goto updt_done;
+		}
+
+		node->index = loffset + i;
+		nentry = &node->entry;
+		memcpy(nentry->mac_addr, entry->mac_addr,
+		       ARRAY_SIZE(nentry->mac_addr));
+		memcpy(nentry->mac_addr_mask, entry->mac_addr_mask,
+		       ARRAY_SIZE(nentry->mac_addr_mask));
+		nentry->dlid_sd.dw = be32_to_cpu(entry->dlid_sd.dw);
+		key = node->entry.mac_addr[HFI_VNIC_MAC_HASH_IDX];
+		vnic_hash_add(new_mactbl, &node->hlist, key);
+	}
+
+	/* add other entries from current mac table to new mac table */
+	old_mactbl = rcu_access_pointer(adapter->mactbl);
+	if (!old_mactbl)
+		goto switch_tbl;
+
+	vnic_hash_for_each(old_mactbl, bkt, node, hlist) {
+		if ((node->index >= loffset) &&
+		    (node->index < (loffset + lnum_entries)))
+			continue;
+
+		new_node = kzalloc(sizeof(*new_node), GFP_KERNEL);
+		if (!new_node) {
+			rc = -ENOMEM;
+			goto updt_done;
+		}
+
+		new_node->index = node->index;
+		memcpy(&new_node->entry, &node->entry, sizeof(node->entry));
+		key = new_node->entry.mac_addr[HFI_VNIC_MAC_HASH_IDX];
+		vnic_hash_add(new_mactbl, &new_node->hlist, key);
+	}
+
+switch_tbl:
+	/* switch to new table */
+	rcu_assign_pointer(adapter->mactbl, new_mactbl);
+	synchronize_rcu();
+
+	adapter->info.vport.mac_tbl_digest = be32_to_cpu(tbl->mac_tbl_digest);
+updt_done:
+	/* upon failure, free the new table; otherwise, free the old table */
+	if (rc)
+		hfi_vnic_free_mac_tbl(new_mactbl);
+	else
+		hfi_vnic_free_mac_tbl(old_mactbl);
+
+	mutex_unlock(&adapter->mactbl_lock);
+	return rc;
+}
+
+/* hfi_vnic_chk_mac_tbl - check mac table for dlid */
+static uint32_t hfi_vnic_chk_mac_tbl(struct hfi_vnic_adapter *adapter,
+				     struct ethhdr *mac_hdr)
+{
+	struct hfi_vnic_mac_tbl_node *node;
+	struct hlist_head *mactbl;
+	u32 dlid = 0;
+	u8 key;
+
+	rcu_read_lock();
+	mactbl = rcu_dereference(adapter->mactbl);
+	if (!mactbl)
+		goto chk_done;
+
+	key = mac_hdr->h_dest[HFI_VNIC_MAC_HASH_IDX];
+	vnic_hash_for_each_possible(mactbl, node, hlist, key) {
+		struct __hfi_vnic_mactable_entry *entry = &node->entry;
+
+		/* if related to source mac, skip */
+		if (entry->dlid_sd.sd_is_src_mac)
+			continue;
+
+		if (!memcmp(node->entry.mac_addr, mac_hdr->h_dest,
+			    ARRAY_SIZE(node->entry.mac_addr))) {
+			/* mac address found */
+			dlid = node->entry.dlid_sd.dlid;
+			break;
+		}
+	}
+
+chk_done:
+	rcu_read_unlock();
+	return dlid;
+}
+
 /* hfi_vnic_get_dlid - find and return the DLID */
 static uint32_t hfi_vnic_get_dlid(struct hfi_vnic_adapter *adapter,
 				  struct sk_buff *skb, u8 def_port)
@@ -113,6 +345,10 @@ static uint32_t hfi_vnic_get_dlid(struct hfi_vnic_adapter *adapter,
 	struct ethhdr *mac_hdr = (struct ethhdr *)skb_mac_header(skb);
 	u32 dlid;
 
+	dlid = hfi_vnic_chk_mac_tbl(adapter, mac_hdr);
+	if (dlid)
+		return dlid;
+
 	if (is_multicast_ether_addr(mac_hdr->h_dest)) {
 		dlid = info->vesw.u_mcast_dlid;
 	} else {
diff --git a/drivers/infiniband/sw/intel/vnic/hfi_vnic/hfi_vnic_internal.h b/drivers/infiniband/sw/intel/vnic/hfi_vnic/hfi_vnic_internal.h
index c48e676..21a43f6 100644
--- a/drivers/infiniband/sw/intel/vnic/hfi_vnic/hfi_vnic_internal.h
+++ b/drivers/infiniband/sw/intel/vnic/hfi_vnic/hfi_vnic_internal.h
@@ -261,6 +261,8 @@ struct hfi_vnic_rx_queue {
  * @lock: adapter lock
  * @rxq: receive queue array
  * @info: virtual ethernet switch port information
+ * @mactbl: hash table of MAC entries
+ * @mactbl_lock: mac table lock
  * @stats_lock: statistics lock
  * @flow_tbl: flow to default port redirection table
  * @q_sum_cntrs: per queue EM summary counters
@@ -283,7 +285,11 @@ struct hfi_vnic_adapter {
 
 	struct hfi_vnic_rx_queue  rxq[HFI_VNIC_MAX_QUEUE];
 
-	struct __hfi_veswport_info info;
+	struct __hfi_veswport_info  info;
+	struct hlist_head  __rcu   *mactbl;
+
+	/* Lock used to protect updates to mac table */
+	struct mutex mactbl_lock;
 
 	/* Lock used to protect access to vnic counters */
 	struct mutex stats_lock;
@@ -303,6 +309,25 @@ struct hfi_vnic_adapter {
 	struct __hfi_vnic_error_counters    err_cntrs;
 };
 
+/* Same as hfi_veswport_mactable_entry, but without bitwise attribute */
+struct __hfi_vnic_mactable_entry {
+	u8                         mac_addr[ETH_ALEN];
+	u8                         mac_addr_mask[ETH_ALEN];
+	union __hfi_vnic_dlid_sd   dlid_sd;
+} __packed;
+
+/**
+ * struct hfi_vnic_mac_tbl_node - HFI VNIC mac table node
+ * @hlist: hash list handle
+ * @index: index of entry in the mac table
+ * @entry: entry in the table
+ */
+struct hfi_vnic_mac_tbl_node {
+	struct hlist_node                    hlist;
+	u16                                  index;
+	struct __hfi_vnic_mactable_entry     entry;
+};
+
 #define v_dbg(format, arg...) \
 	netdev_dbg(adapter->netdev, format, ## arg)
 #define v_err(format, arg...) \
@@ -326,12 +351,38 @@ struct hfi_vnic_adapter {
 #define HFI_VNIC_MAC_TBL_HASH_BITS    8
 #define HFI_VNIC_MAC_TBL_SIZE  BIT(HFI_VNIC_MAC_TBL_HASH_BITS)
 
+/* VNIC HASH MACROS */
+#define vnic_hash_init(hashtable) __hash_init(hashtable, HFI_VNIC_MAC_TBL_SIZE)
+
+#define vnic_hash_add(hashtable, node, key)                                   \
+	hlist_add_head(node,                                                  \
+		&hashtable[hash_min(key, ilog2(HFI_VNIC_MAC_TBL_SIZE))])
+
+#define vnic_hash_for_each_safe(name, bkt, tmp, obj, member)                  \
+	for ((bkt) = 0, obj = NULL;                                           \
+		    !obj && (bkt) < HFI_VNIC_MAC_TBL_SIZE; (bkt)++)           \
+		hlist_for_each_entry_safe(obj, tmp, &name[bkt], member)
+
+#define vnic_hash_for_each_possible(name, obj, member, key)                   \
+	hlist_for_each_entry(obj,                                             \
+		&name[hash_min(key, ilog2(HFI_VNIC_MAC_TBL_SIZE))], member)
+
+#define vnic_hash_for_each(name, bkt, obj, member)                            \
+	for ((bkt) = 0, obj = NULL;                                           \
+		    !obj && (bkt) < HFI_VNIC_MAC_TBL_SIZE; (bkt)++)           \
+		hlist_for_each_entry(obj, &name[bkt], member)
+
 extern char hfi_vnic_driver_name[];
 extern const char hfi_vnic_driver_version[];
 
 int hfi_vnic_encap_skb(struct hfi_vnic_adapter *adapter, struct sk_buff *skb);
 int hfi_vnic_decap_skb(struct hfi_vnic_rx_queue *rxq, struct sk_buff *skb);
 u8 hfi_vnic_calc_entropy(struct hfi_vnic_adapter *adapter, struct sk_buff *skb);
+void hfi_vnic_release_mac_tbl(struct hfi_vnic_adapter *adapter);
+void hfi_vnic_query_mac_tbl(struct hfi_vnic_adapter *adapter,
+			    struct hfi_veswport_mactable *tbl);
+int hfi_vnic_update_mac_tbl(struct hfi_vnic_adapter *adapter,
+			    struct hfi_veswport_mactable *tbl);
 void hfi_vnic_update_stats(struct net_device *netdev);
 void hfi_vnic_set_ethtool_ops(struct net_device *ndev);
 
diff --git a/drivers/infiniband/sw/intel/vnic/hfi_vnic/hfi_vnic_netdev.c b/drivers/infiniband/sw/intel/vnic/hfi_vnic/hfi_vnic_netdev.c
index f134225..ee18610 100644
--- a/drivers/infiniband/sw/intel/vnic/hfi_vnic/hfi_vnic_netdev.c
+++ b/drivers/infiniband/sw/intel/vnic/hfi_vnic/hfi_vnic_netdev.c
@@ -625,6 +625,7 @@ static int hfi_vnic_drv_probe(struct device *dev)
 	netdev->netdev_ops = &hfi_netdev_ops;
 	netdev->hard_header_len += HFI_VNIC_SKB_HEADROOM;
 	mutex_init(&adapter->lock);
+	mutex_init(&adapter->mactbl_lock);
 	mutex_init(&adapter->stats_lock);
 	strcpy(netdev->name, "veth%d");
 
@@ -652,6 +653,7 @@ static int hfi_vnic_drv_probe(struct device *dev)
 	vdev->bus_ops->deinit(vdev);
 hw_err:
 	mutex_destroy(&adapter->lock);
+	mutex_destroy(&adapter->mactbl_lock);
 	mutex_destroy(&adapter->stats_lock);
 	free_netdev(netdev);
 	dev_err(dev, "initialization failed %d\n", rc);
@@ -668,7 +670,9 @@ static int hfi_vnic_drv_remove(struct device *dev)
 
 	unregister_netdev(vdev->netdev);
 	vdev->bus_ops->deinit(vdev);
+	hfi_vnic_release_mac_tbl(adapter);
 	mutex_destroy(&adapter->lock);
+	mutex_destroy(&adapter->mactbl_lock);
 	mutex_destroy(&adapter->stats_lock);
 	free_netdev(vdev->netdev);
 
-- 
1.8.3.1

^ permalink raw reply related

* [RFC 05/10] IB/hfi-vnic: VNIC statistics support
From: Vishwanathapura, Niranjana @ 2016-11-18 22:42 UTC (permalink / raw)
  To: Doug Ledford
  Cc: linux-rdma, netdev, Dennis Dalessandro, Niranjana Vishwanathapura
In-Reply-To: <1479508938-63799-1-git-send-email-niranjana.vishwanathapura@intel.com>

HFI VNIC driver statistics support maintains various counters including
standard netdev counters and the Ethernet manager defined counters.
Add the Ethtool hook to read the counters.

Change-Id: I6d828c2ce5eeae73d611174a985ff41f83480562
Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Niranjana Vishwanathapura <niranjana.vishwanathapura@intel.com>
---
 .../sw/intel/vnic/hfi_vnic/hfi_vnic_encap.c        |  19 +-
 .../sw/intel/vnic/hfi_vnic/hfi_vnic_ethtool.c      | 131 +++++++++++
 .../sw/intel/vnic/hfi_vnic/hfi_vnic_internal.h     |  84 +++++++
 .../sw/intel/vnic/hfi_vnic/hfi_vnic_netdev.c       | 260 ++++++++++++++++++++-
 4 files changed, 486 insertions(+), 8 deletions(-)

diff --git a/drivers/infiniband/sw/intel/vnic/hfi_vnic/hfi_vnic_encap.c b/drivers/infiniband/sw/intel/vnic/hfi_vnic/hfi_vnic_encap.c
index 9804c6d..5a5e5a7 100644
--- a/drivers/infiniband/sw/intel/vnic/hfi_vnic/hfi_vnic_encap.c
+++ b/drivers/infiniband/sw/intel/vnic/hfi_vnic/hfi_vnic_encap.c
@@ -210,8 +210,10 @@ int hfi_vnic_encap_skb(struct hfi_vnic_adapter *adapter, struct sk_buff *skb)
 	hdr->slid_high = info->vport.encap_slid >> 20;
 
 	dlid = hfi_vnic_get_dlid(adapter, skb, def_port);
-	if (unlikely(!dlid))
+	if (unlikely(!dlid)) {
+		adapter->q_err_cntrs[skb->queue_mapping].tx_dlid_zero++;
 		return -EFAULT;
+	}
 
 	hdr->dlid = dlid;
 	hdr->dlid_high = dlid >> 20;
@@ -234,6 +236,19 @@ int hfi_vnic_encap_skb(struct hfi_vnic_adapter *adapter, struct sk_buff *skb)
 /* hfi_vnic_decap_skb - strip OPA header from the skb (ethernet) packet */
 int hfi_vnic_decap_skb(struct hfi_vnic_rx_queue *rxq, struct sk_buff *skb)
 {
+	struct hfi_vnic_adapter *adapter = rxq->adapter;
+	int max_len = adapter->netdev->mtu + VLAN_ETH_HLEN;
+	int rc = -EFAULT;
+
 	skb_pull(skb, HFI_VNIC_HDR_LEN);
-	return 0;
+
+	/* Validate Packet length */
+	if (skb->len > max_len)
+		adapter->q_err_cntrs[rxq->idx].rx_oversize++;
+	else if (skb->len < ETH_ZLEN)
+		adapter->q_err_cntrs[rxq->idx].rx_runt++;
+	else
+		rc = 0;
+
+	return rc;
 }
diff --git a/drivers/infiniband/sw/intel/vnic/hfi_vnic/hfi_vnic_ethtool.c b/drivers/infiniband/sw/intel/vnic/hfi_vnic/hfi_vnic_ethtool.c
index 32bb9ce..ab4b00d 100644
--- a/drivers/infiniband/sw/intel/vnic/hfi_vnic/hfi_vnic_ethtool.c
+++ b/drivers/infiniband/sw/intel/vnic/hfi_vnic/hfi_vnic_ethtool.c
@@ -54,6 +54,83 @@
 #include "hfi_vnic.h"
 #include "hfi_vnic_internal.h"
 
+enum {NETDEV_STATS, VNIC_STATS};
+
+struct vnic_stats {
+	char stat_string[ETH_GSTRING_LEN];
+	struct {
+		int type;
+		int sizeof_stat;
+		int stat_offset;
+	};
+};
+
+#define VNIC_STAT(m)            { VNIC_STATS,                               \
+				  FIELD_SIZEOF(struct hfi_vnic_adapter, m), \
+				  offsetof(struct hfi_vnic_adapter, m) }
+#define VNIC_NETDEV_STAT(m)     { NETDEV_STATS,                             \
+				  FIELD_SIZEOF(struct net_device, m),       \
+				  offsetof(struct net_device, m) }
+
+static struct vnic_stats vnic_gstrings_stats[] = {
+	/* NETDEV stats */
+	{"rx_packets", VNIC_NETDEV_STAT(stats.rx_packets)},
+	{"tx_packets", VNIC_NETDEV_STAT(stats.tx_packets)},
+	{"rx_bytes", VNIC_NETDEV_STAT(stats.rx_bytes)},
+	{"tx_bytes", VNIC_NETDEV_STAT(stats.tx_bytes)},
+	{"rx_errors", VNIC_NETDEV_STAT(stats.rx_errors)},
+	{"tx_errors", VNIC_NETDEV_STAT(stats.tx_errors)},
+	{"rx_dropped", VNIC_NETDEV_STAT(stats.rx_dropped)},
+	{"tx_dropped", VNIC_NETDEV_STAT(stats.tx_dropped)},
+
+	{"rx_fifo_errors", VNIC_NETDEV_STAT(stats.rx_fifo_errors)},
+	{"rx_missed_errors", VNIC_NETDEV_STAT(stats.rx_missed_errors)},
+	{"tx_carrier_errors", VNIC_NETDEV_STAT(stats.tx_carrier_errors)},
+	{"tx_fifo_errors", VNIC_NETDEV_STAT(stats.tx_fifo_errors)},
+
+	/* SUMMARY counters */
+	{"tx_unicast", VNIC_STAT(sum_cntrs.tx_grp.unicast)},
+	{"tx_mcastbcast", VNIC_STAT(sum_cntrs.tx_grp.mcastbcast)},
+	{"tx_untagged", VNIC_STAT(sum_cntrs.tx_grp.untagged)},
+	{"tx_vlan", VNIC_STAT(sum_cntrs.tx_grp.vlan)},
+
+	{"tx_64_size", VNIC_STAT(sum_cntrs.tx_grp.xx_64_size)},
+	{"tx_65_127", VNIC_STAT(sum_cntrs.tx_grp.xx_65_127)},
+	{"tx_128_255", VNIC_STAT(sum_cntrs.tx_grp.xx_128_255)},
+	{"tx_256_511", VNIC_STAT(sum_cntrs.tx_grp.xx_256_511)},
+	{"tx_512_1023", VNIC_STAT(sum_cntrs.tx_grp.xx_512_1023)},
+	{"tx_1024_1518", VNIC_STAT(sum_cntrs.tx_grp.xx_1024_1518)},
+	{"tx_1519_max", VNIC_STAT(sum_cntrs.tx_grp.xx_1519_max)},
+
+	{"rx_unicast", VNIC_STAT(sum_cntrs.rx_grp.unicast)},
+	{"rx_mcastbcast", VNIC_STAT(sum_cntrs.rx_grp.mcastbcast)},
+	{"rx_untagged", VNIC_STAT(sum_cntrs.rx_grp.untagged)},
+	{"rx_vlan", VNIC_STAT(sum_cntrs.rx_grp.vlan)},
+
+	{"rx_64_size", VNIC_STAT(sum_cntrs.rx_grp.xx_64_size)},
+	{"rx_65_127", VNIC_STAT(sum_cntrs.rx_grp.xx_65_127)},
+	{"rx_128_255", VNIC_STAT(sum_cntrs.rx_grp.xx_128_255)},
+	{"rx_256_511", VNIC_STAT(sum_cntrs.rx_grp.xx_256_511)},
+	{"rx_512_1023", VNIC_STAT(sum_cntrs.rx_grp.xx_512_1023)},
+	{"rx_1024_1518", VNIC_STAT(sum_cntrs.rx_grp.xx_1024_1518)},
+	{"rx_1519_max", VNIC_STAT(sum_cntrs.rx_grp.xx_1519_max)},
+
+	/* ERROR counters */
+	{"tx_smac_filt", VNIC_STAT(err_cntrs.tx_smac_filt)},
+	{"tx_dlid_zero", VNIC_STAT(err_cntrs.tx_dlid_zero)},
+	{"tx_logic", VNIC_STAT(err_cntrs.tx_logic)},
+	{"tx_drop_state", VNIC_STAT(err_cntrs.tx_drop_state)},
+
+	{"rx_bad_veswid", VNIC_STAT(err_cntrs.rx_bad_veswid)},
+	{"rx_runt", VNIC_STAT(err_cntrs.rx_runt)},
+	{"rx_oversize", VNIC_STAT(err_cntrs.rx_oversize)},
+	{"rx_eth_down", VNIC_STAT(err_cntrs.rx_eth_down)},
+	{"rx_drop_state", VNIC_STAT(err_cntrs.rx_drop_state)},
+	{"rx_logic", VNIC_STAT(err_cntrs.rx_logic)},
+};
+
+#define VNIC_STATS_LEN  ARRAY_SIZE(vnic_gstrings_stats)
+
 /* vnic_get_drvinfo - get driver info */
 static void vnic_get_drvinfo(struct net_device *netdev,
 			     struct ethtool_drvinfo *drvinfo)
@@ -68,10 +145,64 @@ static void vnic_get_drvinfo(struct net_device *netdev,
 		sizeof(drvinfo->bus_info));
 }
 
+/* vnic_get_sset_count - get string set count */
+static int vnic_get_sset_count(struct net_device *netdev, int sset)
+{
+	return (sset == ETH_SS_STATS) ? VNIC_STATS_LEN : -EOPNOTSUPP;
+}
+
+/* vnic_get_ethtool_stats - get statistics */
+static void vnic_get_ethtool_stats(struct net_device *netdev,
+				   struct ethtool_stats *stats, u64 *data)
+{
+	struct hfi_vnic_adapter *adapter = netdev_priv(netdev);
+	int i;
+	char *p = NULL;
+
+	mutex_lock(&adapter->stats_lock);
+	hfi_vnic_update_stats(netdev);
+	for (i = 0; i < VNIC_STATS_LEN; i++) {
+		switch (vnic_gstrings_stats[i].type) {
+		case NETDEV_STATS:
+			p = (char *)netdev +
+			  vnic_gstrings_stats[i].stat_offset;
+			break;
+		case VNIC_STATS:
+			p = (char *)adapter +
+			  vnic_gstrings_stats[i].stat_offset;
+			break;
+		default:
+			p = NULL;
+		}
+
+		if (p)
+			data[i] = (vnic_gstrings_stats[i].sizeof_stat ==
+			   sizeof(u64)) ? *(u64 *)p : *(u32 *)p;
+	}
+	mutex_unlock(&adapter->stats_lock);
+}
+
+/* vnic_get_strings - get strings */
+static void vnic_get_strings(struct net_device *netdev, u32 stringset, u8 *data)
+{
+	int i;
+
+	if (stringset != ETH_SS_STATS)
+		return;
+
+	for (i = 0; i < VNIC_STATS_LEN; i++)
+		memcpy(data + i * ETH_GSTRING_LEN,
+		       vnic_gstrings_stats[i].stat_string,
+		       ETH_GSTRING_LEN);
+}
+
 /* ethtool ops */
 static const struct ethtool_ops hfi_vnic_ethtool_ops = {
 	.get_drvinfo = vnic_get_drvinfo,
 	.get_link = ethtool_op_get_link,
+	.get_strings = vnic_get_strings,
+	.get_sset_count = vnic_get_sset_count,
+	.get_ethtool_stats = vnic_get_ethtool_stats,
 };
 
 /* hfi_vnic_set_ethtool_ops - set ethtool ops */
diff --git a/drivers/infiniband/sw/intel/vnic/hfi_vnic/hfi_vnic_internal.h b/drivers/infiniband/sw/intel/vnic/hfi_vnic/hfi_vnic_internal.h
index 4dbb117..c48e676 100644
--- a/drivers/infiniband/sw/intel/vnic/hfi_vnic/hfi_vnic_internal.h
+++ b/drivers/infiniband/sw/intel/vnic/hfi_vnic/hfi_vnic_internal.h
@@ -94,6 +94,64 @@ enum hfi_vnic_flags_t {
 struct hfi_vnic_adapter;
 
 /**
+ * struct __hfi_vnic_summary_counters - HFI summary counters
+ *
+ * Same as __hfi_veswport_summary_counters without bitwise
+ * attribute and reserved fields.
+ */
+struct __hfi_vnic_summary_counters {
+	u64  tx_errors;
+	u64  rx_errors;
+	u64  tx_packets;
+	u64  rx_packets;
+	u64  tx_bytes;
+	u64  rx_bytes;
+
+	/* Group of histogram statistic counters */
+	struct __hfi_vnic_group_scs {
+		u64  unicast;
+		u64  mcastbcast;
+
+		u64  untagged;
+		u64  vlan;
+
+		u64  xx_64_size;
+		u64  xx_65_127;
+		u64  xx_128_255;
+		u64  xx_256_511;
+		u64  xx_512_1023;
+		u64  xx_1024_1518;
+		u64  xx_1519_max;
+	} tx_grp;
+
+	struct __hfi_vnic_group_scs rx_grp;
+
+} __packed;
+
+/**
+ * struct __hfi_vnic_error_counters - HFI error counters
+ *
+ * Same as hfi_veswport_error_counters without bitwise
+ * attribute and reserved fields.
+ */
+struct __hfi_vnic_error_counters {
+	u64  tx_errors;
+	u64  rx_errors;
+
+	u64  tx_smac_filt;
+	u64  tx_dlid_zero;
+	u64  tx_logic;
+	u64  tx_drop_state;
+
+	u64  rx_bad_veswid;
+	u64  rx_runt;
+	u64  rx_oversize;
+	u64  rx_eth_down;
+	u64  rx_drop_state;
+	u64  rx_logic;
+} __packed;
+
+/**
  * struct __hfi_vesw_info - HFI vnic virtual switch info
  *
  * Same as hfi_vesw_info without bitwise attribute.
@@ -203,7 +261,17 @@ struct hfi_vnic_rx_queue {
  * @lock: adapter lock
  * @rxq: receive queue array
  * @info: virtual ethernet switch port information
+ * @stats_lock: statistics lock
  * @flow_tbl: flow to default port redirection table
+ * @q_sum_cntrs: per queue EM summary counters
+ * @q_err_cntrs: per queue EM error counters
+ * @q_rx_logic_errors: per queue rx logic (default) errors
+ * @q_tx_logic_errors: per queue tx logic (default) errors
+ * @q_tx_halt: per queue tx halt counts
+ * @q_tx_restart: per queue tx restart counts
+ * @q_tx_wakeup: per queue tx wakeup counts
+ * @sum_cntrs: Total EM summary counters (from all queues)
+ * @err_cntrs: Total EM error counters (from all queues)
  */
 struct hfi_vnic_adapter {
 	struct net_device        *netdev;
@@ -217,7 +285,22 @@ struct hfi_vnic_adapter {
 
 	struct __hfi_veswport_info info;
 
+	/* Lock used to protect access to vnic counters */
+	struct mutex stats_lock;
+
 	u8 flow_tbl[HFI_VNIC_FLOW_TBL_SIZE];
+
+	struct __hfi_vnic_summary_counters  q_sum_cntrs[HFI_VNIC_MAX_QUEUE];
+	struct __hfi_vnic_error_counters    q_err_cntrs[HFI_VNIC_MAX_QUEUE];
+	u64 q_rx_logic_errors[HFI_VNIC_MAX_QUEUE];
+	u64 q_tx_logic_errors[HFI_VNIC_MAX_QUEUE];
+
+	u64 q_tx_halt[HFI_VNIC_MAX_QUEUE];
+	u64 q_tx_restart[HFI_VNIC_MAX_QUEUE];
+	u64 q_tx_wakeup[HFI_VNIC_MAX_QUEUE];
+
+	struct __hfi_vnic_summary_counters  sum_cntrs;
+	struct __hfi_vnic_error_counters    err_cntrs;
 };
 
 #define v_dbg(format, arg...) \
@@ -249,6 +332,7 @@ struct hfi_vnic_adapter {
 int hfi_vnic_encap_skb(struct hfi_vnic_adapter *adapter, struct sk_buff *skb);
 int hfi_vnic_decap_skb(struct hfi_vnic_rx_queue *rxq, struct sk_buff *skb);
 u8 hfi_vnic_calc_entropy(struct hfi_vnic_adapter *adapter, struct sk_buff *skb);
+void hfi_vnic_update_stats(struct net_device *netdev);
 void hfi_vnic_set_ethtool_ops(struct net_device *ndev);
 
 #endif /* _HFI_VNIC_INTERNAL_H */
diff --git a/drivers/infiniband/sw/intel/vnic/hfi_vnic/hfi_vnic_netdev.c b/drivers/infiniband/sw/intel/vnic/hfi_vnic/hfi_vnic_netdev.c
index 7121637..f134225 100644
--- a/drivers/infiniband/sw/intel/vnic/hfi_vnic/hfi_vnic_netdev.c
+++ b/drivers/infiniband/sw/intel/vnic/hfi_vnic/hfi_vnic_netdev.c
@@ -63,6 +63,235 @@
 
 #define HFI_VNIC_MIN_ETH_MTU (ETH_ZLEN - ETH_HLEN)
 
+#define SUM_GRP_COUNTERS(adpt, summary, x_grp) do {                     \
+		u64 *src64, *dst64;                                     \
+		for (src64 = &summary->x_grp.unicast,                   \
+			dst64 = &adpt->sum_cntrs.x_grp.unicast;         \
+			dst64 <= &adpt->sum_cntrs.x_grp.xx_1519_max;) { \
+			*dst64++ += *src64++;                           \
+		}                                                       \
+	} while (0)
+
+/* hfi_vnic_update_stats - update statistics */
+void hfi_vnic_update_stats(struct net_device *netdev)
+{
+	struct hfi_vnic_adapter *adapter = netdev_priv(netdev);
+	struct hfi_vnic_device *vdev = adapter->vdev;
+	struct hfi_vnic_stats h_stats = { 0 };
+	u64 tx_logic_errors = 0;
+	u64 rx_logic_errors = 0;
+	u8 i;
+
+	/* first clear the total counters */
+	memset(&adapter->sum_cntrs, 0, sizeof(adapter->sum_cntrs));
+	memset(&adapter->err_cntrs, 0, sizeof(adapter->err_cntrs));
+
+	/* add tx counters on different queues */
+	for (i = 0; i < vdev->hfi_info.num_tx_q; i++) {
+		struct hfi_vnic_stats *hfi_stats = &vdev->hfi_stats[i];
+		struct __hfi_vnic_summary_counters *sum_cntrs =
+						&adapter->q_sum_cntrs[i];
+		struct __hfi_vnic_error_counters *err_cntrs =
+						&adapter->q_err_cntrs[i];
+
+		h_stats.tx_fifo_errors += hfi_stats->tx_fifo_errors;
+		h_stats.tx_carrier_errors += hfi_stats->tx_carrier_errors;
+		h_stats.tx_logic_errors += hfi_stats->tx_logic_errors;
+
+		SUM_GRP_COUNTERS(adapter, sum_cntrs, tx_grp);
+		adapter->sum_cntrs.tx_packets += sum_cntrs->tx_packets;
+		adapter->sum_cntrs.tx_bytes += sum_cntrs->tx_bytes;
+
+		adapter->err_cntrs.tx_smac_filt += err_cntrs->tx_smac_filt;
+		adapter->err_cntrs.tx_dlid_zero += err_cntrs->tx_dlid_zero;
+		adapter->err_cntrs.tx_drop_state += err_cntrs->tx_drop_state;
+
+		tx_logic_errors += adapter->q_tx_logic_errors[i];
+	}
+
+	/* add rx counters on different queues */
+	for (i = 0; i < vdev->hfi_info.num_rx_q; i++) {
+		struct hfi_vnic_stats *hfi_stats = &vdev->hfi_stats[i];
+		struct __hfi_vnic_summary_counters *sum_cntrs =
+						&adapter->q_sum_cntrs[i];
+		struct __hfi_vnic_error_counters *err_cntrs =
+						&adapter->q_err_cntrs[i];
+
+		h_stats.rx_fifo_errors += hfi_stats->rx_fifo_errors;
+		h_stats.rx_missed_errors += hfi_stats->rx_missed_errors;
+		h_stats.rx_bad_veswid += hfi_stats->rx_bad_veswid;
+		h_stats.rx_logic_errors += hfi_stats->rx_logic_errors;
+
+		SUM_GRP_COUNTERS(adapter, sum_cntrs, rx_grp);
+		adapter->sum_cntrs.rx_packets += sum_cntrs->rx_packets;
+		adapter->sum_cntrs.rx_bytes += sum_cntrs->rx_bytes;
+
+		adapter->err_cntrs.rx_drop_state += err_cntrs->rx_drop_state;
+		adapter->err_cntrs.rx_runt += err_cntrs->rx_runt;
+		adapter->err_cntrs.rx_oversize += err_cntrs->rx_oversize;
+
+		rx_logic_errors += adapter->q_rx_logic_errors[i];
+	}
+
+	/* update hfi errors */
+	netdev->stats.rx_fifo_errors = h_stats.rx_fifo_errors;
+	netdev->stats.tx_fifo_errors = h_stats.tx_fifo_errors;
+	netdev->stats.rx_missed_errors = h_stats.rx_missed_errors;
+	netdev->stats.tx_carrier_errors = h_stats.tx_carrier_errors;
+	adapter->err_cntrs.rx_bad_veswid = h_stats.rx_bad_veswid;
+
+	/* update tx counters */
+	netdev->stats.tx_packets = adapter->sum_cntrs.tx_packets;
+	netdev->stats.tx_bytes = adapter->sum_cntrs.tx_bytes;
+
+	adapter->err_cntrs.tx_logic = netdev->stats.tx_carrier_errors +
+				      netdev->stats.tx_fifo_errors +
+				      h_stats.tx_logic_errors +
+				      tx_logic_errors;
+
+	netdev->stats.tx_errors = adapter->err_cntrs.tx_smac_filt +
+				  adapter->err_cntrs.tx_dlid_zero +
+				  adapter->err_cntrs.tx_drop_state +
+				  adapter->err_cntrs.tx_logic;
+
+	netdev->stats.tx_dropped = netdev->stats.tx_errors;
+	adapter->sum_cntrs.tx_errors = netdev->stats.tx_errors;
+	adapter->err_cntrs.tx_errors = netdev->stats.tx_errors;
+
+	/* update rx counters */
+	netdev->stats.rx_packets = adapter->sum_cntrs.rx_packets;
+	netdev->stats.rx_bytes = adapter->sum_cntrs.rx_bytes;
+	netdev->stats.multicast = adapter->sum_cntrs.rx_grp.mcastbcast;
+	netdev->stats.rx_over_errors = adapter->err_cntrs.rx_oversize;
+	netdev->stats.rx_length_errors = adapter->err_cntrs.rx_oversize +
+					 adapter->err_cntrs.rx_runt;
+
+	adapter->err_cntrs.rx_logic = netdev->stats.rx_missed_errors +
+				      netdev->stats.rx_fifo_errors +
+				      h_stats.rx_logic_errors +
+				      rx_logic_errors;
+
+	netdev->stats.rx_errors = adapter->err_cntrs.rx_bad_veswid +
+				  adapter->err_cntrs.rx_runt +
+				  adapter->err_cntrs.rx_oversize +
+				  adapter->err_cntrs.rx_eth_down +
+				  adapter->err_cntrs.rx_drop_state +
+				  adapter->err_cntrs.rx_logic;
+
+	netdev->stats.rx_dropped = netdev->stats.rx_errors;
+	adapter->sum_cntrs.rx_errors = netdev->stats.rx_errors;
+	adapter->err_cntrs.rx_errors = netdev->stats.rx_errors;
+}
+
+/* update_len_counters - update pkt's len histogram counters */
+static inline void update_len_counters(struct __hfi_vnic_group_scs *grp,
+				       int len)
+{
+	/* account for 4 byte FCS */
+	if (len >= 1515)
+		grp->xx_1519_max++;
+	else if (len >= 1020)
+		grp->xx_1024_1518++;
+	else if (len >= 508)
+		grp->xx_512_1023++;
+	else if (len >= 252)
+		grp->xx_256_511++;
+	else if (len >= 124)
+		grp->xx_128_255++;
+	else if (len >= 61)
+		grp->xx_65_127++;
+	else
+		grp->xx_64_size++;
+}
+
+/* hfi_vnic_update_tx_counters - update transmit counters */
+static void hfi_vnic_update_tx_counters(struct net_device *netdev, u8 q_idx,
+					struct sk_buff *skb, int err)
+{
+	struct ethhdr *mac_hdr = (struct ethhdr *)skb_mac_header(skb);
+	struct hfi_vnic_adapter *adapter = netdev_priv(netdev);
+	struct __hfi_vnic_group_scs *grp_cntrs =
+			&adapter->q_sum_cntrs[q_idx].tx_grp;
+	u16 vlan_tci;
+
+	adapter->q_sum_cntrs[q_idx].tx_packets++;
+	adapter->q_sum_cntrs[q_idx].tx_bytes += skb->len + ETH_FCS_LEN;
+
+	update_len_counters(grp_cntrs, skb->len);
+
+	/* rest of the counts are for good packets only */
+	if (err)
+		return;
+
+	if (is_multicast_ether_addr(mac_hdr->h_dest))
+		grp_cntrs->mcastbcast++;
+	else
+		grp_cntrs->unicast++;
+
+	if (!__vlan_get_tag(skb, &vlan_tci))
+		grp_cntrs->vlan++;
+	else
+		grp_cntrs->untagged++;
+}
+
+/* hfi_vnic_update_rx_counters - update receive counters */
+static void hfi_vnic_update_rx_counters(struct net_device *netdev, u8 q_idx,
+					struct sk_buff *skb, int err)
+{
+	struct ethhdr *mac_hdr = (struct ethhdr *)skb->data;
+	struct hfi_vnic_adapter *adapter = netdev_priv(netdev);
+	struct __hfi_vnic_group_scs *grp_cntrs =
+			&adapter->q_sum_cntrs[q_idx].rx_grp;
+	u16 vlan_tci;
+
+	adapter->q_sum_cntrs[q_idx].rx_packets++;
+	adapter->q_sum_cntrs[q_idx].rx_bytes += skb->len + ETH_FCS_LEN;
+
+	update_len_counters(grp_cntrs, skb->len);
+
+	/* rest of the counts are for good packets only */
+	if (err)
+		return;
+
+	if (is_multicast_ether_addr(mac_hdr->h_dest))
+		grp_cntrs->mcastbcast++;
+	else
+		grp_cntrs->unicast++;
+
+	if (!__vlan_get_tag(skb, &vlan_tci))
+		grp_cntrs->vlan++;
+	else
+		grp_cntrs->untagged++;
+}
+
+static struct rtnl_link_stats64 *
+hfi_vnic_get_stats64(struct net_device *netdev,
+		     struct rtnl_link_stats64 *stats)
+{
+	struct hfi_vnic_adapter *adapter = netdev_priv(netdev);
+
+	mutex_lock(&adapter->stats_lock);
+	hfi_vnic_update_stats(netdev);
+
+	stats->rx_packets = netdev->stats.rx_packets;
+	stats->tx_packets = netdev->stats.tx_packets;
+	stats->rx_bytes = netdev->stats.rx_bytes;
+	stats->tx_bytes = netdev->stats.tx_bytes;
+	stats->rx_errors = netdev->stats.rx_errors;
+	stats->tx_errors = netdev->stats.tx_errors;
+	stats->rx_dropped = netdev->stats.rx_dropped;
+	stats->tx_dropped = netdev->stats.tx_dropped;
+	stats->multicast = netdev->stats.multicast;
+	stats->rx_length_errors = netdev->stats.rx_length_errors;
+	stats->rx_over_errors = netdev->stats.rx_over_errors;
+	stats->rx_fifo_errors = netdev->stats.rx_fifo_errors;
+	stats->rx_missed_errors = netdev->stats.rx_missed_errors;
+	stats->tx_carrier_errors = netdev->stats.tx_carrier_errors;
+	stats->tx_fifo_errors = netdev->stats.tx_fifo_errors;
+	mutex_unlock(&adapter->stats_lock);
+	return stats;
+}
+
 /* hfi_vnic_maybe_stop_tx - stop tx queue if required */
 static void hfi_vnic_maybe_stop_tx(struct hfi_vnic_adapter *adapter, u8 q_idx)
 {
@@ -72,6 +301,7 @@ static void hfi_vnic_maybe_stop_tx(struct hfi_vnic_adapter *adapter, u8 q_idx)
 	if (!vdev->bus_ops->get_write_avail(vdev, q_idx))
 		return;
 
+	adapter->q_tx_restart[q_idx]++;
 	netif_start_subqueue(vdev->netdev, q_idx);
 }
 
@@ -87,12 +317,15 @@ static netdev_tx_t hfi_netdev_start_xmit(struct sk_buff *skb,
 
 	v_dbg("xmit: queue %d skb len %d\n", q_idx, skb->len);
 	if (unlikely(adapter->info.vport.oper_state !=
-		     HFI_VNIC_STATE_FORWARDING))
+		     HFI_VNIC_STATE_FORWARDING)) {
+		adapter->q_err_cntrs[q_idx].tx_drop_state++;
 		goto tx_finish;
+	}
 
 	/* pad to ensure mininum ethernet packet length */
 	if (unlikely(skb->len < ETH_ZLEN)) {
 		if (skb_padto(skb, ETH_ZLEN)) {
+			adapter->q_tx_logic_errors[q_idx]++;
 			skip_skb_free = true;
 			goto tx_finish;
 		}
@@ -106,16 +339,19 @@ static netdev_tx_t hfi_netdev_start_xmit(struct sk_buff *skb,
 	/* Get reference to skb as hfi driver might release it */
 	skb_get(skb);
 	rc = vdev->bus_ops->put_skb(vdev, q_idx, skb);
-	/* remove the header */
+	/* remove the header before updating tx counters */
 	skb_pull(skb, HFI_VNIC_HDR_LEN);
 
 tx_finish:
 	if (unlikely(rc == -EBUSY)) {
 		hfi_vnic_maybe_stop_tx(adapter, q_idx);
+		adapter->q_tx_halt[q_idx]++;
 		dev_kfree_skb_any(skb);
 		return NETDEV_TX_BUSY;
 	}
 
+	/* update tx counters */
+	hfi_vnic_update_tx_counters(netdev, q_idx, skb, rc);
 	if (!skip_skb_free)
 		dev_kfree_skb_any(skb);
 	return NETDEV_TX_OK;
@@ -128,6 +364,7 @@ static void vnic_handle_rx(struct hfi_vnic_rx_queue *rxq,
 	struct hfi_vnic_adapter *adapter = rxq->adapter;
 	struct hfi_vnic_device *vdev = adapter->vdev;
 	struct sk_buff *skb;
+	int rc;
 
 	while (1) {
 		if (*work_done >= work_to_do)
@@ -137,7 +374,11 @@ static void vnic_handle_rx(struct hfi_vnic_rx_queue *rxq,
 		if (!skb)
 			break;
 
-		if (hfi_vnic_decap_skb(rxq, skb)) {
+		rc = hfi_vnic_decap_skb(rxq, skb);
+
+		/* update rx counters */
+		hfi_vnic_update_rx_counters(adapter->netdev, rxq->idx, skb, rc);
+		if (rc) {
 			dev_kfree_skb_any(skb);
 			continue;
 		}
@@ -183,8 +424,10 @@ static void vnic_event_cb(struct hfi_vnic_device *vdev, u8 evt)
 	if (evt < vdev->hfi_info.num_rx_q) {
 		q_idx = evt;
 		if (unlikely(adapter->info.vport.oper_state !=
-			     HFI_VNIC_STATE_FORWARDING))
+			     HFI_VNIC_STATE_FORWARDING)) {
+			adapter->q_err_cntrs[q_idx].rx_drop_state++;
 			return;
+		}
 
 		rxq = &adapter->rxq[q_idx];
 		if (napi_schedule_prep(&rxq->napi)) {
@@ -198,9 +441,10 @@ static void vnic_event_cb(struct hfi_vnic_device *vdev, u8 evt)
 	    (evt < (HFI_VNIC_EVT_TX0 + vdev->hfi_info.num_tx_q))) {
 		q_idx = evt - HFI_VNIC_EVT_TX0;
 
-		if (__netif_subqueue_stopped(vdev->netdev, q_idx))
+		if (__netif_subqueue_stopped(vdev->netdev, q_idx)) {
 			netif_wake_subqueue(vdev->netdev, q_idx);
-
+			adapter->q_tx_wakeup[q_idx]++;
+		}
 		return;
 	}
 	v_err("Invalid event\n");
@@ -341,6 +585,7 @@ static int hfi_netdev_close(struct net_device *netdev)
 	.ndo_stop = hfi_netdev_close,
 	.ndo_start_xmit = hfi_netdev_start_xmit,
 	.ndo_change_mtu = hfi_netdev_change_mtu,
+	.ndo_get_stats64 = hfi_vnic_get_stats64,
 	.ndo_select_queue = hfi_vnic_select_queue,
 	.ndo_set_mac_address = hfi_vnic_set_mac_addr,
 };
@@ -380,6 +625,7 @@ static int hfi_vnic_drv_probe(struct device *dev)
 	netdev->netdev_ops = &hfi_netdev_ops;
 	netdev->hard_header_len += HFI_VNIC_SKB_HEADROOM;
 	mutex_init(&adapter->lock);
+	mutex_init(&adapter->stats_lock);
 	strcpy(netdev->name, "veth%d");
 
 	hfi_vnic_set_ethtool_ops(netdev);
@@ -406,6 +652,7 @@ static int hfi_vnic_drv_probe(struct device *dev)
 	vdev->bus_ops->deinit(vdev);
 hw_err:
 	mutex_destroy(&adapter->lock);
+	mutex_destroy(&adapter->stats_lock);
 	free_netdev(netdev);
 	dev_err(dev, "initialization failed %d\n", rc);
 
@@ -422,6 +669,7 @@ static int hfi_vnic_drv_remove(struct device *dev)
 	unregister_netdev(vdev->netdev);
 	vdev->bus_ops->deinit(vdev);
 	mutex_destroy(&adapter->lock);
+	mutex_destroy(&adapter->stats_lock);
 	free_netdev(vdev->netdev);
 
 	dev_info(dev, "removed\n");
-- 
1.8.3.1

^ permalink raw reply related

* [RFC 10/10] IB/hfi1: VNIC SDMA support
From: Vishwanathapura, Niranjana @ 2016-11-18 22:42 UTC (permalink / raw)
  To: Doug Ledford
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, netdev-u79uwXL29TY76Z2rM5mHXA,
	Dennis Dalessandro, Niranjana Vishwanathapura
In-Reply-To: <1479508938-63799-1-git-send-email-niranjana.vishwanathapura-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>

HFI1 VNIC SDMA support enables transmission of VNIC packets over SDMA.
Map VNIC queues to SDMA engines and support halting and wakeup of the
VNIC queues.

Change-Id: I2d2d23bda9fb8a7194d9722e23bc69b110cdcf86
Reviewed-by: Dennis Dalessandro <dennis.dalessandro-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
Signed-off-by: Niranjana Vishwanathapura <niranjana.vishwanathapura-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
---
 drivers/infiniband/hw/hfi1/hfi.h         |   1 +
 drivers/infiniband/hw/hfi1/vnic.h        |  30 +++-
 drivers/infiniband/hw/hfi1/vnic_device.c |   2 +-
 drivers/infiniband/hw/hfi1/vnic_main.c   |  22 ++-
 drivers/infiniband/hw/hfi1/vnic_sdma.c   | 260 +++++++++++++++++++++++++++++++
 5 files changed, 311 insertions(+), 4 deletions(-)

diff --git a/drivers/infiniband/hw/hfi1/hfi.h b/drivers/infiniband/hw/hfi1/hfi.h
index 2ff3453..f476188 100644
--- a/drivers/infiniband/hw/hfi1/hfi.h
+++ b/drivers/infiniband/hw/hfi1/hfi.h
@@ -855,6 +855,7 @@ struct hfi1_asic_data {
 /* Virtual NIC information */
 struct hfi1_vnic_data {
 	struct hfi1_ctxtdata *ctxt[HFI1_NUM_VNIC_CTXT];
+	struct kmem_cache *txreq_cache;
 	u8 num_vports;
 	struct hfi_vnic_ctrl_device *ctrl_dev;
 	struct idr vesw_idr;
diff --git a/drivers/infiniband/hw/hfi1/vnic.h b/drivers/infiniband/hw/hfi1/vnic.h
index d91c35b..4bdfe2b 100644
--- a/drivers/infiniband/hw/hfi1/vnic.h
+++ b/drivers/infiniband/hw/hfi1/vnic.h
@@ -49,6 +49,7 @@
 
 #include "hfi_vnic.h"
 #include "hfi.h"
+#include "sdma.h"
 
 #define HFI1_VNIC_ICRC_LEN   4
 #define HFI1_VNIC_TAIL_LEN   1
@@ -90,6 +91,26 @@
 #define HFI1_VNIC_SC_SHIFT      4
 
 /**
+ * struct hfi1_vnic_sdma - VNIC per Tx ring SDMA information
+ * @dd - device data pointer
+ * @sde - sdma engine
+ * @vinfo - vnic info pointer
+ * @wait - iowait structure
+ * @stx - sdma tx request
+ * @state - vnic Tx ring SDMA state
+ * @q_idx - vnic Tx queue index
+ */
+struct hfi1_vnic_sdma {
+	struct hfi1_devdata *dd;
+	struct sdma_engine  *sde;
+	struct hfi1_vnic_vport_info *vinfo;
+	struct iowait wait;
+	struct sdma_txreq stx;
+	unsigned int state;
+	u8 q_idx;
+};
+
+/**
  * struct hfi1_vnic_notifier - VNIC notifer structure
  * @cb - vnic callback function
  */
@@ -104,6 +125,7 @@ struct hfi1_vnic_notifier {
  * @event_flags: event notification flags
  * @notifier: vnic notifier
  * @skbq: Array of queues for received socket buffers
+ * @sdma: VNIC SDMA structure per TXQ
  */
 struct hfi1_vnic_vport_info {
 	struct hfi1_devdata *dd;
@@ -112,7 +134,8 @@ struct hfi1_vnic_vport_info {
 	DECLARE_BITMAP(event_flags, HFI_VNIC_NUM_EVTS);
 	struct hfi_vnic_device *vdev;
 
-	struct sk_buff_head skbq[HFI1_NUM_VNIC_CTXT];
+	struct sk_buff_head    skbq[HFI1_NUM_VNIC_CTXT];
+	struct hfi1_vnic_sdma  sdma[HFI1_VNIC_MAX_TXQ];
 };
 
 static inline struct hfi1_devdata *vnic_dev2dd(struct hfi_vnic_device *vdev)
@@ -131,10 +154,15 @@ static inline void hfi1_vnic_update_pad(unsigned char *pad, u8 plen)
 /* vnic hfi1 internal functions */
 int hfi1_vnic_setup(struct hfi1_devdata *dd);
 void hfi1_vnic_cleanup(struct hfi1_devdata *dd);
+int hfi1_vnic_txreq_init(struct hfi1_devdata *dd);
+void hfi1_vnic_txreq_deinit(struct hfi1_devdata *dd);
 int hfi1_vnic_add_ctrl_port(struct hfi1_devdata *dd, struct device *parent);
 void hfi1_vnic_rem_ctrl_port(struct hfi1_devdata *dd);
 
 void hfi1_vnic_bypass_rcv(struct hfi1_packet *packet);
+void hfi1_vnic_sdma_init(struct hfi1_vnic_vport_info *vinfo);
+bool hfi1_vnic_sdma_write_avail(struct hfi1_vnic_vport_info *vinfo,
+				u8 q_idx);
 
 /* vnic device bus ops */
 int hfi1_vnic_init(struct hfi_vnic_device *vdev);
diff --git a/drivers/infiniband/hw/hfi1/vnic_device.c b/drivers/infiniband/hw/hfi1/vnic_device.c
index 468e197..5fb1a49 100644
--- a/drivers/infiniband/hw/hfi1/vnic_device.c
+++ b/drivers/infiniband/hw/hfi1/vnic_device.c
@@ -85,7 +85,7 @@ static int hfi1_vdev_create(struct hfi_vnic_ctrl_device *cdev,
 		return -ENOMEM;
 
 	vinfo->dd = dd;
-	hfi_info.num_tx_q = 1;
+	hfi_info.num_tx_q = dd->chip_sdma_engines;
 	hfi_info.num_rx_q = HFI1_NUM_VNIC_CTXT;
 	hfi_info.cap = HFI_VNIC_CAP_SG;
 	vdev = hfi_vnic_device_register(cdev, port_num, vport_num, vinfo,
diff --git a/drivers/infiniband/hw/hfi1/vnic_main.c b/drivers/infiniband/hw/hfi1/vnic_main.c
index 82e30bd..a21e4cd 100644
--- a/drivers/infiniband/hw/hfi1/vnic_main.c
+++ b/drivers/infiniband/hw/hfi1/vnic_main.c
@@ -294,15 +294,21 @@ int hfi1_vnic_put_skb(struct hfi_vnic_device *vdev,
 
 u8 hfi1_vnic_select_queue(struct hfi_vnic_device *vdev, u8 vl, u8 entropy)
 {
-	return 0;
+	struct hfi1_devdata *dd = (struct hfi1_devdata *)vdev->cdev->hfi_priv;
+	struct sdma_engine *sde;
+
+	sde = sdma_select_engine_vl(dd, entropy, vl);
+	return sde->this_idx;
 }
 
 bool hfi1_vnic_get_write_avail(struct hfi_vnic_device *vdev, u8 q_idx)
 {
+	struct hfi1_vnic_vport_info *vinfo = vdev->hfi_priv;
+
 	if (q_idx >= vdev->hfi_info.num_tx_q)
 		return false;
 
-	return true;
+	return hfi1_vnic_sdma_write_avail(vinfo, q_idx);
 }
 
 void hfi1_vnic_bypass_rcv(struct hfi1_packet *packet)
@@ -504,6 +510,13 @@ int hfi1_vnic_init(struct hfi_vnic_device *vdev)
 	int i, rc = 0;
 
 	mutex_lock(&hfi1_mutex);
+
+	if (!dd->vnic.num_vports) {
+		rc = hfi1_vnic_txreq_init(dd);
+		if (rc)
+			goto txreq_fail;
+	}
+
 	for (i = dd->vnic.num_ctxt; i < vdev->hfi_info.num_rx_q; i++) {
 		rc = hfi1_vnic_allot_ctxt(dd, &dd->vnic.ctxt[i]);
 		if (rc)
@@ -531,7 +544,11 @@ int hfi1_vnic_init(struct hfi_vnic_device *vdev)
 
 	dd->vnic.num_vports++;
 	vinfo->vdev = vdev;
+	hfi1_vnic_sdma_init(vinfo);
 alloc_fail:
+	if (!dd->vnic.num_vports)
+		hfi1_vnic_txreq_deinit(dd);
+txreq_fail:
 	mutex_unlock(&hfi1_mutex);
 	return rc;
 }
@@ -549,6 +566,7 @@ void hfi1_vnic_deinit(struct hfi_vnic_device *vdev)
 		}
 		hfi1_deinit_vnic_rsm(dd);
 		dd->vnic.num_ctxt = 0;
+		hfi1_vnic_txreq_deinit(dd);
 	}
 
 	mutex_unlock(&hfi1_mutex);
diff --git a/drivers/infiniband/hw/hfi1/vnic_sdma.c b/drivers/infiniband/hw/hfi1/vnic_sdma.c
index 66abad0..e9754dd 100644
--- a/drivers/infiniband/hw/hfi1/vnic_sdma.c
+++ b/drivers/infiniband/hw/hfi1/vnic_sdma.c
@@ -52,9 +52,269 @@
 #include "sdma.h"
 #include "vnic.h"
 
+#define HFI1_VNIC_SDMA_Q_ACTIVE   BIT(0)
+#define HFI1_VNIC_SDMA_Q_DEFERRED BIT(1)
+
+#define HFI1_VNIC_TXREQ_NAME_LEN   32
+#define HFI1_VNIC_SDMA_DESC_WTRMRK 64
+#define HFI1_VNIC_SDMA_RETRY_COUNT 1
+
+/*
+ * struct vnic_txreq - VNIC transmit descriptor
+ * @txreq: sdma transmit request
+ * @sdma: vnic sdma pointer
+ * @skb: skb to send
+ * @pad: pad buffer
+ * @plen: pad length
+ * @pbc_val: pbc value
+ * @retry_count: tx retry count
+ */
+struct vnic_txreq {
+	struct sdma_txreq       txreq;
+	struct hfi1_vnic_sdma   *sdma;
+
+	struct sk_buff         *skb;
+	unsigned char           pad[HFI1_VNIC_MAX_PAD];
+	u16                     plen;
+	__le64                  pbc_val;
+
+	u32                     retry_count;
+};
+
+static void vnic_sdma_complete(struct sdma_txreq *txreq,
+			       int status)
+{
+	struct vnic_txreq *tx = container_of(txreq, struct vnic_txreq, txreq);
+	struct hfi1_vnic_sdma *vnic_sdma = tx->sdma;
+
+	sdma_txclean(vnic_sdma->dd, txreq);
+	dev_kfree_skb_any(tx->skb);
+	kmem_cache_free(vnic_sdma->dd->vnic.txreq_cache, tx);
+}
+
+static noinline int build_vnic_ulp_payload(struct sdma_engine *sde,
+					   struct vnic_txreq *tx)
+{
+	int i, ret = 0;
+
+	ret = sdma_txadd_kvaddr(
+		sde->dd,
+		&tx->txreq,
+		tx->skb->data,
+		skb_headlen(tx->skb));
+	if (ret)
+		goto bail_txadd;
+
+	for (i = 0; i < skb_shinfo(tx->skb)->nr_frags; i++) {
+		struct skb_frag_struct *frag = &skb_shinfo(tx->skb)->frags[i];
+
+		/* combine physically continuous fragments later? */
+		ret = sdma_txadd_page(sde->dd,
+				      &tx->txreq,
+				      skb_frag_page(frag),
+				      frag->page_offset,
+				      skb_frag_size(frag));
+		if (ret)
+			goto bail_txadd;
+	}
+
+	if (tx->plen)
+		ret = sdma_txadd_kvaddr(sde->dd, &tx->txreq,
+					tx->pad + HFI1_VNIC_MAX_PAD - tx->plen,
+					tx->plen);
+
+bail_txadd:
+	return ret;
+}
+
+static int build_vnic_tx_desc(struct sdma_engine *sde,
+			      struct vnic_txreq *tx,
+			      u64 pbc)
+{
+	int ret = 0;
+	u16 hdrbytes = 2 << 2;  /* PBC */
+
+	ret = sdma_txinit_ahg(
+		&tx->txreq,
+		0,
+		hdrbytes + tx->skb->len + tx->plen,
+		0,
+		0,
+		NULL,
+		0,
+		vnic_sdma_complete);
+	if (ret)
+		goto bail_txadd;
+
+	/* add pbc */
+	tx->pbc_val = cpu_to_le64(pbc);
+	ret = sdma_txadd_kvaddr(
+		sde->dd,
+		&tx->txreq,
+		&tx->pbc_val,
+		hdrbytes);
+	if (ret)
+		goto bail_txadd;
+
+	/* add the ulp payload */
+	ret = build_vnic_ulp_payload(sde, tx);
+bail_txadd:
+	return ret;
+}
+
 int hfi1_vnic_send_dma(struct hfi1_devdata *dd, u8 q_idx,
 		       struct hfi1_vnic_vport_info *vinfo,
 		       struct sk_buff *skb, u64 pbc, u8 plen)
 {
+	struct hfi1_vnic_sdma *vnic_sdma = &vinfo->sdma[q_idx];
+	struct sdma_engine *sde = vnic_sdma->sde;
+	struct vnic_txreq *tx;
+	int ret = -ECOMM;
+
+	if (READ_ONCE(vnic_sdma->state) != HFI1_VNIC_SDMA_Q_ACTIVE)
+		goto tx_err;
+
+	if (!sde || !sdma_running(sde))
+		goto tx_err;
+
+	tx = kmem_cache_alloc(dd->vnic.txreq_cache, GFP_ATOMIC);
+	if (!tx) {
+		ret = -ENOMEM;
+		goto tx_err;
+	}
+
+	tx->sdma = vnic_sdma;
+	tx->skb = skb;
+	hfi1_vnic_update_pad(tx->pad, plen);
+	tx->plen = plen;
+	ret = build_vnic_tx_desc(sde, tx, pbc);
+	if (unlikely(ret))
+		goto free_desc;
+	tx->retry_count = 0;
+
+	ret = sdma_send_txreq(sde, &vnic_sdma->wait, &tx->txreq);
+	/* When -ECOMM, sdma callback will be called with ABORT status */
+	if (ret && unlikely(ret != -ECOMM))
+		goto free_desc;
+
+	return ret;
+
+free_desc:
+	sdma_txclean(dd, &tx->txreq);
+	kmem_cache_free(dd->vnic.txreq_cache, tx);
+tx_err:
+	if (ret != -EBUSY)
+		dev_kfree_skb_any(skb);
+	return ret;
+}
+
+/*
+ * hfi1_vnic_sdma_sleep - vnic sdma sleep function
+ *
+ * This function gets called from sdma_send_txreq() when there are not enough
+ * sdma descriptors available to send the packet. It adds Tx queue's wait
+ * structure to sdma engine's dmawait list to be woken up when descriptors
+ * become available.
+ */
+static int hfi1_vnic_sdma_sleep(struct sdma_engine *sde,
+				struct iowait *wait,
+				struct sdma_txreq *txreq,
+				unsigned int seq)
+{
+	struct hfi1_vnic_sdma *vnic_sdma =
+		container_of(wait, struct hfi1_vnic_sdma, wait);
+	struct hfi1_ibdev *dev = &vnic_sdma->dd->verbs_dev;
+	struct vnic_txreq *tx = container_of(txreq, struct vnic_txreq, txreq);
+
+	if (sdma_progress(sde, seq, txreq))
+		if (tx->retry_count++ < HFI1_VNIC_SDMA_RETRY_COUNT)
+			return -EAGAIN;
+
+	vnic_sdma->state = HFI1_VNIC_SDMA_Q_DEFERRED;
+	write_seqlock(&dev->iowait_lock);
+	if (list_empty(&vnic_sdma->wait.list))
+		list_add_tail(&vnic_sdma->wait.list, &sde->dmawait);
+	write_sequnlock(&dev->iowait_lock);
+	return -EBUSY;
+}
+
+/*
+ * hfi1_vnic_sdma_wakeup - vnic sdma wakeup function
+ *
+ * This function gets called when SDMA descriptors becomes available and Tx
+ * queue's wait structure was previously added to sdma engine's dmawait list.
+ * It notifies the upper driver about Tx queue wakeup.
+ */
+static void hfi1_vnic_sdma_wakeup(struct iowait *wait, int reason)
+{
+	struct hfi1_vnic_sdma *vnic_sdma =
+		container_of(wait, struct hfi1_vnic_sdma, wait);
+	struct hfi1_vnic_vport_info *vinfo = vnic_sdma->vinfo;
+	u8 evt = HFI_VNIC_EVT_TX0 + vnic_sdma->q_idx;
+	struct hfi1_vnic_notifier *notifier;
+
+	vnic_sdma->state = HFI1_VNIC_SDMA_Q_ACTIVE;
+	notifier = rcu_dereference(vinfo->notifier);
+	if (notifier && notifier->cb && test_bit(evt, vinfo->event_flags))
+		notifier->cb(vinfo->vdev, evt);
+};
+
+inline bool hfi1_vnic_sdma_write_avail(struct hfi1_vnic_vport_info *vinfo,
+				       u8 q_idx)
+{
+	struct hfi1_vnic_sdma *vnic_sdma = &vinfo->sdma[q_idx];
+
+	return (READ_ONCE(vnic_sdma->state) == HFI1_VNIC_SDMA_Q_ACTIVE);
+}
+
+void hfi1_vnic_sdma_init(struct hfi1_vnic_vport_info *vinfo)
+{
+	int i;
+
+	for (i = 0; i < vinfo->vdev->hfi_info.num_tx_q; i++) {
+		struct hfi1_vnic_sdma *vnic_sdma = &vinfo->sdma[i];
+
+		iowait_init(&vnic_sdma->wait, 0, NULL, hfi1_vnic_sdma_sleep,
+			    hfi1_vnic_sdma_wakeup, NULL);
+		vnic_sdma->sde = &vinfo->dd->per_sdma[i];
+		vnic_sdma->dd = vinfo->dd;
+		vnic_sdma->vinfo = vinfo;
+		vnic_sdma->q_idx = i;
+		vnic_sdma->state = HFI1_VNIC_SDMA_Q_ACTIVE;
+
+		/* Add a free descriptor watermark for wakeups */
+		if (vnic_sdma->sde->descq_cnt >= HFI1_VNIC_SDMA_DESC_WTRMRK) {
+			INIT_LIST_HEAD(&vnic_sdma->stx.list);
+			vnic_sdma->stx.num_desc = HFI1_VNIC_SDMA_DESC_WTRMRK;
+			list_add_tail(&vnic_sdma->stx.list,
+				      &vnic_sdma->wait.tx_head);
+		}
+	}
+}
+
+static void hfi1_vnic_txreq_kmem_cache_ctor(void *obj)
+{
+	struct vnic_txreq *tx = (struct vnic_txreq *)obj;
+
+	memset(tx, 0, sizeof(*tx));
+}
+
+int hfi1_vnic_txreq_init(struct hfi1_devdata *dd)
+{
+	char buf[HFI1_VNIC_TXREQ_NAME_LEN];
+
+	snprintf(buf, sizeof(buf), "hfi1_%u_vnic_txreq_cache", dd->unit);
+	dd->vnic.txreq_cache = kmem_cache_create(buf,
+					  sizeof(struct vnic_txreq),
+					  0, SLAB_HWCACHE_ALIGN,
+					  hfi1_vnic_txreq_kmem_cache_ctor);
+	if (!dd->vnic.txreq_cache)
+		return -ENOMEM;
 	return 0;
 }
+
+void hfi1_vnic_txreq_deinit(struct hfi1_devdata *dd)
+{
+	kmem_cache_destroy(dd->vnic.txreq_cache);
+	dd->vnic.txreq_cache = NULL;
+}
-- 
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related

* [RFC 08/10] IB/hfi-vnic: VNIC Ethernet Management Agent (VEMA) driver
From: Vishwanathapura, Niranjana @ 2016-11-18 22:42 UTC (permalink / raw)
  To: Doug Ledford
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, netdev-u79uwXL29TY76Z2rM5mHXA,
	Dennis Dalessandro, Sadanand Warrier, Niranjana Vishwanathapura,
	Tanya K Jajodia, Sudeep Dutt
In-Reply-To: <1479508938-63799-1-git-send-email-niranjana.vishwanathapura-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>

HFI VEMA driver interfaces with the Infiniband MAD stack to exchange the
management information packets with the Ethernet Manager (EM).
It interfaces with the HFI VNIC netdev driver to SET/GET the management
information. The information exchanged with the EM includes class port
details, encapsulation configuration, various counters, unicast and
multicast MAC list and the MAC table. It also supports sending traps
to the EM.

Change-Id: I7439f96858c9019455da1e924a0201eb27177b85
Reviewed-by: Dennis Dalessandro <dennis.dalessandro-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
Signed-off-by: Sadanand Warrier <sadanand.warrier-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
Signed-off-by: Niranjana Vishwanathapura <niranjana.vishwanathapura-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
Signed-off-by: Tanya K Jajodia <tanya.k.jajodia-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
Signed-off-by: Sudeep Dutt <sudeep.dutt-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
---
 drivers/infiniband/sw/intel/vnic/hfi_vnic/Makefile |    2 +-
 .../sw/intel/vnic/hfi_vnic/hfi_vnic_internal.h     |    9 +
 .../sw/intel/vnic/hfi_vnic/hfi_vnic_netdev.c       |    9 +-
 .../sw/intel/vnic/hfi_vnic/hfi_vnic_vema.c         | 1024 ++++++++++++++++++++
 .../sw/intel/vnic/hfi_vnic/hfi_vnic_vema_iface.c   |    2 +-
 5 files changed, 1043 insertions(+), 3 deletions(-)
 create mode 100644 drivers/infiniband/sw/intel/vnic/hfi_vnic/hfi_vnic_vema.c

diff --git a/drivers/infiniband/sw/intel/vnic/hfi_vnic/Makefile b/drivers/infiniband/sw/intel/vnic/hfi_vnic/Makefile
index 375cd09..e05b72b 100644
--- a/drivers/infiniband/sw/intel/vnic/hfi_vnic/Makefile
+++ b/drivers/infiniband/sw/intel/vnic/hfi_vnic/Makefile
@@ -5,4 +5,4 @@ ccflags-y += -I$(src)/../include
 obj-$(CONFIG_HFI_VNIC) += hfi_vnic.o
 
 hfi_vnic-y := hfi_vnic_netdev.o hfi_vnic_encap.o hfi_vnic_ethtool.o \
-              hfi_vnic_vema_iface.o
+              hfi_vnic_vema.o hfi_vnic_vema_iface.o
diff --git a/drivers/infiniband/sw/intel/vnic/hfi_vnic/hfi_vnic_internal.h b/drivers/infiniband/sw/intel/vnic/hfi_vnic/hfi_vnic_internal.h
index 8ebed89..fbebf68 100644
--- a/drivers/infiniband/sw/intel/vnic/hfi_vnic/hfi_vnic_internal.h
+++ b/drivers/infiniband/sw/intel/vnic/hfi_vnic/hfi_vnic_internal.h
@@ -268,6 +268,8 @@ struct hfi_vnic_rx_queue {
  * @mactbl_lock: mac table lock
  * @stats_lock: statistics lock
  * @flow_tbl: flow to default port redirection table
+ * @trap_timeout: trap timeout
+ * @trap_count: no. of traps allowed within timeout period
  * @q_sum_cntrs: per queue EM summary counters
  * @q_err_cntrs: per queue EM error counters
  * @q_rx_logic_errors: per queue rx logic (default) errors
@@ -301,6 +303,8 @@ struct hfi_vnic_adapter {
 	struct mutex stats_lock;
 
 	u8 flow_tbl[HFI_VNIC_FLOW_TBL_SIZE];
+	unsigned long trap_timeout;
+	u8            trap_count;
 
 	struct __hfi_vnic_summary_counters  q_sum_cntrs[HFI_VNIC_MAX_QUEUE];
 	struct __hfi_vnic_error_counters    q_err_cntrs[HFI_VNIC_MAX_QUEUE];
@@ -410,4 +414,9 @@ void hfi_vnic_set_per_veswport_info(struct hfi_vnic_adapter *adapter,
 void hfi_vnic_vema_report_event(struct hfi_vnic_adapter *adapter, u8 event);
 void hfi_vnic_set_ethtool_ops(struct net_device *ndev);
 
+int hfi_vnic_vema_init(void);
+void hfi_vnic_vema_deinit(void);
+void hfi_vnic_vema_send_trap(struct hfi_vnic_adapter *adapter,
+			     struct __hfi_veswport_trap *data, u32 lid);
+
 #endif /* _HFI_VNIC_INTERNAL_H */
diff --git a/drivers/infiniband/sw/intel/vnic/hfi_vnic/hfi_vnic_netdev.c b/drivers/infiniband/sw/intel/vnic/hfi_vnic/hfi_vnic_netdev.c
index 75a3fd2..4ee5bb6 100644
--- a/drivers/infiniband/sw/intel/vnic/hfi_vnic/hfi_vnic_netdev.c
+++ b/drivers/infiniband/sw/intel/vnic/hfi_vnic/hfi_vnic_netdev.c
@@ -855,9 +855,15 @@ static int __init hfi_vnic_init_module(void)
 	pr_info("HFI Virtual Network Driver - %s\n",
 		hfi_vnic_driver_version);
 
-	rc = hfi_vnic_driver_register(&hfi_vnic_drv);
+	rc = hfi_vnic_vema_init();
 	if (rc)
+		return rc;
+
+	rc = hfi_vnic_driver_register(&hfi_vnic_drv);
+	if (rc) {
 		pr_err("VNIC driver register failed %d\n", rc);
+		hfi_vnic_vema_deinit();
+	}
 
 	return rc;
 }
@@ -867,6 +873,7 @@ static int __init hfi_vnic_init_module(void)
 static void __exit hfi_vnic_exit_module(void)
 {
 	hfi_vnic_driver_unregister(&hfi_vnic_drv);
+	hfi_vnic_vema_deinit();
 }
 module_exit(hfi_vnic_exit_module);
 
diff --git a/drivers/infiniband/sw/intel/vnic/hfi_vnic/hfi_vnic_vema.c b/drivers/infiniband/sw/intel/vnic/hfi_vnic/hfi_vnic_vema.c
new file mode 100644
index 0000000..b947cdf
--- /dev/null
+++ b/drivers/infiniband/sw/intel/vnic/hfi_vnic/hfi_vnic_vema.c
@@ -0,0 +1,1024 @@
+/*
+ * Copyright(c) 2016 Intel Corporation.
+ *
+ * This file is provided under a dual BSD/GPLv2 license.  When using or
+ * redistributing this file, you may do so under either license.
+ *
+ * GPL LICENSE SUMMARY
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of version 2 of the GNU General Public License as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ *
+ * BSD LICENSE
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ *
+ *  - Redistributions of source code must retain the above copyright
+ *    notice, this list of conditions and the following disclaimer.
+ *  - Redistributions in binary form must reproduce the above copyright
+ *    notice, this list of conditions and the following disclaimer in
+ *    the documentation and/or other materials provided with the
+ *    distribution.
+ *  - Neither the name of Intel Corporation nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ * "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ * LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ * A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+ * OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ * SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+ * LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+ * DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+ * THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ *
+ */
+
+/*
+ * This file contains HFI Virtual Network Interface Controller (VNIC)
+ * Ethernet Management Agent (EMA) driver
+ */
+
+#include <rdma/ib_addr.h>
+#include <rdma/ib_smi.h>
+
+#include "hfi_vnic.h"
+#include "hfi_vnic_internal.h"
+
+/*
+ * The trap service level is kept in bits 3 to 7 in the trap_sl_rsvd
+ * field in the class port info MAD.
+ */
+#define GET_TRAP_SL_FROM_CLASS_PORT_INFO(x)  (((x) >> 3) & 0x1f)
+
+/* Cap trap bursts to a reasonable limit good for normal cases */
+#define HFI_VNIC_TRAP_BURST_LIMIT 4
+
+/*
+ * VNIC trap limit timeout.
+ * Inverse of cap2_mask response time out (1.0737 secs) = 0.9
+ * secs approx IB spec 13.4.6.2.1 PortInfoSubnetTimeout and
+ * 13.4.9 Traps.
+ */
+#define HFI_VNIC_TRAP_TIMEOUT  ((4096 * (1UL << 18)) / 1000)
+
+#define HFI_VNIC_UNSUP_ATTR  \
+		cpu_to_be16(IB_MGMT_MAD_STATUS_UNSUPPORTED_METHOD_ATTRIB)
+
+#define HFI_VNIC_INVAL_ATTR  \
+		cpu_to_be16(IB_MGMT_MAD_STATUS_INVALID_ATTRIB_VALUE)
+
+#define HFI_VNIC_CLASS_CAP_TRAP  cpu_to_be16(1 << 8)
+
+struct hfi_class_port_info {
+	u8 base_version;
+	u8 class_version;
+	__be16 cap_mask;
+	__be32 cap_mask2_resp_time;
+
+	u8 redirect_gid[16];
+	__be32 redirect_tc_fl;
+	__be32 redirect_lid;
+	__be32 redirect_sl_qp;
+	__be32 redirect_qkey;
+
+	u8 trap_gid[16];
+	__be32 trap_tc_fl;
+	__be32 trap_lid;
+	__be32 trap_hl_qp;
+	__be32 trap_qkey;
+
+	__be16 trap_pkey;
+	__be16 redirect_pkey;
+
+	u8 trap_sl_rsvd;
+	u8 reserved[3];
+} __packed;
+
+/**
+ * struct hfi_vnic_vema_port -- VNIC VEMA port details
+ * @cdev:     pointer to device
+ * @mad_agent: pointer to mad agent for port
+ * @class_port_info: Class port info information.
+ * @tid: Transaction id
+ * @port_num:  port number on HFI device
+ * @lock: adapter interface lock
+ * @vnic_mask: Bit mask for vnic presence
+ */
+struct hfi_vnic_vema_port {
+	struct hfi_vnic_ctrl_device    *cdev;
+	struct ib_mad_agent            *mad_agent;
+	struct hfi_class_port_info      class_port_info;
+	u64                             tid;
+	u8                              port_num;
+
+	/* Lock to query/update network adapter */
+	struct mutex                    lock;
+	DECLARE_BITMAP(vnic_mask, HFI_MAX_VPORTS_SUPPORTED);
+};
+
+static const char hfi_vnic_ctrl_driver_name[] = "hfi_vnic_ctrl";
+
+/**
+ * vema_get_vport_num -- Get the vnic from the mad
+ * @recvd_mad:  Received mad
+ *
+ * Return: returns value of the vnic port number
+ */
+static inline u8 vema_get_vport_num(struct hfi_vnic_vema_mad *recvd_mad)
+{
+	return be32_to_cpu(recvd_mad->mad_hdr.attr_mod) >> 16 & 0xff;
+}
+
+/**
+ * vema_mac_tbl_req_ok -- Check if mac request has correct values
+ * @mac_tbl: mac table
+ *
+ * This function checks for the validity of the offset and number of
+ * entries required.
+ *
+ * Return: true if offset and num_entries are valid
+ */
+static inline bool vema_mac_tbl_req_ok(struct hfi_veswport_mactable *mac_tbl)
+{
+	u16 offset, num_entries;
+	u16 req_entries = ((HFI_VNIC_EMA_DATA - sizeof(*mac_tbl)) /
+			   sizeof(mac_tbl->tbl_entries[0]));
+
+	offset = be16_to_cpu(mac_tbl->offset);
+	num_entries = be16_to_cpu(mac_tbl->num_entries);
+
+	return ((num_entries <= req_entries) &&
+		(offset + num_entries <= HFI_VNIC_MAC_TBL_MAX_ENTRIES));
+}
+
+/**
+ * vema_parms_from_recv_mad -- Get req params from recvd mad
+ * @recvd_mad: received mad
+ * @port: ptr to port struct on which MAD was recvd
+ * @adapter: ptr to ptr to adapter to be filled in
+ *
+ * Return: 0 if success, else non-zero.
+ */
+static int vema_parms_from_recv_mad(struct hfi_vnic_vema_mad *recvd_mad,
+				    struct hfi_vnic_vema_port *port,
+				    struct hfi_vnic_adapter **adapter)
+{
+	struct hfi_vnic_device *vdev;
+	u8 vport_num;
+
+	vport_num = vema_get_vport_num(recvd_mad);
+	vdev = hfi_vnic_get_dev(port->cdev, port->port_num, vport_num);
+	if (!vdev) {
+		dev_err(&port->cdev->dev,
+			"%s:vnic_num %d vdev access err\n", __func__,
+			vport_num);
+		return -EINVAL;
+	}
+	*adapter = netdev_priv(vdev->netdev);
+
+	return 0;
+}
+
+/*
+ * Return the power on default values in the port info structure
+ * in big endian format as required by MAD.
+ */
+static inline void vema_get_pod_values(struct hfi_veswport_info *port_info)
+{
+	memset(port_info, 0, sizeof(*port_info));
+	port_info->vport.max_mac_tbl_ent =
+		cpu_to_be16(HFI_VNIC_MAC_TBL_MAX_ENTRIES);
+	port_info->vport.max_smac_ent =
+		cpu_to_be16(HFI_VNIC_MAX_SMAC_LIMIT);
+	port_info->vport.oper_state = HFI_VNIC_STATE_DROP_ALL;
+	port_info->vport.config_state = HFI_VNIC_STATE_DROP_ALL;
+}
+
+/**
+ * vema_create_vnic -- Create a new vnic device
+ * @port: ptr to hfi_vnic_vema_port struct
+ * @vport_num: Vnic number (to be created)
+ * @adapter: Prt to ptr to adapter associated with vnic
+ *
+ * Create the new vnic device.
+ * Return a pointer to the adapter structure within the new vnic in
+ * the third variable.
+ */
+static int vema_create_vnic(struct hfi_vnic_vema_port *port, u8 vport_num,
+			    struct hfi_vnic_adapter **adapter)
+{
+	struct hfi_vnic_device *vdev;
+	int ret;
+
+	ret = port->cdev->ctrl_ops->add_vport(port->cdev, port->port_num,
+					      vport_num);
+	if (ret) {
+		dev_err(&port->cdev->dev,
+			"%s:vnic %d not created\n", __func__, vport_num);
+	} else {
+		vdev = hfi_vnic_get_dev(port->cdev, port->port_num, vport_num);
+		if (!vdev) {
+			dev_err(&port->cdev->dev,
+				"%s:vnic_num %d vdev access err\n", __func__,
+				vport_num);
+			ret = -ENODEV;
+		} else {
+			*adapter = netdev_priv(vdev->netdev);
+			set_bit(vport_num, port->vnic_mask);
+		}
+	}
+
+	return ret;
+}
+
+/**
+ * vema_get_class_port_info -- Get class info for port
+ * @port:  Port on whic MAD was received
+ * @recvd_mad: pointer to the received mad
+ * @rsp_mad:   pointer to respose mad
+ *
+ * This function copies the latest class port info value set for the
+ * port and stores it for generating traps
+ */
+static void vema_get_class_port_info(struct hfi_vnic_vema_port *port,
+				     struct hfi_vnic_vema_mad *recvd_mad,
+				     struct hfi_vnic_vema_mad *rsp_mad)
+{
+	struct hfi_class_port_info *port_info;
+
+	port_info = (struct hfi_class_port_info *)rsp_mad->data;
+	memcpy(port_info, &port->class_port_info, sizeof(*port_info));
+	port_info->base_version = OPA_MGMT_BASE_VERSION,
+	port_info->class_version = HFI_EMA_CLASS_VERSION;
+
+	/* Agent generates traps */
+	port_info->cap_mask = HFI_VNIC_CLASS_CAP_TRAP;
+
+	/*
+	 * Since a get routine is always sent by the EM first we
+	 * set the expected response time to
+	 * 4.096 usec * 2^18 == 1.0737 sec here.
+	 */
+	port_info->cap_mask2_resp_time = cpu_to_be32(18);
+}
+
+/**
+ * vema_set_class_port_info -- Get class info for port
+ * @port:  Port on whic MAD was received
+ * @recvd_mad: pointer to the received mad
+ * @rsp_mad:   pointer to respose mad
+ *
+ * This function updates the port class info for the specific vnic
+ * and sets up the response mad data
+ */
+static void vema_set_class_port_info(struct hfi_vnic_vema_port *port,
+				     struct hfi_vnic_vema_mad *recvd_mad,
+				     struct hfi_vnic_vema_mad *rsp_mad)
+{
+	memcpy(&port->class_port_info, recvd_mad->data,
+	       sizeof(port->class_port_info));
+
+	vema_get_class_port_info(port, recvd_mad, rsp_mad);
+}
+
+/**
+ * vema_get_veswport_info -- Get veswport info
+ * @port:      source port on which MAD was received
+ * @recvd_mad: pointer to the received mad
+ * @rsp_mad:   pointer to respose mad
+ */
+static void vema_get_veswport_info(struct hfi_vnic_vema_port *port,
+				   struct hfi_vnic_vema_mad *recvd_mad,
+				   struct hfi_vnic_vema_mad *rsp_mad)
+{
+	struct hfi_veswport_info *port_info =
+				  (struct hfi_veswport_info *)rsp_mad->data;
+	struct hfi_vnic_adapter *adapter;
+	u8 vport_num;
+
+	vport_num = vema_get_vport_num(recvd_mad);
+
+	if (test_bit(vport_num, port->vnic_mask)) {
+		if (vema_parms_from_recv_mad(recvd_mad, port, &adapter)) {
+			dev_err(&port->cdev->dev,
+				"%s:vnic adapter not found\n", __func__);
+			goto err_exit;
+		} else {
+			memset(port_info, 0, sizeof(*port_info));
+			hfi_vnic_get_vesw_info(adapter, &port_info->vesw);
+			hfi_vnic_get_per_veswport_info(adapter,
+						       &port_info->vport);
+		}
+	} else {
+		vema_get_pod_values(port_info);
+	}
+	return;
+
+err_exit:
+	rsp_mad->mad_hdr.status = HFI_VNIC_INVAL_ATTR;
+}
+
+/**
+ * vema_set_veswport_info -- Set veswport info
+ * @port:      source port on which MAD was received
+ * @recvd_mad: pointer to the received mad
+ * @rsp_mad:   pointer to respose mad
+ *
+ * This function gets the port class infor for vnic
+ */
+static void vema_set_veswport_info(struct hfi_vnic_vema_port *port,
+				   struct hfi_vnic_vema_mad *recvd_mad,
+				   struct hfi_vnic_vema_mad *rsp_mad)
+{
+	struct hfi_veswport_info *port_info;
+	struct hfi_vnic_adapter *adapter;
+	u8 vport_num;
+
+	vport_num = vema_get_vport_num(recvd_mad);
+
+	if (test_bit(vport_num, port->vnic_mask)) {
+		if (vema_parms_from_recv_mad(recvd_mad, port, &adapter))
+			goto err_exit;
+	} else if (vema_create_vnic(port, vport_num, &adapter)) {
+		dev_err(&port->cdev->dev,
+			"%s:vnic %d not created\n", __func__, vport_num);
+		goto err_exit;
+	}
+
+	port_info = (struct hfi_veswport_info *)recvd_mad->data;
+	hfi_vnic_set_vesw_info(adapter, &port_info->vesw);
+	hfi_vnic_set_per_veswport_info(adapter, &port_info->vport);
+
+	/* Process the new config settings */
+	hfi_vnic_process_vema_config(adapter);
+
+	vema_get_veswport_info(port, recvd_mad, rsp_mad);
+	return;
+
+err_exit:
+	rsp_mad->mad_hdr.status = HFI_VNIC_INVAL_ATTR;
+}
+
+/**
+ * vema_get_mac_entries -- Get MAC entries in VNIC MAC table
+ * @port:      source port on which MAD was received
+ * @recvd_mad: pointer to the received mad
+ * @rsp_mad:   pointer to respose mad
+ *
+ * This function gets the MAC entries that are programmed into
+ * the VNIC MAC forwarding table. It checks for the validity of
+ * the index into the MAC table and the number of entries that
+ * are to be retrieved.
+ */
+static void vema_get_mac_entries(struct hfi_vnic_vema_port *port,
+				 struct hfi_vnic_vema_mad *recvd_mad,
+				 struct hfi_vnic_vema_mad *rsp_mad)
+{
+	struct hfi_veswport_mactable *mac_tbl_in, *mac_tbl_out;
+	struct hfi_vnic_adapter *adapter;
+
+	if (vema_parms_from_recv_mad(recvd_mad, port, &adapter)) {
+		rsp_mad->mad_hdr.status = HFI_VNIC_INVAL_ATTR;
+		return;
+	}
+
+	mac_tbl_in = (struct hfi_veswport_mactable *)recvd_mad->data;
+	mac_tbl_out = (struct hfi_veswport_mactable *)rsp_mad->data;
+
+	if (vema_mac_tbl_req_ok(mac_tbl_in)) {
+		mac_tbl_out->offset = mac_tbl_in->offset;
+		mac_tbl_out->num_entries = mac_tbl_in->num_entries;
+		hfi_vnic_query_mac_tbl(adapter, mac_tbl_out);
+	} else {
+		rsp_mad->mad_hdr.status = HFI_VNIC_INVAL_ATTR;
+	}
+}
+
+/**
+ * vema_set_mac_entries -- Set MAC entries in VNIC MAC table
+ * @port:      source port on which MAD was received
+ * @recvd_mad: pointer to the received mad
+ * @rsp_mad:   pointer to respose mad
+ *
+ * This function sets the MAC entries in the VNIC forwarding table
+ * It checks for the validity of the index and the number of forwarding
+ * table entries to be programmed.
+ */
+static void vema_set_mac_entries(struct hfi_vnic_vema_port *port,
+				 struct hfi_vnic_vema_mad *recvd_mad,
+				 struct hfi_vnic_vema_mad *rsp_mad)
+{
+	struct hfi_veswport_mactable *mac_tbl;
+	struct hfi_vnic_adapter *adapter;
+
+	if (vema_parms_from_recv_mad(recvd_mad, port, &adapter)) {
+		rsp_mad->mad_hdr.status = HFI_VNIC_INVAL_ATTR;
+		return;
+	}
+
+	mac_tbl = (struct hfi_veswport_mactable *)recvd_mad->data;
+	if (vema_mac_tbl_req_ok(mac_tbl)) {
+		if (hfi_vnic_update_mac_tbl(adapter, mac_tbl))
+			rsp_mad->mad_hdr.status = HFI_VNIC_UNSUP_ATTR;
+	} else {
+		rsp_mad->mad_hdr.status = HFI_VNIC_UNSUP_ATTR;
+	}
+	vema_get_mac_entries(port, recvd_mad, rsp_mad);
+}
+
+/**
+ * vema_set_delete_vesw -- Reset VESW info to POD values
+ * @port:      source port on which MAD was received
+ * @recvd_mad: pointer to the received mad
+ * @rsp_mad:   pointer to respose mad
+ *
+ * This function clears all the fields of veswport info for the requested vesw
+ * and sets them back to the power-on default values. It does not delete the
+ * vesw.
+ */
+static void vema_set_delete_vesw(struct hfi_vnic_vema_port *port,
+				 struct hfi_vnic_vema_mad *recvd_mad,
+				 struct hfi_vnic_vema_mad *rsp_mad)
+{
+	struct hfi_veswport_info *port_info =
+				  (struct hfi_veswport_info *)rsp_mad->data;
+	struct hfi_vnic_adapter *adapter;
+
+	if (vema_parms_from_recv_mad(recvd_mad, port, &adapter)) {
+		rsp_mad->mad_hdr.status = HFI_VNIC_INVAL_ATTR;
+		return;
+	}
+
+	vema_get_pod_values(port_info);
+	hfi_vnic_set_vesw_info(adapter, &port_info->vesw);
+	hfi_vnic_set_per_veswport_info(adapter, &port_info->vport);
+
+	/* Process the new config settings */
+	hfi_vnic_process_vema_config(adapter);
+
+	hfi_vnic_release_mac_tbl(adapter);
+
+	vema_get_veswport_info(port, recvd_mad, rsp_mad);
+}
+
+/**
+ * vema_get_mac_list -- Get the unicast/multicast macs.
+ * @port:      source port on which MAD was received
+ * @recvd_mad: Received mad contains fields to set vnic parameters
+ * @rsp_mad:   Response mad to be built
+ * @attr_id:   Attribute ID indicating multicast or unicast mac list
+ */
+static void vema_get_mac_list(struct hfi_vnic_vema_port *port,
+			      struct hfi_vnic_vema_mad *recvd_mad,
+			      struct hfi_vnic_vema_mad *rsp_mad,
+			      u16 attr_id)
+{
+	struct hfi_veswport_iface_macs *macs_in, *macs_out;
+	int max_entries = (HFI_VNIC_EMA_DATA - sizeof(*macs_out)) / ETH_ALEN;
+	struct hfi_vnic_adapter *adapter;
+
+	if (vema_parms_from_recv_mad(recvd_mad, port, &adapter)) {
+		rsp_mad->mad_hdr.status = HFI_VNIC_INVAL_ATTR;
+		return;
+	}
+
+	macs_in = (struct hfi_veswport_iface_macs *)recvd_mad->data;
+	macs_out = (struct hfi_veswport_iface_macs *)rsp_mad->data;
+
+	macs_out->start_idx = macs_in->start_idx;
+	if (macs_in->num_macs_in_msg)
+		macs_out->num_macs_in_msg = macs_in->num_macs_in_msg;
+	else
+		macs_out->num_macs_in_msg = cpu_to_be16(max_entries);
+
+	if (attr_id == HFI_EM_ATTR_IFACE_MCAST_MACS)
+		hfi_vnic_query_mcast_macs(adapter, macs_out);
+	else
+		hfi_vnic_query_ucast_macs(adapter, macs_out);
+}
+
+/**
+ * vema_get_summary_counters -- Gets summary counters.
+ * @port:      source port on which MAD was received
+ * @recvd_mad: Received mad contains fields to set vnic parameters
+ * @rsp_mad:   Response mad to be built
+ */
+static void vema_get_summary_counters(struct hfi_vnic_vema_port *port,
+				      struct hfi_vnic_vema_mad *recvd_mad,
+				      struct hfi_vnic_vema_mad *rsp_mad)
+{
+	struct hfi_veswport_summary_counters *cntrs;
+	struct hfi_vnic_adapter *adapter;
+
+	if (vema_parms_from_recv_mad(recvd_mad, port, &adapter)) {
+		rsp_mad->mad_hdr.status = HFI_VNIC_INVAL_ATTR;
+		return;
+	}
+	cntrs = (struct hfi_veswport_summary_counters *)rsp_mad->data;
+	hfi_vnic_get_summary_counters(adapter, cntrs);
+}
+
+/**
+ * vema_get_error_counters -- Gets summary counters.
+ * @port:      source port on which MAD was received
+ * @recvd_mad: Received mad contains fields to set vnic parameters
+ * @rsp_mad:   Response mad to be built
+ */
+static void vema_get_error_counters(struct hfi_vnic_vema_port *port,
+				    struct hfi_vnic_vema_mad *recvd_mad,
+				    struct hfi_vnic_vema_mad *rsp_mad)
+{
+	struct hfi_veswport_error_counters *cntrs;
+	struct hfi_vnic_adapter *adapter;
+
+	if (vema_parms_from_recv_mad(recvd_mad, port, &adapter)) {
+		rsp_mad->mad_hdr.status = HFI_VNIC_INVAL_ATTR;
+		return;
+	}
+	cntrs = (struct hfi_veswport_error_counters *)rsp_mad->data;
+	hfi_vnic_get_error_counters(adapter, cntrs);
+}
+
+/**
+ * vema_get -- Process received get MAD
+ * @port:      source port on which MAD was received
+ * @recvd_mad: Received mad
+ * @rsp_mad:   Response mad to be built
+ */
+static void vema_get(struct hfi_vnic_vema_port *port,
+		     struct hfi_vnic_vema_mad *recvd_mad,
+		     struct hfi_vnic_vema_mad *rsp_mad)
+{
+	u16 attr_id = be16_to_cpu(recvd_mad->mad_hdr.attr_id);
+
+	switch (attr_id) {
+	case HFI_EM_ATTR_CLASS_PORT_INFO:
+		vema_get_class_port_info(port, recvd_mad, rsp_mad);
+		break;
+	case HFI_EM_ATTR_VESWPORT_INFO:
+		vema_get_veswport_info(port, recvd_mad, rsp_mad);
+		break;
+	case HFI_EM_ATTR_VESWPORT_MAC_ENTRIES:
+		vema_get_mac_entries(port, recvd_mad, rsp_mad);
+		break;
+	case HFI_EM_ATTR_IFACE_UCAST_MACS:
+		/* fall through */
+	case HFI_EM_ATTR_IFACE_MCAST_MACS:
+		vema_get_mac_list(port, recvd_mad, rsp_mad, attr_id);
+		break;
+	case HFI_EM_ATTR_VESWPORT_SUMMARY_COUNTERS:
+		vema_get_summary_counters(port, recvd_mad, rsp_mad);
+		break;
+	case HFI_EM_ATTR_VESWPORT_ERROR_COUNTERS:
+		vema_get_error_counters(port, recvd_mad, rsp_mad);
+		break;
+	default:
+		rsp_mad->mad_hdr.status = HFI_VNIC_UNSUP_ATTR;
+		break;
+	}
+}
+
+/**
+ * vema_set -- Process received set MAD
+ * @port:      source port on which MAD was received
+ * @recvd_mad: Received mad contains fields to set vnic parameters
+ * @rsp_mad:   Response mad to be built
+ */
+static void vema_set(struct hfi_vnic_vema_port *port,
+		     struct hfi_vnic_vema_mad *recvd_mad,
+		     struct hfi_vnic_vema_mad *rsp_mad)
+{
+	u16 attr_id = be16_to_cpu(recvd_mad->mad_hdr.attr_id);
+
+	switch (attr_id) {
+	case HFI_EM_ATTR_CLASS_PORT_INFO:
+		vema_set_class_port_info(port, recvd_mad, rsp_mad);
+		break;
+	case HFI_EM_ATTR_VESWPORT_INFO:
+		vema_set_veswport_info(port, recvd_mad, rsp_mad);
+		break;
+	case HFI_EM_ATTR_VESWPORT_MAC_ENTRIES:
+		vema_set_mac_entries(port, recvd_mad, rsp_mad);
+		break;
+	case HFI_EM_ATTR_DELETE_VESW:
+		vema_set_delete_vesw(port, recvd_mad, rsp_mad);
+		break;
+	default:
+		rsp_mad->mad_hdr.status = HFI_VNIC_UNSUP_ATTR;
+		break;
+	}
+}
+
+/**
+ * vema_send -- Send handler for VEMA MAD agent
+ * @mad_agent: pointer to the mad agent
+ * @mad_wc:    pointer to mad send work completion information
+ *
+ * Free all the data structures associated with the sent MAD
+ */
+static void vema_send(struct ib_mad_agent *mad_agent,
+		      struct ib_mad_send_wc *mad_wc)
+{
+	ib_destroy_ah(mad_wc->send_buf->ah);
+	ib_free_send_mad(mad_wc->send_buf);
+}
+
+/**
+ * vema_recv -- Recv handler for VEMA MAD agent
+ * @mad_agent: pointer to the mad agent
+ * @send_buf: Send buffer if found, else NULL
+ * @mad_wc:    pointer to mad send work completion information
+ *
+ * Handle only set and get methods and respond to other methods
+ * as unsupported. Allocate response buffer and address handle
+ * for the response MAD.
+ */
+static void vema_recv(struct ib_mad_agent *mad_agent,
+		      struct ib_mad_send_buf *send_buf,
+		      struct ib_mad_recv_wc *mad_wc)
+{
+	struct hfi_vnic_vema_port *port;
+	struct ib_ah              *ah;
+	struct ib_mad_send_buf    *rsp;
+	struct hfi_vnic_vema_mad  *vema_mad;
+
+	if (!mad_wc || !mad_wc->recv_buf.mad)
+		return;
+
+	port = mad_agent->context;
+	ah = ib_create_ah_from_wc(mad_agent->qp->pd, mad_wc->wc,
+				  mad_wc->recv_buf.grh, mad_agent->port_num);
+	if (IS_ERR(ah))
+		goto free_recv_mad;
+
+	rsp = ib_create_send_mad(mad_agent, mad_wc->wc->src_qp,
+				 mad_wc->wc->pkey_index, 0,
+				 IB_MGMT_VENDOR_HDR, HFI_VNIC_EMA_DATA,
+				 GFP_KERNEL, OPA_MGMT_BASE_VERSION);
+	if (IS_ERR(rsp))
+		goto err_rsp;
+
+	rsp->ah = ah;
+	vema_mad = rsp->mad;
+	memcpy(vema_mad, mad_wc->recv_buf.mad, IB_MGMT_VENDOR_HDR);
+	vema_mad->mad_hdr.method = IB_MGMT_METHOD_GET_RESP;
+	vema_mad->mad_hdr.status = 0;
+
+	/* Lock ensures network adapter is not removed */
+	mutex_lock(&port->lock);
+
+	switch (mad_wc->recv_buf.mad->mad_hdr.method) {
+	case IB_MGMT_METHOD_GET:
+		vema_get(port, (struct hfi_vnic_vema_mad *)mad_wc->recv_buf.mad,
+			 vema_mad);
+		break;
+	case IB_MGMT_METHOD_SET:
+		vema_set(port, (struct hfi_vnic_vema_mad *)mad_wc->recv_buf.mad,
+			 vema_mad);
+		break;
+	default:
+		vema_mad->mad_hdr.status = HFI_VNIC_UNSUP_ATTR;
+		break;
+	}
+	mutex_unlock(&port->lock);
+
+	if (!ib_post_send_mad(rsp, NULL)) {
+		/*
+		 * with post send successful ah and send mad
+		 * will be destroyed in send handler
+		 */
+		goto free_recv_mad;
+	}
+
+	ib_free_send_mad(rsp);
+
+err_rsp:
+	ib_destroy_ah(ah);
+free_recv_mad:
+	ib_free_recv_mad(mad_wc);
+}
+
+/**
+ * vema_get_port -- Gets the hfi_vnic_vema_port
+ * @cdev: pointer to control dev
+ * @port_num: Port number
+ *
+ * This function loops through the ports and returns
+ * the hfi_vnic_vema port structure that is associated
+ * with the HFI port number
+ *
+ * Return: ptr to requested hfi_vnic_vema_port strucure
+ *         if success, NULL if not
+ */
+static struct hfi_vnic_vema_port *
+vema_get_port(struct hfi_vnic_ctrl_device *cdev, u16 port_num)
+{
+	struct hfi_vnic_vema_port *port_base;
+
+	if (port_num > cdev->num_ports)
+		return NULL;
+
+	port_base = (struct hfi_vnic_vema_port *)dev_get_drvdata(&cdev->dev);
+	return port_base + (port_num - 1);
+}
+
+/**
+ * vema_unregister -- Unregisters agent
+ * @cdev: pointer to control device
+ *
+ * This deletes the registration by VEMA for MADs
+ */
+static void vema_unregister(struct hfi_vnic_ctrl_device *cdev)
+{
+	struct hfi_vnic_vema_port *port, *port_base;
+	int i, j;
+
+	port_base = (struct hfi_vnic_vema_port *)dev_get_drvdata(&cdev->dev);
+	for (i = 0, port = port_base; i < cdev->num_ports; i++, port++) {
+		/* Lock ensures no MAD is being processed */
+		mutex_lock(&port->lock);
+		for (j = 0; j <  HFI_MAX_VPORTS_SUPPORTED; j++) {
+			if (test_bit(j, port->vnic_mask)) {
+				port->cdev->ctrl_ops->rem_vport(cdev,
+								port->port_num,
+								j);
+				clear_bit(j, port->vnic_mask);
+			}
+		}
+		mutex_unlock(&port->lock);
+		if (port->mad_agent)
+			ib_unregister_mad_agent(port->mad_agent);
+
+		mutex_destroy(&port->lock);
+	}
+
+	kfree(port_base);
+}
+
+/**
+ * vema_register -- Registers agent
+ * @cdev: pointer to control device
+ *
+ * This function registers the handlers for the VEMA MADs
+ *
+ * Return: returns 0 on success. non zero otherwise
+ */
+static int vema_register(struct hfi_vnic_ctrl_device *cdev)
+{
+	struct hfi_vnic_vema_port *port, *port_base;
+
+	struct ib_mad_reg_req reg_req = {
+		.mgmt_class = HFI_MGMT_CLASS_INTEL_EMA,
+		.mgmt_class_version = OPA_MGMT_BASE_VERSION,
+		.oui = { INTEL_OUI_1, INTEL_OUI_2, INTEL_OUI_3 }
+	};
+	int i;
+
+	port_base = kcalloc(cdev->num_ports, sizeof(*port), GFP_KERNEL);
+	if (!port_base)
+		return -ENOMEM;
+
+	dev_set_drvdata(&cdev->dev, port_base);
+
+	set_bit(IB_MGMT_METHOD_GET, reg_req.method_mask);
+	set_bit(IB_MGMT_METHOD_SET, reg_req.method_mask);
+
+	/* register mad agent for each port on dev */
+	for (i = 0, port = port_base; i < cdev->num_ports; i++, port++) {
+		port->cdev = cdev;
+		port->port_num = i + 1;
+		mutex_init(&port->lock);
+		port->mad_agent = ib_register_mad_agent(cdev->ibdev, i + 1,
+							IB_QPT_GSI,
+							&reg_req,
+							IB_MGMT_RMPP_VERSION,
+							vema_send,
+							vema_recv,
+							port,
+							0);
+		if (IS_ERR(port->mad_agent)) {
+			int ret = PTR_ERR(port->mad_agent);
+
+			port->mad_agent = NULL;
+			vema_unregister(cdev);
+			return ret;
+		}
+	}
+
+	return 0;
+}
+
+/**
+ * hfi_vnic_vema_send_trap -- This function sends a trap to the EM
+ * @cdev: pointer to vnic control device
+ * @data: pointer to trap data filled by calling function
+ * @lid:  issuers lid (encap_slid from vesw_port_info)
+ *
+ * This function is called from the VNIC driver to send a trap if there
+ * is somethng the EM should be notified about. These events currently
+ * are
+ * 1) UNICAST INTERFACE MACADDRESS changes
+ * 2) MULTICAST INTERFACE MACADDRESS changes
+ * 3) ETHERNET LINK STATUS changes
+ * While allocating the send mad the remote site qpn used is 1
+ * as this is the well known QP.
+ *
+ */
+void hfi_vnic_vema_send_trap(struct hfi_vnic_adapter *adapter,
+			     struct __hfi_veswport_trap *data, u32 lid)
+{
+	struct hfi_vnic_ctrl_device *cdev = adapter->vdev->cdev;
+	struct ib_mad_send_buf *send_buf;
+	struct hfi_vnic_vema_port *port;
+	struct ib_device *ibp;
+	struct hfi_vnic_vema_mad_trap *trap_mad;
+	struct hfi_class_port_info *class;
+	struct ib_ah_attr ah_attr;
+	struct ib_ah *ah;
+	struct hfi_veswport_trap *trap;
+	u32 trap_lid;
+	u16 pkey_idx;
+
+	if (!cdev)
+		goto err_exit;
+	ibp = cdev->ibdev;
+	port = vema_get_port(cdev, data->hfiportnum);
+	if (!port || !port->mad_agent)
+		goto err_exit;
+
+	if (time_before(jiffies, adapter->trap_timeout)) {
+		if (adapter->trap_count == HFI_VNIC_TRAP_BURST_LIMIT) {
+			v_warn("Trap rate exceeded\n");
+			goto err_exit;
+		} else {
+			adapter->trap_count++;
+		}
+	} else {
+		adapter->trap_count = 0;
+	}
+
+	class = &port->class_port_info;
+	/* Set up address handle */
+	memset(&ah_attr, 0, sizeof(ah_attr));
+	ah_attr.sl = GET_TRAP_SL_FROM_CLASS_PORT_INFO(class->trap_sl_rsvd);
+	ah_attr.port_num = port->port_num;
+	trap_lid = be32_to_cpu(class->trap_lid);
+	/*
+	 * check for trap lid validity, must not be zero
+	 * The trap sink could change after we fashion the MAD but since traps
+	 * are not guaranteed we won't use a lock as anyway the change will take
+	 * place even with locking.
+	 */
+	if (!trap_lid) {
+		dev_err(&cdev->dev, "%s: Invalid dlid\n", __func__);
+		goto err_exit;
+	}
+
+	ah_attr.dlid = trap_lid;
+	ah = ib_create_ah(port->mad_agent->qp->pd, &ah_attr);
+	if (IS_ERR(ah)) {
+		dev_err(&cdev->dev,
+			"%s:Couldn't create new AH = %p\n", __func__, ah);
+		dev_err(&cdev->dev,
+			"%s:dlid = %d, sl = %d, port = %d\n", __func__,
+			ah_attr.dlid, ah_attr.sl, ah_attr.port_num);
+		goto err_exit;
+	}
+
+	if (ib_find_pkey(ibp, data->hfiportnum, IB_DEFAULT_PKEY_FULL,
+			 &pkey_idx) < 0) {
+		dev_err(&cdev->dev,
+			"%s:full key not found, defaulting to partial\n",
+			__func__);
+		if (ib_find_pkey(ibp, data->hfiportnum, IB_DEFAULT_PKEY_PARTIAL,
+				 &pkey_idx) < 0)
+			pkey_idx = 1;
+	}
+
+	send_buf = ib_create_send_mad(port->mad_agent, 1, pkey_idx, 0,
+				      IB_MGMT_VENDOR_HDR, IB_MGMT_MAD_DATA,
+				      GFP_KERNEL, OPA_MGMT_BASE_VERSION);
+	if (IS_ERR(send_buf)) {
+		dev_err(&cdev->dev, "%s:Couldn't allocate send buf\n",
+			__func__);
+		goto err_sndbuf;
+	}
+
+	send_buf->ah = ah;
+
+	/* Set up common MAD hdr */
+	trap_mad = send_buf->mad;
+	trap_mad->mad_hdr.base_version = OPA_MGMT_BASE_VERSION;
+	trap_mad->mad_hdr.mgmt_class = HFI_MGMT_CLASS_INTEL_EMA;
+	trap_mad->mad_hdr.class_version = HFI_EMA_CLASS_VERSION;
+	trap_mad->mad_hdr.method = IB_MGMT_METHOD_TRAP;
+	port->tid++;
+	trap_mad->mad_hdr.tid = cpu_to_be64(port->tid);
+	trap_mad->mad_hdr.attr_id = IB_SMP_ATTR_NOTICE;
+
+	/* Set up vendor OUI */
+	trap_mad->oui[0] = INTEL_OUI_1;
+	trap_mad->oui[1] = INTEL_OUI_2;
+	trap_mad->oui[2] = INTEL_OUI_3;
+
+	/* Setup notice attribute portion */
+	trap_mad->notice.gen_type = HFI_INTEL_EMA_NOTICE_TYPE_INFO << 1;
+	trap_mad->notice.oui_1 = INTEL_OUI_1;
+	trap_mad->notice.oui_2 = INTEL_OUI_2;
+	trap_mad->notice.oui_3 = INTEL_OUI_3;
+	trap_mad->notice.issuer_lid = cpu_to_be32(lid);
+
+	/* copy the actual trap data */
+	trap = (struct hfi_veswport_trap *)trap_mad->notice.raw_data;
+	trap->fabric_id = cpu_to_be16(data->fabric_id);
+	trap->veswid = cpu_to_be16(data->veswid);
+	trap->veswportnum = cpu_to_be32(data->veswportnum);
+	trap->hfiportnum = cpu_to_be16(data->hfiportnum);
+	trap->veswportindex = data->veswportindex;
+	trap->opcode = data->opcode;
+
+	/* If successful send set up rate limit timeout else bail */
+	if (ib_post_send_mad(send_buf, NULL)) {
+		ib_free_send_mad(send_buf);
+	} else {
+		if (adapter->trap_count)
+			return;
+		adapter->trap_timeout = jiffies +
+					usecs_to_jiffies(HFI_VNIC_TRAP_TIMEOUT);
+		return;
+	}
+
+err_sndbuf:
+	ib_destroy_ah(ah);
+err_exit:
+	v_err("%s: Aborting trap\n", __func__);
+}
+
+/* hfi_vnic_ctrl_drv_probe - control device initialization routine */
+static int hfi_vnic_ctrl_drv_probe(struct device *dev)
+{
+	struct hfi_vnic_ctrl_device *cdev = container_of(dev,
+					 struct hfi_vnic_ctrl_device, dev);
+	int rc;
+
+	/* Initialize hfi vnic management agent (vema) */
+	rc = vema_register(cdev);
+	if (!rc)
+		dev_info(dev, "initialized\n");
+
+	return rc;
+}
+
+/* hfi_vnic_ctrl_drv_remove - control device removal routine */
+static int hfi_vnic_ctrl_drv_remove(struct device *dev)
+{
+	struct hfi_vnic_ctrl_device *cdev = container_of(dev,
+					 struct hfi_vnic_ctrl_device, dev);
+
+	vema_unregister(cdev);
+
+	dev_info(dev, "removed\n");
+	return 0;
+}
+
+/* HFI Virtual Network Control Driver */
+static struct hfi_vnic_ctrl_driver hfi_vnic_ctrl_drv = {
+	.drvwrap = {
+		.type = HFI_VNIC_CTRL_DRV,
+		.driver = {
+			.name   = hfi_vnic_ctrl_driver_name,
+			.probe  = hfi_vnic_ctrl_drv_probe,
+			.remove = hfi_vnic_ctrl_drv_remove
+		}
+	}
+};
+
+/* hfi_vnic_vema_init - initialize vema */
+int __init hfi_vnic_vema_init(void)
+{
+	int rc;
+
+	rc = hfi_vnic_ctrl_driver_register(&hfi_vnic_ctrl_drv);
+	if (rc)
+		pr_err("VNIC ctrl driver register failed %d\n", rc);
+
+	return rc;
+}
+
+/* hfi_vnic_vema_deinit - deinitialize vema */
+void hfi_vnic_vema_deinit(void)
+{
+	hfi_vnic_ctrl_driver_unregister(&hfi_vnic_ctrl_drv);
+}
diff --git a/drivers/infiniband/sw/intel/vnic/hfi_vnic/hfi_vnic_vema_iface.c b/drivers/infiniband/sw/intel/vnic/hfi_vnic/hfi_vnic_vema_iface.c
index 4a87826..98ddaaf 100644
--- a/drivers/infiniband/sw/intel/vnic/hfi_vnic/hfi_vnic_vema_iface.c
+++ b/drivers/infiniband/sw/intel/vnic/hfi_vnic/hfi_vnic_vema_iface.c
@@ -72,7 +72,7 @@ void hfi_vnic_vema_report_event(struct hfi_vnic_adapter *adapter, u8 event)
 	trap_data.veswportindex = vdev->vport_num;
 	trap_data.opcode = event;
 
-	/* Need to send trap here */
+	hfi_vnic_vema_send_trap(adapter, &trap_data, info->vport.encap_slid);
 }
 
 /**
-- 
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox