Netdev List
 help / color / mirror / Atom feed
* Re: [PATCH 5/7] ipv6 addrconf: implement RFC7559 router solicitation backoff
From: Maciej Żenczykowski @ 2016-09-25  2:14 UTC (permalink / raw)
  To: Maciej Żenczykowski, David S . Miller
  Cc: Linux NetDev, Erik Kline, Lorenzo Colitti
In-Reply-To: <1474733109-18355-5-git-send-email-zenczykowski@gmail.com>

Ok, so that seems to have all sorts of __divdi3 or __aeabi_ldivmod
undefined errors on 32-bit platforms (ie. arm/m68k/i386).

^ permalink raw reply

* [PATCH][V2] mlxsw: spectrum: remove redundant check if err is zero
From: Colin King @ 2016-09-25  1:03 UTC (permalink / raw)
  To: Hariprasad S, netdev; +Cc: linux-kernel

From: Colin Ian King <colin.king@canonical.com>

There is an earlier check and return if err is non-zero, so
the check to see if it is zero is redundant in every iteration
of the loop and hence the check can be removed.

Signed-off-by: Colin Ian King <colin.king@canonical.com>
---
 drivers/net/ethernet/chelsio/cxgb4/cxgb4_filter.c | 10 +++++-----
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/drivers/net/ethernet/chelsio/cxgb4/cxgb4_filter.c b/drivers/net/ethernet/chelsio/cxgb4/cxgb4_filter.c
index 2a61617..1073673 100644
--- a/drivers/net/ethernet/chelsio/cxgb4/cxgb4_filter.c
+++ b/drivers/net/ethernet/chelsio/cxgb4/cxgb4_filter.c
@@ -117,11 +117,11 @@ static int validate_filter(struct net_device *dev,
 	return 0;
 }
 
-static unsigned int get_filter_steerq(struct net_device *dev,
-				      struct ch_filter_specification *fs)
+static int get_filter_steerq(struct net_device *dev,
+			     struct ch_filter_specification *fs)
 {
 	struct adapter *adapter = netdev2adap(dev);
-	unsigned int iq;
+	int iq;
 
 	/* If the user has requested steering matching Ingress Packets
 	 * to a specific Queue Set, we need to make sure it's in range
@@ -443,10 +443,10 @@ int __cxgb4_set_filter(struct net_device *dev, int filter_id,
 		       struct filter_ctx *ctx)
 {
 	struct adapter *adapter = netdev2adap(dev);
-	unsigned int max_fidx, fidx, iq;
+	unsigned int max_fidx, fidx;
 	struct filter_entry *f;
 	u32 iconf;
-	int ret;
+	int iq, ret;
 
 	max_fidx = adapter->tids.nftids;
 	if (filter_id != (max_fidx + adapter->tids.nsftids - 1) &&
-- 
2.9.3

^ permalink raw reply related

* Re: [PATCH net-next 4/4] net/sched: act_mirred: Implement ingress actions
From: Cong Wang @ 2016-09-25  0:20 UTC (permalink / raw)
  To: Shmulik Ladkani
  Cc: Jamal Hadi Salim, David S. Miller, Eric Dumazet,
	Linux Kernel Network Developers
In-Reply-To: <20160923184030.75124289@halley>

On Fri, Sep 23, 2016 at 8:40 AM, Shmulik Ladkani
<shmulik.ladkani@gmail.com> wrote:
> On Fri, 23 Sep 2016 08:48:33 -0400 Jamal Hadi Salim <jhs@mojatatu.com> wrote:
>> > Even today, one may create loops using existing 'egress redirect',
>> > e.g. this rediculously errorneous construct:
>> >
>> >  # ip l add v0 type veth peer name v0p
>> >  # tc filter add dev v0p parent ffff: basic \
>> >     action mirred egress redirect dev v0
>>
>> I think we actually recover from this one by eventually
>> dropping (theres a ttl field).
>
> [off topic]
>
> Don't know about that :) cpu fan got very noisy, 3 of 4 cores at 100%,
> and after one second I got:
>
> # ip -s l show type veth
> 16: v0p@v0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP mode DEFAULT group default qlen 1000
>     link/ether a2:64:ff:10:dd:85 brd ff:ff:ff:ff:ff:ff
>     RX: bytes  packets  errors  dropped overrun mcast
>     71660305923 469890864 0       0       0       0
>     TX: bytes  packets  errors  dropped carrier collsns
>     3509       24       0       0       0       0
> 17: v0@v0p: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP mode DEFAULT group default qlen 1000
>     link/ether 52:a2:34:f6:7c:ec brd ff:ff:ff:ff:ff:ff
>     RX: bytes  packets  errors  dropped overrun mcast
>     3509       24       0       0       0       0
>     TX: bytes  packets  errors  dropped carrier collsns
>     71660713017 469893555 0       0       0       0

These ghost packets never enter IP stack, I don't think TTL
helps.

^ permalink raw reply

* Re: [PATCH net] Revert "net: ethernet: bcmgenet: use phydev from struct net_device"
From: David Miller @ 2016-09-25  0:10 UTC (permalink / raw)
  To: f.fainelli; +Cc: netdev, tremyfr, jaedon.shin
In-Reply-To: <1474747110-6496-1-git-send-email-f.fainelli@gmail.com>

From: Florian Fainelli <f.fainelli@gmail.com>
Date: Sat, 24 Sep 2016 12:58:30 -0700

> There is already a commit:
> 
> Revert "net: ethernet: bcmgenet: use phy_ethtool_{get|set}_link_ksettings"
> 
> which should make this apply cleanly to "net" now.

But look at net-next, it got re-added there.

This is going to be a bit of a merge hassle, and this is why I pushed
back on the other attempt to revert this thing.

^ permalink raw reply

* Re: [PATCH net-next] gre: use nla_get_be32() to extract flowinfo
From: David Miller @ 2016-09-25  0:08 UTC (permalink / raw)
  To: lrichard; +Cc: netdev
In-Reply-To: <1474740064-13580-1-git-send-email-lrichard@redhat.com>

From: Lance Richardson <lrichard@redhat.com>
Date: Sat, 24 Sep 2016 14:01:04 -0400

> Eliminate a sparse endianness mismatch warning, use nla_get_be32() to
> extract a __be32 value instead of nla_get_u32().
> 
> Signed-off-by: Lance Richardson <lrichard@redhat.com>

Applied.

^ permalink raw reply

* Re: [PATCH net-next 4/4] net/sched: act_mirred: Implement ingress actions
From: Cong Wang @ 2016-09-25  0:07 UTC (permalink / raw)
  To: Shmulik Ladkani
  Cc: Jamal Hadi Salim, David S. Miller, Eric Dumazet,
	Linux Kernel Network Developers
In-Reply-To: <20160923081106.73fb48df@halley>

On Thu, Sep 22, 2016 at 10:11 PM, Shmulik Ladkani
<shmulik.ladkani@gmail.com> wrote:
> Was wondering why it's missing, googled a bit with no meaningful
> results, so speculated the following:
>
> Some time long ago, initial 'mirred' purpose was to facilitate ifb.
> Therefore 'egress redirect' was implemented. Jamal probably left the
> 'ingress' support for a later time :)
>
> One interesting usecase for 'ingress redirect' is creating "rx bouncing"
> construct (like macvlan/macvtap/ipvlan) but applied according to custom
> logic.

We have done this for our containers for a long time. We simply
redirect packets to veth TX then flow to veth RX of course.

One problem to use your code for us is that, the RX side of veth
is inside containers, not visible to outside, perhaps we need some
more parameter to tell the netns before the device name/index?
Thoughts?

>
>> It may be around preventing loops maybe.
>
> Could be, but personally, I treat these constructs as (powerful)
> building blocks, and "with great power comes great responsibility".
>
> Even today, one may create loops using existing 'egress redirect',
> e.g. this rediculously errorneous construct:
>
>  # ip l add v0 type veth peer name v0p
>  # tc filter add dev v0p parent ffff: basic \
>     action mirred egress redirect dev v0

Detecting such loops should not be hard technically, like we do
for reclassification. We might need some bits in skb to detect
this specific case. Anyway, I don't think it is a blocker, just need
more tests to catch some corner cases.

Thanks.

^ permalink raw reply

* Re: [PATCH] Net Driver: Add Cypress GX3 VID=04b4 PID=3610.
From: David Miller @ 2016-09-24 23:57 UTC (permalink / raw)
  To: chris.roth-/KKvz3x1pcI
  Cc: linux-usb-u79uwXL29TY76Z2rM5mHXA, netdev-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <CAPZMiRZLBhn93OXhHVNyPgWOTMreeN=BemCbGno=epD89cX3FA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>

From: Chris Roth <chris.roth-/KKvz3x1pcI@public.gmane.org>
Date: Sat, 24 Sep 2016 10:59:04 -0600

> Due to my lack of familiarity with the how git send-email works, I've
> unintentionally had my name listed as the first 'from' whereas I
> intended Allan Chou to be listed as the first 'from' in the patch. If
> anyone can correct this on my behalf, I would appreciate it.

Simply just submit a 'v2' of this patch with this issue fixed.
--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH net-next 4/4] net/sched: act_mirred: Implement ingress actions
From: Cong Wang @ 2016-09-24 23:50 UTC (permalink / raw)
  To: Shmulik Ladkani
  Cc: David S. Miller, Jamal Hadi Salim, Eric Dumazet,
	Linux Kernel Network Developers, Shmulik Ladkani
In-Reply-To: <1474550512-7552-5-git-send-email-shmulik.ladkani@gmail.com>

On Thu, Sep 22, 2016 at 6:21 AM, Shmulik Ladkani
<shmulik.ladkani@ravellosystems.com> wrote:
> From: Shmulik Ladkani <shmulik.ladkani@gmail.com>
>
> Up until now, 'action mirred' supported only egress actions (either
> TCA_EGRESS_REDIR or TCA_EGRESS_MIRROR).
>
> This patch implements the corresponding ingress actions
> TCA_INGRESS_REDIR and TCA_INGRESS_MIRROR.
>
> This allows attaching filters whose target is to hand matching skbs into
> the rx processing of a specified device.

I like this idea, this idea actually came to my mind when we
tried to redirect packets from eth0 to veth device in containers.
I remember I already brought this up to Jamal before (either personally
or publicly), but I forgot why I stopped.

This would reduce some latency for our Mesos containers, we
would just skip one veth TX in our scenario.

^ permalink raw reply

* [PATCH net-next 8/8] rxrpc: Implement slow-start
From: David Howells @ 2016-09-24 23:25 UTC (permalink / raw)
  To: netdev; +Cc: dhowells, linux-afs, linux-kernel
In-Reply-To: <147475949656.15231.3434038490632820605.stgit@warthog.procyon.org.uk>

Implement RxRPC slow-start, which is similar to RFC 5681 for TCP.  A
tracepoint is added to log the state of the congestion management algorithm
and the decisions it makes.

Notes:

 (1) Since we send fixed-size DATA packets (apart from the final packet in
     each phase), counters and calculations are in terms of packets rather
     than bytes.

 (2) The ACK packet carries the equivalent of TCP SACK.

 (3) The FLIGHT_SIZE calculation in RFC 5681 doesn't seem particularly
     suited to SACK of a small number of packets.  It seems that, almost
     inevitably, by the time three 'duplicate' ACKs have been seen, we have
     narrowed the loss down to one or two missing packets, and the
     FLIGHT_SIZE calculation ends up as 2.

 (4) In rxrpc_resend(), if there was no data that apparently needed
     retransmission, we transmit a PING ACK to ask the peer to tell us what
     its Rx window state is.

Signed-off-by: David Howells <dhowells@redhat.com>
---

 include/trace/events/rxrpc.h |   45 +++++++++++
 net/rxrpc/ar-internal.h      |   53 +++++++++++++
 net/rxrpc/call_event.c       |   36 ++++++++-
 net/rxrpc/call_object.c      |   13 +++
 net/rxrpc/conn_event.c       |    1 
 net/rxrpc/input.c            |  169 +++++++++++++++++++++++++++++++++++++++++-
 net/rxrpc/misc.c             |   19 +++++
 net/rxrpc/output.c           |    9 ++
 net/rxrpc/sendmsg.c          |    7 +-
 9 files changed, 339 insertions(+), 13 deletions(-)

diff --git a/include/trace/events/rxrpc.h b/include/trace/events/rxrpc.h
index 56475497043d..ada12d00118c 100644
--- a/include/trace/events/rxrpc.h
+++ b/include/trace/events/rxrpc.h
@@ -570,6 +570,51 @@ TRACE_EVENT(rxrpc_retransmit,
 		      __entry->expiry)
 	    );
 
+TRACE_EVENT(rxrpc_congest,
+	    TP_PROTO(struct rxrpc_call *call, struct rxrpc_ack_summary *summary,
+		     rxrpc_serial_t ack_serial, enum rxrpc_congest_change change),
+
+	    TP_ARGS(call, summary, ack_serial, change),
+
+	    TP_STRUCT__entry(
+		    __field(struct rxrpc_call *,		call		)
+		    __field(enum rxrpc_congest_change,		change		)
+		    __field(rxrpc_seq_t,			hard_ack	)
+		    __field(rxrpc_seq_t,			top		)
+		    __field(rxrpc_seq_t,			lowest_nak	)
+		    __field(rxrpc_serial_t,			ack_serial	)
+		    __field_struct(struct rxrpc_ack_summary,	sum		)
+			     ),
+
+	    TP_fast_assign(
+		    __entry->call	= call;
+		    __entry->change	= change;
+		    __entry->hard_ack	= call->tx_hard_ack;
+		    __entry->top	= call->tx_top;
+		    __entry->lowest_nak	= call->acks_lowest_nak;
+		    __entry->ack_serial	= ack_serial;
+		    memcpy(&__entry->sum, summary, sizeof(__entry->sum));
+			   ),
+
+	    TP_printk("c=%p %08x %s %08x %s cw=%u ss=%u nr=%u,%u nw=%u,%u r=%u b=%u u=%u d=%u l=%x%s%s%s",
+		      __entry->call,
+		      __entry->ack_serial,
+		      rxrpc_ack_names[__entry->sum.ack_reason],
+		      __entry->hard_ack,
+		      rxrpc_congest_modes[__entry->sum.mode],
+		      __entry->sum.cwnd,
+		      __entry->sum.ssthresh,
+		      __entry->sum.nr_acks, __entry->sum.nr_nacks,
+		      __entry->sum.nr_new_acks, __entry->sum.nr_new_nacks,
+		      __entry->sum.nr_rot_new_acks,
+		      __entry->top - __entry->hard_ack,
+		      __entry->sum.cumulative_acks,
+		      __entry->sum.dup_acks,
+		      __entry->lowest_nak, __entry->sum.new_low_nack ? "!" : "",
+		      rxrpc_congest_changes[__entry->change],
+		      __entry->sum.retrans_timeo ? " rTxTo" : "")
+	    );
+
 #endif /* _TRACE_RXRPC_H */
 
 /* This part must be outside protection */
diff --git a/net/rxrpc/ar-internal.h b/net/rxrpc/ar-internal.h
index b1e697fc9ffb..ca96e547cb9a 100644
--- a/net/rxrpc/ar-internal.h
+++ b/net/rxrpc/ar-internal.h
@@ -402,6 +402,7 @@ enum rxrpc_call_flag {
 	RXRPC_CALL_RX_LAST,		/* Received the last packet (at rxtx_top) */
 	RXRPC_CALL_TX_LAST,		/* Last packet in Tx buffer (at rxtx_top) */
 	RXRPC_CALL_PINGING,		/* Ping in process */
+	RXRPC_CALL_RETRANS_TIMEOUT,	/* Retransmission due to timeout occurred */
 };
 
 /*
@@ -447,6 +448,17 @@ enum rxrpc_call_completion {
 };
 
 /*
+ * Call Tx congestion management modes.
+ */
+enum rxrpc_congest_mode {
+	RXRPC_CALL_SLOW_START,
+	RXRPC_CALL_CONGEST_AVOIDANCE,
+	RXRPC_CALL_PACKET_LOSS,
+	RXRPC_CALL_FAST_RETRANSMIT,
+	NR__RXRPC_CONGEST_MODES
+};
+
+/*
  * RxRPC call definition
  * - matched by { connection, call_id }
  */
@@ -518,6 +530,20 @@ struct rxrpc_call {
 						 * not hard-ACK'd packet follows this.
 						 */
 	rxrpc_seq_t		tx_top;		/* Highest Tx slot allocated. */
+
+	/* TCP-style slow-start congestion control [RFC5681].  Since the SMSS
+	 * is fixed, we keep these numbers in terms of segments (ie. DATA
+	 * packets) rather than bytes.
+	 */
+#define RXRPC_TX_SMSS		RXRPC_JUMBO_DATALEN
+	u8			cong_cwnd;	/* Congestion window size */
+	u8			cong_extra;	/* Extra to send for congestion management */
+	u8			cong_ssthresh;	/* Slow-start threshold */
+	enum rxrpc_congest_mode	cong_mode:8;	/* Congestion management mode */
+	u8			cong_dup_acks;	/* Count of ACKs showing missing packets */
+	u8			cong_cumul_acks; /* Cumulative ACK count */
+	ktime_t			cong_tstamp;	/* Last time cwnd was changed */
+
 	rxrpc_seq_t		rx_hard_ack;	/* Dead slot in buffer; the first received but not
 						 * consumed packet follows this.
 						 */
@@ -539,12 +565,13 @@ struct rxrpc_call {
 	ktime_t			ackr_ping_time;	/* Time last ping sent */
 
 	/* transmission-phase ACK management */
+	ktime_t			acks_latest_ts;	/* Timestamp of latest ACK received */
 	rxrpc_serial_t		acks_latest;	/* serial number of latest ACK received */
 	rxrpc_seq_t		acks_lowest_nak; /* Lowest NACK in the buffer (or ==tx_hard_ack) */
 };
 
 /*
- * Summary of a new ACK and the changes it made.
+ * Summary of a new ACK and the changes it made to the Tx buffer packet states.
  */
 struct rxrpc_ack_summary {
 	u8			ack_reason;
@@ -554,6 +581,14 @@ struct rxrpc_ack_summary {
 	u8			nr_new_nacks;		/* Number of new NACKs in packet */
 	u8			nr_rot_new_acks;	/* Number of rotated new ACKs */
 	bool			new_low_nack;		/* T if new low NACK found */
+	bool			retrans_timeo;		/* T if reTx due to timeout happened */
+	u8			flight_size;		/* Number of unreceived transmissions */
+	/* Place to stash values for tracing */
+	enum rxrpc_congest_mode	mode:8;
+	u8			cwnd;
+	u8			ssthresh;
+	u8			dup_acks;
+	u8			cumulative_acks;
 };
 
 enum rxrpc_skb_trace {
@@ -709,6 +744,7 @@ extern const char rxrpc_timer_traces[rxrpc_timer__nr_trace][8];
 enum rxrpc_propose_ack_trace {
 	rxrpc_propose_ack_client_tx_end,
 	rxrpc_propose_ack_input_data,
+	rxrpc_propose_ack_ping_for_lost_ack,
 	rxrpc_propose_ack_ping_for_lost_reply,
 	rxrpc_propose_ack_ping_for_params,
 	rxrpc_propose_ack_respond_to_ack,
@@ -729,6 +765,21 @@ enum rxrpc_propose_ack_outcome {
 extern const char rxrpc_propose_ack_traces[rxrpc_propose_ack__nr_trace][8];
 extern const char *const rxrpc_propose_ack_outcomes[rxrpc_propose_ack__nr_outcomes];
 
+enum rxrpc_congest_change {
+	rxrpc_cong_begin_retransmission,
+	rxrpc_cong_cleared_nacks,
+	rxrpc_cong_new_low_nack,
+	rxrpc_cong_no_change,
+	rxrpc_cong_progress,
+	rxrpc_cong_retransmit_again,
+	rxrpc_cong_rtt_window_end,
+	rxrpc_cong_saw_nack,
+	rxrpc_congest__nr_change
+};
+
+extern const char rxrpc_congest_modes[NR__RXRPC_CONGEST_MODES][10];
+extern const char rxrpc_congest_changes[rxrpc_congest__nr_change][9];
+
 extern const char *const rxrpc_pkts[];
 extern const char const rxrpc_ack_names[RXRPC_ACK__INVALID + 1][4];
 
diff --git a/net/rxrpc/call_event.c b/net/rxrpc/call_event.c
index 05b94d1acf52..0e8478012212 100644
--- a/net/rxrpc/call_event.c
+++ b/net/rxrpc/call_event.c
@@ -147,6 +147,14 @@ void rxrpc_propose_ACK(struct rxrpc_call *call, u8 ack_reason,
 }
 
 /*
+ * Handle congestion being detected by the retransmit timeout.
+ */
+static void rxrpc_congestion_timeout(struct rxrpc_call *call)
+{
+	set_bit(RXRPC_CALL_RETRANS_TIMEOUT, &call->flags);
+}
+
+/*
  * Perform retransmission of NAK'd and unack'd packets.
  */
 static void rxrpc_resend(struct rxrpc_call *call)
@@ -154,9 +162,9 @@ static void rxrpc_resend(struct rxrpc_call *call)
 	struct rxrpc_skb_priv *sp;
 	struct sk_buff *skb;
 	rxrpc_seq_t cursor, seq, top;
-	ktime_t now = ktime_get_real(), max_age, oldest,  resend_at;
+	ktime_t now = ktime_get_real(), max_age, oldest, resend_at, ack_ts;
 	int ix;
-	u8 annotation, anno_type;
+	u8 annotation, anno_type, retrans = 0, unacked = 0;
 
 	_enter("{%d,%d}", call->tx_hard_ack, call->tx_top);
 
@@ -193,10 +201,13 @@ static void rxrpc_resend(struct rxrpc_call *call)
 					oldest = skb->tstamp;
 				continue;
 			}
+			if (!(annotation & RXRPC_TX_ANNO_RESENT))
+				unacked++;
 		}
 
 		/* Okay, we need to retransmit a packet. */
 		call->rxtx_annotations[ix] = RXRPC_TX_ANNO_RETRANS | annotation;
+		retrans++;
 		trace_rxrpc_retransmit(call, seq, annotation | anno_type,
 				       ktime_to_ns(ktime_sub(skb->tstamp, max_age)));
 	}
@@ -210,6 +221,25 @@ static void rxrpc_resend(struct rxrpc_call *call)
 		    * reached the nsec timeout yet.
 		    */
 
+	if (unacked)
+		rxrpc_congestion_timeout(call);
+
+	/* If there was nothing that needed retransmission then it's likely
+	 * that an ACK got lost somewhere.  Send a ping to find out instead of
+	 * retransmitting data.
+	 */
+	if (!retrans) {
+		rxrpc_set_timer(call, rxrpc_timer_set_for_resend);
+		spin_unlock_bh(&call->lock);
+		ack_ts = ktime_sub(now, call->acks_latest_ts);
+		if (ktime_to_ns(ack_ts) < call->peer->rtt)
+			goto out;
+		rxrpc_propose_ACK(call, RXRPC_ACK_PING, 0, 0, true, false,
+				  rxrpc_propose_ack_ping_for_lost_ack);
+		rxrpc_send_call_packet(call, RXRPC_PACKET_TYPE_ACK);
+		goto out;
+	}
+
 	/* Now go through the Tx window and perform the retransmissions.  We
 	 * have to drop the lock for each send.  If an ACK comes in whilst the
 	 * lock is dropped, it may clear some of the retransmission markers for
@@ -260,6 +290,7 @@ static void rxrpc_resend(struct rxrpc_call *call)
 
 out_unlock:
 	spin_unlock_bh(&call->lock);
+out:
 	_leave("");
 }
 
@@ -293,6 +324,7 @@ recheck_state:
 	if (time_after_eq(now, call->expire_at)) {
 		rxrpc_abort_call("EXP", call, 0, RX_CALL_TIMEOUT, ETIME);
 		set_bit(RXRPC_CALL_EV_ABORT, &call->events);
+		goto recheck_state;
 	}
 
 	if (test_and_clear_bit(RXRPC_CALL_EV_ACK, &call->events) ||
diff --git a/net/rxrpc/call_object.c b/net/rxrpc/call_object.c
index a53f4c2c0025..d4b3293b78fa 100644
--- a/net/rxrpc/call_object.c
+++ b/net/rxrpc/call_object.c
@@ -160,6 +160,14 @@ struct rxrpc_call *rxrpc_alloc_call(gfp_t gfp)
 	call->rx_winsize = rxrpc_rx_window_size;
 	call->tx_winsize = 16;
 	call->rx_expect_next = 1;
+
+	if (RXRPC_TX_SMSS > 2190)
+		call->cong_cwnd = 2;
+	else if (RXRPC_TX_SMSS > 1095)
+		call->cong_cwnd = 3;
+	else
+		call->cong_cwnd = 4;
+	call->cong_ssthresh = RXRPC_RXTX_BUFF_SIZE - 1;
 	return call;
 
 nomem_2:
@@ -176,6 +184,7 @@ static struct rxrpc_call *rxrpc_alloc_client_call(struct sockaddr_rxrpc *srx,
 						  gfp_t gfp)
 {
 	struct rxrpc_call *call;
+	ktime_t now;
 
 	_enter("");
 
@@ -185,6 +194,9 @@ static struct rxrpc_call *rxrpc_alloc_client_call(struct sockaddr_rxrpc *srx,
 	call->state = RXRPC_CALL_CLIENT_AWAIT_CONN;
 	call->service_id = srx->srx_service;
 	call->tx_phase = true;
+	now = ktime_get_real();
+	call->acks_latest_ts = now;
+	call->cong_tstamp = now;
 
 	_leave(" = %p", call);
 	return call;
@@ -325,6 +337,7 @@ void rxrpc_incoming_call(struct rxrpc_sock *rx,
 	call->state		= RXRPC_CALL_SERVER_ACCEPTING;
 	if (sp->hdr.securityIndex > 0)
 		call->state	= RXRPC_CALL_SERVER_SECURING;
+	call->cong_tstamp	= skb->tstamp;
 
 	/* Set the channel for this call.  We don't get channel_lock as we're
 	 * only defending against the data_ready handler (which we're called
diff --git a/net/rxrpc/conn_event.c b/net/rxrpc/conn_event.c
index a1cf1ec5f29e..37609ce89f52 100644
--- a/net/rxrpc/conn_event.c
+++ b/net/rxrpc/conn_event.c
@@ -97,6 +97,7 @@ static void rxrpc_conn_retransmit_call(struct rxrpc_connection *conn,
 		pkt.info.maxMTU		= htonl(mtu);
 		pkt.info.rwind		= htonl(rxrpc_rx_window_size);
 		pkt.info.jumbo_max	= htonl(rxrpc_rx_jumbo_max);
+		pkt.whdr.flags		|= RXRPC_SLOW_START_OK;
 		len += sizeof(pkt.ack) + sizeof(pkt.info);
 		break;
 	}
diff --git a/net/rxrpc/input.c b/net/rxrpc/input.c
index 0344f4494eb7..094720dd1eaf 100644
--- a/net/rxrpc/input.c
+++ b/net/rxrpc/input.c
@@ -37,6 +37,166 @@ static void rxrpc_proto_abort(const char *why,
 }
 
 /*
+ * Do TCP-style congestion management [RFC 5681].
+ */
+static void rxrpc_congestion_management(struct rxrpc_call *call,
+					struct sk_buff *skb,
+					struct rxrpc_ack_summary *summary)
+{
+	enum rxrpc_congest_change change = rxrpc_cong_no_change;
+	struct rxrpc_skb_priv *sp = rxrpc_skb(skb);
+	unsigned int cumulative_acks = call->cong_cumul_acks;
+	unsigned int cwnd = call->cong_cwnd;
+	bool resend = false;
+
+	summary->flight_size =
+		(call->tx_top - call->tx_hard_ack) - summary->nr_acks;
+
+	if (test_and_clear_bit(RXRPC_CALL_RETRANS_TIMEOUT, &call->flags)) {
+		summary->retrans_timeo = true;
+		call->cong_ssthresh = max_t(unsigned int,
+					    summary->flight_size / 2, 2);
+		cwnd = 1;
+		if (cwnd > call->cong_ssthresh &&
+		    call->cong_mode == RXRPC_CALL_SLOW_START) {
+			call->cong_mode = RXRPC_CALL_CONGEST_AVOIDANCE;
+			call->cong_tstamp = skb->tstamp;
+			cumulative_acks = 0;
+		}
+	}
+
+	cumulative_acks += summary->nr_new_acks;
+	cumulative_acks += summary->nr_rot_new_acks;
+	if (cumulative_acks > 255)
+		cumulative_acks = 255;
+
+	summary->mode = call->cong_mode;
+	summary->cwnd = call->cong_cwnd;
+	summary->ssthresh = call->cong_ssthresh;
+	summary->cumulative_acks = cumulative_acks;
+	summary->dup_acks = call->cong_dup_acks;
+
+	switch (call->cong_mode) {
+	case RXRPC_CALL_SLOW_START:
+		if (summary->nr_nacks > 0)
+			goto packet_loss_detected;
+		if (summary->cumulative_acks > 0)
+			cwnd += 1;
+		if (cwnd > call->cong_ssthresh) {
+			call->cong_mode = RXRPC_CALL_CONGEST_AVOIDANCE;
+			call->cong_tstamp = skb->tstamp;
+		}
+		goto out;
+
+	case RXRPC_CALL_CONGEST_AVOIDANCE:
+		if (summary->nr_nacks > 0)
+			goto packet_loss_detected;
+
+		/* We analyse the number of packets that get ACK'd per RTT
+		 * period and increase the window if we managed to fill it.
+		 */
+		if (call->peer->rtt_usage == 0)
+			goto out;
+		if (ktime_before(skb->tstamp,
+				 ktime_add_ns(call->cong_tstamp,
+					      call->peer->rtt)))
+			goto out_no_clear_ca;
+		change = rxrpc_cong_rtt_window_end;
+		call->cong_tstamp = skb->tstamp;
+		if (cumulative_acks >= cwnd)
+			cwnd++;
+		goto out;
+
+	case RXRPC_CALL_PACKET_LOSS:
+		if (summary->nr_nacks == 0)
+			goto resume_normality;
+
+		if (summary->new_low_nack) {
+			change = rxrpc_cong_new_low_nack;
+			call->cong_dup_acks = 1;
+			if (call->cong_extra > 1)
+				call->cong_extra = 1;
+			goto send_extra_data;
+		}
+
+		call->cong_dup_acks++;
+		if (call->cong_dup_acks < 3)
+			goto send_extra_data;
+
+		change = rxrpc_cong_begin_retransmission;
+		call->cong_mode = RXRPC_CALL_FAST_RETRANSMIT;
+		call->cong_ssthresh = max_t(unsigned int,
+					    summary->flight_size / 2, 2);
+		cwnd = call->cong_ssthresh + 3;
+		call->cong_extra = 0;
+		call->cong_dup_acks = 0;
+		resend = true;
+		goto out;
+
+	case RXRPC_CALL_FAST_RETRANSMIT:
+		if (!summary->new_low_nack) {
+			if (summary->nr_new_acks == 0)
+				cwnd += 1;
+			call->cong_dup_acks++;
+			if (call->cong_dup_acks == 2) {
+				change = rxrpc_cong_retransmit_again;
+				call->cong_dup_acks = 0;
+				resend = true;
+			}
+		} else {
+			change = rxrpc_cong_progress;
+			cwnd = call->cong_ssthresh;
+			if (summary->nr_nacks == 0)
+				goto resume_normality;
+		}
+		goto out;
+
+	default:
+		BUG();
+		goto out;
+	}
+
+resume_normality:
+	change = rxrpc_cong_cleared_nacks;
+	call->cong_dup_acks = 0;
+	call->cong_extra = 0;
+	call->cong_tstamp = skb->tstamp;
+	if (cwnd <= call->cong_ssthresh)
+		call->cong_mode = RXRPC_CALL_SLOW_START;
+	else
+		call->cong_mode = RXRPC_CALL_CONGEST_AVOIDANCE;
+out:
+	cumulative_acks = 0;
+out_no_clear_ca:
+	if (cwnd >= RXRPC_RXTX_BUFF_SIZE - 1)
+		cwnd = RXRPC_RXTX_BUFF_SIZE - 1;
+	call->cong_cwnd = cwnd;
+	call->cong_cumul_acks = cumulative_acks;
+	trace_rxrpc_congest(call, summary, sp->hdr.serial, change);
+	if (resend && !test_and_set_bit(RXRPC_CALL_EV_RESEND, &call->events))
+		rxrpc_queue_call(call);
+	return;
+
+packet_loss_detected:
+	change = rxrpc_cong_saw_nack;
+	call->cong_mode = RXRPC_CALL_PACKET_LOSS;
+	call->cong_dup_acks = 0;
+	goto send_extra_data;
+
+send_extra_data:
+	/* Send some previously unsent DATA if we have some to advance the ACK
+	 * state.
+	 */
+	if (call->rxtx_annotations[call->tx_top & RXRPC_RXTX_BUFF_MASK] &
+	    RXRPC_TX_ANNO_LAST ||
+	    summary->nr_acks != call->tx_top - call->tx_hard_ack) {
+		call->cong_extra++;
+		wake_up(&call->waitq);
+	}
+	goto out_no_clear_ca;
+}
+
+/*
  * Ping the other end to fill our RTT cache and to retrieve the rwind
  * and MTU parameters.
  */
@@ -524,7 +684,6 @@ static void rxrpc_input_soft_acks(struct rxrpc_call *call, u8 *acks,
 				  rxrpc_seq_t seq, int nr_acks,
 				  struct rxrpc_ack_summary *summary)
 {
-	bool resend = false;
 	int ix;
 	u8 annotation, anno_type;
 
@@ -556,16 +715,11 @@ static void rxrpc_input_soft_acks(struct rxrpc_call *call, u8 *acks,
 				continue;
 			call->rxtx_annotations[ix] =
 				RXRPC_TX_ANNO_NAK | annotation;
-			resend = true;
 			break;
 		default:
 			return rxrpc_proto_abort("SFT", call, 0);
 		}
 	}
-
-	if (resend &&
-	    !test_and_set_bit(RXRPC_CALL_EV_RESEND, &call->events))
-		rxrpc_queue_call(call);
 }
 
 /*
@@ -663,6 +817,7 @@ static void rxrpc_input_ack(struct rxrpc_call *call, struct sk_buff *skb,
 		       sp->hdr.serial, call->acks_latest);
 		return;
 	}
+	call->acks_latest_ts = skb->tstamp;
 	call->acks_latest = sp->hdr.serial;
 
 	if (before(hard_ack, call->tx_hard_ack) ||
@@ -692,6 +847,8 @@ static void rxrpc_input_ack(struct rxrpc_call *call, struct sk_buff *skb,
 		rxrpc_propose_ACK(call, RXRPC_ACK_PING, skew, sp->hdr.serial,
 				  false, true,
 				  rxrpc_propose_ack_ping_for_lost_reply);
+
+	return rxrpc_congestion_management(call, skb, &summary);
 }
 
 /*
diff --git a/net/rxrpc/misc.c b/net/rxrpc/misc.c
index a608769343e6..aedb8978226d 100644
--- a/net/rxrpc/misc.c
+++ b/net/rxrpc/misc.c
@@ -200,6 +200,7 @@ const char rxrpc_timer_traces[rxrpc_timer__nr_trace][8] = {
 const char rxrpc_propose_ack_traces[rxrpc_propose_ack__nr_trace][8] = {
 	[rxrpc_propose_ack_client_tx_end]	= "ClTxEnd",
 	[rxrpc_propose_ack_input_data]		= "DataIn ",
+	[rxrpc_propose_ack_ping_for_lost_ack]	= "LostAck",
 	[rxrpc_propose_ack_ping_for_lost_reply]	= "LostRpl",
 	[rxrpc_propose_ack_ping_for_params]	= "Params ",
 	[rxrpc_propose_ack_respond_to_ack]	= "Rsp2Ack",
@@ -214,3 +215,21 @@ const char *const rxrpc_propose_ack_outcomes[rxrpc_propose_ack__nr_outcomes] = {
 	[rxrpc_propose_ack_update]		= " Update",
 	[rxrpc_propose_ack_subsume]		= " Subsume",
 };
+
+const char rxrpc_congest_modes[NR__RXRPC_CONGEST_MODES][10] = {
+	[RXRPC_CALL_SLOW_START]		= "SlowStart",
+	[RXRPC_CALL_CONGEST_AVOIDANCE]	= "CongAvoid",
+	[RXRPC_CALL_PACKET_LOSS]	= "PktLoss  ",
+	[RXRPC_CALL_FAST_RETRANSMIT]	= "FastReTx ",
+};
+
+const char rxrpc_congest_changes[rxrpc_congest__nr_change][9] = {
+	[rxrpc_cong_begin_retransmission]	= " Retrans",
+	[rxrpc_cong_cleared_nacks]		= " Cleared",
+	[rxrpc_cong_new_low_nack]		= " NewLowN",
+	[rxrpc_cong_no_change]			= "",
+	[rxrpc_cong_progress]			= " Progres",
+	[rxrpc_cong_retransmit_again]		= " ReTxAgn",
+	[rxrpc_cong_rtt_window_end]		= " RttWinE",
+	[rxrpc_cong_saw_nack]			= " SawNack",
+};
diff --git a/net/rxrpc/output.c b/net/rxrpc/output.c
index 3eb01445e814..cf43a715685e 100644
--- a/net/rxrpc/output.c
+++ b/net/rxrpc/output.c
@@ -157,6 +157,8 @@ int rxrpc_send_call_packet(struct rxrpc_call *call, u8 type)
 		spin_unlock_bh(&call->lock);
 
 
+		pkt->whdr.flags |= RXRPC_SLOW_START_OK;
+
 		iov[0].iov_len += sizeof(pkt->ack) + n;
 		iov[1].iov_base = &pkt->ackinfo;
 		iov[1].iov_len	= sizeof(pkt->ackinfo);
@@ -276,8 +278,11 @@ int rxrpc_send_data_packet(struct rxrpc_call *call, struct sk_buff *skb)
 	msg.msg_controllen = 0;
 	msg.msg_flags = 0;
 
-	/* If our RTT cache needs working on, request an ACK. */
-	if ((call->peer->rtt_usage < 3 && sp->hdr.seq & 1) ||
+	/* If our RTT cache needs working on, request an ACK.  Also request
+	 * ACKs if a DATA packet appears to have been lost.
+	 */
+	if (call->cong_mode == RXRPC_CALL_FAST_RETRANSMIT ||
+	    (call->peer->rtt_usage < 3 && sp->hdr.seq & 1) ||
 	    ktime_before(ktime_add_ms(call->peer->rtt_last_req, 1000),
 			 ktime_get_real()))
 		whdr.flags |= RXRPC_REQUEST_ACK;
diff --git a/net/rxrpc/sendmsg.c b/net/rxrpc/sendmsg.c
index 99939372b5a4..1f8040d82395 100644
--- a/net/rxrpc/sendmsg.c
+++ b/net/rxrpc/sendmsg.c
@@ -45,7 +45,9 @@ static int rxrpc_wait_for_tx_window(struct rxrpc_sock *rx,
 	for (;;) {
 		set_current_state(TASK_INTERRUPTIBLE);
 		ret = 0;
-		if (call->tx_top - call->tx_hard_ack < call->tx_winsize)
+		if (call->tx_top - call->tx_hard_ack <
+		    min_t(unsigned int, call->tx_winsize,
+			  call->cong_cwnd + call->cong_extra))
 			break;
 		if (call->state >= RXRPC_CALL_COMPLETE) {
 			ret = -call->error;
@@ -203,7 +205,8 @@ static int rxrpc_send_data(struct rxrpc_sock *rx,
 			_debug("alloc");
 
 			if (call->tx_top - call->tx_hard_ack >=
-			    call->tx_winsize) {
+			    min_t(unsigned int, call->tx_winsize,
+				  call->cong_cwnd + call->cong_extra)) {
 				ret = -EAGAIN;
 				if (msg->msg_flags & MSG_DONTWAIT)
 					goto maybe_error;

^ permalink raw reply related

* [PATCH net-next 7/8] rxrpc: Schedule an ACK if the reply to a client call appears overdue
From: David Howells @ 2016-09-24 23:25 UTC (permalink / raw)
  To: netdev; +Cc: dhowells, linux-afs, linux-kernel
In-Reply-To: <147475949656.15231.3434038490632820605.stgit@warthog.procyon.org.uk>

If we've sent all the request data in a client call but haven't seen any
sign of the reply data yet, schedule an ACK to be sent to the server to
find out if the reply data got lost.

If the server hasn't yet hard-ACK'd the request data, we send a PING ACK to
demand a response to find out whether we need to retransmit.

If the server says it has received all of the data, we send an IDLE ACK to
tell the server that we haven't received anything in the receive phase as
yet.

To make this work, a non-immediate PING ACK must carry a delay.  I've chosen
the same as the IDLE ACK for the moment.

Signed-off-by: David Howells <dhowells@redhat.com>
---

 net/rxrpc/ar-internal.h |    2 ++
 net/rxrpc/call_event.c  |    1 +
 net/rxrpc/input.c       |    8 ++++++++
 net/rxrpc/misc.c        |    2 ++
 4 files changed, 13 insertions(+)

diff --git a/net/rxrpc/ar-internal.h b/net/rxrpc/ar-internal.h
index 1a700b6a998b..b1e697fc9ffb 100644
--- a/net/rxrpc/ar-internal.h
+++ b/net/rxrpc/ar-internal.h
@@ -707,7 +707,9 @@ enum rxrpc_timer_trace {
 extern const char rxrpc_timer_traces[rxrpc_timer__nr_trace][8];
 
 enum rxrpc_propose_ack_trace {
+	rxrpc_propose_ack_client_tx_end,
 	rxrpc_propose_ack_input_data,
+	rxrpc_propose_ack_ping_for_lost_reply,
 	rxrpc_propose_ack_ping_for_params,
 	rxrpc_propose_ack_respond_to_ack,
 	rxrpc_propose_ack_respond_to_ping,
diff --git a/net/rxrpc/call_event.c b/net/rxrpc/call_event.c
index d5bf9ce7ec6f..05b94d1acf52 100644
--- a/net/rxrpc/call_event.c
+++ b/net/rxrpc/call_event.c
@@ -100,6 +100,7 @@ static void __rxrpc_propose_ACK(struct rxrpc_call *call, u8 ack_reason,
 			expiry = rxrpc_soft_ack_delay;
 		break;
 
+	case RXRPC_ACK_PING:
 	case RXRPC_ACK_IDLE:
 		if (rxrpc_idle_ack_delay < expiry)
 			expiry = rxrpc_idle_ack_delay;
diff --git a/net/rxrpc/input.c b/net/rxrpc/input.c
index dd699667eeef..0344f4494eb7 100644
--- a/net/rxrpc/input.c
+++ b/net/rxrpc/input.c
@@ -138,6 +138,8 @@ static bool rxrpc_end_tx_phase(struct rxrpc_call *call, bool reply_begun,
 
 	write_unlock(&call->state_lock);
 	if (call->state == RXRPC_CALL_CLIENT_AWAIT_REPLY) {
+		rxrpc_propose_ACK(call, RXRPC_ACK_IDLE, 0, 0, false, true,
+				  rxrpc_propose_ack_client_tx_end);
 		trace_rxrpc_transmit(call, rxrpc_transmit_await_reply);
 	} else {
 		trace_rxrpc_transmit(call, rxrpc_transmit_end);
@@ -684,6 +686,12 @@ static void rxrpc_input_ack(struct rxrpc_call *call, struct sk_buff *skb,
 		return;
 	}
 
+	if (call->rxtx_annotations[call->tx_top & RXRPC_RXTX_BUFF_MASK] &
+	    RXRPC_TX_ANNO_LAST &&
+	    summary.nr_acks == call->tx_top - hard_ack)
+		rxrpc_propose_ACK(call, RXRPC_ACK_PING, skew, sp->hdr.serial,
+				  false, true,
+				  rxrpc_propose_ack_ping_for_lost_reply);
 }
 
 /*
diff --git a/net/rxrpc/misc.c b/net/rxrpc/misc.c
index 901c012a2700..a608769343e6 100644
--- a/net/rxrpc/misc.c
+++ b/net/rxrpc/misc.c
@@ -198,7 +198,9 @@ const char rxrpc_timer_traces[rxrpc_timer__nr_trace][8] = {
 };
 
 const char rxrpc_propose_ack_traces[rxrpc_propose_ack__nr_trace][8] = {
+	[rxrpc_propose_ack_client_tx_end]	= "ClTxEnd",
 	[rxrpc_propose_ack_input_data]		= "DataIn ",
+	[rxrpc_propose_ack_ping_for_lost_reply]	= "LostRpl",
 	[rxrpc_propose_ack_ping_for_params]	= "Params ",
 	[rxrpc_propose_ack_respond_to_ack]	= "Rsp2Ack",
 	[rxrpc_propose_ack_respond_to_ping]	= "Rsp2Png",

^ permalink raw reply related

* [PATCH net-next 6/8] rxrpc: Generate a summary of the ACK state for later use
From: David Howells @ 2016-09-24 23:25 UTC (permalink / raw)
  To: netdev; +Cc: dhowells, linux-afs, linux-kernel
In-Reply-To: <147475949656.15231.3434038490632820605.stgit@warthog.procyon.org.uk>

Generate a summary of the Tx buffer packet state when an ACK is received
for use in a later patch that does congestion management.

Signed-off-by: David Howells <dhowells@redhat.com>
---

 net/rxrpc/ar-internal.h |   14 ++++++++++++++
 net/rxrpc/input.c       |   45 ++++++++++++++++++++++++++++++++++-----------
 2 files changed, 48 insertions(+), 11 deletions(-)

diff --git a/net/rxrpc/ar-internal.h b/net/rxrpc/ar-internal.h
index cdd35e2b40ba..1a700b6a998b 100644
--- a/net/rxrpc/ar-internal.h
+++ b/net/rxrpc/ar-internal.h
@@ -540,6 +540,20 @@ struct rxrpc_call {
 
 	/* transmission-phase ACK management */
 	rxrpc_serial_t		acks_latest;	/* serial number of latest ACK received */
+	rxrpc_seq_t		acks_lowest_nak; /* Lowest NACK in the buffer (or ==tx_hard_ack) */
+};
+
+/*
+ * Summary of a new ACK and the changes it made.
+ */
+struct rxrpc_ack_summary {
+	u8			ack_reason;
+	u8			nr_acks;		/* Number of ACKs in packet */
+	u8			nr_nacks;		/* Number of NACKs in packet */
+	u8			nr_new_acks;		/* Number of new ACKs in packet */
+	u8			nr_new_nacks;		/* Number of new NACKs in packet */
+	u8			nr_rot_new_acks;	/* Number of rotated new ACKs */
+	bool			new_low_nack;		/* T if new low NACK found */
 };
 
 enum rxrpc_skb_trace {
diff --git a/net/rxrpc/input.c b/net/rxrpc/input.c
index bda11eb2ab2a..dd699667eeef 100644
--- a/net/rxrpc/input.c
+++ b/net/rxrpc/input.c
@@ -56,12 +56,20 @@ static void rxrpc_send_ping(struct rxrpc_call *call, struct sk_buff *skb,
 /*
  * Apply a hard ACK by advancing the Tx window.
  */
-static void rxrpc_rotate_tx_window(struct rxrpc_call *call, rxrpc_seq_t to)
+static void rxrpc_rotate_tx_window(struct rxrpc_call *call, rxrpc_seq_t to,
+				   struct rxrpc_ack_summary *summary)
 {
 	struct sk_buff *skb, *list = NULL;
 	int ix;
 	u8 annotation;
 
+	if (call->acks_lowest_nak == call->tx_hard_ack) {
+		call->acks_lowest_nak = to;
+	} else if (before_eq(call->acks_lowest_nak, to)) {
+		summary->new_low_nack = true;
+		call->acks_lowest_nak = to;
+	}
+
 	spin_lock(&call->lock);
 
 	while (before(call->tx_hard_ack, to)) {
@@ -77,6 +85,8 @@ static void rxrpc_rotate_tx_window(struct rxrpc_call *call, rxrpc_seq_t to)
 
 		if (annotation & RXRPC_TX_ANNO_LAST)
 			set_bit(RXRPC_CALL_TX_LAST, &call->flags);
+		if ((annotation & RXRPC_TX_ANNO_MASK) != RXRPC_TX_ANNO_ACK)
+			summary->nr_rot_new_acks++;
 	}
 
 	spin_unlock(&call->lock);
@@ -147,6 +157,7 @@ bad_state:
  */
 static bool rxrpc_receiving_reply(struct rxrpc_call *call)
 {
+	struct rxrpc_ack_summary summary = { 0 };
 	rxrpc_seq_t top = READ_ONCE(call->tx_top);
 
 	if (call->ackr_reason) {
@@ -159,7 +170,7 @@ static bool rxrpc_receiving_reply(struct rxrpc_call *call)
 	}
 
 	if (!test_bit(RXRPC_CALL_TX_LAST, &call->flags))
-		rxrpc_rotate_tx_window(call, top);
+		rxrpc_rotate_tx_window(call, top, &summary);
 	if (!test_bit(RXRPC_CALL_TX_LAST, &call->flags)) {
 		rxrpc_proto_abort("TXL", call, top);
 		return false;
@@ -508,7 +519,8 @@ static void rxrpc_input_ackinfo(struct rxrpc_call *call, struct sk_buff *skb,
  * the time the ACK was sent.
  */
 static void rxrpc_input_soft_acks(struct rxrpc_call *call, u8 *acks,
-				  rxrpc_seq_t seq, int nr_acks)
+				  rxrpc_seq_t seq, int nr_acks,
+				  struct rxrpc_ack_summary *summary)
 {
 	bool resend = false;
 	int ix;
@@ -521,14 +533,23 @@ static void rxrpc_input_soft_acks(struct rxrpc_call *call, u8 *acks,
 		annotation &= ~RXRPC_TX_ANNO_MASK;
 		switch (*acks++) {
 		case RXRPC_ACK_TYPE_ACK:
+			summary->nr_acks++;
 			if (anno_type == RXRPC_TX_ANNO_ACK)
 				continue;
+			summary->nr_new_acks++;
 			call->rxtx_annotations[ix] =
 				RXRPC_TX_ANNO_ACK | annotation;
 			break;
 		case RXRPC_ACK_TYPE_NACK:
+			if (!summary->nr_nacks &&
+			    call->acks_lowest_nak != seq) {
+				call->acks_lowest_nak = seq;
+				summary->new_low_nack = true;
+			}
+			summary->nr_nacks++;
 			if (anno_type == RXRPC_TX_ANNO_NAK)
 				continue;
+			summary->nr_new_nacks++;
 			if (anno_type == RXRPC_TX_ANNO_RETRANS)
 				continue;
 			call->rxtx_annotations[ix] =
@@ -558,7 +579,7 @@ static void rxrpc_input_soft_acks(struct rxrpc_call *call, u8 *acks,
 static void rxrpc_input_ack(struct rxrpc_call *call, struct sk_buff *skb,
 			    u16 skew)
 {
-	u8 ack_reason;
+	struct rxrpc_ack_summary summary = { 0 };
 	struct rxrpc_skb_priv *sp = rxrpc_skb(skb);
 	union {
 		struct rxrpc_ackpacket ack;
@@ -581,10 +602,10 @@ static void rxrpc_input_ack(struct rxrpc_call *call, struct sk_buff *skb,
 	first_soft_ack = ntohl(buf.ack.firstPacket);
 	hard_ack = first_soft_ack - 1;
 	nr_acks = buf.ack.nAcks;
-	ack_reason = (buf.ack.reason < RXRPC_ACK__INVALID ?
-		      buf.ack.reason : RXRPC_ACK__INVALID);
+	summary.ack_reason = (buf.ack.reason < RXRPC_ACK__INVALID ?
+			      buf.ack.reason : RXRPC_ACK__INVALID);
 
-	trace_rxrpc_rx_ack(call, first_soft_ack, ack_reason, nr_acks);
+	trace_rxrpc_rx_ack(call, first_soft_ack, summary.ack_reason, nr_acks);
 
 	_proto("Rx ACK %%%u { m=%hu f=#%u p=#%u s=%%%u r=%s n=%u }",
 	       sp->hdr.serial,
@@ -592,7 +613,7 @@ static void rxrpc_input_ack(struct rxrpc_call *call, struct sk_buff *skb,
 	       first_soft_ack,
 	       ntohl(buf.ack.previousPacket),
 	       acked_serial,
-	       rxrpc_ack_names[ack_reason],
+	       rxrpc_ack_names[summary.ack_reason],
 	       buf.ack.nAcks);
 
 	if (buf.ack.reason == RXRPC_ACK_PING_RESPONSE)
@@ -649,12 +670,13 @@ static void rxrpc_input_ack(struct rxrpc_call *call, struct sk_buff *skb,
 		return rxrpc_proto_abort("AKN", call, 0);
 
 	if (after(hard_ack, call->tx_hard_ack))
-		rxrpc_rotate_tx_window(call, hard_ack);
+		rxrpc_rotate_tx_window(call, hard_ack, &summary);
 
 	if (nr_acks > 0) {
 		if (skb_copy_bits(skb, sp->offset, buf.acks, nr_acks) < 0)
 			return rxrpc_proto_abort("XSA", call, 0);
-		rxrpc_input_soft_acks(call, buf.acks, first_soft_ack, nr_acks);
+		rxrpc_input_soft_acks(call, buf.acks, first_soft_ack, nr_acks,
+				      &summary);
 	}
 
 	if (test_bit(RXRPC_CALL_TX_LAST, &call->flags)) {
@@ -669,11 +691,12 @@ static void rxrpc_input_ack(struct rxrpc_call *call, struct sk_buff *skb,
  */
 static void rxrpc_input_ackall(struct rxrpc_call *call, struct sk_buff *skb)
 {
+	struct rxrpc_ack_summary summary = { 0 };
 	struct rxrpc_skb_priv *sp = rxrpc_skb(skb);
 
 	_proto("Rx ACKALL %%%u", sp->hdr.serial);
 
-	rxrpc_rotate_tx_window(call, call->tx_top);
+	rxrpc_rotate_tx_window(call, call->tx_top, &summary);
 	if (test_bit(RXRPC_CALL_TX_LAST, &call->flags))
 		rxrpc_end_tx_phase(call, false, "ETL");
 }

^ permalink raw reply related

* [PATCH net-next 5/8] rxrpc: Delay the resend timer to allow for nsec->jiffies conv error
From: David Howells @ 2016-09-24 23:25 UTC (permalink / raw)
  To: netdev; +Cc: dhowells, linux-afs, linux-kernel
In-Reply-To: <147475949656.15231.3434038490632820605.stgit@warthog.procyon.org.uk>

When determining the resend timer value, we have a value in nsec but the
timer is in jiffies which may be a million or more times more coarse.
nsecs_to_jiffies() rounds down - which means that the resend timeout
expressed as jiffies is very likely earlier than the one expressed as
nanoseconds from which it was derived.

The problem is that rxrpc_resend() gets triggered by the timer, but can't
then find anything to resend yet.  It sets the timer again - but gets
kicked off immediately again and again until the nanosecond-based expiry
time is reached and we actually retransmit.

Fix this by adding 1 to the jiffies-based resend_at value to counteract the
rounding and make sure that the timer happens after the nanosecond-based
expiry is passed.

Alternatives would be to adjust the timestamp on the packets to align
with the jiffie scale or to switch back to using jiffie-timestamps.

Signed-off-by: David Howells <dhowells@redhat.com>
---

 net/rxrpc/call_event.c |   10 ++++++++--
 1 file changed, 8 insertions(+), 2 deletions(-)

diff --git a/net/rxrpc/call_event.c b/net/rxrpc/call_event.c
index a78a92fe5d77..d5bf9ce7ec6f 100644
--- a/net/rxrpc/call_event.c
+++ b/net/rxrpc/call_event.c
@@ -200,8 +200,14 @@ static void rxrpc_resend(struct rxrpc_call *call)
 				       ktime_to_ns(ktime_sub(skb->tstamp, max_age)));
 	}
 
-	resend_at = ktime_sub(ktime_add_ms(oldest, rxrpc_resend_timeout), now);
-	call->resend_at = jiffies + nsecs_to_jiffies(ktime_to_ns(resend_at));
+	resend_at = ktime_add_ms(oldest, rxrpc_resend_timeout);
+	call->resend_at = jiffies +
+		nsecs_to_jiffies(ktime_to_ns(ktime_sub(resend_at, now))) +
+		1; /* We have to make sure that the calculated jiffies value
+		    * falls at or after the nsec value, or we shall loop
+		    * ceaselessly because the timer times out, but we haven't
+		    * reached the nsec timeout yet.
+		    */
 
 	/* Now go through the Tx window and perform the retransmissions.  We
 	 * have to drop the lock for each send.  If an ACK comes in whilst the

^ permalink raw reply related

* [PATCH net-next 4/8] rxrpc: Reinitialise the call ACK and timer state for client reply phase
From: David Howells @ 2016-09-24 23:25 UTC (permalink / raw)
  To: netdev; +Cc: dhowells, linux-afs, linux-kernel
In-Reply-To: <147475949656.15231.3434038490632820605.stgit@warthog.procyon.org.uk>

Clear the ACK reason, ACK timer and resend timer when entering the client
reply phase when the first DATA packet is received.  New ACKs will be
proposed once the data is queued.

The resend timer is no longer relevant and we need to cancel ACKs scheduled
to probe for a lost reply.

Signed-off-by: David Howells <dhowells@redhat.com>
---

 net/rxrpc/ar-internal.h |    1 +
 net/rxrpc/input.c       |    9 +++++++++
 net/rxrpc/misc.c        |    1 +
 3 files changed, 11 insertions(+)

diff --git a/net/rxrpc/ar-internal.h b/net/rxrpc/ar-internal.h
index e3bf9c0e3ad1..cdd35e2b40ba 100644
--- a/net/rxrpc/ar-internal.h
+++ b/net/rxrpc/ar-internal.h
@@ -682,6 +682,7 @@ extern const char rxrpc_rtt_rx_traces[rxrpc_rtt_rx__nr_trace][5];
 
 enum rxrpc_timer_trace {
 	rxrpc_timer_begin,
+	rxrpc_timer_init_for_reply,
 	rxrpc_timer_expired,
 	rxrpc_timer_set_for_ack,
 	rxrpc_timer_set_for_resend,
diff --git a/net/rxrpc/input.c b/net/rxrpc/input.c
index 757c16f033a0..bda11eb2ab2a 100644
--- a/net/rxrpc/input.c
+++ b/net/rxrpc/input.c
@@ -149,6 +149,15 @@ static bool rxrpc_receiving_reply(struct rxrpc_call *call)
 {
 	rxrpc_seq_t top = READ_ONCE(call->tx_top);
 
+	if (call->ackr_reason) {
+		spin_lock_bh(&call->lock);
+		call->ackr_reason = 0;
+		call->resend_at = call->expire_at;
+		call->ack_at = call->expire_at;
+		spin_unlock_bh(&call->lock);
+		rxrpc_set_timer(call, rxrpc_timer_init_for_reply);
+	}
+
 	if (!test_bit(RXRPC_CALL_TX_LAST, &call->flags))
 		rxrpc_rotate_tx_window(call, top);
 	if (!test_bit(RXRPC_CALL_TX_LAST, &call->flags)) {
diff --git a/net/rxrpc/misc.c b/net/rxrpc/misc.c
index a473fd7dabaa..901c012a2700 100644
--- a/net/rxrpc/misc.c
+++ b/net/rxrpc/misc.c
@@ -191,6 +191,7 @@ const char rxrpc_rtt_rx_traces[rxrpc_rtt_rx__nr_trace][5] = {
 const char rxrpc_timer_traces[rxrpc_timer__nr_trace][8] = {
 	[rxrpc_timer_begin]			= "Begin ",
 	[rxrpc_timer_expired]			= "*EXPR*",
+	[rxrpc_timer_init_for_reply]		= "IniRpl",
 	[rxrpc_timer_set_for_ack]		= "SetAck",
 	[rxrpc_timer_set_for_send]		= "SetTx ",
 	[rxrpc_timer_set_for_resend]		= "SetRTx",

^ permalink raw reply related

* [PATCH net-next 3/8] rxrpc: Include the last reply DATA serial number in the final ACK
From: David Howells @ 2016-09-24 23:25 UTC (permalink / raw)
  To: netdev; +Cc: dhowells, linux-afs, linux-kernel
In-Reply-To: <147475949656.15231.3434038490632820605.stgit@warthog.procyon.org.uk>

In a client call, include the serial number of the last DATA packet of the
reply in the final ACK.

Signed-off-by: David Howells <dhowells@redhat.com>
---

 net/rxrpc/recvmsg.c |    6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/net/rxrpc/recvmsg.c b/net/rxrpc/recvmsg.c
index a7458c398b9e..038ae62ddb4d 100644
--- a/net/rxrpc/recvmsg.c
+++ b/net/rxrpc/recvmsg.c
@@ -133,7 +133,7 @@ static int rxrpc_recvmsg_new_call(struct rxrpc_sock *rx,
 /*
  * End the packet reception phase.
  */
-static void rxrpc_end_rx_phase(struct rxrpc_call *call)
+static void rxrpc_end_rx_phase(struct rxrpc_call *call, rxrpc_serial_t serial)
 {
 	_enter("%d,%s", call->debug_id, rxrpc_call_states[call->state]);
 
@@ -141,7 +141,7 @@ static void rxrpc_end_rx_phase(struct rxrpc_call *call)
 	ASSERTCMP(call->rx_hard_ack, ==, call->rx_top);
 
 	if (call->state == RXRPC_CALL_CLIENT_RECV_REPLY) {
-		rxrpc_propose_ACK(call, RXRPC_ACK_IDLE, 0, 0, true, false,
+		rxrpc_propose_ACK(call, RXRPC_ACK_IDLE, 0, serial, true, false,
 				  rxrpc_propose_ack_terminal_ack);
 		rxrpc_send_call_packet(call, RXRPC_PACKET_TYPE_ACK);
 	}
@@ -202,7 +202,7 @@ static void rxrpc_rotate_rx_window(struct rxrpc_call *call)
 	_debug("%u,%u,%02x", hard_ack, top, flags);
 	trace_rxrpc_receive(call, rxrpc_receive_rotate, serial, hard_ack);
 	if (flags & RXRPC_LAST_PACKET) {
-		rxrpc_end_rx_phase(call);
+		rxrpc_end_rx_phase(call, serial);
 	} else {
 		/* Check to see if there's an ACK that needs sending. */
 		if (after_eq(hard_ack, call->ackr_consumed + 2) ||

^ permalink raw reply related

* [PATCH net-next 2/8] rxrpc: Send an immediate ACK if we fill in a hole
From: David Howells @ 2016-09-24 23:25 UTC (permalink / raw)
  To: netdev; +Cc: dhowells, linux-afs, linux-kernel
In-Reply-To: <147475949656.15231.3434038490632820605.stgit@warthog.procyon.org.uk>

Send an immediate ACK if we fill in a hole in the buffer left by an
out-of-sequence packet.  This may allow the congestion management in the peer
to avoid a retransmission if packets got reordered on the wire.

Signed-off-by: David Howells <dhowells@redhat.com>
---

 net/rxrpc/input.c |   10 +++++++++-
 1 file changed, 9 insertions(+), 1 deletion(-)

diff --git a/net/rxrpc/input.c b/net/rxrpc/input.c
index 349698d87ad1..757c16f033a0 100644
--- a/net/rxrpc/input.c
+++ b/net/rxrpc/input.c
@@ -331,8 +331,16 @@ next_subpacket:
 	call->rxtx_annotations[ix] = annotation;
 	smp_wmb();
 	call->rxtx_buffer[ix] = skb;
-	if (after(seq, call->rx_top))
+	if (after(seq, call->rx_top)) {
 		smp_store_release(&call->rx_top, seq);
+	} else if (before(seq, call->rx_top)) {
+		/* Send an immediate ACK if we fill in a hole */
+		if (!ack) {
+			ack = RXRPC_ACK_DELAY;
+			ack_serial = serial;
+		}
+		immediate_ack = true;
+	}
 	if (flags & RXRPC_LAST_PACKET) {
 		set_bit(RXRPC_CALL_RX_LAST, &call->flags);
 		trace_rxrpc_receive(call, rxrpc_receive_queue_last, serial, seq);

^ permalink raw reply related

* [PATCH net-next 1/8] rxrpc: Send an ACK after every few DATA packets we receive
From: David Howells @ 2016-09-24 23:25 UTC (permalink / raw)
  To: netdev; +Cc: dhowells, linux-afs, linux-kernel
In-Reply-To: <147475949656.15231.3434038490632820605.stgit@warthog.procyon.org.uk>

Send an ACK if we haven't sent one for the last two packets we've received.
This keeps the other end apprised of where we've got to - which is
important if they're doing slow-start.

We do this in recvmsg so that we can dispatch a packet directly without the
need to wake up the background thread.

This should possibly be made configurable in future.

Signed-off-by: David Howells <dhowells@redhat.com>
---

 net/rxrpc/ar-internal.h |    3 +++
 net/rxrpc/misc.c        |    1 +
 net/rxrpc/output.c      |   25 +++++++++++++++++--------
 net/rxrpc/recvmsg.c     |   13 ++++++++++++-
 4 files changed, 33 insertions(+), 9 deletions(-)

diff --git a/net/rxrpc/ar-internal.h b/net/rxrpc/ar-internal.h
index 042dbcc52654..e3bf9c0e3ad1 100644
--- a/net/rxrpc/ar-internal.h
+++ b/net/rxrpc/ar-internal.h
@@ -533,6 +533,8 @@ struct rxrpc_call {
 	u16			ackr_skew;	/* skew on packet being ACK'd */
 	rxrpc_serial_t		ackr_serial;	/* serial of packet being ACK'd */
 	rxrpc_seq_t		ackr_prev_seq;	/* previous sequence number received */
+	rxrpc_seq_t		ackr_consumed;	/* Highest packet shown consumed */
+	rxrpc_seq_t		ackr_seen;	/* Highest packet shown seen */
 	rxrpc_serial_t		ackr_ping;	/* Last ping sent */
 	ktime_t			ackr_ping_time;	/* Time last ping sent */
 
@@ -695,6 +697,7 @@ enum rxrpc_propose_ack_trace {
 	rxrpc_propose_ack_respond_to_ack,
 	rxrpc_propose_ack_respond_to_ping,
 	rxrpc_propose_ack_retry_tx,
+	rxrpc_propose_ack_rotate_rx,
 	rxrpc_propose_ack_terminal_ack,
 	rxrpc_propose_ack__nr_trace
 };
diff --git a/net/rxrpc/misc.c b/net/rxrpc/misc.c
index 1ca14835d87f..a473fd7dabaa 100644
--- a/net/rxrpc/misc.c
+++ b/net/rxrpc/misc.c
@@ -202,6 +202,7 @@ const char rxrpc_propose_ack_traces[rxrpc_propose_ack__nr_trace][8] = {
 	[rxrpc_propose_ack_respond_to_ack]	= "Rsp2Ack",
 	[rxrpc_propose_ack_respond_to_ping]	= "Rsp2Png",
 	[rxrpc_propose_ack_retry_tx]		= "RetryTx",
+	[rxrpc_propose_ack_rotate_rx]		= "RxAck  ",
 	[rxrpc_propose_ack_terminal_ack]	= "ClTerm ",
 };
 
diff --git a/net/rxrpc/output.c b/net/rxrpc/output.c
index 0c563e325c9d..3eb01445e814 100644
--- a/net/rxrpc/output.c
+++ b/net/rxrpc/output.c
@@ -36,7 +36,9 @@ struct rxrpc_pkt_buffer {
  * Fill out an ACK packet.
  */
 static size_t rxrpc_fill_out_ack(struct rxrpc_call *call,
-				 struct rxrpc_pkt_buffer *pkt)
+				 struct rxrpc_pkt_buffer *pkt,
+				 rxrpc_seq_t *_hard_ack,
+				 rxrpc_seq_t *_top)
 {
 	rxrpc_serial_t serial;
 	rxrpc_seq_t hard_ack, top, seq;
@@ -48,6 +50,8 @@ static size_t rxrpc_fill_out_ack(struct rxrpc_call *call,
 	serial = call->ackr_serial;
 	hard_ack = READ_ONCE(call->rx_hard_ack);
 	top = smp_load_acquire(&call->rx_top);
+	*_hard_ack = hard_ack;
+	*_top = top;
 
 	pkt->ack.bufferSpace	= htons(8);
 	pkt->ack.maxSkew	= htons(call->ackr_skew);
@@ -96,6 +100,7 @@ int rxrpc_send_call_packet(struct rxrpc_call *call, u8 type)
 	struct msghdr msg;
 	struct kvec iov[2];
 	rxrpc_serial_t serial;
+	rxrpc_seq_t hard_ack, top;
 	size_t len, n;
 	bool ping = false;
 	int ioc, ret;
@@ -146,7 +151,7 @@ int rxrpc_send_call_packet(struct rxrpc_call *call, u8 type)
 			goto out;
 		}
 		ping = (call->ackr_reason == RXRPC_ACK_PING);
-		n = rxrpc_fill_out_ack(call, pkt);
+		n = rxrpc_fill_out_ack(call, pkt, &hard_ack, &top);
 		call->ackr_reason = 0;
 
 		spin_unlock_bh(&call->lock);
@@ -203,18 +208,22 @@ int rxrpc_send_call_packet(struct rxrpc_call *call, u8 type)
 	if (ping)
 		call->ackr_ping_time = ktime_get_real();
 
-	if (ret < 0 && call->state < RXRPC_CALL_COMPLETE) {
-		switch (type) {
-		case RXRPC_PACKET_TYPE_ACK:
+	if (type == RXRPC_PACKET_TYPE_ACK &&
+	    call->state < RXRPC_CALL_COMPLETE) {
+		if (ret < 0) {
 			clear_bit(RXRPC_CALL_PINGING, &call->flags);
 			rxrpc_propose_ACK(call, pkt->ack.reason,
 					  ntohs(pkt->ack.maxSkew),
 					  ntohl(pkt->ack.serial),
 					  true, true,
 					  rxrpc_propose_ack_retry_tx);
-			break;
-		case RXRPC_PACKET_TYPE_ABORT:
-			break;
+		} else {
+			spin_lock_bh(&call->lock);
+			if (after(hard_ack, call->ackr_consumed))
+				call->ackr_consumed = hard_ack;
+			if (after(top, call->ackr_seen))
+				call->ackr_seen = top;
+			spin_unlock_bh(&call->lock);
 		}
 	}
 
diff --git a/net/rxrpc/recvmsg.c b/net/rxrpc/recvmsg.c
index 8c7f3de45bac..a7458c398b9e 100644
--- a/net/rxrpc/recvmsg.c
+++ b/net/rxrpc/recvmsg.c
@@ -201,8 +201,19 @@ static void rxrpc_rotate_rx_window(struct rxrpc_call *call)
 
 	_debug("%u,%u,%02x", hard_ack, top, flags);
 	trace_rxrpc_receive(call, rxrpc_receive_rotate, serial, hard_ack);
-	if (flags & RXRPC_LAST_PACKET)
+	if (flags & RXRPC_LAST_PACKET) {
 		rxrpc_end_rx_phase(call);
+	} else {
+		/* Check to see if there's an ACK that needs sending. */
+		if (after_eq(hard_ack, call->ackr_consumed + 2) ||
+		    after_eq(top, call->ackr_seen + 2) ||
+		    (hard_ack == top && after(hard_ack, call->ackr_consumed)))
+			rxrpc_propose_ACK(call, RXRPC_ACK_DELAY, 0, serial,
+					  true, false,
+					  rxrpc_propose_ack_rotate_rx);
+		if (call->ackr_reason)
+			rxrpc_send_call_packet(call, RXRPC_PACKET_TYPE_ACK);
+	}
 }
 
 /*

^ permalink raw reply related

* [PATCH net-next 0/8] rxrpc: Implement slow-start and other bits
From: David Howells @ 2016-09-24 23:24 UTC (permalink / raw)
  To: netdev; +Cc: dhowells, linux-afs, linux-kernel


This set of patches implements the RxRPC slow-start feature for AF_RXRPC to
improve performance and handling of occasional packet loss.  This is more or
less the same as TCP slow start [RFC 5681].  Firstly, there are some ACK
generation improvements:

 (1) Send ACKs regularly to apprise the peer of our state so that they can do
     congestion management of their own.

 (2) Send an ACK when we fill in a hole in the buffer so that the peer can
     find out that we did this thus forestalling retransmission.

 (3) Note the final DATA packet's serial number in the final ACK for
     correlation purposes.

and a couple of bug fixes:

 (4) Reinitialise the ACK state and clear the ACK and resend timers upon
     entering the client reply reception phase to kill off any pending probe
     ACKs.

 (5) Delay the resend timer to allow for nsec->jiffies conversion errors.

and then there's the slow-start pieces:

 (6) Summarise an ACK.

 (7) Schedule a PING or IDLE ACK if the reply to a client call is overdue to
     try and find out what happened to it.

 (8) Implement the slow start feature.

The patches can be found here also:

	http://git.kernel.org/cgit/linux/kernel/git/dhowells/linux-fs.git/log/?h=rxrpc-rewrite

Tagged thusly:

	git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs.git
	rxrpc-rewrite-20160924

David
---
David Howells (8):
      rxrpc: Send an ACK after every few DATA packets we receive
      rxrpc: Send an immediate ACK if we fill in a hole
      rxrpc: Include the last reply DATA serial number in the final ACK
      rxrpc: Reinitialise the call ACK and timer state for client reply phase
      rxrpc: Delay the resend timer to allow for nsec->jiffies conv error
      rxrpc: Generate a summary of the ACK state for later use
      rxrpc: Schedule an ACK if the reply to a client call appears overdue
      rxrpc: Implement slow-start


 include/trace/events/rxrpc.h |   45 ++++++++
 net/rxrpc/ar-internal.h      |   71 ++++++++++++
 net/rxrpc/call_event.c       |   47 +++++++-
 net/rxrpc/call_object.c      |   13 ++
 net/rxrpc/conn_event.c       |    1 
 net/rxrpc/input.c            |  241 +++++++++++++++++++++++++++++++++++++++---
 net/rxrpc/misc.c             |   23 ++++
 net/rxrpc/output.c           |   34 ++++--
 net/rxrpc/recvmsg.c          |   19 +++
 net/rxrpc/sendmsg.c          |    7 +
 10 files changed, 463 insertions(+), 38 deletions(-)

^ permalink raw reply

* [PATCH v5 12/16] IB/pvrdma: Add Queue Pair support
From: Adit Ranadive @ 2016-09-24 23:21 UTC (permalink / raw)
  To: dledford, linux-rdma, pv-drivers
  Cc: Adit Ranadive, netdev, linux-pci, jhansen, asarwade, georgezhang,
	bryantan
In-Reply-To: <cover.1474759181.git.aditr@vmware.com>

This patch adds the ability to create, modify, query and destroy QPs. The
PVRDMA device supports RC, UD and GSI QPs.

Reviewed-by: Yuval Shaia <yuval.shaia@oracle.com>
Reviewed-by: Jorgen Hansen <jhansen@vmware.com>
Reviewed-by: George Zhang <georgezhang@vmware.com>
Reviewed-by: Aditya Sarwade <asarwade@vmware.com>
Reviewed-by: Bryan Tan <bryantan@vmware.com>
Signed-off-by: Adit Ranadive <aditr@vmware.com>
---
Changes v4->v5:
 - Updated include for headers in UAPI folder.
 - Update to pvrdma_cmd_post for creating/destroying/querying/modifying QPs.
 - Use the pvrdma_sge struct when posting WRs/allocating QP memory.
 - Removed two set but unused variables.

Changes v3->v4:
 - Removed an unnecessary switch case.
 - Unified the returns in pvrdma_create_qp to use one exit point.
 - Renamed pvrdma_flush_cqe to _pvrdma_flush_cqe since we need a lock to
 be held when calling this.
 - Updated to use wrapper for UAR write for QP.
 - Updated conversion function to func_name(dst, src) format.
 - Renamed max_gs to max_sg.
 - Renamed cap variable to req_cap in pvrdma_set_sq/rq_size.
 - Changed dev_warn to dev_warn_ratelimited in pvrdma_post_send/recv.
 - Added nesting locking for flushing CQs when destroying/resetting a QP.
 - Added missing ret value.

Changes v2->v3:
 - Removed boolean in pvrdma_cmd_post.
---
 drivers/infiniband/hw/pvrdma/pvrdma_qp.c | 973 +++++++++++++++++++++++++++++++
 1 file changed, 973 insertions(+)
 create mode 100644 drivers/infiniband/hw/pvrdma/pvrdma_qp.c

diff --git a/drivers/infiniband/hw/pvrdma/pvrdma_qp.c b/drivers/infiniband/hw/pvrdma/pvrdma_qp.c
new file mode 100644
index 0000000..52c689d
--- /dev/null
+++ b/drivers/infiniband/hw/pvrdma/pvrdma_qp.c
@@ -0,0 +1,973 @@
+/*
+ * Copyright (c) 2012-2016 VMware, Inc.  All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of EITHER the GNU General Public License
+ * version 2 as published by the Free Software Foundation or the BSD
+ * 2-Clause License. This program is distributed in the hope that it
+ * will be useful, but WITHOUT ANY WARRANTY; WITHOUT EVEN THE IMPLIED
+ * WARRANTY OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE.
+ * See the GNU General Public License version 2 for more details at
+ * http://www.gnu.org/licenses/old-licenses/gpl-2.0.en.html.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program available in the file COPYING in the main
+ * directory of this source tree.
+ *
+ * The BSD 2-Clause License
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ * "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ * LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS
+ * FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE
+ * COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT,
+ * INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
+ * (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
+ * SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
+ * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT,
+ * STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE)
+ * ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED
+ * OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+
+#include <asm/page.h>
+#include <linux/io.h>
+#include <linux/wait.h>
+#include <rdma/ib_addr.h>
+#include <rdma/ib_smi.h>
+#include <rdma/ib_user_verbs.h>
+#include <rdma/pvrdma-abi.h>
+
+#include "pvrdma.h"
+
+static inline void get_cqs(struct pvrdma_qp *qp, struct pvrdma_cq **send_cq,
+			   struct pvrdma_cq **recv_cq)
+{
+	*send_cq = to_vcq(qp->ibqp.send_cq);
+	*recv_cq = to_vcq(qp->ibqp.recv_cq);
+}
+
+static void pvrdma_lock_cqs(struct pvrdma_cq *scq, struct pvrdma_cq *rcq,
+			    unsigned long *scq_flags,
+			    unsigned long *rcq_flags)
+	__acquires(scq->cq_lock) __acquires(rcq->cq_lock)
+{
+	if (scq == rcq) {
+		spin_lock_irqsave(&scq->cq_lock, *scq_flags);
+		__acquire(rcq->cq_lock);
+	} else if (scq->cq_handle < rcq->cq_handle) {
+		spin_lock_irqsave(&scq->cq_lock, *scq_flags);
+		spin_lock_irqsave_nested(&rcq->cq_lock, *rcq_flags,
+					 SINGLE_DEPTH_NESTING);
+	} else {
+		spin_lock_irqsave(&rcq->cq_lock, *rcq_flags);
+		spin_lock_irqsave_nested(&scq->cq_lock, *scq_flags,
+					 SINGLE_DEPTH_NESTING);
+	}
+}
+
+static void pvrdma_unlock_cqs(struct pvrdma_cq *scq, struct pvrdma_cq *rcq,
+			      unsigned long *scq_flags,
+			      unsigned long *rcq_flags)
+	__releases(scq->cq_lock) __releases(rcq->cq_lock)
+{
+	if (scq == rcq) {
+		__release(rcq->cq_lock);
+		spin_unlock_irqrestore(&scq->cq_lock, *scq_flags);
+	} else if (scq->cq_handle < rcq->cq_handle) {
+		spin_unlock_irqrestore(&rcq->cq_lock, *rcq_flags);
+		spin_unlock_irqrestore(&scq->cq_lock, *scq_flags);
+	} else {
+		spin_unlock_irqrestore(&scq->cq_lock, *scq_flags);
+		spin_unlock_irqrestore(&rcq->cq_lock, *rcq_flags);
+	}
+}
+
+static void pvrdma_reset_qp(struct pvrdma_qp *qp)
+{
+	struct pvrdma_cq *scq, *rcq;
+	unsigned long scq_flags, rcq_flags;
+
+	/* Clean up cqes */
+	get_cqs(qp, &scq, &rcq);
+	pvrdma_lock_cqs(scq, rcq, &scq_flags, &rcq_flags);
+
+	_pvrdma_flush_cqe(qp, scq);
+	if (scq != rcq)
+		_pvrdma_flush_cqe(qp, rcq);
+
+	pvrdma_unlock_cqs(scq, rcq, &scq_flags, &rcq_flags);
+
+	/*
+	 * Reset queuepair. The checks are because usermode queuepairs won't
+	 * have kernel ringstates.
+	 */
+	if (qp->rq.ring) {
+		atomic_set(&qp->rq.ring->cons_head, 0);
+		atomic_set(&qp->rq.ring->prod_tail, 0);
+	}
+	if (qp->sq.ring) {
+		atomic_set(&qp->sq.ring->cons_head, 0);
+		atomic_set(&qp->sq.ring->prod_tail, 0);
+	}
+}
+
+static int pvrdma_set_rq_size(struct pvrdma_dev *dev,
+			      struct ib_qp_cap *req_cap,
+			      struct pvrdma_qp *qp)
+{
+	if (req_cap->max_recv_wr > dev->dsr->caps.max_qp_wr ||
+	    req_cap->max_recv_sge > dev->dsr->caps.max_sge) {
+		dev_warn(&dev->pdev->dev, "recv queue size invalid\n");
+		return -EINVAL;
+	}
+
+	qp->rq.wqe_cnt = roundup_pow_of_two(max(1U, req_cap->max_recv_wr));
+	qp->rq.max_sg = roundup_pow_of_two(max(1U, req_cap->max_recv_sge));
+
+	/* Write back */
+	req_cap->max_recv_wr = qp->rq.wqe_cnt;
+	req_cap->max_recv_sge = qp->rq.max_sg;
+
+	qp->rq.wqe_size = roundup_pow_of_two(sizeof(struct pvrdma_rq_wqe_hdr) +
+					     sizeof(struct pvrdma_sge) *
+					     qp->rq.max_sg);
+	qp->npages_recv = (qp->rq.wqe_cnt * qp->rq.wqe_size + PAGE_SIZE - 1) /
+			  PAGE_SIZE;
+
+	return 0;
+}
+
+static int pvrdma_set_sq_size(struct pvrdma_dev *dev, struct ib_qp_cap *req_cap,
+			      enum ib_qp_type type, struct pvrdma_qp *qp)
+{
+	if (req_cap->max_send_wr > dev->dsr->caps.max_qp_wr ||
+	    req_cap->max_send_sge > dev->dsr->caps.max_sge) {
+		dev_warn(&dev->pdev->dev, "send queue size invalid\n");
+		return -EINVAL;
+	}
+
+	qp->sq.wqe_cnt = roundup_pow_of_two(max(1U, req_cap->max_send_wr));
+	qp->sq.max_sg = roundup_pow_of_two(max(1U, req_cap->max_send_sge));
+
+	/* Write back */
+	req_cap->max_send_wr = qp->sq.wqe_cnt;
+	req_cap->max_send_sge = qp->sq.max_sg;
+
+	qp->sq.wqe_size = roundup_pow_of_two(sizeof(struct pvrdma_sq_wqe_hdr) +
+					     sizeof(struct pvrdma_sge) *
+					     qp->sq.max_sg);
+	/* Note: one extra page for the header. */
+	qp->npages_send = 1 + (qp->sq.wqe_cnt * qp->sq.wqe_size +
+			       PAGE_SIZE - 1) / PAGE_SIZE;
+
+	return 0;
+}
+
+/**
+ * pvrdma_create_qp - create queue pair
+ * @pd: protection domain
+ * @init_attr: queue pair attributes
+ * @udata: user data
+ *
+ * @return: the ib_qp pointer on success, otherwise returns an errno.
+ */
+struct ib_qp *pvrdma_create_qp(struct ib_pd *pd,
+			       struct ib_qp_init_attr *init_attr,
+			       struct ib_udata *udata)
+{
+	struct pvrdma_qp *qp = NULL;
+	struct pvrdma_dev *dev = to_vdev(pd->device);
+	union pvrdma_cmd_req req;
+	union pvrdma_cmd_resp rsp;
+	struct pvrdma_cmd_create_qp *cmd = &req.create_qp;
+	struct pvrdma_cmd_create_qp_resp *resp = &rsp.create_qp_resp;
+	struct pvrdma_create_qp ucmd;
+	unsigned long flags;
+	int ret;
+
+	if (init_attr->create_flags) {
+		dev_warn(&dev->pdev->dev,
+			 "invalid create queuepair flags %#x\n",
+			 init_attr->create_flags);
+		return ERR_PTR(-EINVAL);
+	}
+
+	if (init_attr->qp_type != IB_QPT_RC &&
+	    init_attr->qp_type != IB_QPT_UD &&
+	    init_attr->qp_type != IB_QPT_GSI) {
+		dev_warn(&dev->pdev->dev, "queuepair type %d not supported\n",
+			 init_attr->qp_type);
+		return ERR_PTR(-EINVAL);
+	}
+
+	if (!atomic_add_unless(&dev->num_qps, 1, dev->dsr->caps.max_qp))
+		return ERR_PTR(-ENOMEM);
+
+	switch (init_attr->qp_type) {
+	case IB_QPT_GSI:
+		if (init_attr->port_num == 0 ||
+		    init_attr->port_num > pd->device->phys_port_cnt ||
+		    udata) {
+			dev_warn(&dev->pdev->dev, "invalid queuepair attrs\n");
+			ret = -EINVAL;
+			goto err_qp;
+		}
+		/* fall through */
+	case IB_QPT_RC:
+	case IB_QPT_UD:
+		qp = kzalloc(sizeof(*qp), GFP_KERNEL);
+		if (!qp) {
+			ret = -ENOMEM;
+			goto err_qp;
+		}
+
+		spin_lock_init(&qp->sq.lock);
+		spin_lock_init(&qp->rq.lock);
+		mutex_init(&qp->mutex);
+		atomic_set(&qp->refcnt, 1);
+		init_waitqueue_head(&qp->wait);
+
+		qp->state = IB_QPS_RESET;
+
+		if (pd->uobject && udata) {
+			dev_dbg(&dev->pdev->dev,
+				"create queuepair from user space\n");
+
+			if (ib_copy_from_udata(&ucmd, udata, sizeof(ucmd))) {
+				ret = -EFAULT;
+				goto err_qp;
+			}
+
+			/* set qp->sq.wqe_cnt, shift, buf_size.. */
+			qp->rumem = ib_umem_get(pd->uobject->context,
+						ucmd.rbuf_addr,
+						ucmd.rbuf_size, 0, 0);
+			if (IS_ERR(qp->rumem)) {
+				ret = PTR_ERR(qp->rumem);
+				goto err_qp;
+			}
+
+			qp->sumem = ib_umem_get(pd->uobject->context,
+						ucmd.sbuf_addr,
+						ucmd.sbuf_size, 0, 0);
+			if (IS_ERR(qp->sumem)) {
+				ib_umem_release(qp->rumem);
+				ret = PTR_ERR(qp->sumem);
+				goto err_qp;
+			}
+
+			qp->npages_send = ib_umem_page_count(qp->sumem);
+			qp->npages_recv = ib_umem_page_count(qp->rumem);
+			qp->npages = qp->npages_send + qp->npages_recv;
+		} else {
+			qp->is_kernel = true;
+
+			ret = pvrdma_set_sq_size(to_vdev(pd->device),
+						 &init_attr->cap,
+						 init_attr->qp_type, qp);
+			if (ret)
+				goto err_qp;
+
+			ret = pvrdma_set_rq_size(to_vdev(pd->device),
+						 &init_attr->cap, qp);
+			if (ret)
+				goto err_qp;
+
+			qp->npages = qp->npages_send + qp->npages_recv;
+
+			/* Skip header page. */
+			qp->sq.offset = PAGE_SIZE;
+
+			/* Recv queue pages are after send pages. */
+			qp->rq.offset = qp->npages_send * PAGE_SIZE;
+		}
+
+		if (qp->npages < 0 || qp->npages > PVRDMA_PAGE_DIR_MAX_PAGES) {
+			dev_warn(&dev->pdev->dev,
+				 "overflow pages in queuepair\n");
+			ret = -EINVAL;
+			goto err_umem;
+		}
+
+		ret = pvrdma_page_dir_init(dev, &qp->pdir, qp->npages,
+					   qp->is_kernel);
+		if (ret) {
+			dev_warn(&dev->pdev->dev,
+				 "could not allocate page directory\n");
+			goto err_umem;
+		}
+
+		if (!qp->is_kernel) {
+			pvrdma_page_dir_insert_umem(&qp->pdir, qp->sumem, 0);
+			pvrdma_page_dir_insert_umem(&qp->pdir, qp->rumem,
+						    qp->npages_send);
+		} else {
+			/* Ring state is always the first page. */
+			qp->sq.ring = qp->pdir.pages[0];
+			qp->rq.ring = &qp->sq.ring[1];
+		}
+		break;
+	default:
+		ret = -EINVAL;
+		goto err_qp;
+	}
+
+	/* Not supported */
+	init_attr->cap.max_inline_data = 0;
+
+	memset(cmd, 0, sizeof(*cmd));
+	cmd->hdr.cmd = PVRDMA_CMD_CREATE_QP;
+	cmd->pd_handle = to_vpd(pd)->pd_handle;
+	cmd->send_cq_handle = to_vcq(init_attr->send_cq)->cq_handle;
+	cmd->recv_cq_handle = to_vcq(init_attr->recv_cq)->cq_handle;
+	cmd->max_send_wr = init_attr->cap.max_send_wr;
+	cmd->max_recv_wr = init_attr->cap.max_recv_wr;
+	cmd->max_send_sge = init_attr->cap.max_send_sge;
+	cmd->max_recv_sge = init_attr->cap.max_recv_sge;
+	cmd->max_inline_data = init_attr->cap.max_inline_data;
+	cmd->sq_sig_all = (init_attr->sq_sig_type == IB_SIGNAL_ALL_WR) ? 1 : 0;
+	cmd->qp_type = ib_qp_type_to_pvrdma(init_attr->qp_type);
+	cmd->access_flags = IB_ACCESS_LOCAL_WRITE;
+	cmd->total_chunks = qp->npages;
+	cmd->send_chunks = qp->npages_send - 1;
+	cmd->pdir_dma = qp->pdir.dir_dma;
+
+	dev_dbg(&dev->pdev->dev, "create queuepair with %d, %d, %d, %d\n",
+		cmd->max_send_wr, cmd->max_recv_wr, cmd->max_send_sge,
+		cmd->max_recv_sge);
+
+	ret = pvrdma_cmd_post(dev, &req, &rsp, PVRDMA_CMD_CREATE_QP_RESP);
+	if (ret < 0) {
+		dev_warn(&dev->pdev->dev,
+			 "could not create queuepair, error: %d\n", ret);
+		goto err_pdir;
+	}
+
+	/* max_send_wr/_recv_wr/_send_sge/_recv_sge/_inline_data */
+	qp->qp_handle = resp->qpn;
+	qp->port = init_attr->port_num;
+	qp->ibqp.qp_num = resp->qpn;
+	spin_lock_irqsave(&dev->qp_tbl_lock, flags);
+	dev->qp_tbl[qp->qp_handle % dev->dsr->caps.max_qp] = qp;
+	spin_unlock_irqrestore(&dev->qp_tbl_lock, flags);
+
+	return &qp->ibqp;
+
+err_pdir:
+	pvrdma_page_dir_cleanup(dev, &qp->pdir);
+err_umem:
+	if (pd->uobject && udata) {
+		if (qp->rumem)
+			ib_umem_release(qp->rumem);
+		if (qp->sumem)
+			ib_umem_release(qp->sumem);
+	}
+err_qp:
+	kfree(qp);
+	atomic_dec(&dev->num_qps);
+
+	return ERR_PTR(ret);
+}
+
+static void pvrdma_free_qp(struct pvrdma_qp *qp)
+{
+	struct pvrdma_dev *dev = to_vdev(qp->ibqp.device);
+	struct pvrdma_cq *scq;
+	struct pvrdma_cq *rcq;
+	unsigned long flags, scq_flags, rcq_flags;
+
+	/* In case cq is polling */
+	get_cqs(qp, &scq, &rcq);
+	pvrdma_lock_cqs(scq, rcq, &scq_flags, &rcq_flags);
+
+	_pvrdma_flush_cqe(qp, scq);
+	if (scq != rcq)
+		_pvrdma_flush_cqe(qp, rcq);
+
+	spin_lock_irqsave(&dev->qp_tbl_lock, flags);
+	dev->qp_tbl[qp->qp_handle] = NULL;
+	spin_unlock_irqrestore(&dev->qp_tbl_lock, flags);
+
+	pvrdma_unlock_cqs(scq, rcq, &scq_flags, &rcq_flags);
+
+	atomic_dec(&qp->refcnt);
+	wait_event(qp->wait, !atomic_read(&qp->refcnt));
+
+	pvrdma_page_dir_cleanup(dev, &qp->pdir);
+
+	kfree(qp);
+
+	atomic_dec(&dev->num_qps);
+}
+
+/**
+ * pvrdma_destroy_qp - destroy a queue pair
+ * @qp: the queue pair to destroy
+ *
+ * @return: 0 on success.
+ */
+int pvrdma_destroy_qp(struct ib_qp *qp)
+{
+	struct pvrdma_qp *vqp = to_vqp(qp);
+	union pvrdma_cmd_req req;
+	struct pvrdma_cmd_destroy_qp *cmd = &req.destroy_qp;
+	int ret;
+
+	memset(cmd, 0, sizeof(*cmd));
+	cmd->hdr.cmd = PVRDMA_CMD_DESTROY_QP;
+	cmd->qp_handle = vqp->qp_handle;
+
+	ret = pvrdma_cmd_post(to_vdev(qp->device), &req, NULL, 0);
+	if (ret < 0)
+		dev_warn(&to_vdev(qp->device)->pdev->dev,
+			 "destroy queuepair failed, error: %d\n", ret);
+
+	pvrdma_free_qp(vqp);
+
+	return 0;
+}
+
+/**
+ * pvrdma_modify_qp - modify queue pair attributes
+ * @ibqp: the queue pair
+ * @attr: the new queue pair's attributes
+ * @attr_mask: attributes mask
+ * @udata: user data
+ *
+ * @returns 0 on success, otherwise returns an errno.
+ */
+int pvrdma_modify_qp(struct ib_qp *ibqp, struct ib_qp_attr *attr,
+		     int attr_mask, struct ib_udata *udata)
+{
+	struct pvrdma_dev *dev = to_vdev(ibqp->device);
+	struct pvrdma_qp *qp = to_vqp(ibqp);
+	union pvrdma_cmd_req req;
+	union pvrdma_cmd_resp rsp;
+	struct pvrdma_cmd_modify_qp *cmd = &req.modify_qp;
+	int cur_state, next_state;
+	int ret;
+
+	/* Sanity checking. Should need lock here */
+	mutex_lock(&qp->mutex);
+	cur_state = (attr_mask & IB_QP_CUR_STATE) ? attr->cur_qp_state :
+		qp->state;
+	next_state = (attr_mask & IB_QP_STATE) ? attr->qp_state : cur_state;
+
+	if (!ib_modify_qp_is_ok(cur_state, next_state, ibqp->qp_type,
+				attr_mask, IB_LINK_LAYER_ETHERNET)) {
+		ret = -EINVAL;
+		goto out;
+	}
+
+	if (attr_mask & IB_QP_PORT) {
+		if (attr->port_num == 0 ||
+		    attr->port_num > ibqp->device->phys_port_cnt) {
+			ret = -EINVAL;
+			goto out;
+		}
+	}
+
+	if (attr_mask & IB_QP_MIN_RNR_TIMER) {
+		if (attr->min_rnr_timer > 31) {
+			ret = -EINVAL;
+			goto out;
+		}
+	}
+
+	if (attr_mask & IB_QP_PKEY_INDEX) {
+		if (attr->pkey_index >= dev->dsr->caps.max_pkeys) {
+			ret = -EINVAL;
+			goto out;
+		}
+	}
+
+	if (attr_mask & IB_QP_QKEY)
+		qp->qkey = attr->qkey;
+
+	if (cur_state == next_state && cur_state == IB_QPS_RESET) {
+		ret = 0;
+		goto out;
+	}
+
+	qp->state = next_state;
+	memset(cmd, 0, sizeof(*cmd));
+	cmd->hdr.cmd = PVRDMA_CMD_MODIFY_QP;
+	cmd->qp_handle = qp->qp_handle;
+	cmd->attr_mask = ib_qp_attr_mask_to_pvrdma(attr_mask);
+	cmd->attrs.qp_state = ib_qp_state_to_pvrdma(attr->qp_state);
+	cmd->attrs.cur_qp_state =
+		ib_qp_state_to_pvrdma(attr->cur_qp_state);
+	cmd->attrs.path_mtu = ib_mtu_to_pvrdma(attr->path_mtu);
+	cmd->attrs.path_mig_state =
+		ib_mig_state_to_pvrdma(attr->path_mig_state);
+	cmd->attrs.qkey = attr->qkey;
+	cmd->attrs.rq_psn = attr->rq_psn;
+	cmd->attrs.sq_psn = attr->sq_psn;
+	cmd->attrs.dest_qp_num = attr->dest_qp_num;
+	cmd->attrs.qp_access_flags =
+		ib_access_flags_to_pvrdma(attr->qp_access_flags);
+	cmd->attrs.pkey_index = attr->pkey_index;
+	cmd->attrs.alt_pkey_index = attr->alt_pkey_index;
+	cmd->attrs.en_sqd_async_notify = attr->en_sqd_async_notify;
+	cmd->attrs.sq_draining = attr->sq_draining;
+	cmd->attrs.max_rd_atomic = attr->max_rd_atomic;
+	cmd->attrs.max_dest_rd_atomic = attr->max_dest_rd_atomic;
+	cmd->attrs.min_rnr_timer = attr->min_rnr_timer;
+	cmd->attrs.port_num = attr->port_num;
+	cmd->attrs.timeout = attr->timeout;
+	cmd->attrs.retry_cnt = attr->retry_cnt;
+	cmd->attrs.rnr_retry = attr->rnr_retry;
+	cmd->attrs.alt_port_num = attr->alt_port_num;
+	cmd->attrs.alt_timeout = attr->alt_timeout;
+	ib_qp_cap_to_pvrdma(&cmd->attrs.cap, &attr->cap);
+	ib_ah_attr_to_pvrdma(&cmd->attrs.ah_attr, &attr->ah_attr);
+	ib_ah_attr_to_pvrdma(&cmd->attrs.alt_ah_attr, &attr->alt_ah_attr);
+
+	ret = pvrdma_cmd_post(dev, &req, &rsp, PVRDMA_CMD_MODIFY_QP_RESP);
+	if (ret < 0) {
+		dev_warn(&dev->pdev->dev,
+			 "could not modify queuepair, error: %d\n", ret);
+	} else if (rsp.hdr.err > 0) {
+		dev_warn(&dev->pdev->dev,
+			 "cannot modify queuepair, error: %d\n", rsp.hdr.err);
+		ret = -EINVAL;
+	}
+
+	if (ret == 0 && next_state == IB_QPS_RESET)
+		pvrdma_reset_qp(qp);
+
+out:
+	mutex_unlock(&qp->mutex);
+
+	return ret;
+}
+
+static inline void *get_sq_wqe(struct pvrdma_qp *qp, int n)
+{
+	return pvrdma_page_dir_get_ptr(&qp->pdir,
+				       qp->sq.offset + n * qp->sq.wqe_size);
+}
+
+static inline void *get_rq_wqe(struct pvrdma_qp *qp, int n)
+{
+	return pvrdma_page_dir_get_ptr(&qp->pdir,
+				       qp->rq.offset + n * qp->rq.wqe_size);
+}
+
+static int set_reg_seg(struct pvrdma_sq_wqe_hdr *wqe_hdr, struct ib_reg_wr *wr)
+{
+	struct pvrdma_user_mr *mr = to_vmr(wr->mr);
+
+	wqe_hdr->wr.fast_reg.iova_start = mr->ibmr.iova;
+	wqe_hdr->wr.fast_reg.pl_pdir_dma = mr->pdir.dir_dma;
+	wqe_hdr->wr.fast_reg.page_shift = mr->page_shift;
+	wqe_hdr->wr.fast_reg.page_list_len = mr->npages;
+	wqe_hdr->wr.fast_reg.length = mr->ibmr.length;
+	wqe_hdr->wr.fast_reg.access_flags = wr->access;
+	wqe_hdr->wr.fast_reg.rkey = wr->key;
+
+	return pvrdma_page_dir_insert_page_list(&mr->pdir, mr->pages,
+						mr->npages);
+}
+
+/**
+ * pvrdma_post_send - post send work request entries on a QP
+ * @ibqp: the QP
+ * @wr: work request list to post
+ * @bad_wr: the first bad WR returned
+ *
+ * @return: 0 on success, otherwise errno returned.
+ */
+int pvrdma_post_send(struct ib_qp *ibqp, struct ib_send_wr *wr,
+		     struct ib_send_wr **bad_wr)
+{
+	struct pvrdma_qp *qp = to_vqp(ibqp);
+	struct pvrdma_dev *dev = to_vdev(ibqp->device);
+	unsigned long flags;
+	struct pvrdma_sq_wqe_hdr *wqe_hdr;
+	struct pvrdma_sge *sge;
+	int i, index;
+	int nreq;
+	int ret;
+
+	/*
+	 * In states lower than RTS, we can fail immediately. In other states,
+	 * just post and let the device figure it out.
+	 */
+	if (qp->state < IB_QPS_RTS) {
+		*bad_wr = wr;
+		return -EINVAL;
+	}
+
+	spin_lock_irqsave(&qp->sq.lock, flags);
+
+	index = pvrdma_idx(&qp->sq.ring->prod_tail, qp->sq.wqe_cnt);
+	for (nreq = 0; wr; nreq++, wr = wr->next) {
+		unsigned int tail;
+
+		if (unlikely(!pvrdma_idx_ring_has_space(
+				qp->sq.ring, qp->sq.wqe_cnt, &tail))) {
+			dev_warn_ratelimited(&dev->pdev->dev,
+					     "send queue is full\n");
+			*bad_wr = wr;
+			ret = -ENOMEM;
+			goto out;
+		}
+
+		if (unlikely(wr->num_sge > qp->sq.max_sg || wr->num_sge < 0)) {
+			dev_warn_ratelimited(&dev->pdev->dev,
+					     "send SGE overflow\n");
+			*bad_wr = wr;
+			ret = -EINVAL;
+			goto out;
+		}
+
+		if (unlikely(wr->opcode < 0)) {
+			dev_warn_ratelimited(&dev->pdev->dev,
+					     "invalid send opcode\n");
+			*bad_wr = wr;
+			ret = -EINVAL;
+			goto out;
+		}
+
+		/*
+		 * Only support UD, RC.
+		 * Need to check opcode table for thorough checking.
+		 * opcode		_UD	_UC	_RC
+		 * _SEND		x	x	x
+		 * _SEND_WITH_IMM	x	x	x
+		 * _RDMA_WRITE			x	x
+		 * _RDMA_WRITE_WITH_IMM		x	x
+		 * _LOCAL_INV			x	x
+		 * _SEND_WITH_INV		x	x
+		 * _RDMA_READ				x
+		 * _ATOMIC_CMP_AND_SWP			x
+		 * _ATOMIC_FETCH_AND_ADD		x
+		 * _MASK_ATOMIC_CMP_AND_SWP		x
+		 * _MASK_ATOMIC_FETCH_AND_ADD		x
+		 * _REG_MR				x
+		 *
+		 */
+		if (qp->ibqp.qp_type != IB_QPT_UD &&
+		    qp->ibqp.qp_type != IB_QPT_RC &&
+			wr->opcode != IB_WR_SEND) {
+			dev_warn_ratelimited(&dev->pdev->dev,
+					     "unsupported queuepair type\n");
+			*bad_wr = wr;
+			ret = -EINVAL;
+			goto out;
+		} else if (qp->ibqp.qp_type == IB_QPT_UD ||
+			   qp->ibqp.qp_type == IB_QPT_GSI) {
+			if (wr->opcode != IB_WR_SEND &&
+			    wr->opcode != IB_WR_SEND_WITH_IMM) {
+				dev_warn_ratelimited(&dev->pdev->dev,
+						     "invalid send opcode\n");
+				*bad_wr = wr;
+				ret = -EINVAL;
+				goto out;
+			}
+		}
+
+		wqe_hdr = (struct pvrdma_sq_wqe_hdr *)get_sq_wqe(qp, index);
+		memset(wqe_hdr, 0, sizeof(*wqe_hdr));
+		wqe_hdr->wr_id = wr->wr_id;
+		wqe_hdr->num_sge = wr->num_sge;
+		wqe_hdr->opcode = ib_wr_opcode_to_pvrdma(wr->opcode);
+		wqe_hdr->send_flags = ib_send_flags_to_pvrdma(wr->send_flags);
+		if (wr->opcode == IB_WR_SEND_WITH_IMM ||
+		    wr->opcode == IB_WR_RDMA_WRITE_WITH_IMM)
+			wqe_hdr->ex.imm_data = wr->ex.imm_data;
+
+		switch (qp->ibqp.qp_type) {
+		case IB_QPT_GSI:
+		case IB_QPT_UD:
+			if (unlikely(!ud_wr(wr)->ah)) {
+				dev_warn_ratelimited(&dev->pdev->dev,
+						     "invalid address handle\n");
+				*bad_wr = wr;
+				ret = -EINVAL;
+				goto out;
+			}
+
+			/*
+			 * Use qkey from qp context if high order bit set,
+			 * otherwise from work request.
+			 */
+			wqe_hdr->wr.ud.remote_qpn = ud_wr(wr)->remote_qpn;
+			wqe_hdr->wr.ud.remote_qkey =
+				ud_wr(wr)->remote_qkey & 0x80000000 ?
+				qp->qkey : ud_wr(wr)->remote_qkey;
+			wqe_hdr->wr.ud.av = to_vah(ud_wr(wr)->ah)->av;
+
+			break;
+		case IB_QPT_RC:
+			switch (wr->opcode) {
+			case IB_WR_RDMA_READ:
+			case IB_WR_RDMA_WRITE:
+			case IB_WR_RDMA_WRITE_WITH_IMM:
+				wqe_hdr->wr.rdma.remote_addr =
+					rdma_wr(wr)->remote_addr;
+				wqe_hdr->wr.rdma.rkey = rdma_wr(wr)->rkey;
+				break;
+			case IB_WR_LOCAL_INV:
+			case IB_WR_SEND_WITH_INV:
+				wqe_hdr->ex.invalidate_rkey =
+					wr->ex.invalidate_rkey;
+				break;
+			case IB_WR_ATOMIC_CMP_AND_SWP:
+			case IB_WR_ATOMIC_FETCH_AND_ADD:
+				wqe_hdr->wr.atomic.remote_addr =
+					atomic_wr(wr)->remote_addr;
+				wqe_hdr->wr.atomic.rkey = atomic_wr(wr)->rkey;
+				wqe_hdr->wr.atomic.compare_add =
+					atomic_wr(wr)->compare_add;
+				if (wr->opcode == IB_WR_ATOMIC_CMP_AND_SWP)
+					wqe_hdr->wr.atomic.swap =
+						atomic_wr(wr)->swap;
+				break;
+			case IB_WR_REG_MR:
+				ret = set_reg_seg(wqe_hdr, reg_wr(wr));
+				if (ret < 0) {
+					dev_warn_ratelimited(&dev->pdev->dev,
+							     "Failed to set fast register work request\n");
+					*bad_wr = wr;
+					goto out;
+				}
+				break;
+			default:
+				break;
+			}
+
+			break;
+		default:
+			dev_warn_ratelimited(&dev->pdev->dev,
+					     "invalid queuepair type\n");
+			ret = -EINVAL;
+			*bad_wr = wr;
+			goto out;
+		}
+
+		sge = (struct pvrdma_sge *)(wqe_hdr + 1);
+		for (i = 0; i < wr->num_sge; i++) {
+			/* Need to check wqe_size 0 or max size */
+			sge->addr = wr->sg_list[i].addr;
+			sge->length = wr->sg_list[i].length;
+			sge->lkey = wr->sg_list[i].lkey;
+			sge++;
+		}
+
+		/* Make sure wqe is written before index update */
+		smp_wmb();
+
+		index++;
+		if (unlikely(index >= qp->sq.wqe_cnt))
+			index = 0;
+		/* Update shared sq ring */
+		pvrdma_idx_ring_inc(&qp->sq.ring->prod_tail,
+				    qp->sq.wqe_cnt);
+	}
+
+	ret = 0;
+
+out:
+	spin_unlock_irqrestore(&qp->sq.lock, flags);
+
+	if (!ret)
+		pvrdma_write_uar_qp(dev, PVRDMA_UAR_QP_SEND | qp->qp_handle);
+
+	return ret;
+}
+
+/**
+ * pvrdma_post_receive - post receive work request entries on a QP
+ * @ibqp: the QP
+ * @wr: the work request list to post
+ * @bad_wr: the first bad WR returned
+ *
+ * @return: 0 on success, otherwise errno returned.
+ */
+int pvrdma_post_recv(struct ib_qp *ibqp, struct ib_recv_wr *wr,
+		     struct ib_recv_wr **bad_wr)
+{
+	struct pvrdma_dev *dev = to_vdev(ibqp->device);
+	unsigned long flags;
+	struct pvrdma_qp *qp = to_vqp(ibqp);
+	struct pvrdma_rq_wqe_hdr *wqe_hdr;
+	struct pvrdma_sge *sge;
+	int index, nreq;
+	int ret = 0;
+	int i;
+
+	/*
+	 * In the RESET state, we can fail immediately. For other states,
+	 * just post and let the device figure it out.
+	 */
+	if (qp->state == IB_QPS_RESET) {
+		*bad_wr = wr;
+		return -EINVAL;
+	}
+
+	spin_lock_irqsave(&qp->rq.lock, flags);
+
+	index = pvrdma_idx(&qp->rq.ring->prod_tail, qp->rq.wqe_cnt);
+	for (nreq = 0; wr; nreq++, wr = wr->next) {
+		unsigned int tail;
+
+		if (unlikely(wr->num_sge > qp->rq.max_sg ||
+			     wr->num_sge < 0)) {
+			ret = -EINVAL;
+			*bad_wr = wr;
+			dev_warn_ratelimited(&dev->pdev->dev,
+					     "recv SGE overflow\n");
+			goto out;
+		}
+
+		if (unlikely(!pvrdma_idx_ring_has_space(
+				qp->rq.ring, qp->rq.wqe_cnt, &tail))) {
+			ret = -ENOMEM;
+			*bad_wr = wr;
+			dev_warn_ratelimited(&dev->pdev->dev,
+					     "recv queue full\n");
+			goto out;
+		}
+
+		wqe_hdr = (struct pvrdma_rq_wqe_hdr *)get_rq_wqe(qp, index);
+		wqe_hdr->wr_id = wr->wr_id;
+		wqe_hdr->num_sge = wr->num_sge;
+		wqe_hdr->total_len = 0;
+
+		sge = (struct pvrdma_sge *)(wqe_hdr + 1);
+		for (i = 0; i < wr->num_sge; i++) {
+			sge->addr = wr->sg_list[i].addr;
+			sge->length = wr->sg_list[i].length;
+			sge->lkey = wr->sg_list[i].lkey;
+			sge++;
+		}
+
+		/* Make sure wqe is written before index update */
+		smp_wmb();
+
+		index++;
+		if (unlikely(index >= qp->rq.wqe_cnt))
+			index = 0;
+		/* Update shared rq ring */
+		pvrdma_idx_ring_inc(&qp->rq.ring->prod_tail,
+				    qp->rq.wqe_cnt);
+	}
+
+	spin_unlock_irqrestore(&qp->rq.lock, flags);
+
+	pvrdma_write_uar_qp(dev, PVRDMA_UAR_QP_RECV | qp->qp_handle);
+
+	return ret;
+
+out:
+	spin_unlock_irqrestore(&qp->rq.lock, flags);
+
+	return ret;
+}
+
+/**
+ * pvrdma_query_qp - query a queue pair's attributes
+ * @ibqp: the queue pair to query
+ * @attr: the queue pair's attributes
+ * @attr_mask: attributes mask
+ * @init_attr: initial queue pair attributes
+ *
+ * @returns 0 on success, otherwise returns an errno.
+ */
+int pvrdma_query_qp(struct ib_qp *ibqp, struct ib_qp_attr *attr,
+		    int attr_mask, struct ib_qp_init_attr *init_attr)
+{
+	struct pvrdma_dev *dev = to_vdev(ibqp->device);
+	struct pvrdma_qp *qp = to_vqp(ibqp);
+	union pvrdma_cmd_req req;
+	union pvrdma_cmd_resp rsp;
+	struct pvrdma_cmd_query_qp *cmd = &req.query_qp;
+	struct pvrdma_cmd_query_qp_resp *resp = &rsp.query_qp_resp;
+	int ret = 0;
+
+	mutex_lock(&qp->mutex);
+
+	if (qp->state == IB_QPS_RESET) {
+		attr->qp_state = IB_QPS_RESET;
+		goto out;
+	}
+
+	memset(cmd, 0, sizeof(*cmd));
+	cmd->hdr.cmd = PVRDMA_CMD_QUERY_QP;
+	cmd->qp_handle = qp->qp_handle;
+	cmd->attr_mask = ib_qp_attr_mask_to_pvrdma(attr_mask);
+
+	ret = pvrdma_cmd_post(dev, &req, &rsp, PVRDMA_CMD_QUERY_QP_RESP);
+	if (ret < 0) {
+		dev_warn(&dev->pdev->dev,
+			 "could not query queuepair, error: %d\n", ret);
+		goto out;
+	}
+
+	attr->qp_state = pvrdma_qp_state_to_ib(resp->attrs.qp_state);
+	attr->cur_qp_state =
+		pvrdma_qp_state_to_ib(resp->attrs.cur_qp_state);
+	attr->path_mtu = pvrdma_mtu_to_ib(resp->attrs.path_mtu);
+	attr->path_mig_state =
+		pvrdma_mig_state_to_ib(resp->attrs.path_mig_state);
+	attr->qkey = resp->attrs.qkey;
+	attr->rq_psn = resp->attrs.rq_psn;
+	attr->sq_psn = resp->attrs.sq_psn;
+	attr->dest_qp_num = resp->attrs.dest_qp_num;
+	attr->qp_access_flags =
+		pvrdma_access_flags_to_ib(resp->attrs.qp_access_flags);
+	attr->pkey_index = resp->attrs.pkey_index;
+	attr->alt_pkey_index = resp->attrs.alt_pkey_index;
+	attr->en_sqd_async_notify = resp->attrs.en_sqd_async_notify;
+	attr->sq_draining = resp->attrs.sq_draining;
+	attr->max_rd_atomic = resp->attrs.max_rd_atomic;
+	attr->max_dest_rd_atomic = resp->attrs.max_dest_rd_atomic;
+	attr->min_rnr_timer = resp->attrs.min_rnr_timer;
+	attr->port_num = resp->attrs.port_num;
+	attr->timeout = resp->attrs.timeout;
+	attr->retry_cnt = resp->attrs.retry_cnt;
+	attr->rnr_retry = resp->attrs.rnr_retry;
+	attr->alt_port_num = resp->attrs.alt_port_num;
+	attr->alt_timeout = resp->attrs.alt_timeout;
+	pvrdma_qp_cap_to_ib(&attr->cap, &resp->attrs.cap);
+	pvrdma_ah_attr_to_ib(&attr->ah_attr, &resp->attrs.ah_attr);
+	pvrdma_ah_attr_to_ib(&attr->alt_ah_attr, &resp->attrs.alt_ah_attr);
+
+	qp->state = attr->qp_state;
+
+	ret = 0;
+
+out:
+	attr->cur_qp_state = attr->qp_state;
+
+	init_attr->event_handler = qp->ibqp.event_handler;
+	init_attr->qp_context = qp->ibqp.qp_context;
+	init_attr->send_cq = qp->ibqp.send_cq;
+	init_attr->recv_cq = qp->ibqp.recv_cq;
+	init_attr->srq = qp->ibqp.srq;
+	init_attr->xrcd = NULL;
+	init_attr->cap = attr->cap;
+	init_attr->sq_sig_type = 0;
+	init_attr->qp_type = qp->ibqp.qp_type;
+	init_attr->create_flags = 0;
+	init_attr->port_num = qp->port;
+
+	mutex_unlock(&qp->mutex);
+	return ret;
+}
-- 
2.7.4

^ permalink raw reply related

* [PATCH v5 14/16] IB/pvrdma: Add Kconfig and Makefile
From: Adit Ranadive @ 2016-09-24 23:21 UTC (permalink / raw)
  To: dledford, linux-rdma, pv-drivers
  Cc: Adit Ranadive, netdev, linux-pci, jhansen, asarwade, georgezhang,
	bryantan
In-Reply-To: <cover.1474759181.git.aditr@vmware.com>

This patch adds a Kconfig and Makefile for the PVRDMA driver.

Reviewed-by: Jorgen Hansen <jhansen@vmware.com>
Reviewed-by: George Zhang <georgezhang@vmware.com>
Reviewed-by: Aditya Sarwade <asarwade@vmware.com>
Reviewed-by: Bryan Tan <bryantan@vmware.com>
Signed-off-by: Adit Ranadive <aditr@vmware.com>
---
Changes v3->v4:
 - Enforced dependency on VMXNet3
---
 drivers/infiniband/hw/pvrdma/Kconfig  | 7 +++++++
 drivers/infiniband/hw/pvrdma/Makefile | 3 +++
 2 files changed, 10 insertions(+)
 create mode 100644 drivers/infiniband/hw/pvrdma/Kconfig
 create mode 100644 drivers/infiniband/hw/pvrdma/Makefile

diff --git a/drivers/infiniband/hw/pvrdma/Kconfig b/drivers/infiniband/hw/pvrdma/Kconfig
new file mode 100644
index 0000000..b345679
--- /dev/null
+++ b/drivers/infiniband/hw/pvrdma/Kconfig
@@ -0,0 +1,7 @@
+config INFINIBAND_PVRDMA
+	tristate "VMware Paravirtualized RDMA Driver"
+	depends on NETDEVICES && ETHERNET && PCI && INET && VMXNET3
+	---help---
+	  This driver provides low-level support for VMware Paravirtual
+	  RDMA adapter. It interacts with the VMXNet3 driver to provide
+	  Ethernet capabilities.
diff --git a/drivers/infiniband/hw/pvrdma/Makefile b/drivers/infiniband/hw/pvrdma/Makefile
new file mode 100644
index 0000000..e6f078b
--- /dev/null
+++ b/drivers/infiniband/hw/pvrdma/Makefile
@@ -0,0 +1,3 @@
+obj-$(CONFIG_INFINIBAND_PVRDMA) += pvrdma.o
+
+pvrdma-y := pvrdma_cmd.o pvrdma_cq.o pvrdma_doorbell.o pvrdma_main.o pvrdma_misc.o pvrdma_mr.o pvrdma_qp.o pvrdma_verbs.o
-- 
2.7.4

^ permalink raw reply related

* [PATCH v5 13/16] IB/pvrdma: Add the main driver module for PVRDMA
From: Adit Ranadive @ 2016-09-24 23:21 UTC (permalink / raw)
  To: dledford, linux-rdma, pv-drivers
  Cc: Adit Ranadive, netdev, linux-pci, jhansen, asarwade, georgezhang,
	bryantan
In-Reply-To: <cover.1474759181.git.aditr@vmware.com>

This patch adds the support to register a RDMA device with the kernel RDMA
stack as well as a kernel module. This also initializes the underlying
virtual PCI device.

Reviewed-by: Yuval Shaia <yuval.shaia@oracle.com>
Reviewed-by: Jorgen Hansen <jhansen@vmware.com>
Reviewed-by: George Zhang <georgezhang@vmware.com>
Reviewed-by: Aditya Sarwade <asarwade@vmware.com>
Reviewed-by: Bryan Tan <bryantan@vmware.com>
Signed-off-by: Adit Ranadive <aditr@vmware.com>
---
Changes v4->v5:
 - Removed two unnecessary lines.
 - Updated include for headers in UAPI folder.
 - Update to pvrdma_cmd_post for add/delete GIDs.
 - Add error code in dev_warn if pvrdma_cmd_post failed.

Changes v3->v4:
 - Fixed some checkpatch warnings.
 - Added support for new get_dev_fw_str API.
 - Added event workqueue for netdevice events.
 - Restructured the pvrdma_pci_remove function a little bit.

Changes v2->v3:
 - Removed boolean in pvrdma_cmd_post.

Changes v1->v2:
 - Addressed 32-bit build errors
 - Cosmetic change to avoid if else in intr0_handler
 - Removed unnecessary return assignment.
---
 drivers/infiniband/hw/pvrdma/pvrdma_main.c | 1220 ++++++++++++++++++++++++++++
 1 file changed, 1220 insertions(+)
 create mode 100644 drivers/infiniband/hw/pvrdma/pvrdma_main.c

diff --git a/drivers/infiniband/hw/pvrdma/pvrdma_main.c b/drivers/infiniband/hw/pvrdma/pvrdma_main.c
new file mode 100644
index 0000000..94cbbb9
--- /dev/null
+++ b/drivers/infiniband/hw/pvrdma/pvrdma_main.c
@@ -0,0 +1,1220 @@
+/*
+ * Copyright (c) 2012-2016 VMware, Inc.  All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of EITHER the GNU General Public License
+ * version 2 as published by the Free Software Foundation or the BSD
+ * 2-Clause License. This program is distributed in the hope that it
+ * will be useful, but WITHOUT ANY WARRANTY; WITHOUT EVEN THE IMPLIED
+ * WARRANTY OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE.
+ * See the GNU General Public License version 2 for more details at
+ * http://www.gnu.org/licenses/old-licenses/gpl-2.0.en.html.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program available in the file COPYING in the main
+ * directory of this source tree.
+ *
+ * The BSD 2-Clause License
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ * "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ * LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS
+ * FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE
+ * COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT,
+ * INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
+ * (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
+ * SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
+ * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT,
+ * STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE)
+ * ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED
+ * OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+
+#include <linux/errno.h>
+#include <linux/inetdevice.h>
+#include <linux/init.h>
+#include <linux/module.h>
+#include <linux/slab.h>
+#include <rdma/ib_addr.h>
+#include <rdma/ib_smi.h>
+#include <rdma/ib_user_verbs.h>
+#include <net/addrconf.h>
+#include <rdma/pvrdma-abi.h>
+
+#include "pvrdma.h"
+
+#define DRV_NAME	"pvrdma"
+#define DRV_VERSION	"1.0"
+#define DRV_RELDATE	"January 1, 2013"
+
+static const char pvrdma_version[] =
+	DRV_NAME ": PVRDMA InfiniBand driver v"
+	DRV_VERSION " (" DRV_RELDATE ")\n";
+
+static DEFINE_MUTEX(pvrdma_device_list_lock);
+static LIST_HEAD(pvrdma_device_list);
+static struct workqueue_struct *event_wq;
+
+static int pvrdma_add_gid(struct ib_device *ibdev,
+			  u8 port_num,
+			  unsigned int index,
+			  const union ib_gid *gid,
+			  const struct ib_gid_attr *attr,
+			  void **context);
+static int pvrdma_del_gid(struct ib_device *ibdev,
+			  u8 port_num,
+			  unsigned int index,
+			  void **context);
+
+
+static ssize_t show_hca(struct device *device, struct device_attribute *attr,
+			char *buf)
+{
+	return sprintf(buf, "PVRDMA%s\n", DRV_VERSION);
+}
+
+static ssize_t show_rev(struct device *device, struct device_attribute *attr,
+			char *buf)
+{
+	return sprintf(buf, "%d\n", PVRDMA_REV_ID);
+}
+
+static ssize_t show_board(struct device *device, struct device_attribute *attr,
+			  char *buf)
+{
+	return sprintf(buf, "%d\n", PVRDMA_BOARD_ID);
+}
+
+static DEVICE_ATTR(hw_rev,   S_IRUGO, show_rev,	   NULL);
+static DEVICE_ATTR(hca_type, S_IRUGO, show_hca,	   NULL);
+static DEVICE_ATTR(board_id, S_IRUGO, show_board,  NULL);
+
+static struct device_attribute *pvrdma_class_attributes[] = {
+	&dev_attr_hw_rev,
+	&dev_attr_hca_type,
+	&dev_attr_board_id
+};
+
+static void pvrdma_get_fw_ver_str(struct ib_device *device, char *str,
+				  size_t str_len)
+{
+	struct pvrdma_dev *dev =
+		container_of(device, struct pvrdma_dev, ib_dev);
+	snprintf(str, str_len, "%d.%d.%d\n",
+		 (int) (dev->dsr->caps.fw_ver >> 32),
+		 (int) (dev->dsr->caps.fw_ver >> 16) & 0xffff,
+		 (int) dev->dsr->caps.fw_ver & 0xffff);
+}
+
+static int pvrdma_init_device(struct pvrdma_dev *dev)
+{
+	/*  Initialize some device related stuff */
+	spin_lock_init(&dev->cmd_lock);
+	sema_init(&dev->cmd_sema, 1);
+	atomic_set(&dev->num_qps, 0);
+	atomic_set(&dev->num_cqs, 0);
+	atomic_set(&dev->num_pds, 0);
+	atomic_set(&dev->num_ahs, 0);
+
+	return 0;
+}
+
+static int pvrdma_port_immutable(struct ib_device *ibdev, u8 port_num,
+				 struct ib_port_immutable *immutable)
+{
+	struct ib_port_attr attr;
+	int err;
+
+	err = pvrdma_query_port(ibdev, port_num, &attr);
+	if (err)
+		return err;
+
+	immutable->pkey_tbl_len = attr.pkey_tbl_len;
+	immutable->gid_tbl_len = attr.gid_tbl_len;
+	immutable->core_cap_flags = RDMA_CORE_PORT_IBA_ROCE;
+	immutable->max_mad_size = IB_MGMT_MAD_SIZE;
+	return 0;
+}
+
+static struct net_device *pvrdma_get_netdev(struct ib_device *ibdev,
+					    u8 port_num)
+{
+	struct net_device *netdev;
+	struct pvrdma_dev *dev = to_vdev(ibdev);
+
+	if (port_num != 1)
+		return NULL;
+
+	rcu_read_lock();
+	netdev = dev->netdev;
+	if (netdev)
+		dev_hold(netdev);
+	rcu_read_unlock();
+
+	return netdev;
+}
+
+static int pvrdma_register_device(struct pvrdma_dev *dev)
+{
+	int ret = -1;
+	int i = 0;
+
+	strlcpy(dev->ib_dev.name, "pvrdma%d", IB_DEVICE_NAME_MAX);
+	dev->ib_dev.node_guid = dev->dsr->caps.node_guid;
+	dev->sys_image_guid = dev->dsr->caps.sys_image_guid;
+	dev->flags = 0;
+	dev->ib_dev.owner = THIS_MODULE;
+	dev->ib_dev.num_comp_vectors = 1;
+	dev->ib_dev.dma_device = &dev->pdev->dev;
+	dev->ib_dev.uverbs_abi_ver = PVRDMA_UVERBS_ABI_VERSION;
+	dev->ib_dev.uverbs_cmd_mask =
+		(1ull << IB_USER_VERBS_CMD_GET_CONTEXT)		|
+		(1ull << IB_USER_VERBS_CMD_QUERY_DEVICE)	|
+		(1ull << IB_USER_VERBS_CMD_QUERY_PORT)		|
+		(1ull << IB_USER_VERBS_CMD_ALLOC_PD)		|
+		(1ull << IB_USER_VERBS_CMD_DEALLOC_PD)		|
+		(1ull << IB_USER_VERBS_CMD_REG_MR)		|
+		(1ull << IB_USER_VERBS_CMD_DEREG_MR)		|
+		(1ull << IB_USER_VERBS_CMD_CREATE_COMP_CHANNEL)	|
+		(1ull << IB_USER_VERBS_CMD_CREATE_CQ)		|
+		(1ull << IB_USER_VERBS_CMD_POLL_CQ)		|
+		(1ull << IB_USER_VERBS_CMD_REQ_NOTIFY_CQ)	|
+		(1ull << IB_USER_VERBS_CMD_DESTROY_CQ)		|
+		(1ull << IB_USER_VERBS_CMD_CREATE_QP)		|
+		(1ull << IB_USER_VERBS_CMD_MODIFY_QP)		|
+		(1ull << IB_USER_VERBS_CMD_QUERY_QP)		|
+		(1ull << IB_USER_VERBS_CMD_DESTROY_QP)		|
+		(1ull << IB_USER_VERBS_CMD_POST_SEND)		|
+		(1ull << IB_USER_VERBS_CMD_POST_RECV)		|
+		(1ull << IB_USER_VERBS_CMD_CREATE_AH)		|
+		(1ull << IB_USER_VERBS_CMD_DESTROY_AH);
+
+	dev->ib_dev.node_type = RDMA_NODE_IB_CA;
+	dev->ib_dev.phys_port_cnt = dev->dsr->caps.phys_port_cnt;
+
+	dev->ib_dev.query_device = pvrdma_query_device;
+	dev->ib_dev.query_port = pvrdma_query_port;
+	dev->ib_dev.query_gid = pvrdma_query_gid;
+	dev->ib_dev.query_pkey = pvrdma_query_pkey;
+	dev->ib_dev.modify_port	= pvrdma_modify_port;
+	dev->ib_dev.alloc_ucontext = pvrdma_alloc_ucontext;
+	dev->ib_dev.dealloc_ucontext = pvrdma_dealloc_ucontext;
+	dev->ib_dev.mmap = pvrdma_mmap;
+	dev->ib_dev.alloc_pd = pvrdma_alloc_pd;
+	dev->ib_dev.dealloc_pd = pvrdma_dealloc_pd;
+	dev->ib_dev.create_ah = pvrdma_create_ah;
+	dev->ib_dev.destroy_ah = pvrdma_destroy_ah;
+	dev->ib_dev.create_qp = pvrdma_create_qp;
+	dev->ib_dev.modify_qp = pvrdma_modify_qp;
+	dev->ib_dev.query_qp = pvrdma_query_qp;
+	dev->ib_dev.destroy_qp = pvrdma_destroy_qp;
+	dev->ib_dev.post_send = pvrdma_post_send;
+	dev->ib_dev.post_recv = pvrdma_post_recv;
+	dev->ib_dev.create_cq = pvrdma_create_cq;
+	dev->ib_dev.modify_cq = pvrdma_modify_cq;
+	dev->ib_dev.resize_cq = pvrdma_resize_cq;
+	dev->ib_dev.destroy_cq = pvrdma_destroy_cq;
+	dev->ib_dev.poll_cq = pvrdma_poll_cq;
+	dev->ib_dev.req_notify_cq = pvrdma_req_notify_cq;
+	dev->ib_dev.get_dma_mr = pvrdma_get_dma_mr;
+	dev->ib_dev.reg_user_mr	= pvrdma_reg_user_mr;
+	dev->ib_dev.dereg_mr = pvrdma_dereg_mr;
+	dev->ib_dev.alloc_mr = pvrdma_alloc_mr;
+	dev->ib_dev.map_mr_sg = pvrdma_map_mr_sg;
+	dev->ib_dev.add_gid = pvrdma_add_gid;
+	dev->ib_dev.del_gid = pvrdma_del_gid;
+	dev->ib_dev.get_netdev = pvrdma_get_netdev;
+	dev->ib_dev.get_port_immutable = pvrdma_port_immutable;
+	dev->ib_dev.get_link_layer = pvrdma_port_link_layer;
+	dev->ib_dev.get_dev_fw_str = pvrdma_get_fw_ver_str;
+
+	mutex_init(&dev->port_mutex);
+	spin_lock_init(&dev->desc_lock);
+
+	dev->cq_tbl = kcalloc(dev->dsr->caps.max_cq, sizeof(void *),
+			      GFP_KERNEL);
+	if (!dev->cq_tbl)
+		return ret;
+	spin_lock_init(&dev->cq_tbl_lock);
+
+	dev->qp_tbl = kcalloc(dev->dsr->caps.max_qp, sizeof(void *),
+			      GFP_KERNEL);
+	if (!dev->qp_tbl)
+		goto err_cq_free;
+	spin_lock_init(&dev->qp_tbl_lock);
+
+	ret = ib_register_device(&dev->ib_dev, NULL);
+	if (ret)
+		goto err_qp_free;
+
+	for (i = 0; i < ARRAY_SIZE(pvrdma_class_attributes); ++i) {
+		ret = device_create_file(&dev->ib_dev.dev,
+					 pvrdma_class_attributes[i]);
+		if (ret)
+			goto err_class;
+	}
+
+	dev->ib_active = true;
+
+	return 0;
+
+err_class:
+	ib_unregister_device(&dev->ib_dev);
+err_qp_free:
+	kfree(dev->qp_tbl);
+err_cq_free:
+	kfree(dev->cq_tbl);
+
+	return ret;
+}
+
+static irqreturn_t pvrdma_intr0_handler(int irq, void *dev_id)
+{
+	u32 icr = PVRDMA_INTR_CAUSE_RESPONSE;
+	struct pvrdma_dev *dev = dev_id;
+
+	dev_dbg(&dev->pdev->dev, "interrupt 0 (response) handler\n");
+
+	if (dev->intr.type != PVRDMA_INTR_TYPE_MSIX) {
+		/* Legacy intr */
+		icr = pvrdma_read_reg(dev, PVRDMA_REG_ICR);
+		if (icr == 0)
+			return IRQ_NONE;
+	}
+
+	if (icr == PVRDMA_INTR_CAUSE_RESPONSE)
+		complete(&dev->cmd_done);
+
+	return IRQ_HANDLED;
+}
+
+static void pvrdma_qp_event(struct pvrdma_dev *dev, u32 qpn, int type)
+{
+	struct pvrdma_qp *qp;
+	unsigned long flags;
+
+	spin_lock_irqsave(&dev->qp_tbl_lock, flags);
+	qp = dev->qp_tbl[qpn % dev->dsr->caps.max_qp];
+	if (qp)
+		atomic_inc(&qp->refcnt);
+	spin_unlock_irqrestore(&dev->qp_tbl_lock, flags);
+
+	if (qp && qp->ibqp.event_handler) {
+		struct ib_qp *ibqp = &qp->ibqp;
+		struct ib_event e;
+
+		e.device = ibqp->device;
+		e.element.qp = ibqp;
+		e.event = type; /* 1:1 mapping for now. */
+		ibqp->event_handler(&e, ibqp->qp_context);
+	}
+	if (qp) {
+		atomic_dec(&qp->refcnt);
+		if (atomic_read(&qp->refcnt) == 0)
+			wake_up(&qp->wait);
+	}
+}
+
+static void pvrdma_cq_event(struct pvrdma_dev *dev, u32 cqn, int type)
+{
+	struct pvrdma_cq *cq;
+	unsigned long flags;
+
+	spin_lock_irqsave(&dev->cq_tbl_lock, flags);
+	cq = dev->cq_tbl[cqn % dev->dsr->caps.max_cq];
+	if (cq)
+		atomic_inc(&cq->refcnt);
+	spin_unlock_irqrestore(&dev->cq_tbl_lock, flags);
+
+	if (cq && cq->ibcq.event_handler) {
+		struct ib_cq *ibcq = &cq->ibcq;
+		struct ib_event e;
+
+		e.device = ibcq->device;
+		e.element.cq = ibcq;
+		e.event = type; /* 1:1 mapping for now. */
+		ibcq->event_handler(&e, ibcq->cq_context);
+	}
+	if (cq) {
+		atomic_dec(&cq->refcnt);
+		if (atomic_read(&cq->refcnt) == 0)
+			wake_up(&cq->wait);
+	}
+}
+
+static void pvrdma_dispatch_event(struct pvrdma_dev *dev, int port,
+				  enum ib_event_type event)
+{
+	struct ib_event ib_event;
+
+	memset(&ib_event, 0, sizeof(ib_event));
+	ib_event.device = &dev->ib_dev;
+	ib_event.element.port_num = port;
+	ib_event.event = event;
+	ib_dispatch_event(&ib_event);
+}
+
+static void pvrdma_dev_event(struct pvrdma_dev *dev, u8 port, int type)
+{
+	if (port < 1 || port > dev->dsr->caps.phys_port_cnt) {
+		dev_warn(&dev->pdev->dev, "event on port %d\n", port);
+		return;
+	}
+
+	pvrdma_dispatch_event(dev, port, type);
+}
+
+static inline struct pvrdma_eqe *get_eqe(struct pvrdma_dev *dev, unsigned int i)
+{
+	return (struct pvrdma_eqe *)pvrdma_page_dir_get_ptr(
+					&dev->async_pdir,
+					PAGE_SIZE +
+					sizeof(struct pvrdma_eqe) * i);
+}
+
+static irqreturn_t pvrdma_intr1_handler(int irq, void *dev_id)
+{
+	struct pvrdma_dev *dev = dev_id;
+	struct pvrdma_ring *ring = &dev->async_ring_state->rx;
+	int ring_slots = (dev->dsr->async_ring_pages.num_pages - 1) *
+			 PAGE_SIZE / sizeof(struct pvrdma_eqe);
+	unsigned int head;
+
+	dev_dbg(&dev->pdev->dev, "interrupt 1 (async event) handler\n");
+
+	/*
+	 * Don't process events until the IB device is registered. Otherwise
+	 * we'll try to ib_dispatch_event() on an invalid device.
+	 */
+	if (!dev->ib_active)
+		return IRQ_HANDLED;
+
+	while (pvrdma_idx_ring_has_data(ring, ring_slots, &head) > 0) {
+		struct pvrdma_eqe *eqe;
+
+		eqe = get_eqe(dev, head);
+
+		switch (eqe->type) {
+		case PVRDMA_EVENT_QP_FATAL:
+		case PVRDMA_EVENT_QP_REQ_ERR:
+		case PVRDMA_EVENT_QP_ACCESS_ERR:
+		case PVRDMA_EVENT_COMM_EST:
+		case PVRDMA_EVENT_SQ_DRAINED:
+		case PVRDMA_EVENT_PATH_MIG:
+		case PVRDMA_EVENT_PATH_MIG_ERR:
+		case PVRDMA_EVENT_QP_LAST_WQE_REACHED:
+			pvrdma_qp_event(dev, eqe->info, eqe->type);
+			break;
+
+		case PVRDMA_EVENT_CQ_ERR:
+			pvrdma_cq_event(dev, eqe->info, eqe->type);
+			break;
+
+		case PVRDMA_EVENT_SRQ_ERR:
+		case PVRDMA_EVENT_SRQ_LIMIT_REACHED:
+			break;
+
+		case PVRDMA_EVENT_PORT_ACTIVE:
+		case PVRDMA_EVENT_PORT_ERR:
+		case PVRDMA_EVENT_LID_CHANGE:
+		case PVRDMA_EVENT_PKEY_CHANGE:
+		case PVRDMA_EVENT_SM_CHANGE:
+		case PVRDMA_EVENT_CLIENT_REREGISTER:
+		case PVRDMA_EVENT_GID_CHANGE:
+			pvrdma_dev_event(dev, eqe->info, eqe->type);
+			break;
+
+		case PVRDMA_EVENT_DEVICE_FATAL:
+			pvrdma_dev_event(dev, 1, eqe->type);
+			break;
+
+		default:
+			break;
+		}
+
+		pvrdma_idx_ring_inc(&ring->cons_head, ring_slots);
+	}
+
+	return IRQ_HANDLED;
+}
+
+static inline struct pvrdma_cqne *get_cqne(struct pvrdma_dev *dev,
+					   unsigned int i)
+{
+	return (struct pvrdma_cqne *)pvrdma_page_dir_get_ptr(
+					&dev->cq_pdir,
+					PAGE_SIZE +
+					sizeof(struct pvrdma_cqne) * i);
+}
+
+static irqreturn_t pvrdma_intrx_handler(int irq, void *dev_id)
+{
+	struct pvrdma_dev *dev = dev_id;
+	struct pvrdma_ring *ring = &dev->cq_ring_state->rx;
+	int ring_slots = (dev->dsr->cq_ring_pages.num_pages - 1) * PAGE_SIZE /
+			 sizeof(struct pvrdma_cqne);
+	unsigned int head;
+	unsigned long flags;
+
+	dev_dbg(&dev->pdev->dev, "interrupt x (completion) handler\n");
+
+	while (pvrdma_idx_ring_has_data(ring, ring_slots, &head) > 0) {
+		struct pvrdma_cqne *cqne;
+		struct pvrdma_cq *cq;
+
+		cqne = get_cqne(dev, head);
+		spin_lock_irqsave(&dev->cq_tbl_lock, flags);
+		cq = dev->cq_tbl[cqne->info % dev->dsr->caps.max_cq];
+		if (cq)
+			atomic_inc(&cq->refcnt);
+		spin_unlock_irqrestore(&dev->cq_tbl_lock, flags);
+
+		if (cq && cq->ibcq.comp_handler)
+			cq->ibcq.comp_handler(&cq->ibcq, cq->ibcq.cq_context);
+		if (cq) {
+			atomic_dec(&cq->refcnt);
+			if (atomic_read(&cq->refcnt))
+				wake_up(&cq->wait);
+		}
+		pvrdma_idx_ring_inc(&ring->cons_head, ring_slots);
+	}
+
+	return IRQ_HANDLED;
+}
+
+static void pvrdma_disable_msi_all(struct pvrdma_dev *dev)
+{
+	if (dev->intr.type == PVRDMA_INTR_TYPE_MSIX)
+		pci_disable_msix(dev->pdev);
+	else if (dev->intr.type == PVRDMA_INTR_TYPE_MSI)
+		pci_disable_msi(dev->pdev);
+}
+
+static void pvrdma_free_irq(struct pvrdma_dev *dev)
+{
+	int i;
+
+	dev_dbg(&dev->pdev->dev, "freeing interrupts\n");
+
+	if (dev->intr.type == PVRDMA_INTR_TYPE_MSIX) {
+		for (i = 0; i < dev->intr.size; i++) {
+			if (dev->intr.enabled[i]) {
+				free_irq(dev->intr.msix_entry[i].vector, dev);
+				dev->intr.enabled[i] = 0;
+			}
+		}
+	} else if (dev->intr.type == PVRDMA_INTR_TYPE_INTX ||
+		   dev->intr.type == PVRDMA_INTR_TYPE_MSI) {
+		free_irq(dev->pdev->irq, dev);
+	}
+}
+
+static void pvrdma_enable_intrs(struct pvrdma_dev *dev)
+{
+	dev_dbg(&dev->pdev->dev, "enable interrupts\n");
+	pvrdma_write_reg(dev, PVRDMA_REG_IMR, 0);
+}
+
+static void pvrdma_disable_intrs(struct pvrdma_dev *dev)
+{
+	dev_dbg(&dev->pdev->dev, "disable interrupts\n");
+	pvrdma_write_reg(dev, PVRDMA_REG_IMR, ~0);
+}
+
+static int pvrdma_enable_msix(struct pci_dev *pdev, struct pvrdma_dev *dev)
+{
+	int i;
+	int ret;
+
+	for (i = 0; i < PVRDMA_MAX_INTERRUPTS; i++) {
+		dev->intr.msix_entry[i].entry = i;
+		dev->intr.msix_entry[i].vector = i;
+
+		switch (i) {
+		case 0:
+			/* CMD ring handler */
+			dev->intr.handler[i] = pvrdma_intr0_handler;
+			break;
+		case 1:
+			/* Async event ring handler */
+			dev->intr.handler[i] = pvrdma_intr1_handler;
+			break;
+		default:
+			/* Completion queue handler */
+			dev->intr.handler[i] = pvrdma_intrx_handler;
+			break;
+		}
+	}
+
+	ret = pci_enable_msix(pdev, dev->intr.msix_entry,
+			      PVRDMA_MAX_INTERRUPTS);
+	if (!ret) {
+		dev->intr.type = PVRDMA_INTR_TYPE_MSIX;
+		dev->intr.size = PVRDMA_MAX_INTERRUPTS;
+	} else if (ret > 0) {
+		ret = pci_enable_msix(pdev, dev->intr.msix_entry, ret);
+		if (!ret) {
+			dev->intr.type = PVRDMA_INTR_TYPE_MSIX;
+			dev->intr.size = ret;
+		} else {
+			dev->intr.size = 0;
+		}
+	}
+
+	dev_dbg(&pdev->dev, "using interrupt type %d, size %d\n",
+		dev->intr.type, dev->intr.size);
+
+	return ret;
+}
+
+static int pvrdma_alloc_intrs(struct pvrdma_dev *dev)
+{
+	int ret = 0;
+	int i;
+
+	if (pci_find_capability(dev->pdev, PCI_CAP_ID_MSIX) &&
+	    pvrdma_enable_msix(dev->pdev, dev)) {
+		/* Try MSI */
+		ret = pci_enable_msi(dev->pdev);
+		if (!ret) {
+			dev->intr.type = PVRDMA_INTR_TYPE_MSI;
+		} else {
+			/* Legacy INTR */
+			dev->intr.type = PVRDMA_INTR_TYPE_INTX;
+		}
+	}
+
+	/* Request First IRQ */
+	switch (dev->intr.type) {
+	case PVRDMA_INTR_TYPE_INTX:
+	case PVRDMA_INTR_TYPE_MSI:
+		ret = request_irq(dev->pdev->irq, pvrdma_intr0_handler,
+				  IRQF_SHARED, DRV_NAME, dev);
+		if (ret) {
+			dev_err(&dev->pdev->dev,
+				"failed to request interrupt\n");
+			goto disable_msi;
+		}
+		break;
+	case PVRDMA_INTR_TYPE_MSIX:
+		ret = request_irq(dev->intr.msix_entry[0].vector,
+				  pvrdma_intr0_handler, 0, DRV_NAME, dev);
+		if (ret) {
+			dev_err(&dev->pdev->dev,
+				"failed to request interrupt 0\n");
+			goto disable_msi;
+		}
+		dev->intr.enabled[0] = 1;
+		break;
+	default:
+		/* Not reached */
+		break;
+	}
+
+	/* For MSIX: request intr for each vector */
+	if (dev->intr.size > 1) {
+		ret = request_irq(dev->intr.msix_entry[1].vector,
+				  pvrdma_intr1_handler, 0, DRV_NAME, dev);
+		if (ret) {
+			dev_err(&dev->pdev->dev,
+				"failed to request interrupt 1\n");
+			goto free_irq;
+		}
+		dev->intr.enabled[1] = 1;
+
+		for (i = 2; i < dev->intr.size; i++) {
+			ret = request_irq(dev->intr.msix_entry[i].vector,
+					  pvrdma_intrx_handler, 0,
+					  DRV_NAME, dev);
+			if (ret) {
+				dev_err(&dev->pdev->dev,
+					"failed to request interrupt %d\n", i);
+				goto free_irq;
+			}
+			dev->intr.enabled[i] = 1;
+		}
+	}
+
+	return 0;
+
+free_irq:
+	pvrdma_free_irq(dev);
+disable_msi:
+	pvrdma_disable_msi_all(dev);
+	return ret;
+}
+
+static void pvrdma_free_slots(struct pvrdma_dev *dev)
+{
+	struct pci_dev *pdev = dev->pdev;
+
+	if (dev->resp_slot)
+		dma_free_coherent(&pdev->dev, PAGE_SIZE, dev->resp_slot,
+				  dev->dsr->resp_slot_dma);
+	if (dev->cmd_slot)
+		dma_free_coherent(&pdev->dev, PAGE_SIZE, dev->cmd_slot,
+				  dev->dsr->cmd_slot_dma);
+}
+
+static int pvrdma_add_gid_at_index(struct pvrdma_dev *dev,
+				   const union ib_gid *gid,
+				   int index)
+{
+	int ret;
+	union pvrdma_cmd_req req;
+	struct pvrdma_cmd_create_bind *cmd_bind = &req.create_bind;
+
+	if (!dev->sgid_tbl) {
+		dev_warn(&dev->pdev->dev, "sgid table not initialized\n");
+		return -EINVAL;
+	}
+
+	memset(cmd_bind, 0, sizeof(*cmd_bind));
+	cmd_bind->hdr.cmd = PVRDMA_CMD_CREATE_BIND;
+	memcpy(cmd_bind->new_gid, gid->raw, 16);
+	cmd_bind->mtu = ib_mtu_enum_to_int(IB_MTU_1024);
+	cmd_bind->vlan = 0xfff;
+	cmd_bind->index = index;
+	cmd_bind->gid_type = PVRDMA_GID_TYPE_FLAG_ROCE_V1;
+
+	ret = pvrdma_cmd_post(dev, &req, NULL, 0);
+	if (ret < 0) {
+		dev_warn(&dev->pdev->dev,
+			 "could not create binding, error: %d\n", ret);
+		return -EINVAL;
+	}
+	memcpy(&dev->sgid_tbl[index], gid, sizeof(*gid));
+	return 0;
+}
+
+static int pvrdma_add_gid(struct ib_device *ibdev,
+			  u8 port_num,
+			  unsigned int index,
+			  const union ib_gid *gid,
+			  const struct ib_gid_attr *attr,
+			  void **context)
+{
+	struct pvrdma_dev *dev = to_vdev(ibdev);
+
+	return pvrdma_add_gid_at_index(dev, gid, index);
+}
+
+static int pvrdma_del_gid_at_index(struct pvrdma_dev *dev, int index)
+{
+	int ret;
+	union pvrdma_cmd_req req;
+	struct pvrdma_cmd_destroy_bind *cmd_dest = &req.destroy_bind;
+
+	/* Update sgid table. */
+	if (!dev->sgid_tbl) {
+		dev_warn(&dev->pdev->dev, "sgid table not initialized\n");
+		return -EINVAL;
+	}
+
+	memset(cmd_dest, 0, sizeof(*cmd_dest));
+	cmd_dest->hdr.cmd = PVRDMA_CMD_DESTROY_BIND;
+	memcpy(cmd_dest->dest_gid, &dev->sgid_tbl[index], 16);
+	cmd_dest->index = index;
+
+	ret = pvrdma_cmd_post(dev, &req, NULL, 0);
+	if (ret < 0) {
+		dev_warn(&dev->pdev->dev,
+			 "could not destroy binding, error: %d\n", ret);
+		return ret;
+	}
+	memset(&dev->sgid_tbl[index], 0, 16);
+	return 0;
+}
+
+static int pvrdma_del_gid(struct ib_device *ibdev,
+			  u8 port_num,
+			  unsigned int index,
+			  void **context)
+{
+	struct pvrdma_dev *dev = to_vdev(ibdev);
+
+	dev_dbg(&dev->pdev->dev, "removing gid at index %u from %s",
+		index, dev->netdev->name);
+
+	return pvrdma_del_gid_at_index(dev, index);
+}
+
+static void pvrdma_netdevice_event_handle(struct pvrdma_dev *dev,
+					  unsigned long event)
+{
+	switch (event) {
+	case NETDEV_REBOOT:
+	case NETDEV_DOWN:
+		pvrdma_dispatch_event(dev, 1, IB_EVENT_PORT_ERR);
+		break;
+	case NETDEV_UP:
+		pvrdma_dispatch_event(dev, 1, IB_EVENT_PORT_ACTIVE);
+		break;
+	default:
+		dev_dbg(&dev->pdev->dev, "ignore netdevice event %ld on %s\n",
+			event, dev->ib_dev.name);
+		break;
+	}
+}
+
+static void pvrdma_netdevice_event_work(struct work_struct *work)
+{
+	struct pvrdma_netdevice_work *netdev_work;
+	struct pvrdma_dev *dev;
+
+	netdev_work = container_of(work, struct pvrdma_netdevice_work, work);
+
+	mutex_lock(&pvrdma_device_list_lock);
+	list_for_each_entry(dev, &pvrdma_device_list, device_link) {
+		if (dev->netdev == netdev_work->event_netdev) {
+			pvrdma_netdevice_event_handle(dev, netdev_work->event);
+			break;
+		}
+	}
+	mutex_unlock(&pvrdma_device_list_lock);
+
+	kfree(netdev_work);
+}
+
+static int pvrdma_netdevice_event(struct notifier_block *this,
+				  unsigned long event, void *ptr)
+{
+	struct net_device *event_netdev = netdev_notifier_info_to_dev(ptr);
+	struct pvrdma_netdevice_work *netdev_work;
+
+	netdev_work = kmalloc(sizeof(*netdev_work), GFP_ATOMIC);
+	if (!netdev_work)
+		return NOTIFY_BAD;
+
+	INIT_WORK(&netdev_work->work, pvrdma_netdevice_event_work);
+	netdev_work->event_netdev = event_netdev;
+	netdev_work->event = event;
+	queue_work(event_wq, &netdev_work->work);
+
+	return NOTIFY_DONE;
+}
+
+static int pvrdma_pci_probe(struct pci_dev *pdev,
+			    const struct pci_device_id *id)
+{
+	struct pci_dev *pdev_net;
+	struct pvrdma_dev *dev;
+	int ret;
+	unsigned long start;
+	unsigned long len;
+	unsigned int version;
+	dma_addr_t slot_dma = 0;
+
+	dev_dbg(&pdev->dev, "initializing driver %s\n", pci_name(pdev));
+
+	/* Allocate zero-out device */
+	dev = (struct pvrdma_dev *)ib_alloc_device(sizeof(*dev));
+	if (!dev) {
+		dev_err(&pdev->dev, "failed to allocate IB device\n");
+		return -ENOMEM;
+	}
+
+	mutex_lock(&pvrdma_device_list_lock);
+	list_add(&dev->device_link, &pvrdma_device_list);
+	mutex_unlock(&pvrdma_device_list_lock);
+
+	ret = pvrdma_init_device(dev);
+	if (ret)
+		goto err_free_device;
+
+	dev->pdev = pdev;
+	pci_set_drvdata(pdev, dev);
+
+	ret = pci_enable_device(pdev);
+	if (ret) {
+		dev_err(&pdev->dev, "cannot enable PCI device\n");
+		goto err_free_device;
+	}
+
+	/*
+	 * DEBUG: Checking PCI: 0x20200; _MEM; len=0x2000
+	 */
+	dev_dbg(&pdev->dev, "PCI resource flags BAR0 %#lx\n",
+		pci_resource_flags(pdev, 0));
+	dev_dbg(&pdev->dev, "PCI resource len %#llx\n",
+		(unsigned long long)pci_resource_len(pdev, 0));
+	dev_dbg(&pdev->dev, "PCI resource start %#llx\n",
+		(unsigned long long)pci_resource_start(pdev, 0));
+	dev_dbg(&pdev->dev, "PCI resource flags BAR1 %#lx\n",
+		pci_resource_flags(pdev, 1));
+	dev_dbg(&pdev->dev, "PCI resource len %#llx\n",
+		(unsigned long long)pci_resource_len(pdev, 1));
+	dev_dbg(&pdev->dev, "PCI resource start %#llx\n",
+		(unsigned long long)pci_resource_start(pdev, 1));
+
+	if (!(pci_resource_flags(pdev, 0) & IORESOURCE_MEM) ||
+	    !(pci_resource_flags(pdev, 1) & IORESOURCE_MEM)) {
+		dev_err(&pdev->dev, "PCI BAR region not MMIO\n");
+		ret = -ENOMEM;
+		goto err_free_device;
+	}
+
+	ret = pci_request_regions(pdev, DRV_NAME);
+	if (ret) {
+		dev_err(&pdev->dev, "cannot request PCI resources\n");
+		goto err_disable_pdev;
+	}
+
+	/* Enable 64-Bit DMA */
+	if (pci_set_dma_mask(pdev, DMA_BIT_MASK(64)) == 0) {
+		ret = pci_set_consistent_dma_mask(pdev, DMA_BIT_MASK(64));
+		if (ret != 0) {
+			dev_err(&pdev->dev,
+				"pci_set_consistent_dma_mask failed\n");
+			goto err_free_resource;
+		}
+	} else {
+		ret = pci_set_dma_mask(pdev, DMA_BIT_MASK(32));
+		if (ret != 0) {
+			dev_err(&pdev->dev,
+				"pci_set_dma_mask failed\n");
+			goto err_free_resource;
+		}
+	}
+
+	pci_set_master(pdev);
+
+	/* Map register space */
+	start = pci_resource_start(dev->pdev, PVRDMA_PCI_RESOURCE_REG);
+	len = pci_resource_len(dev->pdev, PVRDMA_PCI_RESOURCE_REG);
+	dev->regs = ioremap(start, len);
+	if (!dev->regs) {
+		dev_err(&pdev->dev, "register mapping failed\n");
+		ret = -ENOMEM;
+		goto err_free_resource;
+	}
+
+	/* Setup per-device UAR. */
+	dev->driver_uar.index = 0;
+	dev->driver_uar.pfn =
+		pci_resource_start(dev->pdev, PVRDMA_PCI_RESOURCE_UAR) >>
+		PAGE_SHIFT;
+	dev->driver_uar.map =
+		ioremap(dev->driver_uar.pfn << PAGE_SHIFT, PAGE_SIZE);
+	if (!dev->driver_uar.map) {
+		dev_err(&pdev->dev, "failed to remap UAR pages\n");
+		ret = -ENOMEM;
+		goto err_unmap_regs;
+	}
+
+	version = pvrdma_read_reg(dev, PVRDMA_REG_VERSION);
+	dev_info(&pdev->dev, "device version %d, driver version %d\n",
+		 version, PVRDMA_VERSION);
+	if (version < PVRDMA_VERSION) {
+		dev_err(&pdev->dev, "incompatible device version\n");
+		goto err_uar_unmap;
+	}
+
+	dev->dsr = dma_alloc_coherent(&pdev->dev, sizeof(*dev->dsr),
+				      &dev->dsrbase, GFP_KERNEL);
+	if (!dev->dsr) {
+		dev_err(&pdev->dev, "failed to allocate shared region\n");
+		ret = -ENOMEM;
+		goto err_uar_unmap;
+	}
+
+	/* Setup the shared region */
+	memset(dev->dsr, 0, sizeof(*dev->dsr));
+	dev->dsr->driver_version = PVRDMA_VERSION;
+	dev->dsr->gos_info.gos_bits = sizeof(void *) == 4 ?
+		PVRDMA_GOS_BITS_32 :
+		PVRDMA_GOS_BITS_64;
+	dev->dsr->gos_info.gos_type = PVRDMA_GOS_TYPE_LINUX;
+	dev->dsr->gos_info.gos_ver = 1;
+	dev->dsr->uar_pfn = dev->driver_uar.pfn;
+
+	/* Command slot. */
+	dev->cmd_slot = dma_alloc_coherent(&pdev->dev, PAGE_SIZE,
+					   &slot_dma, GFP_KERNEL);
+	if (!dev->cmd_slot) {
+		ret = -ENOMEM;
+		goto err_free_dsr;
+	}
+
+	dev->dsr->cmd_slot_dma = (u64)slot_dma;
+
+	/* Response slot. */
+	dev->resp_slot = dma_alloc_coherent(&pdev->dev, PAGE_SIZE,
+					    &slot_dma, GFP_KERNEL);
+	if (!dev->resp_slot) {
+		ret = -ENOMEM;
+		goto err_free_slots;
+	}
+
+	dev->dsr->resp_slot_dma = (u64)slot_dma;
+
+	/* Async event ring */
+	dev->dsr->async_ring_pages.num_pages = 4;
+	ret = pvrdma_page_dir_init(dev, &dev->async_pdir,
+				   dev->dsr->async_ring_pages.num_pages, true);
+	if (ret)
+		goto err_free_slots;
+	dev->async_ring_state = dev->async_pdir.pages[0];
+	dev->dsr->async_ring_pages.pdir_dma = dev->async_pdir.dir_dma;
+
+	/* CQ notification ring */
+	dev->dsr->cq_ring_pages.num_pages = 4;
+	ret = pvrdma_page_dir_init(dev, &dev->cq_pdir,
+				   dev->dsr->cq_ring_pages.num_pages, true);
+	if (ret)
+		goto err_free_async_ring;
+	dev->cq_ring_state = dev->cq_pdir.pages[0];
+	dev->dsr->cq_ring_pages.pdir_dma = dev->cq_pdir.dir_dma;
+
+	/*
+	 * Write the PA of the shared region to the device. The writes must be
+	 * ordered such that the high bits are written last. When the writes
+	 * complete, the device will have filled out the capabilities.
+	 */
+
+	pvrdma_write_reg(dev, PVRDMA_REG_DSRLOW, (u32)dev->dsrbase);
+	pvrdma_write_reg(dev, PVRDMA_REG_DSRHIGH,
+			 (u32)((u64)(dev->dsrbase) >> 32));
+
+	/* Make sure the write is complete before reading status. */
+	mb();
+
+	/* Currently, the driver only supports RoCE mode. */
+	if (dev->dsr->caps.mode != PVRDMA_DEVICE_MODE_ROCE) {
+		dev_err(&pdev->dev, "unsupported transport %d\n",
+			dev->dsr->caps.mode);
+		ret = -EINVAL;
+		goto err_free_cq_ring;
+	}
+
+	/* Currently, the driver only supports RoCE V1. */
+	if (!(dev->dsr->caps.gid_types & PVRDMA_GID_TYPE_FLAG_ROCE_V1)) {
+		dev_err(&pdev->dev, "driver needs RoCE v1 support\n");
+		ret = -EINVAL;
+		goto err_free_cq_ring;
+	}
+
+	/* Paired vmxnet3 will have same bus, slot. But func will be 0 */
+	pdev_net = pci_get_slot(pdev->bus, PCI_DEVFN(PCI_SLOT(pdev->devfn), 0));
+	if (!pdev_net) {
+		dev_err(&pdev->dev, "failed to find paired net device\n");
+		ret = -ENODEV;
+		goto err_free_cq_ring;
+	}
+
+	if (pdev_net->vendor != PCI_VENDOR_ID_VMWARE ||
+	    pdev_net->device != PCI_DEVICE_ID_VMWARE_VMXNET3) {
+		dev_err(&pdev->dev, "failed to find paired vmxnet3 device\n");
+		pci_dev_put(pdev_net);
+		ret = -ENODEV;
+		goto err_free_cq_ring;
+	}
+
+	dev->netdev = pci_get_drvdata(pdev_net);
+	pci_dev_put(pdev_net);
+	if (!dev->netdev) {
+		dev_err(&pdev->dev, "failed to get vmxnet3 device\n");
+		ret = -ENODEV;
+		goto err_free_cq_ring;
+	}
+
+	dev_info(&pdev->dev, "paired device to %s\n", dev->netdev->name);
+
+	/* Interrupt setup */
+	ret = pvrdma_alloc_intrs(dev);
+	if (ret) {
+		dev_err(&pdev->dev, "failed to allocate interrupts\n");
+		ret = -ENOMEM;
+		goto err_netdevice;
+	}
+
+	/* Allocate UAR table. */
+	ret = pvrdma_uar_table_init(dev);
+	if (ret) {
+		dev_err(&pdev->dev, "failed to allocate UAR table\n");
+		ret = -ENOMEM;
+		goto err_free_intrs;
+	}
+
+	/* Allocate GID table */
+	dev->sgid_tbl = kcalloc(dev->dsr->caps.gid_tbl_len,
+				sizeof(union ib_gid), GFP_KERNEL);
+	if (!dev->sgid_tbl) {
+		ret = -ENOMEM;
+		goto err_free_uar_table;
+	}
+	dev_dbg(&pdev->dev, "gid table len %d\n", dev->dsr->caps.gid_tbl_len);
+
+	pvrdma_enable_intrs(dev);
+
+	/* Activate pvrdma device */
+	pvrdma_write_reg(dev, PVRDMA_REG_CTL, PVRDMA_DEVICE_CTL_ACTIVATE);
+
+	/* Make sure the write is complete before reading status. */
+	mb();
+
+	/* Check if device was successfully activated */
+	ret = pvrdma_read_reg(dev, PVRDMA_REG_ERR);
+	if (ret != 0) {
+		dev_err(&pdev->dev, "failed to activate device\n");
+		ret = -EINVAL;
+		goto err_disable_intr;
+	}
+
+	/* Register IB device */
+	ret = pvrdma_register_device(dev);
+	if (ret) {
+		dev_err(&pdev->dev, "failed to register IB device\n");
+		goto err_disable_intr;
+	}
+
+	dev->nb_netdev.notifier_call = pvrdma_netdevice_event;
+	ret = register_netdevice_notifier(&dev->nb_netdev);
+	if (ret) {
+		dev_err(&pdev->dev, "failed to register netdevice events\n");
+		goto err_unreg_ibdev;
+	}
+
+	dev_info(&pdev->dev, "attached to device\n");
+	return 0;
+
+err_unreg_ibdev:
+	ib_unregister_device(&dev->ib_dev);
+err_disable_intr:
+	pvrdma_disable_intrs(dev);
+	kfree(dev->sgid_tbl);
+err_free_uar_table:
+	pvrdma_uar_table_cleanup(dev);
+err_free_intrs:
+	pvrdma_free_irq(dev);
+	pvrdma_disable_msi_all(dev);
+err_netdevice:
+	unregister_netdevice_notifier(&dev->nb_netdev);
+err_free_cq_ring:
+	pvrdma_page_dir_cleanup(dev, &dev->cq_pdir);
+err_free_async_ring:
+	pvrdma_page_dir_cleanup(dev, &dev->async_pdir);
+err_free_slots:
+	pvrdma_free_slots(dev);
+err_free_dsr:
+	dma_free_coherent(&pdev->dev, sizeof(*dev->dsr), dev->dsr,
+			  dev->dsrbase);
+err_uar_unmap:
+	iounmap(dev->driver_uar.map);
+err_unmap_regs:
+	iounmap(dev->regs);
+err_free_resource:
+	pci_release_regions(pdev);
+err_disable_pdev:
+	pci_disable_device(pdev);
+	pci_set_drvdata(pdev, NULL);
+err_free_device:
+	mutex_lock(&pvrdma_device_list_lock);
+	list_del(&dev->device_link);
+	mutex_unlock(&pvrdma_device_list_lock);
+	ib_dealloc_device(&dev->ib_dev);
+	return ret;
+}
+
+static void pvrdma_pci_remove(struct pci_dev *pdev)
+{
+	struct pvrdma_dev *dev = pci_get_drvdata(pdev);
+
+	if (!dev)
+		return;
+
+	dev_info(&pdev->dev, "detaching from device\n");
+
+	unregister_netdevice_notifier(&dev->nb_netdev);
+	dev->nb_netdev.notifier_call = NULL;
+
+	flush_workqueue(event_wq);
+
+	/* Unregister ib device */
+	ib_unregister_device(&dev->ib_dev);
+
+	mutex_lock(&pvrdma_device_list_lock);
+	list_del(&dev->device_link);
+	mutex_unlock(&pvrdma_device_list_lock);
+
+	pvrdma_disable_intrs(dev);
+	pvrdma_free_irq(dev);
+	pvrdma_disable_msi_all(dev);
+
+	/* Deactivate pvrdma device */
+	pvrdma_write_reg(dev, PVRDMA_REG_CTL, PVRDMA_DEVICE_CTL_RESET);
+	pvrdma_page_dir_cleanup(dev, &dev->cq_pdir);
+	pvrdma_page_dir_cleanup(dev, &dev->async_pdir);
+	pvrdma_free_slots(dev);
+
+	iounmap(dev->regs);
+	kfree(dev->sgid_tbl);
+	kfree(dev->cq_tbl);
+	kfree(dev->qp_tbl);
+	pvrdma_uar_table_cleanup(dev);
+	iounmap(dev->driver_uar.map);
+
+	ib_dealloc_device(&dev->ib_dev);
+
+	/* Free pci resources */
+	pci_release_regions(pdev);
+	pci_disable_device(pdev);
+	pci_set_drvdata(pdev, NULL);
+}
+
+static struct pci_device_id pvrdma_pci_table[] = {
+	{ PCI_DEVICE(PCI_VENDOR_ID_VMWARE, PCI_DEVICE_ID_VMWARE_PVRDMA), },
+	{ 0 },
+};
+
+MODULE_DEVICE_TABLE(pci, pvrdma_pci_table);
+
+static struct pci_driver pvrdma_driver = {
+	.name		= DRV_NAME,
+	.id_table	= pvrdma_pci_table,
+	.probe		= pvrdma_pci_probe,
+	.remove		= pvrdma_pci_remove,
+};
+
+static int __init pvrdma_init(void)
+{
+	int err;
+
+	event_wq = alloc_ordered_workqueue("pvrdma_event_wq", WQ_MEM_RECLAIM);
+	if (!event_wq)
+		return -ENOMEM;
+
+	err = pci_register_driver(&pvrdma_driver);
+	if (err)
+		destroy_workqueue(event_wq);
+
+	return err;
+}
+
+static void __exit pvrdma_cleanup(void)
+{
+	pci_unregister_driver(&pvrdma_driver);
+
+	destroy_workqueue(event_wq);
+}
+
+module_init(pvrdma_init);
+module_exit(pvrdma_cleanup);
+
+MODULE_AUTHOR("VMware, Inc");
+MODULE_DESCRIPTION("VMware Virtual RDMA/InfiniBand driver");
+MODULE_VERSION(DRV_VERSION);
+MODULE_LICENSE("Dual BSD/GPL");
-- 
2.7.4

^ permalink raw reply related

* [PATCH v5 03/16] IB/pvrdma: Add virtual device RDMA structures
From: Adit Ranadive @ 2016-09-24 23:21 UTC (permalink / raw)
  To: dledford, linux-rdma, pv-drivers
  Cc: Adit Ranadive, netdev, linux-pci, jhansen, asarwade, georgezhang,
	bryantan
In-Reply-To: <cover.1474759181.git.aditr@vmware.com>

This patch adds the various Verbs structures that we support in the
virtual RDMA device. We have re-mapped the ones from the RDMA core stack
to make sure we can maintain compatibility with our backend.

Reviewed-by: Jorgen Hansen <jhansen@vmware.com>
Reviewed-by: George Zhang <georgezhang@vmware.com>
Reviewed-by: Aditya Sarwade <asarwade@vmware.com>
Reviewed-by: Bryan Tan <bryantan@vmware.com>
Signed-off-by: Adit Ranadive <aditr@vmware.com>
---
Changes v4->v5:
 - Removed __ prefix for unsigned vars.

Changes v3->v4:
 - Moved the pvrdma_sge struct to pvrdma_uapi.h.
---
 drivers/infiniband/hw/pvrdma/pvrdma_ib_verbs.h | 444 +++++++++++++++++++++++++
 1 file changed, 444 insertions(+)
 create mode 100644 drivers/infiniband/hw/pvrdma/pvrdma_ib_verbs.h

diff --git a/drivers/infiniband/hw/pvrdma/pvrdma_ib_verbs.h b/drivers/infiniband/hw/pvrdma/pvrdma_ib_verbs.h
new file mode 100644
index 0000000..290b6d8
--- /dev/null
+++ b/drivers/infiniband/hw/pvrdma/pvrdma_ib_verbs.h
@@ -0,0 +1,444 @@
+/*
+ * [PLEASE NOTE:  VMWARE, INC. ELECTS TO USE AND DISTRIBUTE THIS COMPONENT
+ * UNDER THE TERMS OF THE OpenIB.org BSD license.  THE ORIGINAL LICENSE TERMS
+ * ARE REPRODUCED BELOW ONLY AS A REFERENCE.]
+ *
+ * Copyright (c) 2004 Mellanox Technologies Ltd.  All rights reserved.
+ * Copyright (c) 2004 Infinicon Corporation.  All rights reserved.
+ * Copyright (c) 2004 Intel Corporation.  All rights reserved.
+ * Copyright (c) 2004 Topspin Corporation.  All rights reserved.
+ * Copyright (c) 2004 Voltaire Corporation.  All rights reserved.
+ * Copyright (c) 2005 Sun Microsystems, Inc. All rights reserved.
+ * Copyright (c) 2005, 2006, 2007 Cisco Systems.  All rights reserved.
+ * Copyright (c) 2015-2016 VMware, Inc.  All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ */
+
+#ifndef __PVRDMA_IB_VERBS_H__
+#define __PVRDMA_IB_VERBS_H__
+
+#include <linux/types.h>
+
+union pvrdma_gid {
+	u8	raw[16];
+	struct {
+		__be64	subnet_prefix;
+		__be64	interface_id;
+	} global;
+};
+
+enum pvrdma_link_layer {
+	PVRDMA_LINK_LAYER_UNSPECIFIED,
+	PVRDMA_LINK_LAYER_INFINIBAND,
+	PVRDMA_LINK_LAYER_ETHERNET,
+};
+
+enum pvrdma_mtu {
+	PVRDMA_MTU_256  = 1,
+	PVRDMA_MTU_512  = 2,
+	PVRDMA_MTU_1024 = 3,
+	PVRDMA_MTU_2048 = 4,
+	PVRDMA_MTU_4096 = 5,
+};
+
+static inline int pvrdma_mtu_enum_to_int(enum pvrdma_mtu mtu)
+{
+	switch (mtu) {
+	case PVRDMA_MTU_256:	return  256;
+	case PVRDMA_MTU_512:	return  512;
+	case PVRDMA_MTU_1024:	return 1024;
+	case PVRDMA_MTU_2048:	return 2048;
+	case PVRDMA_MTU_4096:	return 4096;
+	default:		return   -1;
+	}
+}
+
+static inline enum pvrdma_mtu pvrdma_mtu_int_to_enum(int mtu)
+{
+	switch (mtu) {
+	case 256:	return PVRDMA_MTU_256;
+	case 512:	return PVRDMA_MTU_512;
+	case 1024:	return PVRDMA_MTU_1024;
+	case 2048:	return PVRDMA_MTU_2048;
+	case 4096:
+	default:	return PVRDMA_MTU_4096;
+	}
+}
+
+enum pvrdma_port_state {
+	PVRDMA_PORT_NOP			= 0,
+	PVRDMA_PORT_DOWN		= 1,
+	PVRDMA_PORT_INIT		= 2,
+	PVRDMA_PORT_ARMED		= 3,
+	PVRDMA_PORT_ACTIVE		= 4,
+	PVRDMA_PORT_ACTIVE_DEFER	= 5,
+};
+
+enum pvrdma_port_cap_flags {
+	PVRDMA_PORT_SM				= 1 <<  1,
+	PVRDMA_PORT_NOTICE_SUP			= 1 <<  2,
+	PVRDMA_PORT_TRAP_SUP			= 1 <<  3,
+	PVRDMA_PORT_OPT_IPD_SUP			= 1 <<  4,
+	PVRDMA_PORT_AUTO_MIGR_SUP		= 1 <<  5,
+	PVRDMA_PORT_SL_MAP_SUP			= 1 <<  6,
+	PVRDMA_PORT_MKEY_NVRAM			= 1 <<  7,
+	PVRDMA_PORT_PKEY_NVRAM			= 1 <<  8,
+	PVRDMA_PORT_LED_INFO_SUP		= 1 <<  9,
+	PVRDMA_PORT_SM_DISABLED			= 1 << 10,
+	PVRDMA_PORT_SYS_IMAGE_GUID_SUP		= 1 << 11,
+	PVRDMA_PORT_PKEY_SW_EXT_PORT_TRAP_SUP	= 1 << 12,
+	PVRDMA_PORT_EXTENDED_SPEEDS_SUP		= 1 << 14,
+	PVRDMA_PORT_CM_SUP			= 1 << 16,
+	PVRDMA_PORT_SNMP_TUNNEL_SUP		= 1 << 17,
+	PVRDMA_PORT_REINIT_SUP			= 1 << 18,
+	PVRDMA_PORT_DEVICE_MGMT_SUP		= 1 << 19,
+	PVRDMA_PORT_VENDOR_CLASS_SUP		= 1 << 20,
+	PVRDMA_PORT_DR_NOTICE_SUP		= 1 << 21,
+	PVRDMA_PORT_CAP_MASK_NOTICE_SUP		= 1 << 22,
+	PVRDMA_PORT_BOOT_MGMT_SUP		= 1 << 23,
+	PVRDMA_PORT_LINK_LATENCY_SUP		= 1 << 24,
+	PVRDMA_PORT_CLIENT_REG_SUP		= 1 << 25,
+	PVRDMA_PORT_IP_BASED_GIDS		= 1 << 26,
+	PVRDMA_PORT_CAP_FLAGS_MAX		= PVRDMA_PORT_IP_BASED_GIDS,
+};
+
+enum pvrdma_port_width {
+	PVRDMA_WIDTH_1X		= 1,
+	PVRDMA_WIDTH_4X		= 2,
+	PVRDMA_WIDTH_8X		= 4,
+	PVRDMA_WIDTH_12X	= 8,
+};
+
+static inline int pvrdma_width_enum_to_int(enum pvrdma_port_width width)
+{
+	switch (width) {
+	case PVRDMA_WIDTH_1X:	return  1;
+	case PVRDMA_WIDTH_4X:	return  4;
+	case PVRDMA_WIDTH_8X:	return  8;
+	case PVRDMA_WIDTH_12X:	return 12;
+	default:		return -1;
+	}
+}
+
+enum pvrdma_port_speed {
+	PVRDMA_SPEED_SDR	= 1,
+	PVRDMA_SPEED_DDR	= 2,
+	PVRDMA_SPEED_QDR	= 4,
+	PVRDMA_SPEED_FDR10	= 8,
+	PVRDMA_SPEED_FDR	= 16,
+	PVRDMA_SPEED_EDR	= 32,
+};
+
+struct pvrdma_port_attr {
+	enum pvrdma_port_state	state;
+	enum pvrdma_mtu		max_mtu;
+	enum pvrdma_mtu		active_mtu;
+	u32			gid_tbl_len;
+	u32			port_cap_flags;
+	u32			max_msg_sz;
+	u32			bad_pkey_cntr;
+	u32			qkey_viol_cntr;
+	u16			pkey_tbl_len;
+	u16			lid;
+	u16			sm_lid;
+	u8			lmc;
+	u8			max_vl_num;
+	u8			sm_sl;
+	u8			subnet_timeout;
+	u8			init_type_reply;
+	u8			active_width;
+	u8			active_speed;
+	u8			phys_state;
+	u8			reserved[2];
+};
+
+struct pvrdma_global_route {
+	union pvrdma_gid	dgid;
+	u32			flow_label;
+	u8			sgid_index;
+	u8			hop_limit;
+	u8			traffic_class;
+	u8			reserved;
+};
+
+struct pvrdma_grh {
+	__be32			version_tclass_flow;
+	__be16			paylen;
+	u8			next_hdr;
+	u8			hop_limit;
+	union pvrdma_gid	sgid;
+	union pvrdma_gid	dgid;
+};
+
+enum pvrdma_ah_flags {
+	PVRDMA_AH_GRH = 1,
+};
+
+enum pvrdma_rate {
+	PVRDMA_RATE_PORT_CURRENT	= 0,
+	PVRDMA_RATE_2_5_GBPS		= 2,
+	PVRDMA_RATE_5_GBPS		= 5,
+	PVRDMA_RATE_10_GBPS		= 3,
+	PVRDMA_RATE_20_GBPS		= 6,
+	PVRDMA_RATE_30_GBPS		= 4,
+	PVRDMA_RATE_40_GBPS		= 7,
+	PVRDMA_RATE_60_GBPS		= 8,
+	PVRDMA_RATE_80_GBPS		= 9,
+	PVRDMA_RATE_120_GBPS		= 10,
+	PVRDMA_RATE_14_GBPS		= 11,
+	PVRDMA_RATE_56_GBPS		= 12,
+	PVRDMA_RATE_112_GBPS		= 13,
+	PVRDMA_RATE_168_GBPS		= 14,
+	PVRDMA_RATE_25_GBPS		= 15,
+	PVRDMA_RATE_100_GBPS		= 16,
+	PVRDMA_RATE_200_GBPS		= 17,
+	PVRDMA_RATE_300_GBPS		= 18,
+};
+
+struct pvrdma_ah_attr {
+	struct pvrdma_global_route	grh;
+	u16				dlid;
+	u16				vlan_id;
+	u8				sl;
+	u8				src_path_bits;
+	u8				static_rate;
+	u8				ah_flags;
+	u8				port_num;
+	u8				dmac[6];
+	u8				reserved;
+};
+
+enum pvrdma_wc_status {
+	PVRDMA_WC_SUCCESS,
+	PVRDMA_WC_LOC_LEN_ERR,
+	PVRDMA_WC_LOC_QP_OP_ERR,
+	PVRDMA_WC_LOC_EEC_OP_ERR,
+	PVRDMA_WC_LOC_PROT_ERR,
+	PVRDMA_WC_WR_FLUSH_ERR,
+	PVRDMA_WC_MW_BIND_ERR,
+	PVRDMA_WC_BAD_RESP_ERR,
+	PVRDMA_WC_LOC_ACCESS_ERR,
+	PVRDMA_WC_REM_INV_REQ_ERR,
+	PVRDMA_WC_REM_ACCESS_ERR,
+	PVRDMA_WC_REM_OP_ERR,
+	PVRDMA_WC_RETRY_EXC_ERR,
+	PVRDMA_WC_RNR_RETRY_EXC_ERR,
+	PVRDMA_WC_LOC_RDD_VIOL_ERR,
+	PVRDMA_WC_REM_INV_RD_REQ_ERR,
+	PVRDMA_WC_REM_ABORT_ERR,
+	PVRDMA_WC_INV_EECN_ERR,
+	PVRDMA_WC_INV_EEC_STATE_ERR,
+	PVRDMA_WC_FATAL_ERR,
+	PVRDMA_WC_RESP_TIMEOUT_ERR,
+	PVRDMA_WC_GENERAL_ERR,
+};
+
+enum pvrdma_wc_opcode {
+	PVRDMA_WC_SEND,
+	PVRDMA_WC_RDMA_WRITE,
+	PVRDMA_WC_RDMA_READ,
+	PVRDMA_WC_COMP_SWAP,
+	PVRDMA_WC_FETCH_ADD,
+	PVRDMA_WC_BIND_MW,
+	PVRDMA_WC_LSO,
+	PVRDMA_WC_LOCAL_INV,
+	PVRDMA_WC_FAST_REG_MR,
+	PVRDMA_WC_MASKED_COMP_SWAP,
+	PVRDMA_WC_MASKED_FETCH_ADD,
+	PVRDMA_WC_RECV = 1 << 7,
+	PVRDMA_WC_RECV_RDMA_WITH_IMM,
+};
+
+enum pvrdma_wc_flags {
+	PVRDMA_WC_GRH			= 1 << 0,
+	PVRDMA_WC_WITH_IMM		= 1 << 1,
+	PVRDMA_WC_WITH_INVALIDATE	= 1 << 2,
+	PVRDMA_WC_IP_CSUM_OK		= 1 << 3,
+	PVRDMA_WC_WITH_SMAC		= 1 << 4,
+	PVRDMA_WC_WITH_VLAN		= 1 << 5,
+	PVRDMA_WC_FLAGS_MAX		= PVRDMA_WC_WITH_VLAN,
+};
+
+enum pvrdma_cq_notify_flags {
+	PVRDMA_CQ_SOLICITED		= 1 << 0,
+	PVRDMA_CQ_NEXT_COMP		= 1 << 1,
+	PVRDMA_CQ_SOLICITED_MASK	= PVRDMA_CQ_SOLICITED |
+					  PVRDMA_CQ_NEXT_COMP,
+	PVRDMA_CQ_REPORT_MISSED_EVENTS	= 1 << 2,
+};
+
+struct pvrdma_qp_cap {
+	u32	max_send_wr;
+	u32	max_recv_wr;
+	u32	max_send_sge;
+	u32	max_recv_sge;
+	u32	max_inline_data;
+	u32	reserved;
+};
+
+enum pvrdma_sig_type {
+	PVRDMA_SIGNAL_ALL_WR,
+	PVRDMA_SIGNAL_REQ_WR,
+};
+
+enum pvrdma_qp_type {
+	PVRDMA_QPT_SMI,
+	PVRDMA_QPT_GSI,
+	PVRDMA_QPT_RC,
+	PVRDMA_QPT_UC,
+	PVRDMA_QPT_UD,
+	PVRDMA_QPT_RAW_IPV6,
+	PVRDMA_QPT_RAW_ETHERTYPE,
+	PVRDMA_QPT_RAW_PACKET = 8,
+	PVRDMA_QPT_XRC_INI = 9,
+	PVRDMA_QPT_XRC_TGT,
+	PVRDMA_QPT_MAX,
+};
+
+enum pvrdma_qp_create_flags {
+	PVRDMA_QP_CREATE_IPOPVRDMA_UD_LSO		= 1 << 0,
+	PVRDMA_QP_CREATE_BLOCK_MULTICAST_LOOPBACK	= 1 << 1,
+};
+
+enum pvrdma_qp_attr_mask {
+	PVRDMA_QP_STATE			= 1 << 0,
+	PVRDMA_QP_CUR_STATE		= 1 << 1,
+	PVRDMA_QP_EN_SQD_ASYNC_NOTIFY	= 1 << 2,
+	PVRDMA_QP_ACCESS_FLAGS		= 1 << 3,
+	PVRDMA_QP_PKEY_INDEX		= 1 << 4,
+	PVRDMA_QP_PORT			= 1 << 5,
+	PVRDMA_QP_QKEY			= 1 << 6,
+	PVRDMA_QP_AV			= 1 << 7,
+	PVRDMA_QP_PATH_MTU		= 1 << 8,
+	PVRDMA_QP_TIMEOUT		= 1 << 9,
+	PVRDMA_QP_RETRY_CNT		= 1 << 10,
+	PVRDMA_QP_RNR_RETRY		= 1 << 11,
+	PVRDMA_QP_RQ_PSN		= 1 << 12,
+	PVRDMA_QP_MAX_QP_RD_ATOMIC	= 1 << 13,
+	PVRDMA_QP_ALT_PATH		= 1 << 14,
+	PVRDMA_QP_MIN_RNR_TIMER		= 1 << 15,
+	PVRDMA_QP_SQ_PSN		= 1 << 16,
+	PVRDMA_QP_MAX_DEST_RD_ATOMIC	= 1 << 17,
+	PVRDMA_QP_PATH_MIG_STATE	= 1 << 18,
+	PVRDMA_QP_CAP			= 1 << 19,
+	PVRDMA_QP_DEST_QPN		= 1 << 20,
+	PVRDMA_QP_ATTR_MASK_MAX		= PVRDMA_QP_DEST_QPN,
+};
+
+enum pvrdma_qp_state {
+	PVRDMA_QPS_RESET,
+	PVRDMA_QPS_INIT,
+	PVRDMA_QPS_RTR,
+	PVRDMA_QPS_RTS,
+	PVRDMA_QPS_SQD,
+	PVRDMA_QPS_SQE,
+	PVRDMA_QPS_ERR,
+};
+
+enum pvrdma_mig_state {
+	PVRDMA_MIG_MIGRATED,
+	PVRDMA_MIG_REARM,
+	PVRDMA_MIG_ARMED,
+};
+
+enum pvrdma_mw_type {
+	PVRDMA_MW_TYPE_1 = 1,
+	PVRDMA_MW_TYPE_2 = 2,
+};
+
+struct pvrdma_qp_attr {
+	enum pvrdma_qp_state	qp_state;
+	enum pvrdma_qp_state	cur_qp_state;
+	enum pvrdma_mtu		path_mtu;
+	enum pvrdma_mig_state	path_mig_state;
+	u32			qkey;
+	u32			rq_psn;
+	u32			sq_psn;
+	u32			dest_qp_num;
+	u32			qp_access_flags;
+	u16			pkey_index;
+	u16			alt_pkey_index;
+	u8			en_sqd_async_notify;
+	u8			sq_draining;
+	u8			max_rd_atomic;
+	u8			max_dest_rd_atomic;
+	u8			min_rnr_timer;
+	u8			port_num;
+	u8			timeout;
+	u8			retry_cnt;
+	u8			rnr_retry;
+	u8			alt_port_num;
+	u8			alt_timeout;
+	u8			reserved[5];
+	struct pvrdma_qp_cap	cap;
+	struct pvrdma_ah_attr	ah_attr;
+	struct pvrdma_ah_attr	alt_ah_attr;
+};
+
+enum pvrdma_wr_opcode {
+	PVRDMA_WR_RDMA_WRITE,
+	PVRDMA_WR_RDMA_WRITE_WITH_IMM,
+	PVRDMA_WR_SEND,
+	PVRDMA_WR_SEND_WITH_IMM,
+	PVRDMA_WR_RDMA_READ,
+	PVRDMA_WR_ATOMIC_CMP_AND_SWP,
+	PVRDMA_WR_ATOMIC_FETCH_AND_ADD,
+	PVRDMA_WR_LSO,
+	PVRDMA_WR_SEND_WITH_INV,
+	PVRDMA_WR_RDMA_READ_WITH_INV,
+	PVRDMA_WR_LOCAL_INV,
+	PVRDMA_WR_FAST_REG_MR,
+	PVRDMA_WR_MASKED_ATOMIC_CMP_AND_SWP,
+	PVRDMA_WR_MASKED_ATOMIC_FETCH_AND_ADD,
+	PVRDMA_WR_BIND_MW,
+	PVRDMA_WR_REG_SIG_MR,
+};
+
+enum pvrdma_send_flags {
+	PVRDMA_SEND_FENCE	= 1 << 0,
+	PVRDMA_SEND_SIGNALED	= 1 << 1,
+	PVRDMA_SEND_SOLICITED	= 1 << 2,
+	PVRDMA_SEND_INLINE	= 1 << 3,
+	PVRDMA_SEND_IP_CSUM	= 1 << 4,
+	PVRDMA_SEND_FLAGS_MAX	= PVRDMA_SEND_IP_CSUM,
+};
+
+enum pvrdma_access_flags {
+	PVRDMA_ACCESS_LOCAL_WRITE	= 1 << 0,
+	PVRDMA_ACCESS_REMOTE_WRITE	= 1 << 1,
+	PVRDMA_ACCESS_REMOTE_READ	= 1 << 2,
+	PVRDMA_ACCESS_REMOTE_ATOMIC	= 1 << 3,
+	PVRDMA_ACCESS_MW_BIND		= 1 << 4,
+	PVRDMA_ZERO_BASED		= 1 << 5,
+	PVRDMA_ACCESS_ON_DEMAND		= 1 << 6,
+	PVRDMA_ACCESS_FLAGS_MAX		= PVRDMA_ACCESS_ON_DEMAND,
+};
+
+#endif /* __PVRDMA_IB_VERBS_H__ */
-- 
2.7.4

^ permalink raw reply related

* [PATCH v5 00/16] Add Paravirtual RDMA Driver
From: Adit Ranadive @ 2016-09-24 23:21 UTC (permalink / raw)
  To: dledford, linux-rdma, pv-drivers
  Cc: Adit Ranadive, netdev, linux-pci, jhansen, asarwade, georgezhang,
	bryantan

Hi Doug, others,

This patch series adds a driver for a paravirtual RDMA device. The device
is developed for VMware's Virtual Machines and allows existing RDMA
applications to continue to use existing Verbs API when deployed in VMs on
ESXi. We recently did a presentation in the OFA Workshop [1] regarding this
device.

Description and RDMA Support
============================
The virtual device is exposed as a dual function PCIe device. One part is
a virtual network device (VMXNet3) which provides networking properties
like MAC, IP addresses to the RDMA part of the device. The networking
properties are used to register GIDs required by RDMA applications to
communicate.

These patches add support and the all required infrastructure for letting
applications use such a device. We support the mandatory Verbs API as well
as the base memory management extensions (Local Inv, Send with Inv and Fast
Register Work Requests). We currently support both Reliable Connected and
Unreliable Datagram QPs but do not support Shared Receive Queues (SRQs).
Also, we support the following types of Work Requests:
 o Send/Receive (with or without Immediate Data)
 o RDMA Write (with or without Immediate Data)
 o RDMA Read
 o Local Invalidate
 o Send with Invalidate
 o Fast Register Work Requests

This version only adds support for version 1 of RoCE. We will add RoCEv2
support in a future patch. We do support registration of both MAC-based and
IP-based GIDs. I have also created a git tree for our user-level driver [2].

Testing
=======
We have tested this internally for various types of Guest OS - Red Hat,
Centos, Ubuntu 12.04/14.04/16.04, Oracle Enterprise Linux, SLES 12
using backported versions of this driver. The tests included several runs
of the performance tests (included with OFED), Intel MPI PingPong benchmark
on OpenMPI, krping for FRWRs. Mellanox has been kind enough to test the
backported version of the driver internally on their hardware using a
VMware provided ESX build. I have also applied and tested this with Doug's
k.o/for-4.9 branch (commit 64278fe). Note, that this patch series should be
applied all together. I split out the commits so that it may be easier to
review.

PVRDMA Resources
================
[1] OFA Workshop Presentation - 
https://openfabrics.org/images/eventpresos/2016presentations/102parardma.pdf
[2] Libpvrdma User-level library - 
http://git.openfabrics.org/?p=~aditr/libpvrdma.git;a=summary
---
Changes v4->v5:
 - PATCH [02/16]
     - Moved pvrdma_uapi.h and pvrdma_user.h into common UAPI folder.
     - Renamed to pvrdma-uapi.h and pvrdma-abi.h respectively.
     - Prefixed unsigned vars with __.
 - PATCH [03/16]
     - Removed __ prefix for unsigned vars.
 - PATCH [04/16]
     - Update include for headers moved to UAPI.
     - Removed __ prefix for unsigned vars.
 - PATCH [05/16]
     - Update include for headers in UAPI folder.
     - Removed setting any properties that are reported by device as 0.
     - Simplified modify_port.
     - PD should be allocated first in kernel then in device.
     - Update to pvrdma_cmd_post for creating/destroying PD, Query port/device.
 - PATCH [06/16]
     - pvrdma_cmd_post takes the response code.
 - PATCH [07/16]
     - Correct var type passed to dma_alloc_coherent.
 - PATCH [08/16]
     - Moved the timeout to pvrdma_cmd_recv.
     - Added additional response code parameter to pvrdma_cmd_post.
 - PATCH [09/16]
     - Updated include for headers in UAPI folder.
     - Changed from EINVAL to ENOMEM if atomic add fails.
     - Added error code if destroy cq command failed.
     - Update to pvrdma_cmd_post for creating/destroying CQ.
 - PATCH [11/16]
     - Check the access flags correctly for DMA MR.
     - Update to pvrdma_cmd_post for creating/destroying MRs.
 - PATCH [12/16]
     - Updated include for headers in UAPI folder.
     - Update to pvrdma_cmd_post for creating/destroying/querying/modifying QPs.
     - Use the pvrdma_sge struct when posting WRs/allocating QP memory.
     - Removed two set but unused variables.
 - PATCH [13/16]
     - Removed two unnecessary lines.
     - Updated include for headers in UAPI folder.
     - Update to pvrdma_cmd_post for add/delete GIDs.
     - Add error code in dev_warn if pvrdma_cmd_post failed.
 - PATCH [16/16]
     - Added pvrdma files to common UAPI folder.

Changes v3->v4:
 - Rebased on for-4.9 branch - commit 64278fe89b729
   ("Merge branch 'hns-roce' into k.o/for-4.9")
 - PATCH [01/16]
     - New in v4 - Moved vmxnet3 id to pci_ids.h
 - PATCH [02,03/16]
     - pvrdma_sge was moved into pvrdma_uapi.h
 - PATCH [04/16]
     - Removed explicit enum values.
 - PATCH [05/16]
     - Renamed priviledged -> privileged.
     - Added error numbers for command errors.
     - Removed unnecessary goto in modify_device.
     - Moved pd allocation to after command execution.
     - Removed an incorrect atomic_dec.
 - PATCH [06/16]
     - Renamed priviledged -> privileged.
     - Renamed pvrdma_flush_cqe to _pvrdma_flush_cqe since we hold a lock
     to call it.
     - Added wrapper functions for writing to UARs for CQ/QP.
     - The conversion functions are updated as func_name(dst, src) format.
     - Renamed max_gs to max_sg.
     - Added work struct for net device events.
 - PATCH [07/16]
     - Updated conversion functions to func_name(dst, src) format.
     - Removed unneeded local variables.
 - PATCH [08/16]
     - Removed the min check and added a BUILD_BUG_ON check for size.
 - PATCH [09/16]
     - Added a pvrdma_destroy_cq in the error path.
     - Renamed pvrdma_flush_cqe to _pvrdma_flush_cqe since we need a lock to
     be held while calling this.
     - Updated to use wrapper for UAR write for CQ.
     - Ensure that poll_cq does not return error values.
 - PATCH [10/16]
     - Removed an unnecessary comment.
 - PATCH [11/16]
     - Changed access flag check for DMA MR to using bit operation.
     - Removed some local variables.
 - PATCH [12/16]
     - Removed an unnecessary switch case.
     - Unified the returns in pvrdma_create_qp to use one exit point.
     - Renamed pvrdma_flush_cqe to _pvrdma_flush_cqe since we need a lock to
     be held when calling this.
     - Updated to use wrapper for UAR write for QP.
     - Updated conversion function to func_name(dst, src) format.
     - Renamed max_gs to max_sg.
     - Renamed cap variable to req_cap in pvrdma_set_sq/rq_size.
     - Changed dev_warn to dev_warn_ratelimited in pvrdma_post_send/recv.
     - Added nesting locking for flushing CQs when destroying/resetting a QP.
     - Added missing ret value.
 - PATCH [13/16]
     - Fixed some checkpatch warnings.
     - Added support for new get_dev_fw_str API.
     - Added event workqueue for netdevice events.
     - Restructured the pvrdma_pci_remove function a little bit.
 - PATCH [14/16]
     - Enforced dependency on VMXNet3 module.

Changes v2->v3:
 - I reordered the patches so that the definitions of enums, structures is
 before their use (suggested by Yuval Shaia) so its easier to review.
 - Removed an unneccesary bool in pvrdma_cmd_post (suggested by Yuval Shaia).
 - Made the use of comma at end of enums consistent across files (suggested
 by Leon Romanovsky).

Changes v1->v2:
 - Patch [07/15] - Addressed Yuval Shaia's comments and 32-bit build errors.

---
Adit Ranadive (16):
  vmxnet3: Move PCI Id to pci_ids.h
  IB/pvrdma: Add user-level shared functions
  IB/pvrdma: Add virtual device RDMA structures
  IB/pvrdma: Add the paravirtual RDMA device specification
  IB/pvrdma: Add functions for Verbs support
  IB/pvrdma: Add paravirtual rdma device
  IB/pvrdma: Add helper functions
  IB/pvrdma: Add device command support
  IB/pvrdma: Add support for Completion Queues
  IB/pvrdma: Add UAR support
  IB/pvrdma: Add support for memory regions
  IB/pvrdma: Add Queue Pair support
  IB/pvrdma: Add the main driver module for PVRDMA
  IB/pvrdma: Add Kconfig and Makefile
  IB: Add PVRDMA driver
  MAINTAINERS: Update for PVRDMA driver

 MAINTAINERS                                    |    9 +
 drivers/infiniband/Kconfig                     |    1 +
 drivers/infiniband/hw/Makefile                 |    1 +
 drivers/infiniband/hw/pvrdma/Kconfig           |    7 +
 drivers/infiniband/hw/pvrdma/Makefile          |    3 +
 drivers/infiniband/hw/pvrdma/pvrdma.h          |  473 +++++++++
 drivers/infiniband/hw/pvrdma/pvrdma_cmd.c      |  117 +++
 drivers/infiniband/hw/pvrdma/pvrdma_cq.c       |  426 +++++++++
 drivers/infiniband/hw/pvrdma/pvrdma_defs.h     |  301 ++++++
 drivers/infiniband/hw/pvrdma/pvrdma_dev_api.h  |  342 +++++++
 drivers/infiniband/hw/pvrdma/pvrdma_doorbell.c |  127 +++
 drivers/infiniband/hw/pvrdma/pvrdma_ib_verbs.h |  444 +++++++++
 drivers/infiniband/hw/pvrdma/pvrdma_main.c     | 1220 ++++++++++++++++++++++++
 drivers/infiniband/hw/pvrdma/pvrdma_misc.c     |  304 ++++++
 drivers/infiniband/hw/pvrdma/pvrdma_mr.c       |  334 +++++++
 drivers/infiniband/hw/pvrdma/pvrdma_qp.c       |  973 +++++++++++++++++++
 drivers/infiniband/hw/pvrdma/pvrdma_verbs.c    |  577 +++++++++++
 drivers/infiniband/hw/pvrdma/pvrdma_verbs.h    |  108 +++
 drivers/net/vmxnet3/vmxnet3_int.h              |    3 +-
 include/linux/pci_ids.h                        |    1 +
 include/uapi/rdma/Kbuild                       |    2 +
 include/uapi/rdma/pvrdma-abi.h                 |   99 ++
 include/uapi/rdma/pvrdma-uapi.h                |  255 +++++
 23 files changed, 6125 insertions(+), 2 deletions(-)
 create mode 100644 drivers/infiniband/hw/pvrdma/Kconfig
 create mode 100644 drivers/infiniband/hw/pvrdma/Makefile
 create mode 100644 drivers/infiniband/hw/pvrdma/pvrdma.h
 create mode 100644 drivers/infiniband/hw/pvrdma/pvrdma_cmd.c
 create mode 100644 drivers/infiniband/hw/pvrdma/pvrdma_cq.c
 create mode 100644 drivers/infiniband/hw/pvrdma/pvrdma_defs.h
 create mode 100644 drivers/infiniband/hw/pvrdma/pvrdma_dev_api.h
 create mode 100644 drivers/infiniband/hw/pvrdma/pvrdma_doorbell.c
 create mode 100644 drivers/infiniband/hw/pvrdma/pvrdma_ib_verbs.h
 create mode 100644 drivers/infiniband/hw/pvrdma/pvrdma_main.c
 create mode 100644 drivers/infiniband/hw/pvrdma/pvrdma_misc.c
 create mode 100644 drivers/infiniband/hw/pvrdma/pvrdma_mr.c
 create mode 100644 drivers/infiniband/hw/pvrdma/pvrdma_qp.c
 create mode 100644 drivers/infiniband/hw/pvrdma/pvrdma_verbs.c
 create mode 100644 drivers/infiniband/hw/pvrdma/pvrdma_verbs.h
 create mode 100644 include/uapi/rdma/pvrdma-abi.h
 create mode 100644 include/uapi/rdma/pvrdma-uapi.h

-- 
2.7.4

^ permalink raw reply

* [PATCH v5 16/16] MAINTAINERS: Update for PVRDMA driver
From: Adit Ranadive @ 2016-09-24 23:21 UTC (permalink / raw)
  To: dledford, linux-rdma, pv-drivers
  Cc: Adit Ranadive, netdev, linux-pci, jhansen, asarwade, georgezhang,
	bryantan
In-Reply-To: <cover.1474759181.git.aditr@vmware.com>

Add maintainer info for the PVRDMA driver.

Reviewed-by: Jorgen Hansen <jhansen@vmware.com>
Reviewed-by: George Zhang <georgezhang@vmware.com>
Reviewed-by: Aditya Sarwade <asarwade@vmware.com>
Reviewed-by: Bryan Tan <bryantan@vmware.com>
Signed-off-by: Adit Ranadive <aditr@vmware.com>
---
Changes v4->v5:
 - Added pvrdma files to common UAPI folder.
---
 MAINTAINERS | 9 +++++++++
 1 file changed, 9 insertions(+)

diff --git a/MAINTAINERS b/MAINTAINERS
index 87e23cd..5023dc0 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -12615,6 +12615,15 @@ S:	Maintained
 F:	drivers/scsi/vmw_pvscsi.c
 F:	drivers/scsi/vmw_pvscsi.h
 
+VMWARE PVRDMA DRIVER
+M:	Adit Ranadive <aditr@vmware.com>
+M:	VMware PV-Drivers <pv-drivers@vmware.com>
+L:	linux-rdma@vger.kernel.org
+S:	Maintained
+F:	drivers/infiniband/hw/pvrdma/
+F:	include/uapi/rdma/pvrdma-abi.h
+F:	include/uapi/rdma/pvrdma-uapi.h
+
 VOLTAGE AND CURRENT REGULATOR FRAMEWORK
 M:	Liam Girdwood <lgirdwood@gmail.com>
 M:	Mark Brown <broonie@kernel.org>
-- 
2.7.4

^ permalink raw reply related

* [PATCH v5 15/16] IB: Add PVRDMA driver
From: Adit Ranadive @ 2016-09-24 23:21 UTC (permalink / raw)
  To: dledford, linux-rdma, pv-drivers
  Cc: Adit Ranadive, netdev, linux-pci, jhansen, asarwade, georgezhang,
	bryantan
In-Reply-To: <cover.1474759181.git.aditr@vmware.com>

This patch updates the InfiniBand subsystem to build the PVRDMA driver.

Reviewed-by: Jorgen Hansen <jhansen@vmware.com>
Reviewed-by: George Zhang <georgezhang@vmware.com>
Reviewed-by: Aditya Sarwade <asarwade@vmware.com>
Reviewed-by: Bryan Tan <bryantan@vmware.com>
Signed-off-by: Adit Ranadive <aditr@vmware.com>
---
 drivers/infiniband/Kconfig     | 1 +
 drivers/infiniband/hw/Makefile | 1 +
 2 files changed, 2 insertions(+)

diff --git a/drivers/infiniband/Kconfig b/drivers/infiniband/Kconfig
index 19a418a..dff4bcf 100644
--- a/drivers/infiniband/Kconfig
+++ b/drivers/infiniband/Kconfig
@@ -88,5 +88,6 @@ source "drivers/infiniband/sw/rdmavt/Kconfig"
 source "drivers/infiniband/sw/rxe/Kconfig"
 
 source "drivers/infiniband/hw/hfi1/Kconfig"
+source "drivers/infiniband/hw/pvrdma/Kconfig"
 
 endif # INFINIBAND
diff --git a/drivers/infiniband/hw/Makefile b/drivers/infiniband/hw/Makefile
index 21fe401..c8a7a36 100644
--- a/drivers/infiniband/hw/Makefile
+++ b/drivers/infiniband/hw/Makefile
@@ -10,3 +10,4 @@ obj-$(CONFIG_INFINIBAND_OCRDMA)		+= ocrdma/
 obj-$(CONFIG_INFINIBAND_USNIC)		+= usnic/
 obj-$(CONFIG_INFINIBAND_HFI1)		+= hfi1/
 obj-$(CONFIG_INFINIBAND_HNS)		+= hns/
+obj-$(CONFIG_INFINIBAND_PVRDMA)		+= pvrdma/
-- 
2.7.4

^ permalink raw reply related

* [PATCH v5 11/16] IB/pvrdma: Add support for memory regions
From: Adit Ranadive @ 2016-09-24 23:21 UTC (permalink / raw)
  To: dledford, linux-rdma, pv-drivers
  Cc: Adit Ranadive, netdev, linux-pci, jhansen, asarwade, georgezhang,
	bryantan
In-Reply-To: <cover.1474759181.git.aditr@vmware.com>

This patch adds support for creating and destroying memory regions. The
PVRDMA device supports User MRs, DMA MRs (no Remote Read/Write support),
Fast Register MRs.

Reviewed-by: Yuval Shaia <yuval.shaia@oracle.com>
Reviewed-by: Jorgen Hansen <jhansen@vmware.com>
Reviewed-by: George Zhang <georgezhang@vmware.com>
Reviewed-by: Aditya Sarwade <asarwade@vmware.com>
Reviewed-by: Bryan Tan <bryantan@vmware.com>
Signed-off-by: Adit Ranadive <aditr@vmware.com>
---
Changes v4->v5:
 - Check the access flags correctly for DMA MR.
 - Update to pvrdma_cmd_post for creating/destroying MRs.

Changes v3->v4:
 - Changed access flag check for DMA MR to using bit operation.
 - Removed some local variables.

Changes v2->v3:
 - Removed boolean in pvrdma_cmd_post.
---
 drivers/infiniband/hw/pvrdma/pvrdma_mr.c | 334 +++++++++++++++++++++++++++++++
 1 file changed, 334 insertions(+)
 create mode 100644 drivers/infiniband/hw/pvrdma/pvrdma_mr.c

diff --git a/drivers/infiniband/hw/pvrdma/pvrdma_mr.c b/drivers/infiniband/hw/pvrdma/pvrdma_mr.c
new file mode 100644
index 0000000..8519f32
--- /dev/null
+++ b/drivers/infiniband/hw/pvrdma/pvrdma_mr.c
@@ -0,0 +1,334 @@
+/*
+ * Copyright (c) 2012-2016 VMware, Inc.  All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of EITHER the GNU General Public License
+ * version 2 as published by the Free Software Foundation or the BSD
+ * 2-Clause License. This program is distributed in the hope that it
+ * will be useful, but WITHOUT ANY WARRANTY; WITHOUT EVEN THE IMPLIED
+ * WARRANTY OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE.
+ * See the GNU General Public License version 2 for more details at
+ * http://www.gnu.org/licenses/old-licenses/gpl-2.0.en.html.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program available in the file COPYING in the main
+ * directory of this source tree.
+ *
+ * The BSD 2-Clause License
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ * "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ * LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS
+ * FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE
+ * COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT,
+ * INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
+ * (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
+ * SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
+ * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT,
+ * STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE)
+ * ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED
+ * OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+
+#include <linux/list.h>
+#include <linux/slab.h>
+
+#include "pvrdma.h"
+
+/**
+ * pvrdma_get_dma_mr - get a DMA memory region
+ * @pd: protection domain
+ * @acc: access flags
+ *
+ * @return: ib_mr pointer on success, otherwise returns an errno.
+ */
+struct ib_mr *pvrdma_get_dma_mr(struct ib_pd *pd, int acc)
+{
+	struct pvrdma_dev *dev = to_vdev(pd->device);
+	struct pvrdma_user_mr *mr;
+	union pvrdma_cmd_req req;
+	union pvrdma_cmd_resp rsp;
+	struct pvrdma_cmd_create_mr *cmd = &req.create_mr;
+	struct pvrdma_cmd_create_mr_resp *resp = &rsp.create_mr_resp;
+	int ret;
+
+	/* Support only LOCAL_WRITE flag for DMA MRs */
+	if (acc & ~IB_ACCESS_LOCAL_WRITE) {
+		dev_warn(&dev->pdev->dev,
+			 "unsupported dma mr access flags %#x\n", acc);
+		return ERR_PTR(-EOPNOTSUPP);
+	}
+
+	mr = kzalloc(sizeof(*mr), GFP_KERNEL);
+	if (!mr)
+		return ERR_PTR(-ENOMEM);
+
+	memset(cmd, 0, sizeof(*cmd));
+	cmd->hdr.cmd = PVRDMA_CMD_CREATE_MR;
+	cmd->pd_handle = to_vpd(pd)->pd_handle;
+	cmd->access_flags = acc;
+	cmd->flags = PVRDMA_MR_FLAG_DMA;
+
+	ret = pvrdma_cmd_post(dev, &req, &rsp, PVRDMA_CMD_CREATE_MR_RESP);
+	if (ret < 0) {
+		dev_warn(&dev->pdev->dev,
+			 "could not get DMA mem region, error: %d\n", ret);
+		kfree(mr);
+		return ERR_PTR(ret);
+	}
+
+	mr->mmr.mr_handle = resp->mr_handle;
+	mr->ibmr.lkey = resp->lkey;
+	mr->ibmr.rkey = resp->rkey;
+
+	return &mr->ibmr;
+}
+
+/**
+ * pvrdma_reg_user_mr - register a userspace memory region
+ * @pd: protection domain
+ * @start: starting address
+ * @length: length of region
+ * @virt_addr: I/O virtual address
+ * @access_flags: access flags for memory region
+ * @udata: user data
+ *
+ * @return: ib_mr pointer on success, otherwise returns an errno.
+ */
+struct ib_mr *pvrdma_reg_user_mr(struct ib_pd *pd, u64 start, u64 length,
+				 u64 virt_addr, int access_flags,
+				 struct ib_udata *udata)
+{
+	struct pvrdma_dev *dev = to_vdev(pd->device);
+	struct pvrdma_user_mr *mr = NULL;
+	struct ib_umem *umem;
+	union pvrdma_cmd_req req;
+	union pvrdma_cmd_resp rsp;
+	struct pvrdma_cmd_create_mr *cmd = &req.create_mr;
+	struct pvrdma_cmd_create_mr_resp *resp = &rsp.create_mr_resp;
+	int nchunks;
+	int ret;
+	int entry;
+	struct scatterlist *sg;
+
+	if (length == 0 || length > dev->dsr->caps.max_mr_size) {
+		dev_warn(&dev->pdev->dev, "invalid mem region length\n");
+		return ERR_PTR(-EINVAL);
+	}
+
+	umem = ib_umem_get(pd->uobject->context, start,
+			   length, access_flags, 0);
+	if (IS_ERR(umem)) {
+		dev_warn(&dev->pdev->dev,
+			 "could not get umem for mem region\n");
+		return ERR_CAST(umem);
+	}
+
+	nchunks = 0;
+	for_each_sg(umem->sg_head.sgl, sg, umem->nmap, entry)
+		nchunks += sg_dma_len(sg) >> PAGE_SHIFT;
+
+	if (nchunks < 0 || nchunks > PVRDMA_PAGE_DIR_MAX_PAGES) {
+		dev_warn(&dev->pdev->dev, "overflow %d pages in mem region\n",
+			 nchunks);
+		ret = -EINVAL;
+		goto err_umem;
+	}
+
+	mr = kzalloc(sizeof(*mr), GFP_KERNEL);
+	if (!mr) {
+		ret = -ENOMEM;
+		goto err_umem;
+	}
+
+	mr->mmr.iova = virt_addr;
+	mr->mmr.size = length;
+	mr->umem = umem;
+
+	ret = pvrdma_page_dir_init(dev, &mr->pdir, nchunks, false);
+	if (ret) {
+		dev_warn(&dev->pdev->dev,
+			 "could not allocate page directory\n");
+		goto err_umem;
+	}
+
+	ret = pvrdma_page_dir_insert_umem(&mr->pdir, mr->umem, 0);
+	if (ret)
+		goto err_pdir;
+
+	memset(cmd, 0, sizeof(*cmd));
+	cmd->hdr.cmd = PVRDMA_CMD_CREATE_MR;
+	cmd->start = start;
+	cmd->length = length;
+	cmd->pd_handle = to_vpd(pd)->pd_handle;
+	cmd->access_flags = access_flags;
+	cmd->nchunks = nchunks;
+	cmd->pdir_dma = mr->pdir.dir_dma;
+
+	ret = pvrdma_cmd_post(dev, &req, &rsp, PVRDMA_CMD_CREATE_MR_RESP);
+	if (ret < 0) {
+		dev_warn(&dev->pdev->dev,
+			 "could not register mem region, error: %d\n", ret);
+		goto err_pdir;
+	}
+
+	mr->mmr.mr_handle = resp->mr_handle;
+	mr->ibmr.lkey = resp->lkey;
+	mr->ibmr.rkey = resp->rkey;
+
+	return &mr->ibmr;
+
+err_pdir:
+	pvrdma_page_dir_cleanup(dev, &mr->pdir);
+err_umem:
+	ib_umem_release(umem);
+	kfree(mr);
+
+	return ERR_PTR(ret);
+}
+
+/**
+ * pvrdma_alloc_mr - allocate a memory region
+ * @pd: protection domain
+ * @mr_type: type of memory region
+ * @max_num_sg: maximum number of pages
+ *
+ * @return: ib_mr pointer on success, otherwise returns an errno.
+ */
+struct ib_mr *pvrdma_alloc_mr(struct ib_pd *pd, enum ib_mr_type mr_type,
+			      u32 max_num_sg)
+{
+	struct pvrdma_dev *dev = to_vdev(pd->device);
+	struct pvrdma_user_mr *mr;
+	union pvrdma_cmd_req req;
+	union pvrdma_cmd_resp rsp;
+	struct pvrdma_cmd_create_mr *cmd = &req.create_mr;
+	struct pvrdma_cmd_create_mr_resp *resp = &rsp.create_mr_resp;
+	int size = max_num_sg * sizeof(u64);
+	int ret;
+
+	if (mr_type != IB_MR_TYPE_MEM_REG ||
+	    max_num_sg > PVRDMA_MAX_FAST_REG_PAGES)
+		return ERR_PTR(-EINVAL);
+
+	mr = kzalloc(sizeof(*mr), GFP_KERNEL);
+	if (!mr)
+		return ERR_PTR(-ENOMEM);
+
+	mr->pages = kzalloc(size, GFP_KERNEL);
+	if (!mr->pages) {
+		ret = -ENOMEM;
+		goto freemr;
+	}
+
+	ret = pvrdma_page_dir_init(dev, &mr->pdir, max_num_sg, false);
+	if (ret) {
+		dev_warn(&dev->pdev->dev,
+			 "failed to allocate page dir for mr\n");
+		ret = -ENOMEM;
+		goto freepages;
+	}
+
+	memset(cmd, 0, sizeof(*cmd));
+	cmd->hdr.cmd = PVRDMA_CMD_CREATE_MR;
+	cmd->pd_handle = to_vpd(pd)->pd_handle;
+	cmd->access_flags = 0;
+	cmd->flags = PVRDMA_MR_FLAG_FRMR;
+	cmd->nchunks = max_num_sg;
+
+	ret = pvrdma_cmd_post(dev, &req, &rsp, PVRDMA_CMD_CREATE_MR_RESP);
+	if (ret < 0) {
+		dev_warn(&dev->pdev->dev,
+			 "could not create FR mem region, error: %d\n", ret);
+		goto freepdir;
+	}
+
+	mr->max_pages = max_num_sg;
+	mr->mmr.mr_handle = resp->mr_handle;
+	mr->ibmr.lkey = resp->lkey;
+	mr->ibmr.rkey = resp->rkey;
+	mr->page_shift = PAGE_SHIFT;
+	mr->umem = NULL;
+
+	return &mr->ibmr;
+
+freepdir:
+	pvrdma_page_dir_cleanup(dev, &mr->pdir);
+freepages:
+	kfree(mr->pages);
+freemr:
+	kfree(mr);
+	return ERR_PTR(ret);
+}
+
+/**
+ * pvrdma_dereg_mr - deregister a memory region
+ * @ibmr: memory region
+ *
+ * @return: 0 on success.
+ */
+int pvrdma_dereg_mr(struct ib_mr *ibmr)
+{
+	struct pvrdma_user_mr *mr = to_vmr(ibmr);
+	struct pvrdma_dev *dev = to_vdev(ibmr->device);
+	union pvrdma_cmd_req req;
+	struct pvrdma_cmd_destroy_mr *cmd = &req.destroy_mr;
+	int ret;
+
+	memset(cmd, 0, sizeof(*cmd));
+	cmd->hdr.cmd = PVRDMA_CMD_DESTROY_MR;
+	cmd->mr_handle = mr->mmr.mr_handle;
+	ret = pvrdma_cmd_post(dev, &req, NULL, 0);
+	if (ret < 0)
+		dev_warn(&dev->pdev->dev,
+			 "could not deregister mem region, error: %d\n", ret);
+
+	pvrdma_page_dir_cleanup(dev, &mr->pdir);
+	if (mr->umem)
+		ib_umem_release(mr->umem);
+
+	kfree(mr->pages);
+	kfree(mr);
+
+	return 0;
+}
+
+static int pvrdma_set_page(struct ib_mr *ibmr, u64 addr)
+{
+	struct pvrdma_user_mr *mr = to_vmr(ibmr);
+
+	if (mr->npages == mr->max_pages)
+		return -ENOMEM;
+
+	mr->pages[mr->npages++] = addr;
+	return 0;
+}
+
+int pvrdma_map_mr_sg(struct ib_mr *ibmr, struct scatterlist *sg, int sg_nents,
+		     unsigned int *sg_offset)
+{
+	struct pvrdma_user_mr *mr = to_vmr(ibmr);
+	struct pvrdma_dev *dev = to_vdev(ibmr->device);
+	int ret;
+
+	mr->npages = 0;
+
+	ret = ib_sg_to_pages(ibmr, sg, sg_nents, sg_offset, pvrdma_set_page);
+	if (ret < 0)
+		dev_warn(&dev->pdev->dev, "could not map sg to pages\n");
+
+	return ret;
+}
-- 
2.7.4

^ permalink raw reply related


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox