Netdev List
 help / color / mirror / Atom feed
* [PATCH net 1/2] tcp: Merge tx_flags and tskey in tcp_collapse_retrans
From: Martin KaFai Lau @ 2016-04-20  5:39 UTC (permalink / raw)
  To: netdev
  Cc: Eric Dumazet, Neal Cardwell, Soheil Hassas Yeganeh,
	Willem de Bruijn, Yuchung Cheng, Kernel Team
In-Reply-To: <1461130769-1442865-1-git-send-email-kafai@fb.com>

If two skbs are merged/collapsed during retransmission, the current
logic does not merge the tx_flags and tskey.  The end result is
the SCM_TSTAMP_ACK timestamp could be missing for a packet.

The patch:
1. Merge the tx_flags
2. Overwrite the prev_skb's tskey with the next_skb's tskey

BPF Output Before:
~~~~~~
<no-output-due-to-missing-tstamp-event>

BPF Output After:
~~~~~~
packetdrill-2092  [001] d.s.   453.998486: : ee_data:1459

Packetdrill Script:
~~~~~~
+0 `sysctl -q -w net.ipv4.tcp_min_tso_segs=10`
+0 `sysctl -q -w net.ipv4.tcp_no_metrics_save=1`
+0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
+0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
+0 bind(3, ..., ...) = 0
+0 listen(3, 1) = 0

0.100 < S 0:0(0) win 32792 <mss 1460,sackOK,nop,nop,nop,wscale 7>
0.100 > S. 0:0(0) ack 1 <mss 1460,nop,nop,sackOK,nop,wscale 7>
0.200 < . 1:1(0) ack 1 win 257
0.200 accept(3, ..., ...) = 4
+0 setsockopt(4, SOL_TCP, TCP_NODELAY, [1], 4) = 0

0.200 write(4, ..., 730) = 730
+0 setsockopt(4, SOL_SOCKET, 37, [2688], 4) = 0
0.200 write(4, ..., 730) = 730
+0 setsockopt(4, SOL_SOCKET, 37, [2176], 4) = 0
0.200 write(4, ..., 11680) = 11680
+0 setsockopt(4, SOL_SOCKET, 37, [2688], 4) = 0

0.200 > P. 1:731(730) ack 1
0.200 > P. 731:1461(730) ack 1
0.200 > . 1461:8761(7300) ack 1
0.200 > P. 8761:13141(4380) ack 1

0.300 < . 1:1(0) ack 1 win 257 <sack 1461:2921,nop,nop>
0.300 < . 1:1(0) ack 1 win 257 <sack 1461:4381,nop,nop>
0.300 < . 1:1(0) ack 1 win 257 <sack 1461:5841,nop,nop>
0.300 > P. 1:1461(1460) ack 1
0.400 < . 1:1(0) ack 13141 win 257

0.400 close(4) = 0
0.400 > F. 13141:13141(0) ack 1
0.500 < F. 1:1(0) ack 13142 win 257
0.500 > . 13142:13142(0) ack 2

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Soheil Hassas Yeganeh <soheil@google.com>
Cc: Willem de Bruijn <willemb@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
---
 net/ipv4/tcp_output.c | 16 ++++++++++++++++
 1 file changed, 16 insertions(+)

diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 7d2dc01..5bc3c30 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -2441,6 +2441,20 @@ u32 __tcp_select_window(struct sock *sk)
 	return window;
 }
 
+static void tcp_skb_collapse_tstamp(struct sk_buff *skb,
+				    const struct sk_buff *next_skb)
+{
+	const struct skb_shared_info *next_shinfo = skb_shinfo(next_skb);
+	u8 tsflags = next_shinfo->tx_flags & SKBTX_ANY_TSTAMP;
+
+	if (unlikely(tsflags)) {
+		struct skb_shared_info *shinfo = skb_shinfo(skb);
+
+		shinfo->tx_flags |= tsflags;
+		shinfo->tskey = next_shinfo->tskey;
+	}
+}
+
 /* Collapses two adjacent SKB's during retransmission. */
 static void tcp_collapse_retrans(struct sock *sk, struct sk_buff *skb)
 {
@@ -2484,6 +2498,8 @@ static void tcp_collapse_retrans(struct sock *sk, struct sk_buff *skb)
 
 	tcp_adjust_pcount(sk, next_skb, tcp_skb_pcount(next_skb));
 
+	tcp_skb_collapse_tstamp(skb, next_skb);
+
 	sk_wmem_free_skb(sk, next_skb);
 }
 
-- 
2.5.1

^ permalink raw reply related

* [PATCH net 0/2] tcp: Merge timestamp info when coalescing skbs
From: Martin KaFai Lau @ 2016-04-20  5:39 UTC (permalink / raw)
  To: netdev
  Cc: Eric Dumazet, Neal Cardwell, Soheil Hassas Yeganeh,
	Willem de Bruijn, Yuchung Cheng, Kernel Team

This series is separated from the RFC series related to
tcp_sendmsg(MSG_EOR) and it is targeting for the net branch.
This patchset is focusing on fixing cases where TCP
timestamp could be lost after coalescing skbs.

A BPF prog is used to kprobe to sock_queue_err_skb()
and print out the value of serr->ee.ee_data.  The BPF
prog (run-able from bcc) is attached here:

BPF prog used for testing:
~~~~~
#!/usr/bin/env python

from __future__ import print_function
from bcc import BPF

bpf_text = """
#include <uapi/linux/ptrace.h>
#include <net/sock.h>
#include <bcc/proto.h>
#include <linux/errqueue.h>

#ifdef memset
#undef memset
#endif

int trace_err_skb(struct pt_regs *ctx)
{
	struct sk_buff *skb = (struct sk_buff *)ctx->si;
	struct sock *sk = (struct sock *)ctx->di;
	struct sock_exterr_skb *serr;
	u32 ee_data = 0;

	if (!sk || !skb)
		return 0;

	serr = SKB_EXT_ERR(skb);
	bpf_probe_read(&ee_data, sizeof(ee_data), &serr->ee.ee_data);
	bpf_trace_printk("ee_data:%u\\n", ee_data);

	return 0;
};
"""

b = BPF(text=bpf_text)
b.attach_kprobe(event="sock_queue_err_skb", fn_name="trace_err_skb")
print("Attached to kprobe")
b.trace_print()

^ permalink raw reply

* [PATCH net 2/2] tcp: Merge tx_flags and tskey in tcp_shifted_skb
From: Martin KaFai Lau @ 2016-04-20  5:39 UTC (permalink / raw)
  To: netdev
  Cc: Eric Dumazet, Neal Cardwell, Soheil Hassas Yeganeh,
	Willem de Bruijn, Yuchung Cheng, Kernel Team
In-Reply-To: <1461130769-1442865-1-git-send-email-kafai@fb.com>

After receiving sacks, tcp_shifted_skb() will collapse
skbs if possible.  tx_flags and tskey also have to be
merged.

This patch reuses the tcp_skb_collapse_tstamp() to handle
them.

BPF Output Before:
~~~~~
<no-output-due-to-missing-tstamp-event>

BPF Output After:
~~~~~
<...>-2024  [007] d.s.    88.644374: : ee_data:14599

Packetdrill Script:
~~~~~
+0 `sysctl -q -w net.ipv4.tcp_min_tso_segs=10`
+0 `sysctl -q -w net.ipv4.tcp_no_metrics_save=1`
+0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
+0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
+0 bind(3, ..., ...) = 0
+0 listen(3, 1) = 0

0.100 < S 0:0(0) win 32792 <mss 1460,sackOK,nop,nop,nop,wscale 7>
0.100 > S. 0:0(0) ack 1 <mss 1460,nop,nop,sackOK,nop,wscale 7>
0.200 < . 1:1(0) ack 1 win 257
0.200 accept(3, ..., ...) = 4
+0 setsockopt(4, SOL_TCP, TCP_NODELAY, [1], 4) = 0

0.200 write(4, ..., 1460) = 1460
+0 setsockopt(4, SOL_SOCKET, 37, [2688], 4) = 0
0.200 write(4, ..., 13140) = 13140

0.200 > P. 1:1461(1460) ack 1
0.200 > . 1461:8761(7300) ack 1
0.200 > P. 8761:14601(5840) ack 1

0.300 < . 1:1(0) ack 1 win 257 <sack 1461:14601,nop,nop>
0.300 > P. 1:1461(1460) ack 1
0.400 < . 1:1(0) ack 14601 win 257

0.400 close(4) = 0
0.400 > F. 14601:14601(0) ack 1
0.500 < F. 1:1(0) ack 14602 win 257
0.500 > . 14602:14602(0) ack 2

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Soheil Hassas Yeganeh <soheil@google.com>
Cc: Willem de Bruijn <willemb@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
---
 include/net/tcp.h     | 2 ++
 net/ipv4/tcp_input.c  | 1 +
 net/ipv4/tcp_output.c | 4 ++--
 3 files changed, 5 insertions(+), 2 deletions(-)

diff --git a/include/net/tcp.h b/include/net/tcp.h
index b91370f..6db1022 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -552,6 +552,8 @@ void tcp_send_ack(struct sock *sk);
 void tcp_send_delayed_ack(struct sock *sk);
 void tcp_send_loss_probe(struct sock *sk);
 bool tcp_schedule_loss_probe(struct sock *sk);
+void tcp_skb_collapse_tstamp(struct sk_buff *skb,
+			     const struct sk_buff *next_skb);
 
 /* tcp_input.c */
 void tcp_resume_early_retransmit(struct sock *sk);
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 0edb071..c124c3c 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -1309,6 +1309,7 @@ static bool tcp_shifted_skb(struct sock *sk, struct sk_buff *skb,
 	if (skb == tcp_highest_sack(sk))
 		tcp_advance_highest_sack(sk, skb);
 
+	tcp_skb_collapse_tstamp(prev, skb);
 	tcp_unlink_write_queue(skb, sk);
 	sk_wmem_free_skb(sk, skb);
 
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 5bc3c30..441ae9d 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -2441,8 +2441,8 @@ u32 __tcp_select_window(struct sock *sk)
 	return window;
 }
 
-static void tcp_skb_collapse_tstamp(struct sk_buff *skb,
-				    const struct sk_buff *next_skb)
+void tcp_skb_collapse_tstamp(struct sk_buff *skb,
+			     const struct sk_buff *next_skb)
 {
 	const struct skb_shared_info *next_shinfo = skb_shinfo(next_skb);
 	u8 tsflags = next_shinfo->tx_flags & SKBTX_ANY_TSTAMP;
-- 
2.5.1

^ permalink raw reply related

* RE: [PATCH net-next 5/5] qed*: Add support for read/write of eeprom
From: Yuval Mintz @ 2016-04-20  5:25 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, Sudarsana Kalluru
In-Reply-To: <CO2PR11MB00888E4DBF49BE0A98ACB381976B0@CO2PR11MB0088.namprd11.prod.outlook.com>

> >> The third option I see is to completely ditch this, stating that
> >> although We obviously CAN set the non-volatile memory we CAN'T do it
> >> with the regular API, and to move into doing this via some
> >> proprietary API such as debugfs.
> 
> > Why go to debugfs rather than gracefully extending the ethtool stuff
> > to explicitly support what you need?
> 
> Gracefully extend it to where?
> 
> The most I can learn from this is that some adapters are in need of additional
> metadata for doing this sort of stuff.
> The state-machine itself is propriety and I don't think anyone would gain from
> trying to formalize anything else from it.
> If by 'graceful extension' of ethtool API you mean adding an additional field or
> two for metadata - sure, I can do that, I just don't think of it as graceful. ;-)
> 
> And if this is actually the way we want to take this, I would still [don't get upset
> ;-)] argue in favor of overloading the 'magic' field for this purpose, as bnx2x and
> bnxt are already doing.
> Basically if the 'magic' would also contain metadata that could be verified as
> appropriate [I.e., by having valid ranges], it would still fit as a safety measure for
> guaranteeing the correctness of the write request.

Hi Dave,

I'll probably send V2 [relatively] shortly, and it looks like I'll drop this one out
as I don't yet understand the expected direction through which to solve this.

I'd appreciate any suggested pointers here.

Thanks,
Yuval

^ permalink raw reply

* Re: [PATCH net-next v5] rtnetlink: add new RTM_GETSTATS message to dump link stats
From: Roopa Prabhu @ 2016-04-20  3:54 UTC (permalink / raw)
  To: David Miller; +Cc: nicolas.dichtel, eric.dumazet, netdev, jhs, tgraf
In-Reply-To: <20160419.195009.1052027353987244150.davem@davemloft.net>

On 4/19/16, 4:50 PM, David Miller wrote:
> From: Nicolas Dichtel <nicolas.dichtel@6wind.com>
> Date: Tue, 19 Apr 2016 21:08:21 +0200
>
>> Le 19/04/2016 20:47, Eric Dumazet a écrit :
>>> Since we want to use this in other places, we could define a helper.
>>>
>>> nla_align_64bit(skb, attribute)  or something.
>> Yes, with the corresponding nla_total_size_64bit()
> Good, idea, committed the following:
>
> Roopa, please use these helpers in your RTM_GETSTATS patch.

will do.
>
> Thank you.
>
> ====================
> [PATCH] net: Add helpers for 64-bit aligning netlink attributes.
>
> Suggested-by: Eric Dumazet <eric.dumazet@gmail.com>
> Suggested-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
> Signed-off-by: David S. Miller <davem@davemloft.net>
> ---
>  
these look really nice.

Thanks!

^ permalink raw reply

* Re: [PATCH net-next v5] rtnetlink: add new RTM_GETSTATS message to dump link stats
From: Roopa Prabhu @ 2016-04-20  3:53 UTC (permalink / raw)
  To: David Miller; +Cc: eric.dumazet, netdev, jhs, tgraf, nicolas.dichtel
In-Reply-To: <20160419.184920.1193931883675948435.davem@davemloft.net>

On 4/19/16, 3:49 PM, David Miller wrote:
> From: Roopa Prabhu <roopa@cumulusnetworks.com>
> Date: Tue, 19 Apr 2016 12:05:00 -0700
>
>> ok, will do. one thing though, for GETSTATS, if I need a pad
>> attribute like IFLA_PAD, I will need to add a new stats attribute
>> IFLA_STATS_PAD and burn a bit for it in filter_mask too.  In which
>> case, I am wondering if we should live with the copy. I will take
>> any suggestions here.
> I don't think the copy is appropriate, especially if the existing full
> link state dump gets away without it.  We're adding this facility for
> performance reasons after all.
>
> You have several options to avoid wasting filter mask space. For
> example, you could use IFLA_STATS_UNSPEC, which should be OK since
> only new applications will use these.
>
> Or you could make IFLA_STATS_PAD the first attribute, and define the
> filter mask as relative to it.  Ie. IFLA_STATS_LINK_64 uses bit
> (IFLA_STATS_LINK_64 - IFLA_STATS_PAD), etc.
ok ack, makes sense.

thanks,
Roopa

^ permalink raw reply

* [PATCH 2/3] ipvs: optimize release of connections in OPS mode
From: Simon Horman @ 2016-04-20  2:46 UTC (permalink / raw)
  To: Pablo Neira Ayuso
  Cc: lvs-devel, netdev, netfilter-devel, Wensong Zhang,
	Julian Anastasov, Marco Angaroni, Simon Horman
In-Reply-To: <1461120394-5548-1-git-send-email-horms@verge.net.au>

From: Marco Angaroni <marcoangaroni@gmail.com>

One-packet-scheduling is the most expensive mode in IPVS from
performance point of view: for each packet to be processed a new
connection data structure is created and, after packet is sent,
deleted by starting a new timer set to expire immediately.

SIP persistent-engine needs OPS mode to have Call-ID based load
balancing, so OPS mode performance has negative impact in SIP
protocol load balancing.

This patch aims to improve performance of OPS mode by means of the
following changes in the release mechanism of OPS connections:
a) call expire callback ip_vs_conn_expire() directly instead of
   starting a timer programmed to fire immediately.
b) avoid call_rcu() overhead inside expire callback, since OPS
   connection are not inserted in the hash-table and last just the
   time to process the packet, hence there is no concurrent access
   to such data structures.

Signed-off-by: Marco Angaroni <marcoangaroni@gmail.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
---
 net/netfilter/ipvs/ip_vs_conn.c | 26 +++++++++++++++++++++++---
 1 file changed, 23 insertions(+), 3 deletions(-)

diff --git a/net/netfilter/ipvs/ip_vs_conn.c b/net/netfilter/ipvs/ip_vs_conn.c
index 85ca189bdc3d..dd75d4120099 100644
--- a/net/netfilter/ipvs/ip_vs_conn.c
+++ b/net/netfilter/ipvs/ip_vs_conn.c
@@ -104,6 +104,7 @@ static inline void ct_write_unlock_bh(unsigned int key)
 	spin_unlock_bh(&__ip_vs_conntbl_lock_array[key&CT_LOCKARRAY_MASK].l);
 }
 
+static void ip_vs_conn_expire(unsigned long data);
 
 /*
  *	Returns hash value for IPVS connection entry
@@ -453,10 +454,16 @@ ip_vs_conn_out_get_proto(struct netns_ipvs *ipvs, int af,
 }
 EXPORT_SYMBOL_GPL(ip_vs_conn_out_get_proto);
 
+static void __ip_vs_conn_put_notimer(struct ip_vs_conn *cp)
+{
+	__ip_vs_conn_put(cp);
+	ip_vs_conn_expire((unsigned long)cp);
+}
+
 /*
  *      Put back the conn and restart its timer with its timeout
  */
-void ip_vs_conn_put(struct ip_vs_conn *cp)
+static void __ip_vs_conn_put_timer(struct ip_vs_conn *cp)
 {
 	unsigned long t = (cp->flags & IP_VS_CONN_F_ONE_PACKET) ?
 		0 : cp->timeout;
@@ -465,6 +472,16 @@ void ip_vs_conn_put(struct ip_vs_conn *cp)
 	__ip_vs_conn_put(cp);
 }
 
+void ip_vs_conn_put(struct ip_vs_conn *cp)
+{
+	if ((cp->flags & IP_VS_CONN_F_ONE_PACKET) &&
+	    (atomic_read(&cp->refcnt) == 1) &&
+	    !timer_pending(&cp->timer))
+		/* expire connection immediately */
+		__ip_vs_conn_put_notimer(cp);
+	else
+		__ip_vs_conn_put_timer(cp);
+}
 
 /*
  *	Fill a no_client_port connection with a client port number
@@ -834,7 +851,10 @@ static void ip_vs_conn_expire(unsigned long data)
 		ip_vs_unbind_dest(cp);
 		if (cp->flags & IP_VS_CONN_F_NO_CPORT)
 			atomic_dec(&ip_vs_conn_no_cport_cnt);
-		call_rcu(&cp->rcu_head, ip_vs_conn_rcu_free);
+		if (cp->flags & IP_VS_CONN_F_ONE_PACKET)
+			ip_vs_conn_rcu_free(&cp->rcu_head);
+		else
+			call_rcu(&cp->rcu_head, ip_vs_conn_rcu_free);
 		atomic_dec(&ipvs->conn_count);
 		return;
 	}
@@ -850,7 +870,7 @@ static void ip_vs_conn_expire(unsigned long data)
 	if (ipvs->sync_state & IP_VS_STATE_MASTER)
 		ip_vs_sync_conn(ipvs, cp, sysctl_sync_threshold(ipvs));
 
-	ip_vs_conn_put(cp);
+	__ip_vs_conn_put_timer(cp);
 }
 
 /* Modify timer, so that it expires as soon as possible.
-- 
2.7.0.rc3.207.g0ac5344

^ permalink raw reply related

* [PATCH 3/3] ipvs: don't alter conntrack in OPS mode
From: Simon Horman @ 2016-04-20  2:46 UTC (permalink / raw)
  To: Pablo Neira Ayuso
  Cc: lvs-devel, netdev, netfilter-devel, Wensong Zhang,
	Julian Anastasov, Marco Angaroni, Simon Horman
In-Reply-To: <1461120394-5548-1-git-send-email-horms@verge.net.au>

From: Marco Angaroni <marcoangaroni@gmail.com>

When using OPS mode in conjunction with SIP persistent-engine, packets
originating from the same ip-address/port could be balanced to different
real servers, and (to properly handle SIP responses) OPS connections
are created in the in-out direction too, where ip_vs_update_conntrack()
is called to modify the reply tuple.

As a result, there can be collision of conntrack tuples, causing random
packet drops, as explained below:

conntrack1: orig=CIP->VIP, reply=RIP1->CIP
conntrack2: orig=RIP2->CIP, reply=CIP->VIP

Tuple CIP->VIP is both in orig of conntrack1 and reply of conntrack2.
The collision triggers packet drop inside nf_conntrack processing.

In addition, the current implementation deletes the conntrack object at
every expire of an OPS connection (once every forwarded packet), to have
it recreated from scratch at next packet traversing IPVS.

Since in OPS mode, by definition, we don't expect any associated
response, the choices implemented in this patch are:
a) don't call nf_conntrack_alter_reply() for OPS connections inside
   ip_vs_update_conntrack().
b) don't delete the conntrack object at OPS connection expire.

The result is that created conntrack objects for each tuple CIP->VIP,
RIP-N->CIP, etc. are left in UNREPLIED state and not modified by IPVS
OPS connection management. This eliminates packet drops and leaves
a single conntrack object for each tuple packets are sent from.

Signed-off-by: Marco Angaroni <marcoangaroni@gmail.com>
Signed-off-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
---
 net/netfilter/ipvs/ip_vs_conn.c | 3 ++-
 net/netfilter/ipvs/ip_vs_nfct.c | 4 ++++
 2 files changed, 6 insertions(+), 1 deletion(-)

diff --git a/net/netfilter/ipvs/ip_vs_conn.c b/net/netfilter/ipvs/ip_vs_conn.c
index dd75d4120099..292365ffa4f0 100644
--- a/net/netfilter/ipvs/ip_vs_conn.c
+++ b/net/netfilter/ipvs/ip_vs_conn.c
@@ -836,7 +836,8 @@ static void ip_vs_conn_expire(unsigned long data)
 		if (cp->control)
 			ip_vs_control_del(cp);
 
-		if (cp->flags & IP_VS_CONN_F_NFCT) {
+		if ((cp->flags & IP_VS_CONN_F_NFCT) &&
+		    !(cp->flags & IP_VS_CONN_F_ONE_PACKET)) {
 			/* Do not access conntracks during subsys cleanup
 			 * because nf_conntrack_find_get can not be used after
 			 * conntrack cleanup for the net.
diff --git a/net/netfilter/ipvs/ip_vs_nfct.c b/net/netfilter/ipvs/ip_vs_nfct.c
index 30434fb133df..f04fd8df210b 100644
--- a/net/netfilter/ipvs/ip_vs_nfct.c
+++ b/net/netfilter/ipvs/ip_vs_nfct.c
@@ -93,6 +93,10 @@ ip_vs_update_conntrack(struct sk_buff *skb, struct ip_vs_conn *cp, int outin)
 	if (IP_VS_FWD_METHOD(cp) != IP_VS_CONN_F_MASQ)
 		return;
 
+	/* Never alter conntrack for OPS conns (no reply is expected) */
+	if (cp->flags & IP_VS_CONN_F_ONE_PACKET)
+		return;
+
 	/* Alter reply only in original direction */
 	if (CTINFO2DIR(ctinfo) != IP_CT_DIR_ORIGINAL)
 		return;
-- 
2.7.0.rc3.207.g0ac5344


^ permalink raw reply related

* [PATCH 1/3] ipvs: handle connections started by real-servers
From: Simon Horman @ 2016-04-20  2:46 UTC (permalink / raw)
  To: Pablo Neira Ayuso
  Cc: lvs-devel, netdev, netfilter-devel, Wensong Zhang,
	Julian Anastasov, Marco Angaroni, Simon Horman
In-Reply-To: <1461120394-5548-1-git-send-email-horms@verge.net.au>

From: Marco Angaroni <marcoangaroni@gmail.com>

When using LVS-NAT and SIP persistence-egine over UDP, the following
limitations are present with current implementation:

  1) To actually have load-balancing based on Call-ID header, you need to
     use one-packet-scheduling mode. But with one-packet-scheduling the
     connection is deleted just after packet is forwarded, so SIP responses
     coming from real-servers do not match any connection and SNAT is
     not applied.

  2) If you do not use "-o" option, IPVS behaves as normal UDP load
     balancer, so different SIP calls (each one identified by a different
     Call-ID) coming from the same ip-address/port go to the same
     real-server. So basically you don’t have load-balancing based on
     Call-ID as intended.

  3) Call-ID is not learned when a new SIP call is started by a real-server
     (inside-to-outside direction), but only in the outside-to-inside
     direction. This would be a general problem for all SIP servers acting
     as Back2BackUserAgent.

This patch aims to solve problems 1) and 3) while keeping OPS mode
mandatory for SIP-UDP, so that 2) is not a problem anymore.

The basic mechanism implemented is to make packets, that do not match any
existent connection but come from real-servers, create new connections
instead of let them pass without any effect.
When such packets pass through ip_vs_out(), if their source ip address and
source port match a configured real-server, a new connection is
automatically created in the same way as it would have happened if the
packet had come from outside-to-inside direction. A new connection template
is created too if the virtual-service is persistent and there is no
matching connection template found. The new connection automatically
created, if the service had "-o" option, is an OPS connection that lasts
only the time to forward the packet, just like it happens on the
ingress side.

The main part of this mechanism is implemented inside a persistent-engine
specific callback (at the moment only SIP persistent engine exists) and
is triggered only for UDP packets, since connection oriented protocols, by
using different set of ports (typically ephemeral ports) to open new
outgoing connections, should not need this feature.

The following requisites are needed for automatic connection creation; if
any is missing the packet simply goes the same way as before.
a) virtual-service is not fwmark based (this is because fwmark services
   do not store address and port of the virtual-service, required to
   build the connection data).
b) virtual-service and real-servers must not have been configured with
   omitted port (this is again to have all data to create the connection).

Signed-off-by: Marco Angaroni <marcoangaroni@gmail.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
---
 include/net/ip_vs.h               |  17 +++++
 net/netfilter/ipvs/ip_vs_core.c   | 154 ++++++++++++++++++++++++++++++++++++++
 net/netfilter/ipvs/ip_vs_ctl.c    |  46 +++++++++++-
 net/netfilter/ipvs/ip_vs_pe_sip.c |  15 ++++
 4 files changed, 231 insertions(+), 1 deletion(-)

diff --git a/include/net/ip_vs.h b/include/net/ip_vs.h
index a6cc576fd467..af4c10ebb241 100644
--- a/include/net/ip_vs.h
+++ b/include/net/ip_vs.h
@@ -731,6 +731,12 @@ struct ip_vs_pe {
 	u32 (*hashkey_raw)(const struct ip_vs_conn_param *p, u32 initval,
 			   bool inverse);
 	int (*show_pe_data)(const struct ip_vs_conn *cp, char *buf);
+	/* create connections for real-server outgoing packets */
+	struct ip_vs_conn* (*conn_out)(struct ip_vs_service *svc,
+				       struct ip_vs_dest *dest,
+				       struct sk_buff *skb,
+				       const struct ip_vs_iphdr *iph,
+				       __be16 dport, __be16 cport);
 };
 
 /* The application module object (a.k.a. app incarnation) */
@@ -874,6 +880,7 @@ struct netns_ipvs {
 	/* Service counters */
 	atomic_t		ftpsvc_counter;
 	atomic_t		nullsvc_counter;
+	atomic_t		conn_out_counter;
 
 #ifdef CONFIG_SYSCTL
 	/* 1/rate drop and drop-entry variables */
@@ -1147,6 +1154,12 @@ static inline int sysctl_cache_bypass(struct netns_ipvs *ipvs)
  */
 const char *ip_vs_proto_name(unsigned int proto);
 void ip_vs_init_hash_table(struct list_head *table, int rows);
+struct ip_vs_conn *ip_vs_new_conn_out(struct ip_vs_service *svc,
+				      struct ip_vs_dest *dest,
+				      struct sk_buff *skb,
+				      const struct ip_vs_iphdr *iph,
+				      __be16 dport,
+				      __be16 cport);
 #define IP_VS_INIT_HASH_TABLE(t) ip_vs_init_hash_table((t), ARRAY_SIZE((t)))
 
 #define IP_VS_APP_TYPE_FTP	1
@@ -1378,6 +1391,10 @@ ip_vs_service_find(struct netns_ipvs *ipvs, int af, __u32 fwmark, __u16 protocol
 bool ip_vs_has_real_service(struct netns_ipvs *ipvs, int af, __u16 protocol,
 			    const union nf_inet_addr *daddr, __be16 dport);
 
+struct ip_vs_dest *
+ip_vs_find_real_service(struct netns_ipvs *ipvs, int af, __u16 protocol,
+			const union nf_inet_addr *daddr, __be16 dport);
+
 int ip_vs_use_count_inc(void);
 void ip_vs_use_count_dec(void);
 int ip_vs_register_nl_ioctl(void);
diff --git a/net/netfilter/ipvs/ip_vs_core.c b/net/netfilter/ipvs/ip_vs_core.c
index b9a4082afa3a..f3bac2e9a25a 100644
--- a/net/netfilter/ipvs/ip_vs_core.c
+++ b/net/netfilter/ipvs/ip_vs_core.c
@@ -68,6 +68,7 @@ EXPORT_SYMBOL(ip_vs_conn_put);
 #ifdef CONFIG_IP_VS_DEBUG
 EXPORT_SYMBOL(ip_vs_get_debug_level);
 #endif
+EXPORT_SYMBOL(ip_vs_new_conn_out);
 
 static int ip_vs_net_id __read_mostly;
 /* netns cnt used for uniqueness */
@@ -1100,6 +1101,143 @@ static inline bool is_new_conn_expected(const struct ip_vs_conn *cp,
 	}
 }
 
+/* Generic function to create new connections for outgoing RS packets
+ *
+ * Pre-requisites for successful connection creation:
+ * 1) Virtual Service is NOT fwmark based:
+ *    In fwmark-VS actual vaddr and vport are unknown to IPVS
+ * 2) Real Server and Virtual Service were NOT configured without port:
+ *    This is to allow match of different VS to the same RS ip-addr
+ */
+struct ip_vs_conn *ip_vs_new_conn_out(struct ip_vs_service *svc,
+				      struct ip_vs_dest *dest,
+				      struct sk_buff *skb,
+				      const struct ip_vs_iphdr *iph,
+				      __be16 dport,
+				      __be16 cport)
+{
+	struct ip_vs_conn_param param;
+	struct ip_vs_conn *ct = NULL, *cp = NULL;
+	const union nf_inet_addr *vaddr, *daddr, *caddr;
+	union nf_inet_addr snet;
+	__be16 vport;
+	unsigned int flags;
+
+	EnterFunction(12);
+	vaddr = &svc->addr;
+	vport = svc->port;
+	daddr = &iph->saddr;
+	caddr = &iph->daddr;
+
+	/* check pre-requisites are satisfied */
+	if (svc->fwmark)
+		return NULL;
+	if (!vport || !dport)
+		return NULL;
+
+	/* for persistent service first create connection template */
+	if (svc->flags & IP_VS_SVC_F_PERSISTENT) {
+		/* apply netmask the same way ingress-side does */
+#ifdef CONFIG_IP_VS_IPV6
+		if (svc->af == AF_INET6)
+			ipv6_addr_prefix(&snet.in6, &caddr->in6,
+					 (__force __u32)svc->netmask);
+		else
+#endif
+			snet.ip = caddr->ip & svc->netmask;
+		/* fill params and create template if not existent */
+		if (ip_vs_conn_fill_param_persist(svc, skb, iph->protocol,
+						  &snet, 0, vaddr,
+						  vport, &param) < 0)
+			return NULL;
+		ct = ip_vs_ct_in_get(&param);
+		if (!ct) {
+			ct = ip_vs_conn_new(&param, dest->af, daddr, dport,
+					    IP_VS_CONN_F_TEMPLATE, dest, 0);
+			if (!ct) {
+				kfree(param.pe_data);
+				return NULL;
+			}
+			ct->timeout = svc->timeout;
+		} else {
+			kfree(param.pe_data);
+		}
+	}
+
+	/* connection flags */
+	flags = ((svc->flags & IP_VS_SVC_F_ONEPACKET) &&
+		 iph->protocol == IPPROTO_UDP) ? IP_VS_CONN_F_ONE_PACKET : 0;
+	/* create connection */
+	ip_vs_conn_fill_param(svc->ipvs, svc->af, iph->protocol,
+			      caddr, cport, vaddr, vport, &param);
+	cp = ip_vs_conn_new(&param, dest->af, daddr, dport, flags, dest, 0);
+	if (!cp) {
+		if (ct)
+			ip_vs_conn_put(ct);
+		return NULL;
+	}
+	if (ct) {
+		ip_vs_control_add(cp, ct);
+		ip_vs_conn_put(ct);
+	}
+	ip_vs_conn_stats(cp, svc);
+
+	/* return connection (will be used to handle outgoing packet) */
+	IP_VS_DBG_BUF(6, "New connection RS-initiated:%c c:%s:%u v:%s:%u "
+		      "d:%s:%u conn->flags:%X conn->refcnt:%d\n",
+		      ip_vs_fwd_tag(cp),
+		      IP_VS_DBG_ADDR(cp->af, &cp->caddr), ntohs(cp->cport),
+		      IP_VS_DBG_ADDR(cp->af, &cp->vaddr), ntohs(cp->vport),
+		      IP_VS_DBG_ADDR(cp->af, &cp->daddr), ntohs(cp->dport),
+		      cp->flags, atomic_read(&cp->refcnt));
+	LeaveFunction(12);
+	return cp;
+}
+
+/* Handle outgoing packets which are considered requests initiated by
+ * real servers, so that subsequent responses from external client can be
+ * routed to the right real server.
+ * Used also for outgoing responses in OPS mode.
+ *
+ * Connection management is handled by persistent-engine specific callback.
+ */
+static struct ip_vs_conn *__ip_vs_rs_conn_out(unsigned int hooknum,
+					      struct netns_ipvs *ipvs,
+					      int af, struct sk_buff *skb,
+					      const struct ip_vs_iphdr *iph)
+{
+	struct ip_vs_dest *dest;
+	struct ip_vs_conn *cp = NULL;
+	__be16 _ports[2], *pptr;
+
+	if (hooknum == NF_INET_LOCAL_IN)
+		return NULL;
+
+	pptr = frag_safe_skb_hp(skb, iph->len,
+				sizeof(_ports), _ports, iph);
+	if (!pptr)
+		return NULL;
+
+	rcu_read_lock();
+	dest = ip_vs_find_real_service(ipvs, af, iph->protocol,
+				       &iph->saddr, pptr[0]);
+	if (dest) {
+		struct ip_vs_service *svc;
+		struct ip_vs_pe *pe;
+
+		svc = rcu_dereference(dest->svc);
+		if (svc) {
+			pe = rcu_dereference(svc->pe);
+			if (pe && pe->conn_out)
+				cp = pe->conn_out(svc, dest, skb, iph,
+						  pptr[0], pptr[1]);
+		}
+	}
+	rcu_read_unlock();
+
+	return cp;
+}
+
 /* Handle response packets: rewrite addresses and send away...
  */
 static unsigned int
@@ -1245,6 +1383,22 @@ ip_vs_out(struct netns_ipvs *ipvs, unsigned int hooknum, struct sk_buff *skb, in
 
 	if (likely(cp))
 		return handle_response(af, skb, pd, cp, &iph, hooknum);
+
+	/* Check for real-server-started requests */
+	if (atomic_read(&ipvs->conn_out_counter)) {
+		/* Currently only for UDP:
+		 * connection oriented protocols typically use
+		 * ephemeral ports for outgoing connections, so
+		 * related incoming responses would not match any VS
+		 */
+		if (pp->protocol == IPPROTO_UDP) {
+			cp = __ip_vs_rs_conn_out(hooknum, ipvs, af, skb, &iph);
+			if (likely(cp))
+				return handle_response(af, skb, pd, cp, &iph,
+						       hooknum);
+		}
+	}
+
 	if (sysctl_nat_icmp_send(ipvs) &&
 	    (pp->protocol == IPPROTO_TCP ||
 	     pp->protocol == IPPROTO_UDP ||
diff --git a/net/netfilter/ipvs/ip_vs_ctl.c b/net/netfilter/ipvs/ip_vs_ctl.c
index 404b2a4f4b5b..6794391c5a32 100644
--- a/net/netfilter/ipvs/ip_vs_ctl.c
+++ b/net/netfilter/ipvs/ip_vs_ctl.c
@@ -567,6 +567,36 @@ bool ip_vs_has_real_service(struct netns_ipvs *ipvs, int af, __u16 protocol,
 	return false;
 }
 
+/* Find real service record by <proto,addr,port>.
+ * In case of multiple records with the same <proto,addr,port>, only
+ * the first found record is returned.
+ *
+ * To be called under RCU lock.
+ */
+struct ip_vs_dest *ip_vs_find_real_service(struct netns_ipvs *ipvs, int af,
+					   __u16 protocol,
+					   const union nf_inet_addr *daddr,
+					   __be16 dport)
+{
+	unsigned int hash;
+	struct ip_vs_dest *dest;
+
+	/* Check for "full" addressed entries */
+	hash = ip_vs_rs_hashkey(af, daddr, dport);
+
+	hlist_for_each_entry_rcu(dest, &ipvs->rs_table[hash], d_list) {
+		if (dest->port == dport &&
+		    dest->af == af &&
+		    ip_vs_addr_equal(af, &dest->addr, daddr) &&
+			(dest->protocol == protocol || dest->vfwmark)) {
+			/* HIT */
+			return dest;
+		}
+	}
+
+	return NULL;
+}
+
 /* Lookup destination by {addr,port} in the given service
  * Called under RCU lock.
  */
@@ -1253,6 +1283,8 @@ ip_vs_add_service(struct netns_ipvs *ipvs, struct ip_vs_service_user_kern *u,
 		atomic_inc(&ipvs->ftpsvc_counter);
 	else if (svc->port == 0)
 		atomic_inc(&ipvs->nullsvc_counter);
+	if (svc->pe && svc->pe->conn_out)
+		atomic_inc(&ipvs->conn_out_counter);
 
 	ip_vs_start_estimator(ipvs, &svc->stats);
 
@@ -1293,6 +1325,7 @@ ip_vs_edit_service(struct ip_vs_service *svc, struct ip_vs_service_user_kern *u)
 	struct ip_vs_scheduler *sched = NULL, *old_sched;
 	struct ip_vs_pe *pe = NULL, *old_pe = NULL;
 	int ret = 0;
+	bool new_pe_conn_out, old_pe_conn_out;
 
 	/*
 	 * Lookup the scheduler, by 'u->sched_name'
@@ -1355,8 +1388,16 @@ ip_vs_edit_service(struct ip_vs_service *svc, struct ip_vs_service_user_kern *u)
 	svc->netmask = u->netmask;
 
 	old_pe = rcu_dereference_protected(svc->pe, 1);
-	if (pe != old_pe)
+	if (pe != old_pe) {
 		rcu_assign_pointer(svc->pe, pe);
+		/* check for optional methods in new pe */
+		new_pe_conn_out = (pe && pe->conn_out) ? true : false;
+		old_pe_conn_out = (old_pe && old_pe->conn_out) ? true : false;
+		if (new_pe_conn_out && !old_pe_conn_out)
+			atomic_inc(&svc->ipvs->conn_out_counter);
+		if (old_pe_conn_out && !new_pe_conn_out)
+			atomic_dec(&svc->ipvs->conn_out_counter);
+	}
 
 out:
 	ip_vs_scheduler_put(old_sched);
@@ -1389,6 +1430,8 @@ static void __ip_vs_del_service(struct ip_vs_service *svc, bool cleanup)
 
 	/* Unbind persistence engine, keep svc->pe */
 	old_pe = rcu_dereference_protected(svc->pe, 1);
+	if (old_pe && old_pe->conn_out)
+		atomic_dec(&ipvs->conn_out_counter);
 	ip_vs_pe_put(old_pe);
 
 	/*
@@ -3957,6 +4000,7 @@ int __net_init ip_vs_control_net_init(struct netns_ipvs *ipvs)
 		    (unsigned long) ipvs);
 	atomic_set(&ipvs->ftpsvc_counter, 0);
 	atomic_set(&ipvs->nullsvc_counter, 0);
+	atomic_set(&ipvs->conn_out_counter, 0);
 
 	/* procfs stats */
 	ipvs->tot_stats.cpustats = alloc_percpu(struct ip_vs_cpu_stats);
diff --git a/net/netfilter/ipvs/ip_vs_pe_sip.c b/net/netfilter/ipvs/ip_vs_pe_sip.c
index 0a6eb5c0d9e9..d07ef9e31c12 100644
--- a/net/netfilter/ipvs/ip_vs_pe_sip.c
+++ b/net/netfilter/ipvs/ip_vs_pe_sip.c
@@ -143,6 +143,20 @@ static int ip_vs_sip_show_pe_data(const struct ip_vs_conn *cp, char *buf)
 	return cp->pe_data_len;
 }
 
+static struct ip_vs_conn *
+ip_vs_sip_conn_out(struct ip_vs_service *svc,
+		   struct ip_vs_dest *dest,
+		   struct sk_buff *skb,
+		   const struct ip_vs_iphdr *iph,
+		   __be16 dport,
+		   __be16 cport)
+{
+	if (likely(iph->protocol == IPPROTO_UDP))
+		return ip_vs_new_conn_out(svc, dest, skb, iph, dport, cport);
+	/* currently no need to handle other than UDP */
+	return NULL;
+}
+
 static struct ip_vs_pe ip_vs_sip_pe =
 {
 	.name =			"sip",
@@ -153,6 +167,7 @@ static struct ip_vs_pe ip_vs_sip_pe =
 	.ct_match =		ip_vs_sip_ct_match,
 	.hashkey_raw =		ip_vs_sip_hashkey_raw,
 	.show_pe_data =		ip_vs_sip_show_pe_data,
+	.conn_out =		ip_vs_sip_conn_out,
 };
 
 static int __init ip_vs_sip_init(void)
-- 
2.7.0.rc3.207.g0ac5344


^ permalink raw reply related

* [GIT PULL nf-next 0/3] IPVS Updates for v4.6
From: Simon Horman @ 2016-04-20  2:46 UTC (permalink / raw)
  To: Pablo Neira Ayuso
  Cc: lvs-devel, netdev, netfilter-devel, Wensong Zhang,
	Julian Anastasov, Simon Horman

Hi Pablo,

please consider these enhancements to the IPVS. They allow SIP connections
originating from real-servers to be load balanced by the SIP psersitence
engine as is already implemented in the other direction. And for better one
packet scheduling (OPS) performance.

The following changes since commit 4a96300cec88729415683db8a2b909563b09fbaa:

  netfilter: ctnetlink: restore inlining for netlink message size calculation (2016-04-18 22:14:40 +0200)

are available in the git repository at:

  https://git.kernel.org/pub/scm/linux/kernel/git/horms/ipvs-next.git tags/ipvs-for-v4.7

for you to fetch changes up to 8fb04d9fc70a67ccabf71dbabf92d7f6fca64a16:

  ipvs: don't alter conntrack in OPS mode (2016-04-20 12:34:17 +1000)

----------------------------------------------------------------
Marco Angaroni (3):
      ipvs: handle connections started by real-servers
      ipvs: optimize release of connections in OPS mode
      ipvs: don't alter conntrack in OPS mode

 include/net/ip_vs.h               |  17 +++++
 net/netfilter/ipvs/ip_vs_conn.c   |  29 ++++++-
 net/netfilter/ipvs/ip_vs_core.c   | 154 ++++++++++++++++++++++++++++++++++++++
 net/netfilter/ipvs/ip_vs_ctl.c    |  46 +++++++++++-
 net/netfilter/ipvs/ip_vs_nfct.c   |   4 +
 net/netfilter/ipvs/ip_vs_pe_sip.c |  15 ++++
 6 files changed, 260 insertions(+), 5 deletions(-)

^ permalink raw reply

* Re: [PATCH V2] net: stmmac: socfpga: Remove re-registration of reset controller
From: Dinh Nguyen @ 2016-04-20  2:04 UTC (permalink / raw)
  To: Marek Vasut, netdev
  Cc: peppe.cavallaro, alexandre.torgue, Matthew Gerlach,
	David S . Miller
In-Reply-To: <1461110753-7641-1-git-send-email-marex@denx.de>



On 04/19/2016 07:05 PM, Marek Vasut wrote:
> Both socfpga_dwmac_parse_data() in dwmac-socfpga.c and stmmac_dvr_probe()
> in stmmac_main.c functions call devm_reset_control_get() to register an
> reset controller for the stmmac. This results in an attempt to register
> two reset controllers for the same non-shared reset line.
> 
> The first attempt to register the reset controller works fine. The second
> attempt fails with warning from the reset controller core, see below.
> The warning is produced because the reset line is non-shared and thus
> it is allowed to have only up-to one reset controller associated with
> that reset line, not two or more.
> 
> The solution is not great. Since the hardware needs to toggle the reset
> before calling stmmac_dvr_probe() to perform mandatory preconfiguration,
> this patch splits socfpga_dwmac_init_probe() from socfpga_dwmac_init().
> 
> The socfpga_dwmac_init_probe() temporarily registers the reset controller,
> performs the pre-configuration and unregisters the reset controller again.
> This function is only called from the socfpga_dwmac_probe().
> 
> The original socfpga_dwmac_init() is tweaked to use reset controller
> pointer from the stmmac_priv (private data of the stmmac core) instead
> of the local instance, which was used before.
> 
> Finally, plat_dat->exit and socfpga_dwmac_exit() is no longer necessary,
> since the functionality is already performed by the stmmac core.
> 
> ------------[ cut here ]------------
> WARNING: CPU: 0 PID: 1 at drivers/reset/core.c:187 __of_reset_control_get+0x218/0x270
> Modules linked in:
> CPU: 0 PID: 1 Comm: swapper/0 Not tainted 4.6.0-rc4-next-20160419-00015-gabb2477-dirty #4
> Hardware name: Altera SOCFPGA
> [<c010f290>] (unwind_backtrace) from [<c010b82c>] (show_stack+0x10/0x14)
> [<c010b82c>] (show_stack) from [<c0373da4>] (dump_stack+0x94/0xa8)
> [<c0373da4>] (dump_stack) from [<c011bcc0>] (__warn+0xec/0x104)
> [<c011bcc0>] (__warn) from [<c011bd88>] (warn_slowpath_null+0x20/0x28)
> [<c011bd88>] (warn_slowpath_null) from [<c03a6eb4>] (__of_reset_control_get+0x218/0x270)
> [<c03a6eb4>] (__of_reset_control_get) from [<c03a701c>] (__devm_reset_control_get+0x54/0x90)
> [<c03a701c>] (__devm_reset_control_get) from [<c041fa30>] (stmmac_dvr_probe+0x1b4/0x8e8)
> [<c041fa30>] (stmmac_dvr_probe) from [<c04298c8>] (socfpga_dwmac_probe+0x1b8/0x28c)
> [<c04298c8>] (socfpga_dwmac_probe) from [<c03d6ffc>] (platform_drv_probe+0x4c/0xb0)
> [<c03d6ffc>] (platform_drv_probe) from [<c03d54ec>] (driver_probe_device+0x224/0x2bc)
> [<c03d54ec>] (driver_probe_device) from [<c03d5630>] (__driver_attach+0xac/0xb0)
> [<c03d5630>] (__driver_attach) from [<c03d382c>] (bus_for_each_dev+0x6c/0xa0)
> [<c03d382c>] (bus_for_each_dev) from [<c03d4ad4>] (bus_add_driver+0x1a4/0x21c)
> [<c03d4ad4>] (bus_add_driver) from [<c03d60ac>] (driver_register+0x78/0xf8)
> [<c03d60ac>] (driver_register) from [<c0101760>] (do_one_initcall+0x40/0x170)
> [<c0101760>] (do_one_initcall) from [<c0800e38>] (kernel_init_freeable+0x1dc/0x27c)
> [<c0800e38>] (kernel_init_freeable) from [<c05d1bd4>] (kernel_init+0x8/0x114)
> [<c05d1bd4>] (kernel_init) from [<c01076f8>] (ret_from_fork+0x14/0x3c)
> ---[ end trace 059d2fbe87608fa9 ]---
> 

Odd, I hadn't seen this error before. I would have noticed it, but I
haven't ran linux-next for a few days.

I see that this error was introduced on linux-next/next-20160414. The
error was not there prior to that. While I agree that the extra call to
get the reset control is not needed, I wonder what commit between
next-20160413 and next-20160414 exposed this error. I'll try to dig this
some more.

Dinh

^ permalink raw reply

* RE: [PATCH 5/5] drivers/net: support hdlc function for QE-UCC
From: Qiang Zhao @ 2016-04-20  2:28 UTC (permalink / raw)
  To: Christophe Leroy, davem@davemloft.net
  Cc: gregkh@linuxfoundation.org, Xiaobo Xie,
	linux-kernel@vger.kernel.org, oss@buserror.net,
	netdev@vger.kernel.org, akpm@linux-foundation.org,
	linuxppc-dev@lists.ozlabs.org
In-Reply-To: <57165B27.60000@c-s.fr>

On 20/04/2016 12:22AM, Christophe Leroy <christophe.leroy@c-s.fr> wrote
> -----Original Message-----
> From: Christophe Leroy [mailto:christophe.leroy@c-s.fr]
> Sent: Wednesday, April 20, 2016 12:22 AM
> To: Qiang Zhao <qiang.zhao@nxp.com>; davem@davemloft.net
> Cc: gregkh@linuxfoundation.org; Xiaobo Xie <xiaobo.xie@nxp.com>; linux-
> kernel@vger.kernel.org; oss@buserror.net; netdev@vger.kernel.org;
> akpm@linux-foundation.org; linuxppc-dev@lists.ozlabs.org
> Subject: Re: [PATCH 5/5] drivers/net: support hdlc function for QE-UCC
> 
> Le 30/03/2016 10:50, Zhao Qiang a écrit :
> > The driver add hdlc support for Freescale QUICC Engine.
> > It support NMSI and TSA mode.
> When using TSA, how does the TSA gets configured ? Especially how do you
> describe which Timeslot is switched to HDLC channels ?

the TSA is configured statically according to device tree node. 
For " which Timeslot is switched to HDLC channels ", there is a property 
"fsl,tx-timeslot-mask" in device tree to describe it.

> Is it possible to route some Timeslots to one UCC for HDLC, and route some
> others to another UCC for an ALSA sound driver ?

The feature you describe is not supported at present.

> The QE also have a QMC which allows to split all timeslots to a given UCC into
> independant channels that can either be used with HDLC or transparents (for
> audio for instance). Do you intent to also support QMC ?

new QE use UMCC instead of QMC in old QE, we have started to develop UMCC.
 
> According to the compatible property, it looks like your driver is for freescale
> T1040. The MPC83xx also has a Quick Engine, would it work on it too ?

The driver is common, but tested on t1040, it is needed to add node to MPC83xx
If you want to test on mpc83xx.

-Zhao Qiang

^ permalink raw reply

* [PATCH net-next V5 2/2] intel: ixgbevf: Support Windows hosts (Hyper-V)
From: K. Y. Srinivasan @ 2016-04-20  2:17 UTC (permalink / raw)
  To: davem, netdev, linux-kernel, devel, olaf, apw, jasowang, eli,
	jackm, yevgenyp, john.ronciak, intel-wired-lan, alexander.duyck
In-Reply-To: <1461118677-28142-1-git-send-email-kys@microsoft.com>

On Hyper-V, the VF/PF communication is a via software mediated path
as opposed to the hardware mailbox. Make the necessary
adjustments to support Hyper-V.

Signed-off-by: K. Y. Srinivasan <kys@microsoft.com>
---
	V2: Addressed most of the comments from
	    Alexander Duyck <alexander.duyck@gmail.com>
	    and Rustad, Mark D <mark.d.rustad@intel.com>.

	V3: Addressed additional comments from
	    Alexander Duyck <alexander.duyck@gmail.com>

	V4: Addressed kbuild errors reported by:
	    kbuild test robot <lkp@intel.com>

	V5: Addressed additional comments from
	    Alexander Duyck <alexander.duyck@gmail.com>

 drivers/net/ethernet/intel/ixgbevf/ixgbevf.h      |   12 ++
 drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c |   31 +++-
 drivers/net/ethernet/intel/ixgbevf/mbx.c          |   12 ++
 drivers/net/ethernet/intel/ixgbevf/vf.c           |  216 +++++++++++++++++++++
 drivers/net/ethernet/intel/ixgbevf/vf.h           |    2 +
 5 files changed, 266 insertions(+), 7 deletions(-)

diff --git a/drivers/net/ethernet/intel/ixgbevf/ixgbevf.h b/drivers/net/ethernet/intel/ixgbevf/ixgbevf.h
index 5ac60ee..3296d27 100644
--- a/drivers/net/ethernet/intel/ixgbevf/ixgbevf.h
+++ b/drivers/net/ethernet/intel/ixgbevf/ixgbevf.h
@@ -460,9 +460,13 @@ enum ixbgevf_state_t {
 
 enum ixgbevf_boards {
 	board_82599_vf,
+	board_82599_vf_hv,
 	board_X540_vf,
+	board_X540_vf_hv,
 	board_X550_vf,
+	board_X550_vf_hv,
 	board_X550EM_x_vf,
+	board_X550EM_x_vf_hv,
 };
 
 enum ixgbevf_xcast_modes {
@@ -477,6 +481,13 @@ extern const struct ixgbevf_info ixgbevf_X550_vf_info;
 extern const struct ixgbevf_info ixgbevf_X550EM_x_vf_info;
 extern const struct ixgbe_mbx_operations ixgbevf_mbx_ops;
 
+
+extern const struct ixgbevf_info ixgbevf_82599_vf_hv_info;
+extern const struct ixgbevf_info ixgbevf_X540_vf_hv_info;
+extern const struct ixgbevf_info ixgbevf_X550_vf_hv_info;
+extern const struct ixgbevf_info ixgbevf_X550EM_x_vf_hv_info;
+extern const struct ixgbe_mbx_operations ixgbevf_hv_mbx_ops;
+
 /* needed by ethtool.c */
 extern const char ixgbevf_driver_name[];
 extern const char ixgbevf_driver_version[];
@@ -494,6 +505,7 @@ void ixgbevf_free_rx_resources(struct ixgbevf_ring *);
 void ixgbevf_free_tx_resources(struct ixgbevf_ring *);
 void ixgbevf_update_stats(struct ixgbevf_adapter *adapter);
 int ethtool_ioctl(struct ifreq *ifr);
+bool ixgbevf_on_hyperv(struct ixgbe_hw *hw);
 
 extern void ixgbevf_write_eitr(struct ixgbevf_q_vector *q_vector);
 
diff --git a/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c b/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
index 007cbe0..c4bb480 100644
--- a/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
+++ b/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
@@ -62,10 +62,14 @@ static char ixgbevf_copyright[] =
 	"Copyright (c) 2009 - 2015 Intel Corporation.";
 
 static const struct ixgbevf_info *ixgbevf_info_tbl[] = {
-	[board_82599_vf] = &ixgbevf_82599_vf_info,
-	[board_X540_vf]  = &ixgbevf_X540_vf_info,
-	[board_X550_vf]  = &ixgbevf_X550_vf_info,
-	[board_X550EM_x_vf] = &ixgbevf_X550EM_x_vf_info,
+	[board_82599_vf]	= &ixgbevf_82599_vf_info,
+	[board_82599_vf_hv]	= &ixgbevf_82599_vf_hv_info,
+	[board_X540_vf]		= &ixgbevf_X540_vf_info,
+	[board_X540_vf_hv]	= &ixgbevf_X540_vf_hv_info,
+	[board_X550_vf]		= &ixgbevf_X550_vf_info,
+	[board_X550_vf_hv]	= &ixgbevf_X550_vf_hv_info,
+	[board_X550EM_x_vf]	= &ixgbevf_X550EM_x_vf_info,
+	[board_X550EM_x_vf_hv]	= &ixgbevf_X550EM_x_vf_hv_info,
 };
 
 /* ixgbevf_pci_tbl - PCI Device ID Table
@@ -78,9 +82,13 @@ static const struct ixgbevf_info *ixgbevf_info_tbl[] = {
  */
 static const struct pci_device_id ixgbevf_pci_tbl[] = {
 	{PCI_VDEVICE(INTEL, IXGBE_DEV_ID_82599_VF), board_82599_vf },
+	{PCI_VDEVICE(INTEL, IXGBE_DEV_ID_82599_VF_HV), board_82599_vf_hv },
 	{PCI_VDEVICE(INTEL, IXGBE_DEV_ID_X540_VF), board_X540_vf },
+	{PCI_VDEVICE(INTEL, IXGBE_DEV_ID_X540_VF_HV), board_X540_vf_hv },
 	{PCI_VDEVICE(INTEL, IXGBE_DEV_ID_X550_VF), board_X550_vf },
+	{PCI_VDEVICE(INTEL, IXGBE_DEV_ID_X550_VF_HV), board_X550_vf_hv },
 	{PCI_VDEVICE(INTEL, IXGBE_DEV_ID_X550EM_X_VF), board_X550EM_x_vf },
+	{PCI_VDEVICE(INTEL, IXGBE_DEV_ID_X550EM_X_VF_HV), board_X550EM_x_vf_hv},
 	/* required last entry */
 	{0, }
 };
@@ -1795,7 +1803,10 @@ static void ixgbevf_configure_rx(struct ixgbevf_adapter *adapter)
 		ixgbevf_setup_vfmrqc(adapter);
 
 	/* notify the PF of our intent to use this size of frame */
-	ixgbevf_rlpml_set_vf(hw, netdev->mtu + ETH_HLEN + ETH_FCS_LEN);
+	if (!ixgbevf_on_hyperv(hw))
+		ixgbevf_rlpml_set_vf(hw, netdev->mtu + ETH_HLEN + ETH_FCS_LEN);
+	else
+		ixgbevf_hv_rlpml_set_vf(hw, netdev->mtu + ETH_HLEN + ETH_FCS_LEN);
 
 	/* Setup the HW Rx Head and Tail Descriptor Pointers and
 	 * the Base and Length of the Rx Descriptor Ring
@@ -2056,7 +2067,10 @@ static void ixgbevf_negotiate_api(struct ixgbevf_adapter *adapter)
 	spin_lock_bh(&adapter->mbx_lock);
 
 	while (api[idx] != ixgbe_mbox_api_unknown) {
-		err = ixgbevf_negotiate_api_version(hw, api[idx]);
+		if (!ixgbevf_on_hyperv(hw))
+			err = ixgbevf_negotiate_api_version(hw, api[idx]);
+		else
+			err = ixgbevf_hv_negotiate_api_version(hw, api[idx]);
 		if (!err)
 			break;
 		idx++;
@@ -3727,7 +3741,10 @@ static int ixgbevf_change_mtu(struct net_device *netdev, int new_mtu)
 	netdev->mtu = new_mtu;
 
 	/* notify the PF of our intent to use this size of frame */
-	ixgbevf_rlpml_set_vf(hw, max_frame);
+	if (!ixgbevf_on_hyperv(hw))
+		ixgbevf_rlpml_set_vf(hw, max_frame);
+	else
+		ixgbevf_hv_rlpml_set_vf(hw, max_frame);
 
 	return 0;
 }
diff --git a/drivers/net/ethernet/intel/ixgbevf/mbx.c b/drivers/net/ethernet/intel/ixgbevf/mbx.c
index dc68fea..298a0da 100644
--- a/drivers/net/ethernet/intel/ixgbevf/mbx.c
+++ b/drivers/net/ethernet/intel/ixgbevf/mbx.c
@@ -346,3 +346,15 @@ const struct ixgbe_mbx_operations ixgbevf_mbx_ops = {
 	.check_for_rst	= ixgbevf_check_for_rst_vf,
 };
 
+/**
+ * Mailbox operations when running on Hyper-V.
+ * On Hyper-V, PF/VF communiction is not through the
+ * hardware mailbox; this communication is through
+ * a software mediated path.
+ * Most mail box operations are noop while running on
+ * Hyper-V.
+ */
+const struct ixgbe_mbx_operations ixgbevf_hv_mbx_ops = {
+	.init_params	= ixgbevf_init_mbx_params_vf,
+	.check_for_rst	= ixgbevf_check_for_rst_vf,
+};
diff --git a/drivers/net/ethernet/intel/ixgbevf/vf.c b/drivers/net/ethernet/intel/ixgbevf/vf.c
index 4d613a4..7208956 100644
--- a/drivers/net/ethernet/intel/ixgbevf/vf.c
+++ b/drivers/net/ethernet/intel/ixgbevf/vf.c
@@ -27,6 +27,13 @@
 #include "vf.h"
 #include "ixgbevf.h"
 
+/*
+ * On Hyper-V, to reset, we need to read from this offset
+ * from the PCI config space. This is the mechanism used on
+ * Hyper-V to support PF/VF communication.
+ */
+#define IXGBE_HV_RESET_OFFSET           0x201
+
 /**
  *  ixgbevf_start_hw_vf - Prepare hardware for Tx/Rx
  *  @hw: pointer to hardware structure
@@ -126,6 +133,27 @@ static s32 ixgbevf_reset_hw_vf(struct ixgbe_hw *hw)
 }
 
 /**
+ * Hyper-V variant; the VF/PF communication is through the PCI
+ * config space.
+ */
+static s32 ixgbevf_hv_reset_hw_vf(struct ixgbe_hw *hw)
+{
+#if IS_ENABLED(CONFIG_PCI_MMCONFIG)
+	struct ixgbevf_adapter *adapter = hw->back;
+	int i;
+
+	for (i = 0; i < 6; i++)
+		pci_read_config_byte(adapter->pdev,
+				     (i + IXGBE_HV_RESET_OFFSET),
+				     &hw->mac.perm_addr[i]);
+	return 0;
+#else
+	pr_err("PCI_MMCONFIG needs to be enabled for Hyper-V\n");
+	return -EOPNOTSUPP;
+#endif
+}
+
+/**
  *  ixgbevf_stop_hw_vf - Generic stop Tx/Rx units
  *  @hw: pointer to hardware structure
  *
@@ -258,6 +286,11 @@ static s32 ixgbevf_set_uc_addr_vf(struct ixgbe_hw *hw, u32 index, u8 *addr)
 	return ret_val;
 }
 
+static s32 ixgbevf_hv_set_uc_addr_vf(struct ixgbe_hw *hw, u32 index, u8 *addr)
+{
+	return -EOPNOTSUPP;
+}
+
 /**
  * ixgbevf_get_reta_locked - get the RSS redirection table (RETA) contents.
  * @adapter: pointer to the port handle
@@ -416,6 +449,26 @@ static s32 ixgbevf_set_rar_vf(struct ixgbe_hw *hw, u32 index, u8 *addr,
 	return ret_val;
 }
 
+/**
+ *  ixgbevf_hv_set_rar_vf - set device MAC address Hyper-V variant
+ *  @hw: pointer to hardware structure
+ *  @index: Receive address register to write
+ *  @addr: Address to put into receive address register
+ *  @vmdq: Unused in this implementation
+ *
+ * We don't really allow setting the device MAC address. However,
+ * if the address being set is the permanent MAC address we will
+ * permit that.
+ **/
+static s32 ixgbevf_hv_set_rar_vf(struct ixgbe_hw *hw, u32 index, u8 *addr,
+				 u32 vmdq)
+{
+	if (ether_addr_equal(addr, hw->mac.perm_addr))
+		return 0;
+
+	return -EOPNOTSUPP;
+}
+
 static void ixgbevf_write_msg_read_ack(struct ixgbe_hw *hw,
 				       u32 *msg, u16 size)
 {
@@ -473,6 +526,15 @@ static s32 ixgbevf_update_mc_addr_list_vf(struct ixgbe_hw *hw,
 }
 
 /**
+ * Hyper-V variant - just a stub.
+ */
+static s32 ixgbevf_hv_update_mc_addr_list_vf(struct ixgbe_hw *hw,
+					  struct net_device *netdev)
+{
+	return -EOPNOTSUPP;
+}
+
+/**
  *  ixgbevf_update_xcast_mode - Update Multicast mode
  *  @hw: pointer to the HW structure
  *  @netdev: pointer to net device structure
@@ -513,6 +575,15 @@ static s32 ixgbevf_update_xcast_mode(struct ixgbe_hw *hw,
 }
 
 /**
+ * Hyper-V variant - just a stub.
+ */
+static s32 ixgbevf_hv_update_xcast_mode(struct ixgbe_hw *hw,
+					struct net_device *netdev, int xcast_mode)
+{
+	return -EOPNOTSUPP;
+}
+
+/**
  *  ixgbevf_set_vfta_vf - Set/Unset VLAN filter table address
  *  @hw: pointer to the HW structure
  *  @vlan: 12 bit VLAN ID
@@ -551,6 +622,15 @@ mbx_err:
 }
 
 /**
+ * Hyper-V variant - just a stub.
+ */
+static s32 ixgbevf_hv_set_vfta_vf(struct ixgbe_hw *hw, u32 vlan, u32 vind,
+				  bool vlan_on)
+{
+	return -EOPNOTSUPP;
+}
+
+/**
  *  ixgbevf_setup_mac_link_vf - Setup MAC link settings
  *  @hw: pointer to hardware structure
  *  @speed: Unused in this implementation
@@ -656,6 +736,67 @@ out:
 }
 
 /**
+ * Hyper-V variant; there is no mailbox communication.
+ */
+static s32 ixgbevf_hv_check_mac_link_vf(struct ixgbe_hw *hw,
+					ixgbe_link_speed *speed,
+					bool *link_up,
+					bool autoneg_wait_to_complete)
+{
+	struct ixgbe_mbx_info *mbx = &hw->mbx;
+	struct ixgbe_mac_info *mac = &hw->mac;
+	u32 links_reg;
+
+	/* If we were hit with a reset drop the link */
+	if (!mbx->ops.check_for_rst(hw) || !mbx->timeout)
+		mac->get_link_status = true;
+
+	if (!mac->get_link_status)
+		goto out;
+
+	/* if link status is down no point in checking to see if pf is up */
+	links_reg = IXGBE_READ_REG(hw, IXGBE_VFLINKS);
+	if (!(links_reg & IXGBE_LINKS_UP))
+		goto out;
+
+	/* for SFP+ modules and DA cables on 82599 it can take up to 500usecs
+	 * before the link status is correct
+	 */
+	if (mac->type == ixgbe_mac_82599_vf) {
+		int i;
+
+		for (i = 0; i < 5; i++) {
+			udelay(100);
+			links_reg = IXGBE_READ_REG(hw, IXGBE_VFLINKS);
+
+			if (!(links_reg & IXGBE_LINKS_UP))
+				goto out;
+		}
+	}
+
+	switch (links_reg & IXGBE_LINKS_SPEED_82599) {
+	case IXGBE_LINKS_SPEED_10G_82599:
+		*speed = IXGBE_LINK_SPEED_10GB_FULL;
+		break;
+	case IXGBE_LINKS_SPEED_1G_82599:
+		*speed = IXGBE_LINK_SPEED_1GB_FULL;
+		break;
+	case IXGBE_LINKS_SPEED_100_82599:
+		*speed = IXGBE_LINK_SPEED_100_FULL;
+		break;
+	}
+
+	/* if we passed all the tests above then the link is up and we no
+	 * longer need to check for link
+	 */
+	mac->get_link_status = false;
+
+out:
+	*link_up = !mac->get_link_status;
+	return 0;
+}
+
+/**
  *  ixgbevf_rlpml_set_vf - Set the maximum receive packet length
  *  @hw: pointer to the HW structure
  *  @max_size: value to assign to max frame size
@@ -670,6 +811,25 @@ void ixgbevf_rlpml_set_vf(struct ixgbe_hw *hw, u16 max_size)
 }
 
 /**
+ *  ixgbevf_hv_rlpml_set_vf - Set the maximum receive packet length
+ *  @hw: pointer to the HW structure
+ *  @max_size: value to assign to max frame size
+ *  Hyper-V variant.
+ **/
+void ixgbevf_hv_rlpml_set_vf(struct ixgbe_hw *hw, u16 max_size)
+{
+	u32 reg;
+
+	/* If we are on Hyper-V, we implement
+	 * this functionality differently.
+	 */
+	reg =  IXGBE_READ_REG(hw, IXGBE_VFRXDCTL(0));
+	/* CRC == 4 */
+	reg |= ((max_size + 4) | IXGBE_RXDCTL_RLPML_EN);
+	IXGBE_WRITE_REG(hw, IXGBE_VFRXDCTL(0), reg);
+}
+
+/**
  *  ixgbevf_negotiate_api_version - Negotiate supported API version
  *  @hw: pointer to the HW structure
  *  @api: integer containing requested API version
@@ -703,6 +863,22 @@ int ixgbevf_negotiate_api_version(struct ixgbe_hw *hw, int api)
 	return err;
 }
 
+/**
+ *  ixgbevf_hv_negotiate_api_version - Negotiate supported API version
+ *  @hw: pointer to the HW structure
+ *  @api: integer containing requested API version
+ *  Hyper-V version - only ixgbe_mbox_api_10 supported.
+ **/
+int ixgbevf_hv_negotiate_api_version(struct ixgbe_hw *hw, int api)
+{
+	/* Hyper-V only supports api version ixgbe_mbox_api_10
+	 */
+	if (api != ixgbe_mbox_api_10)
+		return IXGBE_ERR_INVALID_ARGUMENT;
+
+	return 0;
+}
+
 int ixgbevf_get_queues(struct ixgbe_hw *hw, unsigned int *num_tcs,
 		       unsigned int *default_tc)
 {
@@ -776,22 +952,62 @@ static const struct ixgbe_mac_operations ixgbevf_mac_ops = {
 	.set_vfta		= ixgbevf_set_vfta_vf,
 };
 
+static const struct ixgbe_mac_operations ixgbevf_hv_mac_ops = {
+	.init_hw		= ixgbevf_init_hw_vf,
+	.reset_hw		= ixgbevf_hv_reset_hw_vf,
+	.start_hw		= ixgbevf_start_hw_vf,
+	.get_mac_addr		= ixgbevf_get_mac_addr_vf,
+	.stop_adapter		= ixgbevf_stop_hw_vf,
+	.setup_link		= ixgbevf_setup_mac_link_vf,
+	.check_link		= ixgbevf_hv_check_mac_link_vf,
+	.set_rar		= ixgbevf_hv_set_rar_vf,
+	.update_mc_addr_list	= ixgbevf_hv_update_mc_addr_list_vf,
+	.update_xcast_mode	= ixgbevf_hv_update_xcast_mode,
+	.set_uc_addr		= ixgbevf_hv_set_uc_addr_vf,
+	.set_vfta		= ixgbevf_hv_set_vfta_vf,
+};
+
 const struct ixgbevf_info ixgbevf_82599_vf_info = {
 	.mac = ixgbe_mac_82599_vf,
 	.mac_ops = &ixgbevf_mac_ops,
 };
 
+const struct ixgbevf_info ixgbevf_82599_vf_hv_info = {
+	.mac = ixgbe_mac_82599_vf,
+	.mac_ops = &ixgbevf_hv_mac_ops,
+};
+
 const struct ixgbevf_info ixgbevf_X540_vf_info = {
 	.mac = ixgbe_mac_X540_vf,
 	.mac_ops = &ixgbevf_mac_ops,
 };
 
+const struct ixgbevf_info ixgbevf_X540_vf_hv_info = {
+	.mac = ixgbe_mac_X540_vf,
+	.mac_ops = &ixgbevf_hv_mac_ops,
+};
+
 const struct ixgbevf_info ixgbevf_X550_vf_info = {
 	.mac = ixgbe_mac_X550_vf,
 	.mac_ops = &ixgbevf_mac_ops,
 };
 
+const struct ixgbevf_info ixgbevf_X550_vf_hv_info = {
+	.mac = ixgbe_mac_X550_vf,
+	.mac_ops = &ixgbevf_hv_mac_ops,
+};
+
 const struct ixgbevf_info ixgbevf_X550EM_x_vf_info = {
 	.mac = ixgbe_mac_X550EM_x_vf,
 	.mac_ops = &ixgbevf_mac_ops,
 };
+
+const struct ixgbevf_info ixgbevf_X550EM_x_vf_hv_info = {
+	.mac = ixgbe_mac_X550EM_x_vf,
+	.mac_ops = &ixgbevf_hv_mac_ops,
+};
+
+bool ixgbevf_on_hyperv(struct ixgbe_hw *hw)
+{
+	return hw->mbx.ops.check_for_msg == NULL;
+}
diff --git a/drivers/net/ethernet/intel/ixgbevf/vf.h b/drivers/net/ethernet/intel/ixgbevf/vf.h
index ef9f773..658883e 100644
--- a/drivers/net/ethernet/intel/ixgbevf/vf.h
+++ b/drivers/net/ethernet/intel/ixgbevf/vf.h
@@ -208,7 +208,9 @@ static inline u32 ixgbe_read_reg_array(struct ixgbe_hw *hw, u32 reg,
 #define IXGBE_READ_REG_ARRAY(h, r, o) ixgbe_read_reg_array(h, r, o)
 
 void ixgbevf_rlpml_set_vf(struct ixgbe_hw *hw, u16 max_size);
+void ixgbevf_hv_rlpml_set_vf(struct ixgbe_hw *hw, u16 max_size);
 int ixgbevf_negotiate_api_version(struct ixgbe_hw *hw, int api);
+int ixgbevf_hv_negotiate_api_version(struct ixgbe_hw *hw, int api);
 int ixgbevf_get_queues(struct ixgbe_hw *hw, unsigned int *num_tcs,
 		       unsigned int *default_tc);
 int ixgbevf_get_reta_locked(struct ixgbe_hw *hw, u32 *reta, int num_rx_queues);
-- 
1.7.4.1

^ permalink raw reply related

* [PATCH net-next V5 1/2] ethernet: intel: Add the device ID's presented while running on Hyper-V
From: K. Y. Srinivasan @ 2016-04-20  2:17 UTC (permalink / raw)
  To: davem, netdev, linux-kernel, devel, olaf, apw, jasowang, eli,
	jackm, yevgenyp, john.ronciak, intel-wired-lan, alexander.duyck
In-Reply-To: <1461118655-28103-1-git-send-email-kys@microsoft.com>

Intel SR-IOV cards present different ID when running on Hyper-V.
Add the device IDs presented while running on Hyper-V.

Signed-off-by: K. Y. Srinivasan <kys@microsoft.com>
---
	V5: No change from V1

 drivers/net/ethernet/intel/ixgbevf/defines.h |    5 +++++
 1 files changed, 5 insertions(+), 0 deletions(-)

diff --git a/drivers/net/ethernet/intel/ixgbevf/defines.h b/drivers/net/ethernet/intel/ixgbevf/defines.h
index 5843458..1306a0d 100644
--- a/drivers/net/ethernet/intel/ixgbevf/defines.h
+++ b/drivers/net/ethernet/intel/ixgbevf/defines.h
@@ -33,6 +33,11 @@
 #define IXGBE_DEV_ID_X550_VF		0x1565
 #define IXGBE_DEV_ID_X550EM_X_VF	0x15A8
 
+#define IXGBE_DEV_ID_82599_VF_HV	0x152E
+#define IXGBE_DEV_ID_X540_VF_HV		0x1530
+#define IXGBE_DEV_ID_X550_VF_HV		0x1564
+#define IXGBE_DEV_ID_X550EM_X_VF_HV	0x15A9
+
 #define IXGBE_VF_IRQ_CLEAR_MASK		7
 #define IXGBE_VF_MAX_TX_QUEUES		8
 #define IXGBE_VF_MAX_RX_QUEUES		8
-- 
1.7.4.1

^ permalink raw reply related

* [PATCH net-next V5 0/2] ethernet: intel: Support Hyper-V hosts
From: K. Y. Srinivasan @ 2016-04-20  2:17 UTC (permalink / raw)
  To: davem, netdev, linux-kernel, devel, olaf, apw, jasowang, eli,
	jackm, yevgenyp, john.ronciak, intel-wired-lan, alexander.duyck

Make adjustments to the Intel 10G VF driver to support
running on Hyper-V hosts.

K. Y. Srinivasan (2):
  ethernet: intel: Add the device ID's presented while running on
    Hyper-V
  intel: ixgbevf: Support Windows hosts (Hyper-V)

 drivers/net/ethernet/intel/ixgbevf/defines.h      |    5 +
 drivers/net/ethernet/intel/ixgbevf/ixgbevf.h      |   12 ++
 drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c |   31 +++-
 drivers/net/ethernet/intel/ixgbevf/mbx.c          |   12 ++
 drivers/net/ethernet/intel/ixgbevf/vf.c           |  216 +++++++++++++++++++++
 drivers/net/ethernet/intel/ixgbevf/vf.h           |    2 +
 6 files changed, 271 insertions(+), 7 deletions(-)

-- 
1.7.4.1

^ permalink raw reply

* Re: [PATCH net-next v5] rtnetlink: add new RTM_GETSTATS message to dump link stats
From: David Ahern @ 2016-04-20  1:53 UTC (permalink / raw)
  To: Johannes Berg, David Miller
  Cc: eric.dumazet, roopa, netdev, jhs, tgraf, nicolas.dichtel,
	egrumbach
In-Reply-To: <1461094901.2766.26.camel@sipsolutions.net>

On 4/19/16 1:41 PM, Johannes Berg wrote:
> On Tue, 2016-04-19 at 14:23 -0400, David Miller wrote:
>>
>> I like this nlattr flag idea, it's opt-in and any tool can be updated
>> to use the new facility.
>
> Right.
>
>> I'd be willing the backport the nlattr flag bit change to all stable
>> releases as well.
>
> I'm not really convinced that helps much - tools still can't really
> rely on the kernel supporting it.
>
> One thing that might work is that a tool might say it only wants to
> support kernels that have this change (assuming we backport it to
> enough kernels etc.); in that case the tool could add some absolutely
> must-have information (like the IFINDEX or whatever, depends on the
> command) with the new flag, this would get rejected since unpatched
> kernels wouldn't understand the flag and wouldn't find that must-have
> information.
>
> Nevertheless, I think it's most reliable with new netlink commands that
> are known to be available only on kernels understand and treating the
> flag correctly.

The kernel can set a flag in the response that it acknowledges the new 
attribute/flag. I did that for filtering neigh dumps -- 21fdd092acc7.

^ permalink raw reply

* Re: [PATCH net-next 0/7] net: network drivers should not depend on geneve/vxlan
From: Hannes Frederic Sowa @ 2016-04-20  1:21 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, jesse
In-Reply-To: <20160419.211134.2163748309501010595.davem@davemloft.net>

On 20.04.2016 03:11, David Miller wrote:
> From: Hannes Frederic Sowa <hannes@stressinduktion.org>
> Date: Wed, 20 Apr 2016 03:06:13 +0200
> 
>> On Wed, Apr 20, 2016, at 02:27, David Miller wrote:
>>> From: Hannes Frederic Sowa <hannes@stressinduktion.org>
>>> Date: Mon, 18 Apr 2016 21:19:41 +0200
>>>
>>>> This patchset removes the dependency of network drivers on vxlan or
>>>> geneve, so those don't get autoloaded when the nic driver is loaded.
>>>>
>>>> Also audited the code such that vxlan_get_rx_port and geneve_get_rx_port
>>>> are not called without rtnl lock.
>>>
>>> In net-next, the 'qed' driver has tunneling offload support too.
>>> Don't you need to update this series to handle that driver as
>>> well?
>>
>> I audited qede as well:
>>
>> qede calls {vxlan,geneve}_get_rx_port only from ndo_open which isn't
>> reused anywhere else in the driver, only from ndo_open, which holds
>> rtnl_lock also. Thus the driver is safe and doesn't need a change.
> 
> I'm talking about your final patches which elide the dependencies.

The dependency will be removed with only the last two patches for all
drivers. The static inline functions {vxlan,geneve}_get_rx_port now
simply expand to a call to call_netdevice_notifiers, which is provided
by the core kernel and not by the vxlan or geneve module anymore.

During testing I saw that very often some drivers actually reused their
*_attach or *_open function from other callbacks, e.g. pci ones. While
they often didn't care about rtnl_lock, we now require that for the new
get_rx_port functions, thus I added them.

I hadn't had to modify all drivers with vxlan offloading support, e.g.
also mlx5 and the broadcom drivers, bnx2x and bnxt, were already fine.

Bye,
Hannes

^ permalink raw reply

* Re: [PATCH net-next] macvlan: fix failure during registration
From: Francesco Ruggeri @ 2016-04-20  1:13 UTC (permalink / raw)
  To: Eric W. Biederman; +Cc: netdev, David S. Miller, Mahesh Bandewar
In-Reply-To: <87wpnur4a4.fsf@x220.int.ebiederm.org>

On Mon, Apr 18, 2016 at 2:33 PM, Eric W. Biederman
<ebiederm@xmission.com> wrote:
>
> But please see if you can get macvlan_init to do the necessary work.
> That should simplify everything, and make clever games unnecessary.

Using macvlan_init seems to work. I will test it a couple of days and
then resubmit.

Thanks,
Francesco

>
> Eric

^ permalink raw reply

* Re: [PATCH net-next 0/7] net: network drivers should not depend on geneve/vxlan
From: David Miller @ 2016-04-20  1:11 UTC (permalink / raw)
  To: hannes; +Cc: netdev, jesse
In-Reply-To: <1461114373.2649930.583890089.0C126826@webmail.messagingengine.com>

From: Hannes Frederic Sowa <hannes@stressinduktion.org>
Date: Wed, 20 Apr 2016 03:06:13 +0200

> On Wed, Apr 20, 2016, at 02:27, David Miller wrote:
>> From: Hannes Frederic Sowa <hannes@stressinduktion.org>
>> Date: Mon, 18 Apr 2016 21:19:41 +0200
>> 
>> > This patchset removes the dependency of network drivers on vxlan or
>> > geneve, so those don't get autoloaded when the nic driver is loaded.
>> > 
>> > Also audited the code such that vxlan_get_rx_port and geneve_get_rx_port
>> > are not called without rtnl lock.
>> 
>> In net-next, the 'qed' driver has tunneling offload support too.
>> Don't you need to update this series to handle that driver as
>> well?
> 
> I audited qede as well:
> 
> qede calls {vxlan,geneve}_get_rx_port only from ndo_open which isn't
> reused anywhere else in the driver, only from ndo_open, which holds
> rtnl_lock also. Thus the driver is safe and doesn't need a change.

I'm talking about your final patches which elide the dependencies.

^ permalink raw reply

* Re: [PATCH net-next] tcp-tso: do not split TSO packets at retransmit time
From: David Miller @ 2016-04-20  1:10 UTC (permalink / raw)
  To: eric.dumazet; +Cc: edumazet, netdev, ncardwell, ycheng
In-Reply-To: <1461113390.10638.233.camel@edumazet-glaptop3.roam.corp.google.com>

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Tue, 19 Apr 2016 17:49:50 -0700

> On Tue, 2016-04-19 at 20:36 -0400, David Miller wrote:
>> From: Eric Dumazet <edumazet@google.com>
>> Date: Mon, 18 Apr 2016 13:56:12 -0700
>> 
>> > 1 % packet losses are common today, and at 100Gbit speeds, this
>> > translates to ~80,000 losses per second. If we are unlucky and
>> > first MSS of a 45-MSS TSO is lost, we are cooking 44 MSS segments
>> > at rtx instead of a single 44-MSS TSO packet.
>> 
>> I'm having trouble understanding this.
>> 
>> If the first mss is lost, then we simply chop the 45 MSS TSO skb into
>> two pieces.  The first piece is a 1 MSS chunk for the retransmit, and
>> the second piece is remaining 44 MSS TSO skb.
>> 
>> I am pretty sure that is what the current stack does, and regardless
>> it is certainly what I intended it to do all those years ago when I
>> wrote this code. :-)
>> 
>> The only case where I can see this patch helping is when we have to
>> retransmit multi-mss chunks.  And yes indeed, it might be a useful
>> optimization to TSO those frames rather than sending them one MSS at a
>> time.
> 
> Yeah, it looks like I got the changelog wrong. We definitely see these
> 1-MSS splits during retransmits all the time, and we had to change the
> sch_fq flow_limit from 100 to 1000 packets to cope with that. (TCP Small
> Queues does not guard TCP from sending hundred of rtx at the same time)

Ok, please rewrite the commit log message so that it is more accuate.

Thank you.

^ permalink raw reply

* Re: [PATCH net-next 0/7] net: network drivers should not depend on geneve/vxlan
From: Hannes Frederic Sowa @ 2016-04-20  1:06 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, jesse
In-Reply-To: <20160419.202750.1844614056540086769.davem@davemloft.net>

On Wed, Apr 20, 2016, at 02:27, David Miller wrote:
> From: Hannes Frederic Sowa <hannes@stressinduktion.org>
> Date: Mon, 18 Apr 2016 21:19:41 +0200
> 
> > This patchset removes the dependency of network drivers on vxlan or
> > geneve, so those don't get autoloaded when the nic driver is loaded.
> > 
> > Also audited the code such that vxlan_get_rx_port and geneve_get_rx_port
> > are not called without rtnl lock.
> 
> In net-next, the 'qed' driver has tunneling offload support too.
> Don't you need to update this series to handle that driver as
> well?

I audited qede as well:

qede calls {vxlan,geneve}_get_rx_port only from ndo_open which isn't
reused anywhere else in the driver, only from ndo_open, which holds
rtnl_lock also. Thus the driver is safe and doesn't need a change.

Bye,
Hannes

^ permalink raw reply

* Re: [PATCH net-next] tcp-tso: do not split TSO packets at retransmit time
From: Eric Dumazet @ 2016-04-20  0:49 UTC (permalink / raw)
  To: David Miller; +Cc: edumazet, netdev, ncardwell, ycheng
In-Reply-To: <20160419.203656.505846026424405432.davem@davemloft.net>

On Tue, 2016-04-19 at 20:36 -0400, David Miller wrote:
> From: Eric Dumazet <edumazet@google.com>
> Date: Mon, 18 Apr 2016 13:56:12 -0700
> 
> > 1 % packet losses are common today, and at 100Gbit speeds, this
> > translates to ~80,000 losses per second. If we are unlucky and
> > first MSS of a 45-MSS TSO is lost, we are cooking 44 MSS segments
> > at rtx instead of a single 44-MSS TSO packet.
> 
> I'm having trouble understanding this.
> 
> If the first mss is lost, then we simply chop the 45 MSS TSO skb into
> two pieces.  The first piece is a 1 MSS chunk for the retransmit, and
> the second piece is remaining 44 MSS TSO skb.
> 
> I am pretty sure that is what the current stack does, and regardless
> it is certainly what I intended it to do all those years ago when I
> wrote this code. :-)
> 
> The only case where I can see this patch helping is when we have to
> retransmit multi-mss chunks.  And yes indeed, it might be a useful
> optimization to TSO those frames rather than sending them one MSS at a
> time.

Yeah, it looks like I got the changelog wrong. We definitely see these
1-MSS splits during retransmits all the time, and we had to change the
sch_fq flow_limit from 100 to 1000 packets to cope with that. (TCP Small
Queues does not guard TCP from sending hundred of rtx at the same time)

^ permalink raw reply

* Re: [PATCH] VSOCK: Only check error on skb_recv_datagram when skb is NULL
From: David Miller @ 2016-04-20  0:42 UTC (permalink / raw)
  To: jhansen; +Cc: pv-drivers, netdev, gregkh, linux-kernel, virtualization
In-Reply-To: <1461049132-59927-1-git-send-email-jhansen@vmware.com>

From: Jorgen Hansen <jhansen@vmware.com>
Date: Mon, 18 Apr 2016 23:58:52 -0700

> If skb_recv_datagram returns an skb, we should ignore the err
> value returned. Otherwise, datagram receives will return EAGAIN
> when they have to wait for a datagram.
> 
> Acked-by: Adit Ranadive <aditr@vmware.com>
> Signed-off-by: Jorgen Hansen <jhansen@vmware.com>

Applied.

^ permalink raw reply

* Re: [PATCH net-next] tcp-tso: do not split TSO packets at retransmit time
From: David Miller @ 2016-04-20  0:36 UTC (permalink / raw)
  To: edumazet; +Cc: netdev, ncardwell, ycheng, eric.dumazet
In-Reply-To: <1461012972-15757-1-git-send-email-edumazet@google.com>

From: Eric Dumazet <edumazet@google.com>
Date: Mon, 18 Apr 2016 13:56:12 -0700

> 1 % packet losses are common today, and at 100Gbit speeds, this
> translates to ~80,000 losses per second. If we are unlucky and
> first MSS of a 45-MSS TSO is lost, we are cooking 44 MSS segments
> at rtx instead of a single 44-MSS TSO packet.

I'm having trouble understanding this.

If the first mss is lost, then we simply chop the 45 MSS TSO skb into
two pieces.  The first piece is a 1 MSS chunk for the retransmit, and
the second piece is remaining 44 MSS TSO skb.

I am pretty sure that is what the current stack does, and regardless
it is certainly what I intended it to do all those years ago when I
wrote this code. :-)

The only case where I can see this patch helping is when we have to
retransmit multi-mss chunks.  And yes indeed, it might be a useful
optimization to TSO those frames rather than sending them one MSS at a
time.

^ permalink raw reply

* Re: [PATCH net-next] net: dsa: kill circular reference with slave priv
From: David Miller @ 2016-04-20  0:29 UTC (permalink / raw)
  To: vivien.didelot; +Cc: netdev, linux-kernel, kernel, f.fainelli, andrew
In-Reply-To: <1461010224-10808-1-git-send-email-vivien.didelot@savoirfairelinux.com>

From: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Date: Mon, 18 Apr 2016 16:10:24 -0400

> The dsa_slave_priv structure does not need a pointer to its net_device.
> Kill it.
> 
> Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>

Applied.

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox