* [PATCH -next v3 0/3] net: allow setting ecn via routing table
From: Florian Westphal @ 2014-11-03 13:01 UTC (permalink / raw)
To: netdev
Here is v3 of the patchset, addressing Erics comments wrt. validating vs.
tcp_ecn sysctl when decoding cookie timestamp.
When using syn cookies, then do not simply trust that the echoed timestamp
was not modified to make sure that ecn is not turned on magically when it
is disabled on the host.
The first two patches, which were not part of earlier series, prepare
the cookie code for the ecn route metrics change by allowing is to
more easily use the existing dst object for ecn validation.
The 3rd patch adds the ecn route metric feature support.
It is almost the same as in v2, except that we'll now also test the
dst_features when decoding a syn cookie timestamp that indicates ecn support.
These three patches then allow turning on explicit congestion notification
based on the destination network.
For example, assuming the default tcp_ecn sysctl '2', the following will
enable ecn (tcp_ecn=1 behaviour, i.e. request ecn to be enabled for a
tcp connection) for all connections to hosts inside the 192.168.2/24 network:
ip route change 192.168.2.0/24 dev eth0 features ecn
Having a more fine-grained per-route setting can be beneficial for
various reasons, for example 1) within data centers, or 2) local ISPs
may deploy ECN support for their own video/streaming services [1], etc.
Joint work with Daniel Borkmann, feature suggested by Hannes Frederic Sowa.
The patch to enable this in iproute2 will be posted shortly, it is currently
also available here:
http://git.breakpoint.cc/cgit/fw/iproute2.git/commit/?h=iproute_features&id=8843d2d8973fb81c78a7efe6d42e3a17d739003e
[1] http://www.ietf.org/proceedings/89/slides/slides-89-tsvarea-1.pdf, p.15
^ permalink raw reply
* [PATCH -next v3 1/3] syncookies: avoid magic values and document which-bit-is-what-option
From: Florian Westphal @ 2014-11-03 13:01 UTC (permalink / raw)
To: netdev; +Cc: Florian Westphal, Daniel Borkmann
In-Reply-To: <1415019720-17106-1-git-send-email-fw@strlen.de>
Was a bit more difficult to read than needed due to magic shifts;
add defines and document the used encoding scheme.
Joint work with Daniel Borkmann.
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
---
This patch was not part of earlier versions of the set.
net/ipv4/syncookies.c | 50 +++++++++++++++++++++++++++++++++++---------------
1 file changed, 35 insertions(+), 15 deletions(-)
diff --git a/net/ipv4/syncookies.c b/net/ipv4/syncookies.c
index 4ac7bca..c3792c0 100644
--- a/net/ipv4/syncookies.c
+++ b/net/ipv4/syncookies.c
@@ -19,10 +19,6 @@
#include <net/tcp.h>
#include <net/route.h>
-/* Timestamps: lowest bits store TCP options */
-#define TSBITS 6
-#define TSMASK (((__u32)1 << TSBITS) - 1)
-
extern int sysctl_tcp_syncookies;
static u32 syncookie_secret[2][16-4+SHA_DIGEST_WORDS] __read_mostly;
@@ -30,6 +26,30 @@ static u32 syncookie_secret[2][16-4+SHA_DIGEST_WORDS] __read_mostly;
#define COOKIEBITS 24 /* Upper bits store count */
#define COOKIEMASK (((__u32)1 << COOKIEBITS) - 1)
+/* TCP Timestamp: 6 lowest bits of timestamp sent in the cookie SYN-ACK
+ * stores TCP options:
+ *
+ * MSB LSB
+ * | 31 ... 6 | 5 | 4 | 3 2 1 0 |
+ * | Timestamp | ECN | SACK | WScale |
+ *
+ * When we receive a valid cookie-ACK, we look at the echoed tsval (if
+ * any) to figure out which TCP options we should use for the rebuilt
+ * connection.
+ *
+ * A WScale setting of '0xf' (which is an invalid scaling value)
+ * means that original syn did not include the TCP window scaling option.
+ */
+#define TS_OPT_WSCALE_MASK 0xf
+#define TS_OPT_SACK BIT(4)
+#define TS_OPT_ECN BIT(5)
+/* There is no TS_OPT_TIMESTAMP:
+ * if ACK contains timestamp option, we already know it was
+ * requested/supported by the syn/synack exchange.
+ */
+#define TSBITS 6
+#define TSMASK (((__u32)1 << TSBITS) - 1)
+
static DEFINE_PER_CPU(__u32 [16 + 5 + SHA_WORKSPACE_WORDS],
ipv4_cookie_scratch);
@@ -67,9 +87,11 @@ __u32 cookie_init_timestamp(struct request_sock *req)
ireq = inet_rsk(req);
- options = ireq->wscale_ok ? ireq->snd_wscale : 0xf;
- options |= ireq->sack_ok << 4;
- options |= ireq->ecn_ok << 5;
+ options = ireq->wscale_ok ? ireq->snd_wscale : TS_OPT_WSCALE_MASK;
+ if (ireq->sack_ok)
+ options |= TS_OPT_SACK;
+ if (ireq->ecn_ok)
+ options |= TS_OPT_ECN;
ts = ts_now & ~TSMASK;
ts |= options;
@@ -219,16 +241,13 @@ static inline struct sock *get_cookie_sock(struct sock *sk, struct sk_buff *skb,
* additional tcp options in the timestamp.
* This extracts these options from the timestamp echo.
*
- * The lowest 4 bits store snd_wscale.
- * next 2 bits indicate SACK and ECN support.
- *
* return false if we decode an option that should not be.
*/
bool cookie_check_timestamp(struct tcp_options_received *tcp_opt,
struct net *net, bool *ecn_ok)
{
/* echoed timestamp, lowest bits contain options */
- u32 options = tcp_opt->rcv_tsecr & TSMASK;
+ u32 options = tcp_opt->rcv_tsecr;
if (!tcp_opt->saw_tstamp) {
tcp_clear_options(tcp_opt);
@@ -238,19 +257,20 @@ bool cookie_check_timestamp(struct tcp_options_received *tcp_opt,
if (!sysctl_tcp_timestamps)
return false;
- tcp_opt->sack_ok = (options & (1 << 4)) ? TCP_SACK_SEEN : 0;
- *ecn_ok = (options >> 5) & 1;
+ tcp_opt->sack_ok = (options & TS_OPT_SACK) ? TCP_SACK_SEEN : 0;
+ *ecn_ok = options & TS_OPT_ECN;
if (*ecn_ok && !net->ipv4.sysctl_tcp_ecn)
return false;
if (tcp_opt->sack_ok && !sysctl_tcp_sack)
return false;
- if ((options & 0xf) == 0xf)
+ if ((options & TS_OPT_WSCALE_MASK) == TS_OPT_WSCALE_MASK)
return true; /* no window scaling */
tcp_opt->wscale_ok = 1;
- tcp_opt->snd_wscale = options & 0xf;
+ tcp_opt->snd_wscale = options & TS_OPT_WSCALE_MASK;
+
return sysctl_tcp_window_scaling != 0;
}
EXPORT_SYMBOL(cookie_check_timestamp);
--
2.0.4
^ permalink raw reply related
* [PATCH -next v3 2/3] syncookies: split cookie_check_timestamp() into two functions
From: Florian Westphal @ 2014-11-03 13:01 UTC (permalink / raw)
To: netdev; +Cc: Florian Westphal, Daniel Borkmann
In-Reply-To: <1415019720-17106-1-git-send-email-fw@strlen.de>
The function cookie_check_timestamp(), both called from IPv4/6 context,
is being used to decode the echoed timestamp from the SYN/ACK into TCP
options used for follow-up communication with the peer.
We can remove ECN handling from that function, split it into a separate
one, and simply rename the original function into cookie_decode_options().
cookie_decode_options() just fills in tcp_option struct based on the
echoed timestamp received from the peer. Anything that fails in this
function will actually discard the request socket.
While this is the natural place for decoding options such as ECN which
commit 172d69e63c7f ("syncookies: add support for ECN") added, we argue
that in particular for ECN handling, it can be checked at a later point
in time as the request sock would actually not need to be dropped from
this, but just ECN support turned off.
Therefore, we split this functionality into cookie_ecn_ok(), which tells
|us if the timestamp indicates ECN support AND the tcp_ecn sysctl is enabled.
This prepares for per-route ECN support: just looking at the tcp_ecn sysctl
won't be enough anymore at that point; if the timestamp indicates ECN
and sysctl tcp_ecn == 0, we will also need to check the ECN dst metric.
This would mean adding a route lookup to cookie_check_timestamp(), which
we definitely want to avoid. As we already do a route lookup at a later
point in cookie_{v4,v6}_check(), we can simply make use of that as well
for the new cookie_ecn_ok() function w/o any additional cost.
Joint work with Daniel Borkmann.
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
---
This patch was not part of earlier versions of the set.
include/net/tcp.h | 9 ++++-----
net/ipv4/syncookies.c | 31 +++++++++++++++++++++----------
net/ipv6/syncookies.c | 5 ++---
3 files changed, 27 insertions(+), 18 deletions(-)
diff --git a/include/net/tcp.h b/include/net/tcp.h
index 3a35b15..36c5084 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -490,17 +490,16 @@ u32 __cookie_v4_init_sequence(const struct iphdr *iph, const struct tcphdr *th,
u16 *mssp);
__u32 cookie_v4_init_sequence(struct sock *sk, const struct sk_buff *skb,
__u16 *mss);
-#endif
-
__u32 cookie_init_timestamp(struct request_sock *req);
-bool cookie_check_timestamp(struct tcp_options_received *opt, struct net *net,
- bool *ecn_ok);
+bool cookie_timestamp_decode(struct tcp_options_received *opt);
+bool cookie_ecn_ok(const struct tcp_options_received *opt,
+ const struct net *net);
/* From net/ipv6/syncookies.c */
int __cookie_v6_check(const struct ipv6hdr *iph, const struct tcphdr *th,
u32 cookie);
struct sock *cookie_v6_check(struct sock *sk, struct sk_buff *skb);
-#ifdef CONFIG_SYN_COOKIES
+
u32 __cookie_v6_init_sequence(const struct ipv6hdr *iph,
const struct tcphdr *th, u16 *mssp);
__u32 cookie_v6_init_sequence(struct sock *sk, const struct sk_buff *skb,
diff --git a/net/ipv4/syncookies.c b/net/ipv4/syncookies.c
index c3792c0..6de7725 100644
--- a/net/ipv4/syncookies.c
+++ b/net/ipv4/syncookies.c
@@ -241,10 +241,10 @@ static inline struct sock *get_cookie_sock(struct sock *sk, struct sk_buff *skb,
* additional tcp options in the timestamp.
* This extracts these options from the timestamp echo.
*
- * return false if we decode an option that should not be.
+ * return false if we decode a tcp option that is disabled
+ * on the host.
*/
-bool cookie_check_timestamp(struct tcp_options_received *tcp_opt,
- struct net *net, bool *ecn_ok)
+bool cookie_timestamp_decode(struct tcp_options_received *tcp_opt)
{
/* echoed timestamp, lowest bits contain options */
u32 options = tcp_opt->rcv_tsecr;
@@ -258,9 +258,6 @@ bool cookie_check_timestamp(struct tcp_options_received *tcp_opt,
return false;
tcp_opt->sack_ok = (options & TS_OPT_SACK) ? TCP_SACK_SEEN : 0;
- *ecn_ok = options & TS_OPT_ECN;
- if (*ecn_ok && !net->ipv4.sysctl_tcp_ecn)
- return false;
if (tcp_opt->sack_ok && !sysctl_tcp_sack)
return false;
@@ -273,7 +270,22 @@ bool cookie_check_timestamp(struct tcp_options_received *tcp_opt,
return sysctl_tcp_window_scaling != 0;
}
-EXPORT_SYMBOL(cookie_check_timestamp);
+EXPORT_SYMBOL(cookie_timestamp_decode);
+
+bool cookie_ecn_ok(const struct tcp_options_received *tcp_opt,
+ const struct net *net)
+{
+ bool ecn_ok = tcp_opt->rcv_tsecr & TS_OPT_ECN;
+
+ if (!ecn_ok)
+ return false;
+
+ if (net->ipv4.sysctl_tcp_ecn)
+ return true;
+
+ return false;
+}
+EXPORT_SYMBOL(cookie_ecn_ok);
struct sock *cookie_v4_check(struct sock *sk, struct sk_buff *skb)
{
@@ -289,7 +301,6 @@ struct sock *cookie_v4_check(struct sock *sk, struct sk_buff *skb)
int mss;
struct rtable *rt;
__u8 rcv_wscale;
- bool ecn_ok = false;
struct flowi4 fl4;
if (!sysctl_tcp_syncookies || !th->ack || th->rst)
@@ -310,7 +321,7 @@ struct sock *cookie_v4_check(struct sock *sk, struct sk_buff *skb)
memset(&tcp_opt, 0, sizeof(tcp_opt));
tcp_parse_options(skb, &tcp_opt, 0, NULL);
- if (!cookie_check_timestamp(&tcp_opt, sock_net(sk), &ecn_ok))
+ if (!cookie_timestamp_decode(&tcp_opt))
goto out;
ret = NULL;
@@ -328,7 +339,6 @@ struct sock *cookie_v4_check(struct sock *sk, struct sk_buff *skb)
ireq->ir_loc_addr = ip_hdr(skb)->daddr;
ireq->ir_rmt_addr = ip_hdr(skb)->saddr;
ireq->ir_mark = inet_request_mark(sk, skb);
- ireq->ecn_ok = ecn_ok;
ireq->snd_wscale = tcp_opt.snd_wscale;
ireq->sack_ok = tcp_opt.sack_ok;
ireq->wscale_ok = tcp_opt.wscale_ok;
@@ -377,6 +387,7 @@ struct sock *cookie_v4_check(struct sock *sk, struct sk_buff *skb)
dst_metric(&rt->dst, RTAX_INITRWND));
ireq->rcv_wscale = rcv_wscale;
+ ireq->ecn_ok = cookie_ecn_ok(&tcp_opt, sock_net(sk));
ret = get_cookie_sock(sk, skb, req, &rt->dst);
/* ip_queue_xmit() depends on our flow being setup
diff --git a/net/ipv6/syncookies.c b/net/ipv6/syncookies.c
index be291ba..52cc8cb 100644
--- a/net/ipv6/syncookies.c
+++ b/net/ipv6/syncookies.c
@@ -166,7 +166,6 @@ struct sock *cookie_v6_check(struct sock *sk, struct sk_buff *skb)
int mss;
struct dst_entry *dst;
__u8 rcv_wscale;
- bool ecn_ok = false;
if (!sysctl_tcp_syncookies || !th->ack || th->rst)
goto out;
@@ -186,7 +185,7 @@ struct sock *cookie_v6_check(struct sock *sk, struct sk_buff *skb)
memset(&tcp_opt, 0, sizeof(tcp_opt));
tcp_parse_options(skb, &tcp_opt, 0, NULL);
- if (!cookie_check_timestamp(&tcp_opt, sock_net(sk), &ecn_ok))
+ if (!cookie_timestamp_decode(&tcp_opt))
goto out;
ret = NULL;
@@ -223,7 +222,6 @@ struct sock *cookie_v6_check(struct sock *sk, struct sk_buff *skb)
req->expires = 0UL;
req->num_retrans = 0;
- ireq->ecn_ok = ecn_ok;
ireq->snd_wscale = tcp_opt.snd_wscale;
ireq->sack_ok = tcp_opt.sack_ok;
ireq->wscale_ok = tcp_opt.wscale_ok;
@@ -264,6 +262,7 @@ struct sock *cookie_v6_check(struct sock *sk, struct sk_buff *skb)
dst_metric(dst, RTAX_INITRWND));
ireq->rcv_wscale = rcv_wscale;
+ ireq->ecn_ok = cookie_ecn_ok(&tcp_opt, sock_net(sk));
ret = get_cookie_sock(sk, skb, req, dst);
out:
--
2.0.4
^ permalink raw reply related
* [PATCH -next v3 3/3] net: allow setting ecn via routing table
From: Florian Westphal @ 2014-11-03 13:02 UTC (permalink / raw)
To: netdev; +Cc: Florian Westphal, Daniel Borkmann
In-Reply-To: <1415019720-17106-1-git-send-email-fw@strlen.de>
This patch allows to set ECN on a per-route basis in case the sysctl
tcp_ecn is not set to 1. In other words, when ECN is set for specific
routes, it provides a tcp_ecn=1 behaviour for that route while the rest
of the stack acts according to the global settings.
One can use 'ip route change dev $dev $net features ecn' to toggle this.
Having a more fine-grained per-route setting can be beneficial for various
reasons, for example, 1) within data centers, or 2) local ISPs may deploy
ECN support for their own video/streaming services [1], etc.
There was a recent measurement study/paper [2] which scanned the Alexa's
publicly available top million websites list from a vantage point in US,
Europe and Asia:
Half of the Alexa list will now happily use ECN (tcp_ecn=2, most likely
blamed to commit 255cac91c3 ("tcp: extend ECN sysctl to allow server-side
only ECN") ;)); the break in connectivity on-path was found is about
1 in 10,000 cases. Timeouts rather than receiving back RSTs were much
more common in the negotiation phase (and mostly seen in the Alexa
middle band, ranks around 50k-150k): from 12-thousand hosts on which
there _may_ be ECN-linked connection failures, only 79 failed with RST
when _not_ failing with RST when ECN is not requested.
It's unclear though, how much equipment in the wild actually marks CE
when buffers start to fill up.
We thought about a fallback to non-ECN for retransmitted SYNs as another
global option (which could perhaps one day be made default), but as Eric
points out, there's much more work needed to detect broken middleboxes.
Two examples Eric mentioned are buggy firewalls that accept only a single
SYN per flow, and middleboxes that successfully let an ECN flow establish,
but later mark CE for all packets (so cwnd converges to 1).
[1] http://www.ietf.org/proceedings/89/slides/slides-89-tsvarea-1.pdf, p.15
[2] http://ecn.ethz.ch/
Joint work with Daniel Borkmann.
Reference: http://thread.gmane.org/gmane.linux.network/335797
Suggested-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
---
Changes since v2:
- alter new cookie_ecn_ok() ok function to also evaluate
RTAX_FEATURE_ECN if timestamp-ecn-bit is set and the ecn sysctl
is off.
- extra blank line in tcp_output.c to appease checkpatch.pl
include/net/tcp.h | 2 +-
net/ipv4/syncookies.c | 6 +++---
net/ipv4/tcp_input.c | 25 +++++++++++++++----------
net/ipv4/tcp_output.c | 13 +++++++++++--
net/ipv6/syncookies.c | 2 +-
5 files changed, 31 insertions(+), 17 deletions(-)
diff --git a/include/net/tcp.h b/include/net/tcp.h
index 36c5084..f50f29faf 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -493,7 +493,7 @@ __u32 cookie_v4_init_sequence(struct sock *sk, const struct sk_buff *skb,
__u32 cookie_init_timestamp(struct request_sock *req);
bool cookie_timestamp_decode(struct tcp_options_received *opt);
bool cookie_ecn_ok(const struct tcp_options_received *opt,
- const struct net *net);
+ const struct net *net, const struct dst_entry *dst);
/* From net/ipv6/syncookies.c */
int __cookie_v6_check(const struct ipv6hdr *iph, const struct tcphdr *th,
diff --git a/net/ipv4/syncookies.c b/net/ipv4/syncookies.c
index 6de7725..45fe60c 100644
--- a/net/ipv4/syncookies.c
+++ b/net/ipv4/syncookies.c
@@ -273,7 +273,7 @@ bool cookie_timestamp_decode(struct tcp_options_received *tcp_opt)
EXPORT_SYMBOL(cookie_timestamp_decode);
bool cookie_ecn_ok(const struct tcp_options_received *tcp_opt,
- const struct net *net)
+ const struct net *net, const struct dst_entry *dst)
{
bool ecn_ok = tcp_opt->rcv_tsecr & TS_OPT_ECN;
@@ -283,7 +283,7 @@ bool cookie_ecn_ok(const struct tcp_options_received *tcp_opt,
if (net->ipv4.sysctl_tcp_ecn)
return true;
- return false;
+ return dst_feature(dst, RTAX_FEATURE_ECN);
}
EXPORT_SYMBOL(cookie_ecn_ok);
@@ -387,7 +387,7 @@ struct sock *cookie_v4_check(struct sock *sk, struct sk_buff *skb)
dst_metric(&rt->dst, RTAX_INITRWND));
ireq->rcv_wscale = rcv_wscale;
- ireq->ecn_ok = cookie_ecn_ok(&tcp_opt, sock_net(sk));
+ ireq->ecn_ok = cookie_ecn_ok(&tcp_opt, sock_net(sk), &rt->dst);
ret = get_cookie_sock(sk, skb, req, &rt->dst);
/* ip_queue_xmit() depends on our flow being setup
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 4e4617e..9db942a 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -5876,20 +5876,22 @@ static inline void pr_drop_req(struct request_sock *req, __u16 port, int family)
*/
static void tcp_ecn_create_request(struct request_sock *req,
const struct sk_buff *skb,
- const struct sock *listen_sk)
+ const struct sock *listen_sk,
+ struct dst_entry *dst)
{
const struct tcphdr *th = tcp_hdr(skb);
const struct net *net = sock_net(listen_sk);
bool th_ecn = th->ece && th->cwr;
- bool ect, need_ecn;
+ bool ect, need_ecn, ecn_ok;
if (!th_ecn)
return;
ect = !INET_ECN_is_not_ect(TCP_SKB_CB(skb)->ip_dsfield);
need_ecn = tcp_ca_needs_ecn(listen_sk);
+ ecn_ok = net->ipv4.sysctl_tcp_ecn || dst_feature(dst, RTAX_FEATURE_ECN);
- if (!ect && !need_ecn && net->ipv4.sysctl_tcp_ecn)
+ if (!ect && !need_ecn && ecn_ok)
inet_rsk(req)->ecn_ok = 1;
else if (ect && need_ecn)
inet_rsk(req)->ecn_ok = 1;
@@ -5954,13 +5956,7 @@ int tcp_conn_request(struct request_sock_ops *rsk_ops,
if (security_inet_conn_request(sk, skb, req))
goto drop_and_free;
- if (!want_cookie || tmp_opt.tstamp_ok)
- tcp_ecn_create_request(req, skb, sk);
-
- if (want_cookie) {
- isn = cookie_init_sequence(af_ops, sk, skb, &req->mss);
- req->cookie_ts = tmp_opt.tstamp_ok;
- } else if (!isn) {
+ if (!want_cookie && !isn) {
/* VJ's idea. We save last timestamp seen
* from the destination in peer table, when entering
* state TIME-WAIT, and check against it before
@@ -6008,6 +6004,15 @@ int tcp_conn_request(struct request_sock_ops *rsk_ops,
goto drop_and_free;
}
+ tcp_ecn_create_request(req, skb, sk, dst);
+
+ if (want_cookie) {
+ isn = cookie_init_sequence(af_ops, sk, skb, &req->mss);
+ req->cookie_ts = tmp_opt.tstamp_ok;
+ if (!tmp_opt.tstamp_ok)
+ inet_rsk(req)->ecn_ok = 0;
+ }
+
tcp_rsk(req)->snt_isn = isn;
tcp_openreq_init_rwin(req, sk, dst);
fastopen = !want_cookie &&
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index a3d453b..0b88158 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -333,10 +333,19 @@ static void tcp_ecn_send_synack(struct sock *sk, struct sk_buff *skb)
static void tcp_ecn_send_syn(struct sock *sk, struct sk_buff *skb)
{
struct tcp_sock *tp = tcp_sk(sk);
+ bool use_ecn = sock_net(sk)->ipv4.sysctl_tcp_ecn == 1 ||
+ tcp_ca_needs_ecn(sk);
+
+ if (!use_ecn) {
+ const struct dst_entry *dst = __sk_dst_get(sk);
+
+ if (dst && dst_feature(dst, RTAX_FEATURE_ECN))
+ use_ecn = true;
+ }
tp->ecn_flags = 0;
- if (sock_net(sk)->ipv4.sysctl_tcp_ecn == 1 ||
- tcp_ca_needs_ecn(sk)) {
+
+ if (use_ecn) {
TCP_SKB_CB(skb)->tcp_flags |= TCPHDR_ECE | TCPHDR_CWR;
tp->ecn_flags = TCP_ECN_OK;
if (tcp_ca_needs_ecn(sk))
diff --git a/net/ipv6/syncookies.c b/net/ipv6/syncookies.c
index 52cc8cb..7337fc7 100644
--- a/net/ipv6/syncookies.c
+++ b/net/ipv6/syncookies.c
@@ -262,7 +262,7 @@ struct sock *cookie_v6_check(struct sock *sk, struct sk_buff *skb)
dst_metric(dst, RTAX_INITRWND));
ireq->rcv_wscale = rcv_wscale;
- ireq->ecn_ok = cookie_ecn_ok(&tcp_opt, sock_net(sk));
+ ireq->ecn_ok = cookie_ecn_ok(&tcp_opt, sock_net(sk), dst);
ret = get_cookie_sock(sk, skb, req, dst);
out:
--
2.0.4
^ permalink raw reply related
* [PATCH v2 4/5] stmmac: pci: convert to use dev_* macros
From: Andy Shevchenko @ 2014-11-03 13:02 UTC (permalink / raw)
To: Giuseppe Cavallaro, netdev, Kweh Hock Leong, David S . Miller,
Vince Bridgers, Rayagond K
Cc: Andy Shevchenko
In-Reply-To: <1415019737-32736-1-git-send-email-andriy.shevchenko@linux.intel.com>
Instead of pr_* macros let's use dev_* macros which provide device name.
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
---
drivers/net/ethernet/stmicro/stmmac/stmmac_pci.c | 8 ++++----
1 file changed, 4 insertions(+), 4 deletions(-)
diff --git a/drivers/net/ethernet/stmicro/stmmac/stmmac_pci.c b/drivers/net/ethernet/stmicro/stmmac/stmmac_pci.c
index 5357a3f..5084699 100644
--- a/drivers/net/ethernet/stmicro/stmmac/stmmac_pci.c
+++ b/drivers/net/ethernet/stmicro/stmmac/stmmac_pci.c
@@ -78,8 +78,8 @@ static int stmmac_pci_probe(struct pci_dev *pdev,
/* Enable pci device */
ret = pcim_enable_device(pdev);
if (ret) {
- pr_err("%s : ERROR: failed to enable %s device\n", __func__,
- pci_name(pdev));
+ dev_err(&pdev->dev, "%s: ERROR: failed to enable device\n",
+ __func__);
return ret;
}
@@ -100,7 +100,7 @@ static int stmmac_pci_probe(struct pci_dev *pdev,
priv = stmmac_dvr_probe(&pdev->dev, &plat_dat,
pcim_iomap_table(pdev)[i]);
if (IS_ERR(priv)) {
- pr_err("%s: main driver probe failed", __func__);
+ dev_err(&pdev->dev, "%s: main driver probe failed\n", __func__);
return PTR_ERR(priv);
}
priv->dev->irq = pdev->irq;
@@ -108,7 +108,7 @@ static int stmmac_pci_probe(struct pci_dev *pdev,
pci_set_drvdata(pdev, priv->dev);
- pr_debug("STMMAC platform driver registration completed");
+ dev_dbg(&pdev->dev, "STMMAC PCI driver registration completed\n");
return 0;
}
--
2.1.1
^ permalink raw reply related
* [PATCH v2 0/5] stmmac: pci: various cleanups and fixes
From: Andy Shevchenko @ 2014-11-03 13:02 UTC (permalink / raw)
To: Giuseppe Cavallaro, netdev, Kweh Hock Leong, David S . Miller,
Vince Bridgers, Rayagond K
Cc: Andy Shevchenko
There are few cleanups and fixes regarding to stmmac PCI driver.
This has been tested on Intel Galileo board with recent net-next tree.
Since v1:
- remove already applied patch
- append patch 1/5
- rework patch 3/5 to be functional compatible with original code
Andy Shevchenko (5):
stmmac: pci: use defined constant instead of magic number
stmmac: pci: convert to use dev_pm_ops
stmmac: pci: use managed resources
stmmac: pci: convert to use dev_* macros
stmmac: pci: remove FSF address
drivers/net/ethernet/stmicro/stmmac/stmmac_pci.c | 84 ++++++++----------------
1 file changed, 26 insertions(+), 58 deletions(-)
--
2.1.1
^ permalink raw reply
* [PATCH v2 5/5] stmmac: pci: remove FSF address
From: Andy Shevchenko @ 2014-11-03 13:02 UTC (permalink / raw)
To: Giuseppe Cavallaro, netdev, Kweh Hock Leong, David S . Miller,
Vince Bridgers, Rayagond K
Cc: Andy Shevchenko
In-Reply-To: <1415019737-32736-1-git-send-email-andriy.shevchenko@linux.intel.com>
The FSF address is subject to change, thus remove it from the file.
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
---
drivers/net/ethernet/stmicro/stmmac/stmmac_pci.c | 4 ----
1 file changed, 4 deletions(-)
diff --git a/drivers/net/ethernet/stmicro/stmmac/stmmac_pci.c b/drivers/net/ethernet/stmicro/stmmac/stmmac_pci.c
index 5084699..068ef8e 100644
--- a/drivers/net/ethernet/stmicro/stmmac/stmmac_pci.c
+++ b/drivers/net/ethernet/stmicro/stmmac/stmmac_pci.c
@@ -12,10 +12,6 @@
FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for
more details.
- You should have received a copy of the GNU General Public License along with
- this program; if not, write to the Free Software Foundation, Inc.,
- 51 Franklin St - Fifth Floor, Boston, MA 02110-1301 USA.
-
The full GNU General Public License is included in this distribution in
the file called "COPYING".
--
2.1.1
^ permalink raw reply related
* [PATCH v2 2/5] stmmac: pci: convert to use dev_pm_ops
From: Andy Shevchenko @ 2014-11-03 13:02 UTC (permalink / raw)
To: Giuseppe Cavallaro, netdev, Kweh Hock Leong, David S . Miller,
Vince Bridgers, Rayagond K
Cc: Andy Shevchenko
In-Reply-To: <1415019737-32736-1-git-send-email-andriy.shevchenko@linux.intel.com>
Convert system PM callbacks to use dev_pm_ops. In addition remove the PCI calls
related to a power state since the bus code cares about this already.
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
---
drivers/net/ethernet/stmicro/stmmac/stmmac_pci.c | 27 ++++++++++--------------
1 file changed, 11 insertions(+), 16 deletions(-)
diff --git a/drivers/net/ethernet/stmicro/stmmac/stmmac_pci.c b/drivers/net/ethernet/stmicro/stmmac/stmmac_pci.c
index e45794d..f19ac8e 100644
--- a/drivers/net/ethernet/stmicro/stmmac/stmmac_pci.c
+++ b/drivers/net/ethernet/stmicro/stmmac/stmmac_pci.c
@@ -150,30 +150,26 @@ static void stmmac_pci_remove(struct pci_dev *pdev)
pci_disable_device(pdev);
}
-#ifdef CONFIG_PM
-static int stmmac_pci_suspend(struct pci_dev *pdev, pm_message_t state)
+#ifdef CONFIG_PM_SLEEP
+static int stmmac_pci_suspend(struct device *dev)
{
+ struct pci_dev *pdev = to_pci_dev(dev);
struct net_device *ndev = pci_get_drvdata(pdev);
- int ret;
- ret = stmmac_suspend(ndev);
- pci_save_state(pdev);
- pci_set_power_state(pdev, pci_choose_state(pdev, state));
-
- return ret;
+ return stmmac_suspend(ndev);
}
-static int stmmac_pci_resume(struct pci_dev *pdev)
+static int stmmac_pci_resume(struct device *dev)
{
+ struct pci_dev *pdev = to_pci_dev(dev);
struct net_device *ndev = pci_get_drvdata(pdev);
- pci_set_power_state(pdev, PCI_D0);
- pci_restore_state(pdev);
-
return stmmac_resume(ndev);
}
#endif
+static SIMPLE_DEV_PM_OPS(stmmac_pm_ops, stmmac_pci_suspend, stmmac_pci_resume);
+
#define STMMAC_VENDOR_ID 0x700
#define STMMAC_DEVICE_ID 0x1108
@@ -190,10 +186,9 @@ struct pci_driver stmmac_pci_driver = {
.id_table = stmmac_id_table,
.probe = stmmac_pci_probe,
.remove = stmmac_pci_remove,
-#ifdef CONFIG_PM
- .suspend = stmmac_pci_suspend,
- .resume = stmmac_pci_resume,
-#endif
+ .driver = {
+ .pm = &stmmac_pm_ops,
+ },
};
MODULE_DESCRIPTION("STMMAC 10/100/1000 Ethernet PCI driver");
--
2.1.1
^ permalink raw reply related
* [PATCH v2 1/5] stmmac: pci: use defined constant instead of magic number
From: Andy Shevchenko @ 2014-11-03 13:02 UTC (permalink / raw)
To: Giuseppe Cavallaro, netdev, Kweh Hock Leong, David S . Miller,
Vince Bridgers, Rayagond K
Cc: Andy Shevchenko
In-Reply-To: <1415019737-32736-1-git-send-email-andriy.shevchenko@linux.intel.com>
The last standard PCI resource is defined as PCI_STD_RESOURCE_END. Thus, we
could use it instead of plain integer.
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
---
drivers/net/ethernet/stmicro/stmmac/stmmac_pci.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/net/ethernet/stmicro/stmmac/stmmac_pci.c b/drivers/net/ethernet/stmicro/stmmac/stmmac_pci.c
index e17a970..e45794d 100644
--- a/drivers/net/ethernet/stmicro/stmmac/stmmac_pci.c
+++ b/drivers/net/ethernet/stmicro/stmmac/stmmac_pci.c
@@ -90,7 +90,7 @@ static int stmmac_pci_probe(struct pci_dev *pdev,
}
/* Get the base address of device */
- for (i = 0; i <= 5; i++) {
+ for (i = 0; i <= PCI_STD_RESOURCE_END; i++) {
if (pci_resource_len(pdev, i) == 0)
continue;
addr = pci_iomap(pdev, i, 0);
--
2.1.1
^ permalink raw reply related
* [PATCH v2 3/5] stmmac: pci: use managed resources
From: Andy Shevchenko @ 2014-11-03 13:02 UTC (permalink / raw)
To: Giuseppe Cavallaro, netdev, Kweh Hock Leong, David S . Miller,
Vince Bridgers, Rayagond K
Cc: Andy Shevchenko
In-Reply-To: <1415019737-32736-1-git-send-email-andriy.shevchenko@linux.intel.com>
Migrate pci driver to managed resources to reduce boilerplate error handling
code.
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
---
drivers/net/ethernet/stmicro/stmmac/stmmac_pci.c | 43 ++++++------------------
1 file changed, 10 insertions(+), 33 deletions(-)
diff --git a/drivers/net/ethernet/stmicro/stmmac/stmmac_pci.c b/drivers/net/ethernet/stmicro/stmmac/stmmac_pci.c
index f19ac8e..5357a3f 100644
--- a/drivers/net/ethernet/stmicro/stmmac/stmmac_pci.c
+++ b/drivers/net/ethernet/stmicro/stmmac/stmmac_pci.c
@@ -71,46 +71,37 @@ static void stmmac_default_data(void)
static int stmmac_pci_probe(struct pci_dev *pdev,
const struct pci_device_id *id)
{
- int ret = 0;
- void __iomem *addr = NULL;
- struct stmmac_priv *priv = NULL;
+ struct stmmac_priv *priv;
int i;
+ int ret;
/* Enable pci device */
- ret = pci_enable_device(pdev);
+ ret = pcim_enable_device(pdev);
if (ret) {
pr_err("%s : ERROR: failed to enable %s device\n", __func__,
pci_name(pdev));
return ret;
}
- if (pci_request_regions(pdev, STMMAC_RESOURCE_NAME)) {
- pr_err("%s: ERROR: failed to get PCI region\n", __func__);
- ret = -ENODEV;
- goto err_out_req_reg_failed;
- }
/* Get the base address of device */
for (i = 0; i <= PCI_STD_RESOURCE_END; i++) {
if (pci_resource_len(pdev, i) == 0)
continue;
- addr = pci_iomap(pdev, i, 0);
- if (addr == NULL) {
- pr_err("%s: ERROR: cannot map register memory aborting",
- __func__);
- ret = -EIO;
- goto err_out_map_failed;
- }
+ ret = pcim_iomap_regions(pdev, BIT(i), pci_name(pdev));
+ if (ret)
+ return ret;
break;
}
+
pci_set_master(pdev);
stmmac_default_data();
- priv = stmmac_dvr_probe(&(pdev->dev), &plat_dat, addr);
+ priv = stmmac_dvr_probe(&pdev->dev, &plat_dat,
+ pcim_iomap_table(pdev)[i]);
if (IS_ERR(priv)) {
pr_err("%s: main driver probe failed", __func__);
- ret = PTR_ERR(priv);
- goto err_out;
+ return PTR_ERR(priv);
}
priv->dev->irq = pdev->irq;
priv->wol_irq = pdev->irq;
@@ -120,15 +111,6 @@ static int stmmac_pci_probe(struct pci_dev *pdev,
pr_debug("STMMAC platform driver registration completed");
return 0;
-
-err_out:
- pci_clear_master(pdev);
-err_out_map_failed:
- pci_release_regions(pdev);
-err_out_req_reg_failed:
- pci_disable_device(pdev);
-
- return ret;
}
/**
@@ -141,13 +123,8 @@ err_out_req_reg_failed:
static void stmmac_pci_remove(struct pci_dev *pdev)
{
struct net_device *ndev = pci_get_drvdata(pdev);
- struct stmmac_priv *priv = netdev_priv(ndev);
stmmac_dvr_remove(ndev);
-
- pci_iounmap(pdev, priv->ioaddr);
- pci_release_regions(pdev);
- pci_disable_device(pdev);
}
#ifdef CONFIG_PM_SLEEP
--
2.1.1
^ permalink raw reply related
* [PATCH net] sfc: don't BUG_ON efx->max_channels == 0 in probe
From: Edward Cree @ 2014-11-03 14:14 UTC (permalink / raw)
To: davem; +Cc: netdev, sshah, linux-net-drivers
efx_ef10_probe() was BUGging out if the BAR2 size was 0. This is
unnecessarily violent; instead we should just fail to probe the device.
Kept a WARN_ON as this problem indicates a broken or misconfigured NIC.
Signed-off-by: Edward Cree <ecree@solarflare.com>
---
drivers/net/ethernet/sfc/ef10.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/drivers/net/ethernet/sfc/ef10.c b/drivers/net/ethernet/sfc/ef10.c
index 002d4cd..a77f05c 100644
--- a/drivers/net/ethernet/sfc/ef10.c
+++ b/drivers/net/ethernet/sfc/ef10.c
@@ -180,7 +180,8 @@ static int efx_ef10_probe(struct efx_nic *efx)
EFX_MAX_CHANNELS,
resource_size(&efx->pci_dev->resource[EFX_MEM_BAR]) /
(EFX_VI_PAGE_SIZE * EFX_TXQ_TYPES));
- BUG_ON(efx->max_channels == 0);
+ if (WARN_ON(efx->max_channels == 0))
+ return -EIO;
nic_data = kzalloc(sizeof(*nic_data), GFP_KERNEL);
if (!nic_data)
--
1.7.11.7
^ permalink raw reply related
* RE: [PATCH -next v3 1/3] syncookies: avoid magic values and document which-bit-is-what-option
From: David Laight @ 2014-11-03 14:24 UTC (permalink / raw)
To: 'Florian Westphal', netdev@vger.kernel.org; +Cc: Daniel Borkmann
In-Reply-To: <1415019720-17106-2-git-send-email-fw@strlen.de>
From: Florian Westphal
> Was a bit more difficult to read than needed due to magic shifts;
> add defines and document the used encoding scheme.
>
> Joint work with Daniel Borkmann.
>
> Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
> Signed-off-by: Florian Westphal <fw@strlen.de>
> ---
> This patch was not part of earlier versions of the set.
>
> net/ipv4/syncookies.c | 50 +++++++++++++++++++++++++++++++++++---------------
> 1 file changed, 35 insertions(+), 15 deletions(-)
>
> diff --git a/net/ipv4/syncookies.c b/net/ipv4/syncookies.c
> index 4ac7bca..c3792c0 100644
> --- a/net/ipv4/syncookies.c
> +++ b/net/ipv4/syncookies.c
> @@ -19,10 +19,6 @@
> #include <net/tcp.h>
> #include <net/route.h>
>
> -/* Timestamps: lowest bits store TCP options */
> -#define TSBITS 6
> -#define TSMASK (((__u32)1 << TSBITS) - 1)
> -
> extern int sysctl_tcp_syncookies;
>
> static u32 syncookie_secret[2][16-4+SHA_DIGEST_WORDS] __read_mostly;
> @@ -30,6 +26,30 @@ static u32 syncookie_secret[2][16-4+SHA_DIGEST_WORDS] __read_mostly;
> #define COOKIEBITS 24 /* Upper bits store count */
> #define COOKIEMASK (((__u32)1 << COOKIEBITS) - 1)
>
> +/* TCP Timestamp: 6 lowest bits of timestamp sent in the cookie SYN-ACK
> + * stores TCP options:
> + *
> + * MSB LSB
> + * | 31 ... 6 | 5 | 4 | 3 2 1 0 |
> + * | Timestamp | ECN | SACK | WScale |
> + *
> + * When we receive a valid cookie-ACK, we look at the echoed tsval (if
> + * any) to figure out which TCP options we should use for the rebuilt
> + * connection.
> + *
> + * A WScale setting of '0xf' (which is an invalid scaling value)
> + * means that original syn did not include the TCP window scaling option.
> + */
> +#define TS_OPT_WSCALE_MASK 0xf
> +#define TS_OPT_SACK BIT(4)
> +#define TS_OPT_ECN BIT(5)
> +/* There is no TS_OPT_TIMESTAMP:
> + * if ACK contains timestamp option, we already know it was
> + * requested/supported by the syn/synack exchange.
> + */
> +#define TSBITS 6
> +#define TSMASK (((__u32)1 << TSBITS) - 1)
Personally I'd define all the values as hex constants instead of mixing
and matching the defines.
So probably just:
#define TS_OPT_WSCALE_MASK 0x0f
#define TS_OPT_SACK 0x10
#define TS_OPT_ECN 0x20
#define TSMASK 0x3f
David
^ permalink raw reply
* Re: [PATCH -next v3 1/3] syncookies: avoid magic values and document which-bit-is-what-option
From: Daniel Borkmann @ 2014-11-03 14:33 UTC (permalink / raw)
To: David Laight; +Cc: 'Florian Westphal', netdev@vger.kernel.org
In-Reply-To: <063D6719AE5E284EB5DD2968C1650D6D1C9E44B1@AcuExch.aculab.com>
On 11/03/2014 03:24 PM, David Laight wrote:
> From: Florian Westphal
>> Was a bit more difficult to read than needed due to magic shifts;
>> add defines and document the used encoding scheme.
>>
>> Joint work with Daniel Borkmann.
>>
>> Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
>> Signed-off-by: Florian Westphal <fw@strlen.de>
>> ---
>> This patch was not part of earlier versions of the set.
>>
>> net/ipv4/syncookies.c | 50 +++++++++++++++++++++++++++++++++++---------------
>> 1 file changed, 35 insertions(+), 15 deletions(-)
>>
>> diff --git a/net/ipv4/syncookies.c b/net/ipv4/syncookies.c
>> index 4ac7bca..c3792c0 100644
>> --- a/net/ipv4/syncookies.c
>> +++ b/net/ipv4/syncookies.c
>> @@ -19,10 +19,6 @@
>> #include <net/tcp.h>
>> #include <net/route.h>
>>
>> -/* Timestamps: lowest bits store TCP options */
>> -#define TSBITS 6
>> -#define TSMASK (((__u32)1 << TSBITS) - 1)
>> -
>> extern int sysctl_tcp_syncookies;
>>
>> static u32 syncookie_secret[2][16-4+SHA_DIGEST_WORDS] __read_mostly;
>> @@ -30,6 +26,30 @@ static u32 syncookie_secret[2][16-4+SHA_DIGEST_WORDS] __read_mostly;
>> #define COOKIEBITS 24 /* Upper bits store count */
>> #define COOKIEMASK (((__u32)1 << COOKIEBITS) - 1)
>>
>> +/* TCP Timestamp: 6 lowest bits of timestamp sent in the cookie SYN-ACK
>> + * stores TCP options:
>> + *
>> + * MSB LSB
>> + * | 31 ... 6 | 5 | 4 | 3 2 1 0 |
>> + * | Timestamp | ECN | SACK | WScale |
>> + *
>> + * When we receive a valid cookie-ACK, we look at the echoed tsval (if
>> + * any) to figure out which TCP options we should use for the rebuilt
>> + * connection.
>> + *
>> + * A WScale setting of '0xf' (which is an invalid scaling value)
>> + * means that original syn did not include the TCP window scaling option.
>> + */
>> +#define TS_OPT_WSCALE_MASK 0xf
>> +#define TS_OPT_SACK BIT(4)
>> +#define TS_OPT_ECN BIT(5)
>> +/* There is no TS_OPT_TIMESTAMP:
>> + * if ACK contains timestamp option, we already know it was
>> + * requested/supported by the syn/synack exchange.
>> + */
>> +#define TSBITS 6
>> +#define TSMASK (((__u32)1 << TSBITS) - 1)
>
> Personally I'd define all the values as hex constants instead of mixing
> and matching the defines.
>
> So probably just:
> #define TS_OPT_WSCALE_MASK 0x0f
> #define TS_OPT_SACK 0x10
> #define TS_OPT_ECN 0x20
> #define TSMASK 0x3f
If you look at the above comment and then take a peek at the actual TS_OPT_*,
it is much easier to follow. Moreover, how is having TSMASK as 0x3f better?!
Currently, it is a constant calculated based upon TSBITS.
^ permalink raw reply
* stmmac: potential circular locking dependency
From: Emilio López @ 2014-11-03 14:36 UTC (permalink / raw)
To: peppe.cavallaro, maxime.ripard; +Cc: netdev, linux-kernel, linux-arm-kernel
Hi everyone,
I was playing with iperf on my Cubietruck today when I hit this lockdep
report/breakage on stmmac. It seems to be fairly reproducible by
1) Being on a GbE network
2) "iperf -s" on the stmmmac device
3) "iperf -c <IP of above device> -d -P 5" on some other device on the LAN
Here it goes:
======================================================
[ INFO: possible circular locking dependency detected ]
3.18.0-rc3-00003-g7180edf #916 Not tainted
-------------------------------------------------------
iperf/141 is trying to acquire lock:
(&(&dev->tx_global_lock)->rlock){+.-...}, at: [<c025927c>]
stmmac_tx_clean+0x350/0x43c
but task is already holding lock:
(&(&priv->tx_lock)->rlock){+.-...}, at: [<c0258f5c>]
stmmac_tx_clean+0x30/0x43c
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
-> #2 (&(&priv->tx_lock)->rlock){+.-...}:
[<c03f57d0>] _raw_spin_lock+0x5c/0x94
[<c0259ec4>] stmmac_xmit+0x88/0x620
[<c0319058>] dev_hard_start_xmit+0x230/0x49c
[<c033773c>] sch_direct_xmit+0xdc/0x20c
[<c03194dc>] __dev_queue_xmit+0x218/0x604
[<c0319954>] dev_queue_xmit+0x1c/0x20
[<c032896c>] neigh_resolve_output+0x180/0x254
[<c03a36fc>] ip6_finish_output2+0x188/0x8a4
[<c03a7440>] ip6_output+0xc8/0x398
[<c03c81ec>] mld_sendpack+0x2e0/0x6d8
[<c03ca564>] mld_ifc_timer_expire+0x1f0/0x308
[<c007c06c>] call_timer_fn+0xb4/0x1f0
[<c007c7d0>] run_timer_softirq+0x224/0x2f0
[<c0023fb0>] __do_softirq+0x1d4/0x3e0
[<c00244e0>] irq_exit+0x9c/0xd0
[<c006df48>] __handle_domain_irq+0x70/0xb4
[<c0008754>] gic_handle_irq+0x34/0x6c
[<c0013484>] __irq_svc+0x44/0x5c
[<c00628f0>] lock_acquire+0xec/0x17c
[<c00628f0>] lock_acquire+0xec/0x17c
[<c03f57d0>] _raw_spin_lock+0x5c/0x94
[<c00ddeac>] do_read_fault.isra.93+0xa8/0x2a0
[<c00e0144>] handle_mm_fault+0x44c/0x8dc
[<c0018670>] do_page_fault+0x160/0x2d8
[<c0008564>] do_PrefetchAbort+0x44/0xa8
[<c0013960>] ret_from_exception+0x0/0x20
[<b6eb0120>] 0xb6eb0120
-> #1 (_xmit_ETHER#2){+.-...}:
[<c03f57d0>] _raw_spin_lock+0x5c/0x94
[<c0338254>] dev_deactivate_many+0xd0/0x250
[<c0338410>] dev_deactivate+0x3c/0x4c
[<c032e908>] linkwatch_do_dev+0x50/0x84
[<c032eb74>] __linkwatch_run_queue+0xdc/0x148
[<c032ec1c>] linkwatch_event+0x3c/0x44
[<c00367d0>] process_one_work+0x1ec/0x510
[<c0036b50>] worker_thread+0x5c/0x4d8
[<c003c748>] kthread+0xe8/0xfc
[<c000ed28>] ret_from_fork+0x14/0x20
-> #0 (&(&dev->tx_global_lock)->rlock){+.-...}:
[<c00628e0>] lock_acquire+0xdc/0x17c
[<c03f57d0>] _raw_spin_lock+0x5c/0x94
[<c025927c>] stmmac_tx_clean+0x350/0x43c
[<c02593c0>] stmmac_poll+0x3c/0x618
[<c031a760>] net_rx_action+0x178/0x28c
[<c0023fb0>] __do_softirq+0x1d4/0x3e0
[<c00244e0>] irq_exit+0x9c/0xd0
[<c006df48>] __handle_domain_irq+0x70/0xb4
[<c0008754>] gic_handle_irq+0x34/0x6c
[<c0013484>] __irq_svc+0x44/0x5c
[<c002435c>] __local_bh_enable_ip+0x9c/0xfc
[<c002435c>] __local_bh_enable_ip+0x9c/0xfc
[<c03f5f64>] _raw_read_unlock_bh+0x40/0x44
[<c03ad050>] inet6_dump_addr+0x33c/0x530
[<c03ad2a0>] inet6_dump_ifaddr+0x1c/0x20
[<c032a0fc>] rtnl_dump_all+0x50/0xf4
[<c033c020>] netlink_dump+0xc0/0x250
[<c033c3e4>] netlink_recvmsg+0x234/0x300
[<c02ff7b4>] sock_recvmsg+0xa4/0xc8
[<c03000bc>] ___sys_recvmsg.part.33+0xe4/0x1c0
[<c0302158>] __sys_recvmsg+0x60/0x90
[<c03021a0>] SyS_recvmsg+0x18/0x1c
[<c000ec60>] ret_fast_syscall+0x0/0x48
other info that might help us debug this:
Chain exists of:
&(&dev->tx_global_lock)->rlock --> _xmit_ETHER#2 -->
&(&priv->tx_lock)->rlock
Possible unsafe locking scenario:
CPU0 CPU1
---- ----
lock(&(&priv->tx_lock)->rlock);
lock(_xmit_ETHER#2);
lock(&(&priv->tx_lock)->rlock);
lock(&(&dev->tx_global_lock)->rlock);
*** DEADLOCK ***
3 locks held by iperf/141:
#0: (rtnl_mutex){+.+.+.}, at: [<c033bf88>] netlink_dump+0x28/0x250
#1: (rcu_read_lock){......}, at: [<c03acd14>] inet6_dump_addr+0x0/0x530
#2: (&(&priv->tx_lock)->rlock){+.-...}, at: [<c0258f5c>]
stmmac_tx_clean+0x30/0x43c
stack backtrace:
CPU: 0 PID: 141 Comm: iperf Not tainted 3.18.0-rc3-00003-g7180edf #916
[<c0015df4>] (unwind_backtrace) from [<c0012850>] (show_stack+0x20/0x24)
[<c0012850>] (show_stack) from [<c03ef6dc>] (dump_stack+0x9c/0xbc)
[<c03ef6dc>] (dump_stack) from [<c005bbc0>] (print_circular_bug+0x21c/0x33c)
[<c005bbc0>] (print_circular_bug) from [<c0061e88>]
(__lock_acquire+0x2060/0x2148)
[<c0061e88>] (__lock_acquire) from [<c00628e0>] (lock_acquire+0xdc/0x17c)
[<c00628e0>] (lock_acquire) from [<c03f57d0>] (_raw_spin_lock+0x5c/0x94)
[<c03f57d0>] (_raw_spin_lock) from [<c025927c>]
(stmmac_tx_clean+0x350/0x43c)
[<c025927c>] (stmmac_tx_clean) from [<c02593c0>] (stmmac_poll+0x3c/0x618)
[<c02593c0>] (stmmac_poll) from [<c031a760>] (net_rx_action+0x178/0x28c)
[<c031a760>] (net_rx_action) from [<c0023fb0>] (__do_softirq+0x1d4/0x3e0)
[<c0023fb0>] (__do_softirq) from [<c00244e0>] (irq_exit+0x9c/0xd0)
[<c00244e0>] (irq_exit) from [<c006df48>] (__handle_domain_irq+0x70/0xb4)
[<c006df48>] (__handle_domain_irq) from [<c0008754>]
(gic_handle_irq+0x34/0x6c)
[<c0008754>] (gic_handle_irq) from [<c0013484>] (__irq_svc+0x44/0x5c)
Exception stack(0xcabc1be0 to 0xcabc1c28)
1be0: 00000001 2df53000 00000000 caf15e80 cabc0000 00000201 ca9a9840
c03ad050
1c00: ca8d9404 00000000 ca9b4f50 cabc1c44 c08732d0 cabc1c28 c005d5d0
c002435c
1c20: 20000013 ffffffff
[<c0013484>] (__irq_svc) from [<c002435c>] (__local_bh_enable_ip+0x9c/0xfc)
[<c002435c>] (__local_bh_enable_ip) from [<c03f5f64>]
(_raw_read_unlock_bh+0x40/0x44)
[<c03f5f64>] (_raw_read_unlock_bh) from [<c03ad050>]
(inet6_dump_addr+0x33c/0x530)
[<c03ad050>] (inet6_dump_addr) from [<c03ad2a0>]
(inet6_dump_ifaddr+0x1c/0x20)
[<c03ad2a0>] (inet6_dump_ifaddr) from [<c032a0fc>] (rtnl_dump_all+0x50/0xf4)
[<c032a0fc>] (rtnl_dump_all) from [<c033c020>] (netlink_dump+0xc0/0x250)
[<c033c020>] (netlink_dump) from [<c033c3e4>] (netlink_recvmsg+0x234/0x300)
[<c033c3e4>] (netlink_recvmsg) from [<c02ff7b4>] (sock_recvmsg+0xa4/0xc8)
[<c02ff7b4>] (sock_recvmsg) from [<c03000bc>]
(___sys_recvmsg.part.33+0xe4/0x1c0)
[<c03000bc>] (___sys_recvmsg.part.33) from [<c0302158>]
(__sys_recvmsg+0x60/0x90)
[<c0302158>] (__sys_recvmsg) from [<c03021a0>] (SyS_recvmsg+0x18/0x1c)
[<c03021a0>] (SyS_recvmsg) from [<c000ec60>] (ret_fast_syscall+0x0/0x48)
---------------------------------
Cheers,
Emilio
^ permalink raw reply
* Re: stmmac: potential circular locking dependency
From: Giuseppe CAVALLARO @ 2014-11-03 14:39 UTC (permalink / raw)
To: Emilio López, maxime.ripard; +Cc: netdev, linux-kernel, linux-arm-kernel
In-Reply-To: <545792E8.1020208@elopez.com.ar>
Hello Emilio
I have a subset of new patches to review and fix locks in the
driver. I will plan to send them in the next days.
Sincerely
Peppe
On 11/3/2014 3:36 PM, Emilio López wrote:
> Hi everyone,
>
> I was playing with iperf on my Cubietruck today when I hit this lockdep
> report/breakage on stmmac. It seems to be fairly reproducible by
>
> 1) Being on a GbE network
> 2) "iperf -s" on the stmmmac device
> 3) "iperf -c <IP of above device> -d -P 5" on some other device on the LAN
>
> Here it goes:
>
> ======================================================
> [ INFO: possible circular locking dependency detected ]
> 3.18.0-rc3-00003-g7180edf #916 Not tainted
> -------------------------------------------------------
> iperf/141 is trying to acquire lock:
> (&(&dev->tx_global_lock)->rlock){+.-...}, at: [<c025927c>]
> stmmac_tx_clean+0x350/0x43c
>
> but task is already holding lock:
> (&(&priv->tx_lock)->rlock){+.-...}, at: [<c0258f5c>]
> stmmac_tx_clean+0x30/0x43c
>
> which lock already depends on the new lock.
>
>
> the existing dependency chain (in reverse order) is:
>
> -> #2 (&(&priv->tx_lock)->rlock){+.-...}:
> [<c03f57d0>] _raw_spin_lock+0x5c/0x94
> [<c0259ec4>] stmmac_xmit+0x88/0x620
> [<c0319058>] dev_hard_start_xmit+0x230/0x49c
> [<c033773c>] sch_direct_xmit+0xdc/0x20c
> [<c03194dc>] __dev_queue_xmit+0x218/0x604
> [<c0319954>] dev_queue_xmit+0x1c/0x20
> [<c032896c>] neigh_resolve_output+0x180/0x254
> [<c03a36fc>] ip6_finish_output2+0x188/0x8a4
> [<c03a7440>] ip6_output+0xc8/0x398
> [<c03c81ec>] mld_sendpack+0x2e0/0x6d8
> [<c03ca564>] mld_ifc_timer_expire+0x1f0/0x308
> [<c007c06c>] call_timer_fn+0xb4/0x1f0
> [<c007c7d0>] run_timer_softirq+0x224/0x2f0
> [<c0023fb0>] __do_softirq+0x1d4/0x3e0
> [<c00244e0>] irq_exit+0x9c/0xd0
> [<c006df48>] __handle_domain_irq+0x70/0xb4
> [<c0008754>] gic_handle_irq+0x34/0x6c
> [<c0013484>] __irq_svc+0x44/0x5c
> [<c00628f0>] lock_acquire+0xec/0x17c
> [<c00628f0>] lock_acquire+0xec/0x17c
> [<c03f57d0>] _raw_spin_lock+0x5c/0x94
> [<c00ddeac>] do_read_fault.isra.93+0xa8/0x2a0
> [<c00e0144>] handle_mm_fault+0x44c/0x8dc
> [<c0018670>] do_page_fault+0x160/0x2d8
> [<c0008564>] do_PrefetchAbort+0x44/0xa8
> [<c0013960>] ret_from_exception+0x0/0x20
> [<b6eb0120>] 0xb6eb0120
>
> -> #1 (_xmit_ETHER#2){+.-...}:
> [<c03f57d0>] _raw_spin_lock+0x5c/0x94
> [<c0338254>] dev_deactivate_many+0xd0/0x250
> [<c0338410>] dev_deactivate+0x3c/0x4c
> [<c032e908>] linkwatch_do_dev+0x50/0x84
> [<c032eb74>] __linkwatch_run_queue+0xdc/0x148
> [<c032ec1c>] linkwatch_event+0x3c/0x44
> [<c00367d0>] process_one_work+0x1ec/0x510
> [<c0036b50>] worker_thread+0x5c/0x4d8
> [<c003c748>] kthread+0xe8/0xfc
> [<c000ed28>] ret_from_fork+0x14/0x20
>
> -> #0 (&(&dev->tx_global_lock)->rlock){+.-...}:
> [<c00628e0>] lock_acquire+0xdc/0x17c
> [<c03f57d0>] _raw_spin_lock+0x5c/0x94
> [<c025927c>] stmmac_tx_clean+0x350/0x43c
> [<c02593c0>] stmmac_poll+0x3c/0x618
> [<c031a760>] net_rx_action+0x178/0x28c
> [<c0023fb0>] __do_softirq+0x1d4/0x3e0
> [<c00244e0>] irq_exit+0x9c/0xd0
> [<c006df48>] __handle_domain_irq+0x70/0xb4
> [<c0008754>] gic_handle_irq+0x34/0x6c
> [<c0013484>] __irq_svc+0x44/0x5c
> [<c002435c>] __local_bh_enable_ip+0x9c/0xfc
> [<c002435c>] __local_bh_enable_ip+0x9c/0xfc
> [<c03f5f64>] _raw_read_unlock_bh+0x40/0x44
> [<c03ad050>] inet6_dump_addr+0x33c/0x530
> [<c03ad2a0>] inet6_dump_ifaddr+0x1c/0x20
> [<c032a0fc>] rtnl_dump_all+0x50/0xf4
> [<c033c020>] netlink_dump+0xc0/0x250
> [<c033c3e4>] netlink_recvmsg+0x234/0x300
> [<c02ff7b4>] sock_recvmsg+0xa4/0xc8
> [<c03000bc>] ___sys_recvmsg.part.33+0xe4/0x1c0
> [<c0302158>] __sys_recvmsg+0x60/0x90
> [<c03021a0>] SyS_recvmsg+0x18/0x1c
> [<c000ec60>] ret_fast_syscall+0x0/0x48
>
> other info that might help us debug this:
>
> Chain exists of:
> &(&dev->tx_global_lock)->rlock --> _xmit_ETHER#2 -->
> &(&priv->tx_lock)->rlock
>
> Possible unsafe locking scenario:
>
> CPU0 CPU1
> ---- ----
> lock(&(&priv->tx_lock)->rlock);
> lock(_xmit_ETHER#2);
> lock(&(&priv->tx_lock)->rlock);
> lock(&(&dev->tx_global_lock)->rlock);
>
> *** DEADLOCK ***
>
> 3 locks held by iperf/141:
> #0: (rtnl_mutex){+.+.+.}, at: [<c033bf88>] netlink_dump+0x28/0x250
> #1: (rcu_read_lock){......}, at: [<c03acd14>] inet6_dump_addr+0x0/0x530
> #2: (&(&priv->tx_lock)->rlock){+.-...}, at: [<c0258f5c>]
> stmmac_tx_clean+0x30/0x43c
>
> stack backtrace:
> CPU: 0 PID: 141 Comm: iperf Not tainted 3.18.0-rc3-00003-g7180edf #916
> [<c0015df4>] (unwind_backtrace) from [<c0012850>] (show_stack+0x20/0x24)
> [<c0012850>] (show_stack) from [<c03ef6dc>] (dump_stack+0x9c/0xbc)
> [<c03ef6dc>] (dump_stack) from [<c005bbc0>]
> (print_circular_bug+0x21c/0x33c)
> [<c005bbc0>] (print_circular_bug) from [<c0061e88>]
> (__lock_acquire+0x2060/0x2148)
> [<c0061e88>] (__lock_acquire) from [<c00628e0>] (lock_acquire+0xdc/0x17c)
> [<c00628e0>] (lock_acquire) from [<c03f57d0>] (_raw_spin_lock+0x5c/0x94)
> [<c03f57d0>] (_raw_spin_lock) from [<c025927c>]
> (stmmac_tx_clean+0x350/0x43c)
> [<c025927c>] (stmmac_tx_clean) from [<c02593c0>] (stmmac_poll+0x3c/0x618)
> [<c02593c0>] (stmmac_poll) from [<c031a760>] (net_rx_action+0x178/0x28c)
> [<c031a760>] (net_rx_action) from [<c0023fb0>] (__do_softirq+0x1d4/0x3e0)
> [<c0023fb0>] (__do_softirq) from [<c00244e0>] (irq_exit+0x9c/0xd0)
> [<c00244e0>] (irq_exit) from [<c006df48>] (__handle_domain_irq+0x70/0xb4)
> [<c006df48>] (__handle_domain_irq) from [<c0008754>]
> (gic_handle_irq+0x34/0x6c)
> [<c0008754>] (gic_handle_irq) from [<c0013484>] (__irq_svc+0x44/0x5c)
> Exception stack(0xcabc1be0 to 0xcabc1c28)
> 1be0: 00000001 2df53000 00000000 caf15e80 cabc0000 00000201 ca9a9840
> c03ad050
> 1c00: ca8d9404 00000000 ca9b4f50 cabc1c44 c08732d0 cabc1c28 c005d5d0
> c002435c
> 1c20: 20000013 ffffffff
> [<c0013484>] (__irq_svc) from [<c002435c>] (__local_bh_enable_ip+0x9c/0xfc)
> [<c002435c>] (__local_bh_enable_ip) from [<c03f5f64>]
> (_raw_read_unlock_bh+0x40/0x44)
> [<c03f5f64>] (_raw_read_unlock_bh) from [<c03ad050>]
> (inet6_dump_addr+0x33c/0x530)
> [<c03ad050>] (inet6_dump_addr) from [<c03ad2a0>]
> (inet6_dump_ifaddr+0x1c/0x20)
> [<c03ad2a0>] (inet6_dump_ifaddr) from [<c032a0fc>]
> (rtnl_dump_all+0x50/0xf4)
> [<c032a0fc>] (rtnl_dump_all) from [<c033c020>] (netlink_dump+0xc0/0x250)
> [<c033c020>] (netlink_dump) from [<c033c3e4>] (netlink_recvmsg+0x234/0x300)
> [<c033c3e4>] (netlink_recvmsg) from [<c02ff7b4>] (sock_recvmsg+0xa4/0xc8)
> [<c02ff7b4>] (sock_recvmsg) from [<c03000bc>]
> (___sys_recvmsg.part.33+0xe4/0x1c0)
> [<c03000bc>] (___sys_recvmsg.part.33) from [<c0302158>]
> (__sys_recvmsg+0x60/0x90)
> [<c0302158>] (__sys_recvmsg) from [<c03021a0>] (SyS_recvmsg+0x18/0x1c)
> [<c03021a0>] (SyS_recvmsg) from [<c000ec60>] (ret_fast_syscall+0x0/0x48)
> ---------------------------------
>
> Cheers,
>
> Emilio
>
>
^ permalink raw reply
* RE: [PATCH -next v3 1/3] syncookies: avoid magic values and document which-bit-is-what-option
From: David Laight @ 2014-11-03 14:41 UTC (permalink / raw)
To: 'Daniel Borkmann'
Cc: 'Florian Westphal', netdev@vger.kernel.org
In-Reply-To: <54579240.8060309@redhat.com>
From: Daniel Borkmann
> On 11/03/2014 03:24 PM, David Laight wrote:
> > From: Florian Westphal
> >> Was a bit more difficult to read than needed due to magic shifts;
> >> add defines and document the used encoding scheme.
> >>
> >> Joint work with Daniel Borkmann.
> >>
> >> Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
> >> Signed-off-by: Florian Westphal <fw@strlen.de>
> >> ---
> >> This patch was not part of earlier versions of the set.
> >>
> >> net/ipv4/syncookies.c | 50 +++++++++++++++++++++++++++++++++++---------------
> >> 1 file changed, 35 insertions(+), 15 deletions(-)
> >>
> >> diff --git a/net/ipv4/syncookies.c b/net/ipv4/syncookies.c
> >> index 4ac7bca..c3792c0 100644
> >> --- a/net/ipv4/syncookies.c
> >> +++ b/net/ipv4/syncookies.c
> >> @@ -19,10 +19,6 @@
> >> #include <net/tcp.h>
> >> #include <net/route.h>
> >>
> >> -/* Timestamps: lowest bits store TCP options */
> >> -#define TSBITS 6
> >> -#define TSMASK (((__u32)1 << TSBITS) - 1)
> >> -
> >> extern int sysctl_tcp_syncookies;
> >>
> >> static u32 syncookie_secret[2][16-4+SHA_DIGEST_WORDS] __read_mostly;
> >> @@ -30,6 +26,30 @@ static u32 syncookie_secret[2][16-4+SHA_DIGEST_WORDS] __read_mostly;
> >> #define COOKIEBITS 24 /* Upper bits store count */
> >> #define COOKIEMASK (((__u32)1 << COOKIEBITS) - 1)
> >>
> >> +/* TCP Timestamp: 6 lowest bits of timestamp sent in the cookie SYN-ACK
> >> + * stores TCP options:
> >> + *
> >> + * MSB LSB
> >> + * | 31 ... 6 | 5 | 4 | 3 2 1 0 |
> >> + * | Timestamp | ECN | SACK | WScale |
> >> + *
> >> + * When we receive a valid cookie-ACK, we look at the echoed tsval (if
> >> + * any) to figure out which TCP options we should use for the rebuilt
> >> + * connection.
> >> + *
> >> + * A WScale setting of '0xf' (which is an invalid scaling value)
> >> + * means that original syn did not include the TCP window scaling option.
> >> + */
> >> +#define TS_OPT_WSCALE_MASK 0xf
> >> +#define TS_OPT_SACK BIT(4)
> >> +#define TS_OPT_ECN BIT(5)
> >> +/* There is no TS_OPT_TIMESTAMP:
> >> + * if ACK contains timestamp option, we already know it was
> >> + * requested/supported by the syn/synack exchange.
> >> + */
> >> +#define TSBITS 6
> >> +#define TSMASK (((__u32)1 << TSBITS) - 1)
> >
> > Personally I'd define all the values as hex constants instead of mixing
> > and matching the defines.
> >
> > So probably just:
> > #define TS_OPT_WSCALE_MASK 0x0f
> > #define TS_OPT_SACK 0x10
> > #define TS_OPT_ECN 0x20
> > #define TSMASK 0x3f
>
> If you look at the above comment and then take a peek at the actual TS_OPT_*,
> it is much easier to follow. Moreover, how is having TSMASK as 0x3f better?!
> Currently, it is a constant calculated based upon TSBITS.
TSMASK is also (TS_OPT_WSCALE_MASK | TS_OPT_SACK | TS_OPT_ECN) defining
the values in hex makes this even more clear.
Defining TSBITS from the mask would save the extra definition - which might
be easier done by replacing the shifts with multiply/divide by (TSMASK + 1)
(probably in a #define/inline function to make the code easier to read.
David
^ permalink raw reply
* Re: stmmac: potential circular locking dependency
From: Emilio López @ 2014-11-03 14:48 UTC (permalink / raw)
To: Giuseppe CAVALLARO; +Cc: maxime.ripard, netdev, linux-kernel, linux-arm-kernel
In-Reply-To: <545793B6.5010109@st.com>
El 03/11/14 a las 11:39, Giuseppe CAVALLARO escibió:
> Hello Emilio
>
> I have a subset of new patches to review and fix locks in the
> driver. I will plan to send them in the next days.
Great then :) Please Cc: me on the patches when you send them
Cheers,
Emilio
^ permalink raw reply
* RE: [PATCH 0/1] mv643xx_eth: Disable TSO by default
From: David Laight @ 2014-11-03 14:51 UTC (permalink / raw)
To: 'Eric Dumazet', Ezequiel Garcia
Cc: netdev@vger.kernel.org, David Miller, Thomas Petazzoni,
Gregory Clement, Tawfik Bayouk, Lior Amsalem, Nadav Haklai
In-Reply-To: <1414862766.31792.7.camel@edumazet-glaptop2.roam.corp.google.com>
From: Eric Dumazet
> On Sat, 2014-11-01 at 12:30 -0300, Ezequiel Garcia wrote:
> > Several users ([1], [2]) have been reporting data corruption with TSO on
> > Kirkwood platforms (i.e. using the mv643xx_eth driver).
> >
> > Until we manage to find what's causing this, this simple patch will make
> > the TSO path disabled by default. This patch should be queued for stable,
> > fixing the TSO feature introduced in v3.16.
> >
> > The corruption itself is very easy to reproduce: checking md5sum on a mounted
> > NFS directory gives a different result each time. Same tests using the mvneta
> > driver (Armada 370/38x/XP SoC) pass with no issues.
> >
> > Frankly, I'm a bit puzzled about this, and so any ideas or debugging hints
> > are well received.
>
> lack of barriers maybe ?
>
> It seems you might need to populate all TX descriptors but delay the
> first, like doing the populate in descending order.
>
> If you take a look at txq_submit_skb(), you'll see the final
> desc->cmd_sts = cmd_sts (line 959) is done _after_ frags were cooked by
> txq_submit_frag_skb()
>
> You should kick the nick only when all TX descriptors are ready and
> committed to memory.
Don't forget that the nick might process the first descriptor without
being given a 'kick' - it will read it when it finishes processing the
previous frame.
This also means that you have to be careful about the order of the writes
to the first descriptor.
David
^ permalink raw reply
* [net-next 01/11] i40e: Add condition to enter fdir flush and reinit
From: Jeff Kirsher @ 2014-11-03 14:56 UTC (permalink / raw)
To: davem
Cc: Akeem G Abodunrin, netdev, nhorman, sassmann, jogreene,
Patrick Lu, Jeff Kirsher
In-Reply-To: <1415026599-16232-1-git-send-email-jeffrey.t.kirsher@intel.com>
From: Akeem G Abodunrin <akeem.g.abodunrin@intel.com>
When FD_SB/ATR are not enabled, do not allow flow director flush
and reinit.
Change-ID: Iafe261c1862992981615815551abd1ed9fada0a8
Signed-off-by: Akeem G Abodunrin <akeem.g.abodunrin@intel.com>
Signed-off-by: Patrick Lu <patrick.lu@intel.com>
Tested-by: <jamesx.m.young@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
---
drivers/net/ethernet/intel/i40e/i40e_main.c | 6 ++++++
1 file changed, 6 insertions(+)
diff --git a/drivers/net/ethernet/intel/i40e/i40e_main.c b/drivers/net/ethernet/intel/i40e/i40e_main.c
index 834c9ff..0eccd82 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_main.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_main.c
@@ -5211,6 +5211,9 @@ static void i40e_fdir_flush_and_replay(struct i40e_pf *pf)
int flush_wait_retry = 50;
int reg;
+ if (!(pf->flags & (I40E_FLAG_FD_SB_ENABLED | I40E_FLAG_FD_ATR_ENABLED)))
+ return;
+
if (time_after(jiffies, pf->fd_flush_timestamp +
(I40E_MIN_FD_FLUSH_INTERVAL * HZ))) {
set_bit(__I40E_FD_FLUSH_REQUESTED, &pf->state);
@@ -5272,6 +5275,9 @@ static void i40e_fdir_reinit_subtask(struct i40e_pf *pf)
if (test_bit(__I40E_DOWN, &pf->state))
return;
+ if (!(pf->flags & (I40E_FLAG_FD_SB_ENABLED | I40E_FLAG_FD_ATR_ENABLED)))
+ return;
+
if ((pf->fd_add_err >= I40E_MAX_FD_PROGRAM_ERROR) &&
(i40e_get_current_atr_cnt(pf) >= pf->fd_atr_cnt) &&
(i40e_get_current_atr_cnt(pf) > pf->fdir_pf_filter_count))
--
1.9.3
^ permalink raw reply related
* [net-next 00/11][pull request] Intel Wired LAN Driver Updates 2014-11-03
From: Jeff Kirsher @ 2014-11-03 14:56 UTC (permalink / raw)
To: davem; +Cc: Jeff Kirsher, netdev, nhorman, sassmann, jogreene
This series contains updates to i40e and i40evf.
Akeem adds a check for i40e so that flow director flush and reinit are
not done when flow director is not enabled.
Mitch fixes the i40evf driver to properly handle multiple admin queue
messages, by reinit the msg_size field each time we go through the loop.
Without this, we may receive truncated messages due to the firmware
thinking we have insufficient buffer size. Also fixes the link checking
logic to only check the carrier state if the interface is actually
open, which allows link changes to be reported correctly without spamming
the VFs. Updates i40e to inset the VSI ID in the QTX_CTL register
when configuring queues for VMDq VSIs.
Paul adds support for 10G-base-T in i40evf.
Jesse fixes i40e where the call to irq_dynamic_disable() was turning off
the interrupt completely when trying to set ITR to 0 (for lowest
moderation).
Shannon removes debugfs dump stats function, since it was not being
kept up-to-date and was redundant with the ethtool output. Also, scales
back the LAN MSIx usage to force queue/vector sharing and leave some
vectors for Flow Director, VMDq, etc. when there are more cores than
vectors available to the PF. Cleans up the error reporting for
get_lump() resource tracking errors. Also adds a check for the
debug module parameter earlier to be able to catch the early configuration
phase admin queue messages.
The following are changes since commit 55b42b5ca2dcf143465968697fe6c6503b05fca1:
Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
and are available in the git repository at:
git://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/net-next master
Akeem G Abodunrin (1):
i40e: Add condition to enter fdir flush and reinit
Jesse Brandeburg (1):
i40e: avoid disable of interrupt when changing ITR
Mitch Williams (4):
i40evf: properly handle multiple AQ messages
i40e: fix link checking logic
i40e: configure VM ID in qtx_ctl
i40e: properly parse MDET registers
Paul M Stillwell Jr (1):
i40evf: Add support for 10G base T parts
Shannon Nelson (4):
i40e: remove debugfs dump stats
i40e: scale msix vector use when more cores than vectors
i40e: better wording for resource tracking errors
i40e: enable debug earlier
drivers/net/ethernet/intel/i40e/i40e_debugfs.c | 93 +------------------------
drivers/net/ethernet/intel/i40e/i40e_ethtool.c | 2 -
drivers/net/ethernet/intel/i40e/i40e_main.c | 67 ++++++++++++------
drivers/net/ethernet/intel/i40evf/i40e_common.c | 1 +
drivers/net/ethernet/intel/i40evf/i40e_type.h | 1 +
drivers/net/ethernet/intel/i40evf/i40evf_main.c | 4 +-
6 files changed, 51 insertions(+), 117 deletions(-)
--
1.9.3
^ permalink raw reply
* [net-next 03/11] i40e: fix link checking logic
From: Jeff Kirsher @ 2014-11-03 14:56 UTC (permalink / raw)
To: davem
Cc: Mitch Williams, netdev, nhorman, sassmann, jogreene, Patrick Lu,
Jeff Kirsher
In-Reply-To: <1415026599-16232-1-git-send-email-jeffrey.t.kirsher@intel.com>
From: Mitch Williams <mitch.a.williams@intel.com>
If the interface is closed, but VFs exist, current code will spam all
the VFs with link messages every second. This is because the link event
code was looking at netif_carrier_ok() without checking to see if the
interface was actually open.
Refactor the logic to only check the carrier state if the interface is
actually open. This allows link changes to be reported correctly without
spamming the VFs.
Change-ID: If136e79bb3820d21ea4e39e332e8a9604efc2b2a
Signed-off-by: Mitch Williams <mitch.a.williams@intel.com>
Signed-off-by: Patrick Lu <patrick.lu@intel.com>
Tested-by: Jim Young <jamesx.m.young@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
---
drivers/net/ethernet/intel/i40e/i40e_main.c | 11 +++++++----
1 file changed, 7 insertions(+), 4 deletions(-)
diff --git a/drivers/net/ethernet/intel/i40e/i40e_main.c b/drivers/net/ethernet/intel/i40e/i40e_main.c
index 0eccd82..f95c04a 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_main.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_main.c
@@ -5358,6 +5358,7 @@ static void i40e_veb_link_event(struct i40e_veb *veb, bool link_up)
static void i40e_link_event(struct i40e_pf *pf)
{
bool new_link, old_link;
+ struct i40e_vsi *vsi = pf->vsi[pf->lan_vsi];
/* set this to force the get_link_status call to refresh state */
pf->hw.phy.get_link_info = true;
@@ -5366,10 +5367,12 @@ static void i40e_link_event(struct i40e_pf *pf)
new_link = i40e_get_link_status(&pf->hw);
if (new_link == old_link &&
- new_link == netif_carrier_ok(pf->vsi[pf->lan_vsi]->netdev))
+ (test_bit(__I40E_DOWN, &vsi->state) ||
+ new_link == netif_carrier_ok(vsi->netdev)))
return;
- if (!test_bit(__I40E_DOWN, &pf->vsi[pf->lan_vsi]->state))
- i40e_print_link_message(pf->vsi[pf->lan_vsi], new_link);
+
+ if (!test_bit(__I40E_DOWN, &vsi->state))
+ i40e_print_link_message(vsi, new_link);
/* Notify the base of the switch tree connected to
* the link. Floating VEBs are not notified.
@@ -5377,7 +5380,7 @@ static void i40e_link_event(struct i40e_pf *pf)
if (pf->lan_veb != I40E_NO_VEB && pf->veb[pf->lan_veb])
i40e_veb_link_event(pf->veb[pf->lan_veb], new_link);
else
- i40e_vsi_link_event(pf->vsi[pf->lan_vsi], new_link);
+ i40e_vsi_link_event(vsi, new_link);
if (pf->vf)
i40e_vc_notify_link_state(pf);
--
1.9.3
^ permalink raw reply related
* [net-next 04/11] i40evf: Add support for 10G base T parts
From: Jeff Kirsher @ 2014-11-03 14:56 UTC (permalink / raw)
To: davem
Cc: Paul M Stillwell Jr, netdev, nhorman, sassmann, jogreene,
Patrick Lu, Jeff Kirsher
In-Reply-To: <1415026599-16232-1-git-send-email-jeffrey.t.kirsher@intel.com>
From: Paul M Stillwell Jr <paul.m.stillwell.jr@intel.com>
Add 10G-Base-T support in i40evf.
Change-ID: I98a1c3138d7d6572fe7903a7c1c4692cae3260d5
Signed-off-by: Paul M Stillwell Jr <paul.m.stillwell.jr@intel.com>
Signed-off-by: Patrick Lu <patrick.lu@intel.com>
Tested-by: Jim Young <jamesx.m.young@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
---
drivers/net/ethernet/intel/i40evf/i40e_common.c | 1 +
drivers/net/ethernet/intel/i40evf/i40e_type.h | 1 +
2 files changed, 2 insertions(+)
diff --git a/drivers/net/ethernet/intel/i40evf/i40e_common.c b/drivers/net/ethernet/intel/i40evf/i40e_common.c
index 9525605..28c40c5 100644
--- a/drivers/net/ethernet/intel/i40evf/i40e_common.c
+++ b/drivers/net/ethernet/intel/i40evf/i40e_common.c
@@ -50,6 +50,7 @@ i40e_status i40e_set_mac_type(struct i40e_hw *hw)
case I40E_DEV_ID_QSFP_A:
case I40E_DEV_ID_QSFP_B:
case I40E_DEV_ID_QSFP_C:
+ case I40E_DEV_ID_10G_BASE_T:
hw->mac.type = I40E_MAC_XL710;
break;
case I40E_DEV_ID_VF:
diff --git a/drivers/net/ethernet/intel/i40evf/i40e_type.h b/drivers/net/ethernet/intel/i40evf/i40e_type.h
index 1537643..8fe34fc 100644
--- a/drivers/net/ethernet/intel/i40evf/i40e_type.h
+++ b/drivers/net/ethernet/intel/i40evf/i40e_type.h
@@ -43,6 +43,7 @@
#define I40E_DEV_ID_QSFP_A 0x1583
#define I40E_DEV_ID_QSFP_B 0x1584
#define I40E_DEV_ID_QSFP_C 0x1585
+#define I40E_DEV_ID_10G_BASE_T 0x1586
#define I40E_DEV_ID_VF 0x154C
#define I40E_DEV_ID_VF_HV 0x1571
--
1.9.3
^ permalink raw reply related
* [net-next 02/11] i40evf: properly handle multiple AQ messages
From: Jeff Kirsher @ 2014-11-03 14:56 UTC (permalink / raw)
To: davem
Cc: Mitch Williams, netdev, nhorman, sassmann, jogreene, Patrick Lu,
Jeff Kirsher
In-Reply-To: <1415026599-16232-1-git-send-email-jeffrey.t.kirsher@intel.com>
From: Mitch Williams <mitch.a.williams@intel.com>
When we receive an admin queue message, the msg_size field in the event
struct gets overwritten. Because of this, we need to reinit the field
each time we go through the loop. Without this we may receive truncated
messages due to the firmware thinking we have insufficient buffer size.
Change-ID: I21dcca5114d91365d731169965ce3ffec0e4a190
Signed-off-by: Mitch Williams <mitch.a.williams@intel.com>
Signed-off-by: Patrick Lu <patrick.lu@intel.com>
Tested-by: Jim Young <jamesx.m.young@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
---
drivers/net/ethernet/intel/i40evf/i40evf_main.c | 4 +---
1 file changed, 1 insertion(+), 3 deletions(-)
diff --git a/drivers/net/ethernet/intel/i40evf/i40evf_main.c b/drivers/net/ethernet/intel/i40evf/i40evf_main.c
index dabe6a4..b2f01eb 100644
--- a/drivers/net/ethernet/intel/i40evf/i40evf_main.c
+++ b/drivers/net/ethernet/intel/i40evf/i40evf_main.c
@@ -1647,10 +1647,8 @@ static void i40evf_adminq_task(struct work_struct *work)
v_msg->v_retval, event.msg_buf,
event.msg_size);
if (pending != 0) {
- dev_info(&adapter->pdev->dev,
- "%s: ARQ: Pending events %d\n",
- __func__, pending);
memset(event.msg_buf, 0, I40EVF_MAX_AQ_BUF_SIZE);
+ event.msg_size = I40EVF_MAX_AQ_BUF_SIZE;
}
} while (pending);
--
1.9.3
^ permalink raw reply related
* [net-next 05/11] i40e: avoid disable of interrupt when changing ITR
From: Jeff Kirsher @ 2014-11-03 14:56 UTC (permalink / raw)
To: davem
Cc: Jesse Brandeburg, netdev, nhorman, sassmann, jogreene, Patrick Lu,
Jeff Kirsher
In-Reply-To: <1415026599-16232-1-git-send-email-jeffrey.t.kirsher@intel.com>
From: Jesse Brandeburg <jesse.brandeburg@intel.com>
The call to irq_dynamic_disable was turning off the interrupt completely
when trying to set ITR to 0 (for lowest moderation). Just remove the
call as setting the values to 0 later in this function will suffice.
Change-ID: I47caf1ecbe65653cf63ec833db93094cd83fd84d
Signed-off-by: Jesse Brandeburg <jesse.brandeburg@intel.com>
Signed-off-by: Patrick Lu <patrick.lu@intel.com>
Tested-By: Jim Young <jamesx.m.young@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
---
drivers/net/ethernet/intel/i40e/i40e_ethtool.c | 2 --
1 file changed, 2 deletions(-)
diff --git a/drivers/net/ethernet/intel/i40e/i40e_ethtool.c b/drivers/net/ethernet/intel/i40e/i40e_ethtool.c
index 12adc08..b6e745f 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_ethtool.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_ethtool.c
@@ -1574,7 +1574,6 @@ static int i40e_set_coalesce(struct net_device *netdev,
vsi->rx_itr_setting = ec->rx_coalesce_usecs;
} else if (ec->rx_coalesce_usecs == 0) {
vsi->rx_itr_setting = ec->rx_coalesce_usecs;
- i40e_irq_dynamic_disable(vsi, vector);
if (ec->use_adaptive_rx_coalesce)
netif_info(pf, drv, netdev,
"Rx-secs=0, need to disable adaptive-Rx for a complete disable\n");
@@ -1589,7 +1588,6 @@ static int i40e_set_coalesce(struct net_device *netdev,
vsi->tx_itr_setting = ec->tx_coalesce_usecs;
} else if (ec->tx_coalesce_usecs == 0) {
vsi->tx_itr_setting = ec->tx_coalesce_usecs;
- i40e_irq_dynamic_disable(vsi, vector);
if (ec->use_adaptive_tx_coalesce)
netif_info(pf, drv, netdev,
"Tx-secs=0, need to disable adaptive-Tx for a complete disable\n");
--
1.9.3
^ permalink raw reply related
* [net-next 08/11] i40e: better wording for resource tracking errors
From: Jeff Kirsher @ 2014-11-03 14:56 UTC (permalink / raw)
To: davem
Cc: Shannon Nelson, netdev, nhorman, sassmann, jogreene, Patrick Lu,
Jeff Kirsher
In-Reply-To: <1415026599-16232-1-git-send-email-jeffrey.t.kirsher@intel.com>
From: Shannon Nelson <shannon.nelson@intel.com>
Tweak and homogenize the error reporting for get_lump() resource
tracking errors.
Change-ID: I11330161cc6ad8d04371c499c63071c816171c3b
Signed-off-by: Shannon Nelson <shannon.nelson@intel.com>
Signed-off-by: Patrick Lu <patrick.lu@intel.com>
Tested-by: Jim Young <jamesx.m.young@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
---
drivers/net/ethernet/intel/i40e/i40e_main.c | 14 ++++++++------
1 file changed, 8 insertions(+), 6 deletions(-)
diff --git a/drivers/net/ethernet/intel/i40e/i40e_main.c b/drivers/net/ethernet/intel/i40e/i40e_main.c
index 83fee7f..6a481bf 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_main.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_main.c
@@ -7957,8 +7957,8 @@ static int i40e_vsi_setup_vectors(struct i40e_vsi *vsi)
vsi->num_q_vectors, vsi->idx);
if (vsi->base_vector < 0) {
dev_info(&pf->pdev->dev,
- "failed to get queue tracking for VSI %d, err=%d\n",
- vsi->seid, vsi->base_vector);
+ "failed to get tracking for %d vectors for VSI %d, err=%d\n",
+ vsi->num_q_vectors, vsi->seid, vsi->base_vector);
i40e_vsi_free_q_vectors(vsi);
ret = -ENOENT;
goto vector_setup_out;
@@ -7994,8 +7994,9 @@ static struct i40e_vsi *i40e_vsi_reinit_setup(struct i40e_vsi *vsi)
ret = i40e_get_lump(pf, pf->qp_pile, vsi->alloc_queue_pairs, vsi->idx);
if (ret < 0) {
- dev_info(&pf->pdev->dev, "VSI %d get_lump failed %d\n",
- vsi->seid, ret);
+ dev_info(&pf->pdev->dev,
+ "failed to get tracking for %d queues for VSI %d err=%d\n",
+ vsi->alloc_queue_pairs, vsi->seid, ret);
goto err_vsi;
}
vsi->base_queue = ret;
@@ -8124,8 +8125,9 @@ struct i40e_vsi *i40e_vsi_setup(struct i40e_pf *pf, u8 type,
ret = i40e_get_lump(pf, pf->qp_pile, vsi->alloc_queue_pairs,
vsi->idx);
if (ret < 0) {
- dev_info(&pf->pdev->dev, "VSI %d get_lump failed %d\n",
- vsi->seid, ret);
+ dev_info(&pf->pdev->dev,
+ "failed to get tracking for %d queues for VSI %d err=%d\n",
+ vsi->alloc_queue_pairs, vsi->seid, ret);
goto err_vsi;
}
vsi->base_queue = ret;
--
1.9.3
^ permalink raw reply related
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox