Netdev List

Netdev List
 help / color / mirror / Atom feed

* [PATCH net-next v3 5/9] net: introduce new macro net_get_random_once
From: Hannes Frederic Sowa @ 2013-10-17  5:31 UTC (permalink / raw)
  To: netdev
  Cc: linux-kernel, Ingo Molnar, Steven Rostedt, Jason Baron,
	Peter Zijlstra, Eric Dumazet, David S. Miller
In-Reply-To: <1381987923-1524-1-git-send-email-hannes@stressinduktion.org>

net_get_random_once is a new macro which handles the initialization
of secret keys. It is possible to call it in the fast path. Only the
initialization depends on the spinlock and is rather slow. Otherwise
it should get used just before the key is used to delay the entropy
extration as late as possible to get better randomness. It returns true
if the key got initialized.

The usage of static_keys for net_get_random_once is a bit uncommon so
it needs some further explanation why this actually works:

=== In the simple non-HAVE_JUMP_LABEL case we actually have ===
no constrains to use static_key_(true|false) on keys initialized with
STATIC_KEY_INIT_(FALSE|TRUE). So this path just expands in favor of
the likely case that the initialization is already done. The key is
initialized like this:

___done_key = { .enabled = ATOMIC_INIT(0) }

The check

                if (!static_key_true(&___done_key))                     \

expands into (pseudo code)

                if (!likely(___done_key > 0))

, so we take the fast path as soon as ___done_key is increased from the
helper function.

=== If HAVE_JUMP_LABELs are available this depends ===
on patching of jumps into the prepared NOPs, which is done in
jump_label_init at boot-up time (from start_kernel). It is forbidden
and dangerous to use net_get_random_once in functions which are called
before that!

At compilation time NOPs are generated at the call sites of
net_get_random_once. E.g. net/ipv6/inet6_hashtable.c:inet6_ehashfn (we
need to call net_get_random_once two times in inet6_ehashfn, so two NOPs):

      71:       0f 1f 44 00 00          nopl   0x0(%rax,%rax,1)
      76:       0f 1f 44 00 00          nopl   0x0(%rax,%rax,1)

Both will be patched to the actual jumps to the end of the function to
call __net_get_random_once at boot time as explained above.

arch_static_branch is optimized and inlined for false as return value and
actually also returns false in case the NOP is placed in the instruction
stream. So in the fast case we get a "return false". But because we
initialize ___done_key with (enabled != (entries & 1)) this call-site
will get patched up at boot thus returning true. The final check looks
like this:

                if (!static_key_true(&___done_key))                     \
                        ___ret = __net_get_random_once(buf,             \

expands to

                if (!!static_key_false(&___done_key))                     \
                        ___ret = __net_get_random_once(buf,             \

So we get true at boot time and as soon as static_key_slow_inc is called
on the key it will invert the logic and return false for the fast path.
static_key_slow_inc will change the branch because it got initialized
with .enabled == 0. After static_key_slow_inc is called on the key the
branch is replaced with a nop again.

=== Misc: ===
The helper defers the increment into a workqueue so we don't
have problems calling this code from atomic sections. A seperate boolean
(___done) guards the case where we enter net_get_random_once again before
the increment happend.

Cc: Ingo Molnar <mingo@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Jason Baron <jbaron@redhat.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Eric Dumazet <edumazet@google.com>
Cc: "David S. Miller" <davem@davemloft.net>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
---
I tested this patchset with !CC_HAVE_ASM_GOTO and with CC_HAVE_ASM_GOTO
on x86_64.

I quickly reviewed that all architectures which implement HAVE_JUMP_LABEL
also patch all branch sites on boot-up. But this needs further review
as this is a security sensitive patch series.

Thank you!

 include/linux/net.h | 25 +++++++++++++++++++++++++
 net/core/utils.c    | 48 ++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 73 insertions(+)

diff --git a/include/linux/net.h b/include/linux/net.h
index ca9ec85..a489705 100644
--- a/include/linux/net.h
+++ b/include/linux/net.h
@@ -239,6 +239,31 @@ do {								\
 #define net_random()		prandom_u32()
 #define net_srandom(seed)	prandom_seed((__force u32)(seed))

+bool __net_get_random_once(void *buf, int nbytes, bool *done,
+			   struct static_key *done_key);
+
+#ifdef HAVE_JUMP_LABEL
+#define ___NET_RANDOM_STATIC_KEY_INIT ((struct static_key) \
+		{ .enabled = ATOMIC_INIT(0), .entries = (void *)1 })
+#else /* !HAVE_JUMP_LABEL */
+#define ___NET_RANDOM_STATIC_KEY_INIT STATIC_KEY_INIT_FALSE
+#endif /* HAVE_JUMP_LABEL */
+
+/* BE CAREFUL: this function is not interrupt safe */
+#define net_get_random_once(buf, nbytes)				\
+	({								\
+		bool ___ret = false;					\
+		static bool ___done = false;				\
+		static struct static_key ___done_key =			\
+			___NET_RANDOM_STATIC_KEY_INIT;			\
+		if (!static_key_true(&___done_key))			\
+			___ret = __net_get_random_once(buf,		\
+						       nbytes,		\
+						       &___done,	\
+						       &___done_key);	\
+		___ret;							\
+	})
+
 int kernel_sendmsg(struct socket *sock, struct msghdr *msg, struct kvec *vec,
 		   size_t num, size_t len);
 int kernel_recvmsg(struct socket *sock, struct msghdr *msg, struct kvec *vec,
diff --git a/net/core/utils.c b/net/core/utils.c
index aa88e23..bf09371 100644
--- a/net/core/utils.c
+++ b/net/core/utils.c
@@ -338,3 +338,51 @@ void inet_proto_csum_replace16(__sum16 *sum, struct sk_buff *skb,
 				  csum_unfold(*sum)));
 }
 EXPORT_SYMBOL(inet_proto_csum_replace16);
+
+struct __net_random_once_work {
+	struct work_struct work;
+	struct static_key *key;
+};
+
+static void __net_random_once_deferred(struct work_struct *w)
+{
+	struct __net_random_once_work *work =
+		container_of(w, struct __net_random_once_work, work);
+	if (!static_key_enabled(work->key))
+		static_key_slow_inc(work->key);
+	kfree(work);
+}
+
+static void __net_random_once_disable_jump(struct static_key *key)
+{
+	struct __net_random_once_work *w;
+
+	w = kmalloc(sizeof(*w), GFP_ATOMIC);
+	if (!w)
+		return;
+
+	INIT_WORK(&w->work, __net_random_once_deferred);
+	w->key = key;
+	schedule_work(&w->work);
+}
+
+bool __net_get_random_once(void *buf, int nbytes, bool *done,
+			   struct static_key *done_key)
+{
+	static DEFINE_SPINLOCK(lock);
+
+	spin_lock_bh(&lock);
+	if (*done) {
+		spin_unlock_bh(&lock);
+		return false;
+	}
+
+	get_random_bytes(buf, nbytes);
+	*done = true;
+	spin_unlock_bh(&lock);
+
+	__net_random_once_disable_jump(done_key);
+
+	return true;
+}
+EXPORT_SYMBOL(__net_get_random_once);
-- 
1.8.3.1

^ permalink raw reply related

* [PATCH net-next v3 6/9] inet: split syncookie keys for ipv4 and ipv6 and initialize with net_get_random_once
From: Hannes Frederic Sowa @ 2013-10-17  5:32 UTC (permalink / raw)
  To: netdev; +Cc: linux-kernel, Florian Westphal, Eric Dumazet, David S. Miller
In-Reply-To: <1381987923-1524-1-git-send-email-hannes@stressinduktion.org>

This patch splits the secret key for syncookies for ipv4 and ipv6 and
initializes them with net_get_random_once. This change was the reason I
did this series. I think the initialization of the syncookie_secret is
way to early.

Cc: Florian Westphal <fw@strlen.de>
Cc: Eric Dumazet <edumazet@google.com>
Cc: "David S. Miller" <davem@davemloft.net>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
---
 include/net/tcp.h     |  1 -
 net/ipv4/syncookies.c | 15 +++++----------
 net/ipv6/syncookies.c | 12 +++++++++---
 3 files changed, 14 insertions(+), 14 deletions(-)

diff --git a/include/net/tcp.h b/include/net/tcp.h
index 1db3a01..0fa8fdc 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -475,7 +475,6 @@ int tcp_send_rcvq(struct sock *sk, struct msghdr *msg, size_t size);
 void inet_sk_rx_dst_set(struct sock *sk, const struct sk_buff *skb);
 
 /* From syncookies.c */
-extern __u32 syncookie_secret[2][16-4+SHA_DIGEST_WORDS];
 int __cookie_v4_check(const struct iphdr *iph, const struct tcphdr *th,
 		      u32 cookie);
 struct sock *cookie_v4_check(struct sock *sk, struct sk_buff *skb,
diff --git a/net/ipv4/syncookies.c b/net/ipv4/syncookies.c
index 3b64c59..b95331e 100644
--- a/net/ipv4/syncookies.c
+++ b/net/ipv4/syncookies.c
@@ -25,15 +25,7 @@
 
 extern int sysctl_tcp_syncookies;
 
-__u32 syncookie_secret[2][16-4+SHA_DIGEST_WORDS];
-EXPORT_SYMBOL(syncookie_secret);
-
-static __init int init_syncookies(void)
-{
-	get_random_bytes(syncookie_secret, sizeof(syncookie_secret));
-	return 0;
-}
-__initcall(init_syncookies);
+static u32 syncookie_secret[2][16-4+SHA_DIGEST_WORDS];
 
 #define COOKIEBITS 24	/* Upper bits store count */
 #define COOKIEMASK (((__u32)1 << COOKIEBITS) - 1)
@@ -44,8 +36,11 @@ static DEFINE_PER_CPU(__u32 [16 + 5 + SHA_WORKSPACE_WORDS],
 static u32 cookie_hash(__be32 saddr, __be32 daddr, __be16 sport, __be16 dport,
 		       u32 count, int c)
 {
-	__u32 *tmp = __get_cpu_var(ipv4_cookie_scratch);
+	__u32 *tmp;
+
+	net_get_random_once(syncookie_secret, sizeof(syncookie_secret));
 
+	tmp  = __get_cpu_var(ipv4_cookie_scratch);
 	memcpy(tmp + 4, syncookie_secret[c], sizeof(syncookie_secret[c]));
 	tmp[0] = (__force u32)saddr;
 	tmp[1] = (__force u32)daddr;
diff --git a/net/ipv6/syncookies.c b/net/ipv6/syncookies.c
index d04d3f1..535a3ad 100644
--- a/net/ipv6/syncookies.c
+++ b/net/ipv6/syncookies.c
@@ -24,6 +24,8 @@
 #define COOKIEBITS 24	/* Upper bits store count */
 #define COOKIEMASK (((__u32)1 << COOKIEBITS) - 1)
 
+static u32 syncookie6_secret[2][16-4+SHA_DIGEST_WORDS];
+
 /* RFC 2460, Section 8.3:
  * [ipv6 tcp] MSS must be computed as the maximum packet size minus 60 [..]
  *
@@ -61,14 +63,18 @@ static DEFINE_PER_CPU(__u32 [16 + 5 + SHA_WORKSPACE_WORDS],
 static u32 cookie_hash(const struct in6_addr *saddr, const struct in6_addr *daddr,
 		       __be16 sport, __be16 dport, u32 count, int c)
 {
-	__u32 *tmp = __get_cpu_var(ipv6_cookie_scratch);
+	__u32 *tmp;
+
+	net_get_random_once(syncookie6_secret, sizeof(syncookie6_secret));
+
+	tmp  = __get_cpu_var(ipv6_cookie_scratch);
 
 	/*
 	 * we have 320 bits of information to hash, copy in the remaining
-	 * 192 bits required for sha_transform, from the syncookie_secret
+	 * 192 bits required for sha_transform, from the syncookie6_secret
 	 * and overwrite the digest with the secret
 	 */
-	memcpy(tmp + 10, syncookie_secret[c], 44);
+	memcpy(tmp + 10, syncookie6_secret[c], 44);
 	memcpy(tmp, saddr, 16);
 	memcpy(tmp + 4, daddr, 16);
 	tmp[8] = ((__force u32)sport << 16) + (__force u32)dport;
-- 
1.8.3.1

^ permalink raw reply related

* [PATCH net-next v3 7/9] inet: convert inet_ehash_secret and ipv6_hash_secret to net_get_random_once
From: Hannes Frederic Sowa @ 2013-10-17  5:32 UTC (permalink / raw)
  To: netdev; +Cc: linux-kernel, Eric Dumazet, David S. Miller
In-Reply-To: <1381987923-1524-1-git-send-email-hannes@stressinduktion.org>

Initialize the ehash and ipv6_hash_secrets with net_get_random_once.

Each compilation unit gets its own secret now:
  ipv4/inet_hashtables.o
  ipv4/udp.o
  ipv6/inet6_hashtables.o
  ipv6/udp.o
  rds/connection.o

The functions still get inlined into the hashing functions. In the fast
path we have at most two (needed in ipv6) if (unlikely(...)).

Cc: Eric Dumazet <edumazet@google.com>
Cc: "David S. Miller" <davem@davemloft.net>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
---
 include/net/inet_sock.h     |  4 ----
 net/ipv4/af_inet.c          | 27 ---------------------------
 net/ipv4/inet_hashtables.c  |  4 ++++
 net/ipv4/udp.c              |  6 +++++-
 net/ipv6/af_inet6.c         |  5 -----
 net/ipv6/inet6_hashtables.c | 15 ++++++++++++---
 net/ipv6/udp.c              | 17 ++++++++++++++---
 net/rds/connection.c        | 12 +++++++++---
 8 files changed, 44 insertions(+), 46 deletions(-)

diff --git a/include/net/inet_sock.h b/include/net/inet_sock.h
index 7a6c7f8..1833c3f 100644
--- a/include/net/inet_sock.h
+++ b/include/net/inet_sock.h
@@ -204,10 +204,6 @@ static inline void inet_sk_copy_descendant(struct sock *sk_to,
 
 int inet_sk_rebuild_header(struct sock *sk);
 
-extern u32 inet_ehash_secret;
-extern u32 ipv6_hash_secret;
-void build_ehash_secret(void);
-
 static inline unsigned int __inet_ehashfn(const __be32 laddr,
 					  const __u16 lport,
 					  const __be32 faddr,
diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
index 35913fb..b6bdd82 100644
--- a/net/ipv4/af_inet.c
+++ b/net/ipv4/af_inet.c
@@ -245,29 +245,6 @@ out:
 }
 EXPORT_SYMBOL(inet_listen);
 
-u32 inet_ehash_secret __read_mostly;
-EXPORT_SYMBOL(inet_ehash_secret);
-
-u32 ipv6_hash_secret __read_mostly;
-EXPORT_SYMBOL(ipv6_hash_secret);
-
-/*
- * inet_ehash_secret must be set exactly once, and to a non nul value
- * ipv6_hash_secret must be set exactly once.
- */
-void build_ehash_secret(void)
-{
-	u32 rnd;
-
-	do {
-		get_random_bytes(&rnd, sizeof(rnd));
-	} while (rnd == 0);
-
-	if (cmpxchg(&inet_ehash_secret, 0, rnd) == 0)
-		get_random_bytes(&ipv6_hash_secret, sizeof(ipv6_hash_secret));
-}
-EXPORT_SYMBOL(build_ehash_secret);
-
 /*
  *	Create an inet socket.
  */
@@ -284,10 +261,6 @@ static int inet_create(struct net *net, struct socket *sock, int protocol,
 	int try_loading_module = 0;
 	int err;
 
-	if (unlikely(!inet_ehash_secret))
-		if (sock->type != SOCK_RAW && sock->type != SOCK_DGRAM)
-			build_ehash_secret();
-
 	sock->state = SS_UNCONNECTED;
 
 	/* Look for the requested type/protocol pair. */
diff --git a/net/ipv4/inet_hashtables.c b/net/ipv4/inet_hashtables.c
index 18aa668..8b9cf27 100644
--- a/net/ipv4/inet_hashtables.c
+++ b/net/ipv4/inet_hashtables.c
@@ -28,6 +28,10 @@ static unsigned int inet_ehashfn(struct net *net, const __be32 laddr,
 				 const __u16 lport, const __be32 faddr,
 				 const __be16 fport)
 {
+	static u32 inet_ehash_secret __read_mostly;
+
+	net_get_random_once(&inet_ehash_secret, sizeof(inet_ehash_secret));
+
 	return __inet_ehashfn(laddr, lport, faddr, fport,
 			      inet_ehash_secret + net_hash_mix(net));
 }
diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index b4437c7..89909dd 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -411,8 +411,12 @@ static unsigned int udp_ehashfn(struct net *net, const __be32 laddr,
 				 const __u16 lport, const __be32 faddr,
 				 const __be16 fport)
 {
+	static u32 udp_ehash_secret __read_mostly;
+
+	net_get_random_once(&udp_ehash_secret, sizeof(udp_ehash_secret));
+
 	return __inet_ehashfn(laddr, lport, faddr, fport,
-			      inet_ehash_secret + net_hash_mix(net));
+			      udp_ehash_secret + net_hash_mix(net));
 }
 
 
diff --git a/net/ipv6/af_inet6.c b/net/ipv6/af_inet6.c
index a2cb07c..20af1fb 100644
--- a/net/ipv6/af_inet6.c
+++ b/net/ipv6/af_inet6.c
@@ -110,11 +110,6 @@ static int inet6_create(struct net *net, struct socket *sock, int protocol,
 	int try_loading_module = 0;
 	int err;
 
-	if (sock->type != SOCK_RAW &&
-	    sock->type != SOCK_DGRAM &&
-	    !inet_ehash_secret)
-		build_ehash_secret();
-
 	/* Look for the requested type/protocol pair. */
 lookup_protocol:
 	err = -ESOCKTNOSUPPORT;
diff --git a/net/ipv6/inet6_hashtables.c b/net/ipv6/inet6_hashtables.c
index fa7dd38..262e13c 100644
--- a/net/ipv6/inet6_hashtables.c
+++ b/net/ipv6/inet6_hashtables.c
@@ -29,10 +29,19 @@ static unsigned int inet6_ehashfn(struct net *net,
 				  const struct in6_addr *faddr,
 				  const __be16 fport)
 {
-	const u32 lhash = (__force u32)laddr->s6_addr32[3];
-	const u32 fhash = __ipv6_addr_jhash(faddr, ipv6_hash_secret);
+	static u32 inet6_ehash_secret __read_mostly;
+	static u32 ipv6_hash_secret __read_mostly;
+
+	u32 lhash, fhash;
+
+	net_get_random_once(&inet6_ehash_secret, sizeof(inet6_ehash_secret));
+	net_get_random_once(&ipv6_hash_secret, sizeof(ipv6_hash_secret));
+
+	lhash = (__force u32)laddr->s6_addr32[3];
+	fhash = __ipv6_addr_jhash(faddr, ipv6_hash_secret);
+
 	return __inet6_ehashfn(lhash, lport, fhash, fport,
-			       inet_ehash_secret + net_hash_mix(net));
+			       inet6_ehash_secret + net_hash_mix(net));
 }
 
 static int inet6_sk_ehashfn(const struct sock *sk)
diff --git a/net/ipv6/udp.c b/net/ipv6/udp.c
index 324bd36..44fc4e3 100644
--- a/net/ipv6/udp.c
+++ b/net/ipv6/udp.c
@@ -59,10 +59,21 @@ static unsigned int udp6_ehashfn(struct net *net,
 				  const struct in6_addr *faddr,
 				  const __be16 fport)
 {
-	const u32 lhash = (__force u32)laddr->s6_addr32[3];
-	const u32 fhash = __ipv6_addr_jhash(faddr, ipv6_hash_secret);
+	static u32 udp6_ehash_secret __read_mostly;
+	static u32 udp_ipv6_hash_secret __read_mostly;
+
+	u32 lhash, fhash;
+
+	net_get_random_once(&udp6_ehash_secret,
+			    sizeof(udp6_ehash_secret));
+	net_get_random_once(&udp_ipv6_hash_secret,
+			    sizeof(udp_ipv6_hash_secret));
+
+	lhash = (__force u32)laddr->s6_addr32[3];
+	fhash = __ipv6_addr_jhash(faddr, udp_ipv6_hash_secret);
+
 	return __inet6_ehashfn(lhash, lport, fhash, fport,
-			       inet_ehash_secret + net_hash_mix(net));
+			       udp_ipv6_hash_secret + net_hash_mix(net));
 }
 
 int ipv6_rcv_saddr_equal(const struct sock *sk, const struct sock *sk2)
diff --git a/net/rds/connection.c b/net/rds/connection.c
index 45e2366..378c3a6 100644
--- a/net/rds/connection.c
+++ b/net/rds/connection.c
@@ -51,10 +51,16 @@ static struct kmem_cache *rds_conn_slab;
 
 static struct hlist_head *rds_conn_bucket(__be32 laddr, __be32 faddr)
 {
+	static u32 rds_hash_secret __read_mostly;
+
+	unsigned long hash;
+
+	net_get_random_once(&rds_hash_secret, sizeof(rds_hash_secret));
+
 	/* Pass NULL, don't need struct net for hash */
-	unsigned long hash = __inet_ehashfn(be32_to_cpu(laddr), 0,
-					    be32_to_cpu(faddr), 0,
-					    inet_ehash_secret);
+	hash = __inet_ehashfn(be32_to_cpu(laddr), 0,
+			      be32_to_cpu(faddr), 0,
+			      rds_hash_secret);
 	return &rds_conn_hash[hash & RDS_CONNECTION_HASH_MASK];
 }
 
-- 
1.8.3.1

^ permalink raw reply related

* [PATCH net-next v3 8/9] tcp: switch tcp_fastopen key generation to net_get_random_once
From: Hannes Frederic Sowa @ 2013-10-17  5:32 UTC (permalink / raw)
  To: netdev; +Cc: linux-kernel, Yuchung Cheng, Eric Dumazet, David S. Miller
In-Reply-To: <1381987923-1524-1-git-send-email-hannes@stressinduktion.org>

Changed key initialization of tcp_fastopen cookies to net_get_random_once.

If the user sets a custom key net_get_random_once must be called at
least once to ensure we don't overwrite the user provided key when the
first cookie is generated later on.

Cc: Yuchung Cheng <ycheng@google.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: "David S. Miller" <davem@davemloft.net>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
---
 include/net/tcp.h          |  2 +-
 net/ipv4/sysctl_net_ipv4.c |  5 +++++
 net/ipv4/tcp_fastopen.c    | 27 ++++++++++++++++-----------
 3 files changed, 22 insertions(+), 12 deletions(-)

diff --git a/include/net/tcp.h b/include/net/tcp.h
index 0fa8fdc..027b663 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -1322,7 +1322,7 @@ extern struct tcp_fastopen_context __rcu *tcp_fastopen_ctx;
 int tcp_fastopen_reset_cipher(void *key, unsigned int len);
 void tcp_fastopen_cookie_gen(__be32 src, __be32 dst,
 			     struct tcp_fastopen_cookie *foc);
-
+void tcp_fastopen_init_key_once(bool publish);
 #define TCP_FASTOPEN_KEY_LENGTH 16
 
 /* Fastopen key context */
diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c
index c08f096..4b161d5 100644
--- a/net/ipv4/sysctl_net_ipv4.c
+++ b/net/ipv4/sysctl_net_ipv4.c
@@ -274,6 +274,11 @@ static int proc_tcp_fastopen_key(struct ctl_table *ctl, int write,
 			ret = -EINVAL;
 			goto bad_key;
 		}
+		/* Generate a dummy secret but don't publish it. This
+		 * is needed so we don't regenerate a new key on the
+		 * first invocation of tcp_fastopen_cookie_gen
+		 */
+		tcp_fastopen_init_key_once(false);
 		tcp_fastopen_reset_cipher(user_key, TCP_FASTOPEN_KEY_LENGTH);
 	}
 
diff --git a/net/ipv4/tcp_fastopen.c b/net/ipv4/tcp_fastopen.c
index ab7bd35..766032b 100644
--- a/net/ipv4/tcp_fastopen.c
+++ b/net/ipv4/tcp_fastopen.c
@@ -14,6 +14,20 @@ struct tcp_fastopen_context __rcu *tcp_fastopen_ctx;
 
 static DEFINE_SPINLOCK(tcp_fastopen_ctx_lock);
 
+void tcp_fastopen_init_key_once(bool publish)
+{
+	static u8 key[TCP_FASTOPEN_KEY_LENGTH];
+
+	/* tcp_fastopen_reset_cipher publishes the new context
+	 * atomically, so we allow this race happening here.
+	 *
+	 * All call sites of tcp_fastopen_cookie_gen also check
+	 * for a valid cookie, so this is an acceptable risk.
+	 */
+	if (net_get_random_once(key, sizeof(key)) && publish)
+		tcp_fastopen_reset_cipher(key, sizeof(key));
+}
+
 static void tcp_fastopen_ctx_free(struct rcu_head *head)
 {
 	struct tcp_fastopen_context *ctx =
@@ -70,6 +84,8 @@ void tcp_fastopen_cookie_gen(__be32 src, __be32 dst,
 	__be32 path[4] = { src, dst, 0, 0 };
 	struct tcp_fastopen_context *ctx;
 
+	tcp_fastopen_init_key_once(true);
+
 	rcu_read_lock();
 	ctx = rcu_dereference(tcp_fastopen_ctx);
 	if (ctx) {
@@ -78,14 +94,3 @@ void tcp_fastopen_cookie_gen(__be32 src, __be32 dst,
 	}
 	rcu_read_unlock();
 }
-
-static int __init tcp_fastopen_init(void)
-{
-	__u8 key[TCP_FASTOPEN_KEY_LENGTH];
-
-	get_random_bytes(key, sizeof(key));
-	tcp_fastopen_reset_cipher(key, sizeof(key));
-	return 0;
-}
-
-late_initcall(tcp_fastopen_init);
-- 
1.8.3.1

^ permalink raw reply related

* [PATCH net-next v3 9/9] net: switch net_secret key generation to net_get_random_once
From: Hannes Frederic Sowa @ 2013-10-17  5:32 UTC (permalink / raw)
  To: netdev; +Cc: linux-kernel, Eric Dumazet, David S. Miller
In-Reply-To: <1381987923-1524-1-git-send-email-hannes@stressinduktion.org>

Cc: Eric Dumazet <edumazet@google.com>
Cc: "David S. Miller" <davem@davemloft.net>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
---
 net/core/secure_seq.c | 14 ++------------
 1 file changed, 2 insertions(+), 12 deletions(-)

diff --git a/net/core/secure_seq.c b/net/core/secure_seq.c
index 3f1ec15..b02fd16 100644
--- a/net/core/secure_seq.c
+++ b/net/core/secure_seq.c
@@ -7,6 +7,7 @@
 #include <linux/hrtimer.h>
 #include <linux/ktime.h>
 #include <linux/string.h>
+#include <linux/net.h>
 
 #include <net/secure_seq.h>
 
@@ -16,18 +17,7 @@ static u32 net_secret[NET_SECRET_SIZE] ____cacheline_aligned;
 
 static void net_secret_init(void)
 {
-	u32 tmp;
-	int i;
-
-	if (likely(net_secret[0]))
-		return;
-
-	for (i = NET_SECRET_SIZE; i > 0;) {
-		do {
-			get_random_bytes(&tmp, sizeof(tmp));
-		} while (!tmp);
-		cmpxchg(&net_secret[--i], 0, tmp);
-	}
+	net_get_random_once(net_secret, sizeof(net_secret));
 }
 
 #ifdef CONFIG_INET
-- 
1.8.3.1

^ permalink raw reply related

* RE: [PATCH net v2] be2net: Warn users of possible broken functionality on BE2 cards with very old FW versions with latest driver
From: Somnath Kotur @ 2013-10-17  5:41 UTC (permalink / raw)
  To: Ivan Vecera; +Cc: David Miller, netdev@vger.kernel.org
In-Reply-To: <525EA9E7.409@redhat.com>

> -----Original Message-----
> From: Ivan Vecera [mailto:ivecera@redhat.com]
> Sent: Wednesday, October 16, 2013 8:30 PM
> To: Somnath Kotur
> Cc: David Miller; netdev@vger.kernel.org
> Subject: Re: [PATCH net v2] be2net: Warn users of possible broken
> functionality on BE2 cards with very old FW versions with latest driver
> 
> On 10/07/2013 06:31 PM, David Miller wrote:
> > From: Somnath Kotur <somnath.kotur@emulex.com>
> > Date: Thu, 3 Oct 2013 15:34:29 +0530
> >
> >> +	if (BE2_chip(adapter) && memcmp(adapter->fw_ver, "4.", 2) < 0) {
> >> +		dev_err(dev, "Firmware version is too old.IRQs may not
> work\n");
> >
> > So many grammatical mistakes in one line.
> >
> > First sentence got a period, second one did not.
> >
> > Missing space between period and second sentence.
> > --
> > To unsubscribe from this list: send the line "unsubscribe netdev" in
> > the body of a message to majordomo@vger.kernel.org More majordomo
> info
> > at  http://vger.kernel.org/majordomo-info.html
> >
> Som, any plan to send v3?
> 
> Ivan
HI Ivan,
   Yes , the problem was I was trying to stick within the 80 chars line limit , the missing space above would have pushed it over
needlessly warranting an extra line to accommodate a character.
Will rework the sentence using Joe Perches's suggestions as well as address Ben Hutching's concerns.
Stay tuned.

Thanks
Som

^ permalink raw reply

* RE: [PATCH 0/2] be2net: patch set
From: Sathya Perla @ 2013-10-17  6:00 UTC (permalink / raw)
  To: Sathya Perla, netdev@vger.kernel.org
In-Reply-To: <1381837869-9543-1-git-send-email-sathya.perla@emulex.com>


> -----Original Message-----
> From: netdev-owner@vger.kernel.org [mailto:netdev-owner@vger.kernel.org] On Behalf
> Of Sathya Perla
> 
> Pls apply the following fixes to the net tree. Thanks.
> 
> Vasundhara Volam (2):
>   be2net: pass if_id for v1 and V2 versions of TX_CREATE cmd
>   be2net: drop non-tso frames longer than mtu
> 
Dave, I'll re-send this patch-set without the "drop non-tso frames" patch.
Based on Eric's comments, it seems better to place the mtu checks in pktgen and
before ndo_start_xmit() invocation so that all drivers/devices may benefit from the check.

thanks,
-Sathya

^ permalink raw reply

* [PATCH] be2net: pass if_id for v1 and V2 versions of TX_CREATE cmd
From: Sathya Perla @ 2013-10-17  6:17 UTC (permalink / raw)
  To: netdev

From: Vasundhara Volam <vasundhara.volam@emulex.com>

It is a required field for all TX_CREATE cmd versions > 0.
This fixes a driver initialization failure, caused by recent SH-R Firmwares
(versions > 10.0.639.0) failing the TX_CREATE cmd when if_id field is
not passed.

Signed-off-by: Sathya Perla <sathya.perla@emulex.com>
---

Dave, pls consider queuing this patch for -stable tree as it fixes
a driver load failure. Thanks.

 drivers/net/ethernet/emulex/benet/be_cmds.c |    3 ++-
 1 files changed, 2 insertions(+), 1 deletions(-)

diff --git a/drivers/net/ethernet/emulex/benet/be_cmds.c b/drivers/net/ethernet/emulex/benet/be_cmds.c
index bd0e0c0..c08fd32 100644
--- a/drivers/net/ethernet/emulex/benet/be_cmds.c
+++ b/drivers/net/ethernet/emulex/benet/be_cmds.c
@@ -1198,7 +1198,6 @@ int be_cmd_txq_create(struct be_adapter *adapter, struct be_tx_obj *txo)
 
 	if (lancer_chip(adapter)) {
 		req->hdr.version = 1;
-		req->if_id = cpu_to_le16(adapter->if_handle);
 	} else if (BEx_chip(adapter)) {
 		if (adapter->function_caps & BE_FUNCTION_CAPS_SUPER_NIC)
 			req->hdr.version = 2;
@@ -1206,6 +1205,8 @@ int be_cmd_txq_create(struct be_adapter *adapter, struct be_tx_obj *txo)
 		req->hdr.version = 2;
 	}
 
+	if (req->hdr.version > 0)
+		req->if_id = cpu_to_le16(adapter->if_handle);
 	req->num_pages = PAGES_4K_SPANNED(q_mem->va, q_mem->size);
 	req->ulp_num = BE_ULP1_NUM;
 	req->type = BE_ETH_TX_RING_TYPE_STANDARD;
-- 
1.7.1

^ permalink raw reply related

* Re: [PATCH 5/5] net: rfkill: gpio: add ACPI support
From: Mika Westerberg @ 2013-10-17  7:44 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Heikki Krogerus, John W. Linville, Johannes Berg, Rhyland Klein,
	linux-acpi, linux-wireless, netdev
In-Reply-To: <2878506.7lf24R85t6@vostro.rjw.lan>

On Wed, Oct 16, 2013 at 10:55:01PM +0200, Rafael J. Wysocki wrote:
> On Wednesday, October 16, 2013 01:53:43 PM Heikki Krogerus wrote:
> > Including ACPI ID for Broadcom GPS receiver BCM4752.
> > 
> > Signed-off-by: Heikki Krogerus <heikki.krogerus@linux.intel.com>
> > ---
> >  net/rfkill/rfkill-gpio.c | 31 ++++++++++++++++++++++++++++++-
> >  1 file changed, 30 insertions(+), 1 deletion(-)
> > 
> > diff --git a/net/rfkill/rfkill-gpio.c b/net/rfkill/rfkill-gpio.c
> > index 2dd78c6..5620d3c 100644
> > --- a/net/rfkill/rfkill-gpio.c
> > +++ b/net/rfkill/rfkill-gpio.c
> > @@ -24,6 +24,8 @@
> >  #include <linux/platform_device.h>
> >  #include <linux/clk.h>
> >  #include <linux/slab.h>
> > +#include <linux/acpi.h>
> > +#include <linux/acpi_gpio.h>
> >  
> >  #include <linux/rfkill-gpio.h>
> >  
> > @@ -70,6 +72,23 @@ static const struct rfkill_ops rfkill_gpio_ops = {
> >  	.set_block = rfkill_gpio_set_power,
> >  };
> >  
> > +static int rfkill_gpio_acpi_probe(struct device *dev,
> > +				  struct rfkill_gpio_data *rfkill)
> > +{
> > +	const struct acpi_device_id *id;
> > +
> > +	id = acpi_match_device(dev->driver->acpi_match_table, dev);
> > +	if (!id)
> > +		return -ENODEV;
> > +
> > +	rfkill->name = dev_name(dev);
> > +	rfkill->type = (unsigned)id->driver_data;
> > +	rfkill->reset_gpio = acpi_get_gpio_by_index(dev, 0, NULL);
> > +	rfkill->shutdown_gpio = acpi_get_gpio_by_index(dev, 1, NULL);
> > +
> > +	return 0;
> > +}
> > +
> >  static int rfkill_gpio_probe(struct platform_device *pdev)
> >  {
> >  	struct rfkill_gpio_platform_data *pdata = pdev->dev.platform_data;
> > @@ -82,7 +101,11 @@ static int rfkill_gpio_probe(struct platform_device *pdev)
> >  	if (!rfkill)
> >  		return -ENOMEM;
> >  
> > -	if (pdata) {
> > +	if (ACPI_HANDLE(&pdev->dev)) {
> > +		ret = rfkill_gpio_acpi_probe(&pdev->dev, rfkill);
> > +		if (ret)
> > +			return ret;
> > +	} else if (pdata) {
> >  		clk_name = pdata->power_clk_name;
> >  		rfkill->name = pdata->name;
> >  		rfkill->type = pdata->type;
> > @@ -170,12 +193,18 @@ static int rfkill_gpio_remove(struct platform_device *pdev)
> >  	return 0;
> >  }
> >  
> > +static const struct acpi_device_id rfkill_acpi_match[] = {
> > +	{ "BCM4752", RFKILL_TYPE_GPS },
> > +	{ },
> > +};
> > +
> >  static struct platform_driver rfkill_gpio_driver = {
> >  	.probe = rfkill_gpio_probe,
> >  	.remove = rfkill_gpio_remove,
> >  	.driver = {
> >  		.name = "rfkill_gpio",
> >  		.owner = THIS_MODULE,
> > +		.acpi_match_table = ACPI_PTR(rfkill_acpi_match),
> >  	},
> >  };
> 
> Looks good to me.
> 
> Has Mika seen this?

Yes, saw it now and looks good to me as well.

Reviewed-by: Mika Westerberg <mika.westerberg@linux.intel.com>

for the whole series, for what it's worth.

^ permalink raw reply

* Re: [PATCH 2/3] ipvs: avoid rcu_barrier during netns cleanup
From: Pablo Neira Ayuso @ 2013-10-17  8:11 UTC (permalink / raw)
  To: Simon Horman
  Cc: Julian Anastasov, lvs-devel, netdev, netfilter-devel,
	Wensong Zhang
In-Reply-To: <20131017004939.GB21728@verge.net.au>

On Thu, Oct 17, 2013 at 09:49:39AM +0900, Simon Horman wrote:
> On Wed, Oct 16, 2013 at 10:52:14PM +0300, Julian Anastasov wrote:
> > 
> > 	Hello,
> > 
> > On Wed, 16 Oct 2013, Pablo Neira Ayuso wrote:
> > 
> > > I can enqueue this fix to nf if you like. No need to resend, I can
> > > manually apply.
> > > 
> > > Let me know.
> > 
> > 	It is not critical. I waited weeks the net tree to be
> > copied into net-next because it collides with the recent
> > "ipvs: make the service replacement more robust" change in
> > net tree :) But if a rcu_barrier in the netns cleanup looks
> > scary enough you can push it to nf. IMHO, it just adds
> > unneeded delay there.
> 
> If it is not critical I would prefer for it to travel through
> nf-next. Though I do not feel strongly about this.

Will enqueue for nf-next.

I'd appreciate if you can recover the tradition of attaching a short
evaluation in the cover letter as I do when I send pull requests to
David. Thanks!

^ permalink raw reply

* View the attached file
From: Microsoft Promotion @ 2013-10-17  8:10 UTC (permalink / raw)

In-Reply-To: <1381940933.56384.YahooMailNeo@web5705.biz.mail.ne1.yahoo.com>

[-- Attachment #1: Type: text/plain, Size: 22 bytes --]

View the attached file

[-- Attachment #2: MICROSOFT AWARD PROMOTION.doc --]
[-- Type: application/msword, Size: 109568 bytes --]

^ permalink raw reply

* Re: [Xen-devel] [PATCH net] xen-netback: add the scenario which now beyond the range time_after_eq().
From: Jan Beulich @ 2013-10-17  8:26 UTC (permalink / raw)
  To: Jason Luan
  Cc: david.vrabel, ian.campbell, wei.liu2, xen-devel, annie.li, netdev
In-Reply-To: <1381944167-24918-1-git-send-email-jianhai.luan@oracle.com>

>>> On 16.10.13 at 19:22, Jason Luan <jianhai.luan@oracle.com> wrote:
> time_after_eq() only works if the delta is < MAX_ULONG/2.
> 
> If netfront sends at a very low rate, the time between subsequent calls
> to tx_credit_exceeded() may exceed MAX_ULONG/2 and the test for
> timer_after_eq() will be incorrect.  Credit will not be replenished and
> the guest may become unable to send (e.g., if prior to the long gap, all
> credit was exhausted).
> 
> We should add the scenario which now beyond next_credit+MAX_UNLONG/2. Because
> the fact now must be not before than expire, time_before(now, expire) == true
> will verify the scenario.
>     time_after_eq(now, next_credit) || time_before (now, expire)
>     ==
>     !time_in_range_open(now, expire, next_credit)

So first of all this must be with a 32-bit netback. And the not
coverable gap between activity is well over 240 days long. _If_
this really needs dealing with, then why is extending this from
240+ to 480+ days sufficient? I.e. why don't you simply
change to 64-bit jiffy values, and use time_after_eq64()?

Jan

> Signed-off-by: Jason Luan <jianhai.luan@oracle.com>
> ---
>  drivers/net/xen-netback/netback.c |    7 +++++--
>  1 files changed, 5 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/net/xen-netback/netback.c 
> b/drivers/net/xen-netback/netback.c
> index f3e591c..31eedaf 100644
> --- a/drivers/net/xen-netback/netback.c
> +++ b/drivers/net/xen-netback/netback.c
> @@ -1194,8 +1194,11 @@ static bool tx_credit_exceeded(struct xenvif *vif, 
> unsigned size)
>  	if (timer_pending(&vif->credit_timeout))
>  		return true;
>  
> -	/* Passed the point where we can replenish credit? */
> -	if (time_after_eq(now, next_credit)) {
> +	/* Credit should be replenished when now does not fall into the
> +	 * range from expires to next_credit, and time_in_range_open()
> +	 * is used to verify whether this case happens.
> +	 */
> +	if (!time_in_range_open(now, vif->credit_timeout.expires, next_credit)) {
>  		vif->credit_timeout.expires = now;
>  		tx_add_credit(vif);
>  	}
> -- 
> 1.7.6.5
> 
> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xen.org 
> http://lists.xen.org/xen-devel 

^ permalink raw reply

* Re: [PATCH net-next] netfilter: xt_socket: use sock_gen_put()
From: Pablo Neira Ayuso @ 2013-10-17  8:29 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: David Miller, netdev, netfilter-devel
In-Reply-To: <1381507405.4971.108.camel@edumazet-glaptop.roam.corp.google.com>

On Fri, Oct 11, 2013 at 09:03:25AM -0700, Eric Dumazet wrote:
> From: Eric Dumazet <edumazet@google.com>
> 
> TCP listener refactoring, part 7 :
> 
> Use sock_gen_put() instead of xt_socket_put_sk() for future
> SYN_RECV support.

Applied, thanks Eric.

^ permalink raw reply

* Re: [PATCH 2/3] ipvs: avoid rcu_barrier during netns cleanup
From: Simon Horman @ 2013-10-17  8:30 UTC (permalink / raw)
  To: Pablo Neira Ayuso
  Cc: Julian Anastasov, lvs-devel, netdev, netfilter-devel,
	Wensong Zhang
In-Reply-To: <20131017081142.GA5324@localhost>

On Thu, Oct 17, 2013 at 10:11:42AM +0200, Pablo Neira Ayuso wrote:
> On Thu, Oct 17, 2013 at 09:49:39AM +0900, Simon Horman wrote:
> > On Wed, Oct 16, 2013 at 10:52:14PM +0300, Julian Anastasov wrote:
> > > 
> > > 	Hello,
> > > 
> > > On Wed, 16 Oct 2013, Pablo Neira Ayuso wrote:
> > > 
> > > > I can enqueue this fix to nf if you like. No need to resend, I can
> > > > manually apply.
> > > > 
> > > > Let me know.
> > > 
> > > 	It is not critical. I waited weeks the net tree to be
> > > copied into net-next because it collides with the recent
> > > "ipvs: make the service replacement more robust" change in
> > > net tree :) But if a rcu_barrier in the netns cleanup looks
> > > scary enough you can push it to nf. IMHO, it just adds
> > > unneeded delay there.
> > 
> > If it is not critical I would prefer for it to travel through
> > nf-next. Though I do not feel strongly about this.
> 
> Will enqueue for nf-next.
> 
> I'd appreciate if you can recover the tradition of attaching a short
> evaluation in the cover letter as I do when I send pull requests to
> David. Thanks!

Sure, will do.

^ permalink raw reply

* Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's
From: Ingo Molnar @ 2013-10-17  8:41 UTC (permalink / raw)
  To: Neil Horman
  Cc: Eric Dumazet, linux-kernel, sebastien.dugue, Thomas Gleixner,
	Ingo Molnar, H. Peter Anvin, x86, netdev
In-Reply-To: <20131017003421.GA31470@hmsreliant.think-freely.org>


* Neil Horman <nhorman@tuxdriver.com> wrote:

> On Mon, Oct 14, 2013 at 03:18:47PM -0700, Eric Dumazet wrote:
> > On Mon, 2013-10-14 at 14:19 -0700, Eric Dumazet wrote:
> > > On Mon, 2013-10-14 at 16:28 -0400, Neil Horman wrote:
> > > 
> > > > So, early testing results today.  I wrote a test module that, allocated a 4k
> > > > buffer, initalized it with random data, and called csum_partial on it 100000
> > > > times, recording the time at the start and end of that loop.  Results on a 2.4
> > > > GHz Intel Xeon processor:
> > > > 
> > > > Without patch: Average execute time for csum_partial was 808 ns
> > > > With patch: Average execute time for csum_partial was 438 ns
> > > 
> > > Impressive, but could you try again with data out of cache ?
> > 
> > So I tried your patch on a GRE tunnel and got following results on a
> > single TCP flow. (short result : no visible difference)
> > 
> > 
> 
> So I went to reproduce these results, but was unable to (due to the fact that I
> only have a pretty jittery network to do testing accross at the moment with
> these devices).  So instead I figured that I would go back to just doing
> measurements with the module that I cobbled together (operating under the
> assumption that it would give me accurate, relatively jitter free results (I've
> attached the module code for reference below).  My results show slightly
> different behavior:
> 
> Base results runs:
> 89417240
> 85170397
> 85208407
> 89422794
> 91645494
> 103655144
> 86063791
> 75647774
> 83502921
> 85847372
> AVG = 875 ns
>
> Prefetch only runs:
> 70962849
> 77555099
> 81898170
> 68249290
> 72636538
> 83039294
> 78561494
> 83393369
> 85317556
> 79570951
> AVG = 781 ns
> 
> Parallel addition only runs:
> 42024233
> 44313064
> 48304416
> 64762297
> 42994259
> 41811628
> 55654282
> 64892958
> 55125582
> 42456403
> AVG = 510 ns
> 
> 
> Both prefetch and parallel addition:
> 41329930
> 40689195
> 61106622
> 46332422
> 49398117
> 52525171
> 49517101
> 61311153
> 43691814
> 49043084
> AVG = 494 ns
> 
> 
> For reference, each of the above large numbers is the number of 
> nanoseconds taken to compute the checksum of a 4kb buffer 100000 times.  
> To get my average results, I ran the test in a loop 10 times, averaged 
> them, and divided by 100000.
> 
> Based on these, prefetching is obviously a a good improvement, but not 
> as good as parallel execution, and the winner by far is doing both.

But in the actual usecase mentioned the packet data was likely cache-cold, 
it just arrived in the NIC and an IRQ got sent. Your testcase uses a 
super-hot 4K buffer that fits into the L1 cache. So it's apples to 
oranges.

To correctly simulate the workload you'd have to:

 - allocate a buffer larger than your L2 cache.

 - to measure the effects of the prefetches you'd also have to randomize
   the individual buffer positions. See how 'perf bench numa' implements a
   random walk via --data_rand_walk, in tools/perf/bench/numa.c.
   Otherwise the CPU might learn your simplistic stream direction and the
   L2 cache might hw-prefetch your data, interfering with any explicit 
   prefetches the code does. In many real-life usecases packet buffers are
   scattered.

Also, it would be nice to see standard deviation noise numbers when two 
averages are close to each other, to be able to tell whether differences 
are statistically significant or not.

For example 'perf stat --repeat' will output stddev for you:

  comet:~/tip> perf stat --repeat 20 --null bash -c 'usleep $((RANDOM*10))'

   Performance counter stats for 'bash -c usleep $((RANDOM*10))' (20 runs):

       0.189084480 seconds time elapsed                                          ( +- 11.95% )

The last '+-' percentage is the noise of the measurement.

Also note that you can inspect many cache behavior details of your 
algorithm via perf stat - the -ddd option will give you a laundry list:

  aldebaran:~> perf stat --repeat 20 -ddd perf bench sched messaging
  ...

     Total time: 0.095 [sec]

 Performance counter stats for 'perf bench sched messaging' (20 runs):

       1519.128721 task-clock (msec)         #   12.305 CPUs utilized            ( +-  0.34% )
            22,882 context-switches          #    0.015 M/sec                    ( +-  2.84% )
             3,927 cpu-migrations            #    0.003 M/sec                    ( +-  2.74% )
            16,616 page-faults               #    0.011 M/sec                    ( +-  0.17% )
     2,327,978,366 cycles                    #    1.532 GHz                      ( +-  1.61% ) [36.43%]
     1,715,561,189 stalled-cycles-frontend   #   73.69% frontend cycles idle     ( +-  1.76% ) [38.05%]
       715,715,454 stalled-cycles-backend    #   30.74% backend  cycles idle     ( +-  2.25% ) [39.85%]
     1,253,106,346 instructions              #    0.54  insns per cycle        
                                             #    1.37  stalled cycles per insn  ( +-  1.71% ) [49.68%]
       241,181,126 branches                  #  158.763 M/sec                    ( +-  1.43% ) [47.83%]
         4,232,053 branch-misses             #    1.75% of all branches          ( +-  1.23% ) [48.63%]
       431,907,354 L1-dcache-loads           #  284.313 M/sec                    ( +-  1.00% ) [48.37%]
        20,550,528 L1-dcache-load-misses     #    4.76% of all L1-dcache hits    ( +-  0.82% ) [47.61%]
         7,435,847 LLC-loads                 #    4.895 M/sec                    ( +-  0.94% ) [36.11%]
         2,419,201 LLC-load-misses           #   32.53% of all LL-cache hits     ( +-  2.93% ) [ 7.33%]
       448,638,547 L1-icache-loads           #  295.326 M/sec                    ( +-  2.43% ) [21.75%]
        22,066,490 L1-icache-load-misses     #    4.92% of all L1-icache hits    ( +-  2.54% ) [30.66%]
       475,557,948 dTLB-loads                #  313.047 M/sec                    ( +-  1.96% ) [37.96%]
         6,741,523 dTLB-load-misses          #    1.42% of all dTLB cache hits   ( +-  2.38% ) [37.05%]
     1,268,628,660 iTLB-loads                #  835.103 M/sec                    ( +-  1.75% ) [36.45%]
            74,192 iTLB-load-misses          #    0.01% of all iTLB cache hits   ( +-  2.88% ) [36.19%]
         4,466,526 L1-dcache-prefetches      #    2.940 M/sec                    ( +-  1.61% ) [36.17%]
         2,396,311 L1-dcache-prefetch-misses #    1.577 M/sec                    ( +-  1.55% ) [35.71%]

       0.123459566 seconds time elapsed                                          ( +-  0.58% )

There's also a number of prefetch counters that might be useful:

 aldebaran:~> perf list | grep prefetch
  L1-dcache-prefetches                               [Hardware cache event]
  L1-dcache-prefetch-misses                          [Hardware cache event]
  LLC-prefetches                                     [Hardware cache event]
  LLC-prefetch-misses                                [Hardware cache event]
  node-prefetches                                    [Hardware cache event]
  node-prefetch-misses                               [Hardware cache event]

Thanks,

	Ingo

^ permalink raw reply

* RE: [Xen-devel] [PATCH net-next v4 0/5] xen-netback: IPv6 offload support
From: Paul Durrant @ 2013-10-17  8:42 UTC (permalink / raw)
  To: Ian Campbell; +Cc: xen-devel@lists.xen.org, netdev@vger.kernel.org
In-Reply-To: <1381942838.30409.26.camel@kazak.uk.xensource.com>

> -----Original Message-----
> From: Ian Campbell
> Sent: 16 October 2013 18:01
> To: Paul Durrant
> Cc: xen-devel@lists.xen.org; netdev@vger.kernel.org
> Subject: Re: [Xen-devel] [PATCH net-next v4 0/5] xen-netback: IPv6 offload
> support
> 
> On Wed, 2013-10-16 at 17:53 +0100, Paul Durrant wrote:
> > > -----Original Message-----
> > > From: Ian Campbell
> > > Sent: 16 October 2013 17:20
> > > To: Paul Durrant
> > > Cc: xen-devel@lists.xen.org; netdev@vger.kernel.org
> > > Subject: Re: [Xen-devel] [PATCH net-next v4 0/5] xen-netback: IPv6
> offload
> > > support
> > >
> > > On Fri, 2013-10-11 at 16:06 +0100, Paul Durrant wrote:
> > > > This patch series adds support for checksum and large packet offloads
> into
> > > > xen-netback.
> > > > Testing has mainly been done using the Microsoft network hardware
> > > > certification suite running in Server 2008R2 VMs with Citrix PV
> frontends.
> > >
> > > Are there any Linux netfront patches in existence/the pipeline to take
> > > advantage of this?
> > >
> >
> > I was waiting for the backend patches to be accepted first ;-)
> 
> I think it would be useful to get et least an RFC so others can try it
> etc.
> 

Well, everyone can build and use the Windows frontend :-) (https://github.com/xenserver/win-xenvif/tree/upstream)

I'll try to hack up something in xen-netfront soon.

  Paul

^ permalink raw reply

* Re: [Xen-devel] [PATCH net] xen-netback: add the scenario which now beyond the range time_after_eq().
From: jianhai luan @ 2013-10-17  9:02 UTC (permalink / raw)
  To: Jan Beulich
  Cc: david.vrabel, ian.campbell, wei.liu2, xen-devel, annie.li, netdev
In-Reply-To: <525FBB4F02000078000FBB30@nat28.tlf.novell.com>


On 2013-10-17 16:26, Jan Beulich wrote:
>>>> On 16.10.13 at 19:22, Jason Luan <jianhai.luan@oracle.com> wrote:
>> time_after_eq() only works if the delta is < MAX_ULONG/2.
>>
>> If netfront sends at a very low rate, the time between subsequent calls
>> to tx_credit_exceeded() may exceed MAX_ULONG/2 and the test for
>> timer_after_eq() will be incorrect.  Credit will not be replenished and
>> the guest may become unable to send (e.g., if prior to the long gap, all
>> credit was exhausted).
>>
>> We should add the scenario which now beyond next_credit+MAX_UNLONG/2. Because
>> the fact now must be not before than expire, time_before(now, expire) == true
>> will verify the scenario.
>>      time_after_eq(now, next_credit) || time_before (now, expire)
>>      ==
>>      !time_in_range_open(now, expire, next_credit)
> So first of all this must be with a 32-bit netback. And the not
> coverable gap between activity is well over 240 days long. _If_
> this really needs dealing with, then why is extending this from
> 240+ to 480+ days sufficient? I.e. why don't you simply
> change to 64-bit jiffy values, and use time_after_eq64()?

Yes, the issue only can be  reproduced in 32-bit Dom0 (Beyond 
MAX_ULONG/2 in 64-bit will need long long time)

I think the gap should be think all environment even now extending 480+. 
if now fall in the gap,  one timer will be pending and replenish will be 
in time.  Please run the attachment test program.

If use time_after_eq64(), expire ,next_credit and other member will must 
be u64.
>
> Jan
>
>> Signed-off-by: Jason Luan <jianhai.luan@oracle.com>
>> ---
>>   drivers/net/xen-netback/netback.c |    7 +++++--
>>   1 files changed, 5 insertions(+), 2 deletions(-)
>>
>> diff --git a/drivers/net/xen-netback/netback.c
>> b/drivers/net/xen-netback/netback.c
>> index f3e591c..31eedaf 100644
>> --- a/drivers/net/xen-netback/netback.c
>> +++ b/drivers/net/xen-netback/netback.c
>> @@ -1194,8 +1194,11 @@ static bool tx_credit_exceeded(struct xenvif *vif,
>> unsigned size)
>>   	if (timer_pending(&vif->credit_timeout))
>>   		return true;
>>   
>> -	/* Passed the point where we can replenish credit? */
>> -	if (time_after_eq(now, next_credit)) {
>> +	/* Credit should be replenished when now does not fall into the
>> +	 * range from expires to next_credit, and time_in_range_open()
>> +	 * is used to verify whether this case happens.
>> +	 */
>> +	if (!time_in_range_open(now, vif->credit_timeout.expires, next_credit)) {
>>   		vif->credit_timeout.expires = now;
>>   		tx_add_credit(vif);
>>   	}
>> -- 
>> 1.7.6.5
>>
>>
>> _______________________________________________
>> Xen-devel mailing list
>> Xen-devel@lists.xen.org
>> http://lists.xen.org/xen-devel
>
>

^ permalink raw reply

* Re: [Xen-devel] [PATCH net] xen-netback: add the scenario which now beyond the range time_after_eq().
From: jianhai luan @ 2013-10-17  9:04 UTC (permalink / raw)
  To: Jan Beulich
  Cc: david.vrabel, ian.campbell, wei.liu2, xen-devel, annie.li, netdev
In-Reply-To: <525FA79F.8060601@oracle.com>

[-- Attachment #1: Type: text/plain, Size: 2985 bytes --]


On 2013-10-17 17:02, jianhai luan wrote:
>
> On 2013-10-17 16:26, Jan Beulich wrote:
>>>>> On 16.10.13 at 19:22, Jason Luan <jianhai.luan@oracle.com> wrote:
>>> time_after_eq() only works if the delta is < MAX_ULONG/2.
>>>
>>> If netfront sends at a very low rate, the time between subsequent calls
>>> to tx_credit_exceeded() may exceed MAX_ULONG/2 and the test for
>>> timer_after_eq() will be incorrect.  Credit will not be replenished and
>>> the guest may become unable to send (e.g., if prior to the long gap, 
>>> all
>>> credit was exhausted).
>>>
>>> We should add the scenario which now beyond 
>>> next_credit+MAX_UNLONG/2. Because
>>> the fact now must be not before than expire, time_before(now, 
>>> expire) == true
>>> will verify the scenario.
>>>      time_after_eq(now, next_credit) || time_before (now, expire)
>>>      ==
>>>      !time_in_range_open(now, expire, next_credit)
>> So first of all this must be with a 32-bit netback. And the not
>> coverable gap between activity is well over 240 days long. _If_
>> this really needs dealing with, then why is extending this from
>> 240+ to 480+ days sufficient? I.e. why don't you simply
>> change to 64-bit jiffy values, and use time_after_eq64()?
>
> Yes, the issue only can be  reproduced in 32-bit Dom0 (Beyond 
> MAX_ULONG/2 in 64-bit will need long long time)
>
> I think the gap should be think all environment even now extending 
> 480+. if now fall in the gap,  one timer will be pending and replenish 
> will be in time.  Please run the attachment test program.
>

Sorry for miss the attachment in previous letter. Please check the 
attachment.
> If use time_after_eq64(), expire ,next_credit and other member will 
> must be u64.
>>
>> Jan
>>
>>> Signed-off-by: Jason Luan <jianhai.luan@oracle.com>
>>> ---
>>>   drivers/net/xen-netback/netback.c |    7 +++++--
>>>   1 files changed, 5 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/drivers/net/xen-netback/netback.c
>>> b/drivers/net/xen-netback/netback.c
>>> index f3e591c..31eedaf 100644
>>> --- a/drivers/net/xen-netback/netback.c
>>> +++ b/drivers/net/xen-netback/netback.c
>>> @@ -1194,8 +1194,11 @@ static bool tx_credit_exceeded(struct xenvif 
>>> *vif,
>>> unsigned size)
>>>       if (timer_pending(&vif->credit_timeout))
>>>           return true;
>>>   -    /* Passed the point where we can replenish credit? */
>>> -    if (time_after_eq(now, next_credit)) {
>>> +    /* Credit should be replenished when now does not fall into the
>>> +     * range from expires to next_credit, and time_in_range_open()
>>> +     * is used to verify whether this case happens.
>>> +     */
>>> +    if (!time_in_range_open(now, vif->credit_timeout.expires, 
>>> next_credit)) {
>>>           vif->credit_timeout.expires = now;
>>>           tx_add_credit(vif);
>>>       }
>>> -- 
>>> 1.7.6.5
>>>
>>>
>>> _______________________________________________
>>> Xen-devel mailing list
>>> Xen-devel@lists.xen.org
>>> http://lists.xen.org/xen-devel
>>
>>
>


[-- Attachment #2: main.c --]
[-- Type: text/plain, Size: 961 bytes --]

#include <stdio.h>

#define typecheck(type,x) \
({	type __dummy; \
	typeof(x) __dummy2; \
	(void)(&__dummy == &__dummy2); \
	1; \
})

#define time_after(a, b)		\
	(typecheck(unsigned char, a) && \
	 typecheck(unsigned char, b) && \
	 ((char)((b) - (a)) < 0))
#define time_before(a,b)	time_after(b,a)

#define time_after_eq(a,b)		\
	(typecheck(unsigned char, a) && \
	 typecheck(unsigned char, b) && \
	 ((char)((a) -(b)) >= 0))
#define time_before_eq(a, b) time_after_eq(b,a)

void do_nothing()
{
	return;
}

int main()
{
	unsigned char expire, now, next;
	unsigned char delta = 10;
	int i, j;

	for(i = 0; i < 256; i++) {
		expire = i;
		next = expire + delta;

		printf("\n\n\n[%u ... %u]\n", expire, next);
		now = expire;
		for(j=0; j < 1024; j++, now++) {	
			if(j%256 == 0) printf("\n");

			if (time_after_eq(now, next) ||
				time_before(now, expire)) {
				do_nothing();
			}
			else {
				printf("    now=%d\n", (char)now);
			}
		}
	}
	
	return 0;
}

^ permalink raw reply

* Re: [Xen-devel] [PATCH net] xen-netback: add the scenario which now beyond the range time_after_eq().
From: David Vrabel @ 2013-10-17  9:15 UTC (permalink / raw)
  To: jianhai luan
  Cc: Jan Beulich, ian.campbell, wei.liu2, xen-devel, annie.li, netdev
In-Reply-To: <525FA79F.8060601@oracle.com>

On 17/10/13 10:02, jianhai luan wrote:
> 
> On 2013-10-17 16:26, Jan Beulich wrote:
>>>>> On 16.10.13 at 19:22, Jason Luan <jianhai.luan@oracle.com> wrote:
>>> time_after_eq() only works if the delta is < MAX_ULONG/2.
>>>
>>> If netfront sends at a very low rate, the time between subsequent calls
>>> to tx_credit_exceeded() may exceed MAX_ULONG/2 and the test for
>>> timer_after_eq() will be incorrect.  Credit will not be replenished and
>>> the guest may become unable to send (e.g., if prior to the long gap, all
>>> credit was exhausted).
>>>
>>> We should add the scenario which now beyond next_credit+MAX_UNLONG/2.
>>> Because
>>> the fact now must be not before than expire, time_before(now, expire)
>>> == true
>>> will verify the scenario.
>>>      time_after_eq(now, next_credit) || time_before (now, expire)
>>>      ==
>>>      !time_in_range_open(now, expire, next_credit)
>> So first of all this must be with a 32-bit netback. And the not
>> coverable gap between activity is well over 240 days long. _If_
>> this really needs dealing with, then why is extending this from
>> 240+ to 480+ days sufficient? I.e. why don't you simply
>> change to 64-bit jiffy values, and use time_after_eq64()?
> 
> Yes, the issue only can be  reproduced in 32-bit Dom0 (Beyond
> MAX_ULONG/2 in 64-bit will need long long time)
> 
> I think the gap should be think all environment even now extending 480+.
> if now fall in the gap,  one timer will be pending and replenish will be
> in time.  Please run the attachment test program.
> 
> If use time_after_eq64(), expire ,next_credit and other member will must
> be u64.

Yes, you'll need to store next_credit as a u64 in vif instead of
calculating it in tx_credit_exceeded from expires (which is only an
unsigned long).

David

^ permalink raw reply

* Re: [Xen-devel] [PATCH net] xen-netback: add the scenario which now beyond the range time_after_eq().
From: Jan Beulich @ 2013-10-17  9:26 UTC (permalink / raw)
  To: jianhai luan
  Cc: david.vrabel, ian.campbell, wei.liu2, xen-devel, annie.li, netdev
In-Reply-To: <525FA79F.8060601@oracle.com>

>>> On 17.10.13 at 11:02, jianhai luan <jianhai.luan@oracle.com> wrote:
> On 2013-10-17 16:26, Jan Beulich wrote:
>>>>> On 16.10.13 at 19:22, Jason Luan <jianhai.luan@oracle.com> wrote:
>>> time_after_eq() only works if the delta is < MAX_ULONG/2.
>>>
>>> If netfront sends at a very low rate, the time between subsequent calls
>>> to tx_credit_exceeded() may exceed MAX_ULONG/2 and the test for
>>> timer_after_eq() will be incorrect.  Credit will not be replenished and
>>> the guest may become unable to send (e.g., if prior to the long gap, all
>>> credit was exhausted).
>>>
>>> We should add the scenario which now beyond next_credit+MAX_UNLONG/2. 
> Because
>>> the fact now must be not before than expire, time_before(now, expire) == 
> true
>>> will verify the scenario.
>>>      time_after_eq(now, next_credit) || time_before (now, expire)
>>>      ==
>>>      !time_in_range_open(now, expire, next_credit)
>> So first of all this must be with a 32-bit netback. And the not
>> coverable gap between activity is well over 240 days long. _If_
>> this really needs dealing with, then why is extending this from
>> 240+ to 480+ days sufficient? I.e. why don't you simply
>> change to 64-bit jiffy values, and use time_after_eq64()?
> 
> Yes, the issue only can be  reproduced in 32-bit Dom0 (Beyond 
> MAX_ULONG/2 in 64-bit will need long long time)
> 
> I think the gap should be think all environment even now extending 480+. 
> if now fall in the gap,  one timer will be pending and replenish will be 
> in time.  Please run the attachment test program.

Not sure what this is supposed to tell me. I recognize that there
are overflow conditions not handled properly, but (a) I have a
hard time thinking of a sensible guest that sits idle for over 240
days (host uptime usually isn't even coming close to that due to
maintenance requirements) and (b) if there is such a sensible
guest, then I can't see why dealing with one being idle for over
480 days should be required too.

> If use time_after_eq64(), expire ,next_credit and other member will must 
> be u64.

Exactly - that's what I was telling you to do.

Jan

^ permalink raw reply

* Re: [PATCH net-next] {selinux, af_key} Rework pfkey_sadb2xfrm_user_sec_ctx
From: Steffen Klassert @ 2013-10-17  9:51 UTC (permalink / raw)
  To: Fan Du; +Cc: Paul Moore, davem, netdev
In-Reply-To: <525F3EBD.80406@windriver.com>

On Thu, Oct 17, 2013 at 09:34:53AM +0800, Fan Du wrote:
> 
> 
> On 2013年10月16日 23:15, Paul Moore wrote:
> >
> >The fact that you are now changing sadb_x_sec_ctx->sadb_x_sec_len whenever
> >pfkey_sadb2xfrm_user_sec_ctx() is called raises an eyebrow.  Can you elaborate
> >on why this is not a problem?
> >
> Thanks for your attention, Paul.
> 
> sadb_x_sec_ctx is extra headers passed down from user space, the usage of
> of this data structure falls down to one of pfkey_funcs function only for
> one time, more specifically speaking, it's only used by SELINUX for security
> checking for each operation. In other words, sadb_x_sec_ctx involves with a
> one shot business here. So the original codes seems do a lots of extra job
> which could easily be avoid using casting operation.
> 

Since the selinux people have to live with that change in the fist place,
I'd like to see an ack of one of the selinux maintainers before I take
in into ipsec-next, Paul?

^ permalink raw reply

* Re: [PATCH ipsec] xfrm: prevent ipcomp scratch buffer race condition
From: Steffen Klassert @ 2013-10-17  9:55 UTC (permalink / raw)
  To: Herbert Xu; +Cc: Michal Kubecek, David S. Miller, netdev
In-Reply-To: <20131016123205.GA9982@gondor.apana.org.au>

On Wed, Oct 16, 2013 at 08:32:05PM +0800, Herbert Xu wrote:
> On Mon, Oct 14, 2013 at 06:03:34PM +0200, Michal Kubecek wrote:
> > In ipcomp_compress(), sortirq is enabled too early, allowing the
> > per-cpu scratch buffer to be rewritten by ipcomp_decompress()
> > (called on the same CPU in softirq context) between populating
> > the buffer and copying the compressed data to the skb.
> 
> Good catch.
> 
> > Add similar protection into ipcomp_decompress() as it can be
> > called from process context as well (even if such scenario seems
> > a bit artificial).
> 
> I don't think this is possible or otherwise xfrm_input will
> dead-lock.
> 

Michal, please incorporate the feedback from Herbert and Eric,
I'll take it into the ipsec tree then. Thanks!

^ permalink raw reply

* Re: [Xen-devel] [PATCH net] xen-netback: add the scenario which now beyond the range time_after_eq().
From: jianhai luan @ 2013-10-17  9:59 UTC (permalink / raw)
  To: Jan Beulich
  Cc: david.vrabel, ian.campbell, wei.liu2, xen-devel, annie.li, netdev
In-Reply-To: <525FC98002000078000FBBB5@nat28.tlf.novell.com>


On 2013-10-17 17:26, Jan Beulich wrote:
>>>> On 17.10.13 at 11:02, jianhai luan <jianhai.luan@oracle.com> wrote:
>> On 2013-10-17 16:26, Jan Beulich wrote:
>>>>>> On 16.10.13 at 19:22, Jason Luan <jianhai.luan@oracle.com> wrote:
>>>> time_after_eq() only works if the delta is < MAX_ULONG/2.
>>>>
>>>> If netfront sends at a very low rate, the time between subsequent calls
>>>> to tx_credit_exceeded() may exceed MAX_ULONG/2 and the test for
>>>> timer_after_eq() will be incorrect.  Credit will not be replenished and
>>>> the guest may become unable to send (e.g., if prior to the long gap, all
>>>> credit was exhausted).
>>>>
>>>> We should add the scenario which now beyond next_credit+MAX_UNLONG/2.
>> Because
>>>> the fact now must be not before than expire, time_before(now, expire) ==
>> true
>>>> will verify the scenario.
>>>>       time_after_eq(now, next_credit) || time_before (now, expire)
>>>>       ==
>>>>       !time_in_range_open(now, expire, next_credit)
>>> So first of all this must be with a 32-bit netback. And the not
>>> coverable gap between activity is well over 240 days long. _If_
>>> this really needs dealing with, then why is extending this from
>>> 240+ to 480+ days sufficient? I.e. why don't you simply
>>> change to 64-bit jiffy values, and use time_after_eq64()?
>> Yes, the issue only can be  reproduced in 32-bit Dom0 (Beyond
>> MAX_ULONG/2 in 64-bit will need long long time)
>>
>> I think the gap should be think all environment even now extending 480+.
>> if now fall in the gap,  one timer will be pending and replenish will be
>> in time.  Please run the attachment test program.
> Not sure what this is supposed to tell me. I recognize that there
> are overflow conditions not handled properly, but (a) I have a
> hard time thinking of a sensible guest that sits idle for over 240
> days (host uptime usually isn't even coming close to that due to
> maintenance requirements) and (b) if there is such a sensible
> guest, then I can't see why dealing with one being idle for over
> 480 days should be required too.

The issue can be reproduced when now beyond MAX_ULONG/2 (if the gust 
will send lesser package).
Jiffies beyond than MAX_UNLONG/2 will need below time:
     HZ         days
    100        248.55        (((0xffffffff/2)/HZ)/3600)/24
    250        99.42          (((0xffffffff/2)/HZ)/3600)/24
   1000       24.86          (((0xffffffff/2)/HZ)/3600)/24

Because we use 250,  the issue be found when uptime large than 100 days.

Jason
>> If use time_after_eq64(), expire ,next_credit and other member will must
>> be u64.
> Exactly - that's what I was telling you to do.
>
> Jan
>

^ permalink raw reply

* Re: [PATCH] WAN: Adding support for Infineon PEF2256 E1 chipset
From: Mark Rutland @ 2013-10-17 10:17 UTC (permalink / raw)
  To: Christophe Leroy
  Cc: rob.herring@calxeda.com, Pawel Moll, Stephen Warren, Ian Campbell,
	Rob Landley, grant.likely@linaro.org, Krzysztof Halasa,
	devicetree@vger.kernel.org, linux-doc@vger.kernel.org,
	linux-kernel@vger.kernel.org, netdev@vger.kernel.org,
	jerome.chantelauze@c-s.fr
In-Reply-To: <201310161525.r9GFPZI5006238@localhost.localdomain>

On Wed, Oct 16, 2013 at 04:25:35PM +0100, Christophe Leroy wrote:
> The patch adds WAN support for Infineon PEF2256 E1 Chipset.
> 
> Signed-off-by: Jerome Chantelauze <jerome.chantelauze@c-s.fr>
> Acked-by: Christophe Leroy <christophe.leroy@c-s.fr>

> +static ssize_t fs_attr_mode_store(struct device *dev,
> +                       struct device_attribute *attr,  const char *buf,
> +                       size_t count)
> +{
> +       struct net_device *ndev = dev_get_drvdata(dev);
> +       struct pef2256_dev_priv *priv = dev_to_hdlc(ndev)->priv;
> +       u32 value;
> +       int ret = kstrtol(buf, 10, (long int *)&value);

u32 is not the same as long int.

> +       int reconfigure = (value != priv->mode);
> +
> +       if (ret != 0)
> +               return ret;
> +
> +       if (value != MASTER_MODE && value != SLAVE_MODE)
> +               return -EINVAL;
> +
> +       priv->mode = value;
> +       if (reconfigure && priv->init_done) {
> +               pef2256_close(ndev);
> +               init_FALC(priv);
> +               pef2256_open(ndev);
> +       }
> +
> +       return count;

What if count is not the number of characters read?

[...]

> +
> +       /* TS 0 is reserved */
> +       if (value & 0x80000000)
> +               return -EINVAL;

Magic numbers should be turned into constants.

> +static ssize_t fs_attr_Rx_TS_show(struct device *dev,
> +                       struct device_attribute *attr, char *buf)
> +{
> +       struct net_device *ndev = dev_get_drvdata(dev);
> +       struct pef2256_dev_priv *priv = dev_to_hdlc(ndev)->priv;
> +
> +       return sprintf(buf, "0x%08x\n", priv->Rx_TS);
> +}
> +
> +
> +static ssize_t fs_attr_Rx_TS_store(struct device *dev,
> +                       struct device_attribute *attr,  const char *buf,
> +                       size_t count)
> +{
> +       struct net_device *ndev = dev_get_drvdata(dev);
> +       struct pef2256_dev_priv *priv = dev_to_hdlc(ndev)->priv;
> +       u32 value;
> +       int ret = kstrtol(buf, 10, (long int *)&value);

I'm not sure what the rules are regarding this, but why do we show this
in hexadecimal but read it in decimal?

[...]

> +int Config_HDLC(struct pef2256_dev_priv *priv)
> +{
> +       int i;
> +       int TS_idx;
> +       struct pef2256_regs *base_addr;

That sounds suspicious. Using structs for the offsets of registers isn't
very portable...

It would be preferable to #define the offsets.

> +       u8 dummy;
> +
> +       /* Set framer E1 address */
> +       base_addr = (struct pef2256_regs *)priv->base_addr;

That looks even more suspicious...

> +
> +       /* Read to remove pending IT */
> +       dummy = base_addr->ISR0;
> +       dummy = base_addr->ISR1;

You should use MMIO accessors here (readl, writel, etc). You have no
idea how the compiler may reorganise, coalese or throw away accesses,
nor how those accesses will be made. Additionally, without the requisite
barriers you have no guarantee the CPU won't reorder these accesses.

The compiler is within its rights here to throw away these accesses as
the results are never used. This is broken.

With some constants for the register offsets, this would be:

readb(base_addr + REG_ISR0);
readb(base_addr + REG_ISR1);

Which won't be reordered or thrown away by either the compiler or CPU.

> +
> +       /* Mask HDLC 1 Transmit IT */
> +       base_addr->IMR1 |= 1;
> +       base_addr->IMR1 |= 1 << 4;
> +       base_addr->IMR1 |= 1 << 5;
> +
> +       /* Mask HDLC 1 Receive IT */
> +       base_addr->IMR0 |= 1;
> +       base_addr->IMR0 |= 1 << 7;
> +       base_addr->IMR1 |= 1 << 6;
> +
> +       udelay((2 * 32) * 125);

Why the udelay, and how was the delay period (2 * 32 * 125) derived?

Is this to account for the lack of barriers, or does the hardware have a
requirement that there's a delay? 

If the former, please fix. If the later, please coment the udelay to
make this clear.

> +
> +       /* MODE.HRAC = 0 (Receiver inactive)
> +          MODE.DIV = 0 (Data normal operation)
> +          for FALC V2.2 : MODE.HDLCI = 0 (normal operation) */
> +       /* MODE.MDS2:0 = 100 (No address comparison) */
> +       /* MODE.HRAC = 1 (Receiver active) */
> +       out_8(&(base_addr->MODE), 1 << 3);

Why are you using an MMIO accessor here but not elsewhere?

Not all architectures seem to have out_8, but I think iowrite8/writeb
will work (though I'm not sure what the intended difference between
writeb and out_8 is).

> +       /* CCR1.EITS = 1 (Enable internal Time Slot 31:0 Signaling)
> +          CCR1.XMFA = 0 (No transmit multiframe alignment)
> +          CCR1.RFT1:0 = 00 (RFIFO sur 32 bytes) */
> +       /* setting up Interframe Time Fill */
> +       /* CCR1.ITF = 1 (Interframe Time Fill Continuous flag) */
> +       out_8(&(base_addr->CCR1), 0x10 | (1 << 3));
> +       /* CCR2.XCRC = 0 (Transmit CRC ON)
> +          CCR2.RCRC = 0 (Receive CRC ON, no write in RFIFO)
> +          CCR2.RADD = 0 (No write address in RFIFO) */
> +       out_8(&(base_addr->CCR2), 0x00);
> +
> +       udelay((2 * 32) * 125);

Please explain all udelay instances.

[...]

> +                               setbits8(&(base_addr->TTR1), 1 << i);

I'm not aware of a generic equivalent to setbits8, but it seems like
writeb(readb(ADDR) | bits), ADDR) would do the same.

[...]

> +static int pef2256_ioctl(struct net_device *dev, struct ifreq *ifr, int cmd)
> +{
> +       int ret;
> +
> +       ret = hdlc_ioctl(dev, ifr, cmd);
> +       return ret;
> +}

This seems a bit useless -- can't you just assign hdlc_ioctl to
pef2256_ops::ndo_do_ioctl directly?

> +static const struct of_device_id pef2256_match[];
> +static int pef2256_probe(struct platform_device *ofdev)

s/ofdev/pdev -- platform_device has nothing to do with OF.

> +{
> +       const struct of_device_id *match;
> +       struct pef2256_dev_priv *priv;
> +       int ret = -ENOMEM;
> +       struct net_device *netdev;
> +       hdlc_device *hdlc;
> +       int sys_ret;
> +       struct pef2256_regs *base_addr;
> +       struct device_node *np = (&ofdev->dev)->of_node;
> +       const u32 *data;
> +       int len;
> +
> +       match = of_match_device(pef2256_match, &ofdev->dev);
> +       if (!match)
> +               return -EINVAL;

Why not:

if (!pdev->dev.of_node)
	return -EINVAL;

You shouldn't have an of_node unless one of your compatible strings
matched, and this way you don't have to iterate over the list again.

> +
> +       dev_err(&ofdev->dev, "Found PEF2256\n");
> +
> +       priv = kzalloc(sizeof(*priv), GFP_KERNEL);
> +       if (!priv)
> +               return ret;
> +
> +       priv->dev = &ofdev->dev;
> +
> +       data = of_get_property(np, "data-rate", &len);
> +       if (!data || len != 4) {

Use of_property_read_u32.

> +               dev_err(&ofdev->dev, "failed to read data-rate -> using 8Mb\n");
> +               priv->data_rate = DATA_RATE_8M;
> +       } else
> +               priv->data_rate = *data;
> +
> +       data = of_get_property(np, "channel-phase", &len);
> +       if (!data || len != 4) {

Use of_property_read_u32.

> +               dev_err(&ofdev->dev, "failed to read channel phase -> using 0\n");
> +               priv->channel_phase = CHANNEL_PHASE_0;
> +       } else
> +               priv->channel_phase = *data;
> +
> +       data = of_get_property(np, "rising-edge-sync-pulse", NULL);

Use of_property_read_string.

> +       if (!data) {
> +               dev_err(&ofdev->dev, "failed to read rising edge sync pulse -> using \"transmit\"\n");
> +               strcpy(priv->rising_edge_sync_pulse, "transmit");
> +       } else if (strcmp((char *)data, "transmit") &&
> +                       strcmp((char *)data, "receive")) {
> +               dev_err(&ofdev->dev, "invalid rising edge sync pulse -> using \"transmit\"\n");
> +               strcpy(priv->rising_edge_sync_pulse, "transmit");
> +       } else
> +               strncpy(priv->rising_edge_sync_pulse, (char *)data, 10);
> +
> +       priv->irq = of_irq_to_resource(np, 0, NULL);
> +       if (!priv->irq) {
> +               dev_err(priv->dev, "no irq defined\n");
> +               return -EINVAL;
> +       }

The irq will have already been parsed, and will be in your
platform_device's set of resources. You can use platform_get_irq to get
at it rather than getting the of_ code to attempt to map it again.

Why are you storing the IRQ resource, rather than the irq itself? Surely
the irq number is easier to deal with?

> +       netdev = alloc_hdlcdev(priv);
> +       if (!netdev) {
> +               ret = -ENOMEM;
> +               return ret;

You leak priv and the priv->base_addr mapping here.

[...]

> +       ret = register_hdlc_device(netdev);
> +       if (ret < 0) {
> +               pr_err("unable to register\n");
> +               return ret;

You leak the priv, priv->base_addr, and netdev here.

> +       }
> +
> +       sys_ret = 0;
> +       sys_ret |= device_create_file(priv->dev, &dev_attr_mode);
> +       sys_ret |= device_create_file(priv->dev, &dev_attr_Tx_TS);
> +       sys_ret |= device_create_file(priv->dev, &dev_attr_Rx_TS);
> +       sys_ret |= device_create_file(priv->dev, &dev_attr_regs);

Huh? can't any of these fail individually?

> +
> +       if (sys_ret) {
> +               device_remove_file(priv->dev, &dev_attr_mode);

What about the other files?

> +               unregister_hdlc_device(priv->netdev);
> +               free_netdev(priv->netdev);

What about priv and priv->base_addr?

Why is there not a return here? We'll fall out to the main body and
return 0, as if everything's OK...

> +       }
> +
> +       priv->init_done = 0;
> +
> +       return 0;
> +}

[...]

> +
> +
> +/*
> + * Suppression du module
> + */
> +static int pef2256_remove(struct platform_device *ofdev)
> +{
> +       struct net_device *ndev = dev_get_drvdata(&ofdev->dev);
> +       struct pef2256_dev_priv *priv = dev_to_hdlc(ndev)->priv;
> +
> +       device_remove_file(priv->dev, &dev_attr_Rx_TS);
> +       device_remove_file(priv->dev, &dev_attr_Tx_TS);
> +       device_remove_file(priv->dev, &dev_attr_mode);
> +
> +       unregister_hdlc_device(priv->netdev);
> +       free_netdev(priv->netdev);

What about priv and priv->base_addr?

> +
> +       /* Do E1 stuff */
> +
> +       dev_set_drvdata(&ofdev->dev, NULL);
> +       kfree(ofdev);

Is that meant to be done here? Isn't that the job of the core code?

[...]

> +static int __init pef2256_init(void)
> +{
> +       int ret;
> +       ret = platform_driver_register(&pef2256_driver);
> +       return ret;
> +}
> +module_init(pef2256_init);
> +
> +
> +static void __exit pef2256_exit(void)
> +{
> +       platform_driver_unregister(&pef2256_driver);
> +}
> +module_exit(pef2256_exit);

Use module_platform_driver?

> +/* Framer E1 registers */
> +union pef2256_Fifo {
> +       u8      XFIFO[sizeof(u16)];             /* Transmit FIFO */
> +       u8      RFIFO[sizeof(u16)];             /* Receive FIFO */
> +};

Huh? Why sizeof(u16) rather than 2?

> +struct pef2256_regs {
> +       union pef2256_Fifo      FIFO;   /* 0x00/0x01    FIFO (Tx or rx) */
> +       unsigned char   CMDR;   /* 0x02 Command Register */
> +       unsigned char   MODE;   /* 0x03 Mode Register */
> +       unsigned char   RAH1;   /* 0x04 Receive Address High 1 */
> +       unsigned char   RAH2;   /* 0x05 Receive Address High 2 */
> +       unsigned char   RAL1;   /* 0x06 Receive Address Low 1 */

[...]

Please do not use structures for calculation of register offsets.

> +       unsigned short  FEC;    /* 0x50/0x51 Framing Error Counter */
> +       unsigned short  CVC;    /* 0x52/0x53 Code Violation Counter */
> +       unsigned short  CEC1;   /* 0x54/0x55 CRC Error Counter 1 */
> +       unsigned short  EBC;    /* 0x56/0x57 E-Bit Error Counter */
> +       unsigned short  CEC2;   /* 0x58/0x59 CRC Error Counter 2 */
> +       unsigned short  CEC3;   /* 0x5A/0x5B CRC Error Counter 3 */

These may not be the size you expect.

> diff -urN a/Documentation/devicetree/bindings/net/pef2256.txt b/Documentation/devicetree/bindings/net/pef2256.txt
> --- a/Documentation/devicetree/bindings/net/pef2256.txt 1970-01-01 01:00:00.000000000 +0100
> +++ b/Documentation/devicetree/bindings/net/pef2256.txt 2013-10-13 15:05:42.000000000 +0200
> @@ -0,0 +1,29 @@
> +* Wan on Infineon pef2256 E1 controller

A brief description would be helpful. Is there any publicly available
documentation?

> +
> +Required properties:
> +- compatible: Should be "infineon,pef2256"

s/Should be/Should contain/ -- variants may exist in future.

> +- reg: Address and length of the register set for the device

Is there only the one register bank?

> +- interrupts: Should contain interrupts

How many? What do they correspond to?

> +
> +Optional properties:
> +- data-rate: Data rate on the system highway.
> +  Supported values are: 2, 4, 8, 16.
> +  8 if not defined.

What is the "system highway"? Is this configuration, or is this a
property of the device that cannot be probed?

> +- channel-phase: First time slot transmission channel phase.
> +  Supported values are: 0, 1, 2, 3, 4, 5, 6, 7.
> +  0 if not defined.

Similarly?

> +- rising-edge-sync-pulse: rising edge synchronous pulse.
> +  Supported values are: "receive", "transmit".
> +  "transmit" if not defined.

I'm not sure what this means. Could you elaborate?

Thanks,
Mark.

^ permalink raw reply

* Re: [Xen-devel] [PATCH net] xen-netback: add the scenario which now beyond the range time_after_eq().
From: jianhai luan @ 2013-10-17 10:19 UTC (permalink / raw)
  To: David Vrabel
  Cc: Jan Beulich, ian.campbell, wei.liu2, xen-devel, annie.li, netdev
In-Reply-To: <525FAABE.5080806@citrix.com>


On 2013-10-17 17:15, David Vrabel wrote:
> On 17/10/13 10:02, jianhai luan wrote:
>> On 2013-10-17 16:26, Jan Beulich wrote:
>>>>>> On 16.10.13 at 19:22, Jason Luan <jianhai.luan@oracle.com> wrote:
>>>> time_after_eq() only works if the delta is < MAX_ULONG/2.
>>>>
>>>> If netfront sends at a very low rate, the time between subsequent calls
>>>> to tx_credit_exceeded() may exceed MAX_ULONG/2 and the test for
>>>> timer_after_eq() will be incorrect.  Credit will not be replenished and
>>>> the guest may become unable to send (e.g., if prior to the long gap, all
>>>> credit was exhausted).
>>>>
>>>> We should add the scenario which now beyond next_credit+MAX_UNLONG/2.
>>>> Because
>>>> the fact now must be not before than expire, time_before(now, expire)
>>>> == true
>>>> will verify the scenario.
>>>>       time_after_eq(now, next_credit) || time_before (now, expire)
>>>>       ==
>>>>       !time_in_range_open(now, expire, next_credit)
>>> So first of all this must be with a 32-bit netback. And the not
>>> coverable gap between activity is well over 240 days long. _If_
>>> this really needs dealing with, then why is extending this from
>>> 240+ to 480+ days sufficient? I.e. why don't you simply
>>> change to 64-bit jiffy values, and use time_after_eq64()?
>> Yes, the issue only can be  reproduced in 32-bit Dom0 (Beyond
>> MAX_ULONG/2 in 64-bit will need long long time)
>>
>> I think the gap should be think all environment even now extending 480+.
>> if now fall in the gap,  one timer will be pending and replenish will be
>> in time.  Please run the attachment test program.
>>
>> If use time_after_eq64(), expire ,next_credit and other member will must
>> be u64.
> Yes, you'll need to store next_credit as a u64 in vif instead of
> calculating it in tx_credit_exceeded from expires (which is only an
> unsigned long).

I know that.  Even we use u64, time_after_eq()  will also do wrong judge 
in theory (not in reality because need long long time).
I think the two better fixed way is below:
   - By time_before() to judge if now beyond MAX_ULONG/2
   - Add another timer to check and update expire in MAX_ULONG>>2 period.

Because second way isn't  be verified in practical (need more time to 
waiting jiffes increase),  I chose the first.
>
> David

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox