Netdev List
 help / color / mirror / Atom feed
* Re: [PATCH net-next v2] net: remove unnecessary initializations in net_dev_init
From: David Miller @ 2014-01-22  0:45 UTC (permalink / raw)
  To: sd; +Cc: netdev
In-Reply-To: <1390069167-14080-1-git-send-email-sd@queasysnail.net>

From: Sabrina Dubroca <sd@queasysnail.net>
Date: Sat, 18 Jan 2014 19:19:27 +0100

> softnet_data is already set to 0, no need to use memset or initialize
> specific fields to 0 or NULL afterwards.
> 
> Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>

Applied, thanks.

^ permalink raw reply

* Re: [PATCH net-next] bond: make slave_sysfs_ops static
From: David Miller @ 2014-01-22  0:46 UTC (permalink / raw)
  To: stephen; +Cc: fubar, vfalico, andy, sfeldma, netdev
In-Reply-To: <20140118145418.4b439f15@nehalam.linuxnetplumber.net>

From: Stephen Hemminger <stephen@networkplumber.org>
Date: Sat, 18 Jan 2014 14:54:18 -0800

> Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>

Applied.

^ permalink raw reply

* Re: [PATCH] net: gre: don't pull skb if dealing with icmp message
From: David Miller @ 2014-01-22  0:49 UTC (permalink / raw)
  To: duanj.fnst; +Cc: dborkman, netdev
In-Reply-To: <52DB8E4A.7050809@cn.fujitsu.com>

From: Duan Jiong <duanj.fnst@cn.fujitsu.com>
Date: Sun, 19 Jan 2014 16:35:22 +0800

> 
> When dealing with icmp messages, the skb->data points the
> ip header that triggered the sending of the icmp message.
> 
> In gre_cisco_err(), the parse_gre_header() is called, and the
> iptunnel_pull_header() is called to pull the skb at the end of
> the parse_gre_header(). Unfortunately, the ipgre_err still needs
> the skb->data points the ip header following the icmp layer,
> and those ip addresses in ip header will be used to look up
> tunnel by ip_tunnel_lookup().
> 
> Signed-off-by: Duan Jiong <duanj.fnst@cn.fujitsu.com>

Conditional pulling of headers based upon caller context leads to
impossible to maintain code.

Please find another way to fix this.

^ permalink raw reply

* Re: [PATCH v2] ipv4: remove the useless argument from ip_tunnel_hash()
From: David Miller @ 2014-01-22  0:49 UTC (permalink / raw)
  To: duanj.fnst; +Cc: netdev
In-Reply-To: <52DB903E.9050301@cn.fujitsu.com>

From: Duan Jiong <duanj.fnst@cn.fujitsu.com>
Date: Sun, 19 Jan 2014 16:43:42 +0800

> 
> Since commit c544193214("GRE: Refactor GRE tunneling code")
> introduced function ip_tunnel_hash(), the argument itn is no 
> longer in use, so remove it.
> 
> Signed-off-by: Duan Jiong <duanj.fnst@cn.fujitsu.com>

Applied, thanks.

^ permalink raw reply

* Re: [PATCH net-next] packet: fix a couple of cppcheck warnings
From: David Miller @ 2014-01-22  0:52 UTC (permalink / raw)
  To: dborkman; +Cc: netdev
In-Reply-To: <1390128413-7075-1-git-send-email-dborkman@redhat.com>

From: Daniel Borkmann <dborkman@redhat.com>
Date: Sun, 19 Jan 2014 11:46:53 +0100

> Doesn't bring much, but also doesn't hurt us to fix 'em:
> 
> 1) In tpacket_rcv() flush dcache page we can restirct the scope
>    for start and end and remove one layer of indent.
> 
> 2) In tpacket_destruct_skb() we can restirct the scope for ph.
> 
> 3) In alloc_one_pg_vec_page() we can remove the NULL assignment
>    and change spacing a bit.
> 
> Signed-off-by: Daniel Borkmann <dborkman@redhat.com>

Applied, thanks.

^ permalink raw reply

* Re: [PATCH net v2] tcp: delete redundant calls of tcp_mtup_init()
From: David Miller @ 2014-01-22  0:52 UTC (permalink / raw)
  To: panweiping3; +Cc: netdev
In-Reply-To: <1a8e9e92721ef792aa017b89a4d6376dca5dc4f6.1390135376.git.panweiping3@gmail.com>

From: Weiping Pan <panweiping3@gmail.com>
Date: Sun, 19 Jan 2014 20:44:46 +0800

> As tcp_rcv_state_process() has already calls tcp_mtup_init() for non-fastopen
> sock, we can delete the redundant calls of tcp_mtup_init() in
> tcp_{v4,v6}_syn_recv_sock().
> 
> Signed-off-by: Weiping Pan <panweiping3@gmail.com>

Looks good, applied.

^ permalink raw reply

* Re: [PATCH net-next v2] ipv6: enable anycast addresses as source addresses in ICMPv6 error messages
From: David Miller @ 2014-01-22  0:53 UTC (permalink / raw)
  To: fx.lebail
  Cc: netdev, dlstevens, billfink, hannes, kuznet, jmorris, yoshfuji,
	kaber
In-Reply-To: <1390147236-3660-1-git-send-email-fx.lebail@yahoo.com>

From: Francois-Xavier Le Bail <fx.lebail@yahoo.com>
Date: Sun, 19 Jan 2014 17:00:36 +0100

> - Uses ipv6_anycast_destination() in icmp6_send().
> 
> Suggested-by: Bill Fink <billfink@mindspring.com>
> Signed-off-by: Francois-Xavier Le Bail <fx.lebail@yahoo.com>

Applied.

^ permalink raw reply

* Re: [PATCH net] ipv6: protect protocols not handling ipv4 from v4 connection/bind attempts
From: David Miller @ 2014-01-22  0:59 UTC (permalink / raw)
  To: hannes; +Cc: netdev
In-Reply-To: <20140120041639.GA27055@order.stressinduktion.org>

From: Hannes Frederic Sowa <hannes@stressinduktion.org>
Date: Mon, 20 Jan 2014 05:16:39 +0100

> Some ipv6 protocols cannot handle ipv4 addresses, so we must not allow
> connecting and binding to them. sendmsg logic does already check msg->name
> for this but must trust already connected sockets which could be set up
> for connection to ipv4 address family.
> 
> Per-socket flag ipv6only is of no use here, as it is under users control
> by setsockopt.
> 
> Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>

Applied.  Is the plan to add support to at least ping?

Thanks.

^ permalink raw reply

* Re: [PATCH] 8021q: update description
From: David Miller @ 2014-01-22  1:02 UTC (permalink / raw)
  To: yegorslists; +Cc: netdev
In-Reply-To: <1390211379-3374-1-git-send-email-yegorslists@googlemail.com>

From: yegorslists@googlemail.com
Date: Mon, 20 Jan 2014 10:49:39 +0100

> From: Yegor Yefremov <yegorslists@googlemail.com>
> 
> Replace deprecated 'vconfig' tool with 'ip' from 'iproute2'. Add
> some beautifications like replacing 'ethernet' with 'Ethernet' and
> removing unneeded spaces.
> 
> Signed-off-by: Yegor Yefremov <yegorslists@googlemail.com>

Applied.

 ...
>  	  <http://www.candelatech.com/~greear/vlan.html>

We should also probably get rid of this URL reference, it hasn't been
updated in almost 9 years, talks about 2.2 and 2.4.x kernels, and
recommends using vconfig.

^ permalink raw reply

* [PATCH net-next v4 1/3] random32: add prandom_u32_max and convert open coded users
From: Hannes Frederic Sowa @ 2014-01-22  1:02 UTC (permalink / raw)
  To: netdev; +Cc: Daniel Borkmann, Jakub Zawadzki, Eric Dumazet, linux-kernel
In-Reply-To: <1390352550-4511-1-git-send-email-hannes@stressinduktion.org>

From: Daniel Borkmann <dborkman@redhat.com>

Many functions have open coded a function that returns a random
number in range [0,N-1]. Under the assumption that we have a PRNG
such as taus113 with being well distributed in [0, ~0U] space,
we can implement such a function as uword t = (n*m')>>32, where
m' is a random number obtained from PRNG, n the right open interval
border and t our resulting random number, with n,m',t in u32 universe.

Lets go with Joe and simply call it prandom_u32_max(), although
technically we have an right open interval endpoint, but that we
have documented. Other users can further be migrated to the new
prandom_u32_max() function later on; for now, we need to make sure
to migrate reciprocal_divide() users for the reciprocal_divide()
follow-up fixup since their function signatures are going to change.

Joint work with Hannes Frederic Sowa.

Cc: Jakub Zawadzki <darkjames-ws@darkjames.pl>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
---
 drivers/net/team/team_mode_random.c |  8 +-------
 include/linux/random.h              | 18 +++++++++++++++++-
 net/packet/af_packet.c              |  2 +-
 net/sched/sch_choke.c               |  9 +--------
 4 files changed, 20 insertions(+), 17 deletions(-)

diff --git a/drivers/net/team/team_mode_random.c b/drivers/net/team/team_mode_random.c
index 7f032e2..cd2f692 100644
--- a/drivers/net/team/team_mode_random.c
+++ b/drivers/net/team/team_mode_random.c
@@ -13,20 +13,14 @@
 #include <linux/module.h>
 #include <linux/init.h>
 #include <linux/skbuff.h>
-#include <linux/reciprocal_div.h>
 #include <linux/if_team.h>
 
-static u32 random_N(unsigned int N)
-{
-	return reciprocal_divide(prandom_u32(), N);
-}
-
 static bool rnd_transmit(struct team *team, struct sk_buff *skb)
 {
 	struct team_port *port;
 	int port_index;
 
-	port_index = random_N(team->en_port_count);
+	port_index = prandom_u32_max(team->en_port_count);
 	port = team_get_port_by_index_rcu(team, port_index);
 	if (unlikely(!port))
 		goto drop;
diff --git a/include/linux/random.h b/include/linux/random.h
index 4002b3d..3e89ac9 100644
--- a/include/linux/random.h
+++ b/include/linux/random.h
@@ -8,7 +8,6 @@
 
 #include <uapi/linux/random.h>
 
-
 extern void add_device_randomness(const void *, unsigned int);
 extern void add_input_randomness(unsigned int type, unsigned int code,
 				 unsigned int value);
@@ -38,6 +37,23 @@ struct rnd_state {
 u32 prandom_u32_state(struct rnd_state *state);
 void prandom_bytes_state(struct rnd_state *state, void *buf, int nbytes);
 
+/**
+ * prandom_u32_max - returns a pseduo-random number in interval [0, ep_ro)
+ * @ep_ro: right open interval endpoint
+ *
+ * Returns a pseduo-random number that is in interval [0, ep_ro). Note
+ * that the result depends on PRNG being well distributed in [0, ~0U]
+ * u32 space. Here we use maximally equidistributed combined Tausworthe
+ * generator, that is, prandom_u32(). This is useful when requesting a
+ * random index of an array containing ep_ro elements, for example.
+ *
+ * Returns: pseduo-random number in interval [0, ep_ro)
+ */
+static inline u32 prandom_u32_max(u32 ep_ro)
+{
+	return (u32)(((u64) prandom_u32() * ep_ro) >> 32);
+}
+
 /*
  * Handle minimum values for seeds
  */
diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c
index 59fb3db..df3cbdd 100644
--- a/net/packet/af_packet.c
+++ b/net/packet/af_packet.c
@@ -1289,7 +1289,7 @@ static unsigned int fanout_demux_rnd(struct packet_fanout *f,
 				     struct sk_buff *skb,
 				     unsigned int num)
 {
-	return reciprocal_divide(prandom_u32(), num);
+	return prandom_u32_max(num);
 }
 
 static unsigned int fanout_demux_rollover(struct packet_fanout *f,
diff --git a/net/sched/sch_choke.c b/net/sched/sch_choke.c
index ddd73cb..2aee028 100644
--- a/net/sched/sch_choke.c
+++ b/net/sched/sch_choke.c
@@ -14,7 +14,6 @@
 #include <linux/types.h>
 #include <linux/kernel.h>
 #include <linux/skbuff.h>
-#include <linux/reciprocal_div.h>
 #include <linux/vmalloc.h>
 #include <net/pkt_sched.h>
 #include <net/inet_ecn.h>
@@ -77,12 +76,6 @@ struct choke_sched_data {
 	struct sk_buff **tab;
 };
 
-/* deliver a random number between 0 and N - 1 */
-static u32 random_N(unsigned int N)
-{
-	return reciprocal_divide(prandom_u32(), N);
-}
-
 /* number of elements in queue including holes */
 static unsigned int choke_len(const struct choke_sched_data *q)
 {
@@ -233,7 +226,7 @@ static struct sk_buff *choke_peek_random(const struct choke_sched_data *q,
 	int retrys = 3;
 
 	do {
-		*pidx = (q->head + random_N(choke_len(q))) & q->tab_mask;
+		*pidx = (q->head + prandom_u32_max(choke_len(q))) & q->tab_mask;
 		skb = q->tab[*pidx];
 		if (skb)
 			return skb;
-- 
1.8.4.2

^ permalink raw reply related

* [PATCH net-next v4 2/3] net: introduce reciprocal_scale helper and convert users
From: Hannes Frederic Sowa @ 2014-01-22  1:02 UTC (permalink / raw)
  To: netdev; +Cc: Daniel Borkmann, Jakub Zawadzki, Eric Dumazet, linux-kernel
In-Reply-To: <1390352550-4511-1-git-send-email-hannes@stressinduktion.org>

From: Daniel Borkmann <dborkman@redhat.com>

As David Laight suggests, we shouldn't necessarily call this
reciprocal_divide() when users didn't requested a reciprocal_value();
lets keep the basic idea and call it reciprocal_scale(). More
background information on this topic can be found in [1].

Joint work with Hannes Frederic Sowa.

  [1] http://homepage.cs.uiowa.edu/~jones/bcd/divide.html

Suggested-by: David Laight <david.laight@aculab.com>
Cc: Jakub Zawadzki <darkjames-ws@darkjames.pl>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
---
 include/linux/kernel.h | 19 +++++++++++++++++++
 include/net/codel.h    |  4 +---
 net/packet/af_packet.c |  3 +--
 3 files changed, 21 insertions(+), 5 deletions(-)

diff --git a/include/linux/kernel.h b/include/linux/kernel.h
index ecb8754..03d8a6b 100644
--- a/include/linux/kernel.h
+++ b/include/linux/kernel.h
@@ -193,6 +193,25 @@ extern int _cond_resched(void);
 		(__x < 0) ? -__x : __x;		\
 	})
 
+/**
+ * reciprocal_scale - "scale" a value into range [0, ep_ro)
+ * @val: value
+ * @ep_ro: right open interval endpoint
+ *
+ * Perform a "reciprocal multiplication" in order to "scale" a value into
+ * range [0, ep_ro), where the upper interval endpoint is right-open.
+ * This is useful, e.g. for accessing a index of an array containing
+ * ep_ro elements, for example. Think of it as sort of modulus, only that
+ * the result isn't that of modulo. ;) Note that if initial input is a
+ * small value, then result will return 0.
+ *
+ * Return: a result based on val in interval [0, ep_ro).
+ */
+static inline u32 reciprocal_scale(u32 val, u32 ep_ro)
+{
+	return (u32)(((u64) val * ep_ro) >> 32);
+}
+
 #if defined(CONFIG_MMU) && \
 	(defined(CONFIG_PROVE_LOCKING) || defined(CONFIG_DEBUG_ATOMIC_SLEEP))
 void might_fault(void);
diff --git a/include/net/codel.h b/include/net/codel.h
index 3b04ff5..fe0eab3 100644
--- a/include/net/codel.h
+++ b/include/net/codel.h
@@ -46,7 +46,6 @@
 #include <linux/skbuff.h>
 #include <net/pkt_sched.h>
 #include <net/inet_ecn.h>
-#include <linux/reciprocal_div.h>
 
 /* Controlling Queue Delay (CoDel) algorithm
  * =========================================
@@ -211,10 +210,9 @@ static codel_time_t codel_control_law(codel_time_t t,
 				      codel_time_t interval,
 				      u32 rec_inv_sqrt)
 {
-	return t + reciprocal_divide(interval, rec_inv_sqrt << REC_INV_SQRT_SHIFT);
+	return t + reciprocal_scale(interval, rec_inv_sqrt << REC_INV_SQRT_SHIFT);
 }
 
-
 static bool codel_should_drop(const struct sk_buff *skb,
 			      struct Qdisc *sch,
 			      struct codel_vars *vars,
diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c
index df3cbdd..9734616 100644
--- a/net/packet/af_packet.c
+++ b/net/packet/af_packet.c
@@ -88,7 +88,6 @@
 #include <linux/virtio_net.h>
 #include <linux/errqueue.h>
 #include <linux/net_tstamp.h>
-#include <linux/reciprocal_div.h>
 #include <linux/percpu.h>
 #ifdef CONFIG_INET
 #include <net/inet_common.h>
@@ -1262,7 +1261,7 @@ static unsigned int fanout_demux_hash(struct packet_fanout *f,
 				      struct sk_buff *skb,
 				      unsigned int num)
 {
-	return reciprocal_divide(skb->rxhash, num);
+	return reciprocal_scale(skb->rxhash, num);
 }
 
 static unsigned int fanout_demux_lb(struct packet_fanout *f,
-- 
1.8.4.2

^ permalink raw reply related

* [PATCH net-next v4 3/3] reciprocal_divide: update/correction of the algorithm
From: Hannes Frederic Sowa @ 2014-01-22  1:02 UTC (permalink / raw)
  To: netdev
  Cc: Eric Dumazet, Austin S Hemmelgarn, linux-kernel, Jesse Gross,
	Jamal Hadi Salim, Stephen Hemminger, Matt Mackall, Pekka Enberg,
	Christoph Lameter, Andy Gospodarek, Veaceslav Falico,
	Jay Vosburgh, Jakub Zawadzki, Daniel Borkmann
In-Reply-To: <1390352550-4511-1-git-send-email-hannes@stressinduktion.org>

Jakub Zawadzki noticed that some divisions by reciprocal_divide()
were not correct [1][2], which he could also show with BPF code
after divisions are transformed into reciprocal_value() for runtime
invariance which can be passed to reciprocal_divide() later on;
reverse in BPF dump ended up with a different, off-by-one K in
some situations.

This has been fixed by Eric Dumazet in commit aee636c4809fa5
("bpf: do not use reciprocal divide"). This follow-up patch
improves reciprocal_value() and reciprocal_divide() to work in
all cases by using Granlund and Montgomery method, so that also
future use is safe and without any non-obvious side-effects.
Known problems with the old implementation were that division by 1
always returned 0 and some off-by-ones when the dividend and divisor
where very large. This seemed to not be problematic with its
current users, as far as we can tell. Eric Dumazet checked for
the slab usage, we cannot surely say so in the case of flex_array.
Still, in order to fix that, we propose an extension from the
original implementation from commit 6a2d7a955d8d resp. [3][4],
by using the algorithm proposed in "Division by Invariant Integers
Using Multiplication" [5], Torbjörn Granlund and Peter L.
Montgomery, that is, pseudocode for q = n/d where q, n, d is in
u32 universe:

1) Initialization:

  int l = ceil(log_2 d)
  uword m' = floor((1<<32)*((1<<l)-d)/d)+1
  int sh_1 = min(l,1)
  int sh_2 = max(l-1,0)

2) For q = n/d, all uword:

  uword t = (n*m')>>32
  q = (t+((n-t)>>sh_1))>>sh_2

The assembler implementation from Agner Fog [6] also helped a lot
while implementing. We have tested the implementation on x86_64,
ppc64, i686, s390x; on x86_64/haswell we're still half the latency
compared to normal divide.

Joint work with Daniel Borkmann.

  [1] http://www.wireshark.org/~darkjames/reciprocal-buggy.c
  [2] http://www.wireshark.org/~darkjames/set-and-dump-filter-k-bug.c
  [3] https://gmplib.org/~tege/division-paper.pdf
  [4] http://homepage.cs.uiowa.edu/~jones/bcd/divide.html
  [5] http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.1.2556
  [6] http://www.agner.org/optimize/asmlib.zip

Reported-by: Jakub Zawadzki <darkjames-ws@darkjames.pl>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Austin S Hemmelgarn <ahferroin7@gmail.com>
Cc: linux-kernel@vger.kernel.org
Cc: Jesse Gross <jesse@nicira.com>
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: Stephen Hemminger <stephen@networkplumber.org>
Cc: Matt Mackall <mpm@selenic.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Christoph Lameter <cl@linux-foundation.org>
Cc: Andy Gospodarek <andy@greyhouse.net>
Cc: Veaceslav Falico <vfalico@redhat.com>
Cc: Jay Vosburgh <fubar@us.ibm.com>
Cc: Jakub Zawadzki <darkjames-ws@darkjames.pl>
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
---
 drivers/net/bonding/bond_main.c    | 24 ++++++++++++++++-------
 drivers/net/bonding/bond_netlink.c |  4 ----
 drivers/net/bonding/bond_options.c | 15 ++++++++++-----
 drivers/net/bonding/bond_sysfs.c   |  5 -----
 drivers/net/bonding/bonding.h      |  3 +++
 include/linux/flex_array.h         |  3 ++-
 include/linux/reciprocal_div.h     | 39 ++++++++++++++++++++------------------
 include/linux/slab_def.h           |  4 +++-
 include/net/red.h                  |  3 ++-
 lib/flex_array.c                   |  7 ++++++-
 lib/reciprocal_div.c               | 24 +++++++++++++++++++----
 net/sched/sch_netem.c              |  6 ++++--
 12 files changed, 88 insertions(+), 49 deletions(-)

diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
index 3220b48..f100bd9 100644
--- a/drivers/net/bonding/bond_main.c
+++ b/drivers/net/bonding/bond_main.c
@@ -79,7 +79,6 @@
 #include <net/pkt_sched.h>
 #include <linux/rculist.h>
 #include <net/flow_keys.h>
-#include <linux/reciprocal_div.h>
 #include "bonding.h"
 #include "bond_3ad.h"
 #include "bond_alb.h"
@@ -3596,8 +3595,9 @@ static void bond_xmit_slave_id(struct bonding *bond, struct sk_buff *skb, int sl
  */
 static u32 bond_rr_gen_slave_id(struct bonding *bond)
 {
-	int packets_per_slave = bond->params.packets_per_slave;
 	u32 slave_id;
+	struct reciprocal_value reciprocal_packets_per_slave;
+	int packets_per_slave = bond->params.packets_per_slave;
 
 	switch (packets_per_slave) {
 	case 0:
@@ -3607,8 +3607,10 @@ static u32 bond_rr_gen_slave_id(struct bonding *bond)
 		slave_id = bond->rr_tx_counter;
 		break;
 	default:
+		reciprocal_packets_per_slave =
+			bond->params.reciprocal_packets_per_slave;
 		slave_id = reciprocal_divide(bond->rr_tx_counter,
-					     packets_per_slave);
+					     reciprocal_packets_per_slave);
 		break;
 	}
 	bond->rr_tx_counter++;
@@ -4343,10 +4345,18 @@ static int bond_check_params(struct bond_params *params)
 	params->resend_igmp = resend_igmp;
 	params->min_links = min_links;
 	params->lp_interval = lp_interval;
-	if (packets_per_slave > 1)
-		params->packets_per_slave = reciprocal_value(packets_per_slave);
-	else
-		params->packets_per_slave = packets_per_slave;
+	params->packets_per_slave = packets_per_slave;
+	if (packets_per_slave > 0) {
+		params->reciprocal_packets_per_slave =
+			reciprocal_value(packets_per_slave);
+	} else {
+		/* reciprocal_packets_per_slave is unused if
+		 * packets_per_slave is 0 or 1, just initialize it
+		 */
+		params->reciprocal_packets_per_slave =
+			(struct reciprocal_value) { 0 };
+	}
+
 	if (primary) {
 		strncpy(params->primary, primary, IFNAMSIZ);
 		params->primary[IFNAMSIZ - 1] = 0;
diff --git a/drivers/net/bonding/bond_netlink.c b/drivers/net/bonding/bond_netlink.c
index 21c6488..e852655 100644
--- a/drivers/net/bonding/bond_netlink.c
+++ b/drivers/net/bonding/bond_netlink.c
@@ -19,7 +19,6 @@
 #include <linux/if_ether.h>
 #include <net/netlink.h>
 #include <net/rtnetlink.h>
-#include <linux/reciprocal_div.h>
 #include "bonding.h"
 
 int bond_get_slave(struct net_device *slave_dev, struct sk_buff *skb)
@@ -452,9 +451,6 @@ static int bond_fill_info(struct sk_buff *skb,
 		goto nla_put_failure;
 
 	packets_per_slave = bond->params.packets_per_slave;
-	if (packets_per_slave > 1)
-		packets_per_slave = reciprocal_value(packets_per_slave);
-
 	if (nla_put_u32(skb, IFLA_BOND_PACKETS_PER_SLAVE,
 			packets_per_slave))
 		goto nla_put_failure;
diff --git a/drivers/net/bonding/bond_options.c b/drivers/net/bonding/bond_options.c
index 945a666..85e4348 100644
--- a/drivers/net/bonding/bond_options.c
+++ b/drivers/net/bonding/bond_options.c
@@ -16,7 +16,6 @@
 #include <linux/netdevice.h>
 #include <linux/rwlock.h>
 #include <linux/rcupdate.h>
-#include <linux/reciprocal_div.h>
 #include "bonding.h"
 
 int bond_option_mode_set(struct bonding *bond, int mode)
@@ -671,11 +670,17 @@ int bond_option_packets_per_slave_set(struct bonding *bond,
 		pr_warn("%s: Warning: packets_per_slave has effect only in balance-rr mode\n",
 			bond->dev->name);
 
-	if (packets_per_slave > 1)
-		bond->params.packets_per_slave =
+	bond->params.packets_per_slave = packets_per_slave;
+	if (packets_per_slave > 0) {
+		bond->params.reciprocal_packets_per_slave =
 			reciprocal_value(packets_per_slave);
-	else
-		bond->params.packets_per_slave = packets_per_slave;
+	} else {
+		/* reciprocal_packets_per_slave is unused if
+		 * packets_per_slave is 0 or 1, just initialize it
+		 */
+		bond->params.reciprocal_packets_per_slave =
+			(struct reciprocal_value) { 0 };
+	}
 
 	return 0;
 }
diff --git a/drivers/net/bonding/bond_sysfs.c b/drivers/net/bonding/bond_sysfs.c
index 011f163..c083e9a 100644
--- a/drivers/net/bonding/bond_sysfs.c
+++ b/drivers/net/bonding/bond_sysfs.c
@@ -39,7 +39,6 @@
 #include <net/net_namespace.h>
 #include <net/netns/generic.h>
 #include <linux/nsproxy.h>
-#include <linux/reciprocal_div.h>
 
 #include "bonding.h"
 
@@ -1374,10 +1373,6 @@ static ssize_t bonding_show_packets_per_slave(struct device *d,
 {
 	struct bonding *bond = to_bond(d);
 	unsigned int packets_per_slave = bond->params.packets_per_slave;
-
-	if (packets_per_slave > 1)
-		packets_per_slave = reciprocal_value(packets_per_slave);
-
 	return sprintf(buf, "%u\n", packets_per_slave);
 }
 
diff --git a/drivers/net/bonding/bonding.h b/drivers/net/bonding/bonding.h
index 8a935f8..0a616c4 100644
--- a/drivers/net/bonding/bonding.h
+++ b/drivers/net/bonding/bonding.h
@@ -23,6 +23,8 @@
 #include <linux/netpoll.h>
 #include <linux/inetdevice.h>
 #include <linux/etherdevice.h>
+#include <linux/reciprocal_div.h>
+
 #include "bond_3ad.h"
 #include "bond_alb.h"
 
@@ -171,6 +173,7 @@ struct bond_params {
 	int resend_igmp;
 	int lp_interval;
 	int packets_per_slave;
+	struct reciprocal_value reciprocal_packets_per_slave;
 };
 
 struct bond_parm_tbl {
diff --git a/include/linux/flex_array.h b/include/linux/flex_array.h
index 6843cf1..b6efb0c 100644
--- a/include/linux/flex_array.h
+++ b/include/linux/flex_array.h
@@ -2,6 +2,7 @@
 #define _FLEX_ARRAY_H
 
 #include <linux/types.h>
+#include <linux/reciprocal_div.h>
 #include <asm/page.h>
 
 #define FLEX_ARRAY_PART_SIZE PAGE_SIZE
@@ -22,7 +23,7 @@ struct flex_array {
 			int element_size;
 			int total_nr_elements;
 			int elems_per_part;
-			u32 reciprocal_elems;
+			struct reciprocal_value reciprocal_elems;
 			struct flex_array_part *parts[];
 		};
 		/*
diff --git a/include/linux/reciprocal_div.h b/include/linux/reciprocal_div.h
index f9c90b3..8c5a3fb 100644
--- a/include/linux/reciprocal_div.h
+++ b/include/linux/reciprocal_div.h
@@ -4,29 +4,32 @@
 #include <linux/types.h>
 
 /*
- * This file describes reciprocical division.
+ * This algorithm is based on the paper "Division by Invariant
+ * Integers Using Multiplication" by Torbjörn Granlund and Peter
+ * L. Montgomery.
  *
- * This optimizes the (A/B) problem, when A and B are two u32
- * and B is a known value (but not known at compile time)
+ * The assembler implementation from Agner Fog, which this code is
+ * based on, can be found here:
+ * http://www.agner.org/optimize/asmlib.zip
  *
- * The math principle used is :
- *   Let RECIPROCAL_VALUE(B) be (((1LL << 32) + (B - 1))/ B)
- *   Then A / B = (u32)(((u64)(A) * (R)) >> 32)
- *
- * This replaces a divide by a multiply (and a shift), and
- * is generally less expensive in CPU cycles.
+ * This optimization for A/B is helpful if the divisor B is mostly
+ * runtime invariant. The reciprocal of B is calculated in the
+ * slow-path with reciprocal_value(). The fast-path can then just use
+ * a much faster multiplication operation with a variable dividend A
+ * to calculate the division A/B.
  */
 
-/*
- * Computes the reciprocal value (R) for the value B of the divisor.
- * Should not be called before each reciprocal_divide(),
- * or else the performance is slower than a normal divide.
- */
-extern u32 reciprocal_value(u32 B);
+struct reciprocal_value {
+	u32 m;
+	u8 sh1, sh2;
+};
 
+struct reciprocal_value reciprocal_value(u32 d);
 
-static inline u32 reciprocal_divide(u32 A, u32 R)
+static inline u32 reciprocal_divide(u32 a, struct reciprocal_value R)
 {
-	return (u32)(((u64)A * R) >> 32);
+	u32 t = (u32)(((u64)a * R.m) >> 32);
+	return (t + ((a - t) >> R.sh1)) >> R.sh2;
 }
-#endif
+
+#endif /* _LINUX_RECIPROCAL_DIV_H */
diff --git a/include/linux/slab_def.h b/include/linux/slab_def.h
index 09bfffb..96e8aba 100644
--- a/include/linux/slab_def.h
+++ b/include/linux/slab_def.h
@@ -1,6 +1,8 @@
 #ifndef _LINUX_SLAB_DEF_H
 #define	_LINUX_SLAB_DEF_H
 
+#include <linux/reciprocal_div.h>
+
 /*
  * Definitions unique to the original Linux SLAB allocator.
  */
@@ -12,7 +14,7 @@ struct kmem_cache {
 	unsigned int shared;
 
 	unsigned int size;
-	u32 reciprocal_buffer_size;
+	struct reciprocal_value reciprocal_buffer_size;
 /* 2) touched by every alloc & free from the backend */
 
 	unsigned int flags;		/* constant flags */
diff --git a/include/net/red.h b/include/net/red.h
index 168bb2f..76e0b5f 100644
--- a/include/net/red.h
+++ b/include/net/red.h
@@ -130,7 +130,8 @@ struct red_parms {
 	u32		qth_max;	/* Max avg length threshold: Wlog scaled */
 	u32		Scell_max;
 	u32		max_P;		/* probability, [0 .. 1.0] 32 scaled */
-	u32		max_P_reciprocal; /* reciprocal_value(max_P / qth_delta) */
+	/* reciprocal_value(max_P / qth_delta) */
+	struct reciprocal_value	max_P_reciprocal;
 	u32		qth_delta;	/* max_th - min_th */
 	u32		target_min;	/* min_th + 0.4*(max_th - min_th) */
 	u32		target_max;	/* min_th + 0.6*(max_th - min_th) */
diff --git a/lib/flex_array.c b/lib/flex_array.c
index 6948a66..2eed22f 100644
--- a/lib/flex_array.c
+++ b/lib/flex_array.c
@@ -90,8 +90,8 @@ struct flex_array *flex_array_alloc(int element_size, unsigned int total,
 {
 	struct flex_array *ret;
 	int elems_per_part = 0;
-	int reciprocal_elems = 0;
 	int max_size = 0;
+	struct reciprocal_value reciprocal_elems = { 0 };
 
 	if (element_size) {
 		elems_per_part = FLEX_ARRAY_ELEMENTS_PER_PART(element_size);
@@ -119,6 +119,11 @@ EXPORT_SYMBOL(flex_array_alloc);
 static int fa_element_to_part_nr(struct flex_array *fa,
 					unsigned int element_nr)
 {
+	/*
+	 * if element_size == 0 we don't get here, so we never touch
+	 * the zeroed fa->reciprocal_elems, which would yield invalid
+	 * results
+	 */
 	return reciprocal_divide(element_nr, fa->reciprocal_elems);
 }
 
diff --git a/lib/reciprocal_div.c b/lib/reciprocal_div.c
index 75510e9..4641524 100644
--- a/lib/reciprocal_div.c
+++ b/lib/reciprocal_div.c
@@ -1,11 +1,27 @@
+#include <linux/kernel.h>
 #include <asm/div64.h>
 #include <linux/reciprocal_div.h>
 #include <linux/export.h>
 
-u32 reciprocal_value(u32 k)
+/*
+ * For a description of the algorithm please have a look at
+ * include/linux/reciprocal_div.h
+ */
+
+struct reciprocal_value reciprocal_value(u32 d)
 {
-	u64 val = (1LL << 32) + (k - 1);
-	do_div(val, k);
-	return (u32)val;
+	struct reciprocal_value R;
+	u64 m;
+	int l;
+
+	l = fls(d - 1);
+	m = ((1ULL << 32) * ((1ULL << l) - d));
+	do_div(m, d);
+	++m;
+	R.m = (u32)m;
+	R.sh1 = min(l, 1);
+	R.sh2 = max(l - 1, 0);
+
+	return R;
 }
 EXPORT_SYMBOL(reciprocal_value);
diff --git a/net/sched/sch_netem.c b/net/sched/sch_netem.c
index a2bfc37..de1059a 100644
--- a/net/sched/sch_netem.c
+++ b/net/sched/sch_netem.c
@@ -91,7 +91,7 @@ struct netem_sched_data {
 	u64 rate;
 	s32 packet_overhead;
 	u32 cell_size;
-	u32 cell_size_reciprocal;
+	struct reciprocal_value cell_size_reciprocal;
 	s32 cell_overhead;
 
 	struct crndstate {
@@ -725,9 +725,11 @@ static void get_rate(struct Qdisc *sch, const struct nlattr *attr)
 	q->rate = r->rate;
 	q->packet_overhead = r->packet_overhead;
 	q->cell_size = r->cell_size;
+	q->cell_overhead = r->cell_overhead;
 	if (q->cell_size)
 		q->cell_size_reciprocal = reciprocal_value(q->cell_size);
-	q->cell_overhead = r->cell_overhead;
+	else
+		q->cell_size_reciprocal = (struct reciprocal_value) { 0 };
 }
 
 static int get_loss_clg(struct Qdisc *sch, const struct nlattr *attr)
-- 
1.8.4.2

^ permalink raw reply related

* [PATCH net-next v4 0/3] reciprocal_divide update
From: Hannes Frederic Sowa @ 2014-01-22  1:02 UTC (permalink / raw)
  To: netdev

This patch is on top of aee636c4809fa5 ("bpf: do not use reciprocal
divide") from Eric that sits in net tree. It will not create a merge
conflict, but it depends on this one, so we suggest, if possible, to
merge net into net-next.

We are proposing this change with only small modifications from the
v2 version, namely updating the name of trim to reciprocal_scale
(as commented on by Ben Hutchings and Eric Dumazet, thanks!).

We thought about introducing the reciprocal_divide algorithm in
parallel to the one already used by the kernel but faced organizational
issues, leading us to the conclusion that it is best to just replace
the old one: We could not come up with names for the different
implementations and also with a way to describe the differences to
guide developers which one to choose in which situation. This is
because we cannot specify the correct semantics for the version
which is currently used by the kernel. Altough it seems to not be
causing problems in the kernel, we cannot surely say so in the
case of flex_array for the future. Current usage seems ok, but
future users could run into problems.

Changelog:

v1->v2:
 - changed name to prandom_u32_max in p1
 - changed name to trim in p2
 - reworked code in p3
v2->v3:
 - p1 and p3 stays unchanged, only small update in commit
   message in p3
 - changed name to reciprocal_scale in p2
 - fixed kernel doc format
v3->v4:
 - pseduo -> pseudo (thanks to Tilman Schmidt)

Thanks !

Daniel Borkmann (2):
  random32: add prandom_u32_max and convert open coded users
  net: introduce reciprocal_scale helper and convert users

Hannes Frederic Sowa (1):
  reciprocal_divide: update/correction of the algorithm

 drivers/net/bonding/bond_main.c     | 24 ++++++++++++++++-------
 drivers/net/bonding/bond_netlink.c  |  4 ----
 drivers/net/bonding/bond_options.c  | 15 +++++++++----- 
 drivers/net/bonding/bond_sysfs.c    |  5 -----
 drivers/net/bonding/bonding.h       |  3 +++
 drivers/net/team/team_mode_random.c |  8 +-------
 include/linux/flex_array.h          |  3 ++-
 include/linux/kernel.h              | 19 ++++++++++++++++++
 include/linux/random.h              | 18 ++++++++++++++++-
 include/linux/reciprocal_div.h      | 39 ++++++++++++++++++++-----------------
 include/linux/slab_def.h            |  4 +++-
 include/net/codel.h                 |  4 +---
 include/net/red.h                   |  3 ++-
 lib/flex_array.c                    |  7 ++++++-
 lib/reciprocal_div.c                | 24 +++++++++++++++++++----
 net/packet/af_packet.c              |  5 ++---
 net/sched/sch_choke.c               |  9 +--------
 net/sched/sch_netem.c               |  6 ++++--
 18 files changed, 129 insertions(+), 71 deletions(-)

^ permalink raw reply

* Re: [patch] rxrpc: out of bound read in debug code
From: David Miller @ 2014-01-22  1:03 UTC (permalink / raw)
  To: dan.carpenter; +Cc: netdev, kernel-janitors
In-Reply-To: <20140120102859.GA14233@elgon.mountain>

From: Dan Carpenter <dan.carpenter@oracle.com>
Date: Mon, 20 Jan 2014 13:28:59 +0300

> Smatch complains because we are using an untrusted index into the
> rxrpc_acks[] array.  It's just a read and it's only in the debug code,
> but it's simple enough to add a check and fix it.
> 
> Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>

Applied, thanks Dan.

^ permalink raw reply

* Re: [PATCH net-next v6 0/2] stmmac: fix kernel crashes for jumbo frames
From: David Miller @ 2014-01-22  1:05 UTC (permalink / raw)
  To: vbridgers2013
  Cc: devicetree, netdev, peppe.cavallaro, robh+dt, pawel.moll,
	mark.rutland, ijc+devicetree, galak, dinguyen, rayagond
In-Reply-To: <1390217941-22967-1-git-send-email-vbridgers2013@gmail.com>


Series applied, thanks.

^ permalink raw reply

* Re: [PATCH net] ipv6: protect protocols not handling ipv4 from v4 connection/bind attempts
From: Hannes Frederic Sowa @ 2014-01-22  1:08 UTC (permalink / raw)
  To: David Miller; +Cc: netdev
In-Reply-To: <20140121.165941.1593478240200500821.davem@davemloft.net>

On Tue, Jan 21, 2014 at 04:59:41PM -0800, David Miller wrote:
> From: Hannes Frederic Sowa <hannes@stressinduktion.org>
> Date: Mon, 20 Jan 2014 05:16:39 +0100
> 
> > Some ipv6 protocols cannot handle ipv4 addresses, so we must not allow
> > connecting and binding to them. sendmsg logic does already check msg->name
> > for this but must trust already connected sockets which could be set up
> > for connection to ipv4 address family.
> > 
> > Per-socket flag ipv6only is of no use here, as it is under users control
> > by setsockopt.
> > 
> > Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
> 
> Applied.  Is the plan to add support to at least ping?

Yes, this is planned but for the meantime we should disable it.

Thanks!

^ permalink raw reply

* Re: [PATCH net-next v4 0/3] reciprocal_divide update
From: Hannes Frederic Sowa @ 2014-01-22  1:24 UTC (permalink / raw)
  To: netdev
In-Reply-To: <1390352550-4511-1-git-send-email-hannes@stressinduktion.org>

On Wed, Jan 22, 2014 at 02:02:27AM +0100, Hannes Frederic Sowa wrote:
> v1->v2:
>  - changed name to prandom_u32_max in p1
>  - changed name to trim in p2
>  - reworked code in p3
> v2->v3:
>  - p1 and p3 stays unchanged, only small update in commit
>    message in p3
>  - changed name to reciprocal_scale in p2
>  - fixed kernel doc format
> v3->v4:
>  - pseduo -> pseudo (thanks to Tilman Schmidt)

Oh, David, I messed up the rebase and the change is actually not part of this
series. Please discard this.

I'll double-check and resend soon and finally get my needed sleep then. ;)

Thanks,

  Hannes

^ permalink raw reply

* [PATCH net-next v5 1/3] random32: add prandom_u32_max and convert open coded users
From: Hannes Frederic Sowa @ 2014-01-22  1:29 UTC (permalink / raw)
  To: netdev; +Cc: Daniel Borkmann, Jakub Zawadzki, Eric Dumazet, linux-kernel
In-Reply-To: <1390354181-5080-1-git-send-email-hannes@stressinduktion.org>

From: Daniel Borkmann <dborkman@redhat.com>

Many functions have open coded a function that returns a random
number in range [0,N-1]. Under the assumption that we have a PRNG
such as taus113 with being well distributed in [0, ~0U] space,
we can implement such a function as uword t = (n*m')>>32, where
m' is a random number obtained from PRNG, n the right open interval
border and t our resulting random number, with n,m',t in u32 universe.

Lets go with Joe and simply call it prandom_u32_max(), although
technically we have an right open interval endpoint, but that we
have documented. Other users can further be migrated to the new
prandom_u32_max() function later on; for now, we need to make sure
to migrate reciprocal_divide() users for the reciprocal_divide()
follow-up fixup since their function signatures are going to change.

Joint work with Hannes Frederic Sowa.

Cc: Jakub Zawadzki <darkjames-ws@darkjames.pl>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
---
 drivers/net/team/team_mode_random.c |  8 +-------
 include/linux/random.h              | 18 +++++++++++++++++-
 net/packet/af_packet.c              |  2 +-
 net/sched/sch_choke.c               |  9 +--------
 4 files changed, 20 insertions(+), 17 deletions(-)

diff --git a/drivers/net/team/team_mode_random.c b/drivers/net/team/team_mode_random.c
index 7f032e2..cd2f692 100644
--- a/drivers/net/team/team_mode_random.c
+++ b/drivers/net/team/team_mode_random.c
@@ -13,20 +13,14 @@
 #include <linux/module.h>
 #include <linux/init.h>
 #include <linux/skbuff.h>
-#include <linux/reciprocal_div.h>
 #include <linux/if_team.h>
 
-static u32 random_N(unsigned int N)
-{
-	return reciprocal_divide(prandom_u32(), N);
-}
-
 static bool rnd_transmit(struct team *team, struct sk_buff *skb)
 {
 	struct team_port *port;
 	int port_index;
 
-	port_index = random_N(team->en_port_count);
+	port_index = prandom_u32_max(team->en_port_count);
 	port = team_get_port_by_index_rcu(team, port_index);
 	if (unlikely(!port))
 		goto drop;
diff --git a/include/linux/random.h b/include/linux/random.h
index 4002b3d..1cfce0e 100644
--- a/include/linux/random.h
+++ b/include/linux/random.h
@@ -8,7 +8,6 @@
 
 #include <uapi/linux/random.h>
 
-
 extern void add_device_randomness(const void *, unsigned int);
 extern void add_input_randomness(unsigned int type, unsigned int code,
 				 unsigned int value);
@@ -38,6 +37,23 @@ struct rnd_state {
 u32 prandom_u32_state(struct rnd_state *state);
 void prandom_bytes_state(struct rnd_state *state, void *buf, int nbytes);
 
+/**
+ * prandom_u32_max - returns a pseudo-random number in interval [0, ep_ro)
+ * @ep_ro: right open interval endpoint
+ *
+ * Returns a pseudo-random number that is in interval [0, ep_ro). Note
+ * that the result depends on PRNG being well distributed in [0, ~0U]
+ * u32 space. Here we use maximally equidistributed combined Tausworthe
+ * generator, that is, prandom_u32(). This is useful when requesting a
+ * random index of an array containing ep_ro elements, for example.
+ *
+ * Returns: pseudo-random number in interval [0, ep_ro)
+ */
+static inline u32 prandom_u32_max(u32 ep_ro)
+{
+	return (u32)(((u64) prandom_u32() * ep_ro) >> 32);
+}
+
 /*
  * Handle minimum values for seeds
  */
diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c
index 59fb3db..df3cbdd 100644
--- a/net/packet/af_packet.c
+++ b/net/packet/af_packet.c
@@ -1289,7 +1289,7 @@ static unsigned int fanout_demux_rnd(struct packet_fanout *f,
 				     struct sk_buff *skb,
 				     unsigned int num)
 {
-	return reciprocal_divide(prandom_u32(), num);
+	return prandom_u32_max(num);
 }
 
 static unsigned int fanout_demux_rollover(struct packet_fanout *f,
diff --git a/net/sched/sch_choke.c b/net/sched/sch_choke.c
index ddd73cb..2aee028 100644
--- a/net/sched/sch_choke.c
+++ b/net/sched/sch_choke.c
@@ -14,7 +14,6 @@
 #include <linux/types.h>
 #include <linux/kernel.h>
 #include <linux/skbuff.h>
-#include <linux/reciprocal_div.h>
 #include <linux/vmalloc.h>
 #include <net/pkt_sched.h>
 #include <net/inet_ecn.h>
@@ -77,12 +76,6 @@ struct choke_sched_data {
 	struct sk_buff **tab;
 };
 
-/* deliver a random number between 0 and N - 1 */
-static u32 random_N(unsigned int N)
-{
-	return reciprocal_divide(prandom_u32(), N);
-}
-
 /* number of elements in queue including holes */
 static unsigned int choke_len(const struct choke_sched_data *q)
 {
@@ -233,7 +226,7 @@ static struct sk_buff *choke_peek_random(const struct choke_sched_data *q,
 	int retrys = 3;
 
 	do {
-		*pidx = (q->head + random_N(choke_len(q))) & q->tab_mask;
+		*pidx = (q->head + prandom_u32_max(choke_len(q))) & q->tab_mask;
 		skb = q->tab[*pidx];
 		if (skb)
 			return skb;
-- 
1.8.4.2

^ permalink raw reply related

* [PATCH net-next v5 3/3] reciprocal_divide: update/correction of the algorithm
From: Hannes Frederic Sowa @ 2014-01-22  1:29 UTC (permalink / raw)
  To: netdev
  Cc: Eric Dumazet, Austin S Hemmelgarn, linux-kernel, Jesse Gross,
	Jamal Hadi Salim, Stephen Hemminger, Matt Mackall, Pekka Enberg,
	Christoph Lameter, Andy Gospodarek, Veaceslav Falico,
	Jay Vosburgh, Jakub Zawadzki, Daniel Borkmann
In-Reply-To: <1390354181-5080-1-git-send-email-hannes@stressinduktion.org>

Jakub Zawadzki noticed that some divisions by reciprocal_divide()
were not correct [1][2], which he could also show with BPF code
after divisions are transformed into reciprocal_value() for runtime
invariance which can be passed to reciprocal_divide() later on;
reverse in BPF dump ended up with a different, off-by-one K in
some situations.

This has been fixed by Eric Dumazet in commit aee636c4809fa5
("bpf: do not use reciprocal divide"). This follow-up patch
improves reciprocal_value() and reciprocal_divide() to work in
all cases by using Granlund and Montgomery method, so that also
future use is safe and without any non-obvious side-effects.
Known problems with the old implementation were that division by 1
always returned 0 and some off-by-ones when the dividend and divisor
where very large. This seemed to not be problematic with its
current users, as far as we can tell. Eric Dumazet checked for
the slab usage, we cannot surely say so in the case of flex_array.
Still, in order to fix that, we propose an extension from the
original implementation from commit 6a2d7a955d8d resp. [3][4],
by using the algorithm proposed in "Division by Invariant Integers
Using Multiplication" [5], Torbjörn Granlund and Peter L.
Montgomery, that is, pseudocode for q = n/d where q, n, d is in
u32 universe:

1) Initialization:

  int l = ceil(log_2 d)
  uword m' = floor((1<<32)*((1<<l)-d)/d)+1
  int sh_1 = min(l,1)
  int sh_2 = max(l-1,0)

2) For q = n/d, all uword:

  uword t = (n*m')>>32
  q = (t+((n-t)>>sh_1))>>sh_2

The assembler implementation from Agner Fog [6] also helped a lot
while implementing. We have tested the implementation on x86_64,
ppc64, i686, s390x; on x86_64/haswell we're still half the latency
compared to normal divide.

Joint work with Daniel Borkmann.

  [1] http://www.wireshark.org/~darkjames/reciprocal-buggy.c
  [2] http://www.wireshark.org/~darkjames/set-and-dump-filter-k-bug.c
  [3] https://gmplib.org/~tege/division-paper.pdf
  [4] http://homepage.cs.uiowa.edu/~jones/bcd/divide.html
  [5] http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.1.2556
  [6] http://www.agner.org/optimize/asmlib.zip

Reported-by: Jakub Zawadzki <darkjames-ws@darkjames.pl>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Austin S Hemmelgarn <ahferroin7@gmail.com>
Cc: linux-kernel@vger.kernel.org
Cc: Jesse Gross <jesse@nicira.com>
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: Stephen Hemminger <stephen@networkplumber.org>
Cc: Matt Mackall <mpm@selenic.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Christoph Lameter <cl@linux-foundation.org>
Cc: Andy Gospodarek <andy@greyhouse.net>
Cc: Veaceslav Falico <vfalico@redhat.com>
Cc: Jay Vosburgh <fubar@us.ibm.com>
Cc: Jakub Zawadzki <darkjames-ws@darkjames.pl>
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
---
 drivers/net/bonding/bond_main.c    | 24 ++++++++++++++++-------
 drivers/net/bonding/bond_netlink.c |  4 ----
 drivers/net/bonding/bond_options.c | 15 ++++++++++-----
 drivers/net/bonding/bond_sysfs.c   |  5 -----
 drivers/net/bonding/bonding.h      |  3 +++
 include/linux/flex_array.h         |  3 ++-
 include/linux/reciprocal_div.h     | 39 ++++++++++++++++++++------------------
 include/linux/slab_def.h           |  4 +++-
 include/net/red.h                  |  3 ++-
 lib/flex_array.c                   |  7 ++++++-
 lib/reciprocal_div.c               | 24 +++++++++++++++++++----
 net/sched/sch_netem.c              |  6 ++++--
 12 files changed, 88 insertions(+), 49 deletions(-)

diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
index 3220b48..f100bd9 100644
--- a/drivers/net/bonding/bond_main.c
+++ b/drivers/net/bonding/bond_main.c
@@ -79,7 +79,6 @@
 #include <net/pkt_sched.h>
 #include <linux/rculist.h>
 #include <net/flow_keys.h>
-#include <linux/reciprocal_div.h>
 #include "bonding.h"
 #include "bond_3ad.h"
 #include "bond_alb.h"
@@ -3596,8 +3595,9 @@ static void bond_xmit_slave_id(struct bonding *bond, struct sk_buff *skb, int sl
  */
 static u32 bond_rr_gen_slave_id(struct bonding *bond)
 {
-	int packets_per_slave = bond->params.packets_per_slave;
 	u32 slave_id;
+	struct reciprocal_value reciprocal_packets_per_slave;
+	int packets_per_slave = bond->params.packets_per_slave;
 
 	switch (packets_per_slave) {
 	case 0:
@@ -3607,8 +3607,10 @@ static u32 bond_rr_gen_slave_id(struct bonding *bond)
 		slave_id = bond->rr_tx_counter;
 		break;
 	default:
+		reciprocal_packets_per_slave =
+			bond->params.reciprocal_packets_per_slave;
 		slave_id = reciprocal_divide(bond->rr_tx_counter,
-					     packets_per_slave);
+					     reciprocal_packets_per_slave);
 		break;
 	}
 	bond->rr_tx_counter++;
@@ -4343,10 +4345,18 @@ static int bond_check_params(struct bond_params *params)
 	params->resend_igmp = resend_igmp;
 	params->min_links = min_links;
 	params->lp_interval = lp_interval;
-	if (packets_per_slave > 1)
-		params->packets_per_slave = reciprocal_value(packets_per_slave);
-	else
-		params->packets_per_slave = packets_per_slave;
+	params->packets_per_slave = packets_per_slave;
+	if (packets_per_slave > 0) {
+		params->reciprocal_packets_per_slave =
+			reciprocal_value(packets_per_slave);
+	} else {
+		/* reciprocal_packets_per_slave is unused if
+		 * packets_per_slave is 0 or 1, just initialize it
+		 */
+		params->reciprocal_packets_per_slave =
+			(struct reciprocal_value) { 0 };
+	}
+
 	if (primary) {
 		strncpy(params->primary, primary, IFNAMSIZ);
 		params->primary[IFNAMSIZ - 1] = 0;
diff --git a/drivers/net/bonding/bond_netlink.c b/drivers/net/bonding/bond_netlink.c
index 21c6488..e852655 100644
--- a/drivers/net/bonding/bond_netlink.c
+++ b/drivers/net/bonding/bond_netlink.c
@@ -19,7 +19,6 @@
 #include <linux/if_ether.h>
 #include <net/netlink.h>
 #include <net/rtnetlink.h>
-#include <linux/reciprocal_div.h>
 #include "bonding.h"
 
 int bond_get_slave(struct net_device *slave_dev, struct sk_buff *skb)
@@ -452,9 +451,6 @@ static int bond_fill_info(struct sk_buff *skb,
 		goto nla_put_failure;
 
 	packets_per_slave = bond->params.packets_per_slave;
-	if (packets_per_slave > 1)
-		packets_per_slave = reciprocal_value(packets_per_slave);
-
 	if (nla_put_u32(skb, IFLA_BOND_PACKETS_PER_SLAVE,
 			packets_per_slave))
 		goto nla_put_failure;
diff --git a/drivers/net/bonding/bond_options.c b/drivers/net/bonding/bond_options.c
index 945a666..85e4348 100644
--- a/drivers/net/bonding/bond_options.c
+++ b/drivers/net/bonding/bond_options.c
@@ -16,7 +16,6 @@
 #include <linux/netdevice.h>
 #include <linux/rwlock.h>
 #include <linux/rcupdate.h>
-#include <linux/reciprocal_div.h>
 #include "bonding.h"
 
 int bond_option_mode_set(struct bonding *bond, int mode)
@@ -671,11 +670,17 @@ int bond_option_packets_per_slave_set(struct bonding *bond,
 		pr_warn("%s: Warning: packets_per_slave has effect only in balance-rr mode\n",
 			bond->dev->name);
 
-	if (packets_per_slave > 1)
-		bond->params.packets_per_slave =
+	bond->params.packets_per_slave = packets_per_slave;
+	if (packets_per_slave > 0) {
+		bond->params.reciprocal_packets_per_slave =
 			reciprocal_value(packets_per_slave);
-	else
-		bond->params.packets_per_slave = packets_per_slave;
+	} else {
+		/* reciprocal_packets_per_slave is unused if
+		 * packets_per_slave is 0 or 1, just initialize it
+		 */
+		bond->params.reciprocal_packets_per_slave =
+			(struct reciprocal_value) { 0 };
+	}
 
 	return 0;
 }
diff --git a/drivers/net/bonding/bond_sysfs.c b/drivers/net/bonding/bond_sysfs.c
index 011f163..c083e9a 100644
--- a/drivers/net/bonding/bond_sysfs.c
+++ b/drivers/net/bonding/bond_sysfs.c
@@ -39,7 +39,6 @@
 #include <net/net_namespace.h>
 #include <net/netns/generic.h>
 #include <linux/nsproxy.h>
-#include <linux/reciprocal_div.h>
 
 #include "bonding.h"
 
@@ -1374,10 +1373,6 @@ static ssize_t bonding_show_packets_per_slave(struct device *d,
 {
 	struct bonding *bond = to_bond(d);
 	unsigned int packets_per_slave = bond->params.packets_per_slave;
-
-	if (packets_per_slave > 1)
-		packets_per_slave = reciprocal_value(packets_per_slave);
-
 	return sprintf(buf, "%u\n", packets_per_slave);
 }
 
diff --git a/drivers/net/bonding/bonding.h b/drivers/net/bonding/bonding.h
index 8a935f8..0a616c4 100644
--- a/drivers/net/bonding/bonding.h
+++ b/drivers/net/bonding/bonding.h
@@ -23,6 +23,8 @@
 #include <linux/netpoll.h>
 #include <linux/inetdevice.h>
 #include <linux/etherdevice.h>
+#include <linux/reciprocal_div.h>
+
 #include "bond_3ad.h"
 #include "bond_alb.h"
 
@@ -171,6 +173,7 @@ struct bond_params {
 	int resend_igmp;
 	int lp_interval;
 	int packets_per_slave;
+	struct reciprocal_value reciprocal_packets_per_slave;
 };
 
 struct bond_parm_tbl {
diff --git a/include/linux/flex_array.h b/include/linux/flex_array.h
index 6843cf1..b6efb0c 100644
--- a/include/linux/flex_array.h
+++ b/include/linux/flex_array.h
@@ -2,6 +2,7 @@
 #define _FLEX_ARRAY_H
 
 #include <linux/types.h>
+#include <linux/reciprocal_div.h>
 #include <asm/page.h>
 
 #define FLEX_ARRAY_PART_SIZE PAGE_SIZE
@@ -22,7 +23,7 @@ struct flex_array {
 			int element_size;
 			int total_nr_elements;
 			int elems_per_part;
-			u32 reciprocal_elems;
+			struct reciprocal_value reciprocal_elems;
 			struct flex_array_part *parts[];
 		};
 		/*
diff --git a/include/linux/reciprocal_div.h b/include/linux/reciprocal_div.h
index f9c90b3..8c5a3fb 100644
--- a/include/linux/reciprocal_div.h
+++ b/include/linux/reciprocal_div.h
@@ -4,29 +4,32 @@
 #include <linux/types.h>
 
 /*
- * This file describes reciprocical division.
+ * This algorithm is based on the paper "Division by Invariant
+ * Integers Using Multiplication" by Torbjörn Granlund and Peter
+ * L. Montgomery.
  *
- * This optimizes the (A/B) problem, when A and B are two u32
- * and B is a known value (but not known at compile time)
+ * The assembler implementation from Agner Fog, which this code is
+ * based on, can be found here:
+ * http://www.agner.org/optimize/asmlib.zip
  *
- * The math principle used is :
- *   Let RECIPROCAL_VALUE(B) be (((1LL << 32) + (B - 1))/ B)
- *   Then A / B = (u32)(((u64)(A) * (R)) >> 32)
- *
- * This replaces a divide by a multiply (and a shift), and
- * is generally less expensive in CPU cycles.
+ * This optimization for A/B is helpful if the divisor B is mostly
+ * runtime invariant. The reciprocal of B is calculated in the
+ * slow-path with reciprocal_value(). The fast-path can then just use
+ * a much faster multiplication operation with a variable dividend A
+ * to calculate the division A/B.
  */
 
-/*
- * Computes the reciprocal value (R) for the value B of the divisor.
- * Should not be called before each reciprocal_divide(),
- * or else the performance is slower than a normal divide.
- */
-extern u32 reciprocal_value(u32 B);
+struct reciprocal_value {
+	u32 m;
+	u8 sh1, sh2;
+};
 
+struct reciprocal_value reciprocal_value(u32 d);
 
-static inline u32 reciprocal_divide(u32 A, u32 R)
+static inline u32 reciprocal_divide(u32 a, struct reciprocal_value R)
 {
-	return (u32)(((u64)A * R) >> 32);
+	u32 t = (u32)(((u64)a * R.m) >> 32);
+	return (t + ((a - t) >> R.sh1)) >> R.sh2;
 }
-#endif
+
+#endif /* _LINUX_RECIPROCAL_DIV_H */
diff --git a/include/linux/slab_def.h b/include/linux/slab_def.h
index 09bfffb..96e8aba 100644
--- a/include/linux/slab_def.h
+++ b/include/linux/slab_def.h
@@ -1,6 +1,8 @@
 #ifndef _LINUX_SLAB_DEF_H
 #define	_LINUX_SLAB_DEF_H
 
+#include <linux/reciprocal_div.h>
+
 /*
  * Definitions unique to the original Linux SLAB allocator.
  */
@@ -12,7 +14,7 @@ struct kmem_cache {
 	unsigned int shared;
 
 	unsigned int size;
-	u32 reciprocal_buffer_size;
+	struct reciprocal_value reciprocal_buffer_size;
 /* 2) touched by every alloc & free from the backend */
 
 	unsigned int flags;		/* constant flags */
diff --git a/include/net/red.h b/include/net/red.h
index 168bb2f..76e0b5f 100644
--- a/include/net/red.h
+++ b/include/net/red.h
@@ -130,7 +130,8 @@ struct red_parms {
 	u32		qth_max;	/* Max avg length threshold: Wlog scaled */
 	u32		Scell_max;
 	u32		max_P;		/* probability, [0 .. 1.0] 32 scaled */
-	u32		max_P_reciprocal; /* reciprocal_value(max_P / qth_delta) */
+	/* reciprocal_value(max_P / qth_delta) */
+	struct reciprocal_value	max_P_reciprocal;
 	u32		qth_delta;	/* max_th - min_th */
 	u32		target_min;	/* min_th + 0.4*(max_th - min_th) */
 	u32		target_max;	/* min_th + 0.6*(max_th - min_th) */
diff --git a/lib/flex_array.c b/lib/flex_array.c
index 6948a66..2eed22f 100644
--- a/lib/flex_array.c
+++ b/lib/flex_array.c
@@ -90,8 +90,8 @@ struct flex_array *flex_array_alloc(int element_size, unsigned int total,
 {
 	struct flex_array *ret;
 	int elems_per_part = 0;
-	int reciprocal_elems = 0;
 	int max_size = 0;
+	struct reciprocal_value reciprocal_elems = { 0 };
 
 	if (element_size) {
 		elems_per_part = FLEX_ARRAY_ELEMENTS_PER_PART(element_size);
@@ -119,6 +119,11 @@ EXPORT_SYMBOL(flex_array_alloc);
 static int fa_element_to_part_nr(struct flex_array *fa,
 					unsigned int element_nr)
 {
+	/*
+	 * if element_size == 0 we don't get here, so we never touch
+	 * the zeroed fa->reciprocal_elems, which would yield invalid
+	 * results
+	 */
 	return reciprocal_divide(element_nr, fa->reciprocal_elems);
 }
 
diff --git a/lib/reciprocal_div.c b/lib/reciprocal_div.c
index 75510e9..4641524 100644
--- a/lib/reciprocal_div.c
+++ b/lib/reciprocal_div.c
@@ -1,11 +1,27 @@
+#include <linux/kernel.h>
 #include <asm/div64.h>
 #include <linux/reciprocal_div.h>
 #include <linux/export.h>
 
-u32 reciprocal_value(u32 k)
+/*
+ * For a description of the algorithm please have a look at
+ * include/linux/reciprocal_div.h
+ */
+
+struct reciprocal_value reciprocal_value(u32 d)
 {
-	u64 val = (1LL << 32) + (k - 1);
-	do_div(val, k);
-	return (u32)val;
+	struct reciprocal_value R;
+	u64 m;
+	int l;
+
+	l = fls(d - 1);
+	m = ((1ULL << 32) * ((1ULL << l) - d));
+	do_div(m, d);
+	++m;
+	R.m = (u32)m;
+	R.sh1 = min(l, 1);
+	R.sh2 = max(l - 1, 0);
+
+	return R;
 }
 EXPORT_SYMBOL(reciprocal_value);
diff --git a/net/sched/sch_netem.c b/net/sched/sch_netem.c
index a2bfc37..de1059a 100644
--- a/net/sched/sch_netem.c
+++ b/net/sched/sch_netem.c
@@ -91,7 +91,7 @@ struct netem_sched_data {
 	u64 rate;
 	s32 packet_overhead;
 	u32 cell_size;
-	u32 cell_size_reciprocal;
+	struct reciprocal_value cell_size_reciprocal;
 	s32 cell_overhead;
 
 	struct crndstate {
@@ -725,9 +725,11 @@ static void get_rate(struct Qdisc *sch, const struct nlattr *attr)
 	q->rate = r->rate;
 	q->packet_overhead = r->packet_overhead;
 	q->cell_size = r->cell_size;
+	q->cell_overhead = r->cell_overhead;
 	if (q->cell_size)
 		q->cell_size_reciprocal = reciprocal_value(q->cell_size);
-	q->cell_overhead = r->cell_overhead;
+	else
+		q->cell_size_reciprocal = (struct reciprocal_value) { 0 };
 }
 
 static int get_loss_clg(struct Qdisc *sch, const struct nlattr *attr)
-- 
1.8.4.2

^ permalink raw reply related

* [PATCH net-next v5 0/0] reciprocal_divide update
From: Hannes Frederic Sowa @ 2014-01-22  1:29 UTC (permalink / raw)
  To: netdev

This patch is on top of aee636c4809fa5 ("bpf: do not use reciprocal
divide") from Eric that sits in net tree. It will not create a merge
conflict, but it depends on this one, so we suggest, if possible, to
merge net into net-next.

We are proposing this change with only small modifications from the
v2 version, namely updating the name of trim to reciprocal_scale
(as commented on by Ben Hutchings and Eric Dumazet, thanks!).

We thought about introducing the reciprocal_divide algorithm in
parallel to the one already used by the kernel but faced organizational
issues, leading us to the conclusion that it is best to just replace
the old one: We could not come up with names for the different
implementations and also with a way to describe the differences to
guide developers which one to choose in which situation. This is
because we cannot specify the correct semantics for the version
which is currently used by the kernel. Altough it seems to not be
causing problems in the kernel, we cannot surely say so in the
case of flex_array for the future. Current usage seems ok, but
future users could run into problems.

Changelog:

v1->v2:
 - changed name to prandom_u32_max in p1
 - changed name to trim in p2
 - reworked code in p3
v2->v3:
 - p1 and p3 stays unchanged, only small update in commit
   message in p3
 - changed name to reciprocal_scale in p2
 - fixed kernel doc format
v3->v4:
 - pseduo -> pseudo (thanks to Tilman Schmidt)
v4->v5:
 - fix pseduo -> pseudo for real now, sorry for the noise

Thanks !

Daniel Borkmann (2):
  random32: add prandom_u32_max and convert open coded users
  net: introduce reciprocal_scale helper and convert users

Hannes Frederic Sowa (1):
  reciprocal_divide: update/correction of the algorithm

 drivers/net/bonding/bond_main.c     | 24 ++++++++++++++++-------
 drivers/net/bonding/bond_netlink.c  |  4 ----
 drivers/net/bonding/bond_options.c  | 15 +++++++++-----
 drivers/net/bonding/bond_sysfs.c    |  5 -----
 drivers/net/bonding/bonding.h       |  3 +++
 drivers/net/team/team_mode_random.c |  8 +-------
 include/linux/flex_array.h          |  3 ++-
 include/linux/kernel.h              | 19 ++++++++++++++++++
 include/linux/random.h              | 18 ++++++++++++++++-
 include/linux/reciprocal_div.h      | 39 ++++++++++++++++++++-----------------
 include/linux/slab_def.h            |  4 +++-
 include/net/codel.h                 |  4 +---
 include/net/red.h                   |  3 ++-
 lib/flex_array.c                    |  7 ++++++-
 lib/reciprocal_div.c                | 24 +++++++++++++++++++----   
 net/packet/af_packet.c              |  5 ++---
 net/sched/sch_choke.c               |  9 +--------
 net/sched/sch_netem.c               |  6 ++++--
 18 files changed, 129 insertions(+), 71 deletions(-)

^ permalink raw reply

* [PATCH net-next v5 2/3] net: introduce reciprocal_scale helper and convert users
From: Hannes Frederic Sowa @ 2014-01-22  1:29 UTC (permalink / raw)
  To: netdev; +Cc: Daniel Borkmann, Jakub Zawadzki, Eric Dumazet, linux-kernel
In-Reply-To: <1390354181-5080-1-git-send-email-hannes@stressinduktion.org>

From: Daniel Borkmann <dborkman@redhat.com>

As David Laight suggests, we shouldn't necessarily call this
reciprocal_divide() when users didn't requested a reciprocal_value();
lets keep the basic idea and call it reciprocal_scale(). More
background information on this topic can be found in [1].

Joint work with Hannes Frederic Sowa.

  [1] http://homepage.cs.uiowa.edu/~jones/bcd/divide.html

Suggested-by: David Laight <david.laight@aculab.com>
Cc: Jakub Zawadzki <darkjames-ws@darkjames.pl>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
---
 include/linux/kernel.h | 19 +++++++++++++++++++
 include/net/codel.h    |  4 +---
 net/packet/af_packet.c |  3 +--
 3 files changed, 21 insertions(+), 5 deletions(-)

diff --git a/include/linux/kernel.h b/include/linux/kernel.h
index ecb8754..03d8a6b 100644
--- a/include/linux/kernel.h
+++ b/include/linux/kernel.h
@@ -193,6 +193,25 @@ extern int _cond_resched(void);
 		(__x < 0) ? -__x : __x;		\
 	})
 
+/**
+ * reciprocal_scale - "scale" a value into range [0, ep_ro)
+ * @val: value
+ * @ep_ro: right open interval endpoint
+ *
+ * Perform a "reciprocal multiplication" in order to "scale" a value into
+ * range [0, ep_ro), where the upper interval endpoint is right-open.
+ * This is useful, e.g. for accessing a index of an array containing
+ * ep_ro elements, for example. Think of it as sort of modulus, only that
+ * the result isn't that of modulo. ;) Note that if initial input is a
+ * small value, then result will return 0.
+ *
+ * Return: a result based on val in interval [0, ep_ro).
+ */
+static inline u32 reciprocal_scale(u32 val, u32 ep_ro)
+{
+	return (u32)(((u64) val * ep_ro) >> 32);
+}
+
 #if defined(CONFIG_MMU) && \
 	(defined(CONFIG_PROVE_LOCKING) || defined(CONFIG_DEBUG_ATOMIC_SLEEP))
 void might_fault(void);
diff --git a/include/net/codel.h b/include/net/codel.h
index 3b04ff5..fe0eab3 100644
--- a/include/net/codel.h
+++ b/include/net/codel.h
@@ -46,7 +46,6 @@
 #include <linux/skbuff.h>
 #include <net/pkt_sched.h>
 #include <net/inet_ecn.h>
-#include <linux/reciprocal_div.h>
 
 /* Controlling Queue Delay (CoDel) algorithm
  * =========================================
@@ -211,10 +210,9 @@ static codel_time_t codel_control_law(codel_time_t t,
 				      codel_time_t interval,
 				      u32 rec_inv_sqrt)
 {
-	return t + reciprocal_divide(interval, rec_inv_sqrt << REC_INV_SQRT_SHIFT);
+	return t + reciprocal_scale(interval, rec_inv_sqrt << REC_INV_SQRT_SHIFT);
 }
 
-
 static bool codel_should_drop(const struct sk_buff *skb,
 			      struct Qdisc *sch,
 			      struct codel_vars *vars,
diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c
index df3cbdd..9734616 100644
--- a/net/packet/af_packet.c
+++ b/net/packet/af_packet.c
@@ -88,7 +88,6 @@
 #include <linux/virtio_net.h>
 #include <linux/errqueue.h>
 #include <linux/net_tstamp.h>
-#include <linux/reciprocal_div.h>
 #include <linux/percpu.h>
 #ifdef CONFIG_INET
 #include <net/inet_common.h>
@@ -1262,7 +1261,7 @@ static unsigned int fanout_demux_hash(struct packet_fanout *f,
 				      struct sk_buff *skb,
 				      unsigned int num)
 {
-	return reciprocal_divide(skb->rxhash, num);
+	return reciprocal_scale(skb->rxhash, num);
 }
 
 static unsigned int fanout_demux_lb(struct packet_fanout *f,
-- 
1.8.4.2

^ permalink raw reply related

* Re: [PATCH] DT: net: document Ethernet bindings in one place
From: Rob Herring @ 2014-01-22  1:30 UTC (permalink / raw)
  To: Florian Fainelli
  Cc: Sergei Shtylyov, netdev, Rob Herring, Pawel Moll, Mark Rutland,
	Ian Campbell, Kumar Gala, devicetree@vger.kernel.org, Rob Landley,
	linux-doc@vger.kernel.org
In-Reply-To: <CAGVrzcb3X3soJNJCE5+YSpQrr+EdycCRFkptPvBCgFg4CbGJ4Q@mail.gmail.com>

On Tue, Jan 21, 2014 at 1:56 PM, Florian Fainelli <f.fainelli@gmail.com> wrote:
> 2014/1/20 Rob Herring <robherring2@gmail.com>:
>> On Mon, Jan 20, 2014 at 3:45 PM, Sergei Shtylyov
>> <sergei.shtylyov@cogentembedded.com> wrote:
>>> On 01/20/2014 05:06 PM, Rob Herring wrote:

[snip]

>>>>> +- phy-connection-type: the same as "phy-mode" property (but described in
>>>>> ePAPR);
>>>>> +- phy-handle: phandle, specifies a reference to a node representing a
>>>>> PHY
>>>>> +  device (this property is described in ePAPR);
>>>>> +- phy: the same as "phy-handle" property (but actually ad-hoc one).
>>>
>>>
>>>> Mark this as deprecated in favor of phy-handle.
>>>
>>>
>>>    Here situation is more optimistic. Quite many drivers still use
>>> "phy-handle", though some use even more exotic props I didn't document here.
>>
>> Perhaps flagging as "Not recommended for new bindings" would be nicer wording...
>
> Ok, so what's the deal here, can't we use phy-handle? There is
> currently no infrastructure in drivers/net/ or drivers/of/ to look for
> the "phy-handle" nor "phy" properties as phandles to fetch an Ethernet
> PHY device tree node. If we are to provide one common place where we
> want to do this, we will have to look for both properties, why can't
> we mandate the use of "phy-handle" since this is the ePAPR compliant
> one?

My count is 3 using "phy" and 11 using "phy-handle". Even if we
mandate phy-handle, we still have to support both properties to
maintain compatibility with existing DTs. But nothing new should use
"phy".  I'm not sure that something common really helps here. Drivers
using "phy" can continue to and new ones use "phy-handle". After a
quick scan, there doesn't appear to be any multi-line pattern in the
drivers we could factor out.

Rob

^ permalink raw reply

* Re: ipv4_dst_destroy panic regression after 3.10.15
From: dormando @ 2014-01-22  1:39 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Eric Dumazet, netdev, linux-kernel@vger.kernel.org,
	Alexei Starovoitov
In-Reply-To: <CAADnVQKQFQh2JewwmXWBkyB86rOJs1mXHrgECikaMxGqFW3axA@mail.gmail.com>

> On Fri, Jan 17, 2014 at 11:16 PM, dormando <dormando@rydia.net> wrote:
> >> On Fri, 2014-01-17 at 22:49 -0800, Eric Dumazet wrote:
> >> > On Fri, 2014-01-17 at 17:25 -0800, dormando wrote:
> >> > > Hi,
> >> > >
> >> > > Upgraded a few kernels to the latest 3.10 stable tree while tracking down
> >> > > a rare kernel panic, seems to have introduced a much more frequent kernel
> >> > > panic. Takes anywhere from 4 hours to 2 days to trigger:
> >> > >
> >> > > <4>[196727.311203] general protection fault: 0000 [#1] SMP
> >> > > <4>[196727.311224] Modules linked in: xt_TEE xt_dscp xt_DSCP macvlan bridge coretemp crc32_pclmul ghash_clmulni_intel gpio_ich microcode ipmi_watchdog ipmi_devintf sb_edac edac_core lpc_ich mfd_core tpm_tis tpm tpm_bios ipmi_si ipmi_msghandler isci igb libsas i2c_algo_bit ixgbe ptp pps_core mdio
> >> > > <4>[196727.311333] CPU: 17 PID: 0 Comm: swapper/17 Not tainted 3.10.26 #1
> >> > > <4>[196727.311344] Hardware name: Supermicro X9DRi-LN4+/X9DR3-LN4+/X9DRi-LN4+/X9DR3-LN4+, BIOS 3.0 07/05/2013
> >> > > <4>[196727.311364] task: ffff885e6f069700 ti: ffff885e6f072000 task.ti: ffff885e6f072000
> >> > > <4>[196727.311377] RIP: 0010:[<ffffffff815f8c7f>]  [<ffffffff815f8c7f>] ipv4_dst_destroy+0x4f/0x80
> >> > > <4>[196727.311399] RSP: 0018:ffff885effd23a70  EFLAGS: 00010282
> >> > > <4>[196727.311409] RAX: dead000000200200 RBX: ffff8854c398ecc0 RCX: 0000000000000040
> >> > > <4>[196727.311423] RDX: dead000000100100 RSI: dead000000100100 RDI: dead000000200200
> >> > > <4>[196727.311437] RBP: ffff885effd23a80 R08: ffffffff815fd9e0 R09: ffff885d5a590800
> >> > > <4>[196727.311451] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
> >> > > <4>[196727.311464] R13: ffffffff81c8c280 R14: 0000000000000000 R15: ffff880e85ee16ce
> >> > > <4>[196727.311510] FS:  0000000000000000(0000) GS:ffff885effd20000(0000) knlGS:0000000000000000
> >> > > <4>[196727.311554] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> >> > > <4>[196727.311581] CR2: 00007a46751eb000 CR3: 0000005e65688000 CR4: 00000000000407e0
> >> > > <4>[196727.311625] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> >> > > <4>[196727.311669] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> >> > > <4>[196727.311713] Stack:
> >> > > <4>[196727.311733]  ffff8854c398ecc0 ffff8854c398ecc0 ffff885effd23ab0 ffffffff815b7f42
> >> > > <4>[196727.311784]  ffff88be6595bc00 ffff8854c398ecc0 0000000000000000 ffff8854c398ecc0
> >> > > <4>[196727.311834]  ffff885effd23ad0 ffffffff815b86c6 ffff885d5a590800 ffff8816827821c0
> >> > > <4>[196727.311885] Call Trace:
> >> > > <4>[196727.311907]  <IRQ>
> >> > > <4>[196727.311912]  [<ffffffff815b7f42>] dst_destroy+0x32/0xe0
> >> > > <4>[196727.311959]  [<ffffffff815b86c6>] dst_release+0x56/0x80
> >> > > <4>[196727.311986]  [<ffffffff81620bd5>] tcp_v4_do_rcv+0x2a5/0x4a0
> >> > > <4>[196727.312013]  [<ffffffff81622b5a>] tcp_v4_rcv+0x7da/0x820
> >> > > <4>[196727.312041]  [<ffffffff815fd9e0>] ? ip_rcv_finish+0x360/0x360
> >> > > <4>[196727.312070]  [<ffffffff815de02d>] ? nf_hook_slow+0x7d/0x150
> >> > > <4>[196727.312097]  [<ffffffff815fd9e0>] ? ip_rcv_finish+0x360/0x360
> >> > > <4>[196727.312125]  [<ffffffff815fda92>] ip_local_deliver_finish+0xb2/0x230
> >> > > <4>[196727.312154]  [<ffffffff815fdd9a>] ip_local_deliver+0x4a/0x90
> >> > > <4>[196727.312183]  [<ffffffff815fd799>] ip_rcv_finish+0x119/0x360
> >> > > <4>[196727.312212]  [<ffffffff815fe00b>] ip_rcv+0x22b/0x340
> >> > > <4>[196727.312242]  [<ffffffffa0339680>] ? macvlan_broadcast+0x160/0x160 [macvlan]
> >> > > <4>[196727.312275]  [<ffffffff815b0c62>] __netif_receive_skb_core+0x512/0x640
> >> > > <4>[196727.312308]  [<ffffffff811427fb>] ? kmem_cache_alloc+0x13b/0x150
> >> > > <4>[196727.312338]  [<ffffffff815b0db1>] __netif_receive_skb+0x21/0x70
> >> > > <4>[196727.312368]  [<ffffffff815b0fa1>] netif_receive_skb+0x31/0xa0
> >> > > <4>[196727.312397]  [<ffffffff815b1ae8>] napi_gro_receive+0xe8/0x140
> >> > > <4>[196727.312433]  [<ffffffffa00274f1>] ixgbe_poll+0x551/0x11f0 [ixgbe]
> >> > > <4>[196727.312463]  [<ffffffff815fe00b>] ? ip_rcv+0x22b/0x340
> >> > > <4>[196727.312491]  [<ffffffff815b1691>] net_rx_action+0x111/0x210
> >> > > <4>[196727.312521]  [<ffffffff815b0db1>] ? __netif_receive_skb+0x21/0x70
> >> > > <4>[196727.312552]  [<ffffffff810519d0>] __do_softirq+0xd0/0x270
> >> > > <4>[196727.312583]  [<ffffffff816cef3c>] call_softirq+0x1c/0x30
> >> > > <4>[196727.312613]  [<ffffffff81004205>] do_softirq+0x55/0x90
> >> > > <4>[196727.312640]  [<ffffffff81051c85>] irq_exit+0x55/0x60
> >> > > <4>[196727.312668]  [<ffffffff816cf5c3>] do_IRQ+0x63/0xe0
> >> > > <4>[196727.312696]  [<ffffffff816c5aaa>] common_interrupt+0x6a/0x6a
> >> > > <4>[196727.312722]  <EOI>
> >> > > <4>[196727.312727]  [<ffffffff8100a150>] ? default_idle+0x20/0xe0
> >> > > <4>[196727.312775]  [<ffffffff8100a8ff>] arch_cpu_idle+0xf/0x20
> >> > > <4>[196727.312803]  [<ffffffff8108d330>] cpu_startup_entry+0xc0/0x270
> >> > > <4>[196727.312833]  [<ffffffff816b276e>] start_secondary+0x1f9/0x200
> >> > > <4>[196727.312860] Code: 4a 9f e9 81 e8 13 cb 0c 00 48 8b 93 b0 00 00 00 48 bf 00 02 20 00 00 00 ad de 48 8b 83 b8 00 00 00 48 be 00 01 10 00 00 00 ad de <48> 89 42 08 48 89 10 48 89 bb b8 00 00 00 48 c7 c7 4a 9f e9 81
> >> > > <1>[196727.313071] RIP  [<ffffffff815f8c7f>] ipv4_dst_destroy+0x4f/0x80
> >> > > <4>[196727.313100]  RSP <ffff885effd23a70>
> >> > > <4>[196727.313377] ---[ end trace 64b3f14fae0f2e29 ]---
> >> > > <0>[196727.380908] Kernel panic - not syncing: Fatal exception in interrupt
> >> > >
> >> > >
> >> > > ... bisecting it's going to be a pain... I tried eyeballing the diffs and
> >> > > am trying a revert or two.
> >> > >
> >> > > We've hit it in .25, .26 so far. I have .27 running but not sure if it
> >> > > crashed, so the change exists between .15 and .25.
> >> >
> >> > Please try following fix, thanks for the report !
> >> >
> >> > diff --git a/net/ipv4/route.c b/net/ipv4/route.c
> >> > index 25071b48921c..78a50a22298a 100644
> >> > --- a/net/ipv4/route.c
> >> > +++ b/net/ipv4/route.c
> >> > @@ -1333,7 +1333,7 @@ static void ipv4_dst_destroy(struct dst_entry
> >> > *dst)
> >> >
> >> >     if (!list_empty(&rt->rt_uncached)) {
> >> >             spin_lock_bh(&rt_uncached_lock);
> >> > -           list_del(&rt->rt_uncached);
> >> > +           list_del_init(&rt->rt_uncached);
> >> >             spin_unlock_bh(&rt_uncached_lock);
> >> >     }
> >> >  }
> >> >
> >>
> >> Problem could come from this commit, in linux 3.10.23,
> >> you also could try to revert it
> >>
> >> commit 62713c4b6bc10c2d082ee1540e11b01a2b2162ab
> >> Author: Alexei Starovoitov <ast@plumgrid.com>
> >> Date:   Tue Nov 19 19:12:34 2013 -0800
> >>
> >>     ipv4: fix race in concurrent ip_route_input_slow()
> >>
> >>     [ Upstream commit dcdfdf56b4a6c9437fc37dbc9cee94a788f9b0c4 ]
> >>
> >>     CPUs can ask for local route via ip_route_input_noref() concurrently.
> >>     if nh_rth_input is not cached yet, CPUs will proceed to allocate
> >>     equivalent DSTs on 'lo' and then will try to cache them in nh_rth_input
> >>     via rt_cache_route()
> >>     Most of the time they succeed, but on occasion the following two lines:
> >>         orig = *p;
> >>         prev = cmpxchg(p, orig, rt);
> >>     in rt_cache_route() do race and one of the cpus fails to complete cmpxchg.
> >>     But ip_route_input_slow() doesn't check the return code of rt_cache_route(),
> >>     so dst is leaking. dst_destroy() is never called and 'lo' device
> >>     refcnt doesn't go to zero, which can be seen in the logs as:
> >>         unregister_netdevice: waiting for lo to become free. Usage count = 1
> >>     Adding mdelay() between above two lines makes it easily reproducible.
> >>     Fix it similar to nh_pcpu_rth_output case.
> >>
> >>     Fixes: d2d68ba9fe8b ("ipv4: Cache input routes in fib_info nexthops.")
> >>     Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
> >>     Signed-off-by: David S. Miller <davem@davemloft.net>
> >>     Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
> >>
> >
> > Heh. I spent an hour squinting at the difflog from .15 to .25 and this was
> > my best guess. I have a kernel running in production with only this
> > reverted as of ~5 hours ago. Won't know if it helps for a day or two.
> >
> > I'm building a kernel now with your route patch, but 62713c4 *not*
> > reverted, which I can throw on a different machine. Does this sound like a
> > good idea?
>
> the traces of your crash don't look similar to dst leak that was fixed by
> commit 62713c4...
>
> To trigger the 'fix' logic of the 62713c4 add the following diff:
> diff --git a/net/ipv4/route.c b/net/ipv4/route.c
> index f6c6ab1..8972676 100644
> --- a/net/ipv4/route.c
> +++ b/net/ipv4/route.c
> @@ -1259,7 +1259,7 @@ static bool rt_cache_route(struct fib_nh *nh,
> struct rtable *rt)
>                 p = (struct rtable **)__this_cpu_ptr(nh->nh_pcpu_rth_output);
>         }
>         orig = *p;
> -
> +       mdelay(100);
>         prev = cmpxchg(p, orig, rt);
>         if (prev == orig) {
>                 if (orig)
>
> I've been running with it for a day without issues.
> Note that it will stress both 'input' and 'output' ways of adding dsts to
> rt_uncached list...
> and 'output' was there for ages.
>
> If mdelay() helps to reproduce it in minutes instead of days
> then we're on the right path.
> Could you share details of your workload?
> In our case it was a lot of namespaces with a ton of processes
> talking to each other, forcefully killed and restarted.
> Do you see the same crash in the latest tree?
>
> PS sorry for double posts. netdev email bounced few times for me...
>

So with 62713c4 reverted on top of 3.10.27 (no other new patches), both of
my machines have survived. One four days, another two days. They'd barely
make it a day before.

So this patch is introducing a new problem. Weakening the dst reference
too much in a false positive case?

I'm not entirely sure if the mdelay(100) thing will help much. I can try
it but it's kind of obnoxious to reboot these machines (far away, ipmi
only) too often. Unless you folks have any new ideas before I go off and
do that?

^ permalink raw reply

* IPV6 routing problem
From: Sharat Masetty @ 2014-01-22  1:41 UTC (permalink / raw)
  To: netdev

 I have an IPV6 routing problem that has only surfaced on a 3.10
kernel version. This problem is not seen on 3.4 kernel. I will keep
the problem statement as brief as possible.

I have two interfaces say eth0 and eth1. I have a host route to a
destination over eth1, but this route has a global gateway
address(via) of the router on the link. When this global gateway is
present in the route, the kernel does not seem to pick up the route,
but falls back to the default route which is eth0. When this gateway
is removed from the route or if the gateway is changed to a link local
address of the router, instead of the global address, then routing
seems to work as expected and kernel picks up interface eth1.

== does not work ==
<dest-addr> via <global scope -gateway addr>dev eth1 metric 1024
<global scope -gateway addr> dev eth1  metric 1024

I checked the git commits diff between 3.4 and 3.10 for ipv6/route.c
and ipv6/ip6_fib.c, but looks like they are close to an year apart
with lots of changes. Hence I wanted to reach out to you to see if any
recent changes in IPV6 routing is causing this difference in behavior.

Any help would be greatly appreciated.

Thanks
Sharat

^ permalink raw reply

* Re: [PATCH net-next 1/5] bonding: The fail_over_mac should be set only in ACTIVE_BACKUP mode
From: Ding Tianhong @ 2014-01-22  1:59 UTC (permalink / raw)
  To: Jay Vosburgh; +Cc: Veaceslav Falico, David S. Miller, Netdev, Andy Gospodarek
In-Reply-To: <20332.1390350326@death.nxdomain>

On 2014/1/22 8:25, Jay Vosburgh wrote:
> Ding Tianhong <dingtianhong@huawei.com> wrote:
> 
>> According the bonding.txt, the option fail_over_mac only affect for
>> AB mode, but in currect code, the parameter could be set to active
>> or follow in every mode, this will cause bonding could not set all
>> slaves of an RR or XOR mode to the same MAC address at enslavement
>> time, so reset fail_over_mac to 0 if the mode is not ACTIVE_BACKUP.
> 
> 	The correct way to fix this in general is to permit setting an
> option at any time, but only have it take effect in active-backup mode.
> This minimizes ordering requirements when setting options.
> 
ok

> 	I would instead modify the bond enslave and removal processing
> to check the mode in addition to fail_over_mac when setting a slave's
> MAC during enslavement.  The change active slave processing already only
> calls the fail_over_mac function when in active-backup mode.  This
> should also be a simpler change set.
> 
agree, the current modify actually more redundant.

>> Fix the wrong variables for pr_err().
>>
>> Signed-off-by: Ding Tianhong <dingtianhong@huawei.com>
>> ---
>> drivers/net/bonding/bond_main.c | 8 +++++---
>> 1 file changed, 5 insertions(+), 3 deletions(-)
>>
>> diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
>> index 3220b48..ecff04e 100644
>> --- a/drivers/net/bonding/bond_main.c
>> +++ b/drivers/net/bonding/bond_main.c
>> @@ -4307,12 +4307,14 @@ static int bond_check_params(struct bond_params *params)
>> 						      fail_over_mac_tbl);
>> 		if (fail_over_mac_value == -1) {
>> 			pr_err("Error: invalid fail_over_mac \"%s\"\n",
>> -			       arp_validate == NULL ? "NULL" : arp_validate);
>> +			       fail_over_mac == NULL ? "NULL" : fail_over_mac);
> 
> 	This part is ok.
> 
>> 			return -EINVAL;
>> 		}
>>
>> -		if (bond_mode != BOND_MODE_ACTIVEBACKUP)
>> -			pr_warning("Warning: fail_over_mac only affects active-backup mode.\n");
>> +		if (bond_mode != BOND_MODE_ACTIVEBACKUP) {
>> +			pr_warning("Warning: fail_over_mac only affects active-backup mode, set it to 0.\n");
>> +			fail_over_mac_value = BOND_FOM_NONE;
>> +		}
> 
> 	This part is not.
> 
> 	I would additionally NAK patches 2, 3, and 4 (noting that 4
> inhibits the change to fail_over_mac, but not the warning message that
> we're changing it).
> 
> 	Patch 5 is ok, although it really has nothing to do with
> fail_over_mac.
> 
> 	-J
> 

The whole patchset need to be rebuild, thanks for reviewing.

Regards
Ding

>> 	} else {
>> 		fail_over_mac_value = BOND_FOM_NONE;
>> 	}
>> -- 
>> 1.8.0
> 
> ---
> 	-Jay Vosburgh, IBM Linux Technology Center, fubar@us.ibm.com
> 
> 
> 

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox