Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [PATCH net] bpf: do not use reciprocal divide
From: Martin Schwidefsky @ 2014-01-15 15:35 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Heiko Carstens, Hannes Frederic Sowa, netdev, dborkman,
	darkjames-ws, Mircea Gherzan, Russell King, Matt Evans
In-Reply-To: <1389795926.31367.334.camel@edumazet-glaptop2.roam.corp.google.com>

On Wed, 15 Jan 2014 06:25:26 -0800
Eric Dumazet <eric.dumazet@gmail.com> wrote:

> On Wed, 2014-01-15 at 06:21 -0800, Eric Dumazet wrote:
> > On Wed, 2014-01-15 at 11:51 +0100, Heiko Carstens wrote:
> > > On Wed, Jan 15, 2014 at 09:13:22AM +0100, Martin Schwidefsky wrote:
> > > > On Wed, 15 Jan 2014 09:00:07 +0100
> > > > Heiko Carstens <heiko.carstens@de.ibm.com> wrote:
> > > > 
> > > > > On Tue, Jan 14, 2014 at 11:02:41PM -0800, Eric Dumazet wrote:
> > > > > > diff --git a/arch/s390/net/bpf_jit_comp.c b/arch/s390/net/bpf_jit_comp.c
> > > > > > index 16871da37371..e349dc7d0992 100644
> > > > > > --- a/arch/s390/net/bpf_jit_comp.c
> > > > > > +++ b/arch/s390/net/bpf_jit_comp.c
> > > > > > @@ -371,11 +371,11 @@ static int bpf_jit_insn(struct bpf_jit *jit, struct sock_filter *filter,
> > > > > >  		/* dr %r4,%r12 */
> > > > > >  		EMIT2(0x1d4c);
> > > > > >  		break;
> > > > > > -	case BPF_S_ALU_DIV_K: /* A = reciprocal_divide(A, K) */
> > > > > > -		/* m %r4,<d(K)>(%r13) */
> > > > > > -		EMIT4_DISP(0x5c40d000, EMIT_CONST(K));
> > > > > > -		/* lr %r5,%r4 */
> > > > > > -		EMIT2(0x1854);
> > > > > > +	case BPF_S_ALU_DIV_K: /* A /= K */
> > > > > > +		/* lhi %r4,0 */
> > > > > > +		EMIT4(0xa7480000);
> > > > > > +		/* d %r4,<d(K)>(%r13) */
> > > > > > +		EMIT4_DISP(0x5d40d000, EMIT_CONST(K));
> > > > > >  		break;
> > > > > 
> > > > > The s390 part looks good.
> > > > 
> > > > Does it? The divide instruction is signed, for the special
> > > > case of K==1 this can now cause an exception if the quotient
> > > > gets too large. We should add a check for K==1 and do nothing
> > > > in this case. With a divisor of at least 2 the result will
> > > > stay in the limit.
> > > 
> > > Indeed. That's quite subtle.
> > 
> > net/core/filter.c does :
> > 
> > A /= K;
> > 
> > Why is this working in generic code (if K == 1), not in s390 one ?
> 
> Note that I copied code found in BPF_S_ALU_MOD_K, so this one would need
> a fix as well.

Hmm, that is true BPF_S_ALU_DIV_X, BPF_S_ALU_MOD_K and BPF_S_ALU_MOD_X all
suffer from the same problem. As the BPF jit is only used for 64-bit 
kernel the simplest solution is to replace the "dr" instruction with "dlr".

-- 
blue skies,
   Martin.

"Reality continues to ruin my life." - Calvin.

^ permalink raw reply

* Re: [Xen-devel] [PATCH net-next] xen-netfront: clean up code in xennet_release_rx_bufs
From: Andrew Bennieston @ 2014-01-15 15:35 UTC (permalink / raw)
  To: annie li; +Cc: Wei Liu, netdev, ian.campbell, xen-devel
In-Reply-To: <52D697FB.3000304@oracle.com>

On 15/01/14 14:15, annie li wrote:
>
> On 2014-1-15 19:02, Andrew Bennieston wrote:
>> On 15/01/14 10:07, Wei Liu wrote:
>>> On Fri, Jan 10, 2014 at 06:48:38AM +0800, Annie Li wrote:
>>>> Current netfront only grants pages for grant copy, not for grant
>>>> transfer, so
>>>> remove corresponding transfer code and add receiving copy code in
>>>> xennet_release_rx_bufs.
>>>>
>>>
>>> This path seldom gets call -- not that many people unload xen-netfront
>>> driver. If Annie has tested this patch and it works as expected I think
>>> it's fine.
>>>
>> In XenServer we have seen a number of cases where unplugging and
>> replugging VIFs results in leakage of grant references, eventually
>> leading to a case where you cannot plug a VIF (after ~ 400 such
>> cycles)...
>>
>> It's worth pointing out, as far as this patch is concerned, that
>> gnttab_end_foreign_access() can fail,
>
> Just like what Wei mentioned, it is gnttab_end_foreign_access_ref here,
> right?
Yes, sorry - I forgot to type the _ref part!
>
>> which is not taken into account here.
>
> Good point, gnttab_end_foreign_access_ref fails for grant which is in use.
>
> Thanks
> Annie
>>
>> Andrew.
>>
>>> I'm not netfront maintainer but I'm happy to add
>>> Acked-by: Wei Liu <wei.liu2@citrix.com>
>>> if Annie confirms she's tested this patch.
>>>
>>> Wei.
>>>
>>> _______________________________________________
>>> Xen-devel mailing list
>>> Xen-devel@lists.xen.org
>>> http://lists.xen.org/xen-devel
>>>
>>
>

^ permalink raw reply

* [PATCH] dm9000: fix a lot of checkpatch issues
From: Barry Song @ 2014-01-15 15:31 UTC (permalink / raw)
  To: davem, nikita, grinberg, jg1.han, mugunthanvnm
  Cc: netdev, workgroup.linux, Barry Song

From: Barry Song <Baohua.Song@csr.com>

recently, dm9000 codes have many checkpatch errors and warnings:

WARNING: please, no space before tabs
3: FILE: dm9000.c:3:
+ * ^ICopyright (C) 1997  Sten Wang$

WARNING: please, no space before tabs
5: FILE: dm9000.c:5:
+ * ^IThis program is free software; you can redistribute it and/or$

WARNING: please, no space before tabs
6: FILE: dm9000.c:6:
+ * ^Imodify it under the terms of the GNU General Public License$

WARNING: please, no space before tabs
7: FILE: dm9000.c:7:
+ * ^Ias published by the Free Software Foundation; either version 2$

WARNING: please, no space before tabs
8: FILE: dm9000.c:8:
+ * ^Iof the License, or (at your option) any later version.$

WARNING: please, no space before tabs
10: FILE: dm9000.c:10:
+ * ^IThis program is distributed in the hope that it will be useful,$

WARNING: please, no space before tabs
11: FILE: dm9000.c:11:
+ * ^Ibut WITHOUT ANY WARRANTY; without even the implied warranty of$

WARNING: please, no space before tabs
12: FILE: dm9000.c:12:
+ * ^IMERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the$

WARNING: please, no space before tabs
13: FILE: dm9000.c:13:
+ * ^IGNU General Public License for more details.$

WARNING: do not add new typedefs
97: FILE: dm9000.c:97:
+typedef struct board_info {

ERROR: spaces prohibited around that ':' (ctx:WxV)
113: FILE: dm9000.c:113:
+	unsigned int	in_suspend :1;
 	            	           ^

ERROR: spaces prohibited around that ':' (ctx:WxV)
114: FILE: dm9000.c:114:
+	unsigned int	wake_supported :1;
 	            	               ^

This patch fixes important errors in it.

Signed-off-by: Barry Song <Baohua.Song@csr.com>
---
 drivers/net/ethernet/davicom/dm9000.c |   23 +++++++++++------------
 1 files changed, 11 insertions(+), 12 deletions(-)

diff --git a/drivers/net/ethernet/davicom/dm9000.c b/drivers/net/ethernet/davicom/dm9000.c
index 7080ad6..878b748 100644
--- a/drivers/net/ethernet/davicom/dm9000.c
+++ b/drivers/net/ethernet/davicom/dm9000.c
@@ -110,8 +110,8 @@ typedef struct board_info {
 	u8		imr_all;
 
 	unsigned int	flags;
-	unsigned int	in_suspend :1;
-	unsigned int	wake_supported :1;
+	unsigned int	in_suspend:1;
+	unsigned int	wake_supported:1;
 
 	enum dm9000_type type;
 
@@ -162,7 +162,7 @@ static inline board_info_t *to_dm9000_board(struct net_device *dev)
  *   Read a byte from I/O port
  */
 static u8
-ior(board_info_t * db, int reg)
+ior(board_info_t *db, int reg)
 {
 	writeb(reg, db->io_addr);
 	return readb(db->io_data);
@@ -173,7 +173,7 @@ ior(board_info_t * db, int reg)
  */
 
 static void
-iow(board_info_t * db, int reg, int value)
+iow(board_info_t *db, int reg, int value)
 {
 	writeb(reg, db->io_addr);
 	writeb(value, db->io_data);
@@ -745,9 +745,9 @@ static const struct ethtool_ops dm9000_ethtool_ops = {
 	.get_link		= dm9000_get_link,
 	.get_wol		= dm9000_get_wol,
 	.set_wol		= dm9000_set_wol,
- 	.get_eeprom_len		= dm9000_get_eeprom_len,
- 	.get_eeprom		= dm9000_get_eeprom,
- 	.set_eeprom		= dm9000_set_eeprom,
+	.get_eeprom_len		= dm9000_get_eeprom_len,
+	.get_eeprom		= dm9000_get_eeprom,
+	.set_eeprom		= dm9000_set_eeprom,
 };
 
 static void dm9000_show_carrier(board_info_t *db,
@@ -795,7 +795,7 @@ dm9000_poll_work(struct work_struct *w)
 		}
 	} else
 		mii_check_media(&db->mii, netif_msg_link(db), 0);
-	
+
 	if (netif_running(ndev))
 		dm9000_schedule_poll(db);
 }
@@ -1252,12 +1252,11 @@ static irqreturn_t dm9000_wol_interrupt(int irq, void *dev_id)
 			dev_info(db->dev, "wake by link status change\n");
 		if (wcr & WCR_SAMPLEST)
 			dev_info(db->dev, "wake by sample packet\n");
-		if (wcr & WCR_MAGICST )
+		if (wcr & WCR_MAGICST)
 			dev_info(db->dev, "wake by magic packet\n");
 		if (!(wcr & (WCR_LINKST | WCR_SAMPLEST | WCR_MAGICST)))
 			dev_err(db->dev, "wake signalled with no reason? "
 				"NSR=0x%02x, WSR=0x%02x\n", nsr, wcr);
-
 	}
 
 	spin_unlock_irqrestore(&db->lock, flags);
@@ -1314,7 +1313,7 @@ dm9000_open(struct net_device *dev)
 
 	mii_check_media(&db->mii, netif_msg_link(db), 1);
 	netif_start_queue(dev);
-	
+
 	dm9000_schedule_poll(db);
 
 	return 0;
@@ -1628,7 +1627,7 @@ dm9000_probe(struct platform_device *pdev)
 
 	if (!is_valid_ether_addr(ndev->dev_addr)) {
 		/* try reading from mac */
-		
+
 		mac_src = "chip";
 		for (i = 0; i < 6; i++)
 			ndev->dev_addr[i] = ior(db, i+DM9000_PAR);
-- 
1.7.5.4

^ permalink raw reply related

* Re: [PATCH net] bpf: do not use reciprocal divide
From: Martin Schwidefsky @ 2014-01-15 15:26 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Heiko Carstens, Hannes Frederic Sowa, netdev, dborkman,
	darkjames-ws, Mircea Gherzan, Russell King, Matt Evans
In-Reply-To: <1389795716.31367.333.camel@edumazet-glaptop2.roam.corp.google.com>

On Wed, 15 Jan 2014 06:21:56 -0800
Eric Dumazet <eric.dumazet@gmail.com> wrote:

> On Wed, 2014-01-15 at 11:51 +0100, Heiko Carstens wrote:
> > On Wed, Jan 15, 2014 at 09:13:22AM +0100, Martin Schwidefsky wrote:
> > > On Wed, 15 Jan 2014 09:00:07 +0100
> > > Heiko Carstens <heiko.carstens@de.ibm.com> wrote:
> > > 
> > > > On Tue, Jan 14, 2014 at 11:02:41PM -0800, Eric Dumazet wrote:
> > > > > diff --git a/arch/s390/net/bpf_jit_comp.c b/arch/s390/net/bpf_jit_comp.c
> > > > > index 16871da37371..e349dc7d0992 100644
> > > > > --- a/arch/s390/net/bpf_jit_comp.c
> > > > > +++ b/arch/s390/net/bpf_jit_comp.c
> > > > > @@ -371,11 +371,11 @@ static int bpf_jit_insn(struct bpf_jit *jit, struct sock_filter *filter,
> > > > >  		/* dr %r4,%r12 */
> > > > >  		EMIT2(0x1d4c);
> > > > >  		break;
> > > > > -	case BPF_S_ALU_DIV_K: /* A = reciprocal_divide(A, K) */
> > > > > -		/* m %r4,<d(K)>(%r13) */
> > > > > -		EMIT4_DISP(0x5c40d000, EMIT_CONST(K));
> > > > > -		/* lr %r5,%r4 */
> > > > > -		EMIT2(0x1854);
> > > > > +	case BPF_S_ALU_DIV_K: /* A /= K */
> > > > > +		/* lhi %r4,0 */
> > > > > +		EMIT4(0xa7480000);
> > > > > +		/* d %r4,<d(K)>(%r13) */
> > > > > +		EMIT4_DISP(0x5d40d000, EMIT_CONST(K));
> > > > >  		break;
> > > > 
> > > > The s390 part looks good.
> > > 
> > > Does it? The divide instruction is signed, for the special
> > > case of K==1 this can now cause an exception if the quotient
> > > gets too large. We should add a check for K==1 and do nothing
> > > in this case. With a divisor of at least 2 the result will
> > > stay in the limit.
> > 
> > Indeed. That's quite subtle.
> 
> net/core/filter.c does :
> 
> A /= K;
> 
> Why is this working in generic code (if K == 1), not in s390 one ?

The C compiler naturally can to a u32/u32 division, it either uses
the "dlr" instruction which is unsigned, or uses a call to a function
to do the u32/u32 math. See the lovely code in arch/s390/lib/div64.c
for the kernel version of that code.

-- 
blue skies,
   Martin.

"Reality continues to ruin my life." - Calvin.

^ permalink raw reply

* [PATCH net-next v3 3/3] packet: use percpu mmap tx frame pending refcount
From: Daniel Borkmann @ 2014-01-15 15:25 UTC (permalink / raw)
  To: davem; +Cc: netdev
In-Reply-To: <1389799536-23750-1-git-send-email-dborkman@redhat.com>

In PF_PACKET's packet mmap(), we can avoid using one atomic_inc()
and one atomic_dec() call in skb destructor and use a percpu
reference count instead in order to determine if packets are
still pending to be sent out. Micro-benchmark with [1] that has
been slightly modified (that is, protcol = 0 in socket(2) and
bind(2)), example on a rather crappy testing machine; I expect
it to scale and have even better results on bigger machines:

./packet_mm_tx -s7000 -m7200 -z700000 em1, avg over 2500 runs:

With patch:    4,022,015 cyc
Without patch: 4,812,994 cyc

time ./packet_mm_tx -s64 -c10000000 em1 > /dev/null, stable:

With patch:
  real         1m32.241s
  user         0m0.287s
  sys          1m29.316s

Without patch:
  real         1m38.386s
  user         0m0.265s
  sys          1m35.572s

In function tpacket_snd(), it is okay to use packet_read_pending()
since in fast-path we short-circuit the condition already with
ph != NULL, since we have next frames to process. In case we have
MSG_DONTWAIT, we also do not execute this path as need_wait is
false here anyway, and in case of _no_ MSG_DONTWAIT flag, it is
okay to call a packet_read_pending(), because when we ever reach
that path, we're done processing outgoing frames anyway and only
look if there are skbs still outstanding to be orphaned. We can
stay lockless in this percpu counter since it's acceptable when we
reach this path for the sum to be imprecise first, but we'll level
out at 0 after all pending frames have reached the skb destructor
eventually through tx reclaim. When people pin a tx process to
particular CPUs, we expect overflows to happen in the reference
counter as on one CPU we expect heavy increase; and distributed
through ksoftirqd on all CPUs a decrease, for example. As
David Laight points out, since the C language doesn't define the
result of signed int overflow (i.e. rather than wrap, it is
allowed to saturate as a possible outcome), we have to use
unsigned int as reference count. The sum over all CPUs when tx
is complete will result in 0 again.

The BUG_ON() in tpacket_destruct_skb() we can remove as well. It
can _only_ be set from inside tpacket_snd() path and we made sure
to increase tx_ring.pending in any case before we called po->xmit(skb).
So testing for tx_ring.pending == 0 is not too useful. Instead, it
would rather have been useful to test if lower layers didn't orphan
the skb so that we're missing ring slots being put back to
TP_STATUS_AVAILABLE. But such a bug will be caught in user space
already as we end up realizing that we do not have any
TP_STATUS_AVAILABLE slots left anymore. Therefore, we're all set.

Btw, in case of RX_RING path, we do not make use of the pending
member, therefore we also don't need to use up any percpu memory
here. Also note that __alloc_percpu() already returns a zero-filled
percpu area, so initialization is done already.

  [1] http://wiki.ipxwarzone.com/index.php5?title=Linux_packet_mmap

Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
---
 net/packet/af_packet.c | 66 +++++++++++++++++++++++++++++++++++++++++++++-----
 net/packet/diag.c      |  1 +
 net/packet/internal.h  |  2 +-
 3 files changed, 62 insertions(+), 7 deletions(-)

diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c
index d5495d8..12f2f72 100644
--- a/net/packet/af_packet.c
+++ b/net/packet/af_packet.c
@@ -89,6 +89,7 @@
 #include <linux/errqueue.h>
 #include <linux/net_tstamp.h>
 #include <linux/reciprocal_div.h>
+#include <linux/percpu.h>
 #ifdef CONFIG_INET
 #include <net/inet_common.h>
 #endif
@@ -1168,6 +1169,47 @@ static void packet_increment_head(struct packet_ring_buffer *buff)
 	buff->head = buff->head != buff->frame_max ? buff->head+1 : 0;
 }

+static void packet_inc_pending(struct packet_ring_buffer *rb)
+{
+	this_cpu_inc(*rb->pending_refcnt);
+}
+
+static void packet_dec_pending(struct packet_ring_buffer *rb)
+{
+	this_cpu_dec(*rb->pending_refcnt);
+}
+
+static unsigned int packet_read_pending(const struct packet_ring_buffer *rb)
+{
+	unsigned int refcnt = 0;
+	int cpu;
+
+	/* We don't use pending refcount in rx_ring. */
+	if (rb->pending_refcnt == NULL)
+		return 0;
+
+	for_each_possible_cpu(cpu)
+		refcnt += *per_cpu_ptr(rb->pending_refcnt, cpu);
+
+	return refcnt;
+}
+
+static int packet_alloc_pending(struct packet_sock *po)
+{
+	po->rx_ring.pending_refcnt = NULL;
+
+	po->tx_ring.pending_refcnt = alloc_percpu(unsigned int);
+	if (unlikely(po->tx_ring.pending_refcnt == NULL))
+		return -ENOBUFS;
+
+	return 0;
+}
+
+static void packet_free_pending(struct packet_sock *po)
+{
+	free_percpu(po->tx_ring.pending_refcnt);
+}
+
 static bool packet_rcv_has_room(struct packet_sock *po, struct sk_buff *skb)
 {
 	struct sock *sk = &po->sk;
@@ -2014,8 +2056,7 @@ static void tpacket_destruct_skb(struct sk_buff *skb)
 		__u32 ts;

 		ph = skb_shinfo(skb)->destructor_arg;
-		BUG_ON(atomic_read(&po->tx_ring.pending) == 0);
-		atomic_dec(&po->tx_ring.pending);
+		packet_dec_pending(&po->tx_ring);

 		ts = __packet_set_timestamp(po, ph, skb);
 		__packet_set_status(po, ph, TP_STATUS_AVAILABLE | ts);
@@ -2236,7 +2277,7 @@ static int tpacket_snd(struct packet_sock *po, struct msghdr *msg)
 		skb_set_queue_mapping(skb, packet_pick_tx_queue(dev));
 		skb->destructor = tpacket_destruct_skb;
 		__packet_set_status(po, ph, TP_STATUS_SENDING);
-		atomic_inc(&po->tx_ring.pending);
+		packet_inc_pending(&po->tx_ring);

 		status = TP_STATUS_SEND_REQUEST;
 		err = po->xmit(skb);
@@ -2256,8 +2297,14 @@ static int tpacket_snd(struct packet_sock *po, struct msghdr *msg)
 		}
 		packet_increment_head(&po->tx_ring);
 		len_sum += tp_len;
-	} while (likely((ph != NULL) || (need_wait &&
-					 atomic_read(&po->tx_ring.pending))));
+	} while (likely((ph != NULL) ||
+		/* Note: packet_read_pending() might be slow if we have
+		 * to call it as it's per_cpu variable, but in fast-path
+		 * we already short-circuit the loop with the first
+		 * condition, and luckily don't have to go that path
+		 * anyway.
+		 */
+		 (need_wait && packet_read_pending(&po->tx_ring))));

 	err = len_sum;
 	goto out_put;
@@ -2556,6 +2603,7 @@ static int packet_release(struct socket *sock)
 	/* Purge queues */

 	skb_queue_purge(&sk->sk_receive_queue);
+	packet_free_pending(po);
 	sk_refcnt_debug_release(sk);

 	sock_put(sk);
@@ -2717,6 +2765,10 @@ static int packet_create(struct net *net, struct socket *sock, int protocol,
 	po->num = proto;
 	po->xmit = dev_queue_xmit;

+	err = packet_alloc_pending(po);
+	if (err)
+		goto out2;
+
 	packet_cached_dev_reset(po);

 	sk->sk_destruct = packet_sock_destruct;
@@ -2749,6 +2801,8 @@ static int packet_create(struct net *net, struct socket *sock, int protocol,
 	preempt_enable();

 	return 0;
+out2:
+	sk_free(sk);
 out:
 	return err;
 }
@@ -3676,7 +3730,7 @@ static int packet_set_ring(struct sock *sk, union tpacket_req_u *req_u,
 	if (!closing) {
 		if (atomic_read(&po->mapped))
 			goto out;
-		if (atomic_read(&rb->pending))
+		if (packet_read_pending(rb))
 			goto out;
 	}

diff --git a/net/packet/diag.c b/net/packet/diag.c
index a9584a2..533ce4f 100644
--- a/net/packet/diag.c
+++ b/net/packet/diag.c
@@ -3,6 +3,7 @@
 #include <linux/net.h>
 #include <linux/netdevice.h>
 #include <linux/packet_diag.h>
+#include <linux/percpu.h>
 #include <net/net_namespace.h>
 #include <net/sock.h>

diff --git a/net/packet/internal.h b/net/packet/internal.h
index 0a87d7b..eb9580a 100644
--- a/net/packet/internal.h
+++ b/net/packet/internal.h
@@ -64,7 +64,7 @@ struct packet_ring_buffer {
 	unsigned int		pg_vec_pages;
 	unsigned int		pg_vec_len;

-	atomic_t		pending;
+	unsigned int __percpu	*pending_refcnt;

 	struct tpacket_kbdq_core	prb_bdqc;
 };
-- 
1.7.11.7

^ permalink raw reply related

* [PATCH net-next v3 2/3] packet: don't unconditionally schedule() in case of MSG_DONTWAIT
From: Daniel Borkmann @ 2014-01-15 15:25 UTC (permalink / raw)
  To: davem; +Cc: netdev
In-Reply-To: <1389799536-23750-1-git-send-email-dborkman@redhat.com>

In tpacket_snd(), when we've discovered a first frame that is
not in status TP_STATUS_SEND_REQUEST, and return a NULL buffer,
we exit the send routine in case of MSG_DONTWAIT, since we've
finished traversing the mmaped send ring buffer and don't care
about pending frames.

While doing so, we still unconditionally call an expensive
schedule() in the packet_current_frame() "error" path, which
is unnecessary in this case since it's enough to just quit
the function.

Also, in case MSG_DONTWAIT is not set, we should rather test
for need_resched() first and do schedule() only if necessary
since meanwhile pending frames could already have finished
processing and called skb destructor.

Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
---
 net/packet/af_packet.c | 13 ++++++-------
 1 file changed, 6 insertions(+), 7 deletions(-)

diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c
index 85bb38c..d5495d8 100644
--- a/net/packet/af_packet.c
+++ b/net/packet/af_packet.c
@@ -2156,6 +2156,7 @@ static int tpacket_snd(struct packet_sock *po, struct msghdr *msg)
 	int err, reserve = 0;
 	void *ph;
 	struct sockaddr_ll *saddr = (struct sockaddr_ll *)msg->msg_name;
+	bool need_wait = !(msg->msg_flags & MSG_DONTWAIT);
 	int tp_len, size_max;
 	unsigned char *addr;
 	int len_sum = 0;
@@ -2198,10 +2199,10 @@ static int tpacket_snd(struct packet_sock *po, struct msghdr *msg)
 
 	do {
 		ph = packet_current_frame(po, &po->tx_ring,
-				TP_STATUS_SEND_REQUEST);
-
+					  TP_STATUS_SEND_REQUEST);
 		if (unlikely(ph == NULL)) {
-			schedule();
+			if (need_wait && need_resched())
+				schedule();
 			continue;
 		}
 
@@ -2255,10 +2256,8 @@ static int tpacket_snd(struct packet_sock *po, struct msghdr *msg)
 		}
 		packet_increment_head(&po->tx_ring);
 		len_sum += tp_len;
-	} while (likely((ph != NULL) ||
-			((!(msg->msg_flags & MSG_DONTWAIT)) &&
-			 (atomic_read(&po->tx_ring.pending))))
-		);
+	} while (likely((ph != NULL) || (need_wait &&
+					 atomic_read(&po->tx_ring.pending))));
 
 	err = len_sum;
 	goto out_put;
-- 
1.7.11.7

^ permalink raw reply related

* [PATCH net-next v3 1/3] packet: improve socket create/bind latency in some cases
From: Daniel Borkmann @ 2014-01-15 15:25 UTC (permalink / raw)
  To: davem; +Cc: netdev
In-Reply-To: <1389799536-23750-1-git-send-email-dborkman@redhat.com>

Most people acquire PF_PACKET sockets with a protocol argument in
the socket call, e.g. libpcap does so with htons(ETH_P_ALL) for
all its sockets. Most likely, at some point in time a subsequent
bind() call will follow, e.g. in libpcap with ...

  memset(&sll, 0, sizeof(sll));
  sll.sll_family          = AF_PACKET;
  sll.sll_ifindex         = ifindex;
  sll.sll_protocol        = htons(ETH_P_ALL);

... as arguments. What happens in the kernel is that already
in socket() syscall, we install a proto hook via register_prot_hook()
if our protocol argument is != 0. Yet, in bind() we're almost
doing the same work by doing a unregister_prot_hook() with an
expensive synchronize_net() call in case during socket() the proto
was != 0, plus follow-up register_prot_hook() with a bound device
to it this time, in order to limit traffic we get.

In the case when the protocol and user supplied device index (== 0)
does not change from socket() to bind(), we can spare us doing
the same work twice. Similarly for re-binding to the same device
and protocol. For these scenarios, we can decrease create/bind
latency from ~7447us (sock-bind-2 case) to ~89us (sock-bind-1 case)
with this patch.

Alternatively, for the first case, if people care, they should
simply create their sockets with proto == 0 argument and define
the protocol during bind() as this saves a call to synchronize_net()
as well (sock-bind-3 case).

In all other cases, we're tied to user space behaviour we must not
change, also since a bind() is not strictly required. Thus, we need
the synchronize_net() to make sure no asynchronous packet processing
paths still refer to the previous elements of po->prot_hook.

In case of mmap()ed sockets, the workflow that includes bind() is
socket() -> setsockopt(<ring>) -> bind(). In that case, a pair of
{__unregister, register}_prot_hook is being called from setsockopt()
in order to install the new protocol receive handler. Thus, when
we call bind and can skip a re-hook, we have already previously
installed the new handler. For fanout, this is handled different
entirely, so we should be good.

Timings on an i7-3520M machine:

  * sock-bind-1:   89 us
  * sock-bind-2: 7447 us
  * sock-bind-3:   75 us

sock-bind-1:
  socket(PF_PACKET, SOCK_RAW, htons(ETH_P_IP)) = 3
  bind(3, {sa_family=AF_PACKET, proto=htons(ETH_P_IP), if=all(0),
           pkttype=PACKET_HOST, addr(0)={0, }, 20) = 0

sock-bind-2:
  socket(PF_PACKET, SOCK_RAW, htons(ETH_P_IP)) = 3
  bind(3, {sa_family=AF_PACKET, proto=htons(ETH_P_IP), if=lo(1),
           pkttype=PACKET_HOST, addr(0)={0, }, 20) = 0

sock-bind-3:
  socket(PF_PACKET, SOCK_RAW, 0) = 3
  bind(3, {sa_family=AF_PACKET, proto=htons(ETH_P_IP), if=lo(1),
           pkttype=PACKET_HOST, addr(0)={0, }, 20) = 0

Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
---
 net/packet/af_packet.c | 33 ++++++++++++++++++++++-----------
 1 file changed, 22 insertions(+), 11 deletions(-)

diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c
index 279467b..85bb38c 100644
--- a/net/packet/af_packet.c
+++ b/net/packet/af_packet.c
@@ -2567,9 +2567,12 @@ static int packet_release(struct socket *sock)
  *	Attach a packet hook.
  */

-static int packet_do_bind(struct sock *sk, struct net_device *dev, __be16 protocol)
+static int packet_do_bind(struct sock *sk, struct net_device *dev, __be16 proto)
 {
 	struct packet_sock *po = pkt_sk(sk);
+	const struct net_device *dev_curr;
+	__be16 proto_curr;
+	bool need_rehook;

 	if (po->fanout) {
 		if (dev)
@@ -2579,21 +2582,29 @@ static int packet_do_bind(struct sock *sk, struct net_device *dev, __be16 protoc
 	}

 	lock_sock(sk);
-
 	spin_lock(&po->bind_lock);
-	unregister_prot_hook(sk, true);

-	po->num = protocol;
-	po->prot_hook.type = protocol;
-	if (po->prot_hook.dev)
-		dev_put(po->prot_hook.dev);
+	proto_curr = po->prot_hook.type;
+	dev_curr = po->prot_hook.dev;
+
+	need_rehook = proto_curr != proto || dev_curr != dev;
+
+	if (need_rehook) {
+		unregister_prot_hook(sk, true);

-	po->prot_hook.dev = dev;
-	po->ifindex = dev ? dev->ifindex : 0;
+		po->num = proto;
+		po->prot_hook.type = proto;
+
+		if (po->prot_hook.dev)
+			dev_put(po->prot_hook.dev);

-	packet_cached_dev_assign(po, dev);
+		po->prot_hook.dev = dev;
+
+		po->ifindex = dev ? dev->ifindex : 0;
+		packet_cached_dev_assign(po, dev);
+	}

-	if (protocol == 0)
+	if (proto == 0 || !need_rehook)
 		goto out_unlock;

 	if (!dev || (dev->flags & IFF_UP)) {
-- 
1.7.11.7

^ permalink raw reply related

* [PATCH net-next v3 0/3] pf_packet updates
From: Daniel Borkmann @ 2014-01-15 15:25 UTC (permalink / raw)
  To: davem; +Cc: netdev

Changelog:

v1->v2:
 - put assignments under bind lock in patch 1
 - added 2 more relevant patches

v2->v3:
 - made refcnt unsigned in patch 3
 - rest unchanged

Thanks !

Daniel Borkmann (3):
  packet: improve socket create/bind latency in some cases
  packet: don't unconditionally schedule() in case of MSG_DONTWAIT
  packet: use percpu mmap tx frame pending refcount

 net/packet/af_packet.c | 106 +++++++++++++++++++++++++++++++++++++++----------
 net/packet/diag.c      |   1 +
 net/packet/internal.h  |   2 +-
 3 files changed, 87 insertions(+), 22 deletions(-)

-- 
1.7.11.7

^ permalink raw reply

* Re: [Xen-devel] [PATCH net-next] xen-netfront: add support for IPv6 offloads
From: Andrew Cooper @ 2014-01-15 15:25 UTC (permalink / raw)
  To: Paul Durrant; +Cc: netdev, xen-devel, Boris Ostrovsky, David Vrabel
In-Reply-To: <1389799111-8372-1-git-send-email-paul.durrant@citrix.com>

On 15/01/14 15:18, Paul Durrant wrote:
> This patch adds support for IPv6 checksum offload and GSO when those
> features are available in the backend.
>
> Signed-off-by: Paul Durrant <paul.durrant@citrix.com>
> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
> Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
> Cc: David Vrabel <david.vrabel@citrix.com>
> ---
>  drivers/net/xen-netfront.c |   48 +++++++++++++++++++++++++++++++++++++++-----
>  1 file changed, 43 insertions(+), 5 deletions(-)
>
> diff --git a/drivers/net/xen-netfront.c b/drivers/net/xen-netfront.c
> index c41537b..321759f 100644
> --- a/drivers/net/xen-netfront.c
> +++ b/drivers/net/xen-netfront.c
> @@ -617,7 +617,9 @@ static int xennet_start_xmit(struct sk_buff *skb, struct net_device *dev)
>  		tx->flags |= XEN_NETTXF_extra_info;
>  
>  		gso->u.gso.size = skb_shinfo(skb)->gso_size;
> -		gso->u.gso.type = XEN_NETIF_GSO_TYPE_TCPV4;
> +		gso->u.gso.type = (skb_shinfo(skb)->gso_type & SKB_GSO_TCPV6) ?
> +			XEN_NETIF_GSO_TYPE_TCPV6 :
> +			XEN_NETIF_GSO_TYPE_TCPV4;
>  		gso->u.gso.pad = 0;
>  		gso->u.gso.features = 0;
>  
> @@ -809,15 +811,18 @@ static int xennet_set_skb_gso(struct sk_buff *skb,
>  		return -EINVAL;
>  	}
>  
> -	/* Currently only TCPv4 S.O. is supported. */
> -	if (gso->u.gso.type != XEN_NETIF_GSO_TYPE_TCPV4) {
> +	if (gso->u.gso.type != XEN_NETIF_GSO_TYPE_TCPV4 &&
> +	    gso->u.gso.type != XEN_NETIF_GSO_TYPE_TCPV6) {
>  		if (net_ratelimit())
>  			pr_warn("Bad GSO type %d\n", gso->u.gso.type);
>  		return -EINVAL;
>  	}
>  
>  	skb_shinfo(skb)->gso_size = gso->u.gso.size;
> -	skb_shinfo(skb)->gso_type = SKB_GSO_TCPV4;
> +	skb_shinfo(skb)->gso_type =
> +		(gso->u.gso.type == XEN_NETIF_GSO_TYPE_TCPV4) ?
> +		SKB_GSO_TCPV4 :
> +		SKB_GSO_TCPV6;
>  
>  	/* Header must be checked, and gso_segs computed. */
>  	skb_shinfo(skb)->gso_type |= SKB_GSO_DODGY;
> @@ -1191,6 +1196,15 @@ static netdev_features_t xennet_fix_features(struct net_device *dev,
>  			features &= ~NETIF_F_SG;
>  	}
>  
> +	if (features & NETIF_F_IPV6_CSUM) {
> +		if (xenbus_scanf(XBT_NIL, np->xbdev->otherend,
> +				 "feature-ipv6-csum-offload", "%d", &val) < 0)
> +			val = 0;
> +
> +		if (!val)
> +			features &= ~NETIF_F_IPV6_CSUM;
> +	}
> +
>  	if (features & NETIF_F_TSO) {
>  		if (xenbus_scanf(XBT_NIL, np->xbdev->otherend,
>  				 "feature-gso-tcpv4", "%d", &val) < 0)
> @@ -1200,6 +1214,15 @@ static netdev_features_t xennet_fix_features(struct net_device *dev,
>  			features &= ~NETIF_F_TSO;
>  	}
>  
> +	if (features & NETIF_F_TSO6) {
> +		if (xenbus_scanf(XBT_NIL, np->xbdev->otherend,
> +				 "feature-gso-tcpv6", "%d", &val) < 0)
> +			val = 0;
> +
> +		if (!val)
> +			features &= ~NETIF_F_TSO6;
> +	}
> +
>  	return features;
>  }
>  
> @@ -1338,7 +1361,9 @@ static struct net_device *xennet_create_dev(struct xenbus_device *dev)
>  	netif_napi_add(netdev, &np->napi, xennet_poll, 64);
>  	netdev->features        = NETIF_F_IP_CSUM | NETIF_F_RXCSUM |
>  				  NETIF_F_GSO_ROBUST;
> -	netdev->hw_features	= NETIF_F_IP_CSUM | NETIF_F_SG | NETIF_F_TSO;
> +	netdev->hw_features	= NETIF_F_SG |
> +				  NETIF_F_IPV6_CSUM |
> +				  NETIF_F_TSO | NETIF_F_TSO6;
>  
>  	/*
>           * Assume that all hw features are available for now. This set
> @@ -1716,6 +1741,19 @@ again:
>  		goto abort_transaction;
>  	}
>  
> +	err = xenbus_printf(xbt, dev->nodename, "feature-gso-tcpv6", "%d", 1);

"%d", 1 results in a constant string.  xenbus_write() would avoid a
transitory memory allocation.

~Andrew

> +	if (err) {
> +		message = "writing feature-gso-tcpv6";
> +		goto abort_transaction;
> +	}
> +
> +	err = xenbus_printf(xbt, dev->nodename, "feature-ipv6-csum-offload",
> +			    "%d", 1);
> +	if (err) {
> +		message = "writing feature-ipv6-csum-offload";
> +		goto abort_transaction;
> +	}
> +
>  	err = xenbus_transaction_end(xbt, 0);
>  	if (err) {
>  		if (err == -EAGAIN)

^ permalink raw reply

* [PATCH net-next] xen-netfront: add support for IPv6 offloads
From: Paul Durrant @ 2014-01-15 15:18 UTC (permalink / raw)
  To: netdev, xen-devel
  Cc: Paul Durrant, Konrad Rzeszutek Wilk, Boris Ostrovsky,
	David Vrabel

This patch adds support for IPv6 checksum offload and GSO when those
features are available in the backend.

Signed-off-by: Paul Durrant <paul.durrant@citrix.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: David Vrabel <david.vrabel@citrix.com>
---
 drivers/net/xen-netfront.c |   48 +++++++++++++++++++++++++++++++++++++++-----
 1 file changed, 43 insertions(+), 5 deletions(-)

diff --git a/drivers/net/xen-netfront.c b/drivers/net/xen-netfront.c
index c41537b..321759f 100644
--- a/drivers/net/xen-netfront.c
+++ b/drivers/net/xen-netfront.c
@@ -617,7 +617,9 @@ static int xennet_start_xmit(struct sk_buff *skb, struct net_device *dev)
 		tx->flags |= XEN_NETTXF_extra_info;
 
 		gso->u.gso.size = skb_shinfo(skb)->gso_size;
-		gso->u.gso.type = XEN_NETIF_GSO_TYPE_TCPV4;
+		gso->u.gso.type = (skb_shinfo(skb)->gso_type & SKB_GSO_TCPV6) ?
+			XEN_NETIF_GSO_TYPE_TCPV6 :
+			XEN_NETIF_GSO_TYPE_TCPV4;
 		gso->u.gso.pad = 0;
 		gso->u.gso.features = 0;
 
@@ -809,15 +811,18 @@ static int xennet_set_skb_gso(struct sk_buff *skb,
 		return -EINVAL;
 	}
 
-	/* Currently only TCPv4 S.O. is supported. */
-	if (gso->u.gso.type != XEN_NETIF_GSO_TYPE_TCPV4) {
+	if (gso->u.gso.type != XEN_NETIF_GSO_TYPE_TCPV4 &&
+	    gso->u.gso.type != XEN_NETIF_GSO_TYPE_TCPV6) {
 		if (net_ratelimit())
 			pr_warn("Bad GSO type %d\n", gso->u.gso.type);
 		return -EINVAL;
 	}
 
 	skb_shinfo(skb)->gso_size = gso->u.gso.size;
-	skb_shinfo(skb)->gso_type = SKB_GSO_TCPV4;
+	skb_shinfo(skb)->gso_type =
+		(gso->u.gso.type == XEN_NETIF_GSO_TYPE_TCPV4) ?
+		SKB_GSO_TCPV4 :
+		SKB_GSO_TCPV6;
 
 	/* Header must be checked, and gso_segs computed. */
 	skb_shinfo(skb)->gso_type |= SKB_GSO_DODGY;
@@ -1191,6 +1196,15 @@ static netdev_features_t xennet_fix_features(struct net_device *dev,
 			features &= ~NETIF_F_SG;
 	}
 
+	if (features & NETIF_F_IPV6_CSUM) {
+		if (xenbus_scanf(XBT_NIL, np->xbdev->otherend,
+				 "feature-ipv6-csum-offload", "%d", &val) < 0)
+			val = 0;
+
+		if (!val)
+			features &= ~NETIF_F_IPV6_CSUM;
+	}
+
 	if (features & NETIF_F_TSO) {
 		if (xenbus_scanf(XBT_NIL, np->xbdev->otherend,
 				 "feature-gso-tcpv4", "%d", &val) < 0)
@@ -1200,6 +1214,15 @@ static netdev_features_t xennet_fix_features(struct net_device *dev,
 			features &= ~NETIF_F_TSO;
 	}
 
+	if (features & NETIF_F_TSO6) {
+		if (xenbus_scanf(XBT_NIL, np->xbdev->otherend,
+				 "feature-gso-tcpv6", "%d", &val) < 0)
+			val = 0;
+
+		if (!val)
+			features &= ~NETIF_F_TSO6;
+	}
+
 	return features;
 }
 
@@ -1338,7 +1361,9 @@ static struct net_device *xennet_create_dev(struct xenbus_device *dev)
 	netif_napi_add(netdev, &np->napi, xennet_poll, 64);
 	netdev->features        = NETIF_F_IP_CSUM | NETIF_F_RXCSUM |
 				  NETIF_F_GSO_ROBUST;
-	netdev->hw_features	= NETIF_F_IP_CSUM | NETIF_F_SG | NETIF_F_TSO;
+	netdev->hw_features	= NETIF_F_SG |
+				  NETIF_F_IPV6_CSUM |
+				  NETIF_F_TSO | NETIF_F_TSO6;
 
 	/*
          * Assume that all hw features are available for now. This set
@@ -1716,6 +1741,19 @@ again:
 		goto abort_transaction;
 	}
 
+	err = xenbus_printf(xbt, dev->nodename, "feature-gso-tcpv6", "%d", 1);
+	if (err) {
+		message = "writing feature-gso-tcpv6";
+		goto abort_transaction;
+	}
+
+	err = xenbus_printf(xbt, dev->nodename, "feature-ipv6-csum-offload",
+			    "%d", 1);
+	if (err) {
+		message = "writing feature-ipv6-csum-offload";
+		goto abort_transaction;
+	}
+
 	err = xenbus_transaction_end(xbt, 0);
 	if (err) {
 		if (err == -EAGAIN)
-- 
1.7.10.4

^ permalink raw reply related

* Re: [PATCH net-next v4 1/9] xen-netback: Introduce TX grant map definitions
From: Zoltan Kiss @ 2014-01-15 15:16 UTC (permalink / raw)
  To: ian.campbell, wei.liu2, xen-devel, netdev, linux-kernel,
	jonathan.davies
  Cc: Zoltan Kiss
In-Reply-To: <1389731995-9887-2-git-send-email-zoltan.kiss@citrix.com>

On 14/01/14 20:39, Zoltan Kiss wrote:
> @@ -1677,6 +1793,31 @@ static void xenvif_idx_release(struct xenvif *vif, u16 pending_idx,
>   	vif->mmap_pages[pending_idx] = NULL;
>   }
>
> +void xenvif_idx_unmap(struct xenvif *vif, u16 pending_idx)
> +{
> +	int ret;
> +	struct gnttab_unmap_grant_ref tx_unmap_op;
> +
> +	if (vif->grant_tx_handle[pending_idx] == NETBACK_INVALID_HANDLE) {
> +		netdev_err(vif->dev,
> +			   "Trying to unmap invalid handle! pending_idx: %x\n",
> +			   pending_idx);
> +		return;
> +	}
> +	gnttab_set_unmap_op(&tx_unmap_op,
> +			    idx_to_kaddr(vif, pending_idx),
> +			    GNTMAP_host_map,
> +			    vif->grant_tx_handle[pending_idx]);
> +	ret = gnttab_unmap_refs(&tx_unmap_op,
> +				&vif->mmap_pages[pending_idx],
> +				1);
> +
> +	ret = HYPERVISOR_grant_table_op(GNTTABOP_unmap_grant_ref,
> +					&tx_unmap_op,
> +					1);
> +	BUG_ON(ret);
> +	vif->grant_tx_handle[pending_idx] = NETBACK_INVALID_HANDLE;
> +}

Awkward mistake, I forgot to delete the hypercall ... Even more 
interesting, it caused troubles only very rarely ...

Zoli

^ permalink raw reply

* Re: [Xen-devel] [PATCH net-next] xen-netfront: clean up code in xennet_release_rx_bufs
From: David Vrabel @ 2014-01-15 15:13 UTC (permalink / raw)
  To: annie li; +Cc: Wei Liu, xen-devel, netdev, ian.campbell
In-Reply-To: <52D69864.9030207@oracle.com>

On 15/01/14 14:17, annie li wrote:
> 
> I am thinking of two ways, and they can be implemented in new patches.
> 1. If gnttab_end_foreign_access_ref succeeds, then kfree_skb is called
> to free skb. Otherwise, using gnttab_end_foreign_access to release ref
> and pages.
> 2. Add a similar deferred way of gnttab_end_foreign_access in
> gnttab_end_foreign_access_ref.

Something like the following (untested!) patch is what I'm thinking of.

Fixups to existing drivers (except netfront) are included but may not be
quite correct.

8<----------
>From 76c254c8e020f4427e8f37c867622f0bfd5ac85f Mon Sep 17 00:00:00 2001
From: David Vrabel <david.vrabel@citrix.com>
Date: Wed, 15 Jan 2014 15:04:52 +0000
Subject: [PATCH] HACK! xen/gnttab: make gnttab_end_foreign_access() more
useful

This is UNTESTED and is an example of the sort of change I'm looking
for.

Freeing the page in gnttab_end_foreign_access() means it cannot be
used where the pages are freed in some other way (e.g., as part of a
kfree_skb()).

Leave the freeing of the page to the caller.  If the page still has
foreign users, take an additional reference before adding it to the
deferred list.

Hack all the users of the call to do something resembling the right
thing.  Not quite sure on the blkfront changes.
---
 drivers/block/xen-blkfront.c    |   22 +++++++++++++---------
 drivers/char/tpm/xen-tpmfront.c |    3 +--
 drivers/pci/xen-pcifront.c      |    3 +--
 drivers/xen/grant-table.c       |   19 +++++++++++--------
 include/xen/grant_table.h       |   11 ++++++-----
 5 files changed, 32 insertions(+), 26 deletions(-)

diff --git a/drivers/xen/grant-table.c b/drivers/xen/grant-table.c
index aa846a4..a586496 100644
--- a/drivers/xen/grant-table.c
+++ b/drivers/xen/grant-table.c
@@ -504,7 +504,7 @@ static void gnttab_handle_deferred(unsigned long unused)
 			if (entry->page) {
 				pr_debug("freeing g.e. %#x (pfn %#lx)\n",
 					 entry->ref, page_to_pfn(entry->page));
-				__free_page(entry->page);
+				put_page(entry->page);
 			} else
 				pr_info("freeing g.e. %#x\n", entry->ref);
 			kfree(entry);
@@ -555,15 +555,18 @@ static void gnttab_add_deferred(grant_ref_t ref,
bool readonly,
 }

 void gnttab_end_foreign_access(grant_ref_t ref, int readonly,
-			       unsigned long page)
+			       struct page *page)
 {
-	if (gnttab_end_foreign_access_ref(ref, readonly)) {
+	if (gnttab_end_foreign_access_ref(ref, readonly))
 		put_free_entry(ref);
-		if (page != 0)
-			free_page(page);
-	} else
-		gnttab_add_deferred(ref, readonly,
-				    page ? virt_to_page(page) : NULL);
+	else {
+		/*
+		 * Take a reference to the page so it's not freed
+		 * while the foreign domain still has access to it.
+		 */
+		get_page(page);
+		gnttab_add_deferred(ref, readonly, page);
+	}
 }
 EXPORT_SYMBOL_GPL(gnttab_end_foreign_access);

diff --git a/include/xen/grant_table.h b/include/xen/grant_table.h
index 694dcaf..ffa3ce6 100644
--- a/include/xen/grant_table.h
+++ b/include/xen/grant_table.h
@@ -91,13 +91,14 @@ bool gnttab_trans_grants_available(void);
 int gnttab_end_foreign_access_ref(grant_ref_t ref, int readonly);

 /*
- * Eventually end access through the given grant reference, and once that
- * access has been ended, free the given page too.  Access will be ended
- * immediately iff the grant entry is not in use, otherwise it will happen
- * some time later.  page may be 0, in which case no freeing will occur.
+ * Eventually end access through the given grant reference, if the
+ * page is still in use an additional reference is taken and released
+ * when access has ended.  Access will be ended immediately iff the
+ * grant entry is not in use, otherwise it will happen some time
+ * later.
  */
 void gnttab_end_foreign_access(grant_ref_t ref, int readonly,
-			       unsigned long page);
+			       struct page *page);

 int gnttab_grant_foreign_transfer(domid_t domid, unsigned long pfn);

diff --git a/drivers/block/xen-blkfront.c b/drivers/block/xen-blkfront.c
index c4a4c90..45a2a01 100644
--- a/drivers/block/xen-blkfront.c
+++ b/drivers/block/xen-blkfront.c
@@ -931,14 +931,16 @@ static void blkif_free(struct blkfront_info *info,
int suspend)
 	if (!list_empty(&info->grants)) {
 		list_for_each_entry_safe(persistent_gnt, n,
 		                         &info->grants, node) {
+			struct page *page = pfn_to_page(persistent_gnt->pfn);
+
 			list_del(&persistent_gnt->node);
 			if (persistent_gnt->gref != GRANT_INVALID_REF) {
 				gnttab_end_foreign_access(persistent_gnt->gref,
-				                          0, 0UL);
+				                          0, page);
 				info->persistent_gnts_c--;
 			}
 			if (info->feature_persistent)
-				__free_page(pfn_to_page(persistent_gnt->pfn));
+				__free_page(page);
 			kfree(persistent_gnt);
 		}
 	}
@@ -970,10 +972,13 @@ static void blkif_free(struct blkfront_info *info,
int suspend)
 		       info->shadow[i].req.u.indirect.nr_segments :
 		       info->shadow[i].req.u.rw.nr_segments;
 		for (j = 0; j < segs; j++) {
+			struct page *page;
+
 			persistent_gnt = info->shadow[i].grants_used[j];
-			gnttab_end_foreign_access(persistent_gnt->gref, 0, 0UL);
+			page = pfn_to_page(persistent_gnt->pfn);
+			gnttab_end_foreign_access(persistent_gnt->gref, 0, page);
 			if (info->feature_persistent)
-				__free_page(pfn_to_page(persistent_gnt->pfn));
+				__free_page(page);
 			kfree(persistent_gnt);
 		}

@@ -1010,10 +1015,11 @@ free_shadow:
 	/* Free resources associated with old device channel. */
 	if (info->ring_ref != GRANT_INVALID_REF) {
 		gnttab_end_foreign_access(info->ring_ref, 0,
-					  (unsigned long)info->ring.sring);
+					  virt_to_page(info->ring.sring));
 		info->ring_ref = GRANT_INVALID_REF;
 		info->ring.sring = NULL;
 	}
+	free_page((unsigned long)info->ring.sring);
 	if (info->irq)
 		unbind_from_irqhandler(info->irq, info);
 	info->evtchn = info->irq = 0;
@@ -1053,7 +1059,7 @@ static void blkif_completion(struct blk_shadow *s,
struct blkfront_info *info,
 	}
 	/* Add the persistent grant into the list of free grants */
 	for (i = 0; i < nseg; i++) {
-		if (gnttab_query_foreign_access(s->grants_used[i]->gref)) {
+		if (gnttab_end_foreign_access_ref(s->grants_used[i]->gref, 0)) {
 			/*
 			 * If the grant is still mapped by the backend (the
 			 * backend has chosen to make this grant persistent)
@@ -1072,14 +1078,13 @@ static void blkif_completion(struct blk_shadow
*s, struct blkfront_info *info,
 			 * so it will not be picked again unless we run out of
 			 * persistent grants.
 			 */
-			gnttab_end_foreign_access(s->grants_used[i]->gref, 0, 0UL);
 			s->grants_used[i]->gref = GRANT_INVALID_REF;
 			list_add_tail(&s->grants_used[i]->node, &info->grants);
 		}
 	}
 	if (s->req.operation == BLKIF_OP_INDIRECT) {
 		for (i = 0; i < INDIRECT_GREFS(nseg); i++) {
-			if (gnttab_query_foreign_access(s->indirect_grants[i]->gref)) {
+			if (gnttab_end_foreign_access_ref(s->indirect_grants[i]->gref, 0)) {
 				if (!info->feature_persistent)
 					pr_alert_ratelimited("backed has not unmapped grant: %u\n",
 							     s->indirect_grants[i]->gref);
@@ -1088,7 +1093,6 @@ static void blkif_completion(struct blk_shadow *s,
struct blkfront_info *info,
 			} else {
 				struct page *indirect_page;

-				gnttab_end_foreign_access(s->indirect_grants[i]->gref, 0, 0UL);
 				/*
 				 * Add the used indirect page back to the list of
 				 * available pages for indirect grefs.
diff --git a/drivers/char/tpm/xen-tpmfront.c
b/drivers/char/tpm/xen-tpmfront.c
index c8ff4df..00d1132 100644
--- a/drivers/char/tpm/xen-tpmfront.c
+++ b/drivers/char/tpm/xen-tpmfront.c
@@ -315,8 +315,7 @@ static void ring_free(struct tpm_private *priv)
 	if (priv->ring_ref)
 		gnttab_end_foreign_access(priv->ring_ref, 0,
 				(unsigned long)priv->shr);
-	else
-		free_page((unsigned long)priv->shr);
+	free_page((unsigned long)priv->shr);

 	if (priv->chip && priv->chip->vendor.irq)
 		unbind_from_irqhandler(priv->chip->vendor.irq, priv);
diff --git a/drivers/pci/xen-pcifront.c b/drivers/pci/xen-pcifront.c
index f7197a7..253a129 100644
--- a/drivers/pci/xen-pcifront.c
+++ b/drivers/pci/xen-pcifront.c
@@ -759,8 +759,7 @@ static void free_pdev(struct pcifront_device *pdev)
 	if (pdev->gnt_ref != INVALID_GRANT_REF)
 		gnttab_end_foreign_access(pdev->gnt_ref, 0 /* r/w page */,
 					  (unsigned long)pdev->sh_info);
-	else
-		free_page((unsigned long)pdev->sh_info);
+	free_page((unsigned long)pdev->sh_info);

 	dev_set_drvdata(&pdev->xdev->dev, NULL);

^ permalink raw reply related

* Re: [PATCH] net/ipv4: don't use module_init in non-modular gre_offload
From: Eric Dumazet @ 2014-01-15 15:11 UTC (permalink / raw)
  To: Paul Gortmaker; +Cc: David S. Miller, netdev, Eric Dumazet
In-Reply-To: <1389797534-15657-1-git-send-email-paul.gortmaker@windriver.com>

On Wed, 2014-01-15 at 09:52 -0500, Paul Gortmaker wrote:
> Recent commit 438e38fadca2f6e57eeecc08326c8a95758594d4
> ("gre_offload: statically build GRE offloading support") added
> new module_init/module_exit calls to the gre_offload.c file.
> 
> The file is obj-y and can't be anything other than built-in.
> Currently it can never be built modular, so using module_init
> as an alias for __initcall can be somewhat misleading.
> 
> Fix this up now, so that we can relocate module_init from
> init.h into module.h in the future.  If we don't do this, we'd
> have to add module.h to obviously non-modular code, and that
> would be a worse thing.  We also make the inclusion explicit.
> 
> Note that direct use of __initcall is discouraged, vs. one
> of the priority categorized subgroups.  As __initcall gets
> mapped onto device_initcall, our use of device_initcall
> directly in this change means that the runtime impact is
> zero -- it will remain at level 6 in initcall ordering.
> 
> Cc: Eric Dumazet <edumazet@google.com>
> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
> 
> diff --git a/net/ipv4/gre_offload.c b/net/ipv4/gre_offload.c
> index 29512e3e7e7c..ed968daf2380 100644
> --- a/net/ipv4/gre_offload.c
> +++ b/net/ipv4/gre_offload.c
> @@ -11,6 +11,7 @@
>   */
>  
>  #include <linux/skbuff.h>
> +#include <linux/init.h>
>  #include <net/protocol.h>
>  #include <net/gre.h>
>  
> @@ -283,11 +284,10 @@ static int __init gre_offload_init(void)
>  {
>  	return inet_add_offload(&gre_offload, IPPROTO_GRE);
>  }
> +device_initcall(gre_offload_init);
>  
>  static void __exit gre_offload_exit(void)
>  {
>  	inet_del_offload(&gre_offload, IPPROTO_GRE);
>  }
> -
> -module_init(gre_offload_init);
> -module_exit(gre_offload_exit);
> +__exitcall(gre_offload_exit);

I guess we could remove gre_offload_exit() instead of using
this strange __exitcall()

^ permalink raw reply

* Re: [PATCH v2 net] bpf: do not use reciprocal divide
From: Matt Evans @ 2014-01-15 15:10 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Heiko Carstens, Martin Schwidefsky, Hannes Frederic Sowa, netdev,
	dborkman, darkjames-ws, Mircea Gherzan, Russell King
In-Reply-To: <1389797407.31367.340.camel@edumazet-glaptop2.roam.corp.google.com>

Hi Eric,

On 2014-01-15 14:50, Eric Dumazet wrote:
> From: Eric Dumazet <edumazet@google.com>
> 
> At first Jakub Zawadzki noticed that some divisions by 
> reciprocal_divide
> were not correct. (off by one in some cases)
> http://www.wireshark.org/~darkjames/reciprocal-buggy.c
> 
> He could also show this with BPF:
> http://www.wireshark.org/~darkjames/set-and-dump-filter-k-bug.c
> 
> The reciprocal divide in linux kernel is not generic enough,
> lets remove its use in BPF, as it is not worth the pain with
> current cpus.
> 
> Signed-off-by: Eric Dumazet <edumazet@google.com>
> Reported-by: Jakub Zawadzki <darkjames-ws@darkjames.pl>
> Cc: Mircea Gherzan <mgherzan@gmail.com>
> Cc: Daniel Borkmann <dxchgb@gmail.com>
> Cc: Hannes Frederic Sowa <hannes@stressinduktion.org>
> Cc: Matt Evans <matt@ozlabs.org>
> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
> Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
> Cc: David S. Miller <davem@davemloft.net>
> ---
> v2: Fixed sparc code as David kindly suggested
>     Added tests on K being 1 (divide by 1 is a nop
>                               modulo by 1 clears A),
>     as Martin Schwidefsky seems concerned by this case.
> 
> Please review arch code to make sure I made no mistake, thanks !

PPC looks fine; I had a look at the core/ARM parts which also look good.

I'd forgotten where the DIV0 checking occurred, so I also benefited from 
your hint to Heiko. :)


Cheers,

Matt

> 
>  arch/arm/net/bpf_jit_32.c       |    6 +++---
>  arch/powerpc/net/bpf_jit_comp.c |    7 ++++---
>  arch/s390/net/bpf_jit_comp.c    |   17 ++++++++++++-----
>  arch/sparc/net/bpf_jit_comp.c   |   17 ++++++++++++++---
>  arch/x86/net/bpf_jit_comp.c     |   14 ++++++++++----
>  net/core/filter.c               |   30 ++----------------------------
>  6 files changed, 45 insertions(+), 46 deletions(-)
> 
> diff --git a/arch/arm/net/bpf_jit_32.c b/arch/arm/net/bpf_jit_32.c
> index 9ed155ad0f97..271b5e971568 100644
> --- a/arch/arm/net/bpf_jit_32.c
> +++ b/arch/arm/net/bpf_jit_32.c
> @@ -641,10 +641,10 @@ load_ind:
>  			emit(ARM_MUL(r_A, r_A, r_X), ctx);
>  			break;
>  		case BPF_S_ALU_DIV_K:
> -			/* current k == reciprocal_value(userspace k) */
> +			if (k == 1)
> +				break;
>  			emit_mov_i(r_scratch, k, ctx);
> -			/* A = top 32 bits of the product */
> -			emit(ARM_UMULL(r_scratch, r_A, r_A, r_scratch), ctx);
> +			emit_udiv(r_A, r_A, r_scratch, ctx);
>  			break;
>  		case BPF_S_ALU_DIV_X:
>  			update_on_xread(ctx);
> diff --git a/arch/powerpc/net/bpf_jit_comp.c 
> b/arch/powerpc/net/bpf_jit_comp.c
> index ac3c2a10dafd..555034f8505e 100644
> --- a/arch/powerpc/net/bpf_jit_comp.c
> +++ b/arch/powerpc/net/bpf_jit_comp.c
> @@ -223,10 +223,11 @@ static int bpf_jit_build_body(struct sk_filter
> *fp, u32 *image,
>  			}
>  			PPC_DIVWU(r_A, r_A, r_X);
>  			break;
> -		case BPF_S_ALU_DIV_K: /* A = reciprocal_divide(A, K); */
> +		case BPF_S_ALU_DIV_K: /* A /= K */
> +			if (K == 1)
> +				break;
>  			PPC_LI32(r_scratch1, K);
> -			/* Top 32 bits of 64bit result -> A */
> -			PPC_MULHWU(r_A, r_A, r_scratch1);
> +			PPC_DIVWU(r_A, r_A, r_scratch1);
>  			break;
>  		case BPF_S_ALU_AND_X:
>  			ctx->seen |= SEEN_XREG;
> diff --git a/arch/s390/net/bpf_jit_comp.c 
> b/arch/s390/net/bpf_jit_comp.c
> index 16871da37371..fc0fa77728e1 100644
> --- a/arch/s390/net/bpf_jit_comp.c
> +++ b/arch/s390/net/bpf_jit_comp.c
> @@ -371,11 +371,13 @@ static int bpf_jit_insn(struct bpf_jit *jit,
> struct sock_filter *filter,
>  		/* dr %r4,%r12 */
>  		EMIT2(0x1d4c);
>  		break;
> -	case BPF_S_ALU_DIV_K: /* A = reciprocal_divide(A, K) */
> -		/* m %r4,<d(K)>(%r13) */
> -		EMIT4_DISP(0x5c40d000, EMIT_CONST(K));
> -		/* lr %r5,%r4 */
> -		EMIT2(0x1854);
> +	case BPF_S_ALU_DIV_K: /* A /= K */
> +		if (K == 1)
> +			break;
> +		/* lhi %r4,0 */
> +		EMIT4(0xa7480000);
> +		/* d %r4,<d(K)>(%r13) */
> +		EMIT4_DISP(0x5d40d000, EMIT_CONST(K));
>  		break;
>  	case BPF_S_ALU_MOD_X: /* A %= X */
>  		jit->seen |= SEEN_XREG | SEEN_RET0;
> @@ -391,6 +393,11 @@ static int bpf_jit_insn(struct bpf_jit *jit,
> struct sock_filter *filter,
>  		EMIT2(0x1854);
>  		break;
>  	case BPF_S_ALU_MOD_K: /* A %= K */
> +		if (K == 1) {
> +			/* lhi %r5,0 */
> +			EMIT4(0xa7580000);
> +			break;
> +		}
>  		/* lhi %r4,0 */
>  		EMIT4(0xa7480000);
>  		/* d %r4,<d(K)>(%r13) */
> diff --git a/arch/sparc/net/bpf_jit_comp.c 
> b/arch/sparc/net/bpf_jit_comp.c
> index 218b6b23c378..01fe9946d388 100644
> --- a/arch/sparc/net/bpf_jit_comp.c
> +++ b/arch/sparc/net/bpf_jit_comp.c
> @@ -497,9 +497,20 @@ void bpf_jit_compile(struct sk_filter *fp)
>  			case BPF_S_ALU_MUL_K:	/* A *= K */
>  				emit_alu_K(MUL, K);
>  				break;
> -			case BPF_S_ALU_DIV_K:	/* A /= K */
> -				emit_alu_K(MUL, K);
> -				emit_read_y(r_A);
> +			case BPF_S_ALU_DIV_K:	/* A /= K with K != 0*/
> +				if (K == 1)
> +					break;
> +				emit_write_y(G0);
> +#ifdef CONFIG_SPARC32
> +				/* The Sparc v8 architecture requires
> +				 * three instructions between a %y
> +				 * register write and the first use.
> +				 */
> +				emit_nop();
> +				emit_nop();
> +				emit_nop();
> +#endif
> +				emit_alu_K(DIV, K);
>  				break;
>  			case BPF_S_ALU_DIV_X:	/* A /= X; */
>  				emit_cmpi(r_X, 0);
> diff --git a/arch/x86/net/bpf_jit_comp.c b/arch/x86/net/bpf_jit_comp.c
> index 26328e800869..4ed75dd81d05 100644
> --- a/arch/x86/net/bpf_jit_comp.c
> +++ b/arch/x86/net/bpf_jit_comp.c
> @@ -359,15 +359,21 @@ void bpf_jit_compile(struct sk_filter *fp)
>  				EMIT2(0x89, 0xd0);	/* mov %edx,%eax */
>  				break;
>  			case BPF_S_ALU_MOD_K: /* A %= K; */
> +				if (K == 1) {
> +					CLEAR_A();
> +					break;
> +				}
>  				EMIT2(0x31, 0xd2);	/* xor %edx,%edx */
>  				EMIT1(0xb9);EMIT(K, 4);	/* mov imm32,%ecx */
>  				EMIT2(0xf7, 0xf1);	/* div %ecx */
>  				EMIT2(0x89, 0xd0);	/* mov %edx,%eax */
>  				break;
> -			case BPF_S_ALU_DIV_K: /* A = reciprocal_divide(A, K); */
> -				EMIT3(0x48, 0x69, 0xc0); /* imul imm32,%rax,%rax */
> -				EMIT(K, 4);
> -				EMIT4(0x48, 0xc1, 0xe8, 0x20); /* shr $0x20,%rax */
> +			case BPF_S_ALU_DIV_K: /* A /= K */
> +				if (K == 1)
> +					break;
> +				EMIT2(0x31, 0xd2);	/* xor %edx,%edx */
> +				EMIT1(0xb9);EMIT(K, 4);	/* mov imm32,%ecx */
> +				EMIT2(0xf7, 0xf1);	/* div %ecx */
>  				break;
>  			case BPF_S_ALU_AND_X:
>  				seen |= SEEN_XREG;
> diff --git a/net/core/filter.c b/net/core/filter.c
> index 01b780856db2..ad30d626a5bd 100644
> --- a/net/core/filter.c
> +++ b/net/core/filter.c
> @@ -36,7 +36,6 @@
>  #include <asm/uaccess.h>
>  #include <asm/unaligned.h>
>  #include <linux/filter.h>
> -#include <linux/reciprocal_div.h>
>  #include <linux/ratelimit.h>
>  #include <linux/seccomp.h>
>  #include <linux/if_vlan.h>
> @@ -166,7 +165,7 @@ unsigned int sk_run_filter(const struct sk_buff 
> *skb,
>  			A /= X;
>  			continue;
>  		case BPF_S_ALU_DIV_K:
> -			A = reciprocal_divide(A, K);
> +			A /= K;
>  			continue;
>  		case BPF_S_ALU_MOD_X:
>  			if (X == 0)
> @@ -553,11 +552,6 @@ int sk_chk_filter(struct sock_filter *filter,
> unsigned int flen)
>  		/* Some instructions need special checks */
>  		switch (code) {
>  		case BPF_S_ALU_DIV_K:
> -			/* check for division by zero */
> -			if (ftest->k == 0)
> -				return -EINVAL;
> -			ftest->k = reciprocal_value(ftest->k);
> -			break;
>  		case BPF_S_ALU_MOD_K:
>  			/* check for division by zero */
>  			if (ftest->k == 0)
> @@ -853,27 +847,7 @@ void sk_decode_filter(struct sock_filter *filt,
> struct sock_filter *to)
>  	to->code = decodes[code];
>  	to->jt = filt->jt;
>  	to->jf = filt->jf;
> -
> -	if (code == BPF_S_ALU_DIV_K) {
> -		/*
> -		 * When loaded this rule user gave us X, which was
> -		 * translated into R = r(X). Now we calculate the
> -		 * RR = r(R) and report it back. If next time this
> -		 * value is loaded and RRR = r(RR) is calculated
> -		 * then the R == RRR will be true.
> -		 *
> -		 * One exception. X == 1 translates into R == 0 and
> -		 * we can't calculate RR out of it with r().
> -		 */
> -
> -		if (filt->k == 0)
> -			to->k = 1;
> -		else
> -			to->k = reciprocal_value(filt->k);
> -
> -		BUG_ON(reciprocal_value(to->k) != filt->k);
> -	} else
> -		to->k = filt->k;
> +	to->k = filt->k;
>  }
> 
>  int sk_get_filter(struct sock *sk, struct sock_filter __user *ubuf,
> unsigned int len)

^ permalink raw reply

* Re: [PATCH net] bpf: do not use reciprocal divide
From: Heiko Carstens @ 2014-01-15 15:10 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Hannes Frederic Sowa, netdev, dborkman, darkjames-ws,
	Mircea Gherzan, Russell King, Matt Evans, Martin Schwidefsky
In-Reply-To: <1389795398.31367.329.camel@edumazet-glaptop2.roam.corp.google.com>

On Wed, Jan 15, 2014 at 06:16:38AM -0800, Eric Dumazet wrote:
> On Wed, 2014-01-15 at 09:00 +0100, Heiko Carstens wrote:
> > Are you sure you want to remove the k == 0 check? Is there something
> > else that would prevent a division by zero?
> 
> This is done by factoring the two cases, modulo and divide :
> 
>  vi +553 net/core/filter.c 
> 
>                 /* Some instructions need special checks */
>                 switch (code) {
>                 case BPF_S_ALU_DIV_K:
>                 case BPF_S_ALU_MOD_K:
>                         /* check for division by zero */
>                         if (ftest->k == 0)
>                                 return -EINVAL;
>                         break;

Oh, sorry. I missed the fallthrough.

^ permalink raw reply

* Re: [PATCH net-next] xen-netback: Rework rx_work_todo
From: Zoltan Kiss @ 2014-01-15 15:08 UTC (permalink / raw)
  To: Wei Liu; +Cc: ian.campbell, xen-devel, netdev, linux-kernel, jonathan.davies
In-Reply-To: <20140115145951.GP5698@zion.uk.xensource.com>

On 15/01/14 14:59, Wei Liu wrote:
> On Wed, Jan 15, 2014 at 02:52:41PM +0000, Zoltan Kiss wrote:
>> On 15/01/14 14:45, Wei Liu wrote:
>>>>>> The recent patch to fix receive side flow control (11b57f) solved the spinning
>>>>>>>>> thread problem, however caused an another one. The receive side can stall, if:
>>>>>>>>> - xenvif_rx_action sets rx_queue_stopped to false
>>>>>>>>> - interrupt happens, and sets rx_event to true
>>>>>>>>> - then xenvif_kthread sets rx_event to false
>>>>>>>>>
>>>>>>>
>>>>>>> If you mean "rx_work_todo" returns false.
>>>>>>>
>>>>>>> In this case
>>>>>>>
>>>>>>> (!skb_queue_empty(&vif->rx_queue) && !vif->rx_queue_stopped) || vif->rx_event;
>>>>>>>
>>>>>>> can still be true, can't it?
>>>>> Sorry, I should wrote rx_queue_stopped to true
>>>>>
>>> In this case, if rx_queue_stopped is true, then we're expecting frontend
>>> to notify us, right?
>>>
>>> rx_queue_stopped is set to true if we cannot make any progress to queue
>>> packet into the ring. In that situation we can expect frontend will send
>>> notification to backend after it goes through the backlog in the ring.
>>> That means rx_event is set to true, and rx_work_todo is true again. So
>>> the ring is actually not stalled in this case as well. Did I miss
>>> something?
>>>
>>
>> Yes, we expect the guest to notify us, and it does, and we set
>> rx_event to true (see second point), but then the thread set it to
>> false (see third point). Talking with Paul, another solution could
>> be to set rx_event false before calling xenvif_rx_action. But using
>> rx_last_skb_slots makes it quicker for the thread to see if it
>> doesn't have to do anything.
>>
>
> OK, this is a better explaination. So actually there's no bug in the
> original implementation and your patch is sort of an improvement.
>
> Could you send a new version of this patch with relevant information in
> commit message? Talking to people offline is faster, but I would like to
> have public discussion and relevant information archived in a searchable
> form. Thanks.

No, there is a bug in the original implementation:
- [THREAD] xenvif_rx_action sets rx_queue_stopped to true
- [INTERRUPT] interrupt happens, and sets rx_event to true
- [THREAD] then xenvif_kthread sets rx_event to false
- [THREAD] rx_work_todo never returns true anymore

I will update the explanation and send in the patch again.

Zoli

^ permalink raw reply

* Re: [PATCH RFC] reciprocal_divide: correction/update of the algorithm
From: Hannes Frederic Sowa @ 2014-01-15 15:02 UTC (permalink / raw)
  To: Randy Dunlap; +Cc: netdev, dborkman, eric.dumazet, linux-kernel, darkjames-ws
In-Reply-To: <52D57BAB.5070709@infradead.org>

On Tue, Jan 14, 2014 at 10:02:19AM -0800, Randy Dunlap wrote:
> Just trivia (coding style and spelling):
> 
> > diff --git a/lib/reciprocal_div.c b/lib/reciprocal_div.c
> > index 75510e9..b741b30 100644
> > --- a/lib/reciprocal_div.c
> > +++ b/lib/reciprocal_div.c
> > @@ -1,11 +1,25 @@
> > +#include <linux/kernel.h>
> >  #include <asm/div64.h>
> >  #include <linux/reciprocal_div.h>
> >  #include <linux/export.h>
> >  
> > -u32 reciprocal_value(u32 k)
> > +/* For a description of the algorithmus please look at
> 
>                                algorithms
> 
> > + * linux/reciprocal_div.h
> > + */
> 
> and kernel coding style for multi-line comments is like so:
> 
> /*
>  * For a description of the algorithms, please look at
>  * linux/reciprocal_div.h
>  */
> 
> > +
> > +struct reciprocal_value reciprocal_value(u32 d)
> >  {

Thanks, fixed those stuff locally along with a better description. I am
used to netdev comments. ;)

^ permalink raw reply

* Re: [PATCH net-next] xen-netback: Rework rx_work_todo
From: Wei Liu @ 2014-01-15 14:59 UTC (permalink / raw)
  To: Zoltan Kiss
  Cc: Wei Liu, ian.campbell, xen-devel, netdev, linux-kernel,
	jonathan.davies
In-Reply-To: <52D6A0B9.2030204@citrix.com>

On Wed, Jan 15, 2014 at 02:52:41PM +0000, Zoltan Kiss wrote:
> On 15/01/14 14:45, Wei Liu wrote:
> >>>>The recent patch to fix receive side flow control (11b57f) solved the spinning
> >>>>> >>thread problem, however caused an another one. The receive side can stall, if:
> >>>>> >>- xenvif_rx_action sets rx_queue_stopped to false
> >>>>> >>- interrupt happens, and sets rx_event to true
> >>>>> >>- then xenvif_kthread sets rx_event to false
> >>>>> >>
> >>>> >
> >>>> >If you mean "rx_work_todo" returns false.
> >>>> >
> >>>> >In this case
> >>>> >
> >>>> >(!skb_queue_empty(&vif->rx_queue) && !vif->rx_queue_stopped) || vif->rx_event;
> >>>> >
> >>>> >can still be true, can't it?
> >>>Sorry, I should wrote rx_queue_stopped to true
> >>>
> >In this case, if rx_queue_stopped is true, then we're expecting frontend
> >to notify us, right?
> >
> >rx_queue_stopped is set to true if we cannot make any progress to queue
> >packet into the ring. In that situation we can expect frontend will send
> >notification to backend after it goes through the backlog in the ring.
> >That means rx_event is set to true, and rx_work_todo is true again. So
> >the ring is actually not stalled in this case as well. Did I miss
> >something?
> >
> 
> Yes, we expect the guest to notify us, and it does, and we set
> rx_event to true (see second point), but then the thread set it to
> false (see third point). Talking with Paul, another solution could
> be to set rx_event false before calling xenvif_rx_action. But using
> rx_last_skb_slots makes it quicker for the thread to see if it
> doesn't have to do anything.
> 

OK, this is a better explaination. So actually there's no bug in the
original implementation and your patch is sort of an improvement.

Could you send a new version of this patch with relevant information in
commit message? Talking to people offline is faster, but I would like to
have public discussion and relevant information archived in a searchable
form. Thanks.

Wei.

> Zoli

^ permalink raw reply

* Re: [PATCH v5 0/2] ipv6 addrconf: add IFA_F_NOPREFIXROUTE flag to suppress creation of IP6 routes
From: Thomas Haller @ 2014-01-15 14:58 UTC (permalink / raw)
  To: Hannes Frederic Sowa; +Cc: David Miller, jiri, netdev, stephen, dcbw
In-Reply-To: <20140115144331.GF19945@order.stressinduktion.org>

[-- Attachment #1: Type: text/plain, Size: 670 bytes --]

On Wed, 2014-01-15 at 15:43 +0100, Hannes Frederic Sowa wrote:
> On Wed, Jan 15, 2014 at 03:36:57PM +0100, Thomas Haller wrote:
> > v1 -> v2: add a second commit, handling NOPREFIXROUTE in ip6_del_addr.
> > v2 -> v3: reword commit messages, code comments and some refactoring.
> > v3 -> v4: refactor, rename variables, add enum
> > v4 -> v5: rebase, so that patch applies cleanly to current net-next/master
> 
> Huh? I thought those were already applied?
> 
> Hmm, they seem to be in no repository? I hope that no other patches are
> affected, too.
> 
> Greetings,
> 
>   Hannes
> 

I thought so too, but I cannot find it, so I re-posted v5.

Thomas

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply

* Re: [PATCH net-next] xen-netback: Rework rx_work_todo
From: Zoltan Kiss @ 2014-01-15 14:52 UTC (permalink / raw)
  To: Wei Liu; +Cc: ian.campbell, xen-devel, netdev, linux-kernel, jonathan.davies
In-Reply-To: <20140115144519.GO5698@zion.uk.xensource.com>

On 15/01/14 14:45, Wei Liu wrote:
>>>> The recent patch to fix receive side flow control (11b57f) solved the spinning
>>>> > >>thread problem, however caused an another one. The receive side can stall, if:
>>>> > >>- xenvif_rx_action sets rx_queue_stopped to false
>>>> > >>- interrupt happens, and sets rx_event to true
>>>> > >>- then xenvif_kthread sets rx_event to false
>>>> > >>
>>> > >
>>> > >If you mean "rx_work_todo" returns false.
>>> > >
>>> > >In this case
>>> > >
>>> > >(!skb_queue_empty(&vif->rx_queue) && !vif->rx_queue_stopped) || vif->rx_event;
>>> > >
>>> > >can still be true, can't it?
>> >Sorry, I should wrote rx_queue_stopped to true
>> >
> In this case, if rx_queue_stopped is true, then we're expecting frontend
> to notify us, right?
>
> rx_queue_stopped is set to true if we cannot make any progress to queue
> packet into the ring. In that situation we can expect frontend will send
> notification to backend after it goes through the backlog in the ring.
> That means rx_event is set to true, and rx_work_todo is true again. So
> the ring is actually not stalled in this case as well. Did I miss
> something?
>

Yes, we expect the guest to notify us, and it does, and we set rx_event 
to true (see second point), but then the thread set it to false (see 
third point). Talking with Paul, another solution could be to set 
rx_event false before calling xenvif_rx_action. But using 
rx_last_skb_slots makes it quicker for the thread to see if it doesn't 
have to do anything.

Zoli

^ permalink raw reply

* [PATCH] net/ipv4: don't use module_init in non-modular gre_offload
From: Paul Gortmaker @ 2014-01-15 14:52 UTC (permalink / raw)
  To: David S. Miller; +Cc: netdev, Paul Gortmaker, Eric Dumazet

Recent commit 438e38fadca2f6e57eeecc08326c8a95758594d4
("gre_offload: statically build GRE offloading support") added
new module_init/module_exit calls to the gre_offload.c file.

The file is obj-y and can't be anything other than built-in.
Currently it can never be built modular, so using module_init
as an alias for __initcall can be somewhat misleading.

Fix this up now, so that we can relocate module_init from
init.h into module.h in the future.  If we don't do this, we'd
have to add module.h to obviously non-modular code, and that
would be a worse thing.  We also make the inclusion explicit.

Note that direct use of __initcall is discouraged, vs. one
of the priority categorized subgroups.  As __initcall gets
mapped onto device_initcall, our use of device_initcall
directly in this change means that the runtime impact is
zero -- it will remain at level 6 in initcall ordering.

Cc: Eric Dumazet <edumazet@google.com>
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>

diff --git a/net/ipv4/gre_offload.c b/net/ipv4/gre_offload.c
index 29512e3e7e7c..ed968daf2380 100644
--- a/net/ipv4/gre_offload.c
+++ b/net/ipv4/gre_offload.c
@@ -11,6 +11,7 @@
  */

 #include <linux/skbuff.h>
+#include <linux/init.h>
 #include <net/protocol.h>
 #include <net/gre.h>

@@ -283,11 +284,10 @@ static int __init gre_offload_init(void)
 {
 	return inet_add_offload(&gre_offload, IPPROTO_GRE);
 }
+device_initcall(gre_offload_init);

 static void __exit gre_offload_exit(void)
 {
 	inet_del_offload(&gre_offload, IPPROTO_GRE);
 }
-
-module_init(gre_offload_init);
-module_exit(gre_offload_exit);
+__exitcall(gre_offload_exit);
-- 
1.8.5.2

^ permalink raw reply related

* [PATCH v2 net] bpf: do not use reciprocal divide
From: Eric Dumazet @ 2014-01-15 14:50 UTC (permalink / raw)
  To: Heiko Carstens
  Cc: Martin Schwidefsky, Hannes Frederic Sowa, netdev, dborkman,
	darkjames-ws, Mircea Gherzan, Russell King, Matt Evans
In-Reply-To: <1389795926.31367.334.camel@edumazet-glaptop2.roam.corp.google.com>

From: Eric Dumazet <edumazet@google.com>

At first Jakub Zawadzki noticed that some divisions by reciprocal_divide
were not correct. (off by one in some cases)
http://www.wireshark.org/~darkjames/reciprocal-buggy.c

He could also show this with BPF:
http://www.wireshark.org/~darkjames/set-and-dump-filter-k-bug.c

The reciprocal divide in linux kernel is not generic enough,
lets remove its use in BPF, as it is not worth the pain with
current cpus.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Jakub Zawadzki <darkjames-ws@darkjames.pl>
Cc: Mircea Gherzan <mgherzan@gmail.com>
Cc: Daniel Borkmann <dxchgb@gmail.com>
Cc: Hannes Frederic Sowa <hannes@stressinduktion.org>
Cc: Matt Evans <matt@ozlabs.org>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: David S. Miller <davem@davemloft.net>
---
v2: Fixed sparc code as David kindly suggested
    Added tests on K being 1 (divide by 1 is a nop
                              modulo by 1 clears A),
    as Martin Schwidefsky seems concerned by this case.

Please review arch code to make sure I made no mistake, thanks !

 arch/arm/net/bpf_jit_32.c       |    6 +++---
 arch/powerpc/net/bpf_jit_comp.c |    7 ++++---
 arch/s390/net/bpf_jit_comp.c    |   17 ++++++++++++-----
 arch/sparc/net/bpf_jit_comp.c   |   17 ++++++++++++++---
 arch/x86/net/bpf_jit_comp.c     |   14 ++++++++++----
 net/core/filter.c               |   30 ++----------------------------
 6 files changed, 45 insertions(+), 46 deletions(-)

diff --git a/arch/arm/net/bpf_jit_32.c b/arch/arm/net/bpf_jit_32.c
index 9ed155ad0f97..271b5e971568 100644
--- a/arch/arm/net/bpf_jit_32.c
+++ b/arch/arm/net/bpf_jit_32.c
@@ -641,10 +641,10 @@ load_ind:
 			emit(ARM_MUL(r_A, r_A, r_X), ctx);
 			break;
 		case BPF_S_ALU_DIV_K:
-			/* current k == reciprocal_value(userspace k) */
+			if (k == 1)
+				break;
 			emit_mov_i(r_scratch, k, ctx);
-			/* A = top 32 bits of the product */
-			emit(ARM_UMULL(r_scratch, r_A, r_A, r_scratch), ctx);
+			emit_udiv(r_A, r_A, r_scratch, ctx);
 			break;
 		case BPF_S_ALU_DIV_X:
 			update_on_xread(ctx);
diff --git a/arch/powerpc/net/bpf_jit_comp.c b/arch/powerpc/net/bpf_jit_comp.c
index ac3c2a10dafd..555034f8505e 100644
--- a/arch/powerpc/net/bpf_jit_comp.c
+++ b/arch/powerpc/net/bpf_jit_comp.c
@@ -223,10 +223,11 @@ static int bpf_jit_build_body(struct sk_filter *fp, u32 *image,
 			}
 			PPC_DIVWU(r_A, r_A, r_X);
 			break;
-		case BPF_S_ALU_DIV_K: /* A = reciprocal_divide(A, K); */
+		case BPF_S_ALU_DIV_K: /* A /= K */
+			if (K == 1)
+				break;
 			PPC_LI32(r_scratch1, K);
-			/* Top 32 bits of 64bit result -> A */
-			PPC_MULHWU(r_A, r_A, r_scratch1);
+			PPC_DIVWU(r_A, r_A, r_scratch1);
 			break;
 		case BPF_S_ALU_AND_X:
 			ctx->seen |= SEEN_XREG;
diff --git a/arch/s390/net/bpf_jit_comp.c b/arch/s390/net/bpf_jit_comp.c
index 16871da37371..fc0fa77728e1 100644
--- a/arch/s390/net/bpf_jit_comp.c
+++ b/arch/s390/net/bpf_jit_comp.c
@@ -371,11 +371,13 @@ static int bpf_jit_insn(struct bpf_jit *jit, struct sock_filter *filter,
 		/* dr %r4,%r12 */
 		EMIT2(0x1d4c);
 		break;
-	case BPF_S_ALU_DIV_K: /* A = reciprocal_divide(A, K) */
-		/* m %r4,<d(K)>(%r13) */
-		EMIT4_DISP(0x5c40d000, EMIT_CONST(K));
-		/* lr %r5,%r4 */
-		EMIT2(0x1854);
+	case BPF_S_ALU_DIV_K: /* A /= K */
+		if (K == 1)
+			break;
+		/* lhi %r4,0 */
+		EMIT4(0xa7480000);
+		/* d %r4,<d(K)>(%r13) */
+		EMIT4_DISP(0x5d40d000, EMIT_CONST(K));
 		break;
 	case BPF_S_ALU_MOD_X: /* A %= X */
 		jit->seen |= SEEN_XREG | SEEN_RET0;
@@ -391,6 +393,11 @@ static int bpf_jit_insn(struct bpf_jit *jit, struct sock_filter *filter,
 		EMIT2(0x1854);
 		break;
 	case BPF_S_ALU_MOD_K: /* A %= K */
+		if (K == 1) {
+			/* lhi %r5,0 */
+			EMIT4(0xa7580000);
+			break;
+		}
 		/* lhi %r4,0 */
 		EMIT4(0xa7480000);
 		/* d %r4,<d(K)>(%r13) */
diff --git a/arch/sparc/net/bpf_jit_comp.c b/arch/sparc/net/bpf_jit_comp.c
index 218b6b23c378..01fe9946d388 100644
--- a/arch/sparc/net/bpf_jit_comp.c
+++ b/arch/sparc/net/bpf_jit_comp.c
@@ -497,9 +497,20 @@ void bpf_jit_compile(struct sk_filter *fp)
 			case BPF_S_ALU_MUL_K:	/* A *= K */
 				emit_alu_K(MUL, K);
 				break;
-			case BPF_S_ALU_DIV_K:	/* A /= K */
-				emit_alu_K(MUL, K);
-				emit_read_y(r_A);
+			case BPF_S_ALU_DIV_K:	/* A /= K with K != 0*/
+				if (K == 1)
+					break;
+				emit_write_y(G0);
+#ifdef CONFIG_SPARC32
+				/* The Sparc v8 architecture requires
+				 * three instructions between a %y
+				 * register write and the first use.
+				 */
+				emit_nop();
+				emit_nop();
+				emit_nop();
+#endif
+				emit_alu_K(DIV, K);
 				break;
 			case BPF_S_ALU_DIV_X:	/* A /= X; */
 				emit_cmpi(r_X, 0);
diff --git a/arch/x86/net/bpf_jit_comp.c b/arch/x86/net/bpf_jit_comp.c
index 26328e800869..4ed75dd81d05 100644
--- a/arch/x86/net/bpf_jit_comp.c
+++ b/arch/x86/net/bpf_jit_comp.c
@@ -359,15 +359,21 @@ void bpf_jit_compile(struct sk_filter *fp)
 				EMIT2(0x89, 0xd0);	/* mov %edx,%eax */
 				break;
 			case BPF_S_ALU_MOD_K: /* A %= K; */
+				if (K == 1) {
+					CLEAR_A();
+					break;
+				}
 				EMIT2(0x31, 0xd2);	/* xor %edx,%edx */
 				EMIT1(0xb9);EMIT(K, 4);	/* mov imm32,%ecx */
 				EMIT2(0xf7, 0xf1);	/* div %ecx */
 				EMIT2(0x89, 0xd0);	/* mov %edx,%eax */
 				break;
-			case BPF_S_ALU_DIV_K: /* A = reciprocal_divide(A, K); */
-				EMIT3(0x48, 0x69, 0xc0); /* imul imm32,%rax,%rax */
-				EMIT(K, 4);
-				EMIT4(0x48, 0xc1, 0xe8, 0x20); /* shr $0x20,%rax */
+			case BPF_S_ALU_DIV_K: /* A /= K */
+				if (K == 1)
+					break;
+				EMIT2(0x31, 0xd2);	/* xor %edx,%edx */
+				EMIT1(0xb9);EMIT(K, 4);	/* mov imm32,%ecx */
+				EMIT2(0xf7, 0xf1);	/* div %ecx */
 				break;
 			case BPF_S_ALU_AND_X:
 				seen |= SEEN_XREG;
diff --git a/net/core/filter.c b/net/core/filter.c
index 01b780856db2..ad30d626a5bd 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -36,7 +36,6 @@
 #include <asm/uaccess.h>
 #include <asm/unaligned.h>
 #include <linux/filter.h>
-#include <linux/reciprocal_div.h>
 #include <linux/ratelimit.h>
 #include <linux/seccomp.h>
 #include <linux/if_vlan.h>
@@ -166,7 +165,7 @@ unsigned int sk_run_filter(const struct sk_buff *skb,
 			A /= X;
 			continue;
 		case BPF_S_ALU_DIV_K:
-			A = reciprocal_divide(A, K);
+			A /= K;
 			continue;
 		case BPF_S_ALU_MOD_X:
 			if (X == 0)
@@ -553,11 +552,6 @@ int sk_chk_filter(struct sock_filter *filter, unsigned int flen)
 		/* Some instructions need special checks */
 		switch (code) {
 		case BPF_S_ALU_DIV_K:
-			/* check for division by zero */
-			if (ftest->k == 0)
-				return -EINVAL;
-			ftest->k = reciprocal_value(ftest->k);
-			break;
 		case BPF_S_ALU_MOD_K:
 			/* check for division by zero */
 			if (ftest->k == 0)
@@ -853,27 +847,7 @@ void sk_decode_filter(struct sock_filter *filt, struct sock_filter *to)
 	to->code = decodes[code];
 	to->jt = filt->jt;
 	to->jf = filt->jf;
-
-	if (code == BPF_S_ALU_DIV_K) {
-		/*
-		 * When loaded this rule user gave us X, which was
-		 * translated into R = r(X). Now we calculate the
-		 * RR = r(R) and report it back. If next time this
-		 * value is loaded and RRR = r(RR) is calculated
-		 * then the R == RRR will be true.
-		 *
-		 * One exception. X == 1 translates into R == 0 and
-		 * we can't calculate RR out of it with r().
-		 */
-
-		if (filt->k == 0)
-			to->k = 1;
-		else
-			to->k = reciprocal_value(filt->k);
-
-		BUG_ON(reciprocal_value(to->k) != filt->k);
-	} else
-		to->k = filt->k;
+	to->k = filt->k;
 }
 
 int sk_get_filter(struct sock *sk, struct sock_filter __user *ubuf, unsigned int len)

^ permalink raw reply related

* Re: [PATCH net-next] xen-netback: Rework rx_work_todo
From: Wei Liu @ 2014-01-15 14:45 UTC (permalink / raw)
  To: Zoltan Kiss
  Cc: Wei Liu, ian.campbell, xen-devel, netdev, linux-kernel,
	jonathan.davies
In-Reply-To: <52D67536.4030106@citrix.com>

On Wed, Jan 15, 2014 at 11:47:02AM +0000, Zoltan Kiss wrote:
> On 15/01/14 10:37, Wei Liu wrote:
> >On Tue, Jan 14, 2014 at 07:28:39PM +0000, Zoltan Kiss wrote:
> >>The recent patch to fix receive side flow control (11b57f) solved the spinning
> >>thread problem, however caused an another one. The receive side can stall, if:
> >>- xenvif_rx_action sets rx_queue_stopped to false
> >>- interrupt happens, and sets rx_event to true
> >>- then xenvif_kthread sets rx_event to false
> >>
> >
> >If you mean "rx_work_todo" returns false.
> >
> >In this case
> >
> >(!skb_queue_empty(&vif->rx_queue) && !vif->rx_queue_stopped) || vif->rx_event;
> >
> >can still be true, can't it?
> Sorry, I should wrote rx_queue_stopped to true
> 

In this case, if rx_queue_stopped is true, then we're expecting frontend
to notify us, right?

rx_queue_stopped is set to true if we cannot make any progress to queue
packet into the ring. In that situation we can expect frontend will send
notification to backend after it goes through the backlog in the ring.
That means rx_event is set to true, and rx_work_todo is true again. So
the ring is actually not stalled in this case as well. Did I miss
something?

> >
> >>Also, through rx_event a malicious guest can force the RX thread to spin. This
> >>patch ditch that two variable, and rework rx_work_todo. If the thread finds it
> >
> >This seems to be a bigger problem. Can you elaborate?
> My mistake too. I forgot that rx_action set it to false, so it's not
> really a spinning. However the thread should still run
> xenvif_rx_action to figure out there is no space in the ring before
> it sets rx_event to false. In my patch we can quit earlier.
> 
> Zoli

^ permalink raw reply

* Re: [PATCH v5 0/2] ipv6 addrconf: add IFA_F_NOPREFIXROUTE flag to suppress creation of IP6 routes
From: Hannes Frederic Sowa @ 2014-01-15 14:43 UTC (permalink / raw)
  To: Thomas Haller; +Cc: David Miller, jiri, netdev, stephen, dcbw
In-Reply-To: <1389796619-13440-1-git-send-email-thaller@redhat.com>

On Wed, Jan 15, 2014 at 03:36:57PM +0100, Thomas Haller wrote:
> v1 -> v2: add a second commit, handling NOPREFIXROUTE in ip6_del_addr.
> v2 -> v3: reword commit messages, code comments and some refactoring.
> v3 -> v4: refactor, rename variables, add enum
> v4 -> v5: rebase, so that patch applies cleanly to current net-next/master

Huh? I thought those were already applied?

Hmm, they seem to be in no repository? I hope that no other patches are
affected, too.

Greetings,

  Hannes

^ permalink raw reply

* [PATCH v5 2/2] ipv6 addrconf: don't cleanup prefix route for IFA_F_NOPREFIXROUTE
From: Thomas Haller @ 2014-01-15 14:36 UTC (permalink / raw)
  To: David Miller; +Cc: hannes, jiri, netdev, stephen, dcbw, Thomas Haller
In-Reply-To: <1389796619-13440-1-git-send-email-thaller@redhat.com>

Refactor the deletion/update of prefix routes when removing an
address. Now also consider IFA_F_NOPREFIXROUTE and if there is an address
present with this flag, to not cleanup the route. Instead, assume
that userspace is taking care of this route.

Also perform the same cleanup, when userspace changes an existing address
to add NOPREFIXROUTE (to an address that didn't have this flag). This is
done because when the address was added, a prefix route was created for it.
Since the user now wants to handle this route by himself, we cleanup this
route.

This cleanup of the route is not totally robust. There is no guarantee,
that the route we are about to delete was really the one added by the
kernel. This behavior does not change by the patch, and in practice it
should work just fine.

Signed-off-by: Thomas Haller <thaller@redhat.com>
---
 net/ipv6/addrconf.c | 184 +++++++++++++++++++++++++++++++---------------------
 1 file changed, 109 insertions(+), 75 deletions(-)

diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c
index 85100c1..6913a82 100644
--- a/net/ipv6/addrconf.c
+++ b/net/ipv6/addrconf.c
@@ -900,15 +900,95 @@ out:
 	goto out2;
 }
 
+enum cleanup_prefix_rt_t {
+	CLEANUP_PREFIX_RT_NOP,    /* no cleanup action for prefix route */
+	CLEANUP_PREFIX_RT_DEL,    /* delete the prefix route */
+	CLEANUP_PREFIX_RT_EXPIRE, /* update the lifetime of the prefix route */
+};
+
+/*
+ * Check, whether the prefix for ifp would still need a prefix route
+ * after deleting ifp. The function returns one of the CLEANUP_PREFIX_RT_*
+ * constants.
+ *
+ * 1) we don't purge prefix if address was not permanent.
+ *    prefix is managed by its own lifetime.
+ * 2) we also don't purge, if the address was IFA_F_NOPREFIXROUTE.
+ * 3) if there are no addresses, delete prefix.
+ * 4) if there are still other permanent address(es),
+ *    corresponding prefix is still permanent.
+ * 5) if there are still other addresses with IFA_F_NOPREFIXROUTE,
+ *    don't purge the prefix, assume user space is managing it.
+ * 6) otherwise, update prefix lifetime to the
+ *    longest valid lifetime among the corresponding
+ *    addresses on the device.
+ *    Note: subsequent RA will update lifetime.
+ **/
+static enum cleanup_prefix_rt_t
+check_cleanup_prefix_route(struct inet6_ifaddr *ifp, unsigned long *expires)
+{
+	struct inet6_ifaddr *ifa;
+	struct inet6_dev *idev = ifp->idev;
+	unsigned long lifetime;
+	enum cleanup_prefix_rt_t action = CLEANUP_PREFIX_RT_DEL;
+
+	*expires = jiffies;
+
+	list_for_each_entry(ifa, &idev->addr_list, if_list) {
+		if (ifa == ifp)
+			continue;
+		if (!ipv6_prefix_equal(&ifa->addr, &ifp->addr,
+				       ifp->prefix_len))
+			continue;
+		if (ifa->flags & (IFA_F_PERMANENT | IFA_F_NOPREFIXROUTE))
+			return CLEANUP_PREFIX_RT_NOP;
+
+		action = CLEANUP_PREFIX_RT_EXPIRE;
+
+		spin_lock(&ifa->lock);
+
+		lifetime = addrconf_timeout_fixup(ifa->valid_lft, HZ);
+		/*
+		 * Note: Because this address is
+		 * not permanent, lifetime <
+		 * LONG_MAX / HZ here.
+		 */
+		if (time_before(*expires, ifa->tstamp + lifetime * HZ))
+			*expires = ifa->tstamp + lifetime * HZ;
+		spin_unlock(&ifa->lock);
+	}
+
+	return action;
+}
+
+static void
+cleanup_prefix_route(struct inet6_ifaddr *ifp, unsigned long expires, bool del_rt)
+{
+	struct rt6_info *rt;
+
+	rt = addrconf_get_prefix_route(&ifp->addr,
+				       ifp->prefix_len,
+				       ifp->idev->dev,
+				       0, RTF_GATEWAY | RTF_DEFAULT);
+	if (rt) {
+		if (del_rt)
+			ip6_del_rt(rt);
+		else {
+			if (!(rt->rt6i_flags & RTF_EXPIRES))
+				rt6_set_expires(rt, expires);
+			ip6_rt_put(rt);
+		}
+	}
+}
+
+
 /* This function wants to get referenced ifp and releases it before return */
 
 static void ipv6_del_addr(struct inet6_ifaddr *ifp)
 {
-	struct inet6_ifaddr *ifa, *ifn;
-	struct inet6_dev *idev = ifp->idev;
 	int state;
-	int deleted = 0, onlink = 0;
-	unsigned long expires = jiffies;
+	enum cleanup_prefix_rt_t action = CLEANUP_PREFIX_RT_NOP;
+	unsigned long expires;
 
 	spin_lock_bh(&ifp->state_lock);
 	state = ifp->state;
@@ -922,7 +1002,7 @@ static void ipv6_del_addr(struct inet6_ifaddr *ifp)
 	hlist_del_init_rcu(&ifp->addr_lst);
 	spin_unlock_bh(&addrconf_hash_lock);
 
-	write_lock_bh(&idev->lock);
+	write_lock_bh(&ifp->idev->lock);
 
 	if (ifp->flags&IFA_F_TEMPORARY) {
 		list_del(&ifp->tmp_list);
@@ -933,45 +1013,13 @@ static void ipv6_del_addr(struct inet6_ifaddr *ifp)
 		__in6_ifa_put(ifp);
 	}
 
-	list_for_each_entry_safe(ifa, ifn, &idev->addr_list, if_list) {
-		if (ifa == ifp) {
-			list_del_init(&ifp->if_list);
-			__in6_ifa_put(ifp);
+	if (ifp->flags & IFA_F_PERMANENT && !(ifp->flags & IFA_F_NOPREFIXROUTE))
+		action = check_cleanup_prefix_route(ifp, &expires);
 
-			if (!(ifp->flags & IFA_F_PERMANENT) || onlink > 0)
-				break;
-			deleted = 1;
-			continue;
-		} else if (ifp->flags & IFA_F_PERMANENT) {
-			if (ipv6_prefix_equal(&ifa->addr, &ifp->addr,
-					      ifp->prefix_len)) {
-				if (ifa->flags & IFA_F_PERMANENT) {
-					onlink = 1;
-					if (deleted)
-						break;
-				} else {
-					unsigned long lifetime;
-
-					if (!onlink)
-						onlink = -1;
-
-					spin_lock(&ifa->lock);
-
-					lifetime = addrconf_timeout_fixup(ifa->valid_lft, HZ);
-					/*
-					 * Note: Because this address is
-					 * not permanent, lifetime <
-					 * LONG_MAX / HZ here.
-					 */
-					if (time_before(expires,
-							ifa->tstamp + lifetime * HZ))
-						expires = ifa->tstamp + lifetime * HZ;
-					spin_unlock(&ifa->lock);
-				}
-			}
-		}
-	}
-	write_unlock_bh(&idev->lock);
+	list_del_init(&ifp->if_list);
+	__in6_ifa_put(ifp);
+
+	write_unlock_bh(&ifp->idev->lock);
 
 	addrconf_del_dad_timer(ifp);
 
@@ -979,38 +1027,9 @@ static void ipv6_del_addr(struct inet6_ifaddr *ifp)
 
 	inet6addr_notifier_call_chain(NETDEV_DOWN, ifp);
 
-	/*
-	 * Purge or update corresponding prefix
-	 *
-	 * 1) we don't purge prefix here if address was not permanent.
-	 *    prefix is managed by its own lifetime.
-	 * 2) if there are no addresses, delete prefix.
-	 * 3) if there are still other permanent address(es),
-	 *    corresponding prefix is still permanent.
-	 * 4) otherwise, update prefix lifetime to the
-	 *    longest valid lifetime among the corresponding
-	 *    addresses on the device.
-	 *    Note: subsequent RA will update lifetime.
-	 *
-	 * --yoshfuji
-	 */
-	if ((ifp->flags & IFA_F_PERMANENT) && onlink < 1) {
-		struct rt6_info *rt;
-
-		rt = addrconf_get_prefix_route(&ifp->addr,
-					       ifp->prefix_len,
-					       ifp->idev->dev,
-					       0, RTF_GATEWAY | RTF_DEFAULT);
-
-		if (rt) {
-			if (onlink == 0) {
-				ip6_del_rt(rt);
-				rt = NULL;
-			} else if (!(rt->rt6i_flags & RTF_EXPIRES)) {
-				rt6_set_expires(rt, expires);
-			}
-		}
-		ip6_rt_put(rt);
+	if (action != CLEANUP_PREFIX_RT_NOP) {
+		cleanup_prefix_route(ifp, expires,
+			action == CLEANUP_PREFIX_RT_DEL);
 	}
 
 	/* clean up prefsrc entries */
@@ -3636,6 +3655,7 @@ static int inet6_addr_modify(struct inet6_ifaddr *ifp, u32 ifa_flags,
 	clock_t expires;
 	unsigned long timeout;
 	bool was_managetempaddr;
+	bool had_prefixroute;
 
 	if (!valid_lft || (prefered_lft > valid_lft))
 		return -EINVAL;
@@ -3664,6 +3684,8 @@ static int inet6_addr_modify(struct inet6_ifaddr *ifp, u32 ifa_flags,
 
 	spin_lock_bh(&ifp->lock);
 	was_managetempaddr = ifp->flags & IFA_F_MANAGETEMPADDR;
+	had_prefixroute = ifp->flags & IFA_F_PERMANENT &&
+			  !(ifp->flags & IFA_F_NOPREFIXROUTE);
 	ifp->flags &= ~(IFA_F_DEPRECATED | IFA_F_PERMANENT | IFA_F_NODAD |
 			IFA_F_HOMEADDRESS | IFA_F_MANAGETEMPADDR |
 			IFA_F_NOPREFIXROUTE);
@@ -3679,6 +3701,18 @@ static int inet6_addr_modify(struct inet6_ifaddr *ifp, u32 ifa_flags,
 	if (!(ifa_flags & IFA_F_NOPREFIXROUTE)) {
 		addrconf_prefix_route(&ifp->addr, ifp->prefix_len, ifp->idev->dev,
 				      expires, flags);
+	} else if (had_prefixroute) {
+		enum cleanup_prefix_rt_t action;
+		unsigned long rt_expires;
+
+		write_lock_bh(&ifp->idev->lock);
+		action = check_cleanup_prefix_route(ifp, &rt_expires);
+		write_unlock_bh(&ifp->idev->lock);
+
+		if (action != CLEANUP_PREFIX_RT_NOP) {
+			cleanup_prefix_route(ifp, rt_expires,
+				action == CLEANUP_PREFIX_RT_DEL);
+		}
 	}
 
 	if (was_managetempaddr || ifp->flags & IFA_F_MANAGETEMPADDR) {
-- 
1.8.4.2

^ permalink raw reply related

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox