netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH net-next 0/3] pf_packet updates
@ 2013-08-28 20:13 Daniel Borkmann
  2013-08-29  5:39 ` David Miller
  0 siblings, 1 reply; 18+ messages in thread
From: Daniel Borkmann @ 2013-08-28 20:13 UTC (permalink / raw)
  To: davem; +Cc: netdev

Daniel Borkmann (3):
  net: packet: add random fanout scheduler
  net: packet: use reciprocal_divide in fanout_demux_hash
  net: packet: document available fanout policies

 Documentation/networking/packet_mmap.txt |  8 ++++++++
 include/uapi/linux/if_packet.h           |  1 +
 net/packet/af_packet.c                   | 15 +++++++++++++--
 3 files changed, 22 insertions(+), 2 deletions(-)

-- 
1.7.11.7

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH net-next 0/3] pf_packet updates
  2013-08-28 20:13 [PATCH net-next 0/3] pf_packet updates Daniel Borkmann
@ 2013-08-29  5:39 ` David Miller
  2013-08-29  6:20   ` Daniel Borkmann
  2013-08-29 10:25   ` Eric Dumazet
  0 siblings, 2 replies; 18+ messages in thread
From: David Miller @ 2013-08-29  5:39 UTC (permalink / raw)
  To: dborkman; +Cc: netdev

From: Daniel Borkmann <dborkman@redhat.com>
Date: Wed, 28 Aug 2013 22:13:08 +0200

> Daniel Borkmann (3):
>   net: packet: add random fanout scheduler
>   net: packet: use reciprocal_divide in fanout_demux_hash
>   net: packet: document available fanout policies

Please add the missing reciprocal_divide.h include to the second
patch, as per Eric Dumazet's feedback, and resubmit this series.

Thanks.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH net-next 0/3] pf_packet updates
  2013-08-29  5:39 ` David Miller
@ 2013-08-29  6:20   ` Daniel Borkmann
  2013-08-29 20:43     ` David Miller
  2013-08-29 10:25   ` Eric Dumazet
  1 sibling, 1 reply; 18+ messages in thread
From: Daniel Borkmann @ 2013-08-29  6:20 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, Cong Wang

On 08/29/2013 07:39 AM, David Miller wrote:
> From: Daniel Borkmann <dborkman@redhat.com>
> Date: Wed, 28 Aug 2013 22:13:08 +0200
>
>> Daniel Borkmann (3):
>>    net: packet: add random fanout scheduler
>>    net: packet: use reciprocal_divide in fanout_demux_hash
>>    net: packet: document available fanout policies
>
> Please add the missing reciprocal_divide.h include to the second
> patch, as per Eric Dumazet's feedback, and resubmit this series.

That is already the case in the first patch of the series. It adds:

...
+#include <linux/reciprocal_div.h>
...

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH net-next 0/3] pf_packet updates
  2013-08-29  5:39 ` David Miller
  2013-08-29  6:20   ` Daniel Borkmann
@ 2013-08-29 10:25   ` Eric Dumazet
  2013-08-29 16:53     ` David Miller
  1 sibling, 1 reply; 18+ messages in thread
From: Eric Dumazet @ 2013-08-29 10:25 UTC (permalink / raw)
  To: David Miller; +Cc: dborkman, netdev

On Thu, 2013-08-29 at 01:39 -0400, David Miller wrote:
> From: Daniel Borkmann <dborkman@redhat.com>
> Date: Wed, 28 Aug 2013 22:13:08 +0200
> 
> > Daniel Borkmann (3):
> >   net: packet: add random fanout scheduler
> >   net: packet: use reciprocal_divide in fanout_demux_hash
> >   net: packet: document available fanout policies
> 
> Please add the missing reciprocal_divide.h include to the second
> patch, as per Eric Dumazet's feedback, and resubmit this series.

(It was Cong Wang feedback ;) )

Thanks

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH net-next 0/3] pf_packet updates
  2013-08-29 10:25   ` Eric Dumazet
@ 2013-08-29 16:53     ` David Miller
  0 siblings, 0 replies; 18+ messages in thread
From: David Miller @ 2013-08-29 16:53 UTC (permalink / raw)
  To: eric.dumazet; +Cc: dborkman, netdev

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Thu, 29 Aug 2013 03:25:45 -0700

> On Thu, 2013-08-29 at 01:39 -0400, David Miller wrote:
>> From: Daniel Borkmann <dborkman@redhat.com>
>> Date: Wed, 28 Aug 2013 22:13:08 +0200
>> 
>> > Daniel Borkmann (3):
>> >   net: packet: add random fanout scheduler
>> >   net: packet: use reciprocal_divide in fanout_demux_hash
>> >   net: packet: document available fanout policies
>> 
>> Please add the missing reciprocal_divide.h include to the second
>> patch, as per Eric Dumazet's feedback, and resubmit this series.
> 
> (It was Cong Wang feedback ;) )

Sorry Eric, I am just too anxious to give you credit everywhere that I
can. :-)

Anyways, thanks for explaining Daniel, I've put these patches back into
the to-apply queue.

Thanks!

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH net-next 0/3] pf_packet updates
  2013-08-29  6:20   ` Daniel Borkmann
@ 2013-08-29 20:43     ` David Miller
  0 siblings, 0 replies; 18+ messages in thread
From: David Miller @ 2013-08-29 20:43 UTC (permalink / raw)
  To: dborkman; +Cc: netdev, amwang

From: Daniel Borkmann <dborkman@redhat.com>
Date: Thu, 29 Aug 2013 08:20:14 +0200

> On 08/29/2013 07:39 AM, David Miller wrote:
>> From: Daniel Borkmann <dborkman@redhat.com>
>> Date: Wed, 28 Aug 2013 22:13:08 +0200
>>
>>> Daniel Borkmann (3):
>>>    net: packet: add random fanout scheduler
>>>    net: packet: use reciprocal_divide in fanout_demux_hash
>>>    net: packet: document available fanout policies
>>
>> Please add the missing reciprocal_divide.h include to the second
>> patch, as per Eric Dumazet's feedback, and resubmit this series.
> 
> That is already the case in the first patch of the series. It adds:
> 
> ...
> +#include <linux/reciprocal_div.h>

Series applied, thanks Daniel.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH net-next 0/3] PF_PACKET updates
@ 2013-12-06 10:36 Daniel Borkmann
  2013-12-10  1:24 ` David Miller
  0 siblings, 1 reply; 18+ messages in thread
From: Daniel Borkmann @ 2013-12-06 10:36 UTC (permalink / raw)
  To: davem; +Cc: netdev, brouer

Patch descriptions in individual patches.

In order to avoid a possible merge conflict, we suggest to take
the first patch through net tree, merge net into net-next and
apply the remaining two patches on top of it. Dave, please let
us know if you like to handle this differently than suggested.

For patch 3 we'll send a man-page update as a follow-up.

Thanks !

Daniel Borkmann (3):
  packet: fix send path when running with proto == 0
  net: dev: move inline skb_needs_linearize helper to header
  packet: introduce PACKET_QDISC_BYPASS socket option

 Documentation/networking/packet_mmap.txt |  31 ++++++
 include/linux/skbuff.h                   |  18 ++++
 include/uapi/linux/if_packet.h           |   1 +
 net/core/dev.c                           |  15 ---
 net/packet/af_packet.c                   | 156 +++++++++++++++++++++++--------
 net/packet/internal.h                    |   1 +
 6 files changed, 170 insertions(+), 52 deletions(-)

-- 
1.8.3.1

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH net-next 0/3] PF_PACKET updates
  2013-12-06 10:36 [PATCH net-next 0/3] PF_PACKET updates Daniel Borkmann
@ 2013-12-10  1:24 ` David Miller
  0 siblings, 0 replies; 18+ messages in thread
From: David Miller @ 2013-12-10  1:24 UTC (permalink / raw)
  To: dborkman; +Cc: netdev, brouer

From: Daniel Borkmann <dborkman@redhat.com>
Date: Fri,  6 Dec 2013 11:36:14 +0100

> Patch descriptions in individual patches.
> 
> In order to avoid a possible merge conflict, we suggest to take
> the first patch through net tree, merge net into net-next and
> apply the remaining two patches on top of it. Dave, please let
> us know if you like to handle this differently than suggested.
> 
> For patch 3 we'll send a man-page update as a follow-up.

Patch #1 applied to 'net' and queued up for -stable.

Patch #2 and #3 applied to 'net-next'.

Thanks!

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH net-next 0/3] pf_packet updates
@ 2014-01-12 16:22 Daniel Borkmann
  2014-01-12 16:22 ` [PATCH net-next 1/3] packet: improve socket create/bind latency in some cases Daniel Borkmann
                   ` (2 more replies)
  0 siblings, 3 replies; 18+ messages in thread
From: Daniel Borkmann @ 2014-01-12 16:22 UTC (permalink / raw)
  To: davem; +Cc: netdev

Daniel Borkmann (3):
  packet: improve socket create/bind latency in some cases
  packet: don't unconditionally schedule() in case of MSG_DONTWAIT
  packet: use percpu mmap tx frame pending refcount

 net/packet/af_packet.c | 105 +++++++++++++++++++++++++++++++++++++++----------
 net/packet/diag.c      |   1 +
 net/packet/internal.h  |   2 +-
 3 files changed, 86 insertions(+), 22 deletions(-)

-- 
1.7.11.7

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH net-next 1/3] packet: improve socket create/bind latency in some cases
  2014-01-12 16:22 [PATCH net-next 0/3] pf_packet updates Daniel Borkmann
@ 2014-01-12 16:22 ` Daniel Borkmann
  2014-01-12 16:22 ` [PATCH net-next 2/3] packet: don't unconditionally schedule() in case of MSG_DONTWAIT Daniel Borkmann
  2014-01-12 16:22 ` [PATCH net-next 3/3] packet: use percpu mmap tx frame pending refcount Daniel Borkmann
  2 siblings, 0 replies; 18+ messages in thread
From: Daniel Borkmann @ 2014-01-12 16:22 UTC (permalink / raw)
  To: davem; +Cc: netdev

Most people acquire PF_PACKET sockets with a protocol argument in
the socket call, e.g. libpcap does so with htons(ETH_P_ALL) for
all its sockets. Most likely, at some point in time a subsequent
bind() call will follow, e.g. in libpcap with ...

  memset(&sll, 0, sizeof(sll));
  sll.sll_family          = AF_PACKET;
  sll.sll_ifindex         = ifindex;
  sll.sll_protocol        = htons(ETH_P_ALL);

... as arguments. What happens in the kernel is that already
in socket() syscall, we install a proto hook via register_prot_hook()
if our protocol argument is != 0. Yet, in bind() we're almost
doing the same work by doing a unregister_prot_hook() with an
expensive synchronize_net() call in case during socket() the proto
was != 0, plus follow-up register_prot_hook() with a bound device
to it this time, in order to limit traffic we get.

In the case when the protocol and user supplied device index (== 0)
does not change from socket() to bind(), we can spare us doing
the same work twice. Similarly for re-binding to the same device
and protocol. For these scenarios, we can decrease create/bind
latency from ~7447us (sock-bind-2 case) to ~89us (sock-bind-1 case)
with this patch.

Alternatively, for the first case, if people care, they should
simply create their sockets with proto == 0 argument and define
the protocol during bind() as this saves a call to synchronize_net()
as well (sock-bind-3 case).

In all other cases, we're tied to user space behaviour we must not
change, also since a bind() is not strictly required. Thus, we need
the synchronize_net() to make sure no asynchronous packet processing
paths still refer to the previous elements of po->prot_hook.

In case of mmap()ed sockets, the workflow that includes bind() is
socket() -> setsockopt(<ring>) -> bind(). In that case, a pair of
{__unregister, register}_prot_hook is being called from setsockopt()
in order to install the new protocol receive handler. Thus, when
we call bind and can skip a re-hook, we have already previously
installed the new handler. For fanout, this is handled different
entirely, so we should be good.

Timings on an i7-3520M machine:

  * sock-bind-1:   89 us
  * sock-bind-2: 7447 us
  * sock-bind-3:   75 us

sock-bind-1:
  socket(PF_PACKET, SOCK_RAW, htons(ETH_P_IP)) = 3
  bind(3, {sa_family=AF_PACKET, proto=htons(ETH_P_IP), if=all(0),
           pkttype=PACKET_HOST, addr(0)={0, }, 20) = 0

sock-bind-2:
  socket(PF_PACKET, SOCK_RAW, htons(ETH_P_IP)) = 3
  bind(3, {sa_family=AF_PACKET, proto=htons(ETH_P_IP), if=lo(1),
           pkttype=PACKET_HOST, addr(0)={0, }, 20) = 0

sock-bind-3:
  socket(PF_PACKET, SOCK_RAW, 0) = 3
  bind(3, {sa_family=AF_PACKET, proto=htons(ETH_P_IP), if=lo(1),
           pkttype=PACKET_HOST, addr(0)={0, }, 20) = 0

Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
---
 v1->v2:
  - applied Dave's feedback to move assignments under bind lock
  - removed cleanup part

 net/packet/af_packet.c | 33 ++++++++++++++++++++++-----------
 1 file changed, 22 insertions(+), 11 deletions(-)

diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c
index 279467b..85bb38c 100644
--- a/net/packet/af_packet.c
+++ b/net/packet/af_packet.c
@@ -2567,9 +2567,12 @@ static int packet_release(struct socket *sock)
  *	Attach a packet hook.
  */
 
-static int packet_do_bind(struct sock *sk, struct net_device *dev, __be16 protocol)
+static int packet_do_bind(struct sock *sk, struct net_device *dev, __be16 proto)
 {
 	struct packet_sock *po = pkt_sk(sk);
+	const struct net_device *dev_curr;
+	__be16 proto_curr;
+	bool need_rehook;
 
 	if (po->fanout) {
 		if (dev)
@@ -2579,21 +2582,29 @@ static int packet_do_bind(struct sock *sk, struct net_device *dev, __be16 protoc
 	}
 
 	lock_sock(sk);
-
 	spin_lock(&po->bind_lock);
-	unregister_prot_hook(sk, true);
 
-	po->num = protocol;
-	po->prot_hook.type = protocol;
-	if (po->prot_hook.dev)
-		dev_put(po->prot_hook.dev);
+	proto_curr = po->prot_hook.type;
+	dev_curr = po->prot_hook.dev;
+
+	need_rehook = proto_curr != proto || dev_curr != dev;
+
+	if (need_rehook) {
+		unregister_prot_hook(sk, true);
 
-	po->prot_hook.dev = dev;
-	po->ifindex = dev ? dev->ifindex : 0;
+		po->num = proto;
+		po->prot_hook.type = proto;
+
+		if (po->prot_hook.dev)
+			dev_put(po->prot_hook.dev);
 
-	packet_cached_dev_assign(po, dev);
+		po->prot_hook.dev = dev;
+
+		po->ifindex = dev ? dev->ifindex : 0;
+		packet_cached_dev_assign(po, dev);
+	}
 
-	if (protocol == 0)
+	if (proto == 0 || !need_rehook)
 		goto out_unlock;
 
 	if (!dev || (dev->flags & IFF_UP)) {
-- 
1.7.11.7

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH net-next 2/3] packet: don't unconditionally schedule() in case of MSG_DONTWAIT
  2014-01-12 16:22 [PATCH net-next 0/3] pf_packet updates Daniel Borkmann
  2014-01-12 16:22 ` [PATCH net-next 1/3] packet: improve socket create/bind latency in some cases Daniel Borkmann
@ 2014-01-12 16:22 ` Daniel Borkmann
  2014-01-12 16:22 ` [PATCH net-next 3/3] packet: use percpu mmap tx frame pending refcount Daniel Borkmann
  2 siblings, 0 replies; 18+ messages in thread
From: Daniel Borkmann @ 2014-01-12 16:22 UTC (permalink / raw)
  To: davem; +Cc: netdev

In tpacket_snd(), when we've discovered a first frame that is
not in status TP_STATUS_SEND_REQUEST, and return a NULL buffer,
we exit the send routine in case of MSG_DONTWAIT, since we've
finished traversing the mmaped send ring buffer and don't care
about pending frames.

While doing so, we still unconditionally call an expensive
schedule() in the packet_current_frame() "error" path, which
is unnecessary in this case since it's enough to just quit
the function.

Also, in case MSG_DONTWAIT is not set, we should rather test
for need_resched() first and do schedule() only if necessary
since meanwhile pending frames could already have finished
processing and called skb destructor.

Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
---
 net/packet/af_packet.c | 13 ++++++-------
 1 file changed, 6 insertions(+), 7 deletions(-)

diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c
index 85bb38c..d5495d8 100644
--- a/net/packet/af_packet.c
+++ b/net/packet/af_packet.c
@@ -2156,6 +2156,7 @@ static int tpacket_snd(struct packet_sock *po, struct msghdr *msg)
 	int err, reserve = 0;
 	void *ph;
 	struct sockaddr_ll *saddr = (struct sockaddr_ll *)msg->msg_name;
+	bool need_wait = !(msg->msg_flags & MSG_DONTWAIT);
 	int tp_len, size_max;
 	unsigned char *addr;
 	int len_sum = 0;
@@ -2198,10 +2199,10 @@ static int tpacket_snd(struct packet_sock *po, struct msghdr *msg)
 
 	do {
 		ph = packet_current_frame(po, &po->tx_ring,
-				TP_STATUS_SEND_REQUEST);
-
+					  TP_STATUS_SEND_REQUEST);
 		if (unlikely(ph == NULL)) {
-			schedule();
+			if (need_wait && need_resched())
+				schedule();
 			continue;
 		}
 
@@ -2255,10 +2256,8 @@ static int tpacket_snd(struct packet_sock *po, struct msghdr *msg)
 		}
 		packet_increment_head(&po->tx_ring);
 		len_sum += tp_len;
-	} while (likely((ph != NULL) ||
-			((!(msg->msg_flags & MSG_DONTWAIT)) &&
-			 (atomic_read(&po->tx_ring.pending))))
-		);
+	} while (likely((ph != NULL) || (need_wait &&
+					 atomic_read(&po->tx_ring.pending))));
 
 	err = len_sum;
 	goto out_put;
-- 
1.7.11.7

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH net-next 3/3] packet: use percpu mmap tx frame pending refcount
  2014-01-12 16:22 [PATCH net-next 0/3] pf_packet updates Daniel Borkmann
  2014-01-12 16:22 ` [PATCH net-next 1/3] packet: improve socket create/bind latency in some cases Daniel Borkmann
  2014-01-12 16:22 ` [PATCH net-next 2/3] packet: don't unconditionally schedule() in case of MSG_DONTWAIT Daniel Borkmann
@ 2014-01-12 16:22 ` Daniel Borkmann
  2014-01-13  5:51   ` Cong Wang
  2014-01-15  1:16   ` David Miller
  2 siblings, 2 replies; 18+ messages in thread
From: Daniel Borkmann @ 2014-01-12 16:22 UTC (permalink / raw)
  To: davem; +Cc: netdev

In PF_PACKET's packet mmap(), we can avoid using one atomic_inc()
and one atomic_dec() call in skb destructor and use a percpu
reference count instead in order to determine if packets are
still pending to be sent out. Micro-benchmark with [1] that has
been slightly modified (that is, protcol = 0 in socket(2) and
bind(2)), example on an i7-3520M machine:

./packet_mm_tx -s7000 -m7200 -z700000 em1, avg over 2500 runs:

With patch:    4,022,015 cyc
Without patch: 4,812,994 cyc

time ./packet_mm_tx -s64 -c10000000 em1 > /dev/null, stable:

With patch:
  real         1m32.241s
  user         0m0.287s
  sys          1m29.316s

Without patch:
  real         1m38.386s
  user         0m0.265s
  sys          1m35.572s

In function tpacket_snd(), it is okay to use packet_read_pending()
since in fast-path we short-circuit the condition already with
ph != NULL, since we have next frames to process.

In case we have MSG_DONTWAIT, we also do not execute this path as
need_wait is false here anyway, and in case of _no_ MSG_DONTWAIT
flag, it is okay to call a packet_read_pending(), because when
we ever reach that path, we're done processing frames anyway and
only look if there are skbs still outstanding to be orphaned.

The BUG_ON() in tpacket_destruct_skb() we can remove as well. It
can _only_ be set from inside tpacket_snd() path and we made
sure to increase tx_ring.pending in any case before we called
po->xmit(skb). So testing for tx_ring.pending == 0 is not too
useful. Instead, it would rather have been useful to test if
lower layers didn't orphan the skb so that we're missing ring
slots being put back to TP_STATUS_AVAILABLE. But such a bug will
be caught in user space already as we end up realizing that we do
not have any TP_STATUS_AVAILABLE slots left anymore. Therefore,
we're all set.

Btw, in case of RX_RING path, we do not make use of the pending
member, therefore we also don't need to use up any percpu memory
here.

  [1] http://wiki.ipxwarzone.com/index.php5?title=Linux_packet_mmap

Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
---
 net/packet/af_packet.c | 65 +++++++++++++++++++++++++++++++++++++++++++++-----
 net/packet/diag.c      |  1 +
 net/packet/internal.h  |  2 +-
 3 files changed, 61 insertions(+), 7 deletions(-)

diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c
index d5495d8..6803be98 100644
--- a/net/packet/af_packet.c
+++ b/net/packet/af_packet.c
@@ -89,6 +89,7 @@
 #include <linux/errqueue.h>
 #include <linux/net_tstamp.h>
 #include <linux/reciprocal_div.h>
+#include <linux/percpu.h>
 #ifdef CONFIG_INET
 #include <net/inet_common.h>
 #endif
@@ -1168,6 +1169,46 @@ static void packet_increment_head(struct packet_ring_buffer *buff)
 	buff->head = buff->head != buff->frame_max ? buff->head+1 : 0;
 }
 
+static void packet_inc_pending(struct packet_ring_buffer *rb)
+{
+	this_cpu_inc(*rb->pending_refcnt);
+}
+
+static void packet_dec_pending(struct packet_ring_buffer *rb)
+{
+	this_cpu_dec(*rb->pending_refcnt);
+}
+
+static int packet_read_pending(const struct packet_ring_buffer *rb)
+{
+	int i, refcnt = 0;
+
+	/* We don't use pending refcount in rx_ring. */
+	if (rb->pending_refcnt == NULL)
+		return 0;
+
+	for_each_possible_cpu(i)
+		refcnt += *per_cpu_ptr(rb->pending_refcnt, i);
+
+	return refcnt;
+}
+
+static int packet_alloc_pending(struct packet_sock *po)
+{
+	po->rx_ring.pending_refcnt = NULL;
+
+	po->tx_ring.pending_refcnt = alloc_percpu(int);
+	if (unlikely(po->tx_ring.pending_refcnt == NULL))
+		return -ENOBUFS;
+
+	return 0;
+}
+
+static void packet_free_pending(struct packet_sock *po)
+{
+	free_percpu(po->tx_ring.pending_refcnt);
+}
+
 static bool packet_rcv_has_room(struct packet_sock *po, struct sk_buff *skb)
 {
 	struct sock *sk = &po->sk;
@@ -2014,8 +2055,7 @@ static void tpacket_destruct_skb(struct sk_buff *skb)
 		__u32 ts;
 
 		ph = skb_shinfo(skb)->destructor_arg;
-		BUG_ON(atomic_read(&po->tx_ring.pending) == 0);
-		atomic_dec(&po->tx_ring.pending);
+		packet_dec_pending(&po->tx_ring);
 
 		ts = __packet_set_timestamp(po, ph, skb);
 		__packet_set_status(po, ph, TP_STATUS_AVAILABLE | ts);
@@ -2236,7 +2276,7 @@ static int tpacket_snd(struct packet_sock *po, struct msghdr *msg)
 		skb_set_queue_mapping(skb, packet_pick_tx_queue(dev));
 		skb->destructor = tpacket_destruct_skb;
 		__packet_set_status(po, ph, TP_STATUS_SENDING);
-		atomic_inc(&po->tx_ring.pending);
+		packet_inc_pending(&po->tx_ring);
 
 		status = TP_STATUS_SEND_REQUEST;
 		err = po->xmit(skb);
@@ -2256,8 +2296,14 @@ static int tpacket_snd(struct packet_sock *po, struct msghdr *msg)
 		}
 		packet_increment_head(&po->tx_ring);
 		len_sum += tp_len;
-	} while (likely((ph != NULL) || (need_wait &&
-					 atomic_read(&po->tx_ring.pending))));
+	} while (likely((ph != NULL) ||
+		/* Note: packet_read_pending() might be slow if we have
+		 * to call it as it's per_cpu variable, but in fast-path
+		 * we already short-circuit the loop with the first
+		 * condition, and luckily don't have to go that path
+		 * anyway.
+		 */
+		 (need_wait && packet_read_pending(&po->tx_ring))));
 
 	err = len_sum;
 	goto out_put;
@@ -2556,6 +2602,7 @@ static int packet_release(struct socket *sock)
 	/* Purge queues */
 
 	skb_queue_purge(&sk->sk_receive_queue);
+	packet_free_pending(po);
 	sk_refcnt_debug_release(sk);
 
 	sock_put(sk);
@@ -2717,6 +2764,10 @@ static int packet_create(struct net *net, struct socket *sock, int protocol,
 	po->num = proto;
 	po->xmit = dev_queue_xmit;
 
+	err = packet_alloc_pending(po);
+	if (err)
+		goto out2;
+
 	packet_cached_dev_reset(po);
 
 	sk->sk_destruct = packet_sock_destruct;
@@ -2749,6 +2800,8 @@ static int packet_create(struct net *net, struct socket *sock, int protocol,
 	preempt_enable();
 
 	return 0;
+out2:
+	sk_free(sk);
 out:
 	return err;
 }
@@ -3676,7 +3729,7 @@ static int packet_set_ring(struct sock *sk, union tpacket_req_u *req_u,
 	if (!closing) {
 		if (atomic_read(&po->mapped))
 			goto out;
-		if (atomic_read(&rb->pending))
+		if (packet_read_pending(rb))
 			goto out;
 	}
 
diff --git a/net/packet/diag.c b/net/packet/diag.c
index a9584a2..533ce4f 100644
--- a/net/packet/diag.c
+++ b/net/packet/diag.c
@@ -3,6 +3,7 @@
 #include <linux/net.h>
 #include <linux/netdevice.h>
 #include <linux/packet_diag.h>
+#include <linux/percpu.h>
 #include <net/net_namespace.h>
 #include <net/sock.h>
 
diff --git a/net/packet/internal.h b/net/packet/internal.h
index 0a87d7b..a47d5b5 100644
--- a/net/packet/internal.h
+++ b/net/packet/internal.h
@@ -64,7 +64,7 @@ struct packet_ring_buffer {
 	unsigned int		pg_vec_pages;
 	unsigned int		pg_vec_len;
 
-	atomic_t		pending;
+	int __percpu		*pending_refcnt;
 
 	struct tpacket_kbdq_core	prb_bdqc;
 };
-- 
1.7.11.7

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH net-next 3/3] packet: use percpu mmap tx frame pending refcount
  2014-01-12 16:22 ` [PATCH net-next 3/3] packet: use percpu mmap tx frame pending refcount Daniel Borkmann
@ 2014-01-13  5:51   ` Cong Wang
  2014-01-13  9:55     ` David Laight
  2014-01-13 11:19     ` Daniel Borkmann
  2014-01-15  1:16   ` David Miller
  1 sibling, 2 replies; 18+ messages in thread
From: Cong Wang @ 2014-01-13  5:51 UTC (permalink / raw)
  To: Daniel Borkmann; +Cc: David Miller, netdev

On Sun, Jan 12, 2014 at 8:22 AM, Daniel Borkmann <dborkman@redhat.com> wrote:
> +static void packet_inc_pending(struct packet_ring_buffer *rb)
> +{
> +       this_cpu_inc(*rb->pending_refcnt);
> +}
> +
> +static void packet_dec_pending(struct packet_ring_buffer *rb)
> +{
> +       this_cpu_dec(*rb->pending_refcnt);
> +}
> +
> +static int packet_read_pending(const struct packet_ring_buffer *rb)
> +{
> +       int i, refcnt = 0;
> +
> +       /* We don't use pending refcount in rx_ring. */
> +       if (rb->pending_refcnt == NULL)
> +               return 0;
> +
> +       for_each_possible_cpu(i)
> +               refcnt += *per_cpu_ptr(rb->pending_refcnt, i);
> +
> +       return refcnt;
> +}

How is this supposed to work? Since there is no lock,
you can't read accurate refcnt. Take a look at lib/percpu_counter.c.

I guess for some reason you don't care the accuracy?
Then at least you need to comment in the code.

Thanks.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* RE: [PATCH net-next 3/3] packet: use percpu mmap tx frame pending refcount
  2014-01-13  5:51   ` Cong Wang
@ 2014-01-13  9:55     ` David Laight
  2014-01-13 11:19     ` Daniel Borkmann
  1 sibling, 0 replies; 18+ messages in thread
From: David Laight @ 2014-01-13  9:55 UTC (permalink / raw)
  To: 'Cong Wang', Daniel Borkmann; +Cc: David Miller, netdev

From: Cong Wang
> On Sun, Jan 12, 2014 at 8:22 AM, Daniel Borkmann <dborkman@redhat.com> wrote:
> > +static void packet_inc_pending(struct packet_ring_buffer *rb)
> > +{
> > +       this_cpu_inc(*rb->pending_refcnt);
> > +}
> > +
> > +static void packet_dec_pending(struct packet_ring_buffer *rb)
> > +{
> > +       this_cpu_dec(*rb->pending_refcnt);
> > +}
> > +
> > +static int packet_read_pending(const struct packet_ring_buffer *rb)
> > +{
> > +       int i, refcnt = 0;
> > +
> > +       /* We don't use pending refcount in rx_ring. */
> > +       if (rb->pending_refcnt == NULL)
> > +               return 0;
> > +
> > +       for_each_possible_cpu(i)
> > +               refcnt += *per_cpu_ptr(rb->pending_refcnt, i);
> > +
> > +       return refcnt;
> > +}
> 
> How is this supposed to work? Since there is no lock,
> you can't read accurate refcnt. Take a look at lib/percpu_counter.c.
> 
> I guess for some reason you don't care the accuracy?
> Then at least you need to comment in the code.

Hmmm... did the code ever work?

The value looks like the number of active transmits (s/pending/tx_pending/)
The test of the number of pending tx is at the bottom of a code loop,
I'd expect that some action should be locked against this check - but even
the atomic read doesn't do this.

Check what happens if the count reads as 1 just before another cpu
decrements it to zero (or reads 0 just before being incremented).
In one of those cases something is likely to go wrong.

I'm actually surprised that there isn't a mutex covering the code path.

	David

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH net-next 3/3] packet: use percpu mmap tx frame pending refcount
  2014-01-13  5:51   ` Cong Wang
  2014-01-13  9:55     ` David Laight
@ 2014-01-13 11:19     ` Daniel Borkmann
  2014-01-13 11:58       ` David Laight
  1 sibling, 1 reply; 18+ messages in thread
From: Daniel Borkmann @ 2014-01-13 11:19 UTC (permalink / raw)
  To: Cong Wang; +Cc: David Miller, netdev

On 01/13/2014 06:51 AM, Cong Wang wrote:
> On Sun, Jan 12, 2014 at 8:22 AM, Daniel Borkmann <dborkman@redhat.com> wrote:
>> +static void packet_inc_pending(struct packet_ring_buffer *rb)
>> +{
>> +       this_cpu_inc(*rb->pending_refcnt);
>> +}
>> +
>> +static void packet_dec_pending(struct packet_ring_buffer *rb)
>> +{
>> +       this_cpu_dec(*rb->pending_refcnt);
>> +}
>> +
>> +static int packet_read_pending(const struct packet_ring_buffer *rb)
>> +{
>> +       int i, refcnt = 0;
>> +
>> +       /* We don't use pending refcount in rx_ring. */
>> +       if (rb->pending_refcnt == NULL)
>> +               return 0;
>> +
>> +       for_each_possible_cpu(i)
>> +               refcnt += *per_cpu_ptr(rb->pending_refcnt, i);
>> +
>> +       return refcnt;
>> +}
>
> How is this supposed to work? Since there is no lock,
> you can't read accurate refcnt. Take a look at lib/percpu_counter.c.
>
> I guess for some reason you don't care the accuracy?

Yep, not per se. Look at how we do net device reference counting.
The reason is that we call packet_read_pending() *only* after we
finished processing all frames in TX_RING and we wait for
completion in case MSG_DONTWAIT is *not set*, when that happens
we're back to 0.

But I think I found a different problem with this idea. It could
happen with net devices as well, but probably less likely as there
might be a better distribution of hold/puts among CPUs. However,
for TX_RING, if we pin the process to a particular CPU, and since
the destructor is invoked through ksoftirqd, we could end up with
a misbalance and if the process runs long enough eventually
overflow for one particular CPU. We could work around that, but I
think it's not worth the effort.

Dave, please drop the 3rd patch of the series, thanks.

> Then at least you need to comment in the code.
>
> Thanks.
>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* RE: [PATCH net-next 3/3] packet: use percpu mmap tx frame pending refcount
  2014-01-13 11:19     ` Daniel Borkmann
@ 2014-01-13 11:58       ` David Laight
  2014-01-13 12:57         ` Daniel Borkmann
  0 siblings, 1 reply; 18+ messages in thread
From: David Laight @ 2014-01-13 11:58 UTC (permalink / raw)
  To: 'Daniel Borkmann', Cong Wang; +Cc: David Miller, netdev

From: Daniel Borkmann
...
> But I think I found a different problem with this idea. It could
> happen with net devices as well, but probably less likely as there
> might be a better distribution of hold/puts among CPUs. However,
> for TX_RING, if we pin the process to a particular CPU, and since
> the destructor is invoked through ksoftirqd, we could end up with
> a misbalance and if the process runs long enough eventually
> overflow for one particular CPU. We could work around that, but I
> think it's not worth the effort.

The sum will be 'correct' when summed across all the cpu even if
one of the values has wrapped - provided all the arithmetic is
unsigned and the variables all the same type.

	David

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH net-next 3/3] packet: use percpu mmap tx frame pending refcount
  2014-01-13 11:58       ` David Laight
@ 2014-01-13 12:57         ` Daniel Borkmann
  0 siblings, 0 replies; 18+ messages in thread
From: Daniel Borkmann @ 2014-01-13 12:57 UTC (permalink / raw)
  To: David Laight; +Cc: Cong Wang, David Miller, netdev

On 01/13/2014 12:58 PM, David Laight wrote:
> From: Daniel Borkmann
> ...
>> But I think I found a different problem with this idea. It could
>> happen with net devices as well, but probably less likely as there
>> might be a better distribution of hold/puts among CPUs. However,
>> for TX_RING, if we pin the process to a particular CPU, and since
>> the destructor is invoked through ksoftirqd, we could end up with
>> a misbalance and if the process runs long enough eventually
>> overflow for one particular CPU. We could work around that, but I
>> think it's not worth the effort.
>
> The sum will be 'correct' when summed across all the cpu even if
> one of the values has wrapped - provided all the arithmetic is
> unsigned and the variables all the same type.

You are right, sorry, too much coffee this morning, I missed that.

Then, imho, this should work.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH net-next 3/3] packet: use percpu mmap tx frame pending refcount
  2014-01-12 16:22 ` [PATCH net-next 3/3] packet: use percpu mmap tx frame pending refcount Daniel Borkmann
  2014-01-13  5:51   ` Cong Wang
@ 2014-01-15  1:16   ` David Miller
  1 sibling, 0 replies; 18+ messages in thread
From: David Miller @ 2014-01-15  1:16 UTC (permalink / raw)
  To: dborkman; +Cc: netdev

From: Daniel Borkmann <dborkman@redhat.com>
Date: Sun, 12 Jan 2014 17:22:48 +0100

> +static int packet_read_pending(const struct packet_ring_buffer *rb)
> +{
> +	int i, refcnt = 0;
> +
> +	/* We don't use pending refcount in rx_ring. */
> +	if (rb->pending_refcnt == NULL)
> +		return 0;
> +
> +	for_each_possible_cpu(i)
> +		refcnt += *per_cpu_ptr(rb->pending_refcnt, i);
> +
> +	return refcnt;
> +}

David Laight stated that this works properly only if all the arithmetic
is unsigned.

Therefore, maybe use "unsigned int" for refcnt here?

Can you respin this series with that fixed?

Thanks.

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2014-01-15  1:16 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-01-12 16:22 [PATCH net-next 0/3] pf_packet updates Daniel Borkmann
2014-01-12 16:22 ` [PATCH net-next 1/3] packet: improve socket create/bind latency in some cases Daniel Borkmann
2014-01-12 16:22 ` [PATCH net-next 2/3] packet: don't unconditionally schedule() in case of MSG_DONTWAIT Daniel Borkmann
2014-01-12 16:22 ` [PATCH net-next 3/3] packet: use percpu mmap tx frame pending refcount Daniel Borkmann
2014-01-13  5:51   ` Cong Wang
2014-01-13  9:55     ` David Laight
2014-01-13 11:19     ` Daniel Borkmann
2014-01-13 11:58       ` David Laight
2014-01-13 12:57         ` Daniel Borkmann
2014-01-15  1:16   ` David Miller
  -- strict thread matches above, loose matches on Subject: below --
2013-12-06 10:36 [PATCH net-next 0/3] PF_PACKET updates Daniel Borkmann
2013-12-10  1:24 ` David Miller
2013-08-28 20:13 [PATCH net-next 0/3] pf_packet updates Daniel Borkmann
2013-08-29  5:39 ` David Miller
2013-08-29  6:20   ` Daniel Borkmann
2013-08-29 20:43     ` David Miller
2013-08-29 10:25   ` Eric Dumazet
2013-08-29 16:53     ` David Miller

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).