[PATCH net-next 0/3] pf

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH net-next 0/3] pf_packet updates
@ 2013-08-28 20:13 Daniel Borkmann
  2013-08-29  5:39 ` David Miller
  0 siblings, 1 reply; 12+ messages in thread
From: Daniel Borkmann @ 2013-08-28 20:13 UTC (permalink / raw)
  To: davem; +Cc: netdev

Daniel Borkmann (3):
  net: packet: add random fanout scheduler
  net: packet: use reciprocal_divide in fanout_demux_hash
  net: packet: document available fanout policies

 Documentation/networking/packet_mmap.txt |  8 ++++++++
 include/uapi/linux/if_packet.h           |  1 +
 net/packet/af_packet.c                   | 15 +++++++++++++--
 3 files changed, 22 insertions(+), 2 deletions(-)

-- 
1.7.11.7

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH net-next 0/3] pf_packet updates
  2013-08-28 20:13 Daniel Borkmann
@ 2013-08-29  5:39 ` David Miller
  2013-08-29  6:20   ` Daniel Borkmann
  2013-08-29 10:25   ` Eric Dumazet
  0 siblings, 2 replies; 12+ messages in thread
From: David Miller @ 2013-08-29  5:39 UTC (permalink / raw)
  To: dborkman; +Cc: netdev

From: Daniel Borkmann <dborkman@redhat.com>
Date: Wed, 28 Aug 2013 22:13:08 +0200

> Daniel Borkmann (3):
>   net: packet: add random fanout scheduler
>   net: packet: use reciprocal_divide in fanout_demux_hash
>   net: packet: document available fanout policies

Please add the missing reciprocal_divide.h include to the second
patch, as per Eric Dumazet's feedback, and resubmit this series.

Thanks.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH net-next 0/3] pf_packet updates
  2013-08-29  5:39 ` David Miller
@ 2013-08-29  6:20   ` Daniel Borkmann
  2013-08-29 20:43     ` David Miller
  2013-08-29 10:25   ` Eric Dumazet
  1 sibling, 1 reply; 12+ messages in thread
From: Daniel Borkmann @ 2013-08-29  6:20 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, Cong Wang

On 08/29/2013 07:39 AM, David Miller wrote:
> From: Daniel Borkmann <dborkman@redhat.com>
> Date: Wed, 28 Aug 2013 22:13:08 +0200
>
>> Daniel Borkmann (3):
>>    net: packet: add random fanout scheduler
>>    net: packet: use reciprocal_divide in fanout_demux_hash
>>    net: packet: document available fanout policies
>
> Please add the missing reciprocal_divide.h include to the second
> patch, as per Eric Dumazet's feedback, and resubmit this series.

That is already the case in the first patch of the series. It adds:

...
+#include <linux/reciprocal_div.h>
...

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH net-next 0/3] pf_packet updates
  2013-08-29  5:39 ` David Miller
  2013-08-29  6:20   ` Daniel Borkmann
@ 2013-08-29 10:25   ` Eric Dumazet
  2013-08-29 16:53     ` David Miller
  1 sibling, 1 reply; 12+ messages in thread
From: Eric Dumazet @ 2013-08-29 10:25 UTC (permalink / raw)
  To: David Miller; +Cc: dborkman, netdev

On Thu, 2013-08-29 at 01:39 -0400, David Miller wrote:
> From: Daniel Borkmann <dborkman@redhat.com>
> Date: Wed, 28 Aug 2013 22:13:08 +0200
> 
> > Daniel Borkmann (3):
> >   net: packet: add random fanout scheduler
> >   net: packet: use reciprocal_divide in fanout_demux_hash
> >   net: packet: document available fanout policies
> 
> Please add the missing reciprocal_divide.h include to the second
> patch, as per Eric Dumazet's feedback, and resubmit this series.

(It was Cong Wang feedback ;) )

Thanks

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH net-next 0/3] pf_packet updates
  2013-08-29 10:25   ` Eric Dumazet
@ 2013-08-29 16:53     ` David Miller
  0 siblings, 0 replies; 12+ messages in thread
From: David Miller @ 2013-08-29 16:53 UTC (permalink / raw)
  To: eric.dumazet; +Cc: dborkman, netdev

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Thu, 29 Aug 2013 03:25:45 -0700

> On Thu, 2013-08-29 at 01:39 -0400, David Miller wrote:
>> From: Daniel Borkmann <dborkman@redhat.com>
>> Date: Wed, 28 Aug 2013 22:13:08 +0200
>> 
>> > Daniel Borkmann (3):
>> >   net: packet: add random fanout scheduler
>> >   net: packet: use reciprocal_divide in fanout_demux_hash
>> >   net: packet: document available fanout policies
>> 
>> Please add the missing reciprocal_divide.h include to the second
>> patch, as per Eric Dumazet's feedback, and resubmit this series.
> 
> (It was Cong Wang feedback ;) )

Sorry Eric, I am just too anxious to give you credit everywhere that I
can. :-)

Anyways, thanks for explaining Daniel, I've put these patches back into
the to-apply queue.

Thanks!

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH net-next 0/3] pf_packet updates
  2013-08-29  6:20   ` Daniel Borkmann
@ 2013-08-29 20:43     ` David Miller
  0 siblings, 0 replies; 12+ messages in thread
From: David Miller @ 2013-08-29 20:43 UTC (permalink / raw)
  To: dborkman; +Cc: netdev, amwang

From: Daniel Borkmann <dborkman@redhat.com>
Date: Thu, 29 Aug 2013 08:20:14 +0200

> On 08/29/2013 07:39 AM, David Miller wrote:
>> From: Daniel Borkmann <dborkman@redhat.com>
>> Date: Wed, 28 Aug 2013 22:13:08 +0200
>>
>>> Daniel Borkmann (3):
>>>    net: packet: add random fanout scheduler
>>>    net: packet: use reciprocal_divide in fanout_demux_hash
>>>    net: packet: document available fanout policies
>>
>> Please add the missing reciprocal_divide.h include to the second
>> patch, as per Eric Dumazet's feedback, and resubmit this series.
> 
> That is already the case in the first patch of the series. It adds:
> 
> ...
> +#include <linux/reciprocal_div.h>

Series applied, thanks Daniel.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH net-next 0/3] PF_PACKET updates
@ 2013-12-06 10:36 Daniel Borkmann
  2013-12-06 10:36 ` [PATCH net-next 1/3] packet: fix send path when running with proto == 0 Daniel Borkmann
                   ` (3 more replies)
  0 siblings, 4 replies; 12+ messages in thread
From: Daniel Borkmann @ 2013-12-06 10:36 UTC (permalink / raw)
  To: davem; +Cc: netdev, brouer

Patch descriptions in individual patches.

In order to avoid a possible merge conflict, we suggest to take
the first patch through net tree, merge net into net-next and
apply the remaining two patches on top of it. Dave, please let
us know if you like to handle this differently than suggested.

For patch 3 we'll send a man-page update as a follow-up.

Thanks !

Daniel Borkmann (3):
  packet: fix send path when running with proto == 0
  net: dev: move inline skb_needs_linearize helper to header
  packet: introduce PACKET_QDISC_BYPASS socket option

 Documentation/networking/packet_mmap.txt |  31 ++++++
 include/linux/skbuff.h                   |  18 ++++
 include/uapi/linux/if_packet.h           |   1 +
 net/core/dev.c                           |  15 ---
 net/packet/af_packet.c                   | 156 +++++++++++++++++++++++--------
 net/packet/internal.h                    |   1 +
 6 files changed, 170 insertions(+), 52 deletions(-)

-- 
1.8.3.1

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH net-next 1/3] packet: fix send path when running with proto == 0
  2013-12-06 10:36 [PATCH net-next 0/3] PF_PACKET updates Daniel Borkmann
@ 2013-12-06 10:36 ` Daniel Borkmann
  2013-12-06 10:36 ` [PATCH net-next 2/3] net: dev: move inline skb_needs_linearize helper to header Daniel Borkmann
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 12+ messages in thread
From: Daniel Borkmann @ 2013-12-06 10:36 UTC (permalink / raw)
  To: davem; +Cc: netdev, brouer

Commit e40526cb20b5 introduced a cached dev pointer, that gets
hooked into register_prot_hook(), __unregister_prot_hook() to
update the device used for the send path.

We need to fix this up, as otherwise this will not work with
sockets created with protocol = 0, plus with sll_protocol = 0
passed via sockaddr_ll when doing the bind.

So instead, assign the pointer directly. The compiler can inline
these helper functions automagically.

While at it, also assume the cached dev fast-path as likely(),
and document this variant of socket creation as it seems it is
not widely used (seems not even the author of TX_RING was aware
of that in his reference example [1]). Tested with reproducer
from e40526cb20b5.

 [1] http://wiki.ipxwarzone.com/index.php5?title=Linux_packet_mmap#Example

Fixes: e40526cb20b5 ("packet: fix use after free race in send path when dev is released")
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Tested-by: Salam Noureddine <noureddine@aristanetworks.com>
Tested-by: Jesper Dangaard Brouer <brouer@redhat.com>
---
 Documentation/networking/packet_mmap.txt | 10 +++++
 net/packet/af_packet.c                   | 65 ++++++++++++++++++++------------
 2 files changed, 50 insertions(+), 25 deletions(-)

diff --git a/Documentation/networking/packet_mmap.txt b/Documentation/networking/packet_mmap.txt
index c012236..8e48e3b 100644
--- a/Documentation/networking/packet_mmap.txt
+++ b/Documentation/networking/packet_mmap.txt
@@ -123,6 +123,16 @@ Transmission process is similar to capture as shown below.
 [shutdown]  close() --------> destruction of the transmission socket and
                               deallocation of all associated resources.
 
+Socket creation and destruction is also straight forward, and is done
+the same way as in capturing described in the previous paragraph:
+
+ int fd = socket(PF_PACKET, mode, 0);
+
+The protocol can optionally be 0 in case we only want to transmit
+via this socket, which avoids an expensive call to packet_rcv().
+In this case, you also need to bind(2) the TX_RING with sll_protocol = 0
+set. Otherwise, htons(ETH_P_ALL) or any other protocol, for example.
+
 Binding the socket to your network interface is mandatory (with zero copy) to
 know the header size of frames used in the circular buffer.
 
diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c
index ac27c86..cf09061 100644
--- a/net/packet/af_packet.c
+++ b/net/packet/af_packet.c
@@ -237,6 +237,30 @@ struct packet_skb_cb {
 static void __fanout_unlink(struct sock *sk, struct packet_sock *po);
 static void __fanout_link(struct sock *sk, struct packet_sock *po);
 
+static struct net_device *packet_cached_dev_get(struct packet_sock *po)
+{
+	struct net_device *dev;
+
+	rcu_read_lock();
+	dev = rcu_dereference(po->cached_dev);
+	if (likely(dev))
+		dev_hold(dev);
+	rcu_read_unlock();
+
+	return dev;
+}
+
+static void packet_cached_dev_assign(struct packet_sock *po,
+				     struct net_device *dev)
+{
+	rcu_assign_pointer(po->cached_dev, dev);
+}
+
+static void packet_cached_dev_reset(struct packet_sock *po)
+{
+	RCU_INIT_POINTER(po->cached_dev, NULL);
+}
+
 /* register_prot_hook must be invoked with the po->bind_lock held,
  * or from a context in which asynchronous accesses to the packet
  * socket is not possible (packet_create()).
@@ -246,12 +270,10 @@ static void register_prot_hook(struct sock *sk)
 	struct packet_sock *po = pkt_sk(sk);
 
 	if (!po->running) {
-		if (po->fanout) {
+		if (po->fanout)
 			__fanout_link(sk, po);
-		} else {
+		else
 			dev_add_pack(&po->prot_hook);
-			rcu_assign_pointer(po->cached_dev, po->prot_hook.dev);
-		}
 
 		sock_hold(sk);
 		po->running = 1;
@@ -270,12 +292,11 @@ static void __unregister_prot_hook(struct sock *sk, bool sync)
 	struct packet_sock *po = pkt_sk(sk);
 
 	po->running = 0;
-	if (po->fanout) {
+
+	if (po->fanout)
 		__fanout_unlink(sk, po);
-	} else {
+	else
 		__dev_remove_pack(&po->prot_hook);
-		RCU_INIT_POINTER(po->cached_dev, NULL);
-	}
 
 	__sock_put(sk);
 
@@ -2059,19 +2080,6 @@ static int tpacket_fill_skb(struct packet_sock *po, struct sk_buff *skb,
 	return tp_len;
 }
 
-static struct net_device *packet_cached_dev_get(struct packet_sock *po)
-{
-	struct net_device *dev;
-
-	rcu_read_lock();
-	dev = rcu_dereference(po->cached_dev);
-	if (dev)
-		dev_hold(dev);
-	rcu_read_unlock();
-
-	return dev;
-}
-
 static int tpacket_snd(struct packet_sock *po, struct msghdr *msg)
 {
 	struct sk_buff *skb;
@@ -2088,7 +2096,7 @@ static int tpacket_snd(struct packet_sock *po, struct msghdr *msg)
 
 	mutex_lock(&po->pg_vec_lock);
 
-	if (saddr == NULL) {
+	if (likely(saddr == NULL)) {
 		dev	= packet_cached_dev_get(po);
 		proto	= po->num;
 		addr	= NULL;
@@ -2242,7 +2250,7 @@ static int packet_snd(struct socket *sock,
 	 *	Get and verify the address.
 	 */
 
-	if (saddr == NULL) {
+	if (likely(saddr == NULL)) {
 		dev	= packet_cached_dev_get(po);
 		proto	= po->num;
 		addr	= NULL;
@@ -2451,6 +2459,8 @@ static int packet_release(struct socket *sock)
 
 	spin_lock(&po->bind_lock);
 	unregister_prot_hook(sk, false);
+	packet_cached_dev_reset(po);
+
 	if (po->prot_hook.dev) {
 		dev_put(po->prot_hook.dev);
 		po->prot_hook.dev = NULL;
@@ -2506,14 +2516,17 @@ static int packet_do_bind(struct sock *sk, struct net_device *dev, __be16 protoc
 
 	spin_lock(&po->bind_lock);
 	unregister_prot_hook(sk, true);
+
 	po->num = protocol;
 	po->prot_hook.type = protocol;
 	if (po->prot_hook.dev)
 		dev_put(po->prot_hook.dev);
-	po->prot_hook.dev = dev;
 
+	po->prot_hook.dev = dev;
 	po->ifindex = dev ? dev->ifindex : 0;
 
+	packet_cached_dev_assign(po, dev);
+
 	if (protocol == 0)
 		goto out_unlock;
 
@@ -2626,7 +2639,8 @@ static int packet_create(struct net *net, struct socket *sock, int protocol,
 	po = pkt_sk(sk);
 	sk->sk_family = PF_PACKET;
 	po->num = proto;
-	RCU_INIT_POINTER(po->cached_dev, NULL);
+
+	packet_cached_dev_reset(po);
 
 	sk->sk_destruct = packet_sock_destruct;
 	sk_refcnt_debug_inc(sk);
@@ -3337,6 +3351,7 @@ static int packet_notifier(struct notifier_block *this,
 						sk->sk_error_report(sk);
 				}
 				if (msg == NETDEV_UNREGISTER) {
+					packet_cached_dev_reset(po);
 					po->ifindex = -1;
 					if (po->prot_hook.dev)
 						dev_put(po->prot_hook.dev);
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH net-next 2/3] net: dev: move inline skb_needs_linearize helper to header
  2013-12-06 10:36 [PATCH net-next 0/3] PF_PACKET updates Daniel Borkmann
  2013-12-06 10:36 ` [PATCH net-next 1/3] packet: fix send path when running with proto == 0 Daniel Borkmann
@ 2013-12-06 10:36 ` Daniel Borkmann
  2013-12-06 10:36 ` [PATCH net-next 3/3] packet: introduce PACKET_QDISC_BYPASS socket option Daniel Borkmann
  2013-12-10  1:24 ` [PATCH net-next 0/3] PF_PACKET updates David Miller
  3 siblings, 0 replies; 12+ messages in thread
From: Daniel Borkmann @ 2013-12-06 10:36 UTC (permalink / raw)
  To: davem; +Cc: netdev, brouer

As we need it elsewhere, move the inline helper function of
skb_needs_linearize() over to skbuff.h include file. While
at it, also convert the return to 'bool' instead of 'int'
and add a proper kernel doc.

Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
---
 include/linux/skbuff.h | 18 ++++++++++++++++++
 net/core/dev.c         | 15 ---------------
 2 files changed, 18 insertions(+), 15 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index bec1cc7..7100531 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -2395,6 +2395,24 @@ static inline void *skb_header_pointer(const struct sk_buff *skb, int offset,
 	return buffer;
 }
 
+/**
+ *	skb_needs_linearize - check if we need to linearize a given skb
+ *			      depending on the given device features.
+ *	@skb: socket buffer to check
+ *	@features: net device features
+ *
+ *	Returns true if either:
+ *	1. skb has frag_list and the device doesn't support FRAGLIST, or
+ *	2. skb is fragmented and the device does not support SG.
+ */
+static inline bool skb_needs_linearize(struct sk_buff *skb,
+				       netdev_features_t features)
+{
+	return skb_is_nonlinear(skb) &&
+	       ((skb_has_frag_list(skb) && !(features & NETIF_F_FRAGLIST)) ||
+		(skb_shinfo(skb)->nr_frags && !(features & NETIF_F_SG)));
+}
+
 static inline void skb_copy_from_linear_data(const struct sk_buff *skb,
 					     void *to,
 					     const unsigned int len)
diff --git a/net/core/dev.c b/net/core/dev.c
index ba3b7ea..fc38a36 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -2523,21 +2523,6 @@ netdev_features_t netif_skb_features(struct sk_buff *skb)
 }
 EXPORT_SYMBOL(netif_skb_features);
 
-/*
- * Returns true if either:
- *	1. skb has frag_list and the device doesn't support FRAGLIST, or
- *	2. skb is fragmented and the device does not support SG.
- */
-static inline int skb_needs_linearize(struct sk_buff *skb,
-				      netdev_features_t features)
-{
-	return skb_is_nonlinear(skb) &&
-			((skb_has_frag_list(skb) &&
-				!(features & NETIF_F_FRAGLIST)) ||
-			(skb_shinfo(skb)->nr_frags &&
-				!(features & NETIF_F_SG)));
-}
-
 int dev_hard_start_xmit(struct sk_buff *skb, struct net_device *dev,
 			struct netdev_queue *txq, void *accel_priv)
 {
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH net-next 3/3] packet: introduce PACKET_QDISC_BYPASS socket option
  2013-12-06 10:36 [PATCH net-next 0/3] PF_PACKET updates Daniel Borkmann
  2013-12-06 10:36 ` [PATCH net-next 1/3] packet: fix send path when running with proto == 0 Daniel Borkmann
  2013-12-06 10:36 ` [PATCH net-next 2/3] net: dev: move inline skb_needs_linearize helper to header Daniel Borkmann
@ 2013-12-06 10:36 ` Daniel Borkmann
  2013-12-10  1:24 ` [PATCH net-next 0/3] PF_PACKET updates David Miller
  3 siblings, 0 replies; 12+ messages in thread
From: Daniel Borkmann @ 2013-12-06 10:36 UTC (permalink / raw)
  To: davem; +Cc: netdev, brouer

This patch introduces a PACKET_QDISC_BYPASS socket option, that
allows for using a similar xmit() function as in pktgen instead
of taking the dev_queue_xmit() path. This can be very useful when
PF_PACKET applications are required to be used in a similar
scenario as pktgen, but with full, flexible packet payload that
needs to be provided, for example.

On default, nothing changes in behaviour for normal PF_PACKET
TX users, so everything stays as is for applications. New users,
however, can now set PACKET_QDISC_BYPASS if needed to prevent
own packets from i) reentering packet_rcv() and ii) to directly
push the frame to the driver.

In doing so we can increase pps (here 64 byte packets) for
PF_PACKET a bit:

  # CPUs -- QDISC_BYPASS   -- qdisc path -- qdisc path[**]
  1 CPU  ==  1,509,628 pps --  1,208,708 --  1,247,436
  2 CPUs ==  3,198,659 pps --  2,536,012 --  1,605,779
  3 CPUs ==  4,787,992 pps --  3,788,740 --  1,735,610
  4 CPUs ==  6,173,956 pps --  4,907,799 --  1,909,114
  5 CPUs ==  7,495,676 pps --  5,956,499 --  2,014,422
  6 CPUs ==  9,001,496 pps --  7,145,064 --  2,155,261
  7 CPUs == 10,229,776 pps --  8,190,596 --  2,220,619
  8 CPUs == 11,040,732 pps --  9,188,544 --  2,241,879
  9 CPUs == 12,009,076 pps -- 10,275,936 --  2,068,447
 10 CPUs == 11,380,052 pps -- 11,265,337 --  1,578,689
 11 CPUs == 11,672,676 pps -- 11,845,344 --  1,297,412
 [...]
 20 CPUs == 11,363,192 pps -- 11,014,933 --  1,245,081

 [**]: qdisc path with packet_rcv(), how probably most people
       seem to use it (hopefully not anymore if not needed)

The test was done using a modified trafgen, sending a simple
static 64 bytes packet, on all CPUs.  The trick in the fast
"qdisc path" case, is to avoid reentering packet_rcv() by
setting the RAW socket protocol to zero, like:
socket(PF_PACKET, SOCK_RAW, 0);

Tradeoffs are documented as well in this patch, clearly, if
queues are busy, we will drop more packets, tc disciplines are
ignored, and these packets are not visible to taps anymore. For
a pktgen like scenario, we argue that this is acceptable.

The pointer to the xmit function has been placed in packet
socket structure hole between cached_dev and prot_hook that
is hot anyway as we're working on cached_dev in each send path.

Done in joint work together with Jesper Dangaard Brouer.

Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
---
 We will send a patch for the man-page after this is accepted.

 Documentation/networking/packet_mmap.txt | 21 ++++++++
 include/uapi/linux/if_packet.h           |  1 +
 net/packet/af_packet.c                   | 91 +++++++++++++++++++++++++++-----
 net/packet/internal.h                    |  1 +
 4 files changed, 102 insertions(+), 12 deletions(-)

diff --git a/Documentation/networking/packet_mmap.txt b/Documentation/networking/packet_mmap.txt
index 8e48e3b..4288ffa 100644
--- a/Documentation/networking/packet_mmap.txt
+++ b/Documentation/networking/packet_mmap.txt
@@ -953,6 +953,27 @@ int main(int argc, char **argp)
 }
 
 -------------------------------------------------------------------------------
++ PACKET_QDISC_BYPASS
+-------------------------------------------------------------------------------
+
+If there is a requirement to load the network with many packets in a similar
+fashion as pktgen does, you might set the following option after socket
+creation:
+
+    int one = 1;
+    setsockopt(fd, SOL_PACKET, PACKET_QDISC_BYPASS, &one, sizeof(one));
+
+This has the side-effect, that packets sent through PF_PACKET will bypass the
+kernel's qdisc layer and are forcedly pushed to the driver directly. Meaning,
+packet are not buffered, tc disciplines are ignored, increased loss can occur
+and such packets are also not visible to other PF_PACKET sockets anymore. So,
+you have been warned; generally, this can be useful for stress testing various
+components of a system.
+
+On default, PACKET_QDISC_BYPASS is disabled and needs to be explicitly enabled
+on PF_PACKET sockets.
+
+-------------------------------------------------------------------------------
 + PACKET_TIMESTAMP
 -------------------------------------------------------------------------------
 
diff --git a/include/uapi/linux/if_packet.h b/include/uapi/linux/if_packet.h
index dbf0666..1e24aa7 100644
--- a/include/uapi/linux/if_packet.h
+++ b/include/uapi/linux/if_packet.h
@@ -51,6 +51,7 @@ struct sockaddr_ll {
 #define PACKET_TIMESTAMP		17
 #define PACKET_FANOUT			18
 #define PACKET_TX_HAS_OFF		19
+#define PACKET_QDISC_BYPASS		20
 
 #define PACKET_FANOUT_HASH		0
 #define PACKET_FANOUT_LB		1
diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c
index cf09061..2f4af56 100644
--- a/net/packet/af_packet.c
+++ b/net/packet/af_packet.c
@@ -237,6 +237,48 @@ struct packet_skb_cb {
 static void __fanout_unlink(struct sock *sk, struct packet_sock *po);
 static void __fanout_link(struct sock *sk, struct packet_sock *po);
 
+static int packet_direct_xmit(struct sk_buff *skb)
+{
+	struct net_device *dev = skb->dev;
+	const struct net_device_ops *ops = dev->netdev_ops;
+	netdev_features_t features;
+	struct netdev_queue *txq;
+	u16 queue_map;
+	int ret;
+
+	if (unlikely(!netif_running(dev) ||
+		     !netif_carrier_ok(dev))) {
+		kfree_skb(skb);
+		return NET_XMIT_DROP;
+	}
+
+	features = netif_skb_features(skb);
+	if (skb_needs_linearize(skb, features) &&
+	    __skb_linearize(skb)) {
+		kfree_skb(skb);
+		return NET_XMIT_DROP;
+	}
+
+	queue_map = skb_get_queue_mapping(skb);
+	txq = netdev_get_tx_queue(dev, queue_map);
+
+	__netif_tx_lock_bh(txq);
+	if (unlikely(netif_xmit_frozen_or_stopped(txq))) {
+		ret = NETDEV_TX_BUSY;
+		kfree_skb(skb);
+		goto out;
+	}
+
+	ret = ops->ndo_start_xmit(skb, dev);
+	if (likely(dev_xmit_complete(ret)))
+		txq_trans_update(txq);
+	else
+		kfree_skb(skb);
+out:
+	__netif_tx_unlock_bh(txq);
+	return ret;
+}
+
 static struct net_device *packet_cached_dev_get(struct packet_sock *po)
 {
 	struct net_device *dev;
@@ -261,6 +303,16 @@ static void packet_cached_dev_reset(struct packet_sock *po)
 	RCU_INIT_POINTER(po->cached_dev, NULL);
 }
 
+static bool packet_use_direct_xmit(const struct packet_sock *po)
+{
+	return po->xmit == packet_direct_xmit;
+}
+
+static u16 packet_pick_tx_queue(struct net_device *dev)
+{
+	return (u16) smp_processor_id() % dev->real_num_tx_queues;
+}
+
 /* register_prot_hook must be invoked with the po->bind_lock held,
  * or from a context in which asynchronous accesses to the packet
  * socket is not possible (packet_create()).
@@ -1992,9 +2044,10 @@ static int tpacket_fill_skb(struct packet_sock *po, struct sk_buff *skb,
 
 	skb_reserve(skb, hlen);
 	skb_reset_network_header(skb);
-	skb_probe_transport_header(skb, 0);
 
-	if (po->tp_tx_has_off) {
+	if (!packet_use_direct_xmit(po))
+		skb_probe_transport_header(skb, 0);
+	if (unlikely(po->tp_tx_has_off)) {
 		int off_min, off_max, off;
 		off_min = po->tp_hdrlen - sizeof(struct sockaddr_ll);
 		off_max = po->tx_ring.frame_size - tp_len;
@@ -2164,12 +2217,13 @@ static int tpacket_snd(struct packet_sock *po, struct msghdr *msg)
 			}
 		}
 
+		skb_set_queue_mapping(skb, packet_pick_tx_queue(dev));
 		skb->destructor = tpacket_destruct_skb;
 		__packet_set_status(po, ph, TP_STATUS_SENDING);
 		atomic_inc(&po->tx_ring.pending);
 
 		status = TP_STATUS_SEND_REQUEST;
-		err = dev_queue_xmit(skb);
+		err = po->xmit(skb);
 		if (unlikely(err > 0)) {
 			err = net_xmit_errno(err);
 			if (err && __packet_get_status(po, ph) ==
@@ -2228,8 +2282,7 @@ static struct sk_buff *packet_alloc_skb(struct sock *sk, size_t prepad,
 	return skb;
 }
 
-static int packet_snd(struct socket *sock,
-			  struct msghdr *msg, size_t len)
+static int packet_snd(struct socket *sock, struct msghdr *msg, size_t len)
 {
 	struct sock *sk = sock->sk;
 	struct sockaddr_ll *saddr = (struct sockaddr_ll *)msg->msg_name;
@@ -2374,6 +2427,7 @@ static int packet_snd(struct socket *sock,
 	skb->dev = dev;
 	skb->priority = sk->sk_priority;
 	skb->mark = sk->sk_mark;
+	skb_set_queue_mapping(skb, packet_pick_tx_queue(dev));
 
 	if (po->has_vnet_hdr) {
 		if (vnet_hdr.flags & VIRTIO_NET_HDR_F_NEEDS_CSUM) {
@@ -2394,16 +2448,12 @@ static int packet_snd(struct socket *sock,
 		len += vnet_hdr_len;
 	}
 
-	skb_probe_transport_header(skb, reserve);
-
+	if (!packet_use_direct_xmit(po))
+		skb_probe_transport_header(skb, reserve);
 	if (unlikely(extra_len == 4))
 		skb->no_fcs = 1;
 
-	/*
-	 *	Now send it
-	 */
-
-	err = dev_queue_xmit(skb);
+	err = po->xmit(skb);
 	if (err > 0 && (err = net_xmit_errno(err)) != 0)
 		goto out_unlock;
 
@@ -2425,6 +2475,7 @@ static int packet_sendmsg(struct kiocb *iocb, struct socket *sock,
 {
 	struct sock *sk = sock->sk;
 	struct packet_sock *po = pkt_sk(sk);
+
 	if (po->tx_ring.pg_vec)
 		return tpacket_snd(po, msg);
 	else
@@ -2639,6 +2690,7 @@ static int packet_create(struct net *net, struct socket *sock, int protocol,
 	po = pkt_sk(sk);
 	sk->sk_family = PF_PACKET;
 	po->num = proto;
+	po->xmit = dev_queue_xmit;
 
 	packet_cached_dev_reset(po);
 
@@ -3218,6 +3270,18 @@ packet_setsockopt(struct socket *sock, int level, int optname, char __user *optv
 		po->tp_tx_has_off = !!val;
 		return 0;
 	}
+	case PACKET_QDISC_BYPASS:
+	{
+		int val;
+
+		if (optlen != sizeof(val))
+			return -EINVAL;
+		if (copy_from_user(&val, optval, sizeof(val)))
+			return -EFAULT;
+
+		po->xmit = val ? packet_direct_xmit : dev_queue_xmit;
+		return 0;
+	}
 	default:
 		return -ENOPROTOOPT;
 	}
@@ -3310,6 +3374,9 @@ static int packet_getsockopt(struct socket *sock, int level, int optname,
 	case PACKET_TX_HAS_OFF:
 		val = po->tp_tx_has_off;
 		break;
+	case PACKET_QDISC_BYPASS:
+		val = packet_use_direct_xmit(po);
+		break;
 	default:
 		return -ENOPROTOOPT;
 	}
diff --git a/net/packet/internal.h b/net/packet/internal.h
index 1035fa2..0a87d7b 100644
--- a/net/packet/internal.h
+++ b/net/packet/internal.h
@@ -114,6 +114,7 @@ struct packet_sock {
 	unsigned int		tp_tx_has_off:1;
 	unsigned int		tp_tstamp;
 	struct net_device __rcu	*cached_dev;
+	int			(*xmit)(struct sk_buff *skb);
 	struct packet_type	prot_hook ____cacheline_aligned_in_smp;
 };
 
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH net-next 0/3] PF_PACKET updates
  2013-12-06 10:36 [PATCH net-next 0/3] PF_PACKET updates Daniel Borkmann
                   ` (2 preceding siblings ...)
  2013-12-06 10:36 ` [PATCH net-next 3/3] packet: introduce PACKET_QDISC_BYPASS socket option Daniel Borkmann
@ 2013-12-10  1:24 ` David Miller
  3 siblings, 0 replies; 12+ messages in thread
From: David Miller @ 2013-12-10  1:24 UTC (permalink / raw)
  To: dborkman; +Cc: netdev, brouer

From: Daniel Borkmann <dborkman@redhat.com>
Date: Fri,  6 Dec 2013 11:36:14 +0100

> Patch descriptions in individual patches.
> 
> In order to avoid a possible merge conflict, we suggest to take
> the first patch through net tree, merge net into net-next and
> apply the remaining two patches on top of it. Dave, please let
> us know if you like to handle this differently than suggested.
> 
> For patch 3 we'll send a man-page update as a follow-up.

Patch #1 applied to 'net' and queued up for -stable.

Patch #2 and #3 applied to 'net-next'.

Thanks!

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH net-next 0/3] pf_packet updates
@ 2014-01-12 16:22 Daniel Borkmann
  0 siblings, 0 replies; 12+ messages in thread
From: Daniel Borkmann @ 2014-01-12 16:22 UTC (permalink / raw)
  To: davem; +Cc: netdev

Daniel Borkmann (3):
  packet: improve socket create/bind latency in some cases
  packet: don't unconditionally schedule() in case of MSG_DONTWAIT
  packet: use percpu mmap tx frame pending refcount

 net/packet/af_packet.c | 105 +++++++++++++++++++++++++++++++++++++++----------
 net/packet/diag.c      |   1 +
 net/packet/internal.h  |   2 +-
 3 files changed, 86 insertions(+), 22 deletions(-)

-- 
1.7.11.7

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2014-01-12 16:22 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-12-06 10:36 [PATCH net-next 0/3] PF_PACKET updates Daniel Borkmann
2013-12-06 10:36 ` [PATCH net-next 1/3] packet: fix send path when running with proto == 0 Daniel Borkmann
2013-12-06 10:36 ` [PATCH net-next 2/3] net: dev: move inline skb_needs_linearize helper to header Daniel Borkmann
2013-12-06 10:36 ` [PATCH net-next 3/3] packet: introduce PACKET_QDISC_BYPASS socket option Daniel Borkmann
2013-12-10  1:24 ` [PATCH net-next 0/3] PF_PACKET updates David Miller
  -- strict thread matches above, loose matches on Subject: below --
2014-01-12 16:22 [PATCH net-next 0/3] pf_packet updates Daniel Borkmann
2013-08-28 20:13 Daniel Borkmann
2013-08-29  5:39 ` David Miller
2013-08-29  6:20   ` Daniel Borkmann
2013-08-29 20:43     ` David Miller
2013-08-29 10:25   ` Eric Dumazet
2013-08-29 16:53     ` David Miller

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).