Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [net-next-2.6 PATCH 1/4 revised] TCPCT part 1a: extend struct tcp_request_sock
From: William Allen Simpson @ 2009-10-18 15:57 UTC (permalink / raw)
  To: Linux Kernel Network Developers
In-Reply-To: <4AD8AFC0.1090101@gmail.com>

William Allen Simpson wrote:
> Pass additional parameters associated with sending SYNACK.  This
> is not as straightforward or architecturally clean as previously
> proposed, and has the unfortunate side effect of potentially
> including otherwise unneeded headers for related protocols, but
> that problem will affect very few files.
> ---
>  include/net/extend_request_sock.h |   37 
> +++++++++++++++++++++++++++++++++++++
>  1 files changed, 37 insertions(+), 0 deletions(-)
>  create mode 100644 include/net/extend_request_sock.h
> 
This technique appears to be unworkable:

diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index 9971870..30c4808 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -71,6 +71,7 @@
  #include <net/timewait_sock.h>
  #include <net/xfrm.h>
  #include <net/netdma.h>
+#include <net/extend_request_sock.h>

  #include <linux/inet.h>
  #include <linux/ipv6.h>
@@ -1195,6 +1196,15 @@ struct request_sock_ops tcp_request_sock_ops __read_mostly = {
  	.send_reset	=	tcp_v4_send_reset,
  };

+struct request_sock_ops tcp4_extend_request_sock_ops __read_mostly = {
+	.family		=	PF_INET,
+	.obj_size	=	sizeof(struct extend_request_sock),
+	.rtx_syn_ack	=	tcp_v4_send_synack,
+	.send_ack	=	tcp_v4_reqsk_send_ack,
+	.destructor	=	tcp_v4_reqsk_destructor,
+	.send_reset	=	tcp_v4_send_reset,
+};
+

...

+		req = inet_reqsk_alloc(&tcp4_extend_request_sock_ops);
+		if (NULL == req)
+			goto drop;
+

Many hours of investigation demonstrated that the obj_size isn't actually
used to allocate the structure.  Heck, it's not even checked to determine
whether there's enough room!  Instead, the kernel crashes later, as the
extended variables are accessed!

Returning to the architecturally clean parameters of the previous patch
series, that has the distinct advantage of actually working....

^ permalink raw reply related

* Re: PF_RING: Include in main line kernel?
From: Evgeniy Polyakov @ 2009-10-18 14:50 UTC (permalink / raw)
  To: Harald Welte; +Cc: Luca Deri, Brent Cook, Brad Doctor, netdev
In-Reply-To: <20091018125014.GH27747@prithivi.gnumonks.org>

On Sun, Oct 18, 2009 at 02:50:14PM +0200, Harald Welte (laforge@gnumonks.org) wrote:
> > contrary to other socket types, PF_RING allows
> > - packets to be filtered using both BPF and ACL-like filters
> > - parsing information is returned as metadata with the packet (i.e.
> > you don't have to parse the packet again as it happens with BPF)
> > - ACL-like filters allows you to specify advanced features such as
> > port ranges or packet payload match
> 
> So it seems there is some added features over the existing functionality, plus
> probably increased performance mainly to hooking earlier in the packet receive
> flow.
> 
> What would normally be done is to try to make incremental changes
> to the existing code and extend their features/performacne, rather than
> adding something relatively similar alternative.

PF_PACKET as is can not be made faster - it requires a packet copy, so
virtually this is an end of the game, while mapped packet socket is
quite different and does not require that expensive copy. And while
currently difference between both goes down, it still exists and may
hummer some use cases.

PF_RING uses another ring structure and I saw comparisons of both (many
years ago though), where pf_ring was faster. Unfortunately there is no
way to easily adopt its mapping into pf_packet ring without breaking
compatibility, but I wonder whether performance different between both
still exists and can it be a main factor for the preference. If
difference is not visible, than I believe the only way for PF_RING is to
extend existing packet sockets with its other features.

-- 
	Evgeniy Polyakov

^ permalink raw reply

* Re: PF_RING: Include in main line kernel?
From: Evgeniy Polyakov @ 2009-10-18 14:18 UTC (permalink / raw)
  To: Harald Welte; +Cc: David Miller, greearb, deri, shemminger, brad.doctor, netdev
In-Reply-To: <20091018124337.GE27747@prithivi.gnumonks.org>

On Sun, Oct 18, 2009 at 02:43:37PM +0200, Harald Welte (laforge@gnumonks.org) wrote:
> How does it make it any easier?  Even right now you can implement an entire
> protocol family in your own module, either by registering as netpoll handler,
> or even using the regular dev_add_pack().

Well, it does, since packet will be processed by the main stack after
that, and module will work with the copy only. But I agree that this is
a weak argument.

If it is still a blocking one, what about implementing additional
gpl-only list of handlers which will have 'consumed' skb check? I
believe it would be enough to put it only in single place after the
bridge?

-- 
	Evgeniy Polyakov

^ permalink raw reply

* [PATCH 4/4 v3] net: Fix for dst_negative_advice
From: Krishna Kumar @ 2009-10-18 13:08 UTC (permalink / raw)
  To: davem; +Cc: netdev, herbert, Krishna Kumar, dada1
In-Reply-To: <20091018130727.3960.32107.sendpatchset@localhost.localdomain>

From: Krishna Kumar <krkumar2@in.ibm.com>

dst_negative_advice() should check for changed dst and reset
sk_tx_queue_mapping accordingly. Pass sock to the callers of
dst_negative_advice.

(sk_reset_txq is defined just for use by dst_negative_advice. The
only way I could find to get around this is to move dst_negative_()
from dst.h to dst.c, include sock.h in dst.c, etc)

Signed-off-by: Krishna Kumar <krkumar2@in.ibm.com>
---
 include/net/dst.h      |   12 ++++++++++--
 net/core/sock.c        |    6 ++++++
 net/dccp/timer.c       |    4 ++--
 net/decnet/af_decnet.c |    2 +-
 net/ipv4/tcp_timer.c   |    4 ++--
 5 files changed, 21 insertions(+), 7 deletions(-)

diff -ruNp org/include/net/dst.h new/include/net/dst.h
--- org/include/net/dst.h	2009-10-16 21:30:56.000000000 +0530
+++ new/include/net/dst.h	2009-10-16 21:31:30.000000000 +0530
@@ -222,11 +222,19 @@ static inline void dst_confirm(struct ds
 		neigh_confirm(dst->neighbour);
 }
 
-static inline void dst_negative_advice(struct dst_entry **dst_p)
+static inline void dst_negative_advice(struct dst_entry **dst_p,
+				       struct sock *sk)
 {
 	struct dst_entry * dst = *dst_p;
-	if (dst && dst->ops->negative_advice)
+	if (dst && dst->ops->negative_advice) {
 		*dst_p = dst->ops->negative_advice(dst);
+
+		if (dst != *dst_p) {
+			extern void sk_reset_txq(struct sock *sk);
+
+			sk_reset_txq(sk);
+		}
+	}
 }
 
 static inline void dst_link_failure(struct sk_buff *skb)
diff -ruNp org/net/core/sock.c new/net/core/sock.c
--- org/net/core/sock.c	2009-10-16 21:30:56.000000000 +0530
+++ new/net/core/sock.c	2009-10-16 21:32:33.000000000 +0530
@@ -352,6 +352,12 @@ discard_and_relse:
 }
 EXPORT_SYMBOL(sk_receive_skb);
 
+void sk_reset_txq(struct sock *sk)
+{
+	sk_tx_queue_clear(sk);
+}
+EXPORT_SYMBOL(sk_reset_txq);
+
 struct dst_entry *__sk_dst_check(struct sock *sk, u32 cookie)
 {
 	struct dst_entry *dst = sk->sk_dst_cache;
diff -ruNp org/net/dccp/timer.c new/net/dccp/timer.c
--- org/net/dccp/timer.c	2009-10-16 21:30:56.000000000 +0530
+++ new/net/dccp/timer.c	2009-10-16 21:31:30.000000000 +0530
@@ -38,7 +38,7 @@ static int dccp_write_timeout(struct soc
 
 	if (sk->sk_state == DCCP_REQUESTING || sk->sk_state == DCCP_PARTOPEN) {
 		if (icsk->icsk_retransmits != 0)
-			dst_negative_advice(&sk->sk_dst_cache);
+			dst_negative_advice(&sk->sk_dst_cache, sk);
 		retry_until = icsk->icsk_syn_retries ?
 			    : sysctl_dccp_request_retries;
 	} else {
@@ -63,7 +63,7 @@ static int dccp_write_timeout(struct soc
 			   Golden words :-).
 		   */
 
-			dst_negative_advice(&sk->sk_dst_cache);
+			dst_negative_advice(&sk->sk_dst_cache, sk);
 		}
 
 		retry_until = sysctl_dccp_retries2;
diff -ruNp org/net/decnet/af_decnet.c new/net/decnet/af_decnet.c
--- org/net/decnet/af_decnet.c	2009-10-16 21:30:56.000000000 +0530
+++ new/net/decnet/af_decnet.c	2009-10-16 21:31:30.000000000 +0530
@@ -1955,7 +1955,7 @@ static int dn_sendmsg(struct kiocb *iocb
 	}
 
 	if ((flags & MSG_TRYHARD) && sk->sk_dst_cache)
-		dst_negative_advice(&sk->sk_dst_cache);
+		dst_negative_advice(&sk->sk_dst_cache, sk);
 
 	mss = scp->segsize_rem;
 	fctype = scp->services_rem & NSP_FC_MASK;
diff -ruNp org/net/ipv4/tcp_timer.c new/net/ipv4/tcp_timer.c
--- org/net/ipv4/tcp_timer.c	2009-10-16 21:30:56.000000000 +0530
+++ new/net/ipv4/tcp_timer.c	2009-10-16 21:31:30.000000000 +0530
@@ -141,14 +141,14 @@ static int tcp_write_timeout(struct sock
 
 	if ((1 << sk->sk_state) & (TCPF_SYN_SENT | TCPF_SYN_RECV)) {
 		if (icsk->icsk_retransmits)
-			dst_negative_advice(&sk->sk_dst_cache);
+			dst_negative_advice(&sk->sk_dst_cache, sk);
 		retry_until = icsk->icsk_syn_retries ? : sysctl_tcp_syn_retries;
 	} else {
 		if (retransmits_timed_out(sk, sysctl_tcp_retries1)) {
 			/* Black hole detection */
 			tcp_mtu_probing(icsk, sk);
 
-			dst_negative_advice(&sk->sk_dst_cache);
+			dst_negative_advice(&sk->sk_dst_cache, sk);
 		}
 
 		retry_until = sysctl_tcp_retries2;

^ permalink raw reply

* [PATCH 3/4 v3] net: IPv6 changes
From: Krishna Kumar @ 2009-10-18 13:08 UTC (permalink / raw)
  To: davem; +Cc: netdev, herbert, Krishna Kumar, dada1
In-Reply-To: <20091018130727.3960.32107.sendpatchset@localhost.localdomain>

From: Krishna Kumar <krkumar2@in.ibm.com>

IPv6: Reset sk_tx_queue_mapping when dst_cache is reset. Use existing
macro to do the work.

Signed-off-by: Krishna Kumar <krkumar2@in.ibm.com>
---
 net/ipv6/inet6_connection_sock.c |    3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff -ruNp org/net/ipv6/inet6_connection_sock.c new/net/ipv6/inet6_connection_sock.c
--- org/net/ipv6/inet6_connection_sock.c	2009-10-16 21:29:19.000000000 +0530
+++ new/net/ipv6/inet6_connection_sock.c	2009-10-16 21:31:00.000000000 +0530
@@ -168,8 +168,7 @@ struct dst_entry *__inet6_csk_dst_check(
 	if (dst) {
 		struct rt6_info *rt = (struct rt6_info *)dst;
 		if (rt->rt6i_flow_cache_genid != atomic_read(&flow_cache_genid)) {
-			sk->sk_dst_cache = NULL;
-			dst_release(dst);
+			__sk_dst_reset(sk);
 			dst = NULL;
 		}
 	}

^ permalink raw reply

* [PATCH 2/4 v3] net: Use sk_tx_queue_mapping for connected sockets
From: Krishna Kumar @ 2009-10-18 13:07 UTC (permalink / raw)
  To: davem; +Cc: netdev, herbert, Krishna Kumar, dada1
In-Reply-To: <20091018130727.3960.32107.sendpatchset@localhost.localdomain>

From: Krishna Kumar <krkumar2@in.ibm.com>

For connected sockets, the first run of dev_pick_tx saves the
calculated txq in sk_tx_queue_mapping. This is not saved if
either the device has a queue select or the socket is not
connected. Next iterations of dev_pick_tx uses the cached value
of sk_tx_queue_mapping.

Signed-off-by: Krishna Kumar <krkumar2@in.ibm.com>
---
 net/core/dev.c |   24 ++++++++++++++++++------
 1 file changed, 18 insertions(+), 6 deletions(-)

diff -ruNp org/net/core/dev.c new/net/core/dev.c
--- org/net/core/dev.c	2009-10-16 18:53:40.000000000 +0530
+++ new/net/core/dev.c	2009-10-16 21:30:38.000000000 +0530
@@ -1791,13 +1791,25 @@ EXPORT_SYMBOL(skb_tx_hash);
 static struct netdev_queue *dev_pick_tx(struct net_device *dev,
 					struct sk_buff *skb)
 {
-	const struct net_device_ops *ops = dev->netdev_ops;
-	u16 queue_index = 0;
+	u16 queue_index;
+	struct sock *sk = skb->sk;
+
+	if (sk_tx_queue_recorded(sk)) {
+		queue_index = sk_tx_queue_get(sk);
+	} else {
+		const struct net_device_ops *ops = dev->netdev_ops;
 
-	if (ops->ndo_select_queue)
-		queue_index = ops->ndo_select_queue(dev, skb);
-	else if (dev->real_num_tx_queues > 1)
-		queue_index = skb_tx_hash(dev, skb);
+		if (ops->ndo_select_queue) {
+			queue_index = ops->ndo_select_queue(dev, skb);
+		} else {
+			queue_index = 0;
+			if (dev->real_num_tx_queues > 1)
+				queue_index = skb_tx_hash(dev, skb);
+
+			if (sk && sk->sk_dst_cache)
+				sk_record_tx_queue(sk, queue_index);
+		}
+	}
 
 	skb_set_queue_mapping(skb, queue_index);
 	return netdev_get_tx_queue(dev, queue_index);

^ permalink raw reply

* [PATCH 1/4 v3] net: Introduce sk_tx_queue_mapping
From: Krishna Kumar @ 2009-10-18 13:07 UTC (permalink / raw)
  To: davem; +Cc: netdev, herbert, Krishna Kumar, dada1
In-Reply-To: <20091018130727.3960.32107.sendpatchset@localhost.localdomain>

From: Krishna Kumar <krkumar2@in.ibm.com>

Introduce sk_tx_queue_mapping; and functions that set, test and
get this value. Reset sk_tx_queue_mapping to -1 whenever the dst
cache is set/reset, and in socket alloc. Setting txq to -1 and
using valid txq=<0 to n-1> allows the tx path to use the value
of sk_tx_queue_mapping directly instead of subtracting 1 on every
tx.

Signed-off-by: Krishna Kumar <krkumar2@in.ibm.com>
---
 include/net/sock.h |   26 ++++++++++++++++++++++++++
 net/core/sock.c    |    5 ++++-
 2 files changed, 30 insertions(+), 1 deletion(-)

diff -ruNp org/include/net/sock.h new/include/net/sock.h
--- org/include/net/sock.h	2009-10-16 18:53:40.000000000 +0530
+++ new/include/net/sock.h	2009-10-16 21:38:44.000000000 +0530
@@ -107,6 +107,7 @@ struct net;
  *	@skc_node: main hash linkage for various protocol lookup tables
  *	@skc_nulls_node: main hash linkage for UDP/UDP-Lite protocol
  *	@skc_refcnt: reference count
+ *	@skc_tx_queue_mapping: tx queue number for this connection
  *	@skc_hash: hash value used with various protocol lookup tables
  *	@skc_family: network address family
  *	@skc_state: Connection state
@@ -128,6 +129,7 @@ struct sock_common {
 		struct hlist_nulls_node skc_nulls_node;
 	};
 	atomic_t		skc_refcnt;
+	int			skc_tx_queue_mapping;
 
 	unsigned int		skc_hash;
 	unsigned short		skc_family;
@@ -215,6 +217,7 @@ struct sock {
 #define sk_node			__sk_common.skc_node
 #define sk_nulls_node		__sk_common.skc_nulls_node
 #define sk_refcnt		__sk_common.skc_refcnt
+#define sk_tx_queue_mapping	__sk_common.skc_tx_queue_mapping
 
 #define sk_copy_start		__sk_common.skc_hash
 #define sk_hash			__sk_common.skc_hash
@@ -1094,8 +1097,29 @@ static inline void sock_put(struct sock 
 extern int sk_receive_skb(struct sock *sk, struct sk_buff *skb,
 			  const int nested);
 
+static inline void sk_tx_queue_set(struct sock *sk, int tx_queue)
+{
+	sk->sk_tx_queue_mapping = tx_queue;
+}
+
+static inline void sk_tx_queue_clear(struct sock *sk)
+{
+	sk->sk_tx_queue_mapping = -1;
+}
+
+static inline int sk_tx_queue_get(const struct sock *sk)
+{
+	return sk->sk_tx_queue_mapping;
+}
+
+static inline bool sk_tx_queue_recorded(const struct sock *sk)
+{
+	return (sk && sk->sk_tx_queue_mapping >= 0);
+}
+
 static inline void sk_set_socket(struct sock *sk, struct socket *sock)
 {
+	sk_tx_queue_clear(sk);
 	sk->sk_socket = sock;
 }
 
@@ -1152,6 +1176,7 @@ __sk_dst_set(struct sock *sk, struct dst
 {
 	struct dst_entry *old_dst;
 
+	sk_tx_queue_clear(sk);
 	old_dst = sk->sk_dst_cache;
 	sk->sk_dst_cache = dst;
 	dst_release(old_dst);
@@ -1170,6 +1195,7 @@ __sk_dst_reset(struct sock *sk)
 {
 	struct dst_entry *old_dst;
 
+	sk_tx_queue_clear(sk);
 	old_dst = sk->sk_dst_cache;
 	sk->sk_dst_cache = NULL;
 	dst_release(old_dst);
diff -ruNp org/net/core/sock.c new/net/core/sock.c
--- org/net/core/sock.c	2009-10-16 18:53:40.000000000 +0530
+++ new/net/core/sock.c	2009-10-16 21:29:02.000000000 +0530
@@ -357,6 +357,7 @@ struct dst_entry *__sk_dst_check(struct 
 	struct dst_entry *dst = sk->sk_dst_cache;
 
 	if (dst && dst->obsolete && dst->ops->check(dst, cookie) == NULL) {
+		sk_tx_queue_clear(sk);
 		sk->sk_dst_cache = NULL;
 		dst_release(dst);
 		return NULL;
@@ -953,7 +954,8 @@ static void sock_copy(struct sock *nsk, 
 	void *sptr = nsk->sk_security;
 #endif
 	BUILD_BUG_ON(offsetof(struct sock, sk_copy_start) !=
-		     sizeof(osk->sk_node) + sizeof(osk->sk_refcnt));
+		     sizeof(osk->sk_node) + sizeof(osk->sk_refcnt) +
+		     sizeof(osk->sk_tx_queue_mapping));
 	memcpy(&nsk->sk_copy_start, &osk->sk_copy_start,
 	       osk->sk_prot->obj_size - offsetof(struct sock, sk_copy_start));
 #ifdef CONFIG_SECURITY_NETWORK
@@ -997,6 +999,7 @@ static struct sock *sk_prot_alloc(struct
 
 		if (!try_module_get(prot->owner))
 			goto out_free_sec;
+		sk_tx_queue_clear(sk);
 	}
 
 	return sk;

^ permalink raw reply

* [PATCH 0/4 v3] net: Implement fast TX queue selection
From: Krishna Kumar @ 2009-10-18 13:07 UTC (permalink / raw)
  To: davem; +Cc: netdev, herbert, Krishna Kumar, dada1

From: Krishna Kumar <krkumar2@in.ibm.com>

Notes:
    1.  Eric suggested:
	- To use u16 for txq#, but I am using an "int" for now as that
	  avoids one unnecessary subtraction during tx.
	- An improvement of caching the txq at connection establishment
	  time (TBD later) so as to use rxq# = txq#.
	- Drivers can call sk_tx_queue_set() to set the txq if they are
	  going to call skb_tx_hash() internally.
    2. v3 patch stress tested with 1000 netperfs, reboot's, etc.

Changelog [from v2]
--------------------
	1. Changed names of functions setting, getting and returning the
	   txq#; and added a new one to reset the txq#.
	2. Free sk doesn't need to reset txq#.

Changelog [from v1]
--------------------
	1. Changed IPv6 code to call __sk_dst_reset() directly.
	2. Removed the patch re-arranging ("encapsulating") __sk_dst_reset()

Multiqueue cards on routers/firewalls set skb->queue_mapping on
input which helps in faster xmit. Implement fast queue selection
for locally generated packets also, by saving the txq# for
connected sockets (in dev_pick_tx) and use it in subsequent
iterations. Locally generated packets for a connection will xmit
on the same txq, but routing & firewall loads should not be
affected by this patch. Tests shows the distribution across txq's
for 1-4 netperf sessions is similar to existing code.


                   Testing & results:
                   ------------------

1. Cycles/Iter (C/I) used by dev_pick_tx:
         (B -> Billion,   M -> Million)
   |--------------|------------------------|------------------------|
   |              |          ORG           |          NEW           |
   |  Test        |--------|---------|-----|--------|---------|-----|
   |              | Cycles |  Iters  | C/I | Cycles | Iters   | C/I |
   |--------------|--------|---------|-----|--------|---------|-----|
   | [TCP_STREAM, | 3.98 B | 12.47 M | 320 | 1.95 B | 12.92 M | 152 |
   |  UDP_STREAM, |        |         |     |        |         |     |         
   |  TCP_RR,     |        |         |     |        |         |     |        
   |  UDP_RR]     |        |         |     |        |         |     |        
   |--------------|--------|---------|-----|--------|---------|-----|        
   | [TCP_STREAM, | 8.92 B | 29.66 M | 300 | 3.82 B | 38.88 M | 98  |        
   |  TCP_RR,     |        |         |     |        |         |     |         
   |  UDP_RR]     |        |         |     |        |         |     |         
   |--------------|--------|---------|-----|--------|---------|-----|

2. Stress test (over 48 hours) : 1000 netperfs running combination
   of TCP_STREAM/RR, UDP_STREAM/RR (v4/6, NODELAY/~NODELAY for all
   tests), with some ssh sessions, reboots, modprobe -r driver, etc.

3. Performance test (10 hours): Single 10 hour netperf run of
   TCP_STREAM/RR, TCP_STREAM + NO_DELAY and UDP_RR. Results show an
   improvement in both performance and cpu utilization.

Tested on a 4-processor AMD Opteron 2.8 GHz system with 1GB memory,
10G Chelsio card. Each BW number is the sum of 3 iterations of
individual tests using 512, 16K, 64K & 128K I/O sizes, in Mb/s:

------------------------  TCP Tests  -----------------------
#procs  Org BW     New BW (%)     Org SD     New SD (%)
------------------------------------------------------------
1       77777.7    81011.0 (4.15)    42.3     40.2 (-5.11)
4       91599.2    91878.8 (.30)    955.9    919.3 (-3.83)
6       89533.3    91792.2 (2.52)  2262.0   2143.0 (-5.25)
8       87507.5    89161.9 (1.89)  4363.4   4073.6 (-6.64)
10      85152.4    85607.8 (.53)   6890.4   6851.2 (-.56)
------------------------------------------------------------

------------------------- TCP NO_DELAY Tests ---------------
#procs  Org BW     New BW (%)      Org SD      New SD (%)
------------------------------------------------------------
1       57001.9    57888.0 (1.55)     67.7      70.2 (3.75)
4       69555.1    69957.4 (.57)     823.0     834.3 (1.36)
6       71359.3    71918.7 (.78)    1740.8    1724.5 (-.93)
8       72577.6    72496.1 (-.11)   2955.4    2937.7 (-.59)
10      70829.6    71444.2 (.86)    4826.1    4673.4 (-3.16)
------------------------------------------------------------

----------------------- Request Response Tests --------------------
#procs  Org TPS     New TPS (%)      Org SD    New SD (%)
(1-10)
-------------------------------------------------------------------
TCP     1019245.9   1042626.4 (2.29) 16352.9   16459.8 (.65)
UDP     934598.64   942956.9  (.89)  11607.3   11593.2 (-.12)
-------------------------------------------------------------------

Thanks,

- KK

Signed-off-by: Krishna Kumar <krkumar2@in.ibm.com>
---

^ permalink raw reply

* Re: PF_RING: Include in main line kernel?
From: Harald Welte @ 2009-10-18 12:56 UTC (permalink / raw)
  To: Luca Deri; +Cc: David Miller, bcook, brad.doctor, netdev
In-Reply-To: <C803B824-8249-44BA-ACCD-D9AE4AA21F92@ntop.org>

Dear Luca,

On Wed, Oct 14, 2009 at 10:34:17PM +0200, Luca Deri wrote:

> so do you want me to start porting PF_RING facilities into
> PF_PACKET? As I have said I;m not against this: my goal is to
> include this work into the linux kernel, as it has been separate for
> too long.

I would suggest you do some analysis of each of the features that PF_RING
currently offers, and think about how each those features could be added
as an incremental change to PF_PACKET.

Then post this as a RFC to the list, see what people think about each of those
issues and follow with the actual implementation.  If you think the actual
implementation might be easier than to make a written proposal, that would
of course also be great.

Regards,
-- 
- Harald Welte <laforge@gnumonks.org>           http://laforge.gnumonks.org/
============================================================================
"Privacy in residential applications is a desirable marketing option."
                                                  (ETSI EN 300 175-7 Ch. A6)

^ permalink raw reply

* Re: PF_RING: Include in main line kernel?
From: Harald Welte @ 2009-10-18 12:50 UTC (permalink / raw)
  To: Luca Deri; +Cc: Brent Cook, Brad Doctor, netdev
In-Reply-To: <903D7FEC-34E8-4F70-ABC0-E56A3A5FBBE6@ntop.org>

On Wed, Oct 14, 2009 at 09:54:26PM +0200, Luca Deri wrote:
> Brent
> contrary to other socket types, PF_RING allows
> - packets to be filtered using both BPF and ACL-like filters
> - parsing information is returned as metadata with the packet (i.e.
> you don't have to parse the packet again as it happens with BPF)
> - ACL-like filters allows you to specify advanced features such as
> port ranges or packet payload match

So it seems there is some added features over the existing functionality, plus
probably increased performance mainly to hooking earlier in the packet receive
flow.

What would normally be done is to try to make incremental changes
to the existing code and extend their features/performacne, rather than
adding something relatively similar alternative.

> I agree with you that PF_RING has some overlaps with PACKET_RX/
> TX_RING, but the main idea behind PF_RING is not just to accelerate
> packet capture. For instance in PF_RING you can have actions
> attached to rules, or extend PF_RING filtering/packet handling by
> means of plugins.

Is this in the actual kernel code?  I am not sure whether people generally want
to see ayet another packet filter in Linux ;)

Regards,
-- 
- Harald Welte <laforge@gnumonks.org>           http://laforge.gnumonks.org/
============================================================================
"Privacy in residential applications is a desirable marketing option."
                                                  (ETSI EN 300 175-7 Ch. A6)

^ permalink raw reply

* Re: PF_RING: Include in main line kernel?
From: Harald Welte @ 2009-10-18 12:43 UTC (permalink / raw)
  To: David Miller; +Cc: greearb, zbr, deri, shemminger, brad.doctor, netdev
In-Reply-To: <20091014.144923.112167161.davem@davemloft.net>

Hi Dave,

On Wed, Oct 14, 2009 at 02:49:23PM -0700, David Miller wrote:

> > Maybe something similar to the attached patch?
> 
> This is not something I'm interested in applying.
> 
> It makes implementing proprietary complete networking stacks
> for Linux way too easy.

How does it make it any easier?  Even right now you can implement an entire
protocol family in your own module, either by registering as netpoll handler,
or even using the regular dev_add_pack().

-- 
- Harald Welte <laforge@gnumonks.org>           http://laforge.gnumonks.org/
============================================================================
"Privacy in residential applications is a desirable marketing option."
                                                  (ETSI EN 300 175-7 Ch. A6)

^ permalink raw reply

* Re: PF_RING: Include in main line kernel?
From: Harald Welte @ 2009-10-18 12:45 UTC (permalink / raw)
  To: Ben Greear
  Cc: Evgeniy Polyakov, David Miller, deri, shemminger, brad.doctor,
	netdev
In-Reply-To: <4AD74EF1.6080106@candelatech.com>

Hi Ben,

On Thu, Oct 15, 2009 at 09:33:53AM -0700, Ben Greear wrote:

> Bonding, bridging, mac-vlans, pktgen-rx logic, sniffers, and others could.
> The only trick is ordering...it may be better to keep the hard-coded hooks in
> a definite order than to allow users to switch them around.

why does this sound like the netfilter hooks with their priorities to me?  The
only difference is that netfilter hooks are inside the layer 3 protocols at the
moment, whereas this is one layer lower...

-- 
- Harald Welte <laforge@gnumonks.org>           http://laforge.gnumonks.org/
============================================================================
"Privacy in residential applications is a desirable marketing option."
                                                  (ETSI EN 300 175-7 Ch. A6)

^ permalink raw reply

* [OT] ntop / GPL (was Re: PF_RING: Include in main line kernel?)
From: Harald Welte @ 2009-10-18 12:47 UTC (permalink / raw)
  To: Jarek Poplawski; +Cc: Brad Doctor, netdev, Luca Deri
In-Reply-To: <4AD60053.1030804@gmail.com>

Hi Jarek, Brad, Luca,

[putting my gpl-violations.org hat on]

On Wed, Oct 14, 2009 at 06:46:11PM +0200, Jarek Poplawski wrote:
> Brad Doctor wrote, On 10/14/2009 04:33 PM:
> 
> > Download ntop
> > 
> > ntop is distributed under the GNU GPL. In order to be entitled to download
> > ntop you must accept the GNU license. 
> 
> I can't find such a thing neither in GNU GPL v2:

This is true.  The GPL does never need to be accepted for mere use (i.e.
running) the program.  This is at least true for the continental european
copyright systems, where any legally obtained copy of a program implicitly
carries the permission for running the program.  Only for any other activity
you will need to accept the license.

but, like others posted in this thread, ntop is not the PF_RING code.

-- 
- Harald Welte <laforge@gnumonks.org>           http://laforge.gnumonks.org/
============================================================================
"Privacy in residential applications is a desirable marketing option."
                                                  (ETSI EN 300 175-7 Ch. A6)

^ permalink raw reply

* Re: PF_RING: Include in main line kernel?
From: Harald Welte @ 2009-10-18 12:38 UTC (permalink / raw)
  To: Brad Doctor; +Cc: netdev, Luca Deri
In-Reply-To: <a07586b0910140733s1976cd05u39286de42d9fac23@mail.gmail.com>

Hi Brad and Luca,

On Wed, Oct 14, 2009 at 08:33:08AM -0600, Brad Doctor wrote:

> On behalf of the users and developers of the PF_RING project, we would
> like to ask consideration to include the PF_RING module in the main
> line kernel.

First of all, let me state that I think the mainline support for nProbe/nTop is
something that I have been hoping for many years.  I think the performance you
are achieving is remarkable, and it would be very usable to have this
capability of high performance zero-copy packet access from userspace as a
stock feature of the Linux kernel.

The actual PF_RING implementation has been criticized a couple of times even in
the past.  One general point I remember from past discussions in the kernel
network community was that there is too much overlap with PF_PACKET, and that
this could possibly be extended with a ring buffer rather than replaced with a
fairly similar alternative mechanism.

So let's see what kind of solution the current discussion thread will come up
with... let's hope eventually we'll have the functionality in the kernel.
-- 
- Harald Welte <laforge@gnumonks.org>           http://laforge.gnumonks.org/
============================================================================
"Privacy in residential applications is a desirable marketing option."
                                                  (ETSI EN 300 175-7 Ch. A6)

^ permalink raw reply

* [PATCH][RFC]: ingress socket filter by mark
From: jamal @ 2009-10-18 12:42 UTC (permalink / raw)
  To: netdev; +Cc: David Miller, Atis Elsts, eric.dumazet, Maciej Żenczykowski

[-- Attachment #1: Type: text/plain, Size: 960 bytes --]

Maciej forced me to dig into this ;->

at the socket level if a packet arrives with a different mark than
what we bind to, drop it. I have tested this patch and it drops a packet
with mismatching mark.

There are several approaches - and i think the patch suggestion i have
made here maybe too strict. I assume that if someone binds to a mark,
they want to not only send packets with that mark but receive
only if that mark is set. 
A looser check would be something along the line accept as well if mark
is not set i.e
if (sk->sk_mark && skb->mark && sk->sk_mark != skb->mark)

Alternatively i could add one bit in the socket flags and have it so
that check is made only if app has been explicit:
if (sock_flag(sk, SOCK_CHK_SOMARK) && sk->sk_mark != skb->mark) drop

Another approach  is to set sock filter from app. I dont like this
approach because it will be the least usable from app level and would be
the least simple from kernel level.

cheers,
jamal

[-- Attachment #2: filt-sock-m --]
[-- Type: text/x-patch, Size: 375 bytes --]

diff --git a/net/core/filter.c b/net/core/filter.c
index d1d779c..6fcf577 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -85,6 +85,9 @@ int sk_filter(struct sock *sk, struct sk_buff *skb)
 	if (err)
 		return err;

+	if (sk->sk_mark && sk->sk_mark != skb->mark)
+		return -EPERM;
+
 	rcu_read_lock_bh();
 	filter = rcu_dereference(sk->sk_filter);
 	if (filter) {

^ permalink raw reply related

* Re: [PATCH] net: Fix RPF to work with policy routing
From: jamal @ 2009-10-18 12:13 UTC (permalink / raw)
  To: netdev; +Cc: David Miller, Atis Elsts, eric.dumazet, Maciej Żenczykowski
In-Reply-To: <1255867954.4815.25.camel@dogo.mojatatu.com>

On Sun, 2009-10-18 at 08:12 -0400, jamal wrote:
> policy routing never worked with mark.

I meant policy routing, mark and RPF never worked together ;->

cheers,
jamal


^ permalink raw reply

* [PATCH] net: Fix RPF to work with policy routing
From: jamal @ 2009-10-18 12:12 UTC (permalink / raw)
  To: netdev, David Miller; +Cc: Atis Elsts, eric.dumazet, Maciej Żenczykowski

[-- Attachment #1: Type: text/plain, Size: 129 bytes --]


policy routing never worked with mark.

I tested this with ping and the skbedit patch i posted a few days back.

cheers,
jamal


[-- Attachment #2: policy-mark-rpf --]
[-- Type: text/plain, Size: 3076 bytes --]

commit f7c6fd2465d8e6f4f89c5d1262da10b4a6d499d0
Author: Jamal Hadi Salim <hadi@cyberus.ca>
Date:   Sun Oct 18 08:04:51 2009 -0400

    [PATCH] net: Fix RPF to work with policy routing
    Policy routing is not looked up by mark on reverse path filtering.
    This fixes it.
    
    Signed-off-by: Jamal Hadi Salim <hadi@cyberus.ca>

diff --git a/include/net/ip_fib.h b/include/net/ip_fib.h
index ef91fe9..4d22fab 100644
--- a/include/net/ip_fib.h
+++ b/include/net/ip_fib.h
@@ -210,7 +210,8 @@ extern struct fib_table *fib_get_table(struct net *net, u32 id);
 extern const struct nla_policy rtm_ipv4_policy[];
 extern void		ip_fib_init(void);
 extern int fib_validate_source(__be32 src, __be32 dst, u8 tos, int oif,
-			       struct net_device *dev, __be32 *spec_dst, u32 *itag);
+			       struct net_device *dev, __be32 *spec_dst,
+			       u32 *itag, u32 mark);
 extern void fib_select_default(struct net *net, const struct flowi *flp,
 			       struct fib_result *res);
 
diff --git a/net/ipv4/fib_frontend.c b/net/ipv4/fib_frontend.c
index e2f9505..aa00398 100644
--- a/net/ipv4/fib_frontend.c
+++ b/net/ipv4/fib_frontend.c
@@ -229,14 +229,17 @@ unsigned int inet_dev_addr_type(struct net *net, const struct net_device *dev,
  */
 
 int fib_validate_source(__be32 src, __be32 dst, u8 tos, int oif,
-			struct net_device *dev, __be32 *spec_dst, u32 *itag)
+			struct net_device *dev, __be32 *spec_dst,
+			u32 *itag, u32 mark)
 {
 	struct in_device *in_dev;
 	struct flowi fl = { .nl_u = { .ip4_u =
 				      { .daddr = src,
 					.saddr = dst,
 					.tos = tos } },
+			    .mark = mark,
 			    .iif = oif };
+
 	struct fib_result res;
 	int no_addr, rpf;
 	int ret;
diff --git a/net/ipv4/route.c b/net/ipv4/route.c
index 278f46f..9744fc5 100644
--- a/net/ipv4/route.c
+++ b/net/ipv4/route.c
@@ -1852,7 +1852,7 @@ static int ip_route_input_mc(struct sk_buff *skb, __be32 daddr, __be32 saddr,
 			goto e_inval;
 		spec_dst = inet_select_addr(dev, 0, RT_SCOPE_LINK);
 	} else if (fib_validate_source(saddr, 0, tos, 0,
-					dev, &spec_dst, &itag) < 0)
+					dev, &spec_dst, &itag, 0) < 0)
 		goto e_inval;
 
 	rth = dst_alloc(&ipv4_dst_ops);
@@ -1965,7 +1965,7 @@ static int __mkroute_input(struct sk_buff *skb,
 
 
 	err = fib_validate_source(saddr, daddr, tos, FIB_RES_OIF(*res),
-				  in_dev->dev, &spec_dst, &itag);
+				  in_dev->dev, &spec_dst, &itag, skb->mark);
 	if (err < 0) {
 		ip_handle_martian_source(in_dev->dev, in_dev, skb, daddr,
 					 saddr);
@@ -2139,7 +2139,7 @@ static int ip_route_input_slow(struct sk_buff *skb, __be32 daddr, __be32 saddr,
 		int result;
 		result = fib_validate_source(saddr, daddr, tos,
 					     net->loopback_dev->ifindex,
-					     dev, &spec_dst, &itag);
+					     dev, &spec_dst, &itag, skb->mark);
 		if (result < 0)
 			goto martian_source;
 		if (result)
@@ -2168,7 +2168,7 @@ brd_input:
 		spec_dst = inet_select_addr(dev, 0, RT_SCOPE_LINK);
 	else {
 		err = fib_validate_source(saddr, 0, tos, 0, dev, &spec_dst,
-					  &itag);
+					  &itag, skb->mark);
 		if (err < 0)
 			goto martian_source;
 		if (err)

^ permalink raw reply related

* Re: [PATCH] iputils: ping by mark
From: jamal @ 2009-10-18 11:37 UTC (permalink / raw)
  To: Maciej Żenczykowski; +Cc: Rob.Townley, YOSHIFUJI Hideaki, netdev
In-Reply-To: <55a4f86e0910171846w45dda0e9w86f570b087c1543d@mail.gmail.com>

On Sat, 2009-10-17 at 18:46 -0700, Maciej Żenczykowski wrote:
> Try it with a udp packet or a tcp connection - so_mark and ip rule
> fwmark only work for raw sockets (and maybe some other special cases),
> unless you're lucky and the ip(6)tables mangle module just happens to
> rerun the routing decision (because it mangles the packet in some
> other way...).

It works fine with tcp and udp and to emphasize: i have never seen it
broken.
Above you mention iptables - I dont use it and that maybe the missing
part in our discussion.
I should note though that rpf is broken with policy routing;-> Now that
you got me going on this, I will post a patch. 

cheers,
jamal

^ permalink raw reply

* Re: Question about IPV6 forwarding and proxy_ndp
From: Maciej Żenczykowski @ 2009-10-18  7:37 UTC (permalink / raw)
  To: Linux Networking, YOSHIFUJI Hideaki
In-Reply-To: <55a4f86e0910180030s52484bffs4ca42eb1ff4d5131@mail.gmail.com>

Just a quick follow up.

The current logic for when to do proxy ndp seems to be:

network_device_is_forwarding && (global_proxy_ndp_is_on ||
network_device_is configured_for_proxy_ndp)

Maybe this should be:

(network_device_is_forwarding && global_proxy_ndp_is_on) ||
network_device_is configured_for_proxy_ndp

After all, if the admin has explicitly set proxy ndp on this specific
device, then maybe he knows best?

Alternatively it could also just be:

global_proxy_ndp_is_on || network_device_is configured_for_proxy_ndp

and not bother with checking forwarding at all.


[It should be pointed out that ipv6/conf/all/forwarding is a very
different beast than ipv6/conf/<device>/forwarding, the first globally
turns on ipv6 forwarding, the second switches a device between 'I am a
host' and 'I am a router' modes of operation]

^ permalink raw reply

* Question about IPV6 forwarding and proxy_ndp
From: Maciej Żenczykowski @ 2009-10-18  7:30 UTC (permalink / raw)
  To: Linux Networking, YOSHIFUJI Hideaki

I would like to have a machine:

* do IPv6 forwarding
- thus "echo 1 > /proc/sys/net/ipv6/conf/all/forwarding"

* accept router advertisements and default route from eth0
- thus "echo 0 > /proc/sys/net/ipv6/conf/eth0/forwarding"

* do proxy NDP on eth0 for a specific v6 address
- thus "echo 1 > /proc/sys/net/ipv6/conf/eth0/proxy_ndp"
- and "ip -6 neigh add proxy ${PROXIED_V6_IP} dev eth0"

Problem is, this doesn't work (no ndp responses for the proxied v6 ip)...
While, if I "echo 1 > /proc/sys/net/ipv6/conf/eth0/forwarding" than
proxy ndp starts working, but I lose my default route...

I think the problem is at:

http://lxr.linux.no/linux+v2.6.31/net/ipv6/ndisc.c#L832

 831                if (ipv6_chk_acast_addr(net, dev, &msg->target) ||
 832                    (idev->cnf.forwarding &&
 833                     (net->ipv6.devconf_all->proxy_ndp ||
idev->cnf.proxy_ndp) &&
 834                     (is_router = pndisc_is_router(&msg->target,
dev)) >= 0)) {
 835                        if (!(NEIGH_CB(skb)->flags & LOCALLY_ENQUEUED) &&
 836                            skb->pkt_type != PACKET_HOST &&
 837                            inc != 0 &&
 838                            idev->nd_parms->proxy_delay != 0) {

notice that we require idev->cnf.forwarding to be true.

Should this perhaps be changed to
    (net->ipv6.devconf_all->forwarding || idev->cnf.forwarding)

or even just
    net->ipv6.devconf_all->forwarding
?

It's not clear to me what the reasoning behind that if statement is...

- Maciej

^ permalink raw reply

* Re: [PATCH] genetlink: Optimize and one bug fix in genl_generate_id()
From: David Miller @ 2009-10-18  6:58 UTC (permalink / raw)
  To: krkumar2; +Cc: netdev
In-Reply-To: <20091015055507.30128.60763.sendpatchset@localhost.localdomain>

From: Krishna Kumar <krkumar2@in.ibm.com>
Date: Thu, 15 Oct 2009 11:25:07 +0530

> From: Krishna Kumar <krkumar2@in.ibm.com>
> 
> 1. GENL_MIN_ID is a valid id -> no need to start at
>    GENL_MIN_ID + 1.
> 2. Avoid going through the ids two times: If we start at
>    GENL_MIN_ID+1 (*or bigger*) and all ids are over!, the
>    code iterates through the list twice (*or lesser*).
> 3. Simplify code - no need to start at idx=0 which gets
>    reset to GENL_MIN_ID.
> 
> Patch on net-next-2.6. Reboot test shows that first id
> passed to genl_register_family was 16, next two were
> GENL_ID_GENERATE and genl_generate_id returned 17 & 18
> (user level testing of same code shows expected values
> across entire range of MIN/MAX).
> 
> Signed-off-by: Krishna Kumar <krkumar2@in.ibm.com>

Applied, thanks.

^ permalink raw reply

* Re: [PATCH] genetlink: Optimize genl_register_family()
From: David Miller @ 2009-10-18  6:58 UTC (permalink / raw)
  To: krkumar2; +Cc: netdev
In-Reply-To: <20091015055453.30128.12160.sendpatchset@localhost.localdomain>

From: Krishna Kumar <krkumar2@in.ibm.com>
Date: Thu, 15 Oct 2009 11:24:53 +0530

> From: Krishna Kumar <krkumar2@in.ibm.com>
> 
> genl_register_family() doesn't need to call genl_family_find_byid
> when GENL_ID_GENERATE is passed during register.
> 
> Patch on net-next-2.6, compile and reboot testing only.
> 
> Signed-off-by: Krishna Kumar <krkumar2@in.ibm.com>

Applied.

^ permalink raw reply

* Re: [PATCH] cxgb3: No need to wake queue in xmit handler
From: David Miller @ 2009-10-18  6:57 UTC (permalink / raw)
  To: krkumar2; +Cc: netdev, divy
In-Reply-To: <20091015055419.30109.16062.sendpatchset@localhost.localdomain>

From: Krishna Kumar <krkumar2@in.ibm.com>
Date: Thu, 15 Oct 2009 11:24:19 +0530

> The xmit handler doesn't need to wake the queue after stopping
> it temporarily (some other drivers are doing the same).
> 
> Patch on net-next-2.6, multiple netperf sessions tested.
> 
> Signed-off-by: Krishna Kumar <krkumar2@in.ibm.com>

Applied.

^ permalink raw reply

* Re: [patch 0/5] s390: networking patches for linux-next
From: David Miller @ 2009-10-18  6:57 UTC (permalink / raw)
  To: ursula.braun; +Cc: netdev, linux-s390, schwidefsky, heiko.carstens
In-Reply-To: <20091015085454.124154000@linux.vnet.ibm.com>

From: Ursula Braun <ursula.braun@de.ibm.com>
Date: Thu, 15 Oct 2009 10:54:54 +0200

> here are small improvements for net/iucv/ and drivers/s390/net/ .
> They apply to linux-next.

All applied, thanks.

^ permalink raw reply

* Re: [PATCH] can: provide library functions for skb allocation
From: David Miller @ 2009-10-18  6:55 UTC (permalink / raw)
  To: wg; +Cc: netdev, socketcan-core, haas, anantgole, mkl
In-Reply-To: <4AD6E9CA.3070605@grandegger.com>

From: Wolfgang Grandegger <wg@grandegger.com>
Date: Thu, 15 Oct 2009 11:22:18 +0200

> +	skb->protocol = __constant_htons(ETH_P_CAN);

Please don't use __constant_htonX() for runtime invocatios.
It's only for situation which must be compile time evaluations
such as case statements and static initializations.

GCC can figure out that's it's constant if you just use
plan htonX().

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox