Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [PATCH] net: reduce net_rx_action() latency to 2 HZ
From: Eric Dumazet @ 2013-03-21 17:43 UTC (permalink / raw)
  To: Paul Gortmaker
  Cc: David Miller, netdev, stable, Willy Tarreau, Tom Herbert,
	Steven Rostedt
In-Reply-To: <514B429C.5070605@windriver.com>

On Thu, 2013-03-21 at 13:25 -0400, Paul Gortmaker wrote:

> That is also reasonably portable back to 2.6.34.  And it is more
> interesting too -- it will be interesting in a preempt_rt context
> too, once RT moves ahead off the current 3.6 baseline, which still
> has the old count-limit of 10 vs the new 2ms time limit.
> 
> RT (3.4 and 3.6 based) currently has this patch from Steven:
> http://git.kernel.org/cgit/linux/kernel/git/paulg/3.6-rt-patches.git/tree/net-tx-action-avoid-livelock-on-rt.patch

Interesting, as Google has an internal patch removing this trylock() as
well.

I think I should upstream it eventually ;)

commit 2f0a3f573b531dc57c268fd809dc65169edae369
Author: Eric Dumazet <edumazet@google.com>
Date:   Thu Dec 13 09:18:01 2012 -0800

    net-dev_xmit_hold_queues: fix a busy loop in net_tx_action
    
    Under load, net_tx_action() fails to acquire qdisc lock
    and reschedules qdisc in a never ending loop.
    
    The spin_trylock() has almost no chance to complete because
    of ticket spinlock and xmit_hold_queue holding the lock for long
    period of times.
    

^ permalink raw reply

* Re: [PATCH 2/3] netlink: Remove an unused pointer in netlink_skb_parms
From: Andy Lutomirski @ 2013-03-21 17:41 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: netdev-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA
In-Reply-To: <87d2ut1cpj.fsf-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>

On Wed, Mar 20, 2013 at 11:36 PM, Eric W. Biederman
<ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org> wrote:
>
> net/ipv4/inet_diag.c:                      sk_user_ns(NETLINK_CB(in_skb).ssk),
> net/ipv4/inet_diag.c:                             sk_user_ns(NETLINK_CB(cb->skb).ssk),
> net/ipv4/inet_diag.c:                                          sk_user_ns(NETLINK_CB(cb->skb).ssk),
> net/ipv4/udp_diag.c:                    sk_user_ns(NETLINK_CB(cb->skb).ssk),
> net/ipv4/udp_diag.c:                       sk_user_ns(NETLINK_CB(in_skb).ssk),
> net/netfilter/nfnetlink_log.c:                                         sk_user_ns(NETLINK_CB(skb).ssk));
> net/netlink/af_netlink.c:               NETLINK_CB(skb).ssk = ssk;
> net/sched/cls_flow.c:               sk_user_ns(NETLINK_CB(in_skb).ssk) != &init_user_ns)
>
> I count 8 uses.

Whoops.  I clearly fail at grepping and building with the correct
configuration.  I'll drop this patch entirely -- it's independent of
the other two.

>
> Eric
>
>>
>> diff --git a/include/linux/netlink.h b/include/linux/netlink.h
>> index e0f746b..9ac1201 100644
>> --- a/include/linux/netlink.h
>> +++ b/include/linux/netlink.h
>> @@ -19,7 +19,6 @@ struct netlink_skb_parms {
>>       struct scm_creds        creds;          /* Skb credentials      */
>>       __u32                   portid;
>>       __u32                   dst_group;
>> -     struct sock             *ssk;
>>  };
>>
>>  #define NETLINK_CB(skb)              (*(struct netlink_skb_parms*)&((skb)->cb))



-- 
Andy Lutomirski
AMA Capital Management, LLC

^ permalink raw reply

* Re: [PATCH net-next] net: fix psock_fanout selftest hash collision
From: Willem de Bruijn @ 2013-03-21 17:27 UTC (permalink / raw)
  To: Daniel Borkmann; +Cc: David Miller, netdev
In-Reply-To: <514AA946.6020603@redhat.com>

On Thu, Mar 21, 2013 at 2:31 AM, Daniel Borkmann <dborkman@redhat.com> wrote:
> On 03/21/2013 01:07 AM, Willem de Bruijn wrote:
>>
>> On Wed, Mar 20, 2013 at 1:59 PM, David Miller <davem@davemloft.net> wrote:
>>>
>>> From: David Miller <davem@davemloft.net>
>>> Date: Wed, 20 Mar 2013 12:33:44 -0400 (EDT)
>>>
>>>> From: Willem de Bruijn <willemb@google.com>
>>>> Date: Wed, 20 Mar 2013 02:42:44 -0400
>>>>
>>>>> Fix flaky results with PACKET_FANOUT_HASH depending on whether the
>>>>> two flows hash into the same packet socket or not.
>>>>>
>>>>> Also adds tests for PACKET_FANOUT_LB and PACKET_FANOUT_CPU and
>>>>> replaces the counting method with a packet ring.
>>>>>
>>>>> Signed-off-by: Willem de Bruijn <willemb@google.com>
>>>>
>>>>
>>>> Applied, thanks.  I'll retest on my sparc64 box later today.
>>>
>>>
>>> Unfortunately, it's still broken there:
>>
>>
>> This looks like a new problem. Now the counters all stay zero.
>>
>> I am looking into it. I have not been able to reproduce this on my
>> x86_64 so far, so just brought a sparc32 up in qemu. Had less luck
>> with sparc64, but impressive that it works at all. Come to think of
>> it, is this a 64-bit kernel with 32-bit userland? Perhaps that
>> affects packet ring memory layout.
>
>
> That can affect the ring buffer in case of TPACKET_V1, which is default
> if not specified otherwise. See Documentation/networking/packet_mmap.txt
> +514

Thanks, Daniel. In that case, the following should fix it.
Unfortunately, I don't have the hardware to verify, but it still
passes on my platforms. Let me know if you prefer it as a regular
patch instead of inline.

diff --git a/tools/testing/selftests/net/psock_fanout.c
b/tools/testing/selftests/net/psock_fanout.c
index 226e5e3..59bd636 100644
--- a/tools/testing/selftests/net/psock_fanout.c
+++ b/tools/testing/selftests/net/psock_fanout.c
@@ -182,7 +182,13 @@ static char *sock_fanout_open_ring(int fd)
                .tp_frame_nr   = RING_NUM_FRAMES,
        };
        char *ring;
+       int val = TPACKET_V2;

+       if (setsockopt(fd, SOL_PACKET, PACKET_VERSION, (void *) &val,
+                      sizeof(val))) {
+               perror("packetsock ring setsockopt version");
+               exit(1);
+       }
        if (setsockopt(fd, SOL_PACKET, PACKET_RX_RING, (void *) &req,
                       sizeof(req))) {
                perror("packetsock ring setsockopt");
@@ -201,7 +207,7 @@ static char *sock_fanout_open_ring(int fd)

 static int sock_fanout_read_ring(int fd, void *ring)
 {
-       struct tpacket_hdr *header = ring;
+       struct tpacket2_hdr *header = ring;
        int count = 0;

        while (header->tp_status & TP_STATUS_USER && count < RING_NUM_FRAMES) {

^ permalink raw reply related

* Re: [PATCH net 3/5] net/mlx4_en: Remove ethtool flow steering rules before releasing QPs
From: Sergei Shtylyov @ 2013-03-21 18:28 UTC (permalink / raw)
  To: Or Gerlitz; +Cc: davem, netdev, amirv, jackm, hadarh
In-Reply-To: <1363881355-21137-4-git-send-email-ogerlitz@mellanox.com>

Hello.

On 03/21/2013 06:55 PM, Or Gerlitz wrote:

> From: Hadar Hen Zion <hadarh@mellanox.com>
>
> Fix the ethtool flow steering rules cleanup to be carried out before
> releasing the RX QPs.
>
> Signed-off-by: Hadar Hen Zion <hadarh@mellanox.com>
> Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
> ---
>   drivers/net/ethernet/mellanox/mlx4/en_netdev.c |   22 +++++++++++-----------
>   1 files changed, 11 insertions(+), 11 deletions(-)
>
> diff --git a/drivers/net/ethernet/mellanox/mlx4/en_netdev.c b/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
> index 995d4b6..f278b10 100644
> --- a/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
> +++ b/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
> @@ -1637,6 +1637,17 @@ void mlx4_en_stop_port(struct net_device *dev, int detach)
>   	/* Flush multicast filter */
>   	mlx4_SET_MCAST_FLTR(mdev->dev, priv->port, 0, 1, MLX4_MCAST_CONFIG);
>   
> +	/* Remove flow steering rules for the port*/

    Could you add a space before */, despite it was missing before?

WBR, Sergei

^ permalink raw reply

* Re: [PATCH] net: reduce net_rx_action() latency to 2 HZ
From: Paul Gortmaker @ 2013-03-21 17:25 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David Miller, netdev, stable, Willy Tarreau, Tom Herbert,
	Steven Rostedt
In-Reply-To: <1363879647.4431.8.camel@edumazet-glaptop>

On 13-03-21 11:27 AM, Eric Dumazet wrote:
> On Thu, 2013-03-21 at 11:03 -0400, Paul Gortmaker wrote:
>> [CC'ing stable & Willy - for the older releases not fed by
>> http://patchwork.ozlabs.org/bundle/davem/stable/ ]
>>
>> On Tue, Mar 5, 2013 at 12:15 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
>>> From: Eric Dumazet <edumazt@google.com>
>>>
>>> We should use time_after_eq() to get maximum latency of two ticks,
>>> instead of three.
>>>
>>> Bug added in commit 24f8b2385 (net: increase receive packet quantum)
>>
>> I'm not sure what applications would notice the extra tick, but 24f8b takes
>> us back to 2.6.29.  It cherry picks cleanly onto 2.6.34, so it probably also
>> does the same for Willy's 2.6.32 longterm too.
>>
>> Commit is now mainline d114a3338747255518 - v3.9-rc3~36^2~34.
> 
> BQL (Bytes Queue Limit) relies on TX completion being run often, and
> Qdisc being serviced often as well. If net_rx_action() hogs the cpu,
> net_tx_action() is delayed and NIC can stall.
> 
> I wrote this patch because I was investigating a regression when a
> Google application began using BQL enabled kernels.
> 
> About the latency in itself, following commit is way more interesting.
> 
> commit c10d73671ad30f5 (softirq: reduce latencies)
> 
> As without it, I could trigger more than 50ms latencies for the poor
> user thread interrupted by softirq processing.

That is also reasonably portable back to 2.6.34.  And it is more
interesting too -- it will be interesting in a preempt_rt context
too, once RT moves ahead off the current 3.6 baseline, which still
has the old count-limit of 10 vs the new 2ms time limit.

RT (3.4 and 3.6 based) currently has this patch from Steven:
http://git.kernel.org/cgit/linux/kernel/git/paulg/3.6-rt-patches.git/tree/net-tx-action-avoid-livelock-on-rt.patch

Anyway, thanks for the heads up on this commit.
Paul.

^ permalink raw reply

* Re: [PATCH 0/4] ss: Get netlink sockets info via sock-diag
From: Stephen Hemminger @ 2013-03-21 17:00 UTC (permalink / raw)
  To: Andrey Vagin; +Cc: netdev, Pavel Emelyanov
In-Reply-To: <1363858406-1489-1-git-send-email-avagin@openvz.org>

On Thu, 21 Mar 2013 13:33:22 +0400
Andrey Vagin <avagin@openvz.org> wrote:

> Cc: Stephen Hemminger <stephen@networkplumber.org>
> 
> Andrey Vagin (4):
>   ss: handle socket diag request in a separate function
>   ss: create a frunction to print info about netlink sockets
>   ss: show destination address for netlink sockets
>   ss: Get netlink sockets info via sock-diag
> 
>  include/linux/netlink_diag.h |  40 ++++++++
>  misc/ss.c                    | 235 +++++++++++++++++++++++++++++--------------
>  2 files changed, 198 insertions(+), 77 deletions(-)
>  create mode 100644 include/linux/netlink_diag.h
> 

Since this depends on functionality not in the upstream kernel.
Resubmit this during the 3.10 merge window. At the start of the merge
window, headers are synchronized with the kernel headers.

^ permalink raw reply

* Re: [PATCH 3/3] netlink: Diag core and basic socket info dumping (v2)
From: David Miller @ 2013-03-21 16:38 UTC (permalink / raw)
  To: avagin
  Cc: linux-kernel, netdev, xemul, edumazet, pablo, ebiederm, gaofeng,
	tgraf
In-Reply-To: <1363883628-7249-4-git-send-email-avagin@openvz.org>

From: Andrey Vagin <avagin@openvz.org>
Date: Thu, 21 Mar 2013 20:33:48 +0400

> The netlink_diag can be built as a module, just like it's done in
> unix sockets.
> 
> The core dumping message carries the basic info about netlink sockets:
> family, type and protocol, portis, dst_group, dst_portid, state.
> 
> Groups can be received as an optional parameter NETLINK_DIAG_GROUPS.
> 
> Netlink sockets cab be filtered by protocols.
> 
> The socket inode number and cookie is reserved for future per-socket info
> retrieving. The per-protocol filtering is also reserved for future by
> requiring the sdiag_protocol to be zero.
> 
> The file /proc/net/netlink doesn't provide enough information for
> dumping netlink sockets. It doesn't provide dst_group, dst_portid,
> groups above 32.
> 
> v2: fix NETLINK_DIAG_MAX. Now it's equal to the last constant.
> 
> Acked-by: Pavel Emelyanov <xemul@parallels.com>
> Cc: "David S. Miller" <davem@davemloft.net>
> Cc: Eric Dumazet <edumazet@google.com>
> Cc: Pablo Neira Ayuso <pablo@netfilter.org>
> Cc: "Eric W. Biederman" <ebiederm@xmission.com>
> Cc: Gao feng <gaofeng@cn.fujitsu.com>
> Cc: Thomas Graf <tgraf@suug.ch>
> Signed-off-by: Andrey Vagin <avagin@openvz.org>

Applied to net-next

^ permalink raw reply

* Re: [PATCH 2/3] net: prepare netlink code for netlink diag
From: David Miller @ 2013-03-21 16:38 UTC (permalink / raw)
  To: avagin
  Cc: linux-kernel, netdev, xemul, edumazet, pablo, ebiederm, gaofeng,
	tgraf
In-Reply-To: <1363883628-7249-3-git-send-email-avagin@openvz.org>

From: Andrey Vagin <avagin@openvz.org>
Date: Thu, 21 Mar 2013 20:33:47 +0400

> Move a few declarations in a header.
> 
> Acked-by: Pavel Emelyanov <xemul@parallels.com>
> Cc: "David S. Miller" <davem@davemloft.net>
> Cc: Eric Dumazet <edumazet@google.com>
> Cc: Pablo Neira Ayuso <pablo@netfilter.org>
> Cc: "Eric W. Biederman" <ebiederm@xmission.com>
> Cc: Gao feng <gaofeng@cn.fujitsu.com>
> Cc: Thomas Graf <tgraf@suug.ch>
> Signed-off-by: Andrey Vagin <avagin@openvz.org>

Applied to net-next.

^ permalink raw reply

* Re: [PATCH 1/3] net: fix *_DIAG_MAX constants
From: David Miller @ 2013-03-21 16:37 UTC (permalink / raw)
  To: avagin; +Cc: linux-kernel, netdev, xemul, edumazet, paulmck, dhowells
In-Reply-To: <1363883628-7249-2-git-send-email-avagin@openvz.org>

From: Andrey Vagin <avagin@openvz.org>
Date: Thu, 21 Mar 2013 20:33:46 +0400

> Follow the common pattern and define *_DIAG_MAX like:
> 
>         [...]
>         __XXX_DIAG_MAX,
> };
> 
> Because everyone is used to do:
> 
>         struct nlattr *attrs[XXX_DIAG_MAX+1];
> 
>         nla_parse([...], XXX_DIAG_MAX, [...]
> 
> Reported-by: Thomas Graf <tgraf@suug.ch>
> Cc: "David S. Miller" <davem@davemloft.net>
> Cc: Pavel Emelyanov <xemul@parallels.com>
> Cc: Eric Dumazet <edumazet@google.com>
> Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
> Cc: David Howells <dhowells@redhat.com>
> Signed-off-by: Andrey Vagin <avagin@openvz.org>

Applied to 'net' and queued up for -stable.

Thanks.

^ permalink raw reply

* [PATCH 2/3] net: prepare netlink code for netlink diag
From: Andrey Vagin @ 2013-03-21 16:33 UTC (permalink / raw)
  To: linux-kernel, netdev
  Cc: Pavel Emelyanov, Andrey Vagin, David S. Miller, Eric Dumazet,
	Pablo Neira Ayuso, Eric W. Biederman, Gao feng, Thomas Graf
In-Reply-To: <1363883628-7249-1-git-send-email-avagin@openvz.org>

Move a few declarations in a header.

Acked-by: Pavel Emelyanov <xemul@parallels.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Pablo Neira Ayuso <pablo@netfilter.org>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Gao feng <gaofeng@cn.fujitsu.com>
Cc: Thomas Graf <tgraf@suug.ch>
Signed-off-by: Andrey Vagin <avagin@openvz.org>
---
 net/netlink/af_netlink.c | 59 ++++-----------------------------------------
 net/netlink/af_netlink.h | 62 ++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 67 insertions(+), 54 deletions(-)
 create mode 100644 net/netlink/af_netlink.h

diff --git a/net/netlink/af_netlink.c b/net/netlink/af_netlink.c
index 1e3fd5b..a500ce2 100644
--- a/net/netlink/af_netlink.c
+++ b/net/netlink/af_netlink.c
@@ -61,28 +61,7 @@
 #include <net/scm.h>
 #include <net/netlink.h>
 
-#define NLGRPSZ(x)	(ALIGN(x, sizeof(unsigned long) * 8) / 8)
-#define NLGRPLONGS(x)	(NLGRPSZ(x)/sizeof(unsigned long))
-
-struct netlink_sock {
-	/* struct sock has to be the first member of netlink_sock */
-	struct sock		sk;
-	u32			portid;
-	u32			dst_portid;
-	u32			dst_group;
-	u32			flags;
-	u32			subscriptions;
-	u32			ngroups;
-	unsigned long		*groups;
-	unsigned long		state;
-	wait_queue_head_t	wait;
-	struct netlink_callback	*cb;
-	struct mutex		*cb_mutex;
-	struct mutex		cb_def_mutex;
-	void			(*netlink_rcv)(struct sk_buff *skb);
-	void			(*netlink_bind)(int group);
-	struct module		*module;
-};
+#include "af_netlink.h"
 
 struct listeners {
 	struct rcu_head		rcu;
@@ -94,48 +73,20 @@ struct listeners {
 #define NETLINK_BROADCAST_SEND_ERROR	0x4
 #define NETLINK_RECV_NO_ENOBUFS	0x8
 
-static inline struct netlink_sock *nlk_sk(struct sock *sk)
-{
-	return container_of(sk, struct netlink_sock, sk);
-}
-
 static inline int netlink_is_kernel(struct sock *sk)
 {
 	return nlk_sk(sk)->flags & NETLINK_KERNEL_SOCKET;
 }
 
-struct nl_portid_hash {
-	struct hlist_head	*table;
-	unsigned long		rehash_time;
-
-	unsigned int		mask;
-	unsigned int		shift;
-
-	unsigned int		entries;
-	unsigned int		max_shift;
-
-	u32			rnd;
-};
-
-struct netlink_table {
-	struct nl_portid_hash	hash;
-	struct hlist_head	mc_list;
-	struct listeners __rcu	*listeners;
-	unsigned int		flags;
-	unsigned int		groups;
-	struct mutex		*cb_mutex;
-	struct module		*module;
-	void			(*bind)(int group);
-	int			registered;
-};
-
-static struct netlink_table *nl_table;
+struct netlink_table *nl_table;
+EXPORT_SYMBOL_GPL(nl_table);
 
 static DECLARE_WAIT_QUEUE_HEAD(nl_table_wait);
 
 static int netlink_dump(struct sock *sk);
 
-static DEFINE_RWLOCK(nl_table_lock);
+DEFINE_RWLOCK(nl_table_lock);
+EXPORT_SYMBOL_GPL(nl_table_lock);
 static atomic_t nl_table_users = ATOMIC_INIT(0);
 
 #define nl_deref_protected(X) rcu_dereference_protected(X, lockdep_is_held(&nl_table_lock));
diff --git a/net/netlink/af_netlink.h b/net/netlink/af_netlink.h
new file mode 100644
index 0000000..d9acb2a
--- /dev/null
+++ b/net/netlink/af_netlink.h
@@ -0,0 +1,62 @@
+#ifndef _AF_NETLINK_H
+#define _AF_NETLINK_H
+
+#include <net/sock.h>
+
+#define NLGRPSZ(x)	(ALIGN(x, sizeof(unsigned long) * 8) / 8)
+#define NLGRPLONGS(x)	(NLGRPSZ(x)/sizeof(unsigned long))
+
+struct netlink_sock {
+	/* struct sock has to be the first member of netlink_sock */
+	struct sock		sk;
+	u32			portid;
+	u32			dst_portid;
+	u32			dst_group;
+	u32			flags;
+	u32			subscriptions;
+	u32			ngroups;
+	unsigned long		*groups;
+	unsigned long		state;
+	wait_queue_head_t	wait;
+	struct netlink_callback	*cb;
+	struct mutex		*cb_mutex;
+	struct mutex		cb_def_mutex;
+	void			(*netlink_rcv)(struct sk_buff *skb);
+	void			(*netlink_bind)(int group);
+	struct module		*module;
+};
+
+static inline struct netlink_sock *nlk_sk(struct sock *sk)
+{
+	return container_of(sk, struct netlink_sock, sk);
+}
+
+struct nl_portid_hash {
+	struct hlist_head	*table;
+	unsigned long		rehash_time;
+
+	unsigned int		mask;
+	unsigned int		shift;
+
+	unsigned int		entries;
+	unsigned int		max_shift;
+
+	u32			rnd;
+};
+
+struct netlink_table {
+	struct nl_portid_hash	hash;
+	struct hlist_head	mc_list;
+	struct listeners __rcu	*listeners;
+	unsigned int		flags;
+	unsigned int		groups;
+	struct mutex		*cb_mutex;
+	struct module		*module;
+	void			(*bind)(int group);
+	int			registered;
+};
+
+extern struct netlink_table *nl_table;
+extern rwlock_t nl_table_lock;
+
+#endif
-- 
1.8.1.4

^ permalink raw reply related

* [PATCH 1/3] net: fix *_DIAG_MAX constants
From: Andrey Vagin @ 2013-03-21 16:33 UTC (permalink / raw)
  To: linux-kernel, netdev
  Cc: Pavel Emelyanov, Andrey Vagin, David S. Miller, Eric Dumazet,
	Paul E. McKenney, David Howells
In-Reply-To: <1363883628-7249-1-git-send-email-avagin@openvz.org>

Follow the common pattern and define *_DIAG_MAX like:

        [...]
        __XXX_DIAG_MAX,
};

Because everyone is used to do:

        struct nlattr *attrs[XXX_DIAG_MAX+1];

        nla_parse([...], XXX_DIAG_MAX, [...]

Reported-by: Thomas Graf <tgraf@suug.ch>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Pavel Emelyanov <xemul@parallels.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: David Howells <dhowells@redhat.com>
Signed-off-by: Andrey Vagin <avagin@openvz.org>
---
 include/uapi/linux/packet_diag.h | 4 +++-
 include/uapi/linux/unix_diag.h   | 4 +++-
 2 files changed, 6 insertions(+), 2 deletions(-)

diff --git a/include/uapi/linux/packet_diag.h b/include/uapi/linux/packet_diag.h
index 93f5fa9..afafd70 100644
--- a/include/uapi/linux/packet_diag.h
+++ b/include/uapi/linux/packet_diag.h
@@ -33,9 +33,11 @@ enum {
 	PACKET_DIAG_TX_RING,
 	PACKET_DIAG_FANOUT,
 
-	PACKET_DIAG_MAX,
+	__PACKET_DIAG_MAX,
 };
 
+#define PACKET_DIAG_MAX (__PACKET_DIAG_MAX - 1)
+
 struct packet_diag_info {
 	__u32	pdi_index;
 	__u32	pdi_version;
diff --git a/include/uapi/linux/unix_diag.h b/include/uapi/linux/unix_diag.h
index b8a2494..b9e2a6a 100644
--- a/include/uapi/linux/unix_diag.h
+++ b/include/uapi/linux/unix_diag.h
@@ -39,9 +39,11 @@ enum {
 	UNIX_DIAG_MEMINFO,
 	UNIX_DIAG_SHUTDOWN,
 
-	UNIX_DIAG_MAX,
+	__UNIX_DIAG_MAX,
 };
 
+#define UNIX_DIAG_MAX (__UNIX_DIAG_MAX - 1)
+
 struct unix_diag_vfs {
 	__u32	udiag_vfs_ino;
 	__u32	udiag_vfs_dev;
-- 
1.8.1.4

^ permalink raw reply related

* [PATCH 3/3] netlink: Diag core and basic socket info dumping (v2)
From: Andrey Vagin @ 2013-03-21 16:33 UTC (permalink / raw)
  To: linux-kernel, netdev
  Cc: Pavel Emelyanov, Andrey Vagin, David S. Miller, Eric Dumazet,
	Pablo Neira Ayuso, Eric W. Biederman, Gao feng, Thomas Graf
In-Reply-To: <1363883628-7249-1-git-send-email-avagin@openvz.org>

The netlink_diag can be built as a module, just like it's done in
unix sockets.

The core dumping message carries the basic info about netlink sockets:
family, type and protocol, portis, dst_group, dst_portid, state.

Groups can be received as an optional parameter NETLINK_DIAG_GROUPS.

Netlink sockets cab be filtered by protocols.

The socket inode number and cookie is reserved for future per-socket info
retrieving. The per-protocol filtering is also reserved for future by
requiring the sdiag_protocol to be zero.

The file /proc/net/netlink doesn't provide enough information for
dumping netlink sockets. It doesn't provide dst_group, dst_portid,
groups above 32.

v2: fix NETLINK_DIAG_MAX. Now it's equal to the last constant.

Acked-by: Pavel Emelyanov <xemul@parallels.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Pablo Neira Ayuso <pablo@netfilter.org>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Gao feng <gaofeng@cn.fujitsu.com>
Cc: Thomas Graf <tgraf@suug.ch>
Signed-off-by: Andrey Vagin <avagin@openvz.org>
---
 include/uapi/linux/netlink_diag.h |  42 +++++++++
 net/Kconfig                       |   1 +
 net/netlink/Kconfig               |  10 ++
 net/netlink/Makefile              |   3 +
 net/netlink/diag.c                | 188 ++++++++++++++++++++++++++++++++++++++
 5 files changed, 244 insertions(+)
 create mode 100644 include/uapi/linux/netlink_diag.h
 create mode 100644 net/netlink/Kconfig
 create mode 100644 net/netlink/diag.c

diff --git a/include/uapi/linux/netlink_diag.h b/include/uapi/linux/netlink_diag.h
new file mode 100644
index 0000000..88009a3
--- /dev/null
+++ b/include/uapi/linux/netlink_diag.h
@@ -0,0 +1,42 @@
+#ifndef __NETLINK_DIAG_H__
+#define __NETLINK_DIAG_H__
+
+#include <linux/types.h>
+
+struct netlink_diag_req {
+	__u8	sdiag_family;
+	__u8	sdiag_protocol;
+	__u16	pad;
+	__u32	ndiag_ino;
+	__u32	ndiag_show;
+	__u32	ndiag_cookie[2];
+};
+
+struct netlink_diag_msg {
+	__u8	ndiag_family;
+	__u8	ndiag_type;
+	__u8	ndiag_protocol;
+	__u8	ndiag_state;
+
+	__u32	ndiag_portid;
+	__u32	ndiag_dst_portid;
+	__u32	ndiag_dst_group;
+	__u32	ndiag_ino;
+	__u32	ndiag_cookie[2];
+};
+
+enum {
+	NETLINK_DIAG_MEMINFO,
+	NETLINK_DIAG_GROUPS,
+
+	__NETLINK_DIAG_MAX,
+};
+
+#define NETLINK_DIAG_MAX (__NETLINK_DIAG_MAX - 1)
+
+#define NDIAG_PROTO_ALL		((__u8) ~0)
+
+#define NDIAG_SHOW_MEMINFO	0x00000001 /* show memory info of a socket */
+#define NDIAG_SHOW_GROUPS	0x00000002 /* show groups of a netlink socket */
+
+#endif
diff --git a/net/Kconfig b/net/Kconfig
index 6f676ab..2ddc904 100644
--- a/net/Kconfig
+++ b/net/Kconfig
@@ -217,6 +217,7 @@ source "net/dns_resolver/Kconfig"
 source "net/batman-adv/Kconfig"
 source "net/openvswitch/Kconfig"
 source "net/vmw_vsock/Kconfig"
+source "net/netlink/Kconfig"
 
 config RPS
 	boolean
diff --git a/net/netlink/Kconfig b/net/netlink/Kconfig
new file mode 100644
index 0000000..5d6e8c0
--- /dev/null
+++ b/net/netlink/Kconfig
@@ -0,0 +1,10 @@
+#
+# Netlink Sockets
+#
+
+config NETLINK_DIAG
+	tristate "NETLINK: socket monitoring interface"
+	default n
+	---help---
+	  Support for NETLINK socket monitoring interface used by the ss tool.
+	  If unsure, say Y.
diff --git a/net/netlink/Makefile b/net/netlink/Makefile
index bdd6ddf..e837917 100644
--- a/net/netlink/Makefile
+++ b/net/netlink/Makefile
@@ -3,3 +3,6 @@
 #
 
 obj-y  				:= af_netlink.o genetlink.o
+
+obj-$(CONFIG_NETLINK_DIAG)	+= netlink_diag.o
+netlink_diag-y			:= diag.o
diff --git a/net/netlink/diag.c b/net/netlink/diag.c
new file mode 100644
index 0000000..5ffb1d1
--- /dev/null
+++ b/net/netlink/diag.c
@@ -0,0 +1,188 @@
+#include <linux/module.h>
+
+#include <net/sock.h>
+#include <linux/netlink.h>
+#include <linux/sock_diag.h>
+#include <linux/netlink_diag.h>
+
+#include "af_netlink.h"
+
+static int sk_diag_dump_groups(struct sock *sk, struct sk_buff *nlskb)
+{
+	struct netlink_sock *nlk = nlk_sk(sk);
+
+	if (nlk->groups == NULL)
+		return 0;
+
+	return nla_put(nlskb, NETLINK_DIAG_GROUPS, NLGRPSZ(nlk->ngroups),
+		       nlk->groups);
+}
+
+static int sk_diag_fill(struct sock *sk, struct sk_buff *skb,
+			struct netlink_diag_req *req,
+			u32 portid, u32 seq, u32 flags, int sk_ino)
+{
+	struct nlmsghdr *nlh;
+	struct netlink_diag_msg *rep;
+	struct netlink_sock *nlk = nlk_sk(sk);
+
+	nlh = nlmsg_put(skb, portid, seq, SOCK_DIAG_BY_FAMILY, sizeof(*rep),
+			flags);
+	if (!nlh)
+		return -EMSGSIZE;
+
+	rep = nlmsg_data(nlh);
+	rep->ndiag_family	= AF_NETLINK;
+	rep->ndiag_type		= sk->sk_type;
+	rep->ndiag_protocol	= sk->sk_protocol;
+	rep->ndiag_state	= sk->sk_state;
+
+	rep->ndiag_ino		= sk_ino;
+	rep->ndiag_portid	= nlk->portid;
+	rep->ndiag_dst_portid	= nlk->dst_portid;
+	rep->ndiag_dst_group	= nlk->dst_group;
+	sock_diag_save_cookie(sk, rep->ndiag_cookie);
+
+	if ((req->ndiag_show & NDIAG_SHOW_GROUPS) &&
+	    sk_diag_dump_groups(sk, skb))
+		goto out_nlmsg_trim;
+
+	if ((req->ndiag_show & NDIAG_SHOW_MEMINFO) &&
+	    sock_diag_put_meminfo(sk, skb, NETLINK_DIAG_MEMINFO))
+		goto out_nlmsg_trim;
+
+	return nlmsg_end(skb, nlh);
+
+out_nlmsg_trim:
+	nlmsg_cancel(skb, nlh);
+	return -EMSGSIZE;
+}
+
+static int __netlink_diag_dump(struct sk_buff *skb, struct netlink_callback *cb,
+				int protocol, int s_num)
+{
+	struct netlink_table *tbl = &nl_table[protocol];
+	struct nl_portid_hash *hash = &tbl->hash;
+	struct net *net = sock_net(skb->sk);
+	struct netlink_diag_req *req;
+	struct sock *sk;
+	int ret = 0, num = 0, i;
+
+	req = nlmsg_data(cb->nlh);
+
+	for (i = 0; i <= hash->mask; i++) {
+		sk_for_each(sk, &hash->table[i]) {
+			if (!net_eq(sock_net(sk), net))
+				continue;
+			if (num < s_num) {
+				num++;
+				continue;
+			}
+
+			if (sk_diag_fill(sk, skb, req,
+					 NETLINK_CB(cb->skb).portid,
+					 cb->nlh->nlmsg_seq,
+					 NLM_F_MULTI,
+					 sock_i_ino(sk)) < 0) {
+				ret = 1;
+				goto done;
+			}
+
+			num++;
+		}
+	}
+
+	sk_for_each_bound(sk, &tbl->mc_list) {
+		if (sk_hashed(sk))
+			continue;
+		if (!net_eq(sock_net(sk), net))
+			continue;
+		if (num < s_num) {
+			num++;
+			continue;
+		}
+
+		if (sk_diag_fill(sk, skb, req,
+				 NETLINK_CB(cb->skb).portid,
+				 cb->nlh->nlmsg_seq,
+				 NLM_F_MULTI,
+				 sock_i_ino(sk)) < 0) {
+			ret = 1;
+			goto done;
+		}
+		num++;
+	}
+done:
+	cb->args[0] = num;
+	cb->args[1] = protocol;
+
+	return ret;
+}
+
+static int netlink_diag_dump(struct sk_buff *skb, struct netlink_callback *cb)
+{
+	struct netlink_diag_req *req;
+	int s_num = cb->args[0];
+
+	req = nlmsg_data(cb->nlh);
+
+	read_lock(&nl_table_lock);
+
+	if (req->sdiag_protocol == NDIAG_PROTO_ALL) {
+		int i;
+
+		for (i = cb->args[1]; i < MAX_LINKS; i++) {
+			if (__netlink_diag_dump(skb, cb, i, s_num))
+				break;
+			s_num = 0;
+		}
+	} else {
+		if (req->sdiag_protocol >= MAX_LINKS) {
+			read_unlock(&nl_table_lock);
+			return -ENOENT;
+		}
+
+		__netlink_diag_dump(skb, cb, req->sdiag_protocol, s_num);
+	}
+
+	read_unlock(&nl_table_lock);
+
+	return skb->len;
+}
+
+static int netlink_diag_handler_dump(struct sk_buff *skb, struct nlmsghdr *h)
+{
+	int hdrlen = sizeof(struct netlink_diag_req);
+	struct net *net = sock_net(skb->sk);
+
+	if (nlmsg_len(h) < hdrlen)
+		return -EINVAL;
+
+	if (h->nlmsg_flags & NLM_F_DUMP) {
+		struct netlink_dump_control c = {
+			.dump = netlink_diag_dump,
+		};
+		return netlink_dump_start(net->diag_nlsk, skb, h, &c);
+	} else
+		return -EOPNOTSUPP;
+}
+
+static const struct sock_diag_handler netlink_diag_handler = {
+	.family = AF_NETLINK,
+	.dump = netlink_diag_handler_dump,
+};
+
+static int __init netlink_diag_init(void)
+{
+	return sock_diag_register(&netlink_diag_handler);
+}
+
+static void __exit netlink_diag_exit(void)
+{
+	sock_diag_unregister(&netlink_diag_handler);
+}
+
+module_init(netlink_diag_init);
+module_exit(netlink_diag_exit);
+MODULE_LICENSE("GPL");
+MODULE_ALIAS_NET_PF_PROTO_TYPE(PF_NETLINK, NETLINK_SOCK_DIAG, 16 /* AF_NETLINK */);
-- 
1.8.1.4

^ permalink raw reply related

* Re: [PATCH net] vhost/net: fix heads usage of ubuf_info
From: Ben Hutchings @ 2013-03-21 16:33 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: kvm, netdev, linux-kernel, nab, virtualization, David Miller,
	basil.gor
In-Reply-To: <20130321162813.GG1925@redhat.com>

On Thu, 2013-03-21 at 18:28 +0200, Michael S. Tsirkin wrote:
> On Thu, Mar 21, 2013 at 04:23:48PM +0000, Ben Hutchings wrote:
> > On Thu, 2013-03-21 at 08:02 +0200, Michael S. Tsirkin wrote:
> > > On Sun, Mar 17, 2013 at 02:29:55PM -0400, David Miller wrote:
> > > > From: "Michael S. Tsirkin" <mst@redhat.com>
> > > > Date: Sun, 17 Mar 2013 14:46:09 +0200
> > > > 
> > > > > ubuf info allocator uses guest controlled head as an index,
> > > > > so a malicious guest could put the same head entry in the ring twice,
> > > > > and we will get two callbacks on the same value.
> > > > > To fix use upend_idx which is guaranteed to be unique.
> > > > > 
> > > > > Reported-by: Rusty Russell <rusty@rustcorp.com.au>
> > > > > Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
> > > > 
> > > > Applied and queued up for -stable, thanks.
> > > > 
> > > > And thankfully you got the stable URL wrong,
> > > 
> > > Yes I wrote stable@kernel.org that's what an old copy
> > > says here:
> > > https://www.kernel.org/doc/Documentation/stable_kernel_rules.txt
> > > 
> > > I should have known better than look at it on the 'net.  The top
> > > 'Everything you ever wanted to know about Linux 2.6 -stable releases.'
> > > is a big hint that it's stale.
> > > Any idea who maintains this? Better update it or remove it or redirect to git.
> > 
> > Rob Landley maintains it, but he's been having trouble updating it since
> > all the upload mechanisms were changed on kernel.org.
> > 
> > (My stable maintenance scripts still match the old address, anyway.  Not
> > sure about Greg's.)
> > 
> > Ben.
> 
> I hope you mean it will match both the old and the new address?

Yes, of course!

Ben.

-- 
Ben Hutchings, Staff Engineer, Solarflare
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.

^ permalink raw reply

* [PATCH 0/3] netlink: implement socket diag for netlink sockets (v2)
From: Andrey Vagin @ 2013-03-21 16:33 UTC (permalink / raw)
  To: linux-kernel, netdev
  Cc: Pavel Emelyanov, Andrey Vagin, David S. Miller, Eric Dumazet,
	Pablo Neira Ayuso, Eric W. Biederman, Gao feng, Thomas Graf

v2: fix *_DIAG_MAX constants

Cc: "David S. Miller" <davem@davemloft.net>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Pavel Emelyanov <xemul@parallels.com>
Cc: Pablo Neira Ayuso <pablo@netfilter.org>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Gao feng <gaofeng@cn.fujitsu.com>
Cc: Thomas Graf <tgraf@suug.ch>
Signed-off-by: Andrey Vagin <avagin@openvz.org>

Andrey Vagin (3):
  net: fix *_DIAG_MAX constants
  net: prepare netlink code for netlink diag
  netlink: Diag core and basic socket info dumping (v2)

 include/uapi/linux/netlink_diag.h |  42 +++++++++
 include/uapi/linux/packet_diag.h  |   4 +-
 include/uapi/linux/unix_diag.h    |   4 +-
 net/Kconfig                       |   1 +
 net/netlink/Kconfig               |  10 ++
 net/netlink/Makefile              |   3 +
 net/netlink/af_netlink.c          |  59 +-----------
 net/netlink/af_netlink.h          |  62 +++++++++++++
 net/netlink/diag.c                | 188 ++++++++++++++++++++++++++++++++++++++
 9 files changed, 317 insertions(+), 56 deletions(-)
 create mode 100644 include/uapi/linux/netlink_diag.h
 create mode 100644 net/netlink/Kconfig
 create mode 100644 net/netlink/af_netlink.h
 create mode 100644 net/netlink/diag.c

-- 
1.8.1.4

^ permalink raw reply

* Re: [PATCH net-next] gro: relax ID check in inet_gro_receive()
From: David Miller @ 2013-03-21 16:31 UTC (permalink / raw)
  To: eric.dumazet; +Cc: netdev, dmitry, eilong, pshelar, hkchu, maze
In-Reply-To: <1363882091.4431.20.camel@edumazet-glaptop>

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Thu, 21 Mar 2013 09:08:11 -0700

> I understand your concern, but this check in GRO brings nothing at all.

It brings reversibility, a fundamental rule of our segmentation
offloads.

^ permalink raw reply

* Re: [PATCH net-next] gro: relax ID check in inet_gro_receive()
From: David Miller @ 2013-03-21 16:31 UTC (permalink / raw)
  To: bhutchings; +Cc: eric.dumazet, netdev, dmitry, eilong, pshelar, hkchu, maze
In-Reply-To: <1363882825.2736.4.camel@bwh-desktop.uk.solarflarecom.com>

From: Ben Hutchings <bhutchings@solarflare.com>
Date: Thu, 21 Mar 2013 16:20:25 +0000

> On Thu, 2013-03-21 at 11:46 -0400, David Miller wrote:
>> From: Eric Dumazet <eric.dumazet@gmail.com>
>> Date: Wed, 20 Mar 2013 21:52:33 -0700
>> 
>> > GRE TSO support doesn't increment the ID in the inner IP header.
>> 
>> Is this a fundamental limitation of doing TSO over GRO or
>> were the Broadcom folks just being lazy with their firmware
>> implementation?
>> 
>> I really don't want to apply this patch, because ipv4 frames
>> even with DF set should have an incrementing ID field, in
>> order to accomodate various header compression schemes.
>> 
>> We go out of our way to do this for normal unencapsulated TCP stream
>> packets, rather than set the ID field to zero (which we did for some
>> time until the compression issue was pointed out to us).
> 
> Besides which, GRO has been reliably reversible until now.  (gso_size is
> available through packet sockets, even if tcpdump doesn't appear to use
> it yet.)  Ignoring IPv4 IDs will break that guarantee.

Right, even ignoring the header compression issues, our segmentation
offloads must be perfectly reversible.

^ permalink raw reply

* Re: [PATCH net] vhost/net: fix heads usage of ubuf_info
From: Michael S. Tsirkin @ 2013-03-21 16:28 UTC (permalink / raw)
  To: Ben Hutchings
  Cc: kvm, netdev, linux-kernel, nab, virtualization, David Miller,
	basil.gor
In-Reply-To: <1363883028.2736.7.camel@bwh-desktop.uk.solarflarecom.com>

On Thu, Mar 21, 2013 at 04:23:48PM +0000, Ben Hutchings wrote:
> On Thu, 2013-03-21 at 08:02 +0200, Michael S. Tsirkin wrote:
> > On Sun, Mar 17, 2013 at 02:29:55PM -0400, David Miller wrote:
> > > From: "Michael S. Tsirkin" <mst@redhat.com>
> > > Date: Sun, 17 Mar 2013 14:46:09 +0200
> > > 
> > > > ubuf info allocator uses guest controlled head as an index,
> > > > so a malicious guest could put the same head entry in the ring twice,
> > > > and we will get two callbacks on the same value.
> > > > To fix use upend_idx which is guaranteed to be unique.
> > > > 
> > > > Reported-by: Rusty Russell <rusty@rustcorp.com.au>
> > > > Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
> > > 
> > > Applied and queued up for -stable, thanks.
> > > 
> > > And thankfully you got the stable URL wrong,
> > 
> > Yes I wrote stable@kernel.org that's what an old copy
> > says here:
> > https://www.kernel.org/doc/Documentation/stable_kernel_rules.txt
> > 
> > I should have known better than look at it on the 'net.  The top
> > 'Everything you ever wanted to know about Linux 2.6 -stable releases.'
> > is a big hint that it's stale.
> > Any idea who maintains this? Better update it or remove it or redirect to git.
> 
> Rob Landley maintains it, but he's been having trouble updating it since
> all the upload mechanisms were changed on kernel.org.
> 
> (My stable maintenance scripts still match the old address, anyway.  Not
> sure about Greg's.)
> 
> Ben.

I hope you mean it will match both the old and the new address?


> -- 
> Ben Hutchings, Staff Engineer, Solarflare
> Not speaking for my employer; that's the marketing department's job.
> They asked us to note that Solarflare product names are trademarked.

^ permalink raw reply

* Re: [PATCH net] vhost/net: fix heads usage of ubuf_info
From: Ben Hutchings @ 2013-03-21 16:23 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: David Miller, rusty, jasowang, basil.gor, nab, kvm,
	virtualization, netdev, linux-kernel
In-Reply-To: <20130321060218.GB23908@redhat.com>

On Thu, 2013-03-21 at 08:02 +0200, Michael S. Tsirkin wrote:
> On Sun, Mar 17, 2013 at 02:29:55PM -0400, David Miller wrote:
> > From: "Michael S. Tsirkin" <mst@redhat.com>
> > Date: Sun, 17 Mar 2013 14:46:09 +0200
> > 
> > > ubuf info allocator uses guest controlled head as an index,
> > > so a malicious guest could put the same head entry in the ring twice,
> > > and we will get two callbacks on the same value.
> > > To fix use upend_idx which is guaranteed to be unique.
> > > 
> > > Reported-by: Rusty Russell <rusty@rustcorp.com.au>
> > > Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
> > 
> > Applied and queued up for -stable, thanks.
> > 
> > And thankfully you got the stable URL wrong,
> 
> Yes I wrote stable@kernel.org that's what an old copy
> says here:
> https://www.kernel.org/doc/Documentation/stable_kernel_rules.txt
> 
> I should have known better than look at it on the 'net.  The top
> 'Everything you ever wanted to know about Linux 2.6 -stable releases.'
> is a big hint that it's stale.
> Any idea who maintains this? Better update it or remove it or redirect to git.

Rob Landley maintains it, but he's been having trouble updating it since
all the upload mechanisms were changed on kernel.org.

(My stable maintenance scripts still match the old address, anyway.  Not
sure about Greg's.)

Ben.

-- 
Ben Hutchings, Staff Engineer, Solarflare
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.

^ permalink raw reply

* Re: [PATCH net-next] gro: relax ID check in inet_gro_receive()
From: Ben Hutchings @ 2013-03-21 16:20 UTC (permalink / raw)
  To: David Miller; +Cc: eric.dumazet, netdev, dmitry, eilong, pshelar, hkchu, maze
In-Reply-To: <20130321.114616.279859400813363663.davem@davemloft.net>

On Thu, 2013-03-21 at 11:46 -0400, David Miller wrote:
> From: Eric Dumazet <eric.dumazet@gmail.com>
> Date: Wed, 20 Mar 2013 21:52:33 -0700
> 
> > GRE TSO support doesn't increment the ID in the inner IP header.
> 
> Is this a fundamental limitation of doing TSO over GRO or
> were the Broadcom folks just being lazy with their firmware
> implementation?
> 
> I really don't want to apply this patch, because ipv4 frames
> even with DF set should have an incrementing ID field, in
> order to accomodate various header compression schemes.
> 
> We go out of our way to do this for normal unencapsulated TCP stream
> packets, rather than set the ID field to zero (which we did for some
> time until the compression issue was pointed out to us).

Besides which, GRO has been reliably reversible until now.  (gso_size is
available through packet sockets, even if tcpdump doesn't appear to use
it yet.)  Ignoring IPv4 IDs will break that guarantee.

Ben.

-- 
Ben Hutchings, Staff Engineer, Solarflare
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.

^ permalink raw reply

* Re: RFC: mac802154 Packet Queueing and Slave Devices
From: Alan Ott @ 2013-03-21 16:09 UTC (permalink / raw)
  To: Alexander Smirnov, Dmitry Eremin-Solenikov, slapin, Tony Cheneau
  Cc: linux-zigbee-devel, netdev, Eric Dumazet
In-Reply-To: <504D37A7.60109@signal11.us>

On 09/09/2012 08:43 PM, Alan Ott wrote:
> Tony and I were recently talking about packet queueing on 802.15.4. What
> currently happens (in net/mac802154/tx.c) is that each tx packet (skb)
> is stuck on a work queue, and the worker function then sends each packet
> to the hardware driver in order.
> 
> The problem with this is that it defeats the netif flow control. The
> networking layer thinks the packet is sent as soon as it's put on the
> workqueue (because the function that queues it returns NETDEV_TX_OK to
> the networking layer), and the workqueue can then get arbitrarily large
> if an application tries to send a lot of data. (Tony has shown this with
> iperf)
> 

So I tried fixing this using netif_stop_queue() and netif_wake_queue(),
the standard way. The flow control works, but I'm now losing packets.

It happens like this:

ipv6           -> 6lowpan   -> net core -> mac802154         -> hardware
 single packet     fragment                 netif_stop_queue()
                   fragment
                   fragment
                   fragment

  Above: a single ipv6 packet is split into fragments by 6lowpan. Each
  fragment is sent through the networking core where it ends up in
  mac802154, which will call netif_stop_queue() and netif_wake_queue()
  for flow control as packets are sent.

The problem is that since many ieee802154 hardware devices can only hold
one packet at a time in their tx buffer, netif_stop_queue() gets called
after the first fragment. Since the 6lowpan code is trying to, in the
above case, send 4 packets, the remaining 3 will get dropped when
they're handed to the networking core (dev_queue_xmit()) when the queue
is stopped.

So as a solution, one could envision 6lowpan putting the fragments into
a queue, and submitting one at a time, as the queue gets woken. The
problem with this is that there's no way to get notification for when a
queue is woken. I checked both ppp and ax25 (which would seem to have
this same issue), and they both have a fragment queue, but they rely on
external events (mostly unrelated to the queue being woken) to trigger
sending packets from the queue (see calls to ax25_kick()). That seems
hacky at best.

A thread from pppoe[1] talks about what's a similar issue. The patch
from that email was never merged. Even so, their solution seems a bit
hacky too (because it would basically cause a kick to (in this case)
6lowpan, whenever an skb gets destroyed (ie: after it's sent). With the
desire for 6lowpan to be a generic protocol[2], one would want it to be
efficient on MAC layers which do support longer queues[3].

What am I missing here? What's the right way to do this?

Alan.

[1] http://thread.gmane.org/gmane.linux.network/233089
[2] There has been some discussion about using 6lowpan on Bluetooth
low-energy.
[3] There's also the case where 2 6lowpan instances are on attached to
the same hardware, or where 6lowpan and raw are being used concurrently.

^ permalink raw reply

* Re: [PATCH net-next] gro: relax ID check in inet_gro_receive()
From: Eric Dumazet @ 2013-03-21 16:08 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, dmitry, eilong, pshelar, hkchu, maze
In-Reply-To: <20130321.114616.279859400813363663.davem@davemloft.net>

On Thu, 2013-03-21 at 11:46 -0400, David Miller wrote:
> From: Eric Dumazet <eric.dumazet@gmail.com>
> Date: Wed, 20 Mar 2013 21:52:33 -0700
> 
> > GRE TSO support doesn't increment the ID in the inner IP header.
> 
> Is this a fundamental limitation of doing TSO over GRO or
> were the Broadcom folks just being lazy with their firmware
> implementation?
> 

Well, I suspect this hardware is not capable of doing the proper ID
manipulation twice. (inner and outer header)

Still TSO support permits a single GRE flow going from 3Gbps to 9Gbps on
our hosts. So even if the inner IP id is 'broken', we are going to use
TSO.

Note we are limited by the receiver, as the receiver has to perform the
tcp checksum in software (bnx2x doesnt support CHECKSUM_COMPLETE yet)

Hopefully next firmware or NIC will do the right thing.

> I really don't want to apply this patch, because ipv4 frames
> even with DF set should have an incrementing ID field, in
> order to accomodate various header compression schemes.
> 
> We go out of our way to do this for normal unencapsulated TCP stream
> packets, rather than set the ID field to zero (which we did for some
> time until the compression issue was pointed out to us).

I understand your concern, but this check in GRO brings nothing at all.

Once we receive frames with 'bad IPv4 ID', should we accept them or drop
them ?

TCP stack doesn't care at receive (obviously as this ID is not a concern
for the transport layer), so GRO should do the same, as GRO is a best
effort to reduce cpu load.

I fully understand the 'tos' check because of proper ECN support, but
the ttl check or id check are totally useless and time consuming.

GRO aggregation should roughly work the same than TCP coalescing, and we
don't care of IP ID or ttl in TCP stack.

^ permalink raw reply

* Re: [PATCH net 0/5] Mellanox Core and Ethernet driver fixes 2013-03-21
From: David Miller @ 2013-03-21 16:05 UTC (permalink / raw)
  To: ogerlitz; +Cc: netdev, amirv, jackm, hadarh
In-Reply-To: <1363881355-21137-1-git-send-email-ogerlitz@mellanox.com>

From: Or Gerlitz <ogerlitz@mellanox.com>
Date: Thu, 21 Mar 2013 17:55:50 +0200

> Here's a batch of mlx4 driver fixes for 3.9, mostly SRIOV/Flow-steering
> related. Series done against the net tree as of commit 5a3da1f
> "inet: limit length of fragment queue hash table bucket lists

Series applied, thanks.

^ permalink raw reply

* Re: [PATCH net-next 3/3] gianfar: Remove superfluous kernel_dropped local counter
From: David Miller @ 2013-03-21 16:02 UTC (permalink / raw)
  To: claudiu.manoil; +Cc: netdev
In-Reply-To: <1363871535-29612-3-git-send-email-claudiu.manoil@freescale.com>

From: Claudiu Manoil <claudiu.manoil@freescale.com>
Date: Thu, 21 Mar 2013 15:12:15 +0200

> The GRO_DROP return code is handled by the core network layer.
> The current kernel approach is to factorize this kind of statistics into
> the upper layers, instead of having all the drivers maintaining them.
> 
> Signed-off-by: Claudiu Manoil <claudiu.manoil@freescale.com>

Applied.

^ permalink raw reply

* Re: [PATCH net-next 2/3] gianfar: Cleanup dead code and minor formatting
From: David Miller @ 2013-03-21 16:02 UTC (permalink / raw)
  To: claudiu.manoil; +Cc: netdev
In-Reply-To: <1363871535-29612-2-git-send-email-claudiu.manoil@freescale.com>

From: Claudiu Manoil <claudiu.manoil@freescale.com>
Date: Thu, 21 Mar 2013 15:12:14 +0200

> Signed-off-by: Claudiu Manoil <claudiu.manoil@freescale.com>

Applied.

^ permalink raw reply

* Re: [PATCH net-next 1/3] gianfar: Remove 'maybe-uninitialized' compile warning
From: David Miller @ 2013-03-21 16:02 UTC (permalink / raw)
  To: claudiu.manoil; +Cc: netdev
In-Reply-To: <1363871535-29612-1-git-send-email-claudiu.manoil@freescale.com>

From: Claudiu Manoil <claudiu.manoil@freescale.com>
Date: Thu, 21 Mar 2013 15:12:13 +0200

> Warning message:
> warning: 'budget_per_q' may be used uninitialized in this function
> 
> budget_per_q won't be used uninitialized since the only time
> it doesn't get initialized is when entering gfar_poll with
> num_act_queues == 0, meaning rstat_rxf == 0, in which case
> budget_per_q is not utilized (as it has no meaning).
> Inititalize budget_per_q to 0 though to suppress this compile
> warning.
> 
> Signed-off-by: Claudiu Manoil <claudiu.manoil@freescale.com>

Applied.

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox