Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: ipv4: broadcast sometimes leaves wrong interface (since commit e066008b38ca9ace1b6de8dbbac8ed460640791d)
From: Julian Anastasov @ 2011-11-30  0:56 UTC (permalink / raw)
  To: David Miller; +Cc: jeroen, netdev
In-Reply-To: <20111129.182333.365445684859306013.davem@davemloft.net>


	Hello,

On Tue, 29 Nov 2011, David Miller wrote:

> From: Julian Anastasov <ja@ssi.bg>
> Date: Wed, 30 Nov 2011 01:06:46 +0200 (EET)
> 
> > 	May be the solution is to convert inet_addr_lst
> > from hlist to normal list, so that we can append new
> > addresses at tail and __ip_dev_find to find the first
> > device where IP was added.
> 
> Sure, but do we really want to guarentee this behavior forever?

	I remember for such issue with ipsec%d interfaces
years ago. May be the PPP servers should be configured
to use another local IP for PPP devices because broadcasts
and multicasts may need their own local address - they can use
addresses to refer to output devices.

	Not sure what else we can do, now we have to waste
1-2KB for this to work before someone recreates the eth
addresses and changes the ordering.

Regards

--
Julian Anastasov <ja@ssi.bg>

^ permalink raw reply

* Re: [PATCH v3 05/10] bql: Byte queue limits
From: Stephen Hemminger @ 2011-11-30  0:59 UTC (permalink / raw)
  To: Tom Herbert; +Cc: davem, netdev
In-Reply-To: <alpine.DEB.2.00.1111222140410.15246@pokey.mtv.corp.google.com>

It is great to see new features, but we keep adding stuff with no
visible user documentation!

Every new feature added to networking should entry in Documentation/networking/
as part of the patchset.

^ permalink raw reply

* Re: [PATCH v7 03/10] socket: initial cgroup code.
From: KAMEZAWA Hiroyuki @ 2011-11-30  1:07 UTC (permalink / raw)
  To: Glauber Costa
  Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA, paul-inf54ven1CmVyaH7bEyXVA,
	lizf-BthXqXjhjHXQFUHtdCDX3A, ebiederm-aS9lmoZGLiVWk0Htik3J/w,
	davem-fT/PcQaiUtIeIZ0/mPfg9Q, gthelen-hpIqsD4AKlfQT0dZR+AlfA,
	netdev-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	kirill-oKw7cIdHH8eLwutG50LtGA, avagin-bzQdu9zFT3WakBO8gow8eQ,
	devel-GEFAQzZX7r8dnm+yROfE0A, eric.dumazet-Re5JQEeQqe8AvxtiuMwx3w,
	cgroups-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <1322611021-1730-4-git-send-email-glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>

On Tue, 29 Nov 2011 21:56:54 -0200
Glauber Costa <glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org> wrote:

> The goal of this work is to move the memory pressure tcp
> controls to a cgroup, instead of just relying on global
> conditions.
> 
> To avoid excessive overhead in the network fast paths,
> the code that accounts allocated memory to a cgroup is
> hidden inside a static_branch(). This branch is patched out
> until the first non-root cgroup is created. So when nobody
> is using cgroups, even if it is mounted, no significant performance
> penalty should be seen.
> 
> This patch handles the generic part of the code, and has nothing
> tcp-specific.
> 
> Signed-off-by: Glauber Costa <glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
> Acked-by: Kirill A. Shutemov<kirill-oKw7cIdHH8eLwutG50LtGA@public.gmane.org>
> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu-ZTyZJ+IMMTwvtab9mdV7tw@public.gmane.org>
> CC: David S. Miller <davem-fT/PcQaiUtIeIZ0/mPfg9Q@public.gmane.org>
> CC: Eric W. Biederman <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
> CC: Eric Dumazet <eric.dumazet-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>

<snip>

> +extern struct jump_label_key memcg_socket_limit_enabled;
>  static inline bool sk_has_memory_pressure(const struct sock *sk)
>  {
>  	return sk->sk_prot->memory_pressure != NULL;
> @@ -873,6 +900,17 @@ static inline bool sk_under_memory_pressure(const struct sock *sk)
>  {
>  	if (!sk->sk_prot->memory_pressure)
>  		return false;
> +#ifdef CONFIG_CGROUP_MEM_RES_CTLR_KMEM
> +	if (static_branch(&memcg_socket_limit_enabled)) {
> +		struct cg_proto *cg_proto = sk->sk_cgrp;
> +
> +		if (!cg_proto)
> +			goto nocgroup;
> +		return !!*cg_proto->memory_pressure;
> +	} else

What is dangling 'else' for ?


> +nocgroup:
> +#endif
> +
>  	return !!*sk->sk_prot->memory_pressure;
>  }
>  
> @@ -880,52 +918,176 @@ static inline void sk_leave_memory_pressure(struct sock *sk)
>  {
>  	int *memory_pressure = sk->sk_prot->memory_pressure;
>  
> -	if (memory_pressure && *memory_pressure)
> +	if (!memory_pressure)
> +		return;
> +#ifdef CONFIG_CGROUP_MEM_RES_CTLR_KMEM
> +	if (static_branch(&memcg_socket_limit_enabled)) {
> +		struct cg_proto *cg_proto = sk->sk_cgrp;
> +
> +		if (!cg_proto)
> +			goto nocgroup;
> +
> +		for (; cg_proto; cg_proto = cg_proto->parent)
> +			if (*cg_proto->memory_pressure)
> +				*cg_proto->memory_pressure = 0;
> +	}
> +nocgroup:
> +#endif

Hmm..can't we have a good way for avoiding this #ifdef ?

I guess... as NUMA_BUILD macro in page_alloc.c, you can define

if (HAS_KMEM_LIMIT && static_branch(&.....)).

For example,
==
#include <stdio.h>

#define HAS_SPECIAL     0

int main(int argc, char *argv[])
{
        if (HAS_SPECIAL)
                call();

        printf("Hey!");
}
==

This can be compiled.

So. I guess...

#ifdef CONFIG_CGROUP_MEM_RES_CTLR
#define do_memcg_kmem_account static_branch(&memcg_socket_limit_enabled)
#else
#define do_memcg_kmem_account 0
#endif

maybe good.(not tested.)


BTW, I don't think 'goto nocgroup' is good.



> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 3becb24..12a08bf 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -377,6 +377,40 @@ enum mem_type {
>  #define MEM_CGROUP_RECLAIM_SOFT_BIT	0x2
>  #define MEM_CGROUP_RECLAIM_SOFT		(1 << MEM_CGROUP_RECLAIM_SOFT_BIT)
>  
> +static inline bool mem_cgroup_is_root(struct mem_cgroup *memcg)
> +{
> +	return (memcg == root_mem_cgroup);
> +}
> +

Why do you need this move of definition ?



Thanks,
-Kame

--
To unsubscribe from this list: send the line "unsubscribe cgroups" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* [PATCH] sky2: add bql support
From: Stephen Hemminger @ 2011-11-30  1:15 UTC (permalink / raw)
  To: David Miller; +Cc: Tom Herbert, netdev
In-Reply-To: <20111128201907.6d31d4c3@nehalam.linuxnetplumber.net>

This adds support for byte queue limits and aggregates statistics
update (suggestion from Eric).

Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>

--- a/drivers/net/ethernet/marvell/sky2.c	2011-11-28 19:37:50.739557687 -0800
+++ b/drivers/net/ethernet/marvell/sky2.c	2011-11-29 16:44:52.786657199 -0800
@@ -1110,6 +1110,7 @@ static void tx_init(struct sky2_port *sk
 	sky2->tx_prod = sky2->tx_cons = 0;
 	sky2->tx_tcpsum = 0;
 	sky2->tx_last_mss = 0;
+	netdev_reset_queue(sky2->netdev);
 
 	le = get_tx_le(sky2, &sky2->tx_prod);
 	le->addr = 0;
@@ -1971,6 +1972,7 @@ static netdev_tx_t sky2_xmit_frame(struc
 	if (tx_avail(sky2) <= MAX_SKB_TX_LE)
 		netif_stop_queue(dev);
 
+	netdev_sent_queue(dev, skb->len);
 	sky2_put_idx(hw, txqaddr[sky2->port], sky2->tx_prod);
 
 	return NETDEV_TX_OK;
@@ -2002,7 +2004,8 @@ mapping_error:
 static void sky2_tx_complete(struct sky2_port *sky2, u16 done)
 {
 	struct net_device *dev = sky2->netdev;
-	unsigned idx;
+	u16 idx;
+	unsigned int bytes_compl = 0, pkts_compl = 0;
 
 	BUG_ON(done >= sky2->tx_ring_size);
 
@@ -2017,10 +2020,8 @@ static void sky2_tx_complete(struct sky2
 			netif_printk(sky2, tx_done, KERN_DEBUG, dev,
 				     "tx done %u\n", idx);
 
-			u64_stats_update_begin(&sky2->tx_stats.syncp);
-			++sky2->tx_stats.packets;
-			sky2->tx_stats.bytes += skb->len;
-			u64_stats_update_end(&sky2->tx_stats.syncp);
+			pkts_compl++;
+			bytes_compl += skb->len;
 
 			re->skb = NULL;
 			dev_kfree_skb_any(skb);
@@ -2031,6 +2032,13 @@ static void sky2_tx_complete(struct sky2
 
 	sky2->tx_cons = idx;
 	smp_mb();
+
+	netdev_completed_queue(dev, pkts_compl, bytes_compl);
+
+	u64_stats_update_begin(&sky2->tx_stats.syncp);
+	sky2->tx_stats.packets += pkts_compl;
+	sky2->tx_stats.bytes += bytes_compl;
+	u64_stats_update_end(&sky2->tx_stats.syncp);
 }
 
 static void sky2_tx_reset(struct sky2_hw *hw, unsigned port)

^ permalink raw reply

* Re: RCU'ed dst_get_neighbour()
From: Marc Aurele La France @ 2011-11-30  1:17 UTC (permalink / raw)
  To: Roland Dreier
  Cc: Eric Dumazet, David Miller, netdev-u79uwXL29TY76Z2rM5mHXA,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <CAL1RGDUcYJKBSCthTL1AFgpRvzjoaEpHjGFhratKP-98ufRKnw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>

On Tue, 29 Nov 2011, Roland Dreier wrote:
> On Tue, Nov 29, 2011 at 1:31 PM, Eric Dumazet <eric.dumazet-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
>> Here is the result of this audit, please double check and test it, I
>> only compiled this.

> Thanks Eric... I'll queue this up and send it on once we get a good
> report from Marc.

I can confirm that Eric's patch, retrofitted to 3.1.3, fixes the problem.

Marc.

+----------------------------------+----------------------------------+
|  Marc Aurele La France           |  work:   1-780-492-9310          |
|  Academic Information and        |  fax:    1-780-492-1729          |
|    Communications Technologies   |  email:  tsi-yfeSBMgouQgsA/PxXw9srA@public.gmane.org         |
|  352 General Services Building   +----------------------------------+
|  University of Alberta           |                                  |
|  Edmonton, Alberta               |    Standard disclaimers apply    |
|  T6G 2H1                         |                                  |
|  CANADA                          |                                  |
+----------------------------------+----------------------------------+
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH v7 04/10] tcp memory pressure controls
From: KAMEZAWA Hiroyuki @ 2011-11-30  1:49 UTC (permalink / raw)
  To: Glauber Costa
  Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA, paul-inf54ven1CmVyaH7bEyXVA,
	lizf-BthXqXjhjHXQFUHtdCDX3A, ebiederm-aS9lmoZGLiVWk0Htik3J/w,
	davem-fT/PcQaiUtIeIZ0/mPfg9Q, gthelen-hpIqsD4AKlfQT0dZR+AlfA,
	netdev-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	kirill-oKw7cIdHH8eLwutG50LtGA, avagin-bzQdu9zFT3WakBO8gow8eQ,
	devel-GEFAQzZX7r8dnm+yROfE0A, eric.dumazet-Re5JQEeQqe8AvxtiuMwx3w,
	cgroups-u79uwXL29TY76Z2rM5mHXA, KAMEZAWA Hiroyuki
In-Reply-To: <1322611021-1730-5-git-send-email-glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>

On Tue, 29 Nov 2011 21:56:55 -0200
Glauber Costa <glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org> wrote:

> This patch introduces memory pressure controls for the tcp
> protocol. It uses the generic socket memory pressure code
> introduced in earlier patches, and fills in the
> necessary data in cg_proto struct.
> 
> 
> Signed-off-by: Glauber Costa <glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
> CC: KAMEZAWA Hiroyuki <kamezawa.hiroyu-LdfC7J4mv27QFUHtdCDX3A@public.gmane.org>
> CC: Eric W. Biederman <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>

some comments.



> ---
>  Documentation/cgroups/memory.txt |    2 +
>  include/linux/memcontrol.h       |    3 ++
>  include/net/sock.h               |    2 +
>  include/net/tcp_memcontrol.h     |   17 +++++++++
>  mm/memcontrol.c                  |   36 +++++++++++++++++--
>  net/core/sock.c                  |   42 ++++++++++++++++++++--
>  net/ipv4/Makefile                |    1 +
>  net/ipv4/tcp_ipv4.c              |    8 ++++-
>  net/ipv4/tcp_memcontrol.c        |   73 ++++++++++++++++++++++++++++++++++++++
>  net/ipv6/tcp_ipv6.c              |    4 ++
>  10 files changed, 181 insertions(+), 7 deletions(-)
>  create mode 100644 include/net/tcp_memcontrol.h
>  create mode 100644 net/ipv4/tcp_memcontrol.c
> 
> diff --git a/Documentation/cgroups/memory.txt b/Documentation/cgroups/memory.txt
> index 3cf9d96..1e43da4 100644
> --- a/Documentation/cgroups/memory.txt
> +++ b/Documentation/cgroups/memory.txt
> @@ -299,6 +299,8 @@ and set kmem extension config option carefully.
>  thresholds. The Memory Controller allows them to be controlled individually
>  per cgroup, instead of globally.
>  
> +* tcp memory pressure: sockets memory pressure for the tcp protocol.
> +
>  3. User Interface
>  
>  0. Configuration
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 60964c3..fa2482a 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -85,6 +85,9 @@ extern struct mem_cgroup *try_get_mem_cgroup_from_page(struct page *page);
>  extern struct mem_cgroup *mem_cgroup_from_task(struct task_struct *p);
>  extern struct mem_cgroup *try_get_mem_cgroup_from_mm(struct mm_struct *mm);
>  
> +extern struct mem_cgroup *mem_cgroup_from_cont(struct cgroup *cont);
> +extern struct mem_cgroup *parent_mem_cgroup(struct mem_cgroup *mem);
> +

use 'memcg' please.

> -static struct mem_cgroup *mem_cgroup_from_cont(struct cgroup *cont)
> +struct mem_cgroup *mem_cgroup_from_cont(struct cgroup *cont)
>  {
>  	return container_of(cgroup_subsys_state(cont,
>  				mem_cgroup_subsys_id), struct mem_cgroup,
> @@ -4717,14 +4732,27 @@ static int register_kmem_files(struct cgroup *cont, struct cgroup_subsys *ss)
>  
>  	ret = cgroup_add_files(cont, ss, kmem_cgroup_files,
>  			       ARRAY_SIZE(kmem_cgroup_files));
> +
> +	if (!ret)
> +		ret = mem_cgroup_sockets_init(cont, ss);
>  	return ret;
>  };

You does initizalication here. The reason what I think is
1. 'proto_list' is not available at createion of root cgroup and
    you need to delay set up until mounting.

If so, please add comment or find another way.
This seems not very clean to me.




> +static DEFINE_RWLOCK(proto_list_lock);
> +static LIST_HEAD(proto_list);
> +
> +#ifdef CONFIG_CGROUP_MEM_RES_CTLR_KMEM
> +int mem_cgroup_sockets_init(struct cgroup *cgrp, struct cgroup_subsys *ss)
> +{
> +	struct proto *proto;
> +	int ret = 0;
> +
> +	read_lock(&proto_list_lock);
> +	list_for_each_entry(proto, &proto_list, node) {
> +		if (proto->init_cgroup)
> +			ret = proto->init_cgroup(cgrp, ss);
> +			if (ret)
> +				goto out;
> +	}

seems indent is bad or {} is missing.


> +EXPORT_SYMBOL(memcg_tcp_enter_memory_pressure);
> +
> +int tcp_init_cgroup(struct cgroup *cgrp, struct cgroup_subsys *ss)
> +{
> +	/*
> +	 * The root cgroup does not use res_counters, but rather,
> +	 * rely on the data already collected by the network
> +	 * subsystem
> +	 */
> +	struct res_counter *res_parent = NULL;
> +	struct cg_proto *cg_proto;
> +	struct tcp_memcontrol *tcp;
> +	struct mem_cgroup *memcg = mem_cgroup_from_cont(cgrp);
> +	struct mem_cgroup *parent = parent_mem_cgroup(memcg);
> +
> +	cg_proto = tcp_prot.proto_cgroup(memcg);
> +	if (!cg_proto)
> +		return 0;
> +
> +	tcp = tcp_from_cgproto(cg_proto);
> +	cg_proto->parent = tcp_prot.proto_cgroup(parent);
> +
> +	tcp->tcp_prot_mem[0] = sysctl_tcp_mem[0];
> +	tcp->tcp_prot_mem[1] = sysctl_tcp_mem[1];
> +	tcp->tcp_prot_mem[2] = sysctl_tcp_mem[2];
> +	tcp->tcp_memory_pressure = 0;

Question:

Is this value will be updated when an admin chages sysctl ?

I guess, this value is set at system init script or some which may
happen later than mounting cgroup.
I don't like to write a guideline 'please set sysctl val before
mounting cgroup'


Thanks,
-Kame

--
To unsubscribe from this list: send the line "unsubscribe cgroups" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH] sky2: add bql support
From: David Miller @ 2011-11-30  1:55 UTC (permalink / raw)
  To: shemminger; +Cc: therbert, netdev
In-Reply-To: <20111129171533.146cec9f@nehalam.linuxnetplumber.net>

From: Stephen Hemminger <shemminger@vyatta.com>
Date: Tue, 29 Nov 2011 17:15:33 -0800

> This adds support for byte queue limits and aggregates statistics
> update (suggestion from Eric).
> 
> Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>

Applied.

^ permalink raw reply

* Re: [PATCH v7 06/10] tcp buffer limitation: per-cgroup limit
From: KAMEZAWA Hiroyuki @ 2011-11-30  2:00 UTC (permalink / raw)
  To: Glauber Costa
  Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA, paul-inf54ven1CmVyaH7bEyXVA,
	lizf-BthXqXjhjHXQFUHtdCDX3A, ebiederm-aS9lmoZGLiVWk0Htik3J/w,
	davem-fT/PcQaiUtIeIZ0/mPfg9Q, gthelen-hpIqsD4AKlfQT0dZR+AlfA,
	netdev-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	kirill-oKw7cIdHH8eLwutG50LtGA, avagin-bzQdu9zFT3WakBO8gow8eQ,
	devel-GEFAQzZX7r8dnm+yROfE0A, eric.dumazet-Re5JQEeQqe8AvxtiuMwx3w,
	cgroups-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <1322611021-1730-7-git-send-email-glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>

On Tue, 29 Nov 2011 21:56:57 -0200
Glauber Costa <glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org> wrote:

> This patch uses the "tcp.limit_in_bytes" field of the kmem_cgroup to
> effectively control the amount of kernel memory pinned by a cgroup.
> 
> This value is ignored in the root cgroup, and in all others,
> caps the value specified by the admin in the net namespaces'
> view of tcp_sysctl_mem.
> 
> If namespaces are being used, the admin is allowed to set a
> value bigger than cgroup's maximum, the same way it is allowed
> to set pretty much unlimited values in a real box.
> 
> Signed-off-by: Glauber Costa <glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
> CC: David S. Miller <davem-fT/PcQaiUtIeIZ0/mPfg9Q@public.gmane.org>
> CC: Hiroyouki Kamezawa <kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A@public.gmane.org>
> CC: Eric W. Biederman <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>

you need one more fix.
(please add changelog.)


> +static int tcp_update_limit(struct mem_cgroup *memcg, u64 val)
> +{
> +	struct net *net = current->nsproxy->net_ns;
> +	struct tcp_memcontrol *tcp;
> +	struct cg_proto *cg_proto;
> +	u64 old_lim;
> +	int i;
> +	int ret;
> +
> +	cg_proto = tcp_prot.proto_cgroup(memcg);
> +	if (!cg_proto)
> +		return -EINVAL;
> +
> +	tcp = tcp_from_cgproto(cg_proto);
> +
> +	old_lim = res_counter_read_u64(&tcp->tcp_memory_allocated, RES_LIMIT);
> +	ret = res_counter_set_limit(&tcp->tcp_memory_allocated, val);
> +	if (ret)
> +		return ret;
> +
> +	for (i = 0; i < 3; i++)
> +		tcp->tcp_prot_mem[i] = min_t(long, val >> PAGE_SHIFT,
> +					     net->ipv4.sysctl_tcp_mem[i]);
> +
> +	if (val == RESOURCE_MAX)
> +		jump_label_dec(&memcg_socket_limit_enabled);

if (val == RESOUCE_MAX && old_lim != RESOUCE_MAX)

Thanks,
-Kame

--
To unsubscribe from this list: send the line "unsubscribe cgroups" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: RCU'ed dst_get_neighbour()
From: Roland Dreier @ 2011-11-30  2:00 UTC (permalink / raw)
  To: Marc Aurele La France
  Cc: Eric Dumazet, David Miller, netdev-u79uwXL29TY76Z2rM5mHXA,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <alpine.WNT.2.00.1111291815410.1288@TSI>

On Tue, Nov 29, 2011 at 5:17 PM, Marc Aurele La France <tsi-yfeSBMgouQgsA/PxXw9srA@public.gmane.org> wrote:
> On Tue, 29 Nov 2011, Roland Dreier wrote:
>>
>> On Tue, Nov 29, 2011 at 1:31 PM, Eric Dumazet <eric.dumazet-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
>> wrote:
>>>
>>> Here is the result of this audit, please double check and test it, I
>>> only compiled this.
>
>> Thanks Eric... I'll queue this up and send it on once we get a good
>> report from Marc.
>
> I can confirm that Eric's patch, retrofitted to 3.1.3, fixes the problem.

Oh... the problem was already in 3.1?

I'll tag the patch for stable too.

 - R.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH v7 08/10] Display current tcp failcnt in kmem cgroup
From: KAMEZAWA Hiroyuki @ 2011-11-30  2:01 UTC (permalink / raw)
  To: Glauber Costa
  Cc: linux-kernel, paul, lizf, ebiederm, davem, gthelen, netdev,
	linux-mm, kirill, avagin, devel, eric.dumazet, cgroups
In-Reply-To: <1322611021-1730-9-git-send-email-glommer@parallels.com>

On Tue, 29 Nov 2011 21:56:59 -0200
Glauber Costa <glommer@parallels.com> wrote:

> This patch introduces kmem.tcp.failcnt file, living in the
> kmem_cgroup filesystem. Following the pattern in the other
> memcg resources, this files keeps a counter of how many times
> allocation failed due to limits being hit in this cgroup.
> The root cgroup will always show a failcnt of 0.
> 
> Signed-off-by: Glauber Costa <glommer@parallels.com>
> CC: David S. Miller <davem@davemloft.net>
> CC: Hiroyouki Kamezawa <kamezawa.hiroyu@jp.fujitsu.com>
> CC: Eric W. Biederman <ebiederm@xmission.com>

Reviewed-by : KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: [PATCH v7 09/10] Display maximum tcp memory allocation in kmem cgroup
From: KAMEZAWA Hiroyuki @ 2011-11-30  2:02 UTC (permalink / raw)
  To: Glauber Costa
  Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA, paul-inf54ven1CmVyaH7bEyXVA,
	lizf-BthXqXjhjHXQFUHtdCDX3A, ebiederm-aS9lmoZGLiVWk0Htik3J/w,
	davem-fT/PcQaiUtIeIZ0/mPfg9Q, gthelen-hpIqsD4AKlfQT0dZR+AlfA,
	netdev-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	kirill-oKw7cIdHH8eLwutG50LtGA, avagin-bzQdu9zFT3WakBO8gow8eQ,
	devel-GEFAQzZX7r8dnm+yROfE0A, eric.dumazet-Re5JQEeQqe8AvxtiuMwx3w,
	cgroups-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <1322611021-1730-10-git-send-email-glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>

On Tue, 29 Nov 2011 21:57:00 -0200
Glauber Costa <glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org> wrote:

> This patch introduces kmem.tcp.max_usage_in_bytes file, living in the
> kmem_cgroup filesystem. The root cgroup will display a value equal
> to RESOURCE_MAX. This is to avoid introducing any locking schemes in
> the network paths when cgroups are not being actively used.
> 
> All others, will see the maximum memory ever used by this cgroup.
> 
> Signed-off-by: Glauber Costa <glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
> CC: David S. Miller <davem-fT/PcQaiUtIeIZ0/mPfg9Q@public.gmane.org>
> CC: Hiroyouki Kamezawa <kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A@public.gmane.org>
> CC: Eric W. Biederman <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>

Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A@public.gmane.org>

--
To unsubscribe from this list: send the line "unsubscribe cgroups" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH v7 00/10] Request for Inclusion: per-cgroup tcp memory pressure
From: KAMEZAWA Hiroyuki @ 2011-11-30  2:11 UTC (permalink / raw)
  To: Glauber Costa
  Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA, paul-inf54ven1CmVyaH7bEyXVA,
	lizf-BthXqXjhjHXQFUHtdCDX3A, ebiederm-aS9lmoZGLiVWk0Htik3J/w,
	davem-fT/PcQaiUtIeIZ0/mPfg9Q, gthelen-hpIqsD4AKlfQT0dZR+AlfA,
	netdev-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	kirill-oKw7cIdHH8eLwutG50LtGA, avagin-bzQdu9zFT3WakBO8gow8eQ,
	devel-GEFAQzZX7r8dnm+yROfE0A, eric.dumazet-Re5JQEeQqe8AvxtiuMwx3w,
	cgroups-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <1322611021-1730-1-git-send-email-glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>

On Tue, 29 Nov 2011 21:56:51 -0200
Glauber Costa <glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org> wrote:

> Hi,
> 
> This patchset implements per-cgroup tcp memory pressure controls. It did not change
> significantly since last submission: rather, it just merges the comments Kame had.
> Most of them are style-related and/or Documentation, but there are two real bugs he
> managed to spot (thanks)
> 
> Please let me know if there is anything else I should address.
> 

After reading all codes again, I feel some strange. Could you clarify ?

Here.
==
+void sock_update_memcg(struct sock *sk)
+{
+	/* right now a socket spends its whole life in the same cgroup */
+	if (sk->sk_cgrp) {
+		WARN_ON(1);
+		return;
+	}
+	if (static_branch(&memcg_socket_limit_enabled)) {
+		struct mem_cgroup *memcg;
+
+		BUG_ON(!sk->sk_prot->proto_cgroup);
+
+		rcu_read_lock();
+		memcg = mem_cgroup_from_task(current);
+		if (!mem_cgroup_is_root(memcg))
+			sk->sk_cgrp = sk->sk_prot->proto_cgroup(memcg);
+		rcu_read_unlock();
==

sk->sk_cgrp is set to a memcg without any reference count.

Then, no check for preventing rmdir() and freeing memcgroup.

Is there some css_get() or mem_cgroup_get() somewhere ?

Thanks,
-Kame



--
To unsubscribe from this list: send the line "unsubscribe cgroups" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH v7 10/10] Disable task moving when using kernel memory accounting
From: KAMEZAWA Hiroyuki @ 2011-11-30  2:22 UTC (permalink / raw)
  To: Glauber Costa
  Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA, paul-inf54ven1CmVyaH7bEyXVA,
	lizf-BthXqXjhjHXQFUHtdCDX3A, ebiederm-aS9lmoZGLiVWk0Htik3J/w,
	davem-fT/PcQaiUtIeIZ0/mPfg9Q, gthelen-hpIqsD4AKlfQT0dZR+AlfA,
	netdev-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	kirill-oKw7cIdHH8eLwutG50LtGA, avagin-bzQdu9zFT3WakBO8gow8eQ,
	devel-GEFAQzZX7r8dnm+yROfE0A, eric.dumazet-Re5JQEeQqe8AvxtiuMwx3w,
	cgroups-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <1322611021-1730-11-git-send-email-glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>

On Tue, 29 Nov 2011 21:57:01 -0200
Glauber Costa <glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org> wrote:

> Since this code is still experimental, we are leaving the exact
> details of how to move tasks between cgroups when kernel memory
> accounting is used as future work.
> 
> For now, we simply disallow movement if there are any pending
> accounted memory.
> 
> Signed-off-by: Glauber Costa <glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
> CC: Hiroyouki Kamezawa <kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A@public.gmane.org>
> ---
>  mm/memcontrol.c |   23 ++++++++++++++++++++++-
>  1 files changed, 22 insertions(+), 1 deletions(-)
> 
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index a31a278..dd9a6d9 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -5453,10 +5453,19 @@ static int mem_cgroup_can_attach(struct cgroup_subsys *ss,
>  {
>  	int ret = 0;
>  	struct mem_cgroup *memcg = mem_cgroup_from_cont(cgroup);
> +	struct mem_cgroup *from = mem_cgroup_from_task(p);
> +
> +#if defined(CONFIG_CGROUP_MEM_RES_CTLR_KMEM) && defined(CONFIG_INET)
> +	if (from != memcg && !mem_cgroup_is_root(from) &&
> +	    res_counter_read_u64(&from->tcp_mem.tcp_memory_allocated, RES_USAGE)) {
> +		printk(KERN_WARNING "Can't move tasks between cgroups: "
> +			"Kernel memory held.\n");
> +		return 1;
> +	}
> +#endif

I wonder....reading all codes again, this is incorrect check.

Hm, let me cralify. IIUC, in old code, "prevent moving" is because you hold
reference count of cgroup, which can cause trouble at rmdir() as leaking refcnt.

BTW, because socket is a shared resource between cgroup, changes in mm->owner
may cause task cgroup moving implicitly. So, if you allow leak of resource
here, I guess... you can take mem_cgroup_get() refcnt which is memcg-local and
allow rmdir(). Then, this limitation may disappear.

Then, users will be happy but admins will have unseen kernel resource usage in
not populated(by rmdir) memcg. Hm, big trouble ?

Thanks,
-Kame

--
To unsubscribe from this list: send the line "unsubscribe cgroups" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Problem with the first ICMP_REDIRECT message
From: Li Wei @ 2011-11-30  3:10 UTC (permalink / raw)
  To: netdev

Hi all, 

I am doing some tests on ICMP_REDIRECT messages and found that I never receive
the first ICMP_REDIRECT message, but the following REDIRECT messages was normal.

My test environment as follow:
three pc:
PC A:
	IP: 192.168.0.1 MAC: HW:0A
PC B:
	IP: 192.168.0.2 MAC: HW:0B
	module nf_nat loaded and at least one rule in nat table(rule content not concern)
PC C:
	IP: 192.168.0.3 MAC: HW:0C

enable ip forwarding on PC B:
# echo 1 > /proc/sys/net/ipv4/ip_forward

add a static arp on PC A:
# arp -s 192.168.0.3 HW:0B

ping 192.168.0.3 on PC A:
# ping -c1 192.168.0.3

I expect that PC A will receive a ICMP_REDIRECT message from PC B, but nothing
received.

ping three times 192.168.0.3 on PC A:
# ping -c3 192.168.0.3 

PC A got two ICMP_REDIRECT messages from PC B, the first one missing.

After some code search, I found in function nf_nat_icmp_reply_translation() the first
ICMP_REDIRECT message was droped because the ct->status is not IPS_NAT_DONE_MASK.

Does anyone has some suggestion?

^ permalink raw reply

* Re: [PATCH] at91_ether: use gpio_is_valid for phy IRQ line
From: Jean-Christophe PLAGNIOL-VILLARD @ 2011-11-30  4:44 UTC (permalink / raw)
  To: David Miller
  Cc: nicolas.ferre, jamie, netdev, sfr, linux-next, linux-kernel,
	linux-arm-kernel
In-Reply-To: <20111129.185342.105454076866922618.davem@davemloft.net>

On 18:53 Tue 29 Nov     , David Miller wrote:
> From: Nicolas Ferre <nicolas.ferre@atmel.com>
> Date: Thu, 24 Nov 2011 22:21:14 +0100
> 
> > Use the generic gpiolib gpio_is_valid() function to test
> > if the phy IRQ line GPIO is actually provided.
> > 
> > For non-connected or non-existing phy IRQ lines, -EINVAL
> > value is used for phy_irq_pin field of struct at91_eth_data.
> > 
> > Signed-off-by: Nicolas Ferre <nicolas.ferre@atmel.com>
> 
> I'm assuming this goes through the ARM tree, because in both of my networking
> trees there is no ARM at91 implementation of gpio_is_valid().
yes the depending patch series is in the arm-soc

can we have your ack or sob?

Best Regards,
J.

^ permalink raw reply

* Business Proposal
From: Mr. Vincent Cheng @ 2011-11-30  4:29 UTC (permalink / raw)



I am Mr. Vincent Cheng, GBS, JP Chairman of the Hong Kong and Shanghai  
Banking Corporation Limited. I have a business proposal of Twenty Two  
million Five Hundred Thousand United State Dollars only for you to  
transact with me from my bank to your country.Having gone through a  
methodical search, I decided to contact you hoping that you will find  
this proposal interesting. Please on your confirmation of this message  
and indicating your interest

All confirmable documents to back up the claims will be made available  
to you prior to your acceptance and as soon as I receive your return  
mail Via my email address: chengvincent012@yahoo.co.jp and I will let  
you know what is required of you,your earliest response to this letter  
will be appreciated.

Endeavour to let me know your decision rather than keep me waiting.

Best Regards,
Mr. Vincent Cheng

^ permalink raw reply

* Re: RCU'ed dst_get_neighbour()
From: Marc Aurele La France @ 2011-11-30  5:15 UTC (permalink / raw)
  To: Roland Dreier; +Cc: Eric Dumazet, David Miller, netdev, linux-rdma
In-Reply-To: <CAL1RGDWzVRc4jFtsKUC6FGrdAWn3_OThXpbChCETCR8UCrwDDA@mail.gmail.com>

On Tue, 29 Nov 2011, Roland Dreier wrote:
> On Tue, Nov 29, 2011 at 5:17 PM, Marc Aurele La France <tsi@ualberta.ca> wrote:
>> On Tue, 29 Nov 2011, Roland Dreier wrote:
>>> On Tue, Nov 29, 2011 at 1:31 PM, Eric Dumazet <eric.dumazet@gmail.com>
>>> wrote:
>>>> Here is the result of this audit, please double check and test it, I
>>>> only compiled this.

>>> Thanks Eric... I'll queue this up and send it on once we get a good
>>> report from Marc.

>> I can confirm that Eric's patch, retrofitted to 3.1.3, fixes the problem.

> Oh... the problem was already in 3.1?

Yes, but not in anything earlier.

Marc.

+----------------------------------+----------------------------------+
|  Marc Aurele La France           |  work:   1-780-492-9310          |
|  Academic Information and        |  fax:    1-780-492-1729          |
|    Communications Technologies   |  email:  tsi@ualberta.ca         |
|  352 General Services Building   +----------------------------------+
|  University of Alberta           |                                  |
|  Edmonton, Alberta               |    Standard disclaimers apply    |
|  T6G 2H1                         |                                  |
|  CANADA                          |                                  |
+----------------------------------+----------------------------------+

^ permalink raw reply

* Re: [RFC net-next] Include selection of congestion control algorithm in that which is inherited across an accept() call
From: Eric Dumazet @ 2011-11-30  5:21 UTC (permalink / raw)
  To: Rick Jones; +Cc: netdev
In-Reply-To: <20111130003112.C04FE29005FE@tardy>

Le mardi 29 novembre 2011 à 16:31 -0800, Rick Jones a écrit :
> From: Rick Jones <rick.jones2@hp.com>
> 
> Include congestion control algorithm in what is inherited across an
> accept() call.
> 
> Signed-off-by: Rick Jones <rick.jones2@hp.com>
> 

Hi Rick, thanks for working on this !

I am still not sure why adding new fields is needed, please see my
comment :

> ---
> 
...
>  include/linux/ipv6.h     |    3 +++
>  include/net/inet_sock.h  |    3 +++
>  net/ipv4/tcp_ipv4.c      |    3 +++
>  net/ipv4/tcp_minisocks.c |    2 +-
>  net/ipv6/tcp_ipv6.c      |    3 +++
>  5 files changed, 13 insertions(+), 1 deletions(-)
> 
> diff --git a/include/linux/ipv6.h b/include/linux/ipv6.h
> index 0c99776..1d0dde3 100644
> --- a/include/linux/ipv6.h
> +++ b/include/linux/ipv6.h
> @@ -265,11 +265,14 @@ static inline int inet6_iif(const struct sk_buff *skb)
>  	return IP6CB(skb)->iif;
>  }
>  
> +struct tcp_congestion_ops;
> +
>  struct inet6_request_sock {
>  	struct in6_addr		loc_addr;
>  	struct in6_addr		rmt_addr;
>  	struct sk_buff		*pktopts;
>  	int			iif;
> +	const struct tcp_congestion_ops *cong_ops;
>  };
>  
>  struct tcp6_request_sock {
> diff --git a/include/net/inet_sock.h b/include/net/inet_sock.h
> index f941964..9ec68a3 100644
> --- a/include/net/inet_sock.h
> +++ b/include/net/inet_sock.h
> @@ -69,6 +69,8 @@ struct ip_options_data {
>  	char			data[40];
>  };
>  
> +struct tcp_congestion_ops;
> +
>  struct inet_request_sock {
>  	struct request_sock	req;
>  #if defined(CONFIG_IPV6) || defined(CONFIG_IPV6_MODULE)
> @@ -89,6 +91,7 @@ struct inet_request_sock {
>  				no_srccheck: 1;
>  	kmemcheck_bitfield_end(flags);
>  	struct ip_options_rcu	*opt;
> +	const struct tcp_congestion_ops *cong_ops;
>  };
>  
>  static inline struct inet_request_sock *inet_rsk(const struct request_sock *sk)
> diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
> index a9db4b1..79c02e2 100644
> --- a/net/ipv4/tcp_ipv4.c
> +++ b/net/ipv4/tcp_ipv4.c
> @@ -1254,6 +1254,7 @@ int tcp_v4_conn_request(struct sock *sk, struct sk_buff *skb)
>  	const u8 *hash_location;
>  	struct request_sock *req;
>  	struct inet_request_sock *ireq;
> +	struct inet_connection_sock *icsk = inet_csk(sk);
>  	struct tcp_sock *tp = tcp_sk(sk);
>  	struct dst_entry *dst = NULL;
>  	__be32 saddr = ip_hdr(skb)->saddr;
> @@ -1341,6 +1342,8 @@ int tcp_v4_conn_request(struct sock *sk, struct sk_buff *skb)
>  	ireq->rmt_addr = saddr;
>  	ireq->no_srccheck = inet_sk(sk)->transparent;
>  	ireq->opt = tcp_v4_save_options(sk, skb);
> +	ireq->cong_ops = (icsk->icsk_ca_ops) ? icsk->icsk_ca_ops :
> +					&tcp_init_congestion_ops;
>  
>  	if (security_inet_conn_request(sk, skb, req))
>  		goto drop_and_free;
> diff --git a/net/ipv4/tcp_minisocks.c b/net/ipv4/tcp_minisocks.c
> index 945efff..be338d7 100644
> --- a/net/ipv4/tcp_minisocks.c
> +++ b/net/ipv4/tcp_minisocks.c
> @@ -495,7 +495,7 @@ struct sock *tcp_create_openreq_child(struct sock *sk, struct request_sock *req,
>  		newtp->frto_counter = 0;
>  		newtp->frto_highmark = 0;
>  
> -		newicsk->icsk_ca_ops = &tcp_init_congestion_ops;
> +		newicsk->icsk_ca_ops = ireq->cong_ops;


At this point, sk points to listener socket and is locked, why not
simply copy inet_csk(sk)->icsk_ca_ops to newicsk->icsk_ca_ops ?

As a matter of fact, it was already copied at newsk creation (clone of
sk)

>  
>  		tcp_set_ca_state(newsk, TCP_CA_Open);
>  		tcp_init_xmit_timers(newsk);
> diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c
> index 9d74eee..c689779 100644
> --- a/net/ipv6/tcp_ipv6.c
> +++ b/net/ipv6/tcp_ipv6.c
> @@ -1164,6 +1164,7 @@ static int tcp_v6_conn_request(struct sock *sk, struct sk_buff *skb)
>  	const u8 *hash_location;
>  	struct request_sock *req;
>  	struct inet6_request_sock *treq;
> +	struct inet_connection_sock *icsk = inet_csk(sk);
>  	struct ipv6_pinfo *np = inet6_sk(sk);
>  	struct tcp_sock *tp = tcp_sk(sk);
>  	__u32 isn = TCP_SKB_CB(skb)->when;
> @@ -1254,6 +1255,8 @@ static int tcp_v6_conn_request(struct sock *sk, struct sk_buff *skb)
>  		TCP_ECN_create_request(req, tcp_hdr(skb));
>  
>  	treq->iif = sk->sk_bound_dev_if;
> +	treq->cong_ops = (icsk->icsk_ca_ops) ? icsk->icsk_ca_ops :
> +					&tcp_init_congestion_ops;
>  
>  	/* So that link locals have meaning */
>  	if (!sk->sk_bound_dev_if &&


So my suggestion would be to use this two lines patch instead :

diff --git a/net/ipv4/tcp_minisocks.c b/net/ipv4/tcp_minisocks.c
index 945efff..6b066e2 100644
--- a/net/ipv4/tcp_minisocks.c
+++ b/net/ipv4/tcp_minisocks.c
@@ -495,8 +495,6 @@ struct sock *tcp_create_openreq_child(struct sock *sk, struct request_sock *req,
 		newtp->frto_counter = 0;
 		newtp->frto_highmark = 0;
 
-		newicsk->icsk_ca_ops = &tcp_init_congestion_ops;
-
 		tcp_set_ca_state(newsk, TCP_CA_Open);
 		tcp_init_xmit_timers(newsk);
 		skb_queue_head_init(&newtp->out_of_order_queue);

^ permalink raw reply related

* Re: RCU'ed dst_get_neighbour()
From: Eric Dumazet @ 2011-11-30  5:26 UTC (permalink / raw)
  To: Marc Aurele La France
  Cc: Roland Dreier, David Miller, netdev-u79uwXL29TY76Z2rM5mHXA,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <alpine.WNT.2.00.1111292212280.208@TSI>

Le mardi 29 novembre 2011 à 22:15 -0700, Marc Aurele La France a écrit :
> On Tue, 29 Nov 2011, Roland Dreier wrote:
> > On Tue, Nov 29, 2011 at 5:17 PM, Marc Aurele La France <tsi@ualberta.ca> wrote:
> >> On Tue, 29 Nov 2011, Roland Dreier wrote:
> >>> On Tue, Nov 29, 2011 at 1:31 PM, Eric Dumazet <eric.dumazet@gmail.com>
> >>> wrote:
> >>>> Here is the result of this audit, please double check and test it, I
> >>>> only compiled this.
> 
> >>> Thanks Eric... I'll queue this up and send it on once we get a good
> >>> report from Marc.
> 
> >> I can confirm that Eric's patch, retrofitted to 3.1.3, fixes the problem.
> 
> > Oh... the problem was already in 3.1?
> 
> Yes, but not in anything earlier.
> 

Yes :

# git describe --contains f2c31e32b378a6653f
v3.1-rc1~24^2~11

It all depends if f2c31e32b378a6653f is backported to 3.0 someday, since
it fixes bug added in commit f39925dbde778 (ipv4: Cache learned redirect
information in inetpeer)

# git describe --contains  f39925dbde778  
v2.6.39-rc1~468^2~349



--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [RFC net-next] Include selection of congestion control algorithm in that which is inherited across an accept() call
From: Eric Dumazet @ 2011-11-30  5:40 UTC (permalink / raw)
  To: Rick Jones; +Cc: netdev, Yuchung Cheng
In-Reply-To: <1322630461.2596.61.camel@edumazet-laptop>

Le mercredi 30 novembre 2011 à 06:21 +0100, Eric Dumazet a écrit :

> So my suggestion would be to use this two lines patch instead :
> 
> diff --git a/net/ipv4/tcp_minisocks.c b/net/ipv4/tcp_minisocks.c
> index 945efff..6b066e2 100644
> --- a/net/ipv4/tcp_minisocks.c
> +++ b/net/ipv4/tcp_minisocks.c
> @@ -495,8 +495,6 @@ struct sock *tcp_create_openreq_child(struct sock *sk, struct request_sock *req,
>  		newtp->frto_counter = 0;
>  		newtp->frto_highmark = 0;
>  
> -		newicsk->icsk_ca_ops = &tcp_init_congestion_ops;
> -
>  		tcp_set_ca_state(newsk, TCP_CA_Open);
>  		tcp_init_xmit_timers(newsk);
>  		skb_queue_head_init(&newtp->out_of_order_queue);
> 

Please test this change and if its OK, resubmit your patch, with
appropriate Documentation change, as pointed out by Yuchung Cheng 

(file to change : Documentation/networking/ip-sysctl.txt )

You could clearly state that the congestion control eventually
chosen by the listener socket takes precedence over the system default
tcp congestion value.

Thanks !

^ permalink raw reply

* Re: [PATCH] at91_ether: use gpio_is_valid for phy IRQ line
From: David Miller @ 2011-11-30  5:40 UTC (permalink / raw)
  To: plagnioj
  Cc: nicolas.ferre, jamie, netdev, sfr, linux-next, linux-kernel,
	linux-arm-kernel
In-Reply-To: <20111130044403.GY15008@game.jcrosoft.org>

From: Jean-Christophe PLAGNIOL-VILLARD <plagnioj@jcrosoft.com>
Date: Wed, 30 Nov 2011 05:44:03 +0100

> On 18:53 Tue 29 Nov     , David Miller wrote:
>> From: Nicolas Ferre <nicolas.ferre@atmel.com>
>> Date: Thu, 24 Nov 2011 22:21:14 +0100
>> 
>> > Use the generic gpiolib gpio_is_valid() function to test
>> > if the phy IRQ line GPIO is actually provided.
>> > 
>> > For non-connected or non-existing phy IRQ lines, -EINVAL
>> > value is used for phy_irq_pin field of struct at91_eth_data.
>> > 
>> > Signed-off-by: Nicolas Ferre <nicolas.ferre@atmel.com>
>> 
>> I'm assuming this goes through the ARM tree, because in both of my networking
>> trees there is no ARM at91 implementation of gpio_is_valid().
> yes the depending patch series is in the arm-soc
> 
> can we have your ack or sob?

Acked-by: David S. Miller <davem@davemloft.net>

^ permalink raw reply

* Re: 3.2.0-rc3+ (Linus git - 883381d9f1c5a6329bbb796e23ae52c939940310) - INFO: suspicious RCU usage
From: Eric Dumazet @ 2011-11-30  6:05 UTC (permalink / raw)
  To: Miles Lane, David Miller
  Cc: LKML, Alexey Kuznetsov, Dipankar Sarma, Paul E. McKenney, netdev
In-Reply-To: <1322632602.23721.0.camel@edumazet-laptop>

Le mercredi 30 novembre 2011 à 06:56 +0100, Eric Dumazet a écrit :
> Le mercredi 30 novembre 2011 à 00:09 -0500, Miles Lane a écrit :
> > I got the following message while running "find /proc | xargs head" to
> > see whether accessing any proc files would cause trouble.
> > 
> > [  442.112632] [ INFO: suspicious RCU usage. ]
> > [  442.112634] -------------------------------
> > [  442.112637] include/net/dst.h:91 suspicious rcu_dereference_check() usage!
> > [  442.112640]
> > [  442.112641] other info that might help us debug this:
> > [  442.112643]
> > [  442.112645]
> > [  442.112646] rcu_scheduler_active = 1, debug_locks = 1
> > [  442.112650] 2 locks held by head/4903:
> > [  442.112652]  #0:  (&p->lock){+.+.+.}, at: [<ffffffff810e160a>]
> > seq_read+0x38/0x35d
> > [  442.112665]  #1:  (rcu_read_lock_bh){.+....}, at:
> > [<ffffffff812f831e>] rcu_read_lock_bh+0x0/0x35
> > [  442.112676]
> > [  442.112677] stack backtrace:
> > [  442.112681] Pid: 4903, comm: head Not tainted 3.2.0-rc3+ #42
> > [  442.112684] Call Trace:
> > [  442.112691]  [<ffffffff8105cdd2>] lockdep_rcu_suspicious+0xaf/0xb8
> > [  442.112697]  [<ffffffff812f8e02>] dst_get_neighbour.isra.31+0x44/0x4c
> > [  442.112702]  [<ffffffff812f8e86>] rt_cache_seq_show+0x46/0x194
> > [  442.112708]  [<ffffffff812f8351>] ? rcu_read_lock_bh+0x33/0x35
> > [  442.112714]  [<ffffffff8104b57a>] ? rcu_read_lock_bh_held+0x9/0x37
> > [  442.112719]  [<ffffffff812f85c2>] ? rt_cache_get_first+0x72/0x116
> > [  442.112725]  [<ffffffff810e184d>] seq_read+0x27b/0x35d
> > [  442.112730]  [<ffffffff810e15d2>] ? seq_lseek+0xda/0xda
> > [  442.112736]  [<ffffffff8110e3e1>] proc_reg_read+0x8e/0xad
> > [  442.112741]  [<ffffffff810c6e23>] vfs_read+0xa0/0xc7
> > [  442.112746]  [<ffffffff810c80bf>] ? fget_light+0x35/0x98
> > [  442.112750]  [<ffffffff810c6e8f>] sys_read+0x45/0x69
> > [  442.112756]  [<ffffffff8136a27b>] system_call_fastpath+0x16/0x1b
> 
> 
> Thanks for the report Miles, I'll provide a patch in a couple of minutes
> after testing it
> 
> 

[PATCH] ipv4: fix lockdep splat in rt_cache_seq_show

After commit f2c31e32b378 (fix NULL dereferences in check_peer_redir()),
dst_get_neighbour() should be guarded by rcu_read_lock() /
rcu_read_unlock() section.

Reported-by: Miles Lane <miles.lane@gmail.com>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
---
 net/ipv4/route.c |    8 ++++++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/net/ipv4/route.c b/net/ipv4/route.c
index 5c28472..57e01bc 100644
--- a/net/ipv4/route.c
+++ b/net/ipv4/route.c
@@ -417,9 +417,13 @@ static int rt_cache_seq_show(struct seq_file *seq, void *v)
 	else {
 		struct rtable *r = v;
 		struct neighbour *n;
-		int len;
+		int len, HHUptod;
 
+		rcu_read_lock();
 		n = dst_get_neighbour(&r->dst);
+		HHUptod = (n && (n->nud_state & NUD_CONNECTED)) ? 1 : 0;
+		rcu_read_unlock();
+
 		seq_printf(seq, "%s\t%08X\t%08X\t%8X\t%d\t%u\t%d\t"
 			      "%08X\t%d\t%u\t%u\t%02X\t%d\t%1d\t%08X%n",
 			r->dst.dev ? r->dst.dev->name : "*",
@@ -433,7 +437,7 @@ static int rt_cache_seq_show(struct seq_file *seq, void *v)
 			      dst_metric(&r->dst, RTAX_RTTVAR)),
 			r->rt_key_tos,
 			-1,
-			(n && (n->nud_state & NUD_CONNECTED)) ? 1 : 0,
+			HHUptod,
 			r->rt_spec_dst, &len);
 
 		seq_printf(seq, "%*s\n", 127 - len, "");

^ permalink raw reply related

* Re: [GIT PULL v2] Open vSwitch
From: Jesse Gross @ 2011-11-30  6:18 UTC (permalink / raw)
  To: Herbert Xu
  Cc: dev-yBygre7rU0TnMu66kgdUjQ, netdev-u79uwXL29TY76Z2rM5mHXA,
	jhs-jkUAjuhPggJWk0Htik3J/w, David Miller
In-Reply-To: <20111128130409.GB16828-lOAM2aK0SrRLBo1qDEOMRrpzq4S04n8Q@public.gmane.org>

On Mon, Nov 28, 2011 at 5:04 AM, Herbert Xu <herbert-lOAM2aK0SrRLBo1qDEOMRrpzq4S04n8Q@public.gmane.org> wrote:
> On Wed, Nov 23, 2011 at 07:22:56AM -0500, jamal wrote:
>> I cant find one - you may. After staring at the code, I am also now
>> questioning if the existing bridge code couldnt have been re-used with
>> some small tweaks.
>
> I wasn't able to find any functionality that could not be easily
> done with the existing classifier/action code.
>
> Whether we want to go down this route though is open to debate
> as someone would have to actually implement this :)

Thanks for taking the time to go through the code Herbert.  I think
this conversation overall has suffered some from being a little vague
and high level so it helps a lot to have more people who have looked
at the code.

The main part that worries me about moving to a different approach is
the impedance mismatch that occurs from the fact that Open vSwitch is
modeling a switch and tc is not.  As Jamal alluded to above, it's
actually the bridge code which is more conceptually similar.  In my
experience, combining two disparate models makes things harder over
the long run, not easier.  It also tends to show up more in some of
the edges like userspace/kernel compatibility.

What I'd like to do is start a clean conversation (this one is far too
long already) about what an Open vSwitch built using these components
would look like and really go into the details and design
implications.

^ permalink raw reply

* Re: [GIT PULL v2] Open vSwitch
From: Jesse Gross @ 2011-11-30  6:21 UTC (permalink / raw)
  To: Herbert Xu
  Cc: Fischer, Anna, netdev-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	dev-yBygre7rU0TnMu66kgdUjQ@public.gmane.org,
	jhs-jkUAjuhPggJWk0Htik3J/w@public.gmane.org, David Miller
In-Reply-To: <20111128145157.GA17678-lOAM2aK0SrRLBo1qDEOMRrpzq4S04n8Q@public.gmane.org>

On Mon, Nov 28, 2011 at 6:51 AM, Herbert Xu <herbert@gondor.apana.org.au> wrote:
> There are other issues with the hash implementation.  For example,
> there seems to be no limit on the number of collisions in each
> bucket.  As the hash table growth code simply continues when it
> fails to expand, this means that the number of collisions may
> rise without bound.

It's userspace which is managing the entries in the kernel hash table
and it has some intelligence about aging out entries (and specifically
about doing it more aggressively as the number of entries increases),
so it's not really unbounded.  In practice, userspace actually keeps
the number of entries much smaller than the maximum size of the table.
_______________________________________________
dev mailing list
dev@openvswitch.org
http://openvswitch.org/mailman/listinfo/dev

^ permalink raw reply

* Integration of Open vSwitch
From: Jesse Gross @ 2011-11-30  6:25 UTC (permalink / raw)
  To: Herbert Xu, jhs-jkUAjuhPggJWk0Htik3J/w
  Cc: dev-yBygre7rU0TnMu66kgdUjQ, Chris Wright, Eric Dumazet, netdev,
	John Fastabend, Stephen Hemminger, David Miller

Hi Herbert and Jamal (and everyone else),

Sorry about starting yet another thread but the other one went in so
many directions that I think a lot of things got lost in it.  As I
mentioned before, I'd like to have a bit of a design discussion of
what it would look like if Open vSwitch were to use some of the
existing components (and really focus on just that).  There were a
number of suggestions made about using parts of the bridge, tc,
netfilter, etc. and some of them overlap or conflict so I don't quite
have a coherent solution in mind.  Would you guys mind walking through
what each of you envision it looking like?

Thanks,
Jesse

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox