Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [PATCH 02/17] netfilter: add namespace support for l4proto
From: Pablo Neira Ayuso @ 2012-05-23 10:25 UTC (permalink / raw)
  To: Gao feng
  Cc: netfilter-devel, netdev, serge.hallyn, ebiederm, dlezcano,
	Gao feng
In-Reply-To: <1336985547-31960-3-git-send-email-gaofeng@cn.fujitsu.com>

On Mon, May 14, 2012 at 04:52:12PM +0800, Gao feng wrote:
> From: Gao feng <gaofeng@cn.fujitus.com>
> 
> -nf_ct_(un)register_sysctl are changed to support net namespace,
>  use (un)register_net_sysctl_table replaces (un)register_sysctl_paths.
>  and in nf_ct_unregister_sysctl,kfree table only when users is 0.
> 
> -Add the struct net as param of nf_conntrack_l4proto_(un)register.
>  register or unregister the l4proto only when the net is init_net.
> 
> -nf_conntrack_l4proto_register call init_net to initial the pernet
>  data of l4proto.
> 
> -nf_ct_l4proto_net is used to get the pernet data of l4proto.
> 
> -use init_net as a param of nf_conntrack_l4proto_(un)register.
> 
> Acked-by: Eric W. Biederman <ebiederm@xmission.com>
> Signed-off-by: Gao feng <gaofeng@cn.fujitus.com>
> ---
>  include/net/netfilter/nf_conntrack_l4proto.h   |   13 +-
>  net/ipv4/netfilter/nf_conntrack_l3proto_ipv4.c |   18 +-
>  net/ipv6/netfilter/nf_conntrack_l3proto_ipv6.c |   18 +-
>  net/netfilter/nf_conntrack_proto.c             |  245 ++++++++++++++----------
>  net/netfilter/nf_conntrack_proto_dccp.c        |   10 +-
>  net/netfilter/nf_conntrack_proto_gre.c         |    6 +-
>  net/netfilter/nf_conntrack_proto_sctp.c        |   10 +-
>  net/netfilter/nf_conntrack_proto_udplite.c     |   10 +-
>  8 files changed, 191 insertions(+), 139 deletions(-)
> 
> diff --git a/include/net/netfilter/nf_conntrack_l4proto.h b/include/net/netfilter/nf_conntrack_l4proto.h
> index a90eab5..a93dcd5 100644
> --- a/include/net/netfilter/nf_conntrack_l4proto.h
> +++ b/include/net/netfilter/nf_conntrack_l4proto.h
> @@ -12,7 +12,7 @@
>  #include <linux/netlink.h>
>  #include <net/netlink.h>
>  #include <net/netfilter/nf_conntrack.h>
> -
> +#include <net/netns/generic.h>

Minor nitpick: make sure there's still one line between this structure
below and the include headers.

>  struct seq_file;
>  
>  struct nf_conntrack_l4proto {
> @@ -129,8 +129,15 @@ nf_ct_l4proto_find_get(u_int16_t l3proto, u_int8_t l4proto);
>  extern void nf_ct_l4proto_put(struct nf_conntrack_l4proto *p);
>  
>  /* Protocol registration. */
> -extern int nf_conntrack_l4proto_register(struct nf_conntrack_l4proto *proto);
> -extern void nf_conntrack_l4proto_unregister(struct nf_conntrack_l4proto *proto);
> +extern int nf_conntrack_l4proto_register(struct net *net,
> +					 struct nf_conntrack_l4proto *proto);
> +extern void nf_conntrack_l4proto_unregister(struct net *net,
> +					    struct nf_conntrack_l4proto *proto);
> +
> +extern int nf_ct_l4proto_register_sysctl(struct net *net,
> +					 struct nf_conntrack_l4proto *l4proto);
> +extern void nf_ct_l4proto_unregister_sysctl(struct net *net,
> +					    struct nf_conntrack_l4proto *l4proto);
>  
>  /* Generic netlink helpers */
>  extern int nf_ct_port_tuple_to_nlattr(struct sk_buff *skb,
> diff --git a/net/ipv4/netfilter/nf_conntrack_l3proto_ipv4.c b/net/ipv4/netfilter/nf_conntrack_l3proto_ipv4.c
> index 91747d4..46ec515 100644
> --- a/net/ipv4/netfilter/nf_conntrack_l3proto_ipv4.c
> +++ b/net/ipv4/netfilter/nf_conntrack_l3proto_ipv4.c
> @@ -391,19 +391,19 @@ static int __init nf_conntrack_l3proto_ipv4_init(void)
>  		return ret;
>  	}
>  
> -	ret = nf_conntrack_l4proto_register(&nf_conntrack_l4proto_tcp4);
> +	ret = nf_conntrack_l4proto_register(&init_net, &nf_conntrack_l4proto_tcp4);
>  	if (ret < 0) {
>  		pr_err("nf_conntrack_ipv4: can't register tcp.\n");
>  		goto cleanup_sockopt;
>  	}
>  
> -	ret = nf_conntrack_l4proto_register(&nf_conntrack_l4proto_udp4);
> +	ret = nf_conntrack_l4proto_register(&init_net, &nf_conntrack_l4proto_udp4);
>  	if (ret < 0) {
>  		pr_err("nf_conntrack_ipv4: can't register udp.\n");
>  		goto cleanup_tcp;
>  	}
>  
> -	ret = nf_conntrack_l4proto_register(&nf_conntrack_l4proto_icmp);
> +	ret = nf_conntrack_l4proto_register(&init_net, &nf_conntrack_l4proto_icmp);
>  	if (ret < 0) {
>  		pr_err("nf_conntrack_ipv4: can't register icmp.\n");
>  		goto cleanup_udp;
> @@ -434,11 +434,11 @@ static int __init nf_conntrack_l3proto_ipv4_init(void)
>   cleanup_ipv4:
>  	nf_conntrack_l3proto_unregister(&nf_conntrack_l3proto_ipv4);
>   cleanup_icmp:
> -	nf_conntrack_l4proto_unregister(&nf_conntrack_l4proto_icmp);
> +	nf_conntrack_l4proto_unregister(&init_net, &nf_conntrack_l4proto_icmp);
>   cleanup_udp:
> -	nf_conntrack_l4proto_unregister(&nf_conntrack_l4proto_udp4);
> +	nf_conntrack_l4proto_unregister(&init_net, &nf_conntrack_l4proto_udp4);
>   cleanup_tcp:
> -	nf_conntrack_l4proto_unregister(&nf_conntrack_l4proto_tcp4);
> +	nf_conntrack_l4proto_unregister(&init_net, &nf_conntrack_l4proto_tcp4);
>   cleanup_sockopt:
>  	nf_unregister_sockopt(&so_getorigdst);
>  	return ret;
> @@ -452,9 +452,9 @@ static void __exit nf_conntrack_l3proto_ipv4_fini(void)
>  #endif
>  	nf_unregister_hooks(ipv4_conntrack_ops, ARRAY_SIZE(ipv4_conntrack_ops));
>  	nf_conntrack_l3proto_unregister(&nf_conntrack_l3proto_ipv4);
> -	nf_conntrack_l4proto_unregister(&nf_conntrack_l4proto_icmp);
> -	nf_conntrack_l4proto_unregister(&nf_conntrack_l4proto_udp4);
> -	nf_conntrack_l4proto_unregister(&nf_conntrack_l4proto_tcp4);
> +	nf_conntrack_l4proto_unregister(&init_net, &nf_conntrack_l4proto_icmp);
> +	nf_conntrack_l4proto_unregister(&init_net, &nf_conntrack_l4proto_udp4);
> +	nf_conntrack_l4proto_unregister(&init_net, &nf_conntrack_l4proto_tcp4);
>  	nf_unregister_sockopt(&so_getorigdst);
>  }
>  
> diff --git a/net/ipv6/netfilter/nf_conntrack_l3proto_ipv6.c b/net/ipv6/netfilter/nf_conntrack_l3proto_ipv6.c
> index fe925e4..55f379f 100644
> --- a/net/ipv6/netfilter/nf_conntrack_l3proto_ipv6.c
> +++ b/net/ipv6/netfilter/nf_conntrack_l3proto_ipv6.c
> @@ -341,19 +341,19 @@ static int __init nf_conntrack_l3proto_ipv6_init(void)
>  	need_conntrack();
>  	nf_defrag_ipv6_enable();
>  
> -	ret = nf_conntrack_l4proto_register(&nf_conntrack_l4proto_tcp6);
> +	ret = nf_conntrack_l4proto_register(&init_net, &nf_conntrack_l4proto_tcp6);
>  	if (ret < 0) {
>  		pr_err("nf_conntrack_ipv6: can't register tcp.\n");
>  		return ret;
>  	}
>  
> -	ret = nf_conntrack_l4proto_register(&nf_conntrack_l4proto_udp6);
> +	ret = nf_conntrack_l4proto_register(&init_net, &nf_conntrack_l4proto_udp6);
>  	if (ret < 0) {
>  		pr_err("nf_conntrack_ipv6: can't register udp.\n");
>  		goto cleanup_tcp;
>  	}
>  
> -	ret = nf_conntrack_l4proto_register(&nf_conntrack_l4proto_icmpv6);
> +	ret = nf_conntrack_l4proto_register(&init_net, &nf_conntrack_l4proto_icmpv6);
>  	if (ret < 0) {
>  		pr_err("nf_conntrack_ipv6: can't register icmpv6.\n");
>  		goto cleanup_udp;
> @@ -377,11 +377,11 @@ static int __init nf_conntrack_l3proto_ipv6_init(void)
>   cleanup_ipv6:
>  	nf_conntrack_l3proto_unregister(&nf_conntrack_l3proto_ipv6);
>   cleanup_icmpv6:
> -	nf_conntrack_l4proto_unregister(&nf_conntrack_l4proto_icmpv6);
> +	nf_conntrack_l4proto_unregister(&init_net, &nf_conntrack_l4proto_icmpv6);
>   cleanup_udp:
> -	nf_conntrack_l4proto_unregister(&nf_conntrack_l4proto_udp6);
> +	nf_conntrack_l4proto_unregister(&init_net, &nf_conntrack_l4proto_udp6);
>   cleanup_tcp:
> -	nf_conntrack_l4proto_unregister(&nf_conntrack_l4proto_tcp6);
> +	nf_conntrack_l4proto_unregister(&init_net, &nf_conntrack_l4proto_tcp6);
>  	return ret;
>  }
>  
> @@ -390,9 +390,9 @@ static void __exit nf_conntrack_l3proto_ipv6_fini(void)
>  	synchronize_net();
>  	nf_unregister_hooks(ipv6_conntrack_ops, ARRAY_SIZE(ipv6_conntrack_ops));
>  	nf_conntrack_l3proto_unregister(&nf_conntrack_l3proto_ipv6);
> -	nf_conntrack_l4proto_unregister(&nf_conntrack_l4proto_icmpv6);
> -	nf_conntrack_l4proto_unregister(&nf_conntrack_l4proto_udp6);
> -	nf_conntrack_l4proto_unregister(&nf_conntrack_l4proto_tcp6);
> +	nf_conntrack_l4proto_unregister(&init_net, &nf_conntrack_l4proto_icmpv6);
> +	nf_conntrack_l4proto_unregister(&init_net, &nf_conntrack_l4proto_udp6);
> +	nf_conntrack_l4proto_unregister(&init_net, &nf_conntrack_l4proto_tcp6);
>  }
>  
>  module_init(nf_conntrack_l3proto_ipv6_init);
> diff --git a/net/netfilter/nf_conntrack_proto.c b/net/netfilter/nf_conntrack_proto.c
> index 8b631b0..6d68727 100644
> --- a/net/netfilter/nf_conntrack_proto.c
> +++ b/net/netfilter/nf_conntrack_proto.c
> @@ -35,30 +35,39 @@ EXPORT_SYMBOL_GPL(nf_ct_l3protos);
>  static DEFINE_MUTEX(nf_ct_proto_mutex);
>  
>  #ifdef CONFIG_SYSCTL
> -static int
> -nf_ct_register_sysctl(struct ctl_table_header **header, const char *path,
> -		      struct ctl_table *table, unsigned int *users)
> +int
> +nf_ct_register_sysctl(struct net *net,
> +		      struct ctl_table_header **header,
> +		      const char *path,
> +		      struct ctl_table *table,
> +		      unsigned int *users)
>  {
>  	if (*header == NULL) {
> -		*header = register_net_sysctl(&init_net, path, table);
> +		*header = register_net_sysctl(net, path, table);
>  		if (*header == NULL)
>  			return -ENOMEM;
>  	}
>  	if (users != NULL)
>  		(*users)++;
> +
>  	return 0;
>  }
> +EXPORT_SYMBOL_GPL(nf_ct_register_sysctl);

I don't see why we need to export nf_ct_register_sysctl. I think this
is a left-over from the previous patchset.

> -static void
> +void
>  nf_ct_unregister_sysctl(struct ctl_table_header **header,
> -			struct ctl_table *table, unsigned int *users)
> +			struct ctl_table **table,
> +			unsigned int *users)
>  {
>  	if (users != NULL && --*users > 0)
>  		return;
>  
>  	unregister_net_sysctl_table(*header);
> +	kfree(*table);
>  	*header = NULL;
> +	*table = NULL;
>  }
> +EXPORT_SYMBOL_GPL(nf_ct_unregister_sysctl);

Same thing. I don't find any external user of this new exported
function in your entire patchset.

You have to fix this.

>  #endif
>  
>  struct nf_conntrack_l4proto *
> @@ -167,7 +176,8 @@ static int nf_ct_l3proto_register_sysctl(struct nf_conntrack_l3proto *l3proto)
>  
>  #ifdef CONFIG_SYSCTL
>  	if (l3proto->ctl_table != NULL) {
> -		err = nf_ct_register_sysctl(&l3proto->ctl_table_header,
> +		err = nf_ct_register_sysctl(&init_net,
> +					    &l3proto->ctl_table_header,
>  					    l3proto->ctl_table_path,
>  					    l3proto->ctl_table, NULL);
>  	}
> @@ -180,7 +190,7 @@ static void nf_ct_l3proto_unregister_sysctl(struct nf_conntrack_l3proto *l3proto
>  #ifdef CONFIG_SYSCTL
>  	if (l3proto->ctl_table_header != NULL)
>  		nf_ct_unregister_sysctl(&l3proto->ctl_table_header,
> -					l3proto->ctl_table, NULL);
> +					&l3proto->ctl_table, NULL);
>  #endif
>  }
>  
> @@ -243,137 +253,172 @@ void nf_conntrack_l3proto_unregister(struct nf_conntrack_l3proto *proto)
>  }
>  EXPORT_SYMBOL_GPL(nf_conntrack_l3proto_unregister);
>  
> -static int nf_ct_l4proto_register_sysctl(struct nf_conntrack_l4proto *l4proto)
> +static struct nf_proto_net *nf_ct_l4proto_net(struct net *net,
> +					      struct nf_conntrack_l4proto *l4proto)
>  {
> -	int err = 0;
> +	if (l4proto->net_id)
> +		return net_generic(net, *l4proto->net_id);
> +	else
> +		return NULL;
> +}
>  
> +int nf_ct_l4proto_register_sysctl(struct net *net,
> +				  struct nf_conntrack_l4proto *l4proto)
> +{
> +	int err = 0;
> +	struct nf_proto_net *pn = nf_ct_l4proto_net(net, l4proto);
> +	if (pn == NULL)
> +		return 0;
>  #ifdef CONFIG_SYSCTL
> -	if (l4proto->ctl_table != NULL) {
> -		err = nf_ct_register_sysctl(l4proto->ctl_table_header,
> +	if (pn->ctl_table != NULL) {
> +		err = nf_ct_register_sysctl(net,
> +					    &pn->ctl_table_header,
>  					    "net/netfilter",
> -					    l4proto->ctl_table,
> -					    l4proto->ctl_table_users);
> -		if (err < 0)
> +					    pn->ctl_table,
> +					    &pn->users);
> +		if (err < 0) {
> +			kfree(pn->ctl_table);
> +			pn->ctl_table = NULL;
                               ^^^^^^^^^^^
Do you really need to set this above to NULL? Is there any existing
bug trap? If not, it's superfluous, please, remove it.

>  			goto out;
> +		}
>  	}
>  #ifdef CONFIG_NF_CONNTRACK_PROC_COMPAT
> -	if (l4proto->ctl_compat_table != NULL) {
> -		err = nf_ct_register_sysctl(&l4proto->ctl_compat_table_header,
> +	if (l4proto->compat && pn->ctl_compat_table != NULL) {
> +		err = nf_ct_register_sysctl(net,
> +					    &pn->ctl_compat_header,
>  					    "net/ipv4/netfilter",
> -					    l4proto->ctl_compat_table, NULL);
> +					    pn->ctl_compat_table,
> +					    NULL);
>  		if (err == 0)
>  			goto out;
> -		nf_ct_unregister_sysctl(l4proto->ctl_table_header,
> -					l4proto->ctl_table,
> -					l4proto->ctl_table_users);
> +
> +		kfree(pn->ctl_compat_table);
> +		pn->ctl_compat_table = NULL;
> +		nf_ct_unregister_sysctl(&pn->ctl_table_header,
> +					&pn->ctl_table,
> +					&pn->users);
>  	}
>  #endif /* CONFIG_NF_CONNTRACK_PROC_COMPAT */
>  out:
>  #endif /* CONFIG_SYSCTL */
>  	return err;
>  }
> +EXPORT_SYMBOL_GPL(nf_ct_l4proto_register_sysctl);
>  
> -static void nf_ct_l4proto_unregister_sysctl(struct nf_conntrack_l4proto *l4proto)
> +void nf_ct_l4proto_unregister_sysctl(struct net *net,
> +				     struct nf_conntrack_l4proto *l4proto)
>  {
> +	struct nf_proto_net *pn = nf_ct_l4proto_net(net, l4proto);
> +	if (pn == NULL)
> +		return;
>  #ifdef CONFIG_SYSCTL
> -	if (l4proto->ctl_table_header != NULL &&
> -	    *l4proto->ctl_table_header != NULL)
> -		nf_ct_unregister_sysctl(l4proto->ctl_table_header,
> -					l4proto->ctl_table,
> -					l4proto->ctl_table_users);
> +	if (pn->ctl_table_header != NULL)
> +		nf_ct_unregister_sysctl(&pn->ctl_table_header,
> +					&pn->ctl_table,
> +					&pn->users);
> +
>  #ifdef CONFIG_NF_CONNTRACK_PROC_COMPAT
> -	if (l4proto->ctl_compat_table_header != NULL)
> -		nf_ct_unregister_sysctl(&l4proto->ctl_compat_table_header,
> -					l4proto->ctl_compat_table, NULL);
> +	if (l4proto->compat && pn->ctl_compat_header != NULL)
> +		nf_ct_unregister_sysctl(&pn->ctl_compat_header,
> +					&pn->ctl_compat_table,
> +					NULL);
>  #endif /* CONFIG_NF_CONNTRACK_PROC_COMPAT */
> +#else
> +	pn->users--;
>  #endif /* CONFIG_SYSCTL */
>  }
> +EXPORT_SYMBOL_GPL(nf_ct_l4proto_unregister_sysctl);
>  
>  /* FIXME: Allow NULL functions and sub in pointers to generic for
>     them. --RR */
> -int nf_conntrack_l4proto_register(struct nf_conntrack_l4proto *l4proto)
> +int nf_conntrack_l4proto_register(struct net *net,
> +				  struct nf_conntrack_l4proto *l4proto)
>  {
>  	int ret = 0;

Minor nitpick: you save this amount of edits in this function that
result from the extra tabbing by moving all ...

if (net == &init_net) {
    ... this code ...
}

into some new static int nf_conntrack_l4proto_register_net(...) that
will be called by nf_conntrack_l4proto_register.

It will result more maintainable code. We still stick to 80-chars
columns, saving that extra tabbing makes the code more readable.

>  
> -	if (l4proto->l3proto >= PF_MAX)
> -		return -EBUSY;
> -
> -	if ((l4proto->to_nlattr && !l4proto->nlattr_size)
> -		|| (l4proto->tuple_to_nlattr && !l4proto->nlattr_tuple_size))
> -		return -EINVAL;
> -
> -	mutex_lock(&nf_ct_proto_mutex);
> -	if (!nf_ct_protos[l4proto->l3proto]) {
> -		/* l3proto may be loaded latter. */
> -		struct nf_conntrack_l4proto __rcu **proto_array;
> -		int i;
> -
> -		proto_array = kmalloc(MAX_NF_CT_PROTO *
> -				      sizeof(struct nf_conntrack_l4proto *),
> -				      GFP_KERNEL);
> -		if (proto_array == NULL) {
> -			ret = -ENOMEM;
> +	if (net == &init_net) {
> +		if (l4proto->l3proto >= PF_MAX)
> +			return -EBUSY;
> +
> +		if ((l4proto->to_nlattr && !l4proto->nlattr_size)
> +			|| (l4proto->tuple_to_nlattr && !l4proto->nlattr_tuple_size))
> +			return -EINVAL;
> +
> +		mutex_lock(&nf_ct_proto_mutex);
> +		if (!nf_ct_protos[l4proto->l3proto]) {
> +			/* l3proto may be loaded latter. */
> +			struct nf_conntrack_l4proto __rcu **proto_array;
> +			int i;
> +
> +			proto_array = kmalloc(MAX_NF_CT_PROTO *
> +					      sizeof(struct nf_conntrack_l4proto *),
> +					      GFP_KERNEL);
> +			if (proto_array == NULL) {
> +				ret = -ENOMEM;
> +				goto out_unlock;
> +			}
> +
> +			for (i = 0; i < MAX_NF_CT_PROTO; i++)
> +				RCU_INIT_POINTER(proto_array[i], &nf_conntrack_l4proto_generic);
> +
> +			/* Before making proto_array visible to lockless readers,
> +			 * we must make sure its content is committed to memory.
> +			 */
> +			smp_wmb();
> +
> +			nf_ct_protos[l4proto->l3proto] = proto_array;
> +		} else if (rcu_dereference_protected(
> +				nf_ct_protos[l4proto->l3proto][l4proto->l4proto],
> +				lockdep_is_held(&nf_ct_proto_mutex)
> +				) != &nf_conntrack_l4proto_generic) {
> +			ret = -EBUSY;
>  			goto out_unlock;
>  		}
>  
> -		for (i = 0; i < MAX_NF_CT_PROTO; i++)
> -			RCU_INIT_POINTER(proto_array[i], &nf_conntrack_l4proto_generic);
> -
> -		/* Before making proto_array visible to lockless readers,
> -		 * we must make sure its content is committed to memory.
> -		 */
> -		smp_wmb();
> -
> -		nf_ct_protos[l4proto->l3proto] = proto_array;
> -	} else if (rcu_dereference_protected(
> -			nf_ct_protos[l4proto->l3proto][l4proto->l4proto],
> -			lockdep_is_held(&nf_ct_proto_mutex)
> -			) != &nf_conntrack_l4proto_generic) {
> -		ret = -EBUSY;
> -		goto out_unlock;
> -	}
> -
> -	ret = nf_ct_l4proto_register_sysctl(l4proto);
> -	if (ret < 0)
> -		goto out_unlock;
> -
> -	l4proto->nla_size = 0;
> -	if (l4proto->nlattr_size)
> -		l4proto->nla_size += l4proto->nlattr_size();
> -	if (l4proto->nlattr_tuple_size)
> -		l4proto->nla_size += 3 * l4proto->nlattr_tuple_size();
> -
> -	rcu_assign_pointer(nf_ct_protos[l4proto->l3proto][l4proto->l4proto],
> -			   l4proto);
> +		l4proto->nla_size = 0;
> +		if (l4proto->nlattr_size)
> +			l4proto->nla_size += l4proto->nlattr_size();
> +		if (l4proto->nlattr_tuple_size)
> +			l4proto->nla_size += 3 * l4proto->nlattr_tuple_size();
>  
> +		rcu_assign_pointer(nf_ct_protos[l4proto->l3proto][l4proto->l4proto],
> +				   l4proto);
>  out_unlock:
> -	mutex_unlock(&nf_ct_proto_mutex);
> -	return ret;
> +		mutex_unlock(&nf_ct_proto_mutex);
> +		if (ret < 0)
> +			return ret;
> +	}
> +	if (l4proto->init_net) {
> +		ret = l4proto->init_net(net, l4proto->compat);
> +		if (ret < 0)
> +			return ret;
> +	}
> +	return nf_ct_l4proto_register_sysctl(net, l4proto);
>  }
>  EXPORT_SYMBOL_GPL(nf_conntrack_l4proto_register);
>  
> -void nf_conntrack_l4proto_unregister(struct nf_conntrack_l4proto *l4proto)
> +void nf_conntrack_l4proto_unregister(struct net *net,
> +				     struct nf_conntrack_l4proto *l4proto)
>  {
> -	struct net *net;
> -
> -	BUG_ON(l4proto->l3proto >= PF_MAX);
> -
> -	mutex_lock(&nf_ct_proto_mutex);
> -	BUG_ON(rcu_dereference_protected(
> -			nf_ct_protos[l4proto->l3proto][l4proto->l4proto],
> -			lockdep_is_held(&nf_ct_proto_mutex)
> -			) != l4proto);
> -	rcu_assign_pointer(nf_ct_protos[l4proto->l3proto][l4proto->l4proto],
> -			   &nf_conntrack_l4proto_generic);
> -	nf_ct_l4proto_unregister_sysctl(l4proto);
> -	mutex_unlock(&nf_ct_proto_mutex);
> -
> -	synchronize_rcu();
> +	if (net == &init_net) {

Same thing as above here.

> +		BUG_ON(l4proto->l3proto >= PF_MAX);
> +		mutex_lock(&nf_ct_proto_mutex);
> +
> +		BUG_ON(rcu_dereference_protected(
> +				nf_ct_protos[l4proto->l3proto][l4proto->l4proto],
> +				lockdep_is_held(&nf_ct_proto_mutex)
> +				) != l4proto);
> +		rcu_assign_pointer(nf_ct_protos[l4proto->l3proto][l4proto->l4proto],
> +				   &nf_conntrack_l4proto_generic);
> +		mutex_unlock(&nf_ct_proto_mutex);
> +
> +		synchronize_rcu();
> +	}
> +	nf_ct_l4proto_unregister_sysctl(net, l4proto);
>  
>  	/* Remove all contrack entries for this protocol */
>  	rtnl_lock();
> -	for_each_net(net)
> -		nf_ct_iterate_cleanup(net, kill_l4proto, l4proto);
> +	nf_ct_iterate_cleanup(net, kill_l4proto, l4proto);
>  	rtnl_unlock();
>  }
>  EXPORT_SYMBOL_GPL(nf_conntrack_l4proto_unregister);
> @@ -383,7 +428,7 @@ int nf_conntrack_proto_init(void)
>  	unsigned int i;
>  	int err;
>  
> -	err = nf_ct_l4proto_register_sysctl(&nf_conntrack_l4proto_generic);
> +	err = nf_ct_l4proto_register_sysctl(&init_net, &nf_conntrack_l4proto_generic);
>  	if (err < 0)
>  		return err;
>  
> @@ -397,7 +442,7 @@ void nf_conntrack_proto_fini(void)
>  {
>  	unsigned int i;
>  
> -	nf_ct_l4proto_unregister_sysctl(&nf_conntrack_l4proto_generic);
> +	nf_ct_l4proto_unregister_sysctl(&init_net, &nf_conntrack_l4proto_generic);
>  
>  	/* free l3proto protocol tables */
>  	for (i = 0; i < PF_MAX; i++)
> diff --git a/net/netfilter/nf_conntrack_proto_dccp.c b/net/netfilter/nf_conntrack_proto_dccp.c
> index ef706a4..5a8e037 100644
> --- a/net/netfilter/nf_conntrack_proto_dccp.c
> +++ b/net/netfilter/nf_conntrack_proto_dccp.c
> @@ -945,17 +945,17 @@ static int __init nf_conntrack_proto_dccp_init(void)
>  	if (err < 0)
>  		goto err1;
>  
> -	err = nf_conntrack_l4proto_register(&dccp_proto4);
> +	err = nf_conntrack_l4proto_register(&init_net, &dccp_proto4);
>  	if (err < 0)
>  		goto err2;
>  
> -	err = nf_conntrack_l4proto_register(&dccp_proto6);
> +	err = nf_conntrack_l4proto_register(&init_net, &dccp_proto6);
>  	if (err < 0)
>  		goto err3;
>  	return 0;
>  
>  err3:
> -	nf_conntrack_l4proto_unregister(&dccp_proto4);
> +	nf_conntrack_l4proto_unregister(&init_net, &dccp_proto4);
>  err2:
>  	unregister_pernet_subsys(&dccp_net_ops);
>  err1:
> @@ -965,8 +965,8 @@ err1:
>  static void __exit nf_conntrack_proto_dccp_fini(void)
>  {
>  	unregister_pernet_subsys(&dccp_net_ops);
> -	nf_conntrack_l4proto_unregister(&dccp_proto6);
> -	nf_conntrack_l4proto_unregister(&dccp_proto4);
> +	nf_conntrack_l4proto_unregister(&init_net, &dccp_proto6);
> +	nf_conntrack_l4proto_unregister(&init_net, &dccp_proto4);
>  }
>  
>  module_init(nf_conntrack_proto_dccp_init);
> diff --git a/net/netfilter/nf_conntrack_proto_gre.c b/net/netfilter/nf_conntrack_proto_gre.c
> index 4bf6b4e..132f0d2 100644
> --- a/net/netfilter/nf_conntrack_proto_gre.c
> +++ b/net/netfilter/nf_conntrack_proto_gre.c
> @@ -396,18 +396,18 @@ static int __init nf_ct_proto_gre_init(void)
>  {
>  	int rv;
>  
> -	rv = nf_conntrack_l4proto_register(&nf_conntrack_l4proto_gre4);
> +	rv = nf_conntrack_l4proto_register(&init_net, &nf_conntrack_l4proto_gre4);
>  	if (rv < 0)
>  		return rv;
>  	rv = register_pernet_subsys(&proto_gre_net_ops);
>  	if (rv < 0)
> -		nf_conntrack_l4proto_unregister(&nf_conntrack_l4proto_gre4);
> +		nf_conntrack_l4proto_unregister(&init_net, &nf_conntrack_l4proto_gre4);
>  	return rv;
>  }
>  
>  static void __exit nf_ct_proto_gre_fini(void)
>  {
> -	nf_conntrack_l4proto_unregister(&nf_conntrack_l4proto_gre4);
> +	nf_conntrack_l4proto_unregister(&init_net, &nf_conntrack_l4proto_gre4);
>  	unregister_pernet_subsys(&proto_gre_net_ops);
>  }
>  
> diff --git a/net/netfilter/nf_conntrack_proto_sctp.c b/net/netfilter/nf_conntrack_proto_sctp.c
> index 996db2f..97bbc20 100644
> --- a/net/netfilter/nf_conntrack_proto_sctp.c
> +++ b/net/netfilter/nf_conntrack_proto_sctp.c
> @@ -791,12 +791,12 @@ static int __init nf_conntrack_proto_sctp_init(void)
>  {
>  	int ret;
>  
> -	ret = nf_conntrack_l4proto_register(&nf_conntrack_l4proto_sctp4);
> +	ret = nf_conntrack_l4proto_register(&init_net, &nf_conntrack_l4proto_sctp4);
>  	if (ret) {
>  		pr_err("nf_conntrack_l4proto_sctp4: protocol register failed\n");
>  		goto out;
>  	}
> -	ret = nf_conntrack_l4proto_register(&nf_conntrack_l4proto_sctp6);
> +	ret = nf_conntrack_l4proto_register(&init_net, &nf_conntrack_l4proto_sctp6);
>  	if (ret) {
>  		pr_err("nf_conntrack_l4proto_sctp6: protocol register failed\n");
>  		goto cleanup_sctp4;
> @@ -805,15 +805,15 @@ static int __init nf_conntrack_proto_sctp_init(void)
>  	return ret;
>  
>   cleanup_sctp4:
> -	nf_conntrack_l4proto_unregister(&nf_conntrack_l4proto_sctp4);
> +	nf_conntrack_l4proto_unregister(&init_net, &nf_conntrack_l4proto_sctp4);
>   out:
>  	return ret;
>  }
>  
>  static void __exit nf_conntrack_proto_sctp_fini(void)
>  {
> -	nf_conntrack_l4proto_unregister(&nf_conntrack_l4proto_sctp6);
> -	nf_conntrack_l4proto_unregister(&nf_conntrack_l4proto_sctp4);
> +	nf_conntrack_l4proto_unregister(&init_net, &nf_conntrack_l4proto_sctp6);
> +	nf_conntrack_l4proto_unregister(&init_net, &nf_conntrack_l4proto_sctp4);
>  }
>  
>  module_init(nf_conntrack_proto_sctp_init);
> diff --git a/net/netfilter/nf_conntrack_proto_udplite.c b/net/netfilter/nf_conntrack_proto_udplite.c
> index 4d60a53..fa142a8 100644
> --- a/net/netfilter/nf_conntrack_proto_udplite.c
> +++ b/net/netfilter/nf_conntrack_proto_udplite.c
> @@ -299,23 +299,23 @@ static int __init nf_conntrack_proto_udplite_init(void)
>  {
>  	int err;
>  
> -	err = nf_conntrack_l4proto_register(&nf_conntrack_l4proto_udplite4);
> +	err = nf_conntrack_l4proto_register(&init_net, &nf_conntrack_l4proto_udplite4);
>  	if (err < 0)
>  		goto err1;
> -	err = nf_conntrack_l4proto_register(&nf_conntrack_l4proto_udplite6);
> +	err = nf_conntrack_l4proto_register(&init_net, &nf_conntrack_l4proto_udplite6);
>  	if (err < 0)
>  		goto err2;
>  	return 0;
>  err2:
> -	nf_conntrack_l4proto_unregister(&nf_conntrack_l4proto_udplite4);
> +	nf_conntrack_l4proto_unregister(&init_net, &nf_conntrack_l4proto_udplite4);
>  err1:
>  	return err;
>  }
>  
>  static void __exit nf_conntrack_proto_udplite_exit(void)
>  {
> -	nf_conntrack_l4proto_unregister(&nf_conntrack_l4proto_udplite6);
> -	nf_conntrack_l4proto_unregister(&nf_conntrack_l4proto_udplite4);
> +	nf_conntrack_l4proto_unregister(&init_net, &nf_conntrack_l4proto_udplite6);
> +	nf_conntrack_l4proto_unregister(&init_net, &nf_conntrack_l4proto_udplite4);
>  }
>  
>  module_init(nf_conntrack_proto_udplite_init);
> -- 
> 1.7.7.6
> 
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH 03/17] netfilter: add namespace support for l3proto
From: Pablo Neira Ayuso @ 2012-05-23 10:29 UTC (permalink / raw)
  To: Gao feng; +Cc: netfilter-devel, netdev, serge.hallyn, ebiederm, dlezcano
In-Reply-To: <1336985547-31960-4-git-send-email-gaofeng@cn.fujitsu.com>

On Mon, May 14, 2012 at 04:52:13PM +0800, Gao feng wrote:
> -Add the struct net as param of nf_conntrack_l3proto_(un)register.
>  register or unregister the l3proto only when the net is init_net.
> 
> -The new struct nf_ip_net is used to store the sysctl header and data
>  of l3proto_ipv4,l4proto_tcp(6),l4proto_udp(6),l4proto_icmp(v6).
>  because the protos such tcp and tcp6 use the same data,so making
>  nf_ip_net as a field of netns_ct is the easiest way to manager it.
> 
> -nf_ct_l3proto_register_sysctl call init_net to initial the pernet data
>  of l3proto.
> 
> -nf_ct_l3proto_net is used to get the pernet data of l3proto.
> 
> -export nf_conntrack_l3proto_(un)register
> 
> -use init_net as param of nf_conntrack_l3proto_(un)register.
> 
> Acked-by: Eric W. Biederman <ebiederm@xmission.com>
> Signed-off-by: Gao feng <gaofeng@cn.fujitsu.com>
> ---
>  include/net/netfilter/nf_conntrack_l3proto.h   |    6 +-
>  include/net/netns/conntrack.h                  |    8 ++
>  net/ipv4/netfilter/nf_conntrack_l3proto_ipv4.c |    6 +-
>  net/ipv6/netfilter/nf_conntrack_l3proto_ipv6.c |    6 +-
>  net/netfilter/nf_conntrack_proto.c             |  127 +++++++++++++++---------
>  5 files changed, 97 insertions(+), 56 deletions(-)
> 
> diff --git a/include/net/netfilter/nf_conntrack_l3proto.h b/include/net/netfilter/nf_conntrack_l3proto.h
> index 9766005..d6df8c7 100644
> --- a/include/net/netfilter/nf_conntrack_l3proto.h
> +++ b/include/net/netfilter/nf_conntrack_l3proto.h
> @@ -79,8 +79,10 @@ struct nf_conntrack_l3proto {
>  extern struct nf_conntrack_l3proto __rcu *nf_ct_l3protos[AF_MAX];
>  
>  /* Protocol registration. */
> -extern int nf_conntrack_l3proto_register(struct nf_conntrack_l3proto *proto);
> -extern void nf_conntrack_l3proto_unregister(struct nf_conntrack_l3proto *proto);
> +extern int nf_conntrack_l3proto_register(struct net *net,
> +					 struct nf_conntrack_l3proto *proto);
> +extern void nf_conntrack_l3proto_unregister(struct net *net,
> +					    struct nf_conntrack_l3proto *proto);
>  extern struct nf_conntrack_l3proto *nf_ct_l3proto_find_get(u_int16_t l3proto);
>  extern void nf_ct_l3proto_put(struct nf_conntrack_l3proto *p);
>  
> diff --git a/include/net/netns/conntrack.h b/include/net/netns/conntrack.h
> index 1f53038..94992e9 100644
> --- a/include/net/netns/conntrack.h
> +++ b/include/net/netns/conntrack.h
> @@ -20,6 +20,13 @@ struct nf_proto_net {
>  	unsigned int		users;
>  };
>  
> +struct nf_ip_net {
> +#if defined(CONFIG_SYSCTL) && defined(CONFIG_NF_CONNTRACK_PROC_COMPAT)
> +	struct ctl_table_header *ctl_table_header;
> +	struct ctl_table	*ctl_table;
> +#endif
> +};
> +
>  struct netns_ct {
>  	atomic_t		count;
>  	unsigned int		expect_count;
> @@ -40,6 +47,7 @@ struct netns_ct {
>  	unsigned int		sysctl_log_invalid; /* Log invalid packets */
>  	int			sysctl_auto_assign_helper;
>  	bool			auto_assign_helper_warned;
> +	struct nf_ip_net	proto;
                                ^^^^^
please, rename this to something like nf_ct_proto.

>  #ifdef CONFIG_SYSCTL
>  	struct ctl_table_header	*sysctl_header;
>  	struct ctl_table_header	*acct_sysctl_header;
> diff --git a/net/ipv4/netfilter/nf_conntrack_l3proto_ipv4.c b/net/ipv4/netfilter/nf_conntrack_l3proto_ipv4.c
> index 46ec515..0c0fb90 100644
> --- a/net/ipv4/netfilter/nf_conntrack_l3proto_ipv4.c
> +++ b/net/ipv4/netfilter/nf_conntrack_l3proto_ipv4.c
> @@ -409,7 +409,7 @@ static int __init nf_conntrack_l3proto_ipv4_init(void)
>  		goto cleanup_udp;
>  	}
>  
> -	ret = nf_conntrack_l3proto_register(&nf_conntrack_l3proto_ipv4);
> +	ret = nf_conntrack_l3proto_register(&init_net, &nf_conntrack_l3proto_ipv4);
>  	if (ret < 0) {
>  		pr_err("nf_conntrack_ipv4: can't register ipv4\n");
>  		goto cleanup_icmp;
> @@ -432,7 +432,7 @@ static int __init nf_conntrack_l3proto_ipv4_init(void)
>  	nf_unregister_hooks(ipv4_conntrack_ops, ARRAY_SIZE(ipv4_conntrack_ops));
>  #endif
>   cleanup_ipv4:
> -	nf_conntrack_l3proto_unregister(&nf_conntrack_l3proto_ipv4);
> +	nf_conntrack_l3proto_unregister(&init_net, &nf_conntrack_l3proto_ipv4);
>   cleanup_icmp:
>  	nf_conntrack_l4proto_unregister(&init_net, &nf_conntrack_l4proto_icmp);
>   cleanup_udp:
> @@ -451,7 +451,7 @@ static void __exit nf_conntrack_l3proto_ipv4_fini(void)
>  	nf_conntrack_ipv4_compat_fini();
>  #endif
>  	nf_unregister_hooks(ipv4_conntrack_ops, ARRAY_SIZE(ipv4_conntrack_ops));
> -	nf_conntrack_l3proto_unregister(&nf_conntrack_l3proto_ipv4);
> +	nf_conntrack_l3proto_unregister(&init_net, &nf_conntrack_l3proto_ipv4);
>  	nf_conntrack_l4proto_unregister(&init_net, &nf_conntrack_l4proto_icmp);
>  	nf_conntrack_l4proto_unregister(&init_net, &nf_conntrack_l4proto_udp4);
>  	nf_conntrack_l4proto_unregister(&init_net, &nf_conntrack_l4proto_tcp4);
> diff --git a/net/ipv6/netfilter/nf_conntrack_l3proto_ipv6.c b/net/ipv6/netfilter/nf_conntrack_l3proto_ipv6.c
> index 55f379f..6cfbe7b 100644
> --- a/net/ipv6/netfilter/nf_conntrack_l3proto_ipv6.c
> +++ b/net/ipv6/netfilter/nf_conntrack_l3proto_ipv6.c
> @@ -359,7 +359,7 @@ static int __init nf_conntrack_l3proto_ipv6_init(void)
>  		goto cleanup_udp;
>  	}
>  
> -	ret = nf_conntrack_l3proto_register(&nf_conntrack_l3proto_ipv6);
> +	ret = nf_conntrack_l3proto_register(&init_net, &nf_conntrack_l3proto_ipv6);
>  	if (ret < 0) {
>  		pr_err("nf_conntrack_ipv6: can't register ipv6\n");
>  		goto cleanup_icmpv6;
> @@ -375,7 +375,7 @@ static int __init nf_conntrack_l3proto_ipv6_init(void)
>  	return ret;
>  
>   cleanup_ipv6:
> -	nf_conntrack_l3proto_unregister(&nf_conntrack_l3proto_ipv6);
> +	nf_conntrack_l3proto_unregister(&init_net, &nf_conntrack_l3proto_ipv6);
>   cleanup_icmpv6:
>  	nf_conntrack_l4proto_unregister(&init_net, &nf_conntrack_l4proto_icmpv6);
>   cleanup_udp:
> @@ -389,7 +389,7 @@ static void __exit nf_conntrack_l3proto_ipv6_fini(void)
>  {
>  	synchronize_net();
>  	nf_unregister_hooks(ipv6_conntrack_ops, ARRAY_SIZE(ipv6_conntrack_ops));
> -	nf_conntrack_l3proto_unregister(&nf_conntrack_l3proto_ipv6);
> +	nf_conntrack_l3proto_unregister(&init_net, &nf_conntrack_l3proto_ipv6);
>  	nf_conntrack_l4proto_unregister(&init_net, &nf_conntrack_l4proto_icmpv6);
>  	nf_conntrack_l4proto_unregister(&init_net, &nf_conntrack_l4proto_udp6);
>  	nf_conntrack_l4proto_unregister(&init_net, &nf_conntrack_l4proto_tcp6);
> diff --git a/net/netfilter/nf_conntrack_proto.c b/net/netfilter/nf_conntrack_proto.c
> index 6d68727..7ee6653 100644
> --- a/net/netfilter/nf_conntrack_proto.c
> +++ b/net/netfilter/nf_conntrack_proto.c
> @@ -170,85 +170,116 @@ static int kill_l4proto(struct nf_conn *i, void *data)
>  	       nf_ct_l3num(i) == l4proto->l3proto;
>  }
>  
> -static int nf_ct_l3proto_register_sysctl(struct nf_conntrack_l3proto *l3proto)
> +static struct nf_ip_net *nf_ct_l3proto_net(struct net *net,
> +					   struct nf_conntrack_l3proto *l3proto)
> +{
> +	if (l3proto->l3proto == PF_INET)
> +		return &net->ct.proto;
> +	else
> +		return NULL;
> +}
> +
> +static int nf_ct_l3proto_register_sysctl(struct net *net,
> +					 struct nf_conntrack_l3proto *l3proto)
>  {
>  	int err = 0;
> +	struct nf_ip_net *in = nf_ct_l3proto_net(net, l3proto);
>  
> -#ifdef CONFIG_SYSCTL
> -	if (l3proto->ctl_table != NULL) {
> -		err = nf_ct_register_sysctl(&init_net,
> -					    &l3proto->ctl_table_header,
> +	if (in == NULL)
> +		return 0;

Under what circunstances that in be NULL?

> +
> +#if defined(CONFIG_SYSCTL) && defined(CONFIG_NF_CONNTRACK_PROC_COMPAT)
> +	if (in->ctl_table != NULL) {
> +		err = nf_ct_register_sysctl(net,
> +					    &in->ctl_table_header,
>  					    l3proto->ctl_table_path,
> -					    l3proto->ctl_table, NULL);
> +					    in->ctl_table,
> +					    NULL);
> +		if (err < 0) {
> +			kfree(in->ctl_table);
> +			in->ctl_table = NULL;

do we need this extra NULL assignment?

> +		}
>  	}
>  #endif
>  	return err;
>  }
>  
> -static void nf_ct_l3proto_unregister_sysctl(struct nf_conntrack_l3proto *l3proto)
> +static void nf_ct_l3proto_unregister_sysctl(struct net *net,
> +					    struct nf_conntrack_l3proto *l3proto)
>  {
> -#ifdef CONFIG_SYSCTL
> -	if (l3proto->ctl_table_header != NULL)
> -		nf_ct_unregister_sysctl(&l3proto->ctl_table_header,
> -					&l3proto->ctl_table, NULL);
> +	struct nf_ip_net *in = nf_ct_l3proto_net(net, l3proto);
> +
> +	if (in == NULL)
> +		return;
> +#if defined(CONFIG_SYSCTL) && defined(CONFIG_NF_CONNTRACK_PROC_COMPAT)
> +	if (in->ctl_table_header != NULL)
> +		nf_ct_unregister_sysctl(&in->ctl_table_header,
> +					&in->ctl_table,
> +					NULL);
>  #endif
>  }
>  
> -int nf_conntrack_l3proto_register(struct nf_conntrack_l3proto *proto)
> +int nf_conntrack_l3proto_register(struct net *net,
> +				  struct nf_conntrack_l3proto *proto)
>  {
>  	int ret = 0;
> -	struct nf_conntrack_l3proto *old;
> -
> -	if (proto->l3proto >= AF_MAX)
> -		return -EBUSY;
>  
> -	if (proto->tuple_to_nlattr && !proto->nlattr_tuple_size)
> -		return -EINVAL;
> +	if (net == &init_net) {

Same things as in previous patch. Move...

if (net == &init_net) {
     ... this code ...
}

into some static int nf_conntrack_l3proto_register_net function.

> +		struct nf_conntrack_l3proto *old;
>  
> -	mutex_lock(&nf_ct_proto_mutex);
> -	old = rcu_dereference_protected(nf_ct_l3protos[proto->l3proto],
> -					lockdep_is_held(&nf_ct_proto_mutex));
> -	if (old != &nf_conntrack_l3proto_generic) {
> -		ret = -EBUSY;
> -		goto out_unlock;
> -	}
> +		if (proto->l3proto >= AF_MAX)
> +			return -EBUSY;
>  
> -	ret = nf_ct_l3proto_register_sysctl(proto);
> -	if (ret < 0)
> -		goto out_unlock;
> +		if (proto->tuple_to_nlattr && !proto->nlattr_tuple_size)
> +			return -EINVAL;
>  
> -	if (proto->nlattr_tuple_size)
> -		proto->nla_size = 3 * proto->nlattr_tuple_size();
> +		mutex_lock(&nf_ct_proto_mutex);
> +		old = rcu_dereference_protected(nf_ct_l3protos[proto->l3proto],
> +						lockdep_is_held(&nf_ct_proto_mutex));
> +		if (old != &nf_conntrack_l3proto_generic) {
> +			ret = -EBUSY;
> +			goto out_unlock;
> +		}
>  
> -	rcu_assign_pointer(nf_ct_l3protos[proto->l3proto], proto);
> +		if (proto->nlattr_tuple_size)
> +			proto->nla_size = 3 * proto->nlattr_tuple_size();
>  
> +		rcu_assign_pointer(nf_ct_l3protos[proto->l3proto], proto);
>  out_unlock:
> -	mutex_unlock(&nf_ct_proto_mutex);
> -	return ret;
> +		mutex_unlock(&nf_ct_proto_mutex);
> +		if (ret < 0)
> +			return ret;
> +	}
> +	if (proto->init_net) {
> +		ret = proto->init_net(net);
> +		if (ret < 0)
> +			return ret;
> +	}
> +	return nf_ct_l3proto_register_sysctl(net, proto);
>  }
>  EXPORT_SYMBOL_GPL(nf_conntrack_l3proto_register);
>  
> -void nf_conntrack_l3proto_unregister(struct nf_conntrack_l3proto *proto)
> +void nf_conntrack_l3proto_unregister(struct net *net,
> +				     struct nf_conntrack_l3proto *proto)
>  {
> -	struct net *net;
> -
> -	BUG_ON(proto->l3proto >= AF_MAX);
> +	if (net == &init_net) {
> +		BUG_ON(proto->l3proto >= AF_MAX);

Same thing as above.

>  
> -	mutex_lock(&nf_ct_proto_mutex);
> -	BUG_ON(rcu_dereference_protected(nf_ct_l3protos[proto->l3proto],
> -					 lockdep_is_held(&nf_ct_proto_mutex)
> -					 ) != proto);
> -	rcu_assign_pointer(nf_ct_l3protos[proto->l3proto],
> -			   &nf_conntrack_l3proto_generic);
> -	nf_ct_l3proto_unregister_sysctl(proto);
> -	mutex_unlock(&nf_ct_proto_mutex);
> +		mutex_lock(&nf_ct_proto_mutex);
> +		BUG_ON(rcu_dereference_protected(nf_ct_l3protos[proto->l3proto],
> +						 lockdep_is_held(&nf_ct_proto_mutex)
> +						 ) != proto);
> +		rcu_assign_pointer(nf_ct_l3protos[proto->l3proto],
> +				   &nf_conntrack_l3proto_generic);
> +		mutex_unlock(&nf_ct_proto_mutex);
>  
> -	synchronize_rcu();
> +		synchronize_rcu();
> +	}
> +	nf_ct_l3proto_unregister_sysctl(net, proto);
>  
>  	/* Remove all contrack entries for this protocol */
>  	rtnl_lock();
> -	for_each_net(net)
> -		nf_ct_iterate_cleanup(net, kill_l3proto, proto);
> +	nf_ct_iterate_cleanup(net, kill_l3proto, proto);
>  	rtnl_unlock();
>  }
>  EXPORT_SYMBOL_GPL(nf_conntrack_l3proto_unregister);
> -- 
> 1.7.7.6
> 

^ permalink raw reply

* Re: [V2 PATCH 9/9] vhost: zerocopy: poll vq in zerocopy callback
From: Jason Wang @ 2012-05-23 10:31 UTC (permalink / raw)
  To: Shirley Ma
  Cc: Michael S. Tsirkin, eric.dumazet, netdev, linux-kernel, ebiederm,
	davem
In-Reply-To: <1337702145.11592.3.camel@oc3660625478.ibm.com>

On 05/22/2012 11:55 PM, Shirley Ma wrote:
> On Tue, 2012-05-22 at 18:05 +0800, Jason Wang wrote:
>> On 05/21/2012 11:42 PM, Shirley Ma wrote:
>>> On Mon, 2012-05-21 at 14:05 +0800, Jason Wang wrote:
>>>>>> - tx polling depends on skb_orphan() which is often called by
>>>> device
>>>>>> driver when it place the packet into the queue of the devices
>>>> instead
>>>>>> of  when the packets were sent. So it was too early for vhost to
>> be
>>>>>> notified.
>>>>> Then do you think it's better to replace with vhost_poll_queue
>> here
>>>>> instead?
>>>> Just like what does this patch do - calling vhost_poll_queue() in
>>>> vhost_zerocopy_callback().
>>>>>> - it only works when the pending DMAs exceeds VHOST_MAX_PEND,
>> it's
>>>>>> highly possible that guest needs to be notified when the pending
>>>>>> packets
>>>>>> isn't so much.
>>>>> In which situation the guest needs to be notified when there is no
>>>> TX
>>>>> besides buffers run out?
>>>> Consider guest call virtqueue_enable_cb_delayed() which means it
>> only
>>>> need to be notified when 3/4 of pending buffers ( about 178 buffers
>>>> (256-MAX_SKB_FRAGS-2)*3/4 ) were sent by host. So vhost_net would
>>>> notify
>>>> guest when about 60 buffers were pending. Since tx polling is only
>>>> enabled when pending packets exceeds VHOST_MAX_PEND 128, so tx work
>>>> would not be notified to run and guest would never get the
>> interrupt
>>>> it
>>>> expected to re-enable the queue.
>>> So it seems we still need vhost_enable_notify() in handle_tx when
>> there
>>> is no tx in zerocopy case.
>>>
>>> Do you know which one is more expensive: the cost of
>> vhost_poll_queue()
>>> in each zerocopy callback or calling vhost_enable_notify()?
>> Didn't follow here, do you mean vhost_signal() here?
> I meant removing the code in handle_tx for zerocopy as below:
>
> +	if (zcopy) {
>                          /* If more outstanding DMAs, queue the work.
>                           * Handle upend_idx wrap around
>                           */
>                          num_pends = likely(vq->upend_idx>= vq->done_idx) ?
>                                      (vq->upend_idx - vq->done_idx) :
>                                      (vq->upend_idx + UIO_MAXIOV - vq->done_idx);
> +			/* zerocopy vhost_enable_notify is under zerocopy callback
> +			 * since it could be too early to notify here */
> +			break;
> +	}
> -                       if (unlikely(num_pends>  VHOST_MAX_PEND)) {
> -                                tx_poll_start(net, sock);
> -                                set_bit(SOCK_ASYNC_NOSPACE,&sock->flags);
> -                                break;
> -                        }
>                          if (unlikely(vhost_enable_notify(&net->dev, vq))) {
>                                  vhost_disable_notify(&net->dev, vq);
>                                  continue;
>                          }
>                          break;

Didn't think this can work well as the notification from guest were 
disabled forever.

>
> Thanks
> Shirley
>
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH 04/17] netfilter: add namespace support for l4proto_generic
From: Pablo Neira Ayuso @ 2012-05-23 10:32 UTC (permalink / raw)
  To: Gao feng; +Cc: netfilter-devel, netdev, serge.hallyn, ebiederm, dlezcano
In-Reply-To: <1336985547-31960-5-git-send-email-gaofeng@cn.fujitsu.com>

On Mon, May 14, 2012 at 04:52:14PM +0800, Gao feng wrote:
> implement and export nf_conntrack_proto_generic_[init,fini],
> nf_conntrack_[init,cleanup]_net call them to register or unregister
> the sysctl of generic proto.
> 
> implement generic_net_init,it's used to initial the pernet
> data for generic proto.
> 
> and use nf_generic_net.timeout to replace nf_ct_generic_timeout in
> get_timeouts function.
> 
> Acked-by: Eric W. Biederman <ebiederm@xmission.com>
> Signed-off-by: Gao feng <gaofeng@cn.fujitsu.com>
> ---
>  include/net/netfilter/nf_conntrack_l4proto.h |    2 +
>  include/net/netns/conntrack.h                |    6 +++
>  net/netfilter/nf_conntrack_core.c            |    8 +++-
>  net/netfilter/nf_conntrack_proto.c           |   21 +++++-----
>  net/netfilter/nf_conntrack_proto_generic.c   |   55 ++++++++++++++++++++++++-
>  5 files changed, 76 insertions(+), 16 deletions(-)
> 
> diff --git a/include/net/netfilter/nf_conntrack_l4proto.h b/include/net/netfilter/nf_conntrack_l4proto.h
> index a93dcd5..0d329b9 100644
> --- a/include/net/netfilter/nf_conntrack_l4proto.h
> +++ b/include/net/netfilter/nf_conntrack_l4proto.h
> @@ -118,6 +118,8 @@ struct nf_conntrack_l4proto {
>  
>  /* Existing built-in generic protocol */
>  extern struct nf_conntrack_l4proto nf_conntrack_l4proto_generic;
> +extern int nf_conntrack_proto_generic_init(struct net *net);
> +extern void nf_conntrack_proto_generic_fini(struct net *net);
>  
>  #define MAX_NF_CT_PROTO 256
>  
> diff --git a/include/net/netns/conntrack.h b/include/net/netns/conntrack.h
> index 94992e9..3381b80 100644
> --- a/include/net/netns/conntrack.h
> +++ b/include/net/netns/conntrack.h
> @@ -20,7 +20,13 @@ struct nf_proto_net {
>  	unsigned int		users;
>  };
>  
> +struct nf_generic_net {
> +	struct nf_proto_net pn;
> +	unsigned int timeout;
> +};
> +
>  struct nf_ip_net {
> +	struct nf_generic_net   generic;
>  #if defined(CONFIG_SYSCTL) && defined(CONFIG_NF_CONNTRACK_PROC_COMPAT)
>  	struct ctl_table_header *ctl_table_header;
>  	struct ctl_table	*ctl_table;
> diff --git a/net/netfilter/nf_conntrack_core.c b/net/netfilter/nf_conntrack_core.c
> index 32c5909..fd33e91 100644
> --- a/net/netfilter/nf_conntrack_core.c
> +++ b/net/netfilter/nf_conntrack_core.c
> @@ -1353,6 +1353,7 @@ static void nf_conntrack_cleanup_net(struct net *net)
>  	}
>  
>  	nf_ct_free_hashtable(net->ct.hash, net->ct.htable_size);
> +	nf_conntrack_proto_generic_fini(net);
>  	nf_conntrack_helper_fini(net);
>  	nf_conntrack_timeout_fini(net);
>  	nf_conntrack_ecache_fini(net);
> @@ -1586,9 +1587,12 @@ static int nf_conntrack_init_net(struct net *net)
>  	ret = nf_conntrack_helper_init(net);
>  	if (ret < 0)
>  		goto err_helper;
> -
> +	ret = nf_conntrack_proto_generic_init(net);
> +	if (ret < 0)
> +		goto err_generic;
>  	return 0;
> -
> +err_generic:
> +	nf_conntrack_helper_fini(net);
>  err_helper:
>  	nf_conntrack_timeout_fini(net);
>  err_timeout:
> diff --git a/net/netfilter/nf_conntrack_proto.c b/net/netfilter/nf_conntrack_proto.c
> index 7ee6653..9b4bf6d 100644
> --- a/net/netfilter/nf_conntrack_proto.c
> +++ b/net/netfilter/nf_conntrack_proto.c
> @@ -287,10 +287,16 @@ EXPORT_SYMBOL_GPL(nf_conntrack_l3proto_unregister);
>  static struct nf_proto_net *nf_ct_l4proto_net(struct net *net,
>  					      struct nf_conntrack_l4proto *l4proto)
>  {
> -	if (l4proto->net_id)
> -		return net_generic(net, *l4proto->net_id);
> -	else
> -		return NULL;
> +	switch (l4proto->l4proto) {
> +	case 255: /* l4proto_generic */
> +		return (struct nf_proto_net *)&net->ct.proto.generic;
> +	default:
> +		if (l4proto->net_id)
> +			return net_generic(net, *l4proto->net_id);
> +		else
> +			return NULL;
> +	}
> +	return NULL;
>  }
>  
>  int nf_ct_l4proto_register_sysctl(struct net *net,
> @@ -457,11 +463,6 @@ EXPORT_SYMBOL_GPL(nf_conntrack_l4proto_unregister);
>  int nf_conntrack_proto_init(void)
>  {
>  	unsigned int i;
> -	int err;
> -
> -	err = nf_ct_l4proto_register_sysctl(&init_net, &nf_conntrack_l4proto_generic);
> -	if (err < 0)
> -		return err;

I like that all protocols sysctl are registered by
nf_conntrack_proto_init. Can you keep using that?

>  	for (i = 0; i < AF_MAX; i++)
>  		rcu_assign_pointer(nf_ct_l3protos[i],
> @@ -473,8 +474,6 @@ void nf_conntrack_proto_fini(void)
>  {
>  	unsigned int i;
>  
> -	nf_ct_l4proto_unregister_sysctl(&init_net, &nf_conntrack_l4proto_generic);
> -
>  	/* free l3proto protocol tables */
>  	for (i = 0; i < PF_MAX; i++)
>  		kfree(nf_ct_protos[i]);
> diff --git a/net/netfilter/nf_conntrack_proto_generic.c b/net/netfilter/nf_conntrack_proto_generic.c
> index d8923d5..7976a64 100644
> --- a/net/netfilter/nf_conntrack_proto_generic.c
> +++ b/net/netfilter/nf_conntrack_proto_generic.c
> @@ -14,6 +14,11 @@
>  
>  static unsigned int nf_ct_generic_timeout __read_mostly = 600*HZ;
>  
> +static inline struct nf_generic_net *generic_pernet(struct net *net)
> +{
> +	return &net->ct.proto.generic;
> +}
> +
>  static bool generic_pkt_to_tuple(const struct sk_buff *skb,
>  				 unsigned int dataoff,
>  				 struct nf_conntrack_tuple *tuple)
> @@ -42,7 +47,7 @@ static int generic_print_tuple(struct seq_file *s,
>  
>  static unsigned int *generic_get_timeouts(struct net *net)
>  {
> -	return &nf_ct_generic_timeout;
> +	return &(generic_pernet(net)->timeout);
>  }
>  
>  /* Returns verdict for packet, or -1 for invalid. */
> @@ -110,7 +115,6 @@ static struct ctl_table_header *generic_sysctl_header;
>  static struct ctl_table generic_sysctl_table[] = {
>  	{
>  		.procname	= "nf_conntrack_generic_timeout",
> -		.data		= &nf_ct_generic_timeout,
>  		.maxlen		= sizeof(unsigned int),
>  		.mode		= 0644,
>  		.proc_handler	= proc_dointvec_jiffies,
> @@ -121,7 +125,6 @@ static struct ctl_table generic_sysctl_table[] = {
>  static struct ctl_table generic_compat_sysctl_table[] = {
>  	{
>  		.procname	= "ip_conntrack_generic_timeout",
> -		.data		= &nf_ct_generic_timeout,
>  		.maxlen		= sizeof(unsigned int),
>  		.mode		= 0644,
>  		.proc_handler	= proc_dointvec_jiffies,
> @@ -131,10 +134,39 @@ static struct ctl_table generic_compat_sysctl_table[] = {
>  #endif /* CONFIG_NF_CONNTRACK_PROC_COMPAT */
>  #endif /* CONFIG_SYSCTL */
>  
> +static int generic_init_net(struct net *net, u_int8_t compat)
> +{
> +	struct nf_generic_net *gn = generic_pernet(net);
> +	struct nf_proto_net *pn = (struct nf_proto_net *)gn;
> +	gn->timeout = nf_ct_generic_timeout;
> +#ifdef CONFIG_SYSCTL
> +	pn->ctl_table = kmemdup(generic_sysctl_table,
> +				sizeof(generic_sysctl_table),
> +				GFP_KERNEL);
> +	if (!pn->ctl_table)
> +		return -ENOMEM;
> +	pn->ctl_table[0].data = &gn->timeout;
> +
> +#ifdef CONFIG_NF_CONNTRACK_PROC_COMPAT
> +	pn->ctl_compat_table = kmemdup(generic_compat_sysctl_table,
> +				       sizeof(generic_compat_sysctl_table),
> +				       GFP_KERNEL);
> +	if (!pn->ctl_compat_table) {
> +		kfree(pn->ctl_table);
> +		pn->ctl_table = NULL;
> +		return -ENOMEM;
> +	}
> +	pn->ctl_compat_table[0].data = &gn->timeout;
> +#endif
> +#endif
> +	return 0;
> +}
> +
>  struct nf_conntrack_l4proto nf_conntrack_l4proto_generic __read_mostly =
>  {
>  	.l3proto		= PF_UNSPEC,
>  	.l4proto		= 255,
> +	.compat			= 1,
>  	.name			= "unknown",
>  	.pkt_to_tuple		= generic_pkt_to_tuple,
>  	.invert_tuple		= generic_invert_tuple,
> @@ -158,4 +190,21 @@ struct nf_conntrack_l4proto nf_conntrack_l4proto_generic __read_mostly =
>  	.ctl_compat_table	= generic_compat_sysctl_table,
>  #endif
>  #endif
> +	.init_net		= generic_init_net,
>  };
> +
> +int nf_conntrack_proto_generic_init(struct net *net)
> +{
> +	int ret = 0;
> +	ret = generic_init_net(net, nf_conntrack_l4proto_generic.compat);
> +	if (ret < 0)
> +		return ret;
> +	return nf_ct_l4proto_register_sysctl(net,
> +					     &nf_conntrack_l4proto_generic);
> +}
> +
> +void nf_conntrack_proto_generic_fini(struct net *net)
> +{
> +	nf_ct_l4proto_unregister_sysctl(net,
> +					&nf_conntrack_l4proto_generic);
> +}
> -- 
> 1.7.7.6
> 

^ permalink raw reply

* Re: [PATCH 15/17] netfilter: cleanup sysctl for l4proto and l3proto
From: Pablo Neira Ayuso @ 2012-05-23 10:38 UTC (permalink / raw)
  To: Gao feng; +Cc: netfilter-devel, netdev, serge.hallyn, ebiederm, dlezcano
In-Reply-To: <1336985547-31960-16-git-send-email-gaofeng@cn.fujitsu.com>

On Mon, May 14, 2012 at 04:52:25PM +0800, Gao feng wrote:
> delete no useless sysctl data for l4proto and l3proto.
> 
> Acked-by: Eric W. Biederman <ebiederm@xmission.com>
> Signed-off-by: Gao feng <gaofeng@cn.fujitsu.com>
> ---
>  include/net/netfilter/nf_conntrack_l3proto.h   |    2 --
>  include/net/netfilter/nf_conntrack_l4proto.h   |   10 ----------
>  net/ipv4/netfilter/nf_conntrack_l3proto_ipv4.c |    1 -
>  net/ipv4/netfilter/nf_conntrack_proto_icmp.c   |    8 --------
>  net/ipv6/netfilter/nf_conntrack_proto_icmpv6.c |    5 -----
>  net/netfilter/nf_conntrack_proto_generic.c     |    8 --------
>  net/netfilter/nf_conntrack_proto_sctp.c        |   15 ---------------
>  net/netfilter/nf_conntrack_proto_tcp.c         |   15 ---------------
>  net/netfilter/nf_conntrack_proto_udp.c         |   15 ---------------
>  net/netfilter/nf_conntrack_proto_udplite.c     |   12 ------------
>  10 files changed, 0 insertions(+), 91 deletions(-)
> 
> diff --git a/include/net/netfilter/nf_conntrack_l3proto.h b/include/net/netfilter/nf_conntrack_l3proto.h
> index d6df8c7..6f7c13f 100644
> --- a/include/net/netfilter/nf_conntrack_l3proto.h
> +++ b/include/net/netfilter/nf_conntrack_l3proto.h
> @@ -64,9 +64,7 @@ struct nf_conntrack_l3proto {
>  	size_t nla_size;
>  
>  #ifdef CONFIG_SYSCTL
> -	struct ctl_table_header	*ctl_table_header;
>  	const char		*ctl_table_path;
> -	struct ctl_table	*ctl_table;
>  #endif /* CONFIG_SYSCTL */
>  
>  	/* Init l3proto pernet data */
> diff --git a/include/net/netfilter/nf_conntrack_l4proto.h b/include/net/netfilter/nf_conntrack_l4proto.h
> index 0d329b9..4881df34 100644
> --- a/include/net/netfilter/nf_conntrack_l4proto.h
> +++ b/include/net/netfilter/nf_conntrack_l4proto.h
> @@ -95,16 +95,6 @@ struct nf_conntrack_l4proto {
>  		const struct nla_policy *nla_policy;
>  	} ctnl_timeout;
>  #endif
> -
> -#ifdef CONFIG_SYSCTL
> -	struct ctl_table_header	**ctl_table_header;
> -	struct ctl_table	*ctl_table;
> -	unsigned int		*ctl_table_users;
> -#ifdef CONFIG_NF_CONNTRACK_PROC_COMPAT
> -	struct ctl_table_header	*ctl_compat_table_header;
> -	struct ctl_table	*ctl_compat_table;
> -#endif
> -#endif

Interesting. This structure is added in patch 1/17, then it's remove
in patch 15/17.

Probably I'm missing anything, but why are you doing it like that?

>  	int	*net_id;
>  	/* Init l4proto pernet data */
>  	int (*init_net)(struct net *net, u_int8_t compat);
> diff --git a/net/ipv4/netfilter/nf_conntrack_l3proto_ipv4.c b/net/ipv4/netfilter/nf_conntrack_l3proto_ipv4.c
> index 1dd17ed..173da4d 100644
> --- a/net/ipv4/netfilter/nf_conntrack_l3proto_ipv4.c
> +++ b/net/ipv4/netfilter/nf_conntrack_l3proto_ipv4.c
> @@ -379,7 +379,6 @@ struct nf_conntrack_l3proto nf_conntrack_l3proto_ipv4 __read_mostly = {
>  #endif
>  #if defined(CONFIG_SYSCTL) && defined(CONFIG_NF_CONNTRACK_PROC_COMPAT)
>  	.ctl_table_path  = "net/ipv4/netfilter",
> -	.ctl_table	 = ip_ct_sysctl_table,
>  #endif
>  	.init_net	= ipv4_init_net,
>  	.me		 = THIS_MODULE,
> diff --git a/net/ipv4/netfilter/nf_conntrack_proto_icmp.c b/net/ipv4/netfilter/nf_conntrack_proto_icmp.c
> index f468d10..90da247 100644
> --- a/net/ipv4/netfilter/nf_conntrack_proto_icmp.c
> +++ b/net/ipv4/netfilter/nf_conntrack_proto_icmp.c
> @@ -313,7 +313,6 @@ icmp_timeout_nla_policy[CTA_TIMEOUT_ICMP_MAX+1] = {
>  #endif /* CONFIG_NF_CT_NETLINK_TIMEOUT */
>  
>  #ifdef CONFIG_SYSCTL
> -static struct ctl_table_header *icmp_sysctl_header;
>  static struct ctl_table icmp_sysctl_table[] = {
>  	{
>  		.procname	= "nf_conntrack_icmp_timeout",
> @@ -394,12 +393,5 @@ struct nf_conntrack_l4proto nf_conntrack_l4proto_icmp __read_mostly =
>  		.nla_policy	= icmp_timeout_nla_policy,
>  	},
>  #endif /* CONFIG_NF_CT_NETLINK_TIMEOUT */
> -#ifdef CONFIG_SYSCTL
> -	.ctl_table_header	= &icmp_sysctl_header,
> -	.ctl_table		= icmp_sysctl_table,
> -#ifdef CONFIG_NF_CONNTRACK_PROC_COMPAT
> -	.ctl_compat_table	= icmp_compat_sysctl_table,
> -#endif
> -#endif
>  	.init_net		= icmp_init_net,
>  };
> diff --git a/net/ipv6/netfilter/nf_conntrack_proto_icmpv6.c b/net/ipv6/netfilter/nf_conntrack_proto_icmpv6.c
> index 3cb422e..12ca315 100644
> --- a/net/ipv6/netfilter/nf_conntrack_proto_icmpv6.c
> +++ b/net/ipv6/netfilter/nf_conntrack_proto_icmpv6.c
> @@ -320,7 +320,6 @@ icmpv6_timeout_nla_policy[CTA_TIMEOUT_ICMPV6_MAX+1] = {
>  #endif /* CONFIG_NF_CT_NETLINK_TIMEOUT */
>  
>  #ifdef CONFIG_SYSCTL
> -static struct ctl_table_header *icmpv6_sysctl_header;
>  static struct ctl_table icmpv6_sysctl_table[] = {
>  	{
>  		.procname	= "nf_conntrack_icmpv6_timeout",
> @@ -376,9 +375,5 @@ struct nf_conntrack_l4proto nf_conntrack_l4proto_icmpv6 __read_mostly =
>  		.nla_policy	= icmpv6_timeout_nla_policy,
>  	},
>  #endif /* CONFIG_NF_CT_NETLINK_TIMEOUT */
> -#ifdef CONFIG_SYSCTL
> -	.ctl_table_header	= &icmpv6_sysctl_header,
> -	.ctl_table		= icmpv6_sysctl_table,
> -#endif
>  	.init_net		= icmpv6_init_net,
>  };
> diff --git a/net/netfilter/nf_conntrack_proto_generic.c b/net/netfilter/nf_conntrack_proto_generic.c
> index 7976a64..0f87a77 100644
> --- a/net/netfilter/nf_conntrack_proto_generic.c
> +++ b/net/netfilter/nf_conntrack_proto_generic.c
> @@ -111,7 +111,6 @@ generic_timeout_nla_policy[CTA_TIMEOUT_GENERIC_MAX+1] = {
>  #endif /* CONFIG_NF_CT_NETLINK_TIMEOUT */
>  
>  #ifdef CONFIG_SYSCTL
> -static struct ctl_table_header *generic_sysctl_header;
>  static struct ctl_table generic_sysctl_table[] = {
>  	{
>  		.procname	= "nf_conntrack_generic_timeout",
> @@ -183,13 +182,6 @@ struct nf_conntrack_l4proto nf_conntrack_l4proto_generic __read_mostly =
>  		.nla_policy	= generic_timeout_nla_policy,
>  	},
>  #endif /* CONFIG_NF_CT_NETLINK_TIMEOUT */
> -#ifdef CONFIG_SYSCTL
> -	.ctl_table_header	= &generic_sysctl_header,
> -	.ctl_table		= generic_sysctl_table,
> -#ifdef CONFIG_NF_CONNTRACK_PROC_COMPAT
> -	.ctl_compat_table	= generic_compat_sysctl_table,
> -#endif
> -#endif
>  	.init_net		= generic_init_net,
>  };
>  
> diff --git a/net/netfilter/nf_conntrack_proto_sctp.c b/net/netfilter/nf_conntrack_proto_sctp.c
> index 3f0fdf8..291cef4 100644
> --- a/net/netfilter/nf_conntrack_proto_sctp.c
> +++ b/net/netfilter/nf_conntrack_proto_sctp.c
> @@ -610,8 +610,6 @@ sctp_timeout_nla_policy[CTA_TIMEOUT_SCTP_MAX+1] = {
>  
>  
>  #ifdef CONFIG_SYSCTL
> -static unsigned int sctp_sysctl_table_users;
> -static struct ctl_table_header *sctp_sysctl_header;
>  static struct ctl_table sctp_sysctl_table[] = {
>  	{
>  		.procname	= "nf_conntrack_sctp_timeout_closed",
> @@ -791,14 +789,6 @@ static struct nf_conntrack_l4proto nf_conntrack_l4proto_sctp4 __read_mostly = {
>  		.nla_policy	= sctp_timeout_nla_policy,
>  	},
>  #endif /* CONFIG_NF_CT_NETLINK_TIMEOUT */
> -#ifdef CONFIG_SYSCTL
> -	.ctl_table_users	= &sctp_sysctl_table_users,
> -	.ctl_table_header	= &sctp_sysctl_header,
> -	.ctl_table		= sctp_sysctl_table,
> -#ifdef CONFIG_NF_CONNTRACK_PROC_COMPAT
> -	.ctl_compat_table	= sctp_compat_sysctl_table,
> -#endif
> -#endif
>  	.net_id			= &sctp_net_id,
>  	.init_net		= sctp_init_net,
>  };
> @@ -834,11 +824,6 @@ static struct nf_conntrack_l4proto nf_conntrack_l4proto_sctp6 __read_mostly = {
>  	},
>  #endif /* CONFIG_NF_CT_NETLINK_TIMEOUT */
>  #endif
> -#ifdef CONFIG_SYSCTL
> -	.ctl_table_users	= &sctp_sysctl_table_users,
> -	.ctl_table_header	= &sctp_sysctl_header,
> -	.ctl_table		= sctp_sysctl_table,
> -#endif
>  	.net_id			= &sctp_net_id,
>  	.init_net		= sctp_init_net,
>  };
> diff --git a/net/netfilter/nf_conntrack_proto_tcp.c b/net/netfilter/nf_conntrack_proto_tcp.c
> index dd19350..4d16b8a 100644
> --- a/net/netfilter/nf_conntrack_proto_tcp.c
> +++ b/net/netfilter/nf_conntrack_proto_tcp.c
> @@ -1363,8 +1363,6 @@ static const struct nla_policy tcp_timeout_nla_policy[CTA_TIMEOUT_TCP_MAX+1] = {
>  #endif /* CONFIG_NF_CT_NETLINK_TIMEOUT */
>  
>  #ifdef CONFIG_SYSCTL
> -static unsigned int tcp_sysctl_table_users;
> -static struct ctl_table_header *tcp_sysctl_header;
>  static struct ctl_table tcp_sysctl_table[] = {
>  	{
>  		.procname	= "nf_conntrack_tcp_timeout_syn_sent",
> @@ -1634,14 +1632,6 @@ struct nf_conntrack_l4proto nf_conntrack_l4proto_tcp4 __read_mostly =
>  		.nla_policy	= tcp_timeout_nla_policy,
>  	},
>  #endif /* CONFIG_NF_CT_NETLINK_TIMEOUT */
> -#ifdef CONFIG_SYSCTL
> -	.ctl_table_users	= &tcp_sysctl_table_users,
> -	.ctl_table_header	= &tcp_sysctl_header,
> -	.ctl_table		= tcp_sysctl_table,
> -#ifdef CONFIG_NF_CONNTRACK_PROC_COMPAT
> -	.ctl_compat_table	= tcp_compat_sysctl_table,
> -#endif
> -#endif
>  	.init_net		= tcp_init_net,
>  };
>  EXPORT_SYMBOL_GPL(nf_conntrack_l4proto_tcp4);
> @@ -1679,11 +1669,6 @@ struct nf_conntrack_l4proto nf_conntrack_l4proto_tcp6 __read_mostly =
>  		.nla_policy	= tcp_timeout_nla_policy,
>  	},
>  #endif /* CONFIG_NF_CT_NETLINK_TIMEOUT */
> -#ifdef CONFIG_SYSCTL
> -	.ctl_table_users	= &tcp_sysctl_table_users,
> -	.ctl_table_header	= &tcp_sysctl_header,
> -	.ctl_table		= tcp_sysctl_table,
> -#endif
>  	.init_net		= tcp_init_net,
>  };
>  EXPORT_SYMBOL_GPL(nf_conntrack_l4proto_tcp6);
> diff --git a/net/netfilter/nf_conntrack_proto_udp.c b/net/netfilter/nf_conntrack_proto_udp.c
> index 072ef9c..c38ab58 100644
> --- a/net/netfilter/nf_conntrack_proto_udp.c
> +++ b/net/netfilter/nf_conntrack_proto_udp.c
> @@ -199,8 +199,6 @@ udp_timeout_nla_policy[CTA_TIMEOUT_UDP_MAX+1] = {
>  #endif /* CONFIG_NF_CT_NETLINK_TIMEOUT */
>  
>  #ifdef CONFIG_SYSCTL
> -static unsigned int udp_sysctl_table_users;
> -static struct ctl_table_header *udp_sysctl_header;
>  static struct ctl_table udp_sysctl_table[] = {
>  	{
>  		.procname	= "nf_conntrack_udp_timeout",
> @@ -307,14 +305,6 @@ struct nf_conntrack_l4proto nf_conntrack_l4proto_udp4 __read_mostly =
>  		.nla_policy	= udp_timeout_nla_policy,
>  	},
>  #endif /* CONFIG_NF_CT_NETLINK_TIMEOUT */
> -#ifdef CONFIG_SYSCTL
> -	.ctl_table_users	= &udp_sysctl_table_users,
> -	.ctl_table_header	= &udp_sysctl_header,
> -	.ctl_table		= udp_sysctl_table,
> -#ifdef CONFIG_NF_CONNTRACK_PROC_COMPAT
> -	.ctl_compat_table	= udp_compat_sysctl_table,
> -#endif
> -#endif
>  	.init_net		= udp_init_net,
>  };
>  EXPORT_SYMBOL_GPL(nf_conntrack_l4proto_udp4);
> @@ -347,11 +337,6 @@ struct nf_conntrack_l4proto nf_conntrack_l4proto_udp6 __read_mostly =
>  		.nla_policy	= udp_timeout_nla_policy,
>  	},
>  #endif /* CONFIG_NF_CT_NETLINK_TIMEOUT */
> -#ifdef CONFIG_SYSCTL
> -	.ctl_table_users	= &udp_sysctl_table_users,
> -	.ctl_table_header	= &udp_sysctl_header,
> -	.ctl_table		= udp_sysctl_table,
> -#endif
>  	.init_net		= udp_init_net,
>  };
>  EXPORT_SYMBOL_GPL(nf_conntrack_l4proto_udp6);
> diff --git a/net/netfilter/nf_conntrack_proto_udplite.c b/net/netfilter/nf_conntrack_proto_udplite.c
> index 1e90cf5..cb3dc81 100644
> --- a/net/netfilter/nf_conntrack_proto_udplite.c
> +++ b/net/netfilter/nf_conntrack_proto_udplite.c
> @@ -215,8 +215,6 @@ udplite_timeout_nla_policy[CTA_TIMEOUT_UDPLITE_MAX+1] = {
>  #endif /* CONFIG_NF_CT_NETLINK_TIMEOUT */
>  
>  #ifdef CONFIG_SYSCTL
> -static unsigned int udplite_sysctl_table_users;
> -static struct ctl_table_header *udplite_sysctl_header;
>  static struct ctl_table udplite_sysctl_table[] = {
>  	{
>  		.procname	= "nf_conntrack_udplite_timeout",
> @@ -288,11 +286,6 @@ static struct nf_conntrack_l4proto nf_conntrack_l4proto_udplite4 __read_mostly =
>  		.nla_policy	= udplite_timeout_nla_policy,
>  	},
>  #endif /* CONFIG_NF_CT_NETLINK_TIMEOUT */
> -#ifdef CONFIG_SYSCTL
> -	.ctl_table_users	= &udplite_sysctl_table_users,
> -	.ctl_table_header	= &udplite_sysctl_header,
> -	.ctl_table		= udplite_sysctl_table,
> -#endif
>  	.net_id			= &udplite_net_id,
>  	.init_net		= udplite_init_net,
>  };
> @@ -326,11 +319,6 @@ static struct nf_conntrack_l4proto nf_conntrack_l4proto_udplite6 __read_mostly =
>  		.nla_policy	= udplite_timeout_nla_policy,
>  	},
>  #endif /* CONFIG_NF_CT_NETLINK_TIMEOUT */
> -#ifdef CONFIG_SYSCTL
> -	.ctl_table_users	= &udplite_sysctl_table_users,
> -	.ctl_table_header	= &udplite_sysctl_header,
> -	.ctl_table		= udplite_sysctl_table,
> -#endif
>  	.net_id			= &udplite_net_id,
>  	.init_net		= udplite_init_net,
>  };
> -- 
> 1.7.7.6
> 

^ permalink raw reply

* Re: [PATCH 16/17] netfilter: add namespace support for cttimeout
From: Pablo Neira Ayuso @ 2012-05-23 10:41 UTC (permalink / raw)
  To: Gao feng; +Cc: netfilter-devel, netdev, serge.hallyn, ebiederm, dlezcano
In-Reply-To: <1336985547-31960-17-git-send-email-gaofeng@cn.fujitsu.com>

On Mon, May 14, 2012 at 04:52:26PM +0800, Gao feng wrote:
> add struct net as a param of ctnl_timeout.nlattr_to_obj,
> 
> modify ctnl_timeout_parse_policy and cttimeout_new_timeout
> to transmit struct net to nlattr_to_obj.

Please, merge your patch 16 and 17 into one single patch.

>  	unsigned int *timeouts = data;
>  
> diff --git a/net/netfilter/nf_conntrack_proto_sctp.c b/net/netfilter/nf_conntrack_proto_sctp.c
> index 291cef4..a28f3c4 100644
> --- a/net/netfilter/nf_conntrack_proto_sctp.c
> +++ b/net/netfilter/nf_conntrack_proto_sctp.c
> @@ -562,7 +562,8 @@ static int sctp_nlattr_size(void)
>  #include <linux/netfilter/nfnetlink.h>
>  #include <linux/netfilter/nfnetlink_cttimeout.h>
>  
> -static int sctp_timeout_nlattr_to_obj(struct nlattr *tb[], void *data)
> +static int sctp_timeout_nlattr_to_obj(struct nlattr *tb[],
> +				      struct net *net, void *data)

The interface modification and the use of the new *net parameter
should go together, ie. merge patch 16 and 17 :-).

^ permalink raw reply

* Re: [PATCH v3 00/17] netfilter: add namespace support for netfilter protos
From: Pablo Neira Ayuso @ 2012-05-23 10:42 UTC (permalink / raw)
  To: Gao feng; +Cc: netfilter-devel, netdev, serge.hallyn, ebiederm, dlezcano
In-Reply-To: <1336985547-31960-1-git-send-email-gaofeng@cn.fujitsu.com>

On Mon, May 14, 2012 at 04:52:10PM +0800, Gao feng wrote:
> Currently the sysctl of netfilter proto is not isolated, so when
> changing proto's sysctl in container will cause the host's sysctl
> be changed too. it's not expected.
> 
> This patch set adds the namespace support for netfilter protos.
> 
> impletement four pernet_operations to register sysctl and initial
> pernet data for proto.
> 
> -ipv4_net_ops is used to register tcp4(compat),
>  udp4(compat),icmp(compat),ipv4(compat).
> -ipv6_net_ops is used to register tcp6,udp6 and icmpv6.
> -sctp_net_ops is used to register sctp4(compat) and sctp6.
> -udplite_net_ops is used to register udplite4 and udplite6
> 
> extern l[3,4]proto (sysctl) register functions to make them support
> namespace.
> 
> finailly add namespace support for cttimeout.

This requires another spin. It looks way better than previous version
but I don't want to take the patchset and then send another batch to
David to remove the .compat field, the unrequired export of couple of
symbols, and so on...

Thanks!

^ permalink raw reply

* Re: [README] Two merges pending...
From: Pablo Neira Ayuso @ 2012-05-23 10:45 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, linville, netfilter-devel, linux-wireless
In-Reply-To: <20120522.152840.815291543305899477.davem@davemloft.net>

Hi David,

On Tue, May 22, 2012 at 03:28:40PM -0400, David Miller wrote:
[...]
> Pablo also has a prearranged netfilter merge pending.

I give up on that merge that I requested.

I thought that some namespace changes for conntrack that have been
lying on the table for one week could be merged. But they are not ready
yet, they require another spin.

That means they'll have to wait until window opens again, they need
some extra time.

I'll stick to one pull request for bugfixes then.

Thanks!

^ permalink raw reply

* Re: TCPBacklogDrops during aggressive bursts of traffic
From: Eric Dumazet @ 2012-05-23 12:09 UTC (permalink / raw)
  To: Kieran Mansley, Jeff Kirsher, Alex Duyck; +Cc: Ben Hutchings, netdev
In-Reply-To: <1337766246.3361.2447.camel@edumazet-glaptop>

On Wed, 2012-05-23 at 11:44 +0200, Eric Dumazet wrote:

> I believe that as soon as ixgbe can use build_skb() and avoid the 1024
> bytes overhead per skb, it should go away.


Here is the patch for ixgbe, for reference.

My machine is now able to receive a netperf TCP_STREAM full speed
(10Gb), even with GRO/LRO off. (TCPRcvCoalesce counter increasing very
fast too)

Its not an official patch yet, because :

1) I need to properly align DMA buffers to reserve NET_SKB_PAD bytes
(not all workloads are like TCP, and some headroom is needed for
tunnels)

2) Must cope with MTU > 1500 cases

3) Should not be done if NET_IP_ALIGN is not null (I dont know if ixgbe
hardware can do the DMA to non aligned area on receive)

This patch saves 1024 bytes per incoming skb. (skb->head directly mapped
to the frag containing the frame, instead of a separate memory area)

 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c |   82 ++++++++--------
 1 file changed, 46 insertions(+), 36 deletions(-)

diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
index bf20457..d05693a 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
@@ -1511,39 +1511,41 @@ static bool ixgbe_cleanup_headers(struct ixgbe_ring *rx_ring,
 		return true;
 	}
 
-	/*
-	 * it is valid to use page_address instead of kmap since we are
-	 * working with pages allocated out of the lomem pool per
-	 * alloc_page(GFP_ATOMIC)
-	 */
-	va = skb_frag_address(frag);
+	if (!skb_headlen(skb)) {
+		/*
+		 * it is valid to use page_address instead of kmap since we are
+		 * working with pages allocated out of the lowmem pool per
+		 * alloc_page(GFP_ATOMIC)
+		 */
+		va = skb_frag_address(frag);
 
-	/*
-	 * we need the header to contain the greater of either ETH_HLEN or
-	 * 60 bytes if the skb->len is less than 60 for skb_pad.
-	 */
-	pull_len = skb_frag_size(frag);
-	if (pull_len > 256)
-		pull_len = ixgbe_get_headlen(va, pull_len);
+		/*
+		 * we need the header to contain the greater of either ETH_HLEN or
+		 * 60 bytes if the skb->len is less than 60 for skb_pad.
+		 */
+		pull_len = skb_frag_size(frag);
+		if (pull_len > 256)
+			pull_len = ixgbe_get_headlen(va, pull_len);
 
-	/* align pull length to size of long to optimize memcpy performance */
-	skb_copy_to_linear_data(skb, va, ALIGN(pull_len, sizeof(long)));
+		/* align pull length to size of long to optimize memcpy performance */
+		skb_copy_to_linear_data(skb, va, ALIGN(pull_len, sizeof(long)));
 
-	/* update all of the pointers */
-	skb_frag_size_sub(frag, pull_len);
-	frag->page_offset += pull_len;
-	skb->data_len -= pull_len;
-	skb->tail += pull_len;
+		/* update all of the pointers */
+		skb_frag_size_sub(frag, pull_len);
+		frag->page_offset += pull_len;
+		skb->data_len -= pull_len;
+		skb->tail += pull_len;
 
-	/*
-	 * if we sucked the frag empty then we should free it,
-	 * if there are other frags here something is screwed up in hardware
-	 */
-	if (skb_frag_size(frag) == 0) {
-		BUG_ON(skb_shinfo(skb)->nr_frags != 1);
-		skb_shinfo(skb)->nr_frags = 0;
-		__skb_frag_unref(frag);
-		skb->truesize -= ixgbe_rx_bufsz(rx_ring);
+		/*
+		 * if we sucked the frag empty then we should free it,
+		 * if there are other frags here something is screwed up in hardware
+		 */
+		if (skb_frag_size(frag) == 0) {
+			BUG_ON(skb_shinfo(skb)->nr_frags != 1);
+			skb_shinfo(skb)->nr_frags = 0;
+			__skb_frag_unref(frag);
+			skb->truesize -= ixgbe_rx_bufsz(rx_ring);
+		}
 	}
 
 	/* if skb_pad returns an error the skb was freed */
@@ -1662,6 +1664,8 @@ static bool ixgbe_clean_rx_irq(struct ixgbe_q_vector *q_vector,
 		struct sk_buff *skb;
 		struct page *page;
 		u16 ntc;
+		unsigned int len;
+		bool addfrag = true;
 
 		/* return some buffers to hardware, one at a time is too slow */
 		if (cleaned_count >= IXGBE_RX_BUFFER_WRITE) {
@@ -1687,7 +1691,7 @@ static bool ixgbe_clean_rx_irq(struct ixgbe_q_vector *q_vector,
 		prefetchw(page);
 
 		skb = rx_buffer->skb;
-
+		len = le16_to_cpu(rx_desc->wb.upper.length);
 		if (likely(!skb)) {
 			void *page_addr = page_address(page) +
 					  rx_buffer->page_offset;
@@ -1698,9 +1702,14 @@ static bool ixgbe_clean_rx_irq(struct ixgbe_q_vector *q_vector,
 			prefetch(page_addr + L1_CACHE_BYTES);
 #endif
 
-			/* allocate a skb to store the frags */
-			skb = netdev_alloc_skb_ip_align(rx_ring->netdev,
-							IXGBE_RX_HDR_SIZE);
+			/* allocate a skb to store the frag */
+			if (len <= 256) {
+				skb = netdev_alloc_skb_ip_align(rx_ring->netdev,
+								256);
+			} else {
+				skb = build_skb(page_addr, ixgbe_rx_bufsz(rx_ring));
+				addfrag = false;
+			}
 			if (unlikely(!skb)) {
 				rx_ring->rx_stats.alloc_rx_buff_failed++;
 				break;
@@ -1729,9 +1738,10 @@ static bool ixgbe_clean_rx_irq(struct ixgbe_q_vector *q_vector,
 						      DMA_FROM_DEVICE);
 		}
 
-		/* pull page into skb */
-		ixgbe_add_rx_frag(rx_ring, rx_buffer, skb,
-				  le16_to_cpu(rx_desc->wb.upper.length));
+		if (addfrag)
+			ixgbe_add_rx_frag(rx_ring, rx_buffer, skb, len);
+		else
+			__skb_put(skb, len);
 
 		if (ixgbe_can_reuse_page(rx_buffer)) {
 			/* hand second half of page back to the ring */

^ permalink raw reply related

* Re: [PATCH] xen/netback: calculate correctly the SKB slots.
From: Konrad Rzeszutek Wilk @ 2012-05-23 13:12 UTC (permalink / raw)
  To: Simon Graham
  Cc: Ian Campbell, Ben Hutchings, xen-devel@lists.xensource.com,
	netdev@vger.kernel.org, davem@davemloft.net,
	linux-kernel@vger.kernel.org, Adnan Misherfi
In-Reply-To: <F02ED76F3FCF8C468AD22A7618C05BBBB18DBFFE61@FTLPMAILBOX01.citrite.net>

On Tue, May 22, 2012 at 03:01:55PM -0400, Simon Graham wrote:
> > >
> > > > >  	int i, copy_off;
> > > > >
> > > > >  	count = DIV_ROUND_UP(
> > > > > -			offset_in_page(skb->data)+skb_headlen(skb),
> > PAGE_SIZE);
> > > > > +			offset_in_page(skb->data + skb_headlen(skb)),
> > PAGE_SIZE);
> > > >
> > > > The new version would be equivalent to:
> > > > 	count = offset_in_page(skb->data + skb_headlen(skb)) != 0;
> > > > which is not right, as netbk_gop_skb() will use one slot per page.
> > >
> > > Just outside the context of this patch we separately count the frag
> > > pages.
> > >
> > > However I think you are right if skb->data covers > 1 page, since the
> > > new version can only ever return 0 or 1. I expect this patch papers
> > over
> > > the underlying issue by not stopping often enough, rather than
> > actually
> > > fixing the underlying issue.
> > 
> > Ah, any thoughts? Have you guys seen this behavior as well?
> 
> We ran into this same problem and the fix we've been running with for a while now (been meaning to submit it!) is:

Where is the patchqueue of the patches? Is it only on the src.rpm or
is it in some nice mercurial tree? Asking b/c if we run into other trouble
it would be also time-saving for us (and I presume other companies
too) to check that. Thanks!

^ permalink raw reply

* RE: [PATCH] xen/netback: calculate correctly the SKB slots.
From: Simon Graham @ 2012-05-23 14:17 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk
  Cc: Ian Campbell, Ben Hutchings, xen-devel@lists.xensource.com,
	netdev@vger.kernel.org, davem@davemloft.net,
	linux-kernel@vger.kernel.org, Adnan Misherfi
In-Reply-To: <20120523131242.GA15406@phenom.dumpdata.com>

> > We ran into this same problem and the fix we've been running with for
> a while now (been meaning to submit it!) is:
> 
> Where is the patchqueue of the patches? Is it only on the src.rpm or
> is it in some nice mercurial tree? Asking b/c if we run into other
> trouble
> it would be also time-saving for us (and I presume other companies
> too) to check that. Thanks!

Currently our patchqueue is only in the source iso and not in an externally visible git tree - sorry!

Simon

^ permalink raw reply

* [PATCH 1/1] net: add dev_loopback_xmit() to avoid duplicate code
From: Michel Machado @ 2012-05-23 14:20 UTC (permalink / raw)
  To: netdev, linux-kernel
  Cc: David S. Miller, Alexey Kuznetsov, James Morris,
	Hideaki YOSHIFUJI, Patrick McHardy, Eric Dumazet, Jiri Pirko,
	Michał Mirosław, Ben Hutchings

Add dev_loopback_xmit() in order to deduplicate functions
ip_dev_loopback_xmit() (in net/ipv4/ip_output.c) and
ip6_dev_loopback_xmit() (in net/ipv6/ip6_output.c).

I was about to reinvent the wheel when I noticed that
ip_dev_loopback_xmit() and ip6_dev_loopback_xmit() do exactly what I
need and are not IP-only functions, but they were not available to reuse
elsewhere.

ip6_dev_loopback_xmit() does not have line "skb_dst_force(skb);", but I
understand that this is harmless, and should be be in
dev_loopback_xmit().

Signed-off-by: Michel Machado <michel@digirati.com.br>
CC: "David S. Miller" <davem@davemloft.net>
CC: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
CC: James Morris <jmorris@namei.org>
CC: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
CC: Patrick McHardy <kaber@trash.net>
CC: Eric Dumazet <edumazet@google.com>
CC: Jiri Pirko <jpirko@redhat.com>
CC: "Michał Mirosław" <mirq-linux@rere.qmqm.pl>
CC: Ben Hutchings <bhutchings@solarflare.com>
---

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index e7fd468..1c49c21 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -1626,6 +1626,7 @@ extern int		dev_alloc_name(struct net_device *dev, const char *name);
 extern int		dev_open(struct net_device *dev);
 extern int		dev_close(struct net_device *dev);
 extern void		dev_disable_lro(struct net_device *dev);
+extern int		dev_loopback_xmit(struct sk_buff *newskb);
 extern int		dev_queue_xmit(struct sk_buff *skb);
 extern int		register_netdevice(struct net_device *dev);
 extern void		unregister_netdevice_queue(struct net_device *dev,
diff --git a/net/core/dev.c b/net/core/dev.c
index cd09819..c6e29ea6 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -2476,6 +2476,23 @@ static DEFINE_PER_CPU(int, xmit_recursion);
 #define RECURSION_LIMIT 10
 
 /**
+ *	dev_loopback_xmit - loop back @skb
+ *	@skb: buffer to transmit
+ */
+int dev_loopback_xmit(struct sk_buff *skb)
+{
+	skb_reset_mac_header(skb);
+	__skb_pull(skb, skb_network_offset(skb));
+	skb->pkt_type = PACKET_LOOPBACK;
+	skb->ip_summed = CHECKSUM_UNNECESSARY;
+	WARN_ON(!skb_dst(skb));
+	skb_dst_force(skb);
+	netif_rx_ni(skb);
+	return 0;
+}
+EXPORT_SYMBOL(dev_loopback_xmit);
+
+/**
  *	dev_queue_xmit - transmit a buffer
  *	@skb: buffer to transmit
  *
diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
index 451f97c..d34ddde 100644
--- a/net/ipv4/ip_output.c
+++ b/net/ipv4/ip_output.c
@@ -113,19 +113,6 @@ int ip_local_out(struct sk_buff *skb)
 }
 EXPORT_SYMBOL_GPL(ip_local_out);
 
-/* dev_loopback_xmit for use with netfilter. */
-static int ip_dev_loopback_xmit(struct sk_buff *newskb)
-{
-	skb_reset_mac_header(newskb);
-	__skb_pull(newskb, skb_network_offset(newskb));
-	newskb->pkt_type = PACKET_LOOPBACK;
-	newskb->ip_summed = CHECKSUM_UNNECESSARY;
-	WARN_ON(!skb_dst(newskb));
-	skb_dst_force(newskb);
-	netif_rx_ni(newskb);
-	return 0;
-}
-
 static inline int ip_select_ttl(struct inet_sock *inet, struct dst_entry *dst)
 {
 	int ttl = inet->uc_ttl;
@@ -281,7 +268,7 @@ int ip_mc_output(struct sk_buff *skb)
 			if (newskb)
 				NF_HOOK(NFPROTO_IPV4, NF_INET_POST_ROUTING,
 					newskb, NULL, newskb->dev,
-					ip_dev_loopback_xmit);
+					dev_loopback_xmit);
 		}
 
 		/* Multicasts with ttl 0 must not go beyond the host */
@@ -296,7 +283,7 @@ int ip_mc_output(struct sk_buff *skb)
 		struct sk_buff *newskb = skb_clone(skb, GFP_ATOMIC);
 		if (newskb)
 			NF_HOOK(NFPROTO_IPV4, NF_INET_POST_ROUTING, newskb,
-				NULL, newskb->dev, ip_dev_loopback_xmit);
+				NULL, newskb->dev, dev_loopback_xmit);
 	}
 
 	return NF_HOOK_COND(NFPROTO_IPV4, NF_INET_POST_ROUTING, skb, NULL,
diff --git a/net/ipv6/ip6_output.c b/net/ipv6/ip6_output.c
index d99fdc6..8443c00 100644
--- a/net/ipv6/ip6_output.c
+++ b/net/ipv6/ip6_output.c
@@ -83,19 +83,6 @@ int ip6_local_out(struct sk_buff *skb)
 }
 EXPORT_SYMBOL_GPL(ip6_local_out);
 
-/* dev_loopback_xmit for use with netfilter. */
-static int ip6_dev_loopback_xmit(struct sk_buff *newskb)
-{
-	skb_reset_mac_header(newskb);
-	__skb_pull(newskb, skb_network_offset(newskb));
-	newskb->pkt_type = PACKET_LOOPBACK;
-	newskb->ip_summed = CHECKSUM_UNNECESSARY;
-	WARN_ON(!skb_dst(newskb));
-
-	netif_rx_ni(newskb);
-	return 0;
-}
-
 static int ip6_finish_output2(struct sk_buff *skb)
 {
 	struct dst_entry *dst = skb_dst(skb);
@@ -121,7 +108,7 @@ static int ip6_finish_output2(struct sk_buff *skb)
 			if (newskb)
 				NF_HOOK(NFPROTO_IPV6, NF_INET_POST_ROUTING,
 					newskb, NULL, newskb->dev,
-					ip6_dev_loopback_xmit);
+					dev_loopback_xmit);
 
 			if (ipv6_hdr(skb)->hop_limit == 0) {
 				IP6_INC_STATS(dev_net(dev), idev,

^ permalink raw reply related

* Re: Kernel doesn't propagate DNSSL to userspace (e.g. NetworkManager)
From: Dan Williams @ 2012-05-23 14:24 UTC (permalink / raw)
  To: Pavel Simerda; +Cc: netdev, Dan Williams, danw
In-Reply-To: <fb24ca84-9976-42ab-b81a-73955b753731@zmail19.collab.prod.int.phx2.redhat.com>

On Wed, 2012-05-23 at 04:16 -0400, Pavel Simerda wrote:
> Hi,
> 
> I filed a bugreport to about lack of DNSSL support in kernel:
> 
> https://bugzilla.redhat.com/show_bug.cgi?id=824121
> 
> While NetworkManager recieves RDNSS neighbor discovery user option from kernel, it doesn't recieve DNSSL at all. This can be debugged with NetworkManager
> (or hopefully some better testing tool) and radvdump (to check if DNSSL is present).
> 
> radvdump reports DNSSL is there, NetworkManager gets no netlink message from kernel.
> 
> kernel-3.3.4-4.fc17.x86_64
> NetworkManager-0.9.4.0-7.git20120403.fc17.x86_64
> 
> Dave Jones asked me to contact this ML directly. I'm offlist.

So Pierre Ossman sent some RFC patches for DNSSL back in December 2010,
but Dave Miller wanted actual formal submissions which Pierre never got
around to doing.  I'll resubmit Pierre's second patch, which exports all
options the kernel doesn't care about to userspace.

Sun, 12 Dec 2010:
"[RFC][PATCH] Export all RA options that we don't handle to userspace"

Dan

^ permalink raw reply

* [PATCH net] ipx: restore token ring define to include/linux/ipx.h
From: Paul Gortmaker @ 2012-05-23 14:43 UTC (permalink / raw)
  To: davem; +Cc: netdev, Paul Gortmaker, Stephen Hemminger

Commit 211ed865108e24697b44bee5daac502ee6bdd4a4

    "net: delete all instances of special processing for token ring"

removed the define for IPX_FRAME_TR_8022.

While it is unlikely, we can't be 100% sure that there aren't
random userspace consumers of this value, so restore it.

The only instance I could find was in ncpfs-2.2.6, and it was
safe as-is, since it used #ifdef IPX_FRAME_TR_8022 around the
two use cases it had, but there may be other userspace packages
without similar ifdefs.

Cc: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>

diff --git a/include/linux/ipx.h b/include/linux/ipx.h
index 8f02439..3d48014 100644
--- a/include/linux/ipx.h
+++ b/include/linux/ipx.h
@@ -38,7 +38,7 @@ struct ipx_interface_definition {
 #define IPX_FRAME_8022		2
 #define IPX_FRAME_ETHERII	3
 #define IPX_FRAME_8023		4
-/* obsolete token ring was	5 */
+#define IPX_FRAME_TR_8022       5 /* obsolete */
 	unsigned char ipx_special;
 #define IPX_SPECIAL_NONE	0
 #define IPX_PRIMARY		1
-- 
1.7.9.1

^ permalink raw reply related

* Re: Kernel doesn't propagate DNSSL to userspace (e.g. NetworkManager)
From: Dan Williams @ 2012-05-23 14:46 UTC (permalink / raw)
  To: Pavel Simerda; +Cc: netdev, Dan Williams, danw
In-Reply-To: <1337783065.25721.1.camel@dcbw.foobar.com>

On Wed, 2012-05-23 at 09:24 -0500, Dan Williams wrote:
> On Wed, 2012-05-23 at 04:16 -0400, Pavel Simerda wrote:
> > Hi,
> > 
> > I filed a bugreport to about lack of DNSSL support in kernel:
> > 
> > https://bugzilla.redhat.com/show_bug.cgi?id=824121
> > 
> > While NetworkManager recieves RDNSS neighbor discovery user option from kernel, it doesn't recieve DNSSL at all. This can be debugged with NetworkManager
> > (or hopefully some better testing tool) and radvdump (to check if DNSSL is present).
> > 
> > radvdump reports DNSSL is there, NetworkManager gets no netlink message from kernel.
> > 
> > kernel-3.3.4-4.fc17.x86_64
> > NetworkManager-0.9.4.0-7.git20120403.fc17.x86_64
> > 
> > Dave Jones asked me to contact this ML directly. I'm offlist.
> 
> So Pierre Ossman sent some RFC patches for DNSSL back in December 2010,
> but Dave Miller wanted actual formal submissions which Pierre never got
> around to doing.  I'll resubmit Pierre's second patch, which exports all
> options the kernel doesn't care about to userspace.
> 
> Sun, 12 Dec 2010:
> "[RFC][PATCH] Export all RA options that we don't handle to userspace"

Well, it appears that e35f30c1 from 2012-04-06 (should be in the 3.4
kernel?) actually adds support for passing DNSSL to userspace, basically
what Pierre's first patch from 2010 did.  So thanks Alexey :)  I think
we should be good now for DNSSL?

Dan

^ permalink raw reply

* Re: [RFC:kvm] export host NUMA info to guest & make emulated device NUMA attr
From: Andrew Theurer @ 2012-05-23 14:52 UTC (permalink / raw)
  To: Liu ping fan
  Cc: Shirley Ma, kvm, netdev, linux-kernel, qemu-devel, Avi Kivity,
	Michael S. Tsirkin, Srivatsa Vaddagiri, Rusty Russell,
	Anthony Liguori, Ryan Harper, Shirley Ma, Krishna Kumar,
	Tom Lendacky
In-Reply-To: <CAFgQCTsdXqitVGKm5jQjtC--yvZy6x04zRduJ+_tXqUnkncdog@mail.gmail.com>

On 05/22/2012 04:28 AM, Liu ping fan wrote:
> On Sat, May 19, 2012 at 12:14 AM, Shirley Ma<mashirle@us.ibm.com>  wrote:
>> On Thu, 2012-05-17 at 17:20 +0800, Liu Ping Fan wrote:
>>> Currently, the guest can not know the NUMA info of the vcpu, which
>>> will
>>> result in performance drawback.
>>>
>>> This is the discovered and experiment by
>>>          Shirley Ma<xma@us.ibm.com>
>>>          Krishna Kumar<krkumar2@in.ibm.com>
>>>          Tom Lendacky<toml@us.ibm.com>
>>> Refer to -
>>> http://www.mail-archive.com/kvm@vger.kernel.org/msg69868.html
>>> we can see the big perfermance gap between NUMA aware and unaware.
>>>
>>> Enlightened by their discovery, I think, we can do more work -- that
>>> is to
>>> export NUMA info of host to guest.
>>
>> There three problems we've found:
>>
>> 1. KVM doesn't support NUMA load balancer. Even there are no other
>> workloads in the system, and the number of vcpus on the guest is smaller
>> than the number of cpus per node, the vcpus could be scheduled on
>> different nodes.
>>
>> Someone is working on in-kernel solution. Andrew Theurer has a working
>> user-space NUMA aware VM balancer, it requires libvirt and cgroups
>> (which is default for RHEL6 systems).
>>
> Interesting, and I found that "sched/numa: Introduce
> sys_numa_{t,m}bind()" committed by Peter and Ingo may help.
> But I think from the guest view, it can not tell whether the two vcpus
> are on the same host node. For example,
> vcpu-a in node-A is not vcpu-b in node-B, the guest lb will be more
> expensive if it pull_task from vcpu-a and
> choose vcpu-b to push.  And my idea is to export such info to guest,
> still working on it.

The long term solution is to two-fold:
1) Guests that are quite large (in that they cannot fit in a host NUMA 
node) must have static mulit-node NUMA topology implemented by Qemu. 
That is here today, but we do not do it automatically, which is probably 
going to be a VM management responsibility.
2) Host scheduler and NUMA code must be enhanced to get better placement 
of Qemu memory and threads.  For single-node vNUMA guests, this is easy, 
put it all in one node.  For mulit-node vNUMA guests, the host must 
understand that some Qemu memory belongs with certain vCPU threads 
(which make up one of the guests vNUMA nodes), and then place that 
memory/threads in a specific host node (and continue for other 
memory/threads for each Qemu vNUMA node).

Note that even if a guest's memory/threads for a vNUMA node are 
relocated to another host node (which will be necessary) the NUMA 
characteristics of guest are still maintained (as all those vCPUs and 
memory are still "close" to each other).

The problem with exposing the host's NUMA info directly to the guest is 
that (1) vCPUs will get relocated, so their topology info in the guest 
will have to change over time. IMO that is a bad idea.  We have a hard 
enough time getting applications to work with a static NUMA info.  To 
get applications to react to changing NUMA topology is not going to turn 
out well. (2) Every single guest would have to have the same number of 
NUMA nodes defined as the host.  That is overkill, especially for small 
guests.
>
>
>> 2. The host scheduler is not aware the relationship between guest vCPUs
>> and vhost. So it's possible for host scheduler to schedule per-device
>> vhost thread on the same cpu on which the vCPU kick a TX packet, or
>> schecule vhost thread on different node than the vCPU for; For RX packet
>> it's possible for vhost delivers RX packet on the vCPU running on
>> different node too.
>>
> Yes. I notice this point in your original patch.
>
>> 3. per-device vhost thread is not scaled.
>>
> What about the scale-ability of per-vm * host_NUMA_NODE? When we make
> advantage of multi-core,  we produce mulit vcpu threads for one VM.
> So what about the emulated device? Is it acceptable to scale to take
> advantage of host NUMA attr.  After all, how many nodes on which the
> VM
> can be run on are the user's control.  It is a balance of
> scale-ability and performance.
>
>> So the problems are in host scheduling and vhost thread scalability. I
>> am not sure how much help from exposing NUMA info from host to guest.
>>
>> Have you tested these patched? How much performance gain here?
>>
> Sorry, not yet.  As you have mentioned, the vhost thread scalability
> is a big problem. So I want to see others' opinion before going on.
>
> Thanks and regards,
> pingfan
>
>
>> Thanks
>> Shirley
>>
>>> So here comes the idea:
>>> 1. export host numa info through guest's sched domain to its scheduler
>>>    Export vcpu's NUMA info to guest scheduler(I think mem NUMA problem
>>>    has been handled by host).  So the guest's lb will consider the
>>> cost.
>>>    I am still working on this, and my original idea is to export these
>>> info
>>>    through "static struct sched_domain_topology_level
>>> *sched_domain_topology"
>>>    to guest.
>>>
>>> 2. Do a better emulation of virt mach exported to guest.
>>>    In real world, the devices are limited by kinds of reasons to own
>>> the NUMA
>>>    property. But as to Qemu, the device is emulated by thread, which
>>> inherit
>>>    the NUMA attr in nature.  We can implement the device as components
>>> of many
>>>    logic units, each of the unit is backed by a thread in different
>>> host node.
>>>    Currently, I want to start the work on vhost. But I think, maybe in
>>>    future, the iothread in Qemu can also has such attr.
>>>
>>>
>>> Forgive me, for the limited time, I can not have more better
>>> understand of
>>> vhost/virtio_net drivers. These patches are just draft, _FAR_, _FAR_
>>> from work.
>>> I will do more detail work for them in future.
>>>
>>> To easy the review, the following is the sum up of the 2nd point of
>>> the idea.
>>> As for the 1st point of the idea, it is not reflected in the patches.
>>>
>>> --spread/shrink the vhost_workers over the host nodes as demanded from
>>> Qemu.
>>>    And we can consider each vhost_worker as an independent net logic
>>> device
>>>    embeded in physical device "vhost_net".  At the meanwhile, we spread
>>> vcpu
>>>    threads over the host node.
>>>    The vrings on guest are allocated PAGE_SIZE align separately, so
>>> they can
>>>    will only be mapped into different host node, so vhost_worker in the
>>> same
>>>    node can access it with the least cost. So does the vq on guest.
>>>
>>> --virtio_net driver will changes and talk with the logic device. And
>>> which
>>>    logic device it will talk to is determined by on which vcpu it is
>>> scheduled.
>>>
>>> --the binding of vcpus and vhost_worker is implemented by:
>>>    for call direction, vq-a in the node-A will have a dedicated irq-a.
>>> And
>>>    we set the irq-a's affinity to vcpus in node-A.
>>>    for kick direction, kick register-b trigger different eventfd-b
>>> which wake up
>>>    vhost_worker-b.
>>>
-Andrew Theurer

^ permalink raw reply

* Re: WARNING: at net/ipv4/tcp.c:1301 tcp_cleanup_rbuf+0x4f/0x110()
From: Eric Dumazet @ 2012-05-23 15:08 UTC (permalink / raw)
  To: Sergio Correia; +Cc: netdev
In-Reply-To: <CAJyhjX2f_XFgVZpS1hLy5CWKY7eYA5MzQSHsfRgc40-2Rxztkw@mail.gmail.com>

On Tue, 2012-05-22 at 11:47 -0300, Sergio Correia wrote:
> Hi Eric,
...
> Yes, it's an Atheros AR9285 adapter.
> This morning I did a make mrproper before rebuilding the kernel
> (should I always do that?), but the warning has just appeared again.

OK, I am taking a look at this problem, thanks.

^ permalink raw reply

* Re: [RFC:kvm] export host NUMA info to guest & make emulated device NUMA attr
From: Michael S. Tsirkin @ 2012-05-23 15:16 UTC (permalink / raw)
  To: Andrew Theurer
  Cc: Krishna Kumar, Rusty Russell, Shirley Ma, kvm, netdev, Shirley Ma,
	qemu-devel, Liu ping fan, linux-kernel, Tom Lendacky, Ryan Harper,
	Avi Kivity, Anthony Liguori, Srivatsa Vaddagiri
In-Reply-To: <4FBCF99F.4070409@linux.vnet.ibm.com>

On Wed, May 23, 2012 at 09:52:15AM -0500, Andrew Theurer wrote:
> On 05/22/2012 04:28 AM, Liu ping fan wrote:
> >On Sat, May 19, 2012 at 12:14 AM, Shirley Ma<mashirle@us.ibm.com>  wrote:
> >>On Thu, 2012-05-17 at 17:20 +0800, Liu Ping Fan wrote:
> >>>Currently, the guest can not know the NUMA info of the vcpu, which
> >>>will
> >>>result in performance drawback.
> >>>
> >>>This is the discovered and experiment by
> >>>         Shirley Ma<xma@us.ibm.com>
> >>>         Krishna Kumar<krkumar2@in.ibm.com>
> >>>         Tom Lendacky<toml@us.ibm.com>
> >>>Refer to -
> >>>http://www.mail-archive.com/kvm@vger.kernel.org/msg69868.html
> >>>we can see the big perfermance gap between NUMA aware and unaware.
> >>>
> >>>Enlightened by their discovery, I think, we can do more work -- that
> >>>is to
> >>>export NUMA info of host to guest.
> >>
> >>There three problems we've found:
> >>
> >>1. KVM doesn't support NUMA load balancer. Even there are no other
> >>workloads in the system, and the number of vcpus on the guest is smaller
> >>than the number of cpus per node, the vcpus could be scheduled on
> >>different nodes.
> >>
> >>Someone is working on in-kernel solution. Andrew Theurer has a working
> >>user-space NUMA aware VM balancer, it requires libvirt and cgroups
> >>(which is default for RHEL6 systems).
> >>
> >Interesting, and I found that "sched/numa: Introduce
> >sys_numa_{t,m}bind()" committed by Peter and Ingo may help.
> >But I think from the guest view, it can not tell whether the two vcpus
> >are on the same host node. For example,
> >vcpu-a in node-A is not vcpu-b in node-B, the guest lb will be more
> >expensive if it pull_task from vcpu-a and
> >choose vcpu-b to push.  And my idea is to export such info to guest,
> >still working on it.
> 
> The long term solution is to two-fold:
> 1) Guests that are quite large (in that they cannot fit in a host
> NUMA node) must have static mulit-node NUMA topology implemented by
> Qemu. That is here today, but we do not do it automatically, which
> is probably going to be a VM management responsibility.
> 2) Host scheduler and NUMA code must be enhanced to get better
> placement of Qemu memory and threads.  For single-node vNUMA guests,
> this is easy, put it all in one node.  For mulit-node vNUMA guests,
> the host must understand that some Qemu memory belongs with certain
> vCPU threads (which make up one of the guests vNUMA nodes), and then
> place that memory/threads in a specific host node (and continue for
> other memory/threads for each Qemu vNUMA node).

And for IO, we need multiqueue devices such that each
node can have its own queue in its local memory.

-- 
MST

^ permalink raw reply

* [PATCH] dca: check against empty dca_domains list before unregister provider
From: Maciej Sosnowski @ 2012-05-23 15:27 UTC (permalink / raw)
  To: dan.j.williams; +Cc: jiang.liu, chenkeping, linux-kernel, netdev, linux-pci

When providers get blocked unregister_dca_providers() is called ending up
with dca_providers and dca_domain lists emptied. Dca should be prevented from
trying to unregister any provider if dca_domain list is found empty.

Reported-by: Jiang Liu <jiang.liu@huawei.com>
Tested-by: Gaohuai Han <hangaohuai@huawei.com>
Signed-off-by: Maciej Sosnowski <maciej.sosnowski@intel.com>
---

 drivers/dca/dca-core.c |    5 +++++
 1 files changed, 5 insertions(+), 0 deletions(-)

diff --git a/drivers/dca/dca-core.c b/drivers/dca/dca-core.c
index bc6f5fa..819dfda 100644
--- a/drivers/dca/dca-core.c
+++ b/drivers/dca/dca-core.c
@@ -420,6 +420,11 @@ void unregister_dca_provider(struct dca_
 
 	raw_spin_lock_irqsave(&dca_lock, flags);
 
+	if (list_empty(&dca_domains)) {
+		raw_spin_unlock_irqrestore(&dca_lock, flags);
+		return;
+	}
+
 	list_del(&dca->node);
 
 	pci_rc = dca_pci_rc_from_dev(dev);

^ permalink raw reply related

* Re: WARNING: at net/ipv4/tcp.c:1301 tcp_cleanup_rbuf+0x4f/0x110()
From: Sergio Correia @ 2012-05-23 15:56 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev
In-Reply-To: <1337785681.3361.2923.camel@edumazet-glaptop>

On Wed, May 23, 2012 at 12:08 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> On Tue, 2012-05-22 at 11:47 -0300, Sergio Correia wrote:
>> Hi Eric,
> ...
>> Yes, it's an Atheros AR9285 adapter.
>> This morning I did a make mrproper before rebuilding the kernel
>> (should I always do that?), but the warning has just appeared again.
>
> OK, I am taking a look at this problem, thanks.
>

Thanks. Let me know if you need additional info. As of now, my dmesg
basically shows only those warnings.

^ permalink raw reply

* Re: TCPBacklogDrops during aggressive bursts of traffic
From: Alexander Duyck @ 2012-05-23 16:04 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Kieran Mansley, Jeff Kirsher, Ben Hutchings, netdev
In-Reply-To: <1337774978.3361.2744.camel@edumazet-glaptop>

On 05/23/2012 05:09 AM, Eric Dumazet wrote:
> On Wed, 2012-05-23 at 11:44 +0200, Eric Dumazet wrote:
>
>> I believe that as soon as ixgbe can use build_skb() and avoid the 1024
>> bytes overhead per skb, it should go away.
>
> Here is the patch for ixgbe, for reference.
I'm confused as to what this is trying to accomplish.

Currently the way the ixgbe driver works is that we allocate the
skb->head using netdev_alloc_skb, which after your recent changes should
be using a head frag.  If the buffer is less than 256 bytes we have
pushed the entire buffer into the head frag, and if it is more we only
pull everything up to the end of the TCP header.  In either case if we
are merging TCP flows we should be able to drop one page or the other
along with the sk_buff giving us a total truesize addition after merge
of ~1K for less than 256 bytes or 2K for a full sized frame.

I'll try to take a look at this today as it is in our interest to have
TCP performing as well as possible on ixgbe.

Thanks,

Alex


 
>
> My machine is now able to receive a netperf TCP_STREAM full speed
> (10Gb), even with GRO/LRO off. (TCPRcvCoalesce counter increasing very
> fast too)
>
> Its not an official patch yet, because :
>
> 1) I need to properly align DMA buffers to reserve NET_SKB_PAD bytes
> (not all workloads are like TCP, and some headroom is needed for
> tunnels)
>
> 2) Must cope with MTU > 1500 cases
>
> 3) Should not be done if NET_IP_ALIGN is not null (I dont know if ixgbe
> hardware can do the DMA to non aligned area on receive)
>
> This patch saves 1024 bytes per incoming skb. (skb->head directly mapped
> to the frag containing the frame, instead of a separate memory area)
>
>  drivers/net/ethernet/intel/ixgbe/ixgbe_main.c |   82 ++++++++--------
>  1 file changed, 46 insertions(+), 36 deletions(-)
>
> diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
> index bf20457..d05693a 100644
> --- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
> +++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
> @@ -1511,39 +1511,41 @@ static bool ixgbe_cleanup_headers(struct ixgbe_ring *rx_ring,
>  		return true;
>  	}
>  
> -	/*
> -	 * it is valid to use page_address instead of kmap since we are
> -	 * working with pages allocated out of the lomem pool per
> -	 * alloc_page(GFP_ATOMIC)
> -	 */
> -	va = skb_frag_address(frag);
> +	if (!skb_headlen(skb)) {
> +		/*
> +		 * it is valid to use page_address instead of kmap since we are
> +		 * working with pages allocated out of the lowmem pool per
> +		 * alloc_page(GFP_ATOMIC)
> +		 */
> +		va = skb_frag_address(frag);
>  
> -	/*
> -	 * we need the header to contain the greater of either ETH_HLEN or
> -	 * 60 bytes if the skb->len is less than 60 for skb_pad.
> -	 */
> -	pull_len = skb_frag_size(frag);
> -	if (pull_len > 256)
> -		pull_len = ixgbe_get_headlen(va, pull_len);
> +		/*
> +		 * we need the header to contain the greater of either ETH_HLEN or
> +		 * 60 bytes if the skb->len is less than 60 for skb_pad.
> +		 */
> +		pull_len = skb_frag_size(frag);
> +		if (pull_len > 256)
> +			pull_len = ixgbe_get_headlen(va, pull_len);
>  
> -	/* align pull length to size of long to optimize memcpy performance */
> -	skb_copy_to_linear_data(skb, va, ALIGN(pull_len, sizeof(long)));
> +		/* align pull length to size of long to optimize memcpy performance */
> +		skb_copy_to_linear_data(skb, va, ALIGN(pull_len, sizeof(long)));
>  
> -	/* update all of the pointers */
> -	skb_frag_size_sub(frag, pull_len);
> -	frag->page_offset += pull_len;
> -	skb->data_len -= pull_len;
> -	skb->tail += pull_len;
> +		/* update all of the pointers */
> +		skb_frag_size_sub(frag, pull_len);
> +		frag->page_offset += pull_len;
> +		skb->data_len -= pull_len;
> +		skb->tail += pull_len;
>  
> -	/*
> -	 * if we sucked the frag empty then we should free it,
> -	 * if there are other frags here something is screwed up in hardware
> -	 */
> -	if (skb_frag_size(frag) == 0) {
> -		BUG_ON(skb_shinfo(skb)->nr_frags != 1);
> -		skb_shinfo(skb)->nr_frags = 0;
> -		__skb_frag_unref(frag);
> -		skb->truesize -= ixgbe_rx_bufsz(rx_ring);
> +		/*
> +		 * if we sucked the frag empty then we should free it,
> +		 * if there are other frags here something is screwed up in hardware
> +		 */
> +		if (skb_frag_size(frag) == 0) {
> +			BUG_ON(skb_shinfo(skb)->nr_frags != 1);
> +			skb_shinfo(skb)->nr_frags = 0;
> +			__skb_frag_unref(frag);
> +			skb->truesize -= ixgbe_rx_bufsz(rx_ring);
> +		}
>  	}
>  
>  	/* if skb_pad returns an error the skb was freed */
> @@ -1662,6 +1664,8 @@ static bool ixgbe_clean_rx_irq(struct ixgbe_q_vector *q_vector,
>  		struct sk_buff *skb;
>  		struct page *page;
>  		u16 ntc;
> +		unsigned int len;
> +		bool addfrag = true;
>  
>  		/* return some buffers to hardware, one at a time is too slow */
>  		if (cleaned_count >= IXGBE_RX_BUFFER_WRITE) {
> @@ -1687,7 +1691,7 @@ static bool ixgbe_clean_rx_irq(struct ixgbe_q_vector *q_vector,
>  		prefetchw(page);
>  
>  		skb = rx_buffer->skb;
> -
> +		len = le16_to_cpu(rx_desc->wb.upper.length);
>  		if (likely(!skb)) {
>  			void *page_addr = page_address(page) +
>  					  rx_buffer->page_offset;
> @@ -1698,9 +1702,14 @@ static bool ixgbe_clean_rx_irq(struct ixgbe_q_vector *q_vector,
>  			prefetch(page_addr + L1_CACHE_BYTES);
>  #endif
>  
> -			/* allocate a skb to store the frags */
> -			skb = netdev_alloc_skb_ip_align(rx_ring->netdev,
> -							IXGBE_RX_HDR_SIZE);
> +			/* allocate a skb to store the frag */
> +			if (len <= 256) {
> +				skb = netdev_alloc_skb_ip_align(rx_ring->netdev,
> +								256);
> +			} else {
> +				skb = build_skb(page_addr, ixgbe_rx_bufsz(rx_ring));
> +				addfrag = false;
> +			}
>  			if (unlikely(!skb)) {
>  				rx_ring->rx_stats.alloc_rx_buff_failed++;
>  				break;
> @@ -1729,9 +1738,10 @@ static bool ixgbe_clean_rx_irq(struct ixgbe_q_vector *q_vector,
>  						      DMA_FROM_DEVICE);
>  		}
>  
> -		/* pull page into skb */
> -		ixgbe_add_rx_frag(rx_ring, rx_buffer, skb,
> -				  le16_to_cpu(rx_desc->wb.upper.length));
> +		if (addfrag)
> +			ixgbe_add_rx_frag(rx_ring, rx_buffer, skb, len);
> +		else
> +			__skb_put(skb, len);
>  
>  		if (ixgbe_can_reuse_page(rx_buffer)) {
>  			/* hand second half of page back to the ring */
>
>

^ permalink raw reply

* Re: TCPBacklogDrops during aggressive bursts of traffic
From: Eric Dumazet @ 2012-05-23 16:12 UTC (permalink / raw)
  To: Alexander Duyck; +Cc: Kieran Mansley, Jeff Kirsher, Ben Hutchings, netdev
In-Reply-To: <4FBD0A85.4040407@intel.com>

On Wed, 2012-05-23 at 09:04 -0700, Alexander Duyck wrote:
> On 05/23/2012 05:09 AM, Eric Dumazet wrote:
> > On Wed, 2012-05-23 at 11:44 +0200, Eric Dumazet wrote:
> >
> >> I believe that as soon as ixgbe can use build_skb() and avoid the 1024
> >> bytes overhead per skb, it should go away.
> >
> > Here is the patch for ixgbe, for reference.
> I'm confused as to what this is trying to accomplish.
> 
> Currently the way the ixgbe driver works is that we allocate the
> skb->head using netdev_alloc_skb, which after your recent changes should
> be using a head frag.  If the buffer is less than 256 bytes we have
> pushed the entire buffer into the head frag, and if it is more we only
> pull everything up to the end of the TCP header.  In either case if we
> are merging TCP flows we should be able to drop one page or the other
> along with the sk_buff giving us a total truesize addition after merge
> of ~1K for less than 256 bytes or 2K for a full sized frame.
> 
> I'll try to take a look at this today as it is in our interest to have
> TCP performing as well as possible on ixgbe.

With current driver, a MTU=1500 frame uses :

sk_buff (256 bytes)
skb->head : 1024 bytes  (or more exaclty now : 512 + 384)
one fragment of 2048 bytes

At skb free time,  one kfree(sk_buff) and two put_page().

After this patch :

sk_buff (256 bytes)
skb->head : 2048 bytes 

At skb free time, one kfree(sk_buff) and only one put_page().

Note that my patch doesnt change the 256 bytes threshold: Small frames
wont have one fragment and their use is :

sk_buff (256 bytes)
skb->head : 512 + 384 bytes 

^ permalink raw reply

* Re: WARNING: at net/ipv4/tcp.c:1301 tcp_cleanup_rbuf+0x4f/0x110()
From: Eric Dumazet @ 2012-05-23 16:36 UTC (permalink / raw)
  To: Sergio Correia; +Cc: netdev
In-Reply-To: <CAJyhjX02wjsVE2pM26a+7xTrAJwvLZo0UL-W6h8zMKEiC5qTug@mail.gmail.com>

On Wed, 2012-05-23 at 12:56 -0300, Sergio Correia wrote:
> On Wed, May 23, 2012 at 12:08 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> > On Tue, 2012-05-22 at 11:47 -0300, Sergio Correia wrote:
> >> Hi Eric,
> > ...
> >> Yes, it's an Atheros AR9285 adapter.
> >> This morning I did a make mrproper before rebuilding the kernel
> >> (should I always do that?), but the warning has just appeared again.
> >
> > OK, I am taking a look at this problem, thanks.
> >
> 
> Thanks. Let me know if you need additional info. As of now, my dmesg
> basically shows only those warnings.

I believe I found the bug and am testing a fix right now.

By the way, we might have the same problem in tcp collapses.

TCP coalescing (introduced in linux-3.5) triggers the problem faster.

Please test following patch :

 net/ipv4/tcp_input.c |    5 +++++
 1 file changed, 5 insertions(+)

diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index cfa2aa1..b224eb8 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -4555,6 +4555,11 @@ static bool tcp_try_coalesce(struct sock *sk,
 
 	if (tcp_hdr(from)->fin)
 		return false;
+
+	/* Its possible this segment overlaps with prior segment in queue */
+	if (TCP_SKB_CB(from)->seq != TCP_SKB_CB(to)->end_seq)
+		return false;
+
 	if (!skb_try_coalesce(to, from, fragstolen, &delta))
 		return false;
 

^ permalink raw reply related

* Re: TCPBacklogDrops during aggressive bursts of traffic
From: Eric Dumazet @ 2012-05-23 16:39 UTC (permalink / raw)
  To: Alexander Duyck; +Cc: Kieran Mansley, Jeff Kirsher, Ben Hutchings, netdev
In-Reply-To: <1337789530.3361.2992.camel@edumazet-glaptop>

On Wed, 2012-05-23 at 18:12 +0200, Eric Dumazet wrote:

> With current driver, a MTU=1500 frame uses :
> 
> sk_buff (256 bytes)
> skb->head : 1024 bytes  (or more exaclty now : 512 + 384)

By the way, NET_SKB_PAD adds 64 bytes so its 64 + 512 + 384 = 960

^ permalink raw reply

* Re: TCPBacklogDrops during aggressive bursts of traffic
From: Alexander Duyck @ 2012-05-23 16:58 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Kieran Mansley, Jeff Kirsher, Ben Hutchings, netdev
In-Reply-To: <1337789530.3361.2992.camel@edumazet-glaptop>

On 05/23/2012 09:12 AM, Eric Dumazet wrote:
> On Wed, 2012-05-23 at 09:04 -0700, Alexander Duyck wrote:
>> On 05/23/2012 05:09 AM, Eric Dumazet wrote:
>>> On Wed, 2012-05-23 at 11:44 +0200, Eric Dumazet wrote:
>>>
>>>> I believe that as soon as ixgbe can use build_skb() and avoid the 1024
>>>> bytes overhead per skb, it should go away.
>>> Here is the patch for ixgbe, for reference.
>> I'm confused as to what this is trying to accomplish.
>>
>> Currently the way the ixgbe driver works is that we allocate the
>> skb->head using netdev_alloc_skb, which after your recent changes should
>> be using a head frag.  If the buffer is less than 256 bytes we have
>> pushed the entire buffer into the head frag, and if it is more we only
>> pull everything up to the end of the TCP header.  In either case if we
>> are merging TCP flows we should be able to drop one page or the other
>> along with the sk_buff giving us a total truesize addition after merge
>> of ~1K for less than 256 bytes or 2K for a full sized frame.
>>
>> I'll try to take a look at this today as it is in our interest to have
>> TCP performing as well as possible on ixgbe.
> With current driver, a MTU=1500 frame uses :
>
> sk_buff (256 bytes)
> skb->head : 1024 bytes  (or more exaclty now : 512 + 384)
> one fragment of 2048 bytes
>
> At skb free time,  one kfree(sk_buff) and two put_page().
>
> After this patch :
>
> sk_buff (256 bytes)
> skb->head : 2048 bytes 
>
> At skb free time, one kfree(sk_buff) and only one put_page().
>
> Note that my patch doesnt change the 256 bytes threshold: Small frames
> wont have one fragment and their use is :
>
> sk_buff (256 bytes)
> skb->head : 512 + 384 bytes 
>
>
Right, but the problem is that in order to make this work the we are
dropping the padding for head and hoping to have room for shared info. 
This is going to kill performance for things like routing workloads
since the entire head is going to have to be copied over to make space
for NET_SKB_PAD.  Also this assumes no RSC being enabled.  RSC is
normally enabled by default.  If it is turned on we are going to start
receiving full 2K buffers which will cause even more issues since there
wouldn't be any room for shared info in the 2K frame.

The way the driver is currently written probably provides the optimal
setup for truesize given the circumstances.  In order to support
receiving at least 1 full 1500 byte frame per fragment, and supporting
RSC I have to support receiving up to 2K of data.  If we try to make it
all part of one paged receive we would then have to either reduce the
receive buffer size to 1K in hardware and span multiple fragments for a
1.5K frame or allocate a 3K buffer so we would have room to add
NET_SKB_PAD and the shared info on the end.  At which point we are back
to the extra 1K again, only in that case we cannot trim it off later via
skb_try_coalesce.  In the 3K buffer case we would be over a 1/2 page
which means we can only get one buffer per page instead of 2 in which
case we might as well just round it up to 4K and be honest.

The reason I am confused is that I thought the skb_try_coalesce function
was supposed to be what addressed these types of issues.  If these
packets go through that function they should be stripping the sk_buff
and possibly even the skb->head if we used the fragment since the only
thing that is going to end up in the head would be the TCP header which
should have been pulled prior to trying to coalesce.

I will need to investigate this further to understand what is going on. 
I realize that dealing with 3K of memory for buffer storage is not
ideal, but all of the alternatives lean more toward 4K when fully
implemented.  I'll try and see what alternative solutions we might have
available.

Thanks,

Alex

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox