Netdev List
 help / color / mirror / Atom feed
* Re: [PATCH 15/25] netfilter: nf_nat: support IPv6 in SIP NAT helper
From: Eric W. Biederman @ 2012-09-04 23:14 UTC (permalink / raw)
  To: pablo; +Cc: netfilter-devel, davem, netdev
In-Reply-To: <1346716452-3080-16-git-send-email-pablo@netfilter.org>

pablo@netfilter.org writes:

> From: Patrick McHardy <kaber@trash.net>
>
> Add IPv6 support to the SIP NAT helper. There are no functional differences
> to IPv4 NAT, just different formats for addresses.

Am I missing something here?  It looks like you are implementing port
translation for ipv6.

Simple address translation I can understand.  Especially when it
conforms to RFC6296 and doesn't need to look beyond the addresses and
doesn't even need to recompute checksums.

I can understand having a policy that explicitly manglees various
aspects of a packet as it comes through.  Sometimes people have weird
local policies and need to do weird and peculiar things.

I can understand connection tracking, and have no problems with
connection tracking as it does not get in the way of protocol evolution.

However this looks like full automatic address and port translation like
we have with ipv4.


I can't understand implementiong port translation for ipv6.  We don't
have a shortage of addresses, so there is no need to share addresses.


There has been a lot of work done with with protocol design and figuring
out how to work through NAT boxes.  One of the most complete
descriptions is in RFC5245 ICE.

Frequently protocol designers suggest that ALGs like this not be used
because they get in the way of protocol enhancements like ICE that
provide more general ways to work through the challenges.

RFC5245 can handle any kind of common NAT except both sides doing prefix
translation.  As long as one side can predict the port on the other side
of the NAT device ICE can successfully establish a connection.

Please tell me I am missing something and you are not generalizing code
that breaks protocols without any known work around?

Given that RFC5245 is strongly suggested if not required for ipv6 sip
support I really fail to see why generalizing the sip nat code from ipv4
to ipv6 makes a bit of sense.

RFC6314 "NAT Traversal Practices for Client-Server SIP" may be
interesting for more background on what people are implementing.

As a statement of how protocol designers feel about ALGs section
18.6 of RFC5245 is interesting:

   ICE works best through ALGs when the signaling is run over TLS.  This
   prevents the ALG from manipulating the SDP messages and interfering
   with ICE operation.  Implementations that are expected to be deployed
   behind ALGs SHOULD provide for TLS transport of the SDP.


So I am trying to understand how IPv6 ALGs make sense in general, as
IPv6 prefix translation is checksum neutral so they are in general
unneeded.  And how IPv6 ALGs for sip make sense as the protocol
maintainers report that best results are had without ALGs.
My own experiments confirm that ALGs are not needed for SIP if
you have ICE implementations on both sides.

What am I missing?

Eric


> Signed-off-by: Patrick McHardy <kaber@trash.net>
> ---
>  include/linux/netfilter/nf_conntrack_sip.h |    9 +-
>  net/ipv4/netfilter/Kconfig                 |    5 -
>  net/ipv4/netfilter/Makefile                |    1 -
>  net/ipv4/netfilter/nf_nat_sip.c            |  580 --------------------------
>  net/netfilter/Kconfig                      |    5 +
>  net/netfilter/Makefile                     |    1 +
>  net/netfilter/nf_conntrack_sip.c           |   68 ++--
>  net/netfilter/nf_nat_sip.c                 |  609 ++++++++++++++++++++++++++++
>  8 files changed, 653 insertions(+), 625 deletions(-)
>  delete mode 100644 net/ipv4/netfilter/nf_nat_sip.c
>  create mode 100644 net/netfilter/nf_nat_sip.c
>
> diff --git a/include/linux/netfilter/nf_conntrack_sip.h b/include/linux/netfilter/nf_conntrack_sip.h
> index 1afc669..387bdd0 100644
> --- a/include/linux/netfilter/nf_conntrack_sip.h
> +++ b/include/linux/netfilter/nf_conntrack_sip.h
> @@ -99,10 +99,8 @@ enum sip_header_types {
>  enum sdp_header_types {
>  	SDP_HDR_UNSPEC,
>  	SDP_HDR_VERSION,
> -	SDP_HDR_OWNER_IP4,
> -	SDP_HDR_CONNECTION_IP4,
> -	SDP_HDR_OWNER_IP6,
> -	SDP_HDR_CONNECTION_IP6,
> +	SDP_HDR_OWNER,
> +	SDP_HDR_CONNECTION,
>  	SDP_HDR_MEDIA,
>  };
>  
> @@ -111,7 +109,8 @@ extern unsigned int (*nf_nat_sip_hook)(struct sk_buff *skb,
>  				       unsigned int dataoff,
>  				       const char **dptr,
>  				       unsigned int *datalen);
> -extern void (*nf_nat_sip_seq_adjust_hook)(struct sk_buff *skb, s16 off);
> +extern void (*nf_nat_sip_seq_adjust_hook)(struct sk_buff *skb,
> +					  unsigned int protoff, s16 off);
>  extern unsigned int (*nf_nat_sip_expect_hook)(struct sk_buff *skb,
>  					      unsigned int protoff,
>  					      unsigned int dataoff,
> diff --git a/net/ipv4/netfilter/Kconfig b/net/ipv4/netfilter/Kconfig
> index 52c4a87..30197f8 100644
> --- a/net/ipv4/netfilter/Kconfig
> +++ b/net/ipv4/netfilter/Kconfig
> @@ -242,11 +242,6 @@ config NF_NAT_H323
>  	depends on NF_CONNTRACK && NF_NAT_IPV4
>  	default NF_NAT_IPV4 && NF_CONNTRACK_H323
>  
> -config NF_NAT_SIP
> -	tristate
> -	depends on NF_CONNTRACK && NF_NAT_IPV4
> -	default NF_NAT_IPV4 && NF_CONNTRACK_SIP
> -
>  # mangle + specific targets
>  config IP_NF_MANGLE
>  	tristate "Packet mangling"
> diff --git a/net/ipv4/netfilter/Makefile b/net/ipv4/netfilter/Makefile
> index 8baa496..8914abf 100644
> --- a/net/ipv4/netfilter/Makefile
> +++ b/net/ipv4/netfilter/Makefile
> @@ -23,7 +23,6 @@ obj-$(CONFIG_NF_DEFRAG_IPV4) += nf_defrag_ipv4.o
>  obj-$(CONFIG_NF_NAT_H323) += nf_nat_h323.o
>  obj-$(CONFIG_NF_NAT_IRC) += nf_nat_irc.o
>  obj-$(CONFIG_NF_NAT_PPTP) += nf_nat_pptp.o
> -obj-$(CONFIG_NF_NAT_SIP) += nf_nat_sip.o
>  obj-$(CONFIG_NF_NAT_SNMP_BASIC) += nf_nat_snmp_basic.o
>  obj-$(CONFIG_NF_NAT_TFTP) += nf_nat_tftp.o
>  
> diff --git a/net/ipv4/netfilter/nf_nat_sip.c b/net/ipv4/netfilter/nf_nat_sip.c
> deleted file mode 100644
> index 47a4718..0000000
> --- a/net/ipv4/netfilter/nf_nat_sip.c
> +++ /dev/null
> @@ -1,580 +0,0 @@
> -/* SIP extension for NAT alteration.
> - *
> - * (C) 2005 by Christian Hentschel <chentschel@arnet.com.ar>
> - * based on RR's ip_nat_ftp.c and other modules.
> - * (C) 2007 United Security Providers
> - * (C) 2007, 2008 Patrick McHardy <kaber@trash.net>
> - *
> - * This program is free software; you can redistribute it and/or modify
> - * it under the terms of the GNU General Public License version 2 as
> - * published by the Free Software Foundation.
> - */
> -
> -#include <linux/module.h>
> -#include <linux/skbuff.h>
> -#include <linux/ip.h>
> -#include <net/ip.h>
> -#include <linux/udp.h>
> -#include <linux/tcp.h>
> -
> -#include <net/netfilter/nf_nat.h>
> -#include <net/netfilter/nf_nat_helper.h>
> -#include <net/netfilter/nf_conntrack_helper.h>
> -#include <net/netfilter/nf_conntrack_expect.h>
> -#include <linux/netfilter/nf_conntrack_sip.h>
> -
> -MODULE_LICENSE("GPL");
> -MODULE_AUTHOR("Christian Hentschel <chentschel@arnet.com.ar>");
> -MODULE_DESCRIPTION("SIP NAT helper");
> -MODULE_ALIAS("ip_nat_sip");
> -
> -
> -static unsigned int mangle_packet(struct sk_buff *skb, unsigned int protoff,
> -				  unsigned int dataoff,
> -				  const char **dptr, unsigned int *datalen,
> -				  unsigned int matchoff, unsigned int matchlen,
> -				  const char *buffer, unsigned int buflen)
> -{
> -	enum ip_conntrack_info ctinfo;
> -	struct nf_conn *ct = nf_ct_get(skb, &ctinfo);
> -	struct tcphdr *th;
> -	unsigned int baseoff;
> -
> -	if (nf_ct_protonum(ct) == IPPROTO_TCP) {
> -		th = (struct tcphdr *)(skb->data + ip_hdrlen(skb));
> -		baseoff = ip_hdrlen(skb) + th->doff * 4;
> -		matchoff += dataoff - baseoff;
> -
> -		if (!__nf_nat_mangle_tcp_packet(skb, ct, ctinfo,
> -						protoff, matchoff, matchlen,
> -						buffer, buflen, false))
> -			return 0;
> -	} else {
> -		baseoff = ip_hdrlen(skb) + sizeof(struct udphdr);
> -		matchoff += dataoff - baseoff;
> -
> -		if (!nf_nat_mangle_udp_packet(skb, ct, ctinfo,
> -					      protoff, matchoff, matchlen,
> -					      buffer, buflen))
> -			return 0;
> -	}
> -
> -	/* Reload data pointer and adjust datalen value */
> -	*dptr = skb->data + dataoff;
> -	*datalen += buflen - matchlen;
> -	return 1;
> -}
> -
> -static int map_addr(struct sk_buff *skb, unsigned int protoff,
> -		    unsigned int dataoff,
> -		    const char **dptr, unsigned int *datalen,
> -		    unsigned int matchoff, unsigned int matchlen,
> -		    union nf_inet_addr *addr, __be16 port)
> -{
> -	enum ip_conntrack_info ctinfo;
> -	struct nf_conn *ct = nf_ct_get(skb, &ctinfo);
> -	enum ip_conntrack_dir dir = CTINFO2DIR(ctinfo);
> -	char buffer[sizeof("nnn.nnn.nnn.nnn:nnnnn")];
> -	unsigned int buflen;
> -	__be32 newaddr;
> -	__be16 newport;
> -
> -	if (ct->tuplehash[dir].tuple.src.u3.ip == addr->ip &&
> -	    ct->tuplehash[dir].tuple.src.u.udp.port == port) {
> -		newaddr = ct->tuplehash[!dir].tuple.dst.u3.ip;
> -		newport = ct->tuplehash[!dir].tuple.dst.u.udp.port;
> -	} else if (ct->tuplehash[dir].tuple.dst.u3.ip == addr->ip &&
> -		   ct->tuplehash[dir].tuple.dst.u.udp.port == port) {
> -		newaddr = ct->tuplehash[!dir].tuple.src.u3.ip;
> -		newport = ct->tuplehash[!dir].tuple.src.u.udp.port;
> -	} else
> -		return 1;
> -
> -	if (newaddr == addr->ip && newport == port)
> -		return 1;
> -
> -	buflen = sprintf(buffer, "%pI4:%u", &newaddr, ntohs(newport));
> -
> -	return mangle_packet(skb, protoff, dataoff, dptr, datalen,
> -			     matchoff, matchlen, buffer, buflen);
> -}
> -
> -static int map_sip_addr(struct sk_buff *skb, unsigned int protoff,
> -			unsigned int dataoff,
> -			const char **dptr, unsigned int *datalen,
> -			enum sip_header_types type)
> -{
> -	enum ip_conntrack_info ctinfo;
> -	struct nf_conn *ct = nf_ct_get(skb, &ctinfo);
> -	unsigned int matchlen, matchoff;
> -	union nf_inet_addr addr;
> -	__be16 port;
> -
> -	if (ct_sip_parse_header_uri(ct, *dptr, NULL, *datalen, type, NULL,
> -				    &matchoff, &matchlen, &addr, &port) <= 0)
> -		return 1;
> -	return map_addr(skb, protoff, dataoff, dptr, datalen,
> -			matchoff, matchlen, &addr, port);
> -}
> -
> -static unsigned int ip_nat_sip(struct sk_buff *skb, unsigned int protoff,
> -			       unsigned int dataoff,
> -			       const char **dptr, unsigned int *datalen)
> -{
> -	enum ip_conntrack_info ctinfo;
> -	struct nf_conn *ct = nf_ct_get(skb, &ctinfo);
> -	enum ip_conntrack_dir dir = CTINFO2DIR(ctinfo);
> -	unsigned int coff, matchoff, matchlen;
> -	enum sip_header_types hdr;
> -	union nf_inet_addr addr;
> -	__be16 port;
> -	int request, in_header;
> -
> -	/* Basic rules: requests and responses. */
> -	if (strnicmp(*dptr, "SIP/2.0", strlen("SIP/2.0")) != 0) {
> -		if (ct_sip_parse_request(ct, *dptr, *datalen,
> -					 &matchoff, &matchlen,
> -					 &addr, &port) > 0 &&
> -		    !map_addr(skb, protoff, dataoff, dptr, datalen,
> -			      matchoff, matchlen, &addr, port))
> -			return NF_DROP;
> -		request = 1;
> -	} else
> -		request = 0;
> -
> -	if (nf_ct_protonum(ct) == IPPROTO_TCP)
> -		hdr = SIP_HDR_VIA_TCP;
> -	else
> -		hdr = SIP_HDR_VIA_UDP;
> -
> -	/* Translate topmost Via header and parameters */
> -	if (ct_sip_parse_header_uri(ct, *dptr, NULL, *datalen,
> -				    hdr, NULL, &matchoff, &matchlen,
> -				    &addr, &port) > 0) {
> -		unsigned int olen, matchend, poff, plen, buflen, n;
> -		char buffer[sizeof("nnn.nnn.nnn.nnn:nnnnn")];
> -
> -		/* We're only interested in headers related to this
> -		 * connection */
> -		if (request) {
> -			if (addr.ip != ct->tuplehash[dir].tuple.src.u3.ip ||
> -			    port != ct->tuplehash[dir].tuple.src.u.udp.port)
> -				goto next;
> -		} else {
> -			if (addr.ip != ct->tuplehash[dir].tuple.dst.u3.ip ||
> -			    port != ct->tuplehash[dir].tuple.dst.u.udp.port)
> -				goto next;
> -		}
> -
> -		olen = *datalen;
> -		if (!map_addr(skb, protoff, dataoff, dptr, datalen,
> -			      matchoff, matchlen, &addr, port))
> -			return NF_DROP;
> -
> -		matchend = matchoff + matchlen + *datalen - olen;
> -
> -		/* The maddr= parameter (RFC 2361) specifies where to send
> -		 * the reply. */
> -		if (ct_sip_parse_address_param(ct, *dptr, matchend, *datalen,
> -					       "maddr=", &poff, &plen,
> -					       &addr, true) > 0 &&
> -		    addr.ip == ct->tuplehash[dir].tuple.src.u3.ip &&
> -		    addr.ip != ct->tuplehash[!dir].tuple.dst.u3.ip) {
> -			buflen = sprintf(buffer, "%pI4",
> -					&ct->tuplehash[!dir].tuple.dst.u3.ip);
> -			if (!mangle_packet(skb, protoff, dataoff, dptr, datalen,
> -					   poff, plen, buffer, buflen))
> -				return NF_DROP;
> -		}
> -
> -		/* The received= parameter (RFC 2361) contains the address
> -		 * from which the server received the request. */
> -		if (ct_sip_parse_address_param(ct, *dptr, matchend, *datalen,
> -					       "received=", &poff, &plen,
> -					       &addr, false) > 0 &&
> -		    addr.ip == ct->tuplehash[dir].tuple.dst.u3.ip &&
> -		    addr.ip != ct->tuplehash[!dir].tuple.src.u3.ip) {
> -			buflen = sprintf(buffer, "%pI4",
> -					&ct->tuplehash[!dir].tuple.src.u3.ip);
> -			if (!mangle_packet(skb, protoff, dataoff, dptr, datalen,
> -					   poff, plen, buffer, buflen))
> -				return NF_DROP;
> -		}
> -
> -		/* The rport= parameter (RFC 3581) contains the port number
> -		 * from which the server received the request. */
> -		if (ct_sip_parse_numerical_param(ct, *dptr, matchend, *datalen,
> -						 "rport=", &poff, &plen,
> -						 &n) > 0 &&
> -		    htons(n) == ct->tuplehash[dir].tuple.dst.u.udp.port &&
> -		    htons(n) != ct->tuplehash[!dir].tuple.src.u.udp.port) {
> -			__be16 p = ct->tuplehash[!dir].tuple.src.u.udp.port;
> -			buflen = sprintf(buffer, "%u", ntohs(p));
> -			if (!mangle_packet(skb, protoff, dataoff, dptr, datalen,
> -					   poff, plen, buffer, buflen))
> -				return NF_DROP;
> -		}
> -	}
> -
> -next:
> -	/* Translate Contact headers */
> -	coff = 0;
> -	in_header = 0;
> -	while (ct_sip_parse_header_uri(ct, *dptr, &coff, *datalen,
> -				       SIP_HDR_CONTACT, &in_header,
> -				       &matchoff, &matchlen,
> -				       &addr, &port) > 0) {
> -		if (!map_addr(skb, protoff, dataoff, dptr, datalen,
> -			      matchoff, matchlen,
> -			      &addr, port))
> -			return NF_DROP;
> -	}
> -
> -	if (!map_sip_addr(skb, protoff, dataoff, dptr, datalen, SIP_HDR_FROM) ||
> -	    !map_sip_addr(skb, protoff, dataoff, dptr, datalen, SIP_HDR_TO))
> -		return NF_DROP;
> -
> -	return NF_ACCEPT;
> -}
> -
> -static void ip_nat_sip_seq_adjust(struct sk_buff *skb, s16 off)
> -{
> -	enum ip_conntrack_info ctinfo;
> -	struct nf_conn *ct = nf_ct_get(skb, &ctinfo);
> -	const struct tcphdr *th;
> -
> -	if (nf_ct_protonum(ct) != IPPROTO_TCP || off == 0)
> -		return;
> -
> -	th = (struct tcphdr *)(skb->data + ip_hdrlen(skb));
> -	nf_nat_set_seq_adjust(ct, ctinfo, th->seq, off);
> -}
> -
> -/* Handles expected signalling connections and media streams */
> -static void ip_nat_sip_expected(struct nf_conn *ct,
> -				struct nf_conntrack_expect *exp)
> -{
> -	struct nf_nat_range range;
> -
> -	/* This must be a fresh one. */
> -	BUG_ON(ct->status & IPS_NAT_DONE_MASK);
> -
> -	/* For DST manip, map port here to where it's expected. */
> -	range.flags = (NF_NAT_RANGE_MAP_IPS | NF_NAT_RANGE_PROTO_SPECIFIED);
> -	range.min_proto = range.max_proto = exp->saved_proto;
> -	range.min_addr = range.max_addr = exp->saved_addr;
> -	nf_nat_setup_info(ct, &range, NF_NAT_MANIP_DST);
> -
> -	/* Change src to where master sends to, but only if the connection
> -	 * actually came from the same source. */
> -	if (ct->tuplehash[IP_CT_DIR_ORIGINAL].tuple.src.u3.ip ==
> -	    ct->master->tuplehash[exp->dir].tuple.src.u3.ip) {
> -		range.flags = NF_NAT_RANGE_MAP_IPS;
> -		range.min_addr = range.max_addr
> -			= ct->master->tuplehash[!exp->dir].tuple.dst.u3;
> -		nf_nat_setup_info(ct, &range, NF_NAT_MANIP_SRC);
> -	}
> -}
> -
> -static unsigned int ip_nat_sip_expect(struct sk_buff *skb, unsigned int protoff,
> -				      unsigned int dataoff,
> -				      const char **dptr, unsigned int *datalen,
> -				      struct nf_conntrack_expect *exp,
> -				      unsigned int matchoff,
> -				      unsigned int matchlen)
> -{
> -	enum ip_conntrack_info ctinfo;
> -	struct nf_conn *ct = nf_ct_get(skb, &ctinfo);
> -	enum ip_conntrack_dir dir = CTINFO2DIR(ctinfo);
> -	__be32 newip;
> -	u_int16_t port;
> -	char buffer[sizeof("nnn.nnn.nnn.nnn:nnnnn")];
> -	unsigned int buflen;
> -
> -	/* Connection will come from reply */
> -	if (ct->tuplehash[dir].tuple.src.u3.ip == ct->tuplehash[!dir].tuple.dst.u3.ip)
> -		newip = exp->tuple.dst.u3.ip;
> -	else
> -		newip = ct->tuplehash[!dir].tuple.dst.u3.ip;
> -
> -	/* If the signalling port matches the connection's source port in the
> -	 * original direction, try to use the destination port in the opposite
> -	 * direction. */
> -	if (exp->tuple.dst.u.udp.port ==
> -	    ct->tuplehash[dir].tuple.src.u.udp.port)
> -		port = ntohs(ct->tuplehash[!dir].tuple.dst.u.udp.port);
> -	else
> -		port = ntohs(exp->tuple.dst.u.udp.port);
> -
> -	exp->saved_addr = exp->tuple.dst.u3;
> -	exp->tuple.dst.u3.ip = newip;
> -	exp->saved_proto.udp.port = exp->tuple.dst.u.udp.port;
> -	exp->dir = !dir;
> -	exp->expectfn = ip_nat_sip_expected;
> -
> -	for (; port != 0; port++) {
> -		int ret;
> -
> -		exp->tuple.dst.u.udp.port = htons(port);
> -		ret = nf_ct_expect_related(exp);
> -		if (ret == 0)
> -			break;
> -		else if (ret != -EBUSY) {
> -			port = 0;
> -			break;
> -		}
> -	}
> -
> -	if (port == 0)
> -		return NF_DROP;
> -
> -	if (exp->tuple.dst.u3.ip != exp->saved_addr.ip ||
> -	    exp->tuple.dst.u.udp.port != exp->saved_proto.udp.port) {
> -		buflen = sprintf(buffer, "%pI4:%u", &newip, port);
> -		if (!mangle_packet(skb, protoff, dataoff, dptr, datalen,
> -				   matchoff, matchlen, buffer, buflen))
> -			goto err;
> -	}
> -	return NF_ACCEPT;
> -
> -err:
> -	nf_ct_unexpect_related(exp);
> -	return NF_DROP;
> -}
> -
> -static int mangle_content_len(struct sk_buff *skb, unsigned int protoff,
> -			      unsigned int dataoff,
> -			      const char **dptr, unsigned int *datalen)
> -{
> -	enum ip_conntrack_info ctinfo;
> -	struct nf_conn *ct = nf_ct_get(skb, &ctinfo);
> -	unsigned int matchoff, matchlen;
> -	char buffer[sizeof("65536")];
> -	int buflen, c_len;
> -
> -	/* Get actual SDP length */
> -	if (ct_sip_get_sdp_header(ct, *dptr, 0, *datalen,
> -				  SDP_HDR_VERSION, SDP_HDR_UNSPEC,
> -				  &matchoff, &matchlen) <= 0)
> -		return 0;
> -	c_len = *datalen - matchoff + strlen("v=");
> -
> -	/* Now, update SDP length */
> -	if (ct_sip_get_header(ct, *dptr, 0, *datalen, SIP_HDR_CONTENT_LENGTH,
> -			      &matchoff, &matchlen) <= 0)
> -		return 0;
> -
> -	buflen = sprintf(buffer, "%u", c_len);
> -	return mangle_packet(skb, protoff, dataoff, dptr, datalen,
> -			     matchoff, matchlen, buffer, buflen);
> -}
> -
> -static int mangle_sdp_packet(struct sk_buff *skb, unsigned int protoff,
> -			     unsigned int dataoff,
> -			     const char **dptr, unsigned int *datalen,
> -			     unsigned int sdpoff,
> -			     enum sdp_header_types type,
> -			     enum sdp_header_types term,
> -			     char *buffer, int buflen)
> -{
> -	enum ip_conntrack_info ctinfo;
> -	struct nf_conn *ct = nf_ct_get(skb, &ctinfo);
> -	unsigned int matchlen, matchoff;
> -
> -	if (ct_sip_get_sdp_header(ct, *dptr, sdpoff, *datalen, type, term,
> -				  &matchoff, &matchlen) <= 0)
> -		return -ENOENT;
> -	return mangle_packet(skb, protoff, dataoff, dptr, datalen,
> -			     matchoff, matchlen, buffer, buflen) ? 0 : -EINVAL;
> -}
> -
> -static unsigned int ip_nat_sdp_addr(struct sk_buff *skb, unsigned int protoff,
> -				    unsigned int dataoff,
> -				    const char **dptr, unsigned int *datalen,
> -				    unsigned int sdpoff,
> -				    enum sdp_header_types type,
> -				    enum sdp_header_types term,
> -				    const union nf_inet_addr *addr)
> -{
> -	char buffer[sizeof("nnn.nnn.nnn.nnn")];
> -	unsigned int buflen;
> -
> -	buflen = sprintf(buffer, "%pI4", &addr->ip);
> -	if (mangle_sdp_packet(skb, protoff, dataoff, dptr, datalen,
> -			      sdpoff, type, term, buffer, buflen))
> -		return 0;
> -
> -	return mangle_content_len(skb, protoff, dataoff, dptr, datalen);
> -}
> -
> -static unsigned int ip_nat_sdp_port(struct sk_buff *skb, unsigned int protoff,
> -				    unsigned int dataoff,
> -				    const char **dptr, unsigned int *datalen,
> -				    unsigned int matchoff,
> -				    unsigned int matchlen,
> -				    u_int16_t port)
> -{
> -	char buffer[sizeof("nnnnn")];
> -	unsigned int buflen;
> -
> -	buflen = sprintf(buffer, "%u", port);
> -	if (!mangle_packet(skb, protoff, dataoff, dptr, datalen,
> -			   matchoff, matchlen, buffer, buflen))
> -		return 0;
> -
> -	return mangle_content_len(skb, protoff, dataoff, dptr, datalen);
> -}
> -
> -static unsigned int ip_nat_sdp_session(struct sk_buff *skb, unsigned int protoff,
> -				       unsigned int dataoff,
> -				       const char **dptr, unsigned int *datalen,
> -				       unsigned int sdpoff,
> -				       const union nf_inet_addr *addr)
> -{
> -	char buffer[sizeof("nnn.nnn.nnn.nnn")];
> -	unsigned int buflen;
> -
> -	/* Mangle session description owner and contact addresses */
> -	buflen = sprintf(buffer, "%pI4", &addr->ip);
> -	if (mangle_sdp_packet(skb, protoff, dataoff, dptr, datalen, sdpoff,
> -			       SDP_HDR_OWNER_IP4, SDP_HDR_MEDIA,
> -			       buffer, buflen))
> -		return 0;
> -
> -	switch (mangle_sdp_packet(skb, protoff, dataoff, dptr, datalen, sdpoff,
> -				  SDP_HDR_CONNECTION_IP4, SDP_HDR_MEDIA,
> -				  buffer, buflen)) {
> -	case 0:
> -	/*
> -	 * RFC 2327:
> -	 *
> -	 * Session description
> -	 *
> -	 * c=* (connection information - not required if included in all media)
> -	 */
> -	case -ENOENT:
> -		break;
> -	default:
> -		return 0;
> -	}
> -
> -	return mangle_content_len(skb, protoff, dataoff, dptr, datalen);
> -}
> -
> -/* So, this packet has hit the connection tracking matching code.
> -   Mangle it, and change the expectation to match the new version. */
> -static unsigned int ip_nat_sdp_media(struct sk_buff *skb, unsigned int protoff,
> -				     unsigned int dataoff,
> -				     const char **dptr, unsigned int *datalen,
> -				     struct nf_conntrack_expect *rtp_exp,
> -				     struct nf_conntrack_expect *rtcp_exp,
> -				     unsigned int mediaoff,
> -				     unsigned int medialen,
> -				     union nf_inet_addr *rtp_addr)
> -{
> -	enum ip_conntrack_info ctinfo;
> -	struct nf_conn *ct = nf_ct_get(skb, &ctinfo);
> -	enum ip_conntrack_dir dir = CTINFO2DIR(ctinfo);
> -	u_int16_t port;
> -
> -	/* Connection will come from reply */
> -	if (ct->tuplehash[dir].tuple.src.u3.ip ==
> -	    ct->tuplehash[!dir].tuple.dst.u3.ip)
> -		rtp_addr->ip = rtp_exp->tuple.dst.u3.ip;
> -	else
> -		rtp_addr->ip = ct->tuplehash[!dir].tuple.dst.u3.ip;
> -
> -	rtp_exp->saved_addr = rtp_exp->tuple.dst.u3;
> -	rtp_exp->tuple.dst.u3.ip = rtp_addr->ip;
> -	rtp_exp->saved_proto.udp.port = rtp_exp->tuple.dst.u.udp.port;
> -	rtp_exp->dir = !dir;
> -	rtp_exp->expectfn = ip_nat_sip_expected;
> -
> -	rtcp_exp->saved_addr = rtcp_exp->tuple.dst.u3;
> -	rtcp_exp->tuple.dst.u3.ip = rtp_addr->ip;
> -	rtcp_exp->saved_proto.udp.port = rtcp_exp->tuple.dst.u.udp.port;
> -	rtcp_exp->dir = !dir;
> -	rtcp_exp->expectfn = ip_nat_sip_expected;
> -
> -	/* Try to get same pair of ports: if not, try to change them. */
> -	for (port = ntohs(rtp_exp->tuple.dst.u.udp.port);
> -	     port != 0; port += 2) {
> -		int ret;
> -
> -		rtp_exp->tuple.dst.u.udp.port = htons(port);
> -		ret = nf_ct_expect_related(rtp_exp);
> -		if (ret == -EBUSY)
> -			continue;
> -		else if (ret < 0) {
> -			port = 0;
> -			break;
> -		}
> -		rtcp_exp->tuple.dst.u.udp.port = htons(port + 1);
> -		ret = nf_ct_expect_related(rtcp_exp);
> -		if (ret == 0)
> -			break;
> -		else if (ret != -EBUSY) {
> -			nf_ct_unexpect_related(rtp_exp);
> -			port = 0;
> -			break;
> -		}
> -	}
> -
> -	if (port == 0)
> -		goto err1;
> -
> -	/* Update media port. */
> -	if (rtp_exp->tuple.dst.u.udp.port != rtp_exp->saved_proto.udp.port &&
> -	    !ip_nat_sdp_port(skb, protoff, dataoff, dptr, datalen,
> -			     mediaoff, medialen, port))
> -		goto err2;
> -
> -	return NF_ACCEPT;
> -
> -err2:
> -	nf_ct_unexpect_related(rtp_exp);
> -	nf_ct_unexpect_related(rtcp_exp);
> -err1:
> -	return NF_DROP;
> -}
> -
> -static struct nf_ct_helper_expectfn sip_nat = {
> -        .name           = "sip",
> -        .expectfn       = ip_nat_sip_expected,
> -};
> -
> -static void __exit nf_nat_sip_fini(void)
> -{
> -	RCU_INIT_POINTER(nf_nat_sip_hook, NULL);
> -	RCU_INIT_POINTER(nf_nat_sip_seq_adjust_hook, NULL);
> -	RCU_INIT_POINTER(nf_nat_sip_expect_hook, NULL);
> -	RCU_INIT_POINTER(nf_nat_sdp_addr_hook, NULL);
> -	RCU_INIT_POINTER(nf_nat_sdp_port_hook, NULL);
> -	RCU_INIT_POINTER(nf_nat_sdp_session_hook, NULL);
> -	RCU_INIT_POINTER(nf_nat_sdp_media_hook, NULL);
> -	nf_ct_helper_expectfn_unregister(&sip_nat);
> -	synchronize_rcu();
> -}
> -
> -static int __init nf_nat_sip_init(void)
> -{
> -	BUG_ON(nf_nat_sip_hook != NULL);
> -	BUG_ON(nf_nat_sip_seq_adjust_hook != NULL);
> -	BUG_ON(nf_nat_sip_expect_hook != NULL);
> -	BUG_ON(nf_nat_sdp_addr_hook != NULL);
> -	BUG_ON(nf_nat_sdp_port_hook != NULL);
> -	BUG_ON(nf_nat_sdp_session_hook != NULL);
> -	BUG_ON(nf_nat_sdp_media_hook != NULL);
> -	RCU_INIT_POINTER(nf_nat_sip_hook, ip_nat_sip);
> -	RCU_INIT_POINTER(nf_nat_sip_seq_adjust_hook, ip_nat_sip_seq_adjust);
> -	RCU_INIT_POINTER(nf_nat_sip_expect_hook, ip_nat_sip_expect);
> -	RCU_INIT_POINTER(nf_nat_sdp_addr_hook, ip_nat_sdp_addr);
> -	RCU_INIT_POINTER(nf_nat_sdp_port_hook, ip_nat_sdp_port);
> -	RCU_INIT_POINTER(nf_nat_sdp_session_hook, ip_nat_sdp_session);
> -	RCU_INIT_POINTER(nf_nat_sdp_media_hook, ip_nat_sdp_media);
> -	nf_ct_helper_expectfn_register(&sip_nat);
> -	return 0;
> -}
> -
> -module_init(nf_nat_sip_init);
> -module_exit(nf_nat_sip_fini);
> diff --git a/net/netfilter/Kconfig b/net/netfilter/Kconfig
> index 2eee9f1..bf3e464 100644
> --- a/net/netfilter/Kconfig
> +++ b/net/netfilter/Kconfig
> @@ -390,6 +390,11 @@ config NF_NAT_FTP
>  	depends on NF_CONNTRACK && NF_NAT
>  	default NF_NAT && NF_CONNTRACK_FTP
>  
> +config NF_NAT_SIP
> +	tristate
> +	depends on NF_CONNTRACK && NF_NAT
> +	default NF_NAT && NF_CONNTRACK_SIP
> +
>  endif # NF_CONNTRACK
>  
>  # transparent proxy support
> diff --git a/net/netfilter/Makefile b/net/netfilter/Makefile
> index 7d6e1ea..7d6d1a0 100644
> --- a/net/netfilter/Makefile
> +++ b/net/netfilter/Makefile
> @@ -57,6 +57,7 @@ obj-$(CONFIG_NF_NAT_PROTO_SCTP) += nf_nat_proto_sctp.o
>  # NAT helpers
>  obj-$(CONFIG_NF_NAT_AMANDA) += nf_nat_amanda.o
>  obj-$(CONFIG_NF_NAT_FTP) += nf_nat_ftp.o
> +obj-$(CONFIG_NF_NAT_SIP) += nf_nat_sip.o
>  
>  # transparent proxy support
>  obj-$(CONFIG_NETFILTER_TPROXY) += nf_tproxy_core.o
> diff --git a/net/netfilter/nf_conntrack_sip.c b/net/netfilter/nf_conntrack_sip.c
> index d517490..df8f4f2 100644
> --- a/net/netfilter/nf_conntrack_sip.c
> +++ b/net/netfilter/nf_conntrack_sip.c
> @@ -57,7 +57,8 @@ unsigned int (*nf_nat_sip_hook)(struct sk_buff *skb, unsigned int protoff,
>  				unsigned int *datalen) __read_mostly;
>  EXPORT_SYMBOL_GPL(nf_nat_sip_hook);
>  
> -void (*nf_nat_sip_seq_adjust_hook)(struct sk_buff *skb, s16 off) __read_mostly;
> +void (*nf_nat_sip_seq_adjust_hook)(struct sk_buff *skb, unsigned int protoff,
> +				   s16 off) __read_mostly;
>  EXPORT_SYMBOL_GPL(nf_nat_sip_seq_adjust_hook);
>  
>  unsigned int (*nf_nat_sip_expect_hook)(struct sk_buff *skb,
> @@ -742,13 +743,18 @@ static int sdp_addr_len(const struct nf_conn *ct, const char *dptr,
>   * be tolerant and also accept records terminated with a single newline
>   * character". We handle both cases.
>   */
> -static const struct sip_header ct_sdp_hdrs[] = {
> -	[SDP_HDR_VERSION]		= SDP_HDR("v=", NULL, digits_len),
> -	[SDP_HDR_OWNER_IP4]		= SDP_HDR("o=", "IN IP4 ", sdp_addr_len),
> -	[SDP_HDR_CONNECTION_IP4]	= SDP_HDR("c=", "IN IP4 ", sdp_addr_len),
> -	[SDP_HDR_OWNER_IP6]		= SDP_HDR("o=", "IN IP6 ", sdp_addr_len),
> -	[SDP_HDR_CONNECTION_IP6]	= SDP_HDR("c=", "IN IP6 ", sdp_addr_len),
> -	[SDP_HDR_MEDIA]			= SDP_HDR("m=", NULL, media_len),
> +static const struct sip_header ct_sdp_hdrs_v4[] = {
> +	[SDP_HDR_VERSION]	= SDP_HDR("v=", NULL, digits_len),
> +	[SDP_HDR_OWNER]		= SDP_HDR("o=", "IN IP4 ", sdp_addr_len),
> +	[SDP_HDR_CONNECTION]	= SDP_HDR("c=", "IN IP4 ", sdp_addr_len),
> +	[SDP_HDR_MEDIA]		= SDP_HDR("m=", NULL, media_len),
> +};
> +
> +static const struct sip_header ct_sdp_hdrs_v6[] = {
> +	[SDP_HDR_VERSION]	= SDP_HDR("v=", NULL, digits_len),
> +	[SDP_HDR_OWNER]		= SDP_HDR("o=", "IN IP6 ", sdp_addr_len),
> +	[SDP_HDR_CONNECTION]	= SDP_HDR("c=", "IN IP6 ", sdp_addr_len),
> +	[SDP_HDR_MEDIA]		= SDP_HDR("m=", NULL, media_len),
>  };
>  
>  /* Linear string search within SDP header values */
> @@ -774,11 +780,14 @@ int ct_sip_get_sdp_header(const struct nf_conn *ct, const char *dptr,
>  			  enum sdp_header_types term,
>  			  unsigned int *matchoff, unsigned int *matchlen)
>  {
> -	const struct sip_header *hdr = &ct_sdp_hdrs[type];
> -	const struct sip_header *thdr = &ct_sdp_hdrs[term];
> +	const struct sip_header *hdrs, *hdr, *thdr;
>  	const char *start = dptr, *limit = dptr + datalen;
>  	int shift = 0;
>  
> +	hdrs = nf_ct_l3num(ct) == NFPROTO_IPV4 ? ct_sdp_hdrs_v4 : ct_sdp_hdrs_v6;
> +	hdr = &hdrs[type];
> +	thdr = &hdrs[term];
> +
>  	for (dptr += dataoff; dptr < limit; dptr++) {
>  		/* Find beginning of line */
>  		if (*dptr != '\r' && *dptr != '\n')
> @@ -945,12 +954,12 @@ static int set_expected_rtp_rtcp(struct sk_buff *skb, unsigned int protoff,
>  		    exp->class != class)
>  			break;
>  #ifdef CONFIG_NF_NAT_NEEDED
> -		if (exp->tuple.src.l3num == AF_INET && !direct_rtp &&
> -		    (exp->saved_addr.ip != exp->tuple.dst.u3.ip ||
> +		if (!direct_rtp &&
> +		    (!nf_inet_addr_cmp(&exp->saved_addr, &exp->tuple.dst.u3) ||
>  		     exp->saved_proto.udp.port != exp->tuple.dst.u.udp.port) &&
>  		    ct->status & IPS_NAT_MASK) {
> -			daddr->ip		= exp->saved_addr.ip;
> -			tuple.dst.u3.ip		= exp->saved_addr.ip;
> +			*daddr			= exp->saved_addr;
> +			tuple.dst.u3		= exp->saved_addr;
>  			tuple.dst.u.udp.port	= exp->saved_proto.udp.port;
>  			direct_rtp = 1;
>  		} else
> @@ -987,8 +996,7 @@ static int set_expected_rtp_rtcp(struct sk_buff *skb, unsigned int protoff,
>  			  IPPROTO_UDP, NULL, &rtcp_port);
>  
>  	nf_nat_sdp_media = rcu_dereference(nf_nat_sdp_media_hook);
> -	if (nf_nat_sdp_media && nf_ct_l3num(ct) == NFPROTO_IPV4 &&
> -	    ct->status & IPS_NAT_MASK && !direct_rtp)
> +	if (nf_nat_sdp_media && ct->status & IPS_NAT_MASK && !direct_rtp)
>  		ret = nf_nat_sdp_media(skb, protoff, dataoff, dptr, datalen,
>  				       rtp_exp, rtcp_exp,
>  				       mediaoff, medialen, daddr);
> @@ -1044,15 +1052,12 @@ static int process_sdp(struct sk_buff *skb, unsigned int protoff,
>  	unsigned int i;
>  	union nf_inet_addr caddr, maddr, rtp_addr;
>  	unsigned int port;
> -	enum sdp_header_types c_hdr;
>  	const struct sdp_media_type *t;
>  	int ret = NF_ACCEPT;
>  	typeof(nf_nat_sdp_addr_hook) nf_nat_sdp_addr;
>  	typeof(nf_nat_sdp_session_hook) nf_nat_sdp_session;
>  
>  	nf_nat_sdp_addr = rcu_dereference(nf_nat_sdp_addr_hook);
> -	c_hdr = nf_ct_l3num(ct) == AF_INET ? SDP_HDR_CONNECTION_IP4 :
> -					     SDP_HDR_CONNECTION_IP6;
>  
>  	/* Find beginning of session description */
>  	if (ct_sip_get_sdp_header(ct, *dptr, 0, *datalen,
> @@ -1066,7 +1071,7 @@ static int process_sdp(struct sk_buff *skb, unsigned int protoff,
>  	 * the end of the session description. */
>  	caddr_len = 0;
>  	if (ct_sip_parse_sdp_addr(ct, *dptr, sdpoff, *datalen,
> -				  c_hdr, SDP_HDR_MEDIA,
> +				  SDP_HDR_CONNECTION, SDP_HDR_MEDIA,
>  				  &matchoff, &matchlen, &caddr) > 0)
>  		caddr_len = matchlen;
>  
> @@ -1096,7 +1101,7 @@ static int process_sdp(struct sk_buff *skb, unsigned int protoff,
>  		/* The media description overrides the session description. */
>  		maddr_len = 0;
>  		if (ct_sip_parse_sdp_addr(ct, *dptr, mediaoff, *datalen,
> -					  c_hdr, SDP_HDR_MEDIA,
> +					  SDP_HDR_CONNECTION, SDP_HDR_MEDIA,
>  					  &matchoff, &matchlen, &maddr) > 0) {
>  			maddr_len = matchlen;
>  			memcpy(&rtp_addr, &maddr, sizeof(rtp_addr));
> @@ -1113,11 +1118,10 @@ static int process_sdp(struct sk_buff *skb, unsigned int protoff,
>  			return ret;
>  
>  		/* Update media connection address if present */
> -		if (maddr_len && nf_nat_sdp_addr &&
> -		    nf_ct_l3num(ct) == NFPROTO_IPV4 && ct->status & IPS_NAT_MASK) {
> +		if (maddr_len && nf_nat_sdp_addr && ct->status & IPS_NAT_MASK) {
>  			ret = nf_nat_sdp_addr(skb, protoff, dataoff,
> -					      dptr, datalen,
> -					      mediaoff, c_hdr, SDP_HDR_MEDIA,
> +					      dptr, datalen, mediaoff,
> +					      SDP_HDR_CONNECTION, SDP_HDR_MEDIA,
>  					      &rtp_addr);
>  			if (ret != NF_ACCEPT)
>  				return ret;
> @@ -1127,8 +1131,7 @@ static int process_sdp(struct sk_buff *skb, unsigned int protoff,
>  
>  	/* Update session connection and owner addresses */
>  	nf_nat_sdp_session = rcu_dereference(nf_nat_sdp_session_hook);
> -	if (nf_nat_sdp_session && nf_ct_l3num(ct) == NFPROTO_IPV4 &&
> -	    ct->status & IPS_NAT_MASK)
> +	if (nf_nat_sdp_session && ct->status & IPS_NAT_MASK)
>  		ret = nf_nat_sdp_session(skb, protoff, dataoff,
>  					 dptr, datalen, sdpoff, &rtp_addr);
>  
> @@ -1293,8 +1296,7 @@ static int process_register_request(struct sk_buff *skb, unsigned int protoff,
>  	exp->flags = NF_CT_EXPECT_PERMANENT | NF_CT_EXPECT_INACTIVE;
>  
>  	nf_nat_sip_expect = rcu_dereference(nf_nat_sip_expect_hook);
> -	if (nf_nat_sip_expect && nf_ct_l3num(ct) == NFPROTO_IPV4 &&
> -	    ct->status & IPS_NAT_MASK)
> +	if (nf_nat_sip_expect && ct->status & IPS_NAT_MASK)
>  		ret = nf_nat_sip_expect(skb, protoff, dataoff, dptr, datalen,
>  					exp, matchoff, matchlen);
>  	else {
> @@ -1476,8 +1478,7 @@ static int process_sip_msg(struct sk_buff *skb, struct nf_conn *ct,
>  	else
>  		ret = process_sip_response(skb, protoff, dataoff, dptr, datalen);
>  
> -	if (ret == NF_ACCEPT && nf_ct_l3num(ct) == NFPROTO_IPV4 &&
> -	    ct->status & IPS_NAT_MASK) {
> +	if (ret == NF_ACCEPT && ct->status & IPS_NAT_MASK) {
>  		nf_nat_sip = rcu_dereference(nf_nat_sip_hook);
>  		if (nf_nat_sip && !nf_nat_sip(skb, protoff, dataoff,
>  					      dptr, datalen))
> @@ -1560,11 +1561,10 @@ static int sip_help_tcp(struct sk_buff *skb, unsigned int protoff,
>  		datalen  = datalen + diff - msglen;
>  	}
>  
> -	if (ret == NF_ACCEPT && nf_ct_l3num(ct) == NFPROTO_IPV4 &&
> -	    ct->status & IPS_NAT_MASK) {
> +	if (ret == NF_ACCEPT && ct->status & IPS_NAT_MASK) {
>  		nf_nat_sip_seq_adjust = rcu_dereference(nf_nat_sip_seq_adjust_hook);
>  		if (nf_nat_sip_seq_adjust)
> -			nf_nat_sip_seq_adjust(skb, tdiff);
> +			nf_nat_sip_seq_adjust(skb, protoff, tdiff);
>  	}
>  
>  	return ret;
> diff --git a/net/netfilter/nf_nat_sip.c b/net/netfilter/nf_nat_sip.c
> new file mode 100644
> index 0000000..f4db3a7
> --- /dev/null
> +++ b/net/netfilter/nf_nat_sip.c
> @@ -0,0 +1,609 @@
> +/* SIP extension for NAT alteration.
> + *
> + * (C) 2005 by Christian Hentschel <chentschel@arnet.com.ar>
> + * based on RR's ip_nat_ftp.c and other modules.
> + * (C) 2007 United Security Providers
> + * (C) 2007, 2008, 2011, 2012 Patrick McHardy <kaber@trash.net>
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License version 2 as
> + * published by the Free Software Foundation.
> + */
> +
> +#include <linux/module.h>
> +#include <linux/skbuff.h>
> +#include <linux/inet.h>
> +#include <linux/udp.h>
> +#include <linux/tcp.h>
> +
> +#include <net/netfilter/nf_nat.h>
> +#include <net/netfilter/nf_nat_helper.h>
> +#include <net/netfilter/nf_conntrack_helper.h>
> +#include <net/netfilter/nf_conntrack_expect.h>
> +#include <linux/netfilter/nf_conntrack_sip.h>
> +
> +MODULE_LICENSE("GPL");
> +MODULE_AUTHOR("Christian Hentschel <chentschel@arnet.com.ar>");
> +MODULE_DESCRIPTION("SIP NAT helper");
> +MODULE_ALIAS("ip_nat_sip");
> +
> +
> +static unsigned int mangle_packet(struct sk_buff *skb, unsigned int protoff,
> +				  unsigned int dataoff,
> +				  const char **dptr, unsigned int *datalen,
> +				  unsigned int matchoff, unsigned int matchlen,
> +				  const char *buffer, unsigned int buflen)
> +{
> +	enum ip_conntrack_info ctinfo;
> +	struct nf_conn *ct = nf_ct_get(skb, &ctinfo);
> +	struct tcphdr *th;
> +	unsigned int baseoff;
> +
> +	if (nf_ct_protonum(ct) == IPPROTO_TCP) {
> +		th = (struct tcphdr *)(skb->data + protoff);
> +		baseoff = protoff + th->doff * 4;
> +		matchoff += dataoff - baseoff;
> +
> +		if (!__nf_nat_mangle_tcp_packet(skb, ct, ctinfo,
> +						protoff, matchoff, matchlen,
> +						buffer, buflen, false))
> +			return 0;
> +	} else {
> +		baseoff = protoff + sizeof(struct udphdr);
> +		matchoff += dataoff - baseoff;
> +
> +		if (!nf_nat_mangle_udp_packet(skb, ct, ctinfo,
> +					      protoff, matchoff, matchlen,
> +					      buffer, buflen))
> +			return 0;
> +	}
> +
> +	/* Reload data pointer and adjust datalen value */
> +	*dptr = skb->data + dataoff;
> +	*datalen += buflen - matchlen;
> +	return 1;
> +}
> +
> +static int sip_sprintf_addr(const struct nf_conn *ct, char *buffer,
> +			    const union nf_inet_addr *addr, bool delim)
> +{
> +	if (nf_ct_l3num(ct) == NFPROTO_IPV4)
> +		return sprintf(buffer, "%pI4", &addr->ip);
> +	else {
> +		if (delim)
> +			return sprintf(buffer, "[%pI6c]", &addr->ip6);
> +		else
> +			return sprintf(buffer, "%pI6c", &addr->ip6);
> +	}
> +}
> +
> +static int sip_sprintf_addr_port(const struct nf_conn *ct, char *buffer,
> +				 const union nf_inet_addr *addr, u16 port)
> +{
> +	if (nf_ct_l3num(ct) == NFPROTO_IPV4)
> +		return sprintf(buffer, "%pI4:%u", &addr->ip, port);
> +	else
> +		return sprintf(buffer, "[%pI6c]:%u", &addr->ip6, port);
> +}
> +
> +static int map_addr(struct sk_buff *skb, unsigned int protoff,
> +		    unsigned int dataoff,
> +		    const char **dptr, unsigned int *datalen,
> +		    unsigned int matchoff, unsigned int matchlen,
> +		    union nf_inet_addr *addr, __be16 port)
> +{
> +	enum ip_conntrack_info ctinfo;
> +	struct nf_conn *ct = nf_ct_get(skb, &ctinfo);
> +	enum ip_conntrack_dir dir = CTINFO2DIR(ctinfo);
> +	char buffer[INET6_ADDRSTRLEN + sizeof("[]:nnnnn")];
> +	unsigned int buflen;
> +	union nf_inet_addr newaddr;
> +	__be16 newport;
> +
> +	if (nf_inet_addr_cmp(&ct->tuplehash[dir].tuple.src.u3, addr) &&
> +	    ct->tuplehash[dir].tuple.src.u.udp.port == port) {
> +		newaddr = ct->tuplehash[!dir].tuple.dst.u3;
> +		newport = ct->tuplehash[!dir].tuple.dst.u.udp.port;
> +	} else if (nf_inet_addr_cmp(&ct->tuplehash[dir].tuple.dst.u3, addr) &&
> +		   ct->tuplehash[dir].tuple.dst.u.udp.port == port) {
> +		newaddr = ct->tuplehash[!dir].tuple.src.u3;
> +		newport = ct->tuplehash[!dir].tuple.src.u.udp.port;
> +	} else
> +		return 1;
> +
> +	if (nf_inet_addr_cmp(&newaddr, addr) && newport == port)
> +		return 1;
> +
> +	buflen = sip_sprintf_addr_port(ct, buffer, &newaddr, ntohs(newport));
> +	return mangle_packet(skb, protoff, dataoff, dptr, datalen,
> +			     matchoff, matchlen, buffer, buflen);
> +}
> +
> +static int map_sip_addr(struct sk_buff *skb, unsigned int protoff,
> +			unsigned int dataoff,
> +			const char **dptr, unsigned int *datalen,
> +			enum sip_header_types type)
> +{
> +	enum ip_conntrack_info ctinfo;
> +	struct nf_conn *ct = nf_ct_get(skb, &ctinfo);
> +	unsigned int matchlen, matchoff;
> +	union nf_inet_addr addr;
> +	__be16 port;
> +
> +	if (ct_sip_parse_header_uri(ct, *dptr, NULL, *datalen, type, NULL,
> +				    &matchoff, &matchlen, &addr, &port) <= 0)
> +		return 1;
> +	return map_addr(skb, protoff, dataoff, dptr, datalen,
> +			matchoff, matchlen, &addr, port);
> +}
> +
> +static unsigned int nf_nat_sip(struct sk_buff *skb, unsigned int protoff,
> +			       unsigned int dataoff,
> +			       const char **dptr, unsigned int *datalen)
> +{
> +	enum ip_conntrack_info ctinfo;
> +	struct nf_conn *ct = nf_ct_get(skb, &ctinfo);
> +	enum ip_conntrack_dir dir = CTINFO2DIR(ctinfo);
> +	unsigned int coff, matchoff, matchlen;
> +	enum sip_header_types hdr;
> +	union nf_inet_addr addr;
> +	__be16 port;
> +	int request, in_header;
> +
> +	/* Basic rules: requests and responses. */
> +	if (strnicmp(*dptr, "SIP/2.0", strlen("SIP/2.0")) != 0) {
> +		if (ct_sip_parse_request(ct, *dptr, *datalen,
> +					 &matchoff, &matchlen,
> +					 &addr, &port) > 0 &&
> +		    !map_addr(skb, protoff, dataoff, dptr, datalen,
> +			      matchoff, matchlen, &addr, port))
> +			return NF_DROP;
> +		request = 1;
> +	} else
> +		request = 0;
> +
> +	if (nf_ct_protonum(ct) == IPPROTO_TCP)
> +		hdr = SIP_HDR_VIA_TCP;
> +	else
> +		hdr = SIP_HDR_VIA_UDP;
> +
> +	/* Translate topmost Via header and parameters */
> +	if (ct_sip_parse_header_uri(ct, *dptr, NULL, *datalen,
> +				    hdr, NULL, &matchoff, &matchlen,
> +				    &addr, &port) > 0) {
> +		unsigned int olen, matchend, poff, plen, buflen, n;
> +		char buffer[INET6_ADDRSTRLEN + sizeof("[]:nnnnn")];
> +
> +		/* We're only interested in headers related to this
> +		 * connection */
> +		if (request) {
> +			if (!nf_inet_addr_cmp(&addr,
> +					&ct->tuplehash[dir].tuple.src.u3) ||
> +			    port != ct->tuplehash[dir].tuple.src.u.udp.port)
> +				goto next;
> +		} else {
> +			if (!nf_inet_addr_cmp(&addr,
> +					&ct->tuplehash[dir].tuple.dst.u3) ||
> +			    port != ct->tuplehash[dir].tuple.dst.u.udp.port)
> +				goto next;
> +		}
> +
> +		olen = *datalen;
> +		if (!map_addr(skb, protoff, dataoff, dptr, datalen,
> +			      matchoff, matchlen, &addr, port))
> +			return NF_DROP;
> +
> +		matchend = matchoff + matchlen + *datalen - olen;
> +
> +		/* The maddr= parameter (RFC 2361) specifies where to send
> +		 * the reply. */
> +		if (ct_sip_parse_address_param(ct, *dptr, matchend, *datalen,
> +					       "maddr=", &poff, &plen,
> +					       &addr, true) > 0 &&
> +		    nf_inet_addr_cmp(&addr, &ct->tuplehash[dir].tuple.src.u3) &&
> +		    !nf_inet_addr_cmp(&addr, &ct->tuplehash[!dir].tuple.dst.u3)) {
> +			buflen = sip_sprintf_addr(ct, buffer,
> +					&ct->tuplehash[!dir].tuple.dst.u3,
> +					true);
> +			if (!mangle_packet(skb, protoff, dataoff, dptr, datalen,
> +					   poff, plen, buffer, buflen))
> +				return NF_DROP;
> +		}
> +
> +		/* The received= parameter (RFC 2361) contains the address
> +		 * from which the server received the request. */
> +		if (ct_sip_parse_address_param(ct, *dptr, matchend, *datalen,
> +					       "received=", &poff, &plen,
> +					       &addr, false) > 0 &&
> +		    nf_inet_addr_cmp(&addr, &ct->tuplehash[dir].tuple.dst.u3) &&
> +		    !nf_inet_addr_cmp(&addr, &ct->tuplehash[!dir].tuple.src.u3)) {
> +			buflen = sip_sprintf_addr(ct, buffer,
> +					&ct->tuplehash[!dir].tuple.src.u3,
> +					false);
> +			if (!mangle_packet(skb, protoff, dataoff, dptr, datalen,
> +					   poff, plen, buffer, buflen))
> +				return NF_DROP;
> +		}
> +
> +		/* The rport= parameter (RFC 3581) contains the port number
> +		 * from which the server received the request. */
> +		if (ct_sip_parse_numerical_param(ct, *dptr, matchend, *datalen,
> +						 "rport=", &poff, &plen,
> +						 &n) > 0 &&
> +		    htons(n) == ct->tuplehash[dir].tuple.dst.u.udp.port &&
> +		    htons(n) != ct->tuplehash[!dir].tuple.src.u.udp.port) {
> +			__be16 p = ct->tuplehash[!dir].tuple.src.u.udp.port;
> +			buflen = sprintf(buffer, "%u", ntohs(p));
> +			if (!mangle_packet(skb, protoff, dataoff, dptr, datalen,
> +					   poff, plen, buffer, buflen))
> +				return NF_DROP;
> +		}
> +	}
> +
> +next:
> +	/* Translate Contact headers */
> +	coff = 0;
> +	in_header = 0;
> +	while (ct_sip_parse_header_uri(ct, *dptr, &coff, *datalen,
> +				       SIP_HDR_CONTACT, &in_header,
> +				       &matchoff, &matchlen,
> +				       &addr, &port) > 0) {
> +		if (!map_addr(skb, protoff, dataoff, dptr, datalen,
> +			      matchoff, matchlen,
> +			      &addr, port))
> +			return NF_DROP;
> +	}
> +
> +	if (!map_sip_addr(skb, protoff, dataoff, dptr, datalen, SIP_HDR_FROM) ||
> +	    !map_sip_addr(skb, protoff, dataoff, dptr, datalen, SIP_HDR_TO))
> +		return NF_DROP;
> +
> +	return NF_ACCEPT;
> +}
> +
> +static void nf_nat_sip_seq_adjust(struct sk_buff *skb, unsigned int protoff,
> +				  s16 off)
> +{
> +	enum ip_conntrack_info ctinfo;
> +	struct nf_conn *ct = nf_ct_get(skb, &ctinfo);
> +	const struct tcphdr *th;
> +
> +	if (nf_ct_protonum(ct) != IPPROTO_TCP || off == 0)
> +		return;
> +
> +	th = (struct tcphdr *)(skb->data + protoff);
> +	nf_nat_set_seq_adjust(ct, ctinfo, th->seq, off);
> +}
> +
> +/* Handles expected signalling connections and media streams */
> +static void nf_nat_sip_expected(struct nf_conn *ct,
> +				struct nf_conntrack_expect *exp)
> +{
> +	struct nf_nat_range range;
> +
> +	/* This must be a fresh one. */
> +	BUG_ON(ct->status & IPS_NAT_DONE_MASK);
> +
> +	/* For DST manip, map port here to where it's expected. */
> +	range.flags = (NF_NAT_RANGE_MAP_IPS | NF_NAT_RANGE_PROTO_SPECIFIED);
> +	range.min_proto = range.max_proto = exp->saved_proto;
> +	range.min_addr = range.max_addr = exp->saved_addr;
> +	nf_nat_setup_info(ct, &range, NF_NAT_MANIP_DST);
> +
> +	/* Change src to where master sends to, but only if the connection
> +	 * actually came from the same source. */
> +	if (nf_inet_addr_cmp(&ct->tuplehash[IP_CT_DIR_ORIGINAL].tuple.src.u3,
> +			     &ct->master->tuplehash[exp->dir].tuple.src.u3)) {
> +		range.flags = NF_NAT_RANGE_MAP_IPS;
> +		range.min_addr = range.max_addr
> +			= ct->master->tuplehash[!exp->dir].tuple.dst.u3;
> +		nf_nat_setup_info(ct, &range, NF_NAT_MANIP_SRC);
> +	}
> +}
> +
> +static unsigned int nf_nat_sip_expect(struct sk_buff *skb, unsigned int protoff,
> +				      unsigned int dataoff,
> +				      const char **dptr, unsigned int *datalen,
> +				      struct nf_conntrack_expect *exp,
> +				      unsigned int matchoff,
> +				      unsigned int matchlen)
> +{
> +	enum ip_conntrack_info ctinfo;
> +	struct nf_conn *ct = nf_ct_get(skb, &ctinfo);
> +	enum ip_conntrack_dir dir = CTINFO2DIR(ctinfo);
> +	union nf_inet_addr newaddr;
> +	u_int16_t port;
> +	char buffer[INET6_ADDRSTRLEN + sizeof("[]:nnnnn")];
> +	unsigned int buflen;
> +
> +	/* Connection will come from reply */
> +	if (nf_inet_addr_cmp(&ct->tuplehash[dir].tuple.src.u3,
> +			     &ct->tuplehash[!dir].tuple.dst.u3))
> +		newaddr = exp->tuple.dst.u3;
> +	else
> +		newaddr = ct->tuplehash[!dir].tuple.dst.u3;
> +
> +	/* If the signalling port matches the connection's source port in the
> +	 * original direction, try to use the destination port in the opposite
> +	 * direction. */
> +	if (exp->tuple.dst.u.udp.port ==
> +	    ct->tuplehash[dir].tuple.src.u.udp.port)
> +		port = ntohs(ct->tuplehash[!dir].tuple.dst.u.udp.port);
> +	else
> +		port = ntohs(exp->tuple.dst.u.udp.port);
> +
> +	exp->saved_addr = exp->tuple.dst.u3;
> +	exp->tuple.dst.u3 = newaddr;
> +	exp->saved_proto.udp.port = exp->tuple.dst.u.udp.port;
> +	exp->dir = !dir;
> +	exp->expectfn = nf_nat_sip_expected;
> +
> +	for (; port != 0; port++) {
> +		int ret;
> +
> +		exp->tuple.dst.u.udp.port = htons(port);
> +		ret = nf_ct_expect_related(exp);
> +		if (ret == 0)
> +			break;
> +		else if (ret != -EBUSY) {
> +			port = 0;
> +			break;
> +		}
> +	}
> +
> +	if (port == 0)
> +		return NF_DROP;
> +
> +	if (!nf_inet_addr_cmp(&exp->tuple.dst.u3, &exp->saved_addr) ||
> +	    exp->tuple.dst.u.udp.port != exp->saved_proto.udp.port) {
> +		buflen = sip_sprintf_addr_port(ct, buffer, &newaddr, port);
> +		if (!mangle_packet(skb, protoff, dataoff, dptr, datalen,
> +				   matchoff, matchlen, buffer, buflen))
> +			goto err;
> +	}
> +	return NF_ACCEPT;
> +
> +err:
> +	nf_ct_unexpect_related(exp);
> +	return NF_DROP;
> +}
> +
> +static int mangle_content_len(struct sk_buff *skb, unsigned int protoff,
> +			      unsigned int dataoff,
> +			      const char **dptr, unsigned int *datalen)
> +{
> +	enum ip_conntrack_info ctinfo;
> +	struct nf_conn *ct = nf_ct_get(skb, &ctinfo);
> +	unsigned int matchoff, matchlen;
> +	char buffer[sizeof("65536")];
> +	int buflen, c_len;
> +
> +	/* Get actual SDP length */
> +	if (ct_sip_get_sdp_header(ct, *dptr, 0, *datalen,
> +				  SDP_HDR_VERSION, SDP_HDR_UNSPEC,
> +				  &matchoff, &matchlen) <= 0)
> +		return 0;
> +	c_len = *datalen - matchoff + strlen("v=");
> +
> +	/* Now, update SDP length */
> +	if (ct_sip_get_header(ct, *dptr, 0, *datalen, SIP_HDR_CONTENT_LENGTH,
> +			      &matchoff, &matchlen) <= 0)
> +		return 0;
> +
> +	buflen = sprintf(buffer, "%u", c_len);
> +	return mangle_packet(skb, protoff, dataoff, dptr, datalen,
> +			     matchoff, matchlen, buffer, buflen);
> +}
> +
> +static int mangle_sdp_packet(struct sk_buff *skb, unsigned int protoff,
> +			     unsigned int dataoff,
> +			     const char **dptr, unsigned int *datalen,
> +			     unsigned int sdpoff,
> +			     enum sdp_header_types type,
> +			     enum sdp_header_types term,
> +			     char *buffer, int buflen)
> +{
> +	enum ip_conntrack_info ctinfo;
> +	struct nf_conn *ct = nf_ct_get(skb, &ctinfo);
> +	unsigned int matchlen, matchoff;
> +
> +	if (ct_sip_get_sdp_header(ct, *dptr, sdpoff, *datalen, type, term,
> +				  &matchoff, &matchlen) <= 0)
> +		return -ENOENT;
> +	return mangle_packet(skb, protoff, dataoff, dptr, datalen,
> +			     matchoff, matchlen, buffer, buflen) ? 0 : -EINVAL;
> +}
> +
> +static unsigned int nf_nat_sdp_addr(struct sk_buff *skb, unsigned int protoff,
> +				    unsigned int dataoff,
> +				    const char **dptr, unsigned int *datalen,
> +				    unsigned int sdpoff,
> +				    enum sdp_header_types type,
> +				    enum sdp_header_types term,
> +				    const union nf_inet_addr *addr)
> +{
> +	enum ip_conntrack_info ctinfo;
> +	struct nf_conn *ct = nf_ct_get(skb, &ctinfo);
> +	char buffer[INET6_ADDRSTRLEN];
> +	unsigned int buflen;
> +
> +	buflen = sip_sprintf_addr(ct, buffer, addr, false);
> +	if (mangle_sdp_packet(skb, protoff, dataoff, dptr, datalen,
> +			      sdpoff, type, term, buffer, buflen))
> +		return 0;
> +
> +	return mangle_content_len(skb, protoff, dataoff, dptr, datalen);
> +}
> +
> +static unsigned int nf_nat_sdp_port(struct sk_buff *skb, unsigned int protoff,
> +				    unsigned int dataoff,
> +				    const char **dptr, unsigned int *datalen,
> +				    unsigned int matchoff,
> +				    unsigned int matchlen,
> +				    u_int16_t port)
> +{
> +	char buffer[sizeof("nnnnn")];
> +	unsigned int buflen;
> +
> +	buflen = sprintf(buffer, "%u", port);
> +	if (!mangle_packet(skb, protoff, dataoff, dptr, datalen,
> +			   matchoff, matchlen, buffer, buflen))
> +		return 0;
> +
> +	return mangle_content_len(skb, protoff, dataoff, dptr, datalen);
> +}
> +
> +static unsigned int nf_nat_sdp_session(struct sk_buff *skb, unsigned int protoff,
> +				       unsigned int dataoff,
> +				       const char **dptr, unsigned int *datalen,
> +				       unsigned int sdpoff,
> +				       const union nf_inet_addr *addr)
> +{
> +	enum ip_conntrack_info ctinfo;
> +	struct nf_conn *ct = nf_ct_get(skb, &ctinfo);
> +	char buffer[INET6_ADDRSTRLEN];
> +	unsigned int buflen;
> +
> +	/* Mangle session description owner and contact addresses */
> +	buflen = sip_sprintf_addr(ct, buffer, addr, false);
> +	if (mangle_sdp_packet(skb, protoff, dataoff, dptr, datalen, sdpoff,
> +			      SDP_HDR_OWNER, SDP_HDR_MEDIA, buffer, buflen))
> +		return 0;
> +
> +	switch (mangle_sdp_packet(skb, protoff, dataoff, dptr, datalen, sdpoff,
> +				  SDP_HDR_CONNECTION, SDP_HDR_MEDIA,
> +				  buffer, buflen)) {
> +	case 0:
> +	/*
> +	 * RFC 2327:
> +	 *
> +	 * Session description
> +	 *
> +	 * c=* (connection information - not required if included in all media)
> +	 */
> +	case -ENOENT:
> +		break;
> +	default:
> +		return 0;
> +	}
> +
> +	return mangle_content_len(skb, protoff, dataoff, dptr, datalen);
> +}
> +
> +/* So, this packet has hit the connection tracking matching code.
> +   Mangle it, and change the expectation to match the new version. */
> +static unsigned int nf_nat_sdp_media(struct sk_buff *skb, unsigned int protoff,
> +				     unsigned int dataoff,
> +				     const char **dptr, unsigned int *datalen,
> +				     struct nf_conntrack_expect *rtp_exp,
> +				     struct nf_conntrack_expect *rtcp_exp,
> +				     unsigned int mediaoff,
> +				     unsigned int medialen,
> +				     union nf_inet_addr *rtp_addr)
> +{
> +	enum ip_conntrack_info ctinfo;
> +	struct nf_conn *ct = nf_ct_get(skb, &ctinfo);
> +	enum ip_conntrack_dir dir = CTINFO2DIR(ctinfo);
> +	u_int16_t port;
> +
> +	/* Connection will come from reply */
> +	if (nf_inet_addr_cmp(&ct->tuplehash[dir].tuple.src.u3,
> +			     &ct->tuplehash[!dir].tuple.dst.u3))
> +		*rtp_addr = rtp_exp->tuple.dst.u3;
> +	else
> +		*rtp_addr = ct->tuplehash[!dir].tuple.dst.u3;
> +
> +	rtp_exp->saved_addr = rtp_exp->tuple.dst.u3;
> +	rtp_exp->tuple.dst.u3 = *rtp_addr;
> +	rtp_exp->saved_proto.udp.port = rtp_exp->tuple.dst.u.udp.port;
> +	rtp_exp->dir = !dir;
> +	rtp_exp->expectfn = nf_nat_sip_expected;
> +
> +	rtcp_exp->saved_addr = rtcp_exp->tuple.dst.u3;
> +	rtcp_exp->tuple.dst.u3 = *rtp_addr;
> +	rtcp_exp->saved_proto.udp.port = rtcp_exp->tuple.dst.u.udp.port;
> +	rtcp_exp->dir = !dir;
> +	rtcp_exp->expectfn = nf_nat_sip_expected;
> +
> +	/* Try to get same pair of ports: if not, try to change them. */
> +	for (port = ntohs(rtp_exp->tuple.dst.u.udp.port);
> +	     port != 0; port += 2) {
> +		int ret;
> +
> +		rtp_exp->tuple.dst.u.udp.port = htons(port);
> +		ret = nf_ct_expect_related(rtp_exp);
> +		if (ret == -EBUSY)
> +			continue;
> +		else if (ret < 0) {
> +			port = 0;
> +			break;
> +		}
> +		rtcp_exp->tuple.dst.u.udp.port = htons(port + 1);
> +		ret = nf_ct_expect_related(rtcp_exp);
> +		if (ret == 0)
> +			break;
> +		else if (ret != -EBUSY) {
> +			nf_ct_unexpect_related(rtp_exp);
> +			port = 0;
> +			break;
> +		}
> +	}
> +
> +	if (port == 0)
> +		goto err1;
> +
> +	/* Update media port. */
> +	if (rtp_exp->tuple.dst.u.udp.port != rtp_exp->saved_proto.udp.port &&
> +	    !nf_nat_sdp_port(skb, protoff, dataoff, dptr, datalen,
> +			     mediaoff, medialen, port))
> +		goto err2;
> +
> +	return NF_ACCEPT;
> +
> +err2:
> +	nf_ct_unexpect_related(rtp_exp);
> +	nf_ct_unexpect_related(rtcp_exp);
> +err1:
> +	return NF_DROP;
> +}
> +
> +static struct nf_ct_helper_expectfn sip_nat = {
> +	.name		= "sip",
> +	.expectfn	= nf_nat_sip_expected,
> +};
> +
> +static void __exit nf_nat_sip_fini(void)
> +{
> +	RCU_INIT_POINTER(nf_nat_sip_hook, NULL);
> +	RCU_INIT_POINTER(nf_nat_sip_seq_adjust_hook, NULL);
> +	RCU_INIT_POINTER(nf_nat_sip_expect_hook, NULL);
> +	RCU_INIT_POINTER(nf_nat_sdp_addr_hook, NULL);
> +	RCU_INIT_POINTER(nf_nat_sdp_port_hook, NULL);
> +	RCU_INIT_POINTER(nf_nat_sdp_session_hook, NULL);
> +	RCU_INIT_POINTER(nf_nat_sdp_media_hook, NULL);
> +	nf_ct_helper_expectfn_unregister(&sip_nat);
> +	synchronize_rcu();
> +}
> +
> +static int __init nf_nat_sip_init(void)
> +{
> +	BUG_ON(nf_nat_sip_hook != NULL);
> +	BUG_ON(nf_nat_sip_seq_adjust_hook != NULL);
> +	BUG_ON(nf_nat_sip_expect_hook != NULL);
> +	BUG_ON(nf_nat_sdp_addr_hook != NULL);
> +	BUG_ON(nf_nat_sdp_port_hook != NULL);
> +	BUG_ON(nf_nat_sdp_session_hook != NULL);
> +	BUG_ON(nf_nat_sdp_media_hook != NULL);
> +	RCU_INIT_POINTER(nf_nat_sip_hook, nf_nat_sip);
> +	RCU_INIT_POINTER(nf_nat_sip_seq_adjust_hook, nf_nat_sip_seq_adjust);
> +	RCU_INIT_POINTER(nf_nat_sip_expect_hook, nf_nat_sip_expect);
> +	RCU_INIT_POINTER(nf_nat_sdp_addr_hook, nf_nat_sdp_addr);
> +	RCU_INIT_POINTER(nf_nat_sdp_port_hook, nf_nat_sdp_port);
> +	RCU_INIT_POINTER(nf_nat_sdp_session_hook, nf_nat_sdp_session);
> +	RCU_INIT_POINTER(nf_nat_sdp_media_hook, nf_nat_sdp_media);
> +	nf_ct_helper_expectfn_register(&sip_nat);
> +	return 0;
> +}
> +
> +module_init(nf_nat_sip_init);
> +module_exit(nf_nat_sip_fini);

^ permalink raw reply

* Re: [PATCH v3 01/17] hashtable: introduce a small and naive hashtable
From: Pedro Alves @ 2012-09-04 22:58 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Mathieu Desnoyers, Sasha Levin, Tejun Heo, torvalds, akpm,
	linux-kernel, linux-mm, paul.gortmaker, davem, mingo, ebiederm,
	aarcange, ericvh, netdev, josh, eric.dumazet, axboe, agk,
	dm-devel, neilb, ccaulfie, teigland, Trond.Myklebust, bfields,
	fweisbec, jesse, venkat.x.venkatsubra, ejt, snitzer, edumazet,
	linux-nfs, dev, rds-devel, lw
In-Reply-To: <1346798509.27919.25.camel@gandalf.local.home>

On 09/04/2012 11:41 PM, Steven Rostedt wrote:
> Ah, I missed the condition with the rec == &pg->records[pg->index]. But
> if ftrace_pages_start is NULL, the rec = &pg->records[pg->index] will
> fault.

Right.

> 
> You could do something like rec = pg ? &pg->records[pg->index] : NULL,

Right.

> but IIRC, the comma operator does not guarantee order evaluation. That
> is, the compiler is allowed to process "a , b" as "b; a;" and not "a;
> b;".

Not true.  The comma operator introduces a sequence point.  It's the comma
that separates function parameters that doesn't guarantee ordering.

-- 
Pedro Alves

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: [PATCH v3 01/17] hashtable: introduce a small and naive hashtable
From: Steven Rostedt @ 2012-09-04 22:41 UTC (permalink / raw)
  To: Pedro Alves
  Cc: snitzer-H+wXaHxf7aLQT0dZR+AlfA, neilb-l3A5Bk7waGM,
	fweisbec-Re5JQEeQqe8AvxtiuMwx3w,
	Trond.Myklebust-HgOvQuBEEgTQT0dZR+AlfA,
	bfields-uC3wQj2KruNg9hUCZPvPmw,
	paul.gortmaker-CWA4WttNNZF54TAoqtyWWQ,
	dm-devel-H+wXaHxf7aLQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
	aarcange-H+wXaHxf7aLQT0dZR+AlfA, rds-devel-N0ozoZBvEnrZJqsBc5GL+g,
	eric.dumazet-Re5JQEeQqe8AvxtiuMwx3w,
	venkat.x.venkatsubra-QHcLZuEGTsvQT0dZR+AlfA,
	ccaulfie-H+wXaHxf7aLQT0dZR+AlfA, mingo-X9Un+BFzKDI,
	dev-yBygre7rU0TnMu66kgdUjQ, ericvh-Re5JQEeQqe8AvxtiuMwx3w,
	josh-iaAMLnmF4UmaiuxdJuQwMA, lw-BthXqXjhjHXQFUHtdCDX3A,
	Mathieu Desnoyers, Sasha Levin, axboe-tSWWG44O7X1aa/9Udqfwiw,
	linux-nfs-u79uwXL29TY76Z2rM5mHXA, edumazet-hpIqsD4AKlfQT0dZR+AlfA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, netdev-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, ejt-H+wXaHxf7aLQT0dZR+AlfA,
	ebiederm-aS9lmoZGLiVWk0Htik3J/w, Tejun Heo,
	teigland-H+wXaHxf7aLQT0dZR+AlfA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	davem-fT/PcQaiUtIeIZ0/mPfg9Q
In-Reply-To: <504677C8.3050801-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>

On Tue, 2012-09-04 at 22:51 +0100, Pedro Alves wrote:
> On 09/04/2012 09:59 PM, Steven Rostedt wrote:
> > On Tue, 2012-09-04 at 18:21 +0100, Pedro Alves wrote:
> >> On 09/04/2012 06:17 PM, Steven Rostedt wrote:
> >>> On Tue, 2012-09-04 at 17:40 +0100, Pedro Alves wrote:
> >>>
> >>>> BTW, you can also go a step further and remove the need to close with double }},
> >>>> with something like:
> >>>>
> >>>> #define do_for_each_ftrace_rec(pg, rec)                                          \
> >>>>         for (pg = ftrace_pages_start, rec = &pg->records[pg->index];             \
> >>>>              pg && rec == &pg->records[pg->index];                               \
> >>>>              pg = pg->next)                                                      \
> >>>>           for (rec = pg->records; rec < &pg->records[pg->index]; rec++)
> >>>>
> >>>
> >>> Yeah, but why bother? It's hidden in a macro, and the extra '{ }' shows
> >>> that this is something "special".
> >>
> >> The point of both changes is that there's nothing special in the end
> >> at all.  It all just works...
> >>
> > 
> > It would still fail on a 'break'. The 'while' macro tells us that it is
> > special, because in the end, it wont work.
> 
> Please explain why it would fail on a 'break'.
> 

Ah, I missed the condition with the rec == &pg->records[pg->index]. But
if ftrace_pages_start is NULL, the rec = &pg->records[pg->index] will
fault.

You could do something like rec = pg ? &pg->records[pg->index] : NULL,
but IIRC, the comma operator does not guarantee order evaluation. That
is, the compiler is allowed to process "a , b" as "b; a;" and not "a;
b;".

-- Steve

^ permalink raw reply

* Re: [PATCH v3 01/17] hashtable: introduce a small and naive hashtable
From: Pedro Alves @ 2012-09-04 21:51 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Mathieu Desnoyers, Sasha Levin, Tejun Heo, torvalds, akpm,
	linux-kernel, linux-mm, paul.gortmaker, davem, mingo, ebiederm,
	aarcange, ericvh, netdev, josh, eric.dumazet, axboe, agk,
	dm-devel, neilb, ccaulfie, teigland, Trond.Myklebust, bfields,
	fweisbec, jesse, venkat.x.venkatsubra, ejt, snitzer, edumazet,
	linux-nfs, dev, rds-devel, lw
In-Reply-To: <1346792345.27919.18.camel@gandalf.local.home>

On 09/04/2012 09:59 PM, Steven Rostedt wrote:
> On Tue, 2012-09-04 at 18:21 +0100, Pedro Alves wrote:
>> On 09/04/2012 06:17 PM, Steven Rostedt wrote:
>>> On Tue, 2012-09-04 at 17:40 +0100, Pedro Alves wrote:
>>>
>>>> BTW, you can also go a step further and remove the need to close with double }},
>>>> with something like:
>>>>
>>>> #define do_for_each_ftrace_rec(pg, rec)                                          \
>>>>         for (pg = ftrace_pages_start, rec = &pg->records[pg->index];             \
>>>>              pg && rec == &pg->records[pg->index];                               \
>>>>              pg = pg->next)                                                      \
>>>>           for (rec = pg->records; rec < &pg->records[pg->index]; rec++)
>>>>
>>>
>>> Yeah, but why bother? It's hidden in a macro, and the extra '{ }' shows
>>> that this is something "special".
>>
>> The point of both changes is that there's nothing special in the end
>> at all.  It all just works...
>>
> 
> It would still fail on a 'break'. The 'while' macro tells us that it is
> special, because in the end, it wont work.

Please explain why it would fail on a 'break'.

-- 
Pedro Alves

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: [PATCH 2/3] ipvs: Fix faulty IPv6 extension header handling in IPVS
From: Jesper Dangaard Brouer @ 2012-09-04 21:25 UTC (permalink / raw)
  To: Patrick McHardy
  Cc: netdev, Hans Schillstrom, lvs-devel, Julian Anastasov,
	Simon Horman, Wensong Zhang, netfilter-devel
In-Reply-To: <Pine.GSO.4.63.1208262311110.16771@stinky-local.trash.net>

On Mon, 20 Aug 2012, Jesper Dangaard Brouer wrote:

[cut]

> This patch contains a lot of API changes.  This is done, to avoid
> the costly scan of finding the IPv6 headers, via ipv6_find_hdr().

(small correction ipv6_find_hdr() is not that costly for the general
case of no exthdrs)

> Finding the IPv6 headers is done as early as possible, and passed
> on as a pointer "struct ip_vs_iphdr *" to the affected functions.

This passing the "struct ip_vs_iphdr" actually makes sense.  It reminds
me of the way netfilter/iptables passes the xt_actions_param to each
rule.  Which contains the same information as ip_vs_iphdr.  (note ipvs
register at hooks at a lower level and don't get passed the
xt_actions_param).

Thus, perhaps we should keep these API changes.  Even if we decide to
optimize ipv6_find_hdr().  (as proposed by my RFC patch)

Perhaps we should consider adding a "family" to ip_vs_iphdr, as is done
in xt_actions_param.  This could help us, with collapsing IPv4 and IPv6
code, but i can see that other structs in IPVS carry this info already,
so not sure its relevant.



^ permalink raw reply

* Re: [PATCH V2 09/12] net/eipoib: Add main driver functionality
From: Michael S. Tsirkin @ 2012-09-04 21:21 UTC (permalink / raw)
  To: Or Gerlitz
  Cc: Eric W. Biederman, Or Gerlitz, davem, roland, netdev, sean.hefty,
	Erez Shitrit, Ali Ayoub, Doug Ledford
In-Reply-To: <CAJZOPZJdmDY8rqHJ+jeuG2rLMj9CnwnemkBG=nxD=z9JBFQCRQ@mail.gmail.com>

On Tue, Sep 04, 2012 at 09:50:09PM +0300, Or Gerlitz wrote:
> > And just to stress the point, document the limitations as well.
> 
> sure, not that I see concrete limitations for the **user** at this point, but
> if there are such, will put them clearly written.

Hmm, I'm afraid you mistook a short list of some major
bugs that jumped out at me for an exhaustive list.
This was not intended as such.

Here's how to find some of the limitations in your design
1. look through list archives. Some where pointed out to you
2. list everything ethernet does that you dont, or do differently
3. list everything ipoib does that you don't or do differently
4. list any extra setup work required on behalf of the user
5. check various overheads, compare with native ipoib and alternatives
   such as routing
6. if you still have an empty list of limitations and disadvantages,
   look again :)

The point is to have documentation that is useful both
for reviewers - so they can know which bugs are known
and which need to be reported; and for users -
who should not be expected to be familiar with
internals of your implementation.

-- 
MST

^ permalink raw reply

* [PATCH v3 3/3] iproute2: use libgenl for ipl2tp
From: Julian Anastasov @ 2012-09-04 21:03 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, Stephen Hemminger
In-Reply-To: <1346792597-2427-1-git-send-email-ja@ssi.bg>

	Use the common code from libgenl.c

Signed-off-by: Julian Anastasov <ja@ssi.bg>
---

diff -urp iproute2-3.5.1-tcp_metrics/ip/ipl2tp.c iproute2-3.5.1-ipl2tp-genl/ip/ipl2tp.c
--- iproute2-3.5.1-tcp_metrics/ip/ipl2tp.c	2012-08-13 18:13:58.000000000 +0300
+++ iproute2-3.5.1-ipl2tp-genl/ip/ipl2tp.c	2012-09-04 01:40:55.775000653 +0300
@@ -25,6 +25,7 @@
 
 #include <linux/genetlink.h>
 #include <linux/l2tp.h>
+#include "libgenl.h"
 
 #include "utils.h"
 #include "ip_common.h"
@@ -747,67 +748,6 @@ static int do_show(int argc, char **argv
 	return 0;
 }
 
-static int genl_parse_getfamily(struct nlmsghdr *nlh)
-{
-	struct rtattr *tb[CTRL_ATTR_MAX + 1];
-	struct genlmsghdr *ghdr = NLMSG_DATA(nlh);
-	int len = nlh->nlmsg_len;
-	struct rtattr *attrs;
-
-	if (nlh->nlmsg_type != GENL_ID_CTRL) {
-		fprintf(stderr, "Not a controller message, nlmsg_len=%d "
-			"nlmsg_type=0x%x\n", nlh->nlmsg_len, nlh->nlmsg_type);
-		return -1;
-	}
-
-	if (ghdr->cmd != CTRL_CMD_NEWFAMILY) {
-		fprintf(stderr, "Unknown controller command %d\n", ghdr->cmd);
-		return -1;
-	}
-
-	len -= NLMSG_LENGTH(GENL_HDRLEN);
-
-	if (len < 0) {
-		fprintf(stderr, "wrong controller message len %d\n", len);
-		return -1;
-	}
-
-	attrs = (struct rtattr *) ((char *) ghdr + GENL_HDRLEN);
-	parse_rtattr(tb, CTRL_ATTR_MAX, attrs, len);
-
-	if (tb[CTRL_ATTR_FAMILY_ID] == NULL) {
-		fprintf(stderr, "Missing family id TLV\n");
-		return -1;
-	}
-
-	return rta_getattr_u16(tb[CTRL_ATTR_FAMILY_ID]);
-}
-
-int genl_ctrl_resolve_family(const char *family)
-{
-	struct {
-		struct nlmsghdr         n;
-		struct genlmsghdr	g;
-		char                    buf[1024];
-	} req;
-
-	memset(&req, 0, sizeof(req));
-	req.n.nlmsg_len = NLMSG_LENGTH(GENL_HDRLEN);
-	req.n.nlmsg_flags = NLM_F_REQUEST;
-	req.n.nlmsg_type = GENL_ID_CTRL;
-	req.g.cmd = CTRL_CMD_GETFAMILY;
-
-	addattr_l(&req.n, 1024, CTRL_ATTR_FAMILY_NAME,
-		  family, strlen(family) + 1);
-
-	if (rtnl_talk(&genl_rth, &req.n, 0, 0, &req.n) < 0) {
-		fprintf(stderr, "Error talking to the kernel\n");
-		return -2;
-	}
-
-	return genl_parse_getfamily(&req.n);
-}
-
 int do_ipl2tp(int argc, char **argv)
 {
 	if (genl_family < 0) {
@@ -816,7 +756,8 @@ int do_ipl2tp(int argc, char **argv)
 			exit(1);
 		}
 
-		genl_family = genl_ctrl_resolve_family(L2TP_GENL_NAME);
+		genl_family = libgenl_resolve_family(&genl_rth,
+						     L2TP_GENL_NAME);
 		if (genl_family < 0)
 			exit(1);
 	}

^ permalink raw reply

* [PATCH v3 2/3] iproute2: add support for tcp_metrics
From: Julian Anastasov @ 2012-09-04 21:03 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, Stephen Hemminger
In-Reply-To: <1346792597-2427-1-git-send-email-ja@ssi.bg>

	ip tcp_metrics/tcpmetrics

v2:
- On flush we should provide the address in req2
- "flush all" should use single del, just like when "all" is not provided
- Explain printed information in man page

v3:
- move genl code into new files: libgenl.h and libgenl.c

Signed-off-by: Julian Anastasov <ja@ssi.bg>
---

diff -urpN iproute2-3.5.1/include/libgenl.h iproute2-3.5.1-tcp_metrics/include/libgenl.h
--- iproute2-3.5.1/include/libgenl.h	1970-01-01 02:00:00.000000000 +0200
+++ iproute2-3.5.1-tcp_metrics/include/libgenl.h	2012-09-04 01:10:25.606915458 +0300
@@ -0,0 +1,25 @@
+#ifndef __LIBGENL_H__
+#define __LIBGENL_H__
+
+#include "libnetlink.h"
+
+#define GENL_DEFINE_REQUEST(req, hdrsize, bufsiz)			\
+struct {								\
+	struct nlmsghdr		n;					\
+	struct genlmsghdr	g;					\
+	char			buf[NLMSG_ALIGN(hdrsize) + bufsiz];	\
+} req
+
+#define GENL_INIT_REQUEST(req, family, hdrsize, ver, cmd_, flags)	\
+	do {								\
+		memset(&req, 0, sizeof(req));				\
+		req.n.nlmsg_type = family;				\
+		req.n.nlmsg_flags = flags;				\
+		req.n.nlmsg_len = NLMSG_LENGTH(GENL_HDRLEN + hdrsize);	\
+		req.g.cmd = cmd_;					\
+		req.g.version = ver;					\
+	} while (0)
+
+int libgenl_resolve_family(struct rtnl_handle *grth, const char *family);
+
+#endif /* __LIBGENL_H__ */
diff -urpN iproute2-3.5.1/include/linux/tcp_metrics.h iproute2-3.5.1-tcp_metrics/include/linux/tcp_metrics.h
--- iproute2-3.5.1/include/linux/tcp_metrics.h	1970-01-01 02:00:00.000000000 +0200
+++ iproute2-3.5.1-tcp_metrics/include/linux/tcp_metrics.h	2012-08-23 09:50:54.385569009 +0300
@@ -0,0 +1,54 @@
+/* tcp_metrics.h - TCP Metrics Interface */
+
+#ifndef _LINUX_TCP_METRICS_H
+#define _LINUX_TCP_METRICS_H
+
+#include <linux/types.h>
+
+/* NETLINK_GENERIC related info
+ */
+#define TCP_METRICS_GENL_NAME		"tcp_metrics"
+#define TCP_METRICS_GENL_VERSION	0x1
+
+enum tcp_metric_index {
+	TCP_METRIC_RTT,
+	TCP_METRIC_RTTVAR,
+	TCP_METRIC_SSTHRESH,
+	TCP_METRIC_CWND,
+	TCP_METRIC_REORDERING,
+
+	/* Always last.  */
+	__TCP_METRIC_MAX,
+};
+
+#define TCP_METRIC_MAX	(__TCP_METRIC_MAX - 1)
+
+enum {
+	TCP_METRICS_ATTR_UNSPEC,
+	TCP_METRICS_ATTR_ADDR_IPV4,		/* u32 */
+	TCP_METRICS_ATTR_ADDR_IPV6,		/* binary */
+	TCP_METRICS_ATTR_AGE,			/* msecs */
+	TCP_METRICS_ATTR_TW_TSVAL,		/* u32, raw, rcv tsval */
+	TCP_METRICS_ATTR_TW_TS_STAMP,		/* s32, sec age */
+	TCP_METRICS_ATTR_VALS,			/* nested +1, u32 */
+	TCP_METRICS_ATTR_FOPEN_MSS,		/* u16 */
+	TCP_METRICS_ATTR_FOPEN_SYN_DROPS,	/* u16, count of drops */
+	TCP_METRICS_ATTR_FOPEN_SYN_DROP_TS,	/* msecs age */
+	TCP_METRICS_ATTR_FOPEN_COOKIE,		/* binary */
+
+	__TCP_METRICS_ATTR_MAX,
+};
+
+#define TCP_METRICS_ATTR_MAX	(__TCP_METRICS_ATTR_MAX - 1)
+
+enum {
+	TCP_METRICS_CMD_UNSPEC,
+	TCP_METRICS_CMD_GET,
+	TCP_METRICS_CMD_DEL,
+
+	__TCP_METRICS_CMD_MAX,
+};
+
+#define TCP_METRICS_CMD_MAX	(__TCP_METRICS_CMD_MAX - 1)
+
+#endif /* _LINUX_TCP_METRICS_H */
diff -urpN iproute2-3.5.1/ip/Makefile iproute2-3.5.1-tcp_metrics/ip/Makefile
--- iproute2-3.5.1/ip/Makefile	2012-08-13 18:13:58.000000000 +0300
+++ iproute2-3.5.1-tcp_metrics/ip/Makefile	2012-09-04 00:58:57.434883914 +0300
@@ -3,7 +3,7 @@ IPOBJ=ip.o ipaddress.o ipaddrlabel.o ipr
     ipmaddr.o ipmonitor.o ipmroute.o ipprefix.o iptuntap.o \
     ipxfrm.o xfrm_state.o xfrm_policy.o xfrm_monitor.o \
     iplink_vlan.o link_veth.o link_gre.o iplink_can.o \
-    iplink_macvlan.o iplink_macvtap.o ipl2tp.o
+    iplink_macvlan.o iplink_macvtap.o ipl2tp.o tcp_metrics.o
 
 RTMONOBJ=rtmon.o
 
diff -urpN iproute2-3.5.1/ip/ip.c iproute2-3.5.1-tcp_metrics/ip/ip.c
--- iproute2-3.5.1/ip/ip.c	2012-08-13 18:13:58.000000000 +0300
+++ iproute2-3.5.1-tcp_metrics/ip/ip.c	2012-08-23 10:21:20.917653464 +0300
@@ -45,7 +45,7 @@ static void usage(void)
 "       ip [ -force ] -batch filename\n"
 "where  OBJECT := { link | addr | addrlabel | route | rule | neigh | ntable |\n"
 "                   tunnel | tuntap | maddr | mroute | mrule | monitor | xfrm |\n"
-"                   netns | l2tp }\n"
+"                   netns | l2tp | tcp_metrics }\n"
 "       OPTIONS := { -V[ersion] | -s[tatistics] | -d[etails] | -r[esolve] |\n"
 "                    -f[amily] { inet | inet6 | ipx | dnet | link } |\n"
 "                    -l[oops] { maximum-addr-flush-attempts } |\n"
@@ -78,6 +78,8 @@ static const struct cmd {
 	{ "tunl",	do_iptunnel },
 	{ "tuntap",	do_iptuntap },
 	{ "tap",	do_iptuntap },
+	{ "tcpmetrics",	do_tcp_metrics },
+	{ "tcp_metrics",do_tcp_metrics },
 	{ "monitor",	do_ipmonitor },
 	{ "xfrm",	do_xfrm },
 	{ "mroute",	do_multiroute },
diff -urpN iproute2-3.5.1/ip/ip_common.h iproute2-3.5.1-tcp_metrics/ip/ip_common.h
--- iproute2-3.5.1/ip/ip_common.h	2012-08-13 18:13:58.000000000 +0300
+++ iproute2-3.5.1-tcp_metrics/ip/ip_common.h	2012-08-23 10:19:11.005647457 +0300
@@ -42,6 +42,7 @@ extern int do_multirule(int argc, char *
 extern int do_netns(int argc, char **argv);
 extern int do_xfrm(int argc, char **argv);
 extern int do_ipl2tp(int argc, char **argv);
+extern int do_tcp_metrics(int argc, char **argv);
 
 static inline int rtm_get_table(struct rtmsg *r, struct rtattr **tb)
 {
diff -urpN iproute2-3.5.1/ip/tcp_metrics.c iproute2-3.5.1-tcp_metrics/ip/tcp_metrics.c
--- iproute2-3.5.1/ip/tcp_metrics.c	1970-01-01 02:00:00.000000000 +0200
+++ iproute2-3.5.1-tcp_metrics/ip/tcp_metrics.c	2012-09-04 01:12:19.706921106 +0300
@@ -0,0 +1,434 @@
+/*
+ * tcp_metrics.c	"ip tcp_metrics/tcpmetrics"
+ *
+ *		This program is free software; you can redistribute it and/or
+ *		modify it under the terms of the GNU General Public License
+ *		version 2 as published by the Free Software Foundation;
+ *
+ * Authors:	Julian Anastasov <ja@ssi.bg>, August 2012
+ */
+
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <unistd.h>
+#include <errno.h>
+#include <sys/types.h>
+#include <sys/socket.h>
+#include <arpa/inet.h>
+#include <sys/ioctl.h>
+#include <linux/if.h>
+
+#include <linux/genetlink.h>
+#include <linux/tcp_metrics.h>
+
+#include "utils.h"
+#include "ip_common.h"
+#include "libgenl.h"
+
+static void usage(void)
+{
+	fprintf(stderr, "Usage: ip tcp_metrics/tcpmetrics { COMMAND | help }\n");
+	fprintf(stderr, "       ip tcp_metrics { show | flush } SELECTOR\n");
+	fprintf(stderr, "       ip tcp_metrics delete [ address ] ADDRESS\n");
+	fprintf(stderr, "SELECTOR := [ [ address ] PREFIX ]\n");
+	exit(-1);
+}
+
+/* netlink socket */
+static struct rtnl_handle grth = { .fd = -1 };
+static int genl_family = -1;
+
+
+#define INIT_GENL_ACTION(req, cmd, ack)					\
+	GENL_INIT_REQUEST(req, genl_family, 0,				\
+			  TCP_METRICS_GENL_VERSION, cmd,		\
+			  NLM_F_REQUEST | ((ack) ? NLM_F_ACK : 0))
+
+#define INIT_GENL_DUMP(req)						\
+	GENL_INIT_REQUEST(req, genl_family, 0,				\
+			  TCP_METRICS_GENL_VERSION,			\
+			  TCP_METRICS_CMD_GET,				\
+			  NLM_F_DUMP | NLM_F_REQUEST)
+
+#define CMD_LIST	0x0001	/* list, lst, show		*/
+#define CMD_DEL		0x0002	/* delete, remove		*/
+#define CMD_FLUSH	0x0004	/* flush			*/
+
+static struct {
+	char	*name;
+	int	code;
+} cmds[] = {
+	{	"list",		CMD_LIST	},
+	{	"lst",		CMD_LIST	},
+	{	"show",		CMD_LIST	},
+	{	"delete",	CMD_DEL		},
+	{	"remove",	CMD_DEL		},
+	{	"flush",	CMD_FLUSH	},
+};
+
+static char *metric_name[TCP_METRIC_MAX + 1] = {
+	[TCP_METRIC_RTT]		= "rtt",
+	[TCP_METRIC_RTTVAR]		= "rttvar",
+	[TCP_METRIC_SSTHRESH]		= "ssthresh",
+	[TCP_METRIC_CWND]		= "cwnd",
+	[TCP_METRIC_REORDERING]		= "reordering",
+};
+
+static struct
+{
+	int flushed;
+	char *flushb;
+	int flushp;
+	int flushe;
+	int cmd;
+	inet_prefix addr;
+} f;
+
+static int flush_update(void)
+{
+	if (rtnl_send_check(&grth, f.flushb, f.flushp) < 0) {
+		perror("Failed to send flush request\n");
+		return -1;
+	}
+	f.flushp = 0;
+	return 0;
+}
+
+static int process_msg(const struct sockaddr_nl *who, struct nlmsghdr *n,
+		       void *arg)
+{
+	FILE *fp = (FILE *) arg;
+	struct genlmsghdr *ghdr;
+	struct rtattr *attrs[TCP_METRICS_ATTR_MAX + 1], *a;
+	int len = n->nlmsg_len;
+	char abuf[256];
+	inet_prefix addr;
+	int family, i, atype;
+
+	if (n->nlmsg_type != genl_family)
+		return -1;
+
+	len -= NLMSG_LENGTH(GENL_HDRLEN);
+	if (len < 0)
+		return -1;
+
+	ghdr = NLMSG_DATA(n);
+	if (ghdr->cmd != TCP_METRICS_CMD_GET)
+		return 0;
+
+	parse_rtattr(attrs, TCP_METRICS_ATTR_MAX, (void *) ghdr + GENL_HDRLEN,
+		     len);
+
+	a = attrs[TCP_METRICS_ATTR_ADDR_IPV4];
+	if (a) {
+		if (f.addr.family && f.addr.family != AF_INET)
+			return 0;
+		memcpy(&addr.data, RTA_DATA(a), 4);
+		addr.bytelen = 4;
+		family = AF_INET;
+		atype = TCP_METRICS_ATTR_ADDR_IPV4;
+	} else {
+		a = attrs[TCP_METRICS_ATTR_ADDR_IPV6];
+		if (a) {
+			if (f.addr.family && f.addr.family != AF_INET6)
+				return 0;
+			memcpy(&addr.data, RTA_DATA(a), 16);
+			addr.bytelen = 16;
+			family = AF_INET6;
+			atype = TCP_METRICS_ATTR_ADDR_IPV6;
+		} else
+			return 0;
+	}
+
+	if (f.addr.family && f.addr.bitlen >= 0 &&
+	    inet_addr_match(&addr, &f.addr, f.addr.bitlen))
+		return 0;
+
+	if (f.flushb) {
+		struct nlmsghdr *fn;
+		GENL_DEFINE_REQUEST(req2, 0, 128);
+
+		INIT_GENL_ACTION(req2, TCP_METRICS_CMD_DEL, 0);
+		addattr_l(&req2.n, sizeof(req2), atype, &addr.data,
+			  addr.bytelen);
+
+		if (NLMSG_ALIGN(f.flushp) + req2.n.nlmsg_len > f.flushe) {
+			if (flush_update())
+				return -1;
+		}
+		fn = (struct nlmsghdr *) (f.flushb + NLMSG_ALIGN(f.flushp));
+		memcpy(fn, &req2.n, req2.n.nlmsg_len);
+		fn->nlmsg_seq = ++grth.seq;
+		f.flushp = (((char *) fn) + req2.n.nlmsg_len) - f.flushb;
+		f.flushed++;
+		if (show_stats < 2)
+			return 0;
+	}
+
+	if (f.cmd & (CMD_DEL | CMD_FLUSH))
+		fprintf(fp, "Deleted ");
+
+	fprintf(fp, "%s",
+		format_host(family, RTA_PAYLOAD(a), &addr.data,
+			    abuf, sizeof(abuf)));
+
+	a = attrs[TCP_METRICS_ATTR_AGE];
+	if (a) {
+		__u64 val = rta_getattr_u64(a);
+
+		fprintf(fp, " age %llu.%03llusec",
+			val / 1000, val % 1000);
+	}
+
+	a = attrs[TCP_METRICS_ATTR_TW_TS_STAMP];
+	if (a) {
+		__s32 val = (__s32) rta_getattr_u32(a);
+		__u32 tsval;
+
+		a = attrs[TCP_METRICS_ATTR_TW_TSVAL];
+		tsval = a ? rta_getattr_u32(a) : 0;
+		fprintf(fp, " tw_ts %u/%dsec ago", tsval, val);
+	}
+
+	a = attrs[TCP_METRICS_ATTR_VALS];
+	if (a) {
+		struct rtattr *m[TCP_METRIC_MAX + 1 + 1];
+
+		parse_rtattr_nested(m, TCP_METRIC_MAX + 1, a);
+
+		for (i = 0; i < TCP_METRIC_MAX + 1; i++) {
+			__u32 val;
+
+			a = m[i + 1];
+			if (!a)
+				continue;
+			if (metric_name[i])
+				fprintf(fp, " %s ", metric_name[i]);
+			else
+				fprintf(fp, " metric_%d ", i);
+			val = rta_getattr_u32(a);
+			switch (i) {
+			case TCP_METRIC_RTT:
+			case TCP_METRIC_RTTVAR:
+				fprintf(fp, "%ums", val);
+				break;
+			case TCP_METRIC_SSTHRESH:
+			case TCP_METRIC_CWND:
+			case TCP_METRIC_REORDERING:
+			default:
+				fprintf(fp, "%u", val);
+				break;
+			}
+		}
+	}
+
+	a = attrs[TCP_METRICS_ATTR_FOPEN_MSS];
+	if (a)
+		fprintf(fp, " fo_mss %u", rta_getattr_u16(a));
+
+	a = attrs[TCP_METRICS_ATTR_FOPEN_SYN_DROPS];
+	if (a) {
+		__u16 syn_loss = rta_getattr_u16(a);
+		__u64 ts;
+
+		a = attrs[TCP_METRICS_ATTR_FOPEN_SYN_DROP_TS];
+		ts = a ? rta_getattr_u64(a) : 0;
+
+		fprintf(fp, " fo_syn_drops %u/%llu.%03llusec ago",
+			syn_loss, ts / 1000, ts % 1000);
+	}
+
+	a = attrs[TCP_METRICS_ATTR_FOPEN_COOKIE];
+	if (a) {
+		char cookie[32 + 1];
+		unsigned char *ptr = RTA_DATA(a);
+		int i, max = RTA_PAYLOAD(a);
+
+		if (max > 16)
+			max = 16;
+		cookie[0] = 0;
+		for (i = 0; i < max; i++)
+			sprintf(cookie + i + i, "%02x", ptr[i]);
+		fprintf(fp, " fo_cookie %s", cookie);
+	}
+
+	fprintf(fp, "\n");
+
+	fflush(fp);
+	return 0;
+}
+
+static int tcpm_do_cmd(int cmd, int argc, char **argv)
+{
+	GENL_DEFINE_REQUEST(req, 0, 1024);
+	int atype = -1;
+	int code, ack;
+
+	memset(&f, 0, sizeof(f));
+	f.addr.bitlen = -1;
+	f.addr.family = preferred_family;
+
+	switch (preferred_family) {
+	case AF_UNSPEC:
+	case AF_INET:
+	case AF_INET6:
+		break;
+	default:
+		fprintf(stderr, "Unsupported family:%d\n", preferred_family);
+		return -1;
+	}
+
+	for (; argc > 0; argc--, argv++) {
+		char *who = "address";
+
+		if (strcmp(*argv, "addr") == 0 ||
+		    strcmp(*argv, "address") == 0) {
+			who = *argv;
+			NEXT_ARG();
+		}
+		if (matches(*argv, "help") == 0)
+			usage();
+		if (f.addr.bitlen >= 0)
+			duparg2(who, *argv);
+
+		get_prefix(&f.addr, *argv, preferred_family);
+		if (f.addr.bytelen && f.addr.bytelen * 8 == f.addr.bitlen) {
+			if (f.addr.family == AF_INET)
+				atype = TCP_METRICS_ATTR_ADDR_IPV4;
+			else if (f.addr.family == AF_INET6)
+				atype = TCP_METRICS_ATTR_ADDR_IPV6;
+		}
+		if ((CMD_DEL & cmd) && atype < 0) {
+			fprintf(stderr, "Error: a specific IP address is expected rather than \"%s\"\n",
+				*argv);
+			return -1;
+		}
+
+		argc--; argv++;
+	}
+
+	if (cmd == CMD_DEL && atype < 0)
+		missarg("address");
+
+	/* flush for exact address ? Single del */
+	if (cmd == CMD_FLUSH && atype >= 0)
+		cmd = CMD_DEL;
+	/* flush for all addresses ? Single del without address */
+	if (cmd == CMD_FLUSH && f.addr.bitlen <= 0 &&
+	    preferred_family == AF_UNSPEC) {
+		cmd = CMD_DEL;
+		code = TCP_METRICS_CMD_DEL;
+		ack = 1;
+	} else if (cmd == CMD_DEL) {
+		code = TCP_METRICS_CMD_DEL;
+		ack = 1;
+	} else {	/* CMD_FLUSH, CMD_LIST */
+		code = TCP_METRICS_CMD_GET;
+		ack = 0;
+	}
+
+	if (genl_family < 0) {
+		if (rtnl_open_byproto(&grth, 0, NETLINK_GENERIC) < 0) {
+			fprintf(stderr, "Cannot open generic netlink socket\n");
+			exit(1);
+		}
+		genl_family = libgenl_resolve_family(&grth,
+						     TCP_METRICS_GENL_NAME);
+		if (genl_family < 0)
+			exit(1);
+	}
+
+	if (!(cmd & CMD_FLUSH) && (atype >= 0 || (cmd & CMD_DEL))) {
+		INIT_GENL_ACTION(req, code, ack);
+		if (atype >= 0)
+			addattr_l(&req.n, sizeof(req), atype, &f.addr.data,
+				  f.addr.bytelen);
+	} else {
+		INIT_GENL_DUMP(req);
+	}
+
+	f.cmd = cmd;
+	if (cmd & CMD_FLUSH) {
+		int round = 0;
+		char flushb[4096-512];
+
+		f.flushb = flushb;
+		f.flushp = 0;
+		f.flushe = sizeof(flushb);
+
+		for (;;) {
+			req.n.nlmsg_seq = grth.dump = ++grth.seq;
+			if (rtnl_send(&grth, &req, req.n.nlmsg_len) < 0) {
+				perror("Failed to send flush request");
+				exit(1);
+			}
+			f.flushed = 0;
+			if (rtnl_dump_filter(&grth, process_msg, stdout) < 0) {
+				fprintf(stderr, "Flush terminated\n");
+				exit(1);
+			}
+			if (f.flushed == 0) {
+				if (round == 0) {
+					fprintf(stderr, "Nothing to flush.\n");
+				} else if (show_stats)
+					printf("*** Flush is complete after %d round%s ***\n",
+					       round, round > 1 ? "s" : "");
+				fflush(stdout);
+				return 0;
+			}
+			round++;
+			if (flush_update() < 0)
+				exit(1);
+			if (show_stats) {
+				printf("\n*** Round %d, deleting %d entries ***\n",
+				       round, f.flushed);
+				fflush(stdout);
+			}
+		}
+		return 0;
+	}
+
+	if (ack) {
+		if (rtnl_talk(&grth, &req.n, 0, 0, NULL) < 0)
+			return -2;
+	} else if (atype >= 0) {
+		if (rtnl_talk(&grth, &req.n, 0, 0, &req.n) < 0)
+			return -2;
+		if (process_msg(NULL, &req.n, stdout) < 0) {
+			fprintf(stderr, "Dump terminated\n");
+			exit(1);
+		}
+	} else {
+		req.n.nlmsg_seq = grth.dump = ++grth.seq;
+		if (rtnl_send(&grth, &req, req.n.nlmsg_len) < 0) {
+			perror("Failed to send dump request");
+			exit(1);
+		}
+
+		if (rtnl_dump_filter(&grth, process_msg, stdout) < 0) {
+			fprintf(stderr, "Dump terminated\n");
+			exit(1);
+		}
+	}
+	return 0;
+}
+
+int do_tcp_metrics(int argc, char **argv)
+{
+	int i;
+
+	if (argc < 1)
+		return tcpm_do_cmd(CMD_LIST, 0, NULL);
+	for (i = 0; i < ARRAY_SIZE(cmds); i++) {
+		if (matches(argv[0], cmds[i].name) == 0)
+			return tcpm_do_cmd(cmds[i].code, argc-1, argv+1);
+	}
+	if (matches(argv[0], "help") == 0)
+		usage();
+
+	fprintf(stderr, "Command \"%s\" is unknown, "
+			"try \"ip tcp_metrics help\".\n", *argv);
+	exit(-1);
+}
+
diff -urpN iproute2-3.5.1/lib/Makefile iproute2-3.5.1-tcp_metrics/lib/Makefile
--- iproute2-3.5.1/lib/Makefile	2012-08-13 18:13:58.000000000 +0300
+++ iproute2-3.5.1-tcp_metrics/lib/Makefile	2012-09-04 00:58:43.950883370 +0300
@@ -2,7 +2,7 @@ CFLAGS += -fPIC
 
 UTILOBJ=utils.o rt_names.o ll_types.o ll_proto.o ll_addr.o inet_proto.o
 
-NLOBJ=ll_map.o libnetlink.o
+NLOBJ=libgenl.o ll_map.o libnetlink.o
 
 all: libnetlink.a libutil.a
 
diff -urpN iproute2-3.5.1/lib/libgenl.c iproute2-3.5.1-tcp_metrics/lib/libgenl.c
--- iproute2-3.5.1/lib/libgenl.c	1970-01-01 02:00:00.000000000 +0200
+++ iproute2-3.5.1-tcp_metrics/lib/libgenl.c	2012-09-04 01:10:05.794915224 +0300
@@ -0,0 +1,65 @@
+/*
+ * libgenl.c	GENL library
+ */
+
+#include <stdio.h>
+#include <stdlib.h>
+#include <unistd.h>
+
+#include <linux/genetlink.h>
+#include "libgenl.h"
+
+static int libgenl_parse_getfamily(struct nlmsghdr *nlh)
+{
+	struct rtattr *tb[CTRL_ATTR_MAX + 1];
+	struct genlmsghdr *ghdr = NLMSG_DATA(nlh);
+	int len = nlh->nlmsg_len;
+	struct rtattr *attrs;
+
+	if (nlh->nlmsg_type != GENL_ID_CTRL) {
+		fprintf(stderr, "Not a controller message, nlmsg_len=%d "
+			"nlmsg_type=0x%x\n", nlh->nlmsg_len, nlh->nlmsg_type);
+		return -1;
+	}
+
+	len -= NLMSG_LENGTH(GENL_HDRLEN);
+
+	if (len < 0) {
+		fprintf(stderr, "wrong controller message len %d\n", len);
+		return -1;
+	}
+
+	if (ghdr->cmd != CTRL_CMD_NEWFAMILY) {
+		fprintf(stderr, "Unknown controller command %d\n", ghdr->cmd);
+		return -1;
+	}
+
+	attrs = (struct rtattr *) ((char *) ghdr + GENL_HDRLEN);
+	parse_rtattr(tb, CTRL_ATTR_MAX, attrs, len);
+
+	if (tb[CTRL_ATTR_FAMILY_ID] == NULL) {
+		fprintf(stderr, "Missing family id TLV\n");
+		return -1;
+	}
+
+	return rta_getattr_u16(tb[CTRL_ATTR_FAMILY_ID]);
+}
+
+int libgenl_resolve_family(struct rtnl_handle *grth, const char *family)
+{
+	GENL_DEFINE_REQUEST(req, 0, 1024);
+
+	GENL_INIT_REQUEST(req, GENL_ID_CTRL, 0, 0, CTRL_CMD_GETFAMILY,
+			  NLM_F_REQUEST);
+
+	addattr_l(&req.n, 1024, CTRL_ATTR_FAMILY_NAME,
+		  family, strlen(family) + 1);
+
+	if (rtnl_talk(grth, &req.n, 0, 0, &req.n) < 0) {
+		fprintf(stderr, "Error talking to the kernel\n");
+		return -2;
+	}
+
+	return libgenl_parse_getfamily(&req.n);
+}
+
diff -urpN iproute2-3.5.1/man/man8/Makefile iproute2-3.5.1-tcp_metrics/man/man8/Makefile
--- iproute2-3.5.1/man/man8/Makefile	2012-08-13 18:13:58.000000000 +0300
+++ iproute2-3.5.1-tcp_metrics/man/man8/Makefile	2012-08-23 14:45:12.506385476 +0300
@@ -7,7 +7,8 @@ MAN8PAGES = $(TARGETS) ip.8 arpd.8 lnsta
 	bridge.8 rtstat.8 ctstat.8 nstat.8 routef.8 \
 	ip-tunnel.8 ip-rule.8 ip-ntable.8 \
 	ip-monitor.8 tc-stab.8 tc-hfsc.8 ip-xfrm.8 ip-netns.8 \
-	ip-neighbour.8 ip-mroute.8 ip-maddress.8 ip-addrlabel.8 
+	ip-neighbour.8 ip-mroute.8 ip-maddress.8 ip-addrlabel.8 \
+	ip-tcp_metrics.8
 
 
 all: $(TARGETS)
diff -urpN iproute2-3.5.1/man/man8/ip-tcp_metrics.8 iproute2-3.5.1-tcp_metrics/man/man8/ip-tcp_metrics.8
--- iproute2-3.5.1/man/man8/ip-tcp_metrics.8	1970-01-01 02:00:00.000000000 +0200
+++ iproute2-3.5.1-tcp_metrics/man/man8/ip-tcp_metrics.8	2012-08-25 17:37:30.868321617 +0300
@@ -0,0 +1,143 @@
+.TH "IP\-TCP_METRICS" 8 "23 Aug 2012" "iproute2" "Linux"
+.SH "NAME"
+ip-tcp_metrics \- management for TCP Metrics
+.SH "SYNOPSIS"
+.sp
+.ad l
+.in +8
+.ti -8
+.B ip
+.RI "[ " OPTIONS " ]"
+.B tcp_metrics
+.RI "{ " COMMAND " | "
+.BR help " }"
+.sp
+
+.ti -8
+.BR "ip tcp_metrics" " { " show " | " flush " }
+.IR SELECTOR
+
+.ti -8
+.BR "ip tcp_metrics delete " [ " address " ]
+.IR ADDRESS
+
+.ti -8
+.IR SELECTOR " := " 
+.RB "[ [ " address " ] "
+.IR PREFIX " ]"
+
+.SH "DESCRIPTION"
+.B ip tcp_metrics
+is used to manipulate entries in the kernel that keep TCP information
+for IPv4 and IPv6 destinations. The entries are created when
+TCP sockets want to share information for destinations and are
+stored in a cache keyed by the destination address. The saved
+information may include values for metrics (initially obtained from
+routes), recent TSVAL for TIME-WAIT recycling purposes, state for the
+Fast Open feature, etc.
+For performance reasons the cache can not grow above configured limit
+and the older entries are replaced with fresh information, sometimes
+reclaimed and used for new destinations. The kernel never removes
+entries, they can be flushed only with this tool.
+
+.SS ip tcp_metrics show - show cached entries
+
+.TP
+.BI address " PREFIX " (default)
+IPv4/IPv6 prefix or address. If no prefix is provided all entries are shown.
+
+.LP
+The output may contain the following information:
+
+.BI age " <S.MMM>" sec
+- time after the entry was created, reset or updated with metrics
+from sockets. The entry is reset and refreshed on use with metrics from
+route if the metrics are not updated in last hour. Not all cached values
+reset the age on update.
+
+.BI cwnd " <N>"
+- CWND metric value
+
+.BI fo_cookie " <HEX-STRING>"
+- Cookie value received in SYN-ACK to be used by Fast Open for next SYNs
+
+.BI fo_mss " <N>"
+- MSS value received in SYN-ACK to be used by Fast Open for next SYNs
+
+.BI fo_syn_drops " <N>/<S.MMM>" "sec ago"
+- Number of drops of initial outgoing Fast Open SYNs with data
+detected by monitoring the received SYN-ACK after SYN retransmission.
+The seconds show the time after last SYN drop and together with
+the drop count can be used to disable Fast Open for some time.
+
+.BI reordering " <N>"
+- Reordering metric value
+
+.BI rtt " <N>" ms
+- RTT metric value
+
+.BI rttvar " <N>" ms
+- RTTVAR metric value
+
+.BI ssthresh " <SSTHRESH>"
+- SSTHRESH metric value
+
+.BI tw_ts " <TSVAL>/<SEC>" "sec ago"
+- recent TSVAL and the seconds after saving it into TIME-WAIT socket
+
+.SS ip tcp_metrics delete - delete single entry
+
+.TP
+.BI address " ADDRESS " (default)
+IPv4/IPv6 address. The address is a required argument.
+
+.SS ip tcp_metrics flush - flush entries
+This command flushes the entries selected by some criteria.
+
+.PP
+This command has the same arguments as
+.B show.
+
+.SH "EXAMPLES"
+.PP
+ip tcp_metrics show address 192.168.0.0/24
+.RS 4
+Shows the entries for destinations from subnet
+.RE
+.PP
+ip tcp_metrics show 192.168.0.0/24
+.RS 4
+The same but address keyword is optional
+.RE
+.PP
+ip tcp_metrics
+.RS 4
+Show all is the default action
+.RE
+.PP
+ip tcp_metrics delete 192.168.0.1
+.RS 4
+Removes the entry for 192.168.0.1 from cache.
+.RE
+.PP
+ip tcp_metrics flush 192.168.0.0/24
+.RS 4
+Removes entries for destinations from subnet
+.RE
+.PP
+ip tcp_metrics flush all
+.RS 4
+Removes all entries from cache
+.RE
+.PP
+ip -6 tcp_metrics flush all
+.RS 4
+Removes all IPv6 entries from cache keeping the IPv4 entries.
+.RE
+
+.SH SEE ALSO
+.br
+.BR ip (8)
+
+.SH AUTHOR
+Original Manpage by Julian Anastasov <ja@ssi.bg>
diff -urpN iproute2-3.5.1/man/man8/ip.8 iproute2-3.5.1-tcp_metrics/man/man8/ip.8
--- iproute2-3.5.1/man/man8/ip.8	2012-08-13 18:13:58.000000000 +0300
+++ iproute2-3.5.1-tcp_metrics/man/man8/ip.8	2012-08-23 10:28:12.541672495 +0300
@@ -15,7 +15,7 @@ ip \- show / manipulate routing, devices
 .IR OBJECT " := { "
 .BR link " | " addr " | " addrlabel " | " route " | " rule " | " neigh " | "\
  ntable " | " tunnel " | " tuntap " | " maddr " | "  mroute " | " mrule " | "\
- monitor " | " xfrm " | " netns " | "  l2tp " }"
+ monitor " | " xfrm " | " netns " | "  l2tp " | "  tcp_metrics " }"
 .sp
 
 .ti -8
@@ -156,6 +156,10 @@ host addresses.
 - rule in routing policy database.
 
 .TP
+.B tcp_metrics/tcpmetrics
+- manage TCP Metrics
+
+.TP
 .B tunnel
 - tunnel over IP.
 
@@ -215,6 +219,7 @@ was written by Alexey N. Kuznetsov and a
 .BR ip-ntable (8),
 .BR ip-route (8),
 .BR ip-rule (8),
+.BR ip-tcp_metrics (8),
 .BR ip-tunnel (8),
 .BR ip-xfrm (8)
 .br

^ permalink raw reply

* [PATCH v3 1/3] tcp: add generic netlink support for tcp_metrics
From: Julian Anastasov @ 2012-09-04 21:03 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, Stephen Hemminger
In-Reply-To: <1346792597-2427-1-git-send-email-ja@ssi.bg>

	Add support for genl "tcp_metrics". No locking
is changed, only that now we can unlink and delete
entries after grace period. We implement get/del for
single entry and dump to support show/flush filtering
in user space. Del without address attribute causes
flush for all addresses, sadly under genl_mutex.

v2:
- remove rcu_assign_pointer as suggested by Eric Dumazet,
it is not needed because there are no other writes under lock
- move the flushing code in tcp_metrics_flush_all

v3:
- remove synchronize_rcu on flush as suggested by Eric Dumazet

Signed-off-by: Julian Anastasov <ja@ssi.bg>
---
 include/linux/Kbuild        |    1 +
 include/linux/tcp_metrics.h |   54 +++++++
 net/ipv4/tcp_metrics.c      |  354 +++++++++++++++++++++++++++++++++++++++++--
 3 files changed, 396 insertions(+), 13 deletions(-)
 create mode 100644 include/linux/tcp_metrics.h

diff --git a/include/linux/Kbuild b/include/linux/Kbuild
index 1f2c1c7..90da0af 100644
--- a/include/linux/Kbuild
+++ b/include/linux/Kbuild
@@ -363,6 +363,7 @@ header-y += sysctl.h
 header-y += sysinfo.h
 header-y += taskstats.h
 header-y += tcp.h
+header-y += tcp_metrics.h
 header-y += telephony.h
 header-y += termios.h
 header-y += time.h
diff --git a/include/linux/tcp_metrics.h b/include/linux/tcp_metrics.h
new file mode 100644
index 0000000..cb5157b
--- /dev/null
+++ b/include/linux/tcp_metrics.h
@@ -0,0 +1,54 @@
+/* tcp_metrics.h - TCP Metrics Interface */
+
+#ifndef _LINUX_TCP_METRICS_H
+#define _LINUX_TCP_METRICS_H
+
+#include <linux/types.h>
+
+/* NETLINK_GENERIC related info
+ */
+#define TCP_METRICS_GENL_NAME		"tcp_metrics"
+#define TCP_METRICS_GENL_VERSION	0x1
+
+enum tcp_metric_index {
+	TCP_METRIC_RTT,
+	TCP_METRIC_RTTVAR,
+	TCP_METRIC_SSTHRESH,
+	TCP_METRIC_CWND,
+	TCP_METRIC_REORDERING,
+
+	/* Always last.  */
+	__TCP_METRIC_MAX,
+};
+
+#define TCP_METRIC_MAX	(__TCP_METRIC_MAX - 1)
+
+enum {
+	TCP_METRICS_ATTR_UNSPEC,
+	TCP_METRICS_ATTR_ADDR_IPV4,		/* u32 */
+	TCP_METRICS_ATTR_ADDR_IPV6,		/* binary */
+	TCP_METRICS_ATTR_AGE,			/* msecs */
+	TCP_METRICS_ATTR_TW_TSVAL,		/* u32, raw, rcv tsval */
+	TCP_METRICS_ATTR_TW_TS_STAMP,		/* s32, sec age */
+	TCP_METRICS_ATTR_VALS,			/* nested +1, u32 */
+	TCP_METRICS_ATTR_FOPEN_MSS,		/* u16 */
+	TCP_METRICS_ATTR_FOPEN_SYN_DROPS,	/* u16, count of drops */
+	TCP_METRICS_ATTR_FOPEN_SYN_DROP_TS,	/* msecs age */
+	TCP_METRICS_ATTR_FOPEN_COOKIE,		/* binary */
+
+	__TCP_METRICS_ATTR_MAX,
+};
+
+#define TCP_METRICS_ATTR_MAX	(__TCP_METRICS_ATTR_MAX - 1)
+
+enum {
+	TCP_METRICS_CMD_UNSPEC,
+	TCP_METRICS_CMD_GET,
+	TCP_METRICS_CMD_DEL,
+
+	__TCP_METRICS_CMD_MAX,
+};
+
+#define TCP_METRICS_CMD_MAX	(__TCP_METRICS_CMD_MAX - 1)
+
+#endif /* _LINUX_TCP_METRICS_H */
diff --git a/net/ipv4/tcp_metrics.c b/net/ipv4/tcp_metrics.c
index 0abe67b..988edb6 100644
--- a/net/ipv4/tcp_metrics.c
+++ b/net/ipv4/tcp_metrics.c
@@ -8,6 +8,7 @@
 #include <linux/init.h>
 #include <linux/tcp.h>
 #include <linux/hash.h>
+#include <linux/tcp_metrics.h>
 
 #include <net/inet_connection_sock.h>
 #include <net/net_namespace.h>
@@ -17,20 +18,10 @@
 #include <net/ipv6.h>
 #include <net/dst.h>
 #include <net/tcp.h>
+#include <net/genetlink.h>
 
 int sysctl_tcp_nometrics_save __read_mostly;
 
-enum tcp_metric_index {
-	TCP_METRIC_RTT,
-	TCP_METRIC_RTTVAR,
-	TCP_METRIC_SSTHRESH,
-	TCP_METRIC_CWND,
-	TCP_METRIC_REORDERING,
-
-	/* Always last.  */
-	TCP_METRIC_MAX,
-};
-
 struct tcp_fastopen_metrics {
 	u16	mss;
 	u16	syn_loss:10;		/* Recurring Fast Open SYN losses */
@@ -45,8 +36,10 @@ struct tcp_metrics_block {
 	u32				tcpm_ts;
 	u32				tcpm_ts_stamp;
 	u32				tcpm_lock;
-	u32				tcpm_vals[TCP_METRIC_MAX];
+	u32				tcpm_vals[TCP_METRIC_MAX + 1];
 	struct tcp_fastopen_metrics	tcpm_fastopen;
+
+	struct rcu_head			rcu_head;
 };
 
 static bool tcp_metric_locked(struct tcp_metrics_block *tm,
@@ -690,6 +683,325 @@ void tcp_fastopen_cache_set(struct sock *sk, u16 mss,
 	rcu_read_unlock();
 }
 
+static struct genl_family tcp_metrics_nl_family = {
+	.id		= GENL_ID_GENERATE,
+	.hdrsize	= 0,
+	.name		= TCP_METRICS_GENL_NAME,
+	.version	= TCP_METRICS_GENL_VERSION,
+	.maxattr	= TCP_METRICS_ATTR_MAX,
+	.netnsok	= true,
+};
+
+static struct nla_policy tcp_metrics_nl_policy[TCP_METRICS_ATTR_MAX + 1] = {
+	[TCP_METRICS_ATTR_ADDR_IPV4]	= { .type = NLA_U32, },
+	[TCP_METRICS_ATTR_ADDR_IPV6]	= { .type = NLA_BINARY,
+					    .len = sizeof(struct in6_addr), },
+	/* Following attributes are not received for GET/DEL,
+	 * we keep them for reference
+	 */
+#if 0
+	[TCP_METRICS_ATTR_AGE]		= { .type = NLA_MSECS, },
+	[TCP_METRICS_ATTR_TW_TSVAL]	= { .type = NLA_U32, },
+	[TCP_METRICS_ATTR_TW_TS_STAMP]	= { .type = NLA_S32, },
+	[TCP_METRICS_ATTR_VALS]		= { .type = NLA_NESTED, },
+	[TCP_METRICS_ATTR_FOPEN_MSS]	= { .type = NLA_U16, },
+	[TCP_METRICS_ATTR_FOPEN_SYN_DROPS]	= { .type = NLA_U16, },
+	[TCP_METRICS_ATTR_FOPEN_SYN_DROP_TS]	= { .type = NLA_MSECS, },
+	[TCP_METRICS_ATTR_FOPEN_COOKIE]	= { .type = NLA_BINARY,
+					    .len = TCP_FASTOPEN_COOKIE_MAX, },
+#endif
+};
+
+/* Add attributes, caller cancels its header on failure */
+static int tcp_metrics_fill_info(struct sk_buff *msg,
+				 struct tcp_metrics_block *tm)
+{
+	struct nlattr *nest;
+	int i;
+
+	switch (tm->tcpm_addr.family) {
+	case AF_INET:
+		if (nla_put_be32(msg, TCP_METRICS_ATTR_ADDR_IPV4,
+				tm->tcpm_addr.addr.a4) < 0)
+			goto nla_put_failure;
+		break;
+	case AF_INET6:
+		if (nla_put(msg, TCP_METRICS_ATTR_ADDR_IPV6, 16,
+			    tm->tcpm_addr.addr.a6) < 0)
+			goto nla_put_failure;
+		break;
+	default:
+		return -EAFNOSUPPORT;
+	}
+
+	if (nla_put_msecs(msg, TCP_METRICS_ATTR_AGE,
+			  jiffies - tm->tcpm_stamp) < 0)
+		goto nla_put_failure;
+	if (tm->tcpm_ts_stamp) {
+		if (nla_put_s32(msg, TCP_METRICS_ATTR_TW_TS_STAMP,
+				(s32) (get_seconds() - tm->tcpm_ts_stamp)) < 0)
+			goto nla_put_failure;
+		if (nla_put_u32(msg, TCP_METRICS_ATTR_TW_TSVAL,
+				tm->tcpm_ts) < 0)
+			goto nla_put_failure;
+	}
+
+	{
+		int n = 0;
+
+		nest = nla_nest_start(msg, TCP_METRICS_ATTR_VALS);
+		if (!nest)
+			goto nla_put_failure;
+		for (i = 0; i < TCP_METRIC_MAX + 1; i++) {
+			if (!tm->tcpm_vals[i])
+				continue;
+			if (nla_put_u32(msg, i + 1, tm->tcpm_vals[i]) < 0)
+				goto nla_put_failure;
+			n++;
+		}
+		if (n)
+			nla_nest_end(msg, nest);
+		else
+			nla_nest_cancel(msg, nest);
+	}
+
+	{
+		struct tcp_fastopen_metrics tfom_copy[1], *tfom;
+		unsigned int seq;
+
+		do {
+			seq = read_seqbegin(&fastopen_seqlock);
+			tfom_copy[0] = tm->tcpm_fastopen;
+		} while (read_seqretry(&fastopen_seqlock, seq));
+
+		tfom = tfom_copy;
+		if (tfom->mss &&
+		    nla_put_u16(msg, TCP_METRICS_ATTR_FOPEN_MSS,
+				tfom->mss) < 0)
+			goto nla_put_failure;
+		if (tfom->syn_loss &&
+		    (nla_put_u16(msg, TCP_METRICS_ATTR_FOPEN_SYN_DROPS,
+				tfom->syn_loss) < 0 ||
+		     nla_put_msecs(msg, TCP_METRICS_ATTR_FOPEN_SYN_DROP_TS,
+				jiffies - tfom->last_syn_loss) < 0))
+			goto nla_put_failure;
+		if (tfom->cookie.len > 0 &&
+		    nla_put(msg, TCP_METRICS_ATTR_FOPEN_COOKIE,
+			    tfom->cookie.len, tfom->cookie.val) < 0)
+			goto nla_put_failure;
+	}
+
+	return 0;
+
+nla_put_failure:
+	return -EMSGSIZE;
+}
+
+static int tcp_metrics_dump_info(struct sk_buff *skb,
+				 struct netlink_callback *cb,
+				 struct tcp_metrics_block *tm)
+{
+	void *hdr;
+
+	hdr = genlmsg_put(skb, NETLINK_CB(cb->skb).pid, cb->nlh->nlmsg_seq,
+			  &tcp_metrics_nl_family, NLM_F_MULTI,
+			  TCP_METRICS_CMD_GET);
+	if (!hdr)
+		return -EMSGSIZE;
+
+	if (tcp_metrics_fill_info(skb, tm) < 0)
+		goto nla_put_failure;
+
+	return genlmsg_end(skb, hdr);
+
+nla_put_failure:
+	genlmsg_cancel(skb, hdr);
+	return -EMSGSIZE;
+}
+
+static int tcp_metrics_nl_dump(struct sk_buff *skb,
+			       struct netlink_callback *cb)
+{
+	struct net *net = sock_net(skb->sk);
+	unsigned int max_rows = 1U << net->ipv4.tcp_metrics_hash_log;
+	unsigned int row, s_row = cb->args[0];
+	int s_col = cb->args[1], col = s_col;
+
+	for (row = s_row; row < max_rows; row++, s_col = 0) {
+		struct tcp_metrics_block *tm;
+		struct tcpm_hash_bucket *hb = net->ipv4.tcp_metrics_hash + row;
+
+		rcu_read_lock();
+		for (col = 0, tm = rcu_dereference(hb->chain); tm;
+		     tm = rcu_dereference(tm->tcpm_next), col++) {
+			if (col < s_col)
+				continue;
+			if (tcp_metrics_dump_info(skb, cb, tm) < 0) {
+				rcu_read_unlock();
+				goto done;
+			}
+		}
+		rcu_read_unlock();
+	}
+
+done:
+	cb->args[0] = row;
+	cb->args[1] = col;
+	return skb->len;
+}
+
+static int parse_nl_addr(struct genl_info *info, struct inetpeer_addr *addr,
+			 unsigned int *hash, int optional)
+{
+	struct nlattr *a;
+
+	a = info->attrs[TCP_METRICS_ATTR_ADDR_IPV4];
+	if (a) {
+		addr->family = AF_INET;
+		addr->addr.a4 = nla_get_be32(a);
+		*hash = (__force unsigned int) addr->addr.a4;
+		return 0;
+	}
+	a = info->attrs[TCP_METRICS_ATTR_ADDR_IPV6];
+	if (a) {
+		if (nla_len(a) != sizeof(sizeof(struct in6_addr)))
+			return -EINVAL;
+		addr->family = AF_INET6;
+		memcpy(addr->addr.a6, nla_data(a), sizeof(addr->addr.a6));
+		*hash = ipv6_addr_hash((struct in6_addr *) addr->addr.a6);
+		return 0;
+	}
+	return optional ? 1 : -EAFNOSUPPORT;
+}
+
+static int tcp_metrics_nl_cmd_get(struct sk_buff *skb, struct genl_info *info)
+{
+	struct tcp_metrics_block *tm;
+	struct inetpeer_addr addr;
+	unsigned int hash;
+	struct sk_buff *msg;
+	struct net *net = genl_info_net(info);
+	void *reply;
+	int ret;
+
+	ret = parse_nl_addr(info, &addr, &hash, 0);
+	if (ret < 0)
+		return ret;
+
+	msg = nlmsg_new(NLMSG_DEFAULT_SIZE, GFP_KERNEL);
+	if (!msg)
+		return -ENOMEM;
+
+	reply = genlmsg_put_reply(msg, info, &tcp_metrics_nl_family, 0,
+				  info->genlhdr->cmd);
+	if (!reply)
+		goto nla_put_failure;
+
+	hash = hash_32(hash, net->ipv4.tcp_metrics_hash_log);
+	ret = -ESRCH;
+	rcu_read_lock();
+	for (tm = rcu_dereference(net->ipv4.tcp_metrics_hash[hash].chain); tm;
+	     tm = rcu_dereference(tm->tcpm_next)) {
+		if (addr_same(&tm->tcpm_addr, &addr)) {
+			ret = tcp_metrics_fill_info(msg, tm);
+			break;
+		}
+	}
+	rcu_read_unlock();
+	if (ret < 0)
+		goto out_free;
+
+	genlmsg_end(msg, reply);
+	return genlmsg_reply(msg, info);
+
+nla_put_failure:
+	ret = -EMSGSIZE;
+
+out_free:
+	nlmsg_free(msg);
+	return ret;
+}
+
+#define deref_locked_genl(p)	\
+	rcu_dereference_protected(p, lockdep_genl_is_held() && \
+				     lockdep_is_held(&tcp_metrics_lock))
+
+#define deref_genl(p)	rcu_dereference_protected(p, lockdep_genl_is_held())
+
+static int tcp_metrics_flush_all(struct net *net)
+{
+	unsigned int max_rows = 1U << net->ipv4.tcp_metrics_hash_log;
+	struct tcpm_hash_bucket *hb = net->ipv4.tcp_metrics_hash;
+	struct tcp_metrics_block *tm;
+	unsigned int row;
+
+	for (row = 0; row < max_rows; row++, hb++) {
+		spin_lock_bh(&tcp_metrics_lock);
+		tm = deref_locked_genl(hb->chain);
+		if (tm)
+			hb->chain = NULL;
+		spin_unlock_bh(&tcp_metrics_lock);
+		while (tm) {
+			struct tcp_metrics_block *next;
+
+			next = deref_genl(tm->tcpm_next);
+			kfree_rcu(tm, rcu_head);
+			tm = next;
+		}
+	}
+	return 0;
+}
+
+static int tcp_metrics_nl_cmd_del(struct sk_buff *skb, struct genl_info *info)
+{
+	struct tcpm_hash_bucket *hb;
+	struct tcp_metrics_block *tm;
+	struct tcp_metrics_block __rcu **pp;
+	struct inetpeer_addr addr;
+	unsigned int hash;
+	struct net *net = genl_info_net(info);
+	int ret;
+
+	ret = parse_nl_addr(info, &addr, &hash, 1);
+	if (ret < 0)
+		return ret;
+	if (ret > 0)
+		return tcp_metrics_flush_all(net);
+
+	hash = hash_32(hash, net->ipv4.tcp_metrics_hash_log);
+	hb = net->ipv4.tcp_metrics_hash + hash;
+	pp = &hb->chain;
+	spin_lock_bh(&tcp_metrics_lock);
+	for (tm = deref_locked_genl(*pp); tm;
+	     pp = &tm->tcpm_next, tm = deref_locked_genl(*pp)) {
+		if (addr_same(&tm->tcpm_addr, &addr)) {
+			*pp = tm->tcpm_next;
+			break;
+		}
+	}
+	spin_unlock_bh(&tcp_metrics_lock);
+	if (!tm)
+		return -ESRCH;
+	kfree_rcu(tm, rcu_head);
+	return 0;
+}
+
+static struct genl_ops tcp_metrics_nl_ops[] = {
+	{
+		.cmd = TCP_METRICS_CMD_GET,
+		.doit = tcp_metrics_nl_cmd_get,
+		.dumpit = tcp_metrics_nl_dump,
+		.policy = tcp_metrics_nl_policy,
+		.flags = GENL_ADMIN_PERM,
+	},
+	{
+		.cmd = TCP_METRICS_CMD_DEL,
+		.doit = tcp_metrics_nl_cmd_del,
+		.policy = tcp_metrics_nl_policy,
+		.flags = GENL_ADMIN_PERM,
+	},
+};
+
 static unsigned int tcpmhash_entries;
 static int __init set_tcpmhash_entries(char *str)
 {
@@ -753,5 +1065,21 @@ static __net_initdata struct pernet_operations tcp_net_metrics_ops = {
 
 void __init tcp_metrics_init(void)
 {
-	register_pernet_subsys(&tcp_net_metrics_ops);
+	int ret;
+
+	ret = register_pernet_subsys(&tcp_net_metrics_ops);
+	if (ret < 0)
+		goto cleanup;
+	ret = genl_register_family_with_ops(&tcp_metrics_nl_family,
+					    tcp_metrics_nl_ops,
+					    ARRAY_SIZE(tcp_metrics_nl_ops));
+	if (ret < 0)
+		goto cleanup_subsys;
+	return;
+
+cleanup_subsys:
+	unregister_pernet_subsys(&tcp_net_metrics_ops);
+
+cleanup:
+	return;
 }
-- 
1.7.3.4

^ permalink raw reply related

* [PATCH v3 0/3] Interface for TCP Metrics
From: Julian Anastasov @ 2012-09-04 21:03 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, Stephen Hemminger

	This patchset contains 3 patches, one for kernel
and two for iproute2. We add DUMP/GET/DEL support for
genl "tcp_metrics" and minimal support for filtering
by address prefix and family in user space.

	I tested show/del/flush, filtering by family, IPv4 prefix,
output for metrics such as rtt, rttvar, cwnd, ssthresh (after
modifying route). I didn't tested output for fast open.

	May be some corrections in output can be desired.

v2:
- patch 1: remove rcu_assign_pointer, add tcp_metrics_flush_all
- patch 2: properly flush by specifying address, improve man page

v3:
- patch 1: remove synchronize_rcu
- patch 2: create libgenl.h and libgenl.c
- new patch 3: use libgenl.c also in ipl2tp.c

^ permalink raw reply

* Re: [PATCH v3 01/17] hashtable: introduce a small and naive hashtable
From: Steven Rostedt @ 2012-09-04 20:59 UTC (permalink / raw)
  To: Pedro Alves
  Cc: snitzer-H+wXaHxf7aLQT0dZR+AlfA, neilb-l3A5Bk7waGM,
	fweisbec-Re5JQEeQqe8AvxtiuMwx3w,
	Trond.Myklebust-HgOvQuBEEgTQT0dZR+AlfA,
	bfields-uC3wQj2KruNg9hUCZPvPmw,
	paul.gortmaker-CWA4WttNNZF54TAoqtyWWQ,
	dm-devel-H+wXaHxf7aLQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
	aarcange-H+wXaHxf7aLQT0dZR+AlfA, rds-devel-N0ozoZBvEnrZJqsBc5GL+g,
	eric.dumazet-Re5JQEeQqe8AvxtiuMwx3w,
	venkat.x.venkatsubra-QHcLZuEGTsvQT0dZR+AlfA,
	ccaulfie-H+wXaHxf7aLQT0dZR+AlfA, mingo-X9Un+BFzKDI,
	dev-yBygre7rU0TnMu66kgdUjQ, ericvh-Re5JQEeQqe8AvxtiuMwx3w,
	josh-iaAMLnmF4UmaiuxdJuQwMA, lw-BthXqXjhjHXQFUHtdCDX3A,
	Mathieu Desnoyers, Sasha Levin, axboe-tSWWG44O7X1aa/9Udqfwiw,
	linux-nfs-u79uwXL29TY76Z2rM5mHXA, edumazet-hpIqsD4AKlfQT0dZR+AlfA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, netdev-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, ejt-H+wXaHxf7aLQT0dZR+AlfA,
	ebiederm-aS9lmoZGLiVWk0Htik3J/w, Tejun Heo,
	teigland-H+wXaHxf7aLQT0dZR+AlfA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	davem-fT/PcQaiUtIeIZ0/mPfg9Q
In-Reply-To: <50463883.8080706-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>

On Tue, 2012-09-04 at 18:21 +0100, Pedro Alves wrote:
> On 09/04/2012 06:17 PM, Steven Rostedt wrote:
> > On Tue, 2012-09-04 at 17:40 +0100, Pedro Alves wrote:
> > 
> >> BTW, you can also go a step further and remove the need to close with double }},
> >> with something like:
> >>
> >> #define do_for_each_ftrace_rec(pg, rec)                                          \
> >>         for (pg = ftrace_pages_start, rec = &pg->records[pg->index];             \
> >>              pg && rec == &pg->records[pg->index];                               \
> >>              pg = pg->next)                                                      \
> >>           for (rec = pg->records; rec < &pg->records[pg->index]; rec++)
> >>
> > 
> > Yeah, but why bother? It's hidden in a macro, and the extra '{ }' shows
> > that this is something "special".
> 
> The point of both changes is that there's nothing special in the end
> at all.  It all just works...
> 

It would still fail on a 'break'. The 'while' macro tells us that it is
special, because in the end, it wont work.

-- Steve

^ permalink raw reply

* Re: [PATCH v2 1/2] tcp: add generic netlink support for tcp_metrics
From: Julian Anastasov @ 2012-09-04 21:00 UTC (permalink / raw)
  To: David Miller; +Cc: eric.dumazet, netdev, shemminger, paulmck
In-Reply-To: <20120904.154708.2099268692603008096.davem@davemloft.net>


	Hello,

On Tue, 4 Sep 2012, David Miller wrote:

> From: Julian Anastasov <ja@ssi.bg>
> Date: Mon, 3 Sep 2012 11:22:15 +0300 (EEST)
> 
> > 	BTW, is it appropriate to use kmem_cache for
> > metrics and as result call_rcu for freeing?
> 
> I think it would work as things are implemented currently in
> slab/slub/slob, however I would not rely upon it.
> 
> If you do move to a SLAB cache for the tcp metrics objects,
> you might consider SLAB_DESTROY_BY_RCU.  It's a very delicate
> facility (read the huge comment in linux/slab.h) but I think
> it provides the semantics we need for TCP metrics blobs.

	ok, I'll not mix it with the genl support
because I'm not prepared to do it soon...

> Looking forward to v3 :-)

	Sending...

Regards

--
Julian Anastasov <ja@ssi.bg>

^ permalink raw reply

* Re: [Bug 47021] New: kernel panic with l2tpv3 & mtu > 1500
From: Eric Dumazet @ 2012-09-04 20:57 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: netdev, a1bert
In-Reply-To: <20120904125458.7d97ec38@nehalam.linuxnetplumber.net>

On Tue, 2012-09-04 at 12:54 -0700, Stephen Hemminger wrote:

> 
> I guess nobody ever looked inside this code. That seems like an obvious bug.

It seems that l2tp lacks ECN support as well.

^ permalink raw reply

* [PATCH] can: mcp251x: avoid repeated frame bug
From: Marc Kleine-Budde @ 2012-09-04 20:55 UTC (permalink / raw)
  To: davem; +Cc: netdev, linux-can, Benoît Locher, stable, Marc Kleine-Budde
In-Reply-To: <1346792142-17609-1-git-send-email-mkl@pengutronix.de>

From: Benoît Locher <Benoit.Locher@skf.com>

The MCP2515 has a silicon bug causing repeated frame transmission, see section
5 of MCP2515 Rev. B Silicon Errata Revision G (March 2007).

Basically, setting TXBnCTRL.TXREQ in either SPI mode (00 or 11) will eventually
cause the bug. The workaround proposed by Microchip is to use mode 00 and send
a RTS command on the SPI bus to initiate the transmission.

Cc: <stable@vger.kernel.org>
Signed-off-by: Benoît Locher <Benoit.Locher@skf.com>
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
---
 drivers/net/can/mcp251x.c |   11 ++++++++++-
 1 file changed, 10 insertions(+), 1 deletion(-)

diff --git a/drivers/net/can/mcp251x.c b/drivers/net/can/mcp251x.c
index a580db2..26e7129 100644
--- a/drivers/net/can/mcp251x.c
+++ b/drivers/net/can/mcp251x.c
@@ -83,6 +83,11 @@
 #define INSTRUCTION_LOAD_TXB(n)	(0x40 + 2 * (n))
 #define INSTRUCTION_READ_RXB(n)	(((n) == 0) ? 0x90 : 0x94)
 #define INSTRUCTION_RESET	0xC0
+#define RTS_TXB0		0x01
+#define RTS_TXB1		0x02
+#define RTS_TXB2		0x04
+#define INSTRUCTION_RTS(n)	(0x80 | ((n) & 0x07))
+
 
 /* MPC251x registers */
 #define CANSTAT	      0x0e
@@ -397,6 +402,7 @@ static void mcp251x_hw_tx_frame(struct spi_device *spi, u8 *buf,
 static void mcp251x_hw_tx(struct spi_device *spi, struct can_frame *frame,
 			  int tx_buf_idx)
 {
+	struct mcp251x_priv *priv = dev_get_drvdata(&spi->dev);
 	u32 sid, eid, exide, rtr;
 	u8 buf[SPI_TRANSFER_BUF_LEN];
 
@@ -418,7 +424,10 @@ static void mcp251x_hw_tx(struct spi_device *spi, struct can_frame *frame,
 	buf[TXBDLC_OFF] = (rtr << DLC_RTR_SHIFT) | frame->can_dlc;
 	memcpy(buf + TXBDAT_OFF, frame->data, frame->can_dlc);
 	mcp251x_hw_tx_frame(spi, buf, frame->can_dlc, tx_buf_idx);
-	mcp251x_write_reg(spi, TXBCTRL(tx_buf_idx), TXBCTRL_TXREQ);
+
+	/* use INSTRUCTION_RTS, to avoid "repeated frame problem" */
+	priv->spi_tx_buf[0] = INSTRUCTION_RTS(1 << tx_buf_idx);
+	mcp251x_spi_trans(priv->spi, 1);
 }
 
 static void mcp251x_hw_rx_frame(struct spi_device *spi, u8 *buf,
-- 
1.7.10


^ permalink raw reply related

* pull-request: can 2012-09-04
From: Marc Kleine-Budde @ 2012-09-04 20:55 UTC (permalink / raw)
  To: davem; +Cc: netdev, linux-can

Hello David,

this patch is for the v3.6 release cycle. Benoît Locher fixed a repeated frame
bug in the mcp251x driver. He implemented the workaround suggested by the
errata sheet.

regards, Marc

--

The following changes since commit 5002200599429e83fc13e0d9a2d4788b79515b0c:

  net: qmi_wwan: add several new Gobi devices (2012-09-01 22:49:34 -0400)

are available in the git repository at:

  git://gitorious.org/linux-can/linux-can.git fixes-for-3.6

for you to fetch changes up to cab32f39dcc5b35db96497dc0a026b5dea76e4e7:

  can: mcp251x: avoid repeated frame bug (2012-09-03 20:12:06 +0200)

----------------------------------------------------------------
Benoît Locher (1):
      can: mcp251x: avoid repeated frame bug

 drivers/net/can/mcp251x.c |   11 ++++++++++-
 1 file changed, 10 insertions(+), 1 deletion(-)



^ permalink raw reply

* Re: sctp_close/sk_free: kernel BUG at arch/x86/mm/physaddr.c:18!
From: Marc Kleine-Budde @ 2012-09-04 20:42 UTC (permalink / raw)
  To: Eric W. Biederman; +Cc: Fengguang Wu, networking, linux-can
In-Reply-To: <87mx15zfze.fsf@xmission.com>

[-- Attachment #1: Type: text/plain, Size: 4675 bytes --]

On 09/04/2012 10:32 PM, Eric W. Biederman wrote:
>>> FYI, another kconfig triggering a slightly different oops on tree
>>>
>>>         git://gitorious.org/linux-can/linux-can-next led-trigger
>>
>> This in turn means the problem doesn't come from the CAN patches, as
>> both trees have different CAN patches. I'm adding Eric W. Biederman on
>> Cc as he contributed some sctp patches between v3.6 and net-next/master.
> 
> Anything is possible, but this seems unlikely as I don't think I touched
> anything close to that part of the code.
> 
> This most definitely looks like a memory stomp somewhere.
> 
> sk->inet_sk->inet_opt has a bad value.
> 
> I am puzzled though what are we doing with both ipv4 and ipv6 release
> state doing on the same socket path?    Is this some crazy ipv6 socket
> doing sctp with only ipv4 addresses?

It's Wu's testcase, can you show us the code?
Eric, in case you haven't seen, this is another oops, from a slightly
different tree (a handfull of different CAN patches).

> [  233.046014] kfree_debugcheck: out of range ptr ea6000000bb8h.
> [  233.047399] ------------[ cut here ]------------
> [  233.048393] kernel BUG at /c/kernel-tests/src/stable/mm/slab.c:3074!
> [  233.048393] invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC
> [  233.048393] Modules linked in:
> [  233.048393] CPU 0 
> [  233.048393] Pid: 3929, comm: trinity-watchdo Not tainted 3.6.0-rc3+ #4192 Bochs Bochs
> [  233.048393] RIP: 0010:[<ffffffff81169653>]  [<ffffffff81169653>] kfree_debugcheck+0x27/0x2d
> [  233.048393] RSP: 0018:ffff88000facbca8  EFLAGS: 00010092
> [  233.048393] RAX: 0000000000000031 RBX: 0000ea6000000bb8 RCX: 00000000a189a188
> [  233.048393] RDX: 000000000000a189 RSI: ffffffff8108ad32 RDI: ffffffff810d30f9
> [  233.048393] RBP: ffff88000facbcb8 R08: 0000000000000002 R09: ffffffff843846f0
> [  233.048393] R10: ffffffff810ae37c R11: 0000000000000908 R12: 0000000000000202
> [  233.048393] R13: ffffffff823dbd5a R14: ffff88000ec5bea8 R15: ffffffff8363c780
> [  233.048393] FS:  00007faa6899c700(0000) GS:ffff88001f200000(0000) knlGS:0000000000000000
> [  233.048393] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> [  233.048393] CR2: 00007faa6841019c CR3: 0000000012c82000 CR4: 00000000000006f0
> [  233.048393] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [  233.048393] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> [  233.048393] Process trinity-watchdo (pid: 3929, threadinfo ffff88000faca000, task ffff88000faec600)
> [  233.048393] Stack:
> [  233.048393]  0000000000000000 0000ea6000000bb8 ffff88000facbce8 ffffffff8116ad81
> [  233.048393]  ffff88000ff588a0 ffff88000ff58850 ffff88000ff588a0 0000000000000000
> [  233.048393]  ffff88000facbd08 ffffffff823dbd5a ffffffff823dbcb0 ffff88000ff58850
> [  233.048393] Call Trace:
> [  233.048393]  [<ffffffff8116ad81>] kfree+0x5f/0xca
> [  233.048393]  [<ffffffff823dbd5a>] inet_sock_destruct+0xaa/0x13c
> [  233.048393]  [<ffffffff823dbcb0>] ? inet_sk_rebuild_header+0x319/0x319
> [  233.048393]  [<ffffffff8231c307>] __sk_free+0x21/0x14b
> [  233.048393]  [<ffffffff8231c4bd>] sk_free+0x26/0x2a
> [  233.048393]  [<ffffffff825372db>] sctp_close+0x215/0x224
> [  233.048393]  [<ffffffff810d6835>] ? lock_release+0x16f/0x1b9
> [  233.048393]  [<ffffffff823daf12>] inet_release+0x7e/0x85
> [  233.048393]  [<ffffffff82317d15>] sock_release+0x1f/0x77
> [  233.048393]  [<ffffffff82317d94>] sock_close+0x27/0x2b
> [  233.048393]  [<ffffffff81173bbe>] __fput+0x101/0x20a
> [  233.048393]  [<ffffffff81173cd5>] ____fput+0xe/0x10
> [  233.048393]  [<ffffffff810a3794>] task_work_run+0x5d/0x75
> [  233.048393]  [<ffffffff8108da70>] do_exit+0x290/0x7f5
> [  233.048393]  [<ffffffff82707415>] ? retint_swapgs+0x13/0x1b
> [  233.048393]  [<ffffffff8108e23f>] do_group_exit+0x7b/0xba
> [  233.048393]  [<ffffffff8108e295>] sys_exit_group+0x17/0x17
> [  233.048393]  [<ffffffff8270de10>] tracesys+0xdd/0xe2
> [  233.048393] Code: 59 01 5d c3 55 48 89 e5 53 41 50 0f 1f 44 00 00 48 89 fb e8 d4 b0 f0 ff 84 c0 75 11 48 89 de 48 c7 c7 fc fa f7 82 e8 0d 0f 57 01 <0f> 0b 5f 5b 5d c3 55 48 89 e5 0f 1f 44 00 00 48 63 87 d8 00 00 
> [  233.048393] RIP  [<ffffffff81169653>] kfree_debugcheck+0x27/0x2d
> [  233.048393]  RSP <ffff88000facbca8>

Wu is running a bisect, let's hope that gives us a result.

Marc

-- 
Pengutronix e.K.                  | Marc Kleine-Budde           |
Industrial Linux Solutions        | Phone: +49-231-2826-924     |
Vertretung West/Dortmund          | Fax:   +49-5121-206917-5555 |
Amtsgericht Hildesheim, HRA 2686  | http://www.pengutronix.de   |


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 259 bytes --]

^ permalink raw reply

* Re: sctp_close/sk_free: kernel BUG at arch/x86/mm/physaddr.c:18!
From: Eric W. Biederman @ 2012-09-04 20:32 UTC (permalink / raw)
  To: Marc Kleine-Budde; +Cc: Fengguang Wu, networking, linux-can
In-Reply-To: <5046361C.5070602@pengutronix.de>

Marc Kleine-Budde <mkl@pengutronix.de> writes:

> On 09/04/2012 04:04 PM, Fengguang Wu wrote:
>> FYI, another kconfig triggering a slightly different oops on tree
>> 
>>         git://gitorious.org/linux-can/linux-can-next led-trigger
>
> This in turn means the problem doesn't come from the CAN patches, as
> both trees have different CAN patches. I'm adding Eric W. Biederman on
> Cc as he contributed some sctp patches between v3.6 and net-next/master.

Anything is possible, but this seems unlikely as I don't think I touched
anything close to that part of the code.

This most definitely looks like a memory stomp somewhere.

sk->inet_sk->inet_opt has a bad value.

I am puzzled though what are we doing with both ipv4 and ipv6 release
state doing on the same socket path?    Is this some crazy ipv6 socket
doing sctp with only ipv4 addresses?

Eric

> Marc
>> [   96.267311] ------------[ cut here ]------------
>> [   96.268294] kernel BUG at /c/kernel-tests/src/stable/arch/x86/mm/physaddr.c:18!
>> [   96.269988] invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC
>> [   96.270636] Modules linked in:
>> [   96.270636] CPU 0 
>> [   96.270636] Pid: 2116, comm: trinity Not tainted 3.6.0-rc3+ #2679 Bochs Bochs
>> [   96.270636] RIP: 0010:[<ffffffff8102b22b>]  [<ffffffff8102b22b>] __phys_addr+0x46/0x6b
>> [   96.270636] RSP: 0018:ffff880019585c98  EFLAGS: 00010213
>> [   96.270636] RAX: ffff87ffffffffff RBX: 0000ea6000000bb8 RCX: 0000000000000000
>> [   96.270636] RDX: 0000000000000000 RSI: 0000000000000296 RDI: 0000ea6000000bb8
>> [   96.270636] RBP: ffff880019585c98 R08: 0000000000000058 R09: 0000000000000008
>> [   96.270636] R10: 000000000000000a R11: 0000000000000058 R12: ffff8800195f7718
>> [   96.270636] R13: ffffffff816521cf R14: ffffea0000000000 R15: 0000000000000000
>> [   96.270636] FS:  00007fa19b534700(0000) GS:ffff88001f200000(0000) knlGS:0000000000000000
>> [   96.270636] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>> [   96.270636] CR2: 00007fa19b03eba0 CR3: 000000001957b000 CR4: 00000000000006f0
>> [   96.270636] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
>> [   96.270636] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
>> [   96.270636] Process trinity (pid: 2116, threadinfo ffff880019584000, task ffff88001af2c680)
>> [   96.270636] Stack:
>> [   96.270636]  ffff880019585cd8 ffffffff811091d7 0000000000000000 ffff88001b1ef200
>> [   96.270636]  ffff88001b1ef4d0 0000000000000000 ffff88001b1617b0 0000000000000000
>> [   96.270636]  ffff880019585cf8 ffffffff816521cf ffff88001b1ef200 ffff88001b1ef248
>> [   96.270636] Call Trace:
>> [   96.270636]  [<ffffffff811091d7>] kfree+0x63/0x162
>> [   96.270636]  [<ffffffff816521cf>] inet_sock_destruct+0x112/0x1ca
>> [   96.270636]  [<ffffffff815f6fa4>] __sk_free+0x1d/0x114
>> [   96.270636]  [<ffffffff815f710b>] sk_free+0x1c/0x1e
>> [   96.270636]  [<ffffffff816d59d5>] sctp_close+0x21a/0x229
>> [   96.270636]  [<ffffffff810810f6>] ? lock_release_holdtime.part.6+0xb2/0xb7
>> [   96.270636]  [<ffffffff81651b3e>] ? inet_release+0x65/0xc3
>> [   96.270636]  [<ffffffff81651b93>] inet_release+0xba/0xc3
>> [   96.270636]  [<ffffffff81651af9>] ? inet_release+0x20/0xc3
>> [   96.270636]  [<ffffffff81674134>] inet6_release+0x30/0x3c
>> [   96.270636]  [<ffffffff815f2317>] sock_release+0x1f/0x77
>> [   96.270636]  [<ffffffff815f2396>] sock_close+0x27/0x2b
>> [   96.270636]  [<ffffffff8110ec22>] __fput+0xf0/0x24b
>> [   96.270636]  [<ffffffff8110ed8b>] ____fput+0xe/0x10
>> [   96.270636]  [<ffffffff8104f370>] task_work_run+0x5d/0x75
>> [   96.270636]  [<ffffffff81038a66>] do_exit+0x26b/0x7d7
>> [   96.270636]  [<ffffffff81725a95>] ? retint_swapgs+0x13/0x1b
>> [   96.270636]  [<ffffffff8103925b>] do_group_exit+0x7b/0xba
>> [   96.270636]  [<ffffffff810392b1>] sys_exit_group+0x17/0x17
>> [   96.270636]  [<ffffffff8172c78e>] tracesys+0xd0/0xd5
>> [   96.270636] Code: 00 80 48 01 c7 48 81 ff ff ff ff 1f 76 02 0f 0b 48 89 f8 48 03 05 f6 bd ae 00 eb 32 48 b8 ff ff ff ff ff 87 ff ff 48 39 c7 77 02 <0f> 0b 0f b6 0d 55 57 ba 00 48 b8 00 00 00 00 00 78 00 00 48 01 
>> [   96.270636] RIP  [<ffffffff8102b22b>] __phys_addr+0x46/0x6b
>> [   96.270636]  RSP <ffff880019585c98>


^ permalink raw reply

* Re: [PATCH 2/2] RDMA/cxgb4: Update RDMA/cxgb4 due to macro definition removal in cxgb4 driver
From: David Miller @ 2012-09-04 19:59 UTC (permalink / raw)
  To: vipul-ut6Up61K2wZBDgjK7y7TUQ
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, netdev-u79uwXL29TY76Z2rM5mHXA,
	roland-BHEL68pLQRGGvPXPguhicg, divy-ut6Up61K2wZBDgjK7y7TUQ,
	dm-ut6Up61K2wZBDgjK7y7TUQ, kumaras-ut6Up61K2wZBDgjK7y7TUQ,
	swise-7bPotxP6k4+P2YhJcF5u+vpXobYPEAuW,
	santosh-ut6Up61K2wZBDgjK7y7TUQ, sivasu-ut6Up61K2wZBDgjK7y7TUQ
In-Reply-To: <1346405072-24561-3-git-send-email-vipul-ut6Up61K2wZBDgjK7y7TUQ@public.gmane.org>

From: Vipul Pandya <vipul-ut6Up61K2wZBDgjK7y7TUQ@public.gmane.org>
Date: Fri, 31 Aug 2012 14:54:32 +0530

> cxgb4 driver removed the duplicate definitions of registers which requires
> update in RDMA/cxgb4 driver.
> 
> Signed-off-by: Santosh Rastapur <santosh-ut6Up61K2wZBDgjK7y7TUQ@public.gmane.org>
> Signed-off-by: Vipul Pandya <vipul-ut6Up61K2wZBDgjK7y7TUQ@public.gmane.org>
> Reviewed-by: Sivakumar Subramani <sivasu-ut6Up61K2wZBDgjK7y7TUQ@public.gmane.org>

If you do this in a seperate change, the build is broken between the
two changes.

Never do this, the tree must be fully bisectable and build and work
at each and every step along the way.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [RFC PATCH] ipv6: fix handling of blackhole and prohibit routes
From: David Miller @ 2012-09-04 19:58 UTC (permalink / raw)
  To: nicolas.dichtel; +Cc: netdev
In-Reply-To: <503F78C8.3070807@6wind.com>

From: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Date: Thu, 30 Aug 2012 16:29:28 +0200

> Comments are welcome.

I don't see why we have to create new flags for this.

Handle it like ipv4, where the RTN_* type dictates whether the
route is blackhole, prohibit, or other type of route.

^ permalink raw reply

* Re: [Bug 47021] New: kernel panic with l2tpv3 & mtu > 1500
From: David Miller @ 2012-09-04 19:55 UTC (permalink / raw)
  To: eric.dumazet; +Cc: shemminger, netdev, a1bert
In-Reply-To: <1346788334.13121.82.camel@edumazet-glaptop>

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Tue, 04 Sep 2012 21:52:14 +0200

> [PATCH] l2tp: fix a typo in l2tp_eth_dev_recv()
> 
> While investigating l2tp bug, I hit a bug in eth_type_trans(),
> because not enough bytes were pulled in skb head.
> 
> Signed-off-by: Eric Dumazet <edumazet@google.com>

LoL, applied and queued up for -stable.

^ permalink raw reply

* Re: [Bug 47021] New: kernel panic with l2tpv3 & mtu > 1500
From: Stephen Hemminger @ 2012-09-04 19:54 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev, a1bert
In-Reply-To: <1346788334.13121.82.camel@edumazet-glaptop>

On Tue, 04 Sep 2012 21:52:14 +0200
Eric Dumazet <eric.dumazet@gmail.com> wrote:

> From: Eric Dumazet <edumazet@google.com>
> 
> On Tue, 2012-09-04 at 09:07 -0700, Stephen Hemminger wrote:
> > 
> > Begin forwarded message:
> > 
> > Date: Tue,  4 Sep 2012 16:06:06 +0000 (UTC)
> > From: bugzilla-daemon@bugzilla.kernel.org
> > To: shemminger@linux-foundation.org
> > Subject: [Bug 47021] New: kernel panic with l2tpv3 & mtu > 1500
> > 
> > 
> > https://bugzilla.kernel.org/show_bug.cgi?id=47021
> > 
> >            Summary: kernel panic with l2tpv3 & mtu > 1500
> >            Product: Networking
> >            Version: 2.5
> >     Kernel Version: 3.2.28
> >           Platform: All
> >         OS/Version: Linux
> >               Tree: Mainline
> >             Status: NEW
> >           Severity: normal
> >           Priority: P1
> >          Component: Other
> >         AssignedTo: shemminger@linux-foundation.org
> >         ReportedBy: a1bert@atlas.cz
> >         Regression: No
> > 
> > 
> > first l2tpv3 packet that enters physical device with MTU > 1500 causes kernel
> > panic
> > 
> > tunel created using:
> > 
> > l2tpv3tun add tunnel tunnel_id 1 peer_tunnel_id 1 encap udp udp_sport 5001
> > udp_dport 5001 local 192.168.1.1 remote 192.168.1.2
> > 
> > l2tpv3tun add session tunnel_id 1 session_id 1 peer_session_id 1 dev l2tpeth1
> > 
> > 
> > reproducible: always
> > 
> 
> Seems following patch is needed, not sure if it helps
> 
> [PATCH] l2tp: fix a typo in l2tp_eth_dev_recv()
> 
> While investigating l2tp bug, I hit a bug in eth_type_trans(),
> because not enough bytes were pulled in skb head.
> 
> Signed-off-by: Eric Dumazet <edumazet@google.com>
> ---
>  net/l2tp/l2tp_eth.c |    2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/net/l2tp/l2tp_eth.c b/net/l2tp/l2tp_eth.c
> index f9ee74d..3bfb34a 100644
> --- a/net/l2tp/l2tp_eth.c
> +++ b/net/l2tp/l2tp_eth.c
> @@ -153,7 +153,7 @@ static void l2tp_eth_dev_recv(struct l2tp_session *session, struct sk_buff *skb,
>  		print_hex_dump_bytes("", DUMP_PREFIX_OFFSET, skb->data, length);
>  	}
>  
> -	if (!pskb_may_pull(skb, sizeof(ETH_HLEN)))
> +	if (!pskb_may_pull(skb, ETH_HLEN))
>  		goto error;

I guess nobody ever looked inside this code. That seems like an obvious bug.

^ permalink raw reply

* Re: [PATCH] net: Providing protocol type via system.sockprotoname xattr of /proc/PID/fd entries
From: David Miller @ 2012-09-04 19:52 UTC (permalink / raw)
  To: yamato; +Cc: netdev, linux-kernel
In-Reply-To: <1346273069-5390-1-git-send-email-yamato@redhat.com>

From: Masatake YAMATO <yamato@redhat.com>
Date: Thu, 30 Aug 2012 05:44:29 +0900

> lsof reports some of socket descriptors as "can't identify protocol" like:
> 
>     [yamato@localhost]/tmp% sudo lsof | grep dbus | grep iden
>     dbus-daem   652          dbus    6u     sock ... 17812 can't identify protocol
>     dbus-daem   652          dbus   34u     sock ... 24689 can't identify protocol
>     dbus-daem   652          dbus   42u     sock ... 24739 can't identify protocol
>     dbus-daem   652          dbus   48u     sock ... 22329 can't identify protocol
>     ...
> 
> lsof cannot resolve the protocol used in a socket because procfs
> doesn't provide the map between inode number on sockfs and protocol
> type of the socket.
> 
> For improving the situation this patch adds an extended attribute named
> 'system.sockprotoname' in which the protocol name for
> /proc/PID/fd/SOCKET is stored. So lsof can know the protocol for a
> given /proc/PID/fd/SOCKET with getxattr system call.
> 
> A few weeks ago I submitted a patch for the same purpose. The patch
> was introduced /proc/net/sockfs which enumerates inodes and protocols
> of all sockets alive on a system. However, it was rejected because (1)
> a global lock was needed, and (2) the layout of struct socket was
> changed with the patch.
> 
> This patch doesn't use any global lock; and doesn't change the layout
> of any structs.
> 
> In this patch, a protocol name is stored to dentry->d_name of sockfs
> when new socket is associated with a file descriptor. Before this
> patch dentry->d_name was not used; it was just filled with empty
> string. lsof may use an extended attribute named
> 'system.sockprotoname' to retrieve the value of dentry->d_name.
> 
> It is nice if we can see the protocol name with ls -l
> /proc/PID/fd. However, "socket:[#INODE]", the name format returned
> from sockfs_dname() was already defined. To keep the compatibility
> between kernel and user land, the extended attribute is used to
> prepare the value of dentry->d_name.
> 
> Signed-off-by: Masatake YAMATO <yamato@redhat.com>

This looks a lot more reasonable than your previous attempt.

Applied to net-next, thanks a lot.

^ permalink raw reply

* Re: Fw: [Bug 47021] New: kernel panic with l2tpv3 & mtu > 1500
From: Eric Dumazet @ 2012-09-04 19:52 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: netdev, a1bert
In-Reply-To: <20120904090756.71b34feb@nehalam.linuxnetplumber.net>

From: Eric Dumazet <edumazet@google.com>

On Tue, 2012-09-04 at 09:07 -0700, Stephen Hemminger wrote:
> 
> Begin forwarded message:
> 
> Date: Tue,  4 Sep 2012 16:06:06 +0000 (UTC)
> From: bugzilla-daemon@bugzilla.kernel.org
> To: shemminger@linux-foundation.org
> Subject: [Bug 47021] New: kernel panic with l2tpv3 & mtu > 1500
> 
> 
> https://bugzilla.kernel.org/show_bug.cgi?id=47021
> 
>            Summary: kernel panic with l2tpv3 & mtu > 1500
>            Product: Networking
>            Version: 2.5
>     Kernel Version: 3.2.28
>           Platform: All
>         OS/Version: Linux
>               Tree: Mainline
>             Status: NEW
>           Severity: normal
>           Priority: P1
>          Component: Other
>         AssignedTo: shemminger@linux-foundation.org
>         ReportedBy: a1bert@atlas.cz
>         Regression: No
> 
> 
> first l2tpv3 packet that enters physical device with MTU > 1500 causes kernel
> panic
> 
> tunel created using:
> 
> l2tpv3tun add tunnel tunnel_id 1 peer_tunnel_id 1 encap udp udp_sport 5001
> udp_dport 5001 local 192.168.1.1 remote 192.168.1.2
> 
> l2tpv3tun add session tunnel_id 1 session_id 1 peer_session_id 1 dev l2tpeth1
> 
> 
> reproducible: always
> 

Seems following patch is needed, not sure if it helps

[PATCH] l2tp: fix a typo in l2tp_eth_dev_recv()

While investigating l2tp bug, I hit a bug in eth_type_trans(),
because not enough bytes were pulled in skb head.

Signed-off-by: Eric Dumazet <edumazet@google.com>
---
 net/l2tp/l2tp_eth.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/l2tp/l2tp_eth.c b/net/l2tp/l2tp_eth.c
index f9ee74d..3bfb34a 100644
--- a/net/l2tp/l2tp_eth.c
+++ b/net/l2tp/l2tp_eth.c
@@ -153,7 +153,7 @@ static void l2tp_eth_dev_recv(struct l2tp_session *session, struct sk_buff *skb,
 		print_hex_dump_bytes("", DUMP_PREFIX_OFFSET, skb->data, length);
 	}
 
-	if (!pskb_may_pull(skb, sizeof(ETH_HLEN)))
+	if (!pskb_may_pull(skb, ETH_HLEN))
 		goto error;
 
 	secpath_reset(skb);

^ permalink raw reply related

* Re: [PATCH V2 09/12] net/eipoib: Add main driver functionality
From: Or Gerlitz @ 2012-09-04 19:47 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Michael S. Tsirkin, Or Gerlitz, davem, roland, netdev, sean.hefty,
	Erez Shitrit, Ali Ayoub, Doug Ledford
In-Reply-To: <87a9x537r0.fsf@xmission.com>

On Tue, Sep 4, 2012 at 10:31 PM, Eric W. Biederman
<ebiederm@xmission.com> wrote:
> Or Gerlitz <or.gerlitz@gmail.com> writes:
>> On Tue, Sep 4, 2012 at 12:22 AM, Michael S. Tsirkin <mst@redhat.com> wrote:
>>>> Documentation we will fix,
>>> And just to stress the point, document the limitations as well.
>> sure, not that I see concrete limitations for the **user** at this point, but
>> if there are such, will put them clearly written.

> All ethernet protocols not working except IPv4 is a huge concrete limitation.

Oh, sure, that WILL be documented, currently eIPoIB can deliver only what
IPoIB can which is IPv4, ARP/RARP, IPv6+ND, for IPv6 see next

> So far you are still playing with a design that is strongly NOT
> ethernet.  So calling it eIPoIB will continue to be a LIE.

> You are still playing with an implementation that doesn't even dream
> of supporting IPv6 which makes it so far from ethernet I can't imagine
> anyone taking your code seriously.

This design can and will support IPv6, the IPv6 ND handling will follow the path
we are talking now for the IPv4 ARPs, e.g not within the driver, etc.
Could you be
more specific?

> Any implementation that breaks a naive ARP implementation also breaks
> IPv6.  Not to mention everything else that runs over ethernet.

not sure to follow on the naive impl. comment,  eIPoIB solution will
include kernel driver along with supporting user-space portion, this
is to follow a comment made by the community on mangling ARPs in
network driver.

> If you are clever you can use the current IPoIB hardware accelleration
> but you need to do something different so that you can either encode
> or imply the MAC address so you won't have to munge ethernet protocols.

also here not sure to follow, we have a new design under which the
original VM MAC is preserved on the RX side, we will not generate MAC
for Ethernet frames sent by VMs which we reconstruct on the RX side
any more.

> Just for fun you might want to consider what it takes to support 2 VMs
> in the same VLAN that share the same IP address (but different MAC
> addresses) for failover purposes.

will look on that

Or.

^ permalink raw reply

* Re: [PATCH v2 1/2] tcp: add generic netlink support for tcp_metrics
From: David Miller @ 2012-09-04 19:47 UTC (permalink / raw)
  To: ja; +Cc: eric.dumazet, netdev, shemminger, paulmck
In-Reply-To: <alpine.LFD.2.00.1209031041340.1663@ja.ssi.bg>

From: Julian Anastasov <ja@ssi.bg>
Date: Mon, 3 Sep 2012 11:22:15 +0300 (EEST)

> 	BTW, is it appropriate to use kmem_cache for
> metrics and as result call_rcu for freeing?

I think it would work as things are implemented currently in
slab/slub/slob, however I would not rely upon it.

If you do move to a SLAB cache for the tcp metrics objects,
you might consider SLAB_DESTROY_BY_RCU.  It's a very delicate
facility (read the huge comment in linux/slab.h) but I think
it provides the semantics we need for TCP metrics blobs.

Looking forward to v3 :-)

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox