Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [Ksummit-2010-discuss] [v2] Remaining BKL users, what to do
From: Ville Syrjälä @ 2010-10-20 16:14 UTC (permalink / raw)
  To: Dave Airlie
  Cc: Arnd Bergmann, Jan Kara, Greg KH, Anders Larsen, dri-devel,
	ksummit-2010-discuss, Mikulas Patocka, codalist, Theodore Kilgore,
	Bryan Schumaker, Christoph Hellwig, Petr Vandrovec,
	Arnaldo Carvalho de Melo, linux-media, Samuel Ortiz,
	Evgeniy Dushistov, Steven Rostedt, autofs, Jan Harkes, netdev,
	linux-kernel, linux-fsdevel, Andrew Hendry
In-Reply-To: <AANLkTinw=Wzh2Ucj6zKSoqC8J3Yq9xDr3mKMUB7K6Yyo@mail.gmail.com>

On Wed, Oct 20, 2010 at 06:50:58AM +1000, Dave Airlie wrote:
> On Tue, Oct 19, 2010 at 11:26 PM, Arnd Bergmann <arnd@arndb.de> wrote:
> > On Tuesday 19 October 2010, Arnd Bergmann wrote:
> >> On Tuesday 19 October 2010 06:52:32 Dave Airlie wrote:
> >> > > I might be able to find some hardware still lying around here that uses an
> >> > > i810. Not sure unless I go hunting it. But I get the impression that if
> >> > > the kernel is a single-CPU kernel there is not any problem anyway? Don't
> >> > > distros offer a non-smp kernel as an installation option in case the user
> >> > > needs it? So in reality how big a problem is this?
> >> >
> >> > Not anymore, which is my old point of making a fuss. Nowadays in the
> >> > modern distro world, we supply a single kernel that can at runtime
> >> > decide if its running on SMP or UP and rewrite the text section
> >> > appropriately with locks etc. Its like magic, and something like
> >> > marking drivers as BROKEN_ON_SMP at compile time is really wrong when
> >> > what you want now is a runtime warning if someone tries to hotplug a
> >> > CPU with a known iffy driver loaded or if someone tries to load the
> >> > driver when we are already in SMP mode.
> >>
> >> We could make the driver run-time non-SMP by adding
> >>
> >>       if (num_present_cpus() > 1) {
> >>               pr_err("i810 no longer supports SMP\n");
> >>               return -EINVAL;
> >>       }
> >>
> >> to the init function. That would cover the vast majority of the
> >> users of i810 hardware, I guess.
> >
> > Some research showed that Intel never support i810/i815 SMP setups,
> > but there was indeed one company (http://www.acorpusa.com at the time,
> > now owned by a domain squatter) that made i815E based dual Pentium-III
> > boards like this one: http://cgi.ebay.com/280319795096
> 
> Also that board has no on-board GPU enabled i815EP (P means no on-board GPU).

A quick search seems to indicate that an i815E variant also existed.

-- 
Ville Syrjälä
syrjala@sci.fi
http://www.sci.fi/~syrjala/
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [RFC PATCH 1/9] ipvs network name space aware
From: Paul E. McKenney @ 2010-10-20 16:02 UTC (permalink / raw)
  To: Hans Schillstrom
  Cc: Daniel Lezcano, lvs-devel@vger.kernel.org, netdev@vger.kernel.org,
	netfilter-devel@vger.kernel.org, horms@verge.net.au, ja@ssi.bg,
	wensong@linux-vs.org
In-Reply-To: <201010201025.20950.hans.schillstrom@ericsson.com>

On Wed, Oct 20, 2010 at 10:25:19AM +0200, Hans Schillstrom wrote:
> On Tuesday 19 October 2010 20:44:36 Paul E. McKenney wrote:
> > On Mon, Oct 18, 2010 at 03:23:48PM +0200, Hans Schillstrom wrote:
> > > On Monday 18 October 2010 13:37:38 Daniel Lezcano wrote:
> > > > On 10/18/2010 11:54 AM, Hans Schillstrom wrote:
> > > > > On Monday 18 October 2010 10:59:25 Daniel Lezcano wrote:
> > > > >
> > > > >> On 10/08/2010 01:16 PM, Hans Schillstrom wrote:
> > > > >>
> > > > >>> This part contains the include files
> > > > >>> where include/net/netns/ip_vs.h is new and contains all moved vars.
> > > > >>>
> > > > >>> SUMMARY
> > > > >>>
> > > > >>>    include/net/ip_vs.h                     |  136 ++++---
> > > > >>>    include/net/net_namespace.h             |    2 +
> > > > >>>    include/net/netns/ip_vs.h               |  112 +++++
> > > > >>>
> > > > >>> Signed-off-by:Hans Schillstrom<hans.schillstrom@ericsson.com>
> > > > >>> ---
> > > > >>>
> > > > >>>
> > > > >>>
> > > > >> [ ... ]
> > > > >>
> > > > >>
> > > > >>>    #ifdef CONFIG_IP_VS_IPV6
> > > > >>> diff --git a/include/net/net_namespace.h b/include/net/net_namespace.h
> > > > >>> index bd10a79..b59cdc5 100644
> > > > >>> --- a/include/net/net_namespace.h
> > > > >>> +++ b/include/net/net_namespace.h
> > > > >>> @@ -15,6 +15,7 @@
> > > > >>>    #include<net/netns/ipv4.h>
> > > > >>>    #include<net/netns/ipv6.h>
> > > > >>>    #include<net/netns/dccp.h>
> > > > >>> +#include<net/netns/ip_vs.h>
> > > > >>>    #include<net/netns/x_tables.h>
> > > > >>>    #if defined(CONFIG_NF_CONNTRACK) || defined(CONFIG_NF_CONNTRACK_MODULE)
> > > > >>>    #include<net/netns/conntrack.h>
> > > > >>> @@ -91,6 +92,7 @@ struct net {
> > > > >>>    	struct sk_buff_head	wext_nlevents;
> > > > >>>    #endif
> > > > >>>    	struct net_generic	*gen;
> > > > >>> +	struct netns_ipvs       *ipvs;
> > > > >>>    };
> > > > >>>
> > > > >>>
> > > > >> IMHO, it would be better to use the net_generic infra-structure instead
> > > > >> of adding a new field in the netns structure.
> > > > >>
> > > > >>
> > > > >>
> > > > > I realized that to, but the performance penalty is quite high with net_generic :-(
> > > > > But on the other hand if you are going to backport it, (without recompiling the kernel)
> > > > > you gonna need it!
> > > > >
> > > >
> > > > Hmm, yes. We don't want to have the init_net_ns performances to be impacted.
> > > >
> > > > You use here a pointer which will be dereferenced like the net_generic,
> > > > I don't think there will be
> > > > a big difference between using net_generic and using a pointer in the
> > > > net namespace structure.
> > > >
> > > > The difference is the id usage, but this one is based on the idr which
> > > > is quite fast.
> > > >
> > >
> > > I'm not so sure about that, have a look at net_generic and rcu_read_lock
> > > and compare
> > >  ipvs = net->ipvs;
> > > vs.
> > >  ipvs = net_generic(net, id)
> > >
> > > static inline void *net_generic(struct net *net, int id)
> > > {
> > > 	struct net_generic *ng;
> > > 	void *ptr;
> > >
> > > 	rcu_read_lock();
> > > 	ng = rcu_dereference(net->gen);
> > > 	BUG_ON(id == 0 || id > ng->len);
> > > 	ptr = ng->ptr[id - 1];
> > > 	rcu_read_unlock();
> > >
> > > 	return ptr;
> > > }
> > > ...
> > > static inline void rcu_read_lock(void)
> > > {
> > >         __rcu_read_lock();
> > >         __acquire(RCU);
> > >         rcu_read_acquire();
> > > }
> > >
> > > Another way of doing it is to pass the ipvs ptr instead of the net ptr,
> > > and add *net to the ipvs struct.
> > >
> > > > We should experiment a bit here to compare both solutions.
> > > Agre
> > > >
> > > I single stepped through the rcu_read_lock() on a x86_64
> > > and it's quite many "stepi" that you need to enter :-(
> >
> > Was this by chance with lockdep enabled?  If not, could you please send
> > your .config?
> >
> > 							Thanx, Paul
> 
> No lockdep, but what I ment is that net_generic is not as fast as a plain ptr->xxx.
> IPVS has hooks in the netfilter chain, and gets a huge amount of packets .
> 
> I don't think IPVS is a candidate for net_generic, it should have its own part in "struct net"
> That was my point.
> ( No critic to locking or net_generic)

You said that there were a lot of "stepi" commands to get through
rcu_read_lock() on x86_64.  This is quite surprising, especially if you
built with CONFIG_RCU_TREE.  Even if you built with CONFIG_PREEMPT_RCU_TREE,
you should only see something like the following from rcu_read_lock():

000000b7 <__rcu_read_lock>:
      b7:	55                   	push   %ebp
      b8:	64 a1 00 00 00 00    	mov    %fs:0x0,%eax
      be:	ff 80 80 01 00 00    	incl   0x180(%eax)
      c4:	89 e5                	mov    %esp,%ebp
      c6:	5d                   	pop    %ebp
      c7:	c3                   	ret    

Unless you have some sort of debugging options turned on.  Or unless
six instructions counts for "quite many" stepi commands.  ;-)

So I am quite curious, independent of whether or not IPVS is a candidate
for net_generic.  That choice for IPVS is not mine to make, and I will
trust the relevant developers and maintainers to make the right choice,
whether that be RCU or something else.  Even I do not claim that RCU
is the right tool for all jobs!  ;-)

							Thanx, Paul

^ permalink raw reply

* dead code in networking core
From: Stephen Hemminger @ 2010-10-20 15:52 UTC (permalink / raw)
  To: David Miller; +Cc: netdev

The following API's are exported but unused in current code:
  net/core/dev_addr_lists.o
    __hw_addr_del_multiple
    dev_addr_add_multiple
    dev_addr_del_multiple

  net/core/timestamping.o
    skb_clone_tx_timestamp
    skb_complete_tx_timestamp

Any plans to use these soon?

^ permalink raw reply

* Re: [RFC PATCH 5/9] ipvs network name space aware
From: Simon Horman @ 2010-10-20 15:21 UTC (permalink / raw)
  To: Hans Schillstrom
  Cc: lvs-devel, netdev, netfilter-devel, ja, wensong, daniel.lezcano
In-Reply-To: <201010081317.04167.hans.schillstrom@ericsson.com>

On Fri, Oct 08, 2010 at 01:17:02PM +0200, Hans Schillstrom wrote:
> This patch just contains ip_vs_ctl
> 
> Signed-off-by:Hans Schillstrom <hans.schillstrom@ericsson.com>
> 
> diff --git a/net/netfilter/ipvs/ip_vs_ctl.c b/net/netfilter/ipvs/ip_vs_ctl.c
> index ca8ec8c..7e99cbc 100644
> --- a/net/netfilter/ipvs/ip_vs_ctl.c
> +++ b/net/netfilter/ipvs/ip_vs_ctl.c

[ snip ]

> @@ -3377,62 +3383,131 @@ static void ip_vs_genl_unregister(void)
>  }
> 
>  /* End of Generic Netlink interface definitions */
> +/*
> + * per netns intit/exit func.
> + */
> +int /*__net_init*/ __ip_vs_control_init(struct net *net)

Can you describe why __net_init is commented out?

[ snip ]

^ permalink raw reply

* [PATCH -next 2/2] ibmveth: Free irq on error path
From: Denis Kirjanov @ 2010-10-20 14:21 UTC (permalink / raw)
  To: davem; +Cc: rcj, netdev

Free irq on error path.

Signed-off-by: Denis Kirjanov <dkirjanov@kernel.org>
---
 drivers/net/ibmveth.c |    6 ++++--
 1 files changed, 4 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ibmveth.c b/drivers/net/ibmveth.c
index 2ae8336..c454b45 100644
--- a/drivers/net/ibmveth.c
+++ b/drivers/net/ibmveth.c
@@ -641,7 +641,7 @@ static int ibmveth_open(struct net_device *netdev)
 	if (!adapter->bounce_buffer) {
 		netdev_err(netdev, "unable to allocate bounce buffer\n");
 		rc = -ENOMEM;
-		goto err_out;
+		goto err_out_free_irq;
 	}
 	adapter->bounce_buffer_dma =
 	    dma_map_single(&adapter->vdev->dev, adapter->bounce_buffer,
@@ -649,7 +649,7 @@ static int ibmveth_open(struct net_device *netdev)
 	if (dma_mapping_error(dev, adapter->bounce_buffer_dma)) {
 		netdev_err(netdev, "unable to map bounce buffer\n");
 		rc = -ENOMEM;
-		goto err_out;
+		goto err_out_free_irq;
 	}
 
 	netdev_dbg(netdev, "initial replenish cycle\n");
@@ -661,6 +661,8 @@ static int ibmveth_open(struct net_device *netdev)
 
 	return 0;
 
+err_out_free_irq:
+	free_irq(netdev->irq, netdev);
 err_out:
 	ibmveth_cleanup(adapter);
 	napi_disable(&adapter->napi);
-- 
1.7.2.2


^ permalink raw reply related

* [PATCH -next 1/2] ibmveth: Cleanup error handling inside ibmveth_open
From: Denis Kirjanov @ 2010-10-20 14:21 UTC (permalink / raw)
  To: davem; +Cc: rcj, netdev

Remove duplicated code in one place.

Signed-off-by: Denis Kirjanov <dkirjanov@kernel.org>
---
 drivers/net/ibmveth.c |   44 ++++++++++++++++++++------------------------
 1 files changed, 20 insertions(+), 24 deletions(-)

diff --git a/drivers/net/ibmveth.c b/drivers/net/ibmveth.c
index b3e157e..2ae8336 100644
--- a/drivers/net/ibmveth.c
+++ b/drivers/net/ibmveth.c
@@ -546,9 +546,8 @@ static int ibmveth_open(struct net_device *netdev)
 	if (!adapter->buffer_list_addr || !adapter->filter_list_addr) {
 		netdev_err(netdev, "unable to allocate filter or buffer list "
 			   "pages\n");
-		ibmveth_cleanup(adapter);
-		napi_disable(&adapter->napi);
-		return -ENOMEM;
+		rc = -ENOMEM;
+		goto err_out;
 	}
 
 	adapter->rx_queue.queue_len = sizeof(struct ibmveth_rx_q_entry) *
@@ -558,9 +557,8 @@ static int ibmveth_open(struct net_device *netdev)
 
 	if (!adapter->rx_queue.queue_addr) {
 		netdev_err(netdev, "unable to allocate rx queue pages\n");
-		ibmveth_cleanup(adapter);
-		napi_disable(&adapter->napi);
-		return -ENOMEM;
+		rc = -ENOMEM;
+		goto err_out;
 	}
 
 	dev = &adapter->vdev->dev;
@@ -578,9 +576,8 @@ static int ibmveth_open(struct net_device *netdev)
 	    (dma_mapping_error(dev, adapter->rx_queue.queue_dma))) {
 		netdev_err(netdev, "unable to map filter or buffer list "
 			   "pages\n");
-		ibmveth_cleanup(adapter);
-		napi_disable(&adapter->napi);
-		return -ENOMEM;
+		rc = -ENOMEM;
+		goto err_out;
 	}
 
 	adapter->rx_queue.index = 0;
@@ -611,9 +608,8 @@ static int ibmveth_open(struct net_device *netdev)
 				     adapter->filter_list_dma,
 				     rxq_desc.desc,
 				     mac_address);
-		ibmveth_cleanup(adapter);
-		napi_disable(&adapter->napi);
-		return -ENONET;
+		rc = -ENONET;
+		goto err_out;
 	}
 
 	for (i = 0; i < IBMVETH_NUM_BUFF_POOLS; i++) {
@@ -622,9 +618,8 @@ static int ibmveth_open(struct net_device *netdev)
 		if (ibmveth_alloc_buffer_pool(&adapter->rx_buff_pool[i])) {
 			netdev_err(netdev, "unable to alloc pool\n");
 			adapter->rx_buff_pool[i].active = 0;
-			ibmveth_cleanup(adapter);
-			napi_disable(&adapter->napi);
-			return -ENOMEM ;
+			rc = -ENOMEM;
+			goto err_out;
 		}
 	}
 
@@ -638,27 +633,23 @@ static int ibmveth_open(struct net_device *netdev)
 			rc = h_free_logical_lan(adapter->vdev->unit_address);
 		} while (H_IS_LONG_BUSY(rc) || (rc == H_BUSY));
 
-		ibmveth_cleanup(adapter);
-		napi_disable(&adapter->napi);
-		return rc;
+		goto err_out;
 	}
 
 	adapter->bounce_buffer =
 	    kmalloc(netdev->mtu + IBMVETH_BUFF_OH, GFP_KERNEL);
 	if (!adapter->bounce_buffer) {
 		netdev_err(netdev, "unable to allocate bounce buffer\n");
-		ibmveth_cleanup(adapter);
-		napi_disable(&adapter->napi);
-		return -ENOMEM;
+		rc = -ENOMEM;
+		goto err_out;
 	}
 	adapter->bounce_buffer_dma =
 	    dma_map_single(&adapter->vdev->dev, adapter->bounce_buffer,
 			   netdev->mtu + IBMVETH_BUFF_OH, DMA_BIDIRECTIONAL);
 	if (dma_mapping_error(dev, adapter->bounce_buffer_dma)) {
 		netdev_err(netdev, "unable to map bounce buffer\n");
-		ibmveth_cleanup(adapter);
-		napi_disable(&adapter->napi);
-		return -ENOMEM;
+		rc = -ENOMEM;
+		goto err_out;
 	}
 
 	netdev_dbg(netdev, "initial replenish cycle\n");
@@ -669,6 +660,11 @@ static int ibmveth_open(struct net_device *netdev)
 	netdev_dbg(netdev, "open complete\n");
 
 	return 0;
+
+err_out:
+	ibmveth_cleanup(adapter);
+	napi_disable(&adapter->napi);
+	return rc;
 }
 
 static int ibmveth_close(struct net_device *netdev)
-- 
1.7.2.2


^ permalink raw reply related

* Re: [PATCH 5/9] tproxy: allow non-local binds of IPv6 sockets if IP_TRANSPARENT is enabled
From: Balazs Scheidler @ 2010-10-20 14:07 UTC (permalink / raw)
  To: YOSHIFUJI Hideaki
  Cc: KOVACS Krisztian, netdev, netfilter-devel, Patrick McHardy,
	David Miller
In-Reply-To: <4CBEE45D.2080201@linux-ipv6.org>

On Wed, 2010-10-20 at 21:45 +0900, YOSHIFUJI Hideaki wrote:
> (2010/10/20 20:21), KOVACS Krisztian wrote:
> > From: Balazs Scheidler<bazsi@balabit.hu>
> > 
> > Signed-off-by: Balazs Scheidler<bazsi@balabit.hu>
> > Signed-off-by: KOVACS Krisztian<hidden@balabit.hu>
> > ---
> >   net/ipv6/af_inet6.c |    2 +-
> >   1 files changed, 1 insertions(+), 1 deletions(-)
> > 
> > diff --git a/net/ipv6/af_inet6.c b/net/ipv6/af_inet6.c
> > index 6022098..9480572 100644
> > --- a/net/ipv6/af_inet6.c
> > +++ b/net/ipv6/af_inet6.c
> > @@ -343,7 +343,7 @@ int inet6_bind(struct socket *sock, struct sockaddr *uaddr, int addr_len)
> >   			 */
> >   			v4addr = LOOPBACK4_IPV6;
> >   			if (!(addr_type&  IPV6_ADDR_MULTICAST))	{
> > -				if (!ipv6_chk_addr(net,&addr->sin6_addr,
> > +				if (!inet->transparent&&  !ipv6_chk_addr(net,&addr->sin6_addr,
> >   						   dev, 0)) {
> >   					err = -EADDRNOTAVAIL;
> >   					goto out_unlock;
> > 
> > 
> 
> As I wrote before in other thread, this does not seem sufficient --
> well, it is sufficient to allow non-local bind, but before we're
> allowing this, we need add checks of source address in sending side.

Can you please elaborate or point us to the other thread? Is it some
kind of address-type check that we miss?

-- 
Bazsi



^ permalink raw reply

* Re: [RFC PATCH 3/9] ipvs network name space aware
From: Simon Horman @ 2010-10-20 14:03 UTC (permalink / raw)
  To: Hans Schillstrom
  Cc: lvs-devel, netdev, netfilter-devel, ja, wensong, daniel.lezcano
In-Reply-To: <201010081316.57914.hans.schillstrom@ericsson.com>

On Fri, Oct 08, 2010 at 01:16:57PM +0200, Hans Schillstrom wrote:
> 
> This patch just contains ip_vs_conn.c
> and does the normal
>  - moving to vars to struct ipvs
>  - adding per netns init and exit
> 
> proc_fs required some extra work with adding/chaning private data to get the net ptr.

I am currently working on rebasing this patch against the
current nf-next-2.6 tree with includes persistence engines
and I noticed a few things.

> Signed-off-by:Hans Schillstrom <hans.schillstrom@ericsson.com>
> 
> diff --git a/net/netfilter/ipvs/ip_vs_conn.c b/net/netfilter/ipvs/ip_vs_conn.c
> index b71c69a..c47828f 100644
> --- a/net/netfilter/ipvs/ip_vs_conn.c
> +++ b/net/netfilter/ipvs/ip_vs_conn.c
> @@ -47,7 +47,7 @@
> 
>  /*
>   * Connection hash size. Default is what was selected at compile time.
> -*/
> + */
>  int ip_vs_conn_tab_bits = CONFIG_IP_VS_TAB_BITS;
>  module_param_named(conn_tab_bits, ip_vs_conn_tab_bits, int, 0444);
>  MODULE_PARM_DESC(conn_tab_bits, "Set connections' hash size");

This fragment is not needed.

> @@ -56,23 +56,12 @@ MODULE_PARM_DESC(conn_tab_bits, "Set connections' hash size");
>  int ip_vs_conn_tab_size;
>  int ip_vs_conn_tab_mask;
> 
> -/*
> - *  Connection hash table: for input and output packets lookups of IPVS
> - */
> -static struct list_head *ip_vs_conn_tab;
> -
> -/*  SLAB cache for IPVS connections */
> -static struct kmem_cache *ip_vs_conn_cachep __read_mostly;
> -
> -/*  counter for current IPVS connections */
> -static atomic_t ip_vs_conn_count = ATOMIC_INIT(0);
> -
> -/*  counter for no client port connections */
> -static atomic_t ip_vs_conn_no_cport_cnt = ATOMIC_INIT(0);
> -
>  /* random value for IPVS connection hash */
>  static unsigned int ip_vs_conn_rnd;
> 
> +/* cache name cnt */
> +static atomic_t conn_cache_nr = ATOMIC_INIT(0);
> +
>  /*
>   *  Fine locking granularity for big connection hash table
>   */
> @@ -153,7 +142,7 @@ static unsigned int ip_vs_conn_hashkey(int af, unsigned proto,
>   *	Hashes ip_vs_conn in ip_vs_conn_tab by proto,addr,port.
>   *	returns bool success.
>   */
> -static inline int ip_vs_conn_hash(struct ip_vs_conn *cp)
> +static inline int ip_vs_conn_hash(struct net *net, struct ip_vs_conn *cp)
>  {
>  	unsigned hash;
>  	int ret;
> @@ -168,7 +157,7 @@ static inline int ip_vs_conn_hash(struct ip_vs_conn *cp)
>  	spin_lock(&cp->lock);
> 
>  	if (!(cp->flags & IP_VS_CONN_F_HASHED)) {
> -		list_add(&cp->c_list, &ip_vs_conn_tab[hash]);
> +		list_add(&cp->c_list, &net->ipvs->conn_tab[hash]);
>  		cp->flags |= IP_VS_CONN_F_HASHED;
>  		atomic_inc(&cp->refcnt);
>  		ret = 1;
> @@ -221,18 +210,20 @@ static inline int ip_vs_conn_unhash(struct ip_vs_conn *cp)
>   *	s_addr, s_port: pkt source address (foreign host)
>   *	d_addr, d_port: pkt dest address (load balancer)
>   */
> -static inline struct ip_vs_conn *__ip_vs_conn_in_get
> -(int af, int protocol, const union nf_inet_addr *s_addr, __be16 s_port,
> - const union nf_inet_addr *d_addr, __be16 d_port)
> +static inline struct ip_vs_conn *
> +__ip_vs_conn_in_get(struct net *net, int af, int protocol,
> +		    const union nf_inet_addr *s_addr, __be16 s_port,
> +		    const union nf_inet_addr *d_addr, __be16 d_port)
>  {
>  	unsigned hash;
>  	struct ip_vs_conn *cp;
> +	struct netns_ipvs *ipvs = net->ipvs;
> 
>  	hash = ip_vs_conn_hashkey(af, protocol, s_addr, s_port);
> 
>  	ct_read_lock(hash);
> 
> -	list_for_each_entry(cp, &ip_vs_conn_tab[hash], c_list) {
> +	list_for_each_entry(cp, &ipvs->conn_tab[hash], c_list) {
>  		if (cp->af == af &&
>  		    ip_vs_addr_equal(af, s_addr, &cp->caddr) &&
>  		    ip_vs_addr_equal(af, d_addr, &cp->vaddr) &&
> @@ -251,16 +242,18 @@ static inline struct ip_vs_conn *__ip_vs_conn_in_get
>  	return NULL;
>  }
> 
> -struct ip_vs_conn *ip_vs_conn_in_get
> -(int af, int protocol, const union nf_inet_addr *s_addr, __be16 s_port,
> - const union nf_inet_addr *d_addr, __be16 d_port)
> +struct ip_vs_conn *
> +ip_vs_conn_in_get(struct net *net, int af, int protocol,
> +		  const union nf_inet_addr *s_addr, __be16 s_port,
> +		  const union nf_inet_addr *d_addr, __be16 d_port)
>  {
>  	struct ip_vs_conn *cp;
> 
> -	cp = __ip_vs_conn_in_get(af, protocol, s_addr, s_port, d_addr, d_port);
> -	if (!cp && atomic_read(&ip_vs_conn_no_cport_cnt))
> -		cp = __ip_vs_conn_in_get(af, protocol, s_addr, 0, d_addr,
> -					 d_port);
> +	cp = __ip_vs_conn_in_get(net, af, protocol,
> +				 s_addr, s_port, d_addr, d_port);
> +	if (!cp && atomic_read(&net->ipvs->conn_no_cport_cnt))
> +		cp = __ip_vs_conn_in_get(net, af, protocol,
> +					 s_addr, 0, d_addr, d_port);
> 
>  	IP_VS_DBG_BUF(9, "lookup/in %s %s:%d->%s:%d %s\n",
>  		      ip_vs_proto_name(protocol),
> @@ -278,35 +271,41 @@ ip_vs_conn_in_get_proto(int af, const struct sk_buff *skb,
>  			unsigned int proto_off, int inverse)
>  {
>  	__be16 _ports[2], *pptr;
> +	struct net *net = dev_net(skb->dev);
> 
>  	pptr = skb_header_pointer(skb, proto_off, sizeof(_ports), _ports);
>  	if (pptr == NULL)
>  		return NULL;
> 
> +	BUG_ON(!net);

Can you explain why BUG_ON is here?

>  	if (likely(!inverse))
> -		return ip_vs_conn_in_get(af, iph->protocol,
> +		return ip_vs_conn_in_get(net, af, iph->protocol,
>  					 &iph->saddr, pptr[0],
>  					 &iph->daddr, pptr[1]);
>  	else
> -		return ip_vs_conn_in_get(af, iph->protocol,
> +		return ip_vs_conn_in_get(net, af, iph->protocol,
>  					 &iph->daddr, pptr[1],
>  					 &iph->saddr, pptr[0]);
>  }
>  EXPORT_SYMBOL_GPL(ip_vs_conn_in_get_proto);
> 
> -/* Get reference to connection template */
> -struct ip_vs_conn *ip_vs_ct_in_get
> -(int af, int protocol, const union nf_inet_addr *s_addr, __be16 s_port,
> - const union nf_inet_addr *d_addr, __be16 d_port)
> +/*
> + *  Get reference to connection template
> + */
> +struct ip_vs_conn *
> +ip_vs_ct_in_get(struct net *net, int af, int protocol,
> +		const union nf_inet_addr *s_addr, __be16 s_port,
> +		const union nf_inet_addr *d_addr, __be16 d_port)
>  {
>  	unsigned hash;
>  	struct ip_vs_conn *cp;
> +	struct netns_ipvs *ipvs = net->ipvs;
> 
>  	hash = ip_vs_conn_hashkey(af, protocol, s_addr, s_port);
> 
>  	ct_read_lock(hash);
> 
> -	list_for_each_entry(cp, &ip_vs_conn_tab[hash], c_list) {
> +	list_for_each_entry(cp, &ipvs->conn_tab[hash], c_list) {
>  		if (cp->af == af &&
>  		    ip_vs_addr_equal(af, s_addr, &cp->caddr) &&
>  		    /* protocol should only be IPPROTO_IP if
> @@ -341,12 +340,14 @@ struct ip_vs_conn *ip_vs_ct_in_get
>   *	s_addr, s_port: pkt source address (inside host)
>   *	d_addr, d_port: pkt dest address (foreign host)
>   */
> -struct ip_vs_conn *ip_vs_conn_out_get
> -(int af, int protocol, const union nf_inet_addr *s_addr, __be16 s_port,
> - const union nf_inet_addr *d_addr, __be16 d_port)
> +struct ip_vs_conn *
> +ip_vs_conn_out_get(struct net *net, int af, int protocol,
> +		   const union nf_inet_addr *s_addr, __be16 s_port,
> +		   const union nf_inet_addr *d_addr, __be16 d_port)
>  {
>  	unsigned hash;
>  	struct ip_vs_conn *cp, *ret=NULL;
> +	struct netns_ipvs *ipvs = net->ipvs;
> 
>  	/*
>  	 *	Check for "full" addressed entries
> @@ -355,7 +356,7 @@ struct ip_vs_conn *ip_vs_conn_out_get
> 
>  	ct_read_lock(hash);
> 
> -	list_for_each_entry(cp, &ip_vs_conn_tab[hash], c_list) {
> +	list_for_each_entry(cp, &ipvs->conn_tab[hash], c_list) {
>  		if (cp->af == af &&
>  		    ip_vs_addr_equal(af, d_addr, &cp->caddr) &&
>  		    ip_vs_addr_equal(af, s_addr, &cp->daddr) &&
> @@ -386,17 +387,19 @@ ip_vs_conn_out_get_proto(int af, const struct sk_buff *skb,
>  			 unsigned int proto_off, int inverse)
>  {
>  	__be16 _ports[2], *pptr;
> +	struct net *net = dev_net(skb->dev);
> 
>  	pptr = skb_header_pointer(skb, proto_off, sizeof(_ports), _ports);
>  	if (pptr == NULL)
>  		return NULL;
> 
> +	BUG_ON(!net);
>  	if (likely(!inverse))
> -		return ip_vs_conn_out_get(af, iph->protocol,
> +		return ip_vs_conn_out_get(net, af, iph->protocol,
>  					  &iph->saddr, pptr[0],
>  					  &iph->daddr, pptr[1]);
>  	else
> -		return ip_vs_conn_out_get(af, iph->protocol,
> +		return ip_vs_conn_out_get(net, af, iph->protocol,
>  					  &iph->daddr, pptr[1],
>  					  &iph->saddr, pptr[0]);
>  }
> @@ -408,7 +411,7 @@ EXPORT_SYMBOL_GPL(ip_vs_conn_out_get_proto);
>  void ip_vs_conn_put(struct ip_vs_conn *cp)
>  {
>  	unsigned long t = (cp->flags & IP_VS_CONN_F_ONE_PACKET) ?
> -		0 : cp->timeout;
> +			   0 : cp->timeout;
>  	mod_timer(&cp->timer, jiffies+t);
> 
>  	__ip_vs_conn_put(cp);
> @@ -418,19 +421,19 @@ void ip_vs_conn_put(struct ip_vs_conn *cp)
>  /*
>   *	Fill a no_client_port connection with a client port number
>   */
> -void ip_vs_conn_fill_cport(struct ip_vs_conn *cp, __be16 cport)
> +void ip_vs_conn_fill_cport(struct net *net, struct ip_vs_conn *cp, __be16 cport)
>  {
>  	if (ip_vs_conn_unhash(cp)) {
>  		spin_lock(&cp->lock);
>  		if (cp->flags & IP_VS_CONN_F_NO_CPORT) {
> -			atomic_dec(&ip_vs_conn_no_cport_cnt);
> +			atomic_dec(&net->ipvs->conn_no_cport_cnt);
>  			cp->flags &= ~IP_VS_CONN_F_NO_CPORT;
>  			cp->cport = cport;
>  		}
>  		spin_unlock(&cp->lock);
> 
>  		/* hash on new dport */
> -		ip_vs_conn_hash(cp);
> +		ip_vs_conn_hash(net, cp);
>  	}
>  }
> 
> @@ -561,12 +564,12 @@ ip_vs_bind_dest(struct ip_vs_conn *cp, struct ip_vs_dest *dest)
>   * Check if there is a destination for the connection, if so
>   * bind the connection to the destination.
>   */
> -struct ip_vs_dest *ip_vs_try_bind_dest(struct ip_vs_conn *cp)
> +struct ip_vs_dest *ip_vs_try_bind_dest(struct net *net, struct ip_vs_conn *cp)
>  {
>  	struct ip_vs_dest *dest;
> 
>  	if ((cp) && (!cp->dest)) {
> -		dest = ip_vs_find_dest(cp->af, &cp->daddr, cp->dport,
> +		dest = ip_vs_find_dest(net, cp->af, &cp->daddr, cp->dport,
>  				       &cp->vaddr, cp->vport,
>  				       cp->protocol);
>  		ip_vs_bind_dest(cp, dest);
> @@ -638,7 +641,7 @@ static inline void ip_vs_unbind_dest(struct ip_vs_conn *cp)
>   *	If available, return 1, otherwise invalidate this connection
>   *	template and return 0.
>   */
> -int ip_vs_check_template(struct ip_vs_conn *ct)
> +int ip_vs_check_template(struct net *net, struct ip_vs_conn *ct)
>  {
>  	struct ip_vs_dest *dest = ct->dest;
> 
> @@ -647,7 +650,7 @@ int ip_vs_check_template(struct ip_vs_conn *ct)
>  	 */
>  	if ((dest == NULL) ||
>  	    !(dest->flags & IP_VS_DEST_F_AVAILABLE) ||
> -	    (sysctl_ip_vs_expire_quiescent_template &&
> +	    (net->ipvs->sysctl_expire_quiescent_template &&
>  	     (atomic_read(&dest->weight) == 0))) {
>  		IP_VS_DBG_BUF(9, "check_template: dest not available for "
>  			      "protocol %s s:%s:%d v:%s:%d "
> @@ -668,7 +671,7 @@ int ip_vs_check_template(struct ip_vs_conn *ct)
>  				ct->dport = htons(0xffff);
>  				ct->vport = htons(0xffff);
>  				ct->cport = 0;
> -				ip_vs_conn_hash(ct);
> +				ip_vs_conn_hash(net, ct);
>  			}
>  		}
> 
> @@ -720,16 +723,17 @@ static void ip_vs_conn_expire(unsigned long data)
>  		if (unlikely(cp->app != NULL))
>  			ip_vs_unbind_app(cp);
>  		ip_vs_unbind_dest(cp);
> +		BUG_ON(!cp->net);
>  		if (cp->flags & IP_VS_CONN_F_NO_CPORT)
> -			atomic_dec(&ip_vs_conn_no_cport_cnt);
> -		atomic_dec(&ip_vs_conn_count);
> +			atomic_dec(&cp->net->ipvs->conn_no_cport_cnt);
> +		atomic_dec(&cp->net->ipvs->conn_count);
> 
> -		kmem_cache_free(ip_vs_conn_cachep, cp);
> +		kmem_cache_free(cp->net->ipvs->conn_cachep, cp);
>  		return;
>  	}
> 
>  	/* hash it back to the table */
> -	ip_vs_conn_hash(cp);
> +	ip_vs_conn_hash(cp->net, cp);
> 
>    expire_later:
>  	IP_VS_DBG(7, "delayed: conn->refcnt-1=%d conn->n_control=%d\n",
> @@ -748,18 +752,22 @@ void ip_vs_conn_expire_now(struct ip_vs_conn *cp)
> 
> 
>  /*
> - *	Create a new connection entry and hash it into the ip_vs_conn_tab
> + *	Create a new connection entry and hash it into the ip_vs_conn_tab,
> + * 	netns ptr will be stored in ip_vs_con here.
>   */
>  struct ip_vs_conn *
> -ip_vs_conn_new(int af, int proto, const union nf_inet_addr *caddr, __be16 cport,
> +ip_vs_conn_new(struct net *net, int af, int proto,
> +	       const union nf_inet_addr *caddr, __be16 cport,
>  	       const union nf_inet_addr *vaddr, __be16 vport,
> -	       const union nf_inet_addr *daddr, __be16 dport, unsigned flags,
> -	       struct ip_vs_dest *dest)
> +	       const union nf_inet_addr *daddr, __be16 dport,
> +	       unsigned flags, struct ip_vs_dest *dest)
>  {
>  	struct ip_vs_conn *cp;
> -	struct ip_vs_protocol *pp = ip_vs_proto_get(proto);
> +	struct ip_vs_proto_data *pd = ip_vs_proto_data_get(net, proto);
> +	struct ip_vs_protocol *pp;
> +	struct netns_ipvs *ipvs = net->ipvs;
> 
> -	cp = kmem_cache_zalloc(ip_vs_conn_cachep, GFP_ATOMIC);
> +	cp = kmem_cache_zalloc(ipvs->conn_cachep, GFP_ATOMIC);
>  	if (cp == NULL) {
>  		IP_VS_ERR_RL("%s(): no memory\n", __func__);
>  		return NULL;
> @@ -790,9 +798,9 @@ ip_vs_conn_new(int af, int proto, const union nf_inet_addr *caddr, __be16 cport,
>  	atomic_set(&cp->n_control, 0);
>  	atomic_set(&cp->in_pkts, 0);
> 
> -	atomic_inc(&ip_vs_conn_count);
> +	atomic_inc(&ipvs->conn_count);
>  	if (flags & IP_VS_CONN_F_NO_CPORT)
> -		atomic_inc(&ip_vs_conn_no_cport_cnt);
> +		atomic_inc(&ipvs->conn_no_cport_cnt);
> 
>  	/* Bind the connection with a destination server */
>  	ip_vs_bind_dest(cp, dest);
> @@ -808,12 +816,14 @@ ip_vs_conn_new(int af, int proto, const union nf_inet_addr *caddr, __be16 cport,
>  	else
>  #endif
>  		ip_vs_bind_xmit(cp);
> -
> -	if (unlikely(pp && atomic_read(&pp->appcnt)))
> -		ip_vs_bind_app(cp, pp);
> -
> +	cp->net = net;	/* netns ptr  needed in timer */
> +	if( pd ) {
> +		pp = pd->pp;
> +		if (unlikely(pp && atomic_read(&pd->appcnt)))
> +			ip_vs_bind_app(net, cp, pp);
> +	}
>  	/* Hash it in the ip_vs_conn_tab finally */
> -	ip_vs_conn_hash(cp);
> +	ip_vs_conn_hash(net, cp);
> 
>  	return cp;
>  }
> @@ -824,16 +834,33 @@ ip_vs_conn_new(int af, int proto, const union nf_inet_addr *caddr, __be16 cport,
>   */
>  #ifdef CONFIG_PROC_FS
> 
> +struct ipvs_private {
> +	struct seq_net_private p;
> +	void *private;
> +};
> +
> +static inline void ipvs_seq_priv_set(struct seq_file *seq, void *data)
> +{
> +	struct ipvs_private *ipriv=(struct ipvs_private *)seq->private;
> +	ipriv->private = data;
> +}
> +static inline void *ipvs_seq_priv_get(struct seq_file *seq)
> +{
> +	return ((struct ipvs_private *)seq->private)->private;
> +}
> +
>  static void *ip_vs_conn_array(struct seq_file *seq, loff_t pos)
>  {
>  	int idx;
>  	struct ip_vs_conn *cp;
> +	struct net *net = seq_file_net(seq);
> +	struct netns_ipvs *ipvs = net->ipvs;
> 
>  	for (idx = 0; idx < ip_vs_conn_tab_size; idx++) {
>  		ct_read_lock_bh(idx);
> -		list_for_each_entry(cp, &ip_vs_conn_tab[idx], c_list) {
> +		list_for_each_entry(cp, &ipvs->conn_tab[idx], c_list) {
>  			if (pos-- == 0) {
> -				seq->private = &ip_vs_conn_tab[idx];
> +				ipvs_seq_priv_set(seq, &ipvs->conn_tab[idx]);
>  				return cp;
>  			}
>  		}
> @@ -845,15 +872,17 @@ static void *ip_vs_conn_array(struct seq_file *seq, loff_t pos)
> 
>  static void *ip_vs_conn_seq_start(struct seq_file *seq, loff_t *pos)
>  {
> -	seq->private = NULL;
> +	ipvs_seq_priv_set(seq, NULL);
>  	return *pos ? ip_vs_conn_array(seq, *pos - 1) :SEQ_START_TOKEN;
>  }
> -
> + /* netns: conn_tab OK */
>  static void *ip_vs_conn_seq_next(struct seq_file *seq, void *v, loff_t *pos)
>  {
>  	struct ip_vs_conn *cp = v;
> -	struct list_head *e, *l = seq->private;
> +	struct list_head *e, *l = ipvs_seq_priv_get(seq);
>  	int idx;
> +	struct net *net = seq_file_net(seq);
> +	struct netns_ipvs *ipvs = net->ipvs;
> 
>  	++*pos;
>  	if (v == SEQ_START_TOKEN)
> @@ -863,27 +892,28 @@ static void *ip_vs_conn_seq_next(struct seq_file *seq, void *v, loff_t *pos)
>  	if ((e = cp->c_list.next) != l)
>  		return list_entry(e, struct ip_vs_conn, c_list);
> 
> -	idx = l - ip_vs_conn_tab;
> +	idx = l - ipvs->conn_tab;
>  	ct_read_unlock_bh(idx);
> 
>  	while (++idx < ip_vs_conn_tab_size) {
>  		ct_read_lock_bh(idx);
> -		list_for_each_entry(cp, &ip_vs_conn_tab[idx], c_list) {
> -			seq->private = &ip_vs_conn_tab[idx];
> +		list_for_each_entry(cp, &ipvs->conn_tab[idx], c_list) {
> +			ipvs_seq_priv_set(seq, &ipvs->conn_tab[idx]);
>  			return cp;
>  		}
>  		ct_read_unlock_bh(idx);
>  	}
> -	seq->private = NULL;
> +	ipvs_seq_priv_set(seq, NULL);
>  	return NULL;
>  }
> -
> +/* netns: conn_tab OK */
>  static void ip_vs_conn_seq_stop(struct seq_file *seq, void *v)
>  {
> -	struct list_head *l = seq->private;
> +	struct list_head *l = ipvs_seq_priv_get(seq);
> +	struct net *net = seq_file_net(seq);
> 
>  	if (l)
> -		ct_read_unlock_bh(l - ip_vs_conn_tab);
> +		ct_read_unlock_bh(l - net->ipvs->conn_tab);
>  }
> 
>  static int ip_vs_conn_seq_show(struct seq_file *seq, void *v)
> @@ -928,7 +958,16 @@ static const struct seq_operations ip_vs_conn_seq_ops = {
> 
>  static int ip_vs_conn_open(struct inode *inode, struct file *file)
>  {
> -	return seq_open(file, &ip_vs_conn_seq_ops);
> +	int ret;
> +	struct ipvs_private *priv;
> +
> +	ret = seq_open_net(inode, file, &ip_vs_conn_seq_ops,
> +			   sizeof(struct ipvs_private));
> +	if (!ret) {
> +		priv = ((struct seq_file *)file->private_data)->private;
> +		priv->private = NULL;
> +	}
> +	return ret;
>  }
> 
>  static const struct file_operations ip_vs_conn_fops = {
> @@ -936,7 +975,8 @@ static const struct file_operations ip_vs_conn_fops = {
>  	.open    = ip_vs_conn_open,
>  	.read    = seq_read,
>  	.llseek  = seq_lseek,
> -	.release = seq_release,
> +	.release = seq_release_private,
> +
>  };
> 
>  static const char *ip_vs_origin_name(unsigned flags)
> @@ -991,7 +1031,17 @@ static const struct seq_operations ip_vs_conn_sync_seq_ops = {
> 
>  static int ip_vs_conn_sync_open(struct inode *inode, struct file *file)
>  {
> -	return seq_open(file, &ip_vs_conn_sync_seq_ops);
> +	int ret;
> +	struct ipvs_private *ipriv;
> +
> +	ret = seq_open_net(inode, file, &ip_vs_conn_sync_seq_ops,
> +			   sizeof(struct ipvs_private));
> +	if (!ret) {
> +		ipriv = ((struct seq_file *)file->private_data)->private;
> +		ipriv->private = NULL;
> +	}
> +	return ret;
> +//	return seq_open(file, &ip_vs_conn_sync_seq_ops);
>  }
> 
>  static const struct file_operations ip_vs_conn_sync_fops = {
> @@ -999,7 +1049,7 @@ static const struct file_operations ip_vs_conn_sync_fops = {
>  	.open    = ip_vs_conn_sync_open,
>  	.read    = seq_read,
>  	.llseek  = seq_lseek,
> -	.release = seq_release,
> +	.release = seq_release_private,
>  };
> 
>  #endif
> @@ -1036,11 +1086,14 @@ static inline int todrop_entry(struct ip_vs_conn *cp)
>  	return 1;
>  }
> 
> -/* Called from keventd and must protect itself from softirqs */
> -void ip_vs_random_dropentry(void)
> +/* Called from keventd and must protect itself from softirqs
> + * netns: conn_tab OK
> + */
> +void ip_vs_random_dropentry(struct net *net)
>  {
>  	int idx;
>  	struct ip_vs_conn *cp;
> +	struct netns_ipvs *ipvs = net->ipvs;
> 
>  	/*
>  	 * Randomly scan 1/32 of the whole table every second
> @@ -1053,7 +1106,7 @@ void ip_vs_random_dropentry(void)
>  		 */
>  		ct_write_lock_bh(hash);
> 
> -		list_for_each_entry(cp, &ip_vs_conn_tab[hash], c_list) {
> +		list_for_each_entry(cp, &ipvs->conn_tab[hash], c_list) {
>  			if (cp->flags & IP_VS_CONN_F_TEMPLATE)
>  				/* connection template */
>  				continue;
> @@ -1091,11 +1144,13 @@ void ip_vs_random_dropentry(void)
> 
>  /*
>   *      Flush all the connection entries in the ip_vs_conn_tab
> + * netns: conn_tab OK
>   */
> -static void ip_vs_conn_flush(void)
> +static void ip_vs_conn_flush(struct net *net)
>  {
>  	int idx;
>  	struct ip_vs_conn *cp;
> +	struct netns_ipvs *ipvs = net->ipvs;
> 
>    flush_again:
>  	for (idx = 0; idx < ip_vs_conn_tab_size; idx++) {
> @@ -1104,7 +1159,7 @@ static void ip_vs_conn_flush(void)
>  		 */
>  		ct_write_lock_bh(idx);
> 
> -		list_for_each_entry(cp, &ip_vs_conn_tab[idx], c_list) {
> +		list_for_each_entry(cp, &ipvs->conn_tab[idx], c_list) {
> 
>  			IP_VS_DBG(4, "del connection\n");
>  			ip_vs_conn_expire_now(cp);
> @@ -1118,16 +1173,17 @@ static void ip_vs_conn_flush(void)
> 
>  	/* the counter may be not NULL, because maybe some conn entries
>  	   are run by slow timer handler or unhashed but still referred */
> -	if (atomic_read(&ip_vs_conn_count) != 0) {
> +	if (atomic_read(&ipvs->conn_count) != 0) {
>  		schedule();
>  		goto flush_again;
>  	}
>  }
> 
> 
> -int __init ip_vs_conn_init(void)
> +int __net_init __ip_vs_conn_init(struct net *net)
>  {
>  	int idx;
> +	struct netns_ipvs *ipvs = net->ipvs;
> 
>  	/* Compute size and mask */
>  	ip_vs_conn_tab_size = 1 << ip_vs_conn_tab_bits;
> @@ -1136,19 +1192,26 @@ int __init ip_vs_conn_init(void)
>  	/*
>  	 * Allocate the connection hash table and initialize its list heads
>  	 */
> -	ip_vs_conn_tab = vmalloc(ip_vs_conn_tab_size *
> +	ipvs->conn_tab = vmalloc(ip_vs_conn_tab_size *
>  				 sizeof(struct list_head));
> -	if (!ip_vs_conn_tab)
> +	if (!ipvs->conn_tab)
>  		return -ENOMEM;
> 
>  	/* Allocate ip_vs_conn slab cache */
> -	ip_vs_conn_cachep = kmem_cache_create("ip_vs_conn",
> +	/* Todo: find a better way to name the cache */
> +	snprintf(ipvs->conn_cname, sizeof(ipvs->conn_cname)-1,
> +			"ipvs_conn_%d", atomic_read(&conn_cache_nr) );
> +	atomic_inc(&conn_cache_nr);
> +
> +	ipvs->conn_cachep = kmem_cache_create(ipvs->conn_cname,
>  					      sizeof(struct ip_vs_conn), 0,
>  					      SLAB_HWCACHE_ALIGN, NULL);
> -	if (!ip_vs_conn_cachep) {
> -		vfree(ip_vs_conn_tab);
> +	if (!ipvs->conn_cachep) {
> +		vfree(ipvs->conn_tab);
>  		return -ENOMEM;
>  	}
> +	atomic_set(&ipvs->conn_count, 0);
> +	atomic_set(&ipvs->conn_no_cport_cnt, 0);
> 
>  	pr_info("Connection hash table configured "
>  		"(size=%d, memory=%ldKbytes)\n",
> @@ -1158,31 +1221,46 @@ int __init ip_vs_conn_init(void)
>  		  sizeof(struct ip_vs_conn));
> 
>  	for (idx = 0; idx < ip_vs_conn_tab_size; idx++) {
> -		INIT_LIST_HEAD(&ip_vs_conn_tab[idx]);
> +		INIT_LIST_HEAD(&ipvs->conn_tab[idx]);
>  	}
> 
>  	for (idx = 0; idx < CT_LOCKARRAY_SIZE; idx++)  {
>  		rwlock_init(&__ip_vs_conntbl_lock_array[idx].l);
>  	}
> 
> -	proc_net_fops_create(&init_net, "ip_vs_conn", 0, &ip_vs_conn_fops);
> -	proc_net_fops_create(&init_net, "ip_vs_conn_sync", 0, &ip_vs_conn_sync_fops);
> -
> -	/* calculate the random value for connection hash */
> -	get_random_bytes(&ip_vs_conn_rnd, sizeof(ip_vs_conn_rnd));
> +	proc_net_fops_create(net, "ip_vs_conn", 0, &ip_vs_conn_fops);
> +	proc_net_fops_create(net, "ip_vs_conn_sync", 0, &ip_vs_conn_sync_fops);
> 
>  	return 0;
>  }
> +/* Cleanup and release all netns related ... */
> +static void __net_exit __ip_vs_conn_cleanup(struct net *net) {
> 
> +	/* flush all the connection entries first */
> +	ip_vs_conn_flush(net);
> +	/* Release the empty cache */
> +	kmem_cache_destroy(net->ipvs->conn_cachep);
> +	proc_net_remove(net, "ip_vs_conn");
> +	proc_net_remove(net, "ip_vs_conn_sync");
> +	vfree(net->ipvs->conn_tab);
> +}
> +static struct pernet_operations ipvs_conn_ops = {
> +	.init = __ip_vs_conn_init,
> +	.exit = __ip_vs_conn_cleanup,
> +};
> 
> -void ip_vs_conn_cleanup(void)
> +int __init ip_vs_conn_init(void)
>  {
> -	/* flush all the connection entries first */
> -	ip_vs_conn_flush();
> +	int rv;
> 
> -	/* Release the empty cache */
> -	kmem_cache_destroy(ip_vs_conn_cachep);
> -	proc_net_remove(&init_net, "ip_vs_conn");
> -	proc_net_remove(&init_net, "ip_vs_conn_sync");
> -	vfree(ip_vs_conn_tab);
> +	rv = register_pernet_subsys(&ipvs_conn_ops);
> +
> +	/* calculate the random value for connection hash */
> +	get_random_bytes(&ip_vs_conn_rnd, sizeof(ip_vs_conn_rnd));
> +	return rv;
> +}
> +
> +void ip_vs_conn_cleanup(void)
> +{
> +	unregister_pernet_subsys(&ipvs_conn_ops);
>  }
> 
> -- 
> Regards
> Hans Schillstrom <hans.schillstrom@ericsson.com>
> --
> To unsubscribe from this list: send the line "unsubscribe lvs-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

^ permalink raw reply

* Hello,
From: Ron Abrahams @ 2010-10-20 13:04 UTC (permalink / raw)



Hello,

Contact me for details to transfer US$ 21,300,000.00 to 

you for us.This fund originally belongs to a client who 

had no blood relation in his account-opening package.

email: ronabrahams.uk@rediff.com 

RON ABRAHAMS

^ permalink raw reply

* Re: [PATCH 5/9] tproxy: allow non-local binds of IPv6 sockets if IP_TRANSPARENT is enabled
From: YOSHIFUJI Hideaki @ 2010-10-20 12:45 UTC (permalink / raw)
  To: KOVACS Krisztian; +Cc: netdev, netfilter-devel, Patrick McHardy, David Miller
In-Reply-To: <20101020112118.6260.93956.stgit@este.odu>

Hello.

(2010/10/20 20:21), KOVACS Krisztian wrote:
> From: Balazs Scheidler<bazsi@balabit.hu>
> 
> Signed-off-by: Balazs Scheidler<bazsi@balabit.hu>
> Signed-off-by: KOVACS Krisztian<hidden@balabit.hu>
> ---
>   net/ipv6/af_inet6.c |    2 +-
>   1 files changed, 1 insertions(+), 1 deletions(-)
> 
> diff --git a/net/ipv6/af_inet6.c b/net/ipv6/af_inet6.c
> index 6022098..9480572 100644
> --- a/net/ipv6/af_inet6.c
> +++ b/net/ipv6/af_inet6.c
> @@ -343,7 +343,7 @@ int inet6_bind(struct socket *sock, struct sockaddr *uaddr, int addr_len)
>   			 */
>   			v4addr = LOOPBACK4_IPV6;
>   			if (!(addr_type&  IPV6_ADDR_MULTICAST))	{
> -				if (!ipv6_chk_addr(net,&addr->sin6_addr,
> +				if (!inet->transparent&&  !ipv6_chk_addr(net,&addr->sin6_addr,
>   						   dev, 0)) {
>   					err = -EADDRNOTAVAIL;
>   					goto out_unlock;
> 
> 

As I wrote before in other thread, this does not seem sufficient --
well, it is sufficient to allow non-local bind, but before we're
allowing this, we need add checks of source address in sending side.

Regards,

--yoshfuji

^ permalink raw reply

* [PATCH 5/9] tproxy: allow non-local binds of IPv6 sockets if IP_TRANSPARENT is enabled
From: KOVACS Krisztian @ 2010-10-20 11:21 UTC (permalink / raw)
  To: netdev, netfilter-devel; +Cc: Patrick McHardy, David Miller
In-Reply-To: <20101020112118.6260.31618.stgit@este.odu>

From: Balazs Scheidler <bazsi@balabit.hu>

Signed-off-by: Balazs Scheidler <bazsi@balabit.hu>
Signed-off-by: KOVACS Krisztian <hidden@balabit.hu>
---
 net/ipv6/af_inet6.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/net/ipv6/af_inet6.c b/net/ipv6/af_inet6.c
index 6022098..9480572 100644
--- a/net/ipv6/af_inet6.c
+++ b/net/ipv6/af_inet6.c
@@ -343,7 +343,7 @@ int inet6_bind(struct socket *sock, struct sockaddr *uaddr, int addr_len)
 			 */
 			v4addr = LOOPBACK4_IPV6;
 			if (!(addr_type & IPV6_ADDR_MULTICAST))	{
-				if (!ipv6_chk_addr(net, &addr->sin6_addr,
+				if (!inet->transparent && !ipv6_chk_addr(net, &addr->sin6_addr,
 						   dev, 0)) {
 					err = -EADDRNOTAVAIL;
 					goto out_unlock;



^ permalink raw reply related

* [PATCH 4/9] tproxy: added tproxy sockopt interface in the IPV6 layer
From: KOVACS Krisztian @ 2010-10-20 11:21 UTC (permalink / raw)
  To: netdev, netfilter-devel; +Cc: Patrick McHardy, David Miller
In-Reply-To: <20101020112118.6260.31618.stgit@este.odu>

From: Balazs Scheidler <bazsi@balabit.hu>

Support for IPV6_RECVORIGDSTADDR sockopt for UDP sockets were contributed by
Harry Mason.

Signed-off-by: Balazs Scheidler <bazsi@balabit.hu>
Signed-off-by: KOVACS Krisztian <hidden@balabit.hu>
---
 include/linux/in6.h      |    4 ++++
 include/linux/ipv6.h     |    4 +++-
 net/ipv6/datagram.c      |   19 +++++++++++++++++++
 net/ipv6/ipv6_sockglue.c |   23 +++++++++++++++++++++++
 4 files changed, 49 insertions(+), 1 deletions(-)

diff --git a/include/linux/in6.h b/include/linux/in6.h
index c4bf46f..097a34b 100644
--- a/include/linux/in6.h
+++ b/include/linux/in6.h
@@ -268,6 +268,10 @@ struct in6_flowlabel_req {
 /* RFC5082: Generalized Ttl Security Mechanism */
 #define IPV6_MINHOPCOUNT		73
 
+#define IPV6_ORIGDSTADDR        74
+#define IPV6_RECVORIGDSTADDR    IPV6_ORIGDSTADDR
+#define IPV6_TRANSPARENT        75
+
 /*
  * Multicast Routing:
  * see include/linux/mroute6.h.
diff --git a/include/linux/ipv6.h b/include/linux/ipv6.h
index e62683b..8e429d0 100644
--- a/include/linux/ipv6.h
+++ b/include/linux/ipv6.h
@@ -341,7 +341,9 @@ struct ipv6_pinfo {
 				odstopts:1,
                                 rxflow:1,
 				rxtclass:1,
-				rxpmtu:1;
+				rxpmtu:1,
+				rxorigdstaddr:1;
+				/* 2 bits hole */
 		} bits;
 		__u16		all;
 	} rxopt;
diff --git a/net/ipv6/datagram.c b/net/ipv6/datagram.c
index ef371aa..320bdb8 100644
--- a/net/ipv6/datagram.c
+++ b/net/ipv6/datagram.c
@@ -577,6 +577,25 @@ int datagram_recv_ctl(struct sock *sk, struct msghdr *msg, struct sk_buff *skb)
 		u8 *ptr = nh + opt->dst1;
 		put_cmsg(msg, SOL_IPV6, IPV6_2292DSTOPTS, (ptr[1]+1)<<3, ptr);
 	}
+	if (np->rxopt.bits.rxorigdstaddr) {
+		struct sockaddr_in6 sin6;
+		u16 *ports = (u16 *) skb_transport_header(skb);
+
+		if (skb_transport_offset(skb) + 4 <= skb->len) {
+			/* All current transport protocols have the port numbers in the
+			 * first four bytes of the transport header and this function is
+			 * written with this assumption in mind.
+			 */
+
+			sin6.sin6_family = AF_INET6;
+			ipv6_addr_copy(&sin6.sin6_addr, &ipv6_hdr(skb)->daddr);
+			sin6.sin6_port = ports[1];
+			sin6.sin6_flowinfo = 0;
+			sin6.sin6_scope_id = 0;
+
+			put_cmsg(msg, SOL_IPV6, IPV6_ORIGDSTADDR, sizeof(sin6), &sin6);
+		}
+	}
 	return 0;
 }
 
diff --git a/net/ipv6/ipv6_sockglue.c b/net/ipv6/ipv6_sockglue.c
index a7f66bc..0553867 100644
--- a/net/ipv6/ipv6_sockglue.c
+++ b/net/ipv6/ipv6_sockglue.c
@@ -342,6 +342,21 @@ static int do_ipv6_setsockopt(struct sock *sk, int level, int optname,
 		retv = 0;
 		break;
 
+	case IPV6_TRANSPARENT:
+		if (optlen < sizeof(int))
+			goto e_inval;
+		/* we don't have a separate transparent bit for IPV6 we use the one in the IPv4 socket */
+		inet_sk(sk)->transparent = valbool;
+		retv = 0;
+		break;
+
+	case IPV6_RECVORIGDSTADDR:
+		if (optlen < sizeof(int))
+			goto e_inval;
+		np->rxopt.bits.rxorigdstaddr = valbool;
+		retv = 0;
+		break;
+
 	case IPV6_HOPOPTS:
 	case IPV6_RTHDRDSTOPTS:
 	case IPV6_RTHDR:
@@ -1104,6 +1119,14 @@ static int do_ipv6_getsockopt(struct sock *sk, int level, int optname,
 		break;
 	}
 
+	case IPV6_TRANSPARENT:
+		val = inet_sk(sk)->transparent;
+		break;
+
+	case IPV6_RECVORIGDSTADDR:
+		val = np->rxopt.bits.rxorigdstaddr;
+		break;
+
 	case IPV6_UNICAST_HOPS:
 	case IPV6_MULTICAST_HOPS:
 	{



^ permalink raw reply related

* [PATCH 1/9] tproxy: split off ipv6 defragmentation to a separate module
From: KOVACS Krisztian @ 2010-10-20 11:21 UTC (permalink / raw)
  To: netdev, netfilter-devel; +Cc: Patrick McHardy, David Miller
In-Reply-To: <20101020112118.6260.31618.stgit@este.odu>

From: Balazs Scheidler <bazsi@balabit.hu>

Like with IPv4, TProxy needs IPv6 defragmentation but does not
require connection tracking. Since defragmentation was coupled
with conntrack, I split off the two, creating an nf_defrag_ipv6 module,
similar to the already existing nf_defrag_ipv4.

Signed-off-by: Balazs Scheidler <bazsi@balabit.hu>
Signed-off-by: KOVACS Krisztian <hidden@balabit.hu>
---
 include/net/netfilter/ipv6/nf_defrag_ipv6.h    |    6 +
 net/ipv6/netfilter/Makefile                    |    5 +
 net/ipv6/netfilter/nf_conntrack_l3proto_ipv6.c |   78 +-------------
 net/ipv6/netfilter/nf_conntrack_reasm.c        |   12 ++
 net/ipv6/netfilter/nf_defrag_ipv6_hooks.c      |  131 ++++++++++++++++++++++++
 5 files changed, 154 insertions(+), 78 deletions(-)
 create mode 100644 include/net/netfilter/ipv6/nf_defrag_ipv6.h
 create mode 100644 net/ipv6/netfilter/nf_defrag_ipv6_hooks.c

diff --git a/include/net/netfilter/ipv6/nf_defrag_ipv6.h b/include/net/netfilter/ipv6/nf_defrag_ipv6.h
new file mode 100644
index 0000000..94dd54d
--- /dev/null
+++ b/include/net/netfilter/ipv6/nf_defrag_ipv6.h
@@ -0,0 +1,6 @@
+#ifndef _NF_DEFRAG_IPV6_H
+#define _NF_DEFRAG_IPV6_H
+
+extern void nf_defrag_ipv6_enable(void);
+
+#endif /* _NF_DEFRAG_IPV6_H */
diff --git a/net/ipv6/netfilter/Makefile b/net/ipv6/netfilter/Makefile
index aafbba3..3f8e4a3 100644
--- a/net/ipv6/netfilter/Makefile
+++ b/net/ipv6/netfilter/Makefile
@@ -11,10 +11,11 @@ obj-$(CONFIG_IP6_NF_RAW) += ip6table_raw.o
 obj-$(CONFIG_IP6_NF_SECURITY) += ip6table_security.o
 
 # objects for l3 independent conntrack
-nf_conntrack_ipv6-objs  :=  nf_conntrack_l3proto_ipv6.o nf_conntrack_proto_icmpv6.o nf_conntrack_reasm.o
+nf_conntrack_ipv6-objs  :=  nf_conntrack_l3proto_ipv6.o nf_conntrack_proto_icmpv6.o
+nf_defrag_ipv6-objs := nf_defrag_ipv6_hooks.o nf_conntrack_reasm.o
 
 # l3 independent conntrack
-obj-$(CONFIG_NF_CONNTRACK_IPV6) += nf_conntrack_ipv6.o
+obj-$(CONFIG_NF_CONNTRACK_IPV6) += nf_conntrack_ipv6.o nf_defrag_ipv6.o
 
 # matches
 obj-$(CONFIG_IP6_NF_MATCH_AH) += ip6t_ah.o
diff --git a/net/ipv6/netfilter/nf_conntrack_l3proto_ipv6.c b/net/ipv6/netfilter/nf_conntrack_l3proto_ipv6.c
index ff43461..c8af58b 100644
--- a/net/ipv6/netfilter/nf_conntrack_l3proto_ipv6.c
+++ b/net/ipv6/netfilter/nf_conntrack_l3proto_ipv6.c
@@ -16,7 +16,6 @@
 #include <linux/module.h>
 #include <linux/skbuff.h>
 #include <linux/icmp.h>
-#include <linux/sysctl.h>
 #include <net/ipv6.h>
 #include <net/inet_frag.h>
 
@@ -29,6 +28,7 @@
 #include <net/netfilter/nf_conntrack_core.h>
 #include <net/netfilter/nf_conntrack_zones.h>
 #include <net/netfilter/ipv6/nf_conntrack_ipv6.h>
+#include <net/netfilter/ipv6/nf_defrag_ipv6.h>
 #include <net/netfilter/nf_log.h>
 
 static bool ipv6_pkt_to_tuple(const struct sk_buff *skb, unsigned int nhoff,
@@ -189,53 +189,6 @@ out:
 	return nf_conntrack_confirm(skb);
 }
 
-static enum ip6_defrag_users nf_ct6_defrag_user(unsigned int hooknum,
-						struct sk_buff *skb)
-{
-	u16 zone = NF_CT_DEFAULT_ZONE;
-
-	if (skb->nfct)
-		zone = nf_ct_zone((struct nf_conn *)skb->nfct);
-
-#ifdef CONFIG_BRIDGE_NETFILTER
-	if (skb->nf_bridge &&
-	    skb->nf_bridge->mask & BRNF_NF_BRIDGE_PREROUTING)
-		return IP6_DEFRAG_CONNTRACK_BRIDGE_IN + zone;
-#endif
-	if (hooknum == NF_INET_PRE_ROUTING)
-		return IP6_DEFRAG_CONNTRACK_IN + zone;
-	else
-		return IP6_DEFRAG_CONNTRACK_OUT + zone;
-
-}
-
-static unsigned int ipv6_defrag(unsigned int hooknum,
-				struct sk_buff *skb,
-				const struct net_device *in,
-				const struct net_device *out,
-				int (*okfn)(struct sk_buff *))
-{
-	struct sk_buff *reasm;
-
-	/* Previously seen (loopback)?  */
-	if (skb->nfct && !nf_ct_is_template((struct nf_conn *)skb->nfct))
-		return NF_ACCEPT;
-
-	reasm = nf_ct_frag6_gather(skb, nf_ct6_defrag_user(hooknum, skb));
-	/* queued */
-	if (reasm == NULL)
-		return NF_STOLEN;
-
-	/* error occured or not fragmented */
-	if (reasm == skb)
-		return NF_ACCEPT;
-
-	nf_ct_frag6_output(hooknum, reasm, (struct net_device *)in,
-			   (struct net_device *)out, okfn);
-
-	return NF_STOLEN;
-}
-
 static unsigned int __ipv6_conntrack_in(struct net *net,
 					unsigned int hooknum,
 					struct sk_buff *skb,
@@ -288,13 +241,6 @@ static unsigned int ipv6_conntrack_local(unsigned int hooknum,
 
 static struct nf_hook_ops ipv6_conntrack_ops[] __read_mostly = {
 	{
-		.hook		= ipv6_defrag,
-		.owner		= THIS_MODULE,
-		.pf		= NFPROTO_IPV6,
-		.hooknum	= NF_INET_PRE_ROUTING,
-		.priority	= NF_IP6_PRI_CONNTRACK_DEFRAG,
-	},
-	{
 		.hook		= ipv6_conntrack_in,
 		.owner		= THIS_MODULE,
 		.pf		= NFPROTO_IPV6,
@@ -309,13 +255,6 @@ static struct nf_hook_ops ipv6_conntrack_ops[] __read_mostly = {
 		.priority	= NF_IP6_PRI_CONNTRACK,
 	},
 	{
-		.hook		= ipv6_defrag,
-		.owner		= THIS_MODULE,
-		.pf		= NFPROTO_IPV6,
-		.hooknum	= NF_INET_LOCAL_OUT,
-		.priority	= NF_IP6_PRI_CONNTRACK_DEFRAG,
-	},
-	{
 		.hook		= ipv6_confirm,
 		.owner		= THIS_MODULE,
 		.pf		= NFPROTO_IPV6,
@@ -387,10 +326,6 @@ struct nf_conntrack_l3proto nf_conntrack_l3proto_ipv6 __read_mostly = {
 	.nlattr_to_tuple	= ipv6_nlattr_to_tuple,
 	.nla_policy		= ipv6_nla_policy,
 #endif
-#ifdef CONFIG_SYSCTL
-	.ctl_table_path		= nf_net_netfilter_sysctl_path,
-	.ctl_table		= nf_ct_ipv6_sysctl_table,
-#endif
 	.me			= THIS_MODULE,
 };
 
@@ -403,16 +338,12 @@ static int __init nf_conntrack_l3proto_ipv6_init(void)
 	int ret = 0;
 
 	need_conntrack();
+	nf_defrag_ipv6_enable();
 
-	ret = nf_ct_frag6_init();
-	if (ret < 0) {
-		pr_err("nf_conntrack_ipv6: can't initialize frag6.\n");
-		return ret;
-	}
 	ret = nf_conntrack_l4proto_register(&nf_conntrack_l4proto_tcp6);
 	if (ret < 0) {
 		pr_err("nf_conntrack_ipv6: can't register tcp.\n");
-		goto cleanup_frag6;
+		return ret;
 	}
 
 	ret = nf_conntrack_l4proto_register(&nf_conntrack_l4proto_udp6);
@@ -450,8 +381,6 @@ static int __init nf_conntrack_l3proto_ipv6_init(void)
 	nf_conntrack_l4proto_unregister(&nf_conntrack_l4proto_udp6);
  cleanup_tcp:
 	nf_conntrack_l4proto_unregister(&nf_conntrack_l4proto_tcp6);
- cleanup_frag6:
-	nf_ct_frag6_cleanup();
 	return ret;
 }
 
@@ -463,7 +392,6 @@ static void __exit nf_conntrack_l3proto_ipv6_fini(void)
 	nf_conntrack_l4proto_unregister(&nf_conntrack_l4proto_icmpv6);
 	nf_conntrack_l4proto_unregister(&nf_conntrack_l4proto_udp6);
 	nf_conntrack_l4proto_unregister(&nf_conntrack_l4proto_tcp6);
-	nf_ct_frag6_cleanup();
 }
 
 module_init(nf_conntrack_l3proto_ipv6_init);
diff --git a/net/ipv6/netfilter/nf_conntrack_reasm.c b/net/ipv6/netfilter/nf_conntrack_reasm.c
index 138a8b3..bb669b4 100644
--- a/net/ipv6/netfilter/nf_conntrack_reasm.c
+++ b/net/ipv6/netfilter/nf_conntrack_reasm.c
@@ -73,7 +73,7 @@ static struct inet_frags nf_frags;
 static struct netns_frags nf_init_frags;
 
 #ifdef CONFIG_SYSCTL
-struct ctl_table nf_ct_ipv6_sysctl_table[] = {
+struct ctl_table nf_ct_frag6_sysctl_table[] = {
 	{
 		.procname	= "nf_conntrack_frag6_timeout",
 		.data		= &nf_init_frags.timeout,
@@ -97,6 +97,8 @@ struct ctl_table nf_ct_ipv6_sysctl_table[] = {
 	},
 	{ }
 };
+
+static struct ctl_table_header *nf_ct_frag6_sysctl_header;
 #endif
 
 static unsigned int nf_hashfn(struct inet_frag_queue *q)
@@ -623,11 +625,19 @@ int nf_ct_frag6_init(void)
 	inet_frags_init_net(&nf_init_frags);
 	inet_frags_init(&nf_frags);
 
+	nf_ct_frag6_sysctl_header = register_sysctl_paths(nf_net_netfilter_sysctl_path,
+							  nf_ct_frag6_sysctl_table);
+	if (!nf_ct_frag6_sysctl_header)
+		return -ENOMEM;
+
 	return 0;
 }
 
 void nf_ct_frag6_cleanup(void)
 {
+	unregister_sysctl_table(nf_ct_frag6_sysctl_header);
+	nf_ct_frag6_sysctl_header = NULL;
+
 	inet_frags_fini(&nf_frags);
 
 	nf_init_frags.low_thresh = 0;
diff --git a/net/ipv6/netfilter/nf_defrag_ipv6_hooks.c b/net/ipv6/netfilter/nf_defrag_ipv6_hooks.c
new file mode 100644
index 0000000..99abfb5
--- /dev/null
+++ b/net/ipv6/netfilter/nf_defrag_ipv6_hooks.c
@@ -0,0 +1,131 @@
+/* (C) 1999-2001 Paul `Rusty' Russell
+ * (C) 2002-2004 Netfilter Core Team <coreteam@netfilter.org>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+
+#include <linux/types.h>
+#include <linux/ipv6.h>
+#include <linux/in6.h>
+#include <linux/netfilter.h>
+#include <linux/module.h>
+#include <linux/skbuff.h>
+#include <linux/icmp.h>
+#include <linux/sysctl.h>
+#include <net/ipv6.h>
+#include <net/inet_frag.h>
+
+#include <linux/netfilter_ipv6.h>
+#include <linux/netfilter_bridge.h>
+#include <net/netfilter/nf_conntrack.h>
+#include <net/netfilter/nf_conntrack_helper.h>
+#include <net/netfilter/nf_conntrack_l4proto.h>
+#include <net/netfilter/nf_conntrack_l3proto.h>
+#include <net/netfilter/nf_conntrack_core.h>
+#include <net/netfilter/nf_conntrack_zones.h>
+#include <net/netfilter/ipv6/nf_conntrack_ipv6.h>
+#include <net/netfilter/ipv6/nf_defrag_ipv6.h>
+
+static enum ip6_defrag_users nf_ct6_defrag_user(unsigned int hooknum,
+						struct sk_buff *skb)
+{
+	u16 zone = NF_CT_DEFAULT_ZONE;
+
+	if (skb->nfct)
+		zone = nf_ct_zone((struct nf_conn *)skb->nfct);
+
+#ifdef CONFIG_BRIDGE_NETFILTER
+	if (skb->nf_bridge &&
+	    skb->nf_bridge->mask & BRNF_NF_BRIDGE_PREROUTING)
+		return IP6_DEFRAG_CONNTRACK_BRIDGE_IN + zone;
+#endif
+	if (hooknum == NF_INET_PRE_ROUTING)
+		return IP6_DEFRAG_CONNTRACK_IN + zone;
+	else
+		return IP6_DEFRAG_CONNTRACK_OUT + zone;
+
+}
+
+static unsigned int ipv6_defrag(unsigned int hooknum,
+				struct sk_buff *skb,
+				const struct net_device *in,
+				const struct net_device *out,
+				int (*okfn)(struct sk_buff *))
+{
+	struct sk_buff *reasm;
+
+	/* Previously seen (loopback)?	*/
+	if (skb->nfct && !nf_ct_is_template((struct nf_conn *)skb->nfct))
+		return NF_ACCEPT;
+
+	reasm = nf_ct_frag6_gather(skb, nf_ct6_defrag_user(hooknum, skb));
+	/* queued */
+	if (reasm == NULL)
+		return NF_STOLEN;
+
+	/* error occured or not fragmented */
+	if (reasm == skb)
+		return NF_ACCEPT;
+
+	nf_ct_frag6_output(hooknum, reasm, (struct net_device *)in,
+			   (struct net_device *)out, okfn);
+
+	return NF_STOLEN;
+}
+
+static struct nf_hook_ops ipv6_defrag_ops[] = {
+	{
+		.hook		= ipv6_defrag,
+		.owner		= THIS_MODULE,
+		.pf		= NFPROTO_IPV6,
+		.hooknum	= NF_INET_PRE_ROUTING,
+		.priority	= NF_IP6_PRI_CONNTRACK_DEFRAG,
+	},
+	{
+		.hook		= ipv6_defrag,
+		.owner		= THIS_MODULE,
+		.pf		= NFPROTO_IPV6,
+		.hooknum	= NF_INET_LOCAL_OUT,
+		.priority	= NF_IP6_PRI_CONNTRACK_DEFRAG,
+	},
+};
+
+static int __init nf_defrag_init(void)
+{
+	int ret = 0;
+
+	ret = nf_ct_frag6_init();
+	if (ret < 0) {
+		pr_err("nf_defrag_ipv6: can't initialize frag6.\n");
+		return ret;
+	}
+	ret = nf_register_hooks(ipv6_defrag_ops, ARRAY_SIZE(ipv6_defrag_ops));
+	if (ret < 0) {
+		pr_err("nf_defrag_ipv6: can't register hooks\n");
+		goto cleanup_frag6;
+	}
+	return ret;
+
+cleanup_frag6:
+	nf_ct_frag6_cleanup();
+	return ret;
+
+}
+
+static void __exit nf_defrag_fini(void)
+{
+	nf_unregister_hooks(ipv6_defrag_ops, ARRAY_SIZE(ipv6_defrag_ops));
+	nf_ct_frag6_cleanup();
+}
+
+void nf_defrag_ipv6_enable(void)
+{
+}
+EXPORT_SYMBOL_GPL(nf_defrag_ipv6_enable);
+
+module_init(nf_defrag_init);
+module_exit(nf_defrag_fini);
+
+MODULE_LICENSE("GPL");



^ permalink raw reply related

* [PATCH 1/3] tproxy: kick out TIME_WAIT sockets in case a new connection comes in with the same tuple
From: KOVACS Krisztian @ 2010-10-20 11:21 UTC (permalink / raw)
  To: netdev, netfilter-devel; +Cc: Patrick McHardy, David Miller
In-Reply-To: <20101020112142.6538.25550.stgit@este.odu>

From: Balazs Scheidler <bazsi@balabit.hu>

Without tproxy redirections an incoming SYN kicks out conflicting
TIME_WAIT sockets, in order to handle clients that reuse ports
within the TIME_WAIT period.

The same mechanism didn't work in case TProxy is involved in finding
the proper socket, as the time_wait processing code looked up the
listening socket assuming that the listener addr/port matches those
of the established connection.

This is not the case with TProxy as the listener addr/port is possibly
changed with the tproxy rule.

Signed-off-by: Balazs Scheidler <bazsi@balabit.hu>
Signed-off-by: KOVACS Krisztian <hidden@balabit.hu>
---
 include/net/netfilter/nf_tproxy_core.h |    6 ++-
 net/netfilter/nf_tproxy_core.c         |   29 ++++++++++----
 net/netfilter/xt_TPROXY.c              |   68 ++++++++++++++++++++++++++++++--
 net/netfilter/xt_socket.c              |    2 -
 4 files changed, 90 insertions(+), 15 deletions(-)

diff --git a/include/net/netfilter/nf_tproxy_core.h b/include/net/netfilter/nf_tproxy_core.h
index 208b46f..b3a8942 100644
--- a/include/net/netfilter/nf_tproxy_core.h
+++ b/include/net/netfilter/nf_tproxy_core.h
@@ -8,12 +8,16 @@
 #include <net/inet_sock.h>
 #include <net/tcp.h>
 
+#define NFT_LOOKUP_ANY         0
+#define NFT_LOOKUP_LISTENER    1
+#define NFT_LOOKUP_ESTABLISHED 2
+
 /* look up and get a reference to a matching socket */
 extern struct sock *
 nf_tproxy_get_sock_v4(struct net *net, const u8 protocol,
 		      const __be32 saddr, const __be32 daddr,
 		      const __be16 sport, const __be16 dport,
-		      const struct net_device *in, bool listening);
+		      const struct net_device *in, int lookup_type);
 
 static inline void
 nf_tproxy_put_sock(struct sock *sk)
diff --git a/net/netfilter/nf_tproxy_core.c b/net/netfilter/nf_tproxy_core.c
index daab8c4..2ce945c 100644
--- a/net/netfilter/nf_tproxy_core.c
+++ b/net/netfilter/nf_tproxy_core.c
@@ -22,21 +22,34 @@ struct sock *
 nf_tproxy_get_sock_v4(struct net *net, const u8 protocol,
 		      const __be32 saddr, const __be32 daddr,
 		      const __be16 sport, const __be16 dport,
-		      const struct net_device *in, bool listening_only)
+		      const struct net_device *in, int lookup_type)
 {
 	struct sock *sk;
 
 	/* look up socket */
 	switch (protocol) {
 	case IPPROTO_TCP:
-		if (listening_only)
-			sk = __inet_lookup_listener(net, &tcp_hashinfo,
-						    daddr, ntohs(dport),
-						    in->ifindex);
-		else
+		switch (lookup_type) {
+		case NFT_LOOKUP_ANY:
 			sk = __inet_lookup(net, &tcp_hashinfo,
 					   saddr, sport, daddr, dport,
 					   in->ifindex);
+			break;
+		case NFT_LOOKUP_LISTENER:
+			sk = inet_lookup_listener(net, &tcp_hashinfo,
+						    daddr, dport,
+						    in->ifindex);
+			break;
+		case NFT_LOOKUP_ESTABLISHED:
+			sk = inet_lookup_established(net, &tcp_hashinfo,
+						    saddr, sport, daddr, dport,
+						    in->ifindex);
+			break;
+		default:
+			WARN_ON(1);
+			sk = NULL;
+			break;
+		}
 		break;
 	case IPPROTO_UDP:
 		sk = udp4_lib_lookup(net, saddr, sport, daddr, dport,
@@ -47,8 +60,8 @@ nf_tproxy_get_sock_v4(struct net *net, const u8 protocol,
 		sk = NULL;
 	}
 
-	pr_debug("tproxy socket lookup: proto %u %08x:%u -> %08x:%u, listener only: %d, sock %p\n",
-		 protocol, ntohl(saddr), ntohs(sport), ntohl(daddr), ntohs(dport), listening_only, sk);
+	pr_debug("tproxy socket lookup: proto %u %08x:%u -> %08x:%u, lookup type: %d, sock %p\n",
+		 protocol, ntohl(saddr), ntohs(sport), ntohl(daddr), ntohs(dport), lookup_type, sk);
 
 	return sk;
 }
diff --git a/net/netfilter/xt_TPROXY.c b/net/netfilter/xt_TPROXY.c
index c61294d..67cbed8 100644
--- a/net/netfilter/xt_TPROXY.c
+++ b/net/netfilter/xt_TPROXY.c
@@ -24,6 +24,57 @@
 #include <net/netfilter/ipv4/nf_defrag_ipv4.h>
 #include <net/netfilter/nf_tproxy_core.h>
 
+/**
+ * tproxy_handle_time_wait() - handle TCP TIME_WAIT reopen redirections
+ * @skb:	The skb being processed.
+ * @par:	Iptables target parameters.
+ * @sk:		The TIME_WAIT TCP socket found by the lookup.
+ *
+ * We have to handle SYN packets arriving to TIME_WAIT sockets
+ * differently: instead of reopening the connection we should rather
+ * redirect the new connection to the proxy if there's a listener
+ * socket present.
+ *
+ * tproxy_handle_time_wait() consumes the socket reference passed in.
+ *
+ * Returns the listener socket if there's one, the TIME_WAIT socket if
+ * no such listener is found, or NULL if the TCP header is incomplete.
+ */
+static struct sock *
+tproxy_handle_time_wait(struct sk_buff *skb, const struct xt_action_param *par, struct sock *sk)
+{
+	const struct iphdr *iph = ip_hdr(skb);
+	const struct xt_tproxy_target_info *tgi = par->targinfo;
+	struct tcphdr _hdr, *hp;
+
+	hp = skb_header_pointer(skb, ip_hdrlen(skb), sizeof(_hdr), &_hdr);
+	if (hp == NULL) {
+		inet_twsk_put(inet_twsk(sk));
+		return NULL;
+	}
+
+	if (hp->syn && !hp->rst && !hp->ack && !hp->fin) {
+		/* SYN to a TIME_WAIT socket, we'd rather redirect it
+		 * to a listener socket if there's one */
+		struct sock *sk2;
+
+		sk2 = nf_tproxy_get_sock_v4(dev_net(skb->dev), iph->protocol,
+					    iph->saddr, tgi->laddr ? tgi->laddr : iph->daddr,
+					    hp->source, tgi->lport ? tgi->lport : hp->dest,
+					    par->in, NFT_LOOKUP_LISTENER);
+		if (sk2) {
+			/* yeah, there's one, let's kill the TIME_WAIT
+			 * socket and redirect to the listener
+			 */
+			inet_twsk_deschedule(inet_twsk(sk), &tcp_death_row);
+			inet_twsk_put(inet_twsk(sk));
+			sk = sk2;
+		}
+	}
+
+	return sk;
+}
+
 static unsigned int
 tproxy_tg(struct sk_buff *skb, const struct xt_action_param *par)
 {
@@ -37,11 +88,18 @@ tproxy_tg(struct sk_buff *skb, const struct xt_action_param *par)
 		return NF_DROP;
 
 	sk = nf_tproxy_get_sock_v4(dev_net(skb->dev), iph->protocol,
-				   iph->saddr,
-				   tgi->laddr ? tgi->laddr : iph->daddr,
-				   hp->source,
-				   tgi->lport ? tgi->lport : hp->dest,
-				   par->in, true);
+				   iph->saddr, iph->daddr,
+				   hp->source, hp->dest,
+				   par->in, NFT_LOOKUP_ESTABLISHED);
+
+	/* UDP has no TCP_TIME_WAIT state, so we never enter here */
+	if (sk && sk->sk_state == TCP_TIME_WAIT)
+		sk = tproxy_handle_time_wait(skb, par, sk);
+	else if (!sk)
+		sk = nf_tproxy_get_sock_v4(dev_net(skb->dev), iph->protocol,
+					   iph->saddr, tgi->laddr ? tgi->laddr : iph->daddr,
+					   hp->source, tgi->lport ? tgi->lport : hp->dest,
+					   par->in, NFT_LOOKUP_LISTENER);
 
 	/* NOTE: assign_sock consumes our sk reference */
 	if (sk && nf_tproxy_assign_sock(skb, sk)) {
diff --git a/net/netfilter/xt_socket.c b/net/netfilter/xt_socket.c
index 1ca8990..266faa0 100644
--- a/net/netfilter/xt_socket.c
+++ b/net/netfilter/xt_socket.c
@@ -142,7 +142,7 @@ socket_match(const struct sk_buff *skb, struct xt_action_param *par,
 #endif
 
 	sk = nf_tproxy_get_sock_v4(dev_net(skb->dev), protocol,
-				   saddr, daddr, sport, dport, par->in, false);
+				   saddr, daddr, sport, dport, par->in, NFT_LOOKUP_ANY);
 	if (sk != NULL) {
 		bool wildcard;
 		bool transparent = true;



^ permalink raw reply related

* [Patch] Limit sysctl_tcp_mem and sysctl_udp_mem initializers to prevent integer overflows.
From: Robin Holt @ 2010-10-20 12:03 UTC (permalink / raw)
  To: David S. Miller
  Cc: Willy Tarreau, linux-kernel, netdev, linux-sctp, Alexey Kuznetsov,
	Pekka Savola (ipv6), James Morris, Hideaki YOSHIFUJI,
	Patrick McHardy, Vlad Yasevich, Sridhar Samudrala
In-Reply-To: <20101020120336.967805943@gulag1.americas.sgi.com>

[-- Attachment #1: sysctl_tcp_udp_mem_max_overflows --]
[-- Type: text/plain, Size: 4097 bytes --]

Subject: [Patch] Limit sysctl_tcp_mem and sysctl_udp_mem initializers to prevent integer overflows.

On a 16TB x86_64 machine, sysctl_tcp_mem[2], sysctl_udp_mem[2], and
sysctl_sctp_mem[2] can integer overflow.  Set limit such that they are
maximized without overflowing.

Signed-off-by: Robin Holt <holt@sgi.com>
To: "David S. Miller" <davem@davemloft.net>
Cc: Willy Tarreau <w@1wt.eu>
Cc: linux-kernel@vger.kernel.org
Cc: netdev@vger.kernel.org
Cc: linux-sctp@vger.kernel.org
Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
Cc: "Pekka Savola (ipv6)" <pekkas@netcore.fi>
Cc: James Morris <jmorris@namei.org>
Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
Cc: Patrick McHardy <kaber@trash.net>
Cc: Vlad Yasevich <vladislav.yasevich@hp.com>
Cc: Sridhar Samudrala <sri@us.ibm.com>

---

David, I believe this is also intended for stable.  If you agree, please
add the stable maintainers to the Cc: list.

 net/ipv4/tcp.c      |    4 +++-
 net/ipv4/udp.c      |    4 +++-
 net/sctp/protocol.c |    4 +++-
 3 files changed, 9 insertions(+), 3 deletions(-)

Index: pv1010932/net/ipv4/tcp.c
===================================================================
--- pv1010932.orig/net/ipv4/tcp.c	2010-10-02 06:11:59.737449853 -0500
+++ pv1010932/net/ipv4/tcp.c	2010-10-02 06:12:35.445454593 -0500
@@ -3271,12 +3271,14 @@ void __init tcp_init(void)
 
 	/* Set the pressure threshold to be a fraction of global memory that
 	 * is up to 1/2 at 256 MB, decreasing toward zero with the amount of
-	 * memory, with a floor of 128 pages.
+	 * memory, with a floor of 128 pages, and a ceiling that prevents an
+	 * integer overflow.
 	 */
 	nr_pages = totalram_pages - totalhigh_pages;
 	limit = min(nr_pages, 1UL<<(28-PAGE_SHIFT)) >> (20-PAGE_SHIFT);
 	limit = (limit * (nr_pages >> (20-PAGE_SHIFT))) >> (PAGE_SHIFT-11);
 	limit = max(limit, 128UL);
+	limit = min(limit, INT_MAX * 4UL / 3 / 2);
 	sysctl_tcp_mem[0] = limit / 4 * 3;
 	sysctl_tcp_mem[1] = limit;
 	sysctl_tcp_mem[2] = sysctl_tcp_mem[0] * 2;
Index: pv1010932/net/ipv4/udp.c
===================================================================
--- pv1010932.orig/net/ipv4/udp.c	2010-10-02 06:11:59.737449853 -0500
+++ pv1010932/net/ipv4/udp.c	2010-10-02 06:12:35.453453784 -0500
@@ -2167,12 +2167,14 @@ void __init udp_init(void)
 	udp_table_init(&udp_table, "UDP");
 	/* Set the pressure threshold up by the same strategy of TCP. It is a
 	 * fraction of global memory that is up to 1/2 at 256 MB, decreasing
-	 * toward zero with the amount of memory, with a floor of 128 pages.
+	 * toward zero with the amount of memory, with a floor of 128 pages,
+	 * and a ceiling that prevents an integer overflow.
 	 */
 	nr_pages = totalram_pages - totalhigh_pages;
 	limit = min(nr_pages, 1UL<<(28-PAGE_SHIFT)) >> (20-PAGE_SHIFT);
 	limit = (limit * (nr_pages >> (20-PAGE_SHIFT))) >> (PAGE_SHIFT-11);
 	limit = max(limit, 128UL);
+	limit = min(limit, INT_MAX * 4UL / 3 / 2);
 	sysctl_udp_mem[0] = limit / 4 * 3;
 	sysctl_udp_mem[1] = limit;
 	sysctl_udp_mem[2] = sysctl_udp_mem[0] * 2;
Index: pv1010932/net/sctp/protocol.c
===================================================================
--- pv1010932.orig/net/sctp/protocol.c	2010-10-02 06:11:59.000000000 -0500
+++ pv1010932/net/sctp/protocol.c	2010-10-02 06:13:17.727949810 -0500
@@ -1162,7 +1162,8 @@ SCTP_STATIC __init int sctp_init(void)
 
 	/* Set the pressure threshold to be a fraction of global memory that
 	 * is up to 1/2 at 256 MB, decreasing toward zero with the amount of
-	 * memory, with a floor of 128 pages.
+	 * memory, with a floor of 128 pages, and a ceiling that prevents an
+	 * integer overflow.
 	 * Note this initializes the data in sctpv6_prot too
 	 * Unabashedly stolen from tcp_init
 	 */
@@ -1170,6 +1171,7 @@ SCTP_STATIC __init int sctp_init(void)
 	limit = min(nr_pages, 1UL<<(28-PAGE_SHIFT)) >> (20-PAGE_SHIFT);
 	limit = (limit * (nr_pages >> (20-PAGE_SHIFT))) >> (PAGE_SHIFT-11);
 	limit = max(limit, 128UL);
+	limit = min(limit, INT_MAX * 4UL / 3 / 2);
 	sysctl_sctp_mem[0] = limit / 4 * 3;
 	sysctl_sctp_mem[1] = limit;
 	sysctl_sctp_mem[2] = sysctl_sctp_mem[0] * 2;

^ permalink raw reply

* [PATCH 2/3] tproxy: add lookup type checks for UDP in nf_tproxy_get_sock_v4()
From: KOVACS Krisztian @ 2010-10-20 11:21 UTC (permalink / raw)
  To: netdev, netfilter-devel; +Cc: Patrick McHardy, David Miller
In-Reply-To: <20101020112142.6538.25550.stgit@este.odu>

From: Balazs Scheidler <bazsi@balabit.hu>

Also, inline this function as the lookup_type is always a literal
and inlining removes branches performed at runtime.

Signed-off-by: Balazs Scheidler <bazsi@balabit.hu>
Signed-off-by: KOVACS Krisztian <hidden@balabit.hu>
---
 include/net/netfilter/nf_tproxy_core.h |  116 +++++++++++++++++++++++++++++++-
 net/netfilter/nf_tproxy_core.c         |   48 -------------
 2 files changed, 114 insertions(+), 50 deletions(-)

diff --git a/include/net/netfilter/nf_tproxy_core.h b/include/net/netfilter/nf_tproxy_core.h
index b3a8942..1027d7f 100644
--- a/include/net/netfilter/nf_tproxy_core.h
+++ b/include/net/netfilter/nf_tproxy_core.h
@@ -13,11 +13,123 @@
 #define NFT_LOOKUP_ESTABLISHED 2
 
 /* look up and get a reference to a matching socket */
-extern struct sock *
+
+
+/* This function is used by the 'TPROXY' target and the 'socket'
+ * match. The following lookups are supported:
+ *
+ * Explicit TProxy target rule
+ * ===========================
+ *
+ * This is used when the user wants to intercept a connection matching
+ * an explicit iptables rule. In this case the sockets are assumed
+ * matching in preference order:
+ *
+ *   - match: if there's a fully established connection matching the
+ *     _packet_ tuple, it is returned, assuming the redirection
+ *     already took place and we process a packet belonging to an
+ *     established connection
+ *
+ *   - match: if there's a listening socket matching the redirection
+ *     (e.g. on-port & on-ip of the connection), it is returned,
+ *     regardless if it was bound to 0.0.0.0 or an explicit
+ *     address. The reasoning is that if there's an explicit rule, it
+ *     does not really matter if the listener is bound to an interface
+ *     or to 0. The user already stated that he wants redirection
+ *     (since he added the rule).
+ *
+ * "socket" match based redirection (no specific rule)
+ * ===================================================
+ *
+ * There are connections with dynamic endpoints (e.g. FTP data
+ * connection) that the user is unable to add explicit rules
+ * for. These are taken care of by a generic "socket" rule. It is
+ * assumed that the proxy application is trusted to open such
+ * connections without explicit iptables rule (except of course the
+ * generic 'socket' rule). In this case the following sockets are
+ * matched in preference order:
+ *
+ *   - match: if there's a fully established connection matching the
+ *     _packet_ tuple
+ *
+ *   - match: if there's a non-zero bound listener (possibly with a
+ *     non-local address) We don't accept zero-bound listeners, since
+ *     then local services could intercept traffic going through the
+ *     box.
+ *
+ * Please note that there's an overlap between what a TPROXY target
+ * and a socket match will match. Normally if you have both rules the
+ * "socket" match will be the first one, effectively all packets
+ * belonging to established connections going through that one.
+ */
+static inline struct sock *
 nf_tproxy_get_sock_v4(struct net *net, const u8 protocol,
 		      const __be32 saddr, const __be32 daddr,
 		      const __be16 sport, const __be16 dport,
-		      const struct net_device *in, int lookup_type);
+		      const struct net_device *in, int lookup_type)
+{
+	struct sock *sk;
+
+	/* look up socket */
+	switch (protocol) {
+	case IPPROTO_TCP:
+		switch (lookup_type) {
+		case NFT_LOOKUP_ANY:
+			sk = __inet_lookup(net, &tcp_hashinfo,
+					   saddr, sport, daddr, dport,
+					   in->ifindex);
+			break;
+		case NFT_LOOKUP_LISTENER:
+			sk = inet_lookup_listener(net, &tcp_hashinfo,
+						    daddr, dport,
+						    in->ifindex);
+
+			/* NOTE: we return listeners even if bound to
+			 * 0.0.0.0, those are filtered out in
+			 * xt_socket, since xt_TPROXY needs 0 bound
+			 * listeners too */
+
+			break;
+		case NFT_LOOKUP_ESTABLISHED:
+			sk = inet_lookup_established(net, &tcp_hashinfo,
+						    saddr, sport, daddr, dport,
+						    in->ifindex);
+			break;
+		default:
+			WARN_ON(1);
+			sk = NULL;
+			break;
+		}
+		break;
+	case IPPROTO_UDP:
+		sk = udp4_lib_lookup(net, saddr, sport, daddr, dport,
+				     in->ifindex);
+		if (sk && lookup_type != NFT_LOOKUP_ANY) {
+			int connected = (sk->sk_state == TCP_ESTABLISHED);
+			int wildcard = (inet_sk(sk)->inet_rcv_saddr == 0);
+
+			/* NOTE: we return listeners even if bound to
+			 * 0.0.0.0, those are filtered out in
+			 * xt_socket, since xt_TPROXY needs 0 bound
+			 * listeners too */
+			if ((lookup_type == NFT_LOOKUP_ESTABLISHED && (!connected || wildcard)) ||
+			    (lookup_type == NFT_LOOKUP_LISTENER && connected)) {
+				sock_put(sk);
+				sk = NULL;
+			}
+		}
+		break;
+	default:
+		WARN_ON(1);
+		sk = NULL;
+	}
+
+	pr_debug("tproxy socket lookup: proto %u %08x:%u -> %08x:%u, lookup type: %d, sock %p\n",
+		 protocol, ntohl(saddr), ntohs(sport), ntohl(daddr), ntohs(dport), lookup_type, sk);
+
+	return sk;
+}
+
 
 static inline void
 nf_tproxy_put_sock(struct sock *sk)
diff --git a/net/netfilter/nf_tproxy_core.c b/net/netfilter/nf_tproxy_core.c
index 2ce945c..4d87bef 100644
--- a/net/netfilter/nf_tproxy_core.c
+++ b/net/netfilter/nf_tproxy_core.c
@@ -18,54 +18,6 @@
 #include <net/udp.h>
 #include <net/netfilter/nf_tproxy_core.h>
 
-struct sock *
-nf_tproxy_get_sock_v4(struct net *net, const u8 protocol,
-		      const __be32 saddr, const __be32 daddr,
-		      const __be16 sport, const __be16 dport,
-		      const struct net_device *in, int lookup_type)
-{
-	struct sock *sk;
-
-	/* look up socket */
-	switch (protocol) {
-	case IPPROTO_TCP:
-		switch (lookup_type) {
-		case NFT_LOOKUP_ANY:
-			sk = __inet_lookup(net, &tcp_hashinfo,
-					   saddr, sport, daddr, dport,
-					   in->ifindex);
-			break;
-		case NFT_LOOKUP_LISTENER:
-			sk = inet_lookup_listener(net, &tcp_hashinfo,
-						    daddr, dport,
-						    in->ifindex);
-			break;
-		case NFT_LOOKUP_ESTABLISHED:
-			sk = inet_lookup_established(net, &tcp_hashinfo,
-						    saddr, sport, daddr, dport,
-						    in->ifindex);
-			break;
-		default:
-			WARN_ON(1);
-			sk = NULL;
-			break;
-		}
-		break;
-	case IPPROTO_UDP:
-		sk = udp4_lib_lookup(net, saddr, sport, daddr, dport,
-				     in->ifindex);
-		break;
-	default:
-		WARN_ON(1);
-		sk = NULL;
-	}
-
-	pr_debug("tproxy socket lookup: proto %u %08x:%u -> %08x:%u, lookup type: %d, sock %p\n",
-		 protocol, ntohl(saddr), ntohs(sport), ntohl(daddr), ntohs(dport), lookup_type, sk);
-
-	return sk;
-}
-EXPORT_SYMBOL_GPL(nf_tproxy_get_sock_v4);
 
 static void
 nf_tproxy_destructor(struct sk_buff *skb)



^ permalink raw reply related

* [PATCH 3/3] tproxy: fix hash locking issue when using port redirection in __inet_inherit_port()
From: KOVACS Krisztian @ 2010-10-20 11:21 UTC (permalink / raw)
  To: netdev, netfilter-devel; +Cc: Patrick McHardy, David Miller
In-Reply-To: <20101020112142.6538.25550.stgit@este.odu>

When __inet_inherit_port() is called on a tproxy connection the wrong locks are
held for the inet_bind_bucket it is added to. __inet_inherit_port() made an
implicit assumption that the listener's port number (and thus its bind bucket).
Unfortunately, if you're using the TPROXY target to redirect skbs to a
transparent proxy that assumption is not true anymore and things break.

This patch adds code to __inet_inherit_port() so that it can handle this case
by looking up or creating a new bind bucket for the child socket and updates
callers of __inet_inherit_port() to gracefully handle __inet_inherit_port()
failing.

Reported by and original patch from Stephen Buck <stephen.buck@exinda.com>.
See http://marc.info/?t=128169268200001&r=1&w=2 for the original discussion.

Signed-off-by: KOVACS Krisztian <hidden@balabit.hu>
---
 include/net/inet_hashtables.h |    2 +-
 net/dccp/ipv4.c               |   10 +++++++---
 net/dccp/ipv6.c               |   10 +++++++---
 net/ipv4/inet_hashtables.c    |   28 ++++++++++++++++++++++++++--
 net/ipv4/tcp_ipv4.c           |   10 +++++++---
 net/ipv6/tcp_ipv6.c           |   12 ++++++++----
 6 files changed, 56 insertions(+), 16 deletions(-)

diff --git a/include/net/inet_hashtables.h b/include/net/inet_hashtables.h
index 74358d1..e9c2ed8 100644
--- a/include/net/inet_hashtables.h
+++ b/include/net/inet_hashtables.h
@@ -245,7 +245,7 @@ static inline int inet_sk_listen_hashfn(const struct sock *sk)
 }
 
 /* Caller must disable local BH processing. */
-extern void __inet_inherit_port(struct sock *sk, struct sock *child);
+extern int __inet_inherit_port(struct sock *sk, struct sock *child);
 
 extern void inet_put_port(struct sock *sk);
 
diff --git a/net/dccp/ipv4.c b/net/dccp/ipv4.c
index d4a166f..3f69ea1 100644
--- a/net/dccp/ipv4.c
+++ b/net/dccp/ipv4.c
@@ -392,7 +392,7 @@ struct sock *dccp_v4_request_recv_sock(struct sock *sk, struct sk_buff *skb,
 
 	newsk = dccp_create_openreq_child(sk, req, skb);
 	if (newsk == NULL)
-		goto exit;
+		goto exit_nonewsk;
 
 	sk_setup_caps(newsk, dst);
 
@@ -409,16 +409,20 @@ struct sock *dccp_v4_request_recv_sock(struct sock *sk, struct sk_buff *skb,
 
 	dccp_sync_mss(newsk, dst_mtu(dst));
 
+	if (__inet_inherit_port(sk, newsk) < 0) {
+		sock_put(newsk);
+		goto exit;
+	}
 	__inet_hash_nolisten(newsk, NULL);
-	__inet_inherit_port(sk, newsk);
 
 	return newsk;
 
 exit_overflow:
 	NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_LISTENOVERFLOWS);
+exit_nonewsk:
+	dst_release(dst);
 exit:
 	NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_LISTENDROPS);
-	dst_release(dst);
 	return NULL;
 }
 
diff --git a/net/dccp/ipv6.c b/net/dccp/ipv6.c
index 6e3f325..dca711d 100644
--- a/net/dccp/ipv6.c
+++ b/net/dccp/ipv6.c
@@ -564,7 +564,7 @@ static struct sock *dccp_v6_request_recv_sock(struct sock *sk,
 
 	newsk = dccp_create_openreq_child(sk, req, skb);
 	if (newsk == NULL)
-		goto out;
+		goto out_nonewsk;
 
 	/*
 	 * No need to charge this sock to the relevant IPv6 refcnt debug socks
@@ -632,18 +632,22 @@ static struct sock *dccp_v6_request_recv_sock(struct sock *sk,
 	newinet->inet_daddr = newinet->inet_saddr = LOOPBACK4_IPV6;
 	newinet->inet_rcv_saddr = LOOPBACK4_IPV6;
 
+	if (__inet_inherit_port(sk, newsk) < 0) {
+		sock_put(newsk);
+		goto out;
+	}
 	__inet6_hash(newsk, NULL);
-	__inet_inherit_port(sk, newsk);
 
 	return newsk;
 
 out_overflow:
 	NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_LISTENOVERFLOWS);
+out_nonewsk:
+	dst_release(dst);
 out:
 	NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_LISTENDROPS);
 	if (opt != NULL && opt != np->opt)
 		sock_kfree_s(sk, opt, opt->tot_len);
-	dst_release(dst);
 	return NULL;
 }
 
diff --git a/net/ipv4/inet_hashtables.c b/net/ipv4/inet_hashtables.c
index fb7ad5a..1b344f3 100644
--- a/net/ipv4/inet_hashtables.c
+++ b/net/ipv4/inet_hashtables.c
@@ -101,19 +101,43 @@ void inet_put_port(struct sock *sk)
 }
 EXPORT_SYMBOL(inet_put_port);
 
-void __inet_inherit_port(struct sock *sk, struct sock *child)
+int __inet_inherit_port(struct sock *sk, struct sock *child)
 {
 	struct inet_hashinfo *table = sk->sk_prot->h.hashinfo;
-	const int bhash = inet_bhashfn(sock_net(sk), inet_sk(child)->inet_num,
+	unsigned short port = inet_sk(child)->inet_num;
+	const int bhash = inet_bhashfn(sock_net(sk), port,
 			table->bhash_size);
 	struct inet_bind_hashbucket *head = &table->bhash[bhash];
 	struct inet_bind_bucket *tb;
 
 	spin_lock(&head->lock);
 	tb = inet_csk(sk)->icsk_bind_hash;
+	if (tb->port != port) {
+		/* NOTE: using tproxy and redirecting skbs to a proxy
+		 * on a different listener port breaks the assumption
+		 * that the listener socket's icsk_bind_hash is the same
+		 * as that of the child socket. We have to look up or
+		 * create a new bind bucket for the child here. */
+		struct hlist_node *node;
+		inet_bind_bucket_for_each(tb, node, &head->chain) {
+			if (net_eq(ib_net(tb), sock_net(sk)) &&
+			    tb->port == port)
+				break;
+		}
+		if (!node) {
+			tb = inet_bind_bucket_create(table->bind_bucket_cachep,
+						     sock_net(sk), head, port);
+			if (!tb) {
+				spin_unlock(&head->lock);
+				return -ENOMEM;
+			}
+		}
+	}
 	sk_add_bind_node(child, &tb->owners);
 	inet_csk(child)->icsk_bind_hash = tb;
 	spin_unlock(&head->lock);
+
+	return 0;
 }
 EXPORT_SYMBOL_GPL(__inet_inherit_port);
 
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index a0232f3..8f8527d 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -1422,7 +1422,7 @@ struct sock *tcp_v4_syn_recv_sock(struct sock *sk, struct sk_buff *skb,
 
 	newsk = tcp_create_openreq_child(sk, req, skb);
 	if (!newsk)
-		goto exit;
+		goto exit_nonewsk;
 
 	newsk->sk_gso_type = SKB_GSO_TCPV4;
 	sk_setup_caps(newsk, dst);
@@ -1469,16 +1469,20 @@ struct sock *tcp_v4_syn_recv_sock(struct sock *sk, struct sk_buff *skb,
 	}
 #endif
 
+	if (__inet_inherit_port(sk, newsk) < 0) {
+		sock_put(newsk);
+		goto exit;
+	}
 	__inet_hash_nolisten(newsk, NULL);
-	__inet_inherit_port(sk, newsk);
 
 	return newsk;
 
 exit_overflow:
 	NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_LISTENOVERFLOWS);
+exit_nonewsk:
+	dst_release(dst);
 exit:
 	NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_LISTENDROPS);
-	dst_release(dst);
 	return NULL;
 }
 EXPORT_SYMBOL(tcp_v4_syn_recv_sock);
diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c
index 8d93f6d..7e41e2c 100644
--- a/net/ipv6/tcp_ipv6.c
+++ b/net/ipv6/tcp_ipv6.c
@@ -1409,7 +1409,7 @@ static struct sock * tcp_v6_syn_recv_sock(struct sock *sk, struct sk_buff *skb,
 
 	newsk = tcp_create_openreq_child(sk, req, skb);
 	if (newsk == NULL)
-		goto out;
+		goto out_nonewsk;
 
 	/*
 	 * No need to charge this sock to the relevant IPv6 refcnt debug socks
@@ -1497,18 +1497,22 @@ static struct sock * tcp_v6_syn_recv_sock(struct sock *sk, struct sk_buff *skb,
 	}
 #endif
 
+	if (__inet_inherit_port(sk, newsk) < 0) {
+		sock_put(newsk);
+		goto out;
+	}
 	__inet6_hash(newsk, NULL);
-	__inet_inherit_port(sk, newsk);
 
 	return newsk;
 
 out_overflow:
 	NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_LISTENOVERFLOWS);
-out:
-	NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_LISTENDROPS);
+out_nonewsk:
 	if (opt && opt != np->opt)
 		sock_kfree_s(sk, opt, opt->tot_len);
 	dst_release(dst);
+out:
+	NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_LISTENDROPS);
 	return NULL;
 }
 



^ permalink raw reply related

* [PATCH 0/3] tproxy fixes for current upstream code
From: KOVACS Krisztian @ 2010-10-20 11:21 UTC (permalink / raw)
  To: netdev, netfilter-devel; +Cc: Patrick McHardy, David Miller

The following series fix a handful of issues which have been found in the
current upstream IPv4 tproxy code:

  * an issue with how port redirection interacts with TCP TIME_WAIT sockets
  * UDP socket lookup fixes so that now it prefers connected sockets, etc.
  * fix for a bind hash issue which could trigger crashes when port redirection
    was used.

---

Balazs Scheidler (2):
      tproxy: kick out TIME_WAIT sockets in case a new connection comes in with the same tuple
      tproxy: add lookup type checks for UDP in nf_tproxy_get_sock_v4()

KOVACS Krisztian (1):
      tproxy: fix hash locking issue when using port redirection in __inet_inherit_port()


 include/net/inet_hashtables.h          |    2 -
 include/net/netfilter/nf_tproxy_core.h |  120 +++++++++++++++++++++++++++++++-
 net/dccp/ipv4.c                        |   10 ++-
 net/dccp/ipv6.c                        |   10 ++-
 net/ipv4/inet_hashtables.c             |   28 +++++++
 net/ipv4/tcp_ipv4.c                    |   10 ++-
 net/ipv6/tcp_ipv6.c                    |   12 ++-
 net/netfilter/nf_tproxy_core.c         |   35 ---------
 net/netfilter/xt_TPROXY.c              |   68 +++++++++++++++++-
 net/netfilter/xt_socket.c              |    2 -
 10 files changed, 238 insertions(+), 59 deletions(-)

-- 
KOVACS Krisztian


^ permalink raw reply

* [PATCH 6/9] tproxy: added IPv6 socket lookup function to nf_tproxy_core
From: KOVACS Krisztian @ 2010-10-20 11:21 UTC (permalink / raw)
  To: netdev, netfilter-devel; +Cc: Patrick McHardy, David Miller
In-Reply-To: <20101020112118.6260.31618.stgit@este.odu>

From: Balazs Scheidler <bazsi@balabit.hu>

Signed-off-by: Balazs Scheidler <bazsi@balabit.hu>
Signed-off-by: KOVACS Krisztian <hidden@balabit.hu>
---
 include/net/netfilter/nf_tproxy_core.h |   72 ++++++++++++++++++++++++++++++++
 1 files changed, 71 insertions(+), 1 deletions(-)

diff --git a/include/net/netfilter/nf_tproxy_core.h b/include/net/netfilter/nf_tproxy_core.h
index 1027d7f..cd85b3b 100644
--- a/include/net/netfilter/nf_tproxy_core.h
+++ b/include/net/netfilter/nf_tproxy_core.h
@@ -5,7 +5,8 @@
 #include <linux/in.h>
 #include <linux/skbuff.h>
 #include <net/sock.h>
-#include <net/inet_sock.h>
+#include <net/inet_hashtables.h>
+#include <net/inet6_hashtables.h>
 #include <net/tcp.h>
 
 #define NFT_LOOKUP_ANY         0
@@ -130,6 +131,75 @@ nf_tproxy_get_sock_v4(struct net *net, const u8 protocol,
 	return sk;
 }
 
+#if defined(CONFIG_IPV6) || defined(CONFIG_IPV6_MODULE)
+static inline struct sock *
+nf_tproxy_get_sock_v6(struct net *net, const u8 protocol,
+		      const struct in6_addr *saddr, const struct in6_addr *daddr,
+		      const __be16 sport, const __be16 dport,
+		      const struct net_device *in, int lookup_type)
+{
+	struct sock *sk;
+
+	/* look up socket */
+	switch (protocol) {
+	case IPPROTO_TCP:
+		switch (lookup_type) {
+		case NFT_LOOKUP_ANY:
+			sk = inet6_lookup(net, &tcp_hashinfo,
+					  saddr, sport, daddr, dport,
+					  in->ifindex);
+			break;
+		case NFT_LOOKUP_LISTENER:
+			sk = inet6_lookup_listener(net, &tcp_hashinfo,
+						   daddr, ntohs(dport),
+						   in->ifindex);
+
+			/* NOTE: we return listeners even if bound to
+			 * 0.0.0.0, those are filtered out in
+			 * xt_socket, since xt_TPROXY needs 0 bound
+			 * listeners too */
+
+			break;
+		case NFT_LOOKUP_ESTABLISHED:
+			sk = __inet6_lookup_established(net, &tcp_hashinfo,
+							saddr, sport, daddr, ntohs(dport),
+							in->ifindex);
+			break;
+		default:
+			WARN_ON(1);
+			sk = NULL;
+			break;
+		}
+		break;
+	case IPPROTO_UDP:
+		sk = udp6_lib_lookup(net, saddr, sport, daddr, dport,
+				     in->ifindex);
+		if (sk && lookup_type != NFT_LOOKUP_ANY) {
+			int connected = (sk->sk_state == TCP_ESTABLISHED);
+			int wildcard = ipv6_addr_any(&inet6_sk(sk)->rcv_saddr);
+
+			/* NOTE: we return listeners even if bound to
+			 * 0.0.0.0, those are filtered out in
+			 * xt_socket, since xt_TPROXY needs 0 bound
+			 * listeners too */
+			if ((lookup_type == NFT_LOOKUP_ESTABLISHED && (!connected || wildcard)) ||
+			    (lookup_type == NFT_LOOKUP_LISTENER && connected)) {
+				sock_put(sk);
+				sk = NULL;
+			}
+		}
+		break;
+	default:
+		WARN_ON(1);
+		sk = NULL;
+	}
+
+	pr_debug("tproxy socket lookup: proto %u %pI6:%u -> %pI6:%u, lookup type: %d, sock %p\n",
+		 protocol, saddr, ntohs(sport), daddr, ntohs(dport), lookup_type, sk);
+
+	return sk;
+}
+#endif
 
 static inline void
 nf_tproxy_put_sock(struct sock *sk)



^ permalink raw reply related

* [PATCH 7/9] tproxy: added IPv6 support to the TPROXY target
From: KOVACS Krisztian @ 2010-10-20 11:21 UTC (permalink / raw)
  To: netdev, netfilter-devel; +Cc: Patrick McHardy, David Miller
In-Reply-To: <20101020112118.6260.31618.stgit@este.odu>

From: Balazs Scheidler <bazsi@balabit.hu>

This requires a new revision as the old target structure was
IPv4 specific.

Signed-off-by: Balazs Scheidler <bazsi@balabit.hu>
Signed-off-by: KOVACS Krisztian <hidden@balabit.hu>
---
 include/linux/netfilter/xt_TPROXY.h |   15 +-
 net/netfilter/xt_TPROXY.c           |  262 ++++++++++++++++++++++++++++++-----
 2 files changed, 236 insertions(+), 41 deletions(-)

diff --git a/include/linux/netfilter/xt_TPROXY.h b/include/linux/netfilter/xt_TPROXY.h
index 152e8f9..7b4e06d 100644
--- a/include/linux/netfilter/xt_TPROXY.h
+++ b/include/linux/netfilter/xt_TPROXY.h
@@ -1,14 +1,21 @@
-#ifndef _XT_TPROXY_H_target
-#define _XT_TPROXY_H_target
+#ifndef _XT_TPROXY_H
+#define _XT_TPROXY_H
 
 /* TPROXY target is capable of marking the packet to perform
  * redirection. We can get rid of that whenever we get support for
  * mutliple targets in the same rule. */
-struct xt_tproxy_target_info {
+struct xt_tproxy_target_info_v0 {
 	u_int32_t mark_mask;
 	u_int32_t mark_value;
 	__be32 laddr;
 	__be16 lport;
 };
 
-#endif /* _XT_TPROXY_H_target */
+struct xt_tproxy_target_info_v1 {
+	u_int32_t mark_mask;
+	u_int32_t mark_value;
+	union nf_inet_addr laddr;
+	__be16 lport;
+};
+
+#endif /* _XT_TPROXY_H */
diff --git a/net/netfilter/xt_TPROXY.c b/net/netfilter/xt_TPROXY.c
index 67cbed8..6ce76d6 100644
--- a/net/netfilter/xt_TPROXY.c
+++ b/net/netfilter/xt_TPROXY.c
@@ -1,7 +1,7 @@
 /*
  * Transparent proxy support for Linux/iptables
  *
- * Copyright (c) 2006-2007 BalaBit IT Ltd.
+ * Copyright (c) 2006-2010 BalaBit IT Ltd.
  * Author: Balazs Scheidler, Krisztian Kovacs
  *
  * This program is free software; you can redistribute it and/or modify
@@ -19,15 +19,18 @@
 
 #include <linux/netfilter/x_tables.h>
 #include <linux/netfilter_ipv4/ip_tables.h>
+#include <linux/netfilter_ipv6/ip6_tables.h>
 #include <linux/netfilter/xt_TPROXY.h>
 
 #include <net/netfilter/ipv4/nf_defrag_ipv4.h>
+#include <net/netfilter/ipv6/nf_defrag_ipv6.h>
 #include <net/netfilter/nf_tproxy_core.h>
 
 /**
- * tproxy_handle_time_wait() - handle TCP TIME_WAIT reopen redirections
+ * tproxy_handle_time_wait4() - handle IPv4 TCP TIME_WAIT reopen redirections
  * @skb:	The skb being processed.
- * @par:	Iptables target parameters.
+ * @laddr:	IPv4 address to redirect to or zero.
+ * @lport:	TCP port to redirect to or zero.
  * @sk:		The TIME_WAIT TCP socket found by the lookup.
  *
  * We have to handle SYN packets arriving to TIME_WAIT sockets
@@ -35,16 +38,16 @@
  * redirect the new connection to the proxy if there's a listener
  * socket present.
  *
- * tproxy_handle_time_wait() consumes the socket reference passed in.
+ * tproxy_handle_time_wait4() consumes the socket reference passed in.
  *
  * Returns the listener socket if there's one, the TIME_WAIT socket if
  * no such listener is found, or NULL if the TCP header is incomplete.
  */
 static struct sock *
-tproxy_handle_time_wait(struct sk_buff *skb, const struct xt_action_param *par, struct sock *sk)
+tproxy_handle_time_wait4(struct sk_buff *skb, __be32 laddr, __be16 lport,
+			struct sock *sk)
 {
 	const struct iphdr *iph = ip_hdr(skb);
-	const struct xt_tproxy_target_info *tgi = par->targinfo;
 	struct tcphdr _hdr, *hp;
 
 	hp = skb_header_pointer(skb, ip_hdrlen(skb), sizeof(_hdr), &_hdr);
@@ -59,13 +62,64 @@ tproxy_handle_time_wait(struct sk_buff *skb, const struct xt_action_param *par,
 		struct sock *sk2;
 
 		sk2 = nf_tproxy_get_sock_v4(dev_net(skb->dev), iph->protocol,
-					    iph->saddr, tgi->laddr ? tgi->laddr : iph->daddr,
-					    hp->source, tgi->lport ? tgi->lport : hp->dest,
-					    par->in, NFT_LOOKUP_LISTENER);
+					    iph->saddr, laddr ? laddr : iph->daddr,
+					    hp->source, lport ? lport : hp->dest,
+					    skb->dev, NFT_LOOKUP_LISTENER);
+		if (sk2) {
+			inet_twsk_deschedule(inet_twsk(sk), &tcp_death_row);
+			inet_twsk_put(inet_twsk(sk));
+			sk = sk2;
+		}
+	}
+
+	return sk;
+}
+
+/**
+ * tproxy_handle_time_wait6() - handle IPv6 TCP TIME_WAIT reopen redirections
+ * @skb:	The skb being processed.
+ * @tproto:	Transport protocol.
+ * @thoff:	Transport protocol header offset.
+ * @par:	Iptables target parameters.
+ * @sk:		The TIME_WAIT TCP socket found by the lookup.
+ *
+ * We have to handle SYN packets arriving to TIME_WAIT sockets
+ * differently: instead of reopening the connection we should rather
+ * redirect the new connection to the proxy if there's a listener
+ * socket present.
+ *
+ * tproxy_handle_time_wait6() consumes the socket reference passed in.
+ *
+ * Returns the listener socket if there's one, the TIME_WAIT socket if
+ * no such listener is found, or NULL if the TCP header is incomplete.
+ */
+static struct sock *
+tproxy_handle_time_wait6(struct sk_buff *skb, int tproto, int thoff,
+			 const struct xt_action_param *par,
+			 struct sock *sk)
+{
+	const struct ipv6hdr *iph = ipv6_hdr(skb);
+	struct tcphdr _hdr, *hp;
+	const struct xt_tproxy_target_info_v1 *tgi = par->targinfo;
+
+	hp = skb_header_pointer(skb, thoff, sizeof(_hdr), &_hdr);
+	if (hp == NULL) {
+		inet_twsk_put(inet_twsk(sk));
+		return NULL;
+	}
+
+	if (hp->syn && !hp->rst && !hp->ack && !hp->fin) {
+		/* SYN to a TIME_WAIT socket, we'd rather redirect it
+		 * to a listener socket if there's one */
+		struct sock *sk2;
+
+		sk2 = nf_tproxy_get_sock_v6(dev_net(skb->dev), tproto,
+					    &iph->saddr,
+					    !ipv6_addr_any(&tgi->laddr.in6) ? &tgi->laddr.in6 : &iph->daddr,
+					    hp->source,
+					    tgi->lport ? tgi->lport : hp->dest,
+					    skb->dev, NFT_LOOKUP_LISTENER);
 		if (sk2) {
-			/* yeah, there's one, let's kill the TIME_WAIT
-			 * socket and redirect to the listener
-			 */
 			inet_twsk_deschedule(inet_twsk(sk), &tcp_death_row);
 			inet_twsk_put(inet_twsk(sk));
 			sk = sk2;
@@ -76,10 +130,10 @@ tproxy_handle_time_wait(struct sk_buff *skb, const struct xt_action_param *par,
 }
 
 static unsigned int
-tproxy_tg(struct sk_buff *skb, const struct xt_action_param *par)
+tproxy_tg4(struct sk_buff *skb, __be32 laddr, __be16 lport,
+	   u_int32_t mark_mask, u_int32_t mark_value)
 {
 	const struct iphdr *iph = ip_hdr(skb);
-	const struct xt_tproxy_target_info *tgi = par->targinfo;
 	struct udphdr _hdr, *hp;
 	struct sock *sk;
 
@@ -87,39 +141,140 @@ tproxy_tg(struct sk_buff *skb, const struct xt_action_param *par)
 	if (hp == NULL)
 		return NF_DROP;
 
+	/* check if there's an ongoing connection on the packet
+	 * addresses, this happens if the redirect already happened
+	 * and the current packet belongs to an already established
+	 * connection */
 	sk = nf_tproxy_get_sock_v4(dev_net(skb->dev), iph->protocol,
 				   iph->saddr, iph->daddr,
 				   hp->source, hp->dest,
-				   par->in, NFT_LOOKUP_ESTABLISHED);
+				   skb->dev, NFT_LOOKUP_ESTABLISHED);
 
 	/* UDP has no TCP_TIME_WAIT state, so we never enter here */
 	if (sk && sk->sk_state == TCP_TIME_WAIT)
-		sk = tproxy_handle_time_wait(skb, par, sk);
+		/* reopening a TIME_WAIT connection needs special handling */
+		sk = tproxy_handle_time_wait4(skb, laddr, lport, sk);
 	else if (!sk)
+		/* no, there's no established connection, check if
+		 * there's a listener on the redirected addr/port */
 		sk = nf_tproxy_get_sock_v4(dev_net(skb->dev), iph->protocol,
-					   iph->saddr, tgi->laddr ? tgi->laddr : iph->daddr,
-					   hp->source, tgi->lport ? tgi->lport : hp->dest,
-					   par->in, NFT_LOOKUP_LISTENER);
+					   iph->saddr, laddr ? laddr : iph->daddr,
+					   hp->source, lport ? lport : hp->dest,
+					   skb->dev, NFT_LOOKUP_LISTENER);
 
 	/* NOTE: assign_sock consumes our sk reference */
 	if (sk && nf_tproxy_assign_sock(skb, sk)) {
 		/* This should be in a separate target, but we don't do multiple
 		   targets on the same rule yet */
-		skb->mark = (skb->mark & ~tgi->mark_mask) ^ tgi->mark_value;
+		skb->mark = (skb->mark & ~mark_mask) ^ mark_value;
 
-		pr_debug("redirecting: proto %u %08x:%u -> %08x:%u, mark: %x\n",
-			 iph->protocol, ntohl(iph->daddr), ntohs(hp->dest),
-			 ntohl(tgi->laddr), ntohs(tgi->lport), skb->mark);
+		pr_debug("redirecting: proto %u %pI4:%u -> %pI4:%u, mark: %x\n",
+			 iph->protocol, &iph->daddr, ntohs(hp->dest),
+			 &laddr, ntohs(lport), skb->mark);
 		return NF_ACCEPT;
 	}
 
 	pr_debug("no socket, dropping: proto %u %08x:%u -> %08x:%u, mark: %x\n",
 		 iph->protocol, ntohl(iph->daddr), ntohs(hp->dest),
-		 ntohl(tgi->laddr), ntohs(tgi->lport), skb->mark);
+		 ntohl(laddr), ntohs(lport), skb->mark);
+	return NF_DROP;
+}
+
+static unsigned int
+tproxy_tg4_v0(struct sk_buff *skb, const struct xt_action_param *par)
+{
+	const struct xt_tproxy_target_info_v0 *tgi = par->targinfo;
+
+	return tproxy_tg4(skb, tgi->laddr, tgi->lport, tgi->mark_mask, tgi->mark_value);
+}
+
+static unsigned int
+tproxy_tg4_v1(struct sk_buff *skb, const struct xt_action_param *par)
+{
+	const struct xt_tproxy_target_info_v1 *tgi = par->targinfo;
+
+	return tproxy_tg4(skb, tgi->laddr.ip, tgi->lport, tgi->mark_mask, tgi->mark_value);
+}
+
+#if defined(CONFIG_IPV6) || defined(CONFIG_IPV6_MODULE)
+static unsigned int
+tproxy_tg6_v1(struct sk_buff *skb, const struct xt_action_param *par)
+{
+	const struct ipv6hdr *iph = ipv6_hdr(skb);
+	const struct xt_tproxy_target_info_v1 *tgi = par->targinfo;
+	struct udphdr _hdr, *hp;
+	struct sock *sk;
+	int thoff;
+	int tproto;
+
+	tproto = ipv6_find_hdr(skb, &thoff, -1, NULL);
+	if (tproto < 0) {
+		pr_debug("unable to find transport header in IPv6 packet, dropping\n");
+		return NF_DROP;
+	}
+
+	hp = skb_header_pointer(skb, thoff, sizeof(_hdr), &_hdr);
+	if (hp == NULL) {
+		pr_debug("unable to grab transport header contents in IPv6 packet, dropping\n");
+		return NF_DROP;
+	}
+
+	/* check if there's an ongoing connection on the packet
+	 * addresses, this happens if the redirect already happened
+	 * and the current packet belongs to an already established
+	 * connection */
+	sk = nf_tproxy_get_sock_v6(dev_net(skb->dev), tproto,
+				   &iph->saddr, &iph->daddr,
+				   hp->source, hp->dest,
+				   par->in, NFT_LOOKUP_ESTABLISHED);
+
+	/* UDP has no TCP_TIME_WAIT state, so we never enter here */
+	if (sk && sk->sk_state == TCP_TIME_WAIT)
+		/* reopening a TIME_WAIT connection needs special handling */
+		sk = tproxy_handle_time_wait6(skb, tproto, thoff, par, sk);
+	else if (!sk)
+		/* no there's no established connection, check if
+		 * there's a listener on the redirected addr/port */
+		sk = nf_tproxy_get_sock_v6(dev_net(skb->dev), tproto,
+					   &iph->saddr,
+					   !ipv6_addr_any(&tgi->laddr.in6) ? &tgi->laddr.in6 : &iph->daddr,
+					   hp->source,
+					   tgi->lport ? tgi->lport : hp->dest,
+					   par->in, NFT_LOOKUP_LISTENER);
+
+	/* NOTE: assign_sock consumes our sk reference */
+	if (sk && nf_tproxy_assign_sock(skb, sk)) {
+		/* This should be in a separate target, but we don't do multiple
+		   targets on the same rule yet */
+		skb->mark = (skb->mark & ~tgi->mark_mask) ^ tgi->mark_value;
+
+		pr_debug("redirecting: proto %u %pI6:%u -> %pI6:%u, mark: %x\n",
+			 tproto, &iph->saddr, ntohs(hp->dest),
+			 &tgi->laddr.in6, ntohs(tgi->lport), skb->mark);
+		return NF_ACCEPT;
+	}
+
+	pr_debug("no socket, dropping: proto %u %pI6:%u -> %pI6:%u, mark: %x\n",
+		 tproto, &iph->saddr, ntohs(hp->dest),
+		 &tgi->laddr.in6, ntohs(tgi->lport), skb->mark);
 	return NF_DROP;
 }
 
-static int tproxy_tg_check(const struct xt_tgchk_param *par)
+static int tproxy_tg6_check(const struct xt_tgchk_param *par)
+{
+	const struct ip6t_ip6 *i = par->entryinfo;
+
+	if ((i->proto == IPPROTO_TCP || i->proto == IPPROTO_UDP)
+	    && !(i->flags & IP6T_INV_PROTO))
+		return 0;
+
+	pr_info("Can be used only in combination with "
+		"either -p tcp or -p udp\n");
+	return -EINVAL;
+}
+#endif
+
+static int tproxy_tg4_check(const struct xt_tgchk_param *par)
 {
 	const struct ipt_ip *i = par->entryinfo;
 
@@ -132,31 +287,64 @@ static int tproxy_tg_check(const struct xt_tgchk_param *par)
 	return -EINVAL;
 }
 
-static struct xt_target tproxy_tg_reg __read_mostly = {
-	.name		= "TPROXY",
-	.family		= AF_INET,
-	.table		= "mangle",
-	.target		= tproxy_tg,
-	.targetsize	= sizeof(struct xt_tproxy_target_info),
-	.checkentry	= tproxy_tg_check,
-	.hooks		= 1 << NF_INET_PRE_ROUTING,
-	.me		= THIS_MODULE,
+static struct xt_target tproxy_tg_reg[] __read_mostly = {
+	{
+		.name		= "TPROXY",
+		.family		= NFPROTO_IPV4,
+		.table		= "mangle",
+		.target		= tproxy_tg4_v0,
+		.revision	= 0,
+		.targetsize	= sizeof(struct xt_tproxy_target_info_v0),
+		.checkentry	= tproxy_tg4_check,
+		.hooks		= 1 << NF_INET_PRE_ROUTING,
+		.me		= THIS_MODULE,
+	},
+	{
+		.name		= "TPROXY",
+		.family		= NFPROTO_IPV4,
+		.table		= "mangle",
+		.target		= tproxy_tg4_v1,
+		.revision	= 1,
+		.targetsize	= sizeof(struct xt_tproxy_target_info_v1),
+		.checkentry	= tproxy_tg4_check,
+		.hooks		= 1 << NF_INET_PRE_ROUTING,
+		.me		= THIS_MODULE,
+	},
+#if defined(CONFIG_IPV6) || defined(CONFIG_IPV6_MODULE)
+	{
+		.name		= "TPROXY",
+		.family		= NFPROTO_IPV6,
+		.table		= "mangle",
+		.target		= tproxy_tg6_v1,
+		.revision	= 1,
+		.targetsize	= sizeof(struct xt_tproxy_target_info_v1),
+		.checkentry	= tproxy_tg6_check,
+		.hooks		= 1 << NF_INET_PRE_ROUTING,
+		.me		= THIS_MODULE,
+	},
+#endif
+
 };
 
 static int __init tproxy_tg_init(void)
 {
 	nf_defrag_ipv4_enable();
-	return xt_register_target(&tproxy_tg_reg);
+#if defined(CONFIG_IPV6) || defined(CONFIG_IPV6_MODULE)
+	nf_defrag_ipv6_enable();
+#endif
+
+	return xt_register_targets(tproxy_tg_reg, ARRAY_SIZE(tproxy_tg_reg));
 }
 
 static void __exit tproxy_tg_exit(void)
 {
-	xt_unregister_target(&tproxy_tg_reg);
+	xt_unregister_targets(tproxy_tg_reg, ARRAY_SIZE(tproxy_tg_reg));
 }
 
 module_init(tproxy_tg_init);
 module_exit(tproxy_tg_exit);
 MODULE_LICENSE("GPL");
-MODULE_AUTHOR("Krisztian Kovacs");
+MODULE_AUTHOR("Balazs Scheidler, Krisztian Kovacs");
 MODULE_DESCRIPTION("Netfilter transparent proxy (TPROXY) target module.");
 MODULE_ALIAS("ipt_TPROXY");
+MODULE_ALIAS("ip6t_TPROXY");



^ permalink raw reply related

* [PATCH 9/9] tproxy: use the interface primary IP address as a default value for --on-ip
From: KOVACS Krisztian @ 2010-10-20 11:21 UTC (permalink / raw)
  To: netdev, netfilter-devel; +Cc: Patrick McHardy, David Miller
In-Reply-To: <20101020112118.6260.31618.stgit@este.odu>

From: Balazs Scheidler <bazsi@balabit.hu>

The REDIRECT target and the older TProxy versions used the primary address
of the incoming interface as the default value of the --on-ip parameter.
This was unintentionally changed during the initial TProxy submission and
caused confusion among users.

Since IPv6 has no notion of primary address, we just select the first address
on the list: this way the socket lookup finds wildcard bound sockets
properly and we cannot really do better without the user telling us the
IPv6 address of the proxy.

This is implemented for both IPv4 and IPv6.

Signed-off-by: Balazs Scheidler <bazsi@balabit.hu>
Signed-off-by: KOVACS Krisztian <hidden@balabit.hu>
---
 net/netfilter/xt_TPROXY.c |  198 +++++++++++++++++++++++++++++----------------
 1 files changed, 128 insertions(+), 70 deletions(-)

diff --git a/net/netfilter/xt_TPROXY.c b/net/netfilter/xt_TPROXY.c
index 6ce76d6..a90ff20 100644
--- a/net/netfilter/xt_TPROXY.c
+++ b/net/netfilter/xt_TPROXY.c
@@ -16,15 +16,41 @@
 #include <net/checksum.h>
 #include <net/udp.h>
 #include <net/inet_sock.h>
-
+#include <linux/inetdevice.h>
 #include <linux/netfilter/x_tables.h>
 #include <linux/netfilter_ipv4/ip_tables.h>
-#include <linux/netfilter_ipv6/ip6_tables.h>
-#include <linux/netfilter/xt_TPROXY.h>
 
 #include <net/netfilter/ipv4/nf_defrag_ipv4.h>
+#if defined(CONFIG_IPV6) || defined(CONFIG_IPV6_MODULE)
+#include <net/if_inet6.h>
+#include <net/addrconf.h>
+#include <linux/netfilter_ipv6/ip6_tables.h>
 #include <net/netfilter/ipv6/nf_defrag_ipv6.h>
+#endif
+
 #include <net/netfilter/nf_tproxy_core.h>
+#include <linux/netfilter/xt_TPROXY.h>
+
+static inline __be32
+tproxy_laddr4(struct sk_buff *skb, __be32 user_laddr, __be32 daddr)
+{
+	struct in_device *indev;
+	__be32 laddr;
+
+	if (user_laddr)
+		return user_laddr;
+
+	laddr = 0;
+	rcu_read_lock();
+	indev = __in_dev_get_rcu(skb->dev);
+	for_primary_ifa(indev) {
+		laddr = ifa->ifa_local;
+		break;
+	} endfor_ifa(indev);
+	rcu_read_unlock();
+
+	return laddr ? laddr : daddr;
+}
 
 /**
  * tproxy_handle_time_wait4() - handle IPv4 TCP TIME_WAIT reopen redirections
@@ -75,60 +101,6 @@ tproxy_handle_time_wait4(struct sk_buff *skb, __be32 laddr, __be16 lport,
 	return sk;
 }
 
-/**
- * tproxy_handle_time_wait6() - handle IPv6 TCP TIME_WAIT reopen redirections
- * @skb:	The skb being processed.
- * @tproto:	Transport protocol.
- * @thoff:	Transport protocol header offset.
- * @par:	Iptables target parameters.
- * @sk:		The TIME_WAIT TCP socket found by the lookup.
- *
- * We have to handle SYN packets arriving to TIME_WAIT sockets
- * differently: instead of reopening the connection we should rather
- * redirect the new connection to the proxy if there's a listener
- * socket present.
- *
- * tproxy_handle_time_wait6() consumes the socket reference passed in.
- *
- * Returns the listener socket if there's one, the TIME_WAIT socket if
- * no such listener is found, or NULL if the TCP header is incomplete.
- */
-static struct sock *
-tproxy_handle_time_wait6(struct sk_buff *skb, int tproto, int thoff,
-			 const struct xt_action_param *par,
-			 struct sock *sk)
-{
-	const struct ipv6hdr *iph = ipv6_hdr(skb);
-	struct tcphdr _hdr, *hp;
-	const struct xt_tproxy_target_info_v1 *tgi = par->targinfo;
-
-	hp = skb_header_pointer(skb, thoff, sizeof(_hdr), &_hdr);
-	if (hp == NULL) {
-		inet_twsk_put(inet_twsk(sk));
-		return NULL;
-	}
-
-	if (hp->syn && !hp->rst && !hp->ack && !hp->fin) {
-		/* SYN to a TIME_WAIT socket, we'd rather redirect it
-		 * to a listener socket if there's one */
-		struct sock *sk2;
-
-		sk2 = nf_tproxy_get_sock_v6(dev_net(skb->dev), tproto,
-					    &iph->saddr,
-					    !ipv6_addr_any(&tgi->laddr.in6) ? &tgi->laddr.in6 : &iph->daddr,
-					    hp->source,
-					    tgi->lport ? tgi->lport : hp->dest,
-					    skb->dev, NFT_LOOKUP_LISTENER);
-		if (sk2) {
-			inet_twsk_deschedule(inet_twsk(sk), &tcp_death_row);
-			inet_twsk_put(inet_twsk(sk));
-			sk = sk2;
-		}
-	}
-
-	return sk;
-}
-
 static unsigned int
 tproxy_tg4(struct sk_buff *skb, __be32 laddr, __be16 lport,
 	   u_int32_t mark_mask, u_int32_t mark_value)
@@ -150,6 +122,10 @@ tproxy_tg4(struct sk_buff *skb, __be32 laddr, __be16 lport,
 				   hp->source, hp->dest,
 				   skb->dev, NFT_LOOKUP_ESTABLISHED);
 
+	laddr = tproxy_laddr4(skb, laddr, iph->daddr);
+	if (!lport)
+		lport = hp->dest;
+
 	/* UDP has no TCP_TIME_WAIT state, so we never enter here */
 	if (sk && sk->sk_state == TCP_TIME_WAIT)
 		/* reopening a TIME_WAIT connection needs special handling */
@@ -158,8 +134,8 @@ tproxy_tg4(struct sk_buff *skb, __be32 laddr, __be16 lport,
 		/* no, there's no established connection, check if
 		 * there's a listener on the redirected addr/port */
 		sk = nf_tproxy_get_sock_v4(dev_net(skb->dev), iph->protocol,
-					   iph->saddr, laddr ? laddr : iph->daddr,
-					   hp->source, lport ? lport : hp->dest,
+					   iph->saddr, laddr,
+					   hp->source, lport,
 					   skb->dev, NFT_LOOKUP_LISTENER);
 
 	/* NOTE: assign_sock consumes our sk reference */
@@ -174,9 +150,9 @@ tproxy_tg4(struct sk_buff *skb, __be32 laddr, __be16 lport,
 		return NF_ACCEPT;
 	}
 
-	pr_debug("no socket, dropping: proto %u %08x:%u -> %08x:%u, mark: %x\n",
-		 iph->protocol, ntohl(iph->daddr), ntohs(hp->dest),
-		 ntohl(laddr), ntohs(lport), skb->mark);
+	pr_debug("no socket, dropping: proto %u %pI4:%u -> %pI4:%u, mark: %x\n",
+		 iph->protocol, &iph->saddr, ntohs(hp->source),
+		 &iph->daddr, ntohs(hp->dest), skb->mark);
 	return NF_DROP;
 }
 
@@ -197,6 +173,85 @@ tproxy_tg4_v1(struct sk_buff *skb, const struct xt_action_param *par)
 }
 
 #if defined(CONFIG_IPV6) || defined(CONFIG_IPV6_MODULE)
+
+static inline const struct in6_addr *
+tproxy_laddr6(struct sk_buff *skb, const struct in6_addr *user_laddr, const struct in6_addr *daddr)
+{
+	struct inet6_dev *indev;
+	struct inet6_ifaddr *ifa;
+	struct in6_addr *laddr;
+
+	if (!ipv6_addr_any(user_laddr))
+		return user_laddr;
+	laddr = NULL;
+
+	rcu_read_lock();
+	indev = __in6_dev_get(skb->dev);
+	if (indev)
+		list_for_each_entry(ifa, &indev->addr_list, if_list) {
+			/* FIXME: address selection */
+			laddr = &ifa->addr;
+			break;
+		}
+	rcu_read_unlock();
+
+	return laddr ? laddr : daddr;
+}
+
+/**
+ * tproxy_handle_time_wait6() - handle IPv6 TCP TIME_WAIT reopen redirections
+ * @skb:	The skb being processed.
+ * @tproto:	Transport protocol.
+ * @thoff:	Transport protocol header offset.
+ * @par:	Iptables target parameters.
+ * @sk:		The TIME_WAIT TCP socket found by the lookup.
+ *
+ * We have to handle SYN packets arriving to TIME_WAIT sockets
+ * differently: instead of reopening the connection we should rather
+ * redirect the new connection to the proxy if there's a listener
+ * socket present.
+ *
+ * tproxy_handle_time_wait6() consumes the socket reference passed in.
+ *
+ * Returns the listener socket if there's one, the TIME_WAIT socket if
+ * no such listener is found, or NULL if the TCP header is incomplete.
+ */
+static struct sock *
+tproxy_handle_time_wait6(struct sk_buff *skb, int tproto, int thoff,
+			 const struct xt_action_param *par,
+			 struct sock *sk)
+{
+	const struct ipv6hdr *iph = ipv6_hdr(skb);
+	struct tcphdr _hdr, *hp;
+	const struct xt_tproxy_target_info_v1 *tgi = par->targinfo;
+
+	hp = skb_header_pointer(skb, thoff, sizeof(_hdr), &_hdr);
+	if (hp == NULL) {
+		inet_twsk_put(inet_twsk(sk));
+		return NULL;
+	}
+
+	if (hp->syn && !hp->rst && !hp->ack && !hp->fin) {
+		/* SYN to a TIME_WAIT socket, we'd rather redirect it
+		 * to a listener socket if there's one */
+		struct sock *sk2;
+
+		sk2 = nf_tproxy_get_sock_v6(dev_net(skb->dev), tproto,
+					    &iph->saddr,
+					    tproxy_laddr6(skb, &tgi->laddr.in6, &iph->daddr),
+					    hp->source,
+					    tgi->lport ? tgi->lport : hp->dest,
+					    skb->dev, NFT_LOOKUP_LISTENER);
+		if (sk2) {
+			inet_twsk_deschedule(inet_twsk(sk), &tcp_death_row);
+			inet_twsk_put(inet_twsk(sk));
+			sk = sk2;
+		}
+	}
+
+	return sk;
+}
+
 static unsigned int
 tproxy_tg6_v1(struct sk_buff *skb, const struct xt_action_param *par)
 {
@@ -204,6 +259,8 @@ tproxy_tg6_v1(struct sk_buff *skb, const struct xt_action_param *par)
 	const struct xt_tproxy_target_info_v1 *tgi = par->targinfo;
 	struct udphdr _hdr, *hp;
 	struct sock *sk;
+	const struct in6_addr *laddr;
+	__be16 lport;
 	int thoff;
 	int tproto;
 
@@ -228,6 +285,9 @@ tproxy_tg6_v1(struct sk_buff *skb, const struct xt_action_param *par)
 				   hp->source, hp->dest,
 				   par->in, NFT_LOOKUP_ESTABLISHED);
 
+	laddr = tproxy_laddr6(skb, &tgi->laddr.in6, &iph->daddr);
+	lport = tgi->lport ? tgi->lport : hp->dest;
+
 	/* UDP has no TCP_TIME_WAIT state, so we never enter here */
 	if (sk && sk->sk_state == TCP_TIME_WAIT)
 		/* reopening a TIME_WAIT connection needs special handling */
@@ -236,10 +296,8 @@ tproxy_tg6_v1(struct sk_buff *skb, const struct xt_action_param *par)
 		/* no there's no established connection, check if
 		 * there's a listener on the redirected addr/port */
 		sk = nf_tproxy_get_sock_v6(dev_net(skb->dev), tproto,
-					   &iph->saddr,
-					   !ipv6_addr_any(&tgi->laddr.in6) ? &tgi->laddr.in6 : &iph->daddr,
-					   hp->source,
-					   tgi->lport ? tgi->lport : hp->dest,
+					   &iph->saddr, laddr,
+					   hp->source, lport,
 					   par->in, NFT_LOOKUP_LISTENER);
 
 	/* NOTE: assign_sock consumes our sk reference */
@@ -249,14 +307,14 @@ tproxy_tg6_v1(struct sk_buff *skb, const struct xt_action_param *par)
 		skb->mark = (skb->mark & ~tgi->mark_mask) ^ tgi->mark_value;
 
 		pr_debug("redirecting: proto %u %pI6:%u -> %pI6:%u, mark: %x\n",
-			 tproto, &iph->saddr, ntohs(hp->dest),
-			 &tgi->laddr.in6, ntohs(tgi->lport), skb->mark);
+			 tproto, &iph->saddr, ntohs(hp->source),
+			 laddr, ntohs(lport), skb->mark);
 		return NF_ACCEPT;
 	}
 
 	pr_debug("no socket, dropping: proto %u %pI6:%u -> %pI6:%u, mark: %x\n",
-		 tproto, &iph->saddr, ntohs(hp->dest),
-		 &tgi->laddr.in6, ntohs(tgi->lport), skb->mark);
+		 tproto, &iph->saddr, ntohs(hp->source),
+		 &iph->daddr, ntohs(hp->dest), skb->mark);
 	return NF_DROP;
 }
 



^ permalink raw reply related

* [PATCH 8/9] tproxy: added IPv6 support to the socket match
From: KOVACS Krisztian @ 2010-10-20 11:21 UTC (permalink / raw)
  To: netdev, netfilter-devel; +Cc: Patrick McHardy, David Miller
In-Reply-To: <20101020112118.6260.31618.stgit@este.odu>

From: Balazs Scheidler <bazsi@balabit.hu>

The ICMP extraction bits were contributed by Harry Mason.

Signed-off-by: Balazs Scheidler <bazsi@balabit.hu>
Signed-off-by: KOVACS Krisztian <hidden@balabit.hu>
---
 net/netfilter/xt_socket.c |  165 ++++++++++++++++++++++++++++++++++++++++++---
 1 files changed, 154 insertions(+), 11 deletions(-)

diff --git a/net/netfilter/xt_socket.c b/net/netfilter/xt_socket.c
index 266faa0..1dc2784 100644
--- a/net/netfilter/xt_socket.c
+++ b/net/netfilter/xt_socket.c
@@ -14,6 +14,7 @@
 #include <linux/skbuff.h>
 #include <linux/netfilter/x_tables.h>
 #include <linux/netfilter_ipv4/ip_tables.h>
+#include <linux/netfilter_ipv6/ip6_tables.h>
 #include <net/tcp.h>
 #include <net/udp.h>
 #include <net/icmp.h>
@@ -21,6 +22,7 @@
 #include <net/inet_sock.h>
 #include <net/netfilter/nf_tproxy_core.h>
 #include <net/netfilter/ipv4/nf_defrag_ipv4.h>
+#include <net/netfilter/ipv6/nf_defrag_ipv6.h>
 
 #include <linux/netfilter/xt_socket.h>
 
@@ -30,7 +32,7 @@
 #endif
 
 static int
-extract_icmp_fields(const struct sk_buff *skb,
+extract_icmp4_fields(const struct sk_buff *skb,
 		    u8 *protocol,
 		    __be32 *raddr,
 		    __be32 *laddr,
@@ -86,7 +88,6 @@ extract_icmp_fields(const struct sk_buff *skb,
 	return 0;
 }
 
-
 static bool
 socket_match(const struct sk_buff *skb, struct xt_action_param *par,
 	     const struct xt_socket_mtinfo1 *info)
@@ -115,7 +116,7 @@ socket_match(const struct sk_buff *skb, struct xt_action_param *par,
 		dport = hp->dest;
 
 	} else if (iph->protocol == IPPROTO_ICMP) {
-		if (extract_icmp_fields(skb, &protocol, &saddr, &daddr,
+		if (extract_icmp4_fields(skb, &protocol, &saddr, &daddr,
 					&sport, &dport))
 			return false;
 	} else {
@@ -165,32 +166,157 @@ socket_match(const struct sk_buff *skb, struct xt_action_param *par,
 			sk = NULL;
 	}
 
-	pr_debug("proto %u %08x:%u -> %08x:%u (orig %08x:%u) sock %p\n",
-		 protocol, ntohl(saddr), ntohs(sport),
-		 ntohl(daddr), ntohs(dport),
-		 ntohl(iph->daddr), hp ? ntohs(hp->dest) : 0, sk);
+	pr_debug("proto %u %pI4:%u -> %pI4:%u (orig %pI4:%u) sock %p\n",
+		 protocol, &saddr, ntohs(sport),
+		 &daddr, ntohs(dport),
+		 &iph->daddr, hp ? ntohs(hp->dest) : 0, sk);
 
 	return (sk != NULL);
 }
 
 static bool
-socket_mt_v0(const struct sk_buff *skb, struct xt_action_param *par)
+socket_mt4_v0(const struct sk_buff *skb, struct xt_action_param *par)
 {
 	return socket_match(skb, par, NULL);
 }
 
 static bool
-socket_mt_v1(const struct sk_buff *skb, struct xt_action_param *par)
+socket_mt4_v1(const struct sk_buff *skb, struct xt_action_param *par)
 {
 	return socket_match(skb, par, par->matchinfo);
 }
 
+#if defined(CONFIG_IPV6) || defined(CONFIG_IPV6_MODULE)
+
+static int
+extract_icmp6_fields(const struct sk_buff *skb,
+		     unsigned int outside_hdrlen,
+		     u8 *protocol,
+		     struct in6_addr **raddr,
+		     struct in6_addr **laddr,
+		     __be16 *rport,
+		     __be16 *lport)
+{
+	struct ipv6hdr *inside_iph, _inside_iph;
+	struct icmp6hdr *icmph, _icmph;
+	__be16 *ports, _ports[2];
+	u8 inside_nexthdr;
+	int inside_hdrlen;
+
+	icmph = skb_header_pointer(skb, outside_hdrlen,
+				   sizeof(_icmph), &_icmph);
+	if (icmph == NULL)
+		return 1;
+
+	if (icmph->icmp6_type & ICMPV6_INFOMSG_MASK)
+		return 1;
+
+	inside_iph = skb_header_pointer(skb, outside_hdrlen + sizeof(_icmph), sizeof(_inside_iph), &_inside_iph);
+	if (inside_iph == NULL)
+		return 1;
+	inside_nexthdr = inside_iph->nexthdr;
+
+	inside_hdrlen = ipv6_skip_exthdr(skb, outside_hdrlen + sizeof(_icmph) + sizeof(_inside_iph), &inside_nexthdr);
+	if (inside_hdrlen < 0)
+		return 1; /* hjm: Packet has no/incomplete transport layer headers. */
+
+	if (inside_nexthdr != IPPROTO_TCP &&
+	    inside_nexthdr != IPPROTO_UDP)
+		return 1;
+
+	ports = skb_header_pointer(skb, inside_hdrlen,
+				   sizeof(_ports), &_ports);
+	if (ports == NULL)
+		return 1;
+
+	/* the inside IP packet is the one quoted from our side, thus
+	 * its saddr is the local address */
+	*protocol = inside_nexthdr;
+	*laddr = &inside_iph->saddr;
+	*lport = ports[0];
+	*raddr = &inside_iph->daddr;
+	*rport = ports[1];
+
+	return 0;
+}
+
+static bool
+socket_mt6_v1(const struct sk_buff *skb, struct xt_action_param *par)
+{
+	struct ipv6hdr *iph = ipv6_hdr(skb);
+	struct udphdr _hdr, *hp = NULL;
+	struct sock *sk;
+	struct in6_addr *daddr, *saddr;
+	__be16 dport, sport;
+	int thoff;
+	u8 tproto;
+	const struct xt_socket_mtinfo1 *info = (struct xt_socket_mtinfo1 *) par->matchinfo;
+
+	tproto = ipv6_find_hdr(skb, &thoff, -1, NULL);
+	if (tproto < 0) {
+		pr_debug("unable to find transport header in IPv6 packet, dropping\n");
+		return NF_DROP;
+	}
+
+	if (tproto == IPPROTO_UDP || tproto == IPPROTO_TCP) {
+		hp = skb_header_pointer(skb, thoff,
+					sizeof(_hdr), &_hdr);
+		if (hp == NULL)
+			return false;
+
+		saddr = &iph->saddr;
+		sport = hp->source;
+		daddr = &iph->daddr;
+		dport = hp->dest;
+
+	} else if (tproto == IPPROTO_ICMPV6) {
+		if (extract_icmp6_fields(skb, thoff, &tproto, &saddr, &daddr,
+					 &sport, &dport))
+			return false;
+	} else {
+		return false;
+	}
+
+	sk = nf_tproxy_get_sock_v6(dev_net(skb->dev), tproto,
+				   saddr, daddr, sport, dport, par->in, NFT_LOOKUP_ANY);
+	if (sk != NULL) {
+		bool wildcard;
+		bool transparent = true;
+
+		/* Ignore sockets listening on INADDR_ANY */
+		wildcard = (sk->sk_state != TCP_TIME_WAIT &&
+			    ipv6_addr_any(&inet6_sk(sk)->rcv_saddr));
+
+		/* Ignore non-transparent sockets,
+		   if XT_SOCKET_TRANSPARENT is used */
+		if (info && info->flags & XT_SOCKET_TRANSPARENT)
+			transparent = ((sk->sk_state != TCP_TIME_WAIT &&
+					inet_sk(sk)->transparent) ||
+				       (sk->sk_state == TCP_TIME_WAIT &&
+					inet_twsk(sk)->tw_transparent));
+
+		nf_tproxy_put_sock(sk);
+
+		if (wildcard || !transparent)
+			sk = NULL;
+	}
+
+	pr_debug("proto %u %pI6:%u -> %pI6:%u "
+		 "(orig %pI6:%u) sock %p\n",
+		 tproto, saddr, ntohs(sport),
+		 daddr, ntohs(dport),
+		 &iph->daddr, hp ? ntohs(hp->dest) : 0, sk);
+
+	return (sk != NULL);
+}
+#endif
+
 static struct xt_match socket_mt_reg[] __read_mostly = {
 	{
 		.name		= "socket",
 		.revision	= 0,
 		.family		= NFPROTO_IPV4,
-		.match		= socket_mt_v0,
+		.match		= socket_mt4_v0,
 		.hooks		= (1 << NF_INET_PRE_ROUTING) |
 				  (1 << NF_INET_LOCAL_IN),
 		.me		= THIS_MODULE,
@@ -199,17 +325,33 @@ static struct xt_match socket_mt_reg[] __read_mostly = {
 		.name		= "socket",
 		.revision	= 1,
 		.family		= NFPROTO_IPV4,
-		.match		= socket_mt_v1,
+		.match		= socket_mt4_v1,
 		.matchsize	= sizeof(struct xt_socket_mtinfo1),
 		.hooks		= (1 << NF_INET_PRE_ROUTING) |
 				  (1 << NF_INET_LOCAL_IN),
 		.me		= THIS_MODULE,
 	},
+#if defined(CONFIG_IPV6) || defined(CONFIG_IPV6_MODULE)
+	{
+		.name		= "socket",
+		.revision	= 1,
+		.family		= NFPROTO_IPV6,
+		.match		= socket_mt6_v1,
+		.matchsize	= sizeof(struct xt_socket_mtinfo1),
+		.hooks		= (1 << NF_INET_PRE_ROUTING) |
+				  (1 << NF_INET_LOCAL_IN),
+		.me		= THIS_MODULE,
+	},
+#endif
 };
 
 static int __init socket_mt_init(void)
 {
 	nf_defrag_ipv4_enable();
+#if defined(CONFIG_IPV6) || defined(CONFIG_IPV6_MODULE)
+	nf_defrag_ipv6_enable();
+#endif
+
 	return xt_register_matches(socket_mt_reg, ARRAY_SIZE(socket_mt_reg));
 }
 
@@ -225,3 +367,4 @@ MODULE_LICENSE("GPL");
 MODULE_AUTHOR("Krisztian Kovacs, Balazs Scheidler");
 MODULE_DESCRIPTION("x_tables socket match module");
 MODULE_ALIAS("ipt_socket");
+MODULE_ALIAS("ip6t_socket");



^ permalink raw reply related

* [PATCH 3/9] tproxy: added udp6_lib_lookup function
From: KOVACS Krisztian @ 2010-10-20 11:21 UTC (permalink / raw)
  To: netdev, netfilter-devel; +Cc: Patrick McHardy, David Miller
In-Reply-To: <20101020112118.6260.31618.stgit@este.odu>

From: Balazs Scheidler <bazsi@balabit.hu>

Just like with IPv4, we need access to the UDP hash table to look up local
sockets, but instead of exporting the global udp_table, export a lookup
function.

Signed-off-by: Balazs Scheidler <bazsi@balabit.hu>
Signed-off-by: KOVACS Krisztian <hidden@balabit.hu>
---
 include/net/udp.h |    3 +++
 net/ipv6/udp.c    |    8 ++++++++
 2 files changed, 11 insertions(+), 0 deletions(-)

diff --git a/include/net/udp.h b/include/net/udp.h
index a184d34..200b828 100644
--- a/include/net/udp.h
+++ b/include/net/udp.h
@@ -183,6 +183,9 @@ extern int udp_lib_setsockopt(struct sock *sk, int level, int optname,
 extern struct sock *udp4_lib_lookup(struct net *net, __be32 saddr, __be16 sport,
 				    __be32 daddr, __be16 dport,
 				    int dif);
+extern struct sock *udp6_lib_lookup(struct net *net, const struct in6_addr *saddr, __be16 sport,
+				    const struct in6_addr *daddr, __be16 dport,
+				    int dif);
 
 /*
  * 	SNMP statistics for UDP and UDP-Lite
diff --git a/net/ipv6/udp.c b/net/ipv6/udp.c
index 33e3683..c84dad4 100644
--- a/net/ipv6/udp.c
+++ b/net/ipv6/udp.c
@@ -320,6 +320,14 @@ static struct sock *__udp6_lib_lookup_skb(struct sk_buff *skb,
 				 udptable);
 }
 
+struct sock *udp6_lib_lookup(struct net *net, const struct in6_addr *saddr, __be16 sport,
+			     const struct in6_addr *daddr, __be16 dport, int dif)
+{
+	return __udp6_lib_lookup(net, saddr, sport, daddr, dport, dif, &udp_table);
+}
+EXPORT_SYMBOL_GPL(udp6_lib_lookup);
+
+
 /*
  * 	This should be easy, if there is something there we
  * 	return it, otherwise we block.



^ permalink raw reply related

* [PATCH 2/9] tproxy: added const specifiers to udp lookup functions
From: KOVACS Krisztian @ 2010-10-20 11:21 UTC (permalink / raw)
  To: netdev, netfilter-devel; +Cc: Patrick McHardy, David Miller
In-Reply-To: <20101020112118.6260.31618.stgit@este.odu>

From: Balazs Scheidler <bazsi@balabit.hu>

The parameters for various UDP lookup functions were non-const, even though
they could be const. TProxy has some const references and instead of
downcasting it, I added const specifiers along the path.

Signed-off-by: Balazs Scheidler <bazsi@balabit.hu>
Signed-off-by: KOVACS Krisztian <hidden@balabit.hu>
---
 net/ipv6/udp.c |    8 ++++----
 1 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/net/ipv6/udp.c b/net/ipv6/udp.c
index 5acb356..33e3683 100644
--- a/net/ipv6/udp.c
+++ b/net/ipv6/udp.c
@@ -122,8 +122,8 @@ static void udp_v6_rehash(struct sock *sk)
 
 static inline int compute_score(struct sock *sk, struct net *net,
 				unsigned short hnum,
-				struct in6_addr *saddr, __be16 sport,
-				struct in6_addr *daddr, __be16 dport,
+				const struct in6_addr *saddr, __be16 sport,
+				const struct in6_addr *daddr, __be16 dport,
 				int dif)
 {
 	int score = -1;
@@ -239,8 +239,8 @@ exact_match:
 }
 
 static struct sock *__udp6_lib_lookup(struct net *net,
-				      struct in6_addr *saddr, __be16 sport,
-				      struct in6_addr *daddr, __be16 dport,
+				      const struct in6_addr *saddr, __be16 sport,
+				      const struct in6_addr *daddr, __be16 dport,
 				      int dif, struct udp_table *udptable)
 {
 	struct sock *sk, *result;



^ permalink raw reply related

* [PATCH 0/9] tproxy: add IPv6 support
From: KOVACS Krisztian @ 2010-10-20 11:21 UTC (permalink / raw)
  To: netdev, netfilter-devel; +Cc: Patrick McHardy, David Miller

The following series adds IPv6 support for tproxy. The parts touching
non-Netfilter code include exporting the UDP lookup function, adding the
sockopt infrastructure for getting the original destination address and
allowing non-local binds if the IP_TRANSPARENT socket option is set.

Netfilter changes are splitting the defragmentation code off of conntrack,
adding IPv6 socket lookup helpers to the tproxy core module and updating the
socket match and the TPROXY target.

The last patch in the series tries to make it easier to use the TPROXY target
by selecting a meaningful address to redirect to in case the user did not
explicitly specify it with '--on-ip'.

---

Balazs Scheidler (9):
      tproxy: split off ipv6 defragmentation to a separate module
      tproxy: added const specifiers to udp lookup functions
      tproxy: added udp6_lib_lookup function
      tproxy: added tproxy sockopt interface in the IPV6 layer
      tproxy: allow non-local binds of IPv6 sockets if IP_TRANSPARENT is enabled
      tproxy: added IPv6 socket lookup function to nf_tproxy_core
      tproxy: added IPv6 support to the TPROXY target
      tproxy: added IPv6 support to the socket match
      tproxy: use the interface primary IP address as a default value for --on-ip


 include/linux/in6.h                            |    4 
 include/linux/ipv6.h                           |    4 
 include/linux/netfilter/xt_TPROXY.h            |   15 +
 include/net/netfilter/ipv6/nf_defrag_ipv6.h    |    6 
 include/net/netfilter/nf_tproxy_core.h         |   72 +++++
 include/net/udp.h                              |    3 
 net/ipv6/af_inet6.c                            |    2 
 net/ipv6/datagram.c                            |   19 +
 net/ipv6/ipv6_sockglue.c                       |   23 ++
 net/ipv6/netfilter/Makefile                    |    5 
 net/ipv6/netfilter/nf_conntrack_l3proto_ipv6.c |   78 ------
 net/ipv6/netfilter/nf_conntrack_reasm.c        |   12 +
 net/ipv6/netfilter/nf_defrag_ipv6_hooks.c      |  131 ++++++++++
 net/ipv6/udp.c                                 |   16 +
 net/netfilter/xt_TPROXY.c                      |  324 +++++++++++++++++++++---
 net/netfilter/xt_socket.c                      |  165 +++++++++++-
 16 files changed, 740 insertions(+), 139 deletions(-)
 create mode 100644 include/net/netfilter/ipv6/nf_defrag_ipv6.h
 create mode 100644 net/ipv6/netfilter/nf_defrag_ipv6_hooks.c

-- 
KOVACS Krisztian


^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox