[PATCH] [IPVS] transparent proxying

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH] [IPVS] transparent proxying
@ 2006-11-29  6:21 Horms
  2006-11-29 14:15 ` Thomas Graf
  2006-11-29 15:26 ` Wensong Zhang
  0 siblings, 2 replies; 10+ messages in thread
From: Horms @ 2006-11-29  6:21 UTC (permalink / raw)
  To: netdev
  Cc: David Miller, Julian Anastasov, Wensong Zhan, Joseph Mack NA3T,
	Jinhua Luo

This seems to be a pretty clean solution to a real problem.

Ultimately I would like to see IPVS move into the forward chain.
This seems to be a nice way to explore that, without breaking
any existing setups.

-- 
Horms
  H: http://www.vergenet.net/~horms/
  W: http://www.valinux.co.jp/en/

[IPVS] transparent proxying

Patch from Jinhua Luo <home_king@163.com> to allow a web cluseter using
transparent proxying. It works by simply grabing packets that have the
fwmark set and have not already been processed by ipvs (ip_vs_out) and
throwing them into ip_vs_in.

See: http://archive.linuxvirtualserver.org/html/lvs-users/2006-11/msg00261.html

Normally LVS packets are processed by ip_vs_in fron on the INPUT chain,
and packets that are processed in this way never show up on the FORWARD
chain, so they won't hit this rule.

This patch seems like a good precursor to moving LVS permanantly to
the FORWARD chain. As I'm struggling to think how it could break things.

The changes to the original patch are:

* Reformated to use tabs for indentation (instead of 4 spaces)
* Reformated to be < 80 columns wide
* Added some comments
* Rewrote description (this text)

Signed-off-by: Simon Horman <horms@verge.net.au>
Signed-off-by: Jinhua Luo <home_king@163.com>

Index: linux-2.6/net/ipv4/ipvs/ip_vs_core.c
===================================================================
--- linux-2.6.orig/net/ipv4/ipvs/ip_vs_core.c	2006-11-28 15:30:00.000000000 +0900
+++ linux-2.6/net/ipv4/ipvs/ip_vs_core.c	2006-11-29 10:27:49.000000000 +0900
@@ -23,7 +23,9 @@
  * Changes:
  *	Paul `Rusty' Russell		properly handle non-linear skbs
  *	Harald Welte			don't use nfcache
- *
+ *	Jinhua Luo                      redirect packets with fwmark on
+ *					NF_IP_FORWARD chain to ip_vs_in(),
+ *					mainly for transparent cache cluster
  */
 
 #include <linux/module.h>
@@ -1070,6 +1072,26 @@
 	return ip_vs_in_icmp(pskb, &r, hooknum);
 }
 
+/*
+ * 	This is hooked into the NF_IP_FORWARD. It catches
+ * 	packets that have not already been handled by ipvs (out)
+ * 	and have a fwmark set. This is to allow transparent proxying
+ * 	of fwmark virtual services.
+ *
+ * 	It will not process packets that are handled by ipvs (in)
+ * 	as they never traverse the NF_IP_FORWARD.
+ */
+static unsigned int
+ip_vs_forward_with_fwmark(unsigned int hooknum, struct sk_buff **pskb,
+			  const struct net_device *in,
+			  const struct net_device *out,
+			  int (*okfn)(struct sk_buff *))
+{
+	if ((*pskb)->ipvs_property || ! (*pskb)->nfmark)
+		return NF_ACCEPT;
+
+	return ip_vs_in(hooknum, pskb, in, out, okfn);
+}
 
 /* After packet filtering, forward packet through VS/DR, VS/TUN,
    or VS/NAT(change destination), so that filtering rules can be
@@ -1082,6 +1104,16 @@
 	.priority       = 100,
 };
 
+/* Allow transparent proxying by fishing packets
+ * out of the forward chain. */
+static struct nf_hook_ops ip_vs_forward_with_fwmark_ops = {
+	.hook		= ip_vs_forward_with_fwmark,
+	.owner		= THIS_MODULE,
+	.pf		= PF_INET,
+	.hooknum	= NF_IP_FORWARD,
+	.priority	= 101,
+};
+
 /* After packet filtering, change source only for VS/NAT */
 static struct nf_hook_ops ip_vs_out_ops = {
 	.hook		= ip_vs_out,
@@ -1160,9 +1192,17 @@
 		goto cleanup_postroutingops;
 	}
 
+	ret = nf_register_hook(&ip_vs_forward_with_fwmark_ops);
+	if (ret < 0) {
+		IP_VS_ERR("can't register forward_with_fwmark hook.\n");
+		goto cleanup_forwardicmpops;
+	}
+
 	IP_VS_INFO("ipvs loaded.\n");
 	return ret;
 
+  cleanup_forwardicmpops:
+	nf_unregister_hook(&ip_vs_forward_icmp_ops);
   cleanup_postroutingops:
 	nf_unregister_hook(&ip_vs_post_routing_ops);
   cleanup_outops:
@@ -1182,6 +1222,7 @@
 
 static void __exit ip_vs_cleanup(void)
 {
+	nf_unregister_hook(&ip_vs_forward_with_fwmark_ops);
 	nf_unregister_hook(&ip_vs_forward_icmp_ops);
 	nf_unregister_hook(&ip_vs_post_routing_ops);
 	nf_unregister_hook(&ip_vs_out_ops);

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] [IPVS] transparent proxying
  2006-11-29  6:21 Horms
@ 2006-11-29 14:15 ` Thomas Graf
  2006-11-29 14:46   ` Horms
  2006-11-29 15:26 ` Wensong Zhang
  1 sibling, 1 reply; 10+ messages in thread
From: Thomas Graf @ 2006-11-29 14:15 UTC (permalink / raw)
  To: Horms
  Cc: netdev, David Miller, Julian Anastasov, Wensong Zhan,
	Joseph Mack NA3T, Jinhua Luo

* Horms <horms@verge.net.au> 2006-11-29 15:21
> This seems to be a pretty clean solution to a real problem.
> 
> Ultimately I would like to see IPVS move into the forward chain.
> This seems to be a nice way to explore that, without breaking
> any existing setups.
> 
> -- 
> Horms
>   H: http://www.vergenet.net/~horms/
>   W: http://www.valinux.co.jp/en/
> 
> [IPVS] transparent proxying
> 
> Patch from Jinhua Luo <home_king@163.com> to allow a web cluseter using
> transparent proxying. It works by simply grabing packets that have the
> fwmark set and have not already been processed by ipvs (ip_vs_out) and
> throwing them into ip_vs_in.
> 
> See: http://archive.linuxvirtualserver.org/html/lvs-users/2006-11/msg00261.html
> 
> Normally LVS packets are processed by ip_vs_in fron on the INPUT chain,
> and packets that are processed in this way never show up on the FORWARD
> chain, so they won't hit this rule.
> 
> This patch seems like a good precursor to moving LVS permanantly to
> the FORWARD chain. As I'm struggling to think how it could break things.
> 
> The changes to the original patch are:
> 
> * Reformated to use tabs for indentation (instead of 4 spaces)
> * Reformated to be < 80 columns wide
> * Added some comments
> * Rewrote description (this text)
> 
> Signed-off-by: Simon Horman <horms@verge.net.au>
> Signed-off-by: Jinhua Luo <home_king@163.com>
> 
> Index: linux-2.6/net/ipv4/ipvs/ip_vs_core.c
> ===================================================================
> --- linux-2.6.orig/net/ipv4/ipvs/ip_vs_core.c	2006-11-28 15:30:00.000000000 +0900
> +++ linux-2.6/net/ipv4/ipvs/ip_vs_core.c	2006-11-29 10:27:49.000000000 +0900
> @@ -23,7 +23,9 @@
>   * Changes:
>   *	Paul `Rusty' Russell		properly handle non-linear skbs
>   *	Harald Welte			don't use nfcache
> - *
> + *	Jinhua Luo                      redirect packets with fwmark on
> + *					NF_IP_FORWARD chain to ip_vs_in(),
> + *					mainly for transparent cache cluster
>   */
>  
>  #include <linux/module.h>
> @@ -1070,6 +1072,26 @@
>  	return ip_vs_in_icmp(pskb, &r, hooknum);
>  }
>  
> +/*
> + * 	This is hooked into the NF_IP_FORWARD. It catches
> + * 	packets that have not already been handled by ipvs (out)
> + * 	and have a fwmark set. This is to allow transparent proxying
> + * 	of fwmark virtual services.
> + *
> + * 	It will not process packets that are handled by ipvs (in)
> + * 	as they never traverse the NF_IP_FORWARD.
> + */
> +static unsigned int
> +ip_vs_forward_with_fwmark(unsigned int hooknum, struct sk_buff **pskb,
> +			  const struct net_device *in,
> +			  const struct net_device *out,
> +			  int (*okfn)(struct sk_buff *))
> +{
> +	if ((*pskb)->ipvs_property || ! (*pskb)->nfmark)
> +		return NF_ACCEPT;

This patch seems to be based on an old tree, I've renamed nfmark
to mark in net-2.6.20. The term fwmark and nfmark shouldn't be
used anymore.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] [IPVS] transparent proxying
  2006-11-29 14:15 ` Thomas Graf
@ 2006-11-29 14:46   ` Horms
  2006-12-18  3:19     ` Horms
  0 siblings, 1 reply; 10+ messages in thread
From: Horms @ 2006-11-29 14:46 UTC (permalink / raw)
  To: Thomas Graf
  Cc: netdev, David Miller, Julian Anastasov, Wensong Zhan,
	Joseph Mack NA3T, Jinhua Luo

On Wed, Nov 29, 2006 at 03:15:23PM +0100, Thomas Graf wrote:
> * Horms <horms@verge.net.au> 2006-11-29 15:21
> > This seems to be a pretty clean solution to a real problem.
> > 
> > Ultimately I would like to see IPVS move into the forward chain.
> > This seems to be a nice way to explore that, without breaking
> > any existing setups.
> > 
> > -- 
> > Horms
> >   H: http://www.vergenet.net/~horms/
> >   W: http://www.valinux.co.jp/en/
> > 
> > [IPVS] transparent proxying
> > 
> > Patch from Jinhua Luo <home_king@163.com> to allow a web cluseter using
> > transparent proxying. It works by simply grabing packets that have the
> > fwmark set and have not already been processed by ipvs (ip_vs_out) and
> > throwing them into ip_vs_in.
> > 
> > See: http://archive.linuxvirtualserver.org/html/lvs-users/2006-11/msg00261.html
> > 
> > Normally LVS packets are processed by ip_vs_in fron on the INPUT chain,
> > and packets that are processed in this way never show up on the FORWARD
> > chain, so they won't hit this rule.
> > 
> > This patch seems like a good precursor to moving LVS permanantly to
> > the FORWARD chain. As I'm struggling to think how it could break things.
> > 
> > The changes to the original patch are:
> > 
> > * Reformated to use tabs for indentation (instead of 4 spaces)
> > * Reformated to be < 80 columns wide
> > * Added some comments
> > * Rewrote description (this text)
> > 
> > Signed-off-by: Simon Horman <horms@verge.net.au>
> > Signed-off-by: Jinhua Luo <home_king@163.com>
> > 
> > Index: linux-2.6/net/ipv4/ipvs/ip_vs_core.c
> > ===================================================================
> > --- linux-2.6.orig/net/ipv4/ipvs/ip_vs_core.c	2006-11-28 15:30:00.000000000 +0900
> > +++ linux-2.6/net/ipv4/ipvs/ip_vs_core.c	2006-11-29 10:27:49.000000000 +0900
> > @@ -23,7 +23,9 @@
> >   * Changes:
> >   *	Paul `Rusty' Russell		properly handle non-linear skbs
> >   *	Harald Welte			don't use nfcache
> > - *
> > + *	Jinhua Luo                      redirect packets with fwmark on
> > + *					NF_IP_FORWARD chain to ip_vs_in(),
> > + *					mainly for transparent cache cluster
> >   */
> >  
> >  #include <linux/module.h>
> > @@ -1070,6 +1072,26 @@
> >  	return ip_vs_in_icmp(pskb, &r, hooknum);
> >  }
> >  
> > +/*
> > + * 	This is hooked into the NF_IP_FORWARD. It catches
> > + * 	packets that have not already been handled by ipvs (out)
> > + * 	and have a fwmark set. This is to allow transparent proxying
> > + * 	of fwmark virtual services.
> > + *
> > + * 	It will not process packets that are handled by ipvs (in)
> > + * 	as they never traverse the NF_IP_FORWARD.
> > + */
> > +static unsigned int
> > +ip_vs_forward_with_fwmark(unsigned int hooknum, struct sk_buff **pskb,
> > +			  const struct net_device *in,
> > +			  const struct net_device *out,
> > +			  int (*okfn)(struct sk_buff *))
> > +{
> > +	if ((*pskb)->ipvs_property || ! (*pskb)->nfmark)
> > +		return NF_ACCEPT;
> 
> This patch seems to be based on an old tree, I've renamed nfmark
> to mark in net-2.6.20. The term fwmark and nfmark shouldn't be
> used anymore.

Sorry, I based this patch on Linus's tree. I'll port it to net-2.6.20.

-- 
Horms
  H: http://www.vergenet.net/~horms/
  W: http://www.valinux.co.jp/en/


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] [IPVS] transparent proxying
  2006-11-29  6:21 Horms
  2006-11-29 14:15 ` Thomas Graf
@ 2006-11-29 15:26 ` Wensong Zhang
  1 sibling, 0 replies; 10+ messages in thread
From: Wensong Zhang @ 2006-11-29 15:26 UTC (permalink / raw)
  To: Horms; +Cc: netdev, David Miller, Julian Anastasov, Joseph Mack NA3T,
	Jinhua Luo

Hi Horms,

I see that this patch probably makes IPVS code a bit complicated and 
packet traversing less efficiently.

If I remember correctly, policy-based routing can work with IPVS in 
kernel 2.2 and 2.4 for transparent cache cluster for a long time. It 
should work in kernel 2.6 too.

For example, we can use iptables/ipchains to mark all web traffic with 
fwmark 1, then use policy-based routing to route all web traffic through 
NF_IP_LOCAL_IN, so that ip_vs_in can capture the packets and load 
balance packets to cache servers.
ip rule add prio 100 fwmark 1 table 100
ip route add local 0/0 dev lo table 100

ipvsadm -A -f 1 -s wlc
ipvsadm -a -f 1 -w 100 -r cache1
ipvsadm -a -f 1 -w 100 -r cache2
ipvsadm -a -f 1 -w 100 -r cache2

...

Cheers,

Wensong

Horms wrote:
> This seems to be a pretty clean solution to a real problem.
>
> Ultimately I would like to see IPVS move into the forward chain.
> This seems to be a nice way to explore that, without breaking
> any existing setups.
>
>   

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] [IPVS] transparent proxying
@ 2006-11-30  1:49 home_king
  2006-12-01 15:41 ` Wensong Zhang
  0 siblings, 1 reply; 10+ messages in thread
From: home_king @ 2006-11-30  1:49 UTC (permalink / raw)
  To: Wensong Zhang
  Cc: Horms, netdev, David Miller, Julian Anastasov, Joseph Mack NA3T

hi, Wensong. Thanks for your appraise.

 > I see that this patch probably makes IPVS code a bit complicated and
 > packet traversing less efficiently.

In my opinion, worry about the side-effect to the packet throughput is not
necessary. First, normal packets with mark rarely appear in the 
NF_IP_FORWARD
chain, while people mark packets aiming at the network administration job
usually on the NF_IP_LOCAL_IN or NF_IP_OUTPUT chain. Second, the new hook fn
is called after ipvs SNAT hook fn, and pass the packets handled by the 
latter
hook fn by simply checking the ipvs_property flag, so it would not 
disturb the
SNAT job. Third, the new hook fn is just a thin wrapper of ip_vs_in(), 
so now
that all packets which go through NF_IP_LOCAL_IN will be entirely checked up
by ip_vs_in(), no matter they are virtual-server relative or not, why we 
mind
that a comparatively small quantity of packets which go through 
NF_IP_FORWARD
will be checked too?

 > If I remember correctly, policy-based routing can work with IPVS in
 > kernel 2.2 and 2.4 for transparent cache cluster for a long time. It
 > should work in kernel 2.6 too.

Indeed, policy route can help too, but the patch provides a native manner to
deploy transparent proxy, and meanwhile, this manner will not break the
backbone networking context, such as policy routing setting, iptables 
rules,
etc.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] [IPVS] transparent proxying
  2006-11-30  1:49 [PATCH] [IPVS] transparent proxying home_king
@ 2006-12-01 15:41 ` Wensong Zhang
  0 siblings, 0 replies; 10+ messages in thread
From: Wensong Zhang @ 2006-12-01 15:41 UTC (permalink / raw)
  To: home_king; +Cc: Horms, netdev, David Miller, Julian Anastasov, Joseph Mack NA3T


Hi Jinhua,

home_king wrote:
> hi, Wensong. Thanks for your appraise.
>
> > I see that this patch probably makes IPVS code a bit complicated and
> > packet traversing less efficiently.
>
> In my opinion, worry about the side-effect to the packet throughput is 
> not
> necessary. First, normal packets with mark rarely appear in the 
> NF_IP_FORWARD
> chain, while people mark packets aiming at the network administration job
> usually on the NF_IP_LOCAL_IN or NF_IP_OUTPUT chain. Second, the new 
> hook fn
> is called after ipvs SNAT hook fn, and pass the packets handled by the 
> latter
> hook fn by simply checking the ipvs_property flag, so it would not 
> disturb the
> SNAT job. Third, the new hook fn is just a thin wrapper of ip_vs_in(), 
> so now
> that all packets which go through NF_IP_LOCAL_IN will be entirely 
> checked up
> by ip_vs_in(), no matter they are virtual-server relative or not, why 
> we mind
> that a comparatively small quantity of packets which go through 
> NF_IP_FORWARD
> will be checked too?
>
I see that every firewall-marked packet will be checked by ip_vs_in(), 
no matter whether
the packet is related to IPVS or not. It's a bit less efficient.
> > If I remember correctly, policy-based routing can work with IPVS in
> > kernel 2.2 and 2.4 for transparent cache cluster for a long time. It
> > should work in kernel 2.6 too.
>
> Indeed, policy route can help too, but the patch provides a native 
> manner to
> deploy transparent proxy, and meanwhile, this manner will not break the
> backbone networking context, such as policy routing setting, iptables 
> rules,
> etc.
I am afraid that the method used in the patch is not native, it breaks 
on IP fragments.
IPVS is a kind of layer-4 switching, it routes packet by checking 
layer-4 information
such as address and port number. ip_vs_in() is hooked at NF_IP_LOCAL_IN, so
that all the packets received by ip_vs_in() are already defragmented. On 
NF_IP_FORWARD
hook, there may be some IP fragements, ip_vs_in() cannot handle those IP 
fragments.

I think that it's probably better to let each part do its own things in 
the design.

Cheers,

Wensong


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] [IPVS] transparent proxying
@ 2006-12-04  5:53 home_king
  2006-12-04 17:20 ` Wensong Zhang
  0 siblings, 1 reply; 10+ messages in thread
From: home_king @ 2006-12-04  5:53 UTC (permalink / raw)
  To: Wensong Zhang, Horms
  Cc: netdev, David Miller, Julian Anastasov, Joseph Mack NA3T,
	Jinhua Luo

 > I am afraid that the method used in the patch is not native, it breaks
 > on IP fragments.
 > IPVS is a kind of layer-4 switching, it routes packet by checking
 > layer-4 information
 > such as address and port number. ip_vs_in() is hooked at 
NF_IP_LOCAL_IN, so
 > that all the packets received by ip_vs_in() are already defragmented. On
 > NF_IP_FORWARD
 > hook, there may be some IP fragements, ip_vs_in() cannot handle those IP
 > fragments.

However, your analysis is a bit inaccurate, I think.

As far as I know, Policy route's conjunction with fwmark works just under
some precondition, the most important one of which is defragmentation
function provided by NF_CONNTRACK. That is, the routing core works in
layer 3, and it can even route all IP fragments just by the IP info, and
doesn't care about the layer 4 info, such as service ports. Firewall Mark
makes layer 4 involved. To retrieve the full layer 4 header, netfilter
has no choice but to do defragmentation for the routing core, which is the
key function of NF_CONNTRACK.

In a word, without NF_CONNTRACK, neither the policy route nor my patch, can
face the defragmentation problem!

I will give you some proof of my words.

LOOKING at the NETFILTER SOURCE:

See the below quotation form /usr/include/linux/netfilter_ipv4.h, the
defragmentation of netfilter owns the highest prority, so the corresponding
hook will be called before any other hooks including ipvs & iptables.

-----quote start-----
enum nf_ip_hook_priorities {
        NF_IP_PRI_FIRST = INT_MIN,
        NF_IP_PRI_CONNTRACK_DEFRAG = -400,
    ...
        NF_IP_PRI_LAST = INT_MAX,
};
-----quote end-----

And see ip_conntrack_standalone.c, here defines the defrag hooks on
PREROUTING chain and OUTPUT chain with NF_IP_PRI_CONNTRACK_DEFRAG
prority. Needless to say, all packets which flow on INPUT & FORWARD
chain are already defragmented by it; In other word, once the CONNTRACK
is enabled, you cannot see any fragment in INPUT & FORWARD chain, even
the other chains.

-----quote start-----
static struct nf_hook_ops ip_conntrack_defrag_ops = {
    .hook        = ip_conntrack_defrag,
    .owner        = THIS_MODULE,
    .pf        = PF_INET,
    .hooknum    = NF_IP_PRE_ROUTING,
    .priority    = NF_IP_PRI_CONNTRACK_DEFRAG,
};

static struct nf_hook_ops ip_conntrack_defrag_local_out_ops = {
    .hook        = ip_conntrack_defrag,
    .owner        = THIS_MODULE,
    .pf        = PF_INET,
    .hooknum    = NF_IP_LOCAL_OUT,
    .priority    = NF_IP_PRI_CONNTRACK_DEFRAG,
};
-----quote end-----

On the other hand, I wrote a simply program -- test_udp_fragment.c
to test it.

-----test code start--------------

#include <sys/types.h>
#include <sys/socket.h>
#include <errno.h>
#include <stdlib.h>
#include <linux/in.h>

int main(int argc, char *argv[])
{
#ifndef AS_SERVER
    if (argc < 2) {
        printf("SYNTAX: %s <server ip>\n", argv[0]);
        exit(EXIT_SUCCESS);
    }
#endif

    int sockfd;
    sockfd = socket(PF_INET, SOCK_DGRAM, 0);
    if (sockfd < 0) {
        perror("socket");
        exit(EXIT_FAILURE);
    }
#define MSG_SIZE 10000          /* bigger than MTU */
#define BUF_SIZE MSG_SIZE+1

    char buf[BUF_SIZE];
    memset(buf, 0, BUF_SIZE);

    struct sockaddr_in test_addr;
    test_addr.sin_family = AF_INET;
    test_addr.sin_port = htons(10000);
#ifdef AS_SERVER
    test_addr.sin_addr.s_addr = inet_addr("0.0.0.0");
    if (bind(sockfd,
             (struct sockaddr *) &test_addr, sizeof(test_addr)) < 0) {
        perror("bind");
        exit(EXIT_FAILURE);
    }
    ssize_t r = 0;

    while (1) {
        r = recv(sockfd, buf, MSG_SIZE, MSG_WAITALL);
        if (r < MSG_SIZE) {
            printf("truncated!\n");
            exit(EXIT_FAILURE);
        }
        printf("recv message: %s\n", buf);
    }
#else
    memset(buf, 'A', MSG_SIZE);

    test_addr.sin_addr.s_addr = inet_addr(argv[1]);
    ssize_t s = 0;
    s = sendto(sockfd, buf, MSG_SIZE, 0, (struct sockaddr *) &test_addr,
               sizeof(test_addr));
    if (s != MSG_SIZE) {
        perror("send failed");
        exit(EXIT_FAILURE);
    }
#endif

    exit(EXIT_SUCCESS);
}
-----test code end--------------

The program above implements a simple udp server & a simple udp client.

The client sends a message of MSG_SIZE bytes (which is filled with 'A') to
the server, and the server receives and prints out the message.

The MSG_SIZE (Here I takes 10000 as example) is far bigger than the
normal Ethernet NIC MTU (1500), so the output message will be fragmented.

Given the IP (SIP) of server is 172.16.100.254, and the IP of client (CIP)
is 172.16.100.63. The Default Gateway IP of client is SIP.

I do the below settings:

@ Server

# Mark the client's udp access
iptables -t mangle -A PREROUTING -p udp -s 172.16.100.63 --dport 10000 \
  -j MARK --set-mark 1
# REDIRECT the forward packets marked with 1 to localhost
ip rule add prio 100 fwmark 1 table 100
ip route add local 0/0 dev lo table 100
# Compile the test program and copy them to the place
gcc  -DAS_SERVER -o /tmp/server test_udp_fragment.c
gcc  -o /tmp/client test_udp_fragment.c
scp test_udp_fragment.c 172.16.100.63:/tmp/
# Start the server
/tmp/server

@ Client

# Send a message to some site of external network, such as google.com
/tmp/client 64.233.189.104

The test has two different result:
1. When the CONNTRACK is enabled in the kernel running on the server
recv message: AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
AAA....

2. When the CONNTRACK is disabled in the kernel running on the server
[Nothing printed out!!!]

See, without CONNTRACK, the policy route failed to handle the
defragmentation problem too! In fact, it just routes the first IP
fragment of the message to the INPUT chain, and ignores the other
IP fragments and let them wander on the FORWARD chain, so the
defragmentation job of ip_local_deliver() will never success.
And the upper test program will never receive the message from kernel.

Besides this test program, you can simply validate this fact through
iptables LOG target:

@ Server

# Set a LOG rule just after the MARK rule
iptables -t mangle -A PREROUTING -m mark --mark 1 -j LOG

you will see that, without CONNTRACK, you see the first IP fragment
packet in your log file, and with CONNTRACK, you see an entire IP packet!

All in all, the conclusion is that:
If you use CONNTRACK, my TP patch works without defragmentation problem;
If you don't use CONNTRACK, both my TP patch & Policy routing rule for TP
do not work!

So, I always think that, my patch enables the native support for TP.

Cheers,

Jinhua

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] [IPVS] transparent proxying
  2006-12-04  5:53 home_king
@ 2006-12-04 17:20 ` Wensong Zhang
  0 siblings, 0 replies; 10+ messages in thread
From: Wensong Zhang @ 2006-12-04 17:20 UTC (permalink / raw)
  To: home_king; +Cc: Horms, netdev, David Miller, Julian Anastasov, Joseph Mack NA3T


Hi Jinhua,

home_king wrote:
> > I am afraid that the method used in the patch is not native, it breaks
> > on IP fragments.
> > IPVS is a kind of layer-4 switching, it routes packet by checking
> > layer-4 information
> > such as address and port number. ip_vs_in() is hooked at 
> NF_IP_LOCAL_IN, so
> > that all the packets received by ip_vs_in() are already 
> defragmented. On
> > NF_IP_FORWARD
> > hook, there may be some IP fragements, ip_vs_in() cannot handle 
> those IP
> > fragments.
>
> However, your analysis is a bit inaccurate, I think.
>
> As far as I know, Policy route's conjunction with fwmark works just under
> some precondition, the most important one of which is defragmentation
> function provided by NF_CONNTRACK. That is, the routing core works in
> layer 3, and it can even route all IP fragments just by the IP info, and
> doesn't care about the layer 4 info, such as service ports. Firewall Mark
> makes layer 4 involved. To retrieve the full layer 4 header, netfilter
> has no choice but to do defragmentation for the routing core, which is 
> the
> key function of NF_CONNTRACK.
>
> In a word, without NF_CONNTRACK, neither the policy route nor my 
> patch, can
> face the defragmentation problem!
>
OK, it's my mistake. :-)       we mark packets according to port number 
and those packets go through FORWARD chain by default, so we have to do 
defragmentation before firewall-marking.
> I will give you some proof of my words.
>
> LOOKING at the NETFILTER SOURCE:
>
> See the below quotation form /usr/include/linux/netfilter_ipv4.h, the
> defragmentation of netfilter owns the highest prority, so the 
> corresponding
> hook will be called before any other hooks including ipvs & iptables.
>
> -----quote start-----
> enum nf_ip_hook_priorities {
>        NF_IP_PRI_FIRST = INT_MIN,
>        NF_IP_PRI_CONNTRACK_DEFRAG = -400,
>    ...
>        NF_IP_PRI_LAST = INT_MAX,
> };
> -----quote end-----
>
>
> And see ip_conntrack_standalone.c, here defines the defrag hooks on
> PREROUTING chain and OUTPUT chain with NF_IP_PRI_CONNTRACK_DEFRAG
> prority. Needless to say, all packets which flow on INPUT & FORWARD
> chain are already defragmented by it; In other word, once the CONNTRACK
> is enabled, you cannot see any fragment in INPUT & FORWARD chain, even
> the other chains.
>
> -----quote start-----
> static struct nf_hook_ops ip_conntrack_defrag_ops = {
>    .hook        = ip_conntrack_defrag,
>    .owner        = THIS_MODULE,
>    .pf        = PF_INET,
>    .hooknum    = NF_IP_PRE_ROUTING,
>    .priority    = NF_IP_PRI_CONNTRACK_DEFRAG,
> };
>
> static struct nf_hook_ops ip_conntrack_defrag_local_out_ops = {
>    .hook        = ip_conntrack_defrag,
>    .owner        = THIS_MODULE,
>    .pf        = PF_INET,
>    .hooknum    = NF_IP_LOCAL_OUT,
>    .priority    = NF_IP_PRI_CONNTRACK_DEFRAG,
> };
> -----quote end-----
>
>
> On the other hand, I wrote a simply program -- test_udp_fragment.c
> to test it.
>
> -----test code start--------------
>
> #include <sys/types.h>
> #include <sys/socket.h>
> #include <errno.h>
> #include <stdlib.h>
> #include <linux/in.h>
>
> int main(int argc, char *argv[])
> {
> #ifndef AS_SERVER
>    if (argc < 2) {
>        printf("SYNTAX: %s <server ip>\n", argv[0]);
>        exit(EXIT_SUCCESS);
>    }
> #endif
>
>    int sockfd;
>    sockfd = socket(PF_INET, SOCK_DGRAM, 0);
>    if (sockfd < 0) {
>        perror("socket");
>        exit(EXIT_FAILURE);
>    }
> #define MSG_SIZE 10000          /* bigger than MTU */
> #define BUF_SIZE MSG_SIZE+1
>
>    char buf[BUF_SIZE];
>    memset(buf, 0, BUF_SIZE);
>
>    struct sockaddr_in test_addr;
>    test_addr.sin_family = AF_INET;
>    test_addr.sin_port = htons(10000);
> #ifdef AS_SERVER
>    test_addr.sin_addr.s_addr = inet_addr("0.0.0.0");
>    if (bind(sockfd,
>             (struct sockaddr *) &test_addr, sizeof(test_addr)) < 0) {
>        perror("bind");
>        exit(EXIT_FAILURE);
>    }
>    ssize_t r = 0;
>
>    while (1) {
>        r = recv(sockfd, buf, MSG_SIZE, MSG_WAITALL);
>        if (r < MSG_SIZE) {
>            printf("truncated!\n");
>            exit(EXIT_FAILURE);
>        }
>        printf("recv message: %s\n", buf);
>    }
> #else
>    memset(buf, 'A', MSG_SIZE);
>
>    test_addr.sin_addr.s_addr = inet_addr(argv[1]);
>    ssize_t s = 0;
>    s = sendto(sockfd, buf, MSG_SIZE, 0, (struct sockaddr *) &test_addr,
>               sizeof(test_addr));
>    if (s != MSG_SIZE) {
>        perror("send failed");
>        exit(EXIT_FAILURE);
>    }
> #endif
>
>    exit(EXIT_SUCCESS);
> }
> -----test code end--------------
>
> The program above implements a simple udp server & a simple udp client.
>
> The client sends a message of MSG_SIZE bytes (which is filled with 
> 'A') to
> the server, and the server receives and prints out the message.
>
> The MSG_SIZE (Here I takes 10000 as example) is far bigger than the
> normal Ethernet NIC MTU (1500), so the output message will be fragmented.
>
> Given the IP (SIP) of server is 172.16.100.254, and the IP of client 
> (CIP)
> is 172.16.100.63. The Default Gateway IP of client is SIP.
>
> I do the below settings:
>
> @ Server
>
> # Mark the client's udp access
> iptables -t mangle -A PREROUTING -p udp -s 172.16.100.63 --dport 10000 \
>  -j MARK --set-mark 1
> # REDIRECT the forward packets marked with 1 to localhost
> ip rule add prio 100 fwmark 1 table 100
> ip route add local 0/0 dev lo table 100
> # Compile the test program and copy them to the place
> gcc  -DAS_SERVER -o /tmp/server test_udp_fragment.c
> gcc  -o /tmp/client test_udp_fragment.c
> scp test_udp_fragment.c 172.16.100.63:/tmp/
> # Start the server
> /tmp/server
>
> @ Client
>
> # Send a message to some site of external network, such as google.com
> /tmp/client 64.233.189.104
>
>
> The test has two different result:
> 1. When the CONNTRACK is enabled in the kernel running on the server
> recv message: AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
> AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
> AAA....
>
> 2. When the CONNTRACK is disabled in the kernel running on the server
> [Nothing printed out!!!]
>
> See, without CONNTRACK, the policy route failed to handle the
> defragmentation problem too! In fact, it just routes the first IP
> fragment of the message to the INPUT chain, and ignores the other
> IP fragments and let them wander on the FORWARD chain, so the
> defragmentation job of ip_local_deliver() will never success.
> And the upper test program will never receive the message from kernel.
>
>
> Besides this test program, you can simply validate this fact through
> iptables LOG target:
>
> @ Server
>
> # Set a LOG rule just after the MARK rule
> iptables -t mangle -A PREROUTING -m mark --mark 1 -j LOG
>
> you will see that, without CONNTRACK, you see the first IP fragment
> packet in your log file, and with CONNTRACK, you see an entire IP packet!
>
>
>
>
> All in all, the conclusion is that:
> If you use CONNTRACK, my TP patch works without defragmentation problem;
> If you don't use CONNTRACK, both my TP patch & Policy routing rule for TP
> do not work!
>
Thanks a lot for providing the detailed testing program. However, I 
prefer to
having ip_vs_in() hooked at NF_IP_LOCAL_IN only.
> So, I always think that, my patch enables the native support for TP.
>
>
> Cheers,
>
> Jinhua
Cheers,

Wensong


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] [IPVS] transparent proxying
  2006-11-29 14:46   ` Horms
@ 2006-12-18  3:19     ` Horms
  2006-12-18 14:17       ` Thomas Graf
  0 siblings, 1 reply; 10+ messages in thread
From: Horms @ 2006-12-18  3:19 UTC (permalink / raw)
  To: Thomas Graf
  Cc: netdev, David Miller, Julian Anastasov, Wensong Zhan,
	Joseph Mack NA3T, Jinhua Luo

On Wed, Nov 29, 2006 at 11:46:22PM +0900, Horms wrote:
> On Wed, Nov 29, 2006 at 03:15:23PM +0100, Thomas Graf wrote:

[split]

> > This patch seems to be based on an old tree, I've renamed nfmark
> > to mark in net-2.6.20. The term fwmark and nfmark shouldn't be
> > used anymore.
> 
> Sorry, I based this patch on Linus's tree. I'll port it to net-2.6.20.

This took too long for me to get around to :(
Am I correct in thinking I just need to replace fwmark with mark?
If so, the updated version is below.

-- 
Horms
  H: http://www.vergenet.net/~horms/
  W: http://www.valinux.co.jp/en/

[IPVS] transparent proxying

Patch from home_king <home_king@163.com> to allow a web cluseter using
transparent proxying. It works by simply grabing packets that have the
fwmark set and have not already been processed by ipvs (ip_vs_out) and
throwing them into ip_vs_in.

See: http://archive.linuxvirtualserver.org/html/lvs-users/2006-11/msg00261.html

Normally LVS packets are processed by ip_vs_in fron on the INPUT chain,
and packets that are processed in this way never show up on the FORWARD
chain, so they won't hit this rule.

This patch seems like a good precursor to moving LVS permanantly to
the FORWARD chain. As I'm struggling to think how it could break things.

Reformated to use tabs for indentation (instead of 4 spaces)
Reformated to be < 80 columns wide
Updated fwmark to mark

Cc: Jinhua Luo <home_king@163.com>
Signed-off-by: Simon Horman <horms@verge.net.au>
Index: net-2.6/net/ipv4/ipvs/ip_vs_core.c
===================================================================
--- net-2.6.orig/net/ipv4/ipvs/ip_vs_core.c	2006-12-18 11:46:10.000000000 +0900
+++ net-2.6/net/ipv4/ipvs/ip_vs_core.c	2006-12-18 12:13:32.000000000 +0900
@@ -23,7 +23,9 @@
  * Changes:
  *	Paul `Rusty' Russell		properly handle non-linear skbs
  *	Harald Welte			don't use nfcache
- *
+ *	Jinhua Luo                      redirect packets with fwmark on
+ *					NF_IP_FORWARD chain to ip_vs_in(),
+ *					mainly for transparent cache cluster
  */
 
 #include <linux/module.h>
@@ -1070,6 +1072,26 @@
 	return ip_vs_in_icmp(pskb, &r, hooknum);
 }
 
+/*
+ * 	This is hooked into the NF_IP_FORWARD. It catches
+ * 	packets that have not already been handled by ipvs (out)
+ * 	and have a fwmark set. This is to allow transparent proxying
+ * 	of fwmark virtual services.
+ *
+ * 	It will not process packets that are handled by ipvs (in)
+ * 	as they never traverse the NF_IP_FORWARD.
+ */
+static unsigned int
+ip_vs_forward_with_fwmark(unsigned int hooknum, struct sk_buff **pskb,
+			  const struct net_device *in,
+			  const struct net_device *out,
+			  int (*okfn)(struct sk_buff *))
+{
+	if ((*pskb)->ipvs_property || ! (*pskb)->mark)
+		return NF_ACCEPT;
+
+	return ip_vs_in(hooknum, pskb, in, out, okfn);
+}
 
 /* After packet filtering, forward packet through VS/DR, VS/TUN,
    or VS/NAT(change destination), so that filtering rules can be
@@ -1082,6 +1104,16 @@
 	.priority       = 100,
 };
 
+/* Allow transparent proxying by fishing packets
+ * out of the forward chain. */
+static struct nf_hook_ops ip_vs_forward_with_fwmark_ops = {
+	.hook		= ip_vs_forward_with_fwmark,
+	.owner		= THIS_MODULE,
+	.pf		= PF_INET,
+	.hooknum	= NF_IP_FORWARD,
+	.priority	= 101,
+};
+
 /* After packet filtering, change source only for VS/NAT */
 static struct nf_hook_ops ip_vs_out_ops = {
 	.hook		= ip_vs_out,
@@ -1160,9 +1192,17 @@
 		goto cleanup_postroutingops;
 	}
 
+	ret = nf_register_hook(&ip_vs_forward_with_fwmark_ops);
+	if (ret < 0) {
+		IP_VS_ERR("can't register forward_with_fwmark hook.\n");
+		goto cleanup_forwardicmpops;
+	}
+
 	IP_VS_INFO("ipvs loaded.\n");
 	return ret;
 
+  cleanup_forwardicmpops:
+	nf_unregister_hook(&ip_vs_forward_icmp_ops);
   cleanup_postroutingops:
 	nf_unregister_hook(&ip_vs_post_routing_ops);
   cleanup_outops:
@@ -1182,6 +1222,7 @@
 
 static void __exit ip_vs_cleanup(void)
 {
+	nf_unregister_hook(&ip_vs_forward_with_fwmark_ops);
 	nf_unregister_hook(&ip_vs_forward_icmp_ops);
 	nf_unregister_hook(&ip_vs_post_routing_ops);
 	nf_unregister_hook(&ip_vs_out_ops);

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] [IPVS] transparent proxying
  2006-12-18  3:19     ` Horms
@ 2006-12-18 14:17       ` Thomas Graf
  0 siblings, 0 replies; 10+ messages in thread
From: Thomas Graf @ 2006-12-18 14:17 UTC (permalink / raw)
  To: Horms
  Cc: netdev, David Miller, Julian Anastasov, Wensong Zhan,
	Joseph Mack NA3T, Jinhua Luo

* Horms <horms@verge.net.au> 2006-12-18 12:19
> On Wed, Nov 29, 2006 at 11:46:22PM +0900, Horms wrote:
> > On Wed, Nov 29, 2006 at 03:15:23PM +0100, Thomas Graf wrote:
> 
> [split]
> 
> > > This patch seems to be based on an old tree, I've renamed nfmark
> > > to mark in net-2.6.20. The term fwmark and nfmark shouldn't be
> > > used anymore.
> > 
> > Sorry, I based this patch on Linus's tree. I'll port it to net-2.6.20.
> 
> This took too long for me to get around to :(
> Am I correct in thinking I just need to replace fwmark with mark?

Yes, that's all there is to it.

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2006-12-18 14:17 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2006-11-30  1:49 [PATCH] [IPVS] transparent proxying home_king
2006-12-01 15:41 ` Wensong Zhang
  -- strict thread matches above, loose matches on Subject: below --
2006-12-04  5:53 home_king
2006-12-04 17:20 ` Wensong Zhang
2006-11-29  6:21 Horms
2006-11-29 14:15 ` Thomas Graf
2006-11-29 14:46   ` Horms
2006-12-18  3:19     ` Horms
2006-12-18 14:17       ` Thomas Graf
2006-11-29 15:26 ` Wensong Zhang

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).