netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Re: [PATCH] [IPVS] transparent proxying
@ 2006-11-30  1:49 home_king
  2006-12-01 15:41 ` Wensong Zhang
  0 siblings, 1 reply; 10+ messages in thread
From: home_king @ 2006-11-30  1:49 UTC (permalink / raw)
  To: Wensong Zhang
  Cc: Horms, netdev, David Miller, Julian Anastasov, Joseph Mack NA3T

hi, Wensong. Thanks for your appraise.

 > I see that this patch probably makes IPVS code a bit complicated and
 > packet traversing less efficiently.

In my opinion, worry about the side-effect to the packet throughput is not
necessary. First, normal packets with mark rarely appear in the 
NF_IP_FORWARD
chain, while people mark packets aiming at the network administration job
usually on the NF_IP_LOCAL_IN or NF_IP_OUTPUT chain. Second, the new hook fn
is called after ipvs SNAT hook fn, and pass the packets handled by the 
latter
hook fn by simply checking the ipvs_property flag, so it would not 
disturb the
SNAT job. Third, the new hook fn is just a thin wrapper of ip_vs_in(), 
so now
that all packets which go through NF_IP_LOCAL_IN will be entirely checked up
by ip_vs_in(), no matter they are virtual-server relative or not, why we 
mind
that a comparatively small quantity of packets which go through 
NF_IP_FORWARD
will be checked too?

 > If I remember correctly, policy-based routing can work with IPVS in
 > kernel 2.2 and 2.4 for transparent cache cluster for a long time. It
 > should work in kernel 2.6 too.

Indeed, policy route can help too, but the patch provides a native manner to
deploy transparent proxy, and meanwhile, this manner will not break the
backbone networking context, such as policy routing setting, iptables 
rules,
etc.


^ permalink raw reply	[flat|nested] 10+ messages in thread
* Re: [PATCH] [IPVS] transparent proxying
@ 2006-12-04  5:53 home_king
  2006-12-04 17:20 ` Wensong Zhang
  0 siblings, 1 reply; 10+ messages in thread
From: home_king @ 2006-12-04  5:53 UTC (permalink / raw)
  To: Wensong Zhang, Horms
  Cc: netdev, David Miller, Julian Anastasov, Joseph Mack NA3T,
	Jinhua Luo

 > I am afraid that the method used in the patch is not native, it breaks
 > on IP fragments.
 > IPVS is a kind of layer-4 switching, it routes packet by checking
 > layer-4 information
 > such as address and port number. ip_vs_in() is hooked at 
NF_IP_LOCAL_IN, so
 > that all the packets received by ip_vs_in() are already defragmented. On
 > NF_IP_FORWARD
 > hook, there may be some IP fragements, ip_vs_in() cannot handle those IP
 > fragments.

However, your analysis is a bit inaccurate, I think.

As far as I know, Policy route's conjunction with fwmark works just under
some precondition, the most important one of which is defragmentation
function provided by NF_CONNTRACK. That is, the routing core works in
layer 3, and it can even route all IP fragments just by the IP info, and
doesn't care about the layer 4 info, such as service ports. Firewall Mark
makes layer 4 involved. To retrieve the full layer 4 header, netfilter
has no choice but to do defragmentation for the routing core, which is the
key function of NF_CONNTRACK.

In a word, without NF_CONNTRACK, neither the policy route nor my patch, can
face the defragmentation problem!

I will give you some proof of my words.

LOOKING at the NETFILTER SOURCE:

See the below quotation form /usr/include/linux/netfilter_ipv4.h, the
defragmentation of netfilter owns the highest prority, so the corresponding
hook will be called before any other hooks including ipvs & iptables.

-----quote start-----
enum nf_ip_hook_priorities {
        NF_IP_PRI_FIRST = INT_MIN,
        NF_IP_PRI_CONNTRACK_DEFRAG = -400,
    ...
        NF_IP_PRI_LAST = INT_MAX,
};
-----quote end-----


And see ip_conntrack_standalone.c, here defines the defrag hooks on
PREROUTING chain and OUTPUT chain with NF_IP_PRI_CONNTRACK_DEFRAG
prority. Needless to say, all packets which flow on INPUT & FORWARD
chain are already defragmented by it; In other word, once the CONNTRACK
is enabled, you cannot see any fragment in INPUT & FORWARD chain, even
the other chains.

-----quote start-----
static struct nf_hook_ops ip_conntrack_defrag_ops = {
    .hook        = ip_conntrack_defrag,
    .owner        = THIS_MODULE,
    .pf        = PF_INET,
    .hooknum    = NF_IP_PRE_ROUTING,
    .priority    = NF_IP_PRI_CONNTRACK_DEFRAG,
};

static struct nf_hook_ops ip_conntrack_defrag_local_out_ops = {
    .hook        = ip_conntrack_defrag,
    .owner        = THIS_MODULE,
    .pf        = PF_INET,
    .hooknum    = NF_IP_LOCAL_OUT,
    .priority    = NF_IP_PRI_CONNTRACK_DEFRAG,
};
-----quote end-----


On the other hand, I wrote a simply program -- test_udp_fragment.c
to test it.

-----test code start--------------

#include <sys/types.h>
#include <sys/socket.h>
#include <errno.h>
#include <stdlib.h>
#include <linux/in.h>

int main(int argc, char *argv[])
{
#ifndef AS_SERVER
    if (argc < 2) {
        printf("SYNTAX: %s <server ip>\n", argv[0]);
        exit(EXIT_SUCCESS);
    }
#endif

    int sockfd;
    sockfd = socket(PF_INET, SOCK_DGRAM, 0);
    if (sockfd < 0) {
        perror("socket");
        exit(EXIT_FAILURE);
    }
#define MSG_SIZE 10000          /* bigger than MTU */
#define BUF_SIZE MSG_SIZE+1

    char buf[BUF_SIZE];
    memset(buf, 0, BUF_SIZE);

    struct sockaddr_in test_addr;
    test_addr.sin_family = AF_INET;
    test_addr.sin_port = htons(10000);
#ifdef AS_SERVER
    test_addr.sin_addr.s_addr = inet_addr("0.0.0.0");
    if (bind(sockfd,
             (struct sockaddr *) &test_addr, sizeof(test_addr)) < 0) {
        perror("bind");
        exit(EXIT_FAILURE);
    }
    ssize_t r = 0;

    while (1) {
        r = recv(sockfd, buf, MSG_SIZE, MSG_WAITALL);
        if (r < MSG_SIZE) {
            printf("truncated!\n");
            exit(EXIT_FAILURE);
        }
        printf("recv message: %s\n", buf);
    }
#else
    memset(buf, 'A', MSG_SIZE);

    test_addr.sin_addr.s_addr = inet_addr(argv[1]);
    ssize_t s = 0;
    s = sendto(sockfd, buf, MSG_SIZE, 0, (struct sockaddr *) &test_addr,
               sizeof(test_addr));
    if (s != MSG_SIZE) {
        perror("send failed");
        exit(EXIT_FAILURE);
    }
#endif

    exit(EXIT_SUCCESS);
}
-----test code end--------------

The program above implements a simple udp server & a simple udp client.

The client sends a message of MSG_SIZE bytes (which is filled with 'A') to
the server, and the server receives and prints out the message.

The MSG_SIZE (Here I takes 10000 as example) is far bigger than the
normal Ethernet NIC MTU (1500), so the output message will be fragmented.

Given the IP (SIP) of server is 172.16.100.254, and the IP of client (CIP)
is 172.16.100.63. The Default Gateway IP of client is SIP.

I do the below settings:

@ Server

# Mark the client's udp access
iptables -t mangle -A PREROUTING -p udp -s 172.16.100.63 --dport 10000 \
  -j MARK --set-mark 1
# REDIRECT the forward packets marked with 1 to localhost
ip rule add prio 100 fwmark 1 table 100
ip route add local 0/0 dev lo table 100
# Compile the test program and copy them to the place
gcc  -DAS_SERVER -o /tmp/server test_udp_fragment.c
gcc  -o /tmp/client test_udp_fragment.c
scp test_udp_fragment.c 172.16.100.63:/tmp/
# Start the server
/tmp/server

@ Client

# Send a message to some site of external network, such as google.com
/tmp/client 64.233.189.104


The test has two different result:
1. When the CONNTRACK is enabled in the kernel running on the server
recv message: AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
AAA....

2. When the CONNTRACK is disabled in the kernel running on the server
[Nothing printed out!!!]

See, without CONNTRACK, the policy route failed to handle the
defragmentation problem too! In fact, it just routes the first IP
fragment of the message to the INPUT chain, and ignores the other
IP fragments and let them wander on the FORWARD chain, so the
defragmentation job of ip_local_deliver() will never success.
And the upper test program will never receive the message from kernel.


Besides this test program, you can simply validate this fact through
iptables LOG target:

@ Server

# Set a LOG rule just after the MARK rule
iptables -t mangle -A PREROUTING -m mark --mark 1 -j LOG

you will see that, without CONNTRACK, you see the first IP fragment
packet in your log file, and with CONNTRACK, you see an entire IP packet!




All in all, the conclusion is that:
If you use CONNTRACK, my TP patch works without defragmentation problem;
If you don't use CONNTRACK, both my TP patch & Policy routing rule for TP
do not work!

So, I always think that, my patch enables the native support for TP.


Cheers,

Jinhua


^ permalink raw reply	[flat|nested] 10+ messages in thread
* [PATCH] [IPVS] transparent proxying
@ 2006-11-29  6:21 Horms
  2006-11-29 14:15 ` Thomas Graf
  2006-11-29 15:26 ` Wensong Zhang
  0 siblings, 2 replies; 10+ messages in thread
From: Horms @ 2006-11-29  6:21 UTC (permalink / raw)
  To: netdev
  Cc: David Miller, Julian Anastasov, Wensong Zhan, Joseph Mack NA3T,
	Jinhua Luo

This seems to be a pretty clean solution to a real problem.

Ultimately I would like to see IPVS move into the forward chain.
This seems to be a nice way to explore that, without breaking
any existing setups.

-- 
Horms
  H: http://www.vergenet.net/~horms/
  W: http://www.valinux.co.jp/en/

[IPVS] transparent proxying

Patch from Jinhua Luo <home_king@163.com> to allow a web cluseter using
transparent proxying. It works by simply grabing packets that have the
fwmark set and have not already been processed by ipvs (ip_vs_out) and
throwing them into ip_vs_in.

See: http://archive.linuxvirtualserver.org/html/lvs-users/2006-11/msg00261.html

Normally LVS packets are processed by ip_vs_in fron on the INPUT chain,
and packets that are processed in this way never show up on the FORWARD
chain, so they won't hit this rule.

This patch seems like a good precursor to moving LVS permanantly to
the FORWARD chain. As I'm struggling to think how it could break things.

The changes to the original patch are:

* Reformated to use tabs for indentation (instead of 4 spaces)
* Reformated to be < 80 columns wide
* Added some comments
* Rewrote description (this text)

Signed-off-by: Simon Horman <horms@verge.net.au>
Signed-off-by: Jinhua Luo <home_king@163.com>

Index: linux-2.6/net/ipv4/ipvs/ip_vs_core.c
===================================================================
--- linux-2.6.orig/net/ipv4/ipvs/ip_vs_core.c	2006-11-28 15:30:00.000000000 +0900
+++ linux-2.6/net/ipv4/ipvs/ip_vs_core.c	2006-11-29 10:27:49.000000000 +0900
@@ -23,7 +23,9 @@
  * Changes:
  *	Paul `Rusty' Russell		properly handle non-linear skbs
  *	Harald Welte			don't use nfcache
- *
+ *	Jinhua Luo                      redirect packets with fwmark on
+ *					NF_IP_FORWARD chain to ip_vs_in(),
+ *					mainly for transparent cache cluster
  */
 
 #include <linux/module.h>
@@ -1070,6 +1072,26 @@
 	return ip_vs_in_icmp(pskb, &r, hooknum);
 }
 
+/*
+ * 	This is hooked into the NF_IP_FORWARD. It catches
+ * 	packets that have not already been handled by ipvs (out)
+ * 	and have a fwmark set. This is to allow transparent proxying
+ * 	of fwmark virtual services.
+ *
+ * 	It will not process packets that are handled by ipvs (in)
+ * 	as they never traverse the NF_IP_FORWARD.
+ */
+static unsigned int
+ip_vs_forward_with_fwmark(unsigned int hooknum, struct sk_buff **pskb,
+			  const struct net_device *in,
+			  const struct net_device *out,
+			  int (*okfn)(struct sk_buff *))
+{
+	if ((*pskb)->ipvs_property || ! (*pskb)->nfmark)
+		return NF_ACCEPT;
+
+	return ip_vs_in(hooknum, pskb, in, out, okfn);
+}
 
 /* After packet filtering, forward packet through VS/DR, VS/TUN,
    or VS/NAT(change destination), so that filtering rules can be
@@ -1082,6 +1104,16 @@
 	.priority       = 100,
 };
 
+/* Allow transparent proxying by fishing packets
+ * out of the forward chain. */
+static struct nf_hook_ops ip_vs_forward_with_fwmark_ops = {
+	.hook		= ip_vs_forward_with_fwmark,
+	.owner		= THIS_MODULE,
+	.pf		= PF_INET,
+	.hooknum	= NF_IP_FORWARD,
+	.priority	= 101,
+};
+
 /* After packet filtering, change source only for VS/NAT */
 static struct nf_hook_ops ip_vs_out_ops = {
 	.hook		= ip_vs_out,
@@ -1160,9 +1192,17 @@
 		goto cleanup_postroutingops;
 	}
 
+	ret = nf_register_hook(&ip_vs_forward_with_fwmark_ops);
+	if (ret < 0) {
+		IP_VS_ERR("can't register forward_with_fwmark hook.\n");
+		goto cleanup_forwardicmpops;
+	}
+
 	IP_VS_INFO("ipvs loaded.\n");
 	return ret;
 
+  cleanup_forwardicmpops:
+	nf_unregister_hook(&ip_vs_forward_icmp_ops);
   cleanup_postroutingops:
 	nf_unregister_hook(&ip_vs_post_routing_ops);
   cleanup_outops:
@@ -1182,6 +1222,7 @@
 
 static void __exit ip_vs_cleanup(void)
 {
+	nf_unregister_hook(&ip_vs_forward_with_fwmark_ops);
 	nf_unregister_hook(&ip_vs_forward_icmp_ops);
 	nf_unregister_hook(&ip_vs_post_routing_ops);
 	nf_unregister_hook(&ip_vs_out_ops);

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2006-12-18 14:17 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2006-11-30  1:49 [PATCH] [IPVS] transparent proxying home_king
2006-12-01 15:41 ` Wensong Zhang
  -- strict thread matches above, loose matches on Subject: below --
2006-12-04  5:53 home_king
2006-12-04 17:20 ` Wensong Zhang
2006-11-29  6:21 Horms
2006-11-29 14:15 ` Thomas Graf
2006-11-29 14:46   ` Horms
2006-12-18  3:19     ` Horms
2006-12-18 14:17       ` Thomas Graf
2006-11-29 15:26 ` Wensong Zhang

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).