* Re: Fwd: LVS on local node
From: Eric Dumazet @ 2010-07-22 6:56 UTC (permalink / raw)
To: Franchoze Eric; +Cc: wensong, lvs-devel, netdev, netfilter-devel
In-Reply-To: <27901279770680@web67.yandex.ru>
Le jeudi 22 juillet 2010 à 07:51 +0400, Franchoze Eric a écrit :
> Hello,
>
> I'm trying to do load balancing of incoming traffic to my applications. This applications are not very smp friendly, and I want try to run some instances according to number of cpus on single machine. And balance load of incoming traffic/connections to this applications.
> Looks like is should be similar to http://www.austintek.com/LVS/LVS-HOWTO/HOWTO/LVS-HOWTO.localnode.html
>
> linux kernel 2.6.32 with or without hide interface patches. Tried different configurations but could not see packets on application layer.
>
> 192.168.1.165 - eth0 - interface for external connections
> 195.0.0.1 - dummy0 - virtual interface, real application is binded to that address.
>
> Configuration is:
> -A -t 192.168.1.165:1234 -s wlc
> -a -t 192.168.1.165:1234 -r 195.0.0.1:1234 -g -w
>
> #ipvsadm -L -n
> IP Virtual Server version 1.2.1 (size=4096)
> Prot LocalAddress:Port Scheduler Flags
> -> RemoteAddress:Port Forward Weight ActiveConn InActConn
> TCP 192.168.1.165:1234 wlc
> -> 195.0.0.1:1234 Local 1 0 0
> #
>
> Log:
> [ 2106.897409] IPVS: lookup/out TCP 192.168.1.165:44847->192.168.1.165:1234 not hit
> [ 2106.897412] IPVS: lookup service: fwm 0 TCP 192.168.1.165:1234 hit
> [ 2106.897414] IPVS: ip_vs_wlc_schedule(): Scheduling...
> [ 2106.897416] IPVS: WLC: server 195.0.0.1:1234 activeconns 0 refcnt 2 weight 1 overhead 1
> [ 2106.897418] IPVS: Enter: ip_vs_conn_new, net/netfilter/ipvs/ip_vs_conn.c line 693
> [ 2106.897421] IPVS: Bind-dest TCP c:192.168.1.165:44847 v:192.168.1.165:1234 d:195.0.0.1:1234 fwd:L s:0 conn->flags:181 conn->refcnt:1 dest->refcnt:3
> [ 2106.897425] IPVS: Schedule fwd:L c:192.168.1.165:44847 v:192.168.1.165:1234 d:195.0.0.1:1234 conn->flags:1C1 conn->refcnt:2
> [ 2106.897429] IPVS: TCP input [S...] 195.0.0.1:1234->192.168.1.165:44847 state: NONE->SYN_RECV conn->refcnt:2
> [ 2106.897431] IPVS: Enter: ip_vs_null_xmit, net/netfilter/ipvs/ip_vs_xmit.c line 212
> [ 2106.897439] IPVS: lookup/in TCP 192.168.1.165:1234->192.168.1.165:44847 not hit
> [ 2106.897441] IPVS: lookup/out TCP 192.168.1.165:1234->192.168.1.165:44847 not hit
> [ 2107.277535] IPVS: packet type=1 proto=17 daddr=255.255.255.255 ignored
> [ 2108.542691] IPVS: packet type=1 proto=17 daddr=192.168.1.255 ignored
>
> As the result, server application does receive anything on accept(). I tried to make dummy0 a hidden device and play with arp settings. But without result.
>
> I will be happy to hear any idea how to do connection in this environment.
>
lvs seems not very SMP friendly and a bit complex.
I would use an iptables setup and a slighly modified REDIRECT target
(and/or a nf_nat_setup_info() change)
Say you have 8 daemons listening on different ports (1000 to 1007)
iptables -t nat -A PREROUTING -p tcp --dport 1234 -j REDIRECT --rxhash-dist --to-port 1000-1007
rxhash would be provided by RPS on recent kernels or locally computed if
not already provided by core network (or old kernel)
This rule would be triggered only at connection establishment.
conntracking take care of following packets and is SMP friendly.
^ permalink raw reply
* Re: [patch v2.6 4/4] libxt_ipvs: user-space lib for netfilter matcher xt_ipvs
From: Simon Horman @ 2010-07-22 6:57 UTC (permalink / raw)
To: Jan Engelhardt
Cc: lvs-devel, netdev, linux-kernel, netfilter, netfilter-devel,
Malcolm Turnbull, Wensong Zhang, Julius Volz, Patrick McHardy,
David S. Miller, Hannes Eder
In-Reply-To: <alpine.LSU.2.01.1007220823340.4980@obet.zrqbmnf.qr>
On Thu, Jul 22, 2010 at 08:25:01AM +0200, Jan Engelhardt wrote:
>
> On Thursday 2010-07-22 03:38, Simon Horman wrote:
> >
> >I must confess that I'm not familiar with using enum in this way.
> >Can I confirm that you are suggesting the following?
> >
> >enum {
> > XT_IPVS_IPVS_PROPERTY = 1 << 0, /* all other options imply this one */
> > XT_IPVS_PROTO = 1 << 1,
> > XT_IPVS_VADDR = 1 << 2,
> > XT_IPVS_VPORT = 1 << 3,
> > XT_IPVS_DIR = 1 << 4,
> > XT_IPVS_METHOD = 1 << 5,
> > XT_IPVS_VPORTCTL = 1 << 6,
> > XT_IPVS_MASK = (1 << 7) - 1,
> > XT_IPVS_ONCE_MASK = (XT_IPVS_MASK & ~XT_IPVS_IPVS_PROPERTY)
> >};
>
> Yes; You may drop the () in ONCE_MASK though.
Thanks; and yes, silliness on my part.
^ permalink raw reply
* Re: [PATCH net-next] sysfs: add entry to indicate network interfaces with random MAC address
From: Ian Campbell @ 2010-07-22 7:12 UTC (permalink / raw)
To: David Miller
Cc: gregory.v.rose, leedom, shemminger, andy, harald, bhutchings,
sassmann, netdev, linux-kernel, gospo, alexander.h.duyck
In-Reply-To: <20100721.123324.237334251.davem@davemloft.net>
[-- Attachment #1: Type: text/plain, Size: 2163 bytes --]
On Wed, 2010-07-21 at 12:33 -0700, David Miller wrote:
> From: "Rose, Gregory V" <gregory.v.rose@intel.com>
> Date: Wed, 21 Jul 2010 12:02:17 -0700
>
> >>From: David Miller <davem@davemloft.net>
> >>Date: Wed, 21 Jul 2010 11:48:51 -0700 (PDT)
> >>
> >>> You could do things like have the PF controller use the root
> >>filesystem
> >>> ID label to construct the VF's MAC address, or something like that.
> >>
> >>And here I of course mean the root filesystem of the guest the VF will
> >>be given to.
> >
> > I suppose you could do that but then the VM is going to have to be
> > allowed to set its own MAC address. There is a lot of opposition
> > and concern about allowing VMs to set their own MAC address.
>
> Why would that be necessary? The host with the PF creating the guest
> has access to the "device" and thus the root filesystem of the guest,
> and thus could pull in the root filesystem "key" and instantiate the
> VF's MAC before booting the guest.
Most VM host toolstacks allow you to store a MAC address for each
virtual NIC in the metadata associated with the VM. This MAC address is
either given by the user when they create the virtual NIC, random with
locally administered bit set or random in the VM vendors OID space. This
ensures the VM configuration remains consistent with time.
Why would they not continue to do the same for SR-IOV passthrough NICs?
As a fallback some toolstacks will generate a random address if the NIC
configuration doesn't specify one but if you want a persistent address
for a guest why would you not just configure it that way? Accessing the
guest root filesystem might be a nicer fallback than random generation
when users haven't explicitly configured a MAC but isn't there a chance
of a VM admin controlling the MAC address by manipulating the root
filesystem? What do you do if there is an address clash in this case,
relabelling the root filesystem is a bit of a faff. Also the root
filesystem could be contained within an LVM volume or encrypted or
whatever.
Ian.
--
Ian Campbell
Military intelligence is a contradiction in terms.
-- Groucho Marx
[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 198 bytes --]
^ permalink raw reply
* [patch v2.8 0/4] IPVS full NAT support + netfilter 'ipvs' match support
From: Simon Horman @ 2010-07-22 7:35 UTC (permalink / raw)
To: lvs-devel, netdev, linux-kernel, netfilter, netfilter-devel
Cc: Malcolm Turnbull, Mark Brooks, Wensong Zhang, Julius Volz,
Patrick McHardy, David S. Miller, Hannes Eder, Jan Engelhardt
This is a repost of a patch-series posted by Hannes Eder last September.
This is v2 of the patch series and I don't see any outstanding objections to
it in the mailing list archives.
Series v2.8 Convert XT_IPVS_IPVS_* from #defines to an enum,
as suggested by Jan Engelhardt <jengelh@medozas.de>
Series v2.7 Fixes header miss-match between kernel and user-space
Series v2.6 fixes the arguments to of %pI4
Series v2.5 fixes some problems introduced in v2.4.
Series v2.4 addresses all of the concerns that Patrick McHardy raised
witht the v2.3 series.
The original cover-email from Hannes follows.
The diffstat output has been updated to reflect changes by me.
Mark Brooks has tested the v2.7 patchset, and found no problems.
Details of his test follow Hannes's cover. I have made minor
edits to Mark's email but not the results.
----------------------------------------------------------------------
From: Hannes Eder <heder@google.com>
The following series implements full NAT support for IPVS. The
approach is via a minimal change to IPVS (make friends with
nf_conntrack) and adding a netfilter matcher, kernel- and user-space
part, i.e. xt_ipvs and libxt_ipvs.
Example usage:
% ipvsadm -A -t 192.168.100.30:80 -s rr
% ipvsadm -a -t 192.168.100.30:80 -r 192.168.10.20:80 -m
# ...
# Source NAT for VIP 192.168.100.30:80
% iptables -t nat -A POSTROUTING -m ipvs --vaddr 192.168.100.30/32 \
> --vport 80 -j SNAT --to-source 192.168.10.10
or SNAT-ing only a specific real server:
% iptables -t nat -A POSTROUTING --dst 192.168.11.20 \
> -m ipvs --vaddr 192.168.100.30/32 -j SNAT --to-source 192.168.10.10
First of all, thanks for all the feedback. This is the changelog for v2:
- Make ip_vs_ftp work again. Setup nf_conntrack expectations for
related data connections (based on Julian's patch see
http://www.ssi.bg/~ja/nfct/) and let nf_conntrack/nf_nat do the
packet mangling and the TCP sequence adjusting.
This change rises the question how to deal with ip_vs_sync? Does it
work together with conntrackd? Wild idea: what about getting rid of
ip_vs_sync and piggy packing all on nf_conntrack and use conntrackd?
Any comments on this?
- xt_ipvs: add new rule '--vportctl port' to match the VIP port of the
controlling connection, e.g. port 21 for FTP. Can be used to match
a related data connection for FTP:
# SNAT FTP control connection
% iptables -t nat -A POSTROUTING -m ipvs --vaddr 192.168.100.30/32 \
> --vport 21 -j SNAT --to-source 192.168.10.10
# SNAT FTP passive data connection
% iptables -t nat -A POSTROUTING -m ipvs --vaddr 192.168.100.30/32 \
> --vportctl 21 -j SNAT --to-source 192.168.10.10
- xt_ipvs: use 'par->family' instead of 'skb->protocol'
- xt_ipvs: add ipvs_mt_check and restrict to NFPROTO_IPV4 and NFPROTO_IPV6
- Call nf_conntrack_alter_reply(), so helper lookup is performed based
on the changed tuple.
Changes to the linux kernel
(nf-next-2.6, "bridge: add per bridge device controls for invoking iptables")
Hannes Eder (3):
netfilter: xt_ipvs (netfilter matcher for IPVS)
IPVS: make friends with nf_conntrack
IPVS: make FTP work with full NAT support
include/linux/netfilter/xt_ipvs.h | 27 +++++
include/net/ip_vs.h | 2
net/netfilter/Kconfig | 10 ++
net/netfilter/Makefile | 1
net/netfilter/ipvs/Kconfig | 4
net/netfilter/ipvs/ip_vs_app.c | 43 ---------
net/netfilter/ipvs/ip_vs_core.c | 37 --------
net/netfilter/ipvs/ip_vs_ftp.c | 174 +++++++++++++++++++++++++++++++++++---
net/netfilter/ipvs/ip_vs_proto.c | 1
net/netfilter/ipvs/ip_vs_xmit.c | 29 ++++++
net/netfilter/xt_ipvs.c | 189 +++++++++++++++++++++++++++++++++++++
11 files changed, 422 insertions(+), 95 deletions(-)
create mode 100644 include/linux/netfilter/xt_ipvs.h
create mode 100644 net/netfilter/xt_ipvs.c
Changes to iptables
(iptables.git, "xt_quota: also document negation")
Hannes Eder (1):
libxt_ipvs: user-space lib for netfilter matcher xt_ipvs
configure.ac | 10 1
extensions/libxt_ipvs.c | 365 +++++++++++++++++++++++++++++++++++++
extensions/libxt_ipvs.man | 24 ++
include/linux/netfilter/xt_ipvs.h | 27 +++
4 files changed, 424 insertions(+), 2 deletions(-)
create mode 100644 extensions/libxt_ipvs.c
create mode 100644 extensions/libxt_ipvs.man
create mode 100644 include/linux/netfilter/xt_ipvs.h
----------------------------------------------------------------------
From: Mark Brooks <mark@loadbalancer.org>
I'm going to detail my setup and what I did to test/confirm this (you can
probably skip this bit if you want bit I thought I should include it for
completeness)
The Loadbalancer
IPVS 1.2.1
iptables 1.4.8 --patched
kernel - 2.6.35-rc1 --patched
eth0 ip 192.168.17.93
eth0:45 192.168.18.21 (I would have used eth1 but couldn't find a test box
spare with 2 network cards in)
My test box -
eth0 192.168.18.1
The Webserver -
192.168.17.4:80
Commands to setup ipvs and iptables
IPVS
ipvsadm -A -t 192.168.18.21:80 -s rr
ipvsadm -a -t 192.168.18.21:80 -r 192.168.17.4:80 -m
iptables
/usr/local/sbin/iptables -t nat -A POSTROUTING -m ipvs --vaddr
192.168.18.21/24 --vport 80 -j SNAT --to-source 192.168.17.93
iptables shows -
iptables -t nat -L
Chain PREROUTING (policy ACCEPT)
target prot opt source destination
Chain INPUT (policy ACCEPT)
target prot opt source destination
Chain OUTPUT (policy ACCEPT)
target prot opt source destination
Chain POSTROUTING (policy ACCEPT)
target prot opt source destination
SNAT all -- anywhere anywhere vaddr
192.168.18.0/24 vport 80 to:192.168.17.93
ipvsadm shows -
ipvsadm -ln
IP Virtual Server version 1.2.1 (size=4096)
Prot LocalAddress:Port Scheduler Flags
-> RemoteAddress:Port Forward Weight ActiveConn InActConn
TCP 192.168.18.21:80 rr
-> 192.168.17.4:80 Masq 1 0 0
Connected to the IP from my browser page loaded fine and you can see in the
apache log -
"192.168.17.93 - - [21/Jul/2010:08:44:00 -0400] "GET / HTTP/1.1" 200 82"
Finally I ran a couple of tests with httperf for about 6 hours to see if
anything strange happened
httperf --hog --server 192.168.18.21 --num-con 250 --ra $NUMBER --timeout 5
A maximum number of connections of 250 at rates between 1 and 250
connections per second. Every connection completed fine and there appeared
to be no problems.
^ permalink raw reply
* [patch v2.8 2/4] IPVS: make friends with nf_conntrack
From: Simon Horman @ 2010-07-22 7:35 UTC (permalink / raw)
To: lvs-devel, netdev, linux-kernel, netfilter, netfilter-devel
Cc: Malcolm Turnbull, Mark Brooks, Wensong Zhang, Julius Volz,
Patrick McHardy, David S. Miller, Hannes Eder, Jan Engelhardt
In-Reply-To: <20100722073547.504156161@vergenet.net>
[-- Attachment #1: IPVS-make-friends-with-nf_conntrack.patch --]
[-- Type: text/plain, Size: 5649 bytes --]
From: Hannes Eder <heder@google.com>
Update the nf_conntrack tuple in reply direction, as we will see
traffic from the real server (RIP) to the client (CIP). Once this is
done we can use netfilters SNAT in POSTROUTING, especially with
xt_ipvs, to do source NAT, e.g.:
% iptables -t nat -A POSTROUTING -m ipvs --vaddr 192.168.100.30/32 --vport 80 \
> -j SNAT --to-source 192.168.10.10
[ minor fixes by Simon Horman <horms@verge.net.au> ]
Signed-off-by: Hannes Eder <heder@google.com>
Signed-off-by: Simon Horman <horms@verge.net.au>
---
net/netfilter/ipvs/Kconfig | 2 +-
net/netfilter/ipvs/ip_vs_core.c | 36 ------------------------------------
net/netfilter/ipvs/ip_vs_xmit.c | 29 +++++++++++++++++++++++++++++
3 files changed, 30 insertions(+), 37 deletions(-)
v2.4
As per advice from Patrick McHardy
* Use nf_conntrack_untracked() instead of &nf_conntrack_untracked
v2.1, v2.2, v2.3
No change
Index: nf-next-2.6/net/netfilter/ipvs/Kconfig
===================================================================
--- nf-next-2.6.orig/net/netfilter/ipvs/Kconfig 2010-07-07 13:24:31.000000000 +0900
+++ nf-next-2.6/net/netfilter/ipvs/Kconfig 2010-07-07 13:38:23.000000000 +0900
@@ -3,7 +3,7 @@
#
menuconfig IP_VS
tristate "IP virtual server support"
- depends on NET && INET && NETFILTER
+ depends on NET && INET && NETFILTER && NF_CONNTRACK
---help---
IP Virtual Server support will let you build a high-performance
virtual server based on cluster of two or more real servers. This
Index: nf-next-2.6/net/netfilter/ipvs/ip_vs_core.c
===================================================================
--- nf-next-2.6.orig/net/netfilter/ipvs/ip_vs_core.c 2010-07-07 13:23:37.000000000 +0900
+++ nf-next-2.6/net/netfilter/ipvs/ip_vs_core.c 2010-07-07 13:38:23.000000000 +0900
@@ -536,26 +536,6 @@ int ip_vs_leave(struct ip_vs_service *sv
return NF_DROP;
}
-
-/*
- * It is hooked before NF_IP_PRI_NAT_SRC at the NF_INET_POST_ROUTING
- * chain, and is used for VS/NAT.
- * It detects packets for VS/NAT connections and sends the packets
- * immediately. This can avoid that iptable_nat mangles the packets
- * for VS/NAT.
- */
-static unsigned int ip_vs_post_routing(unsigned int hooknum,
- struct sk_buff *skb,
- const struct net_device *in,
- const struct net_device *out,
- int (*okfn)(struct sk_buff *))
-{
- if (!skb->ipvs_property)
- return NF_ACCEPT;
- /* The packet was sent from IPVS, exit this chain */
- return NF_STOP;
-}
-
__sum16 ip_vs_checksum_complete(struct sk_buff *skb, int offset)
{
return csum_fold(skb_checksum(skb, offset, skb->len - offset, 0));
@@ -1499,14 +1479,6 @@ static struct nf_hook_ops ip_vs_ops[] __
.hooknum = NF_INET_FORWARD,
.priority = 99,
},
- /* Before the netfilter connection tracking, exit from POST_ROUTING */
- {
- .hook = ip_vs_post_routing,
- .owner = THIS_MODULE,
- .pf = PF_INET,
- .hooknum = NF_INET_POST_ROUTING,
- .priority = NF_IP_PRI_NAT_SRC-1,
- },
#ifdef CONFIG_IP_VS_IPV6
/* After packet filtering, forward packet through VS/DR, VS/TUN,
* or VS/NAT(change destination), so that filtering rules can be
@@ -1535,14 +1507,6 @@ static struct nf_hook_ops ip_vs_ops[] __
.hooknum = NF_INET_FORWARD,
.priority = 99,
},
- /* Before the netfilter connection tracking, exit from POST_ROUTING */
- {
- .hook = ip_vs_post_routing,
- .owner = THIS_MODULE,
- .pf = PF_INET6,
- .hooknum = NF_INET_POST_ROUTING,
- .priority = NF_IP6_PRI_NAT_SRC-1,
- },
#endif
};
Index: nf-next-2.6/net/netfilter/ipvs/ip_vs_xmit.c
===================================================================
--- nf-next-2.6.orig/net/netfilter/ipvs/ip_vs_xmit.c 2010-07-07 13:23:37.000000000 +0900
+++ nf-next-2.6/net/netfilter/ipvs/ip_vs_xmit.c 2010-07-07 13:42:22.000000000 +0900
@@ -28,6 +28,7 @@
#include <net/ip6_route.h>
#include <linux/icmpv6.h>
#include <linux/netfilter.h>
+#include <net/netfilter/nf_conntrack.h>
#include <linux/netfilter_ipv4.h>
#include <net/ip_vs.h>
@@ -348,6 +349,30 @@ ip_vs_bypass_xmit_v6(struct sk_buff *skb
}
#endif
+static void
+ip_vs_update_conntrack(struct sk_buff *skb, struct ip_vs_conn *cp)
+{
+ struct nf_conn *ct = (struct nf_conn *)skb->nfct;
+ struct nf_conntrack_tuple new_tuple;
+
+ if (ct == NULL || nf_ct_is_untracked(ct) || nf_ct_is_confirmed(ct))
+ return;
+
+ /*
+ * The connection is not yet in the hashtable, so we update it.
+ * CIP->VIP will remain the same, so leave the tuple in
+ * IP_CT_DIR_ORIGINAL untouched. When the reply comes back from the
+ * real-server we will see RIP->DIP.
+ */
+ new_tuple = ct->tuplehash[IP_CT_DIR_REPLY].tuple;
+ new_tuple.src.u3 = cp->daddr;
+ /*
+ * This will also take care of UDP and other protocols.
+ */
+ new_tuple.src.u.tcp.port = cp->dport;
+ nf_conntrack_alter_reply(ct, &new_tuple);
+}
+
/*
* NAT transmitter (only for outside-to-inside nat forwarding)
* Not used for related ICMP
@@ -403,6 +428,8 @@ ip_vs_nat_xmit(struct sk_buff *skb, stru
IP_VS_DBG_PKT(10, pp, skb, 0, "After DNAT");
+ ip_vs_update_conntrack(skb, cp);
+
/* FIXME: when application helper enlarges the packet and the length
is larger than the MTU of outgoing device, there will be still
MTU problem. */
@@ -479,6 +506,8 @@ ip_vs_nat_xmit_v6(struct sk_buff *skb, s
IP_VS_DBG_PKT(10, pp, skb, 0, "After DNAT");
+ ip_vs_update_conntrack(skb, cp);
+
/* FIXME: when application helper enlarges the packet and the length
is larger than the MTU of outgoing device, there will be still
MTU problem. */
^ permalink raw reply
* [patch v2.8 3/4] IPVS: make FTP work with full NAT support
From: Simon Horman @ 2010-07-22 7:35 UTC (permalink / raw)
To: lvs-devel, netdev, linux-kernel, netfilter, netfilter-devel
Cc: Malcolm Turnbull, Mark Brooks, Wensong Zhang, Julius Volz,
Patrick McHardy, David S. Miller, Hannes Eder, Jan Engelhardt
In-Reply-To: <20100722073547.504156161@vergenet.net>
[-- Attachment #1: IPVS-make-FTP-work-with-full-NAT-support.patch --]
[-- Type: text/plain, Size: 12646 bytes --]
From: Hannes Eder <heder@google.com>
Use nf_conntrack/nf_nat code to do the packet mangling and the TCP
sequence adjusting. The function 'ip_vs_skb_replace' is now dead
code, so it is removed.
To SNAT FTP, use something like:
% iptables -t nat -A POSTROUTING -m ipvs --vaddr 192.168.100.30/32 \
> --vport 21 -j SNAT --to-source 192.168.10.10
and for the data connections in passive mode:
% iptables -t nat -A POSTROUTING -m ipvs --vaddr 192.168.100.30/32 \
> --vportctl 21 -j SNAT --to-source 192.168.10.10
using '-m state --state RELATED' would also works.
Make sure the kernel modules ip_vs_ftp, nf_conntrack_ftp, and
nf_nat_ftp are loaded.
[ up-port and minor fixes by Simon Horman <horms@verge.net.au> ]
Signed-off-by: Hannes Eder <heder@google.com>
Signed-off-by: Simon Horman <horms@verge.net.au>
---
include/net/ip_vs.h | 2
net/netfilter/ipvs/Kconfig | 2
net/netfilter/ipvs/ip_vs_app.c | 43 ---------
net/netfilter/ipvs/ip_vs_core.c | 1
net/netfilter/ipvs/ip_vs_ftp.c | 174 ++++++++++++++++++++++++++++++++++++---
5 files changed, 164 insertions(+), 58 deletions(-)
v2.6
* pointer arguments for %pI4
v2.5
* Use nf_ct_is_untracked(ct) instead of nf_ct_is_untracked(),
the latter is blatantly incorrect
* Return 0 (and thus drop the packet) if mangling wasn't attempted
v2.4
As suggested by Patrick McHardy
* Use nf_conntrack_untracked() instead of &nf_conntrack_untracked
* Fix ip_vs_ftp_out logic
- Don't call nf_nat_mangle_tcp_packet() unless ct is valid and tracked
- Only call ip_vs_expect_relatedi() if nf_nat_mangle_tcp_packet()
succeeds
- Note that packets are dropped if mangling fails
Other
* Drop unrelated cosmetic change to sizing of buf in ip_vs_ftp_out()
v2.3
* Up-port
* Drop buf_len = snprintf() change - its a separate, cosmetic, fix
As suggested by Patrick McHardy
* Use %pI4 instead of NIPQUAD
v2.2
* No change
v2.1
* Up-port
Index: nf-next-2.6/include/net/ip_vs.h
===================================================================
--- nf-next-2.6.orig/include/net/ip_vs.h 2010-07-11 17:30:19.000000000 +0900
+++ nf-next-2.6/include/net/ip_vs.h 2010-07-11 17:33:33.000000000 +0900
@@ -736,8 +736,6 @@ extern void ip_vs_app_inc_put(struct ip_
extern int ip_vs_app_pkt_out(struct ip_vs_conn *, struct sk_buff *skb);
extern int ip_vs_app_pkt_in(struct ip_vs_conn *, struct sk_buff *skb);
-extern int ip_vs_skb_replace(struct sk_buff *skb, gfp_t pri,
- char *o_buf, int o_len, char *n_buf, int n_len);
extern int ip_vs_app_init(void);
extern void ip_vs_app_cleanup(void);
Index: nf-next-2.6/net/netfilter/ipvs/Kconfig
===================================================================
--- nf-next-2.6.orig/net/netfilter/ipvs/Kconfig 2010-07-11 17:33:06.000000000 +0900
+++ nf-next-2.6/net/netfilter/ipvs/Kconfig 2010-07-11 17:33:33.000000000 +0900
@@ -235,7 +235,7 @@ comment 'IPVS application helper'
config IP_VS_FTP
tristate "FTP protocol helper"
- depends on IP_VS_PROTO_TCP
+ depends on IP_VS_PROTO_TCP && NF_NAT
---help---
FTP is a protocol that transfers IP address and/or port number in
the payload. In the virtual server via Network Address Translation,
Index: nf-next-2.6/net/netfilter/ipvs/ip_vs_app.c
===================================================================
--- nf-next-2.6.orig/net/netfilter/ipvs/ip_vs_app.c 2010-07-11 17:30:19.000000000 +0900
+++ nf-next-2.6/net/netfilter/ipvs/ip_vs_app.c 2010-07-11 17:33:33.000000000 +0900
@@ -569,49 +569,6 @@ static const struct file_operations ip_v
};
#endif
-
-/*
- * Replace a segment of data with a new segment
- */
-int ip_vs_skb_replace(struct sk_buff *skb, gfp_t pri,
- char *o_buf, int o_len, char *n_buf, int n_len)
-{
- int diff;
- int o_offset;
- int o_left;
-
- EnterFunction(9);
-
- diff = n_len - o_len;
- o_offset = o_buf - (char *)skb->data;
- /* The length of left data after o_buf+o_len in the skb data */
- o_left = skb->len - (o_offset + o_len);
-
- if (diff <= 0) {
- memmove(o_buf + n_len, o_buf + o_len, o_left);
- memcpy(o_buf, n_buf, n_len);
- skb_trim(skb, skb->len + diff);
- } else if (diff <= skb_tailroom(skb)) {
- skb_put(skb, diff);
- memmove(o_buf + n_len, o_buf + o_len, o_left);
- memcpy(o_buf, n_buf, n_len);
- } else {
- if (pskb_expand_head(skb, skb_headroom(skb), diff, pri))
- return -ENOMEM;
- skb_put(skb, diff);
- memmove(skb->data + o_offset + n_len,
- skb->data + o_offset + o_len, o_left);
- skb_copy_to_linear_data_offset(skb, o_offset, n_buf, n_len);
- }
-
- /* must update the iph total length here */
- ip_hdr(skb)->tot_len = htons(skb->len);
-
- LeaveFunction(9);
- return 0;
-}
-
-
int __init ip_vs_app_init(void)
{
/* we will replace it with proc_net_ipvs_create() soon */
Index: nf-next-2.6/net/netfilter/ipvs/ip_vs_core.c
===================================================================
--- nf-next-2.6.orig/net/netfilter/ipvs/ip_vs_core.c 2010-07-11 17:33:06.000000000 +0900
+++ nf-next-2.6/net/netfilter/ipvs/ip_vs_core.c 2010-07-11 17:33:33.000000000 +0900
@@ -54,7 +54,6 @@
EXPORT_SYMBOL(register_ip_vs_scheduler);
EXPORT_SYMBOL(unregister_ip_vs_scheduler);
-EXPORT_SYMBOL(ip_vs_skb_replace);
EXPORT_SYMBOL(ip_vs_proto_name);
EXPORT_SYMBOL(ip_vs_conn_new);
EXPORT_SYMBOL(ip_vs_conn_in_get);
Index: nf-next-2.6/net/netfilter/ipvs/ip_vs_ftp.c
===================================================================
--- nf-next-2.6.orig/net/netfilter/ipvs/ip_vs_ftp.c 2010-07-11 17:30:19.000000000 +0900
+++ nf-next-2.6/net/netfilter/ipvs/ip_vs_ftp.c 2010-07-11 18:01:58.000000000 +0900
@@ -20,6 +20,17 @@
*
* Author: Wouter Gadeyne
*
+ *
+ * Code for ip_vs_expect_related and ip_vs_expect_callback is taken from
+ * http://www.ssi.bg/~ja/nfct/:
+ *
+ * ip_vs_nfct.c: Netfilter connection tracking support for IPVS
+ *
+ * Portions Copyright (C) 2001-2002
+ * Antefacto Ltd, 181 Parnell St, Dublin 1, Ireland.
+ *
+ * Portions Copyright (C) 2003-2008
+ * Julian Anastasov
*/
#define KMSG_COMPONENT "IPVS"
@@ -32,6 +43,9 @@
#include <linux/in.h>
#include <linux/ip.h>
#include <linux/netfilter.h>
+#include <net/netfilter/nf_conntrack.h>
+#include <net/netfilter/nf_conntrack_expect.h>
+#include <net/netfilter/nf_nat_helper.h>
#include <linux/gfp.h>
#include <net/protocol.h>
#include <net/tcp.h>
@@ -43,6 +57,16 @@
#define SERVER_STRING "227 Entering Passive Mode ("
#define CLIENT_STRING "PORT "
+#define FMT_TUPLE "%pI4:%u->%pI4:%u/%u"
+#define ARG_TUPLE(T) &(T)->src.u3.ip, ntohs((T)->src.u.all), \
+ &(T)->dst.u3.ip, ntohs((T)->dst.u.all), \
+ (T)->dst.protonum
+
+#define FMT_CONN "%pI4:%u->%pI4:%u->%pI4:%u/%u:%u"
+#define ARG_CONN(C) &((C)->caddr.ip), ntohs((C)->cport), \
+ &((C)->vaddr.ip), ntohs((C)->vport), \
+ &((C)->daddr.ip), ntohs((C)->dport), \
+ (C)->protocol, (C)->state
/*
* List of ports (up to IP_VS_APP_MAX_PORTS) to be handled by helper
@@ -123,6 +147,119 @@ static int ip_vs_ftp_get_addrport(char *
return 1;
}
+/*
+ * Called from init_conntrack() as expectfn handler.
+ */
+static void
+ip_vs_expect_callback(struct nf_conn *ct,
+ struct nf_conntrack_expect *exp)
+{
+ struct nf_conntrack_tuple *orig, new_reply;
+ struct ip_vs_conn *cp;
+
+ if (exp->tuple.src.l3num != PF_INET)
+ return;
+
+ /*
+ * We assume that no NF locks are held before this callback.
+ * ip_vs_conn_out_get and ip_vs_conn_in_get should match their
+ * expectations even if they use wildcard values, now we provide the
+ * actual values from the newly created original conntrack direction.
+ * The conntrack is confirmed when packet reaches IPVS hooks.
+ */
+
+ /* RS->CLIENT */
+ orig = &ct->tuplehash[IP_CT_DIR_ORIGINAL].tuple;
+ cp = ip_vs_conn_out_get(exp->tuple.src.l3num, orig->dst.protonum,
+ &orig->src.u3, orig->src.u.tcp.port,
+ &orig->dst.u3, orig->dst.u.tcp.port);
+ if (cp) {
+ /* Change reply CLIENT->RS to CLIENT->VS */
+ new_reply = ct->tuplehash[IP_CT_DIR_REPLY].tuple;
+ IP_VS_DBG(7, "%s(): ct=%p, status=0x%lX, tuples=" FMT_TUPLE ", "
+ FMT_TUPLE ", found inout cp=" FMT_CONN "\n",
+ __func__, ct, ct->status,
+ ARG_TUPLE(orig), ARG_TUPLE(&new_reply),
+ ARG_CONN(cp));
+ new_reply.dst.u3 = cp->vaddr;
+ new_reply.dst.u.tcp.port = cp->vport;
+ IP_VS_DBG(7, "%s(): ct=%p, new tuples=" FMT_TUPLE ", " FMT_TUPLE
+ ", inout cp=" FMT_CONN "\n",
+ __func__, ct,
+ ARG_TUPLE(orig), ARG_TUPLE(&new_reply),
+ ARG_CONN(cp));
+ goto alter;
+ }
+
+ /* CLIENT->VS */
+ cp = ip_vs_conn_in_get(exp->tuple.src.l3num, orig->dst.protonum,
+ &orig->src.u3, orig->src.u.tcp.port,
+ &orig->dst.u3, orig->dst.u.tcp.port);
+ if (cp) {
+ /* Change reply VS->CLIENT to RS->CLIENT */
+ new_reply = ct->tuplehash[IP_CT_DIR_REPLY].tuple;
+ IP_VS_DBG(7, "%s(): ct=%p, status=0x%lX, tuples=" FMT_TUPLE ", "
+ FMT_TUPLE ", found outin cp=" FMT_CONN "\n",
+ __func__, ct, ct->status,
+ ARG_TUPLE(orig), ARG_TUPLE(&new_reply),
+ ARG_CONN(cp));
+ new_reply.src.u3 = cp->daddr;
+ new_reply.src.u.tcp.port = cp->dport;
+ IP_VS_DBG(7, "%s(): ct=%p, new tuples=" FMT_TUPLE ", "
+ FMT_TUPLE ", outin cp=" FMT_CONN "\n",
+ __func__, ct,
+ ARG_TUPLE(orig), ARG_TUPLE(&new_reply),
+ ARG_CONN(cp));
+ goto alter;
+ }
+
+ IP_VS_DBG(7, "%s(): ct=%p, status=0x%lX, tuple=" FMT_TUPLE
+ " - unknown expect\n",
+ __func__, ct, ct->status, ARG_TUPLE(orig));
+ return;
+
+alter:
+ /* Never alter conntrack for non-NAT conns */
+ if (IP_VS_FWD_METHOD(cp) == IP_VS_CONN_F_MASQ)
+ nf_conntrack_alter_reply(ct, &new_reply);
+ ip_vs_conn_put(cp);
+ return;
+}
+
+/*
+ * Create NF conntrack expectation with wildcard (optional) source port.
+ * Then the default callback function will alter the reply and will confirm
+ * the conntrack entry when the first packet comes.
+ */
+static void
+ip_vs_expect_related(struct sk_buff *skb, struct nf_conn *ct,
+ struct ip_vs_conn *cp, u_int8_t proto,
+ const __be16 *port, int from_rs)
+{
+ struct nf_conntrack_expect *exp;
+
+ BUG_ON(!ct || ct == &nf_conntrack_untracked);
+
+ exp = nf_ct_expect_alloc(ct);
+ if (!exp)
+ return;
+
+ if (from_rs)
+ nf_ct_expect_init(exp, NF_CT_EXPECT_CLASS_DEFAULT,
+ nf_ct_l3num(ct), &cp->daddr, &cp->caddr,
+ proto, port, &cp->cport);
+ else
+ nf_ct_expect_init(exp, NF_CT_EXPECT_CLASS_DEFAULT,
+ nf_ct_l3num(ct), &cp->caddr, &cp->vaddr,
+ proto, port, &cp->vport);
+
+ exp->expectfn = ip_vs_expect_callback;
+
+ IP_VS_DBG(7, "%s(): ct=%p, expect tuple=" FMT_TUPLE "\n",
+ __func__, ct, ARG_TUPLE(&exp->tuple));
+ nf_ct_expect_related(exp);
+ nf_ct_expect_put(exp);
+}
/*
* Look at outgoing ftp packets to catch the response to a PASV command
@@ -149,7 +286,9 @@ static int ip_vs_ftp_out(struct ip_vs_ap
struct ip_vs_conn *n_cp;
char buf[24]; /* xxx.xxx.xxx.xxx,ppp,ppp\000 */
unsigned buf_len;
- int ret;
+ int ret = 0;
+ enum ip_conntrack_info ctinfo;
+ struct nf_conn *ct;
#ifdef CONFIG_IP_VS_IPV6
/* This application helper doesn't work with IPv6 yet,
@@ -219,19 +358,26 @@ static int ip_vs_ftp_out(struct ip_vs_ap
buf_len = strlen(buf);
+ ct = nf_ct_get(skb, &ctinfo);
+ if (ct && !nf_ct_is_untracked(ct)) {
+ /* If mangling fails this function will return 0
+ * which will cause the packet to be dropped.
+ * Mangling can only fail under memory pressure,
+ * hopefully it will succeed on the retransmitted
+ * packet.
+ */
+ ret = nf_nat_mangle_tcp_packet(skb, ct, ctinfo,
+ start-data, end-start,
+ buf, buf_len);
+ if (ret)
+ ip_vs_expect_related(skb, ct, n_cp,
+ IPPROTO_TCP, NULL, 0);
+ }
+
/*
- * Calculate required delta-offset to keep TCP happy
+ * Not setting 'diff' is intentional, otherwise the sequence
+ * would be adjusted twice.
*/
- *diff = buf_len - (end-start);
-
- if (*diff == 0) {
- /* simply replace it with new passive address */
- memcpy(start, buf, buf_len);
- ret = 1;
- } else {
- ret = !ip_vs_skb_replace(skb, GFP_ATOMIC, start,
- end-start, buf, buf_len);
- }
cp->app_data = NULL;
ip_vs_tcp_conn_listen(n_cp);
@@ -263,6 +409,7 @@ static int ip_vs_ftp_in(struct ip_vs_app
union nf_inet_addr to;
__be16 port;
struct ip_vs_conn *n_cp;
+ struct nf_conn *ct;
#ifdef CONFIG_IP_VS_IPV6
/* This application helper doesn't work with IPv6 yet,
@@ -349,6 +496,11 @@ static int ip_vs_ftp_in(struct ip_vs_app
ip_vs_control_add(n_cp, cp);
}
+ ct = (struct nf_conn *)skb->nfct;
+ if (ct && ct != &nf_conntrack_untracked)
+ ip_vs_expect_related(skb, ct, n_cp,
+ IPPROTO_TCP, &n_cp->dport, 1);
+
/*
* Move tunnel to listen state
*/
^ permalink raw reply
* [patch v2.8 4/4] [patch v2.2 4/4] [PATCH v2.1 4/4] libxt_ipvs: user-space lib for netfilter matcher xt_ipvs
From: Simon Horman @ 2010-07-22 7:35 UTC (permalink / raw)
To: lvs-devel, netdev, linux-kernel, netfilter, netfilter-devel
Cc: Malcolm Turnbull, Mark Brooks, Wensong Zhang, Julius Volz,
Patrick McHardy, David S. Miller, Hannes Eder, Jan Engelhardt
In-Reply-To: <20100722073547.504156161@vergenet.net>
[-- Attachment #1: libxt_ipvs-user-space-lib-for-netfilter-matcher-xt_ipvs.patch --]
[-- Type: text/plain, Size: 14174 bytes --]
From: Hannes Eder <heder@google.com>
The user-space library for the netfilter matcher xt_ipvs.
[ trivial up-port by Simon Horman <horms@verge.net.au> ]
Signed-off-by: Hannes Eder <heder@google.com>
Acked-by: Simon Horman <horms@verge.net.au>
configure.ac | 10 -
extensions/libxt_ipvs.c | 365 +++++++++++++++++++++++++++++++++++++
extensions/libxt_ipvs.man | 24 ++
include/linux/netfilter/xt_ipvs.h | 27 +++
4 files changed, 424 insertions(+), 2 deletions(-)
create mode 100644 extensions/libxt_ipvs.c
create mode 100644 extensions/libxt_ipvs.man
create mode 100644 include/linux/netfilter/xt_ipvs.h
v2.8
* There is no need to define _XT_IPVS_H as a value
As suggested by Jan Engelhardt
* Use an enum instead of #ifdefs for flags and masks
v2.7
* Update struct xt_ipvs_mtinfo to use __u8 instead of __16 for the l4proto
and fwd_method to reflect the same change to the kernel copy
of struct xt_ipvs_mtinfo.
v2.1
* Trival up-port
Index: iptables/configure.ac
===================================================================
--- iptables.orig/configure.ac 2010-07-22 10:41:05.000000000 +0900
+++ iptables/configure.ac 2010-07-22 10:41:13.000000000 +0900
@@ -52,12 +52,18 @@ AC_ARG_WITH([pkgconfigdir], AS_HELP_STRI
[Path to the pkgconfig directory [[LIBDIR/pkgconfig]]]),
[pkgconfigdir="$withval"], [pkgconfigdir='${libdir}/pkgconfig'])
-AC_CHECK_HEADER([linux/dccp.h])
-
blacklist_modules="";
+
+AC_CHECK_HEADER([linux/dccp.h])
if test "$ac_cv_header_linux_dccp_h" != "yes"; then
blacklist_modules="$blacklist_modules dccp";
fi;
+
+AC_CHECK_HEADER([linux/ip_vs.h])
+if test "$ac_cv_header_linux_ip_vs_h" != "yes"; then
+ blacklist_modules="$blacklist_modules ipvs";
+fi;
+
AC_SUBST([blacklist_modules])
AM_CONDITIONAL([ENABLE_STATIC], [test "$enable_static" = "yes"])
Index: iptables/extensions/libxt_ipvs.c
===================================================================
--- /dev/null 1970-01-01 00:00:00.000000000 +0000
+++ iptables/extensions/libxt_ipvs.c 2010-07-22 10:41:13.000000000 +0900
@@ -0,0 +1,365 @@
+/*
+ * Shared library add-on to iptables to add IPVS matching.
+ *
+ * Detailed doc is in the kernel module source net/netfilter/xt_ipvs.c
+ *
+ * Author: Hannes Eder <heder@google.com>
+ */
+#include <sys/types.h>
+#include <assert.h>
+#include <ctype.h>
+#include <errno.h>
+#include <getopt.h>
+#include <netdb.h>
+#include <stdlib.h>
+#include <stdio.h>
+#include <string.h>
+#include <xtables.h>
+#include <linux/ip_vs.h>
+#include <linux/netfilter/xt_ipvs.h>
+
+static const struct option ipvs_mt_opts[] = {
+ { .name = "ipvs", .has_arg = false, .val = '0' },
+ { .name = "vproto", .has_arg = true, .val = '1' },
+ { .name = "vaddr", .has_arg = true, .val = '2' },
+ { .name = "vport", .has_arg = true, .val = '3' },
+ { .name = "vdir", .has_arg = true, .val = '4' },
+ { .name = "vmethod", .has_arg = true, .val = '5' },
+ { .name = "vportctl", .has_arg = true, .val = '6' },
+ { .name = NULL }
+};
+
+static void ipvs_mt_help(void)
+{
+ printf(
+"IPVS match options:\n"
+"[!] --ipvs packet belongs to an IPVS connection\n"
+"\n"
+"Any of the following options implies --ipvs (even negated)\n"
+"[!] --vproto protocol VIP protocol to match; by number or name,\n"
+" e.g. \"tcp\"\n"
+"[!] --vaddr address[/mask] VIP address to match\n"
+"[!] --vport port VIP port to match; by number or name,\n"
+" e.g. \"http\"\n"
+" --vdir {ORIGINAL|REPLY} flow direction of packet\n"
+"[!] --vmethod {GATE|IPIP|MASQ} IPVS forwarding method used\n"
+"[!] --vportctl port VIP port of the controlling connection to\n"
+" match, e.g. 21 for FTP\n"
+ );
+}
+
+static void ipvs_mt_parse_addr_and_mask(const char *arg,
+ union nf_inet_addr *address,
+ union nf_inet_addr *mask,
+ unsigned int family)
+{
+ struct in_addr *addr = NULL;
+ struct in6_addr *addr6 = NULL;
+ unsigned int naddrs = 0;
+
+ if (family == NFPROTO_IPV4) {
+ xtables_ipparse_any(arg, &addr, &mask->in, &naddrs);
+ if (naddrs > 1)
+ xtables_error(PARAMETER_PROBLEM,
+ "multiple IP addresses not allowed");
+ if (naddrs == 1)
+ memcpy(&address->in, addr, sizeof(*addr));
+ } else if (family == NFPROTO_IPV6) {
+ xtables_ip6parse_any(arg, &addr6, &mask->in6, &naddrs);
+ if (naddrs > 1)
+ xtables_error(PARAMETER_PROBLEM,
+ "multiple IP addresses not allowed");
+ if (naddrs == 1)
+ memcpy(&address->in6, addr6, sizeof(*addr6));
+ } else {
+ /* Hu? */
+ assert(false);
+ }
+}
+
+/* Function which parses command options; returns true if it ate an option */
+static int ipvs_mt_parse(int c, char **argv, int invert, unsigned int *flags,
+ const void *entry, struct xt_entry_match **match,
+ unsigned int family)
+{
+ struct xt_ipvs_mtinfo *data = (void *)(*match)->data;
+ char *p = NULL;
+ u_int8_t op = 0;
+
+ if ('0' <= c && c <= '6') {
+ static const int ops[] = {
+ XT_IPVS_IPVS_PROPERTY,
+ XT_IPVS_PROTO,
+ XT_IPVS_VADDR,
+ XT_IPVS_VPORT,
+ XT_IPVS_DIR,
+ XT_IPVS_METHOD,
+ XT_IPVS_VPORTCTL
+ };
+ op = ops[c - '0'];
+ } else
+ return 0;
+
+ if (*flags & op & XT_IPVS_ONCE_MASK)
+ goto multiple_use;
+
+ switch (c) {
+ case '0': /* --ipvs */
+ /* Nothing to do here. */
+ break;
+
+ case '1': /* --vproto */
+ /* Canonicalize into lower case */
+ for (p = optarg; *p != '\0'; ++p)
+ *p = tolower(*p);
+
+ data->l4proto = xtables_parse_protocol(optarg);
+ break;
+
+ case '2': /* --vaddr */
+ ipvs_mt_parse_addr_and_mask(optarg, &data->vaddr,
+ &data->vmask, family);
+ break;
+
+ case '3': /* --vport */
+ data->vport = htons(xtables_parse_port(optarg, "tcp"));
+ break;
+
+ case '4': /* --vdir */
+ xtables_param_act(XTF_NO_INVERT, "ipvs", "--vdir", invert);
+ if (strcasecmp(optarg, "ORIGINAL") == 0) {
+ data->bitmask |= XT_IPVS_DIR;
+ data->invert &= ~XT_IPVS_DIR;
+ } else if (strcasecmp(optarg, "REPLY") == 0) {
+ data->bitmask |= XT_IPVS_DIR;
+ data->invert |= XT_IPVS_DIR;
+ } else {
+ xtables_param_act(XTF_BAD_VALUE,
+ "ipvs", "--vdir", optarg);
+ }
+ break;
+
+ case '5': /* --vmethod */
+ if (strcasecmp(optarg, "GATE") == 0)
+ data->fwd_method = IP_VS_CONN_F_DROUTE;
+ else if (strcasecmp(optarg, "IPIP") == 0)
+ data->fwd_method = IP_VS_CONN_F_TUNNEL;
+ else if (strcasecmp(optarg, "MASQ") == 0)
+ data->fwd_method = IP_VS_CONN_F_MASQ;
+ else
+ xtables_param_act(XTF_BAD_VALUE,
+ "ipvs", "--vmethod", optarg);
+ break;
+
+ case '6': /* --vportctl */
+ data->vportctl = htons(xtables_parse_port(optarg, "tcp"));
+ break;
+
+ default:
+ /* Hu? How did we come here? */
+ assert(false);
+ return 0;
+ }
+
+ if (op & XT_IPVS_ONCE_MASK) {
+ if (data->invert & XT_IPVS_IPVS_PROPERTY)
+ xtables_error(PARAMETER_PROBLEM,
+ "! --ipvs cannot be together with"
+ " other options");
+ data->bitmask |= XT_IPVS_IPVS_PROPERTY;
+ }
+
+ data->bitmask |= op;
+ if (invert)
+ data->invert |= op;
+ *flags |= op;
+ return 1;
+
+multiple_use:
+ xtables_error(PARAMETER_PROBLEM,
+ "multiple use of the same IPVS option is not allowed");
+}
+
+static int ipvs_mt4_parse(int c, char **argv, int invert, unsigned int *flags,
+ const void *entry, struct xt_entry_match **match)
+{
+ return ipvs_mt_parse(c, argv, invert, flags, entry, match,
+ NFPROTO_IPV4);
+}
+
+static int ipvs_mt6_parse(int c, char **argv, int invert, unsigned int *flags,
+ const void *entry, struct xt_entry_match **match)
+{
+ return ipvs_mt_parse(c, argv, invert, flags, entry, match,
+ NFPROTO_IPV6);
+}
+
+static void ipvs_mt_check(unsigned int flags)
+{
+ if (flags == 0)
+ xtables_error(PARAMETER_PROBLEM,
+ "IPVS: At least one option is required");
+}
+
+/* Shamelessly copied from libxt_conntrack.c */
+static void ipvs_mt_dump_addr(const union nf_inet_addr *addr,
+ const union nf_inet_addr *mask,
+ unsigned int family, bool numeric)
+{
+ char buf[BUFSIZ];
+
+ if (family == NFPROTO_IPV4) {
+ if (!numeric && addr->ip == 0) {
+ printf("anywhere ");
+ return;
+ }
+ if (numeric)
+ strcpy(buf, xtables_ipaddr_to_numeric(&addr->in));
+ else
+ strcpy(buf, xtables_ipaddr_to_anyname(&addr->in));
+ strcat(buf, xtables_ipmask_to_numeric(&mask->in));
+ printf("%s ", buf);
+ } else if (family == NFPROTO_IPV6) {
+ if (!numeric && addr->ip6[0] == 0 && addr->ip6[1] == 0 &&
+ addr->ip6[2] == 0 && addr->ip6[3] == 0) {
+ printf("anywhere ");
+ return;
+ }
+ if (numeric)
+ strcpy(buf, xtables_ip6addr_to_numeric(&addr->in6));
+ else
+ strcpy(buf, xtables_ip6addr_to_anyname(&addr->in6));
+ strcat(buf, xtables_ip6mask_to_numeric(&mask->in6));
+ printf("%s ", buf);
+ }
+}
+
+static void ipvs_mt_dump(const void *ip, const struct xt_ipvs_mtinfo *data,
+ unsigned int family, bool numeric, const char *prefix)
+{
+ if (data->bitmask == XT_IPVS_IPVS_PROPERTY) {
+ if (data->invert & XT_IPVS_IPVS_PROPERTY)
+ printf("! ");
+ printf("%sipvs ", prefix);
+ }
+
+ if (data->bitmask & XT_IPVS_PROTO) {
+ if (data->invert & XT_IPVS_PROTO)
+ printf("! ");
+ printf("%sproto %u ", prefix, data->l4proto);
+ }
+
+ if (data->bitmask & XT_IPVS_VADDR) {
+ if (data->invert & XT_IPVS_VADDR)
+ printf("! ");
+
+ printf("%svaddr ", prefix);
+ ipvs_mt_dump_addr(&data->vaddr, &data->vmask, family, numeric);
+ }
+
+ if (data->bitmask & XT_IPVS_VPORT) {
+ if (data->invert & XT_IPVS_VPORT)
+ printf("! ");
+
+ printf("%svport %u ", prefix, ntohs(data->vport));
+ }
+
+ if (data->bitmask & XT_IPVS_DIR) {
+ if (data->invert & XT_IPVS_DIR)
+ printf("%svdir REPLY ", prefix);
+ else
+ printf("%svdir ORIGINAL ", prefix);
+ }
+
+ if (data->bitmask & XT_IPVS_METHOD) {
+ if (data->invert & XT_IPVS_METHOD)
+ printf("! ");
+
+ printf("%svmethod ", prefix);
+ switch (data->fwd_method) {
+ case IP_VS_CONN_F_DROUTE:
+ printf("GATE ");
+ break;
+ case IP_VS_CONN_F_TUNNEL:
+ printf("IPIP ");
+ break;
+ case IP_VS_CONN_F_MASQ:
+ printf("MASQ ");
+ break;
+ default:
+ /* Hu? */
+ printf("UNKNOWN ");
+ break;
+ }
+ }
+
+ if (data->bitmask & XT_IPVS_VPORTCTL) {
+ if (data->invert & XT_IPVS_VPORTCTL)
+ printf("! ");
+
+ printf("%svportctl %u ", prefix, ntohs(data->vportctl));
+ }
+}
+
+static void ipvs_mt4_print(const void *ip, const struct xt_entry_match *match,
+ int numeric)
+{
+ const struct xt_ipvs_mtinfo *data = (const void *)match->data;
+ ipvs_mt_dump(ip, data, NFPROTO_IPV4, numeric, "");
+}
+
+static void ipvs_mt6_print(const void *ip, const struct xt_entry_match *match,
+ int numeric)
+{
+ const struct xt_ipvs_mtinfo *data = (const void *)match->data;
+ ipvs_mt_dump(ip, data, NFPROTO_IPV6, numeric, "");
+}
+
+static void ipvs_mt4_save(const void *ip, const struct xt_entry_match *match)
+{
+ const struct xt_ipvs_mtinfo *data = (const void *)match->data;
+ ipvs_mt_dump(ip, data, NFPROTO_IPV4, true, "--");
+}
+
+static void ipvs_mt6_save(const void *ip, const struct xt_entry_match *match)
+{
+ const struct xt_ipvs_mtinfo *data = (const void *)match->data;
+ ipvs_mt_dump(ip, data, NFPROTO_IPV6, true, "--");
+}
+
+static struct xtables_match ipvs_matches_reg[] = {
+ {
+ .version = XTABLES_VERSION,
+ .name = "ipvs",
+ .revision = 0,
+ .family = NFPROTO_IPV4,
+ .size = XT_ALIGN(sizeof(struct xt_ipvs_mtinfo)),
+ .userspacesize = XT_ALIGN(sizeof(struct xt_ipvs_mtinfo)),
+ .help = ipvs_mt_help,
+ .parse = ipvs_mt4_parse,
+ .final_check = ipvs_mt_check,
+ .print = ipvs_mt4_print,
+ .save = ipvs_mt4_save,
+ .extra_opts = ipvs_mt_opts,
+ },
+ {
+ .version = XTABLES_VERSION,
+ .name = "ipvs",
+ .revision = 0,
+ .family = NFPROTO_IPV6,
+ .size = XT_ALIGN(sizeof(struct xt_ipvs_mtinfo)),
+ .userspacesize = XT_ALIGN(sizeof(struct xt_ipvs_mtinfo)),
+ .help = ipvs_mt_help,
+ .parse = ipvs_mt6_parse,
+ .final_check = ipvs_mt_check,
+ .print = ipvs_mt6_print,
+ .save = ipvs_mt6_save,
+ .extra_opts = ipvs_mt_opts,
+ },
+};
+
+void _init(void)
+{
+ xtables_register_matches(ipvs_matches_reg,
+ ARRAY_SIZE(ipvs_matches_reg));
+}
Index: iptables/extensions/libxt_ipvs.man
===================================================================
--- /dev/null 1970-01-01 00:00:00.000000000 +0000
+++ iptables/extensions/libxt_ipvs.man 2010-07-22 10:41:13.000000000 +0900
@@ -0,0 +1,24 @@
+Match IPVS connection properties.
+.TP
+[\fB!\fR] \fB\-\-ipvs\fP
+packet belongs to an IPVS connection
+.TP
+Any of the following options implies \-\-ipvs (even negated)
+.TP
+[\fB!\fR] \fB\-\-vproto\fP \fIprotocol\fP
+VIP protocol to match; by number or name, e.g. "tcp"
+.TP
+[\fB!\fR] \fB\-\-vaddr\fP \fIaddress\fP[\fB/\fP\fImask\fP]
+VIP address to match
+.TP
+[\fB!\fR] \fB\-\-vport\fP \fIport\fP
+VIP port to match; by number or name, e.g. "http"
+.TP
+\fB\-\-vdir\fP {\fBORIGINAL\fP|\fBREPLY\fP}
+flow direction of packet
+.TP
+[\fB!\fR] \fB\-\-vmethod\fP {\fBGATE\fP|\fBIPIP\fP|\fBMASQ\fP}
+IPVS forwarding method used
+.TP
+[\fB!\fR] \fB\-\-vportctl\fP \fIport\fP
+VIP port of the controlling connection to match, e.g. 21 for FTP
Index: iptables/include/linux/netfilter/xt_ipvs.h
===================================================================
--- /dev/null 1970-01-01 00:00:00.000000000 +0000
+++ iptables/include/linux/netfilter/xt_ipvs.h 2010-07-22 10:39:44.000000000 +0900
@@ -0,0 +1,27 @@
+#ifndef _XT_IPVS_H
+#define _XT_IPVS_H
+
+enum {
+ XT_IPVS_IPVS_PROPERTY = 1 << 0, /* all other options imply this one */
+ XT_IPVS_PROTO = 1 << 1,
+ XT_IPVS_VADDR = 1 << 2,
+ XT_IPVS_VPORT = 1 << 3,
+ XT_IPVS_DIR = 1 << 4,
+ XT_IPVS_METHOD = 1 << 5,
+ XT_IPVS_VPORTCTL = 1 << 6,
+ XT_IPVS_MASK = (1 << 7) - 1,
+ XT_IPVS_ONCE_MASK = XT_IPVS_MASK & ~XT_IPVS_IPVS_PROPERTY
+};
+
+struct xt_ipvs_mtinfo {
+ union nf_inet_addr vaddr, vmask;
+ __be16 vport;
+ __u8 l4proto;
+ __u8 fwd_method;
+ __be16 vportctl;
+
+ __u8 invert;
+ __u8 bitmask;
+};
+
+#endif /* _XT_IPVS_H */
^ permalink raw reply
* Re: macvtap: Limit packet queue length
From: Herbert Xu @ 2010-07-22 7:44 UTC (permalink / raw)
To: David S. Miller, netdev, Arnd Bergmann; +Cc: Mark Wagner, Chris Wright
In-Reply-To: <20100722064157.GA25913@gondor.apana.org.au>
On Thu, Jul 22, 2010 at 02:41:57PM +0800, Herbert Xu wrote:
> Hi:
>
> macvtap: Limit packet queue length
Chris has informed me that he's already tried a similar patch
and it only makes the problem worse :)
The issue is that the macvtap TX queue length defaults to zero.
So here is an updated patch which addresses this:
macvtap: Limit packet queue length
Mark Wagner reported OOM symptoms when sending UDP traffic over
a macvtap link to a kvm receiver.
This appears to be caused by the fact that macvtap packet queues
are unlimited in length. This means that if the receiver can't
keep up with the rate of flow, then we will hit OOM. Of course
it gets worse if the OOM killer then decides to kill the receiver.
This patch imposes a cap on the packet queue length, in the same
way as the tuntap driver, using the device TX queue length.
Please note that macvtap currently has no way of giving congestion
notification, that means the software device TX queue cannot be
used and packets will always be dropped once the macvtap driver
queue fills up.
This shouldn't be a great problem for the scenario where macvtap
is used to feed a kvm receiver, as the traffic is most likely
external in origin so congestion notification can't be applied
anyway.
Of course, if anybody decides to complain about guest-to-guest
UDP packet loss down the track, then we may have to revisit this.
Incidentally, this patch also fixes a real memory leak when
macvtap_get_queue fails.
Chris Wright noticed that for this patch to work, we need a
non-zero TX queue length. This patch includes his work to change
the default macvtap TX queue length to 500.
Reported-by: Mark Wagner <mwagner@redhat.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
diff --git a/drivers/net/macvlan.c b/drivers/net/macvlan.c
index 87e8d4c..f15fe2c 100644
--- a/drivers/net/macvlan.c
+++ b/drivers/net/macvlan.c
@@ -499,7 +499,7 @@ static const struct net_device_ops macvlan_netdev_ops = {
.ndo_validate_addr = eth_validate_addr,
};
-static void macvlan_setup(struct net_device *dev)
+void macvlan_common_setup(struct net_device *dev)
{
ether_setup(dev);
@@ -508,6 +508,12 @@ static void macvlan_setup(struct net_device *dev)
dev->destructor = free_netdev;
dev->header_ops = &macvlan_hard_header_ops,
dev->ethtool_ops = &macvlan_ethtool_ops;
+}
+EXPORT_SYMBOL_GPL(macvlan_common_setup);
+
+static void macvlan_setup(struct net_device *dev)
+{
+ macvlan_common_setup(dev);
dev->tx_queue_len = 0;
}
@@ -705,7 +711,6 @@ int macvlan_link_register(struct rtnl_link_ops *ops)
/* common fields */
ops->priv_size = sizeof(struct macvlan_dev);
ops->get_tx_queues = macvlan_get_tx_queues;
- ops->setup = macvlan_setup;
ops->validate = macvlan_validate;
ops->maxtype = IFLA_MACVLAN_MAX;
ops->policy = macvlan_policy;
@@ -719,6 +724,7 @@ EXPORT_SYMBOL_GPL(macvlan_link_register);
static struct rtnl_link_ops macvlan_link_ops = {
.kind = "macvlan",
+ .setup = macvlan_setup,
.newlink = macvlan_newlink,
.dellink = macvlan_dellink,
};
diff --git a/drivers/net/macvtap.c b/drivers/net/macvtap.c
index a8a94e2..ff02b83 100644
--- a/drivers/net/macvtap.c
+++ b/drivers/net/macvtap.c
@@ -180,11 +180,18 @@ static int macvtap_forward(struct net_device *dev, struct sk_buff *skb)
{
struct macvtap_queue *q = macvtap_get_queue(dev, skb);
if (!q)
- return -ENOLINK;
+ goto drop;
+
+ if (skb_queue_len(&q->sk.sk_receive_queue) >= dev->tx_queue_len)
+ goto drop;
skb_queue_tail(&q->sk.sk_receive_queue, skb);
wake_up_interruptible_poll(sk_sleep(&q->sk), POLLIN | POLLRDNORM | POLLRDBAND);
- return 0;
+ return NET_RX_SUCCESS;
+
+drop:
+ kfree_skb(skb);
+ return NET_RX_DROP;
}
/*
@@ -235,8 +242,15 @@ static void macvtap_dellink(struct net_device *dev,
macvlan_dellink(dev, head);
}
+static void macvtap_setup(struct net_device *dev)
+{
+ macvlan_common_setup(dev);
+ dev->tx_queue_len = TUN_READQ_SIZE;
+}
+
static struct rtnl_link_ops macvtap_link_ops __read_mostly = {
.kind = "macvtap",
+ .setup = macvtap_setup,
.newlink = macvtap_newlink,
.dellink = macvtap_dellink,
};
diff --git a/include/linux/if_macvlan.h b/include/linux/if_macvlan.h
index 9ea047a..1ffaeff 100644
--- a/include/linux/if_macvlan.h
+++ b/include/linux/if_macvlan.h
@@ -67,6 +67,8 @@ static inline void macvlan_count_rx(const struct macvlan_dev *vlan,
}
}
+extern void macvlan_common_setup(struct net_device *dev);
+
extern int macvlan_common_newlink(struct net *src_net, struct net_device *dev,
struct nlattr *tb[], struct nlattr *data[],
int (*receive)(struct sk_buff *skb),
Cheers,
--
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
^ permalink raw reply related
* Re: macvtap: Limit packet queue length
From: Chris Wright @ 2010-07-22 7:47 UTC (permalink / raw)
To: Herbert Xu
Cc: David S. Miller, netdev, Arnd Bergmann, Mark Wagner, Chris Wright
In-Reply-To: <20100722074431.GA26744@gondor.apana.org.au>
* Herbert Xu (herbert@gondor.hengli.com.au) wrote:
> On Thu, Jul 22, 2010 at 02:41:57PM +0800, Herbert Xu wrote:
> > Hi:
> >
> > macvtap: Limit packet queue length
>
> Chris has informed me that he's already tried a similar patch
> and it only makes the problem worse :)
>
> The issue is that the macvtap TX queue length defaults to zero.
>
> So here is an updated patch which addresses this:
>
> macvtap: Limit packet queue length
>
> Mark Wagner reported OOM symptoms when sending UDP traffic over
> a macvtap link to a kvm receiver.
>
> This appears to be caused by the fact that macvtap packet queues
> are unlimited in length. This means that if the receiver can't
> keep up with the rate of flow, then we will hit OOM. Of course
> it gets worse if the OOM killer then decides to kill the receiver.
>
> This patch imposes a cap on the packet queue length, in the same
> way as the tuntap driver, using the device TX queue length.
>
> Please note that macvtap currently has no way of giving congestion
> notification, that means the software device TX queue cannot be
> used and packets will always be dropped once the macvtap driver
> queue fills up.
>
> This shouldn't be a great problem for the scenario where macvtap
> is used to feed a kvm receiver, as the traffic is most likely
> external in origin so congestion notification can't be applied
> anyway.
>
> Of course, if anybody decides to complain about guest-to-guest
> UDP packet loss down the track, then we may have to revisit this.
>
> Incidentally, this patch also fixes a real memory leak when
> macvtap_get_queue fails.
>
> Chris Wright noticed that for this patch to work, we need a
> non-zero TX queue length. This patch includes his work to change
> the default macvtap TX queue length to 500.
>
> Reported-by: Mark Wagner <mwagner@redhat.com>
> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Acked-by: Chris Wright <chrisw@sous-sol.org>
Thanks Herbert.
-chris
^ permalink raw reply
* [patch v2.8 1/4] netfilter: xt_ipvs (netfilter matcher for IPVS)
From: Simon Horman @ 2010-07-22 7:35 UTC (permalink / raw)
To: lvs-devel, netdev, linux-kernel, netfilter, netfilter-devel
Cc: Malcolm Turnbull, Mark Brooks, Wensong Zhang, Julius Volz,
Patrick McHardy, David S. Miller, Hannes Eder, Jan Engelhardt
In-Reply-To: <20100722073547.504156161@vergenet.net>
[-- Attachment #1: netfilter-xt_ipvs-netfilter-matcher-for-IPVS.patch --]
[-- Type: text/plain, Size: 8937 bytes --]
From: Hannes Eder <heder@google.com>
This implements the kernel-space side of the netfilter matcher xt_ipvs.
[ minor fixes by Simon Horman <horms@verge.net.au> ]
Signed-off-by: Hannes Eder <heder@google.com>
Signed-off-by: Simon Horman <horms@verge.net.au>
---
include/linux/netfilter/xt_ipvs.h | 27 ++++
net/netfilter/Kconfig | 10 +
net/netfilter/Makefile | 1
net/netfilter/ipvs/ip_vs_proto.c | 1
net/netfilter/xt_ipvs.c | 189 +++++++++++++++++++++++++++++++++++++
5 files changed, 228 insertions(+), 0 deletions(-)
create mode 100644 include/linux/netfilter/xt_ipvs.h
create mode 100644 net/netfilter/xt_ipvs.c
v2.8
* Trivial rediff
As suggested by Jan Engelhardt
* Use an enum instead of #ifdefs for flags and masks
v2.5
* Use nf_ct_is_untracked(ct) instead of ct == nf_ct_is_untracked(ct),
the later is blatantly incorrect.
v2.4
As per advice from Patrick McHardy
* Reduce size of l4proto and fwd_method members of struct xt_ipvs_mtinf
from __u16 to __u8
* Use nf_conntrack_untracked() instead of &nf_conntrack_untracked
v2.3
As per advice from Patrick McHardy
* Don't define a value for _XT_IPVS_H in xt_ipvs.h
* Depend on NF_CONNTRACK
* Update to new API
- ipvs_mt_check() should return an int rather than a bool
- Change type of ipvs_mt()'s par parameter from
struct xt_action_param to struct xt_match_param
- Make ipvs_mt()'s par parameter non-const
v2.1, v2.2
No Change
Index: nf-next-2.6/include/linux/netfilter/xt_ipvs.h
===================================================================
--- /dev/null 1970-01-01 00:00:00.000000000 +0000
+++ nf-next-2.6/include/linux/netfilter/xt_ipvs.h 2010-07-22 10:39:44.000000000 +0900
@@ -0,0 +1,27 @@
+#ifndef _XT_IPVS_H
+#define _XT_IPVS_H
+
+enum {
+ XT_IPVS_IPVS_PROPERTY = 1 << 0, /* all other options imply this one */
+ XT_IPVS_PROTO = 1 << 1,
+ XT_IPVS_VADDR = 1 << 2,
+ XT_IPVS_VPORT = 1 << 3,
+ XT_IPVS_DIR = 1 << 4,
+ XT_IPVS_METHOD = 1 << 5,
+ XT_IPVS_VPORTCTL = 1 << 6,
+ XT_IPVS_MASK = (1 << 7) - 1,
+ XT_IPVS_ONCE_MASK = XT_IPVS_MASK & ~XT_IPVS_IPVS_PROPERTY
+};
+
+struct xt_ipvs_mtinfo {
+ union nf_inet_addr vaddr, vmask;
+ __be16 vport;
+ __u8 l4proto;
+ __u8 fwd_method;
+ __be16 vportctl;
+
+ __u8 invert;
+ __u8 bitmask;
+};
+
+#endif /* _XT_IPVS_H */
Index: nf-next-2.6/net/netfilter/Kconfig
===================================================================
--- nf-next-2.6.orig/net/netfilter/Kconfig 2010-07-22 10:13:21.000000000 +0900
+++ nf-next-2.6/net/netfilter/Kconfig 2010-07-22 10:13:42.000000000 +0900
@@ -742,6 +742,16 @@ config NETFILTER_XT_MATCH_IPRANGE
If unsure, say M.
+config NETFILTER_XT_MATCH_IPVS
+ tristate '"ipvs" match support'
+ depends on IP_VS
+ depends on NETFILTER_ADVANCED
+ depends on NF_CONNTRACK
+ help
+ This option allows you to match against IPVS properties of a packet.
+
+ If unsure, say N.
+
config NETFILTER_XT_MATCH_LENGTH
tristate '"length" match support'
depends on NETFILTER_ADVANCED
Index: nf-next-2.6/net/netfilter/Makefile
===================================================================
--- nf-next-2.6.orig/net/netfilter/Makefile 2010-07-22 10:13:21.000000000 +0900
+++ nf-next-2.6/net/netfilter/Makefile 2010-07-22 10:13:42.000000000 +0900
@@ -77,6 +77,7 @@ obj-$(CONFIG_NETFILTER_XT_MATCH_HASHLIMI
obj-$(CONFIG_NETFILTER_XT_MATCH_HELPER) += xt_helper.o
obj-$(CONFIG_NETFILTER_XT_MATCH_HL) += xt_hl.o
obj-$(CONFIG_NETFILTER_XT_MATCH_IPRANGE) += xt_iprange.o
+obj-$(CONFIG_NETFILTER_XT_MATCH_IPVS) += xt_ipvs.o
obj-$(CONFIG_NETFILTER_XT_MATCH_LENGTH) += xt_length.o
obj-$(CONFIG_NETFILTER_XT_MATCH_LIMIT) += xt_limit.o
obj-$(CONFIG_NETFILTER_XT_MATCH_MAC) += xt_mac.o
Index: nf-next-2.6/net/netfilter/ipvs/ip_vs_proto.c
===================================================================
--- nf-next-2.6.orig/net/netfilter/ipvs/ip_vs_proto.c 2010-07-22 10:13:21.000000000 +0900
+++ nf-next-2.6/net/netfilter/ipvs/ip_vs_proto.c 2010-07-22 10:13:42.000000000 +0900
@@ -98,6 +98,7 @@ struct ip_vs_protocol * ip_vs_proto_get(
return NULL;
}
+EXPORT_SYMBOL(ip_vs_proto_get);
/*
Index: nf-next-2.6/net/netfilter/xt_ipvs.c
===================================================================
--- /dev/null 1970-01-01 00:00:00.000000000 +0000
+++ nf-next-2.6/net/netfilter/xt_ipvs.c 2010-07-22 10:13:42.000000000 +0900
@@ -0,0 +1,189 @@
+/*
+ * xt_ipvs - kernel module to match IPVS connection properties
+ *
+ * Author: Hannes Eder <heder@google.com>
+ */
+
+#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
+
+#include <linux/module.h>
+#include <linux/moduleparam.h>
+#include <linux/spinlock.h>
+#include <linux/skbuff.h>
+#ifdef CONFIG_IP_VS_IPV6
+#include <net/ipv6.h>
+#endif
+#include <linux/ip_vs.h>
+#include <linux/types.h>
+#include <linux/netfilter/x_tables.h>
+#include <linux/netfilter/x_tables.h>
+#include <linux/netfilter/xt_ipvs.h>
+#include <net/netfilter/nf_conntrack.h>
+
+#include <net/ip_vs.h>
+
+MODULE_AUTHOR("Hannes Eder <heder@google.com>");
+MODULE_DESCRIPTION("Xtables: match IPVS connection properties");
+MODULE_LICENSE("GPL");
+MODULE_ALIAS("ipt_ipvs");
+MODULE_ALIAS("ip6t_ipvs");
+
+/* borrowed from xt_conntrack */
+static bool ipvs_mt_addrcmp(const union nf_inet_addr *kaddr,
+ const union nf_inet_addr *uaddr,
+ const union nf_inet_addr *umask,
+ unsigned int l3proto)
+{
+ if (l3proto == NFPROTO_IPV4)
+ return ((kaddr->ip ^ uaddr->ip) & umask->ip) == 0;
+#ifdef CONFIG_IP_VS_IPV6
+ else if (l3proto == NFPROTO_IPV6)
+ return ipv6_masked_addr_cmp(&kaddr->in6, &umask->in6,
+ &uaddr->in6) == 0;
+#endif
+ else
+ return false;
+}
+
+static bool
+ipvs_mt(const struct sk_buff *skb, struct xt_action_param *par)
+{
+ const struct xt_ipvs_mtinfo *data = par->matchinfo;
+ /* ipvs_mt_check ensures that family is only NFPROTO_IPV[46]. */
+ const u_int8_t family = par->family;
+ struct ip_vs_iphdr iph;
+ struct ip_vs_protocol *pp;
+ struct ip_vs_conn *cp;
+ bool match = true;
+
+ if (data->bitmask == XT_IPVS_IPVS_PROPERTY) {
+ match = skb->ipvs_property ^
+ !!(data->invert & XT_IPVS_IPVS_PROPERTY);
+ goto out;
+ }
+
+ /* other flags than XT_IPVS_IPVS_PROPERTY are set */
+ if (!skb->ipvs_property) {
+ match = false;
+ goto out;
+ }
+
+ ip_vs_fill_iphdr(family, skb_network_header(skb), &iph);
+
+ if (data->bitmask & XT_IPVS_PROTO)
+ if ((iph.protocol == data->l4proto) ^
+ !(data->invert & XT_IPVS_PROTO)) {
+ match = false;
+ goto out;
+ }
+
+ pp = ip_vs_proto_get(iph.protocol);
+ if (unlikely(!pp)) {
+ match = false;
+ goto out;
+ }
+
+ /*
+ * Check if the packet belongs to an existing entry
+ */
+ cp = pp->conn_out_get(family, skb, pp, &iph, iph.len, 1 /* inverse */);
+ if (unlikely(cp == NULL)) {
+ match = false;
+ goto out;
+ }
+
+ /*
+ * We found a connection, i.e. ct != 0, make sure to call
+ * __ip_vs_conn_put before returning. In our case jump to out_put_con.
+ */
+
+ if (data->bitmask & XT_IPVS_VPORT)
+ if ((cp->vport == data->vport) ^
+ !(data->invert & XT_IPVS_VPORT)) {
+ match = false;
+ goto out_put_cp;
+ }
+
+ if (data->bitmask & XT_IPVS_VPORTCTL)
+ if ((cp->control != NULL &&
+ cp->control->vport == data->vportctl) ^
+ !(data->invert & XT_IPVS_VPORTCTL)) {
+ match = false;
+ goto out_put_cp;
+ }
+
+ if (data->bitmask & XT_IPVS_DIR) {
+ enum ip_conntrack_info ctinfo;
+ struct nf_conn *ct = nf_ct_get(skb, &ctinfo);
+
+ if (ct == NULL || nf_ct_is_untracked(ct)) {
+ match = false;
+ goto out_put_cp;
+ }
+
+ if ((ctinfo >= IP_CT_IS_REPLY) ^
+ !!(data->invert & XT_IPVS_DIR)) {
+ match = false;
+ goto out_put_cp;
+ }
+ }
+
+ if (data->bitmask & XT_IPVS_METHOD)
+ if (((cp->flags & IP_VS_CONN_F_FWD_MASK) == data->fwd_method) ^
+ !(data->invert & XT_IPVS_METHOD)) {
+ match = false;
+ goto out_put_cp;
+ }
+
+ if (data->bitmask & XT_IPVS_VADDR) {
+ if (ipvs_mt_addrcmp(&cp->vaddr, &data->vaddr,
+ &data->vmask, family) ^
+ !(data->invert & XT_IPVS_VADDR)) {
+ match = false;
+ goto out_put_cp;
+ }
+ }
+
+out_put_cp:
+ __ip_vs_conn_put(cp);
+out:
+ pr_debug("match=%d\n", match);
+ return match;
+}
+
+static int ipvs_mt_check(const struct xt_mtchk_param *par)
+{
+ if (par->family != NFPROTO_IPV4
+#ifdef CONFIG_IP_VS_IPV6
+ && par->family != NFPROTO_IPV6
+#endif
+ ) {
+ pr_info("protocol family %u not supported\n", par->family);
+ return -EINVAL;
+ }
+
+ return 0;
+}
+
+static struct xt_match xt_ipvs_mt_reg __read_mostly = {
+ .name = "ipvs",
+ .revision = 0,
+ .family = NFPROTO_UNSPEC,
+ .match = ipvs_mt,
+ .checkentry = ipvs_mt_check,
+ .matchsize = XT_ALIGN(sizeof(struct xt_ipvs_mtinfo)),
+ .me = THIS_MODULE,
+};
+
+static int __init ipvs_mt_init(void)
+{
+ return xt_register_match(&xt_ipvs_mt_reg);
+}
+
+static void __exit ipvs_mt_exit(void)
+{
+ xt_unregister_match(&xt_ipvs_mt_reg);
+}
+
+module_init(ipvs_mt_init);
+module_exit(ipvs_mt_exit);
^ permalink raw reply
* Re: [RFC PATCH v3 4/5] skb: add tracepoints to freeing skb
From: Koki Sanagi @ 2010-07-22 8:39 UTC (permalink / raw)
To: Neil Horman
Cc: netdev, linux-kernel, davem, kaneshige.kenji, izumi.taku,
kosaki.motohiro, laijs, scott.a.mcmillan, rostedt, eric.dumazet,
fweisbec, mathieu.desnoyers
In-Reply-To: <20100721105645.GA21259@hmsreliant.think-freely.org>
(2010/07/21 19:56), Neil Horman wrote:
> On Wed, Jul 21, 2010 at 04:02:57PM +0900, Koki Sanagi wrote:
>> (2010/07/20 20:50), Neil Horman wrote:
>>> On Tue, Jul 20, 2010 at 09:49:10AM +0900, Koki Sanagi wrote:
>>>> [RFC PATCH v3 4/5] skb: add tracepoints to freeing skb
>>>> This patch adds tracepoint to consume_skb, dev_kfree_skb_irq and
>>>> skb_free_datagram_locked. Combinating with tracepoint on dev_hard_start_xmit,
>>>> we can check how long it takes to free transmited packets. And using it, we can
>>>> calculate how many packets driver had at that time. It is useful when a drop of
>>>> transmited packet is a problem.
>>>>
>>>> <idle>-0 [001] 241409.218333: consume_skb: skbaddr=dd6b2fb8
>>>> <idle>-0 [001] 241409.490555: dev_kfree_skb_irq: skbaddr=f5e29840
>>>>
>>>> udp-recv-302 [001] 515031.206008: skb_free_datagram_locked: skbaddr=f5b1d900
>>>>
>>>>
>>>> Signed-off-by: Koki Sanagi <sanagi.koki@jp.fujitsu.com>
>>>> ---
>>>> include/trace/events/skb.h | 42 ++++++++++++++++++++++++++++++++++++++++++
>>>> net/core/datagram.c | 1 +
>>>> net/core/dev.c | 2 ++
>>>> net/core/skbuff.c | 1 +
>>>> 4 files changed, 46 insertions(+), 0 deletions(-)
>>>>
>>>> diff --git a/include/trace/events/skb.h b/include/trace/events/skb.h
>>>> index 4b2be6d..84c9041 100644
>>>> --- a/include/trace/events/skb.h
>>>> +++ b/include/trace/events/skb.h
>>>> @@ -35,6 +35,48 @@ TRACE_EVENT(kfree_skb,
>>>> __entry->skbaddr, __entry->protocol, __entry->location)
>>>> );
>>>>
>>>> +DECLARE_EVENT_CLASS(free_skb,
>>>> +
>>>> + TP_PROTO(struct sk_buff *skb),
>>>> +
>>>> + TP_ARGS(skb),
>>>> +
>>>> + TP_STRUCT__entry(
>>>> + __field( void *, skbaddr )
>>>> + ),
>>>> +
>>>> + TP_fast_assign(
>>>> + __entry->skbaddr = skb;
>>>> + ),
>>>> +
>>>> + TP_printk("skbaddr=%p", __entry->skbaddr)
>>>> +
>>>> +);
>>>> +
>>>> +DEFINE_EVENT(free_skb, consume_skb,
>>>> +
>>>> + TP_PROTO(struct sk_buff *skb),
>>>> +
>>>> + TP_ARGS(skb)
>>>> +
>>>> +);
>>>> +
>>>> +DEFINE_EVENT(free_skb, dev_kfree_skb_irq,
>>>> +
>>>> + TP_PROTO(struct sk_buff *skb),
>>>> +
>>>> + TP_ARGS(skb)
>>>> +
>>>> +);
>>>> +
>>>> +DEFINE_EVENT(free_skb, skb_free_datagram_locked,
>>>> +
>>>> + TP_PROTO(struct sk_buff *skb),
>>>> +
>>>> + TP_ARGS(skb)
>>>> +
>>>> +);
>>>> +
>>>
>>> Why create these last two tracepoints at all? dev_kfree_skb_irq will eventually
>>> pass through kfree_skb anyway, getting picked up by the tracepoint there, the
>>> while the latter won't (since it uses __kfree_skb instead), I think that could
>>> be fixed up by add a call to trace_kfree_skb there directly, saving you two
>>> tracepoints.
>>>
>>> Neil
>>>
>> I think dev_kfree_skb_irq isn't chased by trace_kfree_skb or trace_consume_skb
>> completely. Because net_tx_action frees skb by __kfree_skb. So it is better to
>> add trace_kfree_skb before it. skb_free_datagram_locked is same.
>>
> It isn't, you're right, but that was the point I made above. Those missed areas
> could be easily handled by adding calls to trace_kfree_skb which already exists,
> to the missed areas. Then you don't need to create those new tracepoints. The
> way your doing this, if someone wants to trace all skb frees in debugfs, they
> would have to enable three tracepoints, not just one. Not that thats the point
> of your patch, but its something to consider, and it simplifies your code.
> Neil
>
O.K. I've re-made a patch to use trace_kfree_skb instead of
trace_dev_kfree_skb_irq and trace_skb_free_datagram_locked.
But I've got a problem.
I should use not __builtin_return_address, but macro or function which returns
current address. But I don't know any macro like that. Do you know any solution ?
Koki Sanagi.
---
include/trace/events/skb.h | 17 +++++++++++++++++
net/core/datagram.c | 1 +
net/core/dev.c | 2 ++
net/core/skbuff.c | 1 +
4 files changed, 21 insertions(+), 0 deletions(-)
diff --git a/include/trace/events/skb.h b/include/trace/events/skb.h
index 4b2be6d..75ce9d5 100644
--- a/include/trace/events/skb.h
+++ b/include/trace/events/skb.h
@@ -35,6 +35,23 @@ TRACE_EVENT(kfree_skb,
__entry->skbaddr, __entry->protocol, __entry->location)
);
+TRACE_EVENT(consume_skb,
+
+ TP_PROTO(struct sk_buff *skb),
+
+ TP_ARGS(skb),
+
+ TP_STRUCT__entry(
+ __field( void *, skbaddr )
+ ),
+
+ TP_fast_assign(
+ __entry->skbaddr = skb;
+ ),
+
+ TP_printk("skbaddr=%p", __entry->skbaddr)
+);
+
TRACE_EVENT(skb_copy_datagram_iovec,
TP_PROTO(const struct sk_buff *skb, int len),
diff --git a/net/core/datagram.c b/net/core/datagram.c
index 251997a..96dab4f 100644
--- a/net/core/datagram.c
+++ b/net/core/datagram.c
@@ -243,6 +243,7 @@ void skb_free_datagram_locked(struct sock *sk, struct sk_buff *skb)
unlock_sock_fast(sk, slow);
/* skb is now orphaned, can be freed outside of locked section */
+ trace_kfree_skb(skb, __builtin_return_address(0));
__kfree_skb(skb);
}
EXPORT_SYMBOL(skb_free_datagram_locked);
diff --git a/net/core/dev.c b/net/core/dev.c
index e6a911f..faded6f 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -131,6 +131,7 @@
#include <linux/random.h>
#include <trace/events/napi.h>
#include <trace/events/net.h>
+#include <trace/events/skb.h>
#include <linux/pci.h>
#include "net-sysfs.h"
@@ -2577,6 +2578,7 @@ static void net_tx_action(struct softirq_action *h)
clist = clist->next;
WARN_ON(atomic_read(&skb->users));
+ trace_kfree_skb(skb, __builtin_return_address(0));
__kfree_skb(skb);
}
}
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 76d33ca..ce0bc36 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -466,6 +466,7 @@ void consume_skb(struct sk_buff *skb)
smp_rmb();
else if (likely(!atomic_dec_and_test(&skb->users)))
return;
+ trace_consume_skb(skb);
__kfree_skb(skb);
}
EXPORT_SYMBOL(consume_skb);
^ permalink raw reply related
* Re: [RFC PATCH v3 1/5] irq: add tracepoint to softirq_raise
From: Koki Sanagi @ 2010-07-22 8:41 UTC (permalink / raw)
To: Neil Horman
Cc: netdev, linux-kernel, davem, kaneshige.kenji, izumi.taku,
kosaki.motohiro, laijs, scott.a.mcmillan, rostedt, eric.dumazet,
fweisbec, mathieu.desnoyers
In-Reply-To: <20100721111419.GB21259@hmsreliant.think-freely.org>
(2010/07/21 20:14), Neil Horman wrote:
> On Wed, Jul 21, 2010 at 03:57:05PM +0900, Koki Sanagi wrote:
>> (2010/07/20 20:04), Neil Horman wrote:
>>> On Tue, Jul 20, 2010 at 09:45:31AM +0900, Koki Sanagi wrote:
>>>> From: Lai Jiangshan <laijs@cn.fujitsu.com>
>>>>
>>>> Add a tracepoint for tracing when softirq action is raised.
>>>>
>>>> It and the existed tracepoints complete softirq's tracepoints:
>>>> softirq_raise, softirq_entry and softirq_exit.
>>>>
>>>> And when this tracepoint is used in combination with
>>>> the softirq_entry tracepoint we can determine
>>>> the softirq raise latency.
>>>>
>>>> Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
>>>> Acked-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
>>>> Acked-by: Frederic Weisbecker <fweisbec@gmail.com>
>>>>
>>>> [ factorize softirq events with DECLARE_EVENT_CLASS ]
>>>> Signed-off-by: Koki Sanagi <sanagi.koki@jp.fujitsu.com>
>>>> ---
>>>> include/linux/interrupt.h | 8 +++++-
>>>> include/trace/events/irq.h | 57 ++++++++++++++++++++++++++-----------------
>>>> kernel/softirq.c | 4 +-
>>>> 3 files changed, 43 insertions(+), 26 deletions(-)
>>>>
>>>> diff --git a/include/linux/interrupt.h b/include/linux/interrupt.h
>>>> index c233113..1cb5726 100644
>>>> --- a/include/linux/interrupt.h
>>>> +++ b/include/linux/interrupt.h
>>>> @@ -18,6 +18,7 @@
>>>> #include <asm/atomic.h>
>>>> #include <asm/ptrace.h>
>>>> #include <asm/system.h>
>>>> +#include <trace/events/irq.h>
>>>>
>>>> /*
>>>> * These correspond to the IORESOURCE_IRQ_* defines in
>>>> @@ -402,7 +403,12 @@ asmlinkage void do_softirq(void);
>>>> asmlinkage void __do_softirq(void);
>>>> extern void open_softirq(int nr, void (*action)(struct softirq_action *));
>>>> extern void softirq_init(void);
>>>> -#define __raise_softirq_irqoff(nr) do { or_softirq_pending(1UL << (nr)); } while (0)
>>>> +static inline void __raise_softirq_irqoff(unsigned int nr)
>>>> +{
>>>> + trace_softirq_raise(nr);
>>>> + or_softirq_pending(1UL << nr);
>>>> +}
>>>> +
>>> We already have tracepoints in irq_enter and irq_exit. If the goal here is to
>>> detect latency during packet processing, cant the delta in time between those
>>> two points be used to determine interrupt handling latency?
>>
>> Certainly, the time between irq_entry and irq_exit is not directly related to
>> latency during packet processing. But it's indirectly related it.
>> Because softirq_entry isn't passed until irq exits and softirq_entry time is
>> related to packet processing latency. So I show it as a reference.
>>
> Its not directly related no, but look at it, the amount of processing between
> irq_exit and softirq_entry is minimal. The information you are trying to
> extract by computing the delta from irq_entry to softirq_entry is almost exactly
> the same as that from irq_entry to irq_exit. For that matter, since you're
> trying to guage lantency for packet processing, I expect you could get the same
> delta by measuring irq_entry to napi_poll tracepoint time, and save the hassle
> of needing to filter on softirq processing that doesn't relate to packet
> processing.
Yeah, to determine interrput latency, we need either one irq_exit or
softirq_entry, not both.
And I think softirq_entry should be left because there is a possibility that
softirq isn't triggered immidiately after irq_exit.
softirq_exit isn't needed because it is not related to packet processing.
softirq_raise is needed because it connects irq_entry and softirq_entry but
there is no need to show it. Currently, my idea is like the following.
irq_entry(+0.000000msec,irq=77:eth3)
|
softirq_entry(+0.003562msec)
|
|---netif_receive_skb(+0.006279msec,len=100)
| |
| skb_copy_datagram_iovec(+0.038778msec, 2285:sshd)
|
napi_poll_exit(+0.017160msec, eth3)
>
>>>
>>>
>>>> extern void raise_softirq_irqoff(unsigned int nr);
>>>> extern void raise_softirq(unsigned int nr);
>>>> extern void wakeup_softirqd(void);
>>>> diff --git a/include/trace/events/irq.h b/include/trace/events/irq.h
>>>> index 0e4cfb6..717744c 100644
>>>> --- a/include/trace/events/irq.h
>>>> +++ b/include/trace/events/irq.h
>>>> @@ -5,7 +5,9 @@
>>>> #define _TRACE_IRQ_H
>>>>
>>>> #include <linux/tracepoint.h>
>>>> -#include <linux/interrupt.h>
>>>> +
>>>> +struct irqaction;
>>>> +struct softirq_action;
>>>>
>>>> #define softirq_name(sirq) { sirq##_SOFTIRQ, #sirq }
>>>> #define show_softirq_name(val) \
>>>> @@ -84,56 +86,65 @@ TRACE_EVENT(irq_handler_exit,
>>>>
>>>> DECLARE_EVENT_CLASS(softirq,
>>>>
>>>> - TP_PROTO(struct softirq_action *h, struct softirq_action *vec),
>>>> + TP_PROTO(unsigned int nr),
>>>>
>>>> - TP_ARGS(h, vec),
>>>> + TP_ARGS(nr),
>>>>
>>>> TP_STRUCT__entry(
>>>> - __field( int, vec )
>>>> + __field( unsigned int, vec )
>>>> ),
>>>>
>>>> TP_fast_assign(
>>>> - __entry->vec = (int)(h - vec);
>>>> + __entry->vec = nr;
>>>> ),
>>>>
>>>> TP_printk("vec=%d [action=%s]", __entry->vec,
>>>> - show_softirq_name(__entry->vec))
>>>> + show_softirq_name(__entry->vec))
>>>> +);
>>>> +
>>>> +/**
>>>> + * softirq_raise - called immediately when a softirq is raised
>>>> + * @nr: softirq vector number
>>>> + *
>>>> + * Tracepoint for tracing when softirq action is raised.
>>>> + * Also, when used in combination with the softirq_entry tracepoint
>>>> + * we can determine the softirq raise latency.
>>>> + */
>>>> +DEFINE_EVENT(softirq, softirq_raise,
>>>> +
>>>> + TP_PROTO(unsigned int nr),
>>>> +
>>>> + TP_ARGS(nr)
>>>> );
>>>>
>>>> /**
>>>> * softirq_entry - called immediately before the softirq handler
>>>> - * @h: pointer to struct softirq_action
>>>> - * @vec: pointer to first struct softirq_action in softirq_vec array
>>>> + * @nr: softirq vector number
>>>> *
>>>> - * The @h parameter, contains a pointer to the struct softirq_action
>>>> - * which has a pointer to the action handler that is called. By subtracting
>>>> - * the @vec pointer from the @h pointer, we can determine the softirq
>>>> - * number. Also, when used in combination with the softirq_exit tracepoint
>>>> + * Tracepoint for tracing when softirq action starts.
>>>> + * Also, when used in combination with the softirq_exit tracepoint
>>>> * we can determine the softirq latency.
>>>> */
>>>> DEFINE_EVENT(softirq, softirq_entry,
>>>>
>>>> - TP_PROTO(struct softirq_action *h, struct softirq_action *vec),
>>>> + TP_PROTO(unsigned int nr),
>>>>
>>>> - TP_ARGS(h, vec)
>>>> + TP_ARGS(nr)
>>>> );
>>>>
>>>> /**
>>>> * softirq_exit - called immediately after the softirq handler returns
>>>> - * @h: pointer to struct softirq_action
>>>> - * @vec: pointer to first struct softirq_action in softirq_vec array
>>>> + * @nr: softirq vector number
>>>> *
>>>> - * The @h parameter contains a pointer to the struct softirq_action
>>>> - * that has handled the softirq. By subtracting the @vec pointer from
>>>> - * the @h pointer, we can determine the softirq number. Also, when used in
>>>> - * combination with the softirq_entry tracepoint we can determine the softirq
>>>> - * latency.
>>>> + * Tracepoint for tracing when softirq action ends.
>>>> + * Also, when used in combination with the softirq_entry tracepoint
>>>> + * we can determine the softirq latency.
>>>> */
>>>> DEFINE_EVENT(softirq, softirq_exit,
>>>>
>>>> - TP_PROTO(struct softirq_action *h, struct softirq_action *vec),
>>>> + TP_PROTO(unsigned int nr),
>>>>
>>>> - TP_ARGS(h, vec)
>>>> + TP_ARGS(nr)
>>>> );
>>>>
>>>> #endif /* _TRACE_IRQ_H */
>>>> diff --git a/kernel/softirq.c b/kernel/softirq.c
>>>> index 825e112..6790599 100644
>>>> --- a/kernel/softirq.c
>>>> +++ b/kernel/softirq.c
>>>> @@ -215,9 +215,9 @@ restart:
>>>> int prev_count = preempt_count();
>>>> kstat_incr_softirqs_this_cpu(h - softirq_vec);
>>>>
>>>> - trace_softirq_entry(h, softirq_vec);
>>>> + trace_softirq_entry(h - softirq_vec);
>>>> h->action(h);
>>>> - trace_softirq_exit(h, softirq_vec);
>>>> + trace_softirq_exit(h - softirq_vec);
>>>
>>> You're loosing information here by reducing the numbers of parameters in this
>>> tracepoint. How many other tracepoint scripts rely on having both pointers
>>> handy? Why not just do the pointer math inside your tracehook instead?
>>
>> In __raise_softirq_irqoff macro there is no method to refer softirq_vec, so it
>> can't use softirq DECLARE_EVENT_CLASS as is.
>> Currently, there is no script using softirq_entry or softirq_exit.
>>
> That shouldn't matter, just pass in NULL for softirq_vec in
> __raise_softirq_irqoff as the second argument to the trace function. You may
> need to fix up the class definition so that the assignment or printk doesn't try
> to dereference that pointer when its NULL, but thats easy enough, and it avoids
> breaking any other perf scripts floating out there.
> Neil
>
>> Thanks,
>> Koki Sanagi.
>>
>>>
>>>> if (unlikely(prev_count != preempt_count())) {
>>>> printk(KERN_ERR "huh, entered softirq %td %s %p"
>>>> "with preempt_count %08x,"
>>>>
>>>> --
>>>> To unsubscribe from this list: send the line "unsubscribe netdev" in
>>>> the body of a message to majordomo@vger.kernel.org
>>>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>>>
>>>
>>>
>>
>>
>>
>
>
^ permalink raw reply
* Re: Fwd: LVS on local node
From: Changli Gao @ 2010-07-22 9:10 UTC (permalink / raw)
To: Eric Dumazet; +Cc: Franchoze Eric, wensong, lvs-devel, netdev, netfilter-devel
In-Reply-To: <1279781811.2405.15.camel@edumazet-laptop>
On Thu, Jul 22, 2010 at 2:56 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
>
> lvs seems not very SMP friendly and a bit complex.
>
> I would use an iptables setup and a slighly modified REDIRECT target
> (and/or a nf_nat_setup_info() change)
>
> Say you have 8 daemons listening on different ports (1000 to 1007)
>
> iptables -t nat -A PREROUTING -p tcp --dport 1234 -j REDIRECT --rxhash-dist --to-port 1000-1007
>
> rxhash would be provided by RPS on recent kernels or locally computed if
> not already provided by core network (or old kernel)
>
> This rule would be triggered only at connection establishment.
> conntracking take care of following packets and is SMP friendly.
>
>
I think maybe REDIRECT is enough. If the public port is one of the
real ports, you need to append "random" option to iptables target
REDIRECT. If not, "REDIRECT --to-ports 1000-1007" is good enough, and
the destination port will be selected in the round-robin manner.
--
Regards,
Changli Gao(xiaosuo@gmail.com)
^ permalink raw reply
* [PATCH] Driver-core: Fix bluetooth network device rename regression
From: Eric W. Biederman @ 2010-07-22 9:16 UTC (permalink / raw)
To: Greg KH
Cc: Andrew Morton, Greg KH, Rafael J. Wysocki, Maciej W. Rozycki,
Kay Sievers, Johannes Berg, netdev
In-Reply-To: <m139vd5sms.fsf_-_@fess.ebiederm.org>
With CONFIG_SYSFS_DEPRECATED_V2 enabled I can rename any network device
anything as long as the new name does not conflict with another network
device.
With CONFIG_SYSFS_DEPRECATED_V2 disabled without this fix bluetooth benp
devices, and the mac80211_hwsim driver can not be renamed to any arbitrary
name that happens to conflict with any other name that is used in their
parent devices directory.
The device model usage of the bluetooth bnep driver has not changed since
the current device model sysfs laytout was introduced. Making this a latent
and very annoying regression.
This regression was reported at the time it was introduced and apprently a
few cases were missed by:
commit 864062457a2e444227bd368ca5f2a2b740de604f
Author: Kay Sievers <kay.sievers@vrfy.org>
Date: Wed Mar 14 03:25:56 2007 +0100
driver core: fix namespace issue with devices assigned to classes
- uses a kset in "struct class" to keep track of all directories
belonging to this class
- merges with the /sys/devices/virtual logic.
- removes the namespace-dir if the last member of that class
leaves the directory.
There may be locking or refcounting fixes left, I stopped when it seemed
to work with network and sound modules. :)
From: Kay Sievers <kay.sievers@vrfy.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
That bug fix had a completely undocumented and apparently deliberate omission
where it does not apply in some cases. Those omitted cases cover the
bluetooth network driver.
What makes this regression a serious issue now is the introduction of network
namespace support in sysfs took this from a mild bug to complete driver
non-function.
Several reasons have been put forward not to use this one line bug fix in
the driver core.
- We don't have special cases in the driver core.
This is non-sense we currently have special cases for block devices
scattered all throughout the driver core.
- The driver is at fault.
The bluetooth driver's driver core usage predates the introduction of the
current sysfs layout. It has been 3 years and no one has bothered
to change the driver in all of this time.
If this was really a driver issue that someone cared about and not simply
an academic issue the driver should have been fixed long since.
I offer these reasons to make the change.
- There is not enough information available for sysfs to support rename
or delete of symlinks when only the source directory but not the destination
directory is tagged.
- All of the proposals put forth will change the sysfs layout slightly in
one way or another.
- A network device special case makes sense as network devices are unique
in placing a user choosen name sysfs.
- The driver that are affected are different in sysfs from all other
network devices which likely already consuses any userspace software
that cares about the exact device layout.
- The change is one line that is obviously correct and has no chance
of affecting anything that it is not a network device.
- Real users have hit this problem and reported this bug and those users
deserve drivers that work.
The fix is a trivial change to get_device_parent to always create
the network class directory in any parent of a network device,
Ignoring the incorrect, non-obvious and esoteric rules that the
driver core is trying to impose on the creation of class directories.
Ideally I would remove those crazy special cases get_device_parent for all
devices but in testing it was observed there are other devices such as the
input layer that don't create the class directories today, and applying the
change broadly is likely to break something in userspace and there is no
need to take that risk.
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
---
drivers/base/core.c | 2 +-
1 files changed, 1 insertions(+), 1 deletions(-)
diff --git a/drivers/base/core.c b/drivers/base/core.c
index 9630fbd..ffb8443 100644
--- a/drivers/base/core.c
+++ b/drivers/base/core.c
@@ -673,7 +673,7 @@ static struct kobject *get_device_parent(struct device *dev,
*/
if (parent == NULL)
parent_kobj = virtual_device_parent(dev);
- else if (parent->class)
+ else if (parent->class && (strcmp(dev->class->name, "net") != 0))
return &parent->kobj;
else
parent_kobj = &parent->kobj;
--
1.6.5.2.143.g8cc62
^ permalink raw reply related
* Re: [PATCH] CAN: Add Flexcan CAN controller driver
From: Wolfgang Grandegger @ 2010-07-22 9:25 UTC (permalink / raw)
To: Marc Kleine-Budde
Cc: socketcan-core-0fE9KPoRgkgATYTw5x5z8w,
netdev-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <4C475BA6.3030505-bIcnvbaLZ9MEGnE8C9+IrQ@public.gmane.org>
On 07/21/2010 10:42 PM, Marc Kleine-Budde wrote:
> Marc Kleine-Budde wrote:
>> Wolfgang Grandegger wrote:
>>> I realized a few issues. You can add my "acked-by" when they are fixed.
>>
>> thanks for the review.
>
> [...]
>
>>>> +static void flexcan_poll_err_frame(struct net_device *dev,
>>>> + struct can_frame *cf, u32 reg_esr)
>>>> +{
>>>> + struct flexcan_priv *priv = netdev_priv(dev);
>>>> + int error_warning = 0, rx_errors = 0, tx_errors = 0;
>>>> +
>>>> + if (reg_esr & FLEXCAN_ESR_BIT1_ERR) {
>>>> + rx_errors = 1;
>>>> + cf->can_id |= CAN_ERR_PROT | CAN_ERR_BUSERROR;
>>>> + cf->data[2] |= CAN_ERR_PROT_BIT1;
>>>> + }
>>>> +
>>>> + if (reg_esr & FLEXCAN_ESR_BIT0_ERR) {
>>>> + rx_errors = 1;
>>>> + cf->can_id |= CAN_ERR_PROT | CAN_ERR_BUSERROR;
>>>> + cf->data[2] |= CAN_ERR_PROT_BIT0;
>>>> + }
>>>> +
>>>> + if (reg_esr & FLEXCAN_ESR_FRM_ERR) {
>>>> + rx_errors = 1;
>>>> + cf->can_id |= CAN_ERR_PROT | CAN_ERR_BUSERROR;
>>>> + cf->data[2] |= CAN_ERR_PROT_FORM;
>>>> + }
>>>> +
>>>> + if (reg_esr & FLEXCAN_ESR_STF_ERR) {
>>>> + rx_errors = 1;
>>>> + cf->can_id |= CAN_ERR_PROT | CAN_ERR_BUSERROR;
>>>> + cf->data[2] |= CAN_ERR_PROT_STUFF;
>>>> + }
>>>> +
>>>> +
>>>> + if (reg_esr & FLEXCAN_ESR_ACK_ERR) {
>>>> + tx_errors = 1;
>>>> + cf->can_id |= CAN_ERR_ACK;
>>> This is a bus-error as well. Therefore I think it should be:
>>>
>>> if (reg_esr & FLEXCAN_ESR_ACK_ERR) {
>>> tx_errors = 1;
>>> cf->can_id |= CAN_ERR_ACK;
>>> cf->can_id |= CAN_ERR_PROT | CAN_ERR_BUSERROR;
>>> cf->data[3] |= CAN_ERR_PROT_LOC_ACK; /* ACK slot */
>>> }
>>>
>>> I need to check what CAN_ERR_ACK is intended for. Then, cf->can_id could
>>> be preset with "CAN_ERR_PROT | CAN_ERR_BUSERROR" at the beginning.
>
> This controller issues an ACK error if there are no other nodes on the
> CAN bus to send a ACK that the message has been received. Or all
> remaining Nodes are in bus off state.
Mainly this bus error can cause the infamous IRQ and message flooding
when no cable is connected.
> From the datasheet:
> "This bit indicates that an acknowledge (ACK) error has been detected by
> the transmitter node; that is, a dominant bit has not been detected
> during the ACK SLOT."
That's what the above error describes, like on the SJA1000, apart from
setting CAN_ERR_ACK. Setting CAN_ERR_ACK is somehow bogus, but just
leave it for the time being. I will fix it globally when time permits.
Wolfgang.
^ permalink raw reply
* Re: macvtap: Limit packet queue length
From: Arnd Bergmann @ 2010-07-22 9:30 UTC (permalink / raw)
To: Herbert Xu; +Cc: David S. Miller, netdev, Mark Wagner, Chris Wright
In-Reply-To: <20100722074431.GA26744@gondor.apana.org.au>
On Thursday 22 July 2010, Herbert Xu wrote:
> On Thu, Jul 22, 2010 at 02:41:57PM +0800, Herbert Xu wrote:
> > Hi:
> >
> > macvtap: Limit packet queue length
>
> Chris has informed me that he's already tried a similar patch
> and it only makes the problem worse :)
>
> The issue is that the macvtap TX queue length defaults to zero.
>
> So here is an updated patch which addresses this:
Thanks for debugging this and coming up with a solution.
I'm currently travelling, so I can't easily work on it myself.
> Please note that macvtap currently has no way of giving congestion
> notification, that means the software device TX queue cannot be
> used and packets will always be dropped once the macvtap driver
> queue fills up.
This is something I was planning to look into for doing it right,
and then I forgot about it. I'll investigate what could be done
to get proper flow control once I get back to the office.
> Chris Wright noticed that for this patch to work, we need a
> non-zero TX queue length. This patch includes his work to change
> the default macvtap TX queue length to 500.
The only problem I can see with this patch is making it depend on
the *TX* queue length. The point is that unlike tun/tap, the
macvtap network interface's point of view is that this is the
receive queue, not the transmit queue.
In the TX direction, we really don't queue, since we simply forward
to the lowerdev tx queue, so exposing the tunable to user space
as the tx queue length is a little bit awkward, as well as inconsistent
between macvtap and macvlan.
> Reported-by: Mark Wagner <mwagner@redhat.com>
> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
As long as we're missing a better solution,
Acked-by: Arnd Bergmann <arnd@arndb.de>
^ permalink raw reply
* Re: [PATCH] CAN: Add Flexcan CAN controller driver
From: Marc Kleine-Budde @ 2010-07-22 9:33 UTC (permalink / raw)
To: Wolfgang Grandegger
Cc: socketcan-core-0fE9KPoRgkgATYTw5x5z8w,
netdev-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <4C480E7E.70607-5Yr1BZd7O62+XT7JhA+gdA@public.gmane.org>
[-- Attachment #1.1: Type: text/plain, Size: 3381 bytes --]
Wolfgang Grandegger wrote:
> On 07/21/2010 10:42 PM, Marc Kleine-Budde wrote:
>> Marc Kleine-Budde wrote:
>>> Wolfgang Grandegger wrote:
>>>> I realized a few issues. You can add my "acked-by" when they are fixed.
>>> thanks for the review.
>> [...]
>>
>>>>> +static void flexcan_poll_err_frame(struct net_device *dev,
>>>>> + struct can_frame *cf, u32 reg_esr)
>>>>> +{
>>>>> + struct flexcan_priv *priv = netdev_priv(dev);
>>>>> + int error_warning = 0, rx_errors = 0, tx_errors = 0;
>>>>> +
>>>>> + if (reg_esr & FLEXCAN_ESR_BIT1_ERR) {
>>>>> + rx_errors = 1;
>>>>> + cf->can_id |= CAN_ERR_PROT | CAN_ERR_BUSERROR;
>>>>> + cf->data[2] |= CAN_ERR_PROT_BIT1;
>>>>> + }
>>>>> +
>>>>> + if (reg_esr & FLEXCAN_ESR_BIT0_ERR) {
>>>>> + rx_errors = 1;
>>>>> + cf->can_id |= CAN_ERR_PROT | CAN_ERR_BUSERROR;
>>>>> + cf->data[2] |= CAN_ERR_PROT_BIT0;
>>>>> + }
>>>>> +
>>>>> + if (reg_esr & FLEXCAN_ESR_FRM_ERR) {
>>>>> + rx_errors = 1;
>>>>> + cf->can_id |= CAN_ERR_PROT | CAN_ERR_BUSERROR;
>>>>> + cf->data[2] |= CAN_ERR_PROT_FORM;
>>>>> + }
>>>>> +
>>>>> + if (reg_esr & FLEXCAN_ESR_STF_ERR) {
>>>>> + rx_errors = 1;
>>>>> + cf->can_id |= CAN_ERR_PROT | CAN_ERR_BUSERROR;
>>>>> + cf->data[2] |= CAN_ERR_PROT_STUFF;
>>>>> + }
>>>>> +
>>>>> +
>>>>> + if (reg_esr & FLEXCAN_ESR_ACK_ERR) {
>>>>> + tx_errors = 1;
>>>>> + cf->can_id |= CAN_ERR_ACK;
>>>> This is a bus-error as well. Therefore I think it should be:
>>>>
>>>> if (reg_esr & FLEXCAN_ESR_ACK_ERR) {
>>>> tx_errors = 1;
>>>> cf->can_id |= CAN_ERR_ACK;
>>>> cf->can_id |= CAN_ERR_PROT | CAN_ERR_BUSERROR;
>>>> cf->data[3] |= CAN_ERR_PROT_LOC_ACK; /* ACK slot */
>>>> }
>>>>
>>>> I need to check what CAN_ERR_ACK is intended for. Then, cf->can_id could
>>>> be preset with "CAN_ERR_PROT | CAN_ERR_BUSERROR" at the beginning.
>> This controller issues an ACK error if there are no other nodes on the
>> CAN bus to send a ACK that the message has been received. Or all
>> remaining Nodes are in bus off state.
>
> Mainly this bus error can cause the infamous IRQ and message flooding
> when no cable is connected.
No cable connected can (if your node doesn't have an activated on baord
termination) result in no termination at all, and this may result in a
different error.
At least it does on the at91. I haven't checked with the flexcan.
The subtile difference is that the CAN controller isn't allowed to go
into bus-off with a proper terminated bus when it recevies no ACKs, but
going to bus off on a not terminated bus is okay.
>> From the datasheet:
>> "This bit indicates that an acknowledge (ACK) error has been detected by
>> the transmitter node; that is, a dominant bit has not been detected
>> during the ACK SLOT."
>
> That's what the above error describes, like on the SJA1000, apart from
> setting CAN_ERR_ACK. Setting CAN_ERR_ACK is somehow bogus, but just
> leave it for the time being. I will fix it globally when time permits.
Now I'm confused. What's the meaning of CAN_ERR_ACK? When should it be used?
Cheers, Marc
--
Pengutronix e.K. | Marc Kleine-Budde |
Industrial Linux Solutions | Phone: +49-231-2826-924 |
Vertretung West/Dortmund | Fax: +49-5121-206917-5555 |
Amtsgericht Hildesheim, HRA 2686 | http://www.pengutronix.de |
[-- Attachment #1.2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 260 bytes --]
[-- Attachment #2: Type: text/plain, Size: 188 bytes --]
_______________________________________________
Socketcan-core mailing list
Socketcan-core-0fE9KPoRgkgATYTw5x5z8w@public.gmane.org
https://lists.berlios.de/mailman/listinfo/socketcan-core
^ permalink raw reply
* Re: [PATCH] CAN: Add Flexcan CAN controller driver
From: Wolfgang Grandegger @ 2010-07-22 9:44 UTC (permalink / raw)
To: Marc Kleine-Budde
Cc: socketcan-core-0fE9KPoRgkgATYTw5x5z8w,
netdev-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <4C481064.8090809-bIcnvbaLZ9MEGnE8C9+IrQ@public.gmane.org>
On 07/22/2010 11:33 AM, Marc Kleine-Budde wrote:
> Wolfgang Grandegger wrote:
>> On 07/21/2010 10:42 PM, Marc Kleine-Budde wrote:
>>> Marc Kleine-Budde wrote:
>>>> Wolfgang Grandegger wrote:
>>>>> I realized a few issues. You can add my "acked-by" when they are fixed.
>>>> thanks for the review.
>>> [...]
>>>
>>>>>> +static void flexcan_poll_err_frame(struct net_device *dev,
>>>>>> + struct can_frame *cf, u32 reg_esr)
>>>>>> +{
>>>>>> + struct flexcan_priv *priv = netdev_priv(dev);
>>>>>> + int error_warning = 0, rx_errors = 0, tx_errors = 0;
>>>>>> +
>>>>>> + if (reg_esr & FLEXCAN_ESR_BIT1_ERR) {
>>>>>> + rx_errors = 1;
>>>>>> + cf->can_id |= CAN_ERR_PROT | CAN_ERR_BUSERROR;
>>>>>> + cf->data[2] |= CAN_ERR_PROT_BIT1;
>>>>>> + }
>>>>>> +
>>>>>> + if (reg_esr & FLEXCAN_ESR_BIT0_ERR) {
>>>>>> + rx_errors = 1;
>>>>>> + cf->can_id |= CAN_ERR_PROT | CAN_ERR_BUSERROR;
>>>>>> + cf->data[2] |= CAN_ERR_PROT_BIT0;
>>>>>> + }
>>>>>> +
>>>>>> + if (reg_esr & FLEXCAN_ESR_FRM_ERR) {
>>>>>> + rx_errors = 1;
>>>>>> + cf->can_id |= CAN_ERR_PROT | CAN_ERR_BUSERROR;
>>>>>> + cf->data[2] |= CAN_ERR_PROT_FORM;
>>>>>> + }
>>>>>> +
>>>>>> + if (reg_esr & FLEXCAN_ESR_STF_ERR) {
>>>>>> + rx_errors = 1;
>>>>>> + cf->can_id |= CAN_ERR_PROT | CAN_ERR_BUSERROR;
>>>>>> + cf->data[2] |= CAN_ERR_PROT_STUFF;
>>>>>> + }
>>>>>> +
>>>>>> +
>>>>>> + if (reg_esr & FLEXCAN_ESR_ACK_ERR) {
>>>>>> + tx_errors = 1;
>>>>>> + cf->can_id |= CAN_ERR_ACK;
>>>>> This is a bus-error as well. Therefore I think it should be:
>>>>>
>>>>> if (reg_esr & FLEXCAN_ESR_ACK_ERR) {
>>>>> tx_errors = 1;
>>>>> cf->can_id |= CAN_ERR_ACK;
>>>>> cf->can_id |= CAN_ERR_PROT | CAN_ERR_BUSERROR;
>>>>> cf->data[3] |= CAN_ERR_PROT_LOC_ACK; /* ACK slot */
>>>>> }
>>>>>
>>>>> I need to check what CAN_ERR_ACK is intended for. Then, cf->can_id could
>>>>> be preset with "CAN_ERR_PROT | CAN_ERR_BUSERROR" at the beginning.
>>> This controller issues an ACK error if there are no other nodes on the
>>> CAN bus to send a ACK that the message has been received. Or all
>>> remaining Nodes are in bus off state.
>>
>> Mainly this bus error can cause the infamous IRQ and message flooding
>> when no cable is connected.
>
> No cable connected can (if your node doesn't have an activated on baord
> termination) result in no termination at all, and this may result in a
> different error.
>
> At least it does on the at91. I haven't checked with the flexcan.
>
> The subtile difference is that the CAN controller isn't allowed to go
> into bus-off with a proper terminated bus when it recevies no ACKs, but
> going to bus off on a not terminated bus is okay.
>
>>> From the datasheet:
>>> "This bit indicates that an acknowledge (ACK) error has been detected by
>>> the transmitter node; that is, a dominant bit has not been detected
>>> during the ACK SLOT."
>>
>> That's what the above error describes, like on the SJA1000, apart from
>> setting CAN_ERR_ACK. Setting CAN_ERR_ACK is somehow bogus, but just
>> leave it for the time being. I will fix it globally when time permits.
>
> Now I'm confused. What's the meaning of CAN_ERR_ACK? When should it be used?
The type of the error is already defined via "CAN_ERR_PROT |
CAN_ERR_BUSERROR". The details of "CAN_ERR_PROT" are described in
data[2..3]. Just for the ACK errors we have CAN_ERR_ACK, but not for the
other bus errors and I ask myself why CAN_ERR_ACK was introduced.
If it does not have another meaning, I tend to remove it (only the AT91
driver actually uses it).
Wolfgang.
^ permalink raw reply
* Re: Fwd: LVS on local node
From: Eric Dumazet @ 2010-07-22 9:46 UTC (permalink / raw)
To: Changli Gao, Patrick McHardy, Jan Engelhardt
Cc: Franchoze Eric, wensong, lvs-devel, netdev, netfilter-devel
In-Reply-To: <AANLkTimneLE2xKg6fie5u3Cvjzcrq4VnK6wMT3pno0JK@mail.gmail.com>
Le jeudi 22 juillet 2010 à 17:10 +0800, Changli Gao a écrit :
>
> I think maybe REDIRECT is enough. If the public port is one of the
> real ports, you need to append "random" option to iptables target
> REDIRECT. If not, "REDIRECT --to-ports 1000-1007" is good enough, and
> the destination port will be selected in the round-robin manner.
>
Yes, on 2.6.32, no RPS, so undocumented --random option is probably the
best we can offer. (random option was added in 2.6.22)
iptables -t nat -A PREROUTING -p tcp --dport 1234 -j REDIRECT --random --to-port 1000-1007
Here is a patch to add "random" help to REDIRECT iptables target
Thanks
[PATCH] extensions: REDIRECT: add random help
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
---
diff --git a/extensions/libipt_REDIRECT.c b/extensions/libipt_REDIRECT.c
index 3dfcadf..324d0eb 100644
--- a/extensions/libipt_REDIRECT.c
+++ b/extensions/libipt_REDIRECT.c
@@ -17,7 +17,8 @@ static void REDIRECT_help(void)
printf(
"REDIRECT target options:\n"
" --to-ports <port>[-<port>]\n"
-" Port (range) to map to.\n");
+" Port (range) to map to.\n"
+" [--random]\n");
}
static const struct option REDIRECT_opts[] = {
--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply related
* [PATCH] nf_nat_core: merge the same lines
From: Changli Gao @ 2010-07-22 9:49 UTC (permalink / raw)
To: Patrick McHardy; +Cc: David S. Miller, netfilter-devel, netdev, Changli Gao
nf_nat_core: merge the same lines
proto->unique_tuple() will be called finally, if the previous calls fail. This
patch checks the false condition of (range->flags & IP_NAT_RANGE_PROTO_RANDOM)
instead to avoid duplicate line of code: proto->unique_tuple().
Signed-off-by: Changli Gao <xiaosuo@gmail.com>
----
net/ipv4/netfilter/nf_nat_core.c | 9 ++-------
1 file changed, 2 insertions(+), 7 deletions(-)
diff --git a/net/ipv4/netfilter/nf_nat_core.c b/net/ipv4/netfilter/nf_nat_core.c
index c7719b2..037a3a6 100644
--- a/net/ipv4/netfilter/nf_nat_core.c
+++ b/net/ipv4/netfilter/nf_nat_core.c
@@ -261,14 +261,9 @@ get_unique_tuple(struct nf_conntrack_tuple *tuple,
rcu_read_lock();
proto = __nf_nat_proto_find(orig_tuple->dst.protonum);
- /* Change protocol info to have some randomization */
- if (range->flags & IP_NAT_RANGE_PROTO_RANDOM) {
- proto->unique_tuple(tuple, range, maniptype, ct);
- goto out;
- }
-
/* Only bother mapping if it's not already in range and unique */
- if ((!(range->flags & IP_NAT_RANGE_PROTO_SPECIFIED) ||
+ if (!(range->flags & IP_NAT_RANGE_PROTO_RANDOM) &&
+ (!(range->flags & IP_NAT_RANGE_PROTO_SPECIFIED) ||
proto->in_range(tuple, maniptype, &range->min, &range->max)) &&
!nf_nat_used_tuple(tuple, ct))
goto out;
^ permalink raw reply related
* Re: Fwd: LVS on local node
From: Changli Gao @ 2010-07-22 9:52 UTC (permalink / raw)
To: Eric Dumazet
Cc: Patrick McHardy, Jan Engelhardt, Franchoze Eric, wensong,
lvs-devel, netdev, netfilter-devel
In-Reply-To: <1279791964.2467.12.camel@edumazet-laptop>
On Thu, Jul 22, 2010 at 5:46 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> Le jeudi 22 juillet 2010 à 17:10 +0800, Changli Gao a écrit :
>
>>
>> I think maybe REDIRECT is enough. If the public port is one of the
>> real ports, you need to append "random" option to iptables target
>> REDIRECT. If not, "REDIRECT --to-ports 1000-1007" is good enough, and
>> the destination port will be selected in the round-robin manner.
>>
>
> Yes, on 2.6.32, no RPS, so undocumented --random option is probably the
> best we can offer. (random option was added in 2.6.22)
>
> iptables -t nat -A PREROUTING -p tcp --dport 1234 -j REDIRECT --random --to-port 1000-1007
>
> Here is a patch to add "random" help to REDIRECT iptables target
>
FYI: the random option is documented in the manual page of iptables.
REDIRECT
This target is only valid in the nat table, in the PREROUTING and OUT-
PUT chains, and user-defined chains which are only called from those
chains. It redirects the packet to the machine itself by changing the
destination IP to the primary address of the incoming interface
(locally-generated packets are mapped to the 127.0.0.1 address).
--to-ports port[-port]
This specifies a destination port or range of ports to use:
without this, the destination port is never altered. This is
only valid if the rule also specifies -p tcp or -p udp.
--random
If option --random is used then port mapping will be randomized
(kernel >= 2.6.22).
--
Regards,
Changli Gao(xiaosuo@gmail.com)
--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply
* Re: [PATCH] sysfs: Don't allow the creation of symlinks we can't remove
From: Johannes Berg @ 2010-07-22 9:54 UTC (permalink / raw)
To: Eric W. Biederman
Cc: Greg KH, Andrew Morton, Rafael J. Wysocki, Maciej W. Rozycki,
Kay Sievers, Greg KH, netdev
In-Reply-To: <m1630q7x5v.fsf_-_@fess.ebiederm.org>
On Thu, 2010-07-08 at 09:31 -0700, Eric W. Biederman wrote:
> Recently my tagged sysfs support revealed a flaw in the device core
> that a few rare drivers are running into such that we don't always put
> network devices in a class subdirectory named net/.
>
> Since we are not creating the class directory the network devices wind
> up in a non-tagged directory, but the symlinks to the network devices
> from /sys/class/net are in a tagged directory. All of which works
> until we go to remove or rename the symlink. When we remove or rename
> a symlink we look in the namespace of the target of the symlink.
> Since the target of the symlink is in a non-tagged sysfs directory we
> don't have a namespace to look in, and we fail to remove the symlink.
>
> Detect this problem up front and simply don't create symlinks we won't
> be able to remove later. This prevents symlink leakage and fails in
> a much clearer and more understandable way.
Eric, I was looking into sysfs netns support for wireless, and with this
patch applied I just get the warning and no network interfaces.
Was there any patch that was supposed to fix hwsim?
johannes
^ permalink raw reply
* Re: Fwd: LVS on local node
From: Eric Dumazet @ 2010-07-22 9:59 UTC (permalink / raw)
To: Changli Gao
Cc: Patrick McHardy, Jan Engelhardt, Franchoze Eric, wensong,
lvs-devel, netdev, netfilter-devel
In-Reply-To: <AANLkTimLZG_-4wbKy8nrajHVKQvvGtRKEd8qHNoU3swJ@mail.gmail.com>
Le jeudi 22 juillet 2010 à 17:52 +0800, Changli Gao a écrit :
>
> FYI: the random option is documented in the manual page of iptables.
>
> REDIRECT
> This target is only valid in the nat table, in the PREROUTING and OUT-
> PUT chains, and user-defined chains which are only called from those
> chains. It redirects the packet to the machine itself by changing the
> destination IP to the primary address of the incoming interface
> (locally-generated packets are mapped to the 127.0.0.1 address).
>
> --to-ports port[-port]
> This specifies a destination port or range of ports to use:
> without this, the destination port is never altered. This is
> only valid if the rule also specifies -p tcp or -p udp.
>
> --random
> If option --random is used then port mapping will be randomized
> (kernel >= 2.6.22).
>
>
Note my patch has nothing to do with the man page, its already up2date.
I usually dont read the Fine manuals, do you ?
Try :
iptables -t nat -A PREROUTING -p tcp --dport 1234 -j REDIRECT --help
REDIRECT target options:
--to-ports <port>[-<port>]
Port (range) to map to.
You see [--random] is missing.
--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply
* Re: [PATCH] sysfs: Don't allow the creation of symlinks we can't remove
From: Eric W. Biederman @ 2010-07-22 10:05 UTC (permalink / raw)
To: Johannes Berg
Cc: Greg KH, Andrew Morton, Rafael J. Wysocki, Maciej W. Rozycki,
Kay Sievers, Greg KH, netdev
In-Reply-To: <1279792459.12439.0.camel@jlt3.sipsolutions.net>
Johannes Berg <johannes@sipsolutions.net> writes:
> On Thu, 2010-07-08 at 09:31 -0700, Eric W. Biederman wrote:
>> Recently my tagged sysfs support revealed a flaw in the device core
>> that a few rare drivers are running into such that we don't always put
>> network devices in a class subdirectory named net/.
>>
>> Since we are not creating the class directory the network devices wind
>> up in a non-tagged directory, but the symlinks to the network devices
>> from /sys/class/net are in a tagged directory. All of which works
>> until we go to remove or rename the symlink. When we remove or rename
>> a symlink we look in the namespace of the target of the symlink.
>> Since the target of the symlink is in a non-tagged sysfs directory we
>> don't have a namespace to look in, and we fail to remove the symlink.
>>
>> Detect this problem up front and simply don't create symlinks we won't
>> be able to remove later. This prevents symlink leakage and fails in
>> a much clearer and more understandable way.
>
> Eric, I was looking into sysfs netns support for wireless, and with this
> patch applied I just get the warning and no network interfaces.
The warning patch just makes things fail faster. Although I get some of the
wireless interfaces for hwsim when I use this one.
> Was there any patch that was supposed to fix hwsim?
- If you have my patches that fix CONFIG_SYSFS_DEPRECATED,
you should find everything works there.
As for a proper fix I have just resent my one liner to
drives/base/core.c I can't think of a better option right now.
For hwsim it is arguable, but the behaviour of sysfs for the
bluetooth bnep driver is very clearly a 3 year old regression,
and the cause is exactly the same.
Eric
^ permalink raw reply
* Re: Fwd: LVS on local node
From: Changli Gao @ 2010-07-22 10:06 UTC (permalink / raw)
To: Eric Dumazet
Cc: Patrick McHardy, Jan Engelhardt, Franchoze Eric, wensong,
lvs-devel, netdev, netfilter-devel
In-Reply-To: <1279792792.2467.15.camel@edumazet-laptop>
On Thu, Jul 22, 2010 at 5:59 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> Le jeudi 22 juillet 2010 à 17:52 +0800, Changli Gao a écrit :
>
>>
>> FYI: the random option is documented in the manual page of iptables.
>>
>> REDIRECT
>> This target is only valid in the nat table, in the PREROUTING and OUT-
>> PUT chains, and user-defined chains which are only called from those
>> chains. It redirects the packet to the machine itself by changing the
>> destination IP to the primary address of the incoming interface
>> (locally-generated packets are mapped to the 127.0.0.1 address).
>>
>> --to-ports port[-port]
>> This specifies a destination port or range of ports to use:
>> without this, the destination port is never altered. This is
>> only valid if the rule also specifies -p tcp or -p udp.
>>
>> --random
>> If option --random is used then port mapping will be randomized
>> (kernel >= 2.6.22).
>>
>>
>
> Note my patch has nothing to do with the man page, its already up2date.
>
> I usually dont read the Fine manuals, do you ?
Yea. And I don't object your patch. so I add FYI. Thanks.
>
> Try :
>
> iptables -t nat -A PREROUTING -p tcp --dport 1234 -j REDIRECT --help
>
> REDIRECT target options:
> --to-ports <port>[-<port>]
> Port (range) to map to.
>
>
> You see [--random] is missing.
>
>
--
Regards,
Changli Gao(xiaosuo@gmail.com)
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox