* [PATCH 14/25] ipvs: fix ip_vs_try_bind_dest to rebind app and transmitter
2012-05-08 0:21 [PATCH 00/25] netfilter updates for net-next (upcoming 3.5) pablo
@ 2012-05-08 0:22 ` pablo
0 siblings, 0 replies; 31+ messages in thread
From: pablo @ 2012-05-08 0:22 UTC (permalink / raw)
To: netfilter-devel; +Cc: davem, netdev
From: Julian Anastasov <ja@ssi.bg>
Initially, when the synced connection is created we
use the forwarding method provided by master but once we
bind to destination it can be changed. As result, we must
update the application and the transmitter.
As ip_vs_try_bind_dest is called always for connections
that require dest binding, there is no need to validate the
cp and dest pointers.
Signed-off-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
---
net/netfilter/ipvs/ip_vs_conn.c | 33 ++++++++++++++++++++++++++-------
1 file changed, 26 insertions(+), 7 deletions(-)
diff --git a/net/netfilter/ipvs/ip_vs_conn.c b/net/netfilter/ipvs/ip_vs_conn.c
index 7647f3b..9d237d7 100644
--- a/net/netfilter/ipvs/ip_vs_conn.c
+++ b/net/netfilter/ipvs/ip_vs_conn.c
@@ -612,14 +612,33 @@ struct ip_vs_dest *ip_vs_try_bind_dest(struct ip_vs_conn *cp)
{
struct ip_vs_dest *dest;
- if ((cp) && (!cp->dest)) {
- dest = ip_vs_find_dest(ip_vs_conn_net(cp), cp->af, &cp->daddr,
- cp->dport, &cp->vaddr, cp->vport,
- cp->protocol, cp->fwmark, cp->flags);
+ dest = ip_vs_find_dest(ip_vs_conn_net(cp), cp->af, &cp->daddr,
+ cp->dport, &cp->vaddr, cp->vport,
+ cp->protocol, cp->fwmark, cp->flags);
+ if (dest) {
+ struct ip_vs_proto_data *pd;
+
+ /* Applications work depending on the forwarding method
+ * but better to reassign them always when binding dest */
+ if (cp->app)
+ ip_vs_unbind_app(cp);
+
ip_vs_bind_dest(cp, dest);
- return dest;
- } else
- return NULL;
+
+ /* Update its packet transmitter */
+ cp->packet_xmit = NULL;
+#ifdef CONFIG_IP_VS_IPV6
+ if (cp->af == AF_INET6)
+ ip_vs_bind_xmit_v6(cp);
+ else
+#endif
+ ip_vs_bind_xmit(cp);
+
+ pd = ip_vs_proto_data_get(ip_vs_conn_net(cp), cp->protocol);
+ if (pd && atomic_read(&pd->appcnt))
+ ip_vs_bind_app(cp, pd->pp);
+ }
+ return dest;
}
--
1.7.9.5
^ permalink raw reply related [flat|nested] 31+ messages in thread
* [PATCH 00/25] [v2] netfilter updates for net-next (upcoming 3.5)
@ 2012-05-08 7:49 pablo
2012-05-08 7:49 ` [PATCH 01/25] netfilter: nf_ct_ecache: refactor notifier registration pablo
` (25 more replies)
0 siblings, 26 replies; 31+ messages in thread
From: pablo @ 2012-05-08 7:49 UTC (permalink / raw)
To: netfilter-devel; +Cc: davem, netdev
From: Pablo Neira Ayuso <pablo@netfilter.org>
Hi David,
Second version including requested updates.
The following patchset contains the Netfilter updates for net-next.
Most notably:
* The new /proc/sys/net/netfilter/nf_conntrack_helper entry that
allows to disable the automatic conntrack helper assignment from
Eric Leblond. This patch also spots a warning to inform the user
that this behaviour will be removed at some point. The automatic
conntrack helper assignment may allows attackers to open hole in
the firewall to access the protected network segments (with
incorrect configurations). More information on this issue at:
https://home.regit.org/netfilter-en/secure-use-of-helpers/
In the near future, all conntrack helpers will be explicitly
attached via the CT target, as we longing discussed during
the last netfilter workshop.
* One new sysctl to translate the input device to vlan device name
from Florian Westphal. He required this to get the REDIRECT target
working with another sysctl vlan-on-top-of-bridge.
* Major improvements in the ip_vs_sync code from Julian Anastasov.
They aim to improve scalability and to address possible message
loss due to socket overrun under high rate of synchronization
messages.
* Several minor memory allocation flags fixes from IPVS people
contributors.
* Eric Leblond's patch spotted one problem that becomes noticeable
if a) automatic helper assignment is disabled, and b) if NAT is
in use, and c) the CT target is used to attach a non-standard
conntrack helper port. This fix comes from myself.
* One small update to allow updating the expectation timeout from
Kelvie Wong.
* Finally, remove ip[6]_queue support since they have been marked
as obsolete since long time ago. Now, we have nfnetlink_queue
which is way more flexible from myself.
You can pull these changes from:
git://1984.lsi.us.es/net-next master
If time allows, I'd like to send a second batch. There a several patches
that are very close to get into shape still on netfilter-devel.
Thanks!
Eric Dumazet (1):
netfilter: nf_conntrack: use this_cpu_inc()
Eric Leblond (1):
netfilter: nf_ct_helper: allow to disable automatic helper assignment
Florian Westphal (1):
netfilter: bridge: optionally set indev to vlan
H Hartley Sweeten (2):
ipvs: ip_vs_ftp: local functions should not be exposed globally
ipvs: ip_vs_proto: local functions should not be exposed globally
Hans Schillstrom (1):
net: export sysctl_[r|w]mem_max symbols needed by ip_vs_sync
Julian Anastasov (14):
ipvs: timeout tables do not need GFP_ATOMIC allocation
ipvs: LBLC scheduler does not need GFP_ATOMIC allocation on init
ipvs: DH scheduler does not need GFP_ATOMIC allocation
ipvs: WRR scheduler does not need GFP_ATOMIC allocation
ipvs: LBLCR scheduler does not need GFP_ATOMIC allocation on init
ipvs: SH scheduler does not need GFP_ATOMIC allocation
ipvs: ignore IP_VS_CONN_F_NOOUTPUT in backup server
ipvs: remove check for IP_VS_CONN_F_SYNC from ip_vs_bind_dest
ipvs: fix ip_vs_try_bind_dest to rebind app and transmitter
ipvs: always update some of the flags bits in backup
ipvs: wakeup master thread
ipvs: reduce sync rate with time thresholds
ipvs: add support for sync threads
ipvs: optimize the use of flags in ip_vs_bind_dest
Kelvie Wong (1):
netfilter: nf_ct_expect: partially implement ctnetlink_change_expect
Pablo Neira Ayuso (2):
netfilter: nf_conntrack: fix explicit helper attachment and NAT
netfilter: remove ip_queue support
Sasha Levin (1):
ipvs: use GFP_KERNEL allocation where possible
Tony Zelenoff (1):
netfilter: nf_ct_ecache: refactor notifier registration
Documentation/ABI/removed/ip_queue | 9 +
Documentation/networking/ip-sysctl.txt | 13 +-
include/linux/ip_vs.h | 5 +
include/linux/netfilter/nf_conntrack_common.h | 4 +
include/linux/netfilter_ipv4/Kbuild | 1 -
include/linux/netfilter_ipv4/ip_queue.h | 72 ---
include/linux/netlink.h | 2 +-
include/net/ip_vs.h | 87 +++-
include/net/netfilter/nf_conntrack.h | 10 +-
include/net/netfilter/nf_conntrack_helper.h | 4 +-
include/net/netns/conntrack.h | 3 +
net/bridge/br_netfilter.c | 26 +-
net/core/sock.c | 2 +
net/ipv4/netfilter/Makefile | 3 -
net/ipv4/netfilter/ip_queue.c | 639 ------------------------
net/ipv6/netfilter/Kconfig | 22 -
net/ipv6/netfilter/Makefile | 1 -
net/ipv6/netfilter/ip6_queue.c | 641 ------------------------
net/netfilter/ipvs/ip_vs_conn.c | 69 ++-
net/netfilter/ipvs/ip_vs_core.c | 30 +-
net/netfilter/ipvs/ip_vs_ctl.c | 70 ++-
net/netfilter/ipvs/ip_vs_dh.c | 2 +-
net/netfilter/ipvs/ip_vs_ftp.c | 2 +-
net/netfilter/ipvs/ip_vs_lblc.c | 2 +-
net/netfilter/ipvs/ip_vs_lblcr.c | 2 +-
net/netfilter/ipvs/ip_vs_proto.c | 6 +-
net/netfilter/ipvs/ip_vs_sh.c | 2 +-
net/netfilter/ipvs/ip_vs_sync.c | 662 +++++++++++++++++--------
net/netfilter/ipvs/ip_vs_wrr.c | 2 +-
net/netfilter/nf_conntrack_core.c | 15 +-
net/netfilter/nf_conntrack_ecache.c | 10 +-
net/netfilter/nf_conntrack_helper.c | 120 ++++-
net/netfilter/nf_conntrack_netlink.c | 10 +-
security/selinux/nlmsgtab.c | 13 -
34 files changed, 853 insertions(+), 1708 deletions(-)
create mode 100644 Documentation/ABI/removed/ip_queue
delete mode 100644 include/linux/netfilter_ipv4/ip_queue.h
delete mode 100644 net/ipv4/netfilter/ip_queue.c
delete mode 100644 net/ipv6/netfilter/ip6_queue.c
--
1.7.9.5
^ permalink raw reply [flat|nested] 31+ messages in thread
* [PATCH 01/25] netfilter: nf_ct_ecache: refactor notifier registration
2012-05-08 7:49 [PATCH 00/25] [v2] netfilter updates for net-next (upcoming 3.5) pablo
@ 2012-05-08 7:49 ` pablo
2012-05-08 7:49 ` [PATCH 02/25] netfilter: nf_ct_helper: allow to disable automatic helper assignment pablo
` (24 subsequent siblings)
25 siblings, 0 replies; 31+ messages in thread
From: pablo @ 2012-05-08 7:49 UTC (permalink / raw)
To: netfilter-devel; +Cc: davem, netdev
From: Tony Zelenoff <antonz@parallels.com>
* ret variable initialization removed as useless
* similar code strings concatenated and functions code
flow became more plain
Signed-off-by: Tony Zelenoff <antonz@parallels.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
---
net/netfilter/nf_conntrack_ecache.c | 10 ++++------
1 file changed, 4 insertions(+), 6 deletions(-)
diff --git a/net/netfilter/nf_conntrack_ecache.c b/net/netfilter/nf_conntrack_ecache.c
index 5bd3047d..3a3409f 100644
--- a/net/netfilter/nf_conntrack_ecache.c
+++ b/net/netfilter/nf_conntrack_ecache.c
@@ -84,7 +84,7 @@ EXPORT_SYMBOL_GPL(nf_ct_deliver_cached_events);
int nf_conntrack_register_notifier(struct net *net,
struct nf_ct_event_notifier *new)
{
- int ret = 0;
+ int ret;
struct nf_ct_event_notifier *notify;
mutex_lock(&nf_ct_ecache_mutex);
@@ -95,8 +95,7 @@ int nf_conntrack_register_notifier(struct net *net,
goto out_unlock;
}
rcu_assign_pointer(net->ct.nf_conntrack_event_cb, new);
- mutex_unlock(&nf_ct_ecache_mutex);
- return ret;
+ ret = 0;
out_unlock:
mutex_unlock(&nf_ct_ecache_mutex);
@@ -121,7 +120,7 @@ EXPORT_SYMBOL_GPL(nf_conntrack_unregister_notifier);
int nf_ct_expect_register_notifier(struct net *net,
struct nf_exp_event_notifier *new)
{
- int ret = 0;
+ int ret;
struct nf_exp_event_notifier *notify;
mutex_lock(&nf_ct_ecache_mutex);
@@ -132,8 +131,7 @@ int nf_ct_expect_register_notifier(struct net *net,
goto out_unlock;
}
rcu_assign_pointer(net->ct.nf_expect_event_cb, new);
- mutex_unlock(&nf_ct_ecache_mutex);
- return ret;
+ ret = 0;
out_unlock:
mutex_unlock(&nf_ct_ecache_mutex);
--
1.7.9.5
^ permalink raw reply related [flat|nested] 31+ messages in thread
* [PATCH 02/25] netfilter: nf_ct_helper: allow to disable automatic helper assignment
2012-05-08 7:49 [PATCH 00/25] [v2] netfilter updates for net-next (upcoming 3.5) pablo
2012-05-08 7:49 ` [PATCH 01/25] netfilter: nf_ct_ecache: refactor notifier registration pablo
@ 2012-05-08 7:49 ` pablo
2012-05-08 7:49 ` [PATCH 03/25] netfilter: nf_conntrack: use this_cpu_inc() pablo
` (23 subsequent siblings)
25 siblings, 0 replies; 31+ messages in thread
From: pablo @ 2012-05-08 7:49 UTC (permalink / raw)
To: netfilter-devel; +Cc: davem, netdev
From: Eric Leblond <eric@regit.org>
This patch allows you to disable automatic conntrack helper
lookup based on TCP/UDP ports, eg.
echo 0 > /proc/sys/net/netfilter/nf_conntrack_helper
[ Note: flows that already got a helper will keep using it even
if automatic helper assignment has been disabled ]
Once this behaviour has been disabled, you have to explicitly
use the iptables CT target to attach helper to flows.
There are good reasons to stop supporting automatic helper
assignment, for further information, please read:
http://www.netfilter.org/news.html#2012-04-03
This patch also adds one message to inform that automatic helper
assignment is deprecated and it will be removed soon (this is
spotted only once, with the first flow that gets a helper attached
to make it as less annoying as possible).
Signed-off-by: Eric Leblond <eric@regit.org>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
---
include/net/netfilter/nf_conntrack_helper.h | 4 +-
include/net/netns/conntrack.h | 3 +
net/netfilter/nf_conntrack_core.c | 15 ++--
net/netfilter/nf_conntrack_helper.c | 108 ++++++++++++++++++++++++---
4 files changed, 109 insertions(+), 21 deletions(-)
diff --git a/include/net/netfilter/nf_conntrack_helper.h b/include/net/netfilter/nf_conntrack_helper.h
index 5767dc2..1d18894 100644
--- a/include/net/netfilter/nf_conntrack_helper.h
+++ b/include/net/netfilter/nf_conntrack_helper.h
@@ -60,8 +60,8 @@ static inline struct nf_conn_help *nfct_help(const struct nf_conn *ct)
return nf_ct_ext_find(ct, NF_CT_EXT_HELPER);
}
-extern int nf_conntrack_helper_init(void);
-extern void nf_conntrack_helper_fini(void);
+extern int nf_conntrack_helper_init(struct net *net);
+extern void nf_conntrack_helper_fini(struct net *net);
extern int nf_conntrack_broadcast_help(struct sk_buff *skb,
unsigned int protoff,
diff --git a/include/net/netns/conntrack.h b/include/net/netns/conntrack.h
index 7a911ec..a053a19 100644
--- a/include/net/netns/conntrack.h
+++ b/include/net/netns/conntrack.h
@@ -26,11 +26,14 @@ struct netns_ct {
int sysctl_tstamp;
int sysctl_checksum;
unsigned int sysctl_log_invalid; /* Log invalid packets */
+ int sysctl_auto_assign_helper;
+ bool auto_assign_helper_warned;
#ifdef CONFIG_SYSCTL
struct ctl_table_header *sysctl_header;
struct ctl_table_header *acct_sysctl_header;
struct ctl_table_header *tstamp_sysctl_header;
struct ctl_table_header *event_sysctl_header;
+ struct ctl_table_header *helper_sysctl_header;
#endif
char *slabname;
};
diff --git a/net/netfilter/nf_conntrack_core.c b/net/netfilter/nf_conntrack_core.c
index cf0747c..32c5909 100644
--- a/net/netfilter/nf_conntrack_core.c
+++ b/net/netfilter/nf_conntrack_core.c
@@ -1336,7 +1336,6 @@ static void nf_conntrack_cleanup_init_net(void)
while (untrack_refs() > 0)
schedule();
- nf_conntrack_helper_fini();
nf_conntrack_proto_fini();
#ifdef CONFIG_NF_CONNTRACK_ZONES
nf_ct_extend_unregister(&nf_ct_zone_extend);
@@ -1354,6 +1353,7 @@ static void nf_conntrack_cleanup_net(struct net *net)
}
nf_ct_free_hashtable(net->ct.hash, net->ct.htable_size);
+ nf_conntrack_helper_fini(net);
nf_conntrack_timeout_fini(net);
nf_conntrack_ecache_fini(net);
nf_conntrack_tstamp_fini(net);
@@ -1504,10 +1504,6 @@ static int nf_conntrack_init_init_net(void)
if (ret < 0)
goto err_proto;
- ret = nf_conntrack_helper_init();
- if (ret < 0)
- goto err_helper;
-
#ifdef CONFIG_NF_CONNTRACK_ZONES
ret = nf_ct_extend_register(&nf_ct_zone_extend);
if (ret < 0)
@@ -1525,10 +1521,8 @@ static int nf_conntrack_init_init_net(void)
#ifdef CONFIG_NF_CONNTRACK_ZONES
err_extend:
- nf_conntrack_helper_fini();
-#endif
-err_helper:
nf_conntrack_proto_fini();
+#endif
err_proto:
return ret;
}
@@ -1589,9 +1583,14 @@ static int nf_conntrack_init_net(struct net *net)
ret = nf_conntrack_timeout_init(net);
if (ret < 0)
goto err_timeout;
+ ret = nf_conntrack_helper_init(net);
+ if (ret < 0)
+ goto err_helper;
return 0;
+err_helper:
+ nf_conntrack_timeout_fini(net);
err_timeout:
nf_conntrack_ecache_fini(net);
err_ecache:
diff --git a/net/netfilter/nf_conntrack_helper.c b/net/netfilter/nf_conntrack_helper.c
index 436b7cb..55234dd 100644
--- a/net/netfilter/nf_conntrack_helper.c
+++ b/net/netfilter/nf_conntrack_helper.c
@@ -34,6 +34,66 @@ static struct hlist_head *nf_ct_helper_hash __read_mostly;
static unsigned int nf_ct_helper_hsize __read_mostly;
static unsigned int nf_ct_helper_count __read_mostly;
+static bool nf_ct_auto_assign_helper __read_mostly = true;
+module_param_named(nf_conntrack_helper, nf_ct_auto_assign_helper, bool, 0644);
+MODULE_PARM_DESC(nf_conntrack_helper,
+ "Enable automatic conntrack helper assignment (default 1)");
+
+#ifdef CONFIG_SYSCTL
+static struct ctl_table helper_sysctl_table[] = {
+ {
+ .procname = "nf_conntrack_helper",
+ .data = &init_net.ct.sysctl_auto_assign_helper,
+ .maxlen = sizeof(unsigned int),
+ .mode = 0644,
+ .proc_handler = proc_dointvec,
+ },
+ {}
+};
+
+static int nf_conntrack_helper_init_sysctl(struct net *net)
+{
+ struct ctl_table *table;
+
+ table = kmemdup(helper_sysctl_table, sizeof(helper_sysctl_table),
+ GFP_KERNEL);
+ if (!table)
+ goto out;
+
+ table[0].data = &net->ct.sysctl_auto_assign_helper;
+
+ net->ct.helper_sysctl_header = register_net_sysctl_table(net,
+ nf_net_netfilter_sysctl_path, table);
+ if (!net->ct.helper_sysctl_header) {
+ pr_err("nf_conntrack_helper: can't register to sysctl.\n");
+ goto out_register;
+ }
+ return 0;
+
+out_register:
+ kfree(table);
+out:
+ return -ENOMEM;
+}
+
+static void nf_conntrack_helper_fini_sysctl(struct net *net)
+{
+ struct ctl_table *table;
+
+ table = net->ct.helper_sysctl_header->ctl_table_arg;
+ unregister_net_sysctl_table(net->ct.helper_sysctl_header);
+ kfree(table);
+}
+#else
+static int nf_conntrack_helper_init_sysctl(struct net *net)
+{
+ return 0;
+}
+
+static void nf_conntrack_helper_fini_sysctl(struct net *net)
+{
+}
+#endif /* CONFIG_SYSCTL */
/* Stupid hash, but collision free for the default registrations of the
* helpers currently in the kernel. */
@@ -118,6 +178,7 @@ int __nf_ct_try_assign_helper(struct nf_conn *ct, struct nf_conn *tmpl,
{
struct nf_conntrack_helper *helper = NULL;
struct nf_conn_help *help;
+ struct net *net = nf_ct_net(ct);
int ret = 0;
if (tmpl != NULL) {
@@ -127,8 +188,17 @@ int __nf_ct_try_assign_helper(struct nf_conn *ct, struct nf_conn *tmpl,
}
help = nfct_help(ct);
- if (helper == NULL)
+ if (net->ct.sysctl_auto_assign_helper && helper == NULL) {
helper = __nf_ct_helper_find(&ct->tuplehash[IP_CT_DIR_REPLY].tuple);
+ if (unlikely(!net->ct.auto_assign_helper_warned && helper)) {
+ pr_info("nf_conntrack: automatic helper "
+ "assignment is deprecated and it will "
+ "be removed soon. Use the iptables CT target "
+ "to attach helpers instead.\n");
+ net->ct.auto_assign_helper_warned = true;
+ }
+ }
+
if (helper == NULL) {
if (help)
RCU_INIT_POINTER(help->helper, NULL);
@@ -315,28 +385,44 @@ static struct nf_ct_ext_type helper_extend __read_mostly = {
.id = NF_CT_EXT_HELPER,
};
-int nf_conntrack_helper_init(void)
+int nf_conntrack_helper_init(struct net *net)
{
int err;
- nf_ct_helper_hsize = 1; /* gets rounded up to use one page */
- nf_ct_helper_hash = nf_ct_alloc_hashtable(&nf_ct_helper_hsize, 0);
- if (!nf_ct_helper_hash)
- return -ENOMEM;
+ net->ct.auto_assign_helper_warned = false;
+ net->ct.sysctl_auto_assign_helper = nf_ct_auto_assign_helper;
- err = nf_ct_extend_register(&helper_extend);
+ if (net_eq(net, &init_net)) {
+ nf_ct_helper_hsize = 1; /* gets rounded up to use one page */
+ nf_ct_helper_hash =
+ nf_ct_alloc_hashtable(&nf_ct_helper_hsize, 0);
+ if (!nf_ct_helper_hash)
+ return -ENOMEM;
+
+ err = nf_ct_extend_register(&helper_extend);
+ if (err < 0)
+ goto err1;
+ }
+
+ err = nf_conntrack_helper_init_sysctl(net);
if (err < 0)
- goto err1;
+ goto out_sysctl;
return 0;
+out_sysctl:
+ if (net_eq(net, &init_net))
+ nf_ct_extend_unregister(&helper_extend);
err1:
nf_ct_free_hashtable(nf_ct_helper_hash, nf_ct_helper_hsize);
return err;
}
-void nf_conntrack_helper_fini(void)
+void nf_conntrack_helper_fini(struct net *net)
{
- nf_ct_extend_unregister(&helper_extend);
- nf_ct_free_hashtable(nf_ct_helper_hash, nf_ct_helper_hsize);
+ nf_conntrack_helper_fini_sysctl(net);
+ if (net_eq(net, &init_net)) {
+ nf_ct_extend_unregister(&helper_extend);
+ nf_ct_free_hashtable(nf_ct_helper_hash, nf_ct_helper_hsize);
+ }
}
--
1.7.9.5
^ permalink raw reply related [flat|nested] 31+ messages in thread
* [PATCH 03/25] netfilter: nf_conntrack: use this_cpu_inc()
2012-05-08 7:49 [PATCH 00/25] [v2] netfilter updates for net-next (upcoming 3.5) pablo
2012-05-08 7:49 ` [PATCH 01/25] netfilter: nf_ct_ecache: refactor notifier registration pablo
2012-05-08 7:49 ` [PATCH 02/25] netfilter: nf_ct_helper: allow to disable automatic helper assignment pablo
@ 2012-05-08 7:49 ` pablo
2012-05-08 7:49 ` [PATCH 04/25] netfilter: bridge: optionally set indev to vlan pablo
` (22 subsequent siblings)
25 siblings, 0 replies; 31+ messages in thread
From: pablo @ 2012-05-08 7:49 UTC (permalink / raw)
To: netfilter-devel; +Cc: davem, netdev
From: Eric Dumazet <edumazet@google.com>
this_cpu_inc() is IRQ safe and faster than
local_bh_disable()/__this_cpu_inc()/local_bh_enable(), at least on x86.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Patrick McHardy <kaber@trash.net>
Cc: Christoph Lameter <cl@linux.com>
Cc: Tejun Heo <tj@kernel.org>
Reviewed-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
---
include/net/netfilter/nf_conntrack.h | 10 ++--------
1 file changed, 2 insertions(+), 8 deletions(-)
diff --git a/include/net/netfilter/nf_conntrack.h b/include/net/netfilter/nf_conntrack.h
index ab86036..cce7f6a 100644
--- a/include/net/netfilter/nf_conntrack.h
+++ b/include/net/netfilter/nf_conntrack.h
@@ -321,14 +321,8 @@ extern unsigned int nf_conntrack_max;
extern unsigned int nf_conntrack_hash_rnd;
void init_nf_conntrack_hash_rnd(void);
-#define NF_CT_STAT_INC(net, count) \
- __this_cpu_inc((net)->ct.stat->count)
-#define NF_CT_STAT_INC_ATOMIC(net, count) \
-do { \
- local_bh_disable(); \
- __this_cpu_inc((net)->ct.stat->count); \
- local_bh_enable(); \
-} while (0)
+#define NF_CT_STAT_INC(net, count) __this_cpu_inc((net)->ct.stat->count)
+#define NF_CT_STAT_INC_ATOMIC(net, count) this_cpu_inc((net)->ct.stat->count)
#define MODULE_ALIAS_NFCT_HELPER(helper) \
MODULE_ALIAS("nfct-helper-" helper)
--
1.7.9.5
^ permalink raw reply related [flat|nested] 31+ messages in thread
* [PATCH 04/25] netfilter: bridge: optionally set indev to vlan
2012-05-08 7:49 [PATCH 00/25] [v2] netfilter updates for net-next (upcoming 3.5) pablo
` (2 preceding siblings ...)
2012-05-08 7:49 ` [PATCH 03/25] netfilter: nf_conntrack: use this_cpu_inc() pablo
@ 2012-05-08 7:49 ` pablo
2012-05-08 7:49 ` pablo
` (21 subsequent siblings)
25 siblings, 0 replies; 31+ messages in thread
From: pablo @ 2012-05-08 7:49 UTC (permalink / raw)
To: netfilter-devel; +Cc: davem, netdev
From: Florian Westphal <fw@strlen.de>
if net.bridge.bridge-nf-filter-vlan-tagged sysctl is enabled, bridge
netfilter removes the vlan header temporarily and then feeds the packet
to ip(6)tables.
When the new "bridge-nf-pass-vlan-input-device" sysctl is on
(default off), then bridge netfilter will also set the
in-interface to the vlan interface; if such an interface exists.
This is needed to make iptables REDIRECT target work with
"vlan-on-top-of-bridge" setups and to allow use of "iptables -i" to
match the vlan device name.
Also update Documentation with current brnf default settings.
Signed-off-by: Florian Westphal <fw@strlen.de>
Acked-by: Bart De Schuymer <bdschuym@pandora.be>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
---
Documentation/networking/ip-sysctl.txt | 13 +++++++++++--
net/bridge/br_netfilter.c | 26 ++++++++++++++++++++++++--
2 files changed, 35 insertions(+), 4 deletions(-)
diff --git a/Documentation/networking/ip-sysctl.txt b/Documentation/networking/ip-sysctl.txt
index bd80ba5..edff76d 100644
--- a/Documentation/networking/ip-sysctl.txt
+++ b/Documentation/networking/ip-sysctl.txt
@@ -1287,13 +1287,22 @@ bridge-nf-call-ip6tables - BOOLEAN
bridge-nf-filter-vlan-tagged - BOOLEAN
1 : pass bridged vlan-tagged ARP/IP/IPv6 traffic to {arp,ip,ip6}tables.
0 : disable this.
- Default: 1
+ Default: 0
bridge-nf-filter-pppoe-tagged - BOOLEAN
1 : pass bridged pppoe-tagged IP/IPv6 traffic to {ip,ip6}tables.
0 : disable this.
- Default: 1
+ Default: 0
+bridge-nf-pass-vlan-input-dev - BOOLEAN
+ 1: if bridge-nf-filter-vlan-tagged is enabled, try to find a vlan
+ interface on the bridge and set the netfilter input device to the vlan.
+ This allows use of e.g. "iptables -i br0.1" and makes the REDIRECT
+ target work with vlan-on-top-of-bridge interfaces. When no matching
+ vlan interface is found, or this switch is off, the input device is
+ set to the bridge interface.
+ 0: disable bridge netfilter vlan interface lookup.
+ Default: 0
proc/sys/net/sctp/* Variables:
diff --git a/net/bridge/br_netfilter.c b/net/bridge/br_netfilter.c
index dec4f38..2dca7fb 100644
--- a/net/bridge/br_netfilter.c
+++ b/net/bridge/br_netfilter.c
@@ -54,12 +54,14 @@ static int brnf_call_ip6tables __read_mostly = 1;
static int brnf_call_arptables __read_mostly = 1;
static int brnf_filter_vlan_tagged __read_mostly = 0;
static int brnf_filter_pppoe_tagged __read_mostly = 0;
+static int brnf_pass_vlan_indev __read_mostly = 0;
#else
#define brnf_call_iptables 1
#define brnf_call_ip6tables 1
#define brnf_call_arptables 1
#define brnf_filter_vlan_tagged 0
#define brnf_filter_pppoe_tagged 0
+#define brnf_pass_vlan_indev 0
#endif
#define IS_IP(skb) \
@@ -503,6 +505,19 @@ bridged_dnat:
return 0;
}
+static struct net_device *brnf_get_logical_dev(struct sk_buff *skb, const struct net_device *dev)
+{
+ struct net_device *vlan, *br;
+
+ br = bridge_parent(dev);
+ if (brnf_pass_vlan_indev == 0 || !vlan_tx_tag_present(skb))
+ return br;
+
+ vlan = __vlan_find_dev_deep(br, vlan_tx_tag_get(skb) & VLAN_VID_MASK);
+
+ return vlan ? vlan : br;
+}
+
/* Some common code for IPv4/IPv6 */
static struct net_device *setup_pre_routing(struct sk_buff *skb)
{
@@ -515,7 +530,7 @@ static struct net_device *setup_pre_routing(struct sk_buff *skb)
nf_bridge->mask |= BRNF_NF_BRIDGE_PREROUTING;
nf_bridge->physindev = skb->dev;
- skb->dev = bridge_parent(skb->dev);
+ skb->dev = brnf_get_logical_dev(skb, skb->dev);
if (skb->protocol == htons(ETH_P_8021Q))
nf_bridge->mask |= BRNF_8021Q;
else if (skb->protocol == htons(ETH_P_PPP_SES))
@@ -778,7 +793,7 @@ static unsigned int br_nf_forward_ip(unsigned int hook, struct sk_buff *skb,
else
skb->protocol = htons(ETH_P_IPV6);
- NF_HOOK(pf, NF_INET_FORWARD, skb, bridge_parent(in), parent,
+ NF_HOOK(pf, NF_INET_FORWARD, skb, brnf_get_logical_dev(skb, in), parent,
br_nf_forward_finish);
return NF_STOLEN;
@@ -1006,6 +1021,13 @@ static ctl_table brnf_table[] = {
.mode = 0644,
.proc_handler = brnf_sysctl_call_tables,
},
+ {
+ .procname = "bridge-nf-pass-vlan-input-dev",
+ .data = &brnf_pass_vlan_indev,
+ .maxlen = sizeof(int),
+ .mode = 0644,
+ .proc_handler = brnf_sysctl_call_tables,
+ },
{ }
};
--
1.7.9.5
^ permalink raw reply related [flat|nested] 31+ messages in thread
* [PATCH 04/25] netfilter: bridge: optionally set indev to vlan
2012-05-08 7:49 [PATCH 00/25] [v2] netfilter updates for net-next (upcoming 3.5) pablo
` (3 preceding siblings ...)
2012-05-08 7:49 ` [PATCH 04/25] netfilter: bridge: optionally set indev to vlan pablo
@ 2012-05-08 7:49 ` pablo
2012-05-08 7:49 ` [PATCH 05/25] ipvs: timeout tables do not need GFP_ATOMIC allocation pablo
` (20 subsequent siblings)
25 siblings, 0 replies; 31+ messages in thread
From: pablo @ 2012-05-08 7:49 UTC (permalink / raw)
To: netfilter-devel; +Cc: davem, netdev
From: Florian Westphal <fw@strlen.de>
if net.bridge.bridge-nf-filter-vlan-tagged sysctl is enabled, bridge
netfilter removes the vlan header temporarily and then feeds the packet
to ip(6)tables.
When the new "bridge-nf-pass-vlan-input-device" sysctl is on
(default off), then bridge netfilter will also set the
in-interface to the vlan interface; if such an interface exists.
This is needed to make iptables REDIRECT target work with
"vlan-on-top-of-bridge" setups and to allow use of "iptables -i" to
match the vlan device name.
Also update Documentation with current brnf default settings.
Signed-off-by: Florian Westphal <fw@strlen.de>
Acked-by: Bart De Schuymer <bdschuym@pandora.be>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
---
Documentation/networking/ip-sysctl.txt | 13 +++++++++++--
net/bridge/br_netfilter.c | 26 ++++++++++++++++++++++++--
2 files changed, 35 insertions(+), 4 deletions(-)
diff --git a/Documentation/networking/ip-sysctl.txt b/Documentation/networking/ip-sysctl.txt
index bd80ba5..edff76d 100644
--- a/Documentation/networking/ip-sysctl.txt
+++ b/Documentation/networking/ip-sysctl.txt
@@ -1287,13 +1287,22 @@ bridge-nf-call-ip6tables - BOOLEAN
bridge-nf-filter-vlan-tagged - BOOLEAN
1 : pass bridged vlan-tagged ARP/IP/IPv6 traffic to {arp,ip,ip6}tables.
0 : disable this.
- Default: 1
+ Default: 0
bridge-nf-filter-pppoe-tagged - BOOLEAN
1 : pass bridged pppoe-tagged IP/IPv6 traffic to {ip,ip6}tables.
0 : disable this.
- Default: 1
+ Default: 0
+bridge-nf-pass-vlan-input-dev - BOOLEAN
+ 1: if bridge-nf-filter-vlan-tagged is enabled, try to find a vlan
+ interface on the bridge and set the netfilter input device to the vlan.
+ This allows use of e.g. "iptables -i br0.1" and makes the REDIRECT
+ target work with vlan-on-top-of-bridge interfaces. When no matching
+ vlan interface is found, or this switch is off, the input device is
+ set to the bridge interface.
+ 0: disable bridge netfilter vlan interface lookup.
+ Default: 0
proc/sys/net/sctp/* Variables:
diff --git a/net/bridge/br_netfilter.c b/net/bridge/br_netfilter.c
index dec4f38..2dca7fb 100644
--- a/net/bridge/br_netfilter.c
+++ b/net/bridge/br_netfilter.c
@@ -54,12 +54,14 @@ static int brnf_call_ip6tables __read_mostly = 1;
static int brnf_call_arptables __read_mostly = 1;
static int brnf_filter_vlan_tagged __read_mostly = 0;
static int brnf_filter_pppoe_tagged __read_mostly = 0;
+static int brnf_pass_vlan_indev __read_mostly = 0;
#else
#define brnf_call_iptables 1
#define brnf_call_ip6tables 1
#define brnf_call_arptables 1
#define brnf_filter_vlan_tagged 0
#define brnf_filter_pppoe_tagged 0
+#define brnf_pass_vlan_indev 0
#endif
#define IS_IP(skb) \
@@ -503,6 +505,19 @@ bridged_dnat:
return 0;
}
+static struct net_device *brnf_get_logical_dev(struct sk_buff *skb, const struct net_device *dev)
+{
+ struct net_device *vlan, *br;
+
+ br = bridge_parent(dev);
+ if (brnf_pass_vlan_indev == 0 || !vlan_tx_tag_present(skb))
+ return br;
+
+ vlan = __vlan_find_dev_deep(br, vlan_tx_tag_get(skb) & VLAN_VID_MASK);
+
+ return vlan ? vlan : br;
+}
+
/* Some common code for IPv4/IPv6 */
static struct net_device *setup_pre_routing(struct sk_buff *skb)
{
@@ -515,7 +530,7 @@ static struct net_device *setup_pre_routing(struct sk_buff *skb)
nf_bridge->mask |= BRNF_NF_BRIDGE_PREROUTING;
nf_bridge->physindev = skb->dev;
- skb->dev = bridge_parent(skb->dev);
+ skb->dev = brnf_get_logical_dev(skb, skb->dev);
if (skb->protocol == htons(ETH_P_8021Q))
nf_bridge->mask |= BRNF_8021Q;
else if (skb->protocol == htons(ETH_P_PPP_SES))
@@ -778,7 +793,7 @@ static unsigned int br_nf_forward_ip(unsigned int hook, struct sk_buff *skb,
else
skb->protocol = htons(ETH_P_IPV6);
- NF_HOOK(pf, NF_INET_FORWARD, skb, bridge_parent(in), parent,
+ NF_HOOK(pf, NF_INET_FORWARD, skb, brnf_get_logical_dev(skb, in), parent,
br_nf_forward_finish);
return NF_STOLEN;
@@ -1006,6 +1021,13 @@ static ctl_table brnf_table[] = {
.mode = 0644,
.proc_handler = brnf_sysctl_call_tables,
},
+ {
+ .procname = "bridge-nf-pass-vlan-input-dev",
+ .data = &brnf_pass_vlan_indev,
+ .maxlen = sizeof(int),
+ .mode = 0644,
+ .proc_handler = brnf_sysctl_call_tables,
+ },
{ }
};
--
1.7.9.5
^ permalink raw reply related [flat|nested] 31+ messages in thread
* [PATCH 05/25] ipvs: timeout tables do not need GFP_ATOMIC allocation
2012-05-08 7:49 [PATCH 00/25] [v2] netfilter updates for net-next (upcoming 3.5) pablo
` (4 preceding siblings ...)
2012-05-08 7:49 ` pablo
@ 2012-05-08 7:49 ` pablo
2012-05-08 7:49 ` [PATCH 06/25] ipvs: LBLC scheduler does not need GFP_ATOMIC allocation on init pablo
` (19 subsequent siblings)
25 siblings, 0 replies; 31+ messages in thread
From: pablo @ 2012-05-08 7:49 UTC (permalink / raw)
To: netfilter-devel; +Cc: davem, netdev
From: Julian Anastasov <ja@ssi.bg>
They are called only on initialization.
Signed-off-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Hans Schillstrom <hans@schillstrom.com>
Signed-off-by: Simon Horman <horms@verge.net.au>
---
net/netfilter/ipvs/ip_vs_proto.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/net/netfilter/ipvs/ip_vs_proto.c b/net/netfilter/ipvs/ip_vs_proto.c
index 6eda11d..a981b7c 100644
--- a/net/netfilter/ipvs/ip_vs_proto.c
+++ b/net/netfilter/ipvs/ip_vs_proto.c
@@ -196,7 +196,7 @@ void ip_vs_protocol_timeout_change(struct netns_ipvs *ipvs, int flags)
int *
ip_vs_create_timeout_table(int *table, int size)
{
- return kmemdup(table, size, GFP_ATOMIC);
+ return kmemdup(table, size, GFP_KERNEL);
}
--
1.7.9.5
^ permalink raw reply related [flat|nested] 31+ messages in thread
* [PATCH 06/25] ipvs: LBLC scheduler does not need GFP_ATOMIC allocation on init
2012-05-08 7:49 [PATCH 00/25] [v2] netfilter updates for net-next (upcoming 3.5) pablo
` (5 preceding siblings ...)
2012-05-08 7:49 ` [PATCH 05/25] ipvs: timeout tables do not need GFP_ATOMIC allocation pablo
@ 2012-05-08 7:49 ` pablo
2012-05-08 7:49 ` [PATCH 07/25] ipvs: DH scheduler does not need GFP_ATOMIC allocation pablo
` (18 subsequent siblings)
25 siblings, 0 replies; 31+ messages in thread
From: pablo @ 2012-05-08 7:49 UTC (permalink / raw)
To: netfilter-devel; +Cc: davem, netdev
From: Julian Anastasov <ja@ssi.bg>
Schedulers are initialized and bound to services only
on commands.
Signed-off-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Hans Schillstrom <hans@schillstrom.com>
Signed-off-by: Simon Horman <horms@verge.net.au>
---
net/netfilter/ipvs/ip_vs_lblc.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/net/netfilter/ipvs/ip_vs_lblc.c b/net/netfilter/ipvs/ip_vs_lblc.c
index 27c24f1..7ba1672 100644
--- a/net/netfilter/ipvs/ip_vs_lblc.c
+++ b/net/netfilter/ipvs/ip_vs_lblc.c
@@ -342,7 +342,7 @@ static int ip_vs_lblc_init_svc(struct ip_vs_service *svc)
/*
* Allocate the ip_vs_lblc_table for this service
*/
- tbl = kmalloc(sizeof(*tbl), GFP_ATOMIC);
+ tbl = kmalloc(sizeof(*tbl), GFP_KERNEL);
if (tbl == NULL)
return -ENOMEM;
--
1.7.9.5
^ permalink raw reply related [flat|nested] 31+ messages in thread
* [PATCH 07/25] ipvs: DH scheduler does not need GFP_ATOMIC allocation
2012-05-08 7:49 [PATCH 00/25] [v2] netfilter updates for net-next (upcoming 3.5) pablo
` (6 preceding siblings ...)
2012-05-08 7:49 ` [PATCH 06/25] ipvs: LBLC scheduler does not need GFP_ATOMIC allocation on init pablo
@ 2012-05-08 7:49 ` pablo
2012-05-08 7:49 ` [PATCH 08/25] ipvs: WRR " pablo
` (17 subsequent siblings)
25 siblings, 0 replies; 31+ messages in thread
From: pablo @ 2012-05-08 7:49 UTC (permalink / raw)
To: netfilter-devel; +Cc: davem, netdev
From: Julian Anastasov <ja@ssi.bg>
Schedulers are initialized and bound to services only
on commands.
Signed-off-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Hans Schillstrom <hans@schillstrom.com>
Signed-off-by: Simon Horman <horms@verge.net.au>
---
net/netfilter/ipvs/ip_vs_dh.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/net/netfilter/ipvs/ip_vs_dh.c b/net/netfilter/ipvs/ip_vs_dh.c
index 1a53a7a..8b7dca9 100644
--- a/net/netfilter/ipvs/ip_vs_dh.c
+++ b/net/netfilter/ipvs/ip_vs_dh.c
@@ -149,7 +149,7 @@ static int ip_vs_dh_init_svc(struct ip_vs_service *svc)
/* allocate the DH table for this service */
tbl = kmalloc(sizeof(struct ip_vs_dh_bucket)*IP_VS_DH_TAB_SIZE,
- GFP_ATOMIC);
+ GFP_KERNEL);
if (tbl == NULL)
return -ENOMEM;
--
1.7.9.5
^ permalink raw reply related [flat|nested] 31+ messages in thread
* [PATCH 08/25] ipvs: WRR scheduler does not need GFP_ATOMIC allocation
2012-05-08 7:49 [PATCH 00/25] [v2] netfilter updates for net-next (upcoming 3.5) pablo
` (7 preceding siblings ...)
2012-05-08 7:49 ` [PATCH 07/25] ipvs: DH scheduler does not need GFP_ATOMIC allocation pablo
@ 2012-05-08 7:49 ` pablo
2012-05-08 7:49 ` [PATCH 10/25] ipvs: SH " pablo
` (16 subsequent siblings)
25 siblings, 0 replies; 31+ messages in thread
From: pablo @ 2012-05-08 7:49 UTC (permalink / raw)
To: netfilter-devel; +Cc: davem, netdev
From: Julian Anastasov <ja@ssi.bg>
Schedulers are initialized and bound to services only
on commands.
Signed-off-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Hans Schillstrom <hans@schillstrom.com>
Signed-off-by: Simon Horman <horms@verge.net.au>
---
net/netfilter/ipvs/ip_vs_wrr.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/net/netfilter/ipvs/ip_vs_wrr.c b/net/netfilter/ipvs/ip_vs_wrr.c
index fd0d4e0..231be7d 100644
--- a/net/netfilter/ipvs/ip_vs_wrr.c
+++ b/net/netfilter/ipvs/ip_vs_wrr.c
@@ -84,7 +84,7 @@ static int ip_vs_wrr_init_svc(struct ip_vs_service *svc)
/*
* Allocate the mark variable for WRR scheduling
*/
- mark = kmalloc(sizeof(struct ip_vs_wrr_mark), GFP_ATOMIC);
+ mark = kmalloc(sizeof(struct ip_vs_wrr_mark), GFP_KERNEL);
if (mark == NULL)
return -ENOMEM;
--
1.7.9.5
^ permalink raw reply related [flat|nested] 31+ messages in thread
* [PATCH 10/25] ipvs: SH scheduler does not need GFP_ATOMIC allocation
2012-05-08 7:49 [PATCH 00/25] [v2] netfilter updates for net-next (upcoming 3.5) pablo
` (8 preceding siblings ...)
2012-05-08 7:49 ` [PATCH 08/25] ipvs: WRR " pablo
@ 2012-05-08 7:49 ` pablo
2012-05-08 7:49 ` [PATCH 11/25] ipvs: use GFP_KERNEL allocation where possible pablo
` (15 subsequent siblings)
25 siblings, 0 replies; 31+ messages in thread
From: pablo @ 2012-05-08 7:49 UTC (permalink / raw)
To: netfilter-devel; +Cc: davem, netdev
From: Julian Anastasov <ja@ssi.bg>
Schedulers are initialized and bound to services only
on commands.
Signed-off-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Hans Schillstrom <hans@schillstrom.com>
Signed-off-by: Simon Horman <horms@verge.net.au>
---
net/netfilter/ipvs/ip_vs_sh.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/net/netfilter/ipvs/ip_vs_sh.c b/net/netfilter/ipvs/ip_vs_sh.c
index 91e97ee..0512652 100644
--- a/net/netfilter/ipvs/ip_vs_sh.c
+++ b/net/netfilter/ipvs/ip_vs_sh.c
@@ -162,7 +162,7 @@ static int ip_vs_sh_init_svc(struct ip_vs_service *svc)
/* allocate the SH table for this service */
tbl = kmalloc(sizeof(struct ip_vs_sh_bucket)*IP_VS_SH_TAB_SIZE,
- GFP_ATOMIC);
+ GFP_KERNEL);
if (tbl == NULL)
return -ENOMEM;
--
1.7.9.5
^ permalink raw reply related [flat|nested] 31+ messages in thread
* [PATCH 11/25] ipvs: use GFP_KERNEL allocation where possible
2012-05-08 7:49 [PATCH 00/25] [v2] netfilter updates for net-next (upcoming 3.5) pablo
` (9 preceding siblings ...)
2012-05-08 7:49 ` [PATCH 10/25] ipvs: SH " pablo
@ 2012-05-08 7:49 ` pablo
2012-05-08 7:49 ` [PATCH 12/25] ipvs: ignore IP_VS_CONN_F_NOOUTPUT in backup server pablo
` (14 subsequent siblings)
25 siblings, 0 replies; 31+ messages in thread
From: pablo @ 2012-05-08 7:49 UTC (permalink / raw)
To: netfilter-devel; +Cc: davem, netdev
From: Sasha Levin <levinsasha928@gmail.com>
Use GFP_KERNEL instead of GFP_ATOMIC when registering an ipvs protocol.
This is safe since it will always run from a process context.
Signed-off-by: Sasha Levin <levinsasha928@gmail.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
---
net/netfilter/ipvs/ip_vs_proto.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/net/netfilter/ipvs/ip_vs_proto.c b/net/netfilter/ipvs/ip_vs_proto.c
index a981b7c..8726488 100644
--- a/net/netfilter/ipvs/ip_vs_proto.c
+++ b/net/netfilter/ipvs/ip_vs_proto.c
@@ -71,7 +71,7 @@ register_ip_vs_proto_netns(struct net *net, struct ip_vs_protocol *pp)
struct netns_ipvs *ipvs = net_ipvs(net);
unsigned int hash = IP_VS_PROTO_HASH(pp->protocol);
struct ip_vs_proto_data *pd =
- kzalloc(sizeof(struct ip_vs_proto_data), GFP_ATOMIC);
+ kzalloc(sizeof(struct ip_vs_proto_data), GFP_KERNEL);
if (!pd)
return -ENOMEM;
--
1.7.9.5
^ permalink raw reply related [flat|nested] 31+ messages in thread
* [PATCH 12/25] ipvs: ignore IP_VS_CONN_F_NOOUTPUT in backup server
2012-05-08 7:49 [PATCH 00/25] [v2] netfilter updates for net-next (upcoming 3.5) pablo
` (10 preceding siblings ...)
2012-05-08 7:49 ` [PATCH 11/25] ipvs: use GFP_KERNEL allocation where possible pablo
@ 2012-05-08 7:49 ` pablo
2012-05-08 7:49 ` [PATCH 13/25] ipvs: remove check for IP_VS_CONN_F_SYNC from ip_vs_bind_dest pablo
` (13 subsequent siblings)
25 siblings, 0 replies; 31+ messages in thread
From: pablo @ 2012-05-08 7:49 UTC (permalink / raw)
To: netfilter-devel; +Cc: davem, netdev
From: Julian Anastasov <ja@ssi.bg>
As IP_VS_CONN_F_NOOUTPUT is derived from the
forwarding method we should get it from conn_flags just
like we do it for IP_VS_CONN_F_FWD_MASK bits when binding
to real server.
Signed-off-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
---
net/netfilter/ipvs/ip_vs_conn.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/net/netfilter/ipvs/ip_vs_conn.c b/net/netfilter/ipvs/ip_vs_conn.c
index 4a09b78..f562e63 100644
--- a/net/netfilter/ipvs/ip_vs_conn.c
+++ b/net/netfilter/ipvs/ip_vs_conn.c
@@ -567,7 +567,7 @@ ip_vs_bind_dest(struct ip_vs_conn *cp, struct ip_vs_dest *dest)
if (!(cp->flags & IP_VS_CONN_F_TEMPLATE))
conn_flags &= ~IP_VS_CONN_F_INACTIVE;
/* connections inherit forwarding method from dest */
- cp->flags &= ~IP_VS_CONN_F_FWD_MASK;
+ cp->flags &= ~(IP_VS_CONN_F_FWD_MASK | IP_VS_CONN_F_NOOUTPUT);
}
cp->flags |= conn_flags;
cp->dest = dest;
--
1.7.9.5
^ permalink raw reply related [flat|nested] 31+ messages in thread
* [PATCH 13/25] ipvs: remove check for IP_VS_CONN_F_SYNC from ip_vs_bind_dest
2012-05-08 7:49 [PATCH 00/25] [v2] netfilter updates for net-next (upcoming 3.5) pablo
` (11 preceding siblings ...)
2012-05-08 7:49 ` [PATCH 12/25] ipvs: ignore IP_VS_CONN_F_NOOUTPUT in backup server pablo
@ 2012-05-08 7:49 ` pablo
2012-05-08 7:49 ` [PATCH 14/25] ipvs: fix ip_vs_try_bind_dest to rebind app and transmitter pablo
` (12 subsequent siblings)
25 siblings, 0 replies; 31+ messages in thread
From: pablo @ 2012-05-08 7:49 UTC (permalink / raw)
To: netfilter-devel; +Cc: davem, netdev
From: Julian Anastasov <ja@ssi.bg>
As the IP_VS_CONN_F_INACTIVE bit is properly set
in cp->flags for all kind of connections we do not need to
add special checks for synced connections when updating
the activeconns/inactconns counters for first time. Now
logic will look just like in ip_vs_unbind_dest.
Signed-off-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
---
net/netfilter/ipvs/ip_vs_conn.c | 10 +++++-----
1 file changed, 5 insertions(+), 5 deletions(-)
diff --git a/net/netfilter/ipvs/ip_vs_conn.c b/net/netfilter/ipvs/ip_vs_conn.c
index f562e63..1c1bb30 100644
--- a/net/netfilter/ipvs/ip_vs_conn.c
+++ b/net/netfilter/ipvs/ip_vs_conn.c
@@ -585,11 +585,11 @@ ip_vs_bind_dest(struct ip_vs_conn *cp, struct ip_vs_dest *dest)
/* Update the connection counters */
if (!(cp->flags & IP_VS_CONN_F_TEMPLATE)) {
- /* It is a normal connection, so increase the inactive
- connection counter because it is in TCP SYNRECV
- state (inactive) or other protocol inacive state */
- if ((cp->flags & IP_VS_CONN_F_SYNC) &&
- (!(cp->flags & IP_VS_CONN_F_INACTIVE)))
+ /* It is a normal connection, so modify the counters
+ * according to the flags, later the protocol can
+ * update them on state change
+ */
+ if (!(cp->flags & IP_VS_CONN_F_INACTIVE))
atomic_inc(&dest->activeconns);
else
atomic_inc(&dest->inactconns);
--
1.7.9.5
^ permalink raw reply related [flat|nested] 31+ messages in thread
* [PATCH 14/25] ipvs: fix ip_vs_try_bind_dest to rebind app and transmitter
2012-05-08 7:49 [PATCH 00/25] [v2] netfilter updates for net-next (upcoming 3.5) pablo
` (12 preceding siblings ...)
2012-05-08 7:49 ` [PATCH 13/25] ipvs: remove check for IP_VS_CONN_F_SYNC from ip_vs_bind_dest pablo
@ 2012-05-08 7:49 ` pablo
2012-05-08 7:49 ` [PATCH 15/25] ipvs: always update some of the flags bits in backup pablo
` (11 subsequent siblings)
25 siblings, 0 replies; 31+ messages in thread
From: pablo @ 2012-05-08 7:49 UTC (permalink / raw)
To: netfilter-devel; +Cc: davem, netdev
From: Julian Anastasov <ja@ssi.bg>
Initially, when the synced connection is created we
use the forwarding method provided by master but once we
bind to destination it can be changed. As result, we must
update the application and the transmitter.
As ip_vs_try_bind_dest is called always for connections
that require dest binding, there is no need to validate the
cp and dest pointers.
Signed-off-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
---
net/netfilter/ipvs/ip_vs_conn.c | 33 ++++++++++++++++++++++++++-------
1 file changed, 26 insertions(+), 7 deletions(-)
diff --git a/net/netfilter/ipvs/ip_vs_conn.c b/net/netfilter/ipvs/ip_vs_conn.c
index 1c1bb30..fd74f88 100644
--- a/net/netfilter/ipvs/ip_vs_conn.c
+++ b/net/netfilter/ipvs/ip_vs_conn.c
@@ -613,14 +613,33 @@ struct ip_vs_dest *ip_vs_try_bind_dest(struct ip_vs_conn *cp)
{
struct ip_vs_dest *dest;
- if ((cp) && (!cp->dest)) {
- dest = ip_vs_find_dest(ip_vs_conn_net(cp), cp->af, &cp->daddr,
- cp->dport, &cp->vaddr, cp->vport,
- cp->protocol, cp->fwmark, cp->flags);
+ dest = ip_vs_find_dest(ip_vs_conn_net(cp), cp->af, &cp->daddr,
+ cp->dport, &cp->vaddr, cp->vport,
+ cp->protocol, cp->fwmark, cp->flags);
+ if (dest) {
+ struct ip_vs_proto_data *pd;
+
+ /* Applications work depending on the forwarding method
+ * but better to reassign them always when binding dest */
+ if (cp->app)
+ ip_vs_unbind_app(cp);
+
ip_vs_bind_dest(cp, dest);
- return dest;
- } else
- return NULL;
+
+ /* Update its packet transmitter */
+ cp->packet_xmit = NULL;
+#ifdef CONFIG_IP_VS_IPV6
+ if (cp->af == AF_INET6)
+ ip_vs_bind_xmit_v6(cp);
+ else
+#endif
+ ip_vs_bind_xmit(cp);
+
+ pd = ip_vs_proto_data_get(ip_vs_conn_net(cp), cp->protocol);
+ if (pd && atomic_read(&pd->appcnt))
+ ip_vs_bind_app(cp, pd->pp);
+ }
+ return dest;
}
--
1.7.9.5
^ permalink raw reply related [flat|nested] 31+ messages in thread
* [PATCH 15/25] ipvs: always update some of the flags bits in backup
2012-05-08 7:49 [PATCH 00/25] [v2] netfilter updates for net-next (upcoming 3.5) pablo
` (13 preceding siblings ...)
2012-05-08 7:49 ` [PATCH 14/25] ipvs: fix ip_vs_try_bind_dest to rebind app and transmitter pablo
@ 2012-05-08 7:49 ` pablo
2012-05-08 7:49 ` [PATCH 16/25] ipvs: wakeup master thread pablo
` (10 subsequent siblings)
25 siblings, 0 replies; 31+ messages in thread
From: pablo @ 2012-05-08 7:49 UTC (permalink / raw)
To: netfilter-devel; +Cc: davem, netdev
From: Julian Anastasov <ja@ssi.bg>
As the goal is to mirror the inactconns/activeconns
counters in the backup server, make sure the cp->flags are
updated even if cp is still not bound to dest. If cp->flags
are not updated ip_vs_bind_dest will rely only on the initial
flags when updating the counters. To avoid mistakes and
complicated checks for protocol state rely only on the
IP_VS_CONN_F_INACTIVE bit when updating the counters.
Signed-off-by: Julian Anastasov <ja@ssi.bg>
Tested-by: Aleksey Chudov <aleksey.chudov@gmail.com>
Signed-off-by: Simon Horman <horms@verge.net.au>
---
include/linux/ip_vs.h | 5 +++
net/netfilter/ipvs/ip_vs_sync.c | 65 ++++++++++++++-------------------------
2 files changed, 28 insertions(+), 42 deletions(-)
diff --git a/include/linux/ip_vs.h b/include/linux/ip_vs.h
index be0ef3d..8a2d438 100644
--- a/include/linux/ip_vs.h
+++ b/include/linux/ip_vs.h
@@ -89,6 +89,7 @@
#define IP_VS_CONN_F_TEMPLATE 0x1000 /* template, not connection */
#define IP_VS_CONN_F_ONE_PACKET 0x2000 /* forward only one packet */
+/* Initial bits allowed in backup server */
#define IP_VS_CONN_F_BACKUP_MASK (IP_VS_CONN_F_FWD_MASK | \
IP_VS_CONN_F_NOOUTPUT | \
IP_VS_CONN_F_INACTIVE | \
@@ -97,6 +98,10 @@
IP_VS_CONN_F_TEMPLATE \
)
+/* Bits allowed to update in backup server */
+#define IP_VS_CONN_F_BACKUP_UPD_MASK (IP_VS_CONN_F_INACTIVE | \
+ IP_VS_CONN_F_SEQ_MASK)
+
/* Flags that are not sent to backup server start from bit 16 */
#define IP_VS_CONN_F_NFCT (1 << 16) /* use netfilter conntrack */
diff --git a/net/netfilter/ipvs/ip_vs_sync.c b/net/netfilter/ipvs/ip_vs_sync.c
index f4e0b6c..eeed767 100644
--- a/net/netfilter/ipvs/ip_vs_sync.c
+++ b/net/netfilter/ipvs/ip_vs_sync.c
@@ -731,9 +731,30 @@ static void ip_vs_proc_conn(struct net *net, struct ip_vs_conn_param *param,
else
cp = ip_vs_ct_in_get(param);
- if (cp && param->pe_data) /* Free pe_data */
+ if (cp) {
+ /* Free pe_data */
kfree(param->pe_data);
- if (!cp) {
+
+ dest = cp->dest;
+ if ((cp->flags ^ flags) & IP_VS_CONN_F_INACTIVE &&
+ !(flags & IP_VS_CONN_F_TEMPLATE) && dest) {
+ if (flags & IP_VS_CONN_F_INACTIVE) {
+ atomic_dec(&dest->activeconns);
+ atomic_inc(&dest->inactconns);
+ } else {
+ atomic_inc(&dest->activeconns);
+ atomic_dec(&dest->inactconns);
+ }
+ }
+ flags &= IP_VS_CONN_F_BACKUP_UPD_MASK;
+ flags |= cp->flags & ~IP_VS_CONN_F_BACKUP_UPD_MASK;
+ cp->flags = flags;
+ if (!dest) {
+ dest = ip_vs_try_bind_dest(cp);
+ if (dest)
+ atomic_dec(&dest->refcnt);
+ }
+ } else {
/*
* Find the appropriate destination for the connection.
* If it is not found the connection will remain unbound
@@ -742,18 +763,6 @@ static void ip_vs_proc_conn(struct net *net, struct ip_vs_conn_param *param,
dest = ip_vs_find_dest(net, type, daddr, dport, param->vaddr,
param->vport, protocol, fwmark, flags);
- /* Set the approprite ativity flag */
- if (protocol == IPPROTO_TCP) {
- if (state != IP_VS_TCP_S_ESTABLISHED)
- flags |= IP_VS_CONN_F_INACTIVE;
- else
- flags &= ~IP_VS_CONN_F_INACTIVE;
- } else if (protocol == IPPROTO_SCTP) {
- if (state != IP_VS_SCTP_S_ESTABLISHED)
- flags |= IP_VS_CONN_F_INACTIVE;
- else
- flags &= ~IP_VS_CONN_F_INACTIVE;
- }
cp = ip_vs_conn_new(param, daddr, dport, flags, dest, fwmark);
if (dest)
atomic_dec(&dest->refcnt);
@@ -763,34 +772,6 @@ static void ip_vs_proc_conn(struct net *net, struct ip_vs_conn_param *param,
IP_VS_DBG(2, "BACKUP, add new conn. failed\n");
return;
}
- } else if (!cp->dest) {
- dest = ip_vs_try_bind_dest(cp);
- if (dest)
- atomic_dec(&dest->refcnt);
- } else if ((cp->dest) && (cp->protocol == IPPROTO_TCP) &&
- (cp->state != state)) {
- /* update active/inactive flag for the connection */
- dest = cp->dest;
- if (!(cp->flags & IP_VS_CONN_F_INACTIVE) &&
- (state != IP_VS_TCP_S_ESTABLISHED)) {
- atomic_dec(&dest->activeconns);
- atomic_inc(&dest->inactconns);
- cp->flags |= IP_VS_CONN_F_INACTIVE;
- } else if ((cp->flags & IP_VS_CONN_F_INACTIVE) &&
- (state == IP_VS_TCP_S_ESTABLISHED)) {
- atomic_inc(&dest->activeconns);
- atomic_dec(&dest->inactconns);
- cp->flags &= ~IP_VS_CONN_F_INACTIVE;
- }
- } else if ((cp->dest) && (cp->protocol == IPPROTO_SCTP) &&
- (cp->state != state)) {
- dest = cp->dest;
- if (!(cp->flags & IP_VS_CONN_F_INACTIVE) &&
- (state != IP_VS_SCTP_S_ESTABLISHED)) {
- atomic_dec(&dest->activeconns);
- atomic_inc(&dest->inactconns);
- cp->flags &= ~IP_VS_CONN_F_INACTIVE;
- }
}
if (opt)
--
1.7.9.5
^ permalink raw reply related [flat|nested] 31+ messages in thread
* [PATCH 16/25] ipvs: wakeup master thread
2012-05-08 7:49 [PATCH 00/25] [v2] netfilter updates for net-next (upcoming 3.5) pablo
` (14 preceding siblings ...)
2012-05-08 7:49 ` [PATCH 15/25] ipvs: always update some of the flags bits in backup pablo
@ 2012-05-08 7:49 ` pablo
2012-05-08 7:49 ` [PATCH 17/25] ipvs: reduce sync rate with time thresholds pablo
` (9 subsequent siblings)
25 siblings, 0 replies; 31+ messages in thread
From: pablo @ 2012-05-08 7:49 UTC (permalink / raw)
To: netfilter-devel; +Cc: davem, netdev
From: Julian Anastasov <ja@ssi.bg>
High rate of sync messages in master can lead to
overflowing the socket buffer and dropping the messages.
Fixed sleep of 1 second without wakeup events is not suitable
for loaded masters,
Use delayed_work to schedule sending for queued messages
and limit the delay to IPVS_SYNC_SEND_DELAY (20ms). This will
reduce the rate of wakeups but to avoid sending long bursts we
wakeup the master thread after IPVS_SYNC_WAKEUP_RATE (8) messages.
Add hard limit for the queued messages before sending
by using "sync_qlen_max" sysctl var. It defaults to 1/32 of
the memory pages but actually represents number of messages.
It will protect us from allocating large parts of memory
when the sending rate is lower than the queuing rate.
As suggested by Pablo, add new sysctl var
"sync_sock_size" to configure the SNDBUF (master) or
RCVBUF (slave) socket limit. Default value is 0 (preserve
system defaults).
Change the master thread to detect and block on
SNDBUF overflow, so that we do not drop messages when
the socket limit is low but the sync_qlen_max limit is
not reached. On ENOBUFS or other errors just drop the
messages.
Change master thread to enter TASK_INTERRUPTIBLE
state early, so that we do not miss wakeups due to messages or
kthread_should_stop event.
Thanks to Pablo Neira Ayuso for his valuable feedback!
Signed-off-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
---
include/net/ip_vs.h | 29 ++++++++
net/netfilter/ipvs/ip_vs_ctl.c | 16 +++++
net/netfilter/ipvs/ip_vs_sync.c | 149 ++++++++++++++++++++++++++++++---------
3 files changed, 162 insertions(+), 32 deletions(-)
diff --git a/include/net/ip_vs.h b/include/net/ip_vs.h
index a903a82..8721a78 100644
--- a/include/net/ip_vs.h
+++ b/include/net/ip_vs.h
@@ -870,6 +870,8 @@ struct netns_ipvs {
#endif
int sysctl_snat_reroute;
int sysctl_sync_ver;
+ int sysctl_sync_qlen_max;
+ int sysctl_sync_sock_size;
int sysctl_cache_bypass;
int sysctl_expire_nodest_conn;
int sysctl_expire_quiescent_template;
@@ -890,6 +892,9 @@ struct netns_ipvs {
struct timer_list est_timer; /* Estimation timer */
/* ip_vs_sync */
struct list_head sync_queue;
+ int sync_queue_len;
+ unsigned int sync_queue_delay;
+ struct delayed_work master_wakeup_work;
spinlock_t sync_lock;
struct ip_vs_sync_buff *sync_buff;
spinlock_t sync_buff_lock;
@@ -912,6 +917,10 @@ struct netns_ipvs {
#define DEFAULT_SYNC_THRESHOLD 3
#define DEFAULT_SYNC_PERIOD 50
#define DEFAULT_SYNC_VER 1
+#define IPVS_SYNC_WAKEUP_RATE 8
+#define IPVS_SYNC_QLEN_MAX (IPVS_SYNC_WAKEUP_RATE * 4)
+#define IPVS_SYNC_SEND_DELAY (HZ / 50)
+#define IPVS_SYNC_CHECK_PERIOD HZ
#ifdef CONFIG_SYSCTL
@@ -930,6 +939,16 @@ static inline int sysctl_sync_ver(struct netns_ipvs *ipvs)
return ipvs->sysctl_sync_ver;
}
+static inline int sysctl_sync_qlen_max(struct netns_ipvs *ipvs)
+{
+ return ipvs->sysctl_sync_qlen_max;
+}
+
+static inline int sysctl_sync_sock_size(struct netns_ipvs *ipvs)
+{
+ return ipvs->sysctl_sync_sock_size;
+}
+
#else
static inline int sysctl_sync_threshold(struct netns_ipvs *ipvs)
@@ -947,6 +966,16 @@ static inline int sysctl_sync_ver(struct netns_ipvs *ipvs)
return DEFAULT_SYNC_VER;
}
+static inline int sysctl_sync_qlen_max(struct netns_ipvs *ipvs)
+{
+ return IPVS_SYNC_QLEN_MAX;
+}
+
+static inline int sysctl_sync_sock_size(struct netns_ipvs *ipvs)
+{
+ return 0;
+}
+
#endif
/*
diff --git a/net/netfilter/ipvs/ip_vs_ctl.c b/net/netfilter/ipvs/ip_vs_ctl.c
index b8d0df7..854e9a6 100644
--- a/net/netfilter/ipvs/ip_vs_ctl.c
+++ b/net/netfilter/ipvs/ip_vs_ctl.c
@@ -1718,6 +1718,18 @@ static struct ctl_table vs_vars[] = {
.proc_handler = &proc_do_sync_mode,
},
{
+ .procname = "sync_qlen_max",
+ .maxlen = sizeof(int),
+ .mode = 0644,
+ .proc_handler = proc_dointvec,
+ },
+ {
+ .procname = "sync_sock_size",
+ .maxlen = sizeof(int),
+ .mode = 0644,
+ .proc_handler = proc_dointvec,
+ },
+ {
.procname = "cache_bypass",
.maxlen = sizeof(int),
.mode = 0644,
@@ -3662,6 +3674,10 @@ int __net_init ip_vs_control_net_init_sysctl(struct net *net)
tbl[idx++].data = &ipvs->sysctl_snat_reroute;
ipvs->sysctl_sync_ver = 1;
tbl[idx++].data = &ipvs->sysctl_sync_ver;
+ ipvs->sysctl_sync_qlen_max = nr_free_buffer_pages() / 32;
+ tbl[idx++].data = &ipvs->sysctl_sync_qlen_max;
+ ipvs->sysctl_sync_sock_size = 0;
+ tbl[idx++].data = &ipvs->sysctl_sync_sock_size;
tbl[idx++].data = &ipvs->sysctl_cache_bypass;
tbl[idx++].data = &ipvs->sysctl_expire_nodest_conn;
tbl[idx++].data = &ipvs->sysctl_expire_quiescent_template;
diff --git a/net/netfilter/ipvs/ip_vs_sync.c b/net/netfilter/ipvs/ip_vs_sync.c
index eeed767..eafc1d2 100644
--- a/net/netfilter/ipvs/ip_vs_sync.c
+++ b/net/netfilter/ipvs/ip_vs_sync.c
@@ -307,11 +307,15 @@ static inline struct ip_vs_sync_buff *sb_dequeue(struct netns_ipvs *ipvs)
spin_lock_bh(&ipvs->sync_lock);
if (list_empty(&ipvs->sync_queue)) {
sb = NULL;
+ __set_current_state(TASK_INTERRUPTIBLE);
} else {
sb = list_entry(ipvs->sync_queue.next,
struct ip_vs_sync_buff,
list);
list_del(&sb->list);
+ ipvs->sync_queue_len--;
+ if (!ipvs->sync_queue_len)
+ ipvs->sync_queue_delay = 0;
}
spin_unlock_bh(&ipvs->sync_lock);
@@ -358,9 +362,16 @@ static inline void sb_queue_tail(struct netns_ipvs *ipvs)
struct ip_vs_sync_buff *sb = ipvs->sync_buff;
spin_lock(&ipvs->sync_lock);
- if (ipvs->sync_state & IP_VS_STATE_MASTER)
+ if (ipvs->sync_state & IP_VS_STATE_MASTER &&
+ ipvs->sync_queue_len < sysctl_sync_qlen_max(ipvs)) {
+ if (!ipvs->sync_queue_len)
+ schedule_delayed_work(&ipvs->master_wakeup_work,
+ max(IPVS_SYNC_SEND_DELAY, 1));
+ ipvs->sync_queue_len++;
list_add_tail(&sb->list, &ipvs->sync_queue);
- else
+ if ((++ipvs->sync_queue_delay) == IPVS_SYNC_WAKEUP_RATE)
+ wake_up_process(ipvs->master_thread);
+ } else
ip_vs_sync_buff_release(sb);
spin_unlock(&ipvs->sync_lock);
}
@@ -379,6 +390,7 @@ get_curr_sync_buff(struct netns_ipvs *ipvs, unsigned long time)
time_after_eq(jiffies - ipvs->sync_buff->firstuse, time)) {
sb = ipvs->sync_buff;
ipvs->sync_buff = NULL;
+ __set_current_state(TASK_RUNNING);
} else
sb = NULL;
spin_unlock_bh(&ipvs->sync_buff_lock);
@@ -392,26 +404,23 @@ get_curr_sync_buff(struct netns_ipvs *ipvs, unsigned long time)
void ip_vs_sync_switch_mode(struct net *net, int mode)
{
struct netns_ipvs *ipvs = net_ipvs(net);
+ struct ip_vs_sync_buff *sb;
+ spin_lock_bh(&ipvs->sync_buff_lock);
if (!(ipvs->sync_state & IP_VS_STATE_MASTER))
- return;
- if (mode == sysctl_sync_ver(ipvs) || !ipvs->sync_buff)
- return;
+ goto unlock;
+ sb = ipvs->sync_buff;
+ if (mode == sysctl_sync_ver(ipvs) || !sb)
+ goto unlock;
- spin_lock_bh(&ipvs->sync_buff_lock);
/* Buffer empty ? then let buf_create do the job */
- if (ipvs->sync_buff->mesg->size <= sizeof(struct ip_vs_sync_mesg)) {
- kfree(ipvs->sync_buff);
+ if (sb->mesg->size <= sizeof(struct ip_vs_sync_mesg)) {
+ ip_vs_sync_buff_release(sb);
ipvs->sync_buff = NULL;
- } else {
- spin_lock_bh(&ipvs->sync_lock);
- if (ipvs->sync_state & IP_VS_STATE_MASTER)
- list_add_tail(&ipvs->sync_buff->list,
- &ipvs->sync_queue);
- else
- ip_vs_sync_buff_release(ipvs->sync_buff);
- spin_unlock_bh(&ipvs->sync_lock);
- }
+ } else
+ sb_queue_tail(ipvs);
+
+unlock:
spin_unlock_bh(&ipvs->sync_buff_lock);
}
@@ -1130,6 +1139,28 @@ static void ip_vs_process_message(struct net *net, __u8 *buffer,
/*
+ * Setup sndbuf (mode=1) or rcvbuf (mode=0)
+ */
+static void set_sock_size(struct sock *sk, int mode, int val)
+{
+ /* setsockopt(sock, SOL_SOCKET, SO_SNDBUF, &val, sizeof(val)); */
+ /* setsockopt(sock, SOL_SOCKET, SO_RCVBUF, &val, sizeof(val)); */
+ lock_sock(sk);
+ if (mode) {
+ val = clamp_t(int, val, (SOCK_MIN_SNDBUF + 1) / 2,
+ sysctl_wmem_max);
+ sk->sk_sndbuf = val * 2;
+ sk->sk_userlocks |= SOCK_SNDBUF_LOCK;
+ } else {
+ val = clamp_t(int, val, (SOCK_MIN_RCVBUF + 1) / 2,
+ sysctl_rmem_max);
+ sk->sk_rcvbuf = val * 2;
+ sk->sk_userlocks |= SOCK_RCVBUF_LOCK;
+ }
+ release_sock(sk);
+}
+
+/*
* Setup loopback of outgoing multicasts on a sending socket
*/
static void set_mcast_loop(struct sock *sk, u_char loop)
@@ -1305,6 +1336,9 @@ static struct socket *make_send_sock(struct net *net)
set_mcast_loop(sock->sk, 0);
set_mcast_ttl(sock->sk, 1);
+ result = sysctl_sync_sock_size(ipvs);
+ if (result > 0)
+ set_sock_size(sock->sk, 1, result);
result = bind_mcastif_addr(sock, ipvs->master_mcast_ifn);
if (result < 0) {
@@ -1350,6 +1384,9 @@ static struct socket *make_receive_sock(struct net *net)
sk_change_net(sock->sk, net);
/* it is equivalent to the REUSEADDR option in user-space */
sock->sk->sk_reuse = 1;
+ result = sysctl_sync_sock_size(ipvs);
+ if (result > 0)
+ set_sock_size(sock->sk, 0, result);
result = sock->ops->bind(sock, (struct sockaddr *) &mcast_addr,
sizeof(struct sockaddr));
@@ -1392,18 +1429,22 @@ ip_vs_send_async(struct socket *sock, const char *buffer, const size_t length)
return len;
}
-static void
+static int
ip_vs_send_sync_msg(struct socket *sock, struct ip_vs_sync_mesg *msg)
{
int msize;
+ int ret;
msize = msg->size;
/* Put size in network byte order */
msg->size = htons(msg->size);
- if (ip_vs_send_async(sock, (char *)msg, msize) != msize)
- pr_err("ip_vs_send_async error\n");
+ ret = ip_vs_send_async(sock, (char *)msg, msize);
+ if (ret >= 0 || ret == -EAGAIN)
+ return ret;
+ pr_err("ip_vs_send_async error %d\n", ret);
+ return 0;
}
static int
@@ -1428,36 +1469,75 @@ ip_vs_receive(struct socket *sock, char *buffer, const size_t buflen)
return len;
}
+/* Wakeup the master thread for sending */
+static void master_wakeup_work_handler(struct work_struct *work)
+{
+ struct netns_ipvs *ipvs = container_of(work, struct netns_ipvs,
+ master_wakeup_work.work);
+
+ spin_lock_bh(&ipvs->sync_lock);
+ if (ipvs->sync_queue_len &&
+ ipvs->sync_queue_delay < IPVS_SYNC_WAKEUP_RATE) {
+ ipvs->sync_queue_delay = IPVS_SYNC_WAKEUP_RATE;
+ wake_up_process(ipvs->master_thread);
+ }
+ spin_unlock_bh(&ipvs->sync_lock);
+}
+
+/* Get next buffer to send */
+static inline struct ip_vs_sync_buff *
+next_sync_buff(struct netns_ipvs *ipvs)
+{
+ struct ip_vs_sync_buff *sb;
+
+ sb = sb_dequeue(ipvs);
+ if (sb)
+ return sb;
+ /* Do not delay entries in buffer for more than 2 seconds */
+ return get_curr_sync_buff(ipvs, 2 * HZ);
+}
static int sync_thread_master(void *data)
{
struct ip_vs_sync_thread_data *tinfo = data;
struct netns_ipvs *ipvs = net_ipvs(tinfo->net);
+ struct sock *sk = tinfo->sock->sk;
struct ip_vs_sync_buff *sb;
pr_info("sync thread started: state = MASTER, mcast_ifn = %s, "
"syncid = %d\n",
ipvs->master_mcast_ifn, ipvs->master_syncid);
- while (!kthread_should_stop()) {
- while ((sb = sb_dequeue(ipvs))) {
- ip_vs_send_sync_msg(tinfo->sock, sb->mesg);
- ip_vs_sync_buff_release(sb);
+ for (;;) {
+ sb = next_sync_buff(ipvs);
+ if (unlikely(kthread_should_stop()))
+ break;
+ if (!sb) {
+ schedule_timeout(IPVS_SYNC_CHECK_PERIOD);
+ continue;
}
-
- /* check if entries stay in ipvs->sync_buff for 2 seconds */
- sb = get_curr_sync_buff(ipvs, 2 * HZ);
- if (sb) {
- ip_vs_send_sync_msg(tinfo->sock, sb->mesg);
- ip_vs_sync_buff_release(sb);
+ while (ip_vs_send_sync_msg(tinfo->sock, sb->mesg) < 0) {
+ int ret = 0;
+
+ __wait_event_interruptible(*sk_sleep(sk),
+ sock_writeable(sk) ||
+ kthread_should_stop(),
+ ret);
+ if (unlikely(kthread_should_stop()))
+ goto done;
}
-
- schedule_timeout_interruptible(HZ);
+ ip_vs_sync_buff_release(sb);
}
+done:
+ __set_current_state(TASK_RUNNING);
+ if (sb)
+ ip_vs_sync_buff_release(sb);
+
/* clean up the sync_buff queue */
while ((sb = sb_dequeue(ipvs)))
ip_vs_sync_buff_release(sb);
+ __set_current_state(TASK_RUNNING);
/* clean up the current sync_buff */
sb = get_curr_sync_buff(ipvs, 0);
@@ -1538,6 +1618,10 @@ int start_sync_thread(struct net *net, int state, char *mcast_ifn, __u8 syncid)
realtask = &ipvs->master_thread;
name = "ipvs_master:%d";
threadfn = sync_thread_master;
+ ipvs->sync_queue_len = 0;
+ ipvs->sync_queue_delay = 0;
+ INIT_DELAYED_WORK(&ipvs->master_wakeup_work,
+ master_wakeup_work_handler);
sock = make_send_sock(net);
} else if (state == IP_VS_STATE_BACKUP) {
if (ipvs->backup_thread)
@@ -1623,6 +1707,7 @@ int stop_sync_thread(struct net *net, int state)
spin_lock_bh(&ipvs->sync_lock);
ipvs->sync_state &= ~IP_VS_STATE_MASTER;
spin_unlock_bh(&ipvs->sync_lock);
+ cancel_delayed_work_sync(&ipvs->master_wakeup_work);
retc = kthread_stop(ipvs->master_thread);
ipvs->master_thread = NULL;
} else if (state == IP_VS_STATE_BACKUP) {
--
1.7.9.5
^ permalink raw reply related [flat|nested] 31+ messages in thread
* [PATCH 17/25] ipvs: reduce sync rate with time thresholds
2012-05-08 7:49 [PATCH 00/25] [v2] netfilter updates for net-next (upcoming 3.5) pablo
` (15 preceding siblings ...)
2012-05-08 7:49 ` [PATCH 16/25] ipvs: wakeup master thread pablo
@ 2012-05-08 7:49 ` pablo
2012-05-08 7:49 ` [PATCH 18/25] ipvs: add support for sync threads pablo
` (8 subsequent siblings)
25 siblings, 0 replies; 31+ messages in thread
From: pablo @ 2012-05-08 7:49 UTC (permalink / raw)
To: netfilter-devel; +Cc: davem, netdev
From: Julian Anastasov <ja@ssi.bg>
Add two new sysctl vars to control the sync rate with the
main idea to reduce the rate for connection templates because
currently it depends on the packet rate for controlled connections.
This mechanism should be useful also for normal connections
with high traffic.
sync_refresh_period: in seconds, difference in reported connection
timer that triggers new sync message. It can be used to
avoid sync messages for the specified period (or half of
the connection timeout if it is lower) if connection state
is not changed from last sync.
sync_retries: integer, 0..3, defines sync retries with period of
sync_refresh_period/8. Useful to protect against loss of
sync messages.
Allow sysctl_sync_threshold to be used with
sysctl_sync_period=0, so that only single sync message is sent
if sync_refresh_period is also 0.
Add new field "sync_endtime" in connection structure to
hold the reported time when connection expires. The 2 lowest
bits will represent the retry count.
As the sysctl_sync_period now can be 0 use ACCESS_ONCE to
avoid division by zero.
Special thanks to Aleksey Chudov for being patient with me,
for his extensive reports and helping in all tests.
Signed-off-by: Julian Anastasov <ja@ssi.bg>
Tested-by: Aleksey Chudov <aleksey.chudov@gmail.com>
Signed-off-by: Simon Horman <horms@verge.net.au>
---
include/net/ip_vs.h | 30 +++++++++-
net/netfilter/ipvs/ip_vs_conn.c | 7 ++-
net/netfilter/ipvs/ip_vs_core.c | 30 +---------
net/netfilter/ipvs/ip_vs_ctl.c | 25 +++++++-
net/netfilter/ipvs/ip_vs_sync.c | 121 +++++++++++++++++++++++++++++++++------
5 files changed, 165 insertions(+), 48 deletions(-)
diff --git a/include/net/ip_vs.h b/include/net/ip_vs.h
index 8721a78..941df45 100644
--- a/include/net/ip_vs.h
+++ b/include/net/ip_vs.h
@@ -505,6 +505,7 @@ struct ip_vs_conn {
* state transition triggerd
* synchronization
*/
+ unsigned long sync_endtime; /* jiffies + sent_retries */
/* Control members */
struct ip_vs_conn *control; /* Master control connection */
@@ -876,6 +877,8 @@ struct netns_ipvs {
int sysctl_expire_nodest_conn;
int sysctl_expire_quiescent_template;
int sysctl_sync_threshold[2];
+ unsigned int sysctl_sync_refresh_period;
+ int sysctl_sync_retries;
int sysctl_nat_icmp_send;
/* ip_vs_lblc */
@@ -917,10 +920,13 @@ struct netns_ipvs {
#define DEFAULT_SYNC_THRESHOLD 3
#define DEFAULT_SYNC_PERIOD 50
#define DEFAULT_SYNC_VER 1
+#define DEFAULT_SYNC_REFRESH_PERIOD (0U * HZ)
+#define DEFAULT_SYNC_RETRIES 0
#define IPVS_SYNC_WAKEUP_RATE 8
#define IPVS_SYNC_QLEN_MAX (IPVS_SYNC_WAKEUP_RATE * 4)
#define IPVS_SYNC_SEND_DELAY (HZ / 50)
#define IPVS_SYNC_CHECK_PERIOD HZ
+#define IPVS_SYNC_FLUSH_TIME (HZ * 2)
#ifdef CONFIG_SYSCTL
@@ -931,7 +937,17 @@ static inline int sysctl_sync_threshold(struct netns_ipvs *ipvs)
static inline int sysctl_sync_period(struct netns_ipvs *ipvs)
{
- return ipvs->sysctl_sync_threshold[1];
+ return ACCESS_ONCE(ipvs->sysctl_sync_threshold[1]);
+}
+
+static inline unsigned int sysctl_sync_refresh_period(struct netns_ipvs *ipvs)
+{
+ return ACCESS_ONCE(ipvs->sysctl_sync_refresh_period);
+}
+
+static inline int sysctl_sync_retries(struct netns_ipvs *ipvs)
+{
+ return ipvs->sysctl_sync_retries;
}
static inline int sysctl_sync_ver(struct netns_ipvs *ipvs)
@@ -961,6 +977,16 @@ static inline int sysctl_sync_period(struct netns_ipvs *ipvs)
return DEFAULT_SYNC_PERIOD;
}
+static inline unsigned int sysctl_sync_refresh_period(struct netns_ipvs *ipvs)
+{
+ return DEFAULT_SYNC_REFRESH_PERIOD;
+}
+
+static inline int sysctl_sync_retries(struct netns_ipvs *ipvs)
+{
+ return DEFAULT_SYNC_RETRIES & 3;
+}
+
static inline int sysctl_sync_ver(struct netns_ipvs *ipvs)
{
return DEFAULT_SYNC_VER;
@@ -1248,7 +1274,7 @@ extern struct ip_vs_dest *ip_vs_try_bind_dest(struct ip_vs_conn *cp);
extern int start_sync_thread(struct net *net, int state, char *mcast_ifn,
__u8 syncid);
extern int stop_sync_thread(struct net *net, int state);
-extern void ip_vs_sync_conn(struct net *net, struct ip_vs_conn *cp);
+extern void ip_vs_sync_conn(struct net *net, struct ip_vs_conn *cp, int pkts);
/*
diff --git a/net/netfilter/ipvs/ip_vs_conn.c b/net/netfilter/ipvs/ip_vs_conn.c
index fd74f88..4f3205d 100644
--- a/net/netfilter/ipvs/ip_vs_conn.c
+++ b/net/netfilter/ipvs/ip_vs_conn.c
@@ -762,7 +762,8 @@ int ip_vs_check_template(struct ip_vs_conn *ct)
static void ip_vs_conn_expire(unsigned long data)
{
struct ip_vs_conn *cp = (struct ip_vs_conn *)data;
- struct netns_ipvs *ipvs = net_ipvs(ip_vs_conn_net(cp));
+ struct net *net = ip_vs_conn_net(cp);
+ struct netns_ipvs *ipvs = net_ipvs(net);
cp->timeout = 60*HZ;
@@ -827,6 +828,9 @@ static void ip_vs_conn_expire(unsigned long data)
atomic_read(&cp->refcnt)-1,
atomic_read(&cp->n_control));
+ if (ipvs->sync_state & IP_VS_STATE_MASTER)
+ ip_vs_sync_conn(net, cp, sysctl_sync_threshold(ipvs));
+
ip_vs_conn_put(cp);
}
@@ -900,6 +904,7 @@ ip_vs_conn_new(const struct ip_vs_conn_param *p,
/* Set its state and timeout */
cp->state = 0;
cp->timeout = 3*HZ;
+ cp->sync_endtime = jiffies & ~3UL;
/* Bind its packet transmitter */
#ifdef CONFIG_IP_VS_IPV6
diff --git a/net/netfilter/ipvs/ip_vs_core.c b/net/netfilter/ipvs/ip_vs_core.c
index b5a5c73..7ce5819 100644
--- a/net/netfilter/ipvs/ip_vs_core.c
+++ b/net/netfilter/ipvs/ip_vs_core.c
@@ -1613,34 +1613,8 @@ ip_vs_in(unsigned int hooknum, struct sk_buff *skb, int af)
else
pkts = atomic_add_return(1, &cp->in_pkts);
- if ((ipvs->sync_state & IP_VS_STATE_MASTER) &&
- cp->protocol == IPPROTO_SCTP) {
- if ((cp->state == IP_VS_SCTP_S_ESTABLISHED &&
- (pkts % sysctl_sync_period(ipvs)
- == sysctl_sync_threshold(ipvs))) ||
- (cp->old_state != cp->state &&
- ((cp->state == IP_VS_SCTP_S_CLOSED) ||
- (cp->state == IP_VS_SCTP_S_SHUT_ACK_CLI) ||
- (cp->state == IP_VS_SCTP_S_SHUT_ACK_SER)))) {
- ip_vs_sync_conn(net, cp);
- goto out;
- }
- }
-
- /* Keep this block last: TCP and others with pp->num_states <= 1 */
- else if ((ipvs->sync_state & IP_VS_STATE_MASTER) &&
- (((cp->protocol != IPPROTO_TCP ||
- cp->state == IP_VS_TCP_S_ESTABLISHED) &&
- (pkts % sysctl_sync_period(ipvs)
- == sysctl_sync_threshold(ipvs))) ||
- ((cp->protocol == IPPROTO_TCP) && (cp->old_state != cp->state) &&
- ((cp->state == IP_VS_TCP_S_FIN_WAIT) ||
- (cp->state == IP_VS_TCP_S_CLOSE) ||
- (cp->state == IP_VS_TCP_S_CLOSE_WAIT) ||
- (cp->state == IP_VS_TCP_S_TIME_WAIT)))))
- ip_vs_sync_conn(net, cp);
-out:
- cp->old_state = cp->state;
+ if (ipvs->sync_state & IP_VS_STATE_MASTER)
+ ip_vs_sync_conn(net, cp, pkts);
ip_vs_conn_put(cp);
return ret;
diff --git a/net/netfilter/ipvs/ip_vs_ctl.c b/net/netfilter/ipvs/ip_vs_ctl.c
index 854e9a6..83bdbbc 100644
--- a/net/netfilter/ipvs/ip_vs_ctl.c
+++ b/net/netfilter/ipvs/ip_vs_ctl.c
@@ -1599,6 +1599,10 @@ static int ip_vs_zero_all(struct net *net)
}
#ifdef CONFIG_SYSCTL
+
+static int zero;
+static int three = 3;
+
static int
proc_do_defense_mode(ctl_table *table, int write,
void __user *buffer, size_t *lenp, loff_t *ppos)
@@ -1632,7 +1636,8 @@ proc_do_sync_threshold(ctl_table *table, int write,
memcpy(val, valp, sizeof(val));
rc = proc_dointvec(table, write, buffer, lenp, ppos);
- if (write && (valp[0] < 0 || valp[1] < 0 || valp[0] >= valp[1])) {
+ if (write && (valp[0] < 0 || valp[1] < 0 ||
+ (valp[0] >= valp[1] && valp[1]))) {
/* Restore the correct value */
memcpy(valp, val, sizeof(val));
}
@@ -1755,6 +1760,20 @@ static struct ctl_table vs_vars[] = {
.proc_handler = proc_do_sync_threshold,
},
{
+ .procname = "sync_refresh_period",
+ .maxlen = sizeof(int),
+ .mode = 0644,
+ .proc_handler = proc_dointvec_jiffies,
+ },
+ {
+ .procname = "sync_retries",
+ .maxlen = sizeof(int),
+ .mode = 0644,
+ .proc_handler = proc_dointvec_minmax,
+ .extra1 = &zero,
+ .extra2 = &three,
+ },
+ {
.procname = "nat_icmp_send",
.maxlen = sizeof(int),
.mode = 0644,
@@ -3685,6 +3704,10 @@ int __net_init ip_vs_control_net_init_sysctl(struct net *net)
ipvs->sysctl_sync_threshold[1] = DEFAULT_SYNC_PERIOD;
tbl[idx].data = &ipvs->sysctl_sync_threshold;
tbl[idx++].maxlen = sizeof(ipvs->sysctl_sync_threshold);
+ ipvs->sysctl_sync_refresh_period = DEFAULT_SYNC_REFRESH_PERIOD;
+ tbl[idx++].data = &ipvs->sysctl_sync_refresh_period;
+ ipvs->sysctl_sync_retries = clamp_t(int, DEFAULT_SYNC_RETRIES, 0, 3);
+ tbl[idx++].data = &ipvs->sysctl_sync_retries;
tbl[idx++].data = &ipvs->sysctl_nat_icmp_send;
diff --git a/net/netfilter/ipvs/ip_vs_sync.c b/net/netfilter/ipvs/ip_vs_sync.c
index eafc1d2..4aa9290 100644
--- a/net/netfilter/ipvs/ip_vs_sync.c
+++ b/net/netfilter/ipvs/ip_vs_sync.c
@@ -451,11 +451,94 @@ ip_vs_sync_buff_create_v0(struct netns_ipvs *ipvs)
return sb;
}
+/* Check if conn should be synced.
+ * pkts: conn packets, use sysctl_sync_threshold to avoid packet check
+ * - (1) sync_refresh_period: reduce sync rate. Additionally, retry
+ * sync_retries times with period of sync_refresh_period/8
+ * - (2) if both sync_refresh_period and sync_period are 0 send sync only
+ * for state changes or only once when pkts matches sync_threshold
+ * - (3) templates: rate can be reduced only with sync_refresh_period or
+ * with (2)
+ */
+static int ip_vs_sync_conn_needed(struct netns_ipvs *ipvs,
+ struct ip_vs_conn *cp, int pkts)
+{
+ unsigned long orig = ACCESS_ONCE(cp->sync_endtime);
+ unsigned long now = jiffies;
+ unsigned long n = (now + cp->timeout) & ~3UL;
+ unsigned int sync_refresh_period;
+ int sync_period;
+ int force;
+
+ /* Check if we sync in current state */
+ if (unlikely(cp->flags & IP_VS_CONN_F_TEMPLATE))
+ force = 0;
+ else if (likely(cp->protocol == IPPROTO_TCP)) {
+ if (!((1 << cp->state) &
+ ((1 << IP_VS_TCP_S_ESTABLISHED) |
+ (1 << IP_VS_TCP_S_FIN_WAIT) |
+ (1 << IP_VS_TCP_S_CLOSE) |
+ (1 << IP_VS_TCP_S_CLOSE_WAIT) |
+ (1 << IP_VS_TCP_S_TIME_WAIT))))
+ return 0;
+ force = cp->state != cp->old_state;
+ if (force && cp->state != IP_VS_TCP_S_ESTABLISHED)
+ goto set;
+ } else if (unlikely(cp->protocol == IPPROTO_SCTP)) {
+ if (!((1 << cp->state) &
+ ((1 << IP_VS_SCTP_S_ESTABLISHED) |
+ (1 << IP_VS_SCTP_S_CLOSED) |
+ (1 << IP_VS_SCTP_S_SHUT_ACK_CLI) |
+ (1 << IP_VS_SCTP_S_SHUT_ACK_SER))))
+ return 0;
+ force = cp->state != cp->old_state;
+ if (force && cp->state != IP_VS_SCTP_S_ESTABLISHED)
+ goto set;
+ } else {
+ /* UDP or another protocol with single state */
+ force = 0;
+ }
+
+ sync_refresh_period = sysctl_sync_refresh_period(ipvs);
+ if (sync_refresh_period > 0) {
+ long diff = n - orig;
+ long min_diff = max(cp->timeout >> 1, 10UL * HZ);
+
+ /* Avoid sync if difference is below sync_refresh_period
+ * and below the half timeout.
+ */
+ if (abs(diff) < min_t(long, sync_refresh_period, min_diff)) {
+ int retries = orig & 3;
+
+ if (retries >= sysctl_sync_retries(ipvs))
+ return 0;
+ if (time_before(now, orig - cp->timeout +
+ (sync_refresh_period >> 3)))
+ return 0;
+ n |= retries + 1;
+ }
+ }
+ sync_period = sysctl_sync_period(ipvs);
+ if (sync_period > 0) {
+ if (!(cp->flags & IP_VS_CONN_F_TEMPLATE) &&
+ pkts % sync_period != sysctl_sync_threshold(ipvs))
+ return 0;
+ } else if (sync_refresh_period <= 0 &&
+ pkts != sysctl_sync_threshold(ipvs))
+ return 0;
+
+set:
+ cp->old_state = cp->state;
+ n = cmpxchg(&cp->sync_endtime, orig, n);
+ return n == orig || force;
+}
+
/*
* Version 0 , could be switched in by sys_ctl.
* Add an ip_vs_conn information into the current sync_buff.
*/
-void ip_vs_sync_conn_v0(struct net *net, struct ip_vs_conn *cp)
+static void ip_vs_sync_conn_v0(struct net *net, struct ip_vs_conn *cp,
+ int pkts)
{
struct netns_ipvs *ipvs = net_ipvs(net);
struct ip_vs_sync_mesg_v0 *m;
@@ -468,6 +551,9 @@ void ip_vs_sync_conn_v0(struct net *net, struct ip_vs_conn *cp)
if (cp->flags & IP_VS_CONN_F_ONE_PACKET)
return;
+ if (!ip_vs_sync_conn_needed(ipvs, cp, pkts))
+ return;
+
spin_lock(&ipvs->sync_buff_lock);
if (!ipvs->sync_buff) {
ipvs->sync_buff =
@@ -513,8 +599,14 @@ void ip_vs_sync_conn_v0(struct net *net, struct ip_vs_conn *cp)
spin_unlock(&ipvs->sync_buff_lock);
/* synchronize its controller if it has */
- if (cp->control)
- ip_vs_sync_conn(net, cp->control);
+ cp = cp->control;
+ if (cp) {
+ if (cp->flags & IP_VS_CONN_F_TEMPLATE)
+ pkts = atomic_add_return(1, &cp->in_pkts);
+ else
+ pkts = sysctl_sync_threshold(ipvs);
+ ip_vs_sync_conn(net, cp->control, pkts);
+ }
}
/*
@@ -522,7 +614,7 @@ void ip_vs_sync_conn_v0(struct net *net, struct ip_vs_conn *cp)
* Called by ip_vs_in.
* Sending Version 1 messages
*/
-void ip_vs_sync_conn(struct net *net, struct ip_vs_conn *cp)
+void ip_vs_sync_conn(struct net *net, struct ip_vs_conn *cp, int pkts)
{
struct netns_ipvs *ipvs = net_ipvs(net);
struct ip_vs_sync_mesg *m;
@@ -532,13 +624,16 @@ void ip_vs_sync_conn(struct net *net, struct ip_vs_conn *cp)
/* Handle old version of the protocol */
if (sysctl_sync_ver(ipvs) == 0) {
- ip_vs_sync_conn_v0(net, cp);
+ ip_vs_sync_conn_v0(net, cp, pkts);
return;
}
/* Do not sync ONE PACKET */
if (cp->flags & IP_VS_CONN_F_ONE_PACKET)
goto control;
sloop:
+ if (!ip_vs_sync_conn_needed(ipvs, cp, pkts))
+ goto control;
+
/* Sanity checks */
pe_name_len = 0;
if (cp->pe_data_len) {
@@ -653,16 +748,10 @@ control:
cp = cp->control;
if (!cp)
return;
- /*
- * Reduce sync rate for templates
- * i.e only increment in_pkts for Templates.
- */
- if (cp->flags & IP_VS_CONN_F_TEMPLATE) {
- int pkts = atomic_add_return(1, &cp->in_pkts);
-
- if (pkts % sysctl_sync_period(ipvs) != 1)
- return;
- }
+ if (cp->flags & IP_VS_CONN_F_TEMPLATE)
+ pkts = atomic_add_return(1, &cp->in_pkts);
+ else
+ pkts = sysctl_sync_threshold(ipvs);
goto sloop;
}
@@ -1494,7 +1583,7 @@ next_sync_buff(struct netns_ipvs *ipvs)
if (sb)
return sb;
/* Do not delay entries in buffer for more than 2 seconds */
- return get_curr_sync_buff(ipvs, 2 * HZ);
+ return get_curr_sync_buff(ipvs, IPVS_SYNC_FLUSH_TIME);
}
static int sync_thread_master(void *data)
--
1.7.9.5
^ permalink raw reply related [flat|nested] 31+ messages in thread
* [PATCH 18/25] ipvs: add support for sync threads
2012-05-08 7:49 [PATCH 00/25] [v2] netfilter updates for net-next (upcoming 3.5) pablo
` (16 preceding siblings ...)
2012-05-08 7:49 ` [PATCH 17/25] ipvs: reduce sync rate with time thresholds pablo
@ 2012-05-08 7:49 ` pablo
2012-05-08 7:49 ` [PATCH 19/25] ipvs: optimize the use of flags in ip_vs_bind_dest pablo
` (7 subsequent siblings)
25 siblings, 0 replies; 31+ messages in thread
From: pablo @ 2012-05-08 7:49 UTC (permalink / raw)
To: netfilter-devel; +Cc: davem, netdev
From: Julian Anastasov <ja@ssi.bg>
Allow master and backup servers to use many threads
for sync traffic. Add sysctl var "sync_ports" to define the
number of threads. Every thread will use single UDP port,
thread 0 will use the default port 8848 while last thread
will use port 8848+sync_ports-1.
The sync traffic for connections is scheduled to many
master threads based on the cp address but one connection is
always assigned to same thread to avoid reordering of the
sync messages.
Remove ip_vs_sync_switch_mode because this check
for sync mode change is still risky. Instead, check for mode
change under sync_buff_lock.
Make sure the backup socks do not block on reading.
Special thanks to Aleksey Chudov for helping in all tests.
Signed-off-by: Julian Anastasov <ja@ssi.bg>
Tested-by: Aleksey Chudov <aleksey.chudov@gmail.com>
Signed-off-by: Simon Horman <horms@verge.net.au>
---
include/net/ip_vs.h | 34 +++-
net/netfilter/ipvs/ip_vs_conn.c | 7 +
net/netfilter/ipvs/ip_vs_ctl.c | 29 ++-
net/netfilter/ipvs/ip_vs_sync.c | 401 ++++++++++++++++++++++++---------------
4 files changed, 305 insertions(+), 166 deletions(-)
diff --git a/include/net/ip_vs.h b/include/net/ip_vs.h
index 941df45..75824e2 100644
--- a/include/net/ip_vs.h
+++ b/include/net/ip_vs.h
@@ -785,6 +785,16 @@ struct ip_vs_app {
void (*timeout_change)(struct ip_vs_app *app, int flags);
};
+struct ipvs_master_sync_state {
+ struct list_head sync_queue;
+ struct ip_vs_sync_buff *sync_buff;
+ int sync_queue_len;
+ unsigned int sync_queue_delay;
+ struct task_struct *master_thread;
+ struct delayed_work master_wakeup_work;
+ struct netns_ipvs *ipvs;
+};
+
/* IPVS in network namespace */
struct netns_ipvs {
int gen; /* Generation */
@@ -871,6 +881,7 @@ struct netns_ipvs {
#endif
int sysctl_snat_reroute;
int sysctl_sync_ver;
+ int sysctl_sync_ports;
int sysctl_sync_qlen_max;
int sysctl_sync_sock_size;
int sysctl_cache_bypass;
@@ -894,16 +905,11 @@ struct netns_ipvs {
spinlock_t est_lock;
struct timer_list est_timer; /* Estimation timer */
/* ip_vs_sync */
- struct list_head sync_queue;
- int sync_queue_len;
- unsigned int sync_queue_delay;
- struct delayed_work master_wakeup_work;
spinlock_t sync_lock;
- struct ip_vs_sync_buff *sync_buff;
+ struct ipvs_master_sync_state *ms;
spinlock_t sync_buff_lock;
- struct sockaddr_in sync_mcast_addr;
- struct task_struct *master_thread;
- struct task_struct *backup_thread;
+ struct task_struct **backup_threads;
+ int threads_mask;
int send_mesg_maxlen;
int recv_mesg_maxlen;
volatile int sync_state;
@@ -927,6 +933,7 @@ struct netns_ipvs {
#define IPVS_SYNC_SEND_DELAY (HZ / 50)
#define IPVS_SYNC_CHECK_PERIOD HZ
#define IPVS_SYNC_FLUSH_TIME (HZ * 2)
+#define IPVS_SYNC_PORTS_MAX (1 << 6)
#ifdef CONFIG_SYSCTL
@@ -955,6 +962,11 @@ static inline int sysctl_sync_ver(struct netns_ipvs *ipvs)
return ipvs->sysctl_sync_ver;
}
+static inline int sysctl_sync_ports(struct netns_ipvs *ipvs)
+{
+ return ACCESS_ONCE(ipvs->sysctl_sync_ports);
+}
+
static inline int sysctl_sync_qlen_max(struct netns_ipvs *ipvs)
{
return ipvs->sysctl_sync_qlen_max;
@@ -992,6 +1004,11 @@ static inline int sysctl_sync_ver(struct netns_ipvs *ipvs)
return DEFAULT_SYNC_VER;
}
+static inline int sysctl_sync_ports(struct netns_ipvs *ipvs)
+{
+ return 1;
+}
+
static inline int sysctl_sync_qlen_max(struct netns_ipvs *ipvs)
{
return IPVS_SYNC_QLEN_MAX;
@@ -1242,7 +1259,6 @@ extern struct ip_vs_stats ip_vs_stats;
extern const struct ctl_path net_vs_ctl_path[];
extern int sysctl_ip_vs_sync_ver;
-extern void ip_vs_sync_switch_mode(struct net *net, int mode);
extern struct ip_vs_service *
ip_vs_service_get(struct net *net, int af, __u32 fwmark, __u16 protocol,
const union nf_inet_addr *vaddr, __be16 vport);
diff --git a/net/netfilter/ipvs/ip_vs_conn.c b/net/netfilter/ipvs/ip_vs_conn.c
index 4f3205d..c7edf20 100644
--- a/net/netfilter/ipvs/ip_vs_conn.c
+++ b/net/netfilter/ipvs/ip_vs_conn.c
@@ -619,12 +619,19 @@ struct ip_vs_dest *ip_vs_try_bind_dest(struct ip_vs_conn *cp)
if (dest) {
struct ip_vs_proto_data *pd;
+ spin_lock(&cp->lock);
+ if (cp->dest) {
+ spin_unlock(&cp->lock);
+ return dest;
+ }
+
/* Applications work depending on the forwarding method
* but better to reassign them always when binding dest */
if (cp->app)
ip_vs_unbind_app(cp);
ip_vs_bind_dest(cp, dest);
+ spin_unlock(&cp->lock);
/* Update its packet transmitter */
cp->packet_xmit = NULL;
diff --git a/net/netfilter/ipvs/ip_vs_ctl.c b/net/netfilter/ipvs/ip_vs_ctl.c
index 83bdbbc..0e599a4 100644
--- a/net/netfilter/ipvs/ip_vs_ctl.c
+++ b/net/netfilter/ipvs/ip_vs_ctl.c
@@ -1657,9 +1657,24 @@ proc_do_sync_mode(ctl_table *table, int write,
if ((*valp < 0) || (*valp > 1)) {
/* Restore the correct value */
*valp = val;
- } else {
- struct net *net = current->nsproxy->net_ns;
- ip_vs_sync_switch_mode(net, val);
+ }
+ }
+ return rc;
+}
+
+static int
+proc_do_sync_ports(ctl_table *table, int write,
+ void __user *buffer, size_t *lenp, loff_t *ppos)
+{
+ int *valp = table->data;
+ int val = *valp;
+ int rc;
+
+ rc = proc_dointvec(table, write, buffer, lenp, ppos);
+ if (write && (*valp != val)) {
+ if (*valp < 1 || !is_power_of_2(*valp)) {
+ /* Restore the correct value */
+ *valp = val;
}
}
return rc;
@@ -1723,6 +1738,12 @@ static struct ctl_table vs_vars[] = {
.proc_handler = &proc_do_sync_mode,
},
{
+ .procname = "sync_ports",
+ .maxlen = sizeof(int),
+ .mode = 0644,
+ .proc_handler = &proc_do_sync_ports,
+ },
+ {
.procname = "sync_qlen_max",
.maxlen = sizeof(int),
.mode = 0644,
@@ -3693,6 +3714,8 @@ int __net_init ip_vs_control_net_init_sysctl(struct net *net)
tbl[idx++].data = &ipvs->sysctl_snat_reroute;
ipvs->sysctl_sync_ver = 1;
tbl[idx++].data = &ipvs->sysctl_sync_ver;
+ ipvs->sysctl_sync_ports = 1;
+ tbl[idx++].data = &ipvs->sysctl_sync_ports;
ipvs->sysctl_sync_qlen_max = nr_free_buffer_pages() / 32;
tbl[idx++].data = &ipvs->sysctl_sync_qlen_max;
ipvs->sysctl_sync_sock_size = 0;
diff --git a/net/netfilter/ipvs/ip_vs_sync.c b/net/netfilter/ipvs/ip_vs_sync.c
index 4aa9290..8550f37 100644
--- a/net/netfilter/ipvs/ip_vs_sync.c
+++ b/net/netfilter/ipvs/ip_vs_sync.c
@@ -196,6 +196,7 @@ struct ip_vs_sync_thread_data {
struct net *net;
struct socket *sock;
char *buf;
+ int id;
};
/* Version 0 definition of packet sizes */
@@ -271,13 +272,6 @@ struct ip_vs_sync_buff {
unsigned char *end;
};
-/* multicast addr */
-static struct sockaddr_in mcast_addr = {
- .sin_family = AF_INET,
- .sin_port = cpu_to_be16(IP_VS_SYNC_PORT),
- .sin_addr.s_addr = cpu_to_be32(IP_VS_SYNC_GROUP),
-};
-
/*
* Copy of struct ip_vs_seq
* From unaligned network order to aligned host order
@@ -300,22 +294,22 @@ static void hton_seq(struct ip_vs_seq *ho, struct ip_vs_seq *no)
put_unaligned_be32(ho->previous_delta, &no->previous_delta);
}
-static inline struct ip_vs_sync_buff *sb_dequeue(struct netns_ipvs *ipvs)
+static inline struct ip_vs_sync_buff *
+sb_dequeue(struct netns_ipvs *ipvs, struct ipvs_master_sync_state *ms)
{
struct ip_vs_sync_buff *sb;
spin_lock_bh(&ipvs->sync_lock);
- if (list_empty(&ipvs->sync_queue)) {
+ if (list_empty(&ms->sync_queue)) {
sb = NULL;
__set_current_state(TASK_INTERRUPTIBLE);
} else {
- sb = list_entry(ipvs->sync_queue.next,
- struct ip_vs_sync_buff,
+ sb = list_entry(ms->sync_queue.next, struct ip_vs_sync_buff,
list);
list_del(&sb->list);
- ipvs->sync_queue_len--;
- if (!ipvs->sync_queue_len)
- ipvs->sync_queue_delay = 0;
+ ms->sync_queue_len--;
+ if (!ms->sync_queue_len)
+ ms->sync_queue_delay = 0;
}
spin_unlock_bh(&ipvs->sync_lock);
@@ -338,7 +332,7 @@ ip_vs_sync_buff_create(struct netns_ipvs *ipvs)
kfree(sb);
return NULL;
}
- sb->mesg->reserved = 0; /* old nr_conns i.e. must be zeo now */
+ sb->mesg->reserved = 0; /* old nr_conns i.e. must be zero now */
sb->mesg->version = SYNC_PROTO_VER;
sb->mesg->syncid = ipvs->master_syncid;
sb->mesg->size = sizeof(struct ip_vs_sync_mesg);
@@ -357,20 +351,21 @@ static inline void ip_vs_sync_buff_release(struct ip_vs_sync_buff *sb)
kfree(sb);
}
-static inline void sb_queue_tail(struct netns_ipvs *ipvs)
+static inline void sb_queue_tail(struct netns_ipvs *ipvs,
+ struct ipvs_master_sync_state *ms)
{
- struct ip_vs_sync_buff *sb = ipvs->sync_buff;
+ struct ip_vs_sync_buff *sb = ms->sync_buff;
spin_lock(&ipvs->sync_lock);
if (ipvs->sync_state & IP_VS_STATE_MASTER &&
- ipvs->sync_queue_len < sysctl_sync_qlen_max(ipvs)) {
- if (!ipvs->sync_queue_len)
- schedule_delayed_work(&ipvs->master_wakeup_work,
+ ms->sync_queue_len < sysctl_sync_qlen_max(ipvs)) {
+ if (!ms->sync_queue_len)
+ schedule_delayed_work(&ms->master_wakeup_work,
max(IPVS_SYNC_SEND_DELAY, 1));
- ipvs->sync_queue_len++;
- list_add_tail(&sb->list, &ipvs->sync_queue);
- if ((++ipvs->sync_queue_delay) == IPVS_SYNC_WAKEUP_RATE)
- wake_up_process(ipvs->master_thread);
+ ms->sync_queue_len++;
+ list_add_tail(&sb->list, &ms->sync_queue);
+ if ((++ms->sync_queue_delay) == IPVS_SYNC_WAKEUP_RATE)
+ wake_up_process(ms->master_thread);
} else
ip_vs_sync_buff_release(sb);
spin_unlock(&ipvs->sync_lock);
@@ -381,15 +376,15 @@ static inline void sb_queue_tail(struct netns_ipvs *ipvs)
* than the specified time or the specified time is zero.
*/
static inline struct ip_vs_sync_buff *
-get_curr_sync_buff(struct netns_ipvs *ipvs, unsigned long time)
+get_curr_sync_buff(struct netns_ipvs *ipvs, struct ipvs_master_sync_state *ms,
+ unsigned long time)
{
struct ip_vs_sync_buff *sb;
spin_lock_bh(&ipvs->sync_buff_lock);
- if (ipvs->sync_buff &&
- time_after_eq(jiffies - ipvs->sync_buff->firstuse, time)) {
- sb = ipvs->sync_buff;
- ipvs->sync_buff = NULL;
+ sb = ms->sync_buff;
+ if (sb && time_after_eq(jiffies - sb->firstuse, time)) {
+ ms->sync_buff = NULL;
__set_current_state(TASK_RUNNING);
} else
sb = NULL;
@@ -397,31 +392,10 @@ get_curr_sync_buff(struct netns_ipvs *ipvs, unsigned long time)
return sb;
}
-/*
- * Switch mode from sending version 0 or 1
- * - must handle sync_buf
- */
-void ip_vs_sync_switch_mode(struct net *net, int mode)
+static inline int
+select_master_thread_id(struct netns_ipvs *ipvs, struct ip_vs_conn *cp)
{
- struct netns_ipvs *ipvs = net_ipvs(net);
- struct ip_vs_sync_buff *sb;
-
- spin_lock_bh(&ipvs->sync_buff_lock);
- if (!(ipvs->sync_state & IP_VS_STATE_MASTER))
- goto unlock;
- sb = ipvs->sync_buff;
- if (mode == sysctl_sync_ver(ipvs) || !sb)
- goto unlock;
-
- /* Buffer empty ? then let buf_create do the job */
- if (sb->mesg->size <= sizeof(struct ip_vs_sync_mesg)) {
- ip_vs_sync_buff_release(sb);
- ipvs->sync_buff = NULL;
- } else
- sb_queue_tail(ipvs);
-
-unlock:
- spin_unlock_bh(&ipvs->sync_buff_lock);
+ return ((long) cp >> (1 + ilog2(sizeof(*cp)))) & ipvs->threads_mask;
}
/*
@@ -543,6 +517,9 @@ static void ip_vs_sync_conn_v0(struct net *net, struct ip_vs_conn *cp,
struct netns_ipvs *ipvs = net_ipvs(net);
struct ip_vs_sync_mesg_v0 *m;
struct ip_vs_sync_conn_v0 *s;
+ struct ip_vs_sync_buff *buff;
+ struct ipvs_master_sync_state *ms;
+ int id;
int len;
if (unlikely(cp->af != AF_INET))
@@ -555,20 +532,37 @@ static void ip_vs_sync_conn_v0(struct net *net, struct ip_vs_conn *cp,
return;
spin_lock(&ipvs->sync_buff_lock);
- if (!ipvs->sync_buff) {
- ipvs->sync_buff =
- ip_vs_sync_buff_create_v0(ipvs);
- if (!ipvs->sync_buff) {
+ if (!(ipvs->sync_state & IP_VS_STATE_MASTER)) {
+ spin_unlock(&ipvs->sync_buff_lock);
+ return;
+ }
+
+ id = select_master_thread_id(ipvs, cp);
+ ms = &ipvs->ms[id];
+ buff = ms->sync_buff;
+ if (buff) {
+ m = (struct ip_vs_sync_mesg_v0 *) buff->mesg;
+ /* Send buffer if it is for v1 */
+ if (!m->nr_conns) {
+ sb_queue_tail(ipvs, ms);
+ ms->sync_buff = NULL;
+ buff = NULL;
+ }
+ }
+ if (!buff) {
+ buff = ip_vs_sync_buff_create_v0(ipvs);
+ if (!buff) {
spin_unlock(&ipvs->sync_buff_lock);
pr_err("ip_vs_sync_buff_create failed.\n");
return;
}
+ ms->sync_buff = buff;
}
len = (cp->flags & IP_VS_CONN_F_SEQ_MASK) ? FULL_CONN_SIZE :
SIMPLE_CONN_SIZE;
- m = (struct ip_vs_sync_mesg_v0 *)ipvs->sync_buff->mesg;
- s = (struct ip_vs_sync_conn_v0 *)ipvs->sync_buff->head;
+ m = (struct ip_vs_sync_mesg_v0 *) buff->mesg;
+ s = (struct ip_vs_sync_conn_v0 *) buff->head;
/* copy members */
s->reserved = 0;
@@ -589,12 +583,12 @@ static void ip_vs_sync_conn_v0(struct net *net, struct ip_vs_conn *cp,
m->nr_conns++;
m->size += len;
- ipvs->sync_buff->head += len;
+ buff->head += len;
/* check if there is a space for next one */
- if (ipvs->sync_buff->head + FULL_CONN_SIZE > ipvs->sync_buff->end) {
- sb_queue_tail(ipvs);
- ipvs->sync_buff = NULL;
+ if (buff->head + FULL_CONN_SIZE > buff->end) {
+ sb_queue_tail(ipvs, ms);
+ ms->sync_buff = NULL;
}
spin_unlock(&ipvs->sync_buff_lock);
@@ -619,6 +613,9 @@ void ip_vs_sync_conn(struct net *net, struct ip_vs_conn *cp, int pkts)
struct netns_ipvs *ipvs = net_ipvs(net);
struct ip_vs_sync_mesg *m;
union ip_vs_sync_conn *s;
+ struct ip_vs_sync_buff *buff;
+ struct ipvs_master_sync_state *ms;
+ int id;
__u8 *p;
unsigned int len, pe_name_len, pad;
@@ -645,6 +642,13 @@ sloop:
}
spin_lock(&ipvs->sync_buff_lock);
+ if (!(ipvs->sync_state & IP_VS_STATE_MASTER)) {
+ spin_unlock(&ipvs->sync_buff_lock);
+ return;
+ }
+
+ id = select_master_thread_id(ipvs, cp);
+ ms = &ipvs->ms[id];
#ifdef CONFIG_IP_VS_IPV6
if (cp->af == AF_INET6)
@@ -663,27 +667,32 @@ sloop:
/* check if there is a space for this one */
pad = 0;
- if (ipvs->sync_buff) {
- pad = (4 - (size_t)ipvs->sync_buff->head) & 3;
- if (ipvs->sync_buff->head + len + pad > ipvs->sync_buff->end) {
- sb_queue_tail(ipvs);
- ipvs->sync_buff = NULL;
+ buff = ms->sync_buff;
+ if (buff) {
+ m = buff->mesg;
+ pad = (4 - (size_t) buff->head) & 3;
+ /* Send buffer if it is for v0 */
+ if (buff->head + len + pad > buff->end || m->reserved) {
+ sb_queue_tail(ipvs, ms);
+ ms->sync_buff = NULL;
+ buff = NULL;
pad = 0;
}
}
- if (!ipvs->sync_buff) {
- ipvs->sync_buff = ip_vs_sync_buff_create(ipvs);
- if (!ipvs->sync_buff) {
+ if (!buff) {
+ buff = ip_vs_sync_buff_create(ipvs);
+ if (!buff) {
spin_unlock(&ipvs->sync_buff_lock);
pr_err("ip_vs_sync_buff_create failed.\n");
return;
}
+ ms->sync_buff = buff;
+ m = buff->mesg;
}
- m = ipvs->sync_buff->mesg;
- p = ipvs->sync_buff->head;
- ipvs->sync_buff->head += pad + len;
+ p = buff->head;
+ buff->head += pad + len;
m->size += pad + len;
/* Add ev. padding from prev. sync_conn */
while (pad--)
@@ -834,6 +843,7 @@ static void ip_vs_proc_conn(struct net *net, struct ip_vs_conn_param *param,
kfree(param->pe_data);
dest = cp->dest;
+ spin_lock(&cp->lock);
if ((cp->flags ^ flags) & IP_VS_CONN_F_INACTIVE &&
!(flags & IP_VS_CONN_F_TEMPLATE) && dest) {
if (flags & IP_VS_CONN_F_INACTIVE) {
@@ -847,6 +857,7 @@ static void ip_vs_proc_conn(struct net *net, struct ip_vs_conn_param *param,
flags &= IP_VS_CONN_F_BACKUP_UPD_MASK;
flags |= cp->flags & ~IP_VS_CONN_F_BACKUP_UPD_MASK;
cp->flags = flags;
+ spin_unlock(&cp->lock);
if (!dest) {
dest = ip_vs_try_bind_dest(cp);
if (dest)
@@ -1399,9 +1410,15 @@ static int bind_mcastif_addr(struct socket *sock, char *ifname)
/*
* Set up sending multicast socket over UDP
*/
-static struct socket *make_send_sock(struct net *net)
+static struct socket *make_send_sock(struct net *net, int id)
{
struct netns_ipvs *ipvs = net_ipvs(net);
+ /* multicast addr */
+ struct sockaddr_in mcast_addr = {
+ .sin_family = AF_INET,
+ .sin_port = cpu_to_be16(IP_VS_SYNC_PORT + id),
+ .sin_addr.s_addr = cpu_to_be32(IP_VS_SYNC_GROUP),
+ };
struct socket *sock;
int result;
@@ -1453,9 +1470,15 @@ error:
/*
* Set up receiving multicast socket over UDP
*/
-static struct socket *make_receive_sock(struct net *net)
+static struct socket *make_receive_sock(struct net *net, int id)
{
struct netns_ipvs *ipvs = net_ipvs(net);
+ /* multicast addr */
+ struct sockaddr_in mcast_addr = {
+ .sin_family = AF_INET,
+ .sin_port = cpu_to_be16(IP_VS_SYNC_PORT + id),
+ .sin_addr.s_addr = cpu_to_be32(IP_VS_SYNC_GROUP),
+ };
struct socket *sock;
int result;
@@ -1549,10 +1572,10 @@ ip_vs_receive(struct socket *sock, char *buffer, const size_t buflen)
iov.iov_base = buffer;
iov.iov_len = (size_t)buflen;
- len = kernel_recvmsg(sock, &msg, &iov, 1, buflen, 0);
+ len = kernel_recvmsg(sock, &msg, &iov, 1, buflen, MSG_DONTWAIT);
if (len < 0)
- return -1;
+ return len;
LeaveFunction(7);
return len;
@@ -1561,44 +1584,47 @@ ip_vs_receive(struct socket *sock, char *buffer, const size_t buflen)
/* Wakeup the master thread for sending */
static void master_wakeup_work_handler(struct work_struct *work)
{
- struct netns_ipvs *ipvs = container_of(work, struct netns_ipvs,
- master_wakeup_work.work);
+ struct ipvs_master_sync_state *ms =
+ container_of(work, struct ipvs_master_sync_state,
+ master_wakeup_work.work);
+ struct netns_ipvs *ipvs = ms->ipvs;
spin_lock_bh(&ipvs->sync_lock);
- if (ipvs->sync_queue_len &&
- ipvs->sync_queue_delay < IPVS_SYNC_WAKEUP_RATE) {
- ipvs->sync_queue_delay = IPVS_SYNC_WAKEUP_RATE;
- wake_up_process(ipvs->master_thread);
+ if (ms->sync_queue_len &&
+ ms->sync_queue_delay < IPVS_SYNC_WAKEUP_RATE) {
+ ms->sync_queue_delay = IPVS_SYNC_WAKEUP_RATE;
+ wake_up_process(ms->master_thread);
}
spin_unlock_bh(&ipvs->sync_lock);
}
/* Get next buffer to send */
static inline struct ip_vs_sync_buff *
-next_sync_buff(struct netns_ipvs *ipvs)
+next_sync_buff(struct netns_ipvs *ipvs, struct ipvs_master_sync_state *ms)
{
struct ip_vs_sync_buff *sb;
- sb = sb_dequeue(ipvs);
+ sb = sb_dequeue(ipvs, ms);
if (sb)
return sb;
/* Do not delay entries in buffer for more than 2 seconds */
- return get_curr_sync_buff(ipvs, IPVS_SYNC_FLUSH_TIME);
+ return get_curr_sync_buff(ipvs, ms, IPVS_SYNC_FLUSH_TIME);
}
static int sync_thread_master(void *data)
{
struct ip_vs_sync_thread_data *tinfo = data;
struct netns_ipvs *ipvs = net_ipvs(tinfo->net);
+ struct ipvs_master_sync_state *ms = &ipvs->ms[tinfo->id];
struct sock *sk = tinfo->sock->sk;
struct ip_vs_sync_buff *sb;
pr_info("sync thread started: state = MASTER, mcast_ifn = %s, "
- "syncid = %d\n",
- ipvs->master_mcast_ifn, ipvs->master_syncid);
+ "syncid = %d, id = %d\n",
+ ipvs->master_mcast_ifn, ipvs->master_syncid, tinfo->id);
for (;;) {
- sb = next_sync_buff(ipvs);
+ sb = next_sync_buff(ipvs, ms);
if (unlikely(kthread_should_stop()))
break;
if (!sb) {
@@ -1624,12 +1650,12 @@ done:
ip_vs_sync_buff_release(sb);
/* clean up the sync_buff queue */
- while ((sb = sb_dequeue(ipvs)))
+ while ((sb = sb_dequeue(ipvs, ms)))
ip_vs_sync_buff_release(sb);
__set_current_state(TASK_RUNNING);
/* clean up the current sync_buff */
- sb = get_curr_sync_buff(ipvs, 0);
+ sb = get_curr_sync_buff(ipvs, ms, 0);
if (sb)
ip_vs_sync_buff_release(sb);
@@ -1648,8 +1674,8 @@ static int sync_thread_backup(void *data)
int len;
pr_info("sync thread started: state = BACKUP, mcast_ifn = %s, "
- "syncid = %d\n",
- ipvs->backup_mcast_ifn, ipvs->backup_syncid);
+ "syncid = %d, id = %d\n",
+ ipvs->backup_mcast_ifn, ipvs->backup_syncid, tinfo->id);
while (!kthread_should_stop()) {
wait_event_interruptible(*sk_sleep(tinfo->sock->sk),
@@ -1661,7 +1687,8 @@ static int sync_thread_backup(void *data)
len = ip_vs_receive(tinfo->sock, tinfo->buf,
ipvs->recv_mesg_maxlen);
if (len <= 0) {
- pr_err("receiving message error\n");
+ if (len != -EAGAIN)
+ pr_err("receiving message error\n");
break;
}
@@ -1685,90 +1712,140 @@ static int sync_thread_backup(void *data)
int start_sync_thread(struct net *net, int state, char *mcast_ifn, __u8 syncid)
{
struct ip_vs_sync_thread_data *tinfo;
- struct task_struct **realtask, *task;
+ struct task_struct **array = NULL, *task;
struct socket *sock;
struct netns_ipvs *ipvs = net_ipvs(net);
- char *name, *buf = NULL;
+ char *name;
int (*threadfn)(void *data);
+ int id, count;
int result = -ENOMEM;
IP_VS_DBG(7, "%s(): pid %d\n", __func__, task_pid_nr(current));
IP_VS_DBG(7, "Each ip_vs_sync_conn entry needs %Zd bytes\n",
sizeof(struct ip_vs_sync_conn_v0));
+ if (!ipvs->sync_state) {
+ count = clamp(sysctl_sync_ports(ipvs), 1, IPVS_SYNC_PORTS_MAX);
+ ipvs->threads_mask = count - 1;
+ } else
+ count = ipvs->threads_mask + 1;
if (state == IP_VS_STATE_MASTER) {
- if (ipvs->master_thread)
+ if (ipvs->ms)
return -EEXIST;
strlcpy(ipvs->master_mcast_ifn, mcast_ifn,
sizeof(ipvs->master_mcast_ifn));
ipvs->master_syncid = syncid;
- realtask = &ipvs->master_thread;
- name = "ipvs_master:%d";
+ name = "ipvs-m:%d:%d";
threadfn = sync_thread_master;
- ipvs->sync_queue_len = 0;
- ipvs->sync_queue_delay = 0;
- INIT_DELAYED_WORK(&ipvs->master_wakeup_work,
- master_wakeup_work_handler);
- sock = make_send_sock(net);
} else if (state == IP_VS_STATE_BACKUP) {
- if (ipvs->backup_thread)
+ if (ipvs->backup_threads)
return -EEXIST;
strlcpy(ipvs->backup_mcast_ifn, mcast_ifn,
sizeof(ipvs->backup_mcast_ifn));
ipvs->backup_syncid = syncid;
- realtask = &ipvs->backup_thread;
- name = "ipvs_backup:%d";
+ name = "ipvs-b:%d:%d";
threadfn = sync_thread_backup;
- sock = make_receive_sock(net);
} else {
return -EINVAL;
}
- if (IS_ERR(sock)) {
- result = PTR_ERR(sock);
- goto out;
- }
+ if (state == IP_VS_STATE_MASTER) {
+ struct ipvs_master_sync_state *ms;
- set_sync_mesg_maxlen(net, state);
- if (state == IP_VS_STATE_BACKUP) {
- buf = kmalloc(ipvs->recv_mesg_maxlen, GFP_KERNEL);
- if (!buf)
- goto outsocket;
+ ipvs->ms = kzalloc(count * sizeof(ipvs->ms[0]), GFP_KERNEL);
+ if (!ipvs->ms)
+ goto out;
+ ms = ipvs->ms;
+ for (id = 0; id < count; id++, ms++) {
+ INIT_LIST_HEAD(&ms->sync_queue);
+ ms->sync_queue_len = 0;
+ ms->sync_queue_delay = 0;
+ INIT_DELAYED_WORK(&ms->master_wakeup_work,
+ master_wakeup_work_handler);
+ ms->ipvs = ipvs;
+ }
+ } else {
+ array = kzalloc(count * sizeof(struct task_struct *),
+ GFP_KERNEL);
+ if (!array)
+ goto out;
}
+ set_sync_mesg_maxlen(net, state);
- tinfo = kmalloc(sizeof(*tinfo), GFP_KERNEL);
- if (!tinfo)
- goto outbuf;
-
- tinfo->net = net;
- tinfo->sock = sock;
- tinfo->buf = buf;
+ tinfo = NULL;
+ for (id = 0; id < count; id++) {
+ if (state == IP_VS_STATE_MASTER)
+ sock = make_send_sock(net, id);
+ else
+ sock = make_receive_sock(net, id);
+ if (IS_ERR(sock)) {
+ result = PTR_ERR(sock);
+ goto outtinfo;
+ }
+ tinfo = kmalloc(sizeof(*tinfo), GFP_KERNEL);
+ if (!tinfo)
+ goto outsocket;
+ tinfo->net = net;
+ tinfo->sock = sock;
+ if (state == IP_VS_STATE_BACKUP) {
+ tinfo->buf = kmalloc(ipvs->recv_mesg_maxlen,
+ GFP_KERNEL);
+ if (!tinfo->buf)
+ goto outtinfo;
+ }
+ tinfo->id = id;
- task = kthread_run(threadfn, tinfo, name, ipvs->gen);
- if (IS_ERR(task)) {
- result = PTR_ERR(task);
- goto outtinfo;
+ task = kthread_run(threadfn, tinfo, name, ipvs->gen, id);
+ if (IS_ERR(task)) {
+ result = PTR_ERR(task);
+ goto outtinfo;
+ }
+ tinfo = NULL;
+ if (state == IP_VS_STATE_MASTER)
+ ipvs->ms[id].master_thread = task;
+ else
+ array[id] = task;
}
/* mark as active */
- *realtask = task;
+
+ if (state == IP_VS_STATE_BACKUP)
+ ipvs->backup_threads = array;
+ spin_lock_bh(&ipvs->sync_buff_lock);
ipvs->sync_state |= state;
+ spin_unlock_bh(&ipvs->sync_buff_lock);
/* increase the module use count */
ip_vs_use_count_inc();
return 0;
-outtinfo:
- kfree(tinfo);
-outbuf:
- kfree(buf);
outsocket:
sk_release_kernel(sock->sk);
+
+outtinfo:
+ if (tinfo) {
+ sk_release_kernel(tinfo->sock->sk);
+ kfree(tinfo->buf);
+ kfree(tinfo);
+ }
+ count = id;
+ while (count-- > 0) {
+ if (state == IP_VS_STATE_MASTER)
+ kthread_stop(ipvs->ms[count].master_thread);
+ else
+ kthread_stop(array[count]);
+ }
+ kfree(array);
+
out:
+ if (!(ipvs->sync_state & IP_VS_STATE_MASTER)) {
+ kfree(ipvs->ms);
+ ipvs->ms = NULL;
+ }
return result;
}
@@ -1776,39 +1853,60 @@ out:
int stop_sync_thread(struct net *net, int state)
{
struct netns_ipvs *ipvs = net_ipvs(net);
+ struct task_struct **array;
+ int id;
int retc = -EINVAL;
IP_VS_DBG(7, "%s(): pid %d\n", __func__, task_pid_nr(current));
if (state == IP_VS_STATE_MASTER) {
- if (!ipvs->master_thread)
+ if (!ipvs->ms)
return -ESRCH;
- pr_info("stopping master sync thread %d ...\n",
- task_pid_nr(ipvs->master_thread));
-
/*
* The lock synchronizes with sb_queue_tail(), so that we don't
* add sync buffers to the queue, when we are already in
* progress of stopping the master sync daemon.
*/
- spin_lock_bh(&ipvs->sync_lock);
+ spin_lock_bh(&ipvs->sync_buff_lock);
+ spin_lock(&ipvs->sync_lock);
ipvs->sync_state &= ~IP_VS_STATE_MASTER;
- spin_unlock_bh(&ipvs->sync_lock);
- cancel_delayed_work_sync(&ipvs->master_wakeup_work);
- retc = kthread_stop(ipvs->master_thread);
- ipvs->master_thread = NULL;
+ spin_unlock(&ipvs->sync_lock);
+ spin_unlock_bh(&ipvs->sync_buff_lock);
+
+ retc = 0;
+ for (id = ipvs->threads_mask; id >= 0; id--) {
+ struct ipvs_master_sync_state *ms = &ipvs->ms[id];
+ int ret;
+
+ pr_info("stopping master sync thread %d ...\n",
+ task_pid_nr(ms->master_thread));
+ cancel_delayed_work_sync(&ms->master_wakeup_work);
+ ret = kthread_stop(ms->master_thread);
+ if (retc >= 0)
+ retc = ret;
+ }
+ kfree(ipvs->ms);
+ ipvs->ms = NULL;
} else if (state == IP_VS_STATE_BACKUP) {
- if (!ipvs->backup_thread)
+ if (!ipvs->backup_threads)
return -ESRCH;
- pr_info("stopping backup sync thread %d ...\n",
- task_pid_nr(ipvs->backup_thread));
-
ipvs->sync_state &= ~IP_VS_STATE_BACKUP;
- retc = kthread_stop(ipvs->backup_thread);
- ipvs->backup_thread = NULL;
+ array = ipvs->backup_threads;
+ retc = 0;
+ for (id = ipvs->threads_mask; id >= 0; id--) {
+ int ret;
+
+ pr_info("stopping backup sync thread %d ...\n",
+ task_pid_nr(array[id]));
+ ret = kthread_stop(array[id]);
+ if (retc >= 0)
+ retc = ret;
+ }
+ kfree(array);
+ ipvs->backup_threads = NULL;
}
/* decrease the module use count */
@@ -1825,13 +1923,8 @@ int __net_init ip_vs_sync_net_init(struct net *net)
struct netns_ipvs *ipvs = net_ipvs(net);
__mutex_init(&ipvs->sync_mutex, "ipvs->sync_mutex", &__ipvs_sync_key);
- INIT_LIST_HEAD(&ipvs->sync_queue);
spin_lock_init(&ipvs->sync_lock);
spin_lock_init(&ipvs->sync_buff_lock);
-
- ipvs->sync_mcast_addr.sin_family = AF_INET;
- ipvs->sync_mcast_addr.sin_port = cpu_to_be16(IP_VS_SYNC_PORT);
- ipvs->sync_mcast_addr.sin_addr.s_addr = cpu_to_be32(IP_VS_SYNC_GROUP);
return 0;
}
--
1.7.9.5
^ permalink raw reply related [flat|nested] 31+ messages in thread
* [PATCH 19/25] ipvs: optimize the use of flags in ip_vs_bind_dest
2012-05-08 7:49 [PATCH 00/25] [v2] netfilter updates for net-next (upcoming 3.5) pablo
` (17 preceding siblings ...)
2012-05-08 7:49 ` [PATCH 18/25] ipvs: add support for sync threads pablo
@ 2012-05-08 7:49 ` pablo
2012-05-08 7:49 ` [PATCH 20/25] ipvs: ip_vs_ftp: local functions should not be exposed globally pablo
` (6 subsequent siblings)
25 siblings, 0 replies; 31+ messages in thread
From: pablo @ 2012-05-08 7:49 UTC (permalink / raw)
To: netfilter-devel; +Cc: davem, netdev
From: Pablo Neira Ayuso <pablo@netfilter.org>
cp->flags is marked volatile but ip_vs_bind_dest
can safely modify the flags, so save some CPU cycles by
using temp variable.
Signed-off-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
---
net/netfilter/ipvs/ip_vs_conn.c | 15 +++++++++------
1 file changed, 9 insertions(+), 6 deletions(-)
diff --git a/net/netfilter/ipvs/ip_vs_conn.c b/net/netfilter/ipvs/ip_vs_conn.c
index c7edf20..1548df9 100644
--- a/net/netfilter/ipvs/ip_vs_conn.c
+++ b/net/netfilter/ipvs/ip_vs_conn.c
@@ -548,6 +548,7 @@ static inline void
ip_vs_bind_dest(struct ip_vs_conn *cp, struct ip_vs_dest *dest)
{
unsigned int conn_flags;
+ __u32 flags;
/* if dest is NULL, then return directly */
if (!dest)
@@ -559,17 +560,19 @@ ip_vs_bind_dest(struct ip_vs_conn *cp, struct ip_vs_dest *dest)
conn_flags = atomic_read(&dest->conn_flags);
if (cp->protocol != IPPROTO_UDP)
conn_flags &= ~IP_VS_CONN_F_ONE_PACKET;
+ flags = cp->flags;
/* Bind with the destination and its corresponding transmitter */
- if (cp->flags & IP_VS_CONN_F_SYNC) {
+ if (flags & IP_VS_CONN_F_SYNC) {
/* if the connection is not template and is created
* by sync, preserve the activity flag.
*/
- if (!(cp->flags & IP_VS_CONN_F_TEMPLATE))
+ if (!(flags & IP_VS_CONN_F_TEMPLATE))
conn_flags &= ~IP_VS_CONN_F_INACTIVE;
/* connections inherit forwarding method from dest */
- cp->flags &= ~(IP_VS_CONN_F_FWD_MASK | IP_VS_CONN_F_NOOUTPUT);
+ flags &= ~(IP_VS_CONN_F_FWD_MASK | IP_VS_CONN_F_NOOUTPUT);
}
- cp->flags |= conn_flags;
+ flags |= conn_flags;
+ cp->flags = flags;
cp->dest = dest;
IP_VS_DBG_BUF(7, "Bind-dest %s c:%s:%d v:%s:%d "
@@ -584,12 +587,12 @@ ip_vs_bind_dest(struct ip_vs_conn *cp, struct ip_vs_dest *dest)
atomic_read(&dest->refcnt));
/* Update the connection counters */
- if (!(cp->flags & IP_VS_CONN_F_TEMPLATE)) {
+ if (!(flags & IP_VS_CONN_F_TEMPLATE)) {
/* It is a normal connection, so modify the counters
* according to the flags, later the protocol can
* update them on state change
*/
- if (!(cp->flags & IP_VS_CONN_F_INACTIVE))
+ if (!(flags & IP_VS_CONN_F_INACTIVE))
atomic_inc(&dest->activeconns);
else
atomic_inc(&dest->inactconns);
--
1.7.9.5
^ permalink raw reply related [flat|nested] 31+ messages in thread
* [PATCH 20/25] ipvs: ip_vs_ftp: local functions should not be exposed globally
2012-05-08 7:49 [PATCH 00/25] [v2] netfilter updates for net-next (upcoming 3.5) pablo
` (18 preceding siblings ...)
2012-05-08 7:49 ` [PATCH 19/25] ipvs: optimize the use of flags in ip_vs_bind_dest pablo
@ 2012-05-08 7:49 ` pablo
2012-05-08 7:49 ` [PATCH 21/25] ipvs: ip_vs_proto: " pablo
` (5 subsequent siblings)
25 siblings, 0 replies; 31+ messages in thread
From: pablo @ 2012-05-08 7:49 UTC (permalink / raw)
To: netfilter-devel; +Cc: davem, netdev
From: H Hartley Sweeten <hartleys@visionengravers.com>
Functions not referenced outside of a source file should be marked
static to prevent it from being exposed globally.
This quiets the sparse warnings:
warning: symbol 'ip_vs_ftp_init' was not declared. Should it be static?
Signed-off-by: H Hartley Sweeten <hsweeten@visionengravers.com>
Signed-off-by: Simon Horman <horms@verge.net.au>
---
net/netfilter/ipvs/ip_vs_ftp.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/net/netfilter/ipvs/ip_vs_ftp.c b/net/netfilter/ipvs/ip_vs_ftp.c
index debb8c7..091bec9 100644
--- a/net/netfilter/ipvs/ip_vs_ftp.c
+++ b/net/netfilter/ipvs/ip_vs_ftp.c
@@ -483,7 +483,7 @@ static struct pernet_operations ip_vs_ftp_ops = {
.exit = __ip_vs_ftp_exit,
};
-int __init ip_vs_ftp_init(void)
+static int __init ip_vs_ftp_init(void)
{
int rv;
--
1.7.9.5
^ permalink raw reply related [flat|nested] 31+ messages in thread
* [PATCH 21/25] ipvs: ip_vs_proto: local functions should not be exposed globally
2012-05-08 7:49 [PATCH 00/25] [v2] netfilter updates for net-next (upcoming 3.5) pablo
` (19 preceding siblings ...)
2012-05-08 7:49 ` [PATCH 20/25] ipvs: ip_vs_ftp: local functions should not be exposed globally pablo
@ 2012-05-08 7:49 ` pablo
2012-05-08 7:49 ` [PATCH 22/25] net: export sysctl_[r|w]mem_max symbols needed by ip_vs_sync pablo
` (4 subsequent siblings)
25 siblings, 0 replies; 31+ messages in thread
From: pablo @ 2012-05-08 7:49 UTC (permalink / raw)
To: netfilter-devel; +Cc: davem, netdev
From: H Hartley Sweeten <hartleys@visionengravers.com>
Functions not referenced outside of a source file should be marked
static to prevent it from being exposed globally.
This quiets the sparse warnings:
warning: symbol '__ipvs_proto_data_get' was not declared. Should it be static?
Signed-off-by: H Hartley Sweeten <hsweeten@visionengravers.com>
Signed-off-by: Simon Horman <horms@verge.net.au>
---
net/netfilter/ipvs/ip_vs_proto.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/net/netfilter/ipvs/ip_vs_proto.c b/net/netfilter/ipvs/ip_vs_proto.c
index 8726488..e3f4bb0 100644
--- a/net/netfilter/ipvs/ip_vs_proto.c
+++ b/net/netfilter/ipvs/ip_vs_proto.c
@@ -153,7 +153,7 @@ EXPORT_SYMBOL(ip_vs_proto_get);
/*
* get ip_vs_protocol object data by netns and proto
*/
-struct ip_vs_proto_data *
+static struct ip_vs_proto_data *
__ipvs_proto_data_get(struct netns_ipvs *ipvs, unsigned short proto)
{
struct ip_vs_proto_data *pd;
--
1.7.9.5
^ permalink raw reply related [flat|nested] 31+ messages in thread
* [PATCH 22/25] net: export sysctl_[r|w]mem_max symbols needed by ip_vs_sync
2012-05-08 7:49 [PATCH 00/25] [v2] netfilter updates for net-next (upcoming 3.5) pablo
` (20 preceding siblings ...)
2012-05-08 7:49 ` [PATCH 21/25] ipvs: ip_vs_proto: " pablo
@ 2012-05-08 7:49 ` pablo
2012-05-08 7:49 ` [PATCH 23/25] netfilter: nf_ct_expect: partially implement ctnetlink_change_expect pablo
` (3 subsequent siblings)
25 siblings, 0 replies; 31+ messages in thread
From: pablo @ 2012-05-08 7:49 UTC (permalink / raw)
To: netfilter-devel; +Cc: davem, netdev
From: Hans Schillstrom <hans.schillstrom@ericsson.com>
To build ip_vs as a module sysctl_rmem_max and sysctl_wmem_max
needs to be exported.
The dependency was added by "ipvs: wakeup master thread" patch.
Signed-off-by: Hans Schillstrom <hans.schillstrom@ericsson.com>
Signed-off-by: Simon Horman <horms@verge.net.au>
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
---
net/core/sock.c | 2 ++
1 file changed, 2 insertions(+)
diff --git a/net/core/sock.c b/net/core/sock.c
index c7e60ea..ac3131a 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -258,7 +258,9 @@ static struct lock_class_key af_callback_keys[AF_MAX];
/* Run time adjustable parameters. */
__u32 sysctl_wmem_max __read_mostly = SK_WMEM_MAX;
+EXPORT_SYMBOL(sysctl_wmem_max);
__u32 sysctl_rmem_max __read_mostly = SK_RMEM_MAX;
+EXPORT_SYMBOL(sysctl_rmem_max);
__u32 sysctl_wmem_default __read_mostly = SK_WMEM_MAX;
__u32 sysctl_rmem_default __read_mostly = SK_RMEM_MAX;
--
1.7.9.5
^ permalink raw reply related [flat|nested] 31+ messages in thread
* [PATCH 23/25] netfilter: nf_ct_expect: partially implement ctnetlink_change_expect
2012-05-08 7:49 [PATCH 00/25] [v2] netfilter updates for net-next (upcoming 3.5) pablo
` (21 preceding siblings ...)
2012-05-08 7:49 ` [PATCH 22/25] net: export sysctl_[r|w]mem_max symbols needed by ip_vs_sync pablo
@ 2012-05-08 7:49 ` pablo
2012-05-08 7:49 ` [PATCH 24/25] netfilter: nf_conntrack: fix explicit helper attachment and NAT pablo
` (2 subsequent siblings)
25 siblings, 0 replies; 31+ messages in thread
From: pablo @ 2012-05-08 7:49 UTC (permalink / raw)
To: netfilter-devel; +Cc: davem, netdev
From: Kelvie Wong <kelvie@ieee.org>
This refreshes the "timeout" attribute in existing expectations if one is
given.
The use case for this would be for userspace helpers to extend the lifetime
of the expectation when requested, as this is not possible right now
without deleting/recreating the expectation.
I use this specifically for forwarding DCERPC traffic through:
DCERPC has a port mapper daemon that chooses a (seemingly) random port for
future traffic to go to. We expect this traffic (with a reasonable
timeout), but sometimes the port mapper will tell the client to continue
using the same port. This allows us to extend the expectation accordingly.
Signed-off-by: Kelvie Wong <kelvie@ieee.org>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
---
net/netfilter/nf_conntrack_netlink.c | 10 +++++++++-
1 file changed, 9 insertions(+), 1 deletion(-)
diff --git a/net/netfilter/nf_conntrack_netlink.c b/net/netfilter/nf_conntrack_netlink.c
index 462ec2d..6f4b00a 100644
--- a/net/netfilter/nf_conntrack_netlink.c
+++ b/net/netfilter/nf_conntrack_netlink.c
@@ -2080,7 +2080,15 @@ static int
ctnetlink_change_expect(struct nf_conntrack_expect *x,
const struct nlattr * const cda[])
{
- return -EOPNOTSUPP;
+ if (cda[CTA_EXPECT_TIMEOUT]) {
+ if (!del_timer(&x->timeout))
+ return -ETIME;
+
+ x->timeout.expires = jiffies +
+ ntohl(nla_get_be32(cda[CTA_EXPECT_TIMEOUT])) * HZ;
+ add_timer(&x->timeout);
+ }
+ return 0;
}
static const struct nla_policy exp_nat_nla_policy[CTA_EXPECT_NAT_MAX+1] = {
--
1.7.9.5
^ permalink raw reply related [flat|nested] 31+ messages in thread
* [PATCH 24/25] netfilter: nf_conntrack: fix explicit helper attachment and NAT
2012-05-08 7:49 [PATCH 00/25] [v2] netfilter updates for net-next (upcoming 3.5) pablo
` (22 preceding siblings ...)
2012-05-08 7:49 ` [PATCH 23/25] netfilter: nf_ct_expect: partially implement ctnetlink_change_expect pablo
@ 2012-05-08 7:49 ` pablo
2012-05-08 7:49 ` [PATCH 25/25] netfilter: remove ip_queue support pablo
2012-05-08 16:49 ` [PATCH 00/25] [v2] netfilter updates for net-next (upcoming 3.5) David Miller
25 siblings, 0 replies; 31+ messages in thread
From: pablo @ 2012-05-08 7:49 UTC (permalink / raw)
To: netfilter-devel; +Cc: davem, netdev
From: Pablo Neira Ayuso <pablo@netfilter.org>
Explicit helper attachment via the CT target is broken with NAT
if non-standard ports are used. This problem was hidden behind
the automatic helper assignment routine. Thus, it becomes more
noticeable now that we can disable the automatic helper assignment
with Eric Leblond's:
9e8ac5a netfilter: nf_ct_helper: allow to disable automatic helper assignment
Basically, nf_conntrack_alter_reply asks for looking up the helper
up if NAT is enabled. Unfortunately, we don't have the conntrack
template at that point anymore.
Since we don't want to rely on the automatic helper assignment,
we can skip the second look-up and stick to the helper that was
attached by iptables. With the CT target, the user is in full
control of helper attachment, thus, the policy is to trust what
the user explicitly configures via iptables (no automatic magic
anymore).
Interestingly, this bug was hidden by the automatic helper look-up
code. But it can be easily trigger if you attach the helper in
a non-standard port, eg.
iptables -I PREROUTING -t raw -p tcp --dport 8888 \
-j CT --helper ftp
And you disabled the automatic helper assignment.
I added the IPS_HELPER_BIT that allows us to differenciate between
a helper that has been explicitly attached and those that have been
automatically assigned. I didn't come up with a better solution
(having backward compatibility in mind).
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
---
include/linux/netfilter/nf_conntrack_common.h | 4 ++++
net/netfilter/nf_conntrack_helper.c | 13 ++++++++++++-
2 files changed, 16 insertions(+), 1 deletion(-)
diff --git a/include/linux/netfilter/nf_conntrack_common.h b/include/linux/netfilter/nf_conntrack_common.h
index 0d3dd66..d146872 100644
--- a/include/linux/netfilter/nf_conntrack_common.h
+++ b/include/linux/netfilter/nf_conntrack_common.h
@@ -83,6 +83,10 @@ enum ip_conntrack_status {
/* Conntrack is a fake untracked entry */
IPS_UNTRACKED_BIT = 12,
IPS_UNTRACKED = (1 << IPS_UNTRACKED_BIT),
+
+ /* Conntrack got a helper explicitly attached via CT target. */
+ IPS_HELPER_BIT = 13,
+ IPS_HELPER = (1 << IPS_HELPER_BIT),
};
/* Connection tracking event types */
diff --git a/net/netfilter/nf_conntrack_helper.c b/net/netfilter/nf_conntrack_helper.c
index 55234dd..fee8fd7 100644
--- a/net/netfilter/nf_conntrack_helper.c
+++ b/net/netfilter/nf_conntrack_helper.c
@@ -181,10 +181,21 @@ int __nf_ct_try_assign_helper(struct nf_conn *ct, struct nf_conn *tmpl,
struct net *net = nf_ct_net(ct);
int ret = 0;
+ /* We already got a helper explicitly attached. The function
+ * nf_conntrack_alter_reply - in case NAT is in use - asks for looking
+ * the helper up again. Since now the user is in full control of
+ * making consistent helper configurations, skip this automatic
+ * re-lookup, otherwise we'll lose the helper.
+ */
+ if (test_bit(IPS_HELPER_BIT, &ct->status))
+ return 0;
+
if (tmpl != NULL) {
help = nfct_help(tmpl);
- if (help != NULL)
+ if (help != NULL) {
helper = help->helper;
+ set_bit(IPS_HELPER_BIT, &ct->status);
+ }
}
help = nfct_help(ct);
--
1.7.9.5
^ permalink raw reply related [flat|nested] 31+ messages in thread
* [PATCH 25/25] netfilter: remove ip_queue support
2012-05-08 7:49 [PATCH 00/25] [v2] netfilter updates for net-next (upcoming 3.5) pablo
` (23 preceding siblings ...)
2012-05-08 7:49 ` [PATCH 24/25] netfilter: nf_conntrack: fix explicit helper attachment and NAT pablo
@ 2012-05-08 7:49 ` pablo
2012-05-08 16:49 ` [PATCH 00/25] [v2] netfilter updates for net-next (upcoming 3.5) David Miller
25 siblings, 0 replies; 31+ messages in thread
From: pablo @ 2012-05-08 7:49 UTC (permalink / raw)
To: netfilter-devel; +Cc: davem, netdev
From: Pablo Neira Ayuso <pablo@netfilter.org>
This patch removes ip_queue support which was marked as obsolete
years ago. The nfnetlink_queue modules provides more advanced
user-space packet queueing mechanism.
This patch also removes capability code included in SELinux that
refers to ip_queue. Otherwise, we break compilation.
Several warning has been sent regarding this to the mailing list
in the past month without anyone rising the hand to stop this
with some strong argument.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
---
Documentation/ABI/removed/ip_queue | 9 +
include/linux/netfilter_ipv4/Kbuild | 1 -
include/linux/netfilter_ipv4/ip_queue.h | 72 ----
include/linux/netlink.h | 2 +-
net/ipv4/netfilter/Makefile | 3 -
net/ipv4/netfilter/ip_queue.c | 639 ------------------------------
net/ipv6/netfilter/Kconfig | 22 --
net/ipv6/netfilter/Makefile | 1 -
net/ipv6/netfilter/ip6_queue.c | 641 -------------------------------
security/selinux/nlmsgtab.c | 13 -
10 files changed, 10 insertions(+), 1393 deletions(-)
create mode 100644 Documentation/ABI/removed/ip_queue
delete mode 100644 include/linux/netfilter_ipv4/ip_queue.h
delete mode 100644 net/ipv4/netfilter/ip_queue.c
delete mode 100644 net/ipv6/netfilter/ip6_queue.c
diff --git a/Documentation/ABI/removed/ip_queue b/Documentation/ABI/removed/ip_queue
new file mode 100644
index 0000000..3243613
--- /dev/null
+++ b/Documentation/ABI/removed/ip_queue
@@ -0,0 +1,9 @@
+What: ip_queue
+Date: finally removed in kernel v3.5.0
+Contact: Pablo Neira Ayuso <pablo@netfilter.org>
+Description:
+ ip_queue has been replaced by nfnetlink_queue which provides
+ more advanced queueing mechanism to user-space. The ip_queue
+ module was already announced to become obsolete years ago.
+
+Users:
diff --git a/include/linux/netfilter_ipv4/Kbuild b/include/linux/netfilter_ipv4/Kbuild
index 31f8bec..c61b8fb 100644
--- a/include/linux/netfilter_ipv4/Kbuild
+++ b/include/linux/netfilter_ipv4/Kbuild
@@ -1,4 +1,3 @@
-header-y += ip_queue.h
header-y += ip_tables.h
header-y += ipt_CLUSTERIP.h
header-y += ipt_ECN.h
diff --git a/include/linux/netfilter_ipv4/ip_queue.h b/include/linux/netfilter_ipv4/ip_queue.h
deleted file mode 100644
index a03507f..0000000
--- a/include/linux/netfilter_ipv4/ip_queue.h
+++ /dev/null
@@ -1,72 +0,0 @@
-/*
- * This is a module which is used for queueing IPv4 packets and
- * communicating with userspace via netlink.
- *
- * (C) 2000 James Morris, this code is GPL.
- */
-#ifndef _IP_QUEUE_H
-#define _IP_QUEUE_H
-
-#ifdef __KERNEL__
-#ifdef DEBUG_IPQ
-#define QDEBUG(x...) printk(KERN_DEBUG ## x)
-#else
-#define QDEBUG(x...)
-#endif /* DEBUG_IPQ */
-#else
-#include <net/if.h>
-#endif /* ! __KERNEL__ */
-
-/* Messages sent from kernel */
-typedef struct ipq_packet_msg {
- unsigned long packet_id; /* ID of queued packet */
- unsigned long mark; /* Netfilter mark value */
- long timestamp_sec; /* Packet arrival time (seconds) */
- long timestamp_usec; /* Packet arrvial time (+useconds) */
- unsigned int hook; /* Netfilter hook we rode in on */
- char indev_name[IFNAMSIZ]; /* Name of incoming interface */
- char outdev_name[IFNAMSIZ]; /* Name of outgoing interface */
- __be16 hw_protocol; /* Hardware protocol (network order) */
- unsigned short hw_type; /* Hardware type */
- unsigned char hw_addrlen; /* Hardware address length */
- unsigned char hw_addr[8]; /* Hardware address */
- size_t data_len; /* Length of packet data */
- unsigned char payload[0]; /* Optional packet data */
-} ipq_packet_msg_t;
-
-/* Messages sent from userspace */
-typedef struct ipq_mode_msg {
- unsigned char value; /* Requested mode */
- size_t range; /* Optional range of packet requested */
-} ipq_mode_msg_t;
-
-typedef struct ipq_verdict_msg {
- unsigned int value; /* Verdict to hand to netfilter */
- unsigned long id; /* Packet ID for this verdict */
- size_t data_len; /* Length of replacement data */
- unsigned char payload[0]; /* Optional replacement packet */
-} ipq_verdict_msg_t;
-
-typedef struct ipq_peer_msg {
- union {
- ipq_verdict_msg_t verdict;
- ipq_mode_msg_t mode;
- } msg;
-} ipq_peer_msg_t;
-
-/* Packet delivery modes */
-enum {
- IPQ_COPY_NONE, /* Initial mode, packets are dropped */
- IPQ_COPY_META, /* Copy metadata */
- IPQ_COPY_PACKET /* Copy metadata + packet (range) */
-};
-#define IPQ_COPY_MAX IPQ_COPY_PACKET
-
-/* Types of messages */
-#define IPQM_BASE 0x10 /* standard netlink messages below this */
-#define IPQM_MODE (IPQM_BASE + 1) /* Mode request from peer */
-#define IPQM_VERDICT (IPQM_BASE + 2) /* Verdict from peer */
-#define IPQM_PACKET (IPQM_BASE + 3) /* Packet from kernel */
-#define IPQM_MAX (IPQM_BASE + 4)
-
-#endif /*_IP_QUEUE_H*/
diff --git a/include/linux/netlink.h b/include/linux/netlink.h
index a2092f5..0f628ff 100644
--- a/include/linux/netlink.h
+++ b/include/linux/netlink.h
@@ -7,7 +7,7 @@
#define NETLINK_ROUTE 0 /* Routing/device hook */
#define NETLINK_UNUSED 1 /* Unused number */
#define NETLINK_USERSOCK 2 /* Reserved for user mode socket protocols */
-#define NETLINK_FIREWALL 3 /* Firewalling hook */
+#define NETLINK_FIREWALL 3 /* Unused number, formerly ip_queue */
#define NETLINK_SOCK_DIAG 4 /* socket monitoring */
#define NETLINK_NFLOG 5 /* netfilter/iptables ULOG */
#define NETLINK_XFRM 6 /* ipsec */
diff --git a/net/ipv4/netfilter/Makefile b/net/ipv4/netfilter/Makefile
index 240b684..c20674d 100644
--- a/net/ipv4/netfilter/Makefile
+++ b/net/ipv4/netfilter/Makefile
@@ -66,6 +66,3 @@ obj-$(CONFIG_IP_NF_ARP_MANGLE) += arpt_mangle.o
# just filtering instance of ARP tables for now
obj-$(CONFIG_IP_NF_ARPFILTER) += arptable_filter.o
-
-obj-$(CONFIG_IP_NF_QUEUE) += ip_queue.o
-
diff --git a/net/ipv4/netfilter/ip_queue.c b/net/ipv4/netfilter/ip_queue.c
deleted file mode 100644
index 94d45e1..0000000
--- a/net/ipv4/netfilter/ip_queue.c
+++ /dev/null
@@ -1,639 +0,0 @@
-/*
- * This is a module which is used for queueing IPv4 packets and
- * communicating with userspace via netlink.
- *
- * (C) 2000-2002 James Morris <jmorris@intercode.com.au>
- * (C) 2003-2005 Netfilter Core Team <coreteam@netfilter.org>
- *
- * This program is free software; you can redistribute it and/or modify
- * it under the terms of the GNU General Public License version 2 as
- * published by the Free Software Foundation.
- */
-#include <linux/module.h>
-#include <linux/skbuff.h>
-#include <linux/init.h>
-#include <linux/ip.h>
-#include <linux/notifier.h>
-#include <linux/netdevice.h>
-#include <linux/netfilter.h>
-#include <linux/netfilter_ipv4/ip_queue.h>
-#include <linux/netfilter_ipv4/ip_tables.h>
-#include <linux/netlink.h>
-#include <linux/spinlock.h>
-#include <linux/sysctl.h>
-#include <linux/proc_fs.h>
-#include <linux/seq_file.h>
-#include <linux/security.h>
-#include <linux/net.h>
-#include <linux/mutex.h>
-#include <linux/slab.h>
-#include <net/net_namespace.h>
-#include <net/sock.h>
-#include <net/route.h>
-#include <net/netfilter/nf_queue.h>
-#include <net/ip.h>
-
-#define IPQ_QMAX_DEFAULT 1024
-#define IPQ_PROC_FS_NAME "ip_queue"
-#define NET_IPQ_QMAX 2088
-#define NET_IPQ_QMAX_NAME "ip_queue_maxlen"
-
-typedef int (*ipq_cmpfn)(struct nf_queue_entry *, unsigned long);
-
-static unsigned char copy_mode __read_mostly = IPQ_COPY_NONE;
-static unsigned int queue_maxlen __read_mostly = IPQ_QMAX_DEFAULT;
-static DEFINE_SPINLOCK(queue_lock);
-static int peer_pid __read_mostly;
-static unsigned int copy_range __read_mostly;
-static unsigned int queue_total;
-static unsigned int queue_dropped = 0;
-static unsigned int queue_user_dropped = 0;
-static struct sock *ipqnl __read_mostly;
-static LIST_HEAD(queue_list);
-static DEFINE_MUTEX(ipqnl_mutex);
-
-static inline void
-__ipq_enqueue_entry(struct nf_queue_entry *entry)
-{
- list_add_tail(&entry->list, &queue_list);
- queue_total++;
-}
-
-static inline int
-__ipq_set_mode(unsigned char mode, unsigned int range)
-{
- int status = 0;
-
- switch(mode) {
- case IPQ_COPY_NONE:
- case IPQ_COPY_META:
- copy_mode = mode;
- copy_range = 0;
- break;
-
- case IPQ_COPY_PACKET:
- if (range > 0xFFFF)
- range = 0xFFFF;
- copy_range = range;
- copy_mode = mode;
- break;
-
- default:
- status = -EINVAL;
-
- }
- return status;
-}
-
-static void __ipq_flush(ipq_cmpfn cmpfn, unsigned long data);
-
-static inline void
-__ipq_reset(void)
-{
- peer_pid = 0;
- net_disable_timestamp();
- __ipq_set_mode(IPQ_COPY_NONE, 0);
- __ipq_flush(NULL, 0);
-}
-
-static struct nf_queue_entry *
-ipq_find_dequeue_entry(unsigned long id)
-{
- struct nf_queue_entry *entry = NULL, *i;
-
- spin_lock_bh(&queue_lock);
-
- list_for_each_entry(i, &queue_list, list) {
- if ((unsigned long)i == id) {
- entry = i;
- break;
- }
- }
-
- if (entry) {
- list_del(&entry->list);
- queue_total--;
- }
-
- spin_unlock_bh(&queue_lock);
- return entry;
-}
-
-static void
-__ipq_flush(ipq_cmpfn cmpfn, unsigned long data)
-{
- struct nf_queue_entry *entry, *next;
-
- list_for_each_entry_safe(entry, next, &queue_list, list) {
- if (!cmpfn || cmpfn(entry, data)) {
- list_del(&entry->list);
- queue_total--;
- nf_reinject(entry, NF_DROP);
- }
- }
-}
-
-static void
-ipq_flush(ipq_cmpfn cmpfn, unsigned long data)
-{
- spin_lock_bh(&queue_lock);
- __ipq_flush(cmpfn, data);
- spin_unlock_bh(&queue_lock);
-}
-
-static struct sk_buff *
-ipq_build_packet_message(struct nf_queue_entry *entry, int *errp)
-{
- sk_buff_data_t old_tail;
- size_t size = 0;
- size_t data_len = 0;
- struct sk_buff *skb;
- struct ipq_packet_msg *pmsg;
- struct nlmsghdr *nlh;
- struct timeval tv;
-
- switch (ACCESS_ONCE(copy_mode)) {
- case IPQ_COPY_META:
- case IPQ_COPY_NONE:
- size = NLMSG_SPACE(sizeof(*pmsg));
- break;
-
- case IPQ_COPY_PACKET:
- if (entry->skb->ip_summed == CHECKSUM_PARTIAL &&
- (*errp = skb_checksum_help(entry->skb)))
- return NULL;
-
- data_len = ACCESS_ONCE(copy_range);
- if (data_len == 0 || data_len > entry->skb->len)
- data_len = entry->skb->len;
-
- size = NLMSG_SPACE(sizeof(*pmsg) + data_len);
- break;
-
- default:
- *errp = -EINVAL;
- return NULL;
- }
-
- skb = alloc_skb(size, GFP_ATOMIC);
- if (!skb)
- goto nlmsg_failure;
-
- old_tail = skb->tail;
- nlh = NLMSG_PUT(skb, 0, 0, IPQM_PACKET, size - sizeof(*nlh));
- pmsg = NLMSG_DATA(nlh);
- memset(pmsg, 0, sizeof(*pmsg));
-
- pmsg->packet_id = (unsigned long )entry;
- pmsg->data_len = data_len;
- tv = ktime_to_timeval(entry->skb->tstamp);
- pmsg->timestamp_sec = tv.tv_sec;
- pmsg->timestamp_usec = tv.tv_usec;
- pmsg->mark = entry->skb->mark;
- pmsg->hook = entry->hook;
- pmsg->hw_protocol = entry->skb->protocol;
-
- if (entry->indev)
- strcpy(pmsg->indev_name, entry->indev->name);
- else
- pmsg->indev_name[0] = '\0';
-
- if (entry->outdev)
- strcpy(pmsg->outdev_name, entry->outdev->name);
- else
- pmsg->outdev_name[0] = '\0';
-
- if (entry->indev && entry->skb->dev &&
- entry->skb->mac_header != entry->skb->network_header) {
- pmsg->hw_type = entry->skb->dev->type;
- pmsg->hw_addrlen = dev_parse_header(entry->skb,
- pmsg->hw_addr);
- }
-
- if (data_len)
- if (skb_copy_bits(entry->skb, 0, pmsg->payload, data_len))
- BUG();
-
- nlh->nlmsg_len = skb->tail - old_tail;
- return skb;
-
-nlmsg_failure:
- kfree_skb(skb);
- *errp = -EINVAL;
- printk(KERN_ERR "ip_queue: error creating packet message\n");
- return NULL;
-}
-
-static int
-ipq_enqueue_packet(struct nf_queue_entry *entry, unsigned int queuenum)
-{
- int status = -EINVAL;
- struct sk_buff *nskb;
-
- if (copy_mode == IPQ_COPY_NONE)
- return -EAGAIN;
-
- nskb = ipq_build_packet_message(entry, &status);
- if (nskb == NULL)
- return status;
-
- spin_lock_bh(&queue_lock);
-
- if (!peer_pid)
- goto err_out_free_nskb;
-
- if (queue_total >= queue_maxlen) {
- queue_dropped++;
- status = -ENOSPC;
- if (net_ratelimit())
- printk (KERN_WARNING "ip_queue: full at %d entries, "
- "dropping packets(s). Dropped: %d\n", queue_total,
- queue_dropped);
- goto err_out_free_nskb;
- }
-
- /* netlink_unicast will either free the nskb or attach it to a socket */
- status = netlink_unicast(ipqnl, nskb, peer_pid, MSG_DONTWAIT);
- if (status < 0) {
- queue_user_dropped++;
- goto err_out_unlock;
- }
-
- __ipq_enqueue_entry(entry);
-
- spin_unlock_bh(&queue_lock);
- return status;
-
-err_out_free_nskb:
- kfree_skb(nskb);
-
-err_out_unlock:
- spin_unlock_bh(&queue_lock);
- return status;
-}
-
-static int
-ipq_mangle_ipv4(ipq_verdict_msg_t *v, struct nf_queue_entry *e)
-{
- int diff;
- struct iphdr *user_iph = (struct iphdr *)v->payload;
- struct sk_buff *nskb;
-
- if (v->data_len < sizeof(*user_iph))
- return 0;
- diff = v->data_len - e->skb->len;
- if (diff < 0) {
- if (pskb_trim(e->skb, v->data_len))
- return -ENOMEM;
- } else if (diff > 0) {
- if (v->data_len > 0xFFFF)
- return -EINVAL;
- if (diff > skb_tailroom(e->skb)) {
- nskb = skb_copy_expand(e->skb, skb_headroom(e->skb),
- diff, GFP_ATOMIC);
- if (!nskb) {
- printk(KERN_WARNING "ip_queue: error "
- "in mangle, dropping packet\n");
- return -ENOMEM;
- }
- kfree_skb(e->skb);
- e->skb = nskb;
- }
- skb_put(e->skb, diff);
- }
- if (!skb_make_writable(e->skb, v->data_len))
- return -ENOMEM;
- skb_copy_to_linear_data(e->skb, v->payload, v->data_len);
- e->skb->ip_summed = CHECKSUM_NONE;
-
- return 0;
-}
-
-static int
-ipq_set_verdict(struct ipq_verdict_msg *vmsg, unsigned int len)
-{
- struct nf_queue_entry *entry;
-
- if (vmsg->value > NF_MAX_VERDICT || vmsg->value == NF_STOLEN)
- return -EINVAL;
-
- entry = ipq_find_dequeue_entry(vmsg->id);
- if (entry == NULL)
- return -ENOENT;
- else {
- int verdict = vmsg->value;
-
- if (vmsg->data_len && vmsg->data_len == len)
- if (ipq_mangle_ipv4(vmsg, entry) < 0)
- verdict = NF_DROP;
-
- nf_reinject(entry, verdict);
- return 0;
- }
-}
-
-static int
-ipq_set_mode(unsigned char mode, unsigned int range)
-{
- int status;
-
- spin_lock_bh(&queue_lock);
- status = __ipq_set_mode(mode, range);
- spin_unlock_bh(&queue_lock);
- return status;
-}
-
-static int
-ipq_receive_peer(struct ipq_peer_msg *pmsg,
- unsigned char type, unsigned int len)
-{
- int status = 0;
-
- if (len < sizeof(*pmsg))
- return -EINVAL;
-
- switch (type) {
- case IPQM_MODE:
- status = ipq_set_mode(pmsg->msg.mode.value,
- pmsg->msg.mode.range);
- break;
-
- case IPQM_VERDICT:
- status = ipq_set_verdict(&pmsg->msg.verdict,
- len - sizeof(*pmsg));
- break;
- default:
- status = -EINVAL;
- }
- return status;
-}
-
-static int
-dev_cmp(struct nf_queue_entry *entry, unsigned long ifindex)
-{
- if (entry->indev)
- if (entry->indev->ifindex == ifindex)
- return 1;
- if (entry->outdev)
- if (entry->outdev->ifindex == ifindex)
- return 1;
-#ifdef CONFIG_BRIDGE_NETFILTER
- if (entry->skb->nf_bridge) {
- if (entry->skb->nf_bridge->physindev &&
- entry->skb->nf_bridge->physindev->ifindex == ifindex)
- return 1;
- if (entry->skb->nf_bridge->physoutdev &&
- entry->skb->nf_bridge->physoutdev->ifindex == ifindex)
- return 1;
- }
-#endif
- return 0;
-}
-
-static void
-ipq_dev_drop(int ifindex)
-{
- ipq_flush(dev_cmp, ifindex);
-}
-
-#define RCV_SKB_FAIL(err) do { netlink_ack(skb, nlh, (err)); return; } while (0)
-
-static inline void
-__ipq_rcv_skb(struct sk_buff *skb)
-{
- int status, type, pid, flags;
- unsigned int nlmsglen, skblen;
- struct nlmsghdr *nlh;
- bool enable_timestamp = false;
-
- skblen = skb->len;
- if (skblen < sizeof(*nlh))
- return;
-
- nlh = nlmsg_hdr(skb);
- nlmsglen = nlh->nlmsg_len;
- if (nlmsglen < sizeof(*nlh) || skblen < nlmsglen)
- return;
-
- pid = nlh->nlmsg_pid;
- flags = nlh->nlmsg_flags;
-
- if(pid <= 0 || !(flags & NLM_F_REQUEST) || flags & NLM_F_MULTI)
- RCV_SKB_FAIL(-EINVAL);
-
- if (flags & MSG_TRUNC)
- RCV_SKB_FAIL(-ECOMM);
-
- type = nlh->nlmsg_type;
- if (type < NLMSG_NOOP || type >= IPQM_MAX)
- RCV_SKB_FAIL(-EINVAL);
-
- if (type <= IPQM_BASE)
- return;
-
- if (!capable(CAP_NET_ADMIN))
- RCV_SKB_FAIL(-EPERM);
-
- spin_lock_bh(&queue_lock);
-
- if (peer_pid) {
- if (peer_pid != pid) {
- spin_unlock_bh(&queue_lock);
- RCV_SKB_FAIL(-EBUSY);
- }
- } else {
- enable_timestamp = true;
- peer_pid = pid;
- }
-
- spin_unlock_bh(&queue_lock);
- if (enable_timestamp)
- net_enable_timestamp();
- status = ipq_receive_peer(NLMSG_DATA(nlh), type,
- nlmsglen - NLMSG_LENGTH(0));
- if (status < 0)
- RCV_SKB_FAIL(status);
-
- if (flags & NLM_F_ACK)
- netlink_ack(skb, nlh, 0);
-}
-
-static void
-ipq_rcv_skb(struct sk_buff *skb)
-{
- mutex_lock(&ipqnl_mutex);
- __ipq_rcv_skb(skb);
- mutex_unlock(&ipqnl_mutex);
-}
-
-static int
-ipq_rcv_dev_event(struct notifier_block *this,
- unsigned long event, void *ptr)
-{
- struct net_device *dev = ptr;
-
- if (!net_eq(dev_net(dev), &init_net))
- return NOTIFY_DONE;
-
- /* Drop any packets associated with the downed device */
- if (event == NETDEV_DOWN)
- ipq_dev_drop(dev->ifindex);
- return NOTIFY_DONE;
-}
-
-static struct notifier_block ipq_dev_notifier = {
- .notifier_call = ipq_rcv_dev_event,
-};
-
-static int
-ipq_rcv_nl_event(struct notifier_block *this,
- unsigned long event, void *ptr)
-{
- struct netlink_notify *n = ptr;
-
- if (event == NETLINK_URELEASE && n->protocol == NETLINK_FIREWALL) {
- spin_lock_bh(&queue_lock);
- if ((net_eq(n->net, &init_net)) && (n->pid == peer_pid))
- __ipq_reset();
- spin_unlock_bh(&queue_lock);
- }
- return NOTIFY_DONE;
-}
-
-static struct notifier_block ipq_nl_notifier = {
- .notifier_call = ipq_rcv_nl_event,
-};
-
-#ifdef CONFIG_SYSCTL
-static struct ctl_table_header *ipq_sysctl_header;
-
-static ctl_table ipq_table[] = {
- {
- .procname = NET_IPQ_QMAX_NAME,
- .data = &queue_maxlen,
- .maxlen = sizeof(queue_maxlen),
- .mode = 0644,
- .proc_handler = proc_dointvec
- },
- { }
-};
-#endif
-
-#ifdef CONFIG_PROC_FS
-static int ip_queue_show(struct seq_file *m, void *v)
-{
- spin_lock_bh(&queue_lock);
-
- seq_printf(m,
- "Peer PID : %d\n"
- "Copy mode : %hu\n"
- "Copy range : %u\n"
- "Queue length : %u\n"
- "Queue max. length : %u\n"
- "Queue dropped : %u\n"
- "Netlink dropped : %u\n",
- peer_pid,
- copy_mode,
- copy_range,
- queue_total,
- queue_maxlen,
- queue_dropped,
- queue_user_dropped);
-
- spin_unlock_bh(&queue_lock);
- return 0;
-}
-
-static int ip_queue_open(struct inode *inode, struct file *file)
-{
- return single_open(file, ip_queue_show, NULL);
-}
-
-static const struct file_operations ip_queue_proc_fops = {
- .open = ip_queue_open,
- .read = seq_read,
- .llseek = seq_lseek,
- .release = single_release,
- .owner = THIS_MODULE,
-};
-#endif
-
-static const struct nf_queue_handler nfqh = {
- .name = "ip_queue",
- .outfn = &ipq_enqueue_packet,
-};
-
-static int __init ip_queue_init(void)
-{
- int status = -ENOMEM;
- struct proc_dir_entry *proc __maybe_unused;
-
- netlink_register_notifier(&ipq_nl_notifier);
- ipqnl = netlink_kernel_create(&init_net, NETLINK_FIREWALL, 0,
- ipq_rcv_skb, NULL, THIS_MODULE);
- if (ipqnl == NULL) {
- printk(KERN_ERR "ip_queue: failed to create netlink socket\n");
- goto cleanup_netlink_notifier;
- }
-
-#ifdef CONFIG_PROC_FS
- proc = proc_create(IPQ_PROC_FS_NAME, 0, init_net.proc_net,
- &ip_queue_proc_fops);
- if (!proc) {
- printk(KERN_ERR "ip_queue: failed to create proc entry\n");
- goto cleanup_ipqnl;
- }
-#endif
- register_netdevice_notifier(&ipq_dev_notifier);
-#ifdef CONFIG_SYSCTL
- ipq_sysctl_header = register_sysctl_paths(net_ipv4_ctl_path, ipq_table);
-#endif
- status = nf_register_queue_handler(NFPROTO_IPV4, &nfqh);
- if (status < 0) {
- printk(KERN_ERR "ip_queue: failed to register queue handler\n");
- goto cleanup_sysctl;
- }
- return status;
-
-cleanup_sysctl:
-#ifdef CONFIG_SYSCTL
- unregister_sysctl_table(ipq_sysctl_header);
-#endif
- unregister_netdevice_notifier(&ipq_dev_notifier);
- proc_net_remove(&init_net, IPQ_PROC_FS_NAME);
-cleanup_ipqnl: __maybe_unused
- netlink_kernel_release(ipqnl);
- mutex_lock(&ipqnl_mutex);
- mutex_unlock(&ipqnl_mutex);
-
-cleanup_netlink_notifier:
- netlink_unregister_notifier(&ipq_nl_notifier);
- return status;
-}
-
-static void __exit ip_queue_fini(void)
-{
- nf_unregister_queue_handlers(&nfqh);
-
- ipq_flush(NULL, 0);
-
-#ifdef CONFIG_SYSCTL
- unregister_sysctl_table(ipq_sysctl_header);
-#endif
- unregister_netdevice_notifier(&ipq_dev_notifier);
- proc_net_remove(&init_net, IPQ_PROC_FS_NAME);
-
- netlink_kernel_release(ipqnl);
- mutex_lock(&ipqnl_mutex);
- mutex_unlock(&ipqnl_mutex);
-
- netlink_unregister_notifier(&ipq_nl_notifier);
-}
-
-MODULE_DESCRIPTION("IPv4 packet queue handler");
-MODULE_AUTHOR("James Morris <jmorris@intercode.com.au>");
-MODULE_LICENSE("GPL");
-MODULE_ALIAS_NET_PF_PROTO(PF_NETLINK, NETLINK_FIREWALL);
-
-module_init(ip_queue_init);
-module_exit(ip_queue_fini);
diff --git a/net/ipv6/netfilter/Kconfig b/net/ipv6/netfilter/Kconfig
index d33cddd..1013534 100644
--- a/net/ipv6/netfilter/Kconfig
+++ b/net/ipv6/netfilter/Kconfig
@@ -25,28 +25,6 @@ config NF_CONNTRACK_IPV6
To compile it as a module, choose M here. If unsure, say N.
-config IP6_NF_QUEUE
- tristate "IP6 Userspace queueing via NETLINK (OBSOLETE)"
- depends on INET && IPV6 && NETFILTER
- depends on NETFILTER_ADVANCED
- ---help---
-
- This option adds a queue handler to the kernel for IPv6
- packets which enables users to receive the filtered packets
- with QUEUE target using libipq.
-
- This option enables the old IPv6-only "ip6_queue" implementation
- which has been obsoleted by the new "nfnetlink_queue" code (see
- CONFIG_NETFILTER_NETLINK_QUEUE).
-
- (C) Fernando Anton 2001
- IPv64 Project - Work based in IPv64 draft by Arturo Azcorra.
- Universidad Carlos III de Madrid
- Universidad Politecnica de Alcala de Henares
- email: <fanton@it.uc3m.es>.
-
- To compile it as a module, choose M here. If unsure, say N.
-
config IP6_NF_IPTABLES
tristate "IP6 tables support (required for filtering)"
depends on INET && IPV6
diff --git a/net/ipv6/netfilter/Makefile b/net/ipv6/netfilter/Makefile
index d4dfd0a..534d3f2 100644
--- a/net/ipv6/netfilter/Makefile
+++ b/net/ipv6/netfilter/Makefile
@@ -6,7 +6,6 @@
obj-$(CONFIG_IP6_NF_IPTABLES) += ip6_tables.o
obj-$(CONFIG_IP6_NF_FILTER) += ip6table_filter.o
obj-$(CONFIG_IP6_NF_MANGLE) += ip6table_mangle.o
-obj-$(CONFIG_IP6_NF_QUEUE) += ip6_queue.o
obj-$(CONFIG_IP6_NF_RAW) += ip6table_raw.o
obj-$(CONFIG_IP6_NF_SECURITY) += ip6table_security.o
diff --git a/net/ipv6/netfilter/ip6_queue.c b/net/ipv6/netfilter/ip6_queue.c
deleted file mode 100644
index a34c9e4..0000000
--- a/net/ipv6/netfilter/ip6_queue.c
+++ /dev/null
@@ -1,641 +0,0 @@
-/*
- * This is a module which is used for queueing IPv6 packets and
- * communicating with userspace via netlink.
- *
- * (C) 2001 Fernando Anton, this code is GPL.
- * IPv64 Project - Work based in IPv64 draft by Arturo Azcorra.
- * Universidad Carlos III de Madrid - Leganes (Madrid) - Spain
- * Universidad Politecnica de Alcala de Henares - Alcala de H. (Madrid) - Spain
- * email: fanton@it.uc3m.es
- *
- * This program is free software; you can redistribute it and/or modify
- * it under the terms of the GNU General Public License version 2 as
- * published by the Free Software Foundation.
- */
-#include <linux/module.h>
-#include <linux/skbuff.h>
-#include <linux/init.h>
-#include <linux/ipv6.h>
-#include <linux/notifier.h>
-#include <linux/netdevice.h>
-#include <linux/netfilter.h>
-#include <linux/netlink.h>
-#include <linux/spinlock.h>
-#include <linux/sysctl.h>
-#include <linux/proc_fs.h>
-#include <linux/seq_file.h>
-#include <linux/mutex.h>
-#include <linux/slab.h>
-#include <net/net_namespace.h>
-#include <net/sock.h>
-#include <net/ipv6.h>
-#include <net/ip6_route.h>
-#include <net/netfilter/nf_queue.h>
-#include <linux/netfilter_ipv4/ip_queue.h>
-#include <linux/netfilter_ipv4/ip_tables.h>
-#include <linux/netfilter_ipv6/ip6_tables.h>
-
-#define IPQ_QMAX_DEFAULT 1024
-#define IPQ_PROC_FS_NAME "ip6_queue"
-#define NET_IPQ_QMAX_NAME "ip6_queue_maxlen"
-
-typedef int (*ipq_cmpfn)(struct nf_queue_entry *, unsigned long);
-
-static unsigned char copy_mode __read_mostly = IPQ_COPY_NONE;
-static unsigned int queue_maxlen __read_mostly = IPQ_QMAX_DEFAULT;
-static DEFINE_SPINLOCK(queue_lock);
-static int peer_pid __read_mostly;
-static unsigned int copy_range __read_mostly;
-static unsigned int queue_total;
-static unsigned int queue_dropped = 0;
-static unsigned int queue_user_dropped = 0;
-static struct sock *ipqnl __read_mostly;
-static LIST_HEAD(queue_list);
-static DEFINE_MUTEX(ipqnl_mutex);
-
-static inline void
-__ipq_enqueue_entry(struct nf_queue_entry *entry)
-{
- list_add_tail(&entry->list, &queue_list);
- queue_total++;
-}
-
-static inline int
-__ipq_set_mode(unsigned char mode, unsigned int range)
-{
- int status = 0;
-
- switch(mode) {
- case IPQ_COPY_NONE:
- case IPQ_COPY_META:
- copy_mode = mode;
- copy_range = 0;
- break;
-
- case IPQ_COPY_PACKET:
- if (range > 0xFFFF)
- range = 0xFFFF;
- copy_range = range;
- copy_mode = mode;
- break;
-
- default:
- status = -EINVAL;
-
- }
- return status;
-}
-
-static void __ipq_flush(ipq_cmpfn cmpfn, unsigned long data);
-
-static inline void
-__ipq_reset(void)
-{
- peer_pid = 0;
- net_disable_timestamp();
- __ipq_set_mode(IPQ_COPY_NONE, 0);
- __ipq_flush(NULL, 0);
-}
-
-static struct nf_queue_entry *
-ipq_find_dequeue_entry(unsigned long id)
-{
- struct nf_queue_entry *entry = NULL, *i;
-
- spin_lock_bh(&queue_lock);
-
- list_for_each_entry(i, &queue_list, list) {
- if ((unsigned long)i == id) {
- entry = i;
- break;
- }
- }
-
- if (entry) {
- list_del(&entry->list);
- queue_total--;
- }
-
- spin_unlock_bh(&queue_lock);
- return entry;
-}
-
-static void
-__ipq_flush(ipq_cmpfn cmpfn, unsigned long data)
-{
- struct nf_queue_entry *entry, *next;
-
- list_for_each_entry_safe(entry, next, &queue_list, list) {
- if (!cmpfn || cmpfn(entry, data)) {
- list_del(&entry->list);
- queue_total--;
- nf_reinject(entry, NF_DROP);
- }
- }
-}
-
-static void
-ipq_flush(ipq_cmpfn cmpfn, unsigned long data)
-{
- spin_lock_bh(&queue_lock);
- __ipq_flush(cmpfn, data);
- spin_unlock_bh(&queue_lock);
-}
-
-static struct sk_buff *
-ipq_build_packet_message(struct nf_queue_entry *entry, int *errp)
-{
- sk_buff_data_t old_tail;
- size_t size = 0;
- size_t data_len = 0;
- struct sk_buff *skb;
- struct ipq_packet_msg *pmsg;
- struct nlmsghdr *nlh;
- struct timeval tv;
-
- switch (ACCESS_ONCE(copy_mode)) {
- case IPQ_COPY_META:
- case IPQ_COPY_NONE:
- size = NLMSG_SPACE(sizeof(*pmsg));
- break;
-
- case IPQ_COPY_PACKET:
- if (entry->skb->ip_summed == CHECKSUM_PARTIAL &&
- (*errp = skb_checksum_help(entry->skb)))
- return NULL;
-
- data_len = ACCESS_ONCE(copy_range);
- if (data_len == 0 || data_len > entry->skb->len)
- data_len = entry->skb->len;
-
- size = NLMSG_SPACE(sizeof(*pmsg) + data_len);
- break;
-
- default:
- *errp = -EINVAL;
- return NULL;
- }
-
- skb = alloc_skb(size, GFP_ATOMIC);
- if (!skb)
- goto nlmsg_failure;
-
- old_tail = skb->tail;
- nlh = NLMSG_PUT(skb, 0, 0, IPQM_PACKET, size - sizeof(*nlh));
- pmsg = NLMSG_DATA(nlh);
- memset(pmsg, 0, sizeof(*pmsg));
-
- pmsg->packet_id = (unsigned long )entry;
- pmsg->data_len = data_len;
- tv = ktime_to_timeval(entry->skb->tstamp);
- pmsg->timestamp_sec = tv.tv_sec;
- pmsg->timestamp_usec = tv.tv_usec;
- pmsg->mark = entry->skb->mark;
- pmsg->hook = entry->hook;
- pmsg->hw_protocol = entry->skb->protocol;
-
- if (entry->indev)
- strcpy(pmsg->indev_name, entry->indev->name);
- else
- pmsg->indev_name[0] = '\0';
-
- if (entry->outdev)
- strcpy(pmsg->outdev_name, entry->outdev->name);
- else
- pmsg->outdev_name[0] = '\0';
-
- if (entry->indev && entry->skb->dev &&
- entry->skb->mac_header != entry->skb->network_header) {
- pmsg->hw_type = entry->skb->dev->type;
- pmsg->hw_addrlen = dev_parse_header(entry->skb, pmsg->hw_addr);
- }
-
- if (data_len)
- if (skb_copy_bits(entry->skb, 0, pmsg->payload, data_len))
- BUG();
-
- nlh->nlmsg_len = skb->tail - old_tail;
- return skb;
-
-nlmsg_failure:
- kfree_skb(skb);
- *errp = -EINVAL;
- printk(KERN_ERR "ip6_queue: error creating packet message\n");
- return NULL;
-}
-
-static int
-ipq_enqueue_packet(struct nf_queue_entry *entry, unsigned int queuenum)
-{
- int status = -EINVAL;
- struct sk_buff *nskb;
-
- if (copy_mode == IPQ_COPY_NONE)
- return -EAGAIN;
-
- nskb = ipq_build_packet_message(entry, &status);
- if (nskb == NULL)
- return status;
-
- spin_lock_bh(&queue_lock);
-
- if (!peer_pid)
- goto err_out_free_nskb;
-
- if (queue_total >= queue_maxlen) {
- queue_dropped++;
- status = -ENOSPC;
- if (net_ratelimit())
- printk (KERN_WARNING "ip6_queue: fill at %d entries, "
- "dropping packet(s). Dropped: %d\n", queue_total,
- queue_dropped);
- goto err_out_free_nskb;
- }
-
- /* netlink_unicast will either free the nskb or attach it to a socket */
- status = netlink_unicast(ipqnl, nskb, peer_pid, MSG_DONTWAIT);
- if (status < 0) {
- queue_user_dropped++;
- goto err_out_unlock;
- }
-
- __ipq_enqueue_entry(entry);
-
- spin_unlock_bh(&queue_lock);
- return status;
-
-err_out_free_nskb:
- kfree_skb(nskb);
-
-err_out_unlock:
- spin_unlock_bh(&queue_lock);
- return status;
-}
-
-static int
-ipq_mangle_ipv6(ipq_verdict_msg_t *v, struct nf_queue_entry *e)
-{
- int diff;
- struct ipv6hdr *user_iph = (struct ipv6hdr *)v->payload;
- struct sk_buff *nskb;
-
- if (v->data_len < sizeof(*user_iph))
- return 0;
- diff = v->data_len - e->skb->len;
- if (diff < 0) {
- if (pskb_trim(e->skb, v->data_len))
- return -ENOMEM;
- } else if (diff > 0) {
- if (v->data_len > 0xFFFF)
- return -EINVAL;
- if (diff > skb_tailroom(e->skb)) {
- nskb = skb_copy_expand(e->skb, skb_headroom(e->skb),
- diff, GFP_ATOMIC);
- if (!nskb) {
- printk(KERN_WARNING "ip6_queue: OOM "
- "in mangle, dropping packet\n");
- return -ENOMEM;
- }
- kfree_skb(e->skb);
- e->skb = nskb;
- }
- skb_put(e->skb, diff);
- }
- if (!skb_make_writable(e->skb, v->data_len))
- return -ENOMEM;
- skb_copy_to_linear_data(e->skb, v->payload, v->data_len);
- e->skb->ip_summed = CHECKSUM_NONE;
-
- return 0;
-}
-
-static int
-ipq_set_verdict(struct ipq_verdict_msg *vmsg, unsigned int len)
-{
- struct nf_queue_entry *entry;
-
- if (vmsg->value > NF_MAX_VERDICT || vmsg->value == NF_STOLEN)
- return -EINVAL;
-
- entry = ipq_find_dequeue_entry(vmsg->id);
- if (entry == NULL)
- return -ENOENT;
- else {
- int verdict = vmsg->value;
-
- if (vmsg->data_len && vmsg->data_len == len)
- if (ipq_mangle_ipv6(vmsg, entry) < 0)
- verdict = NF_DROP;
-
- nf_reinject(entry, verdict);
- return 0;
- }
-}
-
-static int
-ipq_set_mode(unsigned char mode, unsigned int range)
-{
- int status;
-
- spin_lock_bh(&queue_lock);
- status = __ipq_set_mode(mode, range);
- spin_unlock_bh(&queue_lock);
- return status;
-}
-
-static int
-ipq_receive_peer(struct ipq_peer_msg *pmsg,
- unsigned char type, unsigned int len)
-{
- int status = 0;
-
- if (len < sizeof(*pmsg))
- return -EINVAL;
-
- switch (type) {
- case IPQM_MODE:
- status = ipq_set_mode(pmsg->msg.mode.value,
- pmsg->msg.mode.range);
- break;
-
- case IPQM_VERDICT:
- status = ipq_set_verdict(&pmsg->msg.verdict,
- len - sizeof(*pmsg));
- break;
- default:
- status = -EINVAL;
- }
- return status;
-}
-
-static int
-dev_cmp(struct nf_queue_entry *entry, unsigned long ifindex)
-{
- if (entry->indev)
- if (entry->indev->ifindex == ifindex)
- return 1;
-
- if (entry->outdev)
- if (entry->outdev->ifindex == ifindex)
- return 1;
-#ifdef CONFIG_BRIDGE_NETFILTER
- if (entry->skb->nf_bridge) {
- if (entry->skb->nf_bridge->physindev &&
- entry->skb->nf_bridge->physindev->ifindex == ifindex)
- return 1;
- if (entry->skb->nf_bridge->physoutdev &&
- entry->skb->nf_bridge->physoutdev->ifindex == ifindex)
- return 1;
- }
-#endif
- return 0;
-}
-
-static void
-ipq_dev_drop(int ifindex)
-{
- ipq_flush(dev_cmp, ifindex);
-}
-
-#define RCV_SKB_FAIL(err) do { netlink_ack(skb, nlh, (err)); return; } while (0)
-
-static inline void
-__ipq_rcv_skb(struct sk_buff *skb)
-{
- int status, type, pid, flags;
- unsigned int nlmsglen, skblen;
- struct nlmsghdr *nlh;
- bool enable_timestamp = false;
-
- skblen = skb->len;
- if (skblen < sizeof(*nlh))
- return;
-
- nlh = nlmsg_hdr(skb);
- nlmsglen = nlh->nlmsg_len;
- if (nlmsglen < sizeof(*nlh) || skblen < nlmsglen)
- return;
-
- pid = nlh->nlmsg_pid;
- flags = nlh->nlmsg_flags;
-
- if(pid <= 0 || !(flags & NLM_F_REQUEST) || flags & NLM_F_MULTI)
- RCV_SKB_FAIL(-EINVAL);
-
- if (flags & MSG_TRUNC)
- RCV_SKB_FAIL(-ECOMM);
-
- type = nlh->nlmsg_type;
- if (type < NLMSG_NOOP || type >= IPQM_MAX)
- RCV_SKB_FAIL(-EINVAL);
-
- if (type <= IPQM_BASE)
- return;
-
- if (!capable(CAP_NET_ADMIN))
- RCV_SKB_FAIL(-EPERM);
-
- spin_lock_bh(&queue_lock);
-
- if (peer_pid) {
- if (peer_pid != pid) {
- spin_unlock_bh(&queue_lock);
- RCV_SKB_FAIL(-EBUSY);
- }
- } else {
- enable_timestamp = true;
- peer_pid = pid;
- }
-
- spin_unlock_bh(&queue_lock);
- if (enable_timestamp)
- net_enable_timestamp();
-
- status = ipq_receive_peer(NLMSG_DATA(nlh), type,
- nlmsglen - NLMSG_LENGTH(0));
- if (status < 0)
- RCV_SKB_FAIL(status);
-
- if (flags & NLM_F_ACK)
- netlink_ack(skb, nlh, 0);
-}
-
-static void
-ipq_rcv_skb(struct sk_buff *skb)
-{
- mutex_lock(&ipqnl_mutex);
- __ipq_rcv_skb(skb);
- mutex_unlock(&ipqnl_mutex);
-}
-
-static int
-ipq_rcv_dev_event(struct notifier_block *this,
- unsigned long event, void *ptr)
-{
- struct net_device *dev = ptr;
-
- if (!net_eq(dev_net(dev), &init_net))
- return NOTIFY_DONE;
-
- /* Drop any packets associated with the downed device */
- if (event == NETDEV_DOWN)
- ipq_dev_drop(dev->ifindex);
- return NOTIFY_DONE;
-}
-
-static struct notifier_block ipq_dev_notifier = {
- .notifier_call = ipq_rcv_dev_event,
-};
-
-static int
-ipq_rcv_nl_event(struct notifier_block *this,
- unsigned long event, void *ptr)
-{
- struct netlink_notify *n = ptr;
-
- if (event == NETLINK_URELEASE && n->protocol == NETLINK_IP6_FW) {
- spin_lock_bh(&queue_lock);
- if ((net_eq(n->net, &init_net)) && (n->pid == peer_pid))
- __ipq_reset();
- spin_unlock_bh(&queue_lock);
- }
- return NOTIFY_DONE;
-}
-
-static struct notifier_block ipq_nl_notifier = {
- .notifier_call = ipq_rcv_nl_event,
-};
-
-#ifdef CONFIG_SYSCTL
-static struct ctl_table_header *ipq_sysctl_header;
-
-static ctl_table ipq_table[] = {
- {
- .procname = NET_IPQ_QMAX_NAME,
- .data = &queue_maxlen,
- .maxlen = sizeof(queue_maxlen),
- .mode = 0644,
- .proc_handler = proc_dointvec
- },
- { }
-};
-#endif
-
-#ifdef CONFIG_PROC_FS
-static int ip6_queue_show(struct seq_file *m, void *v)
-{
- spin_lock_bh(&queue_lock);
-
- seq_printf(m,
- "Peer PID : %d\n"
- "Copy mode : %hu\n"
- "Copy range : %u\n"
- "Queue length : %u\n"
- "Queue max. length : %u\n"
- "Queue dropped : %u\n"
- "Netfilter dropped : %u\n",
- peer_pid,
- copy_mode,
- copy_range,
- queue_total,
- queue_maxlen,
- queue_dropped,
- queue_user_dropped);
-
- spin_unlock_bh(&queue_lock);
- return 0;
-}
-
-static int ip6_queue_open(struct inode *inode, struct file *file)
-{
- return single_open(file, ip6_queue_show, NULL);
-}
-
-static const struct file_operations ip6_queue_proc_fops = {
- .open = ip6_queue_open,
- .read = seq_read,
- .llseek = seq_lseek,
- .release = single_release,
- .owner = THIS_MODULE,
-};
-#endif
-
-static const struct nf_queue_handler nfqh = {
- .name = "ip6_queue",
- .outfn = &ipq_enqueue_packet,
-};
-
-static int __init ip6_queue_init(void)
-{
- int status = -ENOMEM;
- struct proc_dir_entry *proc __maybe_unused;
-
- netlink_register_notifier(&ipq_nl_notifier);
- ipqnl = netlink_kernel_create(&init_net, NETLINK_IP6_FW, 0,
- ipq_rcv_skb, NULL, THIS_MODULE);
- if (ipqnl == NULL) {
- printk(KERN_ERR "ip6_queue: failed to create netlink socket\n");
- goto cleanup_netlink_notifier;
- }
-
-#ifdef CONFIG_PROC_FS
- proc = proc_create(IPQ_PROC_FS_NAME, 0, init_net.proc_net,
- &ip6_queue_proc_fops);
- if (!proc) {
- printk(KERN_ERR "ip6_queue: failed to create proc entry\n");
- goto cleanup_ipqnl;
- }
-#endif
- register_netdevice_notifier(&ipq_dev_notifier);
-#ifdef CONFIG_SYSCTL
- ipq_sysctl_header = register_sysctl_paths(net_ipv6_ctl_path, ipq_table);
-#endif
- status = nf_register_queue_handler(NFPROTO_IPV6, &nfqh);
- if (status < 0) {
- printk(KERN_ERR "ip6_queue: failed to register queue handler\n");
- goto cleanup_sysctl;
- }
- return status;
-
-cleanup_sysctl:
-#ifdef CONFIG_SYSCTL
- unregister_sysctl_table(ipq_sysctl_header);
-#endif
- unregister_netdevice_notifier(&ipq_dev_notifier);
- proc_net_remove(&init_net, IPQ_PROC_FS_NAME);
-
-cleanup_ipqnl: __maybe_unused
- netlink_kernel_release(ipqnl);
- mutex_lock(&ipqnl_mutex);
- mutex_unlock(&ipqnl_mutex);
-
-cleanup_netlink_notifier:
- netlink_unregister_notifier(&ipq_nl_notifier);
- return status;
-}
-
-static void __exit ip6_queue_fini(void)
-{
- nf_unregister_queue_handlers(&nfqh);
-
- ipq_flush(NULL, 0);
-
-#ifdef CONFIG_SYSCTL
- unregister_sysctl_table(ipq_sysctl_header);
-#endif
- unregister_netdevice_notifier(&ipq_dev_notifier);
- proc_net_remove(&init_net, IPQ_PROC_FS_NAME);
-
- netlink_kernel_release(ipqnl);
- mutex_lock(&ipqnl_mutex);
- mutex_unlock(&ipqnl_mutex);
-
- netlink_unregister_notifier(&ipq_nl_notifier);
-}
-
-MODULE_DESCRIPTION("IPv6 packet queue handler");
-MODULE_LICENSE("GPL");
-MODULE_ALIAS_NET_PF_PROTO(PF_NETLINK, NETLINK_IP6_FW);
-
-module_init(ip6_queue_init);
-module_exit(ip6_queue_fini);
diff --git a/security/selinux/nlmsgtab.c b/security/selinux/nlmsgtab.c
index 0920ea3..d309e7f 100644
--- a/security/selinux/nlmsgtab.c
+++ b/security/selinux/nlmsgtab.c
@@ -14,7 +14,6 @@
#include <linux/netlink.h>
#include <linux/rtnetlink.h>
#include <linux/if.h>
-#include <linux/netfilter_ipv4/ip_queue.h>
#include <linux/inet_diag.h>
#include <linux/xfrm.h>
#include <linux/audit.h>
@@ -70,12 +69,6 @@ static struct nlmsg_perm nlmsg_route_perms[] =
{ RTM_SETDCB, NETLINK_ROUTE_SOCKET__NLMSG_WRITE },
};
-static struct nlmsg_perm nlmsg_firewall_perms[] =
-{
- { IPQM_MODE, NETLINK_FIREWALL_SOCKET__NLMSG_WRITE },
- { IPQM_VERDICT, NETLINK_FIREWALL_SOCKET__NLMSG_WRITE },
-};
-
static struct nlmsg_perm nlmsg_tcpdiag_perms[] =
{
{ TCPDIAG_GETSOCK, NETLINK_TCPDIAG_SOCKET__NLMSG_READ },
@@ -145,12 +138,6 @@ int selinux_nlmsg_lookup(u16 sclass, u16 nlmsg_type, u32 *perm)
sizeof(nlmsg_route_perms));
break;
- case SECCLASS_NETLINK_FIREWALL_SOCKET:
- case SECCLASS_NETLINK_IP6FW_SOCKET:
- err = nlmsg_perm(nlmsg_type, perm, nlmsg_firewall_perms,
- sizeof(nlmsg_firewall_perms));
- break;
-
case SECCLASS_NETLINK_TCPDIAG_SOCKET:
err = nlmsg_perm(nlmsg_type, perm, nlmsg_tcpdiag_perms,
sizeof(nlmsg_tcpdiag_perms));
--
1.7.9.5
^ permalink raw reply related [flat|nested] 31+ messages in thread
* Re: [PATCH 00/25] [v2] netfilter updates for net-next (upcoming 3.5)
2012-05-08 7:49 [PATCH 00/25] [v2] netfilter updates for net-next (upcoming 3.5) pablo
` (24 preceding siblings ...)
2012-05-08 7:49 ` [PATCH 25/25] netfilter: remove ip_queue support pablo
@ 2012-05-08 16:49 ` David Miller
2012-05-08 17:10 ` Pablo Neira Ayuso
25 siblings, 1 reply; 31+ messages in thread
From: David Miller @ 2012-05-08 16:49 UTC (permalink / raw)
To: pablo; +Cc: netfilter-devel, netdev
From: pablo@netfilter.org
Date: Tue, 8 May 2012 09:49:29 +0200
> Second version including requested updates.
There were lots of conflicts, due to my merge of net into net-next.
Those were easy enough, but the result doesn't build.
net/netfilter/nf_conntrack_helper.c: In function ‘nf_conntrack_helper_init_sysctl’:
net/netfilter/nf_conntrack_helper.c:65:2: error: implicit declaration of function ‘register_net_sysctl_table’ [-Werror=implicit-function-declaration]
net/netfilter/nf_conntrack_helper.c:66:4: error: ‘nf_net_netfilter_sysctl_path’ undeclared (first use in this function)
net/netfilter/nf_conntrack_helper.c:66:4: note: each undeclared identifier is reported only once for each function it appears in
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH 00/25] [v2] netfilter updates for net-next (upcoming 3.5)
2012-05-08 16:49 ` [PATCH 00/25] [v2] netfilter updates for net-next (upcoming 3.5) David Miller
@ 2012-05-08 17:10 ` Pablo Neira Ayuso
2012-05-08 17:12 ` David Miller
0 siblings, 1 reply; 31+ messages in thread
From: Pablo Neira Ayuso @ 2012-05-08 17:10 UTC (permalink / raw)
To: David Miller; +Cc: netfilter-devel, netdev
On Tue, May 08, 2012 at 12:49:26PM -0400, David Miller wrote:
> From: pablo@netfilter.org
> Date: Tue, 8 May 2012 09:49:29 +0200
>
> > Second version including requested updates.
>
> There were lots of conflicts, due to my merge of net into net-next.
>
> Those were easy enough, but the result doesn't build.
>
> net/netfilter/nf_conntrack_helper.c: In function ‘nf_conntrack_helper_init_sysctl’:
> net/netfilter/nf_conntrack_helper.c:65:2: error: implicit declaration of function ‘register_net_sysctl_table’ [-Werror=implicit-function-declaration]
> net/netfilter/nf_conntrack_helper.c:66:4: error: ‘nf_net_netfilter_sysctl_path’ undeclared (first use in this function)
> net/netfilter/nf_conntrack_helper.c:66:4: note: each undeclared identifier is reported only once for each function it appears in
Strange, this compiles here. Probably you have to add the following
include to net/netfilter/nf_conntrack_helper.c:
#include <net/net_namespace.h>
I can rebase all my patches on top of fresh tree, fix all conflicts
myself and send you another patchset that will apply to your current
tree.
Please, let me know how to proceed.
--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH 00/25] [v2] netfilter updates for net-next (upcoming 3.5)
2012-05-08 17:10 ` Pablo Neira Ayuso
@ 2012-05-08 17:12 ` David Miller
0 siblings, 0 replies; 31+ messages in thread
From: David Miller @ 2012-05-08 17:12 UTC (permalink / raw)
To: pablo; +Cc: netfilter-devel, netdev
From: Pablo Neira Ayuso <pablo@netfilter.org>
Date: Tue, 8 May 2012 19:10:49 +0200
> I can rebase all my patches on top of fresh tree, fix all conflicts
> myself and send you another patchset that will apply to your current
> tree.
>
> Please, let me know how to proceed.
Yes, please build a tree that compiles properly :-)
The error is pretty trivial to reproduce, just "allmodconfig" like I
do.
^ permalink raw reply [flat|nested] 31+ messages in thread
* [PATCH 14/25] ipvs: fix ip_vs_try_bind_dest to rebind app and transmitter
2012-05-08 18:37 [PATCH 00/25] [v3] " pablo
@ 2012-05-08 18:38 ` pablo
0 siblings, 0 replies; 31+ messages in thread
From: pablo @ 2012-05-08 18:38 UTC (permalink / raw)
To: netdev; +Cc: davem, openbsc
From: Julian Anastasov <ja@ssi.bg>
Initially, when the synced connection is created we
use the forwarding method provided by master but once we
bind to destination it can be changed. As result, we must
update the application and the transmitter.
As ip_vs_try_bind_dest is called always for connections
that require dest binding, there is no need to validate the
cp and dest pointers.
Signed-off-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
---
net/netfilter/ipvs/ip_vs_conn.c | 33 ++++++++++++++++++++++++++-------
1 file changed, 26 insertions(+), 7 deletions(-)
diff --git a/net/netfilter/ipvs/ip_vs_conn.c b/net/netfilter/ipvs/ip_vs_conn.c
index 1c1bb30..fd74f88 100644
--- a/net/netfilter/ipvs/ip_vs_conn.c
+++ b/net/netfilter/ipvs/ip_vs_conn.c
@@ -613,14 +613,33 @@ struct ip_vs_dest *ip_vs_try_bind_dest(struct ip_vs_conn *cp)
{
struct ip_vs_dest *dest;
- if ((cp) && (!cp->dest)) {
- dest = ip_vs_find_dest(ip_vs_conn_net(cp), cp->af, &cp->daddr,
- cp->dport, &cp->vaddr, cp->vport,
- cp->protocol, cp->fwmark, cp->flags);
+ dest = ip_vs_find_dest(ip_vs_conn_net(cp), cp->af, &cp->daddr,
+ cp->dport, &cp->vaddr, cp->vport,
+ cp->protocol, cp->fwmark, cp->flags);
+ if (dest) {
+ struct ip_vs_proto_data *pd;
+
+ /* Applications work depending on the forwarding method
+ * but better to reassign them always when binding dest */
+ if (cp->app)
+ ip_vs_unbind_app(cp);
+
ip_vs_bind_dest(cp, dest);
- return dest;
- } else
- return NULL;
+
+ /* Update its packet transmitter */
+ cp->packet_xmit = NULL;
+#ifdef CONFIG_IP_VS_IPV6
+ if (cp->af == AF_INET6)
+ ip_vs_bind_xmit_v6(cp);
+ else
+#endif
+ ip_vs_bind_xmit(cp);
+
+ pd = ip_vs_proto_data_get(ip_vs_conn_net(cp), cp->protocol);
+ if (pd && atomic_read(&pd->appcnt))
+ ip_vs_bind_app(cp, pd->pp);
+ }
+ return dest;
}
--
1.7.9.5
^ permalink raw reply related [flat|nested] 31+ messages in thread
end of thread, other threads:[~2012-05-08 18:39 UTC | newest]
Thread overview: 31+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2012-05-08 7:49 [PATCH 00/25] [v2] netfilter updates for net-next (upcoming 3.5) pablo
2012-05-08 7:49 ` [PATCH 01/25] netfilter: nf_ct_ecache: refactor notifier registration pablo
2012-05-08 7:49 ` [PATCH 02/25] netfilter: nf_ct_helper: allow to disable automatic helper assignment pablo
2012-05-08 7:49 ` [PATCH 03/25] netfilter: nf_conntrack: use this_cpu_inc() pablo
2012-05-08 7:49 ` [PATCH 04/25] netfilter: bridge: optionally set indev to vlan pablo
2012-05-08 7:49 ` pablo
2012-05-08 7:49 ` [PATCH 05/25] ipvs: timeout tables do not need GFP_ATOMIC allocation pablo
2012-05-08 7:49 ` [PATCH 06/25] ipvs: LBLC scheduler does not need GFP_ATOMIC allocation on init pablo
2012-05-08 7:49 ` [PATCH 07/25] ipvs: DH scheduler does not need GFP_ATOMIC allocation pablo
2012-05-08 7:49 ` [PATCH 08/25] ipvs: WRR " pablo
2012-05-08 7:49 ` [PATCH 10/25] ipvs: SH " pablo
2012-05-08 7:49 ` [PATCH 11/25] ipvs: use GFP_KERNEL allocation where possible pablo
2012-05-08 7:49 ` [PATCH 12/25] ipvs: ignore IP_VS_CONN_F_NOOUTPUT in backup server pablo
2012-05-08 7:49 ` [PATCH 13/25] ipvs: remove check for IP_VS_CONN_F_SYNC from ip_vs_bind_dest pablo
2012-05-08 7:49 ` [PATCH 14/25] ipvs: fix ip_vs_try_bind_dest to rebind app and transmitter pablo
2012-05-08 7:49 ` [PATCH 15/25] ipvs: always update some of the flags bits in backup pablo
2012-05-08 7:49 ` [PATCH 16/25] ipvs: wakeup master thread pablo
2012-05-08 7:49 ` [PATCH 17/25] ipvs: reduce sync rate with time thresholds pablo
2012-05-08 7:49 ` [PATCH 18/25] ipvs: add support for sync threads pablo
2012-05-08 7:49 ` [PATCH 19/25] ipvs: optimize the use of flags in ip_vs_bind_dest pablo
2012-05-08 7:49 ` [PATCH 20/25] ipvs: ip_vs_ftp: local functions should not be exposed globally pablo
2012-05-08 7:49 ` [PATCH 21/25] ipvs: ip_vs_proto: " pablo
2012-05-08 7:49 ` [PATCH 22/25] net: export sysctl_[r|w]mem_max symbols needed by ip_vs_sync pablo
2012-05-08 7:49 ` [PATCH 23/25] netfilter: nf_ct_expect: partially implement ctnetlink_change_expect pablo
2012-05-08 7:49 ` [PATCH 24/25] netfilter: nf_conntrack: fix explicit helper attachment and NAT pablo
2012-05-08 7:49 ` [PATCH 25/25] netfilter: remove ip_queue support pablo
2012-05-08 16:49 ` [PATCH 00/25] [v2] netfilter updates for net-next (upcoming 3.5) David Miller
2012-05-08 17:10 ` Pablo Neira Ayuso
2012-05-08 17:12 ` David Miller
-- strict thread matches above, loose matches on Subject: below --
2012-05-08 18:37 [PATCH 00/25] [v3] " pablo
2012-05-08 18:38 ` [PATCH 14/25] ipvs: fix ip_vs_try_bind_dest to rebind app and transmitter pablo
2012-05-08 0:21 [PATCH 00/25] netfilter updates for net-next (upcoming 3.5) pablo
2012-05-08 0:22 ` [PATCH 14/25] ipvs: fix ip_vs_try_bind_dest to rebind app and transmitter pablo
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).