From: Patrick McHardy <kaber@trash.net>
To: davem@davemloft.net
Cc: netdev@vger.kernel.org, Patrick McHardy <kaber@trash.net>,
netfilter-devel@vger.kernel.org
Subject: IPVS 05/62: Allow boot time change of hash size
Date: Tue, 16 Feb 2010 15:55:26 +0100 (MET) [thread overview]
Message-ID: <20100216145523.2796.46110.sendpatchset@x2.localnet> (raw)
In-Reply-To: <20100216145517.2796.40634.sendpatchset@x2.localnet>
commit 6f7edb4881bf51300060e89915926e070ace8c4d
Author: Catalin(ux) M. BOIE <catab@embedromix.ro>
Date: Tue Jan 5 05:50:24 2010 +0100
IPVS: Allow boot time change of hash size
I was very frustrated about the fact that I have to recompile the kernel
to change the hash size. So, I created this patch.
If IPVS is built-in you can append ip_vs.conn_tab_bits=?? to kernel
command line, or, if you built IPVS as modules, you can add
options ip_vs conn_tab_bits=??.
To keep everything backward compatible, you still can select the size at
compile time, and that will be used as default.
It has been about a year since this patch was originally posted
and subsequently dropped on the basis of insufficient test data.
Mark Bergsma has provided the following test results which seem
to strongly support the need for larger hash table sizes:
We do however run into the same problem with the default setting (212 =
4096 entries), as most of our LVS balancers handle around a million
connections/SLAB entries at any point in time (around 100-150 kpps
load). With only 4096 hash table entries this implies that each entry
consists of a linked list of 256 connections *on average*.
To provide some statistics, I did an oprofile run on an 2.6.31 kernel,
with both the default 4096 table size, and the same kernel recompiled
with IP_VS_CONN_TAB_BITS set to 18 (218 = 262144 entries). I built a
quick test setup with a part of Wikimedia/Wikipedia's live traffic
mirrored by the switch to the test host.
With the default setting, at ~ 120 kpps packet load we saw a typical %si
CPU usage of around 30-35%, and oprofile reported a hot spot in
ip_vs_conn_in_get:
samples % image name app name
symbol name
1719761 42.3741 ip_vs.ko ip_vs.ko ip_vs_conn_in_get
302577 7.4554 bnx2 bnx2 /bnx2
181984 4.4840 vmlinux vmlinux __ticket_spin_lock
128636 3.1695 vmlinux vmlinux ip_route_input
74345 1.8318 ip_vs.ko ip_vs.ko ip_vs_conn_out_get
68482 1.6874 vmlinux vmlinux mwait_idle
After loading the recompiled kernel with 218 entries, %si CPU usage
dropped in half to around 12-18%, and oprofile looks much healthier,
with only 7% spent in ip_vs_conn_in_get:
samples % image name app name
symbol name
265641 14.4616 bnx2 bnx2 /bnx2
143251 7.7986 vmlinux vmlinux __ticket_spin_lock
140661 7.6576 ip_vs.ko ip_vs.ko ip_vs_conn_in_get
94364 5.1372 vmlinux vmlinux mwait_idle
86267 4.6964 vmlinux vmlinux ip_route_input
[ horms@verge.net.au: trivial up-port and minor style fixes ]
Signed-off-by: Catalin(ux) M. BOIE <catab@embedromix.ro>
Cc: Mark Bergsma <mark@wikimedia.org>
Signed-off-by: Simon Horman <horms@verge.net.au>
Signed-off-by: Patrick McHardy <kaber@trash.net>
diff --git a/include/net/ip_vs.h b/include/net/ip_vs.h
index 8dc3296..a816c37 100644
--- a/include/net/ip_vs.h
+++ b/include/net/ip_vs.h
@@ -26,6 +26,11 @@
#include <linux/ipv6.h> /* for struct ipv6hdr */
#include <net/ipv6.h> /* for ipv6_addr_copy */
+
+/* Connections' size value needed by ip_vs_ctl.c */
+extern int ip_vs_conn_tab_size;
+
+
struct ip_vs_iphdr {
int len;
__u8 protocol;
@@ -592,17 +597,6 @@ extern void ip_vs_init_hash_table(struct list_head *table, int rows);
* (from ip_vs_conn.c)
*/
-/*
- * IPVS connection entry hash table
- */
-#ifndef CONFIG_IP_VS_TAB_BITS
-#define CONFIG_IP_VS_TAB_BITS 12
-#endif
-
-#define IP_VS_CONN_TAB_BITS CONFIG_IP_VS_TAB_BITS
-#define IP_VS_CONN_TAB_SIZE (1 << IP_VS_CONN_TAB_BITS)
-#define IP_VS_CONN_TAB_MASK (IP_VS_CONN_TAB_SIZE - 1)
-
enum {
IP_VS_DIR_INPUT = 0,
IP_VS_DIR_OUTPUT,
diff --git a/net/netfilter/ipvs/Kconfig b/net/netfilter/ipvs/Kconfig
index 79a6980..c71e543 100644
--- a/net/netfilter/ipvs/Kconfig
+++ b/net/netfilter/ipvs/Kconfig
@@ -68,6 +68,10 @@ config IP_VS_TAB_BITS
each hash entry uses 8 bytes, so you can estimate how much memory is
needed for your box.
+ You can overwrite this number setting conn_tab_bits module parameter
+ or by appending ip_vs.conn_tab_bits=? to the kernel command line
+ if IP VS was compiled built-in.
+
comment "IPVS transport protocol load balancing support"
config IP_VS_PROTO_TCP
diff --git a/net/netfilter/ipvs/ip_vs_conn.c b/net/netfilter/ipvs/ip_vs_conn.c
index 27c30cf..60bb41a 100644
--- a/net/netfilter/ipvs/ip_vs_conn.c
+++ b/net/netfilter/ipvs/ip_vs_conn.c
@@ -40,6 +40,21 @@
#include <net/ip_vs.h>
+#ifndef CONFIG_IP_VS_TAB_BITS
+#define CONFIG_IP_VS_TAB_BITS 12
+#endif
+
+/*
+ * Connection hash size. Default is what was selected at compile time.
+*/
+int ip_vs_conn_tab_bits = CONFIG_IP_VS_TAB_BITS;
+module_param_named(conn_tab_bits, ip_vs_conn_tab_bits, int, 0444);
+MODULE_PARM_DESC(conn_tab_bits, "Set connections' hash size");
+
+/* size and mask values */
+int ip_vs_conn_tab_size;
+int ip_vs_conn_tab_mask;
+
/*
* Connection hash table: for input and output packets lookups of IPVS
*/
@@ -125,11 +140,11 @@ static unsigned int ip_vs_conn_hashkey(int af, unsigned proto,
if (af == AF_INET6)
return jhash_3words(jhash(addr, 16, ip_vs_conn_rnd),
(__force u32)port, proto, ip_vs_conn_rnd)
- & IP_VS_CONN_TAB_MASK;
+ & ip_vs_conn_tab_mask;
#endif
return jhash_3words((__force u32)addr->ip, (__force u32)port, proto,
ip_vs_conn_rnd)
- & IP_VS_CONN_TAB_MASK;
+ & ip_vs_conn_tab_mask;
}
@@ -760,7 +775,7 @@ static void *ip_vs_conn_array(struct seq_file *seq, loff_t pos)
int idx;
struct ip_vs_conn *cp;
- for(idx = 0; idx < IP_VS_CONN_TAB_SIZE; idx++) {
+ for (idx = 0; idx < ip_vs_conn_tab_size; idx++) {
ct_read_lock_bh(idx);
list_for_each_entry(cp, &ip_vs_conn_tab[idx], c_list) {
if (pos-- == 0) {
@@ -797,7 +812,7 @@ static void *ip_vs_conn_seq_next(struct seq_file *seq, void *v, loff_t *pos)
idx = l - ip_vs_conn_tab;
ct_read_unlock_bh(idx);
- while (++idx < IP_VS_CONN_TAB_SIZE) {
+ while (++idx < ip_vs_conn_tab_size) {
ct_read_lock_bh(idx);
list_for_each_entry(cp, &ip_vs_conn_tab[idx], c_list) {
seq->private = &ip_vs_conn_tab[idx];
@@ -976,8 +991,8 @@ void ip_vs_random_dropentry(void)
/*
* Randomly scan 1/32 of the whole table every second
*/
- for (idx = 0; idx < (IP_VS_CONN_TAB_SIZE>>5); idx++) {
- unsigned hash = net_random() & IP_VS_CONN_TAB_MASK;
+ for (idx = 0; idx < (ip_vs_conn_tab_size>>5); idx++) {
+ unsigned hash = net_random() & ip_vs_conn_tab_mask;
/*
* Lock is actually needed in this loop.
@@ -1029,7 +1044,7 @@ static void ip_vs_conn_flush(void)
struct ip_vs_conn *cp;
flush_again:
- for (idx=0; idx<IP_VS_CONN_TAB_SIZE; idx++) {
+ for (idx = 0; idx < ip_vs_conn_tab_size; idx++) {
/*
* Lock is actually needed in this loop.
*/
@@ -1060,10 +1075,15 @@ int __init ip_vs_conn_init(void)
{
int idx;
+ /* Compute size and mask */
+ ip_vs_conn_tab_size = 1 << ip_vs_conn_tab_bits;
+ ip_vs_conn_tab_mask = ip_vs_conn_tab_size - 1;
+
/*
* Allocate the connection hash table and initialize its list heads
*/
- ip_vs_conn_tab = vmalloc(IP_VS_CONN_TAB_SIZE*sizeof(struct list_head));
+ ip_vs_conn_tab = vmalloc(ip_vs_conn_tab_size *
+ sizeof(struct list_head));
if (!ip_vs_conn_tab)
return -ENOMEM;
@@ -1078,12 +1098,12 @@ int __init ip_vs_conn_init(void)
pr_info("Connection hash table configured "
"(size=%d, memory=%ldKbytes)\n",
- IP_VS_CONN_TAB_SIZE,
- (long)(IP_VS_CONN_TAB_SIZE*sizeof(struct list_head))/1024);
+ ip_vs_conn_tab_size,
+ (long)(ip_vs_conn_tab_size*sizeof(struct list_head))/1024);
IP_VS_DBG(0, "Each connection entry needs %Zd bytes at least\n",
sizeof(struct ip_vs_conn));
- for (idx = 0; idx < IP_VS_CONN_TAB_SIZE; idx++) {
+ for (idx = 0; idx < ip_vs_conn_tab_size; idx++) {
INIT_LIST_HEAD(&ip_vs_conn_tab[idx]);
}
diff --git a/net/netfilter/ipvs/ip_vs_ctl.c b/net/netfilter/ipvs/ip_vs_ctl.c
index 6bde12d..93420ea 100644
--- a/net/netfilter/ipvs/ip_vs_ctl.c
+++ b/net/netfilter/ipvs/ip_vs_ctl.c
@@ -1843,7 +1843,7 @@ static int ip_vs_info_seq_show(struct seq_file *seq, void *v)
if (v == SEQ_START_TOKEN) {
seq_printf(seq,
"IP Virtual Server version %d.%d.%d (size=%d)\n",
- NVERSION(IP_VS_VERSION_CODE), IP_VS_CONN_TAB_SIZE);
+ NVERSION(IP_VS_VERSION_CODE), ip_vs_conn_tab_size);
seq_puts(seq,
"Prot LocalAddress:Port Scheduler Flags\n");
seq_puts(seq,
@@ -2374,7 +2374,7 @@ do_ip_vs_get_ctl(struct sock *sk, int cmd, void __user *user, int *len)
char buf[64];
sprintf(buf, "IP Virtual Server version %d.%d.%d (size=%d)",
- NVERSION(IP_VS_VERSION_CODE), IP_VS_CONN_TAB_SIZE);
+ NVERSION(IP_VS_VERSION_CODE), ip_vs_conn_tab_size);
if (copy_to_user(user, buf, strlen(buf)+1) != 0) {
ret = -EFAULT;
goto out;
@@ -2387,7 +2387,7 @@ do_ip_vs_get_ctl(struct sock *sk, int cmd, void __user *user, int *len)
{
struct ip_vs_getinfo info;
info.version = IP_VS_VERSION_CODE;
- info.size = IP_VS_CONN_TAB_SIZE;
+ info.size = ip_vs_conn_tab_size;
info.num_services = ip_vs_num_services;
if (copy_to_user(user, &info, sizeof(info)) != 0)
ret = -EFAULT;
@@ -3231,7 +3231,7 @@ static int ip_vs_genl_get_cmd(struct sk_buff *skb, struct genl_info *info)
case IPVS_CMD_GET_INFO:
NLA_PUT_U32(msg, IPVS_INFO_ATTR_VERSION, IP_VS_VERSION_CODE);
NLA_PUT_U32(msg, IPVS_INFO_ATTR_CONN_TAB_SIZE,
- IP_VS_CONN_TAB_SIZE);
+ ip_vs_conn_tab_size);
break;
}
next prev parent reply other threads:[~2010-02-16 14:55 UTC|newest]
Thread overview: 71+ messages / expand[flat|nested] mbox.gz Atom feed top
2010-02-16 14:55 netfilter 00/62: netfilter update Patrick McHardy
2010-02-16 14:55 ` netfilter 01/62: SNMP NAT: correct the size argument to kzalloc Patrick McHardy
2010-02-16 14:55 ` netfilter 02/62: xt_recent: save 8 bytes per htable Patrick McHardy
2010-02-16 14:55 ` netfilter 03/62: xtables: do not grab random bytes at __init Patrick McHardy
2010-02-16 14:55 ` netfilter 04/62: xtables: obtain random bytes earlier, in checkentry Patrick McHardy
2010-02-16 14:55 ` Patrick McHardy [this message]
2010-02-16 14:55 ` netfilter 06/62: nf_nat_ftp: remove (*mangle[]) array and functions, use %pI4 Patrick McHardy
2010-02-16 14:55 ` ipvs 07/62: use standardized format in sprintf Patrick McHardy
2010-02-16 14:55 ` netfilter 08/62: xt_osf: change %pi4 to %pI4 Patrick McHardy
2010-02-16 14:55 ` netfilter 09/62: nfnetlink: netns support Patrick McHardy
2010-02-16 14:55 ` netfilter 10/62: ctnetlink: " Patrick McHardy
2010-02-16 14:55 ` netfilter 11/62: xt_connlimit: " Patrick McHardy
2010-02-16 14:55 ` netfilter 12/62: netns: Patrick McHardy
2010-02-16 14:55 ` netfilter 13/62: xt_hashlimit: simplify seqfile code Patrick McHardy
2010-02-16 14:55 ` netfilter 14/62: xtables: add struct xt_mtchk_param::net Patrick McHardy
2010-02-16 14:55 ` netfilter 15/62: xtables: add struct xt_mtdtor_param::net Patrick McHardy
2010-02-16 14:55 ` netfilter 16/62: xt_recent: netns support Patrick McHardy
2010-02-16 14:55 ` netfilter 17/62: xt_hashlimit: " Patrick McHardy
2010-02-16 14:55 ` netfilter 18/62: nfnetlink_queue: simplify warning message Patrick McHardy
2010-02-16 14:55 ` netfilter 19/62: nf_conntrack_ipv6: delete the redundant macro definitions Patrick McHardy
2010-02-16 14:55 ` IPv6 20/62: reassembly: replace magic number with " Patrick McHardy
2010-02-16 15:43 ` Joe Perches
2010-02-16 15:47 ` Patrick McHardy
2010-02-17 4:40 ` [PATCH] ipv6.h: reassembly: replace calculated magic number with multiplication Joe Perches
2010-02-17 7:38 ` David Miller
2010-02-16 14:55 ` netfiltr 21/62: ipt_CLUSTERIP: simplify seq_file codeA Patrick McHardy
2010-02-16 14:55 ` netfilter 22/62: xtables: CONFIG_COMPAT redux Patrick McHardy
2010-02-16 14:55 ` netfilter 23/62: xt_TCPMSS: SYN packets are allowed to contain data Patrick McHardy
2010-02-16 14:55 ` netfilter 24/62: xt_hashlimit: fix race condition and simplify locking Patrick McHardy
2010-02-17 16:43 ` [PATCH net-next-2.6] xt_hashlimit: fix locking Eric Dumazet
2010-02-17 20:08 ` Patrick McHardy
2010-02-17 21:39 ` David Miller
2010-02-16 14:55 ` netfilter 25/62: ctnetlink: only assign helpers for matching protocols Patrick McHardy
2010-02-16 14:55 ` netfilter 26/62: add struct net * to target parameters Patrick McHardy
2010-02-16 14:55 ` netfilter 27/62: nf_conntrack: split up IPCT_STATUS event Patrick McHardy
2010-02-16 14:55 ` netfilter 28/62: ctnetlink: support selective event delivery Patrick McHardy
2010-02-16 14:55 ` netfilter 29/62: nf_conntrack: support conntrack templates Patrick McHardy
2010-02-16 14:56 ` netfilter 30/62: xtables: add CT target Patrick McHardy
2010-02-16 14:56 ` netfilter 31/62: fix build failure with CONNTRACK=y NAT=n Patrick McHardy
2010-02-16 14:56 ` netfilter 32/62: xtables: consistent struct compat_xt_counters definition Patrick McHardy
2010-02-16 14:56 ` netfilter 33/62: xtables: symmetric COMPAT_XT_ALIGN definition Patrick McHardy
2010-02-16 14:56 ` netfilter 34/62: ctnetlink: add missing netlink attribute policies Patrick McHardy
2010-02-16 14:56 ` netfilter 35/62: xtables: compact table hook functions (1/2) Patrick McHardy
2010-02-16 14:56 ` netfilter 36/62: xtables: compact table hook functions (2/2) Patrick McHardy
2010-02-16 14:56 ` netfilter 37/62: xtables: use xt_table for hook instantiation Patrick McHardy
2010-02-16 14:56 ` netfilter 38/62: xtables: generate initial table on-demand Patrick McHardy
2010-02-16 14:56 ` netfilter 39/62: ctnetlink: dump expectation helper name Patrick McHardy
2010-02-16 14:56 ` netfilter 40/62: nf_conntrack: show helper and class in /proc/net/nf_conntrack_expect Patrick McHardy
2010-02-16 14:56 ` netfilter 41/62: nf_conntrack_sip: fix ct_sip_parse_request() REGISTER request parsing Patrick McHardy
2010-02-16 14:56 ` netfilter 42/62: nf_conntrack_sip: pass data offset to NAT functions Patrick McHardy
2010-02-16 14:56 ` netfilter 43/62: nf_conntrack_sip: add TCP support Patrick McHardy
2010-02-16 14:56 ` netfilter 44/62: nf_nat: support mangling a single TCP packet multiple times Patrick McHardy
2010-02-16 14:56 ` netfilter 45/62: nf_nat_sip: add TCP support Patrick McHardy
2010-02-16 14:56 ` netfilter 46/62: nf_conntrack_sip: add T.38 FAX support Patrick McHardy
2010-02-16 14:56 ` netfilter 47/62: xtables: fix mangle tables Patrick McHardy
2010-02-16 14:56 ` netfilter 48/62: nf_conntrack: elegantly simplify nf_ct_exp_net() Patrick McHardy
2010-02-16 14:56 ` netfilter 49/62: don't use INIT_RCU_HEAD() Patrick McHardy
2010-02-16 14:56 ` netfilter 50/62: xt_recent: inform user when hitcount is too large Patrick McHardy
2010-02-16 14:56 ` netfilter 51/62: iptables: remove unused function arguments Patrick McHardy
2010-02-16 14:56 ` netfilter 52/62: reduce NF_HOOK by one argument Patrick McHardy
2010-02-16 14:56 ` netfilter 53/62: get rid of the grossness in netfilter.h Patrick McHardy
2010-02-16 14:56 ` netfilter 54/62: xtables: print details on size mismatch Patrick McHardy
2010-02-16 14:56 ` netfilter 55/62: xtables: constify args in compat copying functions Patrick McHardy
2010-02-16 14:56 ` netfilter 56/62: xtables: add const qualifiers Patrick McHardy
2010-02-16 14:56 ` netfilter 57/62: nf_conntrack: pass template to l4proto ->error() handler Patrick McHardy
2010-02-16 14:56 ` netfilter 58/62: nf_conntrack: add support for "conntrack zones" Patrick McHardy
2010-02-16 14:56 ` netfilter 59/62: ctnetlink: add zone support Patrick McHardy
2010-02-16 14:56 ` netfilter 60/62: ebtables: abort if next_offset is too small Patrick McHardy
2010-02-16 14:56 ` netfilter 61/62: ebtables: avoid explicit XT_ALIGN() in match/targets Patrick McHardy
2010-02-16 14:56 ` netfilter 62/62: CONFIG_COMPAT: allow delta to exceed 32767 Patrick McHardy
2010-02-16 19:21 ` netfilter 00/62: netfilter update David Miller
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20100216145523.2796.46110.sendpatchset@x2.localnet \
--to=kaber@trash.net \
--cc=davem@davemloft.net \
--cc=netdev@vger.kernel.org \
--cc=netfilter-devel@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).