* [RFC PATCHv4 0/5] ipvs: Use kthreads for stats
@ 2022-09-20 13:53 Julian Anastasov
2022-09-20 13:53 ` [RFC PATCHv4 1/5] ipvs: add rcu protection to stats Julian Anastasov
` (4 more replies)
0 siblings, 5 replies; 9+ messages in thread
From: Julian Anastasov @ 2022-09-20 13:53 UTC (permalink / raw)
To: Jiri Wiesner; +Cc: Simon Horman, lvs-devel, yunhong-cgl jiang, dust.li
Hello,
Posting v4 (no new ideas for now). Patch 5 just
for debugging, do not apply if not needed.
This patchset implements stats estimation in
kthread context. Simple tests do not show any problem.
If testing, check that calculated chain_max_len does not
deviates much.
Overview of the basic concepts. More in the
commit messages...
RCU Locking:
- As stats are now RCU-locked, tot_stats, svc and dest which
hold estimator structures are now always freed from RCU
callback. This ensures RCU grace period after the
ip_vs_stop_estimator() call.
Kthread data:
- every kthread works over its own data structure and all
such structures are attached to array. For now we
apply a rlimit as max kthreads to create.
- even while there can be a kthread structure, its task
may not be running, eg. before first service is added or
while the sysctl var is set to an empty cpulist or
when run_estimation is 0.
- the allocated kthread context may grow from 1 to 50
allocated structures for ticks which saves memory for
setups with small number of estimators
- a task and its structure may be released if all
estimators are unlinked from its chains, leaving the
slot in the array empty
- every kthread data structure allows limited number
of estimators. Kthread 0 is also used to initially
calculate the number of estimators to allow in every
chain considering a sub-100 microsecond cond_resched
rate. This number can be from 1 to hundreds.
- kthread 0 has an additional job of optimizing the
adding of estimators: they are first added in
temp list (est_temp_list) and later kthread 0
distributes them to other kthreads. The optimization
is based on the fact that newly added estimator
should be estimated after 2 seconds, so we have the
time to offload the adding to chain from controlling
process to kthread 0.
- to add new estimators we use the last added kthread
context (est_add_ktid). The new estimators are linked to
the chains just before the estimated one, based on add_row.
This ensures their estimation will start after 2 seconds.
If estimators are added in bursts, common case if all
services and dests are initially configured, we may
spread the estimators to more chains. This will reduce
the chain imbalance.
There are things that I don't like but for now
I don't have a better idea for them:
- calculation of chain_max_len can go wrong, depending
on the current load, CPU speed, memory speeds, running in
VM, whether tested estimators are in CPU cache, even if
we are doing it in SCHED_FIFO mode. I expect such
noise to be insignificant but who knows.
- ip_vs_stop_estimator is not a simple unlinking of
list node, we spend cycles to account for the removed
estimator
- __ip_vs_mutex is global mutex for all netns. But it
protects hash tables that are still global ones.
Changes in v4:
Patch 2:
* kthread 0 can start with calculation phase in SCHED_FIFO mode
to determine chain_max_len suitable for 100us cond_resched
rate and 12% of 40ms CPU usage in a tick. Current value of
IPVS_EST_TICK_CHAINS=48 determines tick time of 4.8ms (i.e.
in units of 100us) which is 12% of max tick time of 40ms.
The question is how reliable will be such calculation test.
* est_calc_phase indicates a mode where we dequeue estimators
from kthreads, apply new chain_max_len and enqueue again
all estimators to kthreads, done by kthread 0
* est->ktid now can be -1 to indicate est is in est_temp_list
ready to be distributed to kthread by kt 0, done in
ip_vs_est_drain_temp_list(). kthread 0 data is now released
only after the data for others kthreads
* ip_vs_start_estimator was not setting ret = 0
* READ_ONCE not needed for volatile jiffies
Patch 3:
* restrict cpulist based on the cpus_allowed of
process that assigns cpulist, not on cpu_possible_mask
* change of cpulist will trigger calc phase
Patch 5:
* print message every minute, not 2 seconds
Changes in v3:
Patch 2:
* calculate chain_max_len (was IPVS_EST_CHAIN_DEPTH) but
it needs further tuning based on real estimation test
* est_max_threads set from rlimit(RLIMIT_NPROC). I don't
see analog to get_ucounts_value() to get the max value.
* the atomic bitop for td->present is not needed,
remove it
* start filling based on est_row after 2 ticks are
fully allocated. As 2/50 is 4% this can be increased
more.
Changes in v2:
Patch 2:
* kd->mutex is gone, cond_resched rate determined by
IPVS_EST_CHAIN_DEPTH
* IPVS_EST_MAX_COUNT is a hard limit now
* kthread data is now 1-50 allocated tick structures,
each containing heads for limited chains. Bitmaps
should allow faster access. We avoid large
allocations for structs.
* as the td->present bitmap is shared, use atomic bitops
* ip_vs_start_estimator now returns error code
* _bh locking removed from stats->lock
* bump arg is gone from ip_vs_est_reload_start
* prepare for upcoming changes that remove _irq
from u64_stats_fetch_begin_irq/u64_stats_fetch_retry_irq
* est_add_ktid is now always valid
Patch 3:
* use .. in est_nice docs
Julian Anastasov (5):
ipvs: add rcu protection to stats
ipvs: use kthreads for stats estimation
ipvs: add est_cpulist and est_nice sysctl vars
ipvs: run_estimation should control the kthread tasks
ipvs: debug the tick time
Documentation/networking/ipvs-sysctl.rst | 24 +-
include/net/ip_vs.h | 133 +++-
net/netfilter/ipvs/ip_vs_core.c | 10 +-
net/netfilter/ipvs/ip_vs_ctl.c | 371 ++++++++--
net/netfilter/ipvs/ip_vs_est.c | 895 +++++++++++++++++++++--
5 files changed, 1301 insertions(+), 132 deletions(-)
--
2.37.3
^ permalink raw reply [flat|nested] 9+ messages in thread
* [RFC PATCHv4 1/5] ipvs: add rcu protection to stats
2022-09-20 13:53 [RFC PATCHv4 0/5] ipvs: Use kthreads for stats Julian Anastasov
@ 2022-09-20 13:53 ` Julian Anastasov
2022-09-20 13:53 ` [RFC PATCHv4 2/5] ipvs: use kthreads for stats estimation Julian Anastasov
` (3 subsequent siblings)
4 siblings, 0 replies; 9+ messages in thread
From: Julian Anastasov @ 2022-09-20 13:53 UTC (permalink / raw)
To: Jiri Wiesner; +Cc: Simon Horman, lvs-devel, yunhong-cgl jiang, dust.li
In preparation to using RCU locking for the list
with estimators, make sure the struct ip_vs_stats
are released after RCU grace period by using RCU
callbacks. This affects ipvs->tot_stats where we
can not use RCU callbacks for ipvs, so we use
allocated struct ip_vs_stats_rcu. For services
and dests we force RCU callbacks for all cases.
Signed-off-by: Julian Anastasov <ja@ssi.bg>
---
include/net/ip_vs.h | 8 ++++-
net/netfilter/ipvs/ip_vs_core.c | 10 ++++--
net/netfilter/ipvs/ip_vs_ctl.c | 64 ++++++++++++++++++++++-----------
3 files changed, 57 insertions(+), 25 deletions(-)
diff --git a/include/net/ip_vs.h b/include/net/ip_vs.h
index ff1804a0c469..bd8ae137e43b 100644
--- a/include/net/ip_vs.h
+++ b/include/net/ip_vs.h
@@ -405,6 +405,11 @@ struct ip_vs_stats {
struct ip_vs_kstats kstats0; /* reset values */
};
+struct ip_vs_stats_rcu {
+ struct ip_vs_stats s;
+ struct rcu_head rcu_head;
+};
+
struct dst_entry;
struct iphdr;
struct ip_vs_conn;
@@ -688,6 +693,7 @@ struct ip_vs_dest {
union nf_inet_addr vaddr; /* virtual IP address */
__u32 vfwmark; /* firewall mark of service */
+ struct rcu_head rcu_head;
struct list_head t_list; /* in dest_trash */
unsigned int in_rs_table:1; /* we are in rs_table */
};
@@ -869,7 +875,7 @@ struct netns_ipvs {
atomic_t conn_count; /* connection counter */
/* ip_vs_ctl */
- struct ip_vs_stats tot_stats; /* Statistics & est. */
+ struct ip_vs_stats_rcu *tot_stats; /* Statistics & est. */
int num_services; /* no of virtual services */
int num_services6; /* IPv6 virtual services */
diff --git a/net/netfilter/ipvs/ip_vs_core.c b/net/netfilter/ipvs/ip_vs_core.c
index 51ad557a525b..fcdaef1fcccf 100644
--- a/net/netfilter/ipvs/ip_vs_core.c
+++ b/net/netfilter/ipvs/ip_vs_core.c
@@ -143,7 +143,7 @@ ip_vs_in_stats(struct ip_vs_conn *cp, struct sk_buff *skb)
s->cnt.inbytes += skb->len;
u64_stats_update_end(&s->syncp);
- s = this_cpu_ptr(ipvs->tot_stats.cpustats);
+ s = this_cpu_ptr(ipvs->tot_stats->s.cpustats);
u64_stats_update_begin(&s->syncp);
s->cnt.inpkts++;
s->cnt.inbytes += skb->len;
@@ -179,7 +179,7 @@ ip_vs_out_stats(struct ip_vs_conn *cp, struct sk_buff *skb)
s->cnt.outbytes += skb->len;
u64_stats_update_end(&s->syncp);
- s = this_cpu_ptr(ipvs->tot_stats.cpustats);
+ s = this_cpu_ptr(ipvs->tot_stats->s.cpustats);
u64_stats_update_begin(&s->syncp);
s->cnt.outpkts++;
s->cnt.outbytes += skb->len;
@@ -208,7 +208,7 @@ ip_vs_conn_stats(struct ip_vs_conn *cp, struct ip_vs_service *svc)
s->cnt.conns++;
u64_stats_update_end(&s->syncp);
- s = this_cpu_ptr(ipvs->tot_stats.cpustats);
+ s = this_cpu_ptr(ipvs->tot_stats->s.cpustats);
u64_stats_update_begin(&s->syncp);
s->cnt.conns++;
u64_stats_update_end(&s->syncp);
@@ -2448,6 +2448,10 @@ static void __exit ip_vs_cleanup(void)
ip_vs_conn_cleanup();
ip_vs_protocol_cleanup();
ip_vs_control_cleanup();
+ /* common rcu_barrier() used by:
+ * - ip_vs_control_cleanup()
+ */
+ rcu_barrier();
pr_info("ipvs unloaded.\n");
}
diff --git a/net/netfilter/ipvs/ip_vs_ctl.c b/net/netfilter/ipvs/ip_vs_ctl.c
index efab2b06d373..44c79fd1779c 100644
--- a/net/netfilter/ipvs/ip_vs_ctl.c
+++ b/net/netfilter/ipvs/ip_vs_ctl.c
@@ -483,17 +483,14 @@ static void ip_vs_service_rcu_free(struct rcu_head *head)
ip_vs_service_free(svc);
}
-static void __ip_vs_svc_put(struct ip_vs_service *svc, bool do_delay)
+static void __ip_vs_svc_put(struct ip_vs_service *svc)
{
if (atomic_dec_and_test(&svc->refcnt)) {
IP_VS_DBG_BUF(3, "Removing service %u/%s:%u\n",
svc->fwmark,
IP_VS_DBG_ADDR(svc->af, &svc->addr),
ntohs(svc->port));
- if (do_delay)
- call_rcu(&svc->rcu_head, ip_vs_service_rcu_free);
- else
- ip_vs_service_free(svc);
+ call_rcu(&svc->rcu_head, ip_vs_service_rcu_free);
}
}
@@ -780,14 +777,22 @@ ip_vs_trash_get_dest(struct ip_vs_service *svc, int dest_af,
return dest;
}
+static void ip_vs_dest_rcu_free(struct rcu_head *head)
+{
+ struct ip_vs_dest *dest;
+
+ dest = container_of(head, struct ip_vs_dest, rcu_head);
+ free_percpu(dest->stats.cpustats);
+ ip_vs_dest_put_and_free(dest);
+}
+
static void ip_vs_dest_free(struct ip_vs_dest *dest)
{
struct ip_vs_service *svc = rcu_dereference_protected(dest->svc, 1);
__ip_vs_dst_cache_reset(dest);
- __ip_vs_svc_put(svc, false);
- free_percpu(dest->stats.cpustats);
- ip_vs_dest_put_and_free(dest);
+ __ip_vs_svc_put(svc);
+ call_rcu(&dest->rcu_head, ip_vs_dest_rcu_free);
}
/*
@@ -811,6 +816,16 @@ static void ip_vs_trash_cleanup(struct netns_ipvs *ipvs)
}
}
+static void ip_vs_stats_rcu_free(struct rcu_head *head)
+{
+ struct ip_vs_stats_rcu *rs = container_of(head,
+ struct ip_vs_stats_rcu,
+ rcu_head);
+
+ free_percpu(rs->s.cpustats);
+ kfree(rs);
+}
+
static void
ip_vs_copy_stats(struct ip_vs_kstats *dst, struct ip_vs_stats *src)
{
@@ -923,7 +938,7 @@ __ip_vs_update_dest(struct ip_vs_service *svc, struct ip_vs_dest *dest,
if (old_svc != svc) {
ip_vs_zero_stats(&dest->stats);
__ip_vs_bind_svc(dest, svc);
- __ip_vs_svc_put(old_svc, true);
+ __ip_vs_svc_put(old_svc);
}
}
@@ -1571,7 +1586,7 @@ static void __ip_vs_del_service(struct ip_vs_service *svc, bool cleanup)
/*
* Free the service if nobody refers to it
*/
- __ip_vs_svc_put(svc, true);
+ __ip_vs_svc_put(svc);
/* decrease the module use count */
ip_vs_use_count_dec();
@@ -1761,7 +1776,7 @@ static int ip_vs_zero_all(struct netns_ipvs *ipvs)
}
}
- ip_vs_zero_stats(&ipvs->tot_stats);
+ ip_vs_zero_stats(&ipvs->tot_stats->s);
return 0;
}
@@ -2255,7 +2270,7 @@ static int ip_vs_stats_show(struct seq_file *seq, void *v)
seq_puts(seq,
" Conns Packets Packets Bytes Bytes\n");
- ip_vs_copy_stats(&show, &net_ipvs(net)->tot_stats);
+ ip_vs_copy_stats(&show, &net_ipvs(net)->tot_stats->s);
seq_printf(seq, "%8LX %8LX %8LX %16LX %16LX\n\n",
(unsigned long long)show.conns,
(unsigned long long)show.inpkts,
@@ -2279,7 +2294,7 @@ static int ip_vs_stats_show(struct seq_file *seq, void *v)
static int ip_vs_stats_percpu_show(struct seq_file *seq, void *v)
{
struct net *net = seq_file_single_net(seq);
- struct ip_vs_stats *tot_stats = &net_ipvs(net)->tot_stats;
+ struct ip_vs_stats *tot_stats = &net_ipvs(net)->tot_stats->s;
struct ip_vs_cpu_stats __percpu *cpustats = tot_stats->cpustats;
struct ip_vs_kstats kstats;
int i;
@@ -4106,7 +4121,6 @@ static int __net_init ip_vs_control_net_init_sysctl(struct netns_ipvs *ipvs)
kfree(tbl);
return -ENOMEM;
}
- ip_vs_start_estimator(ipvs, &ipvs->tot_stats);
ipvs->sysctl_tbl = tbl;
/* Schedule defense work */
INIT_DELAYED_WORK(&ipvs->defense_work, defense_work_handler);
@@ -4117,6 +4131,7 @@ static int __net_init ip_vs_control_net_init_sysctl(struct netns_ipvs *ipvs)
INIT_DELAYED_WORK(&ipvs->expire_nodest_conn_work,
expire_nodest_conn_handler);
+ ip_vs_start_estimator(ipvs, &ipvs->tot_stats->s);
return 0;
}
@@ -4128,7 +4143,7 @@ static void __net_exit ip_vs_control_net_cleanup_sysctl(struct netns_ipvs *ipvs)
cancel_delayed_work_sync(&ipvs->defense_work);
cancel_work_sync(&ipvs->defense_work.work);
unregister_net_sysctl_table(ipvs->sysctl_hdr);
- ip_vs_stop_estimator(ipvs, &ipvs->tot_stats);
+ ip_vs_stop_estimator(ipvs, &ipvs->tot_stats->s);
if (!net_eq(net, &init_net))
kfree(ipvs->sysctl_tbl);
@@ -4164,17 +4179,20 @@ int __net_init ip_vs_control_net_init(struct netns_ipvs *ipvs)
atomic_set(&ipvs->conn_out_counter, 0);
/* procfs stats */
- ipvs->tot_stats.cpustats = alloc_percpu(struct ip_vs_cpu_stats);
- if (!ipvs->tot_stats.cpustats)
+ ipvs->tot_stats = kzalloc(sizeof(*ipvs->tot_stats), GFP_KERNEL);
+ if (!ipvs->tot_stats)
return -ENOMEM;
+ ipvs->tot_stats->s.cpustats = alloc_percpu(struct ip_vs_cpu_stats);
+ if (!ipvs->tot_stats->s.cpustats)
+ goto err_tot_stats;
for_each_possible_cpu(i) {
struct ip_vs_cpu_stats *ipvs_tot_stats;
- ipvs_tot_stats = per_cpu_ptr(ipvs->tot_stats.cpustats, i);
+ ipvs_tot_stats = per_cpu_ptr(ipvs->tot_stats->s.cpustats, i);
u64_stats_init(&ipvs_tot_stats->syncp);
}
- spin_lock_init(&ipvs->tot_stats.lock);
+ spin_lock_init(&ipvs->tot_stats->s.lock);
#ifdef CONFIG_PROC_FS
if (!proc_create_net("ip_vs", 0, ipvs->net->proc_net,
@@ -4206,7 +4224,10 @@ int __net_init ip_vs_control_net_init(struct netns_ipvs *ipvs)
err_vs:
#endif
- free_percpu(ipvs->tot_stats.cpustats);
+ free_percpu(ipvs->tot_stats->s.cpustats);
+
+err_tot_stats:
+ kfree(ipvs->tot_stats);
return -ENOMEM;
}
@@ -4219,7 +4240,7 @@ void __net_exit ip_vs_control_net_cleanup(struct netns_ipvs *ipvs)
remove_proc_entry("ip_vs_stats", ipvs->net->proc_net);
remove_proc_entry("ip_vs", ipvs->net->proc_net);
#endif
- free_percpu(ipvs->tot_stats.cpustats);
+ call_rcu(&ipvs->tot_stats->rcu_head, ip_vs_stats_rcu_free);
}
int __init ip_vs_register_nl_ioctl(void)
@@ -4279,5 +4300,6 @@ void ip_vs_control_cleanup(void)
{
EnterFunction(2);
unregister_netdevice_notifier(&ip_vs_dst_notifier);
+ /* relying on common rcu_barrier() in ip_vs_cleanup() */
LeaveFunction(2);
}
--
2.37.3
^ permalink raw reply related [flat|nested] 9+ messages in thread
* [RFC PATCHv4 2/5] ipvs: use kthreads for stats estimation
2022-09-20 13:53 [RFC PATCHv4 0/5] ipvs: Use kthreads for stats Julian Anastasov
2022-09-20 13:53 ` [RFC PATCHv4 1/5] ipvs: add rcu protection to stats Julian Anastasov
@ 2022-09-20 13:53 ` Julian Anastasov
2022-10-01 10:52 ` Jiri Wiesner
2022-09-20 13:53 ` [RFC PATCHv4 3/5] ipvs: add est_cpulist and est_nice sysctl vars Julian Anastasov
` (2 subsequent siblings)
4 siblings, 1 reply; 9+ messages in thread
From: Julian Anastasov @ 2022-09-20 13:53 UTC (permalink / raw)
To: Jiri Wiesner; +Cc: Simon Horman, lvs-devel, yunhong-cgl jiang, dust.li
Estimating all entries in single list in timer context
causes large latency with multiple rules.
Spread the estimator structures in multiple chains and
use kthread(s) for the estimation. Every chain is
processed under RCU lock. First kthread determines
parameters to use, eg. maximum number of estimators to
process per kthread based on chain's length, allowing
sub-100us cond_resched rate and estimation taking 1/8
of the CPU.
First kthread also plays the role of distributor of
added estimators to all kthreads.
We also add delayed work est_reload_work that will
make sure the kthread tasks are properly started/stopped.
ip_vs_start_estimator() is changed to report errors
which allows to safely store the estimators in
allocated structures.
Signed-off-by: Julian Anastasov <ja@ssi.bg>
---
include/net/ip_vs.h | 73 ++-
net/netfilter/ipvs/ip_vs_ctl.c | 141 ++++--
net/netfilter/ipvs/ip_vs_est.c | 864 ++++++++++++++++++++++++++++++---
3 files changed, 973 insertions(+), 105 deletions(-)
diff --git a/include/net/ip_vs.h b/include/net/ip_vs.h
index bd8ae137e43b..2601636de648 100644
--- a/include/net/ip_vs.h
+++ b/include/net/ip_vs.h
@@ -42,6 +42,8 @@ static inline struct netns_ipvs *net_ipvs(struct net* net)
/* Connections' size value needed by ip_vs_ctl.c */
extern int ip_vs_conn_tab_size;
+extern struct mutex __ip_vs_mutex;
+
struct ip_vs_iphdr {
int hdr_flags; /* ipvs flags */
__u32 off; /* Where IP or IPv4 header starts */
@@ -365,7 +367,7 @@ struct ip_vs_cpu_stats {
/* IPVS statistics objects */
struct ip_vs_estimator {
- struct list_head list;
+ struct hlist_node list;
u64 last_inbytes;
u64 last_outbytes;
@@ -378,6 +380,55 @@ struct ip_vs_estimator {
u64 outpps;
u64 inbps;
u64 outbps;
+
+ s32 ktid:16, /* kthread ID, -1=temp list */
+ ktrow:8, /* row ID for kthread */
+ ktcid:8; /* chain ID for kthread */
+};
+
+/* Process estimators in multiple timer ticks */
+#define IPVS_EST_NTICKS 50
+/* Estimation uses a 2-second period */
+#define IPVS_EST_TICK ((2 * HZ) / IPVS_EST_NTICKS)
+
+/* Desired number of chains per tick (chain load factor in 100us units),
+ * 48=4.8ms of 40ms tick (12% CPU usage):
+ * 2 sec * 1000 ms in sec * 10 (100us in ms) / 8 (12.5%) / IPVS_EST_NTICKS
+ */
+#define IPVS_EST_CHAIN_FACTOR ALIGN_DOWN(2 * 1000 * 10 / 8 / IPVS_EST_NTICKS, 8)
+
+/* Compiled number of chains per tick
+ * The defines should match cond_resched_rcu
+ */
+#if defined(CONFIG_DEBUG_ATOMIC_SLEEP) || !defined(CONFIG_PREEMPT_RCU)
+#define IPVS_EST_TICK_CHAINS IPVS_EST_CHAIN_FACTOR
+#else
+#define IPVS_EST_TICK_CHAINS 1
+#endif
+
+/* Multiple chains processed in same tick */
+struct ip_vs_est_tick_data {
+ struct hlist_head chains[IPVS_EST_TICK_CHAINS];
+ DECLARE_BITMAP(present, IPVS_EST_TICK_CHAINS);
+ DECLARE_BITMAP(full, IPVS_EST_TICK_CHAINS);
+ int chain_len[IPVS_EST_TICK_CHAINS];
+};
+
+/* Context for estimation kthread */
+struct ip_vs_est_kt_data {
+ struct netns_ipvs *ipvs;
+ struct task_struct *task; /* task if running */
+ struct ip_vs_est_tick_data __rcu *ticks[IPVS_EST_NTICKS];
+ DECLARE_BITMAP(avail, IPVS_EST_NTICKS); /* tick has space for ests */
+ unsigned long est_timer; /* estimation timer (jiffies) */
+ int tick_len[IPVS_EST_NTICKS]; /* est count */
+ int id; /* ktid per netns */
+ int chain_max_len; /* max ests per tick chain */
+ int tick_max_len; /* max ests per tick */
+ int est_count; /* attached ests to kthread */
+ int est_max_count; /* max ests per kthread */
+ int add_row; /* row for new ests */
+ int est_row; /* estimated row */
};
/*
@@ -948,9 +999,17 @@ struct netns_ipvs {
struct ctl_table_header *lblcr_ctl_header;
struct ctl_table *lblcr_ctl_table;
/* ip_vs_est */
- struct list_head est_list; /* estimator list */
- spinlock_t est_lock;
- struct timer_list est_timer; /* Estimation timer */
+ struct delayed_work est_reload_work;/* Reload kthread tasks */
+ struct mutex est_mutex; /* protect kthread tasks */
+ struct hlist_head est_temp_list; /* Ests during calc phase */
+ struct ip_vs_est_kt_data **est_kt_arr; /* Array of kthread data ptrs */
+ unsigned long est_max_threads;/* rlimit */
+ int est_calc_phase; /* Calculation phase */
+ int est_chain_max_len;/* Calculated chain_max_len */
+ int est_kt_count; /* Allocated ptrs */
+ int est_add_ktid; /* ktid where to add ests */
+ atomic_t est_genid; /* kthreads reload genid */
+ atomic_t est_genid_done; /* applied genid */
/* ip_vs_sync */
spinlock_t sync_lock;
struct ipvs_master_sync_state *ms;
@@ -1481,10 +1540,14 @@ int stop_sync_thread(struct netns_ipvs *ipvs, int state);
void ip_vs_sync_conn(struct netns_ipvs *ipvs, struct ip_vs_conn *cp, int pkts);
/* IPVS rate estimator prototypes (from ip_vs_est.c) */
-void ip_vs_start_estimator(struct netns_ipvs *ipvs, struct ip_vs_stats *stats);
+int ip_vs_start_estimator(struct netns_ipvs *ipvs, struct ip_vs_stats *stats);
void ip_vs_stop_estimator(struct netns_ipvs *ipvs, struct ip_vs_stats *stats);
void ip_vs_zero_estimator(struct ip_vs_stats *stats);
void ip_vs_read_estimator(struct ip_vs_kstats *dst, struct ip_vs_stats *stats);
+void ip_vs_est_reload_start(struct netns_ipvs *ipvs);
+int ip_vs_est_kthread_start(struct netns_ipvs *ipvs,
+ struct ip_vs_est_kt_data *kd);
+void ip_vs_est_kthread_stop(struct ip_vs_est_kt_data *kd);
/* Various IPVS packet transmitters (from ip_vs_xmit.c) */
int ip_vs_null_xmit(struct sk_buff *skb, struct ip_vs_conn *cp,
diff --git a/net/netfilter/ipvs/ip_vs_ctl.c b/net/netfilter/ipvs/ip_vs_ctl.c
index 44c79fd1779c..587c91cd3750 100644
--- a/net/netfilter/ipvs/ip_vs_ctl.c
+++ b/net/netfilter/ipvs/ip_vs_ctl.c
@@ -49,8 +49,7 @@
MODULE_ALIAS_GENL_FAMILY(IPVS_GENL_NAME);
-/* semaphore for IPVS sockopts. And, [gs]etsockopt may sleep. */
-static DEFINE_MUTEX(__ip_vs_mutex);
+DEFINE_MUTEX(__ip_vs_mutex); /* Serialize configuration with sockopt/netlink */
/* sysctl variables */
@@ -241,6 +240,47 @@ static void defense_work_handler(struct work_struct *work)
}
#endif
+static void est_reload_work_handler(struct work_struct *work)
+{
+ struct netns_ipvs *ipvs =
+ container_of(work, struct netns_ipvs, est_reload_work.work);
+ int genid_done = atomic_read(&ipvs->est_genid_done);
+ unsigned long delay = HZ / 10; /* repeat startups after failure */
+ bool repeat = false;
+ int genid;
+ int id;
+
+ mutex_lock(&ipvs->est_mutex);
+ genid = atomic_read(&ipvs->est_genid);
+ for (id = 0; id < ipvs->est_kt_count; id++) {
+ struct ip_vs_est_kt_data *kd = ipvs->est_kt_arr[id];
+
+ /* netns clean up started, abort delayed work */
+ if (!ipvs->enable)
+ goto unlock;
+ if (!kd)
+ continue;
+ /* New config ? Stop kthread tasks */
+ if (genid != genid_done)
+ ip_vs_est_kthread_stop(kd);
+ if (!kd->task) {
+ /* Do not start kthreads above 0 in calc phase */
+ if ((!id || !ipvs->est_calc_phase) &&
+ ip_vs_est_kthread_start(ipvs, kd) < 0)
+ repeat = true;
+ }
+ }
+
+ atomic_set(&ipvs->est_genid_done, genid);
+
+ if (repeat)
+ queue_delayed_work(system_long_wq, &ipvs->est_reload_work,
+ delay);
+
+unlock:
+ mutex_unlock(&ipvs->est_mutex);
+}
+
int
ip_vs_use_count_inc(void)
{
@@ -831,7 +871,7 @@ ip_vs_copy_stats(struct ip_vs_kstats *dst, struct ip_vs_stats *src)
{
#define IP_VS_SHOW_STATS_COUNTER(c) dst->c = src->kstats.c - src->kstats0.c
- spin_lock_bh(&src->lock);
+ spin_lock(&src->lock);
IP_VS_SHOW_STATS_COUNTER(conns);
IP_VS_SHOW_STATS_COUNTER(inpkts);
@@ -841,7 +881,7 @@ ip_vs_copy_stats(struct ip_vs_kstats *dst, struct ip_vs_stats *src)
ip_vs_read_estimator(dst, src);
- spin_unlock_bh(&src->lock);
+ spin_unlock(&src->lock);
}
static void
@@ -862,7 +902,7 @@ ip_vs_export_stats_user(struct ip_vs_stats_user *dst, struct ip_vs_kstats *src)
static void
ip_vs_zero_stats(struct ip_vs_stats *stats)
{
- spin_lock_bh(&stats->lock);
+ spin_lock(&stats->lock);
/* get current counters as zero point, rates are zeroed */
@@ -876,7 +916,7 @@ ip_vs_zero_stats(struct ip_vs_stats *stats)
ip_vs_zero_estimator(stats);
- spin_unlock_bh(&stats->lock);
+ spin_unlock(&stats->lock);
}
/*
@@ -957,7 +997,6 @@ __ip_vs_update_dest(struct ip_vs_service *svc, struct ip_vs_dest *dest,
spin_unlock_bh(&dest->dst_lock);
if (add) {
- ip_vs_start_estimator(svc->ipvs, &dest->stats);
list_add_rcu(&dest->n_list, &svc->destinations);
svc->num_dests++;
sched = rcu_dereference_protected(svc->scheduler, 1);
@@ -979,6 +1018,7 @@ ip_vs_new_dest(struct ip_vs_service *svc, struct ip_vs_dest_user_kern *udest)
{
struct ip_vs_dest *dest;
unsigned int atype, i;
+ int ret;
EnterFunction(2);
@@ -1003,9 +1043,10 @@ ip_vs_new_dest(struct ip_vs_service *svc, struct ip_vs_dest_user_kern *udest)
return -EINVAL;
}
+ ret = -ENOMEM;
dest = kzalloc(sizeof(struct ip_vs_dest), GFP_KERNEL);
if (dest == NULL)
- return -ENOMEM;
+ goto err;
dest->stats.cpustats = alloc_percpu(struct ip_vs_cpu_stats);
if (!dest->stats.cpustats)
@@ -1017,6 +1058,12 @@ ip_vs_new_dest(struct ip_vs_service *svc, struct ip_vs_dest_user_kern *udest)
u64_stats_init(&ip_vs_dest_stats->syncp);
}
+ spin_lock_init(&dest->stats.lock);
+
+ ret = ip_vs_start_estimator(svc->ipvs, &dest->stats);
+ if (ret < 0)
+ goto err_cpustats;
+
dest->af = udest->af;
dest->protocol = svc->protocol;
dest->vaddr = svc->addr;
@@ -1032,15 +1079,19 @@ ip_vs_new_dest(struct ip_vs_service *svc, struct ip_vs_dest_user_kern *udest)
INIT_HLIST_NODE(&dest->d_list);
spin_lock_init(&dest->dst_lock);
- spin_lock_init(&dest->stats.lock);
__ip_vs_update_dest(svc, dest, udest, 1);
LeaveFunction(2);
return 0;
+err_cpustats:
+ free_percpu(dest->stats.cpustats);
+
err_alloc:
kfree(dest);
- return -ENOMEM;
+
+err:
+ return ret;
}
@@ -1102,14 +1153,18 @@ ip_vs_add_dest(struct ip_vs_service *svc, struct ip_vs_dest_user_kern *udest)
IP_VS_DBG_ADDR(svc->af, &dest->vaddr),
ntohs(dest->vport));
+ ret = ip_vs_start_estimator(svc->ipvs, &dest->stats);
+ if (ret < 0)
+ goto err;
__ip_vs_update_dest(svc, dest, udest, 1);
- ret = 0;
} else {
/*
* Allocate and initialize the dest structure
*/
ret = ip_vs_new_dest(svc, udest);
}
+
+err:
LeaveFunction(2);
return ret;
@@ -1397,6 +1452,10 @@ ip_vs_add_service(struct netns_ipvs *ipvs, struct ip_vs_service_user_kern *u,
sched = NULL;
}
+ ret = ip_vs_start_estimator(ipvs, &svc->stats);
+ if (ret < 0)
+ goto out_err;
+
/* Bind the ct retriever */
RCU_INIT_POINTER(svc->pe, pe);
pe = NULL;
@@ -1409,8 +1468,6 @@ ip_vs_add_service(struct netns_ipvs *ipvs, struct ip_vs_service_user_kern *u,
if (svc->pe && svc->pe->conn_out)
atomic_inc(&ipvs->conn_out_counter);
- ip_vs_start_estimator(ipvs, &svc->stats);
-
/* Count only IPv4 services for old get/setsockopt interface */
if (svc->af == AF_INET)
ipvs->num_services++;
@@ -1421,8 +1478,15 @@ ip_vs_add_service(struct netns_ipvs *ipvs, struct ip_vs_service_user_kern *u,
ip_vs_svc_hash(svc);
*svc_p = svc;
- /* Now there is a service - full throttle */
- ipvs->enable = 1;
+
+ if (!ipvs->enable) {
+ /* Now there is a service - full throttle */
+ ipvs->enable = 1;
+
+ /* Start estimation for first time */
+ ip_vs_est_reload_start(ipvs);
+ }
+
return 0;
@@ -2311,13 +2375,13 @@ static int ip_vs_stats_percpu_show(struct seq_file *seq, void *v)
u64 conns, inpkts, outpkts, inbytes, outbytes;
do {
- start = u64_stats_fetch_begin_irq(&u->syncp);
+ start = u64_stats_fetch_begin(&u->syncp);
conns = u->cnt.conns;
inpkts = u->cnt.inpkts;
outpkts = u->cnt.outpkts;
inbytes = u->cnt.inbytes;
outbytes = u->cnt.outbytes;
- } while (u64_stats_fetch_retry_irq(&u->syncp, start));
+ } while (u64_stats_fetch_retry(&u->syncp, start));
seq_printf(seq, "%3X %8LX %8LX %8LX %16LX %16LX\n",
i, (u64)conns, (u64)inpkts,
@@ -4041,13 +4105,16 @@ static void ip_vs_genl_unregister(void)
static int __net_init ip_vs_control_net_init_sysctl(struct netns_ipvs *ipvs)
{
struct net *net = ipvs->net;
- int idx;
struct ctl_table *tbl;
+ int idx, ret;
atomic_set(&ipvs->dropentry, 0);
spin_lock_init(&ipvs->dropentry_lock);
spin_lock_init(&ipvs->droppacket_lock);
spin_lock_init(&ipvs->securetcp_lock);
+ INIT_DELAYED_WORK(&ipvs->defense_work, defense_work_handler);
+ INIT_DELAYED_WORK(&ipvs->expire_nodest_conn_work,
+ expire_nodest_conn_handler);
if (!net_eq(net, &init_net)) {
tbl = kmemdup(vs_vars, sizeof(vs_vars), GFP_KERNEL);
@@ -4115,24 +4182,27 @@ static int __net_init ip_vs_control_net_init_sysctl(struct netns_ipvs *ipvs)
tbl[idx++].mode = 0444;
#endif
+ ret = -ENOMEM;
ipvs->sysctl_hdr = register_net_sysctl(net, "net/ipv4/vs", tbl);
- if (ipvs->sysctl_hdr == NULL) {
- if (!net_eq(net, &init_net))
- kfree(tbl);
- return -ENOMEM;
- }
+ if (!ipvs->sysctl_hdr)
+ goto err;
ipvs->sysctl_tbl = tbl;
+
+ ret = ip_vs_start_estimator(ipvs, &ipvs->tot_stats->s);
+ if (ret < 0)
+ goto err;
+
/* Schedule defense work */
- INIT_DELAYED_WORK(&ipvs->defense_work, defense_work_handler);
queue_delayed_work(system_long_wq, &ipvs->defense_work,
DEFENSE_TIMER_PERIOD);
- /* Init delayed work for expiring no dest conn */
- INIT_DELAYED_WORK(&ipvs->expire_nodest_conn_work,
- expire_nodest_conn_handler);
-
- ip_vs_start_estimator(ipvs, &ipvs->tot_stats->s);
return 0;
+
+err:
+ unregister_net_sysctl_table(ipvs->sysctl_hdr);
+ if (!net_eq(net, &init_net))
+ kfree(tbl);
+ return ret;
}
static void __net_exit ip_vs_control_net_cleanup_sysctl(struct netns_ipvs *ipvs)
@@ -4165,6 +4235,7 @@ static struct notifier_block ip_vs_dst_notifier = {
int __net_init ip_vs_control_net_init(struct netns_ipvs *ipvs)
{
+ int ret = -ENOMEM;
int i, idx;
/* Initialize rs_table */
@@ -4178,10 +4249,12 @@ int __net_init ip_vs_control_net_init(struct netns_ipvs *ipvs)
atomic_set(&ipvs->nullsvc_counter, 0);
atomic_set(&ipvs->conn_out_counter, 0);
+ INIT_DELAYED_WORK(&ipvs->est_reload_work, est_reload_work_handler);
+
/* procfs stats */
ipvs->tot_stats = kzalloc(sizeof(*ipvs->tot_stats), GFP_KERNEL);
if (!ipvs->tot_stats)
- return -ENOMEM;
+ goto out;
ipvs->tot_stats->s.cpustats = alloc_percpu(struct ip_vs_cpu_stats);
if (!ipvs->tot_stats->s.cpustats)
goto err_tot_stats;
@@ -4207,7 +4280,8 @@ int __net_init ip_vs_control_net_init(struct netns_ipvs *ipvs)
goto err_percpu;
#endif
- if (ip_vs_control_net_init_sysctl(ipvs))
+ ret = ip_vs_control_net_init_sysctl(ipvs);
+ if (ret < 0)
goto err;
return 0;
@@ -4228,13 +4302,16 @@ int __net_init ip_vs_control_net_init(struct netns_ipvs *ipvs)
err_tot_stats:
kfree(ipvs->tot_stats);
- return -ENOMEM;
+
+out:
+ return ret;
}
void __net_exit ip_vs_control_net_cleanup(struct netns_ipvs *ipvs)
{
ip_vs_trash_cleanup(ipvs);
ip_vs_control_net_cleanup_sysctl(ipvs);
+ cancel_delayed_work_sync(&ipvs->est_reload_work);
#ifdef CONFIG_PROC_FS
remove_proc_entry("ip_vs_stats_percpu", ipvs->net->proc_net);
remove_proc_entry("ip_vs_stats", ipvs->net->proc_net);
diff --git a/net/netfilter/ipvs/ip_vs_est.c b/net/netfilter/ipvs/ip_vs_est.c
index 9a1a7af6a186..63241690072c 100644
--- a/net/netfilter/ipvs/ip_vs_est.c
+++ b/net/netfilter/ipvs/ip_vs_est.c
@@ -30,9 +30,6 @@
long interval, it is easy to implement a user level daemon which
periodically reads those statistical counters and measure rate.
- Currently, the measurement is activated by slow timer handler. Hope
- this measurement will not introduce too much load.
-
We measure rate during the last 8 seconds every 2 seconds:
avgrate = avgrate*(1-W) + rate*W
@@ -47,68 +44,76 @@
to 32-bit values for conns, packets, bps, cps and pps.
* A lot of code is taken from net/core/gen_estimator.c
- */
-
-/*
- * Make a summary from each cpu
+ KEY POINTS:
+ - cpustats counters are updated per-cpu in SoftIRQ context with BH disabled
+ - kthreads read the cpustats to update the estimators (svcs, dests, total)
+ - the states of estimators can be read (get stats) or modified (zero stats)
+ from processes
+
+ KTHREADS:
+ - estimators are added initially to est_temp_list and later kthread 0
+ distributes them to one or many kthreads for estimation
+ - kthread contexts are created and attached to array
+ - the kthread tasks are started when first service is added, before that
+ the total stats are not estimated
+ - the kthread context holds lists with estimators (chains) which are
+ processed every 2 seconds
+ - as estimators can be added dynamically and in bursts, we try to spread
+ them to multiple chains which are estimated at different time
+ - on start, kthread 0 enters calculation phase to determine the chain limits
+ and the limit of estimators per kthread
+ - est_add_ktid: ktid where to add new ests, can point to empty slot where
+ we should add kt data
*/
-static void ip_vs_read_cpu_stats(struct ip_vs_kstats *sum,
- struct ip_vs_cpu_stats __percpu *stats)
-{
- int i;
- bool add = false;
- for_each_possible_cpu(i) {
- struct ip_vs_cpu_stats *s = per_cpu_ptr(stats, i);
- unsigned int start;
- u64 conns, inpkts, outpkts, inbytes, outbytes;
-
- if (add) {
- do {
- start = u64_stats_fetch_begin(&s->syncp);
- conns = s->cnt.conns;
- inpkts = s->cnt.inpkts;
- outpkts = s->cnt.outpkts;
- inbytes = s->cnt.inbytes;
- outbytes = s->cnt.outbytes;
- } while (u64_stats_fetch_retry(&s->syncp, start));
- sum->conns += conns;
- sum->inpkts += inpkts;
- sum->outpkts += outpkts;
- sum->inbytes += inbytes;
- sum->outbytes += outbytes;
- } else {
- add = true;
- do {
- start = u64_stats_fetch_begin(&s->syncp);
- sum->conns = s->cnt.conns;
- sum->inpkts = s->cnt.inpkts;
- sum->outpkts = s->cnt.outpkts;
- sum->inbytes = s->cnt.inbytes;
- sum->outbytes = s->cnt.outbytes;
- } while (u64_stats_fetch_retry(&s->syncp, start));
- }
- }
-}
+static struct lock_class_key __ipvs_est_key;
+static void ip_vs_est_calc_phase(struct netns_ipvs *ipvs);
+static void ip_vs_est_drain_temp_list(struct netns_ipvs *ipvs);
-static void estimation_timer(struct timer_list *t)
+static void ip_vs_chain_estimation(struct hlist_head *chain)
{
struct ip_vs_estimator *e;
+ struct ip_vs_cpu_stats *c;
struct ip_vs_stats *s;
u64 rate;
- struct netns_ipvs *ipvs = from_timer(ipvs, t, est_timer);
- if (!sysctl_run_estimation(ipvs))
- goto skip;
+ hlist_for_each_entry_rcu(e, chain, list) {
+ u64 conns, inpkts, outpkts, inbytes, outbytes;
+ u64 kconns = 0, kinpkts = 0, koutpkts = 0;
+ u64 kinbytes = 0, koutbytes = 0;
+ unsigned int start;
+ int i;
+
+ if (kthread_should_stop())
+ break;
- spin_lock(&ipvs->est_lock);
- list_for_each_entry(e, &ipvs->est_list, list) {
s = container_of(e, struct ip_vs_stats, est);
+ for_each_possible_cpu(i) {
+ c = per_cpu_ptr(s->cpustats, i);
+ do {
+ start = u64_stats_fetch_begin(&c->syncp);
+ conns = c->cnt.conns;
+ inpkts = c->cnt.inpkts;
+ outpkts = c->cnt.outpkts;
+ inbytes = c->cnt.inbytes;
+ outbytes = c->cnt.outbytes;
+ } while (u64_stats_fetch_retry(&c->syncp, start));
+ kconns += conns;
+ kinpkts += inpkts;
+ koutpkts += outpkts;
+ kinbytes += inbytes;
+ koutbytes += outbytes;
+ }
spin_lock(&s->lock);
- ip_vs_read_cpu_stats(&s->kstats, s->cpustats);
+
+ s->kstats.conns = kconns;
+ s->kstats.inpkts = kinpkts;
+ s->kstats.outpkts = koutpkts;
+ s->kstats.inbytes = kinbytes;
+ s->kstats.outbytes = koutbytes;
/* scaled by 2^10, but divided 2 seconds */
rate = (s->kstats.conns - e->last_conns) << 9;
@@ -133,30 +138,742 @@ static void estimation_timer(struct timer_list *t)
e->outbps += ((s64)rate - (s64)e->outbps) >> 2;
spin_unlock(&s->lock);
}
- spin_unlock(&ipvs->est_lock);
+}
+
+static void ip_vs_tick_estimation(struct ip_vs_est_kt_data *kd, int row)
+{
+ struct ip_vs_est_tick_data *td;
+ int cid;
+
+ rcu_read_lock();
+ td = rcu_dereference(kd->ticks[row]);
+ if (!td)
+ goto out;
+ for_each_set_bit(cid, td->present, IPVS_EST_TICK_CHAINS) {
+ if (kthread_should_stop())
+ break;
+ ip_vs_chain_estimation(&td->chains[cid]);
+ cond_resched_rcu();
+ td = rcu_dereference(kd->ticks[row]);
+ if (!td)
+ break;
+ }
-skip:
- mod_timer(&ipvs->est_timer, jiffies + 2*HZ);
+out:
+ rcu_read_unlock();
}
-void ip_vs_start_estimator(struct netns_ipvs *ipvs, struct ip_vs_stats *stats)
+static int ip_vs_estimation_kthread(void *data)
{
- struct ip_vs_estimator *est = &stats->est;
+ struct ip_vs_est_kt_data *kd = data;
+ struct netns_ipvs *ipvs = kd->ipvs;
+ int row = kd->est_row;
+ unsigned long now;
+ int id = kd->id;
+ long gap;
+
+ if (id > 0) {
+ if (!ipvs->est_chain_max_len)
+ return 0;
+ } else {
+ if (!ipvs->est_chain_max_len) {
+ ipvs->est_calc_phase = 1;
+ /* commit est_calc_phase before reading est_genid */
+ smp_mb();
+ }
+
+ /* kthread 0 will handle the calc phase */
+ if (ipvs->est_calc_phase)
+ ip_vs_est_calc_phase(ipvs);
+ }
+
+ while (1) {
+ if (!id && !hlist_empty(&ipvs->est_temp_list))
+ ip_vs_est_drain_temp_list(ipvs);
+ set_current_state(TASK_IDLE);
+ if (kthread_should_stop())
+ break;
+
+ /* before estimation, check if we should sleep */
+ now = jiffies;
+ gap = kd->est_timer - now;
+ if (gap > 0) {
+ if (gap > IPVS_EST_TICK) {
+ kd->est_timer = now - IPVS_EST_TICK;
+ gap = IPVS_EST_TICK;
+ }
+ schedule_timeout(gap);
+ } else {
+ __set_current_state(TASK_RUNNING);
+ if (gap < -8 * IPVS_EST_TICK)
+ kd->est_timer = now;
+ }
+
+ if (sysctl_run_estimation(ipvs) && kd->tick_len[row])
+ ip_vs_tick_estimation(kd, row);
- INIT_LIST_HEAD(&est->list);
+ row++;
+ if (row >= IPVS_EST_NTICKS)
+ row = 0;
+ WRITE_ONCE(kd->est_row, row);
+ kd->est_timer += IPVS_EST_TICK;
+ }
+ __set_current_state(TASK_RUNNING);
- spin_lock_bh(&ipvs->est_lock);
- list_add(&est->list, &ipvs->est_list);
- spin_unlock_bh(&ipvs->est_lock);
+ return 0;
+}
+
+/* Schedule stop/start for kthread tasks */
+void ip_vs_est_reload_start(struct netns_ipvs *ipvs)
+{
+ /* Ignore reloads before first service is added */
+ if (!ipvs->enable)
+ return;
+ /* Bump the kthread configuration genid */
+ atomic_inc(&ipvs->est_genid);
+ queue_delayed_work(system_long_wq, &ipvs->est_reload_work, 0);
}
+/* Start kthread task with current configuration */
+int ip_vs_est_kthread_start(struct netns_ipvs *ipvs,
+ struct ip_vs_est_kt_data *kd)
+{
+ unsigned long now;
+ int ret = 0;
+ long gap;
+
+ lockdep_assert_held(&ipvs->est_mutex);
+
+ if (kd->task)
+ goto out;
+ now = jiffies;
+ gap = kd->est_timer - now;
+ /* Sync est_timer if task is starting later */
+ if (abs(gap) > 4 * IPVS_EST_TICK)
+ kd->est_timer = now;
+ kd->task = kthread_create(ip_vs_estimation_kthread, kd, "ipvs-e:%d:%d",
+ ipvs->gen, kd->id);
+ if (IS_ERR(kd->task)) {
+ ret = PTR_ERR(kd->task);
+ kd->task = NULL;
+ goto out;
+ }
+
+ pr_info("starting estimator thread %d...\n", kd->id);
+ wake_up_process(kd->task);
+
+out:
+ return ret;
+}
+
+void ip_vs_est_kthread_stop(struct ip_vs_est_kt_data *kd)
+{
+ if (kd->task) {
+ pr_info("stopping estimator thread %d...\n", kd->id);
+ kthread_stop(kd->task);
+ kd->task = NULL;
+ }
+}
+
+/* Apply parameters to kthread */
+static void ip_vs_est_set_params(struct netns_ipvs *ipvs,
+ struct ip_vs_est_kt_data *kd)
+{
+ kd->chain_max_len = ipvs->est_chain_max_len;
+ /* We are using single chain on RCU preemption */
+ if (IPVS_EST_TICK_CHAINS == 1)
+ kd->chain_max_len *= IPVS_EST_CHAIN_FACTOR;
+ kd->tick_max_len = IPVS_EST_TICK_CHAINS * kd->chain_max_len;
+ kd->est_max_count = IPVS_EST_NTICKS * kd->tick_max_len;
+}
+
+/* Create and start estimation kthread in a free or new array slot */
+static int ip_vs_est_add_kthread(struct netns_ipvs *ipvs)
+{
+ struct ip_vs_est_kt_data *kd = NULL;
+ int id = ipvs->est_kt_count;
+ int ret = -ENOMEM;
+ void *arr = NULL;
+ int i;
+
+ if ((unsigned long)ipvs->est_kt_count >= ipvs->est_max_threads &&
+ ipvs->enable && ipvs->est_max_threads)
+ return -EINVAL;
+
+ mutex_lock(&ipvs->est_mutex);
+
+ for (i = 0; i < id; i++) {
+ if (!ipvs->est_kt_arr[i])
+ break;
+ }
+ if (i >= id) {
+ arr = krealloc_array(ipvs->est_kt_arr, id + 1,
+ sizeof(struct ip_vs_est_kt_data *),
+ GFP_KERNEL);
+ if (!arr)
+ goto out;
+ ipvs->est_kt_arr = arr;
+ } else {
+ id = i;
+ }
+ kd = kzalloc(sizeof(*kd), GFP_KERNEL);
+ if (!kd)
+ goto out;
+ kd->ipvs = ipvs;
+ bitmap_fill(kd->avail, IPVS_EST_NTICKS);
+ kd->est_timer = jiffies;
+ kd->id = id;
+ ip_vs_est_set_params(ipvs, kd);
+ /* Start kthread tasks only when services are present */
+ if (ipvs->enable) {
+ ret = ip_vs_est_kthread_start(ipvs, kd);
+ if (ret < 0)
+ goto out;
+ }
+
+ if (arr)
+ ipvs->est_kt_count++;
+ ipvs->est_kt_arr[id] = kd;
+ kd = NULL;
+ /* Use most recent kthread for new ests */
+ ipvs->est_add_ktid = id;
+ ret = 0;
+
+out:
+ mutex_unlock(&ipvs->est_mutex);
+ kfree(kd);
+
+ return ret;
+}
+
+/* Select ktid where to add new ests: available, unused or new slot */
+static void ip_vs_est_update_ktid(struct netns_ipvs *ipvs)
+{
+ int ktid, best = ipvs->est_kt_count;
+ struct ip_vs_est_kt_data *kd;
+
+ for (ktid = 0; ktid < ipvs->est_kt_count; ktid++) {
+ kd = ipvs->est_kt_arr[ktid];
+ if (kd) {
+ if (kd->est_count < kd->est_max_count) {
+ best = ktid;
+ break;
+ }
+ } else if (ktid < best) {
+ best = ktid;
+ }
+ }
+ ipvs->est_add_ktid = best;
+}
+
+/* Add estimator to current kthread (est_add_ktid) */
+static int ip_vs_enqueue_estimator(struct netns_ipvs *ipvs,
+ struct ip_vs_estimator *est)
+{
+ struct ip_vs_est_kt_data *kd = NULL;
+ struct ip_vs_est_tick_data *td;
+ int ktid, row, crow, cid, ret;
+
+ if (ipvs->est_add_ktid < ipvs->est_kt_count) {
+ kd = ipvs->est_kt_arr[ipvs->est_add_ktid];
+ if (kd)
+ goto add_est;
+ }
+
+ ret = ip_vs_est_add_kthread(ipvs);
+ if (ret < 0)
+ goto out;
+ kd = ipvs->est_kt_arr[ipvs->est_add_ktid];
+
+add_est:
+ ktid = kd->id;
+ /* For small number of estimators prefer to use few ticks,
+ * otherwise try to add into the last estimated row.
+ * est_row and add_row point after the row we should use
+ */
+ if (kd->est_count >= 2 * kd->tick_max_len)
+ crow = READ_ONCE(kd->est_row);
+ else
+ crow = kd->add_row;
+ crow--;
+ if (crow < 0)
+ crow = IPVS_EST_NTICKS - 1;
+ row = crow;
+ if (crow < IPVS_EST_NTICKS - 1) {
+ crow++;
+ row = find_last_bit(kd->avail, crow);
+ }
+ if (row >= crow)
+ row = find_last_bit(kd->avail, IPVS_EST_NTICKS);
+
+ td = rcu_dereference_protected(kd->ticks[row], 1);
+ if (!td) {
+ td = kzalloc(sizeof(*td), GFP_KERNEL);
+ if (!td) {
+ ret = -ENOMEM;
+ goto out;
+ }
+ rcu_assign_pointer(kd->ticks[row], td);
+ }
+
+ cid = find_first_zero_bit(td->full, IPVS_EST_TICK_CHAINS);
+
+ kd->est_count++;
+ kd->tick_len[row]++;
+ if (!td->chain_len[cid])
+ __set_bit(cid, td->present);
+ td->chain_len[cid]++;
+ est->ktid = ktid;
+ est->ktrow = row;
+ est->ktcid = cid;
+ hlist_add_head_rcu(&est->list, &td->chains[cid]);
+
+ if (td->chain_len[cid] >= kd->chain_max_len) {
+ __set_bit(cid, td->full);
+ if (kd->tick_len[row] >= kd->tick_max_len) {
+ __clear_bit(row, kd->avail);
+ /* Next time search from previous row */
+ kd->add_row = row;
+ }
+ }
+
+ /* Update est_add_ktid to point to first available/empty kt slot */
+ if (kd->est_count == kd->est_max_count)
+ ip_vs_est_update_ktid(ipvs);
+
+ ret = 0;
+
+out:
+ return ret;
+}
+
+/* Start estimation for stats */
+int ip_vs_start_estimator(struct netns_ipvs *ipvs, struct ip_vs_stats *stats)
+{
+ struct ip_vs_estimator *est = &stats->est;
+ int ret;
+
+ /* Get rlimit only from process that adds service, not from
+ * net_init/kthread. Clamp limit depending on est->ktid size.
+ */
+ if (!ipvs->est_max_threads && ipvs->enable)
+ ipvs->est_max_threads = min_t(unsigned long,
+ rlimit(RLIMIT_NPROC), SHRT_MAX);
+
+ est->ktid = -1;
+
+ /* We prefer this code to be short, kthread 0 will requeue the
+ * estimator to available chain. If tasks are disabled, we
+ * will not allocate much memory, just for kt 0.
+ */
+ ret = 0;
+ if (!ipvs->est_kt_count || !ipvs->est_kt_arr[0])
+ ret = ip_vs_est_add_kthread(ipvs);
+ if (ret >= 0)
+ hlist_add_head(&est->list, &ipvs->est_temp_list);
+ else
+ INIT_HLIST_NODE(&est->list);
+ return ret;
+}
+
+static void ip_vs_est_kthread_destroy(struct ip_vs_est_kt_data *kd)
+{
+ if (kd) {
+ if (kd->task) {
+ pr_info("stop unused estimator thread %d...\n", kd->id);
+ kthread_stop(kd->task);
+ }
+ kfree(kd);
+ }
+}
+
+/* Unlink estimator from chain */
void ip_vs_stop_estimator(struct netns_ipvs *ipvs, struct ip_vs_stats *stats)
{
struct ip_vs_estimator *est = &stats->est;
+ struct ip_vs_est_tick_data *td;
+ struct ip_vs_est_kt_data *kd;
+ int ktid = est->ktid;
+ int row = est->ktrow;
+ int cid = est->ktcid;
+
+ /* Failed to add to chain ? */
+ if (hlist_unhashed(&est->list))
+ return;
+
+ /* On return, estimator can be freed, dequeue it now */
+
+ /* In est_temp_list ? */
+ if (ktid < 0) {
+ hlist_del(&est->list);
+ goto end_kt0;
+ }
+
+ hlist_del_rcu(&est->list);
+ kd = ipvs->est_kt_arr[ktid];
+ td = rcu_dereference_protected(kd->ticks[row], 1);
+ __clear_bit(cid, td->full);
+ td->chain_len[cid]--;
+ if (!td->chain_len[cid])
+ __clear_bit(cid, td->present);
+ kd->tick_len[row]--;
+ __set_bit(row, kd->avail);
+ if (!kd->tick_len[row]) {
+ RCU_INIT_POINTER(kd->ticks[row], NULL);
+ kfree_rcu(td);
+ }
+ kd->est_count--;
+ if (kd->est_count) {
+ /* This kt slot can become available just now, prefer it */
+ if (ktid < ipvs->est_add_ktid)
+ ipvs->est_add_ktid = ktid;
+ return;
+ }
+
+ if (ktid > 0) {
+ mutex_lock(&ipvs->est_mutex);
+ ip_vs_est_kthread_destroy(kd);
+ ipvs->est_kt_arr[ktid] = NULL;
+ if (ktid == ipvs->est_kt_count - 1) {
+ ipvs->est_kt_count--;
+ while (ipvs->est_kt_count > 1 &&
+ !ipvs->est_kt_arr[ipvs->est_kt_count - 1])
+ ipvs->est_kt_count--;
+ }
+ mutex_unlock(&ipvs->est_mutex);
- spin_lock_bh(&ipvs->est_lock);
- list_del(&est->list);
- spin_unlock_bh(&ipvs->est_lock);
+ /* This slot is now empty, prefer another available kt slot */
+ if (ktid == ipvs->est_add_ktid)
+ ip_vs_est_update_ktid(ipvs);
+ }
+
+end_kt0:
+ /* kt 0 is freed after all other kthreads and chains are empty */
+ if (ipvs->est_kt_count == 1 && hlist_empty(&ipvs->est_temp_list)) {
+ kd = ipvs->est_kt_arr[0];
+ if (!kd || !kd->est_count) {
+ mutex_lock(&ipvs->est_mutex);
+ if (kd) {
+ ip_vs_est_kthread_destroy(kd);
+ ipvs->est_kt_arr[0] = NULL;
+ }
+ ipvs->est_kt_count--;
+ mutex_unlock(&ipvs->est_mutex);
+ ipvs->est_add_ktid = 0;
+ }
+ }
+}
+
+/* Register all ests from est_temp_list to kthreads */
+static void ip_vs_est_drain_temp_list(struct netns_ipvs *ipvs)
+{
+ struct ip_vs_estimator *est;
+
+ while (1) {
+ int max = 16;
+
+ mutex_lock(&__ip_vs_mutex);
+
+ while (max-- > 0) {
+ est = hlist_entry_safe(ipvs->est_temp_list.first,
+ struct ip_vs_estimator, list);
+ if (est) {
+ if (kthread_should_stop())
+ goto unlock;
+ hlist_del_init(&est->list);
+ if (ip_vs_enqueue_estimator(ipvs, est) >= 0)
+ continue;
+ est->ktid = -1;
+ hlist_add_head(&est->list,
+ &ipvs->est_temp_list);
+ /* Abort, some entries will not be estimated
+ * until next attempt
+ */
+ }
+ goto unlock;
+ }
+ mutex_unlock(&__ip_vs_mutex);
+ cond_resched();
+ }
+
+unlock:
+ mutex_unlock(&__ip_vs_mutex);
+}
+
+/* Calculate limits for all kthreads */
+static int ip_vs_est_calc_limits(struct netns_ipvs *ipvs, int *chain_max_len)
+{
+ struct ip_vs_stats **arr = NULL, **a, *s;
+ struct ip_vs_estimator *est;
+ int i, j, n = 0, ntest = 1;
+ struct hlist_head chain;
+ bool is_fifo = false;
+ s32 min_est = 0;
+ ktime_t t1, t2;
+ s64 diff, val;
+ int retry = 0;
+ int max = 2;
+ int ret = 1;
+
+ INIT_HLIST_HEAD(&chain);
+ for (;;) {
+ /* Too much tests? */
+ if (n >= 128)
+ goto out;
+
+ /* Dequeue old estimators from chain to avoid CPU caching */
+ for (;;) {
+ est = hlist_entry_safe(chain.first,
+ struct ip_vs_estimator,
+ list);
+ if (!est)
+ break;
+ hlist_del_init(&est->list);
+ }
+
+ /* Use only new estimators for test */
+ a = krealloc_array(arr, n + ntest, sizeof(*arr), GFP_KERNEL);
+ if (!a)
+ goto out;
+ arr = a;
+
+ for (j = 0; j < ntest; j++) {
+ arr[n] = kcalloc(1, sizeof(*arr[n]), GFP_KERNEL);
+ if (!arr[n])
+ goto out;
+ s = arr[n];
+ n++;
+
+ spin_lock_init(&s->lock);
+ s->cpustats = alloc_percpu(struct ip_vs_cpu_stats);
+ if (!s->cpustats)
+ goto out;
+ for_each_possible_cpu(i) {
+ struct ip_vs_cpu_stats *cs;
+
+ cs = per_cpu_ptr(s->cpustats, i);
+ u64_stats_init(&cs->syncp);
+ }
+ hlist_add_head(&s->est.list, &chain);
+ }
+
+ cond_resched();
+ if (!is_fifo) {
+ is_fifo = true;
+ sched_set_fifo(current);
+ }
+ rcu_read_lock();
+ t1 = ktime_get();
+ ip_vs_chain_estimation(&chain);
+ t2 = ktime_get();
+ rcu_read_unlock();
+
+ if (!ipvs->enable || kthread_should_stop())
+ goto stop;
+
+ diff = ktime_to_ns(ktime_sub(t2, t1));
+ if (diff <= 1 || diff >= NSEC_PER_SEC)
+ continue;
+ val = diff;
+ do_div(val, ntest);
+ if (!min_est || val < min_est) {
+ min_est = val;
+ /* goal: 95usec per chain */
+ val = 95 * NSEC_PER_USEC;
+ if (val >= min_est) {
+ do_div(val, min_est);
+ max = (int)val;
+ } else {
+ max = 1;
+ }
+ }
+ /* aim is to test below 100us */
+ if (diff < 50 * NSEC_PER_USEC)
+ ntest *= 2;
+ else
+ retry++;
+ /* Do at least 3 large tests to avoid scheduling noise */
+ if (retry >= 3)
+ break;
+ }
+
+out:
+ if (is_fifo)
+ sched_set_normal(current, 0);
+ for (;;) {
+ est = hlist_entry_safe(chain.first, struct ip_vs_estimator,
+ list);
+ if (!est)
+ break;
+ hlist_del_init(&est->list);
+ }
+ for (i = 0; i < n; i++) {
+ free_percpu(arr[i]->cpustats);
+ kfree(arr[i]);
+ }
+ kfree(arr);
+ *chain_max_len = max;
+ return ret;
+
+stop:
+ ret = 0;
+ goto out;
+}
+
+/* Calculate the parameters and apply them in context of kt #0
+ * ECP: est_calc_phase
+ * ECML: est_chain_max_len
+ * ECP ECML Insert Chain enable Description
+ * ---------------------------------------------------------------------------
+ * 0 0 est_temp_list 0 create kt #0 context
+ * 0 0 est_temp_list 0->1 service added, start kthread #0 task
+ * 0->1 0 est_temp_list 1 kt task #0 started, enters calc phase
+ * 1 0 est_temp_list 1 kt #0: determine est_chain_max_len,
+ * stop tasks, move ests to est_temp_list
+ * and free kd for kthreads 1..last
+ * 1->0 0->N kt chains 1 ests can go to kthreads
+ * 0 N kt chains 1 drain est_temp_list, create new kthread
+ * contexts, start tasks, estimate
+ */
+static void ip_vs_est_calc_phase(struct netns_ipvs *ipvs)
+{
+ int genid = atomic_read(&ipvs->est_genid);
+ struct ip_vs_est_tick_data *td;
+ struct ip_vs_est_kt_data *kd;
+ struct ip_vs_estimator *est;
+ struct ip_vs_stats *stats;
+ int chain_max_len;
+ int id, row, cid;
+ bool last, last_td;
+ int step;
+
+ if (!ip_vs_est_calc_limits(ipvs, &chain_max_len))
+ return;
+
+ mutex_lock(&__ip_vs_mutex);
+
+ /* Stop all other tasks, so that we can immediately move the
+ * estimators to est_temp_list without RCU grace period
+ */
+ mutex_lock(&ipvs->est_mutex);
+ for (id = 1; id < ipvs->est_kt_count; id++) {
+ /* netns clean up started, abort */
+ if (!ipvs->enable)
+ goto unlock2;
+ kd = ipvs->est_kt_arr[id];
+ if (!kd)
+ continue;
+ ip_vs_est_kthread_stop(kd);
+ }
+ mutex_unlock(&ipvs->est_mutex);
+
+ /* Move all estimators to est_temp_list but carefully,
+ * all estimators and kthread data can be released while
+ * we reschedule. Even for kthread 0.
+ */
+ step = 0;
+
+next_kt:
+ /* Destroy contexts backwards */
+ id = ipvs->est_kt_count - 1;
+ if (id < 0)
+ goto end_dequeue;
+ kd = ipvs->est_kt_arr[id];
+ if (!kd)
+ goto end_dequeue;
+ /* kt 0 can exist with empty chains */
+ if (!id && kd->est_count <= 1)
+ goto end_dequeue;
+
+ row = -1;
+
+next_row:
+ row++;
+ if (row >= IPVS_EST_NTICKS)
+ goto next_kt;
+ if (!ipvs->enable)
+ goto unlock;
+ td = rcu_dereference_protected(kd->ticks[row], 1);
+ if (!td)
+ goto next_row;
+
+ cid = 0;
+
+walk_chain:
+ if (kthread_should_stop())
+ goto unlock;
+ step++;
+ if (!(step & 63)) {
+ /* Give chance estimators to be added (to est_temp_list)
+ * and deleted (releasing kthread contexts)
+ */
+ mutex_unlock(&__ip_vs_mutex);
+ cond_resched();
+ mutex_lock(&__ip_vs_mutex);
+
+ /* Current kt released ? */
+ if (id + 1 != ipvs->est_kt_count)
+ goto next_kt;
+ if (kd != ipvs->est_kt_arr[id])
+ goto end_dequeue;
+ /* Current td released ? */
+ if (td != rcu_dereference_protected(kd->ticks[row], 1))
+ goto next_row;
+ /* No fatal changes on the current kd and td */
+ }
+ est = hlist_entry_safe(td->chains[cid].first, struct ip_vs_estimator,
+ list);
+ if (!est) {
+ cid++;
+ if (cid >= IPVS_EST_TICK_CHAINS)
+ goto next_row;
+ goto walk_chain;
+ }
+ /* We can cheat and increase est_count to protect kt 0 context
+ * from release but we prefer to keep the last estimator
+ */
+ last = kd->est_count <= 1;
+ /* Do not free kt #0 data */
+ if (!id && last)
+ goto end_dequeue;
+ last_td = kd->tick_len[row] <= 1;
+ stats = container_of(est, struct ip_vs_stats, est);
+ ip_vs_stop_estimator(ipvs, stats);
+ /* Tasks are stopped, move without RCU grace period */
+ est->ktid = -1;
+ hlist_add_head(&est->list, &ipvs->est_temp_list);
+ /* kd freed ? */
+ if (last)
+ goto next_kt;
+ /* td freed ? */
+ if (last_td)
+ goto next_row;
+ goto walk_chain;
+
+end_dequeue:
+ /* All estimators removed while calculating ? */
+ if (!ipvs->est_kt_count)
+ goto unlock;
+ kd = ipvs->est_kt_arr[0];
+ if (!kd)
+ goto unlock;
+ ipvs->est_chain_max_len = chain_max_len;
+ ip_vs_est_set_params(ipvs, kd);
+
+ pr_info("using max %d ests per chain, %d per kthread\n",
+ kd->chain_max_len, kd->est_max_count);
+
+ mutex_lock(&ipvs->est_mutex);
+
+ /* We completed the calc phase, new calc phase not requested */
+ if (genid == atomic_read(&ipvs->est_genid))
+ ipvs->est_calc_phase = 0;
+
+unlock2:
+ mutex_unlock(&ipvs->est_mutex);
+
+unlock:
+ mutex_unlock(&__ip_vs_mutex);
}
void ip_vs_zero_estimator(struct ip_vs_stats *stats)
@@ -191,14 +908,25 @@ void ip_vs_read_estimator(struct ip_vs_kstats *dst, struct ip_vs_stats *stats)
int __net_init ip_vs_estimator_net_init(struct netns_ipvs *ipvs)
{
- INIT_LIST_HEAD(&ipvs->est_list);
- spin_lock_init(&ipvs->est_lock);
- timer_setup(&ipvs->est_timer, estimation_timer, 0);
- mod_timer(&ipvs->est_timer, jiffies + 2 * HZ);
+ INIT_HLIST_HEAD(&ipvs->est_temp_list);
+ ipvs->est_kt_arr = NULL;
+ ipvs->est_max_threads = 0;
+ ipvs->est_calc_phase = 0;
+ ipvs->est_chain_max_len = 0;
+ ipvs->est_kt_count = 0;
+ ipvs->est_add_ktid = 0;
+ atomic_set(&ipvs->est_genid, 0);
+ atomic_set(&ipvs->est_genid_done, 0);
+ __mutex_init(&ipvs->est_mutex, "ipvs->est_mutex", &__ipvs_est_key);
return 0;
}
void __net_exit ip_vs_estimator_net_cleanup(struct netns_ipvs *ipvs)
{
- del_timer_sync(&ipvs->est_timer);
+ int i;
+
+ for (i = 0; i < ipvs->est_kt_count; i++)
+ ip_vs_est_kthread_destroy(ipvs->est_kt_arr[i]);
+ kfree(ipvs->est_kt_arr);
+ mutex_destroy(&ipvs->est_mutex);
}
--
2.37.3
^ permalink raw reply related [flat|nested] 9+ messages in thread
* [RFC PATCHv4 3/5] ipvs: add est_cpulist and est_nice sysctl vars
2022-09-20 13:53 [RFC PATCHv4 0/5] ipvs: Use kthreads for stats Julian Anastasov
2022-09-20 13:53 ` [RFC PATCHv4 1/5] ipvs: add rcu protection to stats Julian Anastasov
2022-09-20 13:53 ` [RFC PATCHv4 2/5] ipvs: use kthreads for stats estimation Julian Anastasov
@ 2022-09-20 13:53 ` Julian Anastasov
2022-09-20 13:53 ` [RFC PATCHv4 4/5] ipvs: run_estimation should control the kthread tasks Julian Anastasov
2022-09-20 13:53 ` [RFC PATCHv4 5/5] ipvs: debug the tick time Julian Anastasov
4 siblings, 0 replies; 9+ messages in thread
From: Julian Anastasov @ 2022-09-20 13:53 UTC (permalink / raw)
To: Jiri Wiesner; +Cc: Simon Horman, lvs-devel, yunhong-cgl jiang, dust.li
Allow the kthreads for stats to be configured for
specific cpulist (isolation) and niceness (scheduling
priority).
Signed-off-by: Julian Anastasov <ja@ssi.bg>
---
Documentation/networking/ipvs-sysctl.rst | 20 ++++
include/net/ip_vs.h | 50 ++++++++
net/netfilter/ipvs/ip_vs_ctl.c | 141 ++++++++++++++++++++++-
net/netfilter/ipvs/ip_vs_est.c | 11 +-
4 files changed, 219 insertions(+), 3 deletions(-)
diff --git a/Documentation/networking/ipvs-sysctl.rst b/Documentation/networking/ipvs-sysctl.rst
index 387fda80f05f..1b778705d706 100644
--- a/Documentation/networking/ipvs-sysctl.rst
+++ b/Documentation/networking/ipvs-sysctl.rst
@@ -129,6 +129,26 @@ drop_packet - INTEGER
threshold. When the mode 3 is set, the always mode drop rate
is controlled by the /proc/sys/net/ipv4/vs/am_droprate.
+est_cpulist - CPULIST
+ Allowed CPUs for estimation kthreads
+
+ Syntax: standard cpulist format
+ empty list - stop kthread tasks and estimation
+ default - the system's housekeeping CPUs for kthreads
+
+ Example:
+ "all": all possible CPUs
+ "0-N": all possible CPUs, N denotes last CPU number
+ "0,1-N:1/2": first and all CPUs with odd number
+ "": empty list
+
+est_nice - INTEGER
+ default 0
+ Valid range: -20 (more favorable) .. 19 (less favorable)
+
+ Niceness value to use for the estimation kthreads (scheduling
+ priority)
+
expire_nodest_conn - BOOLEAN
- 0 - disabled (default)
- not 0 - enabled
diff --git a/include/net/ip_vs.h b/include/net/ip_vs.h
index 2601636de648..73e19794bbe1 100644
--- a/include/net/ip_vs.h
+++ b/include/net/ip_vs.h
@@ -29,6 +29,7 @@
#include <net/netfilter/nf_conntrack.h>
#endif
#include <net/net_namespace.h> /* Netw namespace */
+#include <linux/sched/isolation.h>
#define IP_VS_HDR_INVERSE 1
#define IP_VS_HDR_ICMP 2
@@ -365,6 +366,9 @@ struct ip_vs_cpu_stats {
struct u64_stats_sync syncp;
};
+/* Default nice for estimator kthreads */
+#define IPVS_EST_NICE 0
+
/* IPVS statistics objects */
struct ip_vs_estimator {
struct hlist_node list;
@@ -989,6 +993,12 @@ struct netns_ipvs {
int sysctl_schedule_icmp;
int sysctl_ignore_tunneled;
int sysctl_run_estimation;
+#ifdef CONFIG_SYSCTL
+ cpumask_var_t sysctl_est_cpulist; /* kthread cpumask */
+ int est_cpulist_valid; /* cpulist set */
+ int sysctl_est_nice; /* kthread nice */
+ int est_stopped; /* stop tasks */
+#endif
/* ip_vs_lblc */
int sysctl_lblc_expiration;
@@ -1142,6 +1152,19 @@ static inline int sysctl_run_estimation(struct netns_ipvs *ipvs)
return ipvs->sysctl_run_estimation;
}
+static inline const struct cpumask *sysctl_est_cpulist(struct netns_ipvs *ipvs)
+{
+ if (ipvs->est_cpulist_valid)
+ return ipvs->sysctl_est_cpulist;
+ else
+ return housekeeping_cpumask(HK_TYPE_KTHREAD);
+}
+
+static inline int sysctl_est_nice(struct netns_ipvs *ipvs)
+{
+ return ipvs->sysctl_est_nice;
+}
+
#else
static inline int sysctl_sync_threshold(struct netns_ipvs *ipvs)
@@ -1239,6 +1262,16 @@ static inline int sysctl_run_estimation(struct netns_ipvs *ipvs)
return 1;
}
+static inline const struct cpumask *sysctl_est_cpulist(struct netns_ipvs *ipvs)
+{
+ return housekeeping_cpumask(HK_TYPE_KTHREAD);
+}
+
+static inline int sysctl_est_nice(struct netns_ipvs *ipvs)
+{
+ return IPVS_EST_NICE;
+}
+
#endif
/* IPVS core functions
@@ -1549,6 +1582,23 @@ int ip_vs_est_kthread_start(struct netns_ipvs *ipvs,
struct ip_vs_est_kt_data *kd);
void ip_vs_est_kthread_stop(struct ip_vs_est_kt_data *kd);
+static inline void ip_vs_est_stopped_recalc(struct netns_ipvs *ipvs)
+{
+#ifdef CONFIG_SYSCTL
+ ipvs->est_stopped = ipvs->est_cpulist_valid &&
+ cpumask_empty(sysctl_est_cpulist(ipvs));
+#endif
+}
+
+static inline bool ip_vs_est_stopped(struct netns_ipvs *ipvs)
+{
+#ifdef CONFIG_SYSCTL
+ return ipvs->est_stopped;
+#else
+ return false;
+#endif
+}
+
/* Various IPVS packet transmitters (from ip_vs_xmit.c) */
int ip_vs_null_xmit(struct sk_buff *skb, struct ip_vs_conn *cp,
struct ip_vs_protocol *pp, struct ip_vs_iphdr *iph);
diff --git a/net/netfilter/ipvs/ip_vs_ctl.c b/net/netfilter/ipvs/ip_vs_ctl.c
index 587c91cd3750..4cc45e24d6e2 100644
--- a/net/netfilter/ipvs/ip_vs_ctl.c
+++ b/net/netfilter/ipvs/ip_vs_ctl.c
@@ -263,7 +263,7 @@ static void est_reload_work_handler(struct work_struct *work)
/* New config ? Stop kthread tasks */
if (genid != genid_done)
ip_vs_est_kthread_stop(kd);
- if (!kd->task) {
+ if (!kd->task && !ip_vs_est_stopped(ipvs)) {
/* Do not start kthreads above 0 in calc phase */
if ((!id || !ipvs->est_calc_phase) &&
ip_vs_est_kthread_start(ipvs, kd) < 0)
@@ -1922,6 +1922,120 @@ proc_do_sync_ports(struct ctl_table *table, int write,
return rc;
}
+static int ipvs_proc_est_cpumask_set(struct ctl_table *table, void *buffer)
+{
+ struct netns_ipvs *ipvs = table->extra2;
+ cpumask_var_t *valp = table->data;
+ cpumask_var_t newmask;
+ int ret;
+
+ if (!zalloc_cpumask_var(&newmask, GFP_KERNEL))
+ return -ENOMEM;
+
+ ret = cpulist_parse(buffer, newmask);
+ if (ret)
+ goto out;
+
+ mutex_lock(&ipvs->est_mutex);
+
+ if (!ipvs->est_cpulist_valid) {
+ if (!zalloc_cpumask_var(valp, GFP_KERNEL)) {
+ ret = -ENOMEM;
+ goto unlock;
+ }
+ ipvs->est_cpulist_valid = 1;
+ }
+ cpumask_and(newmask, newmask, ¤t->cpus_mask);
+ cpumask_copy(*valp, newmask);
+ ipvs->est_calc_phase = 1;
+ ip_vs_est_reload_start(ipvs);
+
+unlock:
+ mutex_unlock(&ipvs->est_mutex);
+
+out:
+ free_cpumask_var(newmask);
+ return ret;
+}
+
+static int ipvs_proc_est_cpumask_get(struct ctl_table *table, void *buffer,
+ size_t size)
+{
+ struct netns_ipvs *ipvs = table->extra2;
+ cpumask_var_t *valp = table->data;
+ struct cpumask *mask;
+ int ret;
+
+ mutex_lock(&ipvs->est_mutex);
+
+ if (ipvs->est_cpulist_valid)
+ mask = *valp;
+ else
+ mask = (struct cpumask *)housekeeping_cpumask(HK_TYPE_KTHREAD);
+ ret = scnprintf(buffer, size, "%*pbl\n", cpumask_pr_args(mask));
+
+ mutex_unlock(&ipvs->est_mutex);
+
+ return ret;
+}
+
+static int ipvs_proc_est_cpulist(struct ctl_table *table, int write,
+ void *buffer, size_t *lenp, loff_t *ppos)
+{
+ int ret;
+
+ /* Ignore both read and write(append) if *ppos not 0 */
+ if (*ppos || !*lenp) {
+ *lenp = 0;
+ return 0;
+ }
+ if (write) {
+ /* proc_sys_call_handler() appends terminator */
+ ret = ipvs_proc_est_cpumask_set(table, buffer);
+ if (ret >= 0)
+ *ppos += *lenp;
+ } else {
+ /* proc_sys_call_handler() allocates 1 byte for terminator */
+ ret = ipvs_proc_est_cpumask_get(table, buffer, *lenp + 1);
+ if (ret >= 0) {
+ *lenp = ret;
+ *ppos += *lenp;
+ ret = 0;
+ }
+ }
+ return ret;
+}
+
+static int ipvs_proc_est_nice(struct ctl_table *table, int write,
+ void *buffer, size_t *lenp, loff_t *ppos)
+{
+ struct netns_ipvs *ipvs = table->extra2;
+ int *valp = table->data;
+ int val = *valp;
+ int ret;
+
+ struct ctl_table tmp_table = {
+ .data = &val,
+ .maxlen = sizeof(int),
+ .mode = table->mode,
+ };
+
+ ret = proc_dointvec(&tmp_table, write, buffer, lenp, ppos);
+ if (write && ret >= 0) {
+ if (val < MIN_NICE || val > MAX_NICE) {
+ ret = -EINVAL;
+ } else {
+ mutex_lock(&ipvs->est_mutex);
+ if (*valp != val) {
+ *valp = val;
+ ip_vs_est_reload_start(ipvs);
+ }
+ mutex_unlock(&ipvs->est_mutex);
+ }
+ }
+ return ret;
+}
+
/*
* IPVS sysctl table (under the /proc/sys/net/ipv4/vs/)
* Do not change order or insert new entries without
@@ -2098,6 +2212,18 @@ static struct ctl_table vs_vars[] = {
.mode = 0644,
.proc_handler = proc_dointvec,
},
+ {
+ .procname = "est_cpulist",
+ .maxlen = NR_CPUS, /* unused */
+ .mode = 0644,
+ .proc_handler = ipvs_proc_est_cpulist,
+ },
+ {
+ .procname = "est_nice",
+ .maxlen = sizeof(int),
+ .mode = 0644,
+ .proc_handler = ipvs_proc_est_nice,
+ },
#ifdef CONFIG_IP_VS_DEBUG
{
.procname = "debug_level",
@@ -4115,6 +4241,7 @@ static int __net_init ip_vs_control_net_init_sysctl(struct netns_ipvs *ipvs)
INIT_DELAYED_WORK(&ipvs->defense_work, defense_work_handler);
INIT_DELAYED_WORK(&ipvs->expire_nodest_conn_work,
expire_nodest_conn_handler);
+ ipvs->est_stopped = 0;
if (!net_eq(net, &init_net)) {
tbl = kmemdup(vs_vars, sizeof(vs_vars), GFP_KERNEL);
@@ -4176,6 +4303,15 @@ static int __net_init ip_vs_control_net_init_sysctl(struct netns_ipvs *ipvs)
tbl[idx++].data = &ipvs->sysctl_ignore_tunneled;
ipvs->sysctl_run_estimation = 1;
tbl[idx++].data = &ipvs->sysctl_run_estimation;
+
+ ipvs->est_cpulist_valid = 0;
+ tbl[idx].extra2 = ipvs;
+ tbl[idx++].data = &ipvs->sysctl_est_cpulist;
+
+ ipvs->sysctl_est_nice = IPVS_EST_NICE;
+ tbl[idx].extra2 = ipvs;
+ tbl[idx++].data = &ipvs->sysctl_est_nice;
+
#ifdef CONFIG_IP_VS_DEBUG
/* Global sysctls must be ro in non-init netns */
if (!net_eq(net, &init_net))
@@ -4215,6 +4351,9 @@ static void __net_exit ip_vs_control_net_cleanup_sysctl(struct netns_ipvs *ipvs)
unregister_net_sysctl_table(ipvs->sysctl_hdr);
ip_vs_stop_estimator(ipvs, &ipvs->tot_stats->s);
+ if (ipvs->est_cpulist_valid)
+ free_cpumask_var(ipvs->sysctl_est_cpulist);
+
if (!net_eq(net, &init_net))
kfree(ipvs->sysctl_tbl);
}
diff --git a/net/netfilter/ipvs/ip_vs_est.c b/net/netfilter/ipvs/ip_vs_est.c
index 63241690072c..800ed1ade9f9 100644
--- a/net/netfilter/ipvs/ip_vs_est.c
+++ b/net/netfilter/ipvs/ip_vs_est.c
@@ -57,6 +57,9 @@
- kthread contexts are created and attached to array
- the kthread tasks are started when first service is added, before that
the total stats are not estimated
+ - when configuration (cpulist/nice) is changed, the tasks are restarted
+ by work (est_reload_work)
+ - kthread tasks are stopped while the cpulist is empty
- the kthread context holds lists with estimators (chains) which are
processed every 2 seconds
- as estimators can be added dynamically and in bursts, we try to spread
@@ -229,6 +232,7 @@ void ip_vs_est_reload_start(struct netns_ipvs *ipvs)
/* Ignore reloads before first service is added */
if (!ipvs->enable)
return;
+ ip_vs_est_stopped_recalc(ipvs);
/* Bump the kthread configuration genid */
atomic_inc(&ipvs->est_genid);
queue_delayed_work(system_long_wq, &ipvs->est_reload_work, 0);
@@ -259,6 +263,9 @@ int ip_vs_est_kthread_start(struct netns_ipvs *ipvs,
goto out;
}
+ set_user_nice(kd->task, sysctl_est_nice(ipvs));
+ set_cpus_allowed_ptr(kd->task, sysctl_est_cpulist(ipvs));
+
pr_info("starting estimator thread %d...\n", kd->id);
wake_up_process(kd->task);
@@ -325,7 +332,7 @@ static int ip_vs_est_add_kthread(struct netns_ipvs *ipvs)
kd->id = id;
ip_vs_est_set_params(ipvs, kd);
/* Start kthread tasks only when services are present */
- if (ipvs->enable) {
+ if (ipvs->enable && !ip_vs_est_stopped(ipvs)) {
ret = ip_vs_est_kthread_start(ipvs, kd);
if (ret < 0)
goto out;
@@ -699,7 +706,7 @@ static int ip_vs_est_calc_limits(struct netns_ipvs *ipvs, int *chain_max_len)
out:
if (is_fifo)
- sched_set_normal(current, 0);
+ sched_set_normal(current, sysctl_est_nice(ipvs));
for (;;) {
est = hlist_entry_safe(chain.first, struct ip_vs_estimator,
list);
--
2.37.3
^ permalink raw reply related [flat|nested] 9+ messages in thread
* [RFC PATCHv4 4/5] ipvs: run_estimation should control the kthread tasks
2022-09-20 13:53 [RFC PATCHv4 0/5] ipvs: Use kthreads for stats Julian Anastasov
` (2 preceding siblings ...)
2022-09-20 13:53 ` [RFC PATCHv4 3/5] ipvs: add est_cpulist and est_nice sysctl vars Julian Anastasov
@ 2022-09-20 13:53 ` Julian Anastasov
2022-09-20 13:53 ` [RFC PATCHv4 5/5] ipvs: debug the tick time Julian Anastasov
4 siblings, 0 replies; 9+ messages in thread
From: Julian Anastasov @ 2022-09-20 13:53 UTC (permalink / raw)
To: Jiri Wiesner; +Cc: Simon Horman, lvs-devel, yunhong-cgl jiang, dust.li
Change the run_estimation flag to start/stop the kthread tasks.
Signed-off-by: Julian Anastasov <ja@ssi.bg>
---
Documentation/networking/ipvs-sysctl.rst | 4 ++--
include/net/ip_vs.h | 6 +++--
net/netfilter/ipvs/ip_vs_ctl.c | 29 +++++++++++++++++++++++-
net/netfilter/ipvs/ip_vs_est.c | 2 +-
4 files changed, 35 insertions(+), 6 deletions(-)
diff --git a/Documentation/networking/ipvs-sysctl.rst b/Documentation/networking/ipvs-sysctl.rst
index 1b778705d706..3fb5fa142eef 100644
--- a/Documentation/networking/ipvs-sysctl.rst
+++ b/Documentation/networking/ipvs-sysctl.rst
@@ -324,8 +324,8 @@ run_estimation - BOOLEAN
0 - disabled
not 0 - enabled (default)
- If disabled, the estimation will be stop, and you can't see
- any update on speed estimation data.
+ If disabled, the estimation will be suspended and kthread tasks
+ stopped.
You can always re-enable estimation by setting this value to 1.
But be careful, the first estimation after re-enable is not
diff --git a/include/net/ip_vs.h b/include/net/ip_vs.h
index 73e19794bbe1..e41fb40945ca 100644
--- a/include/net/ip_vs.h
+++ b/include/net/ip_vs.h
@@ -1585,8 +1585,10 @@ void ip_vs_est_kthread_stop(struct ip_vs_est_kt_data *kd);
static inline void ip_vs_est_stopped_recalc(struct netns_ipvs *ipvs)
{
#ifdef CONFIG_SYSCTL
- ipvs->est_stopped = ipvs->est_cpulist_valid &&
- cpumask_empty(sysctl_est_cpulist(ipvs));
+ /* Stop tasks while cpulist is empty or if disabled with flag */
+ ipvs->est_stopped = !sysctl_run_estimation(ipvs) ||
+ (ipvs->est_cpulist_valid &&
+ cpumask_empty(sysctl_est_cpulist(ipvs)));
#endif
}
diff --git a/net/netfilter/ipvs/ip_vs_ctl.c b/net/netfilter/ipvs/ip_vs_ctl.c
index 4cc45e24d6e2..1c5249fff6c4 100644
--- a/net/netfilter/ipvs/ip_vs_ctl.c
+++ b/net/netfilter/ipvs/ip_vs_ctl.c
@@ -2036,6 +2036,32 @@ static int ipvs_proc_est_nice(struct ctl_table *table, int write,
return ret;
}
+static int ipvs_proc_run_estimation(struct ctl_table *table, int write,
+ void *buffer, size_t *lenp, loff_t *ppos)
+{
+ struct netns_ipvs *ipvs = table->extra2;
+ int *valp = table->data;
+ int val = *valp;
+ int ret;
+
+ struct ctl_table tmp_table = {
+ .data = &val,
+ .maxlen = sizeof(int),
+ .mode = table->mode,
+ };
+
+ ret = proc_dointvec(&tmp_table, write, buffer, lenp, ppos);
+ if (write && ret >= 0) {
+ mutex_lock(&ipvs->est_mutex);
+ if (*valp != val) {
+ *valp = val;
+ ip_vs_est_reload_start(ipvs);
+ }
+ mutex_unlock(&ipvs->est_mutex);
+ }
+ return ret;
+}
+
/*
* IPVS sysctl table (under the /proc/sys/net/ipv4/vs/)
* Do not change order or insert new entries without
@@ -2210,7 +2236,7 @@ static struct ctl_table vs_vars[] = {
.procname = "run_estimation",
.maxlen = sizeof(int),
.mode = 0644,
- .proc_handler = proc_dointvec,
+ .proc_handler = ipvs_proc_run_estimation,
},
{
.procname = "est_cpulist",
@@ -4302,6 +4328,7 @@ static int __net_init ip_vs_control_net_init_sysctl(struct netns_ipvs *ipvs)
tbl[idx++].data = &ipvs->sysctl_schedule_icmp;
tbl[idx++].data = &ipvs->sysctl_ignore_tunneled;
ipvs->sysctl_run_estimation = 1;
+ tbl[idx].extra2 = ipvs;
tbl[idx++].data = &ipvs->sysctl_run_estimation;
ipvs->est_cpulist_valid = 0;
diff --git a/net/netfilter/ipvs/ip_vs_est.c b/net/netfilter/ipvs/ip_vs_est.c
index 800ed1ade9f9..38a6c8ab308b 100644
--- a/net/netfilter/ipvs/ip_vs_est.c
+++ b/net/netfilter/ipvs/ip_vs_est.c
@@ -212,7 +212,7 @@ static int ip_vs_estimation_kthread(void *data)
kd->est_timer = now;
}
- if (sysctl_run_estimation(ipvs) && kd->tick_len[row])
+ if (kd->tick_len[row])
ip_vs_tick_estimation(kd, row);
row++;
--
2.37.3
^ permalink raw reply related [flat|nested] 9+ messages in thread
* [RFC PATCHv4 5/5] ipvs: debug the tick time
2022-09-20 13:53 [RFC PATCHv4 0/5] ipvs: Use kthreads for stats Julian Anastasov
` (3 preceding siblings ...)
2022-09-20 13:53 ` [RFC PATCHv4 4/5] ipvs: run_estimation should control the kthread tasks Julian Anastasov
@ 2022-09-20 13:53 ` Julian Anastasov
4 siblings, 0 replies; 9+ messages in thread
From: Julian Anastasov @ 2022-09-20 13:53 UTC (permalink / raw)
To: Jiri Wiesner; +Cc: Simon Horman, lvs-devel, yunhong-cgl jiang, dust.li
Just for testing print the tick time every minute
Signed-off-by: Julian Anastasov <ja@ssi.bg>
---
net/netfilter/ipvs/ip_vs_est.c | 28 ++++++++++++++++++++++++++--
1 file changed, 26 insertions(+), 2 deletions(-)
diff --git a/net/netfilter/ipvs/ip_vs_est.c b/net/netfilter/ipvs/ip_vs_est.c
index 38a6c8ab308b..e214aa0b3abe 100644
--- a/net/netfilter/ipvs/ip_vs_est.c
+++ b/net/netfilter/ipvs/ip_vs_est.c
@@ -147,7 +147,14 @@ static void ip_vs_tick_estimation(struct ip_vs_est_kt_data *kd, int row)
{
struct ip_vs_est_tick_data *td;
int cid;
-
+ u64 ns = 0;
+ static int used_row = -1;
+ static int pass;
+
+ if (used_row < 0)
+ used_row = row;
+ if (row == used_row && !kd->id && !(pass & 31))
+ ns = ktime_get_ns();
rcu_read_lock();
td = rcu_dereference(kd->ticks[row]);
if (!td)
@@ -164,6 +171,16 @@ static void ip_vs_tick_estimation(struct ip_vs_est_kt_data *kd, int row)
out:
rcu_read_unlock();
+ if (row == used_row && !kd->id && !(pass++ & 31)) {
+ static int ncpu;
+
+ ns = ktime_get_ns() - ns;
+ if (!ncpu)
+ ncpu = num_possible_cpus();
+ pr_info("tick time: %lluns for %d CPUs, %d ests, %d chains, chain_max_len=%d\n",
+ (unsigned long long)ns, ncpu, kd->tick_len[row],
+ IPVS_EST_TICK_CHAINS, kd->chain_max_len);
+ }
}
static int ip_vs_estimation_kthread(void *data)
@@ -617,7 +634,7 @@ static int ip_vs_est_calc_limits(struct netns_ipvs *ipvs, int *chain_max_len)
bool is_fifo = false;
s32 min_est = 0;
ktime_t t1, t2;
- s64 diff, val;
+ s64 diff = 0, val;
int retry = 0;
int max = 2;
int ret = 1;
@@ -707,6 +724,8 @@ static int ip_vs_est_calc_limits(struct netns_ipvs *ipvs, int *chain_max_len)
out:
if (is_fifo)
sched_set_normal(current, sysctl_est_nice(ipvs));
+ pr_info("calc: chain_max_len=%d, single est=%dns, diff=%d, retry=%d, ntest=%d\n",
+ max, min_est, (int)diff, retry, ntest);
for (;;) {
est = hlist_entry_safe(chain.first, struct ip_vs_estimator,
list);
@@ -752,6 +771,7 @@ static void ip_vs_est_calc_phase(struct netns_ipvs *ipvs)
int chain_max_len;
int id, row, cid;
bool last, last_td;
+ u64 ns = 0;
int step;
if (!ip_vs_est_calc_limits(ipvs, &chain_max_len))
@@ -780,6 +800,8 @@ static void ip_vs_est_calc_phase(struct netns_ipvs *ipvs)
*/
step = 0;
+ ns = ktime_get_ns();
+
next_kt:
/* Destroy contexts backwards */
id = ipvs->est_kt_count - 1;
@@ -858,6 +880,8 @@ static void ip_vs_est_calc_phase(struct netns_ipvs *ipvs)
goto walk_chain;
end_dequeue:
+ ns = ktime_get_ns() - ns;
+ pr_info("dequeue: %lluns\n", (unsigned long long)ns);
/* All estimators removed while calculating ? */
if (!ipvs->est_kt_count)
goto unlock;
--
2.37.3
^ permalink raw reply related [flat|nested] 9+ messages in thread
* Re: [RFC PATCHv4 2/5] ipvs: use kthreads for stats estimation
2022-09-20 13:53 ` [RFC PATCHv4 2/5] ipvs: use kthreads for stats estimation Julian Anastasov
@ 2022-10-01 10:52 ` Jiri Wiesner
2022-10-02 14:12 ` Julian Anastasov
0 siblings, 1 reply; 9+ messages in thread
From: Jiri Wiesner @ 2022-10-01 10:52 UTC (permalink / raw)
To: Julian Anastasov; +Cc: Simon Horman, lvs-devel, yunhong-cgl jiang, dust.li
Apologies for the late response. I got tied up at work.
On Tue, Sep 20, 2022 at 04:53:29PM +0300, Julian Anastasov wrote:
> +/* Start estimation for stats */
> +int ip_vs_start_estimator(struct netns_ipvs *ipvs, struct ip_vs_stats *stats)
> +{
> + struct ip_vs_estimator *est = &stats->est;
> + int ret;
> +
> + /* Get rlimit only from process that adds service, not from
> + * net_init/kthread. Clamp limit depending on est->ktid size.
> + */
> + if (!ipvs->est_max_threads && ipvs->enable)
> + ipvs->est_max_threads = min_t(unsigned long,
> + rlimit(RLIMIT_NPROC), SHRT_MAX);
For example, the user space limit on the number of processes does not hold any useful value on my testing machine:
# ulimit -u
254318
while /proc/sys/kernel/pid_max is 65536. The pid_max variable itself depends on the number of CPUs on the system. I also think that user space limits should not directly determine how many kthreads can be created by the kernel. By design, one fully loaded kthread will take up 12% of the CPU time on one CPU. On account of the CPU usage it does not make sense to set ipvs->est_max_threads to a value higher than a multiple (4 or less) of the number of possible CPUs in the system. I think the ipvs->est_max_threads value should not allow CPUs to get saturated. Also, kthreads computing IPVS rate estimates could be created in each net namespace on the system, which alone makes it possible to saturate all the CPUs on the system because ipvs->est_max_threads does not take other namespaces into accou
nt.
As for solutions to this problem, I think it would be easiest to implement global counters in ip_vs_est.c (est_kt_count and est_max_threads) that would be tested for the max number of allocated kthreads in ip_vs_est_add_kthread().
Another possible solution would be to share kthreads among all net namespaces but that would be a step back considering that the current implementation is per net namespace. For the purpose of computing estimates, it does not really matter to which namespace an estimator belongs. This solution is problematic with regards to resource control - cgroups. But from what I have seen, IPVS estimators were always configured in the init net namespace so it would not matter if the kthreads were shared among all net namespaces.
> +
> + est->ktid = -1;
> +
> + /* We prefer this code to be short, kthread 0 will requeue the
> + * estimator to available chain. If tasks are disabled, we
> + * will not allocate much memory, just for kt 0.
> + */
> + ret = 0;
> + if (!ipvs->est_kt_count || !ipvs->est_kt_arr[0])
> + ret = ip_vs_est_add_kthread(ipvs);
> + if (ret >= 0)
> + hlist_add_head(&est->list, &ipvs->est_temp_list);
> + else
> + INIT_HLIST_NODE(&est->list);
> + return ret;
> +}
> +
> +/* Calculate limits for all kthreads */
> +static int ip_vs_est_calc_limits(struct netns_ipvs *ipvs, int *chain_max_len)
I am not happy about all the dynamic allocation happening in this function, which introduces a reason why the function could fail. A simpler approach would use just the estimators that are currently available on ipvs->est_temp_list and run ip_vs_chain_estimation(&chain) in a loop to reach ntest estimators being processed. The rate estimates would need to be reset after the tests are done. When kthread 0 enters calc phase there may very well be only two estimators on ipvs->est_temp_list. Are there any testing results indicating that newly allocated estimators give different results compared to processing the est_temp_list estimators in a loop?
When est_temp_list estimators are processed in a loop the estimator objects will be cached, possibly even in caches above the last level cache, and the per CPU counters will be cached. Annotated disassembly from perf profiling with the bus-cycles event (tested on the v2 of the patchset) indicates that the majority of time in ip_vs_estimation_kthread() is spent on the first instruction that reads a per CPU counter value, i.e.
hlist_for_each_entry_rcu(e, chain, list) {
for_each_possible_cpu(i) {
c = per_cpu_ptr(s->cpustats, i);
do {
conns = c->cnt.conns;
the disassembly:
Percent | Source code & Disassembly of kcore for bus-cycles (13731 samples, percent: local period)
: ffffffffc0f34880 <ip_vs_estimation_kthread>:
...
0.91 : ffffffffc0f349ed: cltq
0.00 : ffffffffc0f349ef: mov 0x68(%rbx),%rsi
3.35 : ffffffffc0f349f3: add -0x71856520(,%rax,8),%rsi
1.03 : ffffffffc0f349fb: add (%rsi),%r15
76.52 : ffffffffc0f349fe: add 0x8(%rsi),%r14
4.52 : ffffffffc0f34a02: add 0x10(%rsi),%r13
1.59 : ffffffffc0f34a06: add 0x18(%rsi),%r12
1.44 : ffffffffc0f34a0a: add 0x20(%rsi),%rbp
1.64 : ffffffffc0f34a0e: add $0x1,%ecx
0.47 : ffffffffc0f34a11: xor %r9d,%r9d
0.65 : ffffffffc0f34a14: xor %r8d,%r8d
1.12 : ffffffffc0f34a17: movslq %ecx,%rcx
1.29 : ffffffffc0f34a1a: xor %esi,%esi
0.66 : ffffffffc0f34a1c: mov $0xffffffff8ecd5760,%rdi
0.42 : ffffffffc0f34a23: call 0xffffffff8d75edc0
0.53 : ffffffffc0f34a28: mov -0x3225ec9e(%rip),%edx # 0xffffffff8ecd5d90
0.74 : ffffffffc0f34a2e: mov %eax,%ecx
0.78 : ffffffffc0f34a30: cmp %eax,%edx
0.00 : ffffffffc0f34a32: ja 0xffffffffc0f349ed
The bus-cycles event allows a skid so the high percentage of samples is actually on ffffffffc0f349fb. The performance of instructions reading per CPU counter values strongly depends on node-to-node latency on NUMA machines. The disassembly above is from a test on a 4 NUMA node machine but 2 NUMA node machines give similar results.
Even the currently used solution loads parts of the estimator objects into the cache before gets to ip_vs_chain_estimation() run, see comments below. Whether or not est_temp_list estimators can be used depends whether node-to-node latency for the per CPU counters on NUMA machines disappears after the first loads.
I ran tests using est_temp_list estimators. I applied this diff over the v4 code:
diff --git a/net/netfilter/ipvs/ip_vs_est.c b/net/netfilter/ipvs/ip_vs_est.c
index e214aa0b3abe..f96fb273a4b3 100644
--- a/net/netfilter/ipvs/ip_vs_est.c
+++ b/net/netfilter/ipvs/ip_vs_est.c
@@ -638,6 +638,11 @@ static int ip_vs_est_calc_limits(struct netns_ipvs *ipvs, int *chain_max_len)
int retry = 0;
int max = 2;
int ret = 1;
+ int nodes = 0;
+
+ hlist_for_each_entry(est, &ipvs->est_temp_list, list)
+ ++nodes;
+ pr_info("calc: nodes %d\n", nodes);
INIT_HLIST_HEAD(&chain);
for (;;) {
@@ -688,7 +693,10 @@ static int ip_vs_est_calc_limits(struct netns_ipvs *ipvs, int *chain_max_len)
}
rcu_read_lock();
t1 = ktime_get();
- ip_vs_chain_estimation(&chain);
+ j = 0;
+ do
+ ip_vs_chain_estimation(&ipvs->est_temp_list);
+ while (++j < ntest / nodes);
t2 = ktime_get();
rcu_read_unlock();
@@ -711,6 +719,8 @@ static int ip_vs_est_calc_limits(struct netns_ipvs *ipvs, int *chain_max_len)
max = 1;
}
}
+ pr_info("calc: diff %lld ntest %d min_est %d max %d\n",
+ diff, ntest, min_est, max);
/* aim is to test below 100us */
if (diff < 50 * NSEC_PER_USEC)
ntest *= 2;
For using est_temp_list estimators, the kernel log showed:
[ 89.364408][ T493] IPVS: starting estimator thread 0...
[ 89.370467][ T8039] IPVS: calc: nodes 2
[ 89.374824][ T8039] IPVS: calc: diff 4354 ntest 1 min_est 4354 max 21
[ 89.382081][ T8039] IPVS: calc: diff 1125 ntest 2 min_est 562 max 169
[ 89.389329][ T8039] IPVS: calc: diff 2083 ntest 4 min_est 520 max 182
[ 89.396589][ T8039] IPVS: calc: diff 4102 ntest 8 min_est 512 max 185
[ 89.403868][ T8039] IPVS: calc: diff 8381 ntest 16 min_est 512 max 185
[ 89.411288][ T8039] IPVS: calc: diff 16519 ntest 32 min_est 512 max 185
[ 89.418913][ T8039] IPVS: calc: diff 34162 ntest 64 min_est 512 max 185
[ 89.426705][ T8039] IPVS: calc: diff 65121 ntest 128 min_est 508 max 187
[ 89.434238][ T8039] IPVS: calc: chain_max_len=187, single est=508ns, diff=65121, retry=1, ntest=128
[ 89.444494][ T8039] IPVS: dequeue: 492ns
[ 89.448906][ T8039] IPVS: using max 8976 ests per chain, 448800 per kthread
[ 91.308814][ T8039] IPVS: tick time: 5745ns for 64 CPUs, 2 ests, 1 chains, chain_max_len=8976
I added just the second pr_info() to the v4 of the patchset, the kernel log showed:
[ 115.823618][ T491] IPVS: starting estimator thread 0...
[ 115.829696][ T8005] IPVS: calc: diff 1228 ntest 1 min_est 1228 max 77
[ 115.836962][ T8005] IPVS: calc: diff 1391 ntest 2 min_est 695 max 136
[ 115.844220][ T8005] IPVS: calc: diff 2135 ntest 4 min_est 533 max 178
[ 115.851481][ T8005] IPVS: calc: diff 4022 ntest 8 min_est 502 max 189
[ 115.858762][ T8005] IPVS: calc: diff 8017 ntest 16 min_est 501 max 189
[ 115.866185][ T8005] IPVS: calc: diff 15821 ntest 32 min_est 494 max 192
[ 115.873795][ T8005] IPVS: calc: diff 31726 ntest 64 min_est 494 max 192
[ 115.881599][ T8005] IPVS: calc: diff 68796 ntest 128 min_est 494 max 192
[ 115.889133][ T8005] IPVS: calc: chain_max_len=192, single est=494ns, diff=68796, retry=1, ntest=128
[ 115.899363][ T8005] IPVS: dequeue: 245ns
[ 115.903788][ T8005] IPVS: using max 9216 ests per chain, 460800 per kthread
[ 117.767174][ T8005] IPVS: tick time: 6117ns for 64 CPUs, 2 ests, 1 chains, chain_max_len=9216
I rigged the v4 code to iterate in the for loop until krealloc_array() reports an error. This allowed me to record a profile with the bus-cycles event:
[ 4115.532161][ T3537] IPVS: starting estimator thread 0...
[ 4116.969559][ T8126] IPVS: calc: chain_max_len=233, single est=407ns, diff=53822, retry=4095, ntest=128
[ 4117.077102][ T8126] IPVS: dequeue: 760ns
[ 4117.081525][ T8126] IPVS: using max 11184 ests per chain, 559200 per kthread
[ 4119.053120][ T8126] IPVS: tick time: 8406ns for 64 CPUs, 2 ests, 1 chains, chain_max_len=11184
The profile:
# Samples: 6K of event 'bus-cycles'
# Event count (approx.): 35207766
# Overhead Command Shared Object Symbol
26.69% ipvs-e:0:0 [kernel.kallsyms] [k] memset_erms
21.40% ipvs-e:0:0 [kernel.kallsyms] [k] _find_next_bit
11.96% ipvs-e:0:0 [kernel.kallsyms] [k] ip_vs_chain_estimation
6.30% ipvs-e:0:0 [kernel.kallsyms] [k] pcpu_alloc
4.29% ipvs-e:0:0 [kernel.kallsyms] [k] mutex_lock_killable
2.89% ipvs-e:0:0 [kernel.kallsyms] [k] ip_vs_estimation_kthread
2.84% ipvs-e:0:0 [kernel.kallsyms] [k] __slab_free
2.52% ipvs-e:0:0 [kernel.kallsyms] [k] pcpu_next_md_free_region
The disassembly of ip_vs_chain_estimation (not inlined in v4 code):
: ffffffffc0ce3a40 <ip_vs_chain_estimation>:
5.78 : ffffffffc0ce3a87: cltq
0.00 : ffffffffc0ce3a89: mov 0x68(%rbx),%rsi
0.44 : ffffffffc0ce3a8d: add -0x43054520(,%rax,8),%rsi
0.00 : ffffffffc0ce3a95: add (%rsi),%r14
65.20 : ffffffffc0ce3a98: add 0x8(%rsi),%r15
8.89 : ffffffffc0ce3a9c: add 0x10(%rsi),%r13
5.18 : ffffffffc0ce3aa0: add 0x18(%rsi),%r12
6.96 : ffffffffc0ce3aa4: add 0x20(%rsi),%rbp
3.55 : ffffffffc0ce3aa8: add $0x1,%ecx
0.00 : ffffffffc0ce3aab: xor %r9d,%r9d
0.00 : ffffffffc0ce3aae: xor %r8d,%r8d
0.00 : ffffffffc0ce3ab1: movslq %ecx,%rcx
0.30 : ffffffffc0ce3ab4: xor %esi,%esi
0.44 : ffffffffc0ce3ab6: mov $0xffffffffbd4d6420,%rdi
0.44 : ffffffffc0ce3abd: callq 0xffffffffbbf5ef20
0.30 : ffffffffc0ce3ac2: mov -0x380d078(%rip),%edx # 0xffffffffbd4d6a50
0.00 : ffffffffc0ce3ac8: mov %eax,%ecx
0.00 : ffffffffc0ce3aca: cmp %eax,%edx
0.00 : ffffffffc0ce3acc: ja 0xffffffffc0ce3a87
I did the same for my version using est_temp_list estimators:
[ 268.250061][ T3494] IPVS: starting estimator thread 0...
[ 268.256118][ T7983] IPVS: calc: nodes 2
[ 269.656713][ T7983] IPVS: calc: chain_max_len=230, single est=412ns, diff=55492, retry=4095, ntest=128
[ 269.763749][ T7983] IPVS: dequeue: 810ns
[ 269.768171][ T7983] IPVS: using max 11040 ests per chain, 552000 per kthread
[ 271.739763][ T7983] IPVS: tick time: 7376ns for 64 CPUs, 2 ests, 1 chains, chain_max_len=11040
The profile:
# Samples: 6K of event 'bus-cycles'
# Event count (approx.): 34135939
# Overhead Command Shared Object Symbol
26.86% ipvs-e:0:0 [kernel.kallsyms] [k] memset_erms
20.74% ipvs-e:0:0 [kernel.kallsyms] [k] _find_next_bit
12.11% ipvs-e:0:0 [kernel.kallsyms] [k] ip_vs_chain_estimation
6.05% ipvs-e:0:0 [kernel.kallsyms] [k] pcpu_alloc
4.02% ipvs-e:0:0 [kernel.kallsyms] [k] mutex_lock_killable
2.81% ipvs-e:0:0 [kernel.kallsyms] [k] __slab_free
2.63% ipvs-e:0:0 [kernel.kallsyms] [k] ip_vs_estimation_kthread
2.25% ipvs-e:0:0 [kernel.kallsyms] [k] pcpu_next_md_free_region
The disassembly of ip_vs_chain_estimation (not inlined in v4 code):
: 5 ffffffffc075aa40 <ip_vs_chain_estimation>:
4.99 : ffffffffc075aa87: cltq
0.00 : ffffffffc075aa89: mov 0x68(%rbx),%rsi
0.00 : ffffffffc075aa8d: add -0x48054520(,%rax,8),%rsi
0.15 : ffffffffc075aa95: add (%rsi),%r14
65.92 : ffffffffc075aa98: add 0x8(%rsi),%r15
10.11 : ffffffffc075aa9c: add 0x10(%rsi),%r13
6.49 : ffffffffc075aaa0: add 0x18(%rsi),%r12
6.33 : ffffffffc075aaa4: add 0x20(%rsi),%rbp
3.60 : ffffffffc075aaa8: add $0x1,%ecx
0.00 : ffffffffc075aaab: xor %r9d,%r9d
0.00 : ffffffffc075aaae: xor %r8d,%r8d
0.00 : ffffffffc075aab1: movslq %ecx,%rcx
0.15 : ffffffffc075aab4: xor %esi,%esi
0.45 : ffffffffc075aab6: mov $0xffffffffb84d6420,%rdi
0.91 : ffffffffc075aabd: callq 0xffffffffb6f5ef20
0.00 : ffffffffc075aac2: mov -0x8284078(%rip),%edx # 0xffffffffb84d6a50
0.00 : ffffffffc075aac8: mov %eax,%ecx
0.00 : ffffffffc075aaca: cmp %eax,%edx
0.00 : ffffffffc075aacc: ja 0xffffffffc075aa87
In both cases, these are results from a second test. The command were:
modprobe ip_vs; perf record -e bus-cycles -a sleep 2 & ipvsadm -A -t 10.10.10.1:2000
ipvsadm -D -t 10.10.10.1:2000; modprobe -r ip_vs_wlc ip_vs
modprobe ip_vs; perf record -e bus-cycles -a sleep 2 & ipvsadm -A -t 10.10.10.1:2000
The kernel log from the first tests contains a warning printed by krealloc_array() because the requested size exceeds the object size that SLUB is able to allocate.
Both the chain_max_len and the profiles (and instructions taking the most time) from the test using est_temp_list estimators are similar to the test with the v4 code. In other words, there is no observable difference between the test using est_temp_list estimators and allocating new estimators in my tests (the machine has 64 CPUs and 2 NUMA nodes). Allocating new estimators in ip_vs_est_calc_limits() seems unnecessary.
> +{
> + struct ip_vs_stats **arr = NULL, **a, *s;
> + struct ip_vs_estimator *est;
> + int i, j, n = 0, ntest = 1;
> + struct hlist_head chain;
> + bool is_fifo = false;
> + s32 min_est = 0;
> + ktime_t t1, t2;
> + s64 diff, val;
> + int retry = 0;
> + int max = 2;
> + int ret = 1;
> +
> + INIT_HLIST_HEAD(&chain);
> + for (;;) {
> + /* Too much tests? */
> + if (n >= 128)
> + goto out;
> +
> + /* Dequeue old estimators from chain to avoid CPU caching */
> + for (;;) {
> + est = hlist_entry_safe(chain.first,
> + struct ip_vs_estimator,
> + list);
> + if (!est)
> + break;
> + hlist_del_init(&est->list);
Unlinking every estimator seems unnecessary - they are discarded before the function exits.
> + }
> +
> + /* Use only new estimators for test */
> + a = krealloc_array(arr, n + ntest, sizeof(*arr), GFP_KERNEL);
> + if (!a)
> + goto out;
> + arr = a;
> +
> + for (j = 0; j < ntest; j++) {
> + arr[n] = kcalloc(1, sizeof(*arr[n]), GFP_KERNEL);
> + if (!arr[n])
> + goto out;
> + s = arr[n];
> + n++;
> +
> + spin_lock_init(&s->lock);
This statement loads part of the estimator object into the CPU cache.
> + s->cpustats = alloc_percpu(struct ip_vs_cpu_stats);
I am not sure whether part of the allocated object is loaded into the cache of each CPU as a side effect of alloc_percpu(). Assigning to s->cpustats is another store into the estimator object but it probably is the same cache line as s->lock.
> + if (!s->cpustats)
> + goto out;
> + for_each_possible_cpu(i) {
> + struct ip_vs_cpu_stats *cs;
> +
> + cs = per_cpu_ptr(s->cpustats, i);
> + u64_stats_init(&cs->syncp);
This statement is most probably optimized out on 64bit archs so no caching happens here.
> + }
> + hlist_add_head(&s->est.list, &chain);
This statement loads part of the estimator object into the CPU cache. And who know what the HW prefetcher does because of the accesses to the estimator object.
> + }
> +
> + cond_resched();
> + if (!is_fifo) {
> + is_fifo = true;
> + sched_set_fifo(current);
> + }
> + rcu_read_lock();
I suggest disabling preemption and interrupts on the local CPU. To get the minimal time need to process an estimator there is no need for interference from interrupt processing or context switches in this specific part of the code.
> + t1 = ktime_get();
> + ip_vs_chain_estimation(&chain);
> + t2 = ktime_get();
> + rcu_read_unlock();
> +
> + if (!ipvs->enable || kthread_should_stop())
> + goto stop;
> +
> + diff = ktime_to_ns(ktime_sub(t2, t1));
> + if (diff <= 1 || diff >= NSEC_PER_SEC)
What is the reason for the diff <= 1? Is it about the CLOCK_MONOTONIC time source not incrementing?
> + continue;
> + val = diff;
> + do_div(val, ntest);
> + if (!min_est || val < min_est) {
> + min_est = val;
> + /* goal: 95usec per chain */
> + val = 95 * NSEC_PER_USEC;
> + if (val >= min_est) {
> + do_div(val, min_est);
> + max = (int)val;
> + } else {
> + max = 1;
> + }
> + }
> + /* aim is to test below 100us */
> + if (diff < 50 * NSEC_PER_USEC)
> + ntest *= 2;
> + else
> + retry++;
> + /* Do at least 3 large tests to avoid scheduling noise */
> + if (retry >= 3)
> + break;
> + }
> +
> +out:
> + if (is_fifo)
> + sched_set_normal(current, 0);
> + for (;;) {
> + est = hlist_entry_safe(chain.first, struct ip_vs_estimator,
> + list);
> + if (!est)
> + break;
> + hlist_del_init(&est->list);
> + }
> + for (i = 0; i < n; i++) {
> + free_percpu(arr[i]->cpustats);
> + kfree(arr[i]);
> + }
> + kfree(arr);
> + *chain_max_len = max;
> + return ret;
> +
> +stop:
> + ret = 0;
> + goto out;
> +}
> +
> +/* Calculate the parameters and apply them in context of kt #0
> + * ECP: est_calc_phase
> + * ECML: est_chain_max_len
> + * ECP ECML Insert Chain enable Description
> + * ---------------------------------------------------------------------------
> + * 0 0 est_temp_list 0 create kt #0 context
> + * 0 0 est_temp_list 0->1 service added, start kthread #0 task
> + * 0->1 0 est_temp_list 1 kt task #0 started, enters calc phase
> + * 1 0 est_temp_list 1 kt #0: determine est_chain_max_len,
> + * stop tasks, move ests to est_temp_list
> + * and free kd for kthreads 1..last
> + * 1->0 0->N kt chains 1 ests can go to kthreads
> + * 0 N kt chains 1 drain est_temp_list, create new kthread
> + * contexts, start tasks, estimate
> + */
> +static void ip_vs_est_calc_phase(struct netns_ipvs *ipvs)
> +{
> + int genid = atomic_read(&ipvs->est_genid);
> + struct ip_vs_est_tick_data *td;
> + struct ip_vs_est_kt_data *kd;
> + struct ip_vs_estimator *est;
> + struct ip_vs_stats *stats;
> + int chain_max_len;
> + int id, row, cid;
> + bool last, last_td;
> + int step;
> +
> + if (!ip_vs_est_calc_limits(ipvs, &chain_max_len))
> + return;
> +
> + mutex_lock(&__ip_vs_mutex);
> +
> + /* Stop all other tasks, so that we can immediately move the
> + * estimators to est_temp_list without RCU grace period
> + */
> + mutex_lock(&ipvs->est_mutex);
> + for (id = 1; id < ipvs->est_kt_count; id++) {
> + /* netns clean up started, abort */
> + if (!ipvs->enable)
> + goto unlock2;
> + kd = ipvs->est_kt_arr[id];
> + if (!kd)
> + continue;
> + ip_vs_est_kthread_stop(kd);
> + }
> + mutex_unlock(&ipvs->est_mutex);
> +
> + /* Move all estimators to est_temp_list but carefully,
> + * all estimators and kthread data can be released while
> + * we reschedule. Even for kthread 0.
> + */
> + step = 0;
> +
> +next_kt:
> + /* Destroy contexts backwards */
> + id = ipvs->est_kt_count - 1;
> + if (id < 0)
> + goto end_dequeue;
> + kd = ipvs->est_kt_arr[id];
> + if (!kd)
> + goto end_dequeue;
> + /* kt 0 can exist with empty chains */
> + if (!id && kd->est_count <= 1)
> + goto end_dequeue;
> +
> + row = -1;
> +
> +next_row:
> + row++;
> + if (row >= IPVS_EST_NTICKS)
> + goto next_kt;
> + if (!ipvs->enable)
> + goto unlock;
> + td = rcu_dereference_protected(kd->ticks[row], 1);
> + if (!td)
> + goto next_row;
> +
> + cid = 0;
> +
> +walk_chain:
> + if (kthread_should_stop())
> + goto unlock;
> + step++;
> + if (!(step & 63)) {
> + /* Give chance estimators to be added (to est_temp_list)
> + * and deleted (releasing kthread contexts)
> + */
> + mutex_unlock(&__ip_vs_mutex);
> + cond_resched();
> + mutex_lock(&__ip_vs_mutex);
Is there any data backing the decision to cond_resched() here? What non-functional requirement were used to make this design decision?
> +
> + /* Current kt released ? */
> + if (id + 1 != ipvs->est_kt_count)
> + goto next_kt;
> + if (kd != ipvs->est_kt_arr[id])
> + goto end_dequeue;
> + /* Current td released ? */
> + if (td != rcu_dereference_protected(kd->ticks[row], 1))
> + goto next_row;
> + /* No fatal changes on the current kd and td */
> + }
> + est = hlist_entry_safe(td->chains[cid].first, struct ip_vs_estimator,
> + list);
> + if (!est) {
> + cid++;
> + if (cid >= IPVS_EST_TICK_CHAINS)
> + goto next_row;
> + goto walk_chain;
> + }
> + /* We can cheat and increase est_count to protect kt 0 context
> + * from release but we prefer to keep the last estimator
> + */
> + last = kd->est_count <= 1;
> + /* Do not free kt #0 data */
> + if (!id && last)
> + goto end_dequeue;
> + last_td = kd->tick_len[row] <= 1;
> + stats = container_of(est, struct ip_vs_stats, est);
> + ip_vs_stop_estimator(ipvs, stats);
> + /* Tasks are stopped, move without RCU grace period */
> + est->ktid = -1;
> + hlist_add_head(&est->list, &ipvs->est_temp_list);
> + /* kd freed ? */
> + if (last)
> + goto next_kt;
> + /* td freed ? */
> + if (last_td)
> + goto next_row;
> + goto walk_chain;
> +
> +end_dequeue:
> + /* All estimators removed while calculating ? */
> + if (!ipvs->est_kt_count)
> + goto unlock;
> + kd = ipvs->est_kt_arr[0];
> + if (!kd)
> + goto unlock;
> + ipvs->est_chain_max_len = chain_max_len;
> + ip_vs_est_set_params(ipvs, kd);
> +
> + pr_info("using max %d ests per chain, %d per kthread\n",
> + kd->chain_max_len, kd->est_max_count);
> +
> + mutex_lock(&ipvs->est_mutex);
> +
> + /* We completed the calc phase, new calc phase not requested */
> + if (genid == atomic_read(&ipvs->est_genid))
> + ipvs->est_calc_phase = 0;
> +
> +unlock2:
> + mutex_unlock(&ipvs->est_mutex);
> +
> +unlock:
> + mutex_unlock(&__ip_vs_mutex);
> }
--
Jiri Wiesner
SUSE Labs
^ permalink raw reply related [flat|nested] 9+ messages in thread
* Re: [RFC PATCHv4 2/5] ipvs: use kthreads for stats estimation
2022-10-01 10:52 ` Jiri Wiesner
@ 2022-10-02 14:12 ` Julian Anastasov
2022-10-04 8:39 ` Jiri Wiesner
0 siblings, 1 reply; 9+ messages in thread
From: Julian Anastasov @ 2022-10-02 14:12 UTC (permalink / raw)
To: Jiri Wiesner; +Cc: Simon Horman, lvs-devel, yunhong-cgl jiang, dust.li
Hello,
On Sat, 1 Oct 2022, Jiri Wiesner wrote:
> On Tue, Sep 20, 2022 at 04:53:29PM +0300, Julian Anastasov wrote:
> > +/* Start estimation for stats */
> > +int ip_vs_start_estimator(struct netns_ipvs *ipvs, struct ip_vs_stats *stats)
> > +{
> > + struct ip_vs_estimator *est = &stats->est;
> > + int ret;
> > +
> > + /* Get rlimit only from process that adds service, not from
> > + * net_init/kthread. Clamp limit depending on est->ktid size.
> > + */
> > + if (!ipvs->est_max_threads && ipvs->enable)
> > + ipvs->est_max_threads = min_t(unsigned long,
> > + rlimit(RLIMIT_NPROC), SHRT_MAX);
>
> For example, the user space limit on the number of processes does not hold any useful value on my testing machine:
> # ulimit -u
> 254318
> while /proc/sys/kernel/pid_max is 65536. The pid_max variable itself depends on the number of CPUs on the system. I also think that user space limits should not directly determine how many kthreads can be created by the kernel. By design, one fully loaded kthread will take up 12% of the CPU time on one CPU. On account of the CPU usage it does not make sense to set ipvs->est_max_threads to a value higher than a multiple (4 or less) of the number of possible CPUs in the system. I think the ipvs->est_max_threads value should not allow CPUs to get saturated. Also, kthreads computing IPVS rate estimates could be created in each net namespace on the system, which alone makes it possible to saturate all the CPUs on the system because ipvs->est_max_threads does not take other namespaces into acc
ount.
> As for solutions to this problem, I think it would be easiest to implement global counters in ip_vs_est.c (est_kt_count and est_max_threads) that would be tested for the max number of allocated kthreads in ip_vs_est_add_kthread().
> Another possible solution would be to share kthreads among all net namespaces but that would be a step back considering that the current implementation is per net namespace. For the purpose of computing estimates, it does not really matter to which namespace an estimator belongs. This solution is problematic with regards to resource control - cgroups. But from what I have seen, IPVS estimators were always configured in the init net namespace so it would not matter if the kthreads were shared among all net namespaces.
Yes, considering possible cgroups integration I prefer
namespaces to be isolated. So, 4 * cpumask_weight() would be suitable
ipvs->est_max_threads value. IPVS later can get support for
GENL_UNS_ADMIN_PERM (better netns support) and GFP_KERNEL_ACCOUNT.
In this case, we should somehow control the allocations done
in kthreads.
> > + est->ktid = -1;
> > +
> > + /* We prefer this code to be short, kthread 0 will requeue the
> > + * estimator to available chain. If tasks are disabled, we
> > + * will not allocate much memory, just for kt 0.
> > + */
> > + ret = 0;
> > + if (!ipvs->est_kt_count || !ipvs->est_kt_arr[0])
> > + ret = ip_vs_est_add_kthread(ipvs);
> > + if (ret >= 0)
> > + hlist_add_head(&est->list, &ipvs->est_temp_list);
> > + else
> > + INIT_HLIST_NODE(&est->list);
> > + return ret;
> > +}
> > +
>
> > +/* Calculate limits for all kthreads */
> > +static int ip_vs_est_calc_limits(struct netns_ipvs *ipvs, int *chain_max_len)
>
> I am not happy about all the dynamic allocation happening in this function, which introduces a reason why the function could fail. A simpler approach would use just the estimators that are currently available on ipvs->est_temp_list and run ip_vs_chain_estimation(&chain) in a loop to reach ntest estimators being processed. The rate estimates would need to be reset after the tests are done. When kthread 0 enters calc phase there may very well be only two estimators on ipvs->est_temp_list. Are there any testing results indicating that newly allocated estimators give different results compared to processing the est_temp_list estimators in a loop?
I avoided using the est_temp_list entries because
they can be min 2 (total + service) and I preferred tests
with more estimators to reduce the effect of rescheduling,
interrupts, etc.
> In both cases, these are results from a second test. The command were:
> modprobe ip_vs; perf record -e bus-cycles -a sleep 2 & ipvsadm -A -t 10.10.10.1:2000
> ipvsadm -D -t 10.10.10.1:2000; modprobe -r ip_vs_wlc ip_vs
> modprobe ip_vs; perf record -e bus-cycles -a sleep 2 & ipvsadm -A -t 10.10.10.1:2000
> The kernel log from the first tests contains a warning printed by krealloc_array() because the requested size exceeds the object size that SLUB is able to allocate.
>
> Both the chain_max_len and the profiles (and instructions taking the most time) from the test using est_temp_list estimators are similar to the test with the v4 code. In other words, there is no observable difference between the test using est_temp_list estimators and allocating new estimators in my tests (the machine has 64 CPUs and 2 NUMA nodes). Allocating new estimators in ip_vs_est_calc_limits() seems unnecessary.
OK, so caching effects do not matter. The problem
of using est_temp_list is that ip_vs_chain_estimation()
walks the whole chain. It is less risky to do tests
with allocated chain with known length and it was not a
big deal to allocate 128 estimators. If allocation could
fail, we can move 128 entries from est_temp_list to
the test chain and then to move them back after the test.
But in any case, if we test with estimators from a
est_temp_list, as we run without any locked mutex
the entries could be deleted while we are testing
them. As result, we do not know how many estimators
were really tested. More than one test could be needed for
sure, i.e. the length of the tested temp chain should
not change before/after the test.
And it would be better to call ip_vs_est_calc_limits
after all tasks are stopped and estimators moved to
est_temp_list.
> > + for (;;) {
> > + /* Too much tests? */
> > + if (n >= 128)
> > + goto out;
> > +
> > + /* Dequeue old estimators from chain to avoid CPU caching */
> > + for (;;) {
> > + est = hlist_entry_safe(chain.first,
> > + struct ip_vs_estimator,
> > + list);
> > + if (!est)
> > + break;
> > + hlist_del_init(&est->list);
>
> Unlinking every estimator seems unnecessary - they are discarded before the function exits.
The goal was tested estimators to not be tested again.
> > + }
> > +
> > + cond_resched();
> > + if (!is_fifo) {
> > + is_fifo = true;
> > + sched_set_fifo(current);
> > + }
> > + rcu_read_lock();
>
> I suggest disabling preemption and interrupts on the local CPU. To get the minimal time need to process an estimator there is no need for interference from interrupt processing or context switches in this specific part of the code.
I preferred not to be so rude to other kthreads in the system.
I hope several tests give enough approximation for the
estimation speed.
>
> > + t1 = ktime_get();
> > + ip_vs_chain_estimation(&chain);
> > + t2 = ktime_get();
> > + rcu_read_unlock();
> > +
> > + if (!ipvs->enable || kthread_should_stop())
> > + goto stop;
> > +
> > + diff = ktime_to_ns(ktime_sub(t2, t1));
> > + if (diff <= 1 || diff >= NSEC_PER_SEC)
>
> What is the reason for the diff <= 1? Is it about the CLOCK_MONOTONIC time source not incrementing?
The timer resolution can be low, a longer test should
succeed :)
> > +walk_chain:
> > + if (kthread_should_stop())
> > + goto unlock;
> > + step++;
> > + if (!(step & 63)) {
> > + /* Give chance estimators to be added (to est_temp_list)
> > + * and deleted (releasing kthread contexts)
> > + */
> > + mutex_unlock(&__ip_vs_mutex);
> > + cond_resched();
> > + mutex_lock(&__ip_vs_mutex);
>
> Is there any data backing the decision to cond_resched() here? What non-functional requirement were used to make this design decision?
kt 0 runs in parallel with netlink, we do not want
to delay such processes that want to unlink estimators,
we can be relinking 448800 estimators as in your test.
Regards
--
Julian Anastasov <ja@ssi.bg>
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [RFC PATCHv4 2/5] ipvs: use kthreads for stats estimation
2022-10-02 14:12 ` Julian Anastasov
@ 2022-10-04 8:39 ` Jiri Wiesner
0 siblings, 0 replies; 9+ messages in thread
From: Jiri Wiesner @ 2022-10-04 8:39 UTC (permalink / raw)
To: Julian Anastasov; +Cc: Simon Horman, lvs-devel, yunhong-cgl jiang, dust.li
On Sun, Oct 02, 2022 at 05:12:41PM +0300, Julian Anastasov wrote:
> > In both cases, these are results from a second test. The command were:
> > modprobe ip_vs; perf record -e bus-cycles -a sleep 2 & ipvsadm -A -t 10.10.10.1:2000
> > ipvsadm -D -t 10.10.10.1:2000; modprobe -r ip_vs_wlc ip_vs
> > modprobe ip_vs; perf record -e bus-cycles -a sleep 2 & ipvsadm -A -t 10.10.10.1:2000
> > The kernel log from the first tests contains a warning printed by krealloc_array() because the requested size exceeds the object size that SLUB is able to allocate.
> >
> > Both the chain_max_len and the profiles (and instructions taking the most time) from the test using est_temp_list estimators are similar to the test with the v4 code. In other words, there is no observable difference between the test using est_temp_list estimators and allocating new estimators in my tests (the machine has 64 CPUs and 2 NUMA nodes). Allocating new estimators in ip_vs_est_calc_limits() seems unnecessary.
>
> OK, so caching effects do not matter.
I must argue against my own argument: I would not claim exactly that. I would be reluctant to generalize the statement even for modern CPUs manufactured by Intel. Whether or not caching (including fetching a cache line from a different NUMA node) matters depends on a particular CPU implementation and architecture. I only showed that always allocating new estimators and reusing the estimators from the est_temp_list yields similar results. A closer look at the results indicates that the first estimate is almost 4 times larger than the second estimate. My hackish modification caused the ntest 1 run to process 2 estimators while making the algorithm think it was testing only 1 estimator, hence the time diffs obtained from the ntest 1 run and the ntest 2 run should be similar. Apparently, they
are not:
> [ 89.364408][ T493] IPVS: starting estimator thread 0...
> [ 89.370467][ T8039] IPVS: calc: nodes 2
> [ 89.374824][ T8039] IPVS: calc: diff 4354 ntest 1 min_est 4354 max 21
> [ 89.382081][ T8039] IPVS: calc: diff 1125 ntest 2 min_est 562 max 169
> [ 89.389329][ T8039] IPVS: calc: diff 2083 ntest 4 min_est 520 max 182
This results could actually be caused by reading cache-cold memory regions. Caching might play a role and the most accurate estimate would be obtained from the very first test. Testing just once and just 1 or 2 estimators (depending on what is available on the est_temp_list) while also switching off interrupts and preemption makes sense to me. The algorithm would be simpler and ip_vs_est_calc_limits() would be done sooner.
> The problem
> of using est_temp_list is that ip_vs_chain_estimation()
> walks the whole chain. It is less risky to do tests
> with allocated chain with known length and it was not a
> big deal to allocate 128 estimators. If allocation could
> fail, we can move 128 entries from est_temp_list to
> the test chain and then to move them back after the test.
> But in any case, if we test with estimators from a
> est_temp_list, as we run without any locked mutex
> the entries could be deleted while we are testing
> them. As result, we do not know how many estimators
> were really tested. More than one test could be needed for
> sure, i.e. the length of the tested temp chain should
> not change before/after the test.
I see that leaving the algorithm as it is and only substituting newly allocated estimators with estimators from the est_temp_list brings more problems than what it solves.
> And it would be better to call ip_vs_est_calc_limits
> after all tasks are stopped and estimators moved to
> est_temp_list.
>
> > > + for (;;) {
> > > + /* Too much tests? */
> > > + if (n >= 128)
> > > + goto out;
> > > +
> > > + /* Dequeue old estimators from chain to avoid CPU caching */
> > > + for (;;) {
> > > + est = hlist_entry_safe(chain.first,
> > > + struct ip_vs_estimator,
> > > + list);
> > > + if (!est)
> > > + break;
> > > + hlist_del_init(&est->list);
> >
> > Unlinking every estimator seems unnecessary - they are discarded before the function exits.
>
> The goal was tested estimators to not be tested again.
The usual approach is to initialize the head and leave the list as it is (since an array holds the pointers, which will be used for deallocation) or splice it onto a different head.
> > > + }
> > > +
> > > + cond_resched();
> > > + if (!is_fifo) {
> > > + is_fifo = true;
> > > + sched_set_fifo(current);
> > > + }
> > > + rcu_read_lock();
> >
> > I suggest disabling preemption and interrupts on the local CPU. To get the minimal time need to process an estimator there is no need for interference from interrupt processing or context switches in this specific part of the code.
>
> I preferred not to be so rude to other kthreads in the system.
> I hope several tests give enough approximation for the
> estimation speed.
The way in which the measurement is carried out depends on what value is expected to be measured and how that value is used. There is a catch when it comes to caching, as described above.
1. The value being measured could be the minimal time needed to process an estimator without including the time needed for interrupt processing. As it is written now, the algorithm determines approximately this value despite preemption and interrupts being enabled, which will result in some noise. The result of this approach is that a fully loaded kthread reached more than 50% of CPU utilization on my testing system (2 NUMA nodes, 64 CPUs):
> j=30; time for ((i=0; i < 500000; i++)); do p=$((i % 60000)); if [ $p -eq 0 ]; then j=$((j + 1)); echo 10.10.10.$j; fi; ipvsadm -A -t 10.10.10.$j:$((2000+$p)); done
> PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
> 7993 root 20 0 0 0 0 R 52.99 0.000 6:16.24 ipvs-e:0:0
> 12913 root 20 0 0 0 0 I 5.090 0.000 0:15.31 ipvs-e:0:1
The debugging kernel printed while estimators were being added:
> [ 168.667468] IPVS: starting estimator thread 0...
> [ 168.674322] IPVS: calc: chain_max_len=191, single est=496ns, diff=64965, retry=1, ntest=128
> [ 168.684550] IPVS: dequeue: 293ns
> [ 168.688965] IPVS: using max 9168 ests per chain, 458400 per kthread
> [ 170.613320] IPVS: tick time: 1676379ns for 64 CPUs, 1392 ests, 1 chains, chain_max_len=9168
> [ 234.634576] IPVS: tick time: 18958330ns for 64 CPUs, 9168 ests, 1 chains, chain_max_len=9168
> ...
> [ 1770.630740] IPVS: tick time: 19127043ns for 64 CPUs, 9168 ests, 1 chains, chain_max_len=9168
This is more than 4 times the expected CPU utilization - 12%. So, minimal time needed to process an estimator would need to be applied in a different way - there would have to be other parameters to scale it to determine the chain length that yields 12% CPU utilization.
2. The value being measured would include interference from interrupt processing but not from context switching (due to preemption being disabled). In this case, a sliding average could be computed or something along these lines.
3. The value being measured would include interference from interrupt processing and from context switching. I cannot imagine a workable approach to use this.
> > > +walk_chain:
> > > + if (kthread_should_stop())
> > > + goto unlock;
> > > + step++;
> > > + if (!(step & 63)) {
> > > + /* Give chance estimators to be added (to est_temp_list)
> > > + * and deleted (releasing kthread contexts)
> > > + */
> > > + mutex_unlock(&__ip_vs_mutex);
> > > + cond_resched();
> > > + mutex_lock(&__ip_vs_mutex);
> >
> > Is there any data backing the decision to cond_resched() here? What non-functional requirement were used to make this design decision?
>
> kt 0 runs in parallel with netlink, we do not want
> to delay such processes that want to unlink estimators,
> we can be relinking 448800 estimators as in your test.
I commented out the cond_resched() and the locking statement around it, fully loaded a kthread and this was the result:
> [ 5060.214676] IPVS: starting estimator thread 0...
> [ 5060.222050] IPVS: calc: chain_max_len=144, single est=656ns, diff=91657, retry=1, ntest=128
> [ 5060.318628] IPVS: dequeue: 86284729ns
> [ 5060.323527] IPVS: using max 6912 ests per chain, 345600 per kthread
86 milliseconds is far too long, which justifies the cond_resched() and the additional trouble it brings.
--
Jiri Wiesner
SUSE Labs
^ permalink raw reply [flat|nested] 9+ messages in thread
end of thread, other threads:[~2022-10-04 8:39 UTC | newest]
Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2022-09-20 13:53 [RFC PATCHv4 0/5] ipvs: Use kthreads for stats Julian Anastasov
2022-09-20 13:53 ` [RFC PATCHv4 1/5] ipvs: add rcu protection to stats Julian Anastasov
2022-09-20 13:53 ` [RFC PATCHv4 2/5] ipvs: use kthreads for stats estimation Julian Anastasov
2022-10-01 10:52 ` Jiri Wiesner
2022-10-02 14:12 ` Julian Anastasov
2022-10-04 8:39 ` Jiri Wiesner
2022-09-20 13:53 ` [RFC PATCHv4 3/5] ipvs: add est_cpulist and est_nice sysctl vars Julian Anastasov
2022-09-20 13:53 ` [RFC PATCHv4 4/5] ipvs: run_estimation should control the kthread tasks Julian Anastasov
2022-09-20 13:53 ` [RFC PATCHv4 5/5] ipvs: debug the tick time Julian Anastasov
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).