From: Pablo Neira Ayuso <pablo@netfilter.org>
To: Julian Anastasov <ja@ssi.bg>
Cc: Simon Horman <horms@verge.net.au>,
lvs-devel@vger.kernel.org, netfilter-devel@vger.kernel.org,
Dust Li <dust.li@linux.alibaba.com>,
Jiejian Wu <jiejian@linux.alibaba.com>,
rcu@vger.kernel.org
Subject: Re: [PATCHv6 net-next 00/14] ipvs: per-net tables and optimizations
Date: Mon, 24 Nov 2025 22:46:49 +0100 [thread overview]
Message-ID: <aSTSSTFT9j5eldzv@calendula> (raw)
In-Reply-To: <20251019155711.67609-1-ja@ssi.bg>
Hi Julian,
This is v6 and you have work hard on this, and I am coming late to
review... but may I suggest to split this series?
From my understanding, here I can see initial preparation patches,
including improvements, that could be applied initially before the
per-netns support.
Then, follow up with initial basic per-netns conversion.
Finally, pursue more advanced datastructures / optimizations.
If this is too extreme/deal breaker, let me know.
Thanks a lot for your work on IPVS.
On Sun, Oct 19, 2025 at 06:56:57PM +0300, Julian Anastasov wrote:
> Hello,
>
> This patchset targets more netns isolation when IPVS
> is used in large setups and also includes some optimizations.
>
> First patch adds useful wrappers to rculist_bl, the
> hlist_bl methods IPVS will use in the following patches. The other
> patches are IPVS-specific.
>
> The following patches will:
>
> * Convert the global __ip_vs_mutex to per-net service_mutex and
> switch the service tables to be per-net, cowork by Jiejian Wu and
> Dust Li
>
> * Convert some code that walks the service lists to use RCU instead of
> the service_mutex
>
> * We used two tables for services (non-fwmark and fwmark), merge them
> into single svc_table
>
> * The list for unavailable destinations (dest_trash) holds dsts and
> thus dev references causing extra work for the ip_vs_dst_event() dev
> notifier handler. Change this by dropping the reference when dest
> is removed and saved into dest_trash. The dest_trash will need more
> changes to make it light for lookups. TODO.
>
> * On new connection we can do multiple lookups for services by tryng
> different fallback options. Add more counters for service types, so
> that we can avoid unneeded lookups for services.
>
> * Add infrastructure for resizable hash tables based on hlist_bl
> which we will use for services and connections: hlists with
> per-bucket bit lock in the heads. The resizing delays RCU lookups
> on a bucket level with seqcounts which are protected with spin locks.
> The entries keep the table ID and the hash value which allows to
> filter the entries without touching many cache lines and to
> unlink the entries without lookup by keys.
>
> * Change the 256-bucket service hash table to be resizable in the
> range of 4..20 bits depending on the added services and use jhash
> hashing to reduce the collisions.
>
> * Change the global connection table to be per-net and resizable
> in the range of 256..ip_vs_conn_tab_size. As the connections are
> hashed by using remote addresses and ports, use siphash instead
> of jhash for better security.
>
> * As the connection table is not with fixed size, show its current
> size to user space
>
> * As the connection table is not global anymore, the no_cport and
> dropentry counters can be per-net
>
> * Make the connection hashing more secure for setups with multiple
> services. Hashing only by remote address and port (client info)
> is not enough. To reduce the possible hash collisions add the
> used virtual address/port (local info) into the hash and as a side
> effect the MASQ connections will be double hashed into the
> hash table to match the traffic from real servers:
> OLD:
> - all methods: c_list node: proto, caddr:cport
> NEW:
> - all methods: hn0 node (dir 0): proto, caddr:cport -> vaddr:vport
> - MASQ method: hn1 node (dir 1): proto, daddr:dport -> caddr:cport
>
> * Add /proc/net/ip_vs_status to show current state of IPVS, per-net
>
> cat /proc/net/ip_vs_status
> Conns: 9401
> Conn buckets: 524288 (19 bits, lfactor -5)
> Conn buckets empty: 505633 (96%)
> Conn buckets len-1: 18322 (98%)
> Conn buckets len-2: 329 (1%)
> Conn buckets len-3: 3 (0%)
> Conn buckets len-4: 1 (0%)
> Services: 12
> Service buckets: 128 (7 bits, lfactor -3)
> Service buckets empty: 116 (90%)
> Service buckets len-1: 12 (100%)
> Stats thread slots: 1 (max 16)
> Stats chain max len: 16
> Stats thread ests: 38400
>
> It shows the table size, the load factor (2^n), how many are the empty
> buckets, with percents from the all buckets, the number of buckets
> with length 1..7 where len-7 catches all len>=7 (zero values are
> not shown). The len-N percents ignore the empty buckets, so they
> are relative among all len-N buckets. It shows that smaller lfactor
> is needed to achieve len-1 buckets to be ~98%. Only real tests can
> show if relying on len-1 buckets is a better option because the
> hash table becomes too large with multiple connections. And as
> every table uses random key, the services may not avoid collision
> in all cases.
>
> * add conn_lfactor and svc_lfactor sysctl vars, so that one can tune
> the connection/service hash table sizing
>
> Links to downloadable patchset versions:
> v6 (19 Oct 2025):
> https://ja.ssi.bg/tmp/rht_v6.tgz
>
> v5 (16 Sep 2024):
> https://ja.ssi.bg/tmp/rht_v5.tgz
>
> v4 (28 May 2024):
> https://ja.ssi.bg/tmp/rht_v4.tgz
>
> v3 (31 Mar 2024):
> https://ja.ssi.bg/tmp/rht_v3.tgz
>
> v2 (12 Dec 2023):
> https://ja.ssi.bg/tmp/rht_v2.tgz
>
> v1 (15 Aug 2023):
> https://ja.ssi.bg/tmp/rht_v1.tgz
>
> Changes in v6:
> Patch 5:
> * resync
> Patch 8:
> * resync: use READ_ONCE for ipvs->enable
> * resync: use %zu for size_t
> Patch 9:
> * resync: use the new skip_elems value
> * resync: use READ_ONCE for ipvs->enable
> Patch 12:
> * resync: use the new skip_elems value
>
> Changes in v5:
> Patch 6:
> * resync with changes in main tree (6.11)
> Patch 8:
> * resync with changes in main tree (6.11)
> Patch 9:
> * resync with changes in main tree (6.11)
> Patch 14:
> * resync with changes in main tree (6.11)
>
> Changes in v4:
> Patch 14:
> * the load factor parameters will be read-only for unprivileged
> namespaces while we do not account the allocated memory
> Patch 5:
> * resync with changes in main tree
>
> Changes in v3:
> Patch 7:
> * change the sign of the load factor parameter, so that
> 2^lfactor = load/size
> Patch 8:
> * change the sign of the load factor parameter
> * fix 'goto unlock_sem' in svc_resize_work_handler() after the last
> mutex_trylock() call, should be goto unlock_m
> * now cond_resched_rcu() needs to include linux/rcupdate_wait.h
> Patch 9:
> * consider that the sign of the load factor parameter is changed
> Patch 12:
> * consider that the sign of the load factor parameter is changed
> Patch 14:
> * change the sign of the load factor parameters in docs
>
> Changes in v2:
> Patch 1:
> * add comments to hlist_bl_for_each_entry_continue_rcu and fix
> sparse warnings
> Patch 9:
> * Simon Kirby reports that backup server crashes if conn_tab is not
> created. Create it just to sync conns before any services are added.
> Patch 11:
> * kernel test robot reported for dropentry_counters problem when
> compiling with !CONFIG_SYSCTL, so it is time to wrap todrop_entry,
> ip_vs_conn_ops_mode and ip_vs_random_dropentry under CONFIG_SYSCTL
> Patch 13:
> * remove extra old_gen assignment at start of ip_vs_status_show()
>
> Jiejian Wu (1):
> ipvs: make ip_vs_svc_table and ip_vs_svc_fwm_table per netns
>
> Julian Anastasov (13):
> rculist_bl: add hlist_bl_for_each_entry_continue_rcu
> ipvs: some service readers can use RCU
> ipvs: use single svc table
> ipvs: do not keep dest_dst after dest is removed
> ipvs: use more counters to avoid service lookups
> ipvs: add resizable hash tables
> ipvs: use resizable hash table for services
> ipvs: switch to per-net connection table
> ipvs: show the current conn_tab size to users
> ipvs: no_cport and dropentry counters can be per-net
> ipvs: use more keys for connection hashing
> ipvs: add ip_vs_status info
> ipvs: add conn_lfactor and svc_lfactor sysctl vars
>
> Documentation/networking/ipvs-sysctl.rst | 33 +
> include/linux/rculist_bl.h | 49 +-
> include/net/ip_vs.h | 395 ++++++-
> net/netfilter/ipvs/ip_vs_conn.c | 1052 +++++++++++++-----
> net/netfilter/ipvs/ip_vs_core.c | 177 +++-
> net/netfilter/ipvs/ip_vs_ctl.c | 1232 ++++++++++++++++------
> net/netfilter/ipvs/ip_vs_est.c | 18 +-
> net/netfilter/ipvs/ip_vs_pe_sip.c | 4 +-
> net/netfilter/ipvs/ip_vs_sync.c | 23 +
> net/netfilter/ipvs/ip_vs_xmit.c | 39 +-
> 10 files changed, 2340 insertions(+), 682 deletions(-)
>
> --
> 2.51.0
>
>
>
prev parent reply other threads:[~2025-11-24 21:46 UTC|newest]
Thread overview: 25+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-10-19 15:56 [PATCHv6 net-next 00/14] ipvs: per-net tables and optimizations Julian Anastasov
2025-10-19 15:56 ` [PATCHv6 net-next 01/14] rculist_bl: add hlist_bl_for_each_entry_continue_rcu Julian Anastasov
2025-10-23 11:44 ` Florian Westphal
2025-10-23 13:33 ` Julian Anastasov
2025-10-19 15:56 ` [PATCHv6 net-next 02/14] ipvs: make ip_vs_svc_table and ip_vs_svc_fwm_table per netns Julian Anastasov
2025-10-19 15:57 ` [PATCHv6 net-next 03/14] ipvs: some service readers can use RCU Julian Anastasov
2025-10-24 2:21 ` Dust Li
2025-11-24 21:00 ` Pablo Neira Ayuso
2025-10-19 15:57 ` [PATCHv6 net-next 04/14] ipvs: use single svc table Julian Anastasov
2025-11-24 21:07 ` Pablo Neira Ayuso
2025-10-19 15:57 ` [PATCHv6 net-next 05/14] ipvs: do not keep dest_dst after dest is removed Julian Anastasov
2025-10-19 15:57 ` [PATCHv6 net-next 06/14] ipvs: use more counters to avoid service lookups Julian Anastasov
2025-10-19 15:57 ` [PATCHv6 net-next 07/14] ipvs: add resizable hash tables Julian Anastasov
2025-11-24 21:16 ` Pablo Neira Ayuso
2025-10-19 15:57 ` [PATCHv6 net-next 08/14] ipvs: use resizable hash table for services Julian Anastasov
2025-10-19 15:57 ` [PATCHv6 net-next 09/14] ipvs: switch to per-net connection table Julian Anastasov
2025-10-19 15:57 ` [PATCHv6 net-next 10/14] ipvs: show the current conn_tab size to users Julian Anastasov
2025-11-24 21:21 ` Pablo Neira Ayuso
2025-10-19 15:57 ` [PATCHv6 net-next 11/14] ipvs: no_cport and dropentry counters can be per-net Julian Anastasov
2025-11-24 21:29 ` Pablo Neira Ayuso
2025-10-19 15:57 ` [PATCHv6 net-next 12/14] ipvs: use more keys for connection hashing Julian Anastasov
2025-10-19 15:57 ` [PATCHv6 net-next 13/14] ipvs: add ip_vs_status info Julian Anastasov
2025-11-24 21:42 ` Pablo Neira Ayuso
2025-10-19 15:57 ` [PATCHv6 net-next 14/14] ipvs: add conn_lfactor and svc_lfactor sysctl vars Julian Anastasov
2025-11-24 21:46 ` Pablo Neira Ayuso [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=aSTSSTFT9j5eldzv@calendula \
--to=pablo@netfilter.org \
--cc=dust.li@linux.alibaba.com \
--cc=horms@verge.net.au \
--cc=ja@ssi.bg \
--cc=jiejian@linux.alibaba.com \
--cc=lvs-devel@vger.kernel.org \
--cc=netfilter-devel@vger.kernel.org \
--cc=rcu@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).