netfilter-devel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Pablo Neira Ayuso <pablo@netfilter.org>
To: Julian Anastasov <ja@ssi.bg>
Cc: Simon Horman <horms@verge.net.au>,
	lvs-devel@vger.kernel.org, netfilter-devel@vger.kernel.org,
	Dust Li <dust.li@linux.alibaba.com>,
	Jiejian Wu <jiejian@linux.alibaba.com>,
	rcu@vger.kernel.org
Subject: Re: [PATCHv6 net-next 00/14] ipvs: per-net tables and optimizations
Date: Mon, 24 Nov 2025 22:46:49 +0100	[thread overview]
Message-ID: <aSTSSTFT9j5eldzv@calendula> (raw)
In-Reply-To: <20251019155711.67609-1-ja@ssi.bg>

Hi Julian,

This is v6 and you have work hard on this, and I am coming late to
review... but may I suggest to split this series?

From my understanding, here I can see initial preparation patches,
including improvements, that could be applied initially before the
per-netns support.

Then, follow up with initial basic per-netns conversion.

Finally, pursue more advanced datastructures / optimizations.

If this is too extreme/deal breaker, let me know.

Thanks a lot for your work on IPVS.

On Sun, Oct 19, 2025 at 06:56:57PM +0300, Julian Anastasov wrote:
> 	Hello,
> 
> 	This patchset targets more netns isolation when IPVS
> is used in large setups and also includes some optimizations.
> 
> 	First patch adds useful wrappers to rculist_bl, the
> hlist_bl methods IPVS will use in the following patches. The other
> patches are IPVS-specific.
> 
> 	The following patches will:
> 
> * Convert the global __ip_vs_mutex to per-net service_mutex and
>   switch the service tables to be per-net, cowork by Jiejian Wu and
>   Dust Li
> 
> * Convert some code that walks the service lists to use RCU instead of
>   the service_mutex
> 
> * We used two tables for services (non-fwmark and fwmark), merge them
>   into single svc_table
> 
> * The list for unavailable destinations (dest_trash) holds dsts and
>   thus dev references causing extra work for the ip_vs_dst_event() dev
>   notifier handler. Change this by dropping the reference when dest
>   is removed and saved into dest_trash. The dest_trash will need more
>   changes to make it light for lookups. TODO.
> 
> * On new connection we can do multiple lookups for services by tryng
>   different fallback options. Add more counters for service types, so
>   that we can avoid unneeded lookups for services.
> 
> * Add infrastructure for resizable hash tables based on hlist_bl
>   which we will use for services and connections: hlists with
>   per-bucket bit lock in the heads. The resizing delays RCU lookups
>   on a bucket level with seqcounts which are protected with spin locks.
>   The entries keep the table ID and the hash value which allows to
>   filter the entries without touching many cache lines and to
>   unlink the entries without lookup by keys.
> 
> * Change the 256-bucket service hash table to be resizable in the
>   range of 4..20 bits depending on the added services and use jhash
>   hashing to reduce the collisions.
> 
> * Change the global connection table to be per-net and resizable
>   in the range of 256..ip_vs_conn_tab_size. As the connections are
>   hashed by using remote addresses and ports, use siphash instead
>   of jhash for better security.
> 
> * As the connection table is not with fixed size, show its current
>   size to user space
> 
> * As the connection table is not global anymore, the no_cport and
>   dropentry counters can be per-net
> 
> * Make the connection hashing more secure for setups with multiple
>   services. Hashing only by remote address and port (client info)
>   is not enough. To reduce the possible hash collisions add the
>   used virtual address/port (local info) into the hash and as a side
>   effect the MASQ connections will be double hashed into the
>   hash table to match the traffic from real servers:
>     OLD:
>     - all methods: c_list node: proto, caddr:cport
>     NEW:
>     - all methods: hn0 node (dir 0): proto, caddr:cport -> vaddr:vport
>     - MASQ method: hn1 node (dir 1): proto, daddr:dport -> caddr:cport
> 
> * Add /proc/net/ip_vs_status to show current state of IPVS, per-net
> 
> cat /proc/net/ip_vs_status
> Conns:	9401
> Conn buckets:	524288 (19 bits, lfactor -5)
> Conn buckets empty:	505633 (96%)
> Conn buckets len-1:	18322 (98%)
> Conn buckets len-2:	329 (1%)
> Conn buckets len-3:	3 (0%)
> Conn buckets len-4:	1 (0%)
> Services:	12
> Service buckets:	128 (7 bits, lfactor -3)
> Service buckets empty:	116 (90%)
> Service buckets len-1:	12 (100%)
> Stats thread slots:	1 (max 16)
> Stats chain max len:	16
> Stats thread ests:	38400
> 
> It shows the table size, the load factor (2^n), how many are the empty
> buckets, with percents from the all buckets, the number of buckets
> with length 1..7 where len-7 catches all len>=7 (zero values are
> not shown). The len-N percents ignore the empty buckets, so they
> are relative among all len-N buckets. It shows that smaller lfactor
> is needed to achieve len-1 buckets to be ~98%. Only real tests can
> show if relying on len-1 buckets is a better option because the
> hash table becomes too large with multiple connections. And as
> every table uses random key, the services may not avoid collision
> in all cases.
> 
> * add conn_lfactor and svc_lfactor sysctl vars, so that one can tune
>   the connection/service hash table sizing
> 
> Links to downloadable patchset versions:
> v6 (19 Oct 2025):
> https://ja.ssi.bg/tmp/rht_v6.tgz
> 
> v5 (16 Sep 2024):
> https://ja.ssi.bg/tmp/rht_v5.tgz
> 
> v4 (28 May 2024):
> https://ja.ssi.bg/tmp/rht_v4.tgz
> 
> v3 (31 Mar 2024):
> https://ja.ssi.bg/tmp/rht_v3.tgz
> 
> v2 (12 Dec 2023):
> https://ja.ssi.bg/tmp/rht_v2.tgz
> 
> v1 (15 Aug 2023):
> https://ja.ssi.bg/tmp/rht_v1.tgz
> 
> Changes in v6:
> Patch 5:
> * resync
> Patch 8:
> * resync: use READ_ONCE for ipvs->enable
> * resync: use %zu for size_t
> Patch 9:
> * resync: use the new skip_elems value
> * resync: use READ_ONCE for ipvs->enable
> Patch 12:
> * resync: use the new skip_elems value
> 
> Changes in v5:
> Patch 6:
> * resync with changes in main tree (6.11)
> Patch 8:
> * resync with changes in main tree (6.11)
> Patch 9:
> * resync with changes in main tree (6.11)
> Patch 14:
> * resync with changes in main tree (6.11)
> 
> Changes in v4:
> Patch 14:
> * the load factor parameters will be read-only for unprivileged
>   namespaces while we do not account the allocated memory
> Patch 5:
> * resync with changes in main tree
> 
> Changes in v3:
> Patch 7:
> * change the sign of the load factor parameter, so that
>   2^lfactor = load/size
> Patch 8:
> * change the sign of the load factor parameter
> * fix 'goto unlock_sem' in svc_resize_work_handler() after the last
>   mutex_trylock() call, should be goto unlock_m
> * now cond_resched_rcu() needs to include linux/rcupdate_wait.h
> Patch 9:
> * consider that the sign of the load factor parameter is changed
> Patch 12:
> * consider that the sign of the load factor parameter is changed
> Patch 14:
> * change the sign of the load factor parameters in docs
> 
> Changes in v2:
> Patch 1:
> * add comments to hlist_bl_for_each_entry_continue_rcu and fix
>   sparse warnings
> Patch 9:
> * Simon Kirby reports that backup server crashes if conn_tab is not
>   created. Create it just to sync conns before any services are added.
> Patch 11:
> * kernel test robot reported for dropentry_counters problem when
>   compiling with !CONFIG_SYSCTL, so it is time to wrap todrop_entry,
>   ip_vs_conn_ops_mode and ip_vs_random_dropentry under CONFIG_SYSCTL
> Patch 13:
> * remove extra old_gen assignment at start of ip_vs_status_show()
> 
> Jiejian Wu (1):
>   ipvs: make ip_vs_svc_table and ip_vs_svc_fwm_table per netns
> 
> Julian Anastasov (13):
>   rculist_bl: add hlist_bl_for_each_entry_continue_rcu
>   ipvs: some service readers can use RCU
>   ipvs: use single svc table
>   ipvs: do not keep dest_dst after dest is removed
>   ipvs: use more counters to avoid service lookups
>   ipvs: add resizable hash tables
>   ipvs: use resizable hash table for services
>   ipvs: switch to per-net connection table
>   ipvs: show the current conn_tab size to users
>   ipvs: no_cport and dropentry counters can be per-net
>   ipvs: use more keys for connection hashing
>   ipvs: add ip_vs_status info
>   ipvs: add conn_lfactor and svc_lfactor sysctl vars
> 
>  Documentation/networking/ipvs-sysctl.rst |   33 +
>  include/linux/rculist_bl.h               |   49 +-
>  include/net/ip_vs.h                      |  395 ++++++-
>  net/netfilter/ipvs/ip_vs_conn.c          | 1052 +++++++++++++-----
>  net/netfilter/ipvs/ip_vs_core.c          |  177 +++-
>  net/netfilter/ipvs/ip_vs_ctl.c           | 1232 ++++++++++++++++------
>  net/netfilter/ipvs/ip_vs_est.c           |   18 +-
>  net/netfilter/ipvs/ip_vs_pe_sip.c        |    4 +-
>  net/netfilter/ipvs/ip_vs_sync.c          |   23 +
>  net/netfilter/ipvs/ip_vs_xmit.c          |   39 +-
>  10 files changed, 2340 insertions(+), 682 deletions(-)
> 
> -- 
> 2.51.0
> 
> 
> 

      parent reply	other threads:[~2025-11-24 21:46 UTC|newest]

Thread overview: 25+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-10-19 15:56 [PATCHv6 net-next 00/14] ipvs: per-net tables and optimizations Julian Anastasov
2025-10-19 15:56 ` [PATCHv6 net-next 01/14] rculist_bl: add hlist_bl_for_each_entry_continue_rcu Julian Anastasov
2025-10-23 11:44   ` Florian Westphal
2025-10-23 13:33     ` Julian Anastasov
2025-10-19 15:56 ` [PATCHv6 net-next 02/14] ipvs: make ip_vs_svc_table and ip_vs_svc_fwm_table per netns Julian Anastasov
2025-10-19 15:57 ` [PATCHv6 net-next 03/14] ipvs: some service readers can use RCU Julian Anastasov
2025-10-24  2:21   ` Dust Li
2025-11-24 21:00   ` Pablo Neira Ayuso
2025-10-19 15:57 ` [PATCHv6 net-next 04/14] ipvs: use single svc table Julian Anastasov
2025-11-24 21:07   ` Pablo Neira Ayuso
2025-10-19 15:57 ` [PATCHv6 net-next 05/14] ipvs: do not keep dest_dst after dest is removed Julian Anastasov
2025-10-19 15:57 ` [PATCHv6 net-next 06/14] ipvs: use more counters to avoid service lookups Julian Anastasov
2025-10-19 15:57 ` [PATCHv6 net-next 07/14] ipvs: add resizable hash tables Julian Anastasov
2025-11-24 21:16   ` Pablo Neira Ayuso
2025-10-19 15:57 ` [PATCHv6 net-next 08/14] ipvs: use resizable hash table for services Julian Anastasov
2025-10-19 15:57 ` [PATCHv6 net-next 09/14] ipvs: switch to per-net connection table Julian Anastasov
2025-10-19 15:57 ` [PATCHv6 net-next 10/14] ipvs: show the current conn_tab size to users Julian Anastasov
2025-11-24 21:21   ` Pablo Neira Ayuso
2025-10-19 15:57 ` [PATCHv6 net-next 11/14] ipvs: no_cport and dropentry counters can be per-net Julian Anastasov
2025-11-24 21:29   ` Pablo Neira Ayuso
2025-10-19 15:57 ` [PATCHv6 net-next 12/14] ipvs: use more keys for connection hashing Julian Anastasov
2025-10-19 15:57 ` [PATCHv6 net-next 13/14] ipvs: add ip_vs_status info Julian Anastasov
2025-11-24 21:42   ` Pablo Neira Ayuso
2025-10-19 15:57 ` [PATCHv6 net-next 14/14] ipvs: add conn_lfactor and svc_lfactor sysctl vars Julian Anastasov
2025-11-24 21:46 ` Pablo Neira Ayuso [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=aSTSSTFT9j5eldzv@calendula \
    --to=pablo@netfilter.org \
    --cc=dust.li@linux.alibaba.com \
    --cc=horms@verge.net.au \
    --cc=ja@ssi.bg \
    --cc=jiejian@linux.alibaba.com \
    --cc=lvs-devel@vger.kernel.org \
    --cc=netfilter-devel@vger.kernel.org \
    --cc=rcu@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).